Building LLM-powered Applications

Prep Time:

Cook Time:

Serves:

Level:

About the Recipe

Ingredients

Preparation

Review of https://github.com/mlabonne/llm-course?utm_source=chatgpt.com

Running LLMs

Running LLMs can be resource-intensive, but there are flexible ways to use them from APIs to local setups.

APIs vs Local Models: Private APIs (OpenAI, Google, Anthropic,) are fast to integrate, while open-source options (OpenRouter, Hugging Face, Together AI, etc.) allow for customization and privacy.
Prompt Engineering: Techniques like zero-shot, few-shot, Chain-of-Thought, and ReAct greatly influence quality.
Structured Outputs: Tools like Outlines and JSON schemas can guide LLM responses into clean, usable formats.

Building a Vector Storage

Vector databases form the foundation of Retrieval Augmented Generation (RAG) systems.

Document ingestion and splitting: Use structured loaders and semantic text splitters (LangChain provides many).
Embeddings: Task-specific embedding models improve semantic retrieval accuracy.
Vector databases: Tools like Chroma, Pinecone, Milvus, FAISS, Annoy, etc. efficiently store and search embeddings.

Retrieval Augmented Generation

RAG enhances LLM responses with real-time knowledge retrieval.

Orchestrators: Frameworks like LangChain and LlamaIndex streamline RAG pipelines.
Retrievers & Memory: Advanced retrievers (CoRAG, HyDE) and context memory systems boost relevance.
Evaluation: Tools like Ragas and DeepEval assess retrieval precision and answer quality.

Advanced RAG

For production-grade systems, RAG can integrate structured databases, APIs, and even programmatic optimizations.

Query Construction: Translate user intent into SQL or graph queries.
Agents & Tools: Combine LLMs with external APIs and interpreters for more powerful reasoning.
Post-processing: Techniques like re-ranking, RAG-fusion,, and classification refine the final output.
DSPy: Allows programmatic prompt and weight optimization.

Agents

Agents bring autonomy to LLMs — enabling them to reason, take actions, and learn from results.

Core Loop: Thought → Action → Observation.
Frameworks: LangGraph (design and visualization of workflows), LlamaIndex (data-augmented agents with RAG), or smolagents (beginner-friendly, lightweight option)
Multi-Agent Systems: Experimental frameworks like CrewAI (role-based team orchestration), AutoGen (conversation-driven multi-agent systems), and OpenAI Agents SDK (production-ready with strong OpenAI model integration).

Inference Optimization

Optimizing inference is key to reducing latency and cost.

Flash Attention: Reduces attention complexity from quadratic to linear.
Key-Value Cache Improvements: Multi-Query Attention (MQA) and Grouped-Query Attention (GQA) make attention reuse more efficient.
Speculative Decoding: Smaller models draft text that larger models refine — increasing speed.

Deploying LLMs

Deployment strategies vary based on scale and privacy needs.

Local: Tools like LM Studio, Ollama, oobabooga, kobold.cpp run models privately.
Prototyping: Gradio and Streamlit enable quick, interactive demos.
Server & Edge: Large-scale serving uses frameworks like vLLM and TGI, while MLC LLM supports mobile/web deployments.

Securing LLMs

Security is an emerging priority in LLM systems.

Prompt Hacking: Attacks like injection, data leaks, and jailbreaks can manipulate outputs.
Backdoors: Training data poisoning can compromise models at the source.
Defensive Testing: using red teaming and checks like garak and observe them in production (with a framework like langfuse)

Madison Tagg

About

Projects

Experience

Articles

Building LLM-powered Applications

About the Recipe

Ingredients

Preparation

Running LLMs

Building a Vector Storage

Retrieval Augmented Generation

Advanced RAG

Agents

Inference Optimization

Deploying LLMs

Securing LLMs