top of page

Building LLM-powered Applications

Prep Time:

Cook Time:

Serves:

Level:

AI

About the Recipe

Ingredients

Preparation

Review of https://github.com/mlabonne/llm-course?utm_source=chatgpt.com


LLM Engineer Roadmap by mlabonne
LLM Engineer Roadmap by mlabonne

  1. Running LLMs

Running LLMs can be resource-intensive, but there are flexible ways to use them from APIs to local setups.

  • APIs vs Local Models: Private APIs (OpenAI, Google, Anthropic,) are fast to integrate, while open-source options (OpenRouter, Hugging Face, Together AI, etc.) allow for customization and privacy.

  • Prompt Engineering: Techniques like zero-shot, few-shot, Chain-of-Thought, and ReAct greatly influence quality.

  • Structured Outputs: Tools like Outlines and JSON schemas can guide LLM responses into clean, usable formats.


  1. Building a Vector Storage

Vector databases form the foundation of Retrieval Augmented Generation (RAG) systems.

  • Document ingestion and splitting: Use structured loaders and semantic text splitters (LangChain provides many).

  • Embeddings: Task-specific embedding models improve semantic retrieval accuracy.

  • Vector databases: Tools like  Chroma, Pinecone, Milvus, FAISS, Annoy, etc. efficiently store and search embeddings.


  1. Retrieval Augmented Generation

RAG enhances LLM responses with real-time knowledge retrieval.

  • Orchestrators: Frameworks like LangChain and LlamaIndex  streamline RAG pipelines.

  • Retrievers & Memory: Advanced retrievers (CoRAG, HyDE) and context memory systems boost relevance.

  • Evaluation: Tools like Ragas and DeepEval assess retrieval precision and answer quality.


  1. Advanced RAG

For production-grade systems, RAG can integrate structured databases, APIs, and even programmatic optimizations.

  • Query Construction: Translate user intent into SQL or graph queries.

  • Agents & Tools: Combine LLMs with external APIs and interpreters for more powerful reasoning.

  • Post-processing: Techniques like re-ranking, RAG-fusion,, and classification refine the final output.

  • DSPy: Allows programmatic prompt and weight optimization.


  1. Agents

Agents bring autonomy to LLMs — enabling them to reason, take actions, and learn from results.

  • Core Loop: Thought → Action → Observation.

  • Frameworks: LangGraph (design and visualization of workflows), LlamaIndex (data-augmented agents with RAG), or smolagents (beginner-friendly, lightweight option)

  • Multi-Agent Systems: Experimental frameworks like CrewAI (role-based team orchestration), AutoGen (conversation-driven multi-agent systems), and OpenAI Agents SDK (production-ready with strong OpenAI model integration).


  1. Inference Optimization

Optimizing inference is key to reducing latency and cost.

  • Flash Attention: Reduces attention complexity from quadratic to linear.

  • Key-Value Cache Improvements: Multi-Query Attention (MQA) and Grouped-Query Attention (GQA) make attention reuse more efficient.

  • Speculative Decoding: Smaller models draft text that larger models refine — increasing speed.


  1. Deploying LLMs

Deployment strategies vary based on scale and privacy needs.

  • Local: Tools like LM Studio, Ollama, oobabooga, kobold.cpp run models privately.

  • Prototyping: Gradio and Streamlit enable quick, interactive demos.

  • Server & Edge: Large-scale serving uses frameworks like vLLM and TGI, while MLC LLM supports mobile/web deployments.


  1. Securing LLMs

Security is an emerging priority in LLM systems.

  • Prompt Hacking: Attacks like injection, data leaks, and jailbreaks can manipulate outputs.

  • Backdoors: Training data poisoning can compromise models at the source.

  • Defensive Testing: using red teaming and checks like garak and observe them in production (with a framework like langfuse)

bottom of page