About the Recipe

Ingredients
Preparation
Review of https://github.com/mlabonne/llm-course?utm_source=chatgpt.com

Running LLMs
Running LLMs can be resource-intensive, but there are flexible ways to use them from APIs to local setups.
APIs vs Local Models: Private APIs (OpenAI, Google, Anthropic,) are fast to integrate, while open-source options (OpenRouter, Hugging Face, Together AI, etc.) allow for customization and privacy.
Prompt Engineering: Techniques like zero-shot, few-shot, Chain-of-Thought, and ReAct greatly influence quality.
Structured Outputs: Tools like Outlines and JSON schemas can guide LLM responses into clean, usable formats.
Building a Vector Storage
Vector databases form the foundation of Retrieval Augmented Generation (RAG) systems.
Document ingestion and splitting: Use structured loaders and semantic text splitters (LangChain provides many).
Embeddings: Task-specific embedding models improve semantic retrieval accuracy.
Vector databases: Tools like Chroma, Pinecone, Milvus, FAISS, Annoy, etc. efficiently store and search embeddings.
Retrieval Augmented Generation
RAG enhances LLM responses with real-time knowledge retrieval.
Orchestrators: Frameworks like LangChain and LlamaIndex streamline RAG pipelines.
Retrievers & Memory: Advanced retrievers (CoRAG, HyDE) and context memory systems boost relevance.
Evaluation: Tools like Ragas and DeepEval assess retrieval precision and answer quality.
Advanced RAG
For production-grade systems, RAG can integrate structured databases, APIs, and even programmatic optimizations.
Query Construction: Translate user intent into SQL or graph queries.
Agents & Tools: Combine LLMs with external APIs and interpreters for more powerful reasoning.
Post-processing: Techniques like re-ranking, RAG-fusion,, and classification refine the final output.
DSPy: Allows programmatic prompt and weight optimization.
Agents
Agents bring autonomy to LLMs — enabling them to reason, take actions, and learn from results.
Core Loop: Thought → Action → Observation.
Frameworks: LangGraph (design and visualization of workflows), LlamaIndex (data-augmented agents with RAG), or smolagents (beginner-friendly, lightweight option)
Multi-Agent Systems: Experimental frameworks like CrewAI (role-based team orchestration), AutoGen (conversation-driven multi-agent systems), and OpenAI Agents SDK (production-ready with strong OpenAI model integration).
Inference Optimization
Optimizing inference is key to reducing latency and cost.
Flash Attention: Reduces attention complexity from quadratic to linear.
Key-Value Cache Improvements: Multi-Query Attention (MQA) and Grouped-Query Attention (GQA) make attention reuse more efficient.
Speculative Decoding: Smaller models draft text that larger models refine — increasing speed.
Deploying LLMs
Deployment strategies vary based on scale and privacy needs.
Local: Tools like LM Studio, Ollama, oobabooga, kobold.cpp run models privately.
Prototyping: Gradio and Streamlit enable quick, interactive demos.
Server & Edge: Large-scale serving uses frameworks like vLLM and TGI, while MLC LLM supports mobile/web deployments.
Securing LLMs
Security is an emerging priority in LLM systems.