Retrieval-Augmented Generation (RAG)
RAG is an AI architecture that combines a retrieval system with an LLM, giving the model access to external knowledge at query time.
What Is Retrieval-Augmented Generation?
Retrieval-Augmented Generation (RAG) is an AI architecture that combines a retrieval system with a large language model, giving the model access to external knowledge at query time rather than relying solely on what it memorized during training. Instead of asking “what do you know about X?”, a RAG system asks “here are the relevant documents about X - now answer the question.”
The result: AI products that can answer questions about your proprietary data, recent events, or specialized knowledge bases - without fine-tuning and without hallucinating information the model never learned.
How RAG Works
A typical RAG pipeline has two phases:
Ingestion (one-time):
- Split your documents into chunks (e.g., 500-token segments with overlap)
- Embed each chunk using an embedding model
- Store the vectors in a vector database alongside the original text
Query (real-time):
- Embed the user’s question
- Search the vector database for the top-k most similar chunks
- Inject those chunks into the LLM prompt as context
- The LLM generates an answer grounded in the retrieved material
RAG vs Fine-Tuning
| Dimension | RAG | Fine-Tuning |
|---|---|---|
| Best for | Injecting knowledge | Changing model behavior/style |
| Data freshness | Easy to update | Requires retraining |
| Cost | Low setup, per-query retrieval | High upfront training cost |
| Hallucination | Reduced (grounded in sources) | Still possible |
| Citations | Easy to provide | Not inherent |
Most AI products that need to answer questions from a knowledge base should start with RAG. Fine-tuning is for problems that RAG can’t solve - like changing the model’s tone, making it refuse certain topics, or teaching it specialized terminology.
Common RAG Failure Modes
- Chunking too large or too small: Chunks too large add irrelevant context; too small lose context within a passage
- Poor retrieval: The right answer exists in the database but doesn’t rank in the top-k retrieved chunks
- Conflicting information: Multiple chunks say different things; the model can’t reconcile them
- Context stuffing: Too many chunks exceed the context window or dilute the relevant signal
Key Takeaway
RAG is the most practical way to build AI products on top of proprietary or domain-specific knowledge. It’s faster to implement than fine-tuning, cheaper, and keeps your knowledge base up-to-date without retraining. Master RAG before considering fine-tuning - the vast majority of AI product use cases don’t require anything more.
Frequently Asked Questions
What is RAG (Retrieval-Augmented Generation)?
How does RAG work technically?
When should a startup use RAG vs fine-tuning?
What are the main challenges of building a RAG system?
Create an account to track your progress across all lessons.
Comments
Loading comments...