How to Build a RAG Pipeline
A step-by-step guide to building a Retrieval-Augmented Generation system: chunking, embeddings, vector databases, retrieval, and evaluation.
Prepare and Chunk Your Documents
Document chunking is the most consequential decision in a RAG pipeline. The model can only answer well if the retrieval step surfaces the right chunks - and retrieval quality depends almost entirely on how you split your documents.
Why chunking matters: An embedding model converts a chunk of text into a single vector. If a chunk contains two unrelated topics, the vector is a noisy average of both - it will retrieve poorly for queries about either topic. If chunks are too small, the retrieved context lacks the surrounding information the model needs to answer fully.
Chunking strategies:
| Strategy | Best for | Chunk size | Trade-off |
|---|---|---|---|
| Fixed-size | Simple prose, mixed documents | 256–512 tokens | Fast; may split sentences mid-thought |
| Sentence-level | FAQs, Q&A pairs, structured lists | 1–3 sentences | High precision; very short context |
| Paragraph-level | Articles, reports, documentation | 150–400 tokens | Natural boundaries; variable size |
| Section-level | Technical docs, book chapters | 500–1,500 tokens | Rich context; diluted embedding signal |
| Recursive | Any structured document | Configurable | Best general-purpose; used by LangChain default |
Chunk overlap: Add 10–20% overlap between adjacent chunks (e.g., last 50 tokens of chunk N become the first 50 tokens of chunk N+1). This preserves context at boundaries and is critical for documents where a sentence at the end of one chunk provides essential context for the beginning of the next.
Metadata enrichment: Store metadata alongside each chunk - document title, section heading, page number, URL, last updated date. This enables metadata filtering during retrieval (“only return chunks from documents updated in the last 90 days”) and provides source citations in the model’s answer.
Document cleaning: Strip HTML tags, headers/footers, page numbers, and boilerplate before chunking. Noise in the input text degrades embedding quality for the entire chunk.
Generate Embeddings
An embedding model converts text into a dense vector (typically 768–3,072 dimensions) that encodes semantic meaning. Texts with similar meanings produce vectors that are close together in this high-dimensional space - this is what enables similarity search.
Embedding model options:
| Model | Dimensions | Context | Cost | Best for |
|---|---|---|---|---|
| OpenAI text-embedding-3-large | 3,072 | 8,191 tokens | $0.13/1M tokens | Highest quality; general purpose |
| OpenAI text-embedding-3-small | 1,536 | 8,191 tokens | $0.02/1M tokens | 90% of large quality at 15% cost |
| Cohere Embed v3 | 1,024 | 512 tokens | $0.10/1M tokens | Strong multilingual support |
| BGE-M3 (open-source) | 1,024 | 8,192 tokens | Free (self-hosted) | Best open-source; multilingual |
| E5-large-v2 (open-source) | 1,024 | 512 tokens | Free (self-hosted) | Strong on retrieval tasks |
Embedding cost calculation: For 1 million chunks at an average of 400 tokens each, that is 400 million tokens to embed. At OpenAI text-embedding-3-small pricing ($0.02/1M tokens), the full indexing cost is $8.00. Re-embedding after a document update costs only the tokens in the changed documents.
Critical rule: Never mix embeddings from different models in the same index. Vectors from different models live in incompatible spaces - similarity scores between them are meaningless. If you switch embedding models, you must regenerate all embeddings from scratch.
Store in a Vector Database
A vector database stores your embeddings and enables approximate nearest-neighbor (ANN) search - finding the k most similar vectors to a query vector in milliseconds, even across millions of records.
Vector database comparison:
| Database | Hosting | Scale | Hybrid search | Best for |
|---|---|---|---|---|
| Pinecone | Managed only | 100M+ vectors | Yes (sparse+dense) | Fastest path to production |
| pgvector (Supabase/RDS) | Managed or self-hosted | <10M vectors | Limited | Existing Postgres users |
| Weaviate | Managed or self-hosted | 100M+ vectors | Yes | Complex filtering, multi-tenancy |
| Qdrant | Managed or self-hosted | 100M+ vectors | Yes | Open-source, advanced filtering |
| Chroma | Self-hosted | <1M vectors | No | Local development, prototyping |
| Milvus | Self-hosted | Billions | Yes | Enterprise scale |
Recommended path for a startup:
- Prototype (0–100K chunks): Chroma locally, or pgvector on Supabase (free tier)
- Early production (100K–5M chunks): Pinecone Starter or pgvector on Supabase Pro
- Scaled production (5M+ chunks): Pinecone, Weaviate, or Qdrant
Schema design: Each record in the vector index should contain:
id: unique identifier for the chunkembedding: the vector (float array)text: the original chunk text (for passing to the LLM as context)metadata: source document ID, title, URL, section, page, last updated date
Build the Retrieval Layer
The retrieval layer takes a user query, converts it to an embedding, and returns the most relevant chunks. This is where most RAG pipelines fail - not at the LLM step.
Basic retrieval pipeline:
- Embed the user query using the same model used to embed documents
- Run ANN search against the vector index for top-k results (k=5–20 is typical)
- Apply metadata filters if relevant (e.g., restrict to a specific document set or date range)
- Return chunks with their text and source metadata
Beyond basic retrieval - improving recall:
| Technique | What it does | When to use |
|---|---|---|
| Hybrid search | Combines dense (semantic) + sparse (BM25 keyword) retrieval | Queries with specific terms, names, codes |
| Query expansion | LLM generates 3–5 related queries; results are union of all | Short or ambiguous user queries |
| HyDE (Hypothetical Document Embeddings) | LLM generates a hypothetical answer; embed that for retrieval | When query phrasing differs from document phrasing |
| Re-ranking | Cross-encoder model re-scores top-k results for relevance | When precision matters more than speed |
| Parent-child chunking | Retrieve small child chunks; return larger parent chunk as context | Balances embedding precision with context richness |
Retrieval parameters:
- k (top-k): Start with k=5. Increase to k=10–20 if recall@k is low but watch for context window limits.
- Similarity threshold: Filter out chunks with similarity score below ~0.75 to avoid retrieving loosely related noise.
- MMR (Maximal Marginal Relevance): Penalizes redundant chunks - useful when multiple chunks say the same thing and take up context window space.
Connect to Your LLM
Once you have retrieved the relevant chunks, construct a prompt that grounds the model’s answer in those chunks and prevents hallucination by instructing it not to answer outside the retrieved context.
Standard RAG prompt structure:
System: You are a helpful assistant. Answer the user's question using only
the provided context. If the context does not contain enough information
to answer the question, say "I don't have enough information to answer that."
Do not use any knowledge outside the provided context.
Context:
[CHUNK 1] Source: {source_1} | {chunk_text_1}
[CHUNK 2] Source: {source_2} | {chunk_text_2}
...
User: {user_query}
Citation tracking: Assign a reference ID to each retrieved chunk and instruct the model to cite sources in its answer (e.g., “[1]”, “[2]”). Parse the citations from the response and map them back to source documents. This enables:
- Debugging: trace any hallucination to the exact retrieval failure that caused it
- User trust: show users where each claim comes from
- Quality monitoring: measure what fraction of claims are actually grounded in retrieved chunks
Context window management: If your retrieved chunks exceed the model’s context window, apply a truncation strategy:
- Prioritize by similarity score (keep the highest-scoring chunks)
- Summarize lower-priority chunks to fit more sources in the window
- Use a model with a larger context window for document-heavy queries
Evaluate and Iterate
RAG pipeline evaluation requires measuring retrieval quality and generation quality separately. Mixing them into a single end-to-end accuracy score makes it impossible to identify which component is failing.
Retrieval metrics:
| Metric | Formula | Target |
|---|---|---|
| Recall@k | % of queries where the relevant chunk appears in top-k | >85% at k=5 |
| Precision@k | % of top-k results that are actually relevant | >60% at k=5 |
| MRR (Mean Reciprocal Rank) | Average of 1/rank of first relevant result | >0.7 |
Generation metrics:
| Metric | What it measures | Target |
|---|---|---|
| Faithfulness | % of claims in answer that are grounded in retrieved context | >90% |
| Answer relevance | Does the answer actually address the question? | >85% |
| Context utilization | % of retrieved chunks actually used in the answer | >50% |
Evaluation frameworks: RAGAS (open-source) automates all of these metrics using an LLM-as-judge approach. It requires a test set of (query, relevant document, expected answer) triples - start with 50–100 manually labeled examples.
Iteration priority:
- If recall@k < 80%: fix chunking, add hybrid search, or try query expansion
- If faithfulness < 85%: tighten the system prompt, reduce k, or add a grounding check step
- If answer relevance is low: improve query understanding (classify intent, expand ambiguous queries)
- If latency is high: reduce k, cache frequent queries, or switch to a faster embedding model
Key Takeaway
A RAG pipeline is not a single model choice - it is a system with six components, each of which can independently cause failures. The highest-leverage investment is chunking strategy and retrieval quality: if the wrong chunks are retrieved, no amount of prompt engineering will produce a correct answer. Start with a simple pipeline (fixed-size chunking, OpenAI embeddings, pgvector, top-5 retrieval), measure recall@k and faithfulness on a labeled test set before shipping, and add complexity - hybrid search, re-ranking, query expansion - only where the metrics show a specific gap.
Frequently Asked Questions
What is RAG and why use it instead of fine-tuning?
What chunk size should you use for RAG?
Which vector database should you use for RAG?
How do you evaluate RAG pipeline quality?
What are the most common RAG pipeline failure modes?
Create an account to track your progress across all lessons.
Comments
Loading comments...