What is RAG, and Why Should You Care?
If you've worked with LLMs, you've hit the wall: the model doesn't know your data. Fine-tuning is expensive and slow. RAG (Retrieval-Augmented Generation) is the practical solution — inject relevant context directly into the prompt at inference time.
The architecture is elegant: chunk your documents, embed them, store in a vector database, and retrieve the top-k most relevant chunks when a user asks a question.
Simple in theory. The details are where it gets interesting.
Step 1: Chunking Strategy
Your first decision is how to chunk your documents. I made the mistake of using fixed-size chunks initially.
The problem: a sentence would get split mid-thought, destroying its meaning. The solution I landed on:
- Semantic chunking — split on paragraph boundaries first
- Overlap — 10–15% overlap between chunks so context isn't lost at boundaries
- Size ceiling — cap at ~500 tokens to fit within embedding model limits
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=500,
chunk_overlap=50,
separators=["\n\n", "\n", ".", " "]
)
chunks = splitter.split_text(document)
Step 2: Embedding Model Choice
I tested two approaches:
- OpenAI
text-embedding-ada-002— excellent quality, costs money sentence-transformers/all-MiniLM-L6-v2— fast, free, runs locally
For a production system handling sensitive documents, the local model was the obvious winner. Quality was ~85% of OpenAI's, which was acceptable for the use case.
Step 3: Vector Store
I used ChromaDB for local development — easy to set up, no infra required. For production, the choice depends on scale:
- ChromaDB — great for local, small-to-medium scale
- Pinecone — managed, scales easily
- Weaviate — open source, self-hostable
Step 4: Retrieval + Reranking
Basic retrieval (top-k by cosine similarity) worked, but I noticed false positives — chunks that were semantically similar but not actually relevant.
Adding a reranking step with a cross-encoder model significantly improved precision:
from sentence_transformers import CrossEncoder
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
scores = reranker.predict([[query, chunk] for chunk in retrieved_chunks])
What I'd Do Differently
- Metadata filtering — store document source, date, type as metadata and filter before embedding search
- Hybrid search — combine BM25 (keyword) with vector search for better recall
- Eval from the start — I built the system first and evaluated later. Big mistake. Define your eval metrics before building.
Final Thoughts
RAG is not magic. It's a pipeline, and every stage has failure modes. The teams shipping reliable RAG systems are the ones who treat it as an engineering problem, not a prompt engineering problem.
Start simple, measure everything, improve iteratively.