Different ways to build a RAG system, in plain English
Naive RAG, hybrid retrieval, reranking, multi-vector. The main approaches to building a RAG system — without the jargon.
If you’ve read about RAG (retrieval-augmented generation), you may have noticed that no two articles describe it the same way. “Naive RAG”, “hybrid retrieval”, “reranking”, “agentic RAG”… it’s a lot.
The good news: there are really only a handful of stacks. They’re all the same shape — retrieve, then generate — just with different levels of sophistication. Here’s a tour, in plain English.
1. Naive RAG (the starting point)
The simplest version. You chop your documents into chunks, turn each chunk into a “vector” (think: a numerical fingerprint of its meaning), and store them. When a question comes in, you turn the question into a vector too, find the chunks whose vectors are closest, and hand them to the AI as context.
When it works: clean documents, simple questions, small to medium knowledge bases.
When it breaks: when chunks split across important boundaries, when questions need exact keywords (product codes, names), or when the most relevant chunk isn’t the most similar-looking one.
2. Hybrid retrieval
Pure vector search is great at meaning but bad at exact matches. Search “ACME-9000 firmware update” and a vector model might happily return general firmware articles, while missing the one document that literally contains “ACME-9000.”
Hybrid retrieval fixes this by combining two engines:
- Keyword search (think: classic Google-style matching) for exact terms.
- Vector search for meaning and paraphrase.
Then you blend the results. In practice this gives noticeably better retrieval, especially when your data has product names, codes, acronyms, or proper nouns.
Cost: a bit more infrastructure (you need both indexes), a bit more tuning to balance the two.
3. Reranking
Even with good retrieval, the top results aren’t always in the right order. Reranking adds a second pass: you pull, say, the top 30 candidates from your retrieval step, then run them through a more expensive model that re-orders them by how well they actually answer the question.
Think of it as: retrieval casts a wide net, the reranker picks the best fish.
Cost: an extra model call per query, so latency and cost go up. Quality usually goes up more.
4. Multi-vector / parent-document retrieval
Here’s a subtle but important trick. Vectors work best on small, focused chunks. But when you actually answer a question, you often want a larger surrounding context to be in front of the AI.
The pattern:
- Index small focused chunks (good for matching the question).
- Retrieve small chunks but return the larger parent document or section that contains them.
So you get the precision of small chunks for matching, and the richness of larger context for answering.
5. Graph-augmented RAG
Some kinds of knowledge are connected — people work at companies, companies own products, products have features. A pure chunk-and-search approach treats every chunk as independent, which loses those relationships.
Graph-augmented RAG stores not just chunks but also entities and the connections between them. When you ask a question, the system can traverse the graph: “what’s the latest version of the product owned by the team that ships every Tuesday?”
When it works: structured domains (research, legal, supply chain) where relationships matter as much as text.
Cost: building and maintaining the graph is significant. Often only worth it for high-value, structured use cases.
6. Agentic RAG
This is the “RAG meets agents” approach. Instead of doing one retrieval and answering, the system can:
- Decide whether to retrieve at all (some questions don’t need it).
- Search multiple times with refined queries if the first results aren’t useful.
- Combine information from different sources.
- Ask follow-up questions to the user when the request is ambiguous.
It’s the difference between a Q&A bot and a research assistant. Worth its own longer explanation.
Which stack should you pick?
A rough decision tree:
- Just starting? Naive RAG. Get to a working system end-to-end first.
- Seeing misses on exact terms? Add hybrid retrieval.
- Top results often “almost right” but in the wrong order? Add a reranker.
- Documents have important structure (chapters, sections)? Try parent-document retrieval.
- Questions span relationships across entities? Consider graph-augmented.
- Questions are open-ended and need multi-step reasoning? Look at agentic RAG.
You don’t have to pick one and stick with it. Most production systems combine two or three approaches and add complexity only when the evaluation numbers say it’s worth it.
The one thing that doesn’t change
Whichever stack you pick, the iron rule of RAG is the same: measure everything. Without a labelled evaluation set, every change is a guess. The fanciest stack in the world won’t save a system that nobody is grading.
Need help putting an LLM system into production?
Get in touch