From RAG demo to production: the checklist nobody gave you

Most RAG systems look great in a notebook and embarrassing in production. Here is the gap nobody warns you about, and the checklist that closes it.

May 7, 2026 4 min read rag / production / evaluation

A working RAG demo and a working RAG product are two completely different artifacts. The demo answers ten cherry-picked questions in a Jupyter notebook. The product gets thousands of real questions a day from people who don’t know what RAG is and don’t care about your retrieval recall. The gap between them is where most projects quietly fail.

This is the short list of things that have to be solved before a RAG system is genuinely production-ready. Not “checked off in a Notion doc” — solved.

1. You need a real evaluation set, not vibes

Eyeballing the answers your assistant gives is not evaluation. It is the most expensive form of QA ever invented, and it falls apart the moment anything changes in the system.

You need a labelled evaluation set with, at minimum:

Inputs: real or realistic user questions, including the messy ones.
Expected behaviours: the kind of answer that’s correct — not necessarily an exact string, but a rubric you or a judge model can check against.
Edge cases: out-of-scope questions, ambiguous ones, ones that should refuse to answer.

A few hundred items is plenty to start. The point is that when you change the chunking strategy, the embedding model, or the prompt, you get a number back — and you can decide whether the number went the right way.

2. Separate retrieval quality from generation quality

The single most common debugging mistake in RAG is staring at a bad answer and assuming the model is dumb. Half the time the model is fine and retrieval handed it the wrong documents.

Measure them separately:

Retrieval metrics: recall@k, MRR, and “does the right document appear in the top N” for your eval set.
Generation metrics: faithfulness (does the answer actually reflect the retrieved context?), answer relevance, and refusal-when-appropriate.

If retrieval recall is at 60%, no prompt engineering will save you. Fix retrieval first.

3. Chunking is a product decision, not a default

Most teams pick a default chunk size from a tutorial and move on. That default is almost always wrong for the actual documents they have.

Things that matter:

Semantic boundaries: chunking mid-sentence or mid-table destroys meaning.
Document structure: headings, lists, and code blocks should travel with their surrounding context.
Overlap: a small amount usually helps recall; too much wastes tokens and hurts precision.

Run the same evaluation set with three different chunking strategies before you commit. The right number is data-dependent, not universal.

4. Hybrid retrieval is almost always worth it

Pure vector search misses exact matches: product codes, names, version numbers, acronyms. Pure keyword search misses paraphrase and intent. Combining them — BM25 plus dense vectors, with a reranker on top — consistently outperforms either alone on real-world content.

Don’t over-engineer it. Start with BM25 + dense + a small cross-encoder reranker on the top 30 results. Measure. Tune from there.

5. Latency is a product feature, not an afterthought

If your assistant takes nine seconds to answer, users will leave before they ever see the quality. Concretely, watch:

Time to first token for the streamed answer.
End-to-end latency at p50, p95, p99 — the tail matters.
Per-stage breakdown — retrieval, rerank, generation. You can’t optimise what you can’t see.

Caching identical queries, parallelising retrieval and rerank, and picking a smaller generation model for easy queries are all standard moves.

6. Observability isn’t optional

The day after you ship, someone will ask: “why did it answer this way?” If your only answer is “I don’t know,” you have a problem that will compound.

At minimum, log:

The full prompt sent to the model, including retrieved context.
The retrieval candidates and their scores.
The model’s response and any tool calls.
A trace ID linking it all to the user’s session.

LangSmith, LangFuse, and LangWatch all do this. Pick one, wire it in on day one, and don’t argue about which is best until you actually have a month of traces.

7. Refusals are a feature

A RAG system that confidently makes things up is worse than one that says “I don’t know.” Out-of-domain questions, ambiguous ones, and ones where retrieval comes back empty should produce explicit refusals or clarifying questions — not hallucinated answers.

Add this to your evaluation set explicitly. A model’s refusal behaviour is itself something you measure.

8. Plan for the model upgrade you haven’t done yet

Frontier models change every few months. Embedding models change too. Anything you build that hard-codes a model version, a token limit, or a specific embedding dimension will rot.

Abstract the model behind a thin interface.
Version your embeddings — the next model will have a different dimension.
Keep your evaluation set so you can re-run it after an upgrade and quantify the change.

The shortest version

If you can’t answer these three questions, you are not in production:

How would you know if quality regressed?
Where is retrieval failing today?
What does a bad answer look like, and how often does it happen?

Everything above is the long version of getting good answers to those three.

Need help putting an LLM system into production?

Get in touch