Shipping RAG to production without the 2am pages

Retrieval-augmented generation demos beautifully and breaks quietly in production. Here is the architecture we reach for to make RAG boring, in the best possible way.
Start with evaluation, not retrieval
The first thing we build on a RAG project isn't the pipeline, it's the way we measure it. Without a golden set of questions and expected answers, "is it good?" becomes an argument instead of a number. We assemble a few dozen representative queries early and score every change against them.
That harness turns vague worries into a dashboard. When retrieval quality drops after a chunking change, we see it the same day instead of in a support ticket.
If you can't measure a retrieval change, you're not improving the system, you're rearranging it.
Chunking is a product decision
How you split documents shapes everything downstream. Too large and you bury the answer in noise; too small and you lose the context that makes it meaningful. We tune chunk size and overlap against the eval set rather than guessing, and we keep structural metadata, section, source, recency, alongside every chunk.
- Preserve headings and hierarchy so a chunk knows where it came from.
- Attach metadata you can filter on at query time.
- Re-rank candidates before they reach the model, recall first, precision second.
Guardrails and graceful failure
Production RAG fails safely. When confidence is low or no good context is found, the system should say "I don't know" rather than confidently inventing an answer. We surface citations on every response so users can verify, and we log the retrieved context for every generation so we can debug after the fact.
What this buys you
The payoff isn't a flashier demo, it's a system you trust enough to leave running. Fewer surprises, explainable answers, and a clear path to improve when something drifts. That's what lets your team sleep through the night.
Key takeaways
- Build the eval harness before the pipeline.
- Tune chunking against data, not intuition.
- Always cite sources and fail to "I don't know".
- Log retrieved context for every generation.
Working on something like this? Tell us about it, it's exactly the kind of problem we love.
Frequently asked questions
What is retrieval-augmented generation (RAG)?
RAG is a technique where a language model retrieves relevant documents from a knowledge base and uses them as context to generate grounded, citable answers. It reduces hallucination and lets a model answer questions about private or up-to-date data it wasn't trained on.
Why does RAG break in production?
RAG demos work on clean questions but real users ask messy ones. Without evaluation, good chunking, re-ranking and guardrails, retrieval quality degrades and the model produces confident but wrong answers. The fix is measurement and safe failure, not a bigger model.
How do you evaluate a RAG system?
You assemble a golden set of representative questions with expected answers, then score every change, chunking, retrieval, prompts, against it. This turns subjective debate into an objective metric you can track over time.