RAG Is Easy to Demo and Hard to Ship
Retrieval-augmented generation has a dirty secret: the first version takes an afternoon, and it works. You chunk some documents, embed them, drop them in a vector store, retrieve the top matches, stuff them into a prompt, and the model answers questions about your data. It feels like magic.
Then you point it at real documents, real users, and real questions — and the magic curdles into "well, that's confidently wrong." I've shipped RAG over a five-figure document corpus at work, and almost everything I learned was in the gap between the afternoon demo and the thing people could rely on.
Here's the unglamorous version nobody puts in the tutorial.
The demo lies to you in three ways
1. Your demo questions are too kind. In the demo, you ask questions whose answers sit neatly in one paragraph. Real users ask questions whose answers are spread across four documents, or aren't in the corpus at all, or assume context the documents never state. The demo tests retrieval on easy mode.
2. Your demo corpus is too clean. Ten tidy PDFs is not a corpus. Ten thousand documents — duplicates, near-duplicates, outdated versions, scanned tables, three files that contradict each other — is a corpus. Retrieval quality that looked perfect at small scale degrades the moment the haystack gets real.
3. The model covers for you. A strong model can paper over mediocre retrieval by leaning on what it already knows. That feels great until it confidently fills a gap with something plausible and false. Good demo, dangerous product.
Retrieval is the whole game
Here's the line I wish someone had told me on day one: if the right chunk isn't retrieved, the model cannot answer correctly — full stop. The generation step is downstream of retrieval. You can have the best model in the world and it won't matter, because it never saw the relevant text.
So I stopped obsessing over prompts and started measuring retrieval directly. The question that matters isn't "did the answer sound good?" It's "was the chunk that contains the answer actually in the top results?" Once I had a small set of real questions with known-correct source passages, I could finally see retrieval quality instead of guessing at it. That eval set is the single highest-leverage thing I built.
Chunking is quietly where it lives or dies
Chunking sounds like a config detail. It's not — it's a design decision that quietly sets your ceiling.
Chunk too big and each one is a muddle of topics, so the embedding is an average of everything and matches nothing well. Chunk too small and you shred the context — the answer's first half lands in chunk 12 and its second half in chunk 13, and you retrieve neither cleanly.
What worked for me was chunking along the document's own structure — sections, headings, natural boundaries — instead of blindly slicing every N characters. A chunk should be about one thing. And a little overlap between chunks keeps a sentence that straddles a boundary from being lost. None of this is glamorous. All of it moved the numbers.
The fixes that actually held
Once I could measure retrieval, the improvements stopped being vibes:
- Hybrid search. Pure semantic (vector) search misses exact terms — part numbers, error codes, names. Pairing it with old-fashioned keyword search caught the cases embeddings fumbled. The combination beat either alone, consistently.
- Re-ranking. Retrieve a generous set of candidates, then use a second, smarter pass to re-order them and keep only the best few. Cheap to add, and it cleaned up the "right answer was at rank 8" problem.
- Teaching it to say 'I don't know.' I'd rather the system admit the corpus doesn't cover something than invent an answer. Instructing the model to ground itself strictly in the retrieved text — and to decline when that text is thin — traded a little coverage for a lot of trust. Worth it every time.
- Showing its sources. Every answer cites the chunks it came from. This does two things: it lets users verify, and it makes me able to debug a bad answer by seeing exactly what the model was handed.
What I'd tell my afternoon-one self
RAG isn't a model problem; it's a retrieval and data problem wearing an LLM costume. The teams who win at it aren't the ones with the fanciest model — they're the ones who treat their corpus like a product: cleaned, chunked thoughtfully, measured honestly, and kept fresh.
So build the afternoon demo, absolutely — it's a great way to feel the shape of the thing. But the moment you want people to rely on it, put down the prompt-tinkering and go build an eval set. Measure whether the right chunk shows up. Everything good flows from there.
The magic was never the generation. It was always quietly the retrieval.
Tags: RAG, LLMs, Data Engineering, Production ML