Building production RAG systems: lessons from 20 deployments

We’ve built RAG systems for twenty different clients now, across industries from legal to healthcare to e-commerce. Every project starts with the same pitch: “We’ll just hook up a vector database to an LLM and let users ask questions about our data.” Every project quickly discovers it’s not that simple.

The first lesson is that chunking strategy matters more than model choice. We’ve seen teams spend weeks evaluating LLMs while using naive fixed-size chunking on their documents. Switching from 512-token chunks to semantically-aware chunking improved retrieval accuracy by 30% on one project — more than any model swap ever did.

The second lesson is that retrieval quality is everything. If the retriever surfaces the wrong chunks, no amount of prompt engineering will save you. We now spend at least 40% of every RAG project on retrieval tuning: hybrid search (combining vector similarity with keyword matching), re-ranking retrieved results, and building evaluation datasets to measure retrieval quality over time.

The third lesson is about evaluation. You need an automated way to measure whether your RAG system is actually answering questions correctly. We build golden datasets of question-answer pairs for every deployment, then run regression tests on every change to the chunking, embedding, or retrieval pipeline.

Finally, production RAG systems need guardrails. Citation verification (does the answer actually come from the retrieved sources?), hallucination detection, and graceful fallbacks when the system doesn’t have enough context to answer confidently. These aren’t nice-to-haves — they’re table stakes for any system users will trust.