RAG done right: under 2% hallucination on 200k docs

Most RAG systems we’ve audited fail in the same place: the retrieval is fine, the model is fine, but nobody measured what “fine” means — so the team had no idea their assistant was confidently wrong on 12% of answers. Here’s what we changed on a 200,000-document internal knowledge base to get hallucination rate consistently under 2%.

The pipeline, end to end

Chunking with structure. Markdown headings, code-block boundaries and page breaks all become natural cut points. We never split a code block. We always carry the section title into the chunk metadata.
Hybrid retrieval. BM25 + dense embeddings, both scored, results merged with reciprocal rank fusion. Pure vector search alone misses too many exact-term queries (product SKUs, error codes, API names).
Rerank. A small cross-encoder reranks the top 50 down to the top 8. Cheap, fast, and the single biggest accuracy lift in the whole pipeline.
Metadata filtering. Tenant, document type, freshness, ACL — all enforced in the retrieval layer, not the prompt. Never trust the model to filter for security.
Strict context envelope. Each retrieved chunk is wrapped with its document title, URL and last-updated date. The model is told to cite the chunk by ID for every claim, and to refuse if it can’t.

The eval harness

You can’t improve what you don’t measure, and “the team thinks it feels better” is not a measurement. We run three eval sets in CI on every prompt or pipeline change:

Golden set — 200 hand-curated Q&A with exact expected citations. Hard fail if any drop out.
LLM-judged set — 1,000 synthetic questions, scored by a stronger model with a structured rubric (faithfulness, citation correctness, answer completeness, refusal appropriateness).
Adversarial set — questions designed to bait hallucination: out-of-scope, contradictory sources, near-duplicate documents.

If your RAG demo doesn’t have a refusal rate, it’s lying to you.

What moved the needle

Rerank. Hallucination dropped from ~7% to ~3% on its own.
Refuse-when-unsure prompt. Explicit instruction to answer “I don’t have enough context” when no chunk supports the claim. Got us under 2%.
Citation-required output schema. Forced JSON with a citations array of chunk IDs. The model can’t emit an answer without referencing retrieved context.
Freshness boost. Documents updated in the last 90 days get a small score bump — cheap to add, big quality lift on a corpus where things change.

What didn’t

Bigger embedding model. Marginal gain, double the cost.
Going from top-5 to top-20 context. More tokens, more chances to confuse the model, no measurable accuracy lift after rerank.
“Self-RAG” loops where the model decides if it needs more context. Slow, expensive, and worse than a deterministic threshold on the rerank score.

Takeaway

Most RAG quality wins are in the boring parts: chunking, rerank, citations and a real eval harness. Switching foundation models is almost never the answer.

If you’re running RAG in production and don’t have a number for your hallucination rate, that’s the place to start. Talk to us and we’ll help you instrument it.

RAG done right: under 2% hallucination on 200k internal documents

The pipeline, end to end

The eval harness

What moved the needle

What didn’t

Have a job we could help with?

RAG done right: under 2% hallucination on 200k internal documents

The pipeline, end to end

The eval harness

What moved the needle

What didn’t

Keep reading

What we shipped: 9 AI agents in production

When to put a human in the loop

AI-driven pipeline triage with LLMs + MCP

Have a job we could help with?