Retrieval Is Not Grounding: Building RAG That Stays Honest
Fetching the right documents is necessary but not sufficient. Grounding (answers that are actually entailed by the retrieved evidence) is a separate property you have to design for and measure. Here is how I think about the gap, and the evaluation that closes it.
Retrieval-augmented generation is often described as "the model looks things up before it answers." That framing hides the bug that bites most teams in production: retrieving the right documents and producing a grounded answer are two different things. A system can fetch a perfect passage and still write a confident sentence that the passage does not support. Retrieval is a search problem. Grounding is a generation property. You have to engineer and measure both.
This is the single most useful distinction I have for debugging RAG systems, so it is worth making precise.
The gap between retrieval and grounding
Retrieval asks: did we put the right evidence in the context window? Grounding asks: is the answer we generated actually entailed by that evidence? The two failure modes are independent:
- Good retrieval, bad grounding. The correct passage is right there, and the model still paraphrases it into something subtly false, merges it with a half-remembered fact from pretraining, or over-generalizes a hedged claim into a definitive one.
- Bad retrieval, "good" grounding. The model faithfully grounds its answer in the wrong passage. It is honest about evidence that does not actually answer the question.
If you only measure retrieval (recall@k, MRR) you are blind to the first mode. If you only eyeball outputs you miss the second. The fix is to treat grounding as a first-class, separately-measured property.
Where grounding actually breaks
In practice, ungrounded answers cluster into a few recurring shapes:
- Synthesis drift. The model combines two retrieved chunks into a claim that neither one makes on its own.
- Pretraining leakage. When retrieval is weak, the model quietly falls back on parametric knowledge, which is exactly when it is least likely to be current or correct.
- Confidence inflation. Source says "may be associated with"; answer says "causes."
- Citation theater. The answer cites a source that is plausibly on-topic but does not contain the specific claim.
A useful mental test
For every sentence in the answer, ask: which retrieved span entails this, and would a careful reader agree? If you cannot point at the span, the model is not grounded; it is improvising.
Designing for grounding
Most grounding wins are architectural, not prompt-level. The patterns that have held up for me:
- Make citations structural, not decorative. Require the model to attach a source id to each claim, and render answers so an unsupported sentence is visually obvious. If a claim cannot cite a span, that is a signal, not a cosmetic issue.
- Constrain the surface area. Shorter, well-scoped answers are easier to ground than sprawling ones. "Answer only from the provided context, and say so when it is insufficient" is a real engineering constraint, enforced and tested, not a hopeful sentence in a prompt.
- Separate retrieval quality from answer quality. Rerank aggressively so the top spans are genuinely relevant, then evaluate grounding given those spans. Conflating the two makes every regression ambiguous.
Measuring grounding
Here is the part teams skip. You can approximate grounding cheaply with a claim-level entailment check: split the answer into claims, and for each one ask whether the retrieved context entails it.
def grounding_score(answer: str, context: str, judge) -> float:
"""Fraction of answer claims entailed by the retrieved context."""
claims = split_into_claims(answer)
if not claims:
return 1.0
supported = 0
for claim in claims:
verdict = judge.entails(premise=context, hypothesis=claim)
supported += int(verdict.label == "entailed")
return supported / len(claims)The judge can be a smaller model, a natural-language-inference classifier, or, for a golden set, a human. What matters is that grounding becomes a number you can regression-test, not a vibe. Wire it into a gate: if mean grounding on your evaluation set drops below threshold, the build does not ship.
This is where my information-retrieval background keeps paying off. Classical IR never let you get away with "looks good": you had a labeled set, a metric, and an ablation. RAG deserves the same discipline. Recall@k tells you whether the evidence was available; a grounding score tells you whether the model used it honestly.
What I do in practice
A pipeline I trust usually has four measurable stages: retrieve, rerank, answer, verify. Each one has its own metric and its own failure signature, so when quality drops I can tell which stage moved. The verify stage (the grounding check above, plus a refusal path when context is insufficient) is the cheapest reliability upgrade most RAG systems are missing.
Retrieval gets you candidate evidence. Grounding is the promise that the answer is true to that evidence. Build for the second one explicitly, measure it, and "the model hallucinated" turns from a mystery into a number you can move.