Notes on chunking: smaller isn't always better

Spent the afternoon re-running a chunking ablation because a retrieval metric drifted and I wanted to know why. The folklore is "smaller chunks → better recall." That is true right up until it isn't.

Tiny chunks score well on recall@k because some fragment of the answer almost always lands in the top-k. But fragments are hard to ground: the model gets a sentence with no surrounding context and either refuses or improvises the missing half. Bigger chunks ground better and rerank worse. The sweet spot is the boundary, not the size: split on structure (sections, list items, function bodies) instead of a fixed token count, and the same average size suddenly behaves.

What I'm taking from this round:

Measure retrieval and grounding separately, or you optimize chunk size against the wrong target.
"Semantic" chunking that respects document structure beat fixed-window chunking at the same token budget. The win was boundaries, not embeddings.
Overlap papers over bad boundaries and inflates your index. Fix the boundary first.

Next: a small eval that scores chunking strategies on grounding-given-retrieval, not just recall. If a strategy retrieves the right region but the model can't ground in it, that's a chunking failure I currently can't see.