Re-reading BM25 in the age of embeddings

Went back to the BM25 literature this week, partly out of nostalgia from my IR research days, partly because a hybrid retriever I'm tuning keeps reminding me how good the old lexical baseline is.

The thing that holds up: BM25 encodes assumptions that dense retrieval quietly relies on too. Term saturation (the tenth occurrence of a word matters less than the second). Length normalization (a long document shouldn't win just by being long). These aren't artifacts of a pre-neural era; they're properties of how relevance actually works. Embeddings learn fuzzy versions of the same intuitions, and they miss exact-match and rare-term cases that BM25 nails for free.

Which is why hybrid keeps winning: dense retrieval for "means the same thing," lexical for "is literally this identifier." The interesting engineering question isn't which but how to combine, and rank fusion turns out to be more robust than score interpolation, because you never have to make two incomparable score scales agree.

Reading old papers is underrated. The field rediscovers their lessons every couple of years under new names. Cheaper to just read them.