An eval harness before the feature
Rule I keep relearning: build the evaluation before the feature, not after.
It feels slower. You sit down to add a capability and instead spend the first hour assembling twenty labeled examples and a metric. But the harness pays for itself almost immediately: the moment you have a number, "is this better?" stops being an argument and becomes a measurement. You stop shipping changes that feel like improvements and quietly aren't.
The minimum viable harness is smaller than people think:
- A handful of representative inputs with known-good outputs (twenty beats zero by a lot).
- One metric that actually correlates with the thing you care about.
- A way to run it on every change and a threshold that blocks regressions.
That's it. It doesn't need a framework. The discipline is the product, not the tooling.
The deeper reason is that AI features fail silently. A broken API throws; a subtly worse retriever just gets a little wronger and nobody notices until a user does. The eval is the smoke detector. Install it before you start cooking.