An eval harness before the feature

Rule I keep relearning: build the evaluation before the feature, not after.

It feels slower. You sit down to add a capability and instead spend the first hour assembling twenty labeled examples and a metric. But the harness pays for itself almost immediately: the moment you have a number, "is this better?" stops being an argument and becomes a measurement. You stop shipping changes that feel like improvements and quietly aren't.

The minimum viable harness is smaller than people think:

A handful of representative inputs with known-good outputs (twenty beats zero by a lot).
One metric that actually correlates with the thing you care about.
A way to run it on every change and a threshold that blocks regressions.

That's it. It doesn't need a framework. The discipline is the product, not the tooling.

The deeper reason is that AI features fail silently. A broken API throws; a subtly worse retriever just gets a little wronger and nobody notices until a user does. The eval is the smoke detector. Install it before you start cooking.