Seeing Isn't Measuring: Fixing the Design-to-Code Plateau
AI can build a component from a screenshot. It still can't tell you whether it matched. The fix is the same one that applies to every agentic loop: separate the measurement from the judgment.
Articles & Journal
I think in public. Long-form articles work through the engineering ideas in depth; the journal is a dated log of what I'm building and learning along the way: retrieval, multi-agent systems, evaluation, and applied machine learning.
AI can build a component from a screenshot. It still can't tell you whether it matched. The fix is the same one that applies to every agentic loop: separate the measurement from the judgment.
Anthropic's new Claude Tag turns @Claude into a persistent, shared teammate inside Slack. Here's what it actually does, how it's governed, and why the form factor matters more than the feature list.
The unit of agentic work has moved from the prompt to the loop. A working definition, an anatomy of what a loop is made of, and the one principle that separates loops you can trust from agents that agree with themselves.
/loop runs a prompt or a slash command on a cadence inside your open Claude Code session, so you stop re-asking 'did it finish yet?' and let the machine do the polling. This is what it is, the three ways to run it, where it earns its keep, and the session-scoped limits that tell you when to reach for durable automation instead.
Running a coding agent unattended is tempting and mostly a trap. The thing that makes it safe isn't a better prompt. It's an external, deterministic gate the model cannot talk its way past. Here is the principle, a working pipeline that embodies it, and the failure modes that matter.
Fetching the right documents is necessary but not sufficient. Grounding (answers that are actually entailed by the retrieved evidence) is a separate property you have to design for and measure. Here is how I think about the gap, and the evaluation that closes it.
When an agent system misbehaves, the instinct is to add more instructions. Usually the real fix is structural: explicit states, hard guards, and small tools with narrow contracts. Reliability is an architecture decision, not a prompting one.
Spent the afternoon re-running a chunking ablation because a retrieval metric drifted and I wanted to know why. The folklore is "smaller chunks → better recall." That is true right…
Rule I keep relearning: build the evaluation before the feature, not after. It feels slower. You sit down to add a capability and instead spend the first hour assembling twenty lab…
Went back to the BM25 literature this week, partly out of nostalgia from my IR research days, partly because a hybrid retriever I'm tuning keeps reminding me how good the old lexic…