Home

Feedback Loops Are Very Important For Coding Agents

Scaffolding

Ask people how to get good code out of agents and they’ll talk about prompts, context, which model to use. Those are all important, but I think there’s another important thing which gets too little attention: what agents can check their work against.

That thing is the feedback loop you drop your agents into. A lot of it is tests, which means you already know much of what makes it effective. The test pyramid, red-green-refactor, all that good old TDD stuff.

But you probably need to change some of your practices. The equations haven’t changed; but the inputs have, and they add up to some new answers.

If you’re building software with agents, you should be writing a lot of tests. A lot of tests.

Why we test as little as possible

Most engineers I know are happy to agree that automated testing is “very important” and then do as little of it as they can get away with. A nice unit test suite. A few integration tests. Maybe some end to end testing if they’re feeling orthodox. That instinct is the rational response to a real cost. Tests are boring and slow to write, but the real burden is maintaining them for as long as they live.

That calculation leans on two assumptions. The first assumption is that whoever wrote the code mostly knows whether it’s right. They held the thing in their head while they built it, they can reason about it, and when they finish a function they have a decent sense of whether it works before a single test runs. Tests are a backstop to a judgment that’s already pretty good.

The second assumption is that tests cost real, scarce human time.

You can see what I’m getting at by now. These aren’t true for agents.

Agents are dumb

Agents don’t reliably know whether their own code is correct or hopelessly inadequate. They’ll tell you a feature is finished and when you go to look the feature is broken. I spent some time de-vibing a thoroughly vibe-coded codebase and the most common problem by far was exactly that: declaring victory on work they’d never tested.

I don’t think this is really news to most of you. It’s well studied. Point agents at a real task and they’ll claim success far more often than they have any right to, and stay just as sure of themselves even when the test results are sitting right in front of them. When you ask a model to check and fix its own reasoning with nothing external to go on, it can improve somewhat, but results are limited. Sometimes it gets worse. A lot of the early “look, it can correct itself” results turned out to be quietly handing the model the answer key so it knew when to stop; take the answer key away and the effect starts to evaporate.

So agents can’t trust their own sense of “this works.” And if they can’t trust themselves, they have to get that signal from somewhere outside themselves: something that can tell them, honestly, whether the code does what it’s supposed to.

It’s worth mentioning that this is a “for now,” not a law of nature. Models are slowly getting better at judging their own work, and some of the newer ways of training them lean more on the judgment of other models than on hard pass/fail signals. The gap will narrow, but today it is what we have to work with. It’s not like tests are getting any more expensive to write, though.

Code is cheap now

Onto assumption #2, tests are expensive to write and maintain. For agents, writing a test is cheap, and keeping it green through later changes is cheap too. All that boring, slow, never-ending upkeep, the thing that actually made tests expensive, is exactly the work they’ll do for you without complaint and the sort of work they’re actually good at.

So both forces push the same way. You need tests more than you used to, because the agents writing your code can’t tell when they’re wrong. And you pay far less for them, because those same agents will write and maintain them for you. That should change what you think a reasonable amount of testing is. There’s no longer a good reason to skimp.

None of this lets you off the hook. The agents write the tests, but they don’t decide how testing works in your project. That part is still yours. You set the scopes, draw the mocking boundaries, and define what makes a test real instead of theater. Today’s models won’t do it for you, and agents without that framework reinvent their own conventions every time they touch the suite.

Feedback agents can actually use

Anyway, a big pile of tests is just raw material. What the agents need from it is feedback they can act on while they work, and that takes more than just having a lot of tests.

The first thing that makes feedback usable is scope. A test that reports “something, somewhere, is broken” barely helps; the agents need to know which piece broke. Agents build the way we do, a piece at a time, so the check that helps is one aimed at the piece in hand, able to run while that piece is what’s being worked on. A broad end-to-end test can’t do that. It can’t even run until the whole system exists, and when it does go red it points at everything at once. Bugs turn up at every scope, so you want to be able to check at every scope: the function, the seam where components meet, and the whole user journey.

Speed matters too, further down the list. The agents run these checks on a loop, so one that comes back in seconds keeps them moving while one that takes ten minutes drags every iteration. Next to scope it’s a tuning detail, but at the rate agents work it adds up.

The second is that tests aren’t the only kind of feedback, or the fastest. A type checker or a linter can tell the agents they’ve gone wrong the moment they write the line, before anything runs. For the things you can’t pin down in code, like whether a screen actually looks right, the agents can open a browser, take a screenshot, and see for themselves. VLMs aren’t that good at this, but are getting better all the time with Mythos class models actually approaching something close to competency and design taste. What matters is that they can get an answer they can act on, at the scope they’re working at. A test is just the most familiar way to give them one.

That’s why I called this a post about feedback loops, not about tests.

Another (adversarial) agent is a feedback loop too, and one of the most valuable you have. Point a fresh one at the work and with the right prompt it will tell you things a pass or fail can’t: whether the code is more complex than it needs to be, whether it follows the rules you set, they can even exercise the user-facing boundaries (API, GUI, etc.) and check it really does what you want. None of that is grounded the way a test is, but it doesn’t need to be. A reviewer with no stake in the code earns its keep on exactly the fuzzy questions no assertion will ever answer.

Make the tests count

With coverage basically free, the scarce thing is tests that mean something. A suite of ten thousand tests is worth nothing to you or to the agents if a green run doesn’t actually tell you the code works, and there’s a specific way agents make a green run worthless.

Ask agents to test something they can’t actually reach, and they’ll often write a test anyway, one that passes without checking the thing you care about. The classic version is a regression test asserting that some offending line, a bad handler say, no longer appears in the source. It proves the text is gone, not that the behavior went with it. Agents aren’t being lazy or sneaky. They want to hand you a passing test, and when they can’t test the real behavior, they test the next best thing. It’s not their fault the next best thing is virtually worthless.

This is mostly an affordance problem. Agents reach for a fake test when they have no honest way to write a real one, usually because the scope they need isn’t there: asked for a regression test on a UI behavior with no end-to-end harness, they fall back to poking at the source. Give them the harness and the fakes mostly disappear. Modern models know the difference between a real test and a hollow one, and given a genuine option (and a little bit of encouragement) they take it.

Flakes: the silent killer

Another way to wreck a feedback loop is to let it lie at random. A flaky test, one that passes or fails depending on timing or ordering or the phase of the moon, is worse than no test at all and agents are uniquely bad at dealing with them.

You hit a flaky test, swear at it, and remember it’s unreliable. Agents rerun it, watch it go green, and move on, so flakes never get fixed. They pile up and every flake the suite collects teaches the agents the same lesson: a red result doesn’t necessarily mean anything. Once they’ve absorbed that, they start waving off real failures too. “Oh, that test failed? This suite is super flaky, I’m sure it works fine really.” The agents have learned to ignore the feedback loop, and the loop is no longer a loop at all.

That’s why flakes matter so much more here than they used to (not that they didn’t matter before - if you allow flaky tests in your codebase I hate you). The whole point of the loop is that the agents can trust it more than they trust themselves, and a flake takes that away. So treat a flake as a real bug. No retries, no quarantine, no nudging the timeout up until it goes quiet. Find the nondeterminism and kill it. If you can’t, have a serious think about your testing strategy.

Where that leaves us

Strip it back and the whole argument is short. Agents can’t tell whether their own work is any good, so you give them something that can. That something is a fast, honest, tightly scoped feedback loop, and most of it is made of tests. Now that tests are cheap, you can afford as much of that loop as you like, and there’s no longer a good excuse to skimp. What’s left for you is making the loop effective, because the better the loop, the less you have to check their work yourself.

Maybe one day the models won’t need it. They’ll judge their own work as well as they produce it, and a lot of this goes back to being optional (or more likely they’ll just write good test suites without telling you). We’re not there yet. Until we are, agents are only as good as the loop you put them in, so build one that pulls its weight.

- omegastick