Featured image of post An AI agent that has to make the build pass

An AI agent that has to make the build pass

Most AI code generation works on a charming little principle I’ll call generate-and-hope. The model writes the code, the model stops at the closing brace, and whether the thing actually compiles is left as an exercise for you. For a snippet you paste into an editor, fine. For a whole generated command, that’s just outsourcing the disappointment.

go-tool-base does something I’m rather happier with: the AI has to make the build pass before it’s allowed to claim it’s done.

Generate and hope

The usual shape of AI code generation is this. You ask for code, the model produces it, and the model’s job ends at the closing brace. Whether it compiles, whether the tests pass, whether the imports even resolve, none of that has been checked. The model produced something that looks right. You find out whether it is right when you build it.

For a snippet you paste into an editor, that’s perfectly fine. The compiler tells you in a second. But go-tool-base’s generator, driven by gtb generate command --script or --prompt, produces a whole command: the implementation, its tests, the lot. “Generate and hope” at that scale means handing the user a project that may or may not build, and quietly making them the one who finds out which.

Drafting is only step one

So the generator doesn’t stop at drafting. Writing the first version of the implementation and its tests is step one of two. Step two is an autonomous repair agent.

Once the draft is on the filesystem, a separate agent takes over. It’s an LLM running in a loop, but a loop aimed at one narrow, checkable job: make this project build and pass its tests. It isn’t asked to be creative. It’s asked to get to green.

A fixed set of tools, and no shell

The agent is not handed a shell. It’s given a fixed, defined set of tools and nothing else. Three of them let it explore and edit the project: list_dir, read_file, write_file. Four of them let it verify the project:

  • go_build runs the build and captures the compiler errors.
  • go_test runs the tests and captures the failures.
  • go_get resolves a missing dependency.
  • golangci_lint runs the project’s linter.

That restriction is the design, not a limitation of it. The agent can’t delete arbitrary files, can’t reach the network, can’t run anything that isn’t on the list. It has exactly what it needs to make code compile and nothing it would need to do damage. Its file writes are confined to the project directory by an explicit path check, so even write_file can’t go wandering up into /etc. A coding agent you’d actually let near a filesystem is one whose abilities are an allowlist, not a denylist. (I keep coming back to that principle through this series… safety as a boundary you draw, not a behaviour you hope for.)

The loop

The repair loop is a ReAct loop, the same reason-act-observe shape as the tool-calling loop, only this time pointed at a goal:

  1. The draft is on disk.
  2. Verify: run go_build and go_test.
  3. If verification failed, read the error logs, the compiler error or the failing test.
  4. Reason about the cause: an undefined variable, a missing import, a wrong signature.
  5. Act: call write_file to patch the code, or go_get to add the dependency.
  6. Loop. Steps two to five repeat until the project is green, or the agent hits its step limit, which defaults to 15.

What makes this work is treating the error output as feedback rather than as a failure to log and walk away from. A compiler error is the single most useful sentence you can hand a model that’s trying to fix code. It says what’s wrong, and usually where. The loop feeds it straight back in, and the model fixes against it.

Verification changes what “done” means

Here’s the real shift, and the agent’s own documentation puts it well: the agent “doesn’t just say it fixed a bug; it uses a Test tool to verify the fix before reporting success.”

A generate-and-hope model reports success when it finishes writing. It has no idea whether the code works, and it isn’t really claiming otherwise. “Done” means “I produced text”. The repair agent reports success when go_build and go_test actually pass. “Done” means “the build is green”. Those are two completely different claims, and only the second is worth anything to the person who asked for the command.

That’s the line between an AI that’s a creative writer and an AI that’s a collaborator you can hand a task to. And when the agent can’t reach green, when it spends its whole step budget and the project is still broken, the generator fails safely: it leaves the best-attempt code in place, commented out so the project still compiles, and tells the user what to finish by hand. There’s also an --agentless flag for anyone who’d rather have a plain single-shot retry than the multi-step agent. The default, though, is the agent, because the default should be code that’s been checked.

Where this leaves us

Most AI code generation generates and hopes: the model writes code and the user discovers whether it works. For a whole generated command, that pushes a may-or-may-not-build project onto the user.

go-tool-base’s generator drafts the command and then hands it to an autonomous repair agent. The agent has a fixed set of tools (explore and edit the project, build it, test it, lint it, fetch dependencies) and no shell at all, with file writes confined to the project directory. It runs a ReAct loop, reading each error and patching against it, until the build is green or it exhausts its steps. The point is what “done” comes to mean: not “the model finished writing”, but “the build passes”. Only one of those is a claim worth trusting.

Built with Hugo
Theme Stack designed by Jimmy