AI on PHP Boy Scout

Supporting a provider, or actually using it

Sat, 02 May 2026 00:00:00 +0000

If your CLI tool talks to an AI model, you don’t want to hard-wire one vendor. So you reach for a single client interface over several providers, which is the right call. The trap is the next step: build that interface on only what every provider has in common, and you quietly throw away the very features that made you want a particular provider in the first place. rust-tool-base’s rtb-ai refuses to make that trade.

The pull toward one interface

If your CLI tool talks to an AI model, hard-wiring one vendor is a poor bet. One user has an Anthropic key, another an OpenAI key. Someone’s on Gemini. Someone runs Ollama locally because their data can’t leave the building. Someone points at an OpenAI-compatible endpoint from a provider you’ve never heard of. You don’t want a separate code path for each, so you want one AiClient that all of them slot behind.

rtb-ai gets that unification from the genai crate, which already speaks to Anthropic, OpenAI, Gemini, Ollama and OpenAI-compatible endpoints. One interface, five providers, the tool author picks one in config. The Go sibling makes the same bet: go-tool-base’s chat package also unifies several providers, behind an interface deliberately kept to four methods. So far this is the obvious design, and if it were the whole design there’d be nothing to write about.

What “unified” quietly costs you

Here’s the catch in any unified interface. It can only expose what every provider behind it has in common.

The common subset is plain chat. Messages go in, text comes out, optionally streamed token by token. That’s real and it’s useful and every provider does it. But the common subset is also the floor, and the features that make a particular provider worth choosing are almost never on the floor. They’re the things only that provider does.

Anthropic is the sharp example, because it has three features that matter and not one of them is common-subset.

Prompt caching. You can mark the stable parts of a request, the system prompt and the tool list, as cacheable. The provider keeps them warm, and on the next turn you aren’t billed to re-send and re-process text that didn’t change. On a long agent loop, where the same large system prompt rides along on every single turn, that’s a substantial saving in both cost and latency.

Extended thinking. The model works through a hard problem in a visible, budgeted reasoning pass before it commits to an answer, and you can see that reasoning.

Citations. Structured references back to source material in the response.

A client built strictly on the common subset can’t express any of those. It has no field for them, because four of the five providers wouldn’t know what to do with the field. So a purely lowest-common-denominator client would “support” Anthropic and then use it badly, leaving its best features unreachable. Support as a checkbox, not as the point.

The escape hatch

rtb-ai’s answer is to not choose. It runs two implementations under one interface.

For OpenAI, Gemini, Ollama and OpenAI-compatible endpoints, calls route through genai, the unified path. For Anthropic, every method drops to a direct reqwest implementation straight against the Messages API. Same AiClient on the surface, a different implementation underneath, selected by which provider the config names.

And the request type has deliberate room for the difference:

pub struct ChatRequest {
 pub system: Option<String>,
 pub messages: Vec<Message>,
 pub temperature: Option<f32>,
 pub max_tokens: Option<u32>,
 /// Anthropic-only: enables prompt caching at every stable point.
 /// Ignored on non-Anthropic providers.
 pub cache_control: bool,
 /// Anthropic-only: extended-thinking budget. `None` disables.
 /// Ignored on non-Anthropic providers.
 pub thinking: Option<ThinkingMode>,
}

Set cache_control and the Anthropic-direct path inserts cache breakpoints at the three stable points: the system prompt, the tool list, and the first message. Set thinking and it adds the thinking block, and streaming surfaces a separate ThinkingToken event so you can show the reasoning apart from the answer. On a non-Anthropic provider, both fields are simply ignored. The interface carries them; only the implementation that understands them acts on them.

A hatch, not a leak

It’s worth being precise about why this isn’t the thing it superficially resembles, which is a leaky abstraction.

A leaky abstraction is one where implementation details bleed through that you didn’t intend and can’t reason about. The abstraction quietly fails to abstract, and you’re left guessing which provider you’re really talking to.

This is the opposite of that. The two Anthropic-only fields aren’t a leak. They’re named, documented as Anthropic-only, inert everywhere else, and right there in the public type for anyone to see. The interface is uniform for the common case and deliberately, visibly non-uniform at exactly the points where uniformity would have cost you the good features. You opt into provider-specifics by setting a field. You stay fully portable by leaving it at its default. Nothing bleeds; you decide.

The same design line explains what does stay in the unified path. Structured output, chat_structured::<T>, sends a JSON Schema derived from your Rust type with the request and validates the reply against it before handing you a typed T. That’s a portability win that costs nothing across providers, so it belongs in the common interface. The split isn’t “Anthropic versus the rest”. It’s “features that are free to unify go in the unified path; features that aren’t get a designed door”. Prompt caching and extended thinking get the door, because flattening them away would be the expensive kind of convenient.

To sum up

A CLI tool that integrates AI wants one client over several providers, and a unified interface can only expose what those providers share. The shared floor is plain chat, and the features worth choosing a provider for, like Anthropic’s prompt caching, extended thinking and citations, are never on the floor.

rtb-ai keeps both. genai provides the unified path across five providers; an Anthropic-direct reqwest path drops below the abstraction for the features genai can’t reach, and ChatRequest carries the Anthropic-only fields openly, ignored elsewhere. Uniform where uniformity is free, with a designed escape hatch where it isn’t. That’s the difference between supporting a provider and actually using it.

Testing code that calls an LLM: yes, you actually can

Wed, 08 Apr 2026 00:00:00 +0000

“You can’t test code that calls an AI.” I’ve heard it said with great confidence, and it’s half right, which is the most dangerous kind of right. You genuinely can’t assert on what a non-deterministic model says. But the model isn’t your code, and the bits sitting either side of it most certainly are.

“You can’t test AI code”

It’s a fair worry. Your command calls an LLM. The LLM returns something slightly different every run. A test that asserts response == "..." is broken before you’ve finished typing it. So the conclusion arrives quickly: the AI path can’t be tested, leave it uncovered.

Which is a shame, because the AI call is usually the riskiest line in the whole command.

The conclusion is also wrong. It mistakes “I can’t test the model” for “I can’t test my code”. The model is not your code. Your code is the two pieces sitting on either side of it.

Your code is a prompt and a handler

Strip the command down to what it actually does:

It builds a prompt. It assembles a system prompt, the user’s input, perhaps some context, and sends it.
The model does something. This is not your code.
It takes the response and does something with it. It parses it, branches on it, prints it, stores it.

Steps one and three are entirely yours, and entirely deterministic. The same inputs build the same prompt and handle the same response the same way, every single time. That’s testable. Step two is the only part that isn’t, and step two was never yours to test in the first place.

So the job is to pin step two to a known value, and then test one and three properly.

Test the prompt: snapshot it

Step one produces a prompt, and a prompt is just a string, which means you can pin it.

Both frameworks lean on snapshot testing here. go-tool-base uses a golden-file approach: the prompt your code generates is recorded to a file, and the test re-generates it and compares against that file. rust-tool-base does the same with insta, snapshotting the request body the client would send.

The reason this matters is that the prompt is load-bearing and quietly easy to break. You refactor how context gets assembled. Without noticing, you’ve changed the wording, or the ordering, or dropped a line the model was leaning on. Nothing fails to compile. The behaviour just drifts, silently.

A snapshot test catches exactly that. It fails, it shows you the diff between the old prompt and the new one, and it makes you stop and make a decision. Was this change intended? If yes, you accept the new snapshot and move on. If no, you’ve just caught a bug before it shipped. Either way the prompt never changes by accident, which for AI code is most of the battle.

Test the handler: mock the response

Step three needs a response to handle, and in a unit test you don’t get that response from the real model. You supply it.

go-tool-base ships generated mocks for the ChatClient interface. A test builds a mock client, tells it “when Ask is called, return this canned value”, and runs the command against it:

mockClient := mock_chat.NewMockChatClient(t)
mockClient.EXPECT().
 Ask(mock.Anything, mock.Anything, mock.AnythingOfType("*main.Analysis")).
 RunAndReturn(func(_ context.Context, _ string, target any) error {
 *(target.(*Analysis)) = Analysis{Severity: "critical"}
 return nil
 })

Because the interface is only four methods, that mock is trivial to set up and complete by construction. rust-tool-base takes the same idea one layer down: HTTP-bound tests use wiremock, which stands up a fake server returning a canned response body. The client makes a real HTTP request; it just goes to a fake endpoint the test controls.

Either way, step two is now fixed to a value you chose, which makes step three deterministic. And that unlocks the tests that actually matter: given a malformed response, does the command fail gracefully? Given a rate-limit error, an empty answer, a field missing? Those are the cases a live model almost never hands you on demand, and a mock hands you every time, on the first run.

This is, incidentally, the same discipline as the test-mocking work elsewhere in the framework: the dependency is injected, so the test gets to decide what it does.

What you deliberately don’t test

One honest boundary. None of this tests whether the model gives good answers. That question is real, but it’s a different activity (evaluations, run as their own suite) and not something to mix into the unit tests.

The unit suite’s job is your code: that it builds a sound prompt, and that it handles every shape of response correctly, including the ugly ones. Keep that well away from “is the model clever today”. A unit test that depends on the model being clever is a unit test that fails when the weather changes, and a flaky test just teaches people to ignore the whole suite.

What it comes down to

Code that calls an LLM is testable; the model is not, and those are different statements. Your code is a prompt builder and a response handler, both deterministic, with the model sat in between.

go-tool-base and rust-tool-base converge on the same approach. Snapshot the prompt, with golden files or insta, so a refactor can’t change what you send without a test noticing. Mock the response, with generated ChatClient mocks or a wiremock server, so tests run with no network and you can feed in the malformed and error cases a real model won’t reliably produce. Leave “are the answers any good” to a separate evaluation suite. Test the two halves you own, and the non-determinism in the middle stops being an excuse to leave the riskiest line uncovered.

The AI provider that isn't an API

Mon, 06 Apr 2026 00:00:00 +0000

go-tool-base’s chat package puts five AI providers behind one interface. Four of them are exactly what you’d guess: HTTP calls to OpenAI, Claude, Gemini, and anything OpenAI-compatible. The fifth one isn’t an API at all. It shells out to a binary.

That sounds like a slightly mad thing to want, right up until you’ve worked somewhere the network says no.

The fifth provider shells out

The chat package speaks to five providers through one ChatClient interface. Four of them are what you’d expect: HTTP requests to OpenAI, to Claude, to Gemini, to any OpenAI-compatible endpoint. The tool author picks one in config, and the rest of the code never knows the difference.

The fifth, ProviderClaudeLocal, is different in kind. It doesn’t make an HTTP request at all. It shells out. It runs the claude CLI binary as a child process, passes the prompt in, and reads the answer back from the binary’s output.

That sounds like an odd thing to want until you’ve been stuck in the environment it was built for.

Why you’d want that

Picture a corporate network with its egress locked right down. Outbound HTTPS to api.anthropic.com is blocked by policy. A tool built on go-tool-base that uses AI would simply fall over there. It tries to reach the API, there’s no route, and that’s the end of the feature.

But the developer at that machine has the claude CLI installed, and has run claude login. That binary is permitted. It’s an approved, managed tool, and it has its own sanctioned path out. The direct API call is blocked; the claude command is not.

ProviderClaudeLocal is what bridges those two facts. If your tool’s AI calls go through that already-blessed binary instead of straight at the API, they work, in an environment where the direct call cannot. That’s the whole reason the provider exists. It isn’t faster (a real API call has lower latency) and it isn’t more capable. It’s for the place where the API call simply isn’t an option, and “isn’t an option” is a surprisingly common place to find yourself inside a large organisation.

What it costs, honestly

It’s worth being straight about the trade, because ProviderClaudeLocal is the reduced-capability provider.

It doesn’t do tool calling. It doesn’t do parallel tools. It doesn’t stream. Those need a live, structured connection to the model’s API, and a subprocess that runs once and prints an answer is not that. What it does support is plain chat and structured output, the latter through the binary’s own --json-schema flag.

So the honest positioning, and the package’s documentation says exactly this, is: prefer the API providers when you can reach them, because they’re lower latency and feature-complete. Reach for ProviderClaudeLocal when API access is restricted. You accept the narrower capability set as the price of working at all. For a tool whose AI feature is “answer a question” or “return a structured analysis”, that price is often nothing you’d even notice. For one built on an agentic tool-calling loop, it’s a real limitation, and you’d know to expect it.

How it stays behind the same interface

Here’s the part that makes it pleasant rather than a special case to maintain. Despite being a subprocess and not an API, ProviderClaudeLocal is still a ChatClient. Your feature code calls Chat and Ask exactly the way it would for any other provider.

Everything that makes a subprocess provider awkward stays inside the provider. Spawning the binary, feeding it the prompt, parsing its output, capturing stderr and surfacing it when the binary exits non-zero, and threading multi-turn continuity through session identifiers passed back on the next call with --resume: all of that is the provider’s problem, and all of it sits behind the interface. The code in your tool that uses AI doesn’t know, and has no way to find out, that this particular provider is a child process rather than an HTTPS call.

That’s a unified interface genuinely earning its place. It’s easy to put a uniform face on four things that already work the same way underneath. The real test of the abstraction is whether something that works in a completely different way, a subprocess instead of a socket, can still slot in without the caller changing a line. Here it can. You swap one config value, and a tool that talked to an API now talks through a binary, and nothing downstream so much as blinks.

The bottom line

go-tool-base’s chat package puts five providers behind one ChatClient interface, and ProviderClaudeLocal is the one that isn’t an API. It runs the locally installed, pre-authenticated claude CLI as a subprocess.

It exists for the locked-down environment where outbound HTTPS to the AI API is blocked but the claude binary is allowed: there, AI features keep working where a direct call would fail. The trade is a narrower capability set (no tool calling, no streaming, plain chat and structured output only) so you prefer the API providers when you can reach them and fall back to this when you can’t. And because it’s still a ChatClient, all the subprocess machinery stays hidden, and your code uses it without knowing it’s there. That last part is the real test of an abstraction: a provider that works in an entirely different way still slots in unchanged.

AI conversations you can resume

Sat, 04 Apr 2026 00:00:00 +0000

An AI conversation is, fundamentally, its own history. The model’s next answer depends on everything said so far. And a CLI tool, by its very nature, forgets everything the moment it exits. Put those two facts together and you get the problem: run an AI command, exit, run it again, and you’re talking to someone who’s never met you.

A CLI forgets everything

A long-running service keeps its state in memory for as long as it runs. A CLI tool doesn’t get that luxury. It starts, does one thing, exits. The next invocation is a brand-new process with no memory of the last one.

For most commands that’s exactly right, and you wouldn’t want it any other way. But an AI conversation is a different kind of beast, because a conversation is its history. The model’s next answer depends on everything said so far. Run an AI command, exit, run it again, and you’ve started a fresh conversation with someone who’s never met you. For an interactive assistant, or any AI workflow that unfolds across several invocations, that’s plainly the wrong behaviour. The user expects to pick up where they left off.

Save and restore

The chat package handles this through a PersistentChatClient interface. Like streaming, it’s an optional capability discovered with a type assertion, sitting beside the four-method core rather than bloating it. A client that supports persistence also satisfies this interface:

if pc, ok := client.(chat.PersistentChatClient); ok {
 snapshot, err := pc.Save()
 // store the snapshot somewhere
}

A snapshot is a serialisable value that captures the conversation. You store it. Next run, you load it, Restore it onto a fresh client, re-register your tools, and call Chat again. “Where were we?” works, because the model is handed back the whole history.

A snapshot is opinionated about what it carries

The interesting part is what a snapshot does and doesn’t contain, because that’s a series of deliberate decisions.

It carries the messages, the system prompt, the model name, and tool metadata: the names, descriptions and parameter schemas of the tools that were registered.

It does not carry tool handlers. Handlers are code, not data; you can’t serialise a function meaningfully, so after a restore you re-register them with SetTools. The snapshot remembers that a tool called read_file existed and what its shape was; it doesn’t try to remember the Go function behind it.

And it does not carry API tokens. This is the one to dwell on. A snapshot is a file. A file gets synced, backed up, copied between machines, attached to a support ticket by a user trying to be helpful. A snapshot that carried the API key would be a credential leak the moment it left the laptop it was made on. So the snapshot never contains a token, at all. On restore, the client picks the credential up again the ordinary way, from the environment or the keychain. The conversation and the secret are kept in separate places on purpose, and only one of them is ever in the file.

Encrypted at rest, if you want it

The package ships a FileStore that writes snapshots as JSON files, with 0600 permissions in a 0700 directory, and it can encrypt them. Pass WithEncryption a 32-byte key and snapshots are written with AES-256-GCM.

That option exists because a conversation can hold sensitive content even when it holds no credential. The log a user pasted in for analysis, the source file they asked the model to review, the internal details tucked into their questions: none of that is an API key, and all of it might be something you’d rather not have sitting in plain JSON in a backup somewhere. Encryption at rest covers it.

The FileStore is also careful about the snapshot identifiers it’s handed. An ID has to be a canonical UUID, and the resolved file path is checked to lie inside the store directory, so a snapshot ID arriving from an untrusted source (a CLI flag, a request payload) can’t be bent into a path-traversal that reads or writes somewhere it shouldn’t. Persisting conversations adds a small filesystem surface, and the store treats it as exactly that.

The short version

A CLI tool forgets everything between invocations, which is correct for most commands and wrong for an AI conversation, because a conversation is its history.

go-tool-base’s chat package lets you persist one. PersistentChatClient saves a snapshot you can store and restore later, picking the conversation back up where it ended. The snapshot is deliberate about its contents: messages, system prompt and tool metadata yes; tool handlers no, because they’re code you re-register; API tokens never, because a snapshot is a file and a file travels. The built-in FileStore can encrypt snapshots at rest with AES-256-GCM and validates snapshot IDs against path traversal. Resumable conversations, without the conversation file turning into a place secrets leak from.

An AI agent that has to make the build pass

Thu, 02 Apr 2026 00:00:00 +0000

Most AI code generation works on a charming little principle I’ll call generate-and-hope. The model writes the code, the model stops at the closing brace, and whether the thing actually compiles is left as an exercise for you. For a snippet you paste into an editor, fine. For a whole generated command, that’s just outsourcing the disappointment.

go-tool-base does something I’m rather happier with: the AI has to make the build pass before it’s allowed to claim it’s done.

Generate and hope

The usual shape of AI code generation is this. You ask for code, the model produces it, and the model’s job ends at the closing brace. Whether it compiles, whether the tests pass, whether the imports even resolve, none of that has been checked. The model produced something that looks right. You find out whether it is right when you build it.

For a snippet you paste into an editor, that’s perfectly fine. The compiler tells you in a second. But go-tool-base’s generator, driven by gtb generate command --script or --prompt, produces a whole command: the implementation, its tests, the lot. “Generate and hope” at that scale means handing the user a project that may or may not build, and quietly making them the one who finds out which.

Drafting is only step one

So the generator doesn’t stop at drafting. Writing the first version of the implementation and its tests is step one of two. Step two is an autonomous repair agent.

Once the draft is on the filesystem, a separate agent takes over. It’s an LLM running in a loop, but a loop aimed at one narrow, checkable job: make this project build and pass its tests. It isn’t asked to be creative. It’s asked to get to green.

A fixed set of tools, and no shell

The agent is not handed a shell. It’s given a fixed, defined set of tools and nothing else. Three of them let it explore and edit the project: list_dir, read_file, write_file. Four of them let it verify the project:

go_build runs the build and captures the compiler errors.
go_test runs the tests and captures the failures.
go_get resolves a missing dependency.
golangci_lint runs the project’s linter.

That restriction is the design, not a limitation of it. The agent can’t delete arbitrary files, can’t reach the network, can’t run anything that isn’t on the list. It has exactly what it needs to make code compile and nothing it would need to do damage. Its file writes are confined to the project directory by an explicit path check, so even write_file can’t go wandering up into /etc. A coding agent you’d actually let near a filesystem is one whose abilities are an allowlist, not a denylist. (I keep coming back to that principle through this series… safety as a boundary you draw, not a behaviour you hope for.)

The loop

The repair loop is a ReAct loop, the same reason-act-observe shape as the tool-calling loop, only this time pointed at a goal:

The draft is on disk.
Verify: run go_build and go_test.
If verification failed, read the error logs, the compiler error or the failing test.
Reason about the cause: an undefined variable, a missing import, a wrong signature.
Act: call write_file to patch the code, or go_get to add the dependency.
Loop. Steps two to five repeat until the project is green, or the agent hits its step limit, which defaults to 15.

What makes this work is treating the error output as feedback rather than as a failure to log and walk away from. A compiler error is the single most useful sentence you can hand a model that’s trying to fix code. It says what’s wrong, and usually where. The loop feeds it straight back in, and the model fixes against it.

Verification changes what “done” means

Here’s the real shift, and the agent’s own documentation puts it well: the agent “doesn’t just say it fixed a bug; it uses a Test tool to verify the fix before reporting success.”

A generate-and-hope model reports success when it finishes writing. It has no idea whether the code works, and it isn’t really claiming otherwise. “Done” means “I produced text”. The repair agent reports success when go_build and go_test actually pass. “Done” means “the build is green”. Those are two completely different claims, and only the second is worth anything to the person who asked for the command.

That’s the line between an AI that’s a creative writer and an AI that’s a collaborator you can hand a task to. And when the agent can’t reach green, when it spends its whole step budget and the project is still broken, the generator fails safely: it leaves the best-attempt code in place, commented out so the project still compiles, and tells the user what to finish by hand. There’s also an --agentless flag for anyone who’d rather have a plain single-shot retry than the multi-step agent. The default, though, is the agent, because the default should be code that’s been checked.

Where this leaves us

Most AI code generation generates and hopes: the model writes code and the user discovers whether it works. For a whole generated command, that pushes a may-or-may-not-build project onto the user.

go-tool-base’s generator drafts the command and then hands it to an autonomous repair agent. The agent has a fixed set of tools (explore and edit the project, build it, test it, lint it, fetch dependencies) and no shell at all, with file writes confined to the project directory. It runs a ReAct loop, reading each error and patching against it, until the build is green or it exhausts its steps. The point is what “done” comes to mean: not “the model finished writing”, but “the build passes”. Only one of those is a claim worth trusting.

Stop regex-ing the LLM's prose

Tue, 31 Mar 2026 00:00:00 +0000

Ask an LLM a question and it hands you back prose. Lovely to read, miserable to program against. You wanted the one number buried in the middle of it, and now you’re writing a regular expression to fish a word out of three well-written paragraphs that phrase themselves slightly differently every single time you run them.

There’s a much better way, and it’s the difference between forever interpreting an LLM and actually building on one.

The problem with a paragraph

You ask an LLM to analyse a log file and tell you the severity of what it found and a suggested fix. It comes back with three well-written paragraphs. Somewhere in there is the word “critical”, and somewhere is the fix.

Your program now has to extract those two facts from prose, and prose has no contract. The next run, the model phrases it differently. It leads with a caveat. It says “severe” where last time it said “critical”. It puts the fix first. Anything that worked by finding “critical” in the text is now quietly wrong, and you didn’t change a line. Parsing free text for structured facts is a game you lose slowly.

What you actually wanted was never a paragraph. It was a value: a thing with a severity field and a fix field, that you can branch on and store and pass around like any other.

Ask for the struct, not the prose

go-tool-base’s chat package draws the line with two methods. Chat gives you text. Ask gives you a struct.

You define the Go type you want back:

type Analysis struct {
 Severity string `json:"severity"`
 Fix string `json:"fix"`
}

var result Analysis
err := client.Ask(ctx, "Analyse this log file: "+logText, &result)

The framework generates a JSON Schema from that struct, sends it to the model as the required response format, and unmarshals the reply straight into result. You never lay a finger on the prose. You get result.Severity and result.Fix, typed, ready to use. If you want the model’s answer to drive a switch statement, this is the method that lets it.

The struct is the schema is the contract

The detail that makes this hold up over time: you don’t write the schema. The struct is the schema.

The framework derives the JSON Schema from your type. In go-tool-base that’s GenerateSchema[T](); in rust-tool-base the schema comes from your Rust type through schemars. (Yes, there’s a Rust sibling now. I’ll introduce it properly in a few weeks, but it keeps gatecrashing these posts because the two frameworks deliberately share ideas.) Either way there’s one definition, your type, and the schema is just a projection of it.

That matters, because otherwise two things have to agree. There’s the schema you tell the model to obey, and there’s the type you unmarshal the answer into. Hand-write the schema and those two can drift: add a field to the struct, forget to add it to the schema, and the model is never told to produce it, so it silently never appears. Deriving the schema from the type collapses the two into one. They can’t disagree, because there’s only one of them.

Both frameworks, with one extra step in Rust

go-tool-base does this with Ask and a ResponseSchema set on the client config. rust-tool-base does it with chat_structured::<T>, where T is any type that’s both deserialisable and JsonSchema.

rust-tool-base adds one step worth calling out. Before it deserialises the model’s reply into your T, it validates the raw response against the schema with a JSON Schema validator. That splits the failure into two distinct, named cases: the response didn’t match the schema, or it matched the schema but still wouldn’t deserialise. A model that returns subtly wrong JSON fails loudly and specifically, with an error that tells you which of those happened, instead of quietly handing you a zero-valued struct that you end up debugging an hour later.

When you’d reach for it

The line is simple, and it’s about who reads the answer.

If a human reads the answer, prose is right. Chat, free text, let the model write well. A summary, an explanation, an interactive reply: leave all of those as prose.

If a program consumes the answer, you want a value. Classification, extraction, a code review scored out of a hundred with a list of issues, a yes-or-no with reasons: anything where the next thing that happens is your code branching on the result. There, Ask and chat_structured turn the LLM from something you have to interpret into something that returns a value, and a typed value is a thing you can actually build on.

To sum up

An LLM returns prose by default, and prose has no contract, so a program that picks structured facts out of it breaks the moment the model rephrases.

Structured output asks for the value instead. You define a struct, the framework derives a JSON Schema from it, the model is constrained to that shape, and you get a typed result. go-tool-base’s Ask and rust-tool-base’s chat_structured both work this way, with the schema derived from your type so the schema and the type can’t drift; rust-tool-base additionally validates the response against the schema before deserialising. Use it whenever the answer feeds code rather than a human. It’s one of the four methods that make up go-tool-base’s small chat interface, and it’s the one that makes an LLM safe to program against.

Letting the AI call your Go functions

Sun, 29 Mar 2026 00:00:00 +0000

An AI that can only produce text can describe your system. An AI that can call your Go functions can actually operate it. That gap, between describing and doing, is the difference between a chatbot and something genuinely useful, and crossing it comes down to one fiddly mechanism: tool-calling, and the loop that drives it.

Talking about the system versus operating it

Wire an AI provider into a CLI command and you get something that can talk. Ask it a question, get a paragraph back. Useful, up to a point.

But notice the ceiling. An AI that can only generate text can describe things. It can tell you what it would do. What it can’t do is look at the actual current state of your system, or take a real action, because it has no hands. It’s reasoning in a vacuum about a world it can’t reach out and touch.

The thing that gives it hands is tool-calling. You hand the AI a set of functions it’s allowed to call. Now, mid-conversation, it can decide it needs to read that file before it can answer, or run that query, or check that status, and actually go and do it, and then reason about the real result. The AI stops describing your system and starts operating it.

The loop is the hard part

Tool-calling has a shape, and the shape is a loop. The literature calls it ReAct: Reason, Act, Observe.

The AI reasons about the prompt and decides whether it needs a tool.
If it does, it acts, asking for a specific tool with specific arguments.
Your code runs the tool and feeds the result back. The AI observes that result.
Round again. Reason about the new information, maybe call another tool, maybe several. Keep going until the AI has what it needs and produces a final text answer with no more tool calls.

Conceptually simple. Tedious and error-prone to implement by hand every single time: parsing the model’s tool-call requests, dispatching to the right function, marshalling arguments in and results out, feeding observations back in the exact format the provider expects, knowing when to stop, and not looping forever if the model gets itself stuck.

That orchestration is pure plumbing, and it’s identical for every tool and every command. So you can probably guess what’s coming: go-tool-base’s chat package owns it. You don’t write the loop. You write the tools.

Defining a tool

A chat.Tool is four things: a name, a description, a parameter schema, and a handler. The description is what the AI reads to decide whether to use the tool, so it’s worth writing well. The schema describes the arguments, and you don’t hand-write it. You write a tagged Go struct and let it generate:

type ReadFileParams struct {
 Path string `json:"path" jsonschema_description:"Relative path to the file"`
}

The struct is the contract. The framework derives the JSON Schema the AI is given straight from those tags, so the schema and the Go type the handler receives can’t drift apart, because they share a single source. The handler is then just an ordinary Go function that takes those parameters and returns a result.

You register your tools with SetTools, call Chat, and that’s the whole of your involvement. The framework runs the ReAct loop and Chat returns the AI’s final text answer once the loop settles.

Two details that show it was built for real use

A couple of decisions in the loop tell you it’s meant for production, not a demo.

Tool errors don’t abort the conversation. When a handler returns an error, the framework doesn’t crash the loop. It hands the error back to the AI as a string, as just another observation. That’s deliberate, and it’s right. A real agent should be able to call a tool, watch it fail, and react: try different arguments, take a different route, or tell the user it couldn’t manage it. A loop that aborted on the first tool error would be far more brittle than the model driving it.

The loop is bounded. There’s a MaxSteps limit, default 20. An AI that gets confused could otherwise call tools forever, and a CLI command that never returns is a worse failure than a wrong answer. The cap guarantees the command terminates. The agent gets room to genuinely work a problem across many steps, but not infinite room to flail about in.

There’s also parallel tool execution: when the model asks for several tools in a single step (three independent file reads, say) the framework runs them concurrently rather than one after another, because there’s no reason to make the AI sit and wait out a sequence of things that don’t depend on each other.

Boiling it down

A text-only AI can describe your system; an AI that can call your functions can operate it. Bridging that gap means tool-calling, and tool-calling means the ReAct loop (reason, act, observe, repeat) whose orchestration is fiddly, identical every time, and not a problem worth solving twice.

go-tool-base’s chat package runs the loop for you. You define chat.Tool values (name, description, a tagged parameter struct that generates its own schema, a handler), call SetTools and Chat, and get the final answer. Tool errors go back to the AI as observations so it can recover, and a MaxSteps cap guarantees the command always terminates. You write Go functions. The framework turns them into things an agent can reach for.

Nobody reads the manual

Sun, 29 Mar 2026 00:00:00 +0000

Let me describe the actual lifecycle of a user meeting your CLI tool, because it’s a bit humbling. They run it. It doesn’t quite do what they expected. They run it again with --help. They get a wall of monospaced flag descriptions, skim it, don’t find the thing they wanted, and either give up or go and ask a human who already knows.

Your documentation might be magnificent. It doesn’t matter, because the user never reached it.

The manual loses on location, not quality

That’s the lifecycle, and notice exactly where it breaks. The documentation might be excellent. It might answer their precise question in full. It doesn’t matter, because it’s on a website, in another window, behind a search box, and the user is here, in the terminal, mid-task. The docs lost not on quality but on location. They simply weren’t where the work was.

go-tool-base’s answer starts with a decision about location: the documentation gets embedded into the binary itself. Your docs/ folder ships inside the tool, the same way its default config does. Wherever the tool is installed, the docs are right there alongside it, no network, no browser. That embedding is what makes everything else possible, and there are two things built on top of it.

A browser, in the terminal

The first is the docs command, and it’s not --help with extra steps. It launches a proper Terminal User Interface, built on Bubble Tea.

It has a sidebar, structured from the project’s own zensical.toml or mkdocs.yml, so the docs are a navigable tree rather than one flat scroll. Markdown renders with real formatting through Glamour (colour, tables, lists, headings) instead of collapsing into monospaced soup. There’s live search across every page, regex included.

Compared with man and --help, the difference isn’t a nicer coat of paint. man gives you linear scrolling and grep; this gives you a structured tree, rich rendering and real search. It’s the documentation experience a modern developer expects, except it followed the tool into the terminal instead of demanding the user leave it.

A documentation assistant that won’t make things up

The second thing built on the embedded docs is the one I find genuinely transformative: docs ask.

The user doesn’t navigate anything. They just ask:

mytool docs ask "how do I point this at a self-hosted server?"

and get a direct, specific answer. Under the hood, the framework collates the tool’s embedded markdown and hands it to the configured AI provider (Claude, OpenAI, Gemini, Claude Local, any OpenAI-compatible endpoint) as the context for the question.

Now, “an AI answers questions about my tool” should immediately make you nervous, and the correct thing to be nervous about is hallucination. An AI that confidently invents a flag that doesn’t exist, or describes behaviour the tool simply doesn’t have, is worse than no assistant at all, because the user trusts it.

This is where embedding the docs pays off a second time, and it’s why I keep stressing that the corpus is closed. The model is instructed to answer only from the tool’s actual documentation, and the context it’s handed is exactly that documentation and nothing else. It isn’t drawing on a vague memory of similar tools from its training data. It’s answering from this tool’s real, shipped, version-matched docs. The corpus is small, closed and authoritative, which is the combination that keeps the answers honest. “Zero hallucination by design” isn’t a slogan about the model. It’s a property of bounding what the model is allowed to look at, which is the same instinct I leaned on with the mcp command: the safety comes from the boundary you drew, not from trusting the AI to behave itself.

There’s a nice second-order effect, too. The answer is always about the version of the tool the user actually has, because the docs were embedded into that build. No mismatch between a website documenting the latest release and the slightly older binary sitting on the user’s machine.

The upshot

Documentation usually loses to --help not on quality but on location: it’s in a browser, and the user is in the terminal. go-tool-base embeds the docs into the binary and surfaces them two ways: a docs command that’s a real TUI browser with a sidebar, rich markdown and search, and docs ask, which answers natural-language questions using the embedded docs as context.

Because that context is the tool’s own closed, shipped documentation and the model is told to use nothing else, the assistant stays grounded, and it’s always describing the exact version the user is holding. The fix for unread documentation was never to write more of it. It was to put it where the work happens and let it answer back.

An AI interface that fits on one screen

Fri, 27 Mar 2026 00:00:00 +0000

The moment you decide a CLI tool should talk to an LLM, there’s a strong gravitational pull towards reaching for LangChain, or one of its many relatives. It’s the obvious move. It’s also, for most CLI work, a bit like hiring a removals firm to carry a single box up the stairs.

Let me explain why go-tool-base went the other way, and what “the other way” actually looks like.

The instinct, and why it overshoots

When you add AI to a tool, the instinct is to reach for the big general-purpose framework. LangChain and its relatives are capable, and they exist for a real need: orchestrating complex multi-step AI applications, with retrieval pipelines, memory stores, chains of calls, whole fleets of agents.

Now look at what a CLI tool actually needs from an LLM. It needs to send a prompt and get text back. Sometimes it wants structured data back instead of prose. Sometimes it wants to let the model call a few of the tool’s own functions. That’s pretty much the whole list.

Pulling in a framework built to orchestrate retrieval and agent swarms in order to do that is a poor trade. You take on a large new vocabulary of concepts, a wide dependency surface, and a great deal of abstraction you’ll never touch, all to perform three or four operations. The framework isn’t wrong. It’s just answering a far bigger question than the one a CLI tool is asking.

What go-tool-base chose instead

go-tool-base didn’t reach for a framework. The decision is on the record in its own design notes: before a single line was written, LangChain Go, go-openai, Vercel’s AI SDK and around ten other options were evaluated, and not one of them matched what a CLI framework actually needs. So the chat package was built deliberately small.

How small? The entire core ChatClient interface is four methods:

type ChatClient interface {
 Add(prompt string) error
 Chat(ctx context.Context, prompt string) (string, error)
 Ask(question string, target any) error
 SetTools(tools []Tool) error
}

Add appends a message to the conversation. Chat sends a prompt and returns text. Ask sends a prompt and returns a typed Go struct, the model’s answer unmarshalled straight into a value you defined. SetTools hands the model a set of your own functions it’s allowed to call. That’s the whole surface. Downstream code that uses AI never holds anything larger than this, and never has to know which provider is behind it.

The package’s own documentation has a word for this: right-sized. Large enough to solve genuine provider-abstraction complexity, small enough that the full interface fits on a single screen.

“Thin” is not the same as “does little”

This is the part worth being precise about, because “four methods” can sound like “barely does anything”, and that’s the wrong read entirely.

Behind those four methods sits genuinely awkward work. Five providers (OpenAI, Claude, Gemini, a locally installed claude binary, and any OpenAI-compatible endpoint) each with a different wire API, all normalised behind the one interface. A tool-calling loop. Structured output via JSON Schema, made to behave consistently across providers that each express it differently. Error normalisation. Token chunking.

The point of a thin abstraction is not that there’s little underneath it. It’s that the interface stays small while the implementation quietly absorbs the complexity. Four methods on the surface; five provider integrations and a tool-calling loop below the waterline. The thinness is a property of what the caller sees, not of what the package does. A reach-for-LangChain decision gets that backwards: it exposes the caller to all the machinery, whether or not the caller will ever need it.

The core stays small even as features grow

There’s a neat detail in how chat keeps the interface from creeping. The package also supports streaming responses and conversation persistence, both of which are real features with real surface area. Neither of them is in the four-method core.

Instead they’re separate, optional interfaces. A streaming-capable client also satisfies StreamingChatClient; a persistable one also satisfies PersistentChatClient. Code that wants those capabilities does a type assertion to ask for them, and code that doesn’t simply never sees them. So the common path stays four methods forever. New capabilities arrive as opt-in interfaces alongside the core, not as new methods bolted onto it. The thing that fits on one screen keeps fitting on one screen.

Extensible without forking, testable without a network

Two more properties keep the package small without making it limiting.

It’s extensible. The provider list isn’t closed. A RegisterProvider call lets any package contribute a new provider, and chat.New will route to it. You add a backend without forking pkg/chat or sending a patch upstream.

And it’s testable. The package ships generated mocks. A downstream tool’s AI features can be tested against a mock ChatClient returning canned responses, with no network, no API key, and no flakiness. Because the interface is four methods, that mock is trivial to set up and complete by construction. A sprawling framework interface is a sprawling thing to fake; a four-method one is not. (I’ll come back to testing AI code properly in a later post, because it deserves a whole article of its own.)

The right size

When a CLI tool needs AI, the instinct is a large framework like LangChain. For orchestrating retrieval pipelines and agent swarms, that’s exactly the right tool. For sending a prompt, getting a struct back, and letting the model call a few functions, it’s enormous overkill.

go-tool-base’s chat package is the deliberate alternative, chosen only after LangChain Go and a dozen others were weighed up and rejected. Its core ChatClient interface is four methods. Underneath sit five normalised providers, a tool-calling loop, structured output and error handling, but the caller sees four methods and never learns which provider is active. Streaming and persistence are opt-in interfaces beside the core, not additions to it. It extends without forking and tests without a network. Right-sized: the complexity is real, but it lives under the interface rather than in it.

Your CLI is already an AI tool

Thu, 19 Mar 2026 00:00:00 +0000

“Make it work with AI” has become one of those requests that lands on a developer’s desk with a thud and not much further detail attached. My instinct, the first time, was to brace for a big lump of integration work… a bespoke adapter for this assistant, another for that one, a treadmill of little wrappers stretching off into the distance.

Turns out I’d already done most of the work. So have you, if your CLI tool is any good. Let me explain what I mean.

You already described your capabilities

Stop and think for a second about what a well-built CLI tool actually is. It’s a set of named operations, each with a human-readable description, each taking a set of typed, named, documented parameters. You wrote all of that already, because a CLI without it is unusable by people.

Now look at what an AI assistant needs in order to call a tool. A set of named operations. A description of each, so it knows when to reach for them. A typed parameter schema for each, so it knows how to call them.

It’s the same list! A good CLI is already, structurally, a description of a set of capabilities. The information an AI agent needs isn’t extra work you have to go and do. It’s work you finished the moment your --help output was any good.

The only thing missing is a translator. Something that takes “this is a CLI” and presents it as “this is a set of tools an AI can call”.

MCP is that translator, and it’s a standard

The temptation, when you want your tool to be AI-usable, is to sit down and write an integration. A little adapter for Claude Desktop. Another for Cursor. Another for whatever turns up next month. Each one a bespoke wrapper, each one a thing to maintain, and the list never stops growing because new assistants keep appearing. That’s the treadmill I was bracing for.

The Model Context Protocol exists to kill that list. MCP is an open standard for how an AI model discovers and calls local tools. Implement it once and your tool works with every assistant that speaks it. Write once, not once-per-client.

So go-tool-base implements it once, in the framework, for everyone. (That’s rather the theme of this whole series, if you hadn’t spotted it yet… do the annoying thing once, properly, in a place where every tool inherits it.)

The `mcp` command, and the mapping it does for free

Every tool built on go-tool-base inherits a built-in mcp command. Run it:

mytool mcp

and the tool starts a JSON-RPC server over standard I/O, speaking MCP. That’s the whole user-facing surface. One command.

Behind it, the framework walks your Cobra command tree and maps it straight onto MCP tool definitions:

Each command becomes a tool.
Each command’s short description becomes the tool’s description, the text the AI reads to decide whether this is the tool it wants.
Each command’s flags and arguments become the tool’s JSON Schema parameters.

There’s no second schema to write and then keep in sync (and we all know how well “keep these two things aligned by hand” tends to go). The command tree is the schema. Add a new command to your CLI and it’s a new tool for the agent, automatically, with the description and flags you already gave it. Nobody has to remember to update an MCP manifest, because there’s no separate MCP manifest to forget about.

Configuring an assistant to use it

On the assistant’s side it’s just as undramatic. You tell your AI client (Claude Desktop, Cursor, anything MCP-aware) to launch mytool mcp. From then on the assistant:

Starts your tool in MCP mode when it boots.
Discovers every command as a callable tool.
Calls the right one, with the right parameters, when a user’s request needs it.

Your CLI tool has quietly become something the AI can pick up and use, mid-conversation, on its own initiative.

The safety property worth noticing

Now, “let an AI run things on my machine” is rightly a sentence that makes people nervous. It makes me nervous, and I built the thing. So it’s worth noticing the constraint sitting quietly in this design.

The AI can only call what you defined. The tools it sees are exactly the commands in your tree, and the parameters it can pass are exactly the flags and arguments you declared, validated against the JSON Schema generated from them.

It can’t invent a command. It can’t pass a parameter you never defined. The boundary of what the agent can do is the boundary of what your CLI does, and you drew that boundary already, back when you built the tool. Exposing the CLI over MCP doesn’t widen the surface one inch. It just makes the existing surface reachable. The AI isn’t running things. It’s running your commands, the ones you wrote, tested and shipped, and nothing else.

The gist

A CLI tool, built properly, is already a structured description of a set of capabilities: named operations, descriptions, typed parameters. Which is also exactly what an AI agent needs in order to call a tool. The gap between the two is only a translator, and writing a bespoke one per assistant is a treadmill you don’t need to step onto.

go-tool-base puts the translator in the framework. Every tool gets an mcp command that serves the command tree over the Model Context Protocol… commands become tools, descriptions become descriptions, flags become JSON Schema parameters, with no second schema to maintain. Point any MCP-aware assistant at it and your CLI is an agent-callable tool, bounded to exactly the commands you shipped.

You did the hard part when you built a good CLI. MCP just opens the door you’d already framed.

AI on PHP Boy Scout

Supporting a provider, or actually using it

The pull toward one interface

What “unified” quietly costs you

The escape hatch

A hatch, not a leak

To sum up

Testing code that calls an LLM: yes, you actually can

“You can’t test AI code”

Your code is a prompt and a handler

Test the prompt: snapshot it

Test the handler: mock the response

What you deliberately don’t test

What it comes down to

The AI provider that isn't an API

The fifth provider shells out

Why you’d want that

What it costs, honestly

How it stays behind the same interface

The bottom line

AI conversations you can resume

A CLI forgets everything

Save and restore

A snapshot is opinionated about what it carries

Encrypted at rest, if you want it

The short version

An AI agent that has to make the build pass

Generate and hope

Drafting is only step one

A fixed set of tools, and no shell

The loop

Verification changes what “done” means

Where this leaves us

Stop regex-ing the LLM's prose

The problem with a paragraph

Ask for the struct, not the prose

The struct is the schema is the contract

Both frameworks, with one extra step in Rust

When you’d reach for it

To sum up

Letting the AI call your Go functions

Talking about the system versus operating it

The loop is the hard part

Defining a tool

Two details that show it was built for real use

Boiling it down

Nobody reads the manual

The manual loses on location, not quality

A browser, in the terminal

A documentation assistant that won’t make things up

The upshot

An AI interface that fits on one screen

The instinct, and why it overshoots

What go-tool-base chose instead

“Thin” is not the same as “does little”

The core stays small even as features grow

Extensible without forking, testable without a network

The right size

Your CLI is already an AI tool

You already described your capabilities

MCP is that translator, and it’s a standard

The mcp command, and the mapping it does for free

Configuring an assistant to use it

The safety property worth noticing

The gist

The `mcp` command, and the mapping it does for free