<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Testing on PHP Boy Scout</title><link>https://blog-570662.gitlab.io/categories/testing/</link><description>Recent content in Testing on PHP Boy Scout</description><generator>Hugo -- gohugo.io</generator><language>en-gb</language><copyright>Matt Cockayne</copyright><lastBuildDate>Thu, 16 Apr 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://blog-570662.gitlab.io/categories/testing/index.xml" rel="self" type="application/rss+xml"/><item><title>The test-mocking pattern that races</title><link>https://blog-570662.gitlab.io/the-test-mocking-pattern-that-races/</link><pubDate>Thu, 16 Apr 2026 00:00:00 +0000</pubDate><guid>https://blog-570662.gitlab.io/the-test-mocking-pattern-that-races/</guid><description>&lt;img src="https://blog-570662.gitlab.io/the-test-mocking-pattern-that-races/cover-the-test-mocking-pattern-that-races.png" alt="Featured image of post The test-mocking pattern that races" /&gt;&lt;p&gt;I&amp;rsquo;m going to tell you about a bug go-tool-base shipped, because it&amp;rsquo;s one of those bugs that&amp;rsquo;s so reasonable-looking you&amp;rsquo;ll find it in textbooks, conference talks, and an awful lot of otherwise excellent Go code. We had it too. It passed every test on my laptop, every single time, and then quietly fell over on CI while blaming an innocent bystander.&lt;/p&gt;
&lt;p&gt;It&amp;rsquo;s the classic Go trick for mocking a dependency, and it races.&lt;/p&gt;
&lt;h2 id="a-pattern-that-looks-completely-reasonable"&gt;A pattern that looks completely reasonable
&lt;/h2&gt;&lt;p&gt;Here&amp;rsquo;s a thing you need to do constantly in Go tests: stop a function from really shelling out. It calls &lt;code&gt;exec.LookPath&lt;/code&gt; to find a binary, or &lt;code&gt;exec.Command&lt;/code&gt; to run one, and your test very much does not want it touching the real &lt;code&gt;$PATH&lt;/code&gt; or spawning a real process.&lt;/p&gt;
&lt;p&gt;The Go community has a well-worn answer. Hoist the function into a package-level variable, call &lt;em&gt;that&lt;/em&gt;, and let tests reassign it:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-go" data-lang="go"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c1"&gt;// production code&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="kd"&gt;var&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;execLookPath&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;exec&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;LookPath&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="kd"&gt;func&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;findTool&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;error&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;execLookPath&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;#34;sometool&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-go" data-lang="go"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c1"&gt;// test&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="kd"&gt;func&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;TestFindTool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="nx"&gt;testing&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;T&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;old&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;:=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;execLookPath&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;defer&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kd"&gt;func&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;execLookPath&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;old&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}()&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;execLookPath&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kd"&gt;func&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;error&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;#34;/fake/path&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;nil&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="c1"&gt;// ...assert...&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;It&amp;rsquo;s tidy. No interface to thread through, no constructor to change. You&amp;rsquo;ll find it in a great deal of Go code, including some very respectable Go code indeed. go-tool-base had it too.&lt;/p&gt;
&lt;p&gt;And it works. It works on your machine, it works in code review, it works the first hundred times CI runs it. Which is precisely what makes it dangerous, because it&amp;rsquo;s wrong, and it&amp;rsquo;s just been biding its time.&lt;/p&gt;
&lt;h2 id="add-one-line-and-it-detonates"&gt;Add one line and it detonates
&lt;/h2&gt;&lt;p&gt;Go&amp;rsquo;s &lt;code&gt;t.Parallel()&lt;/code&gt; is more or less free performance. Mark your tests with it and the runner overlaps them instead of plodding through one at a time. On a package with a few hundred tests it&amp;rsquo;s a real, worthwhile speed-up, so naturally you reach for it.&lt;/p&gt;
&lt;p&gt;Now picture two tests, both using the pattern above, both marked &lt;code&gt;t.Parallel()&lt;/code&gt;. They run concurrently. Test A assigns its fake to &lt;code&gt;execLookPath&lt;/code&gt;. Test B assigns &lt;em&gt;its&lt;/em&gt; fake to &lt;code&gt;execLookPath&lt;/code&gt;. Test A reads &lt;code&gt;execLookPath&lt;/code&gt; expecting its own fake. Two goroutines, one variable, writes and reads with nothing synchronising them. That&amp;rsquo;s a textbook data race, and the textbook is right: the behaviour is undefined. Test A might see B&amp;rsquo;s fake. The deferred restore might land in the wrong order and leave the variable pointing at a fake after both tests have finished, poisoning a third one for good measure.&lt;/p&gt;
&lt;p&gt;The truly nasty part is the &lt;em&gt;intermittency&lt;/em&gt;. Whether the race actually bites depends on goroutine scheduling, which depends on machine load and core count. Your laptop running eight tests at once might never lose the coin-toss. A CI runner under load, scheduling differently, loses it and fails a test that has nothing obviously to do with the change in the commit. You re-run the pipeline, it passes, everyone shrugs and moves on. A test suite that fails one run in twenty trains your team to ignore it, and an ignored CI failure is worse than no CI at all.&lt;/p&gt;
&lt;p&gt;I can tell you this one from direct, slightly embarrassed experience, because go-tool-base shipped exactly this bug and CI caught it the honest way: green on the laptop, red on the runner, with the failure cheerfully pointing at innocent bystander tests rather than the global that was actually the culprit. &lt;code&gt;go test -race&lt;/code&gt; will name it for you if you crank the parallelism up high enough to lose the toss reliably&amp;hellip; but you have to go looking, and you only go looking once it&amp;rsquo;s already ruined an afternoon.&lt;/p&gt;
&lt;h2 id="the-fix-isnt-synchronisation-its-structure"&gt;The fix isn&amp;rsquo;t synchronisation, it&amp;rsquo;s structure
&lt;/h2&gt;&lt;p&gt;The instinct is to slap a mutex around the variable. Resist it. A mutex makes the race &lt;em&gt;defined&lt;/em&gt;, but it doesn&amp;rsquo;t make the design any good. You&amp;rsquo;ve still got global mutable state, you&amp;rsquo;ve just queued the fight instead of cancelling it. And tests that serialise on a shared lock aren&amp;rsquo;t really parallel any more, so you&amp;rsquo;ve also handed back the speed-up you came for in the first place.&lt;/p&gt;
&lt;p&gt;The real fix is to not have a shared variable at all. The dependency was always an &lt;em&gt;input&lt;/em&gt; to the code; the package-level var was just a way of avoiding saying so out loud. So say it. Inject it.&lt;/p&gt;
&lt;p&gt;A struct field:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-go" data-lang="go"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="kd"&gt;type&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;Finder&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kd"&gt;struct&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;lookPath&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kd"&gt;func&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;error&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="c1"&gt;// defaults to exec.LookPath&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="kd"&gt;func&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;f&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="nx"&gt;Finder&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;find&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;error&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lookPath&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;#34;sometool&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Or a functional option, if you&amp;rsquo;d rather keep the zero value clean. Either way, each test constructs its &lt;em&gt;own&lt;/em&gt; &lt;code&gt;Finder&lt;/code&gt; with its &lt;em&gt;own&lt;/em&gt; fake. There&amp;rsquo;s no shared variable, so there&amp;rsquo;s no race, and &lt;code&gt;t.Parallel()&lt;/code&gt; is free again because the tests genuinely don&amp;rsquo;t touch each other.&lt;/p&gt;
&lt;p&gt;go-tool-base wrote this straight into its standing rules: no package-level mocking hooks, full stop. Dependencies come in through struct fields, functional options, or config fields. (The same injection discipline that makes &lt;a class="link" href="https://blog-570662.gitlab.io/props-the-container-that-does-the-heavy-lifting/" &gt;Props&lt;/a&gt; so testable, applied one rung further down.) And to stop everyone hand-rolling the same &lt;code&gt;exec&lt;/code&gt; fakes, there&amp;rsquo;s a small internal package, &lt;code&gt;internal/exectest&lt;/code&gt;, with ready-made &lt;code&gt;LookPath&lt;/code&gt; and &lt;code&gt;CommandContext&lt;/code&gt; doubles you construct per-test. The pattern is gone, and the door it came in through is shut.&lt;/p&gt;
&lt;h2 id="the-rule-worth-taking-away"&gt;The rule worth taking away
&lt;/h2&gt;&lt;p&gt;A package-level variable that tests reassign is shared mutable state. It reads as a harmless convenience because in a single-threaded test run it behaves like one. &lt;code&gt;t.Parallel()&lt;/code&gt; is the thing that reveals it was never harmless, only unobserved.&lt;/p&gt;
&lt;p&gt;The general lesson is older than Go: &lt;strong&gt;if a value is an input to your code, make it an input.&lt;/strong&gt; Smuggling it in as a global is borrowing test-time convenience against a debt that comes due, with interest, the day someone wants their tests to run in parallel. Pay cash. Inject the dependency.&lt;/p&gt;
&lt;h2 id="worth-remembering"&gt;Worth remembering
&lt;/h2&gt;&lt;p&gt;Mocking via a reassignable package-level variable is a beloved Go shortcut and a latent data race. It survives because single-threaded test runs hide it; &lt;code&gt;t.Parallel()&lt;/code&gt; exposes it as intermittent, bystander-blaming CI flake that&amp;rsquo;s miserable to trace. A mutex only makes the bad design &lt;em&gt;defined&lt;/em&gt;. The fix is structural: inject the dependency as a struct field or functional option, so each test owns its own double and there&amp;rsquo;s no shared state to race over. go-tool-base banned the global-hook pattern outright and ships &lt;code&gt;internal/exectest&lt;/code&gt; so nobody&amp;rsquo;s tempted back to it.&lt;/p&gt;
&lt;p&gt;If a piece of code depends on something, let it &lt;em&gt;say&lt;/em&gt; so in its signature. Your future self, staring at a CI failure that flatly refuses to reproduce, will thank you.&lt;/p&gt;</description></item><item><title>Testing code that calls an LLM: yes, you actually can</title><link>https://blog-570662.gitlab.io/testing-code-that-calls-an-llm/</link><pubDate>Wed, 08 Apr 2026 00:00:00 +0000</pubDate><guid>https://blog-570662.gitlab.io/testing-code-that-calls-an-llm/</guid><description>&lt;img src="https://blog-570662.gitlab.io/testing-code-that-calls-an-llm/cover-testing-code-that-calls-an-llm.png" alt="Featured image of post Testing code that calls an LLM: yes, you actually can" /&gt;&lt;p&gt;&amp;ldquo;You can&amp;rsquo;t test code that calls an AI.&amp;rdquo; I&amp;rsquo;ve heard it said with great confidence, and it&amp;rsquo;s half right, which is the most dangerous kind of right. You genuinely can&amp;rsquo;t assert on what a non-deterministic model says. But the model isn&amp;rsquo;t your code, and the bits sitting either side of it most certainly are.&lt;/p&gt;
&lt;h2 id="you-cant-test-ai-code"&gt;&amp;ldquo;You can&amp;rsquo;t test AI code&amp;rdquo;
&lt;/h2&gt;&lt;p&gt;It&amp;rsquo;s a fair worry. Your command calls an LLM. The LLM returns something slightly different every run. A test that asserts &lt;code&gt;response == &amp;quot;...&amp;quot;&lt;/code&gt; is broken before you&amp;rsquo;ve finished typing it. So the conclusion arrives quickly: the AI path can&amp;rsquo;t be tested, leave it uncovered.&lt;/p&gt;
&lt;p&gt;Which is a shame, because the AI call is usually the riskiest line in the whole command.&lt;/p&gt;
&lt;p&gt;The conclusion is also wrong. It mistakes &amp;ldquo;I can&amp;rsquo;t test the model&amp;rdquo; for &amp;ldquo;I can&amp;rsquo;t test my code&amp;rdquo;. The model is not your code. Your code is the two pieces sitting on either side of it.&lt;/p&gt;
&lt;h2 id="your-code-is-a-prompt-and-a-handler"&gt;Your code is a prompt and a handler
&lt;/h2&gt;&lt;p&gt;Strip the command down to what it actually does:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;It builds a prompt. It assembles a system prompt, the user&amp;rsquo;s input, perhaps some context, and sends it.&lt;/li&gt;
&lt;li&gt;The model does something. This is not your code.&lt;/li&gt;
&lt;li&gt;It takes the response and does something with it. It parses it, branches on it, prints it, stores it.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Steps one and three are entirely yours, and entirely deterministic. The same inputs build the same prompt and handle the same response the same way, every single time. That&amp;rsquo;s testable. Step two is the only part that isn&amp;rsquo;t, and step two was never yours to test in the first place.&lt;/p&gt;
&lt;p&gt;So the job is to pin step two to a known value, and then test one and three properly.&lt;/p&gt;
&lt;h2 id="test-the-prompt-snapshot-it"&gt;Test the prompt: snapshot it
&lt;/h2&gt;&lt;p&gt;Step one produces a prompt, and a prompt is just a string, which means you can pin it.&lt;/p&gt;
&lt;p&gt;Both frameworks lean on snapshot testing here. go-tool-base uses a golden-file approach: the prompt your code generates is recorded to a file, and the test re-generates it and compares against that file. rust-tool-base does the same with &lt;code&gt;insta&lt;/code&gt;, snapshotting the request body the client would send.&lt;/p&gt;
&lt;p&gt;The reason this matters is that the prompt is load-bearing and quietly easy to break. You refactor how context gets assembled. Without noticing, you&amp;rsquo;ve changed the wording, or the ordering, or dropped a line the model was leaning on. Nothing fails to compile. The behaviour just drifts, silently.&lt;/p&gt;
&lt;p&gt;A snapshot test catches exactly that. It fails, it shows you the diff between the old prompt and the new one, and it makes you stop and make a decision. Was this change intended? If yes, you accept the new snapshot and move on. If no, you&amp;rsquo;ve just caught a bug before it shipped. Either way the prompt never changes by accident, which for AI code is most of the battle.&lt;/p&gt;
&lt;h2 id="test-the-handler-mock-the-response"&gt;Test the handler: mock the response
&lt;/h2&gt;&lt;p&gt;Step three needs a response to handle, and in a unit test you don&amp;rsquo;t get that response from the real model. You supply it.&lt;/p&gt;
&lt;p&gt;go-tool-base ships generated mocks for the &lt;code&gt;ChatClient&lt;/code&gt; interface. A test builds a mock client, tells it &amp;ldquo;when &lt;code&gt;Ask&lt;/code&gt; is called, return &lt;em&gt;this&lt;/em&gt; canned value&amp;rdquo;, and runs the command against it:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-go" data-lang="go"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="nx"&gt;mockClient&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;:=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;mock_chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;NewMockChatClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;t&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="nx"&gt;mockClient&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;EXPECT&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;Ask&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;mock&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;Anything&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;mock&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;Anything&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;mock&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;AnythingOfType&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;#34;*main.Analysis&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;)).&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;RunAndReturn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;func&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;_&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;Context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;_&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;target&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;any&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;error&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;target&lt;/span&gt;&lt;span class="p"&gt;.(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="nx"&gt;Analysis&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;Analysis&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nx"&gt;Severity&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;#34;critical&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;nil&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Because &lt;a class="link" href="https://blog-570662.gitlab.io/an-ai-interface-that-fits-on-one-screen/" &gt;the interface is only four methods&lt;/a&gt;, that mock is trivial to set up and complete by construction. rust-tool-base takes the same idea one layer down: HTTP-bound tests use &lt;code&gt;wiremock&lt;/code&gt;, which stands up a fake server returning a canned response body. The client makes a real HTTP request; it just goes to a fake endpoint the test controls.&lt;/p&gt;
&lt;p&gt;Either way, step two is now fixed to a value you chose, which makes step three deterministic. And that unlocks the tests that actually matter: given a malformed response, does the command fail gracefully? Given a rate-limit error, an empty answer, a field missing? Those are the cases a live model almost never hands you on demand, and a mock hands you every time, on the first run.&lt;/p&gt;
&lt;p&gt;This is, incidentally, the same discipline as &lt;a class="link" href="https://blog-570662.gitlab.io/the-test-mocking-pattern-that-races/" &gt;the test-mocking work elsewhere in the framework&lt;/a&gt;: the dependency is injected, so the test gets to decide what it does.&lt;/p&gt;
&lt;h2 id="what-you-deliberately-dont-test"&gt;What you deliberately don&amp;rsquo;t test
&lt;/h2&gt;&lt;p&gt;One honest boundary. None of this tests whether the model gives &lt;em&gt;good&lt;/em&gt; answers. That question is real, but it&amp;rsquo;s a different activity (evaluations, run as their own suite) and not something to mix into the unit tests.&lt;/p&gt;
&lt;p&gt;The unit suite&amp;rsquo;s job is your code: that it builds a sound prompt, and that it handles every shape of response correctly, including the ugly ones. Keep that well away from &amp;ldquo;is the model clever today&amp;rdquo;. A unit test that depends on the model being clever is a unit test that fails when the weather changes, and a flaky test just teaches people to ignore the whole suite.&lt;/p&gt;
&lt;h2 id="what-it-comes-down-to"&gt;What it comes down to
&lt;/h2&gt;&lt;p&gt;Code that calls an LLM is testable; the model is not, and those are different statements. Your code is a prompt builder and a response handler, both deterministic, with the model sat in between.&lt;/p&gt;
&lt;p&gt;go-tool-base and rust-tool-base converge on the same approach. Snapshot the prompt, with golden files or &lt;code&gt;insta&lt;/code&gt;, so a refactor can&amp;rsquo;t change what you send without a test noticing. Mock the response, with generated &lt;code&gt;ChatClient&lt;/code&gt; mocks or a &lt;code&gt;wiremock&lt;/code&gt; server, so tests run with no network and you can feed in the malformed and error cases a real model won&amp;rsquo;t reliably produce. Leave &amp;ldquo;are the answers any good&amp;rdquo; to a separate evaluation suite. Test the two halves you own, and the non-determinism in the middle stops being an excuse to leave the riskiest line uncovered.&lt;/p&gt;</description></item><item><title>BDD where it earns its place, and nowhere else</title><link>https://blog-570662.gitlab.io/bdd-where-it-earns-its-place/</link><pubDate>Sat, 28 Mar 2026 00:00:00 +0000</pubDate><guid>https://blog-570662.gitlab.io/bdd-where-it-earns-its-place/</guid><description>&lt;img src="https://blog-570662.gitlab.io/bdd-where-it-earns-its-place/cover-bdd-where-it-earns-its-place.png" alt="Featured image of post BDD where it earns its place, and nowhere else" /&gt;&lt;p&gt;I have a slightly complicated relationship with BDD. I&amp;rsquo;ve watched it turn a tangled test suite into something the whole team could read and reason about, and I&amp;rsquo;ve watched it turn a perfectly good unit test into a paragraph of ceremonial English that nobody benefits from. So when go-tool-base brought in Cucumber-style BDD, the interesting decision wasn&amp;rsquo;t adopting it. It was being ruthless about where &lt;em&gt;not&lt;/em&gt; to.&lt;/p&gt;
&lt;h2 id="two-tests-that-hurt-for-different-reasons"&gt;Two tests that hurt for different reasons
&lt;/h2&gt;&lt;p&gt;Most of go-tool-base&amp;rsquo;s tests are ordinary table-driven Go tests, and they&amp;rsquo;re absolutely fine. A function, a slice of input/expected pairs, a loop. Nobody needs Gherkin to understand a parser test.&lt;/p&gt;
&lt;p&gt;But two areas were genuinely painful, and they were painful in the same way: the &lt;em&gt;test&lt;/em&gt; had become harder to understand than the &lt;em&gt;thing it was testing&lt;/em&gt;.&lt;/p&gt;
&lt;p&gt;The first was &lt;code&gt;pkg/controls&lt;/code&gt;, the &lt;a class="link" href="https://blog-570662.gitlab.io/lifecycle-management-for-long-running-go-services/" &gt;service-lifecycle package&lt;/a&gt;. It runs a small state machine (Unknown, Running, Stopping, Stopped) with signal handling, health monitoring, restart policies and graceful shutdown all woven through it. The integration tests for graceful shutdown had grown to over three hundred lines of imperative goroutine and channel coordination. They worked. But reviewing them was a slog, and a test you can&amp;rsquo;t review with confidence is a test you can&amp;rsquo;t trust when it fails. The behaviour being checked, &amp;ldquo;when a shutdown signal arrives mid-startup, the controller stops cleanly&amp;rdquo;, was a simple sentence buried under a heap of synchronisation scaffolding.&lt;/p&gt;
&lt;p&gt;The second was the CLI itself. &lt;code&gt;init&lt;/code&gt;, &lt;code&gt;update&lt;/code&gt;, &lt;code&gt;doctor&lt;/code&gt; are user &lt;em&gt;workflows&lt;/em&gt;. &amp;ldquo;Given a config file with a custom value, when I run init, then the custom value survives the merge.&amp;rdquo; That&amp;rsquo;s already a Given/When/Then; it just happened to be written out as Go.&lt;/p&gt;
&lt;h2 id="godog-and-the-line-i-drew"&gt;Godog, and the line I drew
&lt;/h2&gt;&lt;p&gt;Godog is the official Go implementation of Cucumber. You write &lt;code&gt;.feature&lt;/code&gt; files in plain Gherkin and bind each step to a Go function. The shutdown scenario stops being three hundred lines of channels and becomes this:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-gherkin" data-lang="gherkin"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="k"&gt;Scenario:&lt;/span&gt;&lt;span class="nf"&gt; graceful shutdown completes within the deadline
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="k"&gt; Given &lt;/span&gt;&lt;span class="nf"&gt;a controller with two registered services
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="nf"&gt; &lt;/span&gt;&lt;span class="k"&gt;When &lt;/span&gt;&lt;span class="nf"&gt;a shutdown signal is received
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="nf"&gt; &lt;/span&gt;&lt;span class="k"&gt;Then &lt;/span&gt;&lt;span class="nf"&gt;both services stop in registration order
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="nf"&gt; &lt;/span&gt;&lt;span class="k"&gt;And &lt;/span&gt;&lt;span class="nf"&gt;the controller reports a clean shutdown
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;The goroutine choreography doesn&amp;rsquo;t vanish, of course. It moves into the step definitions, written once and reused. What changes is that the &lt;em&gt;scenario&lt;/em&gt; is now readable by someone who&amp;rsquo;s never opened the file before, including someone from an ops team who&amp;rsquo;ll never write a line of Go but absolutely has opinions about how shutdown should behave.&lt;/p&gt;
&lt;p&gt;Here&amp;rsquo;s the part I want to dwell on, because it&amp;rsquo;s the part most BDD adoptions get wrong. The first design decision written down for this work was: &lt;strong&gt;strategic, not universal.&lt;/strong&gt; Use Godog &lt;em&gt;only&lt;/em&gt; where BDD adds clarity. Keep table-driven Go tests as the baseline everywhere else.&lt;/p&gt;
&lt;p&gt;That sounds obvious written down. It is not obvious in practice, because BDD has a gravitational pull. Once a team has feature files, there&amp;rsquo;s a powerful urge to express &lt;em&gt;everything&lt;/em&gt; as feature files, for consistency. And that&amp;rsquo;s how you end up with Gherkin scenarios for a pure function (&lt;code&gt;Given the number 2, When I double it, Then I get 4&lt;/code&gt;) which is pure ceremony. You&amp;rsquo;ve wrapped a one-line table test in a paragraph of English and a step-definition indirection, and made it actively worse.&lt;/p&gt;
&lt;p&gt;The honest test for whether BDD belongs is this: &lt;strong&gt;is this test a narrative, or is it a matrix?&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;A matrix is the same logic with many input/output pairs. That&amp;rsquo;s a table-driven test, that&amp;rsquo;s most unit tests, and Gherkin actively harms them. A narrative is a sequence of steps where the &lt;em&gt;ordering&lt;/em&gt; and the &lt;em&gt;state between steps&lt;/em&gt; is the thing under test, and that&amp;rsquo;s where Gherkin pays for itself. Lifecycle transitions are narratives. A user running three commands in sequence is a narrative. Doubling a number is not.&lt;/p&gt;
&lt;p&gt;go-tool-base drew that line and stuck to it. Feature files live in &lt;code&gt;features/&lt;/code&gt; at the project root, where a non-Go developer can find and read them. Step definitions live in &lt;code&gt;test/e2e/&lt;/code&gt;, kept well away from the unit tests. And the unit tests stayed exactly what they were, because they were already the right tool.&lt;/p&gt;
&lt;h2 id="made-to-fit-not-bolted-on"&gt;Made to fit, not bolted on
&lt;/h2&gt;&lt;p&gt;A couple of smaller decisions kept the BDD layer from feeling like a foreign object.&lt;/p&gt;
&lt;p&gt;It runs under &lt;code&gt;go test&lt;/code&gt;. There&amp;rsquo;s no separate Cucumber runner to install or remember. A &lt;code&gt;godog.TestSuite&lt;/code&gt; is invoked from an ordinary &lt;code&gt;TestFeatures(t *testing.T)&lt;/code&gt;, so the BDD scenarios run in the same &lt;code&gt;go test ./...&lt;/code&gt; as everything else. CI didn&amp;rsquo;t need a new concept bolted onto it.&lt;/p&gt;
&lt;p&gt;And the CLI end-to-end tests build the &lt;code&gt;gtb&lt;/code&gt; binary &lt;em&gt;once&lt;/em&gt; and reuse it across every scenario. Compiling a binary per scenario would make the suite slow enough that people would quietly start skipping it, and a test suite people skip is just decoration. Build once, test many.&lt;/p&gt;
&lt;h2 id="stepping-back"&gt;Stepping back
&lt;/h2&gt;&lt;p&gt;go-tool-base brought in Godog for BDD, but the decision worth writing about is the restraint. BDD was applied to exactly two things: the service-lifecycle state machine, where a 300-line goroutine tangle became a four-line scenario anyone can review, and CLI workflows, which are Given/When/Then by their very nature. Everywhere else, table-driven Go tests remained the baseline, because wrapping a matrix test in Gherkin makes it worse, not better.&lt;/p&gt;
&lt;p&gt;The useful rule: BDD fits a &lt;em&gt;narrative&lt;/em&gt;, ordered steps with meaningful state in between, and fights a &lt;em&gt;matrix&lt;/em&gt;. Adopt it as a scalpel for the narratives. Resist the pull to turn it into a religion.&lt;/p&gt;</description></item></channel></rss>