It came out of an audit. I’d recently pointed a small army of review agents at the whole go-tool-base codebase, back before that became a political problem, and one of the findings was that a subcommand could quietly skip the framework’s own start-up code. My first reaction was the dangerous one: surely not… we switched that on ages ago. So I asked for a second pair of eyes on the exact line.
There was no line. I was certain I’d enabled it, but I had simply never done it.
What the start-up hook is for
go-tool-base is built on cobra, the library most
Go command-line tools are built on. In cobra, a command can carry a
PersistentPreRunE: a function that runs before the command itself, and that, the
name strongly implies, persists down to the command’s children. Think of it as the
“before you do anything, get the tool ready” step.
go-tool-base uses exactly one of them, on the root command, to do all the
unglamorous setup: load and merge configuration, set up logging, ask about telemetry
the first time you run, wire up the telemetry collector, and check whether there’s a
newer release to install. Everything the tool does afterwards leans on that having
happened. By the time your actual command runs, props.Config is meant to be
populated and the collector is meant to exist.
The reasonable assumption (the one I’d made, anyway) is that “persistent” means it cascades.
Define it once at the root and every mytool foo bar three levels down gets it for
free.
“Persistent” promises less than it says
Here is the catch, and it is a good one to file away if you ever build a command
tree. cobra runs only the nearest PersistentPreRunE it finds, walking up from
the command you actually invoked. If a subcommand defines its own, that one runs and
the root’s does not. Not as well as. Instead of. There’s no warning and no error; the
child’s hook simply wins, and the parent’s is passed over in silence.
So the moment any command below the root declared its own PersistentPreRunE, the
entire start-up for that branch, the config, the logging, the telemetry, the update
check, would just not happen. props.Config would be nil. The collector would be
nil. The first you’d hear of it is a nil-pointer panic a long way from the cause, or,
worse, no panic at all and a tool running happily unconfigured.
EnableTraverseRunHooks is cobra’s opt-in to the behaviour most people assume is
already the default: run every PersistentPreRunE from the root down to the leaf, in
order. I’d assumed it was the default. It is not, and I’d never turned it on.
A landmine nobody had stepped on
The saving grace was that nothing was actually broken yet. In go-tool-base’s own command tree, the root is the only command that defines a persistent pre-run, so “root to leaf” and “nearest only” happen to produce the identical result. The flag being off changed nothing I could observe.
The bug was latent. It was a trap laid for the first person to do something entirely
reasonable: add a PersistentPreRunE to one of their own subcommands. go-tool-base
is a framework other tools are built on, so that person was never going to be me. The
instant a downstream author did the obvious thing, their config and telemetry would
vanish for that branch of their tool and nothing would tell them why.
That is the kind of bug I least like shipping. It compiled. It passed the tests. It would have demoed perfectly. And it sat there waiting to hand a stranger a debugging session with no breadcrumbs, for the crime of using a standard cobra feature the obvious way.
One line, and a note for whoever’s next
The fix is the line I was so sure already existed (root.go):
// Run every parent PersistentPreRunE from root to leaf rather than only the
// closest one. Without this, a downstream subcommand that defines its own
// PersistentPreRunE silently shadows the root bootstrap (config load,
// telemetry, update check) for that subtree. With it set, the framework
// bootstrap always runs first and a child hook runs after it.
cobra.EnableTraverseRunHooks = true
With it set, cobra runs the root start-up first and then the child’s hook, in order, so a downstream command adds to the setup instead of replacing it.
I didn’t want to stop there, because the next author to add a child hook still
deserves to understand the ordering. So the change also drops a one-time debug line if
it spots any command in the tree carrying its own PersistentPreRunE
(the same file),
saying out loud what’s going on:
l.Debug("a downstream command defines its own PersistentPreRunE; " +
"it runs AFTER the framework bootstrap (config load, telemetry, update check), not instead of it")
And, belt and braces, the collector now defaults to a no-op so the few paths that do
legitimately return early, like init and help, still satisfy the “always
non-nil” promise the rest of the code relies on. The whole thing shipped with a pair
of regression tests that assert the bootstrap really does run when a child hook is
present, and that it runs first. It’s all written up in a short
spec
and landed in one commit.
Trust, but grep
There are two things worth taking away. The cobra one is portable: if you rely on
PersistentPreRunE cascading down a command tree, set EnableTraverseRunHooks,
because “persistent” means less than it sounds and the nearest hook wins by default.
The other is the one I keep having to relearn. The settings I’m most certain about are the ones I never check, precisely because the certainty is what stops me looking. Somewhere along the line I’d promoted “I meant to” straight to “I did”, with nothing in between… and then defended it out loud before I’d even gone to look. A review agent is good at exactly that blind spot: it has no memory of intending to do something, only the code in front of it. The best thing the audit turned up wasn’t a clever bug. It was a flag that was never there. Leaving the campsite better than you found it has to include the traps nobody’s stepped on yet.
