The cobra hook I was sure I'd enabled

It came out of an audit. I’d recently pointed a small army of review agents at the whole go-tool-base codebase, back before that became a political problem, and one of the findings was that a subcommand could quietly skip the framework’s own start-up code. My first reaction was the dangerous one: surely not… we switched that on ages ago. So I asked for a second pair of eyes on the exact line.

There was no line. I was certain I’d enabled it, but I had simply never done it.

What the start-up hook is for

go-tool-base is built on cobra, the library most Go command-line tools are built on. In cobra, a command can carry a PersistentPreRunE: a function that runs before the command itself, and that, the name strongly implies, persists down to the command’s children. Think of it as the “before you do anything, get the tool ready” step.

go-tool-base uses exactly one of them, on the root command, to do all the unglamorous setup: load and merge configuration, set up logging, ask about telemetry the first time you run, wire up the telemetry collector, and check whether there’s a newer release to install. Everything the tool does afterwards leans on that having happened. By the time your actual command runs, props.Config is meant to be populated and the collector is meant to exist.

The reasonable assumption (the one I’d made, anyway) is that “persistent” means it cascades. Define it once at the root and every mytool foo bar three levels down gets it for free.

“Persistent” promises less than it says

Here is the catch, and it is a good one to file away if you ever build a command tree. cobra runs only the nearest PersistentPreRunE it finds, walking up from the command you actually invoked. If a subcommand defines its own, that one runs and the root’s does not. Not as well as. Instead of. There’s no warning and no error; the child’s hook simply wins, and the parent’s is passed over in silence.

So the moment any command below the root declared its own PersistentPreRunE, the entire start-up for that branch, the config, the logging, the telemetry, the update check, would just not happen. props.Config would be nil. The collector would be nil. The first you’d hear of it is a nil-pointer panic a long way from the cause, or, worse, no panic at all and a tool running happily unconfigured.

EnableTraverseRunHooks is cobra’s opt-in to the behaviour most people assume is already the default: run every PersistentPreRunE from the root down to the leaf, in order. I’d assumed it was the default. It is not, and I’d never turned it on.

A landmine nobody had stepped on

The saving grace was that nothing was actually broken yet. In go-tool-base’s own command tree, the root is the only command that defines a persistent pre-run, so “root to leaf” and “nearest only” happen to produce the identical result. The flag being off changed nothing I could observe.

The bug was latent. It was a trap laid for the first person to do something entirely reasonable: add a PersistentPreRunE to one of their own subcommands. go-tool-base is a framework other tools are built on, so that person was never going to be me. The instant a downstream author did the obvious thing, their config and telemetry would vanish for that branch of their tool and nothing would tell them why.

That is the kind of bug I least like shipping. It compiled. It passed the tests. It would have demoed perfectly. And it sat there waiting to hand a stranger a debugging session with no breadcrumbs, for the crime of using a standard cobra feature the obvious way.

One line, and a note for whoever’s next

The fix is the line I was so sure already existed (root.go):

// Run every parent PersistentPreRunE from root to leaf rather than only the
// closest one. Without this, a downstream subcommand that defines its own
// PersistentPreRunE silently shadows the root bootstrap (config load,
// telemetry, update check) for that subtree. With it set, the framework
// bootstrap always runs first and a child hook runs after it.
cobra.EnableTraverseRunHooks = true

With it set, cobra runs the root start-up first and then the child’s hook, in order, so a downstream command adds to the setup instead of replacing it.

I didn’t want to stop there, because the next author to add a child hook still deserves to understand the ordering. So the change also drops a one-time debug line if it spots any command in the tree carrying its own PersistentPreRunE (the same file), saying out loud what’s going on:

l.Debug("a downstream command defines its own PersistentPreRunE; " +
	"it runs AFTER the framework bootstrap (config load, telemetry, update check), not instead of it")

And, belt and braces, the collector now defaults to a no-op so the few paths that do legitimately return early, like init and help, still satisfy the “always non-nil” promise the rest of the code relies on. The whole thing shipped with a pair of regression tests that assert the bootstrap really does run when a child hook is present, and that it runs first. It’s all written up in a short spec and landed in one commit.

Trust, but grep

There are two things worth taking away. The cobra one is portable: if you rely on PersistentPreRunE cascading down a command tree, set EnableTraverseRunHooks, because “persistent” means less than it sounds and the nearest hook wins by default.

The other is the one I keep having to relearn. The settings I’m most certain about are the ones I never check, precisely because the certainty is what stops me looking. Somewhere along the line I’d promoted “I meant to” straight to “I did”, with nothing in between… and then defended it out loud before I’d even gone to look. A review agent is good at exactly that blind spot: it has no memory of intending to do something, only the code in front of it. The best thing the audit turned up wasn’t a clever bug. It was a flag that was never there. Leaving the campsite better than you found it has to include the traps nobody’s stepped on yet.