Lifecycle management for when your CLI grows up into a service

There’s a moment in the life of a lot of CLI tools where they stop being a CLI tool. Nobody quite decides it. It just happens. Someone needs the thing to also expose a little HTTP endpoint, or poll a queue, or run a scheduler, so it grows a serve command… and the honest command-line utility you wrote is suddenly a long-running service wearing a CLI as a hat.

And a service needs a whole pile of production plumbing that a one-shot command never did.

The command that stops being a command

go-tool-base is CLI-first. It is not CLI-only, and the reason is a pattern I’ve watched play out more times than I can count.

A tool starts its life as an honest command-line utility. It runs, it does its thing, it exits. Then someone needs it to expose a small HTTP endpoint. Or poll a queue. Or run a scheduler. So it grows a serve command, or a run command, and the moment it does, the thing that was a CLI tool is now a long-running service that happens to have a CLI bolted on the front.

And a long-running service needs a whole category of plumbing a one-shot command never did. It has to start things up in a sensible order. It has to shut them down gracefully when someone sends a SIGTERM, finishing in-flight work rather than dropping it on the floor. It has to tell an orchestrator whether it’s alive, and whether it’s ready. It has to do something sensible when one of its internal services quietly falls over at 3am.

Hand-rolled, that’s a few hundred lines of goroutine choreography, channel-wrangling and signal handling that every such tool reinvents, slightly differently and slightly wrong each time. It’s the first-afternoon problem all over again, just turning up later in the project’s life. So go-tool-base ships it: pkg/controls.

A controller and the things it controls

The model is small. A Controller manages any number of services, each of which satisfies a Controllable interface, which at heart is just a StartFunc and a StopFunc. An HTTP server, a background worker, a scheduler, anything with a “begin” and an “end”.

You register your services with the controller and it owns their collective lifecycle. They share a common set of channels (errors, OS signals, health, control messages) so the whole set can react together. A SIGTERM doesn’t get caught by one service off in a corner; it reaches the controller, and the controller takes everything down in order, each StopFunc handed a context with a deadline so that one sulking service can’t wedge the whole shutdown forever.

That ordering and timeout handling is the bit nobody enjoys writing and everybody needs. Centralising it means a tool that adds a second service later inherits correct coordinated shutdown for free, rather than discovering on its first production SIGTERM that it only half shuts down.

Probes, because something is usually watching

If the service ends up in Kubernetes (and a lot of them do) the orchestrator wants to ask two different questions, and they really are different questions.

Liveness: are you alive, or are you wedged and in need of a kill? Readiness: are you alive and able to take traffic right now? A service can quite easily be live but not ready… still warming a cache, still waiting on a dependency. Conflate the two and you get yourself killed during a slow startup, or sent traffic before you can actually serve it.

controls keeps them separate. You attach a WithLiveness probe and a WithReadiness probe to a service, each just a function returning a health report, and the controller exposes them. The tool answers Kubernetes honestly, in Kubernetes’ own terms, without you hand-wiring two more HTTP handlers.

Self-healing, but only if you ask

The last piece is what happens when a service fails. A worker’s StartFunc returns an error. Health checks start failing. In a hand-rolled setup this is where you either crash the whole process or write yourself a bespoke restart loop.

controls has a supervisor that can restart a failed service for you, and the important word in that sentence is can. It’s off by default. A service is only supervised if you hand it a RestartPolicy at registration:

controls.WithRestartPolicy(controls.RestartPolicy{
    MaxRestarts:            5,
    InitialBackoff:         time.Second,
    MaxBackoff:             30 * time.Second,
    HealthFailureThreshold: 3,
})

With a policy in place, the controller restarts the service if its StartFunc errors out, or if it racks up more consecutive health-check failures than the threshold allows. Restarts back off exponentially, from InitialBackoff up to a MaxBackoff ceiling, so a service that’s failing because its database is down doesn’t sit there hammering that database flat with a tight restart loop. MaxRestarts caps the attempts, because a service that’s failed five times in a row is not going to be rescued by a sixth go, and at that point honest failure beats a thrashing pretence of health.

Opt-in matters here. Automatic restarts are exactly right for a resilient daemon and exactly wrong for a tool where a failure should stop the line and get a human’s attention. The framework doesn’t make that call for you. It gives you the supervisor and lets you point it at the services that genuinely want it.

The bottom line

A surprising number of CLI tools become long-running services the day they grow a serve command, and the day they do, they need coordinated startup, graceful ordered shutdown, real liveness and readiness probes, and a considered answer to a service falling over. That’s a few hundred lines of fiddly, easy-to-get-wrong plumbing.

pkg/controls provides it: a Controller over Controllable services with shared channels and deadline-bounded graceful shutdown, separate Kubernetes-style liveness and readiness probes, and an opt-in supervisor that restarts failed services with exponential backoff and a restart ceiling. Your tool can start as a command and grow into a daemon without that growth turning into a rewrite.

CLI-first, but not stuck there.