Infrastructure on PHP Boy Scout

Two bugs that taught me the rules

Wed, 20 May 2026 00:00:00 +0000

Some bugs are interesting because they’re subtle. These two were interesting because they were the exact opposite… in each case the tool had a hard rule I simply didn’t know about, and its error message couldn’t be bothered to tell me what that rule was. Both came out of building the infrastructure toolchain, both cost me a good deal more time than they had any right to, and both are the sort of thing that looks blindingly obvious the moment you know it and utterly baffling until you do.

So here they are, written down, partly to save you the bother and partly so I don’t go and forget them myself.

Bug one: the rule-less job that skips your merge requests

The cicd gate components, in their first cut, shipped with no rules: block. They were dead simple jobs: lint, scan, validate. No conditions, because they should just always run. Obviously.

They ran on branch pipelines. On merge requests, they didn’t run at all! The gates that were the entire point of the components were simply absent from the one place you’d most want to see them… the merge request.

The cause is a GitLab CI rule that’s remarkably easy to go years without ever learning: a job with no rules: block runs only on branch and tag pipelines. It does not run on merge-request pipelines. So “no conditions” doesn’t mean “runs everywhere” at all. It means “runs everywhere except a merge request”, which is about the least intuitive default I can think of.

The fix is faintly absurd, and that’s exactly what makes it stick. You add an unconditional rule: rules: [{ when: on_success }]. The content of that rule does precisely nothing. It always matches. What actually matters is that the job now has a rules: block at all, because merely having one is what makes a job eligible for merge-request pipelines. A rule whose content is meaningless, added solely so the block exists. That’s the fix. I’ll admit I stared at it for a moment.

Bug two: the import block that only works at the root

The second one came from terraform-aws-security-baseline. The account-hardening module needed to adopt a resource that already existed in the account, which is exactly what OpenTofu’s import {} block is for. So an import block went into the account-hardening module, right next to the resource it was adopting. The natural home for it, surely.

OpenTofu disagreed, and rejected it outright. The rule: an import block is only allowed in the root module. It can’t live inside a child module. A module that wants one of its own resources imported can’t declare that import itself… the import has to be declared up at the root, and the root caller does the adopting.

The fix was to take the import block out of the module and document caller-side adoption instead. The module describes the resource, and the root configuration that calls the module is where the import actually lives.

Two unrelated bugs, in two completely different tools, and the same shape sitting underneath both of them.

In each case the tool has a hard structural rule. Where a block is allowed to live. What makes a job eligible for a particular kind of pipeline. And in each case the error told me the tool was unhappy without telling me which rule I’d broken, so the obvious next move (debugging my own logic) was the wrong move entirely. There was nothing wrong with the logic. The thing was simply in a place the tool doesn’t allow, or missing a block the tool quietly insists on.

The lasting lesson here isn’t the two specific rules, useful as they are to know. It’s the reflex. When something that should obviously work just doesn’t, and the error is unhelpful, stop debugging your logic and start suspecting a structural rule about where something is allowed to be, or whether a thing is eligible in the first place. GitLab CI and OpenTofu both have a handful of these, and you mostly learn them the hard way, by tripping over them. Knowing the shape of the category at least means the next one costs you an hour instead of a whole afternoon.

Worth remembering

Two bugs from building the toolchain, one shape. A GitLab CI job with no rules: block runs on branches and tags but silently not on merge requests, and the fix is an unconditional rules: block whose content does nothing and whose mere existence is the entire point. An OpenTofu import block gets rejected inside a child module, because imports are only legal at the root, so the caller adopts and the module just describes.

Neither error named the rule it was enforcing, and that’s the category to watch for. When sound logic fails against an unhelpful error, suspect a structural rule about where a thing may live or whether it’s even eligible… not a bug in what you actually wrote. It’ll save you an afternoon. It certainly cost me a couple.

Reviewed, then applied

Mon, 18 May 2026 00:00:00 +0000

The genuinely dangerous moment in infrastructure-as-code isn’t the apply. It’s the gap between the plan a human read and approved, and the change that actually runs a moment later. If those two are different computations (and by default they are) then nobody really reviewed the thing that touched your account. The infra repo closes that gap from both ends.

The gap between “reviewed” and “ran”

Here’s the moment in infrastructure-as-code where things go wrong.

Someone opens a merge request. CI runs tofu plan and the output is there to review: these three resources change, this one is destroyed. A human reads it, decides it’s correct, approves, merges. Then apply runs.

The trap is in what apply actually applies. If apply does its own fresh tofu plan and then applies that, the change that runs is not necessarily the change that was reviewed. State can have moved. A provider can have drifted. Someone else can have applied something in between. The reviewed plan and the applied change are two separate computations done at two different moments, and every difference between those moments is a change nobody looked at.

infra closes that gap from both ends.

Plan as an artifact

The first end is making the reviewed plan and the applied plan the same object.

The tofu-plan component runs the plan and saves it. It writes tfplan.cache, OpenTofu’s binary plan file, as a CI artifact. It also writes tfplan.json, which GitLab renders as a plan widget right in the merge request: the add, change and destroy summary, there to review without leaving the MR.

The tofu-apply component then does not re-plan. It applies that saved tfplan.cache. And OpenTofu itself enforces the safety net: applying a stale plan file, one captured against a state that has since moved, is rejected by the tool. So what reaches the account is provably the plan that was reviewed, or it’s nothing at all. There’s no third option where something unreviewed slips through.

Applying is a human decision

The second end is when apply runs.

infra is trunk-based: it dropped the develop branch and works on main. But a naive trunk setup auto-applies every push to main, which means there’s no human gate at all, just whatever the last merge happened to contain.

So the gate is built explicitly. releaser-pleaser keeps a release merge request open against main. Ordinary merges to main run plans but apply nothing. The apply happens only when a person merges the release MR. Merging it cuts a release tag, and the tag pipeline is what runs tofu-apply, against the plan banked by the latest main pipeline.

The effect is that the act of applying to the account is the deliberate, visible act of merging the release request. Nothing reaches the account because a commit landed. It reaches the account because a person decided a release should go out and merged it. (Which, after the accidental v2.0.0 that kicked off the whole GitLab move, is a discipline I’d freshly relearned the value of.)

The guard on the gate

There’s one more piece, because a gate is only as good as its precondition.

A verify-main-plan job blocks the release MR from being mergeable unless the latest main pipeline is green. You can’t cut a release, and therefore can’t apply, on top of a main whose plan didn’t even succeed. The human gate has its own gate: the thing you’re about to merge has to be standing on a known-good plan before you’re allowed to merge it.

The bottom line

The risk in infrastructure-as-code is the gap between the plan a human reviewed and the change that runs, because a re-plan at apply time is a different computation from the one that was approved.

infra closes it twice over. tofu-plan saves the plan as a tfplan.cache artifact and renders it as a merge-request widget; tofu-apply applies that exact artifact, and OpenTofu rejects it outright if the state has moved underneath it. And applying is gated on a human merging a releaser-pleaser release request, not on a push, with a verify-main-plan check making sure that request can only be merged on top of a green plan. What gets applied is what was reviewed, when a person decided it should be.

One graph, not micro-stacks

Sun, 17 May 2026 00:00:00 +0000

Once an infrastructure repo has a few concerns in it (account hardening, the security baseline, the signing stack still to come) there’s a steady pressure to split them into separate stacks with separate state, and Terragrunt is right there to help you do it. The infra repo keeps everything in one OpenTofu graph instead. The reason comes down to who enforces your dependency ordering: the engine, or you.

The pressure to split

The infra repo’s src/ has several concerns in it, and more coming, the signing stack among them. Once a repo reaches that point, there’s a steady pressure to split: one stack per concern, each with its own state file.

It’s an appealing pressure. Separate stacks feel modular. Each apply touches less, so the blast radius of any one run is smaller. And Terragrunt exists, popular and well-regarded, precisely to orchestrate a fleet of separate stacks. The path is well trodden.

infra didn’t take it. src/ is a single OpenTofu root stack: each concern is a module block, in its own main.<concern>.tf file, all sharing one state and one graph.

What one graph gives you

The thing a single graph gives you is engine-enforced truth about ordering and data.

Inside one OpenTofu graph, the tool builds the full dependency DAG itself. When the signing stack needs a value the security baseline produced, you reference it directly, module.baseline.something, and OpenTofu guarantees two things: the baseline is created before the thing that depends on it, and the value handed across is the current one from this same apply. Ordering and data-passing aren’t things you arranged. They’re facts the engine checks and enforces, every plan, every apply.

What splitting costs

Split src/ into per-concern stacks with separate state, and that guarantee is the thing you spend.

Now one stack reads another’s outputs through terraform_remote_state. That’s a lookup of a snapshot: the other stack’s last applied state, whatever it was, whenever that was. It’s not a live edge in a graph. Ordering is no longer enforced by the engine either; it becomes something you arrange yourself, in CI stage sequencing or in Terragrunt’s own dependency blocks.

That’s the trade, stated plainly. You give up a strong, engine-checked guarantee, and you buy back a weaker, hand-arranged imitation of it. Terragrunt is a good tool for managing that weaker world tidily. But the question worth asking first is whether you should be in the weaker world at all.

When splitting is genuinely right

This isn’t an argument that splitting is always wrong. Separate states genuinely earn their place when concerns have different change cadences, different access boundaries, or different teams owning them: when you actively want an apply of one to be unable to touch another, and you want different people holding different state.

infra has none of those. It’s a single account, a single operator, one cohesive set of concerns. The only thing splitting would buy here is a smaller per-apply blast radius, and that’s better handled by reviewing the plan before it applies, which the next post is about, than by fragmenting the dependency graph. So src/ stays one graph, and Terragrunt was considered and deliberately not adopted.

If ordering between graphs is ever needed

If infra ever does genuinely need more than one stack, the plan isn’t Terragrunt. It’s to keep each stack a single strong graph internally, and to sequence the stacks with CI stages. Keep the engine-enforced guarantee where it’s strongest, inside each graph, and reach for hand-arranged ordering only at the one seam where it’s unavoidable.

Boiling it down

A multi-concern infrastructure repo feels like it should be split into per-concern stacks, and Terragrunt is right there to manage the result. infra keeps src/ as one OpenTofu graph instead.

Inside one graph, OpenTofu enforces dependency ordering and passes current values across module boundaries as checked facts. Split into separate states and that becomes a terraform_remote_state snapshot lookup plus ordering you arrange by hand: a weaker version of what you gave up. Splitting is right when concerns have different cadences, boundaries or owners; for a single-account, single-operator repo none of that applies, so the strong guarantee is worth keeping, and Terragrunt is the tool for a problem infra chose not to have.

CI you include, not copy

Sat, 16 May 2026 00:00:00 +0000

Every infrastructure repo runs the same CI: lint the OpenTofu, scan it, validate it, plan, apply. The first repo, you write that .gitlab-ci.yml by hand. The second, you copy it. By the third, you’ve got three copies of the same pipeline quietly drifting apart, which is the exact problem you’d never tolerate in application code. The cicd repo is the fix, and it’s just the library-first instinct pointed at the pipeline.

The `.gitlab-ci.yml` you keep copying

The infrastructure repos in this series all run the same CI gate jobs: format and validate the OpenTofu, lint it, scan it for security issues and secrets, and on the deploy side, plan and apply.

The first repo, you write that .gitlab-ci.yml by hand. The second repo needs the same jobs, so you copy it. The third repo, you copy it again. Now there are three copies of the same pipeline, and they do what copies always do. They drift. A fix you make in one repo’s CI doesn’t reach the other two. A tightened scan rule lands in the repo you were working in and nowhere else. It’s the copy-paste problem, exactly as it shows up in application code, just written in YAML and therefore that bit easier to pretend isn’t code.

GitLab has a feature for exactly this

GitLab CI/CD Components are the answer to that problem. A component is a reusable, versioned piece of pipeline that you publish, and other projects pull in with an include: pinned to a version:

include:
 - component: gitlab.com/phpboyscout/cicd/tofu-lint@v0.5.0

That’s a library import, for pipeline. The component has a defined interface, a version, and a home in GitLab’s CI/CD Catalog. A consuming repo includes it instead of carrying its own copy, and when the component improves, the consumer moves a version pin rather than re-copying YAML.

Why a monorepo of components

The cicd repo holds all of the components together: tofu-lint, tofu-security, tofu-validate, tofu-plan, tofu-apply, and more. One project, not one project per component.

That’s a deliberate call, and the reason is how GitLab versions things. A version is a tag, and a tag belongs to a project. A component’s version is its project’s tag. So a monorepo of components, versioned together as one tag stream, is the natural unit: a consumer pins @v0.5.0 and gets a known-good set of components that were tested together, rather than juggling a separate version for each one.

Authoring discipline

A component is a file under templates/, and it opens with a spec: inputs: block: the typed inputs, their defaults, the component’s public interface.

The discipline that keeps the library usable is that a component must be consumer-agnostic. It never hardcodes a token, and it never names a particular consumer’s variable. Inputs have sensible defaults, and a consuming repo overrides them. A component that reaches out and assumes something about the repo including it is a component that works in one repo and surprises the next. An authoring guide in the repo keeps that consistent across everyone who adds a component.

The self-test you cannot fully write

The cicd repo tests its own components with a self-test pipeline. It’s worth knowing where that self-test stops.

When a repo tests its own components by running them in child pipelines, GitLab masks $CI_PIPELINE_SOURCE as parent_pipeline. A component’s rules:, which often branch on the pipeline source to behave differently for a merge request than for a branch or a tag, therefore can’t be exercised honestly by the self-test: the source they’d branch on has been flattened. The self-test covers what it can, and the component rules: are, in the end, validated by real consumers using them for real. That’s a genuine limit, and naming it is better than pretending the self-test proves more than it does. (It’s also, not coincidentally, the exact rules: quirk that bit me in one of the two bugs I closed the series with.)

The same instinct, again

This blog keeps circling the same instinct. go-tool-base exists because the same CLI scaffolding kept getting rewritten, so it was extracted into a library. cicd is that instinct pointed at the pipeline: the same gate jobs kept getting copied between repos, so they were extracted into a versioned, included library.

Stop copy-pasting. Publish, version, include. It’s true for CLI code, and it turns out to be just as true for the YAML that builds and ships it.

The gist

Every infrastructure repo needs the same CI, and copying the .gitlab-ci.yml between them produces copies that drift apart. GitLab CI/CD Components fix it: reusable, versioned pipeline that a repo include:s and pins, instead of carrying its own copy.

cicd is a monorepo of those components, versioned together as one tag stream, because GitLab tags a project and a component’s version is its project’s tag. Components are authored consumer-agnostic, with typed spec: inputs: and no hardcoded assumptions, and their rules: are validated by real use because the self-test can’t see the pipeline source. It’s the library-first instinct, applied to CI: publish it once, include it everywhere, fix it in one place.

One image for the whole toolchain

Fri, 15 May 2026 00:00:00 +0000

Every CI gate job across the infrastructure repos reaches for the same pile of tools: OpenTofu, tflint, trivy, checkov, gitleaks, terraform-docs, the AWS CLI. Installing that pile per job is both slow and quietly dangerous, because nothing pins it consistently. infra-tools is the obvious fix (one image, one source of truth for versions), but two of its build decisions are less obvious and worth a look: it publishes with crane instead of a second build, and it deliberately lets its own vulnerability scan fail.

The same pile of tools, in every repo

Every infrastructure repo in this series runs the same CI gate jobs: format and validate the OpenTofu, lint it, scan it for security problems and secrets, check the docs. Those jobs need a specific set of tools, and it’s the same set in every repo.

Install them per job and you pay twice. You pay in time, because every pipeline downloads and installs the whole set again. And you pay in drift, because unless every repo pins every tool identically, the repos slowly diverge on which version of trivy or tflint they actually run, and a check that passes in one repo fails in another for no reason anyone can see.

One image, one source of truth

infra-tools is the answer: a single Debian-based container image with the whole toolchain baked in. Every CI job in every repo uses it with one image: line.

The real value isn’t the convenience. It’s that the image is the one place tool versions are pinned. The Go-based tools are pinned in a mise.toml. checkov, which has no mise plugin, is pinned in a requirements file installed with pipx. The AWS CLI is pinned by a build argument. Three mechanisms, because the tools come from three kinds of source, but one image, and every pin wired to Renovate so a version bump arrives as a reviewable pull request. There’s exactly one answer to “what version of trivy does the toolchain use”, and it lives here.

Publishing with crane, not a second build

A build-pipeline detail that took a real bug to discover.

The pipeline builds the image with kaniko, which builds images without a privileged Docker daemon, something that matters a great deal on shared CI runners. Then it scans the image, then it publishes it.

The obvious way to write the publish stage is “build the image and push it”. But kaniko has no mode for “just push this tarball I already built”. A second kaniko invocation re-executes the entire Dockerfile from the top, including a second mise install, which makes a fresh round of calls to GitHub’s API to fetch tools. GitHub’s anonymous API limit is low and shared by IP, so on a CI runner that second install reliably trips a 403 rate-limit. (Yes, another 403. They do get everywhere.)

So the publish stage doesn’t rebuild. It uses crane to push the exact image tarball the build stage already produced. The image is built once. And because the published bytes are the same bytes the scan stage scanned, there’s no gap between “the image we checked” and “the image we shipped”.

Soft-failing the scanner on purpose

The decision that looks wrong until you see the reasoning: the pipeline scans the image with trivy, and trivy is allowed to fail without failing the pipeline.

A vulnerability scanner that doesn’t gate the build sounds like a scanner switched off. It isn’t. It’s a scanner pointed at something it can’t helpfully gate.

The tools in the image are prebuilt Go binaries. trivy inspects them, reads the version of the Go runtime each was compiled with, and reports every known CVE in that Go runtime. Those findings are real, but they aren’t mine to fix. The only fix is the upstream tool rebuilding itself against a patched Go. With seven such tools in the image, at any given moment one of them is usually a little behind on its Go version.

A hard gate would mean the image becomes unpublishable whenever any single upstream lags, over a CVE in code I don’t own and can’t patch. That’s not a security control; it’s a way to be unable to ship. So the scan is allow_failure. The findings stay fully visible, and the residual count is genuinely useful as a metric for how far behind upstream the toolchain has drifted. It just doesn’t block shipping an image whose only “vulnerabilities” are other people’s build timelines.

What it comes down to

The infrastructure repos all run the same CI gate jobs, needing the same tools, so infra-tools bakes the whole toolchain into one image and pins every version in one place, wired to Renovate.

Two build choices are worth copying. The publish stage uses crane to push the already-built, already-scanned tarball, because a second kaniko build would re-run mise install and hit GitHub’s anonymous rate limit, and because pushing the scanned bytes means shipping exactly what was checked. And the trivy scan is deliberately allow_failure, because it reports Go-runtime CVEs in prebuilt upstream binaries that no change to this repo can fix, so a hard gate would only make the image unshippable over someone else’s lag.

A 403 you can't fix in IAM

Thu, 14 May 2026 00:00:00 +0000

The OIDC post explained the handshake that lets a GitLab pipeline deploy to AWS with no stored key. This is the story of the first time I got it wrong, and spent an afternoon fixing the wrong thing. The error was a flat 403 from AWS, and the maddening part is that no amount of editing the IAM policy was ever going to fix it.

A 403 on the first real run

The OIDC post covered the handshake: GitLab CI mints a signed token, AWS exchanges it for short-lived credentials against a role whose trust policy names the pipeline. During the GitLab migration I wired exactly that up for the infra repo, including a trust policy condition meant to let merge-request pipelines run a plan.

The first merge request that should have triggered tofu-plan didn’t run it. The job failed, and the error from AWS was a flat AccessDenied. A 403.

The instinct, and why it wastes an afternoon

The instinct on an IAM 403 is immediate and almost always right: the policy’s wrong, so go and edit the policy. Tighten the condition. Loosen the condition. Check the wildcard. Re-read the sub pattern character by character.

All of that was wasted, and it was wasted for a reason that took me far too long to see. The trust policy wasn’t matching the wrong value. It was matching a value that does not exist. No amount of editing a condition makes it match a thing that’s never present.

What is actually in the token

GitLab’s OIDC token has a sub claim that encodes the pipeline’s context, and part of that encoding is a ref_type. I’d assumed ref_type could be branch, tag, or mr, because a pipeline can certainly be a branch pipeline, a tag pipeline, or a merge-request pipeline. So the trust policy, for the plan job, matched a sub containing ref_type:mr.

That assumption was wrong. GitLab’s ref_type is branch or tag. That’s the entire set. There is no mr.

A merge-request pipeline doesn’t run against a merge-request ref. It runs against the source branch. So its token’s sub carries ref_type:branch, like any other branch pipeline. The trust policy condition asked for ref_type:mr, GitLab never puts mr in a token, the condition was therefore never true, and every merge-request pipeline got a 403. Forever, until the policy stopped asking for a claim that isn’t real.

The fix, and the lesson worth more than the fix

The fix is small once it’s visible: match ref_type:branch and narrow it down by branch name or project path instead. An afternoon of policy edits, and the actual change is one word.

The lesson is the part worth keeping. When an OIDC trust fails, the useful question is never “is my policy clever enough”. It’s “what’s actually in the token”. An OIDC trust policy can only ever match the claims the identity provider genuinely asserts, and the gap between what a provider asserts and what you assumed it asserts is precisely where this class of bug lives.

So the move, when an OIDC handshake 403s, is to get hold of a real token and decode it. Look at the actual sub, the actual claims, the actual values. Match what’s there. A 403 that survives every sensible edit to the policy is usually not a policy that’s too loose or too strict. It’s a policy matching a claim that was never going to be in the token.

The habit it left behind

I wired an OIDC trust policy to let merge-request pipelines plan, by matching a sub claim with ref_type:mr. The first real merge request got a 403, and no edit to the policy fixed it, because GitLab’s ref_type is only ever branch or tag. A merge-request pipeline runs on a branch ref, so the mr value the policy demanded was never in any token.

The fix was one word. The habit it left behind is the valuable bit: when an OIDC trust fails, stop editing the policy and go and read a real token. A trust policy can only match what the provider actually asserts, and “what I assumed it asserts” is where the 403 was hiding the whole time. (If this shape of bug feels familiar by the end of the series, that’s not an accident: I come back to it with two more from exactly the same family.)

Routing security findings without the noise

Tue, 12 May 2026 00:00:00 +0000

Turning on GuardDuty and Security Hub gives you threat detection. It also gives you a firehose. And an alert system that dutifully forwards everything in that firehose isn’t monitoring, it’s a very efficient way of training your team to ignore alerts. So the alerts module’s real job isn’t detection at all. It’s deciding what’s actually worth interrupting a human for, and the interesting part is everything it deliberately throws away.

Detection is the easy half

Switching on threat detection in an AWS account is a few resources. GuardDuty, Security Hub with its standards, IAM Access Analyzer: the security baseline does exactly that. From then on, the account is generating findings.

And it generates a lot of them. Plenty are low-severity, informational, or simply the normal texture of a cloud account. If you wire every finding to an email or a pager, you haven’t built monitoring. You’ve built noise. And noise has a specific failure mode: people stop reading it, and the one finding that genuinely mattered scrolls past unread alongside two hundred that didn’t.

So the valuable work isn’t detection. It’s routing: deciding what’s worth interrupting a human for, and letting the rest sit quietly in a console for whenever someone reviews it.

Forward the severe, leave the rest

The alerts module routes findings with EventBridge rules into an SNS topic that emails out. The rules are deliberately picky. GuardDuty findings are forwarded only at severity 7 and above. Security Hub findings are forwarded only at HIGH and CRITICAL.

Everything below those thresholds isn’t discarded. It’s still in GuardDuty and Security Hub, where someone doing a review will see it. It just doesn’t get to interrupt anyone’s day. The threshold is the line between “look at this now” and “look at this sometime”.

The duplicate you would otherwise send twice

Here’s the subtle one, and it’s the kind of thing you only find by looking closely at where findings come from.

Security Hub is an aggregator. It pulls findings in from other services, GuardDuty among them. So a single GuardDuty finding can show up in two places: in GuardDuty itself, and again in Security Hub as an aggregated copy.

A rule on GuardDuty findings and a rule on Security Hub HIGH/CRITICAL findings would therefore both fire for the same underlying GuardDuty finding. One event, two emails. Do that across an account and a meaningful fraction of your alert volume is just the same findings counted twice, which is its own kind of noise.

So the Security Hub rule explicitly excludes findings whose ProductName is GuardDuty, with an anything-but match. GuardDuty findings come through the GuardDuty rule. The Security Hub rule handles everything Security Hub adds that GuardDuty didn’t already report. One finding, one alert, regardless of how many services it passed through.

Two tripwires on the root account

Findings are about threats the detectors recognise. The module adds two alarms about something simpler: the root account doing anything at all.

One CloudWatch alarm fires on a root console sign-in. The other fires on any root API call that isn’t a console login. In a well-run AWS account, the root user does almost nothing after initial setup: day-to-day work happens through roles. So root activity isn’t a “finding” to be assessed for severity. It’s a tripwire. Any of it, in an account that should be silent, is worth an immediate look, and the two alarms say so directly.

Why a quiet alert stream matters here

This is monitoring for the account that’s going to hold the release-signing key, and that raises the stakes on getting the routing right.

If a key-bearing account ever does come under attack, the alert that says so has to be seen. An alert stream that’s mostly noise and duplicates is, functionally, no alerting at all, because the people who’d act on it have long since tuned it out. Routing the stream down to “severe, deduplicated, plus root tripwires” is what keeps it something a human will still read on the day it finally matters.

The short version

GuardDuty and Security Hub make detection easy. The hard, valuable part is routing: forwarding what deserves to interrupt someone and leaving the rest in a console.

The alerts module forwards GuardDuty at severity 7-plus and Security Hub at HIGH/CRITICAL, and it drops the duplicate that aggregation creates by excluding GuardDuty-sourced findings from the Security Hub rule, so one finding is one alert. Two CloudWatch alarms act as tripwires on root-account activity, which should be near-zero. For the account that will hold the signing key, a quiet, trustworthy alert stream isn’t a nicety. It’s the difference between monitoring and theatre.

Why I hand-rolled every module

Sun, 10 May 2026 00:00:00 +0000

There are well-known community module libraries for AWS: Cloud Posse, the terraform-aws-modules collection, plenty more. Both terraform-aws-bootstrap and terraform-aws-security-baseline use almost none of them. Every sub-module is hand-rolled from raw AWS resources, and before you accuse me of not-invented-here syndrome (a perfectly fair first guess), hear me out, because the same evaluation kept landing the same way for a real reason.

The promise of a wrapper module

The community module ecosystem makes an appealing offer. Don’t write raw aws_s3_bucket and aws_s3_bucket_policy and aws_s3_bucket_public_access_block and the rest. Call a tested, popular module, pass it a handful of inputs, and get a correct, well-configured bucket. Less code in your repo, and the code you don’t write has been exercised by thousands of other users.

For a lot of infrastructure that’s a genuinely good deal, and I take it often. For the two infrastructure modules in this series, I took it almost never. Every sub-module is built from raw AWS resources. That wasn’t a reflex. It was the same evaluation, made over and over, landing the same way.

What kept going wrong

For each place a wrapper module could have fitted, I looked at the wrapper. And the recurring finding was one of two things. Either using the wrapper correctly, with all the overrides my posture needed, came to more configuration than the raw resources would have. Or the wrapper’s abstraction leaked the instant I needed something it hadn’t anticipated, and I was now writing code to fight it.

The CloudTrail bucket, concretely

The clearest example is the bucket that holds CloudTrail logs.

There are popular modules that set up CloudTrail and bundle an S3 bucket for the logs. Convenient. But that bundled bucket isn’t the bucket I want. It doesn’t carry lifecycle { prevent_destroy = true }, and its bucket policy is weaker than the one the state bucket taught me to want: TLS-only, SSE-KMS-only, wrong-key-denied.

So to use the wrapper I had two options. Accept a weaker audit-log bucket than the rest of the account, which rather defeats the point of an audit log. Or fight the wrapper: disable its bucket, create my own, wire it back in. Fighting the wrapper is more work than simply writing the fifty-odd lines of raw aws_s3_bucket plus policy that give me exactly the posture I’d already designed once. The wrapper didn’t save code. It added a negotiation.

A wrapper is a deal, and deals have terms

This isn’t an argument that community modules are bad. It’s an argument about when the deal is good.

A wrapper module is a good deal while its abstraction holds: while what it assumes you want matches what you want. The moment you need something it didn’t anticipate, the deal inverts. Now you’re working against the abstraction, and an abstraction you’re fighting costs more than no abstraction at all. (Regular readers will recognise that line from the LangChain argument; it’s the same principle in a very different language.)

Infrastructure that holds signing keys is precisely the case where you need to control the specifics: every encryption setting, every lifecycle rule, every line of every bucket policy. That’s a domain where wrapper abstractions leak fast, because the whole job is the details the wrapper smoothed over.

The cost, paid on purpose

Hand-rolling isn’t free. It’s more lines of HCL in the repo, up front, than a one-line module call.

What those lines buy is worth the price for this kind of infrastructure. There’s no transitive module-version churn to track. There’s no abstraction between me and the resource when something behaves oddly. And every line is one I can read, and defend, in a security review, because I wrote it and it says exactly what it does. For a foundation that will hold the most sensitive key in the system, “readable and mine” beats “short and someone else’s”.

That’s a deliberate trade, not a universal rule. For an internal tool on a deadline, reach for the wrapper. For the security-critical base of everything else, the raw resources won every time I checked.

To sum up

The community module ecosystem offers less code that more people have tested, and for plenty of infrastructure that’s the right call. For terraform-aws-bootstrap and terraform-aws-security-baseline it almost never was, because each wrapper turned out to be more configuration than the raw resources once my posture was accounted for, or it leaked the moment I needed a specific.

The CloudTrail log bucket is the pattern in miniature: the bundled bucket lacked prevent_destroy and a strong policy, so using the wrapper meant either a weaker bucket or fighting the module. A wrapper is a good deal while its abstraction holds and a bad one the moment you fight it, and security-critical foundation infrastructure is all specifics. Hand-rolling cost more lines and bought code I can read and defend. For this, that was the trade worth making.

Hardening the account that will hold the keys

Sat, 09 May 2026 00:00:00 +0000

Bootstrapping the account got it ready: somewhere to store state, an identity to deploy as, enough for the next tofu apply to run. Ready is not the same as safe. An account with no audit trail, nothing watching it, and no considered way for a human to get in is fine for experimenting and absolutely not where you put the most sensitive key in the system. So before the signing key goes anywhere near it, the account gets a security baseline.

Ready is not the same as safe

The bootstrap post ended with an account that was ready: it had somewhere to store state and a CI identity to deploy as. The next tofu apply could run.

Ready is not safe. That account still has no audit trail, so nobody could tell you afterwards what happened in it. It has no threat detection, so nothing is watching. Its defaults are AWS’s defaults, which are not a security posture. There’s no considered way for a human to get in. An account in that condition is fine for experimenting. It’s not somewhere you put the most sensitive key in the whole system.

So before the signing key is anywhere near it, the account gets a security baseline.

The baseline, in one downstream stack

terraform-aws-security-baseline is that baseline, and it’s exactly the downstream stack the bootstrap post promised: applied through the automation role bootstrap created, not bootstrapped specially.

It’s six sub-modules, each behind an enable_* toggle: account-hardening (IAM password policy, account-wide S3 public-access blocking, default EBS encryption), audit-logging (a multi-region CloudTrail with log-file validation), aws-config, threat-detection (GuardDuty, Security Hub, IAM Access Analyzer), alerts, and operator-role. Together they turn a bare account into one that records what happens, watches for trouble, and controls who gets in.

Most of those are the expected baseline. The operator role is the one worth slowing down on, because it’s built backwards from how people usually think about an admin role.

The operator role, and the inversion

InfraAdmin is the human way into the account: the role a person assumes to do operator work. Two things define it.

The trust policy decides who may assume it. It trusts only the account root principal, and it requires multi-factor authentication: the assume call must carry aws:MultiFactorAuthPresent, and aws:MultiFactorAuthAge bounds how recently that MFA was performed. No MFA, no role. So far this is a careful but ordinary admin role.

The inversion is a second, separate inline policy, and it’s almost entirely Deny. It denies, using NotAction, anything where aws:RequestedRegion falls outside an allowed set of regions. The role’s power comes from an admin grant. This inline policy fences that power.

That’s the part worth holding onto. People picture an admin role as a list of what it can do. This one is better understood by what it cannot: it cannot act outside its permitted regions, full stop. A fat-fingered command, or a compromised session, cannot quietly spin resources up in some region nobody’s watching. The fence is as much the point of the role as the grant is.

The carve-out, because honesty

There’s a fiddly detail, and it’s the kind of thing that makes the region fence real rather than theoretical.

Some AWS services are global. IAM, CloudFront, Route 53 and friends have no region, and they don’t honour aws:RequestedRegion. A naive region-deny would therefore deny calls to IAM, and you’d lock yourself out of the very service you manage access with. (A close cousin of the kind of self-inflicted lockout I’ll come back to in a later post.)

So the Deny carries explicit carve-outs for the global services. It isn’t elegant, and it can’t be: the global-versus-regional split is just a fact of AWS, and a correct region fence has to account for it. The carve-out list is the honest cost of the control working.

Harden the room, then move the keys in

There’s an order to all of this, and the order is the argument.

The account that will hold the signing key has to be audited before the key arrives, so that from day one every call against it is in CloudTrail. It has to be watched before the key arrives, so GuardDuty is already looking. It has to be access-controlled before the key arrives, so the only human path in is MFA-gated and region-fenced.

You don’t move something valuable into a room and then think about locks. You build the room, fit the locks, check they work, and then move the valuable thing in. The security baseline is fitting the locks. The signing key comes later, into a room already built for it.

Worth remembering

Bootstrapping an account makes it ready for the next deploy. It does not make it safe to hold anything that matters. terraform-aws-security-baseline is the downstream stack that closes that gap: audit logging, AWS Config, threat detection, account hardening, and an operator role, applied through the CI role bootstrap created.

The operator role is the piece to study. It’s MFA-gated on the way in, and then fenced by a separate, almost-all-Deny inline policy that confines it to permitted regions, with carve-outs for the global services that have no region. An admin role defined as much by its fence as its grant. Harden the room first; the keys move in afterwards.

No access keys in CI

Fri, 08 May 2026 00:00:00 +0000

A long-lived AWS access key, sitting in a CI system, is just about the single credential I’d most like to be rid of. It’s powerful, it never expires unless someone remembers to rotate it (nobody remembers to rotate it), and it lives in one of the most attractive targets in the whole supply chain. For infrastructure that’s eventually going to hold a release-signing key, it’s exactly the wrong place to start. So the phpboyscout infrastructure has no AWS access key in CI at all. None.

The access key you don’t want

A CI pipeline that runs tofu apply against AWS needs AWS credentials. The traditional way to give it some is an IAM user with an access key pair, pasted into the CI system as a masked variable.

Look at what that key is. It’s long-lived: it works until someone remembers to rotate it, and rotating it is a chore, so mostly nobody does. It’s powerful: it can apply infrastructure, so it can do nearly anything. And it’s sitting in a CI system, which is one of the most attractive targets in your whole supply chain. You’ve taken your highest-value credential and stored a permanent copy of it in a place built for running automated jobs.

For infrastructure that’s going to hold a release-signing key, that’s precisely the wrong starting point. So the phpboyscout infrastructure has no AWS access key in CI at all. Not a well-guarded one. None.

Federation instead of a stored secret

The replacement is OIDC federation, and the shape of it is worth walking through, because it’s genuinely different from “a secret, but better”.

A modern CI platform can mint an OIDC token. GitLab does this with an id_tokens: block: at job time, GitLab issues a short-lived JSON Web Token, signed by GitLab, that asserts a set of facts. This is project X. This is pipeline Y. This is running on ref Z, of this type.

AWS can consume that. The sts:AssumeRoleWithWebIdentity call takes such a token and, if it satisfies an IAM role’s trust policy, returns short-lived AWS credentials for that role. The trust policy is where the control lives: it names GitLab as a trusted token issuer, and it constrains the token’s sub claim so that only the specific project, and the specific refs, you intend can assume the role.

Put it together: the pipeline asks GitLab for a token, hands it to AWS, and gets back credentials that last about an hour and are scoped to one role. Nothing long-lived is stored anywhere. The credential exists only for the job that needs it, and it can’t be stolen from a CI variable store, because it was never in one.

Two halves of one handshake

That handshake is built by two of the repos in this series, each owning one side.

terraform-aws-bootstrap builds the AWS half, in its automation-iam module: it registers GitLab as an OIDC identity provider in the account, and it creates the automation role with the trust policy that decides which pipelines may assume it.

The CI components build the consuming half: the id_tokens: block that asks GitLab for the JWT, and then simply letting the AWS provider’s own credential chain perform the exchange. The pipeline doesn’t call sts by hand. It presents the token; the SDK does the rest.

The gotcha: don’t set a profile

There’s one quiet way to break this, and a stack can look completely correct while doing it.

The AWS SDK finds credentials by walking a chain of sources in order. The web-identity path, the one that uses the OIDC token, is one link in that chain. It triggers off environment variables the CI sets up automatically.

But if the aws provider block has a hardcoded profile = "...", the SDK takes the profile link of the chain instead, and never reaches the web-identity link. A profile line is the sort of thing that ends up in a provider block from someone’s local development setup, where it’s exactly right. Committed and run in CI, it silently short-circuits the federation. The pipeline either fails to find credentials, or finds the wrong ones.

The rule is simple once you know it: the provider block that runs in CI must not name a profile. Leave the chain free to find the web identity. It’s the kind of bug that teaches you to be precise about which link of the credential chain you’re actually relying on.

The bottom line

Giving CI an AWS access key means storing your most powerful, longest-lived credential in one of your most exposed systems. OIDC federation removes it entirely. The CI platform mints a short-lived signed token, AWS exchanges it via AssumeRoleWithWebIdentity for hour-long credentials against a role whose trust policy names the exact pipeline, and nothing permanent is stored.

terraform-aws-bootstrap builds the AWS side, the identity provider and the trust policy; the CI components build the consuming side, the token request. The one trap is a hardcoded profile in the provider block, which short-circuits the SDK’s credential chain before it reaches the web-identity path. Get that right, and a pipeline deploys to AWS as a verifiable, short-lived identity, with no key to steal.

The chicken-and-egg of remote state

Wed, 06 May 2026 00:00:00 +0000

Here’s a puzzle that every infrastructure-as-code setup hits exactly once, right at the very beginning, and then never again. An OpenTofu stack stores its state in a backend. The bootstrap stack I wrote about last time has a particular job, and part of that job is to create the backend that remote state lives in. So where does the bootstrap stack store its own state, on the very first run, before it’s built the place state is supposed to go?

Where does the state of the thing that makes the state store live?

That’s the puzzle, and it’s a real ordering deadlock rather than a riddle.

An OpenTofu stack keeps a state file, and for anything shared that state file lives in a remote backend: on AWS, an S3 bucket. Fine. But the bootstrap stack has a particular job, and part of that job is to create the S3 bucket that remote state lives in.

So walk through the first run. Bootstrap has never been applied. The state bucket doesn’t exist, because creating it is what bootstrap is for. Bootstrap needs somewhere to store its own state. The only place that would make sense is the bucket it’s about to create, which isn’t there yet. The thing that builds the state store can’t store its state in the state store.

Run local, then migrate

The way out is a two-step that OpenTofu supports directly.

Bootstrap starts configured with a local backend: backend "local" {}. State is just a file on the operator’s machine. With that in place, the first tofu apply runs. It creates the S3 bucket and the KMS key, and records all of it in the local state file.

Now the bucket exists. So the backend configuration is rewritten to point at it: an s3 backend block naming the new bucket. Then tofu init -migrate-state. OpenTofu sees the backend has changed, picks up the local state file, and copies it into the S3 bucket. From that point on, bootstrap’s own state lives in the bucket that bootstrap created. The egg has laid the chicken.

The local backend was a scaffold. It existed for exactly one apply, to break the ordering deadlock, and then the state moved off it and it was never used again.

It happened twice

The infra repo actually did this migration twice, and the second time is the proof that the pattern is general rather than a one-off trick.

The first migration was the one above: local to S3, at the very start. The second came later, during the move from GitHub to GitLab. GitLab offers a managed HTTP state backend, and infra chose to use it. So the backend block was rewritten again, this time from s3 to http, and tofu init -migrate-state ran again, copying the state from the S3 bucket to GitLab’s backend.

The same move, twice, against three different backends. That’s the useful lesson hiding in the chicken-and-egg story. State is portable. The backend is just where you currently keep it, not a property of the stack itself, and moving it is a routine, supported operation rather than surgery.

Why this is the honest answer, not a hack

It’s easy to look at “apply once with a local backend, then migrate” and feel it’s a bit of a smell, a workaround for something that should have been cleaner.

It isn’t. It’s the honest answer to a real ordering problem, and the alternatives are worse.

The obvious alternative is to create the state bucket by hand, in the console, before running bootstrap at all. But then the most important bucket in the account is unmanaged. It exists outside every OpenTofu graph, nobody’s code describes it, its encryption and policy and prevent_destroy are whatever someone clicked that day, and it drifts. The local-then-migrate dance avoids exactly that. The bucket is created by bootstrap, described in code, and tracked in bootstrap’s own state from its very first apply. It’s managed from birth.

The chicken-and-egg isn’t a flaw to be embarrassed about. It’s just the shape of the problem when a stack has to build its own foundations, and OpenTofu’s -migrate-state is the supported tool for exactly that shape.

Pulling it together

Every OpenTofu stack needs a backend to store state, and the bootstrap stack’s job is to create the backend, so on its first run the bucket it needs doesn’t yet exist.

The resolution is to run bootstrap once with a local backend, let that apply create the bucket and key, then rewrite the backend configuration and tofu init -migrate-state the state into the bucket bootstrap just made. The infra repo did it twice, local to S3 and later S3 to GitLab, which shows the real point: state is portable, and the backend is just where you keep it. Doing it this way, rather than hand-creating the bucket, is what keeps that critical bucket managed in code from its very first day.

A state bucket that defends itself

Sat, 02 May 2026 00:00:00 +0000

OpenTofu’s remote state file is, quietly, the most sensitive thing in an infrastructure repo. It’s a plain JSON document listing every resource you manage, every ID, and, depending on your providers, the odd secret in clear text. So the S3 bucket that holds it can’t just be a bucket. It has to actively defend itself, on three separate fronts.

The most sensitive file in the repo

OpenTofu, like Terraform, keeps a state file: a JSON document recording every resource the stack manages, its real-world ID, and its attributes. It’s how the tool knows what already exists. It’s also, quietly, the most sensitive file in the whole repo. It can hold resource identifiers an attacker would value, and depending on the providers in play it can hold secret values in clear text.

Three bad things can happen to it. It can be deleted, and now the tool has forgotten everything it manages. It can be read by someone who shouldn’t. It can be corrupted by two runs writing at once. The bucket that holds remote state has to defend against all three, and terraform-aws-bootstrap’s state-backend module is built around doing exactly that.

The DynamoDB lock table is gone

Start with the corruption problem, because the answer changed recently.

The long-standing pattern for remote state on AWS was an S3 bucket plus a DynamoDB table. S3 held the state; the DynamoDB table held a lock, so two apply runs couldn’t write at once. Everyone who’s done Terraform on AWS has provisioned that table, probably more times than they’d care to count.

OpenTofu 1.10 made it unnecessary. The S3 backend gained use_lockfile, which does the locking with a small lock object in the same bucket, using S3’s conditional-write support. No separate table. The state backend is now genuinely one bucket and one key, with the lock living beside the state. It’s one fewer resource to create, one fewer thing to pay for, and one fewer moving part to reason about. The module takes the new path, and the DynamoDB table simply isn’t there.

A bucket you can’t delete by accident

Deletion is guarded with lifecycle { prevent_destroy = true } on the bucket. With that set, OpenTofu refuses to produce a plan that would destroy the bucket. A stray tofu destroy, a refactor that drops the resource, an accidental rename: all of them fail loudly instead of quietly taking the state bucket with them.

This is also why the state-backend module is hand-rolled from raw aws_s3_bucket resources rather than wrapping a community module like terraform-aws-modules/s3-bucket. prevent_destroy has to sit on the actual resource, and a lifecycle block isn’t something you can pass into a wrapper module as an input. Hand-rolling the bucket keeps prevent_destroy somewhere you can put it and, just as importantly, somewhere the next reader can see it. (There’s a whole post coming on why I hand-rolled every module; this is one of the reasons in miniature.)

Reject anything encrypted wrong

Confidentiality is the subtle one, because the obvious control isn’t enough.

The bucket has a default encryption configuration: server-side encryption with the customer-managed KMS key. But default encryption is a default. A client making a PutObject call can override it per request, asking for plain AES256 or a different KMS key, and S3 will honour the override.

So the module doesn’t rely on the default. The bucket policy explicitly denies the upload it doesn’t want. It denies any request not over TLS. It denies any PutObject that isn’t using SSE-KMS. And it denies any PutObject that names the wrong KMS key. The default encryption config says “this is what you get if you don’t ask”; the bucket policy says “and you’re not allowed to ask for anything else”. State can only ever land encrypted, in transit and at rest, under the one key the module controls.

One small companion setting: bucket_key_enabled. With per-object SSE-KMS, every object operation is also a KMS API call, which costs money and can throttle. An S3 Bucket Key collapses those into far fewer KMS calls, cutting per-object KMS traffic by well over ninety per cent. It’s a one-line setting the module turns on and most people forget exists.

In short

Remote state is the most sensitive file an infrastructure repo has, and the bucket that holds it has to defend against deletion, disclosure and corruption.

terraform-aws-bootstrap’s state backend handles corruption with OpenTofu 1.10’s use_lockfile, dropping the old DynamoDB lock table entirely. It guards deletion with prevent_destroy, which is also why the bucket is hand-rolled rather than wrapped. And it guards confidentiality with a bucket policy that denies non-TLS traffic and denies any upload not encrypted with the right KMS key, because default encryption is only a default and a client can override it. The state bucket isn’t just a place to put state. It’s built to refuse every wrong thing that could happen to it.

The bootstrap that does almost nothing

Fri, 01 May 2026 00:00:00 +0000

A brand-new AWS account is a slightly nerve-wracking thing. It can do almost anything, it’s hardened against almost nothing, and the list of stuff you ought to set up before you trust it with anything real is long. The natural instinct is to write one big “set up the account” module that does the whole list in a single apply. I want to talk you out of that, because the bootstrap module I’m happiest with does almost nothing, on purpose.

The first-apply problem

A brand-new AWS account is not ready for anything serious. Before you’d responsibly run real infrastructure into it, you want an account baseline: a password policy, account-wide S3 public-access blocking, default EBS encryption, CloudTrail, AWS Config, GuardDuty, alerting, a sensible human operator role. It’s a long list, and all of it matters.

The instinct, faced with that list, is to write one big “set up the account” module and have it do everything. One tofu apply, a fully prepared account, done.

That instinct is worth resisting, and terraform-aws-bootstrap resists it deliberately.

Three things, and a hard line

terraform-aws-bootstrap does three things:

state-backend, an S3 bucket and a customer-managed KMS key to hold remote Terraform state.
automation-iam, an OIDC identity provider and an IAM role that CI assumes to apply everything else.
nuke-config, which renders an aws-nuke configuration scoped to the account, for tearing a throwaway account back down.

That’s the whole module. Account hardening, CloudTrail, AWS Config, GuardDuty, the operator role, the alerting: none of it is in here. And it’s not absent by accident. The README has a section headed “what’s deliberately NOT in scope” that lists those exclusions out loud. The boundary is written down, because the boundary is the design.

Why the line is exactly there

The reason the line sits where it does is the most useful idea in the module.

Everything bootstrap excludes belongs in a separate stack, applied through the automation role bootstrap creates. Bootstrap’s only job is to get the account to the point where the next tofu apply can run properly: somewhere to store state, and an identity to run as. Once those two things exist, hardening the account isn’t a special bootstrapping act. It’s just another apply, done the normal way: in CI, reviewed, versioned, deployed through the role.

So the account baseline doesn’t need to be bundled into the bootstrap. It needs to be downstream of it. Bootstrap builds the on-ramp; it doesn’t also have to be the motorway.

A narrow module stays re-runnable

There’s a practical payoff to the narrowness, and it’s about fear.

Bootstrap is the one stack that can’t be applied through CI, because it’s what creates the CI identity in the first place. It runs locally, by a human, rarely. That’s exactly the kind of operation you want to be small, boring, and safe to repeat.

A bootstrap module that also did account hardening would be a large, stateful thing managing dozens of resources. Re-running it would be a held-breath operation. Keeping it to three concerns keeps it the opposite: a small stack you can read top to bottom, re-run without anxiety, and reason about completely. The narrowness isn’t minimalism for its own sake. It’s what keeps the one human-applied stack trustworthy.

The boundary is the feature

It’s tempting to judge a module by how much it does. A bootstrap module is the case where that’s exactly backwards. Its value is in how cleanly it stops.

terraform-aws-bootstrap does the bare minimum to make an account ready for the next apply, writes down everything it refuses to do, and hands off to a downstream stack for all of it. The next post follows the trickiest of its three jobs: the state backend has a genuine chicken-and-egg problem, because it has to store Terraform state in a bucket Terraform hasn’t created yet.

Where this leaves us

A fresh AWS account needs a long list of things before it’s safe, and the obvious move is one big module that does the lot. terraform-aws-bootstrap deliberately does only three: a state backend, a CI identity, and an account-scrub config. Everything else is written down as out of scope.

The boundary is the design. The excluded work belongs in a downstream stack applied through the CI role bootstrap creates, so hardening is just a normal reviewed apply rather than a bootstrapping special case. And keeping the one human-run, locally-applied stack small is what keeps it safe to re-run. A bootstrap module is judged by where it stops.

A signing key needs somewhere to live

Sun, 26 Apr 2026 00:00:00 +0000

I left a door open a couple of posts ago, and it’s been quietly bothering me ever since. When I wrote about verifying your own downloads, I was honest that a checksum sitting next to the binary only catches accidents. Anyone who can compromise the release platform can swap the binary and the checksum together, and the tool will happily verify one fake against the other.

Closing that gap needs a signature. And a signature, it turns out, needs a surprising amount of infrastructure standing behind it. This is the first post about building that.

The door the last post left open

A while back I wrote about verifying your own downloads: go-tool-base’s self-update command now checks the SHA-256 of every binary it downloads against the release’s published checksums.txt before installing it.

That post was honest about its own ceiling. A checksum file hosted next to the binary it describes shares a trust root with that binary. Both come from the same release, on the same platform. Corruption, truncation, a CDN serving a stale object: a same-origin checksum catches all of those, because they’re accidents and the checksum wasn’t part of the accident. What it can’t catch is an attacker who’s compromised the release platform itself. Someone who can replace the binary can replace checksums.txt in the same breath, and the tool will cheerfully verify the malicious download against the malicious checksum and call it good.

The post named the fix and then deferred it: a signature whose trust root sits somewhere the release platform can’t reach. “That’s the next phase of this work.” This series is that phase.

What a signature actually needs

It’s worth being precise about why a signature helps where a checksum doesn’t, because it’s easy to wave the word “signature” around and assume it settles everything.

A signature closes the gap only under two conditions. The verifying key, the public half, must reach the user by a path the release platform doesn’t control. And the signing key, the private half, must live somewhere the release platform can’t reach.

The second condition is the one people skip. If the signing key sits in the same CI system that builds the release, you’ve gained almost nothing. An attacker who owns the CI owns the key, and a key they own will sign whatever they hand it. The signature verifies perfectly and means precisely nothing. A signature is only worth the distance between the signing key and the thing being signed. Put them in the same place and the distance is zero.

So the signing key has to live in a different security domain from the release pipeline. Not a different folder. A different account, with a different blast radius, that the release platform has no standing access to.

“Just sign the binary” is not a small feature

That reframes a line item that sounds tiny. “Sign the release binary” unpacks into a list:

there must be a private signing key;
it must live outside the release platform, in its own security domain;
it must be access-controlled, audited, and protected from exfiltration;
only the release pipeline may ask it to sign, and only by proving a short-lived, federated identity, never by holding a copy of the key.

That’s not a feature you bolt onto a CLI. That’s infrastructure.

The shape of it: a cloud account, with the key held in a managed key service so the private key material never exists as a file on a disk that anyone, me included, can copy. The release pipeline authenticates to that account as itself, briefly, and asks the key service to produce a signature. The key never moves.

But an account you’re going to trust with a signing key is itself something you have to get right first. An account with a weak baseline, no audit trail, and long-lived credentials lying around is not a safe home for the most security-sensitive key in the whole system. Before the key can move in, the house has to be built and the locks have to actually work.

What this series builds

So this turned into a rather longer project than “add a signature”, and the series follows it in order.

It starts with bootstrapping a fresh AWS account: the deliberately minimal first tofu apply, and the remote state backend that has a genuine chicken-and-egg problem. Then the credential question, which is the heart of it: how a CI pipeline deploys to AWS with no stored access key at all. Then hardening the account, so it’s genuinely safe to hold something valuable. Then the discipline of deploying changes to it: plans reviewed before they’re applied. Then the shared tooling that makes all of it repeatable.

Every one of those pieces exists for the same reason. The signing key needs somewhere to live, and somewhere safe is not a default you’re handed. It’s a thing you build, deliberately, before you have anything worth protecting in it.

The series ends where the verifying-downloads post pointed: a signing service whose key the release platform can’t touch, so a self-updating tool can finally verify that the binary it’s about to become is genuinely the one I published.

The upshot

go-tool-base’s self-update verifies downloads against a checksum, and a same-origin checksum stops accidents but not a compromise of the release platform. The fix is a signature, and a signature is only worth the distance between its signing key and the release pipeline.

Holding that key safely means a private key that never leaves a managed key service, in a separate cloud account, reached only by a short-lived federated identity. That’s infrastructure, and a safe account is something you build before you trust it with anything. The rest of this series builds it, piece by piece, right up to the signing service itself.

Infrastructure on PHP Boy Scout

Two bugs that taught me the rules

Bug one: the rule-less job that skips your merge requests

Bug two: the import block that only works at the root

The shape they share

Worth remembering

Reviewed, then applied

The gap between “reviewed” and “ran”

Plan as an artifact

Applying is a human decision

The guard on the gate

The bottom line

One graph, not micro-stacks

The pressure to split

What one graph gives you

What splitting costs

When splitting is genuinely right

If ordering between graphs is ever needed

Boiling it down

CI you include, not copy

The .gitlab-ci.yml you keep copying

GitLab has a feature for exactly this

Why a monorepo of components

Authoring discipline

The self-test you cannot fully write

The same instinct, again

The gist

One image for the whole toolchain

The same pile of tools, in every repo

One image, one source of truth

Publishing with crane, not a second build

Soft-failing the scanner on purpose

What it comes down to

A 403 you can't fix in IAM

A 403 on the first real run

The instinct, and why it wastes an afternoon

What is actually in the token

The fix, and the lesson worth more than the fix

The habit it left behind

Routing security findings without the noise

Detection is the easy half

Forward the severe, leave the rest

The duplicate you would otherwise send twice

Two tripwires on the root account

Why a quiet alert stream matters here

The short version

Why I hand-rolled every module

The promise of a wrapper module

What kept going wrong

The CloudTrail bucket, concretely

A wrapper is a deal, and deals have terms

The cost, paid on purpose

To sum up

Hardening the account that will hold the keys

Ready is not the same as safe

The baseline, in one downstream stack

The operator role, and the inversion

The carve-out, because honesty

Harden the room, then move the keys in

Worth remembering

No access keys in CI

The access key you don’t want

Federation instead of a stored secret

Two halves of one handshake

The gotcha: don’t set a profile

The bottom line

The chicken-and-egg of remote state

Where does the state of the thing that makes the state store live?

Run local, then migrate

It happened twice

Why this is the honest answer, not a hack

Pulling it together

A state bucket that defends itself

The most sensitive file in the repo

The DynamoDB lock table is gone

A bucket you can’t delete by accident

Reject anything encrypted wrong

In short

The bootstrap that does almost nothing

The first-apply problem

The `.gitlab-ci.yml` you keep copying