Infrastructure on PHP Boy Scout

One graph, not micro-stacks

Sun, 17 May 2026 00:00:00 +0000

Once an infrastructure repo has a few concerns in it (account hardening, the security baseline, the signing stack still to come) there’s a steady pressure to split them into separate stacks with separate state, and Terragrunt is right there to help you do it. The infra repo keeps everything in one OpenTofu graph instead. The reason comes down to who enforces your dependency ordering: the engine, or you.

The pressure to split

The infra repo’s src/ has several concerns in it, and more coming, the signing stack among them. Once a repo reaches that point, there’s a steady pressure to split: one stack per concern, each with its own state file.

It’s an appealing pressure. Separate stacks feel modular. Each apply touches less, so the blast radius of any one run is smaller. And Terragrunt exists, popular and well-regarded, precisely to orchestrate a fleet of separate stacks. The path is well trodden.

infra didn’t take it. src/ is a single OpenTofu root stack: each concern is a module block, in its own main.<concern>.tf file, all sharing one state and one graph.

What one graph gives you

The thing a single graph gives you is engine-enforced truth about ordering and data.

Inside one OpenTofu graph, the tool builds the full dependency DAG itself. When the signing stack needs a value the security baseline produced, you reference it directly, module.baseline.something, and OpenTofu guarantees two things: the baseline is created before the thing that depends on it, and the value handed across is the current one from this same apply. Ordering and data-passing aren’t things you arranged. They’re facts the engine checks and enforces, every plan, every apply.

What splitting costs

Split src/ into per-concern stacks with separate state, and that guarantee is the thing you spend.

Now one stack reads another’s outputs through terraform_remote_state. That’s a lookup of a snapshot: the other stack’s last applied state, whatever it was, whenever that was. It’s not a live edge in a graph. Ordering is no longer enforced by the engine either; it becomes something you arrange yourself, in CI stage sequencing or in Terragrunt’s own dependency blocks.

That’s the trade, stated plainly. You give up a strong, engine-checked guarantee, and you buy back a weaker, hand-arranged imitation of it. Terragrunt is a good tool for managing that weaker world tidily. But the question worth asking first is whether you should be in the weaker world at all.

When splitting is genuinely right

This isn’t an argument that splitting is always wrong. Separate states genuinely earn their place when concerns have different change cadences, different access boundaries, or different teams owning them: when you actively want an apply of one to be unable to touch another, and you want different people holding different state.

infra has none of those. It’s a single account, a single operator, one cohesive set of concerns. The only thing splitting would buy here is a smaller per-apply blast radius, and that’s better handled by reviewing the plan before it applies, which the next post is about, than by fragmenting the dependency graph. So src/ stays one graph, and Terragrunt was considered and deliberately not adopted.

If ordering between graphs is ever needed

If infra ever does genuinely need more than one stack, the plan isn’t Terragrunt. It’s to keep each stack a single strong graph internally, and to sequence the stacks with CI stages. Keep the engine-enforced guarantee where it’s strongest, inside each graph, and reach for hand-arranged ordering only at the one seam where it’s unavoidable.

Boiling it down

A multi-concern infrastructure repo feels like it should be split into per-concern stacks, and Terragrunt is right there to manage the result. infra keeps src/ as one OpenTofu graph instead.

Inside one graph, OpenTofu enforces dependency ordering and passes current values across module boundaries as checked facts. Split into separate states and that becomes a terraform_remote_state snapshot lookup plus ordering you arrange by hand: a weaker version of what you gave up. Splitting is right when concerns have different cadences, boundaries or owners; for a single-account, single-operator repo none of that applies, so the strong guarantee is worth keeping, and Terragrunt is the tool for a problem infra chose not to have.

Routing security findings without the noise

Tue, 12 May 2026 00:00:00 +0000

Turning on GuardDuty and Security Hub gives you threat detection. It also gives you a firehose. And an alert system that dutifully forwards everything in that firehose isn’t monitoring, it’s a very efficient way of training your team to ignore alerts. So the alerts module’s real job isn’t detection at all. It’s deciding what’s actually worth interrupting a human for, and the interesting part is everything it deliberately throws away.

Detection is the easy half

Switching on threat detection in an AWS account is a few resources. GuardDuty, Security Hub with its standards, IAM Access Analyzer: the security baseline does exactly that. From then on, the account is generating findings.

And it generates a lot of them. Plenty are low-severity, informational, or simply the normal texture of a cloud account. If you wire every finding to an email or a pager, you haven’t built monitoring. You’ve built noise. And noise has a specific failure mode: people stop reading it, and the one finding that genuinely mattered scrolls past unread alongside two hundred that didn’t.

So the valuable work isn’t detection. It’s routing: deciding what’s worth interrupting a human for, and letting the rest sit quietly in a console for whenever someone reviews it.

Forward the severe, leave the rest

The alerts module routes findings with EventBridge rules into an SNS topic that emails out. The rules are deliberately picky. GuardDuty findings are forwarded only at severity 7 and above. Security Hub findings are forwarded only at HIGH and CRITICAL.

Everything below those thresholds isn’t discarded. It’s still in GuardDuty and Security Hub, where someone doing a review will see it. It just doesn’t get to interrupt anyone’s day. The threshold is the line between “look at this now” and “look at this sometime”.

The duplicate you would otherwise send twice

Here’s the subtle one, and it’s the kind of thing you only find by looking closely at where findings come from.

Security Hub is an aggregator. It pulls findings in from other services, GuardDuty among them. So a single GuardDuty finding can show up in two places: in GuardDuty itself, and again in Security Hub as an aggregated copy.

A rule on GuardDuty findings and a rule on Security Hub HIGH/CRITICAL findings would therefore both fire for the same underlying GuardDuty finding. One event, two emails. Do that across an account and a meaningful fraction of your alert volume is just the same findings counted twice, which is its own kind of noise.

So the Security Hub rule explicitly excludes findings whose ProductName is GuardDuty, with an anything-but match. GuardDuty findings come through the GuardDuty rule. The Security Hub rule handles everything Security Hub adds that GuardDuty didn’t already report. One finding, one alert, regardless of how many services it passed through.

Two tripwires on the root account

Findings are about threats the detectors recognise. The module adds two alarms about something simpler: the root account doing anything at all.

One CloudWatch alarm fires on a root console sign-in. The other fires on any root API call that isn’t a console login. In a well-run AWS account, the root user does almost nothing after initial setup: day-to-day work happens through roles. So root activity isn’t a “finding” to be assessed for severity. It’s a tripwire. Any of it, in an account that should be silent, is worth an immediate look, and the two alarms say so directly.

Why a quiet alert stream matters here

This is monitoring for the account that’s going to hold the release-signing key, and that raises the stakes on getting the routing right.

If a key-bearing account ever does come under attack, the alert that says so has to be seen. An alert stream that’s mostly noise and duplicates is, functionally, no alerting at all, because the people who’d act on it have long since tuned it out. Routing the stream down to “severe, deduplicated, plus root tripwires” is what keeps it something a human will still read on the day it finally matters.

The short version

GuardDuty and Security Hub make detection easy. The hard, valuable part is routing: forwarding what deserves to interrupt someone and leaving the rest in a console.

The alerts module forwards GuardDuty at severity 7-plus and Security Hub at HIGH/CRITICAL, and it drops the duplicate that aggregation creates by excluding GuardDuty-sourced findings from the Security Hub rule, so one finding is one alert. Two CloudWatch alarms act as tripwires on root-account activity, which should be near-zero. For the account that will hold the signing key, a quiet, trustworthy alert stream isn’t a nicety. It’s the difference between monitoring and theatre.

Why go-tool-base left GitHub for GitLab

Mon, 11 May 2026 00:00:00 +0000

A botched version bump made me stop and actually look at where go-tool-base lived, and I didn’t much like what I saw. GitHub had spent months quietly falling over, and when Mitchell Hashimoto (GitHub user #1299, no less) publicly walked Ghostty off the platform, it stopped feeling like just my problem. I’ve been a GitLab fan for years, so the move was less a leap and more an overdue nudge. This is the why, not the how.

It started with a wrong number

Every migration has a trigger, and mine was embarrassingly small. A commit landed on main carrying a BREAKING CHANGE: footer it didn’t really deserve. Semantic-release did exactly what it’s told to do with that footer: it cut a major version. go-tool-base lurched from the v1 line straight to v2.0.0, and a chain of things that keyed off the version went sideways with it.

It was fixable. It wasn’t a disaster. But it was the kind of small, stupid breakage that makes you stop and actually look at your setup instead of just patching it and moving on. And when I looked, the version bump wasn’t the thing that bothered me. It was everything around it.

The platform had been quietly failing

I’d been losing time to GitHub for months. Not dramatically. No single outage you’d write home about, just a steady drip of Actions queues that wouldn’t drain, pull requests that wouldn’t merge, the occasional morning where the thing simply wasn’t there. You absorb it. You re-run the job. You make a coffee and try again. You tell yourself it’s a blip.

The trouble with a steady drip is that you stop counting it. It becomes weather.

The canary left the mine

Then, in late April, Mitchell Hashimoto (co-founder of HashiCorp, creator of Vagrant, Terraform and the Ghostty terminal) published Ghostty Is Leaving GitHub, and The Register picked it up a day later under the headline “GitHub ’no longer a place for serious work’”.

This is not a man with a casual relationship to GitHub. He’s, by his own account, user #1299, joined February 2008. He called it “the place that has made me the most happy”. And he still wrote this:

This is no longer a place for serious work if it just blocks you out for hours per day, every day.

The detail that landed hardest for me wasn’t a quote, it was a habit. He’d kept a journal for a month, marking an “X” on every day a GitHub outage had cost him working time. Almost every day had an X. Reading that, I realised I’d been having the same month. I’d just never been disciplined enough to write it down. He’d turned my vague “it’s been flaky lately” into a row of crosses on a calendar.

I want to ship software and it doesn’t want me to ship software.

When the person who’s been on the platform for eighteen years and loves it says that out loud, it stops being your private grumble. It’s the canary, and the canary has stopped singing.

Why GitLab, and not just “somewhere else”

Being annoyed at GitHub is a reason to leave. It is not, on its own, a reason to pick a destination. The destination has to be a positive choice.

For me GitLab was an easy one, because I’ve been a fan for years. Long enough, in fact, to have also been a reliable grumbler about their pricing tiers, which is how you know it’s a real relationship and not a honeymoon. What I’ve always rated is the model: GitLab treats source hosting, CI/CD, the package registry, releases and Pages as one integrated product, not a marketplace of bolted-on parts you assemble yourself.

That integration is the actual prize. On the old setup, “CI” meant a folder of separate GitHub Actions workflow files, each pinned, each its own little world. On GitLab it’s a single .gitlab-ci.yml pipeline with proper stages (lint, test, security, docs, release) and the release stage talks to the built-in package registry and Pages without me wiring up a single external credential. The CI job that builds the project can authenticate to the things the project needs because they’re the same platform.

There’s a second-order benefit too. A migration is a rare licence to fix things you’d never otherwise touch. Moving gave me the cover to reset go-tool-base’s versioning cleanly (back to a sensible v0.x line, the accidental v2.0.0 left behind as a cautionary tale) and to move the module path to its new home in one deliberate change rather than a thousand apologetic ones.

What I’m not going to claim

I’m not going to tell you GitHub is finished, or that GitLab never has a bad day, because it does, everyone does. This isn’t a teardown. GitHub gave go-tool-base a perfectly good home for its first year, and the archived mirror is still sitting there, read-only, pointing anyone who finds it at the new place.

What changed is simpler than a grand verdict. The friction crossed a line, someone I respect said the quiet part loudly enough that I couldn’t keep filing it under “weather”, and the place I’d have moved to anyway was sitting right there with a better model. Sometimes the prudent move and the move you secretly wanted turn out to be the same move, and you just need a wrong version number to give you permission.

Boiling it down

go-tool-base moved from GitHub to GitLab in May 2026. The proximate cause was a self-inflicted version-bump mess; the real cause was months of GitHub unreliability that I’d stopped consciously noticing until Mitchell Hashimoto’s very public departure named it for me. GitLab was a positive pick, not just an escape hatch: its integrated CI/CD, registry, releases and Pages are one product rather than a kit, and that integration is genuinely worth having. The migration also bought a clean versioning restart as a bonus.

If you’ve been absorbing a steady drip of friction and telling yourself it’s normal: try the calendar trick. Mark the X’s for a month. The page will tell you something you already half-know.

Why I hand-rolled every module

Sun, 10 May 2026 00:00:00 +0000

There are well-known community module libraries for AWS: Cloud Posse, the terraform-aws-modules collection, plenty more. Both terraform-aws-bootstrap and terraform-aws-security-baseline use almost none of them. Every sub-module is hand-rolled from raw AWS resources, and before you accuse me of not-invented-here syndrome (a perfectly fair first guess), hear me out, because the same evaluation kept landing the same way for a real reason.

The promise of a wrapper module

The community module ecosystem makes an appealing offer. Don’t write raw aws_s3_bucket and aws_s3_bucket_policy and aws_s3_bucket_public_access_block and the rest. Call a tested, popular module, pass it a handful of inputs, and get a correct, well-configured bucket. Less code in your repo, and the code you don’t write has been exercised by thousands of other users.

For a lot of infrastructure that’s a genuinely good deal, and I take it often. For the two infrastructure modules in this series, I took it almost never. Every sub-module is built from raw AWS resources. That wasn’t a reflex. It was the same evaluation, made over and over, landing the same way.

What kept going wrong

For each place a wrapper module could have fitted, I looked at the wrapper. And the recurring finding was one of two things. Either using the wrapper correctly, with all the overrides my posture needed, came to more configuration than the raw resources would have. Or the wrapper’s abstraction leaked the instant I needed something it hadn’t anticipated, and I was now writing code to fight it.

The CloudTrail bucket, concretely

The clearest example is the bucket that holds CloudTrail logs.

There are popular modules that set up CloudTrail and bundle an S3 bucket for the logs. Convenient. But that bundled bucket isn’t the bucket I want. It doesn’t carry lifecycle { prevent_destroy = true }, and its bucket policy is weaker than the one the state bucket taught me to want: TLS-only, SSE-KMS-only, wrong-key-denied.

So to use the wrapper I had two options. Accept a weaker audit-log bucket than the rest of the account, which rather defeats the point of an audit log. Or fight the wrapper: disable its bucket, create my own, wire it back in. Fighting the wrapper is more work than simply writing the fifty-odd lines of raw aws_s3_bucket plus policy that give me exactly the posture I’d already designed once. The wrapper didn’t save code. It added a negotiation.

A wrapper is a deal, and deals have terms

This isn’t an argument that community modules are bad. It’s an argument about when the deal is good.

A wrapper module is a good deal while its abstraction holds: while what it assumes you want matches what you want. The moment you need something it didn’t anticipate, the deal inverts. Now you’re working against the abstraction, and an abstraction you’re fighting costs more than no abstraction at all. (Regular readers will recognise that line from the LangChain argument; it’s the same principle in a very different language.)

Infrastructure that holds signing keys is precisely the case where you need to control the specifics: every encryption setting, every lifecycle rule, every line of every bucket policy. That’s a domain where wrapper abstractions leak fast, because the whole job is the details the wrapper smoothed over.

The cost, paid on purpose

Hand-rolling isn’t free. It’s more lines of HCL in the repo, up front, than a one-line module call.

What those lines buy is worth the price for this kind of infrastructure. There’s no transitive module-version churn to track. There’s no abstraction between me and the resource when something behaves oddly. And every line is one I can read, and defend, in a security review, because I wrote it and it says exactly what it does. For a foundation that will hold the most sensitive key in the system, “readable and mine” beats “short and someone else’s”.

That’s a deliberate trade, not a universal rule. For an internal tool on a deadline, reach for the wrapper. For the security-critical base of everything else, the raw resources won every time I checked.

To sum up

The community module ecosystem offers less code that more people have tested, and for plenty of infrastructure that’s the right call. For terraform-aws-bootstrap and terraform-aws-security-baseline it almost never was, because each wrapper turned out to be more configuration than the raw resources once my posture was accounted for, or it leaked the moment I needed a specific.

The CloudTrail log bucket is the pattern in miniature: the bundled bucket lacked prevent_destroy and a strong policy, so using the wrapper meant either a weaker bucket or fighting the module. A wrapper is a good deal while its abstraction holds and a bad one the moment you fight it, and security-critical foundation infrastructure is all specifics. Hand-rolling cost more lines and bought code I can read and defend. For this, that was the trade worth making.

Hardening the account that will hold the keys

Sat, 09 May 2026 00:00:00 +0000

Bootstrapping the account got it ready: somewhere to store state, an identity to deploy as, enough for the next tofu apply to run. Ready is not the same as safe. An account with no audit trail, nothing watching it, and no considered way for a human to get in is fine for experimenting and absolutely not where you put the most sensitive key in the system. So before the signing key goes anywhere near it, the account gets a security baseline.

Ready is not the same as safe

The bootstrap post ended with an account that was ready: it had somewhere to store state and a CI identity to deploy as. The next tofu apply could run.

Ready is not safe. That account still has no audit trail, so nobody could tell you afterwards what happened in it. It has no threat detection, so nothing is watching. Its defaults are AWS’s defaults, which are not a security posture. There’s no considered way for a human to get in. An account in that condition is fine for experimenting. It’s not somewhere you put the most sensitive key in the whole system.

So before the signing key is anywhere near it, the account gets a security baseline.

The baseline, in one downstream stack

terraform-aws-security-baseline is that baseline, and it’s exactly the downstream stack the bootstrap post promised: applied through the automation role bootstrap created, not bootstrapped specially.

It’s six sub-modules, each behind an enable_* toggle: account-hardening (IAM password policy, account-wide S3 public-access blocking, default EBS encryption), audit-logging (a multi-region CloudTrail with log-file validation), aws-config, threat-detection (GuardDuty, Security Hub, IAM Access Analyzer), alerts, and operator-role. Together they turn a bare account into one that records what happens, watches for trouble, and controls who gets in.

Most of those are the expected baseline. The operator role is the one worth slowing down on, because it’s built backwards from how people usually think about an admin role.

The operator role, and the inversion

InfraAdmin is the human way into the account: the role a person assumes to do operator work. Two things define it.

The trust policy decides who may assume it. It trusts only the account root principal, and it requires multi-factor authentication: the assume call must carry aws:MultiFactorAuthPresent, and aws:MultiFactorAuthAge bounds how recently that MFA was performed. No MFA, no role. So far this is a careful but ordinary admin role.

The inversion is a second, separate inline policy, and it’s almost entirely Deny. It denies, using NotAction, anything where aws:RequestedRegion falls outside an allowed set of regions. The role’s power comes from an admin grant. This inline policy fences that power.

That’s the part worth holding onto. People picture an admin role as a list of what it can do. This one is better understood by what it cannot: it cannot act outside its permitted regions, full stop. A fat-fingered command, or a compromised session, cannot quietly spin resources up in some region nobody’s watching. The fence is as much the point of the role as the grant is.

The carve-out, because honesty

There’s a fiddly detail, and it’s the kind of thing that makes the region fence real rather than theoretical.

Some AWS services are global. IAM, CloudFront, Route 53 and friends have no region, and they don’t honour aws:RequestedRegion. A naive region-deny would therefore deny calls to IAM, and you’d lock yourself out of the very service you manage access with. (A close cousin of the kind of self-inflicted lockout I’ll come back to in a later post.)

So the Deny carries explicit carve-outs for the global services. It isn’t elegant, and it can’t be: the global-versus-regional split is just a fact of AWS, and a correct region fence has to account for it. The carve-out list is the honest cost of the control working.

Harden the room, then move the keys in

There’s an order to all of this, and the order is the argument.

The account that will hold the signing key has to be audited before the key arrives, so that from day one every call against it is in CloudTrail. It has to be watched before the key arrives, so GuardDuty is already looking. It has to be access-controlled before the key arrives, so the only human path in is MFA-gated and region-fenced.

You don’t move something valuable into a room and then think about locks. You build the room, fit the locks, check they work, and then move the valuable thing in. The security baseline is fitting the locks. The signing key comes later, into a room already built for it.

Worth remembering

Bootstrapping an account makes it ready for the next deploy. It does not make it safe to hold anything that matters. terraform-aws-security-baseline is the downstream stack that closes that gap: audit logging, AWS Config, threat detection, account hardening, and an operator role, applied through the CI role bootstrap created.

The operator role is the piece to study. It’s MFA-gated on the way in, and then fenced by a separate, almost-all-Deny inline policy that confines it to permitted regions, with carve-outs for the global services that have no region. An admin role defined as much by its fence as its grant. Harden the room first; the keys move in afterwards.

The chicken-and-egg of remote state

Wed, 06 May 2026 00:00:00 +0000

Here’s a puzzle that every infrastructure-as-code setup hits exactly once, right at the very beginning, and then never again. An OpenTofu stack stores its state in a backend. The bootstrap stack I wrote about last time has a particular job, and part of that job is to create the backend that remote state lives in. So where does the bootstrap stack store its own state, on the very first run, before it’s built the place state is supposed to go?

Where does the state of the thing that makes the state store live?

That’s the puzzle, and it’s a real ordering deadlock rather than a riddle.

An OpenTofu stack keeps a state file, and for anything shared that state file lives in a remote backend: on AWS, an S3 bucket. Fine. But the bootstrap stack has a particular job, and part of that job is to create the S3 bucket that remote state lives in.

So walk through the first run. Bootstrap has never been applied. The state bucket doesn’t exist, because creating it is what bootstrap is for. Bootstrap needs somewhere to store its own state. The only place that would make sense is the bucket it’s about to create, which isn’t there yet. The thing that builds the state store can’t store its state in the state store.

Run local, then migrate

The way out is a two-step that OpenTofu supports directly.

Bootstrap starts configured with a local backend: backend "local" {}. State is just a file on the operator’s machine. With that in place, the first tofu apply runs. It creates the S3 bucket and the KMS key, and records all of it in the local state file.

Now the bucket exists. So the backend configuration is rewritten to point at it: an s3 backend block naming the new bucket. Then tofu init -migrate-state. OpenTofu sees the backend has changed, picks up the local state file, and copies it into the S3 bucket. From that point on, bootstrap’s own state lives in the bucket that bootstrap created. The egg has laid the chicken.

The local backend was a scaffold. It existed for exactly one apply, to break the ordering deadlock, and then the state moved off it and it was never used again.

It happened twice

The infra repo actually did this migration twice, and the second time is the proof that the pattern is general rather than a one-off trick.

The first migration was the one above: local to S3, at the very start. The second came later, during the move from GitHub to GitLab. GitLab offers a managed HTTP state backend, and infra chose to use it. So the backend block was rewritten again, this time from s3 to http, and tofu init -migrate-state ran again, copying the state from the S3 bucket to GitLab’s backend.

The same move, twice, against three different backends. That’s the useful lesson hiding in the chicken-and-egg story. State is portable. The backend is just where you currently keep it, not a property of the stack itself, and moving it is a routine, supported operation rather than surgery.

Why this is the honest answer, not a hack

It’s easy to look at “apply once with a local backend, then migrate” and feel it’s a bit of a smell, a workaround for something that should have been cleaner.

It isn’t. It’s the honest answer to a real ordering problem, and the alternatives are worse.

The obvious alternative is to create the state bucket by hand, in the console, before running bootstrap at all. But then the most important bucket in the account is unmanaged. It exists outside every OpenTofu graph, nobody’s code describes it, its encryption and policy and prevent_destroy are whatever someone clicked that day, and it drifts. The local-then-migrate dance avoids exactly that. The bucket is created by bootstrap, described in code, and tracked in bootstrap’s own state from its very first apply. It’s managed from birth.

The chicken-and-egg isn’t a flaw to be embarrassed about. It’s just the shape of the problem when a stack has to build its own foundations, and OpenTofu’s -migrate-state is the supported tool for exactly that shape.

Pulling it together

Every OpenTofu stack needs a backend to store state, and the bootstrap stack’s job is to create the backend, so on its first run the bucket it needs doesn’t yet exist.

The resolution is to run bootstrap once with a local backend, let that apply create the bucket and key, then rewrite the backend configuration and tofu init -migrate-state the state into the bucket bootstrap just made. The infra repo did it twice, local to S3 and later S3 to GitLab, which shows the real point: state is portable, and the backend is just where you keep it. Doing it this way, rather than hand-creating the bucket, is what keeps that critical bucket managed in code from its very first day.

A state bucket that defends itself

Sat, 02 May 2026 00:00:00 +0000

OpenTofu’s remote state file is, quietly, the most sensitive thing in an infrastructure repo. It’s a plain JSON document listing every resource you manage, every ID, and, depending on your providers, the odd secret in clear text. So the S3 bucket that holds it can’t just be a bucket. It has to actively defend itself, on three separate fronts.

The most sensitive file in the repo

OpenTofu, like Terraform, keeps a state file: a JSON document recording every resource the stack manages, its real-world ID, and its attributes. It’s how the tool knows what already exists. It’s also, quietly, the most sensitive file in the whole repo. It can hold resource identifiers an attacker would value, and depending on the providers in play it can hold secret values in clear text.

Three bad things can happen to it. It can be deleted, and now the tool has forgotten everything it manages. It can be read by someone who shouldn’t. It can be corrupted by two runs writing at once. The bucket that holds remote state has to defend against all three, and terraform-aws-bootstrap’s state-backend module is built around doing exactly that.

The DynamoDB lock table is gone

Start with the corruption problem, because the answer changed recently.

The long-standing pattern for remote state on AWS was an S3 bucket plus a DynamoDB table. S3 held the state; the DynamoDB table held a lock, so two apply runs couldn’t write at once. Everyone who’s done Terraform on AWS has provisioned that table, probably more times than they’d care to count.

OpenTofu 1.10 made it unnecessary. The S3 backend gained use_lockfile, which does the locking with a small lock object in the same bucket, using S3’s conditional-write support. No separate table. The state backend is now genuinely one bucket and one key, with the lock living beside the state. It’s one fewer resource to create, one fewer thing to pay for, and one fewer moving part to reason about. The module takes the new path, and the DynamoDB table simply isn’t there.

A bucket you can’t delete by accident

Deletion is guarded with lifecycle { prevent_destroy = true } on the bucket. With that set, OpenTofu refuses to produce a plan that would destroy the bucket. A stray tofu destroy, a refactor that drops the resource, an accidental rename: all of them fail loudly instead of quietly taking the state bucket with them.

This is also why the state-backend module is hand-rolled from raw aws_s3_bucket resources rather than wrapping a community module like terraform-aws-modules/s3-bucket. prevent_destroy has to sit on the actual resource, and a lifecycle block isn’t something you can pass into a wrapper module as an input. Hand-rolling the bucket keeps prevent_destroy somewhere you can put it and, just as importantly, somewhere the next reader can see it. (There’s a whole post coming on why I hand-rolled every module; this is one of the reasons in miniature.)

Reject anything encrypted wrong

Confidentiality is the subtle one, because the obvious control isn’t enough.

The bucket has a default encryption configuration: server-side encryption with the customer-managed KMS key. But default encryption is a default. A client making a PutObject call can override it per request, asking for plain AES256 or a different KMS key, and S3 will honour the override.

So the module doesn’t rely on the default. The bucket policy explicitly denies the upload it doesn’t want. It denies any request not over TLS. It denies any PutObject that isn’t using SSE-KMS. And it denies any PutObject that names the wrong KMS key. The default encryption config says “this is what you get if you don’t ask”; the bucket policy says “and you’re not allowed to ask for anything else”. State can only ever land encrypted, in transit and at rest, under the one key the module controls.

One small companion setting: bucket_key_enabled. With per-object SSE-KMS, every object operation is also a KMS API call, which costs money and can throttle. An S3 Bucket Key collapses those into far fewer KMS calls, cutting per-object KMS traffic by well over ninety per cent. It’s a one-line setting the module turns on and most people forget exists.

In short

Remote state is the most sensitive file an infrastructure repo has, and the bucket that holds it has to defend against deletion, disclosure and corruption.

terraform-aws-bootstrap’s state backend handles corruption with OpenTofu 1.10’s use_lockfile, dropping the old DynamoDB lock table entirely. It guards deletion with prevent_destroy, which is also why the bucket is hand-rolled rather than wrapped. And it guards confidentiality with a bucket policy that denies non-TLS traffic and denies any upload not encrypted with the right KMS key, because default encryption is only a default and a client can override it. The state bucket isn’t just a place to put state. It’s built to refuse every wrong thing that could happen to it.

The bootstrap that does almost nothing

Fri, 01 May 2026 00:00:00 +0000

A brand-new AWS account is a slightly nerve-wracking thing. It can do almost anything, it’s hardened against almost nothing, and the list of stuff you ought to set up before you trust it with anything real is long. The natural instinct is to write one big “set up the account” module that does the whole list in a single apply. I want to talk you out of that, because the bootstrap module I’m happiest with does almost nothing, on purpose.

The first-apply problem

A brand-new AWS account is not ready for anything serious. Before you’d responsibly run real infrastructure into it, you want an account baseline: a password policy, account-wide S3 public-access blocking, default EBS encryption, CloudTrail, AWS Config, GuardDuty, alerting, a sensible human operator role. It’s a long list, and all of it matters.

The instinct, faced with that list, is to write one big “set up the account” module and have it do everything. One tofu apply, a fully prepared account, done.

That instinct is worth resisting, and terraform-aws-bootstrap resists it deliberately.

Three things, and a hard line

terraform-aws-bootstrap does three things:

state-backend, an S3 bucket and a customer-managed KMS key to hold remote Terraform state.
automation-iam, an OIDC identity provider and an IAM role that CI assumes to apply everything else.
nuke-config, which renders an aws-nuke configuration scoped to the account, for tearing a throwaway account back down.

That’s the whole module. Account hardening, CloudTrail, AWS Config, GuardDuty, the operator role, the alerting: none of it is in here. And it’s not absent by accident. The README has a section headed “what’s deliberately NOT in scope” that lists those exclusions out loud. The boundary is written down, because the boundary is the design.

Why the line is exactly there

The reason the line sits where it does is the most useful idea in the module.

Everything bootstrap excludes belongs in a separate stack, applied through the automation role bootstrap creates. Bootstrap’s only job is to get the account to the point where the next tofu apply can run properly: somewhere to store state, and an identity to run as. Once those two things exist, hardening the account isn’t a special bootstrapping act. It’s just another apply, done the normal way: in CI, reviewed, versioned, deployed through the role.

So the account baseline doesn’t need to be bundled into the bootstrap. It needs to be downstream of it. Bootstrap builds the on-ramp; it doesn’t also have to be the motorway.

A narrow module stays re-runnable

There’s a practical payoff to the narrowness, and it’s about fear.

Bootstrap is the one stack that can’t be applied through CI, because it’s what creates the CI identity in the first place. It runs locally, by a human, rarely. That’s exactly the kind of operation you want to be small, boring, and safe to repeat.

A bootstrap module that also did account hardening would be a large, stateful thing managing dozens of resources. Re-running it would be a held-breath operation. Keeping it to three concerns keeps it the opposite: a small stack you can read top to bottom, re-run without anxiety, and reason about completely. The narrowness isn’t minimalism for its own sake. It’s what keeps the one human-applied stack trustworthy.

The boundary is the feature

It’s tempting to judge a module by how much it does. A bootstrap module is the case where that’s exactly backwards. Its value is in how cleanly it stops.

terraform-aws-bootstrap does the bare minimum to make an account ready for the next apply, writes down everything it refuses to do, and hands off to a downstream stack for all of it. The next post follows the trickiest of its three jobs: the state backend has a genuine chicken-and-egg problem, because it has to store Terraform state in a bucket Terraform hasn’t created yet.

Where this leaves us

A fresh AWS account needs a long list of things before it’s safe, and the obvious move is one big module that does the lot. terraform-aws-bootstrap deliberately does only three: a state backend, a CI identity, and an account-scrub config. Everything else is written down as out of scope.

The boundary is the design. The excluded work belongs in a downstream stack applied through the CI role bootstrap creates, so hardening is just a normal reviewed apply rather than a bootstrapping special case. And keeping the one human-run, locally-applied stack small is what keeps it safe to re-run. A bootstrap module is judged by where it stops.

A signing key needs somewhere to live

Sun, 26 Apr 2026 00:00:00 +0000

I left a door open a couple of posts ago, and it’s been quietly bothering me ever since. When I wrote about verifying your own downloads, I was honest that a checksum sitting next to the binary only catches accidents. Anyone who can compromise the release platform can swap the binary and the checksum together, and the tool will happily verify one fake against the other.

Closing that gap needs a signature. And a signature, it turns out, needs a surprising amount of infrastructure standing behind it. This is the first post about building that.

The door the last post left open

A while back I wrote about verifying your own downloads: go-tool-base’s self-update command now checks the SHA-256 of every binary it downloads against the release’s published checksums.txt before installing it.

That post was honest about its own ceiling. A checksum file hosted next to the binary it describes shares a trust root with that binary. Both come from the same release, on the same platform. Corruption, truncation, a CDN serving a stale object: a same-origin checksum catches all of those, because they’re accidents and the checksum wasn’t part of the accident. What it can’t catch is an attacker who’s compromised the release platform itself. Someone who can replace the binary can replace checksums.txt in the same breath, and the tool will cheerfully verify the malicious download against the malicious checksum and call it good.

The post named the fix and then deferred it: a signature whose trust root sits somewhere the release platform can’t reach. “That’s the next phase of this work.” This series is that phase.

What a signature actually needs

It’s worth being precise about why a signature helps where a checksum doesn’t, because it’s easy to wave the word “signature” around and assume it settles everything.

A signature closes the gap only under two conditions. The verifying key, the public half, must reach the user by a path the release platform doesn’t control. And the signing key, the private half, must live somewhere the release platform can’t reach.

The second condition is the one people skip. If the signing key sits in the same CI system that builds the release, you’ve gained almost nothing. An attacker who owns the CI owns the key, and a key they own will sign whatever they hand it. The signature verifies perfectly and means precisely nothing. A signature is only worth the distance between the signing key and the thing being signed. Put them in the same place and the distance is zero.

So the signing key has to live in a different security domain from the release pipeline. Not a different folder. A different account, with a different blast radius, that the release platform has no standing access to.

“Just sign the binary” is not a small feature

That reframes a line item that sounds tiny. “Sign the release binary” unpacks into a list:

there must be a private signing key;
it must live outside the release platform, in its own security domain;
it must be access-controlled, audited, and protected from exfiltration;
only the release pipeline may ask it to sign, and only by proving a short-lived, federated identity, never by holding a copy of the key.

That’s not a feature you bolt onto a CLI. That’s infrastructure.

The shape of it: a cloud account, with the key held in a managed key service so the private key material never exists as a file on a disk that anyone, me included, can copy. The release pipeline authenticates to that account as itself, briefly, and asks the key service to produce a signature. The key never moves.

But an account you’re going to trust with a signing key is itself something you have to get right first. An account with a weak baseline, no audit trail, and long-lived credentials lying around is not a safe home for the most security-sensitive key in the whole system. Before the key can move in, the house has to be built and the locks have to actually work.

What this series builds

So this turned into a rather longer project than “add a signature”, and the series follows it in order.

It starts with bootstrapping a fresh AWS account: the deliberately minimal first tofu apply, and the remote state backend that has a genuine chicken-and-egg problem. Then the credential question, which is the heart of it: how a CI pipeline deploys to AWS with no stored access key at all. Then hardening the account, so it’s genuinely safe to hold something valuable. Then the discipline of deploying changes to it: plans reviewed before they’re applied. Then the shared tooling that makes all of it repeatable.

Every one of those pieces exists for the same reason. The signing key needs somewhere to live, and somewhere safe is not a default you’re handed. It’s a thing you build, deliberately, before you have anything worth protecting in it.

The series ends where the verifying-downloads post pointed: a signing service whose key the release platform can’t touch, so a self-updating tool can finally verify that the binary it’s about to become is genuinely the one I published.

The upshot

go-tool-base’s self-update verifies downloads against a checksum, and a same-origin checksum stops accidents but not a compromise of the release platform. The fix is a signature, and a signature is only worth the distance between its signing key and the release pipeline.

Holding that key safely means a private key that never leaves a managed key service, in a separate cloud account, reached only by a short-lived federated identity. That’s infrastructure, and a safe account is something you build before you trust it with anything. The rest of this series builds it, piece by piece, right up to the signing service itself.

Pre-populating Neo4J using Kubernetes Init Containers and neo4j-admin import

Wed, 15 Jul 2020 00:00:00 +0000

Recently there has been an uptake in the use of Neo4j by the Data Scientists. This is a good thing! they are wanting to use the right tool for the job. However we need to run it inside our k8s cluster as a portable readable data source that has been dynamically populated from a pile of data in a combination of PostgreSQL and MongoDB.

This isn’t a problem for them working locally, they install and spin up a local copy of Neo4j and can interact with it quite happily. They even realised that they can generate CSV’s from PostgreSQL and MongoDB and then import them, blindingly fast, into Neo4j using the neo4j-admin tool that comes bundled. Fantastic!

At least until they come to want to run their Neo instance inside our k8s cluster. That’s where I step in and turn them aside from creating their own custom neo4j image with a bespoke entry point that loads all the data for them in some crazy threaded bash scripting!

“No, No, No!” I tell them. “It’s far easier to just add an init container to your pod, that will preload the data before Neo starts up”.

Init containers, if you haven’t come across before, them are a special type of container that lives inside a k8s pod and are set to run BEFORE your main container runs. In this case it means we can easily sequence a bash script to run the neo4j-admin import before Neo4j is even started. And here is how we did it!

The script

The data scientists had been using Neo4j 3.5.x locally because they had a need for the graph algorithms plugin (https://github.com/neo4j-contrib/neo4j-graph-algorithms) which at the time they were looking didn’t support Neo4j 4.x. The plugin is now deprecated and its replacement (https://github.com/neo4j/graph-data-science) thankfully supports 3.5.x and 4.x.

As Neo4j 4.x introduces a lot of new features and improves performance so I recommended we switch to using that. This meant a refactor of their bash script for neo4j-admin there some very subtle differences and a few caveats to work with. This is what they came up with

#!/bin/bash
DBNAME="neo4j"
if [ "$#" -eq 1 ]; then
 DBNAME=$1
fi

# extract data from SQL
python3 extract_data.py

# remove old db for rebuild
rm -rf "/data/databases/$DNBAME"

neo4j-admin import \
 --database=$DBNAME \
 --delimiter="|" \
 --nodes=Protein=${NODE_DIR}/nodes_protein_header.csv,${DATA_DIR}/nodes_proteins.csv \
 --nodes=UniProtKB=${NODE_DIR}/nodes_uniprot_header.csv,${DATA_DIR}/nodes_uniprot.csv \
 --relationships=HAS_AMINO_ACID_SEQUENCE=${EDGE_DIR}/edges_protein_sequence_header.csv,${DATA_DIR}/edges_protein_sequence.csv \  
 --relationships=HAS_AMINO_ACID_SEQUENCE=${EDGE_DIR}/edges_chembl_protein_biotherapeutic_molregno_header.csv,${DATA_DIR}/edges_chembl_protein_biotherapeutic_molregno.csv \
 --skip-bad-relationships=true \
 --skip-duplicate-nodes=true

The import command here is significantly shorter for example purposes, as the original is about 120 lines long. As you can see it’s pretty straight forward, they had another script in extract_data.py, that I wont bore you with suffice to say that it pulled out all the data they wanted from PostgreSQL and MongoDB, which got saved to disk as CSV files in the relevant directories.

Great, it worked on their local version!

The Dockerfile

ROM neo4j:latest
ENV NEO4JLABS_PLUGINS ["graph-data-science"]
RUN apt update && apt install -y python3
WORKDIR /srv
COPY src /srv/src
COPY headers /srv/headers

The plan is always to keep it simple. We have one image that we can run for both the init container and the main container. This docker file gives a vanilla neo4j instance with python and our scripts for extracting the data loaded into it

The k8s Manifest

apiVersion: v1
kind: Pod
metadata:
 name: neo4j
spec:
 containers:
 - name: neo4j
 env:
 - name: NEO4J_AUTH
 value: neo4j/password
 image: registry.example.com/phpboyscout/rnd_graph:latest
 imagePullPolicy: Always
 volumeMounts:
 - mountPath: /data
 name: neo4j
 subPath: data
 initContainers:
 - name: importer
 args:
 - neo4j_import.sh
 command:
 - /bin/bash
 env:
 - name: DATA_DIR
 value: /import/data
 - name: HEADER_DIR
 value: /srv/headers
 image: registry.example.com/phpboyscout/rnd_graph:latest
 imagePullPolicy: Always
 stdin: true
 workingDir: /srv/src
 volumeMounts:
 - mountPath: /data
 name: neo4j
 subPath: data
 - mountPath: /import
 name: neo4j
 subPath: import
 - name: neo4j
 persistentVolumeClaim:
 claimName: neo4j

Now we can pull it all together with our k8s manifest. From here you can see that we have our default neo4j container that we pass in our default authentication details to and an init container that runs our import.sh script. Both containers have access to a shared volume for the /import and /data folders.

And now we get to…

Troubleshooting

So right off the bat it didn’t work! No surprises there but here are a few things that caused us some issues and how we resolved them.

Database offline

At first glance everything seemed to work. Until we tried to connect to the neo4j database with the default UI, at which point we were presented with the error message

Database "neo4j" is unavailable, its status is "offline."

This took a little sleuthing and shelling into the neo4j container to take a look at the /var/debug.log file which gives significantly more useful information about whats going on with the server. First we were getting stack traces that contained messages like

Component 'org.neo4j.kernel.impl.transaction.log.files.TransactionLogFiles@59d6a4d1'
was successfully initialized, but failed to start. Please see the attached cause 
exception "/data/transactions/neo4j/neostore.transaction.db.0"

From experience this sounded like a permissions issue and lo and behold, checking the files on the filesystem showed that because the import script was run as root the database files were owned by root. We resolved this by adding:-

chown -R neo4j:neo4j /data/

to the bottom of the import script. Next we were then presented with an error that looked like

2020-07-14 16:56:33.919+0000 WARN [o.n.k.d.Database] [neo4j] Exception occurred while
starting the database. Trying to stop already started components. Mismatching store id.

This one seems like it would be an obvious one to google and I did come up with few pages that seemed to describe what was happening to me but gave some varied solutions, from starting and stopping the sever and running neo4j-admin unbind in between to deleting various files. It seemed very strange because we did test this with the 3.5.17 version of Neo and it worked fine.

The solution we needed was to wipe the slate clean properly. The line in our script to remove the previous build of the db

# remove old db for rebuild
rm -rf "/data/databases/$DNBAME"

just didn’t cut it. It turns out that because the 4.x version of Neo4j supports multiple databases the import command writes additional information to the system database and transactions database in the form of some identifiers for each database, BUT if you don’t do something to clear that value for the database your are building it wont match up when the server starts and you get a declaration of Mismatching store id

I’m not sure if the developers are aware of this flaw, so in the mean time we have to expand our cleanup to:

# clean up for fresh import
rm -rf /data/databases/*
rm -rf /data/transactions/*

removing the neoj4, system and store_lock databases and transaction logs from the data store. This solved the problem and the server was able to start and we could connect to neo4j database successful.

Its not an ideal solution, I can foresee definite situations we will have to work around when we get to a point where multiple databases may be needed and are built separately and independently from each other. but it will suffice for now.

Malloc(): Error message goes here

Once it was up and running we noticed that we were getting lots of restarts on the main neo4j container a quick look at the stdout log and we could see each restart ending with something that looked like

malloc(): corrupted top size

instantly this looks like an issue with memory sizing inside the container for the JVM. Thankfully the team at Neo4j have accounted for this and give you a nice tool in the form of

neo4j-admin memrec

which interrogates the databases and gives some sensible values you can set in the output which in our case looked like


# Memory settings recommendation from neo4j-admin memrec:
#
# Assuming the system is dedicated to running Neo4j and has 376.6GiB of memory,
# we recommend a heap size of around 31g, and a page cache of around 331500m,
# and that about 22400m is left for the operating system, and the native memory
# needed by Lucene and Netty.
#
# Tip: If the indexing storage use is high, e.g. there are many indexes or most
# data indexed, then it might advantageous to leave more memory for the
# operating system.
#
# Tip: Depending on the workload type you may want to increase the amount
# of off-heap memory available for storing transaction state.
# For instance, in case of large write-intensive transactions
# increasing it can lower GC overhead and thus improve performance.
# On the other hand, if vast majority of transactions are small or read-only
# then you can decrease it and increase page cache instead.
#
# Tip: The more concurrent transactions your workload has and the more updates
# they do, the more heap memory you will need. However, don't allocate more
# than 31g of heap, since this will disable pointer compression, also known as
# "compressed oops", in the JVM and make less effective use of the heap.
#
# Tip: Setting the initial and the max heap size to the same value means the
# JVM will never need to change the heap size. Changing the heap size otherwise
# involves a full GC, which is desirable to avoid.
#
# Based on the above, the following memory settings are recommended:
dbms.memory.heap.initial_size=31g
dbms.memory.heap.max_size=31g
dbms.memory.pagecache.size=331500m
#
# It is also recommended turning out-of-memory errors into full crashes,
# instead of allowing a partially crashed database to continue running:
#dbms.jvm.additional=-XX:+ExitOnOutOfMemoryError
#
# The numbers below have been derived based on your current databases located at: '/var/lib/neo4j/data/databases'.
# They can be used as an input into more detailed memory analysis.
# Total size of lucene indexes in all databases: 0k
# Total size of data and native indexes in all databases: 17300m

So how to get these values into the container… Thankfully this is handled for you in the form of Environment Variables you can pass into the docker image. A bit of a google and i found this little snippet which is a goldmine for telling us how to translate settings into environment variables.

# Env variable naming convention:
# - prefix NEO4J_
# - double underscore char '__' instead of single underscore '_' char in the setting name
# - underscore char '_' instead of dot '.' char in the setting name
# Example:
# NEO4J_dbms_tx__log_rotation_retention__policy env variable to set
# dbms.tx_log.rotation.retention_policy setting

As for getting the variables into the container, you could do this from the pod and inject it in. I this case because the data we are going to be using is reasonably stable and tested we decided to stick them into the Docker file with the ENV directive.

ENV NEO4J_dbms_memory_heap_initial__size 31g
ENV NEO4J_dbms_memory_heap_max__size 31g
ENV NEO4J_dbms_memory_pagecache_size 331500m

And so far we haven’t had a restart yet!