AWS on PHP Boy Scout

A 403 you can't fix in IAM

Thu, 14 May 2026 00:00:00 +0000

The OIDC post explained the handshake that lets a GitLab pipeline deploy to AWS with no stored key. This is the story of the first time I got it wrong, and spent an afternoon fixing the wrong thing. The error was a flat 403 from AWS, and the maddening part is that no amount of editing the IAM policy was ever going to fix it.

A 403 on the first real run

The OIDC post covered the handshake: GitLab CI mints a signed token, AWS exchanges it for short-lived credentials against a role whose trust policy names the pipeline. During the GitLab migration I wired exactly that up for the infra repo, including a trust policy condition meant to let merge-request pipelines run a plan.

The first merge request that should have triggered tofu-plan didn’t run it. The job failed, and the error from AWS was a flat AccessDenied. A 403.

The instinct, and why it wastes an afternoon

The instinct on an IAM 403 is immediate and almost always right: the policy’s wrong, so go and edit the policy. Tighten the condition. Loosen the condition. Check the wildcard. Re-read the sub pattern character by character.

All of that was wasted, and it was wasted for a reason that took me far too long to see. The trust policy wasn’t matching the wrong value. It was matching a value that does not exist. No amount of editing a condition makes it match a thing that’s never present.

What is actually in the token

GitLab’s OIDC token has a sub claim that encodes the pipeline’s context, and part of that encoding is a ref_type. I’d assumed ref_type could be branch, tag, or mr, because a pipeline can certainly be a branch pipeline, a tag pipeline, or a merge-request pipeline. So the trust policy, for the plan job, matched a sub containing ref_type:mr.

That assumption was wrong. GitLab’s ref_type is branch or tag. That’s the entire set. There is no mr.

A merge-request pipeline doesn’t run against a merge-request ref. It runs against the source branch. So its token’s sub carries ref_type:branch, like any other branch pipeline. The trust policy condition asked for ref_type:mr, GitLab never puts mr in a token, the condition was therefore never true, and every merge-request pipeline got a 403. Forever, until the policy stopped asking for a claim that isn’t real.

The fix, and the lesson worth more than the fix

The fix is small once it’s visible: match ref_type:branch and narrow it down by branch name or project path instead. An afternoon of policy edits, and the actual change is one word.

The lesson is the part worth keeping. When an OIDC trust fails, the useful question is never “is my policy clever enough”. It’s “what’s actually in the token”. An OIDC trust policy can only ever match the claims the identity provider genuinely asserts, and the gap between what a provider asserts and what you assumed it asserts is precisely where this class of bug lives.

So the move, when an OIDC handshake 403s, is to get hold of a real token and decode it. Look at the actual sub, the actual claims, the actual values. Match what’s there. A 403 that survives every sensible edit to the policy is usually not a policy that’s too loose or too strict. It’s a policy matching a claim that was never going to be in the token.

The habit it left behind

I wired an OIDC trust policy to let merge-request pipelines plan, by matching a sub claim with ref_type:mr. The first real merge request got a 403, and no edit to the policy fixed it, because GitLab’s ref_type is only ever branch or tag. A merge-request pipeline runs on a branch ref, so the mr value the policy demanded was never in any token.

The fix was one word. The habit it left behind is the valuable bit: when an OIDC trust fails, stop editing the policy and go and read a real token. A trust policy can only match what the provider actually asserts, and “what I assumed it asserts” is where the 403 was hiding the whole time. (If this shape of bug feels familiar by the end of the series, that’s not an accident: I come back to it with two more from exactly the same family.)

Routing security findings without the noise

Tue, 12 May 2026 00:00:00 +0000

Turning on GuardDuty and Security Hub gives you threat detection. It also gives you a firehose. And an alert system that dutifully forwards everything in that firehose isn’t monitoring, it’s a very efficient way of training your team to ignore alerts. So the alerts module’s real job isn’t detection at all. It’s deciding what’s actually worth interrupting a human for, and the interesting part is everything it deliberately throws away.

Detection is the easy half

Switching on threat detection in an AWS account is a few resources. GuardDuty, Security Hub with its standards, IAM Access Analyzer: the security baseline does exactly that. From then on, the account is generating findings.

And it generates a lot of them. Plenty are low-severity, informational, or simply the normal texture of a cloud account. If you wire every finding to an email or a pager, you haven’t built monitoring. You’ve built noise. And noise has a specific failure mode: people stop reading it, and the one finding that genuinely mattered scrolls past unread alongside two hundred that didn’t.

So the valuable work isn’t detection. It’s routing: deciding what’s worth interrupting a human for, and letting the rest sit quietly in a console for whenever someone reviews it.

Forward the severe, leave the rest

The alerts module routes findings with EventBridge rules into an SNS topic that emails out. The rules are deliberately picky. GuardDuty findings are forwarded only at severity 7 and above. Security Hub findings are forwarded only at HIGH and CRITICAL.

Everything below those thresholds isn’t discarded. It’s still in GuardDuty and Security Hub, where someone doing a review will see it. It just doesn’t get to interrupt anyone’s day. The threshold is the line between “look at this now” and “look at this sometime”.

The duplicate you would otherwise send twice

Here’s the subtle one, and it’s the kind of thing you only find by looking closely at where findings come from.

Security Hub is an aggregator. It pulls findings in from other services, GuardDuty among them. So a single GuardDuty finding can show up in two places: in GuardDuty itself, and again in Security Hub as an aggregated copy.

A rule on GuardDuty findings and a rule on Security Hub HIGH/CRITICAL findings would therefore both fire for the same underlying GuardDuty finding. One event, two emails. Do that across an account and a meaningful fraction of your alert volume is just the same findings counted twice, which is its own kind of noise.

So the Security Hub rule explicitly excludes findings whose ProductName is GuardDuty, with an anything-but match. GuardDuty findings come through the GuardDuty rule. The Security Hub rule handles everything Security Hub adds that GuardDuty didn’t already report. One finding, one alert, regardless of how many services it passed through.

Two tripwires on the root account

Findings are about threats the detectors recognise. The module adds two alarms about something simpler: the root account doing anything at all.

One CloudWatch alarm fires on a root console sign-in. The other fires on any root API call that isn’t a console login. In a well-run AWS account, the root user does almost nothing after initial setup: day-to-day work happens through roles. So root activity isn’t a “finding” to be assessed for severity. It’s a tripwire. Any of it, in an account that should be silent, is worth an immediate look, and the two alarms say so directly.

Why a quiet alert stream matters here

This is monitoring for the account that’s going to hold the release-signing key, and that raises the stakes on getting the routing right.

If a key-bearing account ever does come under attack, the alert that says so has to be seen. An alert stream that’s mostly noise and duplicates is, functionally, no alerting at all, because the people who’d act on it have long since tuned it out. Routing the stream down to “severe, deduplicated, plus root tripwires” is what keeps it something a human will still read on the day it finally matters.

The short version

GuardDuty and Security Hub make detection easy. The hard, valuable part is routing: forwarding what deserves to interrupt someone and leaving the rest in a console.

The alerts module forwards GuardDuty at severity 7-plus and Security Hub at HIGH/CRITICAL, and it drops the duplicate that aggregation creates by excluding GuardDuty-sourced findings from the Security Hub rule, so one finding is one alert. Two CloudWatch alarms act as tripwires on root-account activity, which should be near-zero. For the account that will hold the signing key, a quiet, trustworthy alert stream isn’t a nicety. It’s the difference between monitoring and theatre.

Hardening the account that will hold the keys

Sat, 09 May 2026 00:00:00 +0000

Bootstrapping the account got it ready: somewhere to store state, an identity to deploy as, enough for the next tofu apply to run. Ready is not the same as safe. An account with no audit trail, nothing watching it, and no considered way for a human to get in is fine for experimenting and absolutely not where you put the most sensitive key in the system. So before the signing key goes anywhere near it, the account gets a security baseline.

Ready is not the same as safe

The bootstrap post ended with an account that was ready: it had somewhere to store state and a CI identity to deploy as. The next tofu apply could run.

Ready is not safe. That account still has no audit trail, so nobody could tell you afterwards what happened in it. It has no threat detection, so nothing is watching. Its defaults are AWS’s defaults, which are not a security posture. There’s no considered way for a human to get in. An account in that condition is fine for experimenting. It’s not somewhere you put the most sensitive key in the whole system.

So before the signing key is anywhere near it, the account gets a security baseline.

The baseline, in one downstream stack

terraform-aws-security-baseline is that baseline, and it’s exactly the downstream stack the bootstrap post promised: applied through the automation role bootstrap created, not bootstrapped specially.

It’s six sub-modules, each behind an enable_* toggle: account-hardening (IAM password policy, account-wide S3 public-access blocking, default EBS encryption), audit-logging (a multi-region CloudTrail with log-file validation), aws-config, threat-detection (GuardDuty, Security Hub, IAM Access Analyzer), alerts, and operator-role. Together they turn a bare account into one that records what happens, watches for trouble, and controls who gets in.

Most of those are the expected baseline. The operator role is the one worth slowing down on, because it’s built backwards from how people usually think about an admin role.

The operator role, and the inversion

InfraAdmin is the human way into the account: the role a person assumes to do operator work. Two things define it.

The trust policy decides who may assume it. It trusts only the account root principal, and it requires multi-factor authentication: the assume call must carry aws:MultiFactorAuthPresent, and aws:MultiFactorAuthAge bounds how recently that MFA was performed. No MFA, no role. So far this is a careful but ordinary admin role.

The inversion is a second, separate inline policy, and it’s almost entirely Deny. It denies, using NotAction, anything where aws:RequestedRegion falls outside an allowed set of regions. The role’s power comes from an admin grant. This inline policy fences that power.

That’s the part worth holding onto. People picture an admin role as a list of what it can do. This one is better understood by what it cannot: it cannot act outside its permitted regions, full stop. A fat-fingered command, or a compromised session, cannot quietly spin resources up in some region nobody’s watching. The fence is as much the point of the role as the grant is.

The carve-out, because honesty

There’s a fiddly detail, and it’s the kind of thing that makes the region fence real rather than theoretical.

Some AWS services are global. IAM, CloudFront, Route 53 and friends have no region, and they don’t honour aws:RequestedRegion. A naive region-deny would therefore deny calls to IAM, and you’d lock yourself out of the very service you manage access with. (A close cousin of the kind of self-inflicted lockout I’ll come back to in a later post.)

So the Deny carries explicit carve-outs for the global services. It isn’t elegant, and it can’t be: the global-versus-regional split is just a fact of AWS, and a correct region fence has to account for it. The carve-out list is the honest cost of the control working.

Harden the room, then move the keys in

There’s an order to all of this, and the order is the argument.

The account that will hold the signing key has to be audited before the key arrives, so that from day one every call against it is in CloudTrail. It has to be watched before the key arrives, so GuardDuty is already looking. It has to be access-controlled before the key arrives, so the only human path in is MFA-gated and region-fenced.

You don’t move something valuable into a room and then think about locks. You build the room, fit the locks, check they work, and then move the valuable thing in. The security baseline is fitting the locks. The signing key comes later, into a room already built for it.

Worth remembering

Bootstrapping an account makes it ready for the next deploy. It does not make it safe to hold anything that matters. terraform-aws-security-baseline is the downstream stack that closes that gap: audit logging, AWS Config, threat detection, account hardening, and an operator role, applied through the CI role bootstrap created.

The operator role is the piece to study. It’s MFA-gated on the way in, and then fenced by a separate, almost-all-Deny inline policy that confines it to permitted regions, with carve-outs for the global services that have no region. An admin role defined as much by its fence as its grant. Harden the room first; the keys move in afterwards.

No access keys in CI

Fri, 08 May 2026 00:00:00 +0000

A long-lived AWS access key, sitting in a CI system, is just about the single credential I’d most like to be rid of. It’s powerful, it never expires unless someone remembers to rotate it (nobody remembers to rotate it), and it lives in one of the most attractive targets in the whole supply chain. For infrastructure that’s eventually going to hold a release-signing key, it’s exactly the wrong place to start. So the phpboyscout infrastructure has no AWS access key in CI at all. None.

The access key you don’t want

A CI pipeline that runs tofu apply against AWS needs AWS credentials. The traditional way to give it some is an IAM user with an access key pair, pasted into the CI system as a masked variable.

Look at what that key is. It’s long-lived: it works until someone remembers to rotate it, and rotating it is a chore, so mostly nobody does. It’s powerful: it can apply infrastructure, so it can do nearly anything. And it’s sitting in a CI system, which is one of the most attractive targets in your whole supply chain. You’ve taken your highest-value credential and stored a permanent copy of it in a place built for running automated jobs.

For infrastructure that’s going to hold a release-signing key, that’s precisely the wrong starting point. So the phpboyscout infrastructure has no AWS access key in CI at all. Not a well-guarded one. None.

Federation instead of a stored secret

The replacement is OIDC federation, and the shape of it is worth walking through, because it’s genuinely different from “a secret, but better”.

A modern CI platform can mint an OIDC token. GitLab does this with an id_tokens: block: at job time, GitLab issues a short-lived JSON Web Token, signed by GitLab, that asserts a set of facts. This is project X. This is pipeline Y. This is running on ref Z, of this type.

AWS can consume that. The sts:AssumeRoleWithWebIdentity call takes such a token and, if it satisfies an IAM role’s trust policy, returns short-lived AWS credentials for that role. The trust policy is where the control lives: it names GitLab as a trusted token issuer, and it constrains the token’s sub claim so that only the specific project, and the specific refs, you intend can assume the role.

Put it together: the pipeline asks GitLab for a token, hands it to AWS, and gets back credentials that last about an hour and are scoped to one role. Nothing long-lived is stored anywhere. The credential exists only for the job that needs it, and it can’t be stolen from a CI variable store, because it was never in one.

Two halves of one handshake

That handshake is built by two of the repos in this series, each owning one side.

terraform-aws-bootstrap builds the AWS half, in its automation-iam module: it registers GitLab as an OIDC identity provider in the account, and it creates the automation role with the trust policy that decides which pipelines may assume it.

The CI components build the consuming half: the id_tokens: block that asks GitLab for the JWT, and then simply letting the AWS provider’s own credential chain perform the exchange. The pipeline doesn’t call sts by hand. It presents the token; the SDK does the rest.

The gotcha: don’t set a profile

There’s one quiet way to break this, and a stack can look completely correct while doing it.

The AWS SDK finds credentials by walking a chain of sources in order. The web-identity path, the one that uses the OIDC token, is one link in that chain. It triggers off environment variables the CI sets up automatically.

But if the aws provider block has a hardcoded profile = "...", the SDK takes the profile link of the chain instead, and never reaches the web-identity link. A profile line is the sort of thing that ends up in a provider block from someone’s local development setup, where it’s exactly right. Committed and run in CI, it silently short-circuits the federation. The pipeline either fails to find credentials, or finds the wrong ones.

The rule is simple once you know it: the provider block that runs in CI must not name a profile. Leave the chain free to find the web identity. It’s the kind of bug that teaches you to be precise about which link of the credential chain you’re actually relying on.

The bottom line

Giving CI an AWS access key means storing your most powerful, longest-lived credential in one of your most exposed systems. OIDC federation removes it entirely. The CI platform mints a short-lived signed token, AWS exchanges it via AssumeRoleWithWebIdentity for hour-long credentials against a role whose trust policy names the exact pipeline, and nothing permanent is stored.

terraform-aws-bootstrap builds the AWS side, the identity provider and the trust policy; the CI components build the consuming side, the token request. The one trap is a hardcoded profile in the provider block, which short-circuits the SDK’s credential chain before it reaches the web-identity path. Get that right, and a pipeline deploys to AWS as a verifiable, short-lived identity, with no key to steal.

The chicken-and-egg of remote state

Wed, 06 May 2026 00:00:00 +0000

Here’s a puzzle that every infrastructure-as-code setup hits exactly once, right at the very beginning, and then never again. An OpenTofu stack stores its state in a backend. The bootstrap stack I wrote about last time has a particular job, and part of that job is to create the backend that remote state lives in. So where does the bootstrap stack store its own state, on the very first run, before it’s built the place state is supposed to go?

Where does the state of the thing that makes the state store live?

That’s the puzzle, and it’s a real ordering deadlock rather than a riddle.

An OpenTofu stack keeps a state file, and for anything shared that state file lives in a remote backend: on AWS, an S3 bucket. Fine. But the bootstrap stack has a particular job, and part of that job is to create the S3 bucket that remote state lives in.

So walk through the first run. Bootstrap has never been applied. The state bucket doesn’t exist, because creating it is what bootstrap is for. Bootstrap needs somewhere to store its own state. The only place that would make sense is the bucket it’s about to create, which isn’t there yet. The thing that builds the state store can’t store its state in the state store.

Run local, then migrate

The way out is a two-step that OpenTofu supports directly.

Bootstrap starts configured with a local backend: backend "local" {}. State is just a file on the operator’s machine. With that in place, the first tofu apply runs. It creates the S3 bucket and the KMS key, and records all of it in the local state file.

Now the bucket exists. So the backend configuration is rewritten to point at it: an s3 backend block naming the new bucket. Then tofu init -migrate-state. OpenTofu sees the backend has changed, picks up the local state file, and copies it into the S3 bucket. From that point on, bootstrap’s own state lives in the bucket that bootstrap created. The egg has laid the chicken.

The local backend was a scaffold. It existed for exactly one apply, to break the ordering deadlock, and then the state moved off it and it was never used again.

It happened twice

The infra repo actually did this migration twice, and the second time is the proof that the pattern is general rather than a one-off trick.

The first migration was the one above: local to S3, at the very start. The second came later, during the move from GitHub to GitLab. GitLab offers a managed HTTP state backend, and infra chose to use it. So the backend block was rewritten again, this time from s3 to http, and tofu init -migrate-state ran again, copying the state from the S3 bucket to GitLab’s backend.

The same move, twice, against three different backends. That’s the useful lesson hiding in the chicken-and-egg story. State is portable. The backend is just where you currently keep it, not a property of the stack itself, and moving it is a routine, supported operation rather than surgery.

Why this is the honest answer, not a hack

It’s easy to look at “apply once with a local backend, then migrate” and feel it’s a bit of a smell, a workaround for something that should have been cleaner.

It isn’t. It’s the honest answer to a real ordering problem, and the alternatives are worse.

The obvious alternative is to create the state bucket by hand, in the console, before running bootstrap at all. But then the most important bucket in the account is unmanaged. It exists outside every OpenTofu graph, nobody’s code describes it, its encryption and policy and prevent_destroy are whatever someone clicked that day, and it drifts. The local-then-migrate dance avoids exactly that. The bucket is created by bootstrap, described in code, and tracked in bootstrap’s own state from its very first apply. It’s managed from birth.

The chicken-and-egg isn’t a flaw to be embarrassed about. It’s just the shape of the problem when a stack has to build its own foundations, and OpenTofu’s -migrate-state is the supported tool for exactly that shape.

Pulling it together

Every OpenTofu stack needs a backend to store state, and the bootstrap stack’s job is to create the backend, so on its first run the bucket it needs doesn’t yet exist.

The resolution is to run bootstrap once with a local backend, let that apply create the bucket and key, then rewrite the backend configuration and tofu init -migrate-state the state into the bucket bootstrap just made. The infra repo did it twice, local to S3 and later S3 to GitLab, which shows the real point: state is portable, and the backend is just where you keep it. Doing it this way, rather than hand-creating the bucket, is what keeps that critical bucket managed in code from its very first day.

A state bucket that defends itself

Sat, 02 May 2026 00:00:00 +0000

OpenTofu’s remote state file is, quietly, the most sensitive thing in an infrastructure repo. It’s a plain JSON document listing every resource you manage, every ID, and, depending on your providers, the odd secret in clear text. So the S3 bucket that holds it can’t just be a bucket. It has to actively defend itself, on three separate fronts.

The most sensitive file in the repo

OpenTofu, like Terraform, keeps a state file: a JSON document recording every resource the stack manages, its real-world ID, and its attributes. It’s how the tool knows what already exists. It’s also, quietly, the most sensitive file in the whole repo. It can hold resource identifiers an attacker would value, and depending on the providers in play it can hold secret values in clear text.

Three bad things can happen to it. It can be deleted, and now the tool has forgotten everything it manages. It can be read by someone who shouldn’t. It can be corrupted by two runs writing at once. The bucket that holds remote state has to defend against all three, and terraform-aws-bootstrap’s state-backend module is built around doing exactly that.

The DynamoDB lock table is gone

Start with the corruption problem, because the answer changed recently.

The long-standing pattern for remote state on AWS was an S3 bucket plus a DynamoDB table. S3 held the state; the DynamoDB table held a lock, so two apply runs couldn’t write at once. Everyone who’s done Terraform on AWS has provisioned that table, probably more times than they’d care to count.

OpenTofu 1.10 made it unnecessary. The S3 backend gained use_lockfile, which does the locking with a small lock object in the same bucket, using S3’s conditional-write support. No separate table. The state backend is now genuinely one bucket and one key, with the lock living beside the state. It’s one fewer resource to create, one fewer thing to pay for, and one fewer moving part to reason about. The module takes the new path, and the DynamoDB table simply isn’t there.

A bucket you can’t delete by accident

Deletion is guarded with lifecycle { prevent_destroy = true } on the bucket. With that set, OpenTofu refuses to produce a plan that would destroy the bucket. A stray tofu destroy, a refactor that drops the resource, an accidental rename: all of them fail loudly instead of quietly taking the state bucket with them.

This is also why the state-backend module is hand-rolled from raw aws_s3_bucket resources rather than wrapping a community module like terraform-aws-modules/s3-bucket. prevent_destroy has to sit on the actual resource, and a lifecycle block isn’t something you can pass into a wrapper module as an input. Hand-rolling the bucket keeps prevent_destroy somewhere you can put it and, just as importantly, somewhere the next reader can see it. (There’s a whole post coming on why I hand-rolled every module; this is one of the reasons in miniature.)

Reject anything encrypted wrong

Confidentiality is the subtle one, because the obvious control isn’t enough.

The bucket has a default encryption configuration: server-side encryption with the customer-managed KMS key. But default encryption is a default. A client making a PutObject call can override it per request, asking for plain AES256 or a different KMS key, and S3 will honour the override.

So the module doesn’t rely on the default. The bucket policy explicitly denies the upload it doesn’t want. It denies any request not over TLS. It denies any PutObject that isn’t using SSE-KMS. And it denies any PutObject that names the wrong KMS key. The default encryption config says “this is what you get if you don’t ask”; the bucket policy says “and you’re not allowed to ask for anything else”. State can only ever land encrypted, in transit and at rest, under the one key the module controls.

One small companion setting: bucket_key_enabled. With per-object SSE-KMS, every object operation is also a KMS API call, which costs money and can throttle. An S3 Bucket Key collapses those into far fewer KMS calls, cutting per-object KMS traffic by well over ninety per cent. It’s a one-line setting the module turns on and most people forget exists.

In short

Remote state is the most sensitive file an infrastructure repo has, and the bucket that holds it has to defend against deletion, disclosure and corruption.

terraform-aws-bootstrap’s state backend handles corruption with OpenTofu 1.10’s use_lockfile, dropping the old DynamoDB lock table entirely. It guards deletion with prevent_destroy, which is also why the bucket is hand-rolled rather than wrapped. And it guards confidentiality with a bucket policy that denies non-TLS traffic and denies any upload not encrypted with the right KMS key, because default encryption is only a default and a client can override it. The state bucket isn’t just a place to put state. It’s built to refuse every wrong thing that could happen to it.

The bootstrap that does almost nothing

Fri, 01 May 2026 00:00:00 +0000

A brand-new AWS account is a slightly nerve-wracking thing. It can do almost anything, it’s hardened against almost nothing, and the list of stuff you ought to set up before you trust it with anything real is long. The natural instinct is to write one big “set up the account” module that does the whole list in a single apply. I want to talk you out of that, because the bootstrap module I’m happiest with does almost nothing, on purpose.

The first-apply problem

A brand-new AWS account is not ready for anything serious. Before you’d responsibly run real infrastructure into it, you want an account baseline: a password policy, account-wide S3 public-access blocking, default EBS encryption, CloudTrail, AWS Config, GuardDuty, alerting, a sensible human operator role. It’s a long list, and all of it matters.

The instinct, faced with that list, is to write one big “set up the account” module and have it do everything. One tofu apply, a fully prepared account, done.

That instinct is worth resisting, and terraform-aws-bootstrap resists it deliberately.

Three things, and a hard line

terraform-aws-bootstrap does three things:

state-backend, an S3 bucket and a customer-managed KMS key to hold remote Terraform state.
automation-iam, an OIDC identity provider and an IAM role that CI assumes to apply everything else.
nuke-config, which renders an aws-nuke configuration scoped to the account, for tearing a throwaway account back down.

That’s the whole module. Account hardening, CloudTrail, AWS Config, GuardDuty, the operator role, the alerting: none of it is in here. And it’s not absent by accident. The README has a section headed “what’s deliberately NOT in scope” that lists those exclusions out loud. The boundary is written down, because the boundary is the design.

Why the line is exactly there

The reason the line sits where it does is the most useful idea in the module.

Everything bootstrap excludes belongs in a separate stack, applied through the automation role bootstrap creates. Bootstrap’s only job is to get the account to the point where the next tofu apply can run properly: somewhere to store state, and an identity to run as. Once those two things exist, hardening the account isn’t a special bootstrapping act. It’s just another apply, done the normal way: in CI, reviewed, versioned, deployed through the role.

So the account baseline doesn’t need to be bundled into the bootstrap. It needs to be downstream of it. Bootstrap builds the on-ramp; it doesn’t also have to be the motorway.

A narrow module stays re-runnable

There’s a practical payoff to the narrowness, and it’s about fear.

Bootstrap is the one stack that can’t be applied through CI, because it’s what creates the CI identity in the first place. It runs locally, by a human, rarely. That’s exactly the kind of operation you want to be small, boring, and safe to repeat.

A bootstrap module that also did account hardening would be a large, stateful thing managing dozens of resources. Re-running it would be a held-breath operation. Keeping it to three concerns keeps it the opposite: a small stack you can read top to bottom, re-run without anxiety, and reason about completely. The narrowness isn’t minimalism for its own sake. It’s what keeps the one human-applied stack trustworthy.

The boundary is the feature

It’s tempting to judge a module by how much it does. A bootstrap module is the case where that’s exactly backwards. Its value is in how cleanly it stops.

terraform-aws-bootstrap does the bare minimum to make an account ready for the next apply, writes down everything it refuses to do, and hands off to a downstream stack for all of it. The next post follows the trickiest of its three jobs: the state backend has a genuine chicken-and-egg problem, because it has to store Terraform state in a bucket Terraform hasn’t created yet.

Where this leaves us

A fresh AWS account needs a long list of things before it’s safe, and the obvious move is one big module that does the lot. terraform-aws-bootstrap deliberately does only three: a state backend, a CI identity, and an account-scrub config. Everything else is written down as out of scope.

The boundary is the design. The excluded work belongs in a downstream stack applied through the CI role bootstrap creates, so hardening is just a normal reviewed apply rather than a bootstrapping special case. And keeping the one human-run, locally-applied stack small is what keeps it safe to re-run. A bootstrap module is judged by where it stops.