CI/CD on PHP Boy Scout

Two bugs that taught me the rules

Wed, 20 May 2026 00:00:00 +0000

Some bugs are interesting because they’re subtle. These two were interesting because they were the exact opposite… in each case the tool had a hard rule I simply didn’t know about, and its error message couldn’t be bothered to tell me what that rule was. Both came out of building the infrastructure toolchain, both cost me a good deal more time than they had any right to, and both are the sort of thing that looks blindingly obvious the moment you know it and utterly baffling until you do.

So here they are, written down, partly to save you the bother and partly so I don’t go and forget them myself.

Bug one: the rule-less job that skips your merge requests

The cicd gate components, in their first cut, shipped with no rules: block. They were dead simple jobs: lint, scan, validate. No conditions, because they should just always run. Obviously.

They ran on branch pipelines. On merge requests, they didn’t run at all! The gates that were the entire point of the components were simply absent from the one place you’d most want to see them… the merge request.

The cause is a GitLab CI rule that’s remarkably easy to go years without ever learning: a job with no rules: block runs only on branch and tag pipelines. It does not run on merge-request pipelines. So “no conditions” doesn’t mean “runs everywhere” at all. It means “runs everywhere except a merge request”, which is about the least intuitive default I can think of.

The fix is faintly absurd, and that’s exactly what makes it stick. You add an unconditional rule: rules: [{ when: on_success }]. The content of that rule does precisely nothing. It always matches. What actually matters is that the job now has a rules: block at all, because merely having one is what makes a job eligible for merge-request pipelines. A rule whose content is meaningless, added solely so the block exists. That’s the fix. I’ll admit I stared at it for a moment.

Bug two: the import block that only works at the root

The second one came from terraform-aws-security-baseline. The account-hardening module needed to adopt a resource that already existed in the account, which is exactly what OpenTofu’s import {} block is for. So an import block went into the account-hardening module, right next to the resource it was adopting. The natural home for it, surely.

OpenTofu disagreed, and rejected it outright. The rule: an import block is only allowed in the root module. It can’t live inside a child module. A module that wants one of its own resources imported can’t declare that import itself… the import has to be declared up at the root, and the root caller does the adopting.

The fix was to take the import block out of the module and document caller-side adoption instead. The module describes the resource, and the root configuration that calls the module is where the import actually lives.

Two unrelated bugs, in two completely different tools, and the same shape sitting underneath both of them.

In each case the tool has a hard structural rule. Where a block is allowed to live. What makes a job eligible for a particular kind of pipeline. And in each case the error told me the tool was unhappy without telling me which rule I’d broken, so the obvious next move (debugging my own logic) was the wrong move entirely. There was nothing wrong with the logic. The thing was simply in a place the tool doesn’t allow, or missing a block the tool quietly insists on.

The lasting lesson here isn’t the two specific rules, useful as they are to know. It’s the reflex. When something that should obviously work just doesn’t, and the error is unhelpful, stop debugging your logic and start suspecting a structural rule about where something is allowed to be, or whether a thing is eligible in the first place. GitLab CI and OpenTofu both have a handful of these, and you mostly learn them the hard way, by tripping over them. Knowing the shape of the category at least means the next one costs you an hour instead of a whole afternoon.

Worth remembering

Two bugs from building the toolchain, one shape. A GitLab CI job with no rules: block runs on branches and tags but silently not on merge requests, and the fix is an unconditional rules: block whose content does nothing and whose mere existence is the entire point. An OpenTofu import block gets rejected inside a child module, because imports are only legal at the root, so the caller adopts and the module just describes.

Neither error named the rule it was enforcing, and that’s the category to watch for. When sound logic fails against an unhelpful error, suspect a structural rule about where a thing may live or whether it’s even eligible… not a bug in what you actually wrote. It’ll save you an afternoon. It certainly cost me a couple.

Reviewed, then applied

Mon, 18 May 2026 00:00:00 +0000

The genuinely dangerous moment in infrastructure-as-code isn’t the apply. It’s the gap between the plan a human read and approved, and the change that actually runs a moment later. If those two are different computations (and by default they are) then nobody really reviewed the thing that touched your account. The infra repo closes that gap from both ends.

The gap between “reviewed” and “ran”

Here’s the moment in infrastructure-as-code where things go wrong.

Someone opens a merge request. CI runs tofu plan and the output is there to review: these three resources change, this one is destroyed. A human reads it, decides it’s correct, approves, merges. Then apply runs.

The trap is in what apply actually applies. If apply does its own fresh tofu plan and then applies that, the change that runs is not necessarily the change that was reviewed. State can have moved. A provider can have drifted. Someone else can have applied something in between. The reviewed plan and the applied change are two separate computations done at two different moments, and every difference between those moments is a change nobody looked at.

infra closes that gap from both ends.

Plan as an artifact

The first end is making the reviewed plan and the applied plan the same object.

The tofu-plan component runs the plan and saves it. It writes tfplan.cache, OpenTofu’s binary plan file, as a CI artifact. It also writes tfplan.json, which GitLab renders as a plan widget right in the merge request: the add, change and destroy summary, there to review without leaving the MR.

The tofu-apply component then does not re-plan. It applies that saved tfplan.cache. And OpenTofu itself enforces the safety net: applying a stale plan file, one captured against a state that has since moved, is rejected by the tool. So what reaches the account is provably the plan that was reviewed, or it’s nothing at all. There’s no third option where something unreviewed slips through.

Applying is a human decision

The second end is when apply runs.

infra is trunk-based: it dropped the develop branch and works on main. But a naive trunk setup auto-applies every push to main, which means there’s no human gate at all, just whatever the last merge happened to contain.

So the gate is built explicitly. releaser-pleaser keeps a release merge request open against main. Ordinary merges to main run plans but apply nothing. The apply happens only when a person merges the release MR. Merging it cuts a release tag, and the tag pipeline is what runs tofu-apply, against the plan banked by the latest main pipeline.

The effect is that the act of applying to the account is the deliberate, visible act of merging the release request. Nothing reaches the account because a commit landed. It reaches the account because a person decided a release should go out and merged it. (Which, after the accidental v2.0.0 that kicked off the whole GitLab move, is a discipline I’d freshly relearned the value of.)

The guard on the gate

There’s one more piece, because a gate is only as good as its precondition.

A verify-main-plan job blocks the release MR from being mergeable unless the latest main pipeline is green. You can’t cut a release, and therefore can’t apply, on top of a main whose plan didn’t even succeed. The human gate has its own gate: the thing you’re about to merge has to be standing on a known-good plan before you’re allowed to merge it.

The bottom line

The risk in infrastructure-as-code is the gap between the plan a human reviewed and the change that runs, because a re-plan at apply time is a different computation from the one that was approved.

infra closes it twice over. tofu-plan saves the plan as a tfplan.cache artifact and renders it as a merge-request widget; tofu-apply applies that exact artifact, and OpenTofu rejects it outright if the state has moved underneath it. And applying is gated on a human merging a releaser-pleaser release request, not on a push, with a verify-main-plan check making sure that request can only be merged on top of a green plan. What gets applied is what was reviewed, when a person decided it should be.

CI you include, not copy

Sat, 16 May 2026 00:00:00 +0000

Every infrastructure repo runs the same CI: lint the OpenTofu, scan it, validate it, plan, apply. The first repo, you write that .gitlab-ci.yml by hand. The second, you copy it. By the third, you’ve got three copies of the same pipeline quietly drifting apart, which is the exact problem you’d never tolerate in application code. The cicd repo is the fix, and it’s just the library-first instinct pointed at the pipeline.

The `.gitlab-ci.yml` you keep copying

The infrastructure repos in this series all run the same CI gate jobs: format and validate the OpenTofu, lint it, scan it for security issues and secrets, and on the deploy side, plan and apply.

The first repo, you write that .gitlab-ci.yml by hand. The second repo needs the same jobs, so you copy it. The third repo, you copy it again. Now there are three copies of the same pipeline, and they do what copies always do. They drift. A fix you make in one repo’s CI doesn’t reach the other two. A tightened scan rule lands in the repo you were working in and nowhere else. It’s the copy-paste problem, exactly as it shows up in application code, just written in YAML and therefore that bit easier to pretend isn’t code.

GitLab has a feature for exactly this

GitLab CI/CD Components are the answer to that problem. A component is a reusable, versioned piece of pipeline that you publish, and other projects pull in with an include: pinned to a version:

include:
 - component: gitlab.com/phpboyscout/cicd/tofu-lint@v0.5.0

That’s a library import, for pipeline. The component has a defined interface, a version, and a home in GitLab’s CI/CD Catalog. A consuming repo includes it instead of carrying its own copy, and when the component improves, the consumer moves a version pin rather than re-copying YAML.

Why a monorepo of components

The cicd repo holds all of the components together: tofu-lint, tofu-security, tofu-validate, tofu-plan, tofu-apply, and more. One project, not one project per component.

That’s a deliberate call, and the reason is how GitLab versions things. A version is a tag, and a tag belongs to a project. A component’s version is its project’s tag. So a monorepo of components, versioned together as one tag stream, is the natural unit: a consumer pins @v0.5.0 and gets a known-good set of components that were tested together, rather than juggling a separate version for each one.

Authoring discipline

A component is a file under templates/, and it opens with a spec: inputs: block: the typed inputs, their defaults, the component’s public interface.

The discipline that keeps the library usable is that a component must be consumer-agnostic. It never hardcodes a token, and it never names a particular consumer’s variable. Inputs have sensible defaults, and a consuming repo overrides them. A component that reaches out and assumes something about the repo including it is a component that works in one repo and surprises the next. An authoring guide in the repo keeps that consistent across everyone who adds a component.

The self-test you cannot fully write

The cicd repo tests its own components with a self-test pipeline. It’s worth knowing where that self-test stops.

When a repo tests its own components by running them in child pipelines, GitLab masks $CI_PIPELINE_SOURCE as parent_pipeline. A component’s rules:, which often branch on the pipeline source to behave differently for a merge request than for a branch or a tag, therefore can’t be exercised honestly by the self-test: the source they’d branch on has been flattened. The self-test covers what it can, and the component rules: are, in the end, validated by real consumers using them for real. That’s a genuine limit, and naming it is better than pretending the self-test proves more than it does. (It’s also, not coincidentally, the exact rules: quirk that bit me in one of the two bugs I closed the series with.)

The same instinct, again

This blog keeps circling the same instinct. go-tool-base exists because the same CLI scaffolding kept getting rewritten, so it was extracted into a library. cicd is that instinct pointed at the pipeline: the same gate jobs kept getting copied between repos, so they were extracted into a versioned, included library.

Stop copy-pasting. Publish, version, include. It’s true for CLI code, and it turns out to be just as true for the YAML that builds and ships it.

The gist

Every infrastructure repo needs the same CI, and copying the .gitlab-ci.yml between them produces copies that drift apart. GitLab CI/CD Components fix it: reusable, versioned pipeline that a repo include:s and pins, instead of carrying its own copy.

cicd is a monorepo of those components, versioned together as one tag stream, because GitLab tags a project and a component’s version is its project’s tag. Components are authored consumer-agnostic, with typed spec: inputs: and no hardcoded assumptions, and their rules: are validated by real use because the self-test can’t see the pipeline source. It’s the library-first instinct, applied to CI: publish it once, include it everywhere, fix it in one place.

One image for the whole toolchain

Fri, 15 May 2026 00:00:00 +0000

Every CI gate job across the infrastructure repos reaches for the same pile of tools: OpenTofu, tflint, trivy, checkov, gitleaks, terraform-docs, the AWS CLI. Installing that pile per job is both slow and quietly dangerous, because nothing pins it consistently. infra-tools is the obvious fix (one image, one source of truth for versions), but two of its build decisions are less obvious and worth a look: it publishes with crane instead of a second build, and it deliberately lets its own vulnerability scan fail.

The same pile of tools, in every repo

Every infrastructure repo in this series runs the same CI gate jobs: format and validate the OpenTofu, lint it, scan it for security problems and secrets, check the docs. Those jobs need a specific set of tools, and it’s the same set in every repo.

Install them per job and you pay twice. You pay in time, because every pipeline downloads and installs the whole set again. And you pay in drift, because unless every repo pins every tool identically, the repos slowly diverge on which version of trivy or tflint they actually run, and a check that passes in one repo fails in another for no reason anyone can see.

One image, one source of truth

infra-tools is the answer: a single Debian-based container image with the whole toolchain baked in. Every CI job in every repo uses it with one image: line.

The real value isn’t the convenience. It’s that the image is the one place tool versions are pinned. The Go-based tools are pinned in a mise.toml. checkov, which has no mise plugin, is pinned in a requirements file installed with pipx. The AWS CLI is pinned by a build argument. Three mechanisms, because the tools come from three kinds of source, but one image, and every pin wired to Renovate so a version bump arrives as a reviewable pull request. There’s exactly one answer to “what version of trivy does the toolchain use”, and it lives here.

Publishing with crane, not a second build

A build-pipeline detail that took a real bug to discover.

The pipeline builds the image with kaniko, which builds images without a privileged Docker daemon, something that matters a great deal on shared CI runners. Then it scans the image, then it publishes it.

The obvious way to write the publish stage is “build the image and push it”. But kaniko has no mode for “just push this tarball I already built”. A second kaniko invocation re-executes the entire Dockerfile from the top, including a second mise install, which makes a fresh round of calls to GitHub’s API to fetch tools. GitHub’s anonymous API limit is low and shared by IP, so on a CI runner that second install reliably trips a 403 rate-limit. (Yes, another 403. They do get everywhere.)

So the publish stage doesn’t rebuild. It uses crane to push the exact image tarball the build stage already produced. The image is built once. And because the published bytes are the same bytes the scan stage scanned, there’s no gap between “the image we checked” and “the image we shipped”.

Soft-failing the scanner on purpose

The decision that looks wrong until you see the reasoning: the pipeline scans the image with trivy, and trivy is allowed to fail without failing the pipeline.

A vulnerability scanner that doesn’t gate the build sounds like a scanner switched off. It isn’t. It’s a scanner pointed at something it can’t helpfully gate.

The tools in the image are prebuilt Go binaries. trivy inspects them, reads the version of the Go runtime each was compiled with, and reports every known CVE in that Go runtime. Those findings are real, but they aren’t mine to fix. The only fix is the upstream tool rebuilding itself against a patched Go. With seven such tools in the image, at any given moment one of them is usually a little behind on its Go version.

A hard gate would mean the image becomes unpublishable whenever any single upstream lags, over a CVE in code I don’t own and can’t patch. That’s not a security control; it’s a way to be unable to ship. So the scan is allow_failure. The findings stay fully visible, and the residual count is genuinely useful as a metric for how far behind upstream the toolchain has drifted. It just doesn’t block shipping an image whose only “vulnerabilities” are other people’s build timelines.

What it comes down to

The infrastructure repos all run the same CI gate jobs, needing the same tools, so infra-tools bakes the whole toolchain into one image and pins every version in one place, wired to Renovate.

Two build choices are worth copying. The publish stage uses crane to push the already-built, already-scanned tarball, because a second kaniko build would re-run mise install and hit GitHub’s anonymous rate limit, and because pushing the scanned bytes means shipping exactly what was checked. And the trivy scan is deliberately allow_failure, because it reports Go-runtime CVEs in prebuilt upstream binaries that no change to this repo can fix, so a hard gate would only make the image unshippable over someone else’s lag.

A 403 you can't fix in IAM

Thu, 14 May 2026 00:00:00 +0000

The OIDC post explained the handshake that lets a GitLab pipeline deploy to AWS with no stored key. This is the story of the first time I got it wrong, and spent an afternoon fixing the wrong thing. The error was a flat 403 from AWS, and the maddening part is that no amount of editing the IAM policy was ever going to fix it.

A 403 on the first real run

The OIDC post covered the handshake: GitLab CI mints a signed token, AWS exchanges it for short-lived credentials against a role whose trust policy names the pipeline. During the GitLab migration I wired exactly that up for the infra repo, including a trust policy condition meant to let merge-request pipelines run a plan.

The first merge request that should have triggered tofu-plan didn’t run it. The job failed, and the error from AWS was a flat AccessDenied. A 403.

The instinct, and why it wastes an afternoon

The instinct on an IAM 403 is immediate and almost always right: the policy’s wrong, so go and edit the policy. Tighten the condition. Loosen the condition. Check the wildcard. Re-read the sub pattern character by character.

All of that was wasted, and it was wasted for a reason that took me far too long to see. The trust policy wasn’t matching the wrong value. It was matching a value that does not exist. No amount of editing a condition makes it match a thing that’s never present.

What is actually in the token

GitLab’s OIDC token has a sub claim that encodes the pipeline’s context, and part of that encoding is a ref_type. I’d assumed ref_type could be branch, tag, or mr, because a pipeline can certainly be a branch pipeline, a tag pipeline, or a merge-request pipeline. So the trust policy, for the plan job, matched a sub containing ref_type:mr.

That assumption was wrong. GitLab’s ref_type is branch or tag. That’s the entire set. There is no mr.

A merge-request pipeline doesn’t run against a merge-request ref. It runs against the source branch. So its token’s sub carries ref_type:branch, like any other branch pipeline. The trust policy condition asked for ref_type:mr, GitLab never puts mr in a token, the condition was therefore never true, and every merge-request pipeline got a 403. Forever, until the policy stopped asking for a claim that isn’t real.

The fix, and the lesson worth more than the fix

The fix is small once it’s visible: match ref_type:branch and narrow it down by branch name or project path instead. An afternoon of policy edits, and the actual change is one word.

The lesson is the part worth keeping. When an OIDC trust fails, the useful question is never “is my policy clever enough”. It’s “what’s actually in the token”. An OIDC trust policy can only ever match the claims the identity provider genuinely asserts, and the gap between what a provider asserts and what you assumed it asserts is precisely where this class of bug lives.

So the move, when an OIDC handshake 403s, is to get hold of a real token and decode it. Look at the actual sub, the actual claims, the actual values. Match what’s there. A 403 that survives every sensible edit to the policy is usually not a policy that’s too loose or too strict. It’s a policy matching a claim that was never going to be in the token.

The habit it left behind

I wired an OIDC trust policy to let merge-request pipelines plan, by matching a sub claim with ref_type:mr. The first real merge request got a 403, and no edit to the policy fixed it, because GitLab’s ref_type is only ever branch or tag. A merge-request pipeline runs on a branch ref, so the mr value the policy demanded was never in any token.

The fix was one word. The habit it left behind is the valuable bit: when an OIDC trust fails, stop editing the policy and go and read a real token. A trust policy can only match what the provider actually asserts, and “what I assumed it asserts” is where the 403 was hiding the whole time. (If this shape of bug feels familiar by the end of the series, that’s not an accident: I come back to it with two more from exactly the same family.)

Why go-tool-base left GitHub for GitLab

Mon, 11 May 2026 00:00:00 +0000

A botched version bump made me stop and actually look at where go-tool-base lived, and I didn’t much like what I saw. GitHub had spent months quietly falling over, and when Mitchell Hashimoto (GitHub user #1299, no less) publicly walked Ghostty off the platform, it stopped feeling like just my problem. I’ve been a GitLab fan for years, so the move was less a leap and more an overdue nudge. This is the why, not the how.

It started with a wrong number

Every migration has a trigger, and mine was embarrassingly small. A commit landed on main carrying a BREAKING CHANGE: footer it didn’t really deserve. Semantic-release did exactly what it’s told to do with that footer: it cut a major version. go-tool-base lurched from the v1 line straight to v2.0.0, and a chain of things that keyed off the version went sideways with it.

It was fixable. It wasn’t a disaster. But it was the kind of small, stupid breakage that makes you stop and actually look at your setup instead of just patching it and moving on. And when I looked, the version bump wasn’t the thing that bothered me. It was everything around it.

The platform had been quietly failing

I’d been losing time to GitHub for months. Not dramatically. No single outage you’d write home about, just a steady drip of Actions queues that wouldn’t drain, pull requests that wouldn’t merge, the occasional morning where the thing simply wasn’t there. You absorb it. You re-run the job. You make a coffee and try again. You tell yourself it’s a blip.

The trouble with a steady drip is that you stop counting it. It becomes weather.

The canary left the mine

Then, in late April, Mitchell Hashimoto (co-founder of HashiCorp, creator of Vagrant, Terraform and the Ghostty terminal) published Ghostty Is Leaving GitHub, and The Register picked it up a day later under the headline “GitHub ’no longer a place for serious work’”.

This is not a man with a casual relationship to GitHub. He’s, by his own account, user #1299, joined February 2008. He called it “the place that has made me the most happy”. And he still wrote this:

This is no longer a place for serious work if it just blocks you out for hours per day, every day.

The detail that landed hardest for me wasn’t a quote, it was a habit. He’d kept a journal for a month, marking an “X” on every day a GitHub outage had cost him working time. Almost every day had an X. Reading that, I realised I’d been having the same month. I’d just never been disciplined enough to write it down. He’d turned my vague “it’s been flaky lately” into a row of crosses on a calendar.

I want to ship software and it doesn’t want me to ship software.

When the person who’s been on the platform for eighteen years and loves it says that out loud, it stops being your private grumble. It’s the canary, and the canary has stopped singing.

Why GitLab, and not just “somewhere else”

Being annoyed at GitHub is a reason to leave. It is not, on its own, a reason to pick a destination. The destination has to be a positive choice.

For me GitLab was an easy one, because I’ve been a fan for years. Long enough, in fact, to have also been a reliable grumbler about their pricing tiers, which is how you know it’s a real relationship and not a honeymoon. What I’ve always rated is the model: GitLab treats source hosting, CI/CD, the package registry, releases and Pages as one integrated product, not a marketplace of bolted-on parts you assemble yourself.

That integration is the actual prize. On the old setup, “CI” meant a folder of separate GitHub Actions workflow files, each pinned, each its own little world. On GitLab it’s a single .gitlab-ci.yml pipeline with proper stages (lint, test, security, docs, release) and the release stage talks to the built-in package registry and Pages without me wiring up a single external credential. The CI job that builds the project can authenticate to the things the project needs because they’re the same platform.

There’s a second-order benefit too. A migration is a rare licence to fix things you’d never otherwise touch. Moving gave me the cover to reset go-tool-base’s versioning cleanly (back to a sensible v0.x line, the accidental v2.0.0 left behind as a cautionary tale) and to move the module path to its new home in one deliberate change rather than a thousand apologetic ones.

What I’m not going to claim

I’m not going to tell you GitHub is finished, or that GitLab never has a bad day, because it does, everyone does. This isn’t a teardown. GitHub gave go-tool-base a perfectly good home for its first year, and the archived mirror is still sitting there, read-only, pointing anyone who finds it at the new place.

What changed is simpler than a grand verdict. The friction crossed a line, someone I respect said the quiet part loudly enough that I couldn’t keep filing it under “weather”, and the place I’d have moved to anyway was sitting right there with a better model. Sometimes the prudent move and the move you secretly wanted turn out to be the same move, and you just need a wrong version number to give you permission.

Boiling it down

go-tool-base moved from GitHub to GitLab in May 2026. The proximate cause was a self-inflicted version-bump mess; the real cause was months of GitHub unreliability that I’d stopped consciously noticing until Mitchell Hashimoto’s very public departure named it for me. GitLab was a positive pick, not just an escape hatch: its integrated CI/CD, registry, releases and Pages are one product rather than a kit, and that integration is genuinely worth having. The migration also bought a clean versioning restart as a bonus.

If you’ve been absorbing a steady drip of friction and telling yourself it’s normal: try the calendar trick. Mark the X’s for a month. The page will tell you something you already half-know.

No access keys in CI

Fri, 08 May 2026 00:00:00 +0000

A long-lived AWS access key, sitting in a CI system, is just about the single credential I’d most like to be rid of. It’s powerful, it never expires unless someone remembers to rotate it (nobody remembers to rotate it), and it lives in one of the most attractive targets in the whole supply chain. For infrastructure that’s eventually going to hold a release-signing key, it’s exactly the wrong place to start. So the phpboyscout infrastructure has no AWS access key in CI at all. None.

The access key you don’t want

A CI pipeline that runs tofu apply against AWS needs AWS credentials. The traditional way to give it some is an IAM user with an access key pair, pasted into the CI system as a masked variable.

Look at what that key is. It’s long-lived: it works until someone remembers to rotate it, and rotating it is a chore, so mostly nobody does. It’s powerful: it can apply infrastructure, so it can do nearly anything. And it’s sitting in a CI system, which is one of the most attractive targets in your whole supply chain. You’ve taken your highest-value credential and stored a permanent copy of it in a place built for running automated jobs.

For infrastructure that’s going to hold a release-signing key, that’s precisely the wrong starting point. So the phpboyscout infrastructure has no AWS access key in CI at all. Not a well-guarded one. None.

Federation instead of a stored secret

The replacement is OIDC federation, and the shape of it is worth walking through, because it’s genuinely different from “a secret, but better”.

A modern CI platform can mint an OIDC token. GitLab does this with an id_tokens: block: at job time, GitLab issues a short-lived JSON Web Token, signed by GitLab, that asserts a set of facts. This is project X. This is pipeline Y. This is running on ref Z, of this type.

AWS can consume that. The sts:AssumeRoleWithWebIdentity call takes such a token and, if it satisfies an IAM role’s trust policy, returns short-lived AWS credentials for that role. The trust policy is where the control lives: it names GitLab as a trusted token issuer, and it constrains the token’s sub claim so that only the specific project, and the specific refs, you intend can assume the role.

Put it together: the pipeline asks GitLab for a token, hands it to AWS, and gets back credentials that last about an hour and are scoped to one role. Nothing long-lived is stored anywhere. The credential exists only for the job that needs it, and it can’t be stolen from a CI variable store, because it was never in one.

Two halves of one handshake

That handshake is built by two of the repos in this series, each owning one side.

terraform-aws-bootstrap builds the AWS half, in its automation-iam module: it registers GitLab as an OIDC identity provider in the account, and it creates the automation role with the trust policy that decides which pipelines may assume it.

The CI components build the consuming half: the id_tokens: block that asks GitLab for the JWT, and then simply letting the AWS provider’s own credential chain perform the exchange. The pipeline doesn’t call sts by hand. It presents the token; the SDK does the rest.

The gotcha: don’t set a profile

There’s one quiet way to break this, and a stack can look completely correct while doing it.

The AWS SDK finds credentials by walking a chain of sources in order. The web-identity path, the one that uses the OIDC token, is one link in that chain. It triggers off environment variables the CI sets up automatically.

But if the aws provider block has a hardcoded profile = "...", the SDK takes the profile link of the chain instead, and never reaches the web-identity link. A profile line is the sort of thing that ends up in a provider block from someone’s local development setup, where it’s exactly right. Committed and run in CI, it silently short-circuits the federation. The pipeline either fails to find credentials, or finds the wrong ones.

The rule is simple once you know it: the provider block that runs in CI must not name a profile. Leave the chain free to find the web identity. It’s the kind of bug that teaches you to be precise about which link of the credential chain you’re actually relying on.

The bottom line

Giving CI an AWS access key means storing your most powerful, longest-lived credential in one of your most exposed systems. OIDC federation removes it entirely. The CI platform mints a short-lived signed token, AWS exchanges it via AssumeRoleWithWebIdentity for hour-long credentials against a role whose trust policy names the exact pipeline, and nothing permanent is stored.

terraform-aws-bootstrap builds the AWS side, the identity provider and the trust policy; the CI components build the consuming side, the token request. The one trap is a hardcoded profile in the provider block, which short-circuits the SDK’s credential chain before it reaches the web-identity path. Get that right, and a pipeline deploys to AWS as a verifiable, short-lived identity, with no key to steal.

CI/CD on PHP Boy Scout

Two bugs that taught me the rules

Bug one: the rule-less job that skips your merge requests

Bug two: the import block that only works at the root

The shape they share

Worth remembering

Reviewed, then applied

The gap between “reviewed” and “ran”

Plan as an artifact

Applying is a human decision

The guard on the gate

The bottom line

CI you include, not copy

The .gitlab-ci.yml you keep copying

GitLab has a feature for exactly this

Why a monorepo of components

Authoring discipline

The self-test you cannot fully write

The same instinct, again

The gist

One image for the whole toolchain

The same pile of tools, in every repo

One image, one source of truth

Publishing with crane, not a second build

Soft-failing the scanner on purpose

What it comes down to

A 403 you can't fix in IAM

A 403 on the first real run

The instinct, and why it wastes an afternoon

What is actually in the token

The fix, and the lesson worth more than the fix

The habit it left behind

Why go-tool-base left GitHub for GitLab

It started with a wrong number

The platform had been quietly failing

The canary left the mine

Why GitLab, and not just “somewhere else”

What I’m not going to claim

Boiling it down

No access keys in CI

The access key you don’t want

Federation instead of a stored secret

Two halves of one handshake

The gotcha: don’t set a profile

The bottom line

The `.gitlab-ci.yml` you keep copying