Featured image of post Process isolation won't save you from the filesystem

Process isolation won't save you from the filesystem

A test that passed every single time I ran it on its own, and failed maybe one run in five when I ran the whole suite. The failure was always the same: the self-update test downloaded a release archive, went to extract it, and found the archive corrupt. Half-written. As if something had been scribbling in the file while it read it. Something had.

The comfort I was leaning on

The self-update tests are heavier than a unit test wants to be. They stand up a fake release, download the artefact, verify its checksum, extract it, swap a binary. Real files, real I/O. So they’d been built to run as separate processes, not just separate threads, each one its own little world.

And I’d quietly filed that under “solved”. Separate processes don’t share an address space. One can’t reach into another’s memory and corrupt a value mid-read. That whole category of data race, the kind you reach for a mutex to fix, simply can’t happen across a process boundary. So I’d stopped thinking about concurrency in these tests at all, because I’d convinced myself the isolation was total.

It wasn’t total. It was isolation of memory, and I’d let myself hear it as isolation of everything.

Two processes, one path

The thing two processes very much do still share is the filesystem. And the self-update flow, sensibly, caches its download rather than re-fetching it. The default cache directory is computed from the tool’s name and the release version, in crates/rtb-update/src/flow.rs:

pub fn cache_dir_for(tool_name: &str, version: &str) -> PathBuf {
    let base = directories::ProjectDirs::from("", "", tool_name)
        .map_or_else(std::env::temp_dir, |p| p.cache_dir().to_path_buf());
    base.join("update").join(version)
}

Read that with two parallel test processes in mind. They’re testing the same tool, against the same fake release tag. So tool_name matches and version matches, which means cache_dir_for hands both of them the identical path. Two processes, isolated in every way that involves memory, both downloading and extracting into one shared directory on disk, at the same time. One writes the archive while the other is partway through reading it, and you get exactly the corrupt half-written file the test kept tripping over.

Process isolation did nothing here, because the contention was never in memory. It was on a path string that came out the same for both of them.

The fix is to stop sharing the path

Once it’s framed as “they share a path”, the fix writes itself: don’t share the path. Give each invocation its own cache directory. The updater builder already had the seam for it, and the doc comment now says exactly why it’s there, in crates/rtb-update/src/updater.rs:

/// Tools call this when they want isolation per-invocation
/// (e.g. CI runners, tests with parallel processes) or to honour
/// a user-supplied `--cache-dir` flag.
pub fn cache_dir(mut self, cache_dir: impl Into<PathBuf>) -> Self {
    self.cache_dir = Some(cache_dir.into());
    self
}

Each test now builds its updater with cache_dir(its_own_tempdir), so two parallel processes land on two different directories and never meet. No lock, no serialisation, no clever cross-process file mutex. Just the realisation that the shared thing was a directory, and the cure for shared mutable state is usually to stop sharing it, not to guard it.

The fix that turned out to be a feature

The part I’m quietly pleased about is that this didn’t stay a test-only hack. The override I needed to isolate the tests is exactly the override a real tool wants for its own reasons. A CI runner doing self-update wants a writable cache path it controls, not wherever directories-rs decides the system cache lives. A user might reasonably want to point the whole thing somewhere specific. That’s a --cache-dir flag, and cache_dir() is precisely the hook you’d wire it to.

So the thing I added to stop a flaky test is the same thing a downstream tool reaches for to expose --cache-dir. The test forced the seam to exist, and the seam was worth having anyway. I’ll take that trade every time over a fix that only the test suite benefits from.

What it comes down to

I’d treated “separate processes” as a synonym for “can’t race”, and it isn’t. Processes don’t share memory, so the memory races are gone. They absolutely still share the filesystem, the network, every named resource the OS will hand to anyone who asks for it by the same name. My two test processes computed the same cache path from the same tool and tag, and raced on the files in it, and no amount of address-space isolation was ever going to touch that.

Shared mutable state on disk is still shared mutable state. The fix wasn’t a bigger hammer, it was giving each process its own directory and letting the isolation I thought I already had actually be true.

Built with Hugo
Theme Stack designed by Jimmy