Two telemetry events, one mangled line

A line in a log file that no parser would touch. Not a wrong value, not a missing field. Half of one telemetry event spliced into the middle of another, like two people typing into the same text box at once. Which, it turns out, is pretty much exactly what had happened.

A format with exactly one rule

rust-tool-base writes its telemetry to a file as JSONL: one JSON object per line, newline at the end, next object on the next line. It’s a lovely format to work with precisely because it has one rule, and the rule is simple. Every line is a complete object. Honour that and you can tail it, grep it, stream it into anything. Break it once and the whole file is suspect, because now a reader can’t trust that a line is a line.

So the one job the file sink has, beyond writing the right bytes, is to never let two events end up sharing a line.

“Appending is atomic, though, isn’t it?”

The mental model I started with, and I suspect I’m not alone, was this: open the file with O_APPEND, write the serialised event, and the operating system tacks it onto the end atomically. Two writers can’t tread on each other because each write goes to wherever the end currently is, no questions asked. I’d half-remembered O_APPEND as the thing that makes concurrent appending safe, full stop.

It’s half true, and the half that’s missing is the half that bit me.

O_APPEND does guarantee one thing: the seek-to-end and the write happen as a unit, so you never get the classic lost-update where two writers compute the same offset and clobber each other. Good. What it does not guarantee, on POSIX, is that a single write() of arbitrary size is atomic with respect to other writers. That atomicity has a ceiling, and the ceiling is PIPE_BUF: 4096 bytes on Linux. Under it, a write is all-or-nothing against other writes to the same file. Over it, the kernel is entirely within its rights to split your write into chunks, and another writer’s bytes can land in the gap between them.

The fat event that went over the edge

For a long time nothing went wrong, which is the most dangerous way for a bug like this to behave. A typical event, a command name, a duration, a status, an attribute or two, serialises to a few hundred bytes. Comfortably under four kilobytes, so comfortably inside the atomic window. Hundreds of them a day, never a problem.

Then an event came along with a lot of attributes on it, and its serialised form sailed past 4 KiB. Two of those emitted at roughly the same moment, both over the line, and O_APPEND did the only thing it had ever promised: it put each write at the end. It said nothing about not interleaving the bytes on the way, because past PIPE_BUF that was never on offer. One spliced line, one file a parser would now choke on.

The fix isn’t a bigger write, it’s a smaller gate

You can’t buy your way out of this with a bigger buffer, because there’s no buffer size that’s reliably atomic above PIPE_BUF. The fix is to stop relying on the kernel for mutual exclusion you can do yourself: serialise the events through a lock, so only one write is ever in flight at a time. The FileSink carries a mutex for exactly that, and the doc comment on it is the whole post in a paragraph, from crates/rtb-telemetry/src/sink.rs:

pub struct FileSink {
    path: PathBuf,
    // Serialises concurrent `emit` calls. Shared across `Clone`s of
    // the same `FileSink` so multiple handles to the same path also
    // serialise correctly.
    gate: Arc<tokio::sync::Mutex<()>>,
}

If you don’t write Rust day to day (the primer has the rest of the basics): tokio::sync::Mutex is an async-aware lock, .await is where a task waits its turn for that lock without blocking the whole thread, and the Arc wrapper is shared ownership. That Arc is the load-bearing bit: it means every clone of the FileSink points at the same gate, rather than each getting its own lock that guards nothing.

The detail I like is where the lock sits. The event is serialised to a string first, outside the critical section, because turning an event into JSON is the expensive part and there’s no reason to hold the gate while you do it. Only then does emit take the lock, and it holds it across the whole open-write-flush, so no other emit can interleave a single byte:

// Serialise the line outside the critical section.
let mut line = serde_json::to_string(&redacted)?;
line.push('\n');

let _guard = self.gate.lock().await;
// ...create parent dir, open with append(true)...
f.write_all(line.as_bytes()).await?;
f.flush().await?;

write_all makes sure the whole line goes out as one logical write from our side, and the gate makes sure ours is the only one happening. The 4 KiB cliff is still there in the kernel. We just never walk near it any more, because we’ve serialised the writers ourselves rather than hoping the OS would.

The bit even the lock can’t fix

There is however a genuine limit, and the comment is upfront about it. The mutex lives in the process. Two FileSinks in two different processes, both pointed at the same file, are back to relying on O_APPEND alone, and back under the 4 KiB ceiling. The lock can’t reach across a process boundary, so it doesn’t pretend to. The guidance there is the older, duller, correct one: give each process its own file and aggregate them somewhere else. Don’t have two processes fighting over one log file and expect the filesystem to referee.

What it comes down to

O_APPEND is a real guarantee, just a much smaller one than its name talks you into. It keeps your write at the end of the file, and it keeps concurrent writes from interleaving only while they stay under PIPE_BUF, which on Linux is 4096 bytes. A fat JSON event slides straight over that and takes your file’s one rule with it.

The fix was never exotic. Serialise the line, take a mutex, do the write under it, and the interleave can’t happen because there’s only ever one writer at a time. The POSIX manual had all of this written down long before I went and learned it the interesting way, which is, I’m told, how most people meet PIPE_BUF too.