okay, i have so much to talk about.
storage layer for kova is shipped. it's a from-scratch vector DB in rust, and the storage layer is the bit where every decision has to survive a kill -9 and come back the same. that constraint touches every file, every fsync, every byte format
vectors live in an mmap file. fixed stride (a tiny per-slot header plus dim*4 bytes for the data), and every slot self-describes its id and a present flag. no sidecar index file, no separate id->offset map. the cost is that open is O(N): you walk every slot to rebuild the in-memory id_to_slot. the win is there's no two-file atomicity problem: the data IS the index, can't drift from itself. for the workloads i care about (millions of vectors, opened rarely), that's a great trade
file growth is doubling, and never shrinks. doubling so per-insert grows amortise away and do not cost much. never shrinks because truncating an mmap region with an outstanding reference is SIGBUS, exactly the kind of bug i shipped once and found much later. deletes don't shrink the file either; freed slots go on a free-list and the next insert reuses them, so a delete-heavy workload stays compact without anyone running vacuum
metadata is the opposite shape, so it gets opposite treatment. variable-size (open key-value bags), cold (only read for the final k candidates, not on every graph edge), small in aggregate. so the metadata store keeps the whole map in memory and persists via atomic_write (tmp fsync rename dirsync) on mutation. full-file snapshot, no mmap, no sidecar, no free-list. forcing mmap onto variable-size data would mean a separate id->(offset, length) sidecar plus a free-list for resize-on-update, which is a B-tree in disguise. different access patterns deserve different storage strategies; the same hammer for both is just dogma
the WAL is segmented (64 MB rotation), CRC32 length framing per record, torn-tail recovery on the active segment, multi-segment replay in LSN order. the segmentation makes truncate O(1): delete the superseded files
all of that gets composed by Shard under strict log-then-mutate. every insert moves through three phases: validate, commit (wal.append wal.sync), apply (index metadata). phase 3 failures panic. the WAL is truth; pretending otherwise to the caller is the only way to corrupt the log on retry. postgres ships this. rocksdb ships this
the real centerpiece is Shard::checkpoint. on paper it's "snapshot the index, truncate the WAL." in practice it's six phases that have to commit atomically across multiple files: vacuum tombstones, fsync the WAL to capture a durable LSN, stream the HNSW graph to a snapshot file, write the manifest, truncate the WAL, delete the old snapshot. anywhere in those six phases the process can die, and every kill window has to have a defined answer
the whole thing rests on this: the manifest is the only commit point. a tiny atomic_write of a few bytes, names which snapshot generation is live. everything before it is staging. everything after it is best-effort cleanup. kill before the manifest commits, reopen sees the old world and replays the full WAL. kill after, reopen sees the new world and replays WAL only past the checkpoint LSN
the snapshot itself is also atomic_write'd (streaming variant, so we don't buffer the whole graph in memory before the rename), but you can't commit snapshot manifest as one operation: POSIX has no atomic_rename_n. so the problem splits naturally. write the big data file atomically and fsync it. then write the tiny pointer file atomically. the manifest exists precisely so that multi-file durability has a single commit point above the data, instead of trying to make the data itself self-committing. and keeping them separate means the checkpoint pays for one snapshot rewrite per checkpoint, not per manifest update; the pointer stays cheap to update even when the thing it points at is huge
and snapshots are named graph.{N}.snapshot, not graph.snapshot, for the same reason. a single overwrite has a half-written window where the file is partial but the manifest still points at it. generation numbers let the old file stay valid right up until the manifest atomically swaps the pointer. same trick postgres uses for its SLRU pages. names are cheap, ordering is not
300 SIGKILL torture iterations. 55,697 acked inserts. 962 checkpoints. zero data loss. 234 tests green.