95 ns/vector vs 412 ns, Elasticsearch simdvec VS jvector scoring float32 at 32,500 vectors on x86.
But the number isn't really the story here. The story is that at scale, past L3 cache, memory latency beats compute every time and simdvec is built just for that.
At this depth, the kernel spends more time waiting for data than processing it. Prefetch too early and you've wasted the slot. Too late and you're stalling anyway. The window is narrow and simdvec times it right.
Explicit prefetch instructions on x86 pull the next vectors into cache while the current batch is still scoring. On ARM, interleaved loads do the same job differently. Either way, the pipeline stays fed.
Hardware counters show what that buys: 139K L1 cache misses drop to 19K. That's where the 4x gap lives.
The simdvec team wrote up the full architecture and benchmarks, including their hardware counter data.