We made co-located TCP up to 7x faster by adding a copy, not removing one.
That sentence should bother you. Every performance engineer is trained to drive copies toward zero. So when we built bpf_sock_splice_pair(), a new BPF kfunc that splices two TCP sockets on the same machine (think service-mesh sidecars, loopback RPC, co-scheduled microservices), our first design did exactly that: a single direct user-to-user copy, the theoretical minimum for an unmodified sockets API.
It was elegant. It was also the wrong tradeoff.
A single copy forces the sender to write straight into the receiver's buffer, which means both sides have to meet at the same instant. That synchronous rendezvous quietly kills batching. The sender can never run ahead, so throughput is capped by handshake latency instead of memory bandwidth.
The fix is a lesson queueing theory has taught for decades: to let a producer outrun a consumer, you need a buffer between them. A buffer costs a second copy, and that second copy is the price of decoupling. Decoupling enables batching, batching amortizes per-message overhead, and owning an in-kernel ring lets the receiver busy-poll, the one thing that finally cracks loopback latency.
The result, measured with netperf at a realistic 1 KB request-response:
- Loopback TCP_RR: 106k to 713k transactions/sec (6.7x)
- Container TCP_RR: 100k to 705k transactions/sec (7.0x)
- No application changes. No new address family. Just BPF pairing ordinary TCP sockets.
We also benchmarked it against AF_SMC's shared-memory loopback, which independently arrived at the same "buffering enables batching" conclusion. Our two-copy ring still comes out ahead of its three-copy path.
The full design story, the dead end we walked into first, and a comparison with AF_SMC:
multikernel.io/2026/06/11/bp…
The patchset is up as an RFC on the BPF and netdev lists. Reviews and benchmarks welcome.
#LinuxKernel #eBPF #Networking #Performance #TCP #SystemsEngineering #OpenSource