Good Question.
The short answer is that I don't view low latency and scalability as separate problems.
The architecture is built around independent silos that can scale horizontally without introducing large shared bottlenecks. A few examples:
• Workloads are sharded across independent units
• Tokio workers are pinned to dedicated CPU cores
• CPU isolation and IRQ affinity tuning
• RPS/XPS tuning and multi-queue NIC utilization
• Strategic use of io_uring and XDP/eBPF where they provide measurable benefits (compio)
• TCP processing kept as close to the execution path as possible
• BBR FQ and extensive network-stack tuning
• Continuous profiling and latency instrumentation
The goal isn't just a lower average latency number. It's maintaining predictable latency as load increases. And once we're operating at this level, the biggest challenges are usually contention, cache locality, scheduler migrations, lock contention, and cross-core communication, and not the raw CPU horsepower anymore.
In practice, I'd rather add another independent shard than make an existing one bigger.
P.S. I also have a healthy distrust of the happy path. Reality has a habit of finding edge cases we forgot to imagine, so smaller failure domains tend to age better than giant shared systems.
Interesting project. How are you handling scalability with low-latency?