Oxia is a modern cloud-native ZooKeeper replacement that scales 10x more 🔥
Pulsar replaced ZK with Oxia to scale to 10 million topics. Here's how & why 👇
ZooKeeper:
• ❌ Horizontal scaling doesn't work for writes.
Any write has to reach quorum before being acknowledged. The more nodes you add, the slower the write becomes.
• ❌ Vertical scaling also has its limits due to the snapshotting process.
Snapshotting is when ZooKeeper serializes its whole in-memory state into one giant binary blob to disk. It is necessary in order to speed up recovery on failure.
But doing it exhausts a lot of CPU and disk IO on the (single) ensemble leader in a spiky way. That includes traversing the whole tree end-to-end, serializing it, performing checksums, etc.
It is said this becomes very inefficient past 2GBs of state.
Here's why Oxia doesn't suffer from this problem
💡 SHARDING
Systems like ZK or etcd do not partition their data - they store everything in each node.
This is intentional because it makes global linearizability easier (e.g multi-key transactions)
But it also limits their scalability: etcd, for example, officially recommends a max state of 2-4GB for optimal performance.
The key insight is that a Kafka-like system (e.g Pulsar) does not need this type of limiting guarantee.
Oxia was designed from day one to focus on per-key linearizability, and not care about global linearizability. This is the fundamental trade-off that lets it scale so much.
💡 Conceptually, this is analogous to how Kafka can scale to many GB/s whereas a single database table cannot. Kafka has many WALs, the DB table has just one. Kafka can't give you ordering across partitions, but it can within the same partition.
🔥 SCALE
It's said the first production cut of Oxia let Pulsar scale to 10 million topics (10x the previous limit).
It's also said a single Oxia shard leader can deliver ~100k ops/s with a 80/20 read/write mix.
Oxia scales linearly by adding more storage nodes and shards, so it's not unthinkable to imagine it scaling to ~1M ops/s across 100s of GBs of state. (that's what it was designed for)
🤔 HOW DOES IT WORK?
It resembles ZooKeeper Kafka a fair amount. The system consists of
• a single Coordinator (like a Kafka Controller)
• many Data Nodes that host Shards. (like Kafka Brokers and Partitions)
• Kubernetes as a linearizable metadata store (like ZooKeeper)
A Shard has a single leader and multiple followers, depending on its replication factor.
Any Data Node can store multiple shards - it can act as a leader for some and as a follower for others.
For a write to to be acknowledged, a quorum of the data nodes is required -- i.e a majority of a shard's nodes have to acknowledge the write. 👌
Shards store both a log (WAL) and a LSM-backed key-value store (KV) called Pebble (by CockroachDB).
1. The WAL in the leader receives the new write
2. When the record is confirmed to be durably replicated across shards - it updates the KV.
☁️The Coordinator represents a stateless control plane.
It relies on Kubernetes for consensus. It maintains the single source of truth for overall cluster membership/topology in a k8s ConfigMap.
Yes, a simple Kubernetes ConfigMap is used as a linearizable metadata store. It's really just etcd under the hood.
The key reason it works is because the ConfigMap is updated on rare events - only on failures, rebalancing, etc..
The Coordinator continuously health checks the Data Nodes and handles things like shard assignments, shard rebalances, etc., by:
• 1) persisting the state in the ConfigMap
• 2) taking action to reconcile the actual state with the desired one
I find Oxia's design really elegant in its simplicity.
Oxia was released in 2023 by StreamNative and started incubating in the CNCF around 2025. Keep an eye on this project, and like if you learned something. ✌️