Spent a few weekend hours today thinking about how Uber's data plane would look if it were built on Vitess from scratch.
If
@Uber would be designing its core data plane on vitess from scratch today, I don't think their engineers gonna think about which database to even pick . They probably gonna start with “how do we control query routing under extreme, adversarial concurrency.”
People massively underestimate how hostile uber’s workload actually is. This is not just high throughput system, it’s a continuous multi-actor contention on fast-moving shared state.
They got drivers streaming location updates in every few seconds, dispatch constantly recomputing matches, eta systems recalculating on movement, surge pricing reacting to localized spikes, riders hammering refresh, and payment with fraud systems branching asynchronously on the same logical entities.
So I was thinking that Mysql itself is probably not gonna be the bottleneck here. As Innodb can already handle absurd scale if engineered correctly, but I think the real bottleneck becomes the coordination layer deciding where queries execute, how much fanout is allowed, and how much cross-shard amplification you tolerate before tail latency explodes.
That’s why vitess is so so interesting. VTGate isn't actually just a proxy, it’s basically a distributed query router and control plane. We got vindexes defining ownership, routing rules defining latency, and every scatter query becoming a direct attack to p99s.
At that level I don't think you are scaling the storage, you are scaling decision-making around data placement and access paths.
And once you start seeing it that way, shard key design won't look like a schema problem and starts becoming a first-class systems problem.
Hashing purely on trip_id looks beautiful on paper because distribution becomes easy, but it completely destroys spatial locality which is catastrophic for dispatch and eta workloads where most reads are geo-bounded and latency sensitive.
Now imagine vtgate fanning queries across huge shard sets just to answer what should’ve been a local query, and suddenly the slowest shard dictates latency for the entire request. But pure geo-sharding isn't stable either because real-world demand is uneven and adversarial. Airports, concerts, rain, holidays, rush hours, all of them create violent localized write storms that can melt the hotspots instantly. So the only design that’s probably gonna survive long term is hybrid partitioning which is a coarse geo partitioning combined with intra-region hashing so traffic stays operationally local while load still distributes evenly. But even then, relational elegance kinda dies at scale. Cross-shard joins become latency amplifiers, so the architecture has to lean aggressively into denormalization, materialized projections, and query-specific data layouts.
Complexity gets pushed into write paths and async pipelines so reads remain local and predictable. I mean like, this is exactly why uber’s append-only schemaless approach fits so naturally here. Immutable events avoid hot-row contention, align perfectly with replication streams, and let downstream systems like billing, fraud detection, analytics, and notifications consume state transitions independently without tight synchronous coupling. The datastore stops behaving like “rows in tables” and starts behaving more like a distributed log of evolving system state.
And the hardest part of systems like this is probably not raw performance, it’s operability under continuous change. Cities grow unevenly, traffic patterns shift, new product features introduce entirely new access patterns, and hotspots appear unpredictably.
Vitess’s real superpower is that it lets you continuously reshape the physical topology of data without freezing the product or taking the system offline. Online resharding, filtered replication, traffic switching, and vtctl workflows make it possible to split shards, migrate keyspaces, and rebalance load while live reads and writes continue flowing. But this abstraction layer is also where tiny inefficiencies compound into massive production problems.
Something as subtle as vtgate including non-plan-affecting directive comments in plan cache keys can fragment the cache under high-cardinality workloads where requests carry different runtime annotations. Logically identical queries end up generating separate cache entries, wasting cpu on repeated planning and quietly degrading tail latency inside the router itself. Fixing that sounds simple until you realize you’re being forced to define what “the same query” even means inside a distributed execution engine. And that’s kinda the deeper point here: I think uber can scale on vitess, the hardest engineering problems are not really in storage engines or even in sharding itself. They’re in building an abstraction layer powerful enough to hide distribution most of the time, but precise enough that when the abstraction leaks, it doesn’t take correctness, latency, or operability down with it.
Another thing that I think becomes super interesting is how reads and writes stop being symmetric at this scale.
Most engineers learn databases through CRUD applications where reads and writes are treated as almost equal operations. But in a system like Uber, the cost of a read and the cost of a write are completely different economic decisions.
A write is usually local. VTGate computes the vindex, finds the owner shard, forwards the request, and you're done.
A read can become arbitrarily expensive.
Now, The moment a query requires data owned by multiple shards, you're paying coordination costs across the network. You're waiting on multiple replicas. You're merging results. You're dealing with replica lag. You're dealing with inconsistent snapshots. You're dealing with partial failures. A single innocent-looking query can suddenly become a distributed systems problem.
That's why I think the biggest mindset shift at this scale is that you're no longer designing tables. You're designing access patterns.
Every table, every index, every materialized view, every denormalized projection is basically an optimization for a future query you know is gonna happen millions of times per day.
And tbh that's where a lot of distributed database discussions get simplified too much.
People ask "Can this database scale?" But that's not really the question. The question is that "Can the access patterns scale?". Because I've seen architectures where the storage layer was perfectly fine but the query patterns were fundamentally unscalable.
The database wasn't failing.
The engineers were accidentally asking impossible questions.
I also think people underestimate how much of the engineering effort goes into protecting the system from other engineers.
Imagine hundreds or thousands of developers shipping features independently.
Somebody adds a dashboard query.
Someone adds a reporting endpoint.
Someone adds a support workflow.
Someone adds a new recommendation system.
Every one of those features introduces new access patterns.
Some are local.
Some are gonna fan out across half the topology.
Some are gonna look harmless in staging and become a disaster in production.
Which is why systems like Vitess are fascinating to me.
They're not just solving sharding. They're creating guardrails around sharding.
They're giving engineers a way to think about a distributed database without needing every application engineer to become a distributed systems expert.
And that's probably the hardest part.
Not building a system that scales today.
Building a system where thousands of future decisions made by engineers you've never met don't accidentally destroy the scalability you spent years building.