Agreed, shared storage (Aurora, AlloyDB, HorizonDB, Neon, etc.) is the dominate design for cloud OTLP. It's not the best in every scenario, but it is for most use cases. You can push a lot of work (replication, full page writes, dirty page writes, etc.) into the storage layer
There is a new era of data tech that is effectively "__ on object storage":
Turbopuffer is "vector search on object storage"
Warpstream is "kafka on object storage"
Neon is "Postgres on Object Storage" or “Postgres on S3”.
It doesn’t mean that every read and write goes directly to S3. That would be incredibly slow. I’m saying that a Postgres database in the "Postgres on object storage" category can be faster than one in the "Postgres on a cluster of servers with NVMe disks" category.
No one is claiming that S3 is faster than NVMe but Postgres on S3 (with low latency storage in between) can be faster than Postgres running on NVMe with HA on. HA is important here, without HA you don’t do durable writes so it would be an unfair comparison.
While neon runs on s3, calls into s3 are almost never on the transaction reads or writes. Writes are sent into a consensus service and streamed into s3 asynchronously. So the claim can be expanded to Postgres running on a disaggregated storage which implements low latency tier on top of s3 is faster then Postgres with HA running on NVMe.
We are not the only ones making this claim. For example AWS Aurora says "Aurora has 5x the throughput of MySQL and 3x of PostgreSQL with full PostgreSQL and MySQL compatibility."
So why does disaggregating compute allow for higher throughput on Postgres and potentially lower latency as well? The reason is that we can offload a number of CPU and IO operations down to storage. We just published a blog post on how we can turn off full page writes which dramatically reduces WAL volume and saves on CPU cycles on the Postgres node.
In many scenarios this may be a wash because for many workloads you might not be write throughout bound and therefore Postgres checkpoints and full page writes don’t impact overall throughput. However this is general purpose enough to impact a large swath of workloads. It’s also important to mention that scaling write throughput is more important since Postgres is a single write system and you can’t scale writes with read replicas.
So is Postgres on S3 faster than Postgres on NVMe? We believe it can and will be. Postgres with disaggregated storage and several kernel performance optimizations has higher throughput than stock Postgres on NVMe with HA implemented via sync replication.
We'll share more including latency impact as we gather insights after rollout. Lots of things to learn here if saving CPU on full page writes can have a material impact on latency under high throughput. The idea is that if CPU is all used up, freeing up some CPU will impact both latency and throughput - but we'll see!
The statement is indeed provocative, but far from “shock value marketing” as some of the responses claim.