Yes, having storage compute separated can save a lot. I have seen companies follow hot and cold architecture in warehouses as well just because the requirement is to query the last 1 or 2 year data.
The truth about Big Data that vendors don't want you to know 👇
I'm re-reading this classic blog post from Motherduck CEO Jorgan Tigani titled "Big Data is Dead".
Jordan was a founding engineer in Google BigQuery, so this is a high-signal experience-driven blog post.
💎 Here are my favorite gems from it:
💡 1. People using BigQuery didn't have big data (at all)
The vast majority of users had less than 1TB of TOTAL storage. There were many thousands of customers with half a TB. The median storage was much less than 100GB.
Customers also followed a power law distribution - largest customer had double the second largest customer, who had double the third largest, etc. It quickly got to small numbers at high percentiles.
Industry analysts like Gartner/Forrester also confirmed this idea with Jordan: most enterprises don't have that much data.
And of the customers that had a lot of data?
They queried SIGNIFICANTLY LESS than what they had. 👇
💡 2. Storing a lot != Querying a lot
This is my best take away from the piece. 🏆
It's the idea that even if you're storing many TBs of historical data, you're probably still querying a few GBs.
This is because the value of data drops exponentially with age. You care about what happened a few weeks ago, not 2 years ago.
Last month might have 5% of your total data but serve 80% of queries.
Here's data that proves it:
• Jordan analyzed BigQuery usage and found that 90% of queries processed LESS THAN 100 MB of data.
• S3 is built off this very same principle. They scale their massive distributed system of hard drives by co-locating older, cold data with newer, hot data.
The time-decay property of data means that the working set sizes are way smaller than the total set.
Even if you have a 1000TB table with 10 years worth of data - you may only access last day's data which wouldn't be more than 50GB compressed. My laptop can do that easy. 🙂
Modern computing also has tons of tricks to avoid scanning all the data ( column projection, partition pruning, segment elimination, predicate pushdown)
Separating compute and storage allowed customers to keep storing data without having to scale up unnecessary compute for it. 🏆 (ty Snowflake)
Kafka with KIP-405 is a great example of this.
The best logical conclusion from this?
💡If you use scalable object stores (e.g S3), you can probably get away with just running one node for compute.
If nothing changes, your compute requirements remain the same while your total data set continues to grow.
In the past, this meant you had to deploy more instances in order to store the data. Today, you can only scale up the storage layer. You probably won't need to scale to distributed processing at all to match your workload.
Critically, this also allows you to completely outsource all the complicated storage bits to something like S3.
Object stores solve replication, durability, availability, and hot-spot management for you (for cheaper).
Warpstream/Bufstream/Tansu in Kafka are great examples of this.
💡 Big Data is here, it's just not evenly distributed
Yes, data size is growing in the world. But it's mostly stored in a few big tech companies - the rest don't see that much data at all.
My experience matches this very well, yet the companies I worked for still had many thousands of customers
💡 It's not the data size - it's you that's the problem
Back in the big data days, orgs had trouble getting actionable insights from their data. This was blamed on the data's size and solved by new distributed infra software that could handle the data.
The orgs migrated their legacy systems to the new ones... and found they still can't make sense of the data.
The size was never the problem.
💡 The Big Data Tax
Companies pay a hefty price for enterprise-scale infrastructure.
Not to mention the organizational cost of managing yet another big data system, which is way more hidden.
It consumers engineer bandwidth by having them learn the system's nuances, configs, establishing monitoring, establishing processes around deployments and upgrades, attaining operational expertise on how to manage it, creating runbooks, testing it, debugging it, adopting its clients and API, using its UI, keeping up with its ecosystem, etc.
In conclusion?
Don't follow the bandwagon. ✌️