IMO one of the hardest open and most valuable problems in databases is how to make the performance of filtering and aggregating dynamically typed JSON approach the performance of a static schema.
It's conceptually similar to the problem of JIT compiling Javascript and making it as fast as native code in the browser.
Here's why:
* In a columnstore, you get to see all the data at ingest time, and can optimize (in large batches) how it's stored. Simple JSON structures that actually have a consistent schema should literally == the performance of a static schema.
* But most databases are quite lazy in their optimizer and query execution engine, both of which are usually designed for static typesystems. Instead of rebuilding those to defer type assumptions to runtime, databases tend to generate slower runtime code (w/ extra branching and so on). The optimal implementation generates perfect static operators per large batch. I'm not aware of any commercial systems that do this.
* And of course, real-world JSON is rarely consistent. At least in Braintrust's workload, we see extremely high cardinality data (e.g. lists of json objects where each object has a unique set of keys). So figuring out how to store that efficiently is wild. I don't want to make fun of commercial systems but if you see things like "JSON is just a string" or "max JSON paths" that means that the system is not going to handle real workloads.