I just can’t get over how neat CXL type 3 is.
Imagine having a 1TB bucket of memory.
But! Instead of 1TB of DDR5, you have a tiered CXL accelerator. To the OS, it *looks* like regular memory, you address it in the same way.
Maybe your accelerator is actually 100GB of DDR5, and ~1TB of high bandwidth flash. The first 100GB is your buffer, and a little controller slowly flushes it out.
Many, many workloads are not hammering RAM enough for you to notice.
Wait! You could get even more clever.
With regular memory, bouncing cachelines between CPU cores is annoying. Often, you’ll program your way around this (avoiding a shared counter) by having each thread maintain a temporary local state with occasional global syncs.
But, if we have a custom CXL 3 memory device, that slow global merge could be implemented in hardware instead. You’d never have to have cores fight over the same cacheline, because the shared-counter would be local to the CXL device!
Aka, a remote atomic!
This is essentially the concept of NDP (near-data processing), and of course there are much, much more fancy algorithms you can do with it, that’s just one example. But you can imagine, especially with database-style operations, how much bandwidth you could save not having to round-trip to the CPU and back for every operation.
Imagine if your RAM could run a regex for you! We’re getting really close to that world.