I’ve been watching the G2 layer and custom solutions a bit.
$PENG MemoryAI is one of the first clean product expressions here: a CXL-based KV Cache Server for agentic inference, long context, and memory-bound decode.
$ALAB says Leo is deployed in Penguin’s KV Cache Server with SMART Modular memory, showing 3.6x memory expansion, 75% higher GPU utilization, and 2x inference throughput.
Let's see how Penguin Solutions and Astera Labs execute here👀
With modern agentic workloads and long context windows, a common bottleneck in serving LLMs at scale is where to store all the KV cache. Luckily, KV cache can be extended beyond HBM into other tiers of memory.
Nvidia uses the following naming convention to describe the tiers:
🟠 G1 (HBM): fastest bandwidth but (relatively) small
🟠 G2 (host DRAM): still quite fast (traverses PCIe) and an order of magnitude larger than G1
🟠 G3 (SSD/NVMe): slower, shared across entire node
🟠 G4 (shared network storage): slowest, effectively unlimited in size
At GTC 2026, in a historic partnership with SpaceXAnthropicAI, Jensen announced the newest tier, G5: a Starlink-attached HDD array in low earth orbit.
Excited to see what G6 will be.