Now let's do the math on what that actually current systems use...
Current system for 6GB model:
OS kernel: ~500MB
Runtime overhead: ~500MB
Memory addressing/pointers: ~1GB
Padding/alignment waste: ~500MB
Cache inefficiency: ~1GB
Actual weights: ~2.5GB
That's 58% waste.
What does a system actually need to run LLM inference?
Just these things:
-Store the weights β the numbers
-Read them in order β sequentially
-Multiply them β basic math
-Output a result β one token at a time
That's it. That's all an LLM inference engine fundamentally does.