How it hits 56,000 tok/s
No monolithic FSM.
A microcode ROM sequences modular datapath actuators - matvec, attention, RMSNorm, exp, sampler - over one true dual-port scratchpad that also holds the persistent KV cache.
1 block · 24-dim · 4 heads · ctx 16 · Q5.11, bit-exact to Python.