running a 35b model on a 16gb card. not in theory, right now.
qwen3.6-35b-a3b, q4_k_m, on a 4080 super.
64.70 tok/s, cold start.
the trick: -ngl 99 pushes every layer to the gpu, then -ncmoe 20 keeps 20 moe expert tensors on the cpu. that's what makes a 19.7gb model fit in 16gb of vram (13.4 used, ~2.5 free).
64k context, flash attention on, kv cache at q4_0.
this is just my current number, almost certainly not tuned to the ceiling. what are you pulling on your 16gb card?