Llama 3.2 1B Instruct on my RTX 4060 Ti 8GB.
1.24B params, all active. dense transformer. 771 MB on disk (Q4_K_M). this is the speed ceiling test.
>228.2 tok/s at baseline. nearly 2x the next fastest model I've tested (Gemma E2B at 117.8). the GPU is barely loaded: 2.1 GB VRAM used, 5.8 GB free. prompt eval hits 16,791 tok/s at 8K context.
it degrades to 171.4 at 24K (-24.9%).
I ran the same 6 quality tests as every other model. 4/6 passed.
what works: code generation (produced a working memoized fibonacci with type hints and docstring), system prompt adherence (clean uppercase pirate), hallucination resistance (correctly said "I couldn't find any information" about a fictional study), format switching (bullets table).
what breaks: JSON compliance (added markdown fences despite "no fences" instruction), logic puzzle (interpreted the surgeon riddle as a medical ethics discussion, 8 numbered points about beneficence and the trolley problem, never figured out the surgeon is the mother).
no thinking mode. every token is content. this is why code gen works here but fails on Qwen3 8B and GLM, where reasoning consumed all 2,048 tokens without producing output.
for me this is clear: 1.24B params can follow instructions and produce structured output, but can't reason. for high-throughput classification, simple completions, or speed benchmarking, nothing touches it on this hardware.