Local LLM Cheat Sheet: 16GB Edition (4.13.26)
Most people building Hermes or OpenClaw agents are still paying per token for tasks a $0 local model could handle.
Here's every model worth running on a Mac Mini 16GB (or similar RAM device), what it's actually good for, and the honest take on its limits:
Class A | Power
- Qwen3.5 9B: always-on idea generation, drafting, long reasoning. Graeme's choice for all-day loops
- Llama 3.1 8B: only pick this when you need 128K context to feed a long doc or full codebase
Class B | Balanced
- Qwen2.5 Coder 7B: best coding model at this RAM level, 128K for large files
- DeepSeek R1 Distill 7B: offline reasoning chains, research tasks, structured analysis
- Gemma 3 4B: capable all-rounder, 128K context, leaves headroom for other apps
- Phi-4 Mini Reasoning: logic problems and math, hard ceiling at 16K context
- Qwen2.5 7B: multilingual generalist, fallback when Qwen3.5 9B is overkill
Class C | Efficient
- Qwen3.5 2B: fast summaries, tagging, rewrites. realistic secondary agent alongside a Class A
- DeepSeek R1 1.5B: logic checks and self-critique passes, not full R1 quality but earns its place
- Qwen2.5 Coder 1.5B: autocomplete and short scripts, fast code helper not a code reasoner
- Phi-3.5 Mini: long-doc chat at 2.4GB, only realistic way to work with long documents fully offline
Class D | Micro
- Qwen3.5 0.8B: yes/no classification, keyword routing, binary decisions only. will hallucinate on anything harder
- Qwen 3 0.6B: fastest token gen at ~98 t/s, trivial labels and toy experiments, don't trust it for anything that matters
Full breakdown in the cheat sheet ↓