Testing the new Gemma 4 12B (QAT) vision and OCR capabilities locally with LM Studio.
# The setup:
- GPU: NVIDIA RTX 4060 (8GB VRAM)
- CPU: Intel i7
- Runner: LM Studio
- Config: 32k context, 38 layers offloaded, Flash Attention enabled
- Speed: ~14 tokens/sec decode throughput
# The test:
I gave it a screenshot of Google AI Studio.
Prompt: "clone this. give me a single html file"
# The result:
A solid one shot replication. It successfully mapped out the layout, recognized the UI text, and structured the divs correctly, with only minor differences from the original. Results available at the end of the video.
Quite capable for a 12B model running on budget consumer hardware. A gpu that costs only $300.
# Why the architecture under the hood is notable:
Unlike traditional models that rely on heavy, separate vision and audio encoders, Gemma 4 12B uses a unified, encoder free architecture.
It bypasses separate multi stage encoders.
Uses a 35M parameter vision embedder to project raw 48x48 pixel patches directly to the LLM hidden dimension.
Local multimodal development is becoming highly accessible on standard hardware.
If you've spun up Gemma 4 12B locally, what setup are you using and what kind of throughput are you seeing?
i just ran Google's brand new Unsloth Gemma4 12B dense GGUF on my RTX 4060 using llama.cpp CUDA 13.2
21 tokens per second. on a budget consumer GPU. locally.
no API. no cloud. no subscription.
and the benchmarks are absolutely cooked
# first let's talk architecture because this is genuinely different
every multimodal model you've used has a frozen vision encoder frozen audio encoder LLM backbone glued together
Gemma 4 12B is different
it's a single decoder only transformer. that's it. vision? raw 48×48 pixel patches → one matmul → projected directly into the LLM
audio? raw 16kHz signal sliced into 40ms frames → linear projection → same LLM input space
no encoder tax. no latency penalty. no fragmented memory
to put the encoder savings in perspective:
old Gemma 4 26B approach:
- 550M param vision encoder (frozen)
- 300M param audio encoder (frozen)
- LLM backbone
Gemma 4 12B:
- 35M param vision embedder (a single matmul)
- no audio encoder at all
- LLM backbone handles EVERYTHING 550M → 35M for vision alone. that's a 15x reduction
this is why the gemma-4-12b-it-Q4_K_M.gguf is just 6.6 GBs!!!
and it has 256K native context context
# Benchmarks:
AIME 2026 (math olympiad): 77.5%
GPQA Diamond (expert science): 78.8% LiveCodeBench v6 (real code): 72%
Codeforces ELO: 1659
MMLU Pro: 77.2%
MATH-Vision: 79.7%
BigBench Extra Hard: 53%
inference → llama.cpp, LM Studio, vLLM, SGLang
llamacpp flags:
-m "gemma-4-12b-it-Q4_K_M.gguf" -ngl 99 -c 8000 -v --port 8080
Available on huggingface now! Link below