champteta

champteta

5,175 Photos and videos

Tweets

Pinned Tweet

champteta @musicncode

Jan 29

0:25

18,718

AlexAImaginator

champteta retweeted

AlexAImaginator

@TraffAlex

11h

🖥️ Best Local LLMs for Consumer GPUs — llama.cpp Guide (June 2026) What I actually run on consumer hardware right now. Every model below runs via llama.cpp with a simple one-liner — no Docker, no Python env, no cloud. ━━━ 8-16GB VRAM ━━━ 🔹 Gemma 4-12B (Google) • Smartest model in this size class — competes with stuff 2× bigger • Unsloth's MTP GGUFs: 162 tok/s vs 52 tok/s normal (3× speedup) • Minimum 8GB VRAM recommended for Q4_K_M quant • GGUF → huggingface.co/unsloth/gemma… 🔹 LFM2.5-8B-A1B (LiquidAI) • Hybrid MoE, only 1B active params — absurdly fast for its size • Perfect for 8-12GB cards, MacBooks, or anyone on a tight budget • GGUF → huggingface.co/LiquidAI/LFM2… ━━━ 16-32GB VRAM ━━━ 🔹 Qwen3.6-27B (Qwen) • Scored 1.00 on tool-efficiency benchmarks — best local agent available • 40 deterministic tasks, 32k/128k context needle tests — all passed • GGUF → huggingface.co/unsloth/Qwen3… • MTP version (faster) → huggingface.co/unsloth/Qwen3… 🔹 Qwopus3.6-27B-v2 (Jackrong) • Best quantization of Qwen3.6-27B — topped 5 agent & coding benchmarks (1200 samples) • If you're running Q4, this is the one to grab • GGUF → huggingface.co/Jackrong/Qwop… • MTP version → huggingface.co/Jackrong/Qwop… 🔹 Gemma 4-31B QAT (Google/Unsloth) • QAT variant with MTP draft head: 76-125 tok/s (1.67× speedup) • Excellent for multi-agent / subagent workflows • GGUF → huggingface.co/unsloth/gemma… 🔹 Nex-N2-Mini (Nex AGI) • Post-train of Qwen3.5-35B-A3B — MoE with only 3B active params • Fits on 16GB VRAM, overflow loads from system RAM • Adaptive thinking saves ~20% tokens with no quality loss • For deep multi-step reasoning, nothing in this size comes close • GGUF → huggingface.co/sjakek/Nex-N2… ━━━ Quick Picks ━━━ • 16GB all-rounder → Gemma 4-12B with MTP GGUFs • 32GB all-rounder → Qwen3.6-27B / Qwopus-v2 • Agents & tool use → Qwen3.6-27B or Qwopus Q4 • Deep reasoning → Nex-N2-Mini (MoE, fits 16GB ) • Tight budget → LFM2.5-8B-A1B • Cheapest full build: 1× used RTX 3090 (24GB) rest of PC ≈ $1000-1500 ━━━ Setup on Windows ━━━ 1. Download llama.cpp → github.com/ggml-org/llama.cp… (latest .zip) 2. Extract to any folder (e.g. C:\llama.cpp) 3. Download a .gguf from the links above (Q4_K_M or Q5_K_M for best quality/speed balance) 4. Run one of the commands below depending on your hardware ━━━ Launch Commands ━━━ SINGLE GPU — Standard model (no MTP): llama-server.exe ^ -m C:\models\Qwen3.6-27B-Q5_K_M.gguf ^ --ctx-size 180000 ^ --flash-attn on ^ --cache-type-k q4_0 ^ --cache-type-v q4_0 ^ --batch-size 1024 --ubatch-size 512 ^ -ngl 100 ^ -np 1 ^ --port 8080 ^ --jinja SINGLE GPU — MTP model (faster inference): llama-server.exe ^ -m C:\models\Qwen3.6-27B-MTP-Q5_K_M.gguf ^ --ctx-size 180000 ^ --flash-attn on ^ --cache-type-k q4_0 ^ --cache-type-v q4_0 ^ --batch-size 1024 --ubatch-size 512 ^ --spec-type draft-mtp ^ --spec-draft-n-max 3 ^ -ngl 100 ^ -np 1 ^ --port 8080 ^ --jinja DUAL GPU — Split across two cards: llama-server.exe ^ -m C:\models\Qwen3.6-27B-Q5_K_M.gguf ^ --ctx-size 180000 ^ --flash-attn on ^ --cache-type-k q4_0 ^ --cache-type-v q4_0 ^ --batch-size 1024 --ubatch-size 512 ^ -ngl 100 ^ --tensor-split 0.55,0.45 ^ --main-gpu 0 ^ -np 1 ^ --port 8080 ^ --jinja DUAL GPU MTP Vision (multimodal): llama-server.exe ^ -m C:\models\Qwen3.6-27B-MTP-Q5_K_M.gguf ^ --ctx-size 180000 ^ --flash-attn on ^ --cache-type-k q4_0 ^ --cache-type-v q4_0 ^ --batch-size 1024 --ubatch-size 512 ^ --spec-type draft-mtp ^ --spec-draft-n-max 3 ^ -ngl 100 ^ --tensor-split 0.60,0.40 ^ --main-gpu 0 ^ -np 1 ^ --port 8080 ^ --jinja ^ --mmproj C:\models\mmproj-F16.gguf ━━━ Parameter Breakdown ━━━ -m <path> Path to your .gguf model file. Change this to wherever you downloaded it. --ctx-size 180000 Context window in tokens. 180k = huge context for long conversations or big codebases. Reduce to 32768 or 65536 if you don't need long context — uses less VRAM. --flash-attn on Flash Attention — dramatically speeds up inference and reduces VRAM usage. Works on RTX 30xx/40xx/50xx. Always enable this. --cache-type-k q4_0 / --cache-type-v q4_0 Quantizes the KV cache (key/value attention cache) to 4-bit. This is what makes 180k context fit in VRAM. Without it, huge contexts eat all your memory. Quality impact is minimal — this is a free performance win. --batch-size 1024 / --ubatch-size 512 batch-size = how many tokens are processed in one forward pass (throughput). ubatch-size = micro-batch actually sent to the GPU per step. Higher = faster prompt processing but needs more VRAM. If you run out of VRAM, lower these (e.g. 512/256). -ngl 100 Number of layers to offload to GPU. 100 = all layers on GPU (full offload). This is what you want if the model fits in your VRAM. If it doesn't fit, reduce this (e.g. -ngl 40) — remaining layers run on CPU/RAM. --tensor-split 0.55,0.45 How to split model layers across multiple GPUs. Values are ratios. 0.55,0.45 = GPU 0 gets 55% of layers, GPU 1 gets 45%. Adjust based on your VRAM — give more to the card with more memory. Example: 0.70,0.30 for a 24GB 12GB setup. Not needed for single GPU setups. --main-gpu 0 Which GPU handles the batch computation (the "orchestrator"). Set to 0 (your primary GPU). The other GPU(s) handle their assigned layers. Minor performance impact — usually just leave it at 0. -np 1 Number of parallel slots (concurrent requests). 1 = one user at a time. Increase to 2-4 if you want multiple clients connected simultaneously. Each extra slot uses additional VRAM for its own KV cache. --port 8080 Which port the server listens on. Change if port 8080 is busy. --jinja Enables Jinja2 template processing — required for proper chat formatting. Most modern models expect this. Always include it. --spec-type draft-mtp Enables Multi-Token Prediction (MTP) speculative decoding. Only works with MTP GGUF models (downloaded separately). The model predicts multiple tokens at once and verifies them — big speed boost. --spec-draft-n-max 3 How many tokens the MTP draft head proposes per step. 3 is a good default. Higher = potentially faster but more VRAM and may reduce quality. --mmproj <path> Path to the multimodal projector file (for vision models). Enables image understanding — paste screenshots into the web chat. Only needed if you want vision capabilities. Omit for text-only use. ━━━ Your Hardware → Your Command ━━━ Single GPU (8-24GB VRAM): Use the "Single GPU" command. Change -m to your model path. 8GB card → Gemma 4-12B Q4 or LFM2.5-8B 12GB card → Gemma 4-12B Q5/Q6 16GB card → Gemma 4-31B QAT Q4 or Nex-N2-Mini 24GB card → Qwen3.6-27B Q4/Q5, Qwopus-v2, Gemma 4-31B QAT Q5/Q6 Dual GPU: Use the "Dual GPU" command. Adjust --tensor-split based on your VRAM ratio. 24GB 24GB → --tensor-split 0.50,0.50 24GB 12GB → --tensor-split 0.70,0.30 24GB 8GB → --tensor-split 0.75,0.25 Want speed? Use MTP versions of models with the "MTP" commands. Want vision? Add --mmproj with the projector file from the model's HuggingFace repo. 5. Once running, you get: • Web chat UI → http://localhost:8080 • OpenAI-compatible API → http://localhost:8080/v1 • Playground → http://localhost:8080/playground ━━━ Why /v1 API Is the Killer Feature ━━━ One local endpoint replaces your entire cloud API bill. The /v1 endpoint is drop-in OpenAI-spec compatible — every tool that speaks OpenAI just works. No custom code, no glue layer. Works out of the box with: • IDEs: Cursor, Continue, Windsurf, Cline, Roo Code • CLI tools: aider, Open Interpreter, OpenCode • Frameworks: LangChain, LlamaIndex, LiteLLM • Any OpenAI SDK (Python, Node, Go, Rust) Why this beats cloud APIs: • 100% private — code never leaves your machine • $0 per token — no rate limits, no quotas, no surprise bills • Works fully offline • Zero telemetry, no training on your data • Swap models by dropping in a different .gguf — no app changes needed • Run 32k–128k context windows without burning money Good combos: • Cursor Qwopus-v2 → near-frontier quality, zero API cost • Continue Qwen3.6-27B → best local coding agent • aider Gemma 4-12B MTP → 162 tok/s, feels instant • OpenCode Nex-N2-Mini → deep reasoning on 16GB Set any OpenAI-compatible client to your local endpoint: set OPENAI_API_KEY=sk-dummy (any non-empty string works) set OPENAI_BASE_URL=http://localhost:8080/v1 # every OpenAI-compatible tool now hits your local GPU Shoutouts: @0xSero @rS_alonewolf @witcheer @UnslothAI @LottoLabs

unsloth/gemma-4-12b-it-GGUF · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

102

1,163

142,808

champteta

champteta @musicncode

it's enough for this wide eyed wanderer that we got this far

champteta

champteta @musicncode

there is a calm surrender to the rush of day

champteta

champteta @musicncode

hammad ke chakkar main apna openrouter balance exhaust kardia

champteta

champteta @musicncode

nothing left to lose means there is so much more to gain

champteta

champteta @musicncode

kia baat hai yaar open.spotify.com/track/1elGw…

Now We Are Free

Hans Zimmer, Klaus Badelt, Lisa Gerrard, Gavin Greenaway, The Lyndhurst Orchestra · Gladiator - Music From The Motion Picture · Song · 2000

open.spotify.com

117

champteta

champteta @musicncode

can i ask my next wife to use this for entry song rather than something local

104

champteta

champteta @musicncode

11h

look what they took from us

champteta @musicncode

11h

AI took my job but for a brief moment I had the perfect lazy vim setup for julia

champteta

champteta @musicncode

11h

kia yaad karadia open.spotify.com/track/3OJK0…

Flight

Hans Zimmer · Man of Steel (Original Motion Picture Soundtrack) [Deluxe Edition] · Song · 2013

open.spotify.com

champteta

champteta @musicncode

11h

AI took my job but for a brief moment I had the perfect lazy vim setup for julia

181

champteta

champteta @musicncode

12h

this format isn't it man everytime i try to watch a game i realize 8 teams who finish 3rd will be in the next round and it kills the joy.

champteta

champteta @musicncode

12h

i am on unprecedented levels of hopium right now so if you are not into that stuff it's time to block me.

123

champteta

champteta @musicncode

12h

when it gets dark you light the spark

champteta

champteta @musicncode

13h

Cloud computing now (set up ssh on old laptop that doesn't have a working screen)

champteta

champteta @musicncode

13h

Boht khoobsurat hawa chal rahi hai

george

champteta retweeted

george

@StokeyyG2

14h

Turkish fans were left OUTRAGED as they went to watch their World Cup game against Australia. Only to find out the venue were playing a PES game 6 minutes in…😭😭😭

0:43

199

984

33,016

4,651,180

champteta

champteta @musicncode

13h

Trent on loan

Brian

@Bri_an2

14h

All replaced ✅

North Bank Nadim

champteta retweeted

North Bank Nadim @NorthBankNadim

19h

To be fair, I’d be more concerned about how to tell your kids that you spent your adult life picking up your phone to type the word ‘Arsenal’ at any opportunity out of desperation and your life lacking any actual purpose or meaning.

SK🇧🇻

@sk_citeh

Jun 13

How do I tell to my kids that Qatar vs Switzerland pulled a bigger crowd than Arsenal at their own Emirates stadium when we played against them😭

549

10,549

champteta

champteta @musicncode

13h

Love it when old friend messages me that they were listening to a song and it reminded them of me.

135

WelBeast

champteta retweeted

WelBeast

@WelBeast

14h

So dating older women is the answer?

Formula 1

@F1

15h

HE’S DONE IT!!! 🤩 LEWIS HAMILTON WINS THE BARCELONA-CATALUNYA GRAND PRIX!!! 🏆🎉 #F1 #BarcelonaGP

340

1,113

11,204

391,890