fractional cmo zima.media 3K clients served. building with local ai. let’s turn your business vibe and code it into an app

Joined September 2018
843 Photos and videos
Pinned Tweet
Day in the life of an AI strategy expert.
3
350
How is this legal Epstein, I mean @EpsonAmerica there’s no incentive to put your spyware on my network after connecting. Anyone else fed up with garbage printers?
4
Zima retweeted
🖥️ Best Local LLMs for Consumer GPUs — llama.cpp Guide (June 2026) What I actually run on consumer hardware right now. Every model below runs via llama.cpp with a simple one-liner — no Docker, no Python env, no cloud. ━━━ 8-16GB VRAM ━━━ 🔹 Gemma 4-12B (Google) • Smartest model in this size class — competes with stuff 2× bigger • Unsloth's MTP GGUFs: 162 tok/s vs 52 tok/s normal (3× speedup) • Minimum 8GB VRAM recommended for Q4_K_M quant • GGUF → huggingface.co/unsloth/gemma… 🔹 LFM2.5-8B-A1B (LiquidAI) • Hybrid MoE, only 1B active params — absurdly fast for its size • Perfect for 8-12GB cards, MacBooks, or anyone on a tight budget • GGUF → huggingface.co/LiquidAI/LFM2… ━━━ 16-32GB VRAM ━━━ 🔹 Qwen3.6-27B (Qwen) • Scored 1.00 on tool-efficiency benchmarks — best local agent available • 40 deterministic tasks, 32k/128k context needle tests — all passed • GGUF → huggingface.co/unsloth/Qwen3… • MTP version (faster) → huggingface.co/unsloth/Qwen3… 🔹 Qwopus3.6-27B-v2 (Jackrong) • Best quantization of Qwen3.6-27B — topped 5 agent & coding benchmarks (1200 samples) • If you're running Q4, this is the one to grab • GGUF → huggingface.co/Jackrong/Qwop… • MTP version → huggingface.co/Jackrong/Qwop… 🔹 Gemma 4-31B QAT (Google/Unsloth) • QAT variant with MTP draft head: 76-125 tok/s (1.67× speedup) • Excellent for multi-agent / subagent workflows • GGUF → huggingface.co/unsloth/gemma… 🔹 Nex-N2-Mini (Nex AGI) • Post-train of Qwen3.5-35B-A3B — MoE with only 3B active params • Fits on 16GB VRAM, overflow loads from system RAM • Adaptive thinking saves ~20% tokens with no quality loss • For deep multi-step reasoning, nothing in this size comes close • GGUF → huggingface.co/sjakek/Nex-N2… ━━━ Quick Picks ━━━ • 16GB all-rounder → Gemma 4-12B with MTP GGUFs • 32GB all-rounder → Qwen3.6-27B / Qwopus-v2 • Agents & tool use → Qwen3.6-27B or Qwopus Q4 • Deep reasoning → Nex-N2-Mini (MoE, fits 16GB ) • Tight budget → LFM2.5-8B-A1B • Cheapest full build: 1× used RTX 3090 (24GB) rest of PC ≈ $1000-1500 ━━━ Setup on Windows ━━━ 1. Download llama.cpp → github.com/ggml-org/llama.cp… (latest .zip) 2. Extract to any folder (e.g. C:\llama.cpp) 3. Download a .gguf from the links above (Q4_K_M or Q5_K_M for best quality/speed balance) 4. Run one of the commands below depending on your hardware ━━━ Launch Commands ━━━ SINGLE GPU — Standard model (no MTP): llama-server.exe ^ -m C:\models\Qwen3.6-27B-Q5_K_M.gguf ^ --ctx-size 180000 ^ --flash-attn on ^ --cache-type-k q4_0 ^ --cache-type-v q4_0 ^ --batch-size 1024 --ubatch-size 512 ^ -ngl 100 ^ -np 1 ^ --port 8080 ^ --jinja SINGLE GPU — MTP model (faster inference): llama-server.exe ^ -m C:\models\Qwen3.6-27B-MTP-Q5_K_M.gguf ^ --ctx-size 180000 ^ --flash-attn on ^ --cache-type-k q4_0 ^ --cache-type-v q4_0 ^ --batch-size 1024 --ubatch-size 512 ^ --spec-type draft-mtp ^ --spec-draft-n-max 3 ^ -ngl 100 ^ -np 1 ^ --port 8080 ^ --jinja DUAL GPU — Split across two cards: llama-server.exe ^ -m C:\models\Qwen3.6-27B-Q5_K_M.gguf ^ --ctx-size 180000 ^ --flash-attn on ^ --cache-type-k q4_0 ^ --cache-type-v q4_0 ^ --batch-size 1024 --ubatch-size 512 ^ -ngl 100 ^ --tensor-split 0.55,0.45 ^ --main-gpu 0 ^ -np 1 ^ --port 8080 ^ --jinja DUAL GPU MTP Vision (multimodal): llama-server.exe ^ -m C:\models\Qwen3.6-27B-MTP-Q5_K_M.gguf ^ --ctx-size 180000 ^ --flash-attn on ^ --cache-type-k q4_0 ^ --cache-type-v q4_0 ^ --batch-size 1024 --ubatch-size 512 ^ --spec-type draft-mtp ^ --spec-draft-n-max 3 ^ -ngl 100 ^ --tensor-split 0.60,0.40 ^ --main-gpu 0 ^ -np 1 ^ --port 8080 ^ --jinja ^ --mmproj C:\models\mmproj-F16.gguf ━━━ Parameter Breakdown ━━━ -m <path> Path to your .gguf model file. Change this to wherever you downloaded it. --ctx-size 180000 Context window in tokens. 180k = huge context for long conversations or big codebases. Reduce to 32768 or 65536 if you don't need long context — uses less VRAM. --flash-attn on Flash Attention — dramatically speeds up inference and reduces VRAM usage. Works on RTX 30xx/40xx/50xx. Always enable this. --cache-type-k q4_0 / --cache-type-v q4_0 Quantizes the KV cache (key/value attention cache) to 4-bit. This is what makes 180k context fit in VRAM. Without it, huge contexts eat all your memory. Quality impact is minimal — this is a free performance win. --batch-size 1024 / --ubatch-size 512 batch-size = how many tokens are processed in one forward pass (throughput). ubatch-size = micro-batch actually sent to the GPU per step. Higher = faster prompt processing but needs more VRAM. If you run out of VRAM, lower these (e.g. 512/256). -ngl 100 Number of layers to offload to GPU. 100 = all layers on GPU (full offload). This is what you want if the model fits in your VRAM. If it doesn't fit, reduce this (e.g. -ngl 40) — remaining layers run on CPU/RAM. --tensor-split 0.55,0.45 How to split model layers across multiple GPUs. Values are ratios. 0.55,0.45 = GPU 0 gets 55% of layers, GPU 1 gets 45%. Adjust based on your VRAM — give more to the card with more memory. Example: 0.70,0.30 for a 24GB 12GB setup. Not needed for single GPU setups. --main-gpu 0 Which GPU handles the batch computation (the "orchestrator"). Set to 0 (your primary GPU). The other GPU(s) handle their assigned layers. Minor performance impact — usually just leave it at 0. -np 1 Number of parallel slots (concurrent requests). 1 = one user at a time. Increase to 2-4 if you want multiple clients connected simultaneously. Each extra slot uses additional VRAM for its own KV cache. --port 8080 Which port the server listens on. Change if port 8080 is busy. --jinja Enables Jinja2 template processing — required for proper chat formatting. Most modern models expect this. Always include it. --spec-type draft-mtp Enables Multi-Token Prediction (MTP) speculative decoding. Only works with MTP GGUF models (downloaded separately). The model predicts multiple tokens at once and verifies them — big speed boost. --spec-draft-n-max 3 How many tokens the MTP draft head proposes per step. 3 is a good default. Higher = potentially faster but more VRAM and may reduce quality. --mmproj <path> Path to the multimodal projector file (for vision models). Enables image understanding — paste screenshots into the web chat. Only needed if you want vision capabilities. Omit for text-only use. ━━━ Your Hardware → Your Command ━━━ Single GPU (8-24GB VRAM): Use the "Single GPU" command. Change -m to your model path. 8GB card → Gemma 4-12B Q4 or LFM2.5-8B 12GB card → Gemma 4-12B Q5/Q6 16GB card → Gemma 4-31B QAT Q4 or Nex-N2-Mini 24GB card → Qwen3.6-27B Q4/Q5, Qwopus-v2, Gemma 4-31B QAT Q5/Q6 Dual GPU: Use the "Dual GPU" command. Adjust --tensor-split based on your VRAM ratio. 24GB 24GB → --tensor-split 0.50,0.50 24GB 12GB → --tensor-split 0.70,0.30 24GB 8GB → --tensor-split 0.75,0.25 Want speed? Use MTP versions of models with the "MTP" commands. Want vision? Add --mmproj with the projector file from the model's HuggingFace repo. 5. Once running, you get: • Web chat UI → http://localhost:8080 • OpenAI-compatible API → http://localhost:8080/v1 • Playground → http://localhost:8080/playground ━━━ Why /v1 API Is the Killer Feature ━━━ One local endpoint replaces your entire cloud API bill. The /v1 endpoint is drop-in OpenAI-spec compatible — every tool that speaks OpenAI just works. No custom code, no glue layer. Works out of the box with: • IDEs: Cursor, Continue, Windsurf, Cline, Roo Code • CLI tools: aider, Open Interpreter, OpenCode • Frameworks: LangChain, LlamaIndex, LiteLLM • Any OpenAI SDK (Python, Node, Go, Rust) Why this beats cloud APIs: • 100% private — code never leaves your machine • $0 per token — no rate limits, no quotas, no surprise bills • Works fully offline • Zero telemetry, no training on your data • Swap models by dropping in a different .gguf — no app changes needed • Run 32k–128k context windows without burning money Good combos: • Cursor Qwopus-v2 → near-frontier quality, zero API cost • Continue Qwen3.6-27B → best local coding agent • aider Gemma 4-12B MTP → 162 tok/s, feels instant • OpenCode Nex-N2-Mini → deep reasoning on 16GB Set any OpenAI-compatible client to your local endpoint: set OPENAI_API_KEY=sk-dummy (any non-empty string works) set OPENAI_BASE_URL=http://localhost:8080/v1 # every OpenAI-compatible tool now hits your local GPU Shoutouts: @0xSero @rS_alonewolf @witcheer @UnslothAI @LottoLabs
66
175
1,795
239,245
14
61
179
2,200
Zima retweeted
This guy created Linux and beat Microsoft to the best operating system He doesn’t mind being ruthlessly honest snd doing the dirty work because nothing mild ever wins Lesson in that for Opensource AI btw
18
17
297
13,352
Zima retweeted
Gemma 4 now runs 2x faster with MTP GGUFs! Run locally on just 6GB RAM. ⚡️ MTP enables Google Gemma 4 run ~1.4–2.2× faster with no accuracy loss. Gemma 4 12B MTP can run at 162 t/s vs. 52 t/s without MTP. 31B reaches 101 t/s. GGUFs Guide: unsloth.ai/docs/models/mtp
60
256
2,154
212,731
Zima retweeted
Introducing the Hermes Agent Profile Builder You can now build a complete profile in the dashboard with full control over identity/name/description, model/provider, built-in optional skills, skills-hub installs, and MCP servers in one easy flow
171
224
3,019
531,062
Zima retweeted
I was a technical co-founder at three startups. Went through YC, raised millions, one company exited for 9 figures, yada yada... Fable 5 is an entire product team in a box. CTO, VP of engineering, product manager, senior dev, UX expert, UI designer, even sales/marketing... It does it all. And at a much higher level than most humans. Around the clock, multiple threads running simultaneously. I see people saying that token-based billing is a sign the AI bubble is nearing a peak. I don't see an AI bubble. I see a salaried employee bubble. After using Fable 5 I can confidently say: I'd much rather spend $200K on tokens for Fable 5 than hire a single human that does one of those roles (and only on weekdays when they're feeling focused, motivated, etc.) If token billing means choosing between cutting AI usage vs. cutting employees... Every CFO knows which line item goes first. But this tool only produces value in the right hands. The wrong employee will run up a token bill with nothing to show for it. The right employee will turn tokens into bottom-line improvements. If you're an employee, focus on proving you know how to turn tokens into value for your company. OR relationship-max harder than ever. If you're an entrepreneur, you have an entire expert product, marketing, and sales team at your fingertips for a fraction of the cost of the human equivalent. Use it to help other people at scale and you will win big. As an investor... this keeps me very bullish on AI. If you have capital to allocate, you want to put it into an industry seeing exponential growth. And you want to take advantage of all the moments that people lose sight of the big picture. This genie's not going back in the bottle. In fact, it only just escaped.
28
22
429
29,561
Behind us, a partially burned American flag that my brother Legend snatched from a group of talibs in Kabul. In front of us, a We Fight Monsters flag that we took to the world’s most notorious island, little st James aka Epstein Island. Three men committed to finding truth, exposing injustice, and shining light into the dark places. If you’re unfamiliar with my story, check out Shawn Ryan Show episode 178 - if you’re unfamiliar with the story of this flag and the island, and why it matters today - check out Shawn Ryan Show episode 311. Either way - now is the time more than any other in American history for us to quit buying into the manufactured division that’s keeping us separated and start coming together to deal with the ACTUAL problems in our great nation. These are OUR streets, OUR neighborhoods and OUR children - and it’s time we take them back.
10
31
140
1,989
Zima retweeted
This is what the UK spyware proposal means. There must be government spyware on every mobile device. It shall watch everything that happens, including always watching the screen, looking for things the government disapproves of. When anything is flagged by the software as something the government doesn't like, the software must block it from being sent or displayed (in realtime). The user of the device must not be able to shut this watching and blocking off. The only way to shut it off would be to ask the government or its proxies to do so for you, at their discretion. Therefore the whole device must be locked down. Administrator rights and the decision of what software or operating system to run or not to run must be taken from the owner/user and handed to the government and its proxies. Apple and Google are themselves working hard to lock down the devices they are involved in to shut out competition and establish a duopoly. The UK government says it is "working closely" with Apple and Google and currently they synchronise and coordinate their communication on this subject. The UK government is now proposing to mandate what would otherwise be illegal anti-competitive practices. @GrapheneOS on the Apple and Google duopoly: x.com/GrapheneOS/status/2053… Statement from @signalapp x.com/signalapp/status/20640… @ReclaimTheNetHQ on the state spyware: reclaimthenet.org/starmer-ca… The government announcement: gov.uk/government/news/new-p…

Our statement on the UK government’s demand that all content on all devices sold or used in the country be scanned, on the presumption of nudity, using a dystopian combination of age verification and content scanning. This proposal will not safeguard children. It endangers us all. signal.org/blog/pdfs/2026-06…
241
3,419
14,225
1,522,102
Zima retweeted
Problem Solved….
Community note
The video shows a solved Rubik's cube scrambled using a specific sequence of moves then returned to solved by reversing them; it does not solve randomly scrambled cubes. boatos.org/english/is-it-… ruwix.com/the-rubiks-cub…
147
3,925
37,387
5,113,719
Zima retweeted
Local AI hardware = capacity × bandwidth × software stack - Capacity tells you what fits - Bandwidth tells you how hard the box can breathe - The software stack tells you how much of the spec sheet you can actually cash out. Hardware by Memory Bandwidth - Mac Studio M3 Ultra: up to 512GB @ 819 GB/s - RTX PRO 6000 Blackwell: 96GB @ 1792 GB/s - RTX 5090: 32GB @ 1792 GB/s - RTX 4090: 24GB @ 1008 GB/s - RX 7900 XTX: 24GB @ 960 GB/s - Radeon PRO W7900: 48GB @ 864 GB/s - AMD Radeon AI PRO R9700: 32GB @ 640 GB/s - Intel Arc Pro B65: 32GB @ ~608 GB/s - Tenstorrent Wormhole n300: 24GB @ 576 GB/s - Tenstorrent Blackhole p150: 32GB @ 512 GB/s 800G - MacBook Pro M5 Max: 460-614 GB/s - MacBook Pro M5 Pro: 307 GB/s - DGX Spark: 128GB @ 273 GB/s (coherent CUDA) - Mac mini M4 Pro: 273 GB/s - Ryzen AI Max / Strix Halo: ~256 GB/s (~96GB usable GPU) - MacBook Air M5: 153 GB/s - Snapdragon X2 Elite: 152-228 GB/s - Intel Lunar Lake: 136 GB/s - Snapdragon X Elite: 135 GB/s - Mac mini M4: 120 GB/s - Arc Pro B60: 24GB @ ~456 GB/s Verdict - GPUs are still the bandwidth kings - Apple wins: stupid amounts of memory, don’t want to shard across GPUs - Apple loses: when raw tokens/sec & concurrency matter more - DGX Spark: coherent memory NVIDIA stack - Strix Halo / Ryzen AI Max: first real x86 unified-memory contender - Tenstorrent: fully OSS stack, excited to see this mature Fitting ≠ serving Even if it fits, you still pay for - bandwidth during decode - KV cache growth - dequantization - batching concurrency - scheduler quality - framework overhead The only mental model that matters: 1. What must fit? 2. What bandwidth tier do I need? 3. What software stack can actually deliver it? In short: - NVIDIA → fastest raw speed - Apple Studio M3 Ultra → biggest one-box memory - Strix Halo → first real x86 unified - DGX Spark → coherent NVIDIA dev appliance - AMD / Intel Arc → rising alternatives - Tenstorrent → fully opensource stack Do ask: “which bottleneck am I buying?” Not: “which hardware is best?”
102
187
1,243
130,124
Zima retweeted
Steve Jobs's bedroom when he was still living with his parents, 1976. Apple 1 boxes stored on the right.
20
33
302
10,669
Zima retweeted
REVEALED: Audi R8 Successor! Meet the Audi Nuvolari - a brand-new, mid-engined 1,000hp hybrid supercar! 🔥 4-litre twin-turbo V8 and 3 motors! 🔥 1000hp & 730Nm! 🔥 0- 60mph in 2.6 seconds! 🔥 Full carbon exterior! 🔥 £500,000! 🔥 Limited to 499 units! Audi has the Temerario in its sights! But would you pick the Nuvolari over the Lambo? 🤔 Let us know in the comments!
371
173
2,406
2,186,231
Zima retweeted
Audi has officially unveiled its first hybrid supercar called the Audi Nuvolari. • Starting price: $697,000 • 0-60mph: 2.5s • 978 hp • 499 units will be built • Fastest production vehicle in Audi history • Top speed: Over 217 mph (350 km/h) • New Audi signature paint color: Titanium • Three electric motors each produce 110 kW • Two oil-cooled electric motors mounted on the front axle • 7.3 kWh lithium-ion battery • Fully electric driving capability in E-Hybrid mode • Hybrid powertrain combines a 4-liter twin-turbo V8 with three axial-flux electric motors • Formula 1-derived prepreg autoclave carbon manufacturing process • Brake-by-wire braking system • Braking system capable of absorbing up to 2.8 megawatts of energy Deliveries begin in the first half of 2027. More photos in the thread below:
689
324
4,478
665,343
Zima retweeted
Replying to @googlegemma
Thank you Google Deepmind for constantly releasing open models! 🌟 We made Dynamic GGUFs so you can run Gemma 4 12B more efficiently: huggingface.co/unsloth/gemma…
24
59
971
42,072
Zima retweeted
Meet Gemma 4 12B! A unified, encoder-free multimodal model designed to bring high-performance intelligence directly to your laptop, and released under an Apache 2.0 license. Bridging the gap between edge efficiency and advanced reasoning. Here is what’s new with Gemma 4 12B: 👇
404
1,790
12,366
3,179,547
RT @Hesamation: Mythos, regulate something. make no mistakes.
602
Zima retweeted
Make sure your business is ready to accept AI agents as customers!
166
214
1,541
206,150
Zima retweeted
It's finally happening, we're getting an EU bicycle highway system. I'm looking forward to my short 47 hour ride from Frankfurt to Paris. This is the best day of my life. Have a wonderful day, Wolfgang
The EU is making it easier than ever to ditch the car keys and hop on your bike. The upcoming World Bicycle Day on Wednesday is the perfect occasion to do so. With our European Declaration on Cycling, we are committed to boosting this experience for you:
25
17
349
16,415
Zima retweeted
I agree with this. When everyone is spinning GPT and Claude for answers. Thinking differently and maturely is the underrated asset.
6
6
38
1,439