NeilXbt

NeilXbt

Users
Tweets

NeilXbt

@neil_xbt

Cloud GPU instances cost $30,000/year for what the Thinkstation PGX runs locally with no recurring cost! 128GB unified memory. GB10 Grace Blackwell Superchip. 200-billion-parameter models running on a 1.2kg device smaller than a Mac Mini. The same silicon as the NVIDIA DGX Spark. The same CUDA ecosystem. vLLM, TensorRT, FlashInfer, and NVIDIA Container Toolkit all running natively. Every AI framework, optimization library, and model deployment tool built for CUDA first. CPU and GPU share the same 128GB memory pool with no PCIe copying. Apple Silicon has more memory bandwidth. The Mac Studio M4 Ultra hits 800 GB/s versus the PGX's 273 GB/s. But none of the production-grade tooling runs on a Mac. The gap between developers running quantized models on Apple Silicon for local inference and developers running the full CUDA AI stack on a device that fits on their desk is not capability or cost. It is ecosystem access and the hardware that finally makes it personal. Read through this article to understand why Thinkstation PGX is the one to have! Follow @neil_xbt for more local AI and hardware intelligence that tracks when the personal compute threshold shifts significantly.

0:37

leopardracer

@leopardracer

19h

x.com/i/article/206608595176…

2,336

MX3 Dev

MX3 Dev

@Mx3Dev

Replying to @Mx3Dev @vllm_project

Large Scale Serving → Fixed multi-node Ray data-parallel serving hang with multiple API servers Build & CI → Stopped installing quarantined flashinfer-jit-cache in Docker → Normalized NIXL wheel install to fix CUDA import errors.

LuisV8

LuisV8

@itsLuisV8

Replying to @SpaceTimeViking @NVIDIAAI

Hey wanted to ask u for heelp: running your Gemma-4-26B-A4B Uncensored NVFP4 DFlash on a DGX Spark GB10 🙏 Main model flies, but the DFlash drafter dies on `cutlass FP4 gemm failed to init on sm120/sm121` — so flex_attention asserts and I'm stuck on flash_attn (~62 tok/s). What **driver flashinfer version** are you running to get the cutlass-FP4 flex path stable on GB10? Chasing your 150. 🚀

leopardracer

leopardracer

@leopardracer

19h

x.com/i/article/206608595176…

123,446

Manling Li

mrgam retweeted

Manling Li

@ManlingLi_

Jun 12

Kernel Agent by @dogacel0 (in my Agent AI class) Ranked #1 in MLSys 2026 FlashInfer AI Kernel Generation Contest. @dogacel0 is continuing building more efficient GPU kernels with AI agents. He will also give some talks on speculative decoding, coming up soon, stay tuned!

Doğaç

@dogacel0

Jun 9

Testing Mythos for GPU kernel generation. I will test it under 3 kernels: DSA, GDN and MoE routing, let's see how it performs over Opus 4.7 that previously won the contest against humans for DSA track.

11,501

Sakura Yuki

Sakura Yuki

@sakurayukiai

Jun 13

Replying to @mr_r0b0t @vllm_project

modelopt_mixed FlashInfer is the real gold here. running mixed FP8/NVFP4 on Blackwell without pipeline faults means serving massive MoEs like DeepSeek V4 Pro on RTX 50-series is actually viable.

158

susun

susun

@SuJinYan123

Jun 13

x.com/i/article/206577812338…

21,412

MX3 Dev

MX3 Dev

@Mx3Dev

Jun 13

Replying to @Mx3Dev @vllm_project

DeepSeek-V4 optimizes sparse metadata, adds TRTLLM-gen kernel, and detaches from torch.compile. Model Runner V2 now defaults for Llama and Mistral, gains FlashInfer sampler and CUDA graph breaks.

Harishkumar Pillai

Harishkumar Pillai

@harishpillai30

Jun 13

LLM serving performance lab Built a local vLLM serving baseline on RTX 5070 Ti: - GPU/CUDA/PyTorch/vLLM environment check - local Qwen2.5 smoke test - documented a real FlashInfer/SM 12.x startup failure - fixed it with a sampler workaround - verified the OpenAI-compatible endpoint

Jetha Chan

Jetha Chan

@jetha

Jun 13

day 5 - did some benchmarking and yeah it doesn't make sense to move off triton for bf16 especially for gemma 3. for nvfp4 though it is flashinfer/fa2 or bust of course. vLLM ready to go, verifying SGLang now - once that's green will lodge PRs

Jetha Chan

@jetha

Jun 12

day 4 - took it from just 31B to the rest of the Gemma 4 ladder: E4B, 12B, 26B-A4B all serving full NVFP4 KV now (up to 3.6× vs bf16), plus Gemma 3 12B. also got Gemma 4 off the Triton fallback on consumer Blackwell entirely...

441

Doğaç

Doğaç

@dogacel0

Jun 12

RT @ManlingLi_: Kernel Agent by @dogacel0 (in my Agent AI class) Ranked #1 in MLSys 2026 FlashInfer AI Kernel Generation Contest. @dogac…

This tweet is unavailable

Manling Li

Manling Li

@ManlingLi_

Jun 12

Doğaç

@dogacel0

Jun 9

894

Katja Sirazitdinova

Katja Sirazitdinova @katjasrz

Jun 12

On June 24th I’ll be talking about FlashInfer kernels in JAX at AI Systems DevLabs in Sunnyvale. Hoping to meet many JAX developers at the event

Doğaç

Doğaç

@dogacel0

Jun 12

Replying to @elliotarledge

Wanna adopt mine to support kernelbench ? I want it to extend to other formats. Currently it is flashinfer style and claude-first. One run is about 8 hours, depletes 50% of 5hr credits with fable. github.com/Dogacel/auto-gpu-…

GitHub - Dogacel/auto-gpu-kernel: Winner 🏆 (Agent-only) MLSys 2026 - FlashInfer AI Kernel Genera...

Winner 🏆 (Agent-only) MLSys 2026 - FlashInfer AI Kernel Generation Contest for the DeepSeek Sparse Attention (DSA) track with an average speedup of 34.93x - Dogacel/auto-gpu-kernel

github.com

134