Filter
Exclude
Time range
-
Near
A $1,499 AMD box can load a 235B parameter model. That headline has 6,800 likes and everyone's celebrating the death of Nvidia's pricing moat. But capacity isn't the bottleneck. Bandwidth is. And nobody's posting about that. Here's what the real numbers say: 1/ THE MEMORY CAPACITY ILLUSION Strix Halo's Ryzen AI Max 395 gives you 128GB unified memory, 96-110GB addressable as VRAM. An RTX 5090 gives you 32GB. On paper, this is a 3x memory advantage at half the price. But memory capacity determines what you can load. Memory bandwidth determines how fast it generates tokens. Strix Halo pushes roughly 256 GB/s. Apple's M3 Ultra does 800 GB/s. The DGX Spark's GB10 does 273 GB/s with CUDA's optimized stack on top. Independent benchmark (Qwen3.5-27B IQ4, same model, same workload): - AMD Strix Halo (~$2,500): ~16 tok/s decode - Apple Mac Studio M3 Ultra (~$5,000): ~40 tok/s decode - NVIDIA DGX Spark (~$3,999): ~17 tok/s decode, but 1,939 tok/s prefill Strix Halo wins on $/token loaded. It loses on $/token generated by a factor of 2.5x against Apple. 2/ WHY MOE MODELS MAKE THE GAP INVISIBLE The "gotcha" tweet everyone's sharing: someone running Qwen 3.6-35B-A3B at Q8, 131K context, 40-50 tok/s on Strix Halo. Sounds incredible. But that's an MoE with only 3B active parameters per forward pass. The 35B sits in memory, but only 3B gets computed. Of course it's fast. This is the dirty secret of the local AI hardware moment: MoE models make every box look good because they minimize active computation. Run a dense 70B model where all parameters fire every token, and the bandwidth cliff appears. Strix Halo drops to single-digit tok/s on dense models that the M3 Ultra handles at usable speed. The capacity-versus-bandwidth gap isn't a spec sheet footnote. It's the difference between "I can technically load it" and "I can actually use it for production work." 3/ THE SOFTWARE STACK TAX Every Strix Halo review includes a sentence that should worry you: "ROCm or Vulkan?" This isn't a preference question. It's an admission that the AMD software stack is fragmented enough that users must choose between two incomplete implementations, benchmark both, and hope one doesn't break on the next model they pull. NVIDIA's CUDA isn't faster because it's magic. It's faster because it's predictable. You install it, it works, the numbers are reproducible. Apple's MLX reached the same reliability threshold in 18 months. AMD's ROCm has been "almost there" for five years. The real TCO of a Strix Halo isn't $1,499 plus electricity. It's $1,499 plus the hours you spend in ROCm/Vulkan Discord channels debugging why llama.cpp segfaults on your quant config. That time has a price, and for consultants billing $150/hr, it eats the hardware savings fast. 4/ THE BUSINESS MODEL INSIGHT NOBODY'S FRAMING RIGHT The most valuable tweet in this entire wave isn't the Lisa Su demo or the spec comparisons. It's the consultant who turned $2,800/month cloud bills into $8 electricity costs and watched consulting margins jump from 30% to 80-90%. The pitch that closes deals isn't "it's cheaper." It's: "Your data physically lives in your office. Not OpenAI's, not mine." Lawyers, healthcare, finance — the clients who can't touch cloud AI — sign on that single sentence. Local inference doesn't disrupt cloud AI pricing. It creates a new service category: data-sovereignty AI consulting, where the moat isn't model access (anyone can download Qwen) or hardware (anyone can buy a Strix Halo) but the workflow integration trust relationship. The box is commodity. The integration is the product. BUT HERE'S WHAT EVERYONE'S MISSING: The Strix Halo narrative assumes hardware commoditization is the endgame. It's not. The next 18 months will be a software ecosystem war disguised as a hardware price war. AMD can match Nvidia on memory capacity today. It cannot match CUDA's developer experience without a multi-year ecosystem investment that no amount of $1,499 boxes can substitute for. Apple understood this. That's why they built MLX instead of betting on raw specs. The M3 Ultra's 800 GB/s bandwidth matters, but MLX "just working" matters more for adoption. The companies building local AI businesses on Strix Halo today are making a bet that AMD's software stack will mature faster than their patience runs out. Some will win that bet. Many will end up with $1,499 paperweights running Q4 quants at 8 tok/s, wondering why the demo looked so much better than reality. The question isn't whether $1,499 can run a 235B model. It's whether the generation that grows up on local AI will accept "tinker with ROCm" as the price of sovereignty — or whether predictability wins over capacity every time, the same way it did when CUDA killed OpenCL a decade ago. History doesn't repeat. But it benchmarks.
1
23
Replying to @reprompting
Undoubtedly! Have you also tried OpenCL, kernel matrix multiplication offloading on GPU?
1
2
448
OpenCL and WebGL are outside gov control. CUDA and WebGPU are not. They can not unplug the first 2. If they try, theres basically an army of gamers who will get pissed off that games break.
10
OpenCL勉強中 くそ~ 行列が苦手で3Dから逃げ続けてきたのに 逃れられないもんだな
22
Try to optimize OpenCL kernels with Opus ☠️
63
@-Binding無しでOpenCL書くの面倒くさすぎる
86
Adéníyì retweeted
Vortex: OpenCL Compatible RISC-V GPGPU vortex.cc.gatech.edu/

1
2
27
2,140
Replying to @bindureddy
Build on OpenCL instead of CUDA. Build on WebGL instead of WebGPU. Or what happened to Anthropic can happen to ur tech. There are things inside CUDA and WebGPU that are not safe for a free world to build on.
4
161
llm perf as of today with updated drivers with llama.cpp snapdragon x2ee 48gb @ 228GB/s Qwen3.6-35B-A3B (MXFP4, ~20 GB) ~190 t/s prefill, ~31 t/s decode (OpenCL) Qwen3.6-27B (Q4_0 MTP, ~16 GB) ~50 t/s prefill, ~12 t/s decode (OpenCL) Qualcomm's Qwen3-4B w4a16 PP ~2167 t/s, TG ~29 t/s (my own NPU engine)
1
81
minisforum m2について調べてましたが、356hって285hの正統進化ではないんですね🐸💦 geekbench6のopenclで、285hの半分ぐらいのスコアしかないという
1
3
236
Jonathan Leiva 🇨🇱#Kaiser2030 retweeted
Someone set loose two AI agents with $1,000 each and 48 hours to trade on Polymarket. Claude: 1,322% to $14,216 OpenClaw: liquidated to zero in under 48h.
12
7
187
60,369
GPUでSmartRenderする時って、OpenCLとかCUDAのハンドラが渡されるから、wgpuじゃエセGPUになるらしい。
1
198
Jun 12
OpenCL環境での処理速度が2倍になりました💦💦
1
75
ggml-org/llama.cpp b9603 shipped OpenCL kernels for q5_0/q5_1 on Adreno GPUs. Edge inference on Qualcomm hardware just became viable. axolotl-ai-cloud/axolotl also branched for DeepSeek V4 fine-tuning support.
1
32