Jay.TL

Jay.TL

Users
Tweets

Jay.TL

@JayTL00

19m

A $1,499 AMD box can load a 235B parameter model. That headline has 6,800 likes and everyone's celebrating the death of Nvidia's pricing moat. But capacity isn't the bottleneck. Bandwidth is. And nobody's posting about that. Here's what the real numbers say: 1/ THE MEMORY CAPACITY ILLUSION Strix Halo's Ryzen AI Max 395 gives you 128GB unified memory, 96-110GB addressable as VRAM. An RTX 5090 gives you 32GB. On paper, this is a 3x memory advantage at half the price. But memory capacity determines what you can load. Memory bandwidth determines how fast it generates tokens. Strix Halo pushes roughly 256 GB/s. Apple's M3 Ultra does 800 GB/s. The DGX Spark's GB10 does 273 GB/s with CUDA's optimized stack on top. Independent benchmark (Qwen3.5-27B IQ4, same model, same workload): - AMD Strix Halo (~$2,500): ~16 tok/s decode - Apple Mac Studio M3 Ultra (~$5,000): ~40 tok/s decode - NVIDIA DGX Spark (~$3,999): ~17 tok/s decode, but 1,939 tok/s prefill Strix Halo wins on $/token loaded. It loses on $/token generated by a factor of 2.5x against Apple. 2/ WHY MOE MODELS MAKE THE GAP INVISIBLE The "gotcha" tweet everyone's sharing: someone running Qwen 3.6-35B-A3B at Q8, 131K context, 40-50 tok/s on Strix Halo. Sounds incredible. But that's an MoE with only 3B active parameters per forward pass. The 35B sits in memory, but only 3B gets computed. Of course it's fast. This is the dirty secret of the local AI hardware moment: MoE models make every box look good because they minimize active computation. Run a dense 70B model where all parameters fire every token, and the bandwidth cliff appears. Strix Halo drops to single-digit tok/s on dense models that the M3 Ultra handles at usable speed. The capacity-versus-bandwidth gap isn't a spec sheet footnote. It's the difference between "I can technically load it" and "I can actually use it for production work." 3/ THE SOFTWARE STACK TAX Every Strix Halo review includes a sentence that should worry you: "ROCm or Vulkan?" This isn't a preference question. It's an admission that the AMD software stack is fragmented enough that users must choose between two incomplete implementations, benchmark both, and hope one doesn't break on the next model they pull. NVIDIA's CUDA isn't faster because it's magic. It's faster because it's predictable. You install it, it works, the numbers are reproducible. Apple's MLX reached the same reliability threshold in 18 months. AMD's ROCm has been "almost there" for five years. The real TCO of a Strix Halo isn't $1,499 plus electricity. It's $1,499 plus the hours you spend in ROCm/Vulkan Discord channels debugging why llama.cpp segfaults on your quant config. That time has a price, and for consultants billing $150/hr, it eats the hardware savings fast. 4/ THE BUSINESS MODEL INSIGHT NOBODY'S FRAMING RIGHT The most valuable tweet in this entire wave isn't the Lisa Su demo or the spec comparisons. It's the consultant who turned $2,800/month cloud bills into $8 electricity costs and watched consulting margins jump from 30% to 80-90%. The pitch that closes deals isn't "it's cheaper." It's: "Your data physically lives in your office. Not OpenAI's, not mine." Lawyers, healthcare, finance — the clients who can't touch cloud AI — sign on that single sentence. Local inference doesn't disrupt cloud AI pricing. It creates a new service category: data-sovereignty AI consulting, where the moat isn't model access (anyone can download Qwen) or hardware (anyone can buy a Strix Halo) but the workflow integration trust relationship. The box is commodity. The integration is the product. BUT HERE'S WHAT EVERYONE'S MISSING: The Strix Halo narrative assumes hardware commoditization is the endgame. It's not. The next 18 months will be a software ecosystem war disguised as a hardware price war. AMD can match Nvidia on memory capacity today. It cannot match CUDA's developer experience without a multi-year ecosystem investment that no amount of $1,499 boxes can substitute for. Apple understood this. That's why they built MLX instead of betting on raw specs. The M3 Ultra's 800 GB/s bandwidth matters, but MLX "just working" matters more for adoption. The companies building local AI businesses on Strix Halo today are making a bet that AMD's software stack will mature faster than their patience runs out. Some will win that bet. Many will end up with $1,499 paperweights running Q4 quants at 8 tok/s, wondering why the demo looked so much better than reality. The question isn't whether $1,499 can run a 235B model. It's whether the generation that grows up on local AI will accept "tinker with ROCm" as the price of sovereignty — or whether predictability wins over capacity every time, the same way it did when CUDA killed OpenCL a decade ago. History doesn't repeat. But it benchmarks.

@Sincerely, Stepper

@Sincerely, Stepper @lilo_spec

13h

Replying to @reprompting

Undoubtedly! Have you also tried OpenCL, kernel matrix multiplication offloading on GPU?

448

matsuu

Yukiharu YABUKI🍥 retweeted

matsuu

@matsuu

15h

「エージェントもフィッシングに引っかかる場合があった」エージェントに依存する範囲はある程度制限した方がよい、ということだろうね。資産運用とか雑用の自動化とか危うい。 / “AIエージェントもフィッシング詐欺に引っかかる？　米セキュリティ企業がOpenClawで検証　…” htn.to/3AMSsYNWbZ

AIエージェントもフィッシング詐欺に引っかかる？　米セキュリティ企業がOpenClawで検証　結果は……

AIエージェントが話題になる昨今。ローカル環境で動作するエージェントにPCを操作させ、作業を効率化しようと試みる人も散見される。ただ、AIエージェントがフィッシング詐欺に引っ掛かったら、大変なことになるかもしれない。米セキュリティ企業Varonisが6月9日（現地時間）に発表した検証レポートによれば、エージェントもフィッシングに引っかかる場合があったという。

itmedia.co.jp

1,510

Lambda Rick 🏴‍☠️/acc

Lambda Rick 🏴‍☠️/acc

@benrayfield

18h

Replying to @ThinkWiselyMatt @jrysana

OpenCL and WebGL are outside gov control. CUDA and WebGPU are not. They can not unplug the first 2. If they try, theres basically an army of gamers who will get pissed off that games break.

seed_value

seed_value @30years_over

Jun 13

OpenCL勉強中くそ～　行列が苦手で3Dから逃げ続けてきたのに逃れられないもんだな

Krzysztof Gonia

Krzysztof Gonia

@kgonia7

Jun 13

Replying to @TheOperatorPro @scaling01

Try to optimize OpenCL kernels with Opus ☠️

leohirano

leohirano @_leohirano

Jun 13

@-Binding無しでOpenCL書くの面倒くさすぎる

Oscar Broekema 🇳🇱

Adéníyì retweeted

Oscar Broekema 🇳🇱@obr2021

Jun 12

Vortex: OpenCL Compatible RISC-V GPGPU vortex.cc.gatech.edu/

2,140

Lambda Rick 🏴‍☠️/acc

Lambda Rick 🏴‍☠️/acc

@benrayfield

Jun 13

Replying to @bindureddy

Build on OpenCL instead of CUDA. Build on WebGL instead of WebGPU. Or what happened to Anthropic can happen to ur tech. There are things inside CUDA and WebGPU that are not safe for a free world to build on.

161

StrongEngineer_

StrongEngineer_

@hotschmoe

Jun 12

llm perf as of today with updated drivers with llama.cpp snapdragon x2ee 48gb @ 228GB/s Qwen3.6-35B-A3B (MXFP4, ~20 GB) ~190 t/s prefill, ~31 t/s decode (OpenCL) Qwen3.6-27B (Q4_0 MTP, ~16 GB) ~50 t/s prefill, ~12 t/s decode (OpenCL) Qualcomm's Qwen3-4B w4a16 PP ~2167 t/s, TG ~29 t/s (my own NPU engine)

重藤六🐸月2.5万PVガジェットブロガー

重藤六🐸月2.5万PVガジェットブロガー

@shigetoroku1010

Jun 12

minisforum m2について調べてましたが、356hって285hの正統進化ではないんですね🐸💦 geekbench6のopenclで、285hの半分ぐらいのスコアしかないという

236

PolyBackTest

Jonathan Leiva 🇨🇱#Kaiser2030 retweeted

PolyBackTest

@polybacktest

Jun 12

Someone set loose two AI agents with $1,000 each and 48 hours to trade on Polymarket. Claude: 1,322% to $14,216 OpenClaw: liquidated to zero in under 48h.

0:15

187

60,369

❱ ぽち。

❱ ぽち。@potistudio

Jun 12

GPUでSmartRenderする時って、OpenCLとかCUDAのハンドラが渡されるから、wgpuじゃエセGPUになるらしい。

198

Mug

Mug @MugLab3

Jun 12

OpenCL環境での処理速度が2倍になりました💦💦

RazzReport

RazzReport @RazzReport

Jun 12

Replying to @RazzReport @vllm_project @LiteLLM

ggml-org/llama.cpp b9603 shipped OpenCL kernels for q5_0/q5_1 on Adreno GPUs. Edge inference on Qualcomm hardware just became viable. axolotl-ai-cloud/axolotl also branched for DeepSeek V4 fine-tuning support.

AIエージェントもフィッシング詐欺に引っかかる？ 米セキュリティ企業がOpenClawで検証 結果は……

AIエージェントもフィッシング詐欺に引っかかる？　米セキュリティ企業がOpenClawで検証　結果は……