A $1,499 AMD box can load a 235B parameter model. That headline has 6,800 likes and everyone's celebrating the death of Nvidia's pricing moat.
But capacity isn't the bottleneck. Bandwidth is. And nobody's posting about that.
Here's what the real numbers say:
1/ THE MEMORY CAPACITY ILLUSION
Strix Halo's Ryzen AI Max 395 gives you 128GB unified memory, 96-110GB addressable as VRAM. An RTX 5090 gives you 32GB. On paper, this is a 3x memory advantage at half the price.
But memory capacity determines what you can load. Memory bandwidth determines how fast it generates tokens. Strix Halo pushes roughly 256 GB/s. Apple's M3 Ultra does 800 GB/s. The DGX Spark's GB10 does 273 GB/s with CUDA's optimized stack on top.
Independent benchmark (Qwen3.5-27B IQ4, same model, same workload):
- AMD Strix Halo (~$2,500): ~16 tok/s decode
- Apple Mac Studio M3 Ultra (~$5,000): ~40 tok/s decode
- NVIDIA DGX Spark (~$3,999): ~17 tok/s decode, but 1,939 tok/s prefill
Strix Halo wins on $/token loaded. It loses on $/token generated by a factor of 2.5x against Apple.
2/ WHY MOE MODELS MAKE THE GAP INVISIBLE
The "gotcha" tweet everyone's sharing: someone running Qwen 3.6-35B-A3B at Q8, 131K context, 40-50 tok/s on Strix Halo. Sounds incredible. But that's an MoE with only 3B active parameters per forward pass. The 35B sits in memory, but only 3B gets computed. Of course it's fast.
This is the dirty secret of the local AI hardware moment: MoE models make every box look good because they minimize active computation. Run a dense 70B model where all parameters fire every token, and the bandwidth cliff appears. Strix Halo drops to single-digit tok/s on dense models that the M3 Ultra handles at usable speed.
The capacity-versus-bandwidth gap isn't a spec sheet footnote. It's the difference between "I can technically load it" and "I can actually use it for production work."
3/ THE SOFTWARE STACK TAX
Every Strix Halo review includes a sentence that should worry you: "ROCm or Vulkan?" This isn't a preference question. It's an admission that the AMD software stack is fragmented enough that users must choose between two incomplete implementations, benchmark both, and hope one doesn't break on the next model they pull.
NVIDIA's CUDA isn't faster because it's magic. It's faster because it's predictable. You install it, it works, the numbers are reproducible. Apple's MLX reached the same reliability threshold in 18 months. AMD's ROCm has been "almost there" for five years.
The real TCO of a Strix Halo isn't $1,499 plus electricity. It's $1,499 plus the hours you spend in ROCm/Vulkan Discord channels debugging why llama.cpp segfaults on your quant config. That time has a price, and for consultants billing $150/hr, it eats the hardware savings fast.
4/ THE BUSINESS MODEL INSIGHT NOBODY'S FRAMING RIGHT
The most valuable tweet in this entire wave isn't the Lisa Su demo or the spec comparisons. It's the consultant who turned $2,800/month cloud bills into $8 electricity costs and watched consulting margins jump from 30% to 80-90%.
The pitch that closes deals isn't "it's cheaper." It's: "Your data physically lives in your office. Not OpenAI's, not mine." Lawyers, healthcare, finance — the clients who can't touch cloud AI — sign on that single sentence.
Local inference doesn't disrupt cloud AI pricing. It creates a new service category: data-sovereignty AI consulting, where the moat isn't model access (anyone can download Qwen) or hardware (anyone can buy a Strix Halo) but the workflow integration trust relationship. The box is commodity. The integration is the product.
BUT HERE'S WHAT EVERYONE'S MISSING:
The Strix Halo narrative assumes hardware commoditization is the endgame. It's not. The next 18 months will be a software ecosystem war disguised as a hardware price war. AMD can match Nvidia on memory capacity today. It cannot match CUDA's developer experience without a multi-year ecosystem investment that no amount of $1,499 boxes can substitute for.
Apple understood this. That's why they built MLX instead of betting on raw specs. The M3 Ultra's 800 GB/s bandwidth matters, but MLX "just working" matters more for adoption.
The companies building local AI businesses on Strix Halo today are making a bet that AMD's software stack will mature faster than their patience runs out. Some will win that bet. Many will end up with $1,499 paperweights running Q4 quants at 8 tok/s, wondering why the demo looked so much better than reality.
The question isn't whether $1,499 can run a 235B model. It's whether the generation that grows up on local AI will accept "tinker with ROCm" as the price of sovereignty — or whether predictability wins over capacity every time, the same way it did when CUDA killed OpenCL a decade ago.
History doesn't repeat. But it benchmarks.