The 128GB number is the part everyone's repeating. The number that actually decides whether you'd use this box is 256.
That's the memory bandwidth, in GB/s. A 5090 moves about 1,800. An H100 moves 3,350. Local token speed is bound by how fast weights get read out of memory, and this APU reads them at roughly a seventh of a gaming GPU.
So the headline does something quiet. Qwen3 235B runs here at about 11 tokens a second, which sounds impossible on 256 GB/s until you notice the model is mixture-of-experts: 235B total, ~22B active per token. The chip only moves the 22B it needs. The "235B" on the slide is a storage stat. The 22B is the speed stat.
Run something dense and the trick drops. Llama 3.3 70B, where every parameter fires on every token, does about 5 tokens a second on the same box. Readable. Not something you sit in front of for eight hours.
That 3x win over a 5080 lives in the same place. A 5080 has 16GB of VRAM and can't hold a 235B model at all, so it spills to system memory and crawls. The APU wins that matchup on capacity. Change the test to a model that fits in 16GB and the 5080 walks away on speed.
Now look at the workload in the pitch: point Claude Code at localhost. Agentic coding is the worst possible fit for a bandwidth-starved box. One task is dozens of sequential model round trips, each waiting on the last, each streaming at 11 tokens a second. The exact use case used to sell the $5,280 in savings is the one that exposes the bottleneck.
The same Qwen3 235B runs at 1,500 tokens a second on a Cerebras wafer. That's the real comparison: 1,500 versus 11, and how much of your day goes to watching the slow one think.
The box is a real deal for what it is. A quiet, private, $1,800 machine that runs big open models at conversational speed for one person. The frontier stack it's sold as replacing answers at 50 to 100 tokens a second with quality no open 235B matches yet. It pays for itself in 9 months only if your time is worth nothing per token.
AMD CEO LISA SU HELD A MINI PC ON STAGE THAT RUNS A 235B MODEL AND REPLACES YOUR $440/MONTH AI STACK
amd's ryzen ai max 395 is the first x86 chip that runs a 200 billion parameter model on one piece of silicon. cpu and gpu share 128gb of unified memory, no separate graphics card needed
the gmktec evo-x2 runs qwen3 235b fully, deepseek v3 comfortably and llama 3.3 70b with headroom. on linux you get 110gb of usable vram out of 128gb
amd claimed the chip beat an nvidia rtx 5080 by more than 3x on deepseek r1 inference. a lunchbox sized pc outrunning a $1,000 discrete gpu on a real ai workload
a heavy ai user pays $200 for claude code max, $200 for chatgpt pro, $20 for cursor and $20 for gemini. that's $5,280 a year and the box pays itself off in 9 to 10 months
install ollama, pull the model, point claude code at localhost. same interface, nothing leaves the machine, nothing costs per request
bookmark this and read the article below