i'm running a 397 billion parameter model on a amd ai max box that sits on my desk and pulls less power than a gaming laptop.
the model is Nex-N2-Pro, 397B-A17B, the open weight release people are putting next to gpt-5.5 on coding. i have it quantized to IQ1_M, 1.75 bits per weight, 90gb of weights loaded into the 128gb of unified memory on amd's strix halo igpu.
watch the gpu in this recording. it spikes, it sustains, it does not fall over. that is the part the spec sheets never show you, not just that a 400b model loads, but that an integrated graphics chip holds the load and generates token after token, stable, no crash, no thermal cliff.
and it is not a slideshow. roughly 18 tokens a second, faster than you can read. a frontier scale model producing usable output, fully local. no datacenter, no rented h100s, no api key, no permission.
three years ago a model this size meant a server room and a budget to match. tonight it is a quiet box on my desk.
this is the accessible tier almost nobody benchmarks honestly, and it is further along than the timeline thinks.
the full breakdown is coming, rocm vs vulkan on this chip, and this little amd box head to head against the nvidia equivalent.
stay tuned.
the framework strix halo i posted yesterday is fully alive now. ubuntu, rocm 7.2.1, llama.cpp built against both rocm and vulkan, the entire local ai stack running on amd's gfx1151 igpu with 128gb of unified memory.
and it's already loaded with three models:
>Qwen3.6-35B-A3B at Q8, the new moe, 37gb
>Nex-N2-mini at Q8, 37gb
>Nex-N2-Pro, the 397 billion parameter one, at IQ1_M, 91gb across five shards
that last one still doesn't feel real. a 397b model sitting on my desk in a box that sips power off a normal wall socket.
i've already run the first benchmarks and the numbers genuinely caught me off guard, both rocm vs vulkan on this chip and this little amd box against the nvidia equivalent. holding the full breakdown for its own post.
stay tuned. the accessible tier is way further along than the timeline thinks.