The default frame for the last two years has been more gigawatts: bigger clusters, more power, more concrete poured. Stanford's
@HazyResearch spent that same window measuring the opposite direction, and came up with an intelligence-per-watt metric: task accuracy divided by mean power draw, measured per query on real workloads.
Across their 2023 to 2025 measurement window, intelligence-per-watt improved 5.3x. Roughly 3.1x of that came from better models and 1.7x from better hardware. On single-turn chat and reasoning, 88.7% of queries could be answered correctly by a local model under 20B active parameters. Local accelerators still trailed cloud silicon by 1.4 to 7.4x on the same workload, but a hybrid router that sends easy queries local and hard ones to the cloud cut energy, compute, and cost by 60 to 80% against a batched cloud baseline. The win is in the routing.
NVIDIA's DGX Spark put 128GB of unified memory and a petaFLOP at FP4 on a desktop, and open-weight families like Qwen3, gpt-oss, Gemma, and Granite now trail frontier cloud models by 6 to 12 months on most personal-AI tasks rather than years.
MoE decouples capacity from per-token compute, which works for cloud serving at batch, but on a single-user device most experts sit cold across queries, so you pay in memory for capacity you rarely touch. They argue local-first models should look different: dense, small active footprint, quantization-aware, trained with local serving as a real objective rather than an afterthought.
Power, not chips, is the binding constraint on most AI buildout right now, and almost every public argument about it is denominated in gigawatts and $$$ instead of useful work per watt. A metric that ties accuracy to energy changes which number you push on. My read: one of the frontiers in compute is a smarter router that decides, query by query, which one earns the watt. How many gigawatts we need isn't the only discussion point on the power debate.