The true answer is layered I think:
(1) Hardware economics diverge sharply. DeepSeek V4 is explicitly optimized for & runs on Huawei Ascend 950/950PR clusters with native CANN framework support and close co-design. This bypasses (or dramatically reduces) NVIDIA’s high ASPs and ~60-75% gross margins on GPUs. US providers remain heavily NVIDIA-dependent for peak performance and ecosystem maturity; even with custom silicon progress (Google TPU, Amazon Inferentia/Trainium, Microsoft Maia), most frontier API players carry that premium. Huawei Ascend is not yet universally superior in perf/watt or absolute throughput, but the effective cost per FLOP for DeepSeek’s workloads is structurally lower - partly domestic supply chain, partly national strategic pricing
(2) Architectural & systems-level inference superiority. DeepSeek’s MoE designs with extreme sparsity (reports of 27:1 activation ratios in discussion), Multi-head Latent Attention (MLA) slashing KV cache ~93%, speculative decoding, FP8, and custom redundant-expert deployment create a much lower theoretical and realized floor per token. These are not marginal tweaks; they are co-designed with the serving stack. US labs have strong caching and quantization, but fewer are pushing sparsity and KV-cache innovations at the same intensity across the board.
(3) Pricing as strategic investment, not pure margin play. DeepSeek (ecosystem-aligned) appears to treat API pricing partly as customer acquisition and data/feedback flywheel fuel - classic land-grab economics. US labs balance margin signaling to investors/LPs, R&D recovery, and willingness-to-pay segmentation. Developer discourse confirms the pricing feels “too good,” raising questions of short-term cash burn vs long-term moat building
It is both hardware/stack asymmetry & genuine optimization excellence, enabled by China’s full-stack sovereignty push. US players face real cost pressures and ecosystem lock-in; DeepSeek exploits the opening with talent density in systems programming and a willingness to price for adoption.
This is not sustainable purely on subsidies forever, but it is durable enough to bifurcate the market.
Either the Americans are lying about their inference margins, or Deepseek’s GPU util team is full of actual demigods. Which one is it?