Nemotron 3 Full Breakdown
With the help of Joey Conway from
@NVIDIAAI getting into the specifics around why Nemotron 3 is kind of a big deal
Biggest headline with Nemotron is: Hybrid Mamba Transformer, Latent MoE, and MTP
Hybrid Mamba Transformer essentially attacks right at the Attention mechanism to make the overhead sub-quadratic, but unlike quantizing KV Cache or swapping out attention head, NVIDIA chose Mamba-2
Latent MoE helps further optimize on sparsity by down projecting the dimensions so you're doing less math and less memory movement between HBM and SRAM, you're saving a ton, and NVIDIA made a conscious choice to add more experts given the surplus
Finally, MTP or multi token prediction where the model can see future tokens to be more expressive in training and also option to use for speculative decoding during inference
Oh, also the model adopts the new OpenMDW 1.1 License