As part of Nemotron, we're releasing a new Math dataset, made by rendering webpages using Lynx and then using an LLM to rewrite the result into LaTeX.
Our models got much better at math when we started using this dataset. We hope it's helpful to the community. đ
We just released Nemotron-CC-Math đ
Equations on web arenât just LaTeX-theyâre in MathML,<pre> tags,inline,even images.Code shows up just as many ways. Most parsers drop it.
Nemotron-CC-Math(133B tokens) reprocesses CommonCrawl math pages to capture math equations code reliably