HPLT 3.0: Very Large-Scale Multilingual Resources for LLMs and MT....
We present an ongoing initiative to provide open, very large, high-quality, and richly annotated textual datasets for almost 200 languages. At 30 trillion tokens, this is likely the largest...
arxiv.org