Our first task is to massively expand SMOL. Anyone who contributes significant volunteer translations or post-edits will get on the Arxiv paper in the next refresh!
this is a space for grassroots collaboration. It doubles as a directory of speakers of such languages, so you can directly talk with and collaborate with community members.
🚨New machine translation dataset alert! 🚨We expanded the language coverage of WMT24 from 9 to 55 en->xx language pairs by collecting new reference translations for 46 languages in a dataset called WMT24
Paper: arxiv.org/abs/2502.12404v1
Data: huggingface.co/datasets/goog…
Finally, if you are a speaker of any SMOL languages, please take a look at the data and tell me what you think. Despite the quality checks, I am sure that some of the deliveries have issues, and I would love to understand and/or fix them. We are in this together!
I would also like to thank the FAIR lab for being an academic leader in open-sourcing work with low-resource languages, including NLLB and Flores. Thank you for helping make the academic community feel collaborative!
I would like to thank our native-language consultants and translators -- too numerous to name -- for their invaluable help along the way. Several entire languages in SMOL only exist because of volunteer contributions!
SMOL has two sub-sources: SMOL-Doc, a document-level set, and SMOL-Sent, a sentence-level source. They join the token-level GATITOS to hit at three levels of granularity!
And that’s just out-of-the-box finetuning—we know that the community can think of more clever ways to train on SMOL. Multiway parallel data is tricky to deal with without overfitting.
Finetuning of Gemini 2.0 Flash on SMOL yields average improvements of about 4.0 ChrF, with some languages -- including Ewe, Kokborok, Manipuri, Ga, and Dombe -- seeing gains of over 20 ChrF.
SMOL comprises sentences and documents carefully selected for the biggest “Bang for Buck” ratio. It includes 6.1M translated tokens—and if you’ve been in this field a while you know that’s a lot!