๐ฅ ๐๐๐ฅ๐๐๐ฌ๐ข๐ง๐ ๐๐ง๐๐ข๐ ๐๐๐ฆ๐ฆ๐ 7๐/2๐ ๐๐ง๐ฌ๐ญ๐ซ๐ฎ๐๐ญ๐ข๐จ๐ง ๐ญ๐ฎ๐ง๐๐ ๐ฆ๐จ๐๐๐ฅ ๐จ๐ง 9 ๐๐ง๐๐ข๐๐ง ๐๐๐ง๐ ๐ฎ๐๐ ๐๐ฌ โ ๐๐๐ฏ๐๐ซ๐๐ฌ๐ ๐
We are thrilled to share ๐ ๐๐๐ฏ๐๐ซ๐๐ฌ๐, a Gemma 7B & 2B instruction-tuned models in 9 Indian Languages - Perhaps this is the first Indic open instruction-tuned model trained in 9 Indian languages additionally English included.
๐ฅ๐๐๐ฏ๐๐ซ๐๐ฌ๐ is a Gemma 7B & 2B SFT model using Gemma 7B & 2B base models. Last week we released the Telugu Gemma 7B/ 2B SFT model using curated Telugu datasets from Telugu LLM Labs and we observed really good performance compared to Llama2-based models.
๐ So, we thought why donโt we scale up Gemma 7B & 2B models to multiple Indian languages and we went ahead with testing tokenizers of the following 9 Indian Languages and English Language.
1. Hindi
2. Telugu
3. Tamil
4. Malayalam
5. Kannada
6. Gujarati
7. Bengali
8. Punjabi
9. Odia
10. English
โจ We found the model to have the following capabilities: (X represents any other Indian language)
1. Instruction and Input in Native X language, Output in Native X language.
2. Instruction and Input in English language prompted to respond in Native X language, Output in Native X language.
3. Instruction in Native X language, Input in English language, and Output in Native X language.
๐๐๐ซ๐๐ข๐ง๐ข๐ง๐ ๐๐๐ญ๐๐ข๐ฅ๐ฌ:
1. Single A100 machine which took approx. 36 hours for the 7B model and 15 hours for the 2B model.
2. Platform: E2E Networks Limited
๐ We have shared details on datasets, Examples of Reasoning, Translation, and Question Answering with Context in our blog post.
๐ค The work would not have been possible without huge community effort from different languages and a huge shout out to each one of their work over the past few months showcasing the true OSS power. Following are details of contributors for the languages:
1. Hindi:
@SarvamAI
2. Telugu: Telugu LLM Labs
3. Tamil:
@abhinand58
4. Kannada:
@adarshxs and the team at Tensonic
5. Malayalam: Vishnu Prasad J
6. Odia:
@OdiaGenAI
7. Gujarati: Adarsh Shirawalmath and the team at Tensonic
8. Punjabi: HydraIndicLM
9. Bengali: HydraIndicLM
๐ Special thanks toย
@unslothai for simplifying the training and inference processes!
๐ As we release these models, the next step is to create romanized datasets and we are working hard on evaluation datasets so that we can benchmark and improve on top of it.
๐ค This work is done in collaboration withย
@ramsri_gouthamย as part ofย the Telugu LLM Labsย independent initiative.
๐๐ฅ๐จ๐ ๐๐จ๐ฌ๐ญ:
shorturl.at/jBQWY
๐๐จ๐๐๐๐๐ฌ๐:
shorturl.at/elxBF