Joined January 2022
9 Photos and videos
It has come to our attention that many researchers and developers are confused by the existence of both zh-hk (Hong Kong Chinese) and yue (Cantonese) on Common Voice, and don't know which one to use. Our short answer is: use yue, and do NOT use zh-hk. Reasons below.
2
7
25
2,365
CanCLID 最新作品,粵語辭典匯聚網站,粵語辭叢正式上線 jyutjyu.com/ 目前收錄 11 本粵語詞典總共超過 26 萬個詞條,之後仲會陸續增加,歡迎捉錯反饋意見。希望幫到全世界嘅粵語老師同學生🥳
1
8
24
856
張悦楷語音數據集再次大幅更新:新加咗 75.71 個鐘嘅《鹿鼎記》數據,而家總數據量已經達到 188.25 個鐘喇!作為效果展示,我哋訓練咗一個張悦楷 TTS 系統,大家只要開一個Hugging Face 賬户就可以免費任玩!huggingface.co/spaces/laubon…
1
6
251
Our Zoeng Jyut Gaai speech dataset has 126k downloads last month😱🤩🥳 One of the top-100 most downloaded datasets on Hugging Face! We appreciate everyone's support and more updates are on the way! 張悦楷語音數據集上個月有 12.6 萬次下載,係 HF 前一百下載量數據集之一!
2
10
790
粵語計算語言學基礎建設組 CanCLID retweeted
😼SMOL DATA ALERT! 😼Anouncing SMOL, a professionally-translated dataset for 115 very low-resource languages! Paper: arxiv.org/pdf/2502.12301 Huggingface: huggingface.co/datasets/goog…
3
12
35
4,187
張悦楷語音數據集最尾一個子集《走進毛澤東的黃昏歲月》已經上傳完畢,加上之前嘅三國演義同水滸傳,總共有112個鐘嘅高質語音數據喇! The last subset of Zoeng Jyut Gaai Speech Dataset, The final days of Mao Ze Dong, is now fully uploaded. We have 112 hours now! huggingface.co/datasets/CanC…
1
1
15
577
本數據集共計成本接近 $5000 美金,用於聘請標註人員同埋搭建購買工具。目前項目預算已經用完,所以暫時唔再加新數據。如果大家有意贊助或者捐助 CanCLID 繼續出品更多高質數據集,例如增加《倚天屠龍記》《鹿鼎記》等等,歡迎私信聯繫!
1
1
231
This dataset costs us ~$5000 USD. The money was spent on hiring annotators and buying the tools, and we have run out of budget. If you are interested in sponsoring or donating to CanCLID to create more datasets, such as The Heaven Sword and Dragon Saber, please reach out to us!
1
190
張悦楷數據集迎來最大更新:新加咗 38.62 個鐘張悦楷講《水滸傳》,加上原有嘅三國演義數據,總時長達到 104.64 個鐘!HF 倉庫亦正式改名為 CanCLID/zoengjyutgaai huggingface.co/datasets/CanC… 主頁亦已加入最新統計信息 canclid.github.io/zoengjyutg… 請大家多多分享支持,令我哋繼續出品優質數據集!
2
11
474
粵語計算語言學基礎建設組 CanCLID retweeted
I contributed a chapter titled "Ideologically Driven Divergence in Cantonese Vernacular Writing Practices" to J-F Dupré's forthcoming book "The Politics of Language in Hong Kong", releasing Dec 2024. It is part of a new book series on Hong Kong research. routledge.com/9781032648453
3
13
43
3,436
張悦楷講古語音數據集正式完工!總共 66 個鐘嘅高質粵語語音數據,就算唔用嚟整 AI 技術都可以直接下載 webm 同 srt 字幕落嚟當故仔噉聽。亦都可以用嚟做語言學、文學研究。數據集主頁: canclid.github.io/zoengjyutg… The Zoeng Jyut Gaai dataset is officially released! 66 hours of high quality data!
1
3
27
678
作為用嚟示範,呢個係用本數據集訓練出嚟嘅 TTS (語音合成)模型,你可以用楷叔把聲嚟講你想聽嘅嘢! As an example use case, this is a TTS demo trained with this dataset. You read anything aloud with Zoeng Jyut Gaai's voice! huggingface.co/spaces/laubon…
1
1
2
219
Hugging Face 倉庫入面含有:1. 所有源音頻 webm 2. 每集對應嘅字幕 srt 3. 用字幕切分並重採樣之後,適合直接用嚟做訓練數據嘅 wav 4. 由字幕集合起身嘅總數據文件 metadata.csv 如果唔識點樣用 git 或者 Hugging Face 下載,歡迎留言提問。
177
免費粵文字幕SRT生成器! 準過Subanana!請大家多多分享傳播! Free Cantonese subtitles generator! Please share and spread the word! huggingface.co/spaces/laubon…
7
22
892
目前最好用嘅粵文字幕生成器,輸入音頻(.mp3 .wav 等等)自動出 SRT文件。免費開源,準過 subanana!歡迎外部貢獻同意見反饋! State-of-the-art Cantonese subtitles generator, more accurate than Subanana! Contributions and feedback welcomed! github.com/hon9kon9ize/yuesu…
1
6
39
724
Common Voice 19.0 已經發佈,粵語有 209 個鐘嘅驗證錄音喇!多謝晒大家嘅支持!呢啲增長嘅數據量好快就會喺下游嘅語音應用中體現出嚟,期待更多高質嘅粵語語音模型出現! Common Voice 19.0 is released and has 209 validated hours of Cantonese data! Better Cantonese voice models are coming!
1
7
25
782
CanCLID 最新作品,目前全網唯一免費開源嘅粵語 TTS 數據集,張悦楷講三國演義,隆重登場:huggingface.co/datasets/laub… 呢個數據集啱啱開工目前得 55 分鐘。如果你想幫手、加速擴展數據集嘅話歡迎聯繫我哋! New open-sourced Cantonese TTS dataset available now! Contact us if you want to help!
1
5
16
2,624
呢個數據集取材自張悦楷,已故廣州最出名嘅講古佬 zh.wikipedia.org/wiki/… 最出名嘅一部作品《三國演義》。呢個數據集唔單只可以用嚟做TTS,仲可以做 ASR 測試集或者語音模型預訓練數據集,例如github.com/AlienKevin/canton…。我哋嘅目標係整晒157集總共超過70個鐘嘅錄音, 麻煩大家多多支持!
1
1
7
461
粵語計算語言學基礎建設組 CanCLID retweeted
Wondering how Google Translate or SenseChat got so much #Cantonese data? With a good classifier, millions of sentences can be extracted from Hong Kong materials. Here's a rule-based implementation: aclanthology.org/2024.eurali… @Can_CLID
1
5
31
1,071