Skip navigation

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料(如:文字、圖片、PDF)並使其易於取用。

點此認識 DSpace
DSpace logo
English
中文
  • 瀏覽論文
    • 校院系所
    • 出版年
    • 作者
    • 標題
    • 關鍵字
    • 指導教授
  • 搜尋 TDR
  • 授權 Q&A
    • 我的頁面
    • 接受 E-mail 通知
    • 編輯個人資料
  1. NTU Theses and Dissertations Repository
  2. 電機資訊學院
  3. 電機工程學系
請用此 Handle URI 來引用此文件: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/89928
完整後設資料紀錄
DC 欄位值語言
dc.contributor.advisor李宏毅zh_TW
dc.contributor.advisorHung-Yi Leeen
dc.contributor.author江宥呈zh_TW
dc.contributor.authorIu-thing Kangen
dc.date.accessioned2023-09-22T16:42:55Z-
dc.date.available2023-11-09-
dc.date.copyright2023-09-22-
dc.date.issued2023-
dc.date.submitted2023-08-04-
dc.identifier.citationI. Adebara, A. A. Elmadany, and M. Abdul-Mageed. Improving african language identification with multi-task learning. In 4th Workshop on African Natural Language Processing, 2023.
R. Ardila, M. Branson, K. Davis, M. Henretty, M. Kohler, J. Meyer, R. Morais, L. Saunders, F. M. Tyers, and G. Weber. Common voice: A massively-multilingual speech corpus, 2020.
A. Baevski, H. Zhou, A. Mohamed, and M. Auli. wav2vec 2.0: A framework for self-supervised learning of speech representations. CoRR, abs/2006.11477, 2020.
A. W. Black. Cmu wilderness multilingual speech dataset. In ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5971–5975, 2019.
H. Bredin, R. Yin, J. M. Coria, G. Gelly, P. Korshunov, M. Lavechin, D. Fustes, H. Titeux, W. Bouaziz, and M.-P. Gill. pyannote.audio: neural building blocks for speaker diarization, 2019.
T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei. Language models are few-shot learners, 2020.
D. Chen, Z. Yu, and S. R. Bowman. Clean or annotate: How to spend a limited data collection budget, 2022.
T. Chen, S. Kornblith, M. Norouzi, and G. E. Hinton. A simple framework for contrastive learning of visual representations. CoRR, abs/2002.05709, 2020.
A. Conneau, M. Ma, S. Khanuja, Y. Zhang, V. Axelrod, S. Dalmia, J. Riesa, C. Rivera, and A. Bapna. Fleurs: Few-shot learning evaluation of universal representations of speech, 2022.
B. Desplanques, J. Thienpondt, and K. Demuynck. ECAPA-TDNN: Emphasized channel attention, propagation and aggregation in TDNN based speaker verification. In Interspeech 2020. ISCA, oct 2020.
Eberhard, D. M., G. F. Simons, and C. D. Fennig. Ethnologue: Languages of the world, 2023.
S.-H. Gao, M.-M. Cheng, K. Zhao, X.-Y. Zhang, M.-H. Yang, and P. Torr. Res2net: A new multi-scale backbone architecture. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(2):652–662, feb 2021.
Z. Gao, Y. Song, I. McLoughlin, P. Li, Y. Jiang, and L.-R. Dai. Improving Aggregation and Loss Function for Better Embedding Learning in End-to-End Speaker Verification System. In Proc. Interspeech 2019, pages 361–365, 2019.
N. Goyal, C. Gao, V. Chaudhary, P.-J. Chen, G. Wenzek, D. Ju, S. Krishnan, M. Ranzato, F. Guzman, and A. Fan. The flores-101 evaluation benchmark for low-resource and multilingual machine translation, 2021.
K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition, 2015.
W.-N. Hsu, B. Bolte, Y.-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed. Hubert: Self-supervised speech representation learning by masked prediction of hidden units, 2021.
J. Hu, L. Shen, S. Albanie, G. Sun, and E. Wu. Squeeze-and-excitation networks, 2019.
A. G. Ivakhnenko and V. Lapa. The group method of data handling–a rival of the method of stochastic approximation. Soviet Automatic Control, 13(5):43–55, 1968.
N. Japkowicz and S. Stephen. The class imbalance problem: A systematic study. Intelligent data analysis, 6(5):429–449, 2002.
B. Kang, S. Xie, M. Rohrbach, Z. Yan, A. Gordo, J. Feng, and Y. Kalantidis. Decoupling representation and classifier for long-tailed recognition. CoRR, abs/ 1910.09217, 2019.
P. Khosla, P. Teterwak, C. Wang, A. Sarna, Y. Tian, P. Isola, A. Maschinot, C. Liu, and D. Krishnan. Supervised contrastive learning, 2021.
J. Kim, J. Jeong, and J. Shin. M2m: Imbalanced classification via major-to-minor translation. CoRR, abs/2004.00431, 2020.
K. Lee, S. Yun, K. Lee, H. Lee, B. Li, and J. Shin. Robust inference via generative classifiers for handling noisy labels, 2019.
X. Li, F. Metze, D. R. Mortensen, A. W. Black, and S. Watanabe. Asr2k: Speech recognition for around 2000 languages without audio, 2022.
W. S. McCulloch and W. Pitts. A logical calculus of the ideas immanent in nervous activity. The bulletin of mathematical biophysics, 5(4):115–133, 1943.
P. O. of the European Union. Eurovoc, 2023.
D. S. Park, W. Chan, Y. Zhang, C.-C. Chiu, B. Zoph, E. D. Cubuk, and Q. V. Le. SpecAugment: A simple data augmentation method for automatic speech recognition. In Interspeech 2019. ISCA, sep 2019.
V. Pratap, A. Tjandra, B. Shi, P. Tomasello, A. Babu, S. Kundu, A. Elkahky, Z. Ni, A. Vyas, M. Fazel-Zarandi, A. Baevski, Y. Adi, X. Zhang, W.-N. Hsu, A. Conneau, and M. Auli. Scaling speech technology to 1,000+ languages. arXiv, 2023.
A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever. Robust speech recognition via large-scale weak supervision, 2022.
S. O. Sadjadi, T. Kheyrkhah, C. S. Greenberg, E. Singer, D. A. Reynolds, L. P. Mason, and J. Hernandez-Cordero. Performance analysis of the 2017 nist language recognition evaluation. In Interspeech, pages 1798–1802, 2018.
S. O. Sadjadi, T. Kheyrkhah, A. Tong, C. S. Greenberg, D. A. Reynolds, E. Singer, L. P. Mason, J. Hernandez-Cordero, et al. The 2017 nist language recognition evaluation. In Odyssey, pages 82–89, 2018.
D. Snyder, D. Garcia-Romero, A. McCree, G. Sell, D. Povey, and S. Khudanpur. Spoken Language Recognition using X-vectors. In Proc. The Speaker and Language Recognition Workshop (Odyssey 2018), pages 105–111, 2018.
D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur. X-vectors: Robust dnn embeddings for speaker recognition. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5329– 5333, 2018.
J. Spijkervet. Spijkervet/torchaudio-augmentations, 2021.
S. Tchistiakova. Time delay neural network, Nov 2019.
S. Team. Silero vad: pre-trained enterprise-grade voice activity detector (vad), number detector and language classifier. https://github.com/snakers4/ silero-vad, 2021.
J. Valk and T. Alumäe. Voxlingua107: a dataset for spoken language recognition, 2020.
A. van den Oord, Y. Li, and O. Vinyals. Representation learning with contrastive predictive coding. CoRR, abs/1807.03748, 2018.
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need. CoRR, abs/1706.03762, 2017.
A. Waibel. Modular construction of time-delay neural networks for speech recognition. Neural Computation, 1(1):39–46, 1989.
A. Waibel, T. Hanazawa, G. Hinton, K. Shikano, and K. J. Lang. Phoneme recogni tion using time-delay neural networks. IEEE transactions on acoustics, speech, and signal processing, 37(3):328–339, 1989.
J. Wang, T. Lukasiewicz, X. Hu, J. Cai, and Z. Xu. RSG: A simple but effective module for learning imbalanced datasets. CoRR, abs/2106.09859, 2021.
H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz. mixup: Beyond empirical risk minimization, 2018.
Y. Zhang, B. Kang, B. Hooi, S. Yan, and J. Feng. Deep long-tailed learning: A survey. CoRR, abs/2110.04596, 2021.
Y. Zhang, X.-S. Wei, B. Zhou, and J. Wu. Bag of tricks for long-tailed visual recognition with deep convolutional neural networks. In Proceedings of the AAAI conference on artificial intelligence, volume 35, pages 3447–3455, 2021.
-
dc.identifier.urihttp://tdr.lib.ntu.edu.tw/jspui/handle/123456789/89928-
dc.description.abstract機器學習的發展使得語音語言辨識成為一個具有廣泛應用的研究領域。語音語言辨識是指自動識別語音中所使用的語言的任務。它在多種語音處理應用的資料探勘流程中扮演著重要角色,包括自動語音識別、語音合成和多語言語音翻譯等。然而,在實務上,研究人員常常面臨資料集不平衡的問題。傳統上,公開的語音語言辨識資料集通常偏向於覆蓋高資源語言,如英語、華語、西班牙語等,導致其他語言的資源相對不足。此外,這些資料集中往往存在著資料分布不均的情況,即某些語言的數據量明顯多於其他語言。這樣的資料集不平衡問題可能對模型的訓練和性能產生負面影響。
因此,本研究旨在解決語音語言辨識中資料集不平衡的問題。我們收集了不同語言的頻道清單並爬取相應的影片,透過語音活動檢測進行預處理,並建立VoxCentum資料集。它包含了來自137種語言,共13072小時的語音資料,其中大部分語言的時長均超過100小時。此外,我們的資料集中還包含台語、客語和十六種原住民語等台灣的本土語言,其中部分語言首次被納入到多語言語音語言辨識模型中。我們利用這個資料集測試目前公開的語音語言辨識模型,並以研究結果驗證不平衡資料對模型性能有重大影響。而使用我們的資料收集方法產生的平衡資料集,具有良好的泛化能力,其模型效能可以有效轉移至不同的資料集。接下來,我們還探索了對比式學習的應用以及使用語系標籤作為輔助目標的可能性。從實驗結果可以發現,結合交叉熵和對比損失進行多任務學習可以提升模型性能,並且提升模型的泛化能力。
綜上所述,本研究透過收集平衡資料集、開發語言辨識模型、探索對比式學習的應用,解決語音語言辨識中的問題。我們的研究結果突顯了不平衡資料對模型性能的重要影響,同時提供了一個平衡且具有良好泛化能力的資料集供進一步研究和開發使用。
zh_TW
dc.description.abstractThe development of machine learning has made spoken language identification a research field with wide-ranging applications. Spoken language identification refers to the task of automatically identifying the language used in speech. It plays a crucial role in data mining pipelines of various speech processing applications, including automatic speech recognition, text-to-speech synthesis, and multilingual speech translation.

However, in practice, researchers often face the challenge of imbalanced datasets. Traditionally, publicly available spoken language identification datasets tend to focus on high-resource languages such as English, Chinese, and Spanish, resulting in a relative lack of resources for other languages. Furthermore, these datasets often exhibit imbalanced data distributions, with certain languages having significantly more data than others. Such data imbalance issues can negatively impact model training and performance.

Therefore, this study aims to address the problem of dataset imbalance in spoken language identification. We collected channel lists for different languages and crawled corresponding videos, which were then preprocessed using voice activity detection to construct the VoxCentum dataset. This dataset consists of 13,072 hours of audio data from 137 languages, with most languages having more than 100 hours of data. In addition, our dataset includes languages of Taiwan, such as Taiwanese, Hakka, and sixteen aboriginal languages, some of which are included in a multilingual spoken language identification model for the first time. We evaluated existing publicly available spoken language identification models using this dataset and confirmed the significant impact of imbalanced data on model performance. Furthermore, our data collection method generates a balanced dataset with good generalization capabilities, allowing model performance to effectively transfer to different datasets. Additionally, we explored the application of contrastive learning and the use of language family labels as auxiliary objectives. Experimental results show that combining cross-entropy and contrastive loss for multitask learning can improve both model performance as well as generalization ability.
In summary, this study addresses the issue of dataset imbalance in spoken language identification through the collection of a balanced dataset, development of spoken language identification models, and exploration of contrastive learning. Our findings highlight the significant impact of imbalanced data on model performance and provide a balanced dataset with good generalization capabilities for further research and development.
en
dc.description.provenanceSubmitted by admin ntu (admin@lib.ntu.edu.tw) on 2023-09-22T16:42:55Z
No. of bitstreams: 0
en
dc.description.provenanceMade available in DSpace on 2023-09-22T16:42:55Z (GMT). No. of bitstreams: 0en
dc.description.tableofcontents致謝 iii
摘要 v
Abstract vii
目錄 xi
圖目錄 xv
表目錄 xvii
第一章 導論 1
1.1 研究動機 1
1.2 研究方向 3
1.3 研究貢獻 3
1.4 章節安排 4
第二章 背景知識 5
2.1 類神經網路 5
2.1.1 深層類神經網路 6
2.2 時延神經網路 8
2.2.1 卷積層 9
2.2.2 網路架構 11
2.3 長尾分佈 12
2.3.1 重新取樣 13
2.3.2 權重調整 14
2.3.3 資料增強 15
2.3.3.1 轉移增強 16
2.4 對比式學習 17
第三章 多語言平衡資料集 VoxCentum 之蒐集及其對語音語言辨識效能之影響 19
3.1 簡介 19
3.2 文獻回顧 20
3.3 訓練資料集 – VoxCentum 22
3.3.1 資料集蒐集 23
3.3.2 資料集預處理 25
3.3.2.1 語音活動偵測 25
3.3.2.2 語音語言辨識過濾 25
3.3.2.3 資料驅動過濾 26
3.4 資料集分析 27
3.4.1 語系分析 28
3.4.2 地理分布 28
3.4.3 語者分析 29
3.5 語音語言辨識實驗設定 31
3.5.1 資料驅動過濾 31
3.5.2 平衡資料集對模型的影響 31
3.5.3 以 VoxCentum 衡量現有模型 32
3.5.4 訓練細節 32
3.6 實驗結果與討論 34
3.6.1 資料驅動過濾 34
3.6.2 平衡資料集對模型的影響 35
3.6.3 以 VoxCentum 衡量現有模型 36
3.7 本章總結 38
第四章 使用對比式學習增強語音語言辨識之研究 41
4.1 簡介 41
4.2 文獻回顧 42
4.3 實驗設定 43
4.3.1 基準方法 43
4.3.2 多任務學習 44
4.3.3 對比損失函數 45
4.4 實驗結果與討論 46
4.4.1 主要結果 46
4.4.2 對比損失函數損失係數 α 47
4.4.3 語言嵌入向量 48
4.5 本章總結 49
第五章 結論與展望 51
5.1 研究總結與貢獻 51
5.2 未來展望 52
參考文獻 53
附錄 A — VoxCentum 語言總表 59
A.1 VoxCentum 語言總表 59
附錄 B — 詳細實驗數據 63
B.1 以 VoxCentum 衡量現有模型之詳細數據 63
B.2 不同訓練目標測試在 VoxCentum 之詳細數據 63
-
dc.language.isozh_TW-
dc.subject語音語言辨識zh_TW
dc.subject網路爬蟲zh_TW
dc.subjectECAPA-TDNNzh_TW
dc.subject對比式學習zh_TW
dc.subjectspoken language identificationen
dc.subjectweb scrapingen
dc.subjectECAPA-TDNNen
dc.subjectcontrastive learningen
dc.title中低資源語言之語音語料蒐集及語言辨識之分析研究zh_TW
dc.titleCorpus Collection and Analysis of Speech Language Identification for Medium-to-Low Resource Languagesen
dc.typeThesis-
dc.date.schoolyear111-2-
dc.description.degree碩士-
dc.contributor.oralexamcommittee王新民;曹昱;陳尚澤zh_TW
dc.contributor.oralexamcommitteeHsin-Min Wang;Yu Tsao;Shang-Tse Chenen
dc.subject.keyword語音語言辨識,網路爬蟲,ECAPA-TDNN,對比式學習,zh_TW
dc.subject.keywordspoken language identification,web scraping,ECAPA-TDNN,contrastive learning,en
dc.relation.page64-
dc.identifier.doi10.6342/NTU202302571-
dc.rights.note同意授權(限校園內公開)-
dc.date.accepted2023-08-08-
dc.contributor.author-college電機資訊學院-
dc.contributor.author-dept電機工程學系-
顯示於系所單位:電機工程學系

文件中的檔案:
檔案 大小格式 
ntu-111-2.pdf
授權僅限NTU校內IP使用(校園外請利用VPN校外連線服務)
1.46 MBAdobe PDF
顯示文件簡單紀錄


系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved