請用此 Handle URI 來引用此文件:
http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/89928| 標題: | 中低資源語言之語音語料蒐集及語言辨識之分析研究 Corpus Collection and Analysis of Speech Language Identification for Medium-to-Low Resource Languages |
| 作者: | 江宥呈 Iu-thing Kang |
| 指導教授: | 李宏毅 Hung-Yi Lee |
| 關鍵字: | 語音語言辨識,網路爬蟲,ECAPA-TDNN,對比式學習, spoken language identification,web scraping,ECAPA-TDNN,contrastive learning, |
| 出版年 : | 2023 |
| 學位: | 碩士 |
| 摘要: | 機器學習的發展使得語音語言辨識成為一個具有廣泛應用的研究領域。語音語言辨識是指自動識別語音中所使用的語言的任務。它在多種語音處理應用的資料探勘流程中扮演著重要角色,包括自動語音識別、語音合成和多語言語音翻譯等。然而,在實務上,研究人員常常面臨資料集不平衡的問題。傳統上,公開的語音語言辨識資料集通常偏向於覆蓋高資源語言,如英語、華語、西班牙語等,導致其他語言的資源相對不足。此外,這些資料集中往往存在著資料分布不均的情況,即某些語言的數據量明顯多於其他語言。這樣的資料集不平衡問題可能對模型的訓練和性能產生負面影響。
因此,本研究旨在解決語音語言辨識中資料集不平衡的問題。我們收集了不同語言的頻道清單並爬取相應的影片,透過語音活動檢測進行預處理,並建立VoxCentum資料集。它包含了來自137種語言,共13072小時的語音資料,其中大部分語言的時長均超過100小時。此外,我們的資料集中還包含台語、客語和十六種原住民語等台灣的本土語言,其中部分語言首次被納入到多語言語音語言辨識模型中。我們利用這個資料集測試目前公開的語音語言辨識模型,並以研究結果驗證不平衡資料對模型性能有重大影響。而使用我們的資料收集方法產生的平衡資料集,具有良好的泛化能力,其模型效能可以有效轉移至不同的資料集。接下來,我們還探索了對比式學習的應用以及使用語系標籤作為輔助目標的可能性。從實驗結果可以發現,結合交叉熵和對比損失進行多任務學習可以提升模型性能,並且提升模型的泛化能力。 綜上所述,本研究透過收集平衡資料集、開發語言辨識模型、探索對比式學習的應用,解決語音語言辨識中的問題。我們的研究結果突顯了不平衡資料對模型性能的重要影響,同時提供了一個平衡且具有良好泛化能力的資料集供進一步研究和開發使用。 The development of machine learning has made spoken language identification a research field with wide-ranging applications. Spoken language identification refers to the task of automatically identifying the language used in speech. It plays a crucial role in data mining pipelines of various speech processing applications, including automatic speech recognition, text-to-speech synthesis, and multilingual speech translation. However, in practice, researchers often face the challenge of imbalanced datasets. Traditionally, publicly available spoken language identification datasets tend to focus on high-resource languages such as English, Chinese, and Spanish, resulting in a relative lack of resources for other languages. Furthermore, these datasets often exhibit imbalanced data distributions, with certain languages having significantly more data than others. Such data imbalance issues can negatively impact model training and performance. Therefore, this study aims to address the problem of dataset imbalance in spoken language identification. We collected channel lists for different languages and crawled corresponding videos, which were then preprocessed using voice activity detection to construct the VoxCentum dataset. This dataset consists of 13,072 hours of audio data from 137 languages, with most languages having more than 100 hours of data. In addition, our dataset includes languages of Taiwan, such as Taiwanese, Hakka, and sixteen aboriginal languages, some of which are included in a multilingual spoken language identification model for the first time. We evaluated existing publicly available spoken language identification models using this dataset and confirmed the significant impact of imbalanced data on model performance. Furthermore, our data collection method generates a balanced dataset with good generalization capabilities, allowing model performance to effectively transfer to different datasets. Additionally, we explored the application of contrastive learning and the use of language family labels as auxiliary objectives. Experimental results show that combining cross-entropy and contrastive loss for multitask learning can improve both model performance as well as generalization ability. In summary, this study addresses the issue of dataset imbalance in spoken language identification through the collection of a balanced dataset, development of spoken language identification models, and exploration of contrastive learning. Our findings highlight the significant impact of imbalanced data on model performance and provide a balanced dataset with good generalization capabilities for further research and development. |
| URI: | http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/89928 |
| DOI: | 10.6342/NTU202302571 |
| 全文授權: | 同意授權(限校園內公開) |
| 顯示於系所單位: | 電機工程學系 |
文件中的檔案:
| 檔案 | 大小 | 格式 | |
|---|---|---|---|
| ntu-111-2.pdf 授權僅限NTU校內IP使用(校園外請利用VPN校外連線服務) | 1.46 MB | Adobe PDF |
系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。
