請用此 Handle URI 來引用此文件:
http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/93806完整後設資料紀錄
| DC 欄位 | 值 | 語言 |
|---|---|---|
| dc.contributor.advisor | 李宏毅 | zh_TW |
| dc.contributor.advisor | Hung-Yi Lee | en |
| dc.contributor.author | 王式珩 | zh_TW |
| dc.contributor.author | Shih-Heng Wang | en |
| dc.date.accessioned | 2024-08-08T16:18:58Z | - |
| dc.date.available | 2024-08-09 | - |
| dc.date.copyright | 2024-08-08 | - |
| dc.date.issued | 2024 | - |
| dc.date.submitted | 2024-07-17 | - |
| dc.identifier.citation | [1] X.Changetal.,“Explorationofefficientend-to-endasrusingdiscretizedinputfrom self-supervised learning,” 2023.
[2] J. Shi et al., “ML-SUPERB: Multilingual Speech Universal PERformance Benchmark,” in Proc. INTERSPEECH, 2023, pp. 884–888. [3] W.-N. Hsu et al., “HuBERT: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3451–3460, 2021. [4] H. Yadav et al., “A Survey of Multilingual Models for Automatic Speech Recognition,” in LREC, 2022, pp. 5071–5079. [5] Q.Zhanetal.,“Cross-lingualAutomaticSpeechRecognitionExploitingArticulatory Features,” in APSIPA ASC), 2019, pp. 1912–1916. [6] A. Baevski et al., “wav2vec 2.0: A framework for self-supervised learning of speech representations,” Advances in Neural Information Processing Systems, vol. 33, pp. 12 449–12 460, 2020. [7] A.Babuetal.,“XLS-R:Self-supervisedCross-lingualSpeechRepresentationLearning at Scale,” in Proc. Interspeech, 2022, pp. 2278–2282. [8] Y.-A. Chung et al., “W2v-BERT: Combining contrastive learning and masked language modeling for self-supervised speech pre-training,” in ASRU, 2021, pp. 244– 250. [9] S.Chenetal.,“WavLM:Large-scaleself-supervisedpre-trainingforfullstackspeech processing,” IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022. [10] A. Mohamed et al., “Self-supervised speech representation learning: A review,” IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1179–1210, 2022. [11] J. Shi et al., “Multi-resolution huBERT: Multi-resolution speech self-supervised learning with masked unit prediction,” in Proc. ICLR, 2024. [12] V. Pratap et al., “Scaling speech technology to 1,000+ languages,” arXiv, 2023. [13] S. wen Yang et al., “SUPERB: Speech Processing Universal PERformance Benchmark,” in Proc. Interspeech, 2021, pp. 1194–1198. [14] X. Chang et al., “Exploring speech recognition, translation, and understanding with discrete speech units: A comparative study,” 2023. [15] K.-W. Chang et al., “An Exploration of Prompt Tuning on Generative Spoken Language Model for Speech Processing Tasks,” in Proc. Interspeech 2022, 2022, pp. 5005–5009. [16] Y. Yang et al., “Towards universal speech discrete tokens: A case study for asr and tts,” in ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 20243 [17] F. Shen et al., “Acoustic bpe for speech generation with discrete tokens,” in ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024, pp. 11 746–11 750. [18] S. Maiti et al., “Voxtlm: unified decoder-only models for consolidating speech recognition/synthesis and speech/text continuation tasks,” CoRR, vol. abs/2309.07937, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2309. 07937 [19] H. Erdogan et al., “TokenSplit: Using Discrete Speech Representations for Direct, Refined, and Transcript-Conditioned Speech Separation and Recognition,” in Proc. INTERSPEECH 2023, 2023, pp. 3462–3466. [20] J. H. Yeo et al., “Akvsr: Audio knowledge empowered visual speech recognition by compressing audio knowledge of a pretrained model,” IEEE Transactions on Multimedia, pp. 1–13, 2024. [21] M. Kim et al., “Lip reading for low-resource languages by learning and combining general speech knowledge and language-specific knowledge,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 15 359–15 371. [22] X. Chang et al., “Exploration of Efficient End-to-End ASR using Discretized Input from Self-Supervised Learning,” in Proc. INTERSPEECH 2023, 2023, pp. 1399– 1403. [23] T. Srivastava et al., “Effuse: Efficient self-supervised feature fusion for e2e asr in multilingual and low resource scenarios,” 2023. [24] S.-J. Chen et al., “FeaRLESS: Feature Refinement Loss for Ensembling Self- Supervised Learning Features in Robust End-to-end Speech Recognition,” in Proc. Interspeech 2022, 2022, pp. 3058–3062. [25] D.Berrebbietal.,“Combiningspectralandself-supervisedfeaturesforlowresource speech recognition and translation,” in Interspeech, 2022. [Online]. Available: https://api.semanticscholar.org/CorpusID:247996533 [26] M. Kim et al., “Many-to-many spoken language translation via unified speech and text representation learning with unit-to-unit translation,” 2023. [27] Minsu Kim et al., “Tmt: Tri-modal translation between speech, image, and text by processing different modalities as different languages,” 2024. [28] Kim, Minsu et al., “Towards practical and efficient image-to-speech captioning with vision-language pre-training and multi-modal tokens,” in ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024, pp. 7970–7974. [29] A. Vaswani et al., “Attention is all you need,” in Neural Information Processing Systems, 2017. [Online]. Available: https://api.semanticscholar.org/CorpusID: 13756489 [30] S. Hochreiter et al., “Long short-term memory,” Neural Comput., vol. 9, no. 8, p. 1735–1780, nov 1997. [Online]. Available: https://doi.org/10.1162/neco.1997.9.8. 1735 [31] A. Krizhevsky et al., “Imagenet classification with deep convolutional neural networks,” Communications of the ACM, vol. 60, pp. 84 – 90, 2012. [Online]. Available: https://api.semanticscholar.org/CorpusID:195908774 [32] J. Devlin et al., “BERT: Pre-training of deep bidirectional transformers for language understanding,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), J. Burstein, C. Doran, and T. Solorio, Eds. Minneapolis, Minnesota: Association for Computational Linguistics, 2019, pp. 4171–4186. [Online]. Available: https: //aclanthology.org/N19-1423 [33] A. Radford et al., “Improving language understanding by generative pre-training,” 2018. [Online]. Available: https://api.semanticscholar.org/CorpusID:49313245 [34] C. Raffel et al., “Exploring the limits of transfer learning with a unified text-to-text transformer,” 2023. [35] S. Chen et al., “Wavlm: Large-scale self-supervised pre-training for full stack speech processing,” IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, p. 1505–1518, Oct. 2022. [Online]. Available: http://dx.doi.org/10.1109/ JSTSP.2022.3188113 [36] A. Baevski et al., “wav2vec 2.0: a framework for self-supervised learning of speech representations,” in Proceedings of the 34th International Conference on Neural Information Processing Systems, ser. NIPS ’20. Red Hook, NY, USA: Curran Associates Inc., 2020. [37] Y. Yang et al., “Towards universal speech discrete tokens: A case study for asr and tts,” 2023. [38] S. Watanabe et al., “ESPnet: End-to-end speech processing toolkit,” in Proceedings of Interspeech, 2018, pp. 2207–2211. [Online]. Available: http: //dx.doi.org/10.21437/Interspeech.2018-1456 [39] Watanabe, Shinji et al., “Hybrid ctc/attention architecture for end-to-end speech recognition,” IEEE Journal of Selected Topics in Signal Processing, vol. 11, no. 8, pp. 1240–1253, 2017. [40] V. Panayotov et al., “Librispeech: An asr corpus based on public domain audio books,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015, pp. 5206–5210. [41] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” CoRR, vol. abs/1412.6980, 2014. [Online]. Available: https://api.semanticscholar.org/ CorpusID:6628106 | - |
| dc.identifier.uri | http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/93806 | - |
| dc.description.abstract | 語音自監督學習模型在各種語音處理任務中展示了卓越的能力。使用語音自監督模型連續特徵訓練模型雖然性能強大,但卻受限於其高計算和存儲成本。另一方面,雖然使用語音自監督模型離散化特徵訓練模型的性能有所下降,卻可通過去重複化以及字節對編碼,大量降低了傳輸和存儲成本,並提高了輸入序列的訓練效率。為了提升使用語音自監督模型離散化特徵訓練自動語音識別模型中的性能,我們提出了一種新穎的融合機制,整合了兩種離散特徵。這種融合機制保留了離散特徵的所有優點,同時通過整合離散特徵的互補信息來增強模型的性能。此外,我們還探索了「自增強」離散特徵,它對單一連續特徵進行轉換,消除了融合機制對多個語音自監督學習模型的依賴,還進一步降低了推理成本。在包括LibriSpeech和ML-SUPERB在內的基準測試上的實驗結果顯示,與非融合的基準比較對象相比,我們提出的方法有高達19%和24%的字符錯誤率相對進步量,證明了我們方法的有效性。 | zh_TW |
| dc.description.abstract | Self-supervised learning (SSL) models have shown exceptional capabilities across various speech-processing tasks. Continuous SSL representations are effective but suffer from high computational and storage demands. On the other hand, discrete SSL representations, although with degraded performance, reduce transmission and storage costs, and improve input sequence efficiency through de-duplication and subword-modeling. To boost the performance of discrete representations for ASR, we introduce a novel fusion mechanism that integrates two discrete representations. The fusion mechanism preserves all the benefits of discrete representation while enhancing the model's performance by integrating complementary information. Additionally, we explore ''self-augmented'' discrete representations, which apply transformations to a single continuous SSL representation, eliminating the fusion mechanism's dependency on multiple SSL models and further decreasing its inference costs. Experimental results on benchmarks, including LibriSpeech and ML-SUPERB, indicate up to 19% and 24% relative character error rate improvement compared with the non-fusion baseline, validating the effectiveness of our proposed methods. | en |
| dc.description.provenance | Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2024-08-08T16:18:58Z No. of bitstreams: 0 | en |
| dc.description.provenance | Made available in DSpace on 2024-08-08T16:18:58Z (GMT). No. of bitstreams: 0 | en |
| dc.description.tableofcontents | 目次
口試委員會審定書. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i 誌謝. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii 中文摘要. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii 英文摘要. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv 一、導論. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 研究動機. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 研究方向. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3 主要貢獻. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.4 章節安排. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 二、背景知識. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.1 轉換器(Transformer) . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.1.1 起源與發展. . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.1.2 轉換器架構. . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.1.3 應用與影響. . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2 語音自監督學習語音模型. . . . . . . . . . . . . . . . . . . . . . . . . 10 2.3 離散化. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.4 去重複化以及字節對編碼. . . . . . . . . . . . . . . . . . . . . . . . . 13 2.5 Interspeech 2024 離散語音單元語音處理挑戰. . . . . . . . . . . . . . 14 三、離散化語音自監督模型特徵於語音識別之應用. . . . . . . . . . . . . . . 19 3.1 語音自監督模型特徵離散化流程. . . . . . . . . . . . . . . . . . . . . 19 3.2 離散化語音自監督模型特徵訓練. . . . . . . . . . . . . . . . . . . . . 21 3.3 離散化語音自監督模型特徵融合機制. . . . . . . . . . . . . . . . . . 27 3.4 自增強特徵. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 四、實驗設定. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.1 訓練& 測試資料集. . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.2 測量指標. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 4.3 模型、離散化及訓練參數. . . . . . . . . . . . . . . . . . . . . . . . . 35 4.4 比較對象. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.5 本章流程總結. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 五、實驗結果與分析. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 5.1 實驗結果. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 5.2 實驗分析. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 5.3 章節總結. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 六、結論與展望. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 參考文獻. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 | - |
| dc.language.iso | zh_TW | - |
| dc.subject | 語音辨識 | zh_TW |
| dc.subject | 自監督學習 | zh_TW |
| dc.subject | 離散特徵 | zh_TW |
| dc.subject | Discretized representation | en |
| dc.subject | ASR | en |
| dc.subject | Self supervise learning | en |
| dc.title | 離散化語音自監督模型特徵用於多語言語音辨識 | zh_TW |
| dc.title | Discretized Speech Self-Supervised Model Representation for Multilingual Automatic Speech Recognition | en |
| dc.type | Thesis | - |
| dc.date.schoolyear | 112-2 | - |
| dc.description.degree | 碩士 | - |
| dc.contributor.oralexamcommittee | 曹昱;王新民;陳尚澤;賴穎暉;李琳山 | zh_TW |
| dc.contributor.oralexamcommittee | Tsao Yu;Hsin-Min Wang;Shang-Tse Chen;Ying-Hui Lai;Lin-shan Lee | en |
| dc.subject.keyword | 語音辨識,自監督學習,離散特徵, | zh_TW |
| dc.subject.keyword | ASR,Self supervise learning,Discretized representation, | en |
| dc.relation.page | 58 | - |
| dc.identifier.doi | 10.6342/NTU202401809 | - |
| dc.rights.note | 同意授權(全球公開) | - |
| dc.date.accepted | 2024-07-18 | - |
| dc.contributor.author-college | 電機資訊學院 | - |
| dc.contributor.author-dept | 電信工程學研究所 | - |
| 顯示於系所單位: | 電信工程學研究所 | |
文件中的檔案:
| 檔案 | 大小 | 格式 | |
|---|---|---|---|
| ntu-112-2.pdf | 3.91 MB | Adobe PDF | 檢視/開啟 |
系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。
