請用此 Handle URI 來引用此文件:
http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/91260
完整後設資料紀錄
DC 欄位 | 值 | 語言 |
---|---|---|
dc.contributor.advisor | 李琳山 | zh_TW |
dc.contributor.advisor | Lin-Shan Lee | en |
dc.contributor.author | 馮啟倫 | zh_TW |
dc.contributor.author | Chi-Luen Feng | en |
dc.date.accessioned | 2023-12-20T16:11:37Z | - |
dc.date.available | 2023-12-21 | - |
dc.date.copyright | 2023-12-20 | - |
dc.date.issued | 2023 | - |
dc.date.submitted | 2023-09-29 | - |
dc.identifier.citation | [1] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. CoRR, abs/1706.03762, 2017.
[2] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014. [3] Minh-ThangLuong,HieuPham,andChristopherD.Manning.Effectiveapproaches to attention-based neural machine translation. CoRR, abs/1508.04025, 2015. [4] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. Layer normalization, 2016. [5] Steffen Schneider, Alexei Baevski, Ronan Collobert, and Michael Auli. wav2vec: Unsupervised pre-training for speech recognition. CoRR, abs/1904.05862, 2019. [6] Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, and Michael Auli. wav2vec 2.0: A framework for self-supervised learning of speech representations. CoRR, abs/ 2006.11477, 2020. [7] Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan, Salakhutdinov, and Abdelrahman Mohamed. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. CoRR, abs/2106.07447, 2021. [8] Ehsan Variani, Xin Lei, Erik McDermott, Ignacio Lopez Moreno, and Javier Gonzalez-Dominguez. Deep neural networks for small footprint text-dependent speaker verification. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4052–4056, 2014. [9] David Snyder, Daniel Garcia-Romero, Gregory Sell, Daniel Povey, and Sanjeev Khudanpur. X-vectors: Robust dnn embeddings for speaker recognition. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5329–5333, 2018. [10] Aäron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. CoRR, abs/1807.03748, 2018. [11] Andy T Liu, Shu-wen Yang, Po-Han Chi, Po-chun Hsu, and Hung-yi Lee. Mockingjay: Unsupervised speech representation learning with deep bidirectional trans- former encoders. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6419–6423. IEEE, 2020. [12] Yu-An Chung, Hao Tang, and James Glass. Vector-quantized autoregressive predictive coding. arXiv preprint arXiv:2005.08392, 2020. [13] Andy T Liu, Shang-Wen Li, and Hung-yi Lee. Tera: Self-supervised learning of transformer encoder representation for speech. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:2351–2366, 2021. [14] Yu-An Chung, Yonatan Belinkov, and James Glass. Similarity analysis of self- supervised speech representations. In ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 3040–3044, 2021. [15] Ankita Pasad, Ju-Chieh Chou, and Karen Livescu. Layer-wise analysis of a self-supervised speech representation model. CoRR, abs/2107.04734, 2021. [16] Sanyuan Chen, Yu Wu, Chengyi Wang, Zhengyang Chen, Zhuo Chen, Shujie Liu, Jian Wu, Yao Qian, Furu Wei, Jinyu Li, and Xiangzhan Yu. Unispeech-sat: Universal speech representation learning with speaker aware pre-training. CoRR, abs/ 2110.05752, 2021. [17] Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, et al. Wavlm: Large-scale self-supervised pre-training for full stack speech processing. arXiv preprint arXiv:2110.13900, 2021. [18] Arsha Nagrani, Joon Son Chung, and Andrew Zisserman. Voxceleb: a large-scale speaker identification dataset. CoRR, abs/1706.08612, 2017. [19] Shu-Wen Yang, Po-Han Chi, Yung-Sung Chuang, Cheng-I Jeff Lai, Kushal Lakho- tia, Yist Y. Lin, Andy T. Liu, Jiatong Shi, Xuankai Chang, Guan-Ting Lin, Tzu- Hsien Huang, Wei-Cheng Tseng, Ko-tik Lee, Da-Rong Liu, Zili Huang, Shuyan Dong, Shang-Wen Li, Shinji Watanabe, Abdelrahman Mohamed, and Hung-yi Lee. SUPERB: speech processing universal performance benchmark. CoRR, abs/ 2105.01051, 2021. [20] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pretraining of deep bidirectional transformers for language understanding. CoRR, abs/ 1810.04805, 2018. [21] MitchellMcLaren,LucianaFerrer,DiegoCastán,andAaronD.Lawson.The speakers in the wild (sitw) speaker recognition database. In INTERSPEECH, 2016. [22] Colleen Richey, María Auxiliadora Barrios, Zeb Armstrong, Chris Bartels, Horacio Franco, Martin Graciarena, Aaron Lawson, Mahesh Kumar Nandwana, Allen R. Stauffer, Julien van Hout, Paul Gamble, Jeff Hetherly, Cory Stephenson, and Karl Ni. Voices obscured in complex environmental settings (VOICES) corpus. CoRR, abs/1804.05053, 2018. [23] Yu-AnChung,Wei-NingHsu,HaoTang,andJamesGlass. An Un supervised Autoregressive Model for Speech Representation Learning. In Interspeech, pages 146–150, 2019. [24] Alexander H Liu, Yu-An Chung, and James Glass. Non-autoregressive predictive coding for learning speech representations from local dependencies. arXiv preprint arXiv:2011.00406, 2020. [25] Morgane Rivière, Armand Joulin, Pierre-Emmanuel Mazaré, and Emmanuel Dupoux. Unsupervised pretraining transfers well across languages. In ICASSP, pages 7414–7418, 2020. [26] Alexei Baevski, Steffen Schneider, and Michael Auli. vq-wav2vec: Self-supervised learning of discrete speech representations. In ICLR, 2020. [27] Heng-JuiChang,Shu-wenYang,andHung-yiLee.DistilHuBERT:Speech representation learning by layer-wise distillation of hidden-unit BERT. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7087– 7091. IEEE, 2022. [28] SantiagoCastro,DevamanyuHazarika,VerónicaPérez-Rosas,RogerZimmermann, Rada Mihalcea, and Soujanya Poria. Towards multimodal sarcasm detection (an_Obviously_ perfect paper). In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4619–4629, Florence, Italy, July 2019. Association for Computational Linguistics. [29] Sunghyun Park, Han Suk Shim, Moitreya Chatterjee, Kenji Sagae, and Louis-Philippe Morency. Computational analysis of persuasiveness in social multimedia: A novel dataset and multimodal prediction approach. In Proceedings of the 16th International Conference on Multimodal Interaction, pages 50–57, 2014. [30] AmirAli Bagher Zadeh, Paul Pu Liang, Soujanya Poria, Erik Cambria, and Louis- Philippe Morency. Multimodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2236–2246, Melbourne, Australia, July 2018. Association for Computational Lin- guistics. [31] Heiga Zen, Viet Dang, Rob Clark, Yu Zhang, Ron J. Weiss, Ye Jia, Zhifeng Chen, and Yonghui Wu. LibriTTS: A corpus derived from librispeech for text-to-speech. CoRR, abs/1904.02882, 2019. [32] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692, 2019. [33] Iz Beltagy, Matthew E Peters, and Arman Cohan. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150, 2020. [34] Ning Ding, Sheng-wei Tian, and Long Yu. A multimodal fusion method for sarcasm detection based on late fusion. Multimedia Tools and Applications, 81(6):8597– 8616, 2022. [35] Behnaz Nojavanasghari, DeepakGopinath, JayanthKoushik, TadasBaltrušaitis,and Louis-Philippe Morency. Deep multimodal fusion for persuasiveness prediction. In Proceedings of the 18th ACM International Conference on Multimodal Interaction, pages 284–288, 2016. [36] Ankita Pasad, Ju-Chieh Chou, and Karen Livescu. Layer-wise analysis of a self-supervised speech representation model. In 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 914–921. IEEE, 2021. [37] Yao Shi, Hui Bu, Xin Xu, Shaoji Zhang, and Ming Li. Aishell-3: A multi-speaker Mandarin TTS corpus and the baselines. 2015. [38] Vineel Pratap, Qiantong Xu, Anuroop Sriram, Gabriel Synnaeve, and Ronan Col- lobert. MLS: A large-scale multilingual dataset for speech research. ArXiv, abs/ 2012.03411, 2020. | - |
dc.identifier.uri | http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/91260 | - |
dc.description.abstract | 自監督式學習為近年深度學習領域中的重要技術。該訓練框架於不同領域皆 取得卓越的成果表現,如電腦視覺領域、自然語言處理領域和語音領域。
語音訊號中帶有文字資訊與非文字資訊,而非文字資訊中主要由語者資訊所 構成,而語者資訊之中則包含韻律資訊,雖然非文字資訊已可透過自監督式學習 模型捕捉,然而我們卻不知道其背後的原理機制。 本篇論文從上述的語者資訊和韻律資訊作為切入點。於語者資訊的研究中我 們發現自監督式學習模型會透過自監督式學習模型輸出的語音特徵之中,對應輸 入語音訊號的無聲音部分的語音特徵片段汲取語者資訊,且實驗中顯示該項發現 能讓我們在沒有額外添加運算時間的狀況下提升既有的自監督式學習模型表現。 於韻律資訊部分,透過 15 個語音自監督式學習模型和 3 個韻律相關任務,驗 證自監督式學習模型能將韻律資訊鑲嵌在語音特徵之中。此外實驗顯示模型傾向 將韻律資訊儲存在前面的層數中,並且自監督式學習模型能處理預訓練時未見語 言的韻律資訊。 綜上所述,本論文提供實驗驗證自監督式學習模型如何處理非文字資訊,並 根據其機制給出具體改進模型的建議。 | zh_TW |
dc.description.abstract | Self-supervised learning (SSL) is an important technology in the field of deep learning in recent years. This training framework has achieved excellent results in different fields, such as computer vision, natural language processing, and speech.
Speech information can be divided into text information and non-text information. Non-text information mainly consists of speaker information. In addition, prosodic information is included in the speaker information. Although non-text information can be captured by SSL models. However, recent studies have not investigated the underlying mechanism. This thesis starts from speaker information and prosodic information. In the study of speaker information, we found that SSL models can learn speaker information through the speech features output by the self-supervised learning model which corresponding to the silent part of the input speech signal, and experiments show that this discovery allows us to improve the existing SSL model performance. In the part of prosodic information, through 15 speech SSL models and 3 prosody- related tasks, it is verified that the SSL model can embed prosodic information into speech features. In addition, experiments show that models tend to store prosodic information in earlier layers, and SSL models can handle prosodic information in unknown languages. In summary, this paper provides experiments to verify how the SSL model handles non-textual information, and gives specific suggestions for improving the model according to its mechanism. | en |
dc.description.provenance | Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2023-12-20T16:11:37Z No. of bitstreams: 0 | en |
dc.description.provenance | Made available in DSpace on 2023-12-20T16:11:37Z (GMT). No. of bitstreams: 0 | en |
dc.description.tableofcontents | 口試委員審定書 i
致謝 ii 摘要 iv 英文摘要 v 目錄 vii 圖目錄 xi 表目錄 xiii 第一章 導論 1 1.1 研究動機 1 1.2 研究問題定義 3 1.3 主要貢獻 4 1.4 章節安排 4 第二章 背景知識 6 2.1 深度學習 6 2.1.1 深度學習概述 6 2.1.2 轉換器 9 2.2 自監督式學習 2.2.1 自監督式學習概述 13 2.2.2 自監督式學習於語音領域上的應用 14 2.3 語音領域中的非文字資訊之相關任務 17 2.3.1 語音領域中的非文字資訊:語者任務 17 2.3.2 語音領域中的非文字資訊:韻律任務 18 2.4 本章總結 19 第三章 語音自監督式學習模型的語者資訊研究 21 3.1 簡介 21 3.2 語者資訊與語音位置的關聯性分析 24 3.2.1 實驗設定 24 3.2.2 實驗結果 26 3.3 自監督式學習模型輸出的語音特徵之中對應靜音片段的部分和語者辨識任務及語者驗證任務的關聯性分析 28 3.3.1 實驗設定 28 3.3.2 實驗結果 30 3.4 驗證語者資訊與自監督式學習模型輸出的語音特徵之中對應靜音片段的部分的關係:分段測試與基於梯度的重要性測試 32 3.4.1 針對語者資訊的語音特徵分段測試:實驗設定 32 3.4.2 針對語者資訊的語音特徵分段測試:實驗結果 33 3.4.3 基於梯度的重要性分析:實驗設定 36 3.4.4 基於梯度的重要性分析:實驗結果 37 3.5 自監督式學習模型輸出的語音特徵之中對應靜音片段的部分對語者相關任務的表現影響 40 3.5.1 語者辨識任務的改善實驗:實驗設定 41 3.5.2 語者辨識任務的改善實驗:實驗結果 41 3.5.3 三個不同資料集的分析與改善實驗:實驗設定 43 3.5.4 三個不同資料集的分析與改善實驗:實驗結果 44 3.6 本章總結 45 第四章 語音自監督式學習模型的韻律資訊研究 48 4.1 簡介 48 4.2 實驗設定 50 4.2.1 實驗框架簡介 50 4.2.2 韻律辨別任務簡介 53 4.2.3 韻律重建任務簡介 54 4.2.4 韻律預測任務簡介 55 4.2.5 參數與基準設定 56 4.3 實驗結果 4.3.1 韻律辨別任務結果 57 4.3.2 韻律重建任務結果 59 4.3.3 韻律預測任務結果 60 4.4 分析實驗 61 4.4.1 逐層檢驗分析 61 4.4.2 韻律資訊與初階資訊的關聯性分析 63 4.4.3 跨語言推理能力 64 4.5 文章總結 65 第五章 結論與展望 67 5.1 研究貢獻與討論 67 5.2 未來展望 68 參考文獻 69 | - |
dc.language.iso | zh_TW | - |
dc.title | 自監督式學習模型於語音中語者與韻律資訊的分析與應用 | zh_TW |
dc.title | Analysis and Application of Self-supervised Learning in Speaker and Prosody Information in Speech | en |
dc.type | Thesis | - |
dc.date.schoolyear | 112-1 | - |
dc.description.degree | 碩士 | - |
dc.contributor.oralexamcommittee | 李宏毅;賴穎暉;陳尚澤;王新民 | zh_TW |
dc.contributor.oralexamcommittee | Hung-Yi Lee;Ying-Hui Lai;Shang-Tse Chen;Hsin-Min Wang | en |
dc.subject.keyword | 自監督式學習,語者資訊,韻律資訊, | zh_TW |
dc.subject.keyword | Self-supervised learning,Speaker information,Prosodic information, | en |
dc.relation.page | 74 | - |
dc.identifier.doi | 10.6342/NTU202301673 | - |
dc.rights.note | 未授權 | - |
dc.date.accepted | 2023-10-03 | - |
dc.contributor.author-college | 電機資訊學院 | - |
dc.contributor.author-dept | 資訊網路與多媒體研究所 | - |
顯示於系所單位: | 資訊網路與多媒體研究所 |
文件中的檔案:
檔案 | 大小 | 格式 | |
---|---|---|---|
ntu-112-1.pdf 目前未授權公開取用 | 4.71 MB | Adobe PDF |
系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。