經知識蒸餾之自監督式語音模型所生成之信號表徵序列之壓縮

孟妍; Yen Meng

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/88249

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	李琳山	zh_TW
dc.contributor.advisor	Lin-Shan Lee	en
dc.contributor.author	孟妍	zh_TW
dc.contributor.author	Yen Meng	en
dc.date.accessioned	2023-08-09T16:12:29Z	-
dc.date.available	2023-11-09	-
dc.date.copyright	2023-08-09	-
dc.date.issued	2023	-
dc.date.submitted	2023-07-23	-
dc.identifier.citation	[1] Abdelrahman Mohamed, Hung-yi Lee, Lasse Borgholt, Jakob D Havtorn, Joakim Edin, Christian Igel, Katrin Kirchhoff, Shang-Wen Li, Karen Livescu, Lars Maaløe, et al. Self-supervised speech representation learning: A review. IEEE Journal of Selected Topics in Signal Processing, 2022. [2] Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. wav2vec 2.0: A framework for self-supervised learning of speech representations. In Advances in Neural Information Processing Systems, 2020. [3] Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. HuBERT: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:3451–3460, 2021. [4] Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, et al. WavLM: Large-scale self-supervised pre-training for full stack speech processing. IEEE Journal of Selected Topics in Signal Processing, 2022. [5] Alexei Baevski, Wei-Ning Hsu, Qiantong Xu, Arun Babu, Jiatao Gu, and Michael Auli. data2vec: A general framework for self-supervised learning in speech, vision and language. In ICML, 2022. [6] Daniel Povey, Hossein Hadian, Pegah Ghahremani, Ke Li, and Sanjeev Khudanpur. A time-restricted self-attention layer for ASR. In ICASSP, pages 5874–5878. IEEE, 2018. [7] Iz Beltagy, Matthew E. Peters, and Arman Cohan. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150, 2020. [8] Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, et al. Big bird: Transformers for longer sequences. Advances in Neural Information Processing Systems, 33, 2020. [9] Linhao Dong and Bo Xu. CIF: Continuous integrate-and-fire for end-to-end speech recognition. In ICASSP, 2020. [10] Warren S McCulloch and Walter Pitts. A logical calculus of the ideas immanent in nervous activity. The bulletin of mathematical biophysics, 5(4):115–133, 1943. [11] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. ICLR, 2015. [12] Thang Luong, Hieu Pham, and Christopher D. Manning. Effective approaches to attention-based neural machine translation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 1412–1421, Lisbon, Portugal, September 2015. Association for Computational Linguistics. [13] Ashish Vaswani, Noam M. Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, 2017. [14] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT (1), pages 4171–4186. Association for Computational Linguistics, 2019. [15] Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. 2019. [16] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Pra- fulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020. [17] Y. Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35:1798–1828, 08 2013. [18] Paul Mermelstein. Distance measures for speech recognition, psychological and instrumental. Pattern Recognition and Artificial Intelligence, pages 374–388, 1976. [19] Steven Davis and Paul Mermelstein. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE transactions on acoustics, speech, and signal processing, 28(4):357–366, 1980. [20] Andy T. Liu, Shu-wen Yang, Po-Han Chi, Po-chun Hsu, and Hung-yi Lee. Mockingjay: Unsupervised speech representation learning with deep bidirectional transformer encoders. ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 2020. [21] Andy T. Liu, Shang-Wen Li, and Hung-yi Lee. TERA: Self-supervised learning of transformer encoder representation for speech. volume 29, pages 2351–2366. Institute of Electrical and Electronics Engineers (IEEE), 2021. [22] Yu-An Chung and James Glass. Generative pre-training for speech with autoregressive predictive coding. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 3497–3501. IEEE, 2020. [23] Yu-An Chung, Hao Tang, and James R. Glass. Vector-quantized autoregressive predictive coding. ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), abs/2005.08392, 2020. [24] Shaoshi Ling, Yuzong Liu, Julian Salazar, and Katrin Kirchhoff. Deep contextualized acoustic representations for semi-supervised speech recognition. ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6429–6433, 2019. [25] Shaoshi Ling and Yuzong Liu. DeCoAR 2.0: Deep contextualized acoustic representations with vector quantization. arXiv preprint arXiv:2012.06659, 2020. [26] Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding, 2019. [27] Cheng-I Jeff Lai, Yang Zhang, Alexander H Liu, Shiyu Chang, Yi-Lun Liao, Yung- Sung Chuang, Kaizhi Qian, Sameer Khurana, David Cox, and Jim Glass. Parp: Prune, adjust and re-prune for self-supervised speech recognition. Advances in Neural Information Processing Systems, 34:21256–21272, 2021. [28] Rui Wang, Qibing Bai, Junyi Ao, Long Zhou, Zhixiang Xiong, Zhihua Wei, Yu Zhang, Tom Ko, and Haizhou Li. Lighthubert: Lightweight and configurable speech representation learning with once-for-all hidden-unit bert. InterSpeech, 2022. [29] Heng-Jui Chang, Shu-wen Yang, and Hung-yi Lee. DistilHuBERT: Speech representation learning by layer-wise distillation of hidden-unit BERT. In ICASSP, 2022. [30] Yeonghyeon Lee, Kangwook Jang, Jahyun Goo, Youngmoon Jung, and Hoirin Kim. FitHuBERT: Going Thinner and Deeper for Knowledge Distillation of Speech Self-Supervised Models. In Interspeech, pages 3588–3592, 2022. [31] Takanori Ashihara, Takafumi Moriya, Kohei Matsuura, and Tomohiro Tanaka. Deep versus wide: An analysis of student architectures for task-agnostic knowledge distillation of self-supervised speech models. In Interspeech, 2022. [32] Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. ArXiv, abs/1910.01108,2019. [33] Priya Goyal, Dhruv Mahajan, Abhinav Gupta, and Ishan Misra. Scaling and benchmarking self-supervised visual representation learning. In Proceedings of the ieee/cvf International Conference on computer vision, pages 6391–6400, 2019. [34] Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. Glue: A multi-task benchmark and analysis platform for natural language understanding. EMNLP, 2018. [35] Shu-wen Yang, Po-Han Chi, Yung-Sung Chuang, Cheng-I Jeff Lai, Kushal Lakhotia, Yist Y. Lin, Andy T. Liu, Jiatong Shi, Xuankai Chang, Guan-Ting Lin, Tzu-Hsien Huang, Wei-Cheng Tseng, Ko-tik Lee, Da-Rong Liu, Zili Huang, Shuyan Dong, Shang-Wen Li, Shinji Watanabe, Abdelrahman Mohamed, and Hung-yi Lee. SUPERB: Speech Processing Universal PERformance Benchmark. In Interspeech, 2021. [36] Vincent Vanhoucke, Matthieu Devin, and Georg Heigold. Multiframe deep neural networks for acoustic modeling. In ICASSP, 2013. [37] YajieMiao, Mohammad Gowayyed, and Florian Metze. EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding. In ASRU, 2015. [38] William Chan, Navdeep Jaitly, Quoc Le, and Oriol Vinyals. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. In ICASSP, 2016. [39] Dzmitry Bahdanau, Jan Chorowski, Dmitriy Serdyuk, Philemon Brakel, and Yoshua Bengio. End-to-end attention-based large vocabulary speech recognition. In ICASSP, pages 4945–4949. IEEE, 2016. [40] Suyoun Kim, Takaaki Hori, and Shinji Watanabe. Joint CTC-attention based end-to-end speech recognition using multi-task learning. In ICASSP, pages 4835–4839. IEEE, 2017. [41] Felix Wu, Kwangyoun Kim, Jing Pan, Kyu J. Han, Kilian Q. Weinberger, and Yoav Artzi. Performance-efficiency trade-offs in unsupervised pre-training for speech recognition. In ICASSP, 2022. [42] Apoorv Vyas, Wei-Ning Hsu, Michael Auli, and Alexei Baevski. On-demand compute reduction with stochastic wav2vec 2.0. In Interspeech, pages 3048–3052, 2022. [43] Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1715–1725, Berlin, Germany, August 2016. Association for Computational Linguistics. [44] Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. Librispeech: An ASR corpus based on public domain audio books. In ICASSP, 2015. [45] Pete Warden. Speech commands: A public dataset for single-word speech recognition. Dataset available from http://download.tensorflow.org/data/speech_commands_v0.01.tar.gz, 2017. [46] Loren Lugosch, Mirco Ravanelli, Patrick Ignoto, Vikrant Singh Tomar, and Yoshua Bengio. Speech Model Pre-Training for End-to-End Spoken Language Understanding. In Proc. Interspeech 2019, pages 814–818, 2019. [47] Arsha Nagrani, Joon Son Chung, Weidi Xie, and Andrew Zisserman. Voxceleb: Large-scale speaker verification in the wild. Computer Speech & Language, 60:101027, 2020. [48] Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe Kazemzadeh, Emily Mower Provost, Samuel Kim, Jeannette Chang, Sungbok Lee, and Shrikanth Narayanan. Iemocap: Interactive emotional dyadic motion capture database. Language Resources and Evaluation, 42:335–359, 12 2008. [49] Felix Kreuk, Joseph Keshet, and Yossi Adi. Self-supervised contrastive learning for unsupervised phoneme segmentation. Interspeech, 2020. [50] Saurabhchand Bhati, Jesús Villalba, Piotr Żelasko, Laureano Moro-Velázquez, and Najim Dehak. Segmental contrastive predictive coding for unsupervised word segmentation. In Interspeech, 2021. [51] Santiago Cuervo, Maciej Grabias, Jan Chorowski, Grzegorz Ciesielski, Adrian Łańcucki, Paweł Rychlikowski, and Ricard Marxer. Contrastive prediction strategies for unsupervised segmentation and categorization of phonemes and words. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 3189–3193. IEEE, 2022. [52] Herman Kamper and Benjamin van Niekerk. Towards unsupervised phone and word segmentation using self-supervised vector-quantized neural networks. In Interspeech, 2021. [53] Michael McAuliffe, Michaela Socolof, Sarah Mihuc, Michael Wagner, and Morgan Sonderegger. Montreal forced aligner: Trainable text-speech alignment using Kaldi. In Interspeech, 2017. [54] Kushal Lakhotia, Eugene Kharitonov, Wei-Ning Hsu, Yossi Adi, Adam Polyak, Benjamin Bolte, Tu-Anh Nguyen, Jade Copet, Alexei Baevski, Abdelrahman Mohamed, et al. On generative spoken language modeling from raw audio. Transactions of the Association for Computational Linguistics, 9:1336–1354, 2021. [55] Alexei Baevski, Wei-Ning Hsu, Alexis Conneau, and Michael Auli. Unsupervised speech recognition. In Advances in Neural Information Processing Systems, volume 34, pages 27826–27839, 2021. [56] Alexander H. Liu, Wei-Ning Hsu, Michael Auli, and Alexei Baevski. Towards end-to-end unsupervised speech recognition. arXiv preprint arXiv:2204.02492, 2022. [57] Okko Johannes Räsänen, Unto Kalervo Laine, and Toomas Altosaar. An improved speech segmentation quality measure: the r-value. In Proc. Interspeech 2009, pages 1851–1854, 2009.	-
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/88249	-
dc.description.abstract	自監督式學習 (Self-supervised Learning) 的技術在語音處理領域上已有相當成功的發展。透過在大量未標註之語料上的預訓練 (Pre-training)，自監督式語音模型 (Self-Supervised Speech Models) 能學習到語音中蘊含的各種語言知識與語言元素，如語音內容、語者特徵等，因而使自監督式語音模型經微調 (Fine-tuning) 在少量有標註之資料後，能夠在各類語音下游任務上均取得不錯的性能 (Performance) 表現。在大型自監督式語音模型崛起並取得壓倒性優勢後，為了使自監督式語音模型能夠更方便容易地被各界訓練及使用，壓縮自監督式語音模型的研究變得更為重要。先前的研究多集中在壓縮模型本身的大小；卻未曾注意到另一個可能的方向，壓縮時間軸上之序列，將其長度縮短，也可有效減少模型的運算負擔。這就是本論文的研究主軸：透過壓縮語音信號在時間軸上之序列長度，來降低自監督式語音模型之運算負擔。由於不同類別的下游任務有不同的性質，本論文首先探討了各種下游任務對輸入的語音表徵 (Speech Representation) 的採樣率 (Sampling Rate)，亦即單位時間內所需表徵總數，的敏感程度。本論文的研究並包括了在時間軸上進行固定間距次採樣 (Fixed-length Subsampling) 及可變間距次採樣 (Variable-length Subsampling) 兩種不同的壓縮序列長度的思維。本研究發現，如能使用適當的次採樣技術來壓縮序列長度，不僅可以顯著加快預訓練及推論的速度，而且有機會在固定採樣率下，提高特定下游任務的整體表現；本研究也證實了可變間距次採樣的技術在較高的序列壓縮比(Compression Ratio) 的目標下，可以獲得特別好的性能表現，尤其是在與語音內容相關、對採樣率較敏感之任務上。本論文也發現，如果我們能夠取得語音中的近似音素邊界，並使用此近似邊界進行次採樣，即使次採樣後的平均採樣率低至10 Hz，也仍能夠保有，甚至超越原本未經壓縮時間序列之模型的性能表現。	zh_TW
dc.description.abstract	Self-supervised learning has achieved considerable success in speech processing. By pre-training on a large unlabeled speech dataset, self-supervised speech models can learn underlying structure, knowledge, and information in speech, such as the content and speaker characteristics, enabling the models to achieve good performance on various downstream speech tasks after fine-tuning only on a small amount of labeled data. With the rise of large-scale self-supervised speech models and their overwhelming advantages, research on compressing self-supervised speech models has become increasingly important to make them easier to be trained and used in various domains. While previous research has primarily focused on compressing the model size, shortening the length of the signal representation sequences along the time axis is also effective for reducing the computational load in speech processing, but almost overlooked in the past. Therefore, the main focus of this thesis is to consider and analyze the possibility of compressing the length of the signal representation sequences along the time axis to reduce the computational cost of self-supervised speech models. As different downstream tasks have different properties, this work first investigates how individual downstream tasks are sensitive to the sampling rates of the signal representations. This work studies both fixed-length subsampling and variable-length subsampling along the time axis in self-supervised learning. We find subsampling the signal representation sequences while training self-supervised speech models not only can significantly speed up the pre-training and inference processes, but may also improve the overall performance of specific downstream tasks under certain scenarios. It is also found that variable-length subsampling performs particularly well under some relatively high sequence compression ratios, especially for tasks related to speech content, which are more sensitive to signal representation subsampling rates. Additional experiments show that if given approximate phone boundaries, the average sampling rates based on the approximate phone boundaries can be as low as 10 Hz while outperforming the original model without sequence compression.	en
dc.description.provenance	Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2023-08-09T16:12:29Z No. of bitstreams: 0	en
dc.description.provenance	Made available in DSpace on 2023-08-09T16:12:29Z (GMT). No. of bitstreams: 0	en
dc.description.tableofcontents	口試委員審定書 i 致謝 iii 摘要 v Abstract vii 目錄 ix 圖目錄 xv 表目錄 xvii 第一章導論 1 1.1 研究動機 1 1.2 研究方向 3 1.3 研究貢獻 3 1.4 章節安排 4 第二章背景知識 5 2.1 深層類神經網路 5 2.1.1 簡介 5 2.1.2 卷積式類神經網路 8 2.1.3 遞迴式類神經網路 9 2.1.4 專注機制 10 2.1.5 轉換器 13 2.1.6 鏈結式時序分類器 17 2.2 自監督式語音表徵學習 20 2.2.1 簡介 20 2.2.2 自監督式語音模型 21 2.2.2.1 模型架構簡介 21 2.2.2.2 生成式方法 22 2.2.2.3 對比式方法 23 2.2.2.4 預測式方法 24 2.2.3 自監督式語音模型的壓縮 25 2.2.3.1 簡介 25 2.2.3.2 知識蒸餾 26 2.2.4 自監督式語音模型之評比 28 2.2.4.1 語音下游任務 28 2.2.4.2 自監督式語音表徵之衡量基準 29 2.2.4.3 模型之運算負擔衡量 30 2.3 次採樣 31 2.3.1 簡介 31 2.3.2 常用方法介紹 31 2.4 本章總結 32 第三章輸入採樣率對下游任務的影響之初步分析 35 3.1 簡介 35 3.2 實驗模型 36 3.3 實驗方法 37 3.4 實驗設置 38 3.5 實驗結果 39 3.6 本章總結 41 第四章固定間距次採樣於自監督式模型進行序列壓縮 43 4.1 簡介 43 4.2 模型架構 44 4.3 訓練方法 45 4.4 實驗設置 49 4.4.1 次採樣設定 49 4.4.2 訓練細節 49 4.4.3 下游任務 50 4.5 實驗結果 51 4.5.1 下游任務表現 51 4.5.2 不同方法訓練之損失比較 53 4.5.3 模型之運行效率 55 4.6 本章總結 56 第五章可變間距次採樣於自監督式模型進行序列壓縮 59 5.1 可變間距次採樣 59 5.1.1 簡介 59 5.1.2 背景與動機 60 5.1.3 相關研究 61 5.2 語音分割的取得 62 5.2.1 簡介 62 5.2.2 監督式語音分割 62 5.2.2.1 簡介 62 5.2.2.2 強制對齊 63 5.2.3 非監督式語音分割 64 5.2.3.1 簡介 64 5.2.3.2 經平滑化之HuBERT離散單元 64 5.2.3.3 非監督式語音辨識模型預測之音素邊界 66 5.2.4 以語音分割執行輸出表徵次採樣之初步實驗 67 5.2.4.1 實驗框架 67 5.2.4.2 實驗設置 67 5.2.4.3 討論與分析 68 5.3 基於連續整合發放機制之可變間距次採樣 69 5.3.1 簡介 69 5.3.2 連續整合發放機制 70 5.3.2.1 運作方式 71 5.3.2.2 訓練方法 72 5.3.3 連續整合發放機制作為次採樣方法 73 5.4 模型架構 74 5.5 訓練方法 75 5.5.1 簡介 75 5.5.2 基數引導訓練 76 5.5.3 分割引導訓練 77 5.6 實驗設置 79 5.6.1 訓練細節 79 5.6.2 下游任務 81 5.7 實驗結果 81 5.7.1 下游任務表現 81 5.7.2 模型運行效率 83 5.8 語音分割品質對可變間距次採樣之影響 84 5.8.1 簡介 84 5.8.2 評量方式 85 5.8.3 結果分析與討論 86 5.9 本章總結 89 第六章結論 91 6.1 研究貢獻與討論 91 6.2 未來展望 93 參考文獻 95	-
dc.language.iso	zh_TW	-
dc.subject	降低運算負擔	zh_TW
dc.subject	次採樣	zh_TW
dc.subject	自監督式學習	zh_TW
dc.subject	序列壓縮	zh_TW
dc.subject	Sequence Compression	en
dc.subject	Self-supervised Learning	en
dc.subject	Subsampling	en
dc.subject	Computational Load Reduction	en
dc.title	經知識蒸餾之自監督式語音模型所生成之信號表徵序列之壓縮	zh_TW
dc.title	Signal Representation Sequence Compression for Distilled Self-Supervised Speech Models	en
dc.type	Thesis	-
dc.date.schoolyear	111-2	-
dc.description.degree	碩士	-
dc.contributor.oralexamcommittee	王新民;賴穎暉;李宏毅;陳尚澤	zh_TW
dc.contributor.oralexamcommittee	Hsin-Min Wang;Ying-Hui Lai;Hung-yi Lee;Shang-Tse Chen	en
dc.subject.keyword	自監督式學習,序列壓縮,次採樣,降低運算負擔,	zh_TW
dc.subject.keyword	Self-supervised Learning,Sequence Compression,Subsampling,Computational Load Reduction,	en
dc.relation.page	103	-
dc.identifier.doi	10.6342/NTU202301448	-
dc.rights.note	同意授權(全球公開)	-
dc.date.accepted	2023-07-24	-
dc.contributor.author-college	電機資訊學院	-
dc.contributor.author-dept	電信工程學研究所	-
顯示於系所單位：	電信工程學研究所

文件中的檔案：

檔案	大小	格式
ntu-111-2.pdf	3.9 MB	Adobe PDF	檢視/開啟

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。