請用此 Handle URI 來引用此文件:
http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/90193完整後設資料紀錄
| DC 欄位 | 值 | 語言 |
|---|---|---|
| dc.contributor.advisor | 李宏毅 | zh_TW |
| dc.contributor.advisor | Hung-yi Lee | en |
| dc.contributor.author | 陳子晴 | zh_TW |
| dc.contributor.author | Zih-Ching Chen | en |
| dc.date.accessioned | 2023-09-22T17:47:59Z | - |
| dc.date.available | 2023-11-09 | - |
| dc.date.copyright | 2023-09-22 | - |
| dc.date.issued | 2023 | - |
| dc.date.submitted | 2023-08-11 | - |
| dc.identifier.citation | [1] Sylvestre-Alvise Rebuffi, Hakan Bilen, and Andrea Vedaldi. Learning multiple visual domains with residual adapters. Advances in neural information processing systems, 30, 2017.
[2] Zih-Ching Chen, Chin-Lun Fu, et al. Exploring efficient-tuning methods in self-supervised speech models. In 2022 IEEE Spoken Language Technology Workshop (SLT), pages 1120–1127. IEEE, 2023. [3] Zih-Ching Chen, Yu-Shun Sung, and Hung-yi Lee. Chapter: Exploiting convolutional neural network adapters for self-supervised speech models. arXiv:2212.01282, 2022. [4] Kaiming He, Xiangyu Zhang, et al. Deep residual learning for image recognition. In Proc. of CVPR, pages 770–778, 2016. [5] Geoffrey Hinton, Li Deng, Dong Yu, George E Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N Sainath, et al. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal processing magazine, 29(6):82–97, 2012. [6] Alex Graves, Santiago Fernández, Faustino Gomez, and Jürgen Schmidhuber. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd international conference on Machine learning, pages 369–376, 2006. [7] Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. arXiv preprint arXiv:2212.04356, 2022. [8] Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in Neural Information Processing Systems, 33:12449–12460, 2020. [9] Yu Zhang, Daniel S Park, et al. Bigssl: Exploring the frontier of large-scale semi-supervised learning for automatic speech recognition. IEEE Journal of Selected Topics in Signal Processing, 16(6):1519–1532, 2022. [10] Shu-wen Yang, Po-Han Chi, Yung-Sung Chuang, Cheng-I Jeff Lai, Kushal Lakhotia, Yist Y Lin, Andy T Liu, Jiatong Shi, Xuankai Chang, Guan-Ting Lin, et al. Superb: Speech processing universal performance benchmark. arXiv preprint arXiv:2105.01051, 2021. [11] Carl Doersch, Abhinav Gupta, and Alexei A Efros. Unsupervised visual representation learning by context prediction. In Proceedings of the IEEE international conference on computer vision, pages 1422–1430, 2015. [12] Spyros Gidaris, Praveer Singh, and Nikos Komodakis. Unsupervised representation learning by predicting image rotations. arXiv preprint arXiv:1803.07728, 2018. [13] Hankook Lee, Sung Ju Hwang, and Jinwoo Shin. Self-supervised label augmentation via input transformations. In International Conference on Machine Learning, pages 5714–5724. PMLR, 2020. [14] Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:3451–3460, 2021. [15] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020. [16] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018. [17] Diana F Gordon and Marie Desjardins. Evaluation and selection of biases in machine learning. Machine learning, 20:5–22, 1995. [18] Thomas Kemp and Alex Waibel. Unsupervised training of a speech recognizer: Recent experiments. In in Proc. EUROSPEECH, 1999. [19] Lori Lamel, Jean-Luc Gauvain, and Gilles Adda. Lightly supervised and unsupervised acoustic model training. Computer Speech & Language, 16(1):115–129, 2002. [20] Longlong Jing and Yingli Tian. Self-supervised visual feature learning with deep neural networks: A survey. IEEE transactions on pattern analysis and machine intelligence, 43(11):4037–4058, 2020. [21] Jing Zhao and Wei-Qiang Zhang. Improving automatic speech recognition perfor- mance for low-resource languages with self-supervised models. IEEE Journal of Selected Topics in Signal Processing, 2022. [22] ZhenzhongLan,MingdaChen,SebastianGoodman,KevinGimpel,PiyushSharma, and Radu Soricut. Albert: A lite bert for self supervised learning of language repre- sentations. arXiv preprint arXiv:1909.11942, 2019. [23] MathildeCaron,IshanMisra,JulienMairal,PriyaGoyal,PiotrBojanowski,andAr- mand Joulin. Unsupervised learning of visual features by contrasting cluster assign- ments. Advances in neural information processing systems, 33:9912–9924, 2020. [24] Christos Louizos, Max Welling, and Diederik P Kingma. Learning sparse neural networks through l_0 regularization. arXiv preprint arXiv:1712.01312, 2017. [25] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021. [26] KaimingHe,XinleiChen,SainingXie,YanghaoLi,PiotrDollár,andRossGirshick. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000–16009, 2022. [27] Abdelrahman Mohamed, Hung-yi Lee, Lasse Borgholt, Jakob D. Havtorn, Joakim Edin, Christian Igel, Katrin Kirchhoff, Shang-Wen Li, Karen Livescu, Lars Maaløe, Tara N. Sainath, and Shinji Watanabe. Self-supervised speech representation learning: A review, 2022. [28] Ning Ding, Yujia Qin, Guang Yang, Fuchao Wei, Zonghan Yang, Yusheng Su, Shengding Hu, Yulin Chen, Chi-Min Chan, Weize Chen, et al. Delta tuning: A comprehensive study of parameter efficient methods for pre-trained language mod- els. arXiv preprint arXiv:2203.06904, 2022. [29] Jiquan Ngiam, Daiyi Peng, et al. Domain adaptive transfer learning with specialist models. arXiv:1811.07056, 2018. [30] Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter- efficient transfer learning for nlp. In International Conference on Machine Learning, pages 2790–2799. PMLR, 2019. [31] Samuel Kessler, Bethan Thomas, and Salah Karout. An adapter based pre- training for efficient and scalable self-supervised speech representation learning. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 3179–3183. IEEE, 2022. [32] Chin-Lun Fu, Zih-Ching Chen, et al. Adapterbias: Parameter-efficient token- dependent representation shift for adapters in nlp tasks. arXiv:2205.00305, 2022. [33] Bethan Thomas, Samuel Kessler, and Salah Karout. Efficient adapter trans- fer of self-supervised speech models for automatic speech recognition. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7102–7106. IEEE, 2022. [34] Elad Ben Zaken, Shauli Ravfogel, and Yoav Goldberg. Bitfit: Simple parameter- efficient fine-tuning for transformer-based masked language-models. arXiv e-prints, pages arXiv–2106, 2021. [35] Demi Guo, Alexander M Rush, and Yoon Kim. Parameter-efficient transfer learning with diff pruning. arXiv preprint arXiv:2012.07463, 2020. [36] Jonas Pfeiffer, Aishwarya Kamath, Andreas Rücklé, Kyunghyun Cho, and Iryna Gurevych. Adapterfusion: Non-destructive task composition for transfer learning. arXiv preprint arXiv:2005.00247, 2020. [37] Yaoming Zhu, Jiangtao Feng, Chengqi Zhao, Mingxuan Wang, and Lei Li. Counter-interference adapter for multilingual machine translation. arXiv preprint arXiv:2104.08154, 2021. [38] Ruchao Fan and Abeer Alwan. Draft: A novel framework to reduce domain shift- ing in self-supervised learning and its application to children’s asr. arXiv preprint arXiv:2206.07931, 2022. [39] AshishVaswani,NoamShazeer,NikiParmar,JakobUszkoreit,LlionJones,AidanN Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. arXiv preprint arXiv:1706.03762, 2017. [40] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019. [41] Alejandro Newell and Jia Deng. How useful is self-supervised pretraining for visual tasks? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7345–7354, 2020. [42] AlexWang,AmanpreetSingh,JulianMichael,FelixHill,OmerLevy,andSamuelR Bowman. Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461, 2018. [43] Jiatong Shi, Dan Berrebbi, William Chen, Ho-Lam Chung, En-Pei Hu, Wei Ping Huang, Xuankai Chang, Shang-Wen Li, Abdelrahman Mohamed, Hung-yi Lee, et al. Ml-superb: Multilingual speech universal performance benchmark. arXiv preprint arXiv:2305.10615, 2023. [44] JunxianHe,ChuntingZhou,XuezheMa,TaylorBerg-Kirkpatrick,andGrahamNeu- big. Towards a unified view of parameter-efficient transfer learning. In International Conference on Learning Representations, 2022. [45] Ian Tenney, Dipanjan Das, and Ellie Pavlick. Bert rediscovers the classical nlp pipeline. arXiv preprint arXiv:1905.05950, 2019. [46] Steffen Schneider, Alexei Baevski, Ronan Collobert, and Michael Auli. wav2vec: Unsupervised pre-training for speech recognition. arXiv preprint arXiv:1904.05862, 2019. [47] Alexei Baevski, Steffen Schneider, and Michael Auli. vq-wav2vec: Self-supervised learning of discrete speech representations. arXiv preprint arXiv:1910.05453, 2019. [48] Yu-AnChung,YuZhang,WeiHan,Chung-ChengChiu,JamesQin,RuomingPang, and Yonghui Wu. W2v-bert: Combining contrastive learning and masked language modeling for self-supervised speech pre-training. In 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 244–250. IEEE, 2021. [49] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Pra- fulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020. [50] Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint arXiv:2101.00190, 2021. [51] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021. [52] Ruidan He, Linlin Liu, Hai Ye, Qingyu Tan, Bosheng Ding, Liying Cheng, Jia-Wei Low, Lidong Bing, and Luo Si. On the effectiveness of adapter-based tuning for pretrained language model adaptation. arXiv preprint arXiv:2106.03164, 2021. [53] Kai-WeiChang,Wei-ChengTseng,Shang-WenLi,andHung-yiLee.Anexploration of prompt tuning on generative spoken language model for speech processing tasks. arXiv preprint arXiv:2203.16773, 2022. [54] KushalLakhotia,EugeneKharitonov,Wei-NingHsu,YossiAdi,AdamPolyak,Ben- jamin Bolte, Tu-Anh Nguyen, Jade Copet, Alexei Baevski, Abdelrahman Mohamed, et al. On generative spoken language modeling from raw audio. Transactions of the Association for Computational Linguistics, 9:1336–1354, 2021. [55] Trapit Bansal, Salaheddin Alzubi, Tong Wang, Jay-Yoon Lee, and Andrew McCal- lum. Meta-adapters: Parameter efficient few-shot fine-tuning through meta-learning. In International Conference on Automated Machine Learning, pages 19–1. PMLR, 2022. [56] Wenxin Hou, Han Zhu, Yidong Wang, Jindong Wang, Tao Qin, Renjun Xu, and Takahiro Shinozaki. Exploiting adapters for cross-lingual low-resource speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30:317–329, 2021. [57] Chin-Lun Fu, Zih-Ching Chen, Yun-Ru Lee, and Hung-yi Lee. AdapterBias: Parameter-efficient token-dependent representation shift for adapters in NLP tasks. In Findings of the Association for Computational Linguistics: NAACL 2022, pages 2608–2621, Seattle, United States, July 2022. Association for Computational Lin- guistics. [58] Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. Lib- rispeech: an asr corpus based on public domain audio books. In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 5206–5210. IEEE, 2015. [59] Arsha Nagrani, Joon Son Chung, and Andrew Zisserman. Voxceleb: a large-scale speaker identification dataset. arXiv preprint arXiv:1706.08612, 2017. [60] Joris Cosentino, Manuel Pariente, Samuele Cornell, Antoine Deleforge, and Em- manuel Vincent. Librimix: An open-source dataset for generalizable speech separa- tion. arXiv preprint arXiv:2005.11262, 2020. [61] Elad Ben Zaken, Shauli Ravfogel, and Yoav Goldberg. Bitfit: Simple parameter- efficient fine-tuning for transformer-based masked language-models. arXiv preprint arXiv:2106.10199, 2021. [62] Ankita Pasad, Ju-Chieh Chou, and Karen Livescu. Layer-wise analysis of a self-supervised speech representation model. In 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 914–921. IEEE, 2021. | - |
| dc.identifier.uri | http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/90193 | - |
| dc.description.abstract | 本研究的目標是探索輕量化微調方法在自監督語音模型中的應用,以更有效率地的使用自監督式語音模型。研究表明,自監督學習對於各種語音任務都有著很大的潛力,可以透過微調的方式被應用於不同的下游語音任務中。然而,傳統的微調方法在處理數百萬個參數的自監督學習模型時存在著參數使用效率低的問題。為了解決這個問題,我們引入了附加器,這是一種在自然語言處理中常用的輕量級模塊,來讓自監督式預訓練語音模型更好且更有效率地被應用到下游任務當中。
在本研究中,我們將自監督式預訓練語音模型的參數凍結,僅對附加器部分的參數進行微調。考慮到目前對於適配器在自監督語音任務中的有效性缺乏研究,我們通過在預訓練的語音自監督學習模型中添加不同的適配器模塊來填補這一空白。 具體而言,我們將不同的高效微調方法應用於基於SUPERB基準的自監督語音模型。我們提出了一個適配器框架,用於處理多個下游語音處理任務,例如語音識別、分類和說話者識別。 通過這項研究,我們希望能夠有效利用高效微調方法來提升語音模型的性能,並為語音處理領域中的多個下游任務提供更好的解決方案。 | zh_TW |
| dc.description.abstract | In this study, we aim to explore efficient fine-tuning methods for self-supervised speech representation learning. Recent research has demonstrated the potential of self-supervised learning for various speech tasks. However, traditional fine-tuning approaches suffer from inefficiency in parameter usage when dealing with large-scale self-supervised models. To address this issue, we introduce adapter modules, a lightweight module commonly used in natural language processing.
Our approach involves freezing the parameters of the self-supervised learning model and only fine-tuning the adapter modules for downstream tasks. Considering the lack of research on the effectiveness of adapters in self-supervised speech tasks, we fill this gap by incorporating different adapter modules into pre-trained speech self-supervised learning models. Specifically, we apply different efficient fine-tuning methods, including adapter fine-tuning and prompt fine-tuning, on self-supervised speech models based on the SUPERB benchmark. We propose an adapter framework that can handle multiple downstream speech processing tasks, such as speech recognition, classification, and speaker identification. Through this research, we aim to effectively leverage efficient fine-tuning methods to enhance the performance of speech models. Additionally, we strive to fill the research gap in the application of adapters in self-supervised speech tasks and provide better solutions for multiple downstream tasks in the field of speech processing. | en |
| dc.description.provenance | Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2023-09-22T17:47:59Z No. of bitstreams: 0 | en |
| dc.description.provenance | Made available in DSpace on 2023-09-22T17:47:59Z (GMT). No. of bitstreams: 0 | en |
| dc.description.tableofcontents | 致謝......................................................i
摘要.....................................................iii Abstract.................................................v 目錄....................................................vii 圖目錄...................................................xi 表目錄..................................................xiii 符號列表................................................xv 第一章 簡介................................................1 1.1 語音自監督學習的背景和動機...........................1 1.2 附加器的概述和作用..................................3 1.3 輕量化語音微調技術之探討...........................5 1.4 研究目標和貢獻.....................................6 第二章 文獻探討.............................................7 2.1 轉換器的基本概念...................................7 2.1.1 注意力層...................................8 2.1.2 全連接層...................................9 2.1.3 殘差連接和歸一化...........................9 2.1.4 轉換器層..................................10 2.2 自監督學習在語音處理中的應用.........................10 2.3 附加器的基本概念..................................12 第三章 輕量化微調方法於自監督式語音任務......................17 3.1 前置知識..........................................18 3.2 附加器之架構設計和實現............................19 3.2.0.1 Houlsby附加器.........................21 3.2.0.2 LoRA.................................22 3.2.0.3 AdapterBias.........................22 3.2.0.4 BitFit...............................23 3.2.0.5 卷積神經網路附加器.....................23 3.2.0.6 附加器的可疊加性......................23 3.3 實驗的可比性.....................................25 3.3.1 語音下游任務............................25 3.3.1.1 語音識別........................25 3.3.1.2 語者適應.......................26 3.3.1.3 語音分類.......................27 第四章 實驗結果與討論......................................29 4.1 實驗設置和數據集介紹..............................29 4.2 不同附加器模塊的效果比較...........................29 4.2.1 CHAPTER 與更多參數量的 Houlsby 附加器之比較 31 4.2.2 不同訓練目標的自監督式上游模型...............32 4.3 實驗結果分析與討論..............................33 4.3.1 低資源適應的穩定性......................33 4.3.1.1 有效調整方法的學習率魯棒性........34 4.3.2 實驗結果與討論..........................36 4.3.2.1 Houlsby附加器的效能分析.........36 4.3.2.2 不同類型任務的效能分析..........37 第五章 結論...............................................39 參考文獻..................................................41 | - |
| dc.language.iso | zh_TW | - |
| dc.subject | 自監督式語音模型 | zh_TW |
| dc.subject | 預訓練模型 | zh_TW |
| dc.subject | 輕量化微調方法 | zh_TW |
| dc.subject | 附加器 | zh_TW |
| dc.subject | Adapter | en |
| dc.subject | Parameter-efficient Fine-tuning | en |
| dc.subject | Pre-trained Model | en |
| dc.subject | Self-supervised speech model | en |
| dc.title | 輕量化微調於自監督式語音模型之探討 | zh_TW |
| dc.title | Exploring Parameter-Efficient Tuning in Self-supervised Speech Models | en |
| dc.type | Thesis | - |
| dc.date.schoolyear | 111-2 | - |
| dc.description.degree | 碩士 | - |
| dc.contributor.oralexamcommittee | 王新民;曹昱;陳尚澤 | zh_TW |
| dc.contributor.oralexamcommittee | Hsin-Min Wang;Yu Tsao;Shang-Tse Chen | en |
| dc.subject.keyword | 輕量化微調方法,附加器,預訓練模型,自監督式語音模型, | zh_TW |
| dc.subject.keyword | Parameter-efficient Fine-tuning,Adapter,Pre-trained Model,Self-supervised speech model, | en |
| dc.relation.page | 49 | - |
| dc.identifier.doi | 10.6342/NTU202303836 | - |
| dc.rights.note | 同意授權(限校園內公開) | - |
| dc.date.accepted | 2023-08-12 | - |
| dc.contributor.author-college | 電機資訊學院 | - |
| dc.contributor.author-dept | 電信工程學研究所 | - |
| 顯示於系所單位: | 電信工程學研究所 | |
文件中的檔案:
| 檔案 | 大小 | 格式 | |
|---|---|---|---|
| ntu-111-2.pdf 授權僅限NTU校內IP使用(校園外請利用VPN校外連線服務) | 1.34 MB | Adobe PDF |
系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。
