用於自監督式語音模型之各項任務泛用序列壓縮法

陳宣叡; Hsuan-Jui Chen

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/91351

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	李宏毅	zh_TW
dc.contributor.advisor	Hung-yi Lee	en
dc.contributor.author	陳宣叡	zh_TW
dc.contributor.author	Hsuan-Jui Chen	en
dc.date.accessioned	2024-01-03T16:14:04Z	-
dc.date.available	2024-01-04	-
dc.date.copyright	2024-01-03	-
dc.date.issued	2023	-
dc.date.submitted	2023-12-13	-
dc.identifier.citation	[1] A. Baevski, Y. Zhou, A. Mohamed, and M. Auli. wav2vec 2.0: A framework for self-supervised learning of speech representations. In Advances in Neural Information Processing Systems, 2020. [2] H. Cai, C. Gan, T. Wang, Z. Zhang, et al. Once-for-all: Train one network and specialize it for efficient deployment. In ICLR, 2020. [3] W. Chan, N. Jaitly, Q. Le, and O. Vinyals. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. In ICASSP, pages 4960– 4964. IEEE, 2016. [4] H.-J. Chang, S.-w. Yang, and H.-y. Lee. DistilHuBERT: Speech representation learning by layer-wise distillation of hidden-unit BERT. In ICASSP, pages 7087–7091. IEEE, 2022. [5] X. Chang, T. Maekaku, P. Guo, J. Shi, Y.-J. Lu, A. S. Subramanian, T. Wang, S.-w. Yang, Y. Tsao, H.-y. Lee, et al. An exploration of self-supervised pretrained representations for end-to-end speech recognition. In 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 228–235. IEEE, 2021. [6] S. Chen, C. Wang, Z. Chen, Y. Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiao, et al. Wavlm: Large-scale self-supervised pre-training for full stack speech processing. IEEE Journal of Selected Topics in Signal Processing, 16(6):1505–1518, 2022. [7] L. Dong and B. Xu. CIF: Continuous integrate-and-fire for end-to-end speech recognition. In ICASSP, pages 6079–6083. IEEE, 2020. [8] G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. In NIPS Deep Learning and Representation Learning Workshop, 2015. [9] W.-N. Hsu , B. Bolte, Y.-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:3451–3460, 2021. [10] C.-I. J. Lai, Y. Zhang, A. H. Liu, S. Chang, Y.-L. Liao, Y.-S. Chuang, K. Qian, S. Khurana, D. Cox, and J. Glass. Parp: Prune, adjust and re-prune for self-supervised speech recognition. Advances in Neural Information Processing Systems, 34:21256–21272, 2021. [11] Y. Lee, K. Jang, J. Goo, Y. Jung, and H. R. Kim. FitHuBERT: Going thinner and deeper for knowledge distillation of speech self-supervised models. In Interspeech, pages 3588–3592, 2022. [12] A. H. Liu, W.-N. Hsu, M. Auli, and A. Baevski. Towards end-to-end unsupervised speech recognition. arXiv preprint arXiv:2204.02492, 2022. [13] Y. Meng, H.-J. Chen, J. Shi, S. Watanabe, P. Garcia, H.-y. Lee, and H. Tang. On compressing sequences for self-supervised speech models. In 2022 IEEE Spoken Language Technology Workshop (SLT), pages 1128–1135, 2023. [14] Y. Miao, M. Gowayyed, and F. Metze. EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding. In ASRU, pages 167–174. IEEE, 2015. [15] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur. Librispeech: an asr corpus based on public domain audio books. In ICASSP, pages 5206–5210. IEEE, 2015. [16] A. Pasad, J.-C. Chou, and K. Livescu. Layer-wise analysis of a self-supervised speech representation model. In 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 914–921. IEEE, 2021. [17] F. Rosenblatt. The perceptron: a probabilistic model for information storage and organization in the brain. Psychological review, 65(6):386, 1958. [18] Y. Tay, M. Dehghani, D. Bahri, and D. Metzler. Efficient transformers: A survey. ACM Computing Surveys (CSUR), 2020. [19] H.-S. Tsai, H.-J. Chang, W.-C. Huang, Z. Huang, et al. SUPERB-SG: Enhanced speech processing universal performance benchmark for semantic and generative capabilities. In ACL, 2021. [20] V. Vanhoucke, M. Devin, and G. Heigold. Multiframe deep neural networks for acoustic modeling. In ICASSP, pages 7582–7585. IEEE, 2013. [21] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017. [22] A. Vyas, W.-N. Hsu, M. Auli, and A. Baevski. On-demand compute reduction with stochastic wav2vec 2.0. In Interspeech, pages 3048–3052, 2022. [23] R. Wang, Q. Bai, J. Ao, L. Zhou, et al. LightHuBERT: Light weightandconfigurable speech representation learning with once-for-all hidden-unit BERT. In Interspeech, pages 1686–1690, 2022. [24] F. Wu, K. Kim, J. Pan, K. J. Han, K. Q. Weinberger, and Y. Artzi. Performance- efficiency trade-offs in unsupervised pre-training for speech recognition. In ICASSP, pages 7667–7671. IEEE, 2022. [25] S.-w. Yang, P.-H. Chi, Y.-S. Chuang, C.-I. J. Lai, et al. SUPERB: Speech processing universal performance benchmark. In Interspeech, pages 1194–1198, 2021. [26] M. Zaheer, G. Guruganesh, K. A. Dubey, J. Ainslie, et al. Big bird: Transformers for longer sequences. Advances in Neural Information Processing Systems, 33:17283– 17297, 2020.	-
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/91351	-
dc.description.abstract	自監督式語音模型（self-supervised speech models）在現今多項語音下游任務中達到了最先進的結果，同時展現了其在不同下游任務的泛用性，為了降低自監督式語音模型的運算量以在不同的裝置運算限制之下運行，多種不同的技術被應用來降低自監督式語音模型的運算成本，其中序列壓縮法利用語音模型的特性，以減少序列長度的方式降低運算量。本論文提出各項任務泛用序列壓縮法，讓單一的預訓練模型能根據下游任務需求動態的改變其序列壓縮率。首先，本論文將所提出之各項任務序列壓縮法應用在兩種自監督式語音模型：知識蒸餾模型及對比式預訓練模型上，並將結果驗證在SUPERB基準中的多項語音下游任務當中。所提出之方法將前作預訓練模型所使用的單一序列壓縮率擴展到連續可用的壓縮率區間，同時將驗證的序列壓縮率進一步推進到了最大48倍的壓縮率。接著，為了更進一步節省使用網格搜尋（grid search）尋找最佳結果所帶來的額外運算量，本論文實驗同時優化下游任務模型及上游預訓練模型壓縮率，比較此設定所得之結果和以網格搜尋所得之最佳結果間的差異，初步驗證了所提出之框架在不需要網格搜尋的前提下亦能找到最佳下游任務結果。	zh_TW
dc.description.abstract	Self-supervised speech models achieve state-of-the-art results in many speech downstream tasks, showing their generalizability across different tasks. In order to operate under multiple device computational constraints, several methods have been applied to lower the computational cost of self-supervised speech models. The thesis proposed a once-for-all sequence compression method for self-supervised speech models enabling a single pre-trained model to change the sequence compressing rate on demand at inference time. To begin with, the thesis applied the proposed once-for-all sequence compressing method on two self-supervised speech models: a knowledge distillation and a contrastive learned pre-train model, then evaluate the result on several downstream tasks from the SUPERB benchmark. The proposed method extends the original single sequence compressing rate into a continuous range of operating compressing rates, in addition, pushes the upper limit of sequence compressing to 48 times. To further reduce the computational cost of finding the optimal result by grid search, the thesis experiments with the ability to tune the upstream compressing rate along with the downstream model. Comparing the result of adaptive compressing rate learning with the overall best result obtained by grid search shows that the proposed framework has the ability to find the close to optimal result without grid search.	en
dc.description.provenance	Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2024-01-03T16:14:04Z No. of bitstreams: 0	en
dc.description.provenance	Made available in DSpace on 2024-01-03T16:14:04Z (GMT). No. of bitstreams: 0	en
dc.description.tableofcontents	口試委員審定書 i 摘要 iii Abstract v 目錄 vii 圖目錄 xi 表目錄 xv 第一章導論 1 1.1 研究動機 1 1.2 研究方法 2 1.3 主要貢獻 3 1.4 章節安排 3 第二章背景知識 5 2.1 深層神經網路 5 2.1.1 簡介 5 2.1.2 卷積式類神經網路 6 2.1.3 轉換器式類神經網路 8 2.2 語音自監督式學習 9 2.2.1 簡介 9 2.2.2 對比式預訓練模型 10 2.2.3 知識蒸餾模型 11 2.2.4 語音下游任務 13 2.2.4.1 序列至序列 15 2.2.4.2 鏈接式時序分類 15 2.2.4.3 序列層級合計 16 2.2.4.4 序列層級比對 17 2.3 序列壓縮法 18 2.3.1 簡介 18 2.3.2 連續整合發放機制 20 2.3.3 可變間距序列壓縮法 21 2.3.4 時間及空間複雜度 23 第三章各項任務泛用序列壓縮法 25 3.1 各項任務泛用序列壓縮法用於自監督式語音模型 25 3.1.1 簡介 25 3.1.2 模型通用架構 27 3.1.3 可調節式次採樣層 29 3.1.4 模型預訓練方式 32 3.1.5 模型驗證方式 33 3.2 各項任務泛用序列壓縮法用於知識蒸餾模型 34 3.2.1 簡介 34 3.2.2 模型架構 35 3.2.3 模型參數設置 35 3.2.4 實驗結果與分析 36 3.2.4.1 內容相關任務 37 3.2.4.2 語者相關任務 38 3.2.4.3 語義相關任務 39 3.2.4.4 副語言相關任務 40 3.2.4.5 語義和生成相關任務 40 3.2.5 運算成本分析 41 3.3 各項任務泛用序列壓縮法用於對比式預訓練模型 43 3.3.1 簡介 43 3.3.2 模型架構 43 3.3.3 模型參數設置 44 3.3.4實驗結果與分析 46 3.3.4.1 內容相關任務 46 3.3.4.2 語者相關任務 47 3.3.4.3 語義相關任務 47 3.3.4.4 副語言相關任務 48 3.3.4.5 語義和生成相關任務 49 3.3.5 運算成本分析 50 3.4 本章結論 51 第四章自適應序列壓縮率之探討 53 4.1 自適應序列壓縮率法 53 4.1.1 簡介 53 4.1.2 同步最佳化下游任務以及序列壓縮率 54 4.1.3 連續性與可微分性之分析 55 4.2 實驗設定及結果分析 56 4.2.1 模型參數設置與初始壓縮率設定 56 4.2.2 實驗結果與分析 57 4.3 本章結論 59 第五章結論與展望 61 5.1 研究貢獻與討論 61 5.2 未來展望 62 參考文獻 65	-
dc.language.iso	zh_TW	-
dc.subject	自監督式學習	zh_TW
dc.subject	各項任務泛用訓練	zh_TW
dc.subject	序列壓縮法	zh_TW
dc.subject	self-supervised learning	en
dc.subject	sequence compression	en
dc.subject	once-for-all training	en
dc.title	用於自監督式語音模型之各項任務泛用序列壓縮法	zh_TW
dc.title	Once-for-all Sequence Compression for Self-supervised Speech Models	en
dc.type	Thesis	-
dc.date.schoolyear	112-1	-
dc.description.degree	碩士	-
dc.contributor.oralexamcommittee	王新民;曹昱;陳尚澤	zh_TW
dc.contributor.oralexamcommittee	Hsin-Min Wang;Yu Tsao;Shang-Tse Chen	en
dc.subject.keyword	自監督式學習,序列壓縮法,各項任務泛用訓練,	zh_TW
dc.subject.keyword	self-supervised learning,sequence compression,once-for-all training,	en
dc.relation.page	68	-
dc.identifier.doi	10.6342/NTU202304502	-
dc.rights.note	同意授權(全球公開)	-
dc.date.accepted	2023-12-14	-
dc.contributor.author-college	電機資訊學院	-
dc.contributor.author-dept	電信工程學研究所	-
顯示於系所單位：	電信工程學研究所

文件中的檔案：

檔案	大小	格式
ntu-112-1.pdf	4.9 MB	Adobe PDF	檢視/開啟

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。