輕量化微調的預訓練語言模型引導之系統性分析

黃世丞; Shih-Cheng Huang

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/89971

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	李宏毅	zh_TW
dc.contributor.advisor	Hung-yi Lee	en
dc.contributor.author	黃世丞	zh_TW
dc.contributor.author	Shih-Cheng Huang	en
dc.date.accessioned	2023-09-22T16:53:18Z	-
dc.date.available	2023-11-09	-
dc.date.copyright	2023-09-22	-
dc.date.issued	2023	-
dc.date.submitted	2023-08-08	-
dc.identifier.citation	[1] M. Andrychowicz, M. Denil, S. Gomez, M. W. Hoffman, D. Pfau, T. Schaul, B. Shillingford, and N. De Freitas. Learning to learn by gradient descent by gradient descent. Advances in neural information processing systems, 29, 2016. [2] A. Antoniou, H. Edwards, and A. Storkey. How to train your maml. arXiv preprint arXiv:1810.09502, 2018. [3] J. L. Ba, J. R. Kiros, and G. E. Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016. [4] D. Bahdanau, K. H. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate. In 3rd International Conference on Learning Representations, ICLR 2015, 2015. [5] T. Bansal, R. Jha, and A. McCallum. Learning to few-shot learn across diverse natural language classification tasks. arXiv preprint arXiv:1911.03863, 2019. [6] Y. Bengio. Learning deep architectures for AI. Now Publishers Inc, 2009. [7] R. Caruana. Multitask learning. Machine learning, 28(1):41–75, 1997. [8] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018. [9] C. Finn, P. Abbeel, and S. Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In International conference on machine learning, pages 1126– 1135. PMLR, 2017. [10] C.-L. Fu, Z.-C. Chen, Y.-R. Lee, and H.-y. Lee. Adapterbias: Parameter-efficient token-dependent representation shift for adapters in nlp tasks. arXiv preprint arXiv:2205.00305, 2022. [11] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. [12] N. Houlsby, A. Giurgiu, S. Jastrzebski, B. Morrone, Q. De Laroussilhe, A. Gesmundo, M. Attariyan, and S. Gelly. Parameter-efficient transfer learning for nlp. In International Conference on Machine Learning, pages 2790–2799. PMLR, 2019. [13] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021. [14] K. Kawaguchi. A multithreaded software model for backpropagation neural network applications. The University of Texas at El Paso, 2000. [15] T. Khot, A. Sabharwal, and P. Clark. Scitail: A textual entailment dataset from science question answering. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018. [16] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014. [17] Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. nature, 521(7553):436–444, 2015. [18] B. Lester, R. Al-Rfou, and N. Constant. The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691, 2021. [19] M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V. Stoyanov, and L. Zettlemoyer. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461, 2019. [20] X. L. Li and P. Liang. Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint arXiv:2101.00190, 2021. [21] B. McCann, N. S. Keskar, C. Xiong, and R. Socher. The natural language decathlon: Multitask learning as question answering. arXiv preprint arXiv:1806.08730, 2018. [22] W. S. McCulloch and W. Pitts. A logical calculus of the ideas immanent in nervous activity. The bulletin of mathematical biophysics, 5(4):115–133, 1943. [23] J. Pfeiffer, A. Kamath, A. Rücklé, K. Cho, and I. Gurevych. Adapterfusion: Nondestructive task composition for transfer learning. arXiv preprint arXiv:2005.00247, 2020. [24] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019. [25] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, P. J. Liu, et al. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21(140):1–67, 2020. [26] S. Ravi and H. Larochelle. Optimization as a model for few-shot learning. 2016. [27] F. Rosenblatt. The perceptron: a probabilistic model for information storage and organization in the brain. Psychological review, 65(6):386, 1958. [28] P. Shaw, J. Uszkoreit, and A. Vaswani. Self-attention with relative position representations. arXiv preprint arXiv:1803.02155, 2018. [29] A. C. Stickland and I. Murray. Bert and pals: Projected attention layers for efficient adaptation in multi-task learning. In International Conference on Machine Learning, pages 5986–5995. PMLR, 2019. [30] S. Sukhbaatar, J. Weston, R. Fergus, et al. End-to-end memory networks. Advances in neural information processing systems, 28, 2015. [31] I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks. Advances in neural information processing systems, 27, 2014. [32] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017. [33] A. Wang, Y. Pruksachatkun, N. Nangia, A. Singh, J. Michael, F. Hill, O. Levy, and S. Bowman. Superglue: A stickier benchmark for general-purpose language understanding systems. Advances in neural information processing systems, 32, 2019. [34] A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman. Glue: A multitask benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461, 2018. [35] Y. Wang, S. Mukherjee, X. Liu, J. Gao, A. Awadallah, and J. Gao. List: Lite prompted self-training makes parameter-efficient few-shot learners. In Findings of the Association for Computational Linguistics: NAACL 2022, pages 2262–2281, 2022. [36] Y. Wang, S. Mukherjee, X. Liu, J. Gao, A. H. Awadallah, and J. Gao. Adamix: Mixture-of-adapter for parameter-efficient tuning of large language models. arXiv preprint arXiv:2205.12410, 2022. [37] S. Wu, H. R. Zhang, and C. Ré. Understanding and improving information transfer in multi-task learning. arXiv preprint arXiv:2005.00944, 2020. [38] Q. Ye, B. Y. Lin, and X. Ren. Crossfit: A few-shot learning challenge for cross-task generalization in nlp. arXiv preprint arXiv:2104.08835, 2021. [39] E. B. Zaken, S. Ravfogel, and Y. Goldberg. Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models. arXiv preprint arXiv:2106.10199, 2021. [40] B. Zoph and Q. V. Le. Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578, 2016.	-
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/89971	-
dc.description.abstract	隨著預訓練語言模型(Pre-trained Language Model)的參數量變得越來越大，輕量化微調(Parameter-Efficient Fine-tuning)顯得更為重要，但在少樣本學習(Few-Shot Learning)的情境下進行輕量化微調的效果卻遠遠不及微調整個預訓練模型。為了解決這個問題，本研究提出在進行輕量化微調前加入一個稱為「引導」(Priming)的訓練過程來強化預訓練語言模型的量化微調效果，並且在一個包含160個不同自然語言處理任務的少樣本資料集上驗證了本方法的有效性。相較於直接進行輕量化微調，經過引導的模型在ARG(Average Relative Gain)分數上達到了近30%的進步量，其表現也超越了其他的輕量化微調基石模型。除此之外，我們針對引導模型的方法進行了系統性的實驗，分析了在引導階段使用不同訓練演算法和訓練不同參數對於引導效果的影響，並找出最有效的引導方法。本研究的結果將能有效增強輕量化微調在少樣本學習上的表現，並使得大型預訓練語言模型的微調和使用更加有效率。	zh_TW
dc.description.abstract	As the parameter size of pre-trained language models (PLMs) continues to grow, parameter-efficient fine-tuning becomes more important. However, the effectiveness of parameter-efficient fine-tuning is far inferior to that of fine-tuning the entire pre-trained model in the context of few-shot learning. To address this issue, this study proposed a training process called "priming" to enhance the effectiveness of parameter-efficient fine-tuning by strengthening the pre-trained language model before performing the downstream fine-tuning. The effectiveness of this method was verified on a few-shot dataset consisting of 160 different NLP tasks. Compared to directly performing parameter-efficient fine-tuning, the primed model achieved an improvement of nearly 30% in ARG (Average Relative Gain) score and outperformed other parameter-efficient fine-tuning baselines. In addition, we conducted systematic experiments to analyze the impact of different training algorithms and different upstream trainable parameters and identify the most effective priming method. The results of this study will effectively enhance the performance of parameter-efficient fine-tuning in few-shot learning and make fine-tuning and usage of large-scale pre-trained language models more efficient.	en
dc.description.provenance	Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2023-09-22T16:53:18Z No. of bitstreams: 0	en
dc.description.provenance	Made available in DSpace on 2023-09-22T16:53:18Z (GMT). No. of bitstreams: 0	en
dc.description.tableofcontents	目錄 Page 致謝 iii 摘要 v Abstract vii 目錄 ix 圖目錄 xiii 表目錄 xv 第一章導論 1 1.1 研究動機 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 論文研究方向 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 主要貢獻 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.4 章節安排 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 第二章背景知識 5 2.1 深層類神經網路 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1.1 基本原理 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1.2 轉換器類神經網路 . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.1.3 附加器模塊 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2 多任務學習 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.3 元學習 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.3.1 基本原理 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.3.2 模型無關元學習 . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 第三章轉換器模型加上附加器之訓練框架 17 3.1 任務簡介 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.1.1 資料集 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.2 模型架構 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.2.1 BERT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.2.2 BART . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.3 模型訓練框架 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.3.1 簡介 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.3.2 上游訓練階段 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.3.3 下游微調階段 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.3.4 特定的可訓練參數組合 . . . . . . . . . . . . . . . . . . . . . . . 28 3.4 實驗設置 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.4.1 超參數 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.4.2 附加器 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.4.3 評量標準 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 第四章在 GLUE 上初步實驗之結果 31 4.1 模型訓練方法 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.1.1 多任務學習 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.1.2 模型無關元學習 . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.2 實驗結果 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.2.1 MNLI 做為測試任務的表現 . . . . . . . . . . . . . . . . . . . . . 33 4.2.2 移除訓練任務的影響 . . . . . . . . . . . . . . . . . . . . . . . . 34 4.2.3 其他任務作為下游測試任務的表現 . . . . . . . . . . . . . . . . 34 4.3 小結 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 第五章主實驗結果分析討論 37 5.1 概述 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 5.2 實驗結果 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 5.3 上游訓練方法 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 5.4 上游可訓練參數的組合 . . . . . . . . . . . . . . . . . . . . . . . . . 40 5.5 任務 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 5.6 小結 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 第六章結論與展望 43 6.1 研究貢獻與討論 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 6.2 未來展望 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 參考文獻 45	-
dc.language.iso	zh_TW	-
dc.subject	少樣本學習	zh_TW
dc.subject	自然語言處理	zh_TW
dc.subject	附加器	zh_TW
dc.subject	輕量化微調	zh_TW
dc.subject	元學習	zh_TW
dc.subject	多任務學習	zh_TW
dc.subject	meta-learning	en
dc.subject	natural language processing	en
dc.subject	few-shot learning	en
dc.subject	multi-task learning	en
dc.subject	adapter	en
dc.subject	parameter-efficient fine-tuning	en
dc.title	輕量化微調的預訓練語言模型引導之系統性分析	zh_TW
dc.title	Systematic Analysis of Pre-trained Language Model Priming for Parameter-efficient Fine-tuning	en
dc.type	Thesis	-
dc.date.schoolyear	111-2	-
dc.description.degree	碩士	-
dc.contributor.oralexamcommittee	李琳山;曹昱;蔡宗翰	zh_TW
dc.contributor.oralexamcommittee	Lin-shan Lee;Yu Tsao;Tzong-Han Tsai	en
dc.subject.keyword	自然語言處理,附加器,輕量化微調,元學習,多任務學習,少樣本學習,	zh_TW
dc.subject.keyword	natural language processing,adapter,parameter-efficient fine-tuning,meta-learning,multi-task learning,few-shot learning,	en
dc.relation.page	48	-
dc.identifier.doi	10.6342/NTU202302478	-
dc.rights.note	同意授權(全球公開)	-
dc.date.accepted	2023-08-09	-
dc.contributor.author-college	電機資訊學院	-
dc.contributor.author-dept	電信工程學研究所	-
顯示於系所單位：	電信工程學研究所

文件中的檔案：

檔案	大小	格式
ntu-111-2.pdf	1.48 MB	Adobe PDF	檢視/開啟

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。