語言模型的有效壓縮結合剪枝和知識蒸餾

邱麒羽; Chi-Yu Chiu

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/92537

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	劉邦鋒	zh_TW
dc.contributor.advisor	Pangfeng Liu	en
dc.contributor.author	邱麒羽	zh_TW
dc.contributor.author	Chi-Yu Chiu	en
dc.date.accessioned	2024-04-02T16:12:57Z	-
dc.date.available	2024-04-03	-
dc.date.copyright	2024-04-02	-
dc.date.issued	2024	-
dc.date.submitted	2024-03-28	-
dc.identifier.citation	[1] N. Aghli and E. Ribeiro. Combining weight pruning and knowledge distillation forcnn compression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pages 3191–3198, June 2021. [2] T. Chen, T. Moreau, Z. Jiang, L. Zheng, E. Yan, M. Cowan, H. Shen, L. Wang, Y. Hu, L. Ceze, C. Guestrin, and A. Krishnamurthy. Tvm: An automated end-to-end optimizing compiler for deep learning. In Proceedings of the 13th USENIX Conference on Operating Systems Design and Implementation, OSDI’18, page 579–594, USA, 2018. USENIX Association. [3] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In J. Burstein, C. Doran, andT. Solorio, editors, Proceedings of the 2019 Conference of the North American Chap-ter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. [4] A. Fan, E. Grave, and A. Joulin. Reducing transformer depth on demand with structured dropout. In International Conference on Learning Representations, 2019. [5] T. Gale, E. Elsen, and S. Hooker. The state of sparsity in deep neural networks, 2019.30 [6] Y. Gong, L. Liu, M. Yang, and L. D. Bourdev. Compressing deep convolutional networks using vector quantization. CoRR, abs/1412.6115(10):1–10, 2014. [7] G. Hinton, O. Vinyals, and J. Dean. Distilling the Knowledge in a Neural Network.NIPS 2014 Deep Learning Workshop, 1(9):1–9, 2015. [8] X. Jiao, Y. Yin, L. Shang, X. Jiang, X. Chen, L. Li, F. Wang, and Q. Liu. Tinybert:Distilling bert for natural language understanding. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4163–4174, 2020. [9] E. Kurtic, D. Campos, T. Nguyen, E. Frantar, M. Kurtz, B. Fineran, M. Goin, and D. Alistarh. The optimal BERT surgeon: Scalable and accurate second-order pruning for large language models. In Y. Goldberg, Z. Kozareva, and Y. Zhang, editors, Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 4163–4181, Abu Dhabi, United Arab Emirates, Dec. 2022. Association for Computational Linguistics. [10] P. Michel, O. Levy, and G. Neubig. Are sixteen heads really better than one? In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. [11] NVIDIA. Nvidia ampere ga102 gpu architecture. https:// images.nvidia.com/ aem-dam/ en-zz/ Solutions/ geforce/ ampere/ pdf/ NVIDIA-ampere-GA102-GPU-Architecture-Whitepaper-V1.pdf, 2020. [12] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. Bleu: a method for automatic evaluation of machine translation. In P. Isabelle, E. Charniak, and D. Lin, editors, 31 Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA, July 2002. Association for Computational Linguistics. [13] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Köpf, E. Z. Yang, Z. DeVito, M. Raison,A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala. Pytorch:An imperative style, high-performance deep learning library. In H. M. Wallach,H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. B. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pages 8024–8035, Vancouver, Canada, 2019. Neural Information Processing Systems foundation. [14] A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever. Improving language understanding by generative pre-training. 2018. [15] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019. [16] P. Rajpurkar, R. Jia, and P. Liang. Know what you don’t know: Unanswerable questions for SQuAD. In I. Gurevych and Y. Miyao, editors, Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 784–789, Melbourne, Australia, July 2018. Association for Computational Linguistics. [17] P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang. SQuAD: 100,000+ questions for machine comprehension of text. In J. Su, K. Duh, and X. Carreras, editors, Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2383–2392, Austin, Texas, Nov. 2016. Association for Computational Linguistics. [18] V. Sanh, L. Debut, J. Chaumond, and T. Wolf. Distilbert, a distilled version of bert:smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108, 2019. [19] V. Sanh, T. Wolf, and A. Rush. Movement pruning: Adaptive sparsity by fine-tuning.In H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 20378–20389. Curran Associates, Inc., 2020. [20] S. Sun, Y. Cheng, Z. Gan, and J. Liu. Patient knowledge distillation for BERT model compression. In K. Inui, J. Jiang, V. Ng, and X. Wan, editors, Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4323–4332, Hong Kong, China, Nov. 2019. Association for Computational Linguistics. [21] W. Sun, A. Zhou, S. Stuijk, R. Wijnhoven, A. O. Nelson, H. Corporaal, et al. Dominosearch: Find layer-wise fine-grained n: M sparse schemes from dense neural networks. Advances in neural information processing systems, 34:20721–20732, 2021. [22] X. Tan, Y. Ren, D. He, T. Qin, Z. Zhao, and T.-Y. Liu. Multilingual neural machine translation with knowledge distillation. In International Conference on Learning Representations, 2018. [23] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, volume 30, page 5998–6008. Curran Associates, Inc., 2017. [24] E. Voita, D. Talbot, F. Moiseev, R. Sennrich, and I. Titov. Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned. In A. Korhonen, D. Traum, and L. Màrquez, editors, Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5797–5808, Florence, Italy, July 2019. Association for Computational Linguistics. [25] A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. Bowman. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In T. Linzen, G. Chrupała, and A. Alishahi, editors, Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 353–355, Brussels, Belgium, Nov. 2018. Association for Computational Linguistics. [26] C. Wang, G. Zhang, and R. Grosse. Picking winning tickets before training by preserving gradient flow. In International Conference on Learning Representations, 2020. [27] W. Wang, F. Wei, L. Dong, H. Bao, N. Yang, and M. Zhou. Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers, 2020. [28] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, Y. Jernite, J. Plu, C. Xu, T. L. Scao, S. Gugger, M. Drame, Q. Lhoest, and A. M. Rush. Huggingface’s transformers: State-of-the-art natural language processing, 2019.	-
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/92537	-
dc.description.abstract	近年來，Transformer 已成為語言模型中的重要架構，並在許多自然語言處理任務中取得了高效能。然而，高效部署 Transformer 架構面臨挑戰因為它們的模型尺寸大且推理時間長。因此，模型壓縮技術對於減少模型大小和推理時間變得非常重要。權重剪枝是重要的模型壓縮技術消除模型中的一些權重。然而，Transformer模型的修剪面臨挑戰。剪枝後，Transformer模型需要重複整個訓練過程，包括對大型廣義資料集的預訓練和對下游小型資料集的微調，以恢復準確性。整個訓練過程需要很長的時間和大量的運算資源。為了應對這項挑戰，我們提出了一種與知識蒸餾相結合的剪枝方法，以避免長時間的重新訓練時間，同時恢復準確性。我們使用2:4剪枝作為基本剪枝方法。 2:4剪枝是NVIDIA提出的方法，在權重矩陣中每行的每四個連續元素中保留兩個較大的絕對值。我們將 2:4 剪枝推廣為 N:M 剪枝，即在權重矩陣的每行的每 M 個連續元素中保留 N 個較大的絕對值。知識蒸餾是另一種模型壓縮方法，它使小模型（稱為學生）向大模型（稱為教師）學習。在我們的方法中，我們使用 N:M 剪枝將模型統一剪枝為 N:M 結構。接下來，我們透過知識蒸餾對下游資料集進行兩階段微調。透過使用我們的方法，修剪後的模型只需使用下游資料集即可達到相當的精度，並且比傳統的再訓練花費更少的時間。我們使用 DistilBERT 在 SQuAD 和 GLUE 資料集上運行實驗。實驗結果表明，1:4 結構的 DistilBERT 在 SQuAD v1.1 和 SQuAD v2.0 資料集上可以達到相當的精度，與原始密集模型相比，推理速度提高了 1.7 倍。	zh_TW
dc.description.abstract	In recent years, Transformer has become an important architecture in language models and has achieved high performance in many natural language processing tasks. However, deploying Transformer architectures efficiently faces challenges because of their large model size and high inference time. Therefore, model compression techniques become important to reduce model size and inference time. Therefore, model compression techniques become important to reduce model size and inference time. Weight pruning is a prominent model compression technique that removes some weights in a model. However, pruning on transformer models faces a challenge. After pruning, Transformer models require repeating the whole training process, including pre-training on a large generalized data set and fine-tuning on a small downstream data set, to recover the accuracy. The whole training process takes a long time and many computation resources. To address the challenge, we propose a pruning method that combines with knowledge distillation to avoid a long re-training time while recovering the accuracy. We use 2:4 pruning as our basic pruning method. 2:4 pruning is a method proposed by NVIDIA that keeps two larger absolute values in every four consecutive elements in every row in a weight matrix. We generalize 2:4 pruning to N:M pruning which refers to keeping N larger absolute values in every M consecutive elements in every row in a weight matrix. Knowledge distillation is another model compression method that makes a small model, which is referred to as a student, learn from a large model, which is referred to as a teacher. With our method, we use N:M pruning to uniformly prune the model into N:M structure. Next, we use two-stage fine-tuning on the downstream dataset with knowledge distillation. By using our method, the pruned models can achieve comparable accuracy by using only downstream datasets and take much less time than traditional retraining. We run our experiments on SQuAD and GLUE datasets using DistilBERT. The experimental results show that DistilBERT in a 1:4 structure can achieve comparable accuracy on the SQuAD v1.1 and SQuAD v2.0 datasets and 1.7x speedup on inference compared to the original dense model.	en
dc.description.provenance	Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2024-04-02T16:12:57Z No. of bitstreams: 0	en
dc.description.provenance	Made available in DSpace on 2024-04-02T16:12:57Z (GMT). No. of bitstreams: 0	en
dc.description.tableofcontents	口試委員審定書 i 致謝 ii 摘要 iii Abstract v Contents vii List of Figures ix List of Tables x Chapter 1 Introduction 1 Chapter 2 Background And Related Works 5 2.1 Transformer Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2 Maksed Language Modeling . . . . . . . . . . . . . . . . . . . . . . 6 2.3 Unstructured Pruning . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.4 Structured Pruning . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.5 2:4 Pruning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.6 Knowledge Distillation . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.7 Combine Pruning and Knowledge Distillation . . . . . . . . . . . . . 10 Chapter 3 Scheme 11 3.1 Two-Stage Knowledge Distillation . . . . . . . . . . . . . . . . . . . 12 3.1.1 Masked Language Modeling Distillation . . . . . . . . . . . . . . . 12 3.1.2 Adapter-Aware Task Distillation . . . . . . . . . . . . . . . . . . . 13 Chapter 4 Evaluation 15 4.1 Experimental Settings . . . . . . . . . . . . . . . . . . . . . . . . . 15 4.1.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 4.1.2 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 4.1.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 4.1.4 Training Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 4.2 Value of α . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 4.3 Performance of Fine-tuning Methods . . . . . . . . . . . . . . . . . 19 4.4 Comparison With Prior Works . . . . . . . . . . . . . . . . . . . . . 24 4.4.1 Speedup of Inference . . . . . . . . . . . . . . . . . . . . . . . . . 25 Chapter 5 Conclusion 28 References 30	-
dc.language.iso	en	-
dc.subject	模型壓縮	zh_TW
dc.subject	知識蒸餾	zh_TW
dc.subject	權重剪枝	zh_TW
dc.subject	語言模型	zh_TW
dc.subject	Knowledge Distillation	en
dc.subject	Weight Pruning	en
dc.subject	Model Compression	en
dc.subject	Language Model	en
dc.title	語言模型的有效壓縮結合剪枝和知識蒸餾	zh_TW
dc.title	Effective Compression of Language Models by Combining Pruning and Knowledge Distillation	en
dc.type	Thesis	-
dc.date.schoolyear	112-2	-
dc.description.degree	碩士	-
dc.contributor.oralexamcommittee	吳真貞;洪鼎詠	zh_TW
dc.contributor.oralexamcommittee	Jan-Jan Wu;Ding-Yong Hong	en
dc.subject.keyword	模型壓縮,知識蒸餾,權重剪枝,語言模型,	zh_TW
dc.subject.keyword	Model Compression,Knowledge Distillation,Weight Pruning,Language Model,	en
dc.relation.page	34	-
dc.identifier.doi	10.6342/NTU202400814	-
dc.rights.note	未授權	-
dc.date.accepted	2024-03-29	-
dc.contributor.author-college	電機資訊學院	-
dc.contributor.author-dept	資訊網路與多媒體研究所	-
顯示於系所單位：	資訊網路與多媒體研究所

文件中的檔案：

檔案	大小	格式
ntu-112-2.pdf 未授權公開取用	847.61 kB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。