語言模型的有效壓縮結合剪枝和知識蒸餾

邱麒羽; Chi-Yu Chiu

Please use this identifier to cite or link to this item: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/92537

Title:	語言模型的有效壓縮結合剪枝和知識蒸餾 Effective Compression of Language Models by Combining Pruning and Knowledge Distillation
Authors:	邱麒羽 Chi-Yu Chiu
Advisor:	劉邦鋒 Pangfeng Liu
Keyword:	模型壓縮,知識蒸餾,權重剪枝,語言模型, Model Compression,Knowledge Distillation,Weight Pruning,Language Model,
Publication Year :	2024
Degree:	碩士
Abstract:	近年來，Transformer 已成為語言模型中的重要架構，並在許多自然語言處理任務中取得了高效能。然而，高效部署 Transformer 架構面臨挑戰因為它們的模型尺寸大且推理時間長。因此，模型壓縮技術對於減少模型大小和推理時間變得非常重要。權重剪枝是重要的模型壓縮技術消除模型中的一些權重。然而，Transformer模型的修剪面臨挑戰。剪枝後，Transformer模型需要重複整個訓練過程，包括對大型廣義資料集的預訓練和對下游小型資料集的微調，以恢復準確性。整個訓練過程需要很長的時間和大量的運算資源。為了應對這項挑戰，我們提出了一種與知識蒸餾相結合的剪枝方法，以避免長時間的重新訓練時間，同時恢復準確性。我們使用2:4剪枝作為基本剪枝方法。 2:4剪枝是NVIDIA提出的方法，在權重矩陣中每行的每四個連續元素中保留兩個較大的絕對值。我們將 2:4 剪枝推廣為 N:M 剪枝，即在權重矩陣的每行的每 M 個連續元素中保留 N 個較大的絕對值。知識蒸餾是另一種模型壓縮方法，它使小模型（稱為學生）向大模型（稱為教師）學習。在我們的方法中，我們使用 N:M 剪枝將模型統一剪枝為 N:M 結構。接下來，我們透過知識蒸餾對下游資料集進行兩階段微調。透過使用我們的方法，修剪後的模型只需使用下游資料集即可達到相當的精度，並且比傳統的再訓練花費更少的時間。我們使用 DistilBERT 在 SQuAD 和 GLUE 資料集上運行實驗。實驗結果表明，1:4 結構的 DistilBERT 在 SQuAD v1.1 和 SQuAD v2.0 資料集上可以達到相當的精度，與原始密集模型相比，推理速度提高了 1.7 倍。 In recent years, Transformer has become an important architecture in language models and has achieved high performance in many natural language processing tasks. However, deploying Transformer architectures efficiently faces challenges because of their large model size and high inference time. Therefore, model compression techniques become important to reduce model size and inference time. Therefore, model compression techniques become important to reduce model size and inference time. Weight pruning is a prominent model compression technique that removes some weights in a model. However, pruning on transformer models faces a challenge. After pruning, Transformer models require repeating the whole training process, including pre-training on a large generalized data set and fine-tuning on a small downstream data set, to recover the accuracy. The whole training process takes a long time and many computation resources. To address the challenge, we propose a pruning method that combines with knowledge distillation to avoid a long re-training time while recovering the accuracy. We use 2:4 pruning as our basic pruning method. 2:4 pruning is a method proposed by NVIDIA that keeps two larger absolute values in every four consecutive elements in every row in a weight matrix. We generalize 2:4 pruning to N:M pruning which refers to keeping N larger absolute values in every M consecutive elements in every row in a weight matrix. Knowledge distillation is another model compression method that makes a small model, which is referred to as a student, learn from a large model, which is referred to as a teacher. With our method, we use N:M pruning to uniformly prune the model into N:M structure. Next, we use two-stage fine-tuning on the downstream dataset with knowledge distillation. By using our method, the pruned models can achieve comparable accuracy by using only downstream datasets and take much less time than traditional retraining. We run our experiments on SQuAD and GLUE datasets using DistilBERT. The experimental results show that DistilBERT in a 1:4 structure can achieve comparable accuracy on the SQuAD v1.1 and SQuAD v2.0 datasets and 1.7x speedup on inference compared to the original dense model.
URI:	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/92537
DOI:	10.6342/NTU202400814
Fulltext Rights:	未授權
Appears in Collections:	資訊網路與多媒體研究所

Files in This Item:

File	Size	Format
ntu-112-2.pdf Restricted Access	847.61 kB	Adobe PDF

Show full item record

DSpace JSPUI

DSpace preserves and enables easy and open access to all types of digital content including text, images, moving images, mpegs and data sets