無安全性資料下緩解大型語言模型微調造成之安全性減損方法探討

樊樺; Hua Farn

Please use this identifier to cite or link to this item: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/97758

Title:	無安全性資料下緩解大型語言模型微調造成之安全性減損方法探討 A Study on Methods for Mitigating Safety Degradation Caused by Fine-Tuning Large Language Models without Safety Data
Authors:	樊樺 Hua Farn
Advisor:	李宏毅 Hung-Yi Lee
Keyword:	大型語言模型,災難性遺忘,安全性,模型融合, Large Language Model,Safety,Model Merging,Catastrophic Forgetting,
Publication Year :	2025
Degree:	碩士
Abstract:	隨著大型語言模型（Large Language Models, LLMs）在各類自然語言處理（Natural Language Processing, NLP）任務中廣泛應用，模型的安全性已成為一項備受關注的核心議題。為降低模型產生有害或不當回應的風險，多數現代大型語言模型在訓練階段會進行人類偏好對齊（Human Preference Alignment），以使模型輸出更符合人類價值觀與使用規範。然而，近年研究指出，當這些經過對齊的模型進一步接受下游任務的微調（Fine-Tuning）後，原有的安全性可能顯著退化，即便微調所使用的資料本身並不含任何危險內容。此現象常被歸因於災難性遺忘（Catastrophic Forgetting），即模型在學習新任務的過程中遺失原先已習得的能力。本研究針對上述問題進行探討，並提出了模型融合 (Model Merging) 策略，透過對微調前後模型的參數進行內插，以恢復模原有的安全對齊，不需額外安全資料或額外訓練模型，具有操作簡便與資源效率高的優勢。為深入分析不同技術對模型能力的影響，本研究設計涵蓋四類具代表性的下游任務 (Downstream Task)：邏輯推理、醫療對話、程式碼生成與工具使用，並針對微調後的模型進行任務表現與安全性之全面評估。我們亦比較不同模型尺寸與架構下，各方法的整體效果。實驗結果顯示，正規化技術雖在某些情況下具備一定效果，但模型融合方法在維持安全性與指令遵循能力方面表現更為穩定，且能同時保有良好的任務效能。本研究驗證了模型融合作為一項實用且具擴展性的安全性維護方案，特別適用於資源有限、難以取得高品質安全資料的應用場景，並為未來低成本對齊與能力保持技術的發展提供可行方向。 With the increasing application of large language models (LLMs) across a wide range of natural language processing tasks, ensuring model safety has become a critical concern. To reduce the risk of generating harmful or inappropriate content, modern LLMs are typically aligned with human preferences during training through a process known as Human Preference Alignment, allowing their outputs to better reflect human values and usage norms. However, recent studies have shown that this alignment can be significantly compromised after fine-tuning on downstream tasks—even when the fine-tuning data itself is entirely benign. This phenomenon is often attributed to catastrophic forgetting, where the model loses previously acquired knowledge when adapting to new objectives. In response to this issue, this study proposes a model merging strategy, which interpolates between the parameters of the pre- and post-fine-tuned models to restore the original safety alignment. This method requires no additional safety data or extra model training, making it both simple to implement and resource-efficient. To comprehensively evaluate the impact of these techniques, this study conducts experiments on four representative downstream tasks: reasoning, medical consultation, code generation, and tool usage. Each model is evaluated for both task performance and safety, with additional comparisons across different model sizes and architectures. Results show that while regularization offers moderate improvements, model merging consistently achieves better stability in preserving safety and instruction-following ability, without sacrificing downstream performance. Overall, this study demonstrates that model merging is a practical and scalable approach for maintaining safety in LLMs, especially in low-resource settings where high-quality safety data is unavailable. It provides a promising direction for future research on low-cost alignment and capability preservation.
URI:	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/97758
DOI:	10.6342/NTU202501519
Fulltext Rights:	同意授權(全球公開)
metadata.dc.date.embargo-lift:	2025-07-17
Appears in Collections:	電機工程學系

Files in This Item:

File	Size	Format
ntu-113-2.pdf	10.82 MB	Adobe PDF	View/Open

Show full item record

DSpace JSPUI

DSpace preserves and enables easy and open access to all types of digital content including text, images, moving images, mpegs and data sets