Skip navigation

DSpace JSPUI

DSpace preserves and enables easy and open access to all types of digital content including text, images, moving images, mpegs and data sets

Learn More
DSpace logo
English
中文
  • Browse
    • Communities
      & Collections
    • Publication Year
    • Author
    • Title
    • Subject
  • Search TDR
  • Rights Q&A
    • My Page
    • Receive email
      updates
    • Edit Profile
  1. NTU Theses and Dissertations Repository
  2. 電機資訊學院
  3. 電機工程學系
Please use this identifier to cite or link to this item: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/97758
Title: 無安全性資料下緩解大型語言模型微調造成之安全性減損方法探討
A Study on Methods for Mitigating Safety Degradation Caused by Fine-Tuning Large Language Models without Safety Data
Authors: 樊樺
Hua Farn
Advisor: 李宏毅
Hung-Yi Lee
Keyword: 大型語言模型,災難性遺忘,安全性,模型融合,
Large Language Model,Safety,Model Merging,Catastrophic Forgetting,
Publication Year : 2025
Degree: 碩士
Abstract: 隨著大型語言模型(Large Language Models, LLMs)在各類自然語言處理(Natural Language Processing, NLP)任務中廣泛應用,模型的安全性已成為一項備受關注的核心議題。為降低模型產生有害或不當回應的風險,多數現代大型語言模型在訓練階段會進行人類偏好對齊(Human Preference Alignment),以使模型輸出更符合人類價值觀與使用規範。然而,近年研究指出,當這些經過對齊的模型進一步接受下游任務的微調(Fine-Tuning)後,原有的安全性可能顯著退化,即便微調所使用的資料本身並不含任何危險內容。此現象常被歸因於災難性遺忘(Catastrophic Forgetting),即模型在學習新任務的過程中遺失原先已習得的能力。
本研究針對上述問題進行探討,並提出了模型融合 (Model Merging) 策略,透過對微調前後模型的參數進行內插,以恢復模原有的安全對齊,不需額外安全資料或額外訓練模型,具有操作簡便與資源效率高的優勢。
為深入分析不同技術對模型能力的影響,本研究設計涵蓋四類具代表性的下游任務 (Downstream Task):邏輯推理、醫療對話、程式碼生成與工具使用,並針對微調後的模型進行任務表現與安全性之全面評估。我們亦比較不同模型尺寸與架構下,各方法的整體效果。實驗結果顯示,正規化技術雖在某些情況下具備一定效果,但模型融合方法在維持安全性與指令遵循能力方面表現更為穩定,且能同時保有良好的任務效能。
本研究驗證了模型融合作為一項實用且具擴展性的安全性維護方案,特別適用於資源有限、難以取得高品質安全資料的應用場景,並為未來低成本對齊與能力保持技術的發展提供可行方向。
With the increasing application of large language models (LLMs) across a wide range of natural language processing tasks, ensuring model safety has become a critical concern. To reduce the risk of generating harmful or inappropriate content, modern LLMs are typically aligned with human preferences during training through a process known as Human Preference Alignment, allowing their outputs to better reflect human values and usage norms. However, recent studies have shown that this alignment can be significantly compromised after fine-tuning on downstream tasks—even when the fine-tuning data itself is entirely benign. This phenomenon is often attributed to catastrophic forgetting, where the model loses previously acquired knowledge when adapting to new objectives.
In response to this issue, this study proposes a model merging strategy, which interpolates between the parameters of the pre- and post-fine-tuned models to restore the original safety alignment. This method requires no additional safety data or extra model training, making it both simple to implement and resource-efficient.
To comprehensively evaluate the impact of these techniques, this study conducts experiments on four representative downstream tasks: reasoning, medical consultation, code generation, and tool usage. Each model is evaluated for both task performance and safety, with additional comparisons across different model sizes and architectures. Results show that while regularization offers moderate improvements, model merging consistently achieves better stability in preserving safety and instruction-following ability, without sacrificing downstream performance.
Overall, this study demonstrates that model merging is a practical and scalable approach for maintaining safety in LLMs, especially in low-resource settings where high-quality safety data is unavailable. It provides a promising direction for future research on low-cost alignment and capability preservation.
URI: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/97758
DOI: 10.6342/NTU202501519
Fulltext Rights: 同意授權(全球公開)
metadata.dc.date.embargo-lift: 2025-07-17
Appears in Collections:電機工程學系

Files in This Item:
File SizeFormat 
ntu-113-2.pdf10.82 MBAdobe PDFView/Open
Show full item record


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved