Skip navigation

DSpace JSPUI

DSpace preserves and enables easy and open access to all types of digital content including text, images, moving images, mpegs and data sets

Learn More
DSpace logo
English
中文
  • Browse
    • Communities
      & Collections
    • Publication Year
    • Author
    • Title
    • Subject
    • Advisor
  • Search TDR
  • Rights Q&A
    • My Page
    • Receive email
      updates
    • Edit Profile
  1. NTU Theses and Dissertations Repository
  2. 電機資訊學院
  3. 電信工程學研究所
Please use this identifier to cite or link to this item: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/101326
Title: 探討在訓練語音對語音口語語言模型中對災難性遺忘的緩解策略
Exploring Mitigation Strategies for Catastrophic Forgetting in Training Speech-to-Speech Spoken Language Models
Authors: 蕭淇元
Chi-Yuan Hsiao
Advisor: 李宏毅
Hung-yi Lee
Keyword: 口語語言模型,語音問答災難性遺忘持續學習模型融合
Spoken Language Model,Spoken Question AnsweringCatastrophic ForgettingContinual LearningModel Merging
Publication Year : 2025
Degree: 碩士
Abstract: 本研究以單一路徑、文字–語音交錯生成之口語語言模型(SLM)為核心,設計「自動語音辨識(ASR)→ 文字轉語音合成(TTS)→ 語音問答(SQA)」三階段持續微調流程,讓預訓練文字大型語言模型(LLM)逐步具備聽說能力。多階段任務分佈差異導致嚴重的災難性遺忘,本文系統比較四條緩解策略:模型合併(Model Merging)、調降低秩適配器(LoRA)的縮放係數、經驗回放(Experience Replay)以及 L2 正規化。實驗顯示:經驗回放在文字知識與語音辨識保留上最有效;L2 正規化可於僅小幅犧牲文字表現的情況下,維持最佳語音自然度;兩者再結合模型合併或低秩適配器縮放可微幅提升整體均衡性。研究結果為多模態持續學習中「可塑性–穩定性」取捨提供實證指引,並為後續構建高效、穩健的口語語言模型訓練流程奠定基線。
We present a speech-to-speech Spoken Language Model (SLM) that adopts a single-path, token-level interleaving of text and speech. A three-stage continual-learning pipeline—automatic speech recognition (ASR), text-to-speech synthesis (TTS) and spoken question answering (SQA)—progressively adapts a pre-trained text-only Large Language Model (LLM) to the speech modality. The stage-wise distribution shift, however, triggers severe catastrophic forgetting. We therefore benchmark four mitigation strategies: model merging, discounting the LoRA scaling factor, experience replay, and L2 regularization. Experiments show that experience replay is most effective for retaining textual knowledge and ASR accuracy, whereas L2 regularization best preserves speech naturalness with only a modest drop in text performance. Combining either of them with model merging or LoRA-scaling yields additional—though smaller—gains. These findings shed light on the plasticity–stability trade-off in multimodal continual learning and provide practical guidelines for building robust and efficient SLM training pipelines.
URI: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/101326
DOI: 10.6342/NTU202600026
Fulltext Rights: 同意授權(全球公開)
metadata.dc.date.embargo-lift: 2026-01-17
Appears in Collections:電信工程學研究所

Files in This Item:
File SizeFormat 
ntu-114-1.pdf9.82 MBAdobe PDFView/Open
Show full item record


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved