弭平領域差距與改善模型弱點：運用合成數據提升大型語言與語音模型

蘇軒; Hsuan Su

Please use this identifier to cite or link to this item: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/101148

Title:	弭平領域差距與改善模型弱點：運用合成數據提升大型語言與語音模型 Bridging Domain Gaps and Fixing Model Weakness: Leveraging Synthetic Data for Large Language and Speech Model Improvement
Authors:	蘇軒 Hsuan Su
Advisor:	陳尚澤 Shang-Tse Chen
Co-Advisor:	李宏毅 Hung-yi Lee
Keyword:	合成數據,大型語言模型自動語音識別公平性 Synthetic Data,Large Language ModelAutomatic Speech RecognitionFairness
Publication Year :	2025
Degree:	博士
Abstract:	本論文聚焦於大規模語言模型與自動語音辨識等基礎模型在自然語言與語音處理領域的應用，儘管此類模型在多種任務中展現出顯著成效，當面臨與訓練分佈差異極大的新領域或族群資料時，效能仍易受限。為克服此挑戰，本論文結合合成資料生成與優化技術，無需大量真實目標域資料，即能有效提升模型的穩健性與公平性。首先，我們提出一個零樣本的自動語音辨識調適流程，透過大規模語言模型生成特定領域文本，再使用文字轉語音系統將其轉換為語音，成功在公開資料集中達成平均 28.7% 的相對詞錯率下降，且完全不依賴任何真實目標域資料。接著，我們提出「合成轉真實」方法，針對合成與真實資料間的分佈落差進行優化。此方法利用任務向量運算與參數空間調整，系統性地降低合成資料帶來的偏差，並在實際應用中取得最高可達 30% 的效能提升。最後，我們研發一套具備弱點覺察的合成資料生成策略，能經由互動式測試自動找出模型的偏見與性能缺口，並針對這些弱點進行精準的合成資料生成。此策略顯著降低如性別偏見等問題，同時維持模型的整體表現。總體而言，本論文所提出的一系列方法展現了在資源受限情況下，如何透過合成資料與優化機制，兼顧效能與公平性地強化基礎模型，進一步推動其在多元且敏感的真實場域中更可靠地落地應用。 Foundation models, including large language models (LLMs) and automatic speech recognition (ASR) systems, have achieved impressive performance across a range of natural language and speech processing tasks. However, their effectiveness often diminishes when confronted with data from novel domains or demographic groups that diverge from their training distributions. This thesis addresses these challenges by integrating synthetic data generation with innovative optimization techniques to enhance model robustness and fairness without relying on extensive real in-domain datasets. First, we propose a novel zero-shot ASR adaptation pipeline that leverages LLMs to generate domain-specific text prompts, which are subsequently transformed into speech using a Text-to-Speech (TTS) system. This approach significantly improves ASR performance on unseen domains, achieving an average relative word error rate reduction of 28.7% on the SLURP dataset without any real target domain data. Second, we introduce the SYN2REAL framework to mitigate the synthetic-to-real gap—a distributional mismatch between synthetic and real data. By employing task vector arithmetic and parameter-space optimization, SYN2REAL systematically reduces synthetic artifacts, leading to performance improvements of up to 30% in real-world ASR applications. Lastly, we develop a weakness-aware synthetic data generation strategy that iteratively probes foundation models to identify and mitigate biases and performance gaps. This framework refines data sampling to target specific model weaknesses, effectively reducing issues such as gender bias while preserving overall model performance. Collectively, these contributions demonstrate a resource-efficient pathway to enhance the adaptability and fairness of foundation models, paving the way for their more reliable deployment in diverse and sensitive real-world scenarios.
URI:	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/101148
DOI:	10.6342/NTU202504762
Fulltext Rights:	未授權
metadata.dc.date.embargo-lift:	N/A
Appears in Collections:	資訊工程學系

Files in This Item:

File	Size	Format
ntu-114-1.pdf Restricted Access	2.01 MB	Adobe PDF

Show full item record

DSpace JSPUI

DSpace preserves and enables easy and open access to all types of digital content including text, images, moving images, mpegs and data sets