利用能解耦說話人—情感的微調在語音合成中實現樣本高效的跨說話人情感遷移

蔡尹婷; Yin-Ting Tsai

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/100933

標題:	利用能解耦說話人—情感的微調在語音合成中實現樣本高效的跨說話人情感遷移 Sample-Efficient Cross-Speaker Emotion Transfer in Text-to-Speech via Fine-Tuning with Speaker-Emotion Disentanglement
作者:	蔡尹婷 Yin-Ting Tsai
指導教授:	黃從仁 Tsung-Ren Huang
關鍵字:	跨說話人情緒轉移,文字轉語音微調說話人–情緒解耦正交約束 cross-speaker emotion transfer,text-to-speechfine-tuningspeaker–emotion disentanglementorthogonal constraint
出版年 :	2025
學位:	碩士
摘要:	跨說話人情緒轉移（Cross-speaker Emotion Transfer, CSET）可於文字轉語音（Text-to-Speech, TTS）中，將來源語音的情緒遷移至目標說話人。然而，現有多說話人多情緒 TTS 系統大多需依賴大規模且標註完整的情緒語料自頭訓練，資料取得與計算成本皆高。為降低門檻，本文提出一套「微調多說話人語音合成模型」之方法。核心概念為利用近年多說話人語音合成模型所提供之高表達力說話人表徵空間，透過情緒模組注入情緒嵌入，同時結合梯度反轉層（Gradient Reversal Layer）與正交約束（Orthogonal Constraint）完成說話人–情緒解耦，並增進情緒間的可分性。本方法整體設計採模組形式，使系統可靈活替換為任何先進的多說話人骨幹並保有低訓練成本優勢。我們在多說話人英文情緒語料庫 ESD上進行廣泛實驗。客觀評估結果顯示，所提出之微調策略能有效生成情緒自然、可辨識性高的語音。在見過說話人（seen speaker）條件下，模型可維持良好的音色保真度；然而，在未見說話人（unseen speaker, zero-shot）條件下，音色相似度相較預訓練骨幹仍有提升空間，但情緒辨識準確率仍可達真實語音（Ground Truth, GT）約九成水準，展現穩健的跨說話人情緒生成能力，為跨說話人情緒TTS 提供一條高效率且具實用性的解決途徑。 Cross-speaker emotion transfer (CSET) in text-to-speech (TTS) synthesis aims to generate speech that preserves the timbre of a target speaker while conveying the emotion contained in a reference utterance. Existing multi-speaker, multi-emotion TTS systems are typically trained from scratch on large-scale, well-annotated emotional corpora, making data collection and computation costly. To lower this barrier, we propose a fine-tuning framework that injects emotion control into a pre-trained multi-speaker text-to-speech (TTS) backbone model. Leveraging the highly expressive speaker representation space learned by recent multi-speaker TTS models, an external emotion module supplies emotion embeddings that are fused in the decoder pre-net. A gradient-reversal layer together with orthogonal losses disentangles speaker and emotion representations and enhances inter-emotion separability. The framework is model-agnostic and can be seamlessly attached to any advanced multi-speaker backbone with minimal additional training cost. Extensive experiments conducted on the English Emotional Speech Dataset (ESD) demonstrate that the proposed fine-tuning strategy effectively generates speech with natural and distinguishable emotional expressiveness. For seen speakers, the model maintains high timbre fidelity; however, for unseen speakers under the zero-shot setting, timbre similarity still shows room for improvement. Nevertheless, the emotion recognition accuracy of synthesized speech reaches approximately 90% of the ground-truth level, indicating robust cross-speaker emotion transfer capability. This work provides an efficient and practical solution for cross-speaker emotion transfer in TTS.
URI:	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/100933
DOI:	10.6342/NTU202504632
全文授權:	同意授權(限校園內公開)
電子全文公開日期:	2025-11-27
顯示於系所單位：	統計碩士學位學程

文件中的檔案：

檔案	大小	格式
ntu-114-1.pdf 授權僅限NTU校內IP使用（校園外請利用VPN校外連線服務）	4.63 MB	Adobe PDF

顯示文件完整紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料（如：文字、圖片、PDF）並使其易於取用。