Skip navigation

DSpace JSPUI

DSpace preserves and enables easy and open access to all types of digital content including text, images, moving images, mpegs and data sets

Learn More
DSpace logo
English
中文
  • Browse
    • Communities
      & Collections
    • Publication Year
    • Author
    • Title
    • Subject
    • Advisor
  • Search TDR
  • Rights Q&A
    • My Page
    • Receive email
      updates
    • Edit Profile
  1. NTU Theses and Dissertations Repository
  2. 共同教育中心
  3. 統計碩士學位學程
Please use this identifier to cite or link to this item: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/100933
Title: 利用能解耦說話人—情感的微調在語音合成中實現樣本高效的跨說話人情感遷移
Sample-Efficient Cross-Speaker Emotion Transfer in Text-to-Speech via Fine-Tuning with Speaker-Emotion Disentanglement
Authors: 蔡尹婷
Yin-Ting Tsai
Advisor: 黃從仁
Tsung-Ren Huang
Keyword: 跨說話人情緒轉移,文字轉語音微調說話人–情緒解耦正交約束
cross-speaker emotion transfer,text-to-speechfine-tuningspeaker–emotion disentanglementorthogonal constraint
Publication Year : 2025
Degree: 碩士
Abstract: 跨說話人情緒轉移(Cross-speaker Emotion Transfer, CSET)可於文字轉語音(Text-to-Speech, TTS)中,將來源語音的情緒遷移至目標說話人。然而,現有多說話人多情緒 TTS 系統大多需依賴大規模且標註完整的情緒語料自頭訓練,資料取得與計算成本皆高。為降低門檻,本文提出一套「微調多說話人語音合成模型」之方法。核心概念為利用近年多說話人語音合成模型所提供之高表達力說話人表徵空間,透過情緒模組注入情緒嵌入,同時結合梯度反轉層(Gradient Reversal Layer)與正交約束(Orthogonal Constraint)完成說話人–情緒解耦,並增進情緒間的可分性。本方法整體設計採模組形式,使系統可靈活替換為任何先進的多說話人骨幹並保有低訓練成本優勢。我們在多說話人英文情緒語料庫 ESD上進行廣泛實驗。客觀評估結果顯示,所提出之微調策略能有效生成情緒自然、可辨識性高的語音。在見過說話人(seen speaker)條件下,模型可維持良好的音色保真度;然而,在未見說話人(unseen speaker, zero-shot)條件下,音色相似度相較預訓練骨幹仍有提升空間,但情緒辨識準確率仍可達真實語音(Ground Truth, GT)約九成水準,展現穩健的跨說話人情緒生成能力,為跨說話人情緒TTS 提供一條高效率且具實用性的解決途徑。
Cross-speaker emotion transfer (CSET) in text-to-speech (TTS) synthesis aims to generate speech that preserves the timbre of a target speaker while conveying the emotion contained in a reference utterance. Existing multi-speaker, multi-emotion TTS systems are typically trained from scratch on large-scale, well-annotated emotional corpora, making data collection and computation costly. To lower this barrier, we propose a fine-tuning framework that injects emotion control into a pre-trained multi-speaker text-to-speech (TTS) backbone model. Leveraging the highly expressive speaker representation space learned by recent multi-speaker TTS models, an external emotion module supplies emotion embeddings that are fused in the decoder pre-net. A gradient-reversal layer together with orthogonal losses disentangles speaker and emotion representations and enhances inter-emotion separability. The framework is model-agnostic and can be seamlessly attached to any advanced multi-speaker backbone with minimal additional training cost. Extensive experiments conducted on the English Emotional Speech Dataset (ESD) demonstrate that the proposed fine-tuning strategy effectively generates speech with natural and distinguishable emotional expressiveness. For seen speakers, the model maintains high timbre fidelity; however, for unseen speakers under the zero-shot setting, timbre similarity still shows room for improvement. Nevertheless, the emotion recognition accuracy of synthesized speech reaches approximately 90% of the ground-truth level, indicating robust cross-speaker emotion transfer capability. This work provides an efficient and practical solution for cross-speaker emotion transfer in TTS.
URI: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/100933
DOI: 10.6342/NTU202504632
Fulltext Rights: 同意授權(限校園內公開)
metadata.dc.date.embargo-lift: 2025-11-27
Appears in Collections:統計碩士學位學程

Files in This Item:
File SizeFormat 
ntu-114-1.pdf
Access limited in NTU ip range
4.63 MBAdobe PDF
Show full item record


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved