Skip navigation

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料(如:文字、圖片、PDF)並使其易於取用。

點此認識 DSpace
DSpace logo
English
中文
  • 瀏覽論文
    • 校院系所
    • 出版年
    • 作者
    • 標題
    • 關鍵字
    • 指導教授
  • 搜尋 TDR
  • 授權 Q&A
    • 我的頁面
    • 接受 E-mail 通知
    • 編輯個人資料
  1. NTU Theses and Dissertations Repository
  2. 共同教育中心
  3. 統計碩士學位學程
請用此 Handle URI 來引用此文件: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/100933
完整後設資料紀錄
DC 欄位值語言
dc.contributor.advisor黃從仁zh_TW
dc.contributor.advisorTsung-Ren Huangen
dc.contributor.author蔡尹婷zh_TW
dc.contributor.authorYin-Ting Tsaien
dc.date.accessioned2025-11-26T16:09:13Z-
dc.date.available2025-11-27-
dc.date.copyright2025-11-26-
dc.date.issued2025-
dc.date.submitted2025-11-03-
dc.identifier.citation[1] Cho, S., & Lee, S. Y. (2021). Multi-speaker Emotional Text-to-speech Synthesizer. arXiv preprint arXiv:2112.03557.
[2] Zhou, K., Sisman, B., & Li, H. (2021). Limited data emotional voice conversion leveraging text-to-speech: Two-stage sequence-to-sequence training. arXiv preprint arXiv:2103.16809.
[3] Pan, S., & He, L. (2021). Cross-speaker style transfer with prosody bottleneck in neural speech synthesis. arXiv preprint arXiv:2107.12562.
[4] Xie, Q., Li, T., Wang, X., Wang, Z., Xie, L., Yu, G., & Wan, G. (2022, December).Multi-speaker multi-style text-to-speech synthesis with single-speaker single-style training data scenarios. In 2022 13th International Symposium on Chinese SpokenLanguage Processing (ISCSLP) (pp. 66-70). IEEE.
[5] Wang, Y., Stanton, D., Zhang, Y., Ryan, R. S., Battenberg, E., Shor, J., ... & Saurous, R. A. (2018, July). Style tokens: Unsupervised style modeling, control and transfer in end-to-end speech synthesis. In International conference on machine learning (pp. 5180-5189). PMLR.
[6] Li, T., Wang, X., Xie, Q., Wang, Z., & Xie, L. (2022). Cross-speaker emotion disentangling and transfer for end-to-end speech synthesis. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30, 1448-1460.
[7] Ao, J., Wang, R., Zhou, L., Wang, C., Ren, S., Wu, Y., ... & Wei, F. (2021). Speecht5: Unified-modal encoder-decoder pre-training for spoken language processing. arXiv preprint arXiv:2110.07205.
[8] Snyder, D., Garcia-Romero, D., Sell, G., Povey, D., & Khudanpur, S. (2018, April). X-vectors: Robust dnn embeddings for speaker recognition. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 5329-5333). IEEE.
[9] Wu, P., Pan, J., Xu, C., Zhang, J., Wu, L., Yin, X., & Ma, Z. (2021). Cross-speaker emotion transfer based on speaker condition layer normalization and semi-supervised training in text-to-speech. arXiv preprint arXiv:2110.04153.
[10] Wagner, J., Triantafyllopoulos, A., Wierstorf, H., Schmitt, M., Burkhardt, F., Eyben, F., & Schuller, B. W. (2023). Dawn of the transformer era in speech emotion recognition: closing the valence gap. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(9), 10745-10759.
[11] Ganin, Y., & Lempitsky, V. (2015, June). Unsupervised domain adaptation by backpropagation. In International conference on machine learning (pp. 1180-1189). PMLR.
[12] Liu, R., Liang, K., Hu, D., Li, T., Yang, D., & Li, H. (2025). Noise Robust Cross-Speaker Emotion Transfer in TTS through Knowledge Distillation and Orthogonal Constraint. IEEE Transactions on Audio, Speech and Language Processing.
[13] Ranasinghe, K., Naseer, M., Hayat, M., Khan, S., & Khan, F. S. (2021). Orthogonal projection loss. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 12333-12343).
[14] Kong, J., Kim, J., & Bae, J. (2020). Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis. Advances in neural information processing systems, 33, 17022-17033.
[15] Zhou, K., Sisman, B., Liu, R., & Li, H. (2022). Emotional voice conversion: Theory, databases and esd. Speech Communication, 137, 1-18.
[16] Saeki, T., Xin, D., Nakata, W., Koriyama, T., Takamichi, S., & Saruwatari, H. (2022). Utmos: Utokyo-sarulab system for voicemos challenge 2022. arXiv preprint arXiv:2204.02152.
-
dc.identifier.urihttp://tdr.lib.ntu.edu.tw/jspui/handle/123456789/100933-
dc.description.abstract跨說話人情緒轉移(Cross-speaker Emotion Transfer, CSET)可於文字轉語音(Text-to-Speech, TTS)中,將來源語音的情緒遷移至目標說話人。然而,現有多說話人多情緒 TTS 系統大多需依賴大規模且標註完整的情緒語料自頭訓練,資料取得與計算成本皆高。為降低門檻,本文提出一套「微調多說話人語音合成模型」之方法。核心概念為利用近年多說話人語音合成模型所提供之高表達力說話人表徵空間,透過情緒模組注入情緒嵌入,同時結合梯度反轉層(Gradient Reversal Layer)與正交約束(Orthogonal Constraint)完成說話人–情緒解耦,並增進情緒間的可分性。本方法整體設計採模組形式,使系統可靈活替換為任何先進的多說話人骨幹並保有低訓練成本優勢。我們在多說話人英文情緒語料庫 ESD上進行廣泛實驗。客觀評估結果顯示,所提出之微調策略能有效生成情緒自然、可辨識性高的語音。在見過說話人(seen speaker)條件下,模型可維持良好的音色保真度;然而,在未見說話人(unseen speaker, zero-shot)條件下,音色相似度相較預訓練骨幹仍有提升空間,但情緒辨識準確率仍可達真實語音(Ground Truth, GT)約九成水準,展現穩健的跨說話人情緒生成能力,為跨說話人情緒TTS 提供一條高效率且具實用性的解決途徑。zh_TW
dc.description.abstractCross-speaker emotion transfer (CSET) in text-to-speech (TTS) synthesis aims to generate speech that preserves the timbre of a target speaker while conveying the emotion contained in a reference utterance. Existing multi-speaker, multi-emotion TTS systems are typically trained from scratch on large-scale, well-annotated emotional corpora, making data collection and computation costly. To lower this barrier, we propose a fine-tuning framework that injects emotion control into a pre-trained multi-speaker text-to-speech (TTS) backbone model. Leveraging the highly expressive speaker representation space learned by recent multi-speaker TTS models, an external emotion module supplies emotion embeddings that are fused in the decoder pre-net. A gradient-reversal layer together with orthogonal losses disentangles speaker and emotion representations and enhances inter-emotion separability. The framework is model-agnostic and can be seamlessly attached to any advanced multi-speaker backbone with minimal additional training cost. Extensive experiments conducted on the English Emotional Speech Dataset (ESD) demonstrate that the proposed fine-tuning strategy effectively generates speech with natural and distinguishable emotional expressiveness. For seen speakers, the model maintains high timbre fidelity; however, for unseen speakers under the zero-shot setting, timbre similarity still shows room for improvement. Nevertheless, the emotion recognition accuracy of synthesized speech reaches approximately 90% of the ground-truth level, indicating robust cross-speaker emotion transfer capability. This work provides an efficient and practical solution for cross-speaker emotion transfer in TTS.en
dc.description.provenanceSubmitted by admin ntu (admin@lib.ntu.edu.tw) on 2025-11-26T16:09:13Z
No. of bitstreams: 0
en
dc.description.provenanceMade available in DSpace on 2025-11-26T16:09:13Z (GMT). No. of bitstreams: 0en
dc.description.tableofcontents口試委員會審定書 ........................................................................................................... i
誌謝 .................................................................................................................................. ii
摘要 ................................................................................................................................. iii
Abstract ............................................................................................................................ iv
目次 .................................................................................................................................. v
圖次 ............................................................................................................................... viii
表次 ................................................................................................................................. ix
第一章 緒論 .................................................................................................................... 1
1.1 研究背景 ............................................................................................................... 1
1.2 現有方法之局限 ................................................................................................... 1
1.3 研究動機與挑戰 ................................................................................................... 2
1.4 論文架構 ............................................................................................................... 3
第二章 研究方法 ............................................................................................................ 4
2.1 系統總覽 ............................................................................................................... 4
2.2 預訓練多說話人 TTS ........................................................................................... 4
2.3 情緒模組 ............................................................................................................... 6
2.4 梯度反轉層 ........................................................................................................... 7
2.5 正交損失 ............................................................................................................... 7
2.6 正交投影損失 ....................................................................................................... 8
2.7 總體損失函數 ....................................................................................................... 9
2.8 執行時推論 ......................................................................................................... 10
第三章 實驗流程 .......................................................................................................... 12
3.1 資料集 ................................................................................................................. 12
3.2 實驗設置 ............................................................................................................. 12
3.3 消融實驗 ............................................................................................................. 13
3.4 評估指標 ............................................................................................................. 14
3.4.1 情緒分類準確度 .......................................................................................... 14
3.4.2 音色餘弦相似度 .......................................................................................... 15
第四章 實驗結果與討論 .............................................................................................. 16
4.1 見過說話人(Seen Speaker)的訓練與合成 .................................................... 16
4.1.1 模組訓練行為分析 ...................................................................................... 16
4.1.2 Seen Speaker 評估樣本組成 ......................................................................... 20
4.1.3 Seen Speaker:情緒評估 .............................................................................. 21
4.1.4 Seen Speaker:音色評估 .............................................................................. 23
4.2 未見說話人(Unseen Speaker)的跨說話人情緒轉移 ................................... 25
4.2.1 Unseen Speaker, Zero-shot 評估樣本組成 ................................................... 25
4.2.2 Unseen Speaker, Zero-shot:情緒評估 ........................................................ 26
4.2.3 Unseen Speaker, Zero-shot:音色評估 ........................................................ 28
4.3 消融實驗分析 ..................................................................................................... 32
4.3.1 消融實驗合成語音之情緒評估 .................................................................. 32
4.3.2 消融實驗合成語音之音色評估 .................................................................. 33
4.3.3 綜合分析與討論 .......................................................................................... 34
第五章 結論 .................................................................................................................. 37
參考文獻 ........................................................................................................................ 39
附錄 A:模型架構與超參數 ........................................................................................ 42
附錄 B:All, Zero-N 抽樣補充實驗 ............................................................................. 44
附錄 C:消融實驗分析抽樣補充實驗 ......................................................................... 46
附錄 D:合成語音自然度客觀評估(MOS) ............................................................ 47
-
dc.language.isozh_TW-
dc.subject跨說話人情緒轉移-
dc.subject文字轉語音-
dc.subject微調-
dc.subject說話人–情緒解耦-
dc.subject正交約束-
dc.subjectcross-speaker emotion transfer-
dc.subjecttext-to-speech-
dc.subjectfine-tuning-
dc.subjectspeaker–emotion disentanglement-
dc.subjectorthogonal constraint-
dc.title利用能解耦說話人—情感的微調在語音合成中實現樣本高效的跨說話人情感遷移zh_TW
dc.titleSample-Efficient Cross-Speaker Emotion Transfer in Text-to-Speech via Fine-Tuning with Speaker-Emotion Disentanglementen
dc.typeThesis-
dc.date.schoolyear114-1-
dc.description.degree碩士-
dc.contributor.oralexamcommittee李宏毅;王新民zh_TW
dc.contributor.oralexamcommitteeHung-Yi Lee;Hsin-Min Wangen
dc.subject.keyword跨說話人情緒轉移,文字轉語音微調說話人–情緒解耦正交約束zh_TW
dc.subject.keywordcross-speaker emotion transfer,text-to-speechfine-tuningspeaker–emotion disentanglementorthogonal constrainten
dc.relation.page48-
dc.identifier.doi10.6342/NTU202504632-
dc.rights.note同意授權(限校園內公開)-
dc.date.accepted2025-11-03-
dc.contributor.author-college共同教育中心-
dc.contributor.author-dept統計碩士學位學程-
dc.date.embargo-lift2025-11-27-
顯示於系所單位:統計碩士學位學程

文件中的檔案:
檔案 大小格式 
ntu-114-1.pdf
授權僅限NTU校內IP使用(校園外請利用VPN校外連線服務)
4.63 MBAdobe PDF
顯示文件簡單紀錄


系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved