利用能解耦說話人—情感的微調在語音合成中實現樣本高效的跨說話人情感遷移

蔡尹婷; Yin-Ting Tsai

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/100933

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	黃從仁	zh_TW
dc.contributor.advisor	Tsung-Ren Huang	en
dc.contributor.author	蔡尹婷	zh_TW
dc.contributor.author	Yin-Ting Tsai	en
dc.date.accessioned	2025-11-26T16:09:13Z	-
dc.date.available	2025-11-27	-
dc.date.copyright	2025-11-26	-
dc.date.issued	2025	-
dc.date.submitted	2025-11-03	-
dc.identifier.citation	[1] Cho, S., & Lee, S. Y. (2021). Multi-speaker Emotional Text-to-speech Synthesizer. arXiv preprint arXiv:2112.03557. [2] Zhou, K., Sisman, B., & Li, H. (2021). Limited data emotional voice conversion leveraging text-to-speech: Two-stage sequence-to-sequence training. arXiv preprint arXiv:2103.16809. [3] Pan, S., & He, L. (2021). Cross-speaker style transfer with prosody bottleneck in neural speech synthesis. arXiv preprint arXiv:2107.12562. [4] Xie, Q., Li, T., Wang, X., Wang, Z., Xie, L., Yu, G., & Wan, G. (2022, December).Multi-speaker multi-style text-to-speech synthesis with single-speaker single-style training data scenarios. In 2022 13th International Symposium on Chinese SpokenLanguage Processing (ISCSLP) (pp. 66-70). IEEE. [5] Wang, Y., Stanton, D., Zhang, Y., Ryan, R. S., Battenberg, E., Shor, J., ... & Saurous, R. A. (2018, July). Style tokens: Unsupervised style modeling, control and transfer in end-to-end speech synthesis. In International conference on machine learning (pp. 5180-5189). PMLR. [6] Li, T., Wang, X., Xie, Q., Wang, Z., & Xie, L. (2022). Cross-speaker emotion disentangling and transfer for end-to-end speech synthesis. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30, 1448-1460. [7] Ao, J., Wang, R., Zhou, L., Wang, C., Ren, S., Wu, Y., ... & Wei, F. (2021). Speecht5: Unified-modal encoder-decoder pre-training for spoken language processing. arXiv preprint arXiv:2110.07205. [8] Snyder, D., Garcia-Romero, D., Sell, G., Povey, D., & Khudanpur, S. (2018, April). X-vectors: Robust dnn embeddings for speaker recognition. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 5329-5333). IEEE. [9] Wu, P., Pan, J., Xu, C., Zhang, J., Wu, L., Yin, X., & Ma, Z. (2021). Cross-speaker emotion transfer based on speaker condition layer normalization and semi-supervised training in text-to-speech. arXiv preprint arXiv:2110.04153. [10] Wagner, J., Triantafyllopoulos, A., Wierstorf, H., Schmitt, M., Burkhardt, F., Eyben, F., & Schuller, B. W. (2023). Dawn of the transformer era in speech emotion recognition: closing the valence gap. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(9), 10745-10759. [11] Ganin, Y., & Lempitsky, V. (2015, June). Unsupervised domain adaptation by backpropagation. In International conference on machine learning (pp. 1180-1189). PMLR. [12] Liu, R., Liang, K., Hu, D., Li, T., Yang, D., & Li, H. (2025). Noise Robust Cross-Speaker Emotion Transfer in TTS through Knowledge Distillation and Orthogonal Constraint. IEEE Transactions on Audio, Speech and Language Processing. [13] Ranasinghe, K., Naseer, M., Hayat, M., Khan, S., & Khan, F. S. (2021). Orthogonal projection loss. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 12333-12343). [14] Kong, J., Kim, J., & Bae, J. (2020). Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis. Advances in neural information processing systems, 33, 17022-17033. [15] Zhou, K., Sisman, B., Liu, R., & Li, H. (2022). Emotional voice conversion: Theory, databases and esd. Speech Communication, 137, 1-18. [16] Saeki, T., Xin, D., Nakata, W., Koriyama, T., Takamichi, S., & Saruwatari, H. (2022). Utmos: Utokyo-sarulab system for voicemos challenge 2022. arXiv preprint arXiv:2204.02152.	-
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/100933	-
dc.description.abstract	跨說話人情緒轉移（Cross-speaker Emotion Transfer, CSET）可於文字轉語音（Text-to-Speech, TTS）中，將來源語音的情緒遷移至目標說話人。然而，現有多說話人多情緒 TTS 系統大多需依賴大規模且標註完整的情緒語料自頭訓練，資料取得與計算成本皆高。為降低門檻，本文提出一套「微調多說話人語音合成模型」之方法。核心概念為利用近年多說話人語音合成模型所提供之高表達力說話人表徵空間，透過情緒模組注入情緒嵌入，同時結合梯度反轉層（Gradient Reversal Layer）與正交約束（Orthogonal Constraint）完成說話人–情緒解耦，並增進情緒間的可分性。本方法整體設計採模組形式，使系統可靈活替換為任何先進的多說話人骨幹並保有低訓練成本優勢。我們在多說話人英文情緒語料庫 ESD上進行廣泛實驗。客觀評估結果顯示，所提出之微調策略能有效生成情緒自然、可辨識性高的語音。在見過說話人（seen speaker）條件下，模型可維持良好的音色保真度；然而，在未見說話人（unseen speaker, zero-shot）條件下，音色相似度相較預訓練骨幹仍有提升空間，但情緒辨識準確率仍可達真實語音（Ground Truth, GT）約九成水準，展現穩健的跨說話人情緒生成能力，為跨說話人情緒TTS 提供一條高效率且具實用性的解決途徑。	zh_TW
dc.description.abstract	Cross-speaker emotion transfer (CSET) in text-to-speech (TTS) synthesis aims to generate speech that preserves the timbre of a target speaker while conveying the emotion contained in a reference utterance. Existing multi-speaker, multi-emotion TTS systems are typically trained from scratch on large-scale, well-annotated emotional corpora, making data collection and computation costly. To lower this barrier, we propose a fine-tuning framework that injects emotion control into a pre-trained multi-speaker text-to-speech (TTS) backbone model. Leveraging the highly expressive speaker representation space learned by recent multi-speaker TTS models, an external emotion module supplies emotion embeddings that are fused in the decoder pre-net. A gradient-reversal layer together with orthogonal losses disentangles speaker and emotion representations and enhances inter-emotion separability. The framework is model-agnostic and can be seamlessly attached to any advanced multi-speaker backbone with minimal additional training cost. Extensive experiments conducted on the English Emotional Speech Dataset (ESD) demonstrate that the proposed fine-tuning strategy effectively generates speech with natural and distinguishable emotional expressiveness. For seen speakers, the model maintains high timbre fidelity; however, for unseen speakers under the zero-shot setting, timbre similarity still shows room for improvement. Nevertheless, the emotion recognition accuracy of synthesized speech reaches approximately 90% of the ground-truth level, indicating robust cross-speaker emotion transfer capability. This work provides an efficient and practical solution for cross-speaker emotion transfer in TTS.	en
dc.description.provenance	Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2025-11-26T16:09:13Z No. of bitstreams: 0	en
dc.description.provenance	Made available in DSpace on 2025-11-26T16:09:13Z (GMT). No. of bitstreams: 0	en
dc.description.tableofcontents	口試委員會審定書 ........................................................................................................... i 誌謝 .................................................................................................................................. ii 摘要 ................................................................................................................................. iii Abstract ............................................................................................................................ iv 目次 .................................................................................................................................. v 圖次 ............................................................................................................................... viii 表次 ................................................................................................................................. ix 第一章緒論 .................................................................................................................... 1 1.1 研究背景 ............................................................................................................... 1 1.2 現有方法之局限 ................................................................................................... 1 1.3 研究動機與挑戰 ................................................................................................... 2 1.4 論文架構 ............................................................................................................... 3 第二章研究方法 ............................................................................................................ 4 2.1 系統總覽 ............................................................................................................... 4 2.2 預訓練多說話人 TTS ........................................................................................... 4 2.3 情緒模組 ............................................................................................................... 6 2.4 梯度反轉層 ........................................................................................................... 7 2.5 正交損失 ............................................................................................................... 7 2.6 正交投影損失 ....................................................................................................... 8 2.7 總體損失函數 ....................................................................................................... 9 2.8 執行時推論 ......................................................................................................... 10 第三章實驗流程 .......................................................................................................... 12 3.1 資料集 ................................................................................................................. 12 3.2 實驗設置 ............................................................................................................. 12 3.3 消融實驗 ............................................................................................................. 13 3.4 評估指標 ............................................................................................................. 14 3.4.1 情緒分類準確度 .......................................................................................... 14 3.4.2 音色餘弦相似度 .......................................................................................... 15 第四章實驗結果與討論 .............................................................................................. 16 4.1 見過說話人（Seen Speaker）的訓練與合成 .................................................... 16 4.1.1 模組訓練行為分析 ...................................................................................... 16 4.1.2 Seen Speaker 評估樣本組成 ......................................................................... 20 4.1.3 Seen Speaker：情緒評估 .............................................................................. 21 4.1.4 Seen Speaker：音色評估 .............................................................................. 23 4.2 未見說話人（Unseen Speaker）的跨說話人情緒轉移 ................................... 25 4.2.1 Unseen Speaker, Zero-shot 評估樣本組成 ................................................... 25 4.2.2 Unseen Speaker, Zero-shot：情緒評估 ........................................................ 26 4.2.3 Unseen Speaker, Zero-shot：音色評估 ........................................................ 28 4.3 消融實驗分析 ..................................................................................................... 32 4.3.1 消融實驗合成語音之情緒評估 .................................................................. 32 4.3.2 消融實驗合成語音之音色評估 .................................................................. 33 4.3.3 綜合分析與討論 .......................................................................................... 34 第五章結論 .................................................................................................................. 37 參考文獻 ........................................................................................................................ 39 附錄 A：模型架構與超參數 ........................................................................................ 42 附錄 B：All, Zero-N 抽樣補充實驗 ............................................................................. 44 附錄 C：消融實驗分析抽樣補充實驗 ......................................................................... 46 附錄 D：合成語音自然度客觀評估（MOS） ............................................................ 47	-
dc.language.iso	zh_TW	-
dc.subject	跨說話人情緒轉移	-
dc.subject	文字轉語音	-
dc.subject	微調	-
dc.subject	說話人–情緒解耦	-
dc.subject	正交約束	-
dc.subject	cross-speaker emotion transfer	-
dc.subject	text-to-speech	-
dc.subject	fine-tuning	-
dc.subject	speaker–emotion disentanglement	-
dc.subject	orthogonal constraint	-
dc.title	利用能解耦說話人—情感的微調在語音合成中實現樣本高效的跨說話人情感遷移	zh_TW
dc.title	Sample-Efficient Cross-Speaker Emotion Transfer in Text-to-Speech via Fine-Tuning with Speaker-Emotion Disentanglement	en
dc.type	Thesis	-
dc.date.schoolyear	114-1	-
dc.description.degree	碩士	-
dc.contributor.oralexamcommittee	李宏毅;王新民	zh_TW
dc.contributor.oralexamcommittee	Hung-Yi Lee;Hsin-Min Wang	en
dc.subject.keyword	跨說話人情緒轉移,文字轉語音微調說話人–情緒解耦正交約束	zh_TW
dc.subject.keyword	cross-speaker emotion transfer,text-to-speechfine-tuningspeaker–emotion disentanglementorthogonal constraint	en
dc.relation.page	48	-
dc.identifier.doi	10.6342/NTU202504632	-
dc.rights.note	同意授權(限校園內公開)	-
dc.date.accepted	2025-11-03	-
dc.contributor.author-college	共同教育中心	-
dc.contributor.author-dept	統計碩士學位學程	-
dc.date.embargo-lift	2025-11-27	-
顯示於系所單位：	統計碩士學位學程

文件中的檔案：

檔案	大小	格式
ntu-114-1.pdf 授權僅限NTU校內IP使用（校園外請利用VPN校外連線服務）	4.63 MB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。