Skip navigation

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料(如:文字、圖片、PDF)並使其易於取用。

點此認識 DSpace
DSpace logo
English
中文
  • 瀏覽論文
    • 校院系所
    • 出版年
    • 作者
    • 標題
    • 關鍵字
    • 指導教授
  • 搜尋 TDR
  • 授權 Q&A
    • 我的頁面
    • 接受 E-mail 通知
    • 編輯個人資料
  1. NTU Theses and Dissertations Repository
  2. 電機資訊學院
  3. 資訊工程學系
請用此 Handle URI 來引用此文件: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/93471
完整後設資料紀錄
DC 欄位值語言
dc.contributor.advisor李琳山zh_TW
dc.contributor.advisorLin-Shan Leeen
dc.contributor.author林則仰zh_TW
dc.contributor.authorTse-Yang Linen
dc.date.accessioned2024-08-01T16:18:00Z-
dc.date.available2024-08-02-
dc.date.copyright2024-08-01-
dc.date.issued2024-
dc.date.submitted2024-07-24-
dc.identifier.citation[1] Andrew J Hunt and Alan W Black. Unit selection in a concatenative speech synthesis system using a large speech database. In 1996 IEEE international conference on acoustics, speech, and signal processing conference proceedings, volume 1, pages 373–376. IEEE, 1996.
[2] Heiga Zen, Keiichi Tokuda, and Alan W Black. Statistical parametric speech synthesis. speech communication, 51(11):1039–1064, 2009.
[3] Jonathan Shen, Ruoming Pang, Ron J Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, Rj Skerrv-Ryan, et al. Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 4779–4783. IEEE, 2018.
[4] Yi Ren, Yangjun Ruan, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. Fastspeech: Fast, robust and controllable text to speech. Advances in neural information processing systems, 32, 2019.
[5] Yuxuan Wang, RJ Skerry-Ryan, Daisy Stanton, Yonghui Wu, Ron J Weiss, Navdeep Jaitly, Zongheng Yang, Ying Xiao, Zhifeng Chen, Samy Bengio, et al. Tacotron: Towards end-to-end speech synthesis. arXiv preprint arXiv:1703.10135, 2017.
[6] Aaron Van Den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, Koray Kavukcuoglu, et al. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 12, 2016.
[7] Harm Lameris, Shivam Mehta, Gustav Eje Henter, Joakim Gustafson, and Éva Székely. Prosody-controllable spontaneous tts with neural hmms. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023.
[8] Siyang Wang, Gustav Eje Henter, Joakim Gustafson, and Eva Székely. On the use of self-supervised speech representations in spontaneous speech synthesis. arXiv preprint arXiv:2307.05132, 2023.
[9] Yuzi Yan, Xu Tan, Bohan Li, Guangyan Zhang, Tao Qin, Sheng Zhao, Yuan Shen, Wei-Qiang Zhang, and Tie-Yan Liu. Adaspeech 3: Adaptive text to speech for spontaneous style. arXiv preprint arXiv:2107.02530, 2021.
[10] Warren S McCulloch and Walter Pitts. A logical calculus of the ideas immanent in nervous activity. The bulletin of mathematical biophysics, 5:115–133, 1943.
[11] Augustine Gray and John Markel. Distance measures for speech processing. IEEE Transactions on Acoustics, Speech, and Signal Processing, 24(5):380–391, 1976.
[12] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
[13] Minh-Thang Luong, Hieu Pham, and Christopher D Manning. Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025, 2015.
[14] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017. [15] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pretraining of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
[16] Jianmo Ni, Gustavo Hernandez Abrego, Noah Constant, Ji Ma, Keith B Hall, Daniel Cer, and Yinfei Yang. Sentence-t5: Scalable sentence encoders from pre-trained text-to-text models. arXiv preprint arXiv:2108.08877, 2021.
[17] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
[18] Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM transactions on audio, speech, and language processing, 29:3451–3460, 2021.
[19] Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems, 33:12449–12460, 2020.
[20] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
[21] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020. [22] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
[23] Jungil Kong, Jaehyeon Kim, and Jaekyoung Bae. Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis. Advances in neural information processing systems, 33:17022–17033, 2020.
[24] Tomoki Hayashi, Ryuichi Yamamoto, Katsuki Inoue, Takenori Yoshimura, Shinji Watanabe, Tomoki Toda, Kazuya Takeda, Yu Zhang, and Xu Tan. Espnet-tts: Unified, reproducible, and integratable open source end-to-end text-to-speech toolkit. In ICASSP 2020-2020 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 7654–7658. IEEE, 2020.
[25] Wei Ping, Kainan Peng, and Jitong Chen. Clarinet: Parallel wave generation in end-to-end text-to-speech. arXiv preprint arXiv:1807.07281, 2018. [26] Yi Ren, Chenxu Hu, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. Fastspeech 2: Fast and high-quality end-to-end text to speech. arXiv preprint arXiv:2006.04558, 2020.
[27] Xu Tan, Jiawei Chen, Haohe Liu, Jian Cong, Chen Zhang, Yanqing Liu, Xi Wang, Yichong Leng, Yuanhao Yi, Lei He, et al. Naturalspeech: End-to-end text-to-speech synthesis with human-level quality. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024.
[28] Jaehyeon Kim, Jungil Kong, and Juhee Son. Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech. In International Conference on Machine Learning, pages 5530–5540. PMLR, 2021.
[29] Jaehyeon Kim, Sungwon Kim, Jungil Kong, and Sungroh Yoon. Glow-tts: A generative flow for text-to-speech via monotonic alignment search. Advances in Neural Information Processing Systems, 33:8067–8077, 2020. [30] Michael McAuliffe, Michaela Socolof, Sarah Mihuc, Michael Wagner, and Morgan Sonderegger. Montreal forced aligner: Trainable text-speech alignment using kaldi. In Interspeech, volume 2017, pages 498–502, 2017.
[31] Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8):1798–1828, 2013.
[32] Steven Davis and Paul Mermelstein. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE transactions on acoustics, speech, and signal processing, 28(4):357–366, 1980.
[33] Hanzhao Li, Xinfa Zhu, Liumeng Xue, Yang Song, Yunlin Chen, and Lei Xie. Spontts: modeling and transferring spontaneous style for tts. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 12171–12175. IEEE, 2024.
[34] Weiqin Li, Shun Lei, Qiaochu Huang, Yixuan Zhou, Zhiyong Wu, Shiyin Kang, and Helen Meng. Towards spontaneous style modeling with semi-supervised pre-training for conversational text-to-speech synthesis. arXiv preprint arXiv:2308.16593, 2023.
[35] Keith Ito and Linda Johnson. The lj speech dataset. https://keithito.com/ LJ-Speech-Dataset/, 2017.
[36] Heiga Zen, Viet Dang, Rob Clark, Yu Zhang, Ron J Weiss, Ye Jia, Zhifeng Chen, and Yonghui Wu. Libritts: A corpus derived from librispeech for text-to-speech. arXiv preprint arXiv:1904.02882, 2019.
[37] Rosana Ardila, Megan Branson, Kelly Davis, Michael Henretty, Michael Kohler, Josh Meyer, Reuben Morais, Lindsay Saunders, Francis M Tyers, and Gregor Weber. Common voice: A massively-multilingual speech corpus. arXiv preprint arXiv:1912.06670, 2019.
[38] Jian Cong, Shan Yang, Na Hu, Guangzhi Li, Lei Xie, and Dan Su. Controllable context-aware conversational speech synthesis. arXiv preprint arXiv:2106.10828, 2021.
[39] Aijun Li, Fang Zheng, William Byrne, Pascale Fung, Terri Kamm, Yi Liu, Zhanjiang Song, Umar Ruhi, Veera Venkataramani, and Xiaoxia Chen. Cass: A phonetically transcribed corpus of mandarin spontaneous speech. In Sixth international conference on spoken language processing, 2000.
[40] Kikuo Maekawa, Hanae Koiso, Sadaoki Furui, and Hitoshi Isahara. Spontaneous speech corpus of japanese. In LREC, volume 6, pages 1–5. Citeseer, 2000.
[41] Kikuo Maekawa. Corpus of spontaneous japanese: Its design and evaluation. In ISCA & IEEE Workshop on Spontaneous Speech Processing and Recognition, 2003.
[42] Weonhee Yun, Kyuchul Yoon, Sunwoo Park, Juhee Lee, Sungmoon Cho, Ducksoo Kang, Koonhyuk Byun, Hyeseung Hahn, and Jungsun Kim. The korean corpus of spontaneous speech. , 7(2):103–109, 2015.
[43] Nina Grønnum. A danish phonetically annotated spontaneous speech corpus (danpass). Speech Communication, 51(7):594–603, 2009.
[44] Sheng-Fu Wang and Janice Fon. A taiwan southern min spontaneous speech corpus for discourse prosody. The Proceedings of Tools and Resources for the Analysis of Speech Prosody, Aix-en-Provence, France, pages 20–23, 2013.
[45] LWJ Boves and NHJ Oostdijk. Spontaneous speech in the spoken dutch corpus. 2003.
[46] Robert L Thorndike. Who belongs in the family? Psychometrika, 18(4):267–276, 1953.
[47] Peter J Rousseeuw. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of computational and applied mathematics, 20:53–65, 1987.
1987.
[48] Jort F Gemmeke, Daniel PW Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R Channing Moore, Manoj Plakal, and Marvin Ritter. Audio set: An ontology and human-labeled dataset for audio events. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 776–780. IEEE, 2017.
[49] Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, et al. Wavlm: Largescale self-supervised pre-training for full stack speech processing. IEEE Journal of Selected Topics in Signal Processing, 16(6):1505–1518, 2022.
[50] Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. In International conference on machine learning, pages 28492–28518. PMLR, 2023.
[51] Yao Shi, Hui Bu, Xin Xu, Shaoji Zhang, and Ming Li. Aishell-3: A multi-speaker mandarin tts corpus and the baselines. arXiv preprint arXiv:2010.11567, 2020.
-
dc.identifier.urihttp://tdr.lib.ntu.edu.tw/jspui/handle/123456789/93471-
dc.description.abstract語音合成(Speech Synthesis)技術是指將文句(Text)作為輸入並由機器將之轉換為語音(Speech)信號的任務,在許多生活中的應用都能看到這項技術的身影,如語音助理、機器翻譯系統等。受惠於現今發展完整的深層類神經網路強大的能力,這項技術在生成語音的品質上已非常接近真實人聲。然而,這些由機器產生的語音雖然能夠生成高品質的語音,卻仍與真實的人聲在實際聽感上有明顯的差異。造成這個現象的原因為現今語音合成模型所產生的語音信號多為「朗讀」風格,在訓練時就利用大量高品質的朗讀錄音進行學習,導致這些語音與一班下現實日常對話的語音仍有顯著的不同,因為後者之中常帶有豐富的個人口語習慣或情緒等。這樣較為自然的語音也被稱為自發性語音(Spontaneous Speech)。有鑑於此,本論文對自發性語音合成系統做了深入的探討,針對資料標註短少的問題以及維持生成語音的品質兩個方向提出改善方法,同時提升模型的強健度以及降低訓練的難度及成本,並達成以非監督式學習(Unsupervised Learning)的方法來完成對自發性語音合成模型的訓練。
本論文首先提出自發性語音特徵自動分類架構,來解決過往相關研究中需要大量人力成本對自發性特徵的類別進行標註的問題。此一架構以語音自監督式模型為核心,將每一段帶有自發性特徵的語音片段都轉換為獨立的向量,並利用分群演算法對這些向量進行適當的分類,以得到自發性語音的虛擬標籤(Pseudo Label)。利用分類架構得到的標註資料,本論文提出兩種方法來使模型學習到如何生成語音當中帶有自發性特徵的部分,分別是自發性特徵預測器以及風格轉換模型,這兩種方法成功將自發性特徵混入生成的語音之中。此外,利用自發性語音訓練語音合成模型時,經常會因為搜集得來的自發性語音品質往往不如朗讀式特別錄製的語料,容易導致模型生成的語音品質明顯的下滑。本論文也針對這個問題進行改善,研究發現若是在訓練過程中加入預訓練的過程,或是在計算損失函數時額外考慮一種一致性損失,都能夠有效地克服品質下降的情形,並提升模型的強健性。
zh_TW
dc.description.abstractSpeech synthesis technology, which converts text input into speech signals, has been widely used in many everyday applications such as voice assistants and machine translation systems. Thanks to the super-strong deep neural networks, such technology today can generate speech signals with a quality nearly indistinguishable from real human voices. However, despite the high quality, machine-generated voices today still noticeably differ from true human speech in actual perception. This discrepancy mainly arises from the fact that current speech synthesis models are often trained by "read speech" datasets recorded under controlled conditions, thus in lack of the naturalness existing in spontaneous speech that people usually produce in their everyday conversations. The latter is rich of personal idiosyncrasies and carries emotion.

This thesis delves into spontaneous speech synthesis, addressing two challenging problems: sparse data labeling and maintaining speech quality. In this work we propose approaches to enhance the model robustness and reduce the training difficulty and costs, achieving training of spontaneous speech synthesis models using unsupervised learning methods.

First, we present an automatic classification architecture for identifying spontaneous speech features, aiming to reduce the human labor needed for categorizing these traits. With a speech self-supervised model, this architecture transforms each speech segment with spontaneous features into independent vectors, which are then properly classified using clustering algorithms to generate pseudo labels for spontaneous speech. Using these labels, two methods learning and incorporating spontaneous features into generated speech are proposed: a spontaneous feature predictor and a style transfer model.

Moreover, speech synthesis models learned with spontaneous speech often suffer from a significant degradation in speech quality, as the quality of the collected spontaneous speech is generally not as good as that of read speech data recorded under controlled conditions. This thesis addresses this issue by incorporating a pre-training phase or adding a consistency loss in training.
en
dc.description.provenanceSubmitted by admin ntu (admin@lib.ntu.edu.tw) on 2024-08-01T16:18:00Z
No. of bitstreams: 0
en
dc.description.provenanceMade available in DSpace on 2024-08-01T16:18:00Z (GMT). No. of bitstreams: 0en
dc.description.tableofcontents致謝 i
摘要 iii
Abstract v
目次 vii
圖次 xi
表次 xiii
第一章 導論 1
1.1 研究動機 1
1.2 研究方向 3
1.3 章節安排 4
第二章 背景知識 5
2.1 深層類神經網路 5
2.1.1 模型與原理 5
2.1.2 卷積式類神經網路 9
2.1.3 遞迴式類神經網路 10
2.1.4 專注機制 12
2.1.5 轉換器 13
2.2 端到端語音合成 16
2.2.1 自迴歸語音合成模型 18
2.2.2 非自迴歸語音合成模型 19
2.3 自監督式學習 22
2.3.1 自監督式學習概述 22
2.3.2 自監督式學習於語音技術上之應用 23
2.4 自發性語音生成 24
2.4.1 簡介 24
2.4.2 自發性特徵標註之於自發性語音合成 25
2.5 本章總結 26
第三章 自發性語音特徵自動分類架構之研究 27
3.1 簡介 27
3.2 相關研究 28
3.2.1 自發性表徵標註資料集 28
3.3 方法介紹 30
3.3.1 自發性特徵自動分類架構 31
3.3.1.1 強制對齊模型於自發性特徵 31
3.3.1.2 自監督式學習模型於自發性特徵 32
3.3.1.3 k-平均演算法 34
3.4 實驗 36
3.4.1 實驗設定 36
3.4.2 實驗結果與討論 37
3.5 本章總結 44
第四章 非監督式中文自發性語音合成系統之研究 45
4.1 簡介 45
4.2 相關研究 47
4.2.1 自發性語音合成模型 47
4.3 方法介紹 48
4.3.1 自發性特徵預測器 48
4.3.2 風格轉換模型 50
4.3.3 自發性特徵於非迴歸式語音生成 51
4.3.3.1 多語者語音合成模型 51
4.3.3.2 一致性損失 52
4.4 實驗 54
4.4.1 實驗設定 54
4.4.1.1 資料集 54
4.4.1.2 模型設置 55
4.4.1.3 客觀評估指標 56
4.4.1.4 主觀評估指標 56
4.4.2 實驗結果與討論 58
4.5 本章總結 60
第五章 結論與展望 63
5.1 研究貢獻與討論 63
5.2 未來展望 64
參考文獻 67
-
dc.language.isozh_TW-
dc.subject非監督式學習zh_TW
dc.subject自監督式模型zh_TW
dc.subject自發性語音zh_TW
dc.subjectSelf-supervised Modelen
dc.subjectUnsupervised Learningen
dc.subjectSpontaneous Speechen
dc.title自發性華語語音合成之非監督式學習zh_TW
dc.titleSpontaneous Chinese Speech Synthesis by Unsupervised Learningen
dc.typeThesis-
dc.date.schoolyear112-2-
dc.description.degree碩士-
dc.contributor.oralexamcommittee賴穎暉;王新民;曹昱;陳尚澤;李宏毅zh_TW
dc.contributor.oralexamcommitteeYing-Hui Lai;Hsin-Min Wang;Yu Tsao;Shang-Tse Chen;Hung-Yi Leeen
dc.subject.keyword自發性語音,自監督式模型,非監督式學習,zh_TW
dc.subject.keywordSpontaneous Speech,Self-supervised Model,Unsupervised Learning,en
dc.relation.page74-
dc.identifier.doi10.6342/NTU202402056-
dc.rights.note同意授權(限校園內公開)-
dc.date.accepted2024-07-26-
dc.contributor.author-college電機資訊學院-
dc.contributor.author-dept資訊工程學系-
顯示於系所單位:資訊工程學系

文件中的檔案:
檔案 大小格式 
ntu-112-2.pdf
授權僅限NTU校內IP使用(校園外請利用VPN校外連線服務)
2.03 MBAdobe PDF
顯示文件簡單紀錄


系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved