Skip navigation

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料(如:文字、圖片、PDF)並使其易於取用。

點此認識 DSpace
DSpace logo
English
中文
  • 瀏覽論文
    • 校院系所
    • 出版年
    • 作者
    • 標題
    • 關鍵字
    • 指導教授
  • 搜尋 TDR
  • 授權 Q&A
    • 我的頁面
    • 接受 E-mail 通知
    • 編輯個人資料
  1. NTU Theses and Dissertations Repository
  2. 電機資訊學院
  3. 電信工程學研究所
請用此 Handle URI 來引用此文件: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/93286
完整後設資料紀錄
DC 欄位值語言
dc.contributor.advisor李琳山zh_TW
dc.contributor.advisorLin-shan Leeen
dc.contributor.author許博竣zh_TW
dc.contributor.authorPo-chun Hsuen
dc.date.accessioned2024-07-23T16:41:25Z-
dc.date.available2024-07-24-
dc.date.copyright2024-07-23-
dc.date.issued2024-
dc.date.submitted2024-07-22-
dc.identifier.citationJonathan Shen, Ruoming Pang, Ron J Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, Rj Skerrv-Ryan, et al. Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4779–4783. IEEE, 2018.
Yi Ren, Chenxu Hu, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. Fastspeech 2: Fast and high-quality end-to-end text to speech. In International Conference on Learning Representations, 2021.
Jaehyeon Kim, Jungil Kong, and Juhee Son. Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech. In International Conference on Machine Learning, pages 5530–5540. PMLR, 2021.
Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 2016.
Nal Kalchbrenner, Erich Elsen, Karen Simonyan, Seb Noury, Norman Casagrande, Edward Lockhart, Florian Stimberg, Aaron Oord, Sander Dieleman, and Koray Kavukcuoglu. Efficient neural audio synthesis. In International Conference on Machine Learning, pages 2410–2419. PMLR, 2018.
Ryan Prenger, Rafael Valle, and Bryan Catanzaro. Waveglow: A flow-based generative network for speech synthesis. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 3617–3621. IEEE, 2019.
Jungil Kong, Jaehyeon Kim, and Jaekyoung Bae. Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis. Advances in Neural Information Processing Systems, 33:17022–17033, 2020.
Zhifang Guo, Yichong Leng, Yihan Wu, Sheng Zhao, and Xu Tan. Prompttts: Controllable text-to-speech with text descriptions. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023.
Dongchao Yang, Songxiang Liu, Rongjie Huang, Chao Weng, and Helen Meng. Instructtts: Modelling expressive tts in discrete latent space with natural language style prompt, 2023.
Jixun Yao, Yuguang Yang, Yi Lei, Ziqian Ning, Yanni Hu, Yu Pan, Jingjing Yin, Hongbin Zhou, Heng Lu, and Lei Xie. Promptvc: Flexible stylistic voice conversion in latent space driven by natural language prompts. arXiv preprint arXiv:2309.09262, 2023.
Chun-Yi Kuan, Chen-An Li, Tsu-Yuan Hsu, Tse-Yang Lin, Ho-Lam Chung, Kai-Wei Chang, Shuo-Yiin Chang, and Hung-yi Lee. Towards general-purpose text-instruction-guided voice conversion. In 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 1–8. IEEE, 2023.
Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems, 33:12449–12460, 2020.
Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:3451–3460, 2021.
Isaac Elias, Heiga Zen, Jonathan Shen, Yu Zhang, Ye Jia, R.J. Skerry-Ryan, and Yonghui Wu. Parallel tacotron 2: A non-autoregressive neural tts model with differentiable duration modeling. In Proc. Interspeech 2021, pages 141–145, 2021.
Kaizhi Qian, Yang Zhang, Shiyu Chang, Mark Hasegawa-Johnson, and David Cox. Unsupervised speech decomposition via triple information bottleneck. In International Conference on Machine Learning, pages 7836–7846. PMLR, 2020.
Yuxuan Wang, R.J. Skerry-Ryan, Daisy Stanton, Yonghui Wu, Ron J. Weiss, Navdeep Jaitly, Zongheng Yang, Ying Xiao, Zhifeng Chen, Samy Bengio, Quoc Le, Yannis Agiomyrgiannakis, Rob Clark, and Rif A. Saurous. Tacotron: Towards end-to-end speech synthesis. In Proc. Interspeech 2017, pages 4006–4010, 2017.
Heiga Ze, Andrew Senior, and Mike Schuster. Statistical parametric speech synthesis using deep neural networks. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 7962–7966. IEEE, 2013.
Keiichi Tokuda, Yoshihiko Nankaku, Tomoki Toda, Heiga Zen, Junichi Yamagishi, and Keiichiro Oura. Speech synthesis based on hidden markov models. Proceedings of the IEEE, 101(5):1234–1252, 2013.
Masanori Morise, Fumiya Yokomori, and Kenji Ozawa. World: a vocoder-based high-quality speech synthesis system for real-time applications. IEICE TRANSACTIONS on Information and Systems, 99(7):1877–1884, 2016.
Hideki Kawahara, Jo Estill, and Osamu Fujimura. Aperiodicity extraction and control using mixed mode excitation and group delay manipulation for a high quality speech analysis, modification and synthesis system STRAIGHT. In Second International Workshop on Models and Analysis of Vocal Emissions for Biomedical Applications, 2001.
Zhizheng Wu, Oliver Watts, and Simon King. Merlin: An open source neural network speech synthesis system. In SSW, pages 202–207, 2016.
Daniel Griffin and Jae Lim. Signal estimation from modified short-time fourier transform. IEEE Transactions on Acoustics, Speech, and Signal Processing, 32(2):236–243, 1984.
Ju chieh Chou, Cheng chieh Yeh, Hung yi Lee, and Lin shan Lee. Multi-target voice conversion without parallel data by adversarially learning disentangled audio representations. In Proc. Interspeech 2018, pages 501–505, 2018.
Cheng-chieh Yeh, Po-chun Hsu, Ju-chieh Chou, Hung-yi Lee, and Lin-shan Lee. Rhythm-flexible voice conversion without parallel data using cycle-gan over phoneme posteriorgram sequences. In 2018 IEEE Spoken Language Technology Workshop (SLT), pages 274–281. IEEE, 2018.
Andy T. Liu, Po chun Hsu, and Hung-Yi Lee. Unsupervised end-to-end learning of discrete linguistic units for voice conversion. In Proc. Interspeech 2019, pages 1108–1112, 2019.
Chung-Ming Chien, Jheng-Hao Lin, Chien-yu Huang, Po-chun Hsu, and Hung-yi Lee. Investigating on incorporating pretrained and learnable speaker representations for multi-speaker multi-style text-to-speech. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 8588–8592. IEEE, 2021.
Kaizhi Qian, Yang Zhang, Shiyu Chang, Xuesong Yang, and Mark Hasegawa-Johnson. Autovc: Zero-shot voice style transfer with only autoencoder loss. In International Conference on Machine Learning, pages 5210–5219. PMLR, 2019.
Zhen-Hua Ling, Yang Ai, Yu Gu, and Li-Rong Dai. Waveform modeling and generation using hierarchical recurrent neural networks for speech bandwidth extension. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26(5):883–894, 2018.
W Bastiaan Kleijn, Andrew Storus, Michael Chinen, Tom Denton, Felicia SC Lim, Alejandro Luebs, Jan Skoglund, and Hengchin Yeh. Generative speech coding with predictive variance regularization. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6478–6482. IEEE, 2021.
Adam Polyak, Yossi Adi, Jade Copet, Eugene Kharitonov, Kushal Lakhotia, Wei-Ning Hsu, Abdelrahman Mohamed, and Emmanuel Dupoux. Speech resynthesis from discrete disentangled self-supervised representations. In Proc. Interspeech 2021, pages 3615–3619, 2021.
Jaime Lorenzo-Trueba, Thomas Drugman, Javier Latorre, Thomas Merritt, Bartosz Putrycz, Roberto Barra-Chicote, Alexis Moinet, and Vatsal Aggarwal. Towards achieving robust universal neural vocoding. In Proc. Interspeech 2019, pages 181–185, 2019.
Po-chun Hsu, Chun-hsuan Wang, Andy T Liu, and Hung-yi Lee. Towards robust neural vocoding for speech generation: A survey. arXiv preprint arXiv:1912.02461, 2019.
Yunlong Jiao, Adam Gabryś, Georgi Tinchev, Bartosz Putrycz, Daniel Korzekwa, and Viacheslav Klimkov. Universal neural vocoding with parallel wavenet. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6044–6048. IEEE, 2021.
Aaron Van den Oord, Nal Kalchbrenner, Lasse Espeholt, Oriol Vinyals, Alex Graves, et al. Conditional image generation with pixelcnn decoders. Advances in neural information processing systems, 29, 2016.
Ryuichi Yamamoto, Eunwoo Song, and Jae-Min Kim. Parallel wavegan: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6199–6203. IEEE, 2020.
Zhifeng Kong, Wei Ping, Jiaji Huang, Kexin Zhao, and Bryan Catanzaro. Diffwave: A versatile diffusion model for audio synthesis. In International Conference on Learning Representations, 2021.
Sandeep Kumar Pandey, Hanumant Singh Shekhawat, and SRM Prasanna. Emotion recognition from raw speech using wavenet. In TENCON 2019-2019 IEEE Region 10 Conference (TENCON), pages 1292–1297. IEEE, 2019.
Alice Coucke, Mohammed Chlieh, Thibault Gisselbrecht, David Leroy, Mathieu Poumeyrol, and Thibaut Lavril. Efficient keyword spotting using dilated convolutions and gating. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6351–6355. IEEE, 2019.
Hui Wang, Fei Gao, Yue Zhao, and Licheng Wu. Wavenet with cross-attention for audiovisual speech recognition. IEEE Access, 8:169160–169168, 2020.
Zeyu Jin, Adam Finkelstein, Gautham J Mysore, and Jingwan Lu. Fftnet: A real-time speaker-dependent neural vocoder. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2251–2255. IEEE, 2018.
Jean-Marc Valin and Jan Skoglund. Lpcnet: Improving neural speech synthesis through linear prediction. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5891–5895. IEEE, 2019.
Ravichander Vipperla, Sangjun Park, Kihyun Choo, Samin Ishtiaq, Kyoungbo Min, Sourav Bhattacharya, Abhinav Mehrotra, Alberto Gil C.P. Ramos, and Nicholas D. Lane. Bunched lpcnet: Vocoder for low-cost neural text-to-speech systems. In Proc. Interspeech 2020, pages 3565–3569, 2020.
Hiroki Kanagawa and Yusuke Ijima. Lightweight lpcnet-based neural vocoder with tensor decomposition. In INTERSPEECH, pages 205–209, 2020.
Takuma Okamoto, Kentaro Tachibana, Tomoki Toda, Yoshinori Shiga, and Hisashi Kawai. Subband wavenet with overlapped single-sideband filterbanks. In 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 698–704. IEEE, 2017.
Takuma Okamoto, Tomoki Toda, Yoshinori Shiga, and Hisashi Kawai. Improving fftnet vocoder with noise shaping and subband approaches. In 2018 IEEE Spoken Language Technology Workshop (SLT), pages 304–311. IEEE, 2018.
Chengzhu Yu, Heng Lu, Na Hu, Meng Yu, Chao Weng, Kun Xu, Peng Liu, Deyi Tuo, Shiyin Kang, Guangzhi Lei, Dan Su, and Dong Yu. Durian: Duration informed attention network for speech synthesis. In Proc. Interspeech 2020, pages 2027–2031, 2020.
Qiao Tian, Zewang Zhang, Heng Lu, Ling-Hui Chen, and Shan Liu. Featherwave: An efficient high-fidelity neural vocoder with multi-band linear prediction. In Proc. Interspeech 2020, pages 195–199, 2020.
Yang Cui, Xi Wang, Lei He, and Frank K. Soong. An efficient subband linear prediction for lpcnet-based neural synthesis. In Proc. Interspeech 2020, pages 3555–3559, 2020.
Aaron Oord, Yazhe Li, Igor Babuschkin, Karen Simonyan, Oriol Vinyals, Koray Kavukcuoglu, George Driessche, Edward Lockhart, Luis Cobo, Florian Stimberg, et al. Parallel wavenet: Fast high-fidelity speech synthesis. In International conference on machine learning, pages 3918–3926. PMLR, 2018.
Wei Ping, Kainan Peng, and Jitong Chen. Clarinet: Parallel wave generation in end-to-end text-to-speech. In International Conference on Learning Representations,2019.
Durk P Kingma, Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya Sutskever, and Max Welling. Improved variational inference with inverse autoregressive flow. In Advances in Neural Information Processing Systems, volume 29, 2016.
Durk P Kingma and Prafulla Dhariwal. Glow: Generative flow with invertible 1x1 convolutions. Advances in neural information processing systems, 31, 2018.
Kundan Kumar, Rithesh Kumar, Thibault De Boissiere, Lucas Gestin, Wei Zhen Teoh, Jose Sotelo, Alexandre de Brébisson, Yoshua Bengio, and Aaron C Courville. Melgan: Generative adversarial networks for conditional waveform synthesis. Advances in neural information processing systems, 32, 2019.
Nanxin Chen, Yu Zhang, Heiga Zen, Ron J Weiss, Mohammad Norouzi, and William Chan. Wavegrad: Estimating gradients for waveform generation. In International Conference on Learning Representations, 2021.
Sungwon Kim, Sang-Gil Lee, Jongyoon Song, Jaehyeon Kim, and Sungroh Yoon. Flowavenet: A generative flow for raw audio. In International Conference on Machine Learning, pages 3370–3378. PMLR, 2019.
Wei Ping, Kainan Peng, Kexin Zhao, and Zhao Song. Waveflow: A compact flow-based model for raw audio. In International Conference on Machine Learning, pages 7706–7716. PMLR, 2020.
Po-chun Hsu and Hung-yi Lee. Wg-wavenet: Real-time high-fidelity speech synthesis without gpu. Proc. Interspeech 2020, pages 210–214, 2020.
Geng Yang, Shan Yang, Kai Liu, Peng Fang, Wei Chen, and Lei Xie. Multi-band melgan: Faster waveform generation for high-quality text-to-speech. In 2021 IEEE Spoken Language Technology Workshop (SLT), pages 492–498. IEEE, 2021.
Eunwoo Song, Ryuichi Yamamoto, Min-Jae Hwang, Jin-Seob Kim, Ohsung Kwon, and Jae-Min Kim. Improved parallel wavegan vocoder with perceptually weighted spectrogram loss. In 2021 IEEE Spoken Language Technology Workshop (SLT), pages 470–476. IEEE, 2021.
Kentaro Tachibana, Tomoki Toda, Yoshinori Shiga, and Hisashi Kawai. An investigation of noise shaping with perceptual weighting for wavenet-based speech generation. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5664–5668. IEEE, 2018.
Alexey Gritsenko, Tim Salimans, Rianne van den Berg, Jasper Snoek, and Nal Kalchbrenner. A spectral energy distance for parallel speech synthesis. Advances in Neural Information Processing Systems, 33:13062–13072, 2020.
Jiatao Gu, James Bradbury, Caiming Xiong, Victor OK Li, and Richard Socher. Non-autoregressive neural machine translation. In International Conference on Learning Representations, 2018.
Jason Lee, Elman Mansimov, and Kyunghyun Cho. Deterministic non-autoregressive neural sequence modeling by iterative refinement. In 2018 Conference on Empirical Methods in Natural Language Processing, EMNLP 2018, pages 1173–1182. Association for Computational Linguistics, 2018.
Yixuan Su, Deng Cai, Yan Wang, David Vandyke, Simon Baker, Piji Li, and Nigel Collier. Non-autoregressive text generation with pre-trained language models. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 234–243, 2021.
Yisheng Xiao, Lijun Wu, Junliang Guo, Juntao Li, Min Zhang, Tao Qin, and Tie-yan Liu. A survey on non-autoregressive generation for neural machine translation and beyond. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
Yi Ren, Jinglin Liu, Xu Tan, Zhou Zhao, Sheng Zhao, and Tie-Yan Liu. A study of non-autoregressive model for sequence generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 149–159, 2020.
Lijun Wu, Xu Tan, Di He, Fei Tian, Tao Qin, Jianhuang Lai, and Tie-Yan Liu. Beyond error propagation in neural machine translation: Characteristics of language also matter. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3602–3611, 2018.
Minchan Kim, Sung Jun Cheon, Byoung Jin Choi, Jong Jin Kim, and Nam Soo Kim. Expressive Text-to-Speech Using Style Tag. In Proc. Interspeech 2021, pages 4663–4667, 2021.
Reo Shimizu, Ryuichi Yamamoto, Masaya Kawamura, Yuma Shirahata, Tatsuya Komatsu, Kentaro Tachibana, et al. Prompttts++: Controlling speaker identity in prompt-based text-to-speech using natural language descriptions. arXiv preprint arXiv:2309.08140, 2023.
Yongmao Zhang, Guanghou Liu, Yi Lei, Yunlin Chen, Hao Yin, Lei Xie, and Zhifei Li. Promptspeaker: Speaker generation based on text descriptions. In 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 1–7. IEEE, 2023.
Wenhao Guan, Yishuang Li, Tao Li, Hukai Huang, Feng Wang, Jiayan Lin, Lingyan Huang, Lin Li, and Qingyang Hong. Mm-tts: Multi-modal prompt based style transfer for expressive text-to-speech synthesis. arXiv preprint arXiv:2312.10687, 2023.
Apoorv Vyas, Bowen Shi, Matthew Le, Andros Tjandra, Yi-Chiao Wu, Baishan Guo, Jiemin Zhang, Xinyue Zhang, Robert Adkins, William Ngan, et al. Audiobox: Unified audio generation with natural language prompts. arXiv preprint arXiv:2312.15821, 2023.
Yichong Leng, Zhifang Guo, Kai Shen, Xu Tan, Zeqian Ju, Yanqing Liu, Yufei Liu, Dongchao Yang, Leying Zhang, Kaitao Song, et al. Prompttts 2: Describing and generating voices with text prompt. arXiv preprint arXiv:2309.02285, 2023.
Hanglei Zhang, Yiwei Guo, Sen Liu, Xie Chen, and Kai Yu. Expressive tts driven by natural language prompts using few human annotations. arXiv preprint arXiv:2311.01260, 2023.
Dan Lyth and Simon King. Natural language guidance of high-fidelity text-to-speech with synthetic annotations. arXiv preprint arXiv:2402.01912, 2024.
Aya Watanabe, Shinnosuke Takamichi, Yuki Saito, Wataru Nakata, Detai Xin, and Hiroshi Saruwatari. Coco-nut: Corpus of japanese utterance and voice characteristics description for prompt-based control. In 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 1–8. IEEE, 2023.
Guanghou Liu, Yongmao Zhang, Yi Lei, Yunlin Chen, Rui Wang, Lei Xie, and Zhifei Li. PromptStyle: Controllable Style Transfer for Text-to-Speech with Natural Language Descriptions. In Proc. INTERSPEECH 2023, pages 4888–4892, 2023.
Benjamin Barras. Sox: Sound exchange, 2012.
Heiga Zen, Viet Dang, Rob Clark, Yu Zhang, Ron J. Weiss, Ye Jia, Zhifeng Chen, and Yonghui Wu. LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech. In Proc. Interspeech 2019, pages 1526–1530, 2019.
Yuma Koizumi, Heiga Zen, Shigeki Karita, Yifan Ding, Kohei Yatabe, Nobuyuki Morioka, Michiel Bacchiani, Yu Zhang, Wei Han, and Ankur Bapna. LibriTTS-R: A Restored Multi-Speaker Text-to-Speech Corpus. In Proc. INTERSPEECH 2023, pages 5496–5500, 2023.
Yao Shi, Hui Bu, Xin Xu, Shaoji Zhang, and Ming Li. Aishell-3: A multi-speaker mandarin tts corpus and the baselines. arXiv preprint arXiv:2010.11567, 2020.
Tingwei Guo, Cheng Wen, Dongwei Jiang, Ne Luo, Ruixiong Zhang, Shuaijiang Zhao, Wubo Li, Cheng Gong, Wei Zou, Kun Han, et al. Didispeech: A large scale mandarin speech corpus. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6968–6972. IEEE, 2021.
Abdelrahman Mohamed, Hung-yi Lee, Lasse Borgholt, Jakob D Havtorn, Joakim Edin, Christian Igel, Katrin Kirchhoff, Shang-Wen Li, Karen Livescu, Lars Maaløe, et al. Self-supervised speech representation learning: A review. IEEE Journal of Selected Topics in Signal Processing, 16(6):1179–1210, 2022.
Shaoshi Ling and Yuzong Liu. Decoar 2.0: Deep contextualized acoustic representations with vector quantization. arXiv preprint arXiv:2012.06659, 2020.
Andy T Liu, Shu-wen Yang, Po-Han Chi, Po-chun Hsu, and Hung-yi Lee. Mockingjay: Unsupervised speech representation learning with deep bidirectional transformer encoders. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6419–6423. IEEE, 2020.
Andy T Liu, Shang-Wen Li, and Hung-yi Lee. Tera: Self-supervised learning of transformer encoder representation for speech. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:2351–2366, 2021.
Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
Steffen Schneider, Alexei Baevski, Ronan Collobert, and Michael Auli. wav2vec: Unsupervised Pre-Training for Speech Recognition. In Proc. Interspeech 2019, pages 3465–3469, 2019.
Alexei Baevski, Steffen Schneider, and Michael Auli. vq-wav2vec: Self-supervised learning of discrete speech representations. In International Conference on Learning Representations, 2020.
Yu-An Chung, Wei-Ning Hsu, Hao Tang, and James Glass. An Unsupervised Autoregressive Model for Speech Representation Learning. In Proc. Interspeech 2019, pages 146–150, 2019.
Yu-An Chung and James Glass. Generative pre-training for speech with autoregressive predictive coding. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 3497–3501. IEEE, 2020.
Yu-An Chung, Hao Tang, and James Glass. Vector-Quantized Autoregressive Predictive Coding. In Proc. Interspeech 2020, pages 3760–3764, 2020.
Shaoshi Ling, Yuzong Liu, Julian Salazar, and Katrin Kirchhoff. Deep contextualized acoustic representations for semi-supervised speech recognition. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6429–6433. IEEE, 2020.
Shu wen Yang, Po-Han Chi, Yung-Sung Chuang, Cheng-I Jeff Lai, Kushal Lakhotia, Yist Y. Lin, Andy T. Liu, Jiatong Shi, Xuankai Chang, Guan-Ting Lin, Tzu-Hsien Huang, Wei-Cheng Tseng, Ko tik Lee, Da-Rong Liu, Zili Huang, Shuyan Dong, Shang-Wen Li, Shinji Watanabe, Abdelrahman Mohamed, and Hung yi Lee. SUPERB: Speech Processing Universal PERformance Benchmark. In Proc. Interspeech 2021, pages 1194–1198, 2021.
Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, et al. Wavlm: Large-scale self-supervised pre-training for full stack speech processing. IEEE Journal of Selected Topics in Signal Processing, 16(6):1505–1518, 2022.
Bohan Zhai, Tianren Gao, Flora Xue, Daniel Rothchild, Bichen Wu, Joseph E Gonzalez, and Kurt Keutzer. Squeezewave: Extremely lightweight vocoders for on-device speech synthesis. arXiv preprint arXiv:2001.05685, 2020.
Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using real NVP. In International Conference on Learning Representations, 2017.
Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. Albert: A lite bert for self-supervised learning of language representations. In International Conference on Learning Representations, 2020.
Chao-I Tuan, Yuan-Kuei Wu, Hung-yi Lee, and Yu Tsao. Mitas: A compressed time-domain audio separation network with parameter sharing. arXiv preprint arXiv:1912.03884, 2019.
Francesc Lluís, Jordi Pons, and Xavier Serra. End-to-End Music Source Separation: Is it Possible in the Waveform Domain? In Proc. Interspeech 2019, pages 4619–4623, 2019.
Dario Rethage, Jordi Pons, and Xavier Serra. A wavenet for speech denoising. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5069–5073. IEEE, 2018.
Shinji Takaki, Toru Nakashika, Xin Wang, and Junichi Yamagishi. Stft spectral loss for training a neural speech waveform model. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7065–7069. IEEE, 2019.
Shinji Takaki, Hirokazu Kameoka, and Junichi Yamagishi. Training a neural speech waveform model using spectral losses of short-time fourier transform and continuous wavelet transform. arXiv preprint arXiv:1903.12392, 2019.
Sercan Ö Arık, Heewoo Jun, and Gregory Diamos. Fast spectrogram inversion using multi-head convolutional neural networks. IEEE Signal Processing Letters, 26(1):94–98, 2019.
Keith Ito and Linda Johnson. The lj speech dataset. https://keithito.com/LJ-Speech-Dataset/, 2017.
Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015.
Robert Kubichek. Mel-cepstral distance measure for objective speech quality assessment. In Proceedings of IEEE Pacific Rim Conference on Communications Computers and Signal Processing, volume 1, pages 125–128. IEEE, 1993.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7871–7880, 2020.
Aaron Van Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. Pixel recurrent neural networks. In International conference on machine learning, pages 1747–1756. PMLR, 2016.
Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, Ben Hutchinson, Wei Han, Zarana Parekh, Xin Li, Han Zhang, Jason Baldridge, and Yonghui Wu. Scaling autoregressive models for content-rich text-to-image generation. Transactions on Machine Learning Research, 2022.
Charlie Nash, Yaroslav Ganin, SM Ali Eslami, and Peter Battaglia. Polygen: An autoregressive generative model of 3d meshes. In International conference on machine learning, pages 7220–7229. PMLR, 2020.
Ji Ming, Peter Jancovic, and Francis Jack Smith. Robust speech recognition using probabilistic union models. IEEE Transactions on Speech and Audio Processing, 10(6):403–414, 2002.
James McAuley, Ji Ming, Darryl Stewart, and Philip Hanna. Subband correlation and robust speech recognition. IEEE Transactions on Speech and Audio Processing, 13(5):956–964, 2005.
Yinji Piao, HyunWook Park, et al. Image resolution enhancement using inter-subband correlation in wavelet domain. In 2007 IEEE International Conference on Image Processing, volume 1, pages I–445. IEEE, 2007.
Mohammed Imamul Hassan Bhuiyan and Anindya Bijoy Das. A subband correlation-based method for the automatic detection of epilepsy and seizure in the dual tree complex wavelet transform domain. In 2014 IEEE Conference on Biomedical Engineering and Sciences (IECBES), pages 811–816. IEEE, 2014.
T.Q. Nguyen. Near-perfect-reconstruction pseudo-qmf banks. IEEE Transactions on Signal Processing, 42(1):65–76, 1994.
Jan Chorowski, Ron J Weiss, Samy Bengio, and Aäron van den Oord. Unsupervised speech representation learning using wavenet autoencoders. IEEE/ACM transactions on audio, speech, and language processing, 27(12):2041–2053, 2019.
Diganta Misra. Mish: A self regularized non-monotonic activation function. In 31st British Machine Vision Conference (BMVC 2020). British Machine Vision Association, 2020.
Chris S Xie. Dynamic vs static autoregressive models for forecasting time series. Available at SSRN 1268910, 2008.
Rajesh Wadhvani et al. Review on various models for time series forecasting. In 2017 International Conference on Inventive Computing and Informatics (ICICI), pages 405–410. IEEE, 2017.
Qiao Tian, Yi Chen, Zewang Zhang, Heng Lu, Linghui Chen, Lei Xie, and Shan Liu. Tfgan: Time and frequency domain based generative adversarial network for high-fidelity speech synthesis. arXiv preprint arXiv:2011.12206, 2020.
Junichi Yamagishi, Christophe Veaux, Kirsten MacDonald, et al. Cstr vctk corpus: English multi-speaker corpus for cstr voice cloning toolkit, 2017.
John Kominek and Alan W Black. The cmu arctic speech databases. In Fifth ISCA workshop on speech synthesis, 2004.
Jinhyeok Yang, Junmo Lee, Youngik Kim, Hoon-Young Cho, and Injung Kim. Vocgan: A high-fidelity real-time vocoder with a hierarchically-nested adversarial network. In Proc. Interspeech 2020, pages 200–204, 2020.
Less Wright. Ranger - a synergistic optimizer. https://github.com/lessw2020/Ranger-Deep-Learning-Optimizer, 2019.
Matthias Mauch and Simon Dixon. pyin: A fundamental frequency estimator using probabilistic threshold distributions. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 659–663. IEEE, 2014.
ITUR Recommendation. Method for the subjective assessment of intermediate sound quality (mushra). ITU, BS, pages 1543–1, 2001.
Thomas Merritt, Bartosz Putrycz, Adam Nadolski, Tianjun Ye, Daniel Korzekwa, Wiktor Dolecki, Thomas Drugman, Viacheslav Klimkov, Alexis Moinet, Andrew Breen, et al. Comprehensive evaluation of statistical speech waveform synthesis. In 2018 IEEE Spoken Language Technology Workshop (SLT), pages 325–331. IEEE, 2018.
Marius Cotescu, Thomas Drugman, Goeric Huybrechts, Jaime Lorenzo-Trueba, and Alexis Moinet. Voice conversion for whispered speech synthesis. IEEE Signal Processing Letters, 27:186–190, 2019.
Javier Latorre, Jakub Lachowicz, Jaime Lorenzo-Trueba, Thomas Merritt, Thomas Drugman, Srikanth Ronanki, and Viacheslav Klimkov. Effect of data reduction on sequence-to-sequence neural tts. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7075–7079. IEEE, 2019.
Adam Gabryś, Goeric Huybrechts, Manuel Sam Ribeiro, Chung-Ming Chien, Julian Roth, Giulia Comini, Roberto Barra-Chicote, Bartek Perz, and Jaime Lorenzo-Trueba. Voice filter: Few-shot text-to-speech speaker adaptation using voice conversion as a post-processing module. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7902–7906. IEEE, 2022.
Kamil Deja, Ariadna Sanchez, Julian Roth, and Marius Cotescu. Automatic Evaluation of Speaker Similarity. In Proc. Interspeech 2022, pages 2348–2352, 2022.
Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. Advances in neural information processing systems, 27, 2014.
EH Rothauser. Ieee recommended practice for speech quality measurements. IEEE Trans. on Audio and Electroacoustics, 17:225–246, 1969.
Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30, 2017.
Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul F Christiano. Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008–3021, 2020.
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022.
Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36, 2024.
Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela. Kto: Model alignment as prospect theoretic optimization. arXiv preprint arXiv:2402.01306, 2024.
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
Dongchao Yang, Jianwei Yu, Helin Wang, Wen Wang, Chao Weng, Yuexian Zou, and Dong Yu. Diffsound: Discrete diffusion model for text-to-sound generation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023.
Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning. Advances in neural information processing systems, 30, 2017.
Rongjie Huang, Jiawei Huang, Dongchao Yang, Yi Ren, Luping Liu, Mingze Li, Zhenhui Ye, Jinglin Liu, Xiang Yin, and Zhou Zhao. Make-an-audio: Text-to-audio generation with prompt-enhanced diffusion models. In International Conference on Machine Learning, pages 13916–13932. PMLR, 2023.
Haohe Liu, Zehua Chen, Yi Yuan, Xinhao Mei, Xubo Liu, Danilo Mandic, Wenwu Wang, and Mark D Plumbley. Audioldm: Text-to-audio generation with latent diffusion models. In International Conference on Machine Learning, pages 21450–21474. PMLR, 2023.
Deepanway Ghosal, Navonil Majumder, Ambuj Mehrish, and Soujanya Poria. Text-to-audio generation using instruction guided latent diffusion model. In Proceedings of the 31st ACM International Conference on Multimedia, pages 3590–3598, 2023.
Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
Haohe Liu, Qiao Tian, Yi Yuan, Xubo Liu, Xinhao Mei, Qiuqiang Kong, Yuping Wang, Wenwu Wang, Yuxuan Wang, and Mark D Plumbley. Audioldm 2: Learning holistic audio generation with self-supervised pretraining. arXiv preprint arXiv:2308.05734, 2023.
Siddique Latif, Heriberto Cuayáhuitl, Farrukh Pervez, Fahad Shamshad, Hafiz Shehbaz Ali, and Erik Cambria. A survey on deep reinforcement learning for audio-based applications. Artificial Intelligence Review, 56(3):2193–2240, 2023.
Taku Kala and Takahiro Shinozaki. Reinforcement learning of speech recognition system based on policy gradient and hypothesis selection. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 5759–5763. IEEE, 2018.
Andros Tjandra, Sakriani Sakti, and Satoshi Nakamura. Sequence-to-sequence asr optimization via reinforcement learning. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5829–5833. IEEE, 2018.
Shigeki Karita, Atsunori Ogawa, Marc Delcroix, and Tomohiro Nakatani. Sequence training of encoder-decoder model using policy gradient for end-to-end speech recognition. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5839–5843. IEEE, 2018.
Yingbo Zhou, Caiming Xiong, and Richard Socher. Improving end-to-end speech recognition with policy learning. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 5819–5823. IEEE, 2018.
Andros Tjandra, Sakriani Sakti, and Satoshi Nakamura. End-to-end speech recognition sequence training with reinforcement learning. IEEE Access, 7:79758–79769, 2019.
Hoon Chung, Hyeong-Bae Jeon, and Jeon Gue Park. Semi-supervised training for sequence-to-sequence speech recognition using reinforcement learning. In 2020 international joint conference on neural networks (IJCNN), pages 1–6. IEEE, 2020.
Mathieu Seurin, Florian Strub, Philippe Preux, and Olivier Pietquin. A Machine of Few Words: Interactive Speaker Recognition with Reinforcement Learning. In Proc. Interspeech 2020, pages 4323–4327, 2020.
Egor Lakomkin, Mohammad Ali Zamani, Cornelius Weber, Sven Magg, and Stefan Wermter. Emorl: continuous acoustic emotion classification using deep reinforcement learning. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 4445–4450. IEEE, 2018.
J Sangeetha and T Jayasankar. Emotion speech recognition based on adaptive fractional deep belief network and reinforcement learning. In Cognitive Informatics and Soft Computing: Proceeding of CISC 2017, pages 165–174. Springer, 2019.
Yuma Koizumi, Kenta Niwa, Yusuke Hioka, Kazunori Kobayashi, and Yoichi Haneda. Dnn-based source enhancement to increase objective sound quality assessment score. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26(10):1780–1792, 2018.
Yih-Liang Shen, Chao-Yuan Huang, Syu-Siang Wang, Yu Tsao, Hsin-Min Wang, and Tai-Shih Chi. Reinforcement learning based speech enhancement for robust speech recognition. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6750–6754. IEEE, 2019.
Rui Liu, Berrak Sisman, and Haizhou Li. Reinforcement Learning for Emotional Text-to-Speech Synthesis with Improved Emotion Discriminability. In Proc. Interspeech 2021, pages 4648–4652, 2021.
Dong Zhang, Zhaowei Li, Shimin Li, Xin Zhang, Pengyu Wang, Yaqian Zhou, and Xipeng Qiu. Speechalign: Aligning speech generation to human preferences. arXiv preprint arXiv:2404.05600, 2024.
Navonil Majumder, Chia-Yu Hung, Deepanway Ghosal, Wei-Ning Hsu, Rada Mihalcea, and Soujanya Poria. Tango 2: Aligning diffusion-based text-to-audio generations through direct preference optimization. arXiv preprint arXiv:2404.09956, 2024.
Huan Liao, Haonan Han, Kai Yang, Tianjiao Du, Rui Yang, Zunnan Xu, Qinmei Xu, Jingquan Liu, Jiasheng Lu, and Xiu Li. Baton: Aligning text-to-audio model with human preference feedback. arXiv preprint arXiv:2402.00744, 2024.
Chris Dongjoo Kim, Byeongchang Kim, Hyunmin Lee, and Gunhee Kim. Audiocaps: Generating captions for audios in the wild. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 119–132, 2019.
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
Ehsan Variani, Xin Lei, Erik McDermott, Ignacio Lopez Moreno, and Javier Gonzalez-Dominguez. Deep neural networks for small footprint text-dependent speaker verification. In 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 4052–4056. IEEE, 2014.
Miquel Àngel India Massana, Pooyan Safari, and Francisco Javier Hernando Pericás. Self multi-head attention for speaker recognition. In Interspeech 2019: the 20th Annual Conference of the International Speech Communication Association: 15-19 September 2019: Graz, Austria, pages 4305–4309. International Speech Communication Association (ISCA), 2019.
Jonathan Shen, Ye Jia, Mike Chrzanowski, Yu Zhang, Isaac Elias, Heiga Zen, and Yonghui Wu. Non-attentive tacotron: Robust and controllable neural tts synthesis including unsupervised duration modeling. arXiv preprint arXiv:2010.04301, 2020.
Benjamin Elizalde, Soham Deshmukh, Mahmoud Al Ismail, and Huaming Wang. Clap learning audio concepts from natural language supervision. In ICASSP2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023.
Kevin Black, Michael Janner, Yilun Du, Ilya Kostrikov, and Sergey Levine. Training diffusion models with reinforcement learning. In The Twelfth International Conference on Learning Representations, 2024.
Sham Kakade and John Langford. Approximately optimal approximate reinforcement learning. In Proceedings of the Nineteenth International Conference on Machine Learning, pages 267–274, 2002.
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In International conference on machine learning, pages 1889–1897. PMLR, 2015.
Adaeze Adigwe, Noé Tits, Kevin El Haddad, Sarah Ostadabbas, and Thierry Dutoit. The emotional voices database: Towards controlling the emotion dimension in voice generation systems. arXiv preprint arXiv:1806.09514, 2018.
Kun Zhou, Berrak Sisman, Rui Liu, and Haizhou Li. Emotional voice conversion: Theory, databases and esd. Speech Communication, 137:1–18, 2022.
Houwei Cao, David G Cooper, Michael K Keutmann, Ruben C Gur, Ani Nenkova, and Ragini Verma. Crema-d: Crowd-sourced emotional multimodal actors dataset. IEEE transactions on affective computing, 5(4):377–390, 2014.
Jort F Gemmeke, Daniel PW Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R Channing Moore, Manoj Plakal, and Marvin Ritter. Audio set: An ontology and human-labeled dataset for audio events. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 776–780. IEEE, 2017.
Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. In International Conference on Learning Representations, 2021.
Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022.
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In International Conference on Learning Representations, 2019.
Edresson Casanova, Julian Weber, Christopher D Shulby, Arnaldo Candido Junior, Eren Gölge, and Moacir A Ponti. Yourtts: Towards zero-shot multi-speaker tts and zero-shot voice conversion for everyone. In International Conference on Machine Learning, pages 2709–2720. PMLR, 2022.
Xinlei Chen and Kaiming He. Exploring simple siamese representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15750–15758, 2021.
Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Gheshlaghi Azar, et al. Bootstrap your own latent-a new approach to self-supervised learning. Advances in neural information processing systems, 33:21271–21284, 2020.
Jie Huang, Hanyin Shao, and Kevin Chang. Are large pre-trained language models leaking your personal information? In ICML 2022 Workshop on Knowledge Retrieval and Language Models, 2022.
Xudong Pan, Mi Zhang, Shouling Ji, and Min Yang. Privacy risks of general-purpose language models. In 2020 IEEE Symposium on Security and Privacy (SP), pages 1314–1331. IEEE, 2020.
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.
Wei-Cheng Tseng, Wei-Tsung Kao, and Hung yi Lee. Membership Inference Attacks Against Self-supervised Speech Models. In Proc. Interspeech 2022, pages 5040–5044, 2022.
Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 5206–5210. IEEE, 2015.
Zhehuai Chen, Yu Zhang, Andrew Rosenberg, Bhuvana Ramabhadran, Gary Wang, and Pedro Moreno. Injecting text in self-supervised speech pretraining. In 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 251–258. IEEE, 2021.
Ziqiang Zhang, Long Zhou, Junyi Ao, Shujie Liu, Lirong Dai, Jinyu Li, and Furu Wei. Speechut: Bridging speech and text with hidden-unit for encoder-decoder based speech-text pre-training. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 1663–1676, 2022.
Ziqiang Zhang, Sanyuan Chen, Long Zhou, Yu Wu, Shuo Ren, Shujie Liu, Zhuoyuan Yao, Xun Gong, Lirong Dai, Jinyu Li, et al. Speechlm: Enhanced speech pre-training with unpaired textual data. arXiv preprint arXiv:2209.15329, 2022.
Ankur Bapna, Yu-an Chung, Nan Wu, Anmol Gulati, Ye Jia, Jonathan H Clark, Melvin Johnson, Jason Riesa, Alexis Conneau, and Yu Zhang. Slam: A unified encoder for speech and language modeling via speech-text joint pre-training. arXiv preprint arXiv:2110.10329, 2021.
Junyi Ao, Rui Wang, Long Zhou, Chengyi Wang, Shuo Ren, Yu Wu, Shujie Liu, Tom Ko, Qing Li, Yu Zhang, et al. Speecht5: Unified-modal encoder-decoder pre-training for spoken language processing. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5723–5738, 2022.
Zhehuai Chen, Yu Zhang, Andrew Rosenberg, Bhuvana Ramabhadran, Pedro J. Moreno, Ankur Bapna, and Heiga Zen. MAESTRO: Matched Speech Text Representations through Modality Matching. In Proc. Interspeech 2022, pages 4093–4097, 2022.
Masato Mimura, Sei Ueno, Hirofumi Inaguma, Shinsuke Sakai, and Tatsuya Kawahara. Leveraging sequence-to-sequence speech synthesis for enhancing acoustic-to-word speech recognition. In 2018 IEEE Spoken Language Technology Workshop (SLT), pages 477–484. IEEE, 2018.
Sei Ueno, Masato Mimura, Shinsuke Sakai, and Tatsuya Kawahara. Multi-speaker sequence-to-sequence speech synthesis for data augmentation in acoustic-to-word speech recognition. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6161–6165. IEEE, 2019.
Nick Rossenbach, Albert Zeyer, Ralf Schlüter, and Hermann Ney. Generating synthetic audio data for attention-based speech recognition systems. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7069–7073. IEEE, 2020.
Sei Ueno, Masato Mimura, Shinsuke Sakai, and Tatsuya Kawahara. Data augmentation for asr using tts via a discrete representation. In 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 68–75. IEEE, 2021.
Xianrui Zheng, Yulan Liu, Deniz Gunceler, and Daniel Willett. Using synthetic audio to improve the recognition of out-of-vocabulary words in end-to-end asr systems. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5674–5678. IEEE, 2021.
Amin Fazel, Wei Yang, Yulan Liu, Roberto Barra-Chicote, Yixiong Meng, Roland Maas, and Jasha Droppo. SynthASR: Unlocking Synthetic Data for Speech Recognition. In Proc. Interspeech 2021, pages 896–900, 2021.
Manuel Giollo, Deniz Gunceler, Yulan Liu, and Daniel Willett. Bootstrap an End-to-End ASR System by Multilingual Training, Transfer Learning, Text-to-Text Mapping and Synthetic Audio. In Proc. Interspeech 2021, pages 2416–2420, 2021.
Herman Kamper. Word segmentation on discovered phone units with dynamic programming and self-supervised scoring. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 31:684–694, 2022.
Jin Xu, Xu Tan, Yi Ren, Tao Qin, Jian Li, Sheng Zhao, and Tie-Yan Liu. Lrspeech: Extremely low-resource speech synthesis and recognition. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 2802–2812, 2020.
Alexander H Liu, Tao Tu, Hung-yi Lee, and Lin-shan Lee. Towards unsupervised speech recognition and synthesis with quantized speech representation learning. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7259–7263. IEEE, 2020.
Yi Ren, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. Almost unsupervised text to speech and automatic speech recognition. In International conference on machine learning, pages 5410–5419. PMLR, 2019.
David Snyder, Daniel Garcia-Romero, Gregory Sell, Daniel Povey, and Sanjeev Khudanpur. X-vectors: Robust dnn embeddings for speaker recognition. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 5329–5333. IEEE, 2018.
Felix Kreuk, Adam Polyak, Jade Copet, Eugene Kharitonov, Tu Anh Nguyen, Morgan Rivière, Wei-Ning Hsu, Abdelrahman Mohamed, Emmanuel Dupoux, and Yossi Adi. Textless speech emotion conversion using discrete & decomposed representations. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11200–11214, 2022.
Gallil Maimon and Yossi Adi. Speaking style conversion in the waveform domain using discrete self-supervised units. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 8048–8061. Association for Computational Linguistics, 2023.
Jacob Kahn, Morgane Riviere, Weiyi Zheng, Evgeny Kharitonov, Qiantong Xu, Pierre-Emmanuel Mazaré, Julien Karadayi, Vitaliy Liptchinsky, Ronan Collobert, Christian Fuegen, et al. Libri-light: A benchmark for asr with limited or no supervision. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7669–7673. IEEE, 2020.
David Snyder, Guoguo Chen, and Daniel Povey. Musan: A music, speech, and noise corpus. arXiv preprint arXiv:1510.08484, 2015.
Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, and Michael Auli. fairseq: A fast, extensible toolkit for sequence modeling. In Proceedings of NAACL-HLT 2019: Demonstrations, 2019.
Tomoki Hayashi, Ryuichi Yamamoto, Katsuki Inoue, Takenori Yoshimura, Shinji Watanabe, Tomoki Toda, Kazuya Takeda, Yu Zhang, and Xu Tan. Espnet-tts: Unified, reproducible, and integratable open source end-to-end text-to-speech toolkit. In ICASSP 2020-2020 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 7654–7658. IEEE, 2020.
Po-Han Chi, Pei-Hung Chung, Tsung-Han Wu, Chun-Cheng Hsieh, Yen-Hao Chen, Shang-Wen Li, and Hung-yi Lee. Audio albert: A lite bert for self-supervised learning of audio representation. In 2021 IEEE Spoken Language Technology Workshop (SLT), pages 344–350. IEEE, 2021.
Sangjun Park, Kihyun Choo, Joohyung Lee, Anton V. Porov, Konstantin Osipov, and June Sig Sung. Bunched LPCNet2: Efficient Neural Vocoders Covering Devices from Cloud to Edge. In Proc. Interspeech 2022, pages 808–812, 2022.
Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. In International Conference on Machine Learning, pages 28492–28518. PMLR, 2023.
Eliya Nachmani, Alon Levkovitch, Yifan Ding, Chulayuth Asawaroengchai, Heiga Zen, and Michelle Tadmor Ramanovich. Translatotron 3: Speech to speech translation with monolingual data. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 10686–10690. IEEE, 2024.
Paul K Rubenstein, Chulayuth Asawaroengchai, Duc Dung Nguyen, Ankur Bapna, Zalán Borsos, Félix de Chaumont Quitry, Peter Chen, Dalia El Badawy, Wei Han, Eugene Kharitonov, et al. Audiopalm: A large language model that can speak and listen. arXiv preprint arXiv:2306.12925, 2023.
Dong Zhang, Shimin Li, Xin Zhang, Jun Zhan, Pengyu Wang, Yaqian Zhou, and Xipeng Qiu. Speechgpt: Empowering large language models with intrinsic cross-modal conversational abilities. In The 2023 Conference on Empirical Methods in Natural Language Processing, 2023.
Qiantong Xu, Tatiana Likhomanenko, Jacob Kahn, Awni Hannun, Gabriel Synnaeve, and Ronan Collobert. Iterative Pseudo-Labeling for Speech Recognition. In Proc. Interspeech 2020, pages 1006–1010, 2020.
Daniel S. Park, Yu Zhang, Ye Jia, Wei Han, Chung-Cheng Chiu, Bo Li, Yonghui Wu, and Quoc V. Le. Improved Noisy Student Training for Automatic Speech Recognition. In Proc. Interspeech 2020, pages 2817–2821, 2020.
-
dc.identifier.urihttp://tdr.lib.ntu.edu.tw/jspui/handle/123456789/93286-
dc.description.abstract近年來,隨著深度學習的進步,許多語音生成模型展現了出色的表現。儘管取得了亮眼的成果,語音生成技術的發展也伴隨了對運算和資料資源的更大需求,導致其效率受到了限制。本論文旨在從各個方面解決高效語音生成所面臨的挑戰,包括運算效率、資料效率及其在資料高效的語音自監督學習(Self-Supervised Learning, SSL)中的應用。
我們首先關注語音生成的運算效率。我們提出了一種高度壓縮的非自迴歸神經聲碼器(Neural Vocoder),顯著減少了模型大小和訓練所需的運算資源。將改進的架構與額外的後置濾波器相結合,提出的模型無需依賴 GPU 加速,即可實現實時推理和高品質語音輸出。該模型不僅在生成 44 kHz 語音方面展現了卓越的性能,更為高效語音合成樹立了新的基準。
接下來,我們探索自迴歸生成機制,並提高其推理效率。我們引入了創新方法,頻率自迴歸生成(Frequency-wise Autoregressive Generation, FAR)和位自迴歸生成(Bit-wise Autoregressive Generation, BAR),它們分別在不同的域進行自迴歸生成。這些方法大大提高了推理速度,同時保持了良好的語音品質。除了用於神經聲碼器之外,所提出的技術亦可能適用於其他語音生成任務,包括用於自迴歸模型以提高推理效率和用於非自迴歸模型以提高輸出品質,從而擴大其影響。
隨後我們將重點轉向資料效率,解決在收集用於文字引導語音轉換(Text-Guided Voice Conversion)的標記資料時所面臨的高成本問題。我們引入強化學習(Reinforcement Learning, RL)和基於人類回饋的強化學習(Reinforcement Learning from Human Feedback, RLHF)來增強生成語音的表現力。我們的方法減少了對大型標記資料集的依賴,並提高了模型處理複雜風格的文字描述和產生富有表現力的語音的能力,從而在客觀和主觀評估方面展現了顯著改進。
最後,我們擴展了研究範圍,透過語音生成技術提高語音SSL中的資料效率。我們利用高品質文字轉語音系統產生的合成語音資料來增強低資源的(Low-Resource)預訓練(Pre-training)語料庫,減少對大量現實世界語音資料的需求。提出的方法表明,合成資料可以有效地補充真實資料,從而以更少的資源達到具有競爭力的性能。
總體而言,本論文對提高語音生成效率及其在語音處理中的應用做出了重大貢獻。我們引入新穎的架構、生成方法及學習範式來解決運算和資料效率的挑戰,為該領域的未來進步奠定基礎。
zh_TW
dc.description.abstractSpeech generation models have achieved outstanding performance with the advancement of deep learning in recent years. Despite the remarkable achievements, the development of speech generation technology has also been accompanied by greater demands on computational and data resources, resulting in limitations in its efficiency. This thesis aims to address the challenge of achieving efficient speech generation from various aspects, including computational efficiency, data efficiency, and its application for data-efficient speech self-supervised learning (SSL).
We first focus on the computational efficiency of speech generation. We propose a highly compressed non-autoregressive neural vocoder, significantly reducing model size and computational resources for training. By integrating the improved architecture with an additional post-filter, the proposed model achieves high-quality speech output with real-time inference capabilities without relying on GPU acceleration. This model not only demonstrates superior performance in generating 44 kHz speech but also sets a new benchmark for efficient speech synthesis.
Next, we explore autoregressive generation mechanisms and enhance inference efficiency. We introduce innovative methods, Frequency-wise Autoregressive Generation (FAR) and Bit-wise Autoregressive Generation (BAR), which perform the autoregressive processes in different domains. These methods drastically improve inference speed while maintaining high speech quality. Besides neural vocoders, the proposed techniques are versatile and have the potential to be applied to other speech generation tasks, including autoregressive models for efficient inference and non-autoregressive models for better quality, thereby broadening their impact.
Then, we shift the focus to data efficiency, addressing the high costs associated with collecting labeled data for text-guided voice conversion. We introduce reinforcement learning (RL) and reinforcement learning from human feedback (RLHF) to enhance the expressiveness of generated speech. Our approach reduces the dependency on large, labeled datasets and improves the model's ability to handle text descriptions of complex speech styles and generate expressive speech, achieving significant improvements in both objective and subjective evaluations.
Lastly, we extend the scope of our research to improve data efficiency in speech SSL with speech generation techniques. By leveraging synthetic speech data generated from a high-quality text-to-speech system, we augment the low-resource pre-training corpus, reducing the need for extensive real-world speech data. The proposed approach demonstrates that synthetic data can effectively supplement real data, enabling competitive performance with significantly fewer resources.
Overall, this thesis makes substantial contributions to enhancing the efficiency of speech generation and its applications in speech processing. We introduce novel architectures, generation methods, and learning paradigms that address computational and data efficiency challenges, setting the stage for future advancements in the field.
en
dc.description.provenanceSubmitted by admin ntu (admin@lib.ntu.edu.tw) on 2024-07-23T16:41:25Z
No. of bitstreams: 0
en
dc.description.provenanceMade available in DSpace on 2024-07-23T16:41:25Z (GMT). No. of bitstreams: 0en
dc.description.tableofcontents誌謝 i
摘要 iii
Abstract v
Contents vii
List of Figures xiii
List of Tables xvii
Chapter 1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Chapter 2 Background 9
2.1 Neural Vocoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.2 Autoregressive Neural Vocoder . . . . . . . . . . . . . . . . . . . . 10
2.1.3 Non-Autoregressive Neural Vocoder . . . . . . . . . . . . . . . . . 12
2.1.4 Efficiency, Quality, and Stability of Neural Vocoders . . . . . . . . 13
2.2 Text-Guided Speech Generation . . . . . . . . . . . . . . . . . . . . 16
2.2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.2 Text-Guided Text-to-Speech Synthesis . . . . . . . . . . . . . . . . 17
2.2.3 Text-Guided Voice Conversion . . . . . . . . . . . . . . . . . . . . 18
2.2.4 Data Resources of Text-Guided Speech Generation . . . . . . . . . 22
2.3 Self-Supervised Learning in Speech Processing . . . . . . . . . . . . 22
2.3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.3.2 Data Resources of Self-Supervised Learning in Speech Processing . 24
Chapter 3 Computational Efficiency in Speech Generation: Better Architecture Design for Non-Autoregressive Neural Vocoder 25
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2.1 WaveGlow: A Flow-Based Neural Vocoder . . . . . . . . . . . . . 27
3.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.3.1 Highly Compressed WaveGlow . . . . . . . . . . . . . . . . . . . . 31
3.3.2 WaveNet-Based Post-Filter . . . . . . . . . . . . . . . . . . . . . . 33
3.4 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.4.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.4.2 Model Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.4.3 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.5.1 Speed and Computational Cost . . . . . . . . . . . . . . . . . . . . 38
3.5.2 Audio Quality Comparison . . . . . . . . . . . . . . . . . . . . . . 39
3.5.3 High-Fidelity Audio Generation . . . . . . . . . . . . . . . . . . . 40
3.5.4 Text-to-Speech . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
Chapter 4 Computational Efficiency in Speech Generation: Applying NonAutoregressive Mechanism for Efficient Autoregressive Generation 45
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.2.1 Rethinking the Direction for Autoregressive Generation . . . . . . . 50
4.2.2 Frequency-wise Autoregressive Generation (FAR) . . . . . . . . . . 52
4.2.3 Bit-wise Autoregressive Generation (BAR) . . . . . . . . . . . . . 54
4.2.4 Proposed Vocoder Architecture . . . . . . . . . . . . . . . . . . . . 56
4.2.5 Post-filtering for Posterior Sampling . . . . . . . . . . . . . . . . . 59
4.3 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.3.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.3.2 Acoustic Feature . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.3.3 Model Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.3.4 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.4.1 Objective Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.4.2 Subjective Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.4.3 Generalization Evaluation . . . . . . . . . . . . . . . . . . . . . . . 87
4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
Chapter 5 Data Efficiency in Speech Generation: Improving Text-Guided
Voice Conversion with Reinforcement Learning and Human Feedback 93
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
5.2.1 Diffusion-Based Text-to-Audio Generation . . . . . . . . . . . . . . 96
5.2.2 Reinforcement Learning in Speech Processing . . . . . . . . . . . . 97
5.2.3 Reinforcement Learning and Human Feedback in Audio Generation 98
5.3 Text-Guided Voice Conversion . . . . . . . . . . . . . . . . . . . . . 99
5.3.1 Modifying a Pre-Trained Model for Text-Guided Voice Conversion . 100
5.3.2 Duration Model Fine-Tuning . . . . . . . . . . . . . . . . . . . . . 104
5.4 Improving Model Performance with Reinforcement Learning and Human Feedback . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
5.4.1 Reward Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
5.4.2 Denoising Diffusion Policy Optimization . . . . . . . . . . . . . . 107
5.4.3 Reinforcement Learning from Human Feedback . . . . . . . . . . . 109
5.5 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
5.5.1 PromptSpeech Dataset . . . . . . . . . . . . . . . . . . . . . . . . 110
5.5.2 Emotion Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
5.5.3 Accent Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
5.5.4 Sound Event Dataset . . . . . . . . . . . . . . . . . . . . . . . . . 113
5.6 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
5.6.1 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . 113
5.6.2 Models and Training . . . . . . . . . . . . . . . . . . . . . . . . . 114
5.6.3 Evaluation Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
5.7 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
5.7.1 Objective Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 119
5.7.2 Subjective Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 125
5.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
Chapter 6 Enhancing Data Efficiency in Self-Supervised Learning with Speech
Generation 129
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
6.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
6.2.1 Self-Supervised Learning with Unpaired Text Data for Speech Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
6.2.2 Speech Generation for Data Augmentation . . . . . . . . . . . . . . 132
6.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
6.3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
6.3.2 Extracting Discrete Speech Representation . . . . . . . . . . . . . . 134
6.3.3 Text-to-speech Modules . . . . . . . . . . . . . . . . . . . . . . . . 135
6.4 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
6.4.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
6.4.2 Comparing Systems . . . . . . . . . . . . . . . . . . . . . . . . . . 140
6.4.3 Evaluation Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 141
6.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
6.5.1 Performance without Synthetic Data . . . . . . . . . . . . . . . . . 142
6.5.2 Performance of Off-the-Shelf TTS Methods . . . . . . . . . . . . . 143
6.5.3 Performance of the Proposed System . . . . . . . . . . . . . . . . . 145
6.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
Chapter 7 Conclusion 151
7.1 Thesis Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
7.1.1 Achievements of Each Chapter . . . . . . . . . . . . . . . . . . . . 152
7.1.2 Impact to Subsequent Works . . . . . . . . . . . . . . . . . . . . . 153
7.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
7.2.1 Exploring Applications and New Domains for Redesigned Autoregressive Generation . . . . . . . . . . . . . . . . . . . . . . . . . . 155
7.2.2 Expanding Applications of Reinforcement Learning in Speech Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
7.2.3 Extending Iterative Training for Self-Supervised Learning and Speech
Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
References 159
-
dc.language.isoen-
dc.subject語音生成zh_TW
dc.subject資料效率zh_TW
dc.subject運算效率zh_TW
dc.subjectData Efficiencyen
dc.subjectSpeech Generationen
dc.subjectComputational Efficiencyen
dc.title高效率語音生成: 運算效率、資料效率及其在語音自監督學習中的應用zh_TW
dc.titleEfficient Speech Generation: Computational Efficiency, Data Efficiency, and Its Application in Speech Self-Supervised Learningen
dc.typeThesis-
dc.date.schoolyear112-2-
dc.description.degree博士-
dc.contributor.coadvisor李宏毅zh_TW
dc.contributor.coadvisorHung-yi Leeen
dc.contributor.oralexamcommittee林守德;王新民;曹昱;Abdelrahman Mohamedzh_TW
dc.contributor.oralexamcommitteeShou-De Lin;Hsin-Min Wang;Yu Tsao;Abdelrahman Mohameden
dc.subject.keyword語音生成,運算效率,資料效率,zh_TW
dc.subject.keywordSpeech Generation,Computational Efficiency,Data Efficiency,en
dc.relation.page193-
dc.identifier.doi10.6342/NTU202401934-
dc.rights.note同意授權(全球公開)-
dc.date.accepted2024-07-22-
dc.contributor.author-college電機資訊學院-
dc.contributor.author-dept電信工程學研究所-
顯示於系所單位:電信工程學研究所

文件中的檔案:
檔案 大小格式 
ntu-112-2.pdf2.6 MBAdobe PDF檢視/開啟
顯示文件簡單紀錄


系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved