請用此 Handle URI 來引用此文件:
http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/94116
完整後設資料紀錄
DC 欄位 | 值 | 語言 |
---|---|---|
dc.contributor.advisor | 李琳山 | zh_TW |
dc.contributor.advisor | Lin-shan Lee | en |
dc.contributor.author | 黃淞楓 | zh_TW |
dc.contributor.author | Sung-Feng Huang | en |
dc.date.accessioned | 2024-08-14T16:46:59Z | - |
dc.date.available | 2024-08-15 | - |
dc.date.copyright | 2024-08-14 | - |
dc.date.issued | 2024 | - |
dc.date.submitted | 2024-07-31 | - |
dc.identifier.citation | [1] M. Andrychowicz, M. Denil, S. Gomez, M. W. Hoffman, D. Pfau, T. Schaul, B. Shillingford, and N. De Freitas. Learning to learn by gradient descent by gradient descent. Advances in neural information processing systems, 29, 2016.
[2] S. Ö. Arık, M. Chrzanowski, A. Coates, G. Diamos, A. Gibiansky, Y. Kang, X. Li, J. Miller, A. Ng, J. Raiman, et al. Deep Voice: Real-time neural text-to-speech. In International conference on machine learning, pages 195–204. PMLR, 2017. [3] S. Ö. Arık, J. Chen, K. Peng, W. Ping, and Y. Zhou. Neural voice cloning with a few samples. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pages 10040–10050, 2018. [4] I. P. Association. Handbook of the International Phonetic Association: A guide to the use of the International Phonetic Alphabet. Cambridge University Press, 1999. [5] A. Baevski, Y. Zhou, A. Mohamed, and M. Auli. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in Neural Information Processing Systems, 33:12449–12460, 2020. [6] A. W. Black, H. Zen, and K. Tokuda. Statistical parametric speech synthesis. In 2007 IEEE International Conference on Acoustics, Speech and Signal Processing-ICASSP’07, volume 4, pages IV–1229. IEEE, 2007. [7] Z. Cai, C. Zhang, and M. Li. From speaker verification to multispeaker speech synthesis, deep transfer with feedback constraint. Interspeech, pages 3974–3978, 2020. [8] E. Casanova, J. Weber, C. D. Shulby, A. C. Júnior, E. Gölge, and M. A. Ponti. YourTTS: Towards zero-shot multi-speaker TTS and zero-shot voice conversion for everyone. In International Conference on Machine Learning, 2021. [9] W.-L. Chao, H.-J. Ye, D.-C. Zhan, M. Campbell, and K. Q. Weinberger. Revisiting meta-learning as supervised learning. arXiv preprint arXiv:2002.00573, 2020. [10] K.-t. Chen, W.-w. Liau, H.-m. Wang, L.-s. Lee, et al. Fast speaker adaptation using eigenspace-based maximum likelihood linear regression. In Interspeech, pages 742–745, 2000. [11] M. Chen, X. Tan, B. Li, Y. Liu, T. Qin, sheng zhao, and T.-Y. Liu. AdaSpeech: Adaptive text to speech for custom voice. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=Drynvt7gg4L. [12] N. Chen, Y. Zhang, H. Zen, R. J. Weiss, M. Norouzi, and W. Chan. WaveGrad: Estimating gradients for waveform generation. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=NsMLjcFaO8O. [13] Y. Chen, Y. Assael, B. Shillingford, D. Budden, S. Reed, H. Zen, Q. Wang, L. C. Cobo, A. Trask, B. Laurie, C. Gulcehre, A. van den Oord, O. Vinyals, and N. de Freitas. Sample efficient adaptive text-to-speech. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=rkzjUoAcFX. [14] Y.-J. Chen, T. Tu, C.-C. Yeh, and H.-Y. Lee. End-to-end text-to-speech for low-resource languages by cross-lingual transfer learning. Proc. Interspeech 2019, pages 2075–2079, 2019. [15] C.-M. Chien, J.-H. Lin, C.-Y. Huang, P.-C. Hsu, and H.-Y. Lee. Investigating on incorporating pretrained and learnable speaker representations for multi-speaker multi-style text-to-speech. ICASSP, 2021. [16] J. H. Cho and B. Hariharan. On the efficacy of knowledge distillation. In Proceedings of the IEEE/CVF international conference on computer vision, pages 4794–4802, 2019. [17] S. Choi, S. Han, D. Kim, and S. Ha. Attentron: Few-shot text-to-speech utilizing attention-based variable-length embedding. Interspeech, pages 2007–2011, 2020. [18] J. S. Chung, A. Nagrani, and A. Zisserman. VoxCeleb2: Deep speaker recognition. Proc. Interspeech 2018, pages 1086–1090, 2018. [19] A. Conneau, A. Baevski, R. Collobert, A. Mohamed, and M. Auli. Unsupervised cross-lingual representation learning for speech recognition. arXiv preprint arXiv:2006.13979, 2020. [20] E. Cooper, C.-I. Lai, Y. Yasuda, F. Fang, X. Wang, N. Chen, and J. Yamagishi. Zero-shot multi-speaker text-to-speech with state-of-the-art neural speaker embeddings. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6184–6188. IEEE, 2020. [21] M. Courbariaux, Y. Bengio, and J.-P. David. BinaryConnect: Training deep neural networks with binary weights during propagations. Advances in neural information processing systems, 28, 2015. [22] I. Demirsahin, M. Jansche, and A. Gutkin. A unified phonological representation of south Asian languages for multilingual text-to-speech. In Proc. The 6th Intl. Workshop on Spoken Language Technologies for Under-Resourced Languages, pages 80–84. [23] T. Elsken, J. H. Metzen, and F. Hutter. Neural architecture search: A survey. The Journal of Machine Learning Research, 20(1):1997–2017, 2019. [24] C. Finn, P. Abbeel, and S. Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In International Conference on Machine Learning, pages 1126–1135. PMLR, 2017. [25] J. Frankle and M. Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=rJl-b3RcF7. [26] A. Gibiansky, S. Ö. Arik, G. F. Diamos, J. Miller, K. Peng, W. Ping, J. Raiman, and Y. Zhou. Deep Voice 2: Multi-speaker neural text-to-speech. In NIPS, pages 2966–2974, 2017. [27] A. Gutkin. Uniform multilingual multi-speaker acoustic model for statistical parametric speech synthesis of low-resourced languages. Proc. Interspeech 2017, pages 2183–2187, 2017. [28] A. Gutkin and R. Sproat. Areal and phylogenetic features for multilingual speech synthesis. Proc. Interspeech 2017, pages 2078–2082, 2017. [29] S. Han, J. Pool, J. Tran, and W. Dally. Learning both weights and connections for efficient neural network. In Advances in Neural Information Processing Systems (NIPS), pages 1135–1143, 2015. [30] S. Han, H. Mao, and W. J. Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. International Conference on Learning Representations (ICLR), 2016. [31] M. He, J. Yang, L. He, and F. K. Soong. Multilingual Byte2Speech models for scalable low-resource speech synthesis. arXiv preprint arXiv:2103.03541, 2021. [32] X. He, K. Zhao, and X. Chu. AutoML: A survey of the state-of-the-art. Knowledge-Based Systems, 212:106622, 2021. [33] Y. He, X. Zhang, and J. Sun. Channel pruning for accelerating very deep neural networks. In Proceedings of the IEEE international conference on computer vision, pages 1389–1397, 2017. [34] G. Heigold, I. Moreno, S. Bengio, and N. Shazeer. End-to-end text-dependent speaker verification. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5115–5119. IEEE, 2016. [35] G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015. [36] T. Hospedales, A. Antoniou, P. Micaelli, and A. Storkey. Meta-learning in neural networks: A survey. IEEE transactions on pattern analysis and machine intelligence, 44(9):5149–5169, 2021. [37] L. Hou, Z. Huang, L. Shang, X. Jiang, X. Chen, and Q. Liu. Dynabert: Dynamic bert with adaptive width and depth. Advances in Neural Information Processing Systems, 33:9782–9793, 2020. [38] W.-N. Hsu, B. Bolte, Y.-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed. HuBERT: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:3451–3460, 2021. [39] A. J. Hunt and A. W. Black. Unit selection in a concatenative speech synthesis system using a large speech database. In 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings, volume 1, pages 373–376. IEEE, 1996. [40] K. Ito and L. Johnson. The LJ speech dataset. https://keithito.com/LJ-Speech-Dataset/, 2017. [41] V. K. Jaime Lorenzo Trueba. Neural text-to-speech makes speech synthesizers much more versatile. August 2019. URL https://www.amazon.science/blog/ neural-text-to-speech-makes-speech-synthesizers-much-more-versatile. [42] Y. Jia, Y. Zhang, R. J. Weiss, Q. Wang, J. Shen, F. Ren, Z. Chen, P. Nguyen, R. Pang, I. L. Moreno, et al. Transfer learning from speaker verification to multispeaker text- to-speech synthesis. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pages 4485–4495, 2018. [43] N. Kalchbrenner, E. Elsen, K. Simonyan, S. Noury, N. Casagrande, E. Lockhart, F. Stimberg, A. Oord, S. Dieleman, and K. Kavukcuoglu. Efficient neural audio synthesis. In International Conference on Machine Learning, pages 2410–2419. PMLR, 2018. [44] J. Kim, J. Kong, and J. Son. Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech. In International Conference on Machine Learning, pages 5530–5540. PMLR, 2021. [45] S. Kim, S.-G. Lee, J. Song, J. Kim, and S. Yoon. FloWaveNet: A generative flow for raw audio. In International Conference on Machine Learning, pages 3370–3378. PMLR, 2019. [46] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014. [47] J. Kong, J. Kim, and J. Bae. HiFi-GAN: Generative adversarial networks for efficient and high fidelity speech synthesis. Advances in Neural Information Processing Systems, 33:17022–17033, 2020. [48] Z. Kong, W. Ping, J. Huang, K. Zhao, and B. Catanzaro. DiffWave: A versatile diffusion model for audio synthesis. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=a-xFK8Ymz5J. [49] R. Kuhn, P. Nguyen, J.-C. Junqua, R. Boman, N. Niedzielski, S. Fincke, K. Field, and M. Contolini. Fast speaker adaptation using a priori knowledge. In 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No. 99CH36258), volume 2, pages 749–752. IEEE, 1999. [50] R. Kuhn, J.-C. Junqua, P. Nguyen, and N. Niedzielski. Rapid speaker adaptation in eigenvoice space. IEEE Transactions on Speech and Audio Processing, 8(6): 695–707, 2000. [51] K. Kumar, R. Kumar, T. de Boissiere, L. Gestin, W. Z. Teoh, J. Sotelo, A. de Brébisson, Y. Bengio, and A. C. Courville. MelGAN: Generative adversarial networks for conditional waveform synthesis. Advances in neural information processing systems, 32, 2019. [52] W. Kwon, S. Kim, M. W. Mahoney, J. Hassoun, K. Keutzer, and A. Gholami. A fast post-training pruning framework for transformers. Advances in Neural Information Processing Systems, 35:24101–24116, 2022. [53] F. Lagunas, E. Charlaix, V. Sanh, and A. M. Rush. Block pruning for faster transformers. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 10619–10629, 2021. [54] C.-I. J. Lai, E. Cooper, Y. Zhang, S. Chang, K. Qian, Y.-L. Liao, Y.-S. Chuang, A. H. Liu, J. Yamagishi, D. Cox, et al. On the interplay between sparsity, naturalness, intelligibility, and prosody in speech synthesis. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 8447–8451. IEEE, 2022. [55] Y. Leng, X. Tan, S. Zhao, F. Soong, X.-Y. Li, and T. Qin. MBNET: MOS prediction for synthesized speech with mean-bias network. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 391–395. IEEE, 2021. [56] B. Li, Y. Zhang, T. Sainath, Y. Wu, and W. Chan. Bytes are all you need: End-to-end multilingual speech recognition and synthesis with bytes. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5621–5625. IEEE, 2019. [57] H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf. Pruning filters for efficient convnets. In International Conference on Learning Representations, 2017. URL https://openreview.net/forum?id=rJqFGTslg. [58] N. Li, S. Liu, Y. Liu, S. Zhao, and M. Liu. Neural speech synthesis with transformer network. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 6706–6713, 2019. [59] A. T. Liu, S.-W. Li, and H.-Y. Lee. TERA: Self-supervised learning of transformer encoder representation for speech. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:2351–2366, 2021. [60] B. Liu, F. Li, X. Wang, B. Zhang, and J. Yan. Ternary weight networks. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023. [61] H. Liu, K. Simonyan, O. Vinyals, C. Fernando, and K. Kavukcuoglu. Hierarchical representations for efficient architecture search. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=BJQRKzbA-. [62] H. Liu, K. Simonyan, and Y. Yang. DARTS: Differentiable architecture search. arXiv preprint arXiv:1806.09055, 2018. [63] S. Liu, D. Su, and D. Yu. Meta-Voice: Fast few-shot style transfer for expressive voice cloning using meta learning. arXiv preprint arXiv:2111.07218, 2021. [64] C.-C. Lo, S.-W. Fu, W.-C. Huang, X. Wang, J. Yamagishi, Y. Tsao, and H.-M. Wang. MOSNet: Deep learning-based objective assessment for voice conversion. Proc. Interspeech 2019, pages 1541–1545, 2019. [65] C. Louizos, M. Welling, and D. P. Kingma. Learning sparse neural networks through L0 regularization. arXiv preprint arXiv:1712.01312, 2017. [66] R. Luo, X. Tan, R. Wang, T. Qin, J. Li, S. Zhao, E. Chen, and T.-Y. Liu. LightSpeech: Lightweight and fast text to speech with neural architecture search. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5699–5703. IEEE, 2021. [67] G. Maniati, N. Ellinas, K. Markopoulos, G. Vamvoukakis, J. S. Sung, H. Park, A. Chalamandaris, and P. Tsiakoulis. Cross-lingual low resource speaker adaptation using phonological features. arXiv preprint arXiv:2111.09075, 2021. [68] M. McAuliffe, M. Socolof, S. Mihuc, M. Wagner, and M. Sonderegger. Montreal Forced Aligner: Trainable text-speech alignment using Kaldi. In Interspeech, volume 2017, pages 498–502, 2017. [69] P. Micikevicius, S. Narang, J. Alben, G. Diamos, E. Elsen, D. Garcia, B. Ginsburg, M. Houston, O. Kuchaiev, G. Venkatesh, and H. Wu. Mixed precision training. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=r1gs9JgRZ. [70] D. Min, D. B. Lee, E. Yang, and S. J. Hwang. Meta-Stylespeech: Multi-speaker adaptive text-to-speech generation. In International Conference on Machine Learning, pages 7748–7759. PMLR, 2021. [71] E. Moulines and F. Charpentier. Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones. Speech communication, 9(5-6): 453–467, 1990. [72] A. Nagrani, J. S. Chung, and A. Zisserman. VoxCeleb: A large-scale speaker identification dataset. Proc. Interspeech 2017, pages 2616–2620, 2017. [73] T. Nekvinda and O. Dušek. One model, many languages: Meta-learning for multilingual text-to-speech. Proc. Interspeech 2020, pages 2972–2976, 2020. [74] A. Nichol, J. Achiam, and J. Schulman. On first-order meta-learning algorithms. arXiv preprint arXiv:1803.02999, 2018. [75] J. Oh, H. Yoo, C. Kim, and S.-Y. Yun. BOIL: Towards representation change for few-shot learning. arXiv preprint arXiv:2008.08882, 2020. [76] J. Olive. Rule synthesis of speech from dyadic units. In ICASSP’77. IEEE International Conference on Acoustics, Speech, and Signal Processing, volume 2, pages 568–570. IEEE, 1977. [77] A. Oord, Y. Li, I. Babuschkin, K. Simonyan, O. Vinyals, K. Kavukcuoglu, G. Driessche, E. Lockhart, L. Cobo, F. Stimberg, et al. Parallel WaveNet: Fast high-fidelity speech synthesis. In International conference on machine learning, pages 3918–3926. PMLR, 2018. [78] A. v. d. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu. WaveNet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 2016. [79] A. v. d. Oord, Y. Li, and O. Vinyals. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018. [80] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur. LibriSpeech: An ASR corpus based on public domain audio books. In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 5206–5210. IEEE, 2015. [81] K. Park and T. Mulc. CSS10: A collection of single speaker speech datasets for 10 languages. arXiv preprint arXiv:1903.11269, 2019. [82] B. Pena and L. Huang. Wave-GAN: A deep learning approach for the prediction of nonlinear regular wave loads and run-up on a fixed cylinder. Coastal Engineering, 167:103902, 2021. [83] W. Ping, K. Peng, A. Gibiansky, S. Ömer Arik, A. Kannan, S. Narang, J. Raiman, and J. Miller. Deep Voice 3: Scaling text-to-speech with convolutional sequence learning. In ICLR (Poster), 2018. URL https://openreview.net/forum?id=HJtEm4p6Z. [84] W. Ping, K. Peng, and J. Chen. ClariNet: Parallel wave generation in end-to-end text-to-speech. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=HklY120cYm. [85] W. Ping, K. Peng, K. Zhao, and Z. Song. Waveflow: A compact flow-based model for raw audio. In International Conference on Machine Learning, pages 7706–7716. PMLR, 2020. [86] R. Prenger, R. Valle, and B. Catanzaro. Waveglow: A flow-based generative network for speech synthesis. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 3617–3621. IEEE, 2019. [87] A. Raghu, M. Raghu, S. Bengio, and O. Vinyals. Rapid learning or feature reuse? towards understanding the effectiveness of MAML. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=rkgMkCEtPB. [88] A. Rajeswaran, C. Finn, S. M. Kakade, and S. Levine. Meta-learning with implicit gradients. Advances in neural information processing systems, 32, 2019. [89] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi. XNOR-Net: Imagenet classification using binary convolutional neural networks. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV, pages 525–542. Springer, 2016. [90] Y. Ren, Y. Ruan, X. Tan, T. Qin, S. Zhao, Z. Zhao, and T.-Y. Liu. FastSpeech: Fast, robust and controllable text to speech. Advances in neural information processing systems, 32, 2019. [91] Y. Ren, C. Hu, X. Tan, T. Qin, S. Zhao, Z. Zhao, and T.-Y. Liu. FastSpeech 2: Fast and high-quality end-to-end text to speech. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=piLPYqxtWuA. [92] Y. Sagisaka, N. Kaiki, N. Iwahashi, and K. Mimura. ATR μ-talk speech synthesis system. In ICSLP, volume 92, pages 483–486, 1992. [93] T. Schultz. GlobalPhone: A multilingual speech and text database developed at Karlsruhe University. In Seventh International Conference on Spoken Language Processing, 2002. [94] J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, Y. Zhang, Y. Wang, R. Skerrv-Ryan, et al. Natural TTS synthesis by conditioning WaveNet on mel spectrogram predictions. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4779–4783. IEEE, 2018. [95] Y. Shi, H. Bu, X. Xu, S. Zhang, and M. Li. AISHELL-3: A multi-speaker mandarin TTS corpus and the baselines. arXiv preprint arXiv:2010.11567, 2020. [96] Siri Team. Deep learning for Siri's voice: On-device deep mixture density networks for hybrid unit selection synthesis. August 2017. URL https://machinelearning.apple.com/research/siri-voices. [97] J. Snell, K. Swersky, and R. Zemel. Prototypical networks for few-shot learning. Advances in neural information processing systems, 30, 2017. [98] D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur. X-vectors: Robust dnn embeddings for speaker recognition. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 5329–5333. IEEE, 2018. [99] W. Song, X. Yuan, Z. Zhang, C. Zhang, Y. Wu, X. He, and B. Zhou. Dian: Duration informed auto-regressive network for voice cloning. ICASSP, 2021. [100] R. Sonobe, S. Takamichi, and H. Saruwatari. JSUT corpus: Free large-scale Japanese speech corpus for end-to-end speech synthesis. arXiv preprint arXiv:1711.00354, 2017. [101] M. Staib, T. H. Teh, A. Torresquintero, D. S. R. Mohan, L. Foglianti, R. Lenain, and J. Gao. Phonological features for 0-shot multilingual speech synthesis. Proc. Interspeech 2020, pages 2942–2946, 2020. [102] F. Sung, Y. Yang, L. Zhang, T. Xiang, P. H. Torr, and T. M. Hospedales. Learning to compare: Relation network for few-shot learning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1199–1208, 2018. [103] Y. Taigman, L. Wolf, A. Polyak, and E. Nachmani. VoiceLoop: Voice fitting and synthesis via a phonological loop. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=SkFAWax0-. [104] X. Tan, T. Qin, F. Soong, and T.-Y. Liu. A survey on neural speech synthesis. arXiv preprint arXiv:2106.15561, 2021. [105] P. Taylor, A. W. Black, and R. Caley. The architecture of the Festival speech synthesis system. In The third ESCA/COCOSDA workshop (ETRW) on speech synthesis, 1998. [106] K. Tokuda, T. Yoshimura, T. Masuko, T. Kobayashi, and T. Kitamura. Speech parameter generation algorithms for HMM-based speech synthesis. In 2000 IEEE international conference on acoustics, speech, and signal processing. Proceedings (Cat. No. 00CH37100), volume 3, pages 1315–1318. IEEE, 2000. [107] K. Tokuda, Y. Nankaku, T. Toda, H. Zen, J. Yamagishi, and K. Oura. Speech synthesis based on hidden Markov models. Proceedings of the IEEE, 101(5):1234–1252, 2013. [108] Y. Tsao, S.-M. Lee, and L.-S. Lee. Segmental eigenvoice with delicate eigenspace for improved speaker adaptation. IEEE Transactions on Speech and Audio Processing, 13(3):399–411, 2005. [109] W.-C. Tseng, C.-Y. Huang, W.-T. Kao, Y. Y. Lin, and H.-Y. Lee. Utilizing self-supervised representations for MOS prediction. In Proc. Interspeech 2021, pages 2781–2785, 2021. doi: 10.21437/Interspeech.2021-2013. [110] J.-M. Valin and J. Skoglund. LPCNet: Improving neural speech synthesis through linear prediction. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5891–5895. IEEE, 2019. [111] E. Variani, X. Lei, E. McDermott, I. L. Moreno, and J. Gonzalez-Dominguez. Deep neural networks for small footprint text-dependent speaker verification. In 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 4052–4056. IEEE, 2014. [112] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017. [113] L. Wan, Q. Wang, A. Papir, and I. L. Moreno. Generalized end-to-end loss for speaker verification. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4879–4883. IEEE, 2018. [114] H. Wang, Y. Zou, and W. Wang. SpecAugment++: A hidden space data augmentation method for acoustic scene classification. arXiv preprint arXiv:2103.16858, 2021. [115] N.-C. Wang, S.-M. Lee, F. Seide, and L.-S. Lee. Rapid speaker adaptation using a priori knowledge by eigenspace analysis of mllr parameters. In 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No. 01CH37221), volume 1, pages 345–348. IEEE, 2001. [116] T. Wang, J. Tao, R. Fu, J. Yi, Z. Wen, and C. Qiang. Bi-level speaker supervision for one-shot speech synthesis. Interspeech, pages 3989–3993, 2020. [117] T. Wang, J. Tao, R. Fu, J. Yi, Z. Wen, and R. Zhong. Spoken content and voice factorization for few-shot speaker adaptation. Interspeech, pages 796–800, 2020. [118] W. Wang, V. W. Zheng, H. Yu, and C. Miao. A survey of zero-shot learning: Settings, methods, and applications. ACM Transactions on Intelligent Systems and Technology (TIST), 10(2):1–37, 2019. [119] Y. Wang, R. Skerry-Ryan, D. Stanton, Y. Wu, R. J. Weiss, N. Jaitly, Z. Yang, Y. Xiao, Z. Chen, S. Bengio, Q. Le, Y. Agiomyrgiannakis, R. Clark, and R. A. Saurous. Tacotron: Towards end-to-end speech synthesis. In Proc. Interspeech 2017, pages 4006–4010, 2017. doi: 10.21437/Interspeech.2017-1452. URL http://dx.doi.org/10.21437/Interspeech.2017-1452. [120] Y. Wang, Q. Yao, J. T. Kwok, and L. M. Ni. Generalizing from a few examples: A survey on few-shot learning. ACM computing surveys (csur), 53(3):1–34, 2020. [121] M. Xia, Z. Zhong, and D. Chen. Structured pruning learns compact and accurate models. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1513–1528, 2022. [122] Y. Xian, B. Schiele, and Z. Akata. Zero-shot learning - the good, the bad and the ugly. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4582–4591, 2017. [123] Y. Xian, T. Lorenz, B. Schiele, and Z. Akata. Feature generating networks for zero-shot learning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5542–5551, 2018. [124] J. Xu, X. Tan, Y. Ren, T. Qin, J. Li, S. Zhao, and T.-Y. Liu. LRSpeech: Extremely low-resource speech synthesis and recognition. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 2802–2812, 2020. [125] J. Yamagishi, C. Veaux, and K. MacDonald. CSTR VCTK Corpus: English multi-speaker corpus for cstr voice cloning toolkit (version 0.92), [sound]. 2019. doi: 10.7488/ds/2645. URL https://doi.org/10.7488/ds/2645. [126] J. Yang and L. He. Towards universal text-to-speech. In Interspeech, pages 3171–3175, 2020. [127] J. Yang and L. He. Cross-lingual text-to-speech using multi-task learning and speaker classifier joint training. arXiv preprint arXiv:2201.08124, 2022. [128] C. Yin, T. Chi, Y. Tsao, and H. Wang. Svsnet+: Enhancing speaker voice similarity assessment models with representations from speech foundation models. arXiv preprint arXiv:2406.08445, 2024. [129] T. Yoshimura. Simultaneous modeling of phonetic and prosodic parameters, and characteristic conversion for HMM-based text-to-speech systems. PhD dissertation, Nagoya Institute of Technology, 2002. [130] T. Yoshimura, K. Tokuda, T. Masuko, T. Kobayashi, and T. Kitamura. Simultaneous modeling of spectrum, pitch and duration in hmm-based speech synthesis. In Sixth European Conference on Speech Communication and Technology, 1999. [131] S. Zagoruyko and N. Komodakis. Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. arXiv preprint arXiv:1612.03928, 2016. [132] H. Zen, V. Dang, R. Clark, Y. Zhang, R. J. Weiss, Y. Jia, Z. Chen, and Y. Wu. LibriTTS: A corpus derived from LibriSpeech for text-to-speech. Proc. Interspeech 2019, pages 1526–1530, 2019. [133] H. Zhan, H. Zhang, W. Ou, and Y. Lin. Improve cross-lingual text-to-speech synthesis on monolingual corpora with pitch contour information. In Interspeech, pages 1599–1603, 2021. [134] H. Zhang, H. Zhan, Y. Zhang, X. Yu, and Y. Lin. Revisiting IPA-based cross-lingual text-to-speech. arXiv preprint arXiv:2110.07187, 2021. [135] Y. Zhang, R. J. Weiss, H. Zen, Y. Wu, Z. Chen, R. Skerry-Ryan, Y. Jia, A. Rosenberg, and B. Ramabhadran. Learning to speak fluently in a foreign language: Multilingual speech synthesis and cross-language voice cloning. Proc. Interspeech 2019, pages 2080–2084, 2019. [136] B. Zoph and Q. Le. Neural architecture search with reinforcement learning. In International Conference on Learning Representations, 2017. URL https://openreview.net/forum?id=r1Ue8Hcxg. | - |
dc.identifier.uri | http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/94116 | - |
dc.description.abstract | 在這篇博士論文中,我們提出了一項關於推進文字轉語音(TTS)系統實現有效個性化的全面研究,特別是在需要少量樣本學習和零樣本學習的情境中。核心重點在於開發TTS系統中語音克隆和跨語言適應的新方法,旨在解決有限數據下快速適應的迫切需求。
在第一章中,我們為研究奠定了基礎,強調了為個別語音和語言定制TTS系統的重要性。這為我進行少量樣本學習、語音克隆和跨語言TTS適應的探索奠定了基礎。我的目標是有效地使用最少的訓練樣本,解決將TTS系統適應新語音和新語言所面臨的挑戰。 第二章對相關文獻進行了全面的回顧,深入探討了TTS技術的演進。這包括對端到端神經TTS模型、語音合成技術、多語言TTS和各種適應方法的深入討論。值得注意的是,我們考察了TTS與元學習的交叉點,特別是在跨語言語音克隆的背景下,這為我大部分研究提供了基礎。 第三章詳細介紹了我的研究架構基礎,以Transformer模塊和FastSpeech 2為中心。我們探索了FastSpeech 2在多語音和多語言應用中的適應性以及對訓練期間未遇到的領域的微調。 在第四章中,我們介紹了Meta-TTS,這是一種創新的用於少量樣本語音適應TTS的元學習方法。該章深入探討了Meta-TTS的訓練方法、微調策略和一整套評估指標。通過廣泛的實驗,我們展示了Meta-TTS在使用有限數據進行語音克隆時達到高度語音相似性和自然性的有效性。 第五章轉向個性化、輕量級TTS,通過自適應結構化剪枝。我們介紹了自適應結構化剪枝,這是一種在微調TTS模型時提高參數效率和計算速度的方法。這項技術作為Meta-TTS的補充,專注於提高微調效率和模型性能,對於高效和有效的TTS系統至關重要。 在第六章中,我們探討了使用可轉移音素嵌入的少量樣本跨語言TTS,提出了一種超越現有轉移學習方法的新方法。該章包括嚴格的實驗評估,展示了音素嵌入轉移在語言適應任務中的有效性。 論文在第七章結束,我們總結了這項研究對TTS領域的重大貢獻。我們的工作不僅推進了語音克隆和跨語言適應的最新技術,而且還為個性化TTS應用開辟了新途徑,特別是在資源受限的設置中。 | zh_TW |
dc.description.abstract | In this dissertation, we present a comprehensive study on advancing text-to-speech (TTS) systems towards effective personalization, especially in scenarios requiring few-shot and zero-shot learning. The core focus lies in developing novel approaches for voice cloning and cross-lingual adaptation in TTS systems, aiming to address the pressing need for rapid adaptability with limited data.
In the initial chapter, we establish the groundwork for the study, emphasizing the significance of customizing TTS systems for individual voices and languages. This sets the stage for my exploration into few-shot learning, voice cloning, and cross-lingual TTS adaptation. My objective is to tackle the challenges associated with adapting TTS systems to new voices and languages efficiently, using minimal training samples. Chapter 2 provides a thorough review of the relevant literature, delving into the evolution of TTS technology. This includes an in-depth discussion of end-to-end neural TTS models, vocoder technologies, multilingual TTS, and various adaptation methods. Notably, we examine the intersection of TTS and meta-learning, particularly in the context of cross-lingual voice cloning, which forms the basis for much of my research. The third chapter details the architectural foundations of my research, centered around the Transformer block and FastSpeech 2. We explore the adaptations of FastSpeech 2 for multi-speaker and multi-lingual applications and its fine-tuning for domains not encountered during training. In Chapter 4, we introduce Meta-TTS, an innovative meta-learning approach for few-shot speaker adaptive TTS. This chapter delves into the Meta-TTS training methodology, fine-tuning strategies, and a comprehensive set of evaluation metrics. Through extensive experiments, we demonstrate the effectiveness of Meta-TTS in achieving high speaker similarity and naturalness in voice cloning with limited data. Chapter 5 shifts focus to personalized, lightweight TTS through adaptive structured pruning. We present adaptive structured pruning, a method that improves parameter efficiency and computational speed in fine-tuning TTS models. This technique, complementing Meta-TTS, focuses on enhancing fine-tuning efficiency and model performance, crucial for efficient and effective TTS systems. In Chapter 6, we explore few-shot cross-lingual TTS using transferable phoneme embedding, proposing a new methodology that surpasses existing transfer learning approaches. This chapter includes rigorous experimental evaluation, showcasing the effectiveness of phoneme embedding transfer in language adaptation tasks. The dissertation concludes with Chapter 7, where we summarize the significant contributions of this research to the field of TTS. Our work not only advances the state-of-the-art in voice cloning and cross-lingual adaptation but also opens new avenues for personalized TTS applications, particularly in resource-constrained settings. | en |
dc.description.provenance | Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2024-08-14T16:46:59Z No. of bitstreams: 0 | en |
dc.description.provenance | Made available in DSpace on 2024-08-14T16:46:59Z (GMT). No. of bitstreams: 0 | en |
dc.description.tableofcontents | 致謝 i
摘要 iii Abstract v Contents viii List of Figures xiv List of Tables xviii Chapter 1 Introduction 1 1.1 Motivation 1 1.2 Few-Shot and Zero-Shot Learning 2 1.3 TTS Customization Tasks 3 1.3.1 Customizing the Voice — Voice Cloning 4 1.3.2 Customizing the Language — Cross-Lingual Language Adaptation 6 1.3.2.1 Multilingual TTS 6 1.3.2.2 Cross-Lingual TTS Adaptation 7 1.4 Towards Real-World Cases 8 1.4.1 Problems to be Improved 9 1.4.2 Contribution of This Dissertation 11 1.5 Organization of the Dissertation 13 Chapter 2 Related Work 14 2.1 TTS Technology 14 2.1.1 Background Overview 15 2.1.2 End-to-End Neural TTS Models 16 2.1.2.1 Autoregressive TTS 16 2.1.2.2 Non-Autoregressive TTS 17 2.1.3 Vocoder 18 2.2 Multilingual TTS 20 2.3 TTS Adaptation 21 2.3.1 Voice Cloning 22 2.3.2 Cross-Lingual Voice Cloning 24 2.3.3 Cross-Lingual Speaker Adaptation 25 2.3.4 Cross-Lingual Language Adaptation 26 2.4 Meta-Learning for TTS 27 2.4.1 Meta-Learning for Cross-Lingual Voice Cloning 28 2.4.2 Neural Architecture Search for TTS 29 2.4.3 Meta-Learning for Voice Cloning 29 2.5 Model Compression 30 Chapter 3 Model Architecture 33 3.1 Transformer Block 34 3.2 FastSpeech 2 36 3.3 Multi-Speaker FastSpeech 2 36 3.4 Multilingual FastSpeech 2 38 3.5 FastSpeech 2 Adaptation for Unseen Domains via Fine-Tuning 39 3.5.1 Fine-Tuning Multi-Speaker FastSpeech 2 40 3.5.2 Fine-Tuning Multilingual FastSpeech 2 41 3.6 Summary 43 Chapter 4 Meta-TTS: Meta-Learning for Few-shot Speaker Adaptive Text-to-Speech 45 4.1 Introduction 46 4.2 Voice Cloning Task Formulation 48 4.3 Proposed Approach — Meta-TTS 48 4.3.1 An Introduction to MAML for Regression Tasks 49 4.3.2 Meta-TTS Training 51 4.3.3 Fine-tuning 54 4.3.4 Other Variants 54 4.3.4.1 Shared Embedding Initialization Before Speaker Adaptation 54 4.3.4.2 Fine-tuning Different Modules 56 4.4 Evaluation Metrics 57 4.4.1 Subjective Evaluation 58 4.4.1.1 Speaker Similarity (SMOS) 58 4.4.1.2 Naturalness (MOS) 58 4.4.2 Objective Evaluation — Neural Metrics with d-vectors 58 4.4.2.1 d-vector 59 4.4.2.2 Speaker Similarity 59 4.4.2.3 Speaker Verification 60 4.4.2.4 Synthesized Speech Detection 61 4.4.2.5 Naturalness MOS Prediction 61 4.5 Experimental Setups of Few-shot Voice Cloning 62 4.5.1 Datasets 63 4.5.2 Training 63 4.5.3 Inference 64 4.6 Few-Shot Voice Cloning Results 65 4.6.1 Speaker Similarity 65 4.6.2 Speaker Verification 69 4.6.3 Visualizing d-vector Utterance Embedding 74 4.6.4 Synthesized Speech Detection 75 4.6.5 Naturalness 80 4.7 Comparing With Zero-Shot Baseline 82 4.7.1 Model 83 4.7.2 Training 84 4.7.3 Inference 84 4.7.4 Experimental Results 84 4.8 More Analysis 87 4.8.1 Over-Fitting Analysis 87 4.8.2 Speaker Embedding Initialization 89 4.8.3 Upscale the Amount of Training Speakers 90 4.9 Summary 91 Chapter 5 Personalized Lightweight Text-To-Speech: Voice Cloning With Adaptive Structured Pruning 92 5.1 Introduction 93 5.2 Proposed Pruning Approach 95 5.2.1 Structured and Unstructured Pruning 96 5.2.2 Pruning with L0-Regularization 96 5.2.3 Structured Pruning FastSpeech 2 With L0-Regularization 98 5.2.4 Optimizing Adaptive Structured Pruning Masks 99 5.2.5 Inference 99 5.2.6 Compare: Knowledge Distillation 100 5.2.7 Pruning Pipelines 101 5.3 Experimental Setups 102 5.3.1 Subjective and Objective Evaluation Metrics 103 5.4 Experimental Results 104 5.4.1 Subjective Evaluation Results 104 5.4.2 Objective Evaluation Results 105 5.4.3 Inference Acceleration 107 5.4.4 Comparing With Meta-TTS 107 5.4.4.1 Integrating Learnable Structured Pruning With Meta-TTS 108 5.5 Summary 109 Chapter 6 Few-Shot Cross-Lingual TTS Using Transferable Phoneme Embedding 111 6.1Introduction 113 6.2 Proposed Approach 115 6.2.1 Settings 116 6.2.2 Baseline: Transfer Learning From Multilingual TTS 117 6.2.3 Proposed: Phoneme Embedding Transfer 118 6.2.3.1 Phoneme Query Extraction 118 6.2.3.2 Codebook Module 119 6.2.3.3 Pre-Training and Fine-Tuning 119 6.2.4 Compare With Meta-TTS 120 6.3 Experimental Setups 121 6.3.1 Dataset 122 6.3.2 Training Setups 122 6.4 Experimental Results 123 6.4.1 Few-Shot Language Adaptation 123 6.4.1.1 Evaluation Settings 123 6.4.1.2 Evaluation Results 124 6.4.2 Feature Selection 125 6.4.3 Analysis of Codebook Attention by Phoneme Mapping Discovery 127 6.4.4 Analysis of Pre-Training Language Settings 128 6.5 Summary 129 Chapter 7 Conclusion 130 References 133 | - |
dc.language.iso | en | - |
dc.title | 個人化文字轉語音:針對新語者、口音及語言的快速輕量化少樣本適應 | zh_TW |
dc.title | Personalized Text-to-Speech Synthesis: Lightweight and Fast Few-Shot Adaptation for Unseen Speakers, Accents and Languages | en |
dc.type | Thesis | - |
dc.date.schoolyear | 112-2 | - |
dc.description.degree | 博士 | - |
dc.contributor.coadvisor | 李宏毅 | zh_TW |
dc.contributor.coadvisor | Hung-yi Lee | en |
dc.contributor.oralexamcommittee | 王新民;曹昱;林守德;陳尚澤 | zh_TW |
dc.contributor.oralexamcommittee | Hsin-Min Wang;Yu Tsao;Shou-De Lin;Shang-Tse Chen | en |
dc.subject.keyword | 文字轉語音生成,少樣本適應,元學習,適應性模型剪枝,語者適應,跨語言適應, | zh_TW |
dc.subject.keyword | Text-to-Speech Synthesis,Few-Shot Adaptation,Meta-Learning,Adaptive Model Pruning,Speaker Adaptation,Cross-Lingual Adaptation, | en |
dc.relation.page | 151 | - |
dc.identifier.doi | 10.6342/NTU202402788 | - |
dc.rights.note | 同意授權(限校園內公開) | - |
dc.date.accepted | 2024-08-02 | - |
dc.contributor.author-college | 電機資訊學院 | - |
dc.contributor.author-dept | 電信工程學研究所 | - |
顯示於系所單位: | 電信工程學研究所 |
文件中的檔案:
檔案 | 大小 | 格式 | |
---|---|---|---|
ntu-112-2.pdf 授權僅限NTU校內IP使用(校園外請利用VPN校外連線服務) | 7.06 MB | Adobe PDF | 檢視/開啟 |
系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。