請用此 Handle URI 來引用此文件:
http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/91729完整後設資料紀錄
| DC 欄位 | 值 | 語言 |
|---|---|---|
| dc.contributor.advisor | 李宏毅 | zh_TW |
| dc.contributor.advisor | Hung-Yi Lee | en |
| dc.contributor.author | 劉廷緯 | zh_TW |
| dc.contributor.author | Ting-Wei Liu | en |
| dc.date.accessioned | 2024-02-22T16:26:56Z | - |
| dc.date.available | 2024-03-12 | - |
| dc.date.copyright | 2024-02-22 | - |
| dc.date.issued | 2024 | - |
| dc.date.submitted | 2024-01-25 | - |
| dc.identifier.citation | [1] D. S. Park, W. Chan, Y. Zhang, C.-C. Chiu, B. Zoph, E. D. Cubuk, and Q. V. Le, “Specaugment: A simple data augmentation method for automatic speech recognition,” Interspeech 2019, Sep 2019.
[2] V.Panayotov, G.Chen,D.Povey, and S.Khudanpur, “Librispeech: An ASR corpus based on public domain audio books,” in ICASSP 2015 - 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015. [3] J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, and D. S. Pallett, “DARPA TIMIT acoustic-phonetic continuous speech corpus CD-ROM. NIST speech disc 1-1.1,” NASA STI/Recon Technical Report N, p. 27403, Feb. 1993. [4] S. wen Yang, P.-H. Chi, Y.-S. Chuang, C.-I. J. Lai, K. Lakhotia, Y. Y. Lin, A. T. Liu, J. Shi, X. Chang, G.-T. Lin, T.-H. Huang, W.-C. Tseng, K. tik Lee, D.-R. Liu, Z. Huang, S. Dong, S.-W. Li, S. Watanabe, A. Mohamed, and H. yi Lee, “SUPERB: Speech Processing Universal PERformance Benchmark,” in Proc. Interspeech 2021, 2021, pp. 1194–1198. [5] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, u. Kaiser, and I. Polosukhin, “Attention is all you need,” in Proceedings of the 31st International Conference on Neural Information Processing Systems, ser. NIPS’17. Red Hook, NY, USA: Curran Associates Inc., 2017, p. 6000–6010. [6] A. T. Liu, P.-c. Hsu, and H.-Y. Lee, “Unsupervised end-to-end learning of discrete linguistic units for voice conversion,” Interspeech, Sep 2019. [7] M. Abe, S. Nakamura, K. Shikano, and H. Kuwabara, “Voice conversion through vector quantization,” Journal of the Acoustical Society of Japan (E), vol. 11, no. 2, pp. 71–76, 1990. [8] R. Yamamoto, E. Song, and J.-M. Kim, “Parallel wavegan: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 6199–6203. [9] P. chun Hsu, C. hsuan Wang, A. T. Liu, and H. yi Lee, “Towards robust neural vocoding for speech generation: A survey,” 2020. [10] A. T. Liu, S.-w. Yang, P.-H. Chi, P.-c. Hsu, and H.-y. Lee, “Mockingjay: Unsupervised speech representation learning with deep bidirectional transformer encoders,” in ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, May 2020. [11] A. T. Liu, S.-W. Li, and H.-y. Lee, “Tera: Self-supervised learning of transformer encoder representation for speech,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 2351–2366, 2021. [12] Y.-A. Chung, W.-N. Hsu, H. Tang, and J. Glass, “An Unsupervised Autoregressive Model for Speech Representation Learning,” in Proc. Interspeech 2019, 2019, pp. 146–150. [Online]. Available: http://dx.doi.org/10.21437/Interspeech.2019-1473 [13] J. P. Campbell, “Speaker recognition: A tutorial,” Proceedings of the IEEE, vol. 85, no. 9, pp. 1437–1462, 1997. [14] M. Benzeghiba, R. De Mori, O. Deroo, S. Dupont, T. Erbes, D. Jouvet, L. Fissore, P. Laface, A. Mertins, C. Ris et al., “Automatic speech recognition and speech variability: A review,” Speech communication, vol. 49, no. 10-11, pp. 763–786, 2007. [15] V. Sze, Y.-H. Chen, T.-J. Yang, and J. S. Emer, “Efficient processing of deep neural networks: A tutorial and survey,” Proceedings of the IEEE, vol. 105, no. 12, pp. 2295–2329, 2017. [16] T. Mikolov, M. Karafiát, L. Burget, J. Černockỳ, and S. Khudanpur, “Recurrent neural network based language model,” in Eleventh annual conference of the international speech communication association, 2010. [17] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997. [18] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Advances in Neural Information Processing Systems 27, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, Eds. Curran Associates, Inc., 2014, pp. 2672–2680. [19] M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein gan,” arXiv preprint arXiv:1701.07875, 2017. [20] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville, “Improved training of wasserstein gans,” in Advances in Neural Information Processing Systems, 2017, pp. 5767–5777. [21] M. Blaauw and J. Bonada, “Modeling and transforming speech using variational autoencoders,” Morgan N, editor. Interspeech 2016; 2016 Sep 8-12; San Francisco, CA.[place unknown]: ISCA; 2016. p. 1770-4., 2016. [22] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Minneapolis, Minnesota: Association for Computational Linguistics, Jun. 2019, pp. 4171–4186. [Online]. Available: https://www.aclweb.org/anthology/N19-1423 [23] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov, “Roberta: A robustly optimized BERT pretraining approach,” CoRR, vol. abs/1907.11692, 2019. [Online]. Available: http://arxiv.org/abs/1907.11692 [24] A. T. Liu and Y. Shu-wen, “The S3PRL toolkit: Self-supervised speech pre-training and representation learning,” 2020. [Online]. Available: https://github.com/s3prl/s3prl [25] A. Mohamed, H.-y. Lee, L. Borgholt, J. D. Havtorn, J. Edin, C. Igel, K. Kirchhoff, S.-W. Li, K. Livescu, L. Maaløe et al., “Self-supervised speech representation learning: A review,” IEEE Journal of Selected Topics in Signal Processing, 2022. [26] S. Schneider, A. Baevski, R. Collobert, and M. Auli, “wav2vec: Unsupervised pre-training for speech recognition,” in Proc. Interspeech, 2019. [27] A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” in NeurIPS 2020, December 6-12, 2020, virtual, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, Eds., 2020. [Online]. Available: https://proceedings.neurips.cc/paper/2020/hash/ 92d1e1eb1cd6f9fba3227870bb6d7f07-Abstract.html [28] O. D. Liu, H. Tang, and S. Goldwater, “Self-supervised Predictive Coding Models Encode Speaker and Phonetic Information in Orthogonal Subspaces,” in Proc. INTERSPEECH 2023, 2023, pp. 2968–2972. [29] K. Kawakami, L. Wang, C. Dyer, P. Blunsom, and A. van den Oord, “Learning robust and multilingual speech representations,” in Findings of the Association for Computational Linguistics: EMNLP 2020. Online: Association for Computational Linguistics, Nov. 2020, pp. 1182–1192. [Online]. Available: https://www.aclweb.org/anthology/2020.findings-emnlp.106 [30] M. Rivière, A. Joulin, P.-E. Mazaré, and E. Dupoux, “Unsupervised pretraining transfers well across languages,” in ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 7414–7418. [31] P.-H. Chi, P.-H. Chung, T.-H. Wu, C.-C. Hsieh, S.-W. Li, and H. yi Lee, “Audio albert: A lite bert for self-supervised learning of audio representation,” in SLT 2020, 2020. [32] A.BaevskiandA.Mohamed,“Effectivenessofself-supervisedpre-trainingforasr,” in ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 7694–7698. [33] W. Wang, Q. Tang, and K. Livescu, “Unsupervised pre-training of bidirectional speech encoders via masked reconstruction,” in ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, May 2020. [34] X. Song, G. Wang, Y. Huang, Z. Wu, D. Su, and H. Meng, “Speech-XLNet: Unsupervised Acoustic Model Pretraining for Self-Attention Networks,” in Interspeech 2020, 2020, pp. 3765–3769. [Online]. Available: http://dx.doi.org/10. 21437/Interspeech.2020-1511 [35] S. Li, L. Li, Q. Hong, and L. Liu, “Improving Transformer-Based Speech Recognition with Unsupervised Pre-Training and Multi-Task Semantic Knowledge Learning,” in Proc. Interspeech, 2020, pp. 5006–5010. [Online]. Available: http://dx.doi.org/10.21437/Interspeech.2020-2007 [36] D. Jiang, W. Li, R. Zhang, M. Cao, N. Luo, Y. Han, W. Zou, and X. Li, “A further study of unsupervised pre-training for transformer based speech recognition,” in Submitted to International Conference on Learning Representations, 2021, under review. [Online]. Available: https://openreview.net/forum?id=hrpSB_rzQTU [37] L. Liu and Y. Huang, “Masked pre-trained encoder base on joint ctc-transformer,” 2020. [38] W.-N. Hsu, B. Bolte, Y.-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed, “Hubert: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM Trans. Audio, Speech and Lang. Proc., vol. 29, p. 3451–3460, oct 2021. [Online]. Available: https: //doi.org/10.1109/TASLP.2021.3122291 [39] H.-S.Tsai,H.-J.Chang,W.-C.Huang,Z.Huang,K.Lakhotia,S.-w.Yang,S.Dong, A. Liu, C.-I. Lai, J. Shi et al., “SUPERB-SG: Enhanced speech processing universal performance benchmark for semantic and generative capabilities,” in Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2022, pp. 8479–8492. [40] H. Wu, X. Li, A. T. Liu, Z. Wu, H. Meng, and H.-Y. Lee, “Improving the adversarial robustness for speaker verification by self-supervised learning,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 202–217, 2022. [41] E.TrentinandM.Gori,“Asurveyofhybridann/hmmmodelsforautomaticspeech recognition,” Neurocomputing, vol. 37, no. 1-4, pp. 91–126, 2001. [42] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, J. Silovsky, G. Stemmer, and K. Vesely, “The kaldi speech recognition toolkit,” in ASRU, 2011. [43] G. E. Dahl, D. Yu, L. Deng, and A. Acero, “Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 20, no. 1, pp. 30–42, 2012. [44] A. Baevski, S. Schneider, and M. Auli, “vq-wav2vec: Self-supervised learning of discrete speech representations,” in International Conference on Learning Representations, 2020. [Online]. Available: https://openreview.net/forum?id=rylwJxrYDS [45] A. Graves and N. Jaitly, “Towards end-to-end speech recognition with recurrent neural networks,” in International conference on machine learning. PMLR, 2014, pp. 1764–1772. [46] G. Synnaeve, Q. Xu, J. Kahn, T. Likhomanenko, E. Grave, V. Pratap, A. Sriram, V. Liptchinsky, and R. Collobert, “End-to-end ASR: from supervised to semi-supervised learning with modern architectures,” in ICML 2020 Workshop on Self-supervision in Audio and Speech, 2020. [Online]. Available: https: //openreview.net/forum?id=OSVxDDc360z [47] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, Y. Bengio and Y. LeCun, Eds., 2015. [Online]. Available: http://arxiv.org/abs/1412.6980 [48] I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” in International Conference on Learning Representations, 2019. [Online]. Available: https://openreview.net/forum?id=Bkg6RiCqY7 [49] Y. Wang, R. Skerry-Ryan, D.Stanton, Y. Wu, R. J. Weiss, N. Jaitly, Z. Yang, Y. Xiao, Z. Chen, S. Bengio et al., “Tacotron: Towards end-to-end speech synthesis,” arXiv preprint arXiv:1703.10135, 2017. [50] J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, Y. Zhang, Y. Wang, R. Skerrv-Ryan et al., “Natural tts synthesis by conditioning wavenet on mel spectrogram predictions,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 4779–4783. [51] W. Ping, K. Peng, A. Gibiansky, S. O. Arik, A. Kannan, S. Narang, J. Raiman, and J. Miller, “Deep voice 3: Scaling text-to-speech with convolutional sequence learning,” arXiv preprint arXiv:1710.07654, 2017. [52] P. K. Kuhl, “Early language acquisition: cracking the speech code,” Nature reviews neuroscience, vol. 5, no. 11, pp. 831–843, 2004. [53] E. Dupoux, “Cognitive science in the era of artificial intelligence: A roadmap for reverse-engineering the infant language-learner,” Cognition, vol. 173, pp. 43–59, 2018. [54] E. Dunbar, R. Algayres, J. Karadayi, M. Bernard, J. Benjumea, X.-N. Cao, L. Miskic, C. Dugrain, L. Ondel, A. W. Black, L. Besacier, S. Sakti, and E. Dupoux, “The Zero Resource Speech Challenge 2019: TTS without T,” in 20th Annual Conference of the International Speech Communication Association (INTERSPEECH 2019): Crossroads of Speech and Language, 2019. [Online]. Available: https://zerospeech.com/2019/ [55] S. Sakti, R. Maia, S. Sakai, T. Shimizu, and S. Nakamura, “Development of hmm-based indonesian speech synthesis,” 2008 Proc. Oriental COCOSDA, pp. 215–220, November 2008. [56] S.Sakti,E.Kelana,H.Riza,S.Sakai,K.Markov,andS.Nakamura,“Development of indonesian large vocabulary continuous speech recognition system within a-star project,” 2008 Proc. Technologies and Corpora for Asia-Pacific Speech Translation (TCAST), pp. 19–24, January 2008. [57] S.-F. Huang, Y.-C. Chen, H. yi Lee, and L.-S. Lee, “Improved audio embeddings by adjacency-based clustering with applications in spoken term detection,” CoRR, vol. abs/1811.02775, 2018. [58] W. He, W. Wang, and K. Livescu, “Multi-view recurrent neural acoustic word embeddings,” arXiv preprint arXiv:1611.04496, 2016. [59] S. Settle and K. Livescu, “Discriminative acoustic word embeddings: Recurrent neural network-based approaches,” 2016 IEEE Spoken Language Technology Workshop (SLT), pp. 503–510, 2016. [60] Y.-A. Chung, C.-C. Wu, C.-H. Shen, H.-Y. Lee, and L.-S. Lee, “Audio word2vec: Unsupervised learning of audio segment representations using sequence-to-sequence autoencoder,” 2016. [61] K. Levin, K. Henry, A. Jansen, and K. Livescu, “Fixed-dimensional acoustic embeddings of variable-length segments in low-resource settings,” in 2013 IEEE Workshop on Automatic Speech Recognition and Understanding. IEEE, 2013, pp. 410– 415. [62] W.-N. Hsu, S. Zhang, and J. R. Glass, “Unsupervised learning of disentangled and interpretable representations from sequential data,” in NIPS, 2017. [63] A. Jansen, M. Plakal, R. Pandya, D. P. W. Ellis, S. Hershey, J. Liu, R. C. Moore, and R. A. Saurous, “Unsupervised learning of semantic audio representations,” 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 126–130, 2018. [64] J.-c. Chou, C.-c. Yeh, H.-y. Lee, and L.-s. Lee, “Multi-target voice conversion without parallel data by adversarially learning disentangled audio representations,” arXiv preprint arXiv:1804.02812, 2018. [65] C.-C. Hsu, H.-T. Hwang, Y.-C. Wu, Y. Tsao, and H.-M. Wang, “Voice conversion from non-parallel corpora using variational auto-encoder,” in 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (AP-SIPA). IEEE, 2016, pp. 1–6. [66] Y. Gao, R. Singh, and B. Raj, “Voice impersonation using generative adversarial networks,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 2506–2510. [67] A. van den Oord, O. Vinyals et al., “Neural discrete representation learning,” in Advances in Neural Information Processing Systems, 2017, pp. 6306–6315. [68] L. Badino, C. Canevari, L. Fadiga, and G.Metta, “Anauto-encoder based approach to unsupervised learning of subword units,” in 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2014, pp. 7634– 7638. [69] C.-T. Chung, C.-Y. Tsai, H.-H. Lu, C.-H. Liu, H. yi Lee, and L.-S. Lee, “An iterative deep learning framework for unsupervised discovery of speech features and linguistic units with applications on spoken term detection,” 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 245–251, 2015. [70] J. Chorowski, R. J. Weiss, S. Bengio, and A. van den Oord, “Unsupervised speech representation learning using wavenet autoencoders,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 27, no. 12, p. 2041–2053, Dec 2019. [71] T. A. Nguyen, M. de Seyssel, P. Rozé, M. Rivière, E. Kharitonov, A. Baevski, E. Dunbar, and E. Dupoux, “The zero resource speech benchmark 2021: Metrics and baselines for unsupervised spoken language modeling,” in Self-Supervised Learning for Speech and Audio Processing Workshop @ NeurIPS, 2020. [Online]. Available: https://zerospeech.com/challenge_archive/2021/02_track1/ [72] E. Jang, S. Gu, and B. Poole, “Categorical reparameterization with gumbel-softmax,” in 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net, 2017. [Online]. Available: https://openreview.net/forum?id=rkE3y85ee [73] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image translation with conditional adversarial networks,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017. [74] A. Odena, C. Olah, and J. Shlens, “Conditional image synthesis with auxiliary classifier gans,” in Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org, 2017, pp. 2642–2651. [75] Z. Zhang, Y. Song, and H. Qi, “Decoupled learning for conditional adversarial networks,” in 2018 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 2018, pp. 700–708. [76] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros,“Image-to-image translation with conditional adversarial networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 1125–1134. [77] W. Shi, J. Caballero, F. Huszár, J. Totz, A. P. Aitken, R. Bishop, D. Rueckert, and Z. Wang, “Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 1874–1883. [78] T. Schatz, V. Peddinti, F. Bach, A. Jansen, H. Hermansky, and E. Dupoux, “Evaluating speech features with the Minimal-Pair ABX task: Analysis of the classical MFC/PLP pipeline,” in INTERSPEECH 2013 : 14th Annual Conference of the International Speech Communication Association, Lyon, France, Aug. 2013, pp. 1–5. [79] L. Ondel, L. Burget, and J. Černockỳ, “Variational inference for acoustic unit discovery,” Procedia Computer Science, vol. 81, pp. 80–86, 2016. [80] Z. Wu, O. Watts, and S. King, “Merlin: An open source neural network speech synthesis system,” in 9th ISCA Speech Synthesis Workshop (2016), Sep. 2016, pp. 218–223. [81] Y.-A. Chung and J. Glass, “Speech2vec: A sequence-to-sequence framework for learning word embeddings from speech,” Interspeech 2018, Sep 2018. [Online]. Available: http://dx.doi.org/10.21437/interspeech.2018-2341 [82] A. van den Oord, Y. Li, and O. Vinyals, “Representation learning with contrastive predictive coding,” CoRR, vol. abs/1807.03748, 2018. [Online]. Available: http://arxiv.org/abs/1807.03748 [83] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. Salakhutdinov, and Q. V. Le, “XL-Net: Generalized Autoregressive Pretraining for Language Understanding,” arXiv e-prints, p. arXiv:1906.08237, Jun. 2019. [84] H.-J. Chang, S. wen Yang, and H. yi Lee, “Distilhubert: Speech representation learning by layer-wise distillation of hidden-unit bert,” in ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 7087–7091. [85] S. Ling, Y. Liu, J. Salazar, and K. Kirchhoff, “Deep contextualized acoustic representations for semi-supervised speech recognition,” in ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 6429–6433. [86] S. Ling and Y. Liu, “Decoar 2.0: Deep contextualized acoustic representations with vector quantization,” 2020. [87] A. H. Liu, Y.-A. Chung, and J. Glass, “Non-autoregressive predictive coding for learning speech representations from local dependencies,” 2020. [88] S. Chen, C. Wang, Z. Chen, Y. Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiao et al., “Wavlm: Large-scale self-supervised pre-training for full stack speech processing,” IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022. [89] L. J. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” CoRR, vol. abs/1607.06450, 2016. [Online]. Available: http://arxiv.org/abs/1607.06450 [90] M. Sperber, J. Niehues, G. Neubig, S. Stüker, and A. Waibel, “Self-attentional acoustic models,” Interspeech 2018, Sep 2018. [Online]. Available: http://dx.doi.org/10.21437/interspeech.2018-1910 [91] N.-Q. Pham, T.-S. Nguyen, J. Niehues, M. Müller, and A. Waibel, “Very Deep Self-Attention Networks for End-to-End Speech Recognition,” in Interspeech 2019, 2019, pp. 66–70. [Online]. Available: http://dx.doi.org/10.21437/Interspeech. 2019-2702 [92] J. Gehring, M. Auli, D. Grangier, D. Yarats, and Y. N. Dauphin, “Convolutional sequence to sequence learning,” 2017. [93] W. L. Taylor, ““cloze procedure”: A new tool for measuring readability,” Journalism Bulletin, vol. 30, no. 4, pp. 415–433, 1953. [94] Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut, “Albert: A lite bert for self-supervised learning of language representations,” in International Conference on Learning Representations, 2020. [Online]. Available: https://openreview.net/forum?id=H1eA7AEtvS [95] M. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer, “Deep contextualized word representations,” Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), 2018. [96] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A simple way to prevent neural networks from overfitting,” Journal of Machine Learning Research, vol. 15, no. 56, pp. 1929–1958, 2014. [97] M. McAuliffe, M. Socolof, S. Mihuc, M. Wagner, and M. Sonderegger, “Montreal Forced Aligner: Trainable Text-Speech Alignment Using Kaldi,” in Proc. Interspeech 2017, 2017, pp. 498–502. [98] Y. Chung and J. Glass,“Generative pre-training for speech with autoregressive predictive coding,” in ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 3497–3501. [99] Y.-A. Chung and J. Glass, “Improved speech representations with multi-target autoregressive predictive coding,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Online: Association for Computational Linguistics, Jul. 2020, pp. 2353–2358. [Online]. Available: https://www.aclweb.org/anthology/2020.acl-main.213 [100] Y.-A. Chung, H. Tang, and J. Glass, “Vector-quantized autoregressive predictive coding,” in Proc. Interspeech, 2020, pp. 3760–3764. [101] M. Tagliasacchi, B. Gfeller, F. de Chaumont Quitry, and D. Roblek, “Self-supervised audio representation learning for mobile devices,” CoRR, vol. abs/1905.11796, 2019. [Online]. Available: http://arxiv.org/abs/1905.11796 [102] M. Tagliasacchi, B. Gfeller, F. d. C. Quitry, and D. Roblek, “Pre-training audio representations with self-supervision,” IEEE Signal Processing Letters, vol. 27, pp. 600–604, 2020. [103] F.de Chaumont Quitry, M. Tagliasacchi, and D.Roblek, “Learning audio representations via phase prediction,” 2019. [104] S. Pascual, M. Ravanelli, J. Serrà, A. Bonafonte, and Y.Bengio, “Learning problem-agnostic speech representations from multiple self-supervised tasks,” Interspeech 2019, Sep 2019. [105] S. Khurana, A. Laurent, W.-N. Hsu, J. Chorowski, A. Łańcucki, R. Marxer, and J. Glass, “A Convolutional Deep Markov Model for Unsupervised Speech Representation Learning,” in Interspeech 2020, Shanghai, China, Oct. 2020. [Online]. Available: https://hal.archives-ouvertes.fr/hal-02912029 [106] J. Howard and S. Ruder, “Universal language model fine-tuning for text classification,” in Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Melbourne, Australia: Association for Computational Linguistics, Jul. 2018, pp. 328–339. [Online]. Available: https://www.aclweb.org/anthology/P18-1031 [107] C. Sun, X. Qiu, Y. Xu, and X.Huang, “How to fine-tune bert for text classification?” Chinese Computational Linguistics, p. 194–206, 2019. [108] A. Chronopoulou, C. Baziotis, and A. Potamianos, “An embarrassingly simple approach for transfer learning from pretrained language models,” Proceedings of the 2019 Conference of the North, 2019. [109] J. Ebbers, M. Kuhlmann, and R. Haeb-Umbach, “Adversarial contrastive predictive coding for unsupervised learning of disentangled representations,” CoRR, vol. abs/2005.12963, 2020. [Online]. Available: https://arxiv.org/abs/2005.12963 [110] A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” in Proceedings of the 23rd international conference on Machine learning, 2006, pp. 369–376. [111] H. Wu, A. T. Liu, and H. yi Lee, “Defense for Black-Box Attacks on Anti-Spoofing Models by Self-Supervised Learning,” in Proc. Interspeech 2020, 2020, pp. 3780– 3784. [Online]. Available: http://dx.doi.org/10.21437/Interspeech.2020-2026 [112] S. wen Yang, A. T. Liu, and H. yi Lee, “Understanding Self-Attention of Self-Supervised Audio Transformers,” in Proc. Interspeech, 2020, pp. 3785–3789. [Online]. Available: http://dx.doi.org/10.21437/Interspeech.2020-2231 [113] P. Wang, L. Wei, Y. Cao, J. Xie, and Z.Nie, “Large-scaleunsupervisedpre-training for end-to-end spoken language understanding,” in ICASSP, 2020. [114] S. Ling, J. Salazar, Y. Liu, and K. Kirchhoff, “BERTphone: Phonetically-aware Encoder Representations for Utterance-level Speaker and Language Recognition,” in Odyssey 2020 The Speaker and Language Recognition Workshop, 2020. [115] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word representations in vector space,” in 1st International Conference on Learning Representations, ICLR 2013, Scottsdale, Arizona, USA, May 2-4, 2013, Workshop Track Proceedings, Y. Bengio and Y. LeCun, Eds., 2013. [Online]. Available: http://arxiv.org/abs/1301.3781 [116] P. Warden, “Speech commands: A public dataset for single-word speech recognition.” Dataset available online, 2017. [Online]. Available: http://download.tensorflow.org/data/speech_commands_v0.01.tar.gz [117] C. Lopes and F. Perdigao, “Phone recognition on the timit database,” Speech Technologies/Book, vol. 1, pp. 285–302, 2011. [118] K.-F. Lee and H.-W. Hon, “Speaker-independent phone recognition using hidden markov models,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 37, no. 11, pp. 1641–1648, 1989. [119] M. Ravanelli, T. Parcollet, and Y. Bengio, “The pytorch-kaldi speech recognition toolkit,” in ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019, pp. 6465–6469. [120] M. J. Gales, “Maximum likelihood linear transformations for hmm-based speech recognition,” Computer speech & language, vol. 12, no. 2, pp. 75–98, 1998. [121] J. Kahn, M. Rivière, W. Zheng, E. Kharitonov, Q. Xu, P. E. Mazaré, J. Karadayi, V. Liptchinsky, R. Collobert, C. Fuegen, T. Likhomanenko, G. Synnaeve, A. Joulin, A. Mohamed, and E. Dupoux, “Libri-light: A benchmark for asr with limited or no supervision,” in ICASSP 2020, 2020, pp. 7669–7673, https://github.com/facebookresearch/libri-light. [122] N. Zeghidour, N. Usunier, I. Kokkinos, T. Schaiz, G. Synnaeve, and E. Dupoux, “Learning filterbanks from raw speech for phone recognition,” ICASSP 2018, Apr 2018. [123] L. Tóth, “Phone recognition with hierarchical convolutional deep maxout networks,” EURASIP Journal on Audio, Speech, and Music Processing, vol. 2015, no. 1, pp. 1–13, 2015. [124] M. Ravanelli, P. Brakel, M. Omologo, and Y. Bengio, “Light gated recurrent units for speech recognition,” IEEE Transactions on Emerging Topics in Computational Intelligence, vol. 2, no. 2, p. 92–102, Apr 2018. [125] C.-I.Lai,Y.-S.Chuang,H.-Y.Lee,S.-W.Li,andJ.Glass,“Semi-supervisedspoken language understanding via self-supervised speech and language model pretraining,” in ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021. [126] A. Nagrani, J. S. Chung, W. Xie, and A. Zisserman, “Voxceleb: Large-scale speaker verification in the wild,” Computer Speech & Language, vol. 60, p. 101027, 2020. [127] J. Cosentino, M. Pariente, S. Cornell, A. Deleforge, and E. Vincent, “Librimix: An open-source dataset for generalizable speech separation,” arXiv preprint arXiv:2005.11262, 2020. [128] L. Lugosch, M. Ravanelli, P. Ignoto, V. S. Tomar, and Y. Bengio, “Speech model pre-training for end-to-end spoken language understanding,” in Proc. Interspeech, 2019, pp. 814–818. [129] C. Busso et al., “Iemocap: Interactive emotional dyadic motion capture database,” Language resources and evaluation, vol. 42, no. 4, pp. 335–359, 2008. [130] M. Ravanelli, J. Zhong, S. Pascual, P. Swietojanski, J. Monteiro, J. Trmal, and Y. Bengio, “Multi-task self-supervised learning for robust speech recognition,” in ICASSP, 2020, pp. 6989–6993. [131] A. Baevski, W.-N. Hsu, Q. Xu, A. Babu, J. Gu, and M. Auli, “Data2vec: A general framework for self-supervised learning in speech, vision and language,” in International Conference on Machine Learning. PMLR, 2022, pp. 1298–1312. [132] D. Jiang, X. Lei, W. Li, N. Luo, Y. Hu, W. Zou, and X. Li, “Improving transformer-based speech recognition using unsupervised pre-training,” arXiv preprint arXiv:1910.09932, 2019. [133] R. Bommasani, D. A. Hudson, E. Adeli, R. Altman, S. Arora, S. von Arx, M. S. Bernstein, J. Bohg, A. Bosselut, E. Brunskill et al., “On the opportunities and risks of foundation models,” arXiv preprint arXiv:2108.07258, 2021. [134] Y.-A. Chung, Y. Belinkov, and J. Glass, “Similarity analysis of self-supervised speech representations,” in ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 3040–3044. [135] J. Pu, Y. Yang, R. Li, O. Elibol, and J. Droppo, “Scaling Effect of Self-Supervised Speech Models,” in Proc. Interspeech 2021, 2021, pp. 1084–1088. [136] A. Pasad, B. Shi, and K. Livescu, “Comparative layer-wise analysis of self-supervised speech models,” in ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1–5. [137] A. Pasad, J.-C. Chou, and K. Livescu, “Layer-wise analysis of a self-supervised speech representation model,” in 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2021, pp. 914–921. [138] A. Pasad, C.-M. Chien, S. Settle, and K. Livescu, “What do self-supervised speech models know about words?” 2023. [139] Y. Meng, Y.-H. Chou, A. T. Liu, and H.-y. Lee, “Don’t speak too fast: The impact of data bias on self-supervised speech models,” in ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 3258–3262. [140] J. Linke, M. Kadar, G. Dosinszky, P. Mihajlik, G. Kubin, and B. Schuppler, “What do self-supervised speech representations encode? An analysis of languages, varieties, speaking styles and speakers,” in Proc. INTERSPEECH 2023, 2023, pp. 5371–5375. [141] Y. Brima, U. Krumnack, S. Pika, and G. Heidemann, “Understanding self-supervised learning of speech representation via invariance and redundancy reduction,” 2023. [142] M. Ott, S. Edunov, A. Baevski, A. Fan, S. Gross, N. Ng, D. Grangier, and M. Auli, “Fairseq: A fast, extensible toolkit for sequence modeling,” NAACL HLT 2019, p. 48, 2019. [143] J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Rutherford, D. d. L. Casas, L. A. Hendricks, J. Welbl, A. Clark et al., “Training compute-optimal large language models,” arXiv preprint arXiv:2203.15556, 2022. | - |
| dc.identifier.uri | http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/91729 | - |
| dc.description.abstract | 在快速發展的數位語音處理(DSP)領域中,在近年來越發先進的深度神經網絡(DNNs)以及耗費大量資源後得到的人工標註數據的推動下,監督式學習(Supervised Learning)迎來了突破性的發展。然而,由於需要昂貴的人工標註數據以及每個新任務都需要從頭開始訓練深度神經網絡的開銷,使得監督式學習的發展遇到了瓶頸。在資源有限的情況下,傳統的監督式方法容易受限。相較之下,人類從大量未標記數據中自我學習卻自然且高效。舉例來說,一個孩子可以通過僅僅接觸和互動,幾乎不需要明確的指導,就能學習一種語言的基礎知識,掌握語法和詞匯。而機器則完全相反,即使經過數千小時的人工標註數據訓練,也可能無法完全掌握一個人類語言中的語境變化。我們人類展現了將經驗輕鬆編碼成記憶或常識的驚人能力,因此我們能高效地透過回憶,重複使用我們學過的背景知識,以此來面對新的任務。因此,人類可以通過極少的明確指導,識別以前從未見過的對象,掌握不熟悉的技能或學習新語言。這種人類獨有的學習能力使我們能夠非常高效地獲取新知識和技能。為了使機器能夠更貼近並效仿人類的學習方式,我們針對過往語音模型需要用到大量的標註資料這種問題提出方法,希望能有效的利用未標註的語音數據來規避傳統監督式學習方法會遇到的難處。本論文將重心放在自監督式學習(Self-Supervised Learning, SSL)於語音處理上的應用,旨在開發如何高效利用能輕易取得的未標註語音數據來預訓練(Pre-train)模型。我們藉由自監督式學習,開發出高效且可重複使用的自監督式模型。我們所提出的方法使用了比以往更少量的標註資料,且單一模型能一次勝任多種任務,更能增強不同的語音處理任務的效能,特別是在低資源(Low-resource)與零資源(Zero-Resource)的情況下。除此之外,我們也提出新的自監督式學習設計與訓練方針,在有限的電腦運算資源下,能更高效率的預訓練自監督式學習模型。 | zh_TW |
| dc.description.abstract | In the rapidly evolving digital speech processing (DSP) domain, supervised learning, driven by advanced deep neural networks (DNNs) and a wealth of task-specific labeled data, has shown significant progress. However, the need for large labeled datasets and the overhead of training DNNs from scratch for every application render this approach resource-intensive. The conventional supervised approach is particularly limiting in scenarios with limited resources. In contrast, humans are naturally efficient at self-learning from vast amounts of unlabeled data. For instance, a child can learn the basics of a language, grasping syntax and vocabulary, through mere exposure and interaction, with little explicit instruction, while machines, in sharp contrast, require thousands of hours of labeled data to achieve proficiency in speech processing tasks, and even then, they may not fully grasp the nuances and contextual variations inherent in human speech. We exhibit a remarkable ability to effortlessly encode experiences into memories or common sense, efficiently reusing, recalling, and reapplying the learned background knowledge when faced with new tasks. As a result, humans, with minimal explicit instruction, can recognize previously unseen objects, master unfamiliar skills, or learn new languages. This unique learning capability of humans allows us to acquire new knowledge and skills very efficiently.
Drawing inspiration from human learning, this thesis pivots from the label-intensive supervised learning paradigm towards self-supervised learning (SSL) algorithms, also referred to as self-supervised representation learning (SSRL) in literature. We aim to leverage unlabeled speech data to circumvent the challenges associated with the conventional supervised learning scheme. In particular, we focus on developing and studying the pre-training of SSRL models from large amounts of unlabeled speech data. The core aim is to develop versatile, reusable, and efficient representations to enhance different digital speech processing (DSP) tasks, especially in low-resource and even zero-resource scenarios. In the first part of the thesis, we demonstrate learning zero-resource tasks with no downstream supervision. We first discover discrete linguistic units from speech without using any labels under an autoencoder reconstruction setting. We found that the proposed representation offers automatic extraction of speech content from speaker style and is sufficient to cover the linguistic contents in a given language. Therefore, we can perform unsupervised voice conversion (VC) for low-resource languages with zero labels. In the ZeroSpeech 2019 Challenge, we achieved outstanding representation performance with a very low bitrate. In the second part of the thesis, we show how an SSRL model can allow us to achieve better speech processing performance with less labeled data. Previous speech representation methods learn through conditioning on past frames and predicting information about future frames. In contrast, our method encodes the current frame through joint conditioning on past and future contexts through a single auxiliary task: masked reconstruction on the time axes. Experiment results show that our representation improves performance for phoneme classification while outperforming other approaches. In a low-resource setting, with minimal fine-tuning and a fraction of labeled data (0.1%), we remarkably surpass conventional methods that rely on fully labeled datasets (100%). In the third part of the thesis, we demonstrate learning one single model that can be transferred to a wide range of speech tasks. We present an SSRL model that learns through applying reconstruction loss on three orthogonal axes: time, frequency, and magnitude. This allows the model to capture the rich information in speech, aspiring for a versatile model across varied tasks and domains. In contrast, previous work often learns by using a single auxiliary task like contrastive prediction, autoregressive prediction, or masked reconstruction. Experimental results show that the proposed representation can benefit several downstream tasks, including phoneme classification, keyword spotting, speaker recognition, and speech recognition. We achieve robust performance in the comparison by improving upon surface features and outperforming previous models. We show that our speech representations are not only transferable across downstream tasks but also datasets not seen during pre-training, thus increasing reusability and efficiency. In the fourth part, we study the dynamics of the SSRL pre-training process, investigate model design choices, and benchmark our models on the recognized SUPERB challenge. We first investigate the effect of pre-training on different amounts of data and pre-training on various features. We analyze different model sizes and find that smaller models are stronger representation learners than larger models, while larger models are more effective for downstream fine-tuning than smaller models. We evaluate our models with the SUPERB benchmarking protocol and again prove the feasibility of adopting one pre-trained model for many speech tasks. In the last part of the thesis, we investigate underlying factors contributing to the success of SSRL models in speech processing. We first demonstrate that a well-designed SSRL pretext task can benefit downstream performance. Interestingly, slimmer models surpass conventional small models when given a constrained parameter budget. For a limited computational resource, enlarging the model size resulted in better performance gains than increasing the data size. Moreover, given a fixed model size and computing budget, the size of the unlabeled dataset remains vital, as performance suffers when iterating on a small data pool. Additionally, given a fixed training budget, we observed a valley curve in loss and performance as a function of model size, indicating an optimal model size for a specified compute budget. Finally, under a limited computational budget, we pre-train TERA with a new architectural design and optimal model size. This results in TERA achieving superior performance compared to the HuBERT and wav2vec 2.0 models under comparable settings. Our findings shed light on the delicate balance between model and data size, offering insights for effectively training SSRL models within resource-constrained scenarios. Ultimately, our research illuminates the complex dynamics of training speech SSRL models. We offer valuable guidance and insights for future research to develop the next generation of SSRL models in resource-constrained scenarios. | en |
| dc.description.provenance | Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2024-02-22T16:26:56Z No. of bitstreams: 0 | en |
| dc.description.provenance | Made available in DSpace on 2024-02-22T16:26:56Z (GMT). No. of bitstreams: 0 | en |
| dc.description.tableofcontents | 誌謝 i
中文摘要 iii Abstract iv 1 Introduction 1 1.1 Motivation 1 1.2 Contribution 3 1.3 Thesis Organization 5 2 Background 7 2.1 Speech Processing 7 2.1.1 Speech Processing Networks 7 2.1.2 Supervised Speech Processing 11 2.1.3 Limitations of Supervised Speech Processing 14 2.2 Self-Supervised Representation Learning for Speech 15 2.2.1 Overview 15 2.2.2 Advantages of Self-Supervised Learning for Speech 19 2.3 Evaluation of Speech Representations 22 2.3.1 Linear Probing Models 22 2.3.2 Linear Concatenate Models 23 2.3.3 Hidden Layer Models 23 2.3.4 Hybrid Modeling ASR 24 2.3.5 End-to-End ASR 25 3 Autoencoder Representations: Towards Zero Supervision Downstream Learning 27 3.1 Introduction 28 3.2 Related Work 29 3.2.1 Continuous Speech Segment Representations 29 3.2.2 Voice Conversion with Continuous Representations 30 3.2.3 Discrete Speech Representations 30 3.2.4 Influence and Impact to Subsequent Work 31 3.3 Method 31 3.3.1 Discrete Linguistic Unit Representations 32 3.3.2 Target-Guided Adversarial Learning with GAN 35 3.4 Experimental Setup 38 3.5 Results 40 3.5.1 Degree of Disentanglement 40 3.5.2 Subjective and Objective Evaluation 41 3.5.3 Encoding Dimension Analysis 42 3.5.4 The Zero Resource Speech Challenge Competition 44 3.6 Summary 44 4 Bidirectional Representations: Improved Performance with Less Supervision 46 4.1 Introduction 47 4.2 Related Work 49 4.2.1 CPC: Contrastive Predictive Coding 49 4.2.2 APC: Autoregressive Predictive Coding 49 4.2.3 vq-wav2vec: Discrete Representations 50 4.2.4 Influence and Impact to Subsequent Work 50 4.3 Method 51 4.3.1 Overview 51 4.3.2 Model Architecture 52 4.3.3 Masked Acoustic Modeling 54 4.3.4 Incorporating with Downstream Task 55 4.4 Experimental Setup 56 4.4.1 Model Implementation 56 4.4.2 Proposed Model Settings 57 4.4.3 Comparing with Other Representations 58 4.4.4 Downstream Task Settings 58 4.5 Results 59 4.5.1 Improved Downstream Performance 59 4.5.2 Low-resource Downstream Performance 59 4.6 Summary 60 5 Universal Representations: A Single Model for Different Tasks and Domains 62 5.1 Introduction 63 5.2 Related Work 66 5.2.1 Contrastive Approaches 67 5.2.2 Predictive Approaches 69 5.2.3 Generative Approaches 69 5.2.4 Influence and Impact to Subsequent Work 72 5.3 Method 73 5.3.1 Alteration on Data 73 5.3.2 Self-Supervised Pre-training Algorithm 77 5.3.3 Incorporating with Downstream Tasks 80 5.4 Experimental Setup 81 5.4.1 Datasets 82 5.4.2 Phoneme Classification Setup 83 5.4.3 Keyword Spotting Setup 84 5.4.4 Speaker Classification Setup 84 5.4.5 Hybrid DNN/HMM ASR Setup 85 5.4.6 Training Downstream Tasks 86 5.5 Results 87 5.5.1 The Effect of Different Alterations 87 5.5.2 Comparison of Recent Speech Representation Approaches 91 5.5.3 Analysison TERA 97 5.5.4 Applying Speech Representations for ASR 101 5.5.5 Applying Speech Pre-training for ASR 103 5.5.6 Transferring to TIMIT 106 5.6 Summary 108 6 Assessing Pre-training Dynamics: Experimental Analysis and Benchmark Evaluation 109 6.1 Introduction 109 6.2 Method 110 6.2.1 Analysison TERA 110 6.2.2 The SUPERB Benchmark 111 6.3 Experimental Setup 112 6.3.1 Analysis Setup 112 6.3.2 SUPERB Setup 113 6.4 Results 114 6.4.1 Pre-training on More Data 114 6.4.2 Pre-training on Different Acoustic Features 116 6.4.3 Pre-training on Different Network Depth 118 6.4.4 SUPERB Benchmark Results 118 6.5 Summary 119 7 Efficient Representation Training: Success Factors of Self-Supervised Learning for Speech, on the Factors of Model Size, Data Size, and Computational Budget 121 7.1 Introduction 122 7.2 Related Work 126 7.2.1 Studies on Model Architecture Analysis 126 7.2.2 Studies on Model and Data Size Analysis 127 7.2.3 Studies on Layer-wise Analysis 127 7.2.4 Studies on Understanding Self-Supervised Learning 128 7.3 Experimental Setup 129 7.3.1 Self-Supervised Representation Learning Models 129 7.3.2 Pre-training Details 131 7.3.3 Downstream Evaluation Methods 133 7.4 Results and Analysis 134 7.4.1 The Design of SSRL Models 134 7.4.2 The Effect of Model and DataSize 138 7.4.3 Data Volume vs. Iteration Frequency 139 7.4.4 Model Size vs. DataSize: FLOPS Valley Curves 140 7.4.5 Training TERA Slim with the Optimal Model Size 146 7.5 Conclusion 147 8 Conclusion 149 8.1 Thesis Summary 149 8.1.1 Achievement of Each Chapter 150 8.1.2 Influence and Impact to Subsequent Work 152 8.2 Future Work 155 8.2.1 Expanding the Horizons of SSRL in Low-Resource Environments 155 8.2.2 Explainable Universal Speech Representations 156 8.2.3 Optimizing Model Size for Computational Efficiency 156 References 158 | - |
| dc.language.iso | en | - |
| dc.subject | 語音處理 | zh_TW |
| dc.subject | 自監督式學習 | zh_TW |
| dc.subject | 低資源 | zh_TW |
| dc.subject | 預訓練 | zh_TW |
| dc.subject | 表徵學習 | zh_TW |
| dc.subject | representation learning | en |
| dc.subject | speech processing | en |
| dc.subject | self-supervised learning | en |
| dc.subject | low-resource | en |
| dc.subject | pre-training | en |
| dc.title | 更高效的語音處理:低資源情境下的自監督式學習 | zh_TW |
| dc.title | Speech Processing with Higher Efficiency: Self-Supervised Learning for Low-Resource Scenarios | en |
| dc.type | Thesis | - |
| dc.date.schoolyear | 112-1 | - |
| dc.description.degree | 博士 | - |
| dc.contributor.oralexamcommittee | 李琳山;王新民;孫紹華;簡仁宗;林守德;陳信希 | zh_TW |
| dc.contributor.oralexamcommittee | Lin-shan Lee;Hsin-Min Wang;Shao-Hua Sun;Jen-Tzung Chien;Shou-De Lin;Hsin-Hsi Chen | en |
| dc.subject.keyword | 語音處理,自監督式學習,低資源,預訓練,表徵學習, | zh_TW |
| dc.subject.keyword | speech processing,self-supervised learning,low-resource,pre-training,representation learning, | en |
| dc.relation.page | 179 | - |
| dc.identifier.doi | 10.6342/NTU202400216 | - |
| dc.rights.note | 同意授權(全球公開) | - |
| dc.date.accepted | 2024-01-29 | - |
| dc.contributor.author-college | 電機資訊學院 | - |
| dc.contributor.author-dept | 電信工程學研究所 | - |
| 顯示於系所單位: | 資訊工程學系 | |
文件中的檔案:
| 檔案 | 大小 | 格式 | |
|---|---|---|---|
| ntu-112-1.pdf | 13.13 MB | Adobe PDF | 檢視/開啟 |
系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。
