請用此 Handle URI 來引用此文件:
http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/91788
完整後設資料紀錄
DC 欄位 | 值 | 語言 |
---|---|---|
dc.contributor.advisor | 李宏毅 | zh_TW |
dc.contributor.advisor | Hung-yi Lee | en |
dc.contributor.author | 劉達融 | zh_TW |
dc.contributor.author | Da-Rong Liu | en |
dc.date.accessioned | 2024-02-22T16:43:52Z | - |
dc.date.available | 2024-02-23 | - |
dc.date.copyright | 2024-02-22 | - |
dc.date.issued | 2022 | - |
dc.date.submitted | 2023-08-17 | - |
dc.identifier.citation | [1] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. Courville, “Improved training of wasserstein gans,” Advances in Neural Information Processing Systems (NeurIPS), pp. 5769–5779, 2017.
[2] E. Jang, S. Gu, and B. Poole, “Categorical reparameterization with gumbel- softmax,” arXiv preprint arXiv:1611.01144, 2016. [3] C. J. Maddison, A. Mnih, and Y. W. Teh, “The concrete distribution: A continuous relaxation of discrete random variables,” arXiv preprint arXiv:1611.00712, 2016. [4] A. H. Liu, W.-N. Hsu, M. Auli, and A. Baevski, “Towards end-to-end unsupervised speech recognition,” arXiv preprint arXiv:2204.02492, 2022. [5] H. Xu, T. Chen, D. Gao, Y. Wang, K. Li, N. Goel, Y. Carmiel, D. Povey, and S. Khudanpur, “A pruned rnnlm lattice-rescoring algorithm for automatic speech recognition,” IEEE International Conference on Acoustics, Speech and Signal Pro- cessing (ICASSP), pp. 5929–5933, 2018. [6] A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y. Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y. Wu et al., “Conformer: Convolution-augmented transformer for speech recognition,” Annual Conference of the International Speech Communica- tion Association (INTERSPEECH), pp. 5036–5040, 2020. [7] C.-C. Chiu, T. N. Sainath, Y. Wu, R. Prabhavalkar, P. Nguyen, Z. Chen, A. Kan- nan, R. J. Weiss, K. Rao, E. Gonina et al., “State-of-the-art speech recognition with sequence-to-sequence models,” IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4774–4778, 2018. [8] W. Han, Z. Zhang, Y. Zhang, J. Yu, C.-C. Chiu, J. Qin, A. Gulati, R. Pang, and Y. Wu, “Contextnet: Improving convolutional neural networks for automatic speech recognition with global context,” Annual Conference of the International Speech Communication Association (INTERSPEECH), pp. 3610–3614, 2020. [9] Q. Zhang, H. Lu, H. Sak, A. Tripathi, E. McDermott, S. Koo, and S. Kumar, “Trans- former transducer: A streamable speech recognition model with transformer en- coders and rnn-t loss,” IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7829–7833, 2020. [10] J. Li, R. Zhao, H. Hu, and Y. Gong, “Improving rnn transducer modeling for end-to- end speech recognition,” IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 114–121, 2019. [11] M. P. Lewis, G. F. Simons, and C. D. Fennig, Ethnologue: Languages of the world, nineteenth edition. Online version: http://www.ethnologue.com., 2016. [12] Google. Google cloud: Speech-to-text. [Online]. Available: https://cloud.google. com/speech-to-text/docs/speech-to-text-supported-languages [13] Microsoft. Azure: Speech-to-text. [Online]. Available: https://azure.microsoft. com/en-us/products/cognitive-services/speech-to-text/#overview [14] P. Joshi, C. Barnes, S. Santy, S. Khanuja, S. Shah, A. Srinivasan, S. Bhattamishra, S. Sitaram, M. Choudhury, and K. Bali, “Unsung challenges of building and deploying language technologies for low resource language communities,” arXiv preprint arXiv:1912.03457, 2019. [15] M. Jain, G. Keren, J. Mahadeokar, G. Zweig, F. Metze, and Y. Saraf, “Contextual rnn-t for open domain asr,” arXiv preprint arXiv:2006.03411, 2020. [16] R. D. Martinez, S. Novotney, I. Bulyko, A. Rastrow, A. Stolcke, and A. Gandhe, “Attention-based contextual language model adaptation for speech recognition,” arXiv preprint arXiv:2106.01451, 2021. [17] J. Buchmann, Introduction to cryptography, vol. 335, 2004. [18] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” Advances in Neural Information Processing Systems (NeurIPS), pp. 2672–2680, 2014. [19] M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein gan,” arXiv preprint arXiv:1701.07875, 2017. [20] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image transla- tion using cycle-consistent adversarial networks,” IEEE International Conference on Computer Vision (ICCV), pp. 2223–2232, 2017. [21] Y. Choi, Y. Uh, J. Yoo, and J.-W. Ha, “Stargan v2: Diverse image synthesis for multiple domains,” IEEE/CVF Conference on Computer Vision and Pattern Recog- nition (CVPR), pp. 8188–8197, 2020. [22] T. Shen, T. Lei, R. Barzilay, and T. Jaakkola, “Style transfer from non-parallel text by cross-alignment,” Advances in Neural Information Processing Systems (NeurIPS), vol. 30, 2017. [23] Z. Yang, Z. Hu, C. Dyer, E. P. Xing, and T. Berg-Kirkpatrick, “Unsupervised text style transfer using language models as discriminators,” Advances in Neural Infor- mation Processing Systems (NeurIPS), vol. 31, 2018. [24] T. Kaneko and H. Kameoka, “Cyclegan-vc: Non-parallel voice conversion using cycle-consistent adversarial networks,” European Signal Processing Conference (EUSIPCO), pp. 2100–2104, 2018. [25] T. Kaneko, H. Kameoka, K. Tanaka, and N. Hojo, “Cyclegan-vc2: Improved cyclegan-based non-parallel voice conversion,” IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6820–6824, 2019. [26] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learn- ing to align and translate,” arXiv preprint arXiv:1409.0473, 2014. [27] P. Koehn, “Neural machine translation,” arXiv preprint arXiv:1709.07809, 2017. [28] M. Artetxe, G. Labaka, E. Agirre, and K. Cho, “Unsupervised neural machine trans- lation,” International Conference on Learning Representations (ICLR), 2018. [29] A. Conneau, G. Lample, M. Ranzato, L. Denoyer, and H. Jégou, “Word translation without parallel data,” arXiv preprint arXiv:1710.04087, 2017. [30] G. Lample, L. Denoyer, and M. Ranzato, “Unsupervised machine translation using monolingual corpora only,” arXiv preprint arXiv:1711.00043, 2017. [31] J. Glass, “Towards unsupervised speech processing,” Information Science, Signal Processing and their Applications (ISSPA), pp. 1–4, 2012. [32] A. S. Park and J. R. Glass, “Unsupervised pattern discovery in speech,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 16, pp. 186–197, 2008. [33] J. Driesen and H. Van hamme, “Fast word acquisition in an NMF-based learning framework,” IEEE International Conference on Acoustics, Speech and Signal Pro- cessing (ICASSP), 2012. [34] H. Wang, C.-C. Leung, T. Lee, B. Ma, and H. Li, “An acoustic segment modeling approach to query-by-example spoken term detection,” IEEE International Confer- ence on Acoustics, Speech and Signal Processing (ICASSP), 2012. [35] Y.-H. Wang, C.-T. Chung, and H.-y. Lee, “Gate activation signal analysis for gated recurrent neural networks and its correlation with phoneme boundaries,” arXiv preprint arXiv:1703.07588, 2017. [36] A. Baevski, W.-N. Hsu, A. Conneau, and M. Auli, “Unsupervised speech recogni- tion,” arXiv preprint arXiv:2105.11084, 2021. [37] K. Hall, E. Cho, C. Allauzen, F. Beaufays, N. Coccaro, K. Nakajima, M. Riley, B. Roark, D. Rybach, and L. Zhang, “Composition-based on-the-fly rescoring for salient n-gram biasing,” Annual Conference of the International Speech Communi- cation Association (INTERSPEECH), 2015. [38] I. McGraw, R. Prabhavalkar, R. Alvarez, M. G. Arenas, K. Rao, D. Rybach, O. Al- sharif, H. Sak, A. Gruenstein, F. Beaufays et al., “Personalized speech recognition on mobile devices,” IEEE International Conference on Acoustics, Speech and Sig- nal Processing (ICASSP), 2016. [39] I. Williams, A. Kannan, P. S. Aleksic, D. Rybach, and T. N. Sainath, “Contex- tual speech recognition in end-to-end neural network systems using beam search.” Annual Conference of the International Speech Communication Association (IN- TERSPEECH), pp. 2227–2231, 2018. [40] G. Pundak, T. N. Sainath, R. Prabhavalkar, A. Kannan, and D. Zhao, “Deep con- text: end-to-end contextual speech recognition,” IEEE Spoken Language Technol- ogy Workshop (SLT), pp. 418–425, 2018. [41] Z. Chen, M. Jain, Y. Wang, M. L. Seltzer, and C. Fuegen, “End-to-end contextual speech recognition using class language models and a token passing decoder,” IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019. [42] O. Vinyals, M. Fortunato, and N. Jaitly, “Pointer networks,” Advances in Neural Information Processing Systems (NeurIPS), pp. 2692–2700, 2015. [43] Y.-N. Chen, A. Celikyilmaz, and D. Hakkani-Tür, “Deep learning for dialogue sys- tems,” Annual Meeting of the Association for Computational Linguistics (ACL), pp. 8–14, 2017. [44] S. Renals, T. Hain, and H. Bourlard, “Recognition and understanding of meetings the ami and amida projects,” IEEE Automatic Speech Recognition and Understand- ing Workshop (ASRU), pp. 238–247, 2007. [45] G. Tur, A. Stolcke, L. Voss, J. Dowding, B. Favre, R. Fernández, M. Frampton, M. Frandsen, C. Frederickson, M. Graciarena et al., “The calo meeting speech recognition and understanding system,” IEEE Spoken Language Technology Work- shop (SLT), pp. 69–72, 2008. [46] J. Glass, T. J. Hazen, S. Cyphers, I. Malioutov, D. Huynh, and R. Barzilay, “Recent progress in the mit spoken lecture processing project,” Annual Conference of the International Speech Communication Association (INTERSPEECH), 2007. [47] A. Gupta, Y. Miao, L. Neves, and F. Metze, “Visual features for context-aware speech recognition,” IEEE International Conference on Acoustics, Speech and Sig- nal Processing (ICASSP), pp. 5020–5024, 2017. [48] F. Jelinek, “Continuous speech recognition by statistical methods,” Proceedings of the IEEE, vol. 64, pp. 532–556, 1976. [49] P. Mermelstein, “Distance measures for speech recognition, psychological and in- strumental,” Pattern Recognition and Artificial Intelligence, vol. 116, pp. 374–388, 1976. [50] J. A. Bilmes et al., “A gentle tutorial of the em algorithm and its application to pa- rameter estimation for gaussian mixture and hidden markov models,” International Computer Science Institute, vol. 4, p. 126, 1998. [51] J. Baker, “The dragon system–an overview,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 23, pp. 24–29, 1975. [52] A. Viterbi, “Error bounds for convolutional codes and an asymptotically optimum decoding algorithm,” IEEE Transactions on Information Theory, vol. 13, pp. 260– 269, 1967. [53] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed, N. Jaitly, A. Senior, V. Van- houcke, P. Nguyen, T. N. Sainath et al., “Deep neural networks for acoustic model- ing in speech recognition: The shared views of four research groups,” IEEE Signal Processing Magazine, vol. 29, pp. 82–97, 2012. [54] G. E. Dahl, D. Yu, L. Deng, and A. Acero, “Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 20, pp. 30–42, 2011. [55] F. Seide, G. Li, X. Chen, and D. Yu, “Feature engineering in context-dependent deep neural networks for conversational speech transcription,” IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 24–29, 2011. [56] T. N. Sainath, B. Kingsbury, A.-r. Mohamed, G. E. Dahl, G. Saon, H. Soltau, T. Be- ran, A. Y. Aravkin, and B. Ramabhadran, “Improvements to deep convolutional neural networks for lvcsr,” IEEE Automatic Speech Recognition and Understand- ing Workshop (ASRU), pp. 315–320, 2013. [57] O. Abdel-Hamid, A.-r. Mohamed, H. Jiang, and G. Penn, “Applying convolutional neural networks concepts to hybrid nn-hmm model for speech recognition,” IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4277–4280, 2012. [58] A. Graves, A.-r. Mohamed, and G. Hinton, “Speech recognition with deep recurrent neural networks,” IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6645–6649, 2013. [59] H. Sak, A. Senior, K. Rao, and F. Beaufays, “Fast and accurate recurrent neural network acoustic models for speech recognition,” arXiv preprint arXiv:1507.06947, 2015. [60] T. Mikolov, M. Karafiát, L. Burget, J. Cernockỳ, and S. Khudanpur, “Recurrent neu- ral network based language model.” Annual Conference of the International Speech Communication Association (INTERSPEECH), vol. 2, pp. 1045–1048, 2010. [61] X. Liu, X. Chen, Y. Wang, M. J. Gales, and P. C. Woodland, “Two efficient lattice rescoring methods using recurrent neural network language models,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, pp. 1438–1449, 2016. [62] M. Sundermeyer, Z. Tüske, R. Schlüter, and H. Ney, “Lattice decoding and rescor- ing with long-span neural network language models,” Annual Conference of the International Speech Communication Association (INTERSPEECH), 2014. [63] M. Mohri, F. Pereira, and M. Riley, “Weighted finite-state transducers in speech recognition,” Computer Speech & Language, vol. 16, pp. 69–88, 2002. [64] A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, “Connectionist tem- poral classification: labelling unsegmented sequence data with recurrent neural networks,” International Conference on Machine Learning (ICML), pp. 369–376, 2006. [65] S. Kriman, S. Beliaev, B. Ginsburg, J. Huang, O. Kuchaiev, V. Lavrukhin, R. Leary, J. Li, and Y. Zhang, “Quartznet: Deep automatic speech recognition with 1d time-channel separable convolutions,” IEEE International Conference on Acous- tics, Speech and Signal Processing (ICASSP), pp. 6124–6128, 2020. [66] A. Graves, “Sequence transduction with recurrent neural networks,” arXiv preprint arXiv:1211.3711, 2012. [67] J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio, “Attention- based models for speech recognition,” Advances in Neural Information Processing Systems (NeurIPS), pp. 577–585, 2015. [68] D. Bahdanau, J. Chorowski, D. Serdyuk, P. Brakel, and Y. Bengio, “End-to-end attention-based large vocabulary speech recognition,” IEEE International Confer- ence on Acoustics, Speech and Signal Processing (ICASSP), pp. 4945–4949, 2016. [69] S. Karita, N. Chen, T. Hayashi, T. Hori, H. Inaguma, Z. Jiang, M. Someki, N. E. Y. Soplin, R. Yamamoto, X. Wang et al., “A comparative study on transformer vs rnn in speech applications,” IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 449–456, 2019. [70] S. Watanabe, T. Hori, S. Kim, J. R. Hershey, and T. Hayashi, “Hybrid ctc/attention architecture for end-to-end speech recognition,” IEEE Journal of Selected Topics in Signal Processing, vol. 11, pp. 1240–1253, 2017. [71] S. Kim, T. Hori, and S. Watanabe, “Joint ctc-attention based end-to-end speech recognition using multi-task learning,” IEEE International Conference on Acous- tics, Speech and Signal Processing (ICASSP), pp. 4835–4839, 2017. [72] Y. Wang, A. Mohamed, D. Le, C. Liu, A. Xiao, J. Mahadeokar, H. Huang, A. Tjan- dra, X. Zhang, F. Zhang et al., “Transformer-based acoustic modeling for hybrid speech recognition,” IEEE International Conference on Acoustics, Speech and Sig- nal Processing (ICASSP), pp. 6874–6878, 2020. [73] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: An asr corpus based on public domain audio books,” IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210, 2015. [74] H. Tang, W.-N. Hsu, F. Grondin, and J. Glass, “A study of enhancement, augmenta- tion, and autoencoder methods for domain adaptation in distant speech recognition,” arXiv preprint arXiv:1806.04841, 2018. [75] C. Kim, A. Misra, K. Chin, T. Hughes, A. Narayanan, T. Sainath, and M. Bacchiani, “Generation of large-scale simulated utterances in virtual rooms to train deep-neural networks for far-field speech recognition in google home,” Annual Conference of the International Speech Communication Association (INTERSPEECH), pp. 379–383, 2017. [76] Y.-H. Tu, J. Du, and C.-H. Lee, “Speech enhancement based on teacher–student deep learning using improved speech presence probability for noise-robust speech recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Process- ing, vol. 27, pp. 2080–2091, 2019. [77] C. Chen, N. Hou, Y. Hu, S. Shirol, and E. S. Chng, “Noise-robust speech recognition with 10 minutes unparalleled in-domain data,” IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4298–4302, 2022. [78] S. Karita, S. Watanabe, T. Iwata, A. Ogawa, and M. Delcroix, “Semi-supervised end-to-end speech recognition.” Annual Conference of the International Speech Communication Association (INTERSPEECH), pp. 2–6, 2018. [79] A. Tjandra, S. Sakti, and S. Nakamura, “Listening while speaking: Speech chain by deep learning,” IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 301–308, 2017. [80] W.-N. Hsu, A. Lee, G. Synnaeve, and A. Hannun, “Semi-supervised speech recog- nition via local prior matching,” arXiv preprint arXiv:2002.10336, 2020. [81] M. Mirza and S. Osindero, “Conditional generative adversarial nets,” arXiv preprint arXiv:1411.1784, 2014. [82] Y.-L. Tuan and H.-Y. Lee, “Improving conditional sequence generative adversarial networks by stepwise evaluation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 27, pp. 788–798, 2019. [83] Y. Liu, J. Chen, and L. Deng, “Unsupervised sequence classification using se- quential output statistics,” Advances in Neural Information Processing Systems (NeurIPS), pp. 3550–3559, 2017. [84] S. Ravi and K. Knight, “Deciphering foreign language,” Annual Meeting of the As- sociation for Computational Linguistics (ACL), pp. 12–21, 2011. [85] P. Michel, O. Räsänen, R. Thiolliere, and E. Dupoux, “Blind phoneme segmentation with temporal prediction errors,” Annual Meeting of the Association for Computa- tional Linguistics (ACL), Student Research Workshop, pp. 62–68, 2017. [86] F. Kreuk, J. Keshet, and Y. Adi, “Self-supervised contrastive learning for unsu- pervised phoneme segmentation,” Annual Conference of the International Speech Communication Association (INTERSPEECH), pp. 3700–3704, 2020. [87] H. Kamper and B. van Niekerk, “Towards unsupervised phone and word seg- mentation using self-supervised vector-quantized neural networks,” arXiv preprint arXiv:2012.07551, 2020. [88] A. Van Den Oord, O. Vinyals et al., “Neural discrete representation learning,” Ad- vances in Neural Information Processing Systems (NeurIPS), vol. 30, 2017. [89] A. Mohamed, H.-y. Lee, L. Borgholt, J. D. Havtorn, J. Edin, C. Igel, K. Kirch- hoff, S.-W. Li, K. Livescu, L. Maaløe et al., “Self-supervised speech representation learning: A review,” arXiv preprint arXiv:2205.10643, 2022. [90] Y.-A. Chung, C.-C. Wu, C.-H. Shen, H.-Y. Lee, and L.-S. Lee, “Audio word2vec: Unsupervised learning of audio segment representations using sequence-to- sequence autoencoder,” arXiv preprint arXiv:1603.00982, 2016. [91] C.-H. Shen, J. Y. Sung, and H.-Y. Lee, “Language transfer of audio word2vec: Learning audio segment representations without target language data,” arXiv preprint arXiv:1707.06519, 2017. [92] Y.-A. Chung and J. Glass, “Generative pre-training for speech with autoregressive predictive coding,” IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3497–3501, 2020. [93] A. T. Liu, S.-w. Yang, P.-H. Chi, P.-c. Hsu, and H.-y. Lee, “Mockingjay: Unsu- pervised speech representation learning with deep bidirectional transformer en- coders,” IEEE International Conference on Acoustics, Speech and Signal Process- ing (ICASSP), pp. 6419–6423, 2020. [94] A. v. d. Oord, Y. Li, and O. Vinyals, “Representation learning with contrastive pre- dictive coding,” arXiv preprint arXiv:1807.03748, 2018. [95] A. Baevski, H. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” Advances in Neural Informa- tion Processing Systems (NeurIPS), vol. 33, 2020. [96] S. Schneider, A. Baevski, R. Collobert, and M. Auli, “wav2vec: Unsupervised pre- training for speech recognition,” arXiv preprint arXiv:1904.05862, 2019. [97] A. Baevski, S. Schneider, and M. Auli, “vq-wav2vec: Self-supervised learning of discrete speech representations,” arXiv preprint arXiv:1910.05453, 2019. [98] Y.-A. Chung, Y. Zhang, W. Han, C.-C. Chiu, J. Qin, R. Pang, and Y. Wu, “W2v- bert: Combining contrastive learning and masked language modeling for self- supervised speech pre-training,” IEEE Automatic Speech Recognition and Under- standing Workshop (ASRU), pp. 244–250, 2021. [99] W.-N. Hsu, B. Bolte, Y.-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mo- hamed, “Hubert: Self-supervised speech representation learning by masked predic- tion of hidden units,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3451–3460, 2021. [100] S. Chen, C. Wang, Z. Chen, Y. Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yosh- ioka, X. Xiao et al., “Wavlm: Large-scale self-supervised pre-training for full stack speech processing,” IEEE Journal of Selected Topics in Signal Processing, vol. 16, pp. 1505–1518, 2022. [101] S. wen Yang, P.-H. Chi, Y.-S. Chuang, C.-I. J. Lai, K. Lakhotia, Y. Y. Lin, A. T. Liu, J. Shi, X. Chang, G.-T. Lin, T.-H. Huang, W.-C. Tseng, K. tik Lee, D.-R. Liu, Z. Huang, S. Dong, S.-W. Li, S. Watanabe, A. Mohamed, and H. yi Lee, “SU- PERB: Speech Processing Universal PERformance Benchmark,” Annual Confer- ence of the International Speech Communication Association (INTERSPEECH), pp. 1194–1198, 2021. [102] V. Lyzinski, G. Sell, and A. Jansen, “An evaluation of graph clustering methods for unsupervised term discovery,” Annual Conference of the International Speech Communication Association (INTERSPEECH), 2015. [103] Y. Zhang, R. Salakhutdinov, H.-A. Chang, and J. Glass, “Resource configurable spoken query detection using deep boltzmann machines,” IEEE International Con- ference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5161–5164, 2012. [104] C.-T. Chung and L.-S. Lee, “Unsupervised discovery of structured acoustic tokens with applications to spoken term detection,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 26, pp. 394–405, 2018. [105] C.-T. Chung, C.-Y. Tsai, C.-H. Liu, and L.-S. Lee, “Unsupervised iterative deep learning of speech features and acoustic tokens with applications to spoken term detection,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, pp. 1914–1928, 2017. [106] K. Cho, B. van Merriënboer, Ç. Gülçehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, “Learning phrase representations using rnn encoder–decoder for statistical machine translation,” Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1724–1734, 2014. [Online]. Available: http://www.aclweb.org/anthology/D14-1179 [107] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neu- ral networks,” Advances in Neural Information Processing Systems (NeurIPS), pp. 3104–3112, 2014. [108] E. Dunbar, R. Algayres, J. Karadayi, M. Bernard, J. Benjumea, X.-N. Cao, L. Mis- kic, C. Dugrain, L. Ondel, A. W. Black et al., “The zero resource speech challenge 2019: Tts without t,” Annual Conference of the International Speech Communica- tion Association (INTERSPEECH), pp. 1088–1092, 2019. [109] E. Dunbar, J. Karadayi, M. Bernard, X.-N. Cao, R. Algayres, L. Ondel, L. Besacier, S. Sakti, and E. Dupoux, “The zero resource speech challenge 2020: Discovering discrete subword and word units,” Annual Conference of the International Speech Communication Association (INTERSPEECH), pp. 4831–4835, 2020. [110] M. Chen and T. Hain, “Unsupervised acoustic unit representation learning for voice conversion using wavenet auto-encoders,” Annual Conference of the International Speech Communication Association (INTERSPEECH), pp. 4866–4870, 2020. [111] R. Eloff, A. Nortje, B. van Niekerk, A. Govender, L. Nortje, A. Pretorius, E. Van Biljon, E. van der Westhuizen, L. van Staden, and H. Kamper, “Unsupervised acoustic unit discovery for speech synthesis using discrete latent-variable neural networks,” Annual Conference of the International Speech Communication Associ- ation (INTERSPEECH), pp. 1103–1107, 2019. [112] J. Chorowski, R. J. Weiss, S. Bengio, and A. van den Oord, “Unsupervised speech representation learning using wavenet autoencoders,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 27, pp. 2041–2053, 2019. [113] C.-y. Lee and J. Glass, “A nonparametric bayesian approach to acoustic model dis- covery,” Annual Meeting of the Association for Computational Linguistics (ACL), pp. 40–49, 2012. [114] L. Ondel, L. Burget, and J. Černockỳ, “Variational inference for acoustic unit dis- covery,” Procedia Computer Science, vol. 81, pp. 80–86, 2016. [115] H. Chen, C.-C. Leung, L. Xie, B. Ma, and H. Li, “Multilingual bottle-neck fea- ture learning from untranscribed speech,” IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 727–733, 2017. [116] L. Ondel, H. K. Vydana, L. Burget, and J. Černockỳ, “Bayesian subspace hidden markov model for acoustic unit discovery,” Annual Conference of the International Speech Communication Association (INTERSPEECH), pp. 261–265, 2019. [117] B. Yusuf, L. Ondel, L. Burget, J. Černockỳ, and M. Saraçlar, “A hierarchical sub- space model for language-attuned acoustic unit discovery,” IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3710–3714, 2021. [118] C.-K. Yeh, J. Chen, C. Yu, and D. Yu, “Unsupervised speech recognition via segmental empirical output distribution matching,” International Conference on Learning Representations (ICLR), 2018. [119] K.-Y. Chen, C.-P. Tsai, D.-R. Liu, H.-Y. Lee, and L.-s. Lee, “Completely unsu- pervised speech recognition by a generative adversarial network harmonized with iteratively refined hidden markov models,” Annual Conference of the International Speech Communication Association (INTERSPEECH), pp. 1856–1860, 2019. [120] D.-r. Liu, P.-c. Hsu, Y.-c. Chen, S.-f. Huang, S.-p. Chuang, D.-y. Wu, and H.-y. Lee, “Learning phone recognition from unpaired audio and phone sequences based on generative adversarial network,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 230–243, 2021. [121] O. Klejch, E. Wallington, and P. Bell, “Deciphering speech: a zero-resource ap- proach to cross-lingual transfer in asr,” arXiv preprint arXiv:2111.06799, 2021. [122] E. Bocchieri and D. Caseiro, “Use of geographical meta-data in asr language and acoustic models,” IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5118–5121, 2010. [123] S. Dupont and J. Luettin, “Audio-visual speech modeling for continuous speech recognition,” IEEE Transactions on Multimedia, vol. 2, pp. 141–151, 2000. [124] G. Potamianos, J. Luettin, and C. Neti, “Hierarchical discriminant features for audio-visual lvcsr,” IEEE International Conference on Acoustics, Speech and Sig- nal Processing (ICASSP), vol. 1, pp. 165–168, 2001. [125] Y. Miao and F. Metze, “Open-domain audio-visual speech recognition: A deep learning approach.” Annual Conference of the International Speech Communication Association (INTERSPEECH), pp. 3414–3418, 2016. [126] S. Palaskar, R. Sanabria, and F. Metze, “End-to-end multimodal speech recogni- tion,” IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5774–5778, 2018. [127] W. Chan, N. Jaitly, Q. V. Le, and O. Vinyals, “Listen, attend and spell,” arXiv preprint arXiv:1508.01211, 2015. [128] A. Kannan, Y. Wu, P. Nguyen, T. N. Sainath, Z. Chen, and R. Prabhavalkar, “An analysis of incorporating an external language model into a sequence-to-sequence model,” IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5828, 2018. [129] D. Zhao, T. N. Sainath, D. Rybach, P. Rondon, D. Bhatia, B. Li, and R. Pang, “Shallow-fusion end-to-end contextual biasing.” Annual Conference of the Inter- national Speech Communication Association (INTERSPEECH), pp. 1418–1422, 2019. [130] Z. Chen, M. Jain, Y. Wang, M. L. Seltzer, and C. Fuegen, “Joint grapheme and phoneme embeddings for contextual end-to-end ASR,” Annual Conference of the International Speech Communication Association (INTERSPEECH), 2019. [131] Y. Qiao, N. Shimomura, and N. Minematsu, “Unsupervised optimal phoneme seg- mentation: Objectives, algorithm and comparisons,” IEEE International Confer- ence on Acoustics, Speech and Signal Processing (ICASSP), pp. 3989–3992, 2008. [132] J. Franke, M. Mueller, F. Hamlaoui, S. Stueker, and A. Waibel, “Phoneme boundary detection using deep bidirectional lstms,” Speech Communication; 12. ITG Sympo- sium, pp. 1–5, 2016. [133] O. Rasanen, “Basic cuts revisited: Temporal segmentation of speech into phone- like units with statistical learning at a pre-linguistic level,” Annual Meeting of the Cognitive Science Society, vol. 36, 2014. [134] C.-C. Chiu, T. N. Sainath, Y. Wu, R. Prabhavalkar, P. Nguyen, Z. Chen, A. Kannan, R. J. Weiss, K. Rao, K. Gonina et al., “State-of-the-art speech recognition with sequence-to-sequence models,” arXiv preprint arXiv:1712.01769, 2017. [135] H. Kamper, A. Jansen, and S. Goldwater, “A segmental framework for fully- unsupervised large-vocabulary speech recognition,” Computer Speech & Language, vol. 46, pp. 154–174, 2017. [136] K. Levin, K. Henry, A. Jansen, and K. Livescu, “Fixed-dimensional acoustic em- beddings of variable-length segments in low-resource settings,” IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 410–415, 2013. [137] N. Dehak, R. Dehak, P. Kenny, N. Brümmer, P. Ouellet, and P. Dumouchel, “Sup- port vector machines versus fast scoring in the low-dimensional total variability space for speaker verification,” Annual Conference of the International Speech Communication Association (INTERSPEECH), 2009. [138] B. Schuller, S. Steidl, and A. Batliner, “The interspeech 2009 emotion challenge,” Annual Conference of the International Speech Communication Association (IN- TERSPEECH), 2009. [139] H.-y. Lee and L.-s. Lee, “Enhanced spoken term detection using support vector ma- chines and weighted pseudo examples,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 21, pp. 1272–1284, 2013. [140] I.-F. Chen and C.-H. Lee, “A hybrid hmm/dnn approach to keyword spotting of short words.” Annual Conference of the International Speech Communication As- sociation (INTERSPEECH), pp. 1574–1578, 2013. [141] A. Norouzian, A. Jansen, R. C. Rose, and S. Thomas, “Exploiting discriminative point process models for spoken term detection,” Annual Conference of the Inter- national Speech Communication Association (INTERSPEECH), 2012. [142] K. Levin, A. Jansen, and B. Van Durme, “Segmental acoustic indexing for zero resource keyword search,” IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5828–5832, 2015. [143] H. Kamper, W. Wang, and K. Livescu, “Deep convolutional acoustic word embed- dings using word-pair side information,” IEEE International Conference on Acous- tics, Speech and Signal Processing (ICASSP), pp. 4950–4954, 2016. [144] J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, and D. S. Pallett, “Darpa timit acoustic-phonetic continous speech corpus cd-rom. nist speech disc 1-1.1,” NASA STI/Recon technical report, vol. 93, 1993. [145] O. Bojar, R. Chatterjee, C. Federmann, Y. Graham, B. Haddow, M. Huck, A. J. Yepes, P. Koehn, V. Logacheva, C. Monz et al., “Findings of the 2016 conference on machine translation.” ACL First Conference on Machine Translation (WMT16), pp. 131–198, 2016. [146] C. J. Maddison, D. Tarlow, and T. Minka, “A* sampling,” Advances in Neural In- formation Processing Systems (NeurIPS), pp. 3086–3094, 2014. [147] M. Mohri, F. Pereira, and M. Riley, “Speech recognition with weighted finite-state transducers,” Springer Handbook of Speech Processing, pp. 559–584, 2008. [148] E. Gumbel, Statistical theory of extreme values and some practical applications: a series of lectures, 1954. [149] O. Viikki and K. Laurila, “Cepstral domain segmental feature vector normalization for noise robust speech recognition,” Speech Communication, vol. 25, pp. 133–147, 1998. [150] Y. Wang, R. Skerry-Ryan, D. Stanton, Y. Wu, R. J. Weiss, N. Jaitly, Z. Yang, Y. Xiao, Z. Chen, S. Bengio et al., “Tacotron: Towards end-to-end speech synthe- sis,” Annual Conference of the International Speech Communication Association (INTERSPEECH), pp. 4006–4010, 2017. [151] J. Lee, K. Cho, and T. Hofmann, “Fully character-level neural machine translation without explicit segmentation,” Annual Meeting of the Association for Computa- tional Linguistics (ACL), vol. 5, pp. 365–378, 2017. [152] L. Liu, H. Jiang, P. He, W. Chen, X. Liu, J. Gao, and J. Han, “On the variance of the adaptive learning rate and beyond,” International Conference on Learning Representations (ICLR), 2019. [153] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in Neural Information Pro- cessing Systems (NeurIPS), pp. 5998–6008, 2017. [154] A. Veeravalli, W. Pan, R. Adhami, and P. Cox, “A tutorial on using hidden markov models for phoneme recognition,” Southeastern Symposium on System Theory, 2005. SSST’05., pp. 154–157, 2005. [155] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hanne- mann, P. Motlicek, Y. Qian, P. Schwarz, J. Silovsky, G. Stemmer, and K. Vesely, “The kaldi speech recognition toolkit,” IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2011. [156] R. Haeb-Umbach and H. Ney, “Linear discriminant analysis for improved large vo- cabulary continuous speech recognition.” IEEE International Conference on Acous- tics, Speech and Signal Processing (ICASSP), vol. 92, pp. 13–16, 1992. [157] M. J. Gales, “Maximum likelihood linear transformations for hmm-based speech recognition,” Computer Speech & language, vol. 12, pp. 75–98, 1998. [158] O. J. Räsänen, U. K. Laine, and T. Altosaar, “An improved speech segmentation quality measure: the r-value,” Annual Conference of the International Speech Com- munication Association (INTERSPEECH), 2009. [159] M. Ratajczak, S. Tschiatschek, and F. Pernkopf, “Frame and segment level recurrent neural networks for phone classification.” Annual Conference of the International Speech Communication Association (INTERSPEECH), pp. 1318–1322, 2017. [160] A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” Advances in Neural Informa- tion Processing Systems (NeurIPS), vol. 33, pp. 12 449–12 460, 2020. [161] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014. [162] R. Ardila, M. Branson, K. Davis, M. Henretty, M. Kohler, J. Meyer, R. Morais, L. Saunders, F. M. Tyers, and G. Weber, “Common voice: A massively-multilingual speech corpus,” Conference on Language Resources and Evaluation (LREC), pp. 4211–4215, 2020. [163] A. Radford, L. Metz, and S. Chintala, “Unsupervised representation learn- ing with deep convolutional generative adversarial networks,” arXiv preprint arXiv:1511.06434, 2015. [164] T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida, “Spectral normalization for generative adversarial networks,” arXiv preprint arXiv:1802.05957, 2018. [165] A. Brock, J. Donahue, and K. Simonyan, “Large scale gan training for high fidelity natural image synthesis,” arXiv preprint arXiv:1809.11096, 2018. [166] L. Mescheder, A. Geiger, and S. Nowozin, “Which training methods for gans do actually converge?” International Conference on Machine Learning (ICML), pp. 3481–3490, 2018. [167] T. Kudo, “Subword regularization: Improving neural network translation models with multiple subword candidates,” arXiv preprint arXiv:1804.10959, 2018. [168] G. Farquhar, L. Gustafson, Z. Lin, S. Whiteson, N. Usunier, and G. Synnaeve, “Growing action spaces,” International Conference on Machine Learning (ICML), pp. 3040–3051, 2020. [169] S. Pateria, B. Subagdja, A.-H. Tan, and C. Quek, “End-to-end hierarchical rein- forcement learning with integrated subgoal discovery,” IEEE Transactions on Neu- ral Networks and Learning Systems, 2021. [170] J. Zhu, Y. Xia, L. Wu, D. He, T. Qin, W. Zhou, H. Li, and T.-Y. Liu, “Incorporating bert into neural machine translation,” arXiv preprint arXiv:2002.06823, 2020. [171] Y. Liu, J. Gu, N. Goyal, X. Li, S. Edunov, M. Ghazvininejad, M. Lewis, and L. Zettlemoyer, “Multilingual denoising pre-training for neural machine transla- tion,” Transactions of the Association for Computational Linguistics (TACL), vol. 8, pp. 726–742, 2020. [172] A. Conneau, A. Baevski, R. Collobert, A. Mohamed, and M. Auli, “Unsuper- vised cross-lingual representation learning for speech recognition,” arXiv preprint arXiv:2006.13979, 2020. [173] G. Lample and A. Conneau, “Cross-lingual language model pretraining,” arXiv preprint arXiv:1901.07291, 2019. [174] A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, E. Grave, M. Ott, L. Zettlemoyer, and V. Stoyanov, “Unsupervised cross-lingual representation learning at scale,” arXiv preprint arXiv:1911.02116, 2019. [175] A. See, P. J. Liu, and C. D. Manning, “Get to the point: Summarization with pointer- generator networks,” arXiv preprint arXiv:1704.04368, 2017. [176] M.-T. Luong, H. Pham, and C. D. Manning, “Effective approaches to attention-based neural machine translation,” arXiv preprint arXiv:1508.04025, 2015. [177] E. Grave, A. Joulin, and N. Usunier, “Improving neural language models with a continuous cache,” arXiv preprint arXiv:1612.04426, 2016. [178] S. Merity, C. Xiong, J. Bradbury, and R. Socher, “Pointer sentinel mixture models,” arXiv preprint arXiv:1609.07843, 2016. [179] E. Grave, A. Joulin, M. Cissé, H. Jégou et al., “Efficient softmax approximation for gpus,” International Conference on Machine Learning (ICML), pp. 1302–1310, 2017. [180] M. Ott, S. Edunov, A. Baevski, A. Fan, S. Gross, N. Ng, D. Grangier, and M. Auli, “fairseq: A fast, extensible toolkit for sequence modeling,” Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2019. [181] I. Loshchilov and F. Hutter, “Sgdr: Stochastic gradient descent with warm restarts,” arXiv preprint arXiv:1608.03983, 2016. [182] A. Botev, G. Lever, and D. Barber, “Nesterov’s accelerated gradient and momentum as approximations to regularised update descent,” 2017 International Joint Confer- ence on Neural Networks (IJCNN), pp. 1899–1903, 2017. [183] D. Le, X. Zhang, W. Zheng, C. Fügen, G. Zweig, and M. L. Seltzer, “From senones to chenones: Tied context-dependent graphemes for hybrid speech recognition,” IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2019. | - |
dc.identifier.uri | http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/91788 | - |
dc.description.abstract | 由於深度學習的發展,語音辨識系統在近期已取得了不錯的成果,但在許多情況下語音辨識系統並不總是可以表現的這麼好。首先,語音辨識系統的訓練依賴於大量的配對語音和文字資料,這對於全球超過95%的低資源語言是難以獲得的。其次,在不同語境情境下,使用的文字有不同的偏好以及可能出現特定情境下的罕見字詞,語音辨識系統無法在不同語境情境下表現的很好。在本論文中,我們希望提高語音辨識系統在不同語言、不同語境的適用性。
因為搜集大量未標註的資料相比於搜集大量的配對資料更容易,我們是否有可能基於非配對的語音和文字資料訓練一個無監督語音辨識系統呢?如果這個技術獲得成功,將可以使得訓練語音辨識系統的成本大幅下降,讓低資源語言也可以享有高品質的語音辨識。 本論文是世界上第一次成功的無監督語音辨識的嘗試,我們提出了一種兩階段迭代框架來實現無監督語音辨識系統。在框架的第一階段,文字先藉由辭典轉換成音素序列,然後生成對抗網路則被用來找到未標註語音到音素序列的對應關係。在框架的第二階段,我們引入一個隱式馬可夫模型訓練在生成對抗網路的輸出,進一步提高表現並為下一次生成對抗網路訓練提供更好的音素分割。 本論文探索了不同的生成對抗網路架構。首先,受到廣泛使用的語音單元技術的啟發,我們提出從語音生成離散語音單元並通過生成對抗網路將其對應到音素,成功地實現無監督音素辨識。然而,我們發現上述無監督方法的表現受到所生成的離散語音單元的品質的限制。為了解決這個問題,我們進一步提出了不依賴於這些離散語音單元的新生成對抗網路架構。最終,我們的迭代框架可以在基準語料庫 TIMIT 上達到36.71%的音素錯誤率,在2021年以前這是 TIMIT上無監督音素錯誤率最低的紀錄。 接下來為了提高語音辨識系統在不同語境情境中的適用性,本論文研究了使用文字語境資訊來提高語音辨識系統表現的方法。過去的相關研究主要專注於使用通訊錄列表等做為語境資訊並用在數位助理相關的語音辨識任務上,本論文的目標是以社群媒體上使用者上傳的影片說明來提升影片內容語音辨識正確率,不同於過去的研究我們需要辨識更多樣化的內容,且語音辨識系統需要使用長篇文字段落當作語境資訊。我們所提出的模型包含注意模型,用於總結資訊,以及指針網路,用於在文字語境資訊中選擇正確的罕見字詞。提出的模型在已經使用上萬小時配對語音和文字資料進行訓練的商用系統上可以達到5%的相對字錯誤率進步。 | zh_TW |
dc.description.abstract | Automatic speech recognition (ASR) has achieved remarkable performance due to the development of deep learning, but it is not always effective in all cases. First, the training of ASR systems relies on large amounts of paired speech and text data, which is difficult to obtain for at least 95% of the languages over the world which are low-resourced. Second, an ASR system can not easily adapt to different contextual scenarios due to variations in word preferences and the occurrence of rare words. In this thesis, we aim to improve the applicability of ASR systems for different languages and contextual scenarios.
Since it is easier to collect a large amount of unlabeled data than a large amount of paired data, is it possible to train an unsupervised speech recognition system based on unpaired speech and text data? If this technology is successful, the cost of training the speech recognition system will be greatly reduced, and low-resource languages can also enjoy high-quality speech recognition. This thesis is the world’s first successful attempt at unsupervised speech recognition. This thesis presents a two-stage iterative framework to achieve unsupervised ASR. In the first stage, text is transformed into phoneme sequences by a lexicon, and a generative adversarial network (GAN) is employed to find the mapping from unannotated speech to phoneme sequences. In the second stage, we introduce a hidden Markov model (HMM) to train on the GAN’s output, further improving performance and providing better phoneme segmentation for the next iteration of GAN training. Different GAN architectures are explored. Inspired by the widely used technique of identifying acoustic tokens from speech, we propose a GAN architecture that successfully performs unsupervised phoneme recognition by generating discrete acoustic tokens from speech and learning the mapping to phonemes through the GAN. However, we find that the performance of this approach is limited by the quality of the generated discrete acoustic tokens. To address this issue, we propose new GAN architectures that do not rely on using these tokens. Our iterative framework achieves a phoneme error rate of 36.71% on TIMIT, which has been the state-of-the-art of unsupervised ASR before 2021. Next, to improve the applicability of ASR systems in different contexts, this thesis studies the method of using textual contextual information to enhance the performance of ASR. Past related research has mainly focused on using contact lists as contextual information and using them in speech recognition tasks related to digital assistants. This thesis aims to improve speech recognition of video content by using video descriptions uploaded by users on social media. Unlike previous studies, ASR needs to recognize more diverse content, and it has to use long text paragraphs as contextual information. Our proposed model consists of an attention model for summarizing information and a pointer network for selecting the correct rare words in textual contextual information. The proposed model can achieve 5% relative word error rate improvement on a commercial system that has been trained using tens of thousands of hours of paired speech and text data. | en |
dc.description.provenance | Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2024-02-22T16:43:52Z No. of bitstreams: 0 | en |
dc.description.provenance | Made available in DSpace on 2024-02-22T16:43:52Z (GMT). No. of bitstreams: 0 | en |
dc.description.tableofcontents | 誌謝 i
中文摘要 iii Abstract v 1 Introduction 1 1.1 How to Learn an ASR System for Low-resource Languages? 2 1.2 How to Adapt ASR to Different Contextual Scenarios? 8 1.3 Thesis Guide 9 2 Related Work 11 2.1 Automatic Speech Recognition (ASR) 11 2.1.1 Traditional ASR 12 2.1.2 End-to-end ASR 15 2.1.3 Challenges in ASR 16 2.2 Unsupervised ASR 16 2.2.1 Distribution Matching 17 2.2.2 Unsupervised Phoneme Segmentation 19 2.2.3 Speech Representation Learning 20 2.2.4 Acoustic Token Discovery 21 2.2.5 Existing Approach 22 2.3 Contextual ASR 23 2.3.1 Non-textual Contextual Information 24 2.3.2 Textual Contextual Information 24 3 Unsupervised ASR - Overall Framework 26 3.1 Framework Overview 26 3.2 GAN Training 28 3.2.1 Phoneme Segmentation Module 29 3.2.2 Generator 30 3.2.3 Preprocessing of Real Phoneme Sequence 30 3.2.4 Discriminator 31 3.2.5 Optimization Formulation of GAN Training 31 3.3 Self Re-training 32 3.4 Chapter Summary 32 4 Unsupervised ASR - GAN Architecture 1: Using Discrete Acoustic Tokens as an Intermediate Step 35 4.1 Introduction 35 4.2 The Proposed GAN Architecture 37 4.2.1 Audio2Vec 37 4.2.2 K-means Clustering 39 4.2.3 Lookup Table Module 39 4.3 Experimental Setup 41 4.3.1 Audio Data 41 4.3.2 Lexicon and Text Data 41 4.3.3 Experimental Setting 42 4.4 Experimental Result 43 4.4.1 Analysis on Number of Clusters 43 4.4.2 Unsupervised Phoneme Recognition 44 4.4.3 Comparison with Supervised Approaches 45 4.5 Chapter Summary 46 5 Unsupervised ASR - GAN Architecture 2: Discarding Discrete Acoustic Tokens and Using End-to-end Model 48 5.1 The Proposed GAN Architecture 48 5.1.1 Segment-wise Generator 51 5.1.2 Frame-wise Generator 51 5.1.3 Gumbel-Softmax 53 5.2 Experimental Setup 55 5.2.1 Dataset 55 5.2.2 Training Setting 56 5.3 Experimental Result 57 5.3.1 Comparing Segment-wise and Frame-wise Generator 57 5.3.2 Discussion of the Capacity of the Frame-wise Generator 59 5.3.3 Using Gumbel-Softmax in Generator 61 5.3.4 Comparing Discriminator Architecture 63 5.3.5 Error Analysis 64 5.4 Chapter Summary 66 6 Unsupervised ASR - Self Re-training 68 6.1 From the Limitation of GAN Training to Self Re-training 68 6.2 Experimental Setup 69 6.3 Experimental Result 70 6.3.1 Effectiveness of Self Re-training 70 6.3.2 Compared to Previous Works 74 6.3.3 Robustness of Self Re-training 78 6.4 Chapter Summary 78 7 Unsupervised ASR - Looking towards the Future by Analyzing the Current State-of-the-art Model 79 7.1 Introduction 79 7.2 Main Differences from Wav2vecU to Our Proposed Framework 80 7.2.1 Difference1: Feature 80 7.2.2 Difference2: Segmentation Refinement 81 7.2.3 Difference3: Unsupervised Phoneme Segmentation Method 82 7.3 Potential Issues of Wav2vecU 85 7.3.1 Training Stability Issue 85 7.3.2 Generalizability Issue 87 7.4 Chapter Summary 94 8 Contextual ASR - Improving ASR with Textual Contextual Information 96 8.1 Introduction 96 8.2 Contextual Language Model 98 8.2.1 Attention Model 99 8.2.2 Hybrid Pointer Network 100 8.3 Experiments 102 8.3.1 Language Modeling Performance 104 8.3.2 ASR performance 104 8.4 Analysis 106 8.5 Chapter Summary 109 9 Conclusion 111 9.1 Thesis Summary 111 9.2 Future Work 115 9.2.1 Unsupervised ASR without the Use of a Lexicon 115 9.2.2 Utilizing Multiple Modalities as Contextual Information. 116 References 118 | - |
dc.language.iso | en | - |
dc.title | 提高語音辨識系統的適用性:無監督語音辨識和語境語音辨識 | zh_TW |
dc.title | Improving the Applicability of Automatic Speech Recognition (ASR) Systems: Unsupervised ASR and Contextual ASR | en |
dc.type | Thesis | - |
dc.date.schoolyear | 112-1 | - |
dc.description.degree | 博士 | - |
dc.contributor.oralexamcommittee | 李琳山;林軒田;王新民;簡仁宗 | zh_TW |
dc.contributor.oralexamcommittee | Lin-shan Lee;Hsuan-Tien Lin;Hsin-Min Wang;Jen-Tzung Chien | en |
dc.subject.keyword | 語音辨識,無監督式學習,語境語音辨識,生成對抗網路,語言模型, | zh_TW |
dc.subject.keyword | speech recognition,unsupervised learning,contextual ASR,generative adversarial network,language model, | en |
dc.relation.page | 144 | - |
dc.identifier.doi | 10.6342/NTU202304180 | - |
dc.rights.note | 同意授權(全球公開) | - |
dc.date.accepted | 2023-08-17 | - |
dc.contributor.author-college | 電機資訊學院 | - |
dc.contributor.author-dept | 電信工程學研究所 | - |
顯示於系所單位: | 電信工程學研究所 |
文件中的檔案:
檔案 | 大小 | 格式 | |
---|---|---|---|
ntu-112-1.pdf | 5.31 MB | Adobe PDF | 檢視/開啟 |
系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。