對於語碼轉換和語音翻譯任務之資料稀缺性與非自回歸模型研究

Shun-Po Chuang; 莊舜博; f04942141

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/79234

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	李宏毅(Hung-yi Lee)
dc.contributor.author	Shun-Po Chuang	en
dc.contributor.author	莊舜博	zh_TW
dc.contributor.author	f04942141
dc.date.accessioned	2022-11-23T08:56:19Z	-
dc.date.available	2022-02-16
dc.date.available	2022-11-23T08:56:19Z	-
dc.date.copyright	2022-02-16
dc.date.issued	2022
dc.date.submitted	2022-01-22
dc.identifier.citation	[1] Yosuke Higuchi, Shinji Watanabe, Nanxin Chen, Tetsuji Ogawa, and Tetsunori Kobayashi, “Mask CTC: Non-autoregressive end-to-end ASR with CTC and mask predict,” in INTERSPEECH, 2020. [2] Takaaki Hori, Shinji Watanabe, Yu Zhang, and William Chan, “Advances in joint CTC-attention based end-to-end speech recognition with a deep CNN encoder and RNN-LM,” in Interspeech, 2017. [3] Daniel S Park, William Chan, Yu Zhang, Chung-Cheng Chiu, Barret Zoph, Ekin D Cubuk, and Quoc V Le, “Specaugment: A simple data augmentation method for automatic speech recognition,” arXiv preprint arXiv:1904.08779, 2019. [4] Ishan Tarunesh, Syamantak Kumar, and Preethi Jyothi, “From machine translation to code-switching: Generating high-quality code-switched text,” ArXiv, vol. abs/2107.06483, 2021. [5] Surabhi Punjabi, Harish Arsikere, and Srinivas Garimella, “Language model bootstrapping using neural machine translation for conversational speech recognition,” 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 487–493, 2019. [6] Zhen Yang, Bojie Hu, Ambyera Han, Shen Huang, and Qi Ju, “Code-switching pre-training for neural machine translation,” ArXiv, vol. abs/2009.08088, 2020. [7] Genta Indra Winata, Andrea Madotto, Chien-Sheng Wu, and Pascale Fung, “Learn to code-switch: Data augmentation using copy mechanism on language modeling,” ArXiv, vol. abs/1810.10254, 2018. [8] Genta Indra Winata, Samuel Cahyawijaya, Zhaojiang Lin, Zihan Liu, Peng Xu, and Pascale Fung, “Meta-transfer learning for code-switched speech recognition,” in ACL, 2020. [9] Ching-Ting Chang, Shun-Po Chuang, and Hung-Yi Lee, “Code-switching sentence generation by generative adversarial networks and its application to data augmentation,” arXiv preprint arXiv:1811.02356, 2018. [10] D. Serdyuk, Y. Wang, C. Fuegen, A. Kumar, B. Liu, and Y. Bengio, “Towards end-to-end spoken language understanding,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), April 2018, pp. 5754–5758. [11] Matthias Sperber and Matthias Paulik, “Speech Translation and the End-to-End Promise: Taking Stock of Where We Are,” in Association for Computational Linguistic (ACL), Seattle, USA, 2020. [12] Matthias Sperber, Graham Neubig, Jan Niehues, and Alex Waibel, “Attention-passing models for robust and data-efficient end-to-end speech translation,” Transactions of the Association for Computational Linguistics, vol. 7, pp. 313–325, 2019. [13] Zhiping Zeng, Haihua Xu, Tze Yuang Chong, Eng-Siong Chng, and Haizhou Li, “Improving n-gram language modeling for code-switching speech recognition,” in 2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC). IEEE, 2017, pp. 1596–1601. [14] H. Adel, N. T. Vu, F. Kraus, T. Schlippe, H. Li, and T. Schultz, “Recurrent neural network language modeling for code switching conversational speech,” in 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, May 2013, pp. 8411–8415. [15] Ying Li and Pascale Fung, “Improved mixed language speech recognition using asymmetric acoustic model and language model with code-switch inversion constraints,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2013. [16] Ying Li and Pascale Fung, “Language modeling with functional head constraint for code switching speech recognition,” in Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014. [17] Injy Hamed, Mohamed Elmahdy, and Slim Abdennadher, “Building a first language model for code-switch Arabic-English,” Procedia Computer Science, vol. 117, pp. 208–216, 2017. [18] Hila Gonen and Yoav Goldberg, “Language modeling for code-switching: Evaluation, integration of monolingual data, and discriminative training,” arXiv preprint arXiv:1810.11895, 2018. [19] Saurabh Garg, Tanmay Parekh, and Preethi Jyothi, “Dual language models for code switched speech recognition,” arXiv preprint arXiv:1711.01048, 2017. [20] Genta Indra Winata, Andrea Madotto, Chien-Sheng Wu, and Pascale Fung, “Learn to code-switch: Data augmentation using copy mechanism on language modeling,” arXiv preprint arXiv:1810.10254, 2018. [21] Emre Yilmaz, Henk van den Heuvel, and David A. van Leeuwen, “Acoustic and textual data augmentation for improved ASR of code-switching speech,” in the Conference of the International Speech Communication Association (INTERSPEECH), 2018. [22] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio, “Generative adversarial nets,” in Advances in Neural Information Processing Systems (NIPS), 2014. [23] Martin Arjovsky, Soumith Chintala, and Le´on Bottou, “Wasserstein generative adversarial networks,” in International Conference on Machine Learning (ICML), 2017. [24] Sebastian Ruder, “An overview of multi-task learning in deep neural networks,” arXiv preprint arXiv:1706.05098, 2017. [25] Genta Indra Winata, Andrea Madotto, Chien-Sheng Wu, and Pascale Fung, “Code-switching language modeling using syntax-aware multi-task learning,” in Proceedings of the Third Workshop on Computational Approaches to Linguistic Code-Switching, 2018. [26] Khyathi Chandu, Thomas Manzini, Sumeet Singh, and Alan W Black, “Language informed modeling of code-switched text,” in Proceedings of the Third Workshop on Computational Approaches to Linguistic Code-Switching, 2018, pp. 92–97. [27] Guillaume Lample and Alexis Conneau, “Cross-lingual language model pretraining,” arXiv preprint arXiv:1901.07291, 2019. [28] Mikel Artetxe and Holger Schwenk, “Massively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond,” arXiv preprint arXiv:1812.10464, 2018. [29] Dan Garrette Telmo Pires, Eva Schlinger, “How multilingual is multilingual BERT?,” arXiv preprint arXiv:1906.01502, 2019. [30] Xinyi Wang, Hieu Pham, Philip Arthur, and Graham Neubig, “Multilingual neural machine translation with soft decoupled encoding,” arXiv preprint arXiv:1902.03499, 2019. [31] Melvin Johnson, Mike Schuster, Quoc V Le, Maxim Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat, Fernanda Vie´gas, Martin Wattenberg, Greg Corrado, et al., “Google’s multilingual neural machine translation system: Enabling zero-shot translation,” Transactions of the Association for Computational Linguistics, vol. 5, pp. 339–351, 2017. [32] Pratik Jawanpuria, Arjun Balgovind, Anoop Kunchukuttan, and Bamdev Mishra, “Learning multilingual word embeddings in latent metric space: A geometric approach,” Transactions of the Association for Computational Linguistics, vol. 7, pp. 107–120, 2019. [33] Dario Amodei, Sundaram Ananthanarayanan, Rishita Anubhai, Jingliang Bai, et al., “Deep Speech 2: End-to-end speech recognition in English and Mandarin,” in Proceedings of the 33rd International Conference on International Conference on Machine Learning -Volume 48, 2016, ICML’16, pp. 173–182. [34] Rohit Prabhavalkar, Kanishka Rao, Tara N. Sainath, Bo Li, Leif Johnson, and Navdeep Jaitly, “A comparison of sequence-to-sequence models for speech recognition,” in Proc. Interspeech 2017, 2017, pp. 939–943. [35] Chung-Cheng Chiu, Tara N Sainath, Yonghui Wu, Rohit Prabhavalkar, Patrick Nguyen, Zhifeng Chen, Anjuli Kannan, Ron J Weiss, Kanishka Rao, et al., “State-of-the-art speech recognition with sequence-to-sequence models,” ICASSP, 2018. [36] Frederick Jelinek, “Continuous speech recognition by statistical methods,” Proceedings of the IEEE, vol. 64, no. 4, pp. 532–556, 1976. [37] Siddharth Dalmia, Ramon Sanabria, Florian Metze, and Alan W Black, “Sequence-based multi-lingual low resource speech recognition,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 4909–4913. [38] Jaejin Cho, Murali Karthick Baskar, Ruizhi Li, Matthew Wiesner, Sri Harish Mallidi, Nelson Yalta, Martin Karafiat, Shinji Watanabe, and Takaaki Hori, “Multilingual sequence-to-sequence speech recognition: architecture, transfer learning, and language modeling,” in 2018 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2018, pp. 521–527. [39] Toma´sˇ Mikolov, Martin Karafia´t, Luka´sˇ Burget, Jan Cˇ ernocky`, and Sanjeev Khudanpur, “Recurrent neural network based language model,” in Eleventh Annual Conference of the International Speech Communication Association, 2010. [40] Jan Chorowski and Navdeep Jaitly, “Towards better decoding and language model integration in sequence to sequence models,” in Proc. Interspeech 2017, 2017, pp. 523–527. [41] Tomoki Hayashi, Shinji Watanabe, Yu Zhang, Tomoki Toda, Takaaki Hori, Ramon Astudillo, and Kazuya Takeda, “Back-translation-style data augmentation for end-to-end ASR,” arXiv preprint arXiv:1807.10893, 2018. [42] Andros Tjandra, Sakriani Sakti, and Satoshi Nakamura, “Listening while speaking: Speech chain by deep learning,” in Automatic Speech Recognition and Understanding Workshop (ASRU), 2017 IEEE. IEEE, 2017, pp. 301–308. [43] Murali Karthick Baskar, Shinji Watanabe, Ramon Astudillo, Takaaki Hori, Luka´sˇ Burget, and Jan Cˇ ernocky`, “Self-supervised sequence-to-sequence ASR using unpaired speech and text,” arXiv preprint arXiv:1905.01152, 2019. [44] Alexander Liu, Hung-yi Lee, and Lin-shan Lee, “Adversarial training of end-to-end speech recognition using a criticizing language model,” in Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019. [45] Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig, “Linguistic regularities in continuous space word representations,” in Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2013, pp. 746–751. [46] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean, “Efficient estimation of word representations in vector space,” arXiv preprint arXiv:1301.3781, 2013. [47] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean, “Distributed representations of words and phrases and their compositionality,” in Advances in Neural Information Processing Systems (NIPS), 2013. [48] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, June 2019, pp. 4171–4186, Association for Computational Linguistics. [49] Kartik Audhkhasi, Bhuvana Ramabhadran, George Saon, Michael Picheny, and David Nahamoo, “Direct acoustics-to-word models for English conversational speech recognition,” arXiv preprint arXiv:1703.07754, 2017. [50] Jeffrey Pennington, Richard Socher, and Christopher Manning, “GloVe: Global vectors for word representation,” in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, Oct. 2014, pp. 1532–1543, Association for Computational Linguistics. [51] Shane Settle, Kartik Audhkhasi, Karen Livescu, and Michael Picheny, “Acoustically grounded word embeddings for improved acoustics-to-word speech recognition,” in IEEE International Conference on Acoustics, Speech and Signal Process ing (ICASSP), 2019. [52] Shruti Palaskar, Vikas Raunak, and Florian Metze, “Learned in speech recognition: Contextual acoustic word embeddings,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 6530–6534. [53] Herman Kamper, Weiran Wang, and Karen Livescu, “Deep convolutional acoustic word embeddings using word-pair side information,” in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2016, pp. 4950–4954. [54] Shane Settle and Karen Livescu, “Discriminative acoustic word embeddings: Recurrent neural network-based approaches,” in 2016 IEEE Spoken Language Tech nology Workshop (SLT). IEEE, 2016, pp. 503–510. [55] G. Chen, C. Parada, and T. N. Sainath, “Query-by-example keyword spotting using long short-term memory networks,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), April 2015, pp. 5236–5240. [56] K. Levin, K. Henry, A. Jansen, and K. Livescu, “Fixed-dimensional acoustic embeddings of variable-length segments in low-resource settings,” in 2013 IEEE Workshop on Automatic Speech Recognition and Understanding, Dec 2013, pp. 410–415. [57] K. Audhkhasi, A. Rosenberg, A. Sethy, B. Ramabhadran, and B. Kingsbury, “End-to-end ASR-free keyword search from speech,” IEEE Journal of Selected Topics in Signal Processing, vol. 11, no. 8, pp. 1351–1359, Dec 2017. [58] Alexandre Be´rard, Olivier Pietquin, Christophe Servan, and Laurent Besacier, “Listen and translate: A proof of concept for end-to-end speech-to-text translation,” arXiv preprint arXiv:1612.01744, 2016. [59] T. Sung, J. Liu, H. Lee, and L. Lee, “Towards end-to-end speech-to-text translation with two-pass decoding,” in ICASSP 2019 -2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 2019, pp. 7175–7179. [60] Ron J Weiss, Jan Chorowski, Navdeep Jaitly, Yonghui Wu, and Zhifeng Chen, “Sequence-to-sequence models can directly translate foreign speech,” arXiv preprint arXiv:1703.08581, 2017. [61] Hirofumi Inaguma, Kevin Duh, Tatsuya Kawahara, and Shinji Watanabe, “Multilingual end-to-end speech translation,” arXiv preprint arXiv:1910.00254, 2019. [62] Alexandre Be´rard, Laurent Besacier, Ali Can Kocabiyikoglu, and Olivier Pietquin, “End-to-end automatic speech translation of audiobooks,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 6224–6228. [63] Ye Jia, Melvin Johnson, Wolfgang Macherey, Ron J Weiss, Yuan Cao, Chung-Cheng Chiu, Naveen Ari, Stella Laurenzo, and Yonghui Wu, “Leveraging weakly supervised data to improve end-to-end speech-to-text translation,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 7180–7184. [64] Yuchen Liu, Hao Xiong, Zhongjun He, Jiajun Zhang, Hua Wu, Haifeng Wang, and Chengqing Zong, “End-to-end speech translation with knowledge distillation,” arXiv preprint arXiv:1904.08075, 2019. [65] Antonios Anastasopoulos and David Chiang, “Tied multitask learning for neural speech translation,” in Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), New Orleans, Louisiana, June 2018, pp. 82–91, Association for Computational Linguistics. [66] Sathish Indurthi, Houjeung Han, Nikhil Kumar Lakumarapu, Beomseok Lee, Insoo Chung, Sangha Kim, and Chanwoo Kim, “Data efficient direct speech-to-text translation with modality agnostic meta-learning,” arXiv preprint arXiv:1911.04283, 2019. [67] Takatomo Kano, Sakriani Sakti, and Satoshi Nakamura, “Structured-based curriculum learning for end-to-end english-japanese speech translation,” arXiv preprint arXiv:1802.06003, 2018. [68] Takatomo Kano, Sakriani Sakti, and Satoshi Nakamura, “End-to-end speech translation with transcoding by multi-task learning for distant language pairs,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 1342–1355, 2020. [69] Y Jia, RJ Weiss, F Biadsy, W Macherey, M Johnson, Z Chen, and Y Wu, “Direct speech-to-speech translation with a sequence-to-sequence model, corr abs/1904.06037 (2019),” 1904. [70] Sameer Bansal, Herman Kamper, Karen Livescu, Adam Lopez, and Sharon Goldwater, “Pre-training on high-resource speech recognition improves low-resource speech-to-text translation,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota,June 2019, pp. 58–68, Association for Computational Linguistics. [71] Elizabeth Salesky, Matthias Sperber, and Alan W Black, “Exploring phoneme-level speech representations for end-to-end speech translation,” in Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, July 2019, pp. 1835–1841, Association for Computational Linguistics. [72] Mattia Antonino Di Gangi, Matteo Negri, and Marco Turchi, “One-to-many multilingual end-to-end speech translation,” arXiv preprint arXiv:1910.03320, 2019. [73] Kaho Osamura, Takatomo Kano, Sakriani Sakti, Katsuhito Sudoh, and Satoshi Nakamura, “Using spoken word posterior features in neural machine translation,” architecture, vol. 21, pp. 22, 2018. [74] Inigo Jauregi Unanue, Ehsan Zare Borzeshi, Nazanin Esmaili, and Massimo Piccardi, “ReWE: Regressing word embeddings for regularization of neural machine translation systems,” in Proceedings of the 2019 Conference of the North Ameri-can Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, June 2019, pp. 430–436, Association for Computational Linguistics. [75] Sachin Kumar and Yulia Tsvetkov, “Von mises-fisher loss for training sequence to sequence models with continuous outputs,” in International Conference on Learning Representations, 2019. [76] Katsuki Chousa, Katsuhito Sudoh, and Satoshi Nakamura, “Training neural machine translation using word embedding-based loss,” arXiv preprint arXiv:1807.11219, 2018. [77] Jan K Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk, Kyunghyun Cho, and Yoshua Bengio, “Attention-based models for speech recognition,” in Advances in neural information processing systems, 2015, pp. 577–585. [78] Elizabeth Salesky and Alan W Black, “Phone features improve speech translation,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, July 2020, pp. 2388–2397, Association for Computational Linguistics. [79] Andrew L Maas, Stephen D Miller, Tyler M O’Neil, Andrew Y Ng, and Patrick Nguyen, “Word-level acoustic modeling with convolutional vector regression,” in Proc. ICML Workshop on Representation Learning, 2012. [80] Hao Wang, Yitong Wang, Zheng Zhou, Xing Ji, Dihong Gong, Jingchao Zhou, Zhifeng Li, and Wei Liu, “CosFace: Large margin cosine loss for deep face recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 5265–5274. [81] Feng Wang, Jian Cheng, Weiyang Liu, and Haijun Liu, “Additive margin softmax for face verification,” IEEE Signal Processing Letters, vol. 25, no. 7, pp. 926–930, 2018. [82] Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou, “ArcFace: Additive angular margin loss for deep face recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 4690–4699. [83] R. Ji, J. Cao, X. Cai, and B. Xu, “Max margin cosine loss for speaker identification on short utterances,” in 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP), Nov 2018, pp. 304–308. [84] Jiatao Gu, James Bradbury, Caiming Xiong, Victor OK Li, and Richard Socher, “Non-autoregressive neural machine translation,” in International Conference on Learning Representations, 2018. [85] Chunting Zhou, Graham Neubig, and Jiatao Gu, “Understanding knowledge distillation in non-autoregressive machine translation,” ArXiv, vol. abs/1911.02727, 2020. [86] Mitchell Stern, William Chan, Jamie Kiros, and Jakob Uszkoreit, “Insertion trans-former: Flexible sequence generation via insertion operations,” in International Conference on Machine Learning, 2019, pp. 5976–5985. [87] Jiatao Gu, Changhan Wang, and Junbo Zhao, “Levenshtein transformer,” in Advances in Neural Information Processing Systems, 2019, pp. 11181–11191. [88] Jason Lee, Elman Mansimov, and Kyunghyun Cho, “Deterministic non-autoregressive neural sequence modeling by iterative refinement,” in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2018, pp. 1173–1182. [89] William Chan, Chitwan Saharia, Geoffrey Hinton, Mohammad Norouzi, and Navdeep Jaitly, “Imputer: Sequence modelling via imputation and dynamic programming,” arXiv preprint arXiv:2002.08926, 2020. [90] Yosuke Higuchi, Hirofumi Inaguma, Shinji Watanabe, Tetsuji Ogawa, and Tetsunori Kobayashi, “Improved mask-CTC for non-autoregressive end-to-end ASR,” in ICASSP, 2021. [91] Marjan Ghazvininejad, Omer Levy, Yinhan Liu, and Luke Zettlemoyer, “Mask-predict: Parallel decoding of conditional masked language models,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2019, pp. 6114–6123. [92] Ethan A. Chi, Julian Salazar, and Katrin Kirchhoff, “Align-refine: Non-autoregressive speech recognition via iterative realignment,” in NAACL, 2021. [93] Nanxin Chen, Piotr Z˙ elasko, Laureano Moro-Vela´zquez, Jesu´s Villalba, and Najim Dehak, “Align-denoise: Single-pass non-autoregressive speech recognition,” Interspeech 2021, 2021. [94] Jumon Nozaki and Tatsuya Komatsu, “Relaxing the conditional independence assumption of ctc-based asr by conditioning on intermediate predictions,” ArXiv, vol. abs/2104.02724, 2021. [95] Jaesong Lee and Shinji Watanabe, “Intermediate loss regularization for ctc-based speech recognition,” ICASSP 2021 -2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6224–6228, 2021. [96] Alexei Baevski, Henry Zhou, Abdel rahman Mohamed, and Michael Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” ArXiv, vol. abs/2006.11477, 2020. [97] Edwin G. Ng, Chung-Cheng Chiu, Yu Zhang, and William Chan, “Pushing the limits of non-autoregressive speech recognition,” ArXiv, vol. abs/2104.03416, 2021. [98] Guillaume Lample, Alexis Conneau, Marc’Aurelio Ranzato, Ludovic Denoyer, and Herve´ Je´gou, “Word translation without parallel data,” in International Conference on Learning Representations (ICLR), 2018. [99] Guillaume Lample, Alexis Conneau, Ludovic Denoyer, and Marc’Aurelio Ranzato, “Unsupervised machine translation using monolingual corpora only,” arXiv preprint arXiv:1711.00043, 2017. [100] Takashi Wada and Tomoharu Iwata, “Unsupervised cross-lingual word embedding by multilingual neural language models,” arXiv preprint arXiv:1809.02306, 2018. [101] Tomas Mikolov, Martin Karafia´t, Luka´s Burget, Jan Cernocky´, and Sanjeev Khudanpur, “Recurrent neural network based language model,” in the Conference of the International Speech Communication Association (INTERSPEECH), 2010. [102] Dau-Cheng Lyu, Tien Ping Tan, Chng Eng Siong, and Haizhou Li, “SEAME: a Mandarin-English code-switching speech corpus in south-east Asia,” in the Conference of the International Speech Communication Association (INTERSPEECH), 2010. [103] Yerbolat Khassanov, Haihua Xu, Van Tung Pham, Zhiping Zeng, Eng Siong Chng, Chongjia Ni, and Bin Ma, “Constrained output embeddings for end-to-end code-switching speech recognition with only monolingual data,” arXiv preprint arXiv:1904.03802, 2019. [104] Martin Sundermeyer, Ralf Schlu¨ter, and Hermann Ney, “LSTM neural networks for language modeling,” in the Conference of the International Speech Communication Association (INTERSPEECH), 2012. [105] Chao Xing, Dong Wang, Chao Liu, and Yiye Lin, “Normalized word embedding and orthogonal transform for bilingual word translation,” in the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), 2015. [106] Shuai Zhang, Jiangyan Yi, Zhengkun Tian, Jianhua Tao, and Ye Bai, “RNN-transducer with language bias for end-to-end Mandarin-English code-switching speech recognition,” in ISCSLP, 2021. [107] Ke Li, Jinyu Li, Guoli Ye, Rui Zhao, and Yifan Gong, “Towards code-switching ASR for end-to-end CTC models,” in ICASSP, 2019. [108] Changhao Shan, Chao Weng, Guangsen Wang, Dan Su, Min Luo, Dong Yu, and Lei Xie, “Investigating end-to-end speech recognition for Mandarin-English code-switching,” in ICASSP, 2019. [109] Zhiping Zeng, Yerbolat Khassanov, Van Tung Pham, Haihua Xu, Eng Siong Chng, and Haizhou Li, “On the end-to-end solution to Mandarin-English code-switching speech recognition,” in INTERSPEECH, 2019. [110] Chenpeng Du, Hao Li, Yizhou Lu, Lan Wang, and Yanmin Qian, “Data augmentation for end-to-end code-switching speech recognition,” in SLT, 2021. [111] Yanhua Long, Yijie Li, Qiaozheng Zhang, Shuang Wei, Hong Ye, and Jichen Yang, “Acoustic data augmentation for mandarin-english code-switching speech recognition,” Applied Acoustics, vol. 161, 2020. [112] Yash Sharma, Basil Abraham, Karan Taneja, and Preethi Jyothi, “Improving low resource code-switched ASR using augmented code-switched TTS,” in INTERSPEECH, 2020. [113] Ching-Ting Chang, Shun-Po Chuang, and Hung yi Lee, “Code-switching sentence generation by generative adversarial networks and its application to data augmentation,” in INTERSPEECH, 2019. [114] William Chan, Navdeep Jaitly, Quoc V. Le, and Oriol Vinyals, “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,” in ICASSP, 2016. [115] Alex Graves, “Sequence transduction with recurrent neural networks,” in ICML Workshop on Representation Learning, 2012. [116] Xinyuan Zhou, Emre Yılmaz, Yanhua Long, Yijie Li, and Haizhou Li, “Multi-encoder-decoder transformer for code-switching speech recognition,” in INTERSPEECH, 2020. [117] Siddharth Dalmia, Yuzong Liu, Srikanth Ronanki, and Katrin Kirchhoff, “Transformer-transducers for code-switched speech recognition,” in ICASSP, 2021. [118] Yizhou Lu, Mingkun Huang, Hao Li, Jiaqi Guo, and Yanmin Qian, “Bi-encoder transformer network for Mandarin-English code-switching speech recognition using mixture of experts,” in INTERSPEECH, 2020. [119] Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki Parmar, Yu Zhang, Jiahui Yu, Wei Han, Shibo Wang, Zhengdong Zhang, Yonghui Wu, and Ruoming Pang, “Conformer: Convolution-augmented transformer for speech recognition,” in INTERSPEECH, 2020. [120] Chunxi Liu, Frank Zhang, Duc Le, Suyoun Kim, Yatharth Saraf, and Geoffrey Zweig, “Improving RNN transducer based ASR with auxiliary tasks,” in SLT, 2021. [121] Nanxin Chen, Shinji Watanabe, Jesu´s Villalba, and Najim Dehak, “Listen and fill in the missing letters: Non-autoregressive transformer for speech recognition,” arXiv preprint arXiv:1911.04908, 2019. [122] Yash Sharma, Basil Abraham, Karan Taneja, and Preethi Jyothi, “Improving Low Resource Code-Switched ASR Using Augmented Code-Switched TTS,” in INTERSPEECH, 2020. [123] Shun-Po Chuang, Tzu-Wei Sung, and Hung-yi Lee, “Training code-switching language model with monolingual data,” in ICASSP, 2020. [124] Dau-Cheng Lyu, Tien-Ping Tan, Eng Siong Chng, and Haizhou Li, “SEAME: a Mandarin-English code-switching speech corpus in south-east asia,” in INTERSPEECH, 2010. [125] Alex Graves, Santiago Ferna´ndez, Faustino Gomez, and Ju¨rgen Schmidhuber, “Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks,” in ICML, 2006. [126] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna, “Rethinking the inception architecture for computer vision,” in CVPR, 2016. [127] Minguang Song, Yunxin Zhao, Shaojun Wang, and Mei Han, “Word similarity based label smoothing in RNNLM training for asr,” in SLT, 2021. [128] Taku Kudo and John Richardson, “SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing,” in EMNLP: System Demonstrations, 2018. [129] Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur, “Librispeech: An ASR corpus based on public domain audio books,” in ICASSP, 2015. [130] Anthony Rousseau, Paul Dele´glise, and Yannick Este`ve, “Enhancing the TEDLIUM corpus with selected data for language modeling and more TED talks,” in LREC, 2014. [131] Hui Bu, Jiayu Du, Xingyu Na, Bengu Wu, and Hao Zheng, “AISHELL-1: An opensource mandarin speech corpus and a speech recognition baseline,” in OCOCOSDA, 2017. [132] Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov, “Enriching word vectors with subword information,” arXiv preprint arXiv:1607.04606, 2016. [133] Shinji Watanabe, Takaaki Hori, Shigeki Karita, Tomoki Hayashi, Jiro Nishitoba, Yuya Unno, Nelson Enrique Yalta Soplin, Jahn Heymann, Matthew Wiesner, Nanxin Chen, Adithya Renduchintala, and Tsubasa Ochiai, “ESPnet: End-to-end speech processing toolkit,” in INTERSPEECH, 2018. [134] Tom Ko, Vijayaditya Peddinti, Daniel Povey, and Sanjeev Khudanpur, “Audio augmentation for speech recognition,” in INTERSPEECH, 2015. [135] Daniel S. Park, William Chan, Yu Zhang, Chung-Cheng Chiu, Barret Zoph, Ekin D. Cubuk, and Quoc V. Le, “SpecAugment: A simple data augmentation method for automatic speech recognition,” in INTERSPEECH, 2019. [136] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin, “Attention is all you need,” in NeurIPS, 2017. [137] Shigeki Karita, Nanxin Chen, Tomoki Hayashi, Takaaki Hori, Hirofumi Inaguma, Ziyan Jiang, Masao Someki, Nelson Enrique Yalta Soplin, Ryuichi Yamamoto, Xiaofei Wang, et al., “A comparative study on transformer vs RNN in speech applications,” in ASRU, 2019. [138] Alexander H. Liu, Tzu-Wei Sung, Shun-Po Chuang, Hung-yi Lee, and Lin-shan Lee, “Sequence-to-sequence automatic speech recognition with word embedding regularization and fused decoding,” in ICASSP 2020 -2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 7879–7883. [139] William Chan, Navdeep Jaitly, Quoc Le, and Oriol Vinyals, “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,” in Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2016, pp. 4960–4964. [140] Long Duong, Antonios Anastasopoulos, David Chiang, Steven Bird, and Trevor Cohn, “An attentional model for speech translation without transcription,” in Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2016, pp. 949– 959. [141] Inigo Jauregi Unanue, Ehsan Zare Borzeshi, Nazanin Esmaili, and Massimo Piccardi, “ReWE: Regressing word embeddings for regularization of neural machine translation systems,” arXiv preprint arXiv:1904.02461, 2019. [142] Nicolai Wojke and Alex Bewley, “Deep cosine metric learning for person reidentification,” in 2018 IEEE winter conference on applications of computer vision (WACV). IEEE, 2018, pp. 748–756. [143] Manaal Faruqui, Yulia Tsvetkov, Pushpendre Rastogi, and Chris Dyer, “Problems with evaluation of word embeddings using word similarity tasks,” in Proceedings of the 1st Workshop on Evaluating Vector-Space Representations for NLP, Berlin, Germany, Aug. 2016, pp. 30–35, Association for Computational Linguistics. [144] Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov, “Enriching word vectors with subword information,” Transactions of the Association for Computational Linguistics, vol. 5, pp. 135–146, 2017. [145] Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Re´mi Louf, Morgan Funtowicz, and Jamie Brew, “Transformers: State-of-the-art natural language processing,” 2019. [146] Ali Can Kocabiyikoglu, Laurent Besacier, and Olivier Kraif, “Augmenting librispeech with French translations: A multimodal corpus for direct speech translation evaluation,” in Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan, May 2018, European Language Resources Association (ELRA). [147] Benjamin Heinzerling and Michael Strube, “BPEmb: Tokenization-free Pretrained Subword Embeddings in 275 Languages………
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/79234	-
dc.description.abstract	近年因深度學習技術的興起，有越來越多任務採用完全端到端的模型，其表現能夠超越傳統的串接式模型，同時帶來開發上的便利。然而，端到端模型需要相當龐大的標注數據進行模型訓練，但標注資料的過程相當耗時且成本較高，在某些任務上仍然有資料短缺的情況。本篇論文以語碼轉換和語音翻譯做為研究任務，探討資料稀缺性問題。在語碼轉換任務上，由於資料普遍存在於日常生活對話或私人訊息中，其資料搜集的難度較高，所以目前公開可使用的資料集相當少。此論文首先研究在完全沒有語碼轉換資料的狀況下，如何訓練一個語碼轉換的語言模型；在語音翻譯的任務上，訓練模型需要配對的語音和譯文，此種配對資料較為罕見，相較於語音辨識所需的配對語音和文本、機器翻譯所需的雙語配對文本，現今語音翻譯任務仍有資料稀缺性的問題，故本論文討論在資料有限的狀況下，如何有效利用額外的未配對資料進行模型表現的改進。此外，現今語音的端到端模型皆採用自回歸模式進行解碼，自回歸的解碼方式帶來良好的語言建模能力，但解碼過程卻相當耗時，在資源有限的條件下不利於現實生活中的應用；針對此問題，本論文同時也探討了語碼轉換和語音翻譯的非自回歸模型，以期以更快的速度得到良好的模型表現。	zh_TW
dc.description.provenance	Made available in DSpace on 2022-11-23T08:56:19Z (GMT). No. of bitstreams: 1 U0001-2101202201531800.pdf: 4050260 bytes, checksum: 5fc5dd62d4647e2566221ba14a40932c (MD5) Previous issue date: 2022	en
dc.description.tableofcontents	中文摘要 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i 英文摘要 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii 1 – Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Task Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2.1 Code-Switching . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2.2 Speech-to-text Translation . . . . . . . . . . . . . . . . . . . . . 3 1.2.3 Non-autoregressive Model . . . . . . . . . . . . . . . . . . . . . 4 1.3 Organization of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . 6 2 – Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.1 Code-Switching Language Model . . . . . . . . . . . . . . . . . . . . . 7 2.2 Speech Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.3 Speech-to-Text Translation . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.4 Word Embedding as Learning Target . . . . . . . . . . . . . . . . . . . . 10 2.5 Non-Autoregressive Model . . . . . . . . . . . . . . . . . . . . . . . . . 12 3 – Train Code-Switching language model without using code-switching data . . . 13 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.2 Proposed Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.2.1 RNN-based Code-Switching Language Model . . . . . . . . . . 14 3.2.2 Constraints on Output Projection Matrix . . . . . . . . . . . . . . 15 3.2.3 Output Projection Matrix Normalization . . . . . . . . . . . . . . 17 3.3 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.3.1 Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.3.2 Pseudo Code-switching Training Data . . . . . . . . . . . . . . . 19 3.3.3 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.3.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.4.1 Language Modeling . . . . . . . . . . . . . . . . . . . . . . . . 21 3.4.2 Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.4.3 Unsupervised Bilingual Word Translation . . . . . . . . . . . . . 25 3.4.4 Sentence Generation . . . . . . . . . . . . . . . . . . . . . . . . 26 3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 4 – Non-Autoregressive Code-Switching ASR model . . . . . . . . . . . . . . . . 28 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 4.2 Proposed Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 4.2.1 Mask-CTC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 4.2.2 Using Pinyin as Output Target . . . . . . . . . . . . . . . . . . . 31 4.2.3 Word Embedding Label Smoothing Regularization . . . . . . . . 33 4.2.4 Projection Matrix Regularization . . . . . . . . . . . . . . . . . . 34 4.3 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.3.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.3.2 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.4.1 Proposed Pinyin Decoder and Regularization Methods . . . . . . 38 4.4.2 Low-resource Scenario . . . . . . . . . . . . . . . . . . . . . . . 40 4.4.3 Ablation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 5 – Improve speech-to-text translation model by bringing additional semantic context 43 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 5.2 Model Descriptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 5.2.1 Automatic Speech Recognition . . . . . . . . . . . . . . . . . . . 45 5.2.2 End-to-End Speech Translation . . . . . . . . . . . . . . . . . . 47 5.3 Proposed Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 5.3.1 Cosine Distance . . . . . . . . . . . . . . . . . . . . . . . . . . 49 5.3.2 Cosine Softmax . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 5.4 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 5.4.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 5.4.2 Model Parameters . . . . . . . . . . . . . . . . . . . . . . . . . 57 5.5 Experimental results on ASR . . . . . . . . . . . . . . . . . . . . . . . . 59 5.5.1 460hrs ASR Results . . . . . . . . . . . . . . . . . . . . . . . . 59 5.5.2 100hrs ASR Results . . . . . . . . . . . . . . . . . . . . . . . . 60 5.5.3 Compatibility to SpecAugment . . . . . . . . . . . . . . . . . . 62 5.6 Experimental Results on Speech-to-Text translation . . . . . . . . . . . . 64 5.6.1 Cascaded System . . . . . . . . . . . . . . . . . . . . . . . . . . 64 5.6.2 End-to-End System . . . . . . . . . . . . . . . . . . . . . . . . . 68 5.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 6 – Non-Autoregressive Speech-to-text Translation model . . . . . . . . . . . . . . 74 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 6.2 Proposed Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 6.2.1 CTC-based NAR-ST Model . . . . . . . . . . . . . . . . . . . . 75 6.2.2 CTC-based Multitask NAR-ST Model . . . . . . . . . . . . . . . 77 6.2.3 Reordering Evaluation – Kendall’s Tau Distance . . . . . . . . . 78 6.3 Knowledge Distillation . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 6.4 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 6.4.1 Data prepossessing . . . . . . . . . . . . . . . . . . . . . . . . . 79 6.4.2 Model Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 6.4.3 Gradient-based Visualization . . . . . . . . . . . . . . . . . . . . 82 6.5 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 6.5.1 Translation Quality and Speed . . . . . . . . . . . . . . . . . . . 83 6.5.2 Word Order Analysis . . . . . . . . . . . . . . . . . . . . . . . . 86 6.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 7 – Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 參考文獻. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
dc.language.iso	en
dc.title	對於語碼轉換和語音翻譯任務之資料稀缺性與非自回歸模型研究	zh_TW
dc.title	Investigate the data scarcity issue and non-autoregressive model in Code-switching and Speech-to-text translation	en
dc.date.schoolyear	110-1
dc.description.degree	博士
dc.contributor.author-orcid	0000-0003-0720-2732
dc.contributor.oralexamcommittee	李琳山(Wen-Lii Huang),王新民(Men-Chi Chang),曹昱(Ya-Fen Lin),陳信希(Chuan-Ming Yeh),王和盛
dc.subject.keyword	語音翻譯,語碼轉換,資料稀缺性,非自回歸模型,	zh_TW
dc.subject.keyword	Speech Translation,Code-Switching,Data scarcity,non-autoregressive model,	en
dc.relation.page	119
dc.identifier.doi	10.6342/NTU202200129
dc.rights.note	同意授權(全球公開)
dc.date.accepted	2022-01-22
dc.contributor.author-college	電機資訊學院	zh_TW
dc.contributor.author-dept	電信工程學研究所	zh_TW
顯示於系所單位：	電信工程學研究所

文件中的檔案：

檔案	大小	格式
U0001-2101202201531800.pdf	3.96 MB	Adobe PDF	檢視/開啟

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。