請用此 Handle URI 來引用此文件:
http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/56250完整後設資料紀錄
| DC 欄位 | 值 | 語言 |
|---|---|---|
| dc.contributor.advisor | 李琳山(Lin-shan Lee) | |
| dc.contributor.author | Alexander H. Liu | en |
| dc.contributor.author | 劉浩然 | zh_TW |
| dc.date.accessioned | 2021-06-16T05:20:31Z | - |
| dc.date.available | 2020-08-04 | |
| dc.date.copyright | 2020-08-04 | |
| dc.date.issued | 2020 | |
| dc.date.submitted | 2020-07-28 | |
| dc.identifier.citation | Frederick Jelinek, “Continuous speech recognition by statistical methods,”Pro-ceedings of the IEEE, vol. 64, no. 4, pp. 532–556, 1976. Lalit Bahl, Peter Brown, Peter De Souza, and Robert Mercer, “Maximum mutualinformation estimation of hidden markov model parameters for speech recognition,”inICASSP’86. IEEE International Conference on Acoustics, Speech, and SignalProcessing. IEEE, 1986, vol. 11, pp. 49–52. Lawrence R Rabiner, “A tutorial on hidden markov models and selected applicationsin speech recognition,”Proceedings of the IEEE, vol. 77, no. 2, pp. 257–286, 1989. Mark Gales, Steve Young, et al., “The application of hidden markov models inspeech recognition,”Foundations and TrendsR©in Signal Processing, vol. 1, no. 3,pp. 195–304, 2008. Geoffrey Hinton, Li Deng, Dong Yu, George E Dahl, Abdel-rahman Mohamed,Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara NSainath, et al. “Deep neural networks for acoustic modeling in speech recogni-tion: The shared views of four research groups,”IEEE Signal processing magazine,vol. 29, no. 6, pp. 82–97, 2012. Longfei Li, Yong Zhao, Dongmei Jiang, Yanning Zhang, Fengna Wang, Isabel Gon-zalez, Enescu Valentin, and Hichem Sahli, “Hybrid deep neural network–hiddenmarkov model (dnn-hmm) based speech emotion recognition,” inACII, 2013.80 Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton, “Speech recognitionwith deep recurrent neural networks,” inAcoustics, speech and signal processing(ICASSP). IEEE, 2013, pp. 6645–6649. Dario Amodei, Sundaram Ananthanarayanan, Rishita Anubhai, Jingliang Bai, et al.,“Deep speech 2: End-to-end speech recognition in english and mandarin,” inPro-ceedings of the 33rd International Conference on International Conference on Ma-chine Learning - Volume 48, 2016, ICML’16, pp. 173–182. Rohit Prabhavalkar, Kanishka Rao, Tara N. Sainath, Bo Li, Leif Johnson, andNavdeep Jaitly, “A comparison of sequence-to-sequence models for speech recog-nition,” inProc. Interspeech 2017, 2017, pp. 939–943. Chung-Cheng Chiu, Tara N Sainath, Yonghui Wu, Rohit Prabhavalkar, PatrickNguyen, Zhifeng Chen, Anjuli Kannan, Ron J Weiss, Kanishka Rao, et al., “State-of-the-art speech recognition with sequence-to-sequence models,”ICASSP, 2018. Suyoun Kim, Takaaki Hori, and Shinji Watanabe, “Joint ctc-attention based end-to-end speech recognition using multi-task learning,” inAcoustics, Speech and SignalProcessing (ICASSP). IEEE, 2017, pp. 4835–4839. Takaaki Hori, Shinji Watanabe, Yu Zhang, and William Chan, “Advances in jointctc-attention based end-to-end speech recognition with a deep cnn encoder and rnn-lm,” inInterspeech, 2017. Olivier Chapelle, Bernhard Scholkopf, and Alexander Zien, “Semi-supervised learn-ing (chapelle, o. et al., eds.; 2006)[book reviews],”IEEE Transactions on NeuralNetworks, vol. 20, no. 3, pp. 542–542, 2009.81 Shubham Toshniwal, Anjuli Kannan, Chung-Cheng Chiu, Yonghui Wu, Tara NSainath, and Karen Livescu, “A comparison of techniques for language model in-tegration in encoder-decoder speech recognition,” in2018 IEEE Spoken LanguageTechnology Workshop (SLT). IEEE, 2018, pp. 369–375. Horace B Barlow, “Unsupervised learning,”Neural computation, vol. 1, no. 3, pp.295–311, 1989. James Glass, “Towards unsupervised speech processing,” in2012 11th Interna-tional Conference on Information Science, Signal Processing and their Applications(ISSPA). IEEE, 2012, pp. 1–4. Yu-An Chung, Wei-Hung Weng, Schrasing Tong, and James Glass, “Unsupervisedcross-modal alignment of speech and text embedding spaces,” inAdvances in NeuralInformation Processing Systems, 2018, pp. 7354–7364. Da-Rong Liu, Kuan-Yu Chen, Hung-Yi Lee, and Lin-shan Lee, “Completely unsu-pervised phoneme recognition by adversarially learning mapping relationships fromaudio embeddings,”arXiv preprint arXiv:1804.00316, 2018. Kuan-Yu Chen, Che-Ping Tsai, Da-Rong Liu, Hung-Yi Lee, and Lin-shan Lee,“Completely unsupervised phoneme recognition by a generative adversarial net-work harmonized with iteratively refined hidden markov models,”arXiv preprintarXiv:1904.04100, 2019. Chih-Kuan Yeh, Jianshu Chen, Chengzhu Yu, and Dong Yu, “Unsupervised speechrecognition via segmental empirical output distribution matching,”arXiv preprintarXiv:1812.09323, 2018.82 David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams, “Learning internalrepresentations by error propagation,” Tech. Rep., California Univ San Diego LaJolla Inst for Cognitive Science, 1985. Sepp Hochreiter and J ̈urgen Schmidhuber, “Long short-term memory,”Neuralcomputation, vol. 9, no. 8, pp. 1735–1780, 1997. Kyunghyun Cho, Bart Van Merri ̈enboer, Caglar Gulcehre, Dzmitry Bahdanau, FethiBougares, Holger Schwenk, and Yoshua Bengio, “Learning phrase representa-tions using rnn encoder-decoder for statistical machine translation,”arXiv preprintarXiv:1406.1078, 2014. William Chan, Navdeep Jaitly, Quoc Le, and Oriol Vinyals, “Listen, attend andspell: A neural network for large vocabulary conversational speech recognition,” inAcoustics, Speech and Signal Processing (ICASSP). IEEE, 2016, pp. 4960–4964. Alex Graves, Santiago Fern ́andez, Faustino Gomez, and J ̈urgen Schmidhuber, “Con-nectionist temporal classification: labelling unsegmented sequence data with recur-rent neural networks,” inProceedings of the 23rd international conference on Ma-chine learning. ACM, 2006, pp. 369–376. Inigo Jauregi Unanue, Ehsan Zare Borzeshi, Nazanin Esmaili, and Massimo Pic-cardi, “Rewe: Regressing word embeddings for regularization of neural machinetranslation systems,” inACL, 2019. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean, “Dis-tributed representations of words and phrases and their compositionality,” inNIPS,2013.83 Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean, “Efficient estimationof word representations in vector space,”arXiv preprint arXiv:1301.3781, 2013. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean, “Dis-tributed representations of words and phrases and their compositionality,” inAd-vances in neural information processing systems, 2013, pp. 3111–3119. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,”arXivpreprint arXiv:1810.04805, 2018. Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur,“Librispeech: an asr corpus based on public domain audio books,” inAcoustics, Speechand Signal Processing (ICASSP). IEEE, 2015, pp. 5206–5210. Rico Sennrich, Barry Haddow, and Alexandra Birch, “Neural machine translationof rare words with subword units,”arXiv preprint arXiv:1508.07909, 2015. Jan K Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk, Kyunghyun Cho, andYoshua Bengio, “Attention-based models for speech recognition,” inAdvancesin neural information processing systems, 2015, pp. 577–585. Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov, “Enrichingword vectors with subword information,”arXiv preprint arXiv:1607.04606, 2016. Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue,Anthony Moi, Pierric Cistac, Tim Rault, R’emi Louf, Morgan Funtowicz, and Jamie84Brew, “Huggingface’s transformers: State-of-the-art natural language processing,”ArXiv, vol. abs/1910.03771, 2019. Armand Joulin, Edouard Grave, Piotr Bojanowski, Matthijs Douze, H ́erve J ́egou,and Tomas Mikolov, “Fasttext.zip: Compressing text classification models,”arXivpreprint arXiv:1612.03651, 2016. Dzmitry Bahdanau, Jan Chorowski, Dmitriy Serdyuk, Philemon Brakel, and YoshuaBengio, “End-to-end attention-based large vocabulary speech recognition,” in 2016 IEEE international conference on acoustics, speech and signal processing(ICASSP). IEEE, 2016, pp. 4945–4949. Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley,Sherjil Ozair, Aaron Courville, and Yoshua Bengio, “Generative adversarial nets,”inAdvances in neural information processing systems, 2014, pp. 2672–2680. Tomoki Hayashi, Shinji Watanabe, Yu Zhang, Tomoki Toda, Takaaki Hori, RamonAstudillo, and Kazuya Takeda, “Back-translation-style data augmentation for end-to-end asr,”arXiv preprint arXiv:1807.10893, 2018. Murali Karthick Baskar, Shinji Watanabe, Ramon Astudillo, Takaaki Hori, Luk ́aˇsBurget, and JanˇCernock`y, “Semi-supervised sequence-to-sequence asr using un-paired speech and text,”arXiv preprint arXiv:1905.01152, 2019. C ́edric Villani,Optimal transport: old and new, vol. 338, Springer Science Business Media, 2008.85 Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron CCourville, “Improved training of wasserstein gans,” inAdvances in Neural Infor-mation Processing Systems, 2017, pp. 5767–5777. Martin Arjovsky, Soumith Chintala, and L ́eon Bottou, “Wasserstein generative ad-versarial networks,” inInternational Conference on Machine Learning, 2017, pp.214–223. Aaron van den Oord, Oriol Vinyals, et al., “Neural discrete representation learning,”inAdvances in Neural Information Processing Systems, 2017, pp. 6306–6315. Andros Tjandra, Sakriani Sakti, and Satoshi Nakamura, “Listening while speaking:Speech chain by deep learning,” inAutomatic Speech Recognition and Understand-ing Workshop (ASRU), 2017 IEEE. IEEE, 2017, pp. 301–308. Yi Ren, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu, “Almostunsupervised text to speech and automatic speech recognition,”arXiv preprintarXiv:1905.06791, 2019. Yoshua Bengio, Nicholas L ́eonard, and Aaron Courville, “Estimating or propagatinggradients through stochastic neurons for conditional computation,”arXiv preprintarXiv:1308.3432, 2013. Jonathan Shen, Ruoming Pang, Ron J Weiss, Mike Schuster, Navdeep Jaitly,Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, Rj Skerrv-Ryan, et al.,“Natural tts synthesis by conditioning wavenet on mel spectrogram predictions,” in2018 IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP). IEEE, 2018, pp. 4779–4783.86 KeithIto,“Theljspeechdataset,”https://keithito.com/LJ-Speech-Dataset/, 2017. Andros Tjandra, Sakriani Sakti, and Satoshi Nakamura,“End-to-end feedbackloss in speech chain framework via straight-through estimator,” inICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP). IEEE, 2019, pp. 6281–6285. | |
| dc.identifier.uri | http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/56250 | - |
| dc.description.abstract | 隨著機器學習以及相關硬體技術之發展進步,基於深層學習的端到端語音辨識系統(End-to-end Speech Recognition System)逐漸開始取代傳統多模組式系統。相對簡單的模型架構與純數據引導(Data Driven)的訓練方法雖然是端到端語音辨識的優勢所在,卻也無可避免的造成了其技術發展最大的隱憂:過度倚賴基於大量人工標註之語音文字配對資料的監督式學習(Supervised Learning)。近年學界業界均意識到此一問題,並開始著手於降低端到端語音辨識對於人工標註資料倚賴之研究。有鑑於此,本論文提出三種使用不同資源來改善上述缺點之半監督式(Semi-supervised)端到端語音辨識方法。 第一種方法利用無語音對應之純文本所訓練得到的詞嵌入(Word Embed-ding)作為引導,提出針對序列到序列(Sequence-to-sequence)語音辨識模型之正規化(Regularization),將詞嵌入所攜帶的語意資訊作為語音辨識訓練時額外的目標。此外,更進一步將詞嵌入加入文字解碼過程,使語音辨識模型輸出受詞嵌入空間之相對關係影響,以期產生更符合文意之辨識結果。實驗結果顯示詞嵌入能夠簡單有效的使語音辨識系統受益於純文字資料,需要付出的運算資源成本極低,且能夠與既有的純文字技術相容。 第二種方法在語音辨識模型與語言模型間引入了對抗式訓練(AdversarialLearning),將原本語音辨識中輸出結果需要通過語言模型修正的過程提前至訓練階段,使語音辨識模型能夠直接從純文字資料中學習語言知識。實驗結果顯示該方法能夠讓端到端語音辨識受益於大量的純文字資料,在經標註之語音對應文字資料有限的情況下顯著提升辨識率。 第三種方法則提出了語音辨識合成串連框架,從大量未經人工標註之純語音資料中學習語音表徵(Speech Representation),並利用少部份經標註資料完成語音表徵與音素之一一對應。經對應的語音表徵能夠有效降低監督式訓練對於語音辨識之重要性。實驗證實了在僅使用20分鐘以內標註資料的情況下,使用語音表徵之語音辨識模型可以將語音辨識錯誤率有效降低。 | zh_TW |
| dc.description.abstract | With the fast advance of development and progress of machine learning and related hardware technologies, end-to-end speech recognition systems based on deep learning have gradually begun to replace traditional multi-module systems. Although end-to-end speech recognition has certain advantages (e.g. relatively simple model architecture and pure data-driven training methods), it also has some downsides: excessive reliance on a large number of manually labeled speech and text matching data Supervised learning. In recent years, both research community and industry have realized this problem and have begun to study to reduce the reliance of end-to-end speech recognition on artificially labeled data. In view of this, this paper proposes three semi-supervised end-to-end speech recognition methods that use different resources to improve the above shortcomings. The first method uses word embeddings trained in plain text with no speech correspondence as a guide and proposes to normalize the sequence-to-sequence speech recognition model using the semantic information carried by the word embedding as an additional training target. In addition, the word embedding is further added to the text decoding process, so that the output of the speech recognition model is affected by the relative relationship of the word embedding space, in order to produce a more textual recognition result. The experimental results show that word embedding can simply yet effectively make the speech recognition system benefit from pure text data, the cost of computing resources can be lowered, and it is compatible with existing pure text technology. The second method introduces adversarial training between the speech recognition model and the language model, which advances the process of correcting the output results of the original speech recognition through the language model to the training stage, such that speech recognition model can learn the language directly from the pure text data knowledge. Experimental results show that this method can allow end-to-end speech recognition to benefit from a large amount of pure text data, and significantly improve the recognition rate when the labeled speech corresponding to text data is limited. The third method proposes a cascading recognition-synthesis framework, which learns speech representation from a large amount of pure speech data that has not been manually labeled, and uses a small part of the labeled data to complete the one-to-one correspondence between the speech representation and phoneme. The corresponding speech representation can effectively reduce the importance of supervised training for speech recognition. Experiments have proved that using the speech recognition model of speech representation can effectively reduce the speech recognition error rate under the condition of using annotated data within only 20 minutes. | en |
| dc.description.provenance | Made available in DSpace on 2021-06-16T05:20:31Z (GMT). No. of bitstreams: 1 U0001-2707202015442200.pdf: 10309472 bytes, checksum: 13ce2c892819e1b11166ec125d00b933 (MD5) Previous issue date: 2020 | en |
| dc.description.tableofcontents | 口試委員會審定書 i 中文摘要 ii 英文摘要 iv 一、導論 1 1.1 研究動機 1 1.2研究方向及貢獻 3 1.3 章節安排 4 二、背景知識 6 2.1 深層類神經網路 6 2.1.1 模型與原理 6 2.1.2 卷積式類神經網路 9 2.1.3 遞歸式類神經網路 10 2.2 序列到序列學習 12 2.2.1 編碼器解碼器架構 12 2.2.2 專注機制 14 2.3 基於類神經網路之端到端語音辨識模型 16 2.3.1 序列到序列語音辨識模型 16 2.3.2 鏈結式時序分類器模型 18 2.4 端到端語音辨識之解碼 19 2.4.1 直接解碼 19 2.4.2 集束搜尋 20 2.4.3 語言模型 21 三、基於詞嵌入正規化與解碼之半監督式方法 23 3.1 簡介 23 3.2 相關研究 24 3.2.1 詞嵌入 24 3.2.2 現有半監督式語音辨識技術 25 3.3 方法介紹 25 3.3.1 標準之序列到序列語音辨識解碼器運作 25 3.3.2 基於詞嵌入之序列到序列語音辨識正規化 26 3.3.3 混合詞嵌入之文字解碼 28 3.3.4 方法小結 30 3.4 實驗內容 31 3.4.1 實驗設定 31 3.4.2 實驗結果與討論 33 3.5 本章總結 38 四、基於批判式語言模型與對抗式訓練之半監督式方法 39 4.1 簡介 39 4.2 相關研究 42 4.2.1 半監督式語音辨識技術 42 4.3 方法介紹 44 4.3.1 批判式語言模型 44 4.3.2 批判式語言模型與語音辨識模型之對抗式訓練 47 4.4 實驗內容 49 4.4.1 實驗設定 49 4.4.2 實驗結果 49 4.5 本章總結 53 五、邁向非監督式學習之辨識合成串連框架 54 5.1 簡介 54 5.2 相關研究 56 5.2.1 語音表徵 56 5.2.2 結合語音辨識與語音合成之相關研究 56 5.3 辨識合成串連框架 58 5.3.1 語音自動編碼器 59 5.3.2 聲音分群機制 61 5.3.3 序列表徵時序切分機制 63 5.3.4 將語音表徵與音素對應 64 5.3.5 序列量化表徵自編碼器 66 5.4 實驗內容 68 5.4.1 實驗設定 68 5.4.2 實驗結果 69 5.5 本章總結 75 六、結論與展望 76 6.1 半監督式語音辨識方法綜合比較與本論文之貢獻 76 6.2 總結與未來展望 78 參考文獻 80 | |
| dc.language.iso | zh-TW | |
| dc.subject | 半監督式學習 | zh_TW |
| dc.subject | 語音辨識 | zh_TW |
| dc.subject | Speech Recognition | en |
| dc.subject | Semi-supervised Learning | en |
| dc.title | 半監督式學習之端到端語音辨識及辨識合成串連框架 | zh_TW |
| dc.title | Semi-supervised End-to-end Speech Recognition and a Cascading Recognition-Synthesis Framework | en |
| dc.type | Thesis | |
| dc.date.schoolyear | 108-2 | |
| dc.description.degree | 碩士 | |
| dc.contributor.oralexamcommittee | 李宏毅(Hung-yi Lee),鄭秋豫(Chiu-yu Tseng),簡仁宗(Jen-Tzung Chien),王小川(Hsiao-Chuan Wang) | |
| dc.subject.keyword | 語音辨識,半監督式學習, | zh_TW |
| dc.subject.keyword | Speech Recognition,Semi-supervised Learning, | en |
| dc.relation.page | 87 | |
| dc.identifier.doi | 10.6342/NTU202001916 | |
| dc.rights.note | 有償授權 | |
| dc.date.accepted | 2020-07-29 | |
| dc.contributor.author-college | 電機資訊學院 | zh_TW |
| dc.contributor.author-dept | 資訊工程學研究所 | zh_TW |
| 顯示於系所單位: | 資訊工程學系 | |
文件中的檔案:
| 檔案 | 大小 | 格式 | |
|---|---|---|---|
| U0001-2707202015442200.pdf 未授權公開取用 | 10.07 MB | Adobe PDF |
系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。
