請用此 Handle URI 來引用此文件:
http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/1180完整後設資料紀錄
| DC 欄位 | 值 | 語言 |
|---|---|---|
| dc.contributor.advisor | 李琳山 | |
| dc.contributor.author | Yu-Hsuan Wang | en |
| dc.contributor.author | 王育軒 | zh_TW |
| dc.date.accessioned | 2021-05-12T09:33:48Z | - |
| dc.date.available | 2019-07-26 | |
| dc.date.available | 2021-05-12T09:33:48Z | - |
| dc.date.copyright | 2018-07-26 | |
| dc.date.issued | 2018 | |
| dc.date.submitted | 2018-07-19 | |
| dc.identifier.citation | [1] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean, “Distributed representations of words and phrases and their compositionality,” in Advances in neural information processing systems, 2013, pp. 3111–3119.
[2] Yu-An Chung, Chao-Chung Wu, Chia-Hao Shen, Hung-Yi Lee, and Lin-Shan Lee, “Audio word2vec: Unsupervised learning of audio segment representations using sequence-to-sequence autoencoder,” in Interspeech 2016, 2016, pp. 765–769. [3] Zhizheng Wu and Simon King, “Investigating gated recurrent networks for speech synthesis,” in Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on. IEEE, 2016, pp. 5140–5144. [4] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, 2012, pp. 1097–1105. [5] Tomas Mikolov, Martin Karafia ́t, Lukas Burget, Jan Cernocky`, and Sanjeev Khudanpur, “Recurrent neural network based language model.,” in Interspeech, 2010, vol. 2, p. 3. [6] Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton, “Speech recognition with deep recurrent neural networks,” in Acoustics, speech and signal processing (icassp), 2013 ieee international conference on. IEEE, 2013, pp. 6645–6649. [7] Sepp Hochreiter and Ju ̈rgen Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997. [8] Geoffrey Hinton, Li Deng, Dong Yu, George E Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N Sainath, et al., “Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups,” IEEE Signal Processing Magazine, vol. 29, no. 6, pp. 82–97, 2012. [9] Karen Simonyan and Andrew Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014. [10] Thomas Kemp and Alex Waibel, “Unsupervised training of a speech recognizer: recent experiments.,” in Eurospeech, 1999. [11] Lori Lamel, Jean-Luc Gauvain, and Gilles Adda, “Lightly supervised and unsupervised acoustic model training,” Computer Speech & Language, vol. 16, no. 1, pp. 115–129, 2002. [12] Alex S Park and James R Glass, “Unsupervised pattern discovery in speech,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 16, no. 1, pp. 186– 197, 2008. [13] Gautam K Vallabha, James L McClelland, Ferran Pons, Janet F Werker, and Shigeaki Amano, “Unsupervised learning of vowel categories from infant-directed speech,” Proceedings of the National Academy of Sciences, vol. 104, no. 33, pp. 13273–13278, 2007. [14] Xugang Lu, Yu Tsao, Shigeki Matsuda, and Chiori Hori, “Speech enhancement based on deep denoising autoencoder.,” in Interspeech, 2013, pp. 436–440. [15] Andrej Karpathy, Justin Johnson, and Li Fei-Fei, “Visualizing and understanding recurrent networks,” arXiv preprint arXiv:1506.02078, 2015. [16] Alex Krizhevsky and Geoffrey E Hinton, “Using very deep autoencoders for content-based image retrieval.,” in ESANN, 2011. [17] Jiwei Li, Minh-Thang Luong, and Dan Jurafsky, “A hierarchical neural autoencoder for paragraphs and documents,” arXiv preprint arXiv:1506.01057, 2015. [18] Dong Yu and Michael L Seltzer, “Improved bottleneck features using pretrained deep neural networks.,” in Interspeech, 2011, vol. 237, p. 240. [19] Frantisek Gre ́zl, Martin Karafia ́t, Stanislav Konta ́r, and Jan Cernocky, “Probabilistic and bottle-neck features for lvcsr of meetings,” in Acoustics, Speech and Signal Processing, 2007. ICASSP 2007. IEEE International Conference on. IEEE, 2007, vol. 4, pp. IV–757. [20] Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio, and Pierre- Antoine Manzagol, “Stacked denoising autoencoders: Learning useful represen- tations in a deep network with a local denoising criterion,” Journal of Machine Learning Research, vol. 11, no. Dec, pp. 3371–3408, 2010. [21] Steven Davis and Paul Mermelstein, “Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences,” IEEE transac- tions on acoustics, speech, and signal processing, vol. 28, no. 4, pp. 357–366, 1980. [22] Jonas Gehring, Yajie Miao, Florian Metze, and Alex Waibel, “Extracting deep bot- tleneck features using stacked auto-encoders,” in Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on. IEEE, 2013, pp. 3377– 3381. [23] Matthew D Zeiler and Rob Fergus, “Visualizing and understanding convolutional networks,” in European conference on computer vision. Springer, 2014, pp. 818– 833. [24] Cheng-Tao Chung, Chun-an Chan, and Lin-shan Lee, “Unsupervised spoken term detection with spoken queries by multi-level acoustic patterns with varying model granularity,” in Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on. IEEE, 2014, pp. 7814–7818. [25] Chia-ying Lee and James Glass, “A nonparametric bayesian approach to acoustic model discovery,” in Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume 1. Association for Computational Linguistics, 2012, pp. 40–49. [26] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean, “Efficient estimation of word representations in vector space,” arXiv preprint arXiv:1301.3781, 2013. [27] Quoc Le and Tomas Mikolov, “Distributed representations of sentences and documents,” in Proceedings of the 31st International Conference on Machine Learning (ICML-14), 2014, pp. 1188–1196. [28] Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh, “Vqa: Visual question answering,” in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 2425– 2433. [29] Tomas Mikolov, Quoc V. Le, and Ilya Sutskever, “Exploiting similarities among languages for machine translation.,” CoRR, vol. abs/1309.4168, 2013. [30] Vinod Nair and Geoffrey E Hinton, “Rectified linear units improve restricted boltzmann machines,” in Proceedings of the 27th international conference on machine learning (ICML-10), 2010, pp. 807–814. [31] Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton, “On the importance of initialization and momentum in deep learning,” in International conference on machine learning, 2013, pp. 1139–1147. [32] Diederik Kingma and Jimmy Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014. [33] Jeffrey L Elman, “Finding structure in time,” Cognitive science, vol. 14, no. 2, pp. 179–211, 1990. [34] Yoshua Bengio, Patrice Simard, and Paolo Frasconi, “Learning long-term dependencies with gradient descent is difficult,” IEEE transactions on neural networks, vol. 5, no. 2, pp. 157–166, 1994. [35] Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio, “Empirical evaluation of gated recurrent neural networks on sequence modeling,” arXiv preprint arXiv:1412.3555, 2014. [36] Ilya Sutskever, Oriol Vinyals, and Quoc V Le, “Sequence to sequence learning with neural networks,” in Advances in neural information processing systems, 2014, pp. 3104–3112. [37] Ronald J Williams, “Simple statistical gradient-following algorithms for connectionist reinforcement learning,” Machine learning, vol. 8, no. 3-4, pp. 229–256, 1992. [38] John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz, “Trust region policy optimization,” in Proceedings of the 32nd International Conference on Machine Learning (ICML-15), 2015, pp. 1889–1897. [39] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov, “Proximal policy optimization algorithms,” arXiv preprint arXiv:1707.06347, 2017. [40] Peter W Glynn, “Likelihood ratio gradient estimation for stochastic systems,” Communications of the ACM, vol. 33, no. 10, pp. 75–84, 1990. [41] Bhuwan Dhingra, Lihong Li, Xiujun Li, Jianfeng Gao, Yun-Nung Chen, Faisal Ahmed, and Li Deng, “Towards end-to-end reinforcement learning of dialogue agents for information access,” in Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2017, vol. 1, pp. 484–495. [42] Barret Zoph and Quoc V Le, “Neural architecture search with reinforcement learn- ing,” ICLR, 2017. [43] Okko Ra ̈sa ̈nen, “Basic cuts revisited: Temporal segmentation of speech into phone-like units with statistical learning at a pre-linguistic level.,” in CogSci, 2014.114 [44] Herman Kamper, Aren Jansen, and Sharon Goldwater, “Unsupervised word segmentation and lexicon discovery using acoustic word embeddings,” IEEE Transac- tions on Audio, Speech and Language Processing, 1 2016. [45] Paul Michel, Okko Ra ̈sa ̈nen, Roland Thiollie`re, and Emmanuel Dupoux, “Improving phoneme segmentation with recurrent neural networks,” CoRR, vol. abs/1608.00508, 2016. [46] Yossi Adi, Joseph Keshet, Emily Cibelli, and Matthew Goldrick, “Sequence segmentation using joint rnn and structured prediction models,” arXiv preprint arXiv:1610.07918, 2016. [47] Dac-Thang Hoang and Hsiao-Chuan Wang, “Blind phone segmentation based on spectral change detection using legendre polynomial approximation,” The Journal of the Acoustical Society of America, vol. 137, no. 2, pp. 797–805, 2015. [48] Yu Qiao, Naoya Shimomura, and Nobuaki Minematsu, “Unsupervised optimal phoneme segmentation: Objectives, algorithm and comparisons,” in Acoustics, Speech and Signal Processing, 2008. ICASSP 2008. IEEE International Conference on. IEEE, 2008, pp. 3989–3992. [49] Chun-an Chan, Unsupervised Spoken Term Detection with Spoken Queries, Ph.D. thesis, National Taiwan University, 2012. [50] DavidRHMiller,MichaelKleber,Chia-LinKao,OwenKimball,ThomasColthurst, Stephen A Lowe, Richard M Schwartz, and Herbert Gish, “Rapid and accurate spoken term detection,” in Eighth Annual Conference of the International Speech Communication Association, 2007. [51] G. E. Hinton, J. L. McClelland, and D. E. Rumelhart, “Parallel distributed processing: Explorations in the microstructure of cognition, vol. 1,” chapter Distributed Representations, pp. 77–109. MIT Press, Cambridge, MA, USA, 1986. [52] Yoshua Bengio, Re ́jean Ducharme, Pascal Vincent, and Christian Jauvin, “A neural probabilistic language model,” Journal of machine learning research, vol. 3, no. Feb, pp. 1137–1155, 2003. [53] Zih-Wei Lin, “Personalized linguistic processing: Language modeling and understanding,” M.S. thesis, National Taiwan University, 2017. [54] Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig, “Linguistic regularities in continuous space word representations.,” in hlt-Naacl, 2013, vol. 13, pp. 746–751. [55] Chia-Hao Shen, “Audio word2vec: Unsupervised learning of audio segment representations using sequence-to-sequence autoencoder,” M.S. thesis, National Taiwan University, 2017. [56] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov, “Dropout: A simple way to prevent neural networks from overfitting,” 2014, vol. 15, pp. 1929–1958, JMLR. org. [57] Jennifer Fox Drexler, “Deep unsupervised learning from speech,” M.S. thesis, Massachusetts Institute of Technology, 2016. [58] Okko Johannes Ra ̈sa ̈nen, Unto Kalervo Laine, and Toomas Altosaar, “An improved speech segmentation quality measure: the r-value.,” in Interspeech, 2009, pp. 1851– 1854. [59]RafalJozefowicz,WojciechZaremba,andIlyaSutskever,“Anempiricalexploration of recurrent network architectures,” in Proceedings of the 32nd International Conference on Machine Learning (ICML-15), 2015, pp. 2342–2350. [60] Alex Graves and Navdeep Jaitly, “Towards end-to-end speech recognition with re- current neural networks,” in Proceedings of the 31st International Conference on Machine Learning (ICML-14), 2014, pp. 1764–1772. [61] Serena Yeung, Olga Russakovsky, Greg Mori, and Li Fei-Fei, “End-to-end learning of action detection from frame glimpses in videos,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 2678–2687. [62] YoshuaBengio,NicholasLe ́onard,andAaronCourville,“Estimatingorpropagating gradients through stochastic neurons for conditional computation,” arXiv preprint arXiv:1308.3432, 2013. [63] Yu-Hsuan Wang, Cheng-Tao Chung, and Hung-yi Lee, “Gate activation signal anal- ysis for gated recurrent neural networks and its correlation with phoneme bound- aries,” INTERSPEECH, 2017. [64] Tanja Schultz, “Globalphone: a multilingual speech and text database developed at karlsruhe university.,” in INTERSPEECH, 2002. [65] Yaodong Zhang and James R Glass, “Unsupervised spoken keyword spotting via segmental dtw on gaussian posteriorgrams,” in Automatic Speech Recognition & Understanding, 2009. ASRU 2009. IEEE Workshop on. IEEE, 2009, pp. 398–403. | |
| dc.identifier.uri | http://tdr.lib.ntu.edu.tw/handle/123456789/1180 | - |
| dc.description.abstract | 在自然語言處理中,詞向量(Word2Vec)可以用於將一個詞表示為一個一定 維數(Dimensionality)的實數向量並帶有語意資訊(語意接近的詞在向量空間中 會接近),這些向量所帶的語意並在向量空間上具有向量運算的可平移特性。另 一方面,語音詞向量(Audio Word2Vec)則能使用一定維數的實數向量表示語音 詞(一個詞的語音訊號,Spoken Word),並帶有音素結構的資訊。前人所提出的 語音詞向量雖然可以在非督導式學習的框架下訓練,然而訓練語料之音訊需要事 先標註好詞邊界。
在本論文中,我們將語音詞向量由語音詞的層級提升至整句語句的層級。 在本論文所提出的模型中,同時針對語音詞切割與語音詞向量訓練進行訓練, 讓此兩者能夠相互增強。藉由引入一切割門限至序列對序列自動編碼器,本 論文提出全新的分段式序列對序列自動編碼器(Segmental Sequence-to-Sequence Autoencoder, SSAE),並用深層強化學習(Deep Reinforcement Learning)加以訓 練。藉由此一方法,一語句能夠被自動切割為一系列的語音詞,再轉化為一系列 之語音詞向量。本論文之實驗使用詞切割與口述語彙偵測來探討所提出的分段式 序列對序列自動編碼器之效能,並在四種語言上(英文、捷克文,法文與德文) 進行實驗,實驗結果顯示此模型具有比以往方法更佳的效能。 除了分段式序列對序列自動編碼器外,本論文亦分析一種遞迴式類神經網路 內部之訊號:門限激發訊號;並發現此訊號在非督導式學習框架下與輸入音訊中 語音特性之邊界(如音素邊界)具有強烈關聯,因此可以廣泛應用於所有非督導 式學習下的遞迴式類神經網路模型中。 | zh_TW |
| dc.description.provenance | Made available in DSpace on 2021-05-12T09:33:48Z (GMT). No. of bitstreams: 1 ntu-107-R04922167-1.pdf: 5334856 bytes, checksum: 3fed14842a786016d2676427bc9f9802 (MD5) Previous issue date: 2018 | en |
| dc.description.tableofcontents | 誌謝.......................................... i
中文摘要....................................... iv 一、導論....................................... 2 1.1 研究背景及研究動機 ........................... 2 1.2 研究方向.................................. 5 1.3 相關研究.................................. 5 1.4 研究貢獻.................................. 7 1.5 章節安排.................................. 8 二、背景知識 .................................... 9 2.1 基於非督導式學習的遞迴式類神經網路................. 9 2.1.1 類神經網路 ............................ 9 2.1.2 類神經網路訓練.......................... 11 2.1.3 遞迴式類神經網路 ........................ 16 2.1.4 自動編碼器 ............................ 17 2.2 強化學習.................................. 19 2.2.1 馬可夫決策過程.......................... 19 2.2.2 強化學習簡介 ........................... 20 2.2.3 基於策略的強化學習 ....................... 22 2.2.4 以遞迴式類神經網路進行之強化學習.............. 25 2.3 音訊切割.................................. 27 2.4 口述語彙偵測 ............................... 31 2.5 語音詞向量................................. 32 2.5.1 詞向量簡介 ............................ 32 2.5.2 語音詞向量簡介與應用...................... 36 三、門限激發訊號與音素邊界之關聯分析 .................... 42 3.1 具門限機制之遞迴式類神經網路..................... 42 3.2 門限激發訊號與音素邊界......................... 44 3.2.1 模型概述.............................. 45 3.2.2 實驗設計.............................. 46 3.3 預備實驗.................................. 47 3.3.1 門限激發訊號與門限激發訊號均值 ............... 47 3.3.2 實驗結果.............................. 47 3.3.3 門限激發訊號均值變化量 .................... 49 3.4 音素切割實驗 ............................... 50 3.4.1 門限激發訊號在音素切割之應用 ................ 50 3.4.2 遞迴式預測模型.......................... 50 3.4.3 結合門限激發訊號之遞迴式預測模型.............. 51 3.4.4 效能評估.............................. 52 3.4.5 不同門限實驗結果 ........................ 53 3.4.6 不同模型實驗結果 ........................ 55 3.5 本章總結.................................. 59 四、分段式語音詞向量之初步研究 ........................ 61 4.1 分段式序列對序列自動編碼器 ...................... 61 4.1.1 分段式語音詞向量 ........................ 61 4.1.2 切割門限.............................. 62 4.1.3 重設機制.............................. 62 4.1.4 分段式序列對序列式訓練 .................... 63 4.2 端對端訓練下之分段式語音詞向量 ................... 63 4.2.1 端對端訓練 ............................ 63 4.2.2 直通評估器 ............................ 65 4.2.3 減損函數設計 ........................... 66 4.3 實驗..................................... 67 4.3.1 實驗設計.............................. 67 4.3.2 實驗結果與討論.......................... 67 4.4 本章總結.................................. 71 五、基於強化學習之分段式語音詞向量...................... 73 5.1 以強化學習訓練之分段式語音詞向量 .................. 73 5.2 訓練分段式語音詞向量之獎勵 ...................... 76 5.2.1 獎勵設計.............................. 76 5.2.2 獎勵基準.............................. 78 5.3 兩步驟之迭代式訓練法 .......................... 79 5.4 應用分段式語音詞向量於口述語彙偵測................. 82 5.5 實驗..................................... 84 5.5.1 實驗設計.............................. 84 5.5.2 預備實驗.............................. 85 5.5.3 詞切割實驗 ............................ 88 5.5.4 口述語彙偵測實驗 ........................101 5.6 本章總結..................................105 六、結論與展望 ...................................107 6.1 本論文主要的研究貢獻 ..........................107 6.2 本論文未來研究方向 ...........................108 參考文獻 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 | |
| dc.language.iso | zh-TW | |
| dc.subject | 語音詞向量 | zh_TW |
| dc.subject | 非督導式學習 | zh_TW |
| dc.subject | 口述語彙偵測 | zh_TW |
| dc.subject | Audio Word2Vec | en |
| dc.subject | Unsupervised Learning | en |
| dc.subject | Spoken Term Detection | en |
| dc.title | 分段式語音詞向量:將語句信號自動表示為語音詞向量序列 | zh_TW |
| dc.title | Segmental Audio Word2Vec: Representing Utterances as Sequences of Audio Word Vectors | en |
| dc.type | Thesis | |
| dc.date.schoolyear | 106-2 | |
| dc.description.degree | 碩士 | |
| dc.contributor.oralexamcommittee | 林智仁,陳信希,李宏毅 | |
| dc.subject.keyword | 非督導式學習,口述語彙偵測,語音詞向量, | zh_TW |
| dc.subject.keyword | Unsupervised Learning,Spoken Term Detection,Audio Word2Vec, | en |
| dc.relation.page | 119 | |
| dc.identifier.doi | 10.6342/NTU201801733 | |
| dc.rights.note | 同意授權(全球公開) | |
| dc.date.accepted | 2018-07-20 | |
| dc.contributor.author-college | 電機資訊學院 | zh_TW |
| dc.contributor.author-dept | 資訊工程學研究所 | zh_TW |
| 顯示於系所單位: | 資訊工程學系 | |
文件中的檔案:
| 檔案 | 大小 | 格式 | |
|---|---|---|---|
| ntu-107-1.pdf | 5.21 MB | Adobe PDF | 檢視/開啟 |
系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。
