基於子空間之口說語言辨識

Yu-Chin Shih; 施羽芩

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/16104

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	鄭士康(Shyh-Kang Jeng)
dc.contributor.author	Yu-Chin Shih	en
dc.contributor.author	施羽芩	zh_TW
dc.date.accessioned	2021-06-07T18:01:10Z	-
dc.date.copyright	2012-08-10
dc.date.issued	2012
dc.date.submitted	2012-08-06
dc.identifier.citation	[1] M. P. Lewis, Ethnologue: Languages of the World, 16 ed. Dallas, Tex.: SIL International, 2009. [2] M. A. Zissman and K. M. Berkling, 'Automatic language identification,' Speech Communication, vol. 35, pp. 115-124, 2001. [3] B. Ma, C. Guan, H. Li, and C.-H. Lee, 'Multilingual speech recognition with language identification,' Proc. ICSLP, pp. 505-508, 2002. [4] V. W. Zue and J. R. Glass, 'Conversational interfaces: advances and challenges,' Proc. IEEE, vol. 88, pp. 1166-1180, 2000. [5] A. Waibel, P. Geutner, L. M. Tomokiyo, T. Schultz, and M. Woszczyna, 'Multilinguality in speech and spoken language systems,' Proc. IEEE, vol. 88, pp. 1297-1313, 2000. [6] P. Dai, U. Iurgel, and G. Rigoll, 'A novel feature combination approach for spoken document classification with support vector machines,' Proc. Multimedia Information Retrieval Workshop, 2003. [7] E. Ambikairajah, H. Li, L. Wang, B. Yin, and V. Sethu, 'Language identification: a tutorial,' IEEE Circuits and Systems Magazine, vol. 11, pp. 82-108, 2011. [8] H. Li, B. Ma, and C.-H. Lee, 'A vector space modeling approach to spoken language identification,' IEEE Trans. Audio, Speech, and Language Process., vol. 15, pp. 271-284, 2007. [9] J. B. Allen, 'How do humans process and recognize speech?,' IEEE Trans. Speech and Audio Proc., vol. 2, 1994. [10] Y. Muthusamy, K. Berkling, T. Arai, R. Cole, and E. Barnard, 'A comparison of approaches to automatic language identification using telephone speech,' Proc. Eurospeech, vol. 2, pp. 1307-1310, 1993. [11] T. J. Hazen and V. W. Zue, 'Segment-based automatic language identification,' J. Acoust. Soc. Amer., vol. 101, pp. 2323-2331, 1997. [12] M. A. Zissman and E. Singer, 'Automatic language identification of telephone speech messages using phoneme recognition and N-gram modeling,' Proc. ICASSP, vol. 1, pp. 305-308, 1994. [13] M. Penagarikano, A. Varona, L. J. Rodriguez-Fuentes, and G. Bordel, 'Improved modeling of cross-decoder phone co-occurrences in SVM-based phonotactic language recognition,' IEEE Trans. Audio, Speech, and Language Process., vol. 19, pp. 2348-2363, 2011. [14] J. L. Gauvain, A. Messaoudi, and H. Schwenk, 'Language recognition using phone lattices,' Proc. ICSLP, pp. 1215-1218, 2004. [15] J. Hamm and D. D. Lee, 'Grassmann discriminant analysis: a unifying view on subspace-based learning,' Proc. ICML, pp. 376-383, 2008. [16] I. J. Good, 'The population frequencies of species and the estimation of population parameters,' Biometrika, vol. 40, pp. 237-264, 1953. [17] L. F. Lamel and J.-L. Gauvain, 'Cross-lingual experiments with phone recognition,' Proc. ICASSP, vol. 2, pp. 507-510, 1993. [18] M. A. Zissman, 'Comparison of : four approaches to automatic language identification of telephone speech,' IEEE Trans. Speech and Audio Proc., vol. 4, pp. 31-44, 1996. [19] T. J. Hazen and V. W. Zue, 'Automatic language identification using a segment-based approach,' Proc. Eurospeech, vol. 2, pp. 1303-1306, 1993. [20] T. J. Hazen and V. W. Zue, 'Recent improvements in an approach to segment-based automatic language identification,' Proc. ICASSP, vol. 4, pp. 1883-1886, 1994. [21] F. S. Richardson and W. M. Campbell, 'Language recognition with discriminative keyword selection,' Proc. ICASSP, pp. 4145-4148, 2008. [22] T. Mikolov, O. Plchot, O. Glembek, P. Matejka, L. Burget, and J. Cernocky, 'PCA-based feature extraction for phonotactic language recognition,' Proc. Odyssey, pp. 251-255, 2010. [23] M. Soufifar, M. Kockmann, L. Burget, O. Plchot, O. Glembek, and T. Svendsen, 'iVector approach to phonotactic language recognition,' Proc. Interspeech, pp. 2913-2916, 2011. [24] E. Oja and T. Kohonen, 'The subspace learning algorithm as a formalism for pattern recognition and neural networks,' Proc. IEEE Int. Conf. on Neural Networks, vol. 1, pp. 277-284, 1988. [25] A. Bjoerck and G. H. Golub, 'Numerical methods for computing angles between linear subspaces,' Math. Computation, vol. 27, pp. 579-594, 1973. [26] I. J. Good, 'Some applications of the singular decomposition of a matrix,' Technometrics, vol. 11, pp. 823-831, 1969. [27] S.-W. Kim and R. P. W. Duin, 'An empirical comparison of kernel-based and dissimilarity-based feature spaces,' Proc. SSPR&SPR, pp. 559-568, 2010. [28] J. Hamm, 'Subspace-Based Learning with Grassmann Kernels,' Ph.D. Dissertation, 2008. [29] B. W. Silverman, Density Estimation for Statistics and Data Analysis. London: Chapman and Hall, 1986. [30] S. Young, G. Evermann, M. Gales, T. Hain, D. Kershaw, X. A. Liu, G. Moore, J. Odell, D. Ollason, D. Povey, V. Valtchev, and P. Woodland, The HTK book (for HTK version 3.4), 2006. [31] C.-H. Lee, F. K. Soong, and B.-H. Juang, 'A segment model based approach to speech recognition,' Proc. ICASSP, vol. 1, pp. 501-541, 1988. [32] B. Hayes and C. Wilson, 'A maximum entropy model of phonotactics and phonotactic learning,' Linguistic Inquiry, vol. 39, pp. 379-440, 2008.
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/16104	-
dc.description.abstract	本論文提出了一個嶄新且基於子空間的方法來實現基於音素結構的自動語言辨識。整個方法分為兩大部分，語音訊號的特徵表示法和基於子空間的學習演算法。前者利用了語音訊號中前後音素的關係與限制，透過自動語音辨識器的解碼、音素序列中各個音素的概似度計算，以及特徵串接，擷取出富含音素資訊的音素框。假設擷取出的音素框分布於一個低維度的特徵子空間，在這個空間中每段語音的結構幾乎可以完全被保留，因此每段語音又可進一步表示成固定維度的子空間。後者以非歐式距離的度量方法測量兩段語音（子空間）之間的相似性或距離，再利用基於距離或基於核的鑑別式分析進行特徵處理，最後使用後端的分類器，像是k鄰近分類法，來進行分類。實驗於OGI-TS和NIST LRE 2005這兩套資料庫上，結果顯示我們提出的方法在相等錯誤率上均勝過以向量空間模型為基礎的方法。	zh_TW
dc.description.abstract	This thesis presents a novel subspace-based approach for phonotactic language recognition. The whole framework is divided into two parts: speech feature representations and the subspace-based learning algorithms. First, the phonetic information as well as the contextual relationship, possessed by spoken utterances, are more abundantly retrieved by likelihood computation and feature concatenation through the decoding processed by an automatic speech recognizer. It is assumed that the extracted phone frames reside in a lower dimensional eigen-subspace, in which the structure of data can be approximately captured. Each utterance is further represented by a fixed-dimensional linear subspace. Second, to measure the similarity between two utterances, suitable non-Euclidean metrics are explored and applied to linear discriminant analysis in two kinds of mechanisms: the distance-based and kernel-based learning algorithms, followed by a back-end classifier, such as the k-nearest neighbor (KNN) classifier. The results of experiments on the OGI-TS and the NIST LRE 2005 databases demonstrate that the proposed framework outperforms the well-known vector space modeling based method in equal error rate (EER).	en
dc.description.provenance	Made available in DSpace on 2021-06-07T18:01:10Z (GMT). No. of bitstreams: 1 ntu-101-R99921045-1.pdf: 2581385 bytes, checksum: 52835548c34d92667ee399bcebc11d47 (MD5) Previous issue date: 2012	en
dc.description.tableofcontents	口試委員會審定書 i 誌謝 iii 摘要 v Abstract vii Contents ix List of Figures xiii List of Tables xv Chapter 1 Introduction 1 1.1 Motivations 1 1.2 Background Knowledge 3 1.2.1 Verification and Identification 4 1.2.2 Evaluation Measures 5 1.2.3 Statistical Spoken Language Recognition Systems 8 1.2.4 Five Levels of Information 10 1.2.5 Why Phonotactic Information? 12 1.3 Literature Survey and Contributions 13 1.4 Organization of the Thesis 16 Chapter 2 Previous Work 17 2.1 Language Modeling Based Methods 17 2.1.1 Parallel Phone Recognition (PPR) 18 2.1.2 Phone Recognition Followed by Language Modeling (PRLM) 19 2.1.3 Parallel PRLM (PPRLM) 21 2.2 Vector Space Modeling (VSM) Based Methods 23 2.3 Cross-Decoder Phone Co-occurrences 25 2.4 iVectors 25 Chapter 3 Proposed Method 27 3.1 Data Representation in Subspace 27 3.1.1 Phonotactic Feature Extraction 27 3.1.2 Frame Concatenation 29 3.1.3 Subspace Generation 30 3.2 Subspace Learning 33 3.2.1 The Grassmann Metric and Kernel 33 3.2.2 Dissimilarity-based Learning Scheme 37 3.2.3 Kernel-based Learning Scheme 38 Chapter 4 Experiments 41 4.1 Experimental Setup 41 4.1.1 Corpora 41 4.1.2 Feature Extraction 44 4.1.3 Universal Phone Set (UPS) 45 4.1.4 Universal Phone Recognizer (UPR) 47 4.2 Experiments on ASLR 48 4.2.1 The Baseline 48 4.2.2 The Subspace-based Method: Eigenphones 51 4.2.3 The Subspace-based Method: Contextual Windows 54 4.2.4 Comparisons 55 Chapter 5 Conclusions 59 Reference 61 The Author’s Publication 65
dc.language.iso	en
dc.subject	基於子空間學習法	zh_TW
dc.subject	語言辨識	zh_TW
dc.subject	language recognition	en
dc.subject	subspace-based learning	en
dc.title	基於子空間之口說語言辨識	zh_TW
dc.title	Subspace-based Spoken Language Recognition	en
dc.type	Thesis
dc.date.schoolyear	100-2
dc.description.degree	碩士
dc.contributor.coadvisor	王新民(Hsin-Min Wang)
dc.contributor.oralexamcommittee	蔡偉和(Wei-Ho Tsai)
dc.subject.keyword	語言辨識,基於子空間學習法,	zh_TW
dc.subject.keyword	language recognition,subspace-based learning,	en
dc.relation.page	65
dc.rights.note	未授權
dc.date.accepted	2012-08-06
dc.contributor.author-college	電機資訊學院	zh_TW
dc.contributor.author-dept	電機工程學研究所	zh_TW
顯示於系所單位：	電機工程學系

文件中的檔案：

檔案	大小	格式
ntu-101-1.pdf 未授權公開取用	2.52 MB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。