基於調變頻譜等化法之強健性語音辨識技術

Liang-Che Sun; 孫良哲

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/48146

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	李琳山
dc.contributor.author	Liang-Che Sun	en
dc.contributor.author	孫良哲	zh_TW
dc.date.accessioned	2021-06-15T06:47:19Z	-
dc.date.available	2013-07-06
dc.date.copyright	2011-07-06
dc.date.issued	2011
dc.date.submitted	2011-06-07
dc.identifier.citation	[1] L. Deng and X. Huang, “Challenges in adopting speech recognition”,Communications of the ACM, vol. 47, no. 1, pp. 69–75, 2004. [2] D. O'Shaughnessy, “Automatic speech recognition: History, methods and challenges”, invited paper, Pattern Recognition, 2008. [3] M. J. F. Gales, S. J. Young, “The application of hidden Markov models in speech recognition”, Foundations and Trends in Signal Processing, vol. 1, Issue 3, 2008. [4] M. J. F. Gales, S. J. Young, “Cepstral parameter compensation for HMM recognition”, Speech Communication, vol. 12, no. 3, pp. 231–239, July 1993. [5] M. J. F. Gales, “Model-based techniques for noise robust speech recognition”, Ph.D dissertation, Cambridge University, 1995. [6] W. Kim, J. H. L. Hansen, “Feature compensation in the cepstral domain employing model combination”, Speech Communication, vol. 51, Issue 2, pp. 83–96, July 2009. [7] P.J.Moreno, B.Raj, R.M.Stern, “A vector Taylor series approach for environment-independent speech recognition”, in Proc. ICASSP 1996, pp.733–736. [8] A.Acero, L.Deng, T.Kristjansson, and J.Zhang, “HMM adaptation using vector Taylor series for noisy robust speech recognition”, in Proc. ICSLP 2000,pp.869–872. [9] A.de la Torre, D.Fohr, and J.-P.Haton, “Statistical adaptation of acoustic models to noise conditions for robust speech recognition”, in Proc. ICSLP 2002,pp.1437–1440. [10] J.Li, L.Deng, D.Yu, Y.Gong, and A.Acero, “High performance HMM adaptationwith joint compensation of additive and convolutional distortions via vector Taylor series”, in Proc. IEEE ASRU 2007, pp.65–70. [11] R.V.Dalen, M.J. F. Gales, “Extended VTS for noise-robust speech recognition”,in Proc. ICASSP 2009, pp.3829–3832. [12] F.Faubel, J.McDonough, D.Klakow, “On expectation maximization basedchannel and noise estimation beyond the vector Taylor series expansion”, in Proc. ICASSP 2010, pp.4294–4297. [13] C.J.Leggetter, P.C.Woodland, “Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models”, Computer Speech and Language, vol. 9,no. 2,pp.171–185,1995. [14] D.K.Kim, M.J.F.Gales, “Adaptive training with noisy constrained maximum likelihood linear regression for noise robust speech recognition”, in Proc. Interspeech 2009, pp.2383–2386. [15] P.Lockwood and J.Boudy, “Experiments with a nonlinear spectral subtractor (NSS), Hidden Markov models and projection, for robust recognition in cars”, Speech Communication, vol. 11, no.2, pp.215–228, 1992. [16] S.D. Kamath and P.C. Loizou, “A multi-band spectral subtraction method for enhancing speech corrupted by colored noise”, in Proc. ICASSP 2002, pp.4164–4167. [17] J.Meyer,K.U.Simmer, “Multi-channel speech enhancement in a car environment using Wiener filtering and spectral subtraction”, in Proc. ICASSP 1997, pp.1167–1170. [18] A.Agarwal.Y.M.Cheng, “Two-stage Mel-warped Wiener filter for robust speech recognition”, in Proc. ASRU 1999, pp.67–70. [19] R.Gomez, T.Kawahara, “Optimizing spectral subtraction and Wiener filtering for robust speech recognition in reverberant and noisy conditions”, in Proc. ICASSP 2010, pp.4566–4569. [20] S. Furui, “Cepstral analysis technique for automatic speaker verification”, IEEE Trans. on Acoustics, Speech, and Signal Processing, vol. 29,no. 2,pp.254–272, 1981. [21] O.Viikki, K.Laurila, “Cepstral domain segmental feature vector normalization for noise robust speech recognition”, Speech Communication, vol. 25, pp.133–147, August 1998. [22] A.de la Torre, A.M.Peinado, J.C.Segura, J.L.Perez-Cordoba, M.C.Benitez, A.J.Rubio, “Histogram equalization of speech representation for robust speech recognition”, IEEE Trans. on Audio, Speech, and Language Processing, vol. 13,no. 3, pp.355–366, May 2005. [23] F.Hilger, H.Ney, “Quantile-based histogram equalization for noise robust large vocabulary speech recognition”, IEEE Trans. on Audio, Speech, and Language Processing, vol. 14, no. 3, pp.845–854, May 2006. [24] Y.Suh,M.Ji, and H.Kim, “Probabilistic class histogram equalization for robust speech recognition”, IEEE Signal Processing Letters, vol. 14, Issue 4, pp.287–290, 2007. [25] L.Garcia, R.Gemello, F.Mana, J.C.Segura, “Progressive memory-based parametric non-linear feature equalization”, in Proc. Interspeech 2009, pp.40–43. [26] C-W.Hsu, L-S.Lee, “Higher order cepstral moment normalization (HOCMN) for robust speech recognition”, in Proc. ICASSP 2004, pp.197–200. [27] C-W.Hsu, L-S.Lee, “Higher order cepstral moment normalization for improved robust speech recognition”, IEEE Trans. on Audio, Speech, and Language Processing, vol. 17, no. 2, pp.205–220, Feb. 2009. [28] H.Hermansky and N.Morgan, “ RASTA processing of speech”, IEEE Trans. on Speech and Audio Processing,vol. 2,no. 4, pp.578–589, Oct. 1994. [29] H.Hermansky and P.Fousek, “Multi-resolution RASTA filtering for TANDEM-based ASR”, in Proc. Interspeech 2005, pp.361–364. [30] S. van Vuuren and H. Hermansky, “Data-driven design of RASTA-like filters”, in Proc. Eurospeech 1997, pp.409–412. [31] J-W.Hung, H-M.Wang, and L-S.Lee, “Comparative analysis for data-driven temporal filters obtained via principal component analysis (PCA) and linear discriminate analysis (LDA) in speech recognition”, in Proc. Eurospeech 2001. [32] N-C.Wang, J-W.Hung, L-S.Lee, “Data-driven temporal filters based on multi-eigenvectors for robust features in speech recognition”, in Proc. ICASSP 2003, pp.400–403. [33] J-W.Hung ,L-S.Lee, “Optimization of temporal filters for constructing robust features in speech recognition”, IEEE Trans. on Audio, Speech, and Language Processing, vol. 14, no. 3, pp.808–832, May 2006. [34] H.Hermansky, S.Sharma, “Temporal patterns (TRAPS) in ASR of noisy speech”, in Proc. ICASSP 1999, pp.289–292. [35] H.Hermansky, “TRAP-TANDEM: Data-driven extraction of temporal features from speech”, in Proc. IEEE ASRU 2003, pp.255–258. [36] V.Tyagi, I.McCowan, H.Misra, H.Bourlard, “Mel-cepstrum modulation spectrum (MCMS) features for robust ASR”, in Proc. IEEE ASRU 2003, pp.399–404. [37] N.Morgan, B.Y.Chen, Q.Zhu, A.Stolcke, “TRAPping conversational speech : Extending TRAP/Tandem approaches to conversational telephone speech recognition”, in Proc. ICASSP 2004, pp.537–540. [38] B.Chen, Q.Zhu, N.Morgan, “Learning long-term temporal features in LVCSR using neural networks”, in Proc. ICSLP 2004, pp.612–615. [39] J-W.Hung,W-Y.Tsai, “Constructing modulation frequency domain based features for robust speech recognition”, in IEEE Trans. on Audio, Speech, and Language Processing, vol. 16, Issue 3, pp.563–577, Mar. 2008. [40] H.You, A.Alwan, “Temporal modulation processing of speech signals for noise robust ASR”, in Proc. Interspeech 2009, pp.36–39. [41] S.Ganapathy, S.Thomas, H.Hermansky, “Robust spectro-temporal features based on autoregressive models of Hilbert envelops”, in Proc. ICASSP 2010, pp.4286–4289. [42] L.Garcia, J.C.Segura, C.Benitez, J.Ramirez, and A.de la Torre, “Normalization of the inter-frame information using smoothing filtering”, in Proc. Interspeech 2006, pp.369–372. [43] X.Xiao, E.S.Chng, H.Li, “Normalizing the speech modulation spectrum for robust speech recognition”, in Proc. ICASSP 2007, pp.1021–1024. [44] X.Xiao, E.S.Chng, and H.Li, “Normalization of the speech modulation spectra for robust speech recognition”, in IEEE Trans. on Audio, Speech, and Language Processing, vol. 16, no. 8, Nov. 2008. [45] X.Lu, S.Matsuda, M.Unoki, and S.Nakamura, “Temporal modulation normalization for robust speech feature extraction and recognition”, in Proc. IEEE CISP 2009. [46] X.Lu, S.Matsuda, T.Shimizu, and S.Nakamura, “Normalization on temporal modulation transfer function for robust speech recognition”, in Proc. Second International Symposium on Universal Communication 2008. [47] X.Lu, S.Matsuda, T.Shimizu, and S.Nakamura, “Temporal contrast normalization and edge-preserved smoothing of temporal modulation structures of speech for robust speech recognition”, Speech Communication 2010. [48] W-H.Tu, S-Y.Huang, and J-H.Hung, “Sub-band modulation spectrum compensation for robust speech recognition”, IEEE ASRU 2009. [49] H.Hermansky, “The modulation spectrum in the automatic recognition of speech”, in Proc.IEEE ASRU 1997, pp.140–147. [50] N.Kanedera, T.Arai, H.Hermansky, and M.Pavel, “On the relative importance of various components of the modulation spectrum for automatic speech recognition”, Speech Communication, vol.28, no. 1, pp.43–45, 1999. [51] L-C.Sun, C-W.Hsu, L-S.Lee, “Modulation spectrum equalization for robust speech recognition”, in Proc. IEEE ASRU 2007, pp.81–86. [52] L-C.Sun, C-W.Hsu, L-S.Lee, “Evaluation of modulation spectrum equalization for large vocabulary robust speech recognition”, in Proc. Interspeech 2008, pp.1004–1007. [53] J.C.Segura, C.Benítez, Á.de la Torre, A.J.Rubio, and J.Ramírez, “Cepstral domain segmental nonlinear feature transformations for robust speech recognition,” IEEE Signal Process. Lett., vol. 11, no. 5, pp. 517–520, May 2004. [54] H.G.Hirsch, D.Pearce,” The AURORA experimental framework for the performance evaluations of speech recognition systems under noisy conditions”, ISCA ITRW ASR2000, Sep. 2000. [55] “NOISEX-92 database”,[online] available: http://mi.eng.cam.ac.uk/comp.speech/Section1/Data/noisex.html. [56] N.Parihar and J.Picone, “Aurora Working Group: DSR Front End LVCSR Evaluation AU/384/02”, Institute for Signal and Information Processing, Mississippi State University. [57] S. Young, G. Evermann, M. Gales, T. Hain, D. Kershaw, X. Liu, G.Moore, J. Odell, D. Ollason, D. Povey, V. Valtchev, and P. Woodland, “The HTK Book (for HTK Version 3.4)”. Cambridge, U.K.: CambridgeUniv. Press, 2006. [58] “The CMU Pronouncing Dictionary”, [online] available: http://www.speech.cs.cmu.edu/cgi-bin/cmudict, Speech at Carnegie Mellon University, Pittsburgh, Pennsylvania, USA, Jun. 2001. [59] F. Wilcoxon, “Individual comparisons by ranking methods,” Biometrics, 1945. [60] “The CMU Pronouncing Dictionary 0.7a.phones”, [online] available: https://cmusphinx.svn.sourceforge.net/svnroot/cmusphinx/trunk/cmudict/cmudict. 0.7a.phones , Speech at Carnegie Mellon University, Pittsburgh, Pennsylvania, USA. [61] ETSI ES 202 212 v1.1.1, “Speech processing, transmission and quality aspects (STQ); distributed speech recognition; extended advanced front-end feature extraction algorithm; compression algorithm; back-end speech reconstruction algorithm”, Nov. 2003. [62] S.Y.Zhao, N.Morgan, “Multi-stream spectro-temporal features for robust speech recognition”, in Proc. Interspeech 2008. [63] B.T.Meyer, B.Kollmeier, “Optimization and evaluation of Gabor feature sets for ASR”, in Proc. Interspeech 2008. [64] M.Fujimoto, S.Watanebe, T.Nakatani, “Voice activity detection using frame-wise model re-estimation method based on Gaussian pruning with weight normalization”, in Proc. Interspeech 2010. [65] T.Oonishi, K.Iwano, S.Furui, “VAD-measure-embedded decoder with online model adaptation”, in Proc. Interspeech 2010. [66] D.Povey, P.C.Woodland, “Minimum phone error and I-smoothing for improved discriminative training”, in Proc. ICASSP 2002. [67] D.Povey, B.Kingsbury, L.Mangu, G.Saon, H.Soltau, G.Zweig, “fMPE: Discriminative trained features for speech recognition”, in Proc. ICASSP 2005. [68] F.Valente, M.Doss, C.Plahl, S.Ravuri, W.Wang, “A comparative large scale study of MLP feautres for Mandarin ASR”, in Proc. ICASSP 2010
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/48146	-
dc.description.abstract	在強健性語音辨識的領域中，時域濾波器(Temporal Filter)是一個常見且相當有效的技術。在過去已經發展很成熟的著名技術包括相對頻譜濾波器(Relative Spectra, RASTA)，以及基於主成分分析(Principle Component Analysis, PCA) 和線性鑑別分析(Linear Discriminant Analysis, LDA) 所設計的資料導向(Data-driven) 時域濾波器。這些技術主要是針對語音參數的時間序列 (Time Trajectories)或是調變頻譜 (Modulation Spectrum) 設計濾波器，進而使得語音信號中的雜訊能得到有效的抑制; 然而這些傳統技術的缺點在於不能隨著外在雜訊環境的不同來作調整，因而難以在所有的雜訊環境下都有很好的表現。本論文所提出的調變頻譜等化法 (Modulation Spectrum Equalization) 則可以視為一種可適性的時域濾波器，亦即我們可以對在不同雜訊環境下錄音的語句得到不同的濾波器頻率響應，因而能夠有效改善在各種不同雜訊環境下的辨識結果。在這些技術當中，我們首先藉由傅利葉轉換將語音參數的時間序列轉換至調變頻譜，而我們所提出的技術均是直接利用信號在調變頻譜上的分佈情形來做設計。在頻譜分佈等化法 (Spectral Histogram Equalization, SHE) 中，我們先利用乾淨的訓練語料，統計它們的調變頻譜機率分佈作為參考分佈，接著將測試語句的調變頻譜機率分佈等化至此參考分佈。而在雙頻帶頻譜分佈等化法(Two-band Spectral Histogram Equalization, 2B-SHE)中，我們利用調變頻譜上低頻和高頻通常帶有不同的語音資訊這項特色，將測試語句中的低頻和高頻部分，分別等化至不同的參考分佈，進而得到比頻譜分佈等化法更佳的辨識結果。而在量值比例等化法(Magnitude Ratio Equalization, MRE)中，我們則將測試語句在調變頻譜上的量值比例等化至由乾淨語料所計算出的量值比例參考值。我們在英文連續數字語料 (Aurora 2) 和英文連續大字彙語料 (Aurora 4) 上的實驗發現，我們所提出的技術相較於傳統的時域濾波器技術在辨識率上有明顯的提昇，而且我們所提出的技術也可以和一些知名的倒頻譜正規化法作有效的結合以進一步提昇辨識正確率。而除了在辨識率上的呈現外，我們也從許多不同的面向來探討辨識率進步的原因，包含這些技術所求出的濾波器應、雜訊在調變頻譜上的行為、不同音素的辨識率，以及調變頻譜的距離….等。	zh_TW
dc.description.abstract	We propose novel approaches for equalizing the modulation spectrum for robust feature extraction in speech recognition. In these cases the temporal trajectories of the feature parameters are first transformed into the magnitude modulation spectrum. In spectral histogram equalization (SHE) and two-band spectral histogram equalization (2B-SHE), we simply equalize the histogram of the modulation spectrum for each utterance to a reference histogram obtained from clean training data, or perform the this equalization with two sub-bands on the modulation spectrum. In magnitude ratio equalization (MRE), we define the magnitude ratio of lower to higher modulation frequency components for each utterance, and equalize this to a reference value obtained from clean training data. These approaches can be viewed as temporal filters that are adapted to each testing utterance. Experiments performed on the Aurora 2 and 4 corpora for small and large vocabulary tasks indicate that significant performance improvements are achievable for all noise conditions (additive or convolutional, different noise types, and different SNR values). We also show that additional improvements are obtainable when these approaches are integrated with cepstral mean and variance normalization (CMVN), histogram equalization (HEQ), or higher-order cepstral moment normalization (HOCMN). We analyze and discuss reasons why such improvements are achievable from different viewpoints with different sets of data, including adaptive temporal filtering, noise behavior on the modulation spectrum, phoneme types, and modulation spectrum distance.	en
dc.description.provenance	Made available in DSpace on 2021-06-15T06:47:19Z (GMT). No. of bitstreams: 1 ntu-100-F90942011-1.pdf: 2084000 bytes, checksum: 09c1091f244faf7a3fdb943d4b7bcc33 (MD5) Previous issue date: 2011	en
dc.description.tableofcontents	Chinese Abstract i English Abstract iii Contents v List of Figures ix List of Tables xi 1. Introduction ………………………………… 1 1.1.Background of Robust Speech Recognition ………………………. 1 1.2.Primary Achievements of this Dissertation ……….…………………… 4 1.3.Chapter Outline ………………………………………………………… 6 2. Background Review and Experimental Setup ……………………… 7 2.1.Introduction ………………………………………………………… 7 2.2.Feature Normalization Approaches …………………………….. 7 2.2.1. Cepstral Mean and Variance Normalization (CMVN) …………… 8 2.2.2. Histogram Equalization (HEQ) ……………………………. 8 vi 2.2.3. Higher-Order Cepstral Moment Normalization (HOCMN) ……… 10 2.3.Temporal Filtering Approaches ………………………………………… 12 2.3.1. Relative Spectra (RASTA) ……………………………………….. 12 2.3.2. PCA/LDA-derived Temporal Filtering …………………….. 14 2.4.Experimental Setup ………………………………………………… 18 2.4.1. Aurora 2 Corpus ………………………………………………… 19 2.4.2. Aurora 4 Corpus ………………………………………………… 22 2.4.3. Feature Extraction ………………………………………………… 25 2.4.4. Noise Analysis ………………………………………………… 26 2.4.5. Baseline Results ………………...…………………………… 29 2.4.6. Significance Testing …………………………………………… 36 2.5.Summary ………………………………………………………….. 37 3. Modulation Spectrum Equalization Techniques ………………… 39 3.1.Introduction …………………………………………………… 39 3.2.Notation Definition and the Modulation Spectrum ……………… 39 3.3.Spectral Histogram Equalization (SHE) …………………………… 42 3.4.Two-band Spectral Histogram Equalization (2B-SHE) ……………… 44 3.5.Magnitude Ratio Equalization (MRE) …………………………… 45 3.6.The Overall Framework of the Proposed Approach ………………… 48 3.7.Time and Frequency Domain Behavior Analysis ………………… 50 3.8.Analytical Interpretation ………..………………………………… 54 3.9.Discussion of Related Works …………………………………………... 58 3.10. Summary …………………………………………………………… 60 4. Experimental Results on Aurora 2 Task ………………………………… 63 4.1.Introduction ………………………………………………………… 63 4.2.Directly Applying Modulation Spectrum Equalization to MFCCs …… 63 4.3.Modulation Spectrum Equalization with CMVN …………………… 65 4.4.Considerations for Short Utterances ……………………………… 67 4.5.Performance Analysis of Different Cut-off Frequency fc and Weighted Power p in 2B-SHE and MRE with CMVN ……………………… 70 4.6.Noise Type and SNR Analysis ……………………………… 76 4.7.2B-SHE and MRE with Other Feature Normalization Techniques …… 79 4.8.Performance for Proposed Approaches with Multi-condition Training ……………………………………………………………… 81 4.9.Summary ……………………………………………………………… 82 5. Experimental Results on Aurora 4 Task …………………………… 83 5.1.Introduction ……………………………………………………… 83 5.2.Proposed Approaches with CMVN, HEQ, and HOCMN ……………… 83 5.3.Noise Type Analysis ………………………………………… 86 5.4.Discussion: Small and Large Vocabulary Tasks and Noise Conditions … 88 5.5.Phoneme Type Analysis …………………………………………… 89 5.6.Modulation Spectrum Distance Analysis …………………………… 91 5.7.Summary ………………………………………………… 94 6. Conclusion ………………………………………………… 95 6.1.Concluding Remarks ……………………………………………… 95 6.2.Future Work ……………………………………………… 96 Bibliography …………………………………………………………………… 99
dc.language.iso	en
dc.subject	分佈等化法	zh_TW
dc.subject	語音參數正規化	zh_TW
dc.subject	調變頻譜	zh_TW
dc.subject	強健性語音參數抽取	zh_TW
dc.subject	時域濾波器	zh_TW
dc.subject	feature normalization	en
dc.subject	histogram equalization	en
dc.subject	temporal filter	en
dc.subject	robust feature extraction	en
dc.subject	modulation spectrum	en
dc.title	基於調變頻譜等化法之強健性語音辨識技術	zh_TW
dc.title	Modulation Spectrum Equalization for Improved Robust Speech Recognition	en
dc.type	Thesis
dc.date.schoolyear	99-2
dc.description.degree	博士
dc.contributor.oralexamcommittee	洪一平,吳家麟,陳銘憲,林宗男,廖婉君,鄭士康
dc.subject.keyword	語音參數正規化,調變頻譜,強健性語音參數抽取,時域濾波器,分佈等化法,	zh_TW
dc.subject.keyword	feature normalization,modulation spectrum,robust feature extraction,temporal filter,histogram equalization,	en
dc.relation.page	106
dc.rights.note	有償授權
dc.date.accepted	2011-06-07
dc.contributor.author-college	電機資訊學院	zh_TW
dc.contributor.author-dept	電信工程學研究所	zh_TW
顯示於系所單位：	電信工程學研究所

文件中的檔案：

檔案	大小	格式
ntu-100-1.pdf 未授權公開取用	2.04 MB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。