使用二維特徵音框權重法及調變頻譜正規化之強健型語音辨識

Yang Chang; 張暘

Please use this identifier to cite or link to this item: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/65632

Title:	使用二維特徵音框權重法及調變頻譜正規化之強健型語音辨識 Robust Speech Recognition with Two-dimensional Frame-and-feature Weighting and Modulation Spectrum Normalization
Authors:	Yang Chang 張暘
Advisor:	李琳山
Keyword:	語音處理,強健化, speech recognition,robustness,
Publication Year :	2012
Degree:	碩士
Abstract:	本篇論文先概括性的介紹了各種語音強健化的演算法，諸如: 倒頻譜平均值消去法(Cepstral Moment Subtraction, CMS)、倒頻譜正規化法(Cepstral Mean and Variance Normalization, CMVN)、倒頻譜分佈等化法(Histogram Equalization, HEQ)、高階倒頻譜動差正規化法(Higher Order Cepstral Moment Normalization, HOCMN)……等方式，同時也介紹了目前在語音強健化領域所公訂的國際標準語料Aurora 2和Aurora 4，並報告了在這兩套語料上基礎的實驗結果。以亂度為基礎之特徵權重法(entropy-based feature weighting)、以混淆為基礎之特徵權重法(confusion-based feature weighting)考慮不同參數的辨別能力，在辨識時給予不同參數不同的權重，加強那些擁有較好辨別能力的參數，並利用混淆矩陣(confusion matrix)多加考慮了各種音素之間可能發生錯誤的情形，此種方法除了可以直接應用在梅爾倒頻上，更可和許多現行的語音強健化法做結合。以支撐向量辨識器為基礎之音框權重法(SVM-based frame weighting)，使用機器學習(machine learning)中的支撐向量機器(Support Vector Machine)作為機器，利用音框的能量分佈(energy distribution)及諧波率析(harmonicity estimation)將測試資料分為可信賴音框(reliable frames)、不可信賴音框(unreliable frames)，在做辨識過程中時較為依賴可信賴音框來幫助辨識，因此利用支撐向量辨識器的分數來給予可信賴音框較大的權重、不可信賴音框較低的權重。最後試著結合以混淆為基礎之特徵權重法(confusion-based feature weighting)及以支撐向量辨識器為基礎之音框權重法(SVM-based frame weighting)而成為二維特徵音框權重維特比演算法(Two-dimensional frame-and-feature weighted Viterbi decoding)，給予不同參數不同比重、不同音框不同權重。此種方法結合了上述兩種方法的優點而得到更好的進步表現。 In this paper we propose a new approach of two-dimensional frame-and-feature weighted Viterbi decoding performed at the recognizer back-end for robust speech recognition. The frame weighting is based on an Support Vector Machine (SVM) classifier considering the energy distribution and cross-correlation spectrum of the frame. The basic idea is that voiced frames with higher harmonicity is in general more reliable than other frames in noisy speech and therefore should be weighted higher. The feature weighting is based on an entropy measure considering confusion between phoneme classes. The basic idea is that the scores obtained with more discriminating features causing less confusion between phonemes should be weighted higher. These two different weighting schemes on the two different dimensions, frames and features, are then properly integrated in Viterbi decoding. Very significant improvements were achieved in extensive experiments performed with the Aurora 4 testing environment for all types of noise and all SNR values.
URI:	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/65632
Fulltext Rights:	有償授權
Appears in Collections:	電信工程學研究所

Files in This Item:

File	Size	Format
ntu-101-1.pdf Restricted Access	9.88 MB	Adobe PDF

Show full item record

DSpace JSPUI

DSpace preserves and enables easy and open access to all types of digital content including text, images, moving images, mpegs and data sets