以口語查詢之非督導式口語詞彙偵測

Chun-an Chan; 詹竣安

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/64129

標題:	以口語查詢之非督導式口語詞彙偵測 Unsupervised Spoken Term Detection with Spoken Queries
作者:	Chun-an Chan 詹竣安
指導教授:	李琳山(Lin-shan Lee)
關鍵字:	口語詞彙偵側,資訊檢索, spoken term detection,information retrieval,
出版年 :	2012
學位:	博士
摘要:	隨著多媒體及網路技術的發展成熟，帶有知識及資訊的語音文件數量也每天飛快地增加，從會議、課程、演講及網路上的各式各樣的影片，許多資訊都是以影音的型式存在；此時只要找到聲音，就可以找到有用的影音資訊。在語音文件檢索中最重要的關鍵技術即為口語詞彙偵測(spoken term detection)，其目的是從語音文件中找到完全相符於使用者輸入的查尋詞(query term)。常見的查尋詞是以文字形式存在，因此以文字搜尋語句時不可避免的使用自動語音辨識(automatic speech recognition)。近年興起的智慧型手機又讓語音文件搜尋有了新的可能，因為手機的文字輸入不若電腦來得容易，因此口語查尋詞(spoken query)便是一種新且常見的輸入形式，引發了許多非督導式(unsupervised)口語詞彙偵測的研究。因為此時可以直接拿口語查詢詞和語音文件在聲音訊號上比對，未必需要作語音辨識。此類方法的特點是---不再受限於語音辨識的諸多問題，更不需要人工標記的訓練語料，即使不存在語音辨識系統，甚至該語言不存在文字形式，搜尋相似的詞彙仍然可行。在本論文中，我們提出了兩類非督導式口語詞彙偵測的方法：以動態時間校正(dynamic time warping, DTW)和以模型為基礎(model-based)之方法。我們點出在傳統的片斷動態時間校正法(segmental DTW)中的兩個主要問題：無法處理較大的語速差異(speaking rate distortion)和運算量太高。我們提出使用斜率限制之動態時間校正法(slope-constrained DTW)來解決語速問題，再使用聲學片段(acoustic segment)取代語音音框來表示語音訊號，以及以聲學片段為單位之動態時間校正法(segment-based DTW)，如此可以大量減少所需之運算量。進一步使用兩階段口語詞彙偵測，我們可以在很短的時間內達到比傳統方法更好的偵測效能。我們再使用虛擬相關回饋(pseudo relevance feedback)的方法能使偵測正確率更好。同時我們也提出一套產生聲學片段模型(acoustic segment model)的方法，用此聲學片段模型來描述重覆出現的語音標型(pattern)在聲學空間中的分佈。藉由聲學片段模型，我們提出將文件轉換成模型序列，再用語音查尋詞找尋相似的模型序列片段，如此不但能用宏觀的語言結構來描述文件，也可以大幅減少搜尋的時間。相同的，我們也在虛擬相關回饋架構中設計了虛擬概似比(pseudo likelihood ratio)檢驗，來驗正已搜集到的候選口語詞彙是否正確。這些方法的效能開啟了一個非督導式語音搜尋的新方向---以隱藏馬可夫模型(hidden Markov model)為基礎的方法，許多在語音辨識發展成熟的技術都可能在未來應用在此領域之中。最後我們在虛擬相關回饋架構中檢驗整合動態時間校正法和模型法二者之系統效能，實驗顯示我們可以用23\%的時間讓偵測平均準確率(Mean Average Precision)進步14.2\%。 Unsupervised spoken term detection (STD) with spoken queries is a new and important topic in multimedia retrieval. The unsupervised approaches without the need of annotated data bypass various problems in speech recognition particularly the recognition errors under different acoustic and linguistic conditions. Such approaches even make searching for spoken terms possible in low-resourced languages or languages without writing system. In this dissertation, we propose several techniques to solve the problem of unsupervised STD problem with spoken queries. We propose two improved DTW-based approaches to handle the speaking rate distortion and computation efficiency issues in the conventional segmental DTW approach. The Slope-Constrained Dynamic Time Warping (SC-DTW) approach is developed to handle the speaking rate distortion problem. The segment-based DTW approach is devised to reduce the computational burden. The concatenation of these two approaches and the Weighted Pseudo Similarity of SC-DTW approach in the Pseudo Relevance Feedback (PRF) framework show significant improvement on both detection and efficiency performances. We also propose two model-based approaches for unsupervised STD. We design procedures to construct a set of Acoustic Segment Models (ASMs) that describes the patterns and structures of the target language. In this way, the signal trajectory modeling techniques can be leveraged using the ASMs. Using the ASMs, we propose the Document State Matching (DSM) approach to match spoken queries to the ASM states in the documents. The Duration-Constrained Viterbi algorithm is developed in the DSM approach. Another Pseudo Likelihood Ratio approach is proposed to verify the hypotheses in the PRF framework. Experimental results show that the model-based approaches achieve comparable detection performances in much smaller computation time. Our attempt of migrating from DTW-based approaches to model-based approaches creates the possibilities of leveraging well-developed model-based speech processing techniques in unsupervised STD. Finally, we tested various approach integration configurations in our system. With the combined model-based and DTW-based approaches, a 14.2\% of absolute Mean Average Precision improvement was achieved using only 23\% of CPU time on the Mandarin broadcast news corpus.
URI:	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/64129
全文授權:	有償授權
顯示於系所單位：	電信工程學研究所

文件中的檔案：

檔案	大小	格式
ntu-101-1.pdf 未授權公開取用	2.7 MB	Adobe PDF

顯示文件完整紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料（如：文字、圖片、PDF）並使其易於取用。