基於主題分析的強健性語言模型調適

Aaron Heidel; 何恩

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/28351

標題:	基於主題分析的強健性語言模型調適 Robust Unsupervised Topic-Based Language Model Adaptation
作者:	Aaron Heidel 何恩
指導教授:	李琳山
關鍵字:	語音辨識,語言模型調適,潛藏語意分析,片段分割,非監督式調適, speech recognition,language model adaptation, topic modeling,story segmentation, unsupervised adaptation,
出版年 :	2007
學位:	碩士
摘要:	本論文的主要貢獻在於提出一個基於主題分析的語言模型調適法，這個方法主要是使用潛藏狄式配置（Latent Dirichlet Allocation, LDA）。我們使用機率式潛藏語意分析（Probabilistic Latent Semantic Analysis, PLSA）自動地把一個具有不同性質的文字語料加以聚成許多個潛藏主題，然後用這些結果當作我們LDA模型的初始化模型。我們用最後的LDA模型一句一句地建造主題式的文字語料，這些主題式語料則用來估計主題式的語言模型。當我們用語言模型調適進行N-best重新評分時，我們把這些主題式的語言模型以內插法跟一個背景（也就是非主題式的）語言模型結合在一起。本論文共提出幾個機制，可以讓主題推論的結果更強健，比較不會被辨識錯誤扭曲，我們也用詮釋資料做片段分割，進行節目層的語言模型調適。最後在多來源的美國國防部GALE計劃中文資料上的結果顯示比其他最新的語言模型調適方法更有效。 We present a novel topic mixture-based language model adaptation approach that uses Latent Dirichlet Allocation (LDA). We use Probabilistic Latent Semantic Analysis (PLSA) to automatically cluster a heterogeneous training corpus, and then train an LDA model using the resultant topic-document assignments. Using this LDA model, we construct fine-grained topic-specific corpora at the utterance level, which we use to train topic language models. These topic LMs are interpolated with a background language model during language model adaptation under an N-best rescoring framework. We describe several techniques for hardening LDA topic inference to first-pass recognition errors, and demonstrate the effectiveness of metadata-based segmentation when combined with show-level language model adaptation. Good improvements over state-of-the-art schemes were obtained in experiments on multi-genre GALE Project data in Mandarin Chinese.
URI:	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/28351
全文授權:	有償授權
顯示於系所單位：	資訊工程學系

文件中的檔案：

檔案	大小	格式
ntu-96-1.pdf 目前未授權公開取用	1.35 MB	Adobe PDF

顯示文件完整紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料（如：文字、圖片、PDF）並使其易於取用。