資訊保存與自然語言處理的應用

Ruey-Cheng Chen; 陳瑞呈

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/6381

標題:	資訊保存與自然語言處理的應用 Information preservation and its applications to natural language processing
作者:	Ruey-Cheng Chen 陳瑞呈
指導教授:	項潔(Jieh Hsiang)
關鍵字:	資訊理論,資訊保存,推衍原則,無監督式斷詞,靜態索引刪減,蹢最佳化, information theory,information preservation,induction principle,unsupervised word segmentation,static index pruning,entropy optimization,
出版年 :	2013
學位:	博士
摘要:	在這篇論文中，我們從機率模型的範疇內推導一個稱作「資訊保存」的數學概念。我們的方法提供了連接數個最佳化原則，例如最大蹢及最小蹢方法（maximum and minimum entropy methods）的基礎。在這個框架中，我們明確地假設模型推衍是一個目標針對某個參考假說的有向過程。為了檢驗這個理論，我們對無監督式斷詞（unsupervised word segmentation）以及靜態索引刪減（static index pruning）進行了詳盡的實證研究。在無監督式斷詞中，我們的方法顯著地提昇了以壓縮為基礎的方法斷詞精確度，並且在效能與效率表現上達到與目前最佳方法接近的程度。在靜態索引刪減上，我們提出的以資訊為基礎的量度（information-based measure）以比其他方法效率更好的方式達到目前最好的結果。我們的模型推衍方法也取得了新發現，像是分群分析（cluster analysis）中的新校正方法。我們期望這個對推衍原則的深度理解能產生機率模型的新方法論，並且最終邁向自然語言處理上的突破。 In this dissertation, we motivate a mathematical concept, called information preservation, in the context of probabilistic modeling. Our approach provides a common ground for relating various optimization principles, such as maximum and minimum entropy methods. In this framework, we make explicit an assumption that the model induction is a directed process toward some reference hypothesis. To verify this theory, we conducted extensive empirical studies to unsupervised word segmentation and static index pruning. In unsupervised word segmentation, our approach has significantly boosted the segmentation accuracy of an ordinary compression-based method and achieved comparable performance to several state-of-the-art methods in terms of efficiency and effectiveness. For static index pruning, the proposed information-based measure has achieved state-of-the-art performance, and it has done so more efficiently than the other methods. Our approach to model induction has also led to new discovery, such as a new regularization method for cluster analysis. We expect that this deepened understanding about the induction principles may produce new methodologies towards probabilistic modeling, and eventually lead to breakthrough in natural language processing.
URI:	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/6381
全文授權:	同意授權(全球公開)
顯示於系所單位：	資訊工程學系

文件中的檔案：

檔案	大小	格式
ntu-102-1.pdf	806.4 kB	Adobe PDF	檢視/開啟

顯示文件完整紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料（如：文字、圖片、PDF）並使其易於取用。