請用此 Handle URI 來引用此文件:
http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/5881
標題: | 概念表徵及其應用 Concept Representation and Its Application |
作者: | Chi-Hsin Yu 游基鑫 |
指導教授: | 陳信希(Hsin-Hsi Chen) |
關鍵字: | 概念表徵,常識知識,字義消歧,排序, Concept Representation,Continuation,Commonsense,WSD,Ranking, |
出版年 : | 2013 |
學位: | 博士 |
摘要: | 在此論文中,我們為概念進行了定義,並基於此定義,提出了為系統建構概念表徵的架構,及將此架構,套用在常識知識分類以及文字岐義消解這兩應用中。除此之外,我們還驗證了兩個跟知識抽取有關的假設,這分別是常識知識是否出現在文字中,以及小規模網路文件集是否足以支援重要的自然語言處理工作。最後,我們介紹了 ClueWeb09 這一網絡規模資料集的一些前處理結果,希望能提供給其他研究者更好用的資源。
我們給出的概念定義符合三個標準:本質上具有可計算性、沒有無定義的組成、有內建的特質可被人或機器自身進行分析。我們將概念定義成一種延續 (continuation),這種延續可看成是一種概念運算過程的暫存態,此暫存態則放在進化語言博弈 (evolutionary language game) 的架構下來詮釋。在此定義基礎上,我們將概念表徵分為靜態跟動態兩方面,並使用機器學習理論來對系統的許多面向進行了理論的探討。 將概念表徵應用在常識知識分類時,我們用向量空間模型來建構表徵,並展示如何用我們的概念定義,來詮釋一般的機器學習處理過程。而在文字岐義消解這一應用中,我們更進一步運用了我們發展出的概念,為文字岐義消解引入了脈絡適切性 (context appropriateness) 及概念適切性 (concept fitness) 此兩面向,並用此來建構嶄新的文字岐義消解演算法。 為了未來使用自動知識抽取的架構為機器建構概念,我們驗證了知識內容及大小這兩基本問題。為了確認文件是好的知識內容來源,我們發現甚至連常識知識都會出現在文件中。另外,我們利用文字語序錯誤這一問題,間接驗證了雖然 ClueWeb09 的規模只是網路網頁的一小部份,它的規模已可產生跟 Google Web 5-gram 同樣的實驗結果,能很好的支援重要的自然語言處理工作。 最後,我們對 ClueWeb09 這一網絡規模資料集進行了前處理,並產生了許多有用的資源可提供給研究者,這些資源包括 (1) 完成詞性標記、詞組切分及語句剖析的英文語料庫、(2) 完成斷詞、詞性標記及語篇標記詞標記的中文語料庫、(3) 中文詞性 n-gram資料集 (NTU Chinese POS 5-gram)。 In this dissertation, we propose a concept definition in language, derive a concept representation scheme based on this definition, and apply this framework in two applications: commonsense knowledge classification and word sense disambiguation. In addition, we assert two important assumptions for building concept representation using knowledge extraction: does commonsense knowledge appear in texts and is a small part of the Web sufficient for supporting important NLP tasks. Last, we introduce processed ClueWeb09 datasets. We hope the produced datasets can boost NLP research. We give a definition of concept that meets three criteria: having native origin in computational perspective, having no undefined terms in the definition, and having build-in nature for deep analysis by human and by intelligent system itself to understand internal structures of an intelligent system. We define concept a continuation, which is a temporary state in the concept computation process. This temporary state is interpreted within the context of the evolutionary language game. Based on this definition, we define concept representation to have two parts: static and dynamic parts. We investigate some theoretical aspects using theories in machine learning literatures. In the application of commonsense knowledge classification, we adopt vector space model to build representation and interpret this machine learning process in our framework. In WSD, we further apply our framework to develop two new concepts for solving WSD: context appropriateness and concept fitness. We use these two new concepts to build many new algorithms to solve WSD problem. For using knowledge extraction to build concept representation in the future, we verify two important perspectives: content of knowledge and size of knowledge sources. We find that commonsense knowledge are recorded in texts and assert that the web is a good source to extract human knowledge. We use word ordering error task to indirectly assert that a small part of the web, such as ClueWeb09 dataset, can support NLP applications to produce comparable results to that of larger datasets, such as Google Web 5-gram dataset. These two assertions give us confidence to extract knowledge from a smaller dataset to build concept representation. Lastly, we preprocess English and Chinese web pages in ClueWeb09 and produce many resources for researchers, including (1) POS-tagged, phrase-chunked, and partly parsed English dataset, (2) segmented, POS-tagged, and discourse markers identified Chinese dataset, and (3) NTU Chinese POS-5gram dataset. |
URI: | http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/5881 |
全文授權: | 同意授權(全球公開) |
顯示於系所單位: | 資訊工程學系 |
文件中的檔案:
檔案 | 大小 | 格式 | |
---|---|---|---|
ntu-102-1.pdf | 1.26 MB | Adobe PDF | 檢視/開啟 |
系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。