文件分類中自動訓練資料收集法

Chun-Yi Chi; 紀均易

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/37924

標題:	文件分類中自動訓練資料收集法 Automatic Training Corpora Acquisition for Document Classification
作者:	Chun-Yi Chi 紀均易
指導教授:	鄭卜壬(Pu-Jen Cheng)
關鍵字:	文件分類,訓練資料, Document Classification,Training Data,
出版年 :	2008
學位:	碩士
摘要:	多年來，文獻分類在幾個領域中是一個典型的問題。然而，先前大多數的工作都假設認為，語料庫可以被明確標記以及顯著分類。在這論文中，我們將注重於自動收集品質良好的訓練資料。我們提出探勘方法從給定的無標記的語料庫中，或者網路上，來收集訓練資料。我們提出的方法是全自動的，只需要人們事先建立好分類類別。在我們的論文中，類別名稱的概念是可以從和其他被分類的類別中捕獲的，這就是在類別之中的共同概念。此外，我們可以重複地在各個類別之中發掘鑑別性的概念。這麼一來，藉由尋找共同的概念和鑑別性的概念，我們可以獲得品質很高的訓練資料。實驗評估給了經驗上的證據：被訓練的分類器因此有了顯著的準確率。總而言之，藉由我們提出的方法來自動收集品質良好的訓練資料，是我們這篇論文中最主要的貢獻。 Document classification is a typical problem in several fields for many years. However, most previous work has the assumptions that the corpora can be explicitly-labeled and well-classified. In this work, we will concentrate on automatic acquisition of training data in good quality. We propose mining approaches to collect training data from given unlabeled corpus or the web, and our proposed approaches are fully automatic which is only needed to construct classes by humans in advance. In our work, the concept of class name can be captured by comparing with other classes, which is the common concept among classes. Moreover, we can discover discriminative concepts iteratively within each class. In this way, by finding common concepts and discriminative concepts, we can acquire training data of high quality. The evaluation gives empirical evidence that the classifiers thus created have promising accuracy. In a word, the automatic acquisition of training data in good quality by our proposed methods is the primary contributions of this work.
URI:	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/37924
全文授權:	有償授權
顯示於系所單位：	資訊工程學系

文件中的檔案：

檔案	大小	格式
ntu-97-1.pdf 目前未授權公開取用	3.42 MB	Adobe PDF

顯示文件完整紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料（如：文字、圖片、PDF）並使其易於取用。