半監督式學習於中文關係抽取以擴充知識庫之研究

Yu-Ju Chen; 陳昱儒

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/51780

標題:	半監督式學習於中文關係抽取以擴充知識庫之研究 Chinese Relation Extraction by Semi-Supervised Learning for Knowledge Base Expansion
作者:	Yu-Ju Chen 陳昱儒
指導教授:	許永真(Jane Yung-jen Hsu)
關鍵字:	關係抽取,多實例學習,知識庫, Relation Extraction,Multiple Instance Learning,Knowledge Base,
出版年 :	2015
學位:	碩士
摘要:	「關係抽取」（Relation Extraction）意指從文本中學習有語意關係的詞對（Concept Pair），例如（台北，台灣）的關係是「...位於...」。此論文探討藉由關係抽取以擴增常識知識庫的方法。監督式學習是目前發展完整的方法之一，但是必須要有大量的標記資料才能達到好的效果。取而代之的是疏離監督式學習。疏離監督式學習是半監督式學習的一種，過去被用在無標記資料的關係抽取。針對知識庫中的某個關係，找出相關的詞對作為基礎，以此對大量未標記的文本自動做弱標記（Weakly Label），並作為訓練資料。這些詞對被預先標記關係，文本中提及這些詞對的句子會被自動標記與詞對相同的關係。此方法可以快速標記大量資料。但是當文本與知識庫的來源沒有關聯性時，標記的結果會很不可靠。為了減輕錯誤標記造成的學習錯誤，我們在疏離監督式學習中加入多實例學習的假設。多實例學習的訓練資料必須為袋裝形式，用於學習二元分類。每一袋訓練資料都會有 +1 或 -1 標記。標記為 +1 的袋子中包含至少一個 +1 的實例；標記為 -1 的袋子只會有 -1 的實例。我們將提及同一種詞對的句子裝進袋中，並使用多實例學習對未知的袋子做分類。我們以語意網（ConceptNet）作為標記基礎，中研院平衡語料庫的文本當作訓練資料，實作中文關係抽取的實驗，並比較單實例學習與多種多實例學習演算法的實驗結果。該實驗從文中抽取下列四種關係的詞對： AtLocation ， CapableOf ， HasProperty ，及 IsA 。這個研究證實了我們的方法能夠藉由其他語料改進知識庫。 This thesis investigates relation extraction, which learns semantic relations of concept pairs from text, as an approach to mining commonsense knowledge. To achieve good performance, state-of-the-art supervised learning requires a large labeled training set, which is often expensive to prepare. As an alternative, distant supervision, a semi-supervised learning method, was adopted to extract relations from unlabeled corpora. A training set consisting of a large amount of sentences can be weakly labeled automatically based on a set of concept pairs for any given relation in a knowledge base. Labels generated with heuristics can be quite noisy. When the sources of sentences in the training set are not correlated with the knowledge base, the automatic labeling mechanism is unreliable. Instead of assuming all sentences are labeled correctly in the training set, multiple instance learning learns from bags of instances, provided that each positive bag contains at least one positive instance while negative bags contain only negative instances. We conducted experiments on relation extraction in Chinese using concept pairs in ConceptNet, a commonsense knowledge base, as the seeds for labeling a set of predefined relations. The training bags were generated from the Sinica Corpus. The performance of multiple instance learning is compared with single-instance learning and a few other learning algorithms. Our experiments extracted new pairs for relations “AtLocation”, “CapableOf”, “HasProperty” and “IsA”. This study showed that a knowledge base can be improved by another corpus using the proposed approach.
URI:	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/51780
全文授權:	有償授權
顯示於系所單位：	資訊工程學系

文件中的檔案：

檔案	大小	格式
ntu-104-1.pdf 未授權公開取用	1.48 MB	Adobe PDF

顯示文件完整紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料（如：文字、圖片、PDF）並使其易於取用。