中文篇章連接詞偵測、消歧、及論元辨識

Yong-Siang Shih; 施詠翔

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/53748

標題:	中文篇章連接詞偵測、消歧、及論元辨識 Detection, Disambiguation, and Argument Identification of Chinese Discourse Connectives
作者:	Yong-Siang Shih 施詠翔
指導教授:	陳信希
關鍵字:	自然語言處理,中文篇章結構分析,篇章連接詞辨識,篇章關係消歧,論元辨識, Natural Language Processing,Chinese Discourse Analysis,Discourse Connective Recognition,Discourse Relation Disambiguation,Discourse Connective Argument Identification,
出版年 :	2015
學位:	碩士
摘要:	篇章關係指文字單位間如何有邏輯的彼此關聯。透過文章中的篇章結構分析，我們可以更了解文件的意義。因此，篇章結構分析被應用在很多領域，例如自然語言界面以及大規模的文件分析。相對於英文的篇章語料集早就提供研究者使用，中文的大規模篇章資料集一直到近年才終於被釋出。同時，中文的篇章結構分析有很多獨特的議題，例如中文的篇章連接詞的種類較多，且常有多個不連續詞語組成的多重連接詞，此外，中文的句子結構也更為複雜，使得正確辨識篇章結構更為困難。篇章連接詞是用來辨識中文文章中篇章關係的重要線索，但由於連接詞本身的歧義性讓辨識篇章連接詞本身成為一個挑戰議題。在本篇論文中，我們研究與篇章連接詞的顯性篇章關係有關的四個議題：第一，我們處理篇章連接詞的辨識，在文章中找出可能的篇章連接詞。第二，我們探討篇章連接詞的構成詞語間的多重連結關係。第三，我們研究每個篇章連接詞的篇章關係消歧。最後，我們辨識每個篇章連結詞的論元。我們提出不同的特徵來訓練基於羅吉斯迴歸 (Logistic Regression) 演算法的分類器來識別正確的篇章連接詞，以及辨識其篇章關係的種類。此外，我們也將每個可能的候選連接詞排序，並利用一個貪婪的演算法 (greedy algorithm) 來解決連結詞的連結關係歧義性。最後，我們將論元辨識視為一個序列標記問題 (sequence labeling problem)，並利用條件隨機域 (Conditional Random Fields) 來找出論元的邊界。除了顯性篇章關係外，未來隱性篇章關係也需要進一步的研究，在這些元件的基礎上，建立一個完整的中文篇章結構分析器。 Discourse relations represent how textual units logically connect with each other. Analyzing the discourse structure for texts could aid the understanding of the meaning behind paragraphs. There are many potential applications such as natural language interface and large-scale content-analysis. Although there are popular English discourse corpora for researchers, large-scale Chinese discourse corpora have not been available until recently. In addition, Chinese discourse analysis has many unique issues including the variety of discourse connectives, the common occurrences of parallel connectives, and the complex sentence structures. Discourse connectives are important clues for identifying discourse relations in Chinese texts. However, the ambiguity involved makes it a challenge to extract true connectives. In this thesis, we investigate four tasks regarding explicit discourse relations that are signaled by discourse connectives. Firstly, we deal with the extraction of explicit discourse connectives. Secondly, we investigate resolving linking ambiguities among connective components. Thirdly, we disambiguate the discourse relation type for each connective. Finally, we extract the arguments for each discourse connective. Several features are proposed to train Logistic Regression classifiers to disambiguate between discourse and non-discourse usages and the relation types for connectives. Additionally, we rank each connective candidate and develop a greedy algorithm to resolve linking ambiguities. Finally, the argument identification is formulated as a sequence labeling problem, and Conditional Random Fields are utilized to determine the argument boundaries. Besides explicit discourse relations, further investigation must be done to recognize implicit relations. Built upon these components, an end-to-end discourse parser for Chinese may be constructed in future studies.
URI:	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/53748
全文授權:	有償授權
顯示於系所單位：	資訊工程學系

文件中的檔案：

檔案	大小	格式
ntu-104-1.pdf 未授權公開取用	616.27 kB	Adobe PDF

顯示文件完整紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料（如：文字、圖片、PDF）並使其易於取用。