中文事件真實性判斷

Yu-Yun Chang; 張瑜芸

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/72973

標題:	中文事件真實性判斷 Event Veridicality in Chinese
作者:	Yu-Yun Chang 張瑜芸
指導教授:	謝舒凱(Shu-Kai Hsieh)
關鍵字:	事件,真實性判斷,語用學,語言特徵,機器學習模型, event,veridicality,pragmatics,linguistic features,machine learning models,
出版年 :	2019
學位:	博士
摘要:	本論文主要核心內容為建構一個中文 PragBank 語料庫 (Chinese PragBank)，蒐集讀者對於新聞事件的真實性判斷 (veridicality)，並分析語言學特徵以讓機器學習模型能自動化判讀。事件的真實性判斷為讀者是否認為句內所描述的事件會發生。例如，此句 'The FBI alleged in court documents that Zazi had admitted having a handwritten recipe for explosives on his computer' 於閱讀後，讀者是否相信該事件 'Zazi had a handwritten recipe for explosives' 發生? 假如將句子更改為 'According to the FBI agents, there is relatively little evidence that Zazi had a handwritten recipe for explosives'，讀者是否還相信該事件發生? 隨著網路訊息量日漸增長，自動化判別事件的真實性顯得越加重要。由於目前許多資訊擷取系統大多著重於子句的擷取，所以即使上述兩個例句的讀者真實性判斷結果不同，所擷取出來的子句 'Zazi had a handwritten recipe for explosives' 卻會相同。本論文著重於理解及分析事件在語境中的特性，以及語境如何影響讀者對於事件的真實性判斷。目前雖然已有學者建置讀者對於事件真實性判斷的語料庫，但該語料庫為英文語料 (English PragBank)，而非中文語料。透過中文 PragBank 語料庫能夠探索中文的語言特性，並將這些語言特徵應用於最大熵模型 (Maximum Entropy Model) 和卷積神經網絡 (Convolutional Neural Network) 模型。本論文目的在於發掘語言學理論衍伸出的線索如何幫助模型習得語用，以及是否中英文讀者對於新聞事件的真實性判斷上有所不同。研究後發現，中英文讀者在一些語言特徵的表現上有不同。例如，當事件的說話者為權威人士或單位 (例如:白宮或是法官等)，相較於英文母語者，臺灣的中文母語者對於該事件會發生有著較低的信心。除了上述特徵之外，其他特徵像是情態詞 (modality marker)、時態 (tense)、型態 (aspect)、和統計數字等等也對於事件真實性判斷有影響。將這些特徵應用至模型訓練時，結果發現當深度學習模型訓練於語言特徵時會有較好的表現。 The central goal of this dissertation is to build a Chinese corpus annotated with readers’ veridicality judgments to news events (Chinese PragBank), and find out specific linguistic features for the machine learning models to predict veridicality automatically. Readers' veridicality judgments are whether readers view an event described in a sentence as happening or not. For instance, in 'The FBI alleged in court documents that Zazi had admitted having a handwritten recipe for explosives on his computer', do people believe that Zazi had a handwritten recipe for explosives? On the other hand, what do people infer if the sentence is 'According to the FBI agents, there is relatively little evidence that Zazi had a handwritten recipe for explosives'? Automatically classifying veridicality of events is important to swift through the ever growing amount of information appearing online. However, most information extraction systems nowadays work roughly at the clause level, and would extract that 'Zazi had a handwritten recipe for explosives' in both sentences given above. This dissertation aims at a better understanding and characterization of the context in which events are embedded, and how the context leads to human judgments of event veridicality. Currently, there is a veridicality dataset for English (English PragBank) but not for Chinese. Having built the Chinese corpus, it can be used to explore specific linguistic features in Chinese texts, and implement the features into machine learning models, Maximum Entropy (MaxEnt) model and Convolutional Neural Network (CNN) model. The goal is to explore how linguistic cues derived from theories can assist models in learning pragmatically, and whether there are any differences between English and Chinese readers while making veridicality judgments to news events. It is investigated that English and Chinese readers behave differently in some linguistic features. For example, if the speaker of an event is an authority (e.g., 'The White House' or 'The Judge'), Chinese speakers in Taiwan have lower confidence in believing the event happened, compared to English speakers. Other features (e.g., modality markers, tense and aspect, and statistic numbers) presents distinctions as well. While applying features into model training, the evaluation results report that deep learning models particularly trained on data with linguistic features have higher performance.
URI:	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/72973
DOI:	10.6342/NTU201901524
全文授權:	有償授權
顯示於系所單位：	語言學研究所

文件中的檔案：

檔案	大小	格式
ntu-108-1.pdf 未授權公開取用	2.5 MB	Adobe PDF

顯示文件完整紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料（如：文字、圖片、PDF）並使其易於取用。