對抗式零樣本學習於跨語言及跨領域之文字蘊含識別

Ching Huang; 黃晴

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/21418

標題:	對抗式零樣本學習於跨語言及跨領域之文字蘊含識別 Adversarial Training for Zero-Shot Cross-Lingual and Cross-Domain Textual Entailment Recognition
作者:	Ching Huang 黃晴
指導教授:	陳信希
關鍵字:	文字蘊含,零樣本學習,跨語言學習,跨領域學習,對抗式學習,MultiNLI資料集, Textual Entailment Recognition,Natural Language Inference,Zero-shot Learning,Cross-lingual learning,Cross-domain learning,Adversarial learning,MultiNLI corpus,
出版年 :	2019
學位:	碩士
摘要:	近年來，深度類神經網路在許多自然語言處理的議題上，都有十分傑出的表現，包含了這篇研究的主軸「文字蘊含」。文字蘊含是自然語言處理的領域中一個十分經典的議題，定義是在當給定了一個句子作為前提之下，是否能夠判斷另一個句子與前提句的關聯是：(1) 一定正確 (2) 一定錯誤，或者 (3) 毫無關係。文字蘊含的英文資料集像是Stanford Natural Language Inference (SNLI) 與Multi-Genre Natural Language Inference (MultiNLI) 都貢獻了許多由專業人員人工標記的資料，因此讓我們能夠訓練更複雜的深度學習模型。然而，這些資料集卻只有英文的樣本，因此除了英文以外的語言在文字蘊含的議題上，經常要面臨到人工標記資料不足的困擾。因此，這篇研究的目標是使用現有的英文文字蘊含資料集創造一個跨語言的文字蘊含模型。最近，由Google提出的多語言的預訓練句子表示方式BERT，明顯地減輕了上述的問題。利用BERT為預訓練句子表現，再使用零樣本跨語言的學習方式，就能成功地應用於文字蘊含的議題上。而這篇論文提出了一個對抗式的訓練方式於零樣本的跨語言文字蘊含，能夠讓訓練集的語言與測試集的語言表現差距更加減少。基於我們零樣本的跨語言文字蘊含的成功，我們甚至將模型延伸到不只跨語言並同時跨領域的文字蘊含議題之上。只要使用了這篇論文中陳述的同時跨語言並跨領域的訓練機制，文字蘊含模型也能利用沒有標記的非英語的不同領域的資料來增強模型。實驗結果也證實了在上述所提的兩種情況之下，此篇論文的對抗式機制都能讓應用BERT之後的模型表現更上層樓。 Recently, deep neural networks have achieved astonishing performance in a variety of natural language processing tasks, including textual entailment recognition (TER) tasks which this paper focuses on. Textual entailment recognition (TER), also known as Natural Language Inference (NLI), is a classic natural language processing task whose goal is to determine whether a “hypothesis” sentence is true, false, or unrelated given a “premise” sentence. Textual Entailment Recognition datasets such as the Stanford Natural Language Inference (SNLI) corpus and the Multi-Genre Natural Language Inference (MultiNLI) corpus contributed a large amount of annotated data, which enabled the training of complex deep learning models. However, these corpora only contain English examples. Languages other than English often struggle with the problem of insufficient annotated data. As a result, this study aims to employ the available English NLI corpus but provide a cross-lingual Textual Entailment Recognition model. The state-of-the-art multi-lingual sentence representation BERT, proposed by Google, has alleviated the specified problem by adopting zero-shot cross-lingual Textual Entailment Recognition. This paper proposes an adversarial training approach which further narrows the gap between the source and the target language. Moreover, based on the success of the zero-shot cross-lingual training, we extend the scenario to another adversarial training mechanism for zero-shot cross-lingual and cross-domain Textual Entailment Recognition. With the presented cross-lingual and cross-domain training mechanism, TER models could even utilize unlabeled out-of-domain and non-English training instances. Experimental results confirm that the pre-trained BERT sentence representations still benefit from the adversarial training in both scenarios.
URI:	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/21418
DOI:	10.6342/NTU201900744
全文授權:	未授權
顯示於系所單位：	資訊工程學系

文件中的檔案：

檔案	大小	格式
ntu-108-1.pdf 目前未授權公開取用	769.34 kB	Adobe PDF

顯示文件完整紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料（如：文字、圖片、PDF）並使其易於取用。