以生成對抗網路達成非監督式文章摘要及主題模型

Yau-Shian Wang; 王耀賢

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/629

標題:	以生成對抗網路達成非監督式文章摘要及主題模型 Unsupervised Text Summarization and Topic Modeling using Generative Adversarial Networks
作者:	Yau-Shian Wang 王耀賢
指導教授:	李琳山(Lin-shan Lee)
關鍵字:	非監督式學習,文章摘要,主題模型,生成對抗網路, Unsupervised learning,Text summarization,Topic model,GAN,
出版年 :	2019
學位:	碩士
摘要:	隨著網際網路的興起，人類在網路上留下各式各樣的資料，由於這些資料大多是未標註的，使用未標註資料來做訓練的非監督式學習成了近年來重要的研究課題。在本論文中，我們使用生成對抗網路來探索非監督式學習在自然語言處理上的可能性，並專注在在兩個不同的主題上。第一個主題是非平行抽象式文章摘要，亦即不需要平行成對的訓練文章搭配其人類撰寫的摘要便可訓練機器撰寫文章的非抽象式摘要。在這個主題中，我們使用摘要來作為文章自編碼器的潛在表徵，並且使用生成對抗網路來限制此潛在表徵必須具備人類可讀的形式，只要提供較少量的人類撰寫的不相關的內容的文章摘要作為辨識器的範本就可讓機器學習人類是如何寫摘要的。我們衡量我們所提出的模型在英文以及中文的新聞摘要資料庫上，模型的表現也驗證了這樣的方法的可行性。第二個主題則是非監督式文章主題模型，希望機器可以自動發現文章的接近人類認知的主題。我們使用資訊生成對抗網路來模擬文章的產生是由一個離散的主題分佈，以及一個連續的向量來控制主題下的文章的變異，而不若前人所提出的主題模型模擬文章的產生是由若干瑣碎的次要主題所產生。實驗顯示我們的模型在文章分類的任務上，以及所抽取出的每一個主題的關鍵詞的品質上，相較於先前的研究結果均有著顯著的進步。 With the development of the Internet, humans put various data on the Internet. As most of the data is unannotated, how to efficiently utilize unlabeled data for unsupervised learning, becomes an important research direction. In this thesis, we use Generative Adversarial Network (GAN) to explore the possibility of unsupervised learning on NLP, which mainly covers the two different topics. The first topic is unsupervised abstractive text summarization. That is text summarization without any paired data. We use summaries as latent representations of an auto-encoder and use GAN to constrain the latent representation to be human-readable. WIth fewer summaries as examples for discriminator, machine can understand how humans write summaries for documents. The results on English and Chinese news datasets demonstrate the effectiveness of our model. The second topic is unsupervised topic model. The goal of this section is to train a machine that is able to automatically discover the latent topics similar to humans' cognition. Unlike prior topic models which models text generated from a mixture of sub-topics, we utilize InfoGAN to model texts generated from a discrete code controlling high-level topics and a continuous vector controlling variance within the topics. Compared to prior works, our proposed method greatly improves the performance on unsupervised classification and topical word extraction.
URI:	http://tdr.lib.ntu.edu.tw/handle/123456789/629
DOI:	10.6342/NTU202000333
全文授權:	同意授權(全球公開)
顯示於系所單位：	資訊網路與多媒體研究所

文件中的檔案：

檔案	大小	格式
ntu-108-1.pdf	2.93 MB	Adobe PDF	檢視/開啟

顯示文件完整紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料（如：文字、圖片、PDF）並使其易於取用。