對於有限資料主題分類的對抗性訓練方式

Jing-Ting Huang; 黃敬庭

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/74572

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	鄭卜壬(Pu-Jen Cheng)
dc.contributor.author	Jing-Ting Huang	en
dc.contributor.author	黃敬庭	zh_TW
dc.date.accessioned	2021-06-17T08:43:26Z	-
dc.date.available	2019-08-13
dc.date.copyright	2019-08-13
dc.date.issued	2019
dc.date.submitted	2019-08-07
dc.identifier.citation	[1] Zhiting Hu, Zichao Yang, Xiaodan Liang, Ruslan Salakhutdinov, and Eric P. Xing. 2017. Toward controlled generation of text. In ICML. [2] Sosuke Kobayashi. 2018. Contextual augmentation: Data augmentation by words with paradigmatic relations. In NAACL-HLT. [3] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems 26, pages 3111–3119. [4] William Yang Wang and Diyi Yang. 2015. That’s so annoying!!!: A lexical and frame-semantic embedding based data augmentation approach to automatic categorization of annoying behaviors using #petpeeve tweets. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 2557–2563. Association for Computational Linguistics. [5] Oleksandr Kolomiyets, Steven Bethard, and MarieFrancine Moens. 2011. Model-portability experiments for textual temporal analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: Short Papers - Volume 2, HLT ’11, pages 271–276, Stroudsburg, PA, USA. Association for Computational Linguistics. [6] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473. [7] Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alex Smola, and Eduard Hovy. 2016. Hierarchical attention networks for document classification. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/74572	-
dc.description.abstract	現代社會的資訊量已經遠遠超過人類所能處理的範圍，隨著大量文字資料的出現，能夠有效率的借助機器來過濾並辨認資訊的內容與主題變得日趨重要。然而主題分類的問題需要標注過的資料才能進行訓練，若要能滿足我們對於自動理解主題的需求，勢必要有辦法能夠生成這些用來訓練的資料。我們觀察一些近期台灣網路新聞媒體所發佈的文章，發現幾點能夠著力的點：首先，我們發現有些類型的新聞其實是會持續出現的，我們做出推斷，這些新聞可能比較容易吸引讀者注意或者較有炒作特性，如果我們能首先對這些文章進行辨認，那麼我們研究的價值將會更顯著，因為可能在不久的未來這些主題的相關新聞就有可能再次出現。第二，我們也發現即使關於這些主題的文章，每次所出現的人事時地物都不盡相同，所以利用傳統的方式來比對過濾是不切實際的。基於以上兩點，我們提出了利用我們所擁有的少量範本例句來生成能夠偵測相對應主題的資料集，並且我們對這些例句定義一個更高階的主題來包含每次新聞報導出現的單一事件。我們藉由手動搜尋並取得少量正向範本的例句，發展出一個能夠生產正反面訓練資料的方式，並提出能夠有效的利用這些產生出的資料的模型，並進行一系列的實驗來進行驗證。為了執行上述的方法，我們先由各大新聞收集了過去5到10年不等的新聞資料，目的是為反向的資料能夠建立一個取樣的資料庫。接著我們鎖定幾個之前所觀察到的主題，並為其定義出想判斷的高階主題，以產生資料集。在這篇論文中，我們提出的流程主要分成兩個部分，第一步是收集資料並且取樣和生成的部分，第二部分是利用生成之資料訓練辨識主題之模型的部分。實驗結果顯示我們生成的資料對模型能夠產生正向的性能提升，並且利用這些資料，模型也能夠清楚的學習最初所界定的用來生成資料的主題。	zh_TW
dc.description.abstract	The quantity of the data today already surpassed the amount a man can handle. How to deal with and filter enormous text data with the help of machine is becoming more and more important. However, classification task on text data requires enough labeled data to back up the training procedure. If we somehow want to deal with this issue, it’s surely a problem we have to deal with. Before starting on this work, we observed multiple news example from Taiwanese online media and found some pattern that we can take advantage of: first, we found that some news about similar topics tend to appear over and over again. We assumed that news of these topics might be more tempting and thus can attract more readers and hype up emotions. This is a good place to start the work since news of these topics might appear again in the near future thus our work will have immediate impact. Second, we found that even if news of these topics tends to appear again and again, the objective and minor details are usually completely different. Using traditional word matching model on this task might not work very well. Based on above reasons, we propose an approach to use the few seed sentences we get from past news on these events to generate training data. Furthermore, for events of similar topic, we define a higher-level topic to include these events. We generate positive and negative training example based on seed sentences and propose a model that can take fully advantage of our generated dataset on classification task. A series of experiments are also conducted to measure the capability of our approach. To realize our approach, we crawled news articles published by public news media during the past 5 to 10 years to build a corpus from which we can sample negative data. Then for each higher-level topic we generate positive and negative datasets. In this work, our main approach can be divided into two part. The first part being the retrieval and generation of dataset and the second part being the training of classifier using the data generated. Our experiment results showed that generation and augmentation we applied can help boosting the performance on this task.	en
dc.description.provenance	Made available in DSpace on 2021-06-17T08:43:26Z (GMT). No. of bitstreams: 1 ntu-108-R06944049-1.pdf: 1340744 bytes, checksum: 03a488a772c533fc907a97e3886e05db (MD5) Previous issue date: 2019	en
dc.description.tableofcontents	中文摘要 II Abstract III Contents IV List of figures VI List of tables VII Motivation 1 1.1 Information needs on news topics 1 1.2 Higher-level topic 2 1.3 Target 2 1.4 Difficulty on training the classifier 4 1.5 Our goal 4 Related work 5 2.1 Data augmentation 5 Problem definition 6 3.1 Definition of a topic 6 3.2 Problem definition 7 Methodology 8 4.1 Overview 8 4.2 Retrieval of seed sentences 9 4.3 Synonyms 9 4.4 Sentence generation 9 4.5 Composition of training dataset 10 4.6 Classifier 11 4.7 Embedding layer & bi-gru 11 4.8 Attention layer 12 4.9 Loss function 13 Experiments 14 5.1 News corpus and topics 14 5.2 Model and training details 15 5.3 Accuracy on different datasets 15 5.4 Performance boost on data augmentation 16 5.5 Visualization of data 17 5.6 Impact on attention with additional loss function 19 5.7 Examples on attention output 20 5.8 Detecting the consistent target 21 Conclusions and future work 23 6.1 Conclusions 23 6.2 Future work 23 Bibliography 24
dc.language.iso	en
dc.title	對於有限資料主題分類的對抗性訓練方式	zh_TW
dc.title	Towards Adversarial Training for Data-limited Topic Classification	en
dc.type	Thesis
dc.date.schoolyear	107-2
dc.description.degree	碩士
dc.contributor.oralexamcommittee	盧文祥(Wen-Hsiang Lu),王正豪(Jenq-Haur Wang)
dc.subject.keyword	自然語言處理,對抗訓練,遞歸神經網路,主題分類問題,資料增強,注意力機制,	zh_TW
dc.subject.keyword	Natural Language Processing,Adversarial Training,Recurrent Neural Network,Topic Classification,Data Augmentation,Attention Mechanism,	en
dc.relation.page	24
dc.identifier.doi	10.6342/NTU201902560
dc.rights.note	有償授權
dc.date.accepted	2019-08-07
dc.contributor.author-college	電機資訊學院	zh_TW
dc.contributor.author-dept	資訊網路與多媒體研究所	zh_TW
顯示於系所單位：	資訊網路與多媒體研究所

文件中的檔案：

檔案	大小	格式
ntu-108-1.pdf 目前未授權公開取用	1.31 MB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。