Please use this identifier to cite or link to this item:
http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/74572
Title: | 對於有限資料主題分類的對抗性訓練方式 Towards Adversarial Training for Data-limited Topic Classification |
Authors: | Jing-Ting Huang 黃敬庭 |
Advisor: | 鄭卜壬(Pu-Jen Cheng) |
Keyword: | 自然語言處理,對抗訓練,遞歸神經網路,主題分類問題,資料增強,注意力機制, Natural Language Processing,Adversarial Training,Recurrent Neural Network,Topic Classification,Data Augmentation,Attention Mechanism, |
Publication Year : | 2019 |
Degree: | 碩士 |
Abstract: | 現代社會的資訊量已經遠遠超過人類所能處理的範圍,隨著大量文字資料的出現,能夠有效率的借助機器來過濾並辨認資訊的內容與主題變得日趨重要。然而主題分類的問題需要標注過的資料才能進行訓練,若要能滿足我們對於自動理解主題的需求,勢必要有辦法能夠生成這些用來訓練的資料。
我們觀察一些近期台灣網路新聞媒體所發佈的文章,發現幾點能夠著力的點:首先,我們發現有些類型的新聞其實是會持續出現的,我們做出推斷,這些新聞可能比較容易吸引讀者注意或者較有炒作特性,如果我們能首先對這些文章進行辨認,那麼我們研究的價值將會更顯著,因為可能在不久的未來這些主題的相關新聞就有可能再次出現。第二,我們也發現即使關於這些主題的文章,每次所出現的人事時地物都不盡相同,所以利用傳統的方式來比對過濾是不切實際的。 基於以上兩點,我們提出了利用我們所擁有的少量範本例句來生成能夠偵測相對應主題的資料集,並且我們對這些例句定義一個更高階的主題來包含每次新聞報導出現的單一事件。我們藉由手動搜尋並取得少量正向範本的例句,發展出一個能夠生產正反面訓練資料的方式,並提出能夠有效的利用這些產生出的資料的模型,並進行一系列的實驗來進行驗證。 為了執行上述的方法,我們先由各大新聞收集了過去5到10年不等的新聞資料,目的是為反向的資料能夠建立一個取樣的資料庫。接著我們鎖定幾個之前所觀察到的主題,並為其定義出想判斷的高階主題,以產生資料集。 在這篇論文中,我們提出的流程主要分成兩個部分,第一步是收集資料並且取樣和生成的部分,第二部分是利用生成之資料訓練辨識主題之模型的部分。實驗結果顯示我們生成的資料對模型能夠產生正向的性能提升,並且利用這些資料,模型也能夠清楚的學習最初所界定的用來生成資料的主題。 The quantity of the data today already surpassed the amount a man can handle. How to deal with and filter enormous text data with the help of machine is becoming more and more important. However, classification task on text data requires enough labeled data to back up the training procedure. If we somehow want to deal with this issue, it’s surely a problem we have to deal with. Before starting on this work, we observed multiple news example from Taiwanese online media and found some pattern that we can take advantage of: first, we found that some news about similar topics tend to appear over and over again. We assumed that news of these topics might be more tempting and thus can attract more readers and hype up emotions. This is a good place to start the work since news of these topics might appear again in the near future thus our work will have immediate impact. Second, we found that even if news of these topics tends to appear again and again, the objective and minor details are usually completely different. Using traditional word matching model on this task might not work very well. Based on above reasons, we propose an approach to use the few seed sentences we get from past news on these events to generate training data. Furthermore, for events of similar topic, we define a higher-level topic to include these events. We generate positive and negative training example based on seed sentences and propose a model that can take fully advantage of our generated dataset on classification task. A series of experiments are also conducted to measure the capability of our approach. To realize our approach, we crawled news articles published by public news media during the past 5 to 10 years to build a corpus from which we can sample negative data. Then for each higher-level topic we generate positive and negative datasets. In this work, our main approach can be divided into two part. The first part being the retrieval and generation of dataset and the second part being the training of classifier using the data generated. Our experiment results showed that generation and augmentation we applied can help boosting the performance on this task. |
URI: | http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/74572 |
DOI: | 10.6342/NTU201902560 |
Fulltext Rights: | 有償授權 |
Appears in Collections: | 資訊網路與多媒體研究所 |
Files in This Item:
File | Size | Format | |
---|---|---|---|
ntu-108-1.pdf Restricted Access | 1.31 MB | Adobe PDF |
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.