Skip navigation

DSpace JSPUI

DSpace preserves and enables easy and open access to all types of digital content including text, images, moving images, mpegs and data sets

Learn More
DSpace logo
English
中文
  • Browse
    • Communities
      & Collections
    • Publication Year
    • Author
    • Title
    • Subject
    • Advisor
  • Search TDR
  • Rights Q&A
    • My Page
    • Receive email
      updates
    • Edit Profile
  1. NTU Theses and Dissertations Repository
  2. 電機資訊學院
  3. 資訊網路與多媒體研究所
Please use this identifier to cite or link to this item: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/69582
Title: 基於字與詞混合方法之抽象摘要研究
A Hybrid Word-Character Approach to Abstractive Summarization
Authors: Chieh-Teng Chang
張介騰
Advisor: 許永真(Jane Yung-jen Hsu)
Keyword: 抽象摘要,類神經網路,自然語言處理,編碼器-解碼器架構,
Abstractive Summarization,Neural Networks,Natural Language Processing,Encoder-Decoder Framework,
Publication Year : 2018
Degree: 碩士
Abstract: 自動抽象文本摘要是自然語言處理的一個重要且充滿挑戰性的研究課題。在許多廣泛使用的語言中,中文具有特殊的語言性質,即中文的字包含著與詞相當的豐富信息。現有的中文文本摘要方法不是完全採用基於字就是完全採用基於詞的表示方法,未能充分利用這兩種表示方法所攜帶的信息。為了準確地捕捉文章的本質,我們提出了一個基於字與詞混用的方法(HWC),保留了基於字與基於詞表示方法的優點。我們將其應用於兩種現有的架構來評估所提出的HWC 方法的優勢。發現其在廣泛使用的資料集LCSTS 上產生超越目前最先進的方法24 個ROUGE 百分點。除此之外,我們發現LCSTS 資料集中包含一個問題,並提供一個腳本來刪除重疊的資料對(摘要和簡短文本)。以便為社群創建一個乾淨的資料集。提出的HWC 方法也在新的、乾淨的LCSTS 資料集上產生了最佳的表現結果。
Automatic abstractive text summarization is an important and challeng- ing research topic of natural language processing. Among many widely used languages, the Chinese language has a special property that a Chinese char- acter contains rich information comparable to a word. Existing Chinese text summarization methods, either adopt totally character-based or word-based representations, fail to fully exploit the information carried by both repre- sentations. To accurately capture the essence of articles, we propose a hy- brid word-character approach (HWC) which preserves the advantages of both word-based and character-based representations. We evaluate the advantage of the proposed HWC approach by applying it to two existing methods, and discover that it generates state-of-the-art performance with a margin of 24 ROUGE points on a widely used dataset LCSTS. In addition, we find an is- sue contained in the LCSTS dataset and offer a script to remove overlapping pairs (a summary and a short text) to create a clean dataset for the commu- nity. The proposed HWC approach also generates the best performance on the new, clean LCSTS dataset.
URI: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/69582
DOI: 10.6342/NTU201801093
Fulltext Rights: 有償授權
Appears in Collections:資訊網路與多媒體研究所

Files in This Item:
File SizeFormat 
ntu-107-1.pdf
  Restricted Access
678.2 kBAdobe PDF
Show full item record


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved