Skip navigation

DSpace JSPUI

DSpace preserves and enables easy and open access to all types of digital content including text, images, moving images, mpegs and data sets

Learn More
DSpace logo
English
中文
  • Browse
    • Communities
      & Collections
    • Publication Year
    • Author
    • Title
    • Subject
    • Advisor
  • Search TDR
  • Rights Q&A
    • My Page
    • Receive email
      updates
    • Edit Profile
  1. NTU Theses and Dissertations Repository
  2. 管理學院
  3. 資訊管理學系
Please use this identifier to cite or link to this item: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/38414
Title: 以時間分析與多維度語句呈現為基礎之熱門話題萃取
Hot Topic Extraction with Timeline Analysis and Multidimensional Sentence Modeling
Authors: Kuan-Yu Chen
陳冠宇
Advisor: 曹承礎
Keyword: 主題偵測,熱門主題萃取,詞頻與文獻比例頻率,熱門字,多維度句子向量,
Topic Detection,Hot Topic Extraction,TF*PDF,Hot Term,Multidimensional Sentence Vector,
Publication Year : 2005
Degree: 碩士
Abstract: 主題偵測(Topic Detection)是主題偵測與追蹤(Topic Detection and Tracking)裡其中一個研究領域,該領域試著從新聞媒體裡,進行搜尋、組織及建構文字形式的新聞資料。我們的研究為如何偵測「熱門」的主題(Hot Topic)。所謂熱門的主題,是在某一段時間之內,它會被很多人常常討論與報導。在之前的研究裡,可透過TF*PDF計算文字權重的式子,找到描述熱門主題的「熱門字」 (Hot Term)。不過它仍然會有一些問題存在:(一)只以字的出現頻率和文獻比例頻率為基礎的TF*PDF,萃取熱門字會導致不可靠的結果;(二)只用單一的句子向量並不足以表達句子的涵義。
因此我們提出了改良的熱門主題的萃取系統,來解決上述的兩個問題。首先,我們透過紀錄字在時間上的使用變化,來萃取熱門字;也就是說,追蹤任一個字的生命週期,可以幫助我們來分辨它是否為足以描述「熱門」主題的字。之後,我們使用多維度的句子向量,來描述句子的資訊。最後,我們對所有新聞報導裡的句子進行叢集(cluster),而每一個叢集代表著一個新聞話題。透過以上兩個流程的改善,根據實驗結果顯示,不但增進了每一個叢集的品質,也能夠萃取出一段時間內所包含的熱門主題。
Topic detection is part of the Topic Detection and Tracking field, which seeks to develop technologies that search, organize, and structure news-oriented textual materials from various broadcast news media. We are interested in detecting “hot” topics that are frequently discussed by people in a given period of time. A prior work on hot topic extraction that designed an innovative term-weighting scheme called TF*PDF, which extracts “hot” terms that can describe hot topics. One of the problems that happens in the process of extracting hot topics using TF*PDF is the unreliability of results when the weight is determined solely on term frequency and document frequency. Another problem is that using one single vector misrepresents the meaning of a sentence.
We propose a hot topic extraction system that aims to solve the two problems mentioned above. First, we extract the hot terms by capturing their variations of the time distribution within a timeline. In other words, tracking the life cycles of the terms can help us differentiate which term is a real hot term that describes a hot topic. Second, we use multi-dimensional sentence vectors to feature the information of a sentence. Finally we group the sentences of news report into clusters, which represent hot topics. Clustering the sentences by the multi-dimensional sentence vectors not only improves the quality of each cluster, but also extracts most of the actual hot topics over a period of time.
URI: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/38414
Fulltext Rights: 有償授權
Appears in Collections:資訊管理學系

Files in This Item:
File SizeFormat 
ntu-94-1.pdf
  Restricted Access
395.09 kBAdobe PDF
Show full item record


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved