以時間分析與多維度語句呈現為基礎之熱門話題萃取

Kuan-Yu Chen; 陳冠宇

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/38414

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	曹承礎
dc.contributor.author	Kuan-Yu Chen	en
dc.contributor.author	陳冠宇	zh_TW
dc.date.accessioned	2021-06-13T16:32:52Z	-
dc.date.available	2005-07-14
dc.date.copyright	2005-07-14
dc.date.issued	2005
dc.date.submitted	2005-07-11
dc.identifier.citation	[1] James Allan. Introduction to Topic Detection and Tracking. In Topic Detection and Tracking: Event-based Information Organization, J. Allan, ed. Kluwer Academic Boston, MA, pp.1-16 [2] James Allan, Victor Lavrenko and Margaret E. Connell. A Month to Topic Detection and Tracking in Hindi. ACM Transactions on Asian Language Information Processing (TALIP), Vol. 2, 2003, pp.85-100 [3] James Allan, Jaime Carbonell, George Doddington, Jonathan Yamron, and Yiming Yang. Topic detection and tracking pilot study: Final report. In Proceedings of the DARPA Broadcast News Transcription and Understanding Workshop, 1998 [4] James Allan, Ron Papka and Victor Lavrenko. On-line New Event Detection and Tracking. In Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, 1998, pp.37-45 [5] The 2004 Topic Detection and Tracking (TDT2004) Task Definition and Evaluation Plan, http://www.nist.gov/speech/tests/tdt/ [6] TDT 2004 : Annotation Manual Version 1.2 - August 2004, http://www.nist.gov/speech/tests/tdt/ [7] Hans Peter Luhn. A Statistical Approach to Mechanized Encoding and Searching of Literary Information. IBM Journal of Research and Development 1 (4), 1957 ,pp.309-317 [8] Noreault, T., McGill, M., and Koll, M. B. (1981). A performance evaluation of similarity measures, document term weighting schemes and representations in a Boolean environment, pp. 57-76, Butterworths [9] Sparck-Jones, K.:Index Term Weighting. Information Storage and Retrieval. Vol.9, No. 11 , 1973, pp. 619-633 [10] Salton, G. and Yang, C. S. On the Specification of Term Values in Automatic Indexing. Journal of Documentation, 1973, pp.351--372 [11] Nagao, M., Mizutani, M., and Ikeda, H. An Automated Method of the Extraction of Important Words from Japanese Scientific Documents, Trans. of IPSJ, 17(2), 1976, pp. 110--117. [12] Toru Hisamitsu and Jun-ichi Tsujii. Measureing Term Representativeness.SCIE 2002 [13] Toru Hisamitsu, Yoshiki Niwa, and Jun-ichi Tsujii. A method of measuring term representativeness: baseline method using co-occurrence distribution. Proceedings of the 17th conference on Computational linguistics, Vol. 1, 2000, pp.320-326 [14] Toru Hisamitsu and Yoshiki Niwa. A Measure of Term Representativeness Based on the Number of Co-occurring Salient Words. In Proceedings of the 19th International Conference on Computational Linguistics, 2002 [15] Hee-soo Kim , Ikkyu Choi , and Minkoo Kim. Refining term weights of documents using term dependencies. In Proceedings of the 27th annual international conference on Research and development in information retrieval, 2004, Sheffield, United Kingdom [16] Bun, K.K., Ishizuka, M. Topic Extraction from News Archive Using TF*PDF Algorithm. In Proceedings of the 3rd International Conference on Web Information Systems Engineering, 2002. [17] Khoo Khyou Bun and Mitsuru Ishizuka. Emerging Topic Tracking System. Lecture Notes in Computer Science,2001 [18] Chien Chin Chen, Yao-Tsung Chen, Yeali Sun and Meng Chang Chen. Life Cycle Modeling of News Events Using Aging Theory. In Proceedings of the 14th European Conference on Machine Learning (ECML2003), 2003, pp. 47-59 [19] Russell Swan and James Allan. Automatic generation of overview timelines. In Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval, 2001, pp. 49-56 [20] M. Franz, A. Ittycheriah, J. S. McCarley, andT. Ward. First story detection: Combining similarity and novelty based approaches. In Topic Detection and Tracking Workshop Report 2001. [21] Yiming Yang, Tom Pierce, and Jaime Carbonell. A Study on Retrospective and On-Line Event Detection. In Proceedings of the 21st annual international ACM SIGIR conference on Research and Development in Information Retrieval, 1998, pp. 28-36 [22] J. Allan, R. Papka, and V. Lavrenko. Online New Event Detection and Tracking. In Proceddings of the 21st annual international ACM SIGIR conference on Research and Development in Information Retrieval, 1998, pp. 37-45 [23] Hai Leong Chieu and Yoong Keok Lee. Query based event extraction along a timeline. In Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, 2004, PP.425-432 [24] Russell Swan and James Allan. Extracting significant time varying features from text. In Proceedings of the eighth international conference on Information and knowledge management, 1999, pp.38-45 [25] J. Kleinberg. Bursty and hierarchical structure in streams. In the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,2002,pp.23-26 [26] Martin Porter. An Algorithm for Suffix Stripping. In Progam, vol. 14, no. 3,1980 [27] Giridhar Kumaran and James Allan. Text Classification and Named Entities for New Event. In Proceedings of the 27th Annual ACM SIGIR Conference, 2004, pp.297-304 [28] Naoaki Okazaki, Yutaka Matsuo, Naohiro Matsumura, and Mitsuru Ishizuka.Sentence Extraction by Spreading Activation with Refined Similarity Measure. In Proceedings of the Sixteenth International Florida Artificial Intelligence Research Society Conference,2003,pp. 407-411 [29] Andreas Hotho, Steffen Staab, Gerd Stumme. Text Clustering Based on Background Knowledge. Technical Report No. 425. Institute of Applied Informatics and Formal Description Methods AIFB. [30] Yiming Yang, Jian Zhang, Jaime Carbonell ,and Chun Jin. Topic-conditioned Novelty Detection. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, 2002, pp.688-693 [31] WordNet, http://www.cogsci.princeton.edu/~wn/
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/38414	-
dc.description.abstract	主題偵測(Topic Detection)是主題偵測與追蹤(Topic Detection and Tracking)裡其中一個研究領域，該領域試著從新聞媒體裡，進行搜尋、組織及建構文字形式的新聞資料。我們的研究為如何偵測「熱門」的主題(Hot Topic)。所謂熱門的主題，是在某一段時間之內，它會被很多人常常討論與報導。在之前的研究裡，可透過TFPDF計算文字權重的式子，找到描述熱門主題的「熱門字」 (Hot Term)。不過它仍然會有一些問題存在：(一)只以字的出現頻率和文獻比例頻率為基礎的TFPDF，萃取熱門字會導致不可靠的結果;(二)只用單一的句子向量並不足以表達句子的涵義。因此我們提出了改良的熱門主題的萃取系統，來解決上述的兩個問題。首先，我們透過紀錄字在時間上的使用變化，來萃取熱門字;也就是說，追蹤任一個字的生命週期，可以幫助我們來分辨它是否為足以描述「熱門」主題的字。之後，我們使用多維度的句子向量，來描述句子的資訊。最後，我們對所有新聞報導裡的句子進行叢集(cluster)，而每一個叢集代表著一個新聞話題。透過以上兩個流程的改善，根據實驗結果顯示，不但增進了每一個叢集的品質，也能夠萃取出一段時間內所包含的熱門主題。	zh_TW
dc.description.abstract	Topic detection is part of the Topic Detection and Tracking field, which seeks to develop technologies that search, organize, and structure news-oriented textual materials from various broadcast news media. We are interested in detecting “hot” topics that are frequently discussed by people in a given period of time. A prior work on hot topic extraction that designed an innovative term-weighting scheme called TFPDF, which extracts “hot” terms that can describe hot topics. One of the problems that happens in the process of extracting hot topics using TFPDF is the unreliability of results when the weight is determined solely on term frequency and document frequency. Another problem is that using one single vector misrepresents the meaning of a sentence. We propose a hot topic extraction system that aims to solve the two problems mentioned above. First, we extract the hot terms by capturing their variations of the time distribution within a timeline. In other words, tracking the life cycles of the terms can help us differentiate which term is a real hot term that describes a hot topic. Second, we use multi-dimensional sentence vectors to feature the information of a sentence. Finally we group the sentences of news report into clusters, which represent hot topics. Clustering the sentences by the multi-dimensional sentence vectors not only improves the quality of each cluster, but also extracts most of the actual hot topics over a period of time.	en
dc.description.provenance	Made available in DSpace on 2021-06-13T16:32:52Z (GMT). No. of bitstreams: 1 ntu-94-R92725001-1.pdf: 404570 bytes, checksum: e712c65af4ba5f414152e5316fe6607b (MD5) Previous issue date: 2005	en
dc.description.tableofcontents	Chapter 1 Introduction 1 1.1 Motivation 1 1.2 Objective 3 1.3 Organization 3 Chapter 2 Literature Review 4 2.1 The Tasks of TDT Program 4 2.1.1 The Definition of “Topic” 4 2.1.2 Topic Tracking 5 2.1.3 Topic Detection 5 2.1.4 New Event Detection 5 2.1.5 Story Segmentation 5 2.1.6 Story Link Detection 6 2.2 Term-Weighting Schemes 6 2.2 Topic Extraction with TF*PDF 11 2.3 Event Detection with Temporal Information 13 Chapter 3 System Design 20 3.1 System Architecture 20 3.2 Text Preprocessing 21 3.3 Hot Term Generator 22 3.4 Sentence Modeling 27 Chapter 4 Experiment Analysis 32 4.1 Data Source 32 4.2 System Parameter 32 4.3 Term Weighting Analysis 33 4.3 Sentence Clustering Analysis 42 Chapter 5 Conclusion and Future Work 50 5.1 Conclusion 50 5.2 Future Work 51 Bibliography 53
dc.language.iso	en
dc.subject	多維度句子向量	zh_TW
dc.subject	主題偵測	zh_TW
dc.subject	熱門主題萃取	zh_TW
dc.subject	詞頻與文獻比例頻率	zh_TW
dc.subject	熱門字	zh_TW
dc.subject	Hot Topic Extraction	en
dc.subject	Topic Detection	en
dc.subject	Multidimensional Sentence Vector	en
dc.subject	Hot Term	en
dc.subject	TF*PDF	en
dc.title	以時間分析與多維度語句呈現為基礎之熱門話題萃取	zh_TW
dc.title	Hot Topic Extraction with Timeline Analysis and Multidimensional Sentence Modeling	en
dc.type	Thesis
dc.date.schoolyear	93-2
dc.description.degree	碩士
dc.contributor.oralexamcommittee	蔡益坤,林俊叡
dc.subject.keyword	主題偵測,熱門主題萃取,詞頻與文獻比例頻率,熱門字,多維度句子向量,	zh_TW
dc.subject.keyword	Topic Detection,Hot Topic Extraction,TF*PDF,Hot Term,Multidimensional Sentence Vector,	en
dc.relation.page	55
dc.rights.note	有償授權
dc.date.accepted	2005-07-11
dc.contributor.author-college	管理學院	zh_TW
dc.contributor.author-dept	資訊管理學研究所	zh_TW
顯示於系所單位：	資訊管理學系

文件中的檔案：

檔案	大小	格式
ntu-94-1.pdf 未授權公開取用	395.09 kB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。