從文獻中擷取Metadata

Zong-Xun Yang; 楊宗勳

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/39433

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	翁昭旼
dc.contributor.author	Zong-Xun Yang	en
dc.contributor.author	楊宗勳	zh_TW
dc.date.accessioned	2021-06-13T17:28:25Z	-
dc.date.available	2004-10-19
dc.date.copyright	2004-10-19
dc.date.issued	2004
dc.date.submitted	2004-10-13
dc.identifier.citation	【1】 S. Lawrence, C. L. Giles and K. Bollacker. Digital libraries and autonomous citation indexing. IEEE Computer, 32(6):67-71, 1999. 【2】 K. D. Bollacker, S. Lawrence and C. L. Giles. Citeseer：An autonomous web agent for automatic retrieval and identification of interesting publications. In 2nd International ACM Conference on Autonomous Agents, pages 116－123, 1998. 【3】 A. K. McCallum, K. Nigam, J. Rennie and K. Seymore. Automating the Construction of Internet Portals with Machine Learning. Information Retrieval Journal Volumn 3, pages 127－163, 2000. 【4】 G. R. Thoma. Automating the production of bibliographic records for MEDLINE. Internal R&D report, CBE, LHNCBC, NLM, September 2001. 【5】 D. X. Le, J. Kim, G. F. Pearson and G. R. Thoma. Automated labeling of zones from scanned documents. Proc. 1999 Symposium on Document Image Understanding Technology, pages 219－226, 1999. 【6】 D. Besagni and A. Belaid. Citation recognition for scientific publications in digital libraries. In Proc. Of the First International Workshop on Document Image Analysis for Libraries, 2004. 【7】 A. Belaid. Recognition of table of contents for electronic library consulting. International Journal on Document Analysis and Recognition, 4: 35-45, 2001. 【8】 K. Seymore, A. McCallum and R. Rosenfeld. Learning hidden markov model structure for information extraction. In Proc. Of AAAI 99 Workshop on Machine Learning for Information Extraction, pages 37－42, 1999. 【9】 G.. D. Zhou, J. Zhang, J. Su, D. Shen and C. L. Tan. Recognizing names in biomedical texts: a machine learning approach. Oxford University Press, 2004. 【10】 G. D. Zhou and J Su. Named entity recognition using an HMM-based chunk tagger. In Proc. Of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, pages 473-480, July 2002. 【11】 H. L. Chieu and H. T Ng. Named entity recognition: a maximum entropy approach using global information. National University of Singapore Press, 2002. 【12】 J. Connan and C. W. Omlin. Bibliography extraction with hidden markov model. Technical Report US－CS－TR－00－6, Department of Computer Science, University of Stellenbosch, February 2000. 【13】 V. Borkar, K. Deshmukh and S. Sarawagi. Automatic segmentation of text into structured records. In Proc. of the 2001 ACM SIGMOD international conference on Management of Data, Santa Babara, California, 2001. 【14】 H. Han, C. L. Giles, E. Manavoglu and Hongyuan Zha. Automatic document metadata extraction using support vector machines. In Proc.of the 2003 Joint Conference on Digital Libraries, 2003. 【15】 L. R. Babiner. A tutorial on hidden markov models and selected applications in speech recognition. In Proc. of IEEE, 77(2):257－285, February 1989. 【16】 K. Murphy. Dynamic Bayesian Networks: Representation, Inference and Learning. Ph.D. Thises, UC Berkeley, Computer Science Division, July 2002. 【17】 Search Engine World. http://www.searchengineworld.com/spy/stopwords.htm, 2002. 【18】 http://www.dcs.gla.ac.uk/idom/ir_resources/linguistic_utils/stop_words. 【19】 Defense Virtual Library. http://dvl.dtic.mil/stop_list.html, 2000. 【20】 W. B. Frakes and R. Baeza-Yates. Information retrieval data structure ＆ algorithms. Prentice Hall PTR Released,:June, 1992. 【21】 Time and date.com. http://www.timeanddate.com 2004. 【22】 World Travel Guide. http://www.cityguide.travel-guides.com 【23】 http://www.cesus.gov/genealogy/names 【24】 The Java Tutorial. http://java.sun.com/docs/books/tutorial/extra/regex/index.html 【25】 R. Grishman. Information Extraction and Speech Recognition. In Proc. Of the Broadcast News Transcription and Understanding Workshop, pages 159-165, 1998.
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/39433	-
dc.description.abstract	隨著網路的蓬勃發展，電子化文獻快速地傳播，無論是發表或取得都非常方便，這樣的現象使文獻大量地增加，但文獻大多散亂在無涯無際的網路世界，使得找尋相關文獻成為一件耗時費力的事。若有一套能夠將網路上互相關聯的文獻組織起來的系統，就能輕而易舉地查詢到相關參考文獻，這是使用者的一大福音。本文主要探討文獻Header和reference的內容，因為這兩個部分能給我們大量文獻的基本資訊，如標題、作者、出版商與出版日期等等，這些資訊非常適合用來整理文獻，它們提供我們能以各種不同的維度方向去觀察並做分類分群與搜尋。要整理文獻，首要的工作就是要整理出文獻的Metadata。本研究的工作就是要將非結構化的文獻資料整理成具結構化的資料並賦予其意義。工作內容共分成三階段：第一階段先分析文字的特徵，並依據特徵對文字做分群。第二階段將分群好的文字以Machine Learning的演算法將其適當的分段並給予合乎其意義的Metadata。最後再將這些有意義的結構化資料存入資料庫，以方便將來再使用。	zh_TW
dc.description.abstract	Along with the network vigorous development, the electronic literature rapidly disseminates. It is very convenient to issue and obtain extremely. Such phenomenon makes the literature massively increase. The matter which literatures scattered in disorder in networks causes researchers consume time to search relevant articles. If we have a system which can organize relevant literature in networks, it is easy to query relevant references. It is a great good news to users. This article probes into Header and Reference in literatures mainly, because these two parts can give us a large number of basic information about literature, like title, author, publisher and publication date and so on. These information extremely suitably use for to reorganize the literature. They provide us to be able to observe, search and to make the classification by each kind of different dimension. To organize literature, the primary work is to organize the Metadata of literature. This research work is to have the non- structured literature change into the structured data and entrusts with its meanings. The work is divided into three stages：Analyse the feature of the token and make a cluster according to features at the first stage. At the second stage, clustering token will be segmented suitably with algorithms of Machine Learning and extract Metadata from segmented tokens. Finally, we will store these meaningful structured data into database in order to facilitate them in the future.	en
dc.description.provenance	Made available in DSpace on 2021-06-13T17:28:25Z (GMT). No. of bitstreams: 1 ntu-93-R91548051-1.pdf: 410848 bytes, checksum: ed4c0efed433d5ba958584169d08f622 (MD5) Previous issue date: 2004	en
dc.description.tableofcontents	圖次 2 表次 3 第一章序論 4 1.1 背景說明 4 1.2 研究動機與目的 4 1.3 論文架構 5 第二章文獻回顧 6 2.1 相關系統介紹 6 2.1.1 CiteSeer[1,2] 6 2.1.2 Cora[3] 7 2.1.3 MARS[4] 8 2.2 Metadata Extraction的困難 9 2.3 核心技術探討 10 2.3.1 Rule-Based Method 10 2.3.2 Machine Learning Method 11 2.3.3 Word Clustering + Machine Learning Method 12 第三章系統架構與方法 13 3.1 System architecture 13 3.2 Hidden Markov Model 15 3.3 Viterbi Decoding 17 3.4 Internal structure state 19 3.5 Feature與word clustering 20 3.5.1 Reference 20 3.5.2 Header 27 3.6 Training Phase 30 3.7 Laplace Smoothing 31 3.8 Modified Viterbi Decoding 33 3.9 DataBase 34 第四章實驗數據與討論 36 4.1 實驗資料 36 4.2 實驗結果 36 第五章結論與未來 40 參考文獻 42
dc.language.iso	zh-TW
dc.subject	hidden markov model	zh_TW
dc.subject	metadata	zh_TW
dc.subject	word clustering	zh_TW
dc.title	從文獻中擷取Metadata	zh_TW
dc.title	Extract Metadata From Literature	en
dc.type	Thesis
dc.date.schoolyear	93-1
dc.description.degree	碩士
dc.contributor.oralexamcommittee	蔣以仁,陳中明
dc.subject.keyword	metadata,word clustering,hidden markov model,	zh_TW
dc.relation.page	44
dc.rights.note	有償授權
dc.date.accepted	2004-10-14
dc.contributor.author-college	工學院	zh_TW
dc.contributor.author-dept	醫學工程學研究所	zh_TW
顯示於系所單位：	醫學工程學研究所

文件中的檔案：

檔案	大小	格式
ntu-93-1.pdf 未授權公開取用	401.22 kB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。