Skip navigation

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料(如:文字、圖片、PDF)並使其易於取用。

點此認識 DSpace
DSpace logo
English
中文
  • 瀏覽論文
    • 校院系所
    • 出版年
    • 作者
    • 標題
    • 關鍵字
    • 指導教授
  • 搜尋 TDR
  • 授權 Q&A
    • 我的頁面
    • 接受 E-mail 通知
    • 編輯個人資料
  1. NTU Theses and Dissertations Repository
  2. 工學院
  3. 醫學工程學研究所
請用此 Handle URI 來引用此文件: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/39433
完整後設資料紀錄
DC 欄位值語言
dc.contributor.advisor翁昭旼
dc.contributor.authorZong-Xun Yangen
dc.contributor.author楊宗勳zh_TW
dc.date.accessioned2021-06-13T17:28:25Z-
dc.date.available2004-10-19
dc.date.copyright2004-10-19
dc.date.issued2004
dc.date.submitted2004-10-13
dc.identifier.citation【1】 S. Lawrence, C. L. Giles and K. Bollacker. Digital libraries and autonomous citation indexing. IEEE Computer, 32(6):67-71, 1999.
【2】 K. D. Bollacker, S. Lawrence and C. L. Giles. Citeseer:An autonomous web agent for automatic retrieval and identification of interesting publications. In 2nd International ACM Conference on Autonomous Agents, pages 116-123, 1998.
【3】 A. K. McCallum, K. Nigam, J. Rennie and K. Seymore. Automating the Construction of Internet Portals with Machine Learning. Information Retrieval Journal Volumn 3, pages 127-163, 2000.
【4】 G. R. Thoma. Automating the production of bibliographic records for MEDLINE. Internal R&D report, CBE, LHNCBC, NLM, September 2001.
【5】 D. X. Le, J. Kim, G. F. Pearson and G. R. Thoma. Automated labeling of zones from scanned documents. Proc. 1999 Symposium on Document Image Understanding Technology, pages 219-226, 1999.
【6】 D. Besagni and A. Belaid. Citation recognition for scientific publications in digital libraries. In Proc. Of the First International Workshop on Document Image Analysis for Libraries, 2004.
【7】 A. Belaid. Recognition of table of contents for electronic library consulting. International Journal on Document Analysis and Recognition, 4: 35-45, 2001.
【8】 K. Seymore, A. McCallum and R. Rosenfeld. Learning hidden markov model structure for information extraction. In Proc. Of AAAI 99 Workshop on Machine Learning for Information Extraction, pages 37-42, 1999.
【9】 G.. D. Zhou, J. Zhang, J. Su, D. Shen and C. L. Tan. Recognizing names in biomedical texts: a machine learning approach. Oxford University Press, 2004.
【10】 G. D. Zhou and J Su. Named entity recognition using an HMM-based chunk tagger. In Proc. Of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, pages 473-480, July 2002.
【11】 H. L. Chieu and H. T Ng. Named entity recognition: a maximum entropy approach using global information. National University of Singapore Press, 2002.
【12】 J. Connan and C. W. Omlin. Bibliography extraction with hidden markov model. Technical Report US-CS-TR-00-6, Department of Computer Science, University of Stellenbosch, February 2000.
【13】 V. Borkar, K. Deshmukh and S. Sarawagi. Automatic segmentation of text into structured records. In Proc. of the 2001 ACM SIGMOD international conference on Management of Data, Santa Babara, California, 2001.
【14】 H. Han, C. L. Giles, E. Manavoglu and Hongyuan Zha. Automatic document metadata extraction using support vector machines. In Proc.of the 2003 Joint Conference on Digital Libraries, 2003.
【15】 L. R. Babiner. A tutorial on hidden markov models and selected applications in speech recognition. In Proc. of IEEE, 77(2):257-285, February 1989.
【16】 K. Murphy. Dynamic Bayesian Networks: Representation, Inference and Learning. Ph.D. Thises, UC Berkeley, Computer Science Division, July 2002.
【17】 Search Engine World. http://www.searchengineworld.com/spy/stopwords.htm, 2002.
【18】 http://www.dcs.gla.ac.uk/idom/ir_resources/linguistic_utils/stop_words.
【19】 Defense Virtual Library. http://dvl.dtic.mil/stop_list.html, 2000.
【20】 W. B. Frakes and R. Baeza-Yates. Information retrieval data structure & algorithms. Prentice Hall PTR Released,:June, 1992.
【21】 Time and date.com. http://www.timeanddate.com 2004.
【22】 World Travel Guide. http://www.cityguide.travel-guides.com
【23】 http://www.cesus.gov/genealogy/names
【24】 The Java Tutorial. http://java.sun.com/docs/books/tutorial/extra/regex/index.html
【25】 R. Grishman. Information Extraction and Speech Recognition. In Proc. Of the Broadcast News Transcription and Understanding Workshop, pages 159-165, 1998.
dc.identifier.urihttp://tdr.lib.ntu.edu.tw/jspui/handle/123456789/39433-
dc.description.abstract隨著網路的蓬勃發展,電子化文獻快速地傳播,無論是發表或取得都非常方便,這樣的現象使文獻大量地增加,但文獻大多散亂在無涯無際的網路世界,使得找尋相關文獻成為一件耗時費力的事。若有一套能夠將網路上互相關聯的文獻組織起來的系統,就能輕而易舉地查詢到相關參考文獻,這是使用者的一大福音。
本文主要探討文獻Header和reference的內容,因為這兩個部分能給我們大量文獻的基本資訊,如標題、作者、出版商與出版日期等等,這些資訊非常適合用來整理文獻,它們提供我們能以各種不同的維度方向去觀察並做分類分群與搜尋。
要整理文獻,首要的工作就是要整理出文獻的Metadata。本研究的工作就是要將非結構化的文獻資料整理成具結構化的資料並賦予其意義。工作內容共分成三階段:第一階段先分析文字的特徵,並依據特徵對文字做分群。第二階段將分群好的文字以Machine Learning的演算法將其適當的分段並給予合乎其意義的Metadata。最後再將這些有意義的結構化資料存入資料庫,以方便將來再使用。
zh_TW
dc.description.abstractAlong with the network vigorous development, the electronic literature rapidly disseminates. It is very convenient to issue and obtain extremely. Such phenomenon makes the literature massively increase. The matter which literatures scattered in disorder in networks causes researchers consume time to search relevant articles. If we have a system which can organize relevant literature in networks, it is easy to query relevant references. It is a great good news to users.
This article probes into Header and Reference in literatures mainly, because these two parts can give us a large number of basic information about literature, like title, author, publisher and publication date and so on. These information extremely suitably use for to reorganize the literature. They provide us to be able to observe, search and to make the classification by each kind of different dimension.
To organize literature, the primary work is to organize the Metadata of literature. This research work is to have the non- structured literature change into the structured data and entrusts with its meanings. The work is divided into three stages:Analyse the feature of the token and make a cluster according to features at the first stage. At the second stage, clustering token will be segmented suitably with algorithms of Machine Learning and extract Metadata from segmented tokens. Finally, we will store these meaningful structured data into database in order to facilitate them in the future.
en
dc.description.provenanceMade available in DSpace on 2021-06-13T17:28:25Z (GMT). No. of bitstreams: 1
ntu-93-R91548051-1.pdf: 410848 bytes, checksum: ed4c0efed433d5ba958584169d08f622 (MD5)
Previous issue date: 2004
en
dc.description.tableofcontents圖 次 2
表 次 3
第一章 序論 4
1.1 背景說明 4
1.2 研究動機與目的 4
1.3 論文架構 5
第二章 文獻回顧 6
2.1 相關系統介紹 6
2.1.1 CiteSeer[1,2] 6
2.1.2 Cora[3] 7
2.1.3 MARS[4] 8
2.2 Metadata Extraction的困難 9
2.3 核心技術探討 10
2.3.1 Rule-Based Method 10
2.3.2 Machine Learning Method 11
2.3.3 Word Clustering + Machine Learning Method 12
第三章 系統架構與方法 13
3.1 System architecture 13
3.2 Hidden Markov Model 15
3.3 Viterbi Decoding 17
3.4 Internal structure state 19
3.5 Feature與word clustering 20
3.5.1 Reference 20
3.5.2 Header 27
3.6 Training Phase 30
3.7 Laplace Smoothing 31
3.8 Modified Viterbi Decoding 33
3.9 DataBase 34
第四章 實驗數據與討論 36
4.1 實驗資料 36
4.2 實驗結果 36
第五章 結論與未來 40
參考文獻 42
dc.language.isozh-TW
dc.subjecthidden markov modelzh_TW
dc.subjectmetadatazh_TW
dc.subjectword clusteringzh_TW
dc.title從文獻中擷取Metadatazh_TW
dc.titleExtract Metadata From Literatureen
dc.typeThesis
dc.date.schoolyear93-1
dc.description.degree碩士
dc.contributor.oralexamcommittee蔣以仁,陳中明
dc.subject.keywordmetadata,word clustering,hidden markov model,zh_TW
dc.relation.page44
dc.rights.note有償授權
dc.date.accepted2004-10-14
dc.contributor.author-college工學院zh_TW
dc.contributor.author-dept醫學工程學研究所zh_TW
顯示於系所單位:醫學工程學研究所

文件中的檔案:
檔案 大小格式 
ntu-93-1.pdf
  未授權公開取用
401.22 kBAdobe PDF
顯示文件簡單紀錄


系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved