Free-DOM：萃取鬆散文件中的重要資訊並結構化之方法

Wen-Ting Wang; 王文廷

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/32877

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	項潔(Jieh Hsiang)
dc.contributor.author	Wen-Ting Wang	en
dc.contributor.author	王文廷	zh_TW
dc.date.accessioned	2021-06-13T04:17:54Z	-
dc.date.available	2007-07-28
dc.date.copyright	2006-07-28
dc.date.issued	2006
dc.date.submitted	2006-07-24
dc.identifier.citation	[1] M. K. Bergman. “The Deep Web: Surfacing Hidden Value”, The Journal of Electronic Publishing, 7(1), 2001. available at: http://www.press.umich.edu/jep/07-01/bergman.html (accessed 13 April 2006). [2] J. Friedl. “Mastering Regular Expressions”, O'Reilly. ISBN 0-596-00289-0., 1997. [3] K. Larson, M. Czerwinski. “Web Page Design: Implications of Memory, Structure and Scent for Information Retrieval”, Proceedings of CHI'98 Human Factors in Computing Systems, pp25-32, 1998. [4] A. Sahuguet, F. Azavant. “WysiWyg Web Wrapper Factory (W4F)”, Proceedings of WWW Conference, 1999. available at: http://db.cis.upenn.edu/W4F/ (accessed 8 April 2006). [5] A. Sahuguet, F. Azavant. “Building Intelligent Web Applications Using Lightweight Wrappers”, Data Knowl. Eng., 36(3), pp283-316, 2001. [6] K. Thompson. “Regular expression search algorithm”, CACM, 11(6), pp419–422, 1968. [7] R. Wilensky, Y. Arena. “PHRAN:A Knowledge-Based Nature Language Understender”, Proceedings of the 18th Annual Meeting of the Association for Computational Linguistics, pp117-121, 1980. [8] 謝育平, 王文廷, 項潔. “Free-DOM:在鬆散文件中將重要資訊結構化的機制”, 銘傳大學2006國際學術研討會, pp33-34, 2006. [9] Document Object Model (DOM): http://www.w3.org/DOM/ [10] Dynamic HTML (DHTML): http://www.w3schools.com/dhtml/default.asp [11] Extensible Markup Language (XML): http://www.w3.org/XML/ [12] Google: http://www.google.com [13] HTML-DOM: http://www.w3schools.com/htmldom/default.asp [14] Hyper Text Markup Language(HTML): http://www.w3.org/MarkUp/ [15] Object oriented: http://java.sun.com/docs/books/tutorial/java/concepts/object.html [16] Perl: http://www.perl.org/ [17] Really Simple Syndication (RSS): http://www.xml.com/pub/a/2002/12/18/dive-into-xml.html [18] TIDY: http://tidy.sourceforge.net/ [19] XML-DOM: http://www.w3schools.com/dom/default.asp [20] XSL Transformations (XSLT): http://www.w3.org/TR/xslt. [21] 天氣資訊(Yahoo): http://tw.weather.yahoo.com/ [22] 台灣大學圖書館: http://www.ntu.edu.tw [23] 股市(Yahoo): http://tw.stock.yahoo.com [24] 股市(蕃薯藤): http://stock.yam.com [25] 英英字典(Princeton): http://wordnet.princeton.edu/perl/webwn [26] 英漢字典(Yahoo): http://tw.dictionary.yahoo.com/ [27] 國家文化資料庫: http://nrch.cca.gov.tw/ccahome/ [28] 新聞網站(中時電子報): http://www.chinatimes.com.tw [29] 網路書店(金石堂): http://www.kingstone.com.tw/ [30] 網路書店(博客來): http://www.books.com.tw/ [31] 網路書店(誠品): http://www.eslite.com.tw/ [32] 電子公路監理網: https://www.mvdis.gov.tw/wps/portal [33] 匯率資訊網頁(國泰世華): https://www.cathaybk.com.tw/cathaybk/personal_info07.asp [34] 系統網址: http://freedom.csie.org/ 或是 http://freedom.arping.idv.tw
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/32877	-
dc.description.abstract	全球資訊網(WWW)(World Wide Web)上的資料，絕大多數皆以HTML(HyperText Markup Language)文件呈現；而全球資訊網上資料的加值應用，則須以此廣大的文件庫為基礎。又因為HTML文件是一種內容與排版呈現描述交雜在一起的文件，並沒有語意結構的描述，所以重要資訊的線索並不存在標籤(TAG)之中，因此HTML文件不論在語意上或者在結構上皆為鬆散的文件。所以在鬆散文件中的資料萃取及資料操控問題尤為重要。觀察深層網頁，可以假設同一個網站中的文章排版風格相近，同文章中的重要資訊也有相同的排版風格，Free-DOM主要應用在此類的文章之上。對鬆散文件的資料萃取而言，正規表達式提供一個豐富且精準的萃取機制。對資料操控來說，文章物件模型(Document Object Model)(DOM)提供了一個重要的機制來處理結構化的文章。Free-DOM係指使用正規表達式萃取鬆散文件(Free-Text)中的重要資料，然後使用文章物件模型的概念來結構化萃取後的資料。為了要做全球資訊網路資料的加值應用，本文設計Free-DOM來萃取結構化鬆散文件中的重要資訊以提供程式語言操控或是直接以XML(Extensible Markup Language)格式輸出結構化文件之後讓DOM操控以利於做全球資訊網路資料的加值應用。	zh_TW
dc.description.abstract	Most documents available over the World Wide Web are written in or transformed into HTML. However, HTML is a loosely structured language that mixes presentational style with content. It is therefore important to design ways that can extract data from HTML documents. In this thesis we propose a method, Free-DOM (a Free-text Documents Object Model), for this purpose. Free-DOM is aimed at extracting data from HTML documents with a similar presentational format. It uses the regular expression to capture the structure of the format that it wants to extract, and the concept of DOM (Document Object Model) to manipulate the extracted data. Thus Free-DOM provides an extraction-and-manipulation language for free-text documents. Free-DOM supports programming languages (such as C++) as a library to pre-process and manipulate documents. It also works as a server-side script language to do value-added applications over the World Wide Web. We show the effectiveness of our method by several examples.	en
dc.description.provenance	Made available in DSpace on 2021-06-13T04:17:54Z (GMT). No. of bitstreams: 1 ntu-95-R93922073-1.pdf: 3534941 bytes, checksum: 8cbdf674302c9715390f1845154c9d31 (MD5) Previous issue date: 2006	en
dc.description.tableofcontents	第一章簡介 7 1.1 動機 7 1.2 背景 9 1.3 目的 9 1.4 論文架構 10 第二章相關技術 11 2.1 文章物件模型 11 2.2 標籤路徑定位 12 2.3 物件查詢語言 12 2.4 正規表達式簡介 13 2.5 相關技術之彙整比較 14 第三章 Free-DOM設計 17 3.1 Free-DOM運作方式 17 3.2 正規表達式由來以及語法簡介 19 3.3 Free-DOM設定檔文件撰寫介紹 27 第四章實作應用 41 4.1 Meta-Search 41 4.2 個人資訊收集 45 4.3 批次取得大量資料 46 第五章結論與未來發展方向 47 5.1 討論與結論 47 5.2 未來發展方向 49 參考文獻 50 附錄使用者手冊 51 A. Free-DOM網頁服務器安裝及使用說明 51 B. Free-DOM程式庫使用說明 55 C. Free-DOM plug-in安裝及使用說明 56 D. Free-DOM設定檔文件指令使用說明 57
dc.language.iso	zh-TW
dc.subject	XML	zh_TW
dc.subject	正規表達式	zh_TW
dc.subject	DOM	zh_TW
dc.subject	資料萃取	zh_TW
dc.subject	Document object model	en
dc.subject	XML	en
dc.subject	Regular expression	en
dc.subject	Data extraction	en
dc.subject	DOM	en
dc.title	Free-DOM：萃取鬆散文件中的重要資訊並結構化之方法	zh_TW
dc.title	Free-DOM：A Free-text Document Object Model	en
dc.type	Thesis
dc.date.schoolyear	94-2
dc.description.degree	碩士
dc.contributor.oralexamcommittee	陳光華(Kuang-Hua Chen),謝育平(Yuh-Pyng Shieh)
dc.subject.keyword	DOM,XML,資料萃取,正規表達式,	zh_TW
dc.subject.keyword	DOM,Document object model,Data extraction,Regular expression,XML,	en
dc.relation.page	63
dc.rights.note	有償授權
dc.date.accepted	2006-07-25
dc.contributor.author-college	電機資訊學院	zh_TW
dc.contributor.author-dept	資訊工程學研究所	zh_TW
顯示於系所單位：	資訊工程學系

文件中的檔案：

檔案	大小	格式
ntu-95-1.pdf 未授權公開取用	3.45 MB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。