Please use this identifier to cite or link to this item:
http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/38112
Full metadata record
???org.dspace.app.webui.jsptag.ItemTag.dcfield??? | Value | Language |
---|---|---|
dc.contributor.advisor | 曹承礎(Seng-Cho Chou) | |
dc.contributor.author | Shu-Fu Wu | en |
dc.contributor.author | 吳書福 | zh_TW |
dc.date.accessioned | 2021-06-13T16:26:33Z | - |
dc.date.available | 2006-07-20 | |
dc.date.copyright | 2005-07-20 | |
dc.date.issued | 2005 | |
dc.date.submitted | 2005-07-15 | |
dc.identifier.citation | 【1】 ACM, ACM Computing Classification System, http://www.acm.org/class/. 1998
【2】 Alice Laplante, “Sharing the wisdom” Computerworld, Vol. 31, Iss. 22, Jun 2, 1997, pp. 73-74. 【3】 Blair, D., Information retrieval and the philosophy of language. The Computer Journal, 35(3), 1992 【4】 Boser, B.E., Guyon, I. M., and Vapnik, V.N. Atraining algorithm for optimal margin classifiers. In Haussler, D., editor, Proceedings of the 5th Annual ACM Workshop on Computational Learning Theory, pages 144-152 【5】 Ching-Huei Tsou, .NET Implementation of Support Vector Machine, http://blogs.mit.edu/tsou/posts/1255.aspx, 2004 【6】 Cortes, C. and Vapnik, V.N. Support –vector networks. Machine Learning Journal, 20:273-297, 1995 【7】 Croft, W. and Lewis, D., Term clustering of syntactic phrases. In International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 385-404, 1990 【8】 Dan Sullivan, Document Warehousing and Text Mining: Techniques for Improving Business Operations, Marketing, and Sales. March 2001, ISBN 047-1399590 【9】 Eierman, Michael A., Niederman, Fred, and Adams, Carl, ”DSS theory: A model of constructs and relationships,” Decision Support Systems, Vol. 14, 1995, pp. 1-26. 【10】 Fagan, J., Experiments in Automatic Phrase Indexing for Document Retrieval: A Comparison of Syntactic and non-Syntactic Methods. PhD thesis, Department of Computer Science, Cornell University. 【11】 Hannabuss, Stuart, “Knowledge Management,” Library Management, Vol. 8, Iss. 5, 1987, pp. 1-50. 【12】 H. Han, C. L. Giles, E. Manavoglu, H. Zha, Z. Zhang, and E. A. Fox. Automatic document metadata extraction using support vector machines. In Conference on Digital Libraries, pages 37--48. IEEE Computer Society, 2003. 【13】 Jennifer Widom. Research Problem in Data Warehouses. Proc. of 4th Int’l Conference on Information and Knowledge Management (CIKM), November 1995. 【14】 Jiawei Han, Micheline Kamber, Data Mining Concepts and Techniques, Morgan Kaufmann Publishers, San Francisco, US, 2001 【15】 Kristie Seymore, Information Extraction, Labeled Data Set http://www-2.cs.cmu.edu/~kseymore/ie.html, 2000 【16】 Laberis, Bill, “One Big Pile of Knowledge,” Computerworld, Vol. 32, Iss. 5, Feb 2, 1998, pp. 97. 【17】 Lewis, D., An evaluation of phrasal and clustered representations on a text categorization task. In International ACM SIGIR Conference on Research and Development in Information Retrieval, 1992. 【18】 Lewis, D., Representation and Learning in Information Retrieval. PhD thesis, Department of Computer and Information Science, University of Massachusetts, 1999 【19】 Lewis, D., An evaluation of phrasal and clustered representations on a text categorization task. In International ACM SIGIR Conference on Research and Development in Information Retrieval, 1999 【20】 Maglitta Joseph, “Know-How, Inc.,” Computerworld, Jan. 1996, pp.74-76. 【21】 Matteo Golfarelli, Dario Maio and Stefano Rizzi, The Dimensional Fact Model: A Conceptual Model for Data Warehousing, 1998. 【22】 Morik, K., Brockhausen, P., and Joachims, T. Combining statistical learning with a knowkedge-based approach – A case study in intensive care monitoring. In Proc. 16th Int’l Conf. on Machine Learning (ICML-99), Bled, Slowenien. 【23】 Neumann, G.. and Schmeier, S., Combining shallow text processing and machine learning in real world applications. In Joachims, T., McCallum, A., Sahami, M., and Ungar, L., editors, IJCAI99 Workshop on Machine Learning for Information Filtering, 1999 【24】 Salton, G., editor, The SMART Retrieval System: Experiments in Automatic Document Processing. Prentice-Hall, Englewood Cliffs, NJ., 1971 【25】 Salton, G. and Buckley, C., Term weighting approaches in automatic text retrieval. Information Processing and Management, 24(5):513-523., 1988 【26】 Salton, G., Developments in automatic text retrieval. Science, 253:974-979, 1991 【27】 Simon, H. A., The New Science of Management Decision, Chapter 2, Prentice-Hall, Englewood Cliffs, NJ, 1977, pp. 39-81. 【28】 Thorsten Joachims. Learning to Classify Text Using Support Vector Machines. Kluwer Academic Publishers, 2002 【29】 V. Vapnik. Staticstical Learning Theory. Wiley, New York, NY, 1998 【30】 V. Vapnik. The Nature of Statistical Learning Theory. Springer, New York, 1995 【31】 Watson , Sharon, “Getting to 'aha!,” , Computerworld ,Vol. 32, Iss, 4, Jan. 26, 1998, pp. S1-S2. 【32】 W. H. Inmon. Building the Data Warehouse. New York: John Wiley & Sons, 1996 【33】 Yang, Y. and Chute, C., Words or concepts: the features of indexing units and their optimal use in information retrieval. In Annual Symposium on Computer Applications in Medical Care (SCAMC), pages 685-689, 1993 【34】 Yang, Y. and Pedersen, J. O., A comparative study on feature selection in text categorization. In Fisher, D. H., editor, Proceedings of ICML-97, 14th International Conference on Machine Learning, pages 412-420, Nashville, US. Morgan Kaufmann Publishers, San Francisco, US. 1997 | |
dc.identifier.uri | http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/38112 | - |
dc.description.abstract | 資訊科技的發展,讓我們處於資訊過載的環境,傳統的關鍵字搜尋已經無法滿足我們的需求,我們開始尋找可以從多維度做查詢的工具。資料倉儲的系統提供儲存、分析數字的能力,卻無法處理文件類型的檔案,因此本研究探討建立全新的文件倉儲系統,希望能解決上述的兩個問題。
在本研究中,我們介紹自動化擷取元資料的方法以及如何建立完整的文件倉儲系統。我們預先定義十五種元資料為十五種類別,使用支撐向量法的分類演算法,根據訓練支撐向量法所得的分類規則,從新進的文件裡找出每一個句子所屬的類別,再將文件轉換為標記的XML格式。接著,我們應用多維度的星狀架構建立文件倉儲系統,以元資料輔助文件載入文件倉儲的流程。最後搭配線上分析處理與自行撰寫的程式,提供分析多維度文件倉儲所需的工具。 實驗的結果,證明利用支撐向量法的演算法,可以得到高度準確性的分類規則。應用這一些分類規則,可以協助我們分析文件的內容,完整找出文件裡的元資料。同時,我們建立的雛形系統,展示文件倉儲系統的運作流程與組成元素,提供系統建構的基礎與參考。後端的線上分析處理與多維度的查詢工具,讓我們可以從多個角度,尋找文件、分析文件,挖掘隱藏在文件裡的資訊。 | zh_TW |
dc.description.abstract | The development and growth of information technologies have caused a situation called “information overloading”. Therefore, we begin to look for new tools which allow us to create a query in multidimensional perspectives rather then to use traditional keyword-based search engines. Data warehouse systems provide the capabilities of storing and analyzing numerical data but lack the ability to deal with document collections. In order to solve these problems above, we are going to build a whole new system.
In this paper, we describe automatic metadata extraction algorithm and build up a document warehouse system. We define 15 kinds of metadata as 15 classes. Using support vector machine, we create 15 classifies to extract metadata from a new document. Sentences in the document with corresponding metadata were saved in xml format. Next, we use star schema to build a multidimensional document warehouse system. Metadata is used to support the process of loading documents into document warehouse. We also provide client side tools such as OLAP, cube browser, MDX query interface. Our Experiments show that support vector machine can achieve high classification performance. We can extract most metadata from a document by SVM classifier. The prototype system built in this paper also shows the fundamental components and processes in a document warehouse system. The OLAP tools and multidimensional query tools provide methods of search and analyze document from multi-points of view of user perspectives | en |
dc.description.provenance | Made available in DSpace on 2021-06-13T16:26:33Z (GMT). No. of bitstreams: 1 ntu-94-R92725009-1.pdf: 954312 bytes, checksum: 3c4223771fb732bd71536dfd7f43a1e1 (MD5) Previous issue date: 2005 | en |
dc.description.tableofcontents | 致謝詞 一
論文摘要 二 第一章 緒論 1 第一節 研究動機與背景 1 第二節 研究目的 3 第三節 論文架構 6 第二章 文獻探討 7 第一節 文件分類(TEXT CLASSIFICATION)技術探討 7 2-1-1 什麼是文件分類 7 2-1-2 類別標籤設定法 8 2-1-3 如何表示文件 9 2-1-4 特徵篩選(Feature Selection) 13 2-1-5 如何給定特徵權重(Term Weighting) 14 第二節 支撐向量法(SUPPORT VECTOR MACHINE)技術探討 16 2-2-1 支撐向量法與結構化風險最小化 16 2-2-2 線性嚴格邊際的支撐向量法(Linear Hard-Margin SVMs) 17 2-2-3 寬鬆邊際的支撐向量法(Soft-Margin SVMs) 19 2-2-4 非線性的支撐向量法(Non-Linear SVMs) 20 2-2-5 非對稱性的失誤分類成本(Asymmetric Misclassification Cost) 22 第三節 資料倉儲(DATA WAREHOUSE)技術探討 22 2-3-1 什麼是資料倉儲 22 2-3-2 建構資料倉儲 24 2-3-3 概念階層(Concept Hierarchy) 28 2-3-4 資料方塊(Data Cube) 30 2-3-5 線上分析處理工具(OLAP) 31 第三章 系統模型建構 33 第一節 名詞定義 33 3-1-1 元資料(Metadata) 33 3-1-2 文件倉儲系統(Document Warehouse System) 34 3-1-3 基礎方塊與虛擬方塊(Base Cube vs. Virtual Cube) 35 第二節 系統假設 36 3-2-1 採用純文字(Plain Text)文件格式 36 3-2-2 使用SVM的文件分類演算法 37 3-2-3 資訊維度(Information Dimension) 37 第三節 系統架構 38 3-3-1 系統概觀 38 3-3-2 系統架構與組成元件說明 41 3-3-3 系統運作流程 44 第四章 雛形系統架構與系統實作 48 第一節 情境描述 48 第二節 開發工具 50 第三節 雛型系統架構 50 第四節 論文處理與元資料擷取 51 第五節 文件格式與文件倉儲設計 57 4-5-1 文件格式 58 4-5-2 文件倉儲的設計 59 4-5-3 概念階層的設計 64 4-5-4 論文載入的方式 66 第六節 論文的分析與呈現 67 4-6-1 建立維度與資料方塊 67 4-6-2 使用者端的操作介面 70 第七節 分析與討論 75 第五章 結論與未來展望 76 第一節 結論 76 第二節 未來展望 77 參考文獻 79 | |
dc.language.iso | zh-TW | |
dc.title | 應用文件分類技術於多維度文件倉儲系統 | zh_TW |
dc.title | Applying Text Classification Techniques in Multidimensional Document Warehouse System | en |
dc.type | Thesis | |
dc.date.schoolyear | 93-2 | |
dc.description.degree | 碩士 | |
dc.contributor.oralexamcommittee | 李瑞庭(Anthony J. T. Lee),陳炳宇(Robin Bing-Yu Chen) | |
dc.subject.keyword | 支撐向量法,元資料擷取,文件倉儲,線上分析處理, | zh_TW |
dc.subject.keyword | Support Vector Machine,Metadata Extraction,Document Warehouse System,Online Analytical Processing, | en |
dc.relation.page | 81 | |
dc.rights.note | 有償授權 | |
dc.date.accepted | 2005-07-15 | |
dc.contributor.author-college | 管理學院 | zh_TW |
dc.contributor.author-dept | 資訊管理學研究所 | zh_TW |
Appears in Collections: | 資訊管理學系 |
Files in This Item:
File | Size | Format | |
---|---|---|---|
ntu-94-1.pdf Restricted Access | 931.95 kB | Adobe PDF |
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.