應用文件分類技術於多維度文件倉儲系統

Shu-Fu Wu; 吳書福

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/38112

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	曹承礎(Seng-Cho Chou)
dc.contributor.author	Shu-Fu Wu	en
dc.contributor.author	吳書福	zh_TW
dc.date.accessioned	2021-06-13T16:26:33Z	-
dc.date.available	2006-07-20
dc.date.copyright	2005-07-20
dc.date.issued	2005
dc.date.submitted	2005-07-15
dc.identifier.citation	【1】 ACM, ACM Computing Classification System, http://www.acm.org/class/. 1998 【2】 Alice Laplante, “Sharing the wisdom” Computerworld, Vol. 31, Iss. 22, Jun 2, 1997, pp. 73-74. 【3】 Blair, D., Information retrieval and the philosophy of language. The Computer Journal, 35(3), 1992 【4】 Boser, B.E., Guyon, I. M., and Vapnik, V.N. Atraining algorithm for optimal margin classifiers. In Haussler, D., editor, Proceedings of the 5th Annual ACM Workshop on Computational Learning Theory, pages 144-152 【5】 Ching-Huei Tsou, .NET Implementation of Support Vector Machine, http://blogs.mit.edu/tsou/posts/1255.aspx, 2004 【6】 Cortes, C. and Vapnik, V.N. Support –vector networks. Machine Learning Journal, 20:273-297, 1995 【7】 Croft, W. and Lewis, D., Term clustering of syntactic phrases. In International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 385-404, 1990 【8】 Dan Sullivan, Document Warehousing and Text Mining: Techniques for Improving Business Operations, Marketing, and Sales. March 2001, ISBN 047-1399590 【9】 Eierman, Michael A., Niederman, Fred, and Adams, Carl, ”DSS theory: A model of constructs and relationships,” Decision Support Systems, Vol. 14, 1995, pp. 1-26. 【10】 Fagan, J., Experiments in Automatic Phrase Indexing for Document Retrieval: A Comparison of Syntactic and non-Syntactic Methods. PhD thesis, Department of Computer Science, Cornell University. 【11】 Hannabuss, Stuart, “Knowledge Management,” Library Management, Vol. 8, Iss. 5, 1987, pp. 1-50. 【12】 H. Han, C. L. Giles, E. Manavoglu, H. Zha, Z. Zhang, and E. A. Fox. Automatic document metadata extraction using support vector machines. In Conference on Digital Libraries, pages 37--48. IEEE Computer Society, 2003. 【13】 Jennifer Widom. Research Problem in Data Warehouses. Proc. of 4th Int’l Conference on Information and Knowledge Management (CIKM), November 1995. 【14】 Jiawei Han, Micheline Kamber, Data Mining Concepts and Techniques, Morgan Kaufmann Publishers, San Francisco, US, 2001 【15】 Kristie Seymore, Information Extraction, Labeled Data Set http://www-2.cs.cmu.edu/~kseymore/ie.html, 2000 【16】 Laberis, Bill, “One Big Pile of Knowledge,” Computerworld, Vol. 32, Iss. 5, Feb 2, 1998, pp. 97. 【17】 Lewis, D., An evaluation of phrasal and clustered representations on a text categorization task. In International ACM SIGIR Conference on Research and Development in Information Retrieval, 1992. 【18】 Lewis, D., Representation and Learning in Information Retrieval. PhD thesis, Department of Computer and Information Science, University of Massachusetts, 1999 【19】 Lewis, D., An evaluation of phrasal and clustered representations on a text categorization task. In International ACM SIGIR Conference on Research and Development in Information Retrieval, 1999 【20】 Maglitta Joseph, “Know-How, Inc.,” Computerworld, Jan. 1996, pp.74-76. 【21】 Matteo Golfarelli, Dario Maio and Stefano Rizzi, The Dimensional Fact Model: A Conceptual Model for Data Warehousing, 1998. 【22】 Morik, K., Brockhausen, P., and Joachims, T. Combining statistical learning with a knowkedge-based approach – A case study in intensive care monitoring. In Proc. 16th Int’l Conf. on Machine Learning (ICML-99), Bled, Slowenien. 【23】 Neumann, G.. and Schmeier, S., Combining shallow text processing and machine learning in real world applications. In Joachims, T., McCallum, A., Sahami, M., and Ungar, L., editors, IJCAI99 Workshop on Machine Learning for Information Filtering, 1999 【24】 Salton, G., editor, The SMART Retrieval System: Experiments in Automatic Document Processing. Prentice-Hall, Englewood Cliffs, NJ., 1971 【25】 Salton, G. and Buckley, C., Term weighting approaches in automatic text retrieval. Information Processing and Management, 24(5):513-523., 1988 【26】 Salton, G., Developments in automatic text retrieval. Science, 253:974-979, 1991 【27】 Simon, H. A., The New Science of Management Decision, Chapter 2, Prentice-Hall, Englewood Cliffs, NJ, 1977, pp. 39-81. 【28】 Thorsten Joachims. Learning to Classify Text Using Support Vector Machines. Kluwer Academic Publishers, 2002 【29】 V. Vapnik. Staticstical Learning Theory. Wiley, New York, NY, 1998 【30】 V. Vapnik. The Nature of Statistical Learning Theory. Springer, New York, 1995 【31】 Watson , Sharon, “Getting to 'aha!,” , Computerworld ,Vol. 32, Iss, 4, Jan. 26, 1998, pp. S1-S2. 【32】 W. H. Inmon. Building the Data Warehouse. New York: John Wiley & Sons, 1996 【33】 Yang, Y. and Chute, C., Words or concepts: the features of indexing units and their optimal use in information retrieval. In Annual Symposium on Computer Applications in Medical Care (SCAMC), pages 685-689, 1993 【34】 Yang, Y. and Pedersen, J. O., A comparative study on feature selection in text categorization. In Fisher, D. H., editor, Proceedings of ICML-97, 14th International Conference on Machine Learning, pages 412-420, Nashville, US. Morgan Kaufmann Publishers, San Francisco, US. 1997
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/38112	-
dc.description.abstract	資訊科技的發展，讓我們處於資訊過載的環境，傳統的關鍵字搜尋已經無法滿足我們的需求，我們開始尋找可以從多維度做查詢的工具。資料倉儲的系統提供儲存、分析數字的能力，卻無法處理文件類型的檔案，因此本研究探討建立全新的文件倉儲系統，希望能解決上述的兩個問題。在本研究中，我們介紹自動化擷取元資料的方法以及如何建立完整的文件倉儲系統。我們預先定義十五種元資料為十五種類別，使用支撐向量法的分類演算法，根據訓練支撐向量法所得的分類規則，從新進的文件裡找出每一個句子所屬的類別，再將文件轉換為標記的XML格式。接著，我們應用多維度的星狀架構建立文件倉儲系統，以元資料輔助文件載入文件倉儲的流程。最後搭配線上分析處理與自行撰寫的程式，提供分析多維度文件倉儲所需的工具。實驗的結果，證明利用支撐向量法的演算法，可以得到高度準確性的分類規則。應用這一些分類規則，可以協助我們分析文件的內容，完整找出文件裡的元資料。同時，我們建立的雛形系統，展示文件倉儲系統的運作流程與組成元素，提供系統建構的基礎與參考。後端的線上分析處理與多維度的查詢工具，讓我們可以從多個角度，尋找文件、分析文件，挖掘隱藏在文件裡的資訊。	zh_TW
dc.description.abstract	The development and growth of information technologies have caused a situation called “information overloading”. Therefore, we begin to look for new tools which allow us to create a query in multidimensional perspectives rather then to use traditional keyword-based search engines. Data warehouse systems provide the capabilities of storing and analyzing numerical data but lack the ability to deal with document collections. In order to solve these problems above, we are going to build a whole new system. In this paper, we describe automatic metadata extraction algorithm and build up a document warehouse system. We define 15 kinds of metadata as 15 classes. Using support vector machine, we create 15 classifies to extract metadata from a new document. Sentences in the document with corresponding metadata were saved in xml format. Next, we use star schema to build a multidimensional document warehouse system. Metadata is used to support the process of loading documents into document warehouse. We also provide client side tools such as OLAP, cube browser, MDX query interface. Our Experiments show that support vector machine can achieve high classification performance. We can extract most metadata from a document by SVM classifier. The prototype system built in this paper also shows the fundamental components and processes in a document warehouse system. The OLAP tools and multidimensional query tools provide methods of search and analyze document from multi-points of view of user perspectives	en
dc.description.provenance	Made available in DSpace on 2021-06-13T16:26:33Z (GMT). No. of bitstreams: 1 ntu-94-R92725009-1.pdf: 954312 bytes, checksum: 3c4223771fb732bd71536dfd7f43a1e1 (MD5) Previous issue date: 2005	en
dc.description.tableofcontents	致謝詞一論文摘要二第一章緒論 1 第一節研究動機與背景 1 第二節研究目的 3 第三節論文架構 6 第二章文獻探討 7 第一節文件分類（TEXT CLASSIFICATION）技術探討 7 2-1-1 什麼是文件分類 7 2-1-2 類別標籤設定法 8 2-1-3 如何表示文件 9 2-1-4 特徵篩選（Feature Selection） 13 2-1-5 如何給定特徵權重（Term Weighting） 14 第二節支撐向量法（SUPPORT VECTOR MACHINE）技術探討 16 2-2-1 支撐向量法與結構化風險最小化 16 2-2-2 線性嚴格邊際的支撐向量法（Linear Hard-Margin SVMs） 17 2-2-3 寬鬆邊際的支撐向量法（Soft-Margin SVMs） 19 2-2-4 非線性的支撐向量法（Non-Linear SVMs） 20 2-2-5 非對稱性的失誤分類成本（Asymmetric Misclassification Cost） 22 第三節資料倉儲（DATA WAREHOUSE）技術探討 22 2-3-1 什麼是資料倉儲 22 2-3-2 建構資料倉儲 24 2-3-3 概念階層（Concept Hierarchy） 28 2-3-4 資料方塊（Data Cube） 30 2-3-5 線上分析處理工具（OLAP） 31 第三章系統模型建構 33 第一節名詞定義 33 3-1-1 元資料（Metadata） 33 3-1-2 文件倉儲系統（Document Warehouse System） 34 3-1-3 基礎方塊與虛擬方塊（Base Cube vs. Virtual Cube） 35 第二節系統假設 36 3-2-1 採用純文字（Plain Text）文件格式 36 3-2-2 使用SVM的文件分類演算法 37 3-2-3 資訊維度（Information Dimension） 37 第三節系統架構 38 3-3-1 系統概觀 38 3-3-2 系統架構與組成元件說明 41 3-3-3 系統運作流程 44 第四章雛形系統架構與系統實作 48 第一節情境描述 48 第二節開發工具 50 第三節雛型系統架構 50 第四節論文處理與元資料擷取 51 第五節文件格式與文件倉儲設計 57 4-5-1 文件格式 58 4-5-2 文件倉儲的設計 59 4-5-3 概念階層的設計 64 4-5-4 論文載入的方式 66 第六節論文的分析與呈現 67 4-6-1 建立維度與資料方塊 67 4-6-2 使用者端的操作介面 70 第七節分析與討論 75 第五章結論與未來展望 76 第一節結論 76 第二節未來展望 77 參考文獻 79
dc.language.iso	zh-TW
dc.subject	元資料擷取	zh_TW
dc.subject	文件倉儲	zh_TW
dc.subject	線上分析處理	zh_TW
dc.subject	支撐向量法	zh_TW
dc.subject	Document Warehouse System	en
dc.subject	Online Analytical Processing	en
dc.subject	Metadata Extraction	en
dc.subject	Support Vector Machine	en
dc.title	應用文件分類技術於多維度文件倉儲系統	zh_TW
dc.title	Applying Text Classification Techniques in Multidimensional Document Warehouse System	en
dc.type	Thesis
dc.date.schoolyear	93-2
dc.description.degree	碩士
dc.contributor.oralexamcommittee	李瑞庭(Anthony J. T. Lee),陳炳宇(Robin Bing-Yu Chen)
dc.subject.keyword	支撐向量法,元資料擷取,文件倉儲,線上分析處理,	zh_TW
dc.subject.keyword	Support Vector Machine,Metadata Extraction,Document Warehouse System,Online Analytical Processing,	en
dc.relation.page	81
dc.rights.note	有償授權
dc.date.accepted	2005-07-15
dc.contributor.author-college	管理學院	zh_TW
dc.contributor.author-dept	資訊管理學研究所	zh_TW
顯示於系所單位：	資訊管理學系

文件中的檔案：

檔案	大小	格式
ntu-94-1.pdf 未授權公開取用	931.95 kB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。