請用此 Handle URI 來引用此文件:
http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/24635完整後設資料紀錄
| DC 欄位 | 值 | 語言 |
|---|---|---|
| dc.contributor.advisor | 吳玲玲(Ling-Ling Wu) | |
| dc.contributor.author | Feng-Yueh Lu | en |
| dc.contributor.author | 陸鳳玥 | zh_TW |
| dc.date.accessioned | 2021-06-08T05:34:19Z | - |
| dc.date.copyright | 2005-02-04 | |
| dc.date.issued | 2005 | |
| dc.date.submitted | 2005-01-31 | |
| dc.identifier.citation | 中文
【1】 卜小蝶,網路使用者檢索詞彙主題分類探析。知識經濟時代圖書資訊學之展望研討會,台北市,頁107-118,民90 【2】 朱明,數據挖掘,中國科學技術大學出版社,合肥,民91 【3】 陳光華和伍健廷,控制詞彙之自動索引,中國圖書館學會會報,第61期,頁81-102,民87 【4】 張雲濤和龔玲,數據挖掘原理與技術,電子工業出版社,北京,民93 【5】 曾元顯,文件主題自動分類成效因素探討,中國圖書館學會會報,第 68 期,頁 62-83,民91 【6】 黃純敏,多語文(中英文)超文件自動摘要與評估,雲林科技大學,民89 【7】 楊志良和賴憲堂,全民健康保險下疾病分類編碼一致性調查研究,中央健保局,民85 【8】 中研院詞庫小組,中央研究院漢語語料庫的內容與說明,技術報告 95-02/98-04, 中央研究院資訊科學研究所,台北,民87 英文 【1】 Chai, K. M., Ng, H. T., and Chieu, H. L. (2002), Bayesian Online Classifiers for Text Classification and Filtering, Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2002), Tampere, Finland, p97-104 【2】 Chakrabarti, S. (2002), Mining the Web: Discovering Knowledge from Hypertext Data, Morgan Kaufmann Publishers, San Francisco, CA 【3】 Chen, A., He, J. and Xu, L. (1997), Chinese Text Retrieval without Using a Dictionary, Proceedings of the ACM SIGIR 97, p42-49 【4】 Chen, K. J., and Liu, S. H. (1992), Word Identification for Mandarin Chinese Sentences, Proceedings 5th International Conference on Computational Linguistics, p101-107 【5】 Chien, L. F., Huang, T. I., and Chen, M. C., (1997), PAT-Tree-Based Keyword Extraction for Chinese Information Retrieval, Proceedings of 1997 ACM SIGIR Conference, Philadelphia, USA, p50-58 【6】 Church, K., and Hanks, P. (1989), Word Association Norms, Mutual Information and Lexicography, Association for Computational Linguistics, Vancouver, Canada, p76-83 【7】 Cover, T. M. and Hart, P.E. (1967), Nearest Neighbor Pattern Classification, IEEE Transactions on Information Theory, 13, p21--27. 【8】 Duda R. and Hart P. (1973), Pattern Classification and Scene Analysis, John Wiley and Sons, New York 【9】 Dunning, T. (1993), Accurate Methods for the Statistics of Surprise and Coincidence, Computational Linguistics, 19(1), p61-74 【10】 Fabrizio, S. (2002), Machine Learning in Automated Text Categorization, ACM Computing Surveys, 34(1), p1-47 【11】 Frakes, W. B. and Baeza-Yates, R. (1992), Information Retrieval: Data Structures and Algorithms, Prentice-Hall, Englewood Cliffs, New Jersey 【12】 Gonnet, G. (1983), Unstructured Data Bases or Very Efficient Text Searching, ACM Principles of Database Systems, Atlanta, Georgia, p117-124 【13】 Jackson, P. and Moulinier, I. (2002), Natural Language Processing for Online Applications: Text Retrieval, Extraction and Categorization, Natural Language Processing, Volume 5, John Benjamins Publishing Company, Amsterdam or Philadelphia 【14】 Jurafsky, D. and Martin, J. H. (2000), Speech and Language Processing: An Introduction to Natural Language Processing, Prentice Hall, New Jersey 【15】 Kilgarriff, A. (1997), Using Word Frequency Lists to Measure Corpus Homogeneity and Similarity between Corpora, Proceedings 5th ACL Workshop on Very Large Corpora, Beijing and Hong Kong, p231-245 【16】 Kilgarriff, A. and Rose, T. (1998), Measures for Corpus Similarity and Homogeneity, Proceedings 3rd Conference on Empirical Methods in Natural Language Processing, Granada, Spain, p46-52 【17】 Kilgarriff, A. (2001), Comparing Corpora, International Journal of Corpus Linguistics, 6(1), p97-133. 【18】 Luhn, H. P. (1957), A Statistical Approach to the Mechanized Encoding and Searching of Literary Information, IBM Journal of Research and Development, 2(2), p309-317 【19】 Manning, C. D. and Schtüze, H. (1999), Foundations of Statistical Natural Language Processing, The MIT Press, Cambridge, Massachusetts 【20】 Mitchell, T. (1997), Machine Learning, McGraw-Hill 【21】 Morrison, D. R. (1968), PATRICIA: Practical Algorithm to Retrieve Information Coded in Alphanumeric, Journal of the ACM, 15(4), p514-534 【22】 Quinlan, M. R. (1986), Induction of Decision Trees, Machine Learning, (1), p81-106 【23】 Roiger, R. J. and Geatz, M. W. (2003), Data Mining: A Tutorial-Based Primer, Addison Wesley, 【24】 Salton, G. and McGill, M. J. (1983), The SMART and SIRE Experimental Retrieval Systems, McGraw-Hill, New York 【25】 Salton, G. and Buckley, C. (1988), Term Weighting Approaches in Automatic Text Retrieval, Information Processing and Management, 24(5), p513-523. 【26】 Salton, G. (1989), Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer, Addison-Wesley, 【27】 Spärck Johns, K. (1972), A Statistical Interpretation of Term Specificity and its Application in Retrieval, Journal of Documentation, 28(1), p11-21 【28】 Sproat, R., and Shih, C. (1990), A Statistical Method for Finding Word Boundaries in Chinese Text, Computer Processing of Chinese & Oriental Languages, 3(4), p336-351. 【29】 Williams, K. (2003), A Framework for Text Categorization, University of Sydney, Australia | |
| dc.identifier.uri | http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/24635 | - |
| dc.description.abstract | 企業知識庫每天要處理數以萬計之文件資料,無論是外部競爭者網頁、產業分析、客戶需求;或是內部財務報表、技術文件、專利文件,這些皆為企業經營之必要資訊。這樣龐大的資料量無論是在收集、過濾、以及分門別類歸檔皆十分耗費時間與人力資源。企業由此產生對於文件自動分類之需求。如何利用自動化技術,快速有效協助人工分類,以應付大量待分類文件之需求,儼然已成為現今資訊服務與知識管理之重要課題。
企業知識庫之類別架構是否合乎企業需求,收錄之訓練文章是否具代表性,文件分類標準是否一致,這些都足以影響文件分類成果。此外,如何選擇關鍵詞彙使得文件分類工作處理更有效率,分類系統要對待分類之文件有什麼程度之瞭解,如何在速度與正確性之間取得平衡點,這些因素都需要納入文件自動分類系統建置考量。 本研究以漢語平衡語料庫為實驗對象,實作一個文件自動分類系統。並比較機器學習方法與非機器學習方法之分類成效。另外,評估分類系統對於待處理文件之認識程度不同,會對分類成效產生之影響。同時,並應用評估語料庫相似度之統計方法於文件類別上,作為預先定義類別或是收錄文章是否適當之初步評估。 | zh_TW |
| dc.description.abstract | Knowledge bases in a corporation have to process thousands of text-based information every day. Those include competitors’ information, industrial analysis reports, and customer requirements outside the corporation; financial statements, technique reports, and patterns inside the corporation, which are considered crucial for business operation. However, the processes of collecting, filtering, and filing are time and labor consuming tasks. Hence, automatic text classification is required to solve the problem. The issue about the employment of automatic techniques to improve manual classification performance and to meet the requirements of considerable quantities of classification tasks has been raised in the area of information services and knowledge management.
The appropriateness of hierarchy of the knowledge base in the company, the representiveness of texts in the classes, and the consistency of data collection will all affect the performance of text classification. In addition, the method of selecting key terms, the level of understanding of unknown texts, how to achieve the equilibrium between speed and accuracy should be taken into consideration during the construction of automatic text classification systems. In this research, an automatic text classification system is implemented, and the texts are gathered from the Sinica Corpus. Some machine learning methods and non-machine learning methods will be compared in the thesis. Besides, the effect of varying level of understanding about texts will also be measured. Furthermore, the method of measuring corpus similarity and homogeneity is applied to the classes, in order to measure the appropriateness of predefined classes or texts in those classes. | en |
| dc.description.provenance | Made available in DSpace on 2021-06-08T05:34:19Z (GMT). No. of bitstreams: 1 ntu-94-R91725013-1.pdf: 307837 bytes, checksum: 3e8c6aa4b0820be63aac4b47f4d1c482 (MD5) Previous issue date: 2005 | en |
| dc.description.tableofcontents | 第一章 緒論 1
第一節 研究背景與動機 1 第二節 研究目的 2 第三節 論文架構 3 第二章 文獻探討 4 第一節 文件自動分類定義 4 第二節 文件自動分類工作簡介 5 2-2-1 文件自動分類與文件自動叢集 5 2-2-2 文件分類應用 6 第三節 斷詞處理 7 2-3-1 規則式方法 8 2-3-2 統計式方法 9 第四節 空間向量模式 10 2-4-1 文件呈現方式 10 2-4-2 詞彙權重 11 2-4-3 文件相似度計算 12 第五節 維度刪減 13 2-5-1 改良詞彙顯著性計算公式 14 2-5-2 卡方檢定 15 2-5-3 可能比率測試 16 2-5-4 相互資訊 17 第六節 機器學習演算法 18 2-6-1 Naïve Bayes 18 2-6-2 K最近鄰方法 20 2-6-3 決策樹 22 第三章 研究方法 25 第一節 研究架構 25 第二節 實驗語料 26 第三節 分類演算法選擇 27 第四節 分類系統建立 29 3-4-1 語料庫前處理與切割 30 3-4-2 特徵選取方式 32 3-4-2-1 長詞優先斷詞原則 33 3-4-2-2 PAT-tree 35 3-4-3 分類模型建立 38 第五節 實驗評估 38 3-5-1衡量類別的同質性與相似度 39 第四章 實驗結果 41 第一節 實驗流程 41 第二節 實驗數據 42 第三節 實驗結果分析 51 第五章 結論與建議 55 第一節 研究結論 55 第二節 後續研究建議 57 參考文獻 59 附錄一 中研院詞類標記對照表 63 附錄二 實驗數據 65 | |
| dc.language.iso | zh-TW | |
| dc.subject | 語料庫同質性 | zh_TW |
| dc.subject | 語料庫相似性 | zh_TW |
| dc.subject | 文件自動分類 | zh_TW |
| dc.subject | 機器學習 | zh_TW |
| dc.subject | automatic text classification | en |
| dc.subject | corpus homogeneity | en |
| dc.subject | corpus similarity | en |
| dc.subject | machine learning | en |
| dc.title | 文件自動分類技術與成效評估之探討 | zh_TW |
| dc.title | A Study on the Techniques and the Evaluation of Automatic Text Classification | en |
| dc.type | Thesis | |
| dc.date.schoolyear | 93-1 | |
| dc.description.degree | 碩士 | |
| dc.contributor.coadvisor | 高照明(Zhao-Ming Gao) | |
| dc.contributor.oralexamcommittee | 陳柏琳(Berlin Chen),劉昭麟(Chao-Lin Liu) | |
| dc.subject.keyword | 文件自動分類,機器學習,語料庫相似性,語料庫同質性, | zh_TW |
| dc.subject.keyword | automatic text classification,machine learning,corpus similarity,corpus homogeneity, | en |
| dc.relation.page | 62 | |
| dc.rights.note | 未授權 | |
| dc.date.accepted | 2005-02-01 | |
| dc.contributor.author-college | 管理學院 | zh_TW |
| dc.contributor.author-dept | 資訊管理學研究所 | zh_TW |
| 顯示於系所單位: | 資訊管理學系 | |
文件中的檔案:
| 檔案 | 大小 | 格式 | |
|---|---|---|---|
| ntu-94-1.pdf 未授權公開取用 | 300.62 kB | Adobe PDF |
系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。
