資料挖掘對分類問題之研究－基因規劃法之啓發

Jih-Jeng Huang; 黃日鉦

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/34349

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	翁崇雄
dc.contributor.author	Jih-Jeng Huang	en
dc.contributor.author	黃日鉦	zh_TW
dc.date.accessioned	2021-06-13T06:04:16Z	-
dc.date.available	2006-06-26
dc.date.copyright	2006-06-26
dc.date.issued	2006
dc.date.submitted	2006-06-17
dc.identifier.citation	1. Agresti, A. (1990). Categorical data analysis, Now York: Wiley. 2. Ahn, B. S., Cho, S. S., and Kim, C. Y. (2000). The integrated methodology of rough set theory and artificial neural network for business failure prediction. Expert Systems with Applications, 18 (2), 65-74. 3. Aldrich, J. H. and Nelson, F. D. (1984). Linear probability, logit, and probit models, CA: Sage. 4. Beynon, M. J., and Peel, M. J. (2001). Variable precision rough set theory and data discretisation an application to corporate failure prediction. OMEGA: The International Journal of Management Science, 29 (6), 561-576. 5. Bi, Y., Anderson, T., and McClean, S. (2003). A rough set model with ontologies for discovering maximal association rules in document collections. Knowledge-Based Systems, 16 (5/6), 243-251. 6. Bojarczuk, C. C., Lopes, H. S., Freitas, A. A., and Michalkiewicz, E. L. (2004). A constrained-syntax genetic programming system for discovering classification rules: application to medical data sets. Artificial Intelligence in Medicine, 30 (1), 21-48. 7. Chakrabarty, K., Biswas, R., and Nanda, S. (2000). Fuzziness in rough sets. Fuzzy Sets and Systems, 110 (2), 247-251. 8. Chung, H. M. and Gray, P. (1999). Special section: Data mining. Journal of Management Information Systems, 16 (1), 11-16. 9. Craven, M. W. and Shavlik, J. W. (1997). Using neural networks for data mining. Future Generation Computer Systems, 13 (2/3), 221-229. 10. Davidson, J. W., Savic, D. A., and Walters G. A. (2003). Symbolic and numerical regression: experiments and applications. Information Sciences, 150 (1/2), 95-117. 11. Davis, L. (1991), Handbook of Genetic Algorithms, New York: Van Nostrand Reinhold. 12. De Falco, I., Cioppa, A. D., and Tarantino, E. (2002). Discovering interesting classification rules with genetic programming, Applied Soft Computing 1 (4), 257-269. 13. DeMaris, A. (1992). Logit modeling, CA: Sage. 14. Desai, V., Crook, J., and Overstreet, G. (1996). A comparison of neural networks and linear scoring models in credit union environment. European Journal of Operations Management, 95 (1), 24-37. 15. Dimitras, A. I., Slowinski, R., Susmaga, R., and Zopounidis, C. (1999). Business failure prediction using rough sets. European Journal of Operational Research, 144 (2), 263-280. 16. Dubois, D., and Prade, H. (1991). Rough sets: Theoretical aspects of reasoning about data, by Z. Pawlark, Kluwer, Dordrecht, Netherlands. 17. Fayyad, U.M., Piatetsky-Shapiro, G. and Smyth, P. (1996), From data mining to knowledge discovery: an overview. In Fayyad, U.M., Piatetsky-Shapiro, G. and Smyth, P. and Uthurusamu, R. (Eds.), Advances in Knowledge Discovery and Data Mining, Cambridge: AAAI/MIT Press. 18. Feraud, R. and Cleror, F. (2002). A methodology to explain neural network classification. Neural Network, 15 (2), 237-246. 19. Frawley, W. J., Patetsky-Shapiro, G. and Mathews, C. J. (1991), Knowledge discovery in databases: an overview, In Patetsky-Shapiro, G. and Frawley, W. (Eds.), Knowledge Discovery in Databases, Cambridge: AAAI/MIT Press. 20. Freitas, A. A. (1997). A genetic programming framework for two data mining tasks: classification and generalized rule induction in: Proceedings of the Second Annual Conference on Genetic Programming, Morgan Kaufmann, San Francisco, 96-101. 21. Goldberg, D. E. (1989), Genetic algorithms in search, Optimization and Machine Learning, MA: Addison-Wesley. 22. Grzymala-Busse, J. W. (1988). Knowledge acquisition under uncertainty- A rough set approach. Journal of intelligent and Robotic Systems, 1 (1), 3-16. 23. Han, J., and Kamber, M. (2001). Data mining: Concepts and techniques, New York: Morgan Kaufmann. 24. Hand, D., Mannila, H. and Smith (2001), P., Principles of data mining, Cambridge: MIT Press. 25. Haykin, S. (1994). Neural networks: A comprehensive foundation. New York: Macmillan College Publishing Company. 26. Holland, J. M. (1975), Adaptation in natural and artificial systems, Ann Arbor: The University of Michigan Press. 27. Jensen, H. L. (1992). Using neural networks for credit scoring. Managerial Finance, 18 (1), 15-26. 28. Johnson, H. E., Gilbert, R. J., Winson, M. K., Goodacre, R., Smith, A. R., Rowland, J. J., Hall, M. A., and Kell and D. B. (2000). Explanatory analysis of the metabolome using genetic programming of simple, interpretable rules. Genetic Programming and Evolvable Machines, 1 (3), 243-258. 29. Kantardzic, M. (2003), Data mining: concepts, models, methods, and algorithms, New Jersey: Wiley-IEEE Press. 30. Knoke, D. and Burke, P. J. (1980). Log-linear models, CA: Sage. 31. Koza, J. (1992). Genetic programming: On the programming of computers by means of natural selection, MA: MIT Press. 32. Lee, T. S., Chiu, C. C., Lu, C. J., and Chen, I. F. (2002). Credit scoring using the hybrid neural discriminant technique. Expert Systems with Applications, 23 (3), 245-254. 33. Liao, T. F. (1994). Interpreting probability model: logit, probit, and other generalized linear models, CA: Sage. 34. Lindsay, L. Z., Jack, L. B., and Nandi, A. K. (2005). Fault detection using genetic programming. Mechanical Systems and Signal Processing, 19 (2), 271-289. 35. Mahlhotra, R., and Malhotra, D. K. (2003). Evaluating consumer loans using neural networks. OMEGA: The International Journal of Management Science, 31 (2), 83-96. 36. McCullagh, P. (1980). Regression model for ordinal data, Journal of the Royal Statistical Society, Series B, 42 (2), 109-142. 37. McFadden, D. (1974). Conditional logit analysis of qualitative choice behavior. In Zarembka, P. (ed.), Frontiers in Econometrics, New York: Academic Press. 38. Michalewicz, Z. (1992), Genetic algorithms + data Structures = evolution programs, New York: Springer-Verlag. 39. Mordeson, J. N. (2001). Rough set theory applied to (fuzzy) ideal theory. Fuzzy Sets and Systems, 121 (2), 315-324. 40. Nath, R, Rajagopalan, B. and Ryker, R. (1997). Determining the saliency of input variables in neural network classifiers. Computers and Operations Researches, 24 (8), 767-773. 41. Ngan, P. S., Wong, M. L., and Leung K. S. (1998). Using grammar based genetic programming for data mining of medical knowledge in: Proceedings of the 3rd Annual Conference on Genetic Programming, Morgan Kaufmann, San Francisco, 304-312. 42. Ong, C. S., Huang, J. J. and Tzeng, G. H. (2005a), Model identification of ARIMA family using genetic algorithms, Applied Mathematics and Computation, 164 (3), 885-912. 43. Ong, C. S., Huang, J. J. and Tzeng, G. H. (2005b), Building credit scoring models using genetic programming, Expert Systems with Applications, 29 (1), 41-47. 44. Pawlak, Z. (1982). Rough set. International Journal of Computer and Information Science, 11 (5), 341-356. 45. Pawlak, Z., Grzymala-Busse, J., Slowinski, R., and Ziarko, W. (1995). Rough sets. Communications of the ACM, 38 (11), 88-95. 46. Piramuthu, S. (1999). Financial credit-risk evaluation with neural and neurofuzzy systems. European Journal of Operational Research, 112 (2), 310-321. 47. Press, S. J. and Wilson, S. (1978). Choosing between logistic regression and discriminant analysis. Journal of the American Statistical Association, 73 (4), 699-705. 48. Radzikowska, A. M., and Kerre, E. E. (2002). A comparative study of fuzzy rough sets. Fuzzy Sets and Systems, 126 (2), 137-155. 49. Shan, N., Hamilton, H. J., Ziarko, W., and Cercone, N. (1996). Discretization of continuous valued attributes in attribute-value systems. Proceeding of the fourth International Workshop on Rough Sets, Fuzzy Sets, and Machine Discovery, Tokyo, Japan, 74-81. 50. Stefano, C. D., Cioppa, A. D., and Marcelli, A. (2002). Character preclassification based on genetic programming. Pattern Recognition Letters, 23 (12), 1439-1448. 51. Walczak, B., and Massart, D. L. (1999). Rough sets theory. Chemometrics and Intelligent Laboratory Systems, 47 (1), 1-16. 52. West, D. (2000). Neural network credit scoring models. Computers and Operations Research, 27 (11/12), 1131-1152. 53. Wu, X. D. (1996). A Bayesian discretizer for real-valued attributes. The Computer Journal, 39 (8), 687-691. 54. Zaki, M. J. (1999), Data mining and KDD, New directions in bioinformatics and biotechnology Workshop, Troy, New York. 55. Zhang, Y., and Bhattacharyya, S. (2004). Genetic programming in classifying large-scale data: an ensemble method. Information Science, 163 (1/3), 85-101.
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/34349	-
dc.description.abstract	隨著儲存技術、資料庫及資料倉儲的快速發展，企業廣泛的運用此些科技來掘取有用的資訊。為了有效掘取隱藏在資料庫/資料倉儲中的有用知識，在整個知識發現(knowledge discovering in databases, KDD)的流程中，資料挖掘(data mining)技術扮演著重要的角色。資料挖掘為知識發現流程中之核心功能且可視為從一大量資料中，掘取有效、重要、有趣資訊及知識的一連串反覆及互動之流程。而資料挖掘的任務主要可分為分類分析、迴歸分析、誤差偵測分析、集群分析、關聯規則分析及序列分析等。本論文主要的目的為探討及修正資料挖掘中的分類技術。有鑑於傳統分類模型的缺點，本論文提出三個主要的修正模型。此三個模型主要是根據基因規劃(GP)的方法來結合函數基礎(function-based)及歸納基礎(induction-based)等分類技術之優點而成。第一個模型為單純利用基因規劃法來進行分類問題。第二個提出之模型為IF-THEN規則的基因規劃法(IF-THEN GP)，此法乃根據“分開並各個擊破”的原則所成。再者，為了有效結合函數基礎和歸納基礎法之優點，本論文提出兩階段基因規劃法(2SGP)用於分類問題上。此外，利用兩個信用計分之標桿(benchmark)資料庫來測試本論文所提出之三個模型的正確率，並與傳統分類方法進行正確率之比較。根據實證之結果顯示，本論文所提出之三個分類模型均明顯優於傳統分類模型。因此，本論文可嘗試作為傳統相關學術研究的基礎與參考，甚至可協助實務界進行更精確的資料挖掘，以期能提昇學術界與實務界之研究與決策品質。	zh_TW
dc.description.abstract	With the rapid development of storage system technology, databases, data warehouses are widely employed by enterprises to extract useful information for applying supply chain management (SCM), enterprise resource planning (ERP), and customer relationship management (CRM). In order to effectively extract the useful knowledge hidden in the database/data warehouse, data mining technology is highlighted in the process of knowledge discovering in databases (KDD). Data mining can be considered as the core of KDD and an iterative and interactive process to extract valid, nontrivial, and interesting information and knowledge from large among of data. The tasks of data mining can be divided into classification, regression, deviation detection, clustering, association rules, and sequential pattern. In this dissertation, the problem of data classification is highlighted. The problems of the conventional classification models are considered to develop three models. These three models are proposed to incorporate the advantages of the discriminant-based and the induction-based methods based on the genetic programming method (GP). The first model is to employ GP for building a classification model. The reasons which we employ GP to propose the classification model are that GP can automatically and heuristically determine the adequate discriminant functions and the valid attributes simultaneously. In addition, unlike artificial neural networks (ANNs) which are only suited for large data sets, GP can perform well even in small data sets. The second model called the IF-THEN ruled genetic programming (IF-THEN GP) is based on the principle of “divide and conquer.” We can set a threshold of the cut to retrain the indiscernible data set to form the second discriminant function using GP and to obtain other discriminant functions in this order. In order to combine the advantages of the discriminant-based and the induction-based methods, the third model we propose is two-stage genetic programming (2SGP). 2SGP integrates the function-based and the induction-based methods to form a hybrid model. First, the IF-THEN rules are derived using GP. Next, the reduced data are fed into GP again to form the discriminant function for providing the capability of forecasting. In addition, we used two credit-scoring data sets to test the effectiveness of the proposed models and to compared with the conventional methods including multi-layer perceptron (MLP), classification and regression tree (CART), C4.5, rough sets, and logistic regression (LR). On the basis of the numerical results, we can conclude that the proposed methods outperform to other models and should be more suitable for the real-life classification problems.	en
dc.description.provenance	Made available in DSpace on 2021-06-13T06:04:16Z (GMT). No. of bitstreams: 1 ntu-95-D91725010-1.pdf: 324979 bytes, checksum: 1aff6cbad3f25e643825ab7ecb7d9d6c (MD5) Previous issue date: 2006	en
dc.description.tableofcontents	CONTENTS Chapter 1 Introduction……………………………………….…………………1 1.1 Background…………………………………………………………………….1 1.2 Motivation………………………………………………………………..…….3 1.3 Purpose…………………………………………………………………………5 1.4 Organization……………………………………………………………………6 Chapter 2 Overview of the Classification Models…………………….…..7 2.1 Logistic regression……………………………………………………………..7 2.2 Artificial neural network……………………………………………………...10 2.3 Rough sets…………………………………………………………………….13 2.4 Statement of the classification problems……………………………...………16 2.4.1 The problem of irrelevant variables…………………………………….16 2.4.2 The problem of complexity……………………………………………..19 2.4.3 The problem of intelligible rules………………………………………..20 2.4.4 Three genetic programming based models……………………………..20 Chapter 3 Genetic Algorithms and Genetic Programming……………22 3.1 Concepts of genetic algorithms……………………………………………….22 3.1.1. Genetic operators………………………………………………………23 3.1.2. Elitist strategy and stopping criterion………………………………….24 3.2 Genetic Programming……………………………………………………….25 Chapter 4 IF-THEN Genetic Programming Model………………..……29 Chapter 5 Two-Stage Genetic Programming (2SGP)…………………..34 Chapter 6 Implementation: A Credit Scoring Model…………………..38 6.1 Data sets……………………………………………………………………..38 6.2 Numerical analysis…………………………………………………………..39 6.3 Summary……………………………………………………………………..44 Chapter 7 Discussions and Conclusions……………………………………46 7.1 Discussions…………………………………………………………………...46 7.2 Conclusions…………………………………………………………………..48 7.3 Further research………………………………………………………………48 Appendix…………………………………………………………………………..51 References…………………………………………………………………………52
dc.language.iso	en
dc.subject	資料挖掘	zh_TW
dc.subject	分類模型	zh_TW
dc.subject	知識發現	zh_TW
dc.subject	基因規劃	zh_TW
dc.subject	信用計分	zh_TW
dc.subject	logistic regression	en
dc.subject	Classification models	en
dc.subject	genetic programming	en
dc.subject	artificial neural networks (ANNs)	en
dc.subject	decision tree	en
dc.subject	rough sets	en
dc.title	資料挖掘對分類問題之研究－基因規劃法之啓發	zh_TW
dc.title	Data Mining for the Classification Problem-The Inspiration of Genetic Programming	en
dc.type	Thesis
dc.date.schoolyear	94-2
dc.description.degree	博士
dc.contributor.coadvisor	曾國雄
dc.contributor.oralexamcommittee	古永嘉,劉祥熹,陳靜枝
dc.subject.keyword	知識發現,資料挖掘,分類模型,基因規劃,信用計分,	zh_TW
dc.subject.keyword	Classification models,genetic programming,artificial neural networks (ANNs),decision tree,rough sets,logistic regression,	en
dc.relation.page	57
dc.rights.note	有償授權
dc.date.accepted	2006-06-18
dc.contributor.author-college	管理學院	zh_TW
dc.contributor.author-dept	資訊管理學研究所	zh_TW
顯示於系所單位：	資訊管理學系

文件中的檔案：

檔案	大小	格式
ntu-95-1.pdf 未授權公開取用	317.36 kB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。