海量資料下資料探勘的啟發式抽樣資料準備方法

Huai-De Peng; 彭懷德

Please use this identifier to cite or link to this item: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/57722

Full metadata record

???org.dspace.app.webui.jsptag.ItemTag.dcfield???	Value	Language
dc.contributor.advisor	陳靜枝(Ching-Chin Chern)
dc.contributor.author	Huai-De Peng	en
dc.contributor.author	彭懷德	zh_TW
dc.date.accessioned	2021-06-16T06:59:56Z	-
dc.date.available	2024-07-15
dc.date.copyright	2014-08-01
dc.date.issued	2014
dc.date.submitted	2014-07-16
dc.identifier.citation	[1] Ackoff, R. L., 'From Data to Wisdom', Journal of Applies Systems Analysis, Vol. 16, 1989, pp 3-9. [2] Angiulli, F., G. Ianni, and L. Palopoli, 'On the complexity of inducing caategorical and quantitative association rules', Theoretical Computer Science, Vol. 314, no. 1-2, 2004, pp 217-249. [3] Arockiaraj, M. C., 'Application of Data Mining Technique in Invasion Recognition', IOSR Journal of Computer Engineering, Vol. 10, no. 3, 2013, pp 20-23. [4] Cano, J. R., F. Herrera, and M. Lozano, 'Stratification for scaling up evolutionary prototype selection', Pattern Recognition Letters, Vol. 26, no. 7, 2005, pp 953-963. [5] Cano, J. R., F. Herrera, and M. Lozano, 'Evolutionary stratified training set selection for extracting classification rules with trade off precision-interpretability', Data & Knowledge Engineering, Vol. 60, 2006, pp 90-108. [6] Chen, M. C., L. S. Chen, C. C. Hsu, and W. R. Zeng, 'An information granulation based data mining approach for classifying imbalanced data', Information Sciences, Vol. 178, 2008, pp 3214-3227. [7] Collett, S. Why Big Data Is a Big Deal. Computerworld, 2011 November. [8] Fayyad, U., G. Piatetsky-Shapiro, and P. Smyth, 'From Data Mining to Knowledge Discovery in Databases', AI Magazine, Vol. 17, no. 3, 1996, pp 37-54. [9] Fayyad, U. M. and K. B. Irani, 'Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning', International Joint Conferences on Artificial Intelligence, 1993, pp 1022-1029. [10] Good, I. J., 'The Population Frequencies of Species and the Estimation of Population Parameters', Biometrica, Vol. 40, no. 3, 4, 1953, pp 237-264. [11] Gungor, Z. and A. Unler, 'K-harmonic means data clustering with simulated annealing heuristic', Applied Mathematics and Computation, Vol. 184, 2007, pp 199-209. [12] Hall, M. A. and G. Holmes, 'Benchmarking Attribute Selection Techniques for Discrete Class Data Mining', IEEE Transactions on Knowledge and Data Engineering, Vol. 15, no. 6, 2003, pp 1437-1447. [13] Harbert, T. Big Data, Big Jobs? Computerworld, 2013 January. [14] Hoffmann, L. Looking Back at Big Data. Communications of the ACM, 2013 April. [15] Hopkins, B. Beyond the Hype of Big Data. CIO, 2011 October. [16] Jackson, J. The Big Promise of Big Data. CIO, 2012 March. [17] Jacobs, A. The Pathologies of Big Data. Communication of the ACM, 2009 August. [18] Karim, M. R., C. F. Ahmed, B. S. Jeong, and H. J. Choi, 'An Efficient Disitributed Programming Model for Mining Useful Patterns in Big Datasets', IETE Technical Review, Vol. 30, no. 1, 2013, pp 53-63. [19] KDD. Cup 1999 Data Set UCI Machine Learning Repository. Available: http://archive.ics.uci.edu/ml/datasets/KDD+Cup+1999+Data. [20] Kwon, O. and J. M. Sim, 'Effects of data set features on the performances of classification algorithms', Expert Systems with Applications, Vol. 40, 2013, pp 1847-1857. [21] Lamont, J. Big data has big implications for knowledge management. KMWorld, 2012 April. [22] Lamont, J. In the Realm of Big Data. KMWorld, 2013 April. [23] Liu, B., W. Hsu, and Y. Ma, 'Integrating Classification and Association Rule Mining', Knowledge Discovery and Data Mining, 1998, pp 80-86. [24] Moore, G. E., 'Cramming More Components onto Integrated Circuits', Electronics, Vol. 38, no. 8, 1965, pp 114-117. [25] Nash, K. S. How Big Data Can Reduce Big Risk. CIO, 2012 March. [26] Olavsrud, T. How to Be Ready for Big Data. CIO, 2012 March. [27] Olaysrud, T. Big Data Causes Concern and Big Confusion. CIO, 2012 February. [28] Reid, C. Can Big Data Fix Book Marketing. Publishers Weekly, 2012 May. [29] Sahoo, A. K., M. J. Zuo, and M. K. Tiwari, 'A data clustering alogrithm for stratified data partitioning in artificail neural network', Expert Systems with Applications, Vol. 39, 2012, pp 7004-7014. [30] Shannon, C. E., 'A Mathematical Theory of Communication', The Bell System Technical Journal, Vol. 27, no. 3, 1948, pp 379-423. [31] Skinner, D. Big Data: It's not Just Big. HPC Source, 2013 ISC'13 Special Edition. [32] Song, M., H. Yang, S. H. Siadat, and M. Pechenizkiy, 'A comparative study of dimensionality reduction techniques to enhance trace clustering perofrmance', Expert Systems with Applications, Vol. 40, 2013, pp 3722-3737. [33] Stackpole, B. 5 things IT should do to prepare for big data. Computerworld US, 2012 February. [34] Yang, Q. and X. D. Wu, '10 Challenging Problems in Data Mining Research', International Journal of Information Technology & Decision Making, Vol. 5, no. 4, 2006, pp 597-604. [35] Ye, Y., Q. Wu, J. Z. Huang, M. K. Ng, and X. Li, 'Stratified sampling for feature subspace selection in random forests for high dimensional data', Pattern Recognition, Vol. 46, 2013, pp 769-787. [36] Zhang, Y. and S. Bhattacharyya, 'Genetic programming in classifying large-scale data: an ensemble method', Information Sciences, Vol. 163, 2004, pp 85-101.
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/57722	-
dc.description.abstract	海量資料近年來無論在業界或學界中都是非常熱門的話題，其資料的特性不僅數量龐大、來源紛雜，同時資料會不停地新增成長，基於這些特性而使得這些資料比起過往的資料內容更加難以分析，原有的資料探勘方式應用在海量資料時有非常大的可能遭遇到無法適用的狀況，特別是在執行時間上極有可能因為海量資料的特性而不能夠即時有效的產生分析結果，甚至可能因為資料的物件或屬性的數量過大造成完全無法取得結果。在本篇研究中，我們使用基於關聯性規則的分類法作為資料探勘的分類方式，在不改動原有資料探勘方法的前提下，透過資料的選擇、前處理以及產生分類器結果後的評估、整合來解決所遭遇到的海量資料問題。我們提出的方法分為兩個部分，首先是在初始狀態下對資料進行有目的的啟發式抽樣方法，使得抽樣出來的資料能夠足以代表整個海量資料的母體，再針對屬性的部分計算各個屬性分別的鑑別力與重要性，從中選擇出重要的屬性來做為後續資料探勘所用。針對資料分布型態的不同，我們可以視需求使用適當的方法調整抽樣的比率，使得某些特定的稀有分類資料能夠有相對應的分類規則能夠使用。第二部分則是特別處理資料成長增加的問題，首先使用初始狀態的方式分別對舊有資料以及新進資料進行抽樣並建立分類器，再透過新進資料與舊有資料的整合，將舊有與新進的分類器合併，重新驗證分類器中的規則，刪除不必要的規則並將其餘規則重新排序，成為最終調整過後的分類器並得以得以應用在資料上。本研究所提出的方法應用在海量資料下的資料探勘時，透過實驗的結果能夠得知產生的結果與使用所有資料時能夠有相似的準確率，但能夠有效的減少所需要的執行時間，使得分析結果能夠迅速的產生，並將其結果應用在其他資料上。	zh_TW
dc.description.abstract	Big data has been a greatly popular topic among industries and academics. It contains several characteristics which are extreme scale, various data sources and incremental. These characteristics make big data harder to be analyzed while classic data mining techniques are highly possible to be infeasible. Especially, the processing time may not be efficient enough to generate analytic results in time due to the characteristics of big data. Furthermore, it could fail to generate any result since the number of objects and attributes are too large. In this study, we use classification based on association rules as our data mining technique. Under the premise of not changing existing data mining method, we try to solve the problem of big data by data preparation, integration and evaluation. The algorithm we proposed separates to two parts. The first part is a heuristic sampling method at the initial phase. Samples the data that is representative to the population of big data and then selects attributes which are important and discriminative. The sampling result can be further applied to following data mining techniques. For the purpose of handling different class distributions, we can apply undersampling method for some specific rare class to generate corresponding rules. The second part is dealing with incremental problem. Using the sampled data of initial phase from both the preliminary data and the incremental data and their classifiers, we merge the data and apply these data to verify the combined classifier. After pruning invalid rules and ranking all rules, we can obtain the final modified classifier as the result and apply the modified classifier on other data in the population. Applying the algorithm we proposed in data mining under big data environment, we can generate the result that is comparable to the one using the whole dataset. Moreover, the processing time is significantly reduced and thus the analytic result can be obtained in time to make further applications.	en
dc.description.provenance	Made available in DSpace on 2021-06-16T06:59:56Z (GMT). No. of bitstreams: 1 ntu-103-R01725017-1.pdf: 1808444 bytes, checksum: 44ea252ac3e053c77c67999ad935f023 (MD5) Previous issue date: 2014	en
dc.description.tableofcontents	Contents iv List of Figures vii List of Tables viii Chapter 1 Introduction 1 1.1 Motivation 1 1.2 Objective 6 1.3 Scope 7 Chapter 2 Literature Review 9 2.1 Big Data 9 2.2 Data Mining 11 2.3 Sampling 13 2.4 Conclusion 16 Chapter 3 Problem Description 18 3.1 Data Preparation Step for Data Mining Methods 19 3.2 Data Preparation under Big Data Environment 21 3.2.1 Big Data 21 3.2.2 Sampling 23 3.2.3 Attribute Selection 24 3.3 Problem Statement 25 3.3.1 Invasion Detection of Internet Packets 27 3.3.2 Conditions of Records under Big Data 29 3.3.3 Conditions of Attributes under Big Data 30 3.4 Summary 31 Chapter 4 The Heuristic Big-Data Sampling Algorithm (HBDSA) 33 4.1 Data and Attribute Sampling Algorithm (DASA) 35 4.1.1 Step 1: Data Sampling 36 4.1.2 Step 2: Attribute Selection 42 4.1.3 Step 3: Undersampling to Generate Rules of Rare Classes 45 4.2 Incremental Classifier Modifying Algorithm (ICMA) 47 4.2.1 Attributes Change between Original and Incremental Data 49 4.2.2 Classes Change between Original and Incremental Data 50 4.2.3 Combining Original and Incremental Classifiers and Data 50 4.2.4 Scenario 1: Without class and attributes change 51 4.2.5 Scenario 2: Removing an attribute 55 4.2.6 Scenario 3: Adding an attribute 57 4.2.7 Scenario 4: Removing a class 60 4.2.8 Scenario 5: Adding a class 63 4.3 Time Complexity of DASA and ICMA 66 Chapter 5 Computation Analysis 68 5.1 Data Description and Experiment Environment 68 5.2 Experiments for the DASA 70 5.2.1 Comparison between Sample and Population 70 5.2.2 Data with Uniform Class Distribution 74 5.2.3 Data with Long Tail Class Distribution 76 5.3 Experiments for the ICMA 79 5.3.1 Scenario 1: No changes in Classes 80 5.3.2 Scenario 2: Removing a Class 83 5.3.3 Scenario 3: Adding a Class 86 5.3.4 Scenario 4: Removing and Adding a Class 89 5.3.5 Effect over Certain Periods 92 5.4 Summary 94 Chapter 6 Conclusion and Future Work 96 6.1 Conclusion 96 6.2 Future Work 98 Reference 99 Appendix A Accuracy of Classifiers in Incremental Population 103
dc.language.iso	en
dc.subject	海量資料	zh_TW
dc.subject	資料探勘	zh_TW
dc.subject	增量式資料	zh_TW
dc.subject	資料分類	zh_TW
dc.subject	屬性選擇	zh_TW
dc.subject	資料準備	zh_TW
dc.subject	資料抽樣	zh_TW
dc.subject	Big Data	en
dc.subject	Attribute Selection	en
dc.subject	Data Sampling	en
dc.subject	Data Preparation	en
dc.subject	Data Classification	en
dc.subject	Data Mining	en
dc.subject	Incremental Data	en
dc.title	海量資料下資料探勘的啟發式抽樣資料準備方法	zh_TW
dc.title	A Heuristic Data Sampling Approach for Data Mining Preparation under Big Data Environment	en
dc.type	Thesis
dc.date.schoolyear	102-2
dc.description.degree	碩士
dc.contributor.oralexamcommittee	魏志平(Chih-Ping Wei),陳建錦(Chien Chin Chen),楊錦生(Chin-Sheng Yang)
dc.subject.keyword	海量資料,增量式資料,資料探勘,資料分類,資料準備,資料抽樣,屬性選擇,	zh_TW
dc.subject.keyword	Big Data,Incremental Data,Data Mining,Data Classification,Data Preparation,Data Sampling,Attribute Selection,	en
dc.relation.page	103
dc.rights.note	有償授權
dc.date.accepted	2014-07-17
dc.contributor.author-college	管理學院	zh_TW
dc.contributor.author-dept	資訊管理學研究所	zh_TW
Appears in Collections:	資訊管理學系

Files in This Item:

File	Size	Format
ntu-103-1.pdf Restricted Access	1.77 MB	Adobe PDF

Show simple item record

DSpace JSPUI

DSpace preserves and enables easy and open access to all types of digital content including text, images, moving images, mpegs and data sets