Skip navigation

DSpace JSPUI

DSpace preserves and enables easy and open access to all types of digital content including text, images, moving images, mpegs and data sets

Learn More
DSpace logo
English
中文
  • Browse
    • Communities
      & Collections
    • Publication Year
    • Author
    • Title
    • Subject
    • Advisor
  • Search TDR
  • Rights Q&A
    • My Page
    • Receive email
      updates
    • Edit Profile
  1. NTU Theses and Dissertations Repository
  2. 管理學院
  3. 資訊管理學系
Please use this identifier to cite or link to this item: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/57722
Full metadata record
???org.dspace.app.webui.jsptag.ItemTag.dcfield???ValueLanguage
dc.contributor.advisor陳靜枝(Ching-Chin Chern)
dc.contributor.authorHuai-De Pengen
dc.contributor.author彭懷德zh_TW
dc.date.accessioned2021-06-16T06:59:56Z-
dc.date.available2024-07-15
dc.date.copyright2014-08-01
dc.date.issued2014
dc.date.submitted2014-07-16
dc.identifier.citation[1] Ackoff, R. L., 'From Data to Wisdom', Journal of Applies Systems Analysis, Vol. 16, 1989, pp 3-9.
[2] Angiulli, F., G. Ianni, and L. Palopoli, 'On the complexity of inducing caategorical and quantitative association rules', Theoretical Computer Science, Vol. 314, no. 1-2, 2004, pp 217-249.
[3] Arockiaraj, M. C., 'Application of Data Mining Technique in Invasion Recognition', IOSR Journal of Computer Engineering, Vol. 10, no. 3, 2013, pp 20-23.
[4] Cano, J. R., F. Herrera, and M. Lozano, 'Stratification for scaling up evolutionary prototype selection', Pattern Recognition Letters, Vol. 26, no. 7, 2005, pp 953-963.
[5] Cano, J. R., F. Herrera, and M. Lozano, 'Evolutionary stratified training set selection for extracting classification rules with trade off precision-interpretability', Data & Knowledge Engineering, Vol. 60, 2006, pp 90-108.
[6] Chen, M. C., L. S. Chen, C. C. Hsu, and W. R. Zeng, 'An information granulation based data mining approach for classifying imbalanced data', Information Sciences, Vol. 178, 2008, pp 3214-3227.
[7] Collett, S. Why Big Data Is a Big Deal. Computerworld, 2011 November.
[8] Fayyad, U., G. Piatetsky-Shapiro, and P. Smyth, 'From Data Mining to Knowledge Discovery in Databases', AI Magazine, Vol. 17, no. 3, 1996, pp 37-54.
[9] Fayyad, U. M. and K. B. Irani, 'Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning', International Joint Conferences on Artificial Intelligence, 1993, pp 1022-1029.
[10] Good, I. J., 'The Population Frequencies of Species and the Estimation of Population Parameters', Biometrica, Vol. 40, no. 3, 4, 1953, pp 237-264.
[11] Gungor, Z. and A. Unler, 'K-harmonic means data clustering with simulated annealing heuristic', Applied Mathematics and Computation, Vol. 184, 2007, pp 199-209.
[12] Hall, M. A. and G. Holmes, 'Benchmarking Attribute Selection Techniques for Discrete Class Data Mining', IEEE Transactions on Knowledge and Data Engineering, Vol. 15, no. 6, 2003, pp 1437-1447.
[13] Harbert, T. Big Data, Big Jobs? Computerworld, 2013 January.
[14] Hoffmann, L. Looking Back at Big Data. Communications of the ACM, 2013 April.
[15] Hopkins, B. Beyond the Hype of Big Data. CIO, 2011 October.
[16] Jackson, J. The Big Promise of Big Data. CIO, 2012 March.
[17] Jacobs, A. The Pathologies of Big Data. Communication of the ACM, 2009 August.
[18] Karim, M. R., C. F. Ahmed, B. S. Jeong, and H. J. Choi, 'An Efficient Disitributed Programming Model for Mining Useful Patterns in Big Datasets', IETE Technical Review, Vol. 30, no. 1, 2013, pp 53-63.
[19] KDD. Cup 1999 Data Set UCI Machine Learning Repository. Available: http://archive.ics.uci.edu/ml/datasets/KDD+Cup+1999+Data.
[20] Kwon, O. and J. M. Sim, 'Effects of data set features on the performances of classification algorithms', Expert Systems with Applications, Vol. 40, 2013, pp 1847-1857.
[21] Lamont, J. Big data has big implications for knowledge management. KMWorld, 2012 April.
[22] Lamont, J. In the Realm of Big Data. KMWorld, 2013 April.
[23] Liu, B., W. Hsu, and Y. Ma, 'Integrating Classification and Association Rule Mining', Knowledge Discovery and Data Mining, 1998, pp 80-86.
[24] Moore, G. E., 'Cramming More Components onto Integrated Circuits', Electronics, Vol. 38, no. 8, 1965, pp 114-117.
[25] Nash, K. S. How Big Data Can Reduce Big Risk. CIO, 2012 March.
[26] Olavsrud, T. How to Be Ready for Big Data. CIO, 2012 March.
[27] Olaysrud, T. Big Data Causes Concern and Big Confusion. CIO, 2012 February.
[28] Reid, C. Can Big Data Fix Book Marketing. Publishers Weekly, 2012 May.
[29] Sahoo, A. K., M. J. Zuo, and M. K. Tiwari, 'A data clustering alogrithm for stratified data partitioning in artificail neural network', Expert Systems with Applications, Vol. 39, 2012, pp 7004-7014.
[30] Shannon, C. E., 'A Mathematical Theory of Communication', The Bell System Technical Journal, Vol. 27, no. 3, 1948, pp 379-423.
[31] Skinner, D. Big Data: It's not Just Big. HPC Source, 2013 ISC'13 Special Edition.
[32] Song, M., H. Yang, S. H. Siadat, and M. Pechenizkiy, 'A comparative study of dimensionality reduction techniques to enhance trace clustering perofrmance', Expert Systems with Applications, Vol. 40, 2013, pp 3722-3737.
[33] Stackpole, B. 5 things IT should do to prepare for big data. Computerworld US, 2012 February.
[34] Yang, Q. and X. D. Wu, '10 Challenging Problems in Data Mining Research', International Journal of Information Technology & Decision Making, Vol. 5, no. 4, 2006, pp 597-604.
[35] Ye, Y., Q. Wu, J. Z. Huang, M. K. Ng, and X. Li, 'Stratified sampling for feature subspace selection in random forests for high dimensional data', Pattern Recognition, Vol. 46, 2013, pp 769-787.
[36] Zhang, Y. and S. Bhattacharyya, 'Genetic programming in classifying large-scale data: an ensemble method', Information Sciences, Vol. 163, 2004, pp 85-101.
dc.identifier.urihttp://tdr.lib.ntu.edu.tw/jspui/handle/123456789/57722-
dc.description.abstract海量資料近年來無論在業界或學界中都是非常熱門的話題,其資料的特性不僅數量龐大、來源紛雜,同時資料會不停地新增成長,基於這些特性而使得這些資料比起過往的資料內容更加難以分析,原有的資料探勘方式應用在海量資料時有非常大的可能遭遇到無法適用的狀況,特別是在執行時間上極有可能因為海量資料的特性而不能夠即時有效的產生分析結果,甚至可能因為資料的物件或屬性的數量過大造成完全無法取得結果。
在本篇研究中,我們使用基於關聯性規則的分類法作為資料探勘的分類方式,在不改動原有資料探勘方法的前提下,透過資料的選擇、前處理以及產生分類器結果後的評估、整合來解決所遭遇到的海量資料問題。
我們提出的方法分為兩個部分,首先是在初始狀態下對資料進行有目的的啟發式抽樣方法,使得抽樣出來的資料能夠足以代表整個海量資料的母體,再針對屬性的部分計算各個屬性分別的鑑別力與重要性,從中選擇出重要的屬性來做為後續資料探勘所用。針對資料分布型態的不同,我們可以視需求使用適當的方法調整抽樣的比率,使得某些特定的稀有分類資料能夠有相對應的分類規則能夠使用。第二部分則是特別處理資料成長增加的問題,首先使用初始狀態的方式分別對舊有資料以及新進資料進行抽樣並建立分類器,再透過新進資料與舊有資料的整合,將舊有與新進的分類器合併,重新驗證分類器中的規則,刪除不必要的規則並將其餘規則重新排序,成為最終調整過後的分類器並得以得以應用在資料上。
本研究所提出的方法應用在海量資料下的資料探勘時,透過實驗的結果能夠得知產生的結果與使用所有資料時能夠有相似的準確率,但能夠有效的減少所需要的執行時間,使得分析結果能夠迅速的產生,並將其結果應用在其他資料上。
zh_TW
dc.description.abstractBig data has been a greatly popular topic among industries and academics. It contains several characteristics which are extreme scale, various data sources and incremental. These characteristics make big data harder to be analyzed while classic data mining techniques are highly possible to be infeasible. Especially, the processing time may not be efficient enough to generate analytic results in time due to the characteristics of big data. Furthermore, it could fail to generate any result since the number of objects and attributes are too large.
In this study, we use classification based on association rules as our data mining technique. Under the premise of not changing existing data mining method, we try to solve the problem of big data by data preparation, integration and evaluation.
The algorithm we proposed separates to two parts. The first part is a heuristic sampling method at the initial phase. Samples the data that is representative to the population of big data and then selects attributes which are important and discriminative. The sampling result can be further applied to following data mining techniques. For the purpose of handling different class distributions, we can apply undersampling method for some specific rare class to generate corresponding rules. The second part is dealing with incremental problem. Using the sampled data of initial phase from both the preliminary data and the incremental data and their classifiers, we merge the data and apply these data to verify the combined classifier. After pruning invalid rules and ranking all rules, we can obtain the final modified classifier as the result and apply the modified classifier on other data in the population.
Applying the algorithm we proposed in data mining under big data environment, we can generate the result that is comparable to the one using the whole dataset. Moreover, the processing time is significantly reduced and thus the analytic result can be obtained in time to make further applications.
en
dc.description.provenanceMade available in DSpace on 2021-06-16T06:59:56Z (GMT). No. of bitstreams: 1
ntu-103-R01725017-1.pdf: 1808444 bytes, checksum: 44ea252ac3e053c77c67999ad935f023 (MD5)
Previous issue date: 2014
en
dc.description.tableofcontentsContents iv
List of Figures vii
List of Tables viii
Chapter 1 Introduction 1
1.1 Motivation 1
1.2 Objective 6
1.3 Scope 7
Chapter 2 Literature Review 9
2.1 Big Data 9
2.2 Data Mining 11
2.3 Sampling 13
2.4 Conclusion 16
Chapter 3 Problem Description 18
3.1 Data Preparation Step for Data Mining Methods 19
3.2 Data Preparation under Big Data Environment 21
3.2.1 Big Data 21
3.2.2 Sampling 23
3.2.3 Attribute Selection 24
3.3 Problem Statement 25
3.3.1 Invasion Detection of Internet Packets 27
3.3.2 Conditions of Records under Big Data 29
3.3.3 Conditions of Attributes under Big Data 30
3.4 Summary 31
Chapter 4 The Heuristic Big-Data Sampling Algorithm (HBDSA) 33
4.1 Data and Attribute Sampling Algorithm (DASA) 35
4.1.1 Step 1: Data Sampling 36
4.1.2 Step 2: Attribute Selection 42
4.1.3 Step 3: Undersampling to Generate Rules of Rare Classes 45
4.2 Incremental Classifier Modifying Algorithm (ICMA) 47
4.2.1 Attributes Change between Original and Incremental Data 49
4.2.2 Classes Change between Original and Incremental Data 50
4.2.3 Combining Original and Incremental Classifiers and Data 50
4.2.4 Scenario 1: Without class and attributes change 51
4.2.5 Scenario 2: Removing an attribute 55
4.2.6 Scenario 3: Adding an attribute 57
4.2.7 Scenario 4: Removing a class 60
4.2.8 Scenario 5: Adding a class 63
4.3 Time Complexity of DASA and ICMA 66
Chapter 5 Computation Analysis 68
5.1 Data Description and Experiment Environment 68
5.2 Experiments for the DASA 70
5.2.1 Comparison between Sample and Population 70
5.2.2 Data with Uniform Class Distribution 74
5.2.3 Data with Long Tail Class Distribution 76
5.3 Experiments for the ICMA 79
5.3.1 Scenario 1: No changes in Classes 80
5.3.2 Scenario 2: Removing a Class 83
5.3.3 Scenario 3: Adding a Class 86
5.3.4 Scenario 4: Removing and Adding a Class 89
5.3.5 Effect over Certain Periods 92
5.4 Summary 94
Chapter 6 Conclusion and Future Work 96
6.1 Conclusion 96
6.2 Future Work 98
Reference 99
Appendix A Accuracy of Classifiers in Incremental Population 103
dc.language.isoen
dc.subject海量資料zh_TW
dc.subject資料探勘zh_TW
dc.subject增量式資料zh_TW
dc.subject資料分類zh_TW
dc.subject屬性選擇zh_TW
dc.subject資料準備zh_TW
dc.subject資料抽樣zh_TW
dc.subjectBig Dataen
dc.subjectAttribute Selectionen
dc.subjectData Samplingen
dc.subjectData Preparationen
dc.subjectData Classificationen
dc.subjectData Miningen
dc.subjectIncremental Dataen
dc.title海量資料下資料探勘的啟發式抽樣資料準備方法zh_TW
dc.titleA Heuristic Data Sampling Approach for Data Mining Preparation under Big Data Environmenten
dc.typeThesis
dc.date.schoolyear102-2
dc.description.degree碩士
dc.contributor.oralexamcommittee魏志平(Chih-Ping Wei),陳建錦(Chien Chin Chen),楊錦生(Chin-Sheng Yang)
dc.subject.keyword海量資料,增量式資料,資料探勘,資料分類,資料準備,資料抽樣,屬性選擇,zh_TW
dc.subject.keywordBig Data,Incremental Data,Data Mining,Data Classification,Data Preparation,Data Sampling,Attribute Selection,en
dc.relation.page103
dc.rights.note有償授權
dc.date.accepted2014-07-17
dc.contributor.author-college管理學院zh_TW
dc.contributor.author-dept資訊管理學研究所zh_TW
Appears in Collections:資訊管理學系

Files in This Item:
File SizeFormat 
ntu-103-1.pdf
  Restricted Access
1.77 MBAdobe PDF
Show simple item record


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved