多維度資料庫中頻繁數值樣式之分散式探勘

Yen-Chi Chen; 陳彥琦

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/35573

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	李瑞庭(Anthony JT Lee)
dc.contributor.author	Yen-Chi Chen	en
dc.contributor.author	陳彥琦	zh_TW
dc.date.accessioned	2021-06-13T06:59:07Z	-
dc.date.available	2014-07-27
dc.date.copyright	2011-07-27
dc.date.issued	2011
dc.date.submitted	2011-07-22
dc.identifier.citation	[1] R. C. Agarwal, C.C. Aggarwal, V.V.V. Prasad, A tree projection algorithm for generation of frequent itemsets, Journal of Parallel Distributed Computing, Vol. 61, No. 3, 2001, pp. 350-371. [2] J. Ayres, J.E. Gehrke, T. Yiu, J. Flannick, Sequential pattern mining using a bitmap representation, Proceedings of the ACM SIGMOD International Conference on Knowledge Discovery in Database, 2002, pp. 429-435. [3] R. Agrawal, R. Srikant, Fast algorithms for mining association rules, Proceedings of the International Conference on Very Large Data Bases, 1994, pp. 487-499. [4] R. Agrawal, R. Srikant, Mining sequential patterns, Proceedings of the International Conference on Data Engineering, 1995, pp. 3-14. [5] A. Baratloo, M. Karaul, Z. Kedem, P. Wyckoff, Charlotte: Metacomputing on the web, Future Generation Computer Systems, Vol. 15, No. 5-6, 1999, pp. 559-570. [6] R. J. Bayardo, Efﬁciently mining long patterns from databases, Proceedings of the International Conference on Management of Data, 1998, pp. 85-93. [7] D Burdick, M Calimlim, J Flannick, MAFIA: A maximal frequent itemset algorithm, IEEE Transactions on Knowledge and Data Engineering, Vol. 17, No. 11, 2005, pp. 1490-1504. [8] G. E. Blelloch, Scans as primitive parallel operations, IEEE Transactions on Computers, Vol. 38, No. 11, 1989, pp. 1526-1538. [9] J. Dean, S. Ghemawat, MapReduce: Simplified data processing on large clusters, Proceedings of the 6th Symposium on Operating Systems Design and Implementation, 2004, pp. 137-150. [10] L. Di-Jorio, A. Laurent, M. Teisseire, Mining frequent gradual itemsets from large databases, Lecture Notes in Computer Science, Vol. 5772, 2009, pp. 297-308. [11] T. D. T. Do, A. Laurent, A. Termier, PGLCM: Efficient parallel mining of closed frequent gradual itemsets, Proceedings of International Conference on Data Mining, 2010, pp. 138 -147. [12] S. Gorlatch, Systemic efficient parallelization of scan and other list homomorphism, Lecture Notes in Computer Science, Vol. 1124, 1996, pp. 401-408. [13] W. Gropp, E. Lusk, A. Skjellum, Using MPI: Portable Parallel Programming with the Message-Passing Interface, MIT Press, Cambridge, MA, USA, 1999. [14] H. Li, Y. Wang, D. Zhang, M. Zhang, E. Y. Chang, PFP: Parallel FP-growth for query recommendation, Proceedings of Conference on Recommender Systems, 2008, pp. 107-114. [15] E.H. Han, G. Karypis, V. Kumar, Scalable parallel data mining for association rules, IEEE Transactions on Knowledge and Data Engineering, Vol. 12, No. 3, 2000, pp. 337-352. [16] J. Han, J. Pei, B. Mortazavi-Asl, Q. Chen, U. Dayal, M.C. Hsu, FreeSpan: Frequent pattern-projected sequential pattern mining, Proceedings of International Conference on Knowledge Discovery and Data Mining, 2000, pp. 355-359. [17] J. Han, J. Pei, Y. Yin, Mining frequent patterns without candidate generation, Proceedings of the ACM SIGMOD International Conference on Management of Data, 2000, pp. 1-12. [18] L. Huston, R. Sukthankar, R. Wickremesinghe, M. Satyanarayanan, G. R. Ganger, E. Riedel, A. Ailamaki, Diamond: A storage architecture for early discard in interactive search, Proceedings of the USENIX File and Storage Technologies FAST Conference, 2004, pp. 73-86. [19] H. G. Kayacık, A. N. Zincir-Heywood, M. I. Heywood, Selecting features for intrusion detection: A feature relevance analysis on KDD 99 intrusion detection Datasets, Proceedings of the Third Annual Conference on Privacy, Security and Trust, 2005. [20] E. Keogh, J. Lin, A. Fu, HOT SAX: Efficiently finding the most unusual time series subsequence, Proceedings of the Fifth IEEE International Conference on Data Mining, 2005, pp. 226-233. [21] R. E. Ladner, M. J. Fischer, Parallel prefix computation, Journal of the ACM, Vol. 27, No. 4, 1980, pp. 831-838. [22] J. Lin, E. Keogh, Group SAX: Extending the notion of contrast sets to time series and multimedia data, Proceedings of 10th European Conference on Principles and Practice of Knowledge Discovery in Databases, 2006, pp. 284-296. [23] M. Leleu, C. Rigotti, J. Boulicaut, G. Euvrard, GO-SPADE: Mining sequential patterns over datasets with consecutive repetitions, Proceedings of International Conference on Machine Learning and Data Mining, 2003, pp. 293-306. [24] J. Lin, E. Keogh, S. Lonardi and B. Chiu, A symbolic representation of time series, with implications for streaming algorithms, Proceedings of the 8th ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, 2003, pp. 2-11. [25] K. G. Mohammed J. Zaki, GenMax: An efficient algorithm for mining maximal frequent itemsets, Data Mining and Knowledge Discovery, Vol. 11, No. 3, 2005, pp. 223 - 242. [26] N. Pasquier, Y. Bastide, R. Taouil, L. Lakhal, Discovering frequent closed itemsets for association rules, Proceeding of the 7th International Conference on Database Theory, 1999, pp 398–416. [27] J. Pei, J. Han, B. Mortazavi-Asl, H. Pinto, PrefixSpan: Mining sequential patterns efficiently by prefix-projected pattern growth, Proceedings of the 17th International Conference on Data Engineering, 2001, pp. 215-224. [28] E. Riedel, C. Faloutsos, G. A. Gibson, D. Nagle, Active disks for large-scale data processing, IEEE Computer, Vol. 34, No. 6, 2001, pp. 68-74. [29] A. Savasere, E. Omiecinski, S. Navathe, An efficient algorithm for mining association rules in large databases, Proceedings of the 21st VLDB Conference, 1995, pp. 432-444. [30] R. Srikant, R. Agrawal, Mining sequential patterns: Generalizations and performance improvements, Proceedings of the 5th International Conference on Extending Database Technology, 1996, pp. 3-17. [31] L. G. Valiant, A bridging model for parallel computation, Communications of the ACM, Vol. 33, No. 8, 1998, pp. 103-111. [32] M.J. Zaki, SPADE: An efficient algorithm for mining frequent sequences, Machine Learning, Vol. 42, No. 1-2, 2001, pp. 31-60. [33] M.J. Zaki, K. Gouda, Fast vertical mining using diffsets, Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2003, pp. 326-335.
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/35573	-
dc.description.abstract	已有許多方法能從符號式資料庫中探勘頻繁樣式，對於數值資料的處理，則先利用分界點將數值資料轉換成符號式的資料，然後再利用之前所提出的方法，從轉換後的符號式資料中探勘頻繁樣式。然而分界點選定的不同，常會造成樣式的不同。因此，最好發展一個方法直接從數值資料庫中探勘數值樣式。從資料庫中探勘數值樣式，是個需要大量運算的工作；而雲端運算的架構正好適合此類運算密集的應用。就我們所知，目前尚未有人提出在雲端架構上探勘數值樣式的方法。因此，在本篇論文中，我們提出一個有效率的演算法，利用MapReduce架構從多維度的數值資料庫中，探勘頻繁數值樣式。我們所提出的方法，共包含三個MapReduce的工作。首先，每個運算實體分頭掃描資料庫，以找出所有長度為一的頻繁樣式。接著，每個運算實體掃描每個長度為一的頻繁樣式的投影資料庫，以取得有助於負載平衡規劃的資訊，並將探勘的工作平均分配給各個運算實體。最後，每個運算實體以深度優先的方式遞迴產生所有頻繁樣式。在探勘過程中，我們利用兩個加速策略來產生相近份額的諸多任務，以便得到平衡的負載，使得各個運算實體能夠平行而互不干擾的進行探勘工作。因此，我們所提出的方法能夠有效率地從多維度資料庫中探勘出頻繁數值樣式。根據實驗結果顯示，我們所提出的方法較改良式的Partition 演算法，在執行速度與擴充性上皆有較佳的表現。	zh_TW
dc.description.abstract	Most previously proposed methods mine frequent patterns from symbolic databases, where numerical data may be transformed numerical data into a symbolic representation by a list of breakpoints. However, using different breakpoints to transform numerical data into a symbolic representation can result in different patterns. Thus, it is better to directly mine numerical patterns without any transformation. Mining numerical patterns from databases is one of the most computation-intensive applications, which can be dealt with by using the cloud computing infrastructure. To the best of our knowledge, there is no algorithm dedicated to mining frequent numerical patterns on the cloud. Therefore, in this thesis, we propose an efficient algorithm for mining frequent numerical patterns in a multi-dimensional database on the MapReduce framework, where every transaction in the database contains multiple numerical values. The proposed algorithm is composed of three MapReduce jobs. The first job is to mine frequent patterns of length one (1-patterns) in parallel. The second job gathers the information about frequent 1-patterns mined, and utilizes the information to divide the mining process into nearly-equally sized tasks. The third job distributes these tasks to different worker instances, each of which recursively mines frequent patterns in a depth-first search (DFS) manner until no more frequent patterns can be found. During the mining process, we employ two effective speedup strategies to form tasks of nearly-equal size and balance the workload, and an approach to divide a multi-dimensional database into independent partitions so that the mining tasks can be performed independently and parallelly. Therefore, the proposed method can efficiently mine frequent numerical patterns in a multi-dimensional database. The experimental results show that the DNM algorithm outperforms the modified Partition-Apriori algorithm in orders of magnitude.	en
dc.description.provenance	Made available in DSpace on 2021-06-13T06:59:07Z (GMT). No. of bitstreams: 1 ntu-100-R98725011-1.pdf: 1016116 bytes, checksum: 1c739cef93854ce8c79f0b8a32bea1d7 (MD5) Previous issue date: 2011	en
dc.description.tableofcontents	Table of Contents i List of Figures ii List of Tables iii Chapter 1 Introduction 1 Chapter 2 Preliminary Concepts and Problem Definitions 5 Chapter 3 The Proposed Method 8 3.1 Generating frequent 1-patterns 8 3.2 Generating frequent super-patterns 10 3.3 The algorithm 11 3.4 Speedup strategies 17 3.4.1 Dimension ordering 17 3.4.2 Task assignment 19 3.5 An example 20 Chapter 4 Performance Evaluation 24 4.1 Synthetic datasets 25 4.2 Performance evaluation on synthetic datasets 25 4.2 Performance evaluation on real datasets 30 Chapter 5 Concluding Remarks and Future Work 35 List of Figures Figure 1. A sliding window example. 9 Figure 2. The DNM algorithm. 11 Figure 3. The GenerateOnePattern procedure. 12 Figure 4. The PDBPeek procedure. 13 Figure 5. The DNMGrow procedure. 14 Figure 6. The DNMExtend procedure. 15 Figure 7. Sorting dimensions in descending order. 19 Figure 8. Sorting dimensions in ascending order. 19 Figure 9. Runtime versus minimum support. 26 Figure 10. Runtime versus number of worker instances. 27 Figure 11. Speedup versus number of worker instances. 28 Figure 12. Runtime versus number of transactions. 28 Figure 13. Runtime versus average transaction length. 29 Figure 14. Runtime versus tolerance. 30 Figure 15. Runtime versus minimum support for real dataset. 33 List of Tables Table 1. An example multi-dimensional database. 5 Table 2. Another example database. 18 Table 3. An example of scarf-knitting hashing. 20 Table 4. Frequent 1-patterns. 22 Table 5. Task list. 23 Table 6. Parameters in synthetic-dataset experiments. 25 Table 7. Types of attacks in the real dataset. 32
dc.language.iso	en
dc.subject	多維度資料庫	zh_TW
dc.subject	雲端運算	zh_TW
dc.subject	資料探勘	zh_TW
dc.subject	數值樣式	zh_TW
dc.subject	頻繁樣式	zh_TW
dc.subject	multi-dimensional database	en
dc.subject	data mining	en
dc.subject	frequent pattern	en
dc.subject	numerical pattern	en
dc.subject	Cloud computing	en
dc.title	多維度資料庫中頻繁數值樣式之分散式探勘	zh_TW
dc.title	Distributedly Mining Frequent Numerical Patterns in Multi-Dimensional Databases	en
dc.type	Thesis
dc.date.schoolyear	99-2
dc.description.degree	碩士
dc.contributor.oralexamcommittee	陳彥良,劉敦仁
dc.subject.keyword	雲端運算,數值樣式,頻繁樣式,多維度資料庫,資料探勘,	zh_TW
dc.subject.keyword	Cloud computing,numerical pattern,frequent pattern,multi-dimensional database,data mining,	en
dc.relation.page	50
dc.rights.note	有償授權
dc.date.accepted	2011-07-22
dc.contributor.author-college	管理學院	zh_TW
dc.contributor.author-dept	資訊管理學研究所	zh_TW
顯示於系所單位：	資訊管理學系

文件中的檔案：

檔案	大小	格式
ntu-100-1.pdf 未授權公開取用	992.3 kB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。