基於雲端運算模式之巨量資訊分析

Chun-Chieh Chen; 陳俊傑

Please use this identifier to cite or link to this item: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/78090

Full metadata record

???org.dspace.app.webui.jsptag.ItemTag.dcfield???	Value	Language
dc.contributor.advisor	陳銘憲(Ming-Syan Chen)
dc.contributor.author	Chun-Chieh Chen	en
dc.contributor.author	陳俊傑	zh_TW
dc.date.accessioned	2021-07-11T14:41:55Z	-
dc.date.available	2026-08-22
dc.date.copyright	2016-10-26
dc.date.issued	2016
dc.date.submitted	2016-08-19
dc.identifier.citation	[1] Apache Hadoop, http://hadoop.apache.org/. [2] Apache Mahout, http://mahout.apache.org/. [3] Apache Spark: Lightning-fast Cluster Computing, https://spark.incubator.apache.org/. [4] Weka, http://weka.org/. [5] Apache Hama, https://hama.apache.org/. [6] S4: Distributed Stream Computing Platform, https://incubator.apache.org/s4/. [7] Twister: Iterative MapReduce, https://iterative.mapreduce.org. [8] Samza, https://samza.incubator.apache.org/. [9] Apache Storm: Distributed and Fault-tolerant Realtime Computation, https://storm.apache.org/. [10] Rakesh Agrawal, Johannes Gehrke, Dimitrios Gunopulos, and Prabhakar Raghavan. Automatic Subspace Clustering Of High Dimensional Data For Data Mining Applications. Proceedings of International Conference on Management of Data (SIGMOD’98), pages 94–105, 1998. [11] Rakesh Agrawal and Ramakrishnan Srikant. Mining Sequential Patterns. Proceedings of the 11th International Conference on Data Engineering (ICDE’95), pages 3–14, 1995. [12] Ene Alina, Sungjin Im, and Benjamin Moseley. Fast Clustering Using MapReduce. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’11), pages 681–689, 2011. [13] Guilherme Andrade, Gabriel Ramos, Daniel Madeira, Rafael Sachetto, Renato Ferreira, and Leonardo Rocha. G-Dbscan: A Gpu Accelerated Algorithm for Density-based Clustering. In International Conference on Computational Science (ICCS’13), pages 369–378, 2013. [14] Mihael Ankerst, Markus M. Breunig, Hans-Peter Kriegel, and Jorg Sander.ぴ OPTICS: Ordering Points to Identify the Clustering Structure. ACM Sigmod Record (SIGMOD’99), pages 49–60, 1999. [15] Domenica Arlia and Massimo Coppola. Experiments in Parallel Clustering with DBSCAN. Proceedings of the 7th International Euro-Par Conference Manchester on Parallel Processing (Euro-Par’01), pages 326–331, 2001. [16] Jay Ayres, Jason Flannick, Johannes Gehrke, and Tomi Yiu. Sequential PAttern Mining using a Bitmap Representation. Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’02), pages 429–435, 2002. [17] Iyad Batal, Hamed Valizadegan, Gregory F Cooper, and Milos Hauskrecht. A Temporal Pattern Mining Approach for Classifying Electronic Health Record Data. Transactions on Intelligent Systems and Technology (TIST’13), page 63, 2013. [18] Norbert Beckmann, Hans-Peter Kriegel, Ralf Schneider, and Bernhard Seeger. The R*tree: An Efficient and Robust Access Method for Points and Rectangles. Proceedings of the International Conference on Management of Data (SIGMOD’90), pages 322–331, 1990. [19] Christian Bohm, Robert Noll, Claudia Plant, and Bianca Wackersreuther. Density-basedぴ Clustering Using Graphics Processors. Proceedings of the 18th ACM Conference on Information and Knowledge Management (CIKM’09), pages 661–670, 2009. [20] Yingyi Bu, Bill Howe, Magdalena Balazinska, and Michael D. Ernst. Haloop: Efficient Iterative Data Processing on Large Clusters. Proceedings of the VLDB Endowment (PVLDB’10), pages 285–296, 2010. [21] Chun-Chieh Chen, Chi-Yao Tseng, and Ming-Syan Chen. Highly Scalable Sequential Pattern Mining Based on MapReduce Model on the Cloud. IEEE International Congress on Big Data (BigData Congress’13), pages 310–317, 2013. [22] Yi-Hong Chu, Jen-Wei Huang, Kun-Ta Chuang, De-Nian Yang, and Ming-Syan Chen. Density Conscious Subspace Clustering for High-Dimensional Data. IEEE Transactions on Knowledge and Data Engineering (TKDE’10), pages 16–30, 2010. [23] Bi-Ru Dai and I-Chang Lin. Efficient Map/Reduce-based DBSCAN Algorithm with Optimized Data Partition. IEEE 5th International Conference on Cloud Computing (CLOUD’12), pages 59–66, 2012. [24] Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. Proceedings of the 6th Conference on Symp. on Opearting Systems Design and Implementation (OSDI”04), pages 16–30, 2004. [25] Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. Communications of the ACM (CACM’08), pages 107–113, 2008. [26] Jaliya Ekanayake, Hui Li, Bingjing Zhang, Thilina Gunarathne, Seung-Hee Bae, Judy Qiu, and Geoffrey Fox. Twister: a Runtime for Iterative Mapreduce. Proceeding of the 19th ACM International Symposium on High Performance Distributed Computing (HPDC’10), pages 810–818, 2010. [27] Martin Ester, Hans peter Kriegel, Jorg S, and Xiaowei Xu. A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’96), pages 226–231, 1996. [28] Rui Maximo Esteves, Rui Pais, and Chunming Rong. K-Means Clustering in the Cloud– a Mahout Test. IEEE Workshops of International Conference on Advanced Information Networking and Applications, pages 514–519, 2011. [29] Wenbin Fang, Mian Lu, Xiangye Xiao, Bingsheng He, and Qiong Luo. Frequent Itemset Mining on Graphics Processors. Proceedings of the 5th International Workshop on Data Management on New Hardware (DaMoN’09), pages 34–42, 2009. [30] Robson Leonardo Ferreira Cordeiro, Caetano Traina, Junior, Agma Juci Machado Traina, Julio Lopez, U. Kang, and Christos Faloutsos. Clustering Very Large Multi-dimensional´ Datasets with MapReduce. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’11), pages 690–698, 2011. [31] Volker Gaede and Oliver Gunther. Multidimensional Access Methods.ぴ ACM Computing Surveys (CSUR’98), pages 170–231, 1998. [32] Antonio Gomariz, Manuel Campos, Roque Marin, and Bart Goethals. ClaSP: An Efficient Algorithm for Mining Frequent Closed Sequences. Proceedings of the 17th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD’13), pages 50–61, 2013. [33] R.C Gonzalez and R.E. Woods. Digital Image Processing. Pearson Education, 2008. [34] Ken Goodhope, Joel Koshy, Jay Kreps, Neha Narkhede, Richard Park, Jun Rao, and Victor Yang Ye. Building LinkedIn’s Real-time Activity Data Pipeline. IEEE Data Engineering Bulletin (Data Eng. Bull.’12), pages 33–45, 2012. [35] Valerie Guralnik and George Karypis. Parallel Tree-Projection-based Sequence Mining Algorithms. Parallel Computing (PARALLEL COMPUT’04), pages 443–472, 2004. [36] Jiawei Han, Jian Pei, Behzad Mortazavi-Asl, Qiming Chen, Umeshwar Dayal, and MeiChun Hsu. FreeSpan: Frequent Pattern-Projected Sequential Pattern Mining. Proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’00), pages 355–359, 2000. [37] Jiawei Han, Jian Pei, and Xifeng Yan. Sequential Pattern Mining by Pattern-Growth: Principles and Extension. Foundations and Advances in Data Mining, Springer-Verlag, pages 183–220, 2005. [38] John A. Hartigan. Clustering Algorithms. Wiley, 1975. [39] Alexander Hinneburg, Er Hinneburg, and Daniel A. Keim. An Efficient Approach to Clustering in Large Multimedia Databases with Noise. Proceedings of the 4rd International Conference on Knowledge Discovery and Data Mining (KDD’98), pages 58–65, 1998. [40] Joshua Ho, Lior Lukov, and Sanjay Chawla. Sequential Pattern Mining with Constraints on Large Protein Databases. Proceedings of the 12th International Conference on Management of Data (COMAD’05), pages 89–100, 2005. [41] Jen-Wei Huang, Su-Chen Lin, and Ming-Syan Chen. DPSP: Distributed Progressive Sequential Pattern Mining on the Cloud. 14th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD’10), pages 27–34, 2010. [42] Jen-Wei Huang, Chi-Yao Tseng, Jian-Chih Ou, and Ming-Syan Chen. A General Model for Sequential Pattern Mining with a Progressive Database. IEEE Transactions on Knowledge and Data Engineering (TKDE’08), pages 1153–1167, 2008. [43] Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fetterly. Dryad: Distributed Data-parallel Programs from Sequential Building Blocks. ACM SIGOPS Operating Systems Review (SIGOPS’07), pages 59–72, 2007. [44] A. K. Jain, M. N. Murty, and P. J. Flynn. Data Clustering: A Review. ACM Computing Surveys (CSUR’99), pages 264–323, 1999. [45] Eshref Januzaj, Hans-Peter Kriegel, and Martin Pfeifle. Scalable Density-Based Distributed Clustering. In Knowledge Discovery in Databases (PKDD’04), pages 231–244. 2004. [46] Xiaonan Ji, James Bailey, and Guozhu Dong. Mining Minimal Distinguishing Subsequence Patterns with Gap Constraints. Knowledge and Information Systems (KAIS’07), pages 259–286, 2007. [47] Jay Kreps, Neha Narkhede, and Jun Rao. Kafka: A Distributed Messaging System for Log Processing. NetDB Workshop, 2011. [48] YongChul Kwon, Dylan Nunley, Jeffrey P. Gardner, Magdalena Balazinska, Bill Howe, and Sarah Loebman. Scalable Clustering Algorithm for N-Body Simulations in a SharedNothing Cluster. Proceedings of the International Conference on Scientific and Statistical Database Management (SSDBM’10), pages 132–150, 2010. [49] VanceChiang-Chi Liao and Ming-Syan Chen. DFSP: a Depth-First SPelling Algorithm for Sequential Pattern Mining of Biological Sequences. Knowledge and Information Systems (KAIS’14), pages 623–639, 2014. [50] Woong-Kee Loha and Hwanjo Yub. Fast Density-based Clustering through Dataset Partition using Graphics Processing Units. Information Sciences, pages 94–112, 2014. [51] Congnan Luo and SoonM. Chung. A Scalable Algorithm for Mining Maximal Frequent Sequences Using a Sample. Knowledge and Information Systems (KAIS’08), pages 149– 179, 2008. [52] Nizar R. Mabroukeh and C. I. Ezeife. A Taxonomy of Sequential Pattern Mining Algorithms. ACM Computing Surveys (CSUR’10), pages 1–41, 2010. [53] Rashmi V Mane. A Comparative Study of SPAM and PrefixSpan Sequential Pattern Mining Algorithm for Protein Sequences. Advances in Computing, Communication, and Control (ICAC3’13), pages 147–155, 2013. [54] Iris Miliaraki, Klaus Berberich, Rainer Gemulla, and Spyros Zoupanos. Mind the Gap: Large-scale Frequent Sequence Mining. Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data (SIGMOD’13), pages 797–808, 2013. [55] Panagiotis Papapetrou, George Kollios, Stan Sclaroff, and Dimitrios Gunopulos. Mining Frequent Arrangements of Temporal Intervals. Knowledge and Information Systems (KAIS’09), pages 133–171, 2009. [56] M. Parimala and S. Sathiyabama. SPMLS: An Efficient Sequential Pattern Mining Algorithm with candidate Generation and Frequency Testing. International Journal on Computer Science and Engineering (IJCSE’12), page 601, 2012. [57] Jung-Me Park, Carl G. Looney, and Hui-Chuan Chen. Fast Connected Component Labeling Algorithm Using A Ddivide and Conquer Technique. Computers and Their Applications, pages 4–7, 2000. [58] Jian Pei, Jiawei Han, Behzad Mortazavi-asl, Helen Pinto, Qiming Chen, Umeshwar Dayal, and Mei chun Hsu. PrefixSpan: Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth. Proceedings of the 7th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’01), pages 215–224, 2001. [59] Adam Perer and Fei Wang. Frequence: Interactive Mining and Visualization of Temporal Frequent Event Sequences. Proceedings of the 19th ACM International Conference on Intelligent User Interfaces (IUI’14), pages 153–162, 2014. [60] Majed Sahli, Essam Mansour, and Panos Kalnis. ACME: A Scalable Parallel System for Extracting Frequent Patterns From a Very Long Sequence. The VLDB Journal (VLDBJ’14), pages 871–893, 2014. [61] Gholamhosein Sheikholeslami, Surojit Chatterjee, and Aidong Zhang. WaveCluster: A Multi-Resolution. Clustering Approach for Very Large Spatial Databases. Proceedings of the 24th International Conference on Very Large Data Bases (VLDB’98), 1998. [62] Bai-En Shie, Hui-Fang Hsiao, and VincentS. Tseng. Efficient Algorithms for Discovering High Utility User Behavior Patterns in Mobile Commerce Environments. Knowledge and Information Systems (KAIS’13), pages 363–387, 2013. [63] Ramakrishnan Srikant and Rakesh Agrawal. Mining Sequential Patterns: Generalizations and Performance Improvements. Proceedings of the 5th International Conference on Extending Database Technology (EDBT’96), pages 3–17, 1996. [64] Tom White. Hadoop: The Definitive Guide. O’Reilly Media, 2009. [65] Ke Wang, Yabo Xu, and Jeffrey Xu Yu. Scalable Sequential Pattern Mining for Biological Sequences. Proceedings of the 13th ACM International Conference on Information and Knowledge Management (CIKM’04), pages 178–187, 2004. [66] Shen Wang and Haimonti Dutta. PARABLE: A PArallel RAndom-partition Based HierarchicaL ClustEring Algorithm for the MapReduce Framework. Center for Computational Learning Systems Technical Report, 2011. [67] Wei Wang, Jiong Yang, and Richard R. Muntz. STING: A Statistical Information Grid Approach to Spatial Data Mining. Proceedings of the 23rd International Conference on Very Large Data Bases (VLDB’97), pages 186–195, 1997. [68] Xueqiang Wang, Jing Wang, Tengjiao Wang, Hongyan Li, and Dongqing Yang. Parallel Sequential Pattern Mining by Transaction Decomposition. International Conference on Fuzzy Systems and Knowledge Discovery (FSKD’10), pages 1746–1750, 2010. [69] Benjamin Welton, Evan Samanas, and Barton P Miller. Mr. Scan: Extreme Scale Densitybased Clustering using a Tree-based Network of GPGPU Nodes. Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC’13), page 84, 2013. [70] Lilian Weng, Filippo Menczer, and Yong-Yeol Ahn. Virality Prediction and Community Structure in Social Networks. Scientific Reports, 2013. [71] Kesheng Wu, Ekow J Otoo, and Arie Shoshani. Compressing Bitmap Indexes for Faster Search Operations. Proceedings of 14th International Conference on Scientific and Statistical Database Management (SSDBM’02), pages 99–108, 2002. [72] Xiaowei Xu, Jochen Jager, and Hans-Peter Kriegel. A Fast Parallel Clustering Algorithmぴ for Large Spatial Databases. Data Mining and Knowledge Discovery (DMKD’99), pages 263–290, 1999. [73] Dongjin Yu, Wei Wu, Suhang Zheng, and Zhixiang Zhu. BIDE-Based Parallel Mining of Frequent Closed Sequences with MapReduce. Proceedings of the 12th International Conference on Algorithms and Architectures for Parallel Processing (ICA3PP’12), pages 177–186, 2012. [74] Dongjin Yu, Qi Zhu, Jianhua Shao, Chang Li, Youwei Yuan, and Wanqing Li. Parallel Execution of Data-intensive Web Services Based on Data-flow Constructs and I/O Operation Ratio. International Journal of Database Theory and Application (IJDTA’14), pages 129–138, 2014. [75] Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J Franklin, Scott Shenker, and Ion Stoica. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation (NSDI’12), pages 2–2, 2012. [76] Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Murphy McCauley, Michael Franklin, Scott Shenker, and Ion Stoica. Discretized Streams: An Efficient and Fault-tolerant Model for Stream Processing on Large Clusters. Proceedings of the 4th USENIX Conference on Hot Topics in Cloud Ccomputing (HotCloud’12), pages 215–224, 2012. [77] Mohammed J. Zaki. Efficient Enumeration of Frequent Sequences. Proceedings of the 7th ACM International Conference on Information and Knowledge Management (CIKM’98), pages 68–75, 1998. [78] Mohammed J. Zaki. Parallel Sequence Mining on Shared-Memory Machines. Journal of Parallel and Distributed Computing (JPDC’01), pages 401–426, 2001. [79] Tian Zhang, Raghu Ramakrishnan, and Miron Livny. BIRCH: A New Data Clustering Algorithm and Its Applications. Data Mining and Knowledge Discovery (DMKD’97), pages 141–182, 1997. [80] Qiankun Zhao and Sourav S Bhowmick. Sequential Pattern Matching: A Survey. ITechnical Report (CAIS Nayang Technological University Singapore), pages 1–26, 2003. [81] Weizhong Zhao, Huifang Ma, and Qing He. Parallel K-Means Clustering Based on MapReduce. Proceedings of the 1st International Conference on Cloud Computing (CloudCom’09), pages 674–679, 2009.
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/78090	-
dc.description.abstract	隨著網路應用與行動裝置的快速發展，使用者每日不斷地的產生巨量的數據資料。這一些龐大的數據資料可以透過巨量資料分析來獲得更多的洞見，提供用以更深入不同領域的專業知識，提升該領域的發展。然而，傳統的資料分析演算法皆是在單一電腦上作設計之考量。在處理巨量資料時，便可能無法滿足資料分析時所需要的硬體資源，例如：記憶體容量限制、硬碟空間容量與不同的計算模式下所需的硬體需求…等。考量這種種的因素皆增加了我們在設計巨量資料分析演算法時的困難度。因此，我們針對植基於雲端運算下之巨量資料探勘演算法在設計上所面臨的問題作深入地探討。本篇論文中，主要面臨三個大挑戰。第一，資料延展性 (Data Scalability)：在雲端運算的環境中，輸入的資料必須先轉換為鍵 (Key) 與值 (Value) 的資料表示方法。這對於傳統資料探勘演算法而言，是十分不直覺的設計。因此，這樣的設計考量將造成研究人員於設計一個具高度可擴展性的巨量資料探勘演算法時之複雜度。第二，負載平衡 (Load Balancing)：資料輸入後通常被分割成許多個資料區塊 (Data Partition) 接著透過雲端運算的平台將每一個資料區塊分配給不同的機器作資料分析。然而，在不同的機器上，所要負擔的工作負載可能不盡相同。這樣的情況將導致巨量資料探勘演算法設計的困難度。第三，資料通訊上之先天限制 (Constraints of Data Communication)：現今雲端運算的平台，其計算模式是由數個階段 (Phase) 所構成，例如：map 階段與 reducer 階段。然而，在同一個階段中，不同的機器之間不能直接作通訊。因此，在設計我們的分散式巨量資料探勘演算法時，雲端運算環境與其計算模式 (Computing Model) 的先天限制必須被考量。在本論文中，我們指出在雲端運算平台中，傳統資料探勘演算法在設計上所面臨的挑戰並針對這些挑戰提出適合的雲端計算模式與其相對應的巨量資料探勘演算法。我們首先提出兩個植基於 MapReduce 計算模式 (MapReduce Model) 的分散式資料分群演算法 (Distributed Data Clustering Algorithm)，分別為在雲端運算環境中以密度為基礎的分散式分群演算法 (Distributed Density-based Clustering) 與以格網為基礎的分散式分群演算法 (Distributed Grid-based Clustering)。所提出的演算法將巨量的輸入資料拆成數個大小幾乎相同的資料區塊並在分散式資料探勘處理時，盡量減少資料通訊上的成本負擔。如此一來，便可以更有效的對巨量的資料集作有效的分群。在第 3 章中，我們發現 MapReduce 計算模式下，諸如遞迴此類的探勘演算法將會遭遇到下述情況。例如：在同一個資料處理階段中，難以達到資料共享之目的且 MapReduce 計算模式無法反覆地處理部份的分析結果，其需要透過額外階段來完成部份分析結果之處理。為了克服這個問題，我們針對一個此類別的探勘演算法-頻繁循序樣式分析 (Frequent Sequential PAttern Mining) 提出了分散式的頻繁循序樣式探勘演算法 (Distributed Frequent Sequential PAttern Mining Algorithm in the Cloud (SPAMC))。我們採用反覆式 MapReduce 計算模式 (Iterative MapReduce Model)，克服上述問題，以高效率的處理巨量的循序資料。接下來，在第 4 章中，根據前一章節的實驗結果，我們發現即使採用了反覆式 MapReduce 計算模式，但是在頻繁循序式樣探勘(Frequent Sequential Pattern Mining) 中，反覆初始 MapReduce 的成本太高。因此，我們提出了一個新的頻繁循序式樣分析-採用均勻分散式詞彙循序樹之演算法 (Distributed Frequent Sequential PAttern Mining in the Cloud -Uniform Distributed Lexical sequence Tree (SPAMC-UDLT))。該演算法可於新提出的串流式 MapReduce 計算模式 (Streaming MapReduce Model) 中運行，以解決資料重載 (Data Reloading)、工作負載不平衡 (Unbalanced Workload) 與反覆式 MapReduce 計算模式中 mapper 之間缺乏有效通訊方法 (Lack of Communication between mappers) 之問題。在最後章節中，我們探討一些可能需要大量計算資源的資料探勘演算法模組作討論。為了克服計算資源的需求，我們設計一個運行於異質性雲端運算 (Heterogeneous cloud computing) 環境中高度可延展之以密度為基礎的分散式分群演算法 (Highly Scalable Distributed Density-based Clustering (HiClus))，其可透過異質性雲端運算的優勢，利用 GPU 上成千上百的執行緒 (thread) 作平行的資料分析運算，以滿足此類演算法需要大量計算資源的需求。在本論文中實驗結果顯示，我們所提出的數個分散式巨量資料探勘演算法與新的計算模式皆可以達到非常高的資料延展性與加速在巨量資料集下的資料探勘處理。	zh_TW
dc.description.abstract	The amount of user-generated data from Internet applications has grown to be very vast, and big data analysis of this data can provide insights into domain knowledge. However, the traditional analysis approach of using a single machine may be inadequate for big data, given its high resource requirement. Limitations in hardware such as memory space, disk space, and computing model usually increase the difficulty of algorithm design in big data analysis. The challenges faced for mining big data in the cloud are threefold. (1) Data Scalability: in cloud computing, the mining algorithms have to frame in terms of keys and values, which is not natural for data analysis. This makes to design a high scalability algorithm much more complicated, (2) Load Balancing: the input data is usually split into many data partitions, and then each data partition is analyzed on a different machine in cloud computing. However, differing workloads on different machines leads the difficulty of algorithm design, and (3) Constraints of Data Communication: machines cannot directly communicate with each other in the same phase (map phase or reduce phase). Thus, we need to consider the natural limitation of cloud computing model in our distributed mining algorithm design. In this dissertation, we point out the design challenges pertaining to the cloud and propose corresponding mining algorithms with appropriate computing models for mining large datasets. InChapter2, we first devise distributed clustering algorithms using MapReduce model and propose two distributed clustering algorithms, namely Distributed Density-based Clustering (DDC) and Distributed Grid-based Clustering (DGC), which split a large dataset into small partitions of almost equal size and minimize the communication cost to efficiently cluster a large dataset. In Chapter 3, we found that the recursive algorithms in the MapReduce model may suffer from the inability on sharing data within the same processing phase. Besides, the MapReduce job cannot deal with the intermediate sub-results on distributed machines in one MapReduce round. To overcome this problem, we propose a distributed frequent sequential pattern mining algorithm, Sequential PAttern Mining Algorithm in the Cloud (SPAMC) adopting in iterative MapReduce model to deal with large amounts of data efficiently. Chapter4 is based on the results of Chapter3. Even if the iterative MapReduce model can process frequent sequential pattern mining algorithms, the initialization cost of MapReduce jobs is still too high. Besides, there are extension challenges, including data reloading, unbalanced workload, and lack of communication between mappers, associated with the iterative MapReduce model. To address these challenges mentioned above, we propose a novel distributed frequent sequential pattern mining algorithm, Sequential Pattern Mining in the Cloud -Uniform Distributed Lexical sequence Tree (SPAMCUDLT) with streaming MapReduce model to mining frequent sequential patterns. Finally, in Chapter 5, we explore a few components of mining algorithms that may demand high computational power. To solve this problem, we designed a distributed density-based clustering algorithm in heterogeneous infrastructure, namely Highly scalable Density-based Clustering with heterogeneous cloud (HiClus), which can exploit thousands of GPU threads to parallel perform mining tasks and take advantage of heterogeneous cloud computing to solve data-intensity problem. The experimental results indicate that our proposed distributed mining algorithms and computing models achieve extremely high scalability and speed up the mining processing in big data.	en
dc.description.provenance	Made available in DSpace on 2021-07-11T14:41:55Z (GMT). No. of bitstreams: 1 ntu-105-D96944011-1.pdf: 2102712 bytes, checksum: b542e419afce79356d9052773141ad3c (MD5) Previous issue date: 2016	en
dc.description.tableofcontents	1 Introduction 10 1.1 Motivation and Significance of the Research . . . . . . . . . . . . . . . . . . . 10 1.2 Overview of the Dissertation . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.2.1 Reducing Communication and Merging Overheads for Distributed Clustering Algorithms on the Cloud . . . . . . . . . . . . . . . . . . . . . 13 1.2.2 Highly Scalable Sequential Pattern Mining Based on MapReduce Model on the Cloud . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 1.2.3 Distributed and Scalable Sequential Pattern Mining through Stream Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 1.2.4 HiClus: Highly Scalable Density-based Clustering with Heterogeneous Cloud . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 1.3 Organization of the Dissertation . . . . . . . . . . . . . . . . . . . . . . . . . 16 2 Reducing Communication and Merging Overheads for Distributed Clustering Algorithms on the Cloud 17 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.3 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.3.1 Overview of DBSCAN . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.3.2 Overview of Grid-based Clustering . . . . . . . . . . . . . . . . . . . 23 2.3.3 Overview of MapReduce Model . . . . . . . . . . . . . . . . . . . . . 24 2.4 Distributed Density-based Clustering (DDC) . . . . . . . . . . . . . . . . . . . 25 2.4.1 Partition Phase of DDC . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.4.2 Clustering Phase of DDC . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.4.3 Merging Phase of DDC . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.5 Distributed Grid-based Clustering (DGC) . . . . . . . . . . . . . . . . . . . . 28 2.5.1 Partition Phase of DGC . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.5.2 Clustering Phase of DGC . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.5.3 Merging Phase of DGC . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.6 Complexity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.6.1 Time Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.6.2 Correctness of DDC . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.7 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 2.7.1 Comparison with Existing Methods . . . . . . . . . . . . . . . . . . . 33 2.7.2 Performance and Scalability . . . . . . . . . . . . . . . . . . . . . . . 34 2.7.3 Effect to number of Machines . . . . . . . . . . . . . . . . . . . . . . 35 2.7.4 Experience on different Cloud-based Algorithms . . . . . . . . . . . . 36 2.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3 Highly Scalable Sequential Pattern Mining Based on MapReduce Model on the Cloud 42 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.2.1 Review of SPAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.2.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.3 SPAMC Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.3.1 Framework of SPAMC . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.3.2 Scanning Phase of SPAMC . . . . . . . . . . . . . . . . . . . . . . . . 51 3.3.3 Mining Phase of SPAMC . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.4.1 Comparison with Existing Methods . . . . . . . . . . . . . . . . . . . 56 3.4.2 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.4.3 Sensitivity to Parameters . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 4 Distributed and Scalable Sequential Pattern Mining through Stream Processing 64 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.2 Preliminary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 4.2.1 Sequential Pattern Mining . . . . . . . . . . . . . . . . . . . . . . . . 67 4.2.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 4.2.3 Highlights on MapReduce Models . . . . . . . . . . . . . . . . . . . . 71 4.3 Review on SPAM and SPAMC . . . . . . . . . . . . . . . . . . . . . . . . . . 74 4.3.1 SPAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 4.3.2 SPAMC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 4.4 SPAMC-UDLT Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 4.4.1 Scanning Phase of SPAMC-UDLT . . . . . . . . . . . . . . . . . . . . 82 4.4.2 Mining Phase of SPAMC-UDLT . . . . . . . . . . . . . . . . . . . . . 83 4.5 Algorithm Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 4.5.1 Time Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 4.5.2 Space Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 4.6 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 4.6.1 Datasets and System Environment . . . . . . . . . . . . . . . . . . . . 94 4.6.2 Comparison with Existing Methods on Single Machine . . . . . . . . . 94 4.6.3 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 4.6.4 Speedup of SPAMC-UDLT . . . . . . . . . . . . . . . . . . . . . . . . 96 4.6.5 Real Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 4.6.6 Resource Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 4.6.7 Sensitivity to Parameters . . . . . . . . . . . . . . . . . . . . . . . . . 100 4.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 5 HiClus: Highly Scalable Density-based Clustering with Heterogeneous Cloud 103 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 5.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 5.3 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 5.3.1 Hybrid MapReduce Model . . . . . . . . . . . . . . . . . . . . . . . . 107 5.3.2 NVidia GPU Architecture . . . . . . . . . . . . . . . . . . . . . . . . 108 5.4 HiClus Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 5.4.1 Adaptive Distributed KD-Tree (DistKD-tree) . . . . . . . . . . . . . . 109 5.4.2 Distributed Partition Phase . . . . . . . . . . . . . . . . . . . . . . . . 110 5.4.3 Distributed Clustering Phase . . . . . . . . . . . . . . . . . . . . . . . 112 5.4.4 Merging Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 5.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 5.5.1 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 5.5.2 Comparison to Previous Works . . . . . . . . . . . . . . . . . . . . . . 115 5.5.3 Effect to number of Machines with GPU . . . . . . . . . . . . . . . . 115 5.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 6 Conclusion and Future Work 118 6.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 6.2 Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 Bibliography 123
dc.language.iso	en
dc.subject	頻繁循序式樣探勘	zh_TW
dc.subject	MapReduce	zh_TW
dc.subject	雲端運算	zh_TW
dc.subject	巨量資料	zh_TW
dc.subject	GPGPU	zh_TW
dc.subject	分群	zh_TW
dc.subject	資料探勘	zh_TW
dc.subject	分散式演算法	zh_TW
dc.subject	Big Data	en
dc.subject	Cloud Computing	en
dc.subject	MapReduce	en
dc.subject	Distributed Algorithm	en
dc.subject	Data Mining	en
dc.subject	Clustering	en
dc.subject	Frequent Sequential Patterns Mining	en
dc.subject	GPGPU	en
dc.title	基於雲端運算模式之巨量資訊分析	zh_TW
dc.title	Big Data Analytic in the Cloud Computing Models	en
dc.type	Thesis
dc.date.schoolyear	104-2
dc.description.degree	博士
dc.contributor.oralexamcommittee	王凡(Farn Wang),葉彌妍(Mi-Yen Yeh),陳維超(Wei-Chao Chen),沈之涯(Chih-Ya Shen)
dc.subject.keyword	巨量資料,雲端運算,MapReduce,分散式演算法,資料探勘,分群,頻繁循序式樣探勘,GPGPU,	zh_TW
dc.subject.keyword	Big Data,Cloud Computing,MapReduce,Distributed Algorithm,Data Mining,Clustering,Frequent Sequential Patterns Mining,GPGPU,	en
dc.relation.page	133
dc.identifier.doi	10.6342/NTU201602425
dc.rights.note	有償授權
dc.date.accepted	2016-08-20
dc.contributor.author-college	電機資訊學院	zh_TW
dc.contributor.author-dept	資訊網路與多媒體研究所	zh_TW
dc.date.embargo-lift	2026-08-22	-
Appears in Collections:	資訊網路與多媒體研究所

Files in This Item:

File	Size	Format
ntu-105-D96944011-1.pdf Restricted Access	2.05 MB	Adobe PDF	View/Open

Show simple item record

DSpace JSPUI

DSpace preserves and enables easy and open access to all types of digital content including text, images, moving images, mpegs and data sets