基於MapReduce的大型網路探勘之應用與評估

Hung-Che Lai; 賴弘哲

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/45747

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	林守德(Shou-De Lin)
dc.contributor.author	Hung-Che Lai	en
dc.contributor.author	賴弘哲	zh_TW
dc.date.accessioned	2021-06-15T04:45:28Z	-
dc.date.available	2010-08-09
dc.date.copyright	2010-08-09
dc.date.issued	2010
dc.date.submitted	2010-08-06
dc.identifier.citation	[1] L. Akoglu, P. Chau, C. Faloutsos, U. Kang, K. Maruhashi, M. McGlohon, P. Stancioli, and C. E. Tsourakakis. Project pegasus, 2009. [2] U. Brandes. A faster algorithm for betweenness centrality. Journal of Mathematical Sociology, 25(2):163–177, 2001. [3] B. Cai and W. Xue. X-rime: Hadoop based large scale social network analysis, 2009. Available at http://xrime.sourceforge.net/. [4] R. Chen, X. Weng, B. He, and M. Yang. Large Graph Processing in the Cloud. SIGMOD (demo), 2010. [5] R. Chen, X. Weng, B. He, M. Yang, C. B., and L. X. On the Efficiency and Programmability of Large Graph Processing in the Cloud. Microsoft Research TechReport, 2010. [6] J. Cohen. Graph twiddling in a mapreduce world. Computing in Science and Engineering, 11(4):29–41, 2009. [7] T. Cormen. Introduction to algorithms. The MIT press, 2001. [8] D. Cutting. Apache hadoop, 2007. Available at http://hadoop.apache.org/. [9] J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. In OSDI, pages 137–150, 2004. [10] J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters. Commun. ACM, 51(1):107–113, 2008. [11] P. Erd ̈ s and A. R ́ nyi. On the evolution of random graphs. Publication of the Mathematical Institute of the Hungarian Academy of Science, pages 17–61, 1960. [12] S. Ghemawat, H. Gobioff, and S.-T. Leung. The google file system. In SOSP ’03: Proceedings of the nineteenth ACM symposium on Operating systems principles, pages 29–43, New York, NY, USA, 2003. ACM. [13] D. Gregor and A. Lumsdaine. The Parallel BGL: A generic library for distributed graph computations.Parallel Object-Oriented Scientific Computing (POOSC), 2005. [14] J. Hebert. Hadoop examples - pagerank, 2007. Available at http://code.google.com/p/canopy-clustering/. [15] U. Kang, C. Tsourakakis, A. P. Appel, C. Faloutsos, and J. Leskovec. Radius plots for mining tera-byte scale graphs: Algorithms, patterns, and observations. SIAM International Conference on Data Mining, 2010. [16] U. Kang, C. E. Tsourakakis, and C. Faloutsos. Pegasus: A peta-scale graph mining system. Data Mining, IEEE International Conference on, 0:229–238, 2009. [17] H. Kwak, C. Lee, H. Park, and S. Moon. What is Twitter, a social network or a news media? In Proceedings of the 19th international conference on World wide web, pages 591–600. ACM, 2010. [18] J. Leskovec, D. Chakrabarti, J. Kleinberg, and C. Faloutsos. Realistic, mathematically tractable graph generation and evolution, using kronecker multiplication. Knowledge Discovery in Databases: PKDD 2005, pages 133–145, 2005. [19] J. Lin. The Curse of Zipf and Limits to Parallelization: A Look at the Stragglers Problem in MapReduce. Proc. LSDS-IR, pages 1613–0073, 2009. [20] J. Lin and C. Dyer. Data-intensive text processing with MapReduce. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume:Tutorial Abstracts, pages 1–2. Association for Computational Linguistics, 2009. [21] G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: a system for large-scale graph processing. In SPAA ’09: Proceedings of the twenty-first annual symposium on Parallelism in algorithms and architectures, pages 48–48, New York, NY, USA, 2009. ACM. [22] G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: A system for large-scale graph processing. In SIGMOD ’10: Proceedings of the 36th SIGMOD international conference on Management of data, New York, NY, USA, 2010. ACM. [23] M. E. J. Newman. The structure and function of complex networks. SIAM Review, 45(2):167–256, 2003. [24] L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank citation ranking: Bringing order to the web. Technical Report 1999-66, Stanford InfoLab, November 1999. [25] J. Pearl. Probabilistic reasoning in intelligent systems: networks of plausible inference. Morgan Kaufmann, 1988. [26] T. Schank. Algorithmic aspects of triangle-based network analysis. Ph. d. thesis, University Karlsruhe, February 2007. [27] J. Scott. Social Network Analysis: A Handbook. Sage Publications, second. edition, 2000. [28] H. Tong, C. Faloutsos, and J.-Y. Pan. Fast random walk with restart and its applications. In ICDM ’06: Proceedings of the Sixth International Conference on Data Mining, pages 613–622, Washington, DC, USA, 2006. IEEE Computer Society. [29] J. Venner. Pro Hadoop. Apress, Berkely, CA, USA, 2009.
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/45747	-
dc.description.abstract	大型網路探勘的目標是針對網路結構以及網路中隱藏的樣式進行深入了解。在處理大型網路時，巨大的時間與空間複雜度將會影響分析的效率。目前雲端運算正被使用於解決大型網路分析的效率問題，其中MapReduce是目前最流行的一種雲端運算的運算架構。在本研究中，我們提出一個基於MapReduce的運算架構來解決大型網路分析的效率問題，此架構稱之為MapReduceGraphMiningFramework(MGMF)，包含基本函式、MapReduce演算法及最佳化方法。我們將網路探勘演算法分成四大類，並基於此架構來針對各種不同的網路探勘演算法進行有效率而且完整的設計並實作。 PEGASUS是當前最新的一個基於MapReduce的網路分析工具，MGMF針對PageRank的實作比PEGASUS快了3至20倍的執行時間，而MGMF提供的架構也更有效率且支援更多樣的演算法，實驗進行於包含超過十億個聯結的大型真實世界網路上，實驗證明運算架構MGMF兼具效率及可擴展性。除此之外，我們還對單一機器的運算能力進行檢驗，進一步探討適合使用雲端運算的時機，我們實作的MGMF作為開放原始碼工具開放給大眾使用，函式庫及測試資料網址為http://mslab.csie.ntu.edu.tw/~noahsark/MGMF/。	zh_TW
dc.description.abstract	Graph mining is a popular technique for discovering the hidden structure or important instances from a graph. However, when dealing with large-scale graphs with billions of entities, it generally draws concerns about the effi- ciency in computation. On the other hand, cloud computing is often resorted to as a feasible solution to ease the computational burden. In this work, we present the MapReduce Graph Mining Framework (MGMF), an open source graph mining library aims at providing a robust and efficient MapReduce- based graph mining tool. We categorize graph mining algorithms into four categories and design different MapReduce frameworks for each. Comparing our work to PEGASUS, which is the state-of-the-art library for graph mining on MapReduce, the experiment results show our package is 3 to 20 times more efficient than PEGASUS, and we have more com- plete coverage on different graph mining algorithms. We also validate our framework on billion-scale networks to show that it is scalable to the num- ber of machines. Finally, we explore the capability of a single machine on solving large-scale graph mining problems to provide more insight on the usage condition of a cloud computing machine on graph-mining algorithms. We have made our implementation an open-source library downloadable in http://mslab.csie.ntu.edu.tw/ ̃noahsark/MGMF/.	en
dc.description.provenance	Made available in DSpace on 2021-06-15T04:45:28Z (GMT). No. of bitstreams: 1 ntu-99-R97922028-1.pdf: 1539892 bytes, checksum: c95b3b543d283263e0483652c1a29a1b (MD5) Previous issue date: 2010	en
dc.description.tableofcontents	致謝 i 中文摘要 ii Abstract iii 1 Introduction 1 2 Backgrounds 5 2.1 MapReduce, Hadoop, and File Input Format . . . . . . . . . . . . . . . . 5 2.2 PEGASUS: Matrix-Vector Multiplication on MapReduce . . . . . . . . . 7 3 MGMF: MapReduce Graph Mining Framework 11 3.1 MapReduce-based algorithms for MVM: One-Stage MVM and Fast MVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.2 MVM algorithms: Traverse All and Traverse Partial . . . . . . . . . . . . 19 3.3 One-Hop Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.4 Multi-Hop Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 4 Experiments 25 4.1 Evaluating a Traverse All algorithm (PageRank) . . . . . . . . . . . . . . 27 4.2 Evaluating a Traverse Partial algorithm (BFS) . . . . . . . . . . . . . . . 29 4.3 Evaluating a One-Hop algorithm (Out Degree Distribution) . . . . . . . . 29 4.4 Evaluating a Multi-Hop algorithm (Betweenness Centrality) . . . . . . . 30 5 Discussions 36 5.1 The Capability of Single Process . . . . . . . . . . . . . . . . . . . . . . 36 5.2 The Efficiency of File Input Format . . . . . . . . . . . . . . . . . . . . 40 6 Related Works 43 6.1 MPI-based approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 6.2 BSP-based approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 6.3 Breadth First Search and PageRank on MapRduce . . . . . . . . . . . . . 44 7 Conclusion 46 Bibliography 48
dc.language.iso	zh-TW
dc.title	基於MapReduce的大型網路探勘之應用與評估	zh_TW
dc.title	Exploiting and Evaluating MapReduce for Large-scale Graph Mining	en
dc.type	Thesis
dc.date.schoolyear	98-2
dc.description.degree	碩士
dc.contributor.oralexamcommittee	林智仁(Chih-Jen Lin),劉邦峰(Pang-Feng Liu),洪士灝(Shih-Hao Hung)
dc.subject.keyword	雲端運算,高效能,大型網路探勘,MapReduce,軟體架構,	zh_TW
dc.subject.keyword	Cloud Computing,High Performance,Large-Scale Graph Mining,MapReduce,Software Framework,	en
dc.relation.page	52
dc.rights.note	有償授權
dc.date.accepted	2010-08-06
dc.contributor.author-college	電機資訊學院	zh_TW
dc.contributor.author-dept	資訊工程學研究所	zh_TW
顯示於系所單位：	資訊工程學系

文件中的檔案：

檔案	大小	格式
ntu-99-1.pdf 目前未授權公開取用	1.5 MB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。