Hadoop系統參數優化

Ye-Qi Zhuo; 卓也琦

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/54518

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	廖世偉
dc.contributor.author	Ye-Qi Zhuo	en
dc.contributor.author	卓也琦	zh_TW
dc.date.accessioned	2021-06-16T03:01:41Z	-
dc.date.available	2025-06-29
dc.date.copyright	2015-08-11
dc.date.issued	2015
dc.date.submitted	2015-07-02
dc.identifier.citation	[1] Apache hadoop nextgen mapreduce(yarn). http://hadoop.apache.org/docs/ current/hadoop-yarn/hadoop-yarnsite/YARN.html. [2] Hadoop. https://hadoop.apache.org/. [3] Mape. http://en.wikipedia.org/wiki/Mean_absolute_percentage_error. [4] S. G. R. S. Alexander Zien, Nicole Kramer. The feature importance ranking measure. arXiv:0906.4258v1, 2009. [5] S. B.A.Kitchenham, L.M.Pickard and M.J.Shepperd. What accuracy statistics really measure. [6] J. Bennett and S. Lanning. The netflix prize. in Proceedings of KDD cup and workshop, page 35, 2007. [7] Z. W. C. Weng, M. Li and X. Lu. Automatic performance tuning for the virtualized cluster system. in Distributed Computing Systems ICDCS’09. 29th IEEE Interna- tional Conference on, 2009. [8] H. A. Carneiro and E. Mylonakis. Google trends: a web-based tool for real-time surveillance of disease outbreaks. Clinical infectious diseases, 49:1557–1564, 2009. [9] M. S. D. R. Jones and W. J. Welch. Efficient global optimization of expensive black- box functions. Journal of Global optimization, 13:455–492, 1998. [10] D.Heger. Hadoop performance tuning-a pragmatic & iterative approach. CMG Jour- nal, pages 97–113, 2013. [11] G. L. N. B. L. D. F. B. C. e. a. H. Herodotou, H. Lim. Starfish: A self-tuning system for big data analytics. IN CIDR, pages 261–272, 2011. [12] M. Hall. Hadoop: From open source project to big data ecosystem. 2010. [13] H. Herodotou. Hadoop performance models. Proc. of the VLDB Endowment, 4:1111– 1122, 2011. [14] H. Herodotou and S. Babu. Profiling, what-if analysis, and cost-based optimization of mapreduce programs. arXiv preprint arXiv:1106.0940, 2011. [15] S. B. Joshi. Apache hadoop performance-tuning methodologies and best practices. in Proceedings of the 3rd ACM/SPEC International Conference on Performance En- gineering, 2012. [16] L.Breiman. Random forests. Machine learning, 2001. [17] J. Lin and C. Dyer. Data-intensive text processing with mapreduce. Synthesis Lec- tures on Human Language Technologies, 3, 2010. [18] H. H. Liu. Software performance and scalability: a quantitative approach, volume 7. John Wiley & Sons, 2011. [19] P. G. Louis Wehenkel, Damien Ernst. Ensembles of extremely randomized trees and some generic applications. RTE-VT workshop, 2006. [20] A. J. B. N. B. Rizvandi, A. Y. Zomaya and J. Taheri. On modeling dependency be- tween mapreduce configuration parameters and total execution time. arXiv preprint arXiv:1203.0651, 2012. [21] D. E. P. Geurts and L. Wehenkel. Extremely randomized trees. Machine learning, 63, 2006. [22] J. D. T. X. S. Huang, J. Huang and B. Huang. The hibench benchmark suite: Characterization of the mapreduce-based data analysis. Data Engineering Workshops (ICDEW), 2010 IEEE 26th International Conference on, 2010, 2010. [23] K. L. S. Islam, J. Keung and A. Liu. Empirical prediction models for adaptive resource provisioning in the cloud. Future Generation Comp., 28, 2012. [24] B. Selman and C. P. Gomes. Hillclimbing search,. Encyclopedia of Cognitive Science, 2006. [25] S.Joshi. Hadoop tuning guide. Advanced Micro Devices, 2012. [26] T. K. M. S. T. Hothorn, P. Buhlmann and B. Hofner. Model-based boosting 2.0. The Journal of Machine Learning Research, 11, 2010. [27] T. White. Hadoop: The definitive guide. O’Reilly Media, Inc, 2012. [28] Y. Z. C. H. X. Liu, J. Han and X. He. Implementing webgis on hadoop: A case study of improving small file i/o performance on hdfs. in Cluster Computing and Workshops, 2009. [29] T. Ye and S. Kalyanaraman. A recursive random search algorithm for large-scale network parameter configuration. Department of Electrical, Computer and System Engineering. [30] T. Ye and S. Kalyanaraman. A recursive random search algorithm for large-scale net-work parameter configuration. CM SIGMETRICS Performance Evaluation Review, 31, 2003.
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/54518	-
dc.description.abstract	在當前big data的時代，Hadoop系統對於分析和應用大數據有著至關重要的作用，我們既希望能夠把Hadoop系統參數能夠調節到最佳的狀態又希望能夠在不花費更多在硬體的更新上。因此我的碩論的主題選擇在Hadoop系統參數的優化，在這裡主要針對希望優化的效能是在於減少單一任務的執行時間。我採用的是三段式模型：（1）是在眾多參數中找到對於系統影響最大的參數，根據map和reduce分開觀察並選出20個參數作為我們主要要調節的參數；（2）是建立系統時間的預測模型，根據這20個參數去搜集更多的任務執行的時間和相對應的參數作為我們建立模型的基礎，運用機器學習的方法去做建模並且選擇出最適合的三層式模型；（3）是建立系統的優化模型，每次優化機會在設定的參數範圍內隨機選取出來參數，並且把它放到之前建好的預測的模型去預測其執行的時間，經過我設定好的優化模型最終會找到一個執行時間最短的參數組合。我總共選擇了4個程式，經過以上的方法組合去驗證。	zh_TW
dc.description.abstract	Hadoop system is very popular recent year, which is a software framework with distributed processing large-scale data-sets by using a cluster of machines with MapReduce programming model. However, there are still two essential challenges for Hadoop users to manage the Hadoop system. (1) To tune the parameters appropriately; (2) To deal with dozens of configuration parameters which are involved to its performance. This paper will focus on optimizing the Hadoop MapReduce job performance. Our approach has two key model: Prediction and Optimization. The Prediction model is to estimate execution time of a MapReduce job and the Optimization model is to search the approximately optimal configuration parameters by invoking the prediction part repeatedly. By using an analytical method to choose approximately optimal configuration parameters to improve users’ job performance . Besides the configuration parameter tuning, the relevance of each parameters and the evaluation of our methods will also be discussed in this paper. Our paper may provide users a better method to improve the Hadoop system performance and save the hardware resource.	en
dc.description.provenance	Made available in DSpace on 2021-06-16T03:01:41Z (GMT). No. of bitstreams: 1 ntu-104-R02922142-1.pdf: 1429886 bytes, checksum: 36a1d7a25d2c80284e313512388c3dfe (MD5) Previous issue date: 2015	en
dc.description.tableofcontents	Chapter 1 Introduction 1 Chapter 2 Related Work 2 Chapter 3 Hadoop MapReduce System 6 3.1 Architecture of Hadoop MapReduce System . . . . . . 6 3.2 Execution Flow of a MapReduce Job . . . . . . . . . . .7 3.3 Classification of Configuration Parameters . . . . . . 8 Chapter 4 Design of Experiments 12 4.1 Configuration Parameters for Modeling . . . . . . . . . 12 4.2 The Architecture of Predictor and Optimizer . . . . . . . 15 4.3 Design of Predictor . . . . 16 4.4 Design of Optimizer . . . . . . . . . . . . . . . . . 20 Chapter 5 Evaluation 24 5.1 Performance of Predictor . . . . . . . . . . . . . . . 24 5.2 Performance of Optimizer . . . . . . . . . . . . . . 26 5.3 Comparison to Default Configuration . . . . . . . . . 27 Chapter 6 Conclusion and Future Work 29 Bibliography 31
dc.language.iso	en
dc.subject	優化	zh_TW
dc.subject	系統	zh_TW
dc.subject	優化	zh_TW
dc.subject	系統	zh_TW
dc.subject	predictor	en
dc.subject	tuning	en
dc.subject	predictor	en
dc.subject	optimization	en
dc.subject	optimization	en
dc.subject	tuning	en
dc.title	Hadoop系統參數優化	zh_TW
dc.title	Optimization of Hadoop System Configuration Parameters	en
dc.type	Thesis
dc.date.schoolyear	103-2
dc.description.degree	碩士
dc.contributor.oralexamcommittee	蘇中才,杜憶萍,黃維中
dc.subject.keyword	系統,優化,	zh_TW
dc.subject.keyword	tuning,optimization,predictor,	en
dc.relation.page	33
dc.rights.note	有償授權
dc.date.accepted	2015-07-02
dc.contributor.author-college	電機資訊學院	zh_TW
dc.contributor.author-dept	資訊工程學研究所	zh_TW
顯示於系所單位：	資訊工程學系

文件中的檔案：

檔案	大小	格式
ntu-104-1.pdf 未授權公開取用	1.4 MB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。