大規模羅吉斯回歸與線性支持向量機在Spark上之應用

Chieh-Yen Lin; 林玠言

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/57520

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	林智仁(Chih-Jen Lin)
dc.contributor.author	Chieh-Yen Lin	en
dc.contributor.author	林玠言	zh_TW
dc.date.accessioned	2021-06-16T06:49:41Z	-
dc.date.available	2014-07-29
dc.date.copyright	2014-07-29
dc.date.issued	2014
dc.date.submitted	2014-07-24
dc.identifier.citation	[1] B. E. Boser, I. Guyon, and V. Vapnik, “A training algorithm for optimal margin classifiers,” in COLT, 1992. [2] C. Cortes and V. Vapnik, “Support-vector network,” MLJ, vol. 20, pp. 273–297, 1995. [3] G.-X. Yuan, C.-H. Ho, and C.-J. Lin, “Recent advances of large-scale linear classification,” PIEEE, vol. 100, pp. 2584–2603, 2012. [4] J. Dean and S. Ghemawat, “MapReduce: simplified data processing on large clusters,” CACM, vol. 51, pp. 107–113, 2008. [5] M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica, “Spark: cluster computing with working sets,” in Proceedings of the 2nd USENIX conference on Hot topics in cloud computing, 2010. [6] M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica, “Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing,” in Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation, 2012. [7] M. Snir and S. Otto, MPI-The Complete Reference: The MPI Core. Cambridge, MA, USA: MIT Press, 1998. [8] C.-J. Lin and J. J. Mor’e, “Newton’s method for large-scale bound constrained problems,” SIAM J. Optim., vol. 9, pp. 1100–1127, 1999. [9] Y. Zhuang, W.-S. Chin, Y.-C. Juan, and C.-J. Lin, “Distributed Newton method for regularized logistic regression,” Dept. of Computer Science, Natl. Taiwan Univ., Tech. Rep., 2014. [10] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin, “LIBLINEAR: A library for large linear classification,” JMLR, vol. 9, pp. 1871–1874, 2008. [11] T. White, Hadoop: The definitive guide, 2nd ed. O’Reilly Media, 2010. [12] D. Borthakur, “HDFS architecture guide,” 2008. [13] C.-J. Lin, R. C. Weng, and S. S. Keerthi, “Trust region Newton method for large-scale logistic regression,” JMLR, vol. 9, pp. 627–650, 2008. [14] O. L. Mangasarian, “A finite Newton method for classification,” Optimization Methods and Software, vol. 17, no. 5, pp. 913–929, 2002. [15] M. Odersky, L. Spoon, and B. Venners, Programming in Scala. Artima, 2008. [16] M. Odersky, P. Altherr, V. Cremet, B. Emir, S. Micheloud, N. Mihaylov, M. Schinz, E. Stenman, and M. Zenger, “The Scala language specification,” 2004. [17] A. Agarwal, O. Chapelle, M. Dudik, and J. Langford, “A reliable effective terascale linear learning system,” JMLR, vol. 15, pp. 1111– 1133, 2014.
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/57520	-
dc.description.abstract	對於大規模分類問題之學習，羅吉斯回歸與線性支持向量機都是相當有用的方法。然而，此兩種模型的分散式實作，並沒有被徹底及完整地研究。另外，因為典型的映射化簡架構對於機器學習的迭代法之實作遭受到計算效率的瓶頸，所以叢集式記憶體內的運算平台─Spark在最近數年內逐漸嶄露頭角。由於Spark對於資料處理與分析的能力，此平台成為一個被廣泛使用的架構。在這篇論文裡，我們提出牛頓法之分散式演算法，並實作於Spark上。我們點出與分析會強烈影響計算效能與溝通時間的細節，並對這些問題提出解決辦法。最後，在經過謹慎的考量與研究後，我們將此論文中提出的演算法實作為一個有效率並且公開的工具以供使用。	zh_TW
dc.description.abstract	Logistic regression and linear SVM are useful methods for large-scale classification. However, their distributed implementations have not been well studied. Recently, because of the inefficiency of the MapReduce framework on iterative algorithms, Spark, an in-memory cluster-computing platform, has been proposed. It has emerged as a popular framework for large-scale data processing and analytics. In this work, we consider a distributed Newton method for solving logistic regression as well linear SVM and implement it on Spark. We carefully examine many implementation issues significantly affecting running time and propose our solutions. After conducting thorough empirical investigations, we release an efficient and easy-to-use tool for the Spark community.	en
dc.description.provenance	Made available in DSpace on 2021-06-16T06:49:41Z (GMT). No. of bitstreams: 1 ntu-103-R01944006-1.pdf: 1498278 bytes, checksum: f9de5a3fd54e6b9780fed3fb9778a384 (MD5) Previous issue date: 2014	en
dc.description.tableofcontents	口試委員會審定書 i 中文摘要 ii ABSTRACT iii LIST OF FIGURES vi LIST OF TABLES vii I. Introduction 1 II. Apache Spark 3 2.1 Hadoop Distributed File System 3 2.2 Resilient Distributed Datasets 4 2.3 Lineage and Fault Tolerance of Spark 5 III. Logistic Regression, Support Vector Machines and Distributed Newton Method 6 3.1 Logistic Regression and Linear SVM 6 3.2 A Trust Region Newton Method 7 3.3 Distributed Algorithm 8 IV. Implementation Design 11 4.1 Loop Structure 12 4.2 Data Encapsulation 14 4.3 Using mapPartitions Rather Than map 15 4.4 Caching Intermediate Information or not 17 4.5 Using Broadcast Variables 18 4.6 The Cost of the reduce Function 19 V. Related Works 21 5.1 LR Solver in MLlib 21 5.2 MPI LIBLINEAR 22 VI. Experiments 23 6.1 Different Loop Structures 24 6.2 Encapsulation 25 6.3 mapPartitions and map 25 6.4 Broadcast Variables and the coalesce Function 26 6.5 Analysis on Scalability 27 6.6 Comparing with MLlib 28 6.7 Comparison of Spark LIBLINEAR and MPI LIBLINEAR 29 VII. Discussions and Conclusions 33 APPENDICES 34 BIBLIOGRAPHY 40
dc.language.iso	en
dc.title	大規模羅吉斯回歸與線性支持向量機在Spark上之應用	zh_TW
dc.title	Large-scale Logistic Regression and Linear Support Vector Machines Using Spark	en
dc.type	Thesis
dc.date.schoolyear	102-2
dc.description.degree	碩士
dc.contributor.oralexamcommittee	林軒田(Hsuan-Tien Lin),李育杰(Yuh-Jye Lee)
dc.subject.keyword	大規模學習,分散式運算,羅吉斯回歸,支持向量機,牛頓法,	zh_TW
dc.subject.keyword	large scale learning,distributed computing,logistic regression,support vector machine,Newton method,	en
dc.relation.page	41
dc.rights.note	有償授權
dc.date.accepted	2014-07-24
dc.contributor.author-college	電機資訊學院	zh_TW
dc.contributor.author-dept	資訊網路與多媒體研究所	zh_TW
顯示於系所單位：	資訊網路與多媒體研究所

文件中的檔案：

檔案	大小	格式
ntu-103-1.pdf 目前未授權公開取用	1.46 MB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。