牛頓法於機器學習之應用

Chien-Chih Wang; 王建智

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/48826

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	林智仁(Chih-Jen Lin)
dc.contributor.author	Chien-Chih Wang	en
dc.contributor.author	王建智	zh_TW
dc.date.accessioned	2021-06-15T11:09:53Z	-
dc.date.available	2018-02-08
dc.date.copyright	2017-02-08
dc.date.issued	2016
dc.date.submitted	2016-10-11
dc.identifier.citation	A. Agarwal, O. Chapelle, M. Dudik, and J. Langford. A reliable effective terascale linear learning system. Journal of Machine Learning Research, 15:1111–1133, 2014. P. Baldi, P. Sadowski, and D. Whiteson. Searching for exotic particles in high-energy physics with deep learning. Nature Communications, 5, 2014. Y. Bian, X. Li, M. Cao, and Y. Liu. Bundle CDN: a highly parallelized approach for large-scale l1-regularized logistic regression. In Proceedings of European Confer- ence on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML/ PKDD), 2013. L. Bottou. Stochastic gradient learning in neural networks. Proceedings of NeuroNımes, 91(8), 1991. L. Bottou. Large-scale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT 2010, pages 177–186. 2010. R. H. Byrd, G. M. Chin, W. Neveitt, and J. Nocedal. On the use of stochastic Hessian information in optimization methods for machine learning. SIAM Journal on Optimization, 21(3):977–995, 2011. C.-C. Chang and C.-J. Lin. LIBSVM: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2(3):27:1–27:27, 2011. Software available at http://www.csie.ntu.edu.tw/ ̃cjlin/libsvm. O. Chapelle and D. Erhan. Improved preconditioner for Hessian free optimization. In NIPS Workshop on Deep Learning and Unsupervised Feature Learning, 2011. D. C. Ciresan, U. Meier, L. M. Gambardella, and J. Schmidhuber. Deep, big, simple neural nets for handwritten digit recognition. Neural computation, 22:3207–3220, 2010. A. R. Conn, N. I. M. Gould, and P. L. Toint. Trust-region Methods. Society for Industrial and Applied Mathematics, Philadelphia, 2000. J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, Q. V. Le, M. Z. Mao, M. Ranzato, A. W. Senior, P. A. Tucker, et al. Large scale distributed deep networks. In Advances in Neural Information Processing Systems (NIPS) 25, 2012. Y. Drori and M. Teboulle. Performance of first-order methods for smooth convex minimization: a novel approach. Mathematical Programming, 145:451–482, 2014. R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin. LIBLINEAR: a library for large linear classification. Journal of Machine Learning Research, 9:1871–1874, 2008. URL http://www.csie.ntu.edu.tw/ ̃cjlin/papers/liblinear.pdf. G. H. Golub and C. F. Van Loan. Matrix Computations. The Johns Hopkins University Press, third edition, 1996. I. J. Goodfellow, D. Warde-Farley, P. Lamblin, V. Dumoulin, M. Mirza, R. Pascanu, J. Bergstra, F. Bastien, and Y. Bengio. Pylearn2: a machine learning research library,2013. I. Griva, S. G. Nash, and A. Sofer. Linear and nonlinear optimization. SIAM, second edition, 2009. G. E. Hinton, L. Deng, D. Yu, G. Dahl, A. rahman Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. Sainath, and B. Kingsbury. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 29(6):82–97, 2012. F.-L. Huang, C.-J. Hsieh, K.-W. Chang, and C.-J. Lin. Iterative scaling and coordinate descent methods for maximum entropy. Journal of Machine Learning Research, 11:815–848, 2010. URL http://www.csie.ntu.edu.tw/cjlin/papers/maxent_journal.pdf. S. S. Keerthi and D. DeCoste. A modified finite Newton method for fast solution of large scale linear SVMs. Journal of Machine Learning Research, 6:341–361, 2005. R. Kiros. Training neural networks with stochastic Hessian-free optimization, 2013. arXiv preprint arXiv:1301.3641. A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Wein- berger, editors, Advances in Neural Information Processing Systems 25, pages 1097–1105. 2012. Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, November 1998a. MNIST database available at http://yann.lecun.com/exdb/mnist/. Y. LeCun, L. Bottou, G. B. Orr, and K.-R. M ̈ ller. Efficient backprop. In Neural Networks, Tricks of the Trade, Lecture Notes in Computer Science LNCS 1524. Springer Verlag, 1998b. URL http://leon.bottou.org/papers/lecun-98x. P. Li. An empirical evaluation of four algorithms for multi-class classification: Mart, abc-mart, robust logitboost, and abc-logitboost, 2010. arXiv preprint arXiv:1001.1020. C.-J. Lin, R. C. Weng, and S. S. Keerthi. Trust region Newton method for large-scale logistic regression. Journal of Machine Learning Research, 9:627–650, 2008. URL http://www.csie.ntu.edu.tw/ ̃cjlin/papers/logistic.pdf. G. Loosli, S. Canu, and L. Bottou. Training invariant support vector machines using selective sampling. In L. Bottou, O. Chapelle, D. DeCoste, and J. Weston, editors, Large Scale Kernel Machines, pages 301–320. MIT Press, Cambridge, MA., 2007. URL http://leon.bottou.org/papers/loosli-canu-bottou-2006. D. Mahajan, S. S. Keerthi, and S. Sundararajan. A distributed block coordinate descent method for training l 1 regularized linear classifiers. Journal of Machine Learning Research, 2016. To appear. O. L. Mangasarian. A finite Newton method for classification. Optimization Methods and Software, 17(5):913–929, 2002. O. L. Mangasarian. Exact 1-norm support vector machines via unconstrained convex differentiable minimization. Journal of Machine Learning Research, 7:1517–1530, 2006. J. Martens. Deep learning via Hessian-free optimization. In Proceedings of the 27th International Conference on Machine Learning (ICML), 2010. J. Martens and I. Sutskever. Training deep and recurrent networks with Hessian-free optimization. In Neural Networks: Tricks of the Trade, pages 479–535. Springer, 2012. J. J. More. The Levenberg-Marquardt algorithm: Implementation and theory. In G. Watson, editor, Numerical Analysis, number 630 in Lecture Notes in Mathemattcs, pages 105–116, New York, 1978. Sprmger-Verlag. P. Moritz, R. Nishihara, I. Stoica, and M. I. Jordan. SparkNet: Training deep networks in Spark. arXiv preprint arXiv:1511.06051, 2015. J. Ngiam, A. Coates, A. Lahiri, B. Prochnow, Q. V. Le, and A. Y. Ng. On optimization methods for deep learning. In Proceedings of the 28th International Conference on Machine Learning, pages 265–272, 2011. J. Park, D. Kang, J. Kim, J. T. Kwok, and I. W. Tsang. SVDD-based pattern denoising. Neural Computation, 19(7):1919–1938, 2007. B. A. Pearlmutter. Fast exact multiplication by the Hessian. Neural Computation, 6 (1):147–160, 1994. B. T. Polyak. Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics, 4(5):1–17, 1964. N. N. Schraudolph. Fast curvature matrix-vector products for second-order gradient descent. Neural Computation, 14(7):1723–1738, 2002. L. Wan, M. Zeiler, S. Zhang, Y. LeCun, and R. Fergus. Regularization of neural networks using dropconnect. In Proceedings of the 30th International Conference on Machine Learning (ICML), pages 1058–1066, 2013. C.-C. Wang, C.-H. Huang, and C.-J. Lin. Subsampled Hessian Newton methods for supervised learning. Neural Computation, 27:1766–1795, 2015. URL http://www.csie.ntu.edu.tw/˜cjlin/papers/sub_hessian/sample_hessian.pdf. C.-C. Wang, S. Sellamanickam, D. Mahajan, S. S. Keerthi, and C.-J. Lin. Distributed newton methods for deep learning. Technical report, National Taiwan University, 2016. R. E.Wengert. A simple automatic derivative evaluation program. Communications of the ACM, 7(8):463–464, 1964. M. Zinkevich, M. Weimer, A. Smola, and L. Li. Parallelized stochastic gradient descent. In J. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R. Zemel, and A. Culotta, editors, Advances in Neural Information Processing Systems 23, pages 2595–2603. 2010.
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/48826	-
dc.description.abstract	牛頓法可以運用在很多監督式學習的問題中,但是對於大量的資料,使用海森矩陣最佳化問題是需要大量的時間的. 在這篇論文中,我們的目標是要讓牛頓法可以應用在實務上的大資料問題.這篇論文的第一部份是有關於抽樣海森矩陣. 在近代的研究中,抽樣海森矩陣已經被提出,其只利用使用部份的資料去計算出海森矩陣而達到減少計算時間的目的. 美中不足的是,我們發現在某些狀況下,因為抽樣海森矩陣的更新向量的準確度遠低於海森矩陣的更新向量, 所以抽樣海森矩陣的收斂速度會慢於標準的牛頓法. 在這篇論文中,我們提出一些可以改良現有的抽樣海森矩陣的方法.主要的想法是為了能夠更加的最佳化目標函數, 我們利用多解一個兩個變數的子問題來調整得到的抽樣海森矩陣的更新方向,而且我們在這篇論文中提供了理論上收斂的證明. 在邏輯迴歸,線性向量支援機,最大嫡以及深度類神經網路的實驗中,驗證我們所提出的方法可以顯著的降低使用抽樣海森矩陣最佳化問題的訓練時間. 對於大資料的分類,在這個部份所提出的方法可以變成一個相對於牛頓法的有效的演算法. 在這篇論文的第二部份,我們研究應用在分散式系統上訓練深度類神經網路的牛頓法.因為大量的權重存在於相鄰的兩層之間,所以深度學習包含了困難的非凸優化問題.在訓練大資料的情況下,分散式訓練是有其必要性的,但是這樣會導致目標函數,梯度以及海森矩陣的計算花費大量的時間.尤其是傳輸以及同步化的問題會變成問題的瓶頸.在這篇論文,我們提出一個可以應用在深層學習的新的分散式牛頓法.首先,我們在分散式系統的環境中儲存雅可比矩陣,然後使用了對角化的方法近似海森矩陣已達到多台機器間不需要傳輸的時間.其次,為了節省多台機器間同步的時間,即使有某些機器未完成各自的共軛梯度法.再者,我們使用了第一部份的抽樣海森矩陣來減少多台機器各自的計算以及傳輸的時間.在這個部份的實驗中,我們驗證了我們提出的分散式的牛頓法可以很有效率的應用在深度學習中.	zh_TW
dc.description.abstract	Newton methods can be applied in many supervised learning approaches. However, for large-scale data, the use of the whole Hessian matrix can be time consuming. In this thesis we aim to make Newton methods practically viable for various large-scale scenarios. The first part of this thesis is about sub-sampled Newton methods. Recently, subsampled Newton methods have been proposed to reduce the computational time by using only a subset of data for calculating an approximation of the Hessian matrix. Unfortunately, we find that in some situations the running speed is worse than the standard Newton method because cheaper but less accurate search directions are used. In this thesis, we propose some novel techniques to improve the existing subsampled Hessian Newton method. The main idea is to solve a 2-dimensional sub-problem per iteration to adjust the search direction to better minimize the second-order approximation of the function value. We prove the theoretical convergence of the proposed method. Experiments on logistic regression, linear SVM, maximum entropy, and deep networks indicate that our techniques significantly reduce the running time of the subsampled Hessian Newton method. The resulting algorithm becomes a compelling alternative to the standard Newton method for large-scale data classification. Deep learning involves a difficult non-convex optimization problem because of the large number of weights between any two adjacent layers of a deep structure. In the large-scale scenarios, distributed training is needed, but the calculation of function, gradient, and Hessian is expensive. In particular, the communication and synchronization cost becomes the bottleneck. In this thesis, we propose a novel distributed Newton method for deep learning. First, to reduce the communication cost, we consider storing the Jacobian matrix in a distributed environment, and then propose a diagonalization method such that an approximate Newton direction can be obtained without communication between machines. Second, to reduce the synchronization cost, we terminate the process of finding an approximate Newton direction even though some nodes have not finished their tasks. Third, we consider subsampled Hessian for reducing running time as well as communication cost. Details of some implementation issues in distributed environments are thoroughly investigated. Experiments demonstrate that the proposed method is effective for the distributed training of deep neural networks.	en
dc.description.provenance	Made available in DSpace on 2021-06-15T11:09:53Z (GMT). No. of bitstreams: 1 ntu-105-D98922007-1.pdf: 1646123 bytes, checksum: 36d24c03eb79ff5f60f42583e6907011 (MD5) Previous issue date: 2016	en
dc.description.tableofcontents	中文摘要 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii CHAPTER I. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Subsampled Newton Methods for Empirical Risk Minimization . . 1 1.2 Distributed Newton Methods for Deep Neural Networks . . . . . . 4 II. Subsampled Newton Methods for Empirical Risk Minimization . . . . 7 2.1 Subsampled Hessian Newton-CG Method and Its Practical Perfor- mance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2 Modified Subsampled-Hessian Newton Directions . . . . . . . . . 10 2.2.1 Relations with Prior Works of Using Second Directions . 13 2.3 Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.3.1 Convergence of Using (2.9) . . . . . . . . . . . . . . . 14 2.3.2 Convergence of Using (3.61) . . . . . . . . . . . . . . . 17 2.4 Examples: Logistic Regression, l2-loss Linear SVM, Maximum Entropy and Deep Neural Networks . . . . . . . . . . . . . . . . . 17 2.4.1 Logistic Regression . . . . . . . . . . . . . . . . . . . . 17 2.4.2 l2-loss Linear SVM . . . . . . . . . . . . . . . . . . . . 19 2.4.3 Maximum Entropy . . . . . . . . . . . . . . . . . . . . 20 2.4.4 Deep Neural Networks . . . . . . . . . . . . . . . . . . 21 2.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.5.1 Logistic Regression and l2-loss Linear SVM . . . . . . 25 2.5.2 Maximum Entropy for Multi-Class Classification . . . . 31 2.5.3 Deep Learning for Multi-Class Classification . . . . . . 32 2.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 2.7 Using a Subset of Data for Calculating the Gradient . . . . . . . . 37 2.8 Some Variants of the Proposed Method . . . . . . . . . . . . . . . 39 2.8.1 Selection of dk . . . . . . . . . . . . . . . . . . . . . 39 2.8.2 Using Non-Negative β1 and β2 . . . . . . . . . . . . . 39 2.8.3 Experiments . . . . . . . . . . . . . . . . . . . . . . 43 2.9 Details of Applying the Proposed Method to Maximum Entropy . 43 2.10 Details of Hessian-free Approaches for Neural Networks . . . . 46 III. Distributed Newton Methods for Deep Neural Networks . . . . . . . . 50 3.1 Hessian-free Newton Method for Deep Learning . . . . . . . . . . 50 3.1.1 Feedforward Networks . . . . . . . . . . . . . . . . . . 50 3.1.2 Hessian-free Newton Method . . . . . . . . . . . . . . 53 3.2 Distributed Training for Deep Neural Networks . . . . . . . . . . 57 3.2.1 Variable Partition and Diagonal Gauss-Newton Matrix Approximation . . . . . . . . . . . . . . . . . . . . . . 58 3.2.2 Subsampled Hessian Newton Methods and Synchroniza- tion Between Partitions . . . . . . . . . . . . . . . . . . 75 3.2.3 Other Implementation Techniques . . . . . . . . . . . . 79 3.2.4 Summary of the Procedure . . . . . . . . . . . . . . . . 82 3.3 Existing Optimization Methods for Training Neural Networks . . . 83 3.3.1 Stochastic Gradient Methods . . . . . . . . . . . . . . . 83 3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 3.4.1 Analysis of Distributed Newton Methods . . . . . . . . 85 3.4.2 Comparison on Multi-class Testing Accuracy with Stochas- tic Gradient Methods and Support Vector Machines . . . 92 3.4.3 A Comparison on AUC value with Stochastic Gradient Methods . . . . . . . . . . . . . . . . . . . . . . . . . 96 3.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 3.6 List of Symbols . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 BIBLIOGRAPHY . . . . . . . . 102
dc.language.iso	zh-TW
dc.subject	多類別分類	zh_TW
dc.subject	抽樣海森矩陣	zh_TW
dc.subject	深層類神經網路	zh_TW
dc.subject	大規模學習	zh_TW
dc.subject	multi-class classification	en
dc.subject	subsampled Hessian	en
dc.subject	large-scale classifcation	en
dc.subject	deep neural networks	en
dc.title	牛頓法於機器學習之應用	zh_TW
dc.title	Newton Methods For Machine Learning	en
dc.type	Thesis
dc.date.schoolyear	105-1
dc.description.degree	博士
dc.contributor.oralexamcommittee	莊永裕(Yung-Yu Chuang),林軒田(Hsuan-Tien Lin),李宏毅(Hung-Yi Lee),李育杰(Yuh-Jye Lee),王鈺強(Yu-Chiang Wang)
dc.subject.keyword	抽樣海森矩陣,深層類神經網路,多類別分類,大規模學習,	zh_TW
dc.subject.keyword	subsampled Hessian,deep neural networks,multi-class classification,large-scale classifcation,	en
dc.relation.page	105
dc.identifier.doi	10.6342/NTU201603656
dc.rights.note	有償授權
dc.date.accepted	2016-10-12
dc.contributor.author-college	電機資訊學院	zh_TW
dc.contributor.author-dept	資訊工程學研究所	zh_TW
顯示於系所單位：	資訊工程學系

文件中的檔案：

檔案	大小	格式
ntu-105-1.pdf 未授權公開取用	1.61 MB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。