隨機擬牛頓方法用於深度神經網路的訓練

黃子軒; Zih-Syuan Huang

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/100949

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	林智仁	zh_TW
dc.contributor.advisor	Chih-Jen Lin	en
dc.contributor.author	黃子軒	zh_TW
dc.contributor.author	Zih-Syuan Huang	en
dc.date.accessioned	2025-11-26T16:13:11Z	-
dc.date.available	2025-11-27	-
dc.date.copyright	2025-11-26	-
dc.date.issued	2025	-
dc.date.submitted	2025-10-28	-
dc.identifier.citation	[1] Jorge Nocedal and Stephen J Wright. Numerical Optimization. Springer, 1999. [2] Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen Lin. Liblinear: A library for large linear classification. Journal of Machine Learning Research, 9(61):1871-1874, 2008. [3] John E Dennis, Jr and Jorge J Moré. Quasi-newton methods, motivation and theory. SIAM Review, 19(1):46-89, 1977. [4] Charles G Broyden. A new double-rank minimization algorithm. Notices of the American Mathematical Society, 16(4):670, 1969. [5] Roger Fletcher. A new approach to variable metric algorithms. The Computer Journal, 13(3):317-322, 1970. [6] Donald Goldfarb. A family of variable-metric methods derived by variational means. Mathematics of Computation, 24(109):23-26, 1970. [7] David F Shanno. Conditioning of quasi-newton methods for function minimization. Mathematics of Computation, 24(111):647-656, 1970. [8] Dong C Liu and Jorge Nocedal. On the limited memory bfgs method for large scale optimization. Mathematical programming, 45(1):503-528, 1989. [9] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, 2012. [10] Awni Hannun, Carl Case, Jared Casper, Bryan Catanzaro, Greg Diamos, Erich Elsen, Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, and Andrew Y Ng. Deep speech: Scaling up end-to-end speech recognition. Technical report, 2014. [11] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. In International Conference for Learning Representations, 2015. [12] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pretraining of deep bidirectional transformers for language understanding. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2019. [13] Shervin Minaee, Tomas Mikolov, Narjes Nikzad, Meysam Chenaghlu, Richard Socher, Xavier Amatriain, and Jianfeng Gao. Large language models: A survey. Technical report, 2024. [14] OpenAI. Gpt-4 technical report. Technical report, 2023. [15] Gemini Team Google. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. Technical report, 2024. [16] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(7):2121-2159, 2011. [17] P Kingma Diederik and Jimmy Ba. Adam: A method for stochastic optimization. In International Conference for Learning Representations, 2015. [18] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In International Conference for Learning Representations, 2019. [19] Nicol N Schraudolph, Jin Yu, and Simon Günter. A stochastic quasi-newton method for online convex optimization. In Artificial Intelligence and Statistics, 2007. [20] Aryan Mokhtari and Alejandro Ribeiro. Global convergence of online limited memory bfgs. Journal of Machine Learning Research, 16(1):3151-3181, 2015. [21] Albert S Berahas, Jorge Nocedal, and Martin Takác. A multi-batch 1-bfgs method for machine learning. In Advances in Neural Information Processing Systems, 2016. [22] Raghu Bollapragada, Jorge Nocedal, Dheevatsa Mudigere, Hao-Jun Shi, and Ping Tak Peter Tang. A progressive batching 1-bfgs method for machine learning. In International Conference on Machine Learning, 2018. [23] Herbert Robbins and Sutton Monro. A stochastic approximation method. The Annals of Mathematical Statistics, 22(3):400-407, 1951. [24] Boris T Polyak. Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics, 4(5):1-17, 1964. [25] Léon Bottou. Stochastic gradient learning in neural networks. Proceedings of NeuroNimes, 91(8):12, 1991. [26] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278-2324, 1998. [27] Léon Bottou and Olivier Bousquet. The tradeoffs of large scale learning. In Advances in Neural Information Processing Systems, 2007. [28] Vineet Gupta, Tomer Koren, and Yoram Singer. Shampoo: Preconditioned stochastic tensor optimization. In International Conference on Machine Learning, 2018. [29] Hao-Jun Michael Shi, Tsung-Hsien Lee, Shintaro Iwasaki, Jose Gallego-Posada, Zhijing Li, Kaushik Rangadurai, Dheevatsa Mudigere, and Michael Rabbat. A distributed data-parallel pytorch implementation of the distributed shampoo optimizer for training neural networks at-scale. Technical report, 2023. [30] John E Dennis, Jr and Robert B Schnabel. Least change secant updates for quasinewton methods. Siam Review, 21(4):443-459, 1979. [31] Betty Shea and Mark Schmidt. Don't be so positive: Negative step sizes in secondorder methods. Technical report, 2024. [32] Jie Gui, Tuo Chen, Jing Zhang, Qiong Cao, Zhenan Sun, Hao Luo, and Dacheng Tao. A survey on self-supervised learning: Algorithms, applications, and future trends. IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(12): 9052-9071, 2024. [33] Connor Shorten and Taghi M Khoshgoftaar. A survey on image data augmentation for deep learning. Journal of Big Data, 6(1):1-48, 2019. [34] Aaron Defazio and Léon Bottou. On the ineffectiveness of variance reduced optimization for deep learning. In Advances in Neural Information Processing Systems, 2019. [35] Maren Mahsereci and Philipp Hennig. Probabilistic line searches for stochastic optimization. Journal of Machine Learning Research, 18(119):1-59, 2017. [36] Sharan Vaswani, Aaron Mishkin, Issam Laradji, Mark Schmidt, Gauthier Gidel, and Simon Lacoste-Julien. Painless stochastic gradient: Interpolation, line-search, and convergence rates. In Advances in Neural Information Processing Systems, 2019. [37] Richard H Byrd, Samantha L Hansen, Jorge Nocedal, and Yoram Singer. A stochastic quasi-newton method for large-scale optimization. SIAM Journal on Optimization, 26(2):1008-1031, 2016. [38] Philipp Moritz, Robert Nishihara, and Michael Jordan. A linearly-convergent stochastic l-bfgs algorithm. In Artificial Intelligence and Statistics, 2016. [39] Xiao Wang, Shiqian Ma, Donald Goldfarb, and Wei Liu. Stochastic quasi-newton methods for nonconvex stochastic optimization. SIAM Journal on Optimization, 27 (2):927-956, 2017. [40] Yue Niu, Zalan Fabian, Sunwoo Lee, Mahdi Soltanolkotabi, and Salman Avestimehr. ml-BFGS: A momentum-based 1-BFGS for distributed large-scale neural network optimization. Transactions on Machine Learning Research, 2023. [41] Donald Goldfarb, Yi Ren, and Achraf Bahamou. Practical quasi-newton methods for training deep neural networks. In Advances in Neural Information Processing Systems, 2020. [42] Aryan Mokhtari and Alejandro Ribeiro. Res: Regularized stochastic bfgs algorithm. IEEE Transactions on Signal Processing, 62(23):6089-6104, 2014. [43] Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent with warm restarts. In International Conference for Learning Representations, 2017. [44] Michael JD Powell. Algorithms for nonlinear constraints that use lagrangian functions. Mathematical Programming, 14:224-248, 1978. [45] Dong-Hui Li and Masao Fukushima. On the global convergence of the bfgs method for nonconvex unconstrained optimization problems. SIAM Journal on Optimization, 11(4):1054-1064, 2001. [46] Antoine Bordes, Léon Bottou, and Patrick Gallinari. Sgd-qn: Careful quasi-newton stochastic gradient descent. Journal of Machine Learning Research, 10(59):1737-1754, 2009. [47] Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems, 2013. [48] Mathieu Dagréou, Pierre Ablin, Samuel Vaiter, and Thomas Moreau. How to compute hessian-vector products? Technical report, 2024. [49] Rohan Anil, Vineet Gupta, Tomer Koren, Kevin Regan, and Yoram Singer. Scalable second order optimization for deep learning. Technical report, 2020. [50] Wu Lin, Felix Dangel, Runa Eschenhagen, Juhan Bae, Richard E Turner, and Alireza Makhzani. Can we remove the square-root in adaptive gradient methods? a secondorder perspective. In International Conference on Machine Learning, 2024. [51] Depen Morwani, Itai Shapira, Nikhil Vyas, Eran Malach, Sham Kakade, and Lucas Janson. A new perspective on shampoo's preconditioner. In International Conference on Learning Representations, 2025. [52] Zhewei Yao, Amir Gholami, Sheng Shen, Mustafa Mustafa, Kurt Keutzer, and Michael Mahoney. Adahessian: An adaptive second order optimizer for machine learning. In Proceedings of the AAAI Conference on Artificial Intelligence, 2021. [53] Xingyu Xie, Pan Zhou, Huan Li, Zhouchen Lin, and Shuicheng Yan. Adan: Adaptive nesterov momentum algorithm for faster optimizing deep models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(12):9508-9520, 2024. [54] Xiangning Chen, Chen Liang, Da Huang, Esteban Real, Kaiyuan Wang, Hieu Pham, Xuanyi Dong, Thang Luong, Cho-Jui Hsieh, Yifeng Lu, and Quoc V Le. Symbolic discovery of optimization algorithms. In Advances in Neural Information Processing Systems, 2023. [55] Hong Liu, Zhiyuan Li, David Hall, Percy Liang, and Tengyu Ma. Sophia: A scalable stochastic second-order optimizer for language model pre-training. In International Conference on Learning Representations, 2024. [56] Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton. On the importance of initialization and momentum in deep learning. In International Conference on Machine Learning, 2013. [57] Kevin P Murphy. Machine Learning: A Probabilistic Perspective. MIT Press, 2012. [58] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. In International Conference for Learning Representations, 2015. [59] Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, 2009. [60] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2016. [61] TorchVision maintainers and contributors. Torchvision: Pytorch's computer vision library. https://github.com/pytorch/vision, 2016. [62] Nicholas J Higham. Accuracy and Stability of Numerical Algorithms. SIAM, 2002. [63] Richard H Byrd, Jorge Nocedal, and Robert B Schnabel. Representations of quasinewton matrices and their use in limited memory methods. Mathematical Programming, 63(1):129-156, 1994. [64] Jennifer B Erway, Vibhor Jain, and Roummel F Marcia. Shifted limited-memory dfp systems. In Asilomar Conference on Signals, Systems and Computers, 2013.	-
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/100949	-
dc.description.abstract	擬牛頓法因能利用二階資訊而無需實際計算黑森矩陣，在最佳化中已被證明相當有效。然而，將其應用於隨機最佳化，特別是在深度學習情境中，仍面臨挑戰，原因在於梯度估計存在雜訊以及目標函數具非凸性。在本文中，我們提出了一種專為訓練深度神經網路設計的隨機有限記憶 BFGS（LBFGS）方法。我們的方法引入了一種新穎的曲率選擇策略，透過更新頻率機制來挑選曲率最大的曲率對，有效解決隨機性與非凸性問題。此外，我們結合了動量方法，以進一步提升收斂速度。實驗結果顯示，我們的方法在標準的凸與非凸影像分類基準資料集，不僅顯著優於某一個現有的隨機 LBFGS（oLBFGS）方法，還能與廣泛使用的深度學習最佳化方法，如動量隨機梯度下降(SGDM)、Adam 與 Shampoo表現相似。	zh_TW
dc.description.abstract	Quasi-Newton methods have proven to be effective for optimization due to their use of second-order information without explicitly computing Hessian matrices. However, their adaptation to stochastic optimization, particularly in deep learning contexts, remains challenging due to noisy gradient estimates and nonconvex objectives. In this thesis, we propose a stochastic limited-memory BFGS (LBFGS) optimizer designed specifically for training deep neural networks. Our method introduces a novel curvature selection strategy that utilizes an update frequency mechanism to select curvature pairs exhibiting the highest curvature, effectively addressing stochastic and nonconvex issues. Additionally, we integrate momentum to speed up the convergence. Experimental results demonstrate that our approach significantly outperforms the existing stochastic LBFGS method (oLBFGS) and remains competitive with widely used deep learning optimizers such as SGD with momentum (SGDM), Adam, and Shampoo on standard convex and non-convex image classification benchmarks.	en
dc.description.provenance	Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2025-11-26T16:13:11Z No. of bitstreams: 0	en
dc.description.provenance	Made available in DSpace on 2025-11-26T16:13:11Z (GMT). No. of bitstreams: 0	en
dc.description.tableofcontents	口試委員會審定書 i 誌謝 ii 摘要 iii Abstract iv Contents v List of Figures vii List of Tables viii 1 Introduction 1 2 Quasi-Newton Methods 4 3 Limited-memory BFGS Methods 11 4 Stochastic LBFGS Methods 17 5 Proposed Stochastic LBFGS Method 23 6 Experiments 27 Bibliography 37 Appendix 43 A Proof of Rank-Two Update Form for BFGS and DFP 44 B Hyperparameter Settings 47	-
dc.language.iso	en	-
dc.subject	深度學習	-
dc.subject	隨機最佳化	-
dc.subject	擬牛頓法	-
dc.subject	Deep Learning	-
dc.subject	Stochastic Optimization	-
dc.subject	Quasi-Newton Methods	-
dc.title	隨機擬牛頓方法用於深度神經網路的訓練	zh_TW
dc.title	Stochastic Quasi-Newton Methods for Training Deep Neural Networks	en
dc.type	Thesis	-
dc.date.schoolyear	114-1	-
dc.description.degree	碩士	-
dc.contributor.oralexamcommittee	李育杰;李靜沛	zh_TW
dc.contributor.oralexamcommittee	Yuh-Jye Lee;Ching-pei Lee	en
dc.subject.keyword	深度學習,隨機最佳化擬牛頓法	zh_TW
dc.subject.keyword	Deep Learning,Stochastic OptimizationQuasi-Newton Methods	en
dc.relation.page	50	-
dc.identifier.doi	10.6342/NTU202504473	-
dc.rights.note	同意授權(全球公開)	-
dc.date.accepted	2025-10-29	-
dc.contributor.author-college	電機資訊學院	-
dc.contributor.author-dept	資訊工程學系	-
dc.date.embargo-lift	2025-11-27	-
顯示於系所單位：	資訊工程學系

文件中的檔案：

檔案	大小	格式
ntu-114-1.pdf	805.48 kB	Adobe PDF	檢視/開啟

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。