關於 L2 正則化線性分類與迴歸的權重密度

盧智寶; Zhi-Bao Lu

Please use this identifier to cite or link to this item: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/101409

Full metadata record

???org.dspace.app.webui.jsptag.ItemTag.dcfield???	Value	Language
dc.contributor.advisor	林智仁	zh_TW
dc.contributor.advisor	Chih-Jen Lin	en
dc.contributor.author	盧智寶	zh_TW
dc.contributor.author	Zhi-Bao Lu	en
dc.date.accessioned	2026-01-27T16:37:51Z	-
dc.date.available	2026-01-28	-
dc.date.copyright	2026-01-27	-
dc.date.issued	2025	-
dc.date.submitted	2026-01-14	-
dc.identifier.citation	[1] Yu-Chen Lin, Si-An Chen, Jie-Jyun Liu, and Chih-Jen Lin. Linear classifier: An often-forgotten baseline for text classification. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL), pages 1876–1888, 2023. Short paper. [2] Bernhard E. Boser, Isabelle Guyon, and Vladimir Vapnik. A training algorithm for optimal margin classifiers. In Proceedings of the Fifth Annual Workshop on Computational Learning Theory, pages 144–152. ACM Press, 1992. [3] Corina Cortes and Vladimir Vapnik. Support vector network. Machine Learning, 20:273–297, 1995. [4] Ji Zhu and Hui Zou. Variable selection for the linear support vector machine. In Ke Chen and Lipo Wang, editors, Trends in Neural Computation, pages 35–59. Springer Berlin Heidelberg, 2007. [5] Guo-Xun Yuan, Kai-Wei Chang, Cho-Jui Hsieh, and Chih-Jen Lin. A comparison of optimization methods and software for large-scale l1-regularized linear classification. Journal of Machine Learning Research, 11:3183–3234, 2010. [6] Wei-Lin Chiang, Yu-Sheng Li, Ching-Pei Lee, and Chih-Jen Lin. Limited-memory common-directions method for distributed l1-regularized linear classification. In Proceedings of SIAM International Conference on Data Mining (SDM), 2018. [7] Antoine Dedieu, Rahul Mazumder, and Haoyue Wang. Solving L1-regularized SVMs and related linear programs: revisiting the effectiveness of column and con straint generation. Journal of Machine Learning Research, 23(1), 2022. [8] Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, and Édouard Duchesnay. Scikit-learn: machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011. [9] Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen Lin. LIBLINEAR: a library for large linear classification. Journal of Machine Learning Research, 9:1871–1874, 2008. [10] Robert Moore and John DeNero. L1 and L2 regularization for multiclass hinge loss models. In Symposium on Machine Learning in Speech and Language Processing, 2011. [11] Ben J. Marafino, W. John Boscardin, and R. Adams Dudley. Efficient and sparse feature selection for biomedical text classification via the elastic net: Application to ICU risk stratification from nursing notes. Journal of Biomedical Informatics, 54:114–120, 2015. [12] Yashoteja Prabhu, Anil Kag, Shrutendra Harsola, Rahul Agrawal, and Manik Varma. Parabel: Partitioned label trees for extreme classification with application to dynamic search advertising. In Proceedings of the 2018 World Wide Web Conference (WWW), pages 993–1002, 2018. [13] Sujay Khandagale, Han Xiao, and Rohit Babbar. Bonsai: diverse and shallow trees for extreme multi-label classification. Machine Learning, 109:2099–2119, 2020. [14] Hsiang-Fu Yu, Kai Zhong, Jiong Zhang, Wei-Cheng Chang, and Inderjit S. Dhillon. PECOS: Prediction for enormous and correlated output spaces. Journal of Machine Learning Research, 23(98):1–32, 2022. [15] Jiong Zhang, Yau-Shian Wang, Wei-Cheng Chang, Wei Li, Jyun-Yu Jiang, Cho-Jui Hsieh, and Hsiang-Fu Yu. Build faster with less: A journey to accelerate sparse model building for semantic matching in product search. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management, pages 4960–4966, 2023. [16] H.Brendan McMahan, Gary Holt, D.Sculley, Michael Young, Dietmar Ebner,Julian Grady, Lan Nie, Todd Phillips, Eugene Davydov, Daniel Golovin, Sharat Chikkerur, Dan Liu, Martin Wattenberg, Arnar Mar Hrafnkelsson, Tom Boulos, and Jeremy Kubica. Ad click prediction: a view from the trenches. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 2013. [17] He-Zhe Lin, Cheng-Hung Liu, and Chih-Jen Lin. Exploring space efficiency in a tree-based linear model for extreme multi-label classification. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 16245–16260, 2024. [18] Leonardo Galli and Chih-Jen Lin. A study on truncated Newton methods for linear classification. IEEE Transactions on Neural Networks and Learning Systems, 33(7):2828–2841, 2022. [19] Dong C. Liu and Jorge Nocedal. On the limited memory BFGS method for large scale optimization. Mathematical Programming, 45(1):503–528, 1989. [20] Cho-Jui Hsieh, Kai-Wei Chang, Chih-Jen Lin, S. Sathiya Keerthi, and Sellaman ickam Sundararajan. A dual coordinate descent method for large-scale linear SVM. In Proceedings of the Twenty Fifth International Conference on Machine Learning (ICML), 2008. [21] Ronghui You,Zihan Zhang,Ziye Wang, Suyang Dai, Hiroshi Mamitsuka, and Shan feng Zhu. AttentionXML: Label treebased attentionaware deep model for high-performance extreme multilabel text classification. In Advances in Neural Informa tion Processing Systems, volume 32, 2019. [22] Po-Wei Wang, Ching-Pei Lee, and Chih-Jen Lin. The common-directions method for regularized empirical risk minimization. Journal of Machine Learning Research, 20(58):1–49, 2019. [23] Shai Shalev-Shwartz, Yoram Singer, and Nathan Srebro. Pegasos: primal estimated subgradient solver for SVM. In Proceedings of the Twenty Fourth International Conference on Machine Learning (ICML), 2007. [24] R. Tyrrell Rockafellar. Convex Analysis. Princeton University Press, Princeton, NJ, 1970. [25] Wei-Lin Chiang, Mu-Chu Lee, and Chih-Jen Lin. Parallel dual coordinate descent method for large-scale linear classification in multicore environments. In Proceed ings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 2016. [26] Hsiang-Fu Yu, Fang-Lan Huang, and Chih-Jen Lin. Dual coordinate descent meth ods for logistic regression and maximum entropy models. Machine Learning, 85(1 2):41–75, October 2011. [27] Chia-Hua Ho and Chih-Jen Lin. Large-scale linear support vector regression. Journal of Machine Learning Research, 13:3323–3348, 2012. [28] Vladimir Vapnik. The Nature of Statistical Learning Theory. SpringerVerlag, New York, NY, 1995.	-
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/101409	-
dc.description.abstract	對於傳統的線性模型而言，廣泛使用L2正則項的模型通常被認為是稠密的。因此，關於在什麼時候L2正則化線性模型可以得到稀疏解的問題很少受到關注。在這篇論文中，我們嚴謹地證明了對於L2正則化的支持向量分類/迴歸問題，當資料的特徵向量具有稀疏性時，其理論上的最優解可以為稀疏的。令人意外的是，我們觀察到某些最佳化方法無法保留這項稀疏性，反而產生稠密的數值解，導致不必要的儲存空間。我們透過詳細的分析來解釋這個現象。特別值得注意的是，我們新奇地發現某些投影梯度法在解決對偶問題時，會自然地產生比其他最佳化方法更稀疏的數值解。透過使用這些適合的算法，可以將模型的儲存空間減少多達50%，對於大規模的工業應用來說具有相當顯著的效益。	zh_TW
dc.description.abstract	For traditional linear models with the widely used L2-regularizer, it is often assumed that the resulting models are dense. As a result, little attention has been paid to when the optimal solution for an L2-regularized problem can actually be sparse. In this work, we rigorously prove that for L2-regularized support vector classification/regression, the theoretical optimum can indeed be sparse when the data have sparse feature values. Surprisingly, we observe that some optimization methods fail to preserve this sparsity and instead produce fully dense numerical solutions, leading to unnecessary storage overhead. We explain this phenomenon through detailed analysis. In particular, we novelly show that certain projected gradient methods for solving the dual problem naturally yields sparser numerical solutions compared to other optimization algorithms. By applying suitable algorithms that preserve numerical sparsity, the storage can be reduced by up to 50%, which is highly advantageous for large-scale industrial applications.	en
dc.description.provenance	Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2026-01-27T16:37:51Z No. of bitstreams: 0	en
dc.description.provenance	Made available in DSpace on 2026-01-27T16:37:51Z (GMT). No. of bitstreams: 0	en
dc.description.tableofcontents	Contents 口試委員會審定書 i 誌謝 ii 摘要 iii Abstract iv Contents v List of Figures viii List of Tables ix 1 Introduction 1 2 A Surprising Observation: The Weight Density Varies Among Optimizers 4 3 Investigation of the Weight Density on More Optimizers 6 3.1 Introduction of the Solvers 6 3.2 Weight Densities Using Different Solvers 9 4 Theoretical Analysis on the Sparsity of Optimal w∗ 10 5 Investigation: Why Different Density Among Different Solvers? 14 5.1 Newton and LBFGS Use All Training Instances 14 5.2 SGD Can Represent w by Only a Subset of Data 15 5.3 Dual-based Methods Represent w by Support Vectors 16 6 Experiments on Multi-label Classification 19 6.1 Experimental Settings 19 6.2 Model Size Comparison 20 6.3 Comparison of Testing Set Performance 21 7 Conclusions 22 Bibliography 23 Appendix 26 A Notation 27 B Data Statistics 28 B.1 Binary Data Sets 28 B.2 Multi-label Data Sets 28 C Experimental Settings for Binary Classification Problems 30 C.1 LIBLINEAR Stopping Condition 30 C.2 Settings for Comparing the Weight Density 30 D Proofs D.1 Proof of Theorem 1 32 D.2 Proof of Theorem 2 32 E Recovering the Theoretical Sparsity from Numerical Solutions 35 F Experimental Settings for LibMultiLabel 37 G Extension to L1-loss Support Vector Classification 39 G.1 Theoretical Sparsity of w∗ 39 G.2 The Weight Density of the Numerical Solutions Obtained from Different Solvers 40 G.3 Model Size Reduction for L1-loss in Extreme Multi-label Classification 41 H Extension to Logistic Loss 43 I Extension to Support Vector Regression 44 I.1 Problem Formulation 44 I.2 Theoretical Analysis on the Sparsity of Optimal w∗ in Support Vector Regression 46 I.3 Comparison of the Numerical Weight Density under Primal and Dual-based Solvers 49	-
dc.language.iso	en	-
dc.subject	線性分類	-
dc.subject	權重密度	-
dc.subject	極端多標籤分類	-
dc.subject	對偶問題	-
dc.subject	支持向量機	-
dc.subject	Linear Classification	-
dc.subject	Weight Density	-
dc.subject	Extreme Multi-label Classification	-
dc.subject	Dual problem	-
dc.subject	Support vector machines	-
dc.title	關於 L2 正則化線性分類與迴歸的權重密度	zh_TW
dc.title	On the Weight Density of L2 Regularized Linear Classification and Regression	en
dc.type	Thesis	-
dc.date.schoolyear	114-1	-
dc.description.degree	碩士	-
dc.contributor.oralexamcommittee	李育杰;李靜沛	zh_TW
dc.contributor.oralexamcommittee	Yuh-Jye Lee;Ching-pei Lee	en
dc.subject.keyword	線性分類,權重密度極端多標籤分類對偶問題支持向量機	zh_TW
dc.subject.keyword	Linear Classification,Weight DensityExtreme Multi-label ClassificationDual problemSupport vector machines	en
dc.relation.page	49	-
dc.identifier.doi	10.6342/NTU202600011	-
dc.rights.note	同意授權(限校園內公開)	-
dc.date.accepted	2026-01-15	-
dc.contributor.author-college	電機資訊學院	-
dc.contributor.author-dept	資訊工程學系	-
dc.date.embargo-lift	2026-01-28	-
Appears in Collections:	資訊工程學系

Files in This Item:

File	Size	Format
ntu-114-1.pdf Access limited in NTU ip range	1.45 MB	Adobe PDF

Show simple item record

DSpace JSPUI

DSpace preserves and enables easy and open access to all types of digital content including text, images, moving images, mpegs and data sets