最佳化半監督式歸一化指數函式抽樣

王愛琳; Ai-Lin Wang

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/97662

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	潘建興	zh_TW
dc.contributor.advisor	Frederick Kin Hing Phoa	en
dc.contributor.author	王愛琳	zh_TW
dc.contributor.author	Ai-Lin Wang	en
dc.date.accessioned	2025-07-09T16:17:55Z	-
dc.date.available	2025-07-10	-
dc.date.copyright	2025-07-09	-
dc.date.issued	2025	-
dc.date.submitted	2025-06-30	-
dc.identifier.citation	[1] M. Ai, J. Yu, H. Zhang, and H. Wang. Optimal subsampling algorithms for big data regressions. Statistica Sinica, 2021. [2] A. Atkinson, A. Donev, and R. Tobias. Optimum experimental designs, with SAS, volume 34. OUP Oxford, 2007. [3] P. Drineas, M. W. Mahoney, and S. Muthukrishnan. Sampling algorithms for l2 regression and applications. In Proceedings of the Seventeenth Annual ACM-SIAM Symposium on Discrete Algorithm, SODA ’06, page 1127–1136, USA, 2006. Society for Industrial and Applied Mathematics. [4] W. Fithian and T. Hastie. Local case-control sampling: Efficient subsampling in imbalanced data sets. The Annals of Statistics, 42(5):1693–1724, 2014. [5] S. G. Gilmour and L. A. Trinca. Optimum design of experiments for statistical inference. Journal of the Royal Statistical Society. Series C: Applied Statistics, 61(3):345–401, May 2012. [6] A. Katharopoulos and F. Fleuret. Not all samples are created equal: Deep learning with importance sampling. In J. Dy and A. Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 2525–2534. PMLR, 2018. [7] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998. [8] P. Ma, J. Huang, and N. Zhang. Efficient computation of smoothing splines via adaptive basis sampling. Biometrika, 102:631–645, 2015a. [9] P. Ma, M. W. Mahoney, and B. Yu. A statistical perspective on algorithmic leveraging. The Journal of Machine Learning Research, 16(1):861–911, 2015. [10] T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representations in vector space, 2013. [11] F. Pukelsheim. Optimal Design of Experiments. Society for Industrial and Applied Mathematics, 2006. [12] M. Quiroz, R. Kohn, M. Villani, and M.-N. Tran. Speeding up mcmc by efficient data subsampling. Journal of the American Statistical Association, 114(526):831–843, 2019. [13] E. R. Rahman, R. A. Shuvo, M. H. K. Mehedi, M. S. Hossain, and A. A. Rasel. Distributed computing for big data analytics: Challenges and opportunities. ResearchGate, 2022. [14] G. Vaughan. Efficient big data model selection with applications to fraud detection. International Journal of Forecasting, 36(3):1116–1127, 2020. [15] H. Wang and Y. Ma. Optimal subsampling for quantile regression in big data. Biometrika, 108(1):99–112, 2021. [16] H. Wang, M. Yang, and J. Stufken. Information-based optimal subdata selection for big data linear regression. arXiv preprint arXiv:1710.10382, 2017. [17] H. Wang, R. Zhu, and P. Ma. Optimal subsampling for large sample logistic regression. Journal of the American Statistical Association, 13(522):829–844, 2018. [18] R. Xie, Z. Wang, S. Bai, P. Ma, and W. Zhong. Online decentralized leverage score sampling for streaming multidimensional time series. In K. Chaudhuri and M. Sugiyama, editors, Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics, volume 89 of proceedings of Machine Learning Research, pages 2301–2311. PMLR, 2019. [19] Y. Yao and H. Wang. Optimal subsampling for softmax regression. Statistical Papers, 60(2):585–599, April 2019. [20] Y. Yao and H. Wang. A review on optimal subsampling methods for massive datasets. Journal of Data Science, 19(1):151–172, 2021. [21] Y. Yao, J. Zou, and H. Wang. Model constraints independent optimal subsampling probabilities for softmax regression. Journal of Statistical Planning and Inference, 225:188–201, 2023. [22] J. Yu, M. Ai, and Z. Ye. A review on design inspired subsampling for big data. Statistical Papers, 65(2):467–510, 2023. [23] H. Zhang and H. Wang. Distributed subdata selection for big data via sampling-based approach. Computational Statistics Data Analysis, 153:107072, 2021.	-
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/97662	-
dc.description.abstract	在最佳次抽樣方法中，標籤資訊的缺失對抽樣機率的估計構成顯著挑戰，特別是在進行分類任務時，傳統方法往往依賴完整的回應資料。為解決此問題，本文提出一套應用於 softmax 回歸模型的半監督 A-/L-最適次抽樣框架。該方法於基準約束條件下推導出理論上的最適抽樣機率，並進一步探討其在平衡類別回應分布方面的統計意涵。同時，我們亦考量不受拘束條件影響的抽樣方法，提出以最小化漸近預測均方誤差（Asymptotic Mean Squared Prediction Error, MSPE）為目標的抽樣策略，使所構建之抽樣機率對模型約束具備更高的穩健性。理論結果經由模擬數據與實證資料驗證，皆顯示本方法能有效提升預測準確率與計算效率。	zh_TW
dc.description.abstract	Missing label information presents a significant challenge for optimal subsampling methods, which typically rely on complete response data to compute sampling probabilities. In this study, we propose a semi-supervised A-/L-optimal subsampling framework for softmax regression that effectively addresses this issue. We derive the optimal subsampling probabilities under the baseline constraint and highlight their role in balancing categorical responses. In addition, we explore constraint-invariant subsampling by minimizing the asymptotic mean squared prediction error (MSPE), enabling the construction of subsampling probabilities for each observation, which is robust to model constraint choices. Our theoretical findings are supported by simulations and real-data applications, demonstrating improvements in both prediction accuracy and computational efficiency.	en
dc.description.provenance	Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2025-07-09T16:17:55Z No. of bitstreams: 0	en
dc.description.provenance	Made available in DSpace on 2025-07-09T16:17:55Z (GMT). No. of bitstreams: 0	en
dc.description.tableofcontents	Acknowledgements i 摘要 ii Abstract iii Contents iv List of Figures vi List of Tables vii Chapter 1 Introduction 1 1.1 Challenges of Massive Data in Statistical Modeling . . . . . . . . . . 1 1.2 Softmax Regression for Multi-Class Classification . . . . . . . . . . 2 1.3 Research Objectives and Approach . . . . . . . . . . . . . . . . . . 4 1.4 Significance and Scope . . . . . . . . . . . . . . . . . . . . . . . . . 4 Chapter 2 Literature Review 6 2.1 Subsampling for Massive Datasets . . . . . . . . . . . . . . . . . . . 6 2.2 Optimal Designs in Subsampling . . . . . . . . . . . . . . . . . . . 7 2.3 Subsampling in Regression Models . . . . . . . . . . . . . . . . . . 8 Chapter 3 Optimal Design in Subsampling for Softmax Regression 10 3.1 Model and Subsampling Framework . . . . . . . . . . . . . . . . . . 10 3.2 Subsampling Under Baseline Constraint . . . . . . . . . . . . . . . . 11 3.3 Optimal Subsampling with Incomplete Labels . . . . . . . . . . . . . 15 Chapter 4 Simulation and Discussion 17 4.1 Simulation Design . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 4.1.1 Data Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 4.1.2 Subsampling Methods . . . . . . . . . . . . . . . . . . . . . . . . . 19 4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 4.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 References 21 Appendix A — Assumptions and Theoretical Proofs 26	-
dc.language.iso	en	-
dc.subject	半監督式最佳化抽樣	zh_TW
dc.subject	歸一化指數函式	zh_TW
dc.subject	預測均方誤差	zh_TW
dc.subject	Mean squared prediction error	en
dc.subject	Semi-supervised optimal subsampling	en
dc.subject	Softmax regression	en
dc.title	最佳化半監督式歸一化指數函式抽樣	zh_TW
dc.title	Optimal Semi-Supervised Subsampling for Softmax Regression	en
dc.type	Thesis	-
dc.date.schoolyear	113-2	-
dc.description.degree	碩士	-
dc.contributor.coadvisor	林澤	zh_TW
dc.contributor.coadvisor	Che Lin	en
dc.contributor.oralexamcommittee	陳瑞彬;張明中	zh_TW
dc.contributor.oralexamcommittee	Ray-Bing Chen;Ming-Chung Chang	en
dc.subject.keyword	半監督式最佳化抽樣,歸一化指數函式,預測均方誤差,	zh_TW
dc.subject.keyword	Semi-supervised optimal subsampling,Softmax regression,Mean squared prediction error,	en
dc.relation.page	28	-
dc.identifier.doi	10.6342/NTU202501310	-
dc.rights.note	未授權	-
dc.date.accepted	2025-06-30	-
dc.contributor.author-college	電機資訊學院	-
dc.contributor.author-dept	資料科學學位學程	-
dc.date.embargo-lift	N/A	-
顯示於系所單位：	資料科學學位學程

文件中的檔案：

檔案	大小	格式
ntu-113-2.pdf 未授權公開取用	3.63 MB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。