Skip navigation

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料(如:文字、圖片、PDF)並使其易於取用。

點此認識 DSpace
DSpace logo
English
中文
  • 瀏覽論文
    • 校院系所
    • 出版年
    • 作者
    • 標題
    • 關鍵字
    • 指導教授
  • 搜尋 TDR
  • 授權 Q&A
    • 我的頁面
    • 接受 E-mail 通知
    • 編輯個人資料
  1. NTU Theses and Dissertations Repository
  2. 電機資訊學院
  3. 資料科學學位學程
請用此 Handle URI 來引用此文件: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/97662
完整後設資料紀錄
DC 欄位值語言
dc.contributor.advisor潘建興zh_TW
dc.contributor.advisorFrederick Kin Hing Phoaen
dc.contributor.author王愛琳zh_TW
dc.contributor.authorAi-Lin Wangen
dc.date.accessioned2025-07-09T16:17:55Z-
dc.date.available2025-07-10-
dc.date.copyright2025-07-09-
dc.date.issued2025-
dc.date.submitted2025-06-30-
dc.identifier.citation[1] M. Ai, J. Yu, H. Zhang, and H. Wang. Optimal subsampling algorithms for big data regressions. Statistica Sinica, 2021.
[2] A. Atkinson, A. Donev, and R. Tobias. Optimum experimental designs, with SAS, volume 34. OUP Oxford, 2007.
[3] P. Drineas, M. W. Mahoney, and S. Muthukrishnan. Sampling algorithms for l2 regression and applications. In Proceedings of the Seventeenth Annual ACM-SIAM Symposium on Discrete Algorithm, SODA ’06, page 1127–1136, USA, 2006. Society for Industrial and Applied Mathematics.
[4] W. Fithian and T. Hastie. Local case-control sampling: Efficient subsampling in imbalanced data sets. The Annals of Statistics, 42(5):1693–1724, 2014.
[5] S. G. Gilmour and L. A. Trinca. Optimum design of experiments for statistical inference. Journal of the Royal Statistical Society. Series C: Applied Statistics, 61(3):345–401, May 2012.
[6] A. Katharopoulos and F. Fleuret. Not all samples are created equal: Deep learning with importance sampling. In J. Dy and A. Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 2525–2534. PMLR, 2018.
[7] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
[8] P. Ma, J. Huang, and N. Zhang. Efficient computation of smoothing splines via adaptive basis sampling. Biometrika, 102:631–645, 2015a.
[9] P. Ma, M. W. Mahoney, and B. Yu. A statistical perspective on algorithmic leveraging. The Journal of Machine Learning Research, 16(1):861–911, 2015.
[10] T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representations in vector space, 2013.
[11] F. Pukelsheim. Optimal Design of Experiments. Society for Industrial and Applied Mathematics, 2006.
[12] M. Quiroz, R. Kohn, M. Villani, and M.-N. Tran. Speeding up mcmc by efficient data subsampling. Journal of the American Statistical Association, 114(526):831–843, 2019.
[13] E. R. Rahman, R. A. Shuvo, M. H. K. Mehedi, M. S. Hossain, and A. A. Rasel. Distributed computing for big data analytics: Challenges and opportunities. ResearchGate, 2022.
[14] G. Vaughan. Efficient big data model selection with applications to fraud detection. International Journal of Forecasting, 36(3):1116–1127, 2020.
[15] H. Wang and Y. Ma. Optimal subsampling for quantile regression in big data. Biometrika, 108(1):99–112, 2021.
[16] H. Wang, M. Yang, and J. Stufken. Information-based optimal subdata selection for big data linear regression. arXiv preprint arXiv:1710.10382, 2017.
[17] H. Wang, R. Zhu, and P. Ma. Optimal subsampling for large sample logistic regression. Journal of the American Statistical Association, 13(522):829–844, 2018.
[18] R. Xie, Z. Wang, S. Bai, P. Ma, and W. Zhong. Online decentralized leverage score sampling for streaming multidimensional time series. In K. Chaudhuri and M. Sugiyama, editors, Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics, volume 89 of proceedings of Machine Learning Research, pages 2301–2311. PMLR, 2019.
[19] Y. Yao and H. Wang. Optimal subsampling for softmax regression. Statistical Papers, 60(2):585–599, April 2019.
[20] Y. Yao and H. Wang. A review on optimal subsampling methods for massive datasets. Journal of Data Science, 19(1):151–172, 2021.
[21] Y. Yao, J. Zou, and H. Wang. Model constraints independent optimal subsampling probabilities for softmax regression. Journal of Statistical Planning and Inference, 225:188–201, 2023.
[22] J. Yu, M. Ai, and Z. Ye. A review on design inspired subsampling for big data. Statistical Papers, 65(2):467–510, 2023.
[23] H. Zhang and H. Wang. Distributed subdata selection for big data via sampling-based approach. Computational Statistics Data Analysis, 153:107072, 2021.
-
dc.identifier.urihttp://tdr.lib.ntu.edu.tw/jspui/handle/123456789/97662-
dc.description.abstract在最佳次抽樣方法中,標籤資訊的缺失對抽樣機率的估計構成顯著挑戰,特別是在進行分類任務時,傳統方法往往依賴完整的回應資料。為解決此問題,本文提出一套應用於 softmax 回歸模型的半監督 A-/L-最適次抽樣框架。該方法於基準約束條件下推導出理論上的最適抽樣機率,並進一步探討其在平衡類別回應分布方面的統計意涵。同時,我們亦考量不受拘束條件影響的抽樣方法,提出以最小化漸近預測均方誤差(Asymptotic Mean Squared Prediction Error, MSPE)為目標的抽樣策略,使所構建之抽樣機率對模型約束具備更高的穩健性。理論結果經由模擬數據與實證資料驗證,皆顯示本方法能有效提升預測準確率與計算效率。zh_TW
dc.description.abstractMissing label information presents a significant challenge for optimal subsampling methods, which typically rely on complete response data to compute sampling probabilities. In this study, we propose a semi-supervised A-/L-optimal subsampling framework for softmax regression that effectively addresses this issue. We derive the optimal subsampling probabilities under the baseline constraint and highlight their role in balancing categorical responses. In addition, we explore constraint-invariant subsampling by minimizing the asymptotic mean squared prediction error (MSPE), enabling the construction of subsampling probabilities for each observation, which is robust to model constraint choices. Our theoretical findings are supported by simulations and real-data applications, demonstrating improvements in both prediction accuracy and computational efficiency.en
dc.description.provenanceSubmitted by admin ntu (admin@lib.ntu.edu.tw) on 2025-07-09T16:17:55Z
No. of bitstreams: 0
en
dc.description.provenanceMade available in DSpace on 2025-07-09T16:17:55Z (GMT). No. of bitstreams: 0en
dc.description.tableofcontentsAcknowledgements i
摘要 ii
Abstract iii
Contents iv
List of Figures vi
List of Tables vii
Chapter 1 Introduction 1
1.1 Challenges of Massive Data in Statistical Modeling . . . . . . . . . . 1
1.2 Softmax Regression for Multi-Class Classification . . . . . . . . . . 2
1.3 Research Objectives and Approach . . . . . . . . . . . . . . . . . . 4
1.4 Significance and Scope . . . . . . . . . . . . . . . . . . . . . . . . . 4
Chapter 2 Literature Review 6
2.1 Subsampling for Massive Datasets . . . . . . . . . . . . . . . . . . . 6
2.2 Optimal Designs in Subsampling . . . . . . . . . . . . . . . . . . . 7
2.3 Subsampling in Regression Models . . . . . . . . . . . . . . . . . . 8
Chapter 3 Optimal Design in Subsampling for Softmax Regression 10
3.1 Model and Subsampling Framework . . . . . . . . . . . . . . . . . . 10
3.2 Subsampling Under Baseline Constraint . . . . . . . . . . . . . . . . 11
3.3 Optimal Subsampling with Incomplete Labels . . . . . . . . . . . . . 15
Chapter 4 Simulation and Discussion 17
4.1 Simulation Design . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.1.1 Data Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.1.2 Subsampling Methods . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
References 21
Appendix A — Assumptions and Theoretical Proofs 26
-
dc.language.isoen-
dc.subject半監督式最佳化抽樣zh_TW
dc.subject歸一化指數函式zh_TW
dc.subject預測均方誤差zh_TW
dc.subjectMean squared prediction erroren
dc.subjectSemi-supervised optimal subsamplingen
dc.subjectSoftmax regressionen
dc.title最佳化半監督式歸一化指數函式抽樣zh_TW
dc.titleOptimal Semi-Supervised Subsampling for Softmax Regressionen
dc.typeThesis-
dc.date.schoolyear113-2-
dc.description.degree碩士-
dc.contributor.coadvisor林澤zh_TW
dc.contributor.coadvisorChe Linen
dc.contributor.oralexamcommittee陳瑞彬;張明中zh_TW
dc.contributor.oralexamcommitteeRay-Bing Chen;Ming-Chung Changen
dc.subject.keyword半監督式最佳化抽樣,歸一化指數函式,預測均方誤差,zh_TW
dc.subject.keywordSemi-supervised optimal subsampling,Softmax regression,Mean squared prediction error,en
dc.relation.page28-
dc.identifier.doi10.6342/NTU202501310-
dc.rights.note未授權-
dc.date.accepted2025-06-30-
dc.contributor.author-college電機資訊學院-
dc.contributor.author-dept資料科學學位學程-
dc.date.embargo-liftN/A-
顯示於系所單位:資料科學學位學程

文件中的檔案:
檔案 大小格式 
ntu-113-2.pdf
  未授權公開取用
3.63 MBAdobe PDF
顯示文件簡單紀錄


系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved