Skip navigation

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料(如:文字、圖片、PDF)並使其易於取用。

點此認識 DSpace
DSpace logo
English
中文
  • 瀏覽論文
    • 校院系所
    • 出版年
    • 作者
    • 標題
    • 關鍵字
    • 指導教授
  • 搜尋 TDR
  • 授權 Q&A
    • 我的頁面
    • 接受 E-mail 通知
    • 編輯個人資料
  1. NTU Theses and Dissertations Repository
  2. 社會科學院
  3. 經濟學系
請用此 Handle URI 來引用此文件: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/99667
完整後設資料紀錄
DC 欄位值語言
dc.contributor.advisor陳由常zh_TW
dc.contributor.advisorYu-Chang Chenen
dc.contributor.author廖明祐zh_TW
dc.contributor.authorMing-You Liaoen
dc.date.accessioned2025-09-17T16:19:01Z-
dc.date.available2025-09-18-
dc.date.copyright2025-09-17-
dc.date.issued2025-
dc.date.submitted2025-08-08-
dc.identifier.citationCharu C. Aggarwal, Alexander Hinneburg, and Daniel A. Keim. On the surprising behavior of distance metrics in high dimensional space. In Proceedings of the International Conference on Database Theory (ICDT), pages 420–434. Springer, 2001.
Tamaz Amiranashvili, David Lüdke, Hongwei Bran Li, Bjoern Menze, and Stefan Zachow. Learning shape reconstruction from sparse measurements with neural implicit functions. In International Conference on Medical Imaging with Deep Learning, pages 22–34. PMLR, 2022.
Borja Balle, Giovanni Cherubin, and Jamie Hayes. Reconstructing training data with informed adversaries. In 2022 IEEE Symposium on Security and Privacy (SP), pages 1138–1156. IEEE, 2022.
Lukas Biewald. Experiment tracking with weights and biases. https://www.wandb.com/, 2020. Software available from wandb.com.
Christopher M. Bishop. Pattern Recognition and Machine Learning. Springer, 2006. ISBN 9780387310732.
Gavin Brown, Mark Bun, Vitaly Feldman, Adam Smith, and Kunal Talwar. When is memorization of irrelevant training data necessary for high-accuracy learning? In Proceedings of the 53rd annual ACM SIGACT symposium on theory of computing, pages 123–132, 2021.
Gon Buzaglo, Niv Haim, Gilad Yehudai, Gal Vardi, Yakir Oz, Yaniv Nikankin, and Michal Irani. Deconstructing data reconstruction: Multiclass, weight decay and general losses. In Advances in Neural Information Processing Systems (NeurIPS), 2023. URL https://arxiv.org/abs/2307.01827. arXiv:2307.01827.
George Casella and Roger L. Berger. Statistical Inference. Duxbury Press, Pacific Grove, CA, 2nd edition, 2002.
Dheeru Dua and Casey Graff. Uci machine learning repository. http://archive.ics.uci.edu/ml, 2017.
Vitaly Feldman. Does learning require memorization? a short tale about a long tail. In Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, pages 954–959, 2020.
Vitaly Feldman and Chiyuan Zhang. What neural networks memorize and why: Discovering the long tail via influence estimation. Advances in Neural Information Processing Systems, 33:2881–2891, 2020.
Matthew Fredrikson, Eric Lantz, Somesh Jha, Simon Lin, David Page, and Thomas Ristenpart. Privacy in pharmacogenetics: An {End-to-End} case study of personalized warfarin dosing. In 23rd USENIX security symposium (USENIX Security 14), pages 17–32, 2014.
Matthew Fredrikson, Somesh Jha, and Thomas Ristenpart. Model inversion attacks that exploit confidence information and basic countermeasures. In Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security (CCS), pages 1322–1333. ACM, 2015.
Pieter Gijsbers, Erin LeDell, Janek Thomas, Sébastien Poirier, Bernd Bischl, and Joaquin Vanschoren. An open source automl benchmark, 2019. URL https://arxiv.org/abs/1907.00909. arXiv:1907.00909.
Niv Haim, Gilad Yehudai, Gal Vardi, Ohad Shamir, and Michal Irani. Reconstructing training data from trained neural networks. In Advances in Neural Information Processing Systems (NeurIPS), 2022. URL https://arxiv.org/abs/2206.07758. arXiv:2206.07758.
Agus Hartoyo, Dominika Ciupek, Maciej Malawski, and Alessandro Crimi. Data reconstruction from machine learning models via inverse estimation and bayesian inference. Scientific Reports, 15(1):13856, 2025.
Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, 2 edition, 2009. ISBN 9780387848570.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 1026–1034, 2015.
Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and generalization in neural networks. Advances in neural information processing systems, 31, 2018.
Jun-Peng Jiang, Si-Yang Liu, Hao-Run Cai, Qile Zhou, and Han-Jia Ye. Representation learning for tabular data: A comprehensive survey. arXiv preprint arXiv:2504.16109, 2025.
Noel Loo, Ramin Hasani, Mathias Lechner, and Daniela Rus. Dataset distillation fixes dataset reconstruction attacks. arXiv preprint arXiv:2302.01428, 2023.
Kaifeng Lyu and Jian Li. Gradient descent maximizes the margin of homogeneous neural networks. In Proceedings of the International Conference on Learning Representations (ICLR), 2020. URL https://arxiv.org/abs/1906.05890. arXiv:1906.05890.
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems (NeurIPS), volume 32, 2019.
Tao Qin and Tie-Yan Liu. Introducing LETOR 4.0 datasets. CoRR, abs/1306.2597, 2013. URL http://arxiv.org/abs/1306.2597.
Ravid Shwartz-Ziv and Amitai Armon. Tabular data: Deep learning is not all you need, June 2021. URL https://arxiv.org/abs/2106.03253. arXiv:2106.03253.
Mahbod Tavallaee, Ebrahim Bagheri, Wei Lu, and Ali A Ghorbani. A detailed analysis of the kdd cup 99 data set. In 2009 IEEE symposium on computational intelligence for security and defense applications, pages 1–6. Ieee, 2009.
Mark Vero, Mislav Balunović, Dimitar I Dimitrov, and Martin Vechev. Tableak: Tabular data leakage in federated learning. arXiv preprint arXiv:2210.01785, 2022.
Jinsung Yoon, Yao Zhang, James Jordon, and Mihaela Van der Schaar. Vime: Extending the success of self-and semi-supervised learning to tabular domain. Advances in neural information processing systems, 33:11033–11043, 2020.
Ligeng Zhu, Zhijian Liu, and Song Han. Deep leakage from gradients. Advances in neural information processing systems, 32, 2019.
-
dc.identifier.urihttp://tdr.lib.ntu.edu.tw/jspui/handle/123456789/99667-
dc.description.abstract近年研究指出,即使神經網路無法直接存取原始輸入資料,仍可能在參數中隱含並洩漏訓練樣本,這對於常含敏感資訊的表格型資料而言,構成重大的隱私威脅。雖然已有多種資料重建攻擊方法被提出,但大多仰賴大量模型內部資訊,如梯度、邏輯值或預測結果,使其在實務應用中受到限制。

本論文探討一個核心問題:是否僅依賴已訓練完成模型的最終權重,就能還原其訓練樣本。我們的研究基於近期一項理論,該理論指出,當以梯度下降法訓練 ReLU 神經網路時,模型實際上隱性地在解一個最大邊界問題,因此提供了一個利用Karush-Kuhn-Tucker (KKT) 條件的方式,從模型參數中重建訓練資料。我們將此基於KKT的方法應用於表格式資料的情境,並系統性地評估其有效性。

在合成與真實資料集上的實驗結果顯示,部分資料訓練點確實能被準確地重建。我們進一步分析發現,重建效果受到多項關鍵因素影響,包括:資料本身的訊號變異數、輸出類別數量,以及神經網路的深度與寬度。特別地,在訊號變異數較低、類別數較多,且模型結構偏深時,重建攻擊的效果最為顯著。

此外,我們透過統計方法描繪資料洩漏的結構性特徵,指出雖然重建結果近似原始資料,但仍存在系統性的偏差。我們也展示了此攻擊方法在現實環境中的操作可行性,即便無法存取真實資料,攻擊者仍能利用模型分類邊界,有效篩選出準確度高的重建樣本。總體而言,本研究探討將基於 KKT 條件的資料重建方法應用於表格型資料的可行性,並透過系統性實驗評估其成效與限制,進一步揭示神經網路最終權重中潛在的資料洩漏風險。
zh_TW
dc.description.abstractRecent studies have shown that neural network models can memorize and leak training data, even when the original inputs are not directly accessible. This poses a serious privacy concern for tabular datasets, which often contain sensitive information. While existing works have demonstrated various reconstruction attacks, many of these approaches require access to extensive model information, such as gradients or logits.

This thesis investigates whether training records can be recovered using only the final weights of a trained model. We build on a recent framework which posits that training a ReLU neural network with gradient descent implicitly solves a maximum margin problem. This connection enables the use of the Karush-Kuhn-Tucker (KKT) conditions of this optimization problem to reconstruct potential training data from the model parameters. We adapt this KKT-based approach to the tabular data setting and systematically evaluate its effectiveness.

Through controlled experiments on both synthetic and real-world datasets, we demonstrate that a subset of training instances can indeed be reconstructed with high fidelity. Our analysis reveals several key factors influencing this vulnerability: reconstruction is most effective when data has low signal variance, the number of output classes is high, and, notably, when the network architecture is deep rather than wide.

Furthermore, we characterize the nature of the leakage through statistical testing, finding that the reconstructions are close approximations but contain systematic bias. Finally, we demonstrate the practical viability of this attack by showing that an attacker can use the model's classification margin to reliably identify these high-fidelity reconstructions without access to the ground truth. Overall, this study investigates the applicability of KKT-based data reconstruction methods to tabular data. Through systematic experiments, we evaluate both the feasibility and the limitations of this approach, highlighting the potential risks of data leakage embedded in the final weights of neural networks.
en
dc.description.provenanceSubmitted by admin ntu (admin@lib.ntu.edu.tw) on 2025-09-17T16:19:01Z
No. of bitstreams: 0
en
dc.description.provenanceMade available in DSpace on 2025-09-17T16:19:01Z (GMT). No. of bitstreams: 0en
dc.description.tableofcontentsVerification Letter from the Oral Examination Committee i
摘要 ii
Abstract iv
Contents vi
List of Figures x
List of Tables xvii
Chapter 1 Introduction 1
Chapter 2 Literature Review 5
2.1 Model Memorization . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Tabular Data Reconstruction . . . . . . . . . . . . . . . . . . . . . . 6
2.2.1 Model Inversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.2 Gradient-Based Reconstruction Methods . . . . . . . . . . . . . . . 7
2.2.3 Bayesian Inference Approaches . . . . . . . . . . . . . . . . . . . . 7
2.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Chapter 3 Methodology 9
3.1 Binary Reconstruction Method . . . . . . . . . . . . . . . . . . . . . 9
3.1.1 Theoretical Foundation . . . . . . . . . . . . . . . . . . . . . . . . 9
3.1.2 Reconstruction Loss Formulation . . . . . . . . . . . . . . . . . . . 11
3.2 Multiclass Reconstruction Method . . . . . . . . . . . . . . . . . . . 12
3.2.1 Theoretical Foundation . . . . . . . . . . . . . . . . . . . . . . . . 12
3.2.2 Reconstruction Loss Formulation . . . . . . . . . . . . . . . . . . . 13
3.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
Chapter 4 Research Questions and Experimental Design 15
4.1 Research Questions (RQs) . . . . . . . . . . . . . . . . . . . . . . . 15
4.2 Experimental Procedures . . . . . . . . . . . . . . . . . . . . . . . . 16
4.2.1 Initial Exploration of Reconstruction Behavior and Evaluation Framework (for RQ1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.2.2 Baseline Comparison (for RQ2) . . . . . . . . . . . . . . . . . . . 17
4.2.3 Controlled Experiments on Influential Factors (for RQ3) . . . . . . 17
4.2.4 Generalization to Real-World Datasets (for RQ4) . . . . . . . . . . 18
4.2.5 Statistical Property Analysis (for RQ5) . . . . . . . . . . . . . . . . 18
4.2.6 Simulating a Practical Attack Scenario (for RQ6) . . . . . . . . . . 19
4.3 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
Chapter 5 Datasets and Implementation Details 21
5.1 Synthetic Data Generation . . . . . . . . . . . . . . . . . . . . . . . 21
5.1.1 Experimental Setup for Synthetic Data . . . . . . . . . . . . . . . . 23
5.2 Real-World Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . 24
5.3 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . 25
5.3.1 Experimental Setups per Research Question . . . . . . . . . . . . . 26
5.3.2 Training Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . 27
5.3.3 Reconstruction Optimization Procedure . . . . . . . . . . . . . . . 28
Chapter 6 Experiment Results 30
6.1 RQ1: The Evaluation Framework . . . . . . . . . . . . . . . . . . . 30
6.2 RQ2: Efficacy against Baselines . . . . . . . . . . . . . . . . . . . . 35
6.3 RQ3: Identification of Influential Factors . . . . . . . . . . . . . . . 38
6.3.1 Impact of Data Characteristics . . . . . . . . . . . . . . . . . . . . 38
6.3.1.1 Effect of Signal Variance . . . . . . . . . . . . . . . . 38
6.3.1.2 Effect of the Number of Classes . . . . . . . . . . . . . 39
6.3.1.3 Effect of Training Size . . . . . . . . . . . . . . . . . . 43
6.3.2 Impact of Model Architecture . . . . . . . . . . . . . . . . . . . . . 45
6.3.2.1 Effect of Hidden Layer Width . . . . . . . . . . . . . . 45
6.3.2.2 Effect of Hidden Layer Depth . . . . . . . . . . . . . . 47
6.4 RQ4: Real-World Generalizability . . . . . . . . . . . . . . . . . . . 49
6.4.1 Numerical-Only Datasets . . . . . . . . . . . . . . . . . . . . . . . 49
6.4.2 Mixed-Type Datasets . . . . . . . . . . . . . . . . . . . . . . . . . 53
6.5 RQ5: Statistical Characterization . . . . . . . . . . . . . . . . . . . 56
6.5.1 Unbiasedness Analysis . . . . . . . . . . . . . . . . . . . . . . . . 56
6.5.2 Consistency Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 58
6.6 RQ6: Practical Feasibility . . . . . . . . . . . . . . . . . . . . . . . 60
Chapter 7 Discussion 62
7.1 Rethinking the Threat Model . . . . . . . . . . . . . . . . . . . . . . 62
7.2 Privacy Trade-offs in Data and Model Design . . . . . . . . . . . . . 63
7.3 Limitations of Preprocessing as a Security Measure . . . . . . . . . . 64
7.4 The Practicality of Attacks . . . . . . . . . . . . . . . . . . . . . . . 64
Chapter 8 Conclusion 66
8.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
References 69
Appendix A — Further Introduction 74
A.1 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . 74
A.2 Class-wise Reconstruction Visualization . . . . . . . . . . . . . . . . 75
A.3 Full Class-wise Visualizations for Real Datasets . . . . . . . . . . . . 78
A.3.1 Higgs Boson . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
A.3.2 MSLR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
A.3.3 Gas Concentrations . . . . . . . . . . . . . . . . . . . . . . . . . . 83
A.3.4 Helena . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
-
dc.language.isoen-
dc.subject資料重建zh_TW
dc.subject神經網路模型zh_TW
dc.subject機器學習zh_TW
dc.subject表格資料zh_TW
dc.subject資料隱私zh_TW
dc.subjectKKT-based 重建方法zh_TW
dc.subjectMachine Learningen
dc.subjectTabular Dataen
dc.subjectData Privacyen
dc.subjectNeural Network Modelen
dc.subjectKKT- based Reconstructionen
dc.subjectData Reconstructionen
dc.title多類別神經網路的訓練資料重建方法:針對表格式資 料的實證研究zh_TW
dc.titleData Reconstruction from Multi-Class Neural Networks: An Empirical Study on Tabular Dataen
dc.typeThesis-
dc.date.schoolyear113-2-
dc.description.degree碩士-
dc.contributor.coadvisor盧信銘zh_TW
dc.contributor.coadvisorHsin-Min Luen
dc.contributor.oralexamcommittee陳柏安;陳建錦zh_TW
dc.contributor.oralexamcommitteePo-An Chen;Chien-Chin Chenen
dc.subject.keyword機器學習,資料重建,資料隱私,表格資料,KKT-based 重建方法,神經網路模型,zh_TW
dc.subject.keywordMachine Learning,Data Reconstruction,Data Privacy,Tabular Data,KKT- based Reconstruction,Neural Network Model,en
dc.relation.page97-
dc.identifier.doi10.6342/NTU202501907-
dc.rights.note同意授權(限校園內公開)-
dc.date.accepted2025-08-12-
dc.contributor.author-college社會科學院-
dc.contributor.author-dept經濟學系-
dc.date.embargo-lift2030-08-01-
顯示於系所單位:經濟學系

文件中的檔案:
檔案 大小格式 
ntu-113-2.pdf
  未授權公開取用
10.36 MBAdobe PDF檢視/開啟
顯示文件簡單紀錄


系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved