以特徵重要度與間距保留法改良 k 匿名演算法以提高匿名資料之機器學習成效

李丞彥; Cheng-Yen Lee

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/89134

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	張瑞益	zh_TW
dc.contributor.advisor	Ray-I Chang	en
dc.contributor.author	李丞彥	zh_TW
dc.contributor.author	Cheng-Yen Lee	en
dc.date.accessioned	2023-08-16T17:16:23Z	-
dc.date.available	2023-11-09	-
dc.date.copyright	2023-08-16	-
dc.date.issued	2023	-
dc.date.submitted	2023-08-08	-
dc.identifier.citation	中華民國行政院. 個人資料保護法, 2023. Available: https://law.moj.gov.tw/ LawClass/LawAll.aspx?PCode=I0050021. European Commission. General data protection regulation, 2018. Available: https:// gdpr-info.eu/. 中華民國衛生福利部. 全民健康保險研究資料庫, 2000. Available: https:// nhird.nhri.edu.tw/. European Commission. Smart grids and meters. European Commission Energy. Available: https://energy.ec.europa.eu/topics/markets-and-consumers/smart-grids- and-meters_en. 林瑞珠, 朱丹丹. 推動智慧電表布建所需之資訊安全與隱私保護規範. 臺灣能源期刊, 5(4):315–330, 2018. 曾郁凱. 有限誤差與區塊鏈在物聯網資料安全保護之應用. Master’s thesis, 國立臺灣大學, 2021. V. Buterin. Ethereum, 2014. Available: https://ethereum.org/en/. N. Szabo. Smart contract, 1994. Available: https:// en.wikipedia.org/ wiki/ Smart_contract. Juan Benet. Interplanetary file system, 2014. Available: https://en.wikipedia.org/wiki/InterPlanetary_File_System. L. Sweeney. k-anonymity: a model for protecting privacy. International Journal on Uncertainty, Fuzziness and Knowledge-based Systems, 10(5):557–570, 2002. A. Jović, K. Brkić, N. Bogunović. A review of feature selection methods with appli- cations. International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO), (38):1200–1205, 2015. J. Xu, W. Wang, J. Pei, X. Wang, B. Shi, A. W.-C. Fu. Utility-based anonymiza- tion using local recoding. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ’06), page 785, New York, NY, USA, 2006. L. Sweeney. Achieving k-anonymity privacy protection using generalization and suppression. International Journal on Uncertainty, Fuzziness and Knowledge-based Systems, 10(5):571–588, 2002. J.-L. Lin, M.-C. Wei. An efficient clustering method for k-anonymization. In Proceedings of the 2008 International Workshop on Privacy and Anonymity in Information Society (PAIS ’08), pages 46–50, New York, NY, USA, 2008. K. LeFevre, D. J. DeWitt, R. Ramakrishnan. Mondrian multidimensional k- anonymity. In 22nd International Conference on Data Engineering (ICDE’06), pages 25–25, Atlanta, GA, USA, 2006. K. El Emam, F. K. Dankar, R. Issa, E. Jonker, D. Amyot, E. Cogo, J.-P. Corriveau, M. Walker, S. Chowdhury, R. Vaillancourt, T. Roffey, J. Bottomley. A globally optimal k-anonymity method for the de-identification of health data. Journal of the American Medical Informatics Association, 16:670–682, 2009. L. Breiman. Random forests. Machine Learning, 45:5–32, 2001. T. Chen, C. Guestrin. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ’16), pages 785–794, San Francisco, CA, USA, 2016. D. Slijepčević, M. Henzl, L. D. Klausner, T. Dam, P. Kieseberg, M. Zeppelzauer. k- anonymity in practice: How generalisation and suppression affect machine learning classifiers. Computers Security, 111:102488, 2021. B.Becker, R.Kohavi. Adult. UCIMachineLearningRepository, 1996. DOI:https://doi.org/10.24432/C5XW20. C. Nugent. California housing prices. Kaggle, 2017. Available: https://www.kaggle.com/datasets/camnugent/california-housing-prices. T.-S. Lim. Contraceptive method choice. UCI Machine Learning Repository, 1997. DOI: https://doi.org/10.24432/C59W2D. M. Elter. Mammographic mass. UCI Machine Learning Repository, 2007. DOI: https://doi.org/10.24432/C53K6Z.	-
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/89134	-
dc.description.abstract	機器學習應用蓬勃發展，為各領域創造難以估量之商業價值，其中高品質資料是為訓練出優良模型之關鍵；然而個人資料之使用具有隱私侵害風險，因此各地區組織皆持續完善法規以規範資料控制者之收集、存取行為，如我國行政院於 2023 年修訂之《個人資料保護法》與歐盟執委會 (European Commission，EC) 於 2018 年訂定之《General Data Protection Regulation》等，此些法規對於遭存取之個人資料須經去識別化處理之限制，將降低其中包含之資訊，是為機器學習應用之一大挑戰。個人資料之使用授權須透明且嚴謹，因此經常先釐清應用需求，方才尋求途徑收集資料並取得資料主體之授權，其中各國為促進節能減碳之公共利益所推動之智慧電表布建即為一例；而為提升資料共享效率進而挖掘資料潛在價值，先前研究已提出有限誤差資料隱私保護 (Bounded-Error Data Privacy Protection, BEDPP) 架構，以實現具隱私保護之物聯網資料共享服務，惟該架構尚缺乏明確指標量化去識別化程度，是為本研究欲探討之應用情境之一。 k 匿名是為一常見之去識別化手段，適於導入 BEDPP 架構，然而過往之研究皆致力於最佳化整體性誤差指標，並未將機器學習目標或特徵重要度於應用情境之高異質性納入考量，因此本研究以使用匿名資料訓練分類預測模型為例，於 k 匿名演算法中依據特徵重要度 (Feature Importance，FI)作為權重分派誤差，亦或是先基於目標特徵對資料進行分群以保留類別間距 (Margin Preserving，MP)，再將各資料群匿名化後合併，以提高匿名資料於機器學習之成效。實驗結果顯示，相較於透過原 k 匿名演算法匿名化資料，再經本研究採用之機器學習模型進行分類預測，考量特徵重要度 k 匿名化資料平均改良機器學習成效之幅度可達 10.7%，而先對資料進行類別間距保留分群再匿名化各群後合併之方法平均改良機器學習成效之幅度可達 17.40%。	zh_TW
dc.description.abstract	Machine learning applications have been proliferating and creating incalculable business value across various fields. High-quality data plays a crucial role in training SOTA models. However, the use of personal data poses risks to privacy infringement. Therefore, organizations worldwide continually improve regulations to govern the collection and access behaviors of data controllers. Examples of such regulations include Taiwan's Personal Data Protection Act established by the Executive Yuan in 2015 and the European Commission's General Data Protection Regulation introduced in 2018. These regulations impose restrictions on the handling of accessed personal data, requiring de-identification processes that reduce the information contained therein, presenting a significant challenge for machine learning applications. The authorization of personal data applications should be transparent and rigorous. Therefore, it is common to clarify the application requirements first and then seek ways to collect data with the consent of data subjects. An example of this is the deployment of smart meters driven by the public interest in energy conservation and carbon reduction in various countries. To enhance data sharing efficiency and unlock the potential value of data, previous studies have proposed the Bounded-Error Data Privacy Protection (BEDPP) framework for privacy-preserving IoT data sharing services. However, this framework lacks a clear indicator for quantifying the level of de-identification, which is one of the application context explored in this research. k-anonymity is a common de-identification method. However, previous research has mainly focused on optimizing holistic error metrics without considering the high heterogeneity of machine learning objectives or feature importance in the application context. Therefore, our research used training a classification model on anonymous data as an example. In our k-anonymity, errors are assigned based on feature importance as weights or cluster the data first based on the target features to present margin preserving and then anonymize each cluster. These approaches aim to improve the performance of the machine learning model trained by anonymous data. Compare to anonymizing data using the original k-anonymity algorithm and subsequently performing classification prediction, the experimental results demonstrate that anonymizing data considering feature importance leads to average model performance improvement up to 10.7% and presenting margin preserving before anonymizing data shows a average model performance improvement up to 17.40%.	en
dc.description.provenance	Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2023-08-16T17:16:23Z No. of bitstreams: 0	en
dc.description.provenance	Made available in DSpace on 2023-08-16T17:16:23Z (GMT). No. of bitstreams: 0	en
dc.description.tableofcontents	摘要 i Abstract iii 目錄 v 圖目錄 viii 表目錄 ix 第一章緒論 1 1.1 研究背景 1 1.2 研究動機 3 1.3 研究目的 4 第二章文獻探討 6 2.1 特徵選擇演算法 6 2.2 k 匿名演算法 6 2.2.1 k-NN Clustering-Based 匿名法 7 2.2.2 Top-DownGreedy 匿名法 8 2.2.3 Mondrian 匿名法 8 2.2.4 Optimal Lattice 匿名法 9 2.3 機器學習模型 9 2.3.1 k-Nearest Neighbors 9 2.3.2 Support Vector Machine 10 2.3.3 Random Forest 10 2.3.4 eXtreme Gradient Boosting 10 2.4 k 匿名演算法於機器學習之可用性探討 11 第三章研究方法設計 3.1 考量特徵重要度之k匿名演算法 12 3.1.1 特徵重要度衡量 13 3.1.2 CBAFI 14 3.1.3 TDGAFI 15 3.1.4 MAFI 15 3.1.5 OLAFI 16 3.2 結合間距保留法之k匿名流程 20 第四章實驗結果與討論 4.1 資料集 22 4.2 導入特徵重要度改良 k 匿名演算法以提高匿名資料之機器學習成效 23 4.2.1 適於導入特徵重要度之 k 匿名演算法特性 24 4.2.2 適於以考量特徵重要度匿名化之資料集特性 25 4.2.3 適於導入特徵重要度之匿名資料分類預測解決方案 27 4.3 結合間距保留法與 k 匿名演算法以提高匿名資料之機器學習成效 28 4.3.1 適於結合間距保留法之k匿名演算法特性 29 4.3.2 適於結合間距保留法匿名化資料之機器學習模型特性 30 4.3.3 適於結合間距保留法匿名化資料之參數 k 設定 31 第五章結果與未來展望 5.1 結論 33 5.2 未來展望 35 參考文獻 37	-
dc.language.iso	zh_TW	-
dc.subject	去識別化	zh_TW
dc.subject	資料隱私	zh_TW
dc.subject	k 匿名	zh_TW
dc.subject	特徵重要度	zh_TW
dc.subject	機器學習	zh_TW
dc.subject	k-Anonymity	en
dc.subject	Data Privacy	en
dc.subject	Feature Importance	en
dc.subject	De-identification	en
dc.subject	Machine Learning	en
dc.title	以特徵重要度與間距保留法改良 k 匿名演算法以提高匿名資料之機器學習成效	zh_TW
dc.title	Improving k-Anonymization Algorithm to Enhance Machine Learning Performance of Anonymous Data through Feature Importance and Margin Preservation	en
dc.type	Thesis	-
dc.date.schoolyear	111-2	-
dc.description.degree	碩士	-
dc.contributor.oralexamcommittee	丁肇隆;王家輝;尹邦嚴	zh_TW
dc.contributor.oralexamcommittee	Chao-Lung Ting;Chia-Hui Wang;Peng-Yeng Yin	en
dc.subject.keyword	去識別化,k 匿名,資料隱私,特徵重要度,機器學習,	zh_TW
dc.subject.keyword	De-identification,k-Anonymity,Data Privacy,Feature Importance,Machine Learning,	en
dc.relation.page	39	-
dc.identifier.doi	10.6342/NTU202303509	-
dc.rights.note	同意授權(全球公開)	-
dc.date.accepted	2023-08-09	-
dc.contributor.author-college	工學院	-
dc.contributor.author-dept	工程科學及海洋工程學系	-
顯示於系所單位：	工程科學及海洋工程學系

文件中的檔案：

檔案	大小	格式
ntu-111-2.pdf	1.4 MB	Adobe PDF	檢視/開啟

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。