以特徵重要度與間距保留法改良 k 匿名演算法以提高匿名資料之機器學習成效

李丞彥; Cheng-Yen Lee

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/89134

標題:	以特徵重要度與間距保留法改良 k 匿名演算法以提高匿名資料之機器學習成效 Improving k-Anonymization Algorithm to Enhance Machine Learning Performance of Anonymous Data through Feature Importance and Margin Preservation
作者:	李丞彥 Cheng-Yen Lee
指導教授:	張瑞益 Ray-I Chang
關鍵字:	去識別化,k 匿名,資料隱私,特徵重要度,機器學習, De-identification,k-Anonymity,Data Privacy,Feature Importance,Machine Learning,
出版年 :	2023
學位:	碩士
摘要:	機器學習應用蓬勃發展，為各領域創造難以估量之商業價值，其中高品質資料是為訓練出優良模型之關鍵；然而個人資料之使用具有隱私侵害風險，因此各地區組織皆持續完善法規以規範資料控制者之收集、存取行為，如我國行政院於 2023 年修訂之《個人資料保護法》與歐盟執委會 (European Commission，EC) 於 2018 年訂定之《General Data Protection Regulation》等，此些法規對於遭存取之個人資料須經去識別化處理之限制，將降低其中包含之資訊，是為機器學習應用之一大挑戰。個人資料之使用授權須透明且嚴謹，因此經常先釐清應用需求，方才尋求途徑收集資料並取得資料主體之授權，其中各國為促進節能減碳之公共利益所推動之智慧電表布建即為一例；而為提升資料共享效率進而挖掘資料潛在價值，先前研究已提出有限誤差資料隱私保護 (Bounded-Error Data Privacy Protection, BEDPP) 架構，以實現具隱私保護之物聯網資料共享服務，惟該架構尚缺乏明確指標量化去識別化程度，是為本研究欲探討之應用情境之一。 k 匿名是為一常見之去識別化手段，適於導入 BEDPP 架構，然而過往之研究皆致力於最佳化整體性誤差指標，並未將機器學習目標或特徵重要度於應用情境之高異質性納入考量，因此本研究以使用匿名資料訓練分類預測模型為例，於 k 匿名演算法中依據特徵重要度 (Feature Importance，FI)作為權重分派誤差，亦或是先基於目標特徵對資料進行分群以保留類別間距 (Margin Preserving，MP)，再將各資料群匿名化後合併，以提高匿名資料於機器學習之成效。實驗結果顯示，相較於透過原 k 匿名演算法匿名化資料，再經本研究採用之機器學習模型進行分類預測，考量特徵重要度 k 匿名化資料平均改良機器學習成效之幅度可達 10.7%，而先對資料進行類別間距保留分群再匿名化各群後合併之方法平均改良機器學習成效之幅度可達 17.40%。 Machine learning applications have been proliferating and creating incalculable business value across various fields. High-quality data plays a crucial role in training SOTA models. However, the use of personal data poses risks to privacy infringement. Therefore, organizations worldwide continually improve regulations to govern the collection and access behaviors of data controllers. Examples of such regulations include Taiwan's Personal Data Protection Act established by the Executive Yuan in 2015 and the European Commission's General Data Protection Regulation introduced in 2018. These regulations impose restrictions on the handling of accessed personal data, requiring de-identification processes that reduce the information contained therein, presenting a significant challenge for machine learning applications. The authorization of personal data applications should be transparent and rigorous. Therefore, it is common to clarify the application requirements first and then seek ways to collect data with the consent of data subjects. An example of this is the deployment of smart meters driven by the public interest in energy conservation and carbon reduction in various countries. To enhance data sharing efficiency and unlock the potential value of data, previous studies have proposed the Bounded-Error Data Privacy Protection (BEDPP) framework for privacy-preserving IoT data sharing services. However, this framework lacks a clear indicator for quantifying the level of de-identification, which is one of the application context explored in this research. k-anonymity is a common de-identification method. However, previous research has mainly focused on optimizing holistic error metrics without considering the high heterogeneity of machine learning objectives or feature importance in the application context. Therefore, our research used training a classification model on anonymous data as an example. In our k-anonymity, errors are assigned based on feature importance as weights or cluster the data first based on the target features to present margin preserving and then anonymize each cluster. These approaches aim to improve the performance of the machine learning model trained by anonymous data. Compare to anonymizing data using the original k-anonymity algorithm and subsequently performing classification prediction, the experimental results demonstrate that anonymizing data considering feature importance leads to average model performance improvement up to 10.7% and presenting margin preserving before anonymizing data shows a average model performance improvement up to 17.40%.
URI:	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/89134
DOI:	10.6342/NTU202303509
全文授權:	同意授權(全球公開)
顯示於系所單位：	工程科學及海洋工程學系

文件中的檔案：

檔案	大小	格式
ntu-111-2.pdf	1.4 MB	Adobe PDF	檢視/開啟

顯示文件完整紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料（如：文字、圖片、PDF）並使其易於取用。