請用此 Handle URI 來引用此文件:
http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/99523| 標題: | 評估不平衡資料對於深度學習分類模型的影響 Evaluating the Impact of Imbalanced Data on Deep Learning Classification Models |
| 作者: | 祁詠婕 Yung-Chieh Chi |
| 指導教授: | 蔡政安 Chen-An Tsai |
| 關鍵字: | 不平衡資料,SMOTE,Random Forest,LASSO,LSTM,MLP,分類預測,電信客戶流失,線上購物行為, Imbalanced Data,SMOTE,Random Forest,LASSO,LSTM,MLP,Classification Prediction,Telecom Churn,Online Shopping Behavior, |
| 出版年 : | 2025 |
| 學位: | 碩士 |
| 摘要: | 在現實應用中,分類問題常遇到類別不平衡的情況,尤其在顧客流失預測、詐欺偵測及醫療診斷等領域。此類不平衡資料會導致分類模型偏向多數類別,影響少數類別的辨識能力。本研究旨在探討不平衡資料對預測模型的影響,並比較不同資料平衡技術與機器學習/深度學習模型的效能。
本研究選擇兩個真實資料集:「線上購物者購買意圖資料集」與「電信客戶流失資料集」,並加入使用make_classification模擬的資料,設計不同不平衡程度、特徵組合及標記錯誤率條件進行分析。比較的模型包括隨機森林(RF)、LASSO、長短期記憶網路(LSTM)及多層感知機(MLP),並分別搭配無處理(No Oversampling)、SMOTE及SMOTE結合Tomek Link與ENN等資料平衡處理方法。 實驗結果顯示:MLP模型在少數類別預測(Sensitivity)表現較佳,尤其搭配SMOTE+Tomek Link+ENN處理時,能顯著提升Sensitivity;LASSO在Accuracy方面表現保守,但在F1-score與Balanced AC1指標上穩定,適合強調簡潔性與泛化能力的應用;RF在整體準確率與多數類預測(Specificity)方面表現優異,但對少數類別的預測仍存在偏誤。此外,RF+LASSO的ensemble模型在Accuracy與Specificity上穩定,而RF+MLP_permute則在F1-score、Sensitivity及Balanced AC1方面表現出色,顯示其對少數類別預測能力更強。 本研究結果提供了不平衡資料處理中的模型選擇與技術搭配的具體指引,對於資料科學、行為分析及產業預測等領域具有重要應用價值。 In real-world applications, classification problems often encounter class imbalance, especially in areas such as customer churn prediction, fraud detection, and medical diagnosis. Such imbalanced data will cause the classification model to favor the majority class and affect the recognition ability of the minority class. This study aims to investigate the impact of imbalanced data on prediction models and compare the performance of different data balancing techniques with machine learning/deep learning models. This study selected two real data sets: "Online Shoppers' Purchase Intention Data Set" and "Telecom Customer Churn Data Set", and added data simulated using make_classification to design different imbalance levels, feature combinations, and labeling error rate conditions for analysis. The compared models include Random Forest (RF), LASSO, Long Short-Term Memory (LSTM) and Multilayer Perceptron (MLP), which are respectively combined with No Oversampling, SMOTE and SMOTE combined with Tomek Link and ENN and other data balancing processing methods. Experimental results show that: the MLP model performs better in minority category prediction (Sensitivity), especially when combined with SMOTE+Tomek Link+ENN processing, which can significantly improve sensitivity; LASSO is conservative in terms of accuracy, but stable in F1-score and Balanced AC1 indicators, suitable for applications that emphasize simplicity and generalization ability; RF performs well in overall accuracy and majority category prediction (Specificity), but there are still errors in the prediction of minority categories. In addition, the ensemble model of RF+LASSO is stable in terms of Accuracy and Specificity, while RF+MLP_permute performs well in F1-score, Sensitivity and Balanced AC1, showing that it has stronger prediction ability for minority categories. The results of this study provide specific guidance for model selection and technology matching in imbalanced data processing, and have important application value in fields such as data science, behavioral analysis and industry forecasting. |
| URI: | http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/99523 |
| DOI: | 10.6342/NTU202501637 |
| 全文授權: | 同意授權(限校園內公開) |
| 電子全文公開日期: | 2030-06-30 |
| 顯示於系所單位: | 統計碩士學位學程 |
文件中的檔案:
| 檔案 | 大小 | 格式 | |
|---|---|---|---|
| ntu-113-2.pdf 未授權公開取用 | 3.46 MB | Adobe PDF | 檢視/開啟 |
系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。
