請用此 Handle URI 來引用此文件:
http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/99523完整後設資料紀錄
| DC 欄位 | 值 | 語言 |
|---|---|---|
| dc.contributor.advisor | 蔡政安 | zh_TW |
| dc.contributor.advisor | Chen-An Tsai | en |
| dc.contributor.author | 祁詠婕 | zh_TW |
| dc.contributor.author | Yung-Chieh Chi | en |
| dc.date.accessioned | 2025-09-10T16:33:09Z | - |
| dc.date.available | 2025-09-11 | - |
| dc.date.copyright | 2025-09-10 | - |
| dc.date.issued | 2025 | - |
| dc.date.submitted | 2025-07-11 | - |
| dc.identifier.citation | Bao, F., Wu, Y., Li, Z., Li, Y., Liu, L., & Chen, G. (2020). Effect improved for high‐dimensional and unbalanced data anomaly detection model based on KNN‐SMOTE‐LSTM. Complexity, 2020(1), 9084704.
Bhatnagar, A., & Srivastava, S. (2025). Customer Churn Prediction: A Machine Learning Approach with Data Balancing for Telecom Industry. International Journal of Computing, 24(1), 9-18. https://doi.org/10.47839/ijc.24.1.3873 Breiman, L. (2001). Random forests. Machine learning, 45, 5-32. Friedman, J. H., Hastie, T., & Tibshirani, R. (2010). Regularization paths for generalized linear models via coordinate descent. Journal of statistical software, 33, 1-22. Galar, M., Fernandez, A., Barrenechea, E., Bustince, H., & Herrera, F. (2011). A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 42(4), 463-484. Goodfellow, I., Bengio, Y., Courville, A., & Bengio, Y. (2016). Deep learning (Vol. 1). MIT press Cambridge. He, H., & Garcia, E. A. (2009). Learning from imbalanced data. IEEE Transactions on knowledge and data engineering, 21(9), 1263-1284. Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8), 1735-1780. Krawczyk, B. (2016). Learning from imbalanced data: open challenges and future directions. Progress in artificial intelligence, 5(4), 221-232. Ouf, S., Mahmoud, K. T., & Abdel-Fattah, M. A. (2024). A proposed hybrid framework to improve the accuracy of customer churn prediction in telecom industry. Journal of Big Data, 11(1), 70. Reichheld, F. F., & Sasser, W. E. (1990). Zero defeofions: Quoliiy comes to services. Harvard business review, 68(5), 105-111. Sakar, C. O., Polat, S. O., Katircioglu, M., & Kastro, Y. (2019). Real-time prediction of online shoppers’ purchasing intention using multilayer perceptron and LSTM recurrent neural networks. Neural Computing and Applications, 31(10), 6893-6908. Sun, Y., Wong, A. K., & Kamel, M. S. (2009). Classification of imbalanced data: A review. International journal of pattern recognition and artificial intelligence, 23(04), 687-719. Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B: Statistical Methodology, 58(1), 267-288. | - |
| dc.identifier.uri | http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/99523 | - |
| dc.description.abstract | 在現實應用中,分類問題常遇到類別不平衡的情況,尤其在顧客流失預測、詐欺偵測及醫療診斷等領域。此類不平衡資料會導致分類模型偏向多數類別,影響少數類別的辨識能力。本研究旨在探討不平衡資料對預測模型的影響,並比較不同資料平衡技術與機器學習/深度學習模型的效能。
本研究選擇兩個真實資料集:「線上購物者購買意圖資料集」與「電信客戶流失資料集」,並加入使用make_classification模擬的資料,設計不同不平衡程度、特徵組合及標記錯誤率條件進行分析。比較的模型包括隨機森林(RF)、LASSO、長短期記憶網路(LSTM)及多層感知機(MLP),並分別搭配無處理(No Oversampling)、SMOTE及SMOTE結合Tomek Link與ENN等資料平衡處理方法。 實驗結果顯示:MLP模型在少數類別預測(Sensitivity)表現較佳,尤其搭配SMOTE+Tomek Link+ENN處理時,能顯著提升Sensitivity;LASSO在Accuracy方面表現保守,但在F1-score與Balanced AC1指標上穩定,適合強調簡潔性與泛化能力的應用;RF在整體準確率與多數類預測(Specificity)方面表現優異,但對少數類別的預測仍存在偏誤。此外,RF+LASSO的ensemble模型在Accuracy與Specificity上穩定,而RF+MLP_permute則在F1-score、Sensitivity及Balanced AC1方面表現出色,顯示其對少數類別預測能力更強。 本研究結果提供了不平衡資料處理中的模型選擇與技術搭配的具體指引,對於資料科學、行為分析及產業預測等領域具有重要應用價值。 | zh_TW |
| dc.description.abstract | In real-world applications, classification problems often encounter class imbalance, especially in areas such as customer churn prediction, fraud detection, and medical diagnosis. Such imbalanced data will cause the classification model to favor the majority class and affect the recognition ability of the minority class. This study aims to investigate the impact of imbalanced data on prediction models and compare the performance of different data balancing techniques with machine learning/deep learning models.
This study selected two real data sets: "Online Shoppers' Purchase Intention Data Set" and "Telecom Customer Churn Data Set", and added data simulated using make_classification to design different imbalance levels, feature combinations, and labeling error rate conditions for analysis. The compared models include Random Forest (RF), LASSO, Long Short-Term Memory (LSTM) and Multilayer Perceptron (MLP), which are respectively combined with No Oversampling, SMOTE and SMOTE combined with Tomek Link and ENN and other data balancing processing methods. Experimental results show that: the MLP model performs better in minority category prediction (Sensitivity), especially when combined with SMOTE+Tomek Link+ENN processing, which can significantly improve sensitivity; LASSO is conservative in terms of accuracy, but stable in F1-score and Balanced AC1 indicators, suitable for applications that emphasize simplicity and generalization ability; RF performs well in overall accuracy and majority category prediction (Specificity), but there are still errors in the prediction of minority categories. In addition, the ensemble model of RF+LASSO is stable in terms of Accuracy and Specificity, while RF+MLP_permute performs well in F1-score, Sensitivity and Balanced AC1, showing that it has stronger prediction ability for minority categories. The results of this study provide specific guidance for model selection and technology matching in imbalanced data processing, and have important application value in fields such as data science, behavioral analysis and industry forecasting. | en |
| dc.description.provenance | Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2025-09-10T16:33:09Z No. of bitstreams: 0 | en |
| dc.description.provenance | Made available in DSpace on 2025-09-10T16:33:09Z (GMT). No. of bitstreams: 0 | en |
| dc.description.tableofcontents | 口委審定書 I
致謝 II 中文摘要 III 英文摘要 IV 目次 V 圖次 VII 表次 VIII 第一章 緒論 1 第一節 研究背景 1 第二節 研究目的 3 第三節 研究流程架構 4 第二章 文獻探討 6 第一節 資料來源 6 一、 線上購物者購買意圖資料集 (Online Shoppers Purchasing Intention Dataset) 6 二、 電信客戶流失資料集 (Telco Customer Churn) 7 第二節 機器學習模型 8 一、 隨機森林(Random Forest, RF) 8 二、 套索迴歸(Lasso Regression) 9 三、 長短期記憶神經網路(Long Short-Term Memory, LSTM) 9 四、 多層感知器(Multilayer Perceptron, MLP) 9 第三章 研究方法 11 第一節 資料來源 11 一、 資料模擬 11 二、 線上購物者購買意圖資料集 (Online Shoppers Purchasing Intention Dataset) 12 三、 電信客戶流失資料集 (Telco Customer Churn) 13 第二節 平衡資料方法 15 一、 SMOTE (Synthesized Minority Oversampling Technique) 15 二、 Tomek Links 16 三、 ENN (Edited Nearest Neighborhood) 17 第三節 機器學習模型 19 一、 隨機森林(Random Forest, RF) 19 二、 套索迴歸(Lasso Regression) 21 三、 長短期記憶(Long Short-Term Memory,LSTM) 21 四、 多層感知器 (Multilayer perceptron, MLP) 24 第四節 模型評估與比較 27 一、 混淆矩陣(Confusion Matrix) 27 二、 最佳切點 29 三、 10-fold Cross Validation 30 四、 Ensemble Method 31 第四章 研究結果 33 第一節 模擬研究結果 33 一、 模擬實驗結果分析與綜合回顧 33 二、 模型整合比較分析 40 第二節 實際資料結果 42 一、 線上購物者購買意圖資料集 (Online Shoppers Purchasing Intention Dataset) 42 二、 電信客戶流失資料集 (Telco Customer Churn) 47 三、 Ensemble 模型整合表現 54 第三節 變數重要性結果 57 一、 線上購物者購買意圖資料集 (Online Shoppers Purchasing Intention Datasets) 57 二、 電信客戶流失資料集 (Telco Customer Churn) 59 第五章 結論與建議 62 第一節 研究結論 62 第二節 研究限制 63 第三節 後續發展及建議 64 參考文獻 66 附錄 68 | - |
| dc.language.iso | zh_TW | - |
| dc.subject | LSTM | zh_TW |
| dc.subject | MLP | zh_TW |
| dc.subject | 分類預測 | zh_TW |
| dc.subject | 電信客戶流失 | zh_TW |
| dc.subject | 線上購物行為 | zh_TW |
| dc.subject | SMOTE | zh_TW |
| dc.subject | 不平衡資料 | zh_TW |
| dc.subject | Random Forest | zh_TW |
| dc.subject | LASSO | zh_TW |
| dc.subject | Online Shopping Behavior | en |
| dc.subject | Telecom Churn | en |
| dc.subject | Classification Prediction | en |
| dc.subject | MLP | en |
| dc.subject | LSTM | en |
| dc.subject | LASSO | en |
| dc.subject | Random Forest | en |
| dc.subject | SMOTE | en |
| dc.subject | Imbalanced Data | en |
| dc.title | 評估不平衡資料對於深度學習分類模型的影響 | zh_TW |
| dc.title | Evaluating the Impact of Imbalanced Data on Deep Learning Classification Models | en |
| dc.type | Thesis | - |
| dc.date.schoolyear | 113-2 | - |
| dc.description.degree | 碩士 | - |
| dc.contributor.oralexamcommittee | 薛慧敏;陳錦華 | zh_TW |
| dc.contributor.oralexamcommittee | Huey-Miin Hsueh;Jin-Hua Chen | en |
| dc.subject.keyword | 不平衡資料,SMOTE,Random Forest,LASSO,LSTM,MLP,分類預測,電信客戶流失,線上購物行為, | zh_TW |
| dc.subject.keyword | Imbalanced Data,SMOTE,Random Forest,LASSO,LSTM,MLP,Classification Prediction,Telecom Churn,Online Shopping Behavior, | en |
| dc.relation.page | 107 | - |
| dc.identifier.doi | 10.6342/NTU202501637 | - |
| dc.rights.note | 同意授權(限校園內公開) | - |
| dc.date.accepted | 2025-07-15 | - |
| dc.contributor.author-college | 共同教育中心 | - |
| dc.contributor.author-dept | 統計碩士學位學程 | - |
| dc.date.embargo-lift | 2030-06-30 | - |
| 顯示於系所單位: | 統計碩士學位學程 | |
文件中的檔案:
| 檔案 | 大小 | 格式 | |
|---|---|---|---|
| ntu-113-2.pdf 未授權公開取用 | 3.46 MB | Adobe PDF | 檢視/開啟 |
系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。
