請用此 Handle URI 來引用此文件:
http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/79970完整後設資料紀錄
| DC 欄位 | 值 | 語言 |
|---|---|---|
| dc.contributor.advisor | 藍俊宏(Jakey Blue) | |
| dc.contributor.author | Yan-Cheng Liu | en |
| dc.contributor.author | 劉晏誠 | zh_TW |
| dc.date.accessioned | 2022-11-23T09:19:02Z | - |
| dc.date.available | 2026-08-01 | |
| dc.date.available | 2022-11-23T09:19:02Z | - |
| dc.date.copyright | 2021-08-04 | |
| dc.date.issued | 2021 | |
| dc.date.submitted | 2021-07-25 | |
| dc.identifier.citation | Abdi, H., Williams, L. J. (2010). Principal component analysis. Wiley Interdisciplinary Reviews: Computational Statistics, 2(4), 433-459. Ahsan, M., Mashuri, M., Kuswanto, H., Prastyo, D. D., Khusna, H. (2018). Multivariate control chart based on PCA mix for variable and attribute quality characteristics. Production Manufacturing Research, 6(1), 364-384. Amini, P., Ahmadinia, H., Poorolajal, J., Amiri, M. M. (2016). Evaluating the high risk groups for suicide: a comparison of logistic regression, support vector machine, decision tree and artificial neural network. Iranian Journal of Public Health, 45(9), 1179. Barua, S., Islam, M. M., Yao, X., Murase, K. (2012). MWMOTE--majority weighted minority oversampling technique for imbalanced data set learning. IEEE Transactions on Knowledge Data Engineering, 26(2), 405-425. Batista, G. E., Prati, R. C., Monard, M. C. (2004). A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explorations Newsletter, 6(1), 20-29. Beyan, C., Fisher, R. (2015). Classifying imbalanced data sets using similarity based hierarchical decomposition. Pattern Recognition, 48(5), 1653-1672. Breiman, L. (1996). Bagging predictors. Machine Learning, 24(2), 123-140. Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5-32. Brijain, M., Patel, R., Kushik, M., Rana, K. (2014). A survey on decision tree algorithm for classification. International Journal of Engineering Development and Research, 2(1), 1-5. Bunkhumpornpat, C., Sinapiromsaran, K., Lursinsap, C. (2009). Safe-level-smote: Safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem. In Pacific-Asia Conference on Knowledge Discovery and Data Mining (pp. 475-482). Springer, Berlin, Heidelberg. Chawla, N. V., Bowyer, K. W., Hall, L. O., Kegelmeyer, W. P. (2002). SMOTE: synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16, 321-357. Chawla, N. V., Lazarevic, A., Hall, L. O., Bowyer, K. W. (2003). SMOTEBoost: Improving prediction of the minority class in boosting. In European Conference on Principles of Data Mining and Knowledge Discovery (pp. 107-119). Springer, Berlin, Heidelberg. Chen, T., Guestrin, C. (2016). Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 785-794). Cortes, C., Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3), 273-297. Cramer, J. S. (2002). The origins of logistic regression (Tinbergen Institute Discussion Paper; No. 2002-119/4). Tinbergen Institute. Davis, J., Goadrich, M. (2006). The relationship between PREision-Recall and ROC curves. In Proceedings of the 23rd International Conference on Machine learning (pp. 233-240). De Maesschalck, R., Jouan-Rimbaud, D., Massart, D. L. (2000). The mahalanobis distance. Chemometrics and Intelligent Laboratory Systems, 50(1), 1-18. Denil, M., Trappenberg, T. (2010). Overlap versus imbalance. In Canadian Conference on Artificial Intelligence (pp. 220-231). Springer. Du, B., Liu, C., Zhou, W., Hou, Z., Xiong, H. (2016, August). Catch me if you can: Detecting pickpocket suspects from large-scale transit records. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 87-96). Freund, Y., Schapire, R. E. (1996). Experiments with a new boosting algorithm. In Icml (Vol. 96, pp. 148-156). Friedman, J. H. (2001). Greedy function approximation: a gradient boosting machine. Annals of Statistics, 1189-1232. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y. (2014). Generative adversarial nets. ArXiv Preprint ArXiv:1406.2661. Gorsuch, R. L. (2013). Factor Analysis (2nd Ed.). Hillsdale, NJ: Erlbaum. He, H., Bai, Y., Garcia, E. A., Li, S. (2008). ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence) (pp. 1322-1328). IEEE. Hinton, G. E., Salakhutdinov, R. R. (2006). Reducing the dimensionality of data with neural networks. Science, 313(5786), 504-507. Ho, T. K. (1998). The random subspace method for constructing decision forests. IEEE Transactions on Pattern Analysis Machine Intelligence, 20(8), 832-844. Hu, S., Liang, Y., Ma, L., He, Y. (2009). MSMOTE: Improving classification performance when training data is imbalanced. In 2009 2nd International Workshop on Computer Science and Engineering (Vol. 2, pp. 13-17). IEEE. Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q., Liu, T. Y. (2017). Lightgbm: A highly efficient gradient boosting decision tree. Advances in Neural Information Processing Systems, 30, 3146-3154. Khoshgoftaar, T. M., Van Hulse, J., Napolitano, A. (2010). Comparing boosting and bagging techniques with noisy and imbalanced data. IEEE Transactions on Systems, Man, and Cybernetics-Part A: Systems and Humans, 41(3), 552-568. Kingma, D. P., Welling, M. (2013). Auto-encoding variational bayes. ArXiv Preprint ArXiv:1312.6114. López, V., Fernández, A., García, S., Palade, V., Herrera, F. (2013). An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics. Information Sciences, 250, 113-141. Loyola-González, O., Martínez-Trinidad, J. F., Carrasco-Ochoa, J. A., García-Borroto, M. (2016). Study of the impact of resampling methods for contrast pattern based classifiers in imbalanced databases. Neurocomputing, 175, 935-947. Luke, S. (2020, July 20). Credit Card Fraud Detection. Towards Data Science. https://towardsdatascience.com/credit-card-fraud-detection-9bc8db79b956 Machová, K., Puszta, M., Barčák, F., Bednár, P. (2006). A comparison of the bagging and the boosting methods using the decision trees classifiers. Computer Science and Information Systems, 3(2), 57-72. Mani, I., Zhang, I. (2003). kNN approach to unbalanced data distributions: a case study involving information extraction. In Proceedings of Workshop on Learning From Imbalanced Datasets (Vol. 126). United States: ICML Maroco, J., Silva, D., Rodrigues, A., Guerreiro, M., Santana, I., de Mendonça, A. (2011). Data mining methods in the prediction of Dementia: A real-data comparison of the accuracy, sensitivity and specificity of linear discriminant analysis, logistic regression, neural networks, support vector machines, classification trees and random forests. BMC Research Notes, 4(1), 1-14. Peng, C.-Y. J., Lee, K. L., Ingersoll, G. M. (2002). An introduction to logistic regression analysis and reporting. The Journal of Educational Research, 96(1), 3-14. Peterson, L. E. (2009). K-nearest neighbor. Scholarpedia, 4(2), 1883. Quinlan, J. R. (1986). Induction of decision trees. Machine Learning, 1(1), 81-106. Schölkopf, B., Platt, J. C., Shawe-Taylor, J., Smola, A. J., Williamson, R. C. (2001). Estimating the support of a high-dimensional distribution. Neural Computation, 13(7), 1443-1471. Semerci, M., Cemgil, A. T., Sankur, B. (2018). An intelligent cyber security system against DDoS attacks in SIP networks. Computer Networks, 136, 137-154. Shelke, M. S., Deshmukh, P. R., Shandilya, V. K. (2017). A review on imbalanced data handling using undersampling and oversampling technique. Int J Recent Trends in Eng Res, 3, 444-449. Sun, Y., Wong, A. K., Kamel, M. S. (2009). Classification of imbalanced data: A review. International journal of pattern recognition artificial intelligence, 23(04), 687-719. Sun, Y., Xu, L., Guo, L., Li, Y., Wang, Y. (2019). A Comparison Study of VAE and GAN for Software Fault Prediction. In International Conference on Algorithms and Architectures for Parallel Processing (pp. 82-96). Springer, Cham. Tomek, I. (1976). Two modifications of CNN. IEEE Transactions on Systems, Man, and Cybernetics, SMC-6(11), 769–772. Tuba, E., Mrkela, L., Tuba, M. (2016). Support vector machine parameter tuning using firefly algorithm. In 2016 26th International Conference Radioelektronika (RADIOELEKTRONIKA) (pp. 413-418). IEEE. Tuba, E., Stanimirovic, Z. (2017). Elephant herding optimization algorithm for support vector machine parameters tuning. In 2017 9th International Conference on Electronics, Computers and Artificial Intelligence (ECAI) (pp. 1-4). IEEE. UCI Machine Learning Repository. (1995, December). Connectionist Bench ( Vowel Recognition - Deterding Data) Data Set [Data set]. https://archive.ics.uci.edu/ml/datasets/Connectionist+Bench+(Vowel+Recognition+-+Deterding+Data) UCI Machine Learning Repository. (1989, June 1). Abalone Data Set [Data set]. https://archive.ics.uci.edu/ml/datasets/abalone. Université Libre de Bruxelles (ULB) Machine Learning Group. (2018, March 23). Credit Card Fraud Detection (Version 3) [Data set]. https://www.kaggle.com/mlg-ulb/creditcardfraud. Weiss, G. M. (1995). Learning with rare cases and small disjuncts. In Machine Learning Proceedings 1995 (pp. 558-565): Elsevier. Wilson, D. L. (1972). Asymptotic properties of nearest neighbor rules using edited data. IEEE Transactions on Systems, Man, Cybernetics(3), 408-421. Xie, Y., Zhang, T. (2018). Imbalanced learning for fault diagnosis problem of rotating machinery based on generative adversarial networks. In 2018 37th Chinese Control Conference (CCC) (pp. 6017-6022). IEEE. Yi, X., Walia, E., Babyn, P. (2019). Generative adversarial network in medical imaging: A review. Medical Image Analysis, 58, 101552. Yijing, L., Haixiang, G., Xiao, L., Yanan, L., Jinling, L. (2016). Adapted ensemble classification algorithm based on multiple classifier system and feature selection for classifying multi-class imbalanced data. Knowledge-Based Systems, 94, 88-104. Zhou, L. (2013). Performance of corporate bankruptcy prediction models on imbalanced dataset: The effect of sampling methods. Knowledge-Based Systems, 41, 16-25. | |
| dc.identifier.uri | http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/79970 | - |
| dc.description.abstract | " 在執行實例分類任務時,資料不平衡是各產業中常見且棘手的問題,在使用機器學習模型分析此類資料時,模型無法學到欲關注的目標,例如通常為少量的瑕疵樣本。為了解決不平衡類別資料造成的分類難題,常見解法有三:針對判錯樣本的模型學習、資料重抽樣及產生人工合成少數類別樣本。然而以上的方法仍有不足之處,如資料重抽樣容易導致模型過度擬合或刪除重要的樣本資訊、針對判錯樣本加強學習則在遇到極端不平衡的資料時,改善幅度有限、產生人工資料樣本有可能會產生與多數類別樣本相似的資料。 鑒於上述三種解法改善幅度有限,本研究提出基於主成分之馬氏距離 (Principal Component-based Mahalanobis Distance, PCMD) 的生成方法,此方法之目的為在建立屬於各少數類別自有的空間後,再生成少數類別樣本。先將標準化的資料以PCA降維後,再以各個少數類別樣本為中心,計算每筆資料對其的馬氏距離,並透過卡方檢定去過濾資料完成分布的更新,最後再藉由得出多數類別樣本之最短或次短距離作為生成樣本的限制來產生新樣本,藉由此種考慮多數類別、以各個少數類別為中心、並更新其分布的過程來生成與少數類別相似的人工樣本,接著使用五種分類模型: Logistic Regression (LR)、Random Forest (RF)、Support Vector Machine (SVM)、eXtreme Gradient Boosting (XGBoost) 及Light Gradient Boosting Machine (LGBM) 進行分析,最後與現有產生人工資料的方法SMOTE、ADASYN、VAE、GAN作為標竿進行預測之效果比較。研究結果顯示,本研究提出的PCMD方法在各模型上皆可得到比現有生成方法還要更好的結果,其中更以LR模型得到的召回率最高,XGBoost模型之結果在各類資料上最穩定。 關鍵詞:不平衡資料;主成分分析;馬氏距離;資料增強;資料重抽樣;機器學習;良率分析" | zh_TW |
| dc.description.provenance | Made available in DSpace on 2022-11-23T09:19:02Z (GMT). No. of bitstreams: 1 U0001-2207202123215800.pdf: 2710187 bytes, checksum: 51179f27a41f408d500f45874385595a (MD5) Previous issue date: 2021 | en |
| dc.description.tableofcontents | 謝辭 i 中文摘要 ii Abstract iii Chapter 1 緒論 1 1.1 研究背景 1 1.2 研究動機與目的 2 1.3 研究架構 5 Chapter 2 文獻回顧 6 2.1 資料不平衡問題之解析 8 2.2 資料抽樣技術 9 2.2.2 下抽樣 11 2.3 資料增強模型 13 2.3.1 Variational Autoencoder (VAE) 13 2.3.2 Generative Adversarial Network (GAN) 14 2.4 分類模型 16 2.4.1 Logistic Regression (LR) 16 2.4.2 Support Vector Machine (SVM) 16 2.4.3 Decision Tree (DT) 和Random Forest (RF) 17 2.4.4 Gradient boosting tree-based模型 18 2.5 評估指標 20 Chapter 3 基於主成分之馬氏距離資料生成技術 22 3.1 主成分之馬氏距離方法 24 3.2 資料分布更新 25 3.3 資料生成與還原 27 3.3.1 以最近之少數類別樣本線性生成資料 27 3.3.2 將最近或最近+次近之多數類別轉換成少數類別 28 3.3.3 多變量常態分布資料生成 29 3.4 多數類別樣本下抽樣 31 3.5 PCMD方法與現有方法之比較 32 Chapter 4 個案研究 33 4.1 失衡比例差異程度之不平衡資料比較 34 4.1.1 資料集 34 4.1.2 分析結果比較 34 4.2 應用PCMD方法於失衡比例較大之不平衡資料 36 4.2.1 PCMD參數設定 36 4.2.2 重抽樣後之資料集 36 4.2.3 分析結果比較 39 4.3 應用PCMD方法於標竿資料 47 4.3.1 Vowel資料集 47 4.3.2 Abalone資料集 50 4.3.3 信用卡詐欺資料集 52 4.3.4 綜合比較 54 Chapter 5 結論與建議 55 參考文獻 57 附錄 61 | |
| dc.language.iso | zh-TW | |
| dc.subject | 良率分析 | zh_TW |
| dc.subject | 不平衡資料 | zh_TW |
| dc.subject | 主成分分析 | zh_TW |
| dc.subject | 馬氏距離 | zh_TW |
| dc.subject | 資料增強 | zh_TW |
| dc.subject | 資料重抽樣 | zh_TW |
| dc.subject | 機器學習 | zh_TW |
| dc.subject | resampling | en |
| dc.subject | yield analytics | en |
| dc.subject | machine learning | en |
| dc.subject | imbalanced data | en |
| dc.subject | principal component analysis | en |
| dc.subject | Mahalanobis distance | en |
| dc.subject | data augmentation | en |
| dc.title | 發展先進資料增強技術以分析良莠比例失衡資料 | zh_TW |
| dc.title | On Development of Advanced Data Augmentation Technique for Imbalanced Data Analytics | en |
| dc.date.schoolyear | 109-2 | |
| dc.description.degree | 碩士 | |
| dc.contributor.oralexamcommittee | 李尉彰(Hsin-Tsai Liu),許嘉裕(Chih-Yang Tseng) | |
| dc.subject.keyword | 不平衡資料,主成分分析,馬氏距離,資料增強,資料重抽樣,機器學習,良率分析, | zh_TW |
| dc.subject.keyword | imbalanced data,principal component analysis,Mahalanobis distance,data augmentation,resampling,machine learning,yield analytics, | en |
| dc.relation.page | 62 | |
| dc.identifier.doi | 10.6342/NTU202101678 | |
| dc.rights.note | 同意授權(全球公開) | |
| dc.date.accepted | 2021-07-26 | |
| dc.contributor.author-college | 工學院 | zh_TW |
| dc.contributor.author-dept | 工業工程學研究所 | zh_TW |
| dc.date.embargo-lift | 2026-08-01 | - |
| 顯示於系所單位: | 工業工程學研究所 | |
文件中的檔案:
| 檔案 | 大小 | 格式 | |
|---|---|---|---|
| U0001-2207202123215800.pdf 此日期後於網路公開 2026-08-01 | 2.65 MB | Adobe PDF |
系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。
