發展連續混類別變數之資料增強技術與分析框架 - 以製程良率分析為例

吳亭緯; Ting-Wei Wu

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/91867

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	藍俊宏	zh_TW
dc.contributor.advisor	Jakey Blue	en
dc.contributor.author	吳亭緯	zh_TW
dc.contributor.author	Ting-Wei Wu	en
dc.date.accessioned	2024-02-26T16:11:04Z	-
dc.date.available	2024-02-27	-
dc.date.copyright	2024-02-22	-
dc.date.issued	2024	-
dc.date.submitted	2024-01-30	-
dc.identifier.citation	Barua, S., Islam, M. M., Yao, X., & Murase, K. (2012). MWMOTE--majority weighted minority oversampling technique for imbalanced data set learning. IEEE Transactions on Knowledge and Data Engineering, 26(2), 405-425. Batista, G. E., Prati, R. C., & Monard, M. C. (2004). A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD explorations newsletter, 6(1), 20-29. Batuwita, R., & Palade, V. (2010). Efficient resampling methods for training support vector machines with imbalanced datasets. The 2010 International Joint Conference on Neural Networks (IJCNN), Boateng, E. Y., & Abaye, D. A. (2019). A review of the logistic regression model with emphasis on medical research. Journal of Data Analysis and Information Processing, 7(4), 190-207. Breiman, L. (2001). Random Forests. Machine learning, 45(1), 5-32. Bunkhumpornpat, C., Sinapiromsaran, K., & Lursinsap, C. (2012). DBSMOTE: density-based synthetic minority over-sampling technique. Applied Intelligence, 36, 664-684. Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: synthetic minority over-sampling technique. Journal of artificial intelligence research, 16, 321-357. Chen, T., & Guestrin, C. (2016). Xgboost: A scalable tree boosting system. Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine learning, 20, 273-297. Dong, X., Yu, Z., Cao, W., Shi, Y., & Ma, Q. (2020). A survey on ensemble learning. Frontiers of Computer Science, 14, 241-258. Han, H., Wang, W.-Y., & Mao, B.-H. (2005). Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. International conference on intelligent computing, He, H., Bai, Y., Garcia, E. A., & Li, S. (2008). ADASYN: Adaptive synthetic sampling approach for imbalanced learning. 2008 IEEE international joint conference on neural networks (IEEE world congress on computational intelligence), He, H., & Garcia, E. A. (2009). Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering, 21(9), 1263-1284. Kaur, H., Pannu, H. S., & Malhi, A. K. (2019). A Systematic Review on Imbalanced Data Challenges in Machine Learning: Applications and Solutions. ACM Comput. Surv., 52(4), Article 79. Last, F., Douzas, G., & Bacao, F. (2017). Oversampling for Imbalanced Learning Based on K-Means and SMOTE. arXiv:1711.00837. Retrieved November 01, 2017, from https://ui.adsabs.harvard.edu/abs/2017arXiv171100837L Liaw, A., & Wiener, M. (2002). Classification and regression by randomForest. R news, 2(3), 18-22. Ling, C. X., & Sheng, V. S. (2008). Cost-sensitive learning and the class imbalance problem. Encyclopedia of machine learning, 2011, 231-235. Liu, Y., An, A., & Huang, X. (2006). Boosting prediction accuracy on imbalanced datasets with SVM ensembles. Advances in Knowledge Discovery and Data Mining: 10th Pacific-Asia Conference, PAKDD 2006, Singapore, April 9-12, 2006. Proceedings 10, Mukherjee, M., & Khushi, M. (2021). SMOTE-ENC: A novel SMOTE-based method to generate synthetic data for nominal and continuous features. Applied System Innovation, 4(1), 18. Musa, A. B. (2014). Logistic regression classification for uncertain data. Research Journal of Mathematical and Statistical Sciences, ISSN, 2320, 6047. Oommen, T., Baise, L. G., & Vogel, R. M. (2011). Sampling bias and class imbalance in maximum-likelihood logistic regression. Mathematical Geosciences, 43, 99-120. Peng, C.-Y. J., Lee, K. L., & Ingersoll, G. M. (2002). An introduction to logistic regression analysis and reporting. The journal of educational research, 3-14. Rezvani, S., & Wang, X. (2023). A broad review on class imbalance learning techniques. Applied Soft Computing, 110415. Salunkhe, U. R., & Mali, S. N. (2016). Classifier ensemble design for imbalanced data classification: a hybrid approach. Procedia Computer Science, 85, 725-732. Sun, Y., Wong, A. K., & Kamel, M. S. (2009). Classification of imbalanced data: A review. International journal of pattern recognition and artificial intelligence, 23(04), 687-719. Tomek, I. (1976). Two modifications of CNN. Wilson, D. L. (1972). Asymptotic properties of nearest neighbor rules using edited data. IEEE Transactions on Systems, Man, and Cybernetics(3), 408-421. Zhang, P., Jia, Y., & Shang, Y. (2022). Research and application of XGBoost in imbalanced data. International Journal of Distributed Sensor Networks, 18(6), 15501329221106935. 黃鍾承. （2021）. 發展資料不平衡與類別變數限制下的生產良率分類模型國立臺灣大學. 劉晏誠. （2021）. 發展先進資料增強技術以分析良莠比例失衡資料國立臺灣大學.	-
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/91867	-
dc.description.abstract	不平衡資料集在製程良率分析、金融交易、醫療影像分析等多種領域的資料處理中構成了顯著的挑戰，特別是當面對含有連續和類別變數的混合型資料時，傳統的資料增強技術常常無法充分解決類別間的不平衡問題，導致各類預測模型的偏差。為了解決這一問題，本研究提出了一種創新的資料增強框架，該框架結合了SMOTE（Synthetic Minority Over-sampling Technique）或PCMD（Principal Component Mahalanobis Distance）方法對連續變數進行上採樣，再搭配連續變數和類別變數採取多種不同方法建立模型，藉此探究少數類別中連續變數和類別變數的關聯，以模型預測的方式達成類別變數的上採樣，完整地生成了具連續與類別型態變數的輸入資料，進而提升模型的預測能力。本研究首先證明了在製程良率資料集上，與傳統方法相比，提出的方法能夠有效提高對少數類別的識別率並改善模型的整體性能。本方法不僅在數量上平衡了類別分布，同時也在品質上保證了增強資料與原始資料之間的相關性和一致性。實驗結果顯示，該方法在模型泛化能力和預測能力上均有顯著提升。進一步地，本框架的應用不侷限於製程良率分析，其可擴展性使其在醫學、金融等其它需要處理非平衡混合型資料的場景中同樣具有潛力。本研究為處理包含連續與類別型態變數的不平衡資料集提供了一種新的途徑，展現了在各種不同應用情境下的潛在價值和廣泛適用性。	zh_TW
dc.description.abstract	Imbalanced datasets pose a significant challenge in data processing across various fields such as process yield analysis, financial transactions, and medical image analysis, especially when dealing with mixed data containing both continuous and categorical variables. Traditional data augmentation techniques often fail to adequately address the imbalance between categories, leading to biases in various predictive models. To address this issue, this study introduces an innovative data augmentation framework that integrates SMOTE (Synthetic Minority Over-sampling Technique) or PCMD (Principal Component Mahalanobis Distance) methods to oversample continuous variables. It further combines different methods for both continuous and categorical variables to explore the relationship between continuous variables and categorical variables in the minority classes. This approach aims to achieve categorical variable oversampling through predictive modeling, thereby generating comprehensive input data with both continuous and categorical variables, and consequently enhancing the predictive capability of models. This research initially demonstrates that, compared to traditional methods, the proposed approach significantly improves the identification rate of minority classes and the overall performance of the model on process yield datasets. The method not only balances the category distribution quantitatively but also ensures the relevance and consistency between the augmented and original data qualitatively. Experimental results indicate a notable enhancement in the model's generalization and predictive capabilities. Furthermore, the application of this framework is not limited to process yield analysis; its scalability makes it potentially beneficial in other scenarios requiring the handling of imbalanced mixed data, such as in medicine and finance. This study provides a new avenue for dealing with imbalanced datasets containing both continuous and categorical variables, demonstrating its potential value and wide applicability in various application scenarios.	en
dc.description.provenance	Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2024-02-26T16:11:03Z No. of bitstreams: 0	en
dc.description.provenance	Made available in DSpace on 2024-02-26T16:11:04Z (GMT). No. of bitstreams: 0	en
dc.description.provenance	Item withdrawn by admin ntu (admin@lib.ntu.edu.tw) on 2024-05-16T02:09:12Z Item was in collections: 奈米工程與科學學位學程 (ID: 28fd8ceb-c29f-48d8-8a1a-ba562e188f44) No. of bitstreams: 1 ntu-112-1.pdf: 2658305 bytes, checksum: 0f7af38abee45a72719be84d02152f7f (MD5)	en
dc.description.provenance	Item reinstated by admin ntu (admin@lib.ntu.edu.tw) on 2024-05-16T02:10:24Z Item was in collections: 奈米工程與科學學位學程 (ID: 28fd8ceb-c29f-48d8-8a1a-ba562e188f44) No. of bitstreams: 1 ntu-112-1.pdf: 2658305 bytes, checksum: 0f7af38abee45a72719be84d02152f7f (MD5)	en
dc.description.tableofcontents	謝辭 i 中文摘要 ii Abstract iii 目次 v 圖次 vii 表次 ix Chapter 1 緒論 1 1.1 問題背景 1 1.2 研究動機與目的 3 1.3 研究架構 4 Chapter 2 文獻探討 5 2.1 資料層面的方法 5 2.1.1 上採樣（Oversampling） 5 2.1.2 下採樣（Undersampling） 8 2.1.3 上採樣（Oversampling）+ 下採樣（Undersampling） 8 2.2 演算法層面的方法 9 2.2.1 成本敏感學習（Cost-Sensitive Learning） 9 2.2.2 集成學習（Ensemble Learning） 9 2.3 混合方式（Hybrid Methods） 10 2.4 分類模型 11 2.5 模型評估指標 13 2.6 文獻回顧總結 15 Chapter 3 研究方法 17 3.1 方法架構 19 3.2 類別變數的上採樣生成方法 25 3.2.1 整合類別變數預測後進行one-hot編碼 25 3.2.2 預測類別變數後進行one-hot編碼 26 3.2.3 將類別變數先one-hot編碼後預測 28 3.2.4 整合類別變數先one-hot編碼後預測 33 3.2.5 上採樣方法摘要總表 35 Chapter 4 案例研究 36 4.1 案例一：化工製程良率分析 36 4.1.1 資料集描述與特性 36 4.1.2 資料建模與預測 38 4.1.3 SMOTE + 預測類別變數模型評估比較 40 4.1.4 PCMD + 預測類別變數模型評估比較 60 4.1.5 模型綜合評估和討論 79 4.2 案例二：銀行定存預測 97 4.2.1 資料描述與特性 97 4.2.2 資料建模與預測 98 4.2.3 SMOTE + 預測類別變數模型評估比較 99 4.2.4 PCMD + 預測類別變數模型評估比較 106 4.2.5 模型綜合評估和討論 110 Chapter 5 結論與建議 114 5.1 研究成果總結 114 5.2 未來研究方向 117 參考文獻 118	-
dc.language.iso	zh_TW	-
dc.title	發展連續混類別變數之資料增強技術與分析框架 - 以製程良率分析為例	zh_TW
dc.title	On the Development of Data Augmentation Techniques and Analytical Framework for Continuous & Categorical Type Variables - A Case Study on Process Yield Analysis	en
dc.type	Thesis	-
dc.date.schoolyear	112-1	-
dc.description.degree	碩士	-
dc.contributor.oralexamcommittee	許嘉裕;高鈺婷	zh_TW
dc.contributor.oralexamcommittee	Chia-Yu Hsu;Yu-Ting Kao	en
dc.subject.keyword	不平衡資料,類別變數,製程良率分析,資料增強技術,資料上採樣,機器學習,	zh_TW
dc.subject.keyword	Imbalanced Data,Categorical Variable,Yield Analysis,Data Augmentation,Oversampling,Machine Learning,	en
dc.relation.page	119	-
dc.identifier.doi	10.6342/NTU202400362	-
dc.rights.note	同意授權(限校園內公開)	-
dc.date.accepted	2024-02-01	-
dc.contributor.author-college	重點科技研究學院	-
dc.contributor.author-dept	奈米工程與科學學位學程	-
dc.date.embargo-lift	2029-01-30	-
顯示於系所單位：	奈米工程與科學學位學程

文件中的檔案：

檔案	大小	格式
ntu-112-1.pdf 目前未授權公開取用	2.6 MB	Adobe PDF	檢視/開啟

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。