發展二元變數與不平衡資料限制下之水平與垂直堆疊式集成學習架構

Da-Rui Yen; 閻大瑞

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/85445

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	藍俊宏(Jakey Blue)
dc.contributor.author	Da-Rui Yen	en
dc.contributor.author	閻大瑞	zh_TW
dc.date.accessioned	2023-03-19T23:16:42Z	-
dc.date.copyright	2022-07-19
dc.date.issued	2022
dc.date.submitted	2022-07-15
dc.identifier.citation	Al-Shehari, T., & Alsowail, R. A. (2021). An insider data leakage detection using one-hot encoding, synthetic minority oversampling and machine learning techniques. Entropy, 23(10), 1258. Alejo, R., Sotoca, J. M., Valdovinos, R. M., & Toribio, P. (2010). Edited nearest neighbor rule for improving neural networks classifications. In International Symposium on Neural Networks, (pp. 303–310). Springer. Ali, H., Salleh, M. N. M., Saedudin, R., Hussain, K., & Mushtaq, M. F. (2019). Imbalance class problems in data mining: A review. Indonesian Journal of Electrical Engineering and Computer Science, 14(3), 1560–1571. Bergstra, J., Bardenet, R., Bengio, Y., & Kégl, B. (2011). Algorithms for hyper-parameter optimization. Advances in Neural Information Processing Systems, 24. Boughorbel, S., Jarray, F., & El-Anbari, M. (2017). Optimal classifier for imbalanced data using matthews correlation coefficient metric. PloS One, 12(6), e0177678. Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16, 321–357. Chawla, N. V., Lazarevic, A., Hall, L. O., & Bowyer, K. W. (2003). SMOTEBoost:Improving prediction of the minority class in boosting. In European Conference on Principles of Data Mining and Knowledge Discovery, (pp. 107–119). Springer. Chen, H., Li, C., Yang, W., Liu, J., An, X., & Zhao, Y. (2021). Deep balanced cascade forest: An novel fault diagnosis method for data imbalance. ISA Transactions. Chen, S., He, H., & Garcia, E. A. (2010). RAMOBoost: Ranked minority oversampling in boosting. IEEE Transactions on Neural Networks, 21(10), 1624–1642. Chen, T., & Guestrin, C. (2016). Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, (pp. 785–794). Chien, C.-F., Wang, W.-C., & Cheng, J.-C. (2007). Data mining for yield enhancement in semiconductor manufacturing and an empirical study. Expert Systems with Applications, 33(1), 192–198. Dorogush, A. V., Ershov, V., & Gulin, A. (2018). CatBoost: gradient boosting with categorical features support. arXiv Preprint arXiv:1810.11363. Espadinha-Cruz, P., Godina, R., & Rodrigues, E. M. (2021). A review of data mining applications in semiconductor manufacturing. Processes, 9(2), 305. Farquad, M. A. H., & Bose, I. (2012). Preprocessing unbalanced data using support vector machine. Decision Support Systems, 53(1), 226–233. Frías-Blanco, I., Verdecia-Cabrera, A., Ortiz-Díaz, A., & Carvalho, A. (2016). Fast adaptive stacking of ensembles. In Proceedings of the 31st Annual ACM Symposium on Applied Computing, (pp. 929–934). Galar, M., Fernandez, A., Barrenechea, E., Bustince, H., & Herrera, F. (2011). A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 42(4), 463–484. Galar, M., Fernández, A., Barrenechea, E., & Herrera, F. (2013). EUSBoost: Enhancing ensembles for highly imbalanced data-sets by evolutionary undersampling. Pattern Recognition, 46(12), 3460–3471. Golinko, E., Sonderman, T., & Zhu, X. (2017). CNFL: categorical to numerical feature learning for clustering and classification. In 2017 IEEE Second International Conference on Data Science in Cyberspace (DSC), (pp. 585–594). IEEE. Gong, J., & Kim, H. (2017). RHSBoost: Improving classification performance in imbalance data. Computational Statistics & Data Analysis, 111, 1–13. He, H., Bai, Y., Garcia, E. A., & Li, S. (2008). ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), (pp. 1322–1328). IEEE. Hu, S., Liang, Y., Ma, L., & He, Y. (2009). MSMOTE: improving classification performance when training data is imbalanced. In 2009 Second International Workshop on Computer Science and Engineering, vol. 2, (pp. 13–17). IEEE. Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q., & Liu, T.-Y. (2017). Lightgbm: A highly efficient gradient boosting decision tree. Advances in Neural Information Processing Systems, 30. Khushi, M., Shaukat, K., Alam, T. M., Hameed, I. A., Uddin, S., Luo, S., Yang, X., & Reyes, M. C. (2021). A comparative performance analysis of data resampling methods on imbalance medical data. IEEE Access, 9, 109960–109975. Kubat, M., Matwin, S., et al. (1997). Addressing the curse of imbalanced training sets: one-sided selection. In Icml, vol. 97, (p. 179). Citeseer. Last, F., Douzas, G., & Bacao, F. (2017). Oversampling for imbalanced learning based on k-means and smote. arXiv Preprint arXiv:1711.00837. Liu, X.-Y., Wu, J., & Zhou, Z.-H. (2008). Exploratory undersampling for class-imbalance learning. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 39(2), 539–550. Mani, I., & Zhang, I. (2003). kNN approach to unbalanced data distributions: a case study involving information extraction. In Proceedings of Workshop on Learning from Imbalanced Datasets, vol. 126, (pp. 1–7). ICML. Menardi, G., & Torelli, N. (2014). Training and assessing classification rules with imbalanced data. Data Mining and Knowledge Discovery, 28(1), 92–122. More, A., & Rana, D. P. (2017). Review of random forest classification techniques to resolve data imbalance. In 2017 1st International Conference on Intelligent Systems and Information Management (ICISIM), (pp. 72–78). IEEE. Nakata, K., Orihara, R., Mizuoka, Y., & Takagi, K. (2017). A comprehensive big-databased monitoring system for yield enhancement in semiconductor manufacturing. IEEE Transactions on Semiconductor Manufacturing, 30(4), 339–344. Pavlyshenko, B. (2018). Using stacking approaches for machine learning models. In 2018 IEEE Second International Conference on Data Stream Mining & Processing (DSMP), (pp. 255–258). IEEE. Potdar, K., Pardawala, T. S., & Pai, C. D. (2017). A comparative study of categorical variable encoding techniques for neural network classifiers. International Journal of Computer Applications, 175(4), 7–9. Prokhorenkova, L., Gusev, G., Vorobev, A., Dorogush, A. V., & Gulin, A. (2018). CatBoost: unbiased boosting with categorical features. Advances in Neural Information Processing Systems, 31. Puri, A., & Kumar Gupta, M. (2022). Improved hybrid bag-boost ensemble with K-meansSMOTE–ENN technique for handling noisy class imbalanced data. The Computer Journal, 65(1), 124–138. Rayhan, F., Ahmed, S., Mahbub, A., Jani, M. R., Shatabda, S., Farid, D. M., & Rahman, C. M. (2017). MEBoost: mixing estimators with boosting for imbalanced data classification. In 2017 11th International Conference on Software, Knowledge, Information Management and Applications (SKIMA), (pp. 1–6). IEEE. Seiffert, C., Khoshgoftaar, T. M., Van Hulse, J., & Napolitano, A. (2009). RUSBoost: A hybrid approach to alleviating class imbalance. IEEE Transactions on Systems, Man, and Cybernetics-Part A: Systems and Humans, 40(1), 185–197. Shahriari, B., Swersky, K., Wang, Z., Adams, R. P., & De Freitas, N. (2015). Taking the human out of the loop: A review of bayesian optimization. Proceedings of the IEEE, 104(1), 148–175. Smith, M. R., Martinez, T., & Giraud-Carrier, C. (2014). An instance level analysis of data complexity. Machine learning, 95(2), 225–256. Tang, J., Alelyani, S., & Liu, H. (2014). Data classification: algorithms and applications. Data Mining and Knowledge Discovery Series, (pp. 37–64). Tanha, J., Abdi, Y., Samadi, N., Razzaghi, N., & Asadpour, M. (2020). Boosting methods for multi-class imbalanced data classification: an experimental review. Journal of Big Data, 7(1), 1–47. Ul Haq, I., Gondal, I., Vamplew, P., & Brown, S. (2018). Categorical features transformation with compact one-hot encoder for fraud detection in distributed environment.In Australasian Conference on Data Mining, (pp. 69–80). Springer Vergara, J. R., & Estévez, P. A. (2014). A review of feature selection methods based on mutual information. Neural Computing and Applications, 24(1), 175–186. Viola, R., Emonet, R., Habrard, A., Metzler, G., Riou, S., & Sebban, M. (2019). An adjusted nearest neighbor algorithm maximizing the f-measure from imbalanced data. In 2019 IEEE 31st International Conference on Tools with Artificial Intelligence (ICTAI), (pp. 243–250). IEEE. Zheng, H., Sherazi, S. W. A., & Lee, J. Y. (2021). A stacking ensemble prediction model for the occurrences of major adverse cardiovascular events in patients with acute coronary syndrome on imbalanced data. IEEE Access, 9, 113692–113704. Zhu, R., Guo, Y., & Xue, J.-H. (2020). Adjusting the imbalance ratio by the dimensionality of imbalanced data. Pattern Recognition Letters, 133, 217–223. Žliobaitė, I., Pechenizkiy, M., & Gama, J. (2016). An overview of concept drift applications. Big Data Analysis: New Algorithms for a New Society, (pp. 91–114). 黃鍾承 (2021). 發展資料不平衡與類別變數限制下的生產良率分類模型.
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/85445	-
dc.description.abstract	在製造業中，產品出貨前皆需要經過不同的品質測試，以確保產品品質達到驗收標準。而隨著資料科學的興起，利用產品之製造歷程資料集，建立分類模型預測產品良莠以取代傳統的品質檢測，已逐漸成為顯學。然而若是大宗量產之產品，其良率相當高，表示製造歷程資料集中瑕疵品和良品數量相差懸殊，造成極端的資料不平衡現象，使一般的分類模型預測表現大幅下降。更甚者，製造歷程資料亦多為類別或字串型態，也往往需透過變數轉換為二元變數才能分析；然而大部分分類器背後的資料假設多為連續型態，因此表現更大打折扣。集成學習（ensemble learning）是經常用以處理不平衡資料集的方法之一，本研究中我們針對既有的集成學習演算法加以改良，提出了三個堆疊法架構（SCV-1、SCV-2、以及 SCV-3），以及兩個平衡級聯架構（BC-1 以及 BC-2）。此外，我們也建立了一套完整的不平衡資料分析流程，其中包括在資料預處理階段加入資料上抽樣（oversampling）與下抽樣（undersampling）法之組合以大幅減緩資料不平衡之情形；使用 SMBO（sequential model-based optimization）針對集成學習分類模型的眾多超參數進行最佳化；納入多通道（multi-channel）架構以降低資料重抽樣之偏差和增加分類模型多樣性等機制。本研究以台灣面板廠所收集的製造歷程資料作為案例，由於成熟製程的良率極高，因此不平衡程度可達1:1000；而資料分析結果顯示比起僅使用單一個集成學習模型，使用 SCV-1 與 BC-2 能夠在召回率（recall）相當之前提下獲得更高的精確率（precision），SCV-2 與 SCV-3 則是能夠同時提升召回率與精確率。更高的召回率表示依靠分類模型可以辨認出更多的瑕疵面板，更高的精確率則可以同時儘可能減少需要經過老化測試複檢之面板數量。因此根據此結果，我們提出的修改版本集成學習架構以及不平衡資料分析流程，應能夠節省一部份面板老化測試之成本。	zh_TW
dc.description.abstract	In modern manufacturing industries, products are required to undergo a sampling inspection before shipping to customers. With the rise of data science, making use of the production history data to predict the product quality through the classification models is expected to replace the traditional quality examination. However, the yield of matured products is always high, meaning the number of defects versus the number of non-defective units is extremely imbalanced. Moreover, the production history is usually recorded in categorical strings. It is necessary to encode the data into binary variables before analyzing the data. As a result, under the constraints of extreme imbalance and binary types of data, conventional classifiers/regressors cannot perform well. Ensemble learning is a popular method often used to tackle imbalanced data issues. In this thesis, we modified the existing ensemble learning algorithms and proposed three stacking schemes (SCV-1, SCV-2, and SCV-3) and two balance cascade frameworks (BC-1 and BC-2). In addition, we also set up a complete procedure for imbalanced data analysis, which includes a hybrid of data oversampling and undersampling methods in the preprocessing stage to reduce the imbalance significantly. SMBO (sequential model-based optimization) is adopted to optimize the hyperparameters in the ensemble learning classifiers. A multi-channel architecture is integrated to reduce the bias of data resampling and increase the diversity of classification models. The historical manufacturing data collected by the local panel manufacturer are analyzed as the case study. Since the product is matured and massively produced, the imbalance ratio can easily be 1: 1000. The analytical results showed that using SCV-1 and BC-2 could achieve higher precision than using only a single ensemble classifier on the premise of reaching comparable recall will do. On the other hand, SCV-2 and SCV-3 could improve both recall and precision simultaneously. The higher recall means that more defective panels can be identified by the classification model, while the higher precision can minimize the number of panels that need to be re-inspected by aging tests. Therefore, according to the results obtained in the case study, our modified version of the ensemble learning architectures and the complete procedure of imbalanced data analysis are able to save the cost of the panel aging test significantly.	en
dc.description.provenance	Made available in DSpace on 2023-03-19T23:16:42Z (GMT). No. of bitstreams: 1 U0001-1506202216005300.pdf: 1468930 bytes, checksum: e16182a8efe4df35000d2f5f79264d28 (MD5) Previous issue date: 2022	en
dc.description.tableofcontents	摘要 iii Abstract iv 圖目錄 ix 表目錄 x 演算法目錄 xii 縮寫表 xiii 第一章緒論 1 1.1 資料不平衡問題 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 二元變數 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.3 研究目的 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.4 研究架構 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 第二章文獻探討 9 2.1 資料重抽樣技術 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.1.1 上抽樣 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.1.2 下抽樣 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.1.3 多種重抽樣技術組合 . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2 演算法改良 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.2.1 代價導向學習 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.2.2 改良分類演算法 . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.3 集成學習 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.3.1 推升法 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.3.2 結合集成學習與資料重抽樣 . . . . . . . . . . . . . . . . . . . . 18 2.3.3 更大的集成學習架構 . . . . . . . . . . . . . . . . . . . . . . . . 19 2.4 本研究探討方向 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.4.1 貝氏最佳化 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.4.2 資料重抽樣變異 . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 第三章改良版堆疊法與平衡級聯架構 26 3.1 不平衡資料分析架構 . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.1.1 資料重抽樣組合 . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.1.2 分類模型 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.1.3 多通道架構 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.2 堆疊法 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.2.1 SCV-1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.2.2 SCV-2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.2.3 SCV-3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.3 平衡級聯分類模型 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.3.1 BC-1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.3.2 BC-2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 第四章案例探討 44 4.1 製造歷程資料集 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 4.1.1 資料集特性 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 4.1.2 變數篩選 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.1.3 標籤重標記 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.2 資料分析結果 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.2.1 模型評估準則 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.2.2 現有分類模型 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.2.3 三個堆疊法架構 . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 4.2.4 兩個平衡級聯架構 . . . . . . . . . . . . . . . . . . . . . . . . . . 59 4.3 討論 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 4.3.1 資料重抽樣組合 . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 4.3.2 分類模型綜合比較 . . . . . . . . . . . . . . . . . . . . . . . . . . 63 4.3.3 堆疊法架構比較 . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.3.4 平衡級聯架構 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 第五章結論與建議 67 5.1 結論 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 5.2 未來研究方向 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 參考文獻 71 附錄 A — 堆疊法應用不同基學習器之預測表現 76 附錄 B — 平衡級聯應用不同迭代次數之預測表現 91
dc.language.iso	zh-TW
dc.subject	二元變數	zh_TW
dc.subject	資料重抽樣	zh_TW
dc.subject	集成學習	zh_TW
dc.subject	堆疊法	zh_TW
dc.subject	平衡級聯	zh_TW
dc.subject	資料不平衡	zh_TW
dc.subject	良率預測	zh_TW
dc.subject	Yield Prediction	en
dc.subject	Data Imbalance	en
dc.subject	Binary Variable	en
dc.subject	Data Resampling	en
dc.subject	Ensemble Learning	en
dc.subject	Stacking	en
dc.subject	Balance Cascade	en
dc.title	發展二元變數與不平衡資料限制下之水平與垂直堆疊式集成學習架構	zh_TW
dc.title	Development of the Horizontal and Vertical Stacking Structure for the Imbalanced Data with Binary Variables	en
dc.type	Thesis
dc.date.schoolyear	110-2
dc.description.degree	碩士
dc.contributor.oralexamcommittee	洪一薰(I-Hsuan Ethan Hong),陳家正(Chia-Cheng Chen)
dc.subject.keyword	資料不平衡,二元變數,資料重抽樣,集成學習,堆疊法,平衡級聯,良率預測,	zh_TW
dc.subject.keyword	Data Imbalance,Binary Variable,Data Resampling,Ensemble Learning,Stacking,Balance Cascade,Yield Prediction,	en
dc.relation.page	92
dc.identifier.doi	10.6342/NTU202200960
dc.rights.note	同意授權(全球公開)
dc.date.accepted	2022-07-18
dc.contributor.author-college	共同教育中心	zh_TW
dc.contributor.author-dept	統計碩士學位學程	zh_TW
dc.date.embargo-lift	2022-07-19	-
顯示於系所單位：	統計碩士學位學程

文件中的檔案：

檔案	大小	格式
U0001-1506202216005300.pdf	1.43 MB	Adobe PDF	檢視/開啟

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。