Skip navigation

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料(如:文字、圖片、PDF)並使其易於取用。

點此認識 DSpace
DSpace logo
English
中文
  • 瀏覽論文
    • 校院系所
    • 出版年
    • 作者
    • 標題
    • 關鍵字
    • 指導教授
  • 搜尋 TDR
  • 授權 Q&A
    • 我的頁面
    • 接受 E-mail 通知
    • 編輯個人資料
  1. NTU Theses and Dissertations Repository
  2. 工學院
  3. 工業工程學研究所
請用此 Handle URI 來引用此文件: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/81374
完整後設資料紀錄
DC 欄位值語言
dc.contributor.advisor藍俊宏(Jyun-Hong Lan)
dc.contributor.authorChung-Cheng Huangen
dc.contributor.author黃鍾承zh_TW
dc.date.accessioned2022-11-24T03:46:20Z-
dc.date.available2026-07-13
dc.date.available2022-11-24T03:46:20Z-
dc.date.copyright2021-07-23
dc.date.issued2021
dc.date.submitted2021-07-15
dc.identifier.citation[1] Barandela, R., Sánchez, J. S., Garcıa, V., Rangel, E. (2003). Strategies for learning in class imbalance problems. Pattern Recognition, 36(3), 849-851. [2] Bayle, P., Bayle, A., Janson, L., Mackey, L. (2020). Cross-validation Confidence Intervals for Test Error. arXiv preprint arXiv:2007.12671. [3] Beyan, C., Fisher, R. (2015). Classifying imbalanced data sets using similarity based hierarchical decomposition. Pattern Recognition, 48(5), 1653-1672. [4] Bradley, A. P. (1997). The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognition, 30(7), 1145-1159. [5] Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5-32. [6] Cateni, S., Colla, V., Vannucci, M. (2014). A method for resampling imbalanced datasets in binary classification tasks for real-world problems. Neurocomputing, 135, 32-41. [7] Chawla, N. V., Bowyer, K. W., Hall, L. O., Kegelmeyer, W. P. (2002). SMOTE: synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16, 321-357. [8] Ghosh, S., Rana, A., Kansal, V. (2020). Evaluating the impact of sampling-based nonlinear manifold detection model on software defect prediction problem. In Smart Intelligent Computing and Applications (pp. 141-152). Springer, Singapore. [9] Cortes, C., Vapnik, V. (1995). Support-vector networks. Machine Learning, 20 (3), 273-297. [10] Denil, M., Trappenberg, T. (2010, May). Overlap versus imbalance. In Canadian Conference on Artificial Intelligence (pp. 220-231). Springer, Berlin, Heidelberg. [11] Drucker, H. (1997, July). Improving regressors using boosting techniques. In ICML, Vol. 97, 107-115. [12] Estabrooks, A., Jo, T., Japkowicz, N. (2004). A multiple resampling method for learning from imbalanced data sets. Computational intelligence, 20(1), 18-36. [13] Fan, W., Stolfo, S. J., Zhang, J., Chan, P. K. (1999, June). AdaCost: misclassification cost-sensitive boosting. In ICML ,Vol. 99, 97-105. [14] Finch, H. (2005). Comparison of distance measures in cluster analysis with dichotomous data. Journal of Data Science, 3(1), 85-100. [15] Freund, Y., Schapire, R. E. (1997). A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1), 119-139. [16] Galar, M., Fernandez, A., Barrenechea, E., Bustince, H., Herrera, F. (2011). A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 42(4), 463-484. [17] Gu, Q., Zhu, L., Cai, Z. (2009, October). Evaluation measures of the classification performance of imbalanced data sets. In International symposium on intelligence computation and applications (pp. 461-471). Springer, Berlin, Heidelberg. [18] Haixiang, G., Yijing, L., Shang, J., Mingyun, G., Yuanyue, H., Bing, G. (2017). Learning from class-imbalanced data: Review of methods and applications. Expert Systems with Applications, 73, 220-239. [19] Han, H., Wang, W. Y., Mao, B. H. (2005, August). Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In International conference on intelligent computing (pp. 878-887). Springer, Berlin, Heidelberg. [20] Hartmann, W. M. (2004, June). Dimension reduction vs. variable selection. In International Workshop on Applied Parallel Computing (pp. 931-938). Springer, Berlin, Heidelberg. [21] Hart, P. (1968). The condensed nearest neighbor rule (corresp.). IEEE transactions on information theory, 14(3), 515-516. [22] He, H., Garcia, E. A. (2009). Learning from imbalanced data. IEEE Transactions on knowledge and data engineering, 21(9), 1263-1284. [23] Huang, J., Ling, C. X. (2005). Using AUC and accuracy in evaluating learning algorithms. IEEE Transactions on knowledge and Data Engineering, 17(3), 299-310. [24] Japkowicz, N. (2001, June). Concept-learning in the presence of between-class and within-class imbalances. In Conference of the Canadian society for computational studies of intelligence (pp. 67-77). Springer, Berlin, Heidelberg. [25] Japkowicz, N., Stephen, S. (2002). The class imbalance problem: A systematic study. Intelligent data analysis, 6(5), 429-449. [26] Jian, C., Gao, J., Ao, Y. (2016). A new sampling method for classifying imbalanced data based on support vector machine ensemble. Neurocomputing, 193, 115-122. [27] Joshi, M. V. (2002). Learning classifier models for predicting rare phenomena. University of Minnesota. [28] Khan, A. A., Moyne, J. R., Tilbury, D. M. (2008). Virtual metrology and feedback control for semiconductor manufacturing processes using recursive partial least squares. Journal of Process Control, 18(10), 961-974. [29] Khreich, W., Granger, E., Miri, A., Sabourin, R. (2010). Iterative Boolean combination of classifiers in the ROC space: An application to anomaly detection with HMMs. Pattern Recognition, 43(8), 2732-2752. [30] Liao, T. W. (2008). Classification of weld flaws with imbalanced class data. Expert Systems with Applications, 35(3), 1041-1052. [31] Ling, C. X., Li, C. (1998, August). Data mining for direct marketing: Problems and solutions. In Kdd Vol. 98, pp. 73-79. [32] Liu, X. Y., Wu, J., Zhou, Z. H. (2008). Exploratory undersampling for class-imbalance learning. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 39(2), 539-550. [33] Longadge, R., Dongre, S. (2013). Class imbalance problem in data mining review. arXiv preprint arXiv:1305.1707. [34] López, V., Fernández, A., García, S., Palade, V., Herrera, F. (2013). An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics. Information sciences, 250, 113-141. [35] Ma, Y., Guo, L., Cukic, B. (2007). A statistical framework for the prediction of fault-proneness. In Advances in Machine Learning Applications in Software Engineering (pp. 237-263). IGI Global. [36] Mani, I., Zhang, I. (2003, August). kNN approach to unbalanced data distributions: a case study involving information extraction. In Proceedings of workshop on learning from imbalanced datasets (Vol. 126). United States: ICML. [37] Moepya, S. O., Akhoury, S. S., Nelwamondo, F. V. (2014, December). Applying cost-sensitive classification for financial fraud detection under high class-imbalance. In 2014 IEEE international conference on data mining workshop (pp. 183-192). IEEE. [38] Napierala, K., Stefanowski, J. (2016). Types of minority class examples and their influence on learning classifiers from imbalanced data. Journal of Intelligent Information Systems, 46(3), 563-597. [39] Nguyen, H. M., Cooper, E. W., Kamei, K. (2011). Borderline over-sampling for imbalanced data classification. International Journal of Knowledge Engineering and Soft Data Paradigms, 3(1), 4-21. [40] Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., ... Duchesnay, E. (2011). Scikit-learn: Machine learning in Python. The Journal of Machine Learning Research, 12, 2825-2830. [41] Prati, R. C., Batista, G. E., Monard, M. C. (2004, April). Class imbalances versus class overlapping: an analysis of a learning system behavior. In Mexican international conference on artificial intelligence (pp. 312-321). Springer, Berlin, Heidelberg. [42] Rennie, J. D., Shih, L., Teevan, J., Karger, D. R. (2003). Tackling the poor assumptions of naive bayes text classifiers. In Proceedings of the 20th international conference on machine learning, pp. 616-623. [43] S. Cateni, V. Colla M. Vannucci, “A method for resampling imbalanced datasets in binary classification tasks for real-world problems,” Neurocomputing, 135, pp.32-41, 2014. [44] Saeys, Y., Inza, I., Larranaga, P. (2007). A review of feature selection techniques in bioinformatics. Bioinformatics, 23(19), 2507-2517. [45] Schölkopf, B., Williamson, R. C., Smola, A. J., Shawe-Taylor, J., Platt, J. C. (1999, December). Support vector method for novelty detection. In NIPS (Vol. 12, pp. 582-588). [46] Smith, M. R., Martinez, T., Giraud-Carrier, C. (2014). An instance level analysis of data complexity. Machine Learning, 95(2), 225-256. [47] Stehman, S. V. (1997). Selecting and interpreting measures of thematic classification accuracy. Remote Sensing of Environment, 62(1), 77-89. [48] Sun, Y., Wong, A. K., Kamel, M. S. (2009). Classification of imbalanced data: A review. International Journal of Pattern Recognition and Artificial Intelligence, 23(04), 687-719. [49] Susto, G. A., Beghi, A., De Luca, C. (2011, September). A virtual metrology system for predicting cvd thickness with equipment variables and qualitative clustering. In ETFA2011 (pp. 1-4). IEEE. [50] Tavallaee, M., Stakhanova, N., Ghorbani, A. A. (2010). Toward credible evaluation of anomaly-based intrusion-detection methods. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 40(5), 516-524. [51] Tilouche, S., Bassetto, S., Nia, V. P. (2014, September). Classification algorithms for virtual metrology. In 2014 IEEE International Conference on Management of Innovation and Technology (pp. 495-499). IEEE. [52] Weiss, G. M. (2004). Mining with rarity: a unifying framework. ACM SIGKDD Explorations Newsletter, 6(1), 7-19. [53] Weiss, G. M., Provost, F. (2003). Learning when training data are costly: The effect of class distribution on tree induction. Journal of Artificial Intelligence Research, 19, 315-354. [54] Weiss, S. M., Kapouleas, I. (1989, August). An empirical comparison of pattern recognition, neural nets, and machine learning classification methods. In IJCAI (Vol. 89, pp. 781-787). [55] Wilson, D. R., Martinez, T. R. (1997). Improved heterogeneous distance functions. Journal of Artificial Intelligence Research, 6, 1-34. [56] Yang, Q., Wu, X. (2006). 10 challenging problems in data mining research. International Journal of Information Technology Decision Making, 5(04), 597-604. [57] Zhou, L. (2013). Performance of corporate bankruptcy prediction models on imbalanced dataset: The effect of sampling methods. Knowledge-Based Systems, 41, 16-25.
dc.identifier.urihttp://tdr.lib.ntu.edu.tw/jspui/handle/123456789/81374-
dc.description.abstract錯誤偵測及預測的系統為許多先進製程中重要的分析環節,判斷的方法常以對資料分類或分群來判斷產品或在製品的正常或異常狀態。傳統上,利用機器學習演算法建立的良率分類模型,往往能夠達到很好的效果。然而類別間的比例失衡是實務上常見的資料特性,例如科技業製程中,一般僅有千分之一或甚至到百萬分之一的機率會出現不良品,因此大部分以整體分類正確率為目標的演算法很容易將全部的資料預測成良品,以達到極高的正確率,卻其實沒有真正學習到類別之間的差異性,更忽略了誤判不良品所帶來的極高成本,此類模型並無任何實質的應用價值。 近年針對上述問題的文獻多以資料擴充方法 (data augmentation) 或模型參數調整來解決外,也有透過先對資料特性進行分析,例如:不平衡比例、密集程度、類別間的重疊度或是在同類別中又存在著大小不同的群體等性質,再進行資料擴充,惟上述的擴充基礎皆根基於數值型變數。本論文承接此一基礎,轉發展非數值型變數,例如全為二元變數時,該如何進行資料擴充。我們利用漢明距離 (Hamming distance) 計算二元特徵間的相似度,透過少數類別和多數資料的互動關係提出一嶄新的上採樣 (oversampling) 方法,經降低資料中噪音的干擾和重疊度後,將新資料生成在少數類別之間,或是避開多數類別的混淆。 最後,本研究透過結合下採樣 (undersampling) 以及訓練集資料平衡比例的控制,對經過不同資料擴充方法的訓練集和常見的模型做多種組合的實驗分析,結果發現有進行上採樣的訓練集,其模型表現較佳;而對於極度不平衡且皆為類別變數的資料,透過本論文提出的方法亦能發現訓練集的改變對於最後的指標較有顯著的效果,而不同模型所帶來的影響則相對小。zh_TW
dc.description.provenanceMade available in DSpace on 2022-11-24T03:46:20Z (GMT). No. of bitstreams: 1
U0001-1307202101122500.pdf: 6023329 bytes, checksum: 3a280779bc7c86123fa292d9979fa9a6 (MD5)
Previous issue date: 2021
en
dc.description.tableofcontents中文摘要 i Abstract iii 第一章、緒論 1 第一節 研究動機與目的 1 第二節 研究架構 3 第二章、文獻探討 4 第一節 資料不平衡 4 (一) 資料形式及應用範疇 4 (二) 常見的問題及挑戰 5 第二節 代價敏感學習及特徵選擇 8 第三節 資料擴充 9 (一) 上採樣及下採樣 9 (二) 混合採樣法 11 第四節 模型及演算法 14 (一) 基本分類器 14 (二) 集成學習 14 (三) 優缺點及評估指標 15 第五節 文獻回顧摘要 17 第三章、研究方法 18 第一節 資料特性預判 22 第二節 資料擴充方法 25 第三節 模型評估架構 30 第四章、案例分析 33 第一節 資料集的特性 36 第二節 前處理資料 43 第三節 資料建模 50 第四節 預測結果的比較和關聯性分析 71 第五章、結論與建議 74 第一節 研究成果 74 第二節 未來研究方向 76 參考文獻 79 附錄A 84 附錄B 91
dc.language.isozh-TW
dc.subject機器學習模型zh_TW
dc.subject不平衡資料zh_TW
dc.subject類別變數zh_TW
dc.subject資料重採樣zh_TW
dc.subject資料增能擴充zh_TW
dc.subject錯誤偵測與分類zh_TW
dc.subjectoversamplingen
dc.subjectmachine learning algorithmsen
dc.subjectfault detection and classificationen
dc.subjectdata augmentationen
dc.subjectimbalanced dataen
dc.subjectcategorical variablesen
dc.subjectundersamplingen
dc.title發展資料不平衡與類別變數限制下的生產良率分類模型zh_TW
dc.titleOn the Development of Production Yield Classification Model Under Imbalanced Data and Categorical Variable Constraintsen
dc.date.schoolyear109-2
dc.description.degree碩士
dc.contributor.oralexamcommittee洪一薰(Hsin-Tsai Liu),陳正剛(Chih-Yang Tseng)
dc.subject.keyword不平衡資料,類別變數,資料重採樣,資料增能擴充,錯誤偵測與分類,機器學習模型,zh_TW
dc.subject.keywordimbalanced data,categorical variables,oversampling,undersampling,data augmentation,fault detection and classification,machine learning algorithms,en
dc.relation.page97
dc.identifier.doi10.6342/NTU202101424
dc.rights.note同意授權(限校園內公開)
dc.date.accepted2021-07-16
dc.contributor.author-college工學院zh_TW
dc.contributor.author-dept工業工程學研究所zh_TW
dc.date.embargo-lift2026-07-13-
顯示於系所單位:工業工程學研究所

文件中的檔案:
檔案 大小格式 
U0001-1307202101122500.pdf
  未授權公開取用
5.88 MBAdobe PDF檢視/開啟
顯示文件簡單紀錄


系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved