Please use this identifier to cite or link to this item:
http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/91867
Title: | 發展連續混類別變數之資料增強技術與分析框架 - 以製程良率分析為例 On the Development of Data Augmentation Techniques and Analytical Framework for Continuous & Categorical Type Variables - A Case Study on Process Yield Analysis |
Authors: | 吳亭緯 Ting-Wei Wu |
Advisor: | 藍俊宏 Jakey Blue |
Keyword: | 不平衡資料,類別變數,製程良率分析,資料增強技術,資料上採樣,機器學習, Imbalanced Data,Categorical Variable,Yield Analysis,Data Augmentation,Oversampling,Machine Learning, |
Publication Year : | 2024 |
Degree: | 碩士 |
Abstract: | 不平衡資料集在製程良率分析、金融交易、醫療影像分析等多種領域的資料處理中構成了顯著的挑戰,特別是當面對含有連續和類別變數的混合型資料時,傳統的資料增強技術常常無法充分解決類別間的不平衡問題,導致各類預測模型的偏差。為了解決這一問題,本研究提出了一種創新的資料增強框架,該框架結合了SMOTE(Synthetic Minority Over-sampling Technique)或PCMD(Principal Component Mahalanobis Distance)方法對連續變數進行上採樣,再搭配連續變數和類別變數採取多種不同方法建立模型,藉此探究少數類別中連續變數和類別變數的關聯,以模型預測的方式達成類別變數的上採樣,完整地生成了具連續與類別型態變數的輸入資料,進而提升模型的預測能力。
本研究首先證明了在製程良率資料集上,與傳統方法相比,提出的方法能夠有效提高對少數類別的識別率並改善模型的整體性能。本方法不僅在數量上平衡了類別分布,同時也在品質上保證了增強資料與原始資料之間的相關性和一致性。實驗結果顯示,該方法在模型泛化能力和預測能力上均有顯著提升。 進一步地,本框架的應用不侷限於製程良率分析,其可擴展性使其在醫學、金融等其它需要處理非平衡混合型資料的場景中同樣具有潛力。本研究為處理包含連續與類別型態變數的不平衡資料集提供了一種新的途徑,展現了在各種不同應用情境下的潛在價值和廣泛適用性。 Imbalanced datasets pose a significant challenge in data processing across various fields such as process yield analysis, financial transactions, and medical image analysis, especially when dealing with mixed data containing both continuous and categorical variables. Traditional data augmentation techniques often fail to adequately address the imbalance between categories, leading to biases in various predictive models. To address this issue, this study introduces an innovative data augmentation framework that integrates SMOTE (Synthetic Minority Over-sampling Technique) or PCMD (Principal Component Mahalanobis Distance) methods to oversample continuous variables. It further combines different methods for both continuous and categorical variables to explore the relationship between continuous variables and categorical variables in the minority classes. This approach aims to achieve categorical variable oversampling through predictive modeling, thereby generating comprehensive input data with both continuous and categorical variables, and consequently enhancing the predictive capability of models. This research initially demonstrates that, compared to traditional methods, the proposed approach significantly improves the identification rate of minority classes and the overall performance of the model on process yield datasets. The method not only balances the category distribution quantitatively but also ensures the relevance and consistency between the augmented and original data qualitatively. Experimental results indicate a notable enhancement in the model's generalization and predictive capabilities. Furthermore, the application of this framework is not limited to process yield analysis; its scalability makes it potentially beneficial in other scenarios requiring the handling of imbalanced mixed data, such as in medicine and finance. This study provides a new avenue for dealing with imbalanced datasets containing both continuous and categorical variables, demonstrating its potential value and wide applicability in various application scenarios. |
URI: | http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/91867 |
DOI: | 10.6342/NTU202400362 |
Fulltext Rights: | 同意授權(限校園內公開) |
Appears in Collections: | 奈米工程與科學學位學程 |
Files in This Item:
File | Size | Format | |
---|---|---|---|
ntu-112-1.pdf Restricted Access | 2.6 MB | Adobe PDF | View/Open |
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.