請用此 Handle URI 來引用此文件:
http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/102141| 標題: | 整合幾何保留生成模型與序貫估計之處理類別不平衡資料方法 An Integrated Methodology for Class Imbalanced Data Combining Geometry-Preserving Generative Modeling and Sequential Estimation |
| 作者: | 許予綸 Yu-Lun Hsu |
| 指導教授: | 王彥雯 Charlotte Wang |
| 關鍵字: | 不平衡學習,表格資料生成模型生成對抗網路幾何保留合成資料序貫估計統計關聯性推論不偏估計可控估計誤差罕見事件分類 Imbalanced Learning,Tabular DataGenerative ModelsGenerative Adversarial NetworksGeometry PreservingSynthetic DataSequential EstimationStatistical Association InferenceUnbiased EstimationControlled Estimation ErrorRare Event Classification |
| 出版年 : | 2026 |
| 學位: | 碩士 |
| 摘要: | 現實中的生物醫學資料常面臨嚴重類別不平衡(class-imbalanced)、高異質性特徵與樣本數有限等挑戰,嚴重影響了統計推論的有效性與模型分類預測的效能。然而過往處理類別不平衡的策略仍存在顯著局限,傳統資料層級策略,在少數類分布呈現非線性、多峰或偏態時,容易生成缺乏代表性的合成樣本,造成決策邊界扭曲與估計偏誤;而演算法層級調整,則因選擇的方式不同而面臨不同的挑戰,如:加權羅吉斯迴歸(weighted logistic regression),雖可部分修正截距偏差,但其變異數(variance)估計往往偏小,導致信賴區間(confidence interval, CI)涵蓋率不足,使特徵效果推論失真,另一如:成本敏感學習(cost-sensitive learning),則仰賴難以確定的錯誤成本設定,特別是當使用者沒有先驗知識或缺乏相關領域的知識時。迄今,如何在嚴重不平衡情境下同時兼顧有效推論與少數類識別能力,仍是一個亟待解決的關鍵課題。
本研究提出一套新穎的整合式方法,將機器學習中的「生成對抗網路(Generative Adversarial Network, GAN)」與傳統統計學的「序貫估計(sequential estimation)」相結合,旨在解決類別不平衡資料分析中的兩項核心難題:合成資料代表性不足與傳統統計模型推論失效。首先,我們建構一個幾何保留的生成模型(geometry-preserving generative model):該模型先對多數類資料執行 K-means 以取得代表性質心作為幾何錨點,再將原始資料映射至內積空間(inner product space)以萃取兩類樣本的相對幾何關係;接著運用 Wasserstein GAN with Spectral Normalization (WGAN-SN)學習其幾何親和矩陣分布,並透過截斷奇異值分解(truncated singular value decomposition)進行數值穩定的逆映射(inverse mapping),以重建具有連續、類別之混合型特徵的高擬真少數類樣本。此生成過程能有效保留分布形狀、尾部行為與多峰結構等統計機率分布的特徵。 隨後,為確保推論品質,本研究利用序貫估計,在控制指定信心水準下的信賴區間寬度的前提下,結合自適應收縮估計(adaptive shrinkage estimation, ASE)進行穩健的特徵篩選;同時,我們利用 D-optimal design 動態挑選高資訊樣本(high-information samples),建立羅吉斯迴歸模型,以避免納入低品質合成資料,並確保最終估計量的漸近統計性質(asymptotic statistical properties)得以維持。 我們透過模擬研究及兩筆實務資料集(Cervical Cancer (Risk Factors) Dataset 與 CDC 健康資料)驗證方法的表現。結果顯示,相較於加權羅吉斯迴歸、SMOTE 與一般 GAN-based 等不平衡學習方法(imbalanced learning method),本研究提出之方法在多項評估面向皆展現明顯優勢,包括:(1) 生成樣本高度擬真且能精確保留原始少數類分佈特性;(2) 提供具不偏性(unbiased)的迴歸係數估計與穩定的特徵篩選;(3) 產生高涵蓋率且可信賴的信賴區間估計;以及 (4) 顯著提升少數類識別能力。 綜言之,本研究成功建立一套兼具推論有效性與卓越少數類分類表現的整合式方法,為醫療、生物醫學與公共衛生等領域中涉及罕見事件偵測的分類預測研究與危險因子探索的相關性研究,提供了一套具堅實理論基礎且實務可行的新穎解決方案。 Biomedical data in real-world applications frequently exhibit severe class imbalance, substantial feature heterogeneity, and limited sample sizes, all of which undermine the validity of statistical inference and the predictive performance of classification models. Existing strategies for addressing class imbalance have significant limitations. Traditional data-level approaches often generate synthetic samples that lack representativeness—especially when the minority class distribution is nonlinear, multimodal, or skewed—resulting in distorted decision boundaries and biased estimates. Algorithm-level adjustments encounter distinct challenges. For example, weighted logistic regression partially corrects intercept bias but often underestimates the estimator's variance, resulting in inadequate confidence interval (CI) coverage and distorted inferences about feature effects. Cost-sensitive learning relies on specifying misclassification costs, which can be challenging to determine without precise prior domain knowledge. Consequently, achieving both valid statistical inference and effective identification of the minority class under severe imbalance remains a critical unresolved issue. This study proposes a novel integrated methodology that combines Generative Adversarial Networks (GANs) from machine learning with sequential estimation from traditional statistics to address two main challenges in imbalanced data analysis: the lack of representativeness in synthetic data and the failure of inference in traditional statistical models. First, we construct a geometry-preserving generative model. This model applies K-means clustering to the majority class data to identify representative centroids that serve as geometric anchors. It then maps the original data into an inner product space to capture the relative geometric relationships between classes. We utilize a Wasserstein GAN with Spectral Normalization (WGAN-SN) to learn the distribution of the geometric affinity matrix. Subsequently, a numerically stable inverse mapping is performed via truncated Singular Value Decomposition to generate high-fidelity minority samples that preserve the distributional shape, tail behavior, and multimodal characteristics for both continuous and categorical features. To ensure rigorous inference, we employ sequential estimation to control the width of confidence intervals at a specified confidence level, integrated with Adaptive Shrinkage Estimation (ASE) for robust feature selection. Simultaneously, we utilize a D-optimal design to dynamically recruit the most informative samples for constructing the logistic regression model. This strategy prevents the inclusion of low-quality synthetic data and ensures that the asymptotic statistical properties of the final estimators are preserved. We validated the proposed method through simulation studies and two real-world datasets (Cervical Cancer (Risk Factors) Dataset and CDC health data). The results demonstrate that, compared to weighted logistic regression, SMOTE, and standard GAN-based imbalanced learning methods, our approach offers significant advantages across several dimensions: (1) generating highly realistic samples that precisely preserve the original minority distribution characteristics; (2) providing unbiased regression coefficient estimates and stable feature selection; (3) yielding reliable confidence interval estimates with high coverage rates; and (4) substantially enhancing minority class identification capabilities. In conclusion, this study successfully establishes an integrated methodology that achieves both inferential validity and superior classification performance for minority classes. It offers a theoretically grounded and practically viable solution for detecting rare events and associations in risk factor exploration within medicine, biomedicine, and public health. |
| URI: | http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/102141 |
| DOI: | 10.6342/NTU202600366 |
| 全文授權: | 同意授權(限校園內公開) |
| 電子全文公開日期: | 2030-02-15 |
| 顯示於系所單位: | 健康數據拓析統計研究所 |
文件中的檔案:
| 檔案 | 大小 | 格式 | |
|---|---|---|---|
| ntu-114-1.pdf 未授權公開取用 | 41.71 MB | Adobe PDF | 檢視/開啟 |
系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。
