不平衡分類的新穎混成框架：整合貝氏神經網路–變分自編碼器生成模型與樣本加權邏輯斯迴歸

黃郁庭; Yu-Ting Huang

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/99898

標題:	不平衡分類的新穎混成框架：整合貝氏神經網路–變分自編碼器生成模型與樣本加權邏輯斯迴歸 A Novel Hybrid Framework for Imbalanced Classification: Integrating a Bayesian Neural Network-Variational Autoencoder Generative Model and Subject-Weighted Logistic Regression
作者:	黃郁庭 Yu-Ting Huang
指導教授:	王彥雯 Charlotte Wang
關鍵字:	貝氏神經網路,生成式人工智慧,不平衡資料,合成資料生成,變分自編碼器,加權邏輯斯迴歸, Bayesian Neural Networks,Generative Artificial Intelligence,Imbalanced Data,Synthetic Data Generation,Variational Autoencoder,Weighted Logistic Regression,
出版年 :	2025
學位:	碩士
摘要:	在實際應用中分析不平衡資料(imbalanced data)始終是一項重大挑戰，尤其是在資料取得困難或關注罕見事件發生的生醫及公共衛生領域，更加劇了模型訓練與結果預測的困難及挑戰。傳統方法如過採樣(oversampling)與欠採樣(undersampling)，常無法準確捕捉資料的分布特性，導致分類效能依情況而異。在某些研究議題上，如：嚴重工安事件的發生、疾病診斷與基因研究等，類別不平衡問題尤為嚴重，迫切需要更有效的技術來因應。本研究提出一種混成式框架，透過新穎的生成式人工智慧(generative artificial intelligence)方法與加權羅吉斯迴歸(weighted logistic regression)來解決不平衡資料的分類問題。在提出的生成式人工智慧模型中，結合貝氏神經網路(Bayesian Neural Networks, BNNs)作為變分自編碼器(Variational Autoencoder, VAE)中的編碼器與解碼器架構，用以生成少數類別的樣本，作為擴充原始資料集，以達到資料中類別平衡的可能性。隨後再考量生成樣本的代表性與訓練集樣本對分類決策邊界(decision boundary)的重要性，透過樣本重要性進行加權建構加權羅吉斯迴歸(weighted logistic regression)來完成不平衡資料分類的任務，以提升在生醫與公共衛生實務研究上不平衡資料的分類與預測效能。本研究分別於模擬資料與實際公共衛生資料進行實證分析。模擬實驗涵蓋多種不平衡結構與資料複雜度，實際資料則涵蓋病毒感染、子宮頸癌與地震災害等應用場域。結果顯示，傳統過採樣方法（如 SMOTE）在結構簡單的情境中表現尚可，但在分布偏態、共變異特徵明顯、類別與連續變數混合的複雜設定，或於真實資料中，其成效明顯受限。相較之下，本研究所提出之 BNNVAE 模型結合樣本加權機制，尤其是在搭配 realism-aware 權重策略後，不僅能維持整體準確率，更能顯著提升少數類別的識別能力，於 F1 分數(F1 Score)、幾何平均數(G-mean)與平衡準確率(Balanced Accuracy)表現最為穩定且優異。綜合而言，本研究方法展現高度應用潛力，適合推廣至多種場景。本研究運用貝氏神經網路模型以更有效地估計資料分布，進而減少對大量真實訓練資料的依賴，以解決部分實務應用情境資料量較少的問題。與傳統方法相比，本模型能生成更具代表性的合成資料，並透過樣本加權以建構具可解釋性的分類模型，以增加生醫與公共衛生研究領域的應用價值與應用潛力。 Analyzing imbalanced data remains a significant challenge in practical applications, especially in biomedical and public health domains where data collection is difficult or rare events are particularly interesting. These challenges exacerbate the difficulties in training models and achieving reliable predictions. Traditional approaches such as oversampling and undersampling often fail to accurately capture the underlying data distribution, leading to inconsistent classification performance. In particular research areas, such as severe industrial accidents, disease diagnosis, and genomic studies, the class imbalance problem is especially severe, necessitating more effective solutions. This study proposes a hybrid framework integrating a novel generative artificial intelligence approach with weighted logistic regression to tackle classification tasks involving imbalanced data. Specifically, we introduce a generative model that incorporates Bayesian Neural Networks (BNNs) into the encoder and decoder components of a Variational Autoencoder (VAE), enabling the generation of minority-class samples to augment the original dataset and promote class balance. We then apply a subject-weighted logistic regression model, where samples are assigned weights based on their representativeness and proximity to the classification decision boundary, thereby enhancing classification and prediction performance in biomedical and public health research. This study conducts empirical evaluations on both simulated and real-world public health datasets. The simulation experiments cover various levels of imbalance and data complexity, while the real datasets include use cases such as viral infections, cervical cancer, and earthquake-related health impacts. Results show that traditional oversampling methods like SMOTE perform adequately in simple settings. Still, they are limited under more complex conditions, such as skewed distributions, strong feature covariance, and mixed data types, as well as in real-world applications. In contrast, the proposed BNNVAE model, particularly when combined with a realism-aware sample weighting strategy, maintains overall accuracy and significantly improves the identification of minority-class instances. It achieves the most stable and superior F1 score, geometric mean (G-mean), and balanced accuracy across all settings. Overall, the proposed method demonstrates high applicability and generalizability across diverse scenarios. By leveraging Bayesian neural networks, our model provides a more effective estimation of data distributions, reducing reliance on large amounts of real training data, a common limitation in real-world applications. Compared to traditional techniques, the proposed framework can generate more representative synthetic data and, through sample reweighting, construct interpretable classification models, thereby enhancing its practical value and applicability in biomedical and public health research.
URI:	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/99898
DOI:	10.6342/NTU202503128
全文授權:	同意授權(全球公開)
電子全文公開日期:	2030-07-31
顯示於系所單位：	健康數據拓析統計研究所

文件中的檔案：

檔案	大小	格式
ntu-113-2.pdf 此日期後於網路公開 2030-07-31	13.21 MB	Adobe PDF

顯示文件完整紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料（如：文字、圖片、PDF）並使其易於取用。