請用此 Handle URI 來引用此文件:
http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/99898完整後設資料紀錄
| DC 欄位 | 值 | 語言 |
|---|---|---|
| dc.contributor.advisor | 王彥雯 | zh_TW |
| dc.contributor.advisor | Charlotte Wang | en |
| dc.contributor.author | 黃郁庭 | zh_TW |
| dc.contributor.author | Yu-Ting Huang | en |
| dc.date.accessioned | 2025-09-19T16:14:06Z | - |
| dc.date.available | 2025-09-20 | - |
| dc.date.copyright | 2025-09-19 | - |
| dc.date.issued | 2025 | - |
| dc.date.submitted | 2025-07-31 | - |
| dc.identifier.citation | Alemi, A. A., Fischer, I., Dillon, J. V., & Murphy, K. (2016). Deep variational information bottleneck. arXiv preprint arXiv:1612.00410.
Arjovsky, M., Chintala, S., & Bottou, L. (2017). Wasserstein generative adversarial networks. International conference on machine learning, Baowaly, M. K., Lin, C. C., Liu, C. L., & Chen, K. T. (2019). Synthesizing electronic health records using improved generative adversarial networks. J Am Med Inform Assoc, 26(3), 228-241. https://doi.org/10.1093/jamia/ocy142 Blundell, C., Cornebise, J., Kavukcuoglu, K., & Wierstra, D. (2015). Weight Uncertainty in Neural Network Proceedings of the 32nd International Conference on Machine Learning, Proceedings of Machine Learning Research. https://proceedings.mlr.press/v37/blundell15.html Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: synthetic minority over-sampling technique. Journal of artificial intelligence research, 16, 321-357. Chawla, N. V., Lazarevic, A., Hall, L. O., & Bowyer, K. W. (2003). SMOTEBoost: Improving prediction of the minority class in boosting. European conference on principles of data mining and knowledge discovery, Chen, R. J., Lu, M. Y., Chen, T. Y., Williamson, D. F., & Mahmood, F. (2021). Synthetic data in machine learning for medicine and healthcare. Nature Biomedical Engineering, 5(6), 493-497. Chen, W., Yang, K., Yu, Z., Shi, Y., & Chen, C. P. (2024). A survey on imbalanced learning: latest research, applications and future directions. Artificial Intelligence Review, 57(6), 137. Choi, E., Biswal, S., Malin, B., Duke, J., Stewart, W. F., & Sun, J. (2017). Generating multi-label discrete patient records using generative adversarial networks. Machine learning for healthcare conference, Cui, Y., Jia, M., Lin, T.-Y., Song, Y., & Belongie, S. (2019). Class-balanced loss based on effective number of samples. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, Douzas, G., & Bacao, F. (2018). Effective data generation for imbalanced learning using conditional generative adversarial networks. Expert Systems with applications, 91, 464-471. Drummond, C., & Holte, R. C. (2003). C4. 5, class imbalance, and cost sensitivity: why under-sampling beats over-sampling. Workshop on learning from imbalanced datasets II, Elkan, C. (2001). The foundations of cost-sensitive learning. International joint conference on artificial intelligence, Fan, W., Stolfo, S. J., Zhang, J., & Chan, P. K. (1999). AdaCost: misclassification cost-sensitive boosting. Icml, Fernandes, K., Cardoso, J., & Fernandes, J. (2017). Cervical Cancer (Risk Factors) [Dataset]. UCI Machine Learning Repository. https://doi.org/10.24432/C5Z310. Fernández, A., García, S., Galar, M., Prati, R. C., Krawczyk, B., & Herrera, F. (2018). Learning from imbalanced data sets (Vol. 10). Springer. Frid-Adar, M., Klang, E., Amitai, M., Goldberger, J., & Greenspan, H. (2018). Synthetic data augmentation using GAN for improved liver lesion classification. 2018 IEEE 15th international symposium on biomedical imaging (ISBI 2018), Gal, Y., & Ghahramani, Z. (2016). Dropout as a bayesian approximation: Representing model uncertainty in deep learning. international conference on machine learning, Galar, M., Fernandez, A., Barrenechea, E., Bustince, H., & Herrera, F. (2011). A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 42(4), 463-484. Goncalves, A., Ray, P., Soper, B., Stevens, J., Coyle, L., & Sales, A. P. (2020). Generation and evaluation of synthetic patient data. BMC medical research methodology, 20(1), 108. Goodfellow, I. J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., & Bengio, Y. (2014). Generative adversarial nets. Advances in neural information processing systems, 27. Goyal, M., & Mahmoud, Q. H. (2024). A systematic review of synthetic data generation techniques using generative AI. Electronics, 13(17), 3509. Grover, A., Song, J., Kapoor, A., Tran, K., Agarwal, A., Horvitz, E. J., & Ermon, S. (2019). Bias correction of learned generative models using likelihood-free importance weighting. Advances in neural information processing systems, 32. Hestness, J., Narang, S., Ardalani, N., Diamos, G., Jun, H., Kianinejad, H., Patwary, M. M. A., Yang, Y., & Zhou, Y. (2017). Deep learning scaling is predictable, empirically. arXiv preprint arXiv:1712.00409. Hu, Z., Yang, Z., Salakhutdinov, R., & Xing, E. P. (2017). On unifying deep generative models. arXiv preprint arXiv:1706.00550. Iosifidis, V., Papadopoulos, S., Rosenhahn, B., & Ntoutsi, E. (2023). AdaCC: cumulative cost-sensitive boosting for imbalanced classification. Knowledge and Information Systems, 65(2), 789-826. Jafarigol, E., & Trafalis, T. (2023). A review of machine learning techniques in imbalanced data and future trends. arXiv preprint arXiv:2310.07917. Jang, E., Gu, S., & Poole, B. (2016). Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144. Johnson, J. M., & Khoshgoftaar, T. M. (2019). Survey on deep learning with class imbalance. Journal of big data, 6(1), 1-54. Kendall, A., & Gal, Y. (2017). What uncertainties do we need in bayesian deep learning for computer vision? Advances in neural information processing systems, 30. Kingma, D. P., & Welling, M. (2013). Auto-Encoding Variational Bayes. arXiv:1312.6114. Retrieved December 01, 2013, from https://ui.adsabs.harvard.edu/abs/2013arXiv1312.6114K Krawczyk, B. (2016). Learning from imbalanced data: open challenges and future directions. Progress in artificial intelligence, 5(4), 221-232. Lin, J. (2002). Divergence measures based on the Shannon entropy. IEEE Transactions on Information theory, 37(1), 145-151. Liu, X.-Y., Wu, J., & Zhou, Z.-H. (2008). Exploratory undersampling for class-imbalance learning. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 39(2), 539-550. Lu, Y., Shen, M., Wang, H., Wang, X., van Rechem, C., Fu, T., & Wei, W. (2023). Machine learning for synthetic data generation: a review. arXiv preprint arXiv:2302.04062. Massey Jr, F. J. (1951). The Kolmogorov-Smirnov test for goodness of fit. Journal of the American statistical Association, 46(253), 68-78. Mirza, M., & Osindero, S. (2014). Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784. Park, N., Mohammadi, M., Gorde, K., Jajodia, S., Park, H., & Kim, Y. (2018). Data synthesis based on generative adversarial networks. arXiv preprint arXiv:1806.03384. Patki, N., Wedge, R., & Veeramachaneni, K. (2016). The synthetic data vault. 2016 IEEE international conference on data science and advanced analytics (DSAA), Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., & Dubourg, V. (2011). Scikit-learn: Machine learning in Python. the Journal of machine Learning research, 12, 2825-2830. Pezoulas, V. C., Zaridis, D. I., Mylona, E., Androutsos, C., Apostolidis, K., Tachos, N. S., & Fotiadis, D. I. (2024). Synthetic data generation methods in healthcare: A review on open-source tools and methods. Computational and structural biotechnology journal, 23, 2892-2910. Rawat, S. S., & Mishra, A. K. (2022). Review of methods for handling class imbalance in classification problems. International Conference on Data, Engineering and Applications, Ren, M., Zeng, W., Yang, B., & Urtasun, R. (2018). Learning to reweight examples for robust deep learning. International conference on machine learning, Rolfe, J. T. (2016). Discrete variational autoencoders. arXiv preprint arXiv:1609.02200. Seiffert, C., Khoshgoftaar, T. M., Van Hulse, J., & Napolitano, A. (2009). RUSBoost: A hybrid approach to alleviating class imbalance. IEEE transactions on systems, man, and cybernetics-part A: systems and humans, 40(1), 185-197. Sikora, M. & Wrobel, L. (2010). seismic-bumps [Dataset]. UCI Machine Learning Repository. https://doi.org/10.24432/C5W902. Sohn, K., Lee, H., & Yan, X. (2015). Learning structured output representation using deep conditional generative models. Advances in neural information processing systems, 28. Srivastava, A., Valkov, L., Russell, C., Gutmann, M. U., & Sutton, C. (2017). Veegan: Reducing mode collapse in gans using implicit variational learning. Advances in neural information processing systems, 30. Strelcenia, E., & Prakoonwit, S. (2023). Improving cancer detection classification performance using GANs in breast cancer data. IEEE Access, 11, 71594-71615. Sun, Y., Kamel, M. S., Wong, A. K., & Wang, Y. (2007). Cost-sensitive boosting for classification of imbalanced data. Pattern recognition, 40(12), 3358-3378. Tahir, M. A., Kittler, J., & Yan, F. (2012). Inverse random under sampling for class imbalance problem and its application to multi-label classification. Pattern recognition, 45(10), 3738-3750. Tomek, I. (1976). Two modifications of CNN. Vardhan, L. V. H., & Kok, S. (2020). Generating privacy-preserving synthetic tabular data using oblivious variational autoencoders. Proceedings of the Workshop on Economics of Privacy and Data Labor at the 37 th International Conference on Machine Learning, Wilson, A. G., & Izmailov, P. (2020). Bayesian deep learning and a probabilistic perspective of generalization. Advances in neural information processing systems, 33, 4697-4708. Wolfe, J. M., Horowitz, T. S., & Kenner, N. M. (2005). Rare items often missed in visual searches. Nature, 435(7041), 439-440. Wolfe, J. M., & Van Wert, M. J. (2010). Varying target prevalence reveals two dissociable decision criteria in visual search. Current biology, 20(2), 121-124. Xu, L., Skoularidou, M., Cuesta-Infante, A., & Veeramachaneni, K. (2019). Modeling tabular data using conditional gan. Advances in neural information processing systems, 32. Yu, H., Ni, J., & Zhao, J. (2013). ACOSampling: An ant colony optimization-based undersampling method for classifying imbalanced DNA microarray data. Neurocomputing, 101, 309-318. | - |
| dc.identifier.uri | http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/99898 | - |
| dc.description.abstract | 在實際應用中分析不平衡資料(imbalanced data)始終是一項重大挑戰,尤其是在資料取得困難或關注罕見事件發生的生醫及公共衛生領域,更加劇了模型訓練與結果預測的困難及挑戰。傳統方法如過採樣(oversampling)與欠採樣(undersampling),常無法準確捕捉資料的分布特性,導致分類效能依情況而異。在某些研究議題上,如:嚴重工安事件的發生、疾病診斷與基因研究等,類別不平衡問題尤為嚴重,迫切需要更有效的技術來因應。
本研究提出一種混成式框架,透過新穎的生成式人工智慧(generative artificial intelligence)方法與加權羅吉斯迴歸(weighted logistic regression)來解決不平衡資料的分類問題。在提出的生成式人工智慧模型中,結合貝氏神經網路(Bayesian Neural Networks, BNNs)作為變分自編碼器(Variational Autoencoder, VAE)中的編碼器與解碼器架構,用以生成少數類別的樣本,作為擴充原始資料集,以達到資料中類別平衡的可能性。隨後再考量生成樣本的代表性與訓練集樣本對分類決策邊界(decision boundary)的重要性,透過樣本重要性進行加權建構加權羅吉斯迴歸(weighted logistic regression)來完成不平衡資料分類的任務,以提升在生醫與公共衛生實務研究上不平衡資料的分類與預測效能。 本研究分別於模擬資料與實際公共衛生資料進行實證分析。模擬實驗涵蓋多種不平衡結構與資料複雜度,實際資料則涵蓋病毒感染、子宮頸癌與地震災害等應用場域。結果顯示,傳統過採樣方法(如 SMOTE)在結構簡單的情境中表現尚可,但在分布偏態、共變異特徵明顯、類別與連續變數混合的複雜設定,或於真實資料中,其成效明顯受限。相較之下,本研究所提出之 BNNVAE 模型結合樣本加權機制,尤其是在搭配 realism-aware 權重策略後,不僅能維持整體準確率,更能顯著提升少數類別的識別能力,於 F1 分數(F1 Score)、幾何平均數(G-mean)與平衡準確率(Balanced Accuracy)表現最為穩定且優異。綜合而言,本研究方法展現高度應用潛力,適合推廣至多種場景。 本研究運用貝氏神經網路模型以更有效地估計資料分布,進而減少對大量真實訓練資料的依賴,以解決部分實務應用情境資料量較少的問題。與傳統方法相比,本模型能生成更具代表性的合成資料,並透過樣本加權以建構具可解釋性的分類模型,以增加生醫與公共衛生研究領域的應用價值與應用潛力。 | zh_TW |
| dc.description.abstract | Analyzing imbalanced data remains a significant challenge in practical applications, especially in biomedical and public health domains where data collection is difficult or rare events are particularly interesting. These challenges exacerbate the difficulties in training models and achieving reliable predictions. Traditional approaches such as oversampling and undersampling often fail to accurately capture the underlying data distribution, leading to inconsistent classification performance. In particular research areas, such as severe industrial accidents, disease diagnosis, and genomic studies, the class imbalance problem is especially severe, necessitating more effective solutions.
This study proposes a hybrid framework integrating a novel generative artificial intelligence approach with weighted logistic regression to tackle classification tasks involving imbalanced data. Specifically, we introduce a generative model that incorporates Bayesian Neural Networks (BNNs) into the encoder and decoder components of a Variational Autoencoder (VAE), enabling the generation of minority-class samples to augment the original dataset and promote class balance. We then apply a subject-weighted logistic regression model, where samples are assigned weights based on their representativeness and proximity to the classification decision boundary, thereby enhancing classification and prediction performance in biomedical and public health research. This study conducts empirical evaluations on both simulated and real-world public health datasets. The simulation experiments cover various levels of imbalance and data complexity, while the real datasets include use cases such as viral infections, cervical cancer, and earthquake-related health impacts. Results show that traditional oversampling methods like SMOTE perform adequately in simple settings. Still, they are limited under more complex conditions, such as skewed distributions, strong feature covariance, and mixed data types, as well as in real-world applications. In contrast, the proposed BNNVAE model, particularly when combined with a realism-aware sample weighting strategy, maintains overall accuracy and significantly improves the identification of minority-class instances. It achieves the most stable and superior F1 score, geometric mean (G-mean), and balanced accuracy across all settings. Overall, the proposed method demonstrates high applicability and generalizability across diverse scenarios. By leveraging Bayesian neural networks, our model provides a more effective estimation of data distributions, reducing reliance on large amounts of real training data, a common limitation in real-world applications. Compared to traditional techniques, the proposed framework can generate more representative synthetic data and, through sample reweighting, construct interpretable classification models, thereby enhancing its practical value and applicability in biomedical and public health research. | en |
| dc.description.provenance | Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2025-09-19T16:14:06Z No. of bitstreams: 0 | en |
| dc.description.provenance | Made available in DSpace on 2025-09-19T16:14:06Z (GMT). No. of bitstreams: 0 | en |
| dc.description.tableofcontents | 口試委員會審定書 i
致謝 iii 中文摘要 v ABSTRACT vii CONTENTS ix LIST OF FIGURES xiii LIST OF TABLES xvii Chapter 1 Introduction 1 1.1 Imbalanced Classification 1 1.2 Generative Artificial Intelligence 4 1.2.1 Background of Generative Approaches 4 1.2.2 Generative Methods for Tabular Data 5 1.3 Motivation and Aim 7 Chapter 2 Method 11 2.1 Variational Autoencoder (VAE) 11 2.2 Bayesian Neural Network (BNN) 12 2.3 Proposed Method: Bayesian Neural Network-based Variational Autoencoder (BNNVAE) 15 2.4 Proposed Method: Weighted Logistic Regression 21 Chapter 3 Simulation 27 3.1 Simulation 1: Two-Dimensional Simple Case 27 3.1.1 Dataset Description and Experimental Setting 27 3.1.2 Evaluation of Synthetic Data Quality 29 3.1.3 Classification Results and Comparisons 36 3.2 Simulation 2: Multivariate Normal Distribution 42 3.2.1 Dataset Description and Experimental Setting 42 3.2.2 Evaluation of Synthetic Data Quality 44 3.2.3 Classification Results and Comparisons 49 3.3 Simulation 3: Gamma Distribution 54 3.3.1 Dataset Description and Experimental Setting 54 3.3.2 Evaluation of Synthetic Data Quality 56 3.3.3 Classification Results and Comparisons 61 3.4 Simulation 4: Hybrid Distribution 66 3.4.1 Dataset Description and Experimental Setting 66 3.4.2 Evaluation of Synthetic Data Quality 69 3.4.3 Classification Results and Comparisons 76 Chapter 4 Real Application 83 4.1 Real-World Virus Data 83 4.1.1 Dataset Description and Experimental Setting 83 4.1.2 Evaluation of Synthetic Data Quality 84 4.1.3 Classification Results and Comparisons 87 4.2 Real Biomedical Data 90 4.2.1 Dataset Description and Experimental Setting 90 4.2.2 Evaluation of Synthetic Data Quality 91 4.2.3 Classification Results and Comparisons 95 4.3 Real Hazard Data 98 4.3.1 Dataset Description and Experimental Setting 98 4.3.2 Evaluation of Synthetic Data Quality 99 4.3.3 Classification Results and Comparisons 103 Chapter 5 Discussion 107 Chapter 6 Conclusion 111 REFERENCE 113 APPENDIX 121 | - |
| dc.language.iso | en | - |
| dc.subject | 貝氏神經網路 | zh_TW |
| dc.subject | 生成式人工智慧 | zh_TW |
| dc.subject | 不平衡資料 | zh_TW |
| dc.subject | 合成資料生成 | zh_TW |
| dc.subject | 變分自編碼器 | zh_TW |
| dc.subject | 加權邏輯斯迴歸 | zh_TW |
| dc.subject | Imbalanced Data | en |
| dc.subject | Bayesian Neural Networks | en |
| dc.subject | Weighted Logistic Regression | en |
| dc.subject | Variational Autoencoder | en |
| dc.subject | Synthetic Data Generation | en |
| dc.subject | Generative Artificial Intelligence | en |
| dc.title | 不平衡分類的新穎混成框架:整合貝氏神經網路–變分自編碼器生成模型與樣本加權邏輯斯迴歸 | zh_TW |
| dc.title | A Novel Hybrid Framework for Imbalanced Classification: Integrating a Bayesian Neural Network-Variational Autoencoder Generative Model and Subject-Weighted Logistic Regression | en |
| dc.type | Thesis | - |
| dc.date.schoolyear | 113-2 | - |
| dc.description.degree | 碩士 | - |
| dc.contributor.oralexamcommittee | 陳佩君;范怡琴;杜裕康 | zh_TW |
| dc.contributor.oralexamcommittee | PEI-CHUN CHEN;YI-CHIN FAN;YU-KANG TU | en |
| dc.subject.keyword | 貝氏神經網路,生成式人工智慧,不平衡資料,合成資料生成,變分自編碼器,加權邏輯斯迴歸, | zh_TW |
| dc.subject.keyword | Bayesian Neural Networks,Generative Artificial Intelligence,Imbalanced Data,Synthetic Data Generation,Variational Autoencoder,Weighted Logistic Regression, | en |
| dc.relation.page | 147 | - |
| dc.identifier.doi | 10.6342/NTU202503128 | - |
| dc.rights.note | 同意授權(全球公開) | - |
| dc.date.accepted | 2025-08-01 | - |
| dc.contributor.author-college | 公共衛生學院 | - |
| dc.contributor.author-dept | 健康數據拓析統計研究所 | - |
| dc.date.embargo-lift | 2030-07-31 | - |
| 顯示於系所單位: | 健康數據拓析統計研究所 | |
文件中的檔案:
| 檔案 | 大小 | 格式 | |
|---|---|---|---|
| ntu-113-2.pdf 此日期後於網路公開 2030-07-31 | 13.21 MB | Adobe PDF |
系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。
