引入廣義 Condorcet 模型以提升隨機森林在不平衡資料上的表現

何庭妤; Ting-Yu Ho

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/98432

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	徐永豐	zh_TW
dc.contributor.advisor	Yung-Fong Hsu	en
dc.contributor.author	何庭妤	zh_TW
dc.contributor.author	Ting-Yu Ho	en
dc.date.accessioned	2025-08-14T16:05:40Z	-
dc.date.available	2025-08-15	-
dc.date.copyright	2025-08-14	-
dc.date.issued	2025	-
dc.date.submitted	2025-07-30	-
dc.identifier.citation	Batchelder, W. H., & Anders, R. (2012). Cultural consensus theory: Comparing different concepts of cultural truth. Journal of Mathematical Psychology, 56(5), 316–332. https://doi.org/10.1016/j.jmp.2012.06.002 Batchelder, W. H., & Romney, A. K. (1986). The statistical analysis of a general Condorcet model for dichotomous choice situations. In B. Grofman & G. Owen (Eds.), Information pooling and group decision making (pp. 103–112). JAI Press. Batchelder, W. H., & Romney, A. K. (1988). Test theory without an answer key. Psychometrika, 53(1), 71–92. https://doi.org/10.1007/BF02294195 Breiman, L. (1996). Bagging predictors. Machine Learning, 24(2), 123–140. https://doi.org/10.1007/BF00058655 Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32. https://doi.org/10.1023/A:1010933404324 Breiman, L., Friedman, J., Olshen, R. A., & Stone, C. J. (1984). Classification and regression trees. Chapman and Hall/ CRC. https://doi.org/10.1201/9781315139470 Chao, C., Andy, L., & Breiman, L. (2004). Using random forest to learn imbalanced data (No. 666). University of California, Berkeley. https://statistics.berkeley.edu/sites/default/files/tech-reports/666.pdf Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16, 321–357. https://doi.org/10.1613/jair.953 Collell, G., Prelec, D., & Patil, K. R. (2018). A simple plug-in bagging ensemble based on threshold-moving for classifying binary and multiclass imbalanced data. Neurocomputing, 275, 330–340. https://doi.org/10.1016/j.neucom.2017.08.035 Dal Pozzolo, A., Caelen, O., & Bontempi, G. (2015). When is undersampling effective in unbalanced classification tasks? In A. Appice, P. P. Rodrigues, V. Santos Costa, C. Soares, J. Gama, & A. Jorge (Eds.), Machine learning and knowledge discovery in databases (pp. 200–215). Springer. https://doi.org/10.1007/978-3-319-23528-8_13 D’Andrade, R. G. (1981). The cultural part of cognition. Cognitive Science, 5(3), 179–195. https://doi.org/10.1016/S0364-0213(81)80012-2 Denwood, M. J. (2016). runjags: An R package providing interface utilities, model templates, parallel computing methods and additional distributions for MCMC models in JAGS. Journal of Statistical Software, 71(9), 1–25. https://doi.org/10.18637/jss.v071.i09 Drummond, C., & Holte, R. C. (2003). C4. 5, class imbalance, and cost sensitivity: Why under-sampling beats over-sampling. Workshop on Learning from Imbalanced Datasets II, 11(1–8). http://www.eiti.uottawa.ca/~nat/Workshop2003/drummondc.pdf Elkan, C. (2001). The foundations of cost-sensitive learning. Proceedings of the 17th International Joint Conference on Artificial Intelligence - Volume 2, 973–978. Freund, Y., & Schapire, R. E. (1996). Experiments with a new boosting algorithm. International Conference on Machine Learning, 96, 148–156. Grofman, B., & Owen, G. (1986). Review essay: Condorcet models, avenues for further research. In B. Grofman & G. Owen (Eds.), Information pooling and group decision making (pp. 93–102). JAI Press. He, H., & Garcia, E. A. (2009). Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering, 21(9), 1263–1284. https://doi.org/10.1109/TKDE.2008.239 Johnson, J. M., & Khoshgoftaar, T. M. (2021). Output thresholding for ensemble learners and imbalanced big data. 2021 IEEE 33rd International Conference on Tools with Artificial Intelligence (ICTAI), 1449–1454. https://doi.org/10.1109/ICTAI52525.2021.00230 Karabatsos, G., & Batchelder, W. H. (2003). Markov Chain estimation for test theory without an answer key. Psychometrika, 68(3), 373–389. https://doi.org/10.1007/BF02294733 Kelly, M., Longjohn, R., & Nottingham, K. (2023). The UCI machine learning repository [Data set]. University of California, Irvine. https://archive.ics.uci.edu Liaw, A., & Wiener, M. (2002). Classification and regression by randomforest. R News, 2(3), 18–22. https://www.r-project.org/doc/Rnews/Rnews_2002-3.pdf Lipton, Z. C., Elkan, C., & Naryanaswamy, B. (2014). Optimal thresholding of classifiers to maximize F1 measure. In T. Calders, F. Esposito, E. Hüllermeier, & R. Meo (Eds.), Machine learning and knowledge discovery in databases (pp. 225–239). Springer. https://doi.org/10.1007/978-3-662-44851-9_15 Niculescu-Mizil, A., & Caruana, R. (2005). Predicting good probabilities with supervised learning. Proceedings of the 22nd International Conference on Machine Learning, 625–632. https://doi.org/10.1145/1102351.1102430 Provost, F. (2000). Machine learning from imbalanced data sets 101 [Extended abstract]. Proceedings of the AAAI 2000 Workshop on Imbalanced Data Sets. https://cdn.aaai.org/Workshops/2000/WS-00-05/WS00-05-001.pdf Provost, F., & Fawcett, T. (2001). Robust classification for imprecise environments. Machine Learning, 42(3), 203–231. https://doi.org/10.1023/A:1007601015854 Romney, A. K., & Weller, S. C. (1984). Predicting informant accuracy from patterns of recall among individuals. Social Networks, 6(1), 59–77. https://doi.org/10.1016/0378-8733(84)90004-2 Romney, A. K., Weller, S. C., & Batchelder, W. H. (1986). Culture as consensus: A theory of culture and informant accuracy. American Anthropologist, 88(2), 313–338. https://doi.org/10.1525/aa.1986.88.2.02a00020 Scrucca, L., Fraley, C., Murphy, T. B., & Raftery, A. E. (2023). Model-based clustering, classification, and density estimation using mclust in R. Chapman; Hall/CRC. https://doi.org/10.1201/9781003277965 Sheng, V. S., & Ling, C. X. (2006). Thresholding for making classifiers cost-sensitive. Proceedings of the 21st National Conference on Artificial Intelligence - Volume 1, 476–481. https://cdn.aaai.org/AAAI/2006/AAAI06-076.pdf Spelmen, V. S., & Porkodi, R. (2018). A review on handling imbalanced data. 2018 International Conference on Current Trends towards Converging Technologies (ICCTCT), 1–11. https://doi.org/10.1109/ICCTCT.2018.8551020 Wolpert, D. H. (1992). Stacked generalization. Neural Networks, 5(2), 241–259. https://doi.org/10.1016/S0893-6080(05)80023-1 Wu, D. (2024). RSBID: Resampling strategies for binary imbalanced datasets [R package version 0.0.2.0000]. https://github.com/dongyuanwu/RSBID	-
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/98432	-
dc.description.abstract	在分類任務中，不平衡資料是一個常見的挑戰。當類別分布不均時，模型通常在多數類別上有較好的預測較表現，卻難以正確識別少數類別，而這些少數類別往往是實務應用中更關注的類別。隨機森林（Random Forest, RF）透過多數決整合多棵決策樹的預測結果，以提升整體分類表現。然而，由於決策樹本身在不平衡資料上容易傾向多數類別，RF 的多數決機制也會延續此傾向。廣義 Condorcet 模型（General Condorcet Model, GCM）是 1980 年代中期由 Batchelder 等人提出的一種資料整合模型。相較於 RF 採用的多數決策略，GCM 進一步考慮了決策樹的回答偏誤（如 RF 預測傾向多數類別）以及能力。因此，本研究將 RF 中的多數決整合步驟替換為 GCM，期望能改善 RF 在不平衡資料上的表現。此外，從分類錯誤成本不對稱的觀點來看，閾值移動（調整模型預測機率的分類門檻）是一種直接對應該問題的方法，而我們觀察到在合理限制下的GCM可視為閾值移動。本研究比較了 GCM、幾種閾值移動方法（如移動至先驗機率、基於模型表現動態調整），以及主流的重新平衡方法（如合成少數類別過採樣技術，Synthetic Minority Over-sampling Technique, SMOTE，以及 Balanced Random Forest, BRF）。結果顯示，各方法在不同評估指標上展現不同優勢，移動至先驗機率雖在 G-mean 上表現最佳，但其 F1分數表現最差；SMOTE 則呈現相反趨勢；而 GCM 和 BRF 則在 G-mean 與 F1 分數間取得較佳的平衡。其中 BRF 整體較平均，GCM 則有較高 G-mean，適合對敏感度要求較高、但不希望過度犧牲精確率的情境。	zh_TW
dc.description.abstract	Class imbalance is a common challenge in classification tasks. When class distributions are skewed, models usually perform better on the majority class while struggling to identify the minority class, which is often of greater interest in real-world applications. Random Forest (RF), which aggregates the predictions of multiple decision trees through majority rule, aims to enhance overall classification performance. However, since individual decision trees tend to be biased towards the majority class in imbalanced datasets, the majority rule of RF maintains this bias. The General Condorcet Model (GCM), developed by Batchelder et al. in the mid-1980s, is an information pooling model. Compared with majority rule, the GCM takes into account response bias (e.g., tendency toward the majority class) and competency. This study replaces the majority-rule step in RF with the GCM, aiming to improve RF's performance on imbalanced datasets. Moreover, from a cost-sensitive perspective, threshold-moving is a direct and intuitive approach that involves adjusting the decision criterion. We observed that under reasonable restrictions, the GCM can be interpreted as a form of threshold-moving. This study compares the GCM with several threshold-moving techniques (e.g., prior-based and performance-based adjustments) and popular rebalancing methods (e.g., Synthetic Minority Over-sampling Technique, SMOTE, and Balanced Random Forest, BRF). Results indicate that these methods exhibit varying strengths across different evaluation metrics. While the prior-based approach achieves the highest G-mean, it yields the worst F1 score; SMOTE shows the opposite pattern. Both GCM and BRF offer a better trade-off between G-mean and F1 score. Among them, BRF performs more evenly across metrics, while GCM has a higher G-mean. Thus, GCM may be suitable for applications that require high sensitivity without overly compromising precision.	en
dc.description.provenance	Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2025-08-14T16:05:40Z No. of bitstreams: 0	en
dc.description.provenance	Made available in DSpace on 2025-08-14T16:05:40Z (GMT). No. of bitstreams: 0	en
dc.description.tableofcontents	口試委員會審定書 i 誌謝 ii 摘要 iii Abstract v 目次 vii 圖次 ix 表次 x 第一章緒論 1 第二章文獻回顧 7 第一節RF 以及它在不平衡資料上的預測偏向 . . . . . . . . . . . . . . . . 7 第二節GCM 的結構和估計 . . . . . . . . . . . . . . . . . . . . . . . . . . 9 第三節過去對於不平衡資料的處理方法 . . . . . . . . . . . . . . . . . . . 15 第三章模擬實驗 21 第一節資料集 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 第二節評估指標 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 第三節實驗設定 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 第四章結果 26 第一節可靠度圖（reliability plot） . . . . . . . . . . . . . . . . . . . . . . 26 第二節閾值移動方法比較 . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 第三節閾值移動以及重新平衡方法比較 . . . . . . . . . . . . . . . . . . . 31 第五章討論 35 參考文獻 38 附錄 42	-
dc.language.iso	zh_TW	-
dc.subject	廣義 Condorcet 模型	zh_TW
dc.subject	類別不平衡	zh_TW
dc.subject	整合策略	zh_TW
dc.subject	閾值移動	zh_TW
dc.subject	隨機森林	zh_TW
dc.subject	Random Forest	en
dc.subject	threshold-moving	en
dc.subject	aggregation strategy	en
dc.subject	General Condorcet Model	en
dc.subject	class imbalance	en
dc.title	引入廣義 Condorcet 模型以提升隨機森林在不平衡資料上的表現	zh_TW
dc.title	Incorporating the General Condorcet Model to Improve Random Forest Performance on Imbalanced Data	en
dc.type	Thesis	-
dc.date.schoolyear	113-2	-
dc.description.degree	碩士	-
dc.contributor.oralexamcommittee	蔡政安;黃從仁	zh_TW
dc.contributor.oralexamcommittee	Chen-An Tsai;Tsung-Ren Huang	en
dc.subject.keyword	類別不平衡,廣義 Condorcet 模型,隨機森林,閾值移動,整合策略,	zh_TW
dc.subject.keyword	class imbalance,General Condorcet Model,Random Forest,threshold-moving,aggregation strategy,	en
dc.relation.page	42	-
dc.identifier.doi	10.6342/NTU202502381	-
dc.rights.note	同意授權(限校園內公開)	-
dc.date.accepted	2025-07-31	-
dc.contributor.author-college	理學院	-
dc.contributor.author-dept	心理學系	-
dc.date.embargo-lift	2030-07-27	-
顯示於系所單位：	心理學系

文件中的檔案：

檔案	大小	格式
ntu-113-2.pdf 未授權公開取用	2.08 MB	Adobe PDF	檢視/開啟

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。