引入廣義 Condorcet 模型以提升隨機森林在不平衡資料上的表現

何庭妤; Ting-Yu Ho

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/98432

標題:	引入廣義 Condorcet 模型以提升隨機森林在不平衡資料上的表現 Incorporating the General Condorcet Model to Improve Random Forest Performance on Imbalanced Data
作者:	何庭妤 Ting-Yu Ho
指導教授:	徐永豐 Yung-Fong Hsu
關鍵字:	類別不平衡,廣義 Condorcet 模型,隨機森林,閾值移動,整合策略, class imbalance,General Condorcet Model,Random Forest,threshold-moving,aggregation strategy,
出版年 :	2025
學位:	碩士
摘要:	在分類任務中，不平衡資料是一個常見的挑戰。當類別分布不均時，模型通常在多數類別上有較好的預測較表現，卻難以正確識別少數類別，而這些少數類別往往是實務應用中更關注的類別。隨機森林（Random Forest, RF）透過多數決整合多棵決策樹的預測結果，以提升整體分類表現。然而，由於決策樹本身在不平衡資料上容易傾向多數類別，RF 的多數決機制也會延續此傾向。廣義 Condorcet 模型（General Condorcet Model, GCM）是 1980 年代中期由 Batchelder 等人提出的一種資料整合模型。相較於 RF 採用的多數決策略，GCM 進一步考慮了決策樹的回答偏誤（如 RF 預測傾向多數類別）以及能力。因此，本研究將 RF 中的多數決整合步驟替換為 GCM，期望能改善 RF 在不平衡資料上的表現。此外，從分類錯誤成本不對稱的觀點來看，閾值移動（調整模型預測機率的分類門檻）是一種直接對應該問題的方法，而我們觀察到在合理限制下的GCM可視為閾值移動。本研究比較了 GCM、幾種閾值移動方法（如移動至先驗機率、基於模型表現動態調整），以及主流的重新平衡方法（如合成少數類別過採樣技術，Synthetic Minority Over-sampling Technique, SMOTE，以及 Balanced Random Forest, BRF）。結果顯示，各方法在不同評估指標上展現不同優勢，移動至先驗機率雖在 G-mean 上表現最佳，但其 F1分數表現最差；SMOTE 則呈現相反趨勢；而 GCM 和 BRF 則在 G-mean 與 F1 分數間取得較佳的平衡。其中 BRF 整體較平均，GCM 則有較高 G-mean，適合對敏感度要求較高、但不希望過度犧牲精確率的情境。 Class imbalance is a common challenge in classification tasks. When class distributions are skewed, models usually perform better on the majority class while struggling to identify the minority class, which is often of greater interest in real-world applications. Random Forest (RF), which aggregates the predictions of multiple decision trees through majority rule, aims to enhance overall classification performance. However, since individual decision trees tend to be biased towards the majority class in imbalanced datasets, the majority rule of RF maintains this bias. The General Condorcet Model (GCM), developed by Batchelder et al. in the mid-1980s, is an information pooling model. Compared with majority rule, the GCM takes into account response bias (e.g., tendency toward the majority class) and competency. This study replaces the majority-rule step in RF with the GCM, aiming to improve RF's performance on imbalanced datasets. Moreover, from a cost-sensitive perspective, threshold-moving is a direct and intuitive approach that involves adjusting the decision criterion. We observed that under reasonable restrictions, the GCM can be interpreted as a form of threshold-moving. This study compares the GCM with several threshold-moving techniques (e.g., prior-based and performance-based adjustments) and popular rebalancing methods (e.g., Synthetic Minority Over-sampling Technique, SMOTE, and Balanced Random Forest, BRF). Results indicate that these methods exhibit varying strengths across different evaluation metrics. While the prior-based approach achieves the highest G-mean, it yields the worst F1 score; SMOTE shows the opposite pattern. Both GCM and BRF offer a better trade-off between G-mean and F1 score. Among them, BRF performs more evenly across metrics, while GCM has a higher G-mean. Thus, GCM may be suitable for applications that require high sensitivity without overly compromising precision.
URI:	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/98432
DOI:	10.6342/NTU202502381
全文授權:	同意授權(限校園內公開)
電子全文公開日期:	2030-07-27
顯示於系所單位：	心理學系

文件中的檔案：

檔案	大小	格式
ntu-113-2.pdf 未授權公開取用	2.08 MB	Adobe PDF	檢視/開啟

顯示文件完整紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料（如：文字、圖片、PDF）並使其易於取用。