對不平衡的資料有效率的訓練和自我訓練的門檻分析

Chun-Min Chang; 張峻銘

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/8937

標題:	對不平衡的資料有效率的訓練和自我訓練的門檻分析 Efficient Training for Imbalance Data and Threshold Analysis for Self-training
作者:	Chun-Min Chang 張峻銘
指導教授:	林守德(Shou-De Lin)
關鍵字:	半監督式學習,自我訓練,不平衡資料, semi-supervised learning,self-training,imbalanced data,kddcup 08,
出版年 :	2009
學位:	碩士
摘要:	本篇論文提出兩個方法解決在不平衡資料下的分類問題。首先提出利用自我訓練時，比之前自我訓練的方法少的參數維度以及提升效能的方法。透過在已標記的訓練資料上，對每個分類器單獨訓練出預測信心值門檻，用來分辨高信心的未標記資料，並將結果作聯集給它們虛擬類別加入已標記資料中重新訓練。藉此不但降低了選參數的時間，效能也跟複雜的參數差不多。再者我們提出有效率地訓練不平衡資料的方法，從速度快的down-sampling開始透過類似booststrap的方法，將模型逼近得與up-sampling一樣，由於使用的資料量少，速度獲得了提升。我們在KDD cup 2008的極端不平衡資料中為它們實驗，實驗結果顯示在自我訓練中我們的方法選擇參數表現較之前方法稍好；而在效率上提出的方法是直接使用up-sampling的1.3倍快，而且在AUC上的表現差距不多。 There are two methods proposed to address classification problems of imbalanced data. First, we propose a method that has smaller parameter space and more performance when using self-training. We train confidence thresholds for each classifier using labeled data to identify high confident data, and label them pseudo labels for re-train. Through this scheme we get less training time for parameters and get better performance. Second, we proposed an efficient training method for imbalanced data. We start with down-sampling and using a method like bootstrap. The model will approximate the model of up-sampling. Using less training data leads to less training time. We do experiments on KDDCUP 2008 data. The result shows that our threshold-based self-training has better performance and the approximated model has the same performance as up-sampling but cost only 0.75 times training time of up-sampling.
URI:	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/8937
全文授權:	同意授權(全球公開)
顯示於系所單位：	資訊網路與多媒體研究所

文件中的檔案：

檔案	大小	格式
ntu-98-1.pdf	302.96 kB	Adobe PDF	檢視/開啟

顯示文件完整紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料（如：文字、圖片、PDF）並使其易於取用。