對不平衡的資料有效率的訓練和自我訓練的門檻分析

Chun-Min Chang; 張峻銘

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/8937

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	林守德(Shou-De Lin)
dc.contributor.author	Chun-Min Chang	en
dc.contributor.author	張峻銘	zh_TW
dc.date.accessioned	2021-05-20T20:04:35Z	-
dc.date.available	2014-08-20
dc.date.available	2021-05-20T20:04:35Z	-
dc.date.copyright	2009-08-20
dc.date.issued	2009
dc.date.submitted	2009-08-17
dc.identifier.citation	[1] Luca Didaci, Fabio Roli: Using Co-training and Self-training in Semi-supervised Multiple Classifier Systems. SSPR/SPR 2006: 522-530 [2] G. M. Weiss. Mining with rarity - problems and solutions: A unifying framework. SIGKDD Explorations, 6(1):7–19, 2004 [3] A. Bradley. The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognition, 30(7): 1145-1159, 1997. [4] F. Provost, and T. Fawcett. Robust classification for imprecise environments. Machine Learning, 42: 203-231, 2001. [5] Rong Zhang, Alexander I. Rudnicky, 'A New Data Selection Principle for Semi-Supervised Incremental Learning,' Pattern Recognition, International Conference on, vol. 2, pp. 780-783, 18th International Conference on Pattern Recognition (ICPR'06) Volume 2, 2006. [6] R. C. Holte, L. E. Acker, and B. W. Porter. Concept learning and the problem of small disjuncts. In Proceedings of the Eleventh International Joint Conference on Artificial Intelligence, pages 813-818, 1989. [7] M. Kubat, and S. Matwin. Addressing the curse of imbalanced training sets: one-sided selection. In Proceedings of the Fourteenth International Conference on Machine Learning, pages 179-186, Morgan Kaufmann, 1997. [8] R. G. Swensson, “Unified measurement of observer performance in detecting and localizing target objects on images,” Med. Phys. 23, 1709–1725 s1996d. [9] Chih-Chung Chang and Chih-Jen Lin, LIBSVM : a library for support vector machines, 2001. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm [10] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin. LIBLINEAR: A Library for Large Linear Classification, Journal of Machine Learning Research 9(2008), 1871-1874. Software available at http://www.csie.ntu.edu.tw/~cjlin/liblinear [11] Hung-Yi Lo, Chun-Min Chang, Tsung-Hsien Chiang, Cho-Yi Hsiao, Anta Huang, Tsung-Ting Kuo, Wei-Chi Lai, Ming-Han Yang, Jung-Jung Yeh, Chun-Chao Yen and Shou-De Lin, Learning to Improve Area-Under-FROC for Imbalanced Medical Data Classification Using an Ensemble Method, SIGKDD Explorations, 10(2), pp.43-46, December 2008.
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/8937	-
dc.description.abstract	本篇論文提出兩個方法解決在不平衡資料下的分類問題。首先提出利用自我訓練時，比之前自我訓練的方法少的參數維度以及提升效能的方法。透過在已標記的訓練資料上，對每個分類器單獨訓練出預測信心值門檻，用來分辨高信心的未標記資料，並將結果作聯集給它們虛擬類別加入已標記資料中重新訓練。藉此不但降低了選參數的時間，效能也跟複雜的參數差不多。再者我們提出有效率地訓練不平衡資料的方法，從速度快的down-sampling開始透過類似booststrap的方法，將模型逼近得與up-sampling一樣，由於使用的資料量少，速度獲得了提升。我們在KDD cup 2008的極端不平衡資料中為它們實驗，實驗結果顯示在自我訓練中我們的方法選擇參數表現較之前方法稍好；而在效率上提出的方法是直接使用up-sampling的1.3倍快，而且在AUC上的表現差距不多。	zh_TW
dc.description.abstract	There are two methods proposed to address classification problems of imbalanced data. First, we propose a method that has smaller parameter space and more performance when using self-training. We train confidence thresholds for each classifier using labeled data to identify high confident data, and label them pseudo labels for re-train. Through this scheme we get less training time for parameters and get better performance. Second, we proposed an efficient training method for imbalanced data. We start with down-sampling and using a method like bootstrap. The model will approximate the model of up-sampling. Using less training data leads to less training time. We do experiments on KDDCUP 2008 data. The result shows that our threshold-based self-training has better performance and the approximated model has the same performance as up-sampling but cost only 0.75 times training time of up-sampling.	en
dc.description.provenance	Made available in DSpace on 2021-05-20T20:04:35Z (GMT). No. of bitstreams: 1 ntu-98-R96944012-1.pdf: 310232 bytes, checksum: 8d14934b850fb3146f6f3664693a6094 (MD5) Previous issue date: 2009	en
dc.description.tableofcontents	摘要 ii Abstract iii List of Figures v List of Tables vi Chapter 1 1 1.1 背景及動機 1 Chapter 2 7 2.1 Semi-supervised learning 7 2.2 不平衡資料的訓練 9 Chapter 3 11 3.1 自我訓練的門檻分析 11 3.2 對不平衡的資料有效率地訓練 18 Chapter 4 21 4.1 實驗資料 21 4.2 評估方式 22 4.3 Confidence threshold exploitation in self-training of MCS 23 4.4 Approximate up-sampling from down-sampling 26 Chapter 5 30 Bibliography 31
dc.language.iso	zh-TW
dc.title	對不平衡的資料有效率的訓練和自我訓練的門檻分析	zh_TW
dc.title	Efficient Training for Imbalance Data and Threshold Analysis for Self-training	en
dc.type	Thesis
dc.date.schoolyear	97-2
dc.description.degree	碩士
dc.contributor.oralexamcommittee	林軒田(Hsuan-Tien Lin),林智仁(Chih-Jen Lin),王傑智(Chieh-Chih Wang)
dc.subject.keyword	半監督式學習,自我訓練,不平衡資料,	zh_TW
dc.subject.keyword	semi-supervised learning,self-training,imbalanced data,kddcup 08,	en
dc.relation.page	32
dc.rights.note	同意授權(全球公開)
dc.date.accepted	2009-08-17
dc.contributor.author-college	電機資訊學院	zh_TW
dc.contributor.author-dept	資訊網路與多媒體研究所	zh_TW
顯示於系所單位：	資訊網路與多媒體研究所

文件中的檔案：

檔案	大小	格式
ntu-98-1.pdf	302.96 kB	Adobe PDF	檢視/開啟

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。