多標籤分類中對稀有標籤的閥值調整策略之討論

林育任; Yu-Jen Lin

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/88156

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	林智仁	zh_TW
dc.contributor.advisor	Chih-Jen Lin	en
dc.contributor.author	林育任	zh_TW
dc.contributor.author	Yu-Jen Lin	en
dc.date.accessioned	2023-08-08T16:33:01Z	-
dc.date.available	2023-11-09	-
dc.date.copyright	2023-08-08	-
dc.date.issued	2023	-
dc.date.submitted	2023-07-17	-
dc.identifier.citation	Janez Brank, Marko Grobelnik, Natasa Milic-Frayling, and Dunja Mladenic. Training text classifiers with SVM on very few positive examples. Technical report, Technical Report MSR-TR-2003-34, Microsoft Corp, 2003. Wei-Cheng Chang, Daniel Jiang, Hsiang-Fu Yu, Choon-Hui Teo, Jiong Zhang, Kai Zhong, Kedarnath Kolluri, Qie Hu, Nikhil Shandilya, Vyacheslav Ievgrafov, Japinder Singh, and Inderjit S Dhillon. Extreme multi-label learning for semantic matching in product search. In Proceedings of the 27th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 2021. Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen Lin. LIBLINEAR: a library for large linear classification. Journal of Machine Learning Research, 9:1871--1874, 2008. Rong-En Fan and Chih-Jen Lin. A study on threshold selection for multi-label classification. Technical report, Department of Computer Science, National Taiwan University, 2007. Aditya Grover and Jure Leskovec. Node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), page 855–864, 2016. Haixiang Guo, Yijing Li, Jennifer Shang, Mingyun Gu, Yuanyue Huang, and Bing Gong. Learning from class-imbalanced data: Review of methods and applications. Expert Systems With Applications, 73:220--239, 2017. Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems, 2017. Haibo He and Edwardo A. Garcia. Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering, 21(9):1263--1284, 2009. Kalina Jasinska, Krzysztof Dembczynski, Robert Busa-Fekete, Karlson Pfannschmidt, Timo Klerx, and Eyke Hullermeier. Extreme f-measure maximization using sparse probability estimates. In Proceedings of The 33rd International Conference on Machine Learning (ICML), pages 1435--1444, 2016. Justin M. Johnson and Taghi M. Khoshgoftaar. Deep learning and thresholding with class-imbalanced big data. In Proceedings of the 18th IEEE International Conference on Machine Learning and Applications (ICMLA), pages 755--762, 2019. David D. Lewis, Yiming Yang, Tony G. Rose, and Fan Li. RCV1: A new benchmark collection for text categorization research. Journal of Machine Learning Research, 5:361--397, 2004. Li-Chung Lin, Cheng-Hung Liu, Chih-Ming Chen, Kai-Chin Hsu, I-Feng Wu, Ming-Feng Tsai, and Chih-Jen Lin. On the use of unrealistic predictions in hundreds of papers evaluating graph representations. In Proceedings of the Thirty-Sixth AAAI Conference on Artificial Intelligence (AAAI), 2022. Zachary C. Lipton, Charles Elkan, and Balakrishnan Naryanaswamy. Optimal thresholding of classifiers to maximize F1 measure. In Proceedings of the Joint European Conference on Machine Learning and Knowledge Discovery in Databases (ECML/PKDD), pages 225--239, 2014. Johannes Loza Mencia, Eneldoand Furnkranz. Efficient multilabel classification algorithms for large-scale problems in the legal domain. In Enrico Francesconi, Simonetta Montemagni, Wim Peters, and Daniela Tiscornia, editors, Semantic Processing of Legal Texts: Where the Language of Law Meets the Law of Language, pages 192--215. Springer Berlin Heidelberg, 2010. James Mullenbach, Sarah Wiegreffe, Jon Duke, Jimeng Sun, and Jacob Eisenstein. Explainable prediction of medical codes from clinical text. In Proceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), pages 1101--1111, 2018. Shameem A. Puthiya Parambath, Nicolas Usunier, and Yves Grandvalet. Optimizing f-measures by cost-sensitive classification. In Advances in Neural Information Processing Systems, volume 27, 2014. Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pages 701--710, 2014. Ignazio Pillai, Giorgio Fumera, and Fabio Roli. Threshold optimisation for multi-label classifiers. Pattern Recognition, 46(7):2055--2065, 2013. Foster Provost. Machine learning from imbalanced data sets 101. In Proceedings of the AAAI Workshop on Imbalanced Data Sets, pages 1--3, 2000. Erik Schultheis, Marek Wydmuch, Rohit Babbar, and Krzysztof Dembczynski. On missing labels, long-tails and propensities in extreme multi-label classification. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), pages 1547--1557, 2022. F. Sebastiani. Machine learning in automated text categorization. ACM Computing Surveys, 34(1):1--47, 2002. Shai Shalev-Shwartz, Yoram Singer, and Nathan Srebro. Pegasos: primal estimated sub-gradient solver for SVM. In Proceedings of the Twenty Fourth International Conference on Machine Learning (ICML), 2007. Aixin Sun, Ee-Peng Lim, and Ying Liu. On strategies for imbalanced text classification using SVM: A comparative study. Decision Support Systems, 48(1):191--201, 2009. Jian Tang, Meng Qu, Mingzhe Wang, Ming Zhang, Jun Yan, and Qiaozhu Mei. Line: Large-scale information network embedding. In Proceedings of the 24th International Conference on World Wide Web (WWW), pages 1067--1077, 2015. Lei Tang and Huan Liu. Relational learning via latent social dimensions. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD), pages 817--826, 2009. Gang Wu and Edward Y. Chang. Class-boundary alignment for imbalanced dataset learning. In ICML Workshop on Learning from Imbalanced Data Sets II, pages 49--56, 2003. Yiming Yang. A study on thresholding strategies for text categorization. In W. Bruce Croft, David J. Harper, Donald H. Kraft, and Justin Zobel, editors, Proceedings of the 24th ACM International Conference on Research and Development in Information Retrieval, pages 137--145, New Orleans, US, 2001. ACM Press, New York, US. Hsiang-Fu Yu, Kai Zhong, Jiong Zhang, Wei-Cheng Chang, and Inderjit S. Dhillon. PECOS: Prediction for enormous and correlated output spaces. Journal of Machine Learning Research, 23(98):1--32, 2022. Guo-Xun Yuan, Kai-Wei Chang, Cho-Jui Hsieh, and Chih-Jen Lin. A comparison of optimization methods and software for large-scale l1-regularized linear classification. Journal of Machine Learning Research, 11:3183--3234, 2010. Jiong Zhang, Wei-Cheng Chang, Hsiang-Fu Yu, and Inderjit S. Dhillon. Fast multi-resolution transformer fine-tuning for extreme multi-label text classification. 34:7267--7280, 2021. Arkaitz Zubiaga. Enhancing navigation on Wikipedia with social tags. In Proceedings of Wikimania, 2009.	-
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/88156	-
dc.description.abstract	在多標籤分類任務中，標籤出現次數間的不平衡是個常見的問題。對於出現次數稀少的標籤來說，用來產生二元預測值的預設閥值往往不是最佳的。然而，在過去的文獻中已觀察到直接透過最佳化 F 值來選取新閥值容易造成過擬合。在此篇論文中，我們解釋了為什麼藉由調整閥值來最佳化 F 值以及類似的評價指標時特別容易過擬合。接下來，我們分析了 FBR 啟發法 —— 一個既有的對於此過擬合的解法。我們為其成功之處提供了解釋，但也點出 FBR 的潛在問題。針對所發現的問題，我們提出了一個新技巧，在閥值最佳化時對 F 值做平滑化處理。我們以理論證明，如果選取了恰當的參數，平滑化可為調整後的閥值帶來良好的性質。延續平滑化的概念，我們更進一步提出同時最佳化微觀平均 F 與巨觀平均 F 的方法。其享有平滑化所帶來的好處，但是更為輕量化，不需要調整額外的超參數。我們在文字與節點分類的資料集上驗證了新的方法的有效性，其一致的超越了 FBR 啟發法。	zh_TW
dc.description.abstract	In multi-label classification, the imbalance between labels is often a concern. For a label that seldom occurs, the default threshold used to generate binarized predictions of that label is usually sub-optimal. However, directly tuning the threshold to optimize for F-measure has been observed to overfit easily. In this work, we explain why tuning the thresholds for rare labels to optimize F-measure (and similar metrics) is particularly prone to overfitting. Then, we analyze the fbr heuristic, a previous technique proposed to address the overfitting issue. We explain its success but also point out its potential problems. Then, we propose a new technique based on smoothing the F-measure when tuning the threshold. We theoretically prove that, with proper parameters, smoothing results in desirable properties of the tuned threshold. Based on the idea of smoothing, we then propose jointly optimizing micro-F and macro-F as a lightweight alternative free from extra hyperparameters. Our methods are empirically evaluated on text and node classification datasets. The results show that our methods consistently outperform the fbr heuristic.	en
dc.description.provenance	Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2023-08-08T16:33:01Z No. of bitstreams: 0	en
dc.description.provenance	Made available in DSpace on 2023-08-08T16:33:01Z (GMT). No. of bitstreams: 0	en
dc.description.tableofcontents	口試委員會審定書 i Acknowledgements ii 摘要 iii Abstract iv 1 Introduction 1 2 Preliminaries 4 2.1 Problem Setting 4 2.2 Evaluation Metrics 5 2.3 Previous Works 6 3 Why FBR Works and Its Issues 9 3.1 An Interesting Behavior of FBR 9 3.2 The Benifit of Lowering the Threshold 11 3.3 Issues of FBR Heuristic 13 3.4 A new variant of SCutFBR 14 4 Smoothed Fmeasure 15 4.1 Selecting a and b 16 4.2 Comparing Smoothed Fmeasure with the FBR Heuristic 18 4.3 MicroF as Smoothed Fmeasure 21 5 Experiments and Analyses 23 5.1 Experimental Settings 23 5.2 Main Results 24 5.3 Tradeoff between microF and macroF 26 5.4 Occurences of Figure 3.1 and Figure 3.2 27 5.5 Improvements for labels of different rarity 28 5.6 Discussion on selecting a and b 29 5.7 Effect of the bias term of SVM 32 6 Conclusion 34 Bibliography 35 A Proofs 39 A.1 Proof of Theorem 1 39 A.2 Proof of Corollary 2 40 A.3 Proof of Theorem 3 40 A.4 Proof of Theorem 4 41 A.5 Proof of Theorem 5 43 B Implementation details of micromacro 44 C Details of main experiments 46 D Details of Auxiliary Experiments 49 D.1 Occurences of Figure 3.2 49 D.2 Occurences of Figure 3.1 49	-
dc.language.iso	en	-
dc.subject	多標籤分類	zh_TW
dc.subject	閾值調整	zh_TW
dc.subject	支持向量機	zh_TW
dc.subject	節點分類	zh_TW
dc.subject	文本分類	zh_TW
dc.subject	稀有標籤	zh_TW
dc.subject	F值	zh_TW
dc.subject	F-measure	en
dc.subject	multi-label classification	en
dc.subject	threshold adjustion	en
dc.subject	support vector machines	en
dc.subject	text classification	en
dc.subject	node classification	en
dc.subject	rare labels	en
dc.title	多標籤分類中對稀有標籤的閥值調整策略之討論	zh_TW
dc.title	On the Thresholding Strategies for Rare Labels in Multi-label Classification	en
dc.type	Thesis	-
dc.date.schoolyear	111-2	-
dc.description.degree	碩士	-
dc.contributor.oralexamcommittee	林軒田;蔡銘峰	zh_TW
dc.contributor.oralexamcommittee	Hsuan-Tien Lin;Ming-Feng Tsai	en
dc.subject.keyword	多標籤分類,F值,稀有標籤,文本分類,節點分類,支持向量機,閾值調整,	zh_TW
dc.subject.keyword	multi-label classification,F-measure,rare labels,text classification,node classification,support vector machines,threshold adjustion,	en
dc.relation.page	49	-
dc.identifier.doi	10.6342/NTU202301612	-
dc.rights.note	同意授權(全球公開)	-
dc.date.accepted	2023-07-18	-
dc.contributor.author-college	電機資訊學院	-
dc.contributor.author-dept	資訊工程學系	-
顯示於系所單位：	資訊工程學系

文件中的檔案：

檔案	大小	格式
ntu-111-2.pdf	1.7 MB	Adobe PDF	檢視/開啟

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。