Skip navigation

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料(如:文字、圖片、PDF)並使其易於取用。

點此認識 DSpace
DSpace logo
English
中文
  • 瀏覽論文
    • 校院系所
    • 出版年
    • 作者
    • 標題
    • 關鍵字
    • 指導教授
  • 搜尋 TDR
  • 授權 Q&A
    • 我的頁面
    • 接受 E-mail 通知
    • 編輯個人資料
  1. NTU Theses and Dissertations Repository
  2. 電機資訊學院
  3. 資訊工程學系
請用此 Handle URI 來引用此文件: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/88156
完整後設資料紀錄
DC 欄位值語言
dc.contributor.advisor林智仁zh_TW
dc.contributor.advisorChih-Jen Linen
dc.contributor.author林育任zh_TW
dc.contributor.authorYu-Jen Linen
dc.date.accessioned2023-08-08T16:33:01Z-
dc.date.available2023-11-09-
dc.date.copyright2023-08-08-
dc.date.issued2023-
dc.date.submitted2023-07-17-
dc.identifier.citationJanez Brank, Marko Grobelnik, Natasa Milic-Frayling, and Dunja Mladenic. Training text classifiers with SVM on very few positive examples. Technical report, Technical Report MSR-TR-2003-34, Microsoft Corp, 2003.
Wei-Cheng Chang, Daniel Jiang, Hsiang-Fu Yu, Choon-Hui Teo, Jiong Zhang, Kai Zhong, Kedarnath Kolluri, Qie Hu, Nikhil Shandilya, Vyacheslav Ievgrafov, Japinder Singh, and Inderjit S Dhillon. Extreme multi-label learning for semantic matching in product search. In Proceedings of the 27th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 2021.
Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen Lin. LIBLINEAR: a library for large linear classification. Journal of Machine Learning Research, 9:1871--1874, 2008.
Rong-En Fan and Chih-Jen Lin. A study on threshold selection for multi-label classification. Technical report, Department of Computer Science, National Taiwan University, 2007.
Aditya Grover and Jure Leskovec. Node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), page 855–864, 2016.
Haixiang Guo, Yijing Li, Jennifer Shang, Mingyun Gu, Yuanyue Huang, and Bing Gong. Learning from class-imbalanced data: Review of methods and applications. Expert Systems With Applications, 73:220--239, 2017.
Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems, 2017.
Haibo He and Edwardo A. Garcia. Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering, 21(9):1263--1284, 2009.
Kalina Jasinska, Krzysztof Dembczynski, Robert Busa-Fekete, Karlson Pfannschmidt, Timo Klerx, and Eyke Hullermeier. Extreme f-measure maximization using sparse probability estimates. In Proceedings of The 33rd International Conference on Machine Learning (ICML), pages 1435--1444, 2016.
Justin M. Johnson and Taghi M. Khoshgoftaar. Deep learning and thresholding with class-imbalanced big data. In Proceedings of the 18th IEEE International Conference on Machine Learning and Applications (ICMLA), pages 755--762, 2019.
David D. Lewis, Yiming Yang, Tony G. Rose, and Fan Li. RCV1: A new benchmark collection for text categorization research. Journal of Machine Learning Research, 5:361--397, 2004.
Li-Chung Lin, Cheng-Hung Liu, Chih-Ming Chen, Kai-Chin Hsu, I-Feng Wu, Ming-Feng Tsai, and Chih-Jen Lin. On the use of unrealistic predictions in hundreds of papers evaluating graph representations. In Proceedings of the Thirty-Sixth AAAI Conference on Artificial Intelligence (AAAI), 2022.
Zachary C. Lipton, Charles Elkan, and Balakrishnan Naryanaswamy. Optimal thresholding of classifiers to maximize F1 measure. In Proceedings of the Joint European Conference on Machine Learning and Knowledge Discovery in Databases (ECML/PKDD), pages 225--239, 2014.
Johannes Loza Mencia, Eneldoand Furnkranz. Efficient multilabel classification algorithms for large-scale problems in the legal domain. In Enrico Francesconi, Simonetta Montemagni, Wim Peters, and Daniela Tiscornia, editors, Semantic Processing of Legal Texts: Where the Language of Law Meets the Law of Language, pages 192--215. Springer Berlin Heidelberg, 2010.
James Mullenbach, Sarah Wiegreffe, Jon Duke, Jimeng Sun, and Jacob Eisenstein. Explainable prediction of medical codes from clinical text. In Proceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), pages 1101--1111, 2018.
Shameem A. Puthiya Parambath, Nicolas Usunier, and Yves Grandvalet. Optimizing f-measures by cost-sensitive classification. In Advances in Neural Information Processing Systems, volume 27, 2014.
Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pages 701--710, 2014.
Ignazio Pillai, Giorgio Fumera, and Fabio Roli. Threshold optimisation for multi-label classifiers. Pattern Recognition, 46(7):2055--2065, 2013.
Foster Provost. Machine learning from imbalanced data sets 101. In Proceedings of the AAAI Workshop on Imbalanced Data Sets, pages 1--3, 2000.
Erik Schultheis, Marek Wydmuch, Rohit Babbar, and Krzysztof Dembczynski. On missing labels, long-tails and propensities in extreme multi-label classification. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), pages 1547--1557, 2022.
F. Sebastiani. Machine learning in automated text categorization. ACM Computing Surveys, 34(1):1--47, 2002.
Shai Shalev-Shwartz, Yoram Singer, and Nathan Srebro. Pegasos: primal estimated sub-gradient solver for SVM. In Proceedings of the Twenty Fourth International Conference on Machine Learning (ICML), 2007.
Aixin Sun, Ee-Peng Lim, and Ying Liu. On strategies for imbalanced text classification using SVM: A comparative study. Decision Support Systems, 48(1):191--201, 2009.
Jian Tang, Meng Qu, Mingzhe Wang, Ming Zhang, Jun Yan, and Qiaozhu Mei. Line: Large-scale information network embedding. In Proceedings of the 24th International Conference on World Wide Web (WWW), pages 1067--1077, 2015.
Lei Tang and Huan Liu. Relational learning via latent social dimensions. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD), pages 817--826, 2009.
Gang Wu and Edward Y. Chang. Class-boundary alignment for imbalanced dataset learning. In ICML Workshop on Learning from Imbalanced Data Sets II, pages 49--56, 2003.
Yiming Yang. A study on thresholding strategies for text categorization. In W. Bruce Croft, David J. Harper, Donald H. Kraft, and Justin Zobel, editors, Proceedings of the 24th ACM International Conference on Research and Development in Information Retrieval, pages 137--145, New Orleans, US, 2001. ACM Press, New York, US.
Hsiang-Fu Yu, Kai Zhong, Jiong Zhang, Wei-Cheng Chang, and Inderjit S. Dhillon. PECOS: Prediction for enormous and correlated output spaces. Journal of Machine Learning Research, 23(98):1--32, 2022.
Guo-Xun Yuan, Kai-Wei Chang, Cho-Jui Hsieh, and Chih-Jen Lin. A comparison of optimization methods and software for large-scale l1-regularized linear classification. Journal of Machine Learning Research, 11:3183--3234, 2010.
Jiong Zhang, Wei-Cheng Chang, Hsiang-Fu Yu, and Inderjit S. Dhillon. Fast multi-resolution transformer fine-tuning for extreme multi-label text classification. 34:7267--7280, 2021.
Arkaitz Zubiaga. Enhancing navigation on Wikipedia with social tags. In Proceedings of Wikimania, 2009.
-
dc.identifier.urihttp://tdr.lib.ntu.edu.tw/jspui/handle/123456789/88156-
dc.description.abstract在多標籤分類任務中,標籤出現次數間的不平衡是個常見的問題。對於出現次數稀少的標籤來說,用來產生二元預測值的預設閥值往往不是最佳的。然而,在過去的文獻中已觀察到直接透過最佳化 F 值來選取新閥值容易造成過擬合。在此篇論文中,我們解釋了為什麼藉由調整閥值來最佳化 F 值以及類似的評價指標時特別容易過擬合。接下來,我們分析了 FBR 啟發法 —— 一個既有的對於此過擬合的解法。我們為其成功之處提供了解釋,但也點出 FBR 的潛在問題。針對所發現的問題,我們提出了一個新技巧,在閥值最佳化時對 F 值做平滑化處理。我們以理論證明,如果選取了恰當的參數,平滑化可為調整後的閥值帶來良好的性質。延續平滑化的概念,我們更進一步提出同時最佳化微觀平均 F 與巨觀平均 F 的方法。其享有平滑化所帶來的好處,但是更為輕量化,不需要調整額外的超參數。我們在文字與節點分類的資料集上驗證了新的方法的有效性,其一致的超越了 FBR 啟發法。zh_TW
dc.description.abstractIn multi-label classification, the imbalance between labels is often a concern. For a label that seldom occurs, the default threshold used to generate binarized predictions of that label is usually sub-optimal. However, directly tuning the threshold to optimize for F-measure has been observed to overfit easily. In this work, we explain why tuning the thresholds for rare labels to optimize F-measure (and similar metrics) is particularly prone to overfitting. Then, we analyze the fbr heuristic, a previous technique proposed to address the overfitting issue. We explain its success but also point out its potential problems. Then, we propose a new technique based on smoothing the F-measure when tuning the threshold. We theoretically prove that, with proper parameters, smoothing results in desirable properties of the tuned threshold. Based on the idea of smoothing, we then propose jointly optimizing micro-F and macro-F as a lightweight alternative free from extra hyperparameters. Our methods are empirically evaluated on text and node classification datasets. The results show that our methods consistently outperform the fbr heuristic.en
dc.description.provenanceSubmitted by admin ntu (admin@lib.ntu.edu.tw) on 2023-08-08T16:33:01Z
No. of bitstreams: 0
en
dc.description.provenanceMade available in DSpace on 2023-08-08T16:33:01Z (GMT). No. of bitstreams: 0en
dc.description.tableofcontents口試委員會審定書 i
Acknowledgements ii
摘要 iii
Abstract iv
1 Introduction 1
2 Preliminaries 4
2.1 Problem Setting 4
2.2 Evaluation Metrics 5
2.3 Previous Works 6
3 Why FBR Works and Its Issues 9
3.1 An Interesting Behavior of FBR 9
3.2 The Benifit of Lowering the Threshold 11
3.3 Issues of FBR Heuristic 13
3.4 A new variant of SCutFBR 14
4 Smoothed F­measure 15
4.1 Selecting a and b 16
4.2 Comparing Smoothed F­measure with the FBR Heuristic 18
4.3 Micro­F as Smoothed F­measure 21
5 Experiments and Analyses 23
5.1 Experimental Settings 23
5.2 Main Results 24
5.3 Trade­off between micro­F and macro­F 26
5.4 Occurences of Figure 3.1 and Figure 3.2 27
5.5 Improvements for labels of different rarity 28
5.6 Discussion on selecting a and b 29
5.7 Effect of the bias term of SVM 32
6 Conclusion 34
Bibliography 35
A Proofs 39
A.1 Proof of Theorem 1 39
A.2 Proof of Corollary 2 40
A.3 Proof of Theorem 3 40
A.4 Proof of Theorem 4 41
A.5 Proof of Theorem 5 43
B Implementation details of micromacro 44
C Details of main experiments 46
D Details of Auxiliary Experiments 49
D.1 Occurences of Figure 3.2 49
D.2 Occurences of Figure 3.1 49
-
dc.language.isoen-
dc.subject多標籤分類zh_TW
dc.subject閾值調整zh_TW
dc.subject支持向量機zh_TW
dc.subject節點分類zh_TW
dc.subject文本分類zh_TW
dc.subject稀有標籤zh_TW
dc.subjectF值zh_TW
dc.subjectF-measureen
dc.subjectmulti-label classificationen
dc.subjectthreshold adjustionen
dc.subjectsupport vector machinesen
dc.subjecttext classificationen
dc.subjectnode classificationen
dc.subjectrare labelsen
dc.title多標籤分類中對稀有標籤的閥值調整策略之討論zh_TW
dc.titleOn the Thresholding Strategies for Rare Labels in Multi-label Classificationen
dc.typeThesis-
dc.date.schoolyear111-2-
dc.description.degree碩士-
dc.contributor.oralexamcommittee林軒田;蔡銘峰zh_TW
dc.contributor.oralexamcommitteeHsuan-Tien Lin;Ming-Feng Tsaien
dc.subject.keyword多標籤分類,F值,稀有標籤,文本分類,節點分類,支持向量機,閾值調整,zh_TW
dc.subject.keywordmulti-label classification,F-measure,rare labels,text classification,node classification,support vector machines,threshold adjustion,en
dc.relation.page49-
dc.identifier.doi10.6342/NTU202301612-
dc.rights.note同意授權(全球公開)-
dc.date.accepted2023-07-18-
dc.contributor.author-college電機資訊學院-
dc.contributor.author-dept資訊工程學系-
顯示於系所單位:資訊工程學系

文件中的檔案:
檔案 大小格式 
ntu-111-2.pdf1.7 MBAdobe PDF檢視/開啟
顯示文件簡單紀錄


系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved