剖析文本分類任務中的虛假關係

周寬; Oscar Chew Kuan

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/90214

標題:	剖析文本分類任務中的虛假關係 Understanding and Mitigating Spurious Correlations in Text Classification
作者:	周寬 Oscar Chew Kuan
指導教授:	林軒田 Hsuan-Tien Lin
關鍵字:	深度學習,自然語言,詞嵌入,文本分類,虛假關係, Deep Learning,Natural Language Processing,Word Embeddings,Text Classification,Spurious Correlation,
出版年 :	2023
學位:	碩士
摘要:	過去的研究發現深度學習模型會利用訓練資料中的虛假關係來得到看似良好的表現。例如在文本分類任務中，模型可能錯誤地學習到“performances”與正面的評價相關，然而這樣的關聯在一般情況下並不成立。依賴這樣的虛假關係的模型在面對真實世界的數據集時便會出現大幅的性能下降。在本文中，我們從一個新的角度出發，利用鄰域分析來研究深度學習模型是如何學習到這些虛假關係。以上分析揭示了訓練集中導致於語意上與標籤不相關的詞嵌入被模型錯誤地與那些與標籤有關的詞嵌入聚集起來，使得模型無法分辨哪些是與標籤有關的詞嵌入。在這個分析的基礎上，我們設計了一個檢測虛假關係的指標，並提出了一系列正則化方法，稱為NFL (doN't Forget your Language），以避免模型學到文本分類任務中的虛假關係。實驗證明NFL能夠有效地防止錯誤的聚類，並顯著提高模型的穩健性。 Recent research has revealed that deep learning models have a tendency to leverage spurious correlations that exist in the training set but may not hold true in general circumstances. For instance, a sentiment classifier may erroneously learn that the token "performance" is commonly associated with positive movie reviews. Relying on these spurious correlations degrades the classifier’s performance when it deploys on out-of-distribution data. In this paper, we examine the implications of spurious correlations through a novel perspective called neighborhood analysis. The analysis uncovers how spurious correlations lead unrelated words to erroneously cluster together in the embedding space. Driven by the analysis, we design a metric to detect spurious tokens and also propose a family of regularization methods, NFL (don't Forget your Language) to mitigate spurious correlations in text classification. Experiments show that NFL can effectively prevent erroneous clusters and significantly improve the robustness of classifiers.
URI:	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/90214
DOI:	10.6342/NTU202303731
全文授權:	同意授權(全球公開)
顯示於系所單位：	資訊工程學系

文件中的檔案：

檔案	大小	格式
ntu-111-2.pdf	976.66 kB	Adobe PDF	檢視/開啟

顯示文件完整紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料（如：文字、圖片、PDF）並使其易於取用。