Skip navigation

DSpace JSPUI

DSpace preserves and enables easy and open access to all types of digital content including text, images, moving images, mpegs and data sets

Learn More
DSpace logo
English
中文
  • Browse
    • Communities
      & Collections
    • Publication Year
    • Author
    • Title
    • Subject
    • Advisor
  • Search TDR
  • Rights Q&A
    • My Page
    • Receive email
      updates
    • Edit Profile
  1. NTU Theses and Dissertations Repository
  2. 電機資訊學院
  3. 資訊工程學系
Please use this identifier to cite or link to this item: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/90214
Title: 剖析文本分類任務中的虛假關係
Understanding and Mitigating Spurious Correlations in Text Classification
Authors: 周寬
Oscar Chew Kuan
Advisor: 林軒田
Hsuan-Tien Lin
Keyword: 深度學習,自然語言,詞嵌入,文本分類,虛假關係,
Deep Learning,Natural Language Processing,Word Embeddings,Text Classification,Spurious Correlation,
Publication Year : 2023
Degree: 碩士
Abstract: 過去的研究發現深度學習模型會利用訓練資料中的虛假關係來得到看似良好的表現。例如在文本分類任務中,模型可能錯誤地學習到“performances”與正面的評價相關,然而這樣的關聯在一般情況下並不成立。依賴這樣的虛假關係的模型在面對真實世界的數據集時便會出現大幅的性能下降。在本文中,我們從一個新的角度出發,利用鄰域分析來研究深度學習模型是如何學習到這些虛假關係。以上分析揭示了訓練集中導致於語意上與標籤不相關的詞嵌入被模型錯誤地與那些與標籤有關的詞嵌入聚集起來,使得模型無法分辨哪些是與標籤有關的詞嵌入。在這個分析的基礎上,我們設計了一個檢測虛假關係的指標,並提出了一系列正則化方法,稱為NFL (doN't Forget your Language),以避免模型學到文本分類任務中的虛假關係。實驗證明NFL能夠有效地防止錯誤的聚類,並顯著提高模型的穩健性。
Recent research has revealed that deep learning models have a tendency to leverage spurious correlations that exist in the training set but may not hold true in general circumstances. For instance, a sentiment classifier may erroneously learn that the token "performance" is commonly associated with positive movie reviews. Relying on these spurious correlations degrades the classifier’s performance when it deploys on out-of-distribution data. In this paper, we examine the implications of spurious correlations through a novel perspective called neighborhood analysis. The analysis uncovers how spurious correlations lead unrelated words to erroneously cluster together in the embedding space. Driven by the analysis, we design a metric to detect spurious tokens and also propose a family of regularization methods, NFL (don't Forget your Language) to mitigate spurious correlations in text classification. Experiments show that NFL can effectively prevent erroneous clusters and significantly improve the robustness of classifiers.
URI: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/90214
DOI: 10.6342/NTU202303731
Fulltext Rights: 同意授權(全球公開)
Appears in Collections:資訊工程學系

Files in This Item:
File SizeFormat 
ntu-111-2.pdf976.66 kBAdobe PDFView/Open
Show full item record


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved