教導文字生成模型以誤導文字分類器之研究

馬安德; Antoni Maciąg

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/96904

標題:	教導文字生成模型以誤導文字分類器之研究 Fast, fluent, fooling. Teaching machines to mislead text classifiers
作者:	馬安德 Antoni Maciąg
指導教授:	許永貞 Jane Yung-jen Hsu
共同指導教授:	陳縕儂 Vivian Yun-Nung Chen
關鍵字:	機器學習安全,自然語言處理,對抗性攻擊,語義相似性,文本到文本,句級,可學習改寫攻擊,文字擾動, Machine Learning,Natural Language Processing,Adversarial Attacks,semantic similarity,text-to-text,sentence-level,learnable paraphrasing,textual perturbation,
出版年 :	2024
學位:	碩士
摘要:	Please note that, as the author of this work is not a native speaker of Mandarin Chinese, this version of the abstract is based on an automatic translation of the English original. 對抗性機器學習是一個研究機器學習模型漏洞，並生成被人類認為自然，但能誤導目標模型預測的數據樣本的領域。對抗性攻擊被用於測試模型，以觀察其行為，並通過生成對抗性樣本上重新訓練模型提來升其強健性。這類樣本通常是通過對非對抗性樣本（例如真實世界的圖像或人類撰寫的文本）引入微小擾動而生成的。研究顯示，若未進行對抗性重新訓練，模型對攻擊極為敏感。隨著此類模型在現實應用中的使用日益增加，確保其強健性並防止失敗及遭到惡意行為者利用也變得越來越重要。文本數據（自然語言處理，NLP）在開發對抗性攻擊方面帶來了獨特的挑戰。擾動往往是顯而易見的（與圖像數據不同），且常常無法實現自然性和語義保留的目標。改變或破壞語義會使樣本的真實標註失效，從而使其無法用於測試或重新訓練模型。這些目標不僅難以實現，還缺乏通用且可靠的評估指標。在可訓練模型中，這些指標對模型行為有直接影響。根據對現有關於自然語言處理中對抗性攻擊文獻的綜合分析，本論文主張該領域被詞級和字元級攻擊（例如同義詞替換）所主導是不合理的。此類框架的缺點包括：剛性，僅允許有限類別的有效擾動；不可學習性，無法利用先前成功攻擊的觀察來指導未來的攻擊；生成樣本的不自然性和不合語法性；以及因需要在廣大的文本擾動空間中進行搜索而導致的資源消耗。本論文提出了可學習改寫攻擊（learnable paraphrasing attacks）的概念。此類攻擊使用一種文本到文本的模型，該模型能學習目標模型的行為，並施加針對性擾動以誤導目標模型。這些擾動是沒有限制的，只要能保留語義和自然性即可。本論文主張，此類攻擊能克服詞級和字元級攻擊的局限性。然而，目前關於可學習改寫攻擊的研究仍然有限。本論文展示了針對可學習改寫攻擊的實驗，旨在推動該領域的研究進展。實驗中發現了一些此前未被提及的與該方法及文本對抗性攻擊普遍相關的挑戰，並提出了應對這些挑戰的解決方案。研究發現，現有的評估語義保留的方法對否定和矛盾並不具備強健性。這一漏洞不僅限於對抗性攻擊，還影響到任何需要自動評估語義相似性的任務，其中若比較的文本片段彼此矛盾，則會被認為是不相似的。為此，本論文提出了兩種解決方案：一種是結合詞向量的餘弦相似度和文字蘊涵模型的啟發式方法；另一種則利用大型語言模型來實現。本論文指出對抗性攻擊與風格遷移任務之間的相似性，並顯示若語義保留的評估方法不當，對抗性攻擊會退化成風格轉換。本論文實驗將不同攻擊評估指標結合成單一獎勵訊號以引導模型行為。提出使用加權調和平均數代替算術平均數，作為一種簡單的改進方法，旨在為訓練模型提供更清晰的信號，並減少在超參數調整中的人力成本。可學習改寫攻擊中的改寫多樣性問題首次被識別出來。研究表明，現有系統未能促進多樣性，導致攻擊者僅使用狹窄範圍的改寫方法，從而降低了對抗性重新訓練的效用。生成器-分類器風格的訓練被認為是一種潛在解決方案。此類訓練需要進行兩項必要的修改：對對抗性樣本進行質量篩選，以及平衡對抗性數據與非對抗性數據。所開發的方法相比以往應用的方法，能生成語法性和自然性較高的對抗性樣本。攻擊者成功發現並利用了目標模型的漏洞。在使用這些樣本進行重新訓練後，這些漏洞被消除。實驗程式碼已作為公開的專案分享，便於未來的實驗使用和結果的可重現性。 Adversarial Machine Learning is a field dedicated to exploring vulnerabilities in Machine Learning models and crafting data samples that are perceived as natural by humans while misleading the predictions of the targeted models. Adversarial attacks are used to test a model and improve robustness via retraining on the obtained adversarial samples. Such samples are usually created by introducing small perturbations to non-adversarial samples, such as images of the real world or human-written text. It has been demonstrated that without adversarial retraining, neural models are highly susceptible to attacks. With the increasing usage of such models in real-life applications, it is also increasingly important to guarantee their robustness, and prevent failures as well as exploitation by malicious actors. Textual data (Natural Language Processing, NLP) presents unique challenges in the development of Adversarial Attacks. Perturbations are always noticeable (unlike in image data), and often fail to achieve the goals of naturality and semantic preservation. Changing or destroying the semantics invalidates the sample's ground truth annotation and makes it unusable for testing or retraining the model. Not only are these goals challenging to achieve, but universal and reliable metrics for their evaluation are also lacking. In trainable models, such metrics have a direct influence on the behavior of the model. Based on a review of existing literature on Adversarial Attacks in NLP, the thesis makes an argument that the field is unjustifiably dominated by word-level and character-level attacks, such as word synonym replacement. The disadvantages of such frameworks include: rigidity, meaning only a narrow class of valid perturbations is allowed; non-learnability, meaning the incapability of utilizing observations from prior successful attacks for future attacks; unnaturality and ungrammaticality of the resulting samples; and computational cost, resulting from searching in a vast space of possible perturbations to a piece of text. The thesis formulates the term learnable paraphrasing attacks. Such attacks involve a text-to-text model that learns the behavior of the target model and applies perturbations targeted at misleading it. The perturbations are unconstrained, as long as they preserve the semantics and naturality. The thesis posits that such attacks can overcome the limitations of word-level and character-level attacks. Nonetheless, existing work on learnable paraphrasing attacks is limited. The thesis presents experiments on learnable paraphrasing attacks, conducted to further the research effort in that area. It identifies previously unmentioned challenges related to the method, and to textual adversarial attacks in general, and proposes solutions to those challenges. It is found that existing methods of evaluating semantic preservation are not robust to negations and contradictions. This vulnerability is not restricted to adversarial attacks, but rather affects any task in which semantic similarity needs to be automatically evaluated and the compared pieces of text are considered dissimilar if they contradict each other. Two solutions are proposed: a heuristic involving text embedding cosine similarity and an entailment model, and another solution utilizing a Large Language Model. The thesis draws attention to similarities between the tasks of adversarial attacks and style transfer, and it is shown that the former may degenerate into the latter if semantic preservation is evaluated inappropriately. Experiments are conducted on combining different attack evaluation metrics into a single reward guiding the behavior of the model. Using the weighted harmonic mean rather than the arithmetic mean is proposed as a simplistic improvement to provide clearer signals to the trained model and reduce human effort in hyperparameter tuning. The problem of paraphrase diversity in learnable paraphrasing attacks is identified for the first time. It is shown that existing systems do not encourage diversity, causing the attacker to collapse to a narrow range of paraphrasing methods, which reduces the usefulness for adversarial retraining. Generator-discriminator style training is identified as a potential solution. Two necessary modifications to such training are highlighted: adversarial example quality filtering, and balancing adversarial data with non-adversarial data. The method developed in the thesis yields adversarial examples of high grammaticality and naturality as compared to previously applied methods. The attacker successfully finds and exploits vulnerabilities in the target model. After retraining with the obtained examples, these vulnerabilities are eliminated. Experimental code is shared as a publicly available repository, allowing usage for future experiments and reproducibility of the results.
URI:	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/96904
DOI:	10.6342/NTU202404720
全文授權:	同意授權(全球公開)
電子全文公開日期:	2025-09-01
顯示於系所單位：	資訊工程學系

文件中的檔案：

檔案	大小	格式
ntu-113-1.pdf	25.96 MB	Adobe PDF	檢視/開啟

顯示文件完整紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料（如：文字、圖片、PDF）並使其易於取用。