教導文字生成模型以誤導文字分類器之研究

馬安德; Antoni Maciąg

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/96904

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	許永貞	zh_TW
dc.contributor.advisor	Jane Yung-jen Hsu	en
dc.contributor.author	馬安德	zh_TW
dc.contributor.author	Antoni Maciąg	en
dc.date.accessioned	2025-02-24T16:29:20Z	-
dc.date.available	2025-09-01	-
dc.date.copyright	2025-02-24	-
dc.date.issued	2024	-
dc.date.submitted	2025-01-15	-
dc.identifier.citation	[1] M. Alzantot, Y. Sharma, A. Elgohary, B.-J. Ho, M. Srivastava, and K.-W. Chang. Generating natural language adversarial examples, 2018. [2] J. R. Asl, M. H. Rafiei, M. Alohaly, and D. Takabi. A semantic, syntactic, and context-aware natural language adversarial example generator. Transactions on Dependable and Secure Computing, 21(5):4754–4769, 2024. IEEE [3] Y. Belinkov and Y. Bisk. Synthetic and natural noise both break neural machine translation, 2018. [4] D. Cer, Y. Yang, S. yi Kong, N. Hua, N. Limtiaco, R. S. John, N. Constant, M. Guajardo-Cespedes, S. Yuan, C. Tar, B. Strope, and R. Kurzweil. Universal sentence encoder for english. In E. Blanco and W. Lu, editors, Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 169–174. Association for Computational Linguistics, 11 2018. [5] P.-Y. Chen, H. Zhang, Y. Sharma, J. Yi, and C.-J. Hsieh. Zoo: Zeroth order optimization based black-box attacks to deep neural networks without training substitute models. In Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security. ACM, 11 2017. [6] Y. Cheng, L. Jiang, and W. Macherey. Robust neural machine translation with doubly adversarial inputs. In A. Korhonen, D. Traum, and L. Màrquez, editors, Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4324–4333. Association for Computational Linguistics, 7 2019. [7] H. Choi, J. Kim, S. Joe, and Y. Gwon. Evaluation of bert and albert sentence embedding performance on downstream nlp tasks, 2021. [8] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding, 2019. [9] H. Do and G. G. Lee. Aspect-based semantic textual similarity for educational test items. In A. M. Olney, I.-A. Chounta, Z. Liu, O. C. Santos, and I. I. Bittencourt, editors, Artificial Intelligence in Education, pages 344–352, Cham, 2024. Springer Nature Switzerland. [10] J. Ebrahimi, A. Rao, D. Lowd, and D. Dou. HotFlip: White-box adversarial examples for text classification. In I. Gurevych and Y. Miyao, editors, Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 31–36, Melbourne, Australia, July 2018. Association for Computational Linguistics. [11] S. Eger and Y. Benz. From hero to zéroe: A benchmark of low-level adversarial attacks. In K.-F. Wong, K. Knight, and H. Wu, editors, Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing, pages 786–803. Association for Computational Linguistics, 12 2020. [12] J. Gao, J. Lanchantin, M. L. Soffa, and Y. Qi. Black-box generation of adversarial text sequences to evade deep learning classifiers, 2018. [13] S. Garg and G. Ramakrishnan. Bae: Bert-based adversarial examples for text classification. In B. Webber, T. Cohn, Y. He, and Y. Liu, editors, Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6174–6181. Association for Computational Linguistics, 11 2020. [14] Y. Gil, Y. Chai, O. Gorodissky, and J. Berant. White-to-black: Eﬀicient distillation of black-box adversarial attacks, 2019. [15] H. Gong, S. Bhat, L. Wu, J. Xiong, and W. mei Hwu. Reinforcement learning based text style transfer without parallel training corpus. In J. Burstein, C. Doran, and T. Solorio, editors, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 3168–3180. Association for Computational Linguistics, 6 2019. [16] Z. Gong, W. Wang, B. Li, D. Song, and W.-S. Ku. Adversarial texts with gradient methods, 2018. [17] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial networks, 2014. [18] I. J. Goodfellow, J. Shlens, and C. Szegedy. Explaining and harnessing adversarial examples, 2015. [19] S. Goyal, S. Doddapaneni, M. M. Khapra, and B. Ravindran. A survey of adversarial defences and robustness in nlp, 2023. [20] R. Hadsell, S. Chopra, and Y. LeCun. Dimensionality reduction by learning an invariant mapping. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), volume 2, pages 1735–1742, 2006. [21] W. Han, L. Zhang, Y. Jiang, and K. Tu. Adversarial attack and defense of structured prediction models, 2020. [22] W. Hu and Y. Tan. Generating adversarial malware examples for black-box attacks based on gan, 2017. [23] M. Iyyer, J. Wieting, K. Gimpel, and L. Zettlemoyer. Adversarial example generation with syntactically controlled paraphrase networks, 2018. [24] F. Jelinek, R. L. Mercer, L. R. Bahl, and J. K. Baker. Perplexity—a measure of the diﬀiculty of speech recognition tasks. The Journal of the Acoustical Society of America, 62:S63–S63, 11 2005. [25] R. Jia and P. Liang. Adversarial examples for evaluating reading comprehension systems. In M. Palmer, R. Hwa, and S. Riedel, editors, Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2021– 2031. Association for Computational Linguistics, 9 2017. [26] D. Jin, Z. Jin, J. T. Zhou, and P. Szolovits. Is bert really robust? a strong baseline for natural language attack on text classification and entailment, 2020. [27] T. Le, S. Wang, and D. Lee. Malcom: Generating malicious comments to attack neural fake news detection models, 2020. [28] V. I. Levenshtein. Binary codes capable of correcting deletions, insertions, and reversals. Soviet physics. Doklady, 10:707–710, 1965. [29] A. Li, F. Zhang, S. Li, T. Chen, P. Su, and H. Wang. Eﬀiciently generating sentence-level textual adversarial examples with seq2seq stacked auto-encoder. Expert Systems with Applications, 213:119170, 2023. [30] J. Li, S. Ji, T. Du, B. Li, and T. Wang. Textbugger: Generating adversarial text against real-world applications. In Proceedings 2019 Network and Distributed System Security Symposium. Internet Society, 2019. [31] S. Liu, J. Chen, S. Ruan, H. Su, and Z. Yin. Exploring the robustness of decision-level through adversarial attacks on llm-based embodied models, 2024. [32] Y. Liu, H. Lee, and Z. Cai. An attention score based attacker for black-box nlp classifier. ArXiv, abs/2112.11660, 2021. [33] R. Maheshwary, S. Maheshwary, and V. Pudi. Generating natural language attacks in a hard label black box setting, 2021. [34] T. Mikolov, K. Chen, G. Corrado, and J. Dean. Eﬀicient estimation of word representations in vector space, 2013. [35] P. Minervini and S. Riedel. Adversarially regularising neural nli models to integrate logical background knowledge, 2018. [36] J. Morris, E. Lifland, J. Lanchantin, Y. Ji, and Y. Qi. Reevaluating adversarial examples in natural language. In Findings of the Association for Computational Linguistics: EMNLP 2020. Association for Computational Linguistics, 2020. [37] J. X. Morris, E. Lifland, J. Y. Yoo, J. Grigsby, D. Jin, and Y. Qi. Textattack: A framework for adversarial attacks, data augmentation, and adversarial training in nlp, 2020. [38] N. Mrkšić, D. O. Séaghdha, B. Thomson, M. Gašić, L. M. Rojas-Barahona, P.-H. Su, D. Vandyke, T.-H. Wen, and S. Young. Counter-fitting word vectors to linguistic constraints. In K. Knight, A. Nenkova, and O. Rambow, editors, Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 142–148. Association for Computational Linguistics, 6 2016. [39] D. Naber. A rule-based style and grammar checker. 11 2003. [40] N. Papernot, P. McDaniel, I. Goodfellow, S. Jha, Z. B. Celik, and A. Swami. Practical black-box attacks against machine learning, 2017. [41] H. Peng, Z. Wang, C. Wei, D. Zhao, G. Xu, J. Han, S. Guo, M. Zhong, and S. Ji. Textjuggler: Fooling text classification tasks by generating high-quality adversarial examples. Knowledge-Based Systems, 300:112188, 2024. [42] R. Rafailov, A. Sharma, E. Mitchell, S. Ermon, C. D. Manning, and C. Finn. Direct preference optimization: Your language model is secretly a reward model, 2024. [43] V. Raina, A. Liusie, and M. Gales. Is llm-as-a-judge robust? investigating universal adversarial attacks on zero-shot llm assessment, 2024. [44] N. Reimers and I. Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks, 2019. [45] S. Ren, Y. Deng, K. He, and W. Che. Generating natural language adversarial examples through probability weighted word saliency. In A. Korhonen, D. Traum, and L. Màrquez, editors, Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1085–1097. Association for Computational Linguistics, 7 2019. [46] Y. Ren, J. Lin, S. Tang, J. Zhou, S. Yang, Y. Qi, and X. Ren. Generating natural language adversarial examples on a large scale with generative models, 2020. [47] E. A. Rocamora, Y. Wu, F. Liu, G. G. Chrysos, and V. Cevher. Revisiting character-level adversarial attacks for language models, 2024. [48] S. Saxena. Textdecepter: Hard label black box attack on text classifiers, 2020. [49] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms, 2017. [50] K. Shu, A. Sliva, S. Wang, J. Tang, and H. Liu. Fake news detection on social media: A data mining perspective, 2017. [51] S. Srinivasan, M. Mahbub, and A. Sadovnik. Advancing nlp security by leveraging llms as adversarial engines, 2024. [52] I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks, 2014. [53] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus. Intriguing properties of neural networks, 2014. [54] N. Thakur, Y. Ding, and B. Li. Evaluating a simple retraining strategy as a defense against adversarial attacks, 2020. [55] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need, 2023. [56] H. Waghela, J. Sen, and S. Rakshit. Enhancing adversarial text attacks on bert models with projected gradient descent, 2024. [57] B. Wang, H. Pei, B. Pan, Q. Chen, S. Wang, and B. Li. T3: Tree-autoencoder constrained adversarial text generation for targeted attack. In B. Webber, T. Cohn, Y. He, and Y. Liu, editors, Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6134– 6150. Association for Computational Linguistics, 11 2020. [58] T. Wang, X. Wang, Y. Qin, B. Packer, K. Li, J. Chen, A. Beutel, and E. Chi. Cat-gen: Improving robustness in nlp models via controlled adversarial text generation. In B. Webber, T. Cohn, Y. He, and Y. Liu, editors, Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 5141–5146. Association for Computational Linguistics, 11 2020. [59] Z. Wang, W. Wang, Q. Chen, Q. Wang, and A. Nguyen. Generating valid and natural adversarial examples with large language models, 2023. [60] R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8:229–256, 5 1992. [61] C. Wong. Dancin seq2seq: Fooling text classifiers with adversarial text example generation, 2017. [62] X. Xu, K. Kong, N. Liu, L. Cui, D. Wang, J. Zhang, and M. Kankanhalli. An llm can fool itself: A prompt-based adversarial attack, 2023. [63] Y. Xu, X. Zhong, A. J. Yepes, and J. H. Lau. Grey-box adversarial attack and defence for sentiment classification. In K. Toutanova, A. Rumshisky, L. Zettlemoyer, D. Hakkani-Tur, I. Beltagy, S. Bethard, R. Cotterell, T. Chakraborty, and Y. Zhou, editors, Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4078–4087. Association for Computational Linguistics, 6 2021. [64] Y. Zang, F. Qi, C. Yang, Z. Liu, M. Zhang, Q. Liu, and M. Sun. Word-level textual adversarial attacking as combinatorial optimization. In D. Jurafsky, J. Chai, N. Schluter, and J. Tetreault, editors, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 6066–6080. Association for Computational Linguistics, 7 2020. [65] T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger, and Y. Artzi. Bertscore: Evaluating text generation with bert, 2020. [66] W. Zhang, Q. Chen, and Y. Chen. Deep learning based robust text classification method via virtual adversarial training. IEEE Access, 8:61174–61182, 2020. [67] W. E. Zhang, Q. Z. Sheng, A. Alhazmi, and C. Li. Adversarial attacks on deep learning models in natural language processing: A survey, 2019. [68] Z. Zhao, D. Dua, and S. Singh. Generating natural adversarial examples, 2018. [69] A. Zou, Z. Wang, N. Carlini, M. Nasr, J. Z. Kolter, and M. Fredrikson. Universal and transferable adversarial attacks on aligned language models, 2023.	-
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/96904	-
dc.description.abstract	Please note that, as the author of this work is not a native speaker of Mandarin Chinese, this version of the abstract is based on an automatic translation of the English original. 對抗性機器學習是一個研究機器學習模型漏洞，並生成被人類認為自然，但能誤導目標模型預測的數據樣本的領域。對抗性攻擊被用於測試模型，以觀察其行為，並通過生成對抗性樣本上重新訓練模型提來升其強健性。這類樣本通常是通過對非對抗性樣本（例如真實世界的圖像或人類撰寫的文本）引入微小擾動而生成的。研究顯示，若未進行對抗性重新訓練，模型對攻擊極為敏感。隨著此類模型在現實應用中的使用日益增加，確保其強健性並防止失敗及遭到惡意行為者利用也變得越來越重要。文本數據（自然語言處理，NLP）在開發對抗性攻擊方面帶來了獨特的挑戰。擾動往往是顯而易見的（與圖像數據不同），且常常無法實現自然性和語義保留的目標。改變或破壞語義會使樣本的真實標註失效，從而使其無法用於測試或重新訓練模型。這些目標不僅難以實現，還缺乏通用且可靠的評估指標。在可訓練模型中，這些指標對模型行為有直接影響。根據對現有關於自然語言處理中對抗性攻擊文獻的綜合分析，本論文主張該領域被詞級和字元級攻擊（例如同義詞替換）所主導是不合理的。此類框架的缺點包括：剛性，僅允許有限類別的有效擾動；不可學習性，無法利用先前成功攻擊的觀察來指導未來的攻擊；生成樣本的不自然性和不合語法性；以及因需要在廣大的文本擾動空間中進行搜索而導致的資源消耗。本論文提出了可學習改寫攻擊（learnable paraphrasing attacks）的概念。此類攻擊使用一種文本到文本的模型，該模型能學習目標模型的行為，並施加針對性擾動以誤導目標模型。這些擾動是沒有限制的，只要能保留語義和自然性即可。本論文主張，此類攻擊能克服詞級和字元級攻擊的局限性。然而，目前關於可學習改寫攻擊的研究仍然有限。本論文展示了針對可學習改寫攻擊的實驗，旨在推動該領域的研究進展。實驗中發現了一些此前未被提及的與該方法及文本對抗性攻擊普遍相關的挑戰，並提出了應對這些挑戰的解決方案。研究發現，現有的評估語義保留的方法對否定和矛盾並不具備強健性。這一漏洞不僅限於對抗性攻擊，還影響到任何需要自動評估語義相似性的任務，其中若比較的文本片段彼此矛盾，則會被認為是不相似的。為此，本論文提出了兩種解決方案：一種是結合詞向量的餘弦相似度和文字蘊涵模型的啟發式方法；另一種則利用大型語言模型來實現。本論文指出對抗性攻擊與風格遷移任務之間的相似性，並顯示若語義保留的評估方法不當，對抗性攻擊會退化成風格轉換。本論文實驗將不同攻擊評估指標結合成單一獎勵訊號以引導模型行為。提出使用加權調和平均數代替算術平均數，作為一種簡單的改進方法，旨在為訓練模型提供更清晰的信號，並減少在超參數調整中的人力成本。可學習改寫攻擊中的改寫多樣性問題首次被識別出來。研究表明，現有系統未能促進多樣性，導致攻擊者僅使用狹窄範圍的改寫方法，從而降低了對抗性重新訓練的效用。生成器-分類器風格的訓練被認為是一種潛在解決方案。此類訓練需要進行兩項必要的修改：對對抗性樣本進行質量篩選，以及平衡對抗性數據與非對抗性數據。所開發的方法相比以往應用的方法，能生成語法性和自然性較高的對抗性樣本。攻擊者成功發現並利用了目標模型的漏洞。在使用這些樣本進行重新訓練後，這些漏洞被消除。實驗程式碼已作為公開的專案分享，便於未來的實驗使用和結果的可重現性。	zh_TW
dc.description.abstract	Adversarial Machine Learning is a field dedicated to exploring vulnerabilities in Machine Learning models and crafting data samples that are perceived as natural by humans while misleading the predictions of the targeted models. Adversarial attacks are used to test a model and improve robustness via retraining on the obtained adversarial samples. Such samples are usually created by introducing small perturbations to non-adversarial samples, such as images of the real world or human-written text. It has been demonstrated that without adversarial retraining, neural models are highly susceptible to attacks. With the increasing usage of such models in real-life applications, it is also increasingly important to guarantee their robustness, and prevent failures as well as exploitation by malicious actors. Textual data (Natural Language Processing, NLP) presents unique challenges in the development of Adversarial Attacks. Perturbations are always noticeable (unlike in image data), and often fail to achieve the goals of naturality and semantic preservation. Changing or destroying the semantics invalidates the sample's ground truth annotation and makes it unusable for testing or retraining the model. Not only are these goals challenging to achieve, but universal and reliable metrics for their evaluation are also lacking. In trainable models, such metrics have a direct influence on the behavior of the model. Based on a review of existing literature on Adversarial Attacks in NLP, the thesis makes an argument that the field is unjustifiably dominated by word-level and character-level attacks, such as word synonym replacement. The disadvantages of such frameworks include: rigidity, meaning only a narrow class of valid perturbations is allowed; non-learnability, meaning the incapability of utilizing observations from prior successful attacks for future attacks; unnaturality and ungrammaticality of the resulting samples; and computational cost, resulting from searching in a vast space of possible perturbations to a piece of text. The thesis formulates the term learnable paraphrasing attacks. Such attacks involve a text-to-text model that learns the behavior of the target model and applies perturbations targeted at misleading it. The perturbations are unconstrained, as long as they preserve the semantics and naturality. The thesis posits that such attacks can overcome the limitations of word-level and character-level attacks. Nonetheless, existing work on learnable paraphrasing attacks is limited. The thesis presents experiments on learnable paraphrasing attacks, conducted to further the research effort in that area. It identifies previously unmentioned challenges related to the method, and to textual adversarial attacks in general, and proposes solutions to those challenges. It is found that existing methods of evaluating semantic preservation are not robust to negations and contradictions. This vulnerability is not restricted to adversarial attacks, but rather affects any task in which semantic similarity needs to be automatically evaluated and the compared pieces of text are considered dissimilar if they contradict each other. Two solutions are proposed: a heuristic involving text embedding cosine similarity and an entailment model, and another solution utilizing a Large Language Model. The thesis draws attention to similarities between the tasks of adversarial attacks and style transfer, and it is shown that the former may degenerate into the latter if semantic preservation is evaluated inappropriately. Experiments are conducted on combining different attack evaluation metrics into a single reward guiding the behavior of the model. Using the weighted harmonic mean rather than the arithmetic mean is proposed as a simplistic improvement to provide clearer signals to the trained model and reduce human effort in hyperparameter tuning. The problem of paraphrase diversity in learnable paraphrasing attacks is identified for the first time. It is shown that existing systems do not encourage diversity, causing the attacker to collapse to a narrow range of paraphrasing methods, which reduces the usefulness for adversarial retraining. Generator-discriminator style training is identified as a potential solution. Two necessary modifications to such training are highlighted: adversarial example quality filtering, and balancing adversarial data with non-adversarial data. The method developed in the thesis yields adversarial examples of high grammaticality and naturality as compared to previously applied methods. The attacker successfully finds and exploits vulnerabilities in the target model. After retraining with the obtained examples, these vulnerabilities are eliminated. Experimental code is shared as a publicly available repository, allowing usage for future experiments and reproducibility of the results.	en
dc.description.provenance	Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2025-02-24T16:29:20Z No. of bitstreams: 0	en
dc.description.provenance	Made available in DSpace on 2025-02-24T16:29:20Z (GMT). No. of bitstreams: 0	en
dc.description.tableofcontents	Acknowledgements iii Abstract v 摘要 ix Contents xi List of Figures xvii List of Tables xix Chapter 1 Introduction 1 1.1 Definition of adversarial attacks on Machine Learning models . . 1 1.2 Adversarial attack development with malicious intent . . . . . . . 2 1.3 Adversarial attack development in academia . . . . . . . . . . . . 2 1.4 Structure and purpose of this work . . . . . . . . . . . . . . . . . 2 Chapter 2 Attacks in different modalities 5 2.1 Foundational work in Computer Vision . . . . . . . . . . . . . . . 5 2.2 Differences in Natural Language Processing as compared to Computer Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2.1 “Discrete” data domain . . . . . . . . . . . . . . . . . . . . . . . 6 2.2.2 Risk of changing or destroying the semantics . . . . . . . . . . . 7 Chapter 3 Attack goals and evaluation metrics 9 3.1 Perturbation control . . . . . . . . . . . . . . . . . . . . . . . . . 9 3.1.1 Character-based . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 3.1.2 Word-based . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 3.1.2.1 Number of perturbed words . . . . . . . . . . . . . . 10 3.1.2.2 Embedding-based . . . . . . . . . . . . . . . . . . . 10 3.1.3 Sentence-based . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.1.3.1 Embedding-based . . . . . . . . . . . . . . . . . . . 12 3.1.3.2 Textual entailment . . . . . . . . . . . . . . . . . . . 12 3.1.3.3 With Large Language Models . . . . . . . . . . . . . 13 3.2 Grammaticality and naturality . . . . . . . . . . . . . . . . . . . 13 Chapter 4 Classification of attacks, with examples from the literature 15 4.1 By access to the model: white-box, black-box, and no-box . . . . 15 4.1.1 White-box versus black-box . . . . . . . . . . . . . . . . . . . . 15 4.1.2 Surrogate models . . . . . . . . . . . . . . . . . . . . . . . . . . 16 4.1.3 No-box . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 4.2 By granularity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 4.2.1 Character-level . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 4.2.2 Word-level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 4.2.2.1 Common framework . . . . . . . . . . . . . . . . . . 18 4.2.2.2 Target word search methods . . . . . . . . . . . . . 18 4.2.2.3 Perturbation methods . . . . . . . . . . . . . . . . . 18 4.2.2.4 Constraints . . . . . . . . . . . . . . . . . . . . . . . 19 4.2.3 Sentence-level . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 4.2.3.1 Concatenation and insertion . . . . . . . . . . . . . 19 4.2.3.2 Perturbation in the sentence latent space . . . . . . 20 4.2.3.3 Perturbation by controllable generation . . . . . . . 20 4.2.3.4 Prompt-attacking LLMs . . . . . . . . . . . . . . . . 21 4.2.3.5 Learnable paraphrasing . . . . . . . . . . . . . . . . 21 Chapter 5 Under-exploration of sentence-level attacks 23 5.1 Domination of sub-sentence-level attacks . . . . . . . . . . . . . . 23 5.2 Shortcomings of sub-sentence attacks and addressing them with learnable paraphrasing . . . . . . . . . . . . . . . . . . . . . . . . 24 5.2.1 Rigidity of the sub-sentence framework . . . . . . . . . . . . . . 24 5.2.2 Potential loss of grammaticality . . . . . . . . . . . . . . . . . . 25 5.2.3 Computational cost and non-learnability . . . . . . . . . . . . . 26 Chapter 6 Learnable paraphrasing attacks 29 6.1 Paraphrasing method: the seq2seq framework . . . . . . . . . . . 29 6.1.1 Sequence-to-sequence tasks . . . . . . . . . . . . . . . . . . . . . 29 6.1.2 The seq2seq encoder-decoder framework . . . . . . . . . . . . . 30 6.1.3 The Transformer architecture . . . . . . . . . . . . . . . . . . . 30 6.1.4 Variational Autoencoders . . . . . . . . . . . . . . . . . . . . . . 31 6.1.5 Training seq2seq models . . . . . . . . . . . . . . . . . . . . . . 31 6.1.5.1 With parallel corpora . . . . . . . . . . . . . . . . . 31 6.1.5.2 Without parallel corpora . . . . . . . . . . . . . . . 32 6.1.5.3 Pre-training and fine-tuning . . . . . . . . . . . . . . 33 6.2 Analogies to style transfer . . . . . . . . . . . . . . . . . . . . . . 34 6.3 Previous work on learnable paraphrasing attacks . . . . . . . . . 35 6.3.1 Training objectives . . . . . . . . . . . . . . . . . . . . . . . . . 36 6.3.2 Observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 6.3.2.1 Instability of the training . . . . . . . . . . . . . . . 37 6.3.2.2 Lack of balance between the scores . . . . . . . . . . 37 6.3.2.3 Loss of grammaticality . . . . . . . . . . . . . . . . 37 6.3.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 Chapter 7 Experiments 39 7.1 Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 7.2 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . 40 7.2.1 Overview and goals . . . . . . . . . . . . . . . . . . . . . . . . . 40 7.2.2 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 7.2.2.1 Length filtering . . . . . . . . . . . . . . . . . . . . . 40 7.2.2.2 Label filtering . . . . . . . . . . . . . . . . . . . . . 41 7.2.2.3 Filtering statistics . . . . . . . . . . . . . . . . . . . 41 7.2.3 Victim model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 7.2.4 Attacker model . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 7.2.5 Training method . . . . . . . . . . . . . . . . . . . . . . . . . . 43 7.2.5.1 Direct Preference Optimization . . . . . . . . . . . . 43 7.2.5.2 Decoding . . . . . . . . . . . . . . . . . . . . . . . . 44 7.2.5.3 Temperature control . . . . . . . . . . . . . . . . . . 44 7.2.5.4 Black-box attack . . . . . . . . . . . . . . . . . . . . 45 7.2.5.5 Subsetting in the training phase . . . . . . . . . . . 45 7.2.6 Experimental procedure . . . . . . . . . . . . . . . . . . . . . . 45 7.3 Echo training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 7.3.1 Echo as the tuned model . . . . . . . . . . . . . . . . . . . . . . 46 7.3.2 Echo training procedure . . . . . . . . . . . . . . . . . . . . . . 46 7.3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 7.4 Base experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 7.4.1 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 7.4.2 Reward assignment . . . . . . . . . . . . . . . . . . . . . . . . . 48 7.4.3 Demonstrating the need to balance the scores . . . . . . . . . . 48 7.5 Combining the scores into the reward . . . . . . . . . . . . . . . . 49 7.5.1 Means and weights . . . . . . . . . . . . . . . . . . . . . . . . . 49 7.5.2 Improved balancing with weighted harmonic mean . . . . . . . . 51 7.6 Label flipping with negations . . . . . . . . . . . . . . . . . . . . 52 7.6.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 7.6.2 Repeatability of the paraphrases . . . . . . . . . . . . . . . . . . 53 7.6.3 Traditional semantic similarity evaluation as the point of failure 54 7.7 Experimentation on methods for evaluating semantic similarity . 55 7.7.1 Dataset and evaluation . . . . . . . . . . . . . . . . . . . . . . . 55 7.7.1.1 Sample types . . . . . . . . . . . . . . . . . . . . . . 55 7.7.1.2 Human-assigned scores . . . . . . . . . . . . . . . . 56 7.7.1.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . 56 7.7.2 Tested methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 7.7.2.1 Embedding cosine similarity . . . . . . . . . . . . . 57 7.7.2.2 BertScore . . . . . . . . . . . . . . . . . . . . . . . . 57 7.7.2.3 Reconstruction loss . . . . . . . . . . . . . . . . . . 58 7.7.2.4 Entailment probability . . . . . . . . . . . . . . . . 59 7.7.2.5 Heuristic solution: embedding, entailment label, and a length score . . . . . . . . . . . . . . . . . . . 61 7.7.2.6 Large Language Model . . . . . . . . . . . . . . . . 63 7.7.3 Observations and conclusion . . . . . . . . . . . . . . . . . . . . 64 7.8 Training with the proposed semantic similarity evaluation . . . . 65 7.8.1 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 7.8.2 Successful exploits . . . . . . . . . . . . . . . . . . . . . . . . . 66 7.8.3 Grammaticality and naturality . . . . . . . . . . . . . . . . . . . 68 7.9 Retraining the victim . . . . . . . . . . . . . . . . . . . . . . . . 72 7.9.1 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 7.9.2 Selection of adversarial examples . . . . . . . . . . . . . . . . . 72 7.9.3 Training and evaluation method . . . . . . . . . . . . . . . . . . 72 7.9.4 Results: successful retraining . . . . . . . . . . . . . . . . . . . . 75 7.10 Paraphrase diversity issue not noted in the literature . . . . . . . 76 References 79 Appendix — Implementation details 87	-
dc.language.iso	en	-
dc.subject	可學習改寫攻擊	zh_TW
dc.subject	對抗性攻擊	zh_TW
dc.subject	自然語言處理	zh_TW
dc.subject	文字擾動	zh_TW
dc.subject	機器學習安全	zh_TW
dc.subject	語義相似性	zh_TW
dc.subject	句級	zh_TW
dc.subject	文本到文本	zh_TW
dc.subject	Machine Learning	en
dc.subject	textual perturbation	en
dc.subject	learnable paraphrasing	en
dc.subject	sentence-level	en
dc.subject	text-to-text	en
dc.subject	semantic similarity	en
dc.subject	Adversarial Attacks	en
dc.subject	Natural Language Processing	en
dc.title	教導文字生成模型以誤導文字分類器之研究	zh_TW
dc.title	Fast, fluent, fooling. Teaching machines to mislead text classifiers	en
dc.type	Thesis	-
dc.date.schoolyear	113-1	-
dc.description.degree	碩士	-
dc.contributor.coadvisor	陳縕儂	zh_TW
dc.contributor.coadvisor	Vivian Yun-Nung Chen	en
dc.contributor.oralexamcommittee	陳尚澤;蔡宗翰;李育杰	zh_TW
dc.contributor.oralexamcommittee	Shang-Tse Chen;Tzong-Han Tsai;Yuh-Jye Lee	en
dc.subject.keyword	機器學習安全,自然語言處理,對抗性攻擊,語義相似性,文本到文本,句級,可學習改寫攻擊,文字擾動,	zh_TW
dc.subject.keyword	Machine Learning,Natural Language Processing,Adversarial Attacks,semantic similarity,text-to-text,sentence-level,learnable paraphrasing,textual perturbation,	en
dc.relation.page	88	-
dc.identifier.doi	10.6342/NTU202404720	-
dc.rights.note	同意授權(全球公開)	-
dc.date.accepted	2025-01-15	-
dc.contributor.author-college	電機資訊學院	-
dc.contributor.author-dept	資訊工程學系	-
dc.date.embargo-lift	2025-09-01	-
顯示於系所單位：	資訊工程學系

文件中的檔案：

檔案	大小	格式
ntu-113-1.pdf	25.96 MB	Adobe PDF	檢視/開啟

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。