文本嵌入向量逆向攻擊與主題式語意分解模型

Li-Jen Liu; 劉力仁

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/84736

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	林守德(Shou-De Lin)
dc.contributor.author	Li-Jen Liu	en
dc.contributor.author	劉力仁	zh_TW
dc.date.accessioned	2023-03-19T22:23:00Z	-
dc.date.copyright	2022-09-29
dc.date.issued	2022
dc.date.submitted	2022-09-05
dc.identifier.citation	K. Bennani-Smires, C. Musat, A. Hossmann, M. Baeriswyl, and M. Jaggi. Simple unsupervised keyphrase extraction using sentence embeddings. arXiv preprint arXiv:1801.04470, 2018. F. Bianchi, S. Terragni, and D. Hovy. Pre-training is a hot topic: Contextualized document embeddings improve topic coherence. arXiv preprint arXiv:2004.03974, 2020. F. Bianchi, S. Terragni, D. Hovy, D. Nozza, and E. Fersini. Cross-lingual contextualized topic models with zero-shot learning. arXiv preprint arXiv:2004.07737, 2020. D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. Journal of machine Learning research, 3(Jan):993–1022, 2003. R. Campos, V. Mangaravite, A. Pasquali, A. Jorge, C. Nunes, and A. Jatowt. Yake! keyword extraction from single documents using multiple local features. Information Sciences, 509:257–289, 2020. Z. Cao, T. Qin, T.-Y. Liu, M.-F. Tsai, and H. Li. Learning to rank: from pairwise approach to listwise approach. In Proceedings of the 24th international conference on Machine learning, pages 129–136, 2007. M. Coavoux, S. Narayan, and S. B. Cohen. Privacy-preserving neural representations of text. arXiv preprint arXiv:1808.09408, 2018. J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018. J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018. Y. Gallina, F. Boudin, and B. Daille. Large-scale evaluation of keyphrase extraction models. In Proceedings of the ACM/IEEE Joint Conference on Digital Libraries in 2020, pages 271–278, 2020. M. Grootendorst. Keybert: Minimal keyword extraction with bert., 2020. K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. K. Lang. Newsweeder: Learning to filter netnews. In in Proceedings of the 12th International Machine Learning Conference (ML95, 1995.Q. Le and T. Mikolov. Distributed representations of sentences and documents. In International conference on machine learning, pages 1188–1196. PMLR, 2014. Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019. A. L. Maas, R. E. Daly, P. T. Pham, D. Huang, A. Y. Ng, and C. Potts. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 142–150, Portland, Oregon, USA, June 2011. Association for Computational Linguistics. S. Merity, C. Xiong, J. Bradbury, and R. Socher. Pointer sentinel mixture models, 2016. X. Pan, M. Zhang, S. Ji, and M. Yang. Privacy risks of general-purpose language models. In 2020 IEEE Symposium on Security and Privacy (SP), pages 1314–1331. IEEE, 2020. E. Papagiannopoulou and G. Tsoumakas. A review of keyphrase extraction. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 10(2):e1339,2020. J. Pennington, R. Socher, and C. D. Manning. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543, 2014. M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer. Deep contextualized word representations. CoRR, abs/1802.05365, 2018. N. Reimers and I. Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084, 2019. R. Shokri, M. Stronati, C. Song, and V. Shmatikov. Membership inference attacks against machine learning models. In 2017 IEEE symposium on security and privacy(SP), pages 3–18. IEEE, 2017. C. Song and A. Raghunathan. Information leakage in embedding models. In Proceedings of the 2020 ACM SIGSAC Conference on Computer and Communications Security, pages 377–390, 2020. W. Webber, A. Moffat, and J. Zobel. A similarity measure for indefinite rankings. ACM Transactions on Information Systems (TOIS), 28(4):1–38, 2010. X. Zhang, J. J. Zhao, and Y. LeCun. Character-level convolutional networks for text classification. In NIPS, 2015. Y. Zhu, R. Kiros, R. Zemel, R. Salakhutdinov, R. Urtasun, A. Torralba, and S. Fidler. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the IEEE international conference on computer vision, pages 19–27, 2015.
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/84736	-
dc.description.abstract	文本嵌入學習是一種將文章轉化成低維度向量的重要技術，文章的嵌入向量保有完整的文章語意且並在自然語言處理的各種應用上取得成功。然而，這些成功也吸引來許多惡意攻擊者的注意，在其中一種惡意攻擊裡，攻擊者嘗試去逆向工程，把文章的嵌入向量反推回其原本的輸入文字，或是敏感的文字，進而窺探出向量背後的資訊。在這篇論文中，我們將先前的設定延伸成更普遍的形式，並提供兩種資訊來增加嵌入向量的解釋性。其一，我們認為攻擊者對文章中的每個字有他自己的偏好分數，而我們的目標就是回傳一個序列的文字使得它和攻擊者偏好的先後順序一樣。其二，即使能取得該文字序列，我們仍然難以理解它背後的涵義，因此我們借用主題模型的優點，從目標找到一致的語意。為了達成這些目標，我們結合神經網路主題模型和排序優化。經過完整的實驗，我們的設計在返回攻擊者的偏好文字序列，以及提供一致且多元的主題上，都有良好的表現，這意味著攻擊者在各種資料集與常見的嵌入模型上，都能夠輕易地理解一個文章嵌入向量背後的特性。	zh_TW
dc.description.abstract	Document representation learning has become an important technique to embed rich document context into lower dimensional vectors. The embeddings preserve complete semantics of documents and lead to a huge success in various NLP applications. Nonetheless, the success also attracted many malicious adversaries’ attention. In one branch, the adversaries try to reverse-engineer the embeddings to its content words or sensitive keywords to pry into the information behind them. In our work, we want to extend the previous setting to a more general one and provide two types of information that both increase the interpretation of the embeddings. For the first one, we assume an adversary has his/her preference for the information in the documents and our goal is to retrieve the sequence of words that correspond to the adversary’s preference. Second, since even we could precisely retrieve a sequence of words that represent the documents, it is still hard for human to actually understand the idea among them. Thus, we borrow the advantages of topic model to acquire coherent semantics from the targets. To achieve these goals, wecombine the mechanism of neural topic model and ranking optimization. Through comprehensive experiments, our design shows promising results in capturing the sequence of adversaries’ preference words and providing coherent and diverse topics that the adversary could easily realize the characteristic of the unknown embeddings on various datasets and off-the-shelf embedding models .	en
dc.description.provenance	Made available in DSpace on 2023-03-19T22:23:00Z (GMT). No. of bitstreams: 1 U0001-0109202215464400.pdf: 1296316 bytes, checksum: 2911b98cadac7fd38f70b0d77d36dbb9 (MD5) Previous issue date: 2022	en
dc.description.tableofcontents	1. Introduction 1 2. Related Work 4 2.1 Embedding Models for textual data 4 2.2 Privacy issues of document embedding 5 3. Problem Definition 7 4. Methodology 10 4.1 Topic extraction 10 4.2 Embedding Fusion and Prediction 12 4.3 Ranking-aware Optimization 13 5. Experiments 16 5.1 Experiment Setup 16 5.2 Attack Performance Analysis 19 5.3 Cross Domain 22 5.4 Ablation Study 23 5.5 Empirical Case Study 24 6. Conclusion 28
dc.language.iso	zh-TW
dc.title	文本嵌入向量逆向攻擊與主題式語意分解模型	zh_TW
dc.title	Embedding Inversion Attack of Documents with Topic-aware Semantic Decoder Model	en
dc.type	Thesis
dc.date.schoolyear	110-2
dc.description.degree	碩士
dc.contributor.coadvisor	葉彌妍(Mi-Yen Yeh)
dc.contributor.oralexamcommittee	李政德(Cheng-Te Li),陳縕儂(Yun-Nung Chen),陳尚澤(Shang-Tse Chen)
dc.subject.keyword	嵌入向量反向攻擊,文章嵌入,主題模型,排序,	zh_TW
dc.subject.keyword	Embedding Inversion Attack,Document Embedding,Topic Model,Learning to Rank,	en
dc.relation.page	35
dc.identifier.doi	10.6342/NTU202203066
dc.rights.note	同意授權(限校園內公開)
dc.date.accepted	2022-09-06
dc.contributor.author-college	電機資訊學院	zh_TW
dc.contributor.author-dept	資料科學學位學程	zh_TW
dc.date.embargo-lift	2022-09-29	-
顯示於系所單位：	資料科學學位學程

文件中的檔案：

檔案	大小	格式
U0001-0109202215464400.pdf 授權僅限NTU校內IP使用（校園外請利用VPN校外連線服務）	1.27 MB	Adobe PDF	檢視/開啟

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。