基於自注意力機制產生重要執行序的惡意程式家族分類系統

Ting-Yi Chen; 陳廷易

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/74062

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	孫雅麗(Yea-Li Sun)
dc.contributor.author	Ting-Yi Chen	en
dc.contributor.author	陳廷易	zh_TW
dc.date.accessioned	2021-06-17T08:18:30Z	-
dc.date.available	2023-08-18
dc.date.copyright	2019-08-18
dc.date.issued	2019
dc.date.submitted	2019-08-13
dc.identifier.citation	McAfee. (2018). McAfee Labs Threats Report. Retrieved from https://www.mcafee.com/enterprise/en-us/assets/reports/rp-quarterly-threats-jun-2018.pdf AV-TEST. (2019). Malware. Retrieved from https://www.av-test.org/en/statistics/malware/ Sun, Y. S., Chen, C.-C., Hsiao, S.-W., & Chen, M. C. (2018). ANTSdroid: Automatic Malware Family Behaviour Generation and Analysis for Android Apps. https://doi.org/10.1007/978-3-319-93638-3_48 Cohen, F. (1987). Computer viruses: theory and experiments. Computers & security, 6(1), 22-35. OKane, P., Sezer, S., & McLaughlin, K. (2011). Obfuscation: The hidden malware. IEEE Security & Privacy, 9(5), 41-47. Szor, P. (2005). The Art of Computer Virus Research and Defense: ART COMP VIRUS RES DEFENSE _p1: Pearson Education. Christodorescu, M., & Jha, S. (2006). Static analysis of executables to detect malicious patterns. Retrieved from Bayer, U., Moser, A., Kruegel, C., & Kirda, E. (2006). Dynamic analysis of malicious code. Journal in Computer Virology, 2(1), 67-77. Software, S. (2007). CWSandbox User Guide v 2.1.13. Moser, A., Kruegel, C., & Kirda, E. (2007, 10-14 Dec. 2007). Limits of Static Analysis for Malware Detection. Paper presented at the Twenty-Third Annual Computer Security Applications Conference (ACSAC 2007). 邱偉智. (2018). 基於高階API執行序列之惡意程式家族特徵的自動化產生與分析. Seifi, H., & Parsa, S. (2018). Mining malicious behavioural patterns. IET Information Security, 12(1), 60-70. doi:10.1049/iet-ifs.2017.0079 Ki, Y., Kim, E., & Kim, H. K. (2015). A novel approach to detect malware based on API call sequence analysis. International Journal of Distributed Sensor Networks, 11(6), 659101. 薛筑允. (2018). 自動化生成及語意分析惡意軟體家族導致系統狀態改變之活動軌跡. 臺灣大學, Available from Airiti AiritiLibrary database. Yin, H., & Song, D. (2010). Temu: Binary code analysis via whole-system layered annotative execution. EECS Department, University of California, Berkeley, Tech. Rep. UCB/EECS-2010-3. Bellard, F. (2005). QEMU, a fast and portable dynamic translator. Paper presented at the USENIX Annual Technical Conference, FREENIX Track. Garfinkel, T., & Rosenblum, M. (2003). A Virtual Machine Introspection Based Architecture for Intrusion Detection. Paper presented at the Ndss. Hsiao, S., Yi-Ning, C., Sun, Y. S., & Meng Chang, C. (2013, 14-16 Oct. 2013). A cooperative botnet profiling and detection in virtualized environment. Paper presented at the 2013 IEEE Conference on Communications and Network Security (CNS). Needleman, S. B., & Wunsch, C. D. (1970). A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of molecular biology, 48(3), 443-453. Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. Paper presented at the Advances in neural information processing systems. Le, Q., & Mikolov, T. (2014). Distributed representations of sentences and documents. Paper presented at the International conference on machine learning. Li, J., & Hovy, E. (2014). A model of coherence based on distributed sentence representation. Paper presented at the Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Kiros, R., Zhu, Y., Salakhutdinov, R. R., Zemel, R., Urtasun, R., Torralba, A., & Fidler, S. (2015). Skip-thought vectors. Paper presented at the Advances in neural information processing systems. Hill, F., Cho, K., & Korhonen, A. (2016). Learning distributed representations of sentences from unlabelled data. arXiv preprint arXiv:1602.03483. Pagliardini, M., Gupta, P., & Jaggi, M. (2017). Unsupervised learning of sentence embeddings using compositional n-gram features. arXiv preprint arXiv:1703.02507. Tensorflow. Embedding layer. Retrieved from https://www.tensorflow.org/api_docs/python/tf/keras/layers/Embedding Pagliardini, M., Gupta, P., & Jaggi, M. (2018). Unsupervised Learning of Sentence Embeddings Using Compositional n-Gram Features. Paper presented at the Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2017). Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5, 135-146. Jain, A. K., Jianchang, M., & Mohiuddin, K. M. (1996). Artificial neural networks: a tutorial. Computer, 29(3), 31-44. doi:10.1109/2.485891 Ripley, B. D., & Hjort, N. (1996). Pattern recognition and neural networks: Cambridge university press. Yegnanarayana, B. J. S. (1994). Artificial neural networks for pattern recognition. 19(2), 189-238. doi:10.1007/bf02811896 Sherstinsky, A. (2018). Fundamentals of recurrent neural network (rnn) and long short-term memory (lstm) network. arXiv preprint arXiv:1808.03314. Pineda, F. J. (1987). Generalization of back-propagation to recurrent neural networks. Physical review letters, 59(19), 2229. Hochreiter, S., & Schmidhuber, J. J. N. c. (1997). Long short-term memory. 9(8), 1735-1780. Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio, Y. (2014). Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078. Wang, W., Yang, N., Wei, F., Chang, B., & Zhou, M. (2017). Gated self-matching networks for reading comprehension and question answering. Paper presented at the Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Chung, J., Gulcehre, C., Cho, K., & Bengio, Y. (2014). Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555. Luong, M.-T., Pham, H., & Manning, C. D. (2015). Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025. Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to sequence learning with neural networks. Paper presented at the Advances in neural information processing systems. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., . . . Polosukhin, I. (2017). Attention is all you need. Paper presented at the Advances in neural information processing systems. Uppal, D., Sinha, R., Mehra, V., & Jain, V. (2014). Malware detection and classification based on extraction of API sequences. Paper presented at the 2014 International Conference on Advances in Computing, Communications and Informatics (ICACCI). Miao, Q., Liu, J., Cao, Y., & Song, J. (2016). Malware detection using bilayer behavior abstraction and improved one-class support vector machines. International Journal of Information Security, 15(4), 361-379. doi:10.1007/s10207-015-0297-6 Park, Y., Reeves, D. S., & Stamp, M. (2013). Deriving common malware behavior through graph clustering. Computers & security, 39, 419-430. Rieck, K., Trinius, P., Willems, C., & Holz, T. (2011). Automatic analysis of malware behavior using machine learning. Journal of Computer Security, 19(4), 639-668. Huang, Y.-T., Chen, Y.-Y., Yang, C.-C., Sun, Y., Hsiao, S.-W., & Chen, M. (2019). Tagging Malware Intentions by Using Attention-Based Sequence-to-Sequence Neural Network. In (pp. 660-668). Pan, Z.-P., Feng, C., & Tang, C.-J. (2016). Malware Classification Based on the Behavior Analysis and Back Propagation Neural Network. Paper presented at the ITM Web of Conferences. Kolosnjaji, B., Zarras, A., Webster, G., & Eckert, C. (2016). Deep learning for classification of malware system call sequences. Paper presented at the Australasian Joint Conference on Artificial Intelligence. 姜立垣. (2016). 在Windows平台上的惡意軟體家族的基序API序列分析. Retrieved from http://ntur.lib.ntu.edu.tw/handle/246246/275750 Virustotal. (2004). Virus Total. Retrieved from www.virustotal.com Russell, S. J., & Norvig, P. (2016). Artificial intelligence: a modern approach: Malaysia; Pearson Education Limited. Hinton, G. E., Sejnowski, T. J., & Poggio, T. A. (1999). Unsupervised learning: foundations of neural computation: MIT press. Theodoridis, S., & Koutroumbas, K. (1999). Pattern Recognition, Academic Press. New York. Aksoy, S., & Haralick, R. M. (2001). Feature normalization and likelihood-based similarity measures for image retrieval. Pattern Recognition Letters, 22(5), 563-582. doi:https://doi.org/10.1016/S0167-8655(00)00112-4 Jain, A. K., & Dubes, R. C. (1988). Algorithms for clustering data. Englewood Cliffs: Prentice Hall, 1988. Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. HuggingFace. (2018). pytorch-transformers. Retrieved from https://github.com/huggingface/pytorch-transformers Nwankpa, C., Ijomah, W., Gachagan, A., & Marshall, S. (2018). Activation functions: Comparison of trends in practice and research for deep learning. arXiv preprint arXiv:1811.03378. Hawkins, D. M. (2004). The problem of overfitting. Journal of chemical information and computer sciences, 44(1), 1-12. Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980. Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning: MIT press. Ruder, S. (2016). An overview of gradient descent optimization algorithms. arXiv preprint arXiv:1609.04747. Prechelt, L. (1998). Early stopping-but when? In Neural Networks: Tricks of the trade (pp. 55-69): Springer. Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167. Chinchor, N. (1992). MUC-4 evaluation metrics. Paper presented at the Proceedings of the 4th conference on Message understanding. Goutte, C., & Gaussier, E. (2005). A probabilistic interpretation of precision, recall and F-score, with implication for evaluation. Paper presented at the European Conference on Information Retrieval. Venkatesan, R., & Er, M. J. (2014). Multi-label classification method based on extreme learning machines. Paper presented at the 2014 13th International Conference on Control Automation Robotics & Vision (ICARCV). Destercke, S. (2014). Multilabel prediction with probability sets: the hamming loss case. Paper presented at the International Conference on Information Processing and Management of Uncertainty in Knowledge-Based Systems. National Center for High-performance Computing. Retrieved from https://www.nchc.org.tw/tw/ Institute for Information Industry. Retrieved from https://www.iii.org.tw/ Portable Executable. Retrieved from https://en.wikipedia.org/wiki/Portable_Executable Corporation, T. M. (2006). Common Malware Enumeration Identifiers. Retrieved from https://cme.mitre.org/cme/ Bailey, M., Oberheide, J., Andersen, J., Mao, Z. M., Jahanian, F., & Nazario, J. (2007). Automated classification and analysis of internet malware. Paper presented at the International Workshop on Recent Advances in Intrusion Detection. Beyer, H. (1981). Tukey, John W.: Exploratory Data Analysis. Addison‐Wesley Publishing Company Reading, Mass.—Menlo Park, Cal., London, Amsterdam, Don Mills, Ontario, Sydney 1977, XVI, 688 S. Biometrical Journal, 23(4), 413-414. Schmidhuber, J. (2015). Deep learning in neural networks: An overview. Neural networks, 61, 85-117. Mavreshko, K. (2018). Keras-Transformer. Retrieved from https://github.com/kpot/keras-transformer McInnes, L., Healy, J., & Melville, J. (2018). Umap: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426.
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/74062	-
dc.description.abstract	近年來惡意程式產生的數量急速增加，進而造成全球個人和企業的大量損失。瞭解惡意程式的企圖與目的，並萃取出關鍵重要之執行行為，將能對惡意程式偵測與防禦有莫大幫助。本論文提出自動化惡意程式重要執行序行為辨識系統，將以遞歸神經網路搭配自注意力機制做為架構基礎，分析惡意程式執行時的Windows API call invocations序列，學習並捕捉序列之間的關係，以自動辨識每一筆API call invocation在惡意特徵執行活動中是否為重要關鍵者，而能反應其惡意企圖。本論文系統包含了我們所設計的三個功能模組，分別為將API call invocations進行編碼的Embedder、計算每筆API call invocation於惡意程式執行序中重要性的Encoder、篩選出惡意程式重要執行序的Filter。透過此三個模組，我們便能建立惡意程式分析與家族行為態樣歸類的管線流程。使系統所輸出的重要執行序除了能讓資安研究人員迅速得知一惡意程式之特徵執行活動樣式的語意解釋，並正確判斷執行程式是否具有惡意以外，更能藉由重要執行序比較不同惡意程式特徵間的相似度，以對惡意程式進行分類或分群。從我們的實驗結果中，不僅證明了本論文系統各個功能模組相較於其他設計方法的有效性，也展現了系統的行為特徵辨識能力，得以將未知的惡意程式有效地分類出其行為態樣與惡意程式家族。除此之外，我們也將惡意程式的重要執行序以可視化的方式呈現，分析不同行為態樣與惡意程式家族特徵執行活動樣式之間的關係，顯示了同家族惡意程式變種間的行為多樣性與不同家族間共享相同行為的現象。	zh_TW
dc.description.abstract	In recent years, the number of malicious software (malware) has increased rapidly, which has caused a lot of losses for individuals and businesses around the world. Understanding the intention of malware and extracting key execution behaviors will considerably help detect and defend against malware. This research proposes an automated important execution sequence behavior identification system. The recurrent neural network and self-attention mechanism are used as the basis of the architecture. It is used to analyze Windows API call invocations sequence recording at runtime, and capture the relationship between API call invocations. To automatically identify malware whether each API call invocation is a characteristic API call in malicious behavioral activity, and can respond to its malicious intentions. The proposed system contains three functional modules, namely Embedder which vectorizes API call invocations, Encoder which calculates the importance of each API call invocation in the execution profiles, and Filter which extracts important API call invocations from the malware. Through these three modules, we can establish a pipeline for malware analysis and family classification. The important API call invocations of the system output allow the security analysts to quickly know the semantic interpretation of the characteristic execution pattern and classify or cluster malware by calculating the similarity score. Compared with other methods our experiments not only prove the effectiveness of the proposed functional modules in our system but also demonstrate the system's behavioral feature recognition ability, which can classify unseen malware correctly into their family. Additionally, we visualize the important API call invocations of the malware and analyze the relationship between different behavioral patterns and family characteristic execution patterns. We found that the malware family is pluralistic, and the same behavioral patterns can exist in many different families.	en
dc.description.provenance	Made available in DSpace on 2021-06-17T08:18:30Z (GMT). No. of bitstreams: 1 ntu-108-R06725035-1.pdf: 3240179 bytes, checksum: a5df4d4a725fc16abc07905282c0391d (MD5) Previous issue date: 2019	en
dc.description.tableofcontents	謝辭 i 中文摘要 ii ABSTRACT iii 目錄 v 圖目錄 viii 表目錄 xi 第一章介紹 1 1.1 研究動機 1 1.2 研究目的 3 1.3 研究方法 4 1.4 研究貢獻 5 第二章背景知識與相關文獻 7 2.1 背景知識 7 2.1.1 惡意程式動態側錄 7 2.1.2 Winnowing 7 2.1.3 RasMMA 8 2.1.4 Embedding 9 2.1.5 遞歸神經網路（Recurrent Neural Network） 10 2.1.6 自注意力（Self-Attention）機制 11 2.2 相關文獻探討 12 2.2.1 惡意程式偵測基於動態分析 13 2.2.2 惡意程式特徵擷取 14 2.2.3 惡意程式偵測基於機器學習分析 15 2.2.4 惡意程式家族分類基於類神經網路 15 第三章研究方法與系統架構 17 3.1 動態執行行為側錄與分析系統 18 3.2 Embedder功能模組 22 3.3 Encoder功能模組 27 3.4 Filter功能模組 32 3.5 文字型重要執行序還原 34 3.6 重要執行序向量空間中表示 35 第四章實驗 37 4.1 實驗資料前處理 37 4.1.1 動態惡意程式側錄 37 4.1.2 惡意程式家族歸類 38 4.1.3 RasMMA 41 4.1.4 Execution profiles長度分析與處理 42 4.1.5 切分資料集 44 4.2 重要執行序評估與方法比較 49 4.2.1 Embedder方法比較 51 4.2.2 Encoder方法比較 53 4.2.3 Filter閾值比較 55 4.3 惡意程式家族與行為態樣歸類 56 4.3.1 實驗方法 57 4.3.2 評估方法 58 4.3.3 實驗結果與討論 59 4.4 Loner惡意程式家族分類 62 4.4.1 Loner資料集準備 63 4.4.2 實驗方法 64 4.4.3 評估方法 65 4.4.4 實驗結果與討論 65 4.5 案例視覺化與探討 69 4.5.1 Allaple家族行為態樣視覺化 70 4.5.2 Allaple loner家族分類視覺化 72 第五章結論 75 參考文獻 76
dc.language.iso	zh-TW
dc.title	基於自注意力機制產生重要執行序的惡意程式家族分類系統	zh_TW
dc.title	Malware Family Classification System based on Attention-based Characteristic Execution Sequence	en
dc.type	Thesis
dc.date.schoolyear	107-2
dc.description.degree	碩士
dc.contributor.oralexamcommittee	李漢銘(Hahn-Ming Lee),陳孟彰(Meng-Chang Chen),蕭舜文(Shun-Wen Hsiao)
dc.subject.keyword	惡意程式,動態行為分析,執行序列分析,家族分類,自注意力機制,遞歸神經網路,重要執行序視覺化,	zh_TW
dc.subject.keyword	Self-Attention,Recurrent Neural Network,Behavioral pattern,Malware family classification,Malware analysis,Dynamic analysis,Important API calls visualization,	en
dc.relation.page	80
dc.identifier.doi	10.6342/NTU201903569
dc.rights.note	有償授權
dc.date.accepted	2019-08-14
dc.contributor.author-college	管理學院	zh_TW
dc.contributor.author-dept	資訊管理學研究所	zh_TW
顯示於系所單位：	資訊管理學系

文件中的檔案：

檔案	大小	格式
ntu-108-1.pdf 目前未授權公開取用	3.16 MB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。