以機器學習演算法分類洋蔥網路流量

Hong-Bo Lu; 陸弘博

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/65794

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	林宗男(Tsung-Nan Lin)
dc.contributor.author	Hong-Bo Lu	en
dc.contributor.author	陸弘博	zh_TW
dc.date.accessioned	2021-06-17T00:12:12Z	-
dc.date.available	2020-02-18
dc.date.copyright	2020-02-18
dc.date.issued	2020
dc.date.submitted	2020-02-14
dc.identifier.citation	[1] F. Baker, B. Foster, and C. Sharp,“Cisco architecture for lawful intercept in IP networks,”in Internet Engineering Task Force, RFC 3924, 2004. [2] R. Dingledine, N. Mathewson., and P. Syverson. (2004),“Tor: The second-generation onion router,”in Proceedings of the 13th Conference on USENIX Security Symposium - Volume 13, SSYM’04, pages 21–21, Berkeley,CA, USA. USENIX Association. [3] A. Lashkari, G. Gil, M. Mamun, and A. Ghorbani “Characterization of Tor Traffic using Time based Features,”in 3rd International Conference on Information Systems Security and Privacy, vol. 1, pp. 253–262, January 2017. [4] A. Johnson, A. Smith, A. Catarineu et al, “Traffic -Tor Metrics, ” in Tor Project, https://metrics.torproject.org/bandwidth-flags.html. Accessed 3 Dec. 2019 [5] J. Wales et al, “Machine Learning, ” in Wikipedia, https://en.wikipedia.org/wiki/Machine_learning. Accessed 3 Dec. 2019 [6] W. McCulloch and W. Pitts,“A Logical Calculus of the Ideas Immanent in Nervous Activity,”in The Bulletin of Mathematical Biophysics, 1943. [7] R. Kohavi and F. Provost,“Glossary of terms,”in Machine Learning, vol. 30, no. 2–3, pp. 271–274, 1998. [8] T. Mitchell (1997), Machine Learning, McGraw Hill. p. 2. [9] H. Stevan (2008),“The Annotation Game: On Turing (1950) on Computing, Machinery, and Intelligence,”in Epstein, Robert; Peters, Grace (eds.), The Turing Test Sourcebook: Philosophical and Methodological Issues in the Quest for the Thinking Computer, Kluwer. [10] Chollet, Franc{c}ois et al,“Keras,”in GitHub repository. [11] Pedregosa et al,“Scikit-learn: Machine Learning in Python,”in JMLR 12, pp. 2825-2830, 2011. [12] C. Tianqi and G. Carlos,“XGBoost: A Scalable Tree Boosting System,”in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. [13] T. Nguyen and G. Armitage,“A Survey of Techniques for Internet Traffic Classification using Machine Learning,”in IEEE Communications Surveys & Tutorials, VOL. 10, NO. 4, Fourth Quarter 2008. [14] M. Kim and A. Anpalagan,“Tor Traffic Classification from Raw Packet Header using Convolutional Neural Network,”in Proc. of IEEE Intl. Conf. on Knowledge Innovation and Invention (ICKII), Jeju, South Korea, Jul 2018. [15] A. Cuzzocrea, F. Martinelli, F. Mercaldo, and G. Vercelli,“Tor traffic analysis and detection via machine learning techniques,”in 2017 IEEE International Conference on Big Data (Big Data), pp. 4474–4480, December 2017. [16] G. He, M. Yang, J. Luo, and X. Gu,“Inferring application type information from tor encrypted traffic,”in Advanced Cloud and Big Data (CBD), 2014 Second International Conference on. IEEE, 2014, pp. 220–227. [17] M. AlSabah, K. Bauer, and I. Goldberg,“Enhancing tor’s performance using real-time traffic classification,”in Proceedings of the 2012 ACM Conference on Computer and Communications Security, CCS ’12, pages 73–84, New York, NY, USA. ACM. [18] N. Chawla, K. Bowyer, L. Hall, and W. Kegelmeyer “SMOTE: synthetic minority oversampling technique,”in Journal of Artificial Intelligence Research, 321-357, 2002. [19] T. Nguyen and G. Armitage,“Synthetic sub-flow pairs for timely and stable IP traffic identification,”in Proc. Australian Telecommunication Networks and Application Conference, Melbourne, Australia, December 2006.
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/65794	-
dc.description.abstract	隨著加密封包和新應用程式的出現，網際網路流量的分析變得愈來愈困難了。我們藉由提出三類機器學習演算法能使用的強大特徵，和一套建立分類器的標準作業程序來解決這個問題。我們提出的三類特徵分別為方向改變、變換方向前的封包數和變換方向前的位元組數。方向改變和流量改變方向的頻率有關；變換方向前的封包數是每個流量改變其方向以前，總共累積了幾個封包；變換方向前的位元組數是每個網路改變其方向以前，總共累積了幾個位元組。比較用我們提出的特徵訓練出來的分類器和前人提出的特徵訓練出的分類器的平均召回率，神經網路從43.34%進步到58.29%，隨機森林從77.73%進步到82.97%，K-近鄰算法從55.17%進步到73.93%, XGB 從77.62%進步到81.91%，支持向量機從17.17%進步到41.94%, LGB 從80.92%進步到85.19%，決策樹從72.03%進步到82.53%。提出的標準作業程序從洋蔥網路的 Pcap 檔開始。從它們抽取出流量之前，我們會先過濾雜訊和一些特定封包。抽取出的流量會被更進一步切割成較短的流量，我們再計算這些較短流量的特徵。在把特徵餵給機器學習演算法前，我們還會對特徵做一些處理。把處理後的特徵餵給機器是這個標準作業程序的最後一部。這個標準作業程序的特別之處在於其彈性。怎麼過濾封包、怎麼切割流量和怎麼處理特徵，都是能調整的。所以任何機器學習演算法，都能用這套標準作業程序訓練出一個令人滿意的分類器。以下是各演算法實際訓練的平均召回率：神經網路能達95.65%，隨機森林能達到92.72%，K-近鄰算法能達到84.03%，XGB 能達到93.18%，支持向量機能達到90.49%，LGB 能達到94.37%，決策樹能達到89.43%。我們的貢獻在於(1) 我們提出了三類強大的特徵，幫助機器學習演算法訓練分類器。(2) 我們發展出了一套標準作業程序來訓練洋蔥網路流量的分類器。藉由我們提出的特徵和標準作業程序，網路服務提供者和洋蔥網路能在不傷害使用者隱私的情況下，大幅改善使用者體驗。我們希望這能讓洋蔥網路吸引更多使用者，進而使其變成一個更安全的覆蓋網路。洋蔥網路的使用者能對僅有網路一部份控制權/了解的壞蛋完全匿名，包括駭客、殘暴的政府等等。而整個網路也能在洋蔥網路有龐大流量時保持通順。	zh_TW
dc.description.abstract	Traffic classification of the Internet has always been an important task due to its application in systems like Quality of Service (QoS) mechanism or Security Information and Event Management (SIEM) tool, etc. But since few decades ago, traffic classification has become more difficult, because there are more encrypted packets and packets of new applications flowing through the Internet. One of the reasons why there are more encrypted packets and packets of new applications flowing through the Internet is the increasing usage of Tor network. As people start to be aware of the potential danger of surfing the Internet, more people choose to use Tor browser instead. What makes Tor browser so different from current prevalent browsers (for example, Chrome, Firefox, etc.) is that Tor browser can provide anonymous service for its users. For example, we don't have to worry about the websites we browse would save cookies to track our activities when using Tor. Tor also resist to network surveillance. People living in oppressed regimes can use Tor to comment on sensitive topics without being blocked or tracked by their governments. This anonymity is appealing to its users but can make traffic classification much more difficult. So, if ISPs (Internet service providers) want to provide their customers with fast and safe services, Tor is an overlay network they must keep their eyes on. Besides ISPs, if Tor network itself wants to provide its users with a better environment, it had better be capable of classifying traffic flowing through it. With these in mind, we try to classify traffic of Tor network into eight categories: audio, browsing, chat, file-transfer, mail, P2P, video and VoIP, by using machine learning algorithms. In this thesis: (1) We propose three categories of powerful features for machine learning algorithms to train classifiers. (2) We develop a standard operating procedure (SOP) to build classifier for Tor traffic. By our efforts, we can make traffic classification of Tor network no longer a difficult problem, and further improve the performance of Tor and the whole Internet.	en
dc.description.provenance	Made available in DSpace on 2021-06-17T00:12:12Z (GMT). No. of bitstreams: 1 ntu-109-R05942097-1.pdf: 6016919 bytes, checksum: 7c2bfd0a7667f177143c98cde5a20697 (MD5) Previous issue date: 2020	en
dc.description.tableofcontents	中文摘要............................................. 1 英文摘要............................................. 2 Chapter 1 Introduction ............................. 10 Chapter 2 Background Information ................... 14 2.1 Tor: The Second-Generation Onion Router ........ 14 2.2 Machine Learning Algorithms .................... 19 Chapter 3 Related Works ............................ 27 Chapter 4 Data Preprocessing ....................... 31 4.1 Dataset ........................................ 31 4.2 Filtering....................................... 32 4.3 Flow ........................................... 34 4.4 Truncate Flows ................................. 36 4.5 Delete Short Flows ............................. 39 4.6 Calculate Features ............................. 41 4.7 Feature Selection .............................. 65 Chapter 5 Experiments............................... 66 5.1 Imbalance Data Problem.......................... 66 5.2 Mirror Flow..................................... 67 5.3 N-fold Cross Validation ........................ 69 5.4 Measurement .................................... 70 5.5 Results......................................... 73 Chapter 6 Analysis of the Results................... 81 6.1 Analysis of Proposed Features .................. 81 6.2 Analysis of Max_num_pkts ....................... 92 6.3 Feature Importance ............................. 93 6.4 Relations between Parameters.................... 96 Chapter 7 Conclusion................................ 99 Bibliography ....................................... 101
dc.language.iso	en
dc.subject	洋蔥網路	zh_TW
dc.subject	流量控管	zh_TW
dc.subject	安全資訊與事件管理	zh_TW
dc.subject	流量分類	zh_TW
dc.subject	機器學習	zh_TW
dc.subject	Security Information and Event Management (SIEM)	en
dc.subject	traffic classification	en
dc.subject	The Second-Generation Onion Router (Tor)	en
dc.subject	Quality of Service (QoS)	en
dc.subject	machine learning	en
dc.title	以機器學習演算法分類洋蔥網路流量	zh_TW
dc.title	Classification on Tor Traffic with Machine Learning Algorithms	en
dc.type	Thesis
dc.date.schoolyear	108-1
dc.description.degree	碩士
dc.contributor.oralexamcommittee	謝宏昀(Hung-Yun Hsieh),陳俊良(Jiann-Liang Chen)
dc.subject.keyword	洋蔥網路,流量控管,安全資訊與事件管理,流量分類,機器學習,	zh_TW
dc.subject.keyword	The Second-Generation Onion Router (Tor),Quality of Service (QoS),Security Information and Event Management (SIEM),traffic classification,machine learning,	en
dc.relation.page	102
dc.identifier.doi	10.6342/NTU202000479
dc.rights.note	有償授權
dc.date.accepted	2020-02-15
dc.contributor.author-college	電機資訊學院	zh_TW
dc.contributor.author-dept	電信工程學研究所	zh_TW
顯示於系所單位：	電信工程學研究所

文件中的檔案：

檔案	大小	格式
ntu-109-1.pdf 未授權公開取用	5.88 MB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。