深度學習推薦系統訓練之記憶體置換行為分析

賴宥儒; You-Ru Lai

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/83107

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	楊佳玲	zh_TW
dc.contributor.advisor	Chia-Lin Yang	en
dc.contributor.author	賴宥儒	zh_TW
dc.contributor.author	You-Ru Lai	en
dc.date.accessioned	2023-01-08T17:07:13Z	-
dc.date.available	2023-11-09	-
dc.date.copyright	2023-01-06	-
dc.date.issued	2022	-
dc.date.submitted	2022-11-25	-
dc.identifier.citation	G. Linden, B. Smith, and J. York, “Amazon.com recommendations: Item-to-item collaborative filtering,” 2003 B. Acun, M. Murphy, X. Wang, J. Nie, C. Wu, and K. Hazelwood, “Understanding training efficiency of deep learning recommendation models at scale,” in Proceedings of 2021 IEEE International Symposium on High-Performance Computer Architecture, HPCA ’21, IEEE Computer Society, 2021 P. Covington, J. Adams, and E. Sargin, “Deep neural networks for youtube recommendations,” in Proceedings of the 10th ACM Conference on Recommender Systems, RecSys ’16, Association for Computing Machinery, 2016 M. Zhao, N. Agarwal, A. Basant, B. Gedik, S. Pan, M. Ozdal, R. Komuravelli, J. Pan, T. Bao, H. Lu, S. Narayanan, J. Langman, K. Wilfong, H. Rastogi, C.-J. Wu, C. Kozyrakis, and P. Pol, “Understanding data storage and ingestion for large-scale deep recommendation model training: Industrial product,” in Proceedings of the 49th Annual International Symposium on Computer Architecture, ISCA ’22, Association for Computing Machinery, 2022 A. Eisenman, M. Naumov, D. Gardner, M. Smelyanskiy, S. Pupyrev, K. Hazelwood, A. Cidon, and S. Katti, “Bandana: Using non-volatile memory for storing deep learning models,” in Proceedings of 2nd Machine Learning and Systems conference, MLSys ’19, MLSys, 2019 W. Zhao, D. Xie, R. Jia, Y. Qian, R. Ding, M. Sun, and P. Li, “Distributed hierarchical gpu parameter server for massive scale deep learning ads systems,” in Proceedings of 3rd Machine Learning and Systems conference, MLSys ’20, MLSys, 2020 J. Axboe, A. D. Brunelle, and N. Scott, “blktrace(8) - linux man page.” https://linux.die.net/man/8/blktrace, 2006 J. Axboe, “fio - flexible i/o tester.” https://fio.readthedocs.io/en/latest/fio_doc.html, 2017 M. Naumov, D. Mudigere, H.-J. M. Shi, J. Huang, N. Sundaraman, J. Park, X. Wang, U. Gupta, C.-J. Wu, A. G. Azzolini, D. Dzhulgakov, A. Mallevich, I. Cherniavskii, Y. Lu, R. Krishnamoorthi, A. Yu, V. Kondratenko, S. Pereira, X. Chen, W. Chen, V. Rao, B. Jia, L. Xiong, and M. Smelyanskiy, “Deep learning recommendation model for personalization and recommendation systems,” 2019 D. Mudigere, Y. Hao, J. Huang, Z. Jia, A. Tulloch, S. Sridharan, X. Liu, M. Ozdal, J. Nie, J. Park, L. Luo, J. A. Yang, L. Gao, D. Ivchenko, A. Basant, Y. Hu, J. Yang, E. K. Ardestani, X. Wang, R. Komuravelli, C.-H. Chu, S. Yilmaz, H. Li, J. Qian, Z. Feng, Y. Ma, J. Yang, E. Wen, H. Li, L. Yang, C. Sun, W. Zhao, D. Melts, K. Dhulipala, K. Kishore, T. Graf, A. Eisenman, K. K. Matam, A. Gangidi, G. J. Chen, M. Krishnan, A. Nayak, K. Nair, B. Muthiah, M. khorashadi, P. Bhattacharya, P. Lapukhov, M. Naumov, A. Mathews, L. Qiao, M. Smelyanskiy, B. Jia, and V. Rao, “Software-hardware co-design for fast and scalable training of deep learning recommendation models,” in Proceedings of the 49th Annual International Symposium on Computer Architecture, ISCA ’22, Association for Computing Machinery, 2022 M. Gorman, “Swap management.” https://www.kernel.org/doc/gorman/html/understand/understand014.html, 2007 D. P. Bovet and M. Cesati, “Page frame reclaiming,” in Understanding the Linux Kernel, ch. 17, O’Reilly Media, 2005 SSDFans, “SSD 核心技術: FTL,” in 深入淺出 SSD: 固態存儲核心技術、原理與實戰, ch. 4, 機械工業出版社, 2018 CriteoLabs, “Download terabyte click logs.” https://labs.criteo.com/2013/12/download-terabyte-click-logs-2/, 2013 M. Adnan, Y. E. Maboud, D. Mahajan, and P. J. Nair, “Accelerating recommendation system training by leveraging popular choices,” in Proceedings of the VLDB Endowment, VLDB ’21, VLDB Endowment, 2021 D. Kalamkar, E. Georganas, S. Srinivasan, J. Chen, M. Shiryaev, and A. Heinecke, “Optimizing deep learning recommender systems training on cpu cluster architec tures,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’20, IEEE Press, 2020 X. Sun, H. Wan, Q. Li, C. Yang, T. Kuo, and C. Xue, “Rm-ssd: In-storage computing for large-scale recommendation inference,” in Proceedings of 2022 IEEE International Symposium on High-Performance Computer Architecture, HPCA ’22, IEEE Computer Society, 2022 L. Ke, U. Gupta, B. Y. Cho, D. Brooks, V. Chandra, U. Diril, A. Firoozshahian, K. Hazelwood, B. Jia, H.-H. S. Lee, M. Li, B. Maher, D. Mudigere, M. Naumov, M. Schatz, M. Smelyanskiy, X. Wang, B. Reagen, C.-J. Wu, M. Hempstead, and X. Zhang, “Recnmp: Accelerating personalized recommendation with near-memory processing,” in Proceedings of the ACM/IEEE 47th Annual International Symposium on Computer Architecture, ISCA ’20, IEEE Press, 2020 P. Rubin, D. MacKenzie, and S. Kemp, “dd(1) - linux man page.” https://linux.die.net/man/1/dd, 2010. M. Bjørling, J. González, and P. Bonnet, “Lightnvm: The linux open-channel ssd subsystem,” in Proceedings of the 15th Usenix Conference on File and Storage Technologies, FAST’17, USENIX Association, 2017 M. Bjørling, A. Aghayev, H. Holmberg, A. Ramesh, D. L. Moal, G. R. Ganger, and G. Amvrosiadis, “ZNS: Avoiding the block interface tax for flash-based SSDs,” in Proceedings of 2021 USENIX Annual Technical Conference, USENIX ATC ’21, USENIX Association, 2021 Y. Kwon, Y. Lee, and M. Rhu, “Tensordimm: A practical near-memory processing architecture for embeddings and tensor operations in deep learning,” in Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO ’52, Association for Computing Machinery, 2019 H. Wan, X. Sun, Y. Cui, C.-L. Yang, T.-W. Kuo, and C. J. Xue, “Flashembedding: Storing embedding tables in ssd for large-scale recommender systems,” in Proceedings of the 12th ACM SIGOPS Asia-Pacific Workshop on Systems, APSys ’21, Association for Computing Machinery, 2021. M. Wilkening, U. Gupta, S. Hsia, C. Trippel, C.-J. Wu, D. Brooks, and G.-Y. Wei, “Recssd: Near data processing for solid state drive based recommendation inference,” in Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS ’21, Association for Computing Machinery, 2021. D. Skourtis, D. Achlioptas, N. Watkins, C. Maltzahn, and S. Brandt, “Flash on rails: Consistent flash performance through redundancy,” in Proceedings of the 2014 USENIX Conference on USENIX Annual Technical Conference, USENIX ATC’14, USENIX Association, 2014 S. Yan, H. Li, M. Hao, M. H. Tong, S. Sundararaman, A. A. Chien, and H. S. Gunawi, “Tiny-Tail flash: Near-Perfect elimination of garbage collection tail latencies in NAND SSDs,” in Proceedings of 15th USENIX Conference on File and Storage Technologies, FAST ’17, USENIX Association, 2017 J. He, S. Kannan, A. C. Arpaci-Dusseau, and R. H. Arpaci-Dusseau, “The unwritten contract of solid state drives,” in Proceedings of the Twelfth European Conference on Computer Systems, EuroSys ’17, Association for Computing Machinery, 2017	-
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/83107	-
dc.description.abstract	推薦系統廣泛用於提供個人化的建議，而推薦系統訓練時所需記憶體容量持續成長。使用swap將固態硬碟 (SSD) 作為記憶體的延伸，可以緩解訓練時記憶體的需求。由於使用swap會引入額外的讀寫延遲，並且考慮到模型重新訓練及部署的週期性，訓練推薦系統時使用swap必須關注對整體效率的影響。在本文中，我們觀察到使用swap會增加訓練時間達到 2 ~ 5 倍長。經過分析，我們歸納出以下影響訓練效率的原因： 1. 推薦系統訓練時操作記憶體是不規律的，造成記憶體利用率低且swap次數多，所以讀寫量很大。 2. 讀寫請求的大小多數小於32K，不利SSD內部頻寬的利用。 3. SSD的讀寫頻寬隨整體寫入量增加而下降，主要是受到SSD內部的垃圾回收機制影響。同時，我們使用 fio 模擬讀寫行為並探討改善SSD讀寫效率的方式，實驗結果如下，1.改變讀寫大小至128KB，有1.75倍的頻寬提升；2.改變寫入模式為順序寫，寫性能有4.37倍的提升。最後，我們提供了下列二個 swap 用於推薦系統訓練時的建議：1.聚集更多鄰近使用的swap資料來以較大的讀寫大小操作及換出(swap out)記憶體時以順序寫的方式操作SSD。2.採用 Open Channel SSD 或 ZNS SSD 作為 swap 可讓系統依需求安排 SSD 的資料讀寫及垃圾回收機制，以此提升效能。	zh_TW
dc.description.abstract	The deep learning recommendation model(DLRM) is widely used for providing personalized suggestions, and the memory capacity requirement for DLRM training keeps growing. Using swap that turns SSD into a memory extension can alleviate the DRAM capacity demand of training. At the same time, it will introduce additional I/O latency, and considering the cycle time of model retrain and redeployment, training DLRM with swap needs to consider the influence on efficiency. In this thesis, we find that the training time becomes 2 ~ 5 times longer when using swap. Based on the analysis, we summarize the factors that influence the training efficiency as follows. 1. The memory access pattern is irregular when DLRM training, which causes the utilization of memory to be low and the number of swapping to be large. Thus, the I/O volume is huge. 2. Most of the I/O requests are less than 32K, which is unfavorable for utilizing the internal bandwidth of SSD. 3. As the write volume increases, the SSD read/write bandwidth decreases, which is mainly affected by internal garbage collection(GC) task in SSD. Besides, we use fio to simulate the I/O behavior and conduct experiment on how to improve SSD I/O efficiency. The result is the following. 1. Changing the I/O request size to 128K leads to 1.75x bandwidth improvement. 2. Changing the write pattern to sequential write leads to 4.37$x improvement of write bandwidth. In the end, we provide 2 suggestions for using swap in DLRM training. 1. Aggregate more swap data that will be used in close time to read/write with a bigger size and use sequential write to swap out memory 2. Choosing the Open Channel SSD or ZNS SSD allows the host to arrange the read/write and GC according to the demand, thereby improving the performance.	en
dc.description.provenance	Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2023-01-08T17:07:13Z No. of bitstreams: 0	en
dc.description.provenance	Made available in DSpace on 2023-01-08T17:07:13Z (GMT). No. of bitstreams: 0	en
dc.description.tableofcontents	Verification Letter from the Oral Examination Committee i Acknowledgements ii 摘要 iii Abstract iv Contents vi List of Figures viii List of Tables x Chapter 1 Introduction 1 Chapter 2 Background 5 2.1 Deep Learning Recommendation Model(DLRM) 5 2.2 Linux Memory Management and Swapping System 7 2.3 SSD Architecture 8 Chapter 3 Motivation 11 3.1 DLRM Training Time and I/O Volume 11 Chapter 4 Workload Analysis 15 4.1 Experimental Setup 15 4.2 Access Pattern and Utilization of Swapped-in Page 18 4.3 I/O Behavior Analysis 22 4.4 Latency Analysis 31 4.5 FIO Simulation 39 4.6 Discussion 43 Chapter 5 Related works 45 5.1 DLRM Optimization 45 5.2 SSD Performance Optimization 48 Chapter 6 Conclusion 50 References 51	-
dc.language.iso	en	-
dc.subject	置換系統	zh_TW
dc.subject	深度學習學習推薦系統	zh_TW
dc.subject	垃圾回收機制	zh_TW
dc.subject	讀寫特徵	zh_TW
dc.subject	固態硬碟	zh_TW
dc.subject	Garbage Collection	en
dc.subject	DLRM	en
dc.subject	Swapping system	en
dc.subject	SSD	en
dc.subject	I/O characteristic	en
dc.title	深度學習推薦系統訓練之記憶體置換行為分析	zh_TW
dc.title	A Swapping Behavior Analysis of Deep Learning Recommendation System Training	en
dc.title.alternative	A Swapping Behavior Analysis of Deep Learning Recommendation System Training	-
dc.type	Thesis	-
dc.date.schoolyear	111-1	-
dc.description.degree	碩士	-
dc.contributor.oralexamcommittee	鄭湘筠;陳依蓉	zh_TW
dc.contributor.oralexamcommittee	Hsiang-Yun Cheng;Yi-Jung Chen	en
dc.subject.keyword	深度學習學習推薦系統,置換系統,固態硬碟,讀寫特徵,垃圾回收機制,	zh_TW
dc.subject.keyword	DLRM,Swapping system,SSD,I/O characteristic,Garbage Collection,	en
dc.relation.page	55	-
dc.identifier.doi	10.6342/NTU202210058	-
dc.rights.note	同意授權(限校園內公開)	-
dc.date.accepted	2022-11-25	-
dc.contributor.author-college	電機資訊學院	-
dc.contributor.author-dept	資訊網路與多媒體研究所	-
顯示於系所單位：	資訊網路與多媒體研究所

文件中的檔案：

檔案	大小	格式
U0001-1121221117414063.pdf 授權僅限NTU校內IP使用（校園外請利用VPN校外連線服務）	3.45 MB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。