去中心化前沿佇列提升廣度優先搜尋在圖形處理器上的可擴展性

鄭博修; Po-Hsiu Cheng

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/89129

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	郭斯彥	zh_TW
dc.contributor.advisor	Sy-Yen Kuo	en
dc.contributor.author	鄭博修	zh_TW
dc.contributor.author	Po-Hsiu Cheng	en
dc.date.accessioned	2023-08-16T17:15:09Z	-
dc.date.available	2023-11-09	-
dc.date.copyright	2023-08-16	-
dc.date.issued	2023	-
dc.date.submitted	2023-08-01	-
dc.identifier.citation	R. E. Bank and C. C. Douglas. Sparse matrix multiplication package (smmp). Adv. Comput. Math., 1(1):127–137, 1993. S. Beamer, K. Asanovic, and D. Patterson. Direction-optimizing breadth-first search. In SC ’12: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, pages 1–10, Nov 2012. A. Gaihre, Z. Wu, F. Yao, and H. Liu. Xbfs: Exploring runtime optimizations for breadth-first search on gpus. In Proceedings of the 28th International Symposium on HighPerformance Parallel and Distributed Computing, HPDC ’19, page 121–131, New York, NY, USA, 2019. Association for Computing Machinery. A. V. Goldberg, S. Hed, H. Kaplan, R. E. Tarjan, and R. F. Werneck. Maximum flows by incremental breadth-first search. In Algorithms–ESA 2011: 19th Annual European Symposium, Saarbrücken, Germany, September 5-9, 2011. Proceedings 19, pages 457–468. Springer, 2011. P. Gupta. The cuda programming model. NVIDIA Developer Blog, 2020. P. Harish and P. J. Narayanan. Accelerating large graph algorithms on the gpu using cuda. In S. Aluru, M. Parashar, R. Badrinath, and V. K. Prasanna, editors, High Performance Computing – HiPC 2007, pages 197–208, Berlin, Heidelberg, 2007. Springer Berlin Heidelberg. M. Harris and K. Perelygin. Cooperative groups: Flexible cuda thread programming. NVIDIA Developer Blog, 2017. S. Hong, S. K. Kim, T. Oguntebi, and K. Olukotun. Accelerating cuda graph algorithms at maximum warp. SIGPLAN Not., 46(8):267–276, feb 2011. C.-Y. Hsieh, P.-H. Cheng, C.-M. Chang, and S.-Y. Kuo. A decentralized frontier queue for improving scalability of breadth-first-search on gpus. In 2023 Design, Automation & Test in Europe Conference & Exhibition (DATE), pages 1–6, April 2023. Y. Ji, H. Liu, Y. Hu, and H. H. Huang. Ispan: Parallel identification of strongly connected components with spanning trees. ACM Trans. Parallel Comput., 9(3), aug 2022. S. Jung and S. Pramanik. Hiti graph model of topographical road maps in navigation systems. In Proceedings of the Twelfth International Conference on Data Engineering, pages 76–84, Feb 1996. J. Leskovec and A. Krevl. Snap datasets: Stanford large network dataset collection, 2014. H. Liu and H. H. Huang. Enterprise: Breadth-first graph traversal on gpus. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’15, New York, NY, USA, 2015. Association for Computing Machinery. H. Liu, H. Kou, C. Yan, and L. Qi. Link prediction in paper citation network to construct paper correlation graph. EURASIP Journal on Wireless Communications and Networking, 2019(1):1–12, 2019. L. Luo, M. Wong, and W.-m. Hwu. An effective gpu implementation of breadth-first search. In Proceedings of the 47th Design Automation Conference, DAC ’10, page 52–55, New York, NY, USA, 2010. Association for Computing Machinery. A. McLaughlin and D. A. Bader. Scalable and high performance betweenness centrality on the gpu. In SC ’14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 572–583, Nov 2014. D. Merrill, M. Garland, and A. Grimshaw. Scalable gpu graph traversal. SIGPLAN Not., 47(8):117–128, feb 2012. S. A. Myers, A. Sharma, P. Gupta, and J. Lin. Information network or social network? the structure of the twitter follow graph. In Proceedings of the 23rd International Conference on World Wide Web, WWW ’14 Companion, page 493–498, New York, NY, USA, 2014. Association for Computing Machinery. NVIDIA. Cuda toolkit, 2022. Accessed Jul. 8, 2022. L. Nyland and S. Jones. Understanding and using atomic memory operations. In 4th GPU Technology Conf.(GTC＇13), March, pages 1–61, 2013. S. Ortmanns, H. Ney, and X. Aubert. A word graph algorithm for large vocabulary continuous speech recognition. Computer Speech & Language, 11(1):43–72, 1997. R. A. Rossi and N. K. Ahmed. The network data repository with interactive graph analytics and visualization. In AAAI, 2015. D. Troendle, T. Ta, and B. Jang. A specialized concurrent queue for scheduling irregular workloads on gpus. In Proceedings of the 48th International Conference on Parallel Processing, ICPP ’19, New York, NY, USA, 2019. Association for Computing Machinery. Y. Wang, Y. Pan, A. Davidson, Y. Wu, C. Yang, L. Wang, M. Osama, C. Yuan, W. Liu, A. T. Riffel, and J. D. Owens. Gunrock: Gpu graph analytics. ACM Trans. Parallel Comput., 4(1), aug 2017. S. Wu, F. Sun, W. Zhang, X. Xie, and B. Cui. Graph neural networks in recommender systems: A survey. ACM Comput. Surv., 55(5), dec 2022. Y. Xia and V. K. Prasanna. Topologically adaptive parallel breadth-first search on multicore processors. In Proc. 21st Int＇l. Conf. on Parallel and Distributed Computing Systems (PDCS＇09). Citeseer, 2009. R. Yasaei, L. Chen, S.-Y. Yu, and M. A. A. Faruque. Hardware trojan detection using graph neural networks. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, pages 1–1, 2022. S. Zhang, Y. Liu, and L. Xie. Molecular mechanics-driven graph neural network with multiplex graph for molecular structures, 2020.	-
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/89129	-
dc.description.abstract	圖（Graph）是一種常見的資料結構，在導航、語音辨識和推薦系統等方面具有廣泛的應用。其中，廣度優先搜索（BFS）是探索圖中節點的基本算法，在獲取各種graph性質的方面起著至關重要的作用。圖形處理器（GPU）為常用的硬體加速器，具備卓越的計算能力和存儲容量。現在已有許多BFS演算法被移植到GPU上以提高效能，例如並行BFS（PBFS）演算法。本研究主要為一種改進傳統PBFS演算法的可擴展性之方法。採用了去中心化前沿佇列（Decentralized Frontier Queue）、即時佇列排空（Real-time Queue Draining）、兩級鄰居訪問（Two-level Neighbor Visiting）及狀態陣列原子掃描（Atomic Status Array Scanning）等設計。這些機制成功緩解GPU上的爭用（Contension）、降低記憶體消耗、解決負載不平衡問題，在實現了具競爭力的運行速度的同時，改進了PBFS演算法在GPU上的可擴展性。本論文介紹了此方法的設計、評估，以及未來改進的方向。	zh_TW
dc.description.abstract	Graph is a common data structure widely used in navigation, speech recognition, and recommendation systems. Breadth-First Search (BFS) is a fundamental algorithm for graph traversal and plays a crucial role in obtaining various graph properties. Graphic Processing Unit (GPU) is a commonly used hardware accelerators with remarkable computing power and storage capacity. Many BFS algorithms have been ported to GPUs to improve performance, such as the Parallel BFS (PBFS) algorithm. This study proposes some approach to improve the scalability of the traditional PBFS algorithm, includes Decentralized Frontier Queue, Real-time Queue Draining, Two-level Neighbor Visiting, and Atomic Status Array Scanning. These mechanism successfully alleviate contention on GPUs, reduce memory consumption, and solve load imbalance issues. We achieved competitive executing speeds and improved the scalability of the PBFS algorithm on GPUs. This thesis shows the design, evaluation, and future works.	en
dc.description.provenance	Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2023-08-16T17:15:09Z No. of bitstreams: 0	en
dc.description.provenance	Made available in DSpace on 2023-08-16T17:15:09Z (GMT). No. of bitstreams: 0	en
dc.description.tableofcontents	Verification Letter from the Oral Examination Committee i Acknowledgements ii 摘要 iv Abstract v Contents vi List of Figures ix List of Tables x Chapter 1 Introduction 1 Chapter 2 Background 5 2.1 Graphics Processing Unit . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1.1 GPU Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1.2 Compute Unified Device Architecture . . . . . . . . . . . . . . . . 6 2.1.3 Memory Hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.1.4 Atomic Operations . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.1.5 Synchronization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2 Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2.1 Compressed Sparse Row . . . . . . . . . . . . . . . . . . . . . . . 9 2.3 Breadth-First Search . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Chapter 3 Related Works 12 3.1 Parallel Breadth-First Search . . . . . . . . . . . . . . . . . . . . . . 12 3.2 Advanced Related Works . . . . . . . . . . . . . . . . . . . . . . . . 13 3.2.1 First BFS on GPU . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.2.2 Hierarchical Frontier Queue . . . . . . . . . . . . . . . . . . . . . 14 3.2.3 Virtual Warp-centric BFS . . . . . . . . . . . . . . . . . . . . . . . 14 3.2.4 Prefix-sum on Frontier Queue . . . . . . . . . . . . . . . . . . . . 15 3.2.5 Enterprise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.2.6 Gunrock . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.2.7 Specialized Concurrent Queue . . . . . . . . . . . . . . . . . . . . 16 3.2.8 XBFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 Chapter 4 Challenges 17 4.1 Centralized Frontier Queue . . . . . . . . . . . . . . . . . . . . . . . 18 4.2 Scalability Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 4.2.1 Atomic Operations . . . . . . . . . . . . . . . . . . . . . . . . . . 20 4.2.2 Queue Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 4.2.3 Scalability Concerns . . . . . . . . . . . . . . . . . . . . . . . . . 22 Chapter 5 Methodology 23 5.1 Decentralized Frontier Queue . . . . . . . . . . . . . . . . . . . . . 24 5.2 Real-time Queue Draining . . . . . . . . . . . . . . . . . . . . . . . 25 5.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 5.3.1 Execution Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 5.3.2 SA Scanning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 5.3.3 Two-level Neighbor Visiting . . . . . . . . . . . . . . . . . . . . . 32 Chapter 6 Evaluation 33 6.1 Experimental Configuration . . . . . . . . . . . . . . . . . . . . . . 33 6.2 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 6.3 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 6.4 Sub-queue Capacity . . . . . . . . . . . . . . . . . . . . . . . . . . 38 Chapter 7 Conclusion 40 References 43	-
dc.language.iso	en	-
dc.subject	平行計算	zh_TW
dc.subject	圖形處理器	zh_TW
dc.subject	廣度優先搜尋	zh_TW
dc.subject	GPU	en
dc.subject	parallel computing	en
dc.subject	breadth-first search	en
dc.title	去中心化前沿佇列提升廣度優先搜尋在圖形處理器上的可擴展性	zh_TW
dc.title	Improving the Scalability of Breadth-First Search on GPUs via Frontier Queue Decentralization	en
dc.type	Thesis	-
dc.date.schoolyear	111-2	-
dc.description.degree	碩士	-
dc.contributor.oralexamcommittee	雷欽隆;顏嗣鈞;陳英一;林振緯	zh_TW
dc.contributor.oralexamcommittee	Chin-Laung Lei;Hsu-chun Yen;Ing-Yi Chen;Jenn-Wei Lin	en
dc.subject.keyword	圖形處理器,平行計算,廣度優先搜尋,	zh_TW
dc.subject.keyword	GPU,parallel computing,breadth-first search,	en
dc.relation.page	46	-
dc.identifier.doi	10.6342/NTU202302373	-
dc.rights.note	未授權	-
dc.date.accepted	2023-08-04	-
dc.contributor.author-college	電機資訊學院	-
dc.contributor.author-dept	電子工程學研究所	-
顯示於系所單位：	電子工程學研究所

文件中的檔案：

檔案	大小	格式
ntu-111-2.pdf 未授權公開取用	926.58 kB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。