記憶體節約之超高解析度圖形旋轉演算法及其效能優化

Yu-Jen Huang; 黃昱仁

Please use this identifier to cite or link to this item: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/81868

Full metadata record

???org.dspace.app.webui.jsptag.ItemTag.dcfield???	Value	Language
dc.contributor.advisor	洪士灝(Shih-Hao Hung)
dc.contributor.author	Yu-Jen Huang	en
dc.contributor.author	黃昱仁	zh_TW
dc.date.accessioned	2022-11-25T03:05:25Z	-
dc.date.available	2026-09-01
dc.date.copyright	2021-11-12
dc.date.issued	2021
dc.date.submitted	2021-09-02
dc.identifier.citation	[1] Official Documentation: https://developer.nvidia.com/blog/gpudirectstorage/. [2] Official Documentation: https://docs.python.org/3/library/multiprocessing.html. [3] Official Github discussion: https://github.com/NVIDIA/DALI/issues/2588issuecomment756101353. [4] Official Github issue replies: https://github.com/NVIDIA/DALI/issues/2255issuecomment758511816. [5] Official Github reference: https://github.com/NVIDIA/apex. [6] Official released blog: https://www.nvidia.com/enus/geforce/news/rtxiogpuacceleratedstoragetechnology/. [7] PyTorch Official documentation: https://pytorch.org/docs/stable/cuda.htmlmemorymanagement. [8] PyTorch Official documentation: https://pytorch.org/docs/stable/data.htmlsingleandmultiprocessdataloading. [9] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng. TensorFlow: Largescale machine learning on heterogeneous systems, 2015. Software available from tensorflow.org. [10] G. Campanella, M. G. Hanna, L. Geneslaw, A. Miraflor, V. Werneck Krauss Silva, K. J. Busam, E. Brogi, V. E. Reuter, D. S. Klimstra, and T. J. Fuchs. Clinicalgrade computational pathology using weakly supervised deep learning on whole slide images. Nature Medicine, 25(8):1301–1309, Aug. 2019. [11] S. Chetlur, C. Woolley, P. Vandermersch, J. Cohen, J. Tran, B. Catanzaro, and E. Shelhamer. cudnn: Efficient primitives for deep learning, 2014. [12] J. Choquette, W. Gandhi, O. Giroux, N. Stam, and R. Krashinsky. Nvidia a100 tensor core gpu: Performance and innovation. IEEE Micro, 41(02):29–35, mar 2021. [13] D. Ciresan, U. Meier, J. Masci, L. M. Gambardella, and J. Schmidhuber. Flexible, high performance convolutional neural networks for image classification. pages 1237–1242, 07 2011. [14] T. Gale, P. Tredak, S. Layton, A. Ivanov, and S. Panev. Official NVIDA DALI GitHub. https://github.com/nvidia/dali. [15] L. Hou, D. Samaras, T. M. Kurc, Y. Gao, J. E. Davis, and J. H. Saltz. Patchbased convolutional neural network for whole slide tissue image classification, 2016. [16] J. D. Ianni, R. E. Soans, S. Sankarapandian, R. V. Chamarthi, D. Ayyagari, T. G. Olsen, M. J. Bonham, C. C. Stavish, K. Motaparthi, C. J. Cockerell, T. A. Feeser, and J. B. Lee. Tailored for RealWorld: A Whole Slide Image Classification System Validated on Uncurated MultiSite Data Emulating the Prospective Pathology Workload. Scientific Reports, 10(1):3217, Dec. 2020. [17] M. James, M. Tom, P. Groeneveld, and V. Kibardin. Ispd 2020 physical mapping of neural networks on a waferscale deep learning accelerator. In Proceedings of the 2020 International Symposium on Physical Design, ISPD ’20, page 145–149, New York, NY, USA, 2020. Association for Computing Machinery. [18] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093, 2014. [19] S. Kirk, Y. Lee, P. Kumar, J. Filippini, B. Albertina, M. Watson, K. RiegerChrist, and J. Lemmerman. Radiology Data from The Cancer Genome Atlas Lung Squamous Cell Carcinoma [TCGALUSC] collection, 2016. type: dataset. [20] H. Mikami, H. Suganuma, P. Uchupala, Y. Tanaka, and Y. Kageyama. Imagenet/resnet50 training in 224 seconds. 11 2018. [21] J. Mohan, A. Phanishayee, A. Raniwala, and V. Chidambaram. Analyzing and mitigating data stalls in DNN training. CoRR, abs/2007.06775, 2020. [22] J. Nickolls, I. Buck, M. Garland, and K. Skadron. Scalable parallel programming with cuda: Is cuda the parallel programming model that application developers have been waiting for? Queue, 6(2):40–53, Mar. 2008. [23] R. Okuta, Y. Unno, D. Nishino, S. Hido, and C. Loomis. Cupy: A numpycompatible library for nvidia gpu calculations. In Proceedings of Workshop on Machine Learning Systems (LearningSys) in The Thirtyfirst Annual Conference on Neural Information Processing Systems (NIPS), 2017. [24] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala. Pytorch: An imperative style, highperformance deep learning library. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'AlchéBuc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32, pages 8024–8035. Curran Associates, Inc., 2019. [25] H. Pinckaers, W. Bulten, J. van der Laak, and G. Litjens. Detection of prostate cancer in wholeslide images through endtoend training with imagelevel labels, 2020. [26] H. Pinckaers, B. van Ginneken, and G. Litjens. Streaming convolutional neural networks for endtoend learning with multimegapixel images. arXiv eprints, page arXiv:1911.04432, Nov. 2019. [27] M. Satyanarayanan, A. Goode, B. Gilbert, J. Harkes, and D. Jukic. OpenSlide: A vendorneutral software foundation for digital pathology. Journal of Pathology Informatics, 4(1):27, 2013. [28] S. Tokui, R. Okuta, T. Akiba, Y. Niitani, T. Ogawa, S. Saito, S. Suzuki, K. Uenishi, B. Vogel, and H. Yamazaki Vincent. Chainer: A deep learning framework for accelerating the research cycle. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery Data Mining, pages 2002–2011. ACM, 2019. [29] Z. Wang, A. Bovik, H. Sheikh, and E. Simoncelli. Image Quality Assessment: From Error Visibility to Structural Similarity. IEEE Transactions on Image Processing, 13(4):600–612, Apr. 2004. [30] C. Yang and G. Cong. Accelerating data loading in deep neural network training. CoRR, abs/1910.01196, 2019. [31] Y. You, Z. Zhang, C. Hsieh, and J. Demmel. 100epoch imagenet training with alexnet in 24 minutes. CoRR, abs/1709.05011, 2017. [32] Q. Zhang, Z. Han, F. Yang, Y. Zhang, Z. Liu, M. Yang, and L. Zhou. Retiarii: A deep learning exploratorytraining framework. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), pages 919–936. USENIX Association, Nov. 2020. [33] M. Zolnouri, X. Li, and V. P. Nia. Importance of Data Loading Pipeline in Training Deep Neural Networks. arXiv:2005.02130 [cs], Apr. 2020. arXiv: 2005.02130.
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/81868	-
dc.description.abstract	"近年來，高解析度的原始醫學影像切片在深度學習的領域愈來愈受歡迎。一方面是因為如此高解析度的圖形能提供相當程度的細節，使得訓練出來的模型精準度可以達到很不錯的效果，另一方面是提供了充沛的訓練資料量。但高解析度圖片在訓練時會產生效能低落的現象，其原因是中央處理器在影像擴充花費了大量的時間。我們嘗試使用圖形處理器來取代掉中央處理器。但是這造就了另一個難處，因為旋轉高解析度圖片需要相當大量的記憶體，這對於記憶體相當稀少的圖形處理器而言是一大痛點，過往也就因為這個原因而遲遲無法將此類工作安心的交給圖形處理器來運算。針對上述難處，我們提出一個相當快速且記憶體用量也很高效的旋轉演算法。核心概念是將原先的大圖切成很多小圖，對這些小圖做運算，而運算結果會與原先直接對大圖做運算的結果一致。在我們的實驗中，旋轉一張(40000, 40000, 3)大小的圖片，相較於採用中央處理器的比較對象，我們節省了90\%的記憶體並同時提升了60倍的處理速度。"	zh_TW
dc.description.provenance	Made available in DSpace on 2022-11-25T03:05:25Z (GMT). No. of bitstreams: 1 U0001-0306202117542500.pdf: 6238888 bytes, checksum: 1dad183b99cc41d880b80486bae0dfc3 (MD5) Previous issue date: 2021	en
dc.description.tableofcontents	誌謝i 摘要ii Abstract iii 1 Introduction 1 2 Background and Related Work 4 2.1 Training in Whole-Slide on the GPU . . . . . . . . . . . . . . . . . . . . 4 2.2 Image Augmentation on the GPU . . . . . . . . . . . . . . . . . . . . . . 5 3 Methodology 7 3.1 Tile-based Augmentation on the GPU . . . . . . . . . . . . . . . . . . . 7 3.2 Implementation Details with Python and C++ . . . . . . . . . . . . . . . 9 3.2.1 The Python Version . . . . . . . . . . . . . . . . . . . . . . . . . 10 3.2.2 The C++ Version . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.3 Tile-Based Rotation Algorithm . . . . . . . . . . . . . . . . . . . . . . . 13 3.4 Keeping the GPU Busy . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 4 Experimental Results 19 4.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 4.2 Implementation Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 20 4.3 Latency Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 4.4 GPU Memory Usage Comparison . . . . . . . . . . . . . . . . . . . . . 24 4.5 Image Correctness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 5 Conclusion and Future Work 30 Appendices 30 Bibliography 31
dc.language.iso	en
dc.subject	效能調校	zh_TW
dc.subject	串流小圖	zh_TW
dc.subject	圖形處理器平行計算	zh_TW
dc.subject	Performance Tuning	en
dc.subject	Streaming-Tile	en
dc.subject	GPU Parallel Programming	en
dc.title	記憶體節約之超高解析度圖形旋轉演算法及其效能優化	zh_TW
dc.title	Memory-Saving Streaming Tile Rotation Algorithm on Large Scale Medical Image	en
dc.date.schoolyear	109-2
dc.description.degree	碩士
dc.contributor.oralexamcommittee	施吉昇(Hsin-Tsai Liu),梁文耀(Chih-Yang Tseng),張原豪,葉肇元
dc.subject.keyword	串流小圖,圖形處理器平行計算,效能調校,	zh_TW
dc.subject.keyword	Streaming-Tile,GPU Parallel Programming,Performance Tuning,	en
dc.relation.page	34
dc.identifier.doi	10.6342/NTU202100950
dc.rights.note	同意授權(全球公開)
dc.date.accepted	2021-09-06
dc.contributor.author-college	電機資訊學院	zh_TW
dc.contributor.author-dept	資訊工程學研究所	zh_TW
dc.date.embargo-lift	2026-09-01	-
Appears in Collections:	資訊工程學系

Files in This Item:

File	Size	Format
U0001-0306202117542500.pdf Until 2026-09-01	6.09 MB	Adobe PDF

Show simple item record

DSpace JSPUI

DSpace preserves and enables easy and open access to all types of digital content including text, images, moving images, mpegs and data sets