基於超大規模統一記憶體系統之跨層級設計

Che-Wei Tsao; 曹哲維

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/21660

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	郭大維(Tei-Wei Kuo)
dc.contributor.author	Che-Wei Tsao	en
dc.contributor.author	曹哲維	zh_TW
dc.date.accessioned	2021-06-08T03:41:32Z	-
dc.date.copyright	2019-07-10
dc.date.issued	2019
dc.date.submitted	2019-06-26
dc.identifier.citation	Bibliography [1] S. Bansal and D. S. Modha. Car: Clock with adaptive replacement. In Proceedings of the 3rd USENIX Conference on File and Storage Technologies, FAST ’04, pages 187–200, Berkeley, CA, USA, 2004. USENIX Association. [2] R. W. Carr and J. L. Hennessy. Wsclock—a simple and effective algorithm for virtual memory management. In Proceedings of the Eighth ACM Symposium on Operating Systems Principles, SOSP ’81, pages 87–95, New York, NY, USA, 1981. ACM. [3] Y. H. Chang, J. W. Hsieh, and T. W. Kuo. Improving flash wear-leveling by proactively moving static data. IEEE Transactions on Computers, 59(1):53–65, Jan 2010. [4] R. Chen, Z. Shao, and T. Li. Bridging the i/o performance gap for big data workloads: A new nvdimm-based approach. In 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 1–12, Oct 2016. [5] F. Corbató and P. M. M. I. of Technology). A PAGING EXPERIMENT WITH THE MULTICS SYSTEM. Project MAC. Defense Technical Information Center, 1968. [6] M. Dimitrov, K. Kumar, P. Lu, V. Viswanathan, and T. Willhalm. Memory system characterization of big data workloads. In 2013 IEEE International Conference on Big Data, pages 15–22, Oct 2013. [7] A. Gupta, Y. Kim, and B. Urgaonkar. DFTL: a flash translation layer employing demand-based selective caching of page-level address mappings. In ASPLOS, pages 229–240, 2009. [8] S. Jiang and X. Zhang. Lirs: An efficient low inter-reference recency set replacement policy to improve buffer cache performance. SIGMETRICS Perform. Eval. Rev., 30(1):31–42, June 2002. [9] S. Jin, J. Kim, J. Kim, J. Huh, and S. Maeng. Sector log: Fine-grained storage management for solid state drives. In Proceedings of the 2011 ACM Symposium on Applied Computing, SAC ’11, pages 360–367, New York, NY, USA, 2011. ACM. [10] T. Johnson and D. Shasha. 2q: A low overhead high performance buffer management replacement algorithm. In Proceedings of the 20th International Conference on Very Large Data Bases, VLDB ’94, pages 439–450, San Francisco, CA, USA, 1994. Morgan Kaufmann Publishers Inc. [11] T. W. Kuo, Y. H. Chang, P. C. Huang, and C. W. Chang. Special issues in flash. In 2008 IEEE/ACM International Conference on Computer-Aided Design, pages 821–826, Nov 2008. [12] C.-K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wallace, V. J. Reddi, and K. Hazelwood. Pin: Building customized program analysis tools with dynamic instrumentation. In Proceedings of the 2005 ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’05, pages 190–200, New York, NY, USA, 2005. ACM. [13] N. Megiddo and D. S. Modha. Arc: A self-tuning, low overhead replacement cache. In Proceedings of the 2Nd USENIX Conference on File and Storage Technologies, FAST ’03, pages 115–130, Berkeley, CA, USA, 2003. USENIX Association. [14] Micron Technology. NAND Flash Memory MT29F64G08CBAA[A/B], MT29F128G08C[E/F]AAA, MT29F128G08CFAAB, 2009. [15] Micron Technology. NAND Flash Memory MT29F64G08AB[C/E]BB, MT29F128G08AE[C/E]BB,MT29F256G08AK[C/E]BB, 2013. [16] B. Oh, N. Abeyratne, J. Ahn, R. G. Dreslinski, and T. Mudge. Enhancing dram self-refresh for idle power reduction. In Proceedings of the 2016 International Sym-posium on Low Power Electronics and Design, ISLPED ’16, pages 254–259, New York, NY, USA, 2016. ACM. [17] E. J. O’Neil, P. E. O’Neil, and G. Weikum. The lru-k page replacement algorithm for database disk buffering. SIGMOD Rec., 22(2):297–306, June 1993. [18] A. Silberschatz, P. B. Galvin, and G. Gagne. Operating System Concepts. Wiley Publishing, 8th edition, 2008. [19] C.-W. Tsao, Y.-H. Chang, and M.-C. Yang. Performance enhancement of garbage collection for flash storage devices: an efficient victim block selection design. In Proceedings of the 50th Annual Design Automation Conference, DAC ’13, Austin, TX, USA, 2013. ACM/EDAC/IEEE. [20] C. Wang, Q. Wei, M. Xue, J. Yang, and C. Chen. Data-centric garbage collection for nand flash devices. In 2015 IEEE Non-Volatile Memory System and Applications Symposium (NVMSA), pages 1–6, Aug 2015. [21] M.-C. Yang, Y.-H. Chang, P.-C. Huang, and T.-W. Kuo. Working-set-based address mapping for ultra-large-scaled flash devices. In Proceedings of the Eighth IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis, CODES+ISSS ’12, pages 493–502, New York, NY, USA, 2012. ACM. [22] M.-C. Yang, Y.-H. Chang, C.-W. Tsao, and P.-C. Huang. New era: New efficient reliability-aware wear leveling for endurance enhancement of flash storage devices. In Proceedings of the 50th Annual Design Automation Conference, DAC ’13, pages 163:1–163:6, New York, NY, USA, 2013. ACM. [23] J. Zhao and Y. Xie. Optimizing bandwidth and power of graphics memory with hybrid memory technologies and adaptive data migration. In Proceedings of the International Conference on Computer-Aided Design, ICCAD ’12, pages 81–87, New York, NY, USA, 2012. ACM.
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/21660	-
dc.description.abstract	隨著大數據時代的到來，應用程式所需要的記憶體容量的成長速度已經超過DRAM記憶體所能提供，並且大型的DRAM記憶體也帶來了極高的漏電流與硬體的成本考驗。為了解決這樣的問題，異質性記憶體(Unified Memory)被提出來置換純DRAM的主記憶體用於資料密集的應用。異質性記憶體通常包含著少量且高速的DRAM與大量低成本的非揮發性記憶體，這些揮發性記憶體中又以快閃記憶體最為成熟。但是快閃記憶體相較於DRAM有著極高的存取延遲，為了避免主記憶體的存取效能被嚴重的影響，通常會使用DRAM當作快閃記憶體的快取。將最近常用的資料存取於DRAM，而比較不常用的資料則存放於快閃記憶體中。主記憶體層級需要兼顧著存取的效能與維護的成本，我們提出了兩個跨層級的設計是基於CPU的異質性主記憶體，第一個研究主要是在記憶體模組內部如何優化存取效能並降低管理維護成本，而第二個研究則是利用作業系統與記憶體模組兩者的資訊交換來更進一步降低維護成本並優化存取效能。除此之外，我們第三個研究則是基於GPU的異質性主記憶體設計，利用排程資訊來預先搬移資料於快閃記憶體與DRAM之間。一系列的實驗證明我們的設計可以有效改善存取效能與兼顧維護成本。	zh_TW
dc.description.abstract	In the big data era, data-intensive applications have a growing demand for the capacity of DRAM main memory, but the frequent DRAM refresh, high leakage power, and high unit cost bring serious design issues on scaling up DRAM capacity. To address this issue, unified memory system, which is a heterogeneous memory system, becomes a possible alternative to replace DRAM as main memory in some data-intensive applications. Unified memory system that consists of a small-sized high-speed DRAM and a large-sized low-cost non-volatile memory (i.e., flash memory) has the serious performance issue on accessing data stored in the flash memory because of the huge performance gap between DRAM and flash memory. However, there is limited room to adopt a complex caching algorithm for using DRAM as the cache of flash memory in unified memory system because a complex caching algorithm itself would already cause too much performance degradation on handling each request to access unified memory system. In this paper, we present two cross-layer designs to boost CPU-based unified memory system performance by minimizing the cache management overhead and reducing the frequencies to access flash memory. The first design is to improve the performance in device level, and the second design is hardware and software co-design by the interactivity of OS and device layers. In addition, we also present one cross-layer design for GPU-based unified memory system to improve the performance of GPU device. A series of experiments was conducted based on popular benchmarks, and the results demonstrate that the proposed designs can effectively improve the performance of unified memory system.	en
dc.description.provenance	Made available in DSpace on 2021-06-08T03:41:32Z (GMT). No. of bitstreams: 1 ntu-108-D02944011-1.pdf: 4287715 bytes, checksum: e8759b06a59e84628724270c18da655d (MD5) Previous issue date: 2019	en
dc.description.tableofcontents	口試委員會審定書 iii 誌謝 v Acknowledgements vii 摘要 ix Abstract xi 1 Introduction 1 2 Background and Motivation 7 2.1 Background and System Architecture . . . . . . . . . 7 2.2 Motivational Observations and Analyses . . . . . . . 8 2.2.1 Effective Access Time . . . . . . . . . . . . . . 9 2.2.2 DRAM Cache Hit Ratio and Memory Access Behavior . 10 2.3 Research Motivation . . . . . . . . . . . . . . . . 13 3 Distillation: A Light-Weight Data Separation Design to Boost Performance of NVDIMM Main Memory 17 3.1 Design Overview . . . . . . . . . . . . . . . . . . 17 3.2 Design Concept . . . . . . . . . . . . . . . . . . 20 3.3 Design Details . . . . . . . . . . . . . . . . . . 22 3.3.1 Counter-Based CLOCK & Pre-move Operation for Hot Area Management . .. . . . . . . . . . . . . . . . . . . . . 22 3.3.2 Flash-Aware LRU & Page Grouper for Cold Area Management 25 3.3.3 Distillation Algorithm for Warm Area Management . 26 Performance Evaluation . . . . . . . . . . . . . . . . 29 3.4.1 Experimental Setup . . . . . . . . . . . . . . . 29 3.4.2 Experimental Results . . . . . . . . . . . .. . . 30 3.5 Summary . . . . . . . . . . . . . . . . . . . . . . 36 4 Cross-Layer Cache: Rethinking the Interactivity of OS and Device Layers in Memory Management 37 4.1 Background and Motivation . . . . . . . . . . . . . 37 4.2 Design Overview . . . . . . . . . . . . . . . . . . 41 4.3 Design Concept . . . . . . . . . . . . . . . . . . 42 4.4 Design Details . . . . . . . . . . . . . . . . . . 44 4.4.1 Keep List . . . . . . . . . . . . . . . . . . . . 44 4.4.2 Adaptive Adjusting Lists . .. . . . . . . . . . . 46 4.4.3 Data Grouper . . . . . . . .. . . . . . . . . . . 50 4.5 Performance Evaluation . . . . . . . . . . . . . . 53 4.5.1 Experimental Setup .. . . . . . . . . . . . . . . 53 4.5.2 Experimental Results. . . . . . . . . . . . . . . 54 4.6 Summary . . . . . . . . . . . . . . . . . . . . . . 59 5 Scheduling-aware Prefetching: Enabling the PCIe SSD to extent the Global Memory of GPU Device 61 5.1 Background and Motivation . . . . . . . . . . . . . 61 5.2 Scheduling-Aware Prefetching .. . . . . . . . . . . 64 5.2.1 Design Overview . . . . . . . . . . . . . . . . . 64 5.2.2 Internal GPU device information – Warp Scheduling 65 5.3 Performance Evaluation . . . . . . . . . . . . . . 66 5.3.1 Experimental Setup . . . .. . . . . . . . . . . . 66 5.3.2 Experimental Results . . .. . . . . . . . . . . . 67 5.4 Summary . . . . . . . . . . . . . . . . . . . . . . 67 6 Conclusion 71 Bibliography 73
dc.language.iso	zh-TW
dc.title	基於超大規模統一記憶體系統之跨層級設計	zh_TW
dc.title	Enabling Cross-Layer Designs for Ultra-Scale Unified Memory Systems	en
dc.type	Thesis
dc.date.schoolyear	107-2
dc.description.degree	博士
dc.contributor.coadvisor	張原豪(Yuan-Hao Chang)
dc.contributor.oralexamcommittee	曾煜棋(Yu-Chee Tseng),徐慰中(Wei-Chung Hsu),施吉昇(Chi-Sheng Shih),洪士灝(Shih-Hao Hung)
dc.subject.keyword	快閃記憶體,記憶體模組,	zh_TW
dc.subject.keyword	Flash Memory,NVDIMM,	en
dc.relation.page	75
dc.identifier.doi	10.6342/NTU201901049
dc.rights.note	未授權
dc.date.accepted	2019-06-27
dc.contributor.author-college	電機資訊學院	zh_TW
dc.contributor.author-dept	資訊網路與多媒體研究所	zh_TW
顯示於系所單位：	資訊網路與多媒體研究所

文件中的檔案：

檔案	大小	格式
ntu-108-1.pdf 未授權公開取用	4.19 MB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。