Skip navigation

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料(如:文字、圖片、PDF)並使其易於取用。

點此認識 DSpace
DSpace logo
English
中文
  • 瀏覽論文
    • 校院系所
    • 出版年
    • 作者
    • 標題
    • 關鍵字
    • 指導教授
  • 搜尋 TDR
  • 授權 Q&A
    • 我的頁面
    • 接受 E-mail 通知
    • 編輯個人資料
  1. NTU Theses and Dissertations Repository
  2. 工學院
  3. 工程科學及海洋工程學系
請用此 Handle URI 來引用此文件: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/99788
完整後設資料紀錄
DC 欄位值語言
dc.contributor.advisor李佳翰zh_TW
dc.contributor.advisorJia-Han Lien
dc.contributor.author邱昱翰zh_TW
dc.contributor.authorYu-Han Chiuen
dc.date.accessioned2025-09-17T16:41:08Z-
dc.date.available2025-09-18-
dc.date.copyright2025-09-17-
dc.date.issued2025-
dc.date.submitted2025-07-14-
dc.identifier.citation[1] Advanced Micro Devices, Inc. AMD Instinct MI300A APU Datasheet, 2023. Accessed: 2025-04-28.
[2] U. Agrawal, P. Etingov, and R. Huang. Advanced performance metrics and their application to the sensitivity analysis for model validation and calibration. IEEE Transactions on Power Systems, 36(5):4503–4512, 2021.
[3] A. Ahmad, M. Kermanshah, K. Leahy, Z. Serlin, H. C. Siu, M. Mann, C.-I.Vasile, R. Tron, and C. Belta. Accelerating proximal policy optimization learning using task prediction for solving games with delayed rewards. arXiv preprint arXiv:2411.17861, 2024.
[4] X. Chu, D. Hofstätter, S. Ilager, S. Talluri, D. Kampert, D. Podareanu, D. Duplyakin, I. Brandic, and A. Iosup. Generic and ml workloads in an hpc datacenter: Node energy, job failures, and node-job analysis. In 2024 IEEE 30th International Conference on Parallel and Distributed Systems (ICPADS), pages 710–719. IEEE, 2024.
[5] Conan7882. GoogLeNet-Inception: Tensorflow implementation of googlenet.https://github.com/conan7882/GoogLeNet-Inception, 2021. Accessed: 2025-05-28.
[6] R. Curtis, T. Shedd, and E. B. Clark. Performance comparison of five data center server thermal management technologies. In 2023 39th Semiconductor Thermal Measurement, Modeling & Management Symposium (SEMI-THERM), pages 1–9.IEEE, 2023.
[7] D. Donnellan and A. Lawrence. Annual outage analysis 2024: The causes and impacts of it and data center outages (executive summary). Technical Report 131, Uptime Institute Intelligence, New York, NY, Mar. 2024. Executive Summary.
[8] D. Donnellan, A. Lawrence, D. Bizo, P. Judge, J. O’Brien, J. Davis, M. Smolaks, J. Williams-George, and R. Weinschenk. Uptime institute global data center survey 2024. Technical Report 146M, Uptime Institute Intelligence, New York, NY, July 2024. Keynote Report.
[9] A. Heimerson, J. Sjölund, R. Brännvall, J. Gustafsson, and J. Eker. Adaptive control of data center cooling using deep reinforcement learning. In 2022 IEEE international conference on autonomic computing and self-organizing systems companion (ACSOS-C), pages 1–6. IEEE, 2022.
[10] D. A. Kamakshi, M. Fojtik, B. Khailany, S. Kudva, Y. Zhou, and B. H. Calhoun. Modeling and analysis of power supply noise tolerance with fine-grained gals adaptive clocks. In 2016 22nd IEEE International Symposium on Asynchronous Circuits and Systems (ASYNC), pages 75–82. IEEE, 2016.
[11] S. Khadirsharbiyani, J. Kotra, K. Rao, and M. Kandemir. Data convection: A gpu driven case study for thermal-aware data placement in 3d drams. Proceedings of the ACM on Measurement and Analysis of Computing Systems, 6(1):1–25, 2022.
[12] A. Krzywaniak, P. Czarnul, and J. Proficz. Dynamic gpu power capping with online performance tracing for energy efficient gpu computing using depo tool. Future Generation Computer Systems, 145:396–414, 2023.
[13] D. Lee, S. Koo, I. Jang, and J. Kim. Comparison of deep reinforcement learning and pid controllers for automatic cold shutdown operation. Energies, 15(8):2834, 2022.
[14] J. Luo, C. Paduraru, O. Voicu, Y. Chervonyi, S. Munns, J. Li, C. Qian, P. Dutta, J. Q. Davis, N. Wu, et al. Controlling commercial cooling systems using reinforcement learning. arXiv preprint arXiv:2211.07357, 2022.
[15] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. Human-level control through deep reinforcement learning. nature, 518(7540):529–533, 2015.
[16] J. Moreno-Valenzuela. A class of proportional-integral with anti-windup controllers for dc–dc buck power converters with saturating input. IEEE Transactions on Circuits and Systems II: Express Briefs, 67(1):157–161, 2019.
[17] NVIDIA Corporation. Operating temperature range on jetson nano (forum post). https://forums.developer.nvidia.com/t/operating-temperature-range-on-jetson-nano/73555/6, 2019. Accessed 20 Jun 2025.
[18] NVIDIA Corporation. Nvapi reference documentation–gpu thermal control interface. https://docs.nvidia.com/gameworks/content/gameworkslibrary/coresdk/nvapi/group__gputhermal.html, 2024. Accessed 20 Jun 2025.
[19] NVIDIA Corporation. NVIDIA H100 Tensor Core GPU Datasheet, 2024. Accessed: 2025-04-28.
[20] NVIDIA Corporation. TensorRT Best Practices Guide: Gpu power consumption, power throttling, gpu temperature and thermal throttling. https://docs.nvidia.com/deeplearning/tensorrt/latest/performance/best-practices.html#gpu-temperature-and-thermal-throttling, 2025.Version 10.11.0 ― last updated 2025-03-30; accessed 2025-05-16.
[21] G. Ostrouchov, D. Maxwell, R. A. Ashraf, C. Engelmann, M. Shankar, and J. H.Rogers. Gpu lifetimes on titan supercomputer: Survival analysis and reliability.In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–14, 2020.
[22] A. Overgaard, B. K. Nielsen, C. S. Kallesøe, and J. D. Bendtsen. Reinforcement learning for mixing loop control with flow variable eligibility trace. In 2019 IEEE conference on control technology and applications(CCTA), pages 1043–1048. IEEE, 2019.
[23] T. Patki, Z. Frye, H. Bhatia, F. Di Natale, J. Glosli, H. Ingolfsson, and B. Rountree.Comparing gpu power and frequency capping: A case study with the mummi workflow. In 2019 IEEE/ACM Workflows in Support of Large-Scale Science (WORKS), pages 31–39. IEEE, 2019.
[24] B. Ramakrishnan, C. Turner, H. Alissa, D. Trieu, F. Rivera, L. Melton, M. Rao, S. Chigullapalli, T. Getachew, V. Prodanovic, et al. Understanding the impact of data center liquid cooling on energy and performance of machine learning and artificial intelligence workloads. Journal of Electronic Packaging, 147(2):021003, 2025.
[25] A. Silvestri, D. Coraci, S. Brandi, A. Capozzoli, E. Borkowski, J. Köhler, D. Wu, M. N. Zeilinger, and A. Schlueter. Real building implementation of a deep reinforcement learning controller to enhance energy efficiency and indoor temperature control. Applied Energy, 368:123447, 2024.
[26] H. Wang, Y. Ye, J. Zhang, and B. Xu. A comparative study of 13 deep reinforcement learning based energy management methods for a hybrid electric vehicle. Energy, 266:126497, 2023.
[27] R. Wang, X. Zhang, X. Zhou, Y. Wen, and R. Tan. Toward physics-guided safe deep reinforcement learning for green data center cooling control. In 2022 ACM/IEEE 13th International Conference on Cyber-Physical Systems (ICCPS), pages 159–169.IEEE, 2022.
[28] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, Y. Jernite, J. Plu, C. Xu, T. L. Scao, S. Gugger, M. Drame, Q. Lhoest, and A. M. Rush. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online, Oct. 2020. Association for Computational Linguistics.
[29] S. Xu, Y. Fu, Y. Wang, Z. Yang, C. Huang, Z. O'Neill, Z. Wang, and Q. Zhu. Efficient and assured reinforcement learning-based building hvac control with heterogeneous expert-guided training. Scientific reports, 15(1):7677, 2025.
[30] Z. Yang, K. Adamek, and W. Armour. Part-time power measurements: nvidia-smi’s lack of attention. arXiv preprint arXiv:2312.02741, 2023.
[31] Z. Yang, K. Merrick, L. Jin, and H. A. Abbass. Hierarchical deep reinforcement learning for continuous action control. IEEE transactions on neural networks and learning systems, 29(11):5174–5184, 2018.
[32] D. Zhao, S. Samsi, J. McDonald, B. Li, D. Bestor, M. Jones, D. Tiwari, and V. Gadepally. Sustainable supercomputing for ai: Gpu power capping at hpc scale. In Proceedings of the 2023 ACM Symposium on Cloud Computing, pages 588–596, 2023.
[33] Y. Zheng, R. Zhang, J. Zhang, Y. Ye, Z. Luo, Z. Feng, and Y. Ma. Llamafactory: Unified efficient fine-tuning of 100+ language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), Bangkok, Thailand, 2024. Association for Computational Linguistics.
-
dc.identifier.urihttp://tdr.lib.ntu.edu.tw/jspui/handle/123456789/99788-
dc.description.abstract隨著資料中心 GPU 運算密度急遽提升,冷卻系統若發生瞬時失效,晶片溫度易於數秒內突破風險門檻,導致韌體降速與硬體壽命衰減。本研究針對「秒級功率–溫度動態」提出一套 GPU 動態功率限制框架,核心以 Proximal Policy Optimization (PPO) 強化學習取代傳統 Anti-windup PID,並優化觀測與動作設計以提升安全性與泛化能力。方法上:(1)觀測空間僅保留溫度旗標、功率上限、即時功率及上一步功率增量四項特徵,排除易致策略僵化的風扇轉速;(2)動作空間以連續 Power-Limit 微幅調節 (±50 W、兩秒一更新) 為唯一致動;(3)獎勵函數採「超溫懲罰+功率獎勵」分層加權,確保安全溫度優先;(4)用 Generalized Advantage Estimation 緩解功率–溫度回饋延遲。
實驗於單張 RTX 4070 SUPER(100–275 W)平台進行。首先,在風扇交錯配置的驗證中,移除風扇特徵後的 RL 策略能依實際溫度即時上、下調功率,成功避免「轉速—功率」硬式對映造成的泛化失效。其次,在散熱裝置失效場景,RL 將溫度拉回目標帶的上升時間、安定時間分別較 anti-windup PID 縮短 79 % 與 60 %,功率波動亦由 ±20 W 降至 ±10 W;在散熱恢復場景,RL 僅 30 s 即將 Power-Limit 恢復至上限,較 PID 再縮短 23 %。進一步於 LLaMA-7B LoRA 微調、電腦視覺 ResNet CIFAR-100 及 Seq2Seq 翻譯三項高負載工作實測,策略皆能在 98–100 % GPU 使用率下維持核心溫度於 85 °C 保護門檻以下,並保持功率穩定。
綜合結果顯示,本研究提出之 RL 架構在面對散熱失效與極端環境時,兼具更快的溫度抑制速度與更平滑的功率輸出,相較傳統 PID 可顯著降低超溫風險與功率雜訊,並具跨任務、跨框架的部署彈性。此成果為資料中心提供一條不依賴額外硬體的軟體式 GPU 熱—功率協同控制解決方案,對提升高密度 AI 訓練伺服器的韌性與能源效率具實務價值。
zh_TW
dc.description.abstractAs GPU compute density in data centers continues to climb, a sudden failure of the cooling system can push chip temperatures beyond safe limits within seconds, triggering firmware throttling and accelerating hardware degradation. This study proposes a real-time GPU dynamic power-limiting framework tailored to “second-scale power-temperature dynamics."The framework replaces a conventional anti-windup PID controller with a reinforcement-learning (RL) agent based on Proximal Policy Optimization (PPO) and refines both the observation and action spaces to maximize safety and generalization capability.
Methodologically, we (1) confine the observation space to four highly relevant features—temperature flag, power-limit, instantaneous power draw, and previous power-limit increment—while deliberately excluding fan-speed signals that cause policy overfitting; (2) define the action space as a single continuous adjustment of the power-limit (±50W every 2 s); (3) design a reward function that combines high-priority over-temperature penalties with secondary power rewards to enforce a “safety-first"policy; and (4) apply Generalized Advantage Estimation to mitigate the multi-second delay between power changes and thermal feedback.
Experiments are conducted on a single NVIDIA RTX 4070 SUPER (adjustable from 100 W to 275 W). In cross-fan tests, the RL policy—trained without fan-speed inputs—adaptively increases or reduces power based solely on actual temperature, avoiding the hard-coded “fan-speed → power"mapping that undermines generalization. Under a cooling-failure scenario, the RL controller shortens rise time and settling time by 79 % and 60 %, respectively, compared with the anti-windup PID, while halving power oscillations from ±20 W to ±10 W. When cooling is restored, the RL agent raises the power-limit to its maximum in 30 s—23 % faster than PID—without any temperature violations. Further validation on three high-load tasks—LLaMA-7B LoRA fine-tuning, ResNet CIFAR-100 training, and Seq2Seq translation (spanning PyTorch and TensorFlow frameworks)—shows the policy maintains core temperature below the 85 °C firmware throttle threshold at 98–100 % GPU utilization with stable power output.
Overall, the proposed RL architecture delivers faster thermal suppression and smoother power control than a traditional PID across both cooling-failure and extreme-environment scenarios. It significantly lowers over-temperature risk and power noise while remaining task- and framework-agnostic, providing a purely software-based solution for GPU thermal-power co-management that enhances the resilience and energy efficiency of high density AI training servers.
en
dc.description.provenanceSubmitted by admin ntu (admin@lib.ntu.edu.tw) on 2025-09-17T16:41:08Z
No. of bitstreams: 0
en
dc.description.provenanceMade available in DSpace on 2025-09-17T16:41:08Z (GMT). No. of bitstreams: 0en
dc.description.tableofcontents致謝 ii
摘要 iii
Abstract iv
目次 vi
圖次 ix
表次 xi
第一章 緒論 1
1.1 研究背景 1
1.1.1 GPU 需求與熱挑戰 1
1.1.2 型號及個體差異與機櫃內溫差 2
1.2 研究動機 3
1.2.1 散熱失效風險與主動限制功率之必要性 3
1.2.2 風扇失效實驗:內建熱保護行為之觀察 4
1.3 研究目的 6
1.4 論文架構 6
第二章 文獻探討 8
2.1 控制策略比較 8
2.2 觀測空間 9
2.3 動作空間 10
2.3.1 連續與離散動作空間之適用場景 10
2.3.2 功率限制與時脈限制 11
2.4 獎勵函數 12
2.5 熱延遲效應處理 13
第三章 研究方法設計 14
3.1 觀測空間設計 15
3.1.1 觀測空間指標選用 15
3.1.2 感測資料之可靠性與誤差估計 16
3.2 動作空間設計 16
3.2.1 連續動作與離散動作 16
3.2.2 功率限制與頻率限制 17
3.3 獎勵函數設計 18
3.4 RL Training Detail 19
3.4.1 Proximal Policy Optimization 19
3.4.2 Action–Temperature Delay and GAE 20
3.5 GPU 動態功率限制實驗設計結論 21
第四章 實驗結果與討論 23
4.1 實驗環境 23
4.2 評估標準 23
4.3 Anti-windup PID 基準控制器 25
4.4 風扇轉速導致策略收斂僵化現象之排查實驗 28
4.5 移除風扇轉速特徵後之最終 RL 成效評估 29
4.5.1 泛化能力驗證 29
4.5.2 最終策略 RL 效能評估 31
4.6 性能指標匯總與結果討論 35
第五章 結果與未來展望 38
5.1 研究結論 38
5.2 未來展望 39
5.3 結語 40
參考文獻 41
附錄 A — 風扇轉速、核心溫度與功耗限制之三元多項式近似模型建立 46
A.1 資料蒐集 46
A.2 三次多項式模型建構 46
附錄 B — Reinforcement Learning vs. PID: Experimental Results on a MultiGPU Server 49
B.1 實驗環境 49
B.2 無保護下的燒機測試伺服器行為 49
B.3 Scenario of Rapid Temperature Rise 50
B.4 Scenario of Fan Failure and Recovery under Burn Load 51
B.4.1 RL controller 51
B.4.2 PID Controller 51
B.5 Performance of the RL Controller Across Diverse AI Workloads 52
B.5.1 Object Detection under PyTorch 53
B.5.2 Llama LoRA under PyTorch 54
B.5.3 Language Modeling under Tensorflow 55
-
dc.language.isozh_TW-
dc.subject動態功率限制zh_TW
dc.subject強化學習zh_TW
dc.subject散熱失效韌性zh_TW
dc.subject顯示卡熱管理zh_TW
dc.subjectDynamic Power Cappingen
dc.subjectReinforcement Learningen
dc.subjectCooling Failure Resilienceen
dc.subjectGPU Thermal Manage- menten
dc.title基於強化學習實現動態功率限制之顯示卡熱保護機制zh_TW
dc.titleA Reinforcement Learning Based Dynamic Power Limiting Thermal Protection Mechanism for GPUen
dc.typeThesis-
dc.date.schoolyear113-2-
dc.description.degree碩士-
dc.contributor.oralexamcommittee張瑞益;何昭慶;陳詩雯;林宏軒zh_TW
dc.contributor.oralexamcommitteeRay-I Chang;Chao-Ching Ho;Shih-Wen Chen;Hung-Hsuan Linen
dc.subject.keyword強化學習,動態功率限制,顯示卡熱管理,散熱失效韌性,zh_TW
dc.subject.keywordReinforcement Learning,Dynamic Power Capping,GPU Thermal Manage- ment,Cooling Failure Resilience,en
dc.relation.page56-
dc.identifier.doi10.6342/NTU202501824-
dc.rights.note同意授權(限校園內公開)-
dc.date.accepted2025-07-15-
dc.contributor.author-college工學院-
dc.contributor.author-dept工程科學及海洋工程學系-
dc.date.embargo-lift2030-07-14-
顯示於系所單位:工程科學及海洋工程學系

文件中的檔案:
檔案 大小格式 
ntu-113-2.pdf
  未授權公開取用
5.86 MBAdobe PDF檢視/開啟
顯示文件簡單紀錄


系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved