非穩態下之極值多臂吃角子老虎機於數值優化領域之探討

吳柏儒; Po-Ju Wu

Please use this identifier to cite or link to this item: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/90166

Title:	非穩態下之極值多臂吃角子老虎機於數值優化領域之探討 Non-stationary Extreme Bandits: An Optimization Case Study
Authors:	吳柏儒 Po-Ju Wu
Advisor:	于天立 Tian-Li Yu
Keyword:	多臂吃角子老虎機,順序統計量,非穩態,極值,強化學習,機器學習,實數優化,蒙地卡羅樹,超參數優化,自適應共變異數矩陣演化策略, multi-armed bandits,order statistics,non-stationary,extreme value,reinforcement learning,machine learning,real-valued optimization,Monte-Carlo tree search,hyperparameter optimization,Covariance matrix adaptation evolution strategy,
Publication Year :	2023
Degree:	碩士
Abstract:	在工程、科學和財務領域，我們經常遇到在有限的資源下做出選擇的問題，這種問題可歸結為多臂吃角子老虎機問題，也是強化學習的基礎。現有的研究主要集中在穩態場景的期望報酬優化，但實際上許多問題並非穩態，或是追求極值報酬而非期望報酬。我們開發了一種以順序統計量為基礎的演算法，並配合自適應分佈模型，以優化在非穩態場景下追求極值之資源分配。我們將此演算法應用於實數數值優化、蒙地卡羅樹搜索以及深度學習模型超參數優化等三種問題，並與目前經典的多臂吃角子老虎機演算法做比較。在實數數值優化中，我們使用自適應共變異數矩陣演化策略作為優化器，在CEC2005的多模態基準函數上做測試，在有限的抽樣預算下我們的演算法具有優勢。在蒙地卡羅數搜尋中，我們設計了獎勵函數，透過實驗測試演算法在不同場景下之穩態性，在獎勵函數值域不受限制的場景下我們的演算法具有統計上的優勢。而在超參數優化問題上，我們設計了一個架構結合目前之最新架構，並且在我們測試的電腦視覺訓練問題上能取得較原版較高之測試集準確度。 In engineering, scientific, and financial domains, we frequently encounter the challenge of making decisions when faced with limited resources. This issue can be framed as the multi-armed bandit problem, which serves as a fundamental concept in reinforcement learning. Existing research primarily focuses on optimizing the expected reward in stationary scenarios. However, many real-world problems exhibit non-stationarity or prioritize attaining the highest possible reward rather than the expected reward. To address this, we have developed an algorithm that leverages order statistics and adaptive distribution models to optimize resource allocation in pursuit of the highest possible reward in non-stationary environments. We applied our algorithm to three different problems: real-valued optimization, Monte-Carlo tree search, and hyperparameter optimization for deep learning models. Also, we compared it with the classical multi-armed bandit algorithm. In real-valued optimization, we utilized the self-adaptive covariance matrix evolution strategy as the optimizer. We tested our algorithm on the multimodal benchmark functions from CEC2005. Our algorithm exhibited advantages within limited sampling budgets. For Monte-Carlo tree search problems, we designed a reward function and conducted experiments to assess the robustness of our algorithm in various scenarios. Our algorithm demonstrated statistical advantages in scenarios where the reward function's value range was unconstrained. Regarding hyperparameter optimization, we devised a framework that incorporates state-of-the-art architectures. When evaluated on computer vision training problems, our algorithm achieved higher testing set accuracy compared to the original version.
URI:	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/90166
DOI:	10.6342/NTU202303980
Fulltext Rights:	同意授權(全球公開)
Appears in Collections:	電機工程學系

Files in This Item:

File	Size	Format
ntu-111-2.pdf	8.72 MB	Adobe PDF	View/Open

Show full item record

DSpace JSPUI

DSpace preserves and enables easy and open access to all types of digital content including text, images, moving images, mpegs and data sets