請用此 Handle URI 來引用此文件:
http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/90166
完整後設資料紀錄
DC 欄位 | 值 | 語言 |
---|---|---|
dc.contributor.advisor | 于天立 | zh_TW |
dc.contributor.advisor | Tian-Li Yu | en |
dc.contributor.author | 吳柏儒 | zh_TW |
dc.contributor.author | Po-Ju Wu | en |
dc.date.accessioned | 2023-09-22T17:41:28Z | - |
dc.date.available | 2023-11-09 | - |
dc.date.copyright | 2023-09-22 | - |
dc.date.issued | 2023 | - |
dc.date.submitted | 2023-08-14 | - |
dc.identifier.citation | Joannes Vermorel and Mehryar Mohri. “Multi-armed bandit algorithms and empirical evaluation”. In: European Conference on Machine Learning. Springer. (2005), pp. 437–448.
Sattar Vakili and Qing Zhao. “Mean-variance and value at risk in multi-armed bandit problems”. In: 2015 53rd Annual Allerton Conference on Communication, Control, and Computing (Allerton). IEEE. (2015), pp. 1330–1335. Eyal Even-Dar, Shie Mannor, Yishay Mansour, and Sridhar Mahadevan. “Action elimination and stopping conditions for the multi-armed bandit and reinforcement learning problems.” In: Journal of Machine Learning Research 7.6 (2006). Eric M. Schwartz, Eric T. Bradlow, and Peter S. Fader. “Customer acquisition via display advertising using multi-armed bandit experiments”. In: Marketing Science 36.4 (2017), pp. 500–522. Sof´ıa S. Villar, Jack Bowden, and James Wason. “Multi-armed bandit models for the optimal design of clinical trials: Benefits and challenges”. In: Statistical Science: A Review Journal of the Institute of Mathematical Statistics 30.2 (2015), p. 199. N´ıcollas Silva, Heitor Werneck, Thiago Silva, Adriano CM Pereira, and Leonardo Rocha. “Multiarmed bandits in recommendation systems: A survey of the state-of-the-art and future directions”. In: Expert Systems with Applications 197 (2022), p. 116669. Nicolo Cesa-Bianchi and Paul Fischer. “Finite-Time regret bounds for the multiarmed bandit problem”. In: ICML. Vol. 98. Citeseer. (1998), pp. 100–108. Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. “Finite-time analysis of the multiarmed bandit problem”. In: Machine Learning 47 (2002), pp. 235–256. Volodymyr Kuleshov and Doina Precup. “Algorithms for multi-armed bandit problems”. In: arXiv Preprint arXiv:1402.6028 (2014). Peter Auer. “Using confidence bounds for exploitation-exploration trade-offs”. In: Journal of Machine Learning Research 3.Nov (2002), pp. 397–422. Olivier Chapelle and Lihong Li. “An empirical evaluation of Thompson sampling”. In: Advances in Neural Information Processing Systems 24 (2011). Wei Chu, Lihong Li, Lev Reyzin, and Robert Schapire. “Contextual bandits with linear payoff functions”. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics. JMLR Workshop and Conference Proceedings. (2011), pp. 208–214. Aleksandrs Slivkins. “Contextual bandits with similarity information”. In: Proceedings of the 24th annual Conference On Learning Theory. JMLR Workshop and Conference Proceedings. (2011), pp. 679–702. Omar Besbes, Yonatan Gur, and Assaf Zeevi. “Stochastic multi-armed-bandit problem with nonstationary rewards”. In: Advances in neural information processing systems 27 (2014). Robin Allesiardo, Raphael F ¨ eraud, and Odalric-Ambrym Maillard. “The non-stationary stochastic multi-armed bandit problem”. In: International Journal of Data Science and Analytics 3 (2017), pp. 267–283. John Langford and Tong Zhang. “The epoch-greedy algorithm for multi-armed bandits with side information”. In: Advances in neural information processing systems 20 (2007). Shie Mannor and Ohad Shamir. “From bandits to experts: On the value of side-observations”. In: Advances in Neural Information Processing Systems 24 (2011). Nicolo Cesa-Bianchi and Gabor Lugosi. “Combinatorial bandits”. In: Journal of Computer and System Sciences 78.5 (2012), pp. 1404–1422. Filip Radlinski, Robert Kleinberg, and Thorsten Joachims. “Learning diverse rankings with multiarmed bandits”. In: Proceedings of the 25th International Conference on Machine Learning. (2008), pp. 784–791. Arnab Maiti, Vishakha Patil, and Arindam Khan. “Multi-armed bandits with bounded arm-memory: Near-optimal guarantees for best-arm identification and regret minimization”. In: Advances in Neural Information Processing Systems 34 (2021), pp. 19553–19565. Xiaoguang Huo and Feng Fu. “Risk-aware multi-armed bandit problem with application to portfolio selection”. In: Royal Society Open Science 4.11 (2017), p. 171377. Afef Ben Hadj Alaya-Feki, Eric Moulines, and Alain LeCornec. “Dynamic spectrum access with non-stationary multi-armed bandit”. In: 2008 IEEE 9th Workshop on Signal Processing Advances in Wireless Communications. IEEE. (2008), pp. 416–420. Alexandra Carpentier and Michal Valko. “Extreme bandits”. In: Advances in Neural Information Processing Systems 27 (2014). Vincent A. Cicirello and Stephen F. Smith. “The max k-armed bandit: A new model of exploration applied to search heuristic selection”. In: The Proceedings of the Twentieth National Conference on Artificial Intelligence. Vol. 3. (2005), pp. 1355–1361. Sujay Bhatt, Ping Li, and Gennady Samorodnitsky. “Extreme bandits using robust statistics”. In: IEEE Transactions on Information Theory 69.3 (2022), pp. 1761–1776. Dorian Baudry, Yoan Russac, and Emilie Kaufmann. “Efficient algorithms for extreme bandits”. In: arXiv Preprint arXiv:2203.10883 (2022). Ponnuthurai N. Suganthan, Nikolaus Hansen, Jing J. Liang, Kalyanmoy Deb, Ying-Ping Chen, Anne Auger, and Santosh Tiwari. “Problem definitions and evaluation criteria for the CEC 2005 special session on real-parameter optimization”. In: KanGAL Report 2005005.2005 (2005), p. 2005. Nikolaus Hansen, Sibylle D. Muller, and Petros Koumoutsakos. “Reducing the time complexity of the derandomized evolution strategy with covariance matrix adaptation (CMA-ES)”. In: Evolutionary Computation 11.1 (2003), pp. 1–18. Cameron B Browne, Edward Powley, Daniel Whitehouse, Simon M. Lucas, Peter I. Cowling, Philipp Rohlfshagen, Stephen Tavener, Diego Perez, Spyridon Samothrakis, and Simon Colton. “A survey of monte carlo tree search methods”. In: IEEE Transactions on Computational Intelligence and AI in games 4.1 (2012), pp. 1–43. Lisha Li, Kevin Jamieson, Giulia DeSalvo, Afshin Rostamizadeh, and Ameet Talwalkar. “Hyperband: A novel bandit-based approach to hyperparameter optimization”. In: The Journal of Machine Learning Research 18.1 (2017), pp. 6765–6816. Stefan Falkner, Aaron Klein, and Frank Hutter. “BOHB: Robust and efficient hyperparameter optimization at scale”. In: International Conference on Machine Learning. PMLR. (2018), pp. 1437–1446. Levente Kocsis and Csaba Szepesvari. “Bandit based monte-carlo planning”. In: ´ European Conference on Machine Learning. Springer. (2006), pp. 282–293. Aurelien Garivier and Eric Moulines. “On upper-confidence bound policies for switching bandit problems”. In: International Conference on Algorithmic Learning Theory. Springer. (2011), pp. 174–188. J.P. Royston. “Algorithm AS 177: Expected normal order statistics (exact and approximate)”. In: Journal of the Royal Statistical Society. Series C (Applied Statistics) 31.2 (1982), pp. 161–165. Christian L. Muller, Benedikt Baumgartner, and Ivo F. Sbalzarini. “Particle swarm CMA evolution strategy for the optimization of multi-funnel landscapes”. In: 2009 IEEE Congress on Evolutionary Computation. (2009), pp. 2685–2692 | - |
dc.identifier.uri | http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/90166 | - |
dc.description.abstract | 在工程、科學和財務領域,我們經常遇到在有限的資源下做出選擇的問題,這種問題可歸結為多臂吃角子老虎機問題,也是強化學習的基礎。現有的研究主要集中在穩態場景的期望報酬優化,但實際上許多問題並非穩態,或是追求極值報酬而非期望報酬。我們開發了一種以順序統計量為基礎的演算法,並配合自適應分佈模型,以優化在非穩態場景下追求極值之資源分配。我們將此演算法應用於實數數值優化、蒙地卡羅樹搜索以及深度學習模型超參數優化等三種問題,並與目前經典的多臂吃角子老虎機演算法做比較。在實數數值優化中,我們使用自適應共變異數矩陣演化策略作為優化器,在CEC2005的多模態基準函數上做測試,在有限的抽樣預算下我們的演算法具有優勢。在蒙地卡羅數搜尋中,我們設計了獎勵函數,透過實驗測試演算法在不同場景下之穩態性,在獎勵函數值域不受限制的場景下我們的演算法具有統計上的優勢。而在超參數優化問題上,我們設計了一個架構結合目前之最新架構,並且在我們測試的電腦視覺訓練問題上能取得較原版較高之測試集準確度。 | zh_TW |
dc.description.abstract | In engineering, scientific, and financial domains, we frequently encounter the challenge of making decisions when faced with limited resources. This issue can be framed as the multi-armed bandit problem, which serves as a fundamental concept in reinforcement learning. Existing research primarily focuses on optimizing the expected reward in stationary scenarios. However, many real-world problems exhibit non-stationarity or prioritize attaining the highest possible reward rather than the expected reward. To address this, we have developed an algorithm that leverages order statistics and adaptive distribution models to optimize resource allocation in pursuit of the highest possible reward in non-stationary environments. We applied our algorithm to three different problems: real-valued optimization, Monte-Carlo tree search, and hyperparameter optimization for deep learning models. Also, we compared it with the classical multi-armed bandit algorithm.
In real-valued optimization, we utilized the self-adaptive covariance matrix evolution strategy as the optimizer. We tested our algorithm on the multimodal benchmark functions from CEC2005. Our algorithm exhibited advantages within limited sampling budgets. For Monte-Carlo tree search problems, we designed a reward function and conducted experiments to assess the robustness of our algorithm in various scenarios. Our algorithm demonstrated statistical advantages in scenarios where the reward function's value range was unconstrained. Regarding hyperparameter optimization, we devised a framework that incorporates state-of-the-art architectures. When evaluated on computer vision training problems, our algorithm achieved higher testing set accuracy compared to the original version. | en |
dc.description.provenance | Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2023-09-22T17:41:28Z No. of bitstreams: 0 | en |
dc.description.provenance | Made available in DSpace on 2023-09-22T17:41:28Z (GMT). No. of bitstreams: 0 | en |
dc.description.tableofcontents | 致謝 i
中文摘要 ii Abstract iii 1 Introduction 1 1.1 Motivation 3 1.1.1 Non-stationary Reward Distributions 3 1.1.2 Optimization Objectives 3 1.1.3 Unknown Reward Range 4 1.2 Thesis Objective 4 1.3 Roadmap 5 2 Overview 6 2.1 MAB Algorithms 6 2.1.1 ε-greedy Algorithm 6 2.1.2 Upper Confidence Bound (UCB) 8 2.1.3 UCB-Normal 8 2.1.4 Thompson Sampling (TS) 9 2.1.5 Sliding-window UCB (SW-UCB) 10 2.2 Extreme Bandit Algorithms 11 2.2.1 Quantile of Maxima (QoMax) 12 2.2.2 MaxMedian 13 3 StatMax 15 3.1 Overview 15 3.2 Adaptive Surrogate Distribution 17 3.3 Order Statistics 17 3.4 Budget Aware 18 4 Case Studies 19 4.1 Real-valued Optimization 19 4.1.1 Benchmark Functions 20 4.1.2 Definition of Benchmark Functions 21 4.1.3 Optimizer 23 4.1.4 Integration with StatMax 25 4.1.5 Experiments 27 4.2 Monte Carlo Tree Search 30 4.2.1 Benchmark Functions 31 4.2.2 Integration with StatMax 31 4.2.3 Experiments 34 4.3 Hyperparameter Optimization 38 4.3.1 Integration with StatMax 39 4.3.2 Experiments 40 5 Conclusion 42 Bibliography 43 A Learning curve of CEC2005 benchmark functions 46 | - |
dc.language.iso | en | - |
dc.title | 非穩態下之極值多臂吃角子老虎機於數值優化領域之探討 | zh_TW |
dc.title | Non-stationary Extreme Bandits: An Optimization Case Study | en |
dc.type | Thesis | - |
dc.date.schoolyear | 111-2 | - |
dc.description.degree | 碩士 | - |
dc.contributor.oralexamcommittee | 王勝德;陳穎平 | zh_TW |
dc.contributor.oralexamcommittee | Sheng-De Wang;Ying-Ping Chen | en |
dc.subject.keyword | 多臂吃角子老虎機,順序統計量,非穩態,極值,強化學習,機器學習,實數優化,蒙地卡羅樹,超參數優化,自適應共變異數矩陣演化策略, | zh_TW |
dc.subject.keyword | multi-armed bandits,order statistics,non-stationary,extreme value,reinforcement learning,machine learning,real-valued optimization,Monte-Carlo tree search,hyperparameter optimization,Covariance matrix adaptation evolution strategy, | en |
dc.relation.page | 52 | - |
dc.identifier.doi | 10.6342/NTU202303980 | - |
dc.rights.note | 同意授權(全球公開) | - |
dc.date.accepted | 2023-08-14 | - |
dc.contributor.author-college | 電機資訊學院 | - |
dc.contributor.author-dept | 電機工程學系 | - |
顯示於系所單位: | 電機工程學系 |
文件中的檔案:
檔案 | 大小 | 格式 | |
---|---|---|---|
ntu-111-2.pdf | 8.72 MB | Adobe PDF | 檢視/開啟 |
系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。