以虛擬回饋演算法解決線性情境式拉霸問題

Ku-Chun Chou; 周谷駿

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/17368

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	林軒田(Hsuan-Tien Lin)
dc.contributor.author	Ku-Chun Chou	en
dc.contributor.author	周谷駿	zh_TW
dc.date.accessioned	2021-06-08T00:09:08Z	-
dc.date.copyright	2013-08-23
dc.date.issued	2013
dc.date.submitted	2013-08-09
dc.identifier.citation	Bibliography [1] P. Auer. Using Confidence Bounds for Exploitation-Exploration Trade-offs. Journal of Machine Learning Research, 3:397–422, 2002. [2] P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47:235–256, 2002. [3] O. Chapelle and L. Li. An empirical evaluation of thompson sampling. In NIPS, pages 2249–2257, 2011. [4] K.-C. Chou, C.-K. Chiang, H.-T. Lin, and C.-J. Lu. Pseudo-reward for linear contextual bandits. Submitted and under review, 2013. [5] K.-C. Chou and H.-T. Lin. Balancing between estimated reward and uncertainty during news article recommendation for icml 2012 exploration and exploitation challenge. Technical report, 2012. [6] W. Chu, L. Li, L. Reyzin, and R. E. Schapire. Contextual bandits with linear payoff functions. Journal of Machine Learning Research - Proceedings Track, 15:208–214, 2011. [7] C. chun Wang, S. R. Kulkarni, and H. V. Poor. Bandit Problems with Side Observations. Computing Research Repository, abs/cs/050, 2005. [8] A. Garivier and O. Capp’e. The kl-ucb algorithm for bounded stochastic bandits and beyond. Journal of Machine Learning Research - Proceedings Track, 19:359–376, 2011. 25 [9] D. Glowacka, T. Ruotsalo, K. Konuyshkova, k. Athukorala, S. Kaski, and G. Jacucci. Directing exploratory search: reinforcement learning from user interactions with keywords. In Proceedings of the 2013 international conference on Intelligent user interfaces, IUI ’13, pages 117–128, New York, NY, USA, 2013. ACM. [10] T. L. Lai and H. Robbins. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6:4–22, 1985. [11] J. Langford and T. Zhang. The epoch-greedy algorithm for multi-armed bandits with side information. In NIPS, 2007. [12] L. Li, W. Chu, J. Langford, and R. E. Schapire. A contextual-bandit approach to personalized news article recommendation. In WWW, pages 661–670, 2010. [13] L. Li,W. Chu, J. Langford, and X.Wang. Unbiased offline evaluation of contextualbandit- based news article recommendation algorithms. In WSDM, pages 297–306, 2011. [14] R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. MIT Press, 1998. [15] A. Tikhonov. On the stability of inverse problems. Doklady Akademii nauk SSSR,39(5):195-198,1943. [16] C.J.C.H. Watkins. Learning from Delayed Rewards. PhD thesis, King's College,Cambridge,UK,May 1989. [17] Y.Yue,S.A. Hong,and C. Guestrin. Hierarchical exploration for accelerating con-testual bandits. In ICML,2012.
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/17368	-
dc.description.abstract	在此論文中我們研究網路線上廣告播放、推薦系統中衍生的情境式拉霸問題（Contextual Bandit Problem）。線性上信賴界（Linear Upper Confidence Bound）是目前解決這等問題領先的演算法之一，它使用線性回歸向環境的片面回饋學習並且更新其內部的模型。因為拉霸問題的特性線性上信賴界只使用片面回饋作為學習的資料，但這導致它可能會很緩慢的達到最佳的效能。我們在此提出以虛擬回饋（Pseudo-reward）來改善線性上信賴界。並且發現只要使用適當的虛擬回饋以及搭配恰當的遺忘機制就可以快速的達到較佳效能並且同時兼顧線性上信賴界於理論上的保證。除此之外，因為使用了虛擬回饋我們還可以設計出可以快速做出預測的演算法。這種快速的預測能力在許多網路應用都具有價值，因為預測速度和網頁載入速度息息相關。最後我們使用了人造資料以及雅虎新聞提供的巨型真實世界資料來驗證我們所提出的演算法之優越性。	zh_TW
dc.description.abstract	We study the contextual bandit problem that arises in many real world applications such as advertising, recommendations, and otherWeb applications. One leading algorithm for contextual bandit is the linear upper confidence bound (LINUCB) approach, which is based on updating internal linear regression models with the partial feedback from the environment. Because of updating with only the partial feedback, LINUCB can be slow in converging to the optimal performance. In this work, we study techniques that improve LINUCB by updating the linear regressionmodels with some additional feedback called the pseudo-reward. By choosing a proper pseudo-reward formula and implementing a forgetting mechanism to avoid being overly biased by the pseudo-rewards, we propose an improved algorithm that matches the regret guarantee of LINUCB in theory. Furthermore, we design a variant of the proposed algorithm that can be significantly more efficient than LINUCB during action selection, which directly implies faster response time in many applications. Extensive experimental results from both artificial data and the benchmark Yahoo! News recommendation data show that the proposed algorithm enjoys better performance than LINUCB and other contextual bandit algorithms.	en
dc.description.provenance	Made available in DSpace on 2021-06-08T00:09:08Z (GMT). No. of bitstreams: 1 ntu-102-R99922095-1.pdf: 510824 bytes, checksum: 7bec39883c8eefcfe4caa468b0d74c30 (MD5) Previous issue date: 2013	en
dc.description.tableofcontents	i 中文摘要iii Abstract v 1 Introduction 1 2 Problem Settings 5 3 Preliminaries and Related Works 7 3.1 Baseline Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 3.2 LinUCB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 4 Pseudo-reward Algorithms 13 4.1 Introducing LINPRUCB . . . . . . . . . . . . . . . . . . . . . . . . . . 13 5 Experiments 17 5.1 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 5.2 Real World Benchmark Data . . . . . . . . . . . . . . . . . . . . . . . . 19 6 Conclusions 23 Bibliography 25
dc.language.iso	en
dc.title	以虛擬回饋演算法解決線性情境式拉霸問題	zh_TW
dc.title	Pseudo-reward Algorithms for Linear Contextual Bandit Problems	en
dc.type	Thesis
dc.date.schoolyear	101-2
dc.description.degree	碩士
dc.contributor.oralexamcommittee	林智仁(Chih-Jen Lin),李育杰(Yuh-Jye Lee),呂及人(Chi-Jen Lu)
dc.subject.keyword	情境式拉霸問題,線性上信賴界,線性回歸,	zh_TW
dc.subject.keyword	contextual bandit problem,linear upper confidence bound,linear regression,	en
dc.relation.page	26
dc.rights.note	未授權
dc.date.accepted	2013-08-09
dc.contributor.author-college	電機資訊學院	zh_TW
dc.contributor.author-dept	資訊工程學研究所	zh_TW
顯示於系所單位：	資訊工程學系

文件中的檔案：

檔案	大小	格式
ntu-102-1.pdf 未授權公開取用	498.85 kB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。