Skip navigation

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料(如:文字、圖片、PDF)並使其易於取用。

點此認識 DSpace
DSpace logo
English
中文
  • 瀏覽論文
    • 校院系所
    • 出版年
    • 作者
    • 標題
    • 關鍵字
    • 指導教授
  • 搜尋 TDR
  • 授權 Q&A
    • 我的頁面
    • 接受 E-mail 通知
    • 編輯個人資料
  1. NTU Theses and Dissertations Repository
  2. 電機資訊學院
  3. 資訊工程學系
請用此 Handle URI 來引用此文件: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/17368
完整後設資料紀錄
DC 欄位值語言
dc.contributor.advisor林軒田(Hsuan-Tien Lin)
dc.contributor.authorKu-Chun Chouen
dc.contributor.author周谷駿zh_TW
dc.date.accessioned2021-06-08T00:09:08Z-
dc.date.copyright2013-08-23
dc.date.issued2013
dc.date.submitted2013-08-09
dc.identifier.citationBibliography
[1] P. Auer. Using Confidence Bounds for Exploitation-Exploration Trade-offs. Journal
of Machine Learning Research, 3:397–422, 2002.
[2] P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed
bandit problem. Machine Learning, 47:235–256, 2002.
[3] O. Chapelle and L. Li. An empirical evaluation of thompson sampling. In NIPS,
pages 2249–2257, 2011.
[4] K.-C. Chou, C.-K. Chiang, H.-T. Lin, and C.-J. Lu. Pseudo-reward for linear contextual
bandits. Submitted and under review, 2013.
[5] K.-C. Chou and H.-T. Lin. Balancing between estimated reward and uncertainty during
news article recommendation for icml 2012 exploration and exploitation challenge.
Technical report, 2012.
[6] W. Chu, L. Li, L. Reyzin, and R. E. Schapire. Contextual bandits with linear payoff
functions. Journal of Machine Learning Research - Proceedings Track, 15:208–214,
2011.
[7] C. chun Wang, S. R. Kulkarni, and H. V. Poor. Bandit Problems with Side Observations.
Computing Research Repository, abs/cs/050, 2005.
[8] A. Garivier and O. Capp’e. The kl-ucb algorithm for bounded stochastic bandits and
beyond. Journal of Machine Learning Research - Proceedings Track, 19:359–376,
2011.
25
[9] D. Glowacka, T. Ruotsalo, K. Konuyshkova, k. Athukorala, S. Kaski, and G. Jacucci.
Directing exploratory search: reinforcement learning from user interactions with
keywords. In Proceedings of the 2013 international conference on Intelligent user
interfaces, IUI ’13, pages 117–128, New York, NY, USA, 2013. ACM.
[10] T. L. Lai and H. Robbins. Asymptotically efficient adaptive allocation rules. Advances
in Applied Mathematics, 6:4–22, 1985.
[11] J. Langford and T. Zhang. The epoch-greedy algorithm for multi-armed bandits with
side information. In NIPS, 2007.
[12] L. Li, W. Chu, J. Langford, and R. E. Schapire. A contextual-bandit approach to
personalized news article recommendation. In WWW, pages 661–670, 2010.
[13] L. Li,W. Chu, J. Langford, and X.Wang. Unbiased offline evaluation of contextualbandit-
based news article recommendation algorithms. In WSDM, pages 297–306,
2011.
[14] R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. MIT Press,
1998.
[15] A. Tikhonov. On the stability of inverse problems. Doklady Akademii nauk SSSR,39(5):195-198,1943.
[16] C.J.C.H. Watkins. Learning from Delayed Rewards. PhD thesis, King's College,Cambridge,UK,May 1989.
[17] Y.Yue,S.A. Hong,and C. Guestrin. Hierarchical exploration for accelerating con-testual bandits. In ICML,2012.
dc.identifier.urihttp://tdr.lib.ntu.edu.tw/jspui/handle/123456789/17368-
dc.description.abstract在此論文中我們研究網路線上廣告播放、推薦系統中衍生的情
境式拉霸問題(Contextual Bandit Problem)。線性上信賴界(Linear
Upper Confidence Bound)是目前解決這等問題領先的演算法之一, 它
使用線性回歸向環境的片面回饋學習並且更新其內部的模型。因為
拉霸問題的特性線性上信賴界只使用片面回饋作為學習的資料,但這
導致它可能會很緩慢的達到最佳的效能。我們在此提出以虛擬回饋
(Pseudo-reward)來改善線性上信賴界。並且發現只要使用適當的虛
擬回饋以及搭配恰當的遺忘機制就可以快速的達到較佳效能並且同時
兼顧線性上信賴界於理論上的保證。除此之外,因為使用了虛擬回饋
我們還可以設計出可以快速做出預測的演算法。這種快速的預測能
力在許多網路應用都具有價值,因為預測速度和網頁載入速度息息相
關。最後我們使用了人造資料以及雅虎新聞提供的巨型真實世界資料
來驗證我們所提出的演算法之優越性。
zh_TW
dc.description.abstractWe study the contextual bandit problem that arises in many real world applications
such as advertising, recommendations, and otherWeb applications.
One leading algorithm for contextual bandit is the linear upper confidence
bound (LINUCB) approach, which is based on updating internal linear regression
models with the partial feedback from the environment. Because of
updating with only the partial feedback, LINUCB can be slow in converging
to the optimal performance. In this work, we study techniques that improve
LINUCB by updating the linear regressionmodels with some additional feedback
called the pseudo-reward. By choosing a proper pseudo-reward formula
and implementing a forgetting mechanism to avoid being overly biased by
the pseudo-rewards, we propose an improved algorithm that matches the regret
guarantee of LINUCB in theory. Furthermore, we design a variant of
the proposed algorithm that can be significantly more efficient than LINUCB
during action selection, which directly implies faster response time in many
applications. Extensive experimental results from both artificial data and the
benchmark Yahoo! News recommendation data show that the proposed algorithm
enjoys better performance than LINUCB and other contextual bandit
algorithms.
en
dc.description.provenanceMade available in DSpace on 2021-06-08T00:09:08Z (GMT). No. of bitstreams: 1
ntu-102-R99922095-1.pdf: 510824 bytes, checksum: 7bec39883c8eefcfe4caa468b0d74c30 (MD5)
Previous issue date: 2013
en
dc.description.tableofcontentsi
中文摘要iii
Abstract v
1 Introduction 1
2 Problem Settings 5
3 Preliminaries and Related Works 7
3.1 Baseline Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.2 LinUCB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
4 Pseudo-reward Algorithms 13
4.1 Introducing LINPRUCB . . . . . . . . . . . . . . . . . . . . . . . . . . 13
5 Experiments 17
5.1 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
5.2 Real World Benchmark Data . . . . . . . . . . . . . . . . . . . . . . . . 19
6 Conclusions 23
Bibliography 25
dc.language.isoen
dc.title以虛擬回饋演算法解決線性情境式拉霸問題zh_TW
dc.titlePseudo-reward Algorithms for Linear Contextual Bandit Problemsen
dc.typeThesis
dc.date.schoolyear101-2
dc.description.degree碩士
dc.contributor.oralexamcommittee林智仁(Chih-Jen Lin),李育杰(Yuh-Jye Lee),呂及人(Chi-Jen Lu)
dc.subject.keyword情境式拉霸問題,線性上信賴界,線性回歸,zh_TW
dc.subject.keywordcontextual bandit problem,linear upper confidence bound,linear regression,en
dc.relation.page26
dc.rights.note未授權
dc.date.accepted2013-08-09
dc.contributor.author-college電機資訊學院zh_TW
dc.contributor.author-dept資訊工程學研究所zh_TW
顯示於系所單位:資訊工程學系

文件中的檔案:
檔案 大小格式 
ntu-102-1.pdf
  未授權公開取用
498.85 kBAdobe PDF
顯示文件簡單紀錄


系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved