請用此 Handle URI 來引用此文件:
http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/92037完整後設資料紀錄
| DC 欄位 | 值 | 語言 |
|---|---|---|
| dc.contributor.advisor | 吳沛遠 | zh_TW |
| dc.contributor.advisor | Pei-Yuan Wu | en |
| dc.contributor.author | 賴彥儒 | zh_TW |
| dc.contributor.author | Yen-Ru Lai | en |
| dc.date.accessioned | 2024-03-04T16:13:33Z | - |
| dc.date.available | 2024-03-05 | - |
| dc.date.copyright | 2024-03-04 | - |
| dc.date.issued | 2024 | - |
| dc.date.submitted | 2024-02-08 | - |
| dc.identifier.citation | [1] Y. Abbasi-Yadkori, D. Pál, and C. Szepesvári. Improved algorithms for linear stochastic bandits. Advances in neural information processing systems, 24, 2011.
[2] N. Aronszajn. Theory of reproducing kernels. Transactions of the American mathematical society, 68(3):337–404, 1950. [3] A. Berlinet and C. Thomas-Agnan. Reproducing kernel Hilbert spaces in probability and statistics. Springer Science & Business Media, 2011. [4] Q. Cai, Z. Yang, C. Jin, and Z. Wang. Provably efficient exploration in policy op timization. In International Conference on Machine Learning, pages 1283–1294. PMLR, 2020. [5] A. S. Chen, S. Nair, and C. Finn. Learning generalizable robotic reward functions from” in-the-wild” human videos. arXiv preprint arXiv:2103.16817, 2021. [6] S. R. Chowdhury and A. Gopalan. On kernelized multi-armed bandits. In International Conference on Machine Learning, pages 844–853. PMLR, 2017. [7] Y. Duan, Z. Jia, and M. Wang. Minimax-optimal off-policy evaluation with linear function approximation. In International Conference on Machine Learning, pages 2701–2709. PMLR, 2020. [8] B. Eysenbach, X. Geng, S. Levine, and R. R. Salakhutdinov. Rewriting history with inverse rl: Hindsight inference for policy improvement. Advances in neural information processing systems, 33:14783–14795, 2020. [9] C. Finn, S. Levine, and P. Abbeel. Guided cost learning: Deep inverse optimal control via policy optimization. In International conference on machine learning, pages 49–58. PMLR, 2016. [10] J. Fu, K. Luo, and S. Levine. Learning robust rewards with adversarial inverse rein forcement learning. arXiv preprint arXiv:1710.11248, 2017. [11] J. Fu, A. Singh, D. Ghosh, L. Yang, and S. Levine. Variational inverse control with events: A general framework for data-driven reward definition. Advances in neural information processing systems, 31, 2018. [12] J. Ho and S. Ermon. Generative adversarial imitation learning. Advances in neural information processing systems, 29, 2016. [13] H. Hu, Y. Yang, Q. Zhao, and C. Zhang. The provable benefits of unsupervised data sharing for offline reinforcement learning. arXiv preprint arXiv:2302.13493, 2023. [14] M. Janner, J. Fu, M. Zhang, and S. Levine. When to trust your model: Model-based policy optimization. Advances in neural information processing systems, 32, 2019. [15] C. Jin, Z. Allen-Zhu, S. Bubeck, and M. I. Jordan. Is q-learning provably efficient? Advances in neural information processing systems, 31, 2018. [16] Y. Jin, Z. Yang, and Z. Wang. Is pessimism provably efficient for offline rl? In International Conference on Machine Learning, pages 5084–5096. PMLR, 2021. [17] G. Kahn, A. Villaflor, V. Pong, P. Abbeel, and S. Levine. Uncertainty-aware rein forcement learning for collision avoidance. arXiv preprint arXiv:1702.01182, 2017. [18] D. Kalashnikov, A. Irpan, P. Pastor, J. Ibarz, A. Herzog, E. Jang, D. Quillen, E. Holly, M. Kalakrishnan, V. Vanhoucke, et al. Scalable deep reinforcement learning for vision-based robotic manipulation. In Conference on Robot Learning, pages 651– 673. PMLR, 2018. [19] D. Kalashnikov, J. Varley, Y. Chebotar, B. Swanson, R. Jonschkowski, C. Finn, S. Levine, and K. Hausman. Mt-opt: Continuous multi-task robotic reinforcement learning at scale. arXiv preprint arXiv:2104.08212, 2021. [20] K. Konyushkova, K. Zolna, Y. Aytar, A. Novikov, S. Reed, S. Cabi, and N. de Freitas. Semi-supervised reward learning for offline reinforcement learning. arXiv preprint arXiv:2012.06899, 2020. [21] I. Kostrikov, O. Nachum, and J. Tompson. Imitation learning via off-policy distri bution matching. arXiv preprint arXiv:1912.05032, 2019. [22] S. Levine, A. Kumar, G. Tucker, and J. Fu. Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643, 2020. [23] K. Muandet, K. Fukumizu, B. Sriperumbudur, B. Schölkopf, et al. Kernel mean embedding of distributions: A review and beyond. Foundations and Trends® in Machine Learning, 10(1-2):1–141, 2017. [24] P. Rashidinejad, B. Zhu, C. Ma, J. Jiao, and S. Russell. Bridging offline reinforce ment learning and imitation learning: A tale of pessimism. Advances in Neural Information Processing Systems, 34:11702–11716, 2021. [25] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484–489, 2016. [26] A. Singh, L. Yang, K. Hartikainen, C. Finn, and S. Levine. End-to-end robotic rein forcement learning without reward engineering. arXiv preprint arXiv:1904.07854, 2019. [27] N. Srinivas, A. Krause, S. M. Kakade, and M. Seeger. Gaussian process opti mization in the bandit setting: No regret and experimental design. arXiv preprint arXiv:0912.3995, 2009. [28] I. Steinwart and A. Christmann. Support vector machines. Springer Science & Business Media, 2008. [29] N. Stiennon, L. Ouyang, J. Wu, D. Ziegler, R. Lowe, C. Voss, A. Radford, D. Amodei, and P. F. Christiano. Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008–3021, 2020. [30] R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction. MIT press, 2018. [31] J. A. Tropp et al. An introduction to matrix concentration inequalities. Foundations and Trends® in Machine Learning, 8(1-2):1–230, 2015. [32] M. Uehara, X. Zhang, and W. Sun. Representation learning for online and offline rl in low-rank mdps. arXiv preprint arXiv:2110.04652, 2021. [33] S. Vakili, K. Khezeli, and V. Picheny. On information gain and regret bounds in gaussian process bandits. In International Conference on Artificial Intelligence and Statistics, pages 82–90. PMLR, 2021. [34] M. Valko, N. Korda, R. Munos, I. Flaounas, and N. Cristianini. Finite-time analysis of kernelised contextual bandits. arXiv preprint arXiv:1309.6869, 2013. [35] A. Wagenmaker and A. Pacchiano. Leveraging offline data in online reinforcement learning. In International Conference on Machine Learning, pages 35300–35338. PMLR, 2023. [36] R. Wang, D. P. Foster, and S. M. Kakade. What are the statistical limits of offline rl with linear function approximation? arXiv preprint arXiv:2010.11895, 2020. [37] R. Wang, R. Salakhutdinov, and L. F. Yang. Provably efficient reinforcement learn ing with general value function approximation. arXiv preprint arXiv:2005.10804, 2020. [38] T. Xie, N. Jiang, H. Wang, C. Xiong, and Y. Bai. Policy finetuning: Bridging sample efficient offline and online reinforcement learning. Advances in neural information processing systems, 34:27395–27407, 2021. [39] Y. Yan, G. Li, Y. Chen, and J. Fan. Model-based reinforcement learning is minimax optimal for offline zero-sum markov games. arXiv preprint arXiv:2206.04044, 2022. [40] Z. Yang, C. Jin, Z. Wang, M. Wang, and M. I. Jordan. On function approximation in reinforcement learning: Optimism in the face of large state spaces. arXiv preprint arXiv:2011.04622, 2020. [41] S.-Y. Yeh, F.-C. Chang, C.-W. Yueh, P.-Y. Wu, A. Bernacchia, and S. Vakili. Sample complexity of kernel-based q-learning. In International Conference on Artificial Intelligence and Statistics, pages 453–469. PMLR, 2023. [42] M. Yin, Y. Duan, M. Wang, and Y.-X. Wang. Near-optimal offline reinforce ment learning with linear representation: Leveraging variance information with pes simism. arXiv preprint arXiv:2203.05804, 2022. [43] T. Yu, A. Kumar, Y. Chebotar, K. Hausman, C. Finn, and S. Levine. How to leverage unlabeled data in offline reinforcement learning. In International Conference on Machine Learning, pages 25611–25635. PMLR, 2022. [44] T. Yu, A. Kumar, Y. Chebotar, K. Hausman, S. Levine, and C. Finn. Conserva tive data sharing for multi-task offline reinforcement learning. Advances in Neural Information Processing Systems, 34:11501–11516, 2021. [45] T. Yu, A. Kumar, R. Rafailov, A. Rajeswaran, S. Levine, and C. Finn. Combo: Con servative offline model-based policy optimization. Advances in neural information processing systems, 34:28954–28967, 2021. [46] A. Zanette, D. Brandfonbrener, E. Brunskill, M. Pirotta, and A. Lazaric. Frequen tist regret bounds for randomized least-squares value iteration. In International Conference on Artificial Intelligence and Statistics, pages 1954–1964. PMLR, 2020. [47] A. Zanette, C.-A. Cheng, and A. Agarwal. Cautiously optimistic policy optimiza tion and exploration with linear function approximation. In Conference on Learning Theory, pages 4473–4525. PMLR, 2021. | - |
| dc.identifier.uri | http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/92037 | - |
| dc.description.abstract | 離線強化學習方法為從一個固定的數據集中學習策略,但通常需要大量的有標籤數據。由於有標籤數據往往需要人工進行標注,有標籤的數據集通常非常昂貴。相反,無標籤的數據往往成本較低。這種情況凸顯了在離線強化學習中找到有效使用無標籤數據的重要性。在本文中,我們提出了一種利用無標籤數據的離線強化學習方法,並給出了理論保證。我們提出了在再生核希爾伯特空間(RKHS)中的各種特徵值衰減條件,這些條件確定了該算法的複雜性。總的來說,我們的工作提供了一種利用無標籤數據優勢的離線強化學習方法,同時保持理論保證。 | zh_TW |
| dc.description.abstract | Offline reinforcement learning (RL) learns policies from a fixed dataset, but often requires large amounts of data. The challenge arises when labeled datasets are expensive, especially when rewards have to be provided by human labelers for large datasets. In contrast, unlabelled data tends to be less expensive. This situation highlights the importance of finding effective ways to use unlabelled data in offline RL, especially when labelled data is limited or expensive to obtain. In this paper, we present an algorithm to utilize the unlabeled data in the offline RL method with kernel function approximation and give the theoretical guarantee. We present various eigenvalue decay conditions of kernel which determine the complexity of the algorithm. In summary, our work provides a promising approach for exploiting the advantages offered by unlabeled data in offline RL, whilst maintaining theoretical assurances. | en |
| dc.description.provenance | Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2024-03-04T16:13:33Z No. of bitstreams: 0 | en |
| dc.description.provenance | Made available in DSpace on 2024-03-04T16:13:33Z (GMT). No. of bitstreams: 0 | en |
| dc.description.tableofcontents | Acknowledgements i
摘要 iii Abstract v Contents vii List of Tables ix Denotation xi Chapter 1 Introduction 1 Chapter 2 Related Works 5 Chapter 3 Background 9 3.1 Episodic Markov Decision Process 9 3.2 Assumption of Offline Data 11 3.3 Reproducing Kernel Hilbert Space 11 3.4 Pessimistic Value Iteration and Kernel Setting 14 Chapter 4 Unsupervised Data Sharing 15 4.1 Pessimistic Reward Estimation 15 4.2 Theoretical Analysis 19 Chapter 5 Conclusion 25 References 27 Appendix A — Pessimistic Value Iteration 33 Appendix B — Proof of Main Result 37 B.1 Proof of Proposition 4.1 37 B.2 Proof of Theorem 4.4 40 B.3 Proof of Proposition 4.6 55 B.4 Proof of Corollary 4.9 55 Appendix C — Sufficient Lemma 61 | - |
| dc.language.iso | en | - |
| dc.subject | 離線強化學習 | zh_TW |
| dc.subject | 資料分享 | zh_TW |
| dc.subject | 函數逼近 | zh_TW |
| dc.subject | 誤差分析 | zh_TW |
| dc.subject | Offline Reinforcement Learning | en |
| dc.subject | Regret Analysis | en |
| dc.subject | Function Approximation | en |
| dc.subject | Data Sharing | en |
| dc.title | 利用核函數逼近在離線強化學習中的未標記數據共享 | zh_TW |
| dc.title | Leveraging Unlabeled Data Sharing through Kernel Function Approximation in Offline Reinforcement Learning | en |
| dc.type | Thesis | - |
| dc.date.schoolyear | 112-1 | - |
| dc.description.degree | 碩士 | - |
| dc.contributor.oralexamcommittee | 林軒田;陳子立 | zh_TW |
| dc.contributor.oralexamcommittee | Hsuan-Tien Lin;Tzu-Li Chen | en |
| dc.subject.keyword | 離線強化學習,資料分享,函數逼近,誤差分析, | zh_TW |
| dc.subject.keyword | Offline Reinforcement Learning,Data Sharing,Function Approximation,Regret Analysis, | en |
| dc.relation.page | 64 | - |
| dc.identifier.doi | 10.6342/NTU202400542 | - |
| dc.rights.note | 未授權 | - |
| dc.date.accepted | 2024-02-14 | - |
| dc.contributor.author-college | 電機資訊學院 | - |
| dc.contributor.author-dept | 電信工程學研究所 | - |
| 顯示於系所單位: | 電信工程學研究所 | |
文件中的檔案:
| 檔案 | 大小 | 格式 | |
|---|---|---|---|
| ntu-112-1.pdf 未授權公開取用 | 470.45 kB | Adobe PDF |
系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。
