以風險設計強化學習之獎勵並應用於最佳化交易執行

Wei-Lun Luo; 羅偉倫

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/48832

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	許永真(Yung-Jen Hsu)
dc.contributor.author	Wei-Lun Luo	en
dc.contributor.author	羅偉倫	zh_TW
dc.date.accessioned	2021-06-15T11:09:58Z	-
dc.date.available	2020-08-21
dc.date.copyright	2020-08-21
dc.date.issued	2020
dc.date.submitted	2020-08-17
dc.identifier.citation	[1] Robert Almgren and Neil Chriss. “Optimal execution of portfolio transactions”. In: Journal of Risk 3 (2001), pp. 5–40. [2] Bowen Baker et al. “Emergent Tool Use From Multi-Agent Autocurricula”. In: International Conference on Learning Representations. 2020. URL: https://openreview.net/forum?id=SkxpxJBKwS. [3] Wen hang Bao and Xiao yang Liu. “Multi-Agent Deep Reinforcement Learning for Liquidation Strategy Analysis”. In: Proceedings of the 36th International Conference on Machine Learning. AI in Finance: Applications and Infrastructure for Multi-Agent Learning. 2019. [4] Dimitris Bertsimas and Andrew W Lo. “Optimal control of execution costs”. In: Journal of Financial Markets 1.1 (1998), pp. 1–50. [5] Fischer Black and Myron Scholes. “The Pricing of Options and Corporat Liabilities”. In: Journal of Political Economy 81.3 (1973), pp. 637–654. ISSN: 00223808,1537534X. URL: http://www.jstor.org/stable/1831029. [6] Greg Brockman et al. “Openai gym”. In: ArXiv abs/1606.01540 (2016). [7] Brogaard and Jonathan. “High Frequency Trading and Volatility”. In: LSN: Law Finance: Empirical (Topic) (July 2010). DOI: 10.2139/ssrn.1641387. [8] David Byrd, Maria Hybinette, and Tucker Balch. “ABIDES: Towards High-Fidelity Market Simulation for AI Research”. In: ArXiv abs/1904.12066 (Apr. 2019). [9] Louis K. C. Chan and Josef Lakonishok. “Institutional equity trading costs: NYSE versus Nasdaq”. In: Financial Analysts Journal 52.2 (1997), pp. 713–735. [10] Louis K. C. Chan and Josef Lakonishok. “The behavior of stock prices around institutional trades”. In: Financial Analysts Journal 50.4 (1995), pp. 1147-1174. [11] Krishnendu Chatterjee, Rupak Majumdar, and Thomas Henzinger. “Markov Decision Processes with Multiple Objectives”. In: Feb. 2006, pp. 325–336. DOI: 10.1007/11672142_26. [12] James Chen. Investopedia: Risk. URL: https://www.investopedia.com/terms/r/risk.asp. (accessed: 06.21.2020). [13] Kun-Jen Chung and Matthew J. Sobel. “Discounted MDP’s: Distribution Functions and Exponential Utility Maximization”. In: SIAM journal on control and optimization 25.1 (1987), pp. 49–62. [14] Jerzy A. Filar, Lodewijk C. M. Kallenberg, and Huey-Miin Lee. “Variance-Penalized Markov Decision Processes”. In: Mathematics of Operations Research 14 (1989), pp. 147–161. [15] Shixiang Gu et al. “Continuous Deep Q-Learning with Model-based Acceleration”. In: ICML. 2016, pp. 2829–2838. URL: http://proceedings.mlr.press/v48/gu16.html. [16] Sergio Guadarrama et al. TF-Agents: A library for Reinforcement Learning in TensorFlow. https://github.com/tensorflow/agents. [Online; accessed 25-June-2019]. 2018. URL: https://github.com/tensorflow/agents%22. [17] Hado van Hasselt, Arthur Guez, and David Silver. “Deep reinforcement learning with double Q-Learning”. In: AAAI’16: Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence. 2016, pp. 2094–2100. [18] Adam Hayes. Investopedia: Portfolio management definition. URL: https://www.investopedia.com/terms/p/portfoliomanagement.asp. (accessed: 04.20.2020). [19] Dieter Hendricks and Diane Wilcox. “A reinforcement learning extension to the Almgren-Chriss framework for optimal trade execution”. In: Computational Intelligence for Financial Engineering Economics (CIFEr), 2014 IEEE Conference on. IEEE. 2014, pp. 457–464. [20] Marc Juchli. “Limit order placement optimization with Deep Reinforcement Learning: Learning from patterns in cryptocurrency market data”. MA thesis. Delft, Netherlands: Delft University of Technology, July 2018. [21] Donald B. Keim and Ananth Madhavan. “Transactions costs and investment style: an inter-exchange analysis of institutional equity trades”. In: Journal of Financial Economics 46.3 (1997), pp. 265–292. [22] Thomas F. Loeb. “Trading cost: the critical link between investment information and results”. In: Financial Analysts Journal 39.3 (1983), pp. 39–44. [23] Shie Mannor and John Tsitsiklis. “Mean-variance optimization in markov decision processes”. In: arXiv preprint arXiv:1104.5601, 2011. 2011. [24] Oliver Mihatsch and Ralph Neuneier. “Risk-Sensitive Reinforcement Learning”. In: Machine Learning 49 (1998), pp. 267–290. [25] Volodymyr Mnih et al. “Asynchronous methods for deep reinforcement learning”. In: Proceedings of The 33rd International Conference on Machine Learning. 2016, pp. 1928–1937. [26] Volodymyr Mnih et al. “Human-level control through deep reinforcement learning”. In: Nature 518 (2015), pp. 529–533. [27] Yuriy Nevmyvaka, Yi Feng, and Michael Kearns. “Reinforcement learning for optimized trade execution”. In: Proceedings of the 23rd international conference on Machine learning. ACM. 2006, pp. 673–680. [28] Brian Ning, Franco Ho Ting Ling, and Sebastian Jaimungal. “Double Deep QLearning for Optimal Execution”. In: arXiv preprint arXiv:1812.06600v1. 2018. [29] Andre F. Perold. “The implementation shortfall: Paper versus reality”. In: The Journal of Portfolio Management 14.3 (1988), pp. 4–9. [30] John Schulman et al. “Proximal policy optimization algorithms”. In: arXiv preprint arXiv:1707.06347. 2017. [31] John Schulman et al. “Trust region policy optimization”. In: Proceedings of the 32nd International Conference on Machine Learning. 2015, pp. 1889–1897. [32] Yun Shen et al. “Risk-averse reinforcement learning for algorithmic trading”. In: 2014 IEEE Conference on Computational Intelligence for Financial Engineering Economics (CIFEr). 2014, pp. 391–398. [33] David Silver et al. “Deterministic Policy Gradient Algorithms”. In: ICML. 2014, pp. 387–395. URL: http://proceedings.mlr.press/v32/silver14.html. [34] Svitlana Vyetrenko and Shaojie Xu. “Risk-Sensitive Compact Decision Trees for Autonomous Execution in Presence of Simulated Market Response”. In: Proceedings of the 36th International Conference on Machine Learning. AI in Finance: Applications and Infrastructure for Multi-Agent Learning. 2019. [35] Wikipedia contributors. Geometric Brownian motion — Wikipedia, The Free Encyclopedia. [Online; accessed 1-June-2020]. 2020. URL: https://en.wikipedia.org/w/index.php?title=Geometric_Brownian_motion oldid=959122217. [36] Runzhe Yang, Xingyuan Sun, and Karthik Narasimhan. “A Generalized Algorithm for Multi-Objective Reinforcement Learning and Policy Adaptation”. In: NeurIPS. 2020. [37] Chiyuan Zhang et al. “A Study on Overfitting in Deep Reinforcement Learning”. In: ArXiv abs/1804.06893 (2018). [38] Zihao Zhang, Stefan Zohren, and Stephen Roberts. “DeepLOB: Deep Convolutional Neural Networks for Limit Order Books”. In: IEEE Transactions on Signal Processing 67 (2020), pp. 3001–3012.
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/48832	-
dc.description.abstract	最佳化交易執行在整個金融交易的流程中是一個針對如何執行交易訊號的重要課題，已有其他研究證實它能夠劇烈的影響交易策略的獲利能力。而在近幾年當中，由於電子交易所的興起，有很多研究採用強化學習來做最佳化交易執行，並證實其表現比傳統金融方式還要好。然而，這些方法並沒有完整的考量風險與獲利間的平衡，使得訓練完的機器只追求獲利。這樣的狀況會導致我們對機器的表現有錯誤的判斷標準，並喪失交易執行策略的多樣性。因此，在這篇論文當中，我們提出了兩種以風險為基礎的獎勵設計來解決以上兩個問題。第一種做法是將原本的獎勵對市場波動度做正規化，其結果也證明了這種做法能透過給予機器較真實的回饋來提昇整體策略的獲利能力以及穩定度，而這種做法同時可以應用在其他使用強化學習的金融交易上。我們的另一種獎勵設計是針對風險，使用交易單的執行比率來取代標準差，這種做法會使得獎勵較為緊密，對機器來說較好訓練，另外，與之搭配的是一個由多目標馬可夫決策過程組成的框架，可以讓策略同時考量獲利與風險。在這樣的設計下，結果顯示我們的做法能夠對風險跟獲利間的平衡做出更好的詮釋。整體上來說，有了這兩種方法，我們可以先訓練出一個更好策略，再針對這個策略做出分化，使得交易員能夠針對不同的投資者以及商品做出更彈性的交易執行。	zh_TW
dc.description.abstract	Optimal trading execution is an important issue of handling trading signals in a pipeline of financial trading, which had been proven that it can extremely influence the profitability of the trading strategy. In recent years, due to the popularity of electronic exchanges, previous studies applied data-driven methods such as reinforcement learning (RL) on it and had a better performance than traditional financial methods. However, it seems that they do not comprehensively consider the trade-off between risk and return, which would make the RL agent extremely pursue profit. This situation would result in a wrong measurement of agent performance and a lack of diversity of execution strategies. In this thesis, we provided two risk-based reward shaping methods to solve the above problems. The first one shapes the reward by the regularization of market volatility, which has shown that it can help the agent be more profitable and robust by providing more actual feedback of actions. Another one shapes the reward for risk from the standard deviation to the executed inventory ratio, which is a dense reward for better learning. And, it is combined with a multi-objective Markov decision process (MOMDP) framework, considering both profit and risks. Under this design, our results showed that we could exhibit a better interpretation of the trade-off between risk and return than previous works. Overall, with these two methods, we can firstly have a better performance of the RL agent and secondly diversify the execution strategies by the risk-reward and MOMDP framework, which can provide a flexible application for traders to handle the trading signals for different investors and financial assets.	en
dc.description.provenance	Made available in DSpace on 2021-06-15T11:09:58Z (GMT). No. of bitstreams: 1 U0001-1308202010323300.pdf: 9463285 bytes, checksum: 904c9d446ce4fb57779e56459abd87f0 (MD5) Previous issue date: 2020	en
dc.description.tableofcontents	口試委員會審定書iii 致謝v 摘要vii Abstract ix 1 Introduction 1 2 Preliminaries 5 2.1 Order book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1.1 Order types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.1.2 Characteristic . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2 Match Engine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2.1 Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2.2 Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2.3 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.3 Risk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 3 Related Work 11 3.1 Financial Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.2 Data-driven Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.2.1 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . 12 3.2.2 Deep Reinforcement Learning . . . . . . . . . . . . . . . . . . . 12 3.2.3 Risk-sensitive Reinforcement Learning . . . . . . . . . . . . . . 12 4 Methodology 15 4.1 Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 4.1.1 Discretization of time horizon . . . . . . . . . . . . . . . . . . . 16 4.2 MDP Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 4.2.1 State . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 4.2.2 Action . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 4.2.3 Reward . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 4.2.4 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 4.3 Multi-objective MDP formulation . . . . . . . . . . . . . . . . . . . . . 25 4.3.1 Design of the reward . . . . . . . . . . . . . . . . . . . . . . . . 25 4.3.2 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 5 Results 31 5.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 5.2 Experiment Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 5.3 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 5.4 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 5.5 DQN results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 5.5.1 The original training procedure . . . . . . . . . . . . . . . . . . 35 5.5.2 The revised training procedure . . . . . . . . . . . . . . . . . . . 36 5.5.3 Results compared with reward shaping . . . . . . . . . . . . . . 37 5.6 Results for different risk appetites . . . . . . . . . . . . . . . . . . . . . 40 5.6.1 TD error punishment methods . . . . . . . . . . . . . . . . . . . 40 5.6.2 MO-DQN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 6 Conclusion 47 6.1 Summary of contributions . . . . . . . . . . . . . . . . . . . . . . . . . 47 6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 Bibliography 51
dc.language.iso	en
dc.subject	強化學習	zh_TW
dc.subject	多目標馬可夫決策過程	zh_TW
dc.subject	最優交易成本	zh_TW
dc.subject	最佳化交易執行	zh_TW
dc.subject	獎勵設計	zh_TW
dc.subject	Limit order placement	en
dc.subject	Multi-objective MDP	en
dc.subject	Reward shaping	en
dc.subject	Reinforcement Learning	en
dc.subject	Optimal trading execution	en
dc.title	以風險設計強化學習之獎勵並應用於最佳化交易執行	zh_TW
dc.title	Risk-based Reward Shaping Reinforcement Learning for Optimal Trading Execution	en
dc.type	Thesis
dc.date.schoolyear	108-2
dc.description.degree	碩士
dc.contributor.oralexamcommittee	呂育道(Yuh-Dauh Lyuu),張智星(Jyh-Shing Jang),吳毅成(I-Chen Wu),王釧茹(Chuan-Ju Wang)
dc.subject.keyword	最優交易成本,最佳化交易執行,強化學習,獎勵設計,多目標馬可夫決策過程,	zh_TW
dc.subject.keyword	Limit order placement,Optimal trading execution,Reinforcement Learning,Reward shaping,Multi-objective MDP,	en
dc.relation.page	54
dc.identifier.doi	10.6342/NTU202003207
dc.rights.note	有償授權
dc.date.accepted	2020-08-18
dc.contributor.author-college	電機資訊學院	zh_TW
dc.contributor.author-dept	資訊工程學研究所	zh_TW
顯示於系所單位：	資訊工程學系

文件中的檔案：

檔案	大小	格式
U0001-1308202010323300.pdf 未授權公開取用	9.24 MB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。