Skip navigation

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料(如:文字、圖片、PDF)並使其易於取用。

點此認識 DSpace
DSpace logo
English
中文
  • 瀏覽論文
    • 校院系所
    • 出版年
    • 作者
    • 標題
    • 關鍵字
    • 指導教授
  • 搜尋 TDR
  • 授權 Q&A
    • 我的頁面
    • 接受 E-mail 通知
    • 編輯個人資料
  1. NTU Theses and Dissertations Repository
  2. 電機資訊學院
  3. 資訊工程學系
請用此 Handle URI 來引用此文件: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/48832
完整後設資料紀錄
DC 欄位值語言
dc.contributor.advisor許永真(Yung-Jen Hsu)
dc.contributor.authorWei-Lun Luoen
dc.contributor.author羅偉倫zh_TW
dc.date.accessioned2021-06-15T11:09:58Z-
dc.date.available2020-08-21
dc.date.copyright2020-08-21
dc.date.issued2020
dc.date.submitted2020-08-17
dc.identifier.citation[1] Robert Almgren and Neil Chriss. “Optimal execution of portfolio transactions”. In: Journal of Risk 3 (2001), pp. 5–40.
[2] Bowen Baker et al. “Emergent Tool Use From Multi-Agent Autocurricula”. In: International Conference on Learning Representations. 2020. URL: https://openreview.net/forum?id=SkxpxJBKwS.
[3] Wen hang Bao and Xiao yang Liu. “Multi-Agent Deep Reinforcement Learning for Liquidation Strategy Analysis”. In: Proceedings of the 36th International Conference on Machine Learning. AI in Finance: Applications and Infrastructure for Multi-Agent Learning. 2019.
[4] Dimitris Bertsimas and Andrew W Lo. “Optimal control of execution costs”. In: Journal of Financial Markets 1.1 (1998), pp. 1–50.
[5] Fischer Black and Myron Scholes. “The Pricing of Options and Corporat Liabilities”. In: Journal of Political Economy 81.3 (1973), pp. 637–654. ISSN: 00223808,1537534X. URL: http://www.jstor.org/stable/1831029.
[6] Greg Brockman et al. “Openai gym”. In: ArXiv abs/1606.01540 (2016).
[7] Brogaard and Jonathan. “High Frequency Trading and Volatility”. In: LSN: Law Finance: Empirical (Topic) (July 2010). DOI: 10.2139/ssrn.1641387.
[8] David Byrd, Maria Hybinette, and Tucker Balch. “ABIDES: Towards High-Fidelity Market Simulation for AI Research”. In: ArXiv abs/1904.12066 (Apr. 2019).
[9] Louis K. C. Chan and Josef Lakonishok. “Institutional equity trading costs: NYSE versus Nasdaq”. In: Financial Analysts Journal 52.2 (1997), pp. 713–735.
[10] Louis K. C. Chan and Josef Lakonishok. “The behavior of stock prices around institutional trades”. In: Financial Analysts Journal 50.4 (1995), pp. 1147-1174.
[11] Krishnendu Chatterjee, Rupak Majumdar, and Thomas Henzinger. “Markov Decision Processes with Multiple Objectives”. In: Feb. 2006, pp. 325–336. DOI: 10.1007/11672142_26.
[12] James Chen. Investopedia: Risk. URL: https://www.investopedia.com/terms/r/risk.asp. (accessed: 06.21.2020).
[13] Kun-Jen Chung and Matthew J. Sobel. “Discounted MDP’s: Distribution Functions and Exponential Utility Maximization”. In: SIAM journal on control and optimization 25.1 (1987), pp. 49–62.
[14] Jerzy A. Filar, Lodewijk C. M. Kallenberg, and Huey-Miin Lee. “Variance-Penalized Markov Decision Processes”. In: Mathematics of Operations Research 14 (1989), pp. 147–161.
[15] Shixiang Gu et al. “Continuous Deep Q-Learning with Model-based Acceleration”. In: ICML. 2016, pp. 2829–2838. URL: http://proceedings.mlr.press/v48/gu16.html.
[16] Sergio Guadarrama et al. TF-Agents: A library for Reinforcement Learning in TensorFlow. https://github.com/tensorflow/agents. [Online; accessed 25-June-2019]. 2018. URL: https://github.com/tensorflow/agents%22.
[17] Hado van Hasselt, Arthur Guez, and David Silver. “Deep reinforcement learning with double Q-Learning”. In: AAAI’16: Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence. 2016, pp. 2094–2100.
[18] Adam Hayes. Investopedia: Portfolio management definition. URL: https://www.investopedia.com/terms/p/portfoliomanagement.asp. (accessed: 04.20.2020).
[19] Dieter Hendricks and Diane Wilcox. “A reinforcement learning extension to the Almgren-Chriss framework for optimal trade execution”. In: Computational Intelligence for Financial Engineering Economics (CIFEr), 2014 IEEE Conference on. IEEE. 2014, pp. 457–464.
[20] Marc Juchli. “Limit order placement optimization with Deep Reinforcement Learning: Learning from patterns in cryptocurrency market data”. MA thesis. Delft, Netherlands: Delft University of Technology, July 2018.
[21] Donald B. Keim and Ananth Madhavan. “Transactions costs and investment style: an inter-exchange analysis of institutional equity trades”. In: Journal of Financial Economics 46.3 (1997), pp. 265–292.
[22] Thomas F. Loeb. “Trading cost: the critical link between investment information and results”. In: Financial Analysts Journal 39.3 (1983), pp. 39–44.
[23] Shie Mannor and John Tsitsiklis. “Mean-variance optimization in markov decision processes”. In: arXiv preprint arXiv:1104.5601, 2011. 2011.
[24] Oliver Mihatsch and Ralph Neuneier. “Risk-Sensitive Reinforcement Learning”. In: Machine Learning 49 (1998), pp. 267–290.
[25] Volodymyr Mnih et al. “Asynchronous methods for deep reinforcement learning”. In: Proceedings of The 33rd International Conference on Machine Learning. 2016, pp. 1928–1937.
[26] Volodymyr Mnih et al. “Human-level control through deep reinforcement learning”. In: Nature 518 (2015), pp. 529–533.
[27] Yuriy Nevmyvaka, Yi Feng, and Michael Kearns. “Reinforcement learning for optimized trade execution”. In: Proceedings of the 23rd international conference on Machine learning. ACM. 2006, pp. 673–680.
[28] Brian Ning, Franco Ho Ting Ling, and Sebastian Jaimungal. “Double Deep QLearning for Optimal Execution”. In: arXiv preprint arXiv:1812.06600v1. 2018.
[29] Andre F. Perold. “The implementation shortfall: Paper versus reality”. In: The Journal of Portfolio Management 14.3 (1988), pp. 4–9.
[30] John Schulman et al. “Proximal policy optimization algorithms”. In: arXiv preprint arXiv:1707.06347. 2017.
[31] John Schulman et al. “Trust region policy optimization”. In: Proceedings of the 32nd International Conference on Machine Learning. 2015, pp. 1889–1897.
[32] Yun Shen et al. “Risk-averse reinforcement learning for algorithmic trading”. In: 2014 IEEE Conference on Computational Intelligence for Financial Engineering Economics (CIFEr). 2014, pp. 391–398.
[33] David Silver et al. “Deterministic Policy Gradient Algorithms”. In: ICML. 2014, pp. 387–395. URL: http://proceedings.mlr.press/v32/silver14.html.
[34] Svitlana Vyetrenko and Shaojie Xu. “Risk-Sensitive Compact Decision Trees for Autonomous Execution in Presence of Simulated Market Response”. In: Proceedings of the 36th International Conference on Machine Learning. AI in Finance: Applications and Infrastructure for Multi-Agent Learning. 2019.
[35] Wikipedia contributors. Geometric Brownian motion — Wikipedia, The Free Encyclopedia. [Online; accessed 1-June-2020]. 2020. URL: https://en.wikipedia.org/w/index.php?title=Geometric_Brownian_motion oldid=959122217.
[36] Runzhe Yang, Xingyuan Sun, and Karthik Narasimhan. “A Generalized Algorithm for Multi-Objective Reinforcement Learning and Policy Adaptation”. In: NeurIPS. 2020.
[37] Chiyuan Zhang et al. “A Study on Overfitting in Deep Reinforcement Learning”. In: ArXiv abs/1804.06893 (2018).
[38] Zihao Zhang, Stefan Zohren, and Stephen Roberts. “DeepLOB: Deep Convolutional Neural Networks for Limit Order Books”. In: IEEE Transactions on Signal Processing 67 (2020), pp. 3001–3012.
dc.identifier.urihttp://tdr.lib.ntu.edu.tw/jspui/handle/123456789/48832-
dc.description.abstract最佳化交易執行在整個金融交易的流程中是一個針對如何執行交易訊號的重要課題,已有其他研究證實它能夠劇烈的影響交易策略的獲利能力。而在近幾年當中,由於電子交易所的興起,有很多研究採用強化學習來做最佳化交易執行,並證實其表現比傳統金融方式還要好。然而,這些方法並沒有完整的考量風險與獲利間的平衡,使得訓練完的機器只追求獲利。這樣的狀況會導致我們對機器的表現有錯誤的判斷標準,並喪失交易執行策略的多樣性。因此,在這篇論文當中,我們提出了兩種以風險為基礎的獎勵設計來解決以上兩個問題。第一種做法是將原本的獎勵對市場波動度做正規化,其結果也證明了這種做法能透過給予機器較真實的回饋來提昇整體策略的獲利能力以及穩定度,而這種做法同時可以應用在其他使用強化學習的金融交易上。我們的另一種獎勵設計是針對風險,使用交易單的執行比率來取代標準差,這種做法會使得獎勵較為緊密,對機器來說較好訓練,另外,與之搭配的是一個由多目標馬可夫決策過程組成的框架,可以讓策略同時考量獲利與風險。在這樣的設計下,結果顯示我們的做法能夠對風險跟獲利間的平衡做出更好的詮釋。整體上來說,有了這兩種方法,我們可以先訓練出一個更好策略,再針對這個策略做出分化,使得交易員能夠針對不同的投資者以及商品做出更彈性的交易執行。zh_TW
dc.description.abstractOptimal trading execution is an important issue of handling trading signals in a pipeline of financial trading, which had been proven that it can extremely influence the profitability of the trading strategy. In recent years, due to the popularity of electronic exchanges, previous studies applied data-driven methods such as reinforcement learning (RL) on it and had a better performance than traditional financial methods. However, it seems that they do not comprehensively consider the trade-off between risk and return, which would make the RL agent extremely pursue profit. This situation would result in a wrong measurement of agent performance and a lack of diversity of execution strategies. In this thesis, we provided two risk-based reward shaping methods to solve the above problems. The first one shapes the reward by the regularization of market volatility, which has shown that it can help the agent be more profitable and robust by providing more actual feedback of actions. Another one shapes the reward for risk from the standard deviation to the executed inventory ratio, which is a dense reward for better learning. And, it is combined with a multi-objective Markov decision process (MOMDP) framework, considering both profit and risks. Under this design, our results showed that we could exhibit a better interpretation of the trade-off between risk and return than previous works. Overall, with these two methods, we can firstly have a better performance of the RL agent and secondly diversify the execution strategies by the risk-reward and MOMDP framework, which can provide a flexible application for traders to handle the trading signals for different investors and financial assets.en
dc.description.provenanceMade available in DSpace on 2021-06-15T11:09:58Z (GMT). No. of bitstreams: 1
U0001-1308202010323300.pdf: 9463285 bytes, checksum: 904c9d446ce4fb57779e56459abd87f0 (MD5)
Previous issue date: 2020
en
dc.description.tableofcontents口試委員會審定書iii
致謝v
摘要vii
Abstract ix
1 Introduction 1
2 Preliminaries 5
2.1 Order book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.1 Order types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.2 Characteristic . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Match Engine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.1 Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.2 Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.3 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 Risk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3 Related Work 11
3.1 Financial Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2 Data-driven Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.2.1 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . 12
3.2.2 Deep Reinforcement Learning . . . . . . . . . . . . . . . . . . . 12
3.2.3 Risk-sensitive Reinforcement Learning . . . . . . . . . . . . . . 12
4 Methodology 15
4.1 Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.1.1 Discretization of time horizon . . . . . . . . . . . . . . . . . . . 16
4.2 MDP Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.2.1 State . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.2.2 Action . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.2.3 Reward . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.2.4 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.3 Multi-objective MDP formulation . . . . . . . . . . . . . . . . . . . . . 25
4.3.1 Design of the reward . . . . . . . . . . . . . . . . . . . . . . . . 25
4.3.2 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
5 Results 31
5.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.2 Experiment Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
5.3 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
5.4 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.5 DQN results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5.5.1 The original training procedure . . . . . . . . . . . . . . . . . . 35
5.5.2 The revised training procedure . . . . . . . . . . . . . . . . . . . 36
5.5.3 Results compared with reward shaping . . . . . . . . . . . . . . 37
5.6 Results for different risk appetites . . . . . . . . . . . . . . . . . . . . . 40
5.6.1 TD error punishment methods . . . . . . . . . . . . . . . . . . . 40
5.6.2 MO-DQN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
6 Conclusion 47
6.1 Summary of contributions . . . . . . . . . . . . . . . . . . . . . . . . . 47
6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
Bibliography 51
dc.language.isoen
dc.subject強化學習zh_TW
dc.subject多目標馬可夫決策過程zh_TW
dc.subject最優交易成本zh_TW
dc.subject最佳化交易執行zh_TW
dc.subject獎勵設計zh_TW
dc.subjectLimit order placementen
dc.subjectMulti-objective MDPen
dc.subjectReward shapingen
dc.subjectReinforcement Learningen
dc.subjectOptimal trading executionen
dc.title以風險設計強化學習之獎勵並應用於最佳化交易執行zh_TW
dc.titleRisk-based Reward Shaping Reinforcement Learning for Optimal Trading Executionen
dc.typeThesis
dc.date.schoolyear108-2
dc.description.degree碩士
dc.contributor.oralexamcommittee呂育道(Yuh-Dauh Lyuu),張智星(Jyh-Shing Jang),吳毅成(I-Chen Wu),王釧茹(Chuan-Ju Wang)
dc.subject.keyword最優交易成本,最佳化交易執行,強化學習,獎勵設計,多目標馬可夫決策過程,zh_TW
dc.subject.keywordLimit order placement,Optimal trading execution,Reinforcement Learning,Reward shaping,Multi-objective MDP,en
dc.relation.page54
dc.identifier.doi10.6342/NTU202003207
dc.rights.note有償授權
dc.date.accepted2020-08-18
dc.contributor.author-college電機資訊學院zh_TW
dc.contributor.author-dept資訊工程學研究所zh_TW
顯示於系所單位:資訊工程學系

文件中的檔案:
檔案 大小格式 
U0001-1308202010323300.pdf
  未授權公開取用
9.24 MBAdobe PDF
顯示文件簡單紀錄


系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved