使用 LDA 和 BERTopic 模型分類財經新聞並預測股票報酬－以台積電為例

葉浩霖; Hao-Lin Yeh

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/94218

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	楊睿中	zh_TW
dc.contributor.advisor	Jui-Chung Yang	en
dc.contributor.author	葉浩霖	zh_TW
dc.contributor.author	Hao-Lin Yeh	en
dc.date.accessioned	2024-08-15T16:16:53Z	-
dc.date.available	2024-08-16	-
dc.date.copyright	2024-08-15	-
dc.date.issued	2024	-
dc.date.submitted	2024-08-05	-
dc.identifier.citation	Bengio, Yoshua, Réjean Ducharme, Pascal Vincent, and Christian Janvin, “A Neural Probabilistic Language Model,” Journal of Machine Learning Research, March 2003, 3, 1137–1155. Blei, David M., Andrew Y. Ng, and Michael I. Jordan, “Latent Dirichlet Allocation,” Journal of Machine Learning Research, 2003, 3, 993–1022. Campello, Ricardo J. G. B., Davoud Moulavi, and Joerg Sander, “Density-Based Clustering Based on Hierarchical Density Estimates,” Advances in Knowledge Discovery and Data Mining, 2013, pp. 160–172. Chen, Kuan Chen, Chung I Lin, and Hong Ming Chen, “Relationship between News Sentiment Indicator and the Taiwan Weighted Stock Index,” Journal of Social Sciences and Philosophy, 2021, 33 (3), 383–423. Churchill, Rob and Lisa Singh, “The Evolution of Topic Modeling,” ACM Computing Surveys, 2022, 54 (10s), 1–35. Cowles, Alfred, “Can Stock Market Forecasters Forecast?,” Econometrica, 1933, 1 (3), 309–324 Deerwester, Scott, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, and Richard Harshman, “Indexing by Latent Semantic Analysis,” Journal of the American Society for Information Science, 1990, 41 (6), 391–407. Dempster, Arthur P., Nan M. Laird, and Donald B. Rubin, “Maximum Likelihood from Incomplete Data via the EM Algorithm,” Journal of the Royal Statistical Society. Series B (Methodological), 1977, 39 (1), 1–38. Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” arXiv preprint arXiv:1810.04805, 2019 Dieng, Adji B., Francisco J. R. Ruiz, and David M. Blei, “Topic Modeling in Embedding Spaces,” Transactions of the Association for Computational Linguistics, 2020, 8, 439–453. Faccini, Renato, Rastin Matin, and George Skiadopoulos, “Dissecting Climate Risks: Are They Reflected in Stock Prices?,” Journal of Banking Finance, 2023, 155, 106948. Friedman, Jerome H., Trevor Hastie, and Rob Tibshirani, “Regularization Paths for Generalized Linear Models via Coordinate Descent,” Journal of Statistical Software, 2010, 33 (1), 1–22. Grootendorst, Maarten, “BERTopic: Neural Topic Modeling with a Class-based TF-IDF Procedure,” arXiv preprint arXiv:2203.05794, 2022 Gu, Shihao, Bryan Kelly, and Dacheng Xiu, “Empirical Asset Pricing via Machine Learning,” The Review of Financial Studies, February 2020, 33 (5), 2223–2273 Hochreiter, Sepp and Jürgen Schmidhuber, “Long Short-Term Memory,” Neural Computation, 11 1997, 9 (8), 1735–1780. Hofmann, Thomas, “Probabilistic Latent Semantic Indexing,” Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 1999, pp. 50–57. Jiang, Fuwei, Joshua Lee, Xiumin Martin, and Guofu Zhou, “Manager Sentiment and Stock Returns,” Journal of Financial Economics, 2019, 132 (1), 126–149 Jordan, Michael I., Zoubin Ghahramani, Tommi S. Jaakkola, and Lawrence K. Saul, “An Introduction to Variational Methods for Graphical Models,” Machine Learning, 1999, 37, 183–233. Kingma, Diederik P. and Jimmy Ba, “Adam: A Method for Stochastic Optimization,” arXiv preprint arXiv:1412.6980, 2017. Ku, Lun‐Wei and Hsin‐Hsi Chen, “Mining Opinions from the Web: Beyond Relevance Retrieval,” Journal of the American Society for Information Science and Technology, October 2007, 58 (12), 1838–1850. Kullback, Solomon and Richard A. Leibler, “On Information and Sufficiency,” The Annals of Mathematical Statistics, 1951, 22 (1), 79–86. Li, Peng-Hsuan, Tsu-Jui Fu, and Wei-Yun Ma, “Why Attention? Analyze BiLSTM Deficiency and Its Remedies in the Case of NER,” arXiv preprint arXiv:1908.11046, 2020. Lin, Chenghua and Yulan He, “Joint Sentiment/Topic model for Sentiment Analysis,” Proceedings of the 18th ACM Conference on Information and Knowledge Management, 2009, pp. 375––384. Loughran, Tim and Bill Mcdonald, “When Is a Liability Not a Liability? Textual Analysis, Dictionaries, and 10‐Ks,” Journal of Finance, February 2011, 66 (1), 35–65. McInnes, Leland, John Healy, and James Melville, “UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction,” arXiv preprint arXiv:1802.03426, 2020. Mikolov, Tomas, Kai Chen, Greg Corrado, and Jeffrey Dean, “Efficient Estimation of Word Representations in Vector Space,” arXiv preprint arXiv:1301.3781, 2013. Prim, Robert C., “Shortest Connection Networks and Some Generalizations,” The Bell System Technical Journal, 1957, 36 (6), 1389–1401. Rahmadeyan, Akhas and Mustakim, “Long Short-Term Memory and Gated Recurrent Unit for Stock Price Prediction,” Procedia Computer Science, 2024, 234, 204–212. Seventh Information Systems International Conference (ISICO 2023). Reimers, Nils and Iryna Gurevych, “Sentence-BERT: Sentence Embeddings Using Siamese BERT-Networks,” Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, November 2019. Röder, Michael, Andreas Both, and Alexander Hinneburg, “Exploring the Space of Topic Coherence Measures,” WSDM 2015 - Proceedings of the 8th ACM International Conference on Web Search and Data Mining, February 2015, pp. 399–408. Sharpe, William F., “The Sharpe Ratio,” The Best of The Journal of Portfolio Management, 1998, pp. 169–178. Tang, Wenjin, Hui Bu, Yuan Zuo, and Junjie Wu, “Unlocking the Power of the Topic Content in News Headlines: BERTopic for Predicting Chinese Corporate Bond Defaults,” Finance Research Letters, 2024, 62, 105062. Tetlock, Paul C., “Giving Content to Investor Sentiment: The Role of Media in the Stock Market,” The Journal of Finance, 2007, 62 (3), 1139–1168. Tibshirani, Robert, “Regression Shrinkage and Selection via the Lasso,” Journal of the Royal Statistical Society. Series B (Methodological), 1996, 58 (1), 267–288. Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin, “Attention Is All You Need,” arXiv preprint arXiv:1706.03762, 2017.	-
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/94218	-
dc.description.abstract	本研究以「台積電」為關鍵字，選取2016年至2023年間自由時報電子報和中時新聞網的財經新聞，其中2016年至2021年之新聞為訓練資料，2022年至2023年為測試資料。經自動化篩選和清理後，使用潛在狄利克里分配（Latent Dirichlet Allocation，LDA）(Blei et al., 2003)和BERTopic(Grootendorst, 2022)模型進行新聞分類，並結合情緒分數，通過迴歸分析和最小絕對壓縮挑選運算子（Least Absolute Shrinkage and Selection Operator，LASSO）(Tibshirani, 1996)迴歸找出顯著主題類別，進而使用長短期記憶（Long Short-Term Memory，LSTM）(Hochreiter and Schmidhuber, 1997)模型訓練和預測台積電隔日股票一日報酬（收盤價與開盤價差距），並設計交易策略以評估不同方法的交易效果。研究結果顯示，使用BERTopic分類財經新聞，並通過迴歸和LASSO迴歸選取顯著主題後，再以LSTM進行訓練和預測，其交易結果最佳，投資報酬率達56%。本研究證明BERTopic能有效處理自動化篩選的新聞資料，並結合傳統迴歸和深度學習方法，成功應用於股票交易策略上；相比之下，雖然LDA能進行新聞分類，但無法用自動化篩選的新聞資料預測股票報酬的漲跌。	zh_TW
dc.description.abstract	This study focuses on "TSMC" as the keyword, selecting financial news from the Liberty Times and China Times News Network from 2016 to 2023. The news from 2016 to 2021 is used as training data, while the news from 2022 to 2023 is used as test data. After automated screening and cleaning, Latent Dirichlet Allocation (LDA) (Blei et al., 2003) and BERTopic (Grootendorst, 2022) models are employed for news classification. Combining sentiment scores, significant topic categories are identified through regression analysis and Least Absolute Shrinkage and Selection Operator (LASSO) (Tibshirani, 1996) regression. An Long Short-Term Memory (LSTM) (Hochreiter and Schmidhuber, 1997) model is then used to train and predict TSMC's next-day stock return (the difference between closing and opening prices). A trading strategy is designed to evaluate the performance of different methods. The results show that using BERTopic to classify financial news, selecting significant topics through regression and LASSO regression, and then training and predicting with LSTM yields the best trading results, with an investment return rate of 56%. This study demonstrates that BERTopic can effectively handle relatively coarse news data and, combined with traditional regression and deep learning methods, can be successfully applied to stock trading strategies. In contrast, while LDA can classify news, it cannot predict stock returns' rise and fall using automatically screened news data.	en
dc.description.provenance	Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2024-08-15T16:16:53Z No. of bitstreams: 0	en
dc.description.provenance	Made available in DSpace on 2024-08-15T16:16:53Z (GMT). No. of bitstreams: 0	en
dc.description.tableofcontents	口試委員審定書 i 致謝 ii 摘要 iii Abstract iv 目錄 vi 圖目錄 ix 表目錄 x 第一章前言 1 第二章文獻回顧 4 2.1 主題模型（topic model）發展 4 2.2 結合文字探勘技術之金融市場相關研究 6 第三章主題模型介紹－潛在狄利克里分配（Latent Dirichlet Allocation，LDA與BERTopic 8 3.1 潛在狄利克里分配（Latent Dirichlet Allocation，LDA） 8 3.1.1 潛在狄利克里分配（Latent Dirichlet Allocation，LDA）聯合機率分配 8 3.1.2 推論後驗分配機率 11 3.2 BERTopic 14 3.2.1 SBERT（Sentence-BERT） 15 3.2.2 均勻流形逼近及投影降維法（Uniform Manifold Approximation and Projection，UMAP）19 3.2.3 基於密度之含噪空間階層聚類法（Hierarchical Density-Based Spatial Clustering of Applications with Noise，HDBSCAN） 23 3.2.4 基於類別的詞頻與逆文件頻率（class-based term frequency–inverse document frequency，c-TF-IDF） 25 第四章研究方法 27 4.1 文字資料說明與處理 27 4.1.1 新聞篩選過程 27 4.1.2 內文清理過程 28 4.1.3 斷詞 29 4.1.4 停用詞（stopwords） 29 4.2 情緒分數 30 4.3 資料切分及主題模型參數設定 31 4.4 股票資料 31 4.5 實證模型設定 32 4.6 深度學習模型：長短期記憶（Long Short-Term Memory，LSTM） 34 4.7 投資策略 40 第五章實證結果 42 5.1 模型分類結果 42 5.1.1 潛在狄利克里分配（Latent Dirichlet Allocation，LDA）模型分類結果 42 5.1.2 BERTopic 模型分類結果 43 5.2 迴歸結果 43 5.2.1 模型一與模型二 43 5.2.2 模型三與模型四 44 5.3 預測結果 47 5.4 投資策略績效 49 第六章結果與未來展望 53 6.1 結論 53 6.2 未來展望 54 參考文獻 55 附錄 A — 表格 60	-
dc.language.iso	zh_TW	-
dc.title	使用 LDA 和 BERTopic 模型分類財經新聞並預測股票報酬－以台積電為例	zh_TW
dc.title	Using LDA and BERTopic Models to Classify Financial News and Predict Stock Returns - Evidence from TSMC	en
dc.type	Thesis	-
dc.date.schoolyear	112-2	-
dc.description.degree	碩士	-
dc.contributor.oralexamcommittee	陳由常;林昌平	zh_TW
dc.contributor.oralexamcommittee	Yu-Chang Chen;Chang-Ping Lin	en
dc.subject.keyword	BERTopic,潛在狄利克里分配,主題模型,股票報酬預測,台積電,	zh_TW
dc.subject.keyword	BERTopic,LDA,topic model,stock return prediction,TSMC,	en
dc.relation.page	67	-
dc.identifier.doi	10.6342/NTU202403036	-
dc.rights.note	未授權	-
dc.date.accepted	2024-08-07	-
dc.contributor.author-college	社會科學院	-
dc.contributor.author-dept	經濟學系	-
顯示於系所單位：	經濟學系

文件中的檔案：

檔案	大小	格式
ntu-112-2.pdf 目前未授權公開取用	4.16 MB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。