美國 SEC 10-Q 季報項目擷取

Huan-Hsun Yen; 顏煥勳

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/83920

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	盧信銘(Hsin-Min Lu)
dc.contributor.author	Huan-Hsun Yen	en
dc.contributor.author	顏煥勳	zh_TW
dc.date.accessioned	2023-03-19T21:23:34Z	-
dc.date.copyright	2022-07-05
dc.date.issued	2022
dc.date.submitted	2022-07-02
dc.identifier.citation	Akbik, A., Blythe, D., & Vollgraf, R. (2018). Contextual string embeddings for sequence labeling. Proceedings of the 27th International Conference on Computational Linguistics, Basu, S., Ma, X., & Briscoe-Tran, H. (2022). Measuring multidimensional investment opportunity sets with 10-K text. The Accounting Review, 97(1), 51-73. Bengio, Y., Simard, P., & Frasconi, P. (1994). Learning long-term dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks, 5(2), 157-166. Bonsall, S. B., Leone, A. J., Miller, B. P., & Rennekamp, K. (2017). A plain English measure of financial reporting readability. Journal of Accounting and Economics, 63(2-3), 329-357. Brown, N. C., Crowley, R. M., & Elliott, W. B. (2020). What are you saying? Using topic to detect financial misreporting. Journal of Accounting Research, 58(1), 237-291. Brown, S. V., & Tucker, J. W. (2011). Large?sample evidence on firms’ year?over?year MD&A modifications. Journal of Accounting Research, 49(2), 309-346. Campbell, J. L., Chen, H., Dhaliwal, D. S., Lu, H.-m., & Steele, L. B. (2014). The information content of mandatory risk factor disclosures in corporate filings. Review of Accounting Studies, 19(1), 396-455. Chorowski, J. K., Bahdanau, D., Serdyuk, D., Cho, K., & Bengio, Y. (2015). Attention-based models for speech recognition. Advances in Neural Information Processing Systems, 28. Cohen, L., Malloy, C., & Nguyen, Q. (2020). Lazy prices. The Journal of Finance, 75(3), 1371-1415. Core, J. E. (2001). A review of the empirical disclosure literature: discussion. Journal of Accounting and Economics, 31(1-3), 441-456. Cornegruta, S., Bakewell, R., Withey, S., & Montana, G. (2016). Modelling radiological language with bidirectional long short-term memory networks. arXiv preprint arXiv:1609.08409. Davis, A. K., & Tama?Sweet, I. (2012). Managers’ use of language across alternative disclosure outlets: earnings press releases versus MD&A. Contemporary Accounting Research, 29(3), 804-837. Donahue, J., Anne Hendricks, L., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., & Darrell, T. (2015). Long-term recurrent convolutional networks for visual recognition and description. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Dyer, T., Lang, M., & Stice-Lawrence, L. (2017). The evolution of 10-K textual disclosure: Evidence from Latent Dirichlet Allocation. Journal of Accounting and Economics, 64(2-3), 221-245. Ertugrul, M., Lei, J., Qiu, J., & Wan, C. (2017). Annual report readability, tone ambiguity, and the cost of borrowing. Journal of Financial and Quantitative Analysis, 52(2), 811-836. Graves, A., & Schmidhuber, J. (2005). Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Networks, 18(5-6), 602-610. Guay, W., Samuels, D., & Taylor, D. (2016). Guiding through the fog: Financial statement complexity and voluntary disclosure. Journal of Accounting and Economics, 62(2-3), 234-269. Gunning, R. (1952). Technique of clear writing. Hochreiter, S., Bengio, Y., Frasconi, P., & Schmidhuber, J. (2001). Gradient flow in recurrent nets: the difficulty of learning long-term dependencies. In: A Field Guide to Dynamical Recurrent Neural Networks. IEEE Press In. Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735-1780. Hope, O.-K., Hu, D., & Lu, H. (2016). The benefits of specific risk-factor disclosures. Review of Accounting Studies, 21(4), 1005-1045. Huang, Z., Xu, W., & Yu, K. (2015). Bidirectional LSTM-CRF models for sequence tagging. arXiv preprint arXiv:1508.01991. Lafferty, J., McCallum, A., & Pereira, F. C. (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data. The Eighteenth International Conference on Machine Learning (2001), Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., & Dyer, C. (2016). Neural architectures for named entity recognition. arXiv preprint arXiv:1603.01360. Lee, Y. J. (2012). The effect of quarterly report readability on information efficiency of stock prices. Contemporary Accounting Research, 29(4), 1137-1170. Li, F. (2008). Annual report readability, current earnings, and earnings persistence. Journal of Accounting and Economics, 45(2-3), 221-247. Li, F. (2010). The information content of forward?looking statements in corporate filings—A nave Bayesian machine learning approach. Journal of Accounting Research, 48(5), 1049-1102. Loughran, T., & McDonald, B. (2011). When is a liability not a liability? Textual analysis, dictionaries, and 10?Ks. The Journal of Finance, 66(1), 35-65. Loughran, T., & McDonald, B. (2014). Measuring readability in financial disclosures. the Journal of Finance, 69(4), 1643-1671. Loughran, T., & McDonald, B. (2016). Textual analysis in accounting and finance: A survey. Journal of Accounting Research, 54(4), 1187-1230. Loughran, T., & McDonald, B. (2020). Textual analysis in finance. Available at SSRN 3470272. Luo, L., Yang, Z., Yang, P., Zhang, Y., Wang, L., Lin, H., & Wang, J. (2018). An attention-based BiLSTM-CRF approach to document-level chemical named entity recognition. Bioinformatics, 34(8), 1381-1388. Luong, M.-T., Pham, H., & Manning, C. D. (2015). Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025. Ma, X., & Hovy, E. (2016). End-to-end sequence labeling via bi-directional lstm-cnns-crf. arXiv preprint arXiv:1603.01354. Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781. Miller, B. P. (2010). The effects of reporting complexity on small and large investor trading. The Accounting Review, 85(6), 2107-2143. Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., & Gao, J. (2021). Deep learning--based text classification: a comprehensive review. ACM Computing Surveys (CSUR), 54(3), 1-40. Narayanan, S., Achan, P., Rangan, P. V., & Rajan, S. P. (2021). Unified concept and assertion detection using contextual multi-task learning in a clinical decision support system. Journal of Biomedical Informatics, 122, 103898. Petersen, M. A. (2004). Information: Hard and soft. Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084. Salehinejad, H., Sankar, S., Barfett, J., Colak, E., & Valaee, S. (2017). Recent advances in recurrent neural networks. arXiv preprint arXiv:1801.01078. SEC. (2021). About EDGAR. https://www.sec.gov/edgar/about Yang, Z., Yang, D., Dyer, C., He, X., Smola, A., & Hovy, E. (2016). Hierarchical attention networks for document classification. Proceedings of the 2016 conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 陳妍秀. (2018). 財報項目全文的擷取和效能評估國立臺灣大學. 台北市. https://hdl.handle.net/11296/2nww2b
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/83920	-
dc.description.abstract	隨著數據分析的快速發展和財務數據的高速增長，越來越多的研究關注於文字分析在財報中的應用。美國證券交易委員會 (SEC) 要求的年度 10-K 財報和季度 10-Q 財報一直是金融文字分析研究人員的熱門研究主題。但是，財報中的幾個財報項目比較常被單獨研究與討論。因為財報項目擷取的品質會影響後續研究的結果，因此如何準確提取研究人員需要的特定財報項目成為一個重要的問題。在我們的實驗中，我們提出了一個基於注意力的 BiLSTM-CRF 模型來從 10-Q 財報中擷取財報項目。首先，我們設計了人工標註規則，並為訓練過程提供了一個 10-Q 財報項目擷取的資料集。其次，我們建立了一個基於注意力的 BiLSTM-CRF 模型。本模型是一個端到端模型，不需要複雜的特徵設計以及對於特定任務的背景知識，我們首先輸入 10-Q 財報中的所有單詞，然後模型依序輸出每一行的標記決策。我們的研究結果表明，基於注意力的 BiLSTM-CRF 模型在擷取所有財報項目和擷取特定財報項目中都有良好的表現。	zh_TW
dc.description.abstract	With the rapid development in data analysis and high-speed growth in financial data, more and more researchers are focusing on financial report text analysis. 10-K and 10-Q reports required by the United States Securities and Exchange Commission (SEC) have always been popular sources for researchers in financial text analysis. However, several items in the financial report are more common to be discussed separately. As the quality of the extracted item affects the subsequent results and analysis. Extracting items accurately becomes an important issue. In this study, we propose an attention-based BiLSTM-CRF model to extract items from the 10-Q reports. We developed human annotation rules and an annotated 10-Q dataset for model training. We have also developed an attention-based BiLSTM-CRF model for item extraction. Our model is an end-to-end model that does not require hand-crafted features and task-specific knowledge. Our model takes tokens from a 10-Q document and outputs the predicted tags for each line. Our experimental results show that the attention-based BiLSTM-CRF model performs well in both the all items extraction and the selected items extraction.	en
dc.description.provenance	Made available in DSpace on 2023-03-19T21:23:34Z (GMT). No. of bitstreams: 1 U0001-2806202217222000.pdf: 15401506 bytes, checksum: 6149624326b71f5efedfe756765ea406 (MD5) Previous issue date: 2022	en
dc.description.tableofcontents	CONTENTS 誌謝 i 摘要 ii ABSTRACT iii CONTENTS iv LIST OF FIGURES vii LIST OF TABLES viii Chapter 1 Introduction 1 Chapter 2 Literature Review 5 2.1 Text Analysis for Financial Reports 5 2.1.1 Text Measures and Tasks 5 2.1.2 Text Preprocessing for Financial Report Text Analytics 9 2.1.3 Summary 13 2.2 Sequence Models for Text Labeling 13 2.2.1 Conditional Random Fields (CRF) 14 2.2.2 Recurrent Neural Network (RNN) 15 2.2.3 Long Short-Term Memory Network (LSTM) 16 2.2.4 Bidirectional Long Short-Term Memory (Bi-LSTM) 17 2.2.5 BiLSTM-CRF 18 2.2.6 Attention-Based BiLSTM-CRF 19 2.2.7 Summary 22 Chapter 3 Research Gaps and Research Questions 23 3.1 Research Gaps 23 3.2 Research Questions 23 Chapter 4 Data 24 4.1 Data Overview 24 4.2 Data Preprocessing 26 4.3 Annotation Process 27 4.3.1 Annotation methods 27 4.3.2 Annotation Descriptive Statistics 29 Chapter 5 Methodology 35 5.1 Text Representation 35 5.2 Bi-LSTM Model Structure 35 5.3 BiLSTM-CRF 37 5.4 Attention-based BiLSTM-CRF 40 5.5 Baseline Models 42 5.5.1 Regular Expression 42 5.5.2 Conditional Random Fields (CRF) 44 5.5.3 BiLSTM-CRF-HC & BiLSTM-CRF-HC-SBERT 44 Chapter 6 Results & Discussion 46 Chapter 7 Error Analysis 52 7.1 Table of Content Misidentified to Content 52 7.2 Lack of Item Title 53 7.2.1 Item 1 and O Misidentified 53 7.2.2 Lacking “Item” Keywords in Title Misidentification 54 7.3 Manual Annotation Errors 55 7.4 Signature Misidentified 56 7.5 Part Two and Part One Block Misidentified 56 7.6 Special 10-Q Report Format 57 7.6.1 Title Mislabeled 57 7.6.2 Lack of Certain Item 58 7.6.3 Consolidated Statement 59 7.7 Repeated Headers 60 7.8 “Item” Keyword Appears in The Text 61 Chapter 8 Conclusion and Future Work 63 REFERENCE 65 LIST OF FIGURES Figure 1 Recurrent Neural Network Structure 16 Figure 2 A Long Short-Term Memory Cell (Huang et al., 2015) 17 Figure 3 Example of Bi-LSTM Architecture (Cornegruta et al., 2016) 18 Figure 4 Document-level Att-BiLSTM-CRF model (Luo et al., 2018) 20 Figure 5 Attention-based BiLSTM-CRF Multitask Learning Model (Narayanan et al., 2021) 21 Figure 6 Distribution of The Number of Reports Per Year 26 Figure 7 Average Word Length of 10-Q Items 30 Figure 8 Words of Each Item Changing Per Year 32 Figure 9 Bi-LSTM Model Structure 37 Figure 10 BiLSTM-CRF Model Structure 38 Figure 11 Attention Based BiLSTM-CRF Model Structure 41 Figure 12 Attention Weights in Selected Example Lines 51 ? LIST OF TABLES Table 1 Section Definitions in 10-Q Filings 2 Table 2 The Overview of Related Research in Financial Report Text Analysis 8 Table 3 Summary of Financial Report Preprocessing Procedures 10 Table 4 Summary of Item Extraction Methods for 10-K and 10-Q Reports 12 Table 5 Frequently Used Items in 10-Q Filings 25 Table 6 Examples of 10-Q Manual Annotation 27 Table 7 10-Q Annotation Rules 28 Table 8 The Descriptive Statistics of BIO Tags 29 Table 9 Word Length of 10-Q Items 31 Table 10 Line Length of 10-Q items 31 Table 11 Item Transition Probability Matrix From A to B (Panel A) 33 Table 12 Item Transition Probability Matrix From A to B (Panel B) 33 Table 13 Simplified Rules of Regular Expression Baseline 43 Table 14 Summarize of Model Performance 47 Table 15 Precision, Recall, F1-Score of Each Item 48 Table 16 Probability of Word “Item” Appears at The Start of Item Content 49 Table 17 F1-Score in Selected Item 50 Table 18 Examples of Table of Content Misidentified to Content 52 Table 19 Examples of Item 1 and O Misidentified 53 Table 20 Examples of Lacking “Item” Keywords in Title Misidentification 54 Table 21 Examples of Manual Annotation Errors 55 Table 22 Examples of Signature Misidentified 56 Table 23 Examples of Part Two and Part One Block Misidentified 57 Table 24 Examples of Title Mislabeled 58 Table 25 Examples of Lack of Certain Item 58 Table 26 Examples of Consolidated Statement 59 Table 27 Examples of Repeated Headers 60 Table 28 Examples of “Item” Keyword Appears in The Text 61
dc.language.iso	en
dc.subject	端到端	zh_TW
dc.subject	10-Q 財報	zh_TW
dc.subject	自然語言處理	zh_TW
dc.subject	項目擷取	zh_TW
dc.subject	條件隨機域	zh_TW
dc.subject	雙向長短期記憶	zh_TW
dc.subject	注意力	zh_TW
dc.subject	Natural Language Processing	en
dc.subject	End-to-end	en
dc.subject	Attention	en
dc.subject	Bi-LSTM	en
dc.subject	CRF	en
dc.subject	Item Extraction	en
dc.subject	10-Q Report	en
dc.title	美國 SEC 10-Q 季報項目擷取	zh_TW
dc.title	Item Extraction for SEC 10-Q Reports	en
dc.type	Thesis
dc.date.schoolyear	110-2
dc.description.degree	碩士
dc.contributor.oralexamcommittee	莊皓鈞(Hao-Chun Chuang),簡宇泰(Yu-Tai Chien)
dc.subject.keyword	10-Q 財報,自然語言處理,項目擷取,條件隨機域,雙向長短期記憶,注意力,端到端,	zh_TW
dc.subject.keyword	10-Q Report,Natural Language Processing,Item Extraction,CRF,Bi-LSTM,Attention,End-to-end,	en
dc.relation.page	69
dc.identifier.doi	10.6342/NTU202201181
dc.rights.note	未授權
dc.date.accepted	2022-07-05
dc.contributor.author-college	管理學院	zh_TW
dc.contributor.author-dept	資訊管理學研究所	zh_TW
顯示於系所單位：	資訊管理學系

文件中的檔案：

檔案	大小	格式
U0001-2806202217222000.pdf 未授權公開取用	15.04 MB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。