財報項目全文的擷取和效能評估

Yen-Hsiu Chen; 陳妍秀

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/70779

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	盧信銘(Hsin-Min Lu)
dc.contributor.author	Yen-Hsiu Chen	en
dc.contributor.author	陳妍秀	zh_TW
dc.date.accessioned	2021-06-17T04:38:11Z	-
dc.date.available	2023-08-09
dc.date.copyright	2018-08-09
dc.date.issued	2018
dc.date.submitted	2018-08-07
dc.identifier.citation	Antweiler, W., & Frank, M. Z. (2004). Is all that talk just noise? The information content of internet stock message boards. Journal of Finance, 59(3), 1259-1294. doi:10.1111/j.1540-6261.2004.00662.x Bonsall, S. B., Leone, A. J., Miller, B. P., & Rennekamp, K. (2017). A plain english measure of financial reporting readability. Journal of Accounting & Economics, 63(2-3), 329-357. doi:10.1016/j.jacceco.2017.03.002 Brown, S. V., & Tucker, J. W. (2011). Large‐sample evidence on firms’ year‐over‐year MD&A modifications. Journal of Accounting Research, 49(2), 309-346. Campbell, J. L., Chen, H. C., Dhaliwal, D. S., Lu, H. M., & Steele, L. B. (2014). The information content of mandatory risk factor disclosures in corporate filings. Review of Accounting Studies, 19(1), 396-455. doi:10.1007/s11142-013-9258-3 Das, D., & Bandyopadhyay, S. (2012). Sentence-level emotion and valence tagging. Cognitive Computation, 4(4), 420-435. doi:10.1007/s12559-012-9173-0 Dyer, T., Lang, M., & Stice-Lawrence, L. (2017). The evolution of 10-K textual disclosure: Evidence from Latent Dirichlet Allocation. Journal of Accounting & Economics, 64(2-3), 221-245. doi:10.1016/j.jacceco.2017.07.002 Gerdes, J. (2003). EDGAR-analyzer: Automating the analysis of corporate data contained in the sec's EDGAR database. Decision Support Systems, 35(1), 7-29. doi:10.1016/s0167-9236(02)00096-9 Graves, A., Mohamed, A.-r., & Hinton, G. (2013). Speech recognition with deep recurrent neural networks. In In Acoustics, speech and signal processing (icassp), 2013 ieee international conference (pp. 6645-6649). Healy, P. M., & Palepu, K. G. (2001). Information asymmetry, corporate disclosure, and the capital markets: A review of the empirical disclosure literature. Journal of Accounting & Economics, 31(1-3), 405-440. doi:10.1016/s0165-4101(01)00018-0 Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8), 1735-1780. Huang, Z., Xu, W., & Yu, K. (2015). Bidirectional LSTM-CRF models for sequence tagging. arXiv preprint arXiv:1508.01991. Kearney, C., & Liu, S. (2014). Textual sentiment in finance: A survey of methods and models. International Review of Financial Analysis, 33, 171-185. doi:10.1016/j.irfa.2014.02.006 Lafferty, J., McCallum, A., & Pereira, F. C. (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In In Proceedings of the Eighteenth International Conference on Machine Learning. (pp. 282-289). San Francisco, CA, USA. Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., & Dyer, C. (2016). Neural architectures for named entity recognition. In The 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 260-270). San Diego, California. Li, F. (2008). Annual report readability, current earnings, and earnings persistence. Journal of Accounting & Economics, 45(2-3), 221-247. doi:10.1016/j.jacceco.2008.02.003 Li, F. (2010). The information content of forward-looking statements in corporate filings-a naive bayesian machine learning approach. Journal of Accounting Research, 48(5), 1049-1102. doi:10.1111/j.1475-679X.2010.00382.x Li, F. (2010). Survey of the literature. Journal of accounting literature, 29, 143-165. Loughran, T., & McDonald, B. (2011). When is a liability not a liability? Textual analysis, dictionaries, and 10-ks. Journal of Finance, 66(1), 35-65. doi:10.1111/j.1540-6261.2010.01625.x Loughran, T., & McDonald, B. (2014). Measuring readability in financial disclosures. Journal of Finance, 69(4), 1643-1671. doi:10.1111/jofi.12162 Loughran, T., & McDonald, B. (2016). Textual analysis in accounting and finance: A survey. Journal of Accounting Research, 54(4), 1187-1230. doi:10.1111/1475-679x.12123 Ma, X., & Hovy, E. (2016). End-to-end sequence labeling via bi-directional lstm-cnns-crf. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (pp. 1064–1074). Mayew, W. J., Sethuraman, M., & Venkataehalam, M. (2015). MD&A disclosure and the firm's ability to continue as a going concern. Accounting Review, 90(4), 1621-1651. doi:10.2308/accr-50983 Rabiner, L. R. (1989). A tutorial on hidden markov-models and selected applications in speech recognition. Proceedings of the Ieee, 77(2), 257-286. doi:10.1109/5.18626 Sutton, C., & McCallum, A. (2012). An introduction to conditional random fields. Foundations and Trends® in Machine Learning, 4(4), 267-373. Tetlock, P. C. (2007). Giving content to investor sentiment: The role of media in the stock market. Journal of Finance, 62(3), 1139-1168. doi:10.1111/j.1540-6261.2007.01232.x Wallach, H. M. (2004). Conditional random fields: An introduction. Technical Reports (CIS), 22.
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/70779	-
dc.description.abstract	近年來文字分析在財報中的應用相當廣泛，但是研究者感興趣的議題只有特定幾項，再加上財報項目擷取的品質好壞會影響後續分析的結果，因此本實驗提出以機器學習的模型來解決此任務。在第一階段進行財報的人工標記，蒐集訓練資料，以反映報表真實情況；第二階段針對訓練資料集設計六種不同的特徵，並運用條件隨機域讓模型自行根據學到的潛在規則進行文字序列標記。根據本實驗結果可以發現使用條件隨機域的方式進行全文的項目擷取，可以有效地提升擷取準確度，確保分析前的資料品質。而在這之中，項目標題文字對於標記的結果影響較大，項目編號和 item 此字較無任何影響。	zh_TW
dc.description.abstract	Textual Analysis is widely used in financial reports. However, there are only a few specific topics that researchers are interested in, and the quality of the item extraction will affect the results of the subsequent analysis. Therefore, in this research, we propose a machine learning model to extract an item from 10-K reports. First, to reflect the real situation of the reports, this study carries out manual tagging of the financial report and collects training materials. Second, we design six different features for the training dataset, and use the conditional random field to label text sequences based on the potential rules learned. According to the results of this experiment, it can be found that the use of conditional random fields for the full-text item extraction can effectively improve the accuracy of the extraction and ensure the quality of the data before analysis. Among them, the title text of the project has a great influence on the result of the mark, and the item number and the item have no influence.	en
dc.description.provenance	Made available in DSpace on 2021-06-17T04:38:11Z (GMT). No. of bitstreams: 1 ntu-107-R05725037-1.pdf: 2081260 bytes, checksum: 7b477e7c898a7c1c29f96e19308b4fbd (MD5) Previous issue date: 2018	en
dc.description.tableofcontents	口試委員審定書 i 誌謝 ii 摘要 iii Abstract iv 第一章緒論 1 1.1. 研究背景與動機 1 1.2. 研究目的 4 1.3. 研究架構 5 第二章文獻探討 6 2.1. 財報文字分析的介紹 6 2.1.1. 應用層面 7 2.1.2. 資料前處理層面 8 2.1.3. 小結 11 2.2. 文字序列資料的模型介紹 12 2.2.1. 統計模型 13 2.2.2. 類神經網路 15 2.2.3. 混合型 18 2.2.4. 小結 19 2.3. 條件隨機域的特徵設計 19 第三章、研究資料概觀與資料處理 21 3.1. 資料來源與相關介紹 21 3.2. 資料處理 23 3.2.1. 資料蒐集 23 3.2.2. 資料前處理 24 第四章研究方法 25 4.1. 研究流程 25 4.2. 標記規則（Labeling Scheme） 26 4.3. 人工標記流程 29 4.3.1. 標記流程 29 4.3.2. 檔案準備 29 4.3.3. 品質控管 31 4.3.4. 標記敘述統計 33 4.4. 特徵建構 34 4.4.1. 位置特徵 35 4.4.2. 句子首字特徵 35 4.4.3. 句子的單詞和雙詞特徵 36 4.4.4. 標題特徵 37 4.4.5 目錄反指標特徵 39 4.4.6. 特殊指向特徵 40 4.5. 模型訓練 41 4.5.1 CRF模型參數設定 41 4.5.2. Baseline設計 41 4.6. 模型評估 43 第五章實驗結果 45 5.1. 模型參數調整 45 5.2. 模型評估 46 5.3. 錯誤分析 49 5.3.1. Item 6、Item 8、Item 15、O混淆 49 5.3.2. 沒有標題和item的字眼 50 5.3.3. 內文指向其他區段 50 5.3.4. 報表內容難以判斷 51 5.3.5. 人工標記錯誤 53 第六章結論與建議 54 6.1. 實驗結論與建議 54 6.2. 研究貢獻 55 6.3. 未來研究方向 55 第七章參考文獻 56 附錄一 59 附錄二 60
dc.language.iso	zh-TW
dc.subject	項目擷取	zh_TW
dc.subject	10-K 財報	zh_TW
dc.subject	序列標記	zh_TW
dc.subject	條件隨機域	zh_TW
dc.subject	人工標記	zh_TW
dc.subject	CRF	en
dc.subject	10-K Report	en
dc.subject	Sequence Labeling	en
dc.subject	Manual Tagging	en
dc.subject	Item Extraction	en
dc.title	財報項目全文的擷取和效能評估	zh_TW
dc.title	Item Extraction for Annual Financial Report: Annotation and Evaluation	en
dc.type	Thesis
dc.date.schoolyear	106-2
dc.description.degree	碩士
dc.contributor.oralexamcommittee	余峻瑜(Jiun-Yu Yu),洪為璽(Wei-Hsi Hung)
dc.subject.keyword	人工標記,條件隨機域,項目擷取,序列標記,10-K 財報,	zh_TW
dc.subject.keyword	Manual Tagging,CRF,Item Extraction,Sequence Labeling,10-K Report,	en
dc.relation.page	61
dc.identifier.doi	10.6342/NTU201802638
dc.rights.note	有償授權
dc.date.accepted	2018-08-08
dc.contributor.author-college	管理學院	zh_TW
dc.contributor.author-dept	資訊管理學研究所	zh_TW
顯示於系所單位：	資訊管理學系

文件中的檔案：

檔案	大小	格式
ntu-107-1.pdf 未授權公開取用	2.03 MB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。