建構可靠的10-K財報可讀性衡量法－利用機器學習的文本清理減少可讀性中的雜訊

吳琦艾; Chi-Ai Wu

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/92361

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	盧信銘	zh_TW
dc.contributor.advisor	Hsin-Min Lu	en
dc.contributor.author	吳琦艾	zh_TW
dc.contributor.author	Chi-Ai Wu	en
dc.date.accessioned	2024-03-21T16:47:44Z	-
dc.date.available	2024-10-31	-
dc.date.copyright	2024-03-21	-
dc.date.issued	2023	-
dc.date.submitted	2023-10-05	-
dc.identifier.citation	Anderson, J. (1983). Lix and Rix: Variations on a little-known readability index. Journal of Reading, 26(6):490–496. Biddle, G. C., Hilary, G., and Verdi, R. S. (2009). How does financial reporting quality relate to investment efficiency? Journal of Accounting and Economics, 48:112–131. Björnsson, C.-H. (1968). Lesbarkeit durch Lix. Pedagogiskt Centrum. Blankespoor, E. (2019). The impact of information processing costs on firm disclosure choice: Evidence from the xbrl mandate. Journal of Accounting Research, 57:919–967. Bonsall, S. B., Leone, A. J., Miller, B. P., and Rennekamp, K. (2017). A plain english measure of financial reporting readability. Journal of Accounting and Economics, 63:329–357. Chen, Y. H. (2018). Item extraction for annual financial report: Annotation and evaluation. Master’s thesis, National Taiwan University. Chuang, Y. H. (2021). A novel natural language processing framework for analyzing management’s discussion and analysis modifications in 10-K reports. Master’s thesis, National Taiwan University. Cohen, L., Malloy, C., and Nguyen, Q. (2020). Lazy prices. The Journal of Finance, 75:1371–1415. Coleman, M. and Liau, T. L. (1975). A computer readability formula designed for machine scoring. Journal of Applied Psychology, 60(2):283. Craja, P., Kim, A., and Lessmann, S. (2020). Deep learning for detecting financial statement fraud. Decision Support Systems, 139:113421. Dyer, T., Lang, M., and Stice-Lawrence, L. (2017). The evolution of 10-K textual disclosure: Evidence from latent dirichlet allocation. Journal of Accounting and Economics, 64:221–245. Feldman, R., Govindaraj, S., Livnat, J., and Segal, B. (2010). Management’s tone change, post earnings announcement drift and accruals. Review of Accounting Studies, 15:915–953. Flesch, R. (1948). A new readability yardstick. Journal of Applied Psychology, 32(3):221. Griffin, P. A. (2003). Got information? investor response to form 10-K and form 10-Q edgar filings. Review of Accounting Studies, 8:433–460. Jegadeesh, N. and Wu, D. (2013). Word power: A new approach for content analysis. Journal of Financial Economics, 110:712–729. Lawrence, A. (2013). Individual investors and financial disclosure. Journal of Accounting and Economics, 56:130–147. Lehavy, R., Li, F., and Merkley, K. (2011). The effect of annual report readability on analyst following and the properties of their earnings forecasts. The Accounting Review, 86:1087–1115. Li, F. (2008). Annual report readability, current earnings, and earnings persistence. Journal of Accounting and Economics, 45:221–247. Li, F. (2010). The information content of forward-looking statements in corporate filings—a naïve bayesian machine learning approach. Journal of Accounting Research, 48:1049–1102. Loughran, T. and Mcdonald, B. (2011). When is a liability not a liability? textual analysis, dictionaries, and 10-Ks. The Journal of Finance, 66:35–65. Loughran, T. and Mcdonald, B. (2014). Measuring readability in financial disclosures. Journal of Finance, 69:1643-1671. Loughran, T. and Mcdonald, B. (2016). Textual analysis in accounting and finance: A survey. Journal of Accounting Research, 54:1187–1230. Mc Laughlin, G. H. (1969). Smog grading—a new readability formula. Journal of Reading, 12(8):639–646. Miller, B. P. (2010). The effects of reporting complexity on small and large investor trading. The Accounting Review, 85:2107–2143. Robert, G. (1952). The Technique of Clear Writing. McGraw-Hill. Senter, R. and Smith, E. A. (1967). Automated readability index. Technical report, DTIC document. Smith, M. and Taffler, R. (1992). Readability and understandability: Different measures of the textual complexity of accounting narrative. Accounting, Auditing & Accountability Journal, 5:84–98. You, H. F. and Zhang, X. J. (2009). Financial reporting complexity and investor underreaction to 10-K information. Review of Accounting Studies, 14:559–586.	-
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/92361	-
dc.description.abstract	隨著10-K文本分析的興起，如何準確地計算可讀性分數，以及哪種10-K文件的文本清理方法最有效，已變得至關重要。我們提出了基於機器學習的文本整理方法（LTT-Better），這個方法利用Bi-LSTM模型清理文本以利進行可讀性的計算。大多數可讀性公式假設文本內僅包含完整的句子，不包括標題或頁碼，很少原始10-K文件可以滿足這樣的條件。LTT-Better使用Bi-LSTM刪除10-K中不必要的字符，減少干擾，提高文本清理的品質。當使用LTT-Better代替傳統的文本清理時，大多數可讀性在統計上更接近人工清理的10-K報告。我們的研究進一步使用了1994年至2022年的10-K進行實證研究，調查可讀性引起的資訊不確定性是否能影響10-K提交日期後的股價波動。我們的實驗結果顯示，與傳統基於規則的文本清理相比，LTT-Better的可讀性在大多數情況下達到了更高的t分數。此外，當迴歸模型包含傳統文本清理的Fog指數和LTT-Better Fog指數時，兩者都具有顯著性，其中LTT-Better Fog指數的t分數更高。我們的研究結果顯示，當研究需要清理10-K報告以進行可讀性分析時，LTT-Better是一種有效的方法。未來的研究應在分析其語言特徵之前，將此清理方法應用於10-K文件。此外，我們向研究人員提供了關於使用不同文本清理方式後，應使用哪些可讀性公式的建議。	zh_TW
dc.description.abstract	With the growth of 10-K text analysis, it becomes essential to determine how to reliably compute readability scores and what text preparation method for 10-K files is effective. We propose the Better Learning-Based Text Tidying (LTT-Better) approach that leverages Bi-LSTM models in preparing text for readability computation. Most readability measures assume correct sentence boundaries and text chunks without headings or dangling page numbers. These conditions are rarely satisfied in the original 10-K files. LTT-Better uses Bi-LSTM to remove unnecessary text chunks to reduce the noise and improve text preparation and text analysis using 10-K reports. When LTT-Better is used instead of the traditional rule-based preparation, the majority of the readabilities are shown to be statistically closer to the readabilities of human-prepared 10-Ks. Our research further conducts empirical models that investigate whether readability-induced information uncertainty can contribute to stock price volatility after the filing date using 10-Ks from 1994 to 2022. Our empirical results show that, compared to rule-based text preparation, readability from LTT-Better achieved a higher t-value in most cases. Moreover, when the regression models contain both the rule-based Fog index and LTT-Better Fog index, both are significant, with the LTT-Better Fog index achieving a higher t-value. Our findings suggest that LTT-Better is a promising approach to preparing 10-K reports for readability analysis. Future research should apply such an approach to 10-Ks before analyzing their linguistic attributes. Moreover, we give researchers helpful direction on what readability measurements should be used in future research.	en
dc.description.provenance	Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2024-03-21T16:47:44Z No. of bitstreams: 0	en
dc.description.provenance	Made available in DSpace on 2024-03-21T16:47:44Z (GMT). No. of bitstreams: 0	en
dc.description.tableofcontents	致謝 i 中文摘要 ii Abstract iii List of Figures vii List of Tables viii 1 Introduction 1 2 Literature Review 4 2.1 Text Analysis of Financial Report . . . . . . . . . . . . . . . . . 4 2.2 Readability Measures . . . . . . . . . . . . . . . . . . . . . . . . 6 2.3 Preparing Financial Reports for Text Analysis . . . . . . . . . . . 9 3 Methodology 12 3.1 Research Testbed . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.2 Text Preparation Approaches . . . . . . . . . . . . . . . . . . . . 14 3.2.1 Ruled-Based approach (RB) . . . . . . . . . . . . . . . . 15 3.2.2 Learning-Based Text Tidying (LTT) and Better LearningBased Text Tidying (LTT-Better) . . . . . . . . . . . . . . 15 3.3 Reliable Readability . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.4 Experimental Design . . . . . . . . . . . . . . . . . . . . . . . . 17 3.4.1 First Experiment: Paired t-Test . . . . . . . . . . . . . . . 17 3.4.2 Second Experiment: Regression . . . . . . . . . . . . . . 19 4 Experimental Results 22 4.1 Summary Statistis . . . . . . . . . . . . . . . . . . . . . . . . . . 22 4.2 Paired t-Test Result . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.3 Regression Results . . . . . . . . . . . . . . . . . . . . . . . . . 35 5 Conclusion 46 Reference 48 A Summary Statistics of Reproduced Regression 51 B Assumptions for Statistical Tests 53	-
dc.language.iso	en	-
dc.subject	文本分析	zh_TW
dc.subject	文本清理	zh_TW
dc.subject	機器學習	zh_TW
dc.subject	可讀性	zh_TW
dc.subject	財務報表	zh_TW
dc.subject	10-K	zh_TW
dc.subject	Text Analysis	en
dc.subject	Readability	en
dc.subject	10-K	en
dc.subject	Bi-LSTM	en
dc.subject	Text Preparation	en
dc.title	建構可靠的10-K財報可讀性衡量法－利用機器學習的文本清理減少可讀性中的雜訊	zh_TW
dc.title	Reliable Readability for 10-K Reports: Reducing Noise in Readability by Learning-Based Text Tidying	en
dc.type	Thesis	-
dc.date.schoolyear	112-1	-
dc.description.degree	碩士	-
dc.contributor.oralexamcommittee	張景宏;簡宇泰	zh_TW
dc.contributor.oralexamcommittee	Ching-Hung Chang;Yu-Tai Chien	en
dc.subject.keyword	10-K,財務報表,可讀性,文本分析,文本清理,機器學習,	zh_TW
dc.subject.keyword	10-K,Readability,Text Analysis,Text Preparation,Bi-LSTM,	en
dc.relation.page	59	-
dc.identifier.doi	10.6342/NTU202304256	-
dc.rights.note	同意授權(限校園內公開)	-
dc.date.accepted	2023-10-11	-
dc.contributor.author-college	管理學院	-
dc.contributor.author-dept	資訊管理學系	-
dc.date.embargo-lift	2024-10-31	-
顯示於系所單位：	資訊管理學系

文件中的檔案：

檔案	大小	格式
ntu-112-1.pdf 授權僅限NTU校內IP使用（校園外請利用VPN校外連線服務）	924.94 kB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。