基於深度學習方法之中文句法校正模型

Hsiang-Che Hsu; 徐祥哲

Please use this identifier to cite or link to this item: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/52121

Title:	基於深度學習方法之中文句法校正模型 Deep Learning-based Chinese Grammar Correction
Authors:	Hsiang-Che Hsu 徐祥哲
Advisor:	盧信銘(Hsin-Ming Lu)
Keyword:	中文文法校正,深度學習,自然語言處理, Chinese Grammar Error Correction,Deep Learning,Nature Language Processing,
Publication Year :	2020
Degree:	碩士
Abstract:	中文文法校正 (Chinese Grammar Error Correction)主要用於檢測文章中句子與句子之間的用字遣詞是否符合一般所認同文法結構。在應用上除了偵測錯別字等顯而易見的錯誤外，也包含了去除以下四種錯誤情況：多餘字詞 (Redundant Words)、遺漏字詞 (Missing Words)、使用不合適的字詞 (Bad Words Selection)以及順序錯誤的字詞 (Disorder Words)。此方面的研究已發展近20年，但到了近期才開始採用深度學習的方式。本次研究將結合深度學習方法，並應深度學習中的Transformer模型於文法校正的問題上。實驗中我們分別使用Bidirectional encoder Representations from Transformers (BERT) 解決文法錯誤分類問題，以及Copy-Augmented Architecture解決正確語句生成問題。在訓練模型時，我們提出了自行設計的預訓練任務，並使用了自行嵌入文法錯誤的大型語料庫進行預訓練，接續使用了Natural Language Processing Techniques for Educational Applications 2014到2016年三屆比賽的資料做模型的微調與測試，最後依照比賽規定的評分方式(Accuracy and F1 score) 作為評分基礎。根據實驗結果顯示，我們的方法相較於其他相關方法而言，可以大幅提升預測效果。此外也會探討不同模型對此方法與其他深度學習方法之差別。 Chinese Grammar Error Correction (CGEC) is commonly used to detect whether the word sequence used in a sentence is consistent with the commonly accepted grammar. CGEC usually considers detection tasks such as typos, redundant word, missing words, word selection errors, and word ordering errors. While CGEC has been studied for nearly 20 years, few studies adopt deep learning in CGEC. In this thesis, we focus on two types of CGEC tasks—grammar error classification and correct sentence generation. To improve prediction performance in CGEC tasks, we propose to incorporate the Transformer, a deep learning natural language processing approach in our models. We tackle the grammar error classification task by the Bidirectional Encoder Representations from Transformers (BERT) and address the correct sentence generation task by the Copy-Augmented Architecture. We also modified the pretraining and finetuning process to further improve prediction performance. In the pretraining stage, we adopted a new task that predicts error types from inputs with injected errors created using large-scaled corpus including JingYong novel, Wiki Chinese corpus, and Taiwan Yahoo News from 2005 to 2011. In the finetuning stage, we adopted Natural Language Processing Techniques for Educational Applications (NLPTEA) 2014-2016 Shared Task data to finetune our models. According to the experiment results, the our pretrained BERT models outperformed other approaches proposed in shared tasks, and the Copy-Augmented Architecture performed better with more pretraining data. In conclusion, we found that the addition of a new pretraining task can significantly enhance model performance. When training with different data sizes in the pretraining stage, we found that the models can achieve the best performance with approximately 5 million simulated training sentences.
URI:	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/52121
DOI:	10.6342/NTU202002631
Fulltext Rights:	有償授權
Appears in Collections:	資訊管理學系

Files in This Item:

File	Size	Format
U0001-0708202014232100.pdf Restricted Access	3.05 MB	Adobe PDF

Show full item record

DSpace JSPUI

DSpace preserves and enables easy and open access to all types of digital content including text, images, moving images, mpegs and data sets