中文古籍標點研究

蔡念成; Nian-Cheng Tsai

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/90180

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	項潔	zh_TW
dc.contributor.advisor	Hsiang Jieh	en
dc.contributor.author	蔡念成	zh_TW
dc.contributor.author	Nian-Cheng Tsai	en
dc.date.accessioned	2023-09-22T17:44:51Z	-
dc.date.available	2023-11-09	-
dc.date.copyright	2023-09-22	-
dc.date.issued	2023	-
dc.date.submitted	2023-08-10	-
dc.identifier.citation	台灣歷史數位圖書館（http://thdl.ntu.edu.tw/） Sturgeon, Donald (ed.). 2011. 中國哲學書電子化計劃. http://ctext.org OpenAI. (2022). ChatGPT. https://chat.openai.com HU Renfen, LI Shen, ZHU Yuchen. Knowledge Representation and Sentence Segmentation of Ancient Chinese Based on Deep Language Models. Journal of Chinese Information Processing. 2021, 35(4): 8-15 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30. Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. 王东波, 刘畅, 朱子赫, 刘江峰, 胡昊天, 沈思, & 李斌. (2022). SikuBERT 与 SikuRoBERTa: 面向数字人文的《四库全书》预训练模型构建及应用研究. 图书馆论坛, 42(6), 31-43. Van der Maaten, L., & Hinton, G. (2008). Visualizing data using t-SNE. Journal of machine learning research, 9(11).	-
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/90180	-
dc.description.abstract	斷句或標點對文意理解很重要，現今的標點符號概念是近代由西方國家輸入的，中文古籍通常不具標點符號，使得理解古文相當困難。THDL-古契書仍有大量文本尚未斷句，此資料集是來自THDL台灣歷史數位圖書館的古契書文獻集，古契書文獻集蒐集古臺灣的土地檔案共40428件，其中有10492件內文未斷句或只有部分斷句，文本量龐大所以需要資訊技術協助斷句。在嘗試一些斷句工具後，發現結果不如預期，尤其是在含有特殊字彙、特殊格式的文件，或含有日文假名的文件，斷句工具無法正確斷句，勢必得自己訓練可靠的斷句模型替THDL-古契書斷句。本研究的實驗圍繞在微調SikuBERT預訓練模型做斷句或標點任務。除了用THDL-古契書中已斷句的文件訓練斷句模型外，為了說明其必要性，我們也使用含經史子集的ctext文本訓練斷句模型，與THDL-古契書訓練的斷句模型在文本通篇、特殊字彙、日文假名的斷句結果比較。結果顯示以THDL-古契書訓練的模型比ctext文本訓練的模型顯著的優異，表示對同屬中文古文的不同文體，設計不同的模型仍有其意義。除了THDL-古契書外，其他的古文文本也有斷句或標點的需求。因此我們用ctext文本微調SikuBERT預訓練模型做標點任務，再將其與先前訓練的古契書斷句模型、ctext斷句模型做成古文斷句標點工具，供使用者批次斷句或標點。	zh_TW
dc.description.abstract	Sentence segmentation and punctuation play crucial roles in understanding the meaning of texts. However, Chinese classic texts typically lack punctuation marks, which makes understanding these texts quite challenging. THDL(Taiwan History Digital Library) database consists of three collections. One of them is The collection of Taiwanese Land Deeds(古契書), which gathers a total of 40,428 old land deeds in Taiwan, while 10,492 of them hadn’t been punctuated. Due to the massive amounts of texts, assistance from information technology is needed for us to segment the documents. After trying some sentence segmentation tools, we found that the results were not as expected, especially for documents containing special vocabulary, special styles, or some Japanese kana characters. Therefore, we must train reliable segmentation models ourself. In this research, we focuses on fine-tuning the pre-trained model SikuBERT for sentence segmentation or punctuation tasks. Besides training a segmentation model with pre-segmented Taiwanese Land Deeds, to demonstrate its necessity, we also utilized Chinese classic texts on ctext for training another sentence segmentation model, and compare the segmentation results of these two models. The evaluations show that the model trained on Taiwanese Land Deeds significantly outperforms the model trained on ctext texts. This implies that training distinct models for different styles of Chinese classical texts still holds significance. There is a need for sentence segmentation and punctuation in other Chinese classic texts as well. Therefore, we develop a tool for sentence segmentation and punctuation, containing segmentation model trained on Taiwanese Land Deeds, both segmentation and punctuation models trained on ctext texts. This tool provides users with the ability to perform batch sentence segmentation and punctuation in various Chinese classical texts.	en
dc.description.provenance	Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2023-09-22T17:44:51Z No. of bitstreams: 0	en
dc.description.provenance	Made available in DSpace on 2023-09-22T17:44:51Z (GMT). No. of bitstreams: 0	en
dc.description.tableofcontents	誌謝 i 摘要 ii ABSTRACT iii CONTENTS v LIST OF FIGURES viii LIST OF TABLES xi Chapter 1 緒論 1 1.1 動機 1 1.2 論文架構 3 Chapter 2 古文斷句工具概覽 4 2.1 ChatGPT 4 2.2 古詩文斷句 7 2.3 研究目的 11 Chapter 3 研究方法與實驗介紹 12 3.1 BERT 12 3.1.1 BERT簡介 12 3.1.2 SikuBERT預訓練模型與微調下游任務 14 3.2 評估標準 16 3.2.1 Precision, Recall, F1-score 16 3.2.2 Precision-Recall Curve 17 3.2.3 考慮二元或多元分類 18 3.3 t-SNE 19 3.4 實驗介紹 20 3.4.1 實驗流程 20 3.4.2 BERT從輸入到合併輸出結果 21 Chapter 4 資料集介紹與資料清理 23 4.1 THDL-古契書資料集 23 4.1.1 THDL-古契書介紹 23 4.1.2 THDL-古契書表格內容介紹 27 4.2 THDL-古契書資料清理與前處理 28 4.2.1 字元替換 29 4.2.2 認定斷句符號 30 4.2.3 認定已斷句的文件 32 4.3 ctext資料集 33 4.3.1 ctext資料蒐集 35 4.3.2 ctext資料清理 36 Chapter 5 THDL-古契書斷句結果 37 5.1 THDL-古契書斷句結果 37 5.1.1 概述 37 5.1.2 THDL-古契書測試集結果 39 5.2 有無命名實體標籤的比較 42 5.3 與ctext模型比較 44 5.3.1 THDL-古契書特殊字彙比較 46 5.3.2 在含日文假名的THDL-古契書驗證集比較 50 5.3.3 t-SNE視覺化結果比較 51 Chapter 6 古文斷句標點工具 59 6.1 工具目標 59 6.2 ctext標點模型結果 60 6.3 工具介紹 64 Chapter 7 結論與討論 66 7.1 結論 66 7.2 研究限制 67 7.3 未來展望 69 Bibliography 70	-
dc.language.iso	zh_TW	-
dc.subject	自動斷句	zh_TW
dc.subject	古地契	zh_TW
dc.subject	BERT	zh_TW
dc.subject	中文古籍	zh_TW
dc.subject	自然語言處理	zh_TW
dc.subject	Chinese Classic Texts	en
dc.subject	BERT	en
dc.subject	Automatic Sentence Segmentation	en
dc.subject	Natural Language Processing	en
dc.subject	old Land Deeds	en
dc.title	中文古籍標點研究	zh_TW
dc.title	A Study on Punctuation in Chinese classic texts	en
dc.type	Thesis	-
dc.date.schoolyear	111-2	-
dc.description.degree	碩士	-
dc.contributor.oralexamcommittee	蔡宗翰;謝育平	zh_TW
dc.contributor.oralexamcommittee	Tzong-Han Tsai;Yuh-Pyng Shieh	en
dc.subject.keyword	古地契,中文古籍,自然語言處理,自動斷句,BERT,	zh_TW
dc.subject.keyword	old Land Deeds,Chinese Classic Texts,Natural Language Processing,Automatic Sentence Segmentation,BERT,	en
dc.relation.page	70	-
dc.identifier.doi	10.6342/NTU202304034	-
dc.rights.note	未授權	-
dc.date.accepted	2023-08-12	-
dc.contributor.author-college	電機資訊學院	-
dc.contributor.author-dept	資訊工程學系	-
顯示於系所單位：	資訊工程學系

文件中的檔案：

檔案	大小	格式
ntu-111-2.pdf 未授權公開取用	3.52 MB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。