請用此 Handle URI 來引用此文件:
http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/90120
標題: | 基於序列到序列方法的中文財經數據轉文本生成 A Seq-to-Seq Approach for Chinese Financial Data-to-Text Generation |
作者: | 吳晨瑋 Chen-Wei Wu |
指導教授: | 莊裕澤 Yuh-Jzer Joung |
關鍵字: | 數據轉文本生成,自然語言生成,中文財經文本,大型語言模型, Data-to-Text Generation,Natural Language Generation,Financial Context,Large Language Pre-trained Model, |
出版年 : | 2023 |
學位: | 碩士 |
摘要: | Data-to-text Generation 研究在近年獲得越來越多關注,因其注重文本流暢及資訊正確的特性非常適合用於新聞自動化,因此在以英文為主的自然語言研究中發展得相當蓬勃。然而在中文領域卻仍處於起步的階段,原因在於 Data-to-Text 任務的來源資料會因為應用領域有非常大的結構變化,因此在文本的標注上需要耗費大量人力才能得到一個適用於特定領域的訓練資料。此外當 Data-to-Text 任務的來源資料涉及大量數值時,利用語言模型進行理解及生成的學習目前也存在需多挑戰。有鑑於此,本論文提出了一個 Any-Shot 的 Data-to-Text Generation 訓練方式,並透過對語言模型進行數值增益預訓練,來提升中文財經領域的 Data-to-Text 表現。
本論文透過在 Sequence-to-Sequence 模型中導入mT5系列的預訓練模型,並以台灣經濟新報資料庫(TEJ+)的中文財務報表以及 ChatGPT 對目標文本進行增益來進行模型的訓練及驗證,達成中文財經領域 Data-to-Text 生成的第一個研究。此外更進一步透過 Two-Step Pre-Training 及 Numerical Representation Augmentation 兩種方法提升模型理解及生成數值的表現。最後在自動評估方法以及人工評估方法上,本論文提出的輕量型模型架構在文本的評估上可以達到與ChatGPT近似的表現,並且在數值精確率的表現上更可以超越ChatGPT的生成結果。 Data-to-text generation research has been gaining increasing attention in recent years due to its focus on generating fluent and accurate text, making it highly suitable for news automation. As a result, it has seen significant development in English-dominated natural language research. However, in the Chinese domain, it is still in its early stages. This is primarily because the source data for data-to-text tasks can vary significantly depending on the application domain, requiring substantial human effort to obtain domain-specific training data with appropriate annotations. Additionally, when the source data for data-to-text tasks involves a large amount of numerical information, there are additional challenges in utilizing language models for comprehension and generation. Given these challenges, this paper proposes an any-shot training approach for data-to-text generation and leverages numerical augmentation pre-training of language models to enhance the performance of Chinese financial data-to-text generation. The paper introduces mT5-based pre-trained models into a sequence-to-sequence framework and trains and validates the model using Chinese financial statements from the TEJ+ database and ChatGPT's enhanced target text. This study achieves the first research on Chinese financial data-to-text generation. Furthermore, two-step pre-training and numerical representation augmentation methods are employed to improve the model's comprehension and generation of numerical information. Finally, in both automatic and manual evaluation methods, the lightweight model architecture proposed in this paper achieves comparable performance to ChatGPT in text evaluation and outperforms ChatGPT in numerical precision of generated results. |
URI: | http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/90120 |
DOI: | 10.6342/NTU202303640 |
全文授權: | 同意授權(限校園內公開) |
顯示於系所單位: | 資訊管理學系 |
文件中的檔案:
檔案 | 大小 | 格式 | |
---|---|---|---|
ntu-111-2.pdf 目前未授權公開取用 | 3.82 MB | Adobe PDF | 檢視/開啟 |
系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。