T-­BERT:臺灣語言模型–以臺灣在地語言預訓練BERT模型

Ming-Yen Chung; 鍾明諺

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/20454

標題:	T-BERT:臺灣語言模型–以臺灣在地語言預訓練BERT模型 T-BERT: A Taiwanese Language Model Pretraining BERT Model on Languages in Taiwan
作者:	Ming-Yen Chung 鍾明諺
指導教授:	廖世偉(Shih-Wei Liao)
關鍵字:	自然語言處理,語言模型,臺灣,新聞分類, natural language processing,bert,language model,news classification,taiwan,
出版年 :	2020
學位:	碩士
摘要:	預訓練語言模型在多個自然語言處理項目上展現了明顯的進步,多個社群釋出了各自在不同領域文本做預訓練的模型。BERT 是作為自然語言處理社群應用的一個經典模型。但是釋出給臺灣語言使用的 BERT 模型,只在中文維基的文本上做預訓練而已。儘管我們可以在釋出的模型上再做進一步的預訓練,釋出模型的字典也不適合臺灣語言使用。在本論文我們提出 TBERT,是一個基於 BERT 模型,但是預訓練在臺灣的三種語言上:臺灣國語 (繁體中文)、臺灣閩南話、臺灣客家話,且使用專門為這三個語言產生的字典。我們製作了兩種下游任務:自由時報新聞分類、臺華平行新聞語料庫新聞分類,來評量模型分別使用在臺灣國語以及臺灣閩南話的表現。我們展示 TBERT 模型在上述兩種下游任務表現贏過了 Google 釋出的中文 BERT 模型以及多語言 BERT 模型,包含在臺華平行新聞語料庫新聞分類多了 16.88% 的 f1 分數,以及在自由新聞分類多了 0.32% 的 f1 分數。 Pretrained language models have seen a significant improvement in numerous Natural Language Processing (NLP) tasks. Many communities have released their models pretrained on variety of languages as well as different domain copora within each language, typically employing variants of the Bidirectional Encoder Representations from Transformers (BERT). However, existing BERT models for Taiwan’s languages are only pretrained on Chinese Wikipedia. Even though we can further pretrain BERT models on the domain corpora from the released model, its vocabulary is also not suitable for Taiwan’s lanugage. In this thesis, we propose T-BERT, a language model based on BERT but pretrained on Taiwan’s three main languages: Taiwanese Mandarin (Traditional Chinese), Taiwan Southern Min, Taiwanese Hakka, with a vocabulary generated for these three languages. We create two downstream tasks: LTN news classification, and iCorpus news classification for evaluating model’s performance on Taiwanese Mandarin and Taiwan Southern Min respectively. We show that T-BERT model outperforms both Chinese BERT and multilingual BERT (mBERT), both released officially by Google, on the two downstream tasks, with 16.88% increased f1 score on iCorpus dataset, and 0.32% increased f1 score on the LTN dataset.
URI:	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/20454
DOI:	10.6342/NTU202004194
全文授權:	未授權
顯示於系所單位：	資訊工程學系

文件中的檔案：

檔案	大小	格式
U0001-3108202018150100.pdf 目前未授權公開取用	1.13 MB	Adobe PDF

顯示文件完整紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料（如：文字、圖片、PDF）並使其易於取用。