Please use this identifier to cite or link to this item:
http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/20454
Title: | T-BERT:臺灣語言模型–以臺灣在地語言預訓練BERT模型 T-BERT: A Taiwanese Language Model Pretraining BERT Model on Languages in Taiwan |
Authors: | Ming-Yen Chung 鍾明諺 |
Advisor: | 廖世偉(Shih-Wei Liao) |
Keyword: | 自然語言處理,語言模型,臺灣,新聞分類, natural language processing,bert,language model,news classification,taiwan, |
Publication Year : | 2020 |
Degree: | 碩士 |
Abstract: | 預訓練語言模型在多個自然語言處理項目上展現了明顯的進步,多個社群釋出了各自在不同領域文本做預訓練的模型。BERT 是作為自然語言處理社群應用的一個經典模型。但是釋出給臺灣語言使用的 BERT 模型,只在中文維基的文本上做預訓練而已。儘管我們可以在釋出的模型上再做進一步的預訓練,釋出模型的字典也不適合臺灣語言使用。 在本論文我們提出 TBERT,是一個基於 BERT 模型,但是預訓練在臺灣的三種語言上:臺灣國語 (繁體中文)、臺灣閩南話、臺灣客家話,且使用專門為這三個語言產生的字典。 我們製作了兩種下游任務:自由時報新聞分類、臺華平行新聞語料庫新聞分類,來評量模型分別使用在臺灣國語以及臺灣閩南話的表現。 我們展示 TBERT 模型在上述兩種下游任務表現贏過了 Google 釋出的中文 BERT 模型以及多語言 BERT 模型,包含在臺華平行新聞語料庫新聞分類多了 16.88% 的 f1 分數,以及在自由新聞分類多了 0.32% 的 f1 分數。 Pretrained language models have seen a significant improvement in numerous Natural Language Processing (NLP) tasks. Many communities have released their models pretrained on variety of languages as well as different domain copora within each language, typically employing variants of the Bidirectional Encoder Representations from Transformers (BERT). However, existing BERT models for Taiwan’s languages are only pretrained on Chinese Wikipedia. Even though we can further pretrain BERT models on the domain corpora from the released model, its vocabulary is also not suitable for Taiwan’s lanugage. In this thesis, we propose T-BERT, a language model based on BERT but pretrained on Taiwan’s three main languages: Taiwanese Mandarin (Traditional Chinese), Taiwan Southern Min, Taiwanese Hakka, with a vocabulary generated for these three languages. We create two downstream tasks: LTN news classification, and iCorpus news classification for evaluating model’s performance on Taiwanese Mandarin and Taiwan Southern Min respectively. We show that T-BERT model outperforms both Chinese BERT and multilingual BERT (mBERT), both released officially by Google, on the two downstream tasks, with 16.88% increased f1 score on iCorpus dataset, and 0.32% increased f1 score on the LTN dataset. |
URI: | http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/20454 |
DOI: | 10.6342/NTU202004194 |
Fulltext Rights: | 未授權 |
Appears in Collections: | 資訊工程學系 |
Files in This Item:
File | Size | Format | |
---|---|---|---|
U0001-3108202018150100.pdf Restricted Access | 1.13 MB | Adobe PDF |
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.