Skip navigation

DSpace JSPUI

DSpace preserves and enables easy and open access to all types of digital content including text, images, moving images, mpegs and data sets

Learn More
DSpace logo
English
中文
  • Browse
    • Communities
      & Collections
    • Publication Year
    • Author
    • Title
    • Subject
    • Advisor
  • Search TDR
  • Rights Q&A
    • My Page
    • Receive email
      updates
    • Edit Profile
  1. NTU Theses and Dissertations Repository
  2. 電機資訊學院
  3. 資訊工程學系
Please use this identifier to cite or link to this item: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/20454
Title: T-­BERT:臺灣語言模型–以臺灣在地語言預訓練BERT模型
T-­BERT: A Taiwanese Language Model Pretraining BERT Model on Languages in Taiwan
Authors: Ming-Yen Chung
鍾明諺
Advisor: 廖世偉(Shih-Wei Liao)
Keyword: 自然語言處理,語言模型,臺灣,新聞分類,
natural language processing,bert,language model,news classi­fication,taiwan,
Publication Year : 2020
Degree: 碩士
Abstract: 預訓練語言模型在多個自然語言處理項目上展現了明顯的進步,多個社群釋出了各自在不同領域文本做預訓練的模型。BERT 是作為自然語言處理社群應用的一個經典模型。但是釋出給臺灣語言使用的 BERT 模型,只在中文維基的文本上做預訓練而已。儘管我們可以在釋出的模型上再做進一步的預訓練,釋出模型的字典也不適合臺灣語言使用。
在本論文我們提出 T­BERT,是一個基於 BERT 模型,但是預訓練在臺灣的三種語言上:臺灣國語 (繁體中文)、臺灣閩南話、臺灣客家話,且使用專門為這三個語言產生的字典。
我們製作了兩種下游任務:自由時報新聞分類、臺華平行新聞語料庫新聞分類,來評量模型分別使用在臺灣國語以及臺灣閩南話的表現。
我們展示 T­BERT 模型在上述兩種下游任務表現贏過了 Google 釋出的中文 BERT 模型以及多語言 BERT 模型,包含在臺華平行新聞語料庫新聞分類多了 16.88% 的 f1 分數,以及在自由新聞分類多了 0.32% 的 f1 分數。
Pretrained language models have seen a significant improvement in nu­merous Natural Language Processing (NLP) tasks. Many communities have released their models pretrained on variety of languages as well as differ­ent domain copora within each language, typically employing variants of the Bidirectional Encoder Representations from Transformers (BERT). However, existing BERT models for Taiwan’s languages are only pretrained on Chinese Wikipedia. Even though we can further pretrain BERT models on the domain corpora from the released model, its vocabulary is also not suitable for Tai­wan’s lanugage.
In this thesis, we propose T­-BERT, a language model based on BERT but pretrained on Taiwan’s three main languages: Taiwanese Mandarin (Tradi­tional Chinese), Taiwan Southern Min, Taiwanese Hakka, with a vocabulary generated for these three languages.
We create two downstream tasks: LTN news classification, and iCorpus news classification for evaluating model’s performance on Taiwanese Man­darin and Taiwan Southern Min respectively.
We show that T-­BERT model outperforms both Chinese BERT and multi­lingual BERT (mBERT), both released officially by Google, on the two down­stream tasks, with 16.88% increased f1 score on iCorpus dataset, and 0.32% increased f1 score on the LTN dataset.
URI: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/20454
DOI: 10.6342/NTU202004194
Fulltext Rights: 未授權
Appears in Collections:資訊工程學系

Files in This Item:
File SizeFormat 
U0001-3108202018150100.pdf
  Restricted Access
1.13 MBAdobe PDF
Show full item record


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved