Skip navigation

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料(如:文字、圖片、PDF)並使其易於取用。

點此認識 DSpace
DSpace logo
English
中文
  • 瀏覽論文
    • 校院系所
    • 出版年
    • 作者
    • 標題
    • 關鍵字
    • 指導教授
  • 搜尋 TDR
  • 授權 Q&A
    • 我的頁面
    • 接受 E-mail 通知
    • 編輯個人資料
  1. NTU Theses and Dissertations Repository
  2. 電機資訊學院
  3. 資訊工程學系
請用此 Handle URI 來引用此文件: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/20454
完整後設資料紀錄
DC 欄位值語言
dc.contributor.advisor廖世偉(Shih-Wei Liao)
dc.contributor.authorMing-Yen Chungen
dc.contributor.author鍾明諺zh_TW
dc.date.accessioned2021-06-08T02:49:15Z-
dc.date.copyright2020-09-03
dc.date.issued2020
dc.date.submitted2020-09-01
dc.identifier.citation[1] E. Alsentzer, J. Murphy, W. Boag, W.­H. Weng, D. Jindi, T. Naumann, and M. Mc­Dermott. Publicly available clinical BERT embeddings. In Proceedings of the 2nd Clinical Natural Language Processing Workshop, pages 72–78, Minneapolis, Min­nesota, USA, June 2019. Association for Computational Linguistics.
[2] I. Beltagy, K. Lo, and A. Cohan. SciBERT: A pretrained language model for scientific text. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Lan­guage Processing (EMNLP­-IJCNLP), pages 3615–3620, Hong Kong, China, Nov. 2019. Association for Computational Linguistics.
[3] A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, E. Grave, M. Ott, L. Zettlemoyer, and V. Stoyanov. Unsupervised cross­-lingual representation learning at scale. arXiv preprint arXiv:1911.02116, 2019.
[4] A. Conneau, R. Rinott, G. Lample, A. Williams, S. R. Bowman, H. Schwenk, and V. Stoyanov. Xnli: Evaluating cross­-lingual sentence representations. In Proceed­ings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2018.
[5] Y. Cui, W. Che, T. Liu, B. Qin, S. Wang, and G. Hu. Revisiting pre-­trained models for chinese natural language processing. arXiv preprint arXiv:2004.13922, 2020.
[6] Y. Cui, W. Che, T. Liu, B. Qin, Z. Yang, S. Wang, and G. Hu. Pre-­training with whole word masking for chinese bert. arXiv preprint arXiv:1906.08101, 2019.
[7] A. M. Dai and Q. V. Le. Semi­supervised sequence learning. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural In­formation Processing Systems 28, pages 3079–3087. Curran Associates, Inc., 2015.
[8] Z. Dai, Z. Yang, Y. Yang, J. Carbonell, Q. Le, and R. Salakhutdinov. Transformer­-XL: Attentive language models beyond a fixed­-length context. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2978–2988, Florence, Italy, July 2019. Association for Computational Linguistics.
[9] J. Devlin, M.­W. Chang, K. Lee, and K. Toutanova. Bert: Pre­-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
[10] J. Howard and S. Ruder. Universal language model fine­-tuning for text classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 328–339, Melbourne, Australia, July 2018. Association for Computational Linguistics.
[11] A. Joulin, E. Grave, P. Bojanowski, M. Douze, H. Jégou, and T. Mikolov. Fast­text.zip: Compressing text classification models. arXiv preprint arXiv:1612.03651, 2016.
[12] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
[13] T. Kudo and J. Richardson. SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: Sys­tem Demonstrations, pages 66–71, Brussels, Belgium, Nov. 2018. Association for Computational Linguistics.
[14] G. Lample and A. Conneau. Cross­lingual language model pretraining. Advances in Neural Information Processing Systems (NeurIPS), 2019.
[15] Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut. Albert: A lite bert for self­-supervised learning of language representations. In International Conference on Learning Representations, 2020.
[16] J. Lee, W. Yoon, S. Kim, D. Kim, S. Kim, C. H. So, and J. Kang. BioBERT: a pre­-trained biomedical language representation model for biomedical text mining. Bioinformatics, 09 2019.
[17] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettle­ moyer, and V. Stoyanov. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
[18] E. Loper and S. Bird. Nltk: The natural language toolkit. In Proceedings of the ACL­-02 Workshop on Effective Tools and Methodologies for Teaching Natural Language Processing and Computational Linguistics Volume 1, ETMTNLP ’02, page 63–70, USA, 2002. Association for Computational Linguistics.
[19] L. Martin, B. Muller, P. J. Ortiz Suárez, Y. Dupont, L. Romary, É. de la Clergerie, D. Seddah, and B. Sagot. CamemBERT: a tasty French language model. In Proceed­ings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7203–7219, Online, July 2020. Association for Computational Linguistics.
[20] P. Micikevicius, S. Narang, J. Alben, G. F. Diamos, E. Elsen, D. García, B. Ginsburg, M. Houston, O. Kuchaiev, G. Venkatesh, and H. Wu. Mixed precision training. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 ­- May 3, 2018, Conference Track Proceedings. OpenReview.net, 2018.
[21] T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word repre­sentations in vector space. arXiv preprint arXiv:1301.3781, 2013.
[22] NVIDIA. Deep learning examples for tensor cores, 2020. https://github.com/NVIDIA/DeepLearningExamples/tree/9df464f/PyTorch/LanguageModeling/BERT.
[23] J. Pennington, R. Socher, and C. D. Manning. Glove: Global vectors for word rep­resentation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543, 2014.
[24] M. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer. Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 2227–2237, New Orleans, Louisiana, June 2018. Association for Computational Linguistics.
[25] A. Radford. Improving language understanding by generative pre­-training. 2018.
[26] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever. Language models are unsupervised multitask learners. 2019.
[27] P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang. SQuAD: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2383–2392, Austin, Texas, Nov. 2016. Association for Computational Linguistics.
[28] R. Sennrich, B. Haddow, and A. Birch. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1715–1725, Berlin, Germany, Aug. 2016. Association for Computational Linguistics.
[29] Y. Sun, S. Wang, Y. Li, S. Feng, X. Chen, H. Zhang, X. Tian, D. Zhu, H. Tian, and H. Wu. Ernie: Enhanced representation through knowledge integration, 2019.
[30] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. u. Kaiser, and I. Polosukhin. Attention is all you need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neu­ral Information Processing Systems 30, pages 5998–6008. Curran Associates, Inc., 2017.
[31] A. Virtanen, J. Kanerva, R. Ilo, J. Luoma, J. Luotolahti, T. Salakoski, F. Ginter, and S. Pyysalo. Multilingual is not enough: Bert for finnish, 2019.
[32] A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. Bowman. GLUE: A multi­task benchmark and analysis platform for natural language understanding. In Pro­ceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 353–355, Brussels, Belgium, Nov. 2018. Associa­tion for Computational Linguistics.
[33] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, and J. Brew. Huggingface’s transformers: State-­of-­the-­art natural language processing. ArXiv, abs/1910.03771, 2019.
[34] Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey, et al. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144, 2016.
[35] B. L. Xu. Github repository, 2020. https://github.com/brightmart/albert_zh.
[36] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. Salakhutdinov, and Q. V. Le. Xlnet: Generalized autoregressive pretraining for language understanding, 2019.
[37] Y. You, J. Li, S. J. Reddi, J. Hseu, S. Kumar, S. Bhojanapalli, X. Song, J. Demmel, K. Keutzer, and C. Hsieh. Large batch optimization for deep learning: Training BERT in 76 minutes. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-­30, 2020. OpenReview.net, 2020.
dc.identifier.urihttp://tdr.lib.ntu.edu.tw/jspui/handle/123456789/20454-
dc.description.abstract預訓練語言模型在多個自然語言處理項目上展現了明顯的進步,多個社群釋出了各自在不同領域文本做預訓練的模型。BERT 是作為自然語言處理社群應用的一個經典模型。但是釋出給臺灣語言使用的 BERT 模型,只在中文維基的文本上做預訓練而已。儘管我們可以在釋出的模型上再做進一步的預訓練,釋出模型的字典也不適合臺灣語言使用。
在本論文我們提出 T­BERT,是一個基於 BERT 模型,但是預訓練在臺灣的三種語言上:臺灣國語 (繁體中文)、臺灣閩南話、臺灣客家話,且使用專門為這三個語言產生的字典。
我們製作了兩種下游任務:自由時報新聞分類、臺華平行新聞語料庫新聞分類,來評量模型分別使用在臺灣國語以及臺灣閩南話的表現。
我們展示 T­BERT 模型在上述兩種下游任務表現贏過了 Google 釋出的中文 BERT 模型以及多語言 BERT 模型,包含在臺華平行新聞語料庫新聞分類多了 16.88% 的 f1 分數,以及在自由新聞分類多了 0.32% 的 f1 分數。
zh_TW
dc.description.abstractPretrained language models have seen a significant improvement in nu­merous Natural Language Processing (NLP) tasks. Many communities have released their models pretrained on variety of languages as well as differ­ent domain copora within each language, typically employing variants of the Bidirectional Encoder Representations from Transformers (BERT). However, existing BERT models for Taiwan’s languages are only pretrained on Chinese Wikipedia. Even though we can further pretrain BERT models on the domain corpora from the released model, its vocabulary is also not suitable for Tai­wan’s lanugage.
In this thesis, we propose T­-BERT, a language model based on BERT but pretrained on Taiwan’s three main languages: Taiwanese Mandarin (Tradi­tional Chinese), Taiwan Southern Min, Taiwanese Hakka, with a vocabulary generated for these three languages.
We create two downstream tasks: LTN news classification, and iCorpus news classification for evaluating model’s performance on Taiwanese Man­darin and Taiwan Southern Min respectively.
We show that T-­BERT model outperforms both Chinese BERT and multi­lingual BERT (mBERT), both released officially by Google, on the two down­stream tasks, with 16.88% increased f1 score on iCorpus dataset, and 0.32% increased f1 score on the LTN dataset.
en
dc.description.provenanceMade available in DSpace on 2021-06-08T02:49:15Z (GMT). No. of bitstreams: 1
U0001-3108202018150100.pdf: 1157984 bytes, checksum: e8c3e37fe78dbf379f3d0d2b85e18347 (MD5)
Previous issue date: 2020
en
dc.description.tableofcontents口試委員會審定書 i
誌謝 ii
摘要 iii
Abstract iv
1 Introduction 1
2 Background 3
2.1 Languages of Taiwan 3
2.1.1 Taiwanese Mandarin 3
2.1.2 Taiwan Southern Min 4
2.1.3 Taiwanese Hakka 4
2.2 Word Tokenization 4
2.2.1 Byte Pair Encoding 6
2.2.2 WordPiece 6
2.3 BERT 7
2.3.1 Transformer Architecutre 8
2.3.2 BERT Architecture 8
3 Related Work 11
3.1 Pretrained Language Models 11
3.2 Pretrain BERT on Domain Specific Corpus 12
3.3 Multilingual Pretrained Model 12
3.4 Pretrained Chinese BERT 13
4 Method 14
4.1 Vocabulary 14
4.2 Model Architecture 15
4.3 Optimization 16
4.4 Pretraining Setup 16
5 Experiments 17
5.1 Pretraining Data 17
5.1.1 Mandarin 17
5.1.2 Taiwanese 18
5.1.3 Hakka 18
5.2 Pretraining Curve 18
5.3 Downstream Task 19
5.3.1 Liberty Times Net News Classification 19
5.3.2 iCorpus News Classification 20
5.4 Finetuning on Downstream Tasks 20
5.4.1 Liberty Times Net News Classification 22
5.4.2 iCorpus News Classification 22
5.5 Discussion 23
6 Conclusion and Future Work 24
Bibliography 25
dc.language.isoen
dc.titleT-­BERT:臺灣語言模型–以臺灣在地語言預訓練BERT模型zh_TW
dc.titleT-­BERT: A Taiwanese Language Model Pretraining BERT Model on Languages in Taiwanen
dc.typeThesis
dc.date.schoolyear108-2
dc.description.degree碩士
dc.contributor.oralexamcommittee戴敏育(Min-Yuh Day),張富傑(Fu-Chieh Chang),傅為剛(Wei-Kang Fu)
dc.subject.keyword自然語言處理,語言模型,臺灣,新聞分類,zh_TW
dc.subject.keywordnatural language processing,bert,language model,news classi­fication,taiwan,en
dc.relation.page29
dc.identifier.doi10.6342/NTU202004194
dc.rights.note未授權
dc.date.accepted2020-09-01
dc.contributor.author-college電機資訊學院zh_TW
dc.contributor.author-dept資訊工程學研究所zh_TW
顯示於系所單位:資訊工程學系

文件中的檔案:
檔案 大小格式 
U0001-3108202018150100.pdf
  未授權公開取用
1.13 MBAdobe PDF
顯示文件簡單紀錄


系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved