T-­BERT:臺灣語言模型–以臺灣在地語言預訓練BERT模型

Ming-Yen Chung; 鍾明諺

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/20454

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	廖世偉(Shih-Wei Liao)
dc.contributor.author	Ming-Yen Chung	en
dc.contributor.author	鍾明諺	zh_TW
dc.date.accessioned	2021-06-08T02:49:15Z	-
dc.date.copyright	2020-09-03
dc.date.issued	2020
dc.date.submitted	2020-09-01
dc.identifier.citation	[1] E. Alsentzer, J. Murphy, W. Boag, W.H. Weng, D. Jindi, T. Naumann, and M. McDermott. Publicly available clinical BERT embeddings. In Proceedings of the 2nd Clinical Natural Language Processing Workshop, pages 72–78, Minneapolis, Minnesota, USA, June 2019. Association for Computational Linguistics. [2] I. Beltagy, K. Lo, and A. Cohan. SciBERT: A pretrained language model for scientific text. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3615–3620, Hong Kong, China, Nov. 2019. Association for Computational Linguistics. [3] A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, E. Grave, M. Ott, L. Zettlemoyer, and V. Stoyanov. Unsupervised cross-lingual representation learning at scale. arXiv preprint arXiv:1911.02116, 2019. [4] A. Conneau, R. Rinott, G. Lample, A. Williams, S. R. Bowman, H. Schwenk, and V. Stoyanov. Xnli: Evaluating cross-lingual sentence representations. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2018. [5] Y. Cui, W. Che, T. Liu, B. Qin, S. Wang, and G. Hu. Revisiting pre-trained models for chinese natural language processing. arXiv preprint arXiv:2004.13922, 2020. [6] Y. Cui, W. Che, T. Liu, B. Qin, Z. Yang, S. Wang, and G. Hu. Pre-training with whole word masking for chinese bert. arXiv preprint arXiv:1906.08101, 2019. [7] A. M. Dai and Q. V. Le. Semisupervised sequence learning. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 3079–3087. Curran Associates, Inc., 2015. [8] Z. Dai, Z. Yang, Y. Yang, J. Carbonell, Q. Le, and R. Salakhutdinov. Transformer-XL: Attentive language models beyond a fixed-length context. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2978–2988, Florence, Italy, July 2019. Association for Computational Linguistics. [9] J. Devlin, M.W. Chang, K. Lee, and K. Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018. [10] J. Howard and S. Ruder. Universal language model fine-tuning for text classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 328–339, Melbourne, Australia, July 2018. Association for Computational Linguistics. [11] A. Joulin, E. Grave, P. Bojanowski, M. Douze, H. Jégou, and T. Mikolov. Fasttext.zip: Compressing text classification models. arXiv preprint arXiv:1612.03651, 2016. [12] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014. [13] T. Kudo and J. Richardson. SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 66–71, Brussels, Belgium, Nov. 2018. Association for Computational Linguistics. [14] G. Lample and A. Conneau. Crosslingual language model pretraining. Advances in Neural Information Processing Systems (NeurIPS), 2019. [15] Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut. Albert: A lite bert for self-supervised learning of language representations. In International Conference on Learning Representations, 2020. [16] J. Lee, W. Yoon, S. Kim, D. Kim, S. Kim, C. H. So, and J. Kang. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 09 2019. [17] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettle moyer, and V. Stoyanov. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019. [18] E. Loper and S. Bird. Nltk: The natural language toolkit. In Proceedings of the ACL-02 Workshop on Effective Tools and Methodologies for Teaching Natural Language Processing and Computational Linguistics Volume 1, ETMTNLP ’02, page 63–70, USA, 2002. Association for Computational Linguistics. [19] L. Martin, B. Muller, P. J. Ortiz Suárez, Y. Dupont, L. Romary, É. de la Clergerie, D. Seddah, and B. Sagot. CamemBERT: a tasty French language model. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7203–7219, Online, July 2020. Association for Computational Linguistics. [20] P. Micikevicius, S. Narang, J. Alben, G. F. Diamos, E. Elsen, D. García, B. Ginsburg, M. Houston, O. Kuchaiev, G. Venkatesh, and H. Wu. Mixed precision training. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net, 2018. [21] T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013. [22] NVIDIA. Deep learning examples for tensor cores, 2020. https://github.com/NVIDIA/DeepLearningExamples/tree/9df464f/PyTorch/LanguageModeling/BERT. [23] J. Pennington, R. Socher, and C. D. Manning. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543, 2014. [24] M. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer. Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 2227–2237, New Orleans, Louisiana, June 2018. Association for Computational Linguistics. [25] A. Radford. Improving language understanding by generative pre-training. 2018. [26] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever. Language models are unsupervised multitask learners. 2019. [27] P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang. SQuAD: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2383–2392, Austin, Texas, Nov. 2016. Association for Computational Linguistics. [28] R. Sennrich, B. Haddow, and A. Birch. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1715–1725, Berlin, Germany, Aug. 2016. Association for Computational Linguistics. [29] Y. Sun, S. Wang, Y. Li, S. Feng, X. Chen, H. Zhang, X. Tian, D. Zhu, H. Tian, and H. Wu. Ernie: Enhanced representation through knowledge integration, 2019. [30] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. u. Kaiser, and I. Polosukhin. Attention is all you need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 5998–6008. Curran Associates, Inc., 2017. [31] A. Virtanen, J. Kanerva, R. Ilo, J. Luoma, J. Luotolahti, T. Salakoski, F. Ginter, and S. Pyysalo. Multilingual is not enough: Bert for finnish, 2019. [32] A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. Bowman. GLUE: A multitask benchmark and analysis platform for natural language understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 353–355, Brussels, Belgium, Nov. 2018. Association for Computational Linguistics. [33] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, and J. Brew. Huggingface’s transformers: State-of-the-art natural language processing. ArXiv, abs/1910.03771, 2019. [34] Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey, et al. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144, 2016. [35] B. L. Xu. Github repository, 2020. https://github.com/brightmart/albert_zh. [36] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. Salakhutdinov, and Q. V. Le. Xlnet: Generalized autoregressive pretraining for language understanding, 2019. [37] Y. You, J. Li, S. J. Reddi, J. Hseu, S. Kumar, S. Bhojanapalli, X. Song, J. Demmel, K. Keutzer, and C. Hsieh. Large batch optimization for deep learning: Training BERT in 76 minutes. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020.
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/20454	-
dc.description.abstract	預訓練語言模型在多個自然語言處理項目上展現了明顯的進步,多個社群釋出了各自在不同領域文本做預訓練的模型。BERT 是作為自然語言處理社群應用的一個經典模型。但是釋出給臺灣語言使用的 BERT 模型,只在中文維基的文本上做預訓練而已。儘管我們可以在釋出的模型上再做進一步的預訓練,釋出模型的字典也不適合臺灣語言使用。在本論文我們提出 TBERT,是一個基於 BERT 模型,但是預訓練在臺灣的三種語言上:臺灣國語 (繁體中文)、臺灣閩南話、臺灣客家話,且使用專門為這三個語言產生的字典。我們製作了兩種下游任務:自由時報新聞分類、臺華平行新聞語料庫新聞分類,來評量模型分別使用在臺灣國語以及臺灣閩南話的表現。我們展示 TBERT 模型在上述兩種下游任務表現贏過了 Google 釋出的中文 BERT 模型以及多語言 BERT 模型,包含在臺華平行新聞語料庫新聞分類多了 16.88% 的 f1 分數,以及在自由新聞分類多了 0.32% 的 f1 分數。	zh_TW
dc.description.abstract	Pretrained language models have seen a significant improvement in numerous Natural Language Processing (NLP) tasks. Many communities have released their models pretrained on variety of languages as well as different domain copora within each language, typically employing variants of the Bidirectional Encoder Representations from Transformers (BERT). However, existing BERT models for Taiwan’s languages are only pretrained on Chinese Wikipedia. Even though we can further pretrain BERT models on the domain corpora from the released model, its vocabulary is also not suitable for Taiwan’s lanugage. In this thesis, we propose T-BERT, a language model based on BERT but pretrained on Taiwan’s three main languages: Taiwanese Mandarin (Traditional Chinese), Taiwan Southern Min, Taiwanese Hakka, with a vocabulary generated for these three languages. We create two downstream tasks: LTN news classification, and iCorpus news classification for evaluating model’s performance on Taiwanese Mandarin and Taiwan Southern Min respectively. We show that T-BERT model outperforms both Chinese BERT and multilingual BERT (mBERT), both released officially by Google, on the two downstream tasks, with 16.88% increased f1 score on iCorpus dataset, and 0.32% increased f1 score on the LTN dataset.	en
dc.description.provenance	Made available in DSpace on 2021-06-08T02:49:15Z (GMT). No. of bitstreams: 1 U0001-3108202018150100.pdf: 1157984 bytes, checksum: e8c3e37fe78dbf379f3d0d2b85e18347 (MD5) Previous issue date: 2020	en
dc.description.tableofcontents	口試委員會審定書 i 誌謝 ii 摘要 iii Abstract iv 1 Introduction 1 2 Background 3 2.1 Languages of Taiwan 3 2.1.1 Taiwanese Mandarin 3 2.1.2 Taiwan Southern Min 4 2.1.3 Taiwanese Hakka 4 2.2 Word Tokenization 4 2.2.1 Byte Pair Encoding 6 2.2.2 WordPiece 6 2.3 BERT 7 2.3.1 Transformer Architecutre 8 2.3.2 BERT Architecture 8 3 Related Work 11 3.1 Pretrained Language Models 11 3.2 Pretrain BERT on Domain Specific Corpus 12 3.3 Multilingual Pretrained Model 12 3.4 Pretrained Chinese BERT 13 4 Method 14 4.1 Vocabulary 14 4.2 Model Architecture 15 4.3 Optimization 16 4.4 Pretraining Setup 16 5 Experiments 17 5.1 Pretraining Data 17 5.1.1 Mandarin 17 5.1.2 Taiwanese 18 5.1.3 Hakka 18 5.2 Pretraining Curve 18 5.3 Downstream Task 19 5.3.1 Liberty Times Net News Classification 19 5.3.2 iCorpus News Classification 20 5.4 Finetuning on Downstream Tasks 20 5.4.1 Liberty Times Net News Classification 22 5.4.2 iCorpus News Classification 22 5.5 Discussion 23 6 Conclusion and Future Work 24 Bibliography 25
dc.language.iso	en
dc.title	T-BERT:臺灣語言模型–以臺灣在地語言預訓練BERT模型	zh_TW
dc.title	T-BERT: A Taiwanese Language Model Pretraining BERT Model on Languages in Taiwan	en
dc.type	Thesis
dc.date.schoolyear	108-2
dc.description.degree	碩士
dc.contributor.oralexamcommittee	戴敏育(Min-Yuh Day),張富傑(Fu-Chieh Chang),傅為剛(Wei-Kang Fu)
dc.subject.keyword	自然語言處理,語言模型,臺灣,新聞分類,	zh_TW
dc.subject.keyword	natural language processing,bert,language model,news classification,taiwan,	en
dc.relation.page	29
dc.identifier.doi	10.6342/NTU202004194
dc.rights.note	未授權
dc.date.accepted	2020-09-01
dc.contributor.author-college	電機資訊學院	zh_TW
dc.contributor.author-dept	資訊工程學研究所	zh_TW
顯示於系所單位：	資訊工程學系

文件中的檔案：

檔案	大小	格式
U0001-3108202018150100.pdf 目前未授權公開取用	1.13 MB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。