以辭典定義探究語境詞嵌入

Ting-Yun Chang; 張婷雲

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/65360

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	陳縕儂(Yun-Nung Chen)
dc.contributor.author	Ting-Yun Chang	en
dc.contributor.author	張婷雲	zh_TW
dc.date.accessioned	2021-06-16T23:38:33Z	-
dc.date.available	2020-03-17
dc.date.copyright	2020-03-17
dc.date.issued	2020
dc.date.submitted	2020-02-19
dc.identifier.citation	[1] R. J. Williams and D. Zipser, “A learning algorithm for continually running fully recurrent neural networks,” Neural computation, vol. 1, no. 2, pp. 270–280, 1989. [2] M. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer, “Deep contextualized word representations,” in Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2227–2237, 2018. [3] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018. [4] A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever, “Improving language understanding by generative pre-training,” [5] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, “Language models are unsupervised multitask learners,” [6] T. Noraset, C. Liang, L. Birnbaum, and D. Downey, “Definition modeling: Learning to define word embeddings in natural language.,” in Proceedings of AAAI, 2017. [7] A. Gadetsky, I. Yakubovskiy, and D. Vetrov, “Conditional generators of words definitions,” in Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, pp. 266–271, 2018. [8] T.-Y. Chang, T.-C. Chi, S.-C. Tsai, and Y.-N. Chen, “xSense: Learning senseseparated sparse representations and textual definitions for explainable word sense networks,” arXiv preprint arXiv:1809.03348, 2018. [9] M. Ranzato, S. Chopra, M. Auli, and W. Zaremba, “Sequence level training with recurrent neural networks,” arXiv preprint arXiv:1511.06732, 2015. [10] J. Tissier, C. Gravier, and A. Habrard, “Dict2vec: Learning word embeddings using lexical dictionaries,” in Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 254–263, 2017. [11] D. Bahdanau, T. Bosc, S. Jastrzębski, E. Grefenstette, P. Vincent, and Y. Bengio, “Learning to compute word embeddings on the fly,” arXiv preprint arXiv:1706.00286, 2017. [12] T. Bosc and P. Vincent, “Auto-encoding dictionary definitions into consistent word embeddings,” in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 1522–1532, 2018. [13] D. Cer, Y. Yang, S.-y. Kong, N. Hua, N. Limtiaco, R. S. John, N. Constant, M. Guajardo-Cespedes, S. Yuan, C. Tar, et al., “Universal sentence encoder,” arXiv preprint arXiv:1803.11175, 2018. [14] Y. Belinkov, N. Durrani, F. Dalvi, H. Sajjad, and J. Glass, “What do neural machine translation models learn about morphology?,” in Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, 2017. [15] M. Peters, M. Neumann, L. Zettlemoyer, and W.-t. Yih, “Dissecting contextual word embeddings: Architecture and representation,” in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2018. [16] I. Tenney, P. Xia, B. Chen, A. Wang, A. Poliak, R. T. McCoy, N. Kim, B. V. Durme, S. Bowman, D. Das, and E. Pavlick, “What do you learn from context? probing for sentence structure in contextualized word representations,” in International Conference on Learning Representations, 2019 [17] T.-Y. Chang and Y.-N. Chen, “What does this word mean? explaining contextualized embeddings with natural language definition,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 6066–6072, 2019. [18] E. J. L., “Finding structure in time,” Cognitive Science, vol. 14, no. 2, pp. 179–211. [19] K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, “Learning phrase representations using rnn encoder-decoder for statistical machine translation,” arXiv preprint arXiv:1406.1078, 2014. [20] J. Gehring, M. Auli, D. Grangier, D. Yarats, and Y. N. Dauphin, “Convolutional sequence to sequence learning,” in Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1243–1252, JMLR. org, 2017. [21] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in neural information processing systems, pp. 5998–6008, 2017. [22] S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman, “Indexing by latent semantic analysis,” Journal of the American society for information science, vol. 41, no. 6, pp. 391–407, 1990. [23] H. Schutze, “Dimensions of meaning,” in Supercomputing’92: Proceedings of the 1992 ACM/IEEE Conference on Supercomputing, pp. 787–796, IEEE, 1992. [24] K. Lund and C. Burgess, “Producing high-dimensional semantic spaces from lexical co-occurrence,” Behavior Research Methods, Instruments, & Computers, vol. 28, pp. 203–208, Jun 1996. [25] R. Collobert and J. Weston, “A unified architecture for natural language processing: Deep neural networks with multitask learning,” in Proceedings of the 25th International Conference on Machine Learning, ICML ’08, (New York, NY, USA), pp. 160–167, ACM, 2008. [26] T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean, “Distributed representations of words and phrases and their compositionality,” in Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2, NIPS’13, (USA), pp. 3111–3119, Curran Associates Inc., 2013. [27] P. D. Turney and P. Pantel, “From frequency to meaning: Vector space models of semantics,” Journal of artificial intelligence research, vol. 37, pp. 141–188, 2010. [28] J. Pennington, R. Socher, and C. Manning, “Glove: Global vectors for word representation,” in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), (Doha, Qatar), pp. 1532–1543, Association for Computational Linguistics, Oct. 2014. [29] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2019. [30] P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang, “SQuAD: 100,000+ questions for machine comprehension of text,” in Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, (Austin, Texas), pp. 2383–2392, Association for Computational Linguistics, Nov. 2016. [31] R. Socher, A. Perelygin, J. Wu, J. Chuang, C. D. Manning, A. Ng, and C. Potts, “Recursive deep models for semantic compositionality over a sentiment treebank,” in Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, (Seattle, Washington, USA), pp. 1631–1642, Association for Computational Linguistics, Oct. 2013. [32] E. F. Tjong Kim Sang and F. De Meulder, “Introduction to the conll-2003 shared task: Language-independent named entity recognition,” in Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, CONLL ’03, (Stroudsburg, PA, USA), pp. 142–147, Association for Computational Linguistics, 2003. [33] S. R. Bowman, G. Angeli, C. Potts, and C. D. Manning, “A large annotated corpus for learning natural language inference,” in Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, (Lisbon, Portugal), pp. 632– 642, Association for Computational Linguistics, Sept. 2015. [34] G. A. Miller, “Wordnet: a lexical database for english,” Communications of the ACM, vol. 38, no. 11, pp. 39–41, 1995. [35] R. Navigli and S. P. Ponzetto, “Babelnet: Building a very large multilingual semantic network,” in Proceedings of the 48th annual meeting of the association for computational linguistics, pp. 216–225, Association for Computational Linguistics, 2010. [36] T. Schuster, O. Ram, R. Barzilay, and A. Globerson, “Cross-lingual alignment of contextual word embeddings, with applications to zero-shot dependency parsing,” arXiv preprint arXiv:1902.09492, 2019. [37] G. Lample, A. Conneau, L. Denoyer, and M. Ranzato, “Unsupervised machine translation using monolingual corpora only,” arXiv preprint arXiv:1711.00043, 2017. [38] G. Lample, M. Ott, A. Conneau, L. Denoyer, and M. Ranzato, “Phrase-based & neural unsupervised machine translation,” arXiv preprint arXiv:1804.07755, 2018. [39] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” arXiv preprint arXiv:1502.03167, 2015. [40] Y. Kim, “Convolutional neural networks for sentence classification,” in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1746–1751, 2014 [41] D. Yuan, J. Richardson, R. Doherty, C. Evans, and E. Altendorf, “Semi-supervised word sense disambiguation with neural models,” arXiv preprint arXiv:1603.07012, 2016. [42] K. W. Zhang and S. R. Bowman, “Language modeling teaches you more syntax than translation does: Lessons learned through auxiliary task analysis,” arXiv preprint arXiv:1809.10040, 2018. [43] A. Conneau, G. Kruszewski, G. Lample, L. Barrault, and M. Baroni, “What you can cram into a single $&!#* vector: Probing sentence embeddings for linguistic properties,” in Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, pp. 2126–2136, Association for Computational Linguistics, 2018. [44] J. Hewitt and P. Liang, “Designing and interpreting probes with control tasks,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2019. [45] A. Joulin, E. Grave, P. Bojanowski, M. Douze, H. Jégou, and T. Mikolov, “Fasttext.zip: Compressing text classification models,” arXiv preprint arXiv:1612.03651, 2016. [46] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “BLEU: a method for automatic evaluation of machine translation,” in Proceedings of the 40th annual meeting on association for computational linguistics, pp. 311–318, Association for Computational Linguistics, 2002. [47] C.-Y. Lin, “Rouge: A package for automatic evaluation of summaries,” Text Summarization Branches Out, 2004. [48] F. L. Y. G. C. M. M. S. E. Wei Zhao, Maxime Peyrard, “Moverscore: Text generation evaluating with contextualized embeddings and earth mover distance,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, (Hong Kong, China), Association for Computational Linguistics, August 2019. [49] N. Mrkšić, D. Ó Séaghdha, B. Thomson, M. Gašić, L. M. Rojas-Barahona, P.-H. Su, D. Vandyke, T.-H. Wen, and S. Young, “Counter-fitting word vectors to linguistic constraints,” in Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 142–148, June 2016. [50] M. T. Pilehvar and J. Camacho-Collados, “WiC: 10,000 example pairs for evaluating context-sensitive representations,” arXiv preprint arXiv:1808.09121, 2018. [51] F. Guo, M. Iyyer, L. Findlater, and J. Boyd-Graber, “A differentiable selfdisambiguated sense embedding model via scaled gumbel softmax,” 2019. [52] G.-H. Lee and Y.-N. Chen, “MUSE: Modularizing unsupervised sense embeddings,” in Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 327–337, 2017. [53] A. Neelakantan, J. Shankar, A. Passos, and A. McCallum, “Efficient non-parametric estimation of multiple embeddings per word in vector space,” arXiv preprint arXiv:1504.06654, 2015. [54] M. Mancini, J. Camacho-Collados, I. Iacobacci, and R. Navigli, “Embedding words and senses together via joint knowledge-enhanced training,” arXiv preprint arXiv:1612.02703, 2016. [55] M. T. Pilehvar and N. Collier, “De-conflated semantic representations,” arXiv preprint arXiv:1608.01961, 2016. [56] Y. Rubner, C. Tomasi, and L. J. Guibas, “The earth mover’s distance as a metric for image retrieval,” International journal of computer vision, vol. 40, no. 2, pp. 99–121, 2000. [57] S. Kumar, S. Jat, K. Saxena, and P. Talukdar, “Zero-shot word sense disambiguation using sense definition embeddings,” in Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, July 2019. [58] A. Lauscher, I. Vulić, E. M. Ponti, A. Korhonen, and G. Glavaš, “Informing unsupervised pretraining with external linguistic knowledge,” arXiv preprint arXiv:1909.02339, 2019. [59] Y. Levine, B. Lenz, O. Dagan, D. Padnos, O. Sharir, S. Shalev-Shwartz, A. Shashua, and Y. Shoham, “Sensebert: Driving some sense into bert,” arXiv preprint arXiv:1908.05646, 2019. [60] T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger, and Y. Artzi, “Bertscore: Evaluating text generation with bert,” arXiv preprint arXiv:1904.09675, 2019.
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/65360	-
dc.description.abstract	本篇論文旨在探究語境詞嵌入的詞義訊息，並藉由人類自然語言呈現之。在語境詞嵌入提升了許多自然語言處理領域課題的同時，詞表徵本身習得的資訊為何仍究是開放性的議題。本論文探索不同的預訓練的語境詞嵌入，試究能否從中提取出合適的詞定義。明確而言，給定一個多語意的單詞，包含這個單詞的例句，以及其在辭典裡的定義，我們想知道透過受測的語境詞嵌入，模型能產生多合適的詞定義。我們提出一個能兼容不同詞嵌入的架構，訓練其在兩個語義連續空間中的映射，亦即語境詞嵌入和詞定義表徵。此算法將原本的生成問題轉為分類問題，旨在避免自然語言生成的困難，如此大幅提升了實驗結果，同時維持能以人類自然語言呈現詞定義的性質。我們也藉由一詞多義辨別的子任務來證實所提出的架構的功效。將問題轉換的另一個目標為能提供一更合理的方式來探究預訓練的語境詞嵌入。基於此前的研究往往透過訓練不良的解碼器產生不當的詞定義，故而無法清楚反應受測詞嵌入本身的問題。相對地，我們的架構從人寫好的辭典中取得詞定義，而我們的映射模型作為探測器來評估預訓練的語境詞嵌入的語言知識含量極不足之處。我們發現BERT模型較ELMo模型而言似含有更充足的詞義訊息，並列出兩者共有的問題。我們的觀察或有助於了解在語境詞嵌入中，何資訊被捕捉，又於何有所遺失。	zh_TW
dc.description.abstract	The main purpose of this thesis is to investigate the sense information encoded in the contextualized word representations through human-readable definition. As contextualized word embeddings have boosted many downstream NLP tasks compared with traditional static ones, what has been learned in these representations remains an open question. In this thesis, we explore different kinds of contextualized word embeddings to see if suitable definitions can be distilled from these pretrained representations. Specifically, given a multi-sensed target word to be defined, the context containing the target word, and the ground truth definition from the dictionary, we would like to see how well can the evaluated contexutalized embeddings of the target word produce acceptable definitions. We propose a framework that can well incorporate different embedding types, with the algorithm learning a mapping between two semantically continuous spaces, the space of word representations and the space of definitions. The algorithm involves a problem reformulation of the traditional definition modeling task, aiming to avoid the difficulty of natural language generation via transforming it into a classification task, which significantly improves the performance while maintaining the ability to provide human-readable definitions. The main goal of our reformulation is to provide a more reasonable manner to assess the given pre-trained contextualized word embeddings, as previously, the unsatisfying definitions generated from a crippled decoder cannot clearly reflect the problems of the evaluated word embeddings. Instead, our framework retrieves the definitions from the well-written dictionary, and our mapping serves as a emph{probe} to explore the inherent linguistic knowledge and limitations lie in the pretrained contextualized word embeddings. We found that BERT seems to be more sense-informative than ELMo, and we list some shortages of both of them. Our observation may help better understand what is captured and what is lost in the contextualized representations.	en
dc.description.provenance	Made available in DSpace on 2021-06-16T23:38:33Z (GMT). No. of bitstreams: 1 ntu-109-R06922168-1.pdf: 1662955 bytes, checksum: 2d5fef90b50e4224601c0e039164cbbd (MD5) Previous issue date: 2020	en
dc.description.tableofcontents	1 Introduction 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Main Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Thesis Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2 Background 5 2.1 Recurrent Neural Models . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1.1 Recurrent Neural Network (RNN) . . . . . . . . . . . . . . . . . 5 2.1.2 Gated Recurrent Unit (GRU) . . . . . . . . . . . . . . . . . . . . 6 2.2 Transformer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2.1 Multi-Head Attention . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2.2 Positional Encoding . . . . . . . . . . . . . . . . . . . . . . . . 7 2.3 Word Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.3.1 Static Word Embedding . . . . . . . . . . . . . . . . . . . . . . 8 2.3.2 Contextualized Word Embedding . . . . . . . . . . . . . . . . . 9 3 Dataset 11 4 Related Work 13 4.1 Definition modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 4.2 Probes into Representations . . . . . . . . . . . . . . . . . . . . . . . . . 14 4.3 Word Sense in Contextualized Representations . . . . . . . . . . . . . . 14 5 Word Sense Explanation 17 5.1 Model Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 5.2 Mapping Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 5.3 Incorporating POS tagging . . . . . . . . . . . . . . . . . . . . . . . . . 22 5.4 Reverse Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 6 Experiments 25 6.1 Definition Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 6.1.1 Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 6.1.2 Baseline Models . . . . . . . . . . . . . . . . . . . . . . . . . . 26 6.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 6.2.1 Quantitative Results . . . . . . . . . . . . . . . . . . . . . . . . 28 6.2.2 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 6.3 Word Sense Selection in Context . . . . . . . . . . . . . . . . . . . . . . 34 6.4 Experimental Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 6.4.1 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . 35 6.4.2 Hyperparameters Tuning . . . . . . . . . . . . . . . . . . . . . . 36 6.4.3 Details in Pretrained Representations . . . . . . . . . . . . . . . 37 6.4.4 Fine-Tuning BERT . . . . . . . . . . . . . . . . . . . . . . . . . 37 7 Conclusion and Future Work 39 7.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 7.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 7.2.1 Zero-Shot Definition Generation . . . . . . . . . . . . . . . . . . 39 7.2.2 External Linguistic Knowledge . . . . . . . . . . . . . . . . . . 40 7.2.3 Evaluation Metric and Pretrained Sentence Encoder . . . . . . . . 41 Bibliography 43
dc.language.iso	en
dc.subject	消歧義	zh_TW
dc.subject	類神經網絡	zh_TW
dc.subject	辭嵌入	zh_TW
dc.subject	word sense disambiguation	en
dc.subject	neural networks	en
dc.subject	word embedding	en
dc.title	以辭典定義探究語境詞嵌入	zh_TW
dc.title	Probing Contextualized Word Embedding with Definition	en
dc.type	Thesis
dc.date.schoolyear	108-2
dc.description.degree	碩士
dc.contributor.oralexamcommittee	古倫維(Lun-Wei Ku),黃挺豪(Ting-Hao Huang)
dc.subject.keyword	類神經網絡,辭嵌入,消歧義,	zh_TW
dc.subject.keyword	neural networks,word embedding,word sense disambiguation,	en
dc.relation.page	50
dc.identifier.doi	10.6342/NTU202000521
dc.rights.note	有償授權
dc.date.accepted	2020-02-20
dc.contributor.author-college	電機資訊學院	zh_TW
dc.contributor.author-dept	資訊工程學研究所	zh_TW
顯示於系所單位：	資訊工程學系

文件中的檔案：

檔案	大小	格式
ntu-109-1.pdf 未授權公開取用	1.62 MB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。