詞向量化方法之比較與應用

Qian-Hui Zeng; 曾千蕙

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/70433

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	盧信銘(Hsin-Min Lu)
dc.contributor.author	Qian-Hui Zeng	en
dc.contributor.author	曾千蕙	zh_TW
dc.date.accessioned	2021-06-17T04:28:05Z	-
dc.date.available	2020-08-16
dc.date.copyright	2018-08-16
dc.date.issued	2018
dc.date.submitted	2018-08-13
dc.identifier.citation	Baroni, M., Dinu, G., & Kruszewski, G. (2014). Don't count, predict! A systematic comparison of context-counting vs. Context-predicting semantic vectors. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (ACL) (pp. 238-247). Baltimore, Maryland. Bengio, Y., Ducharme, R., Vincent, P., & Jauvin, C. (2003). A neural probabilistic language model. Journal of Machine Learning Research, 3(Feb), 1137-1155. Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet Allocation. Journal of Machine Learning Research, 3(Jan), 993-1022. Bruni, E., Tran, N.-K., & Baroni, M. (2014). Multimodal distributional semantics. Journal of Artificial Intelligence Research, 49, 1-47. Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., & Kuksa, P. (2011). Natural language processing (almost) from scratch. Journal of Machine Learning Research, 12(Aug), 2493-2537. Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. (1990). Indexing by Latent Semantic Analysis. Journal of the American Society for Information Science, 41(6), 391. Erk, K. (2012). Vector space models of word meaning and phrase meaning: A survey. Language and Linguistics Compass, 6(10), 635-653. Finkelstein, L., Gabrilovich, E., Matias, Y., Rivlin, E., Solan, Z., Wolfman, G., & Ruppin, E. (2001). Placing search in context: The concept revisited. In Proceedings of the 10th International Conference on World Wide Web (pp. 406-414). Griffiths, T. L., & Steyvers, M. (2004). Finding scientific topics. Proceedings of the National Academy of Sciences, 101(suppl 1), 5228-5235. Guo, J., Che, W., Wang, H., & Liu, T. (2014). Revisiting embedding features for simple semi-supervised learning. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 110-120). Hill, F., Reichart, R., & Korhonen, A. (2015). Simlex-999: Evaluating semantic models with (genuine) similarity estimation. Computational Linguistics, 41(4), 665-695. Hoffman, M., Bach, F. R., & Blei, D. M. (2010). Online learning for Latent Dirichlet Allocation. In Advances in Neural Information Processing Systems (pp. 856-864). Lai, S., Liu, K., He, S., & Zhao, J. (2016). How to generate a good word embedding. IEEE Intelligent Systems, 31(6), 5-14. Landauer, T. K., Foltz, P. W., & Laham, D. (1998). An Introduction to Latent Semantic Analysis. Discourse Processes, 25(2-3), 259-284. Liu, Y., Liu, Z., Chua, T.-S., & Sun, M. (2015). Topical word embeddings. In Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence (pp. 2418-2424). Austin, Texas. Lund, C. B., & Kevin. (1997). Modelling parsing constraints with high-dimensional context space. Language and Cognitive Processes, 12(2-3), 177-210. Lund, K., & Burgess, C. (1996). Producing high-dimensional semantic spaces from lexical co-occurrence. Behavior Research Methods, Instruments, & Computers, 28(2), 203-208. Luong, T., Socher, R., & Manning, C. (2013). Better word representations with recursive neural networks for morphology. In Proceedings of the Seventeenth Conference on Computational Natural Language Learning (pp. 104-113). Maaten, L. v. d., & Hinton, G. (2008). Visualizing data using t-SNE. Journal of Machine Learning Research, 9(Nov), 2579-2605. Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. Paper presented at the International Conference on Learning Representations (ICLR) Workshop Track. Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems (pp. 3111-3119). Miller, G. A., & Charles, W. G. (1991). Contextual correlates of semantic similarity. Language and Cognitive Processes, 6(1), 1-28. Pennington, J., Socher, R., & Manning, C. (2014). GloVe: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 1532-1543). Rohde, D. L., Gonnerman, L. M., & Plaut, D. C. (2006). An improved model of semantic similarity based on lexical co-occurrence. Communications of the ACM, 8, 627-633. Socher, R., Perelygin, A., Wu, J., Chuang, J., Manning, C. D., Ng, A., & Potts, C. (2013). Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 1631-1642). Turian, J., Ratinov, L., & Bengio, Y. (2010). Word representations: A simple and general method for semi-supervised learning. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics (ACL) (pp. 384-394). Uppsala, Sweden.
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/70433	-
dc.description.abstract	詞向量化（或稱為詞嵌入或分布式詞表示）是將詞用固定長度的向量表示的一系列方法，在文本挖掘和自然語言處理中有相當廣泛的應用。然而，較少有研究完整地比較這些方法的表現。本研究比較了八種詞向量化方法，這些方法使用不同的技術包括矩陣分解，主題模型和神經網絡。在比較方式上我們使用內在和外在評估來比較方法的表現，內在評估包含了衡量詞向量之間的關聯性、相似性以及類比關係，而外在評估我們採用實體辨識 (NER) 做為衡量的任務。以內在評估的結果而言，基於神經網絡的方法（如CBOW和Skip-gram）表現最佳，其次是GloVe，而根據文章層級的資訊做訓練的方法，如潛在語義分析（LSA）和潛藏狄利克雷分配方法（LDA），在我們的比較中表現相對不佳。而以外在評估的結果而言，表現最好的是Skip-gram和HAL (一種相對簡單的矩陣分解方法)，對NER的表現帶來了最大的進步，而LDA和CBOW帶來的進步最少。這樣的結果意味著內在評估方法的排名可能與外在評估的排名不一致。因此，未來的研究可以包括更多的外在評估任務，以便找到內外在任務表現之間的關係。	zh_TW
dc.description.abstract	Word vectorization (a.k.a. word embedding or distributional word representation) is a family of approaches that convert a word to a fixed length of vector. It is widely used in text mining and natural language processing tasks. However, few studies have systematically compared the performance of these methods. This study investigated eight word vectorization methods using different technical approaches including matrix factorization, topic models, and neural network. We compared their performances using both intrinsic and extrinsic evaluations. Intrinsic evaluations included examining association, similarity and analogy relationship between word vectors. And extrinsic evaluations included name entity recognition task. For intrinsic evaluations, the result suggests that neural network based methods such as continuous bag-of-word (CBOW) and Skip-gram performed the best, followed by GloVe, a method that extract latent vectors from word-context matrix. Method that adopted document-wide information, such as latent semantic analysis (LSA) and latent Dirichlet allocation (LDA), did not perform well in our evaluation. For extrinsic evaluation, Skip-gram and HAL, a relatively simple matrix factorization method, brought the most improvement on the performance of NER. While LDA and CBOW brought the least improvement. This result implies that the ranking of methods for intrinsic evaluations may be inconsistent with the ranking for extrinsic evaluations. Thus, future studies could include more tasks for extrinsic evaluation to allow finding the relationship between performances of intrinsic evaluations and extrinsic task performances.	en
dc.description.provenance	Made available in DSpace on 2021-06-17T04:28:05Z (GMT). No. of bitstreams: 1 ntu-107-R05725004-1.pdf: 2533319 bytes, checksum: c32076f76f81c940b8c7e9fc18c6615d (MD5) Previous issue date: 2018	en
dc.description.tableofcontents	口試委員審定書 i 誌謝 ii 摘要 iii Abstract iv Contents v List of Figures vii List of Tables viii 1. Introduction 1 2. Literature Review 3 2.1 Matrix Factorization on co-occurrence matrix 3 2.2 Topic Models 8 2.3 Neural-Network-Based models 9 2.4 Related works 12 2.5 Summary 13 3. Comparison of Word Vectorization Methods 14 3.1 Word Vectorization Methods 14 3.2 Training Corpus 16 3.3 Intrinsic Evaluations 16 3.4 Extrinsic Evaluations 19 4. Experimental Results 22 5. Discussion 32 5.1 Performances between different tasks 32 5.2 Performances between different methods 34 6. Conclusions 36 Reference 37 Appendix A. Visualization of Word Vectors 40
dc.language.iso	en
dc.subject	詞嵌入	zh_TW
dc.subject	矩陣分解	zh_TW
dc.subject	主題模型	zh_TW
dc.subject	詞向量化	zh_TW
dc.subject	類神經網路語言模型	zh_TW
dc.subject	Word embedding	en
dc.subject	Word vectorization	en
dc.subject	Topic model	en
dc.subject	Distributional word representation	en
dc.subject	Matrix factorization	en
dc.subject	Neural network language model	en
dc.title	詞向量化方法之比較與應用	zh_TW
dc.title	Word Vectorization Methods: Comparisons and Applications	en
dc.type	Thesis
dc.date.schoolyear	106-2
dc.description.degree	碩士
dc.contributor.oralexamcommittee	施人英(Jen-Ying Shih),吳家齊(Chia-Chi Wu)
dc.subject.keyword	詞向量化,詞嵌入,主題模型,矩陣分解,類神經網路語言模型,	zh_TW
dc.subject.keyword	Word vectorization,Word embedding,Topic model,Distributional word representation,Matrix factorization,Neural network language model,	en
dc.relation.page	44
dc.identifier.doi	10.6342/NTU201803234
dc.rights.note	有償授權
dc.date.accepted	2018-08-14
dc.contributor.author-college	管理學院	zh_TW
dc.contributor.author-dept	資訊管理學研究所	zh_TW
顯示於系所單位：	資訊管理學系

文件中的檔案：

檔案	大小	格式
ntu-107-1.pdf 未授權公開取用	2.47 MB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。