Skip navigation

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料(如:文字、圖片、PDF)並使其易於取用。

點此認識 DSpace
DSpace logo
English
中文
  • 瀏覽論文
    • 校院系所
    • 出版年
    • 作者
    • 標題
    • 關鍵字
    • 指導教授
  • 搜尋 TDR
  • 授權 Q&A
    • 我的頁面
    • 接受 E-mail 通知
    • 編輯個人資料
  1. NTU Theses and Dissertations Repository
  2. 管理學院
  3. 資訊管理學系
請用此 Handle URI 來引用此文件: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/70433
完整後設資料紀錄
DC 欄位值語言
dc.contributor.advisor盧信銘(Hsin-Min Lu)
dc.contributor.authorQian-Hui Zengen
dc.contributor.author曾千蕙zh_TW
dc.date.accessioned2021-06-17T04:28:05Z-
dc.date.available2020-08-16
dc.date.copyright2018-08-16
dc.date.issued2018
dc.date.submitted2018-08-13
dc.identifier.citationBaroni, M., Dinu, G., & Kruszewski, G. (2014). Don't count, predict! A systematic comparison of context-counting vs. Context-predicting semantic vectors. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (ACL) (pp. 238-247). Baltimore, Maryland.
Bengio, Y., Ducharme, R., Vincent, P., & Jauvin, C. (2003). A neural probabilistic language model. Journal of Machine Learning Research, 3(Feb), 1137-1155.
Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet Allocation. Journal of Machine Learning Research, 3(Jan), 993-1022.
Bruni, E., Tran, N.-K., & Baroni, M. (2014). Multimodal distributional semantics. Journal of Artificial Intelligence Research, 49, 1-47.
Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., & Kuksa, P. (2011). Natural language processing (almost) from scratch. Journal of Machine Learning Research, 12(Aug), 2493-2537.
Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. (1990). Indexing by Latent Semantic Analysis. Journal of the American Society for Information Science, 41(6), 391.
Erk, K. (2012). Vector space models of word meaning and phrase meaning: A survey. Language and Linguistics Compass, 6(10), 635-653.
Finkelstein, L., Gabrilovich, E., Matias, Y., Rivlin, E., Solan, Z., Wolfman, G., & Ruppin, E. (2001). Placing search in context: The concept revisited. In Proceedings of the 10th International Conference on World Wide Web (pp. 406-414).
Griffiths, T. L., & Steyvers, M. (2004). Finding scientific topics. Proceedings of the National Academy of Sciences, 101(suppl 1), 5228-5235.
Guo, J., Che, W., Wang, H., & Liu, T. (2014). Revisiting embedding features for simple semi-supervised learning. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 110-120).
Hill, F., Reichart, R., & Korhonen, A. (2015). Simlex-999: Evaluating semantic models with (genuine) similarity estimation. Computational Linguistics, 41(4), 665-695.
Hoffman, M., Bach, F. R., & Blei, D. M. (2010). Online learning for Latent Dirichlet Allocation. In Advances in Neural Information Processing Systems (pp. 856-864).
Lai, S., Liu, K., He, S., & Zhao, J. (2016). How to generate a good word embedding. IEEE Intelligent Systems, 31(6), 5-14.
Landauer, T. K., Foltz, P. W., & Laham, D. (1998). An Introduction to Latent Semantic Analysis. Discourse Processes, 25(2-3), 259-284.
Liu, Y., Liu, Z., Chua, T.-S., & Sun, M. (2015). Topical word embeddings. In Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence (pp. 2418-2424). Austin, Texas.
Lund, C. B., & Kevin. (1997). Modelling parsing constraints with high-dimensional context space. Language and Cognitive Processes, 12(2-3), 177-210.
Lund, K., & Burgess, C. (1996). Producing high-dimensional semantic spaces from lexical co-occurrence. Behavior Research Methods, Instruments, & Computers, 28(2), 203-208.
Luong, T., Socher, R., & Manning, C. (2013). Better word representations with recursive neural networks for morphology. In Proceedings of the Seventeenth Conference on Computational Natural Language Learning (pp. 104-113).
Maaten, L. v. d., & Hinton, G. (2008). Visualizing data using t-SNE. Journal of Machine Learning Research, 9(Nov), 2579-2605.
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. Paper presented at the International Conference on Learning Representations (ICLR) Workshop Track.
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems (pp. 3111-3119).
Miller, G. A., & Charles, W. G. (1991). Contextual correlates of semantic similarity. Language and Cognitive Processes, 6(1), 1-28.
Pennington, J., Socher, R., & Manning, C. (2014). GloVe: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 1532-1543).
Rohde, D. L., Gonnerman, L. M., & Plaut, D. C. (2006). An improved model of semantic similarity based on lexical co-occurrence. Communications of the ACM, 8, 627-633.
Socher, R., Perelygin, A., Wu, J., Chuang, J., Manning, C. D., Ng, A., & Potts, C. (2013). Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 1631-1642).
Turian, J., Ratinov, L., & Bengio, Y. (2010). Word representations: A simple and general method for semi-supervised learning. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics (ACL) (pp. 384-394). Uppsala, Sweden.
dc.identifier.urihttp://tdr.lib.ntu.edu.tw/jspui/handle/123456789/70433-
dc.description.abstract詞向量化(或稱為詞嵌入或分布式詞表示)是將詞用固定長度的向量表示的一系列方法,在文本挖掘和自然語言處理中有相當廣泛的應用。然而,較少有研究完整地比較這些方法的表現。本研究比較了八種詞向量化方法,這些方法使用不同的技術包括矩陣分解,主題模型和神經網絡。在比較方式上我們使用內在和外在評估來比較方法的表現,內在評估包含了衡量詞向量之間的關聯性、相似性以及類比關係,而外在評估我們採用實體辨識 (NER) 做為衡量的任務。以內在評估的結果而言,基於神經網絡的方法(如CBOW和Skip-gram)表現最佳,其次是GloVe,而根據文章層級的資訊做訓練的方法,如潛在語義分析(LSA)和潛藏狄利克雷分配方法(LDA),在我們的比較中表現相對不佳。而以外在評估的結果而言,表現最好的是Skip-gram和HAL (一種相對簡單的矩陣分解方法),對NER的表現帶來了最大的進步,而LDA和CBOW帶來的進步最少。這樣的結果意味著內在評估方法的排名可能與外在評估的排名不一致。因此,未來的研究可以包括更多的外在評估任務,以便找到內外在任務表現之間的關係。zh_TW
dc.description.abstractWord vectorization (a.k.a. word embedding or distributional word representation) is a family of approaches that convert a word to a fixed length of vector. It is widely used in text mining and natural language processing tasks. However, few studies have systematically compared the performance of these methods. This study investigated eight word vectorization methods using different technical approaches including matrix factorization, topic models, and neural network. We compared their performances using both intrinsic and extrinsic evaluations. Intrinsic evaluations included examining association, similarity and analogy relationship between word vectors. And extrinsic evaluations included name entity recognition task. For intrinsic evaluations, the result suggests that neural network based methods such as continuous bag-of-word (CBOW) and Skip-gram performed the best, followed by GloVe, a method that extract latent vectors from word-context matrix. Method that adopted document-wide information, such as latent semantic analysis (LSA) and latent Dirichlet allocation (LDA), did not perform well in our evaluation. For extrinsic evaluation, Skip-gram and HAL, a relatively simple matrix factorization method, brought the most improvement on the performance of NER. While LDA and CBOW brought the least improvement. This result implies that the ranking of methods for intrinsic evaluations may be inconsistent with the ranking for extrinsic evaluations. Thus, future studies could include more tasks for extrinsic evaluation to allow finding the relationship between performances of intrinsic evaluations and extrinsic task performances.en
dc.description.provenanceMade available in DSpace on 2021-06-17T04:28:05Z (GMT). No. of bitstreams: 1
ntu-107-R05725004-1.pdf: 2533319 bytes, checksum: c32076f76f81c940b8c7e9fc18c6615d (MD5)
Previous issue date: 2018
en
dc.description.tableofcontents口試委員審定書 i
誌謝 ii
摘要 iii
Abstract iv
Contents v
List of Figures vii
List of Tables viii
1. Introduction 1
2. Literature Review 3
2.1 Matrix Factorization on co-occurrence matrix 3
2.2 Topic Models 8
2.3 Neural-Network-Based models 9
2.4 Related works 12
2.5 Summary 13
3. Comparison of Word Vectorization Methods 14
3.1 Word Vectorization Methods 14
3.2 Training Corpus 16
3.3 Intrinsic Evaluations 16
3.4 Extrinsic Evaluations 19
4. Experimental Results 22
5. Discussion 32
5.1 Performances between different tasks 32
5.2 Performances between different methods 34
6. Conclusions 36
Reference 37
Appendix A. Visualization of Word Vectors 40
dc.language.isoen
dc.subject詞嵌入zh_TW
dc.subject矩陣分解zh_TW
dc.subject主題模型zh_TW
dc.subject詞向量化zh_TW
dc.subject類神經網路語言模型zh_TW
dc.subjectWord embeddingen
dc.subjectWord vectorizationen
dc.subjectTopic modelen
dc.subjectDistributional word representationen
dc.subjectMatrix factorizationen
dc.subjectNeural network language modelen
dc.title詞向量化方法之比較與應用zh_TW
dc.titleWord Vectorization Methods: Comparisons and Applicationsen
dc.typeThesis
dc.date.schoolyear106-2
dc.description.degree碩士
dc.contributor.oralexamcommittee施人英(Jen-Ying Shih),吳家齊(Chia-Chi Wu)
dc.subject.keyword詞向量化,詞嵌入,主題模型,矩陣分解,類神經網路語言模型,zh_TW
dc.subject.keywordWord vectorization,Word embedding,Topic model,Distributional word representation,Matrix factorization,Neural network language model,en
dc.relation.page44
dc.identifier.doi10.6342/NTU201803234
dc.rights.note有償授權
dc.date.accepted2018-08-14
dc.contributor.author-college管理學院zh_TW
dc.contributor.author-dept資訊管理學研究所zh_TW
顯示於系所單位:資訊管理學系

文件中的檔案:
檔案 大小格式 
ntu-107-1.pdf
  未授權公開取用
2.47 MBAdobe PDF
顯示文件簡單紀錄


系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved