以卷積神經網路分析部落格社群網站垃圾文章

Chien-Ching Chiu; 邱建晴

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/49752

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	丁肇隆,張瑞益
dc.contributor.author	Chien-Ching Chiu	en
dc.contributor.author	邱建晴	zh_TW
dc.date.accessioned	2021-06-15T11:46:05Z	-
dc.date.available	2021-08-13
dc.date.copyright	2016-08-25
dc.date.issued	2016
dc.date.submitted	2016-08-14
dc.identifier.citation	[1] L. Becchetti, C. Castillo, D. Donato, S. Leonardi, and R. Baezayates. Linkbased characterization and detection of web spam. In 2nd International Workshop on Ad- versarial Information Retrieval on the Web, AIRWeb 2006-29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SI- GIR 2006, 2006. [2] A.Benczúr,I.Bíró,K.Csalogány,andT.Sarlós.Webspamdetectionviacommercial intent analysis. In Proceedings of the 3rd international workshop on Adversarial information retrieval on the web, pages 89–92. ACM, 2007. [3] C. Castillo, D. Donato, A. Gionis, V. Murdock, and F. Silvestri. Know your neigh- bors: Web spam detection using the web topology. In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, pages 423–430. ACM, 2007. [4] K. Chellapilla and D. M. Chickering. Improving cloaking detection using search query popularity and monetizability. In AIRWeb, pages 17–23, 2006. [5] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large- scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 248–255. IEEE, 2009. [6] D. Fetterly, M. Manasse, and M. Najork. Spam, damn spam, and statistics: Using statistical analysis to locate spam web pages. In Proceedings of the 7th International Workshop on the Web and Databases: colocated with ACM SIGMOD/PODS 2004, pages 1–6. ACM, 2004. [7] D. Fetterly, M. Manasse, and M. Najork. Detecting phrase-level duplication on the world wide web. In Proceedings of the 28th annual international ACM SIGIR confer- ence on Research and development in information retrieval, pages 170–177. ACM, 2005. [8] K. Fukushima. Neocognitron: A self-organizing neural network model for a mech- anism of pattern recognition unaffected by shift in position. Biological cybernetics, 36(4):193–202, 1980. [9] Z.Gyöngyi,H.Garcia-Molina,and J.Pedersen.Combating web spam with trustrank. In Proceedings of the Thirtieth international conference on Very large data bases- Volume 30, pages 576–587. VLDB Endowment, 2004. [10] G. E. Hinton. Learning distributed representations of concepts. In Proceedings of the eighth annual conference of the cognitive science society, volume 1, page 12. Amherst, MA, 1986. [11]D.H.Hubel and T.N.Wiesel. Receptive fields and functional architecture of monkey striate cortex. The Journal of physiology, 195(1):215–243, 1968. [12] N. Immorlica, K. Jain, M. Mahdian, and K. Talwar. Click fraud resistant methods for learning click-through rates. In Internet and Network Economics, pages 34–45. Springer, 2005. [13] Y. Kim. Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882, 2014. [14] D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014. [15] P. Kolari, A. Java, T. Finin, T. Oates, and A. Joshi. Detecting spam blogs: A machine learning approach. In Proceedings of the National Conference on Artificial Intelli- gence, volume 21, page 1351. Menlo Park, CA; Cambridge, MA; London; AAAI Press; MIT Press; 1999, 2006. [16] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing sys- tems, pages 1097–1105, 2012. [17] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998. [18] Y. Liu, B. Gao, T.-Y. Liu, Y. Zhang, Z. Ma, S. He, and H. Li. Browserank: letting web users vote for page importance. In Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, pages 451–458. ACM, 2008. [19] T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word repre- sentations in vector space. arXiv preprint arXiv:1301.3781, 2013. [20] T.Mikolov, M.Karafiát, L.Burget, J.Cernockỳ, andS. Khudanpur. Recurrentneural network based language model. In INTERSPEECH, volume 2, page 3, 2010. [21] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed repre- sentations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119, 2013. [22] T. Mikolov, W.-t. Yih, and G. Zweig. Linguistic regularities in continuous space word representations. In HLT-NAACL, pages 746–751, 2013. [23] G. Mishne, D. Carmel, R. Lempel, et al. Blocking blog spam with language model disagreement. In AIRWeb, volume 5, pages 1–6, 2005. [24] A. Ntoulas, M. Najork, M. Manasse, and D. Fetterly. Detecting spam web pages through content analysis. In Proceedings of the 15th international conference on World Wide Web, pages 83–92. ACM, 2006. [25] A. Oliva and A. Torralba. Modeling the shape of the scene: A holistic representation of the spatial envelope. International journal of computer vision, 42(3):145–175, 2001. [26] L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank citation ranking: bringing order to the web. 1999. [27] J. Pennington, R. Socher, and C. D. Manning. Glove: Global vectors for word repre- sentation. In Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543, 2014. [28] J. Piskorski, M. Sydow, and D. Weiss. Exploring linguistic features for web spam detection: a preliminary study. In Proceedings of the 4th international workshop on Adversarial information retrieval on the web, pages 25–28. ACM, 2008. [29] A. Razavian, H. Azizpour, J. Sullivan, and S. Carlsson. Cnn features off-the-shelf: an astounding baseline for recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 806–813, 2014. [30] R. Řehůřek and P. Sojka. Software Framework for Topic Modelling with Large Corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, pages 45–50, Valletta, Malta, May 2010. ELRA. http://is.muni.cz/ publication/884893/en. [31] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning representations by back-propagating errors. Cognitive modeling, 5(3):1, 1988. [32] F. Seide, G. Li, and D. Yu. Conversational speech transcription using context- dependent deep neural networks. In Interspeech, pages 437–440, 2011. [33] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014. [34] N. Spirin and J. Han. Survey on web spam detection: principles and algorithms. ACM SIGKDD Explorations Newsletter, 13(2):50–64, 2012. [35] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1):1929–1958, 2014. [36] I. Sutskever, J. Martens, G. E. Dahl, and G. E. Hinton. On the importance of initial- ization and momentum in deep learning. ICML (3), 28:1139–1147, 2013. [37] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Van- houcke, and A. Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1–9, 2015. [38] M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. In European Conference on Computer Vision, pages 818–833. Springer, 2014. [39] Y. Zhang and B. Wallace. A sensitivity analysis of (and practitioners’ guide to) convolutional neural networks for sentence classification. arXiv preprint arXiv:1510.03820, 2015.
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/49752	-
dc.description.abstract	本論文提出一套基於卷積神經網路的文章過濾系統，針對痞客邦網站的部落格文章進行過濾。文章經本論文提出之系統過濾後可使讀者有更優質的閱讀體驗，也讓研究者有更純淨的繁體中文語料庫做為研究資源。文章使用預先訓練的詞向量表進行編碼，編碼後訓練卷積神經網路對文章擷取特徵並分類，網路所輸出的分數可以對文章分類，或做為文章優劣程度的指標，其錯誤率為 8.8%，有著比統計模型的 13.7% 更好的成效。我們提供了卷積神經網路之於繁體中文文章分類的訓練方法。在本論文使用的卷積神經網路之中，我們發現，卷積層中所擷取的特徵，與文章中重要的關鍵字有著高度的相關性。另一方面，文章經卷積與降採樣後的結果，可以直接轉做其他分類工作的輸入特徵，效果優於部分統計特徵。	zh_TW
dc.description.abstract	This thesis proposes a blog spam filtering system, the convolutional neural network (CNN), which aims at filtering the blog posts on Pixnet. The articles that are filtered by the system mentioned in the thesis not only permits readers to have a more excellent reading experience, but also allows researchers to have a more purified traditional Chinese corpus as their resource data. CNN is trained on Pixnet blog dataset by pre-trained word vectors for spam/non-spam classification. The score output of CNN can be considered as an index of spam level, which offers further gains in performance than statistical classification methods (error rate of 8.8% versus 13.7%). CNN configuration for training a traditional Chinese text classifier is reported in detail. One observation in our experimental results is that the feature extracted by each filter in convolutional layer, is highly relevant to important keywords in the articles. On the other hand, the descriptors extracted from our CNN achieved an acceptable performance in another text classification task. The result is better than both roughly-tuned CNN and bag-of-words method.	en
dc.description.provenance	Made available in DSpace on 2021-06-15T11:46:05Z (GMT). No. of bitstreams: 1 ntu-105-R03525087-1.pdf: 9401275 bytes, checksum: 1ac3b0a63b38efdb5780e53e2f384758 (MD5) Previous issue date: 2016	en
dc.description.tableofcontents	誌謝 iii 摘要 v Abstract vii 1 ﵢ緒論 1 1.1 源起..................................... 1 1.2 論文架構.................................. 3 2 相關研究 5 2.1 Spam偵測與自然語言處理 ........................ 5 2.2 深度學習與卷積神經網路......................... 7 3 部落格文章前處理 11 3.1 中文斷詞.................................. 12 3.1.1 隱藏馬可夫模型.......................... 12 3.1.2 Viterbi演算法 ........................... 13 3.2 詞向量 ................................... 14 3.2.1 Word2vec.............................. 14 3.2.2 Image2vec ............................. 18 4 卷積神經網路 21 4.1 網路結構.................................. 22 4.1.1 卷積層 ............................... 24 4.1.2 降採樣層.............................. 26 4.1.3 嵌入層 ............................... 27 4.2 訓練..................................... 28 4.2.1 後傳遞演算法 ........................... 30 4.2.2 訓練參數.............................. 31 5 實驗結果與討論 35 5.1 資料集 ................................... 35 5.2 特徵設計.................................. 36 5.2.1 文章長度的影響.......................... 36 5.2.2 文章訊息量的影響 ........................ 38 5.2.3 文章編碼方式的影響 ....................... 38 5.3 敏感度分析................................. 39 5.3.1 關鍵字與停用字遮擋 ....................... 39 5.3.2 關鍵字的卷積特徵 ........................ 42 5.4 CNN特徵.................................. 43 6 結論與未來展望 51 Reference 53 A 痞客邦網站資料格式 59 A.1 部落格文章完整內容範例......................... 59 A.2 痞客邦文章分類一覽 ........................... 66 B Google AdSense 內容政策 67
dc.language.iso	zh-TW
dc.title	以卷積神經網路分析部落格社群網站垃圾文章	zh_TW
dc.title	Spam Filtering on Social Media Posts Using Convolutional Neural Networks	en
dc.type	Thesis
dc.date.schoolyear	104-2
dc.description.degree	碩士
dc.contributor.oralexamcommittee	黃乾綱,張恆華,呂承諭
dc.subject.keyword	社群網站,垃圾文章偵測,卷積神經網路,深度學習,	zh_TW
dc.subject.keyword	Social network,Spam detection,Convolutional neural network,Deep learning,	en
dc.relation.page	68
dc.identifier.doi	10.6342/NTU201602340
dc.rights.note	有償授權
dc.date.accepted	2016-08-15
dc.contributor.author-college	工學院	zh_TW
dc.contributor.author-dept	工程科學及海洋工程學研究所	zh_TW
顯示於系所單位：	工程科學及海洋工程學系

文件中的檔案：

檔案	大小	格式
ntu-105-1.pdf 目前未授權公開取用	9.18 MB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。