基於預測熱門度之大規模即時社群爬蟲演算法分析與設計

Shih-En Chou; 周世恩

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/52080

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	黃乾綱(Chien-Kang Huang)
dc.contributor.author	Shih-En Chou	en
dc.contributor.author	周世恩	zh_TW
dc.date.accessioned	2021-06-15T14:07:20Z	-
dc.date.available	2017-08-25
dc.date.copyright	2015-08-25
dc.date.issued	2015
dc.date.submitted	2015-08-20
dc.identifier.citation	[1] F. Inc. (2014, 2015/6/15). Facebook Reports Fourth Quarter and Full Year 2014 Results. Available: http://investor.fb.com/releasedetail.cfm?ReleaseID=893395 [2] C. Wilson, B. Boe, A. Sala, K. P. Puttaswamy, and B. Y. Zhao, 'User interactions in social networks and their implications,' in Proceedings of the 4th ACM European conference on Computer systems, 2009, pp. 205-218. [3] D. Horowitz and S. D. Kamvar, 'The anatomy of a large-scale social search engine,' in Proceedings of the 19th international conference on World wide web, 2010, pp. 431-440. [4] J. Teevan, D. Ramage, and M. R. Morris, '# TwitterSearch: a comparison of microblog search and web search,' in Proceedings of the fourth ACM international conference on Web search and data mining, 2011, pp. 35-44. [5] J. Cho, H. Garcia-Molina, and L. Page, 'Efficient crawling through URL ordering,' 1998. [6] G. Pant, P. Srinivasan, and F. Menczer, 'Crawling the web,' in Web Dynamics, ed: Springer, 2004, pp. 153-177. [7] (2015/06/04). Web crawler. Available: https://en.wikipedia.org/wiki/Web_crawler [8] R. Zafarani and H. Liu, 'Behavior Analysis in Social Media,' ed: IEEE COMPUTER SOC 10662 LOS VAQUEROS CIRCLE, PO BOX 3014, LOS ALAMITOS, CA 90720-1314 USA, 2014. [9] C.-I. Wong, K.-Y. Wong, K.-W. Ng, W. Fan, and K.-H. Yeung, 'Design of a Crawler for Online Social Networks Analysis.' [10] A. Yakushev and S. Mityagin, 'Social networks mining for analysis and modeling drugs usage,' Procedia Computer Science, vol. 29, pp. 2462-2471, 2014. [11] Y. Zhang and M. Pennacchiotti, 'Predicting purchase behaviors from social media,' in Proceedings of the 22nd international conference on World Wide Web, 2013, pp. 1521-1532. [12] H. Kwak, C. Lee, H. Park, and S. Moon, 'What is Twitter, a social network or a news media?,' in Proceedings of the 19th international conference on World wide web, 2010, pp. 591-600. [13] F. Erlandsson, R. Nia, H. Johnson, and S. F. Wu, 'Making social interactions accessible in online social networks,' Inf. Services and Use, vol. 33, pp. 113-117, 2013. [14] D. Shen, H. Wang, Z. Jiang, and J. Cao, 'A high efficient incremental microblog crawler: design and implementation,' J Inf Comput Sci, vol. 10, pp. 1731-1747, 2013. [15] D. Shestakov, 'Intelligent Web Crawling,' IEEE Intelligent Informatics Bulletin, vol. 14, pp. 5-7, 2013. [16] E. Ferrara, P. De Meo, G. Fiumara, and R. Baumgartner, 'Web data extraction, applications and techniques: A survey,' Knowledge-Based Systems, vol. 70, pp. 301-323, 2014. [17] J. Cho and H. Garcia-Molina, 'Parallel crawlers,' in Proceedings of the 11th international conference on World Wide Web, 2002, pp. 124-135. [18] D. H. Chau, S. Pandit, S. Wang, and C. Faloutsos, 'Parallel crawling for online social networks,' in Proceedings of the 16th international conference on World Wide Web, 2007, pp. 1283-1284. [19] K. Kim, K. Kim, K. Lee, T. Kim, and W. Cho, 'Design and implementation of web crawler based on dynamic web collection cycle,' in Information Networking (ICOIN), 2012 International Conference on, 2012, pp. 562-566. [20] S. Mali and B. Meshram, 'Focused web crawler with revisit policy,' in Proceedings of the International Conference & Workshop on Emerging Trends in Technology, 2011, pp. 474-479. [21] D. Yadav, A. Sharma, J. Gupta, N. Garg, and A. Mahajan, 'Architecture for Parallel Crawling and Algorithm for Change Detection in Web Pages,' in Information Technology,(ICIT 2007). 10th International Conference on, 2007, pp. 258-264. [22] J. Cho and H. Garcia-Molina, 'Synchronizing a database to improve freshness,' in Acm Sigmod Record, 2000, pp. 117-128. [23] E. Zhong, W. Fan, Y. Zhu, and Q. Yang, 'Modeling the dynamics of composite social networks,' in Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, 2013, pp. 937-945. [24] R. Horincar, B. Amann, and T. Artières, 'Online Change Estimation Models for Dynamic Web Resources,' in Web Engineering, ed: Springer, 2012, pp. 395-410. [25] J. Lehmann, B. Gonçalves, J. J. Ramasco, and C. Cattuto, 'Dynamical classes of collective attention in twitter,' in Proceedings of the 21st international conference on World Wide Web, 2012, pp. 251-260. [26] B. Viswanath, A. Mislove, M. Cha, and K. P. Gummadi, 'On the evolution of user interaction in facebook,' in Proceedings of the 2nd ACM workshop on Online social networks, 2009, pp. 37-42. [27] L. Ostroumova, I. Bogatyy, A. Chelnokov, A. Tikhonov, and G. Gusev, 'Crawling Policies Based on Web Page Popularity Prediction,' in Advances in Information Retrieval, ed: Springer, 2014, pp. 100-111. [28] D. Lefortier, L. Ostroumova, E. Samosvat, and P. Serdyukov, 'Timely crawling of high-quality ephemeral new content,' presented at the Proceedings of the 22nd ACM international conference on Conference on information & knowledge management, San Francisco, California, USA, 2013. [29] D. Lefortier, L. Ostroumova, E. Samosvat, and P. Serdyukov, 'Timely crawling of high-quality ephemeral new content,' in Proceedings of the 22nd ACM international conference on Conference on information & knowledge management, 2013, pp. 745-750. [30] S. Gao, J. Ma, and Z. Chen, 'Effective and effortless features for popularity prediction in microblogging network,' in Proceedings of the companion publication of the 23rd international conference on World wide web companion, 2014, pp. 269-270. [31] (2015/06/04). Graph and Ads API Rate Limiting. Available: https://developers.facebook.com/docs/marketing-api/api-rate-limiting - troubleshooting [32] R. Sutton. Do Facebook Graph API calls using field expansion count differently against the rate limits than batch calls. Available: http://stackoverflow.com/questions/14626689/do-facebook-graph-api-calls-using-field-expansion-count-differently-against-the/18472015 - 18472015 [33] M. Liu, R. Cai, M. Zhang, and L. Zhang, 'User browsing behavior-driven web crawling,' in Proceedings of the 20th ACM international conference on Information and knowledge management, 2011, pp. 87-92. [34] Z. Yang, J. Guo, K. Cai, J. Tang, J. Li, L. Zhang, et al., 'Understanding retweeting behaviors in social networks,' presented at the Proceedings of the 19th ACM international conference on Information and knowledge management, Toronto, ON, Canada, 2010. [35] C.-C. Chang and C.-J. Lin, 'LIBSVM: A library for support vector machines,' ACM Transactions on Intelligent Systems and Technology (TIST), vol. 2, p. 27, 2011. [36] P. Kolari, T. Finin, and A. Joshi, 'SVMs for the Blogosphere: Blog Identification and Splog Detection,' in AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs, 2006, pp. 92-99. [37] (2015/6/10). Random forest. Available: https://en.wikipedia.org/wiki/Random_forest
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/52080	-
dc.description.abstract	社群網路近年來改變了我們的溝通方式，累積巨量人類行為活動資料，吸引許多新興研究主題與社群網路行為分析結合。進行問題分析的過程中往往需要一個龐大的數據量，最近更朝向時域上分析，每隔一段時間必須對特定的研究標的做一次快照，熱門的訊息尤需要更密集的快照以洞察使用者行為隨著時間上變化。受限於這些社群網路有複雜的網絡，以及爬蟲對於數據存取量和頻率限制，對於多數機構的數據採集部門而言並不容易，且於資料取得之效能上無法進行有效優化。為了取得即時且足夠的資料，必須高頻率對社群網路存取，不僅浪費網路資源，亦增加社群網路的負荷。此外，目前社群網路隱私政策不允許不同單位共享數據，Facebook甚至透過加密的ID來保護使用者使用者資料。這些限制增加單一研究機構與其他機構共享數據，無法利用現有的爬行調度算法與其他機構分配資料收集方式。在本文中，我們提出了一種新爬行排序演算法，考慮用戶過去的行為，隨著收集的資料越多，越能預測該收集標的是否熱門以及有更多文章發布。所設計的演算法可以解決大型立式爬行資源分配與動態網頁無法通過一般的履帶採用的問題。在本研究中，我們運用單位資源內收集的訊息熱度來評估爬行性能。實驗結果呈現我們的演算法在收集社群網路99.5%熱門的訊息能最高節省40％爬蟲網路呼叫次數。	zh_TW
dc.description.abstract	Social media has greatly changed the way we communicate and huge amount of social behavior data is thus recorded and accumulated simultaneously. The data is now widely applied to many emerging research issues in combination with social behavior analysis. More recently, time domain analysis is especially popular on conducting behavior change investigation, in which people take snapshots on a particular subject of network on regular intervals, and hot messages (posts) are in urgent need of snapshot so as to precisely learn about user’s behavior as time moves. Scraping social networking sites such as Twitter, Facebook, etc. is not an easy task for data acquisition departments of most institutions since these sites often have complex structures and also restrict the amount and frequency of the data that they let out to common crawlers. To get more snapshots, groups often consume more computation power and network resources; even increase the load of OSN (Online Social Network) sites. In addition, the current privacy control policies do not allow different groups to share data with one another. These become challenges for an individual research group to collect sufficient data by using existing crawling scheduling algorithms or collaborating with other partners. In this paper, we propose “Novel Crawling Ordering Algorithm”, which allows our crawlers to focus on popular content by collecting and analyzing user behaviors. The designed crawler can also solve the problems of large-scale vertical crawling and dynamic web page problems. The performance of our crawling ordering algorithm” is evaluated by some designed metrics. And the experimental results tell us that this algorithm can save up to 40% of requests by crawling top 99.5 % popular social stream.	en
dc.description.provenance	Made available in DSpace on 2021-06-15T14:07:20Z (GMT). No. of bitstreams: 1 ntu-104-R01525052-1.pdf: 6636210 bytes, checksum: 2ab40b70a7d4a5c398f0f3fda0c7f65a (MD5) Previous issue date: 2015	en
dc.description.tableofcontents	口試委員會審定書 # 誌謝 i 中文摘要 ii ABSTRACT iii CONTENTS iv LIST OF FIGURES vi LIST OF TABLES viii Chapter 1 Introduction 1 1.1 Background 1 1.2 Problem 2 1.3 Solution 3 1.4 Scope and organization of this thesis 4 Chapter 2 Related Works 5 2.1 Web crawling 5 2.2 Demand of user behavior analysis 6 2.3 Data source selection and privacy issue 7 2.4 Challenge of crawling and the corresponding solution. 10 2.5 Temporal Analysis 13 2.6 Popularity Prediction 13 Chapter 3 System Overview 15 3.1 Crawling Architecture 15 3.2 Scheduler Implementation 18 3.3 Fetch Posts 19 3.4 Problem Definition. 22 3.5 Selected Features 22 3.6 Ranking strategy 25 3.6.1 Random crawling 25 3.6.2 Ranked by overall engagement 25 3.6.3 Ranked by average engagement 26 3.6.4 Ranked by predicted-engagement 26 3.7 Learning Algorithm 27 3.8 Recrawl Strategies and the Freshness Metric 28 3.8.1 Model publication frequency 28 3.8.2 Freshness and Delay 29 Chapter 4 Experiment 31 4.1 Data Analysis 31 4.2 Feature Importance 33 4.3 Measurements 37 4.4 Crawling performance 39 4.5 Comparison of static crawling and dynamic crawling 44 Chapter 5 Conclusion 48 REFERENCE 50
dc.language.iso	en
dc.subject	資訊檢索	zh_TW
dc.subject	社群網路	zh_TW
dc.subject	網路爬蟲設計	zh_TW
dc.subject	行為分析	zh_TW
dc.subject	Behavior Analysis	en
dc.subject	Social Network	en
dc.subject	Information Retrieval	en
dc.subject	Crawler Design	en
dc.title	基於預測熱門度之大規模即時社群爬蟲演算法分析與設計	zh_TW
dc.title	An efficient crawling algorithm for large-scale real-time social stream data collection based on popularity prediction	en
dc.type	Thesis
dc.date.schoolyear	103-2
dc.description.degree	碩士
dc.contributor.oralexamcommittee	張瑞益(Ray-I Chang),鄭卜壬(Pu-Jen Cheng),陳信希(Hsin-Hsi Chen)
dc.subject.keyword	社群網路,網路爬蟲設計,資訊檢索,行為分析,	zh_TW
dc.subject.keyword	Social Network,Crawler Design,Information Retrieval,Behavior Analysis,	en
dc.relation.page	52
dc.rights.note	有償授權
dc.date.accepted	2015-08-20
dc.contributor.author-college	工學院	zh_TW
dc.contributor.author-dept	工程科學及海洋工程學研究所	zh_TW
顯示於系所單位：	工程科學及海洋工程學系

文件中的檔案：

檔案	大小	格式
ntu-104-1.pdf 未授權公開取用	6.48 MB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。