Skip navigation

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料(如:文字、圖片、PDF)並使其易於取用。

點此認識 DSpace
DSpace logo
English
中文
  • 瀏覽論文
    • 校院系所
    • 出版年
    • 作者
    • 標題
    • 關鍵字
  • 搜尋 TDR
  • 授權 Q&A
    • 我的頁面
    • 接受 E-mail 通知
    • 編輯個人資料
  1. NTU Theses and Dissertations Repository
  2. 工學院
  3. 工程科學及海洋工程學系
請用此 Handle URI 來引用此文件: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/52080
完整後設資料紀錄
DC 欄位值語言
dc.contributor.advisor黃乾綱(Chien-Kang Huang)
dc.contributor.authorShih-En Chouen
dc.contributor.author周世恩zh_TW
dc.date.accessioned2021-06-15T14:07:20Z-
dc.date.available2017-08-25
dc.date.copyright2015-08-25
dc.date.issued2015
dc.date.submitted2015-08-20
dc.identifier.citation[1] F. Inc. (2014, 2015/6/15). Facebook Reports Fourth Quarter and Full Year 2014 Results. Available: http://investor.fb.com/releasedetail.cfm?ReleaseID=893395
[2] C. Wilson, B. Boe, A. Sala, K. P. Puttaswamy, and B. Y. Zhao, 'User interactions in social networks and their implications,' in Proceedings of the 4th ACM European conference on Computer systems, 2009, pp. 205-218.
[3] D. Horowitz and S. D. Kamvar, 'The anatomy of a large-scale social search engine,' in Proceedings of the 19th international conference on World wide web, 2010, pp. 431-440.
[4] J. Teevan, D. Ramage, and M. R. Morris, '# TwitterSearch: a comparison of microblog search and web search,' in Proceedings of the fourth ACM international conference on Web search and data mining, 2011, pp. 35-44.
[5] J. Cho, H. Garcia-Molina, and L. Page, 'Efficient crawling through URL ordering,' 1998.
[6] G. Pant, P. Srinivasan, and F. Menczer, 'Crawling the web,' in Web Dynamics, ed: Springer, 2004, pp. 153-177.
[7] (2015/06/04). Web crawler. Available: https://en.wikipedia.org/wiki/Web_crawler
[8] R. Zafarani and H. Liu, 'Behavior Analysis in Social Media,' ed: IEEE COMPUTER SOC 10662 LOS VAQUEROS CIRCLE, PO BOX 3014, LOS ALAMITOS, CA 90720-1314 USA, 2014.
[9] C.-I. Wong, K.-Y. Wong, K.-W. Ng, W. Fan, and K.-H. Yeung, 'Design of a Crawler for Online Social Networks Analysis.'
[10] A. Yakushev and S. Mityagin, 'Social networks mining for analysis and modeling drugs usage,' Procedia Computer Science, vol. 29, pp. 2462-2471, 2014.
[11] Y. Zhang and M. Pennacchiotti, 'Predicting purchase behaviors from social media,' in Proceedings of the 22nd international conference on World Wide Web, 2013, pp. 1521-1532.
[12] H. Kwak, C. Lee, H. Park, and S. Moon, 'What is Twitter, a social network or a news media?,' in Proceedings of the 19th international conference on World wide web, 2010, pp. 591-600.
[13] F. Erlandsson, R. Nia, H. Johnson, and S. F. Wu, 'Making social interactions accessible in online social networks,' Inf. Services and Use, vol. 33, pp. 113-117, 2013.
[14] D. Shen, H. Wang, Z. Jiang, and J. Cao, 'A high efficient incremental microblog crawler: design and implementation,' J Inf Comput Sci, vol. 10, pp. 1731-1747, 2013.
[15] D. Shestakov, 'Intelligent Web Crawling,' IEEE Intelligent Informatics Bulletin, vol. 14, pp. 5-7, 2013.
[16] E. Ferrara, P. De Meo, G. Fiumara, and R. Baumgartner, 'Web data extraction, applications and techniques: A survey,' Knowledge-Based Systems, vol. 70, pp. 301-323, 2014.
[17] J. Cho and H. Garcia-Molina, 'Parallel crawlers,' in Proceedings of the 11th international conference on World Wide Web, 2002, pp. 124-135.
[18] D. H. Chau, S. Pandit, S. Wang, and C. Faloutsos, 'Parallel crawling for online social networks,' in Proceedings of the 16th international conference on World Wide Web, 2007, pp. 1283-1284.
[19] K. Kim, K. Kim, K. Lee, T. Kim, and W. Cho, 'Design and implementation of web crawler based on dynamic web collection cycle,' in Information Networking (ICOIN), 2012 International Conference on, 2012, pp. 562-566.
[20] S. Mali and B. Meshram, 'Focused web crawler with revisit policy,' in Proceedings of the International Conference & Workshop on Emerging Trends in Technology, 2011, pp. 474-479.
[21] D. Yadav, A. Sharma, J. Gupta, N. Garg, and A. Mahajan, 'Architecture for Parallel Crawling and Algorithm for Change Detection in Web Pages,' in Information Technology,(ICIT 2007). 10th International Conference on, 2007, pp. 258-264.
[22] J. Cho and H. Garcia-Molina, 'Synchronizing a database to improve freshness,' in Acm Sigmod Record, 2000, pp. 117-128.
[23] E. Zhong, W. Fan, Y. Zhu, and Q. Yang, 'Modeling the dynamics of composite social networks,' in Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, 2013, pp. 937-945.
[24] R. Horincar, B. Amann, and T. Artières, 'Online Change Estimation Models for Dynamic Web Resources,' in Web Engineering, ed: Springer, 2012, pp. 395-410.
[25] J. Lehmann, B. Gonçalves, J. J. Ramasco, and C. Cattuto, 'Dynamical classes of collective attention in twitter,' in Proceedings of the 21st international conference on World Wide Web, 2012, pp. 251-260.
[26] B. Viswanath, A. Mislove, M. Cha, and K. P. Gummadi, 'On the evolution of user interaction in facebook,' in Proceedings of the 2nd ACM workshop on Online social networks, 2009, pp. 37-42.
[27] L. Ostroumova, I. Bogatyy, A. Chelnokov, A. Tikhonov, and G. Gusev, 'Crawling Policies Based on Web Page Popularity Prediction,' in Advances in Information Retrieval, ed: Springer, 2014, pp. 100-111.
[28] D. Lefortier, L. Ostroumova, E. Samosvat, and P. Serdyukov, 'Timely crawling of high-quality ephemeral new content,' presented at the Proceedings of the 22nd ACM international conference on Conference on information & knowledge management, San Francisco, California, USA, 2013.
[29] D. Lefortier, L. Ostroumova, E. Samosvat, and P. Serdyukov, 'Timely crawling of high-quality ephemeral new content,' in Proceedings of the 22nd ACM international conference on Conference on information & knowledge management, 2013, pp. 745-750.
[30] S. Gao, J. Ma, and Z. Chen, 'Effective and effortless features for popularity prediction in microblogging network,' in Proceedings of the companion publication of the 23rd international conference on World wide web companion, 2014, pp. 269-270.
[31] (2015/06/04). Graph and Ads API Rate Limiting. Available: https://developers.facebook.com/docs/marketing-api/api-rate-limiting - troubleshooting
[32] R. Sutton. Do Facebook Graph API calls using field expansion count differently against the rate limits than batch calls. Available: http://stackoverflow.com/questions/14626689/do-facebook-graph-api-calls-using-field-expansion-count-differently-against-the/18472015 - 18472015
[33] M. Liu, R. Cai, M. Zhang, and L. Zhang, 'User browsing behavior-driven web crawling,' in Proceedings of the 20th ACM international conference on Information and knowledge management, 2011, pp. 87-92.
[34] Z. Yang, J. Guo, K. Cai, J. Tang, J. Li, L. Zhang, et al., 'Understanding retweeting behaviors in social networks,' presented at the Proceedings of the 19th ACM international conference on Information and knowledge management, Toronto, ON, Canada, 2010.
[35] C.-C. Chang and C.-J. Lin, 'LIBSVM: A library for support vector machines,' ACM Transactions on Intelligent Systems and Technology (TIST), vol. 2, p. 27, 2011.
[36] P. Kolari, T. Finin, and A. Joshi, 'SVMs for the Blogosphere: Blog Identification and Splog Detection,' in AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs, 2006, pp. 92-99.
[37] (2015/6/10). Random forest. Available: https://en.wikipedia.org/wiki/Random_forest
dc.identifier.urihttp://tdr.lib.ntu.edu.tw/jspui/handle/123456789/52080-
dc.description.abstract社群網路近年來改變了我們的溝通方式,累積巨量人類行為活動資料,吸引許多新興研究主題與社群網路行為分析結合。進行問題分析的過程中往往需要一個龐大的數據量,最近更朝向時域上分析,每隔一段時間必須對特定的研究標的做一次快照,熱門的訊息尤需要更密集的快照以洞察使用者行為隨著時間上變化。受限於這些社群網路有複雜的網絡,以及爬蟲對於數據存取量和頻率限制,對於多數機構的數據採集部門而言並不容易,且於資料取得之效能上無法進行有效優化。為了取得即時且足夠的資料,必須高頻率對社群網路存取,不僅浪費網路資源,亦增加社群網路的負荷。此外,目前社群網路隱私政策不允許不同單位共享數據,Facebook甚至透過加密的ID來保護使用者使用者資料。這些限制增加單一研究機構與其他機構共享數據,無法利用現有的爬行調度算法與其他機構分配資料收集方式。在本文中,我們提出了一種新爬行排序演算法,考慮用戶過去的行為,隨著收集的資料越多,越能預測該收集標的是否熱門以及有更多文章發布。所設計的演算法可以解決大型立式爬行資源分配與動態網頁無法通過一般的履帶採用的問題。在本研究中,我們運用單位資源內收集的訊息熱度來評估爬行性能。實驗結果呈現我們的演算法在收集社群網路99.5%熱門的訊息能最高節省40%爬蟲網路呼叫次數。zh_TW
dc.description.abstractSocial media has greatly changed the way we communicate and huge amount of social behavior data is thus recorded and accumulated simultaneously. The data is now widely applied to many emerging research issues in combination with social behavior analysis. More recently, time domain analysis is especially popular on conducting behavior change investigation, in which people take snapshots on a particular subject of network on regular intervals, and hot messages (posts) are in urgent need of snapshot so as to precisely learn about user’s behavior as time moves. Scraping social networking sites such as Twitter, Facebook, etc. is not an easy task for data acquisition departments of most institutions since these sites often have complex structures and also restrict the amount and frequency of the data that they let out to common crawlers. To get more snapshots, groups often consume more computation power and network resources; even increase the load of OSN (Online Social Network) sites. In addition, the current privacy control policies do not allow different groups to share data with one another. These become challenges for an individual research group to collect sufficient data by using existing crawling scheduling algorithms or collaborating with other partners. In this paper, we propose “Novel Crawling Ordering Algorithm”, which allows our crawlers to focus on popular content by collecting and analyzing user behaviors. The designed crawler can also solve the problems of large-scale vertical crawling and dynamic web page problems. The performance of our crawling ordering algorithm” is evaluated by some designed metrics. And the experimental results tell us that this algorithm can save up to 40% of requests by crawling top 99.5 % popular social stream.en
dc.description.provenanceMade available in DSpace on 2021-06-15T14:07:20Z (GMT). No. of bitstreams: 1
ntu-104-R01525052-1.pdf: 6636210 bytes, checksum: 2ab40b70a7d4a5c398f0f3fda0c7f65a (MD5)
Previous issue date: 2015
en
dc.description.tableofcontents口試委員會審定書 #
誌謝 i
中文摘要 ii
ABSTRACT iii
CONTENTS iv
LIST OF FIGURES vi
LIST OF TABLES viii
Chapter 1 Introduction 1
1.1 Background 1
1.2 Problem 2
1.3 Solution 3
1.4 Scope and organization of this thesis 4
Chapter 2 Related Works 5
2.1 Web crawling 5
2.2 Demand of user behavior analysis 6
2.3 Data source selection and privacy issue 7
2.4 Challenge of crawling and the corresponding solution. 10
2.5 Temporal Analysis 13
2.6 Popularity Prediction 13
Chapter 3 System Overview 15
3.1 Crawling Architecture 15
3.2 Scheduler Implementation 18
3.3 Fetch Posts 19
3.4 Problem Definition. 22
3.5 Selected Features 22
3.6 Ranking strategy 25
3.6.1 Random crawling 25
3.6.2 Ranked by overall engagement 25
3.6.3 Ranked by average engagement 26
3.6.4 Ranked by predicted-engagement 26
3.7 Learning Algorithm 27
3.8 Recrawl Strategies and the Freshness Metric 28
3.8.1 Model publication frequency 28
3.8.2 Freshness and Delay 29
Chapter 4 Experiment 31
4.1 Data Analysis 31
4.2 Feature Importance 33
4.3 Measurements 37
4.4 Crawling performance 39
4.5 Comparison of static crawling and dynamic crawling 44
Chapter 5 Conclusion 48
REFERENCE 50
dc.language.isoen
dc.title基於預測熱門度之大規模即時社群爬蟲演算法分析與設計zh_TW
dc.titleAn efficient crawling algorithm for large-scale real-time social stream data collection based on popularity predictionen
dc.typeThesis
dc.date.schoolyear103-2
dc.description.degree碩士
dc.contributor.oralexamcommittee張瑞益(Ray-I Chang),鄭卜壬(Pu-Jen Cheng),陳信希(Hsin-Hsi Chen)
dc.subject.keyword社群網路,網路爬蟲設計,資訊檢索,行為分析,zh_TW
dc.subject.keywordSocial Network,Crawler Design,Information Retrieval,Behavior Analysis,en
dc.relation.page52
dc.rights.note有償授權
dc.date.accepted2015-08-20
dc.contributor.author-college工學院zh_TW
dc.contributor.author-dept工程科學及海洋工程學研究所zh_TW
顯示於系所單位:工程科學及海洋工程學系

文件中的檔案:
檔案 大小格式 
ntu-104-1.pdf
  目前未授權公開取用
6.48 MBAdobe PDF
顯示文件簡單紀錄


系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved