基於預測熱門度之大規模即時社群爬蟲演算法分析與設計

Shih-En Chou; 周世恩

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/52080

標題:	基於預測熱門度之大規模即時社群爬蟲演算法分析與設計 An efficient crawling algorithm for large-scale real-time social stream data collection based on popularity prediction
作者:	Shih-En Chou 周世恩
指導教授:	黃乾綱(Chien-Kang Huang)
關鍵字:	社群網路,網路爬蟲設計,資訊檢索,行為分析, Social Network,Crawler Design,Information Retrieval,Behavior Analysis,
出版年 :	2015
學位:	碩士
摘要:	社群網路近年來改變了我們的溝通方式，累積巨量人類行為活動資料，吸引許多新興研究主題與社群網路行為分析結合。進行問題分析的過程中往往需要一個龐大的數據量，最近更朝向時域上分析，每隔一段時間必須對特定的研究標的做一次快照，熱門的訊息尤需要更密集的快照以洞察使用者行為隨著時間上變化。受限於這些社群網路有複雜的網絡，以及爬蟲對於數據存取量和頻率限制，對於多數機構的數據採集部門而言並不容易，且於資料取得之效能上無法進行有效優化。為了取得即時且足夠的資料，必須高頻率對社群網路存取，不僅浪費網路資源，亦增加社群網路的負荷。此外，目前社群網路隱私政策不允許不同單位共享數據，Facebook甚至透過加密的ID來保護使用者使用者資料。這些限制增加單一研究機構與其他機構共享數據，無法利用現有的爬行調度算法與其他機構分配資料收集方式。在本文中，我們提出了一種新爬行排序演算法，考慮用戶過去的行為，隨著收集的資料越多，越能預測該收集標的是否熱門以及有更多文章發布。所設計的演算法可以解決大型立式爬行資源分配與動態網頁無法通過一般的履帶採用的問題。在本研究中，我們運用單位資源內收集的訊息熱度來評估爬行性能。實驗結果呈現我們的演算法在收集社群網路99.5%熱門的訊息能最高節省40％爬蟲網路呼叫次數。 Social media has greatly changed the way we communicate and huge amount of social behavior data is thus recorded and accumulated simultaneously. The data is now widely applied to many emerging research issues in combination with social behavior analysis. More recently, time domain analysis is especially popular on conducting behavior change investigation, in which people take snapshots on a particular subject of network on regular intervals, and hot messages (posts) are in urgent need of snapshot so as to precisely learn about user’s behavior as time moves. Scraping social networking sites such as Twitter, Facebook, etc. is not an easy task for data acquisition departments of most institutions since these sites often have complex structures and also restrict the amount and frequency of the data that they let out to common crawlers. To get more snapshots, groups often consume more computation power and network resources; even increase the load of OSN (Online Social Network) sites. In addition, the current privacy control policies do not allow different groups to share data with one another. These become challenges for an individual research group to collect sufficient data by using existing crawling scheduling algorithms or collaborating with other partners. In this paper, we propose “Novel Crawling Ordering Algorithm”, which allows our crawlers to focus on popular content by collecting and analyzing user behaviors. The designed crawler can also solve the problems of large-scale vertical crawling and dynamic web page problems. The performance of our crawling ordering algorithm” is evaluated by some designed metrics. And the experimental results tell us that this algorithm can save up to 40% of requests by crawling top 99.5 % popular social stream.
URI:	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/52080
全文授權:	有償授權
顯示於系所單位：	工程科學及海洋工程學系

文件中的檔案：

檔案	大小	格式
ntu-104-1.pdf 未授權公開取用	6.48 MB	Adobe PDF

顯示文件完整紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料（如：文字、圖片、PDF）並使其易於取用。