基於版型特徵跟內容結構的網頁相似度研究

Chung-Yi Lin; 林忠毅

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/50128

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	張瑞益
dc.contributor.author	Chung-Yi Lin	en
dc.contributor.author	林忠毅	zh_TW
dc.date.accessioned	2021-06-15T12:30:24Z	-
dc.date.available	2016-08-24
dc.date.copyright	2016-08-24
dc.date.issued	2016
dc.date.submitted	2016-08-05
dc.identifier.citation	[1] Lee, Brian, et al. 'Designing with interactive example galleries.' Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. ACM, 2010. [2] Takama, Yasufumi, and Noriaki Mitsuhashi. 'Visual similarity comparison for Web page retrieval.' Web Intelligence, 2005. Proceedings. The 2005 IEEE/WIC/ACM International Conference on. IEEE, 2005. [3] Yang, Wuu. 'Identifying syntactic differences between two programs.' Software: Practice and Experience 21.7 (1991): 739-755. [4] Zhai, Yanhong, and Bing Liu. 'Web data extraction based on partial tree alignment.' Proceedings of the 14th international conference on World Wide Web. ACM, 2005. [5] Kim, Yeonjung, et al. 'Web information extraction by HTML tree edit distance matching.' Convergence Information Technology, 2007. International Conference on. IEEE, 2007. [6] Ferrara, Emilio, and Robert Baumgartner. 'Automatic wrapper adaptation by tree edit distance matching.' Combinations of Intelligent Methods and Applications. Springer Berlin Heidelberg, 2011. 41-54. [7] Zhai, Yanhong, and Bing Liu. 'Structured data extraction from the web based on partial tree alignment.' Knowledge and Data Engineering, IEEE Transactions on18.12 (2006): 1614-1628. [8] Cruz, Isabel F., et al. 'Measuring structural similarity among web documents: preliminary results.' Electronic Publishing, Artistic Imaging, and Digital Typography. Springer Berlin Heidelberg, 1998. 513-524. [9] Joshi, Sachindra, et al. 'A bag of paths model for measuring structural similarity in Web documents.' Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2003. [10] Wong, Wai-ching, and Ada Wai-Chee Fu. 'Finding Structure and Characteristics of Web Documents for Classification.' ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery. No. s 1. 2000. [11] Buttler, David, Ling Liu, and Calton Pu. 'A fully automated object extraction system for the World Wide Web.' Distributed Computing Systems, 2001. 21st International Conference on.. IEEE, 2001. [12] Masek, William J., and Michael S. Paterson. 'A faster algorithm computing string edit distances.' Journal of Computer and System sciences 20.1 (1980): 18-31. [13] Alpuente, María, and Daniel Romero. 'A visual technique for web pages comparison.' Electronic Notes in Theoretical Computer Science 235 (2009): 3-18. [14] Mitra, Anasua. Implementation and Empirical Performance Evaluation Of Semantic Web Page Segmentation Algorithms. Diss. JADAVPUR UNIVERSITY, 2014. [15] Vadrevu, Srinivas, Fatih Gelgi, and Hasan Davulcu. 'Semantic partitioning of web pages.' Web Information Systems Engineering–WISE 2005. Springer Berlin Heidelberg, 2005. 107-118. [16] Vadrevu, Srinivas, and Emre Velipasaoglu. 'Identifying primary content from web pages and its application to web search ranking.' Proceedings of the 20th international conference companion on World wide web. ACM, 2011. [17] Kovacevic, Milo, et al. 'Recognition of Common Areas in a Web Page Using Visual Information: a possible application in a page classification.' Data Mining, 2002. ICDM 2003. Proceedings. 2002 IEEE International Conference on. IEEE, 2002. [18] Cai, Deng, et al. VIPS: a visionbased page segmentation algorithm. Microsoft technical report, MSR-TR-2003-79, 2003. [19] Campus, Northern Cyprus. 'Vision based page segmentation: Extended and improved algorithm.' (2012). [20] Sanoja, Andrés, and Stéphane Gançarski. 'Yet another hybrid segmentation tool.' (2012): https-ipres. [21] Kang, Jinbeom, and Joongmin Choi. 'Recognising Informative Web Page Blocks Using Visual Segmentation for Efficient Information Extraction.' J. UCS14.11 (2008): 1893-1910. [22] Liu, Wei, Xiaofeng Meng, and Weiyi Meng. 'Vision-based web data records extraction.' Proc. 9th International Workshop on the Web and Databases. 2006. [23] Li, Longzhuang, Yonghuai Liu, and Abel Obregon. 'Visual segmentation-based data record extraction from web documents.' Information Reuse and Integration, 2007. IRI 2007. IEEE International Conference on. IEEE, 2007. [24] Chen, Yu, Wei-Ying Ma, and Hong-Jiang Zhang. 'Detecting web page structure for adaptive viewing on small form factor devices.' Proceedings of the 12th international conference on World Wide Web. ACM, 2003. [25] Hashimoto, Yasunari, and Takeo Igarashi. 'Retrieving web page layouts using sketches to support example-based web design.' 2nd Eurographics Workshop on Sketch-Based Interfaces and Modeling. 2005. [26] Bohunsky, Paul, and Wolfgang Gatterbauer. 'Visual structure-based web page clustering and retrieval.' Proceedings of the 19th international conference on World wide web. ACM, 2010. [27] Della Penna, Giuseppe, Daniele Magazzeni, and Sergio Orefice. 'Visual extraction of information from web pages.' Journal of Visual Languages & Computing 21.1 (2010): 23-32. [28] Song, Ruihua, et al. 'Learning block importance models for web pages.'Proceedings of the 13th international conference on World Wide Web. ACM, 2004. [29] https://digital-loom.com/articles/why-your-website-has-lifespan-pet-goldfish-and-what-do-about-it [30] https://en.wikipedia.org/wiki/Information_retrieval [31] https://developers.google.com/webmasters/mobile-sites/mobile-seo/#choose_your_mobile_configuration [32] Kohlschütter, Christian, and Wolfgang Nejdl. 'A densitometric approach to web page segmentation.' Proceedings of the 17th ACM conference on Information and knowledge management. ACM, 2008. [33] Kohlschütter, Christian, Peter Fankhauser, and Wolfgang Nejdl. 'Boilerplate detection using shallow text features.' Proceedings of the third ACM international conference on Web search and data mining. ACM, 2010. [34] Liu, Wei, Xiaofeng Meng, and Weiyi Meng. 'Vide: A vision-based approach for deep web data extraction.' IEEE Transactions on Knowledge and Data Engineering 22.3 (2010): 447-460. [35] https://w3c.github.io/html/, 2015 [36] Gardner, Brett S. 'Responsive web design: Enriching the user experience.'Sigma Journal: Inside the Digital Ecosystem 11.1 (2011): 13-19. [37] http://www.w3schools.com/js/js_htmldom.asp [38] http://www.w3schools.com/html/html5_semantic_elements.asp [39] http://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-ranked-retrieval-results-1.html [40] http://xtremewebsites.com/google-change-now-rewards-mobile-optimized-websites/
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/50128	-
dc.description.abstract	由於近幾年行動裝置的興盛，行動裝置上網已成為最普遍的上網方式。為了因應這個趨勢，Ethan Marcotte提出了響應式網頁設計(Responsive Web Design) [36]使網站達到使用者友善(user-friendly)的目的，讓各種大小的行動裝置都可以順利閱讀網站內容。其中百分之74的行動裝置使用者更喜歡造訪使用者友善的網站，同時在Google的研究 [31]中亦指出，使用者友善的網站有助於提升搜尋排名及吸引使用者目光，可見其重要性。然而，目前仍有百分之85的網站沒有達到使用者友善的目標，導致搜尋排行不斷下降並且流失更多客戶。這些網站的困境是目前網站重建多採用人工重新設計，非常耗時；若能設計一個系統，自動化選擇適合的網站樣板來做網站重建，將可以大符減少需要重新設計網頁的時間，提升效率。而此系統最關鍵的問題在於，如何挑選出最符合客戶需求的網站樣版？研究[2]指出，網站重建前後的視覺呈現是否相似，對於使用者的專注力及閱讀速度有非常大的影響；因此，為了不影響使用者體驗(User experience)，本論文將基於網頁的佈局排版(Layout)跟視覺呈現，從大量的樣版中找到挑選出最相似於原網站的樣板。本論文利用許多真實資料的實驗，驗證了我們方法的效率並且可以準確的從大量樣板中找出相似的結果。一個網站通常會由多個網頁所組成，本論文先專注於研究網頁樣版(Web Page Template)之間的相似度問題，未來可以基於這些研究，進一步去探討網站之間的相似度問題。	zh_TW
dc.description.abstract	In recent year, the mobile device has become the most common tool to access the Internet. In response to the flourish of the mobile device, Ethan Marcotte proposed a design guide which is called Responsive Web Design (RWD) [36] to make the mobile service user-friendly. According to Google statistics, 74 percent of mobile device users prefer user-friendly websites due to its readability on mobile devices. Google Search Guide [31] also reported that a user-friendly website can improve its search ranking and attract users. However, 85 percent of the websites are still user-unfriendly, such that these websites obtain the lower and lower search ranking. The major dilemma is that website rebuilding is usually constructed manually, which is time-consuming and inefficient. If a system can automatically select an appropriate website template, it will significantly improve the efficiency of website rebuilding. The critical problem of this system is how to select an appropriate template which has more features that its customer needs. The study [2] indicates that drastic changes of visual appearance of Web pages have a negative influence on readers. In order not to affect user-experience, this thesis proposes a system that will efficiently sort out the templates which are similar with respect to their layout features and visual appearance. Our experimental results indicate the effectiveness of our approach and show that our approach can find the similar templates precisely. In this thesis, we focus on discussing issues related to the similarity of the Web pages. Since a website consists of multiple Web pages, the proposed method can be extended to measure the similarities between “Websites”.	en
dc.description.provenance	Made available in DSpace on 2021-06-15T12:30:24Z (GMT). No. of bitstreams: 1 ntu-105-R03525086-1.pdf: 3827524 bytes, checksum: 823dcebef01b6de179a4aa167678b620 (MD5) Previous issue date: 2016	en
dc.description.tableofcontents	致謝 i 摘要 ii ABSTRACT iii 論文目錄 iv LIST OF FIGURES vi LIST OF TABLES viii Chapter 1 Introduction 1 1.1 Background and Motivation 1 1.2 Objective and Contributions 5 1.3 Thesis organization 8 Chapter 2 Related Works 9 2.1 Web Page Segmentation 9 2.2 Layout information and related applications 10 2.3 DOM-based similarity 11 Chapter 3 Web Page Similarity 12 3.1 Phase 1 : Semantic Segmentation 15 3.1.1 Web Page Segmentation 15 3.1.2 Region Locate Model 17 3.2 Phase 2 : Layout Similarity 21 3.2.1 Prevention Model 22 3.2.2 Determine Similar Layout Pattern 27 3.2.3 Filter Model 31 3.3 Phase 3 : Content Similarity 32 3.3.1 Translation Model 34 3.3.2 Determine the Content Similarity 36 3.4 Ranking Method and Results 38 Chapter 4 Evaluation of different methods 40 4.1 Experiments Setup 40 4.1.1 Data Sets 40 4.1.2 Performance measures 41 4.1.3 Methods introduction 42 4.2 Experiment on Method 1 (Evaluation goes through entire DOM of the Web page) 44 4.2.1 Experiment procedure 44 4.2.2 Experimental Results 44 4.3 Experiment on Method 2 (Evaluation goes through entire, translated DOM of the Web page) 46 4.3.1 Experiment procedure 46 4.3.2 Experimental Results 46 4.4 Experiment on Method 3 (Evaluation goes through three phases) 48 4.4.1 Experiment procedure 48 4.4.2 Experimental Results 49 4.5 Experiment on Method 4 (Evaluation goes through three phases with translated DOM) 52 4.5.1 Experiment procedure 52 4.5.2 Experimental Results 53 Chapter 5 Conclusion and Future Works 56 REFERENCES 58
dc.language.iso	en
dc.subject	樹編輯距離	zh_TW
dc.subject	網頁相似度	zh_TW
dc.subject	文件物件模型樹	zh_TW
dc.subject	文件物件模型樹	zh_TW
dc.subject	網頁分割	zh_TW
dc.subject	網頁相似度	zh_TW
dc.subject	樹編輯距離	zh_TW
dc.subject	網頁佈局的視覺資訊	zh_TW
dc.subject	網頁佈局的視覺資訊	zh_TW
dc.subject	網頁分割	zh_TW
dc.subject	Web page similarity	en
dc.subject	visual information of web page layout	en
dc.subject	Web page segmentation	en
dc.subject	tree edit distance	en
dc.subject	Web page similarity	en
dc.subject	DOM (Document Object Model) tree	en
dc.subject	visual information of web page layout	en
dc.subject	Web page segmentation	en
dc.subject	DOM (Document Object Model) tree	en
dc.subject	tree edit distance	en
dc.title	基於版型特徵跟內容結構的網頁相似度研究	zh_TW
dc.title	Study on Web Page Similarity Based on Layout Feature and Content Structure	en
dc.type	Thesis
dc.date.schoolyear	104-2
dc.description.degree	碩士
dc.contributor.oralexamcommittee	林正偉,王家輝,丁肇隆
dc.subject.keyword	文件物件模型樹,網頁相似度,樹編輯距離,網頁分割,網頁佈局的視覺資訊,	zh_TW
dc.subject.keyword	DOM (Document Object Model) tree,Web page similarity,tree edit distance,Web page segmentation,visual information of web page layout,	en
dc.relation.page	61
dc.identifier.doi	10.6342/NTU201601939
dc.rights.note	有償授權
dc.date.accepted	2016-08-05
dc.contributor.author-college	工學院	zh_TW
dc.contributor.author-dept	工程科學及海洋工程學研究所	zh_TW
顯示於系所單位：	工程科學及海洋工程學系

文件中的檔案：

檔案	大小	格式
ntu-105-1.pdf 未授權公開取用	3.74 MB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。