Skip navigation

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料(如:文字、圖片、PDF)並使其易於取用。

點此認識 DSpace
DSpace logo
English
中文
  • 瀏覽論文
    • 校院系所
    • 出版年
    • 作者
    • 標題
    • 關鍵字
    • 指導教授
  • 搜尋 TDR
  • 授權 Q&A
    • 我的頁面
    • 接受 E-mail 通知
    • 編輯個人資料
  1. NTU Theses and Dissertations Repository
  2. 工學院
  3. 工程科學及海洋工程學系
請用此 Handle URI 來引用此文件: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/50128
完整後設資料紀錄
DC 欄位值語言
dc.contributor.advisor張瑞益
dc.contributor.authorChung-Yi Linen
dc.contributor.author林忠毅zh_TW
dc.date.accessioned2021-06-15T12:30:24Z-
dc.date.available2016-08-24
dc.date.copyright2016-08-24
dc.date.issued2016
dc.date.submitted2016-08-05
dc.identifier.citation[1] Lee, Brian, et al. 'Designing with interactive example galleries.' Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. ACM, 2010.
[2] Takama, Yasufumi, and Noriaki Mitsuhashi. 'Visual similarity comparison for Web page retrieval.' Web Intelligence, 2005. Proceedings. The 2005 IEEE/WIC/ACM International Conference on. IEEE, 2005.
[3] Yang, Wuu. 'Identifying syntactic differences between two programs.' Software: Practice and Experience 21.7 (1991): 739-755.
[4] Zhai, Yanhong, and Bing Liu. 'Web data extraction based on partial tree alignment.' Proceedings of the 14th international conference on World Wide Web. ACM, 2005.
[5] Kim, Yeonjung, et al. 'Web information extraction by HTML tree edit distance matching.' Convergence Information Technology, 2007. International Conference on. IEEE, 2007.
[6] Ferrara, Emilio, and Robert Baumgartner. 'Automatic wrapper adaptation by tree edit distance matching.' Combinations of Intelligent Methods and Applications. Springer Berlin Heidelberg, 2011. 41-54.
[7] Zhai, Yanhong, and Bing Liu. 'Structured data extraction from the web based on partial tree alignment.' Knowledge and Data Engineering, IEEE Transactions on18.12 (2006): 1614-1628.
[8] Cruz, Isabel F., et al. 'Measuring structural similarity among web documents: preliminary results.' Electronic Publishing, Artistic Imaging, and Digital Typography. Springer Berlin Heidelberg, 1998. 513-524.
[9] Joshi, Sachindra, et al. 'A bag of paths model for measuring structural similarity in Web documents.' Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2003.
[10] Wong, Wai-ching, and Ada Wai-Chee Fu. 'Finding Structure and Characteristics of Web Documents for Classification.' ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery. No. s 1. 2000.
[11] Buttler, David, Ling Liu, and Calton Pu. 'A fully automated object extraction system for the World Wide Web.' Distributed Computing Systems, 2001. 21st International Conference on.. IEEE, 2001.
[12] Masek, William J., and Michael S. Paterson. 'A faster algorithm computing string edit distances.' Journal of Computer and System sciences 20.1 (1980): 18-31.
[13] Alpuente, María, and Daniel Romero. 'A visual technique for web pages comparison.' Electronic Notes in Theoretical Computer Science 235 (2009): 3-18.
[14] Mitra, Anasua. Implementation and Empirical Performance Evaluation Of Semantic Web Page Segmentation Algorithms. Diss. JADAVPUR UNIVERSITY, 2014.
[15] Vadrevu, Srinivas, Fatih Gelgi, and Hasan Davulcu. 'Semantic partitioning of web pages.' Web Information Systems Engineering–WISE 2005. Springer Berlin Heidelberg, 2005. 107-118.
[16] Vadrevu, Srinivas, and Emre Velipasaoglu. 'Identifying primary content from web pages and its application to web search ranking.' Proceedings of the 20th international conference companion on World wide web. ACM, 2011.
[17] Kovacevic, Milo, et al. 'Recognition of Common Areas in a Web Page Using Visual Information: a possible application in a page classification.' Data Mining, 2002. ICDM 2003. Proceedings. 2002 IEEE International Conference on. IEEE, 2002.
[18] Cai, Deng, et al. VIPS: a visionbased page segmentation algorithm. Microsoft technical report, MSR-TR-2003-79, 2003.
[19] Campus, Northern Cyprus. 'Vision based page segmentation: Extended and improved algorithm.' (2012).
[20] Sanoja, Andrés, and Stéphane Gançarski. 'Yet another hybrid segmentation tool.' (2012): https-ipres.
[21] Kang, Jinbeom, and Joongmin Choi. 'Recognising Informative Web Page Blocks Using Visual Segmentation for Efficient Information Extraction.' J. UCS14.11 (2008): 1893-1910.
[22] Liu, Wei, Xiaofeng Meng, and Weiyi Meng. 'Vision-based web data records extraction.' Proc. 9th International Workshop on the Web and Databases. 2006.
[23] Li, Longzhuang, Yonghuai Liu, and Abel Obregon. 'Visual segmentation-based data record extraction from web documents.' Information Reuse and Integration, 2007. IRI 2007. IEEE International Conference on. IEEE, 2007.
[24] Chen, Yu, Wei-Ying Ma, and Hong-Jiang Zhang. 'Detecting web page structure for adaptive viewing on small form factor devices.' Proceedings of the 12th international conference on World Wide Web. ACM, 2003.
[25] Hashimoto, Yasunari, and Takeo Igarashi. 'Retrieving web page layouts using sketches to support example-based web design.' 2nd Eurographics Workshop on Sketch-Based Interfaces and Modeling. 2005.
[26] Bohunsky, Paul, and Wolfgang Gatterbauer. 'Visual structure-based web page clustering and retrieval.' Proceedings of the 19th international conference on World wide web. ACM, 2010.
[27] Della Penna, Giuseppe, Daniele Magazzeni, and Sergio Orefice. 'Visual extraction of information from web pages.' Journal of Visual Languages & Computing 21.1 (2010): 23-32.
[28] Song, Ruihua, et al. 'Learning block importance models for web pages.'Proceedings of the 13th international conference on World Wide Web. ACM, 2004.
[29] https://digital-loom.com/articles/why-your-website-has-lifespan-pet-goldfish-and-what-do-about-it
[30] https://en.wikipedia.org/wiki/Information_retrieval
[31] https://developers.google.com/webmasters/mobile-sites/mobile-seo/#choose_your_mobile_configuration
[32] Kohlschütter, Christian, and Wolfgang Nejdl. 'A densitometric approach to web page segmentation.' Proceedings of the 17th ACM conference on Information and knowledge management. ACM, 2008.
[33] Kohlschütter, Christian, Peter Fankhauser, and Wolfgang Nejdl. 'Boilerplate detection using shallow text features.' Proceedings of the third ACM international conference on Web search and data mining. ACM, 2010.
[34] Liu, Wei, Xiaofeng Meng, and Weiyi Meng. 'Vide: A vision-based approach for deep web data extraction.' IEEE Transactions on Knowledge and Data Engineering 22.3 (2010): 447-460.
[35] https://w3c.github.io/html/, 2015
[36] Gardner, Brett S. 'Responsive web design: Enriching the user experience.'Sigma Journal: Inside the Digital Ecosystem 11.1 (2011): 13-19.
[37] http://www.w3schools.com/js/js_htmldom.asp
[38] http://www.w3schools.com/html/html5_semantic_elements.asp
[39] http://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-ranked-retrieval-results-1.html
[40] http://xtremewebsites.com/google-change-now-rewards-mobile-optimized-websites/
dc.identifier.urihttp://tdr.lib.ntu.edu.tw/jspui/handle/123456789/50128-
dc.description.abstract由於近幾年行動裝置的興盛,行動裝置上網已成為最普遍的上網方式。為了因應這個趨勢,Ethan Marcotte提出了響應式網頁設計(Responsive Web Design) [36]使網站達到使用者友善(user-friendly)的目的,讓各種大小的行動裝置都可以順利閱讀網站內容。其中百分之74的行動裝置使用者更喜歡造訪使用者友善的網站,同時在Google的研究 [31]中亦指出,使用者友善的網站有助於提升搜尋排名及吸引使用者目光,可見其重要性。然而,目前仍有百分之85的網站沒有達到使用者友善的目標,導致搜尋排行不斷下降並且流失更多客戶。這些網站的困境是目前網站重建多採用人工重新設計,非常耗時;若能設計一個系統,自動化選擇適合的網站樣板來做網站重建,將可以大符減少需要重新設計網頁的時間,提升效率。而此系統最關鍵的問題在於,如何挑選出最符合客戶需求的網站樣版?研究[2]指出,網站重建前後的視覺呈現是否相似,對於使用者的專注力及閱讀速度有非常大的影響;因此,為了不影響使用者體驗(User experience),本論文將基於網頁的佈局排版(Layout)跟視覺呈現,從大量的樣版中找到挑選出最相似於原網站的樣板。本論文利用許多真實資料的實驗,驗證了我們方法的效率並且可以準確的從大量樣板中找出相似的結果。一個網站通常會由多個網頁所組成,本論文先專注於研究網頁樣版(Web Page Template)之間的相似度問題,未來可以基於這些研究,進一步去探討網站之間的相似度問題。zh_TW
dc.description.abstractIn recent year, the mobile device has become the most common tool to access the Internet. In response to the flourish of the mobile device, Ethan Marcotte proposed a design guide which is called Responsive Web Design (RWD) [36] to make the mobile service user-friendly. According to Google statistics, 74 percent of mobile device users prefer user-friendly websites due to its readability on mobile devices. Google Search Guide [31] also reported that a user-friendly website can improve its search ranking and attract users. However, 85 percent of the websites are still user-unfriendly, such that these websites obtain the lower and lower search ranking. The major dilemma is that website rebuilding is usually constructed manually, which is time-consuming and inefficient. If a system can automatically select an appropriate website template, it will significantly improve the efficiency of website rebuilding. The critical problem of this system is how to select an appropriate template which has more features that its customer needs. The study [2] indicates that drastic changes of visual appearance of Web pages have a negative influence on readers. In order not to affect user-experience, this thesis proposes a system that will efficiently sort out the templates which are similar with respect to their layout features and visual appearance. Our experimental results indicate the effectiveness of our approach and show that our approach can find the similar templates precisely. In this thesis, we focus on discussing issues related to the similarity of the Web pages. Since a website consists of multiple Web pages, the proposed method can be extended to measure the similarities between “Websites”.en
dc.description.provenanceMade available in DSpace on 2021-06-15T12:30:24Z (GMT). No. of bitstreams: 1
ntu-105-R03525086-1.pdf: 3827524 bytes, checksum: 823dcebef01b6de179a4aa167678b620 (MD5)
Previous issue date: 2016
en
dc.description.tableofcontents致謝 i
摘要 ii
ABSTRACT iii
論文目錄 iv
LIST OF FIGURES vi
LIST OF TABLES viii
Chapter 1 Introduction 1
1.1 Background and Motivation 1
1.2 Objective and Contributions 5
1.3 Thesis organization 8
Chapter 2 Related Works 9
2.1 Web Page Segmentation 9
2.2 Layout information and related applications 10
2.3 DOM-based similarity 11
Chapter 3 Web Page Similarity 12
3.1 Phase 1 : Semantic Segmentation 15
3.1.1 Web Page Segmentation 15
3.1.2 Region Locate Model 17
3.2 Phase 2 : Layout Similarity 21
3.2.1 Prevention Model 22
3.2.2 Determine Similar Layout Pattern 27
3.2.3 Filter Model 31
3.3 Phase 3 : Content Similarity 32
3.3.1 Translation Model 34
3.3.2 Determine the Content Similarity 36
3.4 Ranking Method and Results 38
Chapter 4 Evaluation of different methods 40
4.1 Experiments Setup 40
4.1.1 Data Sets 40
4.1.2 Performance measures 41
4.1.3 Methods introduction 42
4.2 Experiment on Method 1 (Evaluation goes through entire DOM of the Web page) 44
4.2.1 Experiment procedure 44
4.2.2 Experimental Results 44
4.3 Experiment on Method 2 (Evaluation goes through entire, translated DOM of the Web page) 46
4.3.1 Experiment procedure 46
4.3.2 Experimental Results 46
4.4 Experiment on Method 3 (Evaluation goes through three phases) 48
4.4.1 Experiment procedure 48
4.4.2 Experimental Results 49
4.5 Experiment on Method 4 (Evaluation goes through three phases with translated DOM) 52
4.5.1 Experiment procedure 52
4.5.2 Experimental Results 53
Chapter 5 Conclusion and Future Works 56
REFERENCES 58
dc.language.isoen
dc.subject樹編輯距離zh_TW
dc.subject網頁相似度zh_TW
dc.subject文件物件模型樹zh_TW
dc.subject文件物件模型樹zh_TW
dc.subject網頁分割zh_TW
dc.subject網頁相似度zh_TW
dc.subject樹編輯距離zh_TW
dc.subject網頁佈局的視覺資訊zh_TW
dc.subject網頁佈局的視覺資訊zh_TW
dc.subject網頁分割zh_TW
dc.subjectWeb page similarityen
dc.subjectvisual information of web page layouten
dc.subjectWeb page segmentationen
dc.subjecttree edit distanceen
dc.subjectWeb page similarityen
dc.subjectDOM (Document Object Model) treeen
dc.subjectvisual information of web page layouten
dc.subjectWeb page segmentationen
dc.subjectDOM (Document Object Model) treeen
dc.subjecttree edit distanceen
dc.title基於版型特徵跟內容結構的網頁相似度研究zh_TW
dc.titleStudy on Web Page Similarity Based on
Layout Feature and Content Structure
en
dc.typeThesis
dc.date.schoolyear104-2
dc.description.degree碩士
dc.contributor.oralexamcommittee林正偉,王家輝,丁肇隆
dc.subject.keyword文件物件模型樹,網頁相似度,樹編輯距離,網頁分割,網頁佈局的視覺資訊,zh_TW
dc.subject.keywordDOM (Document Object Model) tree,Web page similarity,tree edit distance,Web page segmentation,visual information of web page layout,en
dc.relation.page61
dc.identifier.doi10.6342/NTU201601939
dc.rights.note有償授權
dc.date.accepted2016-08-05
dc.contributor.author-college工學院zh_TW
dc.contributor.author-dept工程科學及海洋工程學研究所zh_TW
顯示於系所單位:工程科學及海洋工程學系

文件中的檔案:
檔案 大小格式 
ntu-105-1.pdf
  未授權公開取用
3.74 MBAdobe PDF
顯示文件簡單紀錄


系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved