應用多模態大型語言模型於網路爬蟲場域

黃冠綸; Guan-Lun Huang

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/102210

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	莊裕澤	zh_TW
dc.contributor.advisor	Yuh-Jzer Joung	en
dc.contributor.author	黃冠綸	zh_TW
dc.contributor.author	Guan-Lun Huang	en
dc.date.accessioned	2026-04-08T16:19:17Z	-
dc.date.available	2026-04-09	-
dc.date.copyright	2026-04-08	-
dc.date.issued	2026	-
dc.date.submitted	2026-03-19	-
dc.identifier.citation	[1] beautifulsoup4. https://pypi.org/project/beautifulsoup4/. Accessed: 2025-01-03. [2] selenium. https://pypi.org/project/selenium/. Accessed: 2025-01-03. [3] M. D. Adelfio and H. Samet, "Schema extraction for tabular data on the web," Proc. VLDB Endow., vol. 6, no. 6, pp. 421–432, Apr. 2013. [4] Angular, "Angular," https://github.com/angular/angular, 2021. Accessed: 2025-01-03. [5] Anthropic, "Build with Claude Computer Use (beta)," https://docs.anthropic.com/en/docs/build-with-claude/computer-use, 2024. Accessed: 2024-12-16. [6] D. Belson, "Cloudflare 2024 year in review," https://blog.cloudflare.com/radar-2024-year-in-review/, 2024. Accessed: 2024-12-16. [7] M. Corporation, "Playwright," https://pypi.org/project/playwright/, 2021. Accessed: 2025-01-03. [8] X. Deng et al., "Mind2Web: Towards a generalist agent for the web," 2023. [9] J. Devlin et al., "BERT: Pre-training of deep bidirectional transformers for language understanding," arXiv:1810.04805, 2018. [10] M. Gheorghe, F.-C. Mihai, and M. Dârdală, "Modern techniques of web scraping for data scientists," International Journal of User-System Interaction, vol. 11, no. 1, pp. 63–75, 2018. [11] granitosaurus, "pyppeteer," https://pypi.org/project/pyppeteer/, 2017. Accessed: 2025-01-03. [12] P. Gulhane et al., "Web-scale information extraction with Vertex," in Proc. IEEE ICDE, pp. 1209–1220, 2011. [13] Q. Hao et al., "From one tree to a forest: A unified solution for structured web data extraction," in Proc. ACM SIGIR, pp. 775–784, 2011. [14] H. He et al., "WebVoyager: Building an end-to-end web agent with large multimodal models," in Proc. ACL 2024, pp. 6864–6890, 2024. [15] S. Hu et al., "The dawn of GUI agent: A preliminary case study with Claude 3.5 computer use," 2024. [16] J. Huang and J. Song, "Automatic XPath generation agents for vertical websites by LLMs," Journal of King Saud University Computer and Information Sciences, vol. 37, no. 5, pp. 1–17, 2025. [17] W. Huang et al., "AutoScraper: A progressive understanding web agent for web scraper generation," in Proc. EMNLP 2024, pp. 2371–2389, 2024. [18] IBM, "ROUGE: Recall-Oriented Understudy for Gisting Evaluation," https://www.ibm.com/docs/en/watsonx/saas?topic=metrics-rouge, May 2024. Accessed: 2025-07-03. [19] R. Kapoor et al., "OmniAct: A dataset and benchmark for enabling multimodal generalist autonomous agents," in Proc. ECCV 2024, pp. 161–178, 2025. [20] M. A. Khder, "Web scraping or web crawling: State of art, techniques, approaches and application," International Journal of Advances in Soft Computing & Its Applications, vol. 13, no. 3, 2021. [21] C. Kim and K. Shim, "TEXT: Automatic template extraction from heterogeneous web pages," IEEE TKDE, vol. 23, no. 4, pp. 612–626, 2011. [22] N. Kushmerick, D. S. Weld, and R. B. Doorenbos, "Wrapper induction for information extraction," in Proc. IJCAI, 1997. [23] K. Lerman, S. Minton, and C. Knoblock, "Wrapper maintenance: A machine learning approach," JAIR, vol. 18, pp. 149–181, 2003. [24] J. Li et al., "MarkupLM: Pre-training of text and markup language for visually-rich document understanding," 2022. [25] Y. Li, B. Wang, and X. Luan, "XPath agent: An efficient XPath programming agent based on LLM for web crawler," 2024. [26] B. Y. Lin et al., "FREEDOM: A transferable neural architecture for structured information extraction," in Proc. KDD 2020, pp. 1092–1102, 2020. [27] C. Lockard et al., "CERES: Distantly supervised relation extraction from the semi-structured web," arXiv:1804.04635, 2018. [28] C. Lockard et al., "ZeroShotCeres: Zero-shot relation extraction from semi-structured webpages," in Proc. ACL 2020, pp. 8105–8117, 2020. [29] Meta Platforms, "React," https://github.com/facebook/react, 2013. Accessed: 2025-01-03. [30] M.-F. Moens et al., "Proceedings of EMNLP 2021," 2021. [31] M. Müller and G. Žunič, "Browser use: Enable AI to control your browser," 2024. [32] OpenAI, "Assistants Code Interpreter," https://platform.openai.com/docs/assistants/tools/code-interpreter, 2024. Accessed: 2024-12-16. [33] K. Reitz, "Requests," https://pypi.org/project/requests/, 2011. Accessed: 2025-01-03. [34] Vue, "Vue," https://github.com/vuejs/core, 2020. Accessed: 2025-01-03. [35] Q. Wang et al., "WebFormer: The webpage transformer for structure information extraction," in Proc. WWW 2022, pp. 3124–3133, 2022. [36] Q. Wang et al., "MUSTIE: Multimodal structural transformer for web information extraction," in Proc. ACL 2023, pp. 2405–2420, 2023. [37] J. Yang et al., "Set-of-Mark prompting unleashes extraordinary visual grounding in GPT-4V," arXiv:2310.11441, 2023. [38] S. Yao et al., "ReAct: Synergizing reasoning and acting in language models," 2023. [39] S. Yin et al., "A survey on multimodal large language models," National Science Review, 2024. [40] B. Zheng et al., "GPT-4V(ision) is a generalist web agent, if grounded," in Proc. ICML 2024, pp. 61349–61385, 2024. [41] Y. Zhou et al., "Simplified DOM trees for transferable attribute extraction from the web," CoRR, abs/2101.02415, 2021.	-
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/102210	-
dc.description.abstract	隨著大型語言模型 (LLM) 的發展，對網路資料需求越發增加，而資料產生的速度也隨著網路流量增長而增加。然而，在資料需求與供給爆炸式增長的趨勢下，網路資料的爬取方式卻沒有顯著的進展。由於網站前後端分工架構興起，且前端網站程式碼越發複雜多變等原因，導致專家需為每個網站客製化爬蟲程式。爲此本研究 Webscraper 嘗試結合多模態大語言模型 (MLLM)的網頁瀏覽能力、工具使用能力及 LLM 的程式生成執行的能力等，讓 MLLM 能夠與網頁互動，自主決定爬取網頁時機並調用相關工具爬取網頁原始碼產生結構化資料以及可重複利用的程式碼。我們發現許多網頁設計採用 index-content 模式，如新聞網站、購物網站、社群媒體、影音平台等，除資料具有極高附加價值，也顯示 index-content 模式被廣泛運用於網頁設計。Webscraper 使用 Anthropic 的桌面自動化代理框架 Computer use 作為瀏覽網頁模組，並開發網頁爬蟲工具供 Computer use 調用，透過五階段的流程提示詞爬取 index-content 類型的網頁。實驗結果顯示，在新聞領域僅使用流程化的 Prompt 的 Webscraper 即能賦予 Computer use 較佳的爬蟲能力。此外，使用本實驗開發的工具能進一步提升爬蟲的準確率。最後，我們將 Webscraper 運用於購物網站的爬蟲任務，實驗結果也顯示該架構不只針對新聞領域有效，以此驗證本架構的泛化能力。	zh_TW
dc.description.abstract	With the development of large language models (LLMs), the demand for large volumes of high-quality web data has grown significantly. Simultaneously, the rate at which data is generated has also increased due to rising internet traffic. However, despite this explosive growth in both demand and supply, web scraping techniques have seen little advancement. The rise of frontend-backend separation in web architecture and increasingly complex and diverse frontend codebases requires experts to customize scrapers for each individual website. To address this, our research introduces Webscraper, a system that leverages the webpage browsing capabilities of Multimodal Large Language Models (MLLMs), along with their ability to use tools and generate executable code. This enables MLLMs to interact with webpages, autonomously decide when and how to scrape content, invoke tools to retrieve raw HTML, and generate structured data as well as reusable code. We observed that many websites follow an index-content design pattern—common in news websites, e-commerce platforms, social media, and video platforms—indicating both high data value and widespread adoption of this pattern. Webscraper utilizes Anthropic’s desktop automation agent framework Computer Use as the web browsing module, and we developed a web scraping tool that can be invoked by Computer Use. The scraping process is guided by a five-stage prompting procedure tailored to index-content webpages. Experimental results show that, in the news domain, even using only prompt-driven procedures, Webscraper significantly enhances the scraping capabilities of Computer Use. Additionally, the use of our custom-developed tools further improves scraping accuracy. Finally, we applied Webscraper to e-commerce sites and found that the architecture is effective beyond just the news domain, thereby validating its generalizability.	en
dc.description.provenance	Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2026-04-08T16:19:17Z No. of bitstreams: 0	en
dc.description.provenance	Made available in DSpace on 2026-04-08T16:19:17Z (GMT). No. of bitstreams: 0	en
dc.description.tableofcontents	誌謝 i 摘要 ii Abstract iii 目次 v 圖次 viii 表次 x 第一章緒論 1 1.1 研究背景 1 1.2 研究動機與目的 2 1.3 論文架構 3 第二章文獻回顧 4 2.1 網路爬蟲概論 4 2.2 Multimodal Large Language Model 發展與網頁瀏覽 8 2.3 Web Information Extraction 13 2.4 總結 17 第三章研究方法 18 3.1 問題定義 18 3.2 研究架構 21 3.2.1 系統架構概述 21 3.2.2 系統流程概述 22 3.2.3 網頁瀏覽模組 25 3.2.4 爬蟲相關工具 27 第四章研究結果 30 4.1 資料來源 30 4.2 免責聲明 33 4.2.1 伺服器負載考量 33 4.2.2 非侵入性蒐集 33 4.3 評估指標 34 4.3.1 Rouge-L 34 4.3.2 Correct 37 4.4 實驗設定 37 4.4.1 Computeruse 設定 38 4.4.2 Webscraper 設定 38 4.5 實驗結果 39 4.5.1 穩定性測試 40 4.5.2 時間穩定性 41 4.5.3 全網站測試結果 42 4.5.4 其他 index-content 網站結果：以購物網站為例 43 4.6 與其他 BrowserAgent 比較 46 4.7 Error Analysis 47 4.7.1 Webscraper 失敗成因統整 48 4.7.2 Computeruse 失敗成因統整 51 4.7.3 共同失敗成因統整 53 第五章結論 60 5.1 研究成果 60 5.2 研究貢獻與創新 61 5.2.1 整合網頁瀏覽與網頁資料擷取領域 61 5.2.2 開發適用於 index-content 網頁的 Prompt 流程與工具 61 5.2.3 自動化生成爬蟲程式碼 61 5.3 研究限制 62 5.3.1 index-content 網站測試不足 62 5.3.2 應用層面不夠泛化 62 5.4 未來研究方向 64 5.4.1 端到端網頁爬蟲 64 5.4.2 MCP 65 參考文獻 66 附錄 A — Prompt 設計 72	-
dc.language.iso	zh_TW	-
dc.subject	大型語言模型	-
dc.subject	自主網頁瀏覽	-
dc.subject	資料擷取	-
dc.subject	資料爬蟲	-
dc.subject	提示詞工程	-
dc.subject	Large Language Model	-
dc.subject	Autonomous Web Navigation	-
dc.subject	Web Information Extraction	-
dc.subject	Web Scraping	-
dc.subject	Prompt Engineering	-
dc.title	應用多模態大型語言模型於網路爬蟲場域	zh_TW
dc.title	Webscraper: Leverage Multimodal Large Language Models for Index-Content Web Scraping	en
dc.type	Thesis	-
dc.date.schoolyear	114-2	-
dc.description.degree	碩士	-
dc.contributor.oralexamcommittee	陳建錦;楊立偉;魏志平;林俊叡	zh_TW
dc.contributor.oralexamcommittee	Chien-Chin Chen;Li-Wei Yang;Chih-Ping Wei;June-Ray Lin	en
dc.subject.keyword	大型語言模型,自主網頁瀏覽資料擷取資料爬蟲提示詞工程	zh_TW
dc.subject.keyword	Large Language Model,Autonomous Web NavigationWeb Information ExtractionWeb ScrapingPrompt Engineering	en
dc.relation.page	79	-
dc.identifier.doi	10.6342/NTU202600867	-
dc.rights.note	未授權	-
dc.date.accepted	2026-03-20	-
dc.contributor.author-college	管理學院	-
dc.contributor.author-dept	資訊管理學系	-
dc.date.embargo-lift	N/A	-
顯示於系所單位：	資訊管理學系

文件中的檔案：

檔案	大小	格式
ntu-114-2.pdf 未授權公開取用	11.72 MB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。