應用多模態大型語言模型於網路爬蟲場域

黃冠綸; Guan-Lun Huang

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/102210

標題:	應用多模態大型語言模型於網路爬蟲場域 Webscraper: Leverage Multimodal Large Language Models for Index-Content Web Scraping
作者:	黃冠綸 Guan-Lun Huang
指導教授:	莊裕澤 Yuh-Jzer Joung
關鍵字:	大型語言模型,自主網頁瀏覽資料擷取資料爬蟲提示詞工程 Large Language Model,Autonomous Web NavigationWeb Information ExtractionWeb ScrapingPrompt Engineering
出版年 :	2026
學位:	碩士
摘要:	隨著大型語言模型 (LLM) 的發展，對網路資料需求越發增加，而資料產生的速度也隨著網路流量增長而增加。然而，在資料需求與供給爆炸式增長的趨勢下，網路資料的爬取方式卻沒有顯著的進展。由於網站前後端分工架構興起，且前端網站程式碼越發複雜多變等原因，導致專家需為每個網站客製化爬蟲程式。爲此本研究 Webscraper 嘗試結合多模態大語言模型 (MLLM)的網頁瀏覽能力、工具使用能力及 LLM 的程式生成執行的能力等，讓 MLLM 能夠與網頁互動，自主決定爬取網頁時機並調用相關工具爬取網頁原始碼產生結構化資料以及可重複利用的程式碼。我們發現許多網頁設計採用 index-content 模式，如新聞網站、購物網站、社群媒體、影音平台等，除資料具有極高附加價值，也顯示 index-content 模式被廣泛運用於網頁設計。Webscraper 使用 Anthropic 的桌面自動化代理框架 Computer use 作為瀏覽網頁模組，並開發網頁爬蟲工具供 Computer use 調用，透過五階段的流程提示詞爬取 index-content 類型的網頁。實驗結果顯示，在新聞領域僅使用流程化的 Prompt 的 Webscraper 即能賦予 Computer use 較佳的爬蟲能力。此外，使用本實驗開發的工具能進一步提升爬蟲的準確率。最後，我們將 Webscraper 運用於購物網站的爬蟲任務，實驗結果也顯示該架構不只針對新聞領域有效，以此驗證本架構的泛化能力。 With the development of large language models (LLMs), the demand for large volumes of high-quality web data has grown significantly. Simultaneously, the rate at which data is generated has also increased due to rising internet traffic. However, despite this explosive growth in both demand and supply, web scraping techniques have seen little advancement. The rise of frontend-backend separation in web architecture and increasingly complex and diverse frontend codebases requires experts to customize scrapers for each individual website. To address this, our research introduces Webscraper, a system that leverages the webpage browsing capabilities of Multimodal Large Language Models (MLLMs), along with their ability to use tools and generate executable code. This enables MLLMs to interact with webpages, autonomously decide when and how to scrape content, invoke tools to retrieve raw HTML, and generate structured data as well as reusable code. We observed that many websites follow an index-content design pattern—common in news websites, e-commerce platforms, social media, and video platforms—indicating both high data value and widespread adoption of this pattern. Webscraper utilizes Anthropic’s desktop automation agent framework Computer Use as the web browsing module, and we developed a web scraping tool that can be invoked by Computer Use. The scraping process is guided by a five-stage prompting procedure tailored to index-content webpages. Experimental results show that, in the news domain, even using only prompt-driven procedures, Webscraper significantly enhances the scraping capabilities of Computer Use. Additionally, the use of our custom-developed tools further improves scraping accuracy. Finally, we applied Webscraper to e-commerce sites and found that the architecture is effective beyond just the news domain, thereby validating its generalizability.
URI:	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/102210
DOI:	10.6342/NTU202600867
全文授權:	未授權
電子全文公開日期:	N/A
顯示於系所單位：	資訊管理學系

文件中的檔案：

檔案	大小	格式
ntu-114-2.pdf 未授權公開取用	11.72 MB	Adobe PDF

顯示文件完整紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料（如：文字、圖片、PDF）並使其易於取用。