用於資訊提取任務的視覺文件生成

江子涵; Zi-Han Jiang

Please use this identifier to cite or link to this item: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/97881

Title:	用於資訊提取任務的視覺文件生成 Visual Document Synthesis for Information Extraction Tasks
Authors:	江子涵 Zi-Han Jiang
Advisor:	陳祝嵩 Chu-Song Chen
Keyword:	合成文件生成,排版生成,視覺資訊提取, synthetic document generation,layout generation,visual information extraction,
Publication Year :	2025
Degree:	碩士
Abstract:	儘管大型語言模型（LLMs）和多模態大型語言模型（MLLMs）推進了視覺文件理解（VDU）領域的發展，但從關係豐富的文件中提取視覺資訊 (VIE) 仍然是一個複雜的挑戰，這源於文件排版的極大多樣性以及訓練資料的稀缺性。現有的合成文件生成方法雖旨在緩解這個問題，但往往效果不佳，因為它們經常依賴人工標註的排版模板，或者使用基於固定規則的方法來生成文件，因而限制了文件排版的多樣性。此外，目前的排版生成技術通常專注於幾何結構，而沒有整合有意義的文字內容，阻礙了它們生成具有複雜的文字內容與排版交互關係的文件。為了克服這些障礙，我們引入了關係豐富的視覺文件生成器（RIDGE）。這是一個兩階段的框架，首先，我們的內容生成階段採用 LLMs 來創建以階層結構文字格式來表達的文件內容，隱含了實體類別以及實體關係的資訊。其次，我們的內容驅動排版生成階段在文字內容的引導下產生多樣化且逼真的文件排版，訓練此模型僅需使用容易獲得的 OCR 資料，無需任何人工標註。通過廣泛的實驗，我們驗證了 RIDGE 在多個 VIE 基準測試中顯著提升了文件理解模型的表現。 Although Large Language Models (LLMs) and Multimodal LLMs (MLLMs) have advanced the field of visual document understanding (VDU), visual information extraction (VIE) from relation-rich documents remains a complex challenge, stemming from the vast diversity of document layouts and the scarcity of training data. Existing synthetic document generation methods aim to mitigate this issue, but often fall short since they either depend on manually crafted layouts and templates or use rule-based methods that constrain layout variety. Additionally, current layout generation techniques typically focus on geometric structures without integrating meaningful textual content, limiting their ability to produce documents with intricate content-layout relationships. To overcome these obstacles, we introduce Relation-rIch visual Document GEnerator (RIDGE), a two-stage framework designed to bridge these gaps. First, our Content Generation stage employs LLMs to create document content in a Hierarchical Structure Text format that explicitly encodes entity categories and their relationships. Second, our Content-driven Layout Generation stage produces diverse and realistic layouts guided by textual content, using only readily obtainable OCR outputs, eliminating the need for manual annotations. Through extensive experiments, we show that RIDGE significantly improves the performance of document understanding models across multiple VIE benchmarks.
URI:	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/97881
DOI:	10.6342/NTU202501745
Fulltext Rights:	同意授權(限校園內公開)
metadata.dc.date.embargo-lift:	2025-07-22
Appears in Collections:	資訊工程學系

Files in This Item:

File	Size	Format
ntu-113-2.pdf Access limited in NTU ip range	27.76 MB	Adobe PDF

Show full item record

DSpace JSPUI

DSpace preserves and enables easy and open access to all types of digital content including text, images, moving images, mpegs and data sets