Skip navigation

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料(如:文字、圖片、PDF)並使其易於取用。

點此認識 DSpace
DSpace logo
English
中文
  • 瀏覽論文
    • 校院系所
    • 出版年
    • 作者
    • 標題
    • 關鍵字
    • 指導教授
  • 搜尋 TDR
  • 授權 Q&A
    • 我的頁面
    • 接受 E-mail 通知
    • 編輯個人資料
  1. NTU Theses and Dissertations Repository
  2. 電機資訊學院
  3. 資訊工程學系
請用此 Handle URI 來引用此文件: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/97881
完整後設資料紀錄
DC 欄位值語言
dc.contributor.advisor陳祝嵩zh_TW
dc.contributor.advisorChu-Song Chenen
dc.contributor.author江子涵zh_TW
dc.contributor.authorZi-Han Jiangen
dc.date.accessioned2025-07-21T16:07:11Z-
dc.date.available2025-07-22-
dc.date.copyright2025-07-21-
dc.date.issued2025-
dc.date.submitted2025-07-17-
dc.identifier.citation[1] D. M. Arroyo, J. Postels, and F. Tombari. Variational transformer networks for layout generation. In CVPR, pages 13642–13652, 2021.
[2] S. Biswas, P. Riba, J. Lladós, and U. Pal. Docsynth: A layout guided approach for controllable document image synthesis. In International Conference on Document Analysis and Recognition (ICDAR), 2021.
[3] M. Carbonell, P. Riba, M. Villegas, A. Fornés, and J. Lladós. Named entity recognition and relation extraction with graph neural networks in semi structured documents. In 2020 25th International Conference on Pattern Recognition (ICPR), pages 9622–9627, 2021.
[4] J. Chen, Y. Huang, T. Lv, L. Cui, Q. Chen, and F. Wei. Textdiffuser-2: Unleashing the power of language models for text rendering. In Computer Vision–ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part V, page 386–402, Berlin, Heidelberg, 2024. Springer-Verlag.
[5] Z. Chen, W. Wang, H. Tian, S. Ye, Z. Gao, E. Cui, W. Tong, K. Hu, J. Luo, Z. Ma, et al. How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites. Science China Information Sciences, 67(12):220101, 2024.
[6] J. Cho, A. Zala, and M. Bansal. Visual programming for step-by-step text-to-image generation and evaluation. NeurIPS, 36, 2024.
[7] B. Davis, B. Morse, B. Price, C. Tensmeyer, C. Wigington, and V. Morariu. Endto-end document recognition and understanding with dessurt. In Computer Vision– ECCV 2022 Workshops: Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part IV, page 280–296, Berlin, Heidelberg, 2022. Springer-Verlag.
[8] A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Yang, A. Fan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024.
[9] W. Feng, W. Zhu, T.-J. Fu, V. Jampani, A. R. Akula, X. He, S. Basu, X. E. Wang, and W. Y. Wang. LayoutGPT: Compositional visual planning and generation with large language models. In NeurIPS, 2023.
[10] A. Gemelli, S. Biswas, E. Civitelli, J. Lladós, and S. Marinai. Doc2graph: A task agnostic document understanding framework based on graph neural networks. In Computer Vision–ECCV 2022 Workshops: Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part IV, page 329–344, Berlin, Heidelberg, 2022. Springer-Verlag.
[11] A. W. Harley, A. Ufkes, and K. G. Derpanis. Evaluation of deep convolutional nets for document image classification and retrieval. In 2015 13th International Conference on Document Analysis and Recognition (ICDAR), pages 991–995. IEEE, 2015.
[12] L. He, Y. Lu, J. Corring, D. Florencio, and C. Zhang. Diffusion-based document layout generation. In Document Analysis and Recognition - ICDAR 2023, pages 361–378. Springer Nature Switzerland, 2023.
[13] T. Hong, D. Kim, M. Ji, W. Hwang, D. Nam, and S. Park. Bros: A pre-trained language model focusing on text and layout for better key information extraction from documents. In AAAI, volume 36, pages 10767–10775, 2022.
[14] A. Hu, H. Xu, J. Ye, M. Yan, L. Zhang, B. Zhang, J. Zhang, Q. Jin, F. Huang, and J. Zhou. mPLUG-DocOwl 1.5: Unified structure learning for OCR-free document understanding. In Y. Al-Onaizan, M. Bansal, and Y.-N. Chen, editors, Findings of the Association for Computational Linguistics: EMNLP 2024, pages 3096–3120, Miami, Florida, USA, Nov. 2024. Association for Computational Linguistics.
[15] E. J. Hu, yelong shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen. LoRA: Low-rank adaptation of large language models. In ICLR, 2022.
[16] Y. Huang, T. Lv, L. Cui, Y. Lu, and F. Wei. Layoutlmv3: Pre-training for document ai with unified text and image masking. In ACM MM, pages 4083–4091, 2022.
[17] Z. Huang, K. Chen, J. He, X. Bai, D. Karatzas, S. Lu, and C. Jawahar. Icdar2019 competition on scanned receipt ocr and information extraction. In 2019 International Conference on Document Analysis and Recognition (ICDAR), pages 1516–1520. IEEE, 2019.
[18] M. Hui, Z. Zhang, X. Zhang, W. Xie, Y. Wang, and Y. Lu. Unifying layout generation with a decoupled diffusion model. In CVPR, pages 1942–1951, 2023.
[19] N. Inoue, K. Kikuchi, E. Simo-Serra, M. Otani, and K. Yamaguchi. LayoutDM: Discrete Diffusion Model for Controllable Layout Generation. In CVPR, pages 10167– 10176, 2023.
[20] G. Jaume, H. K. Ekenel, and J.-P. Thiran. Funsd: A dataset for form understanding in noisy scanned documents. In 2019 International Conference on Document Analysis and Recognition Workshops (ICDARW). IEEE, 2019.
[21] P. Jia, C. Li, Y. Yuan, Z. Liu, Y. Shen, B. Chen, X. Chen, Y. Zheng, D. Chen, J. Li, X. Xie, S. Zhang, and B. Guo. Cole: A hierarchical generation framework for multilayered and editable graphic design, 2024.
[22] Z. Jiang, J. Guo, S. Sun, H. Deng, Z. Wu, V. Mijovic, Z. J. Yang, J.-G. Lou, and D. Zhang. Layoutformer++: Conditional graphic layout generation via constraint serialization and decoding space restriction. In CVPR, pages 18403–18412, 2023.
[23] Z. Jiang, S. Sun, J. Zhu, J.-G. Lou, and D. Zhang. Coarse-to-fine generative modeling for graphic layouts. In AAAI, volume 36, pages 1096–1103, 2022.
[24] A. A. Jyothi, T. Durand, J. He, L. Sigal, and G. Mori. Layoutvae: Stochastic scene layout generation from a label set. In ICCV, pages 9895–9904, 2019.
[25] G. Kim, T. Hong, M. Yim, J. Nam, J. Park, J. Yim, W. Hwang, S. Yun, D. Han, and S. Park. Ocr-free document understanding transformer. In ECCV, 2022.
[26] J. Kuang, W. Hua, D. Liang, M. Yang, D. Jiang, B. Ren, and X. Bai. Visual information extraction in the wild: practical dataset and end-to-end solution. In International Conference on Document Analysis and Recognition, pages 36–53. Springer, 2023.
[27] F. Li, A. Liu, W. Feng, H. Zhu, Y. Li, Z. Zhang, J. Lv, X. Zhu, J. Shen, Z. Lin, and J. Shao. Relation-aware diffusion model for controllable poster layout generation. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management, page 1249–1258, 2023.
[28] J. Li, T. Xu, J. Zhang, A. Hertzmann, and J. Yang. LayoutGAN: Generating graphic layouts with wireframe discriminator. In ICLR, 2019.
[29] Z. Li, B. Yang, Q. Liu, Z. Ma, S. Zhang, J. Yang, Y. Sun, Y. Liu, and X. Bai. Monkey: Image resolution and text label are important things for large multi-modal models. In CVPR, pages 26763–26773, 2024.
[30] L. Lian, B. Li, A. Yala, and T. Darrell. LLM-grounded diffusion: Enhancing prompt understanding of text-to-image diffusion models with large language models. Transactions on Machine Learning Research, 2024. Featured Certification.
[31] J. Lin, J. Guo, S. Sun, Z. J. Yang, J.-G. Lou, and D. Zhang. Layoutprompter: Awaken the design ability of large language models. In NeurIPS, 2023.
[32] H. Liu, C. Li, Y. Li, B. Li, Y. Zhang, S. Shen, and Y. J. Lee. Llava-next: Improved reasoning, ocr, and world knowledge, January 2024.
[33] C. Luo, Y. Shen, Z. Zhu, Q. Zheng, Z. Yu, and C. Yao. Layoutllm: Layout instruction tuning with large language models for document understanding. In CVPR, pages 15630–15640, 2024.
[34] M. Mathew, D. Karatzas, and C. Jawahar. Docvqa: A dataset for vqa on document images. In WACV, 2021.
[35] S. Park, S. Shin, B. Lee, J. Lee, J. Surh, M. Seo, and H. Lee. Cord: a consolidated receipt dataset for post-ocr parsing. In Workshop on Document Intelligence at NeurIPS 2019, 2019.
[36] A. G. Patil, O. Ben-Eliezer, O. Perel, and H. Averbuch-Elor. Read: Recursive autoencoders for document layout generation. In CVPRW, pages 544–545, 2020.
[37] Q. Peng, Y. Pan, W. Wang, B. Luo, Z. Zhang, Z. Huang, Y. Cao, W. Yin, Y. Chen, Y. Zhang, S. Feng, Y. Sun, H. Tian, H. Wu, and H. Wang. ERNIE-layout: Layout knowledge enhanced pre-training for visually-rich document understanding. In Y. Goldberg, Z. Kozareva, and Y. Zhang, editors, Findings of the Association for Computational Linguistics: EMNLP 2022, pages 3744–3756, Abu Dhabi, United Arab Emirates, Dec. 2022. Association for Computational Linguistics.
[38] P. P. Ray. Chatgpt: A comprehensive review on background, applications, key challenges, bias, ethics, limitations and future scope. Internet of Things and Cyber-Physical Systems, 3:121–154, 2023.
[39] Š. Šimsa, M. Šulc, M. Uřičář, Y. Patel, A. Hamdi, M. Kocián, M. Skalickỳ, J. Matas, A. Doucet, M. Coustaty, et al. Docile benchmark for document information localization and extraction. In International Conference on Document Analysis and Recognition, pages 147–166. Springer, 2023.
[40] Z. Tang, C. Wu, J. Li, and N. Duan. LayoutNUWA: Revealing the hidden layout expertise of large language models. In ICLR, 2024.
[41] Y. Tu, Y. Guo, H. Chen, and J. Tang. LayoutMask: Enhance text-layout interaction in multi-modal pre-training for document understanding. In A. Rogers, J. BoydGraber, and N. Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15200– 15212, Toronto, Canada, July 2023. Association for Computational Linguistics.
[42] D. Wang, N. Raman, M. Sibue, Z. Ma, P. Babkin, S. Kaur, Y. Pei, A. Nourbakhsh, and X. Liu. DocLLM: A layout-aware generative language model for multimodal document understanding. In L.-W. Ku, A. Martins, and V. Srikumar, editors, Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8529–8548, Bangkok, Thailand, Aug. 2024. Association for Computational Linguistics.
[43] J. Wang, C. Liu, L. Jin, G. Tang, J. Zhang, S. Zhang, Q. Wang, Y. Wu, and M. Cai. Towards robust visual information extraction in real world: new dataset and novel solution. In AAAI, volume 35, pages 2738–2745, 2021.
[44] P. Wang, S. Bai, S. Tan, S. Wang, Z. Fan, J. Bai, K. Chen, X. Liu, J. Wang, W. Ge, Y. Fan, K. Dang, M. Du, X. Ren, R. Men, D. Liu, C. Zhou, J. Zhou, and J. Lin. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191, 2024.
[45] J. Wei, X. Wang, D. Schuurmans, M. Bosma, b. ichter, F. Xia, E. Chi, Q. V. Le, and D. Zhou. Chain-of-thought prompting elicits reasoning in large language models. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, NeurIPS, volume 35, pages 24824–24837. Curran Associates, Inc., 2022.
[46] Y. Xu, M. Li, L. Cui, S. Huang, F. Wei, and M. Zhou. Layoutlm: Pre-training of text and layout for document image understanding. In Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining, pages 1192–1200, 2020.
[47] Y. Xu, T. Lv, L. Cui, G. Wang, Y. Lu, D. Florencio, C. Zhang, and F. Wei. XFUND: A benchmark dataset for multilingual visually rich form understanding. In S. Muresan, P. Nakov, and A. Villavicencio, editors, Findings of the Association for Computational Linguistics: ACL 2022, Dublin, Ireland, May 2022. Association for Computational Linguistics.
[48] Y. Xu, Y. Xu, T. Lv, L. Cui, F. Wei, G. Wang, Y. Lu, D. Florencio, C. Zhang, W. Che, M. Zhang, and L. Zhou. LayoutLMv2: Multi-modal pre-training for visually-rich document understanding. In C. Zong, F. Xia, W. Li, and R. Navigli, editors, Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 2579–2591, Online, Aug. 2021. Association for Computational Linguistics.
[49] J. Ye, A. Hu, H. Xu, Q. Ye, M. Yan, G. Xu, C. Li, J. Tian, Q. Qian, J. Zhang, Q. Jin, L. He, X. Lin, and F. Huang. UReader: Universal OCR-free visually-situated language understanding with multimodal large language model. In H. Bouamor, J. Pino, and K. Bali, editors, Findings of the Association for Computational Linguistics: EMNLP 2023, pages 2841–2858, Singapore, Dec. 2023. Association for Computational Linguistics.
[50] W. Yu, C. Zhang, H. Cao, W. Hua, B. Li, H. Chen, M. Liu, M. Chen, J. Kuang, M. Cheng, et al. Icdar 2023 competition on structured text extraction from visuallyrich document images. In International Conference on Document Analysis and Recognition, pages 536–552. Springer, 2023.
-
dc.identifier.urihttp://tdr.lib.ntu.edu.tw/jspui/handle/123456789/97881-
dc.description.abstract儘管大型語言模型(LLMs)和多模態大型語言模型(MLLMs)推進了視覺文件理解(VDU)領域的發展,但從關係豐富的文件中提取視覺資訊 (VIE) 仍然是一個複雜的挑戰,這源於文件排版的極大多樣性以及訓練資料的稀缺性。現有的合成文件生成方法雖旨在緩解這個問題,但往往效果不佳,因為它們經常依賴人工標註的排版模板,或者使用基於固定規則的方法來生成文件,因而限制了文件排版的多樣性。此外,目前的排版生成技術通常專注於幾何結構,而沒有整合有意義的文字內容,阻礙了它們生成具有複雜的文字內容與排版交互關係的文件。為了克服這些障礙,我們引入了關係豐富的視覺文件生成器(RIDGE)。這是一個兩階段的框架,首先,我們的內容生成階段採用 LLMs 來創建以階層結構文字格式來表達的文件內容,隱含了實體類別以及實體關係的資訊。其次,我們的內容驅動排版生成階段在文字內容的引導下產生多樣化且逼真的文件排版,訓練此模型僅需使用容易獲得的 OCR 資料,無需任何人工標註。通過廣泛的實驗,我們驗證了 RIDGE 在多個 VIE 基準測試中顯著提升了文件理解模型的表現。zh_TW
dc.description.abstractAlthough Large Language Models (LLMs) and Multimodal LLMs (MLLMs) have advanced the field of visual document understanding (VDU), visual information extraction (VIE) from relation-rich documents remains a complex challenge, stemming from the vast diversity of document layouts and the scarcity of training data. Existing synthetic document generation methods aim to mitigate this issue, but often fall short since they either depend on manually crafted layouts and templates or use rule-based methods that constrain layout variety. Additionally, current layout generation techniques typically focus on geometric structures without integrating meaningful textual content, limiting their ability to produce documents with intricate content-layout relationships. To overcome these obstacles, we introduce Relation-rIch visual Document GEnerator (RIDGE), a two-stage framework designed to bridge these gaps. First, our Content Generation stage employs LLMs to create document content in a Hierarchical Structure Text format that explicitly encodes entity categories and their relationships. Second, our Content-driven Layout Generation stage produces diverse and realistic layouts guided by textual content, using only readily obtainable OCR outputs, eliminating the need for manual annotations. Through extensive experiments, we show that RIDGE significantly improves the performance of document understanding models across multiple VIE benchmarks.en
dc.description.provenanceSubmitted by admin ntu (admin@lib.ntu.edu.tw) on 2025-07-21T16:07:11Z
No. of bitstreams: 0
en
dc.description.provenanceMade available in DSpace on 2025-07-21T16:07:11Z (GMT). No. of bitstreams: 0en
dc.description.tableofcontentsAcknowledgements i
摘要 iii
Abstract v
Contents vii
List of Figures ix
List of Tables xi
Chapter 1 Introduction 1
Chapter 2 Related Work 5
Chapter 3 Method 9
3.1 Content Generation 10
3.2 Content-driven Layout Generation (CLGM) 11
3.2.1 Document Layout Serialization 11
3.2.2 Layout Self-Supervised Learning 12
3.2.3 Document Rendering 15
3.3 Hierarchical Structure Learning 16
Chapter 4 Experiments 19
4.1 Implementation Details 19
4.1.1 Models 19
4.1.2 Datasets 20
4.2 Evaluation Setup 21
4.3 Fine-tuning MLLMs with RIDGE 22
4.3.1 Fine-tuning with RIDGE Alone 23
4.3.2 Fine-tuning with Existing Real-world Datasets 24
4.4 Domain-Specific Document Generation 25
4.5 Applied RIDGE on LayoutLMv3 27
4.6 Ablation Study 28
4.7 Interpretability 29
4.8 Layout & Content Evaluation 30
Chapter 5 Discussions 31
5.1 One-Stage & Two-Stage Generation 31
5.2 Limitation 36
Chapter 6 Conclusion 37
Chapter 7 Supplementary Material 39
7.1 Content Generation Prompts 39
7.2 Ground Truth Correction for SROIE– 40
7.3 More Examples of Images Generated by RIDGE 41
7.4 Introduction to VIE Methods 42
References 49
-
dc.language.isoen-
dc.subject合成文件生成zh_TW
dc.subject視覺資訊提取zh_TW
dc.subject排版生成zh_TW
dc.subject合成文件生成zh_TW
dc.subject視覺資訊提取zh_TW
dc.subject排版生成zh_TW
dc.subjectlayout generationen
dc.subjectsynthetic document generationen
dc.subjectvisual information extractionen
dc.subjectlayout generationen
dc.subjectvisual information extractionen
dc.subjectsynthetic document generationen
dc.title用於資訊提取任務的視覺文件生成zh_TW
dc.titleVisual Document Synthesis for Information Extraction Tasksen
dc.typeThesis-
dc.date.schoolyear113-2-
dc.description.degree碩士-
dc.contributor.oralexamcommittee陳縕儂;張天豪zh_TW
dc.contributor.oralexamcommitteeYun-Nung Chen;Tien-Hao Changen
dc.subject.keyword合成文件生成,排版生成,視覺資訊提取,zh_TW
dc.subject.keywordsynthetic document generation,layout generation,visual information extraction,en
dc.relation.page56-
dc.identifier.doi10.6342/NTU202501745-
dc.rights.note同意授權(限校園內公開)-
dc.date.accepted2025-07-17-
dc.contributor.author-college電機資訊學院-
dc.contributor.author-dept資訊工程學系-
dc.date.embargo-lift2025-07-22-
顯示於系所單位:資訊工程學系

文件中的檔案:
檔案 大小格式 
ntu-113-2.pdf
授權僅限NTU校內IP使用(校園外請利用VPN校外連線服務)
27.76 MBAdobe PDF
顯示文件簡單紀錄


系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved