不經重新訓練的關係知識編輯方法：以FIT-RSRC領域專屬視覺問答為例

許家銓; Chia-Chuan Hsu

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/101907

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	莊永裕	zh_TW
dc.contributor.advisor	Yung-Yu Chuang	en
dc.contributor.author	許家銓	zh_TW
dc.contributor.author	Chia-Chuan Hsu	en
dc.date.accessioned	2026-03-05T16:39:13Z	-
dc.date.available	2026-03-06	-
dc.date.copyright	2026-03-05	-
dc.date.issued	2026	-
dc.date.submitted	2026-02-05	-
dc.identifier.citation	Q. Chen, T. Zhang, C. Wang, X. He, D. Wang, and T. Liu. Attribution analysis meets model editing: Advancing knowledge correction in vision language models with visedit. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 2168–2176, 2025. S. Cheng, B. Tian, Q. Liu, X. Chen, Y. Wang, H. Chen, and N. Zhang. Can we edit multimodal large language models? In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023. Y. Du, K. Jiang, Z. Gao, C. Shi, Z. Zheng, S. Qi, and Q. Li. MMKE-bench: A multimodal editing benchmark for diverse visual knowledge. In International Conference on Learning Representations (ICLR), 2025. Gemini Team. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context, 2024. A. Gupta, S. Baskaran, and G. Anumanchipalli. Rebuilding ROME: Resolving model collapse during sequential model editing. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 21738-21744, Miami,Florida, USA,Nov.2024.AssociationforComputationalLinguistics. H. Huang, H. Zhong, T. Yu, Q. Liu, S. Wu, L. Wang, and T. Tan. VLKEB: A large vision-language model knowledge editing benchmark. Advances in Neural Information Processing Systems, 37:9257–9280, 2024. Z. Jiang, J. Chen, B. Zhu, T. Luo, Y. Shen, and X. Yang. Devils in middle layers of large vision-language models: Interpreting, detecting and mitigating object hallucinations via attention lens. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 25004–25014, 2025. K. Kuckreja, M. S. Danish, M. Naseer, A. Das, S. Khan, and F. S. Khan. GeoChat: Grounded large vision-language model for remote sensing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 27831-27840, 2024. Y. Li, L. Wang, T. Wang, X. Yang, J. Luo, Q. Wang, Y. Deng, W. Wang, X. Sun, H. Li, et al. STAR: A first-ever dataset and a large-scale benchmark for scene graph generation in large-size satellite imagery. IEEE Trans. Pattern Anal. Mach. Intell, 47(3):1832–1849, 2025. H. Liu, C. Li, Y. Li, and Y. J. Lee. Improved baselines with visual instruction tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 26296–26306, June 2024. J. Luo, Z. Pang, Y. Zhang, T. Wang, L. Wang, B. Dang, J. Lao, J. Wang, J. Chen, Y. Tan, et al. SkySenseGPT: A fine-grained instruction tuning dataset and model for remote sensing vision-language understanding. arXiv preprint arXiv:2406.10100, 2024. K. Meng, D. Bau, A. Andonian, and Y. Belinkov. Locating and editing factual associations in gpt. Advances in neural information processing systems, 35:17359-17372, 2022. K. Meng, A. S. Sharma, A. Andonian, Y. Belinkov, and D. Bau. Mass-editing memory in a transformer. arXiv preprint arXiv:2210.07229, 2022. C. Pang, J. Wu, J. Li, Y. Liu, J. Sun, W. Li, X. Weng, S. Wang, L. Feng, G.-S. Xia, et al. H2rsvlm: Towards helpful and honest remote sensing large vision language model. CoRR, 2024. Z. Shi, B. Wang, C. Si, Y. Wu, J. Kim, and H. Pfister. DualEdit: Dual editing for knowledge updating in vision-language models. In Conference on Language Modeling (COLM), 2025. Z. Zeng, L. Gu, X. Yang, Z. Duan, Z. Shi, and M. Wang. Visual-oriented fine-grained knowledge editing for multimodal large language models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2491–2500, 2025. Y. Zhan, Z. Xiong, and Y. Yuan. SkyEyeGPT: Unifying remote sensing vision-language tasks via instruction tuning with large language model. ISPRS Journal of Photogrammetry and Remote Sensing, 221:64–77, 2025. J. Zhang, M. Khayatkhoei, P. Chhikara, and F. Ilievski. MLLMs know where to look: Training-free perception of small visual details with multimodal LLMs. In International Conference on Learning Representations (ICLR), 2025. ICLR 2025. W. Zhang, M. Cai, T. Zhang, Y. Zhuang, and X. Mao. EarthGPT: A universal multimodal large language model for multisensor image comprehension in remote sensing domain. IEEE Transactions on Geoscience and Remote Sensing, 62:1–20, 2024.	-
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/101907	-
dc.description.abstract	本論文探討在領域專屬遙測視覺問答（VQA）中，不經重新訓練的關係知識編輯。透過在 STAR 子集上的 attention-ratio 診斷分析，我們指出存在定位—推理的解耦：模型即使能關注到正確區域，仍可能產生帶偏誤的關係標籤。接著，我們將關係推理改寫為純文字情境任務並套用 ROME 式局部更新，揭示共軛干擾（conjugate interference）以及多次編輯順序對穩定性的高度敏感。最後的遷移測試顯示，語言端的語意修正難以可靠轉移到多模態 VQA 推理中，突顯跨模態泛化的關鍵限制。	zh_TW
dc.description.abstract	This thesis studies relationship knowledge editing without retraining for domain-specific remote-sensing VQA. Using a curated STAR subset and attention-ratio diagnostics, we show a grounding–reasoning decoupling: models often attend to the correct regions yet still produce biased relationship labels. We then cast relationship reasoning as a text-only scenario task and apply ROME-style localized updates, revealing conjugate interference and strong sensitivity to multi-edit order. Finally, transfer tests indicate that language-side semantic edits do not reliably carry over to multimodal VQA inference, highlighting key limits of cross-modal generalization.	en
dc.description.provenance	Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2026-03-05T16:39:12Z No. of bitstreams: 0	en
dc.description.provenance	Made available in DSpace on 2026-03-05T16:39:13Z (GMT). No. of bitstreams: 0	en
dc.description.tableofcontents	Verification Letter from Oral Examination Committee i Acknowledgements iii 摘要 v Abstract vii Contents ix List of Figures xiii List of Tables xv Chapter 1 Introduction 1 1.1 Background and Motivation 1 1.2 Problem Statement 2 1.3 Research Approach 3 1.4 Thesis Contributions 4 1.5 Thesis Organization 5 Chapter 2 Related Work 7 2.1 Remote Sensing Datasets and Relation-Aware RS-VLMs 8 2.2 Grounding Versus Reasoning Decoupling in Multimodal LLMs 9 2.3 Model Editing and Sequential Stability 10 2.4 Multimodal Editing and Cross-Modal Transfer 11 2.5 Positioning of This Thesis 12 Chapter 3 Datasets 15 3.1 Overview of Datasets 15 3.2 Limitations of Directly Using FIT-RS and FIT-RSRC for Diagnosis 16 3.3 Curated STAR Subset for Controlled Analysis 17 Chapter 4 Diagnosis: Grounding vs. Semantics 19 4.1 Diagnostic Framing: Grounding vs. Relationship Semantics 19 4.2 Grounding Analysis via Attention Ratio 20 4.2.1 Formal Definition of Relative Attention 21 4.2.2 Calculation of Entity-Specific Attention Ratio 21 4.2.3 Final Relationship Grounding Metric 22 4.3 Interpretation of Grounding Results 22 4.4 Relationship Accuracy and Bias Analysis 24 4.5 Diagnostic Conclusion 24 Chapter 5 Method 27 5.1 Problem Formulation: Text-Only Relationship Scenarios 27 5.1.1 Text-Only Scenario Construction 27 5.1.2 Query, Answer Space, and Relationship Families 28 5.2 Target Representation and Editing Location 29 5.2.1 Validation via Causal Tracing 30 5.2.2 Target Token and Layer Selection 32 5.3 ROME-Based Editing Procedure 32 5.3.1 Editing as a Rank-One Update 33 5.3.2 Key Representation for Relationship Editing 34 5.3.3 Value Specification via Target Relationship Labels 34 5.3.4 Editing Regime and Scope Clarification 34 5.4 Multi-Case Editing Strategy 35 5.4.1 Sequential Editing per Relationship Family 36 5.4.2 Pre-Edit Evaluation and Conditional Skipping 36 5.4.3 Constraint Recalculation Across Edits 37 5.5 Scope and Methodological Boundaries 38 5.6 Summary 38 Chapter 6 Experiments 39 6.1 Experimental Setup for Text-only Evaluation 39 6.2 Baseline Bias in Text-only Relationship Reasoning 40 6.3 Method Evolution for Multi-case Editing 41 6.4 Single-sided Steering and Conjugate Interference 42 6.5 Editing Order Sensitivity and Interaction-dependent Stability 44 6.6 Replication Across Relationship Families 46 Chapter 7 Transfer: VQA Evaluation 49 7.1 Evaluation Setup for Transfer Analysis 49 7.2 Transfer Results on the Curated Relationship Subset 50 7.3 Transfer Results on FIT-RSRC 52 7.4 Interpretation and Discussion 53 Chapter 8 Conclusion 55 8.1 Conclusion 55 8.2 Limitations and Future Work 56 8.2.1 From Binary Conjugates to Complex Relational Spaces 57 8.2.2 Overcoming Multimodal Inference Inertia 57 8.2.3 Enhancing Sequential Editing Stability 57 References 59 Appendix A —Causal Tracing Results of Other Relationship Families 63	-
dc.language.iso	en	-
dc.subject	模型編輯	-
dc.subject	多模態學習	-
dc.subject	遙測影像	-
dc.subject	視覺問答	-
dc.subject	跨模態遷移	-
dc.subject	Model Editing	-
dc.subject	Multimodal Learning	-
dc.subject	Remote Sensing	-
dc.subject	Visual Question Answering	-
dc.subject	Cross-Modal Transferability	-
dc.title	不經重新訓練的關係知識編輯方法：以FIT-RSRC領域專屬視覺問答為例	zh_TW
dc.title	Editing Relationship Knowledge Without Retraining: A Case Study on Domain-Specific VQA (FIT-RSRC)	en
dc.type	Thesis	-
dc.date.schoolyear	114-1	-
dc.description.degree	碩士	-
dc.contributor.oralexamcommittee	吳賦哲;葉正聖	zh_TW
dc.contributor.oralexamcommittee	Fu-Che Wu;Jeng-Sheng Yeh	en
dc.subject.keyword	模型編輯,多模態學習遙測影像視覺問答跨模態遷移	zh_TW
dc.subject.keyword	Model Editing,Multimodal LearningRemote SensingVisual Question AnsweringCross-Modal Transferability	en
dc.relation.page	67	-
dc.identifier.doi	10.6342/NTU202600393	-
dc.rights.note	未授權	-
dc.date.accepted	2026-02-08	-
dc.contributor.author-college	電機資訊學院	-
dc.contributor.author-dept	資訊工程學系	-
dc.date.embargo-lift	N/A	-
顯示於系所單位：	資訊工程學系

文件中的檔案：

檔案	大小	格式
ntu-114-1.pdf 未授權公開取用	1.73 MB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。