多模態大型語言模型之可解釋深度偽造檢測基準

黃康洋; Kang-Yang Huang

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/98519

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	鄭文皇	zh_TW
dc.contributor.advisor	Wen-Huang Cheng	en
dc.contributor.author	黃康洋	zh_TW
dc.contributor.author	Kang-Yang Huang	en
dc.date.accessioned	2025-08-14T16:25:50Z	-
dc.date.available	2025-08-15	-
dc.date.copyright	2025-08-14	-
dc.date.issued	2025	-
dc.date.submitted	2025-08-01	-
dc.identifier.citation	[1] C. AI. Coqui x-tts: A hugging face space for text-to-speech, 2025. [2] J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y. Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynolds, et al. Flamingo: a visual language model for few-shot learning. Adv. Neural Inform. Process. Syst., 35:23716–23736, 2022. [3] J. Ao, R. Wang, L. Zhou, C. Wang, S. Ren, Y. Wu, S. Liu, T. Ko, Q. Li, Y. Zhang, et al. Speecht5: Unified-modal encoder-decoder pre-training for spoken language processing. In Annu. Meet. Assoc. Comput. Linguist., pages 5723–5738, 2022. [4] C. Bao, Y. Zhang, Y. Li, X. Zhang, B. Yang, H. Bao, M. Pollefeys, G. Zhang, and Z. Cui. Geneavatar: Generic expression-aware volumetric head avatar editing from a single image. In IEEE Conf. Comput. Vis. Pattern Recog., pages 8952–8963, 2024. [5] A. Bigata, M. Stypułkowski, R. Mira, S. Bounareli, K. Vougioukas, Z. Landgraf, N. Drobyshev, M. Zieba, S. Petridis, and M. Pantic. Keyface: Expressive audio-driven facial animation for long sequences via keyframe interpolation. In IEEE Conf. Comput. Vis. Pattern Recog., pages 5477–5488, 2025. [6] S. Cao, Q. Zou, X. Mao, D. Ye, and Z. Wang. Metric learning for anti-compression facial forgery detection. In ACM Int. Conf. Multimedia, pages 1929–1937, 2021. [7] D. Chen, R. Chen, S. Zhang, Y. Wang, Y. Liu, H. Zhou, Q. Zhang, Y. Wan, P. Zhou, and L. Sun. Mllm-as-a-judge: Assessing multimodal llm-as-a-judge with vision-language benchmark. In Int. Conf. Mach. Learn., 2024. [8] H. Chen, Y. Hong, Z. Huang, Z. Xu, Z. Gu, Y. Li, J. Lan, H. Zhu, J. Zhang, W. Wang, et al. Demamba: Ai-generated video detection on million-scale gen-video benchmark. arXiv preprint arXiv:2405.19707, 2024. [9] H. Chen, Y. Zhang, X. Cun, M. Xia, X. Wang, C. Weng, and Y. Shan. Videocrafter2: Overcoming data limitations for high-quality video diffusion models. In IEEE Conf. Comput. Vis. Pattern Recog., pages 7310–7320, 2024. [10] J. Chen, J. Yu, C. Ge, L. Yao, E. Xie, Z. Wang, J. T. Kwok, P. Luo, H. Lu, and Z. Li. Pixart-α: Fast training of diffusion transformer for photorealistic text-to-image synthesis. In Int. Conf. Learn. Represent., 2024. [11] Y. Chen, X. Huang, Q. Zhang, W. Li, M. Zhu, Q. Yan, S. Li, H. Chen, H. Hu, J. Yang, et al. Gim: A million-scale benchmark for generative image manipulation detection and localization. In AAAI Conf. Artif. Intell., volume 39, pages 2311–2319, 2025. [12] H.-Y. Choi, S.-H. Lee, and S.-W. Lee. Diff-hiervc: Diffusion-based hierarchical voice conversion with robust pitch generation and masked prior for zero-shot speaker adaptation. In Conf. Int. Speech Commun. Assoc., pages 2283–2287, 2023. [13] J. Choi, T. Kim, Y. Jeong, S. Baek, and J. Choi. Exploiting style latent flows for generalizing deepfake video detection. In IEEE Conf. Comput. Vis. Pattern Recog., pages 1133–1143, 2024. [14] J. S. Chung, A. Nagrani, and A. Zisserman. Voxceleb2: Deep speaker recognition. Conf. Int. Speech Commun. Assoc., 2018. [15] U. A. Ciftci, I. Demir, and L. Yin. Fakecatcher: Detection of synthetic portrait videos using biological signals. IEEE Trans. Pattern Anal. Mach. Intell., 2020. [16] L. Comanducci, P. Bestagini, and S. Tubaro. Fakemusiccaps: a dataset for detection and attribution of synthetic music generated via text-to-music models. arXiv preprint arXiv:2409.10684, 2024. [17] G. Comanici, E. Bieber, M. Schaekermann, I. Pasupat, N. Sachdeva, I. Dhillon, M. Blistein, O. Ram, D. Zhang, E. Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261, 2025. [18] S. Concas, S. M. La Cava, R. Casula, G. Orru, G. Puglisi, and G. L. Marcialis. Quality-based artifact modeling for facial deepfake detection in videos. In IEEE Conf. Comput. Vis. Pattern Recog., pages 3845–3854, 2024. [19] F.-A. Croitoru, A.-I. Hiji, V. Hondru, N. C. Ristea, P. Irofti, M. Popescu, C. Rusu, R. T. Ionescu, F. S. Khan, and M. Shah. Deepfake media generation and detection in the generative ai era: a survey and outlook. arXiv preprint arXiv:2411.19537, 2024. [20] J. Cui, H. Li, Y. Yao, H. Zhu, H. Shang, K. Cheng, H. Zhou, S. Zhu, and J. Wang. Hallo2: Long-duration and high-resolution audio-driven portrait image animation. In Int. Conf. Learn. Represent., 2025. [21] W. Dai, J. Li, D. Li, A. Tiong, J. Zhao, W. Wang, B. Li, P. N. Fung, and S. Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning. Adv. Neural Inform. Process. Syst., 36:49250–49267, 2023. [22] H. Dang, F. Liu, J. Stehouwer, X. Liu, and A. K. Jain. On the detection of digital face manipulation. In IEEE Conf. Comput. Vis. Pattern Recog., pages 5781–5790, 2020. [23] Z. Ding, M. Zhang, J. Wu, and Z. Tu. Patched denoising diffusion models for high-resolution image synthesis. In Int. Conf. Learn. Represent., 2024. [24] X. Dong, J. Bao, D. Chen, T. Zhang, W. Zhang, N. Yu, D. Chen, F. Wen, and B. Guo. Protecting celebrities from deepfake with identity consistency transformer. In IEEE Conf. Comput. Vis. Pattern Recog., pages 9468–9478, 2022. [25] M. C. Doukas, E. Ververas, V. Sharmanska, and S. Zafeiriou. Free-headgan: Neural talking head synthesis with explicit gaze control. IEEE Trans. Pattern Anal. Mach. Intell., 45:9743–9756, 2023. [26] Z. Du, Y. Wang, Q. Chen, X. Shi, X. Lv, T. Zhao, Z. Gao, Y. Yang, C. Gao, H. Wang, et al. Cosyvoice 2: Scalable streaming speech synthesis with large language models. arXiv preprint arXiv:2412.10117, 2024. [27] T.-J. Fu, W. Hu, X. Du, W. Y. Wang, Y. Yang, and Z. Gan. Guiding instruction-based image editing via multimodal large language models. In Int. Conf. Learn.Represent., 2024. [28] Y. Gan, S. Park, A. M. Schubert, A. Philippakis, and A. Alaa. Instructcv: Instruction-tuned text-to-image diffusion models as vision generalists. In Int. Conf. Learn. Represent., 2024. [29] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial networks. Communications of the ACM, 63:139–144, 2020. [30] I. J. Goodfellow, J. Shlens, and C. Szegedy. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014. [31] Q. Gu, S. Chen, T. Yao, Y. Chen, S. Ding, and R. Yi. Exploiting fine-grained face forgery clues via progressive enhancement learning. In AAAI Conf. Artif. Intell., volume 36, pages 735–743, 2022. [32] J. Guan, H. Zhou, Z. Hong, E. Ding, J. Wang, C. Quan, and Y. Zhao. Delving into sequential patches for deepfake detection. Adv. Neural Inform. Process. Syst., 35:4517–4530, 2022. [33] X. He, Y. Zhou, B. Fan, B. Li, G. Zhu, and F. Ding. Vlforgery face triad: Detection, localization and attribution via multimodal large language models. arXiv preprint arXiv:2503.06142, 2025. [34] Y. He, B. Gan, S. Chen, Y. Zhou, G. Yin, L. Song, L. Sheng, J. Shao, and Z. Liu. Forgerynet: A versatile benchmark for comprehensive forgery analysis. In IEEE Conf. Comput. Vis. Pattern Recog., pages 4360–4369, 2021. [35] Y. He, N. Yu, M. Keuper, and M. Fritz. Beyond the spectrum: Detecting deepfakes via re-synthesis. In Int. Joint Conf. Artif. Intell., pages 2534–2541, 2021. [36] J. Ho, A. Jain, and P. Abbeel. Denoising diffusion probabilistic models. Adv. Neural Inform. Process. Syst., 33:6840–6851, 2020. [37] J. Hu, X. Liao, J. Liang, W. Zhou, and Z. Qin. Finfer: Frame inference-based deepfake detection for high-visual-quality videos. In AAAI Conf. Artif. Intell., volume 36, pages 951–959, 2022. [38] J. Hu, X. Liao, W. Wang, and Z. Qin. Detecting compressed deepfake videos in social networks using frame-temporality two-stream convolutional network. IEEE Trans. Circuit Syst. Video Technol., 32(3):1089–1102, 2021. [39] W. Huang, W. Luo, J. Huang, and X. Cao. Sdgan: disentangling semantic manipulation for facial attribute editing. In AAAI Conf. Artif. Intell., volume 38, pages 2374–2381, 2024. [40] Z. Huang, J. Hu, X. Li, Y. He, X. Zhao, B. Peng, B. Wu, X. Huang, and G. Cheng. Sida: Social media image deepfake detection, localization and explanation with large multimodal model. In IEEE Conf. Comput. Vis. Pattern Recog., pages 28831–28841, 2025. [41] Z. Huang, F. Tang, Y. Zhang, J. Cao, C. Li, S. Tang, J. Li, and T.-Y. Lee. Identity-preserving face swapping via dual surrogate generative models. ACM Trans. Graph., 43(5):1–19, 2024. [42] A. Hurst, A. Lerer, A. P. Goucher, A. Perelman, A. Ramesh, A. Clark, A. Ostrow, A. Welihinda, A. Hayes, A. Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024. [43] Y. Jeong, D. Kim, Y. Ro, and J. Choi. Frepgan: robust deepfake detection using frequency-level perturbations. In AAAI Conf. Artif. Intell., volume 36, pages 1060–1068, 2022. [44] S. Jia, R. Lyu, K. Zhao, Y. Chen, Z. Yan, Y. Ju, C. Hu, X. Li, B. Wu, and S. Lyu. Can chatgpt detect deepfakes? a study of using multimodal large language models for media forensics. In IEEE Conf. Comput. Vis. Pattern Recog., pages 4324–4333, 2024. [45] S. Jia, C. Ma, T. Yao, B. Yin, S. Ding, and X. Yang. Exploring frequency adversarial attacks for face forgery detection. In IEEE Conf. Comput. Vis. Pattern Recog., pages 4103–4112, 2022. [46] J.-Y. Jiang-Lin, K.-Y. Huang, L. Lo, Y.-N. Huang, T. Lin, J.-C. Wu, H.-H. Shuai, and W.-H. Cheng. Record: Reasoning and correcting diffusion for hoi generation. In ACM Int. Conf. Multimedia, pages 9465–9474, 2024. [47] H. Kang, S. Wen, Z. Wen, J. Ye, W. Li, P. Feng, B. Zhou, B. Wang, D. Lin, L. Zhang, et al. Legion: Learning to ground and explain for synthetic image detection. arXiv preprint arXiv:2503.15264, 2025. [48] T. Karras, S. Laine, and T. Aila. A style-based generator architecture for generative adversarial networks. In IEEE Conf. Comput. Vis. Pattern Recog., pages 4401–4410, 2019. [49] T. Karras, S. Laine, M. Aittala, J. Hellsten, J. Lehtinen, and T. Aila. Analyzing and improving the image quality of stylegan. In IEEE Conf. Comput. Vis. Pattern Recog., pages 8110–8119, 2020. [50] S. A. Khan and H. Dai. Video transformer for deepfake detection with incremental learning. In ACM Int. Conf. Multimedia, pages 1821–1828, 2021. [51] J. Kim, J. Cho, J. Park, S. Hwang, D. E. Kim, G. Kim, and Y. Yu. Deeptalk: Dynamic emotion embedding for probabilistic speech-driven 3d face animation. In AAAI Conf. Artif. Intell., volume 39, pages 4275–4283, 2025. [52] C.-H. Lee, Z. Liu, L. Wu, and P. Luo. Maskgan: Towards diverse and interactive facial image manipulation. In IEEE Conf. Comput. Vis. Pattern Recog., pages 5549–5558, 2020. [53] C. Li, C. Zhang, W. Xu, J. Xie, W. Feng, B. Peng, and W. Xing. Latentsync: Audio conditioned latent diffusion models for lip sync. arXiv preprint arXiv:2412.09262, 2024. [54] J. Li, D. Li, S. Savarese, and S. Hoi. Blip-2: Bootstrapping language-image pretraining with frozen image encoders and large language models. In Int. Conf. Mach. Learn., pages 19730–19742, 2023. [55] Q. Li, W. Wang, C. Xu, Z. Sun, and M.-H. Yang. Learning disentangled representation for one-shot progressive face swapping. IEEE Trans. Pattern Anal. Mach. Intell., 46:8348–8364, 2024. [56] Y. Li, X. Liu, X. Wang, S. Wang, and W. Lin. Fakebench: Uncover the achilles’ heels of fake images with large multimodal models. arXiv preprint arXiv:2404.13306, 2024. [57] Y. Li, Y. Tian, Y. Huang, W. Lu, S. Wang, W. Lin, and A. Rocha. Fakescope: Large multimodal expert model for transparent ai-generated image forensics. arXiv preprint arXiv:2503.24267, 2025. [58] K. Lin, Z. Yan, K.-Y. Zhang, L. Hao, Y. Zhou, Y. Lin, W. Li, T. Yao, S. Ding, and B. Li. Guard me if you know me: Protecting specific face-identity from deepfakes. arXiv preprint arXiv:2505.19582, 2025. [59] H. Liu, C. Li, Q. Wu, and Y. J. Lee. Visual instruction tuning. Adv. Neural Inform. Process. Syst., 36:34892–34916, 2023. [60] J. Liu, F. Zhang, J. Zhu, E. Sun, Q. Zhang, and Z.-J. Zha. Forgerygpt: Multi-modal large language model for explainable image forgery detection and localization. arXiv preprint arXiv:2410.10238, 2024. [61] R. Liu, B. Ma, W. Zhang, Z. Hu, C. Fan, T. Lv, Y. Ding, and X. Cheng. Towards a simultaneous and granular identity-expression control in personalized face generation. In IEEE Conf. Comput. Vis. Pattern Recog., pages 2114–2123, 2024. [62] S. Liu. Zero-shot voice conversion with diffusion transformers. arXiv preprint arXiv:2411.09943, 2024. [63] S. Liu, Y. Han, P. Xing, F. Yin, R. Wang, W. Cheng, J. Liao, Y. Wang, H. Fu, C. Han, et al. Step1x-edit: A practical framework for general image editing. arXiv preprint arXiv:2504.17761, 2025. [64] S. Liu, Y. Zhang, W. Li, Z. Lin, and J. Jia. Video-p2p: Video editing with cross-attention control. In IEEE Conf. Comput. Vis. Pattern Recog., pages 8599–8608, 2024. [65] X. Liu, Z. Li, P. Li, S. Xia, X. Cui, L. Huang, H. Huang, W. Deng, and Z. He. MMFakebench: A mixed-source multimodal misinformation detection benchmark for LVLMs. In Int. Conf. Learn. Represent., 2025. [66] L. Lo, K. C. Chan, W.-H. Cheng, and M.-H. Yang. From prompt to progression: Taming video diffusion models for seamless attribute transition. In Int. Conf. Comput. Vis., 2025. [67] Z. Lu, D. Huang, L. Bai, J. Qu, C. Wu, X. Liu, and W. Ouyang. Seeing is not always believing: Benchmarking human and model perception of ai-generated images. Adv. Neural Inform. Process. Syst., 36:25435–25447, 2023. [68] J. Lyu, X. Lan, G. Hu, H. Jiang, W. Gan, J. Wang, and J. Xue. Multimodal emotional talking face generation based on action units. IEEE Trans. Circuit Syst. Video Technol., 2024. [69] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu. Towards deep learning models resistant to adversarial attacks. In Int. Conf. Learn. Represent., 2018. [70] B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM, 65:99–106, 2021. [71] T. Mittal, U. Bhattacharya, R. Chandra, A. Bera, and D. Manocha. Emotions don’t lie: An audio-visual deepfake detection method using affective cues. In ACM Int. Conf. Multimedia, pages 2823–2832, 2020. [72] D. Nguyen, N. Mejri, I. P. Singh, P. Kuleshova, M. Astrid, A. Kacem, E. Ghorbel, and D. Aouada. Laa-net: Localized artifact attention network for quality-agnostic and generalizable deepfake detection. In IEEE Conf. Comput. Vis. Pattern Recog., pages 17395–17405, 2024. [73] Y. Nirkin, L. Wolf, Y. Keller, and T. Hassner. Deepfake detection based on discrepancies between faces and their context. IEEE Trans. Pattern Anal. Mach. Intell., 44(10):6111–6121, 2021. [74] U. Ojha, Y. Li, and Y. J. Lee. Towards universal fake image detectors that generalize across generative models. In IEEE Conf. Comput. Vis. Pattern Recog., pages 24480–24489, 2023. [75] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, et al. Training language models to follow instructions with human feedback. Adv. Neural Inform. Process. Syst., 35:27730–27744, 2022. [76] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur. Librispeech: an asr corpus ased on public domain audio books. In Int. Conf. Acoustics, Speech, and Signal Process., pages 5206–5210, 2015. [77] G. Pei, J. Zhang, M. Hu, Z. Zhang, C. Wang, Y. Wu, G. Zhai, J. Yang, C. Shen, and D. Tao. Deepfake generation and detection: A benchmark and survey. arXiv preprint arXiv:2403.17881, 2024. [78] D. Podell, Z. English, K. Lacey, A. Blattmann, T. Dockhorn, J. Müller, J. Penna, and R. Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis. In Int. Conf. Learn. Represent., 2024. [79] Z. Qin, W. Zhao, X. Yu, and X. Sun. Openvoice: Versatile instant voice cloning. arXiv preprint arXiv:2312.01479, 2023. [80] A. Rochow, M. Schwarz, and S. Behnke. Fsrt: Facial scene representation transformer for face reenactment from factorized appearance head-pose and facial expression features. In IEEE Conf. Comput. Vis. Pattern Recog., pages 7716–7726, 2024. [81] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer. High-resolution image synthesis with latent diffusion models. In IEEE Conf. Comput. Vis. Pattern Recog., pages 10684–10695, 2022. [82] A. Rossler, D. Cozzolino, L. Verdoliva, C. Riess, J. Thies, and M. Nießner. Faceforensics++: Learning to detect manipulated facial images. In Int. Conf. Comput. Vis., pages 1–11, 2019. [83] S. Sheynin, A. Polyak, U. Singer, Y. Kirstain, A. Zohar, O. Ashual, D. Parikh, and Y. Taigman. Emu edit: Precise image editing via recognition and generation tasks. In IEEE Conf. Comput. Vis. Pattern Recog., pages 8871–8879, 2024. [84] K. Shiohara, X. Yang, and T. Taketomi. Blendface: Re-designing identity encoders for face-swapping. In Int. Conf. Comput. Vis., pages 7634–7644, 2023. [85] J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In Int. Conf. Mach. Learn., pages 2256–2265, 2015. [86] K. Sohn, H. Lee, and X. Yan. Learning structured output representation using deep conditional generative models. Adv. Neural Inform. Process. Syst., 28, 2015. [87] J. Son Chung, A. Senior, O. Vinyals, and A. Zisserman. Lip reading sentences in the wild. In IEEE Conf. Comput. Vis. Pattern Recog., pages 6447–6456, 2017. [88] Y. Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole. Score-based generative modeling through stochastic differential equations. In Int. Conf. Learn. Represent., 2021. [89] K. Sun, T. Yao, S. Chen, S. Ding, J. Li, and R. Ji. Dual contrastive learning for general face forgery detection. In AAAI Conf. Artif. Intell., volume 36, pages 2316–2324, 2022. [90] Z. Sun, Y. Han, Z. Hua, N. Ruan, and W. Jia. Improving the efficiency and robustness of deepfakes detection through precise geometric features. In IEEE Conf. Comput. Vis. Pattern Recog., pages 3609–3618, 2021. [91] J. Tao, S. Gu, W. Li, and L. Duan. Learning motion refinement for unsupervised face animation. Adv. Neural Inform. Process. Syst., 36:70483–70496, 2023. [92] X. Tian, W. Li, B. Xu, Y. Yuan, Y. Wang, and H. Shen. Mige: A unified framework for multimodal instruction-based image generation and editing. In ACM Int. Conf. Multimedia, 2025. [93] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023. [94] A. Van Den Oord, O. Vinyals, et al. Neural discrete representation learning. Adv. Neural Inform. Process. Syst., 30, 2017. [95] C. Veaux, J. Yamagishi, K. MacDonald, et al. Cstr vctk corpus: English multi-speaker corpus for cstr voice cloning toolkit. University of Edinburgh. The Centre for Speech Technology Research (CSTR), 6:15, 2017. [96] R. Wang, F. Juefei-Xu, L. Ma, X. Xie, Y. Huang, J. Wang, and Y. Liu. Fakespotter: A simple yet robust baseline for spotting ai-synthesized fake faces. In Int. Joint Conf. Artif. Intell., pages 3444–3451, 2020. [97] R. Wang, S. Xu, T. He, Y. Chen, W. Zhu, D. Song, N. Chen, X. Tang, and Y. Hu. Dynamicface: High-quality and consistent video face swapping using composable 3d facial priors. In Int. Conf. Comput. Vis., 2025. [98] S.-Y. Wang, O. Wang, R. Zhang, A. Owens, and A. A. Efros. Cnn-generated images are surprisingly easy to spot... for now. In IEEE Conf. Comput. Vis. Pattern Recog., pages 8695–8704, 2020. [99] T. Wang and K. P. Chow. Noise based deepfake detection via multi-head relative-interaction. In AAAI Conf. Artif. Intell., volume 37, pages 14548–14556, 2023. [100] T.-C. Wang, A. Mallya, and M.-Y. Liu. One-shot free-view neural talking-head synthesis for video conferencing. In IEEE Conf. Comput. Vis. Pattern Recog., pages 10039–10049, 2021. [101] X. Wang, J. Yamagishi, M. Todisco, H. Delgado, A. Nautsch, N. Evans, M. Sahidullah, V. Vestman, T. Kinnunen, K. A. Lee, et al. Asvspoof 2019: A large-scale public database of synthesized, converted and replayed speech. Computer Speech & Language, 64:101114, 2020. [102] Y. Wang, X. Yi, Z. Wu, N. Zhao, L. Chen, and H. Zhang. View-consistent 3d editing with gaussian splatting. In Eur. Conf. Comput. Vis., pages 404–420, 2024. [103] Z. Wang, J. Bao, W. Zhou, W. Wang, and H. Li. Altfreezing for more general video face forgery detection. In IEEE Conf. Comput. Vis. Pattern Recog., pages 4129–4138, 2023. [104] H. Wei, Z. Yang, and Z. Wang. Aniportrait: Audio-driven synthesis of photorealistic portrait animation. arXiv preprint arXiv:2403.17694, 2024. [105] S. Wen, J. Ye, P. Feng, H. Kang, Z. Wen, Y. Chen, J. Wu, W. Wu, C. He, and W. Li. Spot the fake: Large multimodal model-based synthetic image detection with artifact explanation. arXiv preprint arXiv:2503.14905, 2025. [106] M. Wester. The emime bilingual database. Technical report, The University of Edinburgh, 2010. [107] C. Wu, P. Zheng, R. Yan, S. Xiao, X. Luo, Y. Wang, W. Li, X. Jiang, Y. Liu, J. Zhou, et al. Omnigen2: Exploration to advanced multimodal generation. arXiv preprint arXiv:2506.18871, 2025. [108] H. Wu, J. Zhou, J. Tian, and J. Liu. Robust image forgery detection over online social network shared images. In IEEE Conf. Comput. Vis. Pattern Recog., pages 13440–13449, 2022. [109] X. Wu, Z. Xie, Y. Gao, and Y. Xiao. Sstnet: Detecting manipulated faces through spatial, steganalysis and temporal features. In Int. Conf. Acoustics, Speech, and Signal Process., pages 2952–2956, 2020. [110] C. Xu, Y. Qian, S. Zhu, B. Sun, J. Zhao, Y. Liu, and X. Li. Uniface++: Revisiting a unified framework for face reenactment and swapping via 3d priors. Int. J. Comput. Vis., 133:4538–4554, 2025. [111] J. Xu, Z. Guo, J. He, H. Hu, T. He, S. Bai, K. Chen, J. Wang, Y. Fan, K. Dang, et al. Qwen2. 5-omni technical report. arXiv preprint arXiv:2503.20215, 2025. [112] N. Xu, W. Feng, T. Zhang, and Y. Zhang. Fd-gan: Generalizable and robust forgery detection via generative adversarial networks. Int. J. Comput. Vis., 132:5801–5819, 2024. [113] S. Xu, G. Chen, Y.-X. Guo, J. Yang, C. Li, Z. Zang, Y. Zhang, X. Tong, and B. Guo. Vasa-1: Lifelike audio-driven talking faces generated in real time. Adv. Neural Inform. Process. Syst., 37:660–684, 2024. [114] Y. Xu, J. Liang, G. Jia, Z. Yang, Y. Zhang, and R. He. Tall: Thumbnail layout for deepfake video detection. In Int. Conf. Comput. Vis., pages 22658–22668, 2023. [115] Z. Xu, X. Zhang, R. Li, Z. Tang, Q. Huang, and J. Zhang. Fakeshield: Explainable image forgery detection and localization via multi-modal large language models. In Int. Conf. Learn. Represent., 2025. [116] S. Yan, O. Li, J. Cai, Y. Hao, X. Jiang, Y. Hu, and W. Xie. A sanity check for ai-generated image detection. In Int. Conf. Learn. Represent., 2025. [117] Z. Yan, Y. Luo, S. Lyu, Q. Liu, and B. Wu. Transcending forgery specificity with latent space augmentation for generalizable deepfake detection. In IEEE Conf. Comput. Vis. Pattern Recog., pages 8984–8994, 2024. [118] Z. Yan, Y. Zhang, X. Yuan, S. Lyu, and B. Wu. Deepfakebench: A comprehensive benchmark of deepfake detection. Adv. Neural Inform. Process. Syst., 36:4534–4565, 2023. [119] F. Yang, S. Yang, M. A. Butt, J. van de Weijer, et al. Dynamic prompt learning: Addressing cross-attention leakage for text-based image editing. Adv. Neural Inform. Process. Syst., 36:26291–26303, 2023. [120] M.-H. Yang, D. J. Kriegman, and N. Ahuja. Detecting faces in images: A survey. IEEE Trans. Pattern Anal. Mach. Intell., 24(1):34–58, 2002. [121] S. Yang, W. Wang, Y. Lan, X. Fan, B. Peng, L. Yang, and J. Dong. Learning dense correspondence for nerf-based face reenactment. In AAAI Conf. Artif. Intell., volume 38, pages 6522–6530, 2024. [122] X. Yang, Y. Li, and S. Lyu. Exposing deep fakes using inconsistent head poses. In Int. Conf. Acoustics, Speech, and Signal Process., pages 8261–8265, 2019. [123] J. Ye, H. Xu, H. Liu, A. Hu, M. Yan, Q. Qian, J. Zhang, F. Huang, and J. Zhou. mplug-owl3: Towards long image-sequence understanding in multi-modal large language models. arXiv preprint arXiv:2408.04840, 2024. [124] J. Ye, B. Zhou, Z. Huang, J. Zhang, T. Bai, H. Kang, J. He, H. Lin, Z. Wang, T. Wu, Z. Wu, Y. Chen, D. Lin, C. He, and W. Li. LOKI: A comprehensive synthetic data detection benchmark using large multimodal models. In Int. Conf. Learn. Represent., 2025. [125] Y. Ye, X. He, Z. Li, B. Lin, S. Yuan, Z. Yan, B. Hou, and L. Yuan. Imgedit: A unified image editing dataset and benchmark. arXiv preprint arXiv:2505.20275, 2025. [126] Z. Ye, T. Zhong, Y. Ren, Z. Jiang, J. Huang, R. Huang, J. Liu, J. He, C. Zhang, Z. Wang, et al. Mimictalk: Mimicking a personalized and expressive 3d talking face in minutes. Adv. Neural Inform. Process. Syst., 37:1829–1853, 2024. [127] J. Yu, H. Zhu, L. Jiang, C. C. Loy, W. Cai, and W. Wu. Celebv-text: A large-scale facial text-video dataset. In IEEE Conf. Comput. Vis. Pattern Recog., pages 14805–14814, 2023. [128] H. Zen, V. Dang, R. Clark, Y. Zhang, R. J. Weiss, Y. Jia, Z. Chen, and Y. Wu. Libritts: A corpus derived from librispeech for text-to-speech. In Conf. Int. Speech Commun. Assoc., pages 1526–1530, 2019. [129] H. Zhang, T. Dai, Y. Xu, Y.-W. Tai, and C.-K. Tang. Facednerf: semantics-driven face reconstruction, prompt editing and relighting with diffusion models. Adv. Neural Inform. Process. Syst., 36:55647–55667, 2023. [130] K. Zhang, W. Zuo, Y. Chen, D. Meng, and L. Zhang. Beyond a gaussian denoiser: Residual learning of deep cnn for image denoising.. IEEE Trans. Image Process., 26:3142–3155, 2017. [131] S. Zhang, X. Yang, Y. Feng, C. Qin, C.-C. Chen, N. Yu, Z. Chen, H. Wang, S. Savarese, S. Ermon, et al. Hive: Harnessing human feedback for instructional visual editing. In IEEE Conf. Comput. Vis. Pattern Recog., pages 9026–9036, 2024. [132] W. Zhang, X. Cun, X. Wang, Y. Zhang, X. Shen, Y. Guo, Y. Shan, and F. Wang. Sadtalker: Learning realistic 3d motion coefficients for stylized audio-driven single image talking face animation. In IEEE Conf. Comput. Vis. Pattern Recog., pages 8652–8661, 2023. [133] W. Zhang, C. Jiang, Z. Zhang, C. Si, F. Yu, and W. Peng. Ivy-fake: A unified explainable framework and benchmark for image and video aigc detection. arXiv preprint arXiv:2506.00979, 2025. [134] Y. Zhang, B. Colman, X. Guo, A. Shahriyari, and G. Bharaj. Common sense reasoning for deepfake detection. In Eur. Conf. Comput. Vis., pages 399–415, 2024. [135] Y. Zhang, Z. Zhong, M. Liu, Z. Chen, B. Wu, Y. Zeng, C. Zhan, Y. He, J. Huang, and W. Zhou. Musetalk: Real-time highfidelity video dubbing via spatio-temporal sampling. arXiv preprint arXiv:2410.10122, 2025. [136] Z. Zhang, Z. Hu, W. Deng, C. Fan, T. Lv, and Y. Ding. Dinet: Deformation inpainting network for realistic face visually dubbing on high resolution video. In AAAI Conf. Artif. Intell., volume 37, pages 3543–3551, 2023. [137] Z. Zhang, L. Li, Y. Ding, and C. Fan. Flow-guided one-shot talking face generation with a high-resolution audio-visual dataset. In IEEE Conf. Comput. Vis. Pattern Recog., pages 3661–3670, 2021. [138] W. Zhao, Y. Rao, W. Shi, Z. Liu, J. Zhou, and J. Lu. Diffswap: High-fidelity and controllable face swapping via 3d-aware masked diffusion. In IEEE Conf. Comput. Vis. Pattern Recog., pages 8568–8577, 2023. [139] P. Zhou, L. Xie, B. Ni, and Q. Tian. Cips-3d+ +: End-to-end real-time high-resolution 3d-aware gans for gan inversion and stylization. IEEE Trans. Pattern Anal. Mach. Intell., 45:11502–11520, 2023. [140] D. Zhu, J. Chen, X. Shen, X. Li, and M. Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models. In Int. Conf. Learn. Represent., 2024. [141] M. Zhu, H. Chen, Q. Yan, X. Huang, G. Lin, W. Li, Z. Tu, H. Hu, J. Hu, and Y. Wang. Genimage: A million-scale benchmark for detecting ai-generated image. Adv. Neural Inform. Process. Syst., 36:77771–77782, 2023. [142] Y. Zhu, W. Zhao, Y. Tang, Y. Rao, J. Zhou, and J. Lu. Stableswap: Stable face swapping in a shared and controllable latent space. IEEE Trans. Multimedia, 26:7594–7607, 2024.	-
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/98519	-
dc.description.abstract	生成式人工智慧透過運用多樣化的輸入條件，徹底革新了多媒體內容的創作方式。然而，隨著這些模型日益進步，檢測由人工智慧生成的內容，特別是深度偽造（DeepFake），變得愈發困難。對深偽技術日益增加的關注，使得檢測方法，尤其是多模態大型語言模型在辨識深偽內容方面的效能，成為研究重點。多模態大型語言模型不僅能透過提供決策解釋來提升深偽檢測的透明度，區分真實與合成內容的過程同時也是對其感知與推理能力的嚴格考驗。為了應對這些挑戰，我們提出了 MMIDBench，一個精心設計、全面評估多模態大型語言模型能力的多模態基準。MMIDBench 涵蓋多種最先進的深偽生成模型，橫跨影像、影片與音訊，包含6種不同的深偽任務。該基準包含10k道題目，涵蓋二元選擇、多選題及開放式問答等多種題型，能夠對多模態大型語言模型進行深入評估。我們利用 MMIDBench 評測了5款閉源多模態大型語言模型，揭示了它們在深偽檢測上的優勢與現階段的侷限。	zh_TW
dc.description.abstract	Generative artificial intelligence has revolutionized how multimedia content is created by utilizing diverse input conditions. However, as these models become more advanced, detecting AI-generated content, particularly DeepFakes, has grown increasingly challenging. Rising concerns over DeepFakes have heightened interest in detection methods, specifically the effectiveness of multimodal large language models (MLLMs) in identifying them. MLLMs not only improve the transparency of DeepFake detection by providing explanations for their decisions, but the process of distinguishing authentic from synthetic content also serves as a robust test of their perceptual and reasoning skills. To address these challenges, we introduce MMIDBench, a comprehensive multimodal benchmark meticulously crafted to assess the capabilities of MLLMs. MMIDBench features a variety of state-of-the-art DeepFake generative models spanning images, videos, and audio, encompassing 6 distinct DeepFake tasks. The benchmark comprises 10k questions, including binary, multiple-choice, and open-ended formats, enabling an in-depth assessment of MLLMs. We evaluated 5 proprietary MLLMs with MMIDBench, revealing both their strengths and current limitations in DeepFake detection.	en
dc.description.provenance	Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2025-08-14T16:25:50Z No. of bitstreams: 0	en
dc.description.provenance	Made available in DSpace on 2025-08-14T16:25:50Z (GMT). No. of bitstreams: 0	en
dc.description.tableofcontents	Verification Letter from the Oral Examination Committee i Acknowledgements ii 摘要 iii Abstract iv Contents vi List of Figures viii List of Tables x Chapter 1 Introduction 1 Chapter 2 Related Work 5 2.1 DeepFake Generation . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2 DeepFake Detection . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.3 Multimodal Large Language Models . . . . . . . . . . . . . . . . . . 8 2.4 Synthetic Data Detection Benchmark . . . . . . . . . . . . . . . . . 9 Chapter 3 MMIDBench 13 3.1 Overview of MMIDBench . . . . . . . . . . . . . . . . . . . . . . . 13 3.2 Real Data and Deepfake Synthesis . . . . . . . . . . . . . . . . . . . 18 3.3 Multimodal Artifacts Annotation . . . . . . . . . . . . . . . . . . . . 24 3.4 Assertion-Based Evaluation and Question Generation . . . . . . . . . 33 Chapter 4 Experiments 35 4.1 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 4.2 Assertion-Based Evaluation . . . . . . . . . . . . . . . . . . . . . . 36 4.3 Interpretable DeepFake Detection Results . . . . . . . . . . . . . . . 40 Chapter 5 Conclusion 50 References 51	-
dc.language.iso	en	-
dc.subject	多模態大型語言模型	zh_TW
dc.subject	可解釋性	zh_TW
dc.subject	深度偽造檢測	zh_TW
dc.subject	深度偽造生成	zh_TW
dc.subject	影像編輯	zh_TW
dc.subject	語音合成	zh_TW
dc.subject	擴散模型	zh_TW
dc.subject	生成式人工智慧	zh_TW
dc.subject	Diffusion Models	en
dc.subject	Multimodal Large Language Models	en
dc.subject	Voice Synthesis	en
dc.subject	Interpretability	en
dc.subject	Deepfake Detection	en
dc.subject	Deepfake Generation	en
dc.subject	Image Editing	en
dc.subject	Generative Artificial Intelligence	en
dc.title	多模態大型語言模型之可解釋深度偽造檢測基準	zh_TW
dc.title	MMIDBench: Multimodal Interpretable Deepfake Detection Benchmark for Multimodal Large Language Models	en
dc.type	Thesis	-
dc.date.schoolyear	113-2	-
dc.description.degree	碩士	-
dc.contributor.oralexamcommittee	莊永裕;帥宏翰;黃敬群;簡韶逸	zh_TW
dc.contributor.oralexamcommittee	Yung-Yu Chuang;Hong-Han Shuai;Ching-Chun Huang;Shao-Yi Chien	en
dc.subject.keyword	多模態大型語言模型,可解釋性,深度偽造檢測,深度偽造生成,影像編輯,語音合成,擴散模型,生成式人工智慧,	zh_TW
dc.subject.keyword	Multimodal Large Language Models,Interpretability,Deepfake Detection,Deepfake Generation,Image Editing,Voice Synthesis,Diffusion Models,Generative Artificial Intelligence,	en
dc.relation.page	69	-
dc.identifier.doi	10.6342/NTU202503386	-
dc.rights.note	同意授權(全球公開)	-
dc.date.accepted	2025-08-06	-
dc.contributor.author-college	電機資訊學院	-
dc.contributor.author-dept	資訊網路與多媒體研究所	-
dc.date.embargo-lift	2025-08-15	-
顯示於系所單位：	資訊網路與多媒體研究所

文件中的檔案：

檔案	大小	格式
ntu-113-2.pdf	12.54 MB	Adobe PDF	檢視/開啟

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。