Skip navigation

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料(如:文字、圖片、PDF)並使其易於取用。

點此認識 DSpace
DSpace logo
English
中文
  • 瀏覽論文
    • 校院系所
    • 出版年
    • 作者
    • 標題
    • 關鍵字
    • 指導教授
  • 搜尋 TDR
  • 授權 Q&A
    • 我的頁面
    • 接受 E-mail 通知
    • 編輯個人資料
  1. NTU Theses and Dissertations Repository
  2. 電機資訊學院
  3. 資訊網路與多媒體研究所
請用此 Handle URI 來引用此文件: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/98519
完整後設資料紀錄
DC 欄位值語言
dc.contributor.advisor鄭文皇zh_TW
dc.contributor.advisorWen-Huang Chengen
dc.contributor.author黃康洋zh_TW
dc.contributor.authorKang-Yang Huangen
dc.date.accessioned2025-08-14T16:25:50Z-
dc.date.available2025-08-15-
dc.date.copyright2025-08-14-
dc.date.issued2025-
dc.date.submitted2025-08-01-
dc.identifier.citation[1] C. AI. Coqui x-tts: A hugging face space for text-to-speech, 2025.
[2] J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y. Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynolds, et al. Flamingo: a visual language model for few-shot learning. Adv. Neural Inform. Process. Syst., 35:23716–23736, 2022.
[3] J. Ao, R. Wang, L. Zhou, C. Wang, S. Ren, Y. Wu, S. Liu, T. Ko, Q. Li, Y. Zhang, et al. Speecht5: Unified-modal encoder-decoder pre-training for spoken language processing. In Annu. Meet. Assoc. Comput. Linguist., pages 5723–5738, 2022.
[4] C. Bao, Y. Zhang, Y. Li, X. Zhang, B. Yang, H. Bao, M. Pollefeys, G. Zhang, and Z. Cui. Geneavatar: Generic expression-aware volumetric head avatar editing from a single image. In IEEE Conf. Comput. Vis. Pattern Recog., pages 8952–8963, 2024.
[5] A. Bigata, M. Stypułkowski, R. Mira, S. Bounareli, K. Vougioukas, Z. Landgraf, N. Drobyshev, M. Zieba, S. Petridis, and M. Pantic. Keyface: Expressive audio-driven facial animation for long sequences via keyframe interpolation. In IEEE Conf. Comput. Vis. Pattern Recog., pages 5477–5488, 2025.
[6] S. Cao, Q. Zou, X. Mao, D. Ye, and Z. Wang. Metric learning for anti-compression facial forgery detection. In ACM Int. Conf. Multimedia, pages 1929–1937, 2021.
[7] D. Chen, R. Chen, S. Zhang, Y. Wang, Y. Liu, H. Zhou, Q. Zhang, Y. Wan, P. Zhou, and L. Sun. Mllm-as-a-judge: Assessing multimodal llm-as-a-judge with vision-language benchmark. In Int. Conf. Mach. Learn., 2024.
[8] H. Chen, Y. Hong, Z. Huang, Z. Xu, Z. Gu, Y. Li, J. Lan, H. Zhu, J. Zhang, W. Wang, et al. Demamba: Ai-generated video detection on million-scale gen-video benchmark. arXiv preprint arXiv:2405.19707, 2024.
[9] H. Chen, Y. Zhang, X. Cun, M. Xia, X. Wang, C. Weng, and Y. Shan. Videocrafter2: Overcoming data limitations for high-quality video diffusion models. In IEEE Conf. Comput. Vis. Pattern Recog., pages 7310–7320, 2024.
[10] J. Chen, J. Yu, C. Ge, L. Yao, E. Xie, Z. Wang, J. T. Kwok, P. Luo, H. Lu, and Z. Li. Pixart-α: Fast training of diffusion transformer for photorealistic text-to-image synthesis. In Int. Conf. Learn. Represent., 2024.
[11] Y. Chen, X. Huang, Q. Zhang, W. Li, M. Zhu, Q. Yan, S. Li, H. Chen, H. Hu, J. Yang, et al. Gim: A million-scale benchmark for generative image manipulation detection and localization. In AAAI Conf. Artif. Intell., volume 39, pages 2311–2319, 2025.
[12] H.-Y. Choi, S.-H. Lee, and S.-W. Lee. Diff-hiervc: Diffusion-based hierarchical voice conversion with robust pitch generation and masked prior for zero-shot speaker adaptation. In Conf. Int. Speech Commun. Assoc., pages 2283–2287, 2023.
[13] J. Choi, T. Kim, Y. Jeong, S. Baek, and J. Choi. Exploiting style latent flows for generalizing deepfake video detection. In IEEE Conf. Comput. Vis. Pattern Recog., pages 1133–1143, 2024.
[14] J. S. Chung, A. Nagrani, and A. Zisserman. Voxceleb2: Deep speaker recognition. Conf. Int. Speech Commun. Assoc., 2018.
[15] U. A. Ciftci, I. Demir, and L. Yin. Fakecatcher: Detection of synthetic portrait videos using biological signals. IEEE Trans. Pattern Anal. Mach. Intell., 2020.
[16] L. Comanducci, P. Bestagini, and S. Tubaro. Fakemusiccaps: a dataset for detection and attribution of synthetic music generated via text-to-music models. arXiv preprint arXiv:2409.10684, 2024.
[17] G. Comanici, E. Bieber, M. Schaekermann, I. Pasupat, N. Sachdeva, I. Dhillon, M. Blistein, O. Ram, D. Zhang, E. Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261, 2025.
[18] S. Concas, S. M. La Cava, R. Casula, G. Orru, G. Puglisi, and G. L. Marcialis. Quality-based artifact modeling for facial deepfake detection in videos. In IEEE Conf. Comput. Vis. Pattern Recog., pages 3845–3854, 2024.
[19] F.-A. Croitoru, A.-I. Hiji, V. Hondru, N. C. Ristea, P. Irofti, M. Popescu, C. Rusu, R. T. Ionescu, F. S. Khan, and M. Shah. Deepfake media generation and detection in the generative ai era: a survey and outlook. arXiv preprint arXiv:2411.19537, 2024.
[20] J. Cui, H. Li, Y. Yao, H. Zhu, H. Shang, K. Cheng, H. Zhou, S. Zhu, and J. Wang. Hallo2: Long-duration and high-resolution audio-driven portrait image animation. In Int. Conf. Learn. Represent., 2025.
[21] W. Dai, J. Li, D. Li, A. Tiong, J. Zhao, W. Wang, B. Li, P. N. Fung, and S. Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning. Adv. Neural Inform. Process. Syst., 36:49250–49267, 2023.
[22] H. Dang, F. Liu, J. Stehouwer, X. Liu, and A. K. Jain. On the detection of digital face manipulation. In IEEE Conf. Comput. Vis. Pattern Recog., pages 5781–5790, 2020.
[23] Z. Ding, M. Zhang, J. Wu, and Z. Tu. Patched denoising diffusion models for high-resolution image synthesis. In Int. Conf. Learn. Represent., 2024.
[24] X. Dong, J. Bao, D. Chen, T. Zhang, W. Zhang, N. Yu, D. Chen, F. Wen, and B. Guo. Protecting celebrities from deepfake with identity consistency transformer. In IEEE Conf. Comput. Vis. Pattern Recog., pages 9468–9478, 2022.
[25] M. C. Doukas, E. Ververas, V. Sharmanska, and S. Zafeiriou. Free-headgan: Neural talking head synthesis with explicit gaze control. IEEE Trans. Pattern Anal. Mach. Intell., 45:9743–9756, 2023.
[26] Z. Du, Y. Wang, Q. Chen, X. Shi, X. Lv, T. Zhao, Z. Gao, Y. Yang, C. Gao, H. Wang, et al. Cosyvoice 2: Scalable streaming speech synthesis with large language models. arXiv preprint arXiv:2412.10117, 2024.
[27] T.-J. Fu, W. Hu, X. Du, W. Y. Wang, Y. Yang, and Z. Gan. Guiding instruction-based image editing via multimodal large language models. In Int. Conf. Learn.Represent., 2024.
[28] Y. Gan, S. Park, A. M. Schubert, A. Philippakis, and A. Alaa. Instructcv: Instruction-tuned text-to-image diffusion models as vision generalists. In Int. Conf. Learn. Represent., 2024.
[29] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial networks. Communications of the ACM, 63:139–144, 2020.
[30] I. J. Goodfellow, J. Shlens, and C. Szegedy. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014.
[31] Q. Gu, S. Chen, T. Yao, Y. Chen, S. Ding, and R. Yi. Exploiting fine-grained face forgery clues via progressive enhancement learning. In AAAI Conf. Artif. Intell., volume 36, pages 735–743, 2022.
[32] J. Guan, H. Zhou, Z. Hong, E. Ding, J. Wang, C. Quan, and Y. Zhao. Delving into sequential patches for deepfake detection. Adv. Neural Inform. Process. Syst., 35:4517–4530, 2022.
[33] X. He, Y. Zhou, B. Fan, B. Li, G. Zhu, and F. Ding. Vlforgery face triad: Detection, localization and attribution via multimodal large language models. arXiv preprint arXiv:2503.06142, 2025.
[34] Y. He, B. Gan, S. Chen, Y. Zhou, G. Yin, L. Song, L. Sheng, J. Shao, and Z. Liu. Forgerynet: A versatile benchmark for comprehensive forgery analysis. In IEEE Conf. Comput. Vis. Pattern Recog., pages 4360–4369, 2021.
[35] Y. He, N. Yu, M. Keuper, and M. Fritz. Beyond the spectrum: Detecting deepfakes via re-synthesis. In Int. Joint Conf. Artif. Intell., pages 2534–2541, 2021.
[36] J. Ho, A. Jain, and P. Abbeel. Denoising diffusion probabilistic models. Adv. Neural Inform. Process. Syst., 33:6840–6851, 2020.
[37] J. Hu, X. Liao, J. Liang, W. Zhou, and Z. Qin. Finfer: Frame inference-based deepfake detection for high-visual-quality videos. In AAAI Conf. Artif. Intell., volume 36, pages 951–959, 2022.
[38] J. Hu, X. Liao, W. Wang, and Z. Qin. Detecting compressed deepfake videos in social networks using frame-temporality two-stream convolutional network. IEEE Trans. Circuit Syst. Video Technol., 32(3):1089–1102, 2021.
[39] W. Huang, W. Luo, J. Huang, and X. Cao. Sdgan: disentangling semantic manipulation for facial attribute editing. In AAAI Conf. Artif. Intell., volume 38, pages 2374–2381, 2024.
[40] Z. Huang, J. Hu, X. Li, Y. He, X. Zhao, B. Peng, B. Wu, X. Huang, and G. Cheng. Sida: Social media image deepfake detection, localization and explanation with large multimodal model. In IEEE Conf. Comput. Vis. Pattern Recog., pages 28831–28841, 2025.
[41] Z. Huang, F. Tang, Y. Zhang, J. Cao, C. Li, S. Tang, J. Li, and T.-Y. Lee. Identity-preserving face swapping via dual surrogate generative models. ACM Trans. Graph., 43(5):1–19, 2024.
[42] A. Hurst, A. Lerer, A. P. Goucher, A. Perelman, A. Ramesh, A. Clark, A. Ostrow, A. Welihinda, A. Hayes, A. Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024.
[43] Y. Jeong, D. Kim, Y. Ro, and J. Choi. Frepgan: robust deepfake detection using frequency-level perturbations. In AAAI Conf. Artif. Intell., volume 36, pages 1060–1068, 2022.
[44] S. Jia, R. Lyu, K. Zhao, Y. Chen, Z. Yan, Y. Ju, C. Hu, X. Li, B. Wu, and S. Lyu. Can chatgpt detect deepfakes? a study of using multimodal large language models for media forensics. In IEEE Conf. Comput. Vis. Pattern Recog., pages 4324–4333, 2024.
[45] S. Jia, C. Ma, T. Yao, B. Yin, S. Ding, and X. Yang. Exploring frequency adversarial attacks for face forgery detection. In IEEE Conf. Comput. Vis. Pattern Recog., pages 4103–4112, 2022.
[46] J.-Y. Jiang-Lin, K.-Y. Huang, L. Lo, Y.-N. Huang, T. Lin, J.-C. Wu, H.-H. Shuai, and W.-H. Cheng. Record: Reasoning and correcting diffusion for hoi generation. In ACM Int. Conf. Multimedia, pages 9465–9474, 2024.
[47] H. Kang, S. Wen, Z. Wen, J. Ye, W. Li, P. Feng, B. Zhou, B. Wang, D. Lin, L. Zhang, et al. Legion: Learning to ground and explain for synthetic image detection. arXiv preprint arXiv:2503.15264, 2025.
[48] T. Karras, S. Laine, and T. Aila. A style-based generator architecture for generative adversarial networks. In IEEE Conf. Comput. Vis. Pattern Recog., pages 4401–4410, 2019.
[49] T. Karras, S. Laine, M. Aittala, J. Hellsten, J. Lehtinen, and T. Aila. Analyzing and improving the image quality of stylegan. In IEEE Conf. Comput. Vis. Pattern Recog., pages 8110–8119, 2020.
[50] S. A. Khan and H. Dai. Video transformer for deepfake detection with incremental learning. In ACM Int. Conf. Multimedia, pages 1821–1828, 2021.
[51] J. Kim, J. Cho, J. Park, S. Hwang, D. E. Kim, G. Kim, and Y. Yu. Deeptalk: Dynamic emotion embedding for probabilistic speech-driven 3d face animation. In AAAI Conf. Artif. Intell., volume 39, pages 4275–4283, 2025.
[52] C.-H. Lee, Z. Liu, L. Wu, and P. Luo. Maskgan: Towards diverse and interactive facial image manipulation. In IEEE Conf. Comput. Vis. Pattern Recog., pages 5549–5558, 2020.
[53] C. Li, C. Zhang, W. Xu, J. Xie, W. Feng, B. Peng, and W. Xing. Latentsync: Audio conditioned latent diffusion models for lip sync. arXiv preprint arXiv:2412.09262, 2024.
[54] J. Li, D. Li, S. Savarese, and S. Hoi. Blip-2: Bootstrapping language-image pretraining with frozen image encoders and large language models. In Int. Conf. Mach. Learn., pages 19730–19742, 2023.
[55] Q. Li, W. Wang, C. Xu, Z. Sun, and M.-H. Yang. Learning disentangled representation for one-shot progressive face swapping. IEEE Trans. Pattern Anal. Mach. Intell., 46:8348–8364, 2024.
[56] Y. Li, X. Liu, X. Wang, S. Wang, and W. Lin. Fakebench: Uncover the achilles’ heels of fake images with large multimodal models. arXiv preprint arXiv:2404.13306, 2024.
[57] Y. Li, Y. Tian, Y. Huang, W. Lu, S. Wang, W. Lin, and A. Rocha. Fakescope: Large multimodal expert model for transparent ai-generated image forensics. arXiv preprint arXiv:2503.24267, 2025.
[58] K. Lin, Z. Yan, K.-Y. Zhang, L. Hao, Y. Zhou, Y. Lin, W. Li, T. Yao, S. Ding, and B. Li. Guard me if you know me: Protecting specific face-identity from deepfakes. arXiv preprint arXiv:2505.19582, 2025.
[59] H. Liu, C. Li, Q. Wu, and Y. J. Lee. Visual instruction tuning. Adv. Neural Inform. Process. Syst., 36:34892–34916, 2023.
[60] J. Liu, F. Zhang, J. Zhu, E. Sun, Q. Zhang, and Z.-J. Zha. Forgerygpt: Multi-modal large language model for explainable image forgery detection and localization. arXiv preprint arXiv:2410.10238, 2024.
[61] R. Liu, B. Ma, W. Zhang, Z. Hu, C. Fan, T. Lv, Y. Ding, and X. Cheng. Towards a simultaneous and granular identity-expression control in personalized face generation. In IEEE Conf. Comput. Vis. Pattern Recog., pages 2114–2123, 2024.
[62] S. Liu. Zero-shot voice conversion with diffusion transformers. arXiv preprint arXiv:2411.09943, 2024.
[63] S. Liu, Y. Han, P. Xing, F. Yin, R. Wang, W. Cheng, J. Liao, Y. Wang, H. Fu, C. Han, et al. Step1x-edit: A practical framework for general image editing. arXiv preprint arXiv:2504.17761, 2025.
[64] S. Liu, Y. Zhang, W. Li, Z. Lin, and J. Jia. Video-p2p: Video editing with cross-attention control. In IEEE Conf. Comput. Vis. Pattern Recog., pages 8599–8608, 2024.
[65] X. Liu, Z. Li, P. Li, S. Xia, X. Cui, L. Huang, H. Huang, W. Deng, and Z. He. MMFakebench: A mixed-source multimodal misinformation detection benchmark for LVLMs. In Int. Conf. Learn. Represent., 2025.
[66] L. Lo, K. C. Chan, W.-H. Cheng, and M.-H. Yang. From prompt to progression: Taming video diffusion models for seamless attribute transition. In Int. Conf. Comput. Vis., 2025.
[67] Z. Lu, D. Huang, L. Bai, J. Qu, C. Wu, X. Liu, and W. Ouyang. Seeing is not always believing: Benchmarking human and model perception of ai-generated images. Adv. Neural Inform. Process. Syst., 36:25435–25447, 2023.
[68] J. Lyu, X. Lan, G. Hu, H. Jiang, W. Gan, J. Wang, and J. Xue. Multimodal emotional talking face generation based on action units. IEEE Trans. Circuit Syst. Video Technol., 2024.
[69] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu. Towards deep learning models resistant to adversarial attacks. In Int. Conf. Learn. Represent., 2018.
[70] B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM, 65:99–106, 2021.
[71] T. Mittal, U. Bhattacharya, R. Chandra, A. Bera, and D. Manocha. Emotions don’t lie: An audio-visual deepfake detection method using affective cues. In ACM Int. Conf. Multimedia, pages 2823–2832, 2020.
[72] D. Nguyen, N. Mejri, I. P. Singh, P. Kuleshova, M. Astrid, A. Kacem, E. Ghorbel, and D. Aouada. Laa-net: Localized artifact attention network for quality-agnostic and generalizable deepfake detection. In IEEE Conf. Comput. Vis. Pattern Recog., pages 17395–17405, 2024.
[73] Y. Nirkin, L. Wolf, Y. Keller, and T. Hassner. Deepfake detection based on discrepancies between faces and their context. IEEE Trans. Pattern Anal. Mach. Intell., 44(10):6111–6121, 2021.
[74] U. Ojha, Y. Li, and Y. J. Lee. Towards universal fake image detectors that generalize across generative models. In IEEE Conf. Comput. Vis. Pattern Recog., pages 24480–24489, 2023.
[75] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, et al. Training language models to follow instructions with human feedback. Adv. Neural Inform. Process. Syst., 35:27730–27744, 2022.
[76] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur. Librispeech: an asr corpus ased on public domain audio books. In Int. Conf. Acoustics, Speech, and Signal Process., pages 5206–5210, 2015.
[77] G. Pei, J. Zhang, M. Hu, Z. Zhang, C. Wang, Y. Wu, G. Zhai, J. Yang, C. Shen, and D. Tao. Deepfake generation and detection: A benchmark and survey. arXiv preprint arXiv:2403.17881, 2024.
[78] D. Podell, Z. English, K. Lacey, A. Blattmann, T. Dockhorn, J. Müller, J. Penna, and R. Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis. In Int. Conf. Learn. Represent., 2024.
[79] Z. Qin, W. Zhao, X. Yu, and X. Sun. Openvoice: Versatile instant voice cloning. arXiv preprint arXiv:2312.01479, 2023.
[80] A. Rochow, M. Schwarz, and S. Behnke. Fsrt: Facial scene representation transformer for face reenactment from factorized appearance head-pose and facial expression features. In IEEE Conf. Comput. Vis. Pattern Recog., pages 7716–7726, 2024.
[81] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer. High-resolution image synthesis with latent diffusion models. In IEEE Conf. Comput. Vis. Pattern Recog., pages 10684–10695, 2022.
[82] A. Rossler, D. Cozzolino, L. Verdoliva, C. Riess, J. Thies, and M. Nießner. Faceforensics++: Learning to detect manipulated facial images. In Int. Conf. Comput. Vis., pages 1–11, 2019.
[83] S. Sheynin, A. Polyak, U. Singer, Y. Kirstain, A. Zohar, O. Ashual, D. Parikh, and Y. Taigman. Emu edit: Precise image editing via recognition and generation tasks. In IEEE Conf. Comput. Vis. Pattern Recog., pages 8871–8879, 2024.
[84] K. Shiohara, X. Yang, and T. Taketomi. Blendface: Re-designing identity encoders for face-swapping. In Int. Conf. Comput. Vis., pages 7634–7644, 2023.
[85] J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In Int. Conf. Mach. Learn., pages 2256–2265, 2015.
[86] K. Sohn, H. Lee, and X. Yan. Learning structured output representation using deep conditional generative models. Adv. Neural Inform. Process. Syst., 28, 2015.
[87] J. Son Chung, A. Senior, O. Vinyals, and A. Zisserman. Lip reading sentences in the wild. In IEEE Conf. Comput. Vis. Pattern Recog., pages 6447–6456, 2017.
[88] Y. Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole. Score-based generative modeling through stochastic differential equations. In Int. Conf. Learn. Represent., 2021.
[89] K. Sun, T. Yao, S. Chen, S. Ding, J. Li, and R. Ji. Dual contrastive learning for general face forgery detection. In AAAI Conf. Artif. Intell., volume 36, pages 2316–2324, 2022.
[90] Z. Sun, Y. Han, Z. Hua, N. Ruan, and W. Jia. Improving the efficiency and robustness of deepfakes detection through precise geometric features. In IEEE Conf. Comput. Vis. Pattern Recog., pages 3609–3618, 2021.
[91] J. Tao, S. Gu, W. Li, and L. Duan. Learning motion refinement for unsupervised face animation. Adv. Neural Inform. Process. Syst., 36:70483–70496, 2023.
[92] X. Tian, W. Li, B. Xu, Y. Yuan, Y. Wang, and H. Shen. Mige: A unified framework for multimodal instruction-based image generation and editing. In ACM Int. Conf. Multimedia, 2025.
[93] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
[94] A. Van Den Oord, O. Vinyals, et al. Neural discrete representation learning. Adv. Neural Inform. Process. Syst., 30, 2017.
[95] C. Veaux, J. Yamagishi, K. MacDonald, et al. Cstr vctk corpus: English multi-speaker corpus for cstr voice cloning toolkit. University of Edinburgh. The Centre for Speech Technology Research (CSTR), 6:15, 2017.
[96] R. Wang, F. Juefei-Xu, L. Ma, X. Xie, Y. Huang, J. Wang, and Y. Liu. Fakespotter: A simple yet robust baseline for spotting ai-synthesized fake faces. In Int. Joint Conf. Artif. Intell., pages 3444–3451, 2020.
[97] R. Wang, S. Xu, T. He, Y. Chen, W. Zhu, D. Song, N. Chen, X. Tang, and Y. Hu. Dynamicface: High-quality and consistent video face swapping using composable 3d facial priors. In Int. Conf. Comput. Vis., 2025.
[98] S.-Y. Wang, O. Wang, R. Zhang, A. Owens, and A. A. Efros. Cnn-generated images are surprisingly easy to spot... for now. In IEEE Conf. Comput. Vis. Pattern Recog., pages 8695–8704, 2020.
[99] T. Wang and K. P. Chow. Noise based deepfake detection via multi-head relative-interaction. In AAAI Conf. Artif. Intell., volume 37, pages 14548–14556, 2023.
[100] T.-C. Wang, A. Mallya, and M.-Y. Liu. One-shot free-view neural talking-head synthesis for video conferencing. In IEEE Conf. Comput. Vis. Pattern Recog., pages 10039–10049, 2021.
[101] X. Wang, J. Yamagishi, M. Todisco, H. Delgado, A. Nautsch, N. Evans, M. Sahidullah, V. Vestman, T. Kinnunen, K. A. Lee, et al. Asvspoof 2019: A large-scale public database of synthesized, converted and replayed speech. Computer Speech & Language, 64:101114, 2020.
[102] Y. Wang, X. Yi, Z. Wu, N. Zhao, L. Chen, and H. Zhang. View-consistent 3d editing with gaussian splatting. In Eur. Conf. Comput. Vis., pages 404–420, 2024.
[103] Z. Wang, J. Bao, W. Zhou, W. Wang, and H. Li. Altfreezing for more general video face forgery detection. In IEEE Conf. Comput. Vis. Pattern Recog., pages 4129–4138, 2023.
[104] H. Wei, Z. Yang, and Z. Wang. Aniportrait: Audio-driven synthesis of photorealistic portrait animation. arXiv preprint arXiv:2403.17694, 2024.
[105] S. Wen, J. Ye, P. Feng, H. Kang, Z. Wen, Y. Chen, J. Wu, W. Wu, C. He, and W. Li. Spot the fake: Large multimodal model-based synthetic image detection with artifact explanation. arXiv preprint arXiv:2503.14905, 2025.
[106] M. Wester. The emime bilingual database. Technical report, The University of Edinburgh, 2010.
[107] C. Wu, P. Zheng, R. Yan, S. Xiao, X. Luo, Y. Wang, W. Li, X. Jiang, Y. Liu, J. Zhou, et al. Omnigen2: Exploration to advanced multimodal generation. arXiv preprint arXiv:2506.18871, 2025.
[108] H. Wu, J. Zhou, J. Tian, and J. Liu. Robust image forgery detection over online social network shared images. In IEEE Conf. Comput. Vis. Pattern Recog., pages 13440–13449, 2022.
[109] X. Wu, Z. Xie, Y. Gao, and Y. Xiao. Sstnet: Detecting manipulated faces through spatial, steganalysis and temporal features. In Int. Conf. Acoustics, Speech, and Signal Process., pages 2952–2956, 2020.
[110] C. Xu, Y. Qian, S. Zhu, B. Sun, J. Zhao, Y. Liu, and X. Li. Uniface++: Revisiting a unified framework for face reenactment and swapping via 3d priors. Int. J. Comput. Vis., 133:4538–4554, 2025.
[111] J. Xu, Z. Guo, J. He, H. Hu, T. He, S. Bai, K. Chen, J. Wang, Y. Fan, K. Dang, et al. Qwen2. 5-omni technical report. arXiv preprint arXiv:2503.20215, 2025.
[112] N. Xu, W. Feng, T. Zhang, and Y. Zhang. Fd-gan: Generalizable and robust forgery detection via generative adversarial networks. Int. J. Comput. Vis., 132:5801–5819, 2024.
[113] S. Xu, G. Chen, Y.-X. Guo, J. Yang, C. Li, Z. Zang, Y. Zhang, X. Tong, and B. Guo. Vasa-1: Lifelike audio-driven talking faces generated in real time. Adv. Neural Inform. Process. Syst., 37:660–684, 2024.
[114] Y. Xu, J. Liang, G. Jia, Z. Yang, Y. Zhang, and R. He. Tall: Thumbnail layout for deepfake video detection. In Int. Conf. Comput. Vis., pages 22658–22668, 2023.
[115] Z. Xu, X. Zhang, R. Li, Z. Tang, Q. Huang, and J. Zhang. Fakeshield: Explainable image forgery detection and localization via multi-modal large language models. In Int. Conf. Learn. Represent., 2025.
[116] S. Yan, O. Li, J. Cai, Y. Hao, X. Jiang, Y. Hu, and W. Xie. A sanity check for ai-generated image detection. In Int. Conf. Learn. Represent., 2025.
[117] Z. Yan, Y. Luo, S. Lyu, Q. Liu, and B. Wu. Transcending forgery specificity with latent space augmentation for generalizable deepfake detection. In IEEE Conf. Comput. Vis. Pattern Recog., pages 8984–8994, 2024.
[118] Z. Yan, Y. Zhang, X. Yuan, S. Lyu, and B. Wu. Deepfakebench: A comprehensive benchmark of deepfake detection. Adv. Neural Inform. Process. Syst., 36:4534–4565, 2023.
[119] F. Yang, S. Yang, M. A. Butt, J. van de Weijer, et al. Dynamic prompt learning: Addressing cross-attention leakage for text-based image editing. Adv. Neural Inform. Process. Syst., 36:26291–26303, 2023.
[120] M.-H. Yang, D. J. Kriegman, and N. Ahuja. Detecting faces in images: A survey. IEEE Trans. Pattern Anal. Mach. Intell., 24(1):34–58, 2002.
[121] S. Yang, W. Wang, Y. Lan, X. Fan, B. Peng, L. Yang, and J. Dong. Learning dense correspondence for nerf-based face reenactment. In AAAI Conf. Artif. Intell., volume 38, pages 6522–6530, 2024.
[122] X. Yang, Y. Li, and S. Lyu. Exposing deep fakes using inconsistent head poses. In Int. Conf. Acoustics, Speech, and Signal Process., pages 8261–8265, 2019.
[123] J. Ye, H. Xu, H. Liu, A. Hu, M. Yan, Q. Qian, J. Zhang, F. Huang, and J. Zhou. mplug-owl3: Towards long image-sequence understanding in multi-modal large language models. arXiv preprint arXiv:2408.04840, 2024.
[124] J. Ye, B. Zhou, Z. Huang, J. Zhang, T. Bai, H. Kang, J. He, H. Lin, Z. Wang, T. Wu, Z. Wu, Y. Chen, D. Lin, C. He, and W. Li. LOKI: A comprehensive synthetic data detection benchmark using large multimodal models. In Int. Conf. Learn. Represent., 2025.
[125] Y. Ye, X. He, Z. Li, B. Lin, S. Yuan, Z. Yan, B. Hou, and L. Yuan. Imgedit: A unified image editing dataset and benchmark. arXiv preprint arXiv:2505.20275, 2025.
[126] Z. Ye, T. Zhong, Y. Ren, Z. Jiang, J. Huang, R. Huang, J. Liu, J. He, C. Zhang, Z. Wang, et al. Mimictalk: Mimicking a personalized and expressive 3d talking face in minutes. Adv. Neural Inform. Process. Syst., 37:1829–1853, 2024.
[127] J. Yu, H. Zhu, L. Jiang, C. C. Loy, W. Cai, and W. Wu. Celebv-text: A large-scale facial text-video dataset. In IEEE Conf. Comput. Vis. Pattern Recog., pages 14805–14814, 2023.
[128] H. Zen, V. Dang, R. Clark, Y. Zhang, R. J. Weiss, Y. Jia, Z. Chen, and Y. Wu. Libritts: A corpus derived from librispeech for text-to-speech. In Conf. Int. Speech Commun. Assoc., pages 1526–1530, 2019.
[129] H. Zhang, T. Dai, Y. Xu, Y.-W. Tai, and C.-K. Tang. Facednerf: semantics-driven face reconstruction, prompt editing and relighting with diffusion models. Adv. Neural Inform. Process. Syst., 36:55647–55667, 2023.
[130] K. Zhang, W. Zuo, Y. Chen, D. Meng, and L. Zhang. Beyond a gaussian denoiser: Residual learning of deep cnn for image denoising.. IEEE Trans. Image Process., 26:3142–3155, 2017.
[131] S. Zhang, X. Yang, Y. Feng, C. Qin, C.-C. Chen, N. Yu, Z. Chen, H. Wang, S. Savarese, S. Ermon, et al. Hive: Harnessing human feedback for instructional visual editing. In IEEE Conf. Comput. Vis. Pattern Recog., pages 9026–9036, 2024.
[132] W. Zhang, X. Cun, X. Wang, Y. Zhang, X. Shen, Y. Guo, Y. Shan, and F. Wang. Sadtalker: Learning realistic 3d motion coefficients for stylized audio-driven single image talking face animation. In IEEE Conf. Comput. Vis. Pattern Recog., pages 8652–8661, 2023.
[133] W. Zhang, C. Jiang, Z. Zhang, C. Si, F. Yu, and W. Peng. Ivy-fake: A unified explainable framework and benchmark for image and video aigc detection. arXiv preprint arXiv:2506.00979, 2025.
[134] Y. Zhang, B. Colman, X. Guo, A. Shahriyari, and G. Bharaj. Common sense reasoning for deepfake detection. In Eur. Conf. Comput. Vis., pages 399–415, 2024.
[135] Y. Zhang, Z. Zhong, M. Liu, Z. Chen, B. Wu, Y. Zeng, C. Zhan, Y. He, J. Huang, and W. Zhou. Musetalk: Real-time highfidelity video dubbing via spatio-temporal sampling. arXiv preprint arXiv:2410.10122, 2025.
[136] Z. Zhang, Z. Hu, W. Deng, C. Fan, T. Lv, and Y. Ding. Dinet: Deformation inpainting network for realistic face visually dubbing on high resolution video. In AAAI Conf. Artif. Intell., volume 37, pages 3543–3551, 2023.
[137] Z. Zhang, L. Li, Y. Ding, and C. Fan. Flow-guided one-shot talking face generation with a high-resolution audio-visual dataset. In IEEE Conf. Comput. Vis. Pattern Recog., pages 3661–3670, 2021.
[138] W. Zhao, Y. Rao, W. Shi, Z. Liu, J. Zhou, and J. Lu. Diffswap: High-fidelity and controllable face swapping via 3d-aware masked diffusion. In IEEE Conf. Comput. Vis. Pattern Recog., pages 8568–8577, 2023.
[139] P. Zhou, L. Xie, B. Ni, and Q. Tian. Cips-3d+ +: End-to-end real-time high-resolution 3d-aware gans for gan inversion and stylization. IEEE Trans. Pattern Anal. Mach. Intell., 45:11502–11520, 2023.
[140] D. Zhu, J. Chen, X. Shen, X. Li, and M. Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models. In Int. Conf. Learn. Represent., 2024.
[141] M. Zhu, H. Chen, Q. Yan, X. Huang, G. Lin, W. Li, Z. Tu, H. Hu, J. Hu, and Y. Wang. Genimage: A million-scale benchmark for detecting ai-generated image. Adv. Neural Inform. Process. Syst., 36:77771–77782, 2023.
[142] Y. Zhu, W. Zhao, Y. Tang, Y. Rao, J. Zhou, and J. Lu. Stableswap: Stable face swapping in a shared and controllable latent space. IEEE Trans. Multimedia, 26:7594–7607, 2024.
-
dc.identifier.urihttp://tdr.lib.ntu.edu.tw/jspui/handle/123456789/98519-
dc.description.abstract生成式人工智慧透過運用多樣化的輸入條件,徹底革新了多媒體內容的創作方式。然而,隨著這些模型日益進步,檢測由人工智慧生成的內容,特別是深度偽造(DeepFake),變得愈發困難。對深偽技術日益增加的關注,使得檢測方法,尤其是多模態大型語言模型在辨識深偽內容方面的效能,成為研究重點。多模態大型語言模型不僅能透過提供決策解釋來提升深偽檢測的透明度,區分真實與合成內容的過程同時也是對其感知與推理能力的嚴格考驗。
為了應對這些挑戰,我們提出了 MMIDBench,一個精心設計、全面評估多模態大型語言模型能力的多模態基準。MMIDBench 涵蓋多種最先進的深偽生成模型,橫跨影像、影片與音訊,包含6種不同的深偽任務。該基準包含10k道題目,涵蓋二元選擇、多選題及開放式問答等多種題型,能夠對多模態大型語言模型進行深入評估。我們利用 MMIDBench 評測了5款閉源多模態大型語言模型,揭示了它們在深偽檢測上的優勢與現階段的侷限。
zh_TW
dc.description.abstractGenerative artificial intelligence has revolutionized how multimedia content is created by utilizing diverse input conditions. However, as these models become more advanced, detecting AI-generated content, particularly DeepFakes, has grown increasingly challenging. Rising concerns over DeepFakes have heightened interest in detection methods, specifically the effectiveness of multimodal large language models (MLLMs) in identifying them. MLLMs not only improve the transparency of DeepFake detection by providing explanations for their decisions, but the process of distinguishing authentic from synthetic content also serves as a robust test of their perceptual and reasoning skills.
To address these challenges, we introduce MMIDBench, a comprehensive multimodal benchmark meticulously crafted to assess the capabilities of MLLMs. MMIDBench features a variety of state-of-the-art DeepFake generative models spanning images, videos, and audio, encompassing 6 distinct DeepFake tasks. The benchmark comprises 10k questions, including binary, multiple-choice, and open-ended formats, enabling an in-depth assessment of MLLMs. We evaluated 5 proprietary MLLMs with MMIDBench, revealing both their strengths and current limitations in DeepFake detection.
en
dc.description.provenanceSubmitted by admin ntu (admin@lib.ntu.edu.tw) on 2025-08-14T16:25:50Z
No. of bitstreams: 0
en
dc.description.provenanceMade available in DSpace on 2025-08-14T16:25:50Z (GMT). No. of bitstreams: 0en
dc.description.tableofcontentsVerification Letter from the Oral Examination Committee i
Acknowledgements ii
摘要 iii
Abstract iv
Contents vi
List of Figures viii
List of Tables x
Chapter 1 Introduction 1
Chapter 2 Related Work 5
2.1 DeepFake Generation . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 DeepFake Detection . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Multimodal Large Language Models . . . . . . . . . . . . . . . . . . 8
2.4 Synthetic Data Detection Benchmark . . . . . . . . . . . . . . . . . 9
Chapter 3 MMIDBench 13
3.1 Overview of MMIDBench . . . . . . . . . . . . . . . . . . . . . . . 13
3.2 Real Data and Deepfake Synthesis . . . . . . . . . . . . . . . . . . . 18
3.3 Multimodal Artifacts Annotation . . . . . . . . . . . . . . . . . . . . 24
3.4 Assertion-Based Evaluation and Question Generation . . . . . . . . . 33
Chapter 4 Experiments 35
4.1 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.2 Assertion-Based Evaluation . . . . . . . . . . . . . . . . . . . . . . 36
4.3 Interpretable DeepFake Detection Results . . . . . . . . . . . . . . . 40
Chapter 5 Conclusion 50
References 51
-
dc.language.isoen-
dc.subject多模態大型語言模型zh_TW
dc.subject可解釋性zh_TW
dc.subject深度偽造檢測zh_TW
dc.subject深度偽造生成zh_TW
dc.subject影像編輯zh_TW
dc.subject語音合成zh_TW
dc.subject擴散模型zh_TW
dc.subject生成式人工智慧zh_TW
dc.subjectDiffusion Modelsen
dc.subjectMultimodal Large Language Modelsen
dc.subjectVoice Synthesisen
dc.subjectInterpretabilityen
dc.subjectDeepfake Detectionen
dc.subjectDeepfake Generationen
dc.subjectImage Editingen
dc.subjectGenerative Artificial Intelligenceen
dc.title多模態大型語言模型之可解釋深度偽造檢測基準zh_TW
dc.titleMMIDBench: Multimodal Interpretable Deepfake Detection Benchmark for Multimodal Large Language Modelsen
dc.typeThesis-
dc.date.schoolyear113-2-
dc.description.degree碩士-
dc.contributor.oralexamcommittee莊永裕;帥宏翰;黃敬群;簡韶逸zh_TW
dc.contributor.oralexamcommitteeYung-Yu Chuang;Hong-Han Shuai;Ching-Chun Huang;Shao-Yi Chienen
dc.subject.keyword多模態大型語言模型,可解釋性,深度偽造檢測,深度偽造生成,影像編輯,語音合成,擴散模型,生成式人工智慧,zh_TW
dc.subject.keywordMultimodal Large Language Models,Interpretability,Deepfake Detection,Deepfake Generation,Image Editing,Voice Synthesis,Diffusion Models,Generative Artificial Intelligence,en
dc.relation.page69-
dc.identifier.doi10.6342/NTU202503386-
dc.rights.note同意授權(全球公開)-
dc.date.accepted2025-08-06-
dc.contributor.author-college電機資訊學院-
dc.contributor.author-dept資訊網路與多媒體研究所-
dc.date.embargo-lift2025-08-15-
顯示於系所單位:資訊網路與多媒體研究所

文件中的檔案:
檔案 大小格式 
ntu-113-2.pdf12.54 MBAdobe PDF檢視/開啟
顯示文件簡單紀錄


系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved