使用影音一致性的自監督式學習增強臉部偽造辨識

Chang-Sung Sung; 宋昶松

Please use this identifier to cite or link to this item: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/85069

Full metadata record

???org.dspace.app.webui.jsptag.ItemTag.dcfield???	Value	Language
dc.contributor.advisor	陳祝嵩(Chu-Song Chen)
dc.contributor.author	Chang-Sung Sung	en
dc.contributor.author	宋昶松	zh_TW
dc.date.accessioned	2023-03-19T22:41:39Z	-
dc.date.copyright	2022-08-30
dc.date.issued	2022
dc.date.submitted	2022-08-15
dc.identifier.citation	[1] Deepfakes repo. https://github.com/deepfakes/faceswap. [2] Faceswap repo. https://github.com/marekkowalski/faceswap. [3] D. Afchar, V. Nozick, J. Yamagishi, and I. Echizen. Mesonet: a compact facial video forgery detection network. In 2018 IEEE international workshop on information forensics and security (WIFS), pages 1–7. IEEE, 2018. [4] T. Afouras, A. Owens, J. S. Chung, and A. Zisserman. Self-supervised learning of audiovisual objects from video. In European Conference on Computer Vision, pages 208–224. Springer, 2020. [5] S. Agarwal, H. Farid, O. Fried, and M. Agrawala. Detecting deepfake videos from phoneme-viseme mismatches. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 660–661, 2020. [6] R. Arandjelovic and A. Zisserman. Look, listen and learn. In Proceedings of the IEEE International Conference on Computer Vision, pages 609–617, 2017. [7] R. Arandjelovic and A. Zisserman. Objects that sound. In Proceedings of the European conference on computer vision (ECCV), pages 435–451, 2018. [8] O. Arriaga, M. Valdenegro-Toro, and P. Plöger. Realtime convolutional neural net works for emotion and gender classification. arXiv preprint arXiv:1710.07557, 2017. [9] A. Bulat and G. Tzimiropoulos. How far are we from solving the 2d & 3d face alignment problem?(and a dataset of 230,000 3d facial landmarks). In Proceedings of the IEEE International Conference on Computer Vision, pages 1021–1030, 2017. [10] L. Chai, D. Bau, S.N. Lim, and P. Isola. What makes fake images detectable? understanding properties that generalize. In European Conference on Computer Vision, pages 103–120. Springer, 2020. [11] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597–1607. PMLR, 2020. [12] T. Chen, S. Kornblith, K. Swersky, M. Norouzi, and G. E. Hinton. Big self supervised models are strong semisupervised learners. Advances in neural information processing systems, 33:22243–22255, 2020. [13] K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014. [14] F. Chollet. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1251–1258, 2017. [15] K. Chugh, P. Gupta, A. Dhall, and R. Subramanian. Not made for each otheraudio-visual dissonance-based deepfake detection and localization. In Proceedings of the 28th ACM International Conference on Multimedia, pages 439–447, 2020. [16] J. S. Chung, A. Senior, O. Vinyals, and A. Zisserman. Lip reading sentences in the wild. In 2017 IEEE conference on computer vision and pattern recognition (CVPR), pages 3444–3453. IEEE, 2017. [17] J. S. Chung and A. Zisserman. Lip reading in the wild. In Asian conference on computer vision, pages 87–103. Springer, 2016. [18] J. S. Chung and A. Zisserman. Out of time: automated lip sync in the wild. In Asian conference on computer vision, pages 251–263. Springer, 2016. [19] J. Deng, J. Guo, E. Ververas, I. Kotsia, and S. Zafeiriou. Retinaface: Single-shot multi-level face localisation in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5203–5212, 2020. [20] B. Dolhansky, J. Bitton, B. Pflaum, J. Lu, R. Howes, M. Wang, and C. C. Ferrer. The deepfake detection challenge (dfdc) dataset. arXiv preprint arXiv:2006.07397, 2020. [21] D. Doukhan, J. Carrive, F. Vallet, A. Larcher, and S. Meignier. An opensource speaker gender detection framework for monitoring gender equality. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5214–5218. IEEE, 2018. [22] I. Goodfellow, J. PougetAbadie, M. Mirza, B. Xu, D. WardeFarley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. Advances in neural information processing systems, 27, 2014. [23] Z. Gu, Y. Chen, T. Yao, S. Ding, J. Li, F. Huang, and L. Ma. Spatiotemporal in consistency learning for deepfake video detection. In Proceedings of the 29th ACM International Conference on Multimedia, pages 3473–3481, 2021. [24] H. Guo, S. Hu, X. Wang, M.C. Chang, and S. Lyu. Eyes tell all: Irregular pupil shapes reveal gan-generated faces. arXiv preprint arXiv:2109.00162, 2021. [25] M. Gutmann and A. Hyvärinen. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pages 297–304. JMLR Workshop and Conference Proceedings, 2010. [26] R. Hadsell, S. Chopra, and Y. LeCun. Dimensionality reduction by learning an in variant mapping. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), volume 2, pages 1735–1742. IEEE, 2006. [27] A. Haliassos, K. Vougioukas, S. Petridis, and M. Pantic. Lips don’t lie: A generalisable and robust approach to face forgery detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5039–5049, 2021. [28] D. Hazarika, S. Poria, R. Zimmermann, and R. Mihalcea. Conversational transfer learning for emotion recognition. Information Fusion, 65:1–12, 2021. [29] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9729–9738, 2020. [30] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. [31] K.Hoover, S.Chaudhuri, C.Pantofaru, M.Slaney, and I.Sturdy. Putting a face to the voice: Fusing audio and visual signals across a video to determine speakers. arXiv preprint arXiv:1706.00079, 2017. [32] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4700–4708, 2017. [33] L. Jiang, R. Li, W. Wu, C. Qian, and C. C. Loy. DeeperForensics1.0: A large-scale dataset for real-world face forgery detection. In CVPR, 2020. [34] T. Jung, S. Kim, and K. Kim. Deepvision: deepfakes detection using human eye blinking pattern. IEEE Access, 8:83144–83154, 2020. [35] H. Khalid, S. Tariq, M. Kim, and S. S. Woo. Fakeavceleb: a novel audiovideo multimodal deepfake dataset. arXiv preprint arXiv:2108.05080, 2021. [36] B.Korbar, D.Tran, and L.Torresani. Cooperative learning of audio and video models from self-supervised synchronization. Advances in Neural Information Processing Systems, 31, 2018. [37] G. Levi and T. Hassner. Age and gender classification using convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pages 34–42, 2015. [38] L. Li, J. Bao, H. Yang, D. Chen, and F. Wen. Advancing high fidelity identity swapping for forgery detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5074–5083, 2020. [39] L. Li, J. Bao, T. Zhang, H. Yang, D. Chen, F. Wen, and B. Guo. Face x-ray for more general face forgery detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5001–5010, 2020. [40] Y. Li and S. Lyu. Exposing deepfake videos by detecting face warping artifacts. arXiv preprint arXiv:1811.00656, 2018. [41] Y.Li, X.Yang, P.Sun, H.Qi, and S.Lyu. Celeb-df: A large-scale challenging dataset for deepfake forensics. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3207–3216, 2020. [42] Z. Liu, X. Qi, and P. H. Torr. Global texture enhancement for fake face detection in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8060–8069, 2020. [43] J. Lyons, D. Y.B. Wang, Gianluca, H. Shteingart, E. Mavrinac, Y. Gaurkar, W. Watcharawisetkul, S. Birch, L. Zhihe, J. Hölzl, J. Lesinskis, H. Almér, C. Lord, and A. Stark. jameslyons/python_speech_features: release v0.6.1, Jan. 2020. [44] I.Masi, A.Killekar, R.M.Mascarenhas, S.P.Gurudatt, and W.AbdAlmageed. Two branch recurrent network for isolating deepfakes in videos. In European Conference on Computer Vision, pages 667–684. Springer, 2020. [45] H. McGurk and J. MacDonald. Hearing lips and seeing voices. Nature, 264(5588):746–748, 1976. [46] T. Mittal, U. Bhattacharya, R. Chandra, A. Bera, and D. Manocha. Emotions don’t lie: An audiovisual deepfake detection method using affective cues. In Proceedings of the 28th ACM international conference on multimedia, pages 2823–2832, 2020. [47] P.Morgado, N.Vasconcelos, and I.Misra. Audio-visual instance discrimination with cross-modal agreement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12475–12486, 2021. [48] A.Nagrani, S.Albanie, and A.Zisserman. Learnable pins: Cross-modal embeddings for person identity. In Proceedings of the European Conference on Computer Vision (ECCV), pages 71–88, 2018. [49] A. Nagrani, S. Albanie, and A. Zisserman. Seeing voices and hearing faces: Cross modal biometric matching. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8427–8436, 2018. [50] C.S.Ooi, K.P.Seng, L.M.Ang, and L.W.Chew. A new approach of audio emotion recognition. Expert systems with applications, 41(13):5858–5869, 2014. [51] A. Owens and A. A. Efros. Audiovisual scene analysis with self-supervised multi sensory features. In Proceedings of the European Conference on Computer Vision (ECCV), pages 631–648, 2018. [52] A. Rossler, D. Cozzolino, L. Verdoliva, C. Riess, J. Thies, and M. Nießner. Face forensics++: Learning to detect manipulated facial images. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1–11, 2019. [53] E.Sabir, J.Cheng, A.Jaiswal, W.AbdAlmageed, I.Masi, and P.Natarajan. Recurrent convolutional strategies for face manipulation detection in videos. Interfaces (GUI), 3(1):80–87, 2019. [54] F. Schroff, D. Kalenichenko, and J. Philbin. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 815–823, 2015. [55] C. Sheng, M. Pietikäinen, Q. Tian, and L. Liu. Cross-modal self-supervised learning for lip reading: When contrastive learning meets adversarial training. In Proceedings of the 29th ACM International Conference on Multimedia, pages 2456–2464, 2021. [56] K. Shiohara and T. Yamasaki. Detecting deepfakes with self-blended images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18720–18729, 2022. [57] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014. [58] S. Skoog Waller, M. Eriksson, and P. Sörqvist. Can you hear my age? influences of speech rate and speech spontaneity on estimation of speaker age. Frontiers in psychology, 6:978, 2015. [59] M. Tan and Q. Le. Efficientnet: Rethinking model scaling for convolutional neu ral networks. In International conference on machine learning, pages 6105–6114. PMLR, 2019. [60] J. Thies, M. Zollhöfer, and M. Nießner. Deferred neural rendering: Image synthesis using neural textures. ACM Transactions on Graphics (TOG), 38(4):1–12, 2019. [61] J. Thies, M. Zollhofer, M. Stamminger, C. Theobalt, and M. Nießner. Face2face: Realtime face capture and reenactment of rgb videos. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2387–2395, 2016. [62] G. Trigeorgis, F. Ringeval, R. Brueckner, E. Marchi, M. A. Nicolaou, B. Schuller, and S. Zafeiriou. Adieu features? endtoend speech emotion recognition using a deep convolutional recurrent network. In 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 5200–5204. IEEE, 2016. [63] A. Van den Oord, Y. Li, and O. Vinyals. Representation learning with contrastive predictive coding. arXiv eprints, pages arXiv–1807, 2018. [64] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017. [65] D.Verma. Multimodal for movie genre prediction.https://github.com/dh1105/multi -modal-movie-genre-prediction., 2021. [66] xkaple01. Multimodal classification. https://github.com/xkaple01/multimodal classification., 2019. [67] X. Yang, Y. Li, H. Qi, and S. Lyu. Exposing gan-synthesized faces using land mark locations. In Proceedings of the ACM Workshop on Information Hiding and Multimedia Security, pages 113–118, 2019. [68] H.Zhao, W.Zhou, D.Chen, T.Wei, W.Zhang, and N.Yu. Multi-attentional deepfake detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2185–2194, 2021. [69] T. Zhao, X. Xu, M. Xu, H. Ding, Y. Xiong, and W. Xia. Learning self-consistency for deepfake detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15023–15033, 2021. [70] Y. Zheng, J. Bao, D. Chen, M. Zeng, and F. Wen. Exploring temporal coherence for more general video face forgery detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15044–15054, 2021. [71] Y. Zhou and S.N. Lim. Joint audiovisual deepfake detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 14800–14809, 2021.
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/85069	-
dc.description.abstract	近期因為deepfake(深度偽造)的興起及濫用，產生出一些危害到社會上的問題，例如惡意偽造影片造成他人名聲的損毀或是散播不當的假訊息。雖然目前偽造偵測方法能對於已知偽造方法達到一定的成效，並且有些研究發現透過先驗知識或是預訓練的方式也能對未見過的偽造方法達到很好的效果，但這些方法可能侷限於常見的影片壓縮，或是在進行預訓練時，資料所需要的標註成本很高昂。因此在本篇論文中，我們提出了AVM-FFD架構，一種能夠對於未知的偽造方法也能保有良好的偵測能力的模型。AVM-FFD主要著重在判別影像及聲音之間的一致性，並透過相互之間的特徵關係作為線索，判別出是否有被偽造過。架構包含兩個和空間時序相關的特徵抽取模型分別用於影像及聲音上，會預先訓練在影像聲音之間的對應任務上，因此對於影像和聲音的對應關係俱有一定的了解。接著後面加上一個時序相關的辨識模型透過前面提取的特徵進行辨識，為了不過擬合在某些特定的偽造瑕疵上，我們會鎖住特徵抽取模型的參數，只訓練最後的辨識模型在偽造資料上。最後我們透過實驗情境，測試在未見過的偽造種類以及未見過的資料集上，證實了我們的方法確實有效，並能達到良好的偵測效果。	zh_TW
dc.description.abstract	Recently, due to the growth and abuse of deepfake , there have been some problems that threaten the society, such as maliciously made fake videos causing harm to others' reputations or spreading false information. Though the recent Forgery Detection methods can achieve some reasonable results for seen forgeries, and even with some prior knowledge or pretraining, they can reach a certain level of accuracy on unseen forgeries, but these methods may be restricted by audio or video compression or annotations required for pretraining. In this thesis, we propose AVM-FFD as a framework of detecting forgeries that maintains good detection capabilities for unseen forgery methods. The main objective of AVM-FFD is to determine whether there has been a forgery by judging the consistency between sound and face, and using the relationship between the features to evaluate if the forgery has occurred. First, two spatio-temporal feature extraction networks are pretrained to perform an AVM task, in order to build up a rich representation about the relationship between audio and visual information. A temporal classifier network is used to determine whether or not the video has been manipulated using the representations extracted from the feature extraction networks. In order to prevent fitting of manipulation-specific artifacts, we will freeze the feature extraction networks and only train the final classifier network on forged data. Experiments on unseen forgery categories and unseen datasets show that our approach is indeed effective and achieves state-of-the-art performance.	en
dc.description.provenance	Made available in DSpace on 2023-03-19T22:41:39Z (GMT). No. of bitstreams: 1 U0001-1208202218040400.pdf: 1728906 bytes, checksum: f5251c4c77bd41b54cf842ae3dae4453 (MD5) Previous issue date: 2022	en
dc.description.tableofcontents	Verification Letter from the Oral Examination Committee i 摘要 iii Abstract v Contents vii List of Figures ix List of Table xi Chapter 1 Introduction 1 Chapter 2 Related Work 5 2.1 Face Forgery Detection 5 2.2 Audio-Visual Self-supervised Learning 6 2.2.1 Audio-Visual Temporal Synchronisation 7 2.2.2 Audio-Visual Correspondence 7 Chapter 3 Method 9 3.1 Phase 1: Self-supervised Learning by Audio-Visual Matching Task 9 3.1.1 Audio Visual Temporal Synchronization (AVTS) 10 3.1.2 AudioVisualCorrespondence(AVC) 12 3.2 Phase 2: AudioVisual Matching for Face Forgery Detection 13 Chapter 4 Result and Discussion 15 4.1 Dataset 15 4.2 Implement Details 17 4.3 Performance Comparison 18 4.3.1 Cross-manipulation generalization 19 4.3.2 Cross-dataset generalization 19 4.3.3 Cross-modality forgery detection 21 4.3.4 Robustness to common corruptions 22 4.4 Ablation Study 24 Chapter 5 Conclusion 29 References 31
dc.language.iso	en
dc.subject	自監督式學習	zh_TW
dc.subject	影像聲音輸入	zh_TW
dc.subject	臉部偽造辨識	zh_TW
dc.subject	Face Forgery Detection	en
dc.subject	Audio-Visual	en
dc.subject	Self-supervised Learning	en
dc.title	使用影音一致性的自監督式學習增強臉部偽造辨識	zh_TW
dc.title	Improved Face Forgery Detection with Self-supervised Audio-Visual Consistency-based Pretraining	en
dc.type	Thesis
dc.date.schoolyear	110-2
dc.description.degree	碩士
dc.contributor.coadvisor	陳駿丞(Jun-Cheng Chen)
dc.contributor.oralexamcommittee	王新民(Hsin-Min Wang),葉倚任(Yi-Ren Yeh),賴尚宏(Shang-Hong Lai)
dc.subject.keyword	自監督式學習,影像聲音輸入,臉部偽造辨識,	zh_TW
dc.subject.keyword	Self-supervised Learning,Audio-Visual,Face Forgery Detection,	en
dc.relation.page	39
dc.identifier.doi	10.6342/NTU202202353
dc.rights.note	同意授權(限校園內公開)
dc.date.accepted	2022-08-16
dc.contributor.author-college	電機資訊學院	zh_TW
dc.contributor.author-dept	資料科學學位學程	zh_TW
dc.date.embargo-lift	2022-08-30	-
Appears in Collections:	資料科學學位學程

Files in This Item:

File	Size	Format
U0001-1208202218040400.pdf Access limited in NTU ip range	1.69 MB	Adobe PDF

Show simple item record

DSpace JSPUI

DSpace preserves and enables easy and open access to all types of digital content including text, images, moving images, mpegs and data sets