Skip navigation

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料(如:文字、圖片、PDF)並使其易於取用。

點此認識 DSpace
DSpace logo
English
中文
  • 瀏覽論文
    • 校院系所
    • 出版年
    • 作者
    • 標題
    • 關鍵字
    • 指導教授
  • 搜尋 TDR
  • 授權 Q&A
    • 我的頁面
    • 接受 E-mail 通知
    • 編輯個人資料
  1. NTU Theses and Dissertations Repository
  2. 電機資訊學院
  3. 資料科學學位學程
請用此 Handle URI 來引用此文件: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/85069
完整後設資料紀錄
DC 欄位值語言
dc.contributor.advisor陳祝嵩(Chu-Song Chen)
dc.contributor.authorChang-Sung Sungen
dc.contributor.author宋昶松zh_TW
dc.date.accessioned2023-03-19T22:41:39Z-
dc.date.copyright2022-08-30
dc.date.issued2022
dc.date.submitted2022-08-15
dc.identifier.citation[1] Deepfakes repo. https://github.com/deepfakes/faceswap. [2] Faceswap repo. https://github.com/marekkowalski/faceswap. [3] D. Afchar, V. Nozick, J. Yamagishi, and I. Echizen. Mesonet: a compact facial video forgery detection network. In 2018 IEEE international workshop on information forensics and security (WIFS), pages 1–7. IEEE, 2018. [4] T. Afouras, A. Owens, J. S. Chung, and A. Zisserman. Self­-supervised learning of audio­visual objects from video. In European Conference on Computer Vision, pages 208–224. Springer, 2020. [5] S. Agarwal, H. Farid, O. Fried, and M. Agrawala. Detecting deep­fake videos from phoneme-­viseme mismatches. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 660–661, 2020. [6] R. Arandjelovic and A. Zisserman. Look, listen and learn. In Proceedings of the IEEE International Conference on Computer Vision, pages 609–617, 2017. [7] R. Arandjelovic and A. Zisserman. Objects that sound. In Proceedings of the European conference on computer vision (ECCV), pages 435–451, 2018. [8] O. Arriaga, M. Valdenegro­-Toro, and P. Plöger. Real­time convolutional neural net­ works for emotion and gender classification. arXiv preprint arXiv:1710.07557, 2017. [9] A. Bulat and G. Tzimiropoulos. How far are we from solving the 2d & 3d face alignment problem?(and a dataset of 230,000 3d facial landmarks). In Proceedings of the IEEE International Conference on Computer Vision, pages 1021–1030, 2017. [10] L. Chai, D. Bau, S.­N. Lim, and P. Isola. What makes fake images detectable? un­derstanding properties that generalize. In European Conference on Computer Vision, pages 103–120. Springer, 2020. [11] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton. A simple framework for con­trastive learning of visual representations. In International conference on machine learning, pages 1597–1607. PMLR, 2020. [12] T. Chen, S. Kornblith, K. Swersky, M. Norouzi, and G. E. Hinton. Big self­ supervised models are strong semi­supervised learners. Advances in neural information processing systems, 33:22243–22255, 2020. [13] K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio. Learning phrase representations using rnn encoder­-decoder for sta­tistical machine translation. arXiv preprint arXiv:1406.1078, 2014. [14] F. Chollet. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1251–1258, 2017. [15] K. Chugh, P. Gupta, A. Dhall, and R. Subramanian. Not made for each other­audio­-visual dissonance-­based deepfake detection and localization. In Proceedings of the 28th ACM International Conference on Multimedia, pages 439–447, 2020. [16] J. S. Chung, A. Senior, O. Vinyals, and A. Zisserman. Lip reading sentences in the wild. In 2017 IEEE conference on computer vision and pattern recognition (CVPR), pages 3444–3453. IEEE, 2017. [17] J. S. Chung and A. Zisserman. Lip reading in the wild. In Asian conference on computer vision, pages 87–103. Springer, 2016. [18] J. S. Chung and A. Zisserman. Out of time: automated lip sync in the wild. In Asian conference on computer vision, pages 251–263. Springer, 2016. [19] J. Deng, J. Guo, E. Ververas, I. Kotsia, and S. Zafeiriou. Retinaface: Single­-shot multi­-level face localisation in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5203–5212, 2020. [20] B. Dolhansky, J. Bitton, B. Pflaum, J. Lu, R. Howes, M. Wang, and C. C. Ferrer. The deepfake detection challenge (dfdc) dataset. arXiv preprint arXiv:2006.07397, 2020. [21] D. Doukhan, J. Carrive, F. Vallet, A. Larcher, and S. Meignier. An open­source speaker gender detection framework for monitoring gender equality. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5214–5218. IEEE, 2018. [22] I. Goodfellow, J. Pouget­Abadie, M. Mirza, B. Xu, D. Warde­Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. Advances in neural information processing systems, 27, 2014. [23] Z. Gu, Y. Chen, T. Yao, S. Ding, J. Li, F. Huang, and L. Ma. Spatiotemporal in­ consistency learning for deepfake video detection. In Proceedings of the 29th ACM International Conference on Multimedia, pages 3473–3481, 2021. [24] H. Guo, S. Hu, X. Wang, M.­C. Chang, and S. Lyu. Eyes tell all: Irregular pupil shapes reveal gan-­generated faces. arXiv preprint arXiv:2109.00162, 2021. [25] M. Gutmann and A. Hyvärinen. Noise-­contrastive estimation: A new estima­tion principle for unnormalized statistical models. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pages 297–304. JMLR Workshop and Conference Proceedings, 2010. [26] R. Hadsell, S. Chopra, and Y. LeCun. Dimensionality reduction by learning an in­ variant mapping. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), volume 2, pages 1735–1742. IEEE, 2006. [27] A. Haliassos, K. Vougioukas, S. Petridis, and M. Pantic. Lips don’t lie: A generalis­able and robust approach to face forgery detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5039–5049, 2021. [28] D. Hazarika, S. Poria, R. Zimmermann, and R. Mihalcea. Conversational transfer learning for emotion recognition. Information Fusion, 65:1–12, 2021. [29] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick. Momentum contrast for unsuper­vised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9729–9738, 2020. [30] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. [31] K.Hoover, S.Chaudhuri, C.Pantofaru, M.Slaney, and I.Sturdy. Putting a face to the voice: Fusing audio and visual signals across a video to determine speakers. arXiv preprint arXiv:1706.00079, 2017. [32] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4700–4708, 2017. [33] L. Jiang, R. Li, W. Wu, C. Qian, and C. C. Loy. DeeperForensics­1.0: A large­-scale dataset for real-­world face forgery detection. In CVPR, 2020. [34] T. Jung, S. Kim, and K. Kim. Deepvision: deepfakes detection using human eye blinking pattern. IEEE Access, 8:83144–83154, 2020. [35] H. Khalid, S. Tariq, M. Kim, and S. S. Woo. Fakeavceleb: a novel audio­video multimodal deepfake dataset. arXiv preprint arXiv:2108.05080, 2021. [36] B.Korbar, D.Tran, and L.Torresani. Cooperative learning of audio and video models from self­-supervised synchronization. Advances in Neural Information Processing Systems, 31, 2018. [37] G. Levi and T. Hassner. Age and gender classification using convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pages 34–42, 2015. [38] L. Li, J. Bao, H. Yang, D. Chen, and F. Wen. Advancing high fidelity identity swapping for forgery detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5074–5083, 2020. [39] L. Li, J. Bao, T. Zhang, H. Yang, D. Chen, F. Wen, and B. Guo. Face x­-ray for more general face forgery detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5001–5010, 2020. [40] Y. Li and S. Lyu. Exposing deepfake videos by detecting face warping artifacts. arXiv preprint arXiv:1811.00656, 2018. [41] Y.Li, X.Yang, P.Sun, H.Qi, and S.Lyu. Celeb-­df: A large­-scale challenging dataset for deepfake forensics. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3207–3216, 2020. [42] Z. Liu, X. Qi, and P. H. Torr. Global texture enhancement for fake face detection in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8060–8069, 2020. [43] J. Lyons, D. Y.­B. Wang, Gianluca, H. Shteingart, E. Mavrinac, Y. Gaurkar, W. Watcharawisetkul, S. Birch, L. Zhihe, J. Hölzl, J. Lesinskis, H. Almér, C. Lord, and A. Stark. jameslyons/python_speech_features: release v0.6.1, Jan. 2020. [44] I.Masi, A.Killekar, R.M.Mascarenhas, S.P.Gurudatt, and W.AbdAlmageed. Two­ branch recurrent network for isolating deepfakes in videos. In European Conference on Computer Vision, pages 667–684. Springer, 2020. [45] H. McGurk and J. MacDonald. Hearing lips and seeing voices. Nature, 264(5588):746–748, 1976. [46] T. Mittal, U. Bhattacharya, R. Chandra, A. Bera, and D. Manocha. Emotions don’t lie: An audio­visual deepfake detection method using affective cues. In Proceedings of the 28th ACM international conference on multimedia, pages 2823–2832, 2020. [47] P.Morgado, N.Vasconcelos, and I.Misra. Audio­-visual instance discrimination with cross­-modal agreement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12475–12486, 2021. [48] A.Nagrani, S.Albanie, and A.Zisserman. Learnable pins: Cross-­modal embeddings for person identity. In Proceedings of the European Conference on Computer Vision (ECCV), pages 71–88, 2018. [49] A. Nagrani, S. Albanie, and A. Zisserman. Seeing voices and hearing faces: Cross­ modal biometric matching. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8427–8436, 2018. [50] C.S.Ooi, K.P.Seng, L.­M.Ang, and L.W.Chew. A new approach of audio emotion recognition. Expert systems with applications, 41(13):5858–5869, 2014. [51] A. Owens and A. A. Efros. Audio­visual scene analysis with self-­supervised multi­ sensory features. In Proceedings of the European Conference on Computer Vision (ECCV), pages 631–648, 2018. [52] A. Rossler, D. Cozzolino, L. Verdoliva, C. Riess, J. Thies, and M. Nießner. Face­ forensics++: Learning to detect manipulated facial images. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1–11, 2019. [53] E.Sabir, J.Cheng, A.Jaiswal, W.AbdAlmageed, I.Masi, and P.Natarajan. Recurrent convolutional strategies for face manipulation detection in videos. Interfaces (GUI), 3(1):80–87, 2019. [54] F. Schroff, D. Kalenichenko, and J. Philbin. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 815–823, 2015. [55] C. Sheng, M. Pietikäinen, Q. Tian, and L. Liu. Cross­-modal self­-supervised learning for lip reading: When contrastive learning meets adversarial training. In Proceedings of the 29th ACM International Conference on Multimedia, pages 2456–2464, 2021. [56] K. Shiohara and T. Yamasaki. Detecting deepfakes with self­-blended images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18720–18729, 2022. [57] K. Simonyan and A. Zisserman. Very deep convolutional networks for large­-scale image recognition. arXiv preprint arXiv:1409.1556, 2014. [58] S. Skoog Waller, M. Eriksson, and P. Sörqvist. Can you hear my age? influences of speech rate and speech spontaneity on estimation of speaker age. Frontiers in psychology, 6:978, 2015. [59] M. Tan and Q. Le. Efficientnet: Rethinking model scaling for convolutional neu­ ral networks. In International conference on machine learning, pages 6105–6114. PMLR, 2019. [60] J. Thies, M. Zollhöfer, and M. Nießner. Deferred neural rendering: Image synthesis using neural textures. ACM Transactions on Graphics (TOG), 38(4):1–12, 2019. [61] J. Thies, M. Zollhofer, M. Stamminger, C. Theobalt, and M. Nießner. Face2face: Real­time face capture and reenactment of rgb videos. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2387–2395, 2016. [62] G. Trigeorgis, F. Ringeval, R. Brueckner, E. Marchi, M. A. Nicolaou, B. Schuller, and S. Zafeiriou. Adieu features? end­to­end speech emotion recognition using a deep convolutional recurrent network. In 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 5200–5204. IEEE, 2016. [63] A. Van den Oord, Y. Li, and O. Vinyals. Representation learning with contrastive predictive coding. arXiv e­prints, pages arXiv–1807, 2018. [64] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017. [65] D.Verma. Multimodal for movie genre prediction.https://github.com/dh1105/multi­ -modal­-movie­-genre­-prediction., 2021. [66] xkaple01. Multimodal classification. https://github.com/xkaple01/multimodal­ classification., 2019. [67] X. Yang, Y. Li, H. Qi, and S. Lyu. Exposing gan­-synthesized faces using land­ mark locations. In Proceedings of the ACM Workshop on Information Hiding and Multimedia Security, pages 113–118, 2019. [68] H.Zhao, W.Zhou, D.Chen, T.Wei, W.Zhang, and N.Yu. Multi­-attentional deepfake detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2185–2194, 2021. [69] T. Zhao, X. Xu, M. Xu, H. Ding, Y. Xiong, and W. Xia. Learning self­-consistency for deepfake detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15023–15033, 2021. [70] Y. Zheng, J. Bao, D. Chen, M. Zeng, and F. Wen. Exploring temporal coherence for more general video face forgery detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15044–15054, 2021. [71] Y. Zhou and S.­N. Lim. Joint audio­visual deepfake detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 14800–14809, 2021.
dc.identifier.urihttp://tdr.lib.ntu.edu.tw/jspui/handle/123456789/85069-
dc.description.abstract近期因為deepfake(深度偽造)的興起及濫用,產生出一些危害到社會上的問題,例如惡意偽造影片造成他人名聲的損毀或是散播不當的假訊息。雖然目前偽造偵測方法能對於已知偽造方法達到一定的成效,並且有些研究發現透過先驗知識或是預訓練的方式也能對未見過的偽造方法達到很好的效果,但這些方法可能侷限於常見的影片壓縮,或是在進行預訓練時,資料所需要的標註成本很高昂。因此在本篇論文中,我們提出了AVM-FFD架構,一種能夠對於未知的偽造方法也能保有良好的偵測能力的模型。AVM-FFD主要著重在判別影像及聲音之間的一致性,並透過相互之間的特徵關係作為線索,判別出是否有被偽造過。架構包含兩個和空間時序相關的特徵抽取模型分別用於影像及聲音上,會預先訓練在影像聲音之間的對應任務上,因此對於影像和聲音的對應關係俱有一定的了解。接著後面加上一個時序相關的辨識模型透過前面提取的特徵進行辨識,為了不過擬合在某些特定的偽造瑕疵上,我們會鎖住特徵抽取模型的參數,只訓練最後的辨識模型在偽造資料上。最後我們透過實驗情境,測試在未見過的偽造種類以及未見過的資料集上,證實了我們的方法確實有效,並能達到良好的偵測效果。zh_TW
dc.description.abstractRecently, due to the growth and abuse of deepfake , there have been some problems that threaten the society, such as maliciously made fake videos causing harm to others' reputations or spreading false information. Though the recent Forgery Detection methods can achieve some reasonable results for seen forgeries, and even with some prior knowledge or pretraining, they can reach a certain level of accuracy on unseen forgeries, but these methods may be restricted by audio or video compression or annotations required for pretraining. In this thesis, we propose AVM-FFD as a framework of detecting forgeries that maintains good detection capabilities for unseen forgery methods. The main objective of AVM-FFD is to determine whether there has been a forgery by judging the consistency between sound and face, and using the relationship between the features to evaluate if the forgery has occurred. First, two spatio-temporal feature extraction networks are pretrained to perform an AVM task, in order to build up a rich representation about the relationship between audio and visual information. A temporal classifier network is used to determine whether or not the video has been manipulated using the representations extracted from the feature extraction networks. In order to prevent fitting of manipulation-specific artifacts, we will freeze the feature extraction networks and only train the final classifier network on forged data. Experiments on unseen forgery categories and unseen datasets show that our approach is indeed effective and achieves state-of-the-art performance.en
dc.description.provenanceMade available in DSpace on 2023-03-19T22:41:39Z (GMT). No. of bitstreams: 1
U0001-1208202218040400.pdf: 1728906 bytes, checksum: f5251c4c77bd41b54cf842ae3dae4453 (MD5)
Previous issue date: 2022
en
dc.description.tableofcontentsVerification Letter from the Oral Examination Committee i 摘要 iii Abstract v Contents vii List of Figures ix List of Table xi Chapter 1 Introduction 1 Chapter 2 Related Work 5 2.1 Face Forgery Detection 5 2.2 Audio-Visual Self-supervised Learning 6 2.2.1 Audio-Visual Temporal Synchronisation 7 2.2.2 Audio-Visual Correspondence 7 Chapter 3 Method 9 3.1 Phase 1: Self-supervised Learning by Audio-Visual Matching Task 9 3.1.1 Audio Visual Temporal Synchronization (AVTS) 10 3.1.2 AudioVisualCorrespondence(AVC) 12 3.2 Phase 2: Audio­Visual Matching for Face Forgery Detection 13 Chapter 4 Result and Discussion 15 4.1 Dataset 15 4.2 Implement Details 17 4.3 Performance Comparison 18 4.3.1 Cross­-manipulation generalization 19 4.3.2 Cross-­dataset generalization 19 4.3.3 Cross­-modality forgery detection 21 4.3.4 Robustness to common corruptions 22 4.4 Ablation Study 24 Chapter 5 Conclusion 29 References 31
dc.language.isoen
dc.subject自監督式學習zh_TW
dc.subject影像聲音輸入zh_TW
dc.subject臉部偽造辨識zh_TW
dc.subjectFace Forgery Detectionen
dc.subjectAudio-Visualen
dc.subjectSelf-supervised Learningen
dc.title使用影音一致性的自監督式學習增強臉部偽造辨識zh_TW
dc.titleImproved Face Forgery Detection with Self­-supervised Audio­-Visual Consistency­-based Pretrainingen
dc.typeThesis
dc.date.schoolyear110-2
dc.description.degree碩士
dc.contributor.coadvisor陳駿丞(Jun-Cheng Chen)
dc.contributor.oralexamcommittee王新民(Hsin-Min Wang),葉倚任(Yi-Ren Yeh),賴尚宏(Shang-Hong Lai)
dc.subject.keyword自監督式學習,影像聲音輸入,臉部偽造辨識,zh_TW
dc.subject.keywordSelf-supervised Learning,Audio-Visual,Face Forgery Detection,en
dc.relation.page39
dc.identifier.doi10.6342/NTU202202353
dc.rights.note同意授權(限校園內公開)
dc.date.accepted2022-08-16
dc.contributor.author-college電機資訊學院zh_TW
dc.contributor.author-dept資料科學學位學程zh_TW
dc.date.embargo-lift2022-08-30-
顯示於系所單位:資料科學學位學程

文件中的檔案:
檔案 大小格式 
U0001-1208202218040400.pdf
授權僅限NTU校內IP使用(校園外請利用VPN校外連線服務)
1.69 MBAdobe PDF
顯示文件簡單紀錄


系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved