Skip navigation

DSpace JSPUI

DSpace preserves and enables easy and open access to all types of digital content including text, images, moving images, mpegs and data sets

Learn More
DSpace logo
English
中文
  • Browse
    • Communities
      & Collections
    • Publication Year
    • Author
    • Title
    • Subject
    • Advisor
  • Search TDR
  • Rights Q&A
    • My Page
    • Receive email
      updates
    • Edit Profile
  1. NTU Theses and Dissertations Repository
  2. 電機資訊學院
  3. 資訊網路與多媒體研究所
Please use this identifier to cite or link to this item: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/94349
Full metadata record
???org.dspace.app.webui.jsptag.ItemTag.dcfield???ValueLanguage
dc.contributor.advisor張智星zh_TW
dc.contributor.advisorJyh-Shing Jangen
dc.contributor.author鄭世朋zh_TW
dc.contributor.authorShih-Peng Chengen
dc.date.accessioned2024-08-15T16:58:42Z-
dc.date.available2024-08-16-
dc.date.copyright2024-08-15-
dc.date.issued2024-
dc.date.submitted2024-07-23-
dc.identifier.citationYann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25, 2012.
Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 2016.
Shaojie Bai, J Zico Kolter, and Vladlen Koltun. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv preprint arXiv:1803.01271, 2018.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7132–7141, 2018.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre- training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Pra- fulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
Adam Roberts, Colin Raffel, Katherine Lee, Michael Matena, Noam Shazeer, Pe- ter J Liu, Sharan Narang, Wei Li, and Yanqi Zhou. Exploring the limits of transfer learning with a unified text-to-text transformer. 2019.
Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Lukasz Kaiser, Noam Shazeer, Alexander Ku, and Dustin Tran. Image transformer. In International conference on machine learning, pages 4055–4064. PMLR, 2018.
Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. In European conference on computer vision, pages 213–229. Springer, 2020.
Ngoc-Quan Pham, Thai-Son Nguyen, Jan Niehues, Markus Müller, Sebastian Stüker, and Alexander Waibel. Very deep self-attention networks for end-to-end speech recognition. arXiv preprint arXiv:1904.13377, 2019.
Linhao Dong, Shuang Xu, and Bo Xu. Speech-transformer: a no-recurrence sequence-to-sequence model for speech recognition. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5884– 5888. IEEE, 2018.
Yan-Bo Lin, Yi-Lin Sung, Jie Lei, Mohit Bansal, and Gedas Bertasius. Vision transformers are parameter-efficient audio-visual learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2299– 2309, 2023.
Venkatesh S Kadandale, Juan F Montesinos, and Gloria Haro. Vocalist: An audio- visual synchronisation model for lips and voices. arXiv preprint arXiv:2204.02090, 2022.
Miao Liu, Jing Wang, Xinyuan Qian, and Haizhou Li. Audio-visual temporal forgery detection using embedding-level fusion and multi-dimensional contrastive loss. IEEE Transactions on Circuits and Systems for Video Technology, 2023.
Tianwei Lin, Xiao Liu, Xin Li, Errui Ding, and Shilei Wen. Bmn: Boundary- matching network for temporal action proposal generation. In Proceedings of the IEEE/CVF international conference on computer vision, pages 3889–3898, 2019.
Haisheng Su, Weihao Gan, Wei Wu, Yu Qiao, and Junjie Yan. Bsn+ +: Com- plementary boundary regressor with scale-balanced relation modeling for temporal action proposal generation. In Proceedings of the AAAI conference on artificial intelligence, volume 35, pages 2602–2610, 2021.
Zongwei Zhou, Md Mahfuzur Rahman Siddiquee, Nima Tajbakhsh, and Jianming Liang. Unet++: A nested u-net architecture for medical image segmentation. In Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support: 4th International Workshop, DLMIA 2018, and 8th International Workshop, ML-CDS 2018, Held in Conjunction with MICCAI 2018, Granada, Spain, September 20, 2018, Proceedings 4, pages 3–11. Springer, 2018.
Taiyi Su, Hanli Wang, and Lei Wang. Multi-level content-aware boundary detection for temporal action proposal generation. IEEE Transactions on Image Processing, 2023.
Zhixi Cai, Kalin Stefanov, Abhinav Dhall, and Munawar Hayat. Do you really mean that? content driven audio-visual deepfake dataset and multimodal method for temporal forgery localization. In 2022 International Conference on Digital Image Computing: Techniques and Applications (DICTA), pages 1–10. IEEE.
Navaneeth Bodla, Bharat Singh, Rama Chellappa, and Larry S Davis. Soft-nms– improving object detection with one line of code. In Proceedings of the IEEE international conference on computer vision, pages 5561–5569, 2017.
Zhixi Cai, Shreya Ghosh, Abhinav Dhall, Tom Gedeon, Kalin Stefanov, and Mu- nawar Hayat. Glitch in the matrix: A large scale benchmark for content driven audio–visual forgery detection and localization. Computer Vision and Image Understanding, 236:103818, 2023.
Marcelo Gennari do Nascimento, Roger Fawcett, and Victor Adrian Prisacariu. Dsconv: Efficient convolution operator. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5148–5157, 2019.
Rui Zhang, Hongxia Wang, Mingshan Du, Hanqing Liu, Yang Zhou, and Qiang Zeng. Ummaformer: A universal multimodal-adaptive transformer framework for temporal forgery localization. In Proceedings of the 31st ACM International Conference on Multimedia, pages 8749–8759, 2023.
Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2117–2125, 2017.
Zhixi Cai, Shreya Ghosh, Aman Pankaj Adatia, Munawar Hayat, Abhinav Dhall, and Kalin Stefanov. Av-deepfake1m: A large-scale llm-driven audio-visual deepfake dataset. arXiv preprint arXiv:2311.15308, 2023.
Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pages 2980–2988, 2017.
Raia Hadsell, Sumit Chopra, and Yann LeCun. Dimensionality reduction by learning an invariant mapping. In 2006 IEEE computer society conference on computer vision and pattern recognition (CVPR’06), volume 2, pages 1735–1742. IEEE, 2006.
Joon Son Chung and Andrew Zisserman. Out of time: automated lip sync in the wild. In Computer Vision–ACCV 2016 Workshops: ACCV 2016 International Workshops, Taipei, Taiwan, November 20-24, 2016, Revised Selected Papers, Part II 13, pages 251–263. Springer, 2017.
Komal Chugh, Parul Gupta, Abhinav Dhall, and Ramanathan Subramanian. Not made for each other-audio-visual dissonance-based deepfake detection and localiza- tion. In Proceedings of the 28th ACM international conference on multimedia, pages 439–447, 2020.
Ye Jia, Yu Zhang, Ron Weiss, Quan Wang, Jonathan Shen, Fei Ren, Patrick Nguyen, Ruoming Pang, Ignacio Lopez Moreno, Yonghui Wu, et al. Transfer learning from speaker verification to multispeaker text-to-speech synthesis. Advances in neural information processing systems, 31, 2018.
Jaehyeon Kim, Jungil Kong, and Juhee Son. Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech. In International Conference on Machine Learning, pages 5530–5540. PMLR, 2021.
Edresson Casanova, Christopher Shulby, Eren Gölge, Nicolas Michael Müller, Fred- erico Santos De Oliveira, Arnaldo Candido Junior, Anderson da Silva Soares, San- dra Maria Aluisio, and Moacir Antonelli Ponti. Sc-glowtts: An efficient zero-shot multi-speaker text-to-speech model. arXiv preprint arXiv:2104.05557, 2021.
Jiadong Wang, Xinyuan Qian, Malu Zhang, Robby T Tan, and Haizhou Li. Seeing what you said: Talking face generation guided by a lip reading expert. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14653–14662, 2023.
KR Prajwal, Rudrabha Mukhopadhyay, Vinay P Namboodiri, and CV Jawahar. A lip sync expert is all you need for speech to lip generation in the wild. In Proceedings of the 28th ACM international conference on multimedia, pages 484–492, 2020.
Joon Son Chung, Arsha Nagrani, and Andrew Zisserman. Voxceleb2: Deep speaker recognition. arXiv preprint arXiv:1806.05622, 2018.
Edresson Casanova, Julian Weber, Christopher D Shulby, Arnaldo Candido Junior, Eren Gölge, and Moacir A Ponti. Yourtts: Towards zero-shot multi-speaker tts and zero-shot voice conversion for everyone. In International Conference on Machine Learning, pages 2709–2720. PMLR, 2022.
-
dc.identifier.urihttp://tdr.lib.ntu.edu.tw/jspui/handle/123456789/94349-
dc.description.abstract對於音視頻時序深度偽造偵測任務,以往的方法未能取得令人滿意的結果。特別是對於新發布的內容導向的部分深度偽造數據集:AV-Deepfake1M,所提出的方法需要同時利用並整合影像和聲音的資訊,並準確定位整個視頻的只佔一小部分的深度偽造片段。在這項工作中,我們研究了哪些元件對解決這一任務是有效的。通過從AV-Deepfake1M中抽取子集以便在資源有限的情況下進行測試,我們對損失函數、邊界檢測模組和多模態融合方法進行了研究。提出了一個不需要預訓練模型且優於當前最先進的影音時序偽造偵測方法的架構。zh_TW
dc.description.abstractFor the audio-visual temporal deepfake localization task, previous methods have not been capable of yielding satisfactory results, especially for the newly released content-driven partial deepfake dataset: AV-Deepfake1M. The proposed methods are required to simultaneously utilize and integrate information from both video and audio, accurately localizing deepfake segments that constitute only a minor proportion of the entire video. In this work, we investigate which components are effective for addressing this task. By testing on a subset sampled from AV-Deepfake1M. We conduct various studies on the loss function, boundary detection module, and cross-modality fusion methods. Without the need for pre-trained encoders, we propose an architecture that outperforms the current state-of-the-art multi-modal localization methods.en
dc.description.provenanceSubmitted by admin ntu (admin@lib.ntu.edu.tw) on 2024-08-15T16:58:41Z
No. of bitstreams: 0
en
dc.description.provenanceMade available in DSpace on 2024-08-15T16:58:42Z (GMT). No. of bitstreams: 0en
dc.description.tableofcontents摘要 i
Abstract iii
目次 v
圖次 ix
表次 xi
第一章 緒論 1
1.1 研究簡介與動機 1
1.2 研究貢獻 2
1.3 章節概述 2
第二章 文獻探討 3
2.1 影像深度學習模型 3
2.1.1 卷積神經網路 3
2.1.1.1 卷積層 (Convolution layer) 4
2.1.1.2 因果卷積 (Causal Convolution) 4
2.1.1.3 膨脹卷積 (Dilated Convolution) 5
2.1.2 時序卷積網路 5
2.1.3 殘差網路 6
2.1.4 擠壓-激勵網路 6
2.2 Transformer 7
2.2.1 Self-attention 7
2.2.2 Cross-attention 7
2.3 邊界檢測網路 8
2.3.1 Boundary Matching Network (BMN) 8
2.3.2 Boundary Sensitive Network++ (BSN++) 9
2.3.3 Multi-Level Content-Aware Boundary Detection (MCBD) 10
2.4 影音時序深度偽造檢測模型 12
2.4.1 BA-TFD 12
2.4.1.1 特徵抽取 12
2.4.1.2 邊界偵測 12
2.4.1.3 模態資訊融合 12
2.4.2 BA-TFD++ 13
2.4.3 Audio-Visual Temporal Forgery Detection Using Embedding-Level
Fusion and Multi-Dimensional Contrastive Loss 14
2.4.4 UMMAFormer 15
2.5 Loss Function 17
2.5.1 Binary Logistic Loss 17
2.5.2 Focal loss 17
2.5.3 Contrastive Loss 18
第三章 研究方法 21
3.1 任務定義 21
3.2 模型架構 22
3.2.1 特徵抽取 22
3.2.1.1 Video Encoder 22
3.2.1.2 Audio Encoder 23
3.2.1.3 Cross-Attention and Self-Attention Fusion 23
3.2.1.4 Overall Feature Extraction Structure 23
3.2.2 Boundary Detection 24
3.2.2.1 Segment-Level Boundary Detection Module 24
3.2.2.2 Frame-Level Boundary Detection Module 25
3.2.2.3 Overall Boundary Detection Structure 26
3.3 Overall Model Structure 26
3.3.1 Inference 27
3.4 損失函數 29
3.4.1 Frame Classifier 29
3.4.2 Frame Contrastive Loss 30
3.4.3 Boundary Detection Module Loss 30
3.4.4 Overall Loss 30
第四章 實驗設置 31
4.1 資料集 31
4.1.1 深度偽造影片生成步驟 32
4.1.1.1 內容篡改 32
4.1.1.2 聲音篡改 32
4.1.1.3 影像篡改 32
4.2 實作細節 34
4.3 評量指標 35
第五章 實驗結果 37
5.1 Temporal Deepfake Localization 37
5.2 Ablation study 39
5.2.1 Loss Function 39
5.2.2 Attention 40
5.2.3 Frame-level Boundary Module 41
5.2.4 Single Modality Model 42
5.2.5 Different Level of Structure 43
第六章 結論與未來展望 45
6.1 結論 45
6.2 未來展望 45
參考文獻 47
-
dc.language.isozh_TW-
dc.subject深度偽造檢測zh_TW
dc.subject影音時序偽造偵測zh_TW
dc.subject深度學習zh_TW
dc.subject注意力機制zh_TW
dc.subject邊界偵測zh_TW
dc.subjectboundary detectionen
dc.subjectAttentionen
dc.subjectDeep learningen
dc.subjectDeepfake detectionen
dc.subjectAudio-Visual temporal deepfake localizationen
dc.title多層次內容感知影音時序深偽偵測zh_TW
dc.titleMulti-Level Content-Aware Audio-Visual Temporal Deepfake Localization (MC-TDL)en
dc.typeThesis-
dc.date.schoolyear112-2-
dc.description.degree碩士-
dc.contributor.oralexamcommittee陳駿丞;林仁俊zh_TW
dc.contributor.oralexamcommitteeJun-Cheng Chen;Jen-Chun Linen
dc.subject.keyword影音時序偽造偵測,深度偽造檢測,邊界偵測,注意力機制,深度學習,zh_TW
dc.subject.keywordAudio-Visual temporal deepfake localization,Deepfake detection,boundary detection,Attention,Deep learning,en
dc.relation.page51-
dc.identifier.doi10.6342/NTU202402060-
dc.rights.note同意授權(限校園內公開)-
dc.date.accepted2024-07-23-
dc.contributor.author-college電機資訊學院-
dc.contributor.author-dept資訊網路與多媒體研究所-
Appears in Collections:資訊網路與多媒體研究所

Files in This Item:
File SizeFormat 
ntu-112-2.pdf
Access limited in NTU ip range
9.07 MBAdobe PDF
Show simple item record


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved