Skip navigation

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料(如:文字、圖片、PDF)並使其易於取用。

點此認識 DSpace
DSpace logo
English
中文
  • 瀏覽論文
    • 校院系所
    • 出版年
    • 作者
    • 標題
    • 關鍵字
    • 指導教授
  • 搜尋 TDR
  • 授權 Q&A
    • 我的頁面
    • 接受 E-mail 通知
    • 編輯個人資料
  1. NTU Theses and Dissertations Repository
  2. 電機資訊學院
  3. 電信工程學研究所
請用此 Handle URI 來引用此文件: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/93284
完整後設資料紀錄
DC 欄位值語言
dc.contributor.advisor丁建均zh_TW
dc.contributor.advisorJian-Jiun Dingen
dc.contributor.author顏子鈞zh_TW
dc.contributor.authorGan Chee Kimen
dc.date.accessioned2024-07-23T16:40:39Z-
dc.date.available2024-07-24-
dc.date.copyright2024-07-23-
dc.date.issued2024-
dc.date.submitted2024-07-17-
dc.identifier.citation[1] A. F. Agarap. Deep learning using rectified linear units (relu). arXiv preprint arXiv:1803.08375, 2018.
[2] S. Baker, D. Scharstein, J. P. Lewis, S. Roth, M. J. Black, and R. Szeliski. A database and evaluation methodology for optical flow. International journal of computer vision, 92:1–31, 2011.
[3] Y. Blau and T. Michaeli. The perception-distortion tradeoff. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6228–6237, 2018.
[4] M. Choi, H. Kim, B. Han, N. Xu, and K. M. Lee. Channel attention is all you need for video frame interpolation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 10663–10671, 2020.
[5] X. Chu, Z. Tian, B. Zhang, X. Wang, and C. Shen. Conditional positional encodings for vision transformers. arXiv preprint arXiv:2102.10882, 2021.
[6] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
[7] J. Flynn, I. Neulander, J. Philbin, and N. Snavely. Deepstereo: Learning to predict new views from the world’s imagery. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5515–5524, 2016.
[8] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision, pages 1026–1034, 2015.
[9] D. Hendrycks and K. Gimpel. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415, 2016.
[10] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
[11] P. Hu, S. Niklaus, S. Sclaroff, and K. Saenko. Many-to-many splatting for efficient video frame interpolation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3553–3562, 2022.
[12] Z. Huang, T. Zhang, W. Heng, B. Shi, and S. Zhou. Real-time intermediate flow estimation for video frame interpolation. In European Conference on Computer Vision, pages 624–642. Springer, 2022.
[13] H. Jiang, D. Sun, V. Jampani, M.-H. Yang, E. Learned-Miller, and J. Kautz. Super slomo: High quality estimation of multiple intermediate frames for video interpolation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 9000–9008, 2018.
[14] X. Jin, L. Wu, G. Shen, Y. Chen, J. Chen, J. Koo, and C.-h. Hahm. Enhanced bi-directional motion estimation for video frame interpolation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 5049–5057, 2023.
[15] L. Kong, B. Jiang, D. Luo, W. Chu, X. Huang, Y. Tai, C. Wang, and J. Yang. Ifrnet: Intermediate feature refine network for efficient frame interpolation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1969–1978, 2022.
[16] H. Lee, T. Kim, T.-y. Chung, D. Pak, Y. Ban, and S. Lee. Adacof: Adaptive collaboration of flows for video frame interpolation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5316–5325, 2020.
[17] C. Liu, G. Zhang, R. Zhao, and L. Wang. Sparse global matching for video frame interpolation with large motion. arXiv preprint arXiv:2404.06913, 2024.
[18] Y.-L. Liu, Y.-T. Liao, Y.-Y. Lin, and Y.-Y. Chuang. Deep video frame interpolation using cyclic frame generation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 8794–8802, 2019.
[19] Z. Liu, R. A. Yeh, X. Tang, Y. Liu, and A. Agarwala. Video frame synthesis using deep voxel flow. In Proceedings of the IEEE international conference on computer vision, pages 4463–4471, 2017.
[20] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021.
[21] L. Lu, R. Wu, H. Lin, J. Lu, and J. Jia. Video frame interpolation with transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3532–3542, 2022.
[22] S. Meister, J. Hur, and S. Roth. Unflow: Unsupervised learning of optical flow with a bidirectional census loss. In Proceedings of the AAAI conference on artificial intelligence, volume 32, 2018.
[23] S. Meyer, O. Wang, H. Zimmer, M. Grosse, and A. Sorkine-Hornung. Phasebased frame interpolation for video. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1410–1418, 2015. doi: 10.1109/CVPR.2015.7298747.
[24] S. Meyer, A. Djelouah, B. McWilliams, A. Sorkine-Hornung, M. Gross, and C. Schroers. Phasenet for video frame interpolation. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 498–507, 2018. doi:10.1109/CVPR.2018.00059.
[25] C. Montgomery. Xiph.org video test media (derf’s collection), the xiph open source community. Online, 1994. URL https://media.xiph.org/video/derf.
[26] S. Niklaus and F. Liu. Context-aware synthesis for video frame interpolation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1701–1710, 2018.
[27] S. Niklaus and F. Liu. Softmax splatting for video frame interpolation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5437–5446, 2020.
[28] S. Niklaus, L. Mai, and F. Liu. Video frame interpolation via adaptive convolution. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 670–679, 2017.
[29] S. Niklaus, L. Mai, and F. Liu. Video frame interpolation via adaptive separable convolution. In Proceedings of the IEEE international conference on computer vision, pages 261–270, 2017.
[30] J. Park, K. Ko, C. Lee, and C.-S. Kim. Bmbc: Bilateral motion estimation with bilateral cost volume for video interpolation. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIV 16, pages 109–125. Springer, 2020.
[31] J. Park, C. Lee, and C.-S. Kim. Asymmetric bilateral motion estimation for video frame interpolation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 14539–14548, 2021.
[32] J. Park, J. Kim, and C.-S. Kim. Biformer: Learning bilateral motion estimation via bilateral transformer for 4k video frame interpolation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1568–1577, 2023.
[33] F. Reda, J. Kontkanen, E. Tabellion, D. Sun, C. Pantofaru, and B. Curless. Film: Frame interpolation for large motion. In European Conference on Computer Vision, pages 250–266. Springer, 2022.
[34] O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical image segmentation. In Medical image computing and computerassisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18, pages 234–241. Springer, 2015.
[35] Z. Shi, X. Xu, X. Liu, J. Chen, and M.-H. Yang. Video frame interpolation transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17482–17491, 2022.
[36] H. Sim, J. Oh, and M. Kim. Xvfi: extreme video frame interpolation. In Proceedings of the IEEE/CVF international conference on computer vision, pages 14489–14498, 2021.
[37] E. Simoncelli and W. Freeman. The steerable pyramid: a flexible architecture for multi scale derivative computation. In Proceedings., International Conference on Image Processing, volume 3, pages 444–447 vol.3, 1995. doi:10.1109/ICIP.1995.537667.
[38] K. Simonyan and A. Zisserman. Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556, 2014.
[39] K. Soomro, A. R. Zamir, and M. Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012.
[40] D. Sun, X. Yang, M.-Y. Liu, and J. Kautz. Pwc-net: Cnns for optical flow using pyramid, warping, and cost volume. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8934–8943, 2018.
[41] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
[42] C.-Y. Wu, N. Singhal, and P. Krahenbuhl. Video compression through image interpolation. In Proceedings of the European conference on computer vision (ECCV), pages 416–431, 2018.
[43] H. Wu, X. Zhang, W. Xie, Y. Zhang, and Y. Wang. Boost video frame interpolation via motion adaptation. arXiv preprint arXiv:2306.13933, 2023.
[44] X. Xiang, Y. Tian, Y. Zhang, Y. Fu, J. P. Allebach, and C. Xu. Zooming slowmo: Fast and accurate one-stage space-time video super-resolution. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3370–3379, 2020.
[45] T. Xiao, M. Singh, E. Mintun, T. Darrell, P. Doll´ar, and R. Girshick. Early convolutions help transformers see better. Advances in neural information processing systems, 34:30392–30400, 2021.
[46] T. Xue, B. Chen, J. Wu, D. Wei, and W. T. Freeman. Video enhancement with task-oriented flow. International Journal of Computer Vision, 127:1106–1125, 2019.
[47] F. Yu and V. Koltun. Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122, 2015.
[48] R. Zabih and J. I. Woodfill. Non-parametric local transforms for computing visual correspondence. In European Conference on Computer Vision, 1994. URL https://api.semanticscholar.org/CorpusID:703552.
[49] G. Zhang, Y. Zhu, H. Wang, Y. Chen, G. Wu, and L. Wang. Extracting motion and appearance via inter-frame attention for efficient video frame interpolation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5682–5692, 2023.
[50] C. Zhou, J. Liu, J. Tang, and G. Wu. Video frame interpolation with densely queried bilateral correlation. arXiv preprint arXiv:2304.13596, 2023.
[51] T. Zhou, S. Tulsiani, W. Sun, J. Malik, and A. A. Efros. View synthesis by appearance flow. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV 14, pages 286–301. Springer, 2016.
-
dc.identifier.urihttp://tdr.lib.ntu.edu.tw/jspui/handle/123456789/93284-
dc.description.abstract影片幀插值(VFI)是一項基於影片前後上下文來生成中間幀的任務,以提升影片的品質和幀數。隨高解析度影片的普及,VFI技術需再深入研究來因應這一趨勢。與此同時。維持VFI在處理低解析度影片的性能也同樣重要,以確保其在廣泛的影片格式中的通用性。

雖然近期的VFI研究有不俗的成果,但多數方法都過於優化在某些數據集上。例如,某些方法在低解析度或運動規模較小的數據集(如Vimeo90K,解析度:448 × 256)表現非常出色,但對於高解析度或運動規模較大的數據集(如Xiph,2K/4K解析度及SNU-FILM的困難/極端類別數據集)則表現得差強人意。相反,一些專為4K解析度的影片而設計的插值方法可能在低解析度的情況下缺乏細節。這種權衡存在的原因是神經網絡需要更大的搜索空間來找到兩幀之間較大的運動偏移,而較大的搜索空間也可能導致更高的錯誤率,進而限縮預留給處理小運動量的神經網路參數。

為此,我們提出了一種新穎的架構,針對較大的運動偏移自適應地進行全局運動估測,再進行局部運動估測來優化較小的運動細節。受自注意力網路(Transformer)的注意力(Attention)機制啟發,該機制在識別影像補丁(image patch)之間的對應關係非常強大,因此我們的方法在運動估測框架中巧妙地運用注意力矩陣來進一步挖掘出正確的雙向光流(bi-directional optical flow)。

實驗結果顯示,我們提出的方法在高解析度與運動規模較大的數據集能達到頂尖的水準,同時在低解析度數據集亦可維持不錯的結果。這證明了我們的方法能夠處理多元解析度的影片,即使在具有挑戰性的情況下也能有效保留細節。
zh_TW
dc.description.abstractVideo Frame Interpolation (VFI) aims to synthesize intermediate frames between consecutive frames, enhancing video quality and frame rate. With the widespread adoption of high-resolution video, it is crucial for VFI technology to undergo further research and development to accommodate this trend. However, maintaining the performance on low-resolution video remains equally important for ensuring its versatility across a wide range of video formats.

While recent VFI methods achieve impressive results, many are overly optimized to certain datasets. Specifically, some perform very well on low-resolution or small motion datasets (e.g. Vimeo90K, resolution: 448 × 256), but struggle with highresolution or large motion datasets (e.g. Xiph, 2K/4K resolution and SNU-FILM’s hard/extreme class dataset). Conversely, methods designed for 4K interpolation may lack detail in low-resolution scenarios. This trade-off exists because network requires a larger search window for large motion, leading to higher error rate and leaving less neural capacity for small motion estimation.

To address this issue, we propose a novel architecture that adaptively performs global motion estimation for large motions, followed by a local motion estimation to refine smaller and detailed motions. Inspired by the Transformer’s attention mechanism, which excels at identifying correspondence between patches, we remodel the attention matrix to uncover bi-directional optical flow within our motion estimation framework.

Experimental results show that our method achieves state-of-the-art performance on high resolution and large motion datasets, while still delivering satisfactory results on low-resolution datasets. This versatility indicates that our method is capable of handling both high-definition content and low-resolution videos, effectively preserving fine details even in challenging scenarios
en
dc.description.provenanceSubmitted by admin ntu (admin@lib.ntu.edu.tw) on 2024-07-23T16:40:39Z
No. of bitstreams: 0
en
dc.description.provenanceMade available in DSpace on 2024-07-23T16:40:39Z (GMT). No. of bitstreams: 0en
dc.description.tableofcontentsAcknowledgement(致謝)- i
Mandarin Abstract(中文摘要)- ii
Abstract - iii
List of Figures - vi
List of Tables - ix
1 Introduxtion - 1
2 Related Work - 5
2.1 Flow-based VFI method - 5
2.1.1 Optical Flow and Warping Methods - 6
2.1.2 Motion Estimation - 10
2.2 Kernel-based VFI method -12
2.3 Phase-based VFI method - 16
2.3.1 Prior Knowledge - 17
2.3.2 Methods - 18
2.4 Loss Functions for VFI - 20
2.4.1 Frame Reconstruction Loss - 21
2.4.2 Optical Flow Supervision loss - 24
2.5 Miscellaneous - 25
2.5.1 Techniques for Enhancing VFI Performance -25
2.5.2 Activation Functions - 26
2.5.3 Dilated Convolution - 29
3 Review of Tranformer - 31
3.1 The Attention Mechanism - 32
3.2 Swin Transformer - 34
3.3 Conditional Positional Encodings - 36
4 Proposed Method: ATM-VFI - 37
4.1 Multi-scale Feature Extraction and Fusion - 38
4.2 ATMFormer - 40
4.3 Joint Enhancement and Up-sampling for Feature and Motion - 44
4.4 Miscellaneous - 44
5 Experiments - 46
5.1 Network Configurations - 46
5.2 Loss Functions - 47
5.3 Datasets - 48
5.4 Training Procedure - 49
6 Experimental Result - 50
6.1 Quantitative Comparison -50
6.2 Qualitative Comparison - 51
6.3 Ablation Study - 53
6.3.1 Effectiveness of ATMFormer - 54
6.3.2 Impact of Training Data - 55
6.3.3 Cross-Scale Feature Fusion - 56
6.3.4 Impact of Feature Enhancement - 56
6.3.5 Influence of Global Motion Estimation and Window Size - 56
6.3.6 Complexity Comparison Between Local and Global Motion Estimation - 57
7 Conclusion - 62
References - 64
-
dc.language.isoen-
dc.subject影片幀插值zh_TW
dc.subject自注意力網路zh_TW
dc.subject光流zh_TW
dc.subject可適性運動估測zh_TW
dc.subject注意力矩陣zh_TW
dc.subject深度學習zh_TW
dc.subjectOptical flowen
dc.subjectVideo frame interpolationen
dc.subjectDeep learningen
dc.subjectAttention matrixen
dc.subjectAdaptive motion estimationen
dc.subjectTransformeren
dc.title基於注意力機制引導的運動感知於通用影片幀插值zh_TW
dc.titleVersatile Video Frame Interpolation via Attention-to-Motion of Transformeren
dc.typeThesis-
dc.date.schoolyear112-2-
dc.description.degree碩士-
dc.contributor.oralexamcommittee孫紹華;簡鳳村;張榮吉zh_TW
dc.contributor.oralexamcommitteeShao-Hua Sun;Feng-Tsun Chien;Rong-Chi Changen
dc.subject.keyword影片幀插值,自注意力網路,光流,可適性運動估測,注意力矩陣,深度學習,zh_TW
dc.subject.keywordVideo frame interpolation,Transformer,Optical flow,Adaptive motion estimation,Attention matrix,Deep learning,en
dc.relation.page70-
dc.identifier.doi10.6342/NTU202401839-
dc.rights.note同意授權(限校園內公開)-
dc.date.accepted2024-07-18-
dc.contributor.author-college電機資訊學院-
dc.contributor.author-dept電信工程學研究所-
顯示於系所單位:電信工程學研究所

文件中的檔案:
檔案 大小格式 
ntu-112-2.pdf
授權僅限NTU校內IP使用(校園外請利用VPN校外連線服務)
23.86 MBAdobe PDF
顯示文件簡單紀錄


系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved