基於注意力機制引導的運動感知於通用影片幀插值

顏子鈞; Gan Chee Kim

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/93284

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	丁建均	zh_TW
dc.contributor.advisor	Jian-Jiun Ding	en
dc.contributor.author	顏子鈞	zh_TW
dc.contributor.author	Gan Chee Kim	en
dc.date.accessioned	2024-07-23T16:40:39Z	-
dc.date.available	2024-07-24	-
dc.date.copyright	2024-07-23	-
dc.date.issued	2024	-
dc.date.submitted	2024-07-17	-
dc.identifier.citation	[1] A. F. Agarap. Deep learning using rectified linear units (relu). arXiv preprint arXiv:1803.08375, 2018. [2] S. Baker, D. Scharstein, J. P. Lewis, S. Roth, M. J. Black, and R. Szeliski. A database and evaluation methodology for optical flow. International journal of computer vision, 92:1–31, 2011. [3] Y. Blau and T. Michaeli. The perception-distortion tradeoff. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6228–6237, 2018. [4] M. Choi, H. Kim, B. Han, N. Xu, and K. M. Lee. Channel attention is all you need for video frame interpolation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 10663–10671, 2020. [5] X. Chu, Z. Tian, B. Zhang, X. Wang, and C. Shen. Conditional positional encodings for vision transformers. arXiv preprint arXiv:2102.10882, 2021. [6] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020. [7] J. Flynn, I. Neulander, J. Philbin, and N. Snavely. Deepstereo: Learning to predict new views from the world’s imagery. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5515–5524, 2016. [8] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision, pages 1026–1034, 2015. [9] D. Hendrycks and K. Gimpel. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415, 2016. [10] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017. [11] P. Hu, S. Niklaus, S. Sclaroff, and K. Saenko. Many-to-many splatting for efficient video frame interpolation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3553–3562, 2022. [12] Z. Huang, T. Zhang, W. Heng, B. Shi, and S. Zhou. Real-time intermediate flow estimation for video frame interpolation. In European Conference on Computer Vision, pages 624–642. Springer, 2022. [13] H. Jiang, D. Sun, V. Jampani, M.-H. Yang, E. Learned-Miller, and J. Kautz. Super slomo: High quality estimation of multiple intermediate frames for video interpolation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 9000–9008, 2018. [14] X. Jin, L. Wu, G. Shen, Y. Chen, J. Chen, J. Koo, and C.-h. Hahm. Enhanced bi-directional motion estimation for video frame interpolation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 5049–5057, 2023. [15] L. Kong, B. Jiang, D. Luo, W. Chu, X. Huang, Y. Tai, C. Wang, and J. Yang. Ifrnet: Intermediate feature refine network for efficient frame interpolation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1969–1978, 2022. [16] H. Lee, T. Kim, T.-y. Chung, D. Pak, Y. Ban, and S. Lee. Adacof: Adaptive collaboration of flows for video frame interpolation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5316–5325, 2020. [17] C. Liu, G. Zhang, R. Zhao, and L. Wang. Sparse global matching for video frame interpolation with large motion. arXiv preprint arXiv:2404.06913, 2024. [18] Y.-L. Liu, Y.-T. Liao, Y.-Y. Lin, and Y.-Y. Chuang. Deep video frame interpolation using cyclic frame generation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 8794–8802, 2019. [19] Z. Liu, R. A. Yeh, X. Tang, Y. Liu, and A. Agarwala. Video frame synthesis using deep voxel flow. In Proceedings of the IEEE international conference on computer vision, pages 4463–4471, 2017. [20] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021. [21] L. Lu, R. Wu, H. Lin, J. Lu, and J. Jia. Video frame interpolation with transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3532–3542, 2022. [22] S. Meister, J. Hur, and S. Roth. Unflow: Unsupervised learning of optical flow with a bidirectional census loss. In Proceedings of the AAAI conference on artificial intelligence, volume 32, 2018. [23] S. Meyer, O. Wang, H. Zimmer, M. Grosse, and A. Sorkine-Hornung. Phasebased frame interpolation for video. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1410–1418, 2015. doi: 10.1109/CVPR.2015.7298747. [24] S. Meyer, A. Djelouah, B. McWilliams, A. Sorkine-Hornung, M. Gross, and C. Schroers. Phasenet for video frame interpolation. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 498–507, 2018. doi:10.1109/CVPR.2018.00059. [25] C. Montgomery. Xiph.org video test media (derf’s collection), the xiph open source community. Online, 1994. URL https://media.xiph.org/video/derf. [26] S. Niklaus and F. Liu. Context-aware synthesis for video frame interpolation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1701–1710, 2018. [27] S. Niklaus and F. Liu. Softmax splatting for video frame interpolation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5437–5446, 2020. [28] S. Niklaus, L. Mai, and F. Liu. Video frame interpolation via adaptive convolution. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 670–679, 2017. [29] S. Niklaus, L. Mai, and F. Liu. Video frame interpolation via adaptive separable convolution. In Proceedings of the IEEE international conference on computer vision, pages 261–270, 2017. [30] J. Park, K. Ko, C. Lee, and C.-S. Kim. Bmbc: Bilateral motion estimation with bilateral cost volume for video interpolation. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIV 16, pages 109–125. Springer, 2020. [31] J. Park, C. Lee, and C.-S. Kim. Asymmetric bilateral motion estimation for video frame interpolation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 14539–14548, 2021. [32] J. Park, J. Kim, and C.-S. Kim. Biformer: Learning bilateral motion estimation via bilateral transformer for 4k video frame interpolation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1568–1577, 2023. [33] F. Reda, J. Kontkanen, E. Tabellion, D. Sun, C. Pantofaru, and B. Curless. Film: Frame interpolation for large motion. In European Conference on Computer Vision, pages 250–266. Springer, 2022. [34] O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical image segmentation. In Medical image computing and computerassisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18, pages 234–241. Springer, 2015. [35] Z. Shi, X. Xu, X. Liu, J. Chen, and M.-H. Yang. Video frame interpolation transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17482–17491, 2022. [36] H. Sim, J. Oh, and M. Kim. Xvfi: extreme video frame interpolation. In Proceedings of the IEEE/CVF international conference on computer vision, pages 14489–14498, 2021. [37] E. Simoncelli and W. Freeman. The steerable pyramid: a flexible architecture for multi scale derivative computation. In Proceedings., International Conference on Image Processing, volume 3, pages 444–447 vol.3, 1995. doi:10.1109/ICIP.1995.537667. [38] K. Simonyan and A. Zisserman. Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556, 2014. [39] K. Soomro, A. R. Zamir, and M. Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012. [40] D. Sun, X. Yang, M.-Y. Liu, and J. Kautz. Pwc-net: Cnns for optical flow using pyramid, warping, and cost volume. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8934–8943, 2018. [41] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017. [42] C.-Y. Wu, N. Singhal, and P. Krahenbuhl. Video compression through image interpolation. In Proceedings of the European conference on computer vision (ECCV), pages 416–431, 2018. [43] H. Wu, X. Zhang, W. Xie, Y. Zhang, and Y. Wang. Boost video frame interpolation via motion adaptation. arXiv preprint arXiv:2306.13933, 2023. [44] X. Xiang, Y. Tian, Y. Zhang, Y. Fu, J. P. Allebach, and C. Xu. Zooming slowmo: Fast and accurate one-stage space-time video super-resolution. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3370–3379, 2020. [45] T. Xiao, M. Singh, E. Mintun, T. Darrell, P. Doll´ar, and R. Girshick. Early convolutions help transformers see better. Advances in neural information processing systems, 34:30392–30400, 2021. [46] T. Xue, B. Chen, J. Wu, D. Wei, and W. T. Freeman. Video enhancement with task-oriented flow. International Journal of Computer Vision, 127:1106–1125, 2019. [47] F. Yu and V. Koltun. Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122, 2015. [48] R. Zabih and J. I. Woodfill. Non-parametric local transforms for computing visual correspondence. In European Conference on Computer Vision, 1994. URL https://api.semanticscholar.org/CorpusID:703552. [49] G. Zhang, Y. Zhu, H. Wang, Y. Chen, G. Wu, and L. Wang. Extracting motion and appearance via inter-frame attention for efficient video frame interpolation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5682–5692, 2023. [50] C. Zhou, J. Liu, J. Tang, and G. Wu. Video frame interpolation with densely queried bilateral correlation. arXiv preprint arXiv:2304.13596, 2023. [51] T. Zhou, S. Tulsiani, W. Sun, J. Malik, and A. A. Efros. View synthesis by appearance flow. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV 14, pages 286–301. Springer, 2016.	-
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/93284	-
dc.description.abstract	影片幀插值（VFI）是一項基於影片前後上下文來生成中間幀的任務，以提升影片的品質和幀數。隨高解析度影片的普及，VFI技術需再深入研究來因應這一趨勢。與此同時。維持VFI在處理低解析度影片的性能也同樣重要，以確保其在廣泛的影片格式中的通用性。雖然近期的VFI研究有不俗的成果，但多數方法都過於優化在某些數據集上。例如，某些方法在低解析度或運動規模較小的數據集（如Vimeo90K，解析度：448 × 256）表現非常出色，但對於高解析度或運動規模較大的數據集（如Xiph，2K/4K解析度及SNU-FILM的困難/極端類別數據集）則表現得差強人意。相反，一些專為4K解析度的影片而設計的插值方法可能在低解析度的情況下缺乏細節。這種權衡存在的原因是神經網絡需要更大的搜索空間來找到兩幀之間較大的運動偏移,而較大的搜索空間也可能導致更高的錯誤率,進而限縮預留給處理小運動量的神經網路參數。為此，我們提出了一種新穎的架構，針對較大的運動偏移自適應地進行全局運動估測,再進行局部運動估測來優化較小的運動細節。受自注意力網路（Transformer）的注意力（Attention）機制啟發,該機制在識別影像補丁（image patch）之間的對應關係非常強大,因此我們的方法在運動估測框架中巧妙地運用注意力矩陣來進一步挖掘出正確的雙向光流（bi-directional optical flow）。實驗結果顯示，我們提出的方法在高解析度與運動規模較大的數據集能達到頂尖的水準，同時在低解析度數據集亦可維持不錯的結果。這證明了我們的方法能夠處理多元解析度的影片,即使在具有挑戰性的情況下也能有效保留細節。	zh_TW
dc.description.abstract	Video Frame Interpolation (VFI) aims to synthesize intermediate frames between consecutive frames, enhancing video quality and frame rate. With the widespread adoption of high-resolution video, it is crucial for VFI technology to undergo further research and development to accommodate this trend. However, maintaining the performance on low-resolution video remains equally important for ensuring its versatility across a wide range of video formats. While recent VFI methods achieve impressive results, many are overly optimized to certain datasets. Specifically, some perform very well on low-resolution or small motion datasets (e.g. Vimeo90K, resolution: 448 × 256), but struggle with highresolution or large motion datasets (e.g. Xiph, 2K/4K resolution and SNU-FILM’s hard/extreme class dataset). Conversely, methods designed for 4K interpolation may lack detail in low-resolution scenarios. This trade-off exists because network requires a larger search window for large motion, leading to higher error rate and leaving less neural capacity for small motion estimation. To address this issue, we propose a novel architecture that adaptively performs global motion estimation for large motions, followed by a local motion estimation to refine smaller and detailed motions. Inspired by the Transformer’s attention mechanism, which excels at identifying correspondence between patches, we remodel the attention matrix to uncover bi-directional optical flow within our motion estimation framework. Experimental results show that our method achieves state-of-the-art performance on high resolution and large motion datasets, while still delivering satisfactory results on low-resolution datasets. This versatility indicates that our method is capable of handling both high-definition content and low-resolution videos, effectively preserving fine details even in challenging scenarios	en
dc.description.provenance	Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2024-07-23T16:40:39Z No. of bitstreams: 0	en
dc.description.provenance	Made available in DSpace on 2024-07-23T16:40:39Z (GMT). No. of bitstreams: 0	en
dc.description.tableofcontents	Acknowledgement（致謝）- i Mandarin Abstract（中文摘要）- ii Abstract - iii List of Figures - vi List of Tables - ix 1 Introduxtion - 1 2 Related Work - 5 2.1 Flow-based VFI method - 5 2.1.1 Optical Flow and Warping Methods - 6 2.1.2 Motion Estimation - 10 2.2 Kernel-based VFI method -12 2.3 Phase-based VFI method - 16 2.3.1 Prior Knowledge - 17 2.3.2 Methods - 18 2.4 Loss Functions for VFI - 20 2.4.1 Frame Reconstruction Loss - 21 2.4.2 Optical Flow Supervision loss - 24 2.5 Miscellaneous - 25 2.5.1 Techniques for Enhancing VFI Performance -25 2.5.2 Activation Functions - 26 2.5.3 Dilated Convolution - 29 3 Review of Tranformer - 31 3.1 The Attention Mechanism - 32 3.2 Swin Transformer - 34 3.3 Conditional Positional Encodings - 36 4 Proposed Method: ATM-VFI - 37 4.1 Multi-scale Feature Extraction and Fusion - 38 4.2 ATMFormer - 40 4.3 Joint Enhancement and Up-sampling for Feature and Motion - 44 4.4 Miscellaneous - 44 5 Experiments - 46 5.1 Network Configurations - 46 5.2 Loss Functions - 47 5.3 Datasets - 48 5.4 Training Procedure - 49 6 Experimental Result - 50 6.1 Quantitative Comparison -50 6.2 Qualitative Comparison - 51 6.3 Ablation Study - 53 6.3.1 Effectiveness of ATMFormer - 54 6.3.2 Impact of Training Data - 55 6.3.3 Cross-Scale Feature Fusion - 56 6.3.4 Impact of Feature Enhancement - 56 6.3.5 Influence of Global Motion Estimation and Window Size - 56 6.3.6 Complexity Comparison Between Local and Global Motion Estimation - 57 7 Conclusion - 62 References - 64	-
dc.language.iso	en	-
dc.subject	影片幀插值	zh_TW
dc.subject	自注意力網路	zh_TW
dc.subject	光流	zh_TW
dc.subject	可適性運動估測	zh_TW
dc.subject	注意力矩陣	zh_TW
dc.subject	深度學習	zh_TW
dc.subject	Optical flow	en
dc.subject	Video frame interpolation	en
dc.subject	Deep learning	en
dc.subject	Attention matrix	en
dc.subject	Adaptive motion estimation	en
dc.subject	Transformer	en
dc.title	基於注意力機制引導的運動感知於通用影片幀插值	zh_TW
dc.title	Versatile Video Frame Interpolation via Attention-to-Motion of Transformer	en
dc.type	Thesis	-
dc.date.schoolyear	112-2	-
dc.description.degree	碩士	-
dc.contributor.oralexamcommittee	孫紹華;簡鳳村;張榮吉	zh_TW
dc.contributor.oralexamcommittee	Shao-Hua Sun;Feng-Tsun Chien;Rong-Chi Chang	en
dc.subject.keyword	影片幀插值,自注意力網路,光流,可適性運動估測,注意力矩陣,深度學習,	zh_TW
dc.subject.keyword	Video frame interpolation,Transformer,Optical flow,Adaptive motion estimation,Attention matrix,Deep learning,	en
dc.relation.page	70	-
dc.identifier.doi	10.6342/NTU202401839	-
dc.rights.note	同意授權(限校園內公開)	-
dc.date.accepted	2024-07-18	-
dc.contributor.author-college	電機資訊學院	-
dc.contributor.author-dept	電信工程學研究所	-
顯示於系所單位：	電信工程學研究所

文件中的檔案：

檔案	大小	格式
ntu-112-2.pdf 授權僅限NTU校內IP使用（校園外請利用VPN校外連線服務）	23.86 MB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。