基於注意力機制引導的運動感知於通用影片幀插值

顏子鈞; Gan Chee Kim

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/93284

標題:	基於注意力機制引導的運動感知於通用影片幀插值 Versatile Video Frame Interpolation via Attention-to-Motion of Transformer
作者:	顏子鈞 Gan Chee Kim
指導教授:	丁建均 Jian-Jiun Ding
關鍵字:	影片幀插值,自注意力網路,光流,可適性運動估測,注意力矩陣,深度學習, Video frame interpolation,Transformer,Optical flow,Adaptive motion estimation,Attention matrix,Deep learning,
出版年 :	2024
學位:	碩士
摘要:	影片幀插值（VFI）是一項基於影片前後上下文來生成中間幀的任務，以提升影片的品質和幀數。隨高解析度影片的普及，VFI技術需再深入研究來因應這一趨勢。與此同時。維持VFI在處理低解析度影片的性能也同樣重要，以確保其在廣泛的影片格式中的通用性。雖然近期的VFI研究有不俗的成果，但多數方法都過於優化在某些數據集上。例如，某些方法在低解析度或運動規模較小的數據集（如Vimeo90K，解析度：448 × 256）表現非常出色，但對於高解析度或運動規模較大的數據集（如Xiph，2K/4K解析度及SNU-FILM的困難/極端類別數據集）則表現得差強人意。相反，一些專為4K解析度的影片而設計的插值方法可能在低解析度的情況下缺乏細節。這種權衡存在的原因是神經網絡需要更大的搜索空間來找到兩幀之間較大的運動偏移,而較大的搜索空間也可能導致更高的錯誤率,進而限縮預留給處理小運動量的神經網路參數。為此，我們提出了一種新穎的架構，針對較大的運動偏移自適應地進行全局運動估測,再進行局部運動估測來優化較小的運動細節。受自注意力網路（Transformer）的注意力（Attention）機制啟發,該機制在識別影像補丁（image patch）之間的對應關係非常強大,因此我們的方法在運動估測框架中巧妙地運用注意力矩陣來進一步挖掘出正確的雙向光流（bi-directional optical flow）。實驗結果顯示，我們提出的方法在高解析度與運動規模較大的數據集能達到頂尖的水準，同時在低解析度數據集亦可維持不錯的結果。這證明了我們的方法能夠處理多元解析度的影片,即使在具有挑戰性的情況下也能有效保留細節。 Video Frame Interpolation (VFI) aims to synthesize intermediate frames between consecutive frames, enhancing video quality and frame rate. With the widespread adoption of high-resolution video, it is crucial for VFI technology to undergo further research and development to accommodate this trend. However, maintaining the performance on low-resolution video remains equally important for ensuring its versatility across a wide range of video formats. While recent VFI methods achieve impressive results, many are overly optimized to certain datasets. Specifically, some perform very well on low-resolution or small motion datasets (e.g. Vimeo90K, resolution: 448 × 256), but struggle with highresolution or large motion datasets (e.g. Xiph, 2K/4K resolution and SNU-FILM’s hard/extreme class dataset). Conversely, methods designed for 4K interpolation may lack detail in low-resolution scenarios. This trade-off exists because network requires a larger search window for large motion, leading to higher error rate and leaving less neural capacity for small motion estimation. To address this issue, we propose a novel architecture that adaptively performs global motion estimation for large motions, followed by a local motion estimation to refine smaller and detailed motions. Inspired by the Transformer’s attention mechanism, which excels at identifying correspondence between patches, we remodel the attention matrix to uncover bi-directional optical flow within our motion estimation framework. Experimental results show that our method achieves state-of-the-art performance on high resolution and large motion datasets, while still delivering satisfactory results on low-resolution datasets. This versatility indicates that our method is capable of handling both high-definition content and low-resolution videos, effectively preserving fine details even in challenging scenarios
URI:	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/93284
DOI:	10.6342/NTU202401839
全文授權:	同意授權(限校園內公開)
顯示於系所單位：	電信工程學研究所

文件中的檔案：

檔案	大小	格式
ntu-112-2.pdf 授權僅限NTU校內IP使用（校園外請利用VPN校外連線服務）	23.86 MB	Adobe PDF	檢視/開啟

顯示文件完整紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料（如：文字、圖片、PDF）並使其易於取用。