請用此 Handle URI 來引用此文件:
http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/91903
標題: | 探究注意力聚焦機制與其電腦視覺應用 Exploring Attention Zooming-in and Its Applications in Computer Vision |
作者: | 張正威 Cheng-Wei Chang |
指導教授: | 林軒田 Hsuan-Tien Lin |
共同指導教授: | 劉庭祿 Tyng-Luh Liu |
關鍵字: | 深度學習,注意力機制,影像分類,單樣本學習,動作識別, Deep Learning,Attention Mechanism,Image Classification,One-shot Learning,Action Detection, |
出版年 : | 2022 |
學位: | 碩士 |
摘要: | 有鑑於 Transformer 在自然語言處理 (Natural Language Processing, NLP) 領域中出色的表現,其相關技術已成為電腦視覺 (Computer Vision, CV) 研究的熱門主題。此趨勢促成許多專家學者提出不同的 Vision Transformer 模型,並且在主流電腦視覺應用中取得優異成績。其中自注意力 (self-attention) 機制為 Transformer 的核心技術,在理論概念和實際運算上皆扮演著關鍵角色。為了改善 Transformer 模型的效能,我們提出注意力聚焦 (attention zooming-in, AZ) 來提升自注意力機制,進而建構出一個新的 Transformer-based 視覺骨幹 ,並將其命名為 AZ Transformer。注意力聚焦機制的主要概念是藉由選擇性地聚焦在目標相關區域,以此探索遠距全域相關性。其設計靈感源自於人類的視覺系統,人類傾向於先觀察全局,才會將注意力聚焦於關鍵的細節。具體而言,我們在自注意力模塊中使用了多個不同粒度的特徵,先將粗粒度的特徵用於捕捉全局,再利用這種粗略建模的遠距關係,決定後續將參與自注意力計算的細粒度特徵。藉由這種從粗略到細緻的漸進策略,AZ Transformer 在 ImageNet-1K 影像分類上取得最先進的 (state-of-the-art, SOTA) 表現,其中 AZ Transformer 的三種變形分別可達到 83.5%、83.7% 和 84.2% 的準確度,值得注意的是其中最精簡的模型,僅需相近甚至更低的運算量下即可大幅超越其他方法。在視訊分析應用方面,我們的實作將注意力聚焦機制用於解決 ActivityNet-1.3 上的單樣本動作識別 (one-shot action detection, OSAD) 問題,並在此議題取得最先進的表現。綜上所述,實驗證明我們提出的注意力聚焦機制在主要電腦視覺應用中改進了Transformer 相關方法的效率,且不會降低其表現。 In light of the excellent performance of Transformer in Natural Language Processing (NLP), an upward trend of research attempts has become widespread for extending such inspiring success to the field of Computer Vision (CV). The effort impressively results in a dizzying number of Vision Transformer models that consistently exhibit promising results in various CV applications. At the core of the Transformer based approach is the inclusion of a global operation called self attention, which plays the vital role of the underlying model in both the conceptual and computational aspects. Aiming to improve the effectiveness of Transformer models, we introduce the attention zooming-in (AZ) mechanism to improve the standard self-attention procedure and present a new Transformer-based vision backbone, coined as the AZ Transformer. In a principled way, the proposed AZ mechanism explores long-range contextual dependency selectively by zooming in only those regions relevant to the queries. The idea is inspired by how the human visual system works: humans tend to capture the global context coarsely and then focus on the relevant detailed regions. More specifically, we generalize the self-attention operation with multi-granular features, by first employing the coarsest one to capture the wide-range attention, and then utilizing this coarsely modeled long-range dependency to adaptively determine the attending fine-grained features. Boosted by the coarse-to-fine progressive strategy, the AZ Transformers are shown to achieve state-of-the-art (SOTA) performances on ImageNet-1K image classification with 224 x 224 training images. The three variants of AZ Transformer, namely, AZ-Tiny, AZ-Small, and AZ-Base, achieve 83.5%, 83.7%, and 84.2% Top-1 accuracy, respectively. Notice that the AZ-Tiny surpasses existing SOTA methods by a large margin under a similar or even lower computational cost. Pertaining to video-based applications, we also utilize our AZ mechanism to tackle the one-shot action detection (OSAD) task on the ActivityNet-1.3 dataset, and again achieve state-of-the-art result. In summary, our empirical results support that the proposed AZ mechanism is advantageous for improving the efficiency of Transformer based techniques on tackling computer vision tasks without compromising their performances. |
URI: | http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/91903 |
DOI: | 10.6342/NTU202203898 |
全文授權: | 同意授權(限校園內公開) |
顯示於系所單位: | 資訊工程學系 |
文件中的檔案:
檔案 | 大小 | 格式 | |
---|---|---|---|
ntu-110-2.pdf 授權僅限NTU校內IP使用(校園外請利用VPN校外連線服務) | 10.8 MB | Adobe PDF | 檢視/開啟 |
系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。