探究注意力聚焦機制與其電腦視覺應用

張正威; Cheng-Wei Chang

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/91903

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	林軒田	zh_TW
dc.contributor.advisor	Hsuan-Tien Lin	en
dc.contributor.author	張正威	zh_TW
dc.contributor.author	Cheng-Wei Chang	en
dc.date.accessioned	2024-02-26T16:22:34Z	-
dc.date.available	2024-02-27	-
dc.date.copyright	2024-02-26	-
dc.date.issued	2022	-
dc.date.submitted	2002-01-01	-
dc.identifier.citation	[1] L. J. Ba, J. R. Kiros, and G. E. Hinton. Layer normalization. CoRR, 2016. [2] I. Beltagy, M. E. Peters, and A. Cohan. Longformer: The long-document transformer. CoRR, abs/2004.05150, 2020. [3] Z. Cai and N. Vasconcelos. Cascade R-CNN: delving into high quality object detection. In CVPR, pages 6154–6162. Computer Vision Foundation / IEEE Computer Society, 2018. [4] Y. Chen, Z. Zhang, Y. Cao, L. Wang, S. Lin, and H. Hu. Reppoints v2: Verification meets regression for object detection. In NeurIPS, 2020. [5] X. Chu, B. Zhang, Z. Tian, X. Wei, and H. Xia. Do we really need explicit position encodings for vision transformers? CoRR, abs/2102.10882, 2021. [6] E. D. Cubuk, B. Zoph, J. Shlens, and Q. V. Le. Randaugment: Practical automated data augmentation with a reduced search space. In CVPR Workshops, pages 3008–3017. Computer Vision Foundation / IEEE, 2020. [7] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR, pages 248–255. IEEE Computer Society, 2009. [8] J. Devlin, M. Chang, K. Lee, and K. Toutanova. BERT: pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT (1), pages 4171–4186. Association for Computational Linguistics, 2019. [9] X. Dong, J. Bao, D. Chen, W. Zhang, N. Yu, L. Yuan, D. Chen, and B. Guo. Cswin transformer: A general vision transformer backbone with cross-shaped windows. CVPR, 2022. [10] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021. [11] Y. Feng, L. Ma, W. Liu, T. Zhang, and J. Luo. Video re-localization. In ECCV, volume 11218 of Lecture Notes in Computer Science, pages 55–70, 2018. [12] K. He, G. Gkioxari, P. Dollár, and R. B. Girshick. Mask R-CNN. CoRR, abs/ 1703.06870, 2017. [13] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016. [14] F. C. Heilbron, V. Escorcia, B. Ghanem, and J. C. Niebles. Activitynet: A large-scale video benchmark for human activity understanding. In CVPR, pages 961–970, 2015. [15] G. E. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. CoRR, abs/1503.02531, 2015. [16] T. Hu, P. Mettes, J. Huang, and C. Snoek. SILCO: show a few images, localize the common object. In ICCV, pages 5066–5075, 2019. [17] G. Huang, Y. Sun, Z. Liu, D. Sedra, and K. Q. Weinberger. Deep networks with stochastic depth. In ECCV (4), volume 9908 of Lecture Notes in Computer Science, pages 646–661. Springer, 2016. [18] Y. Huang, K. Hsu, S. Jeng, and Y. Lin. Weakly-supervised video re-localization with multiscale attention model. In AAAI, pages 11077–11084, 2020. [19] Z. Jiang, Q. Hou, L. Yuan, D. Zhou, Y. Shi, X. Jin, A. Wang, and J. Feng. All tokens matter: Token labeling for training better vision transformers. In NeurIPS, pages 18590–18602, 2021. [20] N. Kitaev, L. Kaiser, and A. Levskaya. Reformer: The efficient transformer. In ICLR. OpenReview.net, 2020. [21] J. Li, Y. Yan, S. Liao, X. Yang, and L. Shao. Local-to-global self-attention in vision transformers. CoRR, abs/2107.04735, 2021. [22] S. Li, X. Jin, Y. Xuan, X. Zhou, W. Chen, Y. Wang, and X. Yan. Enhancing the locality and breaking the memory bottleneck of transformer on time series forecasting. In NeurIPS, pages 5244–5254, 2019. [23] T. Lin, M. Maire, S. J. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft COCO: common objects in context. In ECCV (5), volume 8693 of Lecture Notes in Computer Science, pages 740–755. Springer, 2014. [24] Z. Liu, H. Hu, Y. Lin, Z. Yao, Z. Xie, Y. Wei, J. Ning, Y. Cao, Z. Zhang, L. Dong, F. Wei, and B. Guo. Swin transformer v2: Scaling up capacity and resolution. In CVPR, 2022. [25] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV, pages 9992–10002. IEEE, 2021. [26] I. Loshchilov and F. Hutter. Fixing weight decay regularization in adam. CoRR, abs/1711.05101, 2017. [27] S. Nag, X. Zhu, and T. Xiang. Few-shot temporal action localization with query adaptive transformer. In BMVC, 2021. [28] X. Pan, Z. Xia, S. Song, L. E. Li, and G. Huang. 3d object detection with pointformer. In CVPR, pages 7463–7472. Computer Vision Foundation / IEEE, 2021. [29] I. Radosavovic, R. P. Kosaraju, R. B. Girshick, K. He, and P. Dollár. Designing network design spaces. In CVPR, pages 10425–10433. Computer Vision Foundation / IEEE, 2020. [30] S. Ren, K. He, R. B. Girshick, and J. Sun. Faster R-CNN: towards real-time object detection with region proposal networks. In NIPS, pages 91–99, 2015. [31] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015. [32] C. Sun, A. Shrivastava, S. Singh, and A. Gupta. Revisiting unreasonable effectiveness of data in deep learning era. In ICCV, pages 843–852. IEEE Computer Society, 2017. [33] P. Sun, R. Zhang, Y. Jiang, T. Kong, C. Xu, W. Zhan, M. Tomizuka, L. Li, Z. Yuan, C. Wang, and P. Luo. Sparse R-CNN: end-to-end object detection with learnable proposals. In CVPR, pages 14454–14463. Computer Vision Foundation / IEEE, 2021. [34] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception architecture for computer vision. In CVPR, pages 2818–2826. IEEE Computer Society, 2016. [35] M. Tan and Q. V. Le. Efficientnet: Rethinking model scaling for convolutional neural networks. In ICML, volume 97 of Proceedings of Machine Learning Research, pages 6105–6114. PMLR, 2019. [36] H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, and H. Jégou. Training data-efficient image transformers & distillation through attention. In M. Meila and T. Zhang, editors, ICML, pages 10347–10357. PMLR, 2021. [37] D. Tran, L. D. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning spatiotemporal features with 3d convolutional networks. In ICCV, pages 4489–4497, 2015. [38] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need. In NIPS, pages 5998–6008, 2017. [39] S. Wang, B. Z. Li, M. Khabsa, H. Fang, and H. Ma. Linformer: Self-attention with linear complexity. CoRR, abs/2006.04768, 2020. [40] W. Wang, E. Xie, X. Li, D. Fan, K. Song, D. Liang, T. Lu, P. Luo, and L. Shao. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In ICCV, pages 548–558, 2021. [41] W. Wang, E. Xie, X. Li, D.-P. Fan, K. Song, D. Liang, T. Lu, P. Luo, and L. Shao. Pvtv2: Improved baselines with pyramid vision transformer. Computational Visual Media, 2022. [42] H. Wu, B. Xiao, N. Codella, M. Liu, X. Dai, L. Yuan, and L. Zhang. Cvt: Introducing convolutions to vision transformers. In ICCV, pages 22–31. IEEE, 2021. [43] T. Xiao, Y. Liu, B. Zhou, Y. Jiang, and J. Sun. Unified perceptual parsing for scene understanding. In ECCV (5), volume 11209 of Lecture Notes in Computer Science, pages 432–448. Springer, 2018. [44] J. Yang, C. Li, P. Zhang, X. Dai, B. Xiao, L. Yuan, and J. Gao. Focal self-attention for local-global interactions in vision transformers. In NeurIps, 2021. [45] P. Yang, V. T. Hu, P. Mettes, and C. G. M. Snoek. Localizing the common action among a few videos. In ECCV, volume 12352, pages 505–521. Springer, 2020. [46] S. Yun, D. Han, S. Chun, S. J. Oh, Y. Yoo, and J. Choe. Cutmix: Regularization strategy to train strong classifiers with localizable features. In ICCV, pages 6022–6031. IEEE, 2019. [47] H. Zhang, M. Cissé, Y. N. Dauphin, and D. Lopez-Paz. mixup: Beyond empirical risk minimization. In ICLR (Poster). OpenReview.net, 2018. [48] P. Zhang, X. Dai, J. Yang, B. Xiao, L. Yuan, L. Zhang, and J. Gao. Multi-scale vision longformer: A new vision transformer for high-resolution image encoding. In ICCV, pages 2978–2988. IEEE, 2021. [49] S. Zhang, C. Chi, Y. Yao, Z. Lei, and S. Z. Li. Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection. In CVPR, pages 9756–9765. Computer Vision Foundation / IEEE, 2020. [50] Z. Zhang, Z. Zhao, Z. Lin, J. Song, and D. Cai. Localizing unseen activities in video via image query. In IJCAI, pages 4390–4396, 2019. [51] H. Zhao, L. Jiang, J. Jia, P. H. S. Torr, and V. Koltun. Point transformer. In ICCV, pages 16239–16248. IEEE, 2021. [52] Z. Zhong, L. Zheng, G. Kang, S. Li, and Y. Yang. Random erasing data augmentation. In AAAI, pages 13001–13008. AAAI Press, 2020. [53] B. Zhou, H. Zhao, X. Puig, T. Xiao, S. Fidler, A. Barriuso, and A. Torralba. Semantic understanding of scenes through the ADE20K dataset. Int. J. Comput. Vis., 127(3):302–321, 2019. [54] H. Zhou, S. Zhang, J. Peng, S. Zhang, J. Li, H. Xiong, and W. Zhang. Informer: Beyond efficient transformer for long sequence time-series forecasting. In AAAI, pages 11106–11115. AAAI Press, 2021. [55] T. Zhou, Z. Ma, Q. Wen, X. Wang, L. Sun, and R. Jin. Fedformer: Frequency enhanced decomposed transformer for long-term series forecasting. In ICML, volume 162 of Proceedings of Machine Learning Research, pages 27268–27286. PMLR, 2022.	-
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/91903	-
dc.description.abstract	有鑑於 Transformer 在自然語言處理 (Natural Language Processing, NLP) 領域中出色的表現，其相關技術已成為電腦視覺 (Computer Vision, CV) 研究的熱門主題。此趨勢促成許多專家學者提出不同的 Vision Transformer 模型，並且在主流電腦視覺應用中取得優異成績。其中自注意力 (self-attention) 機制為 Transformer 的核心技術，在理論概念和實際運算上皆扮演著關鍵角色。為了改善 Transformer 模型的效能，我們提出注意力聚焦 (attention zooming-in, AZ) 來提升自注意力機制，進而建構出一個新的 Transformer-based 視覺骨幹，並將其命名為 AZ Transformer。注意力聚焦機制的主要概念是藉由選擇性地聚焦在目標相關區域，以此探索遠距全域相關性。其設計靈感源自於人類的視覺系統，人類傾向於先觀察全局，才會將注意力聚焦於關鍵的細節。具體而言，我們在自注意力模塊中使用了多個不同粒度的特徵，先將粗粒度的特徵用於捕捉全局，再利用這種粗略建模的遠距關係，決定後續將參與自注意力計算的細粒度特徵。藉由這種從粗略到細緻的漸進策略，AZ Transformer 在 ImageNet-1K 影像分類上取得最先進的 (state-of-the-art, SOTA) 表現，其中 AZ Transformer 的三種變形分別可達到 83.5%、83.7% 和 84.2% 的準確度，值得注意的是其中最精簡的模型，僅需相近甚至更低的運算量下即可大幅超越其他方法。在視訊分析應用方面，我們的實作將注意力聚焦機制用於解決 ActivityNet-1.3 上的單樣本動作識別 (one-shot action detection, OSAD) 問題，並在此議題取得最先進的表現。綜上所述，實驗證明我們提出的注意力聚焦機制在主要電腦視覺應用中改進了Transformer 相關方法的效率，且不會降低其表現。	zh_TW
dc.description.abstract	In light of the excellent performance of Transformer in Natural Language Processing (NLP), an upward trend of research attempts has become widespread for extending such inspiring success to the field of Computer Vision (CV). The effort impressively results in a dizzying number of Vision Transformer models that consistently exhibit promising results in various CV applications. At the core of the Transformer based approach is the inclusion of a global operation called self attention, which plays the vital role of the underlying model in both the conceptual and computational aspects. Aiming to improve the effectiveness of Transformer models, we introduce the attention zooming-in (AZ) mechanism to improve the standard self-attention procedure and present a new Transformer-based vision backbone, coined as the AZ Transformer. In a principled way, the proposed AZ mechanism explores long-range contextual dependency selectively by zooming in only those regions relevant to the queries. The idea is inspired by how the human visual system works: humans tend to capture the global context coarsely and then focus on the relevant detailed regions. More specifically, we generalize the self-attention operation with multi-granular features, by first employing the coarsest one to capture the wide-range attention, and then utilizing this coarsely modeled long-range dependency to adaptively determine the attending fine-grained features. Boosted by the coarse-to-fine progressive strategy, the AZ Transformers are shown to achieve state-of-the-art (SOTA) performances on ImageNet-1K image classification with 224 x 224 training images. The three variants of AZ Transformer, namely, AZ-Tiny, AZ-Small, and AZ-Base, achieve 83.5%, 83.7%, and 84.2% Top-1 accuracy, respectively. Notice that the AZ-Tiny surpasses existing SOTA methods by a large margin under a similar or even lower computational cost. Pertaining to video-based applications, we also utilize our AZ mechanism to tackle the one-shot action detection (OSAD) task on the ActivityNet-1.3 dataset, and again achieve state-of-the-art result. In summary, our empirical results support that the proposed AZ mechanism is advantageous for improving the efficiency of Transformer based techniques on tackling computer vision tasks without compromising their performances.	en
dc.description.provenance	Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2024-02-26T16:22:34Z No. of bitstreams: 0	en
dc.description.provenance	Made available in DSpace on 2024-02-26T16:22:34Z (GMT). No. of bitstreams: 0	en
dc.description.tableofcontents	Acknowledgements iii 摘要 v Abstract vii Contents ix List of Figures xi List of Tables xv Chapter 1 Introduction 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Our Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Chapter 2 Related Work 5 2.1 Attention Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2 Efficient Attention Mechanisms . . . . . . . . . . . . . . . . . . . . 6 2.3 Transformer-based Vision Backbones . . . . . . . . . . . . . . . . . 7 2.3.1 Columnar Vision Transformers . . . . . . . . . . . . . . . . . . . . 8 2.3.2 Pyramid Vision Transformers . . . . . . . . . . . . . . . . . . . . . 10 Chapter 3 Our Method 13 3.1 Model Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.2 Attention Zooming-in Self-attention . . . . . . . . . . . . . . . . . . 15 3.2.1 Window-based Self-attention . . . . . . . . . . . . . . . . . . . . . 15 3.2.2 Attention Zooming-in Mechanism . . . . . . . . . . . . . . . . . . 18 3.2.3 Intra-window Query Grouping . . . . . . . . . . . . . . . . . . . . 21 3.2.4 Combinational Position Encoding . . . . . . . . . . . . . . . . . . 24 Chapter 4 Experimental Results 27 4.1 Image-based Application . . . . . . . . . . . . . . . . . . . . . . . . 27 4.2 Video-based Application . . . . . . . . . . . . . . . . . . . . . . . . 31 4.3 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 Chapter 5 Conclusions and Future Work 37 5.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 References 39	-
dc.language.iso	en	-
dc.subject	深度學習	zh_TW
dc.subject	注意力機制	zh_TW
dc.subject	影像分類	zh_TW
dc.subject	單樣本學習	zh_TW
dc.subject	動作識別	zh_TW
dc.subject	Image Classification	en
dc.subject	Deep Learning	en
dc.subject	Attention Mechanism	en
dc.subject	Action Detection	en
dc.subject	One-shot Learning	en
dc.title	探究注意力聚焦機制與其電腦視覺應用	zh_TW
dc.title	Exploring Attention Zooming-in and Its Applications in Computer Vision	en
dc.type	Thesis	-
dc.date.schoolyear	110-2	-
dc.description.degree	碩士	-
dc.contributor.coadvisor	劉庭祿	zh_TW
dc.contributor.coadvisor	Tyng-Luh Liu	en
dc.contributor.oralexamcommittee	李明穗	zh_TW
dc.contributor.oralexamcommittee	Ming-Sui Lee	en
dc.subject.keyword	深度學習,注意力機制,影像分類,單樣本學習,動作識別,	zh_TW
dc.subject.keyword	Deep Learning,Attention Mechanism,Image Classification,One-shot Learning,Action Detection,	en
dc.relation.page	45	-
dc.identifier.doi	10.6342/NTU202203898	-
dc.rights.note	同意授權(限校園內公開)	-
dc.date.accepted	2022-09-27	-
dc.contributor.author-college	電機資訊學院	-
dc.contributor.author-dept	資訊工程學系	-
顯示於系所單位：	資訊工程學系

文件中的檔案：

檔案	大小	格式
ntu-110-2.pdf 授權僅限NTU校內IP使用（校園外請利用VPN校外連線服務）	10.8 MB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。