BLIP-適配器：手機截圖敘述生成的參數高效遷移學習

蔣謦宇; Ching-Yu Chiang

Please use this identifier to cite or link to this item: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/88712

Full metadata record

???org.dspace.app.webui.jsptag.ItemTag.dcfield???	Value	Language
dc.contributor.advisor	廖世偉	zh_TW
dc.contributor.advisor	Shih-Wei Liao	en
dc.contributor.author	蔣謦宇	zh_TW
dc.contributor.author	Ching-Yu Chiang	en
dc.date.accessioned	2023-08-15T17:28:26Z	-
dc.date.available	2023-11-09	-
dc.date.copyright	2023-08-15	-
dc.date.issued	2023	-
dc.date.submitted	2023-08-07	-
dc.identifier.citation	[1] H. Agrawal, K. Desai, Y. Wang, X. Chen, R. Jain, M. Johnson, D. Batra, D. Parikh, S. Lee, and P. Anderson. Nocaps: Novel object captioning at scale. In Proceedings of the IEEE/CVF international conference on computer vision, pages 8948–8957, 2019. [2] P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang. Bottom-up and top-down attention for image captioning and visual question an- swering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6077–6086, 2018. [3] A. Bapna, N. Arivazhagan, and O. Firat. Simple, scalable adaptation for neural machine translation. arXiv preprint arXiv:1909.08478, 2019. [4] S. Changpinyo, P. Sharma, N. Ding, and R. Soricut. Conceptual 12m: Pushing web- scale image-text pre-training to recognize long-tail visual concepts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3558–3568, 2021. [5] D. Chen and W. Dolan. Collecting highly parallel data for paraphrase evaluation. In Proceedings of the 49th Annual Meeting of the Association for Linguistics: Human Language Technologies, pages 190–200, Portland, Oregon, USA, June 2011. Association for Computational Linguistics. [6] T. Chen, L. Zhu, C. Ding, R. Cao, Y. Wang, Z. Li, L. Sun, P. Mao, and Y. Zang. Sam fails to segment anything? – sam-adapter: Adapting sam in underperformed scenes: Camouflage, shadow, medical image segmentation, and more, 2023. [7] Z.-C. Chen, C.-L. Fu, C.-Y. Liu, S.-W. D. Li, and H.-y. Lee. Exploring efficient- tuning methods in self-supervised speech models. In 2022 IEEE Spoken Language Technology Workshop (SLT), pages 1120–1127. IEEE, 2023. [8] B. Deka, Z. Huang, C. Franzen, J. Hibschman, D. Afergan, Y. Li, J. Nichols, and R. Kumar. Rico: A mobile app dataset for building data-driven design applications. In Proceedings of the 30th annual ACM symposium on user interface software and technology, pages 845–854, 2017. [9] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large- scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009. [10] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota, June 2019. Association for Computa- tional Linguistics. [11] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021. [12] B. Ermis, G. Zappella, M. Wistuba, A. Rawal, and C. Archambeau. Continual learn- ing with transformers for image classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pages 3774–3781, June 2022. [13] D. Guo, A. M. Rush, and Y. Kim. Parameter-efficient transfer learning with diff pruning. arXiv preprint arXiv:2012.07463, 2020. [14] J. He, C. Zhou, X. Ma, T. Berg-Kirkpatrick, and G. Neubig. Towards a unified view of parameter-efficient transfer learning. CoRR, abs/2110.04366, 2021. [15] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997. [16] N. Houlsby, A. Giurgiu, S. Jastrzebski, B. Morrone, Q. De Laroussilhe, A. Ges- mundo, M. Attariyan, and S. Gelly. Parameter-efficient transfer learning for nlp. In International Conference on Machine Learning, pages 2790–2799. PMLR, 2019. [17] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021. [18] R. Karimi Mahabadi, J. Henderson, and S. Ruder. Compacter: Efficient low-rank hypercomplex adapter layers. Advances in Neural Information Processing Systems, 34:1022–1035, 2021. [19] R.Krishna,Y.Zhu,O.Groth,J.Johnson,K.Hata,J.Kravitz,S.Chen,Y.Kalantidis, L.-J. Li, D. A. Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision, 123:32–73, 2017. [20] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25, 2012. [21] D. Li, J. Li, H. Le, G. Wang, S. Savarese, and S. C. H. Hoi. Lavis: A library for language-vision intelligence, 2022. [22] J. Li, D. Li, S. Savarese, and S. Hoi. Blip-2: Bootstrapping language-image pre- training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023. [23] J. Li, D. Li, C. Xiong, and S. Hoi. Blip: Bootstrapping language-image pre- training for unified vision-language understanding and generation. In International Conference on Machine Learning, pages 12888–12900. PMLR, 2022. [24] J.Li,R.Selvaraju,A.Gotmare,S.Joty,C.Xiong,andS.C.H.Hoi.Alignbeforefuse: Vision and language representation learning with momentum distillation. Advances in neural information processing systems, 34:9694–9705, 2021. [25] X. L. Li and P. Liang. Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint arXiv:2101.00190, 2021. [26] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. In D. Fleet, T. Pajdla, B. Schiele, and T. Tuytelaars, editors, Computer Vision – ECCV 2014, pages 740– 755, Cham, 2014. Springer International Publishing. [27] W. Liu, X. Shen, C.-M. Pun, and X. Cun. Explicit visual prompting for low-level structure segmentations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19434–19445, 2023. [28] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo. Swin trans- former: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021. [29] V. Ordonez, G. Kulkarni, and T. Berg. Im2text: Describing images using 1 million captioned photographs. Advances in neural information processing systems, 24, 2011. [30] J. Pan, Z. Lin, X. Zhu, J. Shao, and H. Li. St-adapter: Parameter-efficient image-to-video transfer learning for action recognition. arXiv preprint arXiv:2206.13559, 2022. [31] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002. [32] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019. [33] J. Pfeiffer, A. Rücklé, C. Poth, A. Kamath, I. Vulić, S. Ruder, K. Cho, and I. Gurevych. Adapterhub: A framework for adapting transformers. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 46–54, 2020. [34] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021. [35] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020. [36] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems, 28, 2015. [37] S. Robertson. Understanding inverse document frequency: on theoretical arguments for idf. Journal of documentation, 60(5):503–520, 2004. [38] C. Schuhmann, R. Vencu, R. Beaumont, R. Kaczmarczyk, C. Mullis, A. Katta, T. Coombes, J. Jitsev, and A. Komatsuzaki. Laion-400m: Open dataset of clip- filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021. [39] P. Sharma, N. Ding, S. Goodman, and R. Soricut. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2556–2565, 2018. [40] S.Subramanian,L.L.Wang,S.Mehta,B.Bogin,M.vanZuylen,S.Parasa,S.Singh, M. Gardner, and H. Hajishirzi. Medicat: A dataset of medical images, captions, and textual references. arXiv preprint arXiv:2010.06000, 2020. [41] Y.-L. Sung, J. Cho, and M. Bansal. Vl-adapter: Parameter-efficient transfer learn- ing for vision-and-language tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5227–5237, 2022. [42] A.Vaswani,N.Shazeer,N.Parmar,J.Uszkoreit,L.Jones,A.N.Gomez,L.u.Kaiser, and I. Polosukhin. Attention is all you need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. [43] R. Vedantam, C. Lawrence Zitnick, and D. Parikh. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4566–4575, 2015. [44] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: A neural image caption generator. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3156–3164, 2015. [45] B. Wang, G. Li, X. Zhou, Z. Chen, T. Grossman, and Y. Li. Screen2words: Auto- matic mobile ui summarization with multimodal learning. In The 34th Annual ACM Symposium on User Interface Software and Technology, pages 498–510, 2021. [46] J. Xu, T. Mei, T. Yao, and Y. Rui. Msr-vtt: A large video description dataset for bridging video and language. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5288–5296, 2016. [47] C. Yang, S. Qiao, Q. Yu, X. Yuan, Y. Zhu, A. Yuille, H. Adam, and L.-C. Chen. Moat: Alternating mobile convolution and attention brings strong vision models. arXiv preprint arXiv:2210.01820, 2022. [48] M. B. Yi-Lin Sung, Jaemin Cho. Vl-adapter: Parameter-efficient transfer learning for vision-and-language tasks. In CVPR, 2022. [49] E. B. Zaken, S. Ravfogel, and Y. Goldberg. Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models. arXiv preprint arXiv:2106.10199, 2021. [50] K. Zhou, J. Yang, C. C. Loy, and Z. Liu. Learning to prompt for vision-language models. International Journal of Computer Vision, 130(9):2337–2348, 2022. [51] L. Zhou, H. Palangi, L. Zhang, H. Hu, J. J. Corso, and J. Gao. Unified vision- language pre-training for image captioning and VQA. CoRR, abs/1909.11059, 2019. [52] U. Zia, M. Mohsin Riaz, and A. Ghafoor. Transforming remote sensing images to textual descriptions. International Journal of Applied Earth Observation and Geoinformation, 108:102741, 2022.	-
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/88712	-
dc.description.abstract	我們旨在探索適用於截圖敘述任務的高效調整方法。最近，圖像敘述生成領域取得了顯著的進展，但在智能手機截圖敘述任務方面的研究相對較少。目前關於手機截圖中用戶行為的數據集和應用案例非常有限。因此，我們的目標是對現有模型進行微調，以應用於截圖敘述任務。然而，由於圖像敘述生成模型中的參數數量龐大，對大型預訓練模型進行微調需要耗費大量時間、計算資源和存儲空間。為了應對這一挑戰，我們提出了一種組合適配器方法的方案，僅需調整模型上的附加模塊。這些方法最初是為視覺或語言任務設計的，我們打算將其應用於解決截圖敘述任務中的類似挑戰。通過凍結圖像敘述生成模型的原參數並僅訓練與方法相關的參數權重，我們可以實現與整體模型微調相當的性能，同時顯著減少參數數量。本研究是對組合適配器方法在截圖敘述任務中有效性進行的首次全面調查。通過實驗和分析，我們旨在提供對於在視覺語言模型中應用適配器的見解，並為截圖敘述任務的高效調整技術的發展做出貢獻。	zh_TW
dc.description.abstract	This study aims to explore efficient tuning methods for the screenshot captioning task. Recently, image captioning has seen significant advancements, but research in captioning tasks for mobile screens remains relatively scarce. Current datasets and use cases describing user behaviors within product screenshots are notably limited. Consequently, we sought to fine-tune pre-existing models for the screenshot captioning task. However, fine-tuning large pre-trained models can be resource-intensive, requiring considerable time, computational power, and storage due to the vast number of parameters in image captioning models. To tackle this challenge, this study proposes a combination of adapter methods, which necessitates tuning only the additional modules on the model. These methods are originally designed for vision or language tasks, and our intention is to apply them to address similar challenges in screenshot captioning. By freezing the parameters of the image caption models and training only the weights associated with the methods, performance comparable to fine-tuning the entire model can be achieved, while significantly reducing the number of parameters. This study represents the first comprehensive investigation into the effectiveness of combining adapters within the context of the screenshot captioning task. Through our experiments and analyses, this study aims to provide valuable insights into the application of adapters in vision-language models and contribute to the development of efficient tuning techniques for the screenshot captioning task.	en
dc.description.provenance	Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2023-08-15T17:28:26Z No. of bitstreams: 0	en
dc.description.provenance	Made available in DSpace on 2023-08-15T17:28:26Z (GMT). No. of bitstreams: 0	en
dc.description.tableofcontents	Verification Letter from the Oral Examination Committee i Acknowledgments ii 摘要 iii Abstract iv Contents vi List of Figures viii List of Tables ix Chapter 1 Introduction 1 Chapter 2 Related Work 5 2.1 Vision-Language Models . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2 Adapter Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.3 Mobile Screenshot Captioning . . . . . . . . . . . . . . . . . . . . . 9 Chapter 3 Methodology 10 3.1 Parameter-Efficient Tuning approaches . . . . . . . . . . . . . . . . 10 3.2 Model Architecture Modification . . . . . . . . . . . . . . . . . . . 12 Chapter 4 Evaluation 15 4.1 Experiment Settings . . . . . . . . . . . . . . . . . . . . . . . . . . 16 4.2 Individual Tuning of Visual and Language Components . . . . . . . 16 4.3 Text Decoder and Visual Projection Tuning . . . . . . . . . . . . . . 19 4.4 Entire Model Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . 20 4.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 Chapter 5 Conclusion 26 References 27 Appendix A — Generated Captions 35	-
dc.language.iso	en	-
dc.subject	適配器	zh_TW
dc.subject	Screen2Words	zh_TW
dc.subject	圖像敘述生成	zh_TW
dc.subject	多模型	zh_TW
dc.subject	視覺語言模型	zh_TW
dc.subject	遷移學習	zh_TW
dc.subject	參數效率	zh_TW
dc.subject	機器學習	zh_TW
dc.subject	Transfer Learning	en
dc.subject	Image Captioning	en
dc.subject	Screen2Words	en
dc.subject	Multi-Model	en
dc.subject	Machine Learning	en
dc.subject	Vision-Language Model	en
dc.subject	Adapter	en
dc.subject	Parameter Efficient	en
dc.title	BLIP-適配器：手機截圖敘述生成的參數高效遷移學習	zh_TW
dc.title	BLIP-Adapter: Parameter-Efficient Transfer Learning for Mobile Screenshot Captioning	en
dc.type	Thesis	-
dc.date.schoolyear	111-2	-
dc.description.degree	碩士	-
dc.contributor.oralexamcommittee	傅楸善;盧瑞山;黃維中	zh_TW
dc.contributor.oralexamcommittee	Chiou-Shann Fuh;Ruei-Shan Lu;Wei-Chung Huang	en
dc.subject.keyword	圖像敘述生成,Screen2Words,多模型,機器學習,視覺語言模型,適配器,參數效率,遷移學習,	zh_TW
dc.subject.keyword	Image Captioning,Screen2Words,Multi-Model,Machine Learning,Vision-Language Model,Adapter,Parameter Efficient,Transfer Learning,	en
dc.relation.page	37	-
dc.identifier.doi	10.6342/NTU202301540	-
dc.rights.note	同意授權(限校園內公開)	-
dc.date.accepted	2023-08-08	-
dc.contributor.author-college	電機資訊學院	-
dc.contributor.author-dept	資訊工程學系	-
Appears in Collections:	資訊工程學系

Files in This Item:

File	Size	Format
ntu-111-2.pdf Access limited in NTU ip range	5.76 MB	Adobe PDF

Show simple item record

DSpace JSPUI

DSpace preserves and enables easy and open access to all types of digital content including text, images, moving images, mpegs and data sets