BLIP-適配器：手機截圖敘述生成的參數高效遷移學習

蔣謦宇; Ching-Yu Chiang

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/88712

標題:	BLIP-適配器：手機截圖敘述生成的參數高效遷移學習 BLIP-Adapter: Parameter-Efficient Transfer Learning for Mobile Screenshot Captioning
作者:	蔣謦宇 Ching-Yu Chiang
指導教授:	廖世偉 Shih-Wei Liao
關鍵字:	圖像敘述生成,Screen2Words,多模型,機器學習,視覺語言模型,適配器,參數效率,遷移學習, Image Captioning,Screen2Words,Multi-Model,Machine Learning,Vision-Language Model,Adapter,Parameter Efficient,Transfer Learning,
出版年 :	2023
學位:	碩士
摘要:	我們旨在探索適用於截圖敘述任務的高效調整方法。最近，圖像敘述生成領域取得了顯著的進展，但在智能手機截圖敘述任務方面的研究相對較少。目前關於手機截圖中用戶行為的數據集和應用案例非常有限。因此，我們的目標是對現有模型進行微調，以應用於截圖敘述任務。然而，由於圖像敘述生成模型中的參數數量龐大，對大型預訓練模型進行微調需要耗費大量時間、計算資源和存儲空間。為了應對這一挑戰，我們提出了一種組合適配器方法的方案，僅需調整模型上的附加模塊。這些方法最初是為視覺或語言任務設計的，我們打算將其應用於解決截圖敘述任務中的類似挑戰。通過凍結圖像敘述生成模型的原參數並僅訓練與方法相關的參數權重，我們可以實現與整體模型微調相當的性能，同時顯著減少參數數量。本研究是對組合適配器方法在截圖敘述任務中有效性進行的首次全面調查。通過實驗和分析，我們旨在提供對於在視覺語言模型中應用適配器的見解，並為截圖敘述任務的高效調整技術的發展做出貢獻。 This study aims to explore efficient tuning methods for the screenshot captioning task. Recently, image captioning has seen significant advancements, but research in captioning tasks for mobile screens remains relatively scarce. Current datasets and use cases describing user behaviors within product screenshots are notably limited. Consequently, we sought to fine-tune pre-existing models for the screenshot captioning task. However, fine-tuning large pre-trained models can be resource-intensive, requiring considerable time, computational power, and storage due to the vast number of parameters in image captioning models. To tackle this challenge, this study proposes a combination of adapter methods, which necessitates tuning only the additional modules on the model. These methods are originally designed for vision or language tasks, and our intention is to apply them to address similar challenges in screenshot captioning. By freezing the parameters of the image caption models and training only the weights associated with the methods, performance comparable to fine-tuning the entire model can be achieved, while significantly reducing the number of parameters. This study represents the first comprehensive investigation into the effectiveness of combining adapters within the context of the screenshot captioning task. Through our experiments and analyses, this study aims to provide valuable insights into the application of adapters in vision-language models and contribute to the development of efficient tuning techniques for the screenshot captioning task.
URI:	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/88712
DOI:	10.6342/NTU202301540
全文授權:	同意授權(限校園內公開)
顯示於系所單位：	資訊工程學系

文件中的檔案：

檔案	大小	格式
ntu-111-2.pdf 授權僅限NTU校內IP使用（校園外請利用VPN校外連線服務）	5.76 MB	Adobe PDF

顯示文件完整紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料（如：文字、圖片、PDF）並使其易於取用。