優化手機影片摘要生成：運用生成式圖片轉文字模型與AITW資料集

蔡博揚; Po-Yang Tsai

Please use this identifier to cite or link to this item: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/97793

Title:	優化手機影片摘要生成：運用生成式圖片轉文字模型與AITW資料集 Enhancing Mobile Video Captioning: Utilizing Generative Image-to-text Transformers with AITW Dataset
Authors:	蔡博揚 Po-Yang Tsai
Advisor:	廖世偉 Shih-Wei Liao
Keyword:	影片摘要生成,Android in the Wild,視覺語言模型,機器學習,微調, Video Captioning,Android in the Wild,Vision-Language Model,Machine Learning,Fine-Tuning,
Publication Year :	2024
Degree:	碩士
Abstract:	我們提供一個有效的方法，使用生成式圖片和文字的轉換器模型來為手機影片生成摘要，並訓練在Android in the Wild資料集。目前手機錄影都是由人工檢視做摘要，我們使用機器學習直接將視覺的資訊轉成文字。本論文使用的方法包含資料的前處理及三種微調策略來改善模型，包含雙學習率、增加時間序詞嵌入，以及可變輸入圖片解析度。實驗結果顯示微調方法明顯的提高了生成摘要的準確度，並且凸顯視覺語言模型，在手機應用程式中自動化問題報告過程的潛力，大量的減少人力與時間的同時提供高準確度的摘要。 This paper introduces a novel approach for mobile video captioning using the Generative Image-to-text Transformer model, with the Android in the Wild dataset. The process of summarizing mobile records is traditionally reliant on manual review. We address this challenge by employing machine learning techniques to convert visual information directly into texts. The methodology includes data preprocessing and three fine-tuning strategies, such as dual learning rates, increased temporal embeddings, and variable input image resolutions, to enhance the model's performance. Comprehensive experimentation shows that these fine-tuning techniques significantly improve the accuracy of generated captions. The results highlight the potential of vision-language models to automate the problem-reporting process in mobile applications, significantly reducing time and labor while ensuring high accuracy.
URI:	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/97793
DOI:	10.6342/NTU202401151
Fulltext Rights:	同意授權(全球公開)
metadata.dc.date.embargo-lift:	2025-07-17
Appears in Collections:	資訊工程學系

Files in This Item:

File	Size	Format
ntu-113-2.pdf	3.22 MB	Adobe PDF	View/Open

Show full item record

DSpace JSPUI

DSpace preserves and enables easy and open access to all types of digital content including text, images, moving images, mpegs and data sets