優化手機影片摘要生成：運用生成式圖片轉文字模型與AITW資料集

蔡博揚; Po-Yang Tsai

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/97793

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	廖世偉	zh_TW
dc.contributor.advisor	Shih-Wei Liao	en
dc.contributor.author	蔡博揚	zh_TW
dc.contributor.author	Po-Yang Tsai	en
dc.date.accessioned	2025-07-16T16:16:57Z	-
dc.date.available	2025-07-17	-
dc.date.copyright	2025-07-16	-
dc.date.issued	2024	-
dc.date.submitted	2024-06-21	-
dc.identifier.citation	[1] G. Baechler, S. Sunkara, M. Wang, F. Zubach, H. Mansoor, V. Etter, V. Cărbune, J. Lin, J. Chen, and A. Sharma. Screenai: A vision-language model for ui and info graphics understanding. arXiv preprint arXiv:2402.04615, 2024. [2] C. Bai, X. Zang, Y. Xu, S. Sunkara, A. Rastogi, J. Chen, and B. A. y Arcas. Uibert: Learning generic multimodal representations for ui understanding. In International Joint Conference on Artificial Intelligence, 2021. [3] D. Chen and W. Dolan. Collecting highly parallel data for paraphrase evaluation. In D. Lin, Y. Matsumoto, and R. Mihalcea, editors, Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 190–200, Portland, Oregon, USA, June 2011. Association for Computational Linguistics. [4] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton. A simple framework for contrastive learning of visual representations. In H. D. III and A. Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 1597–1607. PMLR, 13–18 Jul 2020. [5] B. Deka, Z. Huang, C. Franzen, J. Hibschman, D. Afergan, Y. Li, J. Nichols, and R. Kumar. Rico: A mobile app dataset for building data-driven design applications. In Proceedings of the 30th Annual ACM Symposium on User Interface Software and Technology, UIST ’17, page 845–854, New York, NY, USA, 2017. Association for Computing Machinery. [6] J. Devlin, M.W. Chang, K. Lee, and K. Toutanova. BERT: Pretraining of deep bidirectional transformers for language understanding. In J. Burstein, C. Doran, and T. Solorio, editors, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. [7] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. ICLR, 2021. [8] Y. Goyal, T. Khot, D. SummersStay, D. Batra, and D. Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6325–6334, Los Alamitos, CA, USA, jul 2017. IEEE Computer Society. [9] K. He, X. Chen, S. Xie, Y. Li, P. Doll’ar, and R. B. Girshick. Masked autoencoders are scalable vision learners. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15979–15988, 2021. [10] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. CVPR, 2016. [11] D. Hendrycks, K. Lee, and M. Mazeika. Using pretraining can improve model robustness and uncertainty. In K. Chaudhuri and R. Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 2712–2721. PMLR, 09–15 Jun 2019. [12] K. Lee, M. Joshi, I. Turc, H. Hu, F. Liu, J. Eisenschlos, U. Khandelwal, P. Shaw, M.W. Chang, and K. Toutanova. Pix2struct: Screenshot parsing as pretraining for visual language understanding. arXiv preprint arXiv:2210.03347, 2022. [13] J. Li, D. Li, C. Xiong, and S. C. H. Hoi. Blip: Bootstrapping language-image pretraining for unified vision-language understanding and generation. In International Conference on Machine Learning, 2022. [14] Y. Li, J. He, X. Zhou, Y. Zhang, and J. Baldridge. Mapping natural language in structions to mobile ui action sequences. arXiv preprint arXiv:2005.03776, 2020. [15] Y.P. Y. LikHang Lee and P. Hui. Perceived user reachability in mobile uis using data analytics and machine learning. International Journal of Human–Computer Interaction, 0(0):1–24, 2024. [16] C.Y. Lin. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain, July 2004. Associa tion for Computational Linguistics. [17] T.Y. Lin, M. Maire, S. J. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. In European Conference on Computer Vision, 2014. [18] E. Liner and R. Miikkulainen. Improving neural network learning through dual variable learning rates. In 2021 International Joint Conference on Neural Networks (IJCNN), pages 1–7, 2021. [19] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo. Swin transformer: Hierarchical vision transformer using shifted windows. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 9992–10002, 2021. [20] I. Loshchilov and F. Hutter. Decoupled weight decay regularization. In International Conference on Learning Representations, 2019. [21] G. Oliveira dos Santos, E. L. Colombini, and S. Avila. CIDErR: Robust consensus based image description evaluation. In W. Xu, A. Ritter, T. Baldwin, and A. Rahimi, editors, Proceedings of the Seventh Workshop on Noisy User-generated Text (WNUT 2021), pages 351–360, Online, Nov. 2021. Association for Computational Linguistics. [22] K. Papineni, S. Roukos, T. Ward, and W.J. Zhu. Bleu: a method for automatic evaluation of machine translation. In P. Isabelle, E. Charniak, and D. Lin, editors, Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA, July 2002. Association for Computational Linguistics. [23] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, pages 8024–8035. Curran Associates, Inc., 2019. [24] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever. Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020,2021. [25] A. Radford and K. Narasimhan. Improving language understanding by generative pretraining. In arxiv, 2018. [26] V. Ramanathan, K. D. Tang, G. Mori, and L. FeiFei. Learning temporal embeddings for complex video analysis. 2015 IEEE International Conference on Computer Vision (ICCV), pages 4471–4479, 2015. [27] C. Rawles, A. Li, D. Rodriguez, O. Riva, and T. P. Lillicrap. Androidinthewild: A large scale dataset for android device control. In Thirty Seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023. [28] S. Ren, K. He, R. Girshick, and J. Sun. Faster rcnn: Towards real-time object detection with region proposal networks. In C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc., 2015. [29] K. Simonyan and A. Zisserman:. Very deep convolutional networks for large-scale image recognition. ICLR, 2015. [30] H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, and H. Jegou. Training data-efficient image transformers and distillation through attention. In M. Meila and T. Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 10347–10357. PMLR, 18–24 Jul 2021. [31] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. u. Kaiser, and I. Polosukhin. Attention is all you need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. [32] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: A neural image caption generator. In CVPR, pages 3156–3164. IEEE Computer Society, 2015. [33] B. Wang, G. Li, X. Zhou, Z. Chen, T. Grossman, and Y. Li. Screen2Words: Automatic Mobile UI Summarization With Multimodal Learning. In The 34th Annual ACM Symposium on User Interface Software and Technology. ACM, Oct. 2021. [34] J. Wang, X. Hu, P. Zhang, X. Li, L. Wang, L. Zhang, J. Gao, and Z. Liu. Minivlm: A smaller and faster visionlanguage model. ArXiv, abs/2012.06946, 2020. [35] J. Wang, Z. Yang, X. Hu, L. Li, K. Lin, Z. Gan, Z. Liu, C. Liu, and L. Wang. Git: A generative imagetotext transformer for vision and language. arXiv preprint arXiv:2205.14100, 2022. [36] L. Yuan, D. Chen, Y.L. Chen, N. C. F. Codella, X. Dai, J. Gao, H. Hu, X. Huang, B. Li, C. Li, C. Liu, M. Liu, Z. Liu, Y. Lu, Y. Shi, L. Wang, J. Wang, B. Xiao, Z. Xiao, J. Yang, M. Zeng, L. Zhou, and P. Zhang. Florence: A new foundation model for computer vision. ArXiv, abs/2111.11432, 2021.	-
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/97793	-
dc.description.abstract	我們提供一個有效的方法，使用生成式圖片和文字的轉換器模型來為手機影片生成摘要，並訓練在Android in the Wild資料集。目前手機錄影都是由人工檢視做摘要，我們使用機器學習直接將視覺的資訊轉成文字。本論文使用的方法包含資料的前處理及三種微調策略來改善模型，包含雙學習率、增加時間序詞嵌入，以及可變輸入圖片解析度。實驗結果顯示微調方法明顯的提高了生成摘要的準確度，並且凸顯視覺語言模型，在手機應用程式中自動化問題報告過程的潛力，大量的減少人力與時間的同時提供高準確度的摘要。	zh_TW
dc.description.abstract	This paper introduces a novel approach for mobile video captioning using the Generative Image-to-text Transformer model, with the Android in the Wild dataset. The process of summarizing mobile records is traditionally reliant on manual review. We address this challenge by employing machine learning techniques to convert visual information directly into texts. The methodology includes data preprocessing and three fine-tuning strategies, such as dual learning rates, increased temporal embeddings, and variable input image resolutions, to enhance the model's performance. Comprehensive experimentation shows that these fine-tuning techniques significantly improve the accuracy of generated captions. The results highlight the potential of vision-language models to automate the problem-reporting process in mobile applications, significantly reducing time and labor while ensuring high accuracy.	en
dc.description.provenance	Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2025-07-16T16:16:57Z No. of bitstreams: 0	en
dc.description.provenance	Made available in DSpace on 2025-07-16T16:16:57Z (GMT). No. of bitstreams: 0	en
dc.description.tableofcontents	Acknowledgements ii 摘要 iii Abstract iv Contents v List of Figures vii List of Tables ix Chapter 1 Introduction 1 Chapter 2 Related Work 3 2.1 Vision Language Model . . . . . . . . . . . . . . . . . . . . . . . . 4 2.2 Mobile Video Captioning . . . . . . . . . . . . . . . . . . . . . . . . 5 Chapter 3 Methodology 7 3.1 Model Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 3.2 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 3.3 Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 3.3.1 Dual Learning Rates . . . . . . . . . . . . . . . . . . . . . . . . . 10 3.3.2 Number of Temporal Embeddings . . . . . . . . . . . . . . . . . . 11 3.3.3 Input Image Ratio . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Chapter 4 Evaluation 14 4.1 Experiment Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 4.2 Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 4.3 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 Chapter 5 Conclusion 19 References 20 Appendix A — Examples 26	-
dc.language.iso	en	-
dc.subject	影片摘要生成	zh_TW
dc.subject	微調	zh_TW
dc.subject	機器學習	zh_TW
dc.subject	視覺語言模型	zh_TW
dc.subject	Android in the Wild	zh_TW
dc.subject	Android in the Wild	en
dc.subject	Fine-Tuning	en
dc.subject	Video Captioning	en
dc.subject	Machine Learning	en
dc.subject	Vision-Language Model	en
dc.title	優化手機影片摘要生成：運用生成式圖片轉文字模型與AITW資料集	zh_TW
dc.title	Enhancing Mobile Video Captioning: Utilizing Generative Image-to-text Transformers with AITW Dataset	en
dc.type	Thesis	-
dc.date.schoolyear	113-2	-
dc.description.degree	碩士	-
dc.contributor.oralexamcommittee	盧瑞山;傅楸善	zh_TW
dc.contributor.oralexamcommittee	Ruei-Shan Lu;Chiou-Shann Fuh	en
dc.subject.keyword	影片摘要生成,Android in the Wild,視覺語言模型,機器學習,微調,	zh_TW
dc.subject.keyword	Video Captioning,Android in the Wild,Vision-Language Model,Machine Learning,Fine-Tuning,	en
dc.relation.page	33	-
dc.identifier.doi	10.6342/NTU202401151	-
dc.rights.note	同意授權(全球公開)	-
dc.date.accepted	2024-06-21	-
dc.contributor.author-college	電機資訊學院	-
dc.contributor.author-dept	資訊工程學系	-
dc.date.embargo-lift	2025-07-17	-
顯示於系所單位：	資訊工程學系

文件中的檔案：

檔案	大小	格式
ntu-113-2.pdf	3.22 MB	Adobe PDF	檢視/開啟

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。