Skip navigation

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料(如:文字、圖片、PDF)並使其易於取用。

點此認識 DSpace
DSpace logo
English
中文
  • 瀏覽論文
    • 校院系所
    • 出版年
    • 作者
    • 標題
    • 關鍵字
    • 指導教授
  • 搜尋 TDR
  • 授權 Q&A
    • 我的頁面
    • 接受 E-mail 通知
    • 編輯個人資料
  1. NTU Theses and Dissertations Repository
  2. 電機資訊學院
  3. 資訊工程學系
請用此 Handle URI 來引用此文件: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/97793
完整後設資料紀錄
DC 欄位值語言
dc.contributor.advisor廖世偉zh_TW
dc.contributor.advisorShih-Wei Liaoen
dc.contributor.author蔡博揚zh_TW
dc.contributor.authorPo-Yang Tsaien
dc.date.accessioned2025-07-16T16:16:57Z-
dc.date.available2025-07-17-
dc.date.copyright2025-07-16-
dc.date.issued2024-
dc.date.submitted2024-06-21-
dc.identifier.citation[1] G. Baechler, S. Sunkara, M. Wang, F. Zubach, H. Mansoor, V. Etter, V. Cărbune, J. Lin, J. Chen, and A. Sharma. Screenai: A vision-language model for ui and info graphics understanding. arXiv preprint arXiv:2402.04615, 2024.
[2] C. Bai, X. Zang, Y. Xu, S. Sunkara, A. Rastogi, J. Chen, and B. A. y Arcas. Uibert: Learning generic multimodal representations for ui understanding. In International Joint Conference on Artificial Intelligence, 2021.
[3] D. Chen and W. Dolan. Collecting highly parallel data for paraphrase evaluation. In D. Lin, Y. Matsumoto, and R. Mihalcea, editors, Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 190–200, Portland, Oregon, USA, June 2011. Association for Computational Linguistics.
[4] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton. A simple framework for contrastive learning of visual representations. In H. D. III and A. Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 1597–1607. PMLR, 13–18 Jul 2020.
[5] B. Deka, Z. Huang, C. Franzen, J. Hibschman, D. Afergan, Y. Li, J. Nichols, and R. Kumar. Rico: A mobile app dataset for building data-driven design applications. In Proceedings of the 30th Annual ACM Symposium on User Interface Software and Technology, UIST ’17, page 845–854, New York, NY, USA, 2017. Association for Computing Machinery.
[6] J. Devlin, M.W. Chang, K. Lee, and K. Toutanova. BERT: Pretraining of deep bidirectional transformers for language understanding. In J. Burstein, C. Doran, and T. Solorio, editors, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics.
[7] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. ICLR, 2021.
[8] Y. Goyal, T. Khot, D. SummersStay, D. Batra, and D. Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6325–6334, Los Alamitos, CA, USA, jul 2017. IEEE Computer Society.
[9] K. He, X. Chen, S. Xie, Y. Li, P. Doll’ar, and R. B. Girshick. Masked autoencoders are scalable vision learners. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15979–15988, 2021.
[10] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. CVPR, 2016.
[11] D. Hendrycks, K. Lee, and M. Mazeika. Using pretraining can improve model robustness and uncertainty. In K. Chaudhuri and R. Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 2712–2721. PMLR, 09–15 Jun 2019.
[12] K. Lee, M. Joshi, I. Turc, H. Hu, F. Liu, J. Eisenschlos, U. Khandelwal, P. Shaw, M.W. Chang, and K. Toutanova. Pix2struct: Screenshot parsing as pretraining for visual language understanding. arXiv preprint arXiv:2210.03347, 2022.
[13] J. Li, D. Li, C. Xiong, and S. C. H. Hoi. Blip: Bootstrapping language-image pretraining for unified vision-language understanding and generation. In International Conference on Machine Learning, 2022.
[14] Y. Li, J. He, X. Zhou, Y. Zhang, and J. Baldridge. Mapping natural language in structions to mobile ui action sequences. arXiv preprint arXiv:2005.03776, 2020.
[15] Y.P. Y. LikHang Lee and P. Hui. Perceived user reachability in mobile uis using data analytics and machine learning. International Journal of Human–Computer Interaction, 0(0):1–24, 2024.
[16] C.Y. Lin. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain, July 2004. Associa tion for Computational Linguistics.
[17] T.Y. Lin, M. Maire, S. J. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. In European Conference on Computer Vision, 2014.
[18] E. Liner and R. Miikkulainen. Improving neural network learning through dual variable learning rates. In 2021 International Joint Conference on Neural Networks (IJCNN), pages 1–7, 2021.
[19] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo. Swin transformer: Hierarchical vision transformer using shifted windows. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 9992–10002, 2021.
[20] I. Loshchilov and F. Hutter. Decoupled weight decay regularization. In International Conference on Learning Representations, 2019.
[21] G. Oliveira dos Santos, E. L. Colombini, and S. Avila. CIDErR: Robust consensus based image description evaluation. In W. Xu, A. Ritter, T. Baldwin, and A. Rahimi, editors, Proceedings of the Seventh Workshop on Noisy User-generated Text (WNUT 2021), pages 351–360, Online, Nov. 2021. Association for Computational Linguistics.
[22] K. Papineni, S. Roukos, T. Ward, and W.J. Zhu. Bleu: a method for automatic evaluation of machine translation. In P. Isabelle, E. Charniak, and D. Lin, editors, Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA, July 2002. Association for Computational Linguistics.
[23] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, pages 8024–8035. Curran Associates, Inc., 2019.
[24] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever. Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020,2021.
[25] A. Radford and K. Narasimhan. Improving language understanding by generative pretraining. In arxiv, 2018.
[26] V. Ramanathan, K. D. Tang, G. Mori, and L. FeiFei. Learning temporal embeddings for complex video analysis. 2015 IEEE International Conference on Computer Vision (ICCV), pages 4471–4479, 2015.
[27] C. Rawles, A. Li, D. Rodriguez, O. Riva, and T. P. Lillicrap. Androidinthewild: A large scale dataset for android device control. In Thirty Seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023.
[28] S. Ren, K. He, R. Girshick, and J. Sun. Faster rcnn: Towards real-time object detection with region proposal networks. In C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc., 2015.
[29] K. Simonyan and A. Zisserman:. Very deep convolutional networks for large-scale image recognition. ICLR, 2015.
[30] H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, and H. Jegou. Training data-efficient image transformers and distillation through attention. In M. Meila and T. Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 10347–10357. PMLR, 18–24 Jul 2021.
[31] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. u. Kaiser, and I. Polosukhin. Attention is all you need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.
[32] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: A neural image caption generator. In CVPR, pages 3156–3164. IEEE Computer Society, 2015.
[33] B. Wang, G. Li, X. Zhou, Z. Chen, T. Grossman, and Y. Li. Screen2Words: Automatic Mobile UI Summarization With Multimodal Learning. In The 34th Annual ACM Symposium on User Interface Software and Technology. ACM, Oct. 2021.
[34] J. Wang, X. Hu, P. Zhang, X. Li, L. Wang, L. Zhang, J. Gao, and Z. Liu. Minivlm: A smaller and faster visionlanguage model. ArXiv, abs/2012.06946, 2020.
[35] J. Wang, Z. Yang, X. Hu, L. Li, K. Lin, Z. Gan, Z. Liu, C. Liu, and L. Wang. Git: A generative imagetotext transformer for vision and language. arXiv preprint arXiv:2205.14100, 2022.
[36] L. Yuan, D. Chen, Y.L. Chen, N. C. F. Codella, X. Dai, J. Gao, H. Hu, X. Huang, B. Li, C. Li, C. Liu, M. Liu, Z. Liu, Y. Lu, Y. Shi, L. Wang, J. Wang, B. Xiao, Z. Xiao, J. Yang, M. Zeng, L. Zhou, and P. Zhang. Florence: A new foundation model for computer vision. ArXiv, abs/2111.11432, 2021.
-
dc.identifier.urihttp://tdr.lib.ntu.edu.tw/jspui/handle/123456789/97793-
dc.description.abstract我們提供一個有效的方法,使用生成式圖片和文字的轉換器模型來為手機影片生成摘要,並訓練在Android in the Wild資料集。目前手機錄影都是由人工檢視做摘要,我們使用機器學習直接將視覺的資訊轉成文字。本論文使用的方法包含資料的前處理及三種微調策略來改善模型,包含雙學習率、增加時間序詞嵌入,以及可變輸入圖片解析度。實驗結果顯示微調方法明顯的提高了生成摘要的準確度,並且凸顯視覺語言模型,在手機應用程式中自動化問題報告過程的潛力,大量的減少人力與時間的同時提供高準確度的摘要。zh_TW
dc.description.abstractThis paper introduces a novel approach for mobile video captioning using the Generative Image-to-text Transformer model, with the Android in the Wild dataset. The process of summarizing mobile records is traditionally reliant on manual review. We address this challenge by employing machine learning techniques to convert visual information directly into texts. The methodology includes data preprocessing and three fine-tuning strategies, such as dual learning rates, increased temporal embeddings, and variable input image resolutions, to enhance the model's performance. Comprehensive experimentation shows that these fine-tuning techniques significantly improve the accuracy of generated captions. The results highlight the potential of vision-language models to automate the problem-reporting process in mobile applications, significantly reducing time and labor while ensuring high accuracy.en
dc.description.provenanceSubmitted by admin ntu (admin@lib.ntu.edu.tw) on 2025-07-16T16:16:57Z
No. of bitstreams: 0
en
dc.description.provenanceMade available in DSpace on 2025-07-16T16:16:57Z (GMT). No. of bitstreams: 0en
dc.description.tableofcontentsAcknowledgements ii
摘要 iii
Abstract iv
Contents v
List of Figures vii
List of Tables ix
Chapter 1 Introduction 1
Chapter 2 Related Work 3
2.1 Vision Language Model . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Mobile Video Captioning . . . . . . . . . . . . . . . . . . . . . . . . 5
Chapter 3 Methodology 7
3.1 Model Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.2 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.3 Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.3.1 Dual Learning Rates . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.3.2 Number of Temporal Embeddings . . . . . . . . . . . . . . . . . . 11
3.3.3 Input Image Ratio . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Chapter 4 Evaluation 14
4.1 Experiment Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.2 Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.3 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
Chapter 5 Conclusion 19
References 20
Appendix A — Examples 26
-
dc.language.isoen-
dc.subject影片摘要生成zh_TW
dc.subject微調zh_TW
dc.subject機器學習zh_TW
dc.subject視覺語言模型zh_TW
dc.subjectAndroid in the Wildzh_TW
dc.subjectAndroid in the Wilden
dc.subjectFine-Tuningen
dc.subjectVideo Captioningen
dc.subjectMachine Learningen
dc.subjectVision-Language Modelen
dc.title優化手機影片摘要生成:運用生成式圖片轉文字模型與AITW資料集zh_TW
dc.titleEnhancing Mobile Video Captioning: Utilizing Generative Image-to-text Transformers with AITW Dataseten
dc.typeThesis-
dc.date.schoolyear113-2-
dc.description.degree碩士-
dc.contributor.oralexamcommittee盧瑞山;傅楸善zh_TW
dc.contributor.oralexamcommitteeRuei-Shan Lu;Chiou-Shann Fuhen
dc.subject.keyword影片摘要生成,Android in the Wild,視覺語言模型,機器學習,微調,zh_TW
dc.subject.keywordVideo Captioning,Android in the Wild,Vision-Language Model,Machine Learning,Fine-Tuning,en
dc.relation.page33-
dc.identifier.doi10.6342/NTU202401151-
dc.rights.note同意授權(全球公開)-
dc.date.accepted2024-06-21-
dc.contributor.author-college電機資訊學院-
dc.contributor.author-dept資訊工程學系-
dc.date.embargo-lift2025-07-17-
顯示於系所單位:資訊工程學系

文件中的檔案:
檔案 大小格式 
ntu-113-2.pdf3.22 MBAdobe PDF檢視/開啟
顯示文件簡單紀錄


系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved