請用此 Handle URI 來引用此文件:
http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/97793完整後設資料紀錄
| DC 欄位 | 值 | 語言 |
|---|---|---|
| dc.contributor.advisor | 廖世偉 | zh_TW |
| dc.contributor.advisor | Shih-Wei Liao | en |
| dc.contributor.author | 蔡博揚 | zh_TW |
| dc.contributor.author | Po-Yang Tsai | en |
| dc.date.accessioned | 2025-07-16T16:16:57Z | - |
| dc.date.available | 2025-07-17 | - |
| dc.date.copyright | 2025-07-16 | - |
| dc.date.issued | 2024 | - |
| dc.date.submitted | 2024-06-21 | - |
| dc.identifier.citation | [1] G. Baechler, S. Sunkara, M. Wang, F. Zubach, H. Mansoor, V. Etter, V. Cărbune, J. Lin, J. Chen, and A. Sharma. Screenai: A vision-language model for ui and info graphics understanding. arXiv preprint arXiv:2402.04615, 2024.
[2] C. Bai, X. Zang, Y. Xu, S. Sunkara, A. Rastogi, J. Chen, and B. A. y Arcas. Uibert: Learning generic multimodal representations for ui understanding. In International Joint Conference on Artificial Intelligence, 2021. [3] D. Chen and W. Dolan. Collecting highly parallel data for paraphrase evaluation. In D. Lin, Y. Matsumoto, and R. Mihalcea, editors, Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 190–200, Portland, Oregon, USA, June 2011. Association for Computational Linguistics. [4] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton. A simple framework for contrastive learning of visual representations. In H. D. III and A. Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 1597–1607. PMLR, 13–18 Jul 2020. [5] B. Deka, Z. Huang, C. Franzen, J. Hibschman, D. Afergan, Y. Li, J. Nichols, and R. Kumar. Rico: A mobile app dataset for building data-driven design applications. In Proceedings of the 30th Annual ACM Symposium on User Interface Software and Technology, UIST ’17, page 845–854, New York, NY, USA, 2017. Association for Computing Machinery. [6] J. Devlin, M.W. Chang, K. Lee, and K. Toutanova. BERT: Pretraining of deep bidirectional transformers for language understanding. In J. Burstein, C. Doran, and T. Solorio, editors, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. [7] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. ICLR, 2021. [8] Y. Goyal, T. Khot, D. SummersStay, D. Batra, and D. Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6325–6334, Los Alamitos, CA, USA, jul 2017. IEEE Computer Society. [9] K. He, X. Chen, S. Xie, Y. Li, P. Doll’ar, and R. B. Girshick. Masked autoencoders are scalable vision learners. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15979–15988, 2021. [10] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. CVPR, 2016. [11] D. Hendrycks, K. Lee, and M. Mazeika. Using pretraining can improve model robustness and uncertainty. In K. Chaudhuri and R. Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 2712–2721. PMLR, 09–15 Jun 2019. [12] K. Lee, M. Joshi, I. Turc, H. Hu, F. Liu, J. Eisenschlos, U. Khandelwal, P. Shaw, M.W. Chang, and K. Toutanova. Pix2struct: Screenshot parsing as pretraining for visual language understanding. arXiv preprint arXiv:2210.03347, 2022. [13] J. Li, D. Li, C. Xiong, and S. C. H. Hoi. Blip: Bootstrapping language-image pretraining for unified vision-language understanding and generation. In International Conference on Machine Learning, 2022. [14] Y. Li, J. He, X. Zhou, Y. Zhang, and J. Baldridge. Mapping natural language in structions to mobile ui action sequences. arXiv preprint arXiv:2005.03776, 2020. [15] Y.P. Y. LikHang Lee and P. Hui. Perceived user reachability in mobile uis using data analytics and machine learning. International Journal of Human–Computer Interaction, 0(0):1–24, 2024. [16] C.Y. Lin. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain, July 2004. Associa tion for Computational Linguistics. [17] T.Y. Lin, M. Maire, S. J. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. In European Conference on Computer Vision, 2014. [18] E. Liner and R. Miikkulainen. Improving neural network learning through dual variable learning rates. In 2021 International Joint Conference on Neural Networks (IJCNN), pages 1–7, 2021. [19] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo. Swin transformer: Hierarchical vision transformer using shifted windows. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 9992–10002, 2021. [20] I. Loshchilov and F. Hutter. Decoupled weight decay regularization. In International Conference on Learning Representations, 2019. [21] G. Oliveira dos Santos, E. L. Colombini, and S. Avila. CIDErR: Robust consensus based image description evaluation. In W. Xu, A. Ritter, T. Baldwin, and A. Rahimi, editors, Proceedings of the Seventh Workshop on Noisy User-generated Text (WNUT 2021), pages 351–360, Online, Nov. 2021. Association for Computational Linguistics. [22] K. Papineni, S. Roukos, T. Ward, and W.J. Zhu. Bleu: a method for automatic evaluation of machine translation. In P. Isabelle, E. Charniak, and D. Lin, editors, Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA, July 2002. Association for Computational Linguistics. [23] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, pages 8024–8035. Curran Associates, Inc., 2019. [24] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever. Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020,2021. [25] A. Radford and K. Narasimhan. Improving language understanding by generative pretraining. In arxiv, 2018. [26] V. Ramanathan, K. D. Tang, G. Mori, and L. FeiFei. Learning temporal embeddings for complex video analysis. 2015 IEEE International Conference on Computer Vision (ICCV), pages 4471–4479, 2015. [27] C. Rawles, A. Li, D. Rodriguez, O. Riva, and T. P. Lillicrap. Androidinthewild: A large scale dataset for android device control. In Thirty Seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023. [28] S. Ren, K. He, R. Girshick, and J. Sun. Faster rcnn: Towards real-time object detection with region proposal networks. In C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc., 2015. [29] K. Simonyan and A. Zisserman:. Very deep convolutional networks for large-scale image recognition. ICLR, 2015. [30] H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, and H. Jegou. Training data-efficient image transformers and distillation through attention. In M. Meila and T. Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 10347–10357. PMLR, 18–24 Jul 2021. [31] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. u. Kaiser, and I. Polosukhin. Attention is all you need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. [32] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: A neural image caption generator. In CVPR, pages 3156–3164. IEEE Computer Society, 2015. [33] B. Wang, G. Li, X. Zhou, Z. Chen, T. Grossman, and Y. Li. Screen2Words: Automatic Mobile UI Summarization With Multimodal Learning. In The 34th Annual ACM Symposium on User Interface Software and Technology. ACM, Oct. 2021. [34] J. Wang, X. Hu, P. Zhang, X. Li, L. Wang, L. Zhang, J. Gao, and Z. Liu. Minivlm: A smaller and faster visionlanguage model. ArXiv, abs/2012.06946, 2020. [35] J. Wang, Z. Yang, X. Hu, L. Li, K. Lin, Z. Gan, Z. Liu, C. Liu, and L. Wang. Git: A generative imagetotext transformer for vision and language. arXiv preprint arXiv:2205.14100, 2022. [36] L. Yuan, D. Chen, Y.L. Chen, N. C. F. Codella, X. Dai, J. Gao, H. Hu, X. Huang, B. Li, C. Li, C. Liu, M. Liu, Z. Liu, Y. Lu, Y. Shi, L. Wang, J. Wang, B. Xiao, Z. Xiao, J. Yang, M. Zeng, L. Zhou, and P. Zhang. Florence: A new foundation model for computer vision. ArXiv, abs/2111.11432, 2021. | - |
| dc.identifier.uri | http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/97793 | - |
| dc.description.abstract | 我們提供一個有效的方法,使用生成式圖片和文字的轉換器模型來為手機影片生成摘要,並訓練在Android in the Wild資料集。目前手機錄影都是由人工檢視做摘要,我們使用機器學習直接將視覺的資訊轉成文字。本論文使用的方法包含資料的前處理及三種微調策略來改善模型,包含雙學習率、增加時間序詞嵌入,以及可變輸入圖片解析度。實驗結果顯示微調方法明顯的提高了生成摘要的準確度,並且凸顯視覺語言模型,在手機應用程式中自動化問題報告過程的潛力,大量的減少人力與時間的同時提供高準確度的摘要。 | zh_TW |
| dc.description.abstract | This paper introduces a novel approach for mobile video captioning using the Generative Image-to-text Transformer model, with the Android in the Wild dataset. The process of summarizing mobile records is traditionally reliant on manual review. We address this challenge by employing machine learning techniques to convert visual information directly into texts. The methodology includes data preprocessing and three fine-tuning strategies, such as dual learning rates, increased temporal embeddings, and variable input image resolutions, to enhance the model's performance. Comprehensive experimentation shows that these fine-tuning techniques significantly improve the accuracy of generated captions. The results highlight the potential of vision-language models to automate the problem-reporting process in mobile applications, significantly reducing time and labor while ensuring high accuracy. | en |
| dc.description.provenance | Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2025-07-16T16:16:57Z No. of bitstreams: 0 | en |
| dc.description.provenance | Made available in DSpace on 2025-07-16T16:16:57Z (GMT). No. of bitstreams: 0 | en |
| dc.description.tableofcontents | Acknowledgements ii
摘要 iii Abstract iv Contents v List of Figures vii List of Tables ix Chapter 1 Introduction 1 Chapter 2 Related Work 3 2.1 Vision Language Model . . . . . . . . . . . . . . . . . . . . . . . . 4 2.2 Mobile Video Captioning . . . . . . . . . . . . . . . . . . . . . . . . 5 Chapter 3 Methodology 7 3.1 Model Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 3.2 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 3.3 Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 3.3.1 Dual Learning Rates . . . . . . . . . . . . . . . . . . . . . . . . . 10 3.3.2 Number of Temporal Embeddings . . . . . . . . . . . . . . . . . . 11 3.3.3 Input Image Ratio . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Chapter 4 Evaluation 14 4.1 Experiment Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 4.2 Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 4.3 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 Chapter 5 Conclusion 19 References 20 Appendix A — Examples 26 | - |
| dc.language.iso | en | - |
| dc.subject | 影片摘要生成 | zh_TW |
| dc.subject | 微調 | zh_TW |
| dc.subject | 機器學習 | zh_TW |
| dc.subject | 視覺語言模型 | zh_TW |
| dc.subject | Android in the Wild | zh_TW |
| dc.subject | Android in the Wild | en |
| dc.subject | Fine-Tuning | en |
| dc.subject | Video Captioning | en |
| dc.subject | Machine Learning | en |
| dc.subject | Vision-Language Model | en |
| dc.title | 優化手機影片摘要生成:運用生成式圖片轉文字模型與AITW資料集 | zh_TW |
| dc.title | Enhancing Mobile Video Captioning: Utilizing Generative Image-to-text Transformers with AITW Dataset | en |
| dc.type | Thesis | - |
| dc.date.schoolyear | 113-2 | - |
| dc.description.degree | 碩士 | - |
| dc.contributor.oralexamcommittee | 盧瑞山;傅楸善 | zh_TW |
| dc.contributor.oralexamcommittee | Ruei-Shan Lu;Chiou-Shann Fuh | en |
| dc.subject.keyword | 影片摘要生成,Android in the Wild,視覺語言模型,機器學習,微調, | zh_TW |
| dc.subject.keyword | Video Captioning,Android in the Wild,Vision-Language Model,Machine Learning,Fine-Tuning, | en |
| dc.relation.page | 33 | - |
| dc.identifier.doi | 10.6342/NTU202401151 | - |
| dc.rights.note | 同意授權(全球公開) | - |
| dc.date.accepted | 2024-06-21 | - |
| dc.contributor.author-college | 電機資訊學院 | - |
| dc.contributor.author-dept | 資訊工程學系 | - |
| dc.date.embargo-lift | 2025-07-17 | - |
| 顯示於系所單位: | 資訊工程學系 | |
文件中的檔案:
| 檔案 | 大小 | 格式 | |
|---|---|---|---|
| ntu-113-2.pdf | 3.22 MB | Adobe PDF | 檢視/開啟 |
系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。
