整合光學字元辨識與集群分析之動態時序嵌入於行動裝置螢幕錄影影片字幕生成之研究

蔡佳靜; Chia-Ching Tsai

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/98151

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	廖世偉	zh_TW
dc.contributor.advisor	Shih-Wei Liao	en
dc.contributor.author	蔡佳靜	zh_TW
dc.contributor.author	Chia-Ching Tsai	en
dc.date.accessioned	2025-07-30T16:07:32Z	-
dc.date.available	2025-07-31	-
dc.date.copyright	2025-07-30	-
dc.date.issued	2025	-
dc.date.submitted	2025-07-03	-
dc.identifier.citation	[1] M. Ahn and et al. Android in the wild: A large-scale dataset for device control. In Conference on Robot Learning, 2023. [2] P. Anderson, X. He, C. Buehler, D. Teney, J. Davis, and S. Lazebnik. Bottom-up and top-down attention for image captioning and visual question answering. pages 6077–6086, 2018. [3] S. Appalaraju, B. Jasani, B. U. Kota, Y. Xie, and R. Manmatha. Docformer: End- to-end transformer for document understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 993–1003, 2021. [4] A. Arnab, M. Dehghani, G. Heigold, C. Sun, M. Lučić, and C. Schmid. Vivit: A video vision transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 6836–6846, 2021. [5] C. Bai, X. Zang, Y. Xu, S. Sunkara, A. Rastogi, J. Chen, and B. Aguera y Arcas. Uibert: Learning generic multimodal representations for ui understanding, 2021. arXiv preprint arXiv:2107.13731. [6] A. Bernardino, J. P. Gonçalves, and L. Bernardino. See, hear and read: Deep aligned representations for multimodal integration. Pattern Recognition, 96:106967, 2019. 46 doi:10.6342/NTU202501425 [7] G. Bertasius, H. Wang, and L. Torresani. Is space-time attention all you need for video understanding? In ICML, 2021. [8] N. Carion and et al. End-to-end object detection with transformers. In European conference on computer vision, pages 321–337, 2020. [9] D. L. Chen and W. B. Dolan. Collecting highly parallel data for paraphrase evaluation. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 190–200, Port-land, Oregon, USA, June 2011. Association for Computational Linguistics. [10] T. Chen, S. Kornblith, M. Noroozi, and G. Hinton. A simple framework for contrastive learning of visual representations. pages 1597–1607, 2020. [11] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M.Dehghani, M. Minderer, G. Andreev, A. Steiner, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020. [12] B. Du, Q. Wang, Y. Yu, et al. Paddleocr: Multi-language ocr system, based on paddlepaddle. https://github.com/PaddlePaddle/PaddleOCR, 2020. Accessed:2024-06-01. [13] H. Fan, B. Xie, Z. Wang, R. Girshick, J. Malik, and K. He. Multiscale vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 6824–6835, 2021. [14] Fullview. CX Stats In 2025: Average Support Tickets Per Day, Per User And Other Important Benchmarks. https://www.fullview.io/blog/support-stats, 2025. Accessed: [Insert Date You Accessed It, e.g., May 10, 2024]. 47 doi:10.6342/NTU202501425 [15] D. Gurari, A. Stent, M. Turunen, A. Oulasvirta, B. Mott, J. Kiesel, K. Uhler, I.-T. Yang, et al. Uiscreen: A dataset for smartphone ui understanding. In Proceedings of the 17th international ACM SIGACCESS conference on computers & accessibility, pages 33–42, 2015. [16] R. Hadsell, S. Chopra, and Y. LeCun. Dimensionality reduction by learning an invariant apping. pages 1735–1780, 2006. [17] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. pages 770–778, 2016. [18] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997. [19] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger. Densely connected convolutional networks. pages 4700–4708, 2017. [20] V. Iashin and E. Rahtu. Multi-modal dense video captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 3599–3601. IEEE, 2020. [21] S. Lee and Others. ScreenAI: A Vision-Language Model for UI and Infographics Understanding. arXiv preprint arXiv:2402.04615, 2024. [22] J. Li, D. Li, S. Savarese, and S. Hoi. Blip-2: Bootstrapping language-image pretraining with frozen image encoders and large language models. arXiv:2301.12597, 2023. [23] J. Li, D. Li, S. Savarese, and S. C. H. Hoi. VideoBLIP: Extending BLIP-2 for 48 doi:10.6342/NTU202501425 video captioning, 2023. Model available at https://huggingface.co/kpyu/video-blip-opt-2.7b-ego4d. [24] X. Li, L. Popowski, T. M. Mitchell, and B. A. Myers. Bugcatcher: Crowd-sourced bug reporting and reproduction for mobile applications. In Proceedings of the 2020 International Conference on Mobile Software Engineering and Systems (MOBILESoft), pages 45–56. IEEE, 2020. [25] Y. Li, Q. Li, G. Wu, L. Zhang, Z. Zhou, X. Li, and T. Zhang. Crowdsourced annotation of user interfaces. In Proceedings of the 2017 CHI conference on human factors in computing systems, pages 6405–6416, 2017. [26] T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. In D. Fleet, T. Pajdla, B. Schiele, and T. Tuytelaars, editors, Computer Vision – ECCV 2014, pages 740–755, Cham, 2014. Springer International Publishing. [27] W. Liu, T. Tan, W. Xu, S. Shang, and A. Li. Appflowy scripts: A corpus of task-oriented mobile ui interaction scripts. In Proceedings of the 2021 ACM SIGCHI Symposium on User Interface Software and Technology (UIST), pages 112–120. ACM, 2021. [28] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo. Swin transformer: Hierarchical vision transformer using shifted windows. Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021. [29] S. Long, J. He, C. Yao, W. Luo, X. Zhou, Q. Tian, and J. Li. Textsnake: A flexible representation for irregular text detection. pages 21–37, 2018. 49 doi:10.6342/NTU202501425 [30] R. Lu, C. Bhattacharjee, C. L. Zitnick, A. Kembhavi, and D. Parikh. Uibert: Learning a universal representation for ui elements. In Proceedings of the 28th ACM international conference on multimedia, pages 1690–1698, 2020. [31] J. Mun, L. Yang, Z. Ren, N. Xu, and B. Han. Streamlined dense video captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8344–8353. IEEE, 2019. [32] S. Park, Y. Song, S. Lee, J. Kim, and J. Seo. Leveraging Multimodal LLM for Inspirational User Interface Search. arXiv preprint arXiv:2501.17799, 2025. [33] G. Pu, I.-T. Yang, D. Gurari, A. Stent, A. Oulasvirta, B. Mott, J. Kiesel, K. Uhler, et al. Revisiting automated ui reverse engineering: A dataset, evaluation, and improvements. pages 6417–6428, 2017. [34] A. Radford, J. W. Kim, C. Xu, G. Tribble, M. Hodge, C. Wu, A. Agarwal, A. Ramesh, G. Gasewicz, S. Powell, et al. Learning transferable visual models from natural language supervision. pages 8748–8763, 2021. [35] M. S. Ryoo, A. P. Kim, H. Jin, and A. A. Yang. Tokenlearner: What can 8 learned tokens do for images and videos? In Advances in Neural Information Processing Systems (NeurIPS), 2021. [36] B. Shi, X. Bai, and C. Yao. Toward accurate scene text recognition with contextual reasoning. pages 3204–3212, 2017. [37] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014. [38] A. Singh, T. Natarajan, and M. Shah. Towards interpretable multimodal deep 50 doi:10.6342/NTU202501425 learning for visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 10132–10141, 2019. [39] R. Smith. An overview of the tesseract ocr engine. pages 629–633, 2009. [40] R. K. Srivastava, K. Greff, and J. Schmidhuber. Highway networks. arXiv preprint arXiv:1505.00387, 2015. [41] C. Tang, X. Yan, F. Zhao, and G. Ding. Tokenfuser: Accelerating vision transformers via token merging. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 9645–9654, 2021. [42] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017. [43] J. Wang, Z. Yang, X. Hu, L. Li, K. Lin, Z. Gan, Z. Liu, C. Liu, and L. Wang. GIT:A Generative Image-to-text Transformer for Vision and Language. arXiv preprint arXiv:2205.14100, 2022. [44] Q. Wu, W. Xu, W. Liu, T. Tan, J. Liu, A. Li, J. Luan, B. Wang, and S. Shang. MobileVLM: A Vision-Language Model for Better Intra- and Inter-UI Understanding. arXiv preprint arXiv:2409.14818, 2024. [45] Y. Wu, Y. Liu, S. Wang, W. Zhang, Q. Huang, and S. Lyu. Properly sampled frames preserve semantic completeness. Journal of Visual Communication and Image Representation, 82(C):45–53, 2023. [46] S. Xiao, Y. Chen, Y. Song, L. Chen, L. Sun, Y. Zhen, and Y. Chang. UI Semantic 51 doi:10.6342/NTU202501425 Group Detection: Grouping UI Elements with Similar Semantics in Mobile Graphical User Interface. In arXiv preprint arXiv:2403.04984, 2024. [47] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio. Show, attend and tell: Neural image caption generation with visual attention. In International conference on machine learning, pages 2048–2057, 2015. [48] Y. Xu, M. Li, L. Cui, S. Huang, F. Wei, and M. Zhou. Layoutlm: Pre-training of text and layout for document image understanding. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery Data Mining, pages 1192–1200, 2020. [49] A. Yang, A. Nagrani, P. H. Seo, A. Miech, J. Pont-Tuset, I. Laptev, J. Sivic, and C. Schmid. Vid2seq: Large-scale pretraining of a visual language model for dense video captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1941–1950. IEEE, 2023. [50] K. You, H. Zhang, E. Schoop, F. Weers, A. Swearngin, J. Nichols, Y. Yang, and Z. Gan. Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs. In arXiv preprint arXiv:2404.05719, 2024. [51] Zendesk. State of support: Trends in ticket volume and customer service. https://event.zendesk.com, 2025. Accessed May 2025. [52] W. Zhang, X. Wang, J. Xu, Q. Zhu, X.-S. Lin, and T. Yang. Streaming dense video captioning with online token selection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. To appear. [53] H. Zhou, V. Zhao, K. Chen, M. Huang, and X. Zhu. Cross-modal transformer for multimodal sentiment analysis. arXiv preprint arXiv:2003.05578, 2020.	-
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/98151	-
dc.description.abstract	本研究旨在開發一套自動化流程，將行動裝置的螢幕錄影轉換為簡潔、連貫且具時間脈絡的自然語言描述，以提升技術支援與錯誤回報的效率。基於 Generative Image-to-text Transformer (GIT) 架構，本論文提出四項創新技術： • 光學字元辨識強化模組：整合 PaddleOCR 至 GIT 中，設計專屬文字嵌入層以擷取螢幕文字資訊，並評估加入文字邊界框對描述精確度的影響。 • 雙階段集群分析特徵選擇：第一階段固定 8 幀進行全模型微調；第二階段凍結編碼器及時間嵌入，將任意長度的影片幀特徵透過 K-Means 壓縮為固定大小，顯著降低 GPU 記憶體消耗。 • 動態時間嵌入：對超過原訓練長度的序列，利用線性插值生成任意長度的時間嵌入，以部分恢復因聚類而失去的時間順序資訊。 • 整合優化：結合上述技術，針對不同影片長度與螢幕文字密度進行協同優化。在 Android in the Wild (AITW) 資料集上的實驗結果顯示：引入光學字元辨識即能提升多項指標；結合雙階段集群分析與動態時間嵌入後，系統在幀數增加時仍保持高效能；對於更長序列，文字邊界框進一步提升描述精確度。整體而言，本方法在平衡運算資源與說明品質上展現優異表現，為行動裝置影片自動化說明提供了新技術途徑。	zh_TW
dc.description.abstract	This study develops an automated pipeline that converts mobile screen recordings into concise, coherent, and temporally grounded natural language descriptions to streamline technical support and bug reporting. Building on the Generative Image-to-text Transformer (GIT) framework, we propose four key innovations: • OCR-Enhanced Module: Integrate PaddleOCR into GIT with a dedicated text embedding tower to extract on-screen text and evaluate the impact of adding bounding boxes on description accuracy. • Two-Stage K-Means Feature Selection: Stage 1 fine-tunes the full model on fixed 8-frame inputs; Stage 2 freezes the encoder and temporal embeddings, compressing features from arbitrary-length sequences into a fixed-size representation via K-Means, significantly reducing GPU memory usage. • Dynamic Temporal Embeddings: Use linear interpolation to generate temporal embeddings of arbitrary length for sequences exceeding the original training horizon, partially restoring the temporal order lost by clustering. • Integrated Optimization: Combine the above techniques to jointly optimize performance across varying video lengths and text densities. Experiments on the Android in the Wild (AITW) dataset show that OCR integration alone boosts multiple metrics; the two-stage K-Means pipeline with dynamic temporal embeddings maintains high performance for sequences beyond 48 frames; incorporating bounding box information further improves accuracy on very long sequences. Overall, our approach effectively balances computational efficiency and caption quality, offering a novel technical solution for automated mobile video captioning.	en
dc.description.provenance	Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2025-07-30T16:07:32Z No. of bitstreams: 0	en
dc.description.provenance	Made available in DSpace on 2025-07-30T16:07:32Z (GMT). No. of bitstreams: 0	en
dc.description.tableofcontents	Verification Letter from the Oral Examination Committee i Acknowledgements iii 摘要 v Abstract vii Contents ix List of Figures xi List of Tables xiii Chapter 1 Introduction 1 Chapter 2 Related Work 5 2.1 Vision-Language Models . . . . . . . . . . . . . . . . . . . . . . . 5 2.2 Mobile Video Captioning . . . . . . . . . . . . . . . . . . . . . . .7 2.3 Feature Selection and Token Reduction for Long Video Sequences . . 8 2.4 Temporal Modeling in Video Captioning . . . . . . . . . . . . . . . 9 2.5 Optical Character Recognition (OCR) . . . . . . . . . . . . . . . . . 11 Chapter 3 Methodology 13 3.1 Base Model Structure: Generative Image-to-text Transformer (GIT) . 13 3.2 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.3 Schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.3.1 Scheme 1: OCR Module Integration . . . . . . . . . . . . . . . . . 18 3.3.2 Scheme 2: Efficient Feature Selection via K-Means Clustering, Two-Stage Training, and Dynamic Temporal Embeddings . . . . . . . . 20 3.3.3 Scheme 3: Joint OCR and K-Means Integration . . . . . . . . . . . 25 Chapter 4 Evaluation 31 4.1 Experiment Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.1.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .31 4.1.2 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.1.3 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . .32 4.1.4 Training Hyperparameters . . . . . . . . . . . . . . . . . . . . . . 32 4.1.5 Hardware Environment . . . . . . . . . . . . . . . . . . . . . . . . 33 4.2 Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.2.1 Baseline Performance and Fixed Input Length Selection . . . . . . . 34 4.2.2 Scheme 1: OCR Module Integration . . . . . . . . . . . . . . . . . 34 4.2.3 Scheme 2: Efficient Temporal Feature Selection via K-Means . . . . 36 4.2.4 Scheme 3: Joint OCR and K-Means Integration . . . . . . . . . . . 39 4.2.5 Performance Summary by Sequence Length . . . . . . . . . . . . . 40 4.2.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 Chapter 5 Conclusion 43 References 47	-
dc.language.iso	en	-
dc.subject	行動裝置螢幕錄影	zh_TW
dc.subject	自動化影片說明	zh_TW
dc.subject	光學字元辨識	zh_TW
dc.subject	生成式影像轉文字	zh_TW
dc.subject	雙階段集群分析特徵選擇	zh_TW
dc.subject	動態時間嵌入	zh_TW
dc.subject	Android in the Wild 資料集	zh_TW
dc.subject	Automated video captioning	en
dc.subject	Android in the Wild (AITW) dataset	en
dc.subject	Dynamic temporal embeddings	en
dc.subject	Two-Stage K-Means feature selection	en
dc.subject	Generative Image-to-text Transformer (GIT)	en
dc.subject	Optical Character Recognition (OCR)	en
dc.subject	Mobile screen recording	en
dc.title	整合光學字元辨識與集群分析之動態時序嵌入於行動裝置螢幕錄影影片字幕生成之研究	zh_TW
dc.title	Enhancing Mobile Screen-Recording Video Captioning via Dynamic Temporal Embeddings Integrating OCR and K-Means Clustering	en
dc.type	Thesis	-
dc.date.schoolyear	113-2	-
dc.description.degree	碩士	-
dc.contributor.oralexamcommittee	張傑帆;陳建嘉;李逸元;傅楸善;葉春超	zh_TW
dc.contributor.oralexamcommittee	Jie-Fan Chang;Jian-Jia Chen;Yi-Yuan Lee;Chiou-Shann Fuh ;Chun-Chao Yeh	en
dc.subject.keyword	行動裝置螢幕錄影,自動化影片說明,光學字元辨識,生成式影像轉文字,雙階段集群分析特徵選擇,動態時間嵌入,Android in the Wild 資料集,	zh_TW
dc.subject.keyword	Mobile screen recording,Automated video captioning,Optical Character Recognition (OCR),Generative Image-to-text Transformer (GIT),Two-Stage K-Means feature selection,Dynamic temporal embeddings,Android in the Wild (AITW) dataset,	en
dc.relation.page	53	-
dc.identifier.doi	10.6342/NTU202501425	-
dc.rights.note	未授權	-
dc.date.accepted	2025-07-07	-
dc.contributor.author-college	電機資訊學院	-
dc.contributor.author-dept	資訊工程學系	-
dc.date.embargo-lift	N/A	-
顯示於系所單位：	資訊工程學系

文件中的檔案：

檔案	大小	格式
ntu-113-2.pdf 未授權公開取用	9.85 MB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。