運用行動思維鏈及多模態模型進行可擴展自動化行動裝置使用者介面資料集生成及使用者操作定位

高子維; Tzu-Wei Kao

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/98186

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	廖世偉	zh_TW
dc.contributor.advisor	Shih-Wei Liao	en
dc.contributor.author	高子維	zh_TW
dc.contributor.author	Tzu-Wei Kao	en
dc.date.accessioned	2025-07-30T16:15:31Z	-
dc.date.available	2025-07-31	-
dc.date.copyright	2025-07-30	-
dc.date.issued	2025	-
dc.date.submitted	2025-07-22	-
dc.identifier.citation	P. Authors. Paddleocr, awesome multilingual ocr toolkits based on paddlepaddle. https://github.com/PaddlePaddle/PaddleOCR, 2020. C. Bai and et al. Uibert: Learning generic multimodal representations for ui understanding. arXiv preprint arXiv:2107.13731, 2021. I.-H. Chang and S.-W. Liao. Video action localization: A comprehensive approach to mobile critical user journey and automated mobile testing dataset generation, 2024. Z. Chen, L. Mou, Y. Song, H.-T. Zheng, G. Li, L. Fu, and Y. Ning. Actionbert: Leveraging user interface information for action prediction in gui. CoRR, abs/2012.12350, 2020. R. K. Deka et al. Rico: A mobile app dataset for building data-driven design applications. In Proceedings of the 30th Annual ACM Symposium on User Interface Software and Technology, pages 845–854. ACM, 2017. J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In J. Burstein, C. Doran, and T. Solorio, editors, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. S. Feng and C. Chen. Gifdroid: Automated replay of visual bug reports for android apps, 2022. L. Gomez, I. Neamtiu, T. Azim, and T. Millstein. Reran: Timing- and touch-sensitive record and replay for android. In 2013 35th International Conference on Software Engineering (ICSE), pages 72–81, 2013. H. Guo. A simple algorithm for fitting a gaussian function [dsp tips and tricks]. IEEE Signal Processing Magazine, 28(5):134–137, 2011. K. Lin et al. Android in the wild: A large-scale dataset for android device control. arXiv preprint arXiv:2307.10088, 2023. T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár. Focal loss for dense object detection, 2018. O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention - MICCAI 2015, 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III, pages 234-241. Springer, 2015. P. Sahoo, A. K. Singh, S. Saha, V. Jain, S. Mondal, and A. Chadha. A systematic survey of prompt engineering in large language models: Techniques and applications, 2025. G. Team and P. G. et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context, 2024. J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. Chi, Q. Le, and D. Zhou. Chain-of-thought prompting elicits reasoning in large language models, 2023. S. Yamanaka, H. Usuba, and J. Sato. Behavioral differences between tap and swipe: Observations on time, error, touch-point distribution, and trajectory for tap-and-swipe enabled targets. In Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems, CHI ’24, New York, NY, USA, 2024. Association for Computing Machinery. J. Yang, H. Zhang, F. Li, X. Zou, C. Li, and J. Gao. Set-of-mark prompting unleashes extraordinary visual grounding in gpt-4v, 2023. S. Yu, C. Fang, Z. Tuo, Q. Zhang, C. Chen, Z. Chen, and Z. Su. Vision-based mobile app gui testing: A survey, 2024. J. Zhang, J. Wu, Y. Teng, M. Liao, N. Xu, X. Xiao, Z. Wei, and D. Tang. Android in the zoo: Chain-of-action-thought for gui agents, 2024. Z. Zhang and A. Zhang. You only look at screens: Multimodal chain-of-action agents, 2024. Y. Zhao and et al. Vision-language models for vision tasks: A survey. arXiv preprint arXiv:2304.00685, 2023. Z. Zhong et al. Uied: A hybrid tool for gui element detection. In Proceedings of the 28th ACM International Conference on Multimedia, pages 1759–1767. ACM, 2020. Özgün Çiçek, A. Abdulkadir, S. S. Lienkamp, T. Brox, and O. Ronneberger. 3du-net: Learning dense volumetric segmentation from sparse annotation, 2016.	-
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/98186	-
dc.description.abstract	本研究提出一個以「行動思維鏈」（Chain-of-Action-Thought, CoAT）為核心的可擴展行動裝置使用者介面資料集自動化生成流程。該流程透過視覺語言模型模擬真實使用者互動，涵蓋畫面描述、行為推理、動作執行與結果驗證，無需人工參與即可產生高品質的互動資料集。資料集包含原始截圖、標註之介面元件、動作動畫以及詳細的語意推理紀錄。為驗證資料集之有效性，我們提出一個多模態模型，結合 3D U-Net 用於視覺訊號處理，與 BERT 編碼器處理經由 OCR 擷取的文字資訊。我們以 AITW 與本研究資料集進行多組訓練與測試組合實驗。結果顯示，本資料集能提升動作定位準確性，尤其對滑動操作具優勢。本研究為行動裝置使用者介面測試之中的使用者操作定位提供一套完整解方。	zh_TW
dc.description.abstract	This work presents a fully automated pipeline for scalable mobile UI dataset generation, driven by a Chain-of-Action-Thought (CoAT) framework. The pipeline simulates realistic user interactions using a vision-language model to describe screen content, reason through actions, execute commands, and validate outcomes—without human intervention. The resulting dataset includes raw screenshots, annotated UI elements, action animations,and detailed semantic reasoning traces. To demonstrate the effectiveness of the generated dataset, we introduce a multimodal model that combines 3D U-Net for visual understanding and a BERT encoder for processing textual information extracted via OCR. We evaluate this model across different training and testing configurations, using both AITW and our dataset. Results show that our dataset improves action localization performance, particularly for swipe-based interactions. This work contributes a robust solution for user action localization in mobile GUI testing.	en
dc.description.provenance	Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2025-07-30T16:15:31Z No. of bitstreams: 0	en
dc.description.provenance	Made available in DSpace on 2025-07-30T16:15:31Z (GMT). No. of bitstreams: 0	en
dc.description.tableofcontents	Verification Letter from the Oral Examination Committee i Acknowledgements ii 摘要 iii Abstract iv Contents vi List of Figures viii List of Tables ix Chapter 1 Introduction 1 Chapter 2 Related Work 4 2.1 Mobile UI Dataset Collection . . . . . . . . . . . . . . . . . . . . . 4 2.2 Performance-Enhanced Techniques of Large Language Model Agents 6 2.3 UI Element Detection and Screen Text Understanding . . . . . . . . 7 2.4 Mobile GUI Testing with Vision-Language Models . . . . . . . . . . 8 Chapter 3 Methodology 11 3.1 Automated Mobile Phone UI Dataset Generation Pipeline . . . . . . 11 3.1.1 Environment Setup . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.1.2 Pipeline Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.1.3 Pipeline Parameter Setup . . . . . . . . . . . . . . . . . . . . . . . 14 3.2 Vision Language Model For Action Localization . . . . . . . . . . . 15 3.2.1 Swipe Action Detection . . . . . . . . . . . . . . . . . . . . . . . . 15 3.2.2 Model Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 16 Chapter 4 Evaluation 20 4.1 Mobile Phone UI Dataset Evaluation . . . . . . . . . . . . . . . . . 20 4.2 Automated Dataset Generation Pipeline CoAT Ablation Study . . . . 22 4.3 Model Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 Chapter 5 Conclusion 28 References 32 Appendix A — Justification for Swipe Action Design 35 Appendix B — Examples of Our Dataset 37 Appendix C — Prompts to LLM Agents 38 C.1 System Instruction for Execution Agent . . . . . . . . . . . . . . . . 38 C.2 Screen Description Prompts . . . . . . . . . . . . . . . . . . . . . . 40 C.3 Action Think Prompts . . . . . . . . . . . . . . . . . . . . . . . . . 41 C.4 Action Execution Prompts . . . . . . . . . . . . . . . . . . . . . . . 42 C.5 System Instructions for Validation Agent . . . . . . . . . . . . . . . 43 C.6 Results Validation Prompts . . . . . . . . . . . . . . . . . . . . . . . 44	-
dc.language.iso	en	-
dc.subject	行動裝置圖像使用者介面測試	zh_TW
dc.subject	行動裝置使用者介面資料集	zh_TW
dc.subject	使用者操作定位	zh_TW
dc.subject	行動思維鏈	zh_TW
dc.subject	視覺語言模型	zh_TW
dc.subject	User Action localization	en
dc.subject	Mobile GUI testing	en
dc.subject	Vision-Language Model	en
dc.subject	Chain-of-Action-Thought	en
dc.subject	Mobile UI Dataset	en
dc.title	運用行動思維鏈及多模態模型進行可擴展自動化行動裝置使用者介面資料集生成及使用者操作定位	zh_TW
dc.title	Scalable Automated Mobile UI Dataset Generation Using Chain-of-Action-Thought Framework and Multimodal Models for User Action Localization	en
dc.type	Thesis	-
dc.date.schoolyear	113-2	-
dc.description.degree	碩士	-
dc.contributor.oralexamcommittee	傅楸善;黃上恩;葉春超;吳昆達	zh_TW
dc.contributor.oralexamcommittee	Chiou-Shann Fuh;Shang-En Huang;Chun-Chao Yeh;Kun-Da Wu	en
dc.subject.keyword	行動裝置圖像使用者介面測試,行動裝置使用者介面資料集,使用者操作定位,行動思維鏈,視覺語言模型,	zh_TW
dc.subject.keyword	Mobile GUI testing,Mobile UI Dataset,User Action localization,Chain-of-Action-Thought,Vision-Language Model,	en
dc.relation.page	44	-
dc.identifier.doi	10.6342/NTU202502184	-
dc.rights.note	未授權	-
dc.date.accepted	2025-07-23	-
dc.contributor.author-college	電機資訊學院	-
dc.contributor.author-dept	資訊工程學系	-
dc.date.embargo-lift	N/A	-
顯示於系所單位：	資訊工程學系

文件中的檔案：

檔案	大小	格式
ntu-113-2.pdf 未授權公開取用	1.94 MB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。