Skip navigation

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料(如:文字、圖片、PDF)並使其易於取用。

點此認識 DSpace
DSpace logo
English
中文
  • 瀏覽論文
    • 校院系所
    • 出版年
    • 作者
    • 標題
    • 關鍵字
    • 指導教授
  • 搜尋 TDR
  • 授權 Q&A
    • 我的頁面
    • 接受 E-mail 通知
    • 編輯個人資料
  1. NTU Theses and Dissertations Repository
  2. 電機資訊學院
  3. 資訊工程學系
請用此 Handle URI 來引用此文件: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/97279
完整後設資料紀錄
DC 欄位值語言
dc.contributor.advisor廖世偉zh_TW
dc.contributor.advisorShih-wei Liaoen
dc.contributor.author張瀷鏵zh_TW
dc.contributor.authorI-Hua Changen
dc.date.accessioned2025-04-02T16:16:05Z-
dc.date.available2025-04-03-
dc.date.copyright2025-04-02-
dc.date.issued2024-
dc.date.submitted2024-07-03-
dc.identifier.citation[1] F-droid. https://f-droid.org/en/.
[2] C. Bernal-Cárdenas, N. Cooper, K. Moran, O. Chaparro, A. Marcus, and D. Poshyvanyk. Translating video recordings of mobile app usages into replayable scenarios. In Proceedings of the ACM/IEEE 42nd international conference on software engineering, pages 309–321, 2020.
[3] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
[4] T. Chai, R. R. Draxler, et al. Root mean square error (rmse) or mean absolute error (mae). Geoscientific model development discussions, 7(1):1525–1534, 2014.
[5] J. Chen, A. Swearngin, J. Wu, T. Barik, J. Nichols, and X. Zhang. Extracting replayable interactions from videos of mobile app usage. arXiv preprint arXiv:2207.04165, 2022.
[6] Y.-C. Chen, L. Li, L. Yu, A. El Kholy, F. Ahmed, Z. Gan, Y. Cheng, and J. Liu. Uniter: Universal image-text representation learning. In European conference on computer vision, pages 104–120. Springer, 2020.
[7] Ö. Çiçek, A. Abdulkadir, S. S. Lienkamp, T. Brox, and O. Ronneberger. 3d u-net: learning dense volumetric segmentation from sparse annotation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2016: 19th International Conference, Athens, Greece, October 17-21, 2016, Proceedings, Part II 19, pages 424–432. Springer, 2016.
[8] B. Deka, Z. Huang, C. Franzen, J. Hibschman, D. Afergan, Y. Li, J. Nichols, and R. Kumar. Rico: A mobile app dataset for building data-driven design applications. In Proceedings of the 30th annual ACM symposium on user interface software and technology, pages 845–854, 2017.
[9] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
[10] S. Feng and C. Chen. Gifdroid: Automated replay of visual bug reports for android apps. In Proceedings of the 44th International Conference on Software Engineering, pages 1045–1057, 2022.
[11] L. Gomez, I. Neamtiu, T. Azim, and T. Millstein. Reran: Timing-and touch-sensitive record and replay for android. In 2013 35th International Conference on Software Engineering (ICSE), pages 72–81. IEEE, 2013.
[12] Google LLC. Android 13. https://www.android.com/android-13/.
[13] Google LLC. Android studio. https://developer.android.com/studio.
[14] H. Guo. A simple algorithm for fitting a gaussian function [dsp tips and tricks]. IEEE Signal Processing Magazine, 28(5):134–137, 2011.
[15] V. Iglovikov and A. Shvets. Ternausnet: U-net with vgg11 encoder pre-trained on imagenet for image segmentation. arXiv preprint arXiv:1801.05746, 2018.
[16] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pages 2980–2988, 2017.
[17] T. F. Liu, M. Craft, J. Situ, E. Yumer, R. Mech, and R. Kumar. Learning design semantics for mobile apps. In The 31st Annual ACM Symposium on User Interface Software and Technology, UIST ’18, pages 569–579, New York, NY, USA, 2018. ACM.
[18] J. Lu, D. Batra, D. Parikh, and S. Lee. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in neural information processing systems, 32, 2019.
[19] L. Mariani, A. Mohebbi, M. Pezzè, and V. Terragni. Semantic matching of gui events for test reuse: are we there yet? In Proceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis, pages 177–190, 2021.
[20] O. Oktay, J. Schlemper, L. L. Folgoc, M. Lee, M. Heinrich, K. Misawa, K. Mori, S. McDonagh, N. Y. Hammerla, B. Kainz, et al. Attention u-net: Learning where to look for the pancreas. arXiv preprint arXiv:1804.03999, 2018.
[21] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019.
[22] C. Rawles, A. Li, D. Rodriguez, O. Riva, and T. Lillicrap. Android in the wild: A large-scale dataset for android device control. arXiv preprint arXiv:2307.10088, 2023.
[23] O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical image segmentation. In Medical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18, pages 234–241. Springer, 2015.
[24] G. Team, R. Anil, S. Borgeaud, Y. Wu, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
[25] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
[26] M. Xie, S. Feng, Z. Xing, J. Chen, and C. Chen. Uied: a hybrid tool for gui element detection. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pages 1655–1659, 2020.
[27] A. Yan, Z. Yang, W. Zhu, K. Lin, L. Li, J. Wang, J. Yang, Y. Zhong, J. McAuley, J. Gao, et al. Gpt-4v in wonderland: Large multimodal models for zero-shot smartphone gui navigation. arXiv preprint arXiv:2311.07562, 2023.
[28] J. Yang, H. Zhang, F. Li, X. Zou, C. Li, and J. Gao. Set-of-mark prompting unleashes extraordinary visual grounding in gpt-4v. arXiv preprint arXiv:2310.11441, 2023.
[29] Y. D. Yasuda, L. E. G. Martins, and F. A. Cappabianco. Autonomous visual navigation for mobile robots: A systematic literature review. ACM Computing Surveys(CSUR), 53(1):1–34, 2020.
[30] S. Yu, C. Fang, Y. Feng, W. Zhao, and Z. Chen. Lirat: Layout and image recognition driving automated mobile testing of cross-platform. In 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE), pages 1066–1069. IEEE, 2019.
[31] G. Zhang, N. Kenta, and W. B. Kleijn. Extending adamw by leveraging its second moment and magnitude. arXiv preprint arXiv:2112.06125, 2021.
[32] J. Zhang, J. Huang, S. Jin, and S. Lu. Vision-language models for vision tasks: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024.
[33] M. Zhang and J. Li. A commentary of gpt-3 in mit technology review 2021. Fundamental Research, 1(6):831–833, 2021.
[34] Y. Zhao, S. Talebipour, K. Baral, H. Park, L. Yee, S. A. Khan, Y. Brun, N. Medvidović, and K. Moran. Avgust: Automating usage-based test generation from videos of app executions. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pages 421–433, 2022.
-
dc.identifier.urihttp://tdr.lib.ntu.edu.tw/jspui/handle/123456789/97279-
dc.description.abstract在行動裝置 GUI 測試中,現有的開源數據集往往由於數據過時和變換有限,缺乏進行有效測試所需的精確性和相關性。本文介紹了一種雙重媒介視覺語言模型(VLM)系統,用於生成能準確捕捉行動應用互動的高品質數據集。通過結合 Transformer 和 UNet 模型,我們的方法使用 Android 模擬器自動化數據集收集過程,結果在精確定位動作和螢幕變換數據方面表現出色。

我們對所提出的自動生成數據集進行了評估,發現其在標記精確性方面超過了 RICO 數據集(85% 對 83%),並且變換幀變異數數顯著較低(0.73 對 11.8)。我們結合了 3D UNet 和 Transformer 架構的視覺文字模型,展現出較其他配置更高的準確性(73%),凸顯了整合文字和視覺信息對於行動 GUI 測試的重要性。

這項研究強調了針對現代應用介面開發量身定制的數據集的重要性,並展示了自動化數據集生成以應對行動應用迅速變化的景觀的需求。所提出的視覺文字模型在處理行動 GUI 測試的複雜性方面被證明是有效的,顯示了結合視覺和文字洞察以進行準確分析的潛力。
zh_TW
dc.description.abstractIn mobile GUI testing, existing open-source datasets often lack the accuracy and relevance needed for effective testing due to outdated data and limited transformations. This paper introduces a novel dual-agent Vision-Language Model (VLM) system to generate a high-quality dataset that accurately captures mobile app interactions. On the other hand, by leveraging a combination of transformer and U-Net models, our approach surpasses previous work in precisely locating actions on mobile phone screen recordings.

We evaluated the proposed auto-generated dataset and found that it surpassed the RICO dataset animation class in labeling accuracy (85\\\\\\\\% vs. 83\\\\\\\\%) and achieved a significantly lower variance transformation frame count (0.73 vs. 11.8). Our vision-textual model, which combined 3D U-Net and transformer architectures, exhibited superior accuracy (73\\\\\\\\%) over other configurations, highlighting the importance of integrating both textual and visual information for mobile GUI testing.

This research underscores the importance of developing datasets tailored to modern app interfaces and demonstrates the need for automated dataset generation to address the rapidly changing landscape of mobile applications. The proposed vision-textual model proved effective in handling the complexity of mobile GUI testing, showing the potential of combining visual and textual insights for accurate analysis.
en
dc.description.provenanceSubmitted by admin ntu (admin@lib.ntu.edu.tw) on 2025-04-02T16:16:05Z
No. of bitstreams: 0
en
dc.description.provenanceMade available in DSpace on 2025-04-02T16:16:05Z (GMT). No. of bitstreams: 0en
dc.description.tableofcontentsAcknowledgements i
摘要 ii
Abstract iii
Contents v
List of Figures viii
List of Tables ix
Chapter 1 Introduction 1
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Chapter 2 Related Work 4
2.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1.1 GUI Testing: Record and Replay Techniques . . . . . . . . . . . . 4
2.1.1.1 Semantic Matching and Interaction Classification . . . 4
2.1.1.2 Video-Based Interaction Analysis and Localization . . 5
2.1.2 Hybrid Vision-Language Model . . . . . . . . . . . . . . . . . . . 5
2.1.2.1 U-Net Model . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.3 Large Language Mode and Visual Grounding . . . . . . . . . . . . 7
2.1.3.1 Set-of-Mark Prompting (SoM) for Enhanced Visual Grounding . . . . . . . . . . . . . . . . . .. . . . . . . . . . 7
2.1.3.2 Zero-Shot GUI Navigation with Large visual language model . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.4 GUI Action Localization Datasets . . . . . . . . . . . . . . . . . . 8
Chapter 3 Methodology 10
3.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.1.1 dual-Agent Automated Dataset Collection . . . . . . . . . . . . . . 10
3.1.2 Vision-Textual Model . . . . . . . . . . . . . . . . . . . . . . . . . 12
Chapter 4 Evaluation 15
4.1 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.1.1 Evaluation of Dataset Quality . . . . . . . . . . . . . . . . . . . . . 15
4.1.1.1 Accuracy of Labeling . . . . . . . . . . . . . . . . . . 15
4.1.1.2 Spatial Variety . . . . . . . . . . . . . . . . . . . . . . 16
4.1.1.3 Temporal Variety . . . . . . . . . . . . . . . . . . . . 16
4.1.2 Validation of Agent Styles in Data Generation . . . . . . . . . . . . 18
4.1.3 Performance Comparison of Hybrid Architectures . . . . . . . . . . 19
4.1.4 Impact of Dataset Selection on Model Efficacy . . . . . . . . . . . 20
4.1.5 Comparison of Loss Functions . . . . . . . . . . . . . . . . . . . . 21
Chapter 5 Conclusion 24
5.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
References 26
Appendix A — General App transformation 31
A.0.1 Understanding Screen Transformations . . . . . . . . . . . . . . . . 31
A.0.1.1 Semantic Analysis of Dual Frames . . . . . . . . . . . 31
A.0.1.2 Complex Transformations Requiring Additional Context 33
A.0.2 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
A.1 Visual Language Model (VLM) Interaction Pipeline . . . . . . . . . 34
-
dc.language.isoen-
dc.subject視覺語言模型zh_TW
dc.subject影片動作定位zh_TW
dc.subjectGUI測試zh_TW
dc.subjectVideo Action Localizationen
dc.subjectMobile GUI Testingen
dc.subjectVision-Language Modelen
dc.title分析手機用戶關鍵操作以及自動化手機測試資料集生成zh_TW
dc.titleVideo Action Localization: A Comprehensive Approach to Mobile Critical User Journey and Automated Mobile Testing Dataset Generationen
dc.typeThesis-
dc.date.schoolyear113-2-
dc.description.degree碩士-
dc.contributor.oralexamcommittee傅楸善;盧瑞山zh_TW
dc.contributor.oralexamcommitteeChiou-Shann Fuh;Ruei-Shan Luen
dc.subject.keywordGUI測試,視覺語言模型,影片動作定位,zh_TW
dc.subject.keywordMobile GUI Testing,Vision-Language Model,Video Action Localization,en
dc.relation.page37-
dc.identifier.doi10.6342/NTU202401443-
dc.rights.note未授權-
dc.date.accepted2024-07-04-
dc.contributor.author-college電機資訊學院-
dc.contributor.author-dept資訊工程學系-
dc.date.embargo-liftN/A-
顯示於系所單位:資訊工程學系

文件中的檔案:
檔案 大小格式 
ntu-113-2.pdf
  未授權公開取用
4.39 MBAdobe PDF
顯示文件簡單紀錄


系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved