整合光學字元辨識與集群分析之動態時序嵌入於行動裝置螢幕錄影影片字幕生成之研究

蔡佳靜; Chia-Ching Tsai

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/98151

標題:	整合光學字元辨識與集群分析之動態時序嵌入於行動裝置螢幕錄影影片字幕生成之研究 Enhancing Mobile Screen-Recording Video Captioning via Dynamic Temporal Embeddings Integrating OCR and K-Means Clustering
作者:	蔡佳靜 Chia-Ching Tsai
指導教授:	廖世偉 Shih-Wei Liao
關鍵字:	行動裝置螢幕錄影,自動化影片說明,光學字元辨識,生成式影像轉文字,雙階段集群分析特徵選擇,動態時間嵌入,Android in the Wild 資料集, Mobile screen recording,Automated video captioning,Optical Character Recognition (OCR),Generative Image-to-text Transformer (GIT),Two-Stage K-Means feature selection,Dynamic temporal embeddings,Android in the Wild (AITW) dataset,
出版年 :	2025
學位:	碩士
摘要:	本研究旨在開發一套自動化流程，將行動裝置的螢幕錄影轉換為簡潔、連貫且具時間脈絡的自然語言描述，以提升技術支援與錯誤回報的效率。基於 Generative Image-to-text Transformer (GIT) 架構，本論文提出四項創新技術： • 光學字元辨識強化模組：整合 PaddleOCR 至 GIT 中，設計專屬文字嵌入層以擷取螢幕文字資訊，並評估加入文字邊界框對描述精確度的影響。 • 雙階段集群分析特徵選擇：第一階段固定 8 幀進行全模型微調；第二階段凍結編碼器及時間嵌入，將任意長度的影片幀特徵透過 K-Means 壓縮為固定大小，顯著降低 GPU 記憶體消耗。 • 動態時間嵌入：對超過原訓練長度的序列，利用線性插值生成任意長度的時間嵌入，以部分恢復因聚類而失去的時間順序資訊。 • 整合優化：結合上述技術，針對不同影片長度與螢幕文字密度進行協同優化。在 Android in the Wild (AITW) 資料集上的實驗結果顯示：引入光學字元辨識即能提升多項指標；結合雙階段集群分析與動態時間嵌入後，系統在幀數增加時仍保持高效能；對於更長序列，文字邊界框進一步提升描述精確度。整體而言，本方法在平衡運算資源與說明品質上展現優異表現，為行動裝置影片自動化說明提供了新技術途徑。 This study develops an automated pipeline that converts mobile screen recordings into concise, coherent, and temporally grounded natural language descriptions to streamline technical support and bug reporting. Building on the Generative Image-to-text Transformer (GIT) framework, we propose four key innovations: • OCR-Enhanced Module: Integrate PaddleOCR into GIT with a dedicated text embedding tower to extract on-screen text and evaluate the impact of adding bounding boxes on description accuracy. • Two-Stage K-Means Feature Selection: Stage 1 fine-tunes the full model on fixed 8-frame inputs; Stage 2 freezes the encoder and temporal embeddings, compressing features from arbitrary-length sequences into a fixed-size representation via K-Means, significantly reducing GPU memory usage. • Dynamic Temporal Embeddings: Use linear interpolation to generate temporal embeddings of arbitrary length for sequences exceeding the original training horizon, partially restoring the temporal order lost by clustering. • Integrated Optimization: Combine the above techniques to jointly optimize performance across varying video lengths and text densities. Experiments on the Android in the Wild (AITW) dataset show that OCR integration alone boosts multiple metrics; the two-stage K-Means pipeline with dynamic temporal embeddings maintains high performance for sequences beyond 48 frames; incorporating bounding box information further improves accuracy on very long sequences. Overall, our approach effectively balances computational efficiency and caption quality, offering a novel technical solution for automated mobile video captioning.
URI:	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/98151
DOI:	10.6342/NTU202501425
全文授權:	未授權
電子全文公開日期:	N/A
顯示於系所單位：	資訊工程學系

文件中的檔案：

檔案	大小	格式
ntu-113-2.pdf 未授權公開取用	9.85 MB	Adobe PDF

顯示文件完整紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料（如：文字、圖片、PDF）並使其易於取用。