結合VMamba編碼器與智慧記憶體策略之高效穩健長時序影片物件分割系統

洪德易; Te-I Hung

Please use this identifier to cite or link to this item: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/99154

Title:	結合VMamba編碼器與智慧記憶體策略之高效穩健長時序影片物件分割系統 Towards Efficient and Robust Long-Video Object Segmentation via VMamba-based Encoder and Intelligent Memory Management
Authors:	洪德易 Te-I Hung
Advisor:	鄭文皇 Wen-Huang Cheng
Keyword:	長時序影片物件分割,高效基礎模型,狀態空間模型,時序穩健性,智慧記憶體管理, Long-VideoVideoObject Segmentation,Efficient Foundation Models,State Space Models,Temporal Robustness,Intelligent Memory Curation,
Publication Year :	2025
Degree:	碩士
Abstract:	本論文旨在解決現有影片物件分割（VOS）基礎模型SAM2在處理長時序影片時，所面臨的效率與穩健性之挑戰。儘管SAM2在分割任務上展現了強大的泛化能力，但其基於Transformer的架構使其成為運算效率瓶頸；同時，其標準的先進先出（FIFO）記憶體策略，在長影片中也容易因記憶體污染(memory contamination) 而影響分割品質。為應對這些挑戰，本研究首先聚焦於提升模型的運算效率。我們以一個具備線性時間複雜度的VMamba主幹網絡，取代了原始的Transformer編碼器，並透過兩階段訓練策略成功地轉移了其豐富的視覺知識。為進一步加速，我們還提出了一種免訓練的動態記憶選擇機制避免冗餘記憶產生額外的計算成本，它在進行記憶注意力計算前，利用餘弦相似度暫時剔除冗餘幀，在不改變記憶體庫的前提下，顯著降低了計算開銷。在提升了基礎效率後，本研究接著解決長時序影片的穩健性問題。為此，我們設計了一套智慧記憶體汰換機制。此機制的核心是一個預先品質濾網(Pre-emptive Quality Filter)，用以過濾低品質或物件缺席的幀，從根本上防止記憶體污染。在此基礎上，我們進一步提出了兩種選擇性汰換策略（分別基於IoU分數和注意力權重），以取代原先的FIFO規則，確保記憶體庫能持續保留最有價值的歷史資訊。我們在多個VOS基準測試（包括DAVIS、LVOSv2、SA-V）上進行了全面的實驗評估。結果證明，我們的效率優化策略在保持具競爭力的分割品質的同時，顯著提升了推論速度。更重要的是，我們的智慧記憶體汰換策略，特別是基於IoU 的方法，在長影片數據集上大幅優於FIFO基準線，其性能甚至能媲美或超越如SAM2Long這個一樣基於SMA2但更複雜的方法。本研究證實了，透過我們提出的方法可以有效解決效率與品質這兩個重要的問題，以提升SAM2在真實世界影片分割任務中的實用性與可靠性。 This thesis addresses the dual challenges of efficiency and robustness that limit current video object segmentation (VOS) foundation models SAM 2 when applied to longduration videos. While foundation models like SAM 2 demonstrate powerful generalization, their Transformer-based architecture, particularly the image encoder and memory attention module, creates a computational bottleneck. Concurrently, their standard FirstIn, First-Out (FIFO) memory policy is susceptible to error accumulation from memory contamination, degrading segmentation quality in long videos. To overcome these limitations, this work first focuses on enhancing computational efficiency. We replace the original Transformer-based encoder (Hiera) with a Mamba-based backbone (VMamba), which operates with linear-time complexity, successfully trained via a two-stage knowledge distillation strategy. To further accelerate inference, we introduce a training-free dynamic redundancy pruning method. This strategy uses cosine similarity to temporarily disregard redundant frames before the memory attention step, reducing the computational cost. Having established a more efficient foundation, we then address the challenge of long-term temporal robustness. We design an intelligent memory replacement mechanism that supplants the FIFO policy. This mechanism incorporates a pre-emptive quality filter to reject low-quality or object-absent frames, thus preventing memory contamination. Building on this, we propose two selective replacement policies, based on IoU scores or attention weights, to ensure the memory bank retains the most valuable memory frame. We conducted comprehensive experiments on multiple VOS benchmarks, including DAVIS, LVOS v2, and SA-V. The results validate that our efficiency-focused modifications provide a significant speed-up while maintaining competitive quality. More importantly, our intelligent memory replacement strategies, particularly the IoU-based method, substantially outperform the FIFO baseline on long-video datasets. Our methods achieve performance that is competitive with or even superior to more complex methods like SAM2Long. Thisresearch confirms that by systematically addressing both efficiency and robustness, our proposed framework can significantly improve the practicality and reliability of SAM2 for real-world video segmentation tasks.
URI:	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/99154
DOI:	10.6342/NTU202503759
Fulltext Rights:	未授權
metadata.dc.date.embargo-lift:	N/A
Appears in Collections:	資訊網路與多媒體研究所

Files in This Item:

File	Size	Format
ntu-113-2.pdf Restricted Access	16.37 MB	Adobe PDF

Show full item record

DSpace JSPUI

DSpace preserves and enables easy and open access to all types of digital content including text, images, moving images, mpegs and data sets