基於視覺語言模型與記憶對比學習之零樣本骨架動作識別方法

魏子翔; Zi-Xiang Wei

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/98695

標題:	基於視覺語言模型與記憶對比學習之零樣本骨架動作識別方法 Vision-Augmented Skeleton-Text Alignment for Zero-Shot Action Recognition with Memory-Based Contrastive Learning
作者:	魏子翔 Zi-Xiang Wei
指導教授:	許永真 Jane Yung-Jen Hsu
共同指導教授:	鄭文皇 Wen-Huang Cheng
關鍵字:	零樣本學習,基於骨架之動作辨識,視覺語言模型,對比學習,多模態對齊, Zero-Shot Learning,Skeleton-Based Action Recognition,Vision-Language Model,Contrastive Learning,Multimodal Alignment,
出版年 :	2025
學位:	碩士
摘要:	現有的零樣本骨架動作辨識方法大多依賴固定的類別標籤或通用的文字描述，導致骨架動作與語意理解之間的對齊效果受限。為了解決此問題，我們提出 Vision-augmented Skeleton-Text Alignment（ViSTA）架構，一種基於雙重變分自編碼器（Dual-VAE）的框架，藉由具備視覺理解能力的大型語言模型，從動畫化的骨架序列中生成以動作為核心的描述。這些視覺輔助的描述與原始類別標籤進行語意融合，並透過預訓練文字編碼器轉換為豐富的語意表示。ViSTA 採用雙重 VAE 結構解耦語意與非語意資訊，並結合跨模態重建與動量對比學習以強化模態對齊效果。與以 Dual-VAE 為基礎的原始方法相比，ViSTA 在 ZSL 設定下於 NTU-60、NTU-120 和 PKU-MMD 分別提升 +5.8%、+7.37% 和 +4.65% 的準確率，在 GZSL 設定下亦於三個資料集分別提升 +1.8%、+2.93%、與 +1.44% 的調和平均（harmonic mean）表現。 Existing approaches to zero-shot skeleton-based action recognition often rely on fixed class labels or generic textual descriptions, which limits the alignment between skeletal motion and semantic understanding. To address this, we propose Vision-augmented Skeleton-Text Alignment (ViSTA), a dual-VAE framework that leverages a vision-language model to generate motion-centric descriptions from animated skeleton sequences. These vision-informed descriptions are fused with class labels and embedded via a pre-trained text encoder to form rich semantic representations. We disentangle semantic and irrelevant factors using dual VAEs and align the modalities through cross-reconstruction and momentum-based contrastive learning. Compared to a strong dual-VAE baseline, ViSTA improves ZSL accuracy by +5.8% on NTU-60, +7.37% on NTU-120, and +4.65% on PKU-MMD, and achieves gains of +1.8%, +2.93%, and +1.44% in GZSL harmonic mean on NTU-60, NTU-120, and PKU-MMD, respectively.
URI:	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/98695
DOI:	10.6342/NTU202503431
全文授權:	同意授權(限校園內公開)
電子全文公開日期:	2025-08-19
顯示於系所單位：	資訊工程學系

文件中的檔案：

檔案	大小	格式
ntu-113-2.pdf 授權僅限NTU校內IP使用（校園外請利用VPN校外連線服務）	1.33 MB	Adobe PDF

顯示文件完整紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料（如：文字、圖片、PDF）並使其易於取用。