使用合成資料搭配領域適應學習無關視角姿勢特徵進行跨視角動作辨識

Yu-Huan Yang; 楊侑寰

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/7511

標題:	使用合成資料搭配領域適應學習無關視角姿勢特徵進行跨視角動作辨識 Cross-View Action Recognition Using View-Invariant Pose Feature Learned from Synthetic Data with Domain Adaptation
作者:	Yu-Huan Yang 楊侑寰
指導教授:	傅立成
關鍵字:	動作辨識,跨視角,合成資料,領域適應, action recognition,cross-view,synthetic data,domain adaptation,
出版年 :	2018
學位:	碩士
摘要:	近年來，根據影片了解人們動作的技術獲得越來越多的關注，因為其有廣大的應用場域像是人機互動、智慧家庭、健康照顧以及監視系統。但是隨著視角的不同，人的輪廓外觀也會跟著不同，這造成了從未知的視角進行動作辨識仍然是個挑戰。在本論文中，我們學習了一個無關視角的姿勢特徵以進行跨視角動作辨識，另一方面，考慮到人們的隱私問題，我們捨棄了彩度影像而只採用深度影像當作我們系統的輸入。此提出的特徵提取模型運用深度卷積神經網路將來自不同視角的人物姿勢轉換到共享的特徵空間中，然而訓練深度模型需要龐大的多視角影像資料，人為蒐集和標注這樣的資料將會耗費不少的成本與精力，因此我們收集了一個合成的多視角姿勢資料庫，在模擬環境中我們將人體的立體幾何模型貼合到真實的動作捕捉資料上並且在虛擬環境中進行多視角的深度影像拍攝。我們以無監督的方式在所創造的合成資料庫上進行無關視角姿勢特徵的學習，此外，為了確保從合成資料到真實資料上的模型遷移性，我們採用了領域適應的技巧去降低彼此的領域差異性。一個動作可以視為一連串的姿勢序列所組成，藉由長短期記憶網路我們可以習得動作的時序模型。在實驗的部分，我們將所提出的方法實作在兩個公開的多視角動作資料庫，其表現超越了幾個基本比較模型，並且同時超越了許多當前最好的方法。 Human action understanding from videos has raised lots of attention in computer vision recently because of its wide applications, such as human-robot interaction, smart home, health care, and surveillance systems. Recognizing human activities from unknown viewpoints is still a challenging problem since human shapes appear quite differently from different viewpoints. In this thesis, we learn a View-Invariant Pose (VIP) feature representation for cross- view action recognition. Besides, considering privacy issue, we adopt depth videos rather than RGB videos as input to our system. The proposed VIP feature encoder is a deep Convolutional Neural Network (CNN) that transfers human poses from different viewpoints to a shared high-level feature space. Learning such a deep model requires a large corpus of multi-view data which is very expensive to collect and label. Therefore, we synthesize a Multi-View Pose (MVP) dataset by fitting human physical models with real motion capture data in the simulators and then render depth images from multiple viewpoints. The VIP feature is learned from the synthetic MVP dataset in an unsupervised way. Moreover, domain adaptation is employed to ensure the transferability from synthetic data to real data such that the domain difference is minimized. An action can be considered as a sequence of poses and the temporal progress is learned and modeled by the Long Short-Term Memory (LSTM). In the experimental parts, our method is applied on two benchmark datasets of multi-view 3D human action and achieves superior performance when compared with baseline models as well as promising results when compared with several state-of-the-arts.
URI:	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/7511
DOI:	10.6342/NTU201802626
全文授權:	同意授權(全球公開)
電子全文公開日期:	2023-08-09
顯示於系所單位：	電機工程學系

文件中的檔案：

檔案	大小	格式
ntu-107-1.pdf	11.3 MB	Adobe PDF	檢視/開啟

顯示文件完整紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料（如：文字、圖片、PDF）並使其易於取用。