深度影片解析及其在動作識別中的應用

Sebastian Agethen; 蔡格昇

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/74089

標題:	深度影片解析及其在動作識別中的應用 Deep Video Understanding and its Applications in Action Recognition
作者:	Sebastian Agethen 蔡格昇
指導教授:	徐宏民(Winston H. Hsu)
關鍵字:	深度學習,序列學習,動作辨識,動作預測,多專家模型,增量學習, deep learning,sequence learning,action recognition,action prediction,mixture of experts,incremental learning,
出版年 :	2019
學位:	博士
摘要:	隨著更多可攜式設備能夠瀏覽互聯網，以及隨時錄製並播放影片，視頻在我們的日常生活中變得無所不在，而廣泛的應用需要有效理解視頻內容。然而，相比基於圖像的學習，影片本質上更具有時空性。這帶來幾個挑戰：首先，影片可用數據量是圖像學習的倍數, 因此從影片中培訓模型變得更加耗時。同時不相關的數據數量也在增長，於是發展從影片中截取相關內容的策略更是需要。其次，除了能夠進行空間推理，一個基於視頻的應用也需要一個時間模式來解構影片中的因果關係。我們在本論文中解決了這些核心挑戰：對於數據隨著時間不斷地變得可用的情況, 我們提出了一個基於專家的增量學習(expert-based incremental learning)方法，此方法允許深度學習能有效地訓練數據。接著，我們觀察到內核大小 (kernel size) 是連續影格之間能夠進行因果推理及序列學習的關鍵，因此我們提出的一個基於多內核(multi-kernel)的學習方法以提升結果的正確性，並同時保持學習的高效率。此外，我們通過拋棄不相關的東西，使深度學習模型能夠學習視頻內容中的抽象動作行為。為此，我們提出了一種基於提取細粒特徵(fine-grained feature)的注意機制(attention mechanism)。我們選擇行為預測(human action prediction)作為應用，並且搭配人體姿勢注意(human pose-attention mask)模型做為評估方法。最後，我們研究了基於視頻的學習在攝像機定位任務中的應用，並提出一個改進的正規化方案來提升定位的準確度。 Videos have become omni-present in our daily lives with the rise in popularity of portable devices that are able to browse the internet, as well as take and playback recordings at any time. A wide range of applications require effective understanding of the contents in video. However, in comparison to image-based learning, they are inherently emph{spatio-temporal} by nature. This comes with several challenges: First, the quantity of data available is a multiple of that in image-based learning. As a direct consequence, training of such a model becomes far more time-consuming. Furthermore, however, the amount of irrelevant data also grows, and strategies to focus on relevant content are needed. Second, in addition to spatial reasoning, a video-based application requires the ability to understand and causally connect temporal patterns. We address these central challenges in this dissertation: We propose an incremental learning method based on experts that allows efficient training of deep applications for the case of data continuously becoming available over time. % Do we need to mention it is evaluated on images but can be extended to video? Following that, we identify the critical issue of kernel size towards the ability to causally reason between a collection of temporally spaced phenomena in sequence learning, and propose a multi-kernel solution that not only provides correct results, but also remains efficient at the same time. Furthermore, we enable deep models to learn abstract actions by discarding irrelevant video content. To this end we propose an attention mechanism to extract fine-grained features. We choose human action anticipation as the application, and evaluate the approach with human pose-based attention masks. Finally, we investigate the application of video-based learning to the tasks of camera localization and orientation and propose a regularization scheme to improve results.
URI:	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/74089
DOI:	10.6342/NTU201900690
全文授權:	有償授權
顯示於系所單位：	資訊網路與多媒體研究所

文件中的檔案：

檔案	大小	格式
ntu-108-1.pdf 目前未授權公開取用	11.78 MB	Adobe PDF

顯示文件完整紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料（如：文字、圖片、PDF）並使其易於取用。