以視覺為基礎之人類動作辨識的演算法及架構分析

Jing-Ying Chang; 張靖瑩

Please use this identifier to cite or link to this item: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/45418

Title:	以視覺為基礎之人類動作辨識的演算法及架構分析 Algorithm and Architecture Analysis of Video-based Human Action and Activity Recognition
Authors:	Jing-Ying Chang 張靖瑩
Advisor:	陳良基(Liang-Gee Chen)
Keyword:	視訊處理,人類行為辨識,行為分析,物體追縱,硬體架構,人機介面,監控系統, Video Processing,Human Action Recognition,Behavior Analysis,Tracking,Hardware Architecture,Human Computer Interface,Surveillance System,
Publication Year :	2009
Degree:	博士
Abstract:	以視訊為基礎的人類行為辨識技術提供了許多與電腦視覺相關的重要應用，包含多媒體娛樂、安全監控、互動式環境、視訊分析與生物行為特徵等等。行為辨識演算法和系統中，最重要的挑戰來自人類與機器間的語義鴻溝，需要擷取有意義的物體特徵和設計有效的運動模型，來讓電腦正確理解畫面中所呈現的真正運動含義。第一個挑戰是物體特徵的選擇，這些特徵不只要能有效的用來區分各種行為，還必須能在真實環境中，包雜訊、遮蔽和陰影的影響之下，都還能被取得並正確表達，因為只要有任何的誤差，都會導致後面運算的結果錯誤。第二個挑戰是建立人類的行為模型，這個模型必須精準地描述且能區別不同行為的差異。此外，該模型描述多少局部和全域特性，模型本身的維度大小的決定，都要考慮動作本身的性質和測試資料的多寡。本篇論文以無控制器的遊戲平台和遺棄行李的偵測系統，來分析如何設計以視覺為基礎的人類動作辨識。其中包含兩個部分第一部分討論如何取得大部分行為辨識的共同特徵，即物體的軌跡，而第二部分針對這兩個應用討論如何設計行為模型。在第一部分中討論三個和物體追蹤有關的模組：影像描述子、物體追蹤演算法和多攝影機間的多物體關聯。我們以MPEG-7中所提出的色彩結構描述子來描述影像，因為它可以被利用於物體的追蹤。為了要讓本描述子得以用在即時系統上，針對大量重複讀取的結構方塊影像資料，設計平行化架構且共用相鄰像點來降低95%的外部影像資料讀取。另外利用區域化的統計直方圖，來減少結構方塊本身像點的重複讀取達75%。本色彩結構描述子的硬體架構實現在聯電 0.18um製程，晶片面積為1.37x1.37mm2，工作頻率為31MHz，功率消耗為89mW。處理畫面大小為 256x256，速度為每秒30張。在物體追縱演算法的實現，我們選擇粒子濾波器，它是大家公認對於非線性運動或是non-Gaussian運動的物體有非常好的追縱效果。然而，以視覺為主的物體特徵，如色彩統計直方圖，在運算上相當花費時間，使用粒子濾波器配合色彩統計直方圖相當難被用在即時應用上。在此針對粒子濾波器本身的特性，在粒子層級展開平行來加速運算，且針對利用率不高的色彩統計直方圖需要龐大的記憶體來儲存，提出用可定位內容記憶體的概念，大幅度減少所佔用的面積達整個晶片的33.7%。這個架構被實現在聯電 90nm製程，規格訂在畫面720x480，在追縱物體總面積為128x128，滿足每秒60張，在總面積為64x64的時候，每足每秒120張。晶片面積為1.99x1.88mm2，工作頻率為200MHz，功率消耗為296mW。對於公共環境的安全，其中多攝影機間的多物體空間關聯，我們提出一個全域最佳化的方式來提升關聯結果。在產生物體特徵時，由於在真實環境底下，沒有一個完美的特徵擷取方法，無論何種方式都會有一定程度的雜訊干擾，若用傳統的最近鄰居或是貪婪法，雖然可以確定該關聯結果自己有最大可能性，但是不能保證從全域的觀點，所有的關聯是最佳的可能性。我們提出的最佳化方式，是利用最小泥土搬運代價(earth mover’s distance)的概念，針對每一個可能的配對，從所有的組合找出最佳解。該方法達到關聯全域最佳化，且它是在一般距離計算的上一個層級，所以它可以被各種不同的物體特徵和距離計算方式所採用。實驗結果與傳統的最近鄰居或是貪婪法相比，整體正確率平均提高10.5%。而且本方法計算量極低，在整個系統中是可被忽略，例如當不同畫面中共有的物體數量平均為26個時，以3GHz的中央處理器來模擬，計算速度可達每秒處理3381 張畫面。本論文的第二部份包含利用非參數、參數化方法來實現無控制器的遊戲平台，和用知識為基礎的遺棄行李的偵測系統。我們非參數的方法選用以磁磚和移動向量為基礎的特徵，來讓使用者用全身的動作去模擬排球和足球守門員的動作，進而控制遊戲中的角色。選用磁磚和移動向量的原因主要是這個特性被大量視訊壓縮技術所採用，也就是當一個攝影機系統拍得畫面之後就會取得的資訊。而磁磚化的概念將地區性的特徵給表現出來，即接近模擬人體四肢在不同的位置所呈現的特徵。本方法與前人的時空樣板法相比較，平均來說效果好13%。時間序列參數法我們採用從粒子濾波器所取得之人體關節的運動軌跡來當做動作描述子。每條軌跡先被轉換成符號序列，然後採用兩種軌跡轉換符號方法和兩種表示方式來進行實驗。最後結果顯示以固定時間長度的符號和符號統計直方圖的組合，對於這些運動類，短動作的效果最好，正確率平均達96.7%，本結果亦比前面的非參數法好7%。遺棄行李本身代表一個潛在的公共安全危機，尤其是炸彈式的攻擊。要辨識出哪些是行李，哪些是擁有人，確認是否有行李被遺棄，是遺棄行李的三個主要問題。然而，大多數解決這類方法都將它表示成物體追縱的問題，而去追畫面當中所有的前景物體，這造成現實即時應用上有著相當大的困難。我們提出區域化選擇性追縱的概念，首先利用交集前景取樣的結果來判斷哪裡可能是靜置物體，然後選擇性追縱該物體周圍最近的人類。由於在監控環境攝影機是俯角拍攝，人頭和肩被其他物體遮蔽的可能性最低，我們選擇頭肩輪廓的特徵來判斷及追縱擁有人。最後結果在公共的測試資料上均能偵測出行李遺棄的事件，和標準警報時間的差異約在-2.36~+6.8秒之間，採絕對值之後，平均差異為2.7秒。本論文整合上述所有用在行為辨識系統上的核心模組，根據他們的運算特性及性質，設計高效率的硬體架構和高準確度的演算法。可供後續研究者參考並延伸其性能和應用。 Video-based human action recognition technology provides important applications of computer vision, such as multimedia entertainment, surveillance systems, interactive environments, content-based video analysis, and behavioral biometrics. The major challenges of action recognition algorithms and systems lie in the semantic gap, which requires robust feature extraction and effective mathematical action/activity model to let computers interpret captured videos correctly. First challenge in action recognition is to choose features adopted in an approach that can well classify actions in a variable environment. Several factors that can severely limit the applicability of these features in real-world conditions include noise, occlusions, shadows, etc. Errors in feature extraction can easily propagate to higher levels. Second challenge is to build models to represent the characteristics of performers, and the models should describe actions precisely and distinctively. Effective methods to model the representation mathematically is always a key step to achieve intelligence of computers. How many regional and global characteristics of an object should be obtained for the models is significant to maintain discriminative power for complex activities or numerous action types. Meanwhile, the dimensions of the models should be limited to prevent suffering from the “curse of dimensionality.” Otherwise, a larger number of samples are required to train the models. In this dissertation, two scenarios, controller-free gaming applications and abandoned luggage detection systems, are used to analyze how to derive an effective approach for vision-based human action and activity recognition. This thesis contains two parts; the first part discusses processing modules for the extraction of the feature — trajectory, and the second part considers the model formation of these scenarios. In the first part, three modules including an image descriptor, a tracking algorithm, and an object corresponding method for multiple camera tracking environments, are discussed. The image descriptor is MPEG-7 color structure descriptor. The descriptor originally is used in image retrieval. It also can be modified to capture the trajectory of humans [1]. To adopt this descriptor in real-time multimedia applications (30 frames per second), several architecture design techniques are considered. With the analysis of histogram accumulation, local histogram observing (LHO) is used to buffer local structure window for data reuse, and three parallel LHOs is implemented to support real-time operations. The chip area is further saved from the color transformation and the non-linear quantization. The divider in the color transformation is implemented with a lookup table, which area is 36% of that of original divider. The 255 comparators in non-linear quantization are folded into one. The implemented result is 1.37×1.37 mm2, in UMC 0.18 um technology. A color-based particle filter is implemented for object tracking. Particle filtering is an effective algorithm for vision-based object tracking. With its probabilistic sampling approach, particle filter can easily predict the position of an object. However, the computational requirement to measure the similarity of samples is high, such that the particle filter is hard to be utilized for real-time applications. An effective approach using the estimation of the positions, sizes, and angles of objects, is proposed to generate a color histogram as tracking feature. Several design techniques, such as prioritized finite word length, particle-level parallel operation, and content addressable memory are employed. The architecture analysis shows the proposed architecture is very efficient for all vision-based real-time applications. The content-addressable-memory technique can reduce the storage requirement by 87.5% in terms of chip area. A prototype chip has been designed and verified using UMC’s 90 nm CMOS technology. Experimental results show the chip can support tracking three objects and provide 31.35 frames per second on average for all 720×480-size sequences, and the tracking accuracy is higher than 87%. The third module is an object corresponding method. This work proposes a global-optimization approach of spatial object correspondence in xxii distributed surveillance systems. All object correspondence systems need to be calibrated. However, since the environment is variable and differences exist between cameras, camera calibration may be imprecise and the results of the examination of the similarity between individuals may be incorrect. The concept of earth mover’s distance (EMD) is employed in an environment under imprecise feature measurement. The approach solves the problem of mutually exclusive object correspondence; finds the global optimum; allows partial matches, and is able to be used in combination with others’ approach of feature measurement. Global optimization is achieved by exploring all mutually exclusive match candidates and choosing those that generate the global minimum cost value. Applying EMD with a geometry-based feature to public surveillance datasets, the precision of the EMD-based method exceeds that of the greedy-based method by 4.3% to 20.8%, with an average of 10.5%. In the second part, a nonparametric approach and a parametric approach are adopted for controller-free gaming applications, and a knowledge-based approach is proposed for abandoned luggage detection systems. The nonparametric approach is a tile-based, motion-vector-based approach to provide a function, which allows people for using their whole body parts to mimic the real Volleyball/GoalKeeper actions to control the role in the game. The idea of introducing motion vector pattern to action recognition is based on the fact that video compression is now a common function of xxiii camera systems, which is an abundant source of motion vectors in the video. Motion vectors represent the dynamics information of an environment. By utilizing motion vectors in the region of an object, these dynamics information can be used to analyze actions of the object. The performance is compared with that of temporal-template-based approach. Because feature vectors of the temporal-template-based approach are generated by the entire foreground of one people, no regional information of the foreground is gathered. Therefore, the proposed motion-vector-based approach outperforms the temporal-template-based approach. Another approach for controllerfree gaming applications is a parametric time-series approach using joint trajectories extracted by particle filter as the action descriptors. Each trajectory is converted into a symbol sequence. The action recognition is accomplished using all combinations of two distance measuring methods and two dictionaries completed by fixed-size or adaptive-size segments. Abandoned luggage represents a potential threat to public safety. Identifying objects as luggage, identifying the owners of such objects, and identifying whether owners have left luggage behind, are the three main problems requiring solution. However, in crowded areas, solutions based on identifying what all objects are and tracking all objects, based on the possibility of their being abandoned luggage, are computationally extremely costly. Accordingly, such methods are difficult to utilize in real-time applications. The knowledge-based approach uses two techniques for effectively detecting abandoned luggage. “Foreground-mask sampling” detects luggage with arbitrary appearance and “selective tracking” locates and tracks owners based solely on looking only at the neighborhood of the luggage. A probability model using the maximum a posteriori is adopted to generate a confidence score and determine whether luggage has been abandoned deliberately. Experimental results demonstrate that once an owner abandons their luggage and leaves the scene, the alarm fires within few seconds. The processing speed of the proposed approach is approximately 15 to 20 frames per second, which is sufficient for real world applications.
URI:	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/45418
Fulltext Rights:	有償授權
Appears in Collections:	電子工程學研究所

Files in This Item:

File	Size	Format
ntu-98-1.pdf Restricted Access	20.8 MB	Adobe PDF

Show full item record

DSpace JSPUI

DSpace preserves and enables easy and open access to all types of digital content including text, images, moving images, mpegs and data sets