學習影像和動作辨識之代表性特徵值之演算法與架構設計

Kuo-Wei Tseng; 曾國維

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/53213

標題:	學習影像和動作辨識之代表性特徵值之演算法與架構設計 Learning Representative Feature Expression Algorithm and Architecture for Image and Action Recognition
作者:	Kuo-Wei Tseng 曾國維
指導教授:	陳良基(Liang-Gee Chen)
關鍵字:	動作辨識,特徵值萃取,K 群集分類法,硬體架構設計,訓練資料合成, action recognition,feature extraction,k-means clustering,hardware architecture design,training data synthesis,
出版年 :	2015
學位:	碩士
摘要:	中文摘要在最近十年中，電腦視覺的研究取得相當大的進步並對我們的生活造成顯著的影響。藉由大數據分析和機器學習的演算法為基礎，許多智慧型的裝置被開發出來。舉Google Glass 為例子，這個穿戴型的裝置能藉由擷取周遭人們臉孔的影像並對這些影像來做分析，來認出周遭的人們的身分。另外一個例子是車牌自動辨識系統，這個系統能夠辨識進入車輛的車牌，並將其入場時間記錄於伺服器，於是我們開車進場經過閘門時就不用按鈕等著取得停車票卡，繳費的時候只需要輸入車號，十分便利。這些應用結合了電腦視覺和機器學習的研究，讓我們有機會一窺和達到我們心中未來生活的藍圖。在辨識的這個領域，除了影像外再延伸出去，動作辨識是一個我們迫切需要解決的問題，因為在不久的將來，智慧型的機器人會被發明出來跟我們人類互動並幫助人們日常的生活起居和代替我們執行做一些最艱難的任務。為了達到這個目標，我們必須讓機器有學習一些包含在影像和影片中的物件和動作的含意，就像我們人類一般思考一樣。影像的辨識比起影像的辨識還要更加複雜，這些影片不只包含了像素的強度和其空間上的關係，還擁有著時間上的意義，也就是禎和禎之間變化的關係。隨著科技的進步，機器人即將出現在我們生活中，於是電腦視覺在影片的領域上更加的蓬勃發展。許多相關的演算法在最近幾年內被提出，但大部分的方法在訓練的計算太過於複雜，需要在雲端才能解決。在這篇論文中，我們首先先介紹電腦視覺的背景和其發展的潛力，然後介紹常用的視覺辨識系統架構: (1) 影像或影片的前處理 (2) 特徵值的萃取 (3) 分類器的分類，在我們的方法中，我們集中在於探討前處理和特徵值的方法，用一些簡單已知的演算法來達到不錯的效果。 K 群集分類法實在被用於產上BOVW 的codebook。他非常有名的是其計算的速度，在參考文獻[8]中我們用K 群集分類法來學初有代表性的補丁。相對於其他使用層級式的深度學習演算法去學習代表性的特徵值，這個方法只需要幾十分鐘相對短的時間來達到在CIFAR-10 這個資料庫在辨識方面不錯的表現。在我們的方法中，我們把這個演算法拓展的影片，在這個方法中，K 群集分類法式來學習一個代表性的立體補丁塊，其包含了時間軸的資訊。然而，在這個方法的訓練過程中，由於影片的訓練資料相對於影像來說比較少，所以表現較差，於是我們提出一個方法可以藉由合成原來的訓練資料來達成從不同的資料庫學習的能力。總結一下，我們提出了一個基於K 群集分類法的動作辨識系統，我們可以從不同的資料庫中學習到更好的結果，然後我們提出相對於這個演算法的硬體設計理念，這個架構稍微微調後不論影像還是影片的辨識都可以適用。 In the past decade, computer vision makes great progress and has signi cant impact to our daily life. Various intelligent devices are developed based on big data analysis and machine learning algorithm. Take google glasses for example, this wearable device can capture picture of people around you and analyze the image to recognize who they are. In some parking lot, vehicle license plate recognition system is used for automatic check-in and no more parking coin is needed to get through the gate. These applications show us the possibility to achieve a future life style by combining computer vision and machine learning. Further thinking about visual task, action recognition must be the top priority problem needed to be solved. In the near future, the intelligent robot will be invented which can interact with human-beings and do the most dangerous jobs for us. To do so, the machines must learn the meanings of images and actions, just like us. Visual tasks of videos recognition are much more complex than ones of image recognition. Videos sequence contains not only intensity and spatial information, but also temporal feature which implies the transformation between frames. With advancement of technology, the intelligent robots will be invented in the near future. Therefore, the machine vision in video domain which makes robots to learn our world is a vital issue, including action recognition. Several algorithm to deal with video tasks has been proposed in recent years, but the training procedure is too complex. In the thesis, we rst introduce some applications and common-used recognition pipeline in the eld of computer vision. A general visual recog- nition pipeline consists of three parts: (i) image/ video pre-processing, (ii) feature extraction, (iii) classi cation. In our approach, we focus on pre- processing and feature extraction part, using simple algorithm to achieve high performance. K-means clustering is broadly used for codebook generation in Bag of Visual Words (BOVW)[6] [7] method. It is known for its computational speed. The concept in [8] is to use K-means clustering not to learn the codebook from high level feature but to learn representative patches from pixel raw value. In contrast to constructing hierarchical and deep architec- ture to learn complex features, this method needs only tens of minutes to train and achieve good performance on CIFAR-10 dataset. In our approach, we extend the method from image domain to video domain, where K-means method clusters representative volumes of frames, instead of patches. How- ever, the dimensionality of volumes is much larger than the one of patches, and the size of training data in a video dataset is usually smaller than image dataset, so it is not large enough to train a good k-means model. Therefore, we proposed a method to learn volumes from different dataset to solve this problem. To sum-up, an action recognition system based on k-means cluster- ing method is designed. We can learn and extract features from different dataset. Furthermore, we propose a hardware architecture for this algo- rithm. This architecture can be both used in image/action recognition with some slight parameter changes
URI:	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/53213
全文授權:	有償授權
顯示於系所單位：	電子工程學研究所

文件中的檔案：

檔案	大小	格式
ntu-104-1.pdf 目前未授權公開取用	3.48 MB	Adobe PDF

顯示文件完整紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料（如：文字、圖片、PDF）並使其易於取用。