請用此 Handle URI 來引用此文件:
http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/5260
完整後設資料紀錄
DC 欄位 | 值 | 語言 |
---|---|---|
dc.contributor.advisor | 徐宏民(Winston Hsu) | |
dc.contributor.author | Yu-Chuan Su | en |
dc.contributor.author | 蘇昱銓 | zh_TW |
dc.date.accessioned | 2021-05-15T17:54:33Z | - |
dc.date.available | 2015-09-05 | |
dc.date.available | 2021-05-15T17:54:33Z | - |
dc.date.copyright | 2014-09-05 | |
dc.date.issued | 2014 | |
dc.date.submitted | 2014-07-24 | |
dc.identifier.citation | Bibliography
[1] D. Achlioptas. Database-friendly random projections: Johnson-lindenstrauss with binary coins. J. Comput. Sys. Sci., 66(4):671–687, June 2003. [2] T. Ahonen et al. Face recognition with local binary patterns. In ECCV, 2004. [3] H. Bay et al. Surf: Speeded up robust features. In ECCV, 2006. [4] A. Berg, J. Deng, and F.-F. Li. Large scale visual recognition challenge 2010. [5] J. Bergstra and Y. Bengio. Random search for hyper-parameter optimization. JMLR, 13:281–305, 2012. [6] E. Bingham and H. Mannila. Random projection in dimensionality reduction: applications to image and text data. In Proc. 7th ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, pages 245–250, 2001. [7] A. Bosch et al. Representing shape with a spatial pyramid kernel. In ACM CIVR, 2007. [8] G. Bradski. The OpenCV Library. Dr. Dobb’s Journal of Software Tools, 2000. [9] V. Chandrasekhar et al. Comparison of local feature descriptors for mobile visual search. In ICIP, 2010. [10] V. Chandrasekhar et al. Compressed histogram of gradients: A low-bitrate descriptor. Int. J. Comput. Vision, 96(3):384–399, 2012. [11] C.-C. Chang and C.-J. Lin. Libsvm: A library for support vector machines. ACM Trans. Intell. Syst. Technol., 2(3), May 2011. [12] D. Chen et al. Residual enhanced visual vectors for on-device image matching. In Asilomar Conference on Signals, Systems, and Computers, 2011. [13] L.-C. Dai et al. Imshare: instantly sharing your mobile landmark images by searchbased reconstruction. In ACM MM, 2012. [14] J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, M. Ranzato, A. Senior, P. Tucker, K. Yang, Q. V. Le, and A. Y. Ng. Large scale distributed deep networks. In NIPS, 2012. [15] J. Deng, A. C. Berg, K. Li, and F.-F. Li. What does classifying more than 10,000 image categories tell us? In Proc. of the 11th European Conf. Computer Vision, pages 71–84, 2010. [16] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and F.-F. Li. Imagenet: A large scale hierarchical image database. In Proc. IEEE Conf. Computer Vision and Pattern Recognition, pages 248–255, 2009. [17] J. Deng et al. Large scale visual recognition challenge 2012. [18] T. Deselaers, S. Hasan, O. Bender, and H. Ney. A deep learning approach to machine transliteration. In Proceedings of the Fourth Workshop on Statistical Machine Translation, 2009. [19] M. Douze et al. Evaluation of gist descriptors for web-scale image search. In CIVR, 2009. [20] D. Erhan, Y. Bengio, A. Courville, P.-A. Manzagol, P. Vincent, and S. Bengio. Why does unsupervised pre-training help deep learning? JMLR, 11:625–660, 2010. [21] M. Everingham et al. The pascal visual object classes challenge 2007 (voc2007) results, 2007. [22] M. Everingham, L. V. Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The pascal visual object classes challenge 2009 (voc2009) results. http://www.pascalnetwork.org/challenges/VOC/voc2009/workshop/index.html. [23] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin. Liblinear: A library for large linear classification. J. Mach. Learn. Res., 9:1871–1874, June 2008. [24] E. Gavves, C. G. M. Snoek, and A. W. M. Smeulders. Convex reduction of high dimensional kernels for visual classification. In Proc. IEEE Conf. Computer Vision and Pattern Recognition, pages 3610–3617, 2012. [25] B. Girod, V. Chandrasekhar, D. Chen, N.-M. Cheung, R. Grzeszczuk, Y. Reznik, G. Takacs, S. Tsai, and R. Vedantham. Mobile visual search. IEEE Signal Processing Mag., 28(4):61–76, July 2011. [26] X. Glorot, A. Bordes, and Y. Bengio. Domain adaptation for large-scale sentiment classification: A deep learning approach. In ICML, 2011. [27] Y. Gong and S. Lazebnik. Iterative quantization: A procrustean approach to learning binary codes. In Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2011. [28] G. Griffin et al. Caltech-256 object category dataset. Technical report, California Institute of Technology, 2007. [29] J. He, J. Feng, X. Liu, T. Cheng, T.-H. Lin, H. Chung, and S.-F. Chang. Mobile product search with bag of hash bits and boundary reranking. In Proc. IEEE Conf. Computer Vision and Pattern Recognition, pages 3005–3012, 2012. [30] J.-P. Heo, Y. Lee, J. He, S.-F. Chang, and S.-E. Yoon. Spherical hashing. In Proc. IEEE Conf. Computer Vision and Pattern Recognition, pages 2957–2964, 2012. [31] G. E. Hinton, S. Osindero, and Y.-W. Teh. A fast learning algorithm for deep belief nets. Neural computation, 18(7):1527–1554, 2006. [32] Y. Hong, Q.-N. Li, J.-Y. Jiang, and Z.-W. Tu. Learning a mixture of sparse distance metrics for classification and dimensionality reduction. In Proc. 13th IEEE Int. Conf. Computer Vision, pages 906–913, 2011. [33] H. Jégou, M. Douze, C. Schmid, and P. Pérez. Aggregating local descriptors into a compact image representation. In Proc. IEEE Conf. Computer Vision and Pattern Recognition, pages 3304–3311, 2010. [34] S. Ji, W. Xu, M. Yang, and K. Yu. 3d convolutional neural networks for human action recognition. PAMI, 35(1):221–231, 2013. [35] Y. Jia. Caffe: An open source convolutional architecture for fast feature embedding, 2013. [36] Y.-G. Jiang, G. Ye, S.-F. Chang, D. Ellis, and A. C. Loui. Consumer video understanding: A benchmark database and an evaluation of human and machine performance. In ICMR, 2011. [37] A. Joly and O. Buisson. Random maximum margin hashing. In Proc. IEEE Conf. Computer Vision and Pattern Recognition, pages 873–880, 2011. [38] J. Kandola, J. Shawe-Taylor, and N. Cristianini. Learning semantic similarity. In Proc. Conf. Advances in Neural Information Processing Systems, pages 657–664, 2003. [39] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Computer Science Department, University of Toronto, Tech. Rep., 2009. [40] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS. 2012. [41] S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In Proc. IEEE Conf. Computer Vision and Pattern Recognition, pages 2169–2178, 2006. [42] Q. Le, M. Ranzato, R. Monga, M. Devin, K. Chen, G. Corrado, J. Dean, and A. Ng. Building high-level features using large scale unsupervised learning. In ICML, 2012. [43] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998. [44] H. Lee, C. Ekanadham, and A. Y. Ng. Sparse deep belief net model for visual area v2. In NIPS, 2007. [45] H. Lee, R. Grosse, R. Ranganath, and A. Y. Ng. Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. In ICML, 2009. [46] J.-G. Li et al. Face recognition using feature of integral gabor-haar transformation. In ICIP, 2007. [47] S. Litayem, A. Joly, and N. Boujemaa. Hash-based support vector machines approximation for large scale prediction. In Proc. British Machine Vision Conference, 2012. [48] W. Liu, S.-Q. Ma, D.-C. Tao, J.-Z. Liu, and P. Liu. Semi-supervised sparse metric learning using alternating linearization optimization. In Proc. 16th ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, pages 1139–1148, 2010. [49] D. G. Lowe. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vision, 60(2):91–110, 2004. [50] S. Maji and A. C. Berg. Max-margin additive classifiers for detection. In Proc. 11th IEEE Int. Conf. Computer Vision, pages 40–47, 2009. [51] K. Mikolajczyk et al. A comparison of affine region detectors. Int. J. Comput. Vision, 65(1-2):43–72, 2005. [52] K. Mikolajczyk and C. Schmid. Scale and affine invariant interest point detectors. Int. J. Comput. Vision, 60(1):63–86, 2004. [53] A. Mohamed, G. E. Dahl, and G. Hinton. Acoustic modeling using deep belief networks. Trans. Audio, Speech and Lang. Proc., 20(1):14–22, 2012. [54] A. Oliva and A. Torralba. Modeling the shape of the scene: A holistic representation of the spatial envelope. Int. J. Comput. Vision, 42(3):145–175, 2001. [55] F. Perronnin, J. Sánchez, and T. Mensnik. Improving the fisher kernel for large-scale image classification. In Proc. of the 11th European Conf. Computer Vision, pages 143–156, 2010. [56] G.-J. Qi, J.-H. Tang, Z.-J. Zha, T.-S. Chua, and H.-J. Zhang. An efficient sparse metric learning in high-dimensional space via l1-penalized log-determinant regularization. In Proc. 26th Int. Conf. Machine Learning, pages 841–848, 2009. [57] S. T. Roweis and L. K. Saul. Nonlinear dimensionality reduction by locally linear embedding. Science, 290(5500):2323–2326, Dec. 2000. [58] O. Russakovsky et al. Large scale visual recognition challenge 2013. [59] J. Sánchez and F. Perronnin. High-dimensional signature compression for large-scale image classification. In Proc. IEEE Conf. Computer Vision and Pattern Recognition, pages 1665–1672, 2011. [60] M. Schmidt. Graphical Model Structure Learning with L1-Regularization. PhD thesis, Univ. British Columbia, 2010. [61] J. Sivic and A. Zisserman. Video google: a text retrieval approach to object matching in videos. In Proc. 9th IEEE Int. Conf. Computer Vision, pages 1470–1477, 2003. [62] K. Sohn, D. Y. Jung, H. Lee, and A. O. Hero. Efficient learning of sparse, distributed, convolutional feature representations for object recognition. In ICCV, 2011. [63] Y.-C. Su, T.-H. Chiu, G.-L. Wu, C.-Y. Yeh, F. Wu, and W. Hsu. Flickr-tag prediction using multi-modal fusion and meta information. In ACM MM, 2013. [64] I. Sutskever, J. Martens, G. Dahl, and G. Hinton. On the importance of initialization and momentum in deep learning. In ICML, 2013. [65] A. Torralba, R. Fergus, and W. Freeman. 80 million tiny images: A large data set for nonparametric object and scene recognition. IEEE Trans. Pattern Anal. Mach. Intell., 30(11):1958–1970, 2008. [66] A. Vedaldi and B. Fulkerson. Vlfeat: An open and portable library of computer vision algorithms, 2008. [67] J. Wang and S.-F. Chang. Sequential projection learning for hashing with compact codes. In Proc. 27th Int. Conf. Machine Learning, pages 1127–1134, 2010. [68] J. Wang, J. Yang, K. Yu, F. Lv, T. Huang, and Y. Gong. Locality-constrained linear coding for image classification. In Proc. IEEE Conf. Computer Vision and Pattern Recognition, pages 3360–3367, 2010. [69] K. Q. Weinberger and L. K. Saul. Distance metric learning for large margin nearest neighbor classification. J. Mach. Learn. Res., 10:207–244, June 2009. [70] G.-L. Wu, Y.-H. Kuo, T.-H. Chiu, W. H. Hsu, and L. Xie. Scalable mobile video retrieval with sparse projection learning and pseudo label mining. IEEE Multimedia, 20(3):47–57, 2013. [71] J. Xiao, J. Hays, K. Ehinger, A. Oliva, and A. Torralba. Sun database: Large-scale scene recognition from abbey to zoo. In CVPR, 2010. [72] M.-Q. Xu, X. Zhou, Z. Li, B.-Q. Dai, and T. Huang. Extended hierarchical gaussianization for scene classification. In Proc. 17th IEEE Conf. Image Processing, pages 1837–1840, 2010. [73] X. Yang and K.-T. Cheng. Accelerating surf detector on mobile devices. In ACM MM, 2012. [74] G.-X. Yuan, C.-H. Ho, and C.-J. Lin. Recent advances of large-scale linear classification. Proc. IEEE, 100(9):2584–2603, 2012. | |
dc.identifier.uri | http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/5260 | - |
dc.description.abstract | 行動裝置大規模影像辨識是一種於行動裝置上對於圖片/影像內容 之場景、物件乃至情境等層面進行語意分析之技術。隨著行動裝置的普及以及各式媒體分享服務的流行,行動裝置大規模影像辨識逐漸受到重視。雖然大規模影像辨識在經過長年的發展之後,已經有許多有效的方法可以運用在現實生活中,這些方法大多需要大量的運算資源,因此只能運行於高階的伺服器之上。行動裝置由於其物理上的限制,只有相當有限的運算能力及儲存空間,無法使用現有的大規模影像辨識技術。
在本論文中,我們提出兩種實現行動裝置上大規模影像辨識的系統設計以因應不同的使用情境。在系統無法存取無線網路的情況下,我們提出一種新的線性降維方法– Kernel Preserving Projection (KPP)。有別於傳統的降維方法,KPP 在設計時即針對降維後特徵的可分辨率(Seperability) 進行優化,因此能夠在低維度下提供較好的辨識能力。KPP 同時考慮的行動裝置上的資源限制,採用稀疏線性投射進行降維,以減少運算時所需的儲存空間以及計算量。 當系統有網路連線的能力時,我們提出使用伺服器-客戶架構來提升系統可分辨的語意數量。此一系統架構最大的挑戰在於如何在有限的網路頻寬下確保系統的反應時間。對此,我們提出低頻寬視覺辨識(Low-Bandwidth Recognition) 的概念,並且對各種不同的傳輸策略進行實驗以優化視覺變視頻寬。我們的實驗結果指出,影像縮圖(Thumbnail) 能夠同時保存多種影像特徵,是一種有效率的傳輸方式。我們更進一步提出結合影像縮圖以及基於局部視覺特徵之特徵標籤(Feature Signature) 以提升低頻寬下之辨識能力。 我們同時對深度學習(Deep Learning) 在視覺辨識上應用之特性進行系統性的實驗及探討。深度學習是當前最被看好的視覺辨識演算法,然而深度學習在實際應用上還有許多困難未曾被解決,如參數的選擇以及對訓練資料的大量需求。我們提出使用轉移學習的方法來達成在稀疏資料中進行深度學習,使得深度學習能夠被運用在更廣泛的視覺變是問題上。我們的實驗同時為參數的選擇提供一些線索,以利於深度學習的實際運用。這些結果對於探討深度學習對行動裝置大規模影像辨識之影響提供了良好的基礎。 | zh_TW |
dc.description.abstract | Scalable mobile visual classification – classifying images/videos in a large semantic space on mobile devices in real time – is an emerging problem as observing the paradigm shift towards mobile platforms and the explosive growth of visual data. Though seeing the advances in detecting thousands of concepts in the servers, the scalability is handicapped in mobile devices due to the severe resource constraints within. However, certain emerging applications require such scalable visual classification with prompt response for detecting local contexts (e.g., Google Glass) or ensuring user satisfaction. In this thesis, we point out the ignored challenges for scalable mobile visual classification and provide feasible solutions under different resource constraints.
For systems operate without mobile network, we propose an unsupervised linear dimension reduction algorithm – kernel preserving projection (KPP), to reduce the size of classifiers and computational cost. We further introduce sparsity to the projection matrix to ensure its compliance with mobile computing (with merely 12% non-zero entries). Experimental results on three public datasets confirm that the proposed method outperforms existing dimension reduction methods. What is even more, we can greatly reduce the storage consumption and efficiently compute the classification results on the mobile devices. When the mobile network is available under limited bandwidth, we propose to adopt the client-server framework to ensure the scalability. The main challenge of the framework should be the recognition bitrate, which is theamount of data transmission under the same recognition performance. We exploit and compare various strategies such as compact features, feature compression, feature signatures by hashing, image scaling, etc., to enable low bitrate mobile visual recognition. We argue that thumbnail image is a competitive candidate for low bitrate visual recognition because it carries multiple features at once and multi-feature fusion is important as the size of semantic space increases. We further suggest a new strategy that combines single (local) feature signature and the thumbnail image, which achieves significant bitrate reduction from (average) 102,570 to 4,661 bytes with merely (overall) 10% performance degradation. We also investigate the properties of Deep Convolution Network, which appears to be a promising direction for large scale visual recognition. These studies serve as the basis of further investigation on how DCNs will affect visual recognition on mobile devices. Our preliminary studies reveal the correlation between meta-parameters and the performance of DCN given the properties of the target problem and data. These results lead to a heuristic for meta-parameter selection for future DCN research, which does not rely on the time consuming meta-parameter search. We also point out that the lack-of-training-sample problem limits the usage of DCN on a wide range of computer vision problems where obtaining training data are difficult. To solve the problem, we propose to adopt transfer learning to learn a better representation of natural images using large image corpora with sufficient labeled samples and diversity. We show that by means of transfer learning from image to video, we can learn a frame-based recognizer with only 4k videos, which is far less than the million scale image data sets required by previous works of DCNs. | en |
dc.description.provenance | Made available in DSpace on 2021-05-15T17:54:33Z (GMT). No. of bitstreams: 1 ntu-103-R01922159-1.pdf: 5919928 bytes, checksum: 202306bd8e9fd7eec89721957b30bfa9 (MD5) Previous issue date: 2014 | en |
dc.description.tableofcontents | Contents
口試委員會審定書iii 摘要v Abstract vii 1 Introduction 1 2 Related Work 5 2.1 Scalable Visual Recognition . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2 Mobile visual search . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.3 Dimension reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.4 Distance metric learning . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.5 Deep Convolution Network . . . . . . . . . . . . . . . . . . . . . . . . . 9 3 Pure Mobile Visual Recognition System 13 3.1 Goal and system overview . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.2 Dimension Reduction by Kernel Preserving Projection (KPP) . . . . . . . 15 3.2.1 Projection learning to approximate kernel matrix . . . . . . . . . 15 3.2.2 Information-theoretic-based regularization . . . . . . . . . . . . 17 3.2.3 Sparse projection matrix . . . . . . . . . . . . . . . . . . . . . . 18 3.2.4 Learning cross dimension correlations through RBF kernel . . . . 18 3.2.5 Optimization solver . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.3 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.3.1 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.3.3 The effect of sparse projection matrix . . . . . . . . . . . . . . . 25 4 Client-Server Based Visual Recognition System 27 4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 4.2 Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 4.3 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.3.1 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.3.2 Descriptors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.3.3 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . 34 4.3.4 Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 4.3.5 Compression Factor of Images . . . . . . . . . . . . . . . . . . . 35 4.4 Multi-Feature Fusion is Important . . . . . . . . . . . . . . . . . . . . . 35 4.5 Image Scaling Reduces Bitrate . . . . . . . . . . . . . . . . . . . . . . . 39 4.6 Feature Signature Achieves Lower Bitrate . . . . . . . . . . . . . . . . . 42 4.7 Suggested System Design . . . . . . . . . . . . . . . . . . . . . . . . . . 46 5 New Visual Recognition Paradigm 49 5.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 5.2 Transfer Learning with Deep Convolution Network . . . . . . . . . . . . 51 5.2.1 Feature Extraction with Neural Network . . . . . . . . . . . . . . 52 5.2.2 Mixing Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . 52 5.2.3 Transfer Mid-level Features . . . . . . . . . . . . . . . . . . . . 53 5.3 Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 5.4 Network Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 5.5 Experiment – Network Configuration Selection . . . . . . . . . . . . . . 58 5.5.1 Image Resolution . . . . . . . . . . . . . . . . . . . . . . . . . . 58 5.5.2 Depth of Architecture . . . . . . . . . . . . . . . . . . . . . . . 60 5.5.3 Training Data Number and Diversity . . . . . . . . . . . . . . . . 61 5.6 Experiment – Video Recognition with Transfer Learning . . . . . . . . . 63 5.6.1 Feature Extraction by DCN . . . . . . . . . . . . . . . . . . . . 64 5.6.2 Mixing Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . 64 5.6.3 Transfer Mid-level Features . . . . . . . . . . . . . . . . . . . . 66 6 Conclusion 69 Bibliography 71 | |
dc.language.iso | zh-TW | |
dc.title | 行動裝置大規模影像辨識 | zh_TW |
dc.title | Large Scale Mobile Visual Recognition | en |
dc.type | Thesis | |
dc.date.schoolyear | 102-2 | |
dc.description.degree | 碩士 | |
dc.contributor.oralexamcommittee | 林智仁(Chih-Jen Lin),陳文進(Wen-Chin Chen),劉庭祿(Tyng-Luh Liu) | |
dc.subject.keyword | 行動裝置影像辨識,降維演算法,低頻寬視覺辨識,深度學習, | zh_TW |
dc.subject.keyword | Mobile Visual Recognition,Dimension Reduction,Low-bitrate Visual Recognition,Deep Convolution Network, | en |
dc.relation.page | 77 | |
dc.rights.note | 同意授權(全球公開) | |
dc.date.accepted | 2014-07-24 | |
dc.contributor.author-college | 電機資訊學院 | zh_TW |
dc.contributor.author-dept | 資訊工程學研究所 | zh_TW |
顯示於系所單位: | 資訊工程學系 |
文件中的檔案:
檔案 | 大小 | 格式 | |
---|---|---|---|
ntu-103-1.pdf | 5.78 MB | Adobe PDF | 檢視/開啟 |
系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。