請用此 Handle URI 來引用此文件:
http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/53994完整後設資料紀錄
| DC 欄位 | 值 | 語言 |
|---|---|---|
| dc.contributor.advisor | 徐宏民(Winston Hsu) | |
| dc.contributor.author | Peng-Ju Hsieh | en |
| dc.contributor.author | 謝朋儒 | zh_TW |
| dc.date.accessioned | 2021-06-16T02:35:53Z | - |
| dc.date.available | 2015-07-29 | |
| dc.date.copyright | 2015-07-29 | |
| dc.date.issued | 2015 | |
| dc.date.submitted | 2015-07-27 | |
| dc.identifier.citation | [1] J. K. Aggarwal and M. S. Ryoo. Human activity analysis: A review. ACM Computing Surveys (CSUR), 43(3):16, 2011.
[2] P. K. Atrey, M. A. Hossain, A. El Saddik, and M. S. Kankanhalli. Multimodal fusion for multimedia analysis: a survey. Multimedia systems, 16(6):345–379, 2010. [3] B. Bruno, F. Mastrogiovanni, A. Sgorbissa, T. Vernazza, and R. Zaccaria. Analysis of human behavior recognition algorithms based on acceleration data. In Robotics and Automation (ICRA), 2013 IEEE International Conference on, pages 1602–1607. IEEE, 2013. [4] A. Bulling, U. Blanke, and B. Schiele. A tutorial on human activity recognition using body-worn inertial sensors. ACM Computing Surveys (CSUR), 46(3):33, 2014. [5] A. Catz, A. Tamir, and M. Itzkovich. Scim–spinal cord independence measure: a new disability scale for patients with spinal cord lesions. Spinal cord, 36(734):735, 1998. [6] C.-C. Chang and C.-J. Lin. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2:27:1–27:27, 2011. [7] A. Efros, A. C. Berg, G. Mori, J. Malik, et al. Recognizing action at a distance. In Computer Vision, 2003. Proceedings. Ninth IEEE International Conference on, pages 726–733. IEEE, 2003. [8] G. Evangelopoulos, A. Zlatintsi, A. Potamianos, P. Maragos, K. Rapantzikos, G. Skoumas, and Y. Avrithis. Multimodal saliency and fusion for movie summarization based on aural, visual, and textual attention. Multimedia, IEEE Transactions on, 15(7):1553–1568, 2013. [9] A. Fathi, A. Farhadi, and J. M. Rehg. Understanding egocentric activities. In Computer Vision (ICCV), 2011 IEEE International Conference on, pages 407–414. IEEE, 2011. [10] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively trained part-based models. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 32(9):1627–1645, 2010. [11] J. Geng, Z. Miao, and X.-P. Zhang. Efficient heuristic methods for multimodal fusion and concept fusion in video concept detection. Multimedia, IEEE Transactions on, 17(4):498–511, 2015. [12] G. Griffin, A. Holub, and P. Perona. Caltech-256 object category dataset. 2007. [13] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the ACM International Conference on Multimedia, pages 675–678. ACM, 2014. [14] B. Kopp, A. Kunkel, H. Flor, T. Platz, U. Rose, K.-H. Mauritz, K. Gresser, K. L. McCulloch, and E. Taub. The arm motor ability test: reliability, validity, and sensitivity to change of an instrument for assessing disabilities in activities of daily living. Archives of physical medicine and rehabilitation, 78(6):615–620, 1997. [15] M. Koskela and J. Laaksonen. Convolutional network features for scene recognition. In Proceedings of the ACM International Conference on Multimedia, pages 1169–1172. ACM, 2014. [16] J. R. Kwapisz, G. M. Weiss, and S. A. Moore. Activity recognition using cell phone accelerometers. ACM SigKDD Explorations Newsletter, 12(2):74–82, 2011. [17] O. D. Lara and M. A. Labrador. A survey on human activity recognition using wearable sensors. Communications Surveys & Tutorials, IEEE, 15(3):1192–1209, 2013. [18] L.-J. Li, H. Su, L. Fei-Fei, and E. P. Xing. Object bank: A high-level image representation for scene classification & semantic feature sparsification. In Advances in neural information processing systems, pages 1378–1386, 2010. [19] W. Li, Z. Zhang, and Z. Liu. Action recognition based on a bag of 3d points. In Computer Vision and Pattern Recognition Workshops (CVPRW), 2010 IEEE Computer Society Conference on, pages 9–14. IEEE, 2010. [20] Y. Li, Z. Ye, and J. M. Rehg. Delving into egocentric actions. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015. [21] Y.-Y. Lin, J.-H. Hua, N. C. Tang, M.-H. Chen, and H.-Y. M. Liao. Depth and skeleton associated action recognition without online accessible rgb-d cameras. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on, pages 2617–2624. IEEE, 2014. [22] S. Maji, L. Bourdev, and J. Malik. Action recognition from a distributed representation of pose and appearance. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pages 3177–3184. IEEE, 2011. [23] K. Matsuo, K. Yamada, S. Ueno, and S. Naito. An attention-based activity recognition for egocentric video. In Computer Vision and Pattern Recognition Workshops (CVPRW), 2014 IEEE Conference on, pages 565–570. IEEE, 2014. [24] D. J. Moore, I. Essa, M. H. Hayes III, et al. Exploiting human actions and object context for recognition tasks. In Computer Vision, 1999. The Proceedings of the Seventh IEEE International Conference on, volume 1, pages 80–86. IEEE, 1999. [25] S. Narayan, M. S. Kankanhalli, and K. R. Ramakrishnan. Action and interaction recognition in first-person videos. In Computer Vision and Pattern Recognition Workshops (CVPRW), 2014 IEEE Conference on, pages 526–532. IEEE, 2014. [26] H. Pirsiavash and D. Ramanan. Detecting activities of daily living in first-person camera views. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 2847–2854. IEEE, 2012. [27] A. Quattoni and A. Torralba. Recognizing indoor scenes. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 413–420. IEEE, 2009. [28] N. Ravi, N. Dandekar, P. Mysore, and M. L. Littman. Activity recognition from accelerometer data. In AAAI, volume 5, pages 1541–1546, 2005. [29] C. G. Snoek, M. Worring, and A. W. Smeulders. Early versus late fusion in semantic video analysis. In Proceedings of the 13th annual ACM international conference on Multimedia, pages 399–402. ACM, 2005. [30] P. Turaga, R. Chellappa, V. S. Subrahmanian, and O. Udrea. Machine recognition of human activities: A survey. Circuits and Systems for Video Technology, IEEE Transactions on, 18(11):1473–1488, 2008. [31] J. Wu, A. Osuntogun, T. Choudhury, M. Philipose, and J. M. Rehg. A scalable approach to activity recognition based on object use. In Computer Vision, 2007. ICCV 2007. IEEE 11th International Conference on, pages 1–8. IEEE, 2007. [32] S. Wu, S. Bondugula, F. Luisier, X. Zhuang, and P. Natarajan. Zero-shot event detection using multi-modal fusion of weakly supervised concepts. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on, pages 2665–2672. IEEE, 2014. [33] L. Xia, C.-C. Chen, and J. Aggarwal. View invariant human action recognition using histograms of 3d joints. In Computer Vision and Pattern Recognition Workshops (CVPRW), 2012 IEEE Computer Society Conference on, pages 20–27. IEEE, 2012. [34] K. Yamada, Y. Sugano, T. Okabe, Y. Sato, A. Sugimoto, and K. Hiraki. Attention prediction in egocentric video using motion and visual saliency. In Advances in Image and Video Technology, pages 277–288. Springer, 2012. [35] Y. Yan, E. Ricci, G. Liu, and N. Sebe. Egocentric daily activity recognition via multitask clustering. IEEE transactions on image processing: a publication of the IEEE Signal Processing Society, 24(10):2984–2995, 2015. [36] Y. Zhang, X. Liu, M.-C. Chang, W. Ge, and T. Chen. Spatio-temporal phrases for activity recognition. In Computer Vision–ECCV 2012, pages 707–721. Springer, 2012. [37] B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva. Learning deep features for scene recognition using places database. In Advances in Neural Information Processing Systems, pages 487–495, 2014. | |
| dc.identifier.uri | http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/53994 | - |
| dc.description.abstract | 現有的第一人稱視角影像動作辨識主要著重在單一模型(例如:偵測互動的物件)來推斷活動類型。然而,因為攝影機的與實驗者的視角不一致的關係,導致在影像裡重要的物件可能會被部分遮蔽或者沒有顯示。這些因素將導致偵測互動的物件模型準確度大幅降低。再者,我們發現實驗者在何處(where)與如何(how)與物件互動的資訊在先前的第一人稱視角影像動作辨識研究裡幾乎被忽略。因此為了解決上述的困難點,我們使用多重中階表徵來提高第一人稱視角影像動作辨識的準確度。具體地來說,我們利用多重模型(例如:背景資訊、物件的使用與手部的動作模式)來補足單一模型的不足,並且共同地考慮使用者在與什麼(what)、何處(where)與如何(how)互動的資訊來建立起多重模型來進行第一人稱視角影像動作辨識。為了測試我們所提出的多重中階表徵模型,我們收集了新的第一人稱視角影像動作辨識資料庫,其中包含了第一人稱視角影像與手部三軸加速度。在公開的資料集(ADL)中我們的多重中階表徵模型勝過目前最新穎的方法從36.8%到46.7%,在我們自己收集的資料集裡,我們的方法勝過目前最新穎的方法從32.5%到60.0%。除此之外,我們也做了一系列的實驗來發掘各個模型的相對價值。 | zh_TW |
| dc.description.abstract | Existing approaches for egocentric activity recognition mainly rely on a single modality (e.g., detecting interacting objects) to infer the activity category. However, due to the inconsistency between camera angle and subject's visual field, important objects may be partially occluded or missing in the video frames. Moreover, where the objects are and how we interact with the objects are usually ignored in prior works. To resolve these difficulties, we propose to leverage multiple mid-level representations to improve egocentric activity classification accuracy. Specifically, we aim at utilizing multimodal representations (e.g., background context, objects manipulated by a user, and motion patterns of hands) to compensate the insufficiency of a single modality, and jointly consider what, where, and how a subject is interacting with. To evaluate the method, we introduce a new and challenging egocentric activity dataset (ADL+) that contains video and wrist-worn accelerometer data of people performing daily-life activities. Our approach significantly outperforms the state-of-the-art method on the ADL dataset (i.e., 36.8% to 46.7%) and our ADL+ dataset (i.e., 32.5% to 60.0%) in terms of classification accuracy. In addition, we also conduct a series of analyses to explore relative merits of each modality to egocentric activity recognition. | en |
| dc.description.provenance | Made available in DSpace on 2021-06-16T02:35:53Z (GMT). No. of bitstreams: 1 ntu-104-R02944011-1.pdf: 8856911 bytes, checksum: 8f2c65f5c894a58177cc0d2f6f1783e0 (MD5) Previous issue date: 2015 | en |
| dc.description.tableofcontents | 口試委員會審定書i
誌謝ii 摘要iii Abstract iv 1 Introduction 1 2 RELATED WORK 5 2.1 Vision-based features 5 2.2 Sensor-based features 6 3 OBSERVATIONS 8 3.1 Important Objects are Missing 8 3.2 Huge Intra-variances 9 4 PROPOSED METHOD 11 4.1 Object Modality (O) 11 4.2 Scene Modality (S) 13 4.3 Sensor Modality (M) 14 4.4 Recognition and Multimodal Fusion 15 5 DATASET 17 6 EXPERIMENT 19 6.1 Experimental Setup 19 6.2 Evaluation of the ADL Dataset 20 6.3 Evaluation of the ADL+ Dataset 22 7 CONCLUSION 26 Bibliography 27 | |
| dc.language.iso | en | |
| dc.subject | 第一人稱視角辨識 | zh_TW |
| dc.subject | 動作辨識 | zh_TW |
| dc.subject | 特徵融合 | zh_TW |
| dc.subject | Egocentric Video | en |
| dc.subject | Activity Recognition | en |
| dc.subject | Feature Fusion | en |
| dc.title | 利用多重中階表徵進行第一人稱視角影片動作辨識 | zh_TW |
| dc.title | Egocentric Activity Recognition by Leveraging Multiple Mid-level Representations | en |
| dc.type | Thesis | |
| dc.date.schoolyear | 103-2 | |
| dc.description.degree | 碩士 | |
| dc.contributor.oralexamcommittee | 陳祝嵩,葉梅珍 | |
| dc.subject.keyword | 動作辨識,第一人稱視角辨識,特徵融合, | zh_TW |
| dc.subject.keyword | Activity Recognition,Egocentric Video,Feature Fusion, | en |
| dc.relation.page | 31 | |
| dc.rights.note | 有償授權 | |
| dc.date.accepted | 2015-07-27 | |
| dc.contributor.author-college | 電機資訊學院 | zh_TW |
| dc.contributor.author-dept | 資訊網路與多媒體研究所 | zh_TW |
| 顯示於系所單位: | 資訊網路與多媒體研究所 | |
文件中的檔案:
| 檔案 | 大小 | 格式 | |
|---|---|---|---|
| ntu-104-1.pdf 未授權公開取用 | 8.65 MB | Adobe PDF |
系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。
