Skip navigation

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料(如:文字、圖片、PDF)並使其易於取用。

點此認識 DSpace
DSpace logo
English
中文
  • 瀏覽論文
    • 校院系所
    • 出版年
    • 作者
    • 標題
    • 關鍵字
    • 指導教授
  • 搜尋 TDR
  • 授權 Q&A
    • 我的頁面
    • 接受 E-mail 通知
    • 編輯個人資料
  1. NTU Theses and Dissertations Repository
  2. 電機資訊學院
  3. 資訊工程學系
請用此 Handle URI 來引用此文件: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/8254
完整後設資料紀錄
DC 欄位值語言
dc.contributor.advisor徐宏民(Winston H. Hsu whsu@ntu.edu.tw )
dc.contributor.authorZhe-Yu Liuen
dc.contributor.author劉哲宇zh_TW
dc.date.accessioned2021-05-20T00:50:49Z-
dc.date.available2020-11-01
dc.date.available2021-05-20T00:50:49Z-
dc.date.copyright2020-08-21
dc.date.issued2020
dc.date.submitted2020-08-14
dc.identifier.citation[1] J. A. Buolamwini. Gender shades: intersectional phenotypic and demographic evaluation of face datasets and gender classifiers. PhD thesis, Massachusetts Institute of Technology, 2017.
[2] J. Carreira and A. Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6299–6308, 2017.
[3] C.-H. Chen, A. Tyagi, A. Agrawal, D. Drover, R. MV, S. Stojanov, and J. M. Rehg. Unsupervised 3d pose estimation with geometric self-supervision. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
[4] J. Choi, C. Gao, J. C. Messou, and J.-B. Huang. Why can’t i dance in the mall? learning to mitigate scene bias in action recognition. In Advances in Neural Information Processing Systems, pages 851–863, 2019.
[5] D. Damen, H. Doughty, G. Maria Farinella, S. Fidler, A. Furnari, E. Kazakos, D. Moltisanti, J. Munro, T. Perrett, W. Price, et al. Scaling egocentric vision: The epic-kitchens dataset. In Proceedings of the European Conference on Computer Vision (ECCV), pages 720–736, 2018.
[6] D. Drover, R. MV, C.-H. Chen, A. Agrawal, A. Tyagi, and C. Phuoc Huynh. Can 3d pose be learned from 2d projections alone? In The European Conference on Computer Vision (ECCV) Workshops, September 2018.
[7] N. Dvornik, J. Mairal, and C. Schmid. Modeling visual context is key to augmenting object detection datasets. In Proceedings of the European Conference on Computer Vision (ECCV), pages 364–380, 2018.
[8] D. Dwibedi, I. Misra, and M. Hebert. Cut, paste and learn: Surprisingly easy synthesis for instance detection. In Proceedings of the IEEE International Conference on Computer Vision, pages 1301–1310, 2017.
[9] C. Feichtenhofer, H. Fan, J. Malik, and K. He. Slowfast networks for video recognition. In Proceedings of the IEEE International Conference on Computer Vision, pages 6202–6211, 2019.
[10] C. Feichtenhofer, A. Pinz, and A. Zisserman. Convolutional two-stream network fusion for video action recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1933–1941, 2016.
[11] G. Georgakis, A. Mousavian, A. C. Berg, and J. Kosecka. Synthesizing training data for object detection in indoor scenes. arXiv preprint arXiv:1702.07836, 2017.
[12] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
[13] R. Goyal, S. E. Kahou, V. Michalski, J. Materzynska, S. Westphal, H. Kim, V. Haenel, I. Fruend, P. Yianilos, M. Mueller-Freitag, et al. The” something something” video database for learning and evaluating visual common sense. In ICCV, volume 1, page 5, 2017.
[14] C. Gu, C. Sun, D. A. Ross, C. Vondrick, C. Pantofaru, Y. Li, S. Vijayanarasimhan, G. Toderici, S. Ricco, R. Sukthankar, et al. Ava: A video dataset of spatio-temporally localized atomic visual actions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6047–6056, 2018.
[15] L. A. Hendricks, K. Burns, K. Saenko, T. Darrell, and A. Rohrbach. Women also snowboard: Overcoming bias in captioning models. In European Conference on Computer Vision, pages 793–811. Springer, 2018.
[16] N. Hussein, E. Gavves, and A. W. Smeulders. Timeception for complex action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 254–263, 2019.
[17] H. Idrees, A. R. Zamir, Y.-G. Jiang, A. Gorban, I. Laptev, R. Sukthankar, and M. Shah. The thumos challenge on action recognition for videos “in the wild”. Computer Vision and Image Understanding, 155:1–23, 2017.
[18] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
[19] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei. Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 1725–1732, 2014.
[20] A. Khoreva, R. Benenson, E. Ilg, T. Brox, and B. Schiele. Lucid data dreaming for object tracking. In The DAVIS Challenge on Video Object Segmentation, 2017.
[21] R. Krishna, K. Hata, F. Ren, L. Fei-Fei, and J. Carlos Niebles. Dense-captioning events in videos. In Proceedings of the IEEE international conference on computer vision, pages 706–715, 2017.
[22] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre. Hmdb: a large video database for human motion recognition. In 2011 International Conference on Computer Vision, pages 2556–2563. IEEE, 2011.
[23] Y. Li, Y. Li, and N. Vasconcelos. Resound: Towards action recognition without representation bias. In Proceedings of the European Conference on Computer Vision (ECCV), pages 513–528, 2018.
[24] C.-H. Lin, E. Yumer, O. Wang, E. Shechtman, and S. Lucey. St-gan: Spatial transformer generative adversarial networks for image compositing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 9455–9464, 2018.
[25] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014.
[26] Z. Qiu, T. Yao, and T. Mei. Learning spatio-temporal representation with pseudo-3d residual networks. In proceedings of the IEEE International Conference on Computer Vision, pages 5533–5541, 2017.
[27] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE international conference on computer vision, pages 618–626, 2017.
[28] G. A. Sigurdsson, G. Varol, X. Wang, A. Farhadi, I. Laptev, and A. Gupta. Hollywood in homes: Crowdsourcing data collection for activity understanding. In European Conference on Computer Vision, pages 510–526. Springer, 2016.
[29] K. Simonyan and A. Zisserman. Two-stream convolutional networks for action recognition in videos. In Advances in neural information processing systems, pages 568–576, 2014.
[30] Y. Song, R. Shu, N. Kushman, and S. Ermon. Constructing unrestricted adversarial examples with generative models. In Advances in Neural Information Processing Systems, pages 8312–8323, 2018.
[31] K. Soomro and A. R. Zamir. Action recognition in realistic sports videos. In Computer vision in sports, pages 181–208. Springer, 2014.
[32] K. Soomro, A. R. Zamir, and M. Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012.
[33] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.
[34] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision, pages 4489–4497, 2015.
[35] D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun, and M. Paluri. A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 6450–6459, 2018.
[36] S. Tripathi, S. Chandra, A. Agrawal, A. Tyagi, J. M. Rehg, and V. Chari. Learning to generate synthetic data via compositing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 461–470, 2019.
[37] L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Van Gool. Temporal segment networks: Towards good practices for deep action recognition. In European conference on computer vision, pages 20–36. Springer, 2016.
[38] X. Wang, R. Girshick, A. Gupta, and K. He. Non-local neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7794–7803, 2018.
[39] X. Wang, A. Shrivastava, and A. Gupta. A-fast-rcnn: Hard positive generation via adversary for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2606–2615, 2017.
[40] S. Xie, C. Sun, J. Huang, Z. Tu, and K. Murphy. Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification. In Proceedings of the European Conference on Computer Vision (ECCV), pages 305–321, 2018.
[41] B. Zhou, A. Andonian, A. Oliva, and A. Torralba. Temporal relational reasoning in videos. In Proceedings of the European Conference on Computer Vision (ECCV), pages 803–818, 2018.
[42] L. Zhou, C. Xu, and J. J. Corso. Towards automatic learning of procedures from web instructional videos. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
dc.identifier.urihttp://tdr.lib.ntu.edu.tw/jspui/handle/123456789/8254-
dc.description.abstract一個好的動作辨識模型需要對人類或其他物體的移動模式有良好的了解。然而我們發現,即使是許多目前表現最好的模型,都會一定程度的利用周遭環境中的靜止物件來判斷當下發生的動作,而非使用該動作本身當作判斷依據。這種對周遭特定物件的依賴性,使得模型在應用到擁有不同物件分佈的環境中時,無法維持原來的表現,因為許多動作像是「拿取」,不會跟固定的物件做連結。
在此篇論文中,我們將上述問題稱為物件謬誤依賴 (Fallacious Object Reliance, or FOR),並且詳盡地討論了關於物件特徵偏差(object representation bias)在許多動作辨識資料集中造成的影響。我們提出了數個量化方法來測量 FOR 問題的嚴重性,並且提出了一個「對抗式物件合成訓練(AdvOST)」的方法來減輕模型的 FOR 問題。
AdvOST 方法訓練了一個神經網路合成器,把各種物件的圖片合成到訓練資料集的影片中,並且該合成器在需要混淆動作辨識模型的同時做出合理的生成來通過另一個神經網路鑑別器的偵測。此方法驅使動作辨識模型去忽略無關的靜止物件線索,以此減輕 FOR 問題。我們的實驗發現 AdvOST 方法可幫助 I3D 跟 SlowFast 等頂尖的動作辨識模型在 EPIC-KITCHENS 與 HMDB51 資料集上獲得更好的表現。
zh_TW
dc.description.abstractThe action recognition task requires agents to understand the motion performed by humans or objects. However, we found that recognition models tend to predict the action based on the surrounding static objects instead of the action itself. This dependency may hurt the robustness of such models when applied to new environments with different object distribution as many action classes could be associated with different subjects (e.g. 'take'). In this paper, we regard this problem as the \textit{Fallacious Object Reliance (FOR)} issue and discuss the role that the object representation bias plays in different datasets. Based on the observation, we propose several metrics to measure the severity of the FOR issue.
Moreover, we propose a new training procedure called Adversarial Object Synthesis Training (AdvOST) to mitigate this issue. AdvOST trains a synthesizer pasting objects onto training videos to obfuscate the classification model and uses a discriminator that regularizes the synthesizer to generate natural synthesis. This method forces the action model to ignore unrelated object clues and successfully reduces the FOR issue. Finally, we obtain decent accuracy improvement on the validation sets of the EPIC-KITCHENS using the state-of-the-art I3D and SlowFast after applied AdvOST. We also acquire consistently accurate improvement on the three splits of HMDB51 using I3D.
en
dc.description.provenanceMade available in DSpace on 2021-05-20T00:50:49Z (GMT). No. of bitstreams: 1
U0001-1108202020115200.pdf: 5762396 bytes, checksum: 468ac94f36ffc3204a79de85667a343d (MD5)
Previous issue date: 2020
en
dc.description.tableofcontents摘要 iii
Abstract iv
1 Introduction 1
2 Related work 4
2.1 Temporal Modeling of Action Recognition 4
2.2 Datasets Bias 4
2.3 Cut-and-paste Synthesis 5
2.4 Adversarial Learning 5
3 Analysis of the Fallacious Object Reliance (FOR) Issue 6
3.1 Object Reliance Level (ORL) . . . . . . . . . . . . . . . . . . . . . . . . 6
3.2 Fallacious Object Reliance . . . . . . . . . . . . . . . . . . . . . . . . . 8
4 Proposed Method 11
4.1 AdvOST 11
4.2 Loss Design 13
4.3 Optimization 14
5 Experiment 15
5.1 Datasets 15
5.2 Experiment Details 16
5.3 Results 16
5.4 Ablation Study 18
6 Discussion 20
6.1 Per-class Improvement and Confusion Matrix 20
6.2 Grad-CAM Comparison 21
7 Conclusion 23
Bibliography 24
dc.language.isoen
dc.title以對抗式物件合成訓練輔助動作辨識模型以減輕其偏差zh_TW
dc.titleTowards Robust Action Recognition via Adversarial Object Synthesis Trainingen
dc.typeThesis
dc.date.schoolyear108-2
dc.description.degree碩士
dc.contributor.oralexamcommittee葉梅珍(Mei-Chen Yeh),陳永昇(Yong-Sheng Chen)
dc.subject.keyword動作辨識,資料集偏差,對抗式訓練,zh_TW
dc.subject.keywordAction Recognition,Dataset Bias,Adversarial Training,en
dc.relation.page29
dc.identifier.doi10.6342/NTU202003008
dc.rights.note同意授權(全球公開)
dc.date.accepted2020-08-15
dc.contributor.author-college電機資訊學院zh_TW
dc.contributor.author-dept資訊工程學研究所zh_TW
顯示於系所單位:資訊工程學系

文件中的檔案:
檔案 大小格式 
U0001-1108202020115200.pdf5.63 MBAdobe PDF檢視/開啟
顯示文件簡單紀錄


系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved