請用此 Handle URI 來引用此文件:
http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/7267
完整後設資料紀錄
DC 欄位 | 值 | 語言 |
---|---|---|
dc.contributor.advisor | 王鈺強 | |
dc.contributor.author | Yan-Bo Lin | en |
dc.contributor.author | 林彥伯 | zh_TW |
dc.date.accessioned | 2021-05-19T17:40:46Z | - |
dc.date.available | 2021-08-12 | |
dc.date.available | 2021-05-19T17:40:46Z | - |
dc.date.copyright | 2019-08-12 | |
dc.date.issued | 2019 | |
dc.date.submitted | 2019-07-31 | |
dc.identifier.citation | [1] R. Arandjelovic and A. Zisserman. Look, listen and learn. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2017.
[2] R. Arandjelović and A. Zisserman. Objects that sound. In Proceedings of the European Conference on Computer Vision (ECCV), 2018. [3] Y. Aytar, C. Vondrick, and A. Torralba. Soundnet: Learning sound representations from unlabeled video. In Advances in Neural Information Processing Systems (NIPS), 2016. [4] Y. Bai, J. Fu, T. Zhao, and T. Mei. Deep attention neural tensor network for visual question answering. In Proceedings of the European Conference on Computer Vision (ECCV), September 2018. [5] H. Bilen, B. Fernando, E. Gavves, and A. Vedaldi. Action recognition with dynamic image networks. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2018. [6] H. Bilen, B. Fernando, E. Gavves, A. Vedaldi, and S. Gould. Dynamic image networks for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3034–3042, 2016. [7] J. Carreira and A. Zisserman. Quo vadis, action recognition? A new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4724–4733,2017. [8] L. Chen, S. Srivastava, Z. Duan, and C. Xu. Deep cross-modal audio-visual generation. In Proceedings of the on Thematic Workshops of ACM Multimedia 2017, pages 349–357. ACM, 2017. [9] J. S. Chung, A. W. Senior, O. Vinyals, and A. Zisserman. Lip reading sentences in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017. [10] A. Das, H. Agrawal, C. L. Zitnick, D. Parikh, and D. Batra. Human Attention in Visual Question Answering: Do Humans and Deep Networks Look at the Same Regions? In Conference on Empirical Methods in Natural Language Processing (EMNLP), 2016. [11] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale Hierarchical Image Database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2009. [12] J. Donahue, L. A. Hendricks, M. Rohrbach, S. Venugopalan, S. Guadarrama, K. Saenko, and T. Darrell. Long-term recurrent convolutional networks for visual recognition and description. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2017. [13] A. Ephrat, I. Mosseri, O. Lang, T. Dekel, K. Wilson, A. Hassidim, W. T. Freeman, and M. Rubinstein. Looking to listen at the cocktail party: a speaker-independent audio-visual model for speech separation. ACM Transactions on Graphics (TOG), 2018. [14] R. Gao, R. Feris, and K. Grauman. Learning to separate object sounds by watching unlabeled video. In Proceedings of the European Conference on Computer Vision (ECCV), 2018. [15] R. Gao and K. Grauman. 2.5d-visual-sound. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019. [16] J. F. Gemmeke, D. P. W. Ellis, et al. Audio set: An ontology and human-labeled dataset for audio events. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017. [17] K.Greff, R.K.Srivastava, J.Koutník, B.R.Steunebrink, and J.Schmidhuber. Lstm: A search space odyssey. CoRR, abs/1503.04069, 2015. [18] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. [19] S. Hershey, S. Chaudhuri, et al. Cnn architectures for large-scale audio classification. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017. [20] S. Hochreiter and J. Schmidhuber. Long short-term memory. 9:1735–80, 12 1997. [21] D.Hu,X.Li,etal. Temporal multimodal learning in audio visual speech recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. [22]A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and F. Li. Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014. [23] D. Kiela, E. Grave, A. Joulin, and T. Mikolov. Efficient large-scale multi-modal classification. Proceedings of the AAAI Conference on Artificial Intelligence (AAAI),2018. [24] J. Kim, S. Lee, D. Kwak, M. Heo, J. Kim, J. Ha, and B. Zhang. Multimodal residual learning for visual QA. In Advances in Neural Information Processing Systems (NIPS), 2016. [25] K.Kim,S.Choi,J.Kim,andB.Zhang. Multimodal dual attention memory for video story question answering. In Proceedings of the European Conference on Computer Vision (ECCV), 2018. [26] G. Lev, G. Sadeh, B. Klein, and L. Wolf. RNN fisher vectors for action recognition and image annotation. In Proceedings of the European Conference on Computer Vision (ECCV), pages 833–850, 2016. [27] Z. Li, K. Gavrilyuk, E. Gavves, M. Jain, and C. G. M. Snoek. Video lstm convolves, attends and flows for action recognition. Computer Vision and Image Understanding (CVIU). [28] Y.-B. Lin, Y.-J. Li, and Y.-C. F. Wang. Dual-modality seq2seq network for audio visual event localization. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019. [29] D.-K. Nguyen and T. Okatani. Improved fusion of visual and language representations by dense symmetric co-attention for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018. [30] A. Owens and A. A. Efros. Audio-visual scene analysis with self-supervised multi-sensory features. In Proceedings of the European Conference on Computer Vision (ECCV), 2018. [31] A. Owens, P. Isola, J. McDermott, A. Torralba, E. H. Adelson, and W. T. Freeman. Visually indicated sounds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. [32] A.Owens,J.Wu,J.H.McDermott,W.T.Freeman,and A.Torralba. Ambient sound provides supervision for visual learning. In Proceedings of the European Conference on Computer Vision (ECCV), 2016. [33] A.Santoro, D.Raposo, D.G.T.Barrett, M.Malinowski, R.Pascanu, P.Battaglia, and T. Lillicrap. A simple neural network module for relational reasoning. In Advances in Neural Information Processing Systems (NIPS), 2017. [34] A. Senocak, T.-H. Oh, J. Kim, M.-H. Yang, and I. S. Kweon. Learning to localize sound source in visual scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018. [35] Y. Shi, T. Furlanello, S. Zha, and A. Anandkumar. Question type guided attention in visual question answering. In Proceedings of the European Conference on Computer Vision (ECCV), 2018. [36] K. Simonyan and A. Zisserman. Two-stream convolutional networks for action recognition in videos. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems (NIPS), pages 568–576. Curran Associates, Inc., 2014. [37] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. CoRR , abs/1409.1556, 2014. [38] I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems (NIPS) , 2014. [39] Y. Tian, C. Guan, J. Goodman, M. Moore, and C. Xu. An attempt towards interpretable audio-visual video captioning. CoRR , abs/1812.02872, 2018. [40] Y. Tian, J. Shi, B. Li, Z. Duan, and C. Xu. Audio-visual event localization in unconstrained videos. In Proceedings of the European Conference on Computer Vision (ECCV), 2018. [41] D.Tran,L.D.Bourdev,R.Fergus,L.Torresani,andM.Paluri. C3D:genericfeatures for video analysis. CoRR, abs/1412.0767, 2014. [42] D. Tran, J. Ray, Z. Shou, S. Chang, and M. Paluri. Convnet architecture search for spatiotemporal feature learning. CoRR, abs/1708.05038, 2017. [43] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems (NIPS), 2017. [44] L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. V. Gool. Temporal segment networks: Towards good practices for deep action recognition. In Proceedings of the European Conference on Computer Vision (ECCV), 2016. [45] O. Wiles, A. S. Koepke, and A. Zisserman. X2face: A network for controlling face generation using images, audio, and pose codes. In Proceedings of the European Conference on Computer Vision (ECCV), 2018. [46] S. Xie, C. Sun, J. Huang, Z. Tu, and K. Murphy. Rethinking spatio temporal feature learning: Speed-accuracy trade-offs in video classification. In Proceedings of the European Conference on Computer Vision (ECCV) , 2018. [47] X. Yang, P. Ramesh, R. Chitta, S. Madhvanath, E. A. Bernal, and J. Luo. Deep multimodal representation learning from temporal data. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017. [48] D. Yu, J. Fu, T. Mei, and Y. Rui. Multi-level attention networks for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017. [49] H.Zhao, C.Gan, A.Rouditchenko, C.Vondrick, J.McDermott, and A.Torralba. The sound of pixels. In Proceedings of the European Conference on Computer Vision (ECCV), 2018. [50] B. Zhou, A. Andonian, A. Oliva, and A. Torralba. Temporal relational reasoning in videos. In Proceedings of the European Conference on Computer Vision (ECCV), September 2018. [51] H. Zhou, Y. Liu, Z. Liu, P. Luo, and X. Wang. Talking face generation by adversarially disentangled audio-visual representation. Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2019. [52] Y. Zhou, Z. Wang, C. Fang, T. Bui, and T. L. Berg. Visual to sound: Generating natural sound for videos in the wild. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018. [53] M. Zolfaghari, G. L. Oliveira, N. Sedaghat, and T. Brox. Chained multi-stream networks exploiting pose, motion, and appearance for action classification and detection. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 2923–2932, 2017. [54] M. Zolfaghari, K. Singh, and T. Brox. ECO: efficient convolutional network for online video understanding. In Proceedings of the European Conference on Computer Vision (ECCV), 2018. | |
dc.identifier.uri | http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/7267 | - |
dc.description.abstract | 視聽事件定位需要人類透過聯合觀察跨模態視聽信息及跨越視頻幀的事件標籤。為了解決這個任務,我們提出了一個跨模式的深度學習框架專注在共同關注視頻事件定位。我們提出的模型能夠利用幀內和幀間時間及視覺信息,以及同時間的音訊資訊,利用觀察上述的三種資訊,來實現共注意視覺物件。搭配視覺,連續觀察到的時間及音訊資訊,我們的模型實現了有新穎的能力來提取空間訊息/時間特徵以改進視聽事件定位。而且,我們的模型能夠產生實例級別的視覺注意力,這將識別圖像最有可能發出聲音的區域/位置,並且在同時有相同物體的場景中找出真正發聲的物體。在實驗設計方面,我們利用了最新穎的方法來跟我們所提出的共注意模組進行比較,並且使用公開的數據集來驗證我們提出方法的有效性,其中我們的實驗結果準確度超過目前現有的方法,可視化的結果也能印證我們提出的架構能達到實例級別的視覺注意力。 | zh_TW |
dc.description.abstract | Audio-visual event localization requires one to identify the event labelacross video frames by jointly observing visual and audio information. To address this task, we propose a deep neural network named Audio-Visual sequence-to-sequence dual network (AVSDN). By jointly taking both audio and visual features at each time segment as inputs, our proposed model learns global and local event information in a sequence to sequence manner. Besides, we also propose a deep learning framework of cross-modality co-attention for audio-visual event localization. The co-attention framework can be applied on existing methods and AVSDN. Our co-attention modelis able to exploit intra and inter-frame visual information, with audio features jointly observed to perform co-attention over the above three modalities.With visual, temporal, and audio information observed across consecutive video frames, our model achieves promising capability in extracting informative spatial/temporal features for improved event localization. Moreover,our model is able to produce instance-level attention, which would identify image regions at the instance level which are associated with the sound/event of interest. Experiments on a benchmark dataset confirm the effectiveness of our proposed framework,with ablation studies performed to verify the design of our propose network model. | en |
dc.description.provenance | Made available in DSpace on 2021-05-19T17:40:46Z (GMT). No. of bitstreams: 1 ntu-108-R06942048-1.pdf: 3417941 bytes, checksum: 30523f6386addbf3f915799a3167bdfa (MD5) Previous issue date: 2019 | en |
dc.description.tableofcontents | 口試委員會審定書. . . . . . . . . . . . . . . . . iii
誌謝. . . . . . . . . . . . . . . . . . . . . . .v Acknowledgements . . . . . . . . . . . . . . . .vii 摘要 . . . . . . . . . . . . . . . . . . . . . . ix Abstract . . . . . . . . . . . . . . . . . . . .xi 1 Introduction. . . . . . . . . . . . . . . . .. 1 2 Related Work. . . . . . . . . . . . . . . . . .5 3 Proposed Method. . . . . . . . . . . . . . . . 7 3.0.1 Notations and Problem Formulation. . . . .7 3.0.2 Audio-Visual Sequence-to-sequence Dual Network (AVSDN). .. . . . . . . . . . . . . . . . . . . .8 3.0.3 Learning Intra and Inter-Frame Visual Representation. . . . .. . . . . . . . . . . . .12 3.0.4 Cross-Modality Co-Attention for Event Localization. . . . . . .. . . . . . . . . . . . . . . . . . . 15 4 Experiments . . . .. . . . . . . . . . . . . .19 4.0.1 Dataset. . . . . . . . . . . . . . . . . .19 4.0.2 Implementation Details. . . . . . . . . ..19 4.0.3 Experiment results. . . . . . . . . . . . 20 4.0.4 Ablation studies. . . . . . . . . . . . . 24 Conclusion. . . . . . . . . . . . . . . . . . . 29 Bibliography. . . . . . . . . . . . . . . . . . .31 | |
dc.language.iso | en | |
dc.title | 跨模態共注意視聽事件定位 | zh_TW |
dc.title | Cross-Modality Co-Attention for Audio-Visual Event Localization | en |
dc.type | Thesis | |
dc.date.schoolyear | 107-2 | |
dc.description.degree | 碩士 | |
dc.contributor.oralexamcommittee | 林彥宇,邱維辰 | |
dc.subject.keyword | 視聽特徵,雙模態,跨模態,事件定位,深度學習,機器學習,電腦視覺, | zh_TW |
dc.subject.keyword | Audio-Video Features,Dual Modality,Cross Modality,Event localization,Deep learning,Machine learning,Computer vision, | en |
dc.relation.page | 37 | |
dc.identifier.doi | 10.6342/NTU201902224 | |
dc.rights.note | 同意授權(全球公開) | |
dc.date.accepted | 2019-07-31 | |
dc.contributor.author-college | 電機資訊學院 | zh_TW |
dc.contributor.author-dept | 電信工程學研究所 | zh_TW |
顯示於系所單位: | 電信工程學研究所 |
文件中的檔案:
檔案 | 大小 | 格式 | |
---|---|---|---|
ntu-108-1.pdf | 3.34 MB | Adobe PDF | 檢視/開啟 |
系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。