應用全捲積網路所達成之弱監督音樂音訊及視訊事件偵測

Jen-Yu Liu; 劉任瑜

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/1199

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	鄭士康
dc.contributor.author	Jen-Yu Liu	en
dc.contributor.author	劉任瑜	zh_TW
dc.date.accessioned	2021-05-12T09:34:07Z	-
dc.date.available	2018-08-01
dc.date.available	2021-05-12T09:34:07Z	-
dc.date.copyright	2018-07-26
dc.date.issued	2018
dc.date.submitted	2018-07-07
dc.identifier.citation	[1] Jen-Yu Liu and Yi-Hsuan Yang. Event localization in music auto-tagging. In Proceedings of the ACM international conference on Multimedia (MM), 2016. [2] Ken Chatfield, Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Return of the devil in the details: delving deep into convolutional nets. British Machine Vision Conference, pages 1–11, may 2014. [3] Geoffrey Hinton, Li Deng, Dong Yu, George E Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N Sainath, et al. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 29(6): 82–97, 2012. [4] Alex Graves. Supervised Sequence Labelling with Recurrent Neural Networks, volume 385 of Studies in Computational Intelligence. Springer, 2012. [5] Ossama Abdel-Hamid, Li Deng, and Dong Yu. Exploring convolutional neural network structures and optimization techniques for speech recognition. In Proceedings of INTERSPEECH, pages 3366–3370, 2013. [6] Ossama Abdel-Hamid, Abdel-rahman Mohamed, Hui Jiang, Li Deng, Gerald Penn, and Dong Yu. Convolutional neural networks for speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22(10):1533–1545, Oct 2014. [7] Pedro F Felzenszwalb, Ross B Girshick, David McAllester, and Deva Ramanan. Object detection with discriminatively trained part-based models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(9):1627–1645, 2010. [8] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 580–587, 2014. [9] Pedro O Pinheiro and Ronan Collobert. From image-level to pixel-level labeling with convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1713–1721, 2015. [10] Maxime Oquab, Leon Bottou, Ivan Laptev, and Josef Sivic. Is object localization for free? - Weakly-supervised learning with convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 685–694, 2015. [11] George Papandreou, Liang-Chieh Chen, Kevin P Murphy, and Alan L Yuille. Weakly-and semi-supervised learning of a deep convolutional network for semantic image segmentation. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 1742–1750, 2015. [12] Philippe Hamel, Simon Lemieux, Yoshua Bengio, and Douglas Eck. Temporal pooling and multiscale learning for automatic annotation and ranking of music audio. In Proceedings of the International Society for Music Information Retrieval Conference (ISMIR), pages 729–734, 2011. [13] Sander Dieleman and Benjamin Schrauwen. End-to-end learning for music audio. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6964–6968, May 2014. [14] Sander Dieleman and Benjamin Schrauwen. Multiscale approaches to music audio feature learning. In Proceedings of the International Society for Music Information Retrieval Conference (ISMIR), pages 116–121, 2013. [15] Douglas Turnbull Derek Tingle, Youngmoo E. Kim. Exploring automatic music annotation with acoustically-objective tags. In Proceedings of the ACM International Conference on Multimedia Retrieval (ICMR), pages 55–61, 2010. [16] John S Garofolo, Lori F Lamel, William M Fisher, Jonathan G Fiscus, and David S Pallett. Darpa timit acoustic phonetic continuous speech corpus. 1993. US Dept. of Commerce, NIST, Gaithersburg, USA. [17] Edith Law, Kris West, Michael I. Mandel, Mert Bay, and J. Stephen Downie. Evaluation of algorithms using games: the case of music tagging. In Proceedings of the International Society of Music Information Retrieval Conference (ISMIR), 2009. [18] Rachel M. Bittner, Justin Salamon, Mike Tierney, Matthias Mauch, Chris Cannam, and Juan Pablo Bello. Medleydb: A multitrack dataset for annotation-intensive mir research. In Proceedings of the International Society for Music Information Retrieval Conference (ISMIR), pages 155–160, 2014. http://medleydb.weebly.com. [19] Zhong Zhou, Feng Shi, and Wei Wu. Learning spatial and temporal extents of human actions for action detection. IEEE Transactions on Multimedia (TMM), 17(4):512–525, apr 2015. [20] Yemin Shi, Yonghong Tian, Yaowei Wang, and Tiejun Huang. Sequential deep trajectory descriptor for action recognition with three-stream CNN. IEEE Transactions on Multimedia (TMM), 19(7):1510–1520, jul 2017. [21] Piotr Bojanowski, Rémi Lajugie, Francis Bach, Ivan Laptev, Jean Ponce, Cordelia Schmid, and Josef Sivic. Weakly supervised action labeling in videos under ordering constraints. In Proceedings of the European Conference on Computer Vision (ECCV), volume 8693 LNCS, pages 628–643. Springer, Cham, 2014. [22] Karen Simonyan and Andrew Zisserman. Two-stream convolutional networks for action recognition in videos. In Proceedings of the Annual Conference on Neural Information Processing Systems (NIPS), pages 1–11, 2014. [23] Joe Yue Hei Ng, Matthew Hausknecht, Sudheendra Vijayanarasimhan, Oriol Vinyals, Rajat Monga, and George Toderici. Beyond short snippets: Deep networks for video classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume 07-12-June, pages 4694–4702, 2015. [24] Yu-Gang Jiang, Qi Dai, Tao Mei, Yong Rui, and Shih-Fu Chang. Super fast event recognition in internet videos. IEEE Transactions on Multimedia (TMM), 17(8): 1174–1186, aug 2015. [25] Christoph Feichtenhofer, Axel Pinz, and Andrew Zisserman. Convolutional two-stream network fusion for video action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1933–1941, 2016. [26] De-An Huang, Li Fei-Fei, and Juan Carlos Niebles. Connectionist temporal modeling for weakly supervised action labeling. In Proceedings of the European Conference on Computer Vision (ECCV), pages 137–153. Springer International Publishing, 2016. [27] Bochen Li, Karthik Dinesh, Zhiyao Duan, and Gaurav Sharma. See and listen: Score-informed association of sound tracks to players in chamber music performance videos. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2906–2910. IEEE, mar 2017. [28] Weiming Hu, Xinmiao Ding, Bing Li, Jianchao Wang, Yan Gao, Fangshi Wang, and Stephen Maybank. Multi-perspective cost-sensitive context-aware multi-instance sparse coding and its application to sensitive video recognition. IEEE Transactions on Multimedia (TMM), 18(1):76–89, jan 2016. [29] Francesco Cricri, Mikko J. Roininen, Jussi Leppanen, Sujeet Mate, Igor D. D. Curcio, Stefan Uhlmann, and Moncef Gabbouj. Sport type classification of mobile videos. IEEE Transactions on Multimedia (TMM), 16(4):917–932, jun 2014. [30] Xu Cheng, Jiangchuan Liu, and Cameron Dale. Understanding the characteristics of internet short video sharing: A YouTube-based measurement study. IEEE Transactions on Multimedia (TMM), 15(5):1184–1194, aug 2013. [31] Slim Essid, Gaël Richard, and Bertrand David. Instrument recognition in polyphonic music based on automatic taxonomies. IEEE Transactions on Audio, Speech and Language Processing (TASLP), 14(1):68–80, jan 2006. [32] Zhouyu Fu, Guojun Lu, Kai Ming Ting, and Dengsheng Zhang. A survey of audio-based music classification and annotation. IEEE Transactions on Multimedia (TMM), 13(2):303–319, 2011. [33] Dimitrios Giannoulis, Emmanouil Benetos, Anssi Klapuri, and Mark D. Plumbley. Improving instrument recognition in polyphonic music through system integration. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5222–5226. IEEE, may 2014. [34] Yoonchang Han, Jaehun Kim, and Kyogu Lee. Deep convolutional neural networks for predominant instrument recognition in polyphonic music. IEEE/ACM Transactions on Audio, Speech, and Language Processing (TASLP), 25(1):208–221, jan 2017. [35] Antonello Rizzi, Mario Antonelli, and Massimiliano Luzi. Instrument learning and sparse NMD for automatic polyphonic music transcription. IEEE Transactions on Multimedia (TMM), 19(7):1405–1415, jul 2017. [36] Loïc Reboursière, Otso Lähdeoja, Thomas Drugman, Stéphane Dupont, Cecile Picard-Limpens, and Nicolas Riche. Left and right-hand guitar playing techniques detection. In Proceedings of the International Conference on New Interfaces for Musical Expression, 2012. [37] Alexander Schindler and Andreas Rauber. Harnessing music-related visual stereotypes for music information retrieval. ACM Transactions on Intelligent Systems and Technology (TIST), 8(2):1–21, oct 2016. [38] Olga Slizovskaia, Emilia Gómez, and Gloria Haro. Musical instrument recognition in user-generated videos using a multimodal convolutional neural network architecture. In Proceedings of the ACM International Conference on Multimedia Retrieval (ICMR), pages 226–232, New York, New York, USA, 2017. ACM Press. [39] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet large scale visual recognition challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015. [40] Soobeom Jang and Jong-Seok Lee. On evaluating perceptual quality of online user-generated videos. IEEE Transactions on Multimedia (TMM), 18(9):1808–1818, sep 2016. [41] Douglas Eck, Paul Lamere, Stephen Green, and Thierry Bertin-Mahieux. Automatic generation of social tags for music recommendation. In Proceedings of Advances in Neural Information Processing Systems (NIPS), pages 1–8, 2007. [42] Douglas Turnbull, Luke Barrington, David Torres, and Gert Lanckriet. Semantic annotation and retrieval of music and sound effects. IEEE Transactions on Audio, Speech and Language Processing, 16(2):467–476, feb 2008. [43] Mikael Henaff, Kevin Jarrett, Koray Kavukcuoglu, and Yann LeCun. Unsupervised learning of sparse features for scalable audio classification. In Proceedings of the International Society for Music Information Retrieval Conference (ISMIR), pages 681–686, 2011. [44] Juhan Nam, Jorge Herrera, Malcolm Slaney, and Julius O. Smith. Learning sparse feature representations for music annotation and retrieval. In Proceedings of the International Society of Music Information Retrieval Conference (ISMIR), pages 565–570, 2012. [45] Eric J. Humphrey, Juan Pablo Bello, and Yann LeCun. Moving beyond feature design: Deep architectures and automatic feature learning in music informatics. In Proceedings of the International Society of Music Information Retrieval Conference (ISMIR), pages 403–408, 2012. [46] Slim Essid, Gaël Richard, and Bertrand David. Instrument recognition in polyphonic music based on automatic taxonomies. IEEE Transactions on Audio, Speech and Language Processing, 14(1):68–80, jan 2006. [47] Michael I. Mandel and Daniel P.W. Ellis. A Web-Based Game for Collecting Music Metadata. Journal of New Music Research, 37(2):151–165, 2008. [48] Michael I. Mandel, Douglas Eck, and Yoshua Bengio. Learning tags that vary within a song. pages 399–404, 2010. [49] Michael I. Mandel, Razvan Pascanu, Douglas Eck, Yoshua Bengio, Luca M. Aiello, Rossano Schifanella, and Filippo Menczer. Contextual tag inference. ACM Transactions on Multimedia Computing, Communications, and Applications, 7S(1):1–18, oct 2011. [50] Shuo-Yang Wang, Ju-Chiang Wang Yi-Hsuan Yang, and Hsin-Min Wang. Towards time-varying music auto-tagging based on cal500 expansion. In Proceedings of the IEEE International Conference on Multimedia and Expo. (ICME), pages 1–6, 2014. [51] Giambattista Parascandolo, Heikki Huttunen, and Tuomas Virtanen. Recurrent neural networks for polyphonic sound event detection in real life recordings. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2016. [52] Anurag Kumar and Bhiksha Raj. Audio event detection using weakly labeled data. In Proceedings of the ACM international conference on Multimedia (MM), pages 1038–1047. ACM Press, 2016. [53] Jan Schlüter. Learning to pinpoint singing voice from weakly labeled examples. In Proceedings of the International Society for Music Information Retrieval Conference (ISMIR), pages 44–50, 2016. [54] Yann LeCun, Leon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998. [55] Alex Krizhevsky, Ilya Sutskever, and Hinton Geoffrey E. ImageNet classification with deep convolutional neural networks. Proceedings of Advances in Neural Information Processing Systems (NIPS), pages 1–9, 2012. [56] Corinna Cortes and Vladimir Vapnik. Support-vector networks. Machine Learning, 20(3):273–297, Sep 1995. [57] Glenn Hartmann, Matthias Grundmann, Judy Hoffman, David Tsai, Vivek Kwatra, Omid Madani, Sudheendra Vijayanarasimhan, Irfan Essa, James Rehg, and Rahul Sukthankar. Weakly supervised learning of object segmentations from web-scale video. In Proceedings of the European Conference on Computer Vision (ECCV), volume 7583 LNCS, pages 198–208. Springer, Berlin, Heidelberg, 2012. [58] Xiao Liu, Dacheng Tao, Mingli Song, Ying Ruan, Chun Chen, and Jiajun Bu. Weakly supervised multiclass video segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 57–64, 2014. [59] Alessandro Prest, Christian Leistner, Javier Civera, Cordelia Schmid, and Vittorio Ferrari. Learning object class detectors from weakly annotated video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3282–3289. IEEE, jun 2012. [60] Saurabh Gupta, Judy Hoffman, and Jitendra Malik. Cross modal distillation for supervision transfer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2827–2836, 2016. [61] Yusuf Aytar, Carl Vondrick, and Antonio Torralba. SoundNet: Learning sound representations from unlabeled video. Proceedings of Advances in Neural Information Processing Systems (NIPS), oct 2016. [62] Alfredo Canziani and Eugenio Culurciello. CortexNet: A generic network family for robust visual temporal representations. arXiv preprint, jun 2017. [63] Relja Arandjelović and Andrew Zisserman. Look, listen and learn. arXiv preprint, may 2017. [64] Yusuf Aytar, Carl Vondrick, and Antonio Torralba. See, hear, and read: Deep aligned representations. arXiv preprint, jun 2017. [65] Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei. Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1725–1732. IEEE, jun 2014. [66] Zhongwen Xu, Yi Yang, and Alex G. Hauptmann. A discriminative cnn video representation for event detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015. [67] Zhi-Hua Zhou and Min-Ling Zhang. Multi-instance multi-label learning with application to scene classification. In Proceedings of Advances in Neural Information Processing Systems (NIPS), pages 1609–1616. MIT Press, 2007. [68] Gedas Bertasius, Jianbo Shi, and Lorenzo Torresani. Semantic segmentation with boundary neural fields. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. [69] George Tzanetakis and Perry Cook. Musical genre classification of audio signals. IEEE Transactions on Audio, Speech and Language Processing (TASLP), 10(5): 293–302, 2002. [70] Zhi-Hua Zhou. A brief introduction to weakly supervised learning. National Science review, 5(1):44–53, jan 2018. [71] Serena Yeung, Vignesh Ramanathan, Olga Russakovsky, Liyue Shen, Greg Mori, and Li Fei-Fei. Learning to learn from noisy web videos. arXiv preprint, 2017. [72] Anna Khoreva, Rodrigo Benenson, Jan Hosang, Matthias Hein, and Bernt Schiele. Simple does it: Weakly supervised instance and semantic segmentation. 2016. [73] Jiajun Wu, Yu Yinan, Chang Huang, and Yu Kai. Deep multiple instance learning for image classification and auto-annotation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3460–3469. IEEE, 2015. [74] Ehsan Adeli Mosabbeb, Ricardo Cabral, Fernando De la Torre, and Mahmood Fathy. Multi-label discriminative weakly-supervised human activity recognition and localization. In Proceedings of the Asian Conference on Computer Vision (ACCV), volume 9007, pages 241–258. 2015. [75] Ting-Wei Su, Jen-Yu Liu, and Yi-Hsuan Yang. Weakly-supervised audio event detection using event-specific Gaussian filters and fully-convolutional networks. Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017. [76] Jurgen Schmidhuber. Multi-column deep neural networks for image classification. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3642–3649, 2012. [77] Vinod Nair and Geoffrey E. Hinton. Rectified linear units improve restricted boltzmann machines. In Proceedings of the International Conference on Machine Learning (ICML), pages 807–814, 2010. [78] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. International Conference on Learning Representations (ICLR), pages 1–14, sep 2015. [79] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3431–3440, 2015. [80] Gunnar Farnebäck. Two-frame motion estimation based on polynomial expansion. In Josef Bigun and Tomas Gustavsson, editors, Image Analysis, pages 363–370. Springer Berlin Heidelberg, Berlin, Heidelberg, 2003. [81] Deepak Pathak, Evan Shelhamer, Jonathan Long, and Trevor Darrell. Fully convolutional multi-class multiple instance learning. arXiv preprint arXiv:1412.7144, 2014. [82] Jonathan Masci, Alessandro Giusti, Dan Ciresan, Gabriel Fricout, and Jürgen Schmidhuber. A fast learning algorithm for image segmentation with max-pooling convolutional networks. In Proceedings on the IEEE International Conference on Image Processing (ICIP), pages 2713–2717, 2013. [83] Pedro O Pinheiro and Ronan Collobert. Recurrent convolutional neural networks for scene labeling. In Proceedings of the International Conference on Machine Learning (ICML), pages 82–90, 2014. [84] Brian McFee, Matt McVicar, Colin Raffel, Dawen Liang, Oriol Nieto, Eric Battenberg, Josh Moore, Dan Ellis, Ryuichi Yamamoto, Rachel Bittner, Douglas Repetto, Petr Viktorin, João Felipe Santos, and Adrian Holovaty. librosa: 0.4.1, October 2015. [85] James Bergstra, Olivier Breuleux, Frédéric Bastien, Pascal Lamblin, Razvan Pascanu, and Desjardins. Theano: a cpu and gpu math expression compiler. In Proceedings of the Python for Scientific Computing Conference (SciPy), June 2010. http://deeplearning.net/software/theano/. [86] Frédéric Bastien, Pascal Lamblin, Razvan Pascanu, James Bergstra, Ian J. Goodfellow, Arnaud Bergeron, Nicolas Bouchard, and Yoshua Bengio. Theano: new features and speed improvements. Deep Learning and Unsupervised Feature Learning NIPS 2012 Workshop, 2012. [87] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. Semantic image segmentation with deep convolutional nets and fully connected crfs. arXiv preprint arXiv:1412.7062, 2014. [88] Jia-Ching Wang, Chang-Hong Lin, Bo-Wei Chen, and Min-Kang Tsai. Gabor-based non-uniform scale-frequency map for environmental sound classification in home automation. IEEE Transactions on Automation Science and Engineering, 11(2): 607–613, April 2014. [89] Pasquale Foggia, Nicolai Petkov, Alessia Saggese, Nicola Strisciuglio, and Mario Vento. Audio surveillance of roads: a system for detecting anomalous sounds. IEEE Transactions on Intelligent Transportation Systems, 17, 2015. [90] Pierre Laffitte, David Sodoyer, Charles Tatkeu, and Laurent Girin. Deep neural networks for automatic detection of screams and shouted speech in subway trains. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6460–6464, March 2016. [91] Tim Fischer, Johannes Schneider, and Wilhelm Stork. Classification of breath and snore sounds using audio data recorded with smartphones in the home environment. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 226–230, March 2016. [92] Justin Salamon, Christopher Jacoby, and Juan Pablo Bello. A dataset and taxonomy for urban sound research. In Proceedings of the ACM International Conference on Multimedia (ACM MM), pages 1041–1044, New York, NY, USA, 2014. ACM. [93] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. The Journal of Machine Learning Research (JMLR), 12:2121–2159, 2011. [94] Andrew P Bradley. The use of the area under the roc curve in the evaluation of machine learning algorithms. Pattern recognition, 30(7):1145–1159, 1997. [95] Sergey Ioffe and Christian Szegedy. Batch Normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the International Conference on Machine Learning (ICML), pages 448–456, 2015. [96] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask R-CNN. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), mar 2017. [97] Ross Girshick. Fast R-CNN. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), apr 2015. [98] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster R-CNN: towards real-time object detection with region proposal networks. In Proceedings of Advances in Neural Information Processing Systems, jun 2015. [99] Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, C. Lawrence Zitnick, and Piotr Dollár. Microsoft COCO: Common objects in context. may 2014. [100] Mark Everingham, Luc Van Gool, Christopher K. I. Williams, John Winn, and Andrew Zisserman. The pascal visual object classes (voc) challenge. International Journal of Computer Vision, 88(2):303–338, June 2010. [101] Brian McFee, Matt McVicar, Oriol Nieto, Stefan Balke, Carl Thome, Dawen Liang, Eric Battenberg, Josh Moore, Rachel Bittner, Ryuichi Yamamoto, Dan Ellis, FabianRobert Stoter, Douglas Repetto, Simon Waloschek, CJ Carr, Seth Kranzler, Keunwoo Choi, Petr Viktorin, Joao Felipe Santos, Adrian Holovaty, Waldir Pimenta, and Hojin Lee. librosa 0.5.0, February 2017. [102] Sami Abu-El-Haija, Nisarg Kothari, Joonseok Lee, Paul Natsev, George Toderici, Balakrishnan Varadarajan, and Sudheendra Vijayanarasimhan. YouTube-8M: A large-scale video classification benchmark. sep 2016. [103] Jort F. Gemmeke, Daniel P. W. Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R. Channing Moore, Manoj Plakal, and Marvin Ritter. Audio set: An ontology and human-labeled dataset for audio events. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, 2017. [104] Zheng Shou, Dongang Wang, and Shih-Fu Chang. Temporal action localization in untrimmed videos via multi-stage CNNs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), jan 2016.
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/handle/123456789/1199	-
dc.description.abstract	隨著視訊與音訊串流服務的流行，音樂音訊與視訊是現今最受歡迎的娛樂來源之一。音樂與音樂演奏包含相當豐富的資訊。為了能自動分析這些音訊及視訊以進一步進行檢索或教學，我們會想要使用機器學習來幫助偵測各式音訊及視覺事件。然而，機器學習的方法通常需要相當數量的訓練資料。在音訊及視訊中，標示這些訓練資料並不容易，因為手動標示的過程非常花時間而且乏味。在本論文中，我們探討如何以弱監督的方式，僅使用長片斷層級的標示來訓練偵測模型。我們使用全捲積網路來達到音樂音訊與視訊之事件偵測。首先，使用全捲積網路在時間上偵測音樂音訊事件，如曲風、樂器、情緒等，並且使用樂器演奏資料庫來評估模型的表現。接著，我們將發展一個弱監督的架構來實現視訊中的樂器演奏動作偵測。此學習架構包含兩個輔助模型：聲音偵測模型與物體偵測模型。這兩個輔助模型也只使用長片斷層級的標記來訓練。它們將為動作偵測模型提供監督資訊。我們使用5400個經過手動標記的影像畫面來評估此訓練架構的表現。提出之訓練架構在時間與空間上相當大程度地改進了模型表現。	zh_TW
dc.description.abstract	With the growing of audio and video streaming services, music audios and videos are among the most popular sources for entertainment in recent days. There are rich information in music and music playing. In order to automatically analyze these audios and videos for further retrieval or pedagogical purpose, we may want to use machine learning to help with detecting audio and visual events. However, learning-based methods usually require a large amount of training data. In audios and videos, annotating these data are not easy because the process is time-consuming and tedious. In this work, we will see how to train such detection models with only clip-level annotations with weakly-supervised learning. We will use fully-convolutional networks (FCNs) for event detection in music audios and videos. First, we will develop FCNs for temporally detecting music audio events such as genres, instruments, and moods, which will be evaluated on an instrument dataset. Second, we will develop a weakly-supervised framework for detecting instrument-playing actions in videos. The learning framework involves two auxiliary models, a sound model and an object model, which are trained using clip-level annotations only. They will provide supervisions temporally and spatially for the action model. In total 5,400 annotated frames will be used to evaluate the performance of the proposed framework. The proposed framework largely improves the performance temporally and spatially.	en
dc.description.provenance	Made available in DSpace on 2021-05-12T09:34:07Z (GMT). No. of bitstreams: 1 ntu-107-D02921018-1.pdf: 18109349 bytes, checksum: 3518f73a5a7e9c440181317d0bc6526e (MD5) Previous issue date: 2018	en
dc.description.tableofcontents	口試委員會審定書 i 致謝 ii 中文摘要 iii Abstract iv Contents v List of Figures viii List of Tables xii 1 Introduction 1 1.1 Contributions 3 1.2 Event detection in music audios 3 1.3 Instrument-playing action detection in music videos 5 1.4 Overview 10 2 Background 11 2.1 Literature survey 11 2.1.1 Detection and classification in audios 11 2.1.2 Detection and classification in videos and images 12 2.1.3 Weakly-supervised learning 14 2.2 Event detection as a multi-label classification problem 15 2.3 Weakly-supervised learning 18 2.4 Fully-convolutional networks 19 2.5 Audio and visual features used 20 2.5.1 Audio features 20 2.5.2 Visual features 21 3 Weakly-supervised music event detection 22 3.1 Proposed method 22 3.1.1 Clip-level Prediction 24 3.1.2 Frame-level model 26 3.2 Data: MagnaTagATune and MedleyDB 27 3.2.1 Data processing 28 3.3 Experiments 29 3.3.1 Metrics for Objective Evaluation 30 3.3.2 Best Performance 30 3.3.3 Effect of Thresholds 31 3.3.4 Effect of Accumulation 32 3.3.5 Effect of Multi-scale Input 32 3.3.6 Resolution and Performance Trade-off 33 3.3.7 Effect of Final Pooling Functions 34 3.3.8 Learned Parameters of Gaussian Filters 35 3.3.9 Comparing with Frame-to-Frame training 37 3.4 Visualization of the frame-level predictions 38 3.4.1 MedleyDB 38 3.4.2 MagnaTagATune 38 3.5 Application to audio event detection 39 3.5.1 Datasets 40 3.5.2 Experiments 41 3.6 Summary 43 4 Weakly-supervised Visual Instrument-playing Action Detection in Videos 49 4.1 Proposed method 50 4.1.1 Instrument-playing actions 50 4.1.2 Increasing supervisions for training the action model 50 4.1.3 Fusion of different modality streams after model training 57 4.2 Experimental setup 58 4.2.1 Models 58 4.2.2 Features 60 4.2.3 Datasets 61 4.2.4 Training 67 4.3 Experiments 68 4.3.1 Performance of the sound model for instrument sound detection 69 4.3.2 Performance of the object model for instrument object detection 70 4.3.3 Performance of the action model for instrument-playing action detection 71 4.3.4 Fusion of different streams after training 82 4.3.5 Learned movements in the action model 83 4.3.6 Analyses and observations 85 4.4 Summary 88 5 Conclusions and discussions 89 A A GUI for displaying the result of music audio event detection 90 B Publications 95 Bibliography 97
dc.language.iso	en
dc.subject	音樂自動標籤	zh_TW
dc.subject	音樂事件偵測	zh_TW
dc.subject	樂器演奏動作偵測	zh_TW
dc.subject	弱監督學習	zh_TW
dc.subject	instrument-playing action detection	en
dc.subject	music auto-tagging	en
dc.subject	weakly-supervised learning	en
dc.subject	music event detection	en
dc.title	應用全捲積網路所達成之弱監督音樂音訊及視訊事件偵測	zh_TW
dc.title	Weakly-supervised Event Detection for Music Audios and Videos Using Fully-convolutional Networks	en
dc.type	Thesis
dc.date.schoolyear	106-2
dc.description.degree	博士
dc.contributor.coadvisor	楊奕軒
dc.contributor.oralexamcommittee	張智星,李宏毅,蘇黎,深山覺(Satoru Fukayama)
dc.subject.keyword	音樂事件偵測,樂器演奏動作偵測,弱監督學習,音樂自動標籤,	zh_TW
dc.subject.keyword	music event detection,instrument-playing action detection,weakly-supervised learning,music auto-tagging,	en
dc.relation.page	110
dc.identifier.doi	10.6342/NTU201801365
dc.rights.note	同意授權(全球公開)
dc.date.accepted	2018-07-09
dc.contributor.author-college	電機資訊學院	zh_TW
dc.contributor.author-dept	電機工程學研究所	zh_TW
顯示於系所單位：	電機工程學系

文件中的檔案：

檔案	大小	格式
ntu-107-1.pdf	17.68 MB	Adobe PDF	檢視/開啟

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。