專為旋律提取設計的流線型編碼器/解碼器架構

Tsung-Han Hsieh; 謝宗翰

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/66127

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	李琳山(Lin-shan Lee)
dc.contributor.author	Tsung-Han Hsieh	en
dc.contributor.author	謝宗翰	zh_TW
dc.date.accessioned	2021-06-17T00:22:45Z	-
dc.date.available	2020-02-18
dc.date.copyright	2020-02-18
dc.date.issued	2019
dc.date.submitted	2020-02-11
dc.identifier.citation	[1] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Au-tomatic differentiation in PyTorch. In NIPS-W, 2017. [2] Basaran, D. and Essid, S. and Peeters, G. Main melody extraction with source-filter NMF and CRNN. In Proc. ISMIR, 2018. [3] Bernhard Lehner, Gerhard Widmer, and Reinhard Sonnleitner. On the reduction of false positives in singing voice detection. In Proc. ICASSP, pages 7480–7484, 2014. [4] R. M. Bittner et al. Deep salience representations for f0 estimation in polyphonic music. In Proc. ISMIR, pages 63–70, 2017. [5] R. M. Bittner et al. Deep salience representations for f0 estimation in poly-phonic music. In Proc. ISMIR, 2017. [Online] https://github.com/rabitt/ ismir2017-deepsalience. [6] Carl Southall, Ryan Stables and Jason Hockman. Improving peak-picking using multiple time-step loss. In Proc. ISMIR, pages 313–320, 2018. [7] Chao-Ling Hsu and Jyh-Shing Roger Jang. On the improvement of singing voice separation for monaural recordings using the MIR-1K dataset. IEEE Trans. Audio, Speech, and Lang. Proc., 18(2):310–319, 2010. [Online]https://sites.google.com/site/unvoicedsoundseparation/. [8] L.-C. Chen, Y. Zhu, P. George, S. Florian, and A. Hartwig. Encoder-decoder with atrous separable convolution for semantic image segmentation. eprint arXiv:1802.02611, 2018. [9] Daniel Stoller, Sebastian Ewert, Simon Dixon. Jointly detecting and separating singing voice: A multi-task approach. In Proc. Latent Variable Analysis and Signal Separation, pages 329–339, 2018. [10] Hao-Wen Dong and Yi-Hsuan Yang. Convolutional generative adversarial networks with binary neurons for polyphonic music generation. In Proc. ISMIR, 2018. [11] H. Indefrey, W. Hess, and G. Seeser. Design and evaluation of double-transform pitch determination algorithms with nonlinear distortion in the frequency domain-preliminary results. In Proc. ICASSP, 1985. [12] B. D. J. L. Durrieu, G. Richard and C. Fevotte. Source/filter model for unsupervised main melody extraction from polyphonic audio signals. IEEE Transactions on Audio, Speech, and Language Processing, pages 564–575, 2010. [13] Jan Schluter and Thomas Grill. Exploring data augmentation for improved singing voice detection with neural networks. In Proc. ISMIR, 2015. [14] Justin Salamon, Emilia Gomez, Daniel P. W. Ellis and Gael Richard. Melody extrac-tion from polyphonic music signals: Approaches, applications, and challenges. IEEE Signal Processing Magazine, 31(2):118–134, 2014. [15] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014. [16] G. Klambauer et al. Self-normalizing neural networks. arXiv preprint arXiv:1706.02515, 2017. [17]A. Klapuri. Multipitch analysis of polyphonic music and speech signals using an auditory model. IEEE Trans. Audio, Speech, Lang. Proc., 16(2):255-266, 2008. [18] T. Kobayashi and S. Imai. Spectral analysis using generalized cepstrum. IEEE Trans. Acoust., Speech, Signal Proc., 32(5):1087-1089, 1984. [19] S. Kum, C. Oh, and J. Nam. Melody extraction on vocal segments using multi-column deep neural networks. In Proc. ISMIR, pages 819-225, 2016. [20] T.-Y. Lin et al. Focal loss for dense object detection. eprint arXiv:1708.02002, 2017. [21] W.-T. Lu and L. Su. Vocal melody extraction with semantic segmentation and audio-symbolic domain transfer learning. In Proc. ISMIR, pages 521-528, 2018. [Online] https://github.com/s603122001/Vocal-Melody-Extraction. [22] Nadine Kroher and Emilia Gomez. Automatic transcription of flamenco singing from polyphonic music recordings. IEEE/ACM Trans. Audio, Speech, and Language Pro-cessing, 24(5):901-913, 2016. [23] Peeters, G. Music pitch representation by periodicity measures based on combined temporal and spectral representations. In Proc. IEEE ICASSP, 2006. [24] Rachel Bittner, Justin Salamon, Mike Tierney , Matthias Mauch, Chris Cannam and Juan Bello1. MedleyDB: A multitrack dataset for annotation-intensive MIR research. In Proc. ISMIR, 2014. [Online] http://medleydb.weebly.com/. [25] Rachel M Bittner, Justin Salamon, Juan J Bosch and Juan Pablo Bello. Pitch contours as a mid-level representation for music informatics. In AES Int. Conf. Semantic Audio, 2017. [26] Raffel, Colin, Brian McFee, Eric J. Humphrey, Justin Salamon, Oriol Nieto, Dawen Liang, and Daniel PW Ellis. mir eval: a transparent implementation of common mir metrics. In Proc. ISMIR, 2014. [Online] https://github.com/craffel/mir\_eval. [27] F. Rigaud and M. Radenen. Singing voice melody transcription using deep neural networks. In ISMIR, pages 737-743, 2016. [28] Scott Beveridge and Don Knox. Popular music and the role of vocal melody in perceived emotion. Psychology of Music, 46(3):411-423, 2018. [29] L. Su. Between homomorphic signal processing and deep neural networks: Construct-ing deep algorithms for polyphonic music transcription. In Proc. APSIPA ASC, 2017. [30] L. Su. Vocal melody extraction using patch-based CNN. In Proc. ICASSP, 2018. [31] L. Su and Y.-H. Yang. Combining spectral and temporal representations for mul-tipitch estimation of polyphonic music. IEEE Trans. Audio, Speech, and Language Processing, 23(10):1600-1612, 2015. [32]Tero Tolonen, and Matti Karjalainen. A computationally eÿcient multipitch analysis model. IEEE Speech Audio Processing, 8(6):708–716, 2000. [33] K. Tokuda, T. Kobayashi, T. Masuko, and S. Imai. Mel-generalized cepstral analysis: a unified approach to speech spectral estimation. In Proc. Int. Conf. Spoken Language Processing, 1994. [34] Vijay Badrinarayanan, Alex Kendall and Roberto Cipolla. SegNet: A deep convo-lutional encoder-decoder architecture for image segmentation. IEEE Trans. Pattern Analysis and Machine Intelligence, 39(12):2481–2495, 2017. [35] Y.-T. Wu, B. Chen, and L. Su. Automatic music transcription leveraging generalized cepstral features and deep learning. In Proc. ICASSP, 2018.
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/66127	-
dc.description.abstract	在音樂信號處理的領域中，旋律提取一直是很重要的任務。在本論文中，我們提出了一個專為此設計的流線型編碼/解碼器網路模型。我們有兩項技術貢獻。首先，啟發於一個最先進的語意像素分割模型，我們通過向下池化層和向上池化層之間的池化索引來定位旋律頻率。我們用更少的卷機層與更簡單的卷積模塊就可以達到接近最先進水平的結果。第二，我們提出了一種使用神經網路中瓶頸層來預測每ㄧ楨中旋律是否存在的方法，並且使得我們不需要取闕值，可以用簡單的arg-max函數來獲得最終結果。我們的實驗在人聲旋律提取及主旋律旋律提取上，兩者都驗證了模型的有效性。	zh_TW
dc.description.abstract	Melody extraction in polyphonic musical audio is important for music signal processing. In this paper, we propose a novel streamlined encoder/decoder network that is designed for the task. We make two technical contributions. First, drawing inspiration from a state-of-the-art model for semantic pixelwise segmentation, we pass through the pooling indices between pooling and un-pooling layers to localize the melody in frequency. We can achieve result close to the state-of-the-art with much fewer convolutional layers and simpler convolution modules. Second, we propose a way to use the bottleneck layer of the network to estimate the existence of a melody line for each time frame, and make it possible to use a simple argmax function instead of ad-hoc thresholding to get the final estimation of the melody line. Our experiments on both vocal melody extraction and general melody extraction validate the effectiveness of the proposed model.	en
dc.description.provenance	Made available in DSpace on 2021-06-17T00:22:45Z (GMT). No. of bitstreams: 1 ntu-108-R06946013-1.pdf: 2116393 bytes, checksum: 922b7edaa2984f588014460102073238 (MD5) Previous issue date: 2019	en
dc.description.tableofcontents	Contents Abstract ii List of Figures vi List of Tables vii Chapter 1 Introduction Chapter 2 Related Works 1 Chapter 2 Related Works 7 2.1 Deep Salience Model . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2 SF-NMF-CRNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.3 Lu and Su’s Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 Chapter 3 Proposed Model 12 3.1 Model Input . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.2 Encoder and Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.3 Max-pooling with Rectangle Kernel . . . . . . . . . . . . . . . 17 3.4 Non-melody Detector and ArgMax Layer . . . . . . . . . . 18 3.5 Weighted BCELoss with Melody/non-melody Ratio . .20 3.6 Model Update . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 Chapter 4 Experiments 26 4.1 Baseline Methods and Evaluation Metrics . . . . . . . . . 27 4.2 Vocal Melody Extraction . . . . . . . . . . . . . . . . . . . . . . . .28 4.2.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 4.2.2 Experiment Result . . . . . . . . . . . . . . . . . . . . . . . . 28 4.3 General Melody Extraction . . . . . . . . . . . . . . . . . . . . . . .30 4.3.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 4.3.2 Experiment Result . . . . . . . . . . . . . . . . . . . . . . . . 31 4.4 Experiment in Different Input Representation . . . . . . . .31 4.5 Experimental Detail . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .32 4.6 Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .33 Chapter 5 Conclusions and Future work 37 Bibliography 39
dc.language.iso	en
dc.title	專為旋律提取設計的流線型編碼器/解碼器架構	zh_TW
dc.title	A streamlined encoder/decoder architecture for melody extraction	en
dc.type	Thesis
dc.date.schoolyear	108-1
dc.description.degree	碩士
dc.contributor.coadvisor	楊奕軒(Yi-Hsuan Yang)
dc.contributor.oralexamcommittee	劉奕汶(Yi-Wen Liu),蔡偉和(Wei-Ho Tsai),陳冠宇(Kuan-Yu Chen),尤信程(Shing-Chern You)
dc.subject.keyword	旋律提取,編碼/解碼器,	zh_TW
dc.subject.keyword	melody extraction,encoder/decoder,	en
dc.relation.page	44
dc.identifier.doi	10.6342/NTU202000419
dc.rights.note	有償授權
dc.date.accepted	2020-02-12
dc.contributor.author-college	電機資訊學院	zh_TW
dc.contributor.author-dept	資料科學學位學程	zh_TW
顯示於系所單位：	資料科學學位學程

文件中的檔案：

檔案	大小	格式
ntu-108-1.pdf 目前未授權公開取用	2.07 MB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。