請用此 Handle URI 來引用此文件:
http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/89938
完整後設資料紀錄
DC 欄位 | 值 | 語言 |
---|---|---|
dc.contributor.advisor | 張智星 | zh_TW |
dc.contributor.advisor | Roger Jang | en |
dc.contributor.author | 李學翰 | zh_TW |
dc.contributor.author | Hsueh-Han Lee | en |
dc.date.accessioned | 2023-09-22T16:45:20Z | - |
dc.date.available | 2023-11-09 | - |
dc.date.copyright | 2023-09-22 | - |
dc.date.issued | 2023 | - |
dc.date.submitted | 2023-08-11 | - |
dc.identifier.citation | [1] Y. Mitsufuji, G. Fabbro, S. Uhlich, and F.-R. Stöter, Music demixing challenge 2021, 2021. arXiv: 2108.13559 [eess.AS].
[2] F. Li and M. Akagi, “Blind monaural singing voice separation using rank-1 constraint robust principal component analysis and vocal activity detection,” Neurocomputing, vol. 350, pp. 44–52, 2019. [3] P.-S. Huang, S. D. Chen, P. Smaragdis, and M. Hasegawa-Johnson, “Singing-voice separation from monaural recordings using robust principal component analysis,” in 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2012, pp. 57–60. doi: 10.1109/ICASSP.2012.6287816. [4] S. Vembu and S. Baumann, “Separation of vocals from polyphonic audio recordings.,” Jan. 2005, pp. 337–344. [5] Z. Rafii and B. Pardo, “Repeating pattern extraction technique (repet): A simple method for music/voice separation,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 21, no. 1, pp. 73–84, 2013. doi: 10.1109/TASL.2012. 2213249. [6] A. Jansson, E. Humphrey, N. Montecchio, R. Bittner, A. Kumar, and T. Weyde, “Singing voice separation with deep u-net convolutional networks,” 2017. [7] S. Rouard, F. Massa, and A. Défossez, Hybrid transformers for music source separation, 2022. arXiv: 2211.08553 [eess.AS]. [8] N. Takahashi and Y. Mitsufuji, “Multi-scale multi-band densenets for audio source separation,” in 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), IEEE, 2017, pp. 21–25. [9] A. Liutkus, F.-R. Stöter, Z. Rafii, et al., “The 2016 signal separation evaluation campaign,” in Latent Variable Analysis and Signal Separation: 13th International Conference, LVA/ICA 2017, Grenoble, France, Springer International Publishing, 2017, pp. 323–332. [10] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely connected convolutional networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 4700–4708. [11] M. Huber, G. Schindler, C. Schörkhuber, W. Roth, F. Pernkopf, and H. Fröning, “Towards real-time single-channel singing-voice separation with pruned multi-scaled densenets,” in ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 806–810. doi: 10.1109/ICASSP40776.2020.9053542. [12] N. Takahashi, N. Goswami, and Y. Mitsufuji, “Mmdenselstm: An efficient combination of convolutional and recurrent neural networks for audio source separation,” in 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC), 2018, pp. 106–110. doi: 10.1109/IWAENC.2018.8521383. [13] N. Takahashi and Y. Mitsufuji, “D3net: Densely connected multidilated densenet for music source separation,” arXiv preprint arXiv:2010.01733, 2020. [14] D. Wang and J. Chen, “Supervised speech separation based on deep learning: An overview,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 26, no. 10, pp. 1702–1726, 2018. doi: 10.1109/TASLP.2018.2842159. [15] Y. Wang, A. Narayanan, and D. Wang, “On training targets for supervised speech separation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 22, no. 12, pp. 1849–1858, 2014. doi: 10.1109/TASLP.2014.2352935. [16] H. Erdogan, J. R. Hershey, S. Watanabe, and J. Le Roux, “Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015, pp. 708–712. doi: 10.1109/ICASSP.2015.7178061. [17] Z. Rafii, A. Liutkus, F.-R. Stöter, S. I. Mimilakis, and R. Bittner, Musdb18-hq - an uncompressed version of musdb18, Aug. 2019. doi: 10.5281/zenodo.3338373. [Online]. Available: https://doi.org/10.5281/zenodo.3338373. [18] A. Liutkus, F.-R. Stöter, Z. Rafii, et al., “The 2016 signal separation evaluation campaign,” in Latent Variable Analysis and Signal Separation - 12th International Conference, LVA/ICA 2015, Liberec, Czech Republic, August 25-28, 2015, Proceedings, P. Tichavský, M. Babaie-Zadeh, O. J. Michel, and N. Thirion-Moreau, Eds., Cham: Springer International Publishing, 2017, pp. 323–332. [19] R. M. Bittner, J. Salamon, M. Tierney, M. Mauch, C. Cannam, and J. P. Bello, “Medleydb: A multitrack dataset for annotation-intensive mir research.,” in ISMIR, vol. 14, 2014, pp. 155–160. [20] N. Wiener, N. Wiener, C. Mathematician, N. Wiener, N. Wiener, and C. Mathématicien, Extrapolation, interpolation, and smoothing of stationary time series: with engineering applications. MIT press Cambridge, MA, 1949, vol. 113. [21] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014. [22] S. Uhlich, M. Porcu, F. Giron, et al., “Improving music source separation based on deep neural networks through data augmentation and network blending,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017, pp. 261–265. doi: 10.1109/ICASSP.2017.7952158. [23] E. Vincent, R. Gribonval, and C. Fevotte, “Performance measurement in blind audio source separation,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 14, no. 4, pp. 1462–1469, 2006. doi: 10.1109/TSA.2005.858005. [24] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood from incomplete data via the em algorithm,” Journal of the royal statistical society: series B (methodological), vol. 39, no. 1, pp. 1–22, 1977. [25] A. Vaswani, N. Shazeer, N. Parmar, et al., “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017. [26] S. Yuan, Z. Wang, U. Isik, et al., “Improved singing voice separation with chromagram-based pitch-aware remixing,” in ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 111–115. doi: 10.1109/ICASSP43922.2022.9747612. [27] Y. Luo and J. Yu, “Music source separation with band-split rnn,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 1893–1901, 2023. doi: 10.1109/TASLP.2023.3271145. [28] D. S. Williamson, Y. Wang, and D. Wang, “Complex ratio masking for monaural speech separation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 3, pp. 483–492, 2016. doi: 10.1109/TASLP.2015.2512042. | - |
dc.identifier.uri | http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/89938 | - |
dc.description.abstract | 「音樂聲部分離」為音樂資訊檢索領域中重要研究方向,其目標為將一由多部聲源混合而成之音樂訊號,還原回各自混合前的訊號。而音樂聲部分離的子任務「歌曲人聲分離」,則致力於將音樂訊號還原為「人聲」和「伴奏」兩個音軌,即使已有許多研究提出架構達到良好的分離效果,卻都伴隨相當龐大的運算資源與時間,並不適用於即時分離系統的應用,因此如何即時進行伴奏音軌的分離,即為本文研究方向。本文使用音樂聲部分離領域中一輕量模型架構 MMDenseNet,先以遮罩預測、多重擴張卷積、增加模型複雜度等方式提升分離效果,再以縮短模型輸入長度和上下文聚合等方式降低延遲時間,以達到擁有良好分離效果且低延遲之模型。 | zh_TW |
dc.description.abstract | Music source separation (MSS) is an important research task in the music information retrieval (MIR) domain which aims to recover the mixing of musical signals to individual audio tracks. And its subtask, singing voice separation (SVS), is dedicated to recovering the signal to vocals and accompaniment tracks merely. Although several studies proposed their methods to achieve outstanding performances, the massive computing power and processing time limit the applications on edge devices. Therefore, extracting the accompaniment track in real-time with limited resources is the main target in this article. A lightweight MSS model, MMDenseNet, is used in this study. With mask estimation, multi-dilated convolution, and model complexity increasing, the separation performance is enhanced. And with shorter model input duration and context aggregation, the latency is decreased. Therefore the separation can be performed in real time and the performance is sustained. | en |
dc.description.provenance | Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2023-09-22T16:45:20Z No. of bitstreams: 0 | en |
dc.description.provenance | Made available in DSpace on 2023-09-22T16:45:20Z (GMT). No. of bitstreams: 0 | en |
dc.description.tableofcontents | 摘要 v
Abstract vii Contents ix 1 Introduction 1 1.1 Singing voice separation (SVS) 1 1.2 Research topic 3 1.3 Contributions 3 1.4 Structure 4 2 Related Work 5 2.1 Traditional Approaches 6 2.1.1 Robust Principal Component Analysis (RCPA) 6 2.1.2 Non-negative Matrix Factorization (NMF) 6 2.1.3 REpeating Pattern Extraction Technique (REPET) 6 2.2 Deep Learning-based Approaches 7 2.2.1 U-Net 7 2.2.2 Demucs 8 2.2.3 Multi-Scale Multi-Band DenseNet (MMDenseNet) 9 2.2.4 Real-Time MDenseNet 11 2.2.5 D3Net 11 2.2.6 Training Target 12 3 Dataset 13 3.1 MUSDB18-HQ 13 3.2 Popular Music 15 4 Method 17 4.1 Problem Definition 17 4.2 Experimental Configuration 18 4.2.1 Hardware Environment 18 4.2.2 Hyperparameters 18 4.2.3 Augmentation 19 4.3 Evaluation Metrics 20 4.3.1 Accuracy 20 4.3.2 Efficiency 21 4.4 Training target replacement 22 4.4.1 Mask Estimation 22 4.4.2 Spectral Magnitude Mask (SMM) 23 4.4.3 Phase-Sensitive Mask (PSM) 24 4.5 Accuracy Improvement 25 4.5.1 Multi-Dilated Convolution 25 4.5.2 Double Number of Channels 28 4.6 Efficiency Improvement 31 4.6.1 Shorter Input Duration 31 4.6.2 Context Aggregation 32 5 Result 33 5.1 Training Target 34 5.1.1 Spectral Magnitude Mask (SMM) 34 5.1.2 Phase-Sensitive Mask (PSM) 35 5.2 Accuracy Improvement 36 5.2.1 Multi-dilated convolution 36 5.2.2 Double Number of Channels 37 5.3 Efficiency Improvement 38 5.3.1 Shorter input duration 39 5.3.2 Context Aggregation 41 6 Conclusion and Future Work 43 6.1 Conclusion 43 6.2 Future Work 44 Bibliography 47 A Models with different input duration 51 A.1 SMM w/ Multi-Dilated Convolution 52 A.2 LogSMM w/ Multi-Dilated Convolution 52 A.3 SMM w/ Multi-Dilated Convolution, 2x Channel 53 A.4 LogSMM w/ Multi-Dilated Convolution, 2x Channel 53 A.5 PSM w/ Multi-Dilated Convolution, 2x Channel 54 | - |
dc.language.iso | en | - |
dc.title | 使用多重擴張卷積 MMDenseNet 於即時歌曲伴奏分離 | zh_TW |
dc.title | Real-Time Accompaniment Extraction with Multi-Dilated Convolution MMDenseNet | en |
dc.type | Thesis | - |
dc.date.schoolyear | 111-2 | - |
dc.description.degree | 碩士 | - |
dc.contributor.oralexamcommittee | 蘇黎;曹昱 | zh_TW |
dc.contributor.oralexamcommittee | Li Su;Yu Tsao | en |
dc.subject.keyword | 音樂聲部分離,歌曲人聲分離,MMDenseNet,多重擴張卷積,頻譜遮罩預測,即時分離, | zh_TW |
dc.subject.keyword | Music Source Separation,Singing Voice Separation,MMDenseNet,Multi-dilated Convolution,Spectral mask estimation,Real-time separation, | en |
dc.relation.page | 54 | - |
dc.identifier.doi | 10.6342/NTU202301173 | - |
dc.rights.note | 同意授權(全球公開) | - |
dc.date.accepted | 2023-08-13 | - |
dc.contributor.author-college | 電機資訊學院 | - |
dc.contributor.author-dept | 資訊網路與多媒體研究所 | - |
顯示於系所單位: | 資訊網路與多媒體研究所 |
文件中的檔案:
檔案 | 大小 | 格式 | |
---|---|---|---|
ntu-111-2.pdf | 5.85 MB | Adobe PDF | 檢視/開啟 |
系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。