使用二維位置編碼改善以時頻譜為基礎的聲源分離模型

Chu-Ying Chan; 詹居穎

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/79437

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	許永真(Yung-Jen Hsu)
dc.contributor.author	Chu-Ying Chan	en
dc.contributor.author	詹居穎	zh_TW
dc.date.accessioned	2022-11-23T09:00:26Z	-
dc.date.available	2021-11-05
dc.date.available	2022-11-23T09:00:26Z	-
dc.date.copyright	2021-11-05
dc.date.issued	2021
dc.date.submitted	2021-10-19
dc.identifier.citation	R. Bittner, J. Salamon, M. Tierney, M. Mauch, C. Cannam, and J. P. Bello. Medleydb: A multitrack dataset for annotation-intensive mir research. In 15th International Society for Music Information Retrieval Conference, Oct 2014. W. Choi, M. Kim, J. Chung, and S. Jung. LaSAFT: Latent source attentive frequency transformation for conditioned source separation. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 171–175, 2021. W. Choi, M. Kim, J. Chung, D. Lee, and S. Jung. Investigating U-Nets with various intermediate blocks for spectrogram-based singing voice separation. In 21th International Society for Music Information Retrieval Conference, OCTOBER 2020. A. Defossez, N. Usunier, L. Bottou, and F. Bach. Music source separation in the waveform domain. arXiv preprint arXiv:1911.13254, 2019. K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, Los Alamitos, CA, USA, Jun 2016. IEEE Computer Society. R. Hennequin, A. Khlif, F. Voituret, and M. Moussallam. Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software, 5(50):2154, 2020. Deezer Research. Y. Ikemiya, K. Yoshii, and K. Itoyama. Singing voice analysis and editing based on mutually dependent f0 estimation and source separation. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 574–578, 2015. M. A. Islam, S. Jia, and N. D. B. Bruce. How much position information do convolutional neural networks encode? In International Conference on Learning Representations, 2020. A.Jansson,E.J.Humphrey,N.Montecchio,R.M.Bittner,A.Kumar,andT.Weyde. Singing voice separation with deep U-Net convolutional networks. In ISMIR, 2017. D. P. Kingma and J. L. Ba. Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations (ICLR), 2015. J. Liu and Y. Yang. Denoising auto-encoder with recurrent skip connections and residual regression for music source separation. CoRR, abs/1807.01898, 2018. A. Liutkus, F.-R. Stoter, Z. Rafii, D. Kitamura, B. Rivet, N. Ito, N. Ono, and J. Fontecave. The 2016 signal separation evaluation campaign. In P. Tichavsky ́, M. Babaie-Zadeh, O. J. Michel, and N. Thirion-Moreau, editors, Latent Variable Analysis and Signal Separation 12th International Conference, LVA/ICA 2015, Liberec, Czech Republic, August 25-28, 2015, Proceedings, pages 323–332, Cham, 2017. Springer International Publishing. Y. Luo and N. Mesgarani. Conv-tasnet: Surpassing ideal time–frequency magnitude masking for speech separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 27(8):1256–1266, 2019. G. Meseguer-Brocal and G. Peeters. Conditioned-U-Net: Introducing a control mechanism in the U-Net for multiple source separations. In ISMIR, editor, 20th International Society for Music Information Retrieval Conference, November 2019. E. Perez, F. Strub, H. de Vries, V. Dumoulin, and A. C. Courville. Film: Visual reasoning with a general conditioning layer. In AAAI, 2018. Z. Rafii, A. Liutkus, F.-R. Stoter, S. I. Mimilakis, and R. Bittner. The MUSDB18 corpus for music separation, Dec. 2017. O. Ronneberger, P. Fischer, and T. Brox. U-Net: Convolutional networks for biomedical image segmentation. In N. Navab, J. Hornegger, W. M. Wells, and A. F. Frangi, editors, Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015, pages 234–241, Cham, 2015. Springer International Publishing. S. T. Roweis. One microphone source separation. In NIPS, 2000. D. Samuel, A. Ganeshan, and J. Naradowsky. Meta-learning extractors for music source separation. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 816–820, 2020. M. Senior. Mixing Secrets For The Small Studio. Routledge, 2011. D. Stoller, S. Ewert, and S. Dixon. Wave-U-Net: A multi-scale neural network for end-to-end audio source separation, 2018. F.-R. Stter, S. Uhlich, A. Liutkus, and Y. Mitsufuji. Open-unmix a reference implementation for music source separation. Journal of Open Source Software, 4(41):1667, 2019. N. Takahashi and Y. Mitsufuji. D3net: Densely connected multidilated densenet for music source separation, 2021. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. u. Kaiser, and I. Polosukhin. Attention is all you need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. E. Vincent, R. Gribonval, and C. Fevotte. Performance measurement in blind audio source separation. IEEE Transactions on Audio, Speech, and Language Processing, 14(4):1462–1469, 2006. Z. Wang and J. Liu. Translating mathematical formula images to latex sequences using deep neural networks with sequence-level training. International Journal on Document Analysis and Recognition (IJDAR), 24:63–75, 2021. Y. Wu, H. Mao, and Z. Yi. Audio classification using attention-augmented convolutional neural network. Knowledge-Based Systems, 161:90–100, 2018. M. Zibulevsky and B. A. Pearlmutter. Blind Source Separation by Sparse Decomposition in a Signal Dictionary. Neural Computation, 13(4):863–882, 04 2001.
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/79437	-
dc.description.abstract	音樂聲部分離 (music source separation) 的目的是將一個由多個聲源 (source) 混合而成的混合音訊，分離回當初混合該音訊的多個聲源。在出現深度學習後，音樂聲部分離 (music source separation) 領域中的模型幾乎都使用了深度學習，有些模型會將音訊透過時頻轉換 (time-frequency transformation) 以時頻譜 (spectrogram) 的形式下進行分離，並且會把它當作是圖像來進行處理，但是時頻譜與處理一般圖像不同的點是，絕對位置的資訊對於時頻譜來說意義非凡，其每一個點的絕對位置都對應其時間點以及頻率，對於使用卷積 (convolution) 的模型，若是堆疊的卷積不夠深，會使模型無法掌握完整的位置資訊。所以此篇論文，提出了 Pos-LaSAFT，其為 LsSAFT 的變形，藉由使用二維位置編碼協助模型可以掌握絕對位置的資訊，來提升模型的表現。此篇論文的結果都使用 SDR 這個度量 (metric) 做評定，且皆使用 MUSDB18 當做資料集，在同樣的實驗參數下， Pos-LaSAFT 的 SDR 平均分數比原始版的 LaSAFT 提高了 0.28 dB，其中在低音吉他這個分部 (stems) 中改善最多，進步了 0.56 dB。	zh_TW
dc.description.provenance	Made available in DSpace on 2022-11-23T09:00:26Z (GMT). No. of bitstreams: 1 U0001-1510202117015400.pdf: 5573916 bytes, checksum: 2025e69f99f32af209c0140b6c968543 (MD5) Previous issue date: 2021	en
dc.description.tableofcontents	Acknowledgments .............................. i Abstract ..................................... iii List of Figures ............................. viii List of Tables .............................. ix Chapter 1 Introduction ...................... 1 1.1 背景與動機 ................................ 1 1.2 論文目標 ................................. 2 1.3 名詞解釋 ................................. 3 1.4 論文架構 ................................. 3 Chapter 2 Related Work ...................... 4 2.1 遮罩.................................... 6 2.2 生成.................................... 6 Chapter 3 Methodology ...................... 9 3.1 基礎架構 ................................. 9 3.1.1 複數轉換頻道........................... 10 3.1.2 LaSAFT中的條件式U-Net ................... 11 3.2 鑲嵌位置資訊............................... 14 3.2.1 位置編碼 ............................. 16 3.2.2 二維位置編碼........................... 17 3.2.3 Pos-LaSAFT ........................... 17 Chapter 4 Experiments ...................... 19 4.1 資料集................................... 19 4.2 度量.................................... 20 4.3 實驗設置 ................................. 21 4.4 結果.................................... 22 Chapter 5 Discussion ........................ 28 5.1 關於資料集 ................................ 28 5.2 聲音的變化性............................... 29 Chapter 6 Conclusion and Future Work .........31 6.1 總結與貢獻 ................................ 31 6.2 位置編碼的極限.............................. 32 6.3 下一步呢? ................................ 33 Bibliography ................................ 34
dc.language.iso	zh-TW
dc.title	使用二維位置編碼改善以時頻譜為基礎的聲源分離模型	zh_TW
dc.title	Pos-LaSAFT: Improving Spectrogram-based Source Separation Model by Leveraging 2-D Positional Encoding	en
dc.date.schoolyear	109-2
dc.description.degree	碩士
dc.contributor.oralexamcommittee	楊奕軒(Hsin-Tsai Liu),陳宏銘(Chih-Yang Tseng),楊智淵,蔡偉和
dc.subject.keyword	音樂聲部分離,位置編碼,	zh_TW
dc.subject.keyword	music source separation,positional encoding,	en
dc.relation.page	37
dc.identifier.doi	10.6342/NTU202103763
dc.rights.note	同意授權(全球公開)
dc.date.accepted	2021-10-20
dc.contributor.author-college	電機資訊學院	zh_TW
dc.contributor.author-dept	資訊工程學研究所	zh_TW
顯示於系所單位：	資訊工程學系

文件中的檔案：

檔案	大小	格式
U0001-1510202117015400.pdf	5.44 MB	Adobe PDF	檢視/開啟

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。