請用此 Handle URI 來引用此文件:
http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/89987完整後設資料紀錄
| DC 欄位 | 值 | 語言 |
|---|---|---|
| dc.contributor.advisor | 張智星 | zh_TW |
| dc.contributor.advisor | Jyh-Shing Jang | en |
| dc.contributor.author | 王俊翔 | zh_TW |
| dc.contributor.author | Chun-Hsiang Wang | en |
| dc.date.accessioned | 2023-09-22T16:57:10Z | - |
| dc.date.available | 2023-11-09 | - |
| dc.date.copyright | 2023-09-22 | - |
| dc.date.issued | 2023 | - |
| dc.date.submitted | 2023-08-11 | - |
| dc.identifier.citation | N. Takahashi and Y. Mitsufuji, “Multi-scale multi-band densenets for audio source separation,” in 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 21–25, IEEE, 2017.
S. Rouard, F. Massa, and A. Défossez, “Hybrid transformers for music source separation,” arXiv preprint arXiv:2211.08553, 2022. D. Stoller, S. Ewert, and S. Dixon, “Wave-u-net: A multi-scale neural network for end-to-end audio source separation,” arXiv preprint arXiv:1806.03185, 2018. O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pp. 234–241, Springer, 2015. A. Défossez, N. Usunier, L. Bottou, and F. Bach, “Music source separation in the waveform domain,” arXiv preprint arXiv:1911.13254, 2019. Z. Huang, W. Xu, and K. Yu, “Bidirectional lstm-crf models for sequence tagging,” arXiv preprint arXiv:1508.01991, 2015. Y. Luo and N. Mesgarani, “Conv-tasnet: Surpassing ideal time–frequency magnitude masking for speech separation,” IEEE/ACM transactions on audio, speech, and language processing, vol. 27, no. 8, pp. 1256–1266, 2019. W. Zhu and G. C. Beroza, “Phasenet: a deep-neural-network-based seismic arrival-time picking method,” Geophysical Journal International, vol. 216, no. 1, pp. 261–273, 2019. Y. Luo and J. Yu, “Music source separation with band-split rnn,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023. M. Kim, W. Choi, J. Chung, D. Lee, and S. Jung, “Kuielab-mdx-net: A two-stream neural network for music demixing,” arXiv preprint arXiv:2111.12203, 2021. A. Défossez, “Hybrid spectrogram and waveform source separation,” arXiv preprint arXiv:2111.03600, 2021. Z. Rafii, A. Liutkus, F.-R. Stöter, S. I. Mimilakis, and R. Bittner, “Musdb18-hq - an uncompressed version of musdb18,” Aug. 2019. A. Liutkus, F.-R. Stöter, Z. Rafii, D. Kitamura, B. Rivet, N. Ito, N. Ono, and J. Fontecave, “The 2016 signal separation evaluation campaign,” in Latent Variable Analysis and Signal Separation: 13th International Conference, LVA/ICA 2017, Grenoble, France, February 21-23, 2017, Proceedings 13, pp. 323-332, Springer, 2017. G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely connected convolutional networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700–4708, 2017. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016. Q. Kong, Y. Cao, H. Liu, K. Choi, and Y. Wang, “Decoupling magnitude and phase estimation with deep resunet for music source separation,” arXiv preprint arXiv:2109.05418, 2021. D. S. Williamson, Y. Wang, and D. Wang, “Complex ratio masking for monaural speech separation,” IEEE/ACM transactions on audio, speech, and language processing, vol. 24, no. 3, pp. 483–492, 2015. N. Takahashi, P. Agrawal, N. Goswami, and Y. Mitsufuji, “Phasenet: Discretized phase modeling with deep neural networks for audio source separation.,” in Interspeech, pp. 2713–2717, 2018. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017. A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020. N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-end object detection with transformers,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16, pp. 213–229, Springer, 2020. Y. Liu, B. Thoshkahna, A. Milani, and T. Kristjansson, “Voice and accompaniment separation in music using self-attention convolutional neural network,” arXiv preprint arXiv:2003.08954, 2020. D. Stoller, S. Ewert, and S. Dixon, “Adversarial semi-supervised audio source separation applied to singing voice extraction,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2391–2395, IEEE, 2018. K. Li, X. Hu, and Y. Luo, “On the use of deep mask estimation module for neural source separation systems,” arXiv preprint arXiv:2206.07347, 2022. N. Wiener, N. Wiener, C. Mathematician, N. Wiener, N. Wiener, and C. Mathématicien, Extrapolation, interpolation, and smoothing of stationary time series: with engineering applications, vol. 113. MIT press Cambridge, MA, 1949. Z. Huang, X. Wang, L. Huang, C. Huang, Y. Wei, and W. Liu, “Ccnet: Criss-cross attention for semantic segmentation,” in Proceedings of the IEEE/CVF international conference on computer vision, pp. 603–612, 2019. H. Wang, Y. Zhu, B. Green, H. Adam, A. Yuille, and L.-C. Chen, “Axial-deeplab: Stand-alone axial-attention for panoptic segmentation,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IV, pp. 108–126, Springer, 2020. R. M. Bittner, J. Salamon, M. Tierney, M. Mauch, C. Cannam, and J. P. Bello, “Medleydb: A multitrack dataset for annotation-intensive mir research.,” in ISMIR, vol. 14, pp. 155–160, 2014. F.-R. Stöter, A. Liutkus, and N. Ito, “The 2018 signal separation evaluation campaign,” in Latent Variable Analysis and Signal Separation: 14th International Conference, LVA/ICA 2018, Guildford, UK, July 2–5, 2018, Proceedings 14, pp. 293–305, Springer, 2018. E. Vincent, R. Gribonval, and C. Févotte, “Performance measurement in blind audio source separation,” IEEE transactions on audio, speech, and language processing, vol. 14, no. 4, pp. 1462–1469, 2006. Q. Xie, M.-T. Luong, E. Hovy, and Q. V. Le, “Self-training with noisy student improves imagenet classification,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10687–10698, 2020. A. Tarvainen and H. Valpola, “Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results,” Advances in neural information processing systems, vol. 30, 2017. S. Zhao, B. Ma, K. N. Watcharasupat, and W.-S. Gan, “Frcrn: Boosting feature representation using frequency recurrence for monaural speech enhancement,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 9281–9285, IEEE, 2022. M. Huber, G. Schindler, C. Schörkhuber, W. Roth, F. Pernkopf, and H. Fröning, “Towards real-time single-channel singing-voice separation with pruned multi-scaled densenets,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 806–810, IEEE, 2020. Y. Luo, Z. Chen, and T. Yoshioka, “Dual-path rnn: efficient long sequence modeling for time-domain single-channel speech separation,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 46–50, IEEE, 2020. C. Subakan, M. Ravanelli, S. Cornell, M. Bronzi, and J. Zhong, “Attention is all you need in speech separation,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 21–25, IEEE, 2021. | - |
| dc.identifier.uri | http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/89987 | - |
| dc.description.abstract | 音樂聲源分離旨在分離出一首歌曲當中的不同音軌,本篇研究著重於對伴奏軌的即時分離。過去的音樂聲源分離模型傾向於提升分離品質,但這使得模型的大小和延遲時間增加,難以在邊緣裝置上進行運算。此外,大多數方法在降低輸入秒數時會明顯降低分離品質。本論文改進Sony早期提出的輕量化模型MMDenseNet,希望在分離品質、延遲時間及空間資源三者間達成平衡。儘管MMDenseNet的參數量很低,但分離品質不理想,且在低延遲時情況下表現不佳。因此,本研究提出了三個改進方向,分別為訓練目標調整、模型架構調整、以及訓練及測試方法改進,試圖在維持空間資源表現的情形下改善模型分離品質與延遲時間。我們使用MUSDB18資料集進行訓練及測試,並使用SDR作為分離品質評估指標,計算延遲時間作為延遲評估指標,使用參數量作為空間資源評估指標。根據實驗結果,調整模型的訓練目標能夠在維持空間資源與延遲時間下提高分離品質(median SDR從11.162提升至13.951)。此外,提出的多種自注意力架構使MMDenseNet在稍微增加空間資源及延遲時間的情況下提升分離品質(median SDR從13.951提升至15.011)。最後,我們提出的漸進式訓練及測試方法使得模型在低延遲下能夠保持良好的分離品質(延遲時間1.19秒時,RTF為0.4031、median SDR 由 13.951提升至14.394)。 | zh_TW |
| dc.description.abstract | Music source separation aims to separate different tracks within a song, and this study focuses on the real-time separation of the accompaniment track. Previous music source separation models have prioritized improving separation quality, but this has resulted in increased model size and latency, making it challenging to perform computations on edge devices. Additionally, most methods exhibit a significant decrease in separation quality when reducing the input duration. This paper improves the lightweight model MMDenseNet proposed by Sony in the early stages, aiming to achieve a balance between separation quality, latency, and spatial resources. Although MMDenseNet has a low number of parameters, its separation quality is not ideal and it performs poorly in low-latency scenarios. Therefore, this research proposes three improvement directions, including adjustments to the training objectives, model architecture, as well as improvements in the training and testing methods, aiming to enhance the separation quality and latency while maintaining the parameter count. We utilize the MUSDB18 dataset for training and testing, using SDR as the evaluation metric for separation quality, measuring latency as the delay evaluation metric, and considering parameter count as an indicator of spatial resources. Based on the experimental results, adjusting the model's training objectives improves the separation quality while maintaining spatial resources and latency (median SDR improves from 11.162 to 13.951). Furthermore, the proposed various self-attention architectures enable MMDenseNet to enhance the separation quality with only a slight increase in spatial resources and latency (median SDR improves from 13.951 to 15.011). Finally, the progressive training and testing methods proposed in this study allow the model to maintain good separation quality at low latency (at a delay of 1.19 seconds, RTF is 0.4031, and median SDR improves from 13.951 to 14.394). | en |
| dc.description.provenance | Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2023-09-22T16:57:10Z No. of bitstreams: 0 | en |
| dc.description.provenance | Made available in DSpace on 2023-09-22T16:57:10Z (GMT). No. of bitstreams: 0 | en |
| dc.description.tableofcontents | 誌謝 v
摘要 vii Abstract ix 目錄 xi 圖目錄 xv 表目錄 xvii 第一章 緒論 1 1.1 研究主題簡介 1 1.2 相關研究簡介 1 1.2.1 波形導向方法 1 1.2.2 時頻譜導向方法 2 1.2.3 混合導向方法 2 1.3 本論文方法簡介及創新之處 2 1.4 章節概述 3 第二章 文獻探討 5 2.1 MMDenseNet 5 2.1.1 DenseNet 6 2.1.2 Multi-Scale DenseNet (MDenseNet) 6 2.1.3 Multi-band MDenseNet (MMDenseNet) 7 2.2 音樂聲源分離訓練目標介紹 8 2.2.1 強度時頻譜預測 8 2.2.2 強度遮罩預測 9 2.2.3 基於 cIRM 之模型預測 9 2.2.4 強度與相位的分開預測 10 2.3 注意力機制 11 2.3.1 自注意力機制套用至音樂聲源分離 14 2.3.2 HT Demucs 架構簡介 15 2.4 BSRNN 架構簡介 17 2.4.1 Band Split Module 17 2.4.2 Band And Sequence Modeling Module 17 2.4.3 Mask Estimation Module 17 第三章 研究方法 19 3.1 問題定義 19 3.2 MMDenseNet 模型分析 20 3.3 訓練目標調整 21 3.3.1 強度遮罩 21 3.3.2 強度及相位的分開預測 22 3.4 模型架構調整 23 3.4.1 對時間軸的自注意力機制 23 3.4.2 軸向注意力機制 25 3.4.3 New cIRM MMDenseNet 27 3.5 減少分離延遲 29 3.5.1 U-Net 特化漸進式測試 30 3.5.2 U-Net 特化漸進式訓練 32 第四章 資料集與實驗設定 35 4.1 資料集簡介 35 4.1.1 MUSDB18 35 4.1.2 DSD100 36 4.1.3 MedleyDB 37 4.2 評量指標 38 4.2.1 SDR 38 4.2.2 延遲時間 39 4.3 實驗設定 40 4.3.1 實驗環境 40 4.3.2 實驗項目 40 4.3.3 訓練及測試設定 41 第五章 實驗結果與討論 43 5.1 實驗一: 模型訓練目標調整效果比較 43 5.1.1 分離品質比較 43 5.1.2 延遲時間比較 44 5.1.3 空間資源比較 44 5.2 實驗二: 模型架構調整效果比較 45 5.2.1 分離品質比較 45 5.2.2 延遲時間比較 46 5.2.3 空間資源比較 46 5.3 實驗三: 對時間軸之自注意力機制架構消融實驗效果比較 47 5.4 實驗四: 漸進式測試與漸進式訓練效果比較 49 5.4.1 漸進式測試結果分析 49 5.4.2 漸進式訓練搭配漸進式測試結果分析 51 5.5 實驗五: 模型於邊緣裝置的表現分析 54 5.6 綜合分析 56 第六章 總結 57 6.1 結論 57 6.2 未來展望 58 參考文獻 61 附錄A 最佳延遲時間推導 65 | - |
| dc.language.iso | zh_TW | - |
| dc.subject | 漸進式訓練 | zh_TW |
| dc.subject | U-Net | zh_TW |
| dc.subject | 自注意力機制 | zh_TW |
| dc.subject | MMDenseNet | zh_TW |
| dc.subject | 伴奏分離 | zh_TW |
| dc.subject | MMDenseNet | en |
| dc.subject | U-Net | en |
| dc.subject | progressive training | en |
| dc.subject | self-attention mechanism | en |
| dc.subject | accompaniment separation | en |
| dc.title | 用於伴奏分離的輕量化深度學習模型 | zh_TW |
| dc.title | Lightweight Deep-Learning Models for Accompaniment Separation | en |
| dc.type | Thesis | - |
| dc.date.schoolyear | 111-2 | - |
| dc.description.degree | 碩士 | - |
| dc.contributor.oralexamcommittee | 曹昱;楊奕軒 | zh_TW |
| dc.contributor.oralexamcommittee | Yu Tsao;Yi-Hsuan Yang | en |
| dc.subject.keyword | 伴奏分離,MMDenseNet,自注意力機制,漸進式訓練,U-Net, | zh_TW |
| dc.subject.keyword | accompaniment separation,MMDenseNet,self-attention mechanism,progressive training,U-Net, | en |
| dc.relation.page | 67 | - |
| dc.identifier.doi | 10.6342/NTU202303861 | - |
| dc.rights.note | 同意授權(全球公開) | - |
| dc.date.accepted | 2023-08-12 | - |
| dc.contributor.author-college | 電機資訊學院 | - |
| dc.contributor.author-dept | 資訊網路與多媒體研究所 | - |
| 顯示於系所單位: | 資訊網路與多媒體研究所 | |
文件中的檔案:
| 檔案 | 大小 | 格式 | |
|---|---|---|---|
| ntu-111-2.pdf | 4.23 MB | Adobe PDF | 檢視/開啟 |
系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。
