MPHTDemucs：多路徑架構之阿卡貝拉人聲分離與漏音消除

黃宇瑍; Yu-Huan Huang

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/101576

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	許永真	zh_TW
dc.contributor.advisor	Jane Yung-Jen Hsu	en
dc.contributor.author	黃宇瑍	zh_TW
dc.contributor.author	Yu-Huan Huang	en
dc.date.accessioned	2026-02-11T16:29:15Z	-
dc.date.available	2026-02-12	-
dc.date.copyright	2026-02-11	-
dc.date.issued	2026	-
dc.date.submitted	2026-02-03	-
dc.identifier.citation	[1] Simon Rouard, Francisco Massa, and Alexandre Défossez. Hybrid Transformers for Music Source Separation. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5, Rhodes Island, Greece, 2023. [2] Fabian-Robert Stöter, Stefan Uhlich, Antoine Liutkus, and Yuki Mitsufuji. Open- Unmix - A Reference Implementation for Music Source Separation. Journal of Open Source Software, 4(41):1667, 2019. [3] Romain Hennequin, Anis Khlif, Felix Voituret, and Manuel Moussallam. Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software, 5(50):2154, 2020. [4] Weitao Lu, Jiachen Wang, Qiuqiang Kong, and Yin-Jyun Hung. Music Source Sepa- ration With Band-Split Rope Transformer. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 481–485, Seoul, Republic of Korea, 2024. [5] Daniel Stoller, Sebastian Ewert, and Simon Dixon. Wave-U-Net: A Multi-Scale Neural Network for End-to-End Audio Source Separation. In Proceedings of the 19th International Society for Music Information Retrieval Conference (ISMIR), pages 334–340, Paris, France, 2018. [6] Yi Luo and Nima Mesgarani. Conv-TasNet: Surpassing Ideal Time-Frequency Mag- nitude Masking for Speech Separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 27(8):1256–1266, 2019. [7] Alexandre Défossez, Nicolas Usunier, Léon Bottou, and Francis Bach. Music Source Separation in the Waveform Domain. arXiv preprint arXiv:1911.13254, 2019. [8] Alexandre Défossez, Gabriel Synnaeve, and Yossi Adi. Real Time Speech Enhance- ment in the Waveform Domain. In Proceedings of the 21st Annual Conference of the International Speech Communication Association (Interspeech), pages 3291–3295, Shanghai, China, 2020. [9] Alexandre Défossez. Hybrid Spectrogram and Waveform Source Separation. In Proceedings of the ISMIR 2021 Workshop on Music Source Separation (MDX Workshop), 2021. [10] Zafar Rafii, Antoine Liutkus, Fabian-Robert Stöter, Stylianos Ioannis Mimilakis, and Rachel Bittner. The MUSDB18 corpus for music separation, December 2017. [11] Yi Luo and Jianwei Yu. Music Source Separation With Band-Split RNN. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 31:1893–1901, 2023. [12] Tomohiko Nakamura, Shinnosuke Takamichi, Naoko Tanji, Satoru Fukayama, and Hiroshi Saruwatari. jaCappella Corpus: A Japanese a Cappella Vocal Ensemble Corpus. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 2023. [13] Alexandre Défossez, Nicolas Usunier, Léon Bottou, and Francis Bach. Demucs mu- sic source separation. https://github.com/facebookresearch/demucs, 2019. Accessed: 2025-07-06. [14] Kanhaiya Acharya, Andres Velasquez, Yuxin Liu, and H. H. Song. SCNet: Sparsity- Based Compact Network for Music Source Separation. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 2024. [15] Yin-Jyun Hung, Weitao Lu, Qiuqiang Kong, and Jiachen Wang. Moises-Light: A Lightweight Moises-Based Model for Music Source Separation. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Hyderabad, India, 2025. To appear. [16] Giorgio Mariani, Léo Tallini, Zixun Luo, Marc-André Côté, Karim Ghaleb, Em- manuel Vincent, Pablo Mesejo, and Alexandre Bitton. Multi-Source Diffusion Mod- els for Simultaneous Music Generation and Separation. In Proceedings of the Twelfth International Conference on Learning Representations (ICLR), 2024. [17] Karn N Watcharasupat and Alexander Lerch. A stem-agnostic single-decoder system for music source separation beyond four stems. arXiv preprint arXiv:2406.18747, 2024. [18] Ziyu Pan, Yuxuan Fu, Xubo Wang, and Jun Liu. ACappellaSet: A Large-Scale a Cappella Recording Dataset for All-Day Singing Voice Research. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Hyderabad, India, 2025. To appear. [19] Jingjing Chen, Qirong Mao, and Dong Liu. Dual-Path Transformer Net- work: Direct Context-Aware Modeling for End-to-End Monaural Speech Separa- tion. In Proceedings of the 21st Annual Conference of the International Speech Communication Association (Interspeech), pages 2642–2646, Shanghai, China, 2020. [20] Tomohiko Nakamura, Shihori Kozuka, and Hiroshi Saruwatari. Time-Domain Au- dio Source Separation With Neural Networks Based on Multiresolution Analysis. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:1687– 1701, 2021. [21] Leonard A. Lanzendörfer, Fabian Grötschla, Markus Ungersböck, and Roger Wat- tenhofer. SepACap: A Dataset for A Cappella Singing Separation. In Proceedings of the IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP), Hyderabad, India, 2025. To appear. [22] Zhong-Qiu Wang, Anurag Kumar, and Shinji Watanabe. Cross-Talk Reduction. In Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence (IJCAI), pages 5171–5180, Jeju, Korea, 2024. [23] Georg Holzmann, Christoph Grasser, and Lehel Török. Multitrack Clarity Rede- fined: Introducing our new Mic Bleed Remover. https://auphonic.com/blog/ 2025/10/08/mic-bleed-remover/, October 2025. Accessed: 2025-11-16. [24] Bernard Widrow, John R. Glover, Jr., John M. McCool, John Kaunitz, Charles S. Williams, Robert H. Hearn, James R. Zeidler, Eugene Dong, Jr., and Robert C. Goodlin. Adaptive Noise Cancelling: Principles and Applications. Proceedings of the IEEE, 63(12):1692–1716, 1975. [25] Józef Kotus and Grzegorz Szwoch. Separation of Simultaneous Speakers with Acoustic Vector Sensor. Sensors, 25(5):1509, 2025. Published: 28 February 2025. [26] Katerina Zmolikova, Marc Delcroix, Tsubasa Ochiai, Keisuke Kinoshita, Jan Černocký, and Dong Yu. Neural Target Speech Extraction: An Overview. IEEE Signal Processing Magazine, 40(3):8–29, 2023. [27] Jonathan Le Roux, Scott Wisdom, Hakan Erdogan, and John R. Hershey. SDR – Half-baked or Well Done? In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 626–630, Brighton, UK, 2019. [28] Chandan K. A. Reddy, Vishak Gopal, and Ross Cutler. DNSMOS P.835: A Non- Intrusive Perceptual Objective Speech Quality Metric to Evaluate Noise Suppres- sors. arXiv preprint arXiv:2110.01763, 2022. [29] Gabriel Mittag, Babak Naderi, Amine Chehadi, and Sebastian Möller. NISQA: A Deep CNN-Self-Attention Model for Multidimensional Speech Quality Prediction with Crowdsourced Datasets. In Proceedings of the 22nd Annual Conference of the International Speech Communication Association (Interspeech), pages 2127–2131, Brno, Czechia, 2021. [30] Dreamtonics Co., Ltd. Synthesizer V Studio Pro. https://dreamtonics.com/ synthesizerv/, 2020. Software version 1.11.0. Accessed: 2025-11-19. [31] Beijing Timedomain Technology Co., Ltd. ACE Studio: Ai singing voice synthesis platform. https://acestudio.ai/, 2024. Accessed: 2025-11-19. [32] MuseScore Ltd. MuseScore.com: The world’s largest sheet music catalog. https: //musescore.com/, 2024. Accessed: 2025-11-19. [33] Accompany Vocal Band. Accompany Vocal Band - YouTube Channel. https: //www.youtube.com/@AccompanyVocalBand, 2022. Accessed: 2026-01-07.	-
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/101576	-
dc.description.abstract	阿卡貝拉（A cappella）是一種全人聲的音樂類型，針對特定作品的翻唱和重混音工程常因缺乏分軌音訊而面臨挑戰。而阿卡貝拉的聲源分離因為各聲部間人聲的高度相似性和節奏重疊性，相較於包含多元樂器的一般音源分離更為困難。此外，阿卡貝拉樂團常因方便性和成本問題，會採用同空間同步錄音的方式，此方式會無可避免的產生麥克風漏音（Bleeding）問題，進一步限制了後製混音的動態調整空間。本研究旨在解決上述問題。針對聲源分離問題我們提出了三種策略，其一是提出了一個名為 Multi-Path HTDemucs（MPHTDemucs）的深度學習架構，基於 Hybrid Transformer Demucs（HTDemucs）的架構基礎上增加多個平行的 U-Net 路徑，並將輸出層依聲部數量獨立切分；其二是基於各聲部的聲音特徵，使用一般音樂資料的預訓練模型，修改其輸出層權重映射方式的遷移式學習策略；其三是利用ACE Stduio等人聲音樂合成引擎生成的合成資料，建構模型預訓練策略。針對漏音分離問題，我們提出了一個基於房間物理特性以及多聲部排列資料窮舉的物理模擬資料訓練增強框架，利用隨機生成的房間參數和人員位置來模擬實際現場聲音，以及不同聲部組合的漏音配置，把少量資料拓展成龐大且特性多元的模擬資料。實驗結果顯示，聲源分離任務在 jaCappella 資料集上，MPHTDemucs 架構在隨機初始化條件下達到了18.64dB的平均SI-SDRi，相較於原始的 HTDemucs 架構的17.51dB大幅提升了1.13dB；而在遷移式學習初始化下，MPHTDemucs的20.61dB平均SI-SDRi則超過了HTDemucs的19.92dB，有0.69dB提升。證實了模型架構改良和遷移式學習的有效性。在合成資料預訓練上，ACE Studio的擬真合成人聲則相較於隨機初始化模型提升了平均0.1dB的SI-SDRi，證實了合成資料預訓練的有效性，也展示了合成資料在缺乏自動化合成方法下資料量仍然有限的侷限性。在漏音問題任務上，本研究建立的物理模擬資料增強框架能有效訓練模型抑制漏音干擾，客觀指標顯示漏音移除程度有顯著改善，並通過頻譜展示了音訊的保真度和漏音移除的效果。本研究不僅對阿卡貝拉聲源分離和漏音移除問題提供了多個方向的解決方案，論文核心的 MPHTDemucs 架構亦可望能應用於其他聲源分離處理任務上。	zh_TW
dc.description.abstract	A cappella is a genre of music performed entirely by the human voice. Projects involving covers and remixing of specific a cappella works frequently encounter challenges due to the absence of isolated stem tracks. Compared to general music source separation involving diverse instruments, separating a cappella sources is considerably more difficult due to the high timbral similarity and rhythmic overlap among vocal parts. Furthermore, for reasons of convenience and cost, a cappella groups often adopt simultaneous recording in a shared space. This method inevitably results in microphone bleeding, which further limits the flexibility of dynamic adjustments during post-production mixing. This study aims to address the aforementioned issues. We propose three strategies to tackle the source separation problem. First, we introduce a deep learning architecture named Multi-Path HTDemucs (MPHTDemucs), which builds upon the Hybrid Transformer Demucs (HTDemucs) by incorporating multiple parallel U-Net paths and independently partitioning the output layer based on the number of voice parts. Second, we employ a transfer learning strategy that utilizes pre-trained models trained on general music data; this involves modifying the weight mapping of the output layer based on the acoustic characteristics of each vocal part. Third, we establish a model pre-training strategy leveraging synthetic data generated by vocal synthesis engines such as ACE Studio. To address the microphone bleeding issue, we propose a training data augmentation framework based on room acoustics simulation and exhaustive multi-part arrangements. By utilizing randomly generated room parameters and personnel positions to simulate real-world audio environments and various bleeding configurations, this framework expands a small dataset into a large-scale and diverse collection of simulated data. Experimental results demonstrate that for the source separation task on the jaCappella dataset, the MPHTDemucs architecture achieved an average SI-SDRi of 18.64 dB under random initialization, representing a significant improvement of 1.13 dB over the original HTDemucs architecture (17.51 dB). Furthermore, with transfer learning initialization, MPHTDemucs achieved an average SI-SDRi of 20.61 dB, surpassing HTDemucs' 19.92 dB by 0.69 dB. These results confirm the effectiveness of both the architectural improvements and the transfer learning strategy. Regarding pre-training with synthetic data, the realistic synthetic vocals generated by ACE Studio yielded an average SI-SDRi improvement of 0.1 dB compared to the randomly initialized model. This validates the efficacy of synthetic data pre-training while also highlighting current limitations regarding data volume due to the lack of fully automated synthesis methods. For the bleeding removal task, the proposed physical simulation data augmentation framework effectively trained the model to suppress bleeding interference. Objective metrics indicate a significant improvement in bleeding removal, while spectrogram analysis further demonstrates the preservation of audio fidelity and the effectiveness of leakage elimination. This study not only provides multi-faceted solutions for a cappella source separation and bleeding removal but also introduces the MPHTDemucs architecture, which holds potential for application in other audio source separation tasks.	en
dc.description.provenance	Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2026-02-11T16:29:15Z No. of bitstreams: 0	en
dc.description.provenance	Made available in DSpace on 2026-02-11T16:29:15Z (GMT). No. of bitstreams: 0	en
dc.description.tableofcontents	口試委員審定書 i 誌謝 ii 中文摘要 iii 英文摘要 v 目次 vii 圖次 xi 表次 xii 第一章緒論 1 1.1 研究背景 1 1.2 Acappella 音源分離技術現狀與挑戰 3 1.3 漏音問題 4 1.4 研究動機 5 第二章相關研究 7 2.1 音樂音源分離模型 7 2.2 深度學習在 Acappella 之相關研究 9 2.3 漏音與串音移除方法 10 2.4 小結 12 第三章問題定義 13 3.1 核心概念與名詞定義 13 3.2 問題一：Acappella 音源分離之優化問題 15 3.3 問題二：Acappella 錄音漏音移除問題 17 3.4 評估方法與指標 18 第四章研究方法 23 4.1 MPHTD 模型架構設計 24 4.2 數據集與預處理 31 4.2.1 jaCappella 數據集 31 4.2.2 合成數據生成 32 4.2.3 數據預處理與標準化 34 4.3 模型訓練策略 36 4.3.1 預訓練權重遷移方法 36 4.4 訓練配置與損失函數 40 4.5 Acappella 錄音漏音消除之資料建構與訓練方法 44 第五章實驗設計 49 5.1 實驗目標與假設 49 5.2 實驗數據與方法 50 5.2.1 數據集配置 50 5.2.2 評估指標 52 5.2.3 訓練配置與參數設置 53 5.3 實驗設計與流程 54 5.3.1 整體實驗流程 54 5.3.2 基準模型實驗 55 5.3.3 模型架構改進實驗 56 5.3.4 預訓練策略實驗 56 5.4 消融研究設計 58 5.4.1 路徑數量影響實驗 58 5.4.2 資料增強影響 59 5.4.3 訓練策略比較 59 5.5 Acappella 錄音漏音移除實驗設計 60 5.5.1 訓練資料建構與設定 60 5.5.2 模型與訓練參數配置 60 5.5.3 訓練策略 61 5.5.4 實驗目標 61 5.6 本章總結 62 第六章結果與分析 63 6.1 與 jaCappella 團隊的實驗基準比較 64 6.2 資料增強對於模型效能之影響 65 6.3 歌曲量比較 66 6.4 Batch size 對模型訓練的影響 66 6.5 預訓練模型之使用對 HTD 帶來的影響 67 6.6 MPHTD 之性能與分析 67 6.6.1 不同路徑數量的影響 68 6.6.2 MPHTD 與 HTD 模型的比較 70 6.7 使用合成資料訓練的影響 71 6.8 音源分離模型性能總結 72 6.9 模型參數數量與運行速度 73 6.10 漏音移除之實驗結果 75 6.10.1 客觀指標評估與模型比較 75 6.10.2 頻譜圖視覺化分析 78 6.10.3 漏音移除實驗總結 80 第七章結論 81 7.1 研究貢獻 81 7.2 研究限制與未來展望 82 參考文獻 83	-
dc.language.iso	zh_TW	-
dc.subject	阿卡貝拉	-
dc.subject	聲源分離	-
dc.subject	遷移式學習	-
dc.subject	合成資料	-
dc.subject	多路徑架構	-
dc.subject	漏音消除	-
dc.subject	A Cappella	-
dc.subject	Source Separation	-
dc.subject	Transfer Learning	-
dc.subject	Synthetic Data	-
dc.subject	Multi-Path Architecture	-
dc.subject	Bleeding Removal	-
dc.title	MPHTDemucs：多路徑架構之阿卡貝拉人聲分離與漏音消除	zh_TW
dc.title	MPHTDemucs: A Multi-Path Architecture for A Cappella Vocal Source Separation and Bleeding Removal	en
dc.type	Thesis	-
dc.date.schoolyear	114-1	-
dc.description.degree	碩士	-
dc.contributor.coadvisor	傅立成	zh_TW
dc.contributor.coadvisor	Li-Chen Fu	en
dc.contributor.oralexamcommittee	蔡文傑;楊智淵	zh_TW
dc.contributor.oralexamcommittee	Wenn-Chieh Tsai;Chih-Yuan Yang	en
dc.subject.keyword	阿卡貝拉,聲源分離遷移式學習合成資料多路徑架構漏音消除	zh_TW
dc.subject.keyword	A Cappella,Source SeparationTransfer LearningSynthetic DataMulti-Path ArchitectureBleeding Removal	en
dc.relation.page	88	-
dc.identifier.doi	10.6342/NTU202600508	-
dc.rights.note	同意授權(全球公開)	-
dc.date.accepted	2026-02-05	-
dc.contributor.author-college	電機資訊學院	-
dc.contributor.author-dept	資訊工程學系	-
dc.date.embargo-lift	2026-02-12	-
顯示於系所單位：	資訊工程學系

文件中的檔案：

檔案	大小	格式
ntu-114-1.pdf	10.05 MB	Adobe PDF	檢視/開啟

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。