Skip navigation

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料(如:文字、圖片、PDF)並使其易於取用。

點此認識 DSpace
DSpace logo
English
中文
  • 瀏覽論文
    • 校院系所
    • 出版年
    • 作者
    • 標題
    • 關鍵字
    • 指導教授
  • 搜尋 TDR
  • 授權 Q&A
    • 我的頁面
    • 接受 E-mail 通知
    • 編輯個人資料
  1. NTU Theses and Dissertations Repository
  2. 電機資訊學院
  3. 資訊工程學系
請用此 Handle URI 來引用此文件: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/101576
完整後設資料紀錄
DC 欄位值語言
dc.contributor.advisor許永真zh_TW
dc.contributor.advisorJane Yung-Jen Hsuen
dc.contributor.author黃宇瑍zh_TW
dc.contributor.authorYu-Huan Huangen
dc.date.accessioned2026-02-11T16:29:15Z-
dc.date.available2026-02-12-
dc.date.copyright2026-02-11-
dc.date.issued2026-
dc.date.submitted2026-02-03-
dc.identifier.citation[1] Simon Rouard, Francisco Massa, and Alexandre Défossez. Hybrid Transformers for Music Source Separation. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5, Rhodes Island, Greece, 2023.
[2] Fabian-Robert Stöter, Stefan Uhlich, Antoine Liutkus, and Yuki Mitsufuji. Open- Unmix - A Reference Implementation for Music Source Separation. Journal of Open Source Software, 4(41):1667, 2019.
[3] Romain Hennequin, Anis Khlif, Felix Voituret, and Manuel Moussallam. Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software, 5(50):2154, 2020.
[4] Weitao Lu, Jiachen Wang, Qiuqiang Kong, and Yin-Jyun Hung. Music Source Sepa- ration With Band-Split Rope Transformer. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 481–485, Seoul, Republic of Korea, 2024.
[5] Daniel Stoller, Sebastian Ewert, and Simon Dixon. Wave-U-Net: A Multi-Scale Neural Network for End-to-End Audio Source Separation. In Proceedings of the 19th International Society for Music Information Retrieval Conference (ISMIR), pages 334–340, Paris, France, 2018.
[6] Yi Luo and Nima Mesgarani. Conv-TasNet: Surpassing Ideal Time-Frequency Mag- nitude Masking for Speech Separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 27(8):1256–1266, 2019.
[7] Alexandre Défossez, Nicolas Usunier, Léon Bottou, and Francis Bach. Music Source Separation in the Waveform Domain. arXiv preprint arXiv:1911.13254, 2019.
[8] Alexandre Défossez, Gabriel Synnaeve, and Yossi Adi. Real Time Speech Enhance- ment in the Waveform Domain. In Proceedings of the 21st Annual Conference of the International Speech Communication Association (Interspeech), pages 3291–3295, Shanghai, China, 2020.
[9] Alexandre Défossez. Hybrid Spectrogram and Waveform Source Separation. In Proceedings of the ISMIR 2021 Workshop on Music Source Separation (MDX Workshop), 2021.
[10] Zafar Rafii, Antoine Liutkus, Fabian-Robert Stöter, Stylianos Ioannis Mimilakis, and Rachel Bittner. The MUSDB18 corpus for music separation, December 2017.
[11] Yi Luo and Jianwei Yu. Music Source Separation With Band-Split RNN. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 31:1893–1901, 2023.
[12] Tomohiko Nakamura, Shinnosuke Takamichi, Naoko Tanji, Satoru Fukayama, and Hiroshi Saruwatari. jaCappella Corpus: A Japanese a Cappella Vocal Ensemble Corpus. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 2023.
[13] Alexandre Défossez, Nicolas Usunier, Léon Bottou, and Francis Bach. Demucs mu- sic source separation. https://github.com/facebookresearch/demucs, 2019. Accessed: 2025-07-06.
[14] Kanhaiya Acharya, Andres Velasquez, Yuxin Liu, and H. H. Song. SCNet: Sparsity- Based Compact Network for Music Source Separation. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 2024.
[15] Yin-Jyun Hung, Weitao Lu, Qiuqiang Kong, and Jiachen Wang. Moises-Light: A Lightweight Moises-Based Model for Music Source Separation. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Hyderabad, India, 2025. To appear.
[16] Giorgio Mariani, Léo Tallini, Zixun Luo, Marc-André Côté, Karim Ghaleb, Em- manuel Vincent, Pablo Mesejo, and Alexandre Bitton. Multi-Source Diffusion Mod- els for Simultaneous Music Generation and Separation. In Proceedings of the Twelfth International Conference on Learning Representations (ICLR), 2024.
[17] Karn N Watcharasupat and Alexander Lerch. A stem-agnostic single-decoder system for music source separation beyond four stems. arXiv preprint arXiv:2406.18747, 2024.
[18] Ziyu Pan, Yuxuan Fu, Xubo Wang, and Jun Liu. ACappellaSet: A Large-Scale a Cappella Recording Dataset for All-Day Singing Voice Research. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Hyderabad, India, 2025. To appear.
[19] Jingjing Chen, Qirong Mao, and Dong Liu. Dual-Path Transformer Net- work: Direct Context-Aware Modeling for End-to-End Monaural Speech Separa- tion. In Proceedings of the 21st Annual Conference of the International Speech Communication Association (Interspeech), pages 2642–2646, Shanghai, China, 2020.
[20] Tomohiko Nakamura, Shihori Kozuka, and Hiroshi Saruwatari. Time-Domain Au- dio Source Separation With Neural Networks Based on Multiresolution Analysis. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:1687– 1701, 2021.
[21] Leonard A. Lanzendörfer, Fabian Grötschla, Markus Ungersböck, and Roger Wat- tenhofer. SepACap: A Dataset for A Cappella Singing Separation. In Proceedings of the IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP), Hyderabad, India, 2025. To appear.
[22] Zhong-Qiu Wang, Anurag Kumar, and Shinji Watanabe. Cross-Talk Reduction. In Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence (IJCAI), pages 5171–5180, Jeju, Korea, 2024.
[23] Georg Holzmann, Christoph Grasser, and Lehel Török. Multitrack Clarity Rede- fined: Introducing our new Mic Bleed Remover. https://auphonic.com/blog/ 2025/10/08/mic-bleed-remover/, October 2025. Accessed: 2025-11-16.
[24] Bernard Widrow, John R. Glover, Jr., John M. McCool, John Kaunitz, Charles S. Williams, Robert H. Hearn, James R. Zeidler, Eugene Dong, Jr., and Robert C. Goodlin. Adaptive Noise Cancelling: Principles and Applications. Proceedings of the IEEE, 63(12):1692–1716, 1975.
[25] Józef Kotus and Grzegorz Szwoch. Separation of Simultaneous Speakers with Acoustic Vector Sensor. Sensors, 25(5):1509, 2025. Published: 28 February 2025.
[26] Katerina Zmolikova, Marc Delcroix, Tsubasa Ochiai, Keisuke Kinoshita, Jan Černocký, and Dong Yu. Neural Target Speech Extraction: An Overview. IEEE Signal Processing Magazine, 40(3):8–29, 2023.
[27] Jonathan Le Roux, Scott Wisdom, Hakan Erdogan, and John R. Hershey. SDR – Half-baked or Well Done? In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 626–630, Brighton, UK, 2019.
[28] Chandan K. A. Reddy, Vishak Gopal, and Ross Cutler. DNSMOS P.835: A Non- Intrusive Perceptual Objective Speech Quality Metric to Evaluate Noise Suppres- sors. arXiv preprint arXiv:2110.01763, 2022.
[29] Gabriel Mittag, Babak Naderi, Amine Chehadi, and Sebastian Möller. NISQA: A Deep CNN-Self-Attention Model for Multidimensional Speech Quality Prediction with Crowdsourced Datasets. In Proceedings of the 22nd Annual Conference of the International Speech Communication Association (Interspeech), pages 2127–2131, Brno, Czechia, 2021.
[30] Dreamtonics Co., Ltd. Synthesizer V Studio Pro. https://dreamtonics.com/ synthesizerv/, 2020. Software version 1.11.0. Accessed: 2025-11-19.
[31] Beijing Timedomain Technology Co., Ltd. ACE Studio: Ai singing voice synthesis platform. https://acestudio.ai/, 2024. Accessed: 2025-11-19.
[32] MuseScore Ltd. MuseScore.com: The world’s largest sheet music catalog. https: //musescore.com/, 2024. Accessed: 2025-11-19.
[33] Accompany Vocal Band. Accompany Vocal Band - YouTube Channel. https: //www.youtube.com/@AccompanyVocalBand, 2022. Accessed: 2026-01-07.
-
dc.identifier.urihttp://tdr.lib.ntu.edu.tw/jspui/handle/123456789/101576-
dc.description.abstract阿卡貝拉(A cappella)是一種全人聲的音樂類型,針對特定作品的翻唱和重混音工程常因缺乏分軌音訊而面臨挑戰。而阿卡貝拉的聲源分離因為各聲部間人聲的高度相似性和節奏重疊性,相較於包含多元樂器的一般音源分離更為困難。此外,阿卡貝拉樂團常因方便性和成本問題,會採用同空間同步錄音的方式,此方式會無可避免的產生麥克風漏音(Bleeding)問題,進一步限制了後製混音的動態調整空間。
本研究旨在解決上述問題。針對聲源分離問題我們提出了三種策略,其一是提出了一個名為 Multi-Path HTDemucs(MPHTDemucs)的深度學習架構,基於 Hybrid Transformer Demucs(HTDemucs)的架構基礎上增加多個平行的 U-Net 路徑,並將輸出層依聲部數量獨立切分;其二是基於各聲部的聲音特徵,使用一般音樂資料的預訓練模型,修改其輸出層權重映射方式的遷移式學習策略;其三是利用ACE Stduio等人聲音樂合成引擎生成的合成資料,建構模型預訓練策略。針對漏音分離問題,我們提出了一個基於房間物理特性以及多聲部排列資料窮舉的物理模擬資料訓練增強框架,利用隨機生成的房間參數和人員位置來模擬實際現場聲音,以及不同聲部組合的漏音配置,把少量資料拓展成龐大且特性多元的模擬資料。
實驗結果顯示,聲源分離任務在 jaCappella 資料集上,MPHTDemucs 架構在隨機初始化條件下達到了18.64dB的平均SI-SDRi,相較於原始的 HTDemucs 架構的17.51dB大幅提升了1.13dB;而在遷移式學習初始化下,MPHTDemucs的20.61dB平均SI-SDRi則超過了HTDemucs的19.92dB,有0.69dB提升。證實了模型架構改良和遷移式學習的有效性。在合成資料預訓練上,ACE Studio的擬真合成人聲則相較於隨機初始化模型提升了平均0.1dB的SI-SDRi,證實了合成資料預訓練的有效性,也展示了合成資料在缺乏自動化合成方法下資料量仍然有限的侷限性。在漏音問題任務上,本研究建立的物理模擬資料增強框架能有效訓練模型抑制漏音干擾,客觀指標顯示漏音移除程度有顯著改善,並通過頻譜展示了音訊的保真度和漏音移除的效果。本研究不僅對阿卡貝拉聲源分離和漏音移除問題提供了多個方向的解決方案,論文核心的 MPHTDemucs 架構亦可望能應用於其他聲源分離處理任務上。
zh_TW
dc.description.abstractA cappella is a genre of music performed entirely by the human voice. Projects involving covers and remixing of specific a cappella works frequently encounter challenges due to the absence of isolated stem tracks. Compared to general music source separation involving diverse instruments, separating a cappella sources is considerably more difficult due to the high timbral similarity and rhythmic overlap among vocal parts. Furthermore, for reasons of convenience and cost, a cappella groups often adopt simultaneous recording in a shared space. This method inevitably results in microphone bleeding, which further limits the flexibility of dynamic adjustments during post-production mixing.
This study aims to address the aforementioned issues. We propose three strategies to tackle the source separation problem. First, we introduce a deep learning architecture named Multi-Path HTDemucs (MPHTDemucs), which builds upon the Hybrid Transformer Demucs (HTDemucs) by incorporating multiple parallel U-Net paths and independently partitioning the output layer based on the number of voice parts. Second, we employ a transfer learning strategy that utilizes pre-trained models trained on general music data; this involves modifying the weight mapping of the output layer based on the acoustic characteristics of each vocal part. Third, we establish a model pre-training strategy leveraging synthetic data generated by vocal synthesis engines such as ACE Studio. To address the microphone bleeding issue, we propose a training data augmentation framework based on room acoustics simulation and exhaustive multi-part arrangements. By utilizing randomly generated room parameters and personnel positions to simulate real-world audio environments and various bleeding configurations, this framework expands a small dataset into a large-scale and diverse collection of simulated data.
Experimental results demonstrate that for the source separation task on the jaCappella dataset, the MPHTDemucs architecture achieved an average SI-SDRi of 18.64 dB under random initialization, representing a significant improvement of 1.13 dB over the original HTDemucs architecture (17.51 dB). Furthermore, with transfer learning initialization, MPHTDemucs achieved an average SI-SDRi of 20.61 dB, surpassing HTDemucs' 19.92 dB by 0.69 dB. These results confirm the effectiveness of both the architectural improvements and the transfer learning strategy. Regarding pre-training with synthetic data, the realistic synthetic vocals generated by ACE Studio yielded an average SI-SDRi improvement of 0.1 dB compared to the randomly initialized model. This validates the efficacy of synthetic data pre-training while also highlighting current limitations regarding data volume due to the lack of fully automated synthesis methods. For the bleeding removal task, the proposed physical simulation data augmentation framework effectively trained the model to suppress bleeding interference. Objective metrics indicate a significant improvement in bleeding removal, while spectrogram analysis further demonstrates the preservation of audio fidelity and the effectiveness of leakage elimination. This study not only provides multi-faceted solutions for a cappella source separation and bleeding removal but also introduces the MPHTDemucs architecture, which holds potential for application in other audio source separation tasks.
en
dc.description.provenanceSubmitted by admin ntu (admin@lib.ntu.edu.tw) on 2026-02-11T16:29:15Z
No. of bitstreams: 0
en
dc.description.provenanceMade available in DSpace on 2026-02-11T16:29:15Z (GMT). No. of bitstreams: 0en
dc.description.tableofcontents口試委員審定書 i
誌謝 ii
中文摘要 iii
英文摘要 v
目次 vii
圖次 xi
表次 xii
第一章 緒論 1
1.1 研究背景 1
1.2 Acappella 音源分離技術現狀與挑戰 3
1.3 漏音問題 4
1.4 研究動機 5
第二章 相關研究 7
2.1 音樂音源分離模型 7
2.2 深度學習在 Acappella 之相關研究 9
2.3 漏音與串音移除方法 10
2.4 小結 12
第三章 問題定義 13
3.1 核心概念與名詞定義 13
3.2 問題一:Acappella 音源分離之優化問題 15
3.3 問題二:Acappella 錄音漏音移除問題 17
3.4 評估方法與指標 18
第四章 研究方法 23
4.1 MPHTD 模型架構設計 24
4.2 數據集與預處理 31
4.2.1 jaCappella 數據集 31
4.2.2 合成數據生成 32
4.2.3 數據預處理與標準化 34
4.3 模型訓練策略 36
4.3.1 預訓練權重遷移方法 36
4.4 訓練配置與損失函數 40
4.5 Acappella 錄音漏音消除之資料建構與訓練方法 44
第五章 實驗設計 49
5.1 實驗目標與假設 49
5.2 實驗數據與方法 50
5.2.1 數據集配置 50
5.2.2 評估指標 52
5.2.3 訓練配置與參數設置 53
5.3 實驗設計與流程 54
5.3.1 整體實驗流程 54
5.3.2 基準模型實驗 55
5.3.3 模型架構改進實驗 56
5.3.4 預訓練策略實驗 56
5.4 消融研究設計 58
5.4.1 路徑數量影響實驗 58
5.4.2 資料增強影響 59
5.4.3 訓練策略比較 59
5.5 Acappella 錄音漏音移除實驗設計 60
5.5.1 訓練資料建構與設定 60
5.5.2 模型與訓練參數配置 60
5.5.3 訓練策略 61
5.5.4 實驗目標 61
5.6 本章總結 62
第六章 結果與分析 63
6.1 與 jaCappella 團隊的實驗基準比較 64
6.2 資料增強對於模型效能之影響 65
6.3 歌曲量比較 66
6.4 Batch size 對模型訓練的影響 66
6.5 預訓練模型之使用對 HTD 帶來的影響 67
6.6 MPHTD 之性能與分析 67
6.6.1 不同路徑數量的影響 68
6.6.2 MPHTD 與 HTD 模型的比較 70
6.7 使用合成資料訓練的影響 71
6.8 音源分離模型性能總結 72
6.9 模型參數數量與運行速度 73
6.10 漏音移除之實驗結果 75
6.10.1 客觀指標評估與模型比較 75
6.10.2 頻譜圖視覺化分析 78
6.10.3 漏音移除實驗總結 80
第七章 結論 81
7.1 研究貢獻 81
7.2 研究限制與未來展望 82
參考文獻 83
-
dc.language.isozh_TW-
dc.subject阿卡貝拉-
dc.subject聲源分離-
dc.subject遷移式學習-
dc.subject合成資料-
dc.subject多路徑架構-
dc.subject漏音消除-
dc.subjectA Cappella-
dc.subjectSource Separation-
dc.subjectTransfer Learning-
dc.subjectSynthetic Data-
dc.subjectMulti-Path Architecture-
dc.subjectBleeding Removal-
dc.titleMPHTDemucs:多路徑架構之阿卡貝拉人聲分離與漏音消除zh_TW
dc.titleMPHTDemucs: A Multi-Path Architecture for A Cappella Vocal Source Separation and Bleeding Removalen
dc.typeThesis-
dc.date.schoolyear114-1-
dc.description.degree碩士-
dc.contributor.coadvisor傅立成zh_TW
dc.contributor.coadvisorLi-Chen Fuen
dc.contributor.oralexamcommittee蔡文傑;楊智淵zh_TW
dc.contributor.oralexamcommitteeWenn-Chieh Tsai;Chih-Yuan Yangen
dc.subject.keyword阿卡貝拉,聲源分離遷移式學習合成資料多路徑架構漏音消除zh_TW
dc.subject.keywordA Cappella,Source SeparationTransfer LearningSynthetic DataMulti-Path ArchitectureBleeding Removalen
dc.relation.page88-
dc.identifier.doi10.6342/NTU202600508-
dc.rights.note同意授權(全球公開)-
dc.date.accepted2026-02-05-
dc.contributor.author-college電機資訊學院-
dc.contributor.author-dept資訊工程學系-
dc.date.embargo-lift2026-02-12-
顯示於系所單位:資訊工程學系

文件中的檔案:
檔案 大小格式 
ntu-114-1.pdf10.05 MBAdobe PDF檢視/開啟
顯示文件簡單紀錄


系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved