MPHTDemucs：多路徑架構之阿卡貝拉人聲分離與漏音消除

黃宇瑍; Yu-Huan Huang

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/101576

標題:	MPHTDemucs：多路徑架構之阿卡貝拉人聲分離與漏音消除 MPHTDemucs: A Multi-Path Architecture for A Cappella Vocal Source Separation and Bleeding Removal
作者:	黃宇瑍 Yu-Huan Huang
指導教授:	許永真 Jane Yung-Jen Hsu
共同指導教授:	傅立成 Li-Chen Fu
關鍵字:	阿卡貝拉,聲源分離遷移式學習合成資料多路徑架構漏音消除 A Cappella,Source SeparationTransfer LearningSynthetic DataMulti-Path ArchitectureBleeding Removal
出版年 :	2026
學位:	碩士
摘要:	阿卡貝拉（A cappella）是一種全人聲的音樂類型，針對特定作品的翻唱和重混音工程常因缺乏分軌音訊而面臨挑戰。而阿卡貝拉的聲源分離因為各聲部間人聲的高度相似性和節奏重疊性，相較於包含多元樂器的一般音源分離更為困難。此外，阿卡貝拉樂團常因方便性和成本問題，會採用同空間同步錄音的方式，此方式會無可避免的產生麥克風漏音（Bleeding）問題，進一步限制了後製混音的動態調整空間。本研究旨在解決上述問題。針對聲源分離問題我們提出了三種策略，其一是提出了一個名為 Multi-Path HTDemucs（MPHTDemucs）的深度學習架構，基於 Hybrid Transformer Demucs（HTDemucs）的架構基礎上增加多個平行的 U-Net 路徑，並將輸出層依聲部數量獨立切分；其二是基於各聲部的聲音特徵，使用一般音樂資料的預訓練模型，修改其輸出層權重映射方式的遷移式學習策略；其三是利用ACE Stduio等人聲音樂合成引擎生成的合成資料，建構模型預訓練策略。針對漏音分離問題，我們提出了一個基於房間物理特性以及多聲部排列資料窮舉的物理模擬資料訓練增強框架，利用隨機生成的房間參數和人員位置來模擬實際現場聲音，以及不同聲部組合的漏音配置，把少量資料拓展成龐大且特性多元的模擬資料。實驗結果顯示，聲源分離任務在 jaCappella 資料集上，MPHTDemucs 架構在隨機初始化條件下達到了18.64dB的平均SI-SDRi，相較於原始的 HTDemucs 架構的17.51dB大幅提升了1.13dB；而在遷移式學習初始化下，MPHTDemucs的20.61dB平均SI-SDRi則超過了HTDemucs的19.92dB，有0.69dB提升。證實了模型架構改良和遷移式學習的有效性。在合成資料預訓練上，ACE Studio的擬真合成人聲則相較於隨機初始化模型提升了平均0.1dB的SI-SDRi，證實了合成資料預訓練的有效性，也展示了合成資料在缺乏自動化合成方法下資料量仍然有限的侷限性。在漏音問題任務上，本研究建立的物理模擬資料增強框架能有效訓練模型抑制漏音干擾，客觀指標顯示漏音移除程度有顯著改善，並通過頻譜展示了音訊的保真度和漏音移除的效果。本研究不僅對阿卡貝拉聲源分離和漏音移除問題提供了多個方向的解決方案，論文核心的 MPHTDemucs 架構亦可望能應用於其他聲源分離處理任務上。 A cappella is a genre of music performed entirely by the human voice. Projects involving covers and remixing of specific a cappella works frequently encounter challenges due to the absence of isolated stem tracks. Compared to general music source separation involving diverse instruments, separating a cappella sources is considerably more difficult due to the high timbral similarity and rhythmic overlap among vocal parts. Furthermore, for reasons of convenience and cost, a cappella groups often adopt simultaneous recording in a shared space. This method inevitably results in microphone bleeding, which further limits the flexibility of dynamic adjustments during post-production mixing. This study aims to address the aforementioned issues. We propose three strategies to tackle the source separation problem. First, we introduce a deep learning architecture named Multi-Path HTDemucs (MPHTDemucs), which builds upon the Hybrid Transformer Demucs (HTDemucs) by incorporating multiple parallel U-Net paths and independently partitioning the output layer based on the number of voice parts. Second, we employ a transfer learning strategy that utilizes pre-trained models trained on general music data; this involves modifying the weight mapping of the output layer based on the acoustic characteristics of each vocal part. Third, we establish a model pre-training strategy leveraging synthetic data generated by vocal synthesis engines such as ACE Studio. To address the microphone bleeding issue, we propose a training data augmentation framework based on room acoustics simulation and exhaustive multi-part arrangements. By utilizing randomly generated room parameters and personnel positions to simulate real-world audio environments and various bleeding configurations, this framework expands a small dataset into a large-scale and diverse collection of simulated data. Experimental results demonstrate that for the source separation task on the jaCappella dataset, the MPHTDemucs architecture achieved an average SI-SDRi of 18.64 dB under random initialization, representing a significant improvement of 1.13 dB over the original HTDemucs architecture (17.51 dB). Furthermore, with transfer learning initialization, MPHTDemucs achieved an average SI-SDRi of 20.61 dB, surpassing HTDemucs' 19.92 dB by 0.69 dB. These results confirm the effectiveness of both the architectural improvements and the transfer learning strategy. Regarding pre-training with synthetic data, the realistic synthetic vocals generated by ACE Studio yielded an average SI-SDRi improvement of 0.1 dB compared to the randomly initialized model. This validates the efficacy of synthetic data pre-training while also highlighting current limitations regarding data volume due to the lack of fully automated synthesis methods. For the bleeding removal task, the proposed physical simulation data augmentation framework effectively trained the model to suppress bleeding interference. Objective metrics indicate a significant improvement in bleeding removal, while spectrogram analysis further demonstrates the preservation of audio fidelity and the effectiveness of leakage elimination. This study not only provides multi-faceted solutions for a cappella source separation and bleeding removal but also introduces the MPHTDemucs architecture, which holds potential for application in other audio source separation tasks.
URI:	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/101576
DOI:	10.6342/NTU202600508
全文授權:	同意授權(全球公開)
電子全文公開日期:	2026-02-12
顯示於系所單位：	資訊工程學系

文件中的檔案：

檔案	大小	格式
ntu-114-1.pdf	10.05 MB	Adobe PDF	檢視/開啟

顯示文件完整紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料（如：文字、圖片、PDF）並使其易於取用。