無需人工標註的自動歌聲音素標註及其之於歌聲合成之應用

葉軒瑜; Hsuan-Yu Yeh

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/97777

標題:	無需人工標註的自動歌聲音素標註及其之於歌聲合成之應用 Labeling-Free Automatic Singing Voice Phoneme Annotation and Its Application to Singing Voice Synthesis
作者:	葉軒瑜 Hsuan-Yu Yeh
指導教授:	楊奕軒 Yi-Hsuan Yang
關鍵字:	音樂,自動標註,歌聲合成,人工智慧,音訊處理, Music,Automatic labeling,Singing voice synthesis,Artificial Intelligence,Audio processing,
出版年 :	2025
學位:	碩士
摘要:	精確的音素級與換氣標註對於訓練高品質的歌聲合成（Singing Voice Synthesis, SVS）模型至關重要。然而，取得這類標註通常需要耗費大量人力進行手動標註，或依賴具有明顯侷限的自動化方法。現有方法往往難以精細對齊歌聲中的音素，缺乏整合的換氣偵測能力，或在缺乏歌詞的情況下無法執行無監督的音素切分。為了解決這些問題，我們提出Phonsa，一個用於歌聲的自動標註系統。Phonsa 建構於預訓練的 Whisper 編碼器-解碼器架構之上，結合具分段自注意力機制的音素級分類器以進行音素預測，並透過明確定義的換氣類別與專門的處理策略來實現自動換氣標註。此外，Phonsa 還透過一項新穎的解碼演算法支援無監督的音素切分。我們也設計了一個特殊的邊界符號，用以提升連續相同音素的區分能力，進而增強換氣偵測與切分的準確性。我們在中文與日文歌唱資料上評估 Phonsa，結果顯示其在強制對齊任務中相較於傳統方法表現更佳，並具備創新的自動換氣偵測與無監督音素切分能力。更重要的是，使用 Phonsa 自動生成標註訓練的 SVS 模型，在主觀平均意見評分（MOS）中展現出明顯優於使用 MFA 標註所訓練模型的聽感品質。我們將開源 Phonsa 系統以及預訓練的中日文模型。 Precise phoneme-level and breath annotations are crucial for training high-quality Singing Voice Synthesis (SVS) models. However, obtaining such annotations typically requires labor-intensive manual labeling or relies on automatic methods with notable limitations. Existing approaches often struggle with fine-grained phoneme alignment in singing voice, lack integrated breath detection capabilities, or fail to perform unsupervised phoneme segmentation in the absence of lyrics. To address these challenges, we propose Phonsa, an automatic annotation system for singing voice. Built upon a pretrained Whisper encoder-decoder architecture, Phonsa incorporates a chunked self-attention classifier for phoneme-level prediction, introduces an explicit breath class with a dedicated processing strategy for automatic breath annotation, and enables unsupervised phoneme segmentation through a novel decoding algorithm. In addition, a specially designed boundary token is employed to improve the separation of consecutive identical phonemes, thereby enhancing both breath detection and segmentation accuracy. We evaluate Phonsa on Chinese and Japanese singing data, demonstrating its superior performance over traditional methods in forced alignment, as well as its novel capabilities in automatic breath detection and unsupervised phoneme segmentation. Importantly, SVS models trained with Phonsa-generated annotations achieve significantly higher perceptual quality in subjective Mean Opinion Score (MOS) evaluations compared to those trained with MFA-based annotations. The Phonsa system, along with pretrained Chinese and Japanese models, will be released as an open-source toolkit.
URI:	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/97777
DOI:	10.6342/NTU202501378
全文授權:	同意授權(限校園內公開)
電子全文公開日期:	2025-07-17
顯示於系所單位：	電信工程學研究所

文件中的檔案：

檔案	大小	格式
ntu-113-2.pdf 授權僅限NTU校內IP使用（校園外請利用VPN校外連線服務）	640.97 kB	Adobe PDF

顯示文件完整紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料（如：文字、圖片、PDF）並使其易於取用。