無需人工標註的自動歌聲音素標註及其之於歌聲合成之應用

葉軒瑜; Hsuan-Yu Yeh

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/97777

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	楊奕軒	zh_TW
dc.contributor.advisor	Yi-Hsuan Yang	en
dc.contributor.author	葉軒瑜	zh_TW
dc.contributor.author	Hsuan-Yu Yeh	en
dc.date.accessioned	2025-07-16T16:13:58Z	-
dc.date.available	2025-07-17	-
dc.date.copyright	2025-07-16	-
dc.date.issued	2025	-
dc.date.submitted	2025-06-30	-
dc.identifier.citation	[1] S. Durand, D. Stoller, and S. Ewert. Contrastive learning-based audio to lyrics alignment for multiple languages. In ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, June 2023. [2] G. D. Forney. The viterbi algorithm. Proceedings of the IEEE, 61(3):268–278, 2005. [3] A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd international conference on Machine learning, pages 369–376, 2006. [4] J. He, J. Liu, Z. Ye, R. Huang, C. Cui, H. Liu, and Z. Zhao. Rmssinger: Realisticmusic-score based singing voice synthesis, 2023. [5] J. Koguchi and S. Takamichi. Pjs: phoneme-balanced japanese singing voice corpus, 2020. [6] J. Kong, J. Kim, and J. Bae. Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis, 2020. [7] S. Kum, J. Lee, K. L. Kim, T. Kim, and J. Nam. Pseudo-label transfer from frame level to note-level in a teacher-student framework for singing transcription from polyphonic music, 2022. [8] M. Kurumi, ATSUYA, Enbun, K. (m), Chiteiko, CrazY, A. Kei, and Tansansui. oniku kurumi utagoe database. https://onikuru.info/db-download/, 2023. Accessed: 2024-10-11. [9] R. Li, Y. Zhang, Y. Wang, Z. Hong, R. Huang, and Z. Zhao. Robust singing voice transcription serves synthesis, 2024. [10] J. Liu, C. Li, Y. Ren, F. Chen, and Z. Zhao. Diffsinger: Singing voice synthesis via shallow diffusion mechanism, 2022. [11] M. McAuliffe, M. Socolof, S. Mihuc, M. Wagner, and M. Sonderegger. Montreal forced aligner: Trainable text-speech alignment using kaldi. In Interspeech 2017, pages 498–502, 2017. [12] I. Ogawa and M. Morise. Tohoku kiritan singing database: A singing database for statistical parametric singing synthesis using japanese pop songs. Acoustical Science and Technology, 42(3):140–145, May 2021. [13] A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever. Robust speech recognition via large-scale weak supervision. In International conference on machine learning, pages 28492–28518. PMLR, 2023. [14] Y. Ren, Y. Ruan, X. Tan, T. Qin, S. Zhao, Z. Zhao, and T.-Y. Liu. Fastspeech: Fast, robust and controllable text to speech, 2019. [15] Y. Ren, X. Tan, T. Qin, J. Luan, Z. Zhao, and T.-Y. Liu. Deepsinger: Singing voice synthesis with data mined from the web, 2020. [16] S. Rouard, F. Massa, and A. Défossez. Hybrid transformers for music source separation, 2022. [17] K. Schulze-Forster, C. S. Doire, G. Richard, and R. Badeau. Phoneme level lyrics alignment and text-informed singing voice separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:2382–2395, 2021. [18] L. Strgar and D. Harwath. Phoneme segmentation using self-supervised speech models. In 2022 IEEE Spoken Language Technology Workshop (SLT), pages 1067–1073. IEEE, 2023. [19] S. Takamichi, N. Tanji, and H. Saruwatari. Jsut-song. https://sites.google.com/site/shinnosuketakamichi/publication/jsut-song, 2018. Accessed: 2024-10-11. [20] Tansansui, Chiteiko, Y. Puu, ATSUYA, Yuuhikou, CrazY, and A. Kei. Oftonp singing voice database distribution site. https://sites.google.com/view/oftn-utagoedb/%E3%83%9B%E3%83%BC%E3%83%A0, 2020. Accessed: 2025-04-29. [21] J.-Y. Wang, C.-I. Leong, Y.-C. Lin, L. Su, and J.-S. R. Jang. Adapting pretrained speech model for mandarin lyrics transcription and alignment, 2023. [22] Y. Wang, X. Wang, P. Zhu, J. Wu, H. Li, H. Xue, Y. Zhang, L. Xie, and M. Bi. Opencpop: A high-quality open source chinese popular song corpus for singing voice synthesis, 2022. [23] H. Wei, X. Cao, T. Dan, and Y. Chen. Rmvpe: A robust model for vocal pitch estimation in polyphonic music. In INTERSPEECH 2023, pages 5421–5425. ISCA, Aug. 2023. [24] R. Yamamoto, R. Yoneyama, and T. Toda. Nnsvs: A neural network-based singing voice synthesis toolkit, 2023. [25] L. Zhang, R. Li, S. Wang, L. Deng, J. Liu, Y. Ren, J. He, R. Huang, J. Zhu, X. Chen, and Z. Zhao. M4singer: A multi-style, multi-singer and musical score provided mandarin singing corpus. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022. [26] Y. Zhang, Z. Jiang, R. Li, C. Pan, J. He, R. Huang, C. Wang, and Z. Zhao. Tcsinger: Zero-shot singing voice synthesis with style transfer and multi-level style control. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, page 1960–1975. Association for Computational Linguistics, 2024. [27] Y. Zhang, C. Pan, W. Guo, R. Li, Z. Zhu, J. Wang, W. Xu, J. Lu, Z. Hong, C. Wang, et al. Gtsinger: A global multi-technique singing corpus with realistic music scores for all singing tasks. arXiv preprint arXiv:2409.13832, 2024. [28] Y. Zhang, H. Xue, H. Li, L. Xie, T. Guo, R. Zhang, and C. Gong. Visinger 2: Highfidelity end-to-end singing voice synthesis enhanced by digital signal processing synthesizer, 2022.	-
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/97777	-
dc.description.abstract	精確的音素級與換氣標註對於訓練高品質的歌聲合成（Singing Voice Synthesis, SVS）模型至關重要。然而，取得這類標註通常需要耗費大量人力進行手動標註，或依賴具有明顯侷限的自動化方法。現有方法往往難以精細對齊歌聲中的音素，缺乏整合的換氣偵測能力，或在缺乏歌詞的情況下無法執行無監督的音素切分。為了解決這些問題，我們提出Phonsa，一個用於歌聲的自動標註系統。Phonsa 建構於預訓練的 Whisper 編碼器-解碼器架構之上，結合具分段自注意力機制的音素級分類器以進行音素預測，並透過明確定義的換氣類別與專門的處理策略來實現自動換氣標註。此外，Phonsa 還透過一項新穎的解碼演算法支援無監督的音素切分。我們也設計了一個特殊的邊界符號，用以提升連續相同音素的區分能力，進而增強換氣偵測與切分的準確性。我們在中文與日文歌唱資料上評估 Phonsa，結果顯示其在強制對齊任務中相較於傳統方法表現更佳，並具備創新的自動換氣偵測與無監督音素切分能力。更重要的是，使用 Phonsa 自動生成標註訓練的 SVS 模型，在主觀平均意見評分（MOS）中展現出明顯優於使用 MFA 標註所訓練模型的聽感品質。我們將開源 Phonsa 系統以及預訓練的中日文模型。	zh_TW
dc.description.abstract	Precise phoneme-level and breath annotations are crucial for training high-quality Singing Voice Synthesis (SVS) models. However, obtaining such annotations typically requires labor-intensive manual labeling or relies on automatic methods with notable limitations. Existing approaches often struggle with fine-grained phoneme alignment in singing voice, lack integrated breath detection capabilities, or fail to perform unsupervised phoneme segmentation in the absence of lyrics. To address these challenges, we propose Phonsa, an automatic annotation system for singing voice. Built upon a pretrained Whisper encoder-decoder architecture, Phonsa incorporates a chunked self-attention classifier for phoneme-level prediction, introduces an explicit breath class with a dedicated processing strategy for automatic breath annotation, and enables unsupervised phoneme segmentation through a novel decoding algorithm. In addition, a specially designed boundary token is employed to improve the separation of consecutive identical phonemes, thereby enhancing both breath detection and segmentation accuracy. We evaluate Phonsa on Chinese and Japanese singing data, demonstrating its superior performance over traditional methods in forced alignment, as well as its novel capabilities in automatic breath detection and unsupervised phoneme segmentation. Importantly, SVS models trained with Phonsa-generated annotations achieve significantly higher perceptual quality in subjective Mean Opinion Score (MOS) evaluations compared to those trained with MFA-based annotations. The Phonsa system, along with pretrained Chinese and Japanese models, will be released as an open-source toolkit.	en
dc.description.provenance	Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2025-07-16T16:13:58Z No. of bitstreams: 0	en
dc.description.provenance	Made available in DSpace on 2025-07-16T16:13:58Z (GMT). No. of bitstreams: 0	en
dc.description.tableofcontents	Acknowledgements i 摘要 iii Abstract v Contents vii List of Figures xi List of Tables xiii Chapter1 Introduction 1 Chapter2 Related Works 9 2.1 Singing Voice Phoneme Annotation . . . . . . . . . . . . . . . . . . 9 2.2 Singing Voice Pitch Annotation . . . . . . . . . . . . . . . . . . . . 10 2.3 Singing Voice Synthesis . . . . . . . . . . . . . . . . . . . . . . . . 10 Chapter3 Method 13 3.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.2 Proposed Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.3 Loss Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.3.1 Phoneme Alignment Objectives . . . . . . . . . . . . . . . . . . . 17 3.3.2 Speech Transcription Objective . . . . . . . . . . . . . . . . . . . . 18 3.3.3 Total Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.4 Special Phoneme Tokens . . . . . . . . . . . . . . . . . . . . . . . . 19 3.5 Phoneme Inference and Alignment Strategy . . . . . . . . . . . . . . 21 3.5.1 Phoneme Sequence Construction . . . . . . . . . . . . . . . . . . . 21 3.5.2 Framewise Probability Computation . . . . . . . . . . . . . . . . . 22 3.5.3 Task1: Forced Alignment with Breath Detection . . . . . . . . . . 23 3.5.4 Task2: Phoneme Segmentation from Framewise Predictions . . . . 24 Chapter4 Phonsa Experiment 27 4.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 4.2 Data Augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 4.3 Training Setting. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.4 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.4.1 Forced Alignment Evaluation. . . . . . . . . . . . . . . . . . . . . 29 4.4.2 Breath Prediction Evaluation . . . . . . . . . . . . . . . . . . . . . 30 4.4.3 Phoneme Segmentation from Framewise Predictions Evaluation . . 30 4.5 Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.6 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.6.1 Overall Comparison. . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.6.2 Homophone Alignment Enhancement . . . . . . . . . . . . . . . . 35 Chapter5 Evaluating Annotation Utility via Singing Voice Synthesis 37 5.1 Motivation and Objectives . . . . . . . . . . . . . . . . . . . . . . . 37 5.2 Singing Voice Synthesis Model Architecture . . . . . . . . . . . . . 38 5.3 Experimental Setup. . . . . . . . . . . . . . . . . . . . . . . . . . . 38 5.4 Subjective Listening Tests . . . . . . . . . . . . . . . . . . . . . . . 39 5.5 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 41 Chapter6 Conclusion 45 References 47	-
dc.language.iso	en	-
dc.subject	人工智慧	zh_TW
dc.subject	音訊處理	zh_TW
dc.subject	歌聲合成	zh_TW
dc.subject	自動標註	zh_TW
dc.subject	音樂	zh_TW
dc.subject	Audio processing	en
dc.subject	Music	en
dc.subject	Automatic labeling	en
dc.subject	Singing voice synthesis	en
dc.subject	Artificial Intelligence	en
dc.title	無需人工標註的自動歌聲音素標註及其之於歌聲合成之應用	zh_TW
dc.title	Labeling-Free Automatic Singing Voice Phoneme Annotation and Its Application to Singing Voice Synthesis	en
dc.type	Thesis	-
dc.date.schoolyear	113-2	-
dc.description.degree	碩士	-
dc.contributor.oralexamcommittee	蘇黎;曹昱	zh_TW
dc.contributor.oralexamcommittee	Li Su;Yu Tsao	en
dc.subject.keyword	音樂,自動標註,歌聲合成,人工智慧,音訊處理,	zh_TW
dc.subject.keyword	Music,Automatic labeling,Singing voice synthesis,Artificial Intelligence,Audio processing,	en
dc.relation.page	50	-
dc.identifier.doi	10.6342/NTU202501378	-
dc.rights.note	同意授權(限校園內公開)	-
dc.date.accepted	2025-07-01	-
dc.contributor.author-college	電機資訊學院	-
dc.contributor.author-dept	電信工程學研究所	-
dc.date.embargo-lift	2025-07-17	-
顯示於系所單位：	電信工程學研究所

文件中的檔案：

檔案	大小	格式
ntu-113-2.pdf 授權僅限NTU校內IP使用（校園外請利用VPN校外連線服務）	640.97 kB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。