Skip navigation

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料(如:文字、圖片、PDF)並使其易於取用。

點此認識 DSpace
DSpace logo
English
中文
  • 瀏覽論文
    • 校院系所
    • 出版年
    • 作者
    • 標題
    • 關鍵字
    • 指導教授
  • 搜尋 TDR
  • 授權 Q&A
    • 我的頁面
    • 接受 E-mail 通知
    • 編輯個人資料
  1. NTU Theses and Dissertations Repository
  2. 電機資訊學院
  3. 電信工程學研究所
請用此 Handle URI 來引用此文件: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/97777
完整後設資料紀錄
DC 欄位值語言
dc.contributor.advisor楊奕軒zh_TW
dc.contributor.advisorYi-Hsuan Yangen
dc.contributor.author葉軒瑜zh_TW
dc.contributor.authorHsuan-Yu Yehen
dc.date.accessioned2025-07-16T16:13:58Z-
dc.date.available2025-07-17-
dc.date.copyright2025-07-16-
dc.date.issued2025-
dc.date.submitted2025-06-30-
dc.identifier.citation[1] S. Durand, D. Stoller, and S. Ewert. Contrastive learning-based audio to lyrics alignment for multiple languages. In ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, June 2023.
[2] G. D. Forney. The viterbi algorithm. Proceedings of the IEEE, 61(3):268–278, 2005.
[3] A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd international conference on Machine learning, pages 369–376, 2006.
[4] J. He, J. Liu, Z. Ye, R. Huang, C. Cui, H. Liu, and Z. Zhao. Rmssinger: Realisticmusic-score based singing voice synthesis, 2023.
[5] J. Koguchi and S. Takamichi. Pjs: phoneme-balanced japanese singing voice corpus, 2020.
[6] J. Kong, J. Kim, and J. Bae. Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis, 2020.
[7] S. Kum, J. Lee, K. L. Kim, T. Kim, and J. Nam. Pseudo-label transfer from frame level to note-level in a teacher-student framework for singing transcription from polyphonic music, 2022.
[8] M. Kurumi, ATSUYA, Enbun, K. (m), Chiteiko, CrazY, A. Kei, and Tansansui. oniku kurumi utagoe database. https://onikuru.info/db-download/, 2023. Accessed: 2024-10-11.
[9] R. Li, Y. Zhang, Y. Wang, Z. Hong, R. Huang, and Z. Zhao. Robust singing voice transcription serves synthesis, 2024.
[10] J. Liu, C. Li, Y. Ren, F. Chen, and Z. Zhao. Diffsinger: Singing voice synthesis via shallow diffusion mechanism, 2022.
[11] M. McAuliffe, M. Socolof, S. Mihuc, M. Wagner, and M. Sonderegger. Montreal forced aligner: Trainable text-speech alignment using kaldi. In Interspeech 2017, pages 498–502, 2017.
[12] I. Ogawa and M. Morise. Tohoku kiritan singing database: A singing database for statistical parametric singing synthesis using japanese pop songs. Acoustical Science and Technology, 42(3):140–145, May 2021.
[13] A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever. Robust speech recognition via large-scale weak supervision. In International conference on machine learning, pages 28492–28518. PMLR, 2023.
[14] Y. Ren, Y. Ruan, X. Tan, T. Qin, S. Zhao, Z. Zhao, and T.-Y. Liu. Fastspeech: Fast, robust and controllable text to speech, 2019.
[15] Y. Ren, X. Tan, T. Qin, J. Luan, Z. Zhao, and T.-Y. Liu. Deepsinger: Singing voice synthesis with data mined from the web, 2020.
[16] S. Rouard, F. Massa, and A. Défossez. Hybrid transformers for music source separation, 2022.
[17] K. Schulze-Forster, C. S. Doire, G. Richard, and R. Badeau. Phoneme level lyrics alignment and text-informed singing voice separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:2382–2395, 2021.
[18] L. Strgar and D. Harwath. Phoneme segmentation using self-supervised speech models. In 2022 IEEE Spoken Language Technology Workshop (SLT), pages 1067–1073. IEEE, 2023.
[19] S. Takamichi, N. Tanji, and H. Saruwatari. Jsut-song. https://sites.google.com/site/shinnosuketakamichi/publication/jsut-song, 2018. Accessed: 2024-10-11.
[20] Tansansui, Chiteiko, Y. Puu, ATSUYA, Yuuhikou, CrazY, and A. Kei. Oftonp singing voice database distribution site. https://sites.google.com/view/oftn-utagoedb/%E3%83%9B%E3%83%BC%E3%83%A0, 2020. Accessed: 2025-04-29.
[21] J.-Y. Wang, C.-I. Leong, Y.-C. Lin, L. Su, and J.-S. R. Jang. Adapting pretrained speech model for mandarin lyrics transcription and alignment, 2023.
[22] Y. Wang, X. Wang, P. Zhu, J. Wu, H. Li, H. Xue, Y. Zhang, L. Xie, and M. Bi. Opencpop: A high-quality open source chinese popular song corpus for singing voice synthesis, 2022.
[23] H. Wei, X. Cao, T. Dan, and Y. Chen. Rmvpe: A robust model for vocal pitch estimation in polyphonic music. In INTERSPEECH 2023, pages 5421–5425. ISCA, Aug. 2023.
[24] R. Yamamoto, R. Yoneyama, and T. Toda. Nnsvs: A neural network-based singing voice synthesis toolkit, 2023.
[25] L. Zhang, R. Li, S. Wang, L. Deng, J. Liu, Y. Ren, J. He, R. Huang, J. Zhu, X. Chen, and Z. Zhao. M4singer: A multi-style, multi-singer and musical score provided mandarin singing corpus. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022.
[26] Y. Zhang, Z. Jiang, R. Li, C. Pan, J. He, R. Huang, C. Wang, and Z. Zhao. Tcsinger: Zero-shot singing voice synthesis with style transfer and multi-level style control. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, page 1960–1975. Association for Computational Linguistics, 2024.
[27] Y. Zhang, C. Pan, W. Guo, R. Li, Z. Zhu, J. Wang, W. Xu, J. Lu, Z. Hong, C. Wang, et al. Gtsinger: A global multi-technique singing corpus with realistic music scores for all singing tasks. arXiv preprint arXiv:2409.13832, 2024.
[28] Y. Zhang, H. Xue, H. Li, L. Xie, T. Guo, R. Zhang, and C. Gong. Visinger 2: Highfidelity end-to-end singing voice synthesis enhanced by digital signal processing synthesizer, 2022.
-
dc.identifier.urihttp://tdr.lib.ntu.edu.tw/jspui/handle/123456789/97777-
dc.description.abstract精確的音素級與換氣標註對於訓練高品質的歌聲合成(Singing Voice Synthesis, SVS)模型至關重要。然而,取得這類標註通常需要耗費大量人力進行手動標註,或依賴具有明顯侷限的自動化方法。現有方法往往難以精細對齊歌聲中的音素,缺乏整合的換氣偵測能力,或在缺乏歌詞的情況下無法執行無監督的音素切分。
為了解決這些問題,我們提出Phonsa,一個用於歌聲的自動標註系統。Phonsa 建構於預訓練的 Whisper 編碼器-解碼器架構之上,結合具分段自注意力機制的音素級分類器以進行音素預測,並透過明確定義的換氣類別與專門的處理策略來實現自動換氣標註。此外,Phonsa 還透過一項新穎的解碼演算法支援無監督的音素切分。我們也設計了一個特殊的邊界符號,用以提升連續相同音素的區分能力,進而增強換氣偵測與切分的準確性。
我們在中文與日文歌唱資料上評估 Phonsa,結果顯示其在強制對齊任務中相較於傳統方法表現更佳,並具備創新的自動換氣偵測與無監督音素切分能力。更重要的是,使用 Phonsa 自動生成標註訓練的 SVS 模型,在主觀平均意見評分(MOS)中展現出明顯優於使用 MFA 標註所訓練模型的聽感品質。我們將開源 Phonsa 系統以及預訓練的中日文模型。
zh_TW
dc.description.abstractPrecise phoneme-level and breath annotations are crucial for training high-quality Singing Voice Synthesis (SVS) models. However, obtaining such annotations typically requires labor-intensive manual labeling or relies on automatic methods with notable limitations. Existing approaches often struggle with fine-grained phoneme alignment in singing voice, lack integrated breath detection capabilities, or fail to perform unsupervised phoneme segmentation in the absence of lyrics. To address these challenges, we propose Phonsa, an automatic annotation system for singing voice. Built upon a pretrained Whisper encoder-decoder architecture, Phonsa incorporates a chunked self-attention classifier for phoneme-level prediction, introduces an explicit breath class with a dedicated processing strategy for automatic breath annotation, and enables unsupervised phoneme segmentation through a novel decoding algorithm. In addition, a specially designed boundary token is employed to improve the separation of consecutive identical phonemes, thereby enhancing both breath detection and segmentation accuracy. We evaluate Phonsa on Chinese and Japanese singing data, demonstrating its superior performance over traditional methods in forced alignment, as well as its novel capabilities in automatic breath detection and unsupervised phoneme segmentation. Importantly, SVS models trained with Phonsa-generated annotations achieve significantly higher perceptual quality in subjective Mean Opinion Score (MOS) evaluations compared to those trained with MFA-based annotations. The Phonsa system, along with pretrained Chinese and Japanese models, will be released as an open-source toolkit.en
dc.description.provenanceSubmitted by admin ntu (admin@lib.ntu.edu.tw) on 2025-07-16T16:13:58Z
No. of bitstreams: 0
en
dc.description.provenanceMade available in DSpace on 2025-07-16T16:13:58Z (GMT). No. of bitstreams: 0en
dc.description.tableofcontentsAcknowledgements i
摘要 iii
Abstract v
Contents vii
List of Figures xi
List of Tables xiii
Chapter1 Introduction 1
Chapter2 Related Works 9
2.1 Singing Voice Phoneme Annotation . . . . . . . . . . . . . . . . . . 9
2.2 Singing Voice Pitch Annotation . . . . . . . . . . . . . . . . . . . . 10
2.3 Singing Voice Synthesis . . . . . . . . . . . . . . . . . . . . . . . . 10
Chapter3 Method 13
3.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2 Proposed Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.3 Loss Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.3.1 Phoneme Alignment Objectives . . . . . . . . . . . . . . . . . . . 17
3.3.2 Speech Transcription Objective . . . . . . . . . . . . . . . . . . . . 18
3.3.3 Total Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.4 Special Phoneme Tokens . . . . . . . . . . . . . . . . . . . . . . . . 19
3.5 Phoneme Inference and Alignment Strategy . . . . . . . . . . . . . . 21
3.5.1 Phoneme Sequence Construction . . . . . . . . . . . . . . . . . . . 21
3.5.2 Framewise Probability Computation . . . . . . . . . . . . . . . . . 22
3.5.3 Task1: Forced Alignment with Breath Detection . . . . . . . . . . 23
3.5.4 Task2: Phoneme Segmentation from Framewise Predictions . . . . 24
Chapter4 Phonsa Experiment 27
4.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.2 Data Augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.3 Training Setting. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.4 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.4.1 Forced Alignment Evaluation. . . . . . . . . . . . . . . . . . . . . 29
4.4.2 Breath Prediction Evaluation . . . . . . . . . . . . . . . . . . . . . 30
4.4.3 Phoneme Segmentation from Framewise Predictions Evaluation . . 30
4.5 Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.6 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.6.1 Overall Comparison. . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.6.2 Homophone Alignment Enhancement . . . . . . . . . . . . . . . . 35
Chapter5 Evaluating Annotation Utility via Singing Voice Synthesis 37
5.1 Motivation and Objectives . . . . . . . . . . . . . . . . . . . . . . . 37
5.2 Singing Voice Synthesis Model Architecture . . . . . . . . . . . . . 38
5.3 Experimental Setup. . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.4 Subjective Listening Tests . . . . . . . . . . . . . . . . . . . . . . . 39
5.5 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 41
Chapter6 Conclusion 45
References 47
-
dc.language.isoen-
dc.subject人工智慧zh_TW
dc.subject音訊處理zh_TW
dc.subject歌聲合成zh_TW
dc.subject自動標註zh_TW
dc.subject音樂zh_TW
dc.subjectAudio processingen
dc.subjectMusicen
dc.subjectAutomatic labelingen
dc.subjectSinging voice synthesisen
dc.subjectArtificial Intelligenceen
dc.title無需人工標註的自動歌聲音素標註及其之於歌聲合成之應用zh_TW
dc.titleLabeling-Free Automatic Singing Voice Phoneme Annotation and Its Application to Singing Voice Synthesisen
dc.typeThesis-
dc.date.schoolyear113-2-
dc.description.degree碩士-
dc.contributor.oralexamcommittee蘇黎;曹昱zh_TW
dc.contributor.oralexamcommitteeLi Su;Yu Tsaoen
dc.subject.keyword音樂,自動標註,歌聲合成,人工智慧,音訊處理,zh_TW
dc.subject.keywordMusic,Automatic labeling,Singing voice synthesis,Artificial Intelligence,Audio processing,en
dc.relation.page50-
dc.identifier.doi10.6342/NTU202501378-
dc.rights.note同意授權(限校園內公開)-
dc.date.accepted2025-07-01-
dc.contributor.author-college電機資訊學院-
dc.contributor.author-dept電信工程學研究所-
dc.date.embargo-lift2025-07-17-
顯示於系所單位:電信工程學研究所

文件中的檔案:
檔案 大小格式 
ntu-113-2.pdf
授權僅限NTU校內IP使用(校園外請利用VPN校外連線服務)
640.97 kBAdobe PDF
顯示文件簡單紀錄


系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved