Skip navigation

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料(如:文字、圖片、PDF)並使其易於取用。

點此認識 DSpace
DSpace logo
English
中文
  • 瀏覽論文
    • 校院系所
    • 出版年
    • 作者
    • 標題
    • 關鍵字
    • 指導教授
  • 搜尋 TDR
  • 授權 Q&A
    • 我的頁面
    • 接受 E-mail 通知
    • 編輯個人資料
  1. NTU Theses and Dissertations Repository
  2. 電機資訊學院
  3. 資訊工程學系
請用此 Handle URI 來引用此文件: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/93905
完整後設資料紀錄
DC 欄位值語言
dc.contributor.advisor張智星zh_TW
dc.contributor.advisorJyh-Shing Roger Jangen
dc.contributor.author許育騰zh_TW
dc.contributor.authorYu-Teng Hsuen
dc.date.accessioned2024-08-09T16:19:51Z-
dc.date.available2024-08-10-
dc.date.copyright2024-08-09-
dc.date.issued2024-
dc.date.submitted2024-08-02-
dc.identifier.citation[1] W. Cai, J. Chen, and M. Li. Exploring the encoding layer and loss function in endto-end speaker and language recognition system. arXiv preprint arXiv:1804.05160, 2018.
[2] H.-Y. Choi, S.-H. Lee, and S.-W. Lee. Diff-hiervc: Diffusion-based hierarchical voice conversion with robust pitch generation and masked prior for zero-shot speaker adaptation. International Speech Communication Association, pages 2283–2287, 2023.
[3] J. S. Chung, J. Huh, and S. Mun. Delving into voxceleb: environment invariant speaker recognition. arXiv preprint arXiv:1910.11238, 2019.
[4] J. S. Chung, J. Huh, S. Mun, M. Lee, H. S. Heo, S. Choe, C. Ham, S. Jung, B.-J. Lee, and I. Han. In defence of metric learning for speaker recognition. arXiv preprint arXiv:2003.11982, 2020.
[5] C. Deng, C. Yu, H. Lu, C. Weng, and D. Yu. Pitchnet: Unsupervised singing voice conversion with pitch adversarial network. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7749–7753. IEEE, 2020.
[6] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial networks. Communications of the ACM, 63(11):139–144, 2020.
[7] H. Guo, Z. Zhou, F. Meng, and K. Liu. Improving adversarial waveform generation based singing voice conversion with harmonic signals. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6657–6661. IEEE, 2022.
[8] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
[9] J. Ho, A. Jain, and P. Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
[10] R. Huang, F. Chen, Y. Ren, J. Liu, C. Cui, and Z. Zhao. Multi-singer: Fast multisinger singing voice vocoder with a large-scale corpus. In Proceedings of the 29th ACM International Conference on Multimedia, pages 3945–3954, 2021.
[11] W. Huang, L. P. Violeta, S. Liu, J. Shi, and T. Toda. The singing voice conversion challenge 2023. In IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2023, Taipei, Taiwan, December 16-20, 2023, pages 1–8. IEEE, 2023.
[12] T. Jayashankar, J. Wu, L. Sari, D. Kant, V. Manohar, and Q. He. Self-supervised representations for singing voice conversion. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023.
[13] J. W. Kim, J. Salamon, P. Li, and J. P. Bello. Crepe: A convolutional representation for pitch estimation. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 161–165. IEEE, 2018.
[14] C. Le Moine, N. Obin, and A. Roebel. Towards end-to-end f0 voice conversion based on dual-gan with convolutional wavelet kernels. In 2021 29th European Signal Processing Conference (EUSIPCO), pages 36–40. IEEE, 2021.
[15] J. Lee, H. Choi, J. Koo, and K. Lee. Disentangling timbre and singing style with multi-singer singing synthesis system. In 2020 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2020, Barcelona, Spain, May 4-8, 2020, pages 7224–7228. IEEE, 2020.
[16] X. Li, S. Liu, and Y. Shan. A hierarchical speaker representation framework for one-shot singing voice conversion. arXiv preprint arXiv:2206.13762, 2022.
[17] R. Liu, X. Wen, C. Lu, L. Song, and J. S. Sung. Vibrato learning in multisinger singing voice synthesis. In 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 773–779. IEEE, 2021.
[18] S. Liu, Y. Cao, N. Hu, D. Su, and H. Meng. Fastsvc: Fast cross-domain singing voice conversion with feature-wise linear modulation. In 2021 ieee international conference on multimedia and expo (icme), pages 1–6. IEEE, 2021.
[19] S. Liu, Y. Cao, D. Su, and H. Meng. Diffsvc: A diffusion probabilistic model for singing voice conversion. In 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 741–748. IEEE, 2021.
[20] P. C. Loizou. Speech quality assessment. In Multimedia analysis, processing and communications, pages 623–654. Springer, 2011.
[21] J. Lu, K. Zhou, B. Sisman, and H. Li. Vaw-gan for singing voice conversion with non-parallel training data. In 2020 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), pages 514–519. IEEE, 2020.
[22] Y. Lu, Z. Ye, W. Xue, X. Tan, Q. Liu, and Y. Guo. Comosvc: Consistency modelbased singing voice conversion. arXiv preprint arXiv:2401.01792, 2024.
[23] Y.-J. Luo, C.-C. Hsu, K. Agres, and D. Herremans. Singing voice conversion with disentangled representations of singer and vocal technique using variational autoencoders. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 3277–3281. IEEE, 2020.
[24] J. Mora, F. Gómez, E. Gómez, F. Escobar, and J. M. Díaz-Báñez. Melodic characterization and similarity in a cappella flamenco cantes. In International Society for Music Information Retrieval Conference (ISMIR), 2010.
[25] S. Nercessian. Zero-shot singing voice conversion. In ISMIR, pages 70–76, 2020.
[26] Z. Ning, Y. Jiang, Z. Wang, B. Zhang, and L. Xie. Vits-based singing voice conversion leveraging whisper and multi-scale f0 modeling. In 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 1–8. IEEE, 2023.
[27] B. D. O’Connor, S. Dixon, and G. Fazekas. Zero-shot singing technique conversion. CoRR, abs/2111.08839, 2021.
[28] K. Qian, Y. Zhang, S. Chang, X. Yang, and M. Hasegawa-Johnson. Autovc: Zeroshot voice style transfer with only autoencoder loss. In International Conference on Machine Learning, pages 5210–5219. PMLR, 2019.
[29] T. Saitou, M. Unoki, and M. Akagi. Extraction of f0 dynamic characteristics and development of f0 control model in singing voice. In Proc. ICAD, pages 275–278, 2002.
[30] Y. Song, W. Song, W. Zhang, Z. Zhang, D. Zeng, Z. Liu, and Y. Yu. Singing voice synthesis with vibrato modeling and latent energy representation. In 2022 IEEE 24th International Workshop on Multimedia Signal Processing (MMSP), pages 1–6. IEEE, 2022.
[31] C. Wang, Z. Li, B. Tang, X. Yin, Y. Wan, Y. Yu, and Z. Ma. Towards high-fidelity singing voice conversion with acoustic reference and contrastive predictive coding. arXiv preprint arXiv:2110.04754, 2021.
[32] F. Wang, J. Cheng, W. Liu, and H. Liu. Additive margin softmax for face verification. IEEE Signal Processing Letters, 25(7):926–930, 2018.
[33] Y. Wang, X. Wang, P. Zhu, J. Wu, H. Li, H. Xue, Y. Zhang, L. Xie, and M. Bi. Opencpop: A high-quality open source chinese popular song corpus for singing voice synthesis. arXiv preprint arXiv:2201.07429, 2022.
[34] Y. Wu and K. He. Group normalization. In Proceedings of the European conference on computer vision (ECCV), pages 3–19, 2018.
[35] Z. Wu, T. Kinnunen, E. Chng, and H. Li. Text-independent f0 transformation with non-parallel data for voice conversion. In INTERSPEECH, pages 1732–1735, 2010.
[36] F.-L. Xie, Y. Qian, F. K. Soong, and H. Li. Pitch transformation in neural network based voice conversion. In The 9th International Symposium on Chinese Spoken Language Processing, pages 197–200. IEEE, 2014.
[37] L. Zhang, R. Li, S. Wang, L. Deng, J. Liu, Y. Ren, J. He, R. Huang, J. Zhu, X. Chen, et al. M4singer: A multi-style, multi-singer and musical score provided mandarin singing corpus. Advances in Neural Information Processing Systems, 35:6914– 6926, 2022.
[38] Y. Zhou, M. Chen, Y. Lei, J. Zhu, and W. Zhao. Vits-based singing voice conversion system with dspgan post-processing for svcc2023. In 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 1–8. IEEE, 2023.
-
dc.identifier.urihttp://tdr.lib.ntu.edu.tw/jspui/handle/123456789/93905-
dc.description.abstract近年來,將歌聲中的歌手身份轉換成另一位歌手的任務,或稱為歌聲轉換,已經取得了巨大的成功。大多數現有的歌聲轉換系統僅考慮了歌聲的音色轉換,其他資訊則保持不變。然而,這未充分考慮歌手身份的其他方面,特別是體現在歌聲的音高曲線和能量曲線中的歌唱風格。為了解決這個問題,本論文提出了一個任意對多的歌唱風格轉換系統,將一位歌手的音高曲線和能量曲線轉換為另一位歌手的風格。為了實現這個目標,我們利用了兩個類似 AutoVC 具有信息瓶頸的自編碼器,以將歌唱風格與音樂內容區分開來。第一個自編碼器執行音高轉換,而第二個自編碼器則以音高曲線為條件執行能量轉換,以確保兩個曲線之間的一致性。考慮到顫音在歌聲表達中的重要性,我們進一步加入了強調顫音特徵的損失函數,以突顯其作用。實驗結果顯示,我們提出的模型能夠有效地在任意對多的情境下將音高和能量特徵的風格轉換為目標歌手的歌唱風格。zh_TW
dc.description.abstractThe task of converting singer identity of a singing voice to that of another singer, or singing voice conversion (SVC), has achieved a huge success in recent years. Most existing SVC systems consider the conversion of a singing voice's timbre while leaving all other information unchanged. This, however, does not take other aspects of singer identity into consideration, particularly a singer's singing style, which is reflected in the pitch and the energy contours of a singing voice. To address this issue, this paper proposes an any-to-many singing style conversion system that converts the pitch and energy contours of one singer's style to that of another singer's style. To achieve this target, we utilize two AutoVC-like autoencoders with information bottleneck to disentangle singing style from musical contents. The first one performs pitch conversion, while the second one performs energy conversion with the condition of pitch contour to ensure a consistency between the two contours. Recognizing the crucial role of vibratos in vocal expression, we further incorporate loss functions that emphasize vibrato features to highlight their importance. Experimental results suggested that the proposed model can effectively convert the style of pitch and energy features to that of target singer in an any-to-many conversion scenario.en
dc.description.provenanceSubmitted by admin ntu (admin@lib.ntu.edu.tw) on 2024-08-09T16:19:51Z
No. of bitstreams: 0
en
dc.description.provenanceMade available in DSpace on 2024-08-09T16:19:51Z (GMT). No. of bitstreams: 0en
dc.description.tableofcontents口試委員審定書 i
誌謝 iii
摘要 v
Abstract vii
目次 ix
圖次 xiii
表次 xv
第一章 緒論 1
1.1 研究簡介與動機 1
1.2 研究方法與貢獻 3
1.3 章節概述 3
第二章 文獻探討 5
2.1 語音轉換中之音高轉換 5
2.1.1 相關研究之方法 5
2.1.2 語音與歌聲之差異 6
2.2 表現性歌聲合成 7
2.3 歌唱技巧轉換 8
2.3.1 相關研究之方法 8
2.3.2 歌唱技巧與歌唱風格之差異 10
2.4 AutoVC 架構簡介 10
2.4.1 內容編碼器 11
2.4.2 語者編碼器 11
2.4.3 解碼器 12
第三章 研究方法 13
3.1 音高轉換 13
3.1.1 資料處理 14
3.1.2 模型架構 14
3.1.3 顫音建模 16
3.1.4 顫音幅度平滑 18
3.1.5 損失函數 19
3.2 能量轉換 20
3.2.1 資料處理 21
3.2.2 模型架構 21
3.2.3 損失函數 22
3.3 單階段轉換 22
3.3.1 資料處理 23
3.3.2 模型架構 23
3.3.3 損失函數 24
第四章 實驗相關設定 25
4.1 資料集 25
4.1.1 Opencpop 25
4.1.2 TONAS 26
4.1.3 M4Singer 26
4.1.4 OpenSinger 27
4.2 評量指標 28
4.2.1 客觀指標 28
4.2.2 主觀指標 30
4.3 實驗環境 31
4.4 實驗參數設定 31
4.5 實驗項目 32
第五章 實驗結果與探討 35
5.1 實驗一:單階段轉換與二階段轉換模型之比較 35
5.2 實驗二:顫音建模與顫音幅度平滑之消融實驗 38
5.3 實驗三:有無提供音高曲線對能量轉換模型之影響 39
5.4 實驗四:主觀評量指標之結果分析 41
5.4.1 整體主觀評量結果 42
5.4.2 不同性別目標歌手之主觀評量結果 43
5.5 實驗五:任意對多情境下之案例分析 43
5.5.1 成功轉換案例 43
5.5.2 失敗轉換案例 44
5.6 實驗六:歌唱風格轉換與顫音轉換之比較 45
第六章 結論與未來展望 49
6.1 結論 49
6.2 未來展望 50
參考文獻 53
-
dc.language.isozh_TW-
dc.subject音高轉換zh_TW
dc.subject歌唱風格轉換zh_TW
dc.subject顫音學習zh_TW
dc.subject自編碼器zh_TW
dc.subject能量轉換zh_TW
dc.subjectenergy conversionen
dc.subjectautoencoderen
dc.subjectvibrato learningen
dc.subjectpitch conversionen
dc.subjectsinging style conversionen
dc.title任意對多歌唱風格轉換zh_TW
dc.titleAny-to-many Singing Style Conversionen
dc.typeThesis-
dc.date.schoolyear112-2-
dc.description.degree碩士-
dc.contributor.oralexamcommittee王新民;曹昱zh_TW
dc.contributor.oralexamcommitteeHsin-Min Wang;Yu Tsaoen
dc.subject.keyword歌唱風格轉換,音高轉換,能量轉換,自編碼器,顫音學習,zh_TW
dc.subject.keywordsinging style conversion,pitch conversion,energy conversion,autoencoder,vibrato learning,en
dc.relation.page57-
dc.identifier.doi10.6342/NTU202402100-
dc.rights.note同意授權(全球公開)-
dc.date.accepted2024-08-06-
dc.contributor.author-college電機資訊學院-
dc.contributor.author-dept資訊工程學系-
顯示於系所單位:資訊工程學系

文件中的檔案:
檔案 大小格式 
ntu-112-2.pdf9.58 MBAdobe PDF檢視/開啟
顯示文件簡單紀錄


系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved