零樣本歌聲轉換與合成的統一模型

Jui-Te Wu; 吳睿得

Please use this identifier to cite or link to this item: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/84701

Title:	零樣本歌聲轉換與合成的統一模型 A Unified Model for Zero-Shot Singing Voice Conversion and Synthesis
Authors:	Jui-Te Wu 吳睿得
Advisor:	蘇黎(Li Su)
Co-Advisor:	張智星(Jyh-Shing Jang)
Keyword:	歌聲轉換,歌聲合成,零樣本學習,自督導式學習, singing voice conversion,singing voice synthesis,zero-shot learning,self-supervised learning,
Publication Year :	2022
Degree:	碩士
Abstract:	深度學習的最新進展不僅促進了零樣本歌聲合成和歌聲轉換任務的實現，同時也提供了將這兩個任務統一為一個通用模型的機會。在本文中我們提出了一個統一兩項任務的模型，可以從文本或音頻格式的任意源歌唱內容生成任意目標歌手的歌聲。該模型結合了處理文本輸入的詞源編碼器以及處理音頻輸入的聲源編碼器進行訓練，並透過以動態規劃為基礎的自督導式學習，編碼器將會在訓練過程中學習如何將音頻與音素進行最佳的對齊。這些編碼器也將音頻和文本數據分別映射到一個相似的潛在空間中，使得歌聲轉換與合成兩項任務可以透過同一個解碼器來完成。目標歌手的參考音檔被轉換成以幀為單位的碎片化資訊，並透過注意機制來根據源內容進行提取與重構，這使模型能夠在測試階段從文本或音頻源生成沒學習過的目標歌手的聲音。客觀和主觀實驗都證實，所提出的模型表現超越過去最佳的任意歌聲轉換與任意歌聲合成模型。 Recent advances in deep learning not only facilitate the implementation of zero-shot singing voice synthesis (SVS) and singing voice conversion (SVC) tasks, but also provide the opportunity to unify these two tasks into one gen- eralized model. In this paper, we propose such a model that can generate singing voice of any target singer from any source singing content in either text or audio format. The model incorporates self-supervised joint training of the phonetic source encoder and the acoustic source encoder, with an audio- to-phoneme alignment process in each training step, such that these encoders map the audio and text data respectively into a shared, temporally aligned, and singer-agnostic latent space. The target singer’s latent representations en- coded at different granularity levels are all trained to match the source latent representations sequentially with the attention mechanisms in the decoding stage. This enables the model to generate unseen target singer’s voice with fine-grained resolution from either text or audio sources during the inference stage. Both objective and subjective experiments confirmed that the proposed model is competitive with the state-of-the-art SVC and SVS methods.
URI:	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/84701
DOI:	10.6342/NTU202203241
Fulltext Rights:	同意授權(限校園內公開)
metadata.dc.date.embargo-lift:	2022-09-14
Appears in Collections:	資料科學學位學程

Files in This Item:

File	Size	Format
U0001-0709202223252900.pdf Access limited in NTU ip range	5 MB	Adobe PDF

Show full item record

DSpace JSPUI

DSpace preserves and enables easy and open access to all types of digital content including text, images, moving images, mpegs and data sets