Skip navigation

DSpace JSPUI

DSpace preserves and enables easy and open access to all types of digital content including text, images, moving images, mpegs and data sets

Learn More
DSpace logo
English
中文
  • Browse
    • Communities
      & Collections
    • Publication Year
    • Author
    • Title
    • Subject
    • Advisor
  • Search TDR
  • Rights Q&A
    • My Page
    • Receive email
      updates
    • Edit Profile
  1. NTU Theses and Dissertations Repository
  2. 電機資訊學院
  3. 資料科學學位學程
Please use this identifier to cite or link to this item: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/84701
Title: 零樣本歌聲轉換與合成的統一模型
A Unified Model for Zero-Shot Singing Voice Conversion and Synthesis
Authors: Jui-Te Wu
吳睿得
Advisor: 蘇黎(Li Su)
Co-Advisor: 張智星(Jyh-Shing Jang)
Keyword: 歌聲轉換,歌聲合成,零樣本學習,自督導式學習,
singing voice conversion,singing voice synthesis,zero-shot learning,self-supervised learning,
Publication Year : 2022
Degree: 碩士
Abstract: 深度學習的最新進展不僅促進了零樣本歌聲合成和歌聲轉換任務的 實現,同時也提供了將這兩個任務統一為一個通用模型的機會。在本 文中我們提出了一個統一兩項任務的模型,可以從文本或音頻格式的 任意源歌唱內容生成任意目標歌手的歌聲。該模型結合了處理文本輸 入的詞源編碼器以及處理音頻輸入的聲源編碼器進行訓練,並透過以 動態規劃為基礎的自督導式學習,編碼器將會在訓練過程中學習如何 將音頻與音素進行最佳的對齊。這些編碼器也將音頻和文本數據分別 映射到一個相似的潛在空間中,使得歌聲轉換與合成兩項任務可以透 過同一個解碼器來完成。目標歌手的參考音檔被轉換成以幀為單位的 碎片化資訊,並透過注意機制來根據源內容進行提取與重構,這使模 型能夠在測試階段從文本或音頻源生成沒學習過的目標歌手的聲音。 客觀和主觀實驗都證實,所提出的模型表現超越過去最佳的任意歌聲 轉換與任意歌聲合成模型。
Recent advances in deep learning not only facilitate the implementation of zero-shot singing voice synthesis (SVS) and singing voice conversion (SVC) tasks, but also provide the opportunity to unify these two tasks into one gen- eralized model. In this paper, we propose such a model that can generate singing voice of any target singer from any source singing content in either text or audio format. The model incorporates self-supervised joint training of the phonetic source encoder and the acoustic source encoder, with an audio- to-phoneme alignment process in each training step, such that these encoders map the audio and text data respectively into a shared, temporally aligned, and singer-agnostic latent space. The target singer’s latent representations en- coded at different granularity levels are all trained to match the source latent representations sequentially with the attention mechanisms in the decoding stage. This enables the model to generate unseen target singer’s voice with fine-grained resolution from either text or audio sources during the inference stage. Both objective and subjective experiments confirmed that the proposed model is competitive with the state-of-the-art SVC and SVS methods.
URI: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/84701
DOI: 10.6342/NTU202203241
Fulltext Rights: 同意授權(限校園內公開)
metadata.dc.date.embargo-lift: 2022-09-14
Appears in Collections:資料科學學位學程

Files in This Item:
File SizeFormat 
U0001-0709202223252900.pdf
Access limited in NTU ip range
5 MBAdobe PDF
Show full item record


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved