請用此 Handle URI 來引用此文件:
http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/93568| 標題: | 邁向能進行自然對話的口語語言模型 Towards Building a Spoken Language Model for Natural Dialogue |
| 作者: | 伏宇寬 Yu-Kuan Fu |
| 指導教授: | 李宏毅 Hung-yi Lee |
| 關鍵字: | 語音對話,對話輪替行為,語言模型,自監督式語音模型,語音生成, Spoken dialogue,Turn-taking,Langauge model,Self-supervised speech model,Speech synthesis, |
| 出版年 : | 2024 |
| 學位: | 碩士 |
| 摘要: | 本研究的目標是建立一個能夠與人自然對話的人機語音對話系統。市面上的語音助理大多依賴文字作為溝通的中介。這種方法雖然實用,但往往導致對話中缺少情感色彩、笑聲等人類特有的交流元素,使得整體交流顯得生硬而不自然。為了解決這一問題,近期的研究開始探索一種新的途徑:不依賴文字直接生成語音訊號,這樣不僅能有效保留非文字信息,還能模擬出更加接近人類自然對話的效果。
語音對話與文字對話相比,展現了更為複雜的交流行為,除了包含情感等非語言元素外,其中一個明顯的差異是語音對話允許語者可以同時進行發出聲音(即產生重疊語音)。因此,語音對話模型的訓練資料必須將兩個語者的語音分別記錄在不同聲道。但這種特定格式的資料相當稀缺。為了克服這一挑戰,我們提出了一種創新的自動化流程,這個流程可以將混合在單一聲道中的語音轉換為分離後的虛擬雙聲道語音。通過這一方法,我們將訓練資料從原先的2,000小時擴充至17,600小時,並且覆蓋了更多樣化的對話主題和語者。 增加訓練資料能夠顯著地提高了我們的模型在生成語音對話時的語意連貫性。此外,我們還採用了多種最先進的語音編碼技術來進行語音的離散化處理,這不僅進一步提升了對話的語意連貫性,也顯著改善了音質。 The goal of this research is to establish a human-machine spoken dialogue system which is able to communicate naturally with human. Most existing voice assistants rely on text as a medium for communication. Although this method is practical, it often leads to dialogues lacking in emotion, laughter, and other unique elements of human interaction, making the overall communication appear stiff and unnatural. To address this issue, recent studies have begun to explore a new approach: generating voice signals directly without relying on text. This not only effectively preserves non-textual information but also simulates an effect closer to natural human conversation. However, this method faces challenges in understanding semantics and providing appropriate responses due to the lack of textual guidance. Compared to text dialogues, spoken dialogues exhibit more complex communicative behaviors. Except for containing emotion or other non-verbal cues, one notable difference being that spoken dialogues allow speakers to talk simultaneously (overlapping speech). Therefore, the training data for spoken dialogue models must record the speech of two speakers on separate channels. However, this specific format of data is quite scarce. To overcome this challenge, we have proposed an innovative automated process that can convert speaker-mixed single-channel audio into speaker-separated pseudo-stereo audio. Through this method, we have expanded our training data from the original 2,000 hours to 17,600 hours, covering a wider variety of dialogue topics and speakers. Increasing the training data significantly enhances the semantic coherence of our model in generating spoken dialogues. Additionally, we have adopted a variety of state-of-the-art voice encoding technologies for the discretizing speech, which not only further improves the semantic coherence of the dialogues but also significantly improves the sound quality. |
| URI: | http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/93568 |
| DOI: | 10.6342/NTU202402276 |
| 全文授權: | 同意授權(限校園內公開) |
| 顯示於系所單位: | 電信工程學研究所 |
文件中的檔案:
| 檔案 | 大小 | 格式 | |
|---|---|---|---|
| ntu-112-2.pdf 授權僅限NTU校內IP使用(校園外請利用VPN校外連線服務) | 1.7 MB | Adobe PDF |
系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。
