請用此 Handle URI 來引用此文件:
http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/93568完整後設資料紀錄
| DC 欄位 | 值 | 語言 |
|---|---|---|
| dc.contributor.advisor | 李宏毅 | zh_TW |
| dc.contributor.advisor | Hung-yi Lee | en |
| dc.contributor.author | 伏宇寬 | zh_TW |
| dc.contributor.author | Yu-Kuan Fu | en |
| dc.date.accessioned | 2024-08-05T16:37:47Z | - |
| dc.date.available | 2024-08-06 | - |
| dc.date.copyright | 2024-08-05 | - |
| dc.date.issued | 2024 | - |
| dc.date.submitted | 2024-07-28 | - |
| dc.identifier.citation | [1] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
[2] Gabriel Skantze. Turn-taking in conversational systems and human-robot interaction: a review. Computer Speech & Language, 67:101178, 2021. [3] Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:3451–3460, 2021. [4] Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. wav2vec2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems, 33:12449–12460, 2020. [5] Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, et al. Wavlm: Largescale self-supervised pre-training for full stack speech processing. IEEE Journal of Selected Topics in Signal Processing, 16(6):1505–1518, 2022. [6] Shu-wen Yang, Po-Han Chi, Yung-Sung Chuang, Cheng-I Jeff Lai, Kushal Lakhotia, Yist Y Lin, Andy T Liu, Jiatong Shi, Xuankai Chang, Guan-Ting Lin, et al. Superb: Speech processing universal performance benchmark. arXiv preprint arXiv:2105.01051, 2021. [7] Eugene Kharitonov, Ann Lee, Adam Polyak, Yossi Adi, Jade Copet, Kushal Lakhotia, Tu-Anh Nguyen, Morgane Rivière, Abdelrahman Mohamed, Emmanuel Dupoux, et al. Text-free prosody-aware generative spoken language modeling. arXiv preprint arXiv:2109.03264, 2021. [8] Kushal Lakhotia, Eugene Kharitonov, Wei-Ning Hsu, Yossi Adi, Adam Polyak, Benjamin Bolte, Tu-Anh Nguyen, Jade Copet, Alexei Baevski, Abdelrahman Mohamed, et al. On generative spoken language modeling from raw audio. Transactions of the Association for Computational Linguistics, 9:1336–1354, 2021. [9] Kurt Shuster, Mojtaba Komeili, Leonard Adolphs, Stephen Roller, Arthur Szlam, and Jason Weston. Language models that seek for knowledge: Modular search & generation for dialogue and prompt completion. arXiv preprint arXiv:2203.13224, 2022. [10] Victor H Yngve. On getting a word in edgewise. In Papers from the sixth regional meeting Chicago Linguistic Society, April 16-18, 1970, Chicago Linguistic Society, Chicago, pages 567–578, 1970. [11] Emanuel A Schegloff. Discourse as an interactional achievement: Some uses of`uh huh'and other things that come between sentences. Analyzing discourse: Text and talk, 71:71–93, 1982. [12] Tanya Stivers, Nicholas J Enfield, Penelope Brown, Christina Englert, Makoto Hayashi, Trine Heinemann, Gertie Hoymann, Federico Rossano, Jan Peter De Ruiter, Kyung-Eun Yoon, et al. Universals and cultural variation in turn-taking in conversation. Proceedings of the National Academy of Sciences, 106(26):10587–10592, 2009. [13] Frank Rosenblatt. The perceptron: a probabilistic model for information storage and organization in the brain. Psychological review, 65(6):386, 1958. [14] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. [15] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. Communications of the ACM, 60(6):84–90, 2017. [16] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017. [17] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xi-aohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020. [18] Qiang Wang, Bei Li, Tong Xiao, Jingbo Zhu, Changliang Li, Derek F Wong, and Lidia S Chao. Learning deep transformer models for machine translation. arXiv preprint arXiv:1906.01787, 2019. [19] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pretraining of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018. [20] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014. [21] Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. Improving language understanding by generative pre-training. 2018. [22] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019. [23] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020. [24] Christopher D Manning. Human language understanding & reasoning. Daedalus, 151(2):127–138, 2022. [25] Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Gheshlaghi Azar, et al. Bootstrap your own latent-a new approach to self-supervised learning. Advances in neural information processing systems, 33:21271–21284, 2020. [26] Lu Yuan, Dongdong Chen, Yi-Ling Chen, Noel Codella, Xiyang Dai, Jianfeng Gao, Houdong Hu, Xuedong Huang, Boxin Li, Chunyuan Li, et al. Florence: A new foundation model for computer vision. arXiv preprint arXiv:2111.11432, 2021. [27] Tu Anh Nguyen, Wei-Ning Hsu, Antony d’Avirro, Bowen Shi, Itai Gat, Maryam Fazel-Zarani, Tal Remez, Jade Copet, Gabriel Synnaeve, Michael Hassid, et al. Expresso: A benchmark and analysis of discrete expressive speech resynthesis. arXiv preprint arXiv:2308.05725, 2023. [28] Jonathan Shen, Ruoming Pang, Ron J Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, Rj Skerrv-Ryan, et al. Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 4779–4783. IEEE, 2018. [29] Adam Polyak, Yossi Adi, Jade Copet, Eugene Kharitonov, Kushal Lakhotia, Wei-Ning Hsu, Abdelrahman Mohamed, and Emmanuel Dupoux. Speech resynthesis from discrete disentangled self-supervised representations. arXiv preprint arXiv:2104.00355, 2021. [30] Ann Lee, Peng-Jen Chen, Changhan Wang, Jiatao Gu, Sravya Popuri, Xutai Ma, Adam Polyak, Yossi Adi, Qing He, Yun Tang, et al. Direct speech-to-speech translation with discrete units. arXiv preprint arXiv:2107.05604, 2021. [31] Ann Lee, Hongyu Gong, Paul-Ambroise Duquenne, Holger Schwenk, Peng-Jen Chen, Changhan Wang, Sravya Popuri, Yossi Adi, Juan Pino, Jiatao Gu, et al. Textless speech-to-speech translation on real data. arXiv preprint arXiv:2112.08352, 2021. [32] Tu Anh Nguyen, Eugene Kharitonov, Jade Copet, Yossi Adi, Wei-Ning Hsu, Ali Elkahky, Paden Tomasello, Robin Algayres, Benoit Sagot, Abdelrahman Mohamed, et al. Generative spoken dialogue language modeling. Transactions of the Association for Computational Linguistics, 11:250–266, 2023. [33] Joris Cosentino, Manuel Pariente, Samuele Cornell, Antoine Deleforge, and Emmanuel Vincent. Librimix: An open-source dataset for generalizable speech separation. arXiv preprint arXiv:2005.11262, 2020. [34] Christopher Cieri, David Miller, and Kevin Walker. The fisher corpus: A resource for the next generations of speech-to-text. In LREC, volume 4, pages 69–71, 2004. [35] Alexis Plaquet and Hervé Bredin. Powerset multi-class cross entropy loss for neural speaker diarization. In Proc. INTERSPEECH 2023, 2023. [36] Mirco Ravanelli, Titouan Parcollet, Peter Plantinga, Aku Rouhe, Samuele Cornell, Loren Lugosch, Cem Subakan, Nauman Dawalatabad, Abdelwahab Heba, Jianyuan Zhong, et al. Speechbrain: A general-purpose speech toolkit. arXiv preprint arXiv:2106.04624, 2021. [37] Cem Subakan, Mirco Ravanelli, Samuele Cornell, Mirko Bronzi, and Jianyuan Zhong. Attention is all you need in speech separation. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 21–25. IEEE, 2021. [38] Brecht Desplanques, Jenthe Thienpondt, and Kris Demuynck. Ecapa-tdnn: Emphasized channel attention, propagation and aggregation in tdnn based speaker verification. arXiv preprint arXiv:2005.07143, 2020. [39] Christophe; MacDonald Kirsten. Yamagishi, Junichi; Veaux. Cstr vctk corpus: English multi-speaker corpus for cstr voice cloning toolkit. 2012. [40] Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. In International Conference on Machine Learning, pages 28492–28518. PMLR, 2023. | - |
| dc.identifier.uri | http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/93568 | - |
| dc.description.abstract | 本研究的目標是建立一個能夠與人自然對話的人機語音對話系統。市面上的語音助理大多依賴文字作為溝通的中介。這種方法雖然實用,但往往導致對話中缺少情感色彩、笑聲等人類特有的交流元素,使得整體交流顯得生硬而不自然。為了解決這一問題,近期的研究開始探索一種新的途徑:不依賴文字直接生成語音訊號,這樣不僅能有效保留非文字信息,還能模擬出更加接近人類自然對話的效果。
語音對話與文字對話相比,展現了更為複雜的交流行為,除了包含情感等非語言元素外,其中一個明顯的差異是語音對話允許語者可以同時進行發出聲音(即產生重疊語音)。因此,語音對話模型的訓練資料必須將兩個語者的語音分別記錄在不同聲道。但這種特定格式的資料相當稀缺。為了克服這一挑戰,我們提出了一種創新的自動化流程,這個流程可以將混合在單一聲道中的語音轉換為分離後的虛擬雙聲道語音。通過這一方法,我們將訓練資料從原先的2,000小時擴充至17,600小時,並且覆蓋了更多樣化的對話主題和語者。 增加訓練資料能夠顯著地提高了我們的模型在生成語音對話時的語意連貫性。此外,我們還採用了多種最先進的語音編碼技術來進行語音的離散化處理,這不僅進一步提升了對話的語意連貫性,也顯著改善了音質。 | zh_TW |
| dc.description.abstract | The goal of this research is to establish a human-machine spoken dialogue system which is able to communicate naturally with human. Most existing voice assistants rely on text as a medium for communication. Although this method is practical, it often leads to dialogues lacking in emotion, laughter, and other unique elements of human interaction, making the overall communication appear stiff and unnatural. To address this issue, recent studies have begun to explore a new approach: generating voice signals directly without relying on text. This not only effectively preserves non-textual information but also simulates an effect closer to natural human conversation. However, this method faces challenges in understanding semantics and providing appropriate responses due to the lack of textual guidance.
Compared to text dialogues, spoken dialogues exhibit more complex communicative behaviors. Except for containing emotion or other non-verbal cues, one notable difference being that spoken dialogues allow speakers to talk simultaneously (overlapping speech). Therefore, the training data for spoken dialogue models must record the speech of two speakers on separate channels. However, this specific format of data is quite scarce. To overcome this challenge, we have proposed an innovative automated process that can convert speaker-mixed single-channel audio into speaker-separated pseudo-stereo audio. Through this method, we have expanded our training data from the original 2,000 hours to 17,600 hours, covering a wider variety of dialogue topics and speakers. Increasing the training data significantly enhances the semantic coherence of our model in generating spoken dialogues. Additionally, we have adopted a variety of state-of-the-art voice encoding technologies for the discretizing speech, which not only further improves the semantic coherence of the dialogues but also significantly improves the sound quality. | en |
| dc.description.provenance | Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2024-08-05T16:37:47Z No. of bitstreams: 0 | en |
| dc.description.provenance | Made available in DSpace on 2024-08-05T16:37:47Z (GMT). No. of bitstreams: 0 | en |
| dc.description.tableofcontents | 目錄
致謝 i 摘要 iii Abstract v 目錄 vii 圖目錄 xi 表目錄 xiii 第一章 導論 1 1.1 研究動機 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 研究方法 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 主要貢獻 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.4 章節安排 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 第二章 背景知識 5 2.1 語音對話輪替行為 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2 類神經網路 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2.1 簡介 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2.2 架構 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2.3 卷積類神經網路 . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2.4 轉換器 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.3 生成式語言模型 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.4 語音基石模型 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.4.1 簡介 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.4.2 掩碼語音預訓練 . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.5 基於語音的語音生成語言模型 . . . . . . . . . . . . . . . . . . . . . 14 2.5.1 簡介 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.5.2 離散化語音訊號與語音生成式模型訓練 . . . . . . . . . . . . . . 15 2.5.3 語音還原 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 第三章 研究方法與設置 17 3.1 語音對話模型 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.1.1 雙聲道轉換器語言模型 . . . . . . . . . . . . . . . . . . . . . . . 17 3.1.2 模型損失函數 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.1.3 模型生成 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.2 資料蒐集 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.2.1 網路爬蟲 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.2.2 語者極化 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.2.3 語者分離 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.2.4 語者驗證 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.3 語音編碼器選擇 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.4 實驗參數設置 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.4.1 資料集設定 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.4.2 虛擬雙聲道資料建立流程 . . . . . . . . . . . . . . . . . . . . . . 25 3.4.3 模型訓練 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 第四章 實驗結果與分析 29 4.1 聲碼器語音還原自然程度 . . . . . . . . . . . . . . . . . . . . . . . . 29 4.2 訓練目標 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 4.3 輪替行為分析 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.3.1 活性語音、間隔、重疊、暫停行為分析 . . . . . . . . . . . . . . 31 4.3.2 笑聲頻率分析 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.3.3 語速分析 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.4 語音對話內容連貫性 . . . . . . . . . . . . . . . . . . . . . . . . . . 34 4.5 分析小結 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 第五章 結論與展望 37 5.1 結論 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 5.2 未來展望 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 參考文獻 39 | - |
| dc.language.iso | zh_TW | - |
| dc.subject | 對話輪替行為 | zh_TW |
| dc.subject | 語音對話 | zh_TW |
| dc.subject | 語音生成 | zh_TW |
| dc.subject | 自監督式語音模型 | zh_TW |
| dc.subject | 語言模型 | zh_TW |
| dc.subject | Langauge model | en |
| dc.subject | Self-supervised speech model | en |
| dc.subject | Speech synthesis | en |
| dc.subject | Turn-taking | en |
| dc.subject | Spoken dialogue | en |
| dc.title | 邁向能進行自然對話的口語語言模型 | zh_TW |
| dc.title | Towards Building a Spoken Language Model for Natural Dialogue | en |
| dc.type | Thesis | - |
| dc.date.schoolyear | 112-2 | - |
| dc.description.degree | 碩士 | - |
| dc.contributor.oralexamcommittee | 賴穎暉;曹昱;陳尚澤;王新民 | zh_TW |
| dc.contributor.oralexamcommittee | Ying-Hui Lai;Yu Tsao;Shang-Tse Chen;Hsin-Min Wang | en |
| dc.subject.keyword | 語音對話,對話輪替行為,語言模型,自監督式語音模型,語音生成, | zh_TW |
| dc.subject.keyword | Spoken dialogue,Turn-taking,Langauge model,Self-supervised speech model,Speech synthesis, | en |
| dc.relation.page | 45 | - |
| dc.identifier.doi | 10.6342/NTU202402276 | - |
| dc.rights.note | 同意授權(限校園內公開) | - |
| dc.date.accepted | 2024-07-30 | - |
| dc.contributor.author-college | 電機資訊學院 | - |
| dc.contributor.author-dept | 電信工程學研究所 | - |
| 顯示於系所單位: | 電信工程學研究所 | |
文件中的檔案:
| 檔案 | 大小 | 格式 | |
|---|---|---|---|
| ntu-112-2.pdf 授權僅限NTU校內IP使用(校園外請利用VPN校外連線服務) | 1.7 MB | Adobe PDF |
系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。
