請用此 Handle URI 來引用此文件:
http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/86964
完整後設資料紀錄
DC 欄位 | 值 | 語言 |
---|---|---|
dc.contributor.advisor | 李琳山 | zh_TW |
dc.contributor.advisor | Lin-shan Lee | en |
dc.contributor.author | 曾韋誠 | zh_TW |
dc.contributor.author | Wei-Cheng Tseng | en |
dc.date.accessioned | 2023-05-02T17:06:56Z | - |
dc.date.available | 2023-11-09 | - |
dc.date.copyright | 2023-05-02 | - |
dc.date.issued | 2022 | - |
dc.date.submitted | 2023-01-09 | - |
dc.identifier.citation | [1] Antony W Rix, John G Beerends, Michael P Hollier, and Andries P Hekstra. Perceptual evaluation of speech quality (pesq)-a new method for speech quality assessment of telephone networks and codecs. In 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No. 01CH37221), volume 2, pages 749–752. IEEE, 2001.
[2] Cees H Taal, Richard C Hendriks, Richard Heusdens, and Jesper Jensen. A short-time objective intelligibility measure for time-frequency weighted noisy speech. In 2010 IEEE international conference on acoustics, speech and signal processing, pages 4214–4217. IEEE, 2010. [3] Robert Kubichek. Mel-cepstral distance measure for objective speech quality assessment. In Proceedings of IEEE Pacific Rim Conference on Communications Computers and Signal Processing, volume 1, pages 125–128. IEEE, 1993. [4] Chen-Chou Lo, Szu-Wei Fu, Wen-Chin Huang, Xin Wang, Junichi Yamagishi, Yu Tsao, and Hsin-Min Wang. Mosnet: Deep learning based objective assessment for voice conversion. arXiv preprint arXiv:1904.08352, 2019. [5] Yichong Leng, Xu Tan, Sheng Zhao, Frank Soong, Xiang-Yang Li, and Tao Qin. Mbnet: Mos prediction for synthesized speech with mean-bias network. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 391–395. IEEE, 2021. [6] Wen-Chin Huang, Erica Cooper, Junichi Yamagishi, and Tomoki Toda. Ldnet: Unified listener dependent modeling in mos prediction for synthetic speech. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 896–900. IEEE, 2022. [7] Szu-Wei Fu, Yu Tsao, Hsin-Te Hwang, and Hsin-Min Wang. Quality-net: An end-to-end non-intrusive speech quality assessment model based on blstm. arXiv preprint arXiv:1808.05344, 2018. [8] Brian Patton, Yannis Agiomyrgiannakis, Michael Terry, Kevin Wilson, Rif A Saurous, and D Sculley. Automos: Learning a non-intrusive assessor of naturalness-of-speech. arXiv preprint arXiv:1611.09207, 2016. [9] Andy T Liu, Shu-wen Yang, Po-Han Chi, Po-chun Hsu, and Hung-yi Lee. Mockingjay: Unsupervised speech representation learning with deep bidirectional transformer encoders. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6419–6423. IEEE, 2020. [10] Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018. [11] Andy T Liu, Shang-Wen Li, and Hung-yi Lee. Tera: Self-supervised learning of transformer encoder representation for speech. arXiv preprint arXiv:2007.06028, 2020. [12] Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in Neural Information Processing Systems, 33, 2020. [13] Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:3451–3460, 2021. [14] Pooyan Safari, Miquel India, and Javier Hernando. Self-attention encoding and pooling for speaker recognition. Proc. Interspeech 2020, pages 941–945, 2020. [15] Warren S McCulloch and Walter Pitts. A logical calculus of the ideas immanent in nervous activity. The bulletin of mathematical biophysics, 5(4):115–133, 1943. [16] Frank Rosenblatt. The perceptron: a probabilistic model for information storage and organization in the brain. Psychological review, 65(6):386, 1958. [17] Paul John Werbos. The roots of backpropagation: from ordered derivatives to neural networks and political forecasting, volume 1. John Wiley & Sons, 1994. [18] David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning internal representations by error propagation. Technical report, California Univ San Diego La Jolla Inst for Cognitive Science, 1985. [19] David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning representations by back-propagating errors. nature, 323(6088):533–536, 1986. [20] Kiyoshi Kawaguchi. A multithreaded software model for backpropagation neural network applications. 2000. [21] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998. doi: 10.1109/5.726791. [22] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep convolutional neural networks. In Proceedings of the 25th International Conference on Neural Information Processing Systems - Volume 1, NIPS’12, page 1097–1105, Red Hook, NY, USA, 2012. Curran Associates Inc. [23] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014. [24] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. [25] Dimitri Palaz, Mathew Magimai-Doss, and Ronan Collobert. Analysis of CNN-based speech recognition system using raw speech as input. In Proc. Interspeech 2015, pages 11–15, 2015. doi: 10.21437/Interspeech.2015-3. [26] Michael I. Jordan. Chapter 25 - serial order: A parallel distributed processing approach. In John W. Donahoe and Vivian Packard Dorsel, editors, Neural-Network Models of Cognition, volume 121 of Advances in Psychology, pages 471–495. North-Holland, 1997. doi: https://doi.org/10.1016/S0166-4115(97)80111-2. URL: https://www.sciencedirect.com/science/article/pii/S0166411597801112. [27] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems, volume 27. Curran Associates, Inc., 2014. URL https://proceedings.neurips.cc/paper/2014/file/a14ac55a4f27472c5d894ec1c3c743d2-Paper.pdf. [28] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. CoRR, abs/1409.0473, 2015. [29] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in Neural Information Processing Systems, 30:5998–6008, 2017. [30] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018. [31] Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. 2019. [32] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf. [33] Songxiang Liu, Yuewen Cao, Disong Wang, Xixin Wu, Xunying Liu, and Helen Meng. Any-to-many voice conversion with location-relative sequence-to-sequence modeling. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:1717–1728, 2021. [34] Songxiang Liu, Yuewen Cao, Na Hu, Dan Su, and Helen Meng. Fastsvc: Fast cross-domain singing voice conversion with feature-wise linear modulation. In 2021 IEEE International Conference on Multimedia and Expo (ICME), pages 1–6. IEEE, 2021. [35] Shahin Amiriparian, Maurice Gerczuk, Sandra Ottl, Nicholas Cummins, Michael Freitag, Sergey Pugachevskiy, Alice Baird, and Björn Schuller. Snore sound classification using image-based deep spectrum features. In Interspeech 2017, pages 3512–3516. ISCA, August 2017. [36] Jennifer Williams, Joanna Rownicka, Pilar Oplustil, and Simon King. Comparison of speech representations for automatic quality estimation in multi-speaker text-to-speech synthesis. [37] Yann LeCun. Self-supervised learning, 2020. URL https://vimeo.com/390347111. [38] Shu-wen Yang, Po-Han Chi, Yung-Sung Chuang, Cheng-I Jeff Lai, Kushal Lakhotia, Yist Y Lin, Andy T Liu, Jiatong Shi, Xuankai Chang, Guan-Ting Lin, et al. Superb: Speech processing universal performance benchmark. arXiv preprint arXiv:2105.01051, 2021. [39] Weisi Lin, Dacheng Tao, Janusz Kacprzyk, Zhu Li, Ebroul Izquierdo, and Haohong Wang, editors. Multimedia Analysis, Processing and Communications, volume 346. 2011. ISBN 978-3-642-19550-1. doi: 10.1007/978-3-642-19551-8. URL https://doi.org/10.1007/978-3-642-19551-8. [40] L. Malfait, J. Berger, and M. Kastner. P.563—the itu-t standard for single-ended speech quality assessment. IEEE Transactions on Audio, Speech, and Language Processing, 14(6):1924–1934, 2006. doi: 10.1109/TASL.2006.883177. [41] Meet H. Soni and Hemant A. Patil. Novel deep autoencoder features for non-intrusive speech quality assessment. In 2016 24th European Signal Processing Conference (EUSIPCO), pages 2315–2319, 2016. doi: 10.1109/EUSIPCO.2016.7760662. [42] Alessandro Ragano, Emmanouil Benetos, and Andrew Hines. More for less: Non-intrusive speech quality assessment with limited annotations. In 2021 13th International Conference on Quality of Multimedia Experience (QoMEX), pages 103–108. IEEE, 2021. [43] Gabriel Mittag and Sebastian Möller. Deep learning based assessment of synthetic speech naturalness. arXiv preprint arXiv:2104.11673, 2021. [44] Wei-Cheng Tseng, Chien-yu Huang, Wei-Tsung Kao, Yist Y Lin, and Hung-yi Lee. Utilizing self-supervised representations for mos prediction. arXiv preprint arXiv:2104.03017, 2021. [45] Gabriel Mittag, Babak Naderi, Assmaa Chehadi, and Sebastian Möller. Nisqa: A deep cnn-self-attention model for multidimensional speech quality prediction with crowdsourced datasets. arXiv preprint arXiv:2104.09494, 2021. [46] Joan Serrà, Jordi Pons, and Santiago Pascual. Sesqa: Semi-supervised learning for speech quality assessment. In ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 381–385, 2021. doi: 10.1109/ICASSP39728.2021.9414052. [47] Yeunju Choi, Youngmoon Jung, and Hoirin Kim. Neural mos prediction for synthesized speech using multi-task learning with spoofing detection and spoofing type classification. In 2021 IEEE Spoken Language Technology Workshop (SLT), pages 462–469, 2021. doi: 10.1109/SLT48900.2021.9383533. [48] Ute Jekosch. Assigning Meaning to Sounds — Semiotics in the Context of Product-Sound Design, pages 193–221. Springer Berlin Heidelberg, Berlin, Heidelberg, 2005. ISBN 978-3-540-27437-7. doi: 10.1007/3-540-27437-5_8. URL https://doi.org/10.1007/3-540-27437-5_8. [49] Ute Jekosch. Voice and speech quality perception: assessment and evaluation. Springer Science & Business Media, 2006. [50] Erica Cooper, Wen-Chin Huang, Tomoki Toda, and Junichi Yamagishi. Generalization ability of mos prediction networks. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 8442-8446. IEEE, 2022. [51] Tomoki Toda, Ling-Hui Chen, Daisuke Saito, F. Villavicencio, Mirjam Wester, Zhizheng Wu, and Junichi Yamagishi. The voice conversion challenge 2016. In INTERSPEECH, 2016. [52] Mirjam Wester, Zhizheng Wu, and Junichi Yamagishi. Analysis of the voice conversion challenge 2016 evaluation results. In Interspeech 2016, Interspeech, pages 1637–1641. International Speech Communication Association, September 2016. doi: 10.21437/Interspeech.2016-1331. URL http://www.interspeech2016.org/. Interspeech 2016 ; Conference date: 08-09-2016 Through 12-09-2016. [53] Jaime Lorenzo-Trueba, Junichi Yamagishi, Tomoki Toda, Daisuke Saito, Fernando Villavicencio, Tomi Kinnunen, and Zhenhua Ling. The voice conversion challenge 2018: Promoting development of parallel and nonparallel methods. arXiv preprint arXiv:1804.04262, 2018. [54] Yi Zhao, Wen-Chin Huang, Xiaohai Tian, Junichi Yamagishi, Rohan Kumar Das, Tomi Kinnunen, Zhenhua Ling, and Tomoki Toda. Voice conversion challenge 2020: Intra-lingual semi-parallel and cross-lingual voice conversion. arXiv preprint arXiv:2008.12527, 2020. [55] Rohan Kumar Das, Tomi Kinnunen, Wen-Chin Huang, Zhen-Hua Ling, Junichi Yamagishi, Zhao Yi, Xiaohai Tian, and Tomoki Toda. Predictions of Subjective Ratings and Spoofing Assessments of Voice Conversion Challenge 2020 Submissions. In Proc. Joint Workshop for the Blizzard Challenge and Voice Conversion Challenge 2020, pages 99–120, 2020. doi: 10.21437/VCC_BC.2020-15. URL http://dx.doi.org/10.21437/VCC_BC.2020-15. [56] Simon King, Robert AJ Clark, Catherine Mayo, and Vasilis Karaiskos. The blizzard challenge 2008. 2008. [57] Simon King and Vasilis Karaiskos. The blizzard challenge 2009. In The Blizzard Challenge 2009 Workshop, 2009. [58] Simon King and Vasilis Karaiskos. The blizzard challenge 2010. 2010. [59] Simon King and Vasilis Karaiskos. The blizzard challenge 2011. 2011. [60] Simon King and Vasilis Karaiskos. The blizzard challenge 2013. 2014. [61] Simon King and Vasilis Karaiskos. The blizzard challenge 2016. 2016. [62] Shinji Watanabe, Takaaki Hori, Shigeki Karita, Tomoki Hayashi, Jiro Nishitoba, Yuya Unno, Nelson Enrique Yalta Soplin, Jahn Heymann, Matthew Wiesner, Nanxin Chen, et al. Espnet: End-to-end speech processing toolkit. arXiv preprint arXiv:1804.00015, 2018. [63] Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. Librispeech: An asr corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5206–5210, 2015. doi: 10.1109/ICASSP.2015.7178964. [64] Jacob Kahn, Morgane Riviere, Weiyi Zheng, Evgeny Kharitonov, Qiantong Xu, Pierre-Emmanuel Mazaré, Julien Karadayi, Vitaliy Liptchinsky, Ronan Collobert, Christian Fuegen, et al. Libri-light: A benchmark for asr with limited or no supervision. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7669–7673. IEEE, 2020. [65] Andrew Howard, Mark Sandler, Grace Chu, Liang-Chieh Chen, Bo Chen, Mingxing Tan, Weijun Wang, Yukun Zhu, Ruoming Pang, Vijay Vasudevan, et al. Searching for mobilenetv3. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1314–1324, 2019. [66] Karl Pearson. Note on Regression and Inheritance in the Case of Two Parents. Proceedings of the Royal Society of London Series I, 58:240–242, January 1895. [67] C. Spearman. The proof and measurement of association between two things. The American Journal of Psychology, 15(1):72–101, 1904. ISSN 00029556. URL http://www.jstor.org/stable/1412159. [68] M. G. KENDALL. A NEW MEASURE OF RANK CORRELATION. Biometrika, 30(1-2):81–93, 06 1938. ISSN 0006-3444. doi: 10.1093/biomet/30.1-2.81. URL https://doi.org/10.1093/biomet/30.1-2.81. [69] Simon King Zhizheng Wu, Zhihang Xie. The blizzard challenge 2019. 2019. [70] Suchin Gururangan, Ana Marasović, Swabha Swayamdipta, Kyle Lo, Iz Beltagy, Doug Downey, and Noah A Smith. Don’t stop pretraining: adapt language models to domains and tasks. arXiv preprint arXiv:2004.10964, 2020. | - |
dc.identifier.uri | http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/86964 | - |
dc.description.abstract | 語音品質評估(Speech Quality Assessment)多年來,一直是語音處理(Speech Processing)領域的重要課題。傳統上,經許多人聆聽後所獲得的平均主觀意見分數(Mean Opinion Score)一直是語音品質評估的金科玉律,但由於需舉辦聆聽測驗來獲取許多受測者對於待測語音訊號的主觀評分,因而必須耗費大量的人力與時間。另一方面,確有多項基於模擬人類聽覺系統所發展而來的全參考客觀語音品質評估方法(Full-reference Objective Speech Quality Assessment)被普遍使用,並證實與平均主觀意見分數成高度相關。然而,由於這些方法中需要乾淨真實的參考訊號作為待測訊號的比較對象,使得它們在無法取得參考訊號的情況下無法使用。因此,開發一套無參考客觀語音品質評估方法(No-reference Objective Speech Quality Assessment),也就是不須參考語音訊號,且與平均主觀意見分數的評量結果呈現高度相關的語音品質評量技術,乃成為本研究的主題。
另一方面,近年自監督式學習(Self-supervised Learning) 的預訓練(Pre-trained)技術在語音處理領域上已經相當成熟,可以由大規模無標記語料庫中,提取出隱含豐富資訊的特徵向量(Feature Vector)。這些特徵向量被證實能增進多項語音處理任務的表現,如語音辨識、語者辨識、語音翻譯等;只是在無參考客觀語音品質評估上的潛力還未被充分發掘。在本論文中,我們首先分析了自監督式語音表徵用於無參考客觀語音品質評估上的可行性,在實驗中發現,自監督式語音表徵中含有豐富的聲學(Acoustic)資訊及語言 (Linguistic)內容的資訊,且能區隔不同品質的語音訊號,說明其可能相當適合用於無參考語音品質評估。接著,我們基於上述結果,提出了一套全新的、基於 HuBERT 表徵的深層(Deep)無參考語音品質評估技術。實驗結果顯示,這套技術全面超越過去使用傳統語音表徵的所有方法,並在不同語言上有更好的泛化能力。最後,我們也使用探測分析 (Probing Analysis)更深入理解影響模型表現的因素。 | zh_TW |
dc.description.abstract | Speech quality assessment is to evaluate the quality of audio, and it has been an essential part of speech processing to measure the performance of a system for decades. Conventionally, the mean opinion score (MOS) has been considered the "golden standard" for speech quality assessment, but such measurement involves a large number of human listeners, making it costly and time-consuming. Full-reference objective speech quality assessment approaches have thus been developed to simulate the auditory system of human beings and have been shown to have a high correlation with MOS. However, these approaches require a clean reference signal for comparison with the test signal, limiting their utility when such a signal is unavailable. Accordingly, there is a need to develop a no-reference objective speech quality assessment method that correlates well with human perception and does not require a reference signal, which is the main focus of this thesis.
On the other hand, self-supervised pre-trained models that enhance the utility of large-scale unlabeled speech datasets have emerged in the research field of speech processing. The self-supervised models can extract high-level, informative, and compact representation vectors from the raw audio inputs. The extracted representations have been demonstrated beneficial for downstream tasks like speech recognition, speaker verification, speech translation, and spoken language understanding. Nonetheless, the capability of these self-supervised speech representations for speech quality assessment has yet to be well addressed. In this thesis, we first conduct a preliminary analysis to investigate the feasibility of adopting self-supervised speech representations for speech quality assessment. The analysis results demonstrated that these representations contain rich acoustic and linguistic information and can distinguish audio signals of different qualities, suggesting their potential for evaluating speech quality. Accordingly, we proposed a novel, deep no-reference objective speech quality assessment model based on the HuBERT feature. The experiment results showed that our model significantly outperforms the previous state-of-the-art approaches and has better generalization ability across different languages. Moreover, we also conducted several probing analyses to further understand the factors that affect the model performance. | en |
dc.description.provenance | Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2023-05-02T17:06:56Z No. of bitstreams: 0 | en |
dc.description.provenance | Made available in DSpace on 2023-05-02T17:06:56Z (GMT). No. of bitstreams: 0 | en |
dc.description.tableofcontents | 口試委員審定書 i
致謝 ii 摘要 iv Abstract vi 目錄 viii 圖目錄 xii 表目錄 xiv 第一章 導論 1 1.1 研究動機 1 1.2 研究方向 3 1.3 研究貢獻 4 1.4 章節安排 5 第二章 背景知識 6 2.1 深層類神經網路 6 2.1.1 前饋式類神經網路 7 2.1.2 卷積式類神經網路 9 2.1.3 遞迴式類神經網路 10 2.1.4 專注機制 11 2.2 語音表徵 17 2.2.1 監督式學習之語音表徵 17 2.2.2 自監督式學習之語音表徵 18 2.3 語音品質評估 20 2.3.1 主觀語音品質評估方法 21 2.3.2 客觀語音品質評估方法 22 2.4 本章總結 24 第三章 基於深層學習的無參考客觀語音品質評估 25 3.1 簡介 25 3.2 模型架構 26 3.3 相關技術 27 3.3.1 聆聽者相關網路 27 3.3.2 轉移學習 28 3.3.3 專注機制 29 3.3.4 多任務學習 29 3.4 本章總結 30 第四章 自監督式語音表徵用於無參考客觀語音品質評估之可行性分析 31 4.1 簡介 31 4.2 資料集 33 4.3 使用之自監督學習語音表徵 37 4.4 自監督式語音表徵中的聲學資訊實驗 38 4.4.1 實驗設置 38 4.4.2 實驗結果 38 4.5 自監督式語音表徵中的語言內容資訊實驗 39 4.5.1 實驗設置 39 4.5.2 實驗結果 40 4.6 降維分析實驗 40 4.6.1 實驗設置 41 4.6.2 實驗結果 41 4.7 標準相關分析實驗 46 4.7.1 實驗設置 46 4.7.2 實驗結果 47 4.8 本章結論 48 第五章 基於自監督式語音表徵的無參考客觀語音品質評估模型 49 5.1 簡介 49 5.2 資料集 50 5.3 本論文提出之方法 50 5.3.1 模型架構 50 5.3.2 訓練方法 54 5.4 基準方法 55 5.4.1 LDNet 55 5.4.2 NISQAv2 55 5.5 評量方法 56 5.6 與其他無參考客觀語音品質評估模型之比較實驗 57 5.6.1 實驗結果 57 5.7 對於不同語言的泛化能力實驗 61 5.7.1 實驗結果 61 5.8 對於不同類型語料的可轉移性實驗 64 5.8.1 實驗結果 64 5.9 語音品質對模型的預測表現影響實驗 67 5.9.1 實驗結果 67 5.10 本章總結 71 第六章 結論與展望 72 6.1 研究貢獻與討論 72 6.2 未來展望 73 參考文獻 74 | - |
dc.language.iso | zh_TW | - |
dc.title | 基於自監督式語音表徵之無參考客觀語音品質評估 | zh_TW |
dc.title | No-reference Objective Speech Quality Assessment Based on Self-supervised Speech Representations | en |
dc.type | Thesis | - |
dc.date.schoolyear | 111-1 | - |
dc.description.degree | 碩士 | - |
dc.contributor.oralexamcommittee | 李宏毅;簡仁宗;陳信宏;王小川;鄭秋豫 | zh_TW |
dc.contributor.oralexamcommittee | Hung-yi Lee;Jen-Tzung Chien;Sin-Horng Chen;Hsiao-Chuan Wang;Chiu-yu Tseng | en |
dc.subject.keyword | 深度學習,自監督式語音表徵,語音處理,語音品質評估,無參考客觀語音品質評估, | zh_TW |
dc.subject.keyword | Deep Learning,Speech Processing,Self-supervised Speech Representation,Speech Quality Assessment,No-reference Objective Speech Quality Assessment, | en |
dc.relation.page | 83 | - |
dc.identifier.doi | 10.6342/NTU202300029 | - |
dc.rights.note | 同意授權(全球公開) | - |
dc.date.accepted | 2023-01-10 | - |
dc.contributor.author-college | 電機資訊學院 | - |
dc.contributor.author-dept | 電信工程學研究所 | - |
顯示於系所單位: | 電信工程學研究所 |
文件中的檔案:
檔案 | 大小 | 格式 | |
---|---|---|---|
ntu-111-1.pdf | 2.45 MB | Adobe PDF | 檢視/開啟 |
系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。