語音分離技術研究:模型壓縮與多工學習

Chao-I Tuan; 段昭誼

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/55183

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	李宏毅(Hung-Yi Lee)
dc.contributor.author	Chao-I Tuan	en
dc.contributor.author	段昭誼	zh_TW
dc.date.accessioned	2021-06-16T03:50:21Z	-
dc.date.available	2021-02-20
dc.date.copyright	2021-02-20
dc.date.issued	2021
dc.date.submitted	2021-02-17
dc.identifier.citation	[1]C. Olah, “Understanding lstm networks.(online),” Accessed in Dec. 2020. [2]E. Bendersky, “Depthwise separable convolutions for machine learning.(online),”retrievedfromhttps://eli.thegreenplace.net/2018/depthwise-separable-convolutions-for-machine-learning, Accessed in Dec. 2020. [3]F. Yu and V. Koltun, “Multi-scale context aggregation by dilated convolutions,” inInternational Conference on Learning Representations (ICLR), May 2016. [4]L. Hung Yi, “Speech separation.(online),” Accessed in Dec. 2020. [5]A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. An-dreetto, and H. Adam, “Mobilenets: Efficient convolutional neural networks for mo-bile vision applications,”arXiv preprint arXiv:1704.04861, 2017. [6]Y. Luo, Z. Chen, J. Hershey, J. Roux, and N. Mesgarani, “Deep clustering and con-ventional networks for music separation: Stronger together,” in2017 IEEE Interna-tional Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE,2017, pp. 61–65. [7]Y. Luo and N. Mesgarani, “Tasnet: Surpassing ideal time-frequency masking forspeech separation,”arXiv preprint arXiv:1809.07454, 2018. [8]K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,”in Proceedings of the IEEE conference on computer vision and pattern recognition,pp. 770–778, 2016. [9]D. Yu, M. Kolbæk, Z. Tan, and J. Jensen, “Permutation invariant training of deepmodels for speaker-independent multi-talker speech separation,” in2017 IEEE Inter-national Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE,2017, pp. 241–245. [10]M. Kolbaek, D. Yu, Z. Tan, and J. Jensen, “Multitalker speech separation withutterance-level permutation invariant training of deep recurrent neural networks,”IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP),vol. 25, no. 10, pp. 1901–1913, 2017. [11]Y. Luo, Z. Chen, and N. Mesgarani, “Speaker-independent speech separation withdeep attractor network,”IEEE/ACM Transactions on Audio, Speech, and LanguageProcessing, vol. 26, no. 4, pp. 787–796, 2018. [12]P. S. Kevin W Wilson, Bhiksha Raj and A. Divakaran, “Speech denoising using non-negative matrix factorization with priors.” inIn Proceedings of the 2008 IEEE Inter-national Conference on Acoustics, Speech, and Signal Processing (ICASSP).IEEE,2008, pp. 4029–4032. [13]S. Boll, “Suppression of acoustic noise in speech using spectral subtraction.”IEEETransactions on acoustics, speech, and signal processing., pp. 113–1120, 1979. [14]Y. Ephraim and D. Malah, “Speech enhancement using a minimum mean square errorshort time spectral amplitude estimator.”IEEE Transactions on acoustics, speech,and signal processing., pp. 1109–1121, 1984. [15]P. Scalart, “Speech enhancement based on a priori signal to noise estimation.” in1996 IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP-96).IEEE, 1996, pp. 629–632. [16]Y. Xu, J. Du, L.-R. Dai, and C.-H. Lee, “A regression approach to speech enhance-ment based on deep neural networks.” IEEE, 2014, pp. 7–19. [17]M. Kolbk, Z.-H. Tan, J. Jensen, M. Kolbk, Z.-H. Tan, and J. Jensenh, “Speech in-telligibility potential of general and specialized deep neural network based speechenhancement systems.”IEEE/ACM Transactions on Audio, Speech and LanguageProcessing (TASLP)., p. 153–167, 2017. [18]D. Wang and J. Chen, “Supervised speech separation based on deeplearning: Anoverview.”IEEE/ACM Transactions on Audio, Speech, and Language Processing.,p. 1702–1726, 2018. [19]X. Lu, Y. Tsao, S. Matsuda, and C. Hori, “Speech enhancement based on deep de-noising autoencoder.” inInterspeech, 2013., 2013, p. 436–440. [20]I.-T. Recommendation, “Perceptual evaluation of speech quality (pesq): An objectivemethod for end-to-end speech quality assessment of narrow-band telephone networksand speech codecs,”Rec. ITU-T P. 862, 2001. [21]C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, “An algorithm for intelligi-bility prediction of time–frequency weighted noisy speech,”IEEE Transactions onAudio, Speech, and Language Processing, vol. 19, no. 7, pp. 2125–2136, 2011. [22]Z. Wang, J. L. Roux, and J. Hershey, “Alternative objective functions for deep clus-tering,” in2018 IEEE International Conference on Acoustics, Speech and SignalProcessing (ICASSP). IEEE, 2018, pp. 686–690. [23]Y. Luo and N. Mesgarani, “Tasnet: time-domain audio separation network for real-time, single-channel speech separation,” in2018 IEEE International Conference onAcoustics, Speech and Signal Processing (ICASSP). IEEE, 2017, pp. 696–700. [24]Y. Isik, J. Roux, Z. Chen, S. Watanabe, and J. Hershey, “Single-channel multi-speakerseparation using deep clustering,”arXiv preprint arXiv:1607.02173, 2016. [25]Y. Luo and N. Mesgarani, “Conv-tasnet: Surpassing ideal time–frequency magnitudemasking for speech separation,” vol. 27. IEEE, 2019, pp. 1256–1266.[26]M. Wang, B. Liu, and H. Foroosh, “Factorized convolutional neural networks,” in2017 International Conference on Computer Vision Workshops (ICCVW). IEEE,2017. [27]G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,”arXiv preprint arXiv:1503.02531v1, 2015. [28]J. Wu, C. Yu, S. Fu, C. Liu, S. Chien, and Y. Tsao, “Increasing compactness of deeplearning based speech enhancement models with parameter pruning and quantizationtechniques,”arXiv preprint arXiv:1906.01078, 2019. [29]Z. Lan, M. Chen, S. Goodman, K. Gimpe, P. Sharma, and R. Soricut, “Albert: Alite bert for self-supervised learning of language representations,”arXiv preprintarXiv:1909.11942, 2019. [30]A. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner,A. Senior, and K. Kavukcuoglu, “Wavenet: A generative model for raw audio,”arXivpreprint arXiv:1609.03499v2, 2016. [31]N. Takahashi, S. Parthasaarathy, N. Goswami, and Y. Mitsufuji, “Recursive speechseparation for unknown number of speakers,”arXiv preprint arXiv:1904.03065,2019. [32]Y. Liu, M. Delfarah, and D. Wang, “Deep casa for talker-independent monauralspeech separation,” inICASSP2020-2020IEEEInternationalConferenceonAcous-tics, Speech and Signal Processing (ICASSP), 2020, pp. 6354–6358. [33]J. Du, Y.-H. Tu, L. Sun, F. Ma, H.-K. Wang, J. Pan, C. Liu, J.-D. Chen, and C.-H. Lee,“The ustc-iflytek system for chime-4 challenge,”Proc. CHiME, pp. 36–38, 2016. [34]X. Wang, J. Du, A. Cristia, L. Sun, and C.-H. Lee, “A study of child speech ex-traction using joint speech enhancement and separation in realistic conditions,” inICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and SignalProcessing (ICASSP). IEEE, 2020, pp. 7304–7308. [35]C. Ma, D. Li, and X. Jia, “Two-stage model and optimal si-snr for monaural multi-speaker speech separation in noisy environment,”arXiv preprint arXiv:2004.06332,2020. [36]T. Gao, J. Du, L.-R. Dai, and C.-H. Lee, “A unified dnn approach to speaker-dependent simultaneous speech enhancement and speech separation in low snr envi-ronments,”Speech Communication, vol. 95, pp. 28–39, 2017. [37]J. Zegers, K. Leuvenet al., “Multi-scenario deep learning for multi-speaker sourceseparation,” in2018 IEEE International Conference on Acoustics, Speech and SignalProcessing (ICASSP). IEEE, 2018, pp. 5379–5383. [38]Y. Luo and N. Mesgarani, “Separating varying numbers of sources with auxiliaryautoencoding loss,”arXiv preprint arXiv:2003.12326, 2020. [39]I. Kavalerov, S. Wisdom, H. Erdogan, B. Patton, K. Wilson, J. Le Roux, and J. R.Hershey, “Universal sound separation,” in2019 IEEE Workshop on Applications ofSignal Processing to Audio and Acoustics (WASPAA). IEEE, 2019, pp. 175–179.[40]E. Tzinis, S. Wisdom, J. R. Hershey, A. Jansen, and D. P. W. Ellis, “Improving uni-versal sound separation using sound classification,” inICASSP 2020 - 2020 IEEE In-ternational Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020,pp. 96–100. [41]J. Le Roux, S. Wisdom, H. Erdogan, and J. R. Hershey, “Sdr–half-baked or welldone?” inICASSP 2019-2019 IEEE International Conference on Acoustics, Speechand Signal Processing (ICASSP). IEEE, 2019, pp. 626–630. [42]G. Wichern, J. Antognini, M. Flynn, L. R. Zhu, E. McQuinn, D. Crow, E. Manilow,and J. L. Roux, “Wham!: Extending speech separation to noisy environments,”arXivpreprint arXiv:1907.01160, 2019. [43]V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: an asr corpusbased on public domain audio books,” in2015 IEEE International Conference onAcoustics, Speech and Signal Processing (ICASSP). IEEE, 2015, pp. 5206–5210. [44]D. Snyder, G. Chen, and D. Povey, “MUSAN: A music, speech, and noise corpus,”CoRR, vol. abs/1510.08484, 2015. [45]G. Hu and D. Wang, “A tandem algorithm for pitch estimation and voiced speech seg-regation,”Audio, Speech, and Language Processing, IEEE Transactions on, vol. 18,pp. 2067 – 2079, 12 2010. [46]E. Vincent, R. Gribonval, and C. Fevotte, “Performance measurement in blind audiosource separation,”IEEE Transactions on Audio, Speech, and Language Processing,vol. 14, no. 4, pp. 1462–1469, 2006. [47]Y. Luo, Z. Chen, and N. Mesgarani, “Speaker-independent speech separation withdeep attractor network,”IEEE/ACM Transactions on Audio, Speech, and LanguageProcessing, vol. 26, no. 4, pp. 787–796, 2018. [48]N. Zeghidour and D. Grangier, “Wavesplit: End-to-end speech separation by speakerclustering,”arXiv preprint arXiv:2002.08933, 2020. [49]J. R. Hershey, Z. Chen, J. Le Roux, and S. Watanabe, “Deep clustering: Discrim-inative embeddings for segmentation and separation,” in2016 IEEE InternationalConference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2016,pp. 31–35.
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/55183	-
dc.description.abstract	本論文中，我們提出了兩種新穎的語音分離模型架構，分別以模型壓縮和噪聲環境下的語音分離任務為目標，我們期望透過改進現有語音分離模型以達到更通用化、更貼近真實應用場景的語音分離系統(Universal Separation)。針對模型壓縮，參照參數共享方法在自然語言處理模型壓縮上帶來的成功。我們探討參數共享方法，在時域語音分離模型上的影響，並針對時域模型設計對應的參數共享策略。模型穩定性評估對於壓縮後模型非常重要。實驗證明，我們所提出的MiTAS在保有相同的語音分離表現之外，能壓縮近50%參數量，並通過多重穩定性評估實驗。模型壓縮使得語音分離能朝向終端使用者並更接近應用的普及化。本論文第二個研究方向為改善噪聲環境下的語音分離任務的表現，由於語音去噪與語音分離任務在本質上相近，我們提出統一的模型架構SADDEL將兩任務透過多工學習框架合併在一個框架下，因此模型本身能執行語音分離以及語音去噪任務。實驗證明SADDEL較單一任務模型表現更好並較其他比較模型更貼近真實環境中的場景。其在語音分離及語音去噪表現和在未知噪聲及噪聲程度下的模型穩定性也都獲致成功。語音分離的應用包括，現實生活中語音分離數據的採集標記以及在嘈雜環境中進行自動語音辨識(Automatic Speech Recognition, ASR)、語者辨識(Speaker Recognition)等應用。將語音訊息從人聲混雜以及背景噪聲中提取出來，對於下游各種語音訊號處理系統皆相當重要。	zh_TW
dc.description.abstract	In this paper, we propose two novel model architectures in speech separation to boost applications in real world scenarios through two aspects. Our goal targets model compression and speech separation in noisy environments respectively. We hope to improve the existing speech separation models to achieve wilder generalizability and step closer toward an universal separation system. Our first research interest is model compression, inspired by the success of parameter sharing in compression of natural language processing models. We investigated the effectiveness of such methods on time domain speech separation and proposed several parameter sharing strategies. We also looked into some important design aspects leading to a parameter efficient model. Model stability evaluation is very important for the compressed model. Experimental results have proved that our proposed MiTAS can compress nearly 75% of the model parameters while maintaining the same speech separation performance. Besides, MiTAS has passed multiple stability evaluation experiments indicating its robustness. In summary, MiTAS represents a significant step toward the realization of separation on edge devices and enables a wider range of downstream applications. Our second research interest is to improve the speech separation performance under noisy environments. Since speech separation and denoising tasks have similar nature. In this study, we propose a joint speech separation and denoising framework based on the multitask learning criterion to tackle the two issues simultaneously. Under the framework, the model itself can perform speech separation and speech denoising tasks. Experimental results demonstrate that SADDEL outperforms comparative speech denosing and speech separation models, and exhibits promising results on various noisy separation tasks. Moreover, SADDEL can provide high performance robustness across different datasets, noise types, and SNR levels. Common application of speech separation include labeling of collected real world separation data, automatic speech recognition (ASR) and speaker recognition in noisy environments. Extracting speech from a mixture of human voice and background noise is very important for various downstream speech processing systems.	en
dc.description.provenance	Made available in DSpace on 2021-06-16T03:50:21Z (GMT). No. of bitstreams: 1 U0001-0402202121321200.pdf: 2385764 bytes, checksum: a03dfc7647fdbdca2ed6e70298409dce (MD5) Previous issue date: 2021	en
dc.description.tableofcontents	致謝. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .i 中文摘要. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .ii 英文摘要. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .iii 一、導論. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1 1.1研究動機. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1 1.2研究方向. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1 1.3章節安排. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2 二、背景知識. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3 2.1深層類神經網路(Deep Neural Network). . . . . . . . . . . . . . . . .3 2.1.1簡介. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3 2.1.2模型架構. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4 2.1.3類神經網路訓練. . . . . . . . . . . . . . . . . . . . . . . . . .5 2.2用於語音序列的類神經網路. . . . . . . . . . . . . . . . . . . . . . .6 2.2.1簡介. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6 2.2.2遞歸式類神經網路(Recurrent Neural Network, RNN). . . . . .6 2.2.3卷積式類神經網路(Convolution Neural Network). . . . . . . .8 2.2.4深度分離卷積(Depthwise Separable Convolution). . . . . . . .9 2.2.5擴張卷積(Dilated Convolution). . . . . . . . . . . . . . . . . .10 2.3語音分離(Speech Separation). . . . . . . . . . . . . . . . . . . . . . .13 2.3.1簡介. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .13 2.3.2基於深度學習語音分離. . . . . . . . . . . . . . . . . . . . . .14 2.3.3排列無關訓練(Permutation Invariant Training). . . . . . . . . .21 2.3.4評估標準. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .23 2.4語音增強(Speech Enhancement). . . . . . . . . . . . . . . . . . . . .25 2.4.1簡介. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .25 2.4.2評估標準. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .26 2.5本章總結. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .28 三、語音分離模型之模型壓縮(Model Compression). . . . . . . . . . . . . . .29 3.1簡介. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .29 3.2研究動機. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .29 3.3方法. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .30 3.3.1模型架構. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .30 3.3.2權重共享(Weight Sharing). . . . . . . . . . . . . . . . . . . .31 3.4實驗設計. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .34 3.4.1語料介紹. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .34 3.4.2深度學習模型概述與實驗設定. . . . . . . . . . . . . . . . . .36 3.4.3實驗結果與分析. . . . . . . . . . . . . . . . . . . . . . . . . .36 3.5本章總結. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .42 四、使用多工學習(Multi-task Learning)之合併語音增強與語音分離. . . . . .43 4.1簡介. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .43 4.2方法. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .43 4.2.1相關研究. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .46 4.2.2模型架構與損失函數設計. . . . . . . . . . . . . . . . . . . .47 4.3實驗設計. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .49 4.3.1語料介紹. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .49 4.3.2深度學習模型概述與實驗設定. . . . . . . . . . . . . . . . . .51 4.3.3模型比較. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .52 4.3.4實驗結果與分析. . . . . . . . . . . . . . . . . . . . . . . . . .53 4.4本章總結. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .60 五、結論與展望. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .63 5.1研究貢獻與討論. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .63 5.2未來展望. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .64 參考文獻. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .65
dc.language.iso	zh-TW
dc.subject	膜型壓縮	zh_TW
dc.subject	語音分離	zh_TW
dc.subject	多工學習	zh_TW
dc.subject	終端應用	zh_TW
dc.subject	語音去噪	zh_TW
dc.subject	speech denoising	en
dc.subject	model compression	en
dc.subject	multitask learning	en
dc.subject	endpoint applications	en
dc.subject	speech separation	en
dc.title	語音分離技術研究:模型壓縮與多工學習	zh_TW
dc.title	Research of Speech Separation: Network Compression andMulti-task Learning	en
dc.type	Thesis
dc.date.schoolyear	109-1
dc.description.degree	碩士
dc.contributor.coadvisor	曹昱(Yu Tsao)
dc.contributor.oralexamcommittee	蔡宗翰(Tzong-Han Tsai),李琳山(Lin-shan Lee)
dc.subject.keyword	語音分離,膜型壓縮,多工學習,終端應用,語音去噪,	zh_TW
dc.subject.keyword	speech separation,speech denoising,model compression,multitask learning,endpoint applications,	en
dc.relation.page	71
dc.identifier.doi	10.6342/NTU202100540
dc.rights.note	有償授權
dc.date.accepted	2021-02-17
dc.contributor.author-college	電機資訊學院	zh_TW
dc.contributor.author-dept	資料科學學位學程	zh_TW
顯示於系所單位：	資料科學學位學程

文件中的檔案：

檔案	大小	格式
U0001-0402202121321200.pdf 未授權公開取用	2.33 MB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。