利用參數修剪與量化技術以精簡語音除噪之深度學習模型

Jyun-Yi Wu; 吳俊易

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/73424

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	簡韶逸(Shao-Yi Chien)
dc.contributor.author	Jyun-Yi Wu	en
dc.contributor.author	吳俊易	zh_TW
dc.date.accessioned	2021-06-17T07:34:05Z	-
dc.date.available	2022-06-12
dc.date.copyright	2019-06-12
dc.date.issued	2019
dc.date.submitted	2019-05-17
dc.identifier.citation	[1] E.C. Cherry, “Some experiments on the recognition of speech, with one and with two ears,” J. Acoust. Soc. Am., vol. 25, pp. 975-979, 1953. [2] J. H. McDermott, “The cocktail party problem,” Curr. Biol., 19, R1024–R1027, 2009. [3] P. C. Loizou, Speech enhancement: Theory and practice. 2nd ed., Boca Raton FL: CRC Press, 2013. [4] D. L. Wang and G. J. Brown, Ed., Computational auditory scene analysis: Principles, algorithms, and applications. Hoboken NJ: Wiley & IEEE Press, 2006. [5] J. Chen, Fundamentals of Noise Reduction in Spring Handbook of Speech Processing, Springer, 2008. [6] Y. Tsao, and Y. H. Lai, “Generalized maximum a posteriori spectral amplitude estimation for speech enhancement,” Speech Communication, vol. 76, pp. 112-126, 2016. [7] T. Hussain, S. M. Siniscalchi, C. C. Lee, S. S. Wang, Y. Tsao, and W. H. Liao, “Experimental study on extreme learning machine applications for speech enhancement,” IEEE Access, vol. PP, no. 99, pp. 1–1, 2017. [8] R. McAulay and M. Malpass, “Speech enhancement using a soft-decision noise suppression filter,” IEEE/ACM Transactions on Audio, Speech, and Language Process- ing, vol. 28, no. 2, pp. 137–145, 1980. [9] P. Scalart and J. V. Filho, “Speech enhancement based on a priori signal to noise estimation,” 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings, vol. 2, pp. 629-632, 1996. [10] S. O. Rice, “Statistical properties of a sine wave plus random noise,” Bell System Technical Journal, vol. 27, pp. 109–157, 1948. [11] T. Lotter and P. Vary, “Speech enhancement by MAP spectral amplitude estimation using a super-Gaussian speech model,” EURASIP J. Appl. Signal Process., vol. 7, pp. 1110–1126, 2005. [12] R. Martin, “Speech enhancement based on minimum mean-square error estimation and supergaussian priors,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 13, no. 5, pp. 845–856, 2005. [13] X. Zou, P. Jancovic, J. Liu, and M. Kokuer, “Speech signal enhancement based on MAP algorithm in the ICA space,” IEEE Trans. Signal Process., vol. 56, no. 5, pp. 1812–1820, May 2008. [14] P. J. Wolfe and S. J. Godsill, “Efficient alternatives to the Ephraim and Malah suppression rule for audio signal enhancement,” EURASIP J. Appl. Signal Process., no. 10, pp. 1043–1051, 2003. [15] P. J. Wolfe and S. J. Godsill, “Simple alternatives to the Ephraim and Malah suppression rule for speech enhancement,” in Proc. 11th IEEE Workshop Statist. Signal Processing, pp. 496–499, 2001. [16] D. Wang, “Time-frequency masking for speech separation and its potential for hearing aid design,” Trends in Amplificat., vol. 12, pp. 332–353, 2008. [17] R.F. Lyon, “A computational model of binaural localization and separation,” in Proceedings of ICASSP, pp. 1148-1151, 1983. [18] D.L. Wang and G.J. Brown, Ed., Computational auditory scene analysis: Principles, algorithms, and applications. Hoboken NJ: Wiley & IEEE Press, 2006. [19] D.L. Wang, “Time-frequency masking for speech separation and its potential for hearing aid design,” Trend. Amplif., vol. 12, pp. 332-353, 2008. [20] S. Gonzalez and M. Brookes, “Mask-based enhancement for very low quality speech,” in Proc. of ICASSP, Florence, Italy, 2014. [21] G. Hu and D. L. Wang, “Monaural speech segregation based on pitch tracking and amplitude modulation,” IEEE Trans. Neural Netw., vol. 15, no. 5, pp. 1135–1150, 2004. [22] N. Roman, D.L. Wang, and G.J. Brown, “Speech segregation based on sound localization,” J. Acoust. Soc. Am., vol. 114, pp. 2236-2252, 2003. [23] D.L. Wang, “On ideal binary mask as the computational goal of auditory scene analysis,” in Speech separation by humans and machines, P. Divenyi, Ed., Norwell MA: Kluwer Academic, pp. 181-197, 2005. [24] D. L. Wang and J. Chen, “Supervised Speech Separation Based on Deep Learning: An Overview,” in arXiv preprint arXiv:1708.07524, 2017. [25] N. Madhu, A. Spriet, S. Jansen, R. Koning, and J. Wouters, “The potential for speech intelligibility improvement using the ideal binary mask and the ideal Wiener filter in single channel noise reduction systems: Application to auditory prostheses,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 21, no. 1, pp. 63–72, Jan. 2013. [26] J. Jensen and R. C. Hendriks, “Spectral magnitude minimum meansquare error estimation using binary and continuous gain functions,” IEEE Trans. Audio, Speech, Lang. Process., vol. 20, no. 1, pp. 92–102, Jan. 2012. [27] Y. Wang, A. Narayanan, and D.L. Wang, “On training targets for supervised speech separation,” IEEE/ACM Trans. Audio Speech Lang. Proc., vol. 22, pp. 1849-1858, 2014. [28] M. Kolbk, Z. H. Tan, and J. Jensen, “Speech intelligibility potential of general and specialized deep neural network based speech enhancement systems,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 25, no. 1, pp. 153–167, Jan. 2017. [29] D.S. Williamson, Y. Wang, and D.L. Wang, “Complex ratio masking for monaural speech separation,” IEEE/ACM Trans. Audio Speech Lang. Proc., vol. 24, pp. 483–492, 2016 [30] Y. Bi, G. Chen, Q. Deng, and Y. Wang, Embedded Systems Technology: 15th National Conference, ESTC 2017, Shenyang, China, November 17- 19, 2017, Revised Selected Papers, vol. 857. Springer, 2018. [31] http://cs231n.github.io/neural-networks-3/ [32] D. Li and Y. Du, Artificial intelligence with uncertainty. Chapman and Hall/CRC, 2007. [33] N. Pentreath, Machine learning with spark. Packt Publishing Ltd, 2015. [34] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: a simple way to prevent neural networks from overfitting,” The Journal of Machine Learning Research, vol. 15, no. 1, pp. 1929–1958, 2014. [35] S. Sangmun, H. Thanh-Tra, L. Tuan-Ho, and L. Moo-Yeon. “A New Robust Design Method Using Neural Network,” Journal of Nanoelectronics and Optoelectronics, vol. 11, pp. 68-78, 2016. [36] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” CoRR, vol. abs/1502.03167, 2015. [37] Morten Kolbæk, Dong Yu, Zheng-Hua Tan, and Jesper Jensen, “Multitalker speech separation with utterance-level permutation invariant training of deep recurrent neural networks,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, no. 10, pp. 1901–1913, 2017. [38] Y. Qian, X. Chang, and D. Yu, “Single-channel multi-talker speech recognition with permutation invariant training,” Speech Communication,arXiv preprint arXiv:1707.06527, 2017. [39] Y. Xu, J. Du, L.-R. Dai, and C.-H. Lee, “An experimental study on speech enhancement based on deep neural networks,” IEEE Signal processing letters, vol. 21, no. 1, pp. 65–68, 2014. [40] Y. Xu, J. Du, L.-R. Dai, and C.-H. Lee, “A regression approach to speech enhancement based on deep neural networks,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 23, no. 1, pp. 7–19, 2015. [41] S.-W. Fu, Y. Tsao, and X. Lu, “SNR aware convolutional neural network modeling for speech enhancement,” in INTERSPEECH, 2016. [42] Z. Chen, S. Watanabe, H. Erdogan, and John R. Hershey, “Speech enhancement and recognition using multi-task learning of long short-term memory recurrent neural networks,” Nuclear Physics A, vol. 2015-January, pp. 3274–3278, 2015. [43] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of data with neural networks,” science, vol. 313, no. 5786, pp. 504–507, 2006. [44] Y. Wang and D. Wang, “Towards scaling up classification-based speech separation,” IEEE Trans. Audio, Speech, Lang. Process., vol. 21, no. 7, pp. 1381–1390, Jul. 2013. [45] K. Han and D. Wang, “Towards generalizing classification based speech separation,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 21, no. 1, pp. 168–177, Jan. 2013. [46] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” Computer Vision and Pattern Recognition, 2015 [47] S.-W. Fu, Y. Tsao, X. Lu, and H. Kawai, “Raw waveform-based speech enhancement by fully convolutional networks,” in Proc. APSIPA, 2017. [48] S. Boll, “Suppression of acoustic noise in speech using spectral subtraction,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 27, no. 2, pp. 113–120, 1979. [49] Y. Ephraim and D. Malah, “Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 32, no. 6, pp. 1109–1121, 1984. [50] Y. Ephraim and D. Malah, “Speech enhancement using a minimum mean-square error log-spectral amplitude estimator,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 33, no. 2, pp. 443–445, 1985. [51] P. Scalart and J. V. Filho, “Speech enhancement based on a priori signal to noise estimation,” in ICASSP, 1996. [52] I. Cohen and B. Berdugo, “Noise estimation by minima controlled recursive averaging for robust speech enhancement,” IEEE Signal Processing Letters, vol. 9, no. 1, pp. 12–15, 2002. [53] Ing Yann Soon, Soo Ngee Koh, and Chai Kiat Yeo, “Improved noise suppression filter using self-adaptive estimator of probability of speech absence,” Signal Processing, vol. 75, no. 2, pp. 151 – 159, 1999. [54] J. H. L. Hansen, V. Radhakrishnan, and K. H. Arehart, “Speech enhancement based on generalized minimum mean square error estimators and masking properties of the auditory system,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 14, no. 6, pp. 2049–2063, 2006. [55] I. Cohen and B. Berdugo, “Speech enhancement for non-stationary noise environments,” Signal Processing, vol. 81, pp. 2403–2418, 2001. [56] U. Kjems and J. Jensen, “Maximum likelihood based noise covariance matrix estimation for multi-microphone speech enhancement,” in 2012 Proceedings of the 20th European Signal Processing Conference (EUSIPCO), 2012, pp. 295–299. [57] D. Wang and J. Chen, “Supervised speech separation based on deep learning: An overview,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 26, no. 10, pp. 1702–1726, 2018. [58] X. Lu, Y. Tsao, S. Matsuda, and C. Hori, “Speech enhancement based on deep denoising autoencoder,” in INTERSPEECH, 2013. [59] S.-W. Fu, T. Wang, Y. Tsao, X. Lu, and H. Kawai, “End-to-end waveform utterance enhancement for direct evaluation metrics optimization by fully convolutional neural networks,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 26, no. 9, pp. 1570–1584, 2018. [60] J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, D. S. Pallett, and N. Y. Dahlgren, “Darpa timit acoustic- phonetic continuous speech corpus cd-rom. nist speech disc 1-1.1,” NASA STI/Recon Technical Report N, vol. 93, pp. 27403, 1993. [61] A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra, “Perceptual evaluation of speech quality (pesq)-a new method for speech quality assessment of telephone networks and codecs,” IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2001, pp. 749–752. [62] C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, “An algorithm for intelligibility prediction of time frequency weighted noisy speech,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 19, no. 7, pp. 2125–2136, 2011. [63] E. Park, J. Ahn, and S. Yoo, “Weighted-entropy-based quantization for deep neural networks,” Computer Vision and Pattern Recognition, 2017.
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/73424	-
dc.description.abstract	近年來，由於深度學習的興起，許多應用於語音除噪的相關研究，不斷地被提出來；然而，一個符合現實環境，合適的深度學習之語音除噪方法，需要取得在除躁表現與運算成本間的平衡。在此，我們提出參數之減化與量化的方法，減化能移除掉深度學習神經網絡中，不必要的頻道；量化則是利用參數間的分群，有效縮小整體架構的尺寸。由於上述兩種方法作用在不同的原理中，故可同時應用於合適的語音除噪神經網絡中，來得到更精簡的架構。當同時運用參數減化與量化時，從實驗數據，可以將現有網絡大小縮小為原架構的 10.03%，與原架構相比，對於PESQ跟STOI，僅有1.43%及3.24%的下降。因此，在有限運算資源的裝置中，參數之減化與量化能被有效的利用在語音除噪系統中。	zh_TW
dc.description.abstract	Most recent studies on deep learning based speech enhancement (SE) focused on improving denoising performance. However, successful SE applications require striking a desirable balance between denoising performance and computational cost in real scenarios. In this study, we propose a novel parameter pruning (PP) technique, which removes redundant channels in a neural network. In addition, a parameter quantization (PQ) technique was applied to reduce the size of a neural network by representing weights with fewer cluster centroids. Because the techniques are derived based on different concepts, the PP and PQ can be integrated to provide even more compact SE models. The experimental results show that the PP and PQ techniques produce a compacted SE model with a size of only 10.03% compared to that of the original model, resulting in minor performance losses of 1.43% (from 0.70 to 0.69) for STOI and 3.24% (from 1.85 to 1.79) for PESQ. The promising results suggest that the PP and PQ techniques can be used in a SE system in devices with limited storage and computation resources.	en
dc.description.provenance	Made available in DSpace on 2021-06-17T07:34:05Z (GMT). No. of bitstreams: 1 ntu-108-R05943114-1.pdf: 6210231 bytes, checksum: 0210bd6cd69f6d680a365e117e0d9e76 (MD5) Previous issue date: 2019	en
dc.description.tableofcontents	口試委員審定書 i 誌謝 ii 中文摘要 iii ABSTRACT iv CONTENTS v LIST OF FIGURES viii LIST OF TABLES xi Chapter 1 Background 1 1.1 Problem of speech in a noisy environment 1 1.2 Traditional Speech Enhancement Approaches 4 1.2.1 Noisy Speech Spectrum model 4 1.2.2 MMSE algorithm 7 1.2.3 MAPA algorithm 7 1.2.4 MLSA algorithm 8 1.3 Speech Enhancement: Masking-based 9 1.3.1 Ideal Binary Mask (IBM) 10 1.3.2 Ideal Ratio Mask (IRM) 11 1.3.3 Spectral Magnitude Mask (SMM) 11 1.3.4 Complex Ideal Ratio Mask (cIRM) 11 1.4 Fundamentals of Deep Learning 12 1.4.1 Artificial Neural Networks 13 1.4.2 Common Activation Functions 15 1.4.3 Process of Optimization 16 1.5 Introduction to Neural Network model about Speech Enhancement 23 1.5.1 Deep Neural Network 24 1.5.2 Fully Convolutional Network 27 Chapter 2 Introduction 31 2.1 Motivation 31 2.2 Organization 33 Chapter 3 Speech Enhancement Model and the proposed PP and PQ Techniques 34 3.1 Speech Enhancement Model 34 3.1.1 Waveform processing 34 3.1.2 FCN enhancement system 36 3.2 The Parameter Pruning (PP) Technique 37 3.2.1 FCN-based Waveform Mapping 37 3.2.2 Definition of Sparsity 38 3.2.3 Channel Pruning 38 3.3 The Parameter Quantization (PQ) Technique 39 3.4 The Integration of PP and PQ 40 Chapter 4 Experiments 42 4.1 Experimental Setup 42 4.2 Experimental Results 43 4.2.1 FCN SE model 43 4.2.2 Parameter Quantization (PQ) 45 4.2.3 Parameter Pruning (PP) 49 4.2.4 The Integration of PP and PQ 51 4.2.5 The Comparison of Model 53 Chapter 5 Discussion 55 Chapter 6 Conclusion 57 References 58
dc.language.iso	en
dc.title	利用參數修剪與量化技術以精簡語音除噪之深度學習模型	zh_TW
dc.title	Increasing Compactness Of Deep Learning Based Speech Enhancement Models With Parameter Pruning And Quantization Techniques.	en
dc.type	Thesis
dc.date.schoolyear	107-2
dc.description.degree	碩士
dc.contributor.oralexamcommittee	吳安宇(An-Yeu Wu),曹昱(Yu Tsao)
dc.subject.keyword	精簡架構,參數減化,參數量化,低運算成本,	zh_TW
dc.subject.keyword	Compactness,Parameter Pruning,Parameter Quantization,Low Computational Cost,	en
dc.relation.page	66
dc.identifier.doi	10.6342/NTU201900728
dc.rights.note	有償授權
dc.date.accepted	2019-05-20
dc.contributor.author-college	電機資訊學院	zh_TW
dc.contributor.author-dept	電子工程學研究所	zh_TW
顯示於系所單位：	電子工程學研究所

文件中的檔案：

檔案	大小	格式
ntu-108-1.pdf 目前未授權公開取用	6.06 MB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。