基於深度學習的端到端語者驗證系統之損失函數的研究

Chih-hao Wang; 王智顥

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/15788

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	張智星(Jyh-Shing Jang)
dc.contributor.author	Chih-hao Wang	en
dc.contributor.author	王智顥	zh_TW
dc.date.accessioned	2021-06-07T17:52:09Z	-
dc.date.copyright	2020-08-07
dc.date.issued	2020
dc.date.submitted	2020-08-05
dc.identifier.citation	[1] T. Q. D. Reynolds and R. Dunn, “Speaker verification using adapted gaussian mixture models,” 2000. [2] P. O. Patrick Kenny, G. Boulianne and P. Dumouchel, “Joint factor analysis versus eigenchannels in speaker recognition,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 1, no. 1, 2007. [3] P. K. Niko Brümmer, Doris Baum, “Abc system description for nist sre 2010,” NIST SRE, vol. 1, no. 1, 2010. [4] E. Variani, “Deep neural networks for small footprint text-dependent speaker verification,” ICASSP Transactions on Acoustics, Speech and Signal Processing, vol. 1, no. 1, 2014. [5] G. Heigold, I. Moreno, S. Bengio, and N. Shazeer, “End-to-end text-dependent speaker verification,” in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5115–5119, IEEE, 2016. [6] D. Snyder, P. Ghahremani, D. Povey, D. Garcia-Romero, Y. Carmiel, and S. Khudanpur, “Deep neural network-based speaker embeddings for end-to-end speaker verification,” in 2016 IEEE Spoken Language Technology Workshop (SLT), pp. 165–170, IEEE, 2016. [7] D. Snyder, D. Garcia-Romero, D. Povey, and S. Khudanpur, “Deep neural network embeddings for text-independent speaker verification.,” in Interspeech, pp. 999–1003, 2017. [8] L. Yang and R. Jin, “Distance metric learning: A comprehensive survey,” Michigan State Universiy, vol. 2, no. 2, p. 4, 2006. [9] S. Chopra, R. Hadsell, and Y. LeCun, “Learning a similarity metric discriminatively, with application to face verification,” in 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), vol. 1, pp. 539–546, IEEE, 2005. [10] J. Deng, J. Guo, N. Xue, and S. Zafeiriou, “Arcface: Additive angular margin loss for deep face recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4690–4699, 2019. [11] W. Liu, Y. Wen, Z. Yu, M. Li, B. Raj, and L. Song, “Sphereface: Deep hypersphere embedding for face recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 212–220, 2017. [12] H. Wang, Y. Wang, Z. Zhou, X. Ji, D. Gong, J. Zhou, Z. Li, and W. Liu, “Cosface: Large margin cosine loss for deep face recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5265–5274, 2018. [13] F. Wang, J. Cheng, W. Liu, and H. Liu, “Additive margin softmax for face verification,” IEEE Signal Processing Letters, vol. 25, no. 7, pp. 926–930, 2018. [14] Y. Wen, K. Zhang, Z. Li, and Y. Qiao, “A discriminative feature learning approach for deep face recognition,” in European conference on computer vision, pp. 499–515, Springer, 2016. [15] L. Wan, Q. Wang, A. Papir, and I. L. Moreno, “Generalized end-to-end loss for speaker verification,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4879–4883, IEEE, 2018. [16] F. Schroff, D. Kalenichenko, and J. Philbin, “Facenet: A unified embedding for face recognition and clustering,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 815–823, 2015. [17] C. Li, X. Ma, B. Jiang, X. Li, X. Zhang, X. Liu, Y. Cao, A. Kannan, and Z. Zhu, “Deep speaker: an end-to-end neural speaker embedding system,” arXiv preprint arXiv:1705.02304, 2017. [18] C. Zhang and K. Koishida, “End-to-end text-independent speaker verification with triplet loss on short utterances.,” in Interspeech, pp. 1487–1491, 2017. [19] H. Bredin, “Tristounet: triplet loss for speaker turn embedding,” in 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 5430–5434, IEEE, 2017. [20] W. Cai, J. Chen, and M. Li, “Exploring the encoding layer and loss function in endto-end speaker and language recognition system,” arXiv preprint arXiv:1804.05160, 2018. [21] G. Bhattacharya, J. Alam, and P. Kenny, “Deep speaker recognition: Modular or monolithic?,” in Proc. Interspeech, pp. 1143–1147, 2019. [22] Z. Li, C. Xu, and B. Leng, “Angular triplet-center loss for multi-view 3d shape retrieval,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 8682–8689, 2019. [23] H. Bu, J. Du, X. Na, B. Wu, and H. Zheng, “Aishell-1: An open-source mandarin speech corpus and a speech recognition baseline,” in 2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA), pp. 1–5, IEEE, 2017. [24] A. Nagrani, J. S. Chung, and A. Zisserman, “Voxceleb: a large-scale speaker identification dataset,” arXiv preprint arXiv:1706.08612, 2017. [25] M. Ravanelli and Y. Bengio, “Speaker recognition from raw waveform with sincnet,” in 2018 IEEE Spoken Language Technology Workshop (SLT), pp. 1021–1028, IEEE, 2018. [26] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016. [27] K. Okabe, T. Koshinaka, and K. Shinoda, “Attentive statistics pooling for deep speaker embedding,” arXiv preprint arXiv:1803.10963, 2018. [28] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in neural information processing systems, pp. 5998–6008, 2017. [29] T. Abdullah, Y. Bazi, M. M. Al Rahhal, M. L. Mekhalfi, L. Rangarajan, and M. Zuair, “Textrs: Deep bidirectional triplet network for matching text to remote sensing images,” Remote Sensing, vol. 12, no. 3, p. 405, 2020. [30] R. Ranjan, C. D. Castillo, and R. Chellappa, “L2-constrained softmax loss for discriminative face verification,” arXiv preprint arXiv:1703.09507, 2017. [31] A. T. Liu, S.-w. Yang, P.-H. Chi, P.-c. Hsu, and H.-y. Lee, “Mockingjay: Unsupervised speech representation learning with deep bidirectional transformer encoders,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6419–6423, IEEE, 2020.
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/15788	-
dc.description.abstract	深度學習搭配度量學習（metric learning ）在辨別臉部特徵之應用，已經被證實有良好的效果，其中利用角度間隔（ angular margin ）限制的 Additive Angular Margin Softmax Loss 在計算機視覺的成功，也帶領了語者驗證領域的進步。本論文將使用以角度間隔為基礎的 Angular Triplet Loss 以及 Angulart Triplet Center Loss 所訓練之端到端語者驗證模型，以 Equal Error Rate 和 NIST SRE 所訂定的 Cprimary 作為衡量標準，將其應用在公開的中文語音資料集 aishell1 來衡量模型的表現。最終本次研究的最佳模型相較於 Additive Angular Margin Softmax Loss 模型在平均 Equal Error Rate 獲得了 7.4% 的相對進步，以及在平均 Cprimary 獲得了 6.1% 的相對進步。	zh_TW
dc.description.abstract	Deep metric learning has proven itself an effective way to discriminate clustering embedding for face recognition. The success of the modified softmax Loss function, additive angular margin softmax loss, in computer vision leads the improvement of training speaker recognition. We introduce angular triplet loss and angular triplet center loss into end-to-end speaker verification. Experiments are conducted on Aishell1 dataset and demonstrate the performance with equal error rate and Cprimary. By testing the combination of different loss function with angular triplet loss and angular triplet center loss, our best model show a relative improvement of 7.4% on average Equal Error Rate and 6.1% on average Cprimary over the additive angular margin softmax loss.	en
dc.description.provenance	Made available in DSpace on 2021-06-07T17:52:09Z (GMT). No. of bitstreams: 1 U0001-0208202022505600.pdf: 11280531 bytes, checksum: 21a1f5e906e22c08b0649e99d5f91bc3 (MD5) Previous issue date: 2020	en
dc.description.tableofcontents	誌謝ii 摘要iii Abstract iv 1 緒論1 1.1 主題簡介. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 方法簡介. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 章節概述. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2 文獻回顧4 2.1 高斯混合模型. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.2 神經網路. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.3 度量學習. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 3 資料集7 3.1 Aishell1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 3.2 VoxCeleb1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 4 研究方法10 4.1 聲學特徵. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 4.1.1 梅爾頻率倒譜係數. . . . . . . . . . . . . . . . . . . . . . . . 10 4.2 神經網路架構. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 4.2.1 One Dimensional Resnet28 . . . . . . . . . . . . . . . . . . . . . 12 4.2.2 Attentive Statistics Pooling . . . . . . . . . . . . . . . . . . . . . 13 4.3 損失函數. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 4.3.1 Softmax Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 4.3.2 Additive Angular Margin Softmax Loss . . . . . . . . . . . . . . 16 4.3.3 Generalized End-to-end Loss . . . . . . . . . . . . . . . . . . . . 17 4.3.4 Triplet Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 4.3.5 Angular Triplet Loss . . . . . . . . . . . . . . . . . . . . . . . . 21 4.3.6 Angular Triplet Center Loss . . . . . . . . . . . . . . . . . . . . 21 4.4 衡量標準. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 4.4.1 Equal Error Rate . . . . . . . . . . . . . . . . . . . . . . . . . . 22 4.4.2 Cprimary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 5 實驗結果25 5.1 實驗設定. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 5.1.1 資料集設定. . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 5.1.2 損失函數設定. . . . . . . . . . . . . . . . . . . . . . . . . . . 26 5.1.3 神經網路參數設定. . . . . . . . . . . . . . . . . . . . . . . . 27 5.1.4 訓練與測試設定. . . . . . . . . . . . . . . . . . . . . . . . . . 27 5.1.5 Cprimary 參數設定. . . . . . . . . . . . . . . . . . . . . . . . . 28 5.2 VoxCeleb1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 5.3 實驗一：單一損失函數模型. . . . . . . . . . . . . . . . . . . . . . . 29 5.4 實驗二：組合損失函數模型. . . . . . . . . . . . . . . . . . . . . . . 34 5.5 實驗三：預訓練模型. . . . . . . . . . . . . . . . . . . . . . . . . . . 40 5.6 綜合討論. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 6 總結49 6.1 結論. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 6.2 未來研究方向. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 Bibliography 51
dc.language.iso	zh-TW
dc.subject	度量學習	zh_TW
dc.subject	語者辨識	zh_TW
dc.subject	聲紋辨識	zh_TW
dc.subject	語者驗證	zh_TW
dc.title	基於深度學習的端到端語者驗證系統之損失函數的研究	zh_TW
dc.title	A Study on Loss Functions in End-to-end DNN-based Speaker Verification	en
dc.type	Thesis
dc.date.schoolyear	108-2
dc.description.degree	碩士
dc.contributor.oralexamcommittee	李宏毅(Hung-yi Lee),廖元甫(Yuan-fu Liao),林其翰(Chi-han Lin)
dc.subject.keyword	語者辨識,聲紋辨識,語者驗證,度量學習,	zh_TW
dc.subject.keyword	speaker recognition,speaker verification,metric learning,	en
dc.relation.page	54
dc.identifier.doi	10.6342/NTU202002234
dc.rights.note	未授權
dc.date.accepted	2020-08-05
dc.contributor.author-college	電機資訊學院	zh_TW
dc.contributor.author-dept	資訊工程學研究所	zh_TW
顯示於系所單位：	資訊工程學系

文件中的檔案：

檔案	大小	格式
U0001-0208202022505600.pdf 未授權公開取用	11.02 MB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。