Skip navigation

DSpace JSPUI

DSpace preserves and enables easy and open access to all types of digital content including text, images, moving images, mpegs and data sets

Learn More
DSpace logo
English
中文
  • Browse
    • Communities
      & Collections
    • Publication Year
    • Author
    • Title
    • Subject
    • Advisor
  • Search TDR
  • Rights Q&A
    • My Page
    • Receive email
      updates
    • Edit Profile
  1. NTU Theses and Dissertations Repository
  2. 電機資訊學院
  3. 電機工程學系
Please use this identifier to cite or link to this item: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/80011
Full metadata record
???org.dspace.app.webui.jsptag.ItemTag.dcfield???ValueLanguage
dc.contributor.advisor李琳山(Lin-shan Lee)
dc.contributor.authorChien-yu Huangen
dc.contributor.author黃健祐zh_TW
dc.date.accessioned2022-11-23T09:20:57Z-
dc.date.available2021-08-06
dc.date.available2022-11-23T09:20:57Z-
dc.date.copyright2021-08-06
dc.date.issued2021
dc.date.submitted2021-07-27
dc.identifier.citation[1] G. Aoyama, Detective Conan Vol. 1, ser. Detective Conan. Shogakukan Asia, 1994. [2] J. Chou, C. Yeh, H. Lee, and L. Lee, “Multi-target voice conversion without parallel data by adversarially learning disentangled audio representations,” in Interspeech 2018, 19th Annual Conference of the International Speech Communication Association, Hyderabad, India, 2-6 September 2018, B. Yegnanarayana, Ed. ISCA, 2018, pp. 501–505. [Online]. Available: https://doi.org/10.21437/Interspeech. 2018-1830 [3] T. Kaneko and H. Kameoka, “Cyclegan-vc: Non-parallel voice conversion using cycle-consistent adversarial networks,” in 26th European Signal Processing Conference, EUSIPCO 2018, Roma, Italy, September 3-7, 2018. IEEE, 2018, pp. 2100–2104. [Online]. Available: https://doi.org/10.23919/EUSIPCO.2018. 8553236 [4] T. Kaneko, H. Kameoka, K. Tanaka, and N. Hojo, “Cyclegan-vc2: Improved cyclegan-based non-parallel voice conversion,” in IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2019, Brighton, United Kingdom, May 12-17, 2019. IEEE, 2019, pp. 6820–6824. [Online]. Available: https://doi.org/10.1109/ICASSP.2019.8682897 [5] ——, “Cyclegan-vc3: Examining and improving cyclegan-vcs for mel-spectrogram conversion,” in Interspeech 2020, 21st Annual Conference of the International Speech Communication Association, Virtual Event, Shanghai, China, 25-29 October 2020, H. Meng, B. Xu, and T. F. Zheng, Eds. ISCA, 2020, pp. 2017–2021. [Online]. Available: https://doi.org/10.21437/Interspeech.2020-2280 [6] H. Kameoka, T. Kaneko, K. Tanaka, and N. Hojo, “Stargan-vc: non-parallel many-to-many voice conversion using star generative adversarial networks,” in 2018 IEEE Spoken Language Technology Workshop, SLT 2018, Athens, Greece, December 18-21, 2018. IEEE, 2018, pp. 266–273. [Online]. Available: https://doi.org/10.1109/SLT.2018.8639535 [7] T. Kaneko, H. Kameoka, K. Tanaka, and N. Hojo, “Stargan-vc2: Rethinking conditional methods for stargan-based voice conversion,” in Interspeech 2019, 20th Annual Conference of the International Speech Communication Association, Graz, Austria, 15-19 September 2019, G. Kubin and Z. Kacic, Eds. ISCA, 2019, pp. 679–683. [Online]. Available: https://doi.org/10.21437/Interspeech.2019-2236 [8] J. Chou, C. Yeh, and H. Lee, “One-shot voice conversion by separating speaker and content representations with instance normalization,” in Interspeech 2019, 20th Annual Conference of the International Speech Communication Association, Graz, Austria, 15-19 September 2019, G. Kubin and Z. Kacic, Eds. ISCA, 2019, pp. 664–668. [Online]. Available: https://doi.org/10.21437/Interspeech.2019-2663 [9] K. Qian, Y. Zhang, S. Chang, X. Yang, and M. Hasegawa-Johnson, “Autovc: Zero-shot voice style transfer with only autoencoder loss,” in Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, ser. Proceedings of Machine Learning Research, K. Chaudhuri and R. Salakhutdinov, Eds., vol. 97. PMLR, 2019, pp. 5210–5219. [Online]. Available: http://proceedings.mlr.press/v97/qian19c.html [10] T. Huang, J. Lin, and H. Lee, “How far are we from robust voice conversion: A survey,” in IEEE Spoken Language Technology Workshop, SLT 2021, Shenzhen, China, January 19-22, 2021. IEEE, 2021, pp. 514–521. [Online]. Available: https://doi.org/10.1109/SLT48900.2021.9383498 [11] “Deepfake,” May 2021. [Online]. Available: https://en.wikipedia.org/wiki/Deepfake [12] B. Dolhansky, R. Howes, B. Pflaum, N. Baram, and C. Canton-Ferrer, “The deepfake detection challenge (DFDC) preview dataset,” CoRR, vol. abs/1910.08854, 2019. [Online]. Available: http://arxiv.org/abs/1910.08854 [13] B. Dolhansky, J. Bitton, B. Pflaum, J. Lu, R. Howes, M. Wang, and C. Canton-Ferrer, “The deepfake detection challenge dataset,” CoRR, vol. abs/2006.07397, 2020. [Online]. Available: https://arxiv.org/abs/2006.07397 [14] W. S. McCulloch and W. H. Pitts, “A logical calculus of the ideas immanent in nervous activity,” in The Philosophy of Artificial Intelligence, ser. Oxford readings in philosophy, M. A. Boden, Ed. Oxford University Press, 1990, pp. 22–39. [15] K. Hornik, M. Stinchcombe, and H. White, “Multilayer feedforward networks are universal approximators,” Neural networks, vol. 2, no. 5, pp. 359–366, 1989. [16] A. CAUCHY, “Methode generale pour la resolution des systemes d’equations simultanees,” Comptes Rendus de l’Academie des Science, vol. 25, pp. 536–538, 1847. [17] T. Homma, L. E. Atlas, and R. J. Marks, “An artificial neural network for spatiotemporal: application to phoneme classification,” in Proceedings of the 1987 International Conference on Neural Information Processing Systems, 1987, pp. 31–40. [18] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel, “Backpropagation applied to handwritten zip code recognition,” Neural computation, vol. 1, no. 4, pp. 541–551, 1989. [19] Y. LeCun, B. E. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. E. Hubbard, and L. D. Jackel, “Handwritten digit recognition with a back-propagation network,” in Advances in neural information processing systems, 1990, pp. 396–404. [20] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning internal representations by error propagation,” California Univ San Diego La Jolla Inst for Cognitive Science, Tech. Rep., 1985. [21] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997. [22] K. Cho, B. van Merrienboer, Ç. Gülçehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, “Learning phrase representations using RNN encoder-decoder for statistical machine translation,” in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL, A. Moschitti, B. Pang, and W. Daelemans, Eds. ACL, 2014, pp. 1724–1734. [Online]. Available: https://doi.org/10.3115/v1/d14-1179 [23] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. J. Goodfellow, and R. Fergus, “Intriguing properties of neural networks,” in 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings, Y. Bengio and Y. LeCun, Eds., 2014. [Online]. Available: http://arxiv.org/abs/1312.6199 [24] I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harnessing adversarial examples,” in 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, Y. Bengio and Y. LeCun, Eds., 2015. [Online]. Available: http://arxiv.org/abs/1412.6572 [25] A. Kurakin, I. J. Goodfellow, and S. Bengio, “Adversarial examples in the physical world,” in 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Workshop Track Proceedings. OpenReview.net, 2017. [Online]. Available: https://openreview.net/forum?id=HJGU3Rodl [26] Y. Dong, F. Liao, T. Pang, H. Su, J. Zhu, X. Hu, and J. Li, “Boosting adversarial attacks with momentum,” in 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018. IEEE Computer Society, 2018, pp. 9185–9193. [Online]. Available: http://openaccess.thecvf.com/content_cvpr_2018/html/Dong_ Boosting_Adversarial_Attacks_CVPR_2018_paper.html [27] N. Carlini and D. A. Wagner, “Towards evaluating the robustness of neural networks,” in 2017 IEEE Symposium on Security and Privacy, SP 2017, San Jose, CA, USA, May 22-26, 2017. IEEE Computer Society, 2017, pp. 39–57. [Online]. Available: https://doi.org/10.1109/SP.2017.49 [28] N. Papernot, P. D. McDaniel, S. Jha, M. Fredrikson, Z. B. Celik, and A. Swami, “The limitations of deep learning in adversarial settings,” in IEEE European Symposium on Security and Privacy, EuroS P 2016, Saarbrücken, Germany, March 21-24, 2016. IEEE, 2016, pp. 372–387. [Online]. Available: https://doi.org/10.1109/EuroSP.2016.36 [29] C. Xie, J. Wang, Z. Zhang, Z. Ren, and A. L. Yuille, “Mitigating adversarial effects through randomization,” in 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net, 2018. [Online]. Available: https://openreview.net/forum?id=Sk9yuql0Z [30] P. Samangouei, M. Kabkab, and R. Chellappa, “Defense-gan: Protecting classifiers against adversarial attacks using generative models,” in 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net, 2018. [Online]. Available: https://openreview.net/forum?id=BkJ3ibb0- [31] A. Shafahi, M. Najibi, A. Ghiasi, Z. Xu, J. P. Dickerson, C. Studer, L. S. Davis, G. Taylor, and T. Goldstein, “Adversarial training for free!” in Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, H. M. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. B. Fox, and R. Garnett, Eds., 2019, pp. 3353–3364. [Online]. Available: https://proceedings.neurips.cc/paper/2019/hash/7503cfacd12053d309b6bed5c89de212-Abstract.html [32] E. Wong, L. Rice, and J. Z. Kolter, “Fast is better than free: Revisiting adversarial training,” in 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020. [Online]. Available: https://openreview.net/forum?id=BJx040EFvH [33] D. Tsipras, S. Santurkar, L. Engstrom, A. Turner, and A. Madry, “Robustness may be at odds with accuracy,” in 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019. [Online]. Available: https://openreview.net/forum?id=SyxAb30cY7 [34] N. Papernot, P. D. McDaniel, and I. J. Goodfellow, “Transferability in machine learning: from phenomena to black-box attacks using adversarial samples,” CoRR, vol. abs/1605.07277, 2016. [Online]. Available: http://arxiv.org/abs/1605.07277 [35] Y. Liu, X. Chen, C. Liu, and D. Song, “Delving into transferable adversarial examples and black-box attacks,” in 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net, 2017. [Online]. Available: https://openreview.net/forum?id=Sys6GJqxl [36] P. Chen, H. Zhang, Y. Sharma, J. Yi, and C. Hsieh, “ZOO: zeroth order optimization based black-box attacks to deep neural networks without training substitute models,” in Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security, AISec@CCS 2017, Dallas, TX, USA, November 3, 2017, B. M. Thuraisingham, B. Biggio, D. M. Freeman, B. Miller, and A. Sinha, Eds. ACM, 2017, pp. 15–26. [Online]. Available: https://doi.org/10.1145/3128572.3140448 [37] A. N. Bhagoji, W. He, B. Li, and D. Song, “Practical black-box attacks on deep neural networks using efficient query mechanisms,” in Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part XII, ser. Lecture Notes in Computer Science, V. Ferrari, M. Hebert, C. Sminchisescu, and Y. Weiss, Eds., vol. 11216. Springer, 2018, pp. 158–174. [Online]. Available: https://doi.org/10.1007/978-3-030-01258-8_10 [38] N. Narodytska and S. P. Kasiviswanathan, “Simple black-box adversarial attacks on deep neural networks,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops, CVPR Workshops 2017, Honolulu, HI, USA, July 21-26, 2017. IEEE Computer Society, 2017, pp. 1310–1318. [Online]. Available: https://doi.org/10.1109/CVPRW.2017.172 [39] Y. Li, L. Li, L. Wang, T. Zhang, and B. Gong, “NATTACK: learning the distributions of adversarial examples for an improved black-box attack on deep neural networks,” in Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, ser. Proceedings of Machine Learning Research, K. Chaudhuri and R. Salakhutdinov, Eds., vol. 97. PMLR, 2019, pp. 3866–3876. [Online]. Available: http://proceedings.mlr.press/v97/li19g.html [40] E. Helander, J. Schwarz, J. Nurminen, H. Silén, and M. Gabbouj, “On the impact of alignment on voice conversion performance,” in INTERSPEECH 2008, 9th Annual Conference of the International Speech Communication Association, Brisbane, Australia, September 22-26, 2008. ISCA, 2008, pp. 1453–1456. [Online]. Available: http://www.isca-speech.org/archive/interspeech_2008/i08_1453.html [41] M. Abe, S. Nakamura, K. Shikano, and H. Kuwabara, “Voice conversion through vector quantization,” in IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP ’88, New York, New York, USA, April 11-14, 1988. IEEE, 1988, pp. 655–658. [Online]. Available: https://doi.org/10.1109/ICASSP.1988.196671 [42] K. Shikano, S. Nakamura, and M. Abe, “Speaker adaptation and voice conversion by codebook mapping,” in 1991 IEEE International Symposium on Circuits and Systems (ISCAS). IEEE, 1991, pp. 594–597. [43] Y. Stylianou, O. Cappé, and E. Moulines, “Continuous probabilistic transform for voice conversion,” IEEE Trans. Speech Audio Process., vol. 6, no. 2, pp. 131–142, 1998. [Online]. Available: https://doi.org/10.1109/89.661472 [44] T. Toda, A. W. Black, and K. Tokuda, “Voice conversion based on maximum-likelihood estimation of spectral parameter trajectory,” IEEE Trans. Speech Audio Process., vol. 15, no. 8, pp. 2222–2235, 2007. [Online]. Available: https://doi.org/10.1109/TASL.2007.907344 [45] H. Zen, Y. Nankaku, and K. Tokuda, “Probabilistic feature mapping based on trajectory hmms,” in INTERSPEECH 2008, 9th Annual Conference of the International Speech Communication Association, Brisbane, Australia, September 22-26, 2008. ISCA, 2008, pp. 1068–1071. [Online]. Available: http://www.isca-speech.org/archive/interspeech_2008/i08_1068.html [46] E. Helander, T. Virtanen, J. Nurminen, and M. Gabbouj, “Voice conversion using partial least squares regression,” IEEE Trans. Speech Audio Process., vol. 18, no. 5, pp. 912–921, 2010. [Online]. Available: https://doi.org/10.1109/TASL.2010.2041699 [47] E. Helander, H. Silén, T. Virtanen, and M. Gabbouj, “Voice conversion using dynamic kernel partial least squares regression,” IEEE Trans. Speech Audio Process., vol. 20, no. 3, pp. 806–817, 2012. [Online]. Available: https://doi.org/10.1109/TASL.2011.2165944 [48] R. Takashima, T. Takiguchi, and Y. Ariki, “Exemplar-based voice conversion in noisy environment,” in 2012 IEEE Spoken Language Technology Workshop (SLT), Miami, FL, USA, December 2-5, 2012. IEEE, 2012, pp. 313–317. [Online]. Available: https://doi.org/10.1109/SLT.2012.6424242 [49] Z. Jin, A. Finkelstein, S. DiVerdi, J. Lu, and G. J. Mysore, “Cute: A concatenative method for voice conversion using exemplar-based unit selection,” in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2016, Shanghai, China, March 20-25, 2016. IEEE, 2016, pp. 5660–5664. [Online]. Available: https://doi.org/10.1109/ICASSP.2016.7472761 [50] D. Erro, A. Moreno, and A. Bonafonte, “INCA algorithm for training voice conversion systems from nonparallel corpora,” IEEE Trans. Speech Audio Process., vol. 18, no. 5, pp. 944–953, 2010. [Online]. Available: https://doi.org/10.1109/TASL.2009.2038669 [51] S. Desai, E. V. Raghavendra, B. Yegnanarayana, A. W. Black, and K. Prahallad, “Voice conversion using artificial neural networks,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2009, 19-24 April 2009, Taipei, Taiwan. IEEE, 2009, pp. 3893–3896. [Online]. Available: https://doi.org/10.1109/ICASSP.2009.4960478 [52] L. Sun, S. Kang, K. Li, and H. M. Meng, “Voice conversion using deep bidirectional long short-term memory based recurrent neural networks,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2015, South Brisbane, Queensland, Australia, April 19-24, 2015. IEEE, 2015, pp. 4869–4873. [Online]. Available: https://doi.org/10.1109/ICASSP.2015.7178896 [53] J. Serrà, S. Pascual, and C. Segura, “Blow: a single-scale hyperconditioned flow for non-parallel raw-audio voice conversion,” in Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, H. M. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. B. Fox, and R. Garnett, Eds., 2019, pp. 6790–6800. [Online]. Available: https://proceedings. neurips.cc/paper/2019/hash/9426c311e76888b3b2368150cd05f362-Abstract.html [54] I. Higgins, D. Amos, D. Pfau, S. Racanière, L. Matthey, D. J. Rezende, and A. Lerchner, “Towards a definition of disentangled representations,” CoRR, vol. abs/1812.02230, 2018. [Online]. Available: http://arxiv.org/abs/1812.02230 [55] X. Huang and S. J. Belongie, “Arbitrary style transfer in real-time with adaptive instance normalization,” in IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017. IEEE Computer Society, 2017, pp. 1510–1519. [Online]. Available: https://doi.org/10.1109/ICCV.2017.167 [56] L. Wan, Q. Wang, A. Papir, and I. Lopez-Moreno, “Generalized end-to-end loss for speaker verification,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2018, Calgary, AB, Canada, April 15-20, 2018. IEEE, 2018, pp. 4879–4883. [Online]. Available: https://doi.org/10.1109/ICASSP.2018.8462665 [57] D. Wu and H. Lee, “One-shot voice conversion by vector quantization,” in 2020 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2020, Barcelona, Spain, May 4-8, 2020. IEEE, 2020, pp. 7734–7738. [Online]. Available: https://doi.org/10.1109/ICASSP40776.2020.9053854 [58] J. Ebbers, M. Kuhlmann, T. Cord-Landwehr, and R. Haeb-Umbach, “Contrastive predictive coding supported factorized variational autoencoder for unsupervised learning of disentangled speech representations,” in IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2021, Toronto, ON, Canada, June 6-11, 2021. IEEE, 2021, pp. 3860–3864. [Online]. Available: https://doi.org/10.1109/ICASSP39728.2021.9414487 [59] W.-C. Huang, T. Hayashi, S. Watanabe, and T. Toda, “The sequence-to-sequence baseline for the voice conversion challenge 2020: Cascading asr and tts,” in Proc. Joint Workshop for the Blizzard Challenge and Voice Conversion Challenge 2020, 2020, pp. 160–164. [60] Y. Saito, Y. Ijima, K. Nishida, and S. Takamichi, “Non-parallel voice conversion using variational autoencoders conditioned by phonetic posteriorgrams and d-vectors,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2018, Calgary, AB, Canada, April 15-20, 2018. IEEE, 2018, pp. 5274–5278. [Online]. Available: https://doi.org/10.1109/ICASSP.2018.8461384 [61] S. Pascual, A. Bonafonte, and J. Serrà, “SEGAN: speech enhancement generative adversarial network,” in Interspeech 2017, 18th Annual Conference of the International Speech Communication Association, Stockholm, Sweden, August 20-24, 2017, F. Lacerda, Ed. ISCA, 2017, pp. 3642–3646. [Online]. Available: http://www.isca-speech.org/archive/Interspeech_2017/abstracts/1428.html [62] J. Su, Z. Jin, and A. Finkelstein, “Hifi-gan: High-fidelity denoising and dereverberation based on speech deep features in adversarial networks,” in Interspeech 2020, 21st Annual Conference of the International Speech Communication Association, Virtual Event, Shanghai, China, 25-29 October 2020, H. Meng, B. Xu, and T. F. Zheng, Eds. ISCA, 2020, pp. 4506–4510. [Online]. Available: https://doi.org/10.21437/Interspeech.2020-2143 [63] S. Fu, C. Liao, Y. Tsao, and S. Lin, “Metricgan: Generative adversarial networks based black-box metric scores optimization for speech enhancement,” in Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, ser. Proceedings of Machine Learning Research, K. Chaudhuri and R. Salakhutdinov, Eds., vol. 97. PMLR, 2019, pp. 2031–2041. [Online]. Available: http://proceedings.mlr.press/v97/fu19b.html [64] S. Fu, C. Yu, T. Hsieh, P. Plantinga, M. Ravanelli, X. Lu, and Y. Tsao, “Metricgan+: An improved version of metricgan for speech enhancement,” CoRR, vol. abs/2104.03538, 2021. [Online]. Available: https://arxiv.org/abs/2104.03538 [65] A. Sivaraman and M. Kim, “Self-supervised learning using contrastive mixtures for personalized speech enhancement,” in NeurIPS Workshop on Self-Supervised Learning for Speech and Audio Processing, 2020. [66] K. Tan and D. Wang, “A convolutional recurrent neural network for real-time speech enhancement,” in Interspeech 2018, 19th Annual Conference of the International Speech Communication Association, Hyderabad, India, 2-6 September 2018, B. Yegnanarayana, Ed. ISCA, 2018, pp. 3229–3233. [Online]. Available: https://doi.org/10.21437/Interspeech.2018-1405 [67] Y. Xia, S. Braun, C. K. A. Reddy, H. Dubey, R. Cutler, and I. Tashev, “Weighted speech distortion losses for neural-network-based real-time speech enhancement,” in 2020 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2020, Barcelona, Spain, May 4-8, 2020. IEEE, 2020, pp. 871–875. [Online]. Available: https://doi.org/10.1109/ICASSP40776.2020.9054254 [68] A. Défossez, G. Synnaeve, and Y. Adi, “Real time speech enhancement in the waveform domain,” in Interspeech 2020, 21st Annual Conference of the International Speech Communication Association, Virtual Event, Shanghai, China, 25-29 October 2020, H. Meng, B. Xu, and T. F. Zheng, Eds. ISCA, 2020, pp. 3291–3295. [Online]. Available: https://doi.org/10.21437/Interspeech.2020-2409 [69] C. Veaux, J. Yamagishi, K. MacDonald et al., “Cstr vctk corpus: English multi-speaker corpus for cstr voice cloning toolkit,” 2017. [70] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: An ASR corpus based on public domain audio books,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2015, South Brisbane, Queensland, Australia, April 19-24, 2015. IEEE, 2015, pp. 5206–5210. [Online]. Available: https://doi.org/10.1109/ICASSP.2015.7178964 [71] “Librivox.” [Online]. Available: https://librivox.org/ [72] H. Zen, V. Dang, R. Clark, Y. Zhang, R. J. Weiss, Y. Jia, Z. Chen, and Y. Wu, “Libritts: A corpus derived from librispeech for text-to-speech,” in Interspeech 2019, 20th Annual Conference of the International Speech Communication Association, Graz, Austria, 15-19 September 2019, G. Kubin and Z. Kacic, Eds. ISCA, 2019, pp. 1526–1530. [Online]. Available: https://doi.org/10.21437/Interspeech.2019-2441 [73] C. Chien and H. Lee, “Hierarchical prosody modeling for non-autoregressive speech synthesis,” in IEEE Spoken Language Technology Workshop, SLT 2021, Shenzhen, China, January 19-22, 2021. IEEE, 2021, pp. 446–453. [Online]. Available: https://doi.org/10.1109/SLT48900.2021.9383629 [74] X. Tan, T. Qin, F. K. Soong, and T. Liu, “A survey on neural speech synthesis,” CoRR, vol. abs/2106.15561, 2021. [Online]. Available: https://arxiv.org/abs/2106.15561 [75] K. Ito and L. Johnson, “The lj speech dataset,” https://keithito.com/LJ-Speech-Dataset/, 2017. [76] J. Kominek and A. W. Black, “The CMU arctic speech databases,” in Fifth ISCA ITRW on Speech Synthesis, Pittsburgh, PA, USA, June 14-16, 2004, A. W. Black and K. A. Lenzo, Eds. ISCA, 2004, pp. 223–224. [Online]. Available: http://www.isca-speech.org/archive_open/ssw5/ssw5_223.html [77] C. Valentini-Botinhao, X. Wang, S. Takaki, and J. Yamagishi, “Speech enhancement for a noise-robust text-to-speech synthesis system using deep recurrent neural networks,” in Interspeech 2016, 17th Annual Conference of the International Speech Communication Association, San Francisco, CA, USA, September 8-12, 2016, N. Morgan, Ed. ISCA, 2016, pp. 352–356. [Online]. Available: https://doi.org/10.21437/Interspeech.2016-159 [78] J. Thiemann, N. Ito, and E. Vincent, “DEMAND: a collection of multi-channel recordings of acoustic noise in diverse environments,” Jun. 2013, Supported by Inria under the Associate Team Program VERSAMUS. [Online]. Available: https://doi.org/10.5281/zenodo.1227121 [79] A. Nagrani, J. S. Chung, and A. Zisserman, “Voxceleb: A large-scale speaker identification dataset,” in Interspeech 2017, 18th Annual Conference of the International Speech Communication Association, Stockholm, Sweden, August 20-24, 2017, F. Lacerda, Ed. ISCA, 2017, pp. 2616–2620. [Online]. Available: http://www.isca-speech.org/archive/Interspeech_2017/abstracts/0950.html [80] “Youtube.” [Online]. Available: https://www.youtube.com/ [81] J. S. Chung, A. Nagrani, and A. Zisserman, “Voxceleb2: Deep speaker recognition,” in Interspeech 2018, 19th Annual Conference of the International Speech Communication Association, Hyderabad, India, 2-6 September 2018, B. Yegnanarayana, Ed. ISCA, 2018, pp. 1086–1090. [Online]. Available: https://doi.org/10.21437/Interspeech.2018-1929 [82] D. Yin, C. Luo, Z. Xiong, and W. Zeng, “PHASEN: A phase-and-harmonics-aware speech enhancement network,” in The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020. AAAI Press, 2020, pp. 9458–9465. [Online]. Available: https://aaai.org/ojs/index.php/AAAI/article/view/6489 [83] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, I. Guyon, U. von Luxburg, S. Bengio, H. M. Wallach, R. Fergus, S. V. N. Vishwanathan, and R. Garnett, Eds., 2017, pp. 5998–6008. [Online]. Available: https://proceedings. neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html [84] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in Medical Image Computing and Computer-Assisted Intervention - MICCAI 2015 - 18th International Conference Munich, Germany, October 5 - 9, 2015, Proceedings, Part III, ser. Lecture Notes in Computer Science, N. Navab, J. Hornegger, W. M. W. III, and A. F. Frangi, Eds., vol. 9351. Springer, 2015, pp. 234–241. [Online]. Available: https://doi.org/10.1007/978-3-319-24574-4_28 [85] I. J. Goodfellow, M. Mirza, D. Xiao, A. Courville, and Y. Bengio, “An empirical investigation of catastrophic forgetting in gradient-based neural networks,” arXiv preprint arXiv:1312.6211, 2013. [86] “Sound exchange: Homepage.” [Online]. Available: http://sox.sourceforge.net/ [87] R. Bittner, E. Humphrey, and J. Bello, “Pysox: Leveraging the audio signal processing power of sox in python.” International Conference on Music Information Retrieval (ISMIR-16), 2016. [88] S. S. Stevens, J. Volkmann, and E. B. Newman, “A scale for the measurement of the psychological magnitude pitch,” The journal of the acoustical society of america, vol. 8, no. 3, pp. 185–190, 1937. [89] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, Y. Bengio and Y. LeCun, Eds., 2015. [Online]. Available: http://arxiv.org/abs/1412.6980 [90] S. J. Reddi, S. Kale, and S. Kumar, “On the convergence of adam and beyond,” in 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net, 2018. [Online]. Available: https://openreview.net/forum?id=ryQu7f-RZ [91] I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” in 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019. [Online]. Available: https://openreview.net/forum?id=Bkg6RiCqY7 [92] N. Kalchbrenner, E. Elsen, K. Simonyan, S. Noury, N. Casagrande, E. Lockhart, F. Stimberg, A. van den Oord, S. Dieleman, and K. Kavukcuoglu, “Efficient neural audio synthesis,” in Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, ser. Proceedings of Machine Learning Research, J. G. Dy and A. Krause, Eds., vol. 80. PMLR, 2018, pp. 2415–2424. [Online]. Available: http://proceedings.mlr.press/v80/kalchbrenner18a.html [93] J. Lorenzo-Trueba, T. Drugman, J. Latorre, T. Merritt, B. Putrycz, R. Barra-Chicote, A. Moinet, and V. Aggarwal, “Towards achieving robust universal neural vocoding,” in Interspeech 2019, 20th Annual Conference of the International Speech Communication Association, Graz, Austria, 15-19 September 2019, G. Kubin and Z. Kacic, Eds. ISCA, 2019, pp. 181–185. [Online]. Available: https://doi.org/10.21437/Interspeech.2019-1424 [94] F. Damerau, “A technique for computer detection and correction of spelling errors,” Commun. ACM, vol. 7, no. 3, pp. 171–176, 1964. [Online]. Available: https://doi.org/10.1145/363958.363994 [95] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, Y. Jernite, J. Plu, C. Xu, T. L. Scao, S. Gugger, M. Drame, Q. Lhoest, and A. M. Rush, “Transformers: State-of-the-art natural language processing,” in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, EMNLP 2020 - Demos, Online, November 16-20, 2020, Q. Liu and D. Schlangen, Eds. Association for Computational Linguistics, 2020, pp. 38–45. [Online]. Available: https://doi.org/10.18653/v1/2020.emnlp-demos.6 [96] A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” in Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, Eds., 2020. [Online]. Available: https://proceedings.neurips.cc/paper/2020/hash/92d1e1eb1cd6f9fba3227870bb6d7f07-Abstract.html [97] A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra, “Perceptual evaluation of speech quality (pesq)-a new method for speech quality assessment of telephone networks and codecs,” in IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2001, 7-11 May, 2001, Salt Palace Convention Center, Salt Lake City, Utah, USA, Proceedings. IEEE, 2001, pp. 749–752. [Online]. Available: https://doi.org/10.1109/ICASSP.2001.941023 [98] C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, “A short-time objective intelligibility measure for time-frequency weighted noisy speech,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2010, 14-19 March 2010, Sheraton Dallas Hotel, Dallas, Texas, USA. IEEE, 2010, pp. 4214–4217. [Online]. Available: https://doi.org/10.1109/ICASSP.2010.5495701 [99] C. Huang, Y. Y. Lin, H. Lee, and L. Lee, “Defending your voice: Adversarial attack on voice conversion,” in IEEE Spoken Language Technology Workshop, SLT 2021, Shenzhen, China, January 19-22, 2021. IEEE, 2021, pp. 552–559. [Online]. Available: https://doi.org/10.1109/SLT48900.2021.9383529 [100] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu, “Towards deep learning models resistant to adversarial attacks,” in 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Confe………
dc.identifier.urihttp://tdr.lib.ntu.edu.tw/jspui/handle/123456789/80011-
dc.description.abstract語音轉換(Voice Conversion)是不改變語句中的語言內容(Linguistic Content)或音素結構(Phoneme Structure),而去修改語音訊號中某些其他特定資訊的技術,其中以變換語者音色(Vocal Timbre)的語者轉換最為常見。近年來,隨著機器學習等相關技術的發展,我們已能夠達成任意對任意(Any-to-any)的語者轉換,亦即可將任意語句中的語者音色修改為任意一位其他語者的。然而,以現有技術而言,成功的語音轉換必須以乾淨而無雜訊的語音訊號作為輸入。另一方面,隨著技術不斷進步,語音轉換極有可能被用來偽造他人的聲音。因此,如何強化語音轉換技術在雜訊環境的表現,以及如何保護我們的語者音色不被「竊取」,便成為了相當重要的研究方向。 本論文首先分析現有語音轉換模型在雜訊環境下的強健性(Robustness),在輸入訊號中加入雜訊,並衡量轉換結果的失真程度。為了提升模型的表現,本論文使用語音增強(Speech Enhancement)模型進行預處理,將訊號中的雜訊去除後,再輸入至語音轉換模型。同時,本論文也提出以去雜訊損失作為語音轉換模型的訓練目標,使其能夠在不利用上述預處理的情況下,依然能夠減少轉換的失真。實驗結果顯示,兩種方法皆能有效提升語音轉換模型在雜訊環境下的表現,並且端對端的去雜訊比表徵層級上的處理更能夠提升轉換結果的品質。 接著,本論文提出三種針對語音轉換模型的對抗式攻擊(Adversarial Attack),透過在輸入訊號中加入人類無法感知的特殊雜訊,使得語音轉換模型無法成功,並以此作為保護個人語音不受濫用的手段。此外,本論文也使用語音增強以「抵禦」提出之攻擊方法。實驗結果顯示,即便是在模型參數未知的情況下,本論文所提出之攻擊方法仍然能夠大幅改變模型的轉換結果,使其語者音色與所期望者大相逕庭;另一方面,語音增強模型不僅能作用於人類感知中的一般雜訊,亦可在一定程度上抹除極微小的對抗式雜訊,並提升語音轉換模型在攻擊下的轉換效果。zh_TW
dc.description.provenanceMade available in DSpace on 2022-11-23T09:20:57Z (GMT). No. of bitstreams: 1
U0001-1607202115262400.pdf: 3419119 bytes, checksum: 865f13db22aa5ed63cd8760eb2c166ca (MD5)
Previous issue date: 2021
en
dc.description.tableofcontents目錄 誌謝 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i 中文摘要 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii 英文摘要 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv 一、導論 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 研究動機 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 研究方向及貢獻 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 章節安排 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 二、背景知識 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1 深層類神經網路 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1.1 簡介 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1.2 卷積式類神經網路 . . . . . . . . . . . . . . . . . . . . . . . . 7 2.1.3 遞迴式類神經網路 . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2 對抗式攻擊 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2.1 簡介 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2.2 特定目標與非特定目標攻擊 . . . . . . . . . . . . . . . . . . . 11 2.2.3 黑箱與白箱攻擊 . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.3 語音轉換 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.3.1 簡介 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.3.2 任意對任意語音轉換 . . . . . . . . . . . . . . . . . . . . . . . 15 2.4 語音增強 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.5 本章總結 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 三、語音轉換於一般雜訊環境下之表現 . . . . . . . . . . . . . . . . . . . . . . 20 3.1 簡介 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.2 基於語音增強之語音轉換 . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.3 以去雜訊損失訓練之語音轉換 . . . . . . . . . . . . . . . . . . . . . . 22 3.3.1 端對端去雜訊 . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.3.2 表徵層級去雜訊 . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.4 實驗設定 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.4.1 資料集 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.4.2 與語音增強模型串聯之實驗 . . . . . . . . . . . . . . . . . . . 27 3.4.3 實作細節 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.5 實驗結果與分析 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.5.1 評估方法 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.5.2 實驗結果 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.6 本章總結 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 四、語音轉換之對抗式攻擊與隱私保護 . . . . . . . . . . . . . . . . . . . . . . 45 4.1 簡介 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.2 端對端對抗式攻擊 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.3 基於語者表徵之對抗式攻擊 . . . . . . . . . . . . . . . . . . . . . . . 48 4.4 基於轉換結果之語者表徵之對抗式攻擊 . . . . . . . . . . . . . . . . . 49 4.5 基於語音增強之防禦 . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.6 以對抗式攻擊保護個人語音 . . . . . . . . . . . . . . . . . . . . . . . 51 4.7 實驗設定 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.7.1 模型與資料集 . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.7.2 實作細節 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.8 實驗結果與分析 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.8.1 評估方法 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.8.2 實驗結果 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 4.9 本章總結 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 五、結論與展望 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 5.1 研究貢獻與討論 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 5.2 未來展望 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 參考文獻 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
dc.language.isozh-TW
dc.subject雜訊強健性zh_TW
dc.subject對抗式攻擊zh_TW
dc.subject語音轉換zh_TW
dc.subjectNoise Robustnessen
dc.subjectVoice Conversionen
dc.subjectAdversarial Attacken
dc.title語音轉換之雜訊強健性及隱私保護zh_TW
dc.titleNoise Robustness and Privacy Protection in Voice Conversionen
dc.date.schoolyear109-2
dc.description.degree碩士
dc.contributor.author-orcid0000-0003-4927-1293
dc.contributor.oralexamcommittee鄭秋豫(Hsin-Tsai Liu),王小川(Chih-Yang Tseng),陳信宏,簡仁宗,李宏毅
dc.subject.keyword語音轉換,雜訊強健性,對抗式攻擊,zh_TW
dc.subject.keywordVoice Conversion,Noise Robustness,Adversarial Attack,en
dc.relation.page99
dc.identifier.doi10.6342/NTU202101515
dc.rights.note同意授權(全球公開)
dc.date.accepted2021-07-27
dc.contributor.author-college電機資訊學院zh_TW
dc.contributor.author-dept電機工程學研究所zh_TW
Appears in Collections:電機工程學系

Files in This Item:
File SizeFormat 
U0001-1607202115262400.pdf3.34 MBAdobe PDFView/Open
Show simple item record


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved