基於條件對抗式網路進行長片段音訊修補

Po-Yu Wu; 吳柏鋙

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/80035

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	徐宏民(Winston Hsu)
dc.contributor.author	Po-Yu Wu	en
dc.contributor.author	吳柏鋙	zh_TW
dc.date.accessioned	2022-11-23T09:22:21Z	-
dc.date.available	2021-08-23
dc.date.available	2022-11-23T09:22:21Z	-
dc.date.copyright	2021-08-23
dc.date.issued	2021
dc.date.submitted	2021-08-11
dc.identifier.citation	A. Adler, V. Emiya, M. G. Jafari, M. Elad, R. Gribonval, and M. D. Plumbley. Audio inpainting. IEEE Transactions on Audio, Speech, and Language Processing, 20(3):922–932, 2011. M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein generative adversarial networks. In International conference on machine learning, pages 214–223. PMLR, 2017. Y. Bahat, Y. Y. Schechner, and M. Elad. Selfcontentbased audio inpainting. Signal Processing, 111:61–72, 2015. R. Balan. On signal reconstruction from its spectrogram. In 2010 44th Annual Conference on Information Sciences and Systems (CISS), pages 1–4. IEEE, 2010. T. Bazin, G. Hadjeres, P. Esling, and M. Malt. Spectrogram inpainting for interactive generation of instrument sounds. arXiv preprint arXiv:2104.07519, 2021. Y.L. Chang, K.Y. Lee, P.Y. Wu, H.y. Lee, and W. Hsu. Deep long audio inpainting. arXiv preprint arXiv:1911.06476, 2019. Y.L. Chang, Z. Y. Liu, K.Y. Lee, and W. Hsu. Freeform video inpainting with 3d gated convolution and temporal patchgan. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9066–9075, 2019. C. Donahue, J. McAuley, and M. Puckette. Adversarial audio synthesis. arXiv preprint arXiv:1802.04208, 2018. P. P. Ebner and A. Eltelt. Audio inpainting with generative adversarial network. arXiv preprint arXiv:2003.07704, 2020. A. A. Efros and W. T. Freeman. Image quilting for texture synthesis and transfer. In Proceedings of the 28th annual conference on Computer graphics and interactive techniques, pages 341–346, 2001. A. A. Efros and T. K. Leung. Texture synthesis by nonparametric sampling. In Proceedings of the seventh IEEE international conference on computer vision, volume 2, pages 1033–1038. IEEE, 1999. I. J. Goodfellow, J. PougetAbadie, M. Mirza, B. Xu, D. WardeFarley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial networks. arXiv preprint arXiv:1406.2661, 2014. D. Griffin and J. Lim. Signal estimation from modified shorttime fourier transform. IEEE Transactions on acoustics, speech, and signal processing, 32(2):236–243, 1984. C. Hawthorne, A. Stasyuk, A. Roberts, I. Simon, C.Z. A. Huang, S. Dieleman, E. Elsen, J. Engel, and D. Eck. Enabling factorized piano music modeling and generation with the maestro dataset. arXiv preprint arXiv:1810.12247, 2018. M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter. Gans trained by a two timescale update rule converge to a local nash equilibrium. arXiv preprint arXiv:1706.08500, 2017. S. Iizuka, E. SimoSerra, and H. Ishikawa. Globally and locally consistent image completion. ACM Transactions on Graphics (ToG), 36(4):1–14, 2017. P. Isola, J.Y. Zhu, T. Zhou, and A. A. Efros. Imagetoimage translation with conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1125–1134, 2017. K. Ito and L. Johnson. The lj speech dataset. https://keithito.com/LJ-Speech-Dataset/, 2017. J. Johnson, A. Alahi, and L. FeiFei. Perceptual losses for realtime style transfer and superresolution. In European conference on computer vision, pages 694–711. Springer, 2016. N. Kalchbrenner, E. Elsen, K. Simonyan, S. Noury, N. Casagrande, E. Lockhart, F. Stimberg, A. Oord, S. Dieleman, and K. Kavukcuoglu. Efficient neural audio synthesis. In International Conference on Machine Learning, pages 2410–2419. PMLR, 2018. M. Kegler, P. Beckmann, and M. Cernak. Deep speech inpainting of timefrequency masks. arXiv preprint arXiv:1910.09058, 2019. K. Kilgour, M. Zuluaga, D. Roblek, and M. Sharifi. Fr\’echet audio distance: A metric for evaluating music enhancement algorithms. arXiv preprint arXiv:1812.08466, 2018. D. P. Kingma and M. Welling. Autoencoding variational bayes. arXiv preprint arXiv:1312.6114, 2013. K. Kumar, R. Kumar, T. de Boissiere, L. Gestin, W. Z. Teoh, J. Sotelo, A. de Brébisson, Y. Bengio, and A. Courville. Melgan: Generative adversarial networks for conditional waveform synthesis. arXiv preprint arXiv:1910.06711, 2019. J. Le Roux, H. Kameoka, N. Ono, and S. Sagayama. Fast signal reconstruction from magnitude stft spectrogram based on spectrogram consistency. In Proc. DAFx, volume 10, pages 397–403, 2010. F. Lieb and H.G. Stark. Audio inpainting: Evaluation of timefrequency representations and structured sparsity approaches. Signal Processing, 153:291–299, 2018. J. H. Lim and J. C. Ye. Geometric gan. arXiv preprint arXiv:1705.02894, 2017. G. Liu, F. A. Reda, K. J. Shih, T.C. Wang, A. Tao, and B. Catanzaro. Image inpainting for irregular holes using partial convolutions. In Proceedings of the European Conference on Computer Vision (ECCV), pages 85–100, 2018. X. Mao, Q. Li, H. Xie, R. Y. Lau, Z. Wang, and S. Paul Smolley. Least squares generative adversarial networks. In Proceedings of the IEEE international conference on computer vision, pages 2794–2802, 2017. A. Marafioti, N. Holighaus, P. Majdak, N. Perraudin, et al. Audio inpainting of music by means of neural networks. In Audio Engineering Society Convention 146. Audio Engineering Society, 2019. A. Marafioti, P. Majdak, N. Holighaus, and N. Perraudin. Gacelaa generative adversarial context encoder for long audio inpainting of music. IEEE Journal of Selected Topics in Signal Processing, 2020. A. Marafioti, N. Perraudin, N. Holighaus, and P. Majdak. A context encoder for audio inpainting. arXiv preprint arXiv:1810.12138, 2018. M. Mirza and S. Osindero. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784, 2014. T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida. Spectral normalization for generative adversarial networks. arXiv preprint arXiv:1802.05957, 2018. O. Mokrỳ and P. Rajmic. Audio inpainting: Revisited and reweighted. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 28:2906–2918, 2020. O. Mokrỳ, P. Záviška, P. Rajmic, and V. Veselỳ. Introducing spain (sparse audio inpainter). In 2019 27th European Signal Processing Conference (EUSIPCO), pages 1–5. IEEE, 2019. V. S. Narayanaswamy, J. J. Thiagarajan, and A. Spanias. On the design of deep priors for unsupervised audio restoration. arXiv preprint arXiv:2104.07161, 2021. K. Nazeri, E. Ng, T. Joseph, F. Z. Qureshi, and M. Ebrahimi. Edgeconnect: Generative image inpainting with adversarial edge learning. arXiv preprint arXiv:1901.00212, 2019. A. v. d. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 2016. T. Park, M.Y. Liu, T.C. Wang, and J.Y. Zhu. Semantic image synthesis with spatially-adaptive normalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2337–2346, 2019. D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. Efros. Context encoders: Feature learning by inpainting. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2536–2544, 2016. N. Perraudin, N. Holighaus, P. Majdak, and P. Balazs. Inpainting of long audio segments with similarity graphs. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26(6):1083–1094, 2018. K. J. Piczak. Esc: Dataset for environmental sound classification. In Proceedings of the 23rd ACM international conference on Multimedia, pages 1015–1018, 2015. R. Prenger, R. Valle, and B. Catanzaro. Waveglow: A flowbased generative network for speech synthesis. In ICASSP 20192019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 3617–3621. IEEE, 2019. T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. Improved techniques for training gans. arXiv preprint arXiv:1606.03498, 2016. P. Smaragdis, B. Raj, and M. Shashanka. Missing data imputation for timefrequency representations of audio signals. Journal of signal processing systems, 65(3):361-370, 2011. I. Toumi and V. Emiya. Sparse nonlocal similarity modeling for audio inpainting. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 576–580. IEEE, 2018. T. E. Tremain. The government standard linear predictive coding algorithm: Lpc10. Speech Technology, pages 40–49, 1982. D. Ulyanov, A. Vedaldi, and V. Lempitsky. Instance normalization: The missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022, 2016. D. Ulyanov, A. Vedaldi, and V. Lempitsky. Deep image prior. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 9446–9454, 2018. T.C. Wang, M.Y. Liu, J.Y. Zhu, A. Tao, J. Kautz, and B. Catanzaro. Highresolution image synthesis and semantic manipulation with conditional gans. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8798–8807, 2018. Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13(4):600–612, 2004. P. J. Wolfe and S. J. Godsill. Interpolation of missing data values for audio signal restoration using a gabor regression model. In ICASSP, volume 5, pages v–517. IEEE, 2005. R. Yamamoto, E. Song, and J.M. Kim. Parallel wavegan: A fast waveform generation model based on generative adversarial networks with multiresolution spectrogram. In ICASSP 20202020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6199–6203. IEEE, 2020. G. Yang, S. Yang, K. Liu, P. Fang, W. Chen, and L. Xie. Multiband melgan: Faster waveform generation for highquality texttospeech. In 2021 IEEE Spoken Language Technology Workshop (SLT), pages 492–498. IEEE, 2021. Z. Yi, Q. Tang, S. Azizi, D. Jang, and Z. Xu. Contextual residual aggregation for ultra highresolution image inpainting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7508–7517, 2020. J. Yu, Z. Lin, J. Yang, X. Shen, X. Lu, and T. S. Huang. Generative image inpainting with contextual attention. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5505–5514, 2018. J. Yu, Z. Lin, J. Yang, X. Shen, X. Lu, and T. S. Huang. Freeform image inpainting with gated convolution. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4471–4480, 2019. Z. Zhang, Y. Wang, C. Gan, J. Wu, J. B. Tenenbaum, A. Torralba, and W. T. Freeman. Deep audio priors emerge from harmonic convolutional networks. In International Conference on Learning Representations, 2019. H. Zhao, C. Gan, A. Rouditchenko, C. Vondrick, J. McDermott, and A. Torralba. The sound of pixels. In Proceedings of the European conference on computer vision (ECCV), pages 570–586, 2018. H. Zhou, Z. Liu, X. Xu, P. Luo, and X. Wang. Visioninfused deep audio inpainting. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 283–292, 2019.
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/80035	-
dc.description.abstract	我們輿論文中介紹一種實用、彈性且有效的長片段音訊修復方法。這個基於條件對抗式網路的架構稱為SLAIN，能夠恢復音訊的毀損部分，包括各類音效和樂器錄音。我們利用源自風格遷移的架構並進行精心設計的修改，使此方法可以處理未被形變的音訊頻譜圖，並根據人類的聲學特徵進行衡量。另外與最新神經聲碼器的集成使得輸出音訊質量比傳統演算法GriffinLim好上不少。除了重建函數和生成對抗函數之外，預訓練的聲碼器還提供了額外聲學函數來指導模型。透過分析實驗在兩個有挑戰性的數據集上，平均意見分數(MOS)的人工評估表明我們的方法可以處理彈性長度的毀損並在44.1 kHz（常見採樣頻率）的1.5秒長音訊樣本中能夠達到最多1秒的修補長度。生成的聲音其分數平均在MOS上最高5分中超過4分，這代表與現有的長音訊修復方法相比，我們的方法具有最佳效能。	zh_TW
dc.description.provenance	Made available in DSpace on 2022-11-23T09:22:21Z (GMT). No. of bitstreams: 1 U0001-1607202118463800.pdf: 3482085 bytes, checksum: ca6fbf0f9af3758efa34d6637fec0dbc (MD5) Previous issue date: 2021	en
dc.description.tableofcontents	Verification Letter from the Oral Examination Committee i Acknowledgements ii 摘要 iii Abstract iv Contents vi List of Figures viii List of Tables ix Chapter 1 Introduction 1 Chapter 2 Related Work 4 Chapter 3 Proposed Methods 8 3.1 Audio analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Chapter 4 Experiments 11 4.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 Chapter 5 Conclusion 16 References 17 Appendix A — Further Details 25 A.1 Deformation of the Mel spectrogram. . . . . . . . . . . . . . . . . . 25 A.2 Discussion of anomaly detection. . . . . . . . . . . . . . . . . . . . 25 A.3 Freefrom mask comparison. . . . . . . . . . . . . . . . . . . . . . . 26 A.3.1 Additional LJSpeech dataset. . . . . . . . . . . . . . . . . . . . . 28 A.4 Failure samples. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 A.5 Training curves. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
dc.language.iso	en
dc.subject	平均主觀意見分	zh_TW
dc.subject	音訊修補	zh_TW
dc.subject	條件對抗式網路	zh_TW
dc.subject	聲碼器	zh_TW
dc.subject	聲學	zh_TW
dc.subject	cGANs	en
dc.subject	MOS	en
dc.subject	Acoustic	en
dc.subject	Vocoder	en
dc.subject	Audio Inpainting	en
dc.title	基於條件對抗式網路進行長片段音訊修補	zh_TW
dc.title	SLAIN: A Second Long Audio Inpainting with Conditional GAN.	en
dc.date.schoolyear	109-2
dc.description.degree	碩士
dc.contributor.coadvisor	陳文進(Wen-Chin Chen)
dc.contributor.oralexamcommittee	余能豪(Hsin-Tsai Liu),葉梅珍(Chih-Yang Tseng),陳奕廷
dc.subject.keyword	音訊修補,條件對抗式網路,聲碼器,聲學,平均主觀意見分,	zh_TW
dc.subject.keyword	Audio Inpainting,cGANs,Vocoder,Acoustic,MOS,	en
dc.relation.page	30
dc.identifier.doi	10.6342/NTU202101523
dc.rights.note	同意授權(全球公開)
dc.date.accepted	2021-08-13
dc.contributor.author-college	電機資訊學院	zh_TW
dc.contributor.author-dept	資訊工程學研究所	zh_TW
顯示於系所單位：	資訊工程學系

文件中的檔案：

檔案	大小	格式
U0001-1607202118463800.pdf	3.4 MB	Adobe PDF	檢視/開啟

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。