請用此 Handle URI 來引用此文件:
http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/98383完整後設資料紀錄
| DC 欄位 | 值 | 語言 |
|---|---|---|
| dc.contributor.advisor | 吳沛遠 | zh_TW |
| dc.contributor.advisor | Pei-Yuan Wu | en |
| dc.contributor.author | 鍾乙綾 | zh_TW |
| dc.contributor.author | I-Ling Chung | en |
| dc.date.accessioned | 2025-08-05T16:09:14Z | - |
| dc.date.available | 2025-08-06 | - |
| dc.date.copyright | 2025-08-05 | - |
| dc.date.issued | 2025 | - |
| dc.date.submitted | 2025-07-29 | - |
| dc.identifier.citation | [1] R. Bellman and R. Kalaba. On adaptive control processes. IRE Transactions on Automatic Control, 4(2):1–9, 1959.
[2] R. S. Bonnici, M. Benning, and C. Saitis. Timbre transfer with variational autoencoding and cycle-consistent adversarial networks. In 2022 International Joint Conference on Neural Networks (IJCNN), pages 1–8. IEEE, 2022. [3] L. Comanducci, F. Antonacci, and A. Sarti. Timbre transfer using image-to-image denoising diffusion implicit models. In A. Sarti, F. Antonacci, M. Sandler, P. Bestagini, S. Dixon, B. Liang, G. Richard, and J. Pauwels, editors, Proceedings of the 24th International Society for Music Information Retrieval Conference, ISMIR 2023, Milan, Italy, November 5-9, 2023, pages 257–263, 2023. [4] O. Cífka, A. Ozerov, U. Şimşekli, and G. Richard. Self-supervised vq-vae for one-shot music style transfer. In ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 96–100, 2021. [5] Z. Duan, H. Fang, B. Li, K. C. Sim, and Y. Wang. The nus sung and spoken lyrics corpus: A quantitative comparison of singing and speech. In 2013 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, pages 1–9, 2013. [6] J. H. Engel, L. Hantrakul, C. Gu, and A. Roberts. DDSP: differentiable digital signal processing. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020. [7] E. A. C. (EsAC). Essen folk song database. [8] S. Huang, Q. Li, C. Anil, X. Bao, S. Oore, and R. B. Grosse. Timbretron:A wavenet(cyclegan(cqt(audio))) pipeline for musical timbre transfer. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019. [9] J.-S. R. Jang and H.-R. Lee. A general framework of progressive filtering and its application to query by singing/humming. IEEE Transactions on Audio, Speech, and Language Processing, 16(2):350–358, 2008. [10] K. Kilgour, M. Zuluaga, D. Roblek, and M. Sharifi. Fréchet audio distance: A reference-free metric for evaluating music enhancement algorithms. In Interspeech, 2019. [11] J. Kim, J. Salamon, P. Li, and J. Bello. Crepe: A convolutional representation for pitch estimation. Acoustics, Speech, and Signal Processing, 1988. ICASSP-88., 1988 International Conference on, 04 2018. [12] A. B. L. Larsen, S. K. Sønderby, H. Larochelle, and O. Winther. Autoencoding beyond pixels using a learned similarity metric. In M. F. Balcan and K. Q. Weinberger, editors, Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pages 1558–1566, New York, New York, USA, 20–22 Jun 2016. PMLR. [13] C. I. Leong, I.-L. Chung, K.-F. Chao, J.-Y. Wang, Y.-H. Yang, and J.-S. R. Jang. Music2fail: Transfer music to failed recorder style. In 2024 Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), pages 1–5, 2024. [14] Y. Li, R. Yuan, G. Zhang, Y. Ma, X. Chen, H. Yin, C. Xiao, C. Lin, A. Ragni, E. Benetos, N. Gyenge, R. B. Dannenberg, R. Liu, W. Chen, G. Xia, Y. Shi, W. Huang, Z. Wang, Y. Guo, and J. Fu. MERT: acoustic music understanding model with large-scale self-supervised training. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024. [15] J. Liu, C. Li, Y. Ren, Z. Zhu, and Z. Zhao. Learning the beauty in songs: Neural singing voice beautifier. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7970–7983, Dublin, Ireland, May 2022. Association for Computational Linguistics. [16] M. Mancusi, Y. Halychanskyi, K. W. Cheuk, E. Moliner, C.-H. Lai, S. Uhlich, J. Koo, M. A. Martínez-Ramírez, W.-H. Liao, G. Fabbro, and Y. Mitsufuji. Latent diffusion bridges for unsupervised musical audio timbre transfer. In ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5, 2025. [17] C. Myers, L. Rabiner, and A. Rosenberg. Performance tradeoffs in dynamic time warping algorithms for isolated word recognition. IEEE Transactions on Acoustics, Speech, and Signal Processing, 28(6):623–635, 1980. [18] K. Qian, Y. Zhang, S. Chang, X. Yang, and M. Hasegawa-Johnson. Autovc: Zero-shot voice style transfer with only autoencoder loss. In International Conference on Machine Learning, pages 5210–5219. PMLR, 2019. [19] T. Shibuya, Y. Takida, and Y. Mitsufuji. Bigvsan: Enhancing gan-based neural vocoders with slicing adversarial network. In ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 10121–10125, 2024. [20] L. Van der Maaten and G. Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(11), 2008. [21] L. Wang, S. Huang, S. Hu, J. Liang, and B. Xu. An effective and efficient method for query by humming system based on multi-similarity measurement fusion. In 2008 International Conference on Audio, Language and Image Processing, pages 471–475, 2008. [22] Y. Wu, Y. He, X. Liu, Y. Wang, and R. B. Dannenberg. Transplayer: Timbre style transfer with flexible timbre control. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023. [23] E. Yumoto, W. J. Gould, and T. Baer. Harmonics-to-noise ratio as an index of the degree of hoarseness. The journal of the Acoustical Society of America, 71(6):1544–1550, 1982. [24] L. Zhang, R. Li, S. Wang, L. Deng, J. Liu, Y. Ren, J. He, R. Huang, J. Zhu, X. Chen, and Z. Zhao. M4singer: A multi-style, multi-singer and musical score provided mandarin singing corpus. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022. [25] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. In 2017 IEEE International Conference on Computer Vision (ICCV), pages 2242–2251, 2017. | - |
| dc.identifier.uri | http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/98383 | - |
| dc.description.abstract | 音色轉換的目標是在保留輸入音訊內容的同時,改變其音色。在本研究中,我們深入探討「失敗音樂音色風格轉換」(failed music timbre transfer)任務,並開發一套能夠進行歌聲轉換為「破音直笛」的音色轉換系統,藉由屬性向量(attribute vector)實現對演奏失敗程度的可控調節。為了解決音色轉換領域中,特別是在「失敗音樂」情境下缺乏客觀評估標準的問題,我們引入一組客觀評估指標:用以捕捉病態聲音特徵的諧波噪音比(Harmonics-to-Noise Ratio, HNR)、用以衡量音高輪廓一致性的動態時間校正距離(Dynamic Time Warping, DTW),以及根據哼唱選歌(Query by Singing/Humming, QbSH)設計的旋律辨識度指標。實驗結果顯示,這些指標與人類感知高度一致,能有效反映可控的演奏失敗程度。我們的研究為音色轉換任務中的表現劣化評估與控制提供了穩健的基礎。 | zh_TW |
| dc.description.abstract | The goal of timbre transfer is to modify the timbre of an input audio while preserving its content. In this work, we conduct an in-depth investigation into the failed music timbre transfer by developing a vocal-to-failed-recorder timbre transfer system with an attribute vector for poor performance controllability. To address the lack of objective evaluation criteria in timbre transfer, particularly for failed music, we introduce a set of objective metrics: Harmonics-to-Noise Ratio (HNR) for capturing pathological sound traits, Dynamic Time Warping (DTW) distance for assessing pitch contour consistency, and Query by Singing/Humming (QbSH)-based metrics for quantifying melodic identity preservation. Experiments show these metrics align well with human perception and effectively reflect controllable poor performance. Our work offers a robust foundation for evaluating and controlling performance degradation in timbre transfer tasks. | en |
| dc.description.provenance | Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2025-08-05T16:09:14Z No. of bitstreams: 0 | en |
| dc.description.provenance | Made available in DSpace on 2025-08-05T16:09:14Z (GMT). No. of bitstreams: 0 | en |
| dc.description.tableofcontents | Acknowledgements i
摘要 iii Abstract v Contents vii List of Figures ix List of Tables xi Chapter 1 Introduction 1 Chapter 2 Preliminaries 5 2.1 Features of the Failed Recorder . . . . . . . . . . . . . . . . . . . . 5 2.2 Dynamic Time Warping, DTW . . . . . . . . . . . . . . . . . . . . . 6 2.3 Query by Singing/Humming, QbSH . . . . . . . . . . . . . . . . . . 6 Chapter 3 Related Work 9 3.1 Timbre Transfer with Unpaired Datasets . . . . . . . . . . . . . . . . 9 3.2 Objective Metrics for Timbre Transfer . . . . . . . . . . . . . . . . . 10 Chapter 4 Methodology 11 4.1 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 4.2 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 Chapter 5 Experiments 17 5.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 5.2 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . 18 5.2.1 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . 18 5.2.2 Model Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 5.2.3 Augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 5.2.4 Vocoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 5.2.5 Pitch Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 5.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 5.4 Controllable Degree of Poor Performance . . . . . . . . . . . . . . . 23 5.5 QbSH-based Metrics for Evaluating Melodic Identity Preservation . . 23 5.5.1 Subjective Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 24 Chapter 6 Analysis 27 6.1 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 6.2 Mel Spectrogram Analysis . . . . . . . . . . . . . . . . . . . . . . . 28 6.3 t-SNE Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 Chapter 7 Conclusion 31 References 33 Appendix A — Effect of Latent Space Manipulation 39 | - |
| dc.language.iso | zh_TW | - |
| dc.subject | 音色風格轉換 | zh_TW |
| dc.subject | 失敗音樂 | zh_TW |
| dc.subject | 哼唱選歌 | zh_TW |
| dc.subject | 客觀指標 | zh_TW |
| dc.subject | 屬性向量 | zh_TW |
| dc.subject | attribute vector | en |
| dc.subject | objective metrics | en |
| dc.subject | query by singing/humming | en |
| dc.subject | timbre transfer | en |
| dc.subject | failed music | en |
| dc.title | Vocal2Fail:演奏不佳程度可控的人聲至破音直笛風格轉換 | zh_TW |
| dc.title | Vocal2Fail: Controllable Timbre Transfer and Evaluation for Failed Recorder Style | en |
| dc.type | Thesis | - |
| dc.date.schoolyear | 113-2 | - |
| dc.description.degree | 碩士 | - |
| dc.contributor.oralexamcommittee | 楊奕軒;蘇黎 | zh_TW |
| dc.contributor.oralexamcommittee | Yi-Hsuan Yang;Li Su | en |
| dc.subject.keyword | 失敗音樂,音色風格轉換,屬性向量,客觀指標,哼唱選歌, | zh_TW |
| dc.subject.keyword | failed music,timbre transfer,attribute vector,objective metrics,query by singing/humming, | en |
| dc.relation.page | 40 | - |
| dc.identifier.doi | 10.6342/NTU202502619 | - |
| dc.rights.note | 同意授權(全球公開) | - |
| dc.date.accepted | 2025-07-31 | - |
| dc.contributor.author-college | 電機資訊學院 | - |
| dc.contributor.author-dept | 電信工程學研究所 | - |
| dc.date.embargo-lift | 2025-08-06 | - |
| 顯示於系所單位: | 電信工程學研究所 | |
文件中的檔案:
| 檔案 | 大小 | 格式 | |
|---|---|---|---|
| ntu-113-2.pdf | 4.21 MB | Adobe PDF | 檢視/開啟 |
系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。
