基於Transformer模型節奏、和弦與文字控制之音樂生成研究

藍雲瀚; Yun-Han Lan

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/92773

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	楊奕軒	zh_TW
dc.contributor.advisor	Yi-Hsuan Yang	en
dc.contributor.author	藍雲瀚	zh_TW
dc.contributor.author	Yun-Han Lan	en
dc.date.accessioned	2024-06-24T16:05:58Z	-
dc.date.available	2024-06-25	-
dc.date.copyright	2024-06-24	-
dc.date.issued	2024	-
dc.date.submitted	2024-06-15	-
dc.identifier.citation	[1] A.Agostinelli,T.I.Denk,Z.Borsos,J.H.Engel,M.Verzetti,A.Caillon,Q.Huang, A. Jansen, A. Roberts, M. Tagliasacchi, M. Sharifi, N. Zeghidour, and C. H. Frank. MusicLM: Generating music from text. arXiv preprint arXiv:2301.11325, 2023. [2] S. Böck, F. Korzeniowski, J. Schlüter, F. Krebs, and G. Widmer. madmom: a new Python audio and music signal processing library. In Proc. ACM Multimedia, pages 1174–1178, 2016. [3] K.Chen,X.Du,B.Zhu,Z.Ma,T.Berg-Kirkpatrick,andS.Dubnov.HTS-AT:Ahi- erarchical token-semantic audio Transformer for sound classification and detection. In Proc. ICASSP, 2022. [4] K. Chen, Y. Wu, H. Liu, M. Nezhurina, T. Berg-Kirkpatrick, and S. Dubnov. Mu- sicLDM: Enhancing novelty in text-to-music generation using beat-synchronous mixup strategies. In Proc. ICASSP, 2024. [5] H. W. Chung, L. Hou, S. Longpre, B. Zoph, Y. Tay, W. Fedus, Y. Li, X. Wang, M. Dehghani, S. Brahma, et al. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022. [6] J.Copet,F.Kreuk,I.Gat,T.Remez,D.Kant,G.Synnaeve,Y.Adi,andA.Défossez. Simple and controllable music generation. In Proc. NeurIPS, 2023. [7] A. Défossez. Hybrid spectrogram and waveform source separation. In Proc. ISMIR 2021 Workshop on Music Source Separation, 2021. [8] P. Dhariwal, H. Jun, C. Payne, J. W. Kim, A. Radford, and I. Sutskever. Jukebox: A generative model for music. arXiv preprint arXiv:2005.00341, 2020. [9] P. Dhariwal and A. Nichol. Diffusion models beat gans on image synthesis. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P. Liang, and J. W. Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 8780–8794. Curran Associates, Inc., 2021. [10] A. Défossez, J. Copet, G. Synnaeve, and Y. Adi. High fidelity neural audio com- pression. arXiv preprint arXiv:2210.13438, 2022. [11] M. Goto, H. Hashiguchi, T. Nishimura, and R. Oka. RWC Music Database: Popular, classical, and jazz music databases. In Proc. ISMIR, 2002. [12] S. Hershey, S. Chaudhuri, D. P. W. Ellis, J. F. Gemmeke, A. Jansen, R. C. Moore, M. Plakal, D. Platt, R. A. Saurous, B. Seybold, et al. CNN architectures for large- scale audio classification. In Proc. ICASSP, 2017. [13] J. Ho and T. Salimans. Classifier-free diffusion guidance. In Proc. NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021. [14] Q. Huang, D. S. Park, T. Wang, T. I. Denk, A. Ly, N. Chen, Z. Zhang, Z. Zhang, J. Yu, C. Frank, J. Engel, Q. V. Le, W. Chan, Z. Chen, and W. Han. Noise2music: Text-conditioned music generation with diffusion models, 2023. [15] K. Kilgour, M. Zuluaga, D. Roblek, and M. Sharifi. Fréchet Audio Distance: A met-ric for evaluating music enhancement algorithms. arXiv preprint arXiv:1812.08466, 2018. [16] R. Kumar, P. Seetharaman, A. Luebs, I. Kumar, and K. Kumar. High-fidelity audio compression with improved RVQGAN. In Proc. NeurIPS, 2023. [17] P. Li, B. Chen, Y. Yao, Y. Wang, A. Wang, and A. Wang. JEN-1: Text-guided universal music generation with omnidirectional diffusion models. arXiv preprint arXiv:2308.04729, 2024. [18] L. Lin, G. Xia, J. Jiang, and Y. Zhang. Content-based controls for music large lan- guage modeling. arXiv preprint arXiv:2310.17162, 2023. [19] H. Liu, Q. Tian, Y. Yuan, X. Liu, X. Mei, Q. Kong, Y. Wang, W. Wang, Y. Wang, and M. D. Plumbley. AudioLDM 2: Learning holistic audio generation with self- supervised pretraining. arXiv preprint arXiv:2308.05734, 2023. [20] S. Mangrulkar, S. Gugger, L. Debut, Y. Belkada, S. Paul, and B. Bossan. PEFT: State-of-the-art parameter-efficient fine-tuning methods. https://github.com/ huggingface/peft, 2022. [21] J. Melechovsky, Z. Guo, D. Ghosal, N. Majumder, D. Herremans, and S. Po- ria. Mustango: Toward controllable text-to-music generation. arXiv preprint arXiv:2311.08355, 2023. [22] J. Park, K. Choi, S. Jeon, D. Kim, and J. Park. A bi-directional Transformer for musical chord recognition. In Proc. ISMIR, 2019. [23] C. Raffel, B. McFee, E. J. Humphrey, J. Salamon, O. Nieto, D. Liang, and D. P. W.Ellis. Mir_eval: A transparent implementation of common MIR metrics. In Proc. ISMIR, pages 367–372, 2014. [24] Z. Rafii, A. Liutkus, F.-R. Stöter, S. I. Mimilakis, and R. Bittner. The MUSDB18 corpus for music separation, 2017. [25] S. Rouard, F. Massa, and A. Défossez. Hybrid Transformers for music source sepa- ration. In Proc. ICASSP, 2023. [26] J. Schulman, B. Zoph, C. Kim, J. Hilton, J. Menick, J. Weng, et al. Introducing ChatGPT, 2022. [27] A. van den Oord, Y. Li, and O. Vinyals. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018. [28] A.Vaswani,N.Shazeer,N.Parmar,J.Uszkoreit,L.Jones,A.N.Gomez,L.u.Kaiser, and I. Polosukhin. Attention is all you need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. [29] S.-L. Wu, C. Donahue, S. Watanabe, and N. J. Bryan. Music ControlNet: Multiple time-varying controls for music generation. arXiv preprint arXiv:2311.07069, 2023. [30] S.-L. Wu and Y.-H. Yang. MuseMorphose: Full-song and fine-grained piano mu- sic style transfer with one Transformer VAE. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 31:1953–1967, 2023. [31] Y. Wu, K. Chen, T. Zhang, Y. Hui, T. Berg-Kirkpatrick, and S. Dubnov. Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation. In Proc. ICASSP, 2023. [32] N.Zeghidour,A.Luebs,A.Omran,J.Skoglund,andM.Tagliasacchi.SoundStream: An end-to-end neural audio codec. IEEE/ACM Trans. Audio, Speech and Lang. Proc., 30:495–507, 2021. [33] L. Zhang, A. Rao, and M. Agrawala. Adding conditional control to text-to-image diffusion models. In Proc. ICCV, 2023. [34] R. Zhang, J. Han, C. Liu, P. Gao, A. Zhou, X. Hu, S. Yan, P. Lu, H. Li, and Y. Qiao. LLaMA-Adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199, 2023. [35] Y. Zhang, Y. Ikemiya, G. Xia, N. Murata, M. Martínez, W.-H. Liao, Y. Mitsufuji, and S. Dixon. MusicMagus: Zero-shot text-to-music editing via diffusion models. arXiv preprint arXiv:2402.06178, 2024.	-
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/92773	-
dc.description.abstract	現有的文字轉音樂模型能夠產生高品質且多樣化的音樂信號。然而，僅用文字提示無法精確控制生成音樂的時間特徵，如和弦與節奏。為了解決這個問題，我們引入了 MusiConGen，一個基於時序條件控制的 Transformer文字轉音樂模型，基於預訓練的 MusicGen 框架進行構建。本研究之貢獻為提出消費級GPU之高效微調(finetuning)機制，它集成了自動提取的和弦與節奏特徵作為控制信號。在推理(inference)過程中，控制信號可以是從參考音訊信號中提取的音樂特徵，或是使用者定義的符號(symbolic)和弦序列、BPM和文字提示。我們對兩個數據集進行的性能評估——一個來自提取的控制特徵，另一個來自使用者創建的輸入——證明 MusiConGen 能生成與指定時序控制良好對齊的逼真音樂。	zh_TW
dc.description.abstract	Existing text-to-music models can produce high-quality audio with great diversity. However, textual prompts alone cannot precisely control temporal musical features such as chords and rhythm of the generated music. To address this challenge, we introduce MusiConGen, a temporally-conditioned Transformer-based text-to-music model that builds upon the pretrained MusicGen framework. Our innovation lies in an efficient finetuning mechanism, tailored for consumer-grade GPUs, that integrates automatically-extracted chords and rhythm features as the control signal. During inference, the control can either be musical features extracted from a reference audio signal, or be user-defined symbolic chord sequence, BPM, and textual prompts. Our performance evaluation on two datasets---one derived from extracted features and the other from user-created inputs---demonstrates that MusiConGen can generate realistic music that aligns well with the specified temporal control.	en
dc.description.provenance	Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2024-06-24T16:05:58Z No. of bitstreams: 0	en
dc.description.provenance	Made available in DSpace on 2024-06-24T16:05:58Z (GMT). No. of bitstreams: 0	en
dc.description.tableofcontents	Master’s thesis acceptance certificate i Acknowledgements iii 摘要 v Abstract vii Contents ix List of Figures xiii List of Tables xv Chapter 1 Introduction 1 1.1 Overview of text-to-music generation 1 1.2 Concurrent studies of conditional music generation 2 1.3 Our proposed method for the challenges 3 Chapter 2 Background 7 2.1 Transformer model 7 2.1.1 Self-attention mechanism 7 2.1.2 Multi-head attention 9 2.1.3 TransformerEncoder 10 2.1.4 TransformerDecoder 10 2.2 Codec Models for Audio Representation 11 2.3 Classifier-Free Guidance 12 2.4 MusicGen Model 13 2.4.1 Model structure 13 2.4.2 Delay Pattern 14 2.4.3 Pretrained MusicGen Model 14 Chapter 3 Methodoloy 17 3.1 Representing Temporal & Symbolic Conditions 17 3.2 Finetuning Mechanisms 19 Chapter 4 Experimental Setup 23 4.1 Datasets 23 4.1.1 Training Dataset 23 4.1.2 Evaluation Dataset 23 4.2 Dataset Pre-processing Details 25 4.3 Training Configuration 26 4.4 Evaluation Metrics 27 Chapter 5 Evaluation and Discussion 29 5.1 Temporal Conditions Comparison 29 5.2 Finetuning Mechanism Comparison 30 5.3 Genre condition comparison 32 5.4 BPM Analysis 34 5.5 Subjective Evaluation 34 Chapter 6 Conclusion and Future work 39 6.1 Conclusion 39 6.2 Future work 39 6.2.1 Melody, Chord, and Rhythm Conditional Music Generation 40 6.2.2 Melody Accompanying Conditional Music Generation 40 References 41	-
dc.language.iso	en	-
dc.title	基於Transformer模型節奏、和弦與文字控制之音樂生成研究	zh_TW
dc.title	MusiConGen: Rhythm and Chord Control for Transformer-Based Text-to-Music Generation	en
dc.type	Thesis	-
dc.date.schoolyear	112-2	-
dc.description.degree	碩士	-
dc.contributor.coadvisor	鄭皓中	zh_TW
dc.contributor.coadvisor	Hao-Chung Cheng	en
dc.contributor.oralexamcommittee	蘇黎;王新民	zh_TW
dc.contributor.oralexamcommittee	Li Su;Hsin-Min Wang	en
dc.subject.keyword	音樂,大型語言模型,生成式模型,控制,	zh_TW
dc.subject.keyword	Music,LLM,Generative model,Control,	en
dc.relation.page	45	-
dc.identifier.doi	10.6342/NTU202400889	-
dc.rights.note	同意授權(限校園內公開)	-
dc.date.accepted	2024-06-17	-
dc.contributor.author-college	電機資訊學院	-
dc.contributor.author-dept	資料科學學位學程	-
顯示於系所單位：	資料科學學位學程

文件中的檔案：

檔案	大小	格式
ntu-112-2.pdf 目前未授權公開取用	7.88 MB	Adobe PDF	檢視/開啟

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。