Please use this identifier to cite or link to this item:
http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/92773
Full metadata record
???org.dspace.app.webui.jsptag.ItemTag.dcfield??? | Value | Language |
---|---|---|
dc.contributor.advisor | 楊奕軒 | zh_TW |
dc.contributor.advisor | Yi-Hsuan Yang | en |
dc.contributor.author | 藍雲瀚 | zh_TW |
dc.contributor.author | Yun-Han Lan | en |
dc.date.accessioned | 2024-06-24T16:05:58Z | - |
dc.date.available | 2024-06-25 | - |
dc.date.copyright | 2024-06-24 | - |
dc.date.issued | 2024 | - |
dc.date.submitted | 2024-06-15 | - |
dc.identifier.citation | [1] A.Agostinelli,T.I.Denk,Z.Borsos,J.H.Engel,M.Verzetti,A.Caillon,Q.Huang, A. Jansen, A. Roberts, M. Tagliasacchi, M. Sharifi, N. Zeghidour, and C. H. Frank. MusicLM: Generating music from text. arXiv preprint arXiv:2301.11325, 2023.
[2] S. Böck, F. Korzeniowski, J. Schlüter, F. Krebs, and G. Widmer. madmom: a new Python audio and music signal processing library. In Proc. ACM Multimedia, pages 1174–1178, 2016. [3] K.Chen,X.Du,B.Zhu,Z.Ma,T.Berg-Kirkpatrick,andS.Dubnov.HTS-AT:Ahi- erarchical token-semantic audio Transformer for sound classification and detection. In Proc. ICASSP, 2022. [4] K. Chen, Y. Wu, H. Liu, M. Nezhurina, T. Berg-Kirkpatrick, and S. Dubnov. Mu- sicLDM: Enhancing novelty in text-to-music generation using beat-synchronous mixup strategies. In Proc. ICASSP, 2024. [5] H. W. Chung, L. Hou, S. Longpre, B. Zoph, Y. Tay, W. Fedus, Y. Li, X. Wang, M. Dehghani, S. Brahma, et al. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022. [6] J.Copet,F.Kreuk,I.Gat,T.Remez,D.Kant,G.Synnaeve,Y.Adi,andA.Défossez. Simple and controllable music generation. In Proc. NeurIPS, 2023. [7] A. Défossez. Hybrid spectrogram and waveform source separation. In Proc. ISMIR 2021 Workshop on Music Source Separation, 2021. [8] P. Dhariwal, H. Jun, C. Payne, J. W. Kim, A. Radford, and I. Sutskever. Jukebox: A generative model for music. arXiv preprint arXiv:2005.00341, 2020. [9] P. Dhariwal and A. Nichol. Diffusion models beat gans on image synthesis. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P. Liang, and J. W. Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 8780–8794. Curran Associates, Inc., 2021. [10] A. Défossez, J. Copet, G. Synnaeve, and Y. Adi. High fidelity neural audio com- pression. arXiv preprint arXiv:2210.13438, 2022. [11] M. Goto, H. Hashiguchi, T. Nishimura, and R. Oka. RWC Music Database: Popular, classical, and jazz music databases. In Proc. ISMIR, 2002. [12] S. Hershey, S. Chaudhuri, D. P. W. Ellis, J. F. Gemmeke, A. Jansen, R. C. Moore, M. Plakal, D. Platt, R. A. Saurous, B. Seybold, et al. CNN architectures for large- scale audio classification. In Proc. ICASSP, 2017. [13] J. Ho and T. Salimans. Classifier-free diffusion guidance. In Proc. NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021. [14] Q. Huang, D. S. Park, T. Wang, T. I. Denk, A. Ly, N. Chen, Z. Zhang, Z. Zhang, J. Yu, C. Frank, J. Engel, Q. V. Le, W. Chan, Z. Chen, and W. Han. Noise2music: Text-conditioned music generation with diffusion models, 2023. [15] K. Kilgour, M. Zuluaga, D. Roblek, and M. Sharifi. Fréchet Audio Distance: A met-ric for evaluating music enhancement algorithms. arXiv preprint arXiv:1812.08466, 2018. [16] R. Kumar, P. Seetharaman, A. Luebs, I. Kumar, and K. Kumar. High-fidelity audio compression with improved RVQGAN. In Proc. NeurIPS, 2023. [17] P. Li, B. Chen, Y. Yao, Y. Wang, A. Wang, and A. Wang. JEN-1: Text-guided universal music generation with omnidirectional diffusion models. arXiv preprint arXiv:2308.04729, 2024. [18] L. Lin, G. Xia, J. Jiang, and Y. Zhang. Content-based controls for music large lan- guage modeling. arXiv preprint arXiv:2310.17162, 2023. [19] H. Liu, Q. Tian, Y. Yuan, X. Liu, X. Mei, Q. Kong, Y. Wang, W. Wang, Y. Wang, and M. D. Plumbley. AudioLDM 2: Learning holistic audio generation with self- supervised pretraining. arXiv preprint arXiv:2308.05734, 2023. [20] S. Mangrulkar, S. Gugger, L. Debut, Y. Belkada, S. Paul, and B. Bossan. PEFT: State-of-the-art parameter-efficient fine-tuning methods. https://github.com/ huggingface/peft, 2022. [21] J. Melechovsky, Z. Guo, D. Ghosal, N. Majumder, D. Herremans, and S. Po- ria. Mustango: Toward controllable text-to-music generation. arXiv preprint arXiv:2311.08355, 2023. [22] J. Park, K. Choi, S. Jeon, D. Kim, and J. Park. A bi-directional Transformer for musical chord recognition. In Proc. ISMIR, 2019. [23] C. Raffel, B. McFee, E. J. Humphrey, J. Salamon, O. Nieto, D. Liang, and D. P. W.Ellis. Mir_eval: A transparent implementation of common MIR metrics. In Proc. ISMIR, pages 367–372, 2014. [24] Z. Rafii, A. Liutkus, F.-R. Stöter, S. I. Mimilakis, and R. Bittner. The MUSDB18 corpus for music separation, 2017. [25] S. Rouard, F. Massa, and A. Défossez. Hybrid Transformers for music source sepa- ration. In Proc. ICASSP, 2023. [26] J. Schulman, B. Zoph, C. Kim, J. Hilton, J. Menick, J. Weng, et al. Introducing ChatGPT, 2022. [27] A. van den Oord, Y. Li, and O. Vinyals. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018. [28] A.Vaswani,N.Shazeer,N.Parmar,J.Uszkoreit,L.Jones,A.N.Gomez,L.u.Kaiser, and I. Polosukhin. Attention is all you need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. [29] S.-L. Wu, C. Donahue, S. Watanabe, and N. J. Bryan. Music ControlNet: Multiple time-varying controls for music generation. arXiv preprint arXiv:2311.07069, 2023. [30] S.-L. Wu and Y.-H. Yang. MuseMorphose: Full-song and fine-grained piano mu- sic style transfer with one Transformer VAE. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 31:1953–1967, 2023. [31] Y. Wu, K. Chen, T. Zhang, Y. Hui, T. Berg-Kirkpatrick, and S. Dubnov. Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation. In Proc. ICASSP, 2023. [32] N.Zeghidour,A.Luebs,A.Omran,J.Skoglund,andM.Tagliasacchi.SoundStream: An end-to-end neural audio codec. IEEE/ACM Trans. Audio, Speech and Lang. Proc., 30:495–507, 2021. [33] L. Zhang, A. Rao, and M. Agrawala. Adding conditional control to text-to-image diffusion models. In Proc. ICCV, 2023. [34] R. Zhang, J. Han, C. Liu, P. Gao, A. Zhou, X. Hu, S. Yan, P. Lu, H. Li, and Y. Qiao. LLaMA-Adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199, 2023. [35] Y. Zhang, Y. Ikemiya, G. Xia, N. Murata, M. Martínez, W.-H. Liao, Y. Mitsufuji, and S. Dixon. MusicMagus: Zero-shot text-to-music editing via diffusion models. arXiv preprint arXiv:2402.06178, 2024. | - |
dc.identifier.uri | http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/92773 | - |
dc.description.abstract | 現有的文字轉音樂模型能夠產生高品質且多樣化的音樂信號。然而,僅用文字提示無法精確控制生成音樂的時間特徵,如和弦與節奏。為了解決這個問題,我們引入了 MusiConGen,一個基於時序條件控制的 Transformer文字轉音樂模型,基於預訓練的 MusicGen 框架進行構建。本研究之貢獻為提出消費級GPU之高效微調(finetuning)機制,它集成了自動提取的和弦與節奏特徵作為控制信號。在推理(inference)過程中,控制信號可以是從參考音訊信號中提取的音樂特徵,或是使用者定義的符號(symbolic)和弦序列、BPM和文字提示。我們對兩個數據集進行的性能評估——一個來自提取的控制特徵,另一個來自使用者創建的輸入——證明 MusiConGen 能生成與指定時序控制良好對齊的逼真音樂。 | zh_TW |
dc.description.abstract | Existing text-to-music models can produce high-quality audio with great diversity. However, textual prompts alone cannot precisely control temporal musical features such as chords and rhythm of the generated music. To address this challenge, we introduce MusiConGen, a temporally-conditioned Transformer-based text-to-music model that builds upon the pretrained MusicGen framework. Our innovation lies in an efficient finetuning mechanism, tailored for consumer-grade GPUs, that integrates automatically-extracted chords and rhythm features as the control signal. During inference, the control can either be musical features extracted from a reference audio signal, or be user-defined symbolic chord sequence, BPM, and textual prompts. Our performance evaluation on two datasets---one derived from extracted features and the other from user-created inputs---demonstrates that MusiConGen can generate realistic music that aligns well with the specified temporal control. | en |
dc.description.provenance | Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2024-06-24T16:05:58Z No. of bitstreams: 0 | en |
dc.description.provenance | Made available in DSpace on 2024-06-24T16:05:58Z (GMT). No. of bitstreams: 0 | en |
dc.description.tableofcontents | Master’s thesis acceptance certificate i
Acknowledgements iii 摘要 v Abstract vii Contents ix List of Figures xiii List of Tables xv Chapter 1 Introduction 1 1.1 Overview of text-to-music generation 1 1.2 Concurrent studies of conditional music generation 2 1.3 Our proposed method for the challenges 3 Chapter 2 Background 7 2.1 Transformer model 7 2.1.1 Self-attention mechanism 7 2.1.2 Multi-head attention 9 2.1.3 TransformerEncoder 10 2.1.4 TransformerDecoder 10 2.2 Codec Models for Audio Representation 11 2.3 Classifier-Free Guidance 12 2.4 MusicGen Model 13 2.4.1 Model structure 13 2.4.2 Delay Pattern 14 2.4.3 Pretrained MusicGen Model 14 Chapter 3 Methodoloy 17 3.1 Representing Temporal & Symbolic Conditions 17 3.2 Finetuning Mechanisms 19 Chapter 4 Experimental Setup 23 4.1 Datasets 23 4.1.1 Training Dataset 23 4.1.2 Evaluation Dataset 23 4.2 Dataset Pre-processing Details 25 4.3 Training Configuration 26 4.4 Evaluation Metrics 27 Chapter 5 Evaluation and Discussion 29 5.1 Temporal Conditions Comparison 29 5.2 Finetuning Mechanism Comparison 30 5.3 Genre condition comparison 32 5.4 BPM Analysis 34 5.5 Subjective Evaluation 34 Chapter 6 Conclusion and Future work 39 6.1 Conclusion 39 6.2 Future work 39 6.2.1 Melody, Chord, and Rhythm Conditional Music Generation 40 6.2.2 Melody Accompanying Conditional Music Generation 40 References 41 | - |
dc.language.iso | en | - |
dc.title | 基於Transformer模型節奏、和弦與文字控制之音樂生成研究 | zh_TW |
dc.title | MusiConGen: Rhythm and Chord Control for Transformer-Based Text-to-Music Generation | en |
dc.type | Thesis | - |
dc.date.schoolyear | 112-2 | - |
dc.description.degree | 碩士 | - |
dc.contributor.coadvisor | 鄭皓中 | zh_TW |
dc.contributor.coadvisor | Hao-Chung Cheng | en |
dc.contributor.oralexamcommittee | 蘇黎;王新民 | zh_TW |
dc.contributor.oralexamcommittee | Li Su;Hsin-Min Wang | en |
dc.subject.keyword | 音樂,大型語言模型,生成式模型,控制, | zh_TW |
dc.subject.keyword | Music,LLM,Generative model,Control, | en |
dc.relation.page | 45 | - |
dc.identifier.doi | 10.6342/NTU202400889 | - |
dc.rights.note | 同意授權(限校園內公開) | - |
dc.date.accepted | 2024-06-17 | - |
dc.contributor.author-college | 電機資訊學院 | - |
dc.contributor.author-dept | 資料科學學位學程 | - |
Appears in Collections: | 資料科學學位學程 |
Files in This Item:
File | Size | Format | |
---|---|---|---|
ntu-112-2.pdf Restricted Access | 7.88 MB | Adobe PDF | View/Open |
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.