Skip navigation

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料(如:文字、圖片、PDF)並使其易於取用。

點此認識 DSpace
DSpace logo
English
中文
  • 瀏覽論文
    • 校院系所
    • 出版年
    • 作者
    • 標題
    • 關鍵字
    • 指導教授
  • 搜尋 TDR
  • 授權 Q&A
    • 我的頁面
    • 接受 E-mail 通知
    • 編輯個人資料
  1. NTU Theses and Dissertations Repository
  2. 電機資訊學院
  3. 資料科學學位學程
請用此 Handle URI 來引用此文件: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/92773
完整後設資料紀錄
DC 欄位值語言
dc.contributor.advisor楊奕軒zh_TW
dc.contributor.advisorYi-Hsuan Yangen
dc.contributor.author藍雲瀚zh_TW
dc.contributor.authorYun-Han Lanen
dc.date.accessioned2024-06-24T16:05:58Z-
dc.date.available2024-06-25-
dc.date.copyright2024-06-24-
dc.date.issued2024-
dc.date.submitted2024-06-15-
dc.identifier.citation[1] A.Agostinelli,T.I.Denk,Z.Borsos,J.H.Engel,M.Verzetti,A.Caillon,Q.Huang, A. Jansen, A. Roberts, M. Tagliasacchi, M. Sharifi, N. Zeghidour, and C. H. Frank. MusicLM: Generating music from text. arXiv preprint arXiv:2301.11325, 2023.
[2] S. Böck, F. Korzeniowski, J. Schlüter, F. Krebs, and G. Widmer. madmom: a new Python audio and music signal processing library. In Proc. ACM Multimedia, pages 1174–1178, 2016.
[3] K.Chen,X.Du,B.Zhu,Z.Ma,T.Berg-Kirkpatrick,andS.Dubnov.HTS-AT:Ahi- erarchical token-semantic audio Transformer for sound classification and detection. In Proc. ICASSP, 2022.
[4] K. Chen, Y. Wu, H. Liu, M. Nezhurina, T. Berg-Kirkpatrick, and S. Dubnov. Mu- sicLDM: Enhancing novelty in text-to-music generation using beat-synchronous mixup strategies. In Proc. ICASSP, 2024.
[5] H. W. Chung, L. Hou, S. Longpre, B. Zoph, Y. Tay, W. Fedus, Y. Li, X. Wang, M. Dehghani, S. Brahma, et al. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
[6] J.Copet,F.Kreuk,I.Gat,T.Remez,D.Kant,G.Synnaeve,Y.Adi,andA.Défossez. Simple and controllable music generation. In Proc. NeurIPS, 2023.
[7] A. Défossez. Hybrid spectrogram and waveform source separation. In Proc. ISMIR 2021 Workshop on Music Source Separation, 2021.
[8] P. Dhariwal, H. Jun, C. Payne, J. W. Kim, A. Radford, and I. Sutskever. Jukebox: A generative model for music. arXiv preprint arXiv:2005.00341, 2020.
[9] P. Dhariwal and A. Nichol. Diffusion models beat gans on image synthesis. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P. Liang, and J. W. Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 8780–8794. Curran Associates, Inc., 2021.
[10] A. Défossez, J. Copet, G. Synnaeve, and Y. Adi. High fidelity neural audio com- pression. arXiv preprint arXiv:2210.13438, 2022.
[11] M. Goto, H. Hashiguchi, T. Nishimura, and R. Oka. RWC Music Database: Popular, classical, and jazz music databases. In Proc. ISMIR, 2002.
[12] S. Hershey, S. Chaudhuri, D. P. W. Ellis, J. F. Gemmeke, A. Jansen, R. C. Moore, M. Plakal, D. Platt, R. A. Saurous, B. Seybold, et al. CNN architectures for large- scale audio classification. In Proc. ICASSP, 2017.
[13] J. Ho and T. Salimans. Classifier-free diffusion guidance. In Proc. NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021.
[14] Q. Huang, D. S. Park, T. Wang, T. I. Denk, A. Ly, N. Chen, Z. Zhang, Z. Zhang, J. Yu, C. Frank, J. Engel, Q. V. Le, W. Chan, Z. Chen, and W. Han. Noise2music: Text-conditioned music generation with diffusion models, 2023.
[15] K. Kilgour, M. Zuluaga, D. Roblek, and M. Sharifi. Fréchet Audio Distance: A met-ric for evaluating music enhancement algorithms. arXiv preprint arXiv:1812.08466, 2018.
[16] R. Kumar, P. Seetharaman, A. Luebs, I. Kumar, and K. Kumar. High-fidelity audio compression with improved RVQGAN. In Proc. NeurIPS, 2023.
[17] P. Li, B. Chen, Y. Yao, Y. Wang, A. Wang, and A. Wang. JEN-1: Text-guided universal music generation with omnidirectional diffusion models. arXiv preprint arXiv:2308.04729, 2024.
[18] L. Lin, G. Xia, J. Jiang, and Y. Zhang. Content-based controls for music large lan- guage modeling. arXiv preprint arXiv:2310.17162, 2023.
[19] H. Liu, Q. Tian, Y. Yuan, X. Liu, X. Mei, Q. Kong, Y. Wang, W. Wang, Y. Wang, and M. D. Plumbley. AudioLDM 2: Learning holistic audio generation with self- supervised pretraining. arXiv preprint arXiv:2308.05734, 2023.
[20] S. Mangrulkar, S. Gugger, L. Debut, Y. Belkada, S. Paul, and B. Bossan. PEFT: State-of-the-art parameter-efficient fine-tuning methods. https://github.com/ huggingface/peft, 2022.
[21] J. Melechovsky, Z. Guo, D. Ghosal, N. Majumder, D. Herremans, and S. Po- ria. Mustango: Toward controllable text-to-music generation. arXiv preprint arXiv:2311.08355, 2023.
[22] J. Park, K. Choi, S. Jeon, D. Kim, and J. Park. A bi-directional Transformer for musical chord recognition. In Proc. ISMIR, 2019.
[23] C. Raffel, B. McFee, E. J. Humphrey, J. Salamon, O. Nieto, D. Liang, and D. P. W.Ellis. Mir_eval: A transparent implementation of common MIR metrics. In Proc. ISMIR, pages 367–372, 2014.
[24] Z. Rafii, A. Liutkus, F.-R. Stöter, S. I. Mimilakis, and R. Bittner. The MUSDB18 corpus for music separation, 2017.
[25] S. Rouard, F. Massa, and A. Défossez. Hybrid Transformers for music source sepa- ration. In Proc. ICASSP, 2023.
[26] J. Schulman, B. Zoph, C. Kim, J. Hilton, J. Menick, J. Weng, et al. Introducing ChatGPT, 2022.
[27] A. van den Oord, Y. Li, and O. Vinyals. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
[28] A.Vaswani,N.Shazeer,N.Parmar,J.Uszkoreit,L.Jones,A.N.Gomez,L.u.Kaiser, and I. Polosukhin. Attention is all you need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.
[29] S.-L. Wu, C. Donahue, S. Watanabe, and N. J. Bryan. Music ControlNet: Multiple time-varying controls for music generation. arXiv preprint arXiv:2311.07069, 2023.
[30] S.-L. Wu and Y.-H. Yang. MuseMorphose: Full-song and fine-grained piano mu- sic style transfer with one Transformer VAE. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 31:1953–1967, 2023.
[31] Y. Wu, K. Chen, T. Zhang, Y. Hui, T. Berg-Kirkpatrick, and S. Dubnov. Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation. In Proc. ICASSP, 2023.
[32] N.Zeghidour,A.Luebs,A.Omran,J.Skoglund,andM.Tagliasacchi.SoundStream: An end-to-end neural audio codec. IEEE/ACM Trans. Audio, Speech and Lang. Proc., 30:495–507, 2021.
[33] L. Zhang, A. Rao, and M. Agrawala. Adding conditional control to text-to-image diffusion models. In Proc. ICCV, 2023.
[34] R. Zhang, J. Han, C. Liu, P. Gao, A. Zhou, X. Hu, S. Yan, P. Lu, H. Li, and Y. Qiao. LLaMA-Adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199, 2023.
[35] Y. Zhang, Y. Ikemiya, G. Xia, N. Murata, M. Martínez, W.-H. Liao, Y. Mitsufuji, and S. Dixon. MusicMagus: Zero-shot text-to-music editing via diffusion models. arXiv preprint arXiv:2402.06178, 2024.
-
dc.identifier.urihttp://tdr.lib.ntu.edu.tw/jspui/handle/123456789/92773-
dc.description.abstract現有的文字轉音樂模型能夠產生高品質且多樣化的音樂信號。然而,僅用文字提示無法精確控制生成音樂的時間特徵,如和弦與節奏。為了解決這個問題,我們引入了 MusiConGen,一個基於時序條件控制的 Transformer文字轉音樂模型,基於預訓練的 MusicGen 框架進行構建。本研究之貢獻為提出消費級GPU之高效微調(finetuning)機制,它集成了自動提取的和弦與節奏特徵作為控制信號。在推理(inference)過程中,控制信號可以是從參考音訊信號中提取的音樂特徵,或是使用者定義的符號(symbolic)和弦序列、BPM和文字提示。我們對兩個數據集進行的性能評估——一個來自提取的控制特徵,另一個來自使用者創建的輸入——證明 MusiConGen 能生成與指定時序控制良好對齊的逼真音樂。zh_TW
dc.description.abstractExisting text-to-music models can produce high-quality audio with great diversity. However, textual prompts alone cannot precisely control temporal musical features such as chords and rhythm of the generated music. To address this challenge, we introduce MusiConGen, a temporally-conditioned Transformer-based text-to-music model that builds upon the pretrained MusicGen framework. Our innovation lies in an efficient finetuning mechanism, tailored for consumer-grade GPUs, that integrates automatically-extracted chords and rhythm features as the control signal. During inference, the control can either be musical features extracted from a reference audio signal, or be user-defined symbolic chord sequence, BPM, and textual prompts. Our performance evaluation on two datasets---one derived from extracted features and the other from user-created inputs---demonstrates that MusiConGen can generate realistic music that aligns well with the specified temporal control.en
dc.description.provenanceSubmitted by admin ntu (admin@lib.ntu.edu.tw) on 2024-06-24T16:05:58Z
No. of bitstreams: 0
en
dc.description.provenanceMade available in DSpace on 2024-06-24T16:05:58Z (GMT). No. of bitstreams: 0en
dc.description.tableofcontentsMaster’s thesis acceptance certificate i
Acknowledgements iii
摘要 v
Abstract vii
Contents ix
List of Figures xiii
List of Tables xv
Chapter 1 Introduction 1
1.1 Overview of text-to-music generation 1
1.2 Concurrent studies of conditional music generation 2
1.3 Our proposed method for the challenges 3

Chapter 2 Background 7
2.1 Transformer model 7
2.1.1 Self-attention mechanism 7
2.1.2 Multi-head attention 9
2.1.3 TransformerEncoder 10
2.1.4 TransformerDecoder 10
2.2 Codec Models for Audio Representation 11
2.3 Classifier-Free Guidance 12
2.4 MusicGen Model 13
2.4.1 Model structure 13
2.4.2 Delay Pattern 14
2.4.3 Pretrained MusicGen Model 14

Chapter 3 Methodoloy 17
3.1 Representing Temporal & Symbolic Conditions 17
3.2 Finetuning Mechanisms 19

Chapter 4 Experimental Setup 23
4.1 Datasets 23
4.1.1 Training Dataset 23
4.1.2 Evaluation Dataset 23
4.2 Dataset Pre-processing Details 25
4.3 Training Configuration 26
4.4 Evaluation Metrics 27

Chapter 5 Evaluation and Discussion 29
5.1 Temporal Conditions Comparison 29
5.2 Finetuning Mechanism Comparison 30
5.3 Genre condition comparison 32
5.4 BPM Analysis 34
5.5 Subjective Evaluation 34

Chapter 6 Conclusion and Future work 39
6.1 Conclusion 39
6.2 Future work 39
6.2.1 Melody, Chord, and Rhythm Conditional Music Generation 40
6.2.2 Melody Accompanying Conditional Music Generation 40

References 41
-
dc.language.isoen-
dc.subject音樂zh_TW
dc.subject生成式模型zh_TW
dc.subject大型語言模型zh_TW
dc.subject控制zh_TW
dc.subjectMusicen
dc.subjectGenerative modelen
dc.subjectControlen
dc.subjectLLMen
dc.title基於Transformer模型節奏、和弦與文字控制之音樂生成研究zh_TW
dc.titleMusiConGen: Rhythm and Chord Control for Transformer-Based Text-to-Music Generationen
dc.typeThesis-
dc.date.schoolyear112-2-
dc.description.degree碩士-
dc.contributor.coadvisor鄭皓中zh_TW
dc.contributor.coadvisorHao-Chung Chengen
dc.contributor.oralexamcommittee蘇黎;王新民zh_TW
dc.contributor.oralexamcommitteeLi Su;Hsin-Min Wangen
dc.subject.keyword音樂,大型語言模型,生成式模型,控制,zh_TW
dc.subject.keywordMusic,LLM,Generative model,Control,en
dc.relation.page45-
dc.identifier.doi10.6342/NTU202400889-
dc.rights.note同意授權(限校園內公開)-
dc.date.accepted2024-06-17-
dc.contributor.author-college電機資訊學院-
dc.contributor.author-dept資料科學學位學程-
dc.date.embargo-lift2029-06-12-
顯示於系所單位:資料科學學位學程

文件中的檔案:
檔案 大小格式 
ntu-112-2.pdf
  未授權公開取用
7.88 MBAdobe PDF檢視/開啟
顯示文件簡單紀錄


系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved