用於符號旋律生成之潛在語言擴散模型

葉咸辰; Hsien-Chen Yeh

Please use this identifier to cite or link to this item: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/97721

Full metadata record

???org.dspace.app.webui.jsptag.ItemTag.dcfield???	Value	Language
dc.contributor.advisor	楊奕軒	zh_TW
dc.contributor.advisor	Yi-Hsuan Yang	en
dc.contributor.author	葉咸辰	zh_TW
dc.contributor.author	Hsien-Chen Yeh	en
dc.date.accessioned	2025-07-11T16:21:40Z	-
dc.date.available	2025-07-12	-
dc.date.copyright	2025-07-11	-
dc.date.issued	2025	-
dc.date.submitted	2025-06-30	-
dc.identifier.citation	[1] Wu, X., Huang, Z., Zhang, K., Yu, J., Tan, X., Zhang, T., Wang, Z., Sun, L. (2023). MelodyGLM: Multi-task Pre-training for Symbolic Melody Genera-tion. arXiv preprint arXiv:2309.10738. [2] S. Wu and M. Sun, “Exploring the efficacy of pre-trained checkpoints in text-to-music generation task,” in The AAAI-23 Workshop on Creative AI Across Modalities, 2023. [Online]. Available: https://openreview.net/forum?id=QmWXskBhesn [3] L. Casini, N. Jonason, and B. L. Sturm, “Generating folk-like music in abc-notation with masked language models,” in Ismir 2023 Hybrid Confer-ence, 2023. [4] Wu, S., Li, X., Yu, F., & Sun, M. (2023). Tunesformer: Forming irish tunes with control codes by bar patching. arXiv preprint arXiv:2301.02884. [5] R. Yuan, H. Lin, Y. Wang, Z. Tian, S. Wu, T. Shen, G. Zhang, Y. Wu, C. Liu, Z. Zhou, Z. Ma, L. Xue, Z. Wang, Q. Liu, T. Zheng, Y. Li, Y. Ma, Y. Liang, X. Chi, R. Liu, Z. Wang, P. Li, J. Wu, C. Lin, Q. Liu, T. Jiang, W. Huang, W. Chen, E. Benetos, J. Fu, G. Xia, R. B. Dannenberg, W. Xue, S. Kang, and Y. Guo, “Chatmusician: Understanding and generating music intrinsically with LLM,” CoRR, vol. abs/2402.16153 , 2024. [Online]. Available: https://doi.org/10.48550/arXiv.2402.16153 [6] Qu, X., Bai, Y., Ma, Y., Zhou, Z., Lo, K.M., Liu, J., Yuan, R., Min, L., Liu, X., Zhang, T., Du, X., Guo, S., Liang, Y., Li, Y., Wu, S., Zhou, J., Zheng, T., Ma, Z., Han, F., Xue, W., Xia, G., Benetos, E., Yue, X., Lin, C., Tan, X., Huang, S.W., Fu, J., Zhang, G. (2024). MuPT: A Generative Symbolic Mu-sic Pretrained Transformer. arXiv preprint arXiv:2404.06393. [7] Shangda Wu, Yashan Wang, Xiaobing Li, Feng Yu, Maosong Sun. Melo-dyT5: A Unified Score-to-Score Transformer for Symbolic Music Pro-cessing. In Proceedings of the 25th International Society for Music Infor-mation Retrieval Conference (ISMIR), 2024, San Francisco, California, USA and Online, pp. 642-650. [https://zenodo.org/record/14877419/files/000072.pdf] [8] Gautam Mittal, Jesse Engel, Curtis Hawthorne, Ian Simon. Symbolic Music Generation with Diffusion Models. In Proceedings of the 22nd International Society for Music Information Retrieval Conference (ISMIR), 2021, Online, pp. 468-475. [https://archives.ismir.net/ismir2021/paper/000058.pdf] [9] Chih-Pin Tan, Wen-Yu Su, Yi-Hsuan Yang. Melody Infilling with Us-er-Provided Structural Context. In Proceedings of the 23rd International So-ciety for Music Information Retrieval Conference (ISMIR), 2022, Bangalore, India, pp. 834-841. [https://archives.ismir.net/ismir2022/paper/000100.pdf] [10] Peiling Lu, Xu Tan, Botao Yu, Tao Qin, Sheng Zhao, Tie-Yan Liu. Melo-Form: Generating Melody with Musical Form based on Expert Systems and Neural Networks. In Proceedings of the 23rd International Society for Music Information Retrieval Conference (ISMIR), 2022, Bangalore, India, pp. 567-574. [https://archives.ismir.net/ismir2022/paper/000068.pdf] [11] Thickstun, J., Hall, D., Donahue, C., & Liang, P. (2023). Anticipatory music transformer. arXiv preprint arXiv:2306.08620 [12] Bhandari, K., Wiggins, G.A., Colton, S. (2025). Yin-Yang: Developing Mo-tifs With Long-Term Structure And Controllability. arXiv preprint arXiv:2501.17759. [13] Ho, J., Jain, A., & Abbeel, P. (2020). Denoising diffusion probabilistic models. Advances in neural information processing systems, 33, 6840-6851. [14] Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B. (2022). High-resolution image synthesis with latent diffusion models. In Proceed-ings of the IEEE/CVF conference on computer vision and pattern recogni-tion (pp. 10684-10695). [15] Zhang, L., Rao, A., & Agrawala, M. (2023). Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF interna-tional conference on computer vision (pp. 3836-3847). [16] Haohe Liu, Yi Yuan, Xubo Liu, Xinhao Mei, Qiuqiang Kong, Qiao Tian, Yuping Wang, Wenwu Wang, Yuxuan Wang, and Mark D. Plumbley. 2024. AudioLDM 2: Learning Holistic Audio Generation With Self-Supervised Pretraining. IEEE/ACM Trans. Audio, Speech and Lang. Proc. 32 (2024), 2871–2883. https://doi.org/10.1109/TASLP.2024.3399607 [17] Evans, Z., Parker, J. D., Carr, C. J., Zukowski, Z., Taylor, J., & Pons, J. (2025, April). Stable audio open. In ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 1-5). IEEE. [18] Wang, J., Yuan, H., Chen, D., Zhang, Y., Wang, X., & Zhang, S. (2023). Modelscope text-to-video technical report. arXiv preprint arXiv:2308.06571. [19] Khachatryan, L., Movsisyan, A., Tadevosyan, V., Henschel, R., Wang, Z., Navasardyan, S., & Shi, H. (2023). Text2video-zero: Text-to-image diffu-sion models are zero-shot video generators. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 15954-15964). [20] Lovelace, J., Kishore, V., Wan, C., Shekhtman, E., & Weinberger, K. Q. (2023). Latent diffusion for language generation. Advances in Neural In-formation Processing Systems, 36, 56998-57025. [21] Zhang, Y., Gu, J., Wu, Z., Zhai, S., Susskind, J., & Jaitly, N. (2023). Planner: Generating diversified paragraph via latent language diffusion model. Ad-vances in Neural Information Processing Systems, 36, 80178-80190. [22] Subham Sekhar Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Ed-gar Marroquin, Justin T Chiu, Alexander Rush, and Volodymyr Kuleshov. 2025. Simple and effective masked diffusion language models. In Proceed-ings of the 38th International Conference on Neural Information Processing Systems (NIPS '24), Vol. 37. Curran Associates Inc., Red Hook, NY, USA, Article 4135, 130136–130184. [23] Nie, S., Zhu, F., You, Z., Zhang, X., Ou, J., Hu, J., ... & Li, C. (2025). Large language diffusion models. arXiv preprint arXiv:2502.09992. [24] Ling Yang, Zhilong Zhang, Yang Song, Shenda Hong, Runsheng Xu, Yue Zhao, Wentao Zhang, Bin Cui, and Ming-Hsuan Yang. 2023. Diffusion models: A comprehensive survey of methods and applications. ACM Com-puting Surveys, 56(4):1–39. [25] Hao Zou, Zae Myung Kim, and Dongyeop Kang. 2023. A survey of diffusion models in natural language processing. ArXiv preprint, abs/2305.14671. [26] Yifan Li, Kun Zhou, Wayne Xin Zhao, and Ji-Rong Wen. 2023. Diffusion models for non-autoregressive text generation: a survey. In Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence (IJCAI '23). Article 750, 6692–6701. https://doi.org/10.24963/ijcai.2023/750 [27] Zheng, K., Chen, Y., Mao, H., Liu, M. Y., Zhu, J., & Zhang, Q. (2024). Masked diffusion models are secretly time-agnostic masked models and ex-ploit inaccurate categorical sampling. arXiv preprint arXiv:2409.02908. [28] Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrah-man Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. 2020. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7871–7880, Online. Association for Computational Linguistics. [29] Ziyu Wang, Gus Xia. MuseBERT: Pre-training Music Representation for Music Understanding and Controllable Generation. In Proceedings of the 22nd International Society for Music Information Retrieval Conference (ISMIR), 2021, Online, pp. 722-729. [https://archives.ismir.net/ismir2021/paper/000090.pdf] [30] A. Roberts, J. Engel, C. Raffel, C. Hawthorne, and D. Eck, “A hierarchical latent vector model for learning long-term structure in music,” in Proceed-ings of the 35th International Conference on Machine Learning, ser. Pro-ceedings of Machine Learning Research, J. Dy and A. Krause, Eds., vol. 80. PMLR, 10–15 Jul 2018, pp. 4364–4373. [Online]. Available: http://proceedings.mlr.press/v80/roberts18a.html [31] Javier Nistal, Marco Pasini, Cyran Aouameur, Maarten Grachten, Stefan Lattner. Diff-a-Riff: Musical Accompaniment Co-Creation via Latent Dif-fusion Models. In Proceedings of the 25th International Society for Music Information Retrieval Conference (ISMIR), 2024, San Francisco, California, USA and Online, pp. 272-280. [https://zenodo.org/record/14877327/files/000027.pdf] [32] Pasini, M., Lattner, S., & Fazekas, G. (2024). Music2latent: Consistency autoencoders for latent audio compression. arXiv preprint arXiv:2408.06500. [33] Ens, J., & Pasquier, P. (2020). MMM: Exploring conditional multi-track music generation with the transformer. arXiv preprint arXiv:2008.06048. [34] Chang, C.-J.; Lee, C.-Y.; and Yang, Y.-H. 2021. Variable-length music score infilling via XLNet and musically specialized positional encoding. In Pro-ceedings of the 22nd International Society for Music Information Retrieval Conference, ISMIR 2021, Online, November 7-12, 2021, 97–104. [https://archives.ismir.net/ismir2021/paper/000011.pdf] [35] Martin E. Malandro. Composer’s Assistant: An Interactive Transformer for Multi-Track MIDI Infilling. In Proceedings of the 24th International Society for Music Information Retrieval Conference (ISMIR), 2023, Milan, Italy, pp. 327-334. [https://archives.ismir.net/ismir2023/paper/000038.pdf] [36] Hu, Z., Liu, Y., Chen, G., & Yu, B. X. (2025). Compose with Me: Collabo-rative Music Inpainter for Symbolic Music Infilling. Proceedings of the AAAI Conference on Artificial Intelligence, 39(2), 1327-1335. https://doi.org/10.1609/aaai.v39i2.32122 [37] Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., & Ganguli, S. (2015, June). Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning (pp. 2256-2265). pmlr. [38] Song, Y., & Ermon, S. (2019). Generative modeling by estimating gradients of the data distribution. Advances in neural information processing sys-tems, 32. [39] Sungbin Lim, Eunbi Yoon, Taehyun Byun, Taewon Kang, Seungwoo Kim, Kyungjae Lee, and Sungjoon Choi. 2023. Score-based generative modeling through stochastic evolution equations in hilbert spaces. In Proceedings of the 37th International Conference on Neural Information Processing Sys-tems (NIPS '23). Curran Associates Inc., Red Hook, NY, USA, Article 1645, 37799–37812. [40] Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=TIdIXIpzhoI. [41] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of the 31st International Conference on Neural In-formation Processing Systems (NIPS'17). Curran Associates Inc., Red Hook, NY, USA, 6000–6010. [42] Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng, Shuxin Zheng, Chen Xing, Huishuai Zhang, Yanyan Lan, Liwei Wang, and Tieyan Liu. On layer nor-malization in the transformer architecture. In Hal Daumé III and Aarti Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 10524–10533. PMLR, 13–18 Jul 2020. URL https://proceedings.mlr.press/v119/xiong20b.html. [43] Shazeer, N. (2020). Glu variants improve transformer. arXiv preprint arXiv:2002.05202. [44] Huang, G., Liu, Z., Van Der Maaten, L., & Weinberger, K. Q. (2017). Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4700-4708). [45] Bao, F., Nie, S., Xue, K., Cao, Y., Li, C., Su, H., & Zhu, J. (2023). All are worth words: A vit backbone for diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 22669-22679). [46] Chen, T., Zhang, R., & Hinton, G. (2022). Analog bits: Generating discrete data using diffusion models with self-conditioning. arXiv preprint arXiv:2208.04202. [47] Strudel, R., Tallec, C., Altché, F., Du, Y., Ganin, Y., Mensch, A., ... & Le-blond, R. (2022). Self-conditioned embedding diffusion for text generation. arXiv preprint arXiv:2211.04236. [48] Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millicah, Malcolm Reyn-olds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob Menick, Sebastian Borgeaud, Andrew Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikolaj Binkowski, Ricardo Barreira, Oriol Vinyals, Andrew Zisserman, and Karen Simonyan. 2022. Flamingo: a visual language model for few-shot learning. In Proceed-ings of the 36th International Conference on Neural Information Processing Systems (NIPS '22). Curran Associates Inc., Red Hook, NY, USA, Article 1723, 23716–23736. [49] Chen, T. (2023). On the importance of noise scheduling for diffusion mod-els. arXiv preprint arXiv:2301.10972. [50] Hoogeboom, E., Heek, J., & Salimans, T. (2023, July). simple diffusion: End-to-end diffusion for high resolution images. In International Conference on Machine Learning (pp. 13213-13232). PMLR. [51] Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. 2024. RoFormer: Enhanced transformer with Rotary Position Embed-ding. Neurocomput. 568, C (Feb 2024). https://doi.org/10.1016/j.neucom.2023.127063 [52] DeepSeek-AI, , Liu, A., Feng, B., Wang, B., Wang, B., Liu, B., Zhao, C., Dengr, C., Ruan, C., Dai, D., Guo, D., Yang, D., Chen, D., Ji, D., Li, E., Lin, F., Luo, F., Hao, G., Chen, G., Li, G., Zhang, H., Xu, H., Yang, H., Zhang, H., Ding, H., Xin, H., Gao, H., Li, H., Qu, H., Cai, J.L., Liang, J., Guo, J., Ni, J., Li, J., Chen, J., Yuan, J., Qiu, J., Song, J., Dong, K., Gao, K., Guan, K., Wang, L., Zhang, L., Xu, L., Xia, L., Zhao, L., Zhang, L., Li, M., Wang, M., Zhang, M., Zhang, M., Tang, M., Li, M., Tian, N., Huang, P., Wang, P., Zhang, P., Zhu, Q., Chen, Q., Du, Q., Chen, R.J., Jin, R.L., Ge, R., Pan, R., Xu, R., Chen, R., Li, S.S., Lu, S., Zhou, S., Chen, S., Wu, S., Ye, S., Ma, S., Wang, S., Zhou, S., Yu, S., Zhou, S., Zheng, S., Wang, T., Pei, T., Yuan, T., Sun, T., Xiao, W.L., Zeng, W., An, W., Liu, W., Liang, W., Gao, W., Zhang, W., Li, X.Q., Jin, X., Wang, X., Bi, X., Liu, X., Wang, X., Shen, X., Chen, X., Chen, X., Nie, X., Sun, X., Wang, X., Liu, X., Xie, X., Yu, X., Song, X., Zhou, X., Yang, X., Lu, X., Su, X., Wu, Y., Li, Y.K., Wei, Y.X., Zhu, Y.X., Xu, Y., Huang, Y., Li, Y., Zhao, Y., Sun, Y., Li, Y., Wang, Y., Zheng, Y., Zhang, Y., Xiong, Y., Zhao, Y., He, Y., Tang, Y., Piao, Y., Dong, Y., Tan, Y., Liu, Y., Wang, Y., Guo, Y., Zhu, Y., Wang, Y., Zou, Y., Zha, Y., Ma, Y., Yan, Y., You, Y., Liu, Y., Ren, Z.Z., Ren, Z., Sha, Z., Fu, Z., Huang, Z., Zhang, Z., Xie, Z., Hao, Z., Shao, Z., Wen, Z., Xu, Z., Zhang, Z., Li, Z., Wang, Z., Gu, Z., Li, Z., Xie, Z. (2024). DeepSeek-V2: A Strong, Economi-cal, and Efficient Mixture-of-Experts Language Model. arXiv preprint arXiv:2405.04434. [53] DeepSeek-AI, , Liu, A., Feng, B., Xue, B., Wang, B., Wu, B., Lu, C., Zhao, C., Deng, C., Zhang, C., Ruan, C., Dai, D., Guo, D., Yang, D., Chen, D., Ji, D., Li, E., Lin, F., Dai, F., Luo, F., Hao, G., Chen, G., Li, G., Zhang, H., Bao, H., Xu, H., Wang, H., Zhang, H., Ding, H., Xin, H., Gao, H., Li, H., Qu, H., Cai, J.L., Liang, J., Guo, J., Ni, J., Li, J., Wang, J., Chen, J., Chen, J., Yuan, J., Qiu, J., Li, J., Song, J., Dong, K., Hu, K., Gao, K., Guan, K., Huang, K., Yu, K., Wang, L., Zhang, L., Xu, L., Xia, L., Zhao, L., Wang, L., Zhang, L., Li, M., Wang, M., Zhang, M., Zhang, M., Tang, M., Li, M., Tian, N., Huang, P., Wang, P., Zhang, P., Wang, Q., Zhu, Q., Chen, Q., Du, Q., Chen, R.J., Jin, R.L., Ge, R., Zhang, R., Pan, R., Wang, R., Xu, R., Zhang, R., Chen, R., Li, S.S., Lu, S., Zhou, S., Chen, S., Wu, S., Ye, S., Ye, S., Ma, S., Wang, S., Zhou, S., Yu, S., Zhou, S., Pan, S., Wang, T., Yun, T., Pei, T., Sun, T., Xiao, W.L., Zeng, W., Zhao, W., An, W., Liu, W., Liang, W., Gao, W., Yu, W., Zhang, W., Li, X.Q., Jin, X., Wang, X., Bi, X., Liu, X., Wang, X., Shen, X., Chen, X., Zhang, X., Chen, X., Nie, X., Sun, X., Wang, X., Cheng, X., Liu, X., Xie, X., Liu, X., Yu, X., Song, X., Shan, X., Zhou, X., Yang, X., Li, X., Su, X., Lin, X., Li, Y.K., Wang, Y.Q., Wei, Y.X., Zhu, Y.X., Zhang, Y., Xu, Y., Xu, Y., Huang, Y., Li, Y., Zhao, Y., Sun, Y., Li, Y., Wang, Y., Yu, Y., Zheng, Y., Zhang, Y., Shi, Y., Xiong, Y., He, Y., Tang, Y., Piao, Y., Wang, Y., Tan, Y., Ma, Y., Liu, Y., Guo, Y., Wu, Y., Ou, Y., Zhu, Y., Wang, Y., Gong, Y., Zou, Y., He, Y., Zha, Y., Xiong, Y., Ma, Y., Yan, Y., Luo, Y., You, Y., Liu, Y., Zhou, Y., Wu, Z.F., Ren, Z.Z., Ren, Z., Sha, Z., Fu, Z., Xu, Z., Huang, Z., Zhang, Z., Xie, Z., Zhang, Z., Hao, Z., Gou, Z., Ma, Z., Yan, Z., Shao, Z., Xu, Z., Wu, Z., Zhang, Z., Li, Z., Gu, Z., Zhu, Z., Liu, Z., Li, Z., Xie, Z., Song, Z., Gao, Z., Pan, Z. (2024). DeepSeek-V3 Technical Report. arXiv preprint arXiv:2412.19437. [54] Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebron, and Sumit Sanghai. 2023. GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints. In Pro-ceedings of the 2023 Conference on Empirical Methods in Natural Lan-guage Processing, pages 4895–4901, Singapore. Association for Computa-tional Linguistics. [55] Lu, C., Zhou, Y., Bao, F., Chen, J., Li, C., & Zhu, J. (2022). DPM-solver++: Fast solver for guided sampling of diffusion probabilistic models. arXiv preprint arXiv:2211.01095. [56] Hao-Wen Dong, Ke Chen, Julian McAuley, and Taylor Berg-Kirkpatrick, “MusPy: A Toolkit for Symbolic Music Generation,” in Proceedings of the 21st International Society for Music Information Retrieval Conference (IS-MIR), 2020. [57] Sheng, Z., Song, K., Tan, X., Ren, Y., Ye, W., Zhang, S., & Qin, T. (2021, May). Songmass: Automatic song writing with pre-training and alignment constraint. In Proceedings of the AAAI Conference on Artificial Intelli-gence (Vol. 35, No. 15, pp. 13798-13805). [58] Zeqian Ju, Peiling Lu, Xu Tan, Rui Wang, Chen Zhang, Songruoyao Wu, Kejun Zhang, Xiang-Yang Li, Tao Qin, and Tie-Yan Liu. 2022. TeleMelody: Lyric-to-Melody Generation with a Template-Based Two-Stage Method. In Proceedings of the 2022 Conference on Empirical Methods in Natural Lan-guage Processing, pages 5426–5437, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics. [59] Raffel, C. (2016). Learning-based methods for comparing sequences, with applications to audio-to-midi alignment and matching. Columbia Universi-ty. [60] Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L. & Chen, W. Lora: Low-rank adaptation of large language models. arXiv pre-print arXiv:2106.09685 (2021).	-
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/97721	-
dc.description.abstract	基於 Transformer 的模型在符號旋律生成（symbolic melody generation）任務上取得了卓越的成果。然而，由於它們自回歸（autoregressive）的特性，限制了它們在旋律填充（melody inpainting）等任務上的效能。相反地，儘管擴散模型（diffusion models）在處理影像、音訊和影片等連續型資料上非常成功，在符號旋律生成這類離散領域的應用卻相對有限。本論文提出使用潛在語言擴散模型（Latent Language Diffusion Model）進行符號旋律生成，此模型利用語言模型將離散的符號音樂資料編碼（encode）到連續的潛在空間（continuous latent space），使得其適合被連續型擴散模型所處理。這個方法使得我們能夠對連續的潛在表徵進行取樣，包括完成旋律續寫（melody continuation）及旋律填充任務，之後再透過語言解碼器（language decoder）將其轉回離散的符號音樂資料。我們的模型使用 Google Colab（NVIDIA T4 GPU）和 Kaggle Kernels（NVIDIA P100 GPU）等免費和便宜的資源成功訓練，證明了其低運算需求和適合資源受限的環境。我們的評估結果展現出此方法在旋律續寫任務上的優異表現，並且在旋律填充任務上展示出優於自回歸基線模型的成果，同時在兩項任務上皆有更快的取樣速度。	zh_TW
dc.description.abstract	Transformer based models have achieved remarkable results in symbolic melody generation. However, their autoregressive nature limits their effectiveness in tasks like melody inpainting. Conversely, diffusion models, while highly successful in modeling continuous data like images, audio, and video, have seen limited application in discrete domains like symbolic melody generation. This paper proposes using a Latent Language Diffusion Model for symbolic melody generation, which leverages a language model to encode discrete symbolic music data into a continuous latent space, making it amenable to processing by continuous diffusion models. This approach allows us to sample continuous latent representations including achieving melody continuation and melody inpainting tasks, which can subsequently be decoded back into discrete symbolic music data via the language decoder. Our model was successfully trained using freely available and low-cost resources such as Google Colab (NVIDIA T4 GPU) and Kaggle Kernels (NVIDIA P100 GPU), demonstrating its low computational requirements and suitability for resource-constrained settings. Our evaluation demonstrates strong performance in melody continuation tasks, and outperforms autoregressive baselines in melody inpainting task, with a faster inference speed in both tasks.	en
dc.description.provenance	Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2025-07-11T16:21:40Z No. of bitstreams: 0	en
dc.description.provenance	Made available in DSpace on 2025-07-11T16:21:40Z (GMT). No. of bitstreams: 0	en
dc.description.tableofcontents	摘要…………………………………………………………. i Abstract…………………………………………………………. ii 目次…………………………………………………………......... iv 圖次…………………………………………………………......... vi 表次………………………………………………………............ vii 第一章 Introduction……………………………………………… 1 第二章 Related Work.…………………………………………….. 6 2.1 Transformer-Based Methods for Symbolic Melody Generation. 6 2.2 Diffusion-Based Methods for Symbolic Melody Generation… 7 2.3 Symbolic music inpainting…………………………………… 8 第三章 Background....…………………………………………… 11 3.1 Diffusion Models...………………………………………… 11 3.1.1 Latent Diffusion Models..…………………………………… 12 3.1.2 Text Diffusion Models..……..……………………………… 12 3.1.2.1 Latent Diffusion for Language Generation (LD4LG) … 14 第四章 Methodology....……………………………………..…… 16 4.1 Data....……………………………………….………..……… 16 4.2 Model Architecture…………………………………...……… 17 4.2.1 Language encoder-decoder model……………..…..…..…… 17 4.2.2 Compression and Reconstruction Network……..…..…….… 18 4.2.3 Latent Language Diffusion Model…………..……..…..…… 20 4.2.4 Training Details………………………...………..……..…… 23 第五章 Experiment....……………………………………….....… 25 5.1 Unconditional Generation…………………………….....…… 26 5.1.1 Evaluation metrics……………………………………...…… 26 5.1.2 Objective evaluation……………………………….…..…… 26 5.1.3 Inference sampling step evaluation……………….………… 30 5.2 Melody Continuation…………………………….....……...… 32 5.2.1 Evaluation metrics………………………………………...… 32 5.2.2 Baselines………………………………......……...………… 33 5.2.3 Objective evaluation……………………………….……..… 34 5.2.4 Inference Time Evaluation……………………..….…...…… 34 5.3 Melody Inpainting…………………………….....……...…… 36 5.3.1 Baselines………………………………......……...………… 36 5.3.2 Objective evaluation……………………………….…..…… 38 5.3.3 Subjective evaluation…………………………..…………… 39 5.3.4 Inference Time Evaluation……………………..….…...…… 40 第六章 Conclusion....……………………………………….....… 44 第七章 Future Work....……………………………………...…… 45 參考文獻…………………………………………………….…… 46	-
dc.language.iso	en	-
dc.subject	旋律填充	zh_TW
dc.subject	旋律續寫	zh_TW
dc.subject	潛在語言擴散模型	zh_TW
dc.subject	符號旋律生成	zh_TW
dc.subject	Melody Inpainting	en
dc.subject	Symbolic Melody Generation	en
dc.subject	Latent Language Diffusion Model	en
dc.subject	Melody Continuation	en
dc.title	用於符號旋律生成之潛在語言擴散模型	zh_TW
dc.title	Latent Language Diffusion Model for Symbolic Melody Generation	en
dc.type	Thesis	-
dc.date.schoolyear	113-2	-
dc.description.degree	碩士	-
dc.contributor.oralexamcommittee	鄭皓中;蘇黎	zh_TW
dc.contributor.oralexamcommittee	Hao-Chung Cheng;Li Su	en
dc.subject.keyword	符號旋律生成,潛在語言擴散模型,旋律續寫,旋律填充,	zh_TW
dc.subject.keyword	Symbolic Melody Generation,Latent Language Diffusion Model,Melody Continuation,Melody Inpainting,	en
dc.relation.page	57	-
dc.identifier.doi	10.6342/NTU202501313	-
dc.rights.note	同意授權(全球公開)	-
dc.date.accepted	2025-07-01	-
dc.contributor.author-college	電機資訊學院	-
dc.contributor.author-dept	電信工程學研究所	-
dc.date.embargo-lift	2025-07-12	-
Appears in Collections:	電信工程學研究所

Files in This Item:

File	Size	Format
ntu-113-2.pdf	1.11 MB	Adobe PDF	View/Open

Show simple item record

DSpace JSPUI

DSpace preserves and enables easy and open access to all types of digital content including text, images, moving images, mpegs and data sets