用於符號旋律生成之潛在語言擴散模型

葉咸辰; Hsien-Chen Yeh

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/97721

標題:	用於符號旋律生成之潛在語言擴散模型 Latent Language Diffusion Model for Symbolic Melody Generation
作者:	葉咸辰 Hsien-Chen Yeh
指導教授:	楊奕軒 Yi-Hsuan Yang
關鍵字:	符號旋律生成,潛在語言擴散模型,旋律續寫,旋律填充, Symbolic Melody Generation,Latent Language Diffusion Model,Melody Continuation,Melody Inpainting,
出版年 :	2025
學位:	碩士
摘要:	基於 Transformer 的模型在符號旋律生成（symbolic melody generation）任務上取得了卓越的成果。然而，由於它們自回歸（autoregressive）的特性，限制了它們在旋律填充（melody inpainting）等任務上的效能。相反地，儘管擴散模型（diffusion models）在處理影像、音訊和影片等連續型資料上非常成功，在符號旋律生成這類離散領域的應用卻相對有限。本論文提出使用潛在語言擴散模型（Latent Language Diffusion Model）進行符號旋律生成，此模型利用語言模型將離散的符號音樂資料編碼（encode）到連續的潛在空間（continuous latent space），使得其適合被連續型擴散模型所處理。這個方法使得我們能夠對連續的潛在表徵進行取樣，包括完成旋律續寫（melody continuation）及旋律填充任務，之後再透過語言解碼器（language decoder）將其轉回離散的符號音樂資料。我們的模型使用 Google Colab（NVIDIA T4 GPU）和 Kaggle Kernels（NVIDIA P100 GPU）等免費和便宜的資源成功訓練，證明了其低運算需求和適合資源受限的環境。我們的評估結果展現出此方法在旋律續寫任務上的優異表現，並且在旋律填充任務上展示出優於自回歸基線模型的成果，同時在兩項任務上皆有更快的取樣速度。 Transformer based models have achieved remarkable results in symbolic melody generation. However, their autoregressive nature limits their effectiveness in tasks like melody inpainting. Conversely, diffusion models, while highly successful in modeling continuous data like images, audio, and video, have seen limited application in discrete domains like symbolic melody generation. This paper proposes using a Latent Language Diffusion Model for symbolic melody generation, which leverages a language model to encode discrete symbolic music data into a continuous latent space, making it amenable to processing by continuous diffusion models. This approach allows us to sample continuous latent representations including achieving melody continuation and melody inpainting tasks, which can subsequently be decoded back into discrete symbolic music data via the language decoder. Our model was successfully trained using freely available and low-cost resources such as Google Colab (NVIDIA T4 GPU) and Kaggle Kernels (NVIDIA P100 GPU), demonstrating its low computational requirements and suitability for resource-constrained settings. Our evaluation demonstrates strong performance in melody continuation tasks, and outperforms autoregressive baselines in melody inpainting task, with a faster inference speed in both tasks.
URI:	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/97721
DOI:	10.6342/NTU202501313
全文授權:	同意授權(全球公開)
電子全文公開日期:	2025-07-12
顯示於系所單位：	電信工程學研究所

文件中的檔案：

檔案	大小	格式
ntu-113-2.pdf	1.11 MB	Adobe PDF	檢視/開啟

顯示文件完整紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料（如：文字、圖片、PDF）並使其易於取用。