用於符號音樂生成的時間移位複合令牌表示法

王庭康; Ting-Kang Wang

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/97961

標題:	用於符號音樂生成的時間移位複合令牌表示法 Time-Shifted Compound Token Refinement for Symbolic Music Generation
作者:	王庭康 Ting-Kang Wang
指導教授:	鄭皓中 Hao-Chung Cheng
共同指導教授:	楊奕軒 Yi-Hsuan Yang
關鍵字:	音樂生成,符號音樂生成,音樂詞元表示法, Music Generation,Symbolic Music Generation,Music Token Representation,
出版年 :	2025
學位:	碩士
摘要:	本研究探討基於機器學習之符號音樂生成（symbolic music generation）模型中，不同序列表示法對生成效果之影響。現行常見的Compound Word表示法，由於在進行token預測時，單一token同時代表多個音樂事件（events），導致事件之間潛在的相依性（dependency）無法有效建模，進而影響生成音樂的豐富性與連貫性。為改善此問題，我們提出一種全新的Compound Word排列方法，其靈感源自MusicGen模型中Residual Vector Quantization (RVQ) codebook所提及之delayed pattern概念。我們的排列法透過延遲特定事件序列的組合，有效增強了事件間隱含依賴的建模能力。在實驗部分，我們採用一系列基於 Transformer Decoder 架構的音樂生成模型進行比較，並選取三種不同序列表示法：傳統的 REMI（REvamped MIDI）表示法、原始 Compound Word 表示法與本研究提出的延遲式 Compound Word 表示法作為對照。比較評估指標涵蓋了音樂連貫性（coherence）、豐富性（Richness）、正確性（Consistency）、以及模型推論效率等面向。結果顯示，延遲式表示法在維持系統效率的同時，能生成更具結構層次與音樂邏輯的作品，在主觀與客觀指標上皆優於傳統方法，並且表現接近 REMI 表示法在結構保留上的強項。綜上所述，本研究不僅成功提出一種兼具效率以可攜性的序列表示法，亦驗證了表示設計對生成模型效能的關鍵性影響。我們的發現可為未來符號音樂生成模型之設計，提供更具靈活性與表現力的編碼策略，並對於音樂生成式 AI 在長序列建模上的應用，提供具有延展性的研究基礎。 This study investigates the impact of different sequence representations on the performance of symbolic music generation models based on machine learning. One widely adopted approach, the Compound Word representation, encodes multiple musical events into a single token. While this strategy reduces sequence length and improves efficiency, it also makes it difficult for the model to capture the underlying dependencies between events, which can compromise the richness and coherence of the generated music. To address this limitation, we propose a novel rearrangement strategy for Compound Word tokens, inspired by the delayed pattern concept introduced in the Residual Vector Quantization (RVQ) codebook of the MusicGen model. By intentionally delaying the combination of specific event sequences, our method strengthens the model's ability to learn implicit dependencies between musical events without sacrificing computational efficiency. In the experimental phase, we conducted a comprehensive comparison using a series of Transformer Decoder-based music generation models. Three types of sequence representations were evaluated: the traditional REMI (REvamped MIDI) format, the original Compound Word representation, and our proposed Delayed Compound Word representation. Evaluation criteria included musical coherence, richness, consistency, and inference efficiency. Results show that our delayed representation significantly enhances the structural depth and musical logic of the generated output, achieving a quality level that not only surpasses the original Compound Word method but also approaches the structural fidelity of the REMI format—while maintaining a leaner and faster inference process. In summary, this study introduces an efficient and portable sequence representation that bridges the gap between expressiveness and performance. Our findings underscore the critical role of token design in symbolic music generation and offer a flexible encoding strategy that can inform future architectures. Moreover, the proposed approach provides a scalable foundation for generative AI models tasked with modeling long-form musical sequences.
URI:	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/97961
DOI:	10.6342/NTU202502077
全文授權:	未授權
電子全文公開日期:	N/A
顯示於系所單位：	電信工程學研究所

文件中的檔案：

檔案	大小	格式
ntu-113-2.pdf 未授權公開取用	1.09 MB	Adobe PDF

顯示文件完整紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料（如：文字、圖片、PDF）並使其易於取用。