用於符號音樂生成的時間移位複合令牌表示法

王庭康; Ting-Kang Wang

Please use this identifier to cite or link to this item: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/97961

Full metadata record

???org.dspace.app.webui.jsptag.ItemTag.dcfield???	Value	Language
dc.contributor.advisor	鄭皓中	zh_TW
dc.contributor.advisor	Hao-Chung Cheng	en
dc.contributor.author	王庭康	zh_TW
dc.contributor.author	Ting-Kang Wang	en
dc.date.accessioned	2025-07-23T16:15:32Z	-
dc.date.available	2025-07-24	-
dc.date.copyright	2025-07-23	-
dc.date.issued	2025	-
dc.date.submitted	2025-07-21	-
dc.identifier.citation	[1] J. Copet, F. Kreuk, I. Gat, T. Remez, D. Kant, G. Synnaeve, Y. Adi, and A. Défossez. Simple and controllable music generation. Advances in Neural Information Processing Systems, 36:47704–47720, 2023. [2] H.-W. Dong, K. Chen, S. Dubnov, J. J. McAuley, and T. Berg-Kirkpatrick. Multitrack music transformer. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2023), pages 1–5. IEEE, 2023. [3] C. Hawthorne, A. Stasyuk, A. Roberts, I. Simon, C.-Z. A. Huang, S. Dieleman, E. Elsen, J. Engel, and D. Eck. Enabling factorized piano music modeling and generation with the maestro dataset. arXiv preprint arXiv:1810.12247, 2018. [4] W.-Y. Hsiao, J.-Y. Liu, Y.-C. Yeh, and Y.-H. Yang. Compound word transformer: Learning to compose full-song music over dynamic directed hypergraphs. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 178–186. AAAI Press, 2021. [5] C.-Z. A. Huang, A. Vaswani, J. Uszkoreit, I. Simon, C. Hawthorne, N. Shazeer, A. M. Dai, M. D. Hoffman, M. Dinculescu, and D. Eck. Music transformer: Generating music with long term structure. In Proceedings of the 7th International Conference on Learning Representations (ICLR 2019). OpenReview.net, 2019. [6] Y.-S. Huang and Y.-H. Yang. Pop music transformer: Beat-based modeling and generation of expressive pop piano compositions. In Proceedings of the 28th ACM International Conference on Multimedia (MM 2020), pages 1180–1188. ACM, 2020. [7] C. Raffel. Learning-based methods for comparing sequences, with applications to audio- -to-midi alignment and matching. Columbia University, 2016. [8] J. Ryu, H.-W. Dong, J. Jung, and D. Jeong. Nested music transformer: Sequentially decoding compound tokens in symbolic music and audio generation. arXiv preprint arXiv:2408.01180, 2024. [9] M. Zeng, X. Tan, R. Wang, Z. Ju, T. Qin, and T.-Y. Liu. Musicbert: Symbolic music understanding with large-scale pre-training. arXiv preprint arXiv:2106.05630, 2021.	-
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/97961	-
dc.description.abstract	本研究探討基於機器學習之符號音樂生成（symbolic music generation）模型中，不同序列表示法對生成效果之影響。現行常見的Compound Word表示法，由於在進行token預測時，單一token同時代表多個音樂事件（events），導致事件之間潛在的相依性（dependency）無法有效建模，進而影響生成音樂的豐富性與連貫性。為改善此問題，我們提出一種全新的Compound Word排列方法，其靈感源自MusicGen模型中Residual Vector Quantization (RVQ) codebook所提及之delayed pattern概念。我們的排列法透過延遲特定事件序列的組合，有效增強了事件間隱含依賴的建模能力。在實驗部分，我們採用一系列基於 Transformer Decoder 架構的音樂生成模型進行比較，並選取三種不同序列表示法：傳統的 REMI（REvamped MIDI）表示法、原始 Compound Word 表示法與本研究提出的延遲式 Compound Word 表示法作為對照。比較評估指標涵蓋了音樂連貫性（coherence）、豐富性（Richness）、正確性（Consistency）、以及模型推論效率等面向。結果顯示，延遲式表示法在維持系統效率的同時，能生成更具結構層次與音樂邏輯的作品，在主觀與客觀指標上皆優於傳統方法，並且表現接近 REMI 表示法在結構保留上的強項。綜上所述，本研究不僅成功提出一種兼具效率以可攜性的序列表示法，亦驗證了表示設計對生成模型效能的關鍵性影響。我們的發現可為未來符號音樂生成模型之設計，提供更具靈活性與表現力的編碼策略，並對於音樂生成式 AI 在長序列建模上的應用，提供具有延展性的研究基礎。	zh_TW
dc.description.abstract	This study investigates the impact of different sequence representations on the performance of symbolic music generation models based on machine learning. One widely adopted approach, the Compound Word representation, encodes multiple musical events into a single token. While this strategy reduces sequence length and improves efficiency, it also makes it difficult for the model to capture the underlying dependencies between events, which can compromise the richness and coherence of the generated music. To address this limitation, we propose a novel rearrangement strategy for Compound Word tokens, inspired by the delayed pattern concept introduced in the Residual Vector Quantization (RVQ) codebook of the MusicGen model. By intentionally delaying the combination of specific event sequences, our method strengthens the model's ability to learn implicit dependencies between musical events without sacrificing computational efficiency. In the experimental phase, we conducted a comprehensive comparison using a series of Transformer Decoder-based music generation models. Three types of sequence representations were evaluated: the traditional REMI (REvamped MIDI) format, the original Compound Word representation, and our proposed Delayed Compound Word representation. Evaluation criteria included musical coherence, richness, consistency, and inference efficiency. Results show that our delayed representation significantly enhances the structural depth and musical logic of the generated output, achieving a quality level that not only surpasses the original Compound Word method but also approaches the structural fidelity of the REMI format—while maintaining a leaner and faster inference process. In summary, this study introduces an efficient and portable sequence representation that bridges the gap between expressiveness and performance. Our findings underscore the critical role of token design in symbolic music generation and offer a flexible encoding strategy that can inform future architectures. Moreover, the proposed approach provides a scalable foundation for generative AI models tasked with modeling long-form musical sequences.	en
dc.description.provenance	Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2025-07-23T16:15:32Z No. of bitstreams: 0	en
dc.description.provenance	Made available in DSpace on 2025-07-23T16:15:32Z (GMT). No. of bitstreams: 0	en
dc.description.tableofcontents	誌謝 i 中文摘要 ii 英文摘要 iv 目次 vi 圖次 viii 表次 ix 第一章 Introduction 1 第二章 Related Work 4 2.1 Transformer-Based Symbolic Music Generation . . . . . . . . . . . . . 4 2.2 Symbolic Music Encoding Schemes . . . . . . . . . . . . . . . . . . . 4 2.3 Delay Pattern and Temporal Scheduling in Music Generation . . . . . . 6 第三章 Methodology 8 3.1 Time-shifted Compound Token Representation . . . . . . . . . . . . . 8 3.1.1 Compound Token Representation . . . . . . . . . . . . . . . . . . . 9 3.1.2 Delay-Pattern Decoding Mechanism . . . . . . . . . . . . . . . . . 10 3.1.3 Model Architecture and Training Objective . . . . . . . . . . . . . 11 3.1.4 Autoregressive Generation with Delay Pattern . . . . . . . . . . . . 14 3.1.5 Benefits and Design Rationale . . . . . . . . . . . . . . . . . . . . 15 第四章 Experiments 16 4.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .16 4.2 Model and Hyperparameters . . . . . . . . . . . . . . . . . . . . . . . 18 第五章 Results 21 5.1 Objective Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 5.1.1 Quantitative Evaluation Metrics . . . . . . . . . . . . . . . . . . . 21 5.1.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 5.2 Subjective Listening Tests . . . . . . . . . . . . . . . . . . . . . . . . 25 5.3 Inference Speed Test . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 5.3.1 Inference Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 5.3.2 Speed Test Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 30 5.4 Self-Attention Visualization . . . . . . . . . . . . . . . . . . . . . . . 32 5.4.1 Self‐Attention Visualization . . . . . . . . . . . . . . . . . . . . . . 32 第六章 Conclusion and Future Work 34 參考文獻 37	-
dc.language.iso	en	-
dc.subject	音樂生成	zh_TW
dc.subject	符號音樂生成	zh_TW
dc.subject	音樂詞元表示法	zh_TW
dc.subject	音樂生成	zh_TW
dc.subject	音樂詞元表示法	zh_TW
dc.subject	符號音樂生成	zh_TW
dc.subject	Music Token Representation	en
dc.subject	Music Generation	en
dc.subject	Symbolic Music Generation	en
dc.subject	Music Token Representation	en
dc.subject	Music Generation	en
dc.subject	Symbolic Music Generation	en
dc.title	用於符號音樂生成的時間移位複合令牌表示法	zh_TW
dc.title	Time-Shifted Compound Token Refinement for Symbolic Music Generation	en
dc.type	Thesis	-
dc.date.schoolyear	113-2	-
dc.description.degree	碩士	-
dc.contributor.coadvisor	楊奕軒	zh_TW
dc.contributor.coadvisor	Yi-Hsuan Yang	en
dc.contributor.oralexamcommittee	蘇黎	zh_TW
dc.contributor.oralexamcommittee	Li Su	en
dc.subject.keyword	音樂生成,符號音樂生成,音樂詞元表示法,	zh_TW
dc.subject.keyword	Music Generation,Symbolic Music Generation,Music Token Representation,	en
dc.relation.page	38	-
dc.identifier.doi	10.6342/NTU202502077	-
dc.rights.note	未授權	-
dc.date.accepted	2025-07-22	-
dc.contributor.author-college	電機資訊學院	-
dc.contributor.author-dept	電信工程學研究所	-
dc.date.embargo-lift	N/A	-
Appears in Collections:	電信工程學研究所

Files in This Item:

File	Size	Format
ntu-113-2.pdf Restricted Access	1.09 MB	Adobe PDF

Show simple item record

DSpace JSPUI

DSpace preserves and enables easy and open access to all types of digital content including text, images, moving images, mpegs and data sets