Etude：基於萃取、結構化與解碼的自動鋼琴翻奏生成模型架構

陳澤暘; Tse-Yang Chen

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/98711

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	莊裕澤	zh_TW
dc.contributor.advisor	Yuh-Jzer Joung	en
dc.contributor.author	陳澤暘	zh_TW
dc.contributor.author	Tse-Yang Chen	en
dc.date.accessioned	2025-08-18T16:11:44Z	-
dc.date.available	2025-08-19	-
dc.date.copyright	2025-08-18	-
dc.date.issued	2025	-
dc.date.submitted	2025-08-08	-
dc.identifier.citation	[1] S. Black, S. Biderman, E. Hallahan, Q. Anthony, L. Gao, L. Golding, H. He, C. Leahy, K. McDonell, J. Phang, et al. Gpt-neox-20b: An open-source autoregressive language model. arXiv preprint arXiv:2204.06745, 2022. [2] J.-P. Briot. From artificial neural networks to deep learning for music generation: history, concepts and trends. Neural Computing and Applications, 33(1):39–65, 2021. [3] K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014. [4] J. Choi and K. Lee. Pop2piano: Pop audio-based piano cover generation. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023. [5] C. Donahue, J. Thickstun, and P. Liang. Melody transcription via generative pre-training. In ISMIR, 2022. [6] J. Gardner, I. Simon, E. Manilow, C. Hawthorne, and J. Engel. Mt3: Multi-task multitrack music transcription. arXiv preprint arXiv:2111.03017, 2021. [7] G. Hadjeres, F. Pachet, and F. Nielsen. Deepbach: a steerable model for bach chorales generation. In International conference on machine learning, pages 1362–1371. PMLR, 2017. [8] W.-Y. Hsiao, J.-Y. Liu, Y.-C. Yeh, and Y.-H. Yang. Compound word transformer: Learning to compose full-song music over dynamic directed hypergraphs. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 178–186, 2021. [9] Y.-S. Huang and Y.-H. Yang. Pop music transformer: Beat-based modeling and generation of expressive pop piano compositions. In Proceedings of the 28th ACM international conference on multimedia, pages 1180–1188, 2020. [10] S. Ji, X. Yang, and J. Luo. A survey on deep learning for symbolic music generation: Representations, algorithms, evaluations, and challenges. ACM Computing Surveys, 56(1):1–39, 2023. [11] D. P. Kingma. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013. [12] K. Komiya and Y. Fukuhara. Amt-apc: Automatic piano cover by fine-tuning an automatic music transcription model. arXiv preprint arXiv:2409.14086, 2024. [13] I. Loshchilov and F. Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017. [14] H. H. Mao, T. Shin, and G. Cottrell. Deepj: Style-specific music generation. In 2018 IEEE 12th International Conference on Semantic Computing (ICSC), pages 377–382. IEEE, 2018. [15] M. Müller, Y. Özer, M. Krause, T. Prätzlich, and J. Driedger. Sync toolbox: A python package for efficient, robust, and accurate music synchronization. Journal of Open Source Software, 6(64):3434, 2021. [16] T. Prätzlich, J. Driedger, and M. Müller. Memory-restricted multiscale dynamic time warping. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 569–573. IEEE, 2016. [17] I. Sutskever. Sequence to sequence learning with neural networks. arXiv preprint arXiv:1409.3215, 2014. [18] H. Takamori, T. Nakatsuka, S. Fukayama, M. Goto, and S. Morishima. Audio-based automatic generation of a piano reduction score by considering the musical structure. In MultiMedia Modeling: 25th International Conference, MMM 2019, Thessaloniki, Greece, January 8–11, 2019, Proceedings, Part II 25, pages 169–181. Springer, 2019. [19] C.-P. Tan, H. Ai, Y.-H. Chang, S.-H. Guan, and Y.-H. Yang. Picogen2: Piano cover generation with transfer learning approach and weakly aligned data. In Proceedings of the 25th International Society for Music Information Retrieval Conference (ISMIR), San Francisco, CA, United States, Nov. 2024. [20] C.-P. Tan, S.-H. Guan, and Y.-H. Yang. Picogen: Generate piano covers with a two -stage approach. In Proceedings of the 2024 International Conference on Multimedia Retrieval, pages 1180–1184, 2024. [21] H. H. Tan and D. Herremans. Music fadernets: Controllable music generation based on high-level features via low-level feature modelling. arXiv preprint arXiv:2007.15474, 2020. [22] K. Toyama, T. Akama, Y. Ikemiya, Y. Takida, W.-H. Liao, and Y. Mitsufuji. Automatic piano transcription with hierarchical frequency-time transformer. arXiv preprint arXiv:2307.04305, 2023. [23] A. Vaswani. Attention is all you need. Advances in Neural Information Processing Systems, 2017. [24] S.-L. Wu and Y.-H. Yang. Compose & embellish: Well-structured piano performance generation via a two-stage approach. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023. [25] S.-L. Wu and Y.-H. Yang. Musemorphose: Full-song and fine-grained piano music style transfer with one transformer vae. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 31:1953–1967, 2023. [26] T. Y. Yip and C.-j. Chau. Music2midi: Pop music to midi piano cover generation. In International Conference on Multimedia Modeling, pages 101–113, 2025. [27] J. Zhao, G. Xia, and Y. Wang. Beat transformer: Demixed beat and downbeat tracking with dilated self-attention. arXiv preprint arXiv:2209.07140, 2022	-
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/98711	-
dc.description.abstract	鋼琴翻奏生成（Piano Cover Generation）旨在將一首流行歌曲自動轉換為鋼琴編曲。過去已有眾多深度學習研究探討此任務，其解決方案涵蓋了從模型架構的修改到資料預處理的優化等多個層面。然而，我們觀察到這些模型時常無法確保其輸出與原曲之間的結構一致性。我們推論，其原因在於模型的架構缺乏節拍感知的能力，或是模型無法正確學習複雜的節奏資訊。這些節奏資訊至關重要，因為它不僅主導了鋼琴翻奏與原曲在結構層面上的相似性（如速度、BPM），也直接影響了生成音樂的整體品質。在本論文中，我們提出了一套名為 Etude 的三階段式架構，其名稱融合了其三大核心模組的英文縮寫：萃取（Extract）、結構化（strucTUralize）與解碼（DEcode）。透過預先提取節奏資訊，並採用一種新穎且高度簡化的、基於 REMI 的 token 表示法，我們的模型確保了生成的翻奏具備正確的歌曲結構，提升了音樂的流暢度與動態表現，並能透過注入指定風格來實現高度可控的生成。最終，在包含人類聽眾的主觀評測中，Etude 的表現大幅超越了所有過去的代表性模型，其生成品質更加接近人類作曲家的水平。	zh_TW
dc.description.abstract	Piano cover generation aims to automatically convert a pop song into a piano arrangement. Numerous deep learning studies have previously addressed this task, with solutions ranging from architectural modifications to optimizations in data preprocessing. However, we observe that these models often fail to ensure structural consistency between their output and the original song. We hypothesize this is due to a lack of beat-aware capabilities in their architectures or an inability of the models to correctly learn complex rhythmic information. This rhythmic information is critical, as it not only governs the structural similarity (e.g., tempo, BPM) but also directly impacts the overall quality of the generated piano music. In this paper, we propose a three-stage architecture, Etude, composed of Extract, strucTUralize, and DEcode stages. By pre-extracting rhythmic information and utilizing a novel, highly simplified REMI-based tokenization, our model ensures the generated covers possess a proper song structure, improves fluency and musical dynamics, and enables highly controllable generation through the injection of specified styles. Finally, in subjective evaluations with human listeners, Etude substantially outperforms all previous models, achieving a quality closer to that of human composers.	en
dc.description.provenance	Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2025-08-18T16:11:44Z No. of bitstreams: 0	en
dc.description.provenance	Made available in DSpace on 2025-08-18T16:11:44Z (GMT). No. of bitstreams: 0	en
dc.description.tableofcontents	致謝 i 摘要 iii Abstract v 目次 vii 圖次 xi 表次 xiii 第一章緒論 1 1.1 研究背景與動機 1 1.2 研究目的 3 第二章文獻探討 5 2.1 符號音樂生成 5 2.1.1 音符事件序列 5 2.1.2 鋼琴卷軸 6 2.2 音樂風格轉換 8 2.3 自動鋼琴翻奏生成（APCG） 9 2.3.1 Pop2Piano 10 2.3.2 PiCoGen 10 2.3.3 PiCoGen2 11 2.3.4 AMT-APC 13 2.3.5 Music2MIDI 13 2.4 總結 14 第三章研究方法 17 3.1 研究架構 17 3.2 資料集前處理 18 3.2.1 提取節拍資訊 18 3.2.2 轉錄與對齊 18 3.2.3 量化 19 3.2.4 Tokenize 20 3.3 Tiny-REMI Token 20 3.3.1 Token 結構 21 3.3.2 編碼方式 22 3.3.3 解碼方式 24 3.4 模型 25 3.4.1 Extractor 模型 25 3.4.2 Decoder 模型 26 3.4.2.1 Bar-wise Mix 27 3.4.2.2 風格向量 28 第四章實驗過程與成果評估 33 4.1 資料集 33 4.2 訓練過程 33 4.2.1 Extractor 33 4.2.2 Decoder 34 4.3 模型推論 35 4.4 客觀評估 36 4.4.1 對齊路徑偏差（Warp Path Deviation, WPD） 37 4.4.2 節奏網格一致性（Rhythmic Grid Coherence, RGC） 38 4.4.3 IOI 模式熵（IOI Pattern Entropy, IPE） 39 4.5 主觀評估 40 4.6 評估結果 41 4.7 風格向量對模型生成的影響 46 第五章結論 49 5.1 研究總結 49 5.2 研究貢獻 50 5.3 研究限制 51 5.4 未來研究方向 52 參考文獻 53	-
dc.language.iso	zh_TW	-
dc.subject	音樂生成	zh_TW
dc.subject	自動鋼琴翻奏生成	zh_TW
dc.subject	可控生成	zh_TW
dc.subject	自動音樂轉錄	zh_TW
dc.subject	音樂資訊檢索	zh_TW
dc.subject	Music Information Retrieval (MIR)	en
dc.subject	Automatic Music Transcription	en
dc.subject	Controllable Generation	en
dc.subject	Music Generation	en
dc.subject	Automatic Piano Cover Generation	en
dc.title	Etude：基於萃取、結構化與解碼的自動鋼琴翻奏生成模型架構	zh_TW
dc.title	Etude: Automatic Piano Cover Generation with a Three-Stage Approach — Extract, strucTUralize, and DEcode	en
dc.type	Thesis	-
dc.date.schoolyear	113-2	-
dc.description.degree	碩士	-
dc.contributor.oralexamcommittee	陳建錦;魏志平;楊奕軒;林俊叡	zh_TW
dc.contributor.oralexamcommittee	Chien-Chin Chen;Chih-Ping Wei;Yi-Hsuan Yang;June-Ray Lin	en
dc.subject.keyword	自動鋼琴翻奏生成,音樂生成,音樂資訊檢索,自動音樂轉錄,可控生成,	zh_TW
dc.subject.keyword	Automatic Piano Cover Generation,Music Generation,Music Information Retrieval (MIR),Automatic Music Transcription,Controllable Generation,	en
dc.relation.page	56	-
dc.identifier.doi	10.6342/NTU202503741	-
dc.rights.note	同意授權(限校園內公開)	-
dc.date.accepted	2025-08-12	-
dc.contributor.author-college	管理學院	-
dc.contributor.author-dept	資訊管理學系	-
dc.date.embargo-lift	2025-08-19	-
顯示於系所單位：	資訊管理學系

文件中的檔案：

檔案	大小	格式
ntu-113-2.pdf 授權僅限NTU校內IP使用（校園外請利用VPN校外連線服務）	1.56 MB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。