Etude：基於萃取、結構化與解碼的自動鋼琴翻奏生成模型架構

陳澤暘; Tse-Yang Chen

Please use this identifier to cite or link to this item: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/98711

Title:	Etude：基於萃取、結構化與解碼的自動鋼琴翻奏生成模型架構 Etude: Automatic Piano Cover Generation with a Three-Stage Approach — Extract, strucTUralize, and DEcode
Authors:	陳澤暘 Tse-Yang Chen
Advisor:	莊裕澤 Yuh-Jzer Joung
Keyword:	自動鋼琴翻奏生成,音樂生成,音樂資訊檢索,自動音樂轉錄,可控生成, Automatic Piano Cover Generation,Music Generation,Music Information Retrieval (MIR),Automatic Music Transcription,Controllable Generation,
Publication Year :	2025
Degree:	碩士
Abstract:	鋼琴翻奏生成（Piano Cover Generation）旨在將一首流行歌曲自動轉換為鋼琴編曲。過去已有眾多深度學習研究探討此任務，其解決方案涵蓋了從模型架構的修改到資料預處理的優化等多個層面。然而，我們觀察到這些模型時常無法確保其輸出與原曲之間的結構一致性。我們推論，其原因在於模型的架構缺乏節拍感知的能力，或是模型無法正確學習複雜的節奏資訊。這些節奏資訊至關重要，因為它不僅主導了鋼琴翻奏與原曲在結構層面上的相似性（如速度、BPM），也直接影響了生成音樂的整體品質。在本論文中，我們提出了一套名為 Etude 的三階段式架構，其名稱融合了其三大核心模組的英文縮寫：萃取（Extract）、結構化（strucTUralize）與解碼（DEcode）。透過預先提取節奏資訊，並採用一種新穎且高度簡化的、基於 REMI 的 token 表示法，我們的模型確保了生成的翻奏具備正確的歌曲結構，提升了音樂的流暢度與動態表現，並能透過注入指定風格來實現高度可控的生成。最終，在包含人類聽眾的主觀評測中，Etude 的表現大幅超越了所有過去的代表性模型，其生成品質更加接近人類作曲家的水平。 Piano cover generation aims to automatically convert a pop song into a piano arrangement. Numerous deep learning studies have previously addressed this task, with solutions ranging from architectural modifications to optimizations in data preprocessing. However, we observe that these models often fail to ensure structural consistency between their output and the original song. We hypothesize this is due to a lack of beat-aware capabilities in their architectures or an inability of the models to correctly learn complex rhythmic information. This rhythmic information is critical, as it not only governs the structural similarity (e.g., tempo, BPM) but also directly impacts the overall quality of the generated piano music. In this paper, we propose a three-stage architecture, Etude, composed of Extract, strucTUralize, and DEcode stages. By pre-extracting rhythmic information and utilizing a novel, highly simplified REMI-based tokenization, our model ensures the generated covers possess a proper song structure, improves fluency and musical dynamics, and enables highly controllable generation through the injection of specified styles. Finally, in subjective evaluations with human listeners, Etude substantially outperforms all previous models, achieving a quality closer to that of human composers.
URI:	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/98711
DOI:	10.6342/NTU202503741
Fulltext Rights:	同意授權(限校園內公開)
metadata.dc.date.embargo-lift:	2025-08-19
Appears in Collections:	資訊管理學系

Files in This Item:

File	Size	Format
ntu-113-2.pdf Access limited in NTU ip range	1.56 MB	Adobe PDF

Show full item record

DSpace JSPUI

DSpace preserves and enables easy and open access to all types of digital content including text, images, moving images, mpegs and data sets