Skip navigation

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料(如:文字、圖片、PDF)並使其易於取用。

點此認識 DSpace
DSpace logo
English
中文
  • 瀏覽論文
    • 校院系所
    • 出版年
    • 作者
    • 標題
    • 關鍵字
    • 指導教授
  • 搜尋 TDR
  • 授權 Q&A
    • 我的頁面
    • 接受 E-mail 通知
    • 編輯個人資料
  1. NTU Theses and Dissertations Repository
  2. 電機資訊學院
  3. 資訊網路與多媒體研究所
請用此 Handle URI 來引用此文件: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/58195
完整後設資料紀錄
DC 欄位值語言
dc.contributor.advisor張智星(Jyh-Shing Roger Jang)
dc.contributor.authorYu-Siang Huangen
dc.contributor.author黃郁翔zh_TW
dc.date.accessioned2021-06-16T08:07:59Z-
dc.date.available2025-07-14
dc.date.copyright2020-07-17
dc.date.issued2020
dc.date.submitted2020-07-15
dc.identifier.citationLeCun, Y., Bengio, Y. Hinton, G. Deep learning. nature 521, 436 (2015).
Gregor, K., Danihelka, I., Graves, A., Rezende, D. Wierstra, D. DRAW: A recurrent neural network for image generation in Proceedings of the 32nd International Conference on Machine Learning 37 (PMLR, 2015), 1462–1471.
Tulyakov, S., Liu, M.-Y., Yang, X. Kautz, J. Mocogan: Decomposing motion and content for video generation in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2018).
Wen, T.-H. et al. Semantically conditioned LSTM-based natural language generation for spoken dialogue aystems in Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (2015), 1711–1721.
Mehri, S. et al. SampleRNN: An unconditional end-to-end neural audio generation model in International Conference on Learning Representations (2017).
Briot, J.-P., Hadjeres, G. Pachet, F. Deep learning techniques for music generation- a survey. arXiv preprint arXiv:1709.01620 (2017).
Yang, L.-C., Chou, S.-Y. Yang, Y.-H. MidiNet: A convolutional generative adversarial network for symbolic-domain music generation in Proceedings of the International Society for Music Information Retrieval (2017).
Huang, C.-Z. A., Cooijmans, T., Roberts, A., Courville, A. Eck, D. Counterpoint by convolution in Proceedings of the International Society for Music Information Retrieval (2017).
Roberts, A., Engel, J., Raffel, C., Hawthorne, C. Eck, D. A hierarchical latent vector model for learning long-term structure in music in Proceedings of the International Conference on Machine Learning (2018).
Huang, C.-Z. A. et al. Music Transformer: Generating music with long-term structure in International Conference on Learning Representations (2019).
Meyer, L. B. Emotion and meaning in music (University of chicago Press, 2008).
Dieleman, S., van den Oord, A. Simonyan, K. The challenge of realistic music generation: Modelling raw audio at scale in Advances in Neural Information Processing Systems (2018), 7989–7999.
Wang, B. Yang, Y.-H. PerformanceNet: Score-to-audio music generation with multi-band convolutional residual network in Proceedings of the AAAI Conference on Artificial Intelligence 33 (2019), 1174–1181.
Jeong, D., Kwon, T., Kim, Y. Nam, J. Graph neural network for music score data and modeling expressive piano performance in Proceedings of the 36th International Conference on Machine Learning 97 (PMLR, 2019), 3060–3070.
LeCun, Y., Bottou, L., Bengio, Y., Haffner, P., et al. Gradient-based learning applied to document recognition. Proceedings of the IEEE 86, 2278–2324 (1998).
Goodfellow, I. et al. Generative adversarial nets in Advances in neural information processing systems (2014), 2672–2680.
Rumelhart, D. E., Hinton, G. E. Williams, R. J. Learning representations by backpropagating errors. nature 323, 533–536 (1986).
Vaswani, A. et al. Attention is all you need in Advances in neural information processing systems (2017), 5998–6008.
Kingma, D. P. Welling, M. Auto-encoding variational bayes in (2014).
Bengio, Y., Ducharme, R., Vincent, P. Jauvin, C. A neural probabilistic language model. Journal of machine learning research 3, 1137–1155 (2003).
Hochreiter, S. Schmidhuber, J. Long short-term memory. Neural computation 9,1735–1780 (1997).
Dong, H.-W., Hsiao, W.-Y., Yang, L.-C. Yang, Y.-H. MuseGAN: Multi-track sequential generative adversarial networks for symbolic music generation and accompaniment in Thirty-Second AAAI Conference on Artificial Intelligence (2018).
Uria, B., Murray, I. Larochelle, H. A Deep and Tractable Density Estimator in Proceedings of the 31st International Conference on Machine Learning (eds Xing, E. P. Jebara, T.) 32 (PMLR, 2014), 467–475.
Gillick, J., Roberts, A., Engel, J., Eck, D. Bamman, D. Learning to Groove with Inverse Sequence Transformations in Proceedings of the 36th International Conference on Machine Learning (eds Chaudhuri, K. Salakhutdinov, R.) 97 (PMLR, 2019), 2269–2279.
Sutskever, I., Vinyals, O. Le, Q. V. Sequence to sequence learning with neural networks in Advances in neural information processing systems (2014), 3104–3112.
Waite, E. Generating long-term structure in songs and stories https://magenta. tensorflow.org/2016/07/15/lookback-rnn-attention-rnn. Blog. 2016.
Bahdanau, D., Cho, K. Bengio, Y. Neural machine translation by jointly learning to align and translate in International Conference on Learning Representations (2015).
Hadjeres, G., Pachet, F. Nielsen, F. DeepBach: a Steerable Model for Bach Chorales Generation in Proceedings of the 34th International Conference on Machine Learning (eds Precup, D. Teh, Y. W.) 70 (PMLR, 2017), 1362–1371.
Simon, I. Oore, S. Performance RNN: Generating Music with Expressive Timing and Dynamics https://magenta.tensorflow.org/performance-rnn. Blog. 2017.
Meade, N., Barreyre, N., Lowe, S. C. Oore, S. Exploring conditioning for generative music systems with human-interpretable controls in Proceedings of the 10th International Conference on Computation Creativity (2019).
Payne, C. MuseNet https://openai.com/blog/musenet/. Blog. 2019.
Donahue, C., Mao, H. H., Li, Y. E., Cottrell, G. W. McAuley, J. LakhNES: Improving multi-instrumental music generation with cross-domain pre-training in Proceedings of the International Society for Music Information Retrieval (2019).
(eds Dinculescu, M., Engel, J. Roberts, A.) MidiMe: Personalizing a MusicVAE model with user data (2019).
Mogren, O. C-RNN-GAN: Continuous recurrent neural networks with adversarial training in NIPS Workshop on Constructive Machine Learning (2016).
Yu, L., Zhang, W., Wang, J. Yu, Y. Seqgan: Sequence generative adversarial nets with policy gradient in Thirty-First AAAI Conference on Artificial Intelligence (2017).
Dai, Z. et al. Transformer-XL: attentive language models beyond a fixed-length context in Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (Association for Computational Linguistics, 2019), 2978–2988.
So, D., Le, Q. Liang, C. The Evolved Transformer in Proceedings of the 36th International Conference on Machine Learning 97 (PMLR, 2019), 5877–5886.
Radford, A., Narasimhan, K., Salimans, T. Sutskever, I. Improving language understanding by generative pre-training. URL https://s3-us-west-2.amazonaws.com/openai-assets/researchcovers/languageunsupervised/language understanding paper. pdf (2018).
Radford, A. et al. Language models are unsupervised multitask learners. OpenAI Blog 1 (2019).
Devlin, J., Chang, M.-W., Lee, K. Toutanova, K. Bert: Pre-training of deep bidi- rectional transformers for language understanding in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) (Association for Computational Linguistics, 2019), 4171–4186.
Hawthorne, C. et al. Onsets and frames: Dual-objective piano transcription. arXiv preprint arXiv:1710.11153 (2017).
Böck, S., Krebs, F. Widmer, G. Joint Beat and Downbeat Tracking with Recurrent Neural Networks. in ISMIR (2016), 255–261.
Baevski, A. Auli, M. Adaptive input representations for neural language modeling. arXiv preprint arXiv:1809.10853 (2018).
Grave, E., Joulin, A., Cissé, M., Jégou, H., et al. Efficient softmax approximation for GPUs in Proceedings of the 34th International Conference on Machine Learning- Volume 70 (2017), 1302–1310.
Shaw, P., Uszkoreit, J. Vaswani, A. Self-attention with relative position representations. arXiv preprint arXiv:1803.02155 (2018).
He, K., Zhang, X., Ren, S. Sun, J. Deep residual learning for image recognition in Proceedings of the IEEE conference on computer vision and pattern recognition (2016), 770–778.
Ba, J. L., Kiros, J. R. Hinton, G. E. Layer normalization. arXiv preprint arXiv:1607.06450 (2016).
Chen, T., Xu, B., Zhang, C. Guestrin, C. Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174 (2016).
dc.identifier.urihttp://tdr.lib.ntu.edu.tw/jspui/handle/123456789/58195-
dc.description.abstract音樂生成與影像生成及影片生成有著一些顯著的差異。首先,音樂是時間上的藝術,所以我們需要利用時序處理的方法。接著,音符不僅僅是純粹時序上的先後關係,鄰近的音群可以組成各式的音樂語法、結構,例如和弦、琶音與音階等等。本篇論文,在基於自注意力機制模型的框架下,我們探討如何生成數分鐘長的流行鋼琴音樂,我們也進一步地提出一套資料前處理的流程,藉由此流程我們可以從原始音訊轉換為音樂數位介面格式。為了分析生成結果,透過主觀的使用者問卷,我們得到許多深刻的見解,並且對模型的架構優缺點做一個通盤性的探討,進一步驗證了我們提出方法的有效性,從而了解深度學習技術的有效性與局限性。zh_TW
dc.description.abstractGenerating music has a few notable differences from generating images and videos. First, music is an art of the time, necessitating a temporal model. Second, musical notes are often grouped into chords, arpeggios, or melodies in polyphonic music, and therefore introducing a sequential ordering of notes into the generating model is critical. In this thesis, we investigated the framework of the Transformer model for generating minute-long pop piano music. We also proposed a data pre-processing pipeline to collect audio data and convert it to MIDI format. To evaluate the generated results, we adopted subjective user study to demonstrate the effectiveness of the proposed method.en
dc.description.provenanceMade available in DSpace on 2021-06-16T08:07:59Z (GMT). No. of bitstreams: 1
U0001-1407202014411900.pdf: 3625414 bytes, checksum: 7150038659da4efd4b3848a351ac4f51 (MD5)
Previous issue date: 2020
en
dc.description.tableofcontents口試委員會審定書.................................. i
摘要.......................................... ii Abstract........................................ iii Contents........................................ iv
List of Figures..................................... vi
List of Tables ..................................... ix
1 Introduction.................................... 1
1.1 Motivation.................................. 1
1.2 Problem Statement ............................. 2
1.2.1 Design of Representation...................... 2
1.2.2 Design of Networks ........................ 3
1.3 Contribution................................. 4
1.4 Thesis Organization............................. 4
2 Related Work ................................... 5
2.1 Symbolic Music Generation ........................ 5
2.1.1 Image-modeling Approach..................... 5
2.1.2 Language-modeling Approach................... 6
2.2 Transformer................................. 7
3 Method ...................................... 9
3.1 Data Pre-processing............................9
3.1.1 Data Collection........................... 9
3.1.2 Music Transcription ........................ 10
3.1.3 Time Quantization ......................... 11
3.1.4 Data Augmentation......................... 11
3.1.5 Symbolic ChordRecognition.................... 12
3.1.6 Event Representation........................ 16
3.2 Model.................................... 18
3.2.1 Adaptive Input Representation................... 18
3.2.2 Self-Attention Modules....................... 20
3.2.3 Relative Positional Encoding.................... 23
3.2.4 Other Modules ........................... 27
3.2.5 Training with Gradient Checkpointing ............................ 30
3.2.6 Model settings ........................... 32
4 Experiments.................................... 34
4.1 Experiment Settings............................. 34
4.2 Effectiveness of Input Length........................ 36
4.3 Effectiveness of Chord Information .................... 38
4.4 General Subjective Ratings......................... 40
4.5 Generation from scratch .......................... 42
5 Conclusion and Future Work ........................... 44
Bibliography ..................................... 45
dc.language.isoen
dc.subject自注意力機制zh_TW
dc.subject流行樂zh_TW
dc.subject鋼琴zh_TW
dc.subject音樂生成zh_TW
dc.subjectmusic generationen
dc.subjectpopen
dc.subjectpianoen
dc.subjectTransformeren
dc.title使用和弦編碼轉換的流行音樂鋼琴樂曲自動生成zh_TW
dc.titlePop Piano Music Generation Using Chord-encoded Transformeren
dc.typeThesis
dc.date.schoolyear108-2
dc.description.degree碩士
dc.contributor.coadvisor楊奕軒(Yi-Hsuan Yang)
dc.contributor.oralexamcommittee蔡銘峰(Ming-Feng Tsai)
dc.subject.keyword音樂生成,流行樂,鋼琴,自注意力機制,zh_TW
dc.subject.keywordmusic generation,pop,piano,Transformer,en
dc.relation.page50
dc.identifier.doi10.6342/NTU202001511
dc.rights.note有償授權
dc.date.accepted2020-07-15
dc.contributor.author-college電機資訊學院zh_TW
dc.contributor.author-dept資訊網路與多媒體研究所zh_TW
顯示於系所單位:資訊網路與多媒體研究所

文件中的檔案:
檔案 大小格式 
U0001-1407202014411900.pdf
  未授權公開取用
3.54 MBAdobe PDF
顯示文件簡單紀錄


系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved