請用此 Handle URI 來引用此文件:
http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/58195
完整後設資料紀錄
DC 欄位 | 值 | 語言 |
---|---|---|
dc.contributor.advisor | 張智星(Jyh-Shing Roger Jang) | |
dc.contributor.author | Yu-Siang Huang | en |
dc.contributor.author | 黃郁翔 | zh_TW |
dc.date.accessioned | 2021-06-16T08:07:59Z | - |
dc.date.available | 2025-07-14 | |
dc.date.copyright | 2020-07-17 | |
dc.date.issued | 2020 | |
dc.date.submitted | 2020-07-15 | |
dc.identifier.citation | LeCun, Y., Bengio, Y. Hinton, G. Deep learning. nature 521, 436 (2015). Gregor, K., Danihelka, I., Graves, A., Rezende, D. Wierstra, D. DRAW: A recurrent neural network for image generation in Proceedings of the 32nd International Conference on Machine Learning 37 (PMLR, 2015), 1462–1471. Tulyakov, S., Liu, M.-Y., Yang, X. Kautz, J. Mocogan: Decomposing motion and content for video generation in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2018). Wen, T.-H. et al. Semantically conditioned LSTM-based natural language generation for spoken dialogue aystems in Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (2015), 1711–1721. Mehri, S. et al. SampleRNN: An unconditional end-to-end neural audio generation model in International Conference on Learning Representations (2017). Briot, J.-P., Hadjeres, G. Pachet, F. Deep learning techniques for music generation- a survey. arXiv preprint arXiv:1709.01620 (2017). Yang, L.-C., Chou, S.-Y. Yang, Y.-H. MidiNet: A convolutional generative adversarial network for symbolic-domain music generation in Proceedings of the International Society for Music Information Retrieval (2017). Huang, C.-Z. A., Cooijmans, T., Roberts, A., Courville, A. Eck, D. Counterpoint by convolution in Proceedings of the International Society for Music Information Retrieval (2017). Roberts, A., Engel, J., Raffel, C., Hawthorne, C. Eck, D. A hierarchical latent vector model for learning long-term structure in music in Proceedings of the International Conference on Machine Learning (2018). Huang, C.-Z. A. et al. Music Transformer: Generating music with long-term structure in International Conference on Learning Representations (2019). Meyer, L. B. Emotion and meaning in music (University of chicago Press, 2008). Dieleman, S., van den Oord, A. Simonyan, K. The challenge of realistic music generation: Modelling raw audio at scale in Advances in Neural Information Processing Systems (2018), 7989–7999. Wang, B. Yang, Y.-H. PerformanceNet: Score-to-audio music generation with multi-band convolutional residual network in Proceedings of the AAAI Conference on Artificial Intelligence 33 (2019), 1174–1181. Jeong, D., Kwon, T., Kim, Y. Nam, J. Graph neural network for music score data and modeling expressive piano performance in Proceedings of the 36th International Conference on Machine Learning 97 (PMLR, 2019), 3060–3070. LeCun, Y., Bottou, L., Bengio, Y., Haffner, P., et al. Gradient-based learning applied to document recognition. Proceedings of the IEEE 86, 2278–2324 (1998). Goodfellow, I. et al. Generative adversarial nets in Advances in neural information processing systems (2014), 2672–2680. Rumelhart, D. E., Hinton, G. E. Williams, R. J. Learning representations by backpropagating errors. nature 323, 533–536 (1986). Vaswani, A. et al. Attention is all you need in Advances in neural information processing systems (2017), 5998–6008. Kingma, D. P. Welling, M. Auto-encoding variational bayes in (2014). Bengio, Y., Ducharme, R., Vincent, P. Jauvin, C. A neural probabilistic language model. Journal of machine learning research 3, 1137–1155 (2003). Hochreiter, S. Schmidhuber, J. Long short-term memory. Neural computation 9,1735–1780 (1997). Dong, H.-W., Hsiao, W.-Y., Yang, L.-C. Yang, Y.-H. MuseGAN: Multi-track sequential generative adversarial networks for symbolic music generation and accompaniment in Thirty-Second AAAI Conference on Artificial Intelligence (2018). Uria, B., Murray, I. Larochelle, H. A Deep and Tractable Density Estimator in Proceedings of the 31st International Conference on Machine Learning (eds Xing, E. P. Jebara, T.) 32 (PMLR, 2014), 467–475. Gillick, J., Roberts, A., Engel, J., Eck, D. Bamman, D. Learning to Groove with Inverse Sequence Transformations in Proceedings of the 36th International Conference on Machine Learning (eds Chaudhuri, K. Salakhutdinov, R.) 97 (PMLR, 2019), 2269–2279. Sutskever, I., Vinyals, O. Le, Q. V. Sequence to sequence learning with neural networks in Advances in neural information processing systems (2014), 3104–3112. Waite, E. Generating long-term structure in songs and stories https://magenta. tensorflow.org/2016/07/15/lookback-rnn-attention-rnn. Blog. 2016. Bahdanau, D., Cho, K. Bengio, Y. Neural machine translation by jointly learning to align and translate in International Conference on Learning Representations (2015). Hadjeres, G., Pachet, F. Nielsen, F. DeepBach: a Steerable Model for Bach Chorales Generation in Proceedings of the 34th International Conference on Machine Learning (eds Precup, D. Teh, Y. W.) 70 (PMLR, 2017), 1362–1371. Simon, I. Oore, S. Performance RNN: Generating Music with Expressive Timing and Dynamics https://magenta.tensorflow.org/performance-rnn. Blog. 2017. Meade, N., Barreyre, N., Lowe, S. C. Oore, S. Exploring conditioning for generative music systems with human-interpretable controls in Proceedings of the 10th International Conference on Computation Creativity (2019). Payne, C. MuseNet https://openai.com/blog/musenet/. Blog. 2019. Donahue, C., Mao, H. H., Li, Y. E., Cottrell, G. W. McAuley, J. LakhNES: Improving multi-instrumental music generation with cross-domain pre-training in Proceedings of the International Society for Music Information Retrieval (2019). (eds Dinculescu, M., Engel, J. Roberts, A.) MidiMe: Personalizing a MusicVAE model with user data (2019). Mogren, O. C-RNN-GAN: Continuous recurrent neural networks with adversarial training in NIPS Workshop on Constructive Machine Learning (2016). Yu, L., Zhang, W., Wang, J. Yu, Y. Seqgan: Sequence generative adversarial nets with policy gradient in Thirty-First AAAI Conference on Artificial Intelligence (2017). Dai, Z. et al. Transformer-XL: attentive language models beyond a fixed-length context in Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (Association for Computational Linguistics, 2019), 2978–2988. So, D., Le, Q. Liang, C. The Evolved Transformer in Proceedings of the 36th International Conference on Machine Learning 97 (PMLR, 2019), 5877–5886. Radford, A., Narasimhan, K., Salimans, T. Sutskever, I. Improving language understanding by generative pre-training. URL https://s3-us-west-2.amazonaws.com/openai-assets/researchcovers/languageunsupervised/language understanding paper. pdf (2018). Radford, A. et al. Language models are unsupervised multitask learners. OpenAI Blog 1 (2019). Devlin, J., Chang, M.-W., Lee, K. Toutanova, K. Bert: Pre-training of deep bidi- rectional transformers for language understanding in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) (Association for Computational Linguistics, 2019), 4171–4186. Hawthorne, C. et al. Onsets and frames: Dual-objective piano transcription. arXiv preprint arXiv:1710.11153 (2017). Böck, S., Krebs, F. Widmer, G. Joint Beat and Downbeat Tracking with Recurrent Neural Networks. in ISMIR (2016), 255–261. Baevski, A. Auli, M. Adaptive input representations for neural language modeling. arXiv preprint arXiv:1809.10853 (2018). Grave, E., Joulin, A., Cissé, M., Jégou, H., et al. Efficient softmax approximation for GPUs in Proceedings of the 34th International Conference on Machine Learning- Volume 70 (2017), 1302–1310. Shaw, P., Uszkoreit, J. Vaswani, A. Self-attention with relative position representations. arXiv preprint arXiv:1803.02155 (2018). He, K., Zhang, X., Ren, S. Sun, J. Deep residual learning for image recognition in Proceedings of the IEEE conference on computer vision and pattern recognition (2016), 770–778. Ba, J. L., Kiros, J. R. Hinton, G. E. Layer normalization. arXiv preprint arXiv:1607.06450 (2016). Chen, T., Xu, B., Zhang, C. Guestrin, C. Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174 (2016). | |
dc.identifier.uri | http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/58195 | - |
dc.description.abstract | 音樂生成與影像生成及影片生成有著一些顯著的差異。首先,音樂是時間上的藝術,所以我們需要利用時序處理的方法。接著,音符不僅僅是純粹時序上的先後關係,鄰近的音群可以組成各式的音樂語法、結構,例如和弦、琶音與音階等等。本篇論文,在基於自注意力機制模型的框架下,我們探討如何生成數分鐘長的流行鋼琴音樂,我們也進一步地提出一套資料前處理的流程,藉由此流程我們可以從原始音訊轉換為音樂數位介面格式。為了分析生成結果,透過主觀的使用者問卷,我們得到許多深刻的見解,並且對模型的架構優缺點做一個通盤性的探討,進一步驗證了我們提出方法的有效性,從而了解深度學習技術的有效性與局限性。 | zh_TW |
dc.description.abstract | Generating music has a few notable differences from generating images and videos. First, music is an art of the time, necessitating a temporal model. Second, musical notes are often grouped into chords, arpeggios, or melodies in polyphonic music, and therefore introducing a sequential ordering of notes into the generating model is critical. In this thesis, we investigated the framework of the Transformer model for generating minute-long pop piano music. We also proposed a data pre-processing pipeline to collect audio data and convert it to MIDI format. To evaluate the generated results, we adopted subjective user study to demonstrate the effectiveness of the proposed method. | en |
dc.description.provenance | Made available in DSpace on 2021-06-16T08:07:59Z (GMT). No. of bitstreams: 1 U0001-1407202014411900.pdf: 3625414 bytes, checksum: 7150038659da4efd4b3848a351ac4f51 (MD5) Previous issue date: 2020 | en |
dc.description.tableofcontents | 口試委員會審定書.................................. i 摘要.......................................... ii Abstract........................................ iii Contents........................................ iv List of Figures..................................... vi List of Tables ..................................... ix 1 Introduction.................................... 1 1.1 Motivation.................................. 1 1.2 Problem Statement ............................. 2 1.2.1 Design of Representation...................... 2 1.2.2 Design of Networks ........................ 3 1.3 Contribution................................. 4 1.4 Thesis Organization............................. 4 2 Related Work ................................... 5 2.1 Symbolic Music Generation ........................ 5 2.1.1 Image-modeling Approach..................... 5 2.1.2 Language-modeling Approach................... 6 2.2 Transformer................................. 7 3 Method ...................................... 9 3.1 Data Pre-processing............................9 3.1.1 Data Collection........................... 9 3.1.2 Music Transcription ........................ 10 3.1.3 Time Quantization ......................... 11 3.1.4 Data Augmentation......................... 11 3.1.5 Symbolic ChordRecognition.................... 12 3.1.6 Event Representation........................ 16 3.2 Model.................................... 18 3.2.1 Adaptive Input Representation................... 18 3.2.2 Self-Attention Modules....................... 20 3.2.3 Relative Positional Encoding.................... 23 3.2.4 Other Modules ........................... 27 3.2.5 Training with Gradient Checkpointing ............................ 30 3.2.6 Model settings ........................... 32 4 Experiments.................................... 34 4.1 Experiment Settings............................. 34 4.2 Effectiveness of Input Length........................ 36 4.3 Effectiveness of Chord Information .................... 38 4.4 General Subjective Ratings......................... 40 4.5 Generation from scratch .......................... 42 5 Conclusion and Future Work ........................... 44 Bibliography ..................................... 45 | |
dc.language.iso | en | |
dc.title | 使用和弦編碼轉換的流行音樂鋼琴樂曲自動生成 | zh_TW |
dc.title | Pop Piano Music Generation Using Chord-encoded Transformer | en |
dc.type | Thesis | |
dc.date.schoolyear | 108-2 | |
dc.description.degree | 碩士 | |
dc.contributor.coadvisor | 楊奕軒(Yi-Hsuan Yang) | |
dc.contributor.oralexamcommittee | 蔡銘峰(Ming-Feng Tsai) | |
dc.subject.keyword | 音樂生成,流行樂,鋼琴,自注意力機制, | zh_TW |
dc.subject.keyword | music generation,pop,piano,Transformer, | en |
dc.relation.page | 50 | |
dc.identifier.doi | 10.6342/NTU202001511 | |
dc.rights.note | 有償授權 | |
dc.date.accepted | 2020-07-15 | |
dc.contributor.author-college | 電機資訊學院 | zh_TW |
dc.contributor.author-dept | 資訊網路與多媒體研究所 | zh_TW |
顯示於系所單位: | 資訊網路與多媒體研究所 |
文件中的檔案:
檔案 | 大小 | 格式 | |
---|---|---|---|
U0001-1407202014411900.pdf 目前未授權公開取用 | 3.54 MB | Adobe PDF |
系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。