利用音樂理解模型增強多模態圖神經網絡的音樂推薦

管晟宇; Cheng-Yu Kuan

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/101815

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	曹承礎	zh_TW
dc.contributor.advisor	Seng-Cho Chou	en
dc.contributor.author	管晟宇	zh_TW
dc.contributor.author	Cheng-Yu Kuan	en
dc.date.accessioned	2026-03-04T16:47:39Z	-
dc.date.available	2026-03-05	-
dc.date.copyright	2026-03-04	-
dc.date.issued	2026	-
dc.date.submitted	2026-02-07	-
dc.identifier.citation	[1] P. Alonso-Jiménez, L. Pepino, R. Batlle-Roca, P. Zinemanas, D. Bogdanov, X. Serra, and M. Rocamora. Leveraging Pre-Trained Autoencoders for Interpretable Prototype Learning of Music Audio, Feb. 2024. arXiv:2402.09318 [cs]. [2] A. Baevski, W.-N. Hsu, Q. Xu, A. Babu, J. Gu, and M. Auli. data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language, Oct. 2022. arXiv:2202.03555 [cs]. [3] D. Bogdanov, M. Won, P. Tovstogan, A. Porter, and X. Serra. The mtg-jamendo dataset for automatic music tagging. ICML, 2019. [4] R. Castellon, C. Donahue, and P. Liang. Codified audio language modeling learns useful representations for music information retrieval, July 2021. arXiv:2107.05677 [cs]. [5] H. Chen, Z. Wang, F. Huang, X. Huang, Y. Xu, Y. Lin, P. He, and Z. Li. Generative Adversarial Framework for Cold-Start Item Recommendation. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development inInformation Retrieval, pages 2565–2571, Madrid Spain, July 2022. ACM. [6] J. Choi, S. Jang, H. Cho, and S. Chung. Towards Proper Contrastive Self-supervised Learning Strategies For Music Audio Representation, July 2022. arXiv:2207.04471 [cs]. [7] K. Choi, G. Fazekas, M. B. Sandler, and K. Cho. Convolutional recurrent neural networks for music classification. 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2392–2396, 2016. [8] S.-Y. Chou, Y.-H. Yang, J.-S. R. Jang, and Y.-C. Lin. Addressing cold start for next-song recommendation. In Proceedings of the 10th ACM Conference on Recommender Systems, pages 115–118, 2016. [9] X. Cui, X. Qu, D. Li, Y. Yang, Y. Li, and X. Zhang. Mkgcn: multi-modal knowledge graph convolutional network for music recommender systems. Electronics, 12(12):2688, 2023. [10] M. Defferrard, K. Benzi, P. Vandergheynst, and X. Bresson. Fma: A dataset for music analysis. arXiv preprint arXiv:1612.01840, 2016. [11] Y. Deldjoo, M. Schedl, and P. Knees. Content-driven music recommendation: Evolution, state of the art, and challenges. Computer Science Review, 51:100618, 2024. [12] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), pages 4171–4186, 2019. [13] P. Dhariwal, H. Jun, C. Payne, J. W. Kim, A. Radford, and I. Sutskever. Jukebox: A generative model for music. arXiv preprint arXiv:2005.00341, 2020. [14] S. Doh, K. Choi, and J. Nam. Talkplay: Multimodal music recommendation with large language models. arXiv preprint arXiv:2502.13713, 2025. [15] A. Ferraro, A. Galli, V. La Gatta, V. Moscato, M. Postiglione, G. Sperlì, and F. Amato. Hemr: Hypergraph embeddings for music recommendation. In SEBD, pages 317–325, 2023. [16] X. Glorot and Y. Bengio. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pages 249–256. JMLR Workshop and Conference Proceedings, 2010. [17] A. Goel, P. A. Chitale, B. Paliwal, B. Santra, and A. Sharma. Horizon: A benchmark for in-the-wild user behaviour modeling. In NeurIPS 2025 Workshop on Evaluating the Evolving LLM Lifecycle: Benchmarks, Emergent Abilities, and Scaling, 2025. [18] R. He and J. McAuley. Vbpr: visual bayesian personalized ranking from implicit feedback. In Proceedings of the AAAI conference on artificial intelligence, volume 30, 2016. [19] X. He, K. Deng, X. Wang, Y. Li, Y. Zhang, and M. Wang. Lightgcn: Simplifying and powering graph convolution network for recommendation. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, pages 639–648, 2020. [20] N. Ihemelandu and M. D. Ekstrand. Candidate set sampling for evaluating top-n recommendation. In 2023 IEEE/WIC International Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT), pages 88–94. IEEE, 2023. [21] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization, 2017. [22] Y. Li et al. MERT: Acoustic Music Understanding Model with Large-Scale Self-supervised Training, Apr. 2024. arXiv:2306.00107 [cs]. [23] Y. Li, R. Yuan, G. Zhang, Y. Ma, C. Lin, X. Chen, A. Ragni, H. Yin, Z. Hu, H. He, E. Benetos, N. Gyenge, R. Liu, and J. Fu. MAP-Music2Vec: A Simple and Effective Baseline for Self-Supervised Music Audio Representation Learning, Dec. 2022. arXiv:2212.02508 [cs]. [24] Q. Liu, S. Wu, and L. Wang. Deepstyle: Learning user preferences for visual recommendation. In Proceedings of the 40th international acm sigir conference on research and development in information retrieval, pages 841–844, 2017. [25] I. Manco, E. Benetos, E. Quinton, and G. Fazekas. Learning music audio representations via weak language supervision, Feb. 2022. arXiv:2112.04214 [cs]. [26] M. C. McCallum, M. E. Davies, F. Henkel, J. Kim, and S. E. Sandberg. On the effect of data-augmentation on local embedding properties in the contrastive learning of music audio representations. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 671–675. IEEE, 2024. [27] M. C. McCallum, F. Korzeniowski, S. Oramas, F. Gouyon, and A. F. Ehmann. Supervised and Unsupervised Learning of Audio Representations for Music Understanding, Oct. 2022. arXiv:2210.03799 [cs]. [28] M. Moscati, E. Parada-Cabaleiro, Y. Deldjoo, E. Zangerle, and M. Schedl. Music4all-onion–a large-scale multi-faceted content-centric music recommendation dataset. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management, pages 4339–4343, 2022. [29] S. Oramas, O. Nieto, M. Sordo, and X. Serra. A deep multimodal approach for cold-start music recommendation. In Proceedings of the 2nd workshop on deep learning for recommender systems, pages 32–37, 2017. [30] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019. [31] L. Pepino, P. Riera, and L. Ferrer. EnCodecMAE: Leveraging neural codecs for universal audio representation learning, May 2024. arXiv:2309.07391 [cs]. [32] T. Qian, Y. Liang, Q. Li, and H. Xiong. Attribute Graph Neural Networks for Strict Cold Start Recommendation. IEEE Transactions on Knowledge and Data Engineering, 34(8):3597–3610, Aug. 2022. [33] N. Reimers and I. Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084, 2019. [34] I. A. P. Santana, F. Pinhelli, J. Donini, L. Catharin, R. B. Mangolin, V. D. Feltrim, M. A. Domingues, et al. Music4all: A new music database and its applications. In 2020 International Conference on Systems, Signals and Image Processing (IWSSIP), pages 399–404. IEEE, 2020. [35] M. Schedl, H. Zamani, C.-W. Chen, Y. Deldjoo, and M. Elahi. Current challenges and visions in music recommender systems research. International Journal of Multimedia Information Retrieval, 7(2):95–116, 2018. [36] T. Schnabel, M. Wan, and L. Yang. Situating recommender systems in practice: Towards inductive learning and incremental updates. arXiv preprint arXiv:2211.06365, 2022. [37] A. Van den Oord, S. Dieleman, and B. Schrauwen. Deep content-based music recommendation. Advances in neural information processing systems, 26, 2013. [38] M. Volkovs, G. Yu, and T. Poutanen. Dropoutnet: Addressing cold start in recommender systems. Advances in neural information processing systems, 30, 2017. [39] Y. Wei, X. Wang, L. Nie, X. He, and T.-S. Chua. Graph-refined convolutional network for multimedia recommendation with implicit feedback. In Proceedings of the 28th ACM international conference on multimedia, pages 3541–3549, 2020. [40] Y. Wei, X. Wang, L. Nie, X. He, R. Hong, and T.-S. Chua. MMGCN: Multi-modal Graph Convolution Network for Personalized Recommendation of Micro-video. In Proceedings of the 27th ACM International Conference on Multimedia, pages 1437–1445, Nice France, Oct. 2019. ACM. [41] H. Wu, C. W. Wong, J. Zhang, Y. Yan, D. Yu, J. Long, and M. K. Ng. Cold-Start Next-Item Recommendation by User-Item Matching and Auto-Encoders. IEEE Transactions on Services Computing, 16(4):2477–2489, July 2023. [42] Y. Wu, K. Chen, T. Zhang, Y. Hui, T. Berg-Kirkpatrick, and S. Dubnov. Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023. [43] J. Xu, Z. Chen, S. Yang, J. Li, W. Wang, X. Hu, S. Hoi, and E. Ngai. A survey on multimodal recommender systems: Recent advances and future directions. arXiv preprint arXiv:2502.15711, 2025. [44] O. Yonay, T. Hammond, and T. Yang. Myna: Masking-based contrastive learning of musical representations. arXiv preprint arXiv:2502.12511, 2025. [45] J. M. Zawia, M. A. B. Ismail, M. Imran, B. T. Hanggara, D. Kurnianingtyas, S. Asna, and Q. T. Minh. Comprehensive review of meta-learning methods for cold-start issue in recommendation systems. IEEE Access, 2025. [46] J. Zhang, Y. Zhu, Q. Liu, S. Wu, S. Wang, and L. Wang. Mining Latent Structures for Multimedia Recommendation. In Proceedings of the 29th ACM International Conference on Multimedia, pages 3872–3880, Virtual Event China, Oct. 2021. ACM. [47] W. Zhang, Y. Bei, L. Yang, H. P. Zou, P. Zhou, A. Liu, Y. Li, H. Chen, J. Wang, Y. Wang, et al. Cold-start recommendation towards the era of large language models (llms): A comprehensive survey and roadmap. arXiv preprint arXiv:2501.01945, 2025. [48] X. Zhang, Y. Chen, C. Gao, Q. Liao, S. Zhao, and I. King. Knowledge-aware Neural Networks with Personalized Feature Referencing for Cold-start Recommendation, Sept. 2022. arXiv:2209.13973 [cs]. [49] H. Zhou, X. Zhou, Z. Zeng, L. Zhang, and Z. Shen. A comprehensive survey on multimodal recommender systems: Taxonomy, evaluation, and future directions. arXiv preprint arXiv:2302.04473, 2023. [50] H. Zhou, X. Zhou, L. Zhang, and Z. Shen. Enhancing Dyadic Relations with Homogeneous Graphs for Multimodal Recommendation. In K. Gal, A. Nowé, G. J. Nalepa, R. Fairstein, and R. Rădulescu, editors, Frontiers in Artificial Intelligence and Applications. IOS Press, Sept. 2023. [51] X. Zhou. Mmrec: Simplifying multimodal recommendation. In Proceedings of the 5th ACM International Conference on Multimedia in Asia Workshops, pages 1–2, 2023. [52] X. Zhou and Z. Shen. A tale of two graphs: Freezing and denoising graph structures for multimodal recommendation. In Proceedings of the 31st ACM international conference on multimedia, pages 935–943, 2023. [53] X. Zhou, H. Zhou, Y. Liu, Z. Zeng, C. Miao, P. Wang, Y. You, and F. Jiang. Bootstrap latent representations for multi-modal recommendation. In Proceedings of the ACM web conference 2023, pages 845–854, 2023. [54] K. Ziaoddini. Socially aware music recommendation: A multi-modal graph neural networks for collaborative music consumption and community-based engagement. arXiv preprint arXiv:2511.05497, 2025.	-
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/101815	-
dc.description.abstract	在音樂推薦系統的發展中，如何有效地對聲音資訊中的複雜特徵進行建模至關重要。本研究透過結合音樂理解模型與多模態圖神經網路 (GNN) 二者，直接從原始音訊中提取深層語義特徵並嵌入複雜的圖網路中，使模型得以在二分圖 (bipartite graph) 上有效傳遞基於內容的信號，提升音樂推薦系統的效能。此外，面對每日發行的大量新歌曲，音樂推薦系統常受限於嚴重的交互資訊稀缺與流行度偏差 (popularity bias);本研究藉由引入深層音樂嵌入向量，利用從聲音中提取的資訊對新歌曲進行推薦，使系統能降低對歷史互動數據的依賴，為音樂推薦領域中的「物品冷啟動」 (item cold-start) 問題提供一種解法。在 Music4All 資料集上的實驗結果顯示了內容特徵與推薦效能在不同場景下的關係。在互動數據豐富的「熱啟動」(warm-start) 場景下，純協同過濾模型 LightGCN 的表現優於實驗中其他多模態 GNN，顯示當使用者與物品間的互動信號充足時，加入其他模態資訊輔助的效益有限。然而，音樂特徵在「物品冷啟動」的情境下則十分重要。具體而言，在低互動的資料集上，DRAGON 模型的 Recall@20 指標在加入音樂特徵後顯著增長了 30.80\%。本研究證實聲音資訊在音樂推薦系統中是關鍵的信號；若與先進的 GNN 架構進行整合，能大幅改善長尾與小眾音樂的推薦效果，有助於建構完善的多模態推薦系統。	zh_TW
dc.description.abstract	Effectively modeling the complex characteristics in audio content is fundamental to advancing personalized music recommendation systems. To achieve this, this thesis proposes a novel approach that enhances multimodal graph neural networks (GNNs) by integrating music understanding models, specifically MERT and Music2Vec. By extracting deep semantic features directly from raw audio waveforms and embedding these representations into sophisticated graph structures, the model effectively propagates content-based signals across the user-item bipartite graph. Furthermore, this approach provides a solution to the cold-start item recommendation problem—a persistent challenge arising from the high volume of daily releases, which results in severe information scarcity and entrenched popularity bias. By incorporating deep music embeddings, the system can recommend newly released items based on their inherent acoustic properties, reducing the dependency on historical interaction data. Experimental results on the Music4All dataset indicate a nuanced interplay between content features and recommendation performance. In warm-start scenarios, the purely collaborative filtering model LightGCN consistently outperformed all evaluated multimodal GNNs, suggesting that auxiliary modalities may offer limited additional benefit when dense interaction histories provide strong signals. Conversely, music representations become critical in cold-start item recommendation. Notably, the DRAGON architecture achieves a 30.80\% improvement in Recall@20 when transitioning from single to dual modalities on the low-count split dataset. Our findings highlights that acoustic information functions as an essential complementary signal which, when integrated through advanced GNN architectures, substantially improves the discovery of long-tail and niche music items.	en
dc.description.provenance	Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2026-03-04T16:47:39Z No. of bitstreams: 0	en
dc.description.provenance	Made available in DSpace on 2026-03-04T16:47:39Z (GMT). No. of bitstreams: 0	en
dc.description.tableofcontents	口試委員審定書 i 致謝 ii 摘要 iii Abstract iv Contents vi List of Figures ix List of Tables x Chapter 1 Introduction 1 1.1 Background 1 1.2 Research Motivation 2 1.3 Research Objectives 4 Chapter 2 Literature Review 6 2.1 Music Representation Learning 6 2.2 Multimodal Recommendation Systems 8 2.3 Cold-Start Recommendation Models and Evaluation Frameworks 11 Chapter 3 Methodology 14 3.1 Problem Formulation 14 3.2 Music Feature Extraction 16 3.2.1 MERT 16 3.2.2 Music2Vec 18 3.2.3 Textual Feature Extraction with Sentence-BERT 18 3.3 Multimodal Graph Neural Networks 19 3.3.1 Baseline Architectures: VBPR and LightGCN 19 3.3.2 BM3 20 3.3.3 DRAGON 20 3.3.4 FREEDOM 21 3.4 Feature Propagation and Fusion Mechanisms 21 3.4.1 Message Passing 22 3.4.2 Multimodal Fusion 22 3.5 Memory Optimization 23 Chapter 4 Experiments 26 4.1 Datasets 27 4.1.1 Data Preprocessing 27 4.1.2 Cold-Start Data Split Strategies 29 4.2 Evaluation Protocols 30 4.2.1 Candidate Sampling 31 4.3 Implementation Details 32 4.4 Performance Comparison 33 4.4.1 Effect of Music Understanding Models 37 4.4.2 Effect of Cold-Start Data Split and Sampling Strategies 38 Chapter 5 Conclusion 41 References 43	-
dc.language.iso	en	-
dc.subject	多模態推薦	-
dc.subject	圖神經網路	-
dc.subject	物品冷啟動推薦	-
dc.subject	推薦系統	-
dc.subject	音樂嵌入學習	-
dc.subject	Multimodal recommendation	-
dc.subject	Content-based recommender systems	-
dc.subject	Graph neural networks	-
dc.subject	Cold-start item recommendation	-
dc.subject	Recommender systems	-
dc.subject	Music representation learning	-
dc.title	利用音樂理解模型增強多模態圖神經網絡的音樂推薦	zh_TW
dc.title	Enhancing Multimodal Graph Neural Networks with Music Understanding Models for Music Recommendation	en
dc.type	Thesis	-
dc.date.schoolyear	114-1	-
dc.description.degree	碩士	-
dc.contributor.oralexamcommittee	陳建錦;張伊君	zh_TW
dc.contributor.oralexamcommittee	Chien-Chin Chen;Natalia Chang	en
dc.subject.keyword	多模態推薦,圖神經網路物品冷啟動推薦推薦系統音樂嵌入學習	zh_TW
dc.subject.keyword	Multimodal recommendation,Content-based recommender systemsGraph neural networksCold-start item recommendationRecommender systemsMusic representation learning	en
dc.relation.page	50	-
dc.identifier.doi	10.6342/NTU202600712	-
dc.rights.note	同意授權(限校園內公開)	-
dc.date.accepted	2026-02-09	-
dc.contributor.author-college	管理學院	-
dc.contributor.author-dept	資訊管理學系	-
dc.date.embargo-lift	2026-03-05	-
顯示於系所單位：	資訊管理學系

文件中的檔案：

檔案	大小	格式
ntu-114-1.pdf 授權僅限NTU校內IP使用（校園外請利用VPN校外連線服務）	5.56 MB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。