請用此 Handle URI 來引用此文件:
http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/99814完整後設資料紀錄
| DC 欄位 | 值 | 語言 |
|---|---|---|
| dc.contributor.advisor | 葉彌妍 | zh_TW |
| dc.contributor.advisor | Mi-Yen Yeh | en |
| dc.contributor.author | 李宇昂 | zh_TW |
| dc.contributor.author | Yu-Ang Lee | en |
| dc.date.accessioned | 2025-09-18T16:04:56Z | - |
| dc.date.available | 2025-09-19 | - |
| dc.date.copyright | 2025-09-18 | - |
| dc.date.issued | 2025 | - |
| dc.date.submitted | 2025-08-13 | - |
| dc.identifier.citation | [1] Y.-A. Lee, C.-Y. Ko, T. Pedapati, I. Chung, M.-Y. Yeh, P.-Y. Chen, et al., “Star: Spectral truncation and rescale for model merging,” arXiv preprint arXiv:2502.10339, 2025.
[2] G. Ilharco, M. T. Ribeiro, M. Wortsman, S. Gururangan, L. Schmidt, H. Hajishirzi, and A. Farhadi, “Editing models with task arithmetic,” arXiv preprint arXiv:2212.04089, 2022. [3] E. Yang, L. Shen, G. Guo, X. Wang, X. Cao, J. Zhang, and D. Tao, “Model merging in llms, mllms, and beyond: Methods, theories, applications and opportunities,” arXiv preprint arXiv:2408.07666, 2024. [4] K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, et al., “Training verifiers to solve math word problems,” arXiv preprint arXiv:2110.14168, 2021. [5] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, C. Ma, Y. Jernite, J. Plu, C. Xu, T. Le Scao, S. Gugger, M. Drame, Q. Lhoest, and A. M. Rush, “Transformers: State-of-the-Art Natural Language Processing,” Association for Computational Linguistics, Oct. 2020, pp. 38–45. [Online]. Available: https://www.aclweb.org/anthology/2020.emnlp-demos.6. [6] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen, “Lora: Low-rank adaptation of large language models,” arXiv preprint arXiv:2106.09685, 2021. [7] B. Gliwa, I. Mochol, M. Biesek, and A. Wawer, “Samsum corpus: A human-annotated dialogue dataset for abstractive summarization,” arXiv preprint arXiv:1911.12237, 2019. [8] J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Dohan, E. Jiang, C. Cai, M. Terry, Q. Le, et al., “Program synthesis with large language models,” arXiv preprint arXiv:2108.07732, 2021. [9] A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman, “Glue: A multi-task benchmark and analysis platform for natural language understanding,” arXiv preprint arXiv:1804.07461, 2018. [10] R. Zellers, A. Holtzman, Y. Bisk, A. Farhadi, and Y. Choi, “Hellaswag: Can a machine really finish your sentence?” In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019. [11] Y. Bisk, R. Zellers, R. L. Bras, J. Gao, and Y. Choi, “Piqa: Reasoning about physical commonsense in natural language,” in Thirty-Fourth AAAI Conference on Artificial Intelligence, 2020. [12] P. Malo, A. Sinha, P. Korhonen, J. Wallenius, and P. Takala, “Good debt or bad debt: Detecting semantic orientations in economic texts,” Journal of the Association for Information Science and Technology, vol. 65, 2014. [13] P. Yadav, T. Vu, J. Lai, A. Chronopoulou, M. Faruqui, M. Bansal, and T. Munkhdalai, “What matters for model merging at scale?” arXiv preprint arXiv:2410.03617, 2024. [14] R. Anil, A. M. Dai, O. Firat, M. Johnson, D. Lepikhin, A. Passos, S. Shakeri, E. Taropa, P. Bailey, Z. Chen, et al., “Palm 2 technical report,” arXiv preprint arXiv:2305.10403, 2023. [15] P. Yadav, D. Tam, L. Choshen, C. A. Raffel, and M. Bansal, “Ties-merging: Resolving interference when merging models,” Advances in Neural Information Processing Systems, vol. 36, 2024. [16] L. Yu, B. Yu, H. Yu, F. Huang, and Y. Li, “Language models are super mario: Absorbing abilities from homologous models as a free lunch,” in Forty-first International Conference on Machine Learning, 2024. [17] J. Utans, “Weight averaging for neural networks and local resampling schemes,” in Proc. AAAI-96 Workshop on Integrating Multiple Learned Models. AAAI Press, Citeseer, 1996, pp. 133–138. [18] M. Wortsman, G. Ilharco, S. Y. Gadre, R. Roelofs, R. Gontijo-Lopes, A. S. Morcos, H. Namkoong, A. Farhadi, Y. Carmon, S. Kornblith, et al., “Model soups: Averaging weights of multiple fine-tuned models improves accuracy without increasing inference time,” in International conference on machine learning, PMLR, 2022, pp. 23 965–23 998. [19] A. Alexandrov, V. Raychev, M. N. Müller, C. Zhang, M. Vechev, and K. Toutanova, “Mitigating catastrophic forgetting in language transfer via model merging,” arXiv preprint arXiv:2407.08699, 2024. [20] H. Farn, H. Su, S. H. Kumar, S. Sahay, S.-T. Chen, and H.-y. Lee, “Safeguard fine-tuned llms through pre-and post-tuning model merging,” arXiv preprint arXiv:2412.19512, 2024. [21] T.-Q. Lin, W.-P. Huang, H. Tang, and H.-y. Lee, “Speech-ft: Merging pre-trained and fine-tuned speech representation models for cross-task generalization,” arXiv preprint arXiv:2502.12672, 2025. [22] L. Bandarkar, B. Muller, P. Yuvraj, R. Hou, N. Singhal, H. Lv, and B. Liu, “Layer swapping for zero-shot cross-lingual transfer in large language models,” arXiv preprint arXiv:2410.01335, 2024. [23] S. Chen, J. Zhang, T. Zhu, W. Liu, S. Gao, M. Xiong, M. Li, and J. He, “Bring reason to vision: Understanding perception and reasoning through model merging,” arXiv preprint arXiv:2505.05464, 2025. [24] G. Ortiz-Jimenez, A. Favero, and P. Frossard, “Task arithmetic in the tangent space: Improved editing of pre-trained models,” Advances in Neural Information Processing Systems, vol. 36, 2024. [25] M. Imfeld, J. Graldi, M. Giordano, T. Hofmann, S. Anagnostidis, and S. P. Singh, “Transformer fusion with optimal transport,” arXiv preprint arXiv:2310.05719, 2023. [26] F. A. Guerrero Pena, H. R. Medeiros, T. Dubail, M. Aminbeidokhti, E. Granger, and M. Pedersoli, “Re-basin via implicit sinkhorn differentiation. in 2023 ieee,” in CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 20 237–20 246. [27] M. S. Matena and C. A. Raffel, “Merging models with fisher-weighted averaging,” Advances in Neural Information Processing Systems, vol. 35, pp. 17 703–17 716, 2022. [28] D. Tam, M. Bansal, and C. Raffel, “Merging by matching models in task parameter subspaces,” Transactions on Machine Learning Research, 2024. [29] N. Daheim, T. Möllenhoff, E. M. Ponti, I. Gurevych, and M. E. Khan, “Model merging by uncertainty-based gradient matching,” arXiv preprint arXiv:2310.12808, 2023. [30] E. Yang, Z. Wang, L. Shen, S. Liu, G. Guo, X. Wang, and D. Tao, “Adamerging: Adaptive model merging for multi-task learning,” arXiv preprint arXiv:2310.02575, 2023. [31] Y. Zhou, L. Song, B. Wang, and W. Chen, “Metagpt: Merging large language models using model exclusive task arithmetic,” arXiv preprint arXiv:2406.11385, 2024. [32] M. Davari and E. Belilovsky, “Model breadcrumbs: Scaling multi-task model merging with sparse masks,” arXiv preprint arXiv:2312.06795, 2023. [33] K. Wang, N. Dimitriadis, G. Ortiz-Jimenez, F. Fleuret, and P. Frossard, “Localizing task information for improved model merging and compression,” in Forty-first International Conference on Machine Learning, 2024. [Online]. Available: https://openreview.net/forum?id=DWT9uiGjxT. [34] C. Huang, P. Ye, T. Chen, T. He, X. Yue, and W. Ouyang, “Emr-merging: Tuning-free high-performance model merging,” arXiv preprint arXiv:2405.17461, 2024. [35] K. Dabov, A. Foi, V. Katkovnik, and K. Egiazarian, “Image denoising by sparse 3-d transform-domain collaborative filtering,” IEEE Transactions on image processing, vol. 16, no. 8, pp. 2080–2095, 2007. [36] E. J. Candes and Y. Plan, “Matrix completion with noise,” Proceedings of the IEEE, vol. 98, no. 6, pp. 925–936, 2010. [37] J.-F. Cai, E. J. Candès, and Z. Shen, “A singular value thresholding algorithm for matrix completion,” SIAM Journal on optimization, vol. 20, no. 4, pp. 1956–1982, 2010. [38] E. Candes and B. Recht, “Exact matrix completion via convex optimization,” Communications of the ACM, vol. 55, no. 6, pp. 111–119, 2012. [39] H. W. Chung, L. Hou, S. Longpre, B. Zoph, Y. Tay, W. Fedus, Y. Li, X. Wang, M. Dehghani, S. Brahma, et al., “Scaling instruction-finetuned language models,” Journal of Machine Learning Research, vol. 25, no. 70, pp. 1–53, 2024. [40] A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. d. l. Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnier, et al., “Mistral 7b,” arXiv preprint arXiv:2310.06825, 2023. [41] A. Tang, L. Shen, Y. Luo, H. Hu, B. Do, and D. Tao, “Fusionbench: A comprehensive benchmark of deep model fusion,” arXiv preprint arXiv:2406.03280, 2024. [42] A. L. Maas, R. E. Daly, P. T. Pham, D. Huang, A. Y. Ng, and C. Potts, “Learning word vectors for sentiment analysis,” in Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Portland, Oregon, USA: Association for Computational Linguistics, Jun. 2011, pp. 142–150. [Online]. Available: http://www.aclweb.org/anthology/P11-1015. [43] X. Zhang, J. Zhao, and Y. LeCun, “Character-level convolutional networks for text classification,” Advances in neural information processing systems, vol. 28, 2015. [44] C. Clark, K. Lee, M.-W. Chang, T. Kwiatkowski, M. Collins, and K. Toutanova, “Boolq: Exploring the surprising difficulty of natural yes/no questions,” in NAACL, 2019. [45] R. Brüel-Gabrielsson, J. Zhu, O. Bhardwaj, L. Choshen, K. Greenewald, M. Yurochkin, and J. Solomon, “Compress then serve: Serving thousands of lora adapters with little overhead,” arXiv preprint arXiv:2407.00066, 2024. [46] T. Khot, P. Clark, M. Guerquin, P. Jansen, and A. Sabharwal, “Qasc: A dataset for question answering via sentence composition,” arXiv:1910.11473v2, 2020. [47] D. Cer, M. Diab, E. Agirre, I. Lopez-Gazpio, and L. Specia, “Semeval-2017 task 1: Semantic textual similarity-multilingual and cross-lingual focused evaluation,” arXiv preprint arXiv:1708.00055, 2017. [48] A. Wang, Y. Pruksachatkun, N. Nangia, A. Singh, J. Michael, F. Hill, O. Levy, and S. Bowman, “Superglue: A stickier benchmark for general-purpose language understanding systems,” Advances in neural information processing systems, vol. 32, 2019. [49] D. J. Kopiczko, T. Blankevoort, and Y. M. Asano, “Vera: Vector-based random matrix adaptation,” arXiv preprint arXiv:2310.11454, 2023. [50] Q. Zhang, M. Chen, A. Bukharin, N. Karampatziakis, P. He, Y. Cheng, W. Chen, and T. Zhao, “Adalora: Adaptive budget allocation for parameter-efficient fine-tuning,” arXiv preprint arXiv:2303.10512, 2023. [51] B. Liao, Y. Meng, and C. Monz, “Parameter-efficient fine-tuning without introducing new latency,” arXiv preprint arXiv:2305.16742, 2023. [52] G. Stoica, P. Ramesh, B. Ecsedi, L. Choshen, and J. Hoffman, “Model merging with svd to tie the knots,” arXiv preprint arXiv:2410.19735, 2024. [53] A. A. Gargiulo, D. Crisostomi, M. S. Bucarelli, S. Scardapane, F. Silvestri, and E. Rodola, “Task singular vectors: Reducing task interference in model merging,” in Proceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 18 695–18 705. [54] H. Qiu, Y. Wu, and Q. Yao, “Superpose singular features for model merging,” arXiv preprint arXiv:2502.10698, 2025. [55] D. Marczak, S. Magistri, S. Cygert, B. Twardowski, A. D. Bagdanov, and J. van de Weijer, “No task left behind: Isotropic model merging with common and task-specific subspaces,” arXiv preprint arXiv:2502.04959, 2025. [56] D. Tang, P. Yadav, Y.-L. Sung, J. Yoon, and M. Bansal, “Lora merging with svd: Understanding interference and preserving performance,” in ICML 2025 Workshop on Reliable and Responsible Foundation Models. [57] P. Ren, Y. Xiao, X. Chang, P.-Y. Huang, Z. Li, B. B. Gupta, X. Chen, and X. Wang, “A survey of deep active learning,” ACM computing surveys (CSUR), vol. 54, no. 9, pp. 1–40, 2021. | - |
| dc.identifier.uri | http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/99814 | - |
| dc.description.abstract | 模型融合是一種從多個預訓練模型中獲得多任務模型的高效方法,無需進一步微調,目前在自然語言處理(Natural Language Processing, NLP)等各個領域都受到廣泛關注。本論文首先探討現有模型融合研究領域的高層次分類架構,並歸納出當前方法面臨的三大核心挑戰:任務衝突、超參數依賴性,以及稀疏化方法的侷限性。針對這些挑戰,我們重訪已發表之基於頻譜截斷及重縮放的STAR (Spectral Truncation and Rescale)方法,透過在頻譜空間中截斷較小的分量來緩解任務衝突問題,並採用自動參數重縮放機制以保持原始矩陣的核範數。STAR無需對原始訓練資料進行額外推理,且對超參數選擇具備良好的穩健性。我們透過在多樣化NLP任務上的大量模型融合實驗驗證STAR的有效性。實驗結果顯示,STAR在不同模型規模下皆表現穩定,在Flan-T5上融合12個模型時,相較於基準方法可提升4.2\\%的效能。此外,基於我們對任務向量的深入分析,本論文亦提出一個初步的選擇性參數高效微調(parameter-efficient fine-tuning, PEFT)方法,該方法善用不同任務向量間的共通模式。相關程式碼已公開於https://github.com/IBM/STAR。 | zh_TW |
| dc.description.abstract | Model merging is an efficient way of obtaining a multi-task model from several pretrained models without further fine-tuning, and it has gained attention in various domains, including natural language processing (NLP). In this thesis, we develop a high-level taxonomy of model merging research and identify three key challenges in current model merging methods. To address these challenges, we revisited a published method, STAR (Spectral Truncation And Rescale), that aims at mitigating task conflicts by truncating small components in the respective spectral spaces, which is followed by an automatic parameter rescaling scheme to retain the nuclear norm of the original matrix. STAR requires no additional inference on original training data and is robust to hyperparameter choice. We demonstrate the effectiveness of STAR through extensive model merging experiments on diverse NLP tasks, and conduct several analyses including hyperparameter sensitivity. Besides model merging, we also propose a preliminary selective parameter-efficient fine-tuning (PEFT) method that leverages common patterns observed across different task vectors. Our code is publicly available at https://github.com/IBM/STAR. | en |
| dc.description.provenance | Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2025-09-18T16:04:56Z No. of bitstreams: 0 | en |
| dc.description.provenance | Made available in DSpace on 2025-09-18T16:04:56Z (GMT). No. of bitstreams: 0 | en |
| dc.description.tableofcontents | Verification Letter from the Oral Examination Committee ii
誌謝 iii 摘要 v ABSTRACT vii CONTENTS ix LIST OF FIGURES xi LIST OF TABLES xiii Chapter 1 Introduction 1 Chapter 2 Background 3 2.1 The Fine-tuning of Large Language Models . . . . . . . . . . . . . . 3 2.2 Motivation of Model Merging . . . . . . . . . . . . . . . . . . . . . 5 2.3 Pretrained vs. Instruction-tuned Versions . . . . . . . . . . . . . . . 6 Chapter 3 Model Merging: A Literature Survey 9 3.1 Taxonomy of Existing Research . . . . . . . . . . . . . . . . . . . . 9 3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.2.1 Notations and Problem Definition . . . . . . . . . . . . . . . . . . 11 3.2.2 Weighted-based . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.2.3 Subspace-based . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.3 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.3.1 Task Conflicts . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.3.2 Hyperparameter Dependency . . . . . . . . . . . . . . . . . . . . 15 3.3.3 Limitations of Sparsification Approaches . . . . . . . . . . . . . . 16 Chapter 4 STAR (Spectral Truncation and Rescale) 17 4.1 Spectral Truncation . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 4.2 Rescale to Restore Matrix Nuclear Norm . . . . . . . . . . . . . . . 19 4.3 Automatic Rank Determination . . . . . . . . . . . . . . . . . . . . 20 doi:10.6342/NTU202503925 Chapter 5 Results and Discussion 23 5.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 5.1.1 Baselines and Hyperparameters . . . . . . . . . . . . . . . . . . . 23 5.1.2 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 5.1.3 Evaluation Metric . . . . . . . . . . . . . . . . . . . . . . . . . . 24 5.2 Model Merging Results . . . . . . . . . . . . . . . . . . . . . . . . . 25 5.3 Sensitivity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 27 5.4 Further Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 5.4.1 Optimal Truncation Extent . . . . . . . . . . . . . . . . . . . . . . 27 5.4.2 Ablation Study: The effectiveness of Rescale . . . . . . . . . . . . 29 Chapter 6 Extension: PEFT Exploration 31 6.1 Common Patterns Across Task Vectors . . . . . . . . . . . . . . . . 31 6.2 Preliminary PEFT results . . . . . . . . . . . . . . . . . . . . . . . . 32 Chapter 7 Conclusion 35 7.1 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . 35 7.2 Future Research Directions . . . . . . . . . . . . . . . . . . . . . . . 36 7.2.1 SVD-based Model Merging . . . . . . . . . . . . . . . . . . . . . 36 7.2.2 Finalize Our Proposed PEFT Method . . . . . . . . . . . . . . . . 37 7.2.3 Model Merging × Active Learning . . . . . . . . . . . . . . . . . 37 References 39 Appendix A 45 A.1 Bounding ∥Bx∥ . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 A.2 Algorithm of STAR . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 Appendix B 47 B.1 One-shot STAR performs even better than grid-search TIES . . . . . 47 Appendix C 49 C.1 Details about the fine-tuned models considered . . . . . . . . . . . . 49 C.2 Optimal k for TIES . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 | - |
| dc.language.iso | en | - |
| dc.subject | 大型語言模型 | zh_TW |
| dc.subject | 模型融合 | zh_TW |
| dc.subject | 奇異值分解 | zh_TW |
| dc.subject | 參數高效微調 | zh_TW |
| dc.subject | 任務向量 | zh_TW |
| dc.subject | Large Language Model | en |
| dc.subject | Task Vector | en |
| dc.subject | Parameter-Efficient Fine-tuning | en |
| dc.subject | SVD | en |
| dc.subject | Model Merging | en |
| dc.title | STAR:大規模融合語言模型的頻譜截斷與重縮放方法 | zh_TW |
| dc.title | STAR: Spectral Truncation and Rescale for Merging Language Models at Scale | en |
| dc.type | Thesis | - |
| dc.date.schoolyear | 113-2 | - |
| dc.description.degree | 碩士 | - |
| dc.contributor.coadvisor | 林守德 | zh_TW |
| dc.contributor.coadvisor | Shou-De Lin | en |
| dc.contributor.oralexamcommittee | 李宏毅;李政德 | zh_TW |
| dc.contributor.oralexamcommittee | Hung-yi Lee;Cheng-Te Li | en |
| dc.subject.keyword | 大型語言模型,模型融合,奇異值分解,參數高效微調,任務向量, | zh_TW |
| dc.subject.keyword | Large Language Model,Model Merging,SVD,Parameter-Efficient Fine-tuning,Task Vector, | en |
| dc.relation.page | 50 | - |
| dc.identifier.doi | 10.6342/NTU202503925 | - |
| dc.rights.note | 未授權 | - |
| dc.date.accepted | 2025-08-14 | - |
| dc.contributor.author-college | 電機資訊學院 | - |
| dc.contributor.author-dept | 資料科學學位學程 | - |
| dc.date.embargo-lift | N/A | - |
| 顯示於系所單位: | 資料科學學位學程 | |
文件中的檔案:
| 檔案 | 大小 | 格式 | |
|---|---|---|---|
| ntu-113-2.pdf 未授權公開取用 | 8.26 MB | Adobe PDF |
系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。
