Skip navigation

DSpace JSPUI

DSpace preserves and enables easy and open access to all types of digital content including text, images, moving images, mpegs and data sets

Learn More
DSpace logo
English
中文
  • Browse
    • Communities
      & Collections
    • Publication Year
    • Author
    • Title
    • Subject
    • Advisor
  • Search TDR
  • Rights Q&A
    • My Page
    • Receive email
      updates
    • Edit Profile
  1. NTU Theses and Dissertations Repository
  2. 電機資訊學院
  3. 資訊網路與多媒體研究所
Please use this identifier to cite or link to this item: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/98821
Full metadata record
???org.dspace.app.webui.jsptag.ItemTag.dcfield???ValueLanguage
dc.contributor.advisor廖世偉zh_TW
dc.contributor.advisorShih-Wei Liaoen
dc.contributor.author梁家綸zh_TW
dc.contributor.authorChia-Lun Liangen
dc.date.accessioned2025-08-19T16:19:57Z-
dc.date.available2025-08-20-
dc.date.copyright2025-08-19-
dc.date.issued2025-
dc.date.submitted2025-08-07-
dc.identifier.citation[1] J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y. Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynolds, R. Ring, E. Rutherford, S. Cabi, T. Han, Z. Gong, S. Samangooei, M. Monteiro, J. Menick, S. Borgeaud, A. Brock, A. Nematzadeh, S. Sharifzadeh, M. Binkowski, R. Barreira, O. Vinyals, A. Zisserman, and K. Simonyan. Flamingo: a visual language model for few-shot learning. arXiv preprint arXiv:2204.14198, 2022.
[2] C. G. Bampis, Z. Li, I. Katsavounidis, T.-Y. Huang, C. Ekanadham, and A. C. Bovik. Towards perceptually optimized adaptive video streaming-a realistic quality of experience database. IEEE Transactions on Image Processing, 30:5182–5197, 2021.
[3] L. Beyer, A. Steiner, A. S. Pinto, A. Kolesnikov, X. Wang, D. Salz, M. Neumann, I. Alabdulmohsin, M. Tschannen, E. Bugliarello, T. Unterthiner, D. Keysers, S. Koppula, F. Liu, A. Grycner, A. Gritsenko, N. Houlsby, M. Kumar, K. Rong, J. Eisenschlos, R. Kabra, M. Bauer, M. Bošnjak, X. Chen, M. Minderer, P. Voigtlaender, I. Bica, I. Balazevic, J. Puigcerver, P. Papalampidi, O. Henaff, X. Xiong, R. Soricut, J. Harmsen, and X. Zhai. Paligemma: A versatile 3b vlm for transfer. arXiv preprint arXiv:2407.07726, 2024.
[4] Z. Chen, W. Wang, Y. Cao, Y. Liu, Z. Gao, E. Cui, J. Zhu, S. Ye, H. Tian, Z. Liu, L. Gu, X. Wang, Q. Li, Y. Ren, Z. Chen, J. Luo, J. Wang, T. Jiang, B. Wang, C. He, B. Shi, X. Zhang, H. Lv, Y. Wang, W. Shao, P. Chu, Z. Tu, T. He, Z. Wu, H. Deng, J. Ge, K. Chen, K. Zhang, L. Wang, M. Dou, L. Lu, X. Zhu, T. Lu, D. Lin, Y. Qiao, J. Dai, and W. Wang. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling. arXiv preprint arXiv:2412.05271, 2025.
[5] W. Dai, J. Li, D. Li, A. M. H. Tiong, J. Zhao, W. Wang, B. Li, P. Fung, and S. Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning. arXiv preprint arXiv:2305.06500, 2023.
[6] R. Diaz and A. Marathe. Soft labels for ordinal regression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4738–4747, June 2019.
[7] Z. Duanmu, A. Rehman, and Z. Wang. A quality-of-experience database for adaptive video streaming. IEEE Transactions on Broadcasting, 64(2):474–487, 2018.
[8] Z. Duanmu, K. Zeng, K. Ma, A. Rehman, and Z. Wang. A quality-of-experience index for streaming video. IEEE Journal of Selected Topics in Signal Processing, 11(1):154–166, 2017.
[9] C. Feichtenhofer, H. Fan, J. Malik, and K. He. Slowfast networks for video recognition. arXiv preprint arXiv:1812.03982, 2019.
[10] D. Ghadiyaram, J. Pan, A. C. Bovik, A. K. Moorthy, P. Panda, and K.-C. Yang. In-capture mobile video distortions: A study of subjective behavior and objective algorithms. IEEE Transactions on Circuits and Systems for Video Technology, 28(9):2061–2077, 2018.
[11] S. Halawani, I. Albidewi, and A. Ahmad. A novel ensemble method for regression via classification problems. Journal of Computer Science, 7(3):387–393, 2011.
[12] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385, 2015.
[13] X. He, D. Jiang, G. Zhang, M. Ku, A. Soni, S. Siu, H. Chen, A. Chandra, Z. Jiang, A. Arulraj, K. Wang, Q. D. Do, Y. Ni, B. Lyu, Y. Narsupalli, R. Fan, Z. Lyu, Y. Lin, and W. Chen. Videoscore: Building automatic metrics to simulate fine-grained human feedback for video generation. arXiv preprint arXiv:2406.15252, 2024.
[14] S. Horiguchi, K. Dohi, and Y. Kawaguchi. Streaming active learning for regression problems using regression via classification. arXiv preprint arXiv:2309.01013, 2023.
[15] V. Hosu, F. Hahn, M. Jenadeleh, H. Lin, H. Men, T. Szirányi, S. Li, and D. Saupe. The konstanz natural video database. https://database.mmsp-kn.de, 2017. Accessed: 2025-07-10.
[16] V. Hosu, F. Hahn, M. Jenadeleh, H. Lin, H. Men, T. Szirányi, S. Li, and D. Saupe. The konstanz natural video database (konvid-1k). In 2017 Ninth International Conference on Quality of Multimedia Experience (QoMEX), pages 1–6, 2017.
[17] V. Hosu, H. Lin, T. Sziranyi, and D. Saupe. Koniq-10k: An ecologically valid database for deep learning of blind image quality assessment. IEEE Transactions on Image Processing, 29:4041–4056, 2020.
[18] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
[19] Internet Archive. Moving image archive. https://archive.org/details/ movies, n.d. Accessed: 2025-08-01.
[20] ITU-T. Objective Perceptual Assessment of Video Quality: Full‐Reference Television. Tutorial, ITU-T Telecommunication Standardization Bureau, 2004.
[21] Z. Jia, Z. Zhang, J. Qian, H. Wu, W. Sun, C. Li, X. Liu, W. Lin, G. Zhai, and X. Min. VQA²: Visual question answering for video quality assessment. arXiv preprint arXiv:2411.03795, 2024.
[22] P. Kancharla and S. S. Channappayya. Completely blind quality assessment of user generated video content. IEEE Transactions on Image Processing, 31:263–274, 2022.
[23] T. Kou, X. Liu, W. Sun, J. Jia, X. Min, G. Zhai, and N. Liu. Stablevqa: A deep no-reference quality assessment model for video stability. In Proceedings of the 31st ACM International Conference on Multimedia, MM'23, pages 1066– 1076. ACM, Oct. 2023.
[24] B. Li, W. Zhang, M. Tian, G. Zhai, and X. Wang. Blindly assess quality of in-the-wild videos via quality-aware pre-training and motion perception. arXiv preprint arXiv:2108.08505, 2022.
[25] B. Li, Y. Zhang, D. Guo, R. Zhang, F. Li, H. Zhang, K. Zhang, P. Zhang, Y. Li, Z. Liu, and C. Li. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024.
[26] D. Li, T. Jiang, and M. Jiang. Quality assessment of in-the-wild videos. In Proceedings of the 27th ACM International Conference on Multimedia, MM'19, pages 2351– 2359. ACM, Oct. 2019.
[27] H. Lin, V. Hosu, and D. Saupe. Kadid-10k: A large-scale artificially distorted iqa database. In 2019 Eleventh International Conference on Quality of Multimedia Experience (QoMEX), pages 1–3, 2019.
[28] H. Liu, C. Li, Q. Wu, and Y. J. Lee. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023.
[29] M. Maaz, H. Rasheed, S. Khan, and F. S. Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models. arXiv preprint arXiv:2306.05424, 2024.
[30] X. Min, H. Duan, W. Sun, Y. Zhu, and G. Zhai. Perceptual video quality assessment: A survey. arXiv preprint arXiv:2402.03413, 2024.
[31] A. Mittal, M. A. Saad, and A. C. Bovik. A completely blind video integrity oracle. Trans. Img. Proc., 25(1):289– 300, Jan. 2016.
[32] A. Mittal, R. Soundararajan, and A. C. Bovik. Making a“completely blind" image quality analyzer. IEEE Signal Processing Letters, 20(3):209–212, 2013.
[33] OpenAI. Openai o3mini. https://openai.com/index/openai-o3-mini/, Jan 2025. Accessed: 2025-07-10.
[34] Y. Pei, S. Huang, Y. Lu, X. Li, and Z. Chen. Priorformer: A ugc-vqa method with content and distortion priors. arXiv preprint arXiv:2406.16297, 2024.
[35] Q-Future. Visual-question-answering-for-video-quality-assessment. https://github.com/Q-Future/ Visual-Question-Answering-for-Video-Quality-Assessment, 2025. Accessed: 2025-07-09.
[36] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever. Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020, 2021.
[37] K. Seshadrinathan, R. Soundararajan, A. C. Bovik, and L. K. Cormack. Study of subjective and objective quality assessment of video. IEEE Transactions on Image Processing, 19(6):1427–1441, 2010.
[38] Z. Sinno and A. C. Bovik. Large-scale study of perceptual video quality. IEEE Transactions on Image Processing, 28(2):612– 627, Feb. 2019.
[39] A. Steiner, A. S. Pinto, M. Tschannen, D. Keysers, X. Wang, Y. Bitton, A. Gritsenko, M. Minderer, A. Sherbondy, S. Long, S. Qin, R. Ingle, E. Bugliarello, S. Kazemzadeh, T. Mesnard, I. Alabdulmohsin, L. Beyer, and X. Zhai. Paligemma 2: A family of versatile vlms for transfer. arXiv preprint arXiv:2412.03555, 2024.
[40] W. Sun, X. Min, W. Lu, and G. Zhai. A deep learning based no-reference quality assessment model for ugc videos. In Proceedings of the 30th ACM International Conference on Multimedia, MM'22, pages 856– 865. ACM, Oct. 2022.
[41] W. Sun, W. Wen, X. Min, L. Lan, G. Zhai, and K. Ma. Analysis of video quality datasets via design of minimalistic video quality models. arXiv preprint arXiv:2307.13981, 2024.
[42] G. Team, M. Riviere, S. Pathak, P. G. Sessa, C. Hardin, S. Bhupatiraju, L. Hussenot, T. Mesnard, B. Shahriari, A. Ramé, J. Ferret, P. Liu, P. Tafti, A. Friesen, M. Casbon, S. Ramos, R. Kumar, C. L. Lan, S. Jerome, A. Tsitsulin, N. Vieillard, P. Stanczyk, S. Girgin, N. Momchev, M. Hoffman, S. Thakoor, J.-B. Grill, B. Neyshabur, O. Bachem, A. Walton, A. Severyn, A. Parrish, A. Ahmad, A. Hutchison, A. Abdagic, A. Carl, A. Shen, A. Brock, A. Coenen, A. Laforge, A. Paterson, B. Bastian, B. Piot, B. Wu, B. Royal, C. Chen, C. Kumar, C. Perry, C. Welty, C. A. Choquette-Choo, D. Sinopalnikov, D. Weinberger, D. Vijaykumar, D. Rogozińska, D. Herbison, E. Bandy, E. Wang, E. Noland, E. Moreira, E. Senter, E. Eltyshev, F. Visin, G. Rasskin, G. Wei, G. Cameron, G. Martins, H. Hashemi, H. Klimczak-Plucińska, H. Batra, H. Dhand, I. Nardini, J. Mein, J. Zhou, J. Svensson, J. Stanway, J. Chan, J. P. Zhou, J. Carrasqueira, J. Iljazi, J. Becker, J. Fernandez, J. van Amersfoort, J. Gordon, J. Lipschultz, J. Newlan, J. yeong Ji, K. Mohamed, K. Badola, K. Black, K. Millican, K. McDonell, K. Nguyen, K. Sodhia, K. Greene, L. L. Sjoesund, L. Usui, L. Sifre, L. Heuermann, L. Lago, L. McNealus, L. B. Soares, L. Kilpatrick, L. Dixon, L. Martins, M. Reid, M. Singh, M. Iverson, M. Görner, M. Velloso, M. Wirth, M. Davidow, M. Miller, M. Rahtz, M. Watson, M. Risdal, M. Kazemi, M. Moynihan, M. Zhang, M. Kahng, M. Park, M. Rahman, M. Khatwani, N. Dao, N. Bardoliwalla, N. Devanathan, N. Dumai, N. Chauhan, O. Wahltinez, P. Botarda, P. Barnes, P. Barham, P. Michel, P. Jin, P. Georgiev, P. Culliton, P. Kuppala, R. Comanescu, R. Merhej, R. Jana, R. A. Rokni, R. Agarwal, R. Mullins, S. Saadat, S. M. Carthy, S. Cogan, S. Perrin, S. M. R. Arnold, S. Krause, S. Dai, S. Garg, S. Sheth, S. Ronstrom, S. Chan, T. Jordan, T. Yu, T. Eccles, T. Hennigan, T. Kocisky, T. Doshi, V. Jain, V. Yadav, V. Meshram, V. Dharmadhikari, W. Barkley, W. Wei, W. Ye, W. Han, W. Kwon, X. Xu, Z. Shen, Z. Gong, Z. Wei, V. Cotruta, P. Kirk, A. Rao, M. Giang, L. Peran, T. Warkentin, E. Collins, J. Barral, Z. Ghahramani, R. Hadsell, D. Sculley, J. Banks, A. Dragan, S. Petrov, O. Vinyals, J. Dean, D. Hassabis, K. Kavukcuoglu, C. Farabet, E. Buchatskaya, S. Borgeaud, N. Fiedel, A. Joulin, K. Kenealy, R. Dadashi, and A. Andreev. Gemma 2: Improving open language models at a practical size. arXiv preprint arXiv:2408.00118, 2024.
[43] B. Thomee, D. A. Shamma, G. Friedland, B. Elizalde, K. Ni, D. Poland, D. Borth, and L.-J. Li. Yfcc100m: The new data in multimedia research. http://projects. dfki.uni-kl.de/yfcc100m/, 2015. Accessed: 2025-08-01.
[44] L. Torgo and J. Gama. Regression by classification. In D. L. Borges and C. A. A. Kaestner, editors, Advances in Artificial Intelligence, 13th Brazilian Symposium on Artificial Intelligence (SBIA ’96), Proceedings, volume 1159 of Lecture Notes in Computer Science, pages 51–60, Berlin, Heidelberg, 1996. Springer.
[45] Z. Tu, C.-J. Chen, L.-H. Chen, Y. Wang, N. Birkbeck, B. Adsumilli, and A. C. Bovik. Regression or classification? new methods to evaluate no-reference picture and video quality models. arXiv preprint arXiv:2102.00155, 2021.
[46] H. Wang, W. Gan, S. Hu, J. Y. Lin, L. Jin, L. Song, P. Wang, I. Katsavounidis, A. Aaron, and C.-C. J. Kuo. Mcl-jcv: A jnd-based h.264/avc video quality assessment dataset. In 2016 IEEE International Conference on Image Processing (ICIP), pages 1509–1513, 2016.
[47] H. Wang, I. Katsavounidis, J. Zhou, J. Park, S. Lei, X. Zhou, M.-O. Pun, X. Jin, R. Wang, X. Wang, Y. Zhang, J. Huang, S. Kwong, and C. C. J. Kuo. Videoset: A large-scale compressed video quality dataset based on jnd measurement. arXiv preprint arXiv:1701.01500, 2017.
[48] Y. Wang, S. Inguva, and B. Adsumilli. Youtube ugc dataset for video compression research. In 2019 IEEE 21st International Workshop on Multimedia Signal Processing (MMSP), pages 1–5, 2019.
[49] Y. Wang, J. Ke, H. Talebi, J. G. Yim, N. Birkbeck, B. Adsumilli, P. Milanfar, and F. Yang. Rich features for perceptual quality assessment of ugc videos. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13430–13439, 2021.
[50] Y. Wang and F. Yang. Uvq: Measuring youtube’s perceptual video quality. https://research.google/blog/ uvq-measuring-youtubes-perceptual-video-quality/, Aug. 2022. Accessed: 2025-07-09.
[51] W. Wen, M. Li, Y. Zhang, Y. Liao, J. Li, L. Zhang, and K. Ma. Modular blind video quality assessment. arXiv preprint arXiv:2402.19276, 2024.
[52] W. Wen, Y. Wang, N. Birkbeck, and B. Adsumilli. An ensemble approach to short-form video quality assessment using multimodal llm. arXiv preprint arXiv:2412.18060, 2024.
[53] H. Wu, C. Chen, J. Hou, L. Liao, A. Wang, W. Sun, Q. Yan, and W. Lin. Fast-vqa: Efficient end-to-end video quality assessment with fragment sampling. arXiv preprint arXiv:2207.02595, 2022.
[54] H. Wu, L. Liao, A. Wang, C. Chen, J. Hou, W. Sun, Q. Yan, and W. Lin. Towards robust text-prompted semantic criterion for in-the-wild video quality assessment. arXiv preprint arXiv:2304.14672, 2023.
[55] H. Wu, E. Zhang, L. Liao, C. Chen, J. H. Hou, A. Wang, W. S. Sun, Q. Yan, and W. Lin. Exploring video quality assessment on user generated contents from aesthetic and technical perspectives. In International Conference on Computer Vision (ICCV), 2023.
[56] H. Wu, Z. Zhang, W. Zhang, C. Chen, L. Liao, C. Li, Y. Gao, A. Wang, E. Zhang, W. Sun, Q. Yan, X. Min, G. Zhai, and W. Lin. Q-align: Teaching lmms for visual scoring via discrete text-defined levels. arXiv preprint arXiv:2312.17090, 2023.
[57] S. Wu, Y. Li, Z. Xu, Y. Gao, H. Duan, W. Sun, and G. Zhai. Fvq: A large-scale dataset and a lmm-based method for face video quality assessment. arXiv preprint arXiv:2504.09255, 2025.
[58] F. Xing, Y.-G. Wang, H. Wang, L. Li, and G. Zhu. Starvqa: Space-time attention for video quality assessment. arXiv preprint arXiv:2108.09635, 2021.
[59] J. Xu, P. Ye, Y. Liu, and D. Doermann. No-reference video quality assessment via feature learning. In 2014 IEEE International Conference on Image Processing (ICIP), pages 491–495, 2014.
[60] Q. Ye, H. Xu, J. Ye, M. Yan, A. Hu, H. Liu, Q. Qian, J. Zhang, F. Huang, and J. Zhou. mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration. arXiv preprint arXiv:2311.04257, 2023.
[61] J. Yim, Y. Wang, N. A. C. Birkbeck, and B. Adsumilli. Subjective quality assessment for youtube ugc dataset. In 2020 IEEE International Conference on Image Processing, pages 131–135, 2020.
[62] Z. Ying, M. Mandal, D. Ghadiyaram, and A. Bovik. Patch-vq:`patching up'the video quality problem. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14014–14024. IEEE, June 2021.
[63] J. You and Y. Lin. Efficient transformer with locally shared attention for video quality assessment. In 2022 IEEE International Conference on Image Processing (ICIP), pages 356–360, 2022.
[64] X. Zhai, B. Mustafa, A. Kolesnikov, and L. Beyer. Sigmoid loss for language image pre-training. arXiv preprint arXiv:2303.15343, 2023.
[65] Q. Zheng, Y. Fan, L. Huang, T. Zhu, J. Liu, Z. Hao, S. Xing, C.-J. Chen, X. Min, A. C. Bovik, and Z. Tu. Video quality assessment: A comprehensive survey. arXiv preprint arXiv:2412.04508, 2024.
-
dc.identifier.urihttp://tdr.lib.ntu.edu.tw/jspui/handle/123456789/98821-
dc.description.abstract隨著使用者創作影片的迅速增加,開發穩定且自動化影片品質評估(VQA)的方法已變得至關重要。儘管大型多模態模型(LMMs)已為無參考影片品質評估(BVQA)帶來了進展,但現行方法仍普遍將品質評估轉化爲一個粗略的五級分類問題,因此限制了模型在影片品質上微小差異的辨識能力。在本文中,我們提出 Pali-VQA,一種高效、基於 LMM 的 BVQA 模型,通過引進多級別評分架構來打破此一限制。Pali-VQA 奠基於 PaliGemma 2,將 BVQA 重新定義為一個最多可達 18 個不同評分等級的細緻分類問題。我們採用低秩自適應(LoRA)進行微調,並結合次序迴歸標籤平滑技術,在正則化模型的同時保留評分等級之間的內在的順序資訊。儘管我們僅在單一數據集上做 LoRA 微調,但在四個實景 VQA 基準測試的實驗中,Pali-VQA 取得了具競爭力的表現,足以媲美或超越那些參數量更大、進行完整微調或使用集成方法的模型。此外,當 Pali-VQA 與非 LMM 的深度神經網路(DNN)BVQA 模型 FAST-VQA 進行集成時,Pali-VQA 更在基準資料集中的三個上超越了所有先前的模型。我們的研究結果顯示,提升評分等級的數量能顯著提高預測表現,為基於 LMM 的影片品質評估方法提供了一條更經濟、更有效的途徑。zh_TW
dc.description.abstractThe exponential increase in user-generated video content necessitates robust automated methods for Video Quality Assessment (VQA). While Large Multimodal Models (LMMs) have propelled advances in Blind VQA (BVQA), current approaches typically frame quality prediction as a coarse, five-level classification task, limiting their ability to discern fine-grained video quality differences. In this paper, we introduce Pali-VQA, an efficient LMM-based BVQA model that addresses this limitation by incorporating a multi-level rating framework. Built on the PaliGemma 2 backbone, Pali-VQA reframes BVQA as a fine-grained classification problem parameterized with a maximum of 18 distinct rating levels. We employ Low-Rank Adaptation (LoRA) for fine-tuning and incorporate an ordinal regression label smoothing technique to preserve the inherent ordinal information among rating levels while regularizing the model. Despite being fine-tuned using LoRA on only a single dataset, our experiments on four in-the-wild VQA benchmarks show that Pali-VQA achieves competitive performance, matching or outperforming larger, fully fine-tuned, or ensemble models. Moreover, when ensembled with FAST‑VQA, a non‑LMM Deep Neural Network (DNN) BVQA model, Pali‑VQA outperforms all previous top models on three of the four datasets. Our findings demonstrate that increasing the granularity of the rating levels significantly enhances predictive performance, offering a more efficient and effective path to LMM-based video quality assessment.en
dc.description.provenanceSubmitted by admin ntu (admin@lib.ntu.edu.tw) on 2025-08-19T16:19:57Z
No. of bitstreams: 0
en
dc.description.provenanceMade available in DSpace on 2025-08-19T16:19:57Z (GMT). No. of bitstreams: 0en
dc.description.tableofcontents口試委員審定書 i
Acknowledgements iii
摘要 v
Abstract vii
Contents ix
List of Figures xiii
List of Tables xv
Chapter 1 Introduction 1
Chapter 2 Related Work 5
2.1 Knowledge-based BVQA Methods 5
2.2 Non-LMM Deep BVQA Methods 6
2.3 LMM-based BVQA Methods 7
Chapter 3 VQA Datasets 9
3.1 LSVQ Dataset 9
3.2 KoNViD-1k Dataset 10
3.3 LIVE-VQC Dataset 11
3.4 Distribution of MOS Values 12
3.5 Statistical Significance of MOS Differences 12
3.5.1 Pairwise Significance Threshold 13
3.5.2 Summary of Thresholds 14
Chapter 4 Methodology 17
4.1 Video Preprocessing 18
4.2 Quality Prompt 18
4.3 MOS-to-Level Mapping 18
4.4 LMM Backbone 19
4.5 Prediction Score Calculation 20
4.6 Multi-Level Rating Framework 21
4.7 Ordinal Regression Label Smoothing 21
Chapter 5 Experiments 25
5.1 Fine-tuning Settings 25
5.2 Evaluation Benchmarks 28
5.3 Evaluation Metrics 28
5.3.1 Pearson Linear Correlation Coefficient (PLCC) 28
5.3.2 Spearman’s Rank Order Correlation Coefficient (SRCC) 29
5.4 Inference Setting 30
5.5 Results 32
5.6 Discussion 35
5.7 Ablation Study 35
5.7.1 Multi-level Rating Scheme 36
5.7.2 Ordinal Regression Label Smoothing 38
Chapter 6 Conclusion 41
References 43
-
dc.language.isoen-
dc.subject影片品質評估zh_TW
dc.subject大型多模態模型zh_TW
dc.subjectPaliGemmazh_TW
dc.subject序數迴歸zh_TW
dc.subjectLarge Multimodal Modelen
dc.subjectPaliGemmaen
dc.subjectVideo Quality Assessmenten
dc.subjectOrdinal Regressionen
dc.titlePali-VQA:基於 PaliGemma 2 的多級別無參考影片品質評估模型zh_TW
dc.titlePali-VQA: A PaliGemma 2-Based Multi-Level Blind Video Quality Assessment Modelen
dc.typeThesis-
dc.date.schoolyear113-2-
dc.description.degree碩士-
dc.contributor.oralexamcommittee葉春超;黃鐘揚;黃敬群;吳馬丁zh_TW
dc.contributor.oralexamcommitteeChun-Chao Yeh;Chung-Yang Huang;Ching-Chun Huang;Torbjörn Nordlingen
dc.subject.keyword影片品質評估,大型多模態模型,PaliGemma,序數迴歸,zh_TW
dc.subject.keywordVideo Quality Assessment,Large Multimodal Model,PaliGemma,Ordinal Regression,en
dc.relation.page53-
dc.identifier.doi10.6342/NTU202501694-
dc.rights.note同意授權(全球公開)-
dc.date.accepted2025-08-13-
dc.contributor.author-college電機資訊學院-
dc.contributor.author-dept資訊網路與多媒體研究所-
dc.date.embargo-lift2025-08-20-
Appears in Collections:資訊網路與多媒體研究所

Files in This Item:
File SizeFormat 
ntu-113-2.pdf3.26 MBAdobe PDFView/Open
Show simple item record


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved