增強影片品質評估：基於CLIP和審美標準的無參考影片品質評估方法

黃薺用; Chi-Yung Huang

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/95328

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	廖世偉	zh_TW
dc.contributor.advisor	Shih-Wei Liao	en
dc.contributor.author	黃薺用	zh_TW
dc.contributor.author	Chi-Yung Huang	en
dc.date.accessioned	2024-09-05T16:11:33Z	-
dc.date.available	2024-09-06	-
dc.date.copyright	2024-09-05	-
dc.date.issued	2024	-
dc.date.submitted	2024-08-07	-
dc.identifier.citation	[1] A. Antsiferova, S. Lavrushkin, M. Smirnov, A. Gushchin, D. Vatolin, and D. Kulikov. Video compression dataset and benchmark of learning-based video-quality metrics. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 13814–13825. Curran Associates, Inc., 2022. [2] A. Arnab, M. Dehghani, G. Heigold, C. Sun, M. Lučić, and C. Schmid. Vivit: A video vision transformer, 2021. [3] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling, 2014. [4] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale, 2021. [5] C. Feichtenhofer, H. Fan, J. Malik, and K. He. Slowfast networks for video recognition, 2019. [6] K. Hara, H. Kataoka, and Y. Satoh. Learning spatio-temporal features with 3d residual networks for action recognition, 2017. [7] K. Hara, H. Kataoka, and Y. Satoh. Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet?, 2018. [8] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition, 2015. [9] K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in deep residual networks, 2016. [10] S. Hentschel, K. Kobs, and A. Hotho. Clip knows image aesthetics. Frontiers in Artificial Intelligence, 5, 2022. [11] V. Hosu, F. Hahn, M. Jenadeleh, H. Lin, H. Men, T. Szirányi, S. Li, and D. Saupe. The konstanz natural video database (konvid-1k). In 2017 Ninth International Conference on Quality of Multimedia Experience (QoMEX), pages 1–6, 2017. [12] J. Korhonen. Two-level approach for no-reference consumer video quality assessment. IEEE Transactions on Image Processing, 28(12):5923–5938, 2019. [13] B. Li, W. Zhang, M. Tian, G. Zhai, and X. Wang. Blindly assess quality of in-the-wild videos via quality-aware pre-training and motion perception, 2022. [14] D. Li, T. Jiang, and M. Jiang. Quality assessment of in-the-wild videos. In Proceedings of the 27th ACM International Conference on Multimedia, MM'19. ACM, Oct. 2019. [15] D. Li, T. Jiang, W. Lin, and M. Jiang. Which has better visual quality: The clear blue sky or a blurry animal? IEEE Transactions on Multimedia, 21(5):1221–1234, 2019. [16] H. Lin, V. Hosu, and D. Saupe. Kadid-10k: A large-scale artificially distorted iqa database. In 2019 Eleventh International Conference on Quality of Multimedia Experience (QoMEX), pages 1–3, 2019. [17] H. Liu, C. Li, Y. Li, and Y. J. Lee. Improved baselines with visual instruction tuning, 2023. [18] H. Liu, C. Li, Q. Wu, and Y. J. Lee. Visual instruction tuning, 2023. [19] W. Liu, Z. Duanmu, and Z. Wang. End-to-end blind quality assessment of compressed videos using deep neural networks. In Proceedings of the 26th ACM International Conference on Multimedia, MM '18, page 546–554, New York, NY, USA, 2018. Association for Computing Machinery. [20] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo. Swin transformer: Hierarchical vision transformer using shifted windows, 2021. [21] Z. Liu, H. Mao, C.-Y. Wu, C. Feichtenhofer, T. Darrell, and S. Xie. A convnet for the 2020s, 2022. [22] Y. Mi, Y. Li, Y. Shu, C. Hui, P. Zhou, and S. Liu. Clif-vqa: Enhancing video quality assessment by incorporating high-level semantic information related to human feelings, 2023. [23] A. Mittal, A. K. Moorthy, and A. C. Bovik. No-reference image quality assessment in the spatial domain. IEEE Transactions on Image Processing, 21(12):4695–4708, 2012. [24] A. Mittal, R. Soundararajan, and A. C. Bovik. Making a “completely blind” image quality analyzer. IEEE Signal Processing Letters, 20(3):209–212, 2013. [25] N. Murray, L. Marchesotti, and F. Perronnin. Ava: A large-scale database for aesthetic visual analysis. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pages 2408–2415, 2012. [26] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever. Learning transferable visual models from natural language supervision, 2021. [27] R. Rassool. Vmaf reproducibility: Validating a perceptual practical video quality metric. In 2017 IEEE International Symposium on Broadband Multimedia Systems and Broadcasting (BMSB), pages 1–2, 2017. [28] M. A. Saad, A. C. Bovik, and C. Charrier. Blind prediction of natural video quality. IEEE Transactions on Image Processing, 23(3):1352–1365, 2014. [29] X. Sheng, L. Li, P. Chen, J. Wu, W. Dong, Y. Yang, L. Xu, Y. Li, and G. Shi. Aesclip: Multi-attribute contrastive learning for image aesthetics assessment. In Proceedings of the 31st ACM International Conference on Multimedia, MM '23, page 1117–1126, New York, NY, USA, 2023. Association for Computing Machinery. [30] Z. Sinno and A. C. Bovik. Large-scale study of perceptual video quality. IEEE Transactions on Image Processing, 28(2):612–627, 2019. [31] W. Sun, X. Min, W. Lu, and G. Zhai. A deep learning based no-reference quality assessment model for ugc videos. In Proceedings of the 30th ACM International Conference on Multimedia, MM'22. ACM, Oct. 2022. [32] Z. Tu, Y. Wang, N. Birkbeck, B. Adsumilli, and A. C. Bovik. Ugc-vqa: Benchmarking blind video quality assessment for user generated content. IEEE Transactions on Image Processing, 30:4449–4464, 2021. [33] Z. Tu, X. Yu, Y. Wang, N. Birkbeck, B. Adsumilli, and A. C. Bovik. Rapique: Rapid and accurate video quality prediction of user generated content. IEEE Open Journal of Signal Processing, 2:425–440, 2021. [34] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need, 2023. [35] B. Wang, Z. Wang, Y. Liao, and X. Lin. Hvs-based structural similarity for image quality assessment. In 2008 9th International Conference on Signal Processing, pages 1194–1197, 2008. [36] J. Wang, K. C. K. Chan, and C. C. Loy. Exploring clip for assessing the look and feel of images, 2022. [37] Y. Wang, S. Inguva, and B. Adsumilli. Youtube ugc dataset for video compression research. In 2019 IEEE 21st International Workshop on Multimedia Signal Processing (MMSP). IEEE, Sept. 2019. [38] H. Wu, C. Chen, J. Hou, L. Liao, A. Wang, W. Sun, Q. Yan, and W. Lin. Fast-vqa: Efficient end-to-end video quality assessment with fragment sampling, 2022. [39] H. Wu, E. Zhang, L. Liao, C. Chen, J. Hou, A. Wang, W. Sun, Q. Yan, and W. Lin. Exploring video quality assessment on user generated contents from aesthetic and technical perspectives, 2023. [40] H. Wu, E. Zhang, L. Liao, C. Chen, J. Hou, A. Wang, W. Sun, Q. Yan, and W. Lin. Towards explainable in-the-wild video quality assessment: A database and a language-prompted approach. In Proceedings of the 31st ACM International Conference on Multimedia, MM'23. ACM, Oct. 2023. [41] H. Wu, Z. Zhang, E. Zhang, C. Chen, L. Liao, A. Wang, K. Xu, C. Li, J. Hou, G. Zhai, G. Xue, W. Sun, Q. Yan, and W. Lin. Q-instruct: Improving low-level visual abilities for multi-modality foundation models, 2023. [42] H. Wu, Z. Zhang, W. Zhang, C. Chen, L. Liao, C. Li, Y. Gao, A. Wang, E. Zhang, W. Sun, Q. Yan, X. Min, G. Zhai, and W. Lin. Q-align: Teaching lmms for visual scoring via discrete text-defined levels, 2023. [43] J. Xu, P. Ye, Y. Liu, and D. Doermann. No-reference video quality assessment via feature learning. In 2014 IEEE International Conference on Image Processing (ICIP), pages 491–495, 2014. [44] L. Xu, J. Xu, Y. Yang, Y. Huang, Y. Xie, and Y. Li. Clip brings better features to visual aesthetics learners, 2023. [45] P. Ye, J. Kumar, L. Kang, and D. Doermann. Unsupervised feature learning framework for no-reference image quality assessment. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pages 1098–1105, 2012. [46] Z. Ying, M. Mandal, D. Ghadiyaram, and A. Bovik. Patch-vq: 'patching up' the video quality problem. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, June 2021. [47] W. Zhang, G. Zhai, Y. Wei, X. Yang, and K. Ma. Blind image quality assessment via vision-language correspondence: A multitask learning perspective, 2023. [48] H. Zhou, L. Tang, R. Yang, G. Qin, Y. Zhang, R. Hu, and X. Li. Uniqa: Unified vision-language pre-training for image quality and aesthetic assessment, 2024.	-
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/95328	-
dc.description.abstract	本文旨在提出一種基於 CLIP 和審美標準的無參考影片品質評估方法（NR-VQA）。隨著行動裝置和網際網路技術的快速發展，用戶生成的內容（UGC）影片已成為社交媒體平台上的常見媒介。然而，由於UGC影片的製作過程參差不齊，其品質評估面臨重大挑戰。傳統的全參考影片品質評估方法（FR-VQA）依賴於高品質原始影片，但在UGC影片中通常無法獲得這些高品質的參考影片。因此，本文旨在開發和應用無參考影片品質評估方法，通過影片本身的特徵來評估其品質。本研究利用了 CLIP（Contrastive Language-Image Pre-training）模型來提取高層次的審美特徵，並將這些特徵與低層次感知特徵相結合，構建了一個綜合性的影片品質評估模型 CA-VQA。研究中，先使用AVA資料集和多模態大語言模型（MLLMs）創建了一個大規模的文本-圖像審美資料集，對 CLIP 模型進行預訓練，增強其提取審美特徵的能力。然後，將預訓練的 CLIP 模型整合到 VQA 模型中，並在 KoNViD-1k、YouTube UGC 和 LIVE-VQC 三個資料集上進行微調和評估。實驗結果顯示，CA-VQA 模型在 KoNViD-1k 測試資料集上達到了 0.905 的 PLCC 和 0.909 的 SRCC，這是目前基於 CLIP 的 VQA 模型中最佳的性能。主要貢獻如下：1.本研究證明了在 IAA 資料集上預訓練的 CLIP 模型在影片審美品質評估任務中具有出色的性能。2.提出了 CA-VQA 模型，該模型採用簡單而有效的方法將 CLIP 整合到現有的 VQA 框架中，並在多個資料集上達到了最佳性能。	zh_TW
dc.description.abstract	The rapid advancement of mobile devices and internet technology has facilitated the widespread capture and production of videos, making video quality a crucial metric on social media platforms. Evaluating the quality of User-Generated Content (UGC) videos poses significant challenges due to various distortions. To address this, No Reference Video Quality Assessment (NR-VQA) algorithms are essential. We developed the CA-VQA model, which distinguishes between low-level perceptual factors and aesthetic factors to assess video quality. By pre-training CLIP on a large-scale text-image aesthetic dataset created using the AVA dataset and MLLMs, we enhanced its capability to extract aesthetic features. Our CA-VQA model integrates CLIP with existing VQA frameworks, achieving a PLCC of 0.905 and an SRCC of 0.909 on the KoNViD1k test dataset, the highest performance among current CLIP-based VQA models.	en
dc.description.provenance	Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2024-09-05T16:11:33Z No. of bitstreams: 0	en
dc.description.provenance	Made available in DSpace on 2024-09-05T16:11:33Z (GMT). No. of bitstreams: 0	en
dc.description.tableofcontents	Acknowledgements i 摘要 ii Abstract iv Contents v List of Figures vii List of Tables viii Chapter 1 Introduction 1 Chapter 2 Related Work 4 2.1 Handcrafted Feature-Based NR-VQA Models 4 2.2 Deep Learning-Based NR-VQA Models 5 2.3 Vision-Language Based NR-VQA Models 6 2.4 VQA Datasets 7 Chapter 3 Methodology 11 3.1 Custom AVA Dataset 11 3.2 Contrastive Language-Image Pre-training 13 3.3 CLIP Aesthetic Feature Extraction for Video Quality Assessment 15 3.3.1 Aesthetic Feature Extractor 15 3.3.2 Low-Level Perceptual Feature Extractor 16 3.3.3 Quality Regression 17 Chapter 4 Evaluation 19 4.1 Evaluation 19 4.1.1 Setup 19 4.1.2 Performance Evaluation 21 4.2 Ablation Study 27 Chapter 5 Conclusion 29 References 31	-
dc.language.iso	en	-
dc.subject	視覺語言模型	zh_TW
dc.subject	影片品質評估	zh_TW
dc.subject	審美評估	zh_TW
dc.subject	Aesthetic Assessment	en
dc.subject	Vision-Language Model	en
dc.subject	Video Quality Assessment	en
dc.title	增強影片品質評估：基於CLIP和審美標準的無參考影片品質評估方法	zh_TW
dc.title	Enhancing Video Quality Assessment: A CLIP-Based Approach for Blind Video Quality Assessment with Aesthetic Criteria	en
dc.type	Thesis	-
dc.date.schoolyear	112-2	-
dc.description.degree	碩士	-
dc.contributor.oralexamcommittee	傅楸善;盧瑞山;李逸元	zh_TW
dc.contributor.oralexamcommittee	Chiou-Shann Fuh;Ruei-Shan Lu;Yi-Yuan Lee	en
dc.subject.keyword	影片品質評估,視覺語言模型,審美評估,	zh_TW
dc.subject.keyword	Video Quality Assessment,Vision-Language Model,Aesthetic Assessment,	en
dc.relation.page	36	-
dc.identifier.doi	10.6342/NTU202403345	-
dc.rights.note	未授權	-
dc.date.accepted	2024-08-10	-
dc.contributor.author-college	電機資訊學院	-
dc.contributor.author-dept	資訊網路與多媒體研究所	-
顯示於系所單位：	資訊網路與多媒體研究所

文件中的檔案：

檔案	大小	格式
ntu-112-2.pdf 未授權公開取用	968.12 kB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。