Skip navigation

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料(如:文字、圖片、PDF)並使其易於取用。

點此認識 DSpace
DSpace logo
English
中文
  • 瀏覽論文
    • 校院系所
    • 出版年
    • 作者
    • 標題
    • 關鍵字
    • 指導教授
  • 搜尋 TDR
  • 授權 Q&A
    • 我的頁面
    • 接受 E-mail 通知
    • 編輯個人資料
  1. NTU Theses and Dissertations Repository
  2. 電機資訊學院
  3. 資訊網路與多媒體研究所
請用此 Handle URI 來引用此文件: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/95328
完整後設資料紀錄
DC 欄位值語言
dc.contributor.advisor廖世偉zh_TW
dc.contributor.advisorShih-Wei Liaoen
dc.contributor.author黃薺用zh_TW
dc.contributor.authorChi-Yung Huangen
dc.date.accessioned2024-09-05T16:11:33Z-
dc.date.available2024-09-06-
dc.date.copyright2024-09-05-
dc.date.issued2024-
dc.date.submitted2024-08-07-
dc.identifier.citation[1] A. Antsiferova, S. Lavrushkin, M. Smirnov, A. Gushchin, D. Vatolin, and D. Kulikov. Video compression dataset and benchmark of learning-based video-quality metrics. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 13814–13825. Curran Associates, Inc., 2022.
[2] A. Arnab, M. Dehghani, G. Heigold, C. Sun, M. Lučić, and C. Schmid. Vivit: A video vision transformer, 2021.
[3] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling, 2014.
[4] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale, 2021.
[5] C. Feichtenhofer, H. Fan, J. Malik, and K. He. Slowfast networks for video recognition, 2019.
[6] K. Hara, H. Kataoka, and Y. Satoh. Learning spatio-temporal features with 3d residual networks for action recognition, 2017.
[7] K. Hara, H. Kataoka, and Y. Satoh. Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet?, 2018.
[8] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition, 2015.
[9] K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in deep residual networks, 2016.
[10] S. Hentschel, K. Kobs, and A. Hotho. Clip knows image aesthetics. Frontiers in Artificial Intelligence, 5, 2022.
[11] V. Hosu, F. Hahn, M. Jenadeleh, H. Lin, H. Men, T. Szirányi, S. Li, and D. Saupe. The konstanz natural video database (konvid-1k). In 2017 Ninth International Conference on Quality of Multimedia Experience (QoMEX), pages 1–6, 2017.
[12] J. Korhonen. Two-level approach for no-reference consumer video quality assessment. IEEE Transactions on Image Processing, 28(12):5923–5938, 2019.
[13] B. Li, W. Zhang, M. Tian, G. Zhai, and X. Wang. Blindly assess quality of in-the-wild videos via quality-aware pre-training and motion perception, 2022.
[14] D. Li, T. Jiang, and M. Jiang. Quality assessment of in-the-wild videos. In Proceedings of the 27th ACM International Conference on Multimedia, MM'19. ACM, Oct. 2019.
[15] D. Li, T. Jiang, W. Lin, and M. Jiang. Which has better visual quality: The clear blue sky or a blurry animal? IEEE Transactions on Multimedia, 21(5):1221–1234, 2019.
[16] H. Lin, V. Hosu, and D. Saupe. Kadid-10k: A large-scale artificially distorted iqa database. In 2019 Eleventh International Conference on Quality of Multimedia Experience (QoMEX), pages 1–3, 2019.
[17] H. Liu, C. Li, Y. Li, and Y. J. Lee. Improved baselines with visual instruction tuning, 2023.
[18] H. Liu, C. Li, Q. Wu, and Y. J. Lee. Visual instruction tuning, 2023.
[19] W. Liu, Z. Duanmu, and Z. Wang. End-to-end blind quality assessment of compressed videos using deep neural networks. In Proceedings of the 26th ACM International Conference on Multimedia, MM '18, page 546–554, New York, NY, USA, 2018. Association for Computing Machinery.
[20] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo. Swin transformer: Hierarchical vision transformer using shifted windows, 2021.
[21] Z. Liu, H. Mao, C.-Y. Wu, C. Feichtenhofer, T. Darrell, and S. Xie. A convnet for the 2020s, 2022.
[22] Y. Mi, Y. Li, Y. Shu, C. Hui, P. Zhou, and S. Liu. Clif-vqa: Enhancing video quality assessment by incorporating high-level semantic information related to human feelings, 2023.
[23] A. Mittal, A. K. Moorthy, and A. C. Bovik. No-reference image quality assessment in the spatial domain. IEEE Transactions on Image Processing, 21(12):4695–4708, 2012.
[24] A. Mittal, R. Soundararajan, and A. C. Bovik. Making a “completely blind” image quality analyzer. IEEE Signal Processing Letters, 20(3):209–212, 2013.
[25] N. Murray, L. Marchesotti, and F. Perronnin. Ava: A large-scale database for aesthetic visual analysis. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pages 2408–2415, 2012.
[26] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever. Learning transferable visual models from natural language supervision, 2021.
[27] R. Rassool. Vmaf reproducibility: Validating a perceptual practical video quality metric. In 2017 IEEE International Symposium on Broadband Multimedia Systems and Broadcasting (BMSB), pages 1–2, 2017.
[28] M. A. Saad, A. C. Bovik, and C. Charrier. Blind prediction of natural video quality. IEEE Transactions on Image Processing, 23(3):1352–1365, 2014.
[29] X. Sheng, L. Li, P. Chen, J. Wu, W. Dong, Y. Yang, L. Xu, Y. Li, and G. Shi. Aesclip: Multi-attribute contrastive learning for image aesthetics assessment. In Proceedings of the 31st ACM International Conference on Multimedia, MM '23, page 1117–1126, New York, NY, USA, 2023. Association for Computing Machinery.
[30] Z. Sinno and A. C. Bovik. Large-scale study of perceptual video quality. IEEE Transactions on Image Processing, 28(2):612–627, 2019.
[31] W. Sun, X. Min, W. Lu, and G. Zhai. A deep learning based no-reference quality assessment model for ugc videos. In Proceedings of the 30th ACM International Conference on Multimedia, MM'22. ACM, Oct. 2022.
[32] Z. Tu, Y. Wang, N. Birkbeck, B. Adsumilli, and A. C. Bovik. Ugc-vqa: Benchmarking blind video quality assessment for user generated content. IEEE Transactions on Image Processing, 30:4449–4464, 2021.
[33] Z. Tu, X. Yu, Y. Wang, N. Birkbeck, B. Adsumilli, and A. C. Bovik. Rapique: Rapid and accurate video quality prediction of user generated content. IEEE Open Journal of Signal Processing, 2:425–440, 2021.
[34] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need, 2023.
[35] B. Wang, Z. Wang, Y. Liao, and X. Lin. Hvs-based structural similarity for image quality assessment. In 2008 9th International Conference on Signal Processing, pages 1194–1197, 2008.
[36] J. Wang, K. C. K. Chan, and C. C. Loy. Exploring clip for assessing the look and feel of images, 2022.
[37] Y. Wang, S. Inguva, and B. Adsumilli. Youtube ugc dataset for video compression research. In 2019 IEEE 21st International Workshop on Multimedia Signal Processing (MMSP). IEEE, Sept. 2019.
[38] H. Wu, C. Chen, J. Hou, L. Liao, A. Wang, W. Sun, Q. Yan, and W. Lin. Fast-vqa: Efficient end-to-end video quality assessment with fragment sampling, 2022.
[39] H. Wu, E. Zhang, L. Liao, C. Chen, J. Hou, A. Wang, W. Sun, Q. Yan, and W. Lin. Exploring video quality assessment on user generated contents from aesthetic and technical perspectives, 2023.
[40] H. Wu, E. Zhang, L. Liao, C. Chen, J. Hou, A. Wang, W. Sun, Q. Yan, and W. Lin. Towards explainable in-the-wild video quality assessment: A database and a language-prompted approach. In Proceedings of the 31st ACM International Conference on Multimedia, MM'23. ACM, Oct. 2023.
[41] H. Wu, Z. Zhang, E. Zhang, C. Chen, L. Liao, A. Wang, K. Xu, C. Li, J. Hou, G. Zhai, G. Xue, W. Sun, Q. Yan, and W. Lin. Q-instruct: Improving low-level visual abilities for multi-modality foundation models, 2023.
[42] H. Wu, Z. Zhang, W. Zhang, C. Chen, L. Liao, C. Li, Y. Gao, A. Wang, E. Zhang, W. Sun, Q. Yan, X. Min, G. Zhai, and W. Lin. Q-align: Teaching lmms for visual scoring via discrete text-defined levels, 2023.
[43] J. Xu, P. Ye, Y. Liu, and D. Doermann. No-reference video quality assessment via feature learning. In 2014 IEEE International Conference on Image Processing (ICIP), pages 491–495, 2014.
[44] L. Xu, J. Xu, Y. Yang, Y. Huang, Y. Xie, and Y. Li. Clip brings better features to visual aesthetics learners, 2023.
[45] P. Ye, J. Kumar, L. Kang, and D. Doermann. Unsupervised feature learning framework for no-reference image quality assessment. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pages 1098–1105, 2012.
[46] Z. Ying, M. Mandal, D. Ghadiyaram, and A. Bovik. Patch-vq: 'patching up' the video quality problem. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, June 2021.
[47] W. Zhang, G. Zhai, Y. Wei, X. Yang, and K. Ma. Blind image quality assessment via vision-language correspondence: A multitask learning perspective, 2023.
[48] H. Zhou, L. Tang, R. Yang, G. Qin, Y. Zhang, R. Hu, and X. Li. Uniqa: Unified vision-language pre-training for image quality and aesthetic assessment, 2024.
-
dc.identifier.urihttp://tdr.lib.ntu.edu.tw/jspui/handle/123456789/95328-
dc.description.abstract本文旨在提出一種基於 CLIP 和審美標準的無參考影片品質評估方法(NR-VQA)。隨著行動裝置和網際網路技術的快速發展,用戶生成的內容(UGC)影片已成為社交媒體平台上的常見媒介。然而,由於UGC影片的製作過程參差不齊,其品質評估面臨重大挑戰。傳統的全參考影片品質評估方法(FR-VQA)依賴於高品質原始影片,但在UGC影片中通常無法獲得這些高品質的參考影片。因此,本文旨在開發和應用無參考影片品質評估方法,通過影片本身的特徵來評估其品質。
本研究利用了 CLIP(Contrastive Language-Image Pre-training)模型來提取高層次的審美特徵,並將這些特徵與低層次感知特徵相結合,構建了一個綜合性的影片品質評估模型 CA-VQA。研究中,先使用AVA資料集和多模態大語言模型(MLLMs)創建了一個大規模的文本-圖像審美資料集,對 CLIP 模型進行預訓練,增強其提取審美特徵的能力。然後,將預訓練的 CLIP 模型整合到 VQA 模型中,並在 KoNViD-1k、YouTube UGC 和 LIVE-VQC 三個資料集上進行微調和評估。
實驗結果顯示,CA-VQA 模型在 KoNViD-1k 測試資料集上達到了 0.905 的 PLCC 和 0.909 的 SRCC,這是目前基於 CLIP 的 VQA 模型中最佳的性能。主要貢獻如下:1.本研究證明了在 IAA 資料集上預訓練的 CLIP 模型在影片審美品質評估任務中具有出色的性能。2.提出了 CA-VQA 模型,該模型採用簡單而有效的方法將 CLIP 整合到現有的 VQA 框架中,並在多個資料集上達到了最佳性能。
zh_TW
dc.description.abstractThe rapid advancement of mobile devices and internet technology has facilitated the widespread capture and production of videos, making video quality a crucial metric on social media platforms. Evaluating the quality of User-Generated Content (UGC) videos poses significant challenges due to various distortions. To address this, No Reference Video Quality Assessment (NR-VQA) algorithms are essential.
We developed the CA-VQA model, which distinguishes between low-level perceptual factors and aesthetic factors to assess video quality. By pre-training CLIP on a large-scale text-image aesthetic dataset created using the AVA dataset and MLLMs, we enhanced its capability to extract aesthetic features. Our CA-VQA model integrates CLIP with existing VQA frameworks, achieving a PLCC of 0.905 and an SRCC of 0.909 on the KoNViD1k test dataset, the highest performance among current CLIP-based VQA models.
en
dc.description.provenanceSubmitted by admin ntu (admin@lib.ntu.edu.tw) on 2024-09-05T16:11:33Z
No. of bitstreams: 0
en
dc.description.provenanceMade available in DSpace on 2024-09-05T16:11:33Z (GMT). No. of bitstreams: 0en
dc.description.tableofcontentsAcknowledgements i
摘要 ii
Abstract iv
Contents v
List of Figures vii
List of Tables viii
Chapter 1 Introduction 1
Chapter 2 Related Work 4
2.1 Handcrafted Feature-Based NR-VQA Models 4
2.2 Deep Learning-Based NR-VQA Models 5
2.3 Vision-Language Based NR-VQA Models 6
2.4 VQA Datasets 7
Chapter 3 Methodology 11
3.1 Custom AVA Dataset 11
3.2 Contrastive Language-Image Pre-training 13
3.3 CLIP Aesthetic Feature Extraction for Video Quality Assessment 15
3.3.1 Aesthetic Feature Extractor 15
3.3.2 Low-Level Perceptual Feature Extractor 16
3.3.3 Quality Regression 17
Chapter 4 Evaluation 19
4.1 Evaluation 19
4.1.1 Setup 19
4.1.2 Performance Evaluation 21
4.2 Ablation Study 27
Chapter 5 Conclusion 29
References 31
-
dc.language.isoen-
dc.subject視覺語言模型zh_TW
dc.subject影片品質評估zh_TW
dc.subject審美評估zh_TW
dc.subjectAesthetic Assessmenten
dc.subjectVision-Language Modelen
dc.subjectVideo Quality Assessmenten
dc.title增強影片品質評估:基於CLIP和審美標準的無參考影片品質評估方法zh_TW
dc.titleEnhancing Video Quality Assessment: A CLIP-Based Approach for Blind Video Quality Assessment with Aesthetic Criteriaen
dc.typeThesis-
dc.date.schoolyear112-2-
dc.description.degree碩士-
dc.contributor.oralexamcommittee傅楸善;盧瑞山;李逸元zh_TW
dc.contributor.oralexamcommitteeChiou-Shann Fuh;Ruei-Shan Lu;Yi-Yuan Leeen
dc.subject.keyword影片品質評估,視覺語言模型,審美評估,zh_TW
dc.subject.keywordVideo Quality Assessment,Vision-Language Model,Aesthetic Assessment,en
dc.relation.page36-
dc.identifier.doi10.6342/NTU202403345-
dc.rights.note未授權-
dc.date.accepted2024-08-10-
dc.contributor.author-college電機資訊學院-
dc.contributor.author-dept資訊網路與多媒體研究所-
顯示於系所單位:資訊網路與多媒體研究所

文件中的檔案:
檔案 大小格式 
ntu-112-2.pdf
  未授權公開取用
968.12 kBAdobe PDF
顯示文件簡單紀錄


系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved