請用此 Handle URI 來引用此文件:
http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/94622完整後設資料紀錄
| DC 欄位 | 值 | 語言 |
|---|---|---|
| dc.contributor.advisor | 黃乾綱 | zh_TW |
| dc.contributor.advisor | Chien-Kang Huang | en |
| dc.contributor.author | 許彤 | zh_TW |
| dc.contributor.author | Tung Hsu | en |
| dc.date.accessioned | 2024-08-16T17:08:48Z | - |
| dc.date.available | 2024-08-17 | - |
| dc.date.copyright | 2024-08-16 | - |
| dc.date.issued | 2024 | - |
| dc.date.submitted | 2024-08-07 | - |
| dc.identifier.citation | A. Radford et al., "Learning transferable visual models from natural language supervision," in International conference on machine learning, 2021: PMLR, pp. 8748-8763.
A. Dosovitskiy et al., "An image is worth 16x16 words: Transformers for image recognition at scale," arXiv preprint arXiv:2010.11929, 2020. W. X. Zhao et al., "A survey of large language models," arXiv preprint arXiv:2303.18223, 2023. T. Brown et al., "Language models are few-shot learners," Advances in neural information processing systems, vol. 33, pp. 1877-1901, 2020. J. Achiam et al., "Gpt-4 technical report," arXiv preprint arXiv:2303.08774, 2023. Y. Kong and Y. Fu, "Human action recognition and prediction: A survey," International Journal of Computer Vision, vol. 130, no. 5, pp. 1366-1401, 2022. H.-B. Zhang et al., "A comprehensive survey of vision-based human action recognition methods," Sensors, vol. 19, no. 5, p. 1005, 2019. A. F. Bobick and J. W. Davis, "The recognition of human movement using temporal templates," IEEE Transactions on pattern analysis and machine intelligence, vol. 23, no. 3, pp. 257-267, 2001. D. Weinland, R. Ronfard, and E. Boyer, "Free viewpoint action recognition using motion history volumes," Computer vision and image understanding, vol. 104, no. 2-3, pp. 249-257, 2006. C. Schuldt, I. Laptev, and B. Caputo, "Recognizing human actions: a local SVM approach," in Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004., 2004, vol. 3: IEEE, pp. 32-36. M. Blank, L. Gorelick, E. Shechtman, M. Irani, and R. Basri, "Actions as space-time shapes," in Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1, 2005, vol. 2: IEEE, pp. 1395-1402. C. Szegedy et al., "Going deeper with convolutions," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 1-9. K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image recognition," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770-778. K. Soomro, A. R. Zamir, and M. Shah, "UCF101: A dataset of 101 human actions classes from videos in the wild," arXiv preprint arXiv:1212.0402, 2012. J. Carreira and A. Zisserman, "Quo vadis, action recognition? a new model and the kinetics dataset," in proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 6299-6308. H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre, "HMDB: a large video database for human motion recognition," in 2011 International conference on computer vision, 2011: IEEE, pp. 2556-2563. D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, "Learning spatiotemporal features with 3d convolutional networks," in Proceedings of the IEEE international conference on computer vision, 2015, pp. 4489-4497. Y. Zhu, Z. Lan, S. Newsam, and A. Hauptmann, "Hidden two-stream convolutional networks for action recognition," in Computer Vision–ACCV 2018: 14th Asian Conference on Computer Vision, Perth, Australia, December 2–6, 2018, Revised Selected Papers, Part III 14, 2019: Springer, pp. 363-378. A. Diba et al., "Temporal 3d convnets: New architecture and transfer learning for video classification," arXiv preprint arXiv:1711.08200, 2017. S. N. Gowda, M. Rohrbach, and L. Sevilla-Lara, "Smart frame selection for action recognition," in Proceedings of the AAAI Conference on Artificial Intelligence, 2021, vol. 35, no. 2, pp. 1451-1459. W. Lin et al., "Match, expand and improve: Unsupervised finetuning for zero-shot action recognition with language knowledge," in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 2851-2862. Q. Sun, Y. Fang, L. Wu, X. Wang, and Y. Cao, "Eva-clip: Improved training techniques for clip at scale," arXiv preprint arXiv:2303.15389, 2023. T. Chen, H. Yu, Z. Yang, Z. Li, W. Sun, and C. Chen, "OST: Refining Text Knowledge with Optimal Spatio-Temporal Descriptor for General Video Recognition," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 18888-18898. K. Kahatapitiya, A. Arnab, A. Nagrani, and M. S. Ryoo, "Victr: Video-conditioned text representations for activity recognition," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 18547-18558. Z. Qing et al., "Disentangling spatial and temporal learning for efficient image-to-video transfer learning," in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 13934-13944. K. K. Reddy and M. Shah, "Recognizing 50 human action categories of web videos," Machine vision and applications, vol. 24, no. 5, pp. 971-981, 2013. A. Shahroudy, J. Liu, T.-T. Ng, and G. Wang, "Ntu rgb+ d: A large scale dataset for 3d human activity analysis," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 1010-1019. A. Vaswani et al., "Attention is all you need," Advances in neural information processing systems, vol. 30, 2017. J. L. Elman, "Finding structure in time," Cognitive science, vol. 14, no. 2, pp. 179-211, 1990. S. Hochreiter and J. Schmidhuber, "Long short-term memory," Neural computation, vol. 9, no. 8, pp. 1735-1780, 1997. X. Wang, J. Wu, J. Chen, L. Li, Y.-F. Wang, and W. Y. Wang, "Vatex: A large-scale, high-quality multilingual dataset for video-and-language research," in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 4581-4591. R. Vedantam, C. Lawrence Zitnick, and D. Parikh, "Cider: Consensus-based image description evaluation," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 4566-4575. S. Chen et al., "Valor: Vision-audio-language omni-perception pretraining model and dataset," arXiv preprint arXiv:2304.08345, 2023. S. Chen et al., "Vast: A vision-audio-subtitle-text omni-modality foundation model and dataset," Advances in Neural Information Processing Systems, vol. 36, 2024. S. Chen, X. He, H. Li, X. Jin, J. Feng, and J. Liu, "Cosa: Concatenated sample pretrained vision-language foundation model," arXiv preprint arXiv:2306.09085, 2023. S. Yan et al., "VideoCoCa: Video-text modeling with zero-shot transfer from contrastive captioners," arXiv preprint arXiv:2212.04979, 2022. K. Lin et al., "Swinbert: End-to-end transformers with sparse attention for video captioning," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 17949-17958. J. Wei et al., "Chain-of-thought prompting elicits reasoning in large language models," Advances in neural information processing systems, vol. 35, pp. 24824-24837, 2022. S. M. Bsharat, A. Myrzakhan, and Z. Shen, "Principled Instructions Are All You Need for Questioning LLaMA-1/2, GPT-3.5/4," arXiv preprint arXiv:2312.16171, 2023. E. Saravia, "Prompt Engineering Guide," https://github.com/dair-ai/Prompt-Engineering-Guide, 2022. Aman Istwal. "Mastering ChatGPT Prompting: The RTF Framework." https://medium.com/@aciddun/mastering-chatgpt-prompting-the-rtf-framework-2db7465c6500 (accessed. M. Kremb. "5 prompt frameworks to level up your prompts." https://www.thepromptwarrior.com/p/5-prompt-frameworks-level-prompts (accessed. A. A., "Exploring Different Prompt Frameworks and Their Applications," https://www.linkedin.com/pulse/exploring-different-prompt-frameworks-applications-ahmed-albadri-kwj9f, 2023. S. J. Pan and Q. Yang, "A survey on transfer learning," IEEE Transactions on knowledge and data engineering, vol. 22, no. 10, pp. 1345-1359, 2009. J. Yosinski, J. Clune, Y. Bengio, and H. Lipson, "How transferable are features in deep neural networks?," Advances in neural information processing systems, vol. 27, 2014. J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, "Imagenet: A large-scale hierarchical image database," in 2009 IEEE conference on computer vision and pattern recognition, 2009: Ieee, pp. 248-255. H. Xu et al., "mplug-2: A modularized multi-modal foundation model across text, image and video," in International Conference on Machine Learning, 2023: PMLR, pp. 38728-38748. Q. Ye et al., "Hitea: Hierarchical temporal-aware video-language pre-training," in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 15405-15416. J. Xu, T. Mei, T. Yao, and Y. Rui, "Msr-vtt: A large video description dataset for bridging video and language," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 5288-5296. D. Chen and W. B. Dolan, "Collecting highly parallel data for paraphrase evaluation," in Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, 2011, pp. 190-200. J. Lei, L. Yu, T. L. Berg, and M. Bansal, "Tvr: A large-scale dataset for video-subtitle moment retrieval," in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXI 16, 2020: Springer, pp. 447-463. L. Zhou, C. Xu, and J. Corso, "Towards automatic learning of procedures from web instructional videos," in Proceedings of the AAAI Conference on Artificial Intelligence, 2018, vol. 32, no. 1. J. Carreira, E. Noland, A. Banki-Horvath, C. Hillier, and A. Zisserman, "A short note about kinetics-600," arXiv preprint arXiv:1808.01340, 2018. | - |
| dc.identifier.uri | http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/94622 | - |
| dc.description.abstract | 人物動作辨識 (Human Action Recognition, HAR) 可謂一具挑戰性之議題,因其在分析影像時,須處理複雜之動態場景、時間序列及視覺訊息。近年影像辨識技術逐漸由過去單一模態轉向跨模態架構,如視覺語言模型 (Vision-Language Model, VLM) 等。過往對跨模態辨識模型之研究與改良,通常涉及在模型中納入巨量訓練資料並對模型進行大幅度微調 (Finetune) 以增進辨識成效,惟其過程含龐大的訓練時間、資金、設備等成本,可能限制學者於該領域之深究,此外,模型適應於不同任務或資料集之通用性亦為當今研究所重視之目標。因此,降低跨模態動作辨識模型之訓練成本,同時提升其通用性成為亟待解決之議題。
本研究提出全新跨模態架構,旨在實現具通用性之人物動作辨識任務,並有效降低欲發展該類模型之隱藏成本。研究方法結合已具一定成熟度之影片描述模型 (Video Captioning Model, VCM) 和大型語言模型 (Large Language Model, LLM),並透過專為此任務設計之提示工程技術,為兩模型之核心效能建構實效連繫。實驗結果顯示,在 UCF-101 資料集上,該架構於零樣本辨識準確率 (Zero-Shot Recognition) 達 73.4%,已優於部分跨模態之人物動作辨識模型,充分反映其可行性及發展潛力。此外,本研究對所提出之架構進行優化分析與驗證,經優化方法之實驗結果證實,該架構可透過遷移學習,對 VCM 進行輕量的訓練,方可於特定任務上將整體效能提升 11.5% 至 14%,亦佐證該種模組化思維之結合方式對後續改良提供相當便捷與良善之空間。本研究成果對影像辨識領域的學術研究及應用提供不同思維角度和參考方案,期望於技術蓬勃發展的時代下探究多種可能性。 | zh_TW |
| dc.description.abstract | Human Action Recognition (HAR) is a challenging task due to the need to handle complex dynamic scenes, temporal sequences, and visual information during image analysis. Recently, image recognition techniques have transitioned from unimodal to cross-modal architectures, such as Vision-Language Models (VLMs). Previous research and improvements on cross-modal recognition models typically involve incorporating vast amounts of training data and extensive finetuning to enhance recognition performance. However, this process entails significant costs in terms of training time, financial resources, and equipment, which may hinder deeper exploration by researchers in this field. Moreover, achieving generalizability across different tasks or datasets remains a key objective in current research. Therefore, reducing the training costs of cross-modal action recognition models while enhancing their generalizability is a pressing issue.
This study proposes a novel cross-modal architecture aimed at achieving a generalizable human action recognition task while effectively reducing the hidden costs associated with developing such models. This method combines the matured Video Captioning Model (VCM) and Large Language Model (LLM) and constructs an efficient linkage between the core capabilities of these models through task-specific prompt engineering techniques. Experimental results demonstrate that on the UCF-101 dataset, the proposed architecture achieves a zero-shot recognition accuracy of 73.4%, outperforming some existing cross-modal human action recognition models and reflecting its feasibility and development potential. Additionally, this study conducts optimization analysis and validation of the proposed architecture. Experimental results of the optimization methods confirm that the proposed architecture can enhance overall performance by 11.5% to 14% on specific tasks through lightweight training of the VCM via transfer learning. This finding also supports the modular approach as a highly convenient and effective way for subsequent improvements. The outcomes of this study provide alternative perspectives and reference solutions for academic research and applications in the field of image recognition, with the hope of exploring various possibilities in an era of rapid technological development. | en |
| dc.description.provenance | Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2024-08-16T17:08:48Z No. of bitstreams: 0 | en |
| dc.description.provenance | Made available in DSpace on 2024-08-16T17:08:48Z (GMT). No. of bitstreams: 0 | en |
| dc.description.tableofcontents | 摘要 i
Abstract ii 目次 iv 圖次 vi 表次 vii 第一章 緒論 1 1.1 研究背景 1 1.2 研究動機 1 1.3 研究目的 3 1.4 全文架構說明 4 第二章 文獻探討 5 2.1 人物動作辨識任務 5 2.2 人物動作辨識資料集 6 2.2.1 實驗所使用之資料集—UCF-101 7 2.3 其他相關研究 9 2.3.1 視覺語言模型 9 2.3.2 影片描述模型 9 2.3.3 大型語言模型 11 2.3.4 提示工程 11 2.4 遷移學習與微調 12 第三章 研究方法及基礎架構效能探討 13 3.1 問題定義 13 3.2 研究方法與模型選擇介紹 14 3.2.1 影片描述模型—SwinBERT Model 15 3.2.2 GPT-4 15 3.3 基礎架構之流程設計 16 3.3.1 Phase I:Video Captioning 16 3.3.2 Phase II:Classification 17 3.4 初步實驗結果與討論 18 3.4.1 開發環境 18 3.4.2 初步效能 19 3.4.3 錯誤案例分析及 “Difficult-25” 20 3.4.4 優化分析 25 第四章 針對特定任務之效能優化方法與討論 27 4.1 針對特定任務之效能優化方法 27 4.1.1 優化方向選擇階段 27 4.1.2 資料前處理階段 28 4.1.3 遷移學習階段 30 4.2 優化方法之實驗結果與討論 31 4.2.1 針對特定任務之改良成效 31 4.2.2 所提出架構之綜合討論 35 第五章 結論與未來展望 36 5.1 結論 36 5.2 未來展望 36 REFERENCE 38 APPENDIX 42 Prompt detail 42 | - |
| dc.language.iso | zh_TW | - |
| dc.subject | 遷移學習 | zh_TW |
| dc.subject | 大型語言模型 | zh_TW |
| dc.subject | 零樣本辨識 | zh_TW |
| dc.subject | 提示工程 | zh_TW |
| dc.subject | 人物動作辨識 | zh_TW |
| dc.subject | 影片描述模型 | zh_TW |
| dc.subject | Transfer Learning | en |
| dc.subject | Human Action Recognition | en |
| dc.subject | Prompt Engineering | en |
| dc.subject | Video Captioning Model | en |
| dc.subject | Large Language Model | en |
| dc.subject | Zero-Shot Recognition | en |
| dc.title | 結合跨模態模型與大型語言模型進行人物動作辨識:以 UCF-101 資料集為例 | zh_TW |
| dc.title | Combining Cross-Modal Models with Large Language Models for Human Action Recognition: The Case of the UCF-101 Dataset | en |
| dc.type | Thesis | - |
| dc.date.schoolyear | 112-2 | - |
| dc.description.degree | 碩士 | - |
| dc.contributor.oralexamcommittee | 丁肇隆;張恆華;李曉祺 | zh_TW |
| dc.contributor.oralexamcommittee | Chao-Lung Ting;Herng-Hua Chang;Hsiao-Chi Li | en |
| dc.subject.keyword | 人物動作辨識,提示工程,影片描述模型,大型語言模型,零樣本辨識,遷移學習, | zh_TW |
| dc.subject.keyword | Human Action Recognition,Prompt Engineering,Video Captioning Model,Large Language Model,Zero-Shot Recognition,Transfer Learning, | en |
| dc.relation.page | 42 | - |
| dc.identifier.doi | 10.6342/NTU202403933 | - |
| dc.rights.note | 同意授權(限校園內公開) | - |
| dc.date.accepted | 2024-08-11 | - |
| dc.contributor.author-college | 工學院 | - |
| dc.contributor.author-dept | 工程科學及海洋工程學系 | - |
| dc.date.embargo-lift | 2026-07-15 | - |
| 顯示於系所單位: | 工程科學及海洋工程學系 | |
文件中的檔案:
| 檔案 | 大小 | 格式 | |
|---|---|---|---|
| ntu-112-2.pdf 未授權公開取用 | 3.32 MB | Adobe PDF | 檢視/開啟 |
系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。
