Skip navigation

DSpace JSPUI

DSpace preserves and enables easy and open access to all types of digital content including text, images, moving images, mpegs and data sets

Learn More
DSpace logo
English
中文
  • Browse
    • Communities
      & Collections
    • Publication Year
    • Author
    • Title
    • Subject
    • Advisor
  • Search TDR
  • Rights Q&A
    • My Page
    • Receive email
      updates
    • Edit Profile
  1. NTU Theses and Dissertations Repository
  2. 電機資訊學院
  3. 電信工程學研究所
Please use this identifier to cite or link to this item: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/98913
Title: 提示學習與選擇於弱監督視覺分析
Prompt Learning and Selection for Weakly-Supervised Visual Analysis
Authors: 林棋祥
Ci-Siang Lin
Advisor: 王鈺強
Yu-Chiang Frank Wang
Keyword: 人工智慧,深度學習,電腦視覺,圖像,影片,
artificial intelligence,deep learning,computer vision,image,video,
Publication Year : 2025
Degree: 博士
Abstract: 當前深度學習的快速發展促使多種基礎模型被提出,用以解決視覺與語言的基本任務,而提示學習成為將基礎模型適應下游任務的一種主流微調技術。本論文旨在推進提示學習與選擇技術,以實現高級的視覺分析,包括可解釋的細粒度識別(第 1章)、圖像語義分割(第 2章)以及指向式影片分割(第 3章)。在第 1章中,我們通過學習一組視覺提示,利用視覺轉換器進行注意力機制並提取具辨識性的原型,實現了可解釋的細粒度識別。在第 2章中,我們通過從CLIP模型中學習文本背景提示來提升圖像語義分割效果。最後,在第 3章中,我們的模型能夠根據文本查詢選擇對應的時空提示,從而基於SAM實現指向式影片分割。得益於這些基礎模型所學到的豐富知識,以上任務都能以弱監督方式完成,減少了高昂的標註成本。
With the rapid development of deep learning, several foundation models have been proposed to address fundamental vision and language tasks, and prompt learning becomes a prevalent finetuning technique to adapt foundation models to downstream tasks. In this dissertation, we aim to advance prompt learning and selection techniques to achieve advanced visual analysis, including interpretable fine-grained recognition (Chapter 1), image semantic segmentation (Chapter 2), and referring video segmentation (Chapter 3). In Chapter 1, we achieve interpretable fine-grained recognition by learning a set of visual prompts to perform attention through vision transformer and derive discriminative prototypes. In Chapter 2, we enhance image semantic segmentation by learning textual background prompts from the CLIP model. Lastly, in Chapter 3, our model learns to select desired spatial-temporal prompts corresponding to the text query, addressing referring video segmentation based on SAM. Thanks to the rich knowledge learned inside these foundation models, the above tasks are able to be achieved in a weakly-supervised manner, alleviating expensive annotation costs.
URI: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/98913
DOI: 10.6342/NTU202504082
Fulltext Rights: 未授權
metadata.dc.date.embargo-lift: N/A
Appears in Collections:電信工程學研究所

Files in This Item:
File SizeFormat 
ntu-113-2.pdf
  Restricted Access
13.28 MBAdobe PDF
Show full item record


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved