請用此 Handle URI 來引用此文件:
http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/98913完整後設資料紀錄
| DC 欄位 | 值 | 語言 |
|---|---|---|
| dc.contributor.advisor | 王鈺強 | zh_TW |
| dc.contributor.advisor | Yu-Chiang Frank Wang | en |
| dc.contributor.author | 林棋祥 | zh_TW |
| dc.contributor.author | Ci-Siang Lin | en |
| dc.date.accessioned | 2025-08-20T16:15:50Z | - |
| dc.date.available | 2025-08-21 | - |
| dc.date.copyright | 2025-08-20 | - |
| dc.date.issued | 2025 | - |
| dc.date.submitted | 2025-08-12 | - |
| dc.identifier.citation | [1] J. Ahn, S. Cho, and S. Kwak. Weakly supervised learning of instance segmentation with inter-pixel relations. In CVPR, 2019.
[2] J. Ahn and S. Kwak. Learning pixel-level semantic affinity with image-level supervision for weakly supervised semantic segmentation. In CVPR, 2018. [3] V. Badrinarayanan, A. Kendall, and R. Cipolla. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE transactions on pattern analysis and machine intelligence, 39(12):2481–2495, 2017. [4] Z. Bai, T. He, H. Mei, P. Wang, Z. Gao, J. Chen, L. Liu, Z. Zhang, and M. Z. Shou. One token to seg them all: Language instructed reasoning segmentation in videos. In NeurIPS, 2024. [5] A. Bearman, O. Russakovsky, V. Ferrari, and L. Fei-Fei. What's the point: Semantic segmentation with point supervision. In ECCV, 2016. [6] A. Botach, E. Zheltonozhskii, and C. Baskin. End-to-end referring video object segmentation with multimodal transformers. In CVPR, 2022. [7] S. Branson, G. Van Horn, S. Belongie, and P. Perona. Bird species categorization using pose normalized deep convolutional nets. arXiv preprint arXiv:1406.2952, 2014. [8] S. Caelles, A. Montes, K.-K. Maninis, Y. Chen, L. Van Gool, F. Perazzi, and J. PontTuset. The 2018 davis challenge on video object segmentation. arXiv preprint arXiv:1803.00557, 2018. [9] J. Cao, J. Pang, X. Weng, R. Khirodkar, and K. Kitani. Observation-centric sort: Rethinking sort for robust multi-object tracking. In CVPR, 2023. [10] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko. Endto-end object detection with transformers. In European conference on computer vision, pages 213–229. Springer, 2020. [11] M. Caron, H. Touvron, I. Misra, H. Jégou, J. Mairal, P. Bojanowski, and A. Joulin. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 9650– 9660, October 2021. [12] Y.-T. Chang, Q. Wang, W.-C. Hung, R. Piramuthu, Y.-H. Tsai, and M.-H. Yang. Weakly-supervised semantic segmentation via sub-category exploration. In CVPR, 2020. [13] C. Chen, O. Li, D. Tao, A. Barnett, C. Rudin, and J. K. Su. This looks like that: Deep learning for interpretable image recognition. Advances in Neural Information Processing Systems, 32:8930–8941, 2019. [14] K. Chen, C. Liu, H. Chen, H. Zhang, W. Li, Z. Zou, and Z. Shi. Rsprompter: Learning to prompt for remote sensing instance segmentation based on visual foundation model. TGRS, 2024. [15] L. Chen, C. Lei, R. Li, S. Li, Z. Zhang, and L. Zhang. Fpr: False positive rectification for weakly supervised semantic segmentation. In ICCV, 2023. [16] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. TPAMI, 2017. [17] L.-C. Chen, G. Papandreou, F. Schroff, and H. Adam. Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587, 2017. [18] L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam. Encoder-decoder with atrous separable convolution for semantic image segmentation. In ECCV, 2018. [19] Q. Chen and V. Koltun. Photographic image synthesis with cascaded refinement networks. In Proceedings of the IEEE international conference on computer vision, pages 1511–1520, 2017. [20] Q. Chen, L. Yang, J.-H. Lai, and X. Xie. Self-supervised image-specific prototype exploration for weakly supervised semantic segmentation. In CVPR, 2022. [21] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597–1607. PMLR, 2020. [22] T. Chen, Z. Mai, R. Li, and W.-l. Chao. Segment anything model (sam) enhanced pseudo labels for weakly supervised semantic segmentation. In NeurIPS Workshop, 2023. [23] T. Chen, L. Zhu, C. Ding, R. Cao, Y. Wang, Z. Li, L. Sun, P. Mao, and Y. Zang. Sam fails to segment anything?–sam-adapter: Adapting sam in underperformed scenes: Camouflage, shadow, medical image segmentation, and more. arXiv preprint arXiv:2304.09148, 2023. [24] Y. Chen, Y. Bai, W. Zhang, and T. Mei. Destruction and construction learning for fine-grained image recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5157–5166, 2019. [25] Y.-C. Chen, L. Li, L. Yu, A. El Kholy, F. Ahmed, Z. Gan, Y. Cheng, and J. Liu. Uniter: Universal image-text representation learning. In ECCV, 2020. [26] Z. Chen and Q. Sun. Extracting class activation maps from non-discriminative features as well. In CVPR, 2023. [27] Z. Chen and Q. Sun. Weakly-supervised semantic segmentation with imagelevel labels: from traditional models to foundation models. arXiv preprint arXiv:2310.13026, 2023. [28] Z. Chen, T. Wang, X. Wu, X.-S. Hua, H. Zhang, and Q. Sun. Class re-activation maps for weakly-supervised semantic segmentation. In CVPR, 2022. [29] B. Cheng, I. Misra, A. G. Schwing, A. Kirillov, and R. Girdhar. Masked-attention mask transformer for universal image segmentation. In CVPR, 2022. [30] B. Cheng, A. Schwing, and A. Kirillov. Per-pixel classification is not all you need for semantic segmentation. In NeurIPS, 2021. [31] D. Cheng, G. Wang, B. Wang, Q. Zhang, J. Han, and D. Zhang. Hybrid routing transformer for zero-shot learning. arXiv preprint arXiv:2203.15310, 2022. [32] H. K. Cheng, S. W. Oh, B. Price, A. Schwing, and J.-Y. Lee. Tracking anything with decoupled video segmentation. In ICCV, 2023. [33] J. Cheng, J. Ye, Z. Deng, J. Chen, T. Li, H. Wang, Y. Su, Z. Huang, J. Chen, L. Jiang, et al. Sam-med2d. arXiv preprint arXiv:2308.16184, 2023. [34] Y. Cheng, L. Li, Y. Xu, X. Li, Z. Yang, W. Wang, and Y. Yang. Segment and track anything. arXiv preprint arXiv:2305.06558, 2023. [35] Z. Cheng, P. Qiao, K. Li, S. Li, P. Wei, X. Ji, L. Yuan, C. Liu, and J. Chen. Outof-candidate rectification for weakly supervised semantic segmentation. In CVPR, 2023. [36] M. Cornia, M. Stefanini, L. Baraldi, and R. Cucchiara. Meshed-memory transformer for image captioning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10578–10587, 2020. [37] H. Ding, C. Liu, S. He, X. Jiang, and C. C. Loy. Mevis: A large-scale benchmark for video segmentation with motion expressions. In ICCV, 2023. [38] J. Ding, N. Xue, G.-S. Xia, and D. Dai. Decoupling zero-shot semantic segmentation. In CVPR, 2022. [39] Y. Ding, Y. Zhou, Y. Zhu, Q. Ye, and J. Jiao. Selective sparse sampling for fine-grained image recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6599–6608, 2019. [40] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020. [41] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021. [42] Y. Du, Z. Fu, Q. Liu, and Y. Wang. Weakly supervised semantic segmentation by pixel-to-prototype contrast. In CVPR, 2022. [43] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman. The pascal visual object classes (voc) challenge. IJCV, 2010. [44] H. Fan, L. Lin, F. Yang, P. Chu, G. Deng, S. Yu, H. Bai, Y. Xu, C. Liao, and H. Ling. Lasot: A high-quality benchmark for large-scale single object tracking. In CVPR, 2019. [45] J. Fan, Z. Zhang, C. Song, and T. Tan. Learning integral objects with intra-class discriminator for weakly-supervised semantic segmentation. In CVPR, 2020. [46] K. Fischer, M. Simon, F. Olsner, S. Milz, H.-M. Gross, and P. Mader. Stickypillars: Robust and efficient feature matching on point clouds using graph neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 313–323, June 2021. [47] R. Fong, M. Patrick, and A. Vedaldi. Understanding deep networks via extremal perturbations and smooth masks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2950–2958, 2019. [48] R. C. Fong and A. Vedaldi. Interpretable explanations of black boxes by meaningful perturbation. In Proceedings of the IEEE international conference on computer vision, pages 3429–3437, 2017. [49] J. Fu, H. Zheng, and T. Mei. Look closer to see better: Recurrent attention convolutional neural network for fine-grained image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4438–4446, 2017. [50] H. Fukui, T. Hirakawa, T. Yamashita, and H. Fujiyoshi. Attention branch network: Learning of attention mechanism for visual explanation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10705– 10714, 2019. [51] Y. Gao, X. Han, X. Wang, W. Huang, and M. Scott. Channel interaction networks for fine-grained image categorization. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 10818–10825, 2020. [52] Y. Ge, Z. Li, H. Zhao, G. Yin, S. Yi, X. Wang, et al. Fd-gan: Pose-guided feature distilling gan for robust person re-identification. Advances in Neural Information Processing Systems, 31:1222–1233, 2018. [53] G. Ghiasi, X. Gu, Y. Cui, and T.-Y. Lin. Scaling open-vocabulary image segmentation with image-level labels. In ECCV, 2022. [54] A. Goyal, A. Mousavian, C. Paxton, Y.-W. Chao, B. Okorn, J. Deng, and D. Fox. Ifor: Iterative flow minimization for robotic object rearrangement. In CVPR, 2022. [55] Q. Guo, S. De Mello, H. Yin, W. Byeon, K. C. Cheung, Y. Yu, P. Luo, and S. Liu. Regiongpt: Towards region understanding vision language model. In CVPR, 2024. [56] M. Han, Y. Wang, Z. Li, L. Yao, X. Chang, and Y. Qiao. Html: Hybrid temporalscale multimodal learning framework for referring video object segmentation. In ICCV, 2023. [57] J. He, J.-N. Chen, S. Liu, A. Kortylewski, C. Yang, Y. Bai, and C. Wang. Transfg: A transformer architecture for fine-grained recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 852–860, 2022. [58] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9729–9738, 2020. [59] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. [60] S. He and H. Ding. Decoupling static and hierarchical motion perception for referring video segmentation. In CVPR, 2024. [61] W. He, S. Jamonnak, L. Gou, and L. Ren. Clip-S4: Language-guided selfsupervised semantic segmentation. In CVPR, 2023. [62] L. A. Hendricks, R. Hu, T. Darrell, and Z. Akata. Grounding visual explanations. In Proceedings of the European Conference on Computer Vision (ECCV), pages 264–279, 2018. [63] L. Hong, W. Chen, Z. Liu, W. Zhang, P. Guo, Z. Chen, and W. Zhang. Lvos: A benchmark for long-term video object segmentation. In ICCV, 2023. [64] R. Hu, M. Rohrbach, and T. Darrell. Segmentation from natural language expressions. In ECCV, 2016. [65] L. Huang, X. Zhao, and K. Huang. Got-10k: A large high-diversity benchmark for generic object tracking in the wild. TPAMI, 2019. [66] Z. Huang and Y. Li. Interpretable and accurate fine-grained recognition via region grouping. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8662–8672, 2020. [67] D. Huynh and E. Elhamifar. Fine-grained generalized zero-shot learning via dense attribute-based attention. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4483–4493, 2020. [68] R. Ji, L. Wen, L. Zhang, D. Du, Y. Wu, C. Zhao, X. Liu, and F. Huang. Attention convolutional binary neural tree for fine-grained visual categorization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10468–10477, 2020. [69] M. Jia, L. Tang, B.-C. Chen, C. Cardie, S. Belongie, B. Hariharan, and S.-N. Lim. Visual prompt tuning. In ECCV, 2022. [70] P.-T. Jiang and Y. Yang. Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:2305.01275, 2023. [71] P.-T. Jiang, Y. Yang, Q. Hou, and Y. Wei. L2g: A simple local-to-global knowledge transfer framework for weakly supervised semantic segmentation. In CVPR, 2022. [72] S. Jo, I.-J. Yu, and K. Kim. Mars: Model-agnostic biased object removal without additional supervision for weakly-supervised semantic segmentation. In ICCV, 2023. [73] A. Khoreva, R. Benenson, J. Hosang, M. Hein, and B. Schiele. Simple does it: Weakly supervised instance and semantic segmentation. In CVPR, 2017. [74] A. Khoreva, A. Rohrbach, and B. Schiele. Video object segmentation with referring expressions. In ECCV, 2018. [75] A. Khosla, N. Jayadevaprakash, B. Yao, and F.-F. Li. Novel dataset for fine-grained image categorization: Stanford dogs. In Proc. CVPR workshop on fine-grained visual categorization (FGVC), volume 2. Citeseer, 2011. [76] A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo, et al. Segment anything. In ICCV, 2023. [77] P. Krähenbühl and V. Koltun. Efficient inference in fully connected crfs with gaussian edge potentials. In NeurIPS, 2011. [78] J. Kwon, E. Lee, Y. Cho, and Y. Kim. Learning to detour: Shortcut mitigating augmentation for weakly supervised semantic segmentation. In WACV, 2024. [79] X. Lai, Z. Tian, Y. Chen, Y. Li, Y. Yuan, S. Liu, and J. Jia. Lisa: Reasoning segmentation via large language model. In CVPR, 2024. [80] H. J. Lee, J. U. Kim, S. Lee, H. G. Kim, and Y. M. Ro. Structure boundary preserving segmentation for medical image with ambiguous boundary. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020. [81] J. Lee, S. J. Oh, S. Yun, J. Choe, E. Kim, and S. Yoon. Weakly supervised semantic segmentation using out-of-distribution data. In CVPR, 2022. [82] B. Li, K. Q. Weinberger, S. Belongie, V. Koltun, and R. Ranftl. Language-driven semantic segmentation. In ICLR, 2022. [83] G. Li, M. Gao, H. Liu, X. Zhen, and F. Zheng. Learning cross-modal affinity for referring video object segmentation targeting limited samples. In ICCV, 2023. [84] J. Li, J. Fan, and Z. Zhang. Towards noiseless object contours for weakly supervised semantic segmentation. In CVPR, 2022. [85] J. Li, Z. Jie, X. Wang, X. Wei, and L. Ma. Expansion and shrinkage of localization for weakly-supervised semantic segmentation. In NeurIPS, 2022. [86] J. Li, D. Li, S. Savarese, and S. Hoi. Blip-2: Bootstrapping language-image pretraining with frozen image encoders and large language models. In ICML, 2023. [87] W. Li, X. Zhu, and S. Gong. Harmonious attention network for person reidentification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2285–2294, 2018. [88] X. Li, J. Wang, X. Xu, X. Li, B. Raj, and Y. Lu. Robust referring video object segmentation with cyclic structural consensus. In ICCV, 2023. [89] Y. Li, J. Zhang, X. Teng, and L. Lan. Refsam: Efficiently adapting segmenting anything model for referring video object segmentation. arXiv preprint arXiv:2307.00997, 2023. [90] F. Liang, B. Wu, X. Dai, K. Li, Y. Zhao, H. Zhang, P. Zhang, P. Vajda, and D. Marculescu. Open-vocabulary semantic segmentation with mask-adapted clip. In CVPR, 2023. [91] C. Liang-Chieh, G. Papandreou, I. Kokkinos, K. Murphy, et al. Semantic image segmentation with deep convolutional nets and fully connected crfs. In ICLR, 2015. [92] C.-S. Lin, M.-H. Chen, I.-J. Liu, C.-Y. Wang, S. Liu, and Y.-C. F. Wang. Temporal prompting matters: Rethinking referring video object segmentation. [93] C.-S. Lin, C.-Y. Wang, Y.-C. F. Wang, and M.-H. Chen. Semantic prompt learning for weakly-supervised semantic segmentation. In 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 8764–8774. IEEE, 2025. [94] C.-S. Lin and Y.-C. F. Wang. Describe, spot and explain: Interpretable representation learning for discriminative visual reasoning. IEEE Transactions on Image Processing, 32:2481–2492, 2023. [95] D. Lin, J. Dai, J. Jia, K. He, and J. Sun. Scribblesup: Scribble-supervised convolutional networks for semantic segmentation. In CVPR, 2016. [96] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014. [97] Y. Lin, M. Chen, W. Wang, B. Wu, K. Li, B. Lin, H. Liu, and X. He. Clip is also an efficient segmenter: A text-driven approach for weakly supervised semantic segmentation. In CVPR, 2023. [98] C. Liu, L.-C. Chen, F. Schroff, H. Adam, W. Hua, A. L. Yuille, and L. Fei-Fei. Auto-deeplab: Hierarchical neural architecture search for semantic image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 82–92, 2019. [99] C. Liu, H. Ding, and X. Jiang. Gres: Generalized referring expression segmentation. In CVPR, 2023. [100] J. Liu, H. Ding, Z. Cai, Y. Zhang, R. K. Satzoda, V. Mahadevan, and R. Manmatha. Polyformer: Referring image segmentation as sequential polygon generation. In CVPR, 2023. [101] P. Liu, W. Yuan, J. Fu, Z. Jiang, H. Hayashi, and G. Neubig. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys, 2023. [102] S. Liu, Z. Zeng, T. Ren, F. Li, H. Zhang, J. Yang, C. Li, J. Yang, H. Su, J. Zhu, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023. [103] S. Liu, Z. Zeng, T. Ren, F. Li, H. Zhang, J. Yang, C. Li, J. Yang, H. Su, J. Zhu, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. In ECCV, 2024. [104] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR, 2015. [105] J. Lu, D. Batra, D. Parikh, and S. Lee. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In NeurIPS, 2019. [106] T. Lüddecke and A. Ecker. Image segmentation using text and image prompts. In CVPR, 2022. [107] W. Luo, X. Yang, X. Mo, Y. Lu, L. S. Davis, J. Li, J. Yang, and S.-N. Lim. Cross-x learning for fine-grained visual categorization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8242–8251, 2019. [108] Z. Luo, Y. Xiao, Y. Liu, S. Li, Y. Wang, Y. Tang, X. Li, and Y. Yang. Soc: Semanticassisted object cluster for referring video object segmentation. In NeurIPS, 2024. [109] J. Ma, Y. He, F. Li, L. Han, C. You, and B. Wang. Segment anything in medical images. Nature Communications, 2024. [110] M. Maaz, H. Rasheed, S. Khan, and F. S. Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models. In ACL, 2024. [111] A. Mahendran and A. Vedaldi. Visualizing deep convolutional neural networks using natural pre-images. International Journal of Computer Vision, 120(3):233– 255, 2016. [112] J. Mao, J. Huang, A. Toshev, O. Camburu, A. L. Yuille, and K. Murphy. Generation and comprehension of unambiguous object descriptions. In CVPR, 2016. [113] M. Meshry, Y. Ren, L. S. Davis, and A. Shrivastava. Step: Style-based encoder pre-training for multi-modal image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3712– 3721, June 2021. [114] B. J. Meyer and T. Drummond. Improved semantic segmentation for robotic applications with hierarchical conditional random fields. In ICRA, 2017. [115] B. Miao, M. Bennamoun, Y. Gao, and A. Mian. Spectrum-guided multi-granularity referring video object segmentation. In ICCV, 2023. [116] A. Mordvintsev, C. Olah, and M. Tyka. Inceptionism: Going deeper into neural networks. 2015. [117] M. Muller, A. Bibi, S. Giancola, S. Alsubaihi, and B. Ghanem. Trackingnet: A large-scale dataset and benchmark for object tracking in the wild. In ECCV, 2018. [118] B. Murugesan, R. Hussain, R. Bhattacharya, I. Ben Ayed, and J. Dolz. Prompting classes: Exploring the power of prompt class learning in weakly supervised semantic segmentation. In WACV, 2024. [119] M. Nauta, R. van Bree, and C. Seifert. Neural prototype trees for interpretable fine-grained image recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14933–14943. [120] A. Nguyen, J. Clune, Y. Bengio, A. Dosovitskiy, and J. Yosinski. Plug & play generative networks: Conditional iterative generation of images in latent space. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4467–4477, 2017. [121] B. Pan, Y. Jiang, R. Panda, Z. Wang, R. Feris, and A. Oliva. Ia-red2: Interpretability-aware redundancy reduction for vision transformers. arXiv preprint arXiv:2106.12620, 2021. [122] D. H. Park, L. A. Hendricks, Z. Akata, A. Rohrbach, B. Schiele, T. Darrell, and M. Rohrbach. Multimodal explanations: Justifying decisions and pointing to the evidence. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8779–8788, 2018. [123] Z. Peng, G. Wang, L. Xie, D. Jiang, W. Shen, and Q. Tian. Usage: A unified seed area generation paradigm for weakly supervised semantic segmentation. In ICCV, 2023. [124] F. Perazzi, J. Pont-Tuset, B. McWilliams, L. Van Gool, M. Gross, and A. SorkineHornung. A benchmark dataset and evaluation methodology for video object segmentation. In CVPR, 2016. [125] D. C. Psichogios and L. H. Ungar. Svd-net: An algorithm that automatically selects network structure. IEEE Transactions on Neural Networks, 5(3):513–515, 1994. [126] C. R. Qi, H. Su, K. Mo, and L. J. Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 652–660, 2017. [127] J. Qi, Y. Gao, Y. Hu, X. Wang, X. Liu, X. Bai, S. Belongie, A. Yuille, P. H. Torr, and S. Bai. Occluded video instance segmentation: A benchmark. IJCV, 2022. [128] X. Qian, Y. Fu, T. Xiang, W. Wang, J. Qiu, Y. Wu, Y.-G. Jiang, and X. Xue. Posenormalized image generation for person re-identification. In Proceedings of the European conference on computer vision (ECCV), pages 650–667, 2018. [129] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. Learning transferable visual models from natural language supervision. In ICML, 2021. [130] F. Rajič, L. Ke, Y.-W. Tai, C.-K. Tang, M. Danelljan, and F. Yu. Segment anything meets point tracking. arXiv:2307.01197, 2023. [131] Y. Rao, W. Zhao, G. Chen, Y. Tang, Z. Zhu, G. Huang, J. Zhou, and J. Lu. Denseclip: Language-guided dense prediction with context-aware prompting. In CVPR, 2022. [132] N. Ravi, V. Gabeur, Y.-T. Hu, R. Hu, C. Ryali, T. Ma, H. Khedr, R. Rädle, C. Rolland, L. Gustafson, et al. Sam 2: Segment anything in images and videos. arXiv preprint arXiv:2408.00714, 2024. [133] S. Reiss, C. Seibold, A. Freytag, E. Rodner, and R. Stiefelhagen. Every annotation counts: Multi-label deep supervision for medical image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9532–9542, June 2021. [134] T. Ren, S. Liu, A. Zeng, J. Lin, K. Li, H. Cao, J. Chen, X. Huang, Y. Chen, F. Yan, et al. Grounded sam 2. https://github.com/IDEA-Research/ Grounded-SAM-2, 2024. [135] T. Ren, S. Liu, A. Zeng, J. Lin, K. Li, H. Cao, J. Chen, X. Huang, Y. Chen, F. Yan, et al. Grounded sam: Assembling open-world models for diverse visual tasks. arXiv preprint arXiv:2401.14159, 2024. [136] O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical image segmentation. In MICCAI, 2015. [137] S. Rossetti, D. Zappia, M. Sanzari, M. Schaerf, and F. Pirri. Max pooling with vision transformers reconciles class and shape in weakly supervised semantic segmentation. In ECCV, 2022. [138] L. Ru, Y. Zhan, B. Yu, and B. Du. Learning affinity from attention: end-to-end weakly-supervised semantic segmentation with transformers. In CVPR, 2022. [139] L. Ru, H. Zheng, Y. Zhan, and B. Du. Token contrast for weakly-supervised semantic segmentation. In CVPR, 2023. [140] A. Sadat, S. Casas, M. Ren, X. Wu, P. Dhawan, and R. Urtasun. Perceive, predict, and plan: Safe motion planning through interpretable semantic representations. In European Conference on Computer Vision, pages 414–430. Springer, 2020. [141] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra. Gradcam: Visual explanations from deep networks via gradient-based localization. In ICCV, 2017. [142] S. Seo, J.-Y. Lee, and B. Han. Urvos: Unified referring video object segmentation network with a large-scale benchmark. In ECCV, 2020. [143] G. Shin, W. Xie, and S. Albanie. Reco: Retrieve and co-segment for zero-shot transfer. In NeurIPS, 2022. [144] H. Shin, H. Kim, S. Kim, Y. Jun, T. Eo, and D. Hwang. Sdc-uda: volumetric unsupervised domain adaptation framework for slice-direction continuous crossmodality medical image segmentation. In CVPR, 2023. [145] S. Sukhbaatar, E. Grave, G. Lample, H. Jegou, and A. Joulin. Augmenting selfattention with persistent memory. arXiv preprint arXiv:1907.01470, 2019. [146] W. Sun, Z. Liu, Y. Zhang, Y. Zhong, and N. Barnes. An alternative to wsss? an empirical study of the segment anything model (sam) on weakly-supervised semantic segmentation problems. arXiv preprint arXiv:2305.01586, 2023. [147] Y. Sun, L. Zheng, Y. Yang, Q. Tian, and S. Wang. Beyond part models: Person retrieval with refined part pooling (and a strong convolutional baseline). In Proceedings of the European conference on computer vision (ECCV), pages 480– 496, 2018. [148] J. Tang, G. Zheng, and S. Yang. Temporal collection and distribution for referring video object segmentation. In ICCV, 2023. [149] C.-P. Tay, S. Roy, and K.-H. Yap. Aanet: Attribute attention network for person reidentifications. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7134–7143, 2019. [150] H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, and H. Jégou. Training data-efficient image transformers & distillation through attention. In International Conference on Machine Learning, pages 10347–10357. PMLR, 2021. [151] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. In NeurIPS, 2017. [152] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. The caltech-ucsd birds-200-2011 dataset. 2011. [153] C. Wang, R. Xu, S. Xu, W. Meng, and X. Zhang. Treating pseudo-labels generation as image matting for weakly supervised semantic segmentation. In ICCV, 2023. [154] C. Wang, Q. Zhang, C. Huang, W. Liu, and X. Wang. Mancs: A multi-task attentional network with curriculum sampling for person re-identification. In Proceedings of the European Conference on Computer Vision (ECCV), pages 365– 381, 2018. [155] D. Wang, J. Zhang, B. Du, M. Xu, L. Liu, D. Tao, and L. Zhang. Samrs: Scalingup remote sensing segmentation dataset with segment anything model. In NeurIPS, 2024. [156] J. Wang, A. Pun, J. Tu, S. Manivasagam, A. Sadat, S. Casas, M. Ren, and R. Urtasun. Advsim: Generating safety-critical scenarios for self-driving vehicles. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9909–9918, June 2021. [157] J. Wang, H. Yuan, D. Chen, Y. Zhang, X. Wang, and S. Zhang. Modelscope textto-video technical report. arXiv preprint arXiv:2308.06571, 2023. [158] P. Wang, Z. Zhao, F. Su, X. Zu, and N. V. Boulgouris. Horeid: Deep high-order mapping enhances pose alignment for person re-identification. IEEE Transactions on Image Processing, 30:2908–2922, 2021. [159] X. Wang, X. Zhang, Y. Cao, W. Wang, C. Shen, and T. Huang. Seggpt: Towards segmenting everything in context. In ICCV, 2023. [160] Y. Wang, J. Zhang, M. Kan, S. Shan, and X. Chen. Self-supervised equivariant attention mechanism for weakly supervised semantic segmentation. In CVPR, 2020. [161] X. Wei, Y. Zhang, Y. Gong, J. Zhang, and N. Zheng. Grassmann pooling as compact homogeneous bilinear pooling for fine-grained visual classification. In Proceedings of the European Conference on Computer Vision (ECCV), pages 355–370, 2018. [162] C.-Y. Wu and P. Krahenbuhl. Towards long-form video understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1884–1894, June 2021. [163] D. Wu, T. Wang, Y. Zhang, X. Zhang, and J. Shen. Onlinerefer: A simple online baseline for referring video object segmentation. In ICCV, 2023. [164] F. Wu, J. He, L. Cheng, Y. Yin, Y. Hao, and G. Huang. Masked collaborative contrast for weakly supervised semantic segmentation. In WACV, 2024. [165] J. Wu, R. Fu, H. Fang, Y. Liu, Z. Wang, Y. Xu, Y. Jin, and T. Arbel. Medical sam adapter: Adapting segment anything model for medical image segmentation. arXiv preprint arXiv:2304.12620, 2023. [166] J. Wu, Y. Jiang, P. Sun, Z. Yuan, and P. Luo. Language as queries for referring video object segmentation. In CVPR, 2022. [167] J. Wu, Y. Jiang, B. Yan, H. Lu, Z. Yuan, and P. Luo. Segment every reference object in spatial and temporal spaces. In ICCV, 2023. [168] T. Wu, G. Gao, J. Huang, X. Wei, X. Wei, and C. H. Liu. Adaptive spatial-bce loss for weakly supervised semantic segmentation. In ECCV, 2022. [169] Y. Wu, X. Ye, K. Yang, J. Li, and X. Li. Dupl: Dual student with trustworthy progressive learning for robust weakly supervised semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3534–3543, 2024. [170] T. Xiao, Y. Xu, K. Yang, J. Zhang, Y. Peng, and Z. Zhang. The application of two-level attention models in deep convolutional neural network for fine-grained image classification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 842–850, 2015. [171] J. Xie, X. Hou, K. Ye, and L. Shen. Clims: cross language image matching for weakly supervised semantic segmentation. In CVPR, 2022. [172] J. Xie, J. Xiang, J. Chen, X. Hou, X. Zhao, and L. Shen. C2am: contrastive learning of class-agnostic activation map for weakly supervised object localization and semantic segmentation. In CVPR, 2022. [173] J. Xu, S. Liu, A. Vahdat, W. Byeon, X. Wang, and S. De Mello. Open-vocabulary panoptic segmentation with text-to-image diffusion models. In CVPR, pages 2955– 2966, 2023. [174] L. Xu, W. Ouyang, M. Bennamoun, F. Boussaid, and D. Xu. Multi-class token transformer for weakly supervised semantic segmentation. In CVPR, 2022. [175] L. Xu, W. Ouyang, M. Bennamoun, F. Boussaid, and D. Xu. Learning multi-modal class-specific tokens for weakly supervised dense object localization. In CVPR, 2023. [176] M. Xu, Z. Zhang, F. Wei, H. Hu, and X. Bai. Side adapter network for openvocabulary semantic segmentation. In CVPR, 2023. [177] M. Xu, Z. Zhang, F. Wei, Y. Lin, Y. Cao, H. Hu, and X. Bai. A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In ECCV, 2022. [178] N. Xu, L. Yang, Y. Fan, D. Yue, Y. Liang, J. Yang, and T. Huang. Youtubevos: A large-scale video object segmentation benchmark. arXiv preprint arXiv:1809.03327, 2018. [179] Z. Xu, Z. Chen, Y. Zhang, Y. Song, X. Wan, and G. Li. Bridging vision and language encoders: Parameter-efficient tuning for referring image segmentation. In ICCV, 2023. [180] B. Yan, Y. Jiang, J. Wu, D. Wang, P. Luo, Z. Yuan, and H. Lu. Universal instance perception as object discovery and retrieval. In CVPR, 2023. [181] C. Yan, H. Wang, S. Yan, X. Jiang, Y. Hu, G. Kang, W. Xie, and E. Gavves. Visa: Reasoning video object segmentation via large language models. In ECCV, 2024. [182] S. Yan, R. Zhang, Z. Guo, W. Chen, W. Zhang, H. Li, Y. Qiao, H. Dong, Z. He, and P. Gao. Referred by multi-modality: A unified temporal transformer for video object segmentation. In AAAI, 2024. [183] J. Yang, M. Gao, Z. Li, S. Gao, F. Wang, and F. Zheng. Track anything: Segment anything meets videos. arXiv preprint arXiv:2304.11968, 2023. [184] L. Yang, Y. Fan, and N. Xu. Video instance segmentation. In ICCV, 2019. [185] X. Yang and X. Gong. Foundation model assisted weakly supervised semantic segmentation. In WACV, 2024. [186] Z. Yang, T. Luo, D. Wang, Z. Hu, J. Gao, and L. Wang. Learning to navigate for finegrained classification. In Proceedings of the European Conference on Computer Vision (ECCV), pages 420–435, 2018. [187] Z. Yang, J. Wang, Y. Tang, K. Chen, H. Zhao, and P. H. Torr. Lavt: Language-aware vision transformer for referring image segmentation. In CVPR, 2022. [188] Z. Yang and Y. Yang. Decoupling features in hierarchical propagation for video object segmentation. In NeurIPS, 2022. [189] J. Ye, J. He, X. Peng, W. Wu, and Y. Qiao. Attention-driven dynamic graph convolutional network for multi-label image recognition. In European Conference on Computer Vision, pages 649–665. Springer, 2020. [190] H. Yin, A. Mallya, A. Vahdat, J. M. Alvarez, J. Kautz, and P. Molchanov. See through gradients: Image batch recovery via gradinversion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16337– 16346, 2021. [191] H. Yin, P. Molchanov, J. M. Alvarez, Z. Li, A. Mallya, D. Hoiem, N. K. Jha, and J. Kautz. Dreaming to distill: Data-free knowledge transfer via deepinversion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8715–8724, 2020. [192] S.-H. Yoon, H. Kweon, J. Cho, S. Kim, and K.-J. Yoon. Adversarial erasing framework via triplet with gated pyramid pooling layer for weakly supervised semantic segmentation. In ECCV, 2022. [193] J. Yosinski, J. Clune, A. Nguyen, T. Fuchs, and H. Lipson. Understanding neural networks through deep visualization. arXiv preprint arXiv:1506.06579, 2015. [194] F. Yu, H. Chen, X. Wang, W. Xian, Y. Chen, F. Liu, V. Madhavan, and T. Darrell. Bdd100k: A diverse driving dataset for heterogeneous multitask learning. In CVPR, 2020. [195] L. Yu, P. Poirson, S. Yang, A. C. Berg, and T. L. Berg. Modeling context in referring expressions. In ECCV, 2016. [196] S. Yu, P. H. Seo, and J. Son. Zero-shot referring image segmentation with globallocal context features. In CVPR, 2023. [197] L. Yuan, M. Shi, Z. Yue, and Q. Chen. Losh: Long-short text joint prediction network for referring video object segmentation. In CVPR, 2024. [198] S. Zagoruyko and N. Komodakis. Wide residual networks. In British Machine Vision Conference 2016. British Machine Vision Association, 2016. [199] J. Zbontar, L. Jing, I. Misra, Y. LeCun, and S. Deny. Barlow twins: Self-supervised learning via redundancy reduction. arXiv preprint arXiv:2103.03230, 2021. [200] O. Zendel, M. Schörghuber, B. Rainer, M. Murschitz, and C. Beleznai. Unifying panoptic segmentation for autonomous driving. In CVPR, 2022. [201] H. Zhang, T. Xu, M. Elhoseiny, X. Huang, S. Zhang, A. Elgammal, and D. Metaxas. Spda-cnn: Unifying semantic part detection and abstraction for fine-grained recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1143–1152, 2016. [202] Y. Zhang, X. Huang, J. Ma, Z. Li, Z. Luo, Y. Xie, Y. Qin, T. Luo, Y. Li, S. Liu, et al. Recognize anything: A strong image tagging model. arXiv preprint arXiv:2306.03514, 2023. [203] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia. Pyramid scene parsing network. In CVPR, 2017. [204] H. Zhao, Y. Zhang, S. Liu, J. Shi, C. C. Loy, D. Lin, and J. Jia. Psanet: Point-wise spatial attention network for scene parsing. In ECCV, 2018. [205] W. Zhao, K. Nan, S. Zhang, K. Chen, D. Lin, and Y. You. Learning referring video object segmentation from weak annotation. arXiv preprint arXiv:2308.02162, 2023. [206] X. Zhao, F. Tang, X. Wang, and J. Xiao. Sfc: Shared feature calibration in weakly supervised semantic segmentation. In AAAI, 2024. [207] H. Zheng, J. Fu, Z.-J. Zha, and J. Luo. Looking for the devil in the details: Learning trilinear attention sampling network for fine-grained image recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5012–5021, 2019. [208] L. Zheng, L. Shen, L. Tian, S. Wang, J. Wang, and Q. Tian. Scalable person reidentification: A benchmark. In Proceedings of the IEEE international conference on computer vision, pages 1116–1124, 2015. [209] L. Zheng, Y. Yang, and A. G. Hauptmann. Person re-identification: Past, present and future. arXiv preprint arXiv:1610.02984, 2016. [210] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba. Learning deep features for discriminative localization. In CVPR, 2016. [211] C. Zhou, C. C. Loy, and B. Dai. Extract free dense labels from clip. In ECCV, 2022. [212] K. Zhou, J. Yang, C. C. Loy, and Z. Liu. Conditional prompt learning for visionlanguage models. In CVPR, 2022. [213] K. Zhou, J. Yang, C. C. Loy, and Z. Liu. Learning to prompt for vision-language models. IJCV, 2022. [214] J. Zhu, Z.-Q. Cheng, J.-Y. He, C. Li, B. Luo, H. Lu, Y. Geng, and X. Xie. Tracking with human-intent reasoning. arXiv preprint arXiv:2312.17448, 2023. [215] L. Zhu, Y. Li, J. Fang, Y. Liu, H. Xin, W. Liu, and X. Wang. Weaktr: Exploring plain vision transformer for weakly-supervised semantic segmentation. arXiv preprint arXiv:2304.01184, 2023. [216] Z. Zhu, X. Feng, D. Chen, J. Yuan, C. Qiao, and G. Hua. Exploring pre-trained text-to-video diffusion models for referring video object segmentation. In ECCV, 2024. [217] P. Zhuang, Y. Wang, and Y. Qiao. Learning attentive pairwise interaction for fine-grained classification. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 13130–13137, 2020. [218] M. Zolfaghari, K. Singh, and T. Brox. Eco: Efficient convolutional network for online video understanding. In Proceedings of the European conference on computer vision (ECCV), pages 695–712, 2018. [219] X. Zou, J. Yang, H. Zhang, F. Li, L. Li, J. Wang, L. Wang, J. Gao, and Y. J. Lee. Segment everything everywhere all at once. In NeurIPS, 2024. | - |
| dc.identifier.uri | http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/98913 | - |
| dc.description.abstract | 當前深度學習的快速發展促使多種基礎模型被提出,用以解決視覺與語言的基本任務,而提示學習成為將基礎模型適應下游任務的一種主流微調技術。本論文旨在推進提示學習與選擇技術,以實現高級的視覺分析,包括可解釋的細粒度識別(第 1章)、圖像語義分割(第 2章)以及指向式影片分割(第 3章)。在第 1章中,我們通過學習一組視覺提示,利用視覺轉換器進行注意力機制並提取具辨識性的原型,實現了可解釋的細粒度識別。在第 2章中,我們通過從CLIP模型中學習文本背景提示來提升圖像語義分割效果。最後,在第 3章中,我們的模型能夠根據文本查詢選擇對應的時空提示,從而基於SAM實現指向式影片分割。得益於這些基礎模型所學到的豐富知識,以上任務都能以弱監督方式完成,減少了高昂的標註成本。 | zh_TW |
| dc.description.abstract | With the rapid development of deep learning, several foundation models have been proposed to address fundamental vision and language tasks, and prompt learning becomes a prevalent finetuning technique to adapt foundation models to downstream tasks. In this dissertation, we aim to advance prompt learning and selection techniques to achieve advanced visual analysis, including interpretable fine-grained recognition (Chapter 1), image semantic segmentation (Chapter 2), and referring video segmentation (Chapter 3). In Chapter 1, we achieve interpretable fine-grained recognition by learning a set of visual prompts to perform attention through vision transformer and derive discriminative prototypes. In Chapter 2, we enhance image semantic segmentation by learning textual background prompts from the CLIP model. Lastly, in Chapter 3, our model learns to select desired spatial-temporal prompts corresponding to the text query, addressing referring video segmentation based on SAM. Thanks to the rich knowledge learned inside these foundation models, the above tasks are able to be achieved in a weakly-supervised manner, alleviating expensive annotation costs. | en |
| dc.description.provenance | Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2025-08-20T16:15:50Z No. of bitstreams: 0 | en |
| dc.description.provenance | Made available in DSpace on 2025-08-20T16:15:50Z (GMT). No. of bitstreams: 0 | en |
| dc.description.tableofcontents | 誌謝 i
中文摘要 iii Abstract v Contents vii List of Figures xi List of Tables xv Chapter 1 Visual Prompt Learning for Interpretable Fine-grained Recognition 1 1.1 Introduction 2 1.2 Related Works 6 1.2.1 Explainable Artificial Intelligence (XAI) 6 1.2.2 Fine-Grained Recognition 7 1.3 Methodology 8 1.3.1 Notations and Overview 8 1.3.2 Descriptive Prototype Learning 9 1.3.3 Discriminative Prototype Discovery 12 1.4 Experiments 15 1.4.1 Datasets 15 1.4.2 Implementation Details 17 1.4.3 Interpreting Descriptive Prototypes 18 1.4.4 Discriminative Prototypes for Visual Reasoning 22 1.4.5 Ablation Studies 24 1.5 Conclusion 26 Chapter 2 Textual Prompt Learning for Image Semantic Segmentation 29 2.1 Introduction 30 2.2 Related Works 33 2.2.1 Weakly-Supervised Semantic Segmentation 33 2.2.2 CLIP-based Semantic Segmentation 34 2.2.3 Prompt Learning 35 2.3 Proposed Method 36 2.3.1 Problem Formulation and Model Overview 36 2.3.2 Semantic Prompt Learning for WSSS 37 2.3.2.1 Segment-Label Matching 37 2.3.2.2 Contrastive Prompt Learning 38 2.3.2.3 Prompt-guided Semantic Refinement 40 2.4 Experiments 41 2.4.1 Datasets and Evaluation Metrics 41 2.4.2 Implementation Details 41 2.4.3 Quantitative Comparisons 43 2.4.4 Qualitative Comparisons 46 2.4.5 Ablation Studies 47 2.5 Conclusion 49 Chapter 3 Spatial-Temporal Prompt Selection for Referring Video Segmentation 51 3.1 Introduction 52 3.2 Related Works 56 3.2.1 Referring Image/Video Segmentation 56 3.2.2 Foundation Segmentation Models 57 3.3 Proposed Framework: TENET 58 3.3.1 Problem Definition and Method Overview 58 3.3.2 Temporal Prompt Generation and Analysis 60 3.3.3 Prompt Preference Learning 64 3.4 Experiments 66 3.4.1 Datasets and Implementation Details 66 3.4.2 Quantitative and Qualitative Results 66 3.4.3 Ablation Studies 68 3.5 Conclusion 69 Chapter 4 Conclusion 71 References 73 | - |
| dc.language.iso | en | - |
| dc.subject | 深度學習 | zh_TW |
| dc.subject | 人工智慧 | zh_TW |
| dc.subject | 影片 | zh_TW |
| dc.subject | 圖像 | zh_TW |
| dc.subject | 電腦視覺 | zh_TW |
| dc.subject | computer vision | en |
| dc.subject | image | en |
| dc.subject | video | en |
| dc.subject | deep learning | en |
| dc.subject | artificial intelligence | en |
| dc.title | 提示學習與選擇於弱監督視覺分析 | zh_TW |
| dc.title | Prompt Learning and Selection for Weakly-Supervised Visual Analysis | en |
| dc.type | Thesis | - |
| dc.date.schoolyear | 113-2 | - |
| dc.description.degree | 博士 | - |
| dc.contributor.oralexamcommittee | 莊永裕;賴尚宏;邱維辰;陳駿丞 | zh_TW |
| dc.contributor.oralexamcommittee | Yung-Yu Chuang;Shang-Hong Lai;Wei-Chen Chiu;Jun-Cheng Chen | en |
| dc.subject.keyword | 人工智慧,深度學習,電腦視覺,圖像,影片, | zh_TW |
| dc.subject.keyword | artificial intelligence,deep learning,computer vision,image,video, | en |
| dc.relation.page | 99 | - |
| dc.identifier.doi | 10.6342/NTU202504082 | - |
| dc.rights.note | 未授權 | - |
| dc.date.accepted | 2025-08-14 | - |
| dc.contributor.author-college | 電機資訊學院 | - |
| dc.contributor.author-dept | 電信工程學研究所 | - |
| dc.date.embargo-lift | N/A | - |
| 顯示於系所單位: | 電信工程學研究所 | |
文件中的檔案:
| 檔案 | 大小 | 格式 | |
|---|---|---|---|
| ntu-113-2.pdf 未授權公開取用 | 13.28 MB | Adobe PDF |
系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。
