請用此 Handle URI 來引用此文件:
http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/88414完整後設資料紀錄
| DC 欄位 | 值 | 語言 |
|---|---|---|
| dc.contributor.advisor | 吳家麟 | zh_TW |
| dc.contributor.advisor | Ja-Ling Wu | en |
| dc.contributor.author | 林家毅 | zh_TW |
| dc.contributor.author | Chia-Yi Lin | en |
| dc.date.accessioned | 2023-08-15T16:11:25Z | - |
| dc.date.available | 2023-11-09 | - |
| dc.date.copyright | 2023-08-15 | - |
| dc.date.issued | 2023 | - |
| dc.date.submitted | 2023-07-26 | - |
| dc.identifier.citation | [1] Hengshuang Zhao, Xiaojuan Qi, Xiaoyong Shen, Jianping Shi, and Jiaya Jia. Icnet for real-time semantic segmentation on high-resolution images. In Proceedings of the European conference on computer vision (ECCV), pages 405–420, 2018.
[2] Changqian Yu, Jingbo Wang, Chao Peng, Changxin Gao, Gang Yu, and Nong Sang. Bisenet: Bilateral segmentation network for real-time semantic segmentation. In Proceedings of the European conference on computer vision (ECCV), pages 325– 341, 2018. [3] Hanchao Li, Pengfei Xiong, Haoqiang Fan, and Jian Sun. Dfanet: Deep feature aggregation for real-time semantic segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9522–9531, 2019. [4] Sachin Mehta, Mohammad Rastegari, Linda Shapiro, and Hannaneh Hajishirzi. Es- pnetv2: A light-weight, power efficient, and general purpose convolutional neural network. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9190–9200, 2019. [5] Changqian Yu, Changxin Gao, Jingbo Wang, Gang Yu, Chunhua Shen, and Nong Sang. Bisenet v2: Bilateral network with guided aggregation for real-time semantic segmentation. International Journal of Computer Vision, 129:3051–3068, 2021. [6] Wuyang Chen, Xinyu Gong, Xianming Liu, Qian Zhang, Yuan Li, and Zhangyang Wang. Fasterseg: Searching for faster real-time semantic segmentation. arXiv preprint arXiv:1912.10917, 2019. [7] Mingyuan Fan, Shenqi Lai, Junshi Huang, Xiaoming Wei, Zhenhua Chai, Jun- feng Luo, and Xiaolin Wei. Rethinking bisenet for real-time semantic segmenta- tion. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9716–9725, 2021. [8] JuncaiPeng,YiLiu,ShiyuTang,YuyingHao,LutaoChu,GuoweiChen,ZewuWu, Zeyu Chen, Zhiliang Yu, Yuning Du, et al. Pp-liteseg: A superior real-time semantic segmentation model. arXiv preprint arXiv:2204.02681, 2022. [9] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021. [10] Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In International Conference on Machine Learning, pages 4904–4916. PMLR, 2021. [11] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. Communications of the ACM, 60(6):84– 90, 2017. [12] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. [13] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3431–3440, 2015. [14] Vijay Badrinarayanan, Alex Kendall, and Roberto Cipolla. Segnet: A deep convo- lutional encoder-decoder architecture for image segmentation. IEEE transactions on pattern analysis and machine intelligence, 39(12):2481–2495, 2017. [15] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional net- works for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pages 234–241. Springer, 2015. [16] Liang-Chieh Chen, George Papandreou, Florian Schroff, and Hartwig Adam. Re- thinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587, 2017. [17] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2117–2125, 2017. [18] Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia. Pyra- mid scene parsing network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2881–2890, 2017. [19] Tete Xiao, Yingcheng Liu, Bolei Zhou, Yuning Jiang, and Jian Sun. Unified percep- tual parsing for scene understanding. In Proceedings of the European conference on computer vision (ECCV), pages 418–434, 2018. [20] AshishVaswani,NoamShazeer,NikiParmar,JakobUszkoreit,LlionJones,AidanN Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017. [21] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xi- aohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020. [22] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021. [23] BarretZophandQuocVLe.Neuralarchitecturesearchwithreinforcementlearning. arXiv preprint arXiv:1611.01578, 2016. [24] Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. Uniter: Universal image-text representation learning. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXX, pages 104–120. Springer, 2020. [25] Pengchuan Zhang, Xiujun Li, Xiaowei Hu, Jianwei Yang, Lei Zhang, Lijuan Wang, Yejin Choi, and Jianfeng Gao. Vinvl: Revisiting visual representations in vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5579–5588, 2021. [26] Zirui Wang, Jiahui Yu, Adams Wei Yu, Zihang Dai, Yulia Tsvetkov, and Yuan Cao. Simvlm: Simple visual language model pretraining with weak supervision. arXiv preprint arXiv:2108.10904, 2021. [27] Fabio Petroni, Tim Rocktäschel, Patrick Lewis, Anton Bakhtin, Yuxiang Wu, Alexander H Miller, and Sebastian Riedel. Language models as knowledge bases? arXiv preprint arXiv:1909.01066, 2019. [28] Tianyu Gao, Adam Fisch, and Danqi Chen. Making pre-trained language models better few-shot learners. arXiv preprint arXiv:2012.15723, 2020. [29] Xiao Liu, Yanan Zheng, Zhengxiao Du, Ming Ding, Yujie Qian, Zhilin Yang, and Jie Tang. Gpt understands, too. arXiv preprint arXiv:2103.10385, 2021. [30] ColinRaffel,NoamShazeer,AdamRoberts,KatherineLee,SharanNarang,Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020. [31] Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Learning to prompt for vision-language models. International Journal of Computer Vision, 130 (9):2337–2348, 2022. [32] Peng Gao, Shijie Geng, Renrui Zhang, Teli Ma, Rongyao Fang, Yongfeng Zhang, Hongsheng Li, and Yu Qiao. Clip-adapter: Better vision-language models with fea- ture adapters. arXiv preprint arXiv:2110.04544, 2021. [33] Boyi Li, Kilian Q Weinberger, Serge Belongie, Vladlen Koltun, and René Ranftl. Language-driven semantic segmentation. arXiv preprint arXiv:2201.03546, 2022. [34] Maxime Bucher, Tuan-Hung Vu, Matthieu Cord, and Patrick Pérez. Zero-shot se- mantic segmentation. Advances in Neural Information Processing Systems, 32, 2019. [35] Yongming Rao, Wenliang Zhao, Guangyi Chen, Yansong Tang, Zheng Zhu, Guan Huang, Jie Zhou, and Jiwen Lu. Denseclip: Language-guided dense prediction with context-aware prompting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18082–18091, 2022. [36] Abhinav Shrivastava, Abhinav Gupta, and Ross Girshick. Training region-based object detectors with online hard example mining. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 761–769, 2016. [37] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus En- zweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3213–3223, 2016. [38] Gabriel J Brostow, Jamie Shotton, Julien Fauqueur, and Roberto Cipolla. Seg- mentation and recognition using structure from motion point clouds. In Computer Vision–ECCV 2008: 10th European Conference on Computer Vision, Marseille, France, October 12-18, 2008, Proceedings, Part I 10, pages 44–57. Springer, 2008. [39] Sebastian Ruder. An overview of gradient descent optimization algorithms. arXiv preprint arXiv:1609.04747, 2016. [40] MMSegmentation Contributors. MMSegmentation: Openmmlab semantic segmentation toolbox and benchmark. https://github.com/open-mmlab/ mmsegmentation, 2020. [41] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gre- gory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, pages 8024–8035. Curran Associates, Inc., 2019. [42] Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. cudnn: Efficient primitives for deep learn- ing. arXiv preprint arXiv:1410.0759, 2014. [43] Bolei Zhou, Hang Zhao, Xavier Puig, Tete Xiao, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Semantic understanding of scenes through the ade20k dataset. International Journal of Computer Vision, 127:302–321, 2019. | - |
| dc.identifier.uri | http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/88414 | - |
| dc.description.abstract | 在本文中,我們提出一種輕量級方法,透過預訓練視覺語言模型(pre-trained vision-language model)來增強即時語義分割(real-time semantic segmentation)。我們的方法將CLIP文本編碼器(text encoder)與語義分割模型相結合,有效地將文本知識傳遞給分割模型。我們的框架整合了圖像和文本嵌入(text and image embeddings),使視覺和文本資訊可以相互整合。同時,我們引入了可學習的提示嵌入(learnable prompt embedding),以捕捉特定類別(class-specific)的資訊並提升模型對語義的理解能力。為了確保訓練效果,我們設計了一種兩階段的訓練流程,允許語義分割模型在第一階段從固定的文本嵌入中學習,並在第二階段優化提示嵌入。通過實驗和消融研究,我們驗證了這種方法能夠顯著提升即時語義分割模型的性能。 | zh_TW |
| dc.description.abstract | In this paper, we present a lightweight method to enhance real-time semantic segmentation models by leveraging the power of pre-trained vision-language models. Our approach incorporates the CLIP text encoder, which provides rich semantic embeddings for text labels, and effectively transmits this textual knowledge to the segmentation model. The proposed framework integrates image and text embeddings, enabling visual and textual information alignment. Besides, we introduce learnable prompt embeddings to capture class-specific information and enhance the semantic understanding of the model. To ensure effective learning, we devise a two-stage training procedure that allows the segmentation backbone to learn from fixed text embeddings in the first stage and optimize the prompt embeddings in the second stage. Extensive experiments and ablation studies demonstrate the effectiveness of our method in significantly improving the performance of the real-time semantic segmentation model. | en |
| dc.description.provenance | Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2023-08-15T16:11:25Z No. of bitstreams: 0 | en |
| dc.description.provenance | Made available in DSpace on 2023-08-15T16:11:25Z (GMT). No. of bitstreams: 0 | en |
| dc.description.tableofcontents | Verification Letter from the Oral Examination Committee i
Acknowledgements ii 摘要 iii Abstract iv Contents v List of Figures viii List of Tables ix Chapter 1 Introduction 1 Chapter 2 Related Work 5 2.1 Semantic Segmentation 5 2.2 Real-Time Semantic Segmentation 6 2.3 Pre-Trained Vision-Language Model 6 2.4 Prompt Tuning 7 2.5 Semantic Segmentation with Pre-Trained Vision-Language Mode 8 Chapter 3 Methodology 9 3.1 Preliminaries: CLIP 9 3.2 Model architectures 11 3.2.1 Lightweight Segmentation Module (LSM) 11 3.2.2 Textual Guidance Module (TGM) 12 3.2.3 Prompt Optimization Module (POM) 13 3.3 Two-stage training procedure 15 3.3.1 Loss Function 15 3.3.2 First Training Stage 16 3.3.3 Second Training Stage 16 3.4 Inference 17 Chapter 4 Experiments 20 4.1 Experimental Settings 20 4.1.1 Datasets 20 4.1.2 Implementation Details 21 4.2 Main Results 22 4.2.1 Results on Cityscapes 22 4.2.2 Results on CamVid 22 4.2.3 Qualitative Results 23 4.3 Ablation Studies 24 4.3.1 Effectiveness of Two-Stage Training Procedure 24 4.3.2 Number of Learnable Prompt Embeddings 26 4.3.3 Regularization of Prompt Embedding Space 27 Chapter 5 Discussion 29 5.1 Limitations 29 5.2 Future Research Directions 30 Chapter 6 Conclusion 32 References 34 | - |
| dc.language.iso | en | - |
| dc.subject | 兩階段訓練 | zh_TW |
| dc.subject | 語義分割 | zh_TW |
| dc.subject | 即時 | zh_TW |
| dc.subject | 預訓練視覺-語言模型 | zh_TW |
| dc.subject | 文本知識 | zh_TW |
| dc.subject | 提示調整 | zh_TW |
| dc.subject | Real-Time | en |
| dc.subject | Semantic Segmentation | en |
| dc.subject | Two-Stage Training | en |
| dc.subject | Prompt Tuning | en |
| dc.subject | Textual Knowledge | en |
| dc.subject | Pre-Trained Vision-Language Model | en |
| dc.title | 透過預訓練視覺-語言模型之文本知識增強即時語義分割:一種輕量級方法 | zh_TW |
| dc.title | Enhancing Real-Time Semantic Segmentation with Textual Knowledge of Pre-Trained Vision-Language Model: A Lightweight Approach | en |
| dc.type | Thesis | - |
| dc.date.schoolyear | 111-2 | - |
| dc.description.degree | 碩士 | - |
| dc.contributor.oralexamcommittee | 許超雲;陳文進 | zh_TW |
| dc.contributor.oralexamcommittee | Chau-Yun Hsu;Wen-Chin Chen | en |
| dc.subject.keyword | 語義分割,即時,預訓練視覺-語言模型,文本知識,提示調整,兩階段訓練, | zh_TW |
| dc.subject.keyword | Semantic Segmentation,Real-Time,Pre-Trained Vision-Language Model,Textual Knowledge,Prompt Tuning,Two-Stage Training, | en |
| dc.relation.page | 40 | - |
| dc.identifier.doi | 10.6342/NTU202302018 | - |
| dc.rights.note | 未授權 | - |
| dc.date.accepted | 2023-07-28 | - |
| dc.contributor.author-college | 電機資訊學院 | - |
| dc.contributor.author-dept | 資訊工程學系 | - |
| 顯示於系所單位: | 資訊工程學系 | |
文件中的檔案:
| 檔案 | 大小 | 格式 | |
|---|---|---|---|
| ntu-111-2.pdf 未授權公開取用 | 5.72 MB | Adobe PDF |
系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。
