結合跨層級注意力特徵融合之高效率全景分割系統

Chia-Yuan Chang; 張家源

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/60127

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	傅立成(Li-Chen Fu)
dc.contributor.author	Chia-Yuan Chang	en
dc.contributor.author	張家源	zh_TW
dc.date.accessioned	2021-06-16T09:57:50Z	-
dc.date.available	2023-08-19
dc.date.copyright	2020-09-22
dc.date.issued	2020
dc.date.submitted	2020-08-16
dc.identifier.citation	[1] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classifica- tion with deep convolutional neural networks. In NIPS, 2012. [2] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2015. [3] Joseph Redmon and Ali Farhadi. Yolo9000: Better, faster, stronger. The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6517– 6525, 2016. [4] Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. Faster r-cnn: To- wards real-time object detection with region proposal networks. IEEE Trans- actions on Pattern Analysis and Machine Intelligence, 39:1137–1149, 2015. [5] Vijay Badrinarayanan, Alex Kendall, and Roberto Cipolla. Segnet: A deep con- volutional encoder-decoder architecture for image segmentation. IEEE Trans- actions on Pattern Analysis and Machine Intelligence, 39:2481–2495, 2017. [6] Kaiming He, Georgia Gkioxari, Piotr Dolla ́r, and Ross B. Girshick. Mask r- cnn. The IEEE International Conference on Computer Vision (ICCV), pages 2980–2988, 2017. [7] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. CoRR, abs/1409.0473, 2015. [8] Alexander Kirillov, Kaiming He, Ross B. Girshick, Carsten Rother, and Piotr Dolla ́r. Panoptic segmentation. The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 9396–9405, 2019. [9] Alexander Kirillov, Ross Girshick, Kaiming He, and Piotr Dollar. Panoptic feature pyramid networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019. [10] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2015. [11] Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia. Pyramid scene parsing network. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017. [12] Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. Encoder-decoder with atrous separable convolution for se- mantic image segmentation. In The European Conference on Computer Vision (ECCV), 2018. [13] Shu Liu, Lu Qi, Haifang Qin, Jianping Shi, and Jiaya Jia. Path aggregation network for instance segmentation. The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 8759–8768, 2018. [14] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael S. Bernstein, Alexander C. Berg, and Fei-Fei Li. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115:211–252, 2014. [15] Mark Everingham, Luc Van Gool, Christopher K. I. Williams, John M. Winn, and Andrew Zisserman. The pascal visual object classes (voc) challenge. In- ternational Journal of Computer Vision, 88(2):303–338, 2010. [16] Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dolla ́r, and C. Lawrence Zitnick. Microsoft coco: Com- mon objects in context. ArXiv, abs/1405.0312, 2014. [17] Marius Cordts, Mohamed Omran, Sebastian RAamos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3213– 3223, 2016. [18] Ross B. Girshick, Jeff Donahue, Trevor Darrell, and Jagannath Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. 2014 IEEE Conference on Computer Vision and Pattern Recognition, pages 580–587, 2014. [19] Koen E. A. van de Sande, Jasper R. R. Uijlings, Theo Gevers, and Arnold W. M. Smeulders. Segmentation as selective search for object recognition. 2011 International Conference on Computer Vision, pages 1879–1886, 2011. [20] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott E. Reed, Cheng-Yang Fu, and Alexander C. Berg. Ssd: Single shot multibox detector. In ECCV, 2016. [21] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for seman- tic segmentation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3431–3440, 2015. [22] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L. Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40:834–848, 2018. [23] Yuwen Xiong, Renjie Liao, Hengshuang Zhao, Rui Hu, Min Bai, Ersin Yumer, and Raquel Urtasun. Upsnet: A unified panoptic segmentation network. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019. [24] Konstantin Sofiiuk, Olga Barinova, and Anton Konushin. Adaptis: Adaptive instance selection network. In The IEEE International Conference on Computer Vision (ICCV), 2019. [25] Yibo Yang, Hongyang Li, Xia Li, Qijie Zhao, Jianlong Wu, and Zhouchen Lin. SOGNet: Scene Overlap Graph Network for Panoptic Segmentation. arXiv preprint, page arXiv:1911.07527, 2019. [26] Huanyu Liu, Chao Peng, Changqian Yu, Jingbo Wang, Xu Liu, Gang Yu, and Wei Jiang. An end-to-end network for panoptic segmentation. The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6165– 6174, 2019. [27] Yanwei Li, Xinze Chen, Zheng Zhu, Lingxi Xie, Guan Huang, Dalong Du, and Xingang Wang. Attention-guided unified network for panoptic segmentation. The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 7019–7028, 2018. [28] Tsung-Yi Lin, Piotr Dolla ́r, Ross B. Girshick, Kaiming He, Bharath Hariharan, and Serge J. Belongie. Feature pyramid networks for object detection. The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 936–944, 2016. [29] Holger Caesar, Jasper Uijlings, and Vittorio Ferrari. Coco-stuff: Thing and stuff classes in context. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018. [30] Daan de Geus, Panagiotis Meletis, and Gijs Dubbelman. Panoptic Segmenta- tion with a Joint Semantic and Instance Segmentation Network. arXiv preprint, page arXiv:1809.02110, 2018. [31] Jie Li, Allan Raventos, Arjun Bhargava, Takaaki Tagawa, and Adrien Gaidon. Learning to Fuse Things and Stuff. arXiv preprint, page arXiv:1812.01192, 2018. [32] Qizhu Li, Anurag Arnab, and Philip H.S. Torr. Weakly- and semi-supervised panoptic segmentation. In The European Conference on Computer Vision (ECCV), 2018. [33] Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4396–4405, 2018. [34] Ziyu Zhang, Sanja Fidler, and Raquel Urtasun. Instance-level segmentation for autonomous driving with deep densely connected mrfs. The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 669–677, 2015. [35] Ziyu Zhang, Alexander G. Schwing, Sanja Fidler, and Raquel Urtasun. Monoc- ular object instance segmentation and depth ordering with cnns. The IEEE International Conference on Computer Vision (ICCV), pages 2614–2622, 2015. [36] Jonas Uhrig, Marius Cordts, Uwe Franke, and Thomas Brox. Pixel-level en- coding and depth layering for instance-level semantic labeling. In GCPR, 2016. [37] Xiaodan Liang, Liang Lin, Yunchao Wei, Xiaohui Shen, Jianchao Yang, and Shuicheng Yan. Proposal-free network for instance-level object segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40:2978– 2991, 2015. [38] Tien-Ju Yang, Maxwell D. Collins, Yukun Zhu, Jyh-Jing Hwang, Ting Liu, Xiao Zhang, Vivienne Sze, George Papandreou, and Liang-Chieh Chen. DeeperLab: Single-Shot Image Parser. arXiv preprint, page arXiv:1902.05093, 2019. [39] Naiyu Gao, Yanhu Shan, Yupei Wang, Xin Zhao, Yinan Yu, Ming Yang, and Kaiqi Huang. Ssap: Single-shot instance segmentation with affinity pyramid. In The IEEE International Conference on Computer Vision (ICCV), 2019. [40] Paul A. Viola and Michael J. Jones. Rapid object detection using a boosted cascade of simple features. Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001, 1:I–I, 2001. [41] Pedro F. Felzenszwalb, Ross B. Girshick, David A. McAllester, and Deva Ra- manan. Object detection with discriminatively trained part based models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32:1627– 1645, 2009. [42] Ali M. Reza. Realization of the contrast limited adaptive histogram equalization (clahe) for real-time image enhancement. Journal of VLSI signal processing systems for signal, image and video technology, 38:35–44, 2004. [43] Wenshuo Gao, Xiaoguang Zhang, Lei Yang, and Huizhong Liu. An improved sobel edge detection. 2010 3rd International Conference on Computer Science and Information Technology, 5:67–71, 2010. [44] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2818– 2826, 2016. [45] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimiza- tion. CoRR, abs/1412.6980, 2015. [46] John C. Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res., 12:2121– 2159, 2010. [47] Ilya Sutskever, James Martens, George E. Dahl, and Geoffrey E. Hinton. On the importance of initialization and momentum in deep learning. In ICML, 2013. [48] Ross B. Girshick. Fast r-cnn. 2015 IEEE International Conference on Computer Vision (ICCV), pages 1440–1448, 2015. [49] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556, 2015. [50] Daniel Bolya, Chong Zhou, Fanyi Xiao, and Yong Jae Lee. Yolact: Real-time instance segmentation. In The IEEE International Conference on Computer Vision (ICCV), 2019. [51] Yuxin Wu and Kaiming He. Group normalization. In The European Conference on Computer Vision (ECCV), 2018. [52] Guosheng Lin, Anton Milan, Chunhua Shen, and Ian D. Reid. Refinenet: Multi- path refinement networks for high-resolution semantic segmentation. The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5168– 5177, 2016. [53] Zhenli Zhang, Xiangyu Zhang, Chao Peng, Dazhi Cheng, and Jian Sun. Ex- fuse: Enhancing feature fusion for semantic segmentation. In The European Conference on Computer Vision (ECCV), 2018. [54] Jiangmiao Pang, Kai Chen, Jianping Shi, Huajun Feng, Wanli Ouyang, and Dahua Lin. Libra r-cnn: Towards balanced learning for object detection. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019. [55] Lorenzo Porzi, Samuel Rota Bulo, Aleksander Colovic, and Peter Kontschieder. Seamless scene segmentation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019. [56] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. Non-local neural networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018. [57] Han Zhang, Ian Goodfellow, Dimitris Metaxas, and Augustus Odena. Self-Attention Generative Adversarial Networks. arXiv preprint, page arXiv:1805.08318, 2018. [58] Jun Fu, Jing Liu, Haijie Tian, Yong Li, Yongjun Bao, Zhiwei Fang, and Hanqing Lu. Dual attention network for scene segmentation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019. [59] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NIPS, 2017. [60] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Ko ̈pf, Edward Yang, Zach DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In NeurIPS, 2019. [61] Mark Sandler, Andrew G. Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4510–4520, 2018. [62] Fran ̧cois Chollet. Xception: Deep learning with depthwise separable convo- lutions. The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1800–1807, 2016.
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/60127	-
dc.description.abstract	近年來，深度卷積神經網絡（CNN）在密集分類問題如影像分割上有相當傑出結果。語義分割主要是在輸入圖像上對於各個像素進行分類，實例分割則是對所有前景物體進行分割並賦予特定遮罩，並區分出不同的物體和類別。全景分割可視為語義分割和實例分割所結合之研究，目標是在為圖像中的每個像素預測對應之類別和可數物體的ID。當前的最新研究採用兩階段檢測器來檢測和分割前景對象，並在其上附加另一個網絡分支以進行語義分割。接著，他們將語義分割和實例分割知結果以演算法進行融合得到全景分割結果。然而，這些研究並未考慮運算所花的時間。本研究中，我們提出了一種有效的全景分割網絡，以快速的運算速度來解決全景分割任務。基本上，此研究是基於原型遮罩和遮罩係數的簡單線性組合來生成遮罩。語義分割以及實例分割的分支網路上僅需要預測遮罩係數並使用原型網絡分支預測的共享原型遮罩生成結果。此外，為了提高共享原型遮罩的質量，我們採用了一個稱為跨層級注意力融合模塊的模塊，該模塊將多尺度特徵以注意力機制融合，從而幫助它們彼此之間的關聯性。為了驗證此項研究，我們對具有挑戰性的COCO全景數據集進行了各種實驗。實驗結果中，本研究在GPU上約51毫秒之快速運算速度得到了極具競爭性的結果。同樣，我們以38.9％的PQ勝過所有一階段網路之方法。	zh_TW
dc.description.abstract	Recently, deep convolutional neural networks (CNNs) have shown outstanding performance in dense classification problems such as segmentation tasks. Semantic segmentation aim to provide pixel-wise classification on input image. Instance segmentation segments all foreground objects and distinguishes different object instances. Panoptic segmentation is a scene parsing task which unifies semantic segmentation and instance segmentation into one single task, which aims to assign semantic label and instance ID to every pixel in an image. The current state-of-the-art studies adopt a two-stage detector which detects and segments the foreground objects and then cattach another network branch on it for semantic segmentation. Then, they combine both results into panoptic segmentation with heuristic merging. However, these studies did not take too much concern on inference time. In this work, we propose an Efficient Panoptic Segmentation Network (EPSNet) to tackle the panoptic segmentation tasks with fast inference speed. Basically, EPSNet generates masks based on simple linear combination of prototype masks and mask coefficients. The light-weight network branches for instance segmentation and semantic segmentation only need to predict mask coefficients and produce masks with the shared prototypes predicted by prototype network branch. Furthermore, to enhance the quality of shared prototypes, we adopt a module called 'cross-layer attention fusion module', which aggregates the multi-scale features with attention mechanism helping them capture the long-range dependencies between each other. To validate the proposed work, we have conducted various experiments on the challenging COCO panoptic dataset. The experimental results show that EPSNet achieve highly promising performance with significantly faster inference speed (51ms on GPU). Also we outperforms all one-stage methods with 38.9\% PQ.	en
dc.description.provenance	Made available in DSpace on 2021-06-16T09:57:50Z (GMT). No. of bitstreams: 1 U0001-1308202013330200.pdf: 15783752 bytes, checksum: a9001b8a96023865fdcd3e4f86c2a287 (MD5) Previous issue date: 2020	en
dc.description.tableofcontents	口試委員會審定書 i 誌謝 ii 中文摘要 iii ABSTRACT iv CONTENTS v LIST OF FIGURES viii LIST OF TABLES x 1 Introduction 1 1.1 Motivation................................. 1 1.2 RelatedWork............................... 6 1.2.1 Detection-based.......................... 6 1.2.2 Bottom-up............................. 7 1.3 Contribution................................ 8 1.4 ThesisOrganization............................ 9 2 Preliminaries 11 2.1 ConvolutionalNeuralNetwork...................... 11 2.1.1 ConvolutionalLayers....................... 12 2.1.2 PoolingLayers .......................... 16 2.1.3 ActivationFunction........................ 18 2.1.4 Up-SamplingLayers ....................... 19 2.1.5 Optimizer ............................. 19 2.1.6 Alex-Net.............................. 20 2.1.7 Residual-Net ........................... 21 2.2 ObjectDetectionFrameworks ...................... 23 2.2.1 YOLO............................... 23 2.2.2 FasterR-CNN........................... 24 2.3 SemanticSegmentationFrameworks................... 25 2.3.1 SegNet............................... 25 2.4 InstanceSegmentationFrameworks ................... 27 2.4.1 MaskR-CNN ........................... 27 2.5 AttentionMechanism........................... 28 3 Efficient Panoptic Segmentation Network 31 3.1 EfficientPanopticSegmentationNetwork . . . . . . . . . . . . . . . . 31 3.1.1 SystemOverview ......................... 32 3.1.2 Backbone ............................. 33 3.1.3 Protohead............................. 34 3.1.4 InstanceSegmentationHead................... 35 3.1.5 SemanticSegmentationHead .................. 37 3.1.6 Cross-layerAttentionFusion................... 38 3.2 Training and Inference .......................... 41 3.2.1 LossFunction........................... 41 3.2.2 PanopticInference ........................ 43 4 Experiments 45 4.1 ExperimentalSetup............................ 45 4.1.1 Datasets.............................. 45 4.1.2 Metrics .............................. 48 4.1.3 ImplementationDetails...................... 50 4.2 AblationStudy .............................. 51 4.2.1 DataAugmentation........................ 52 4.2.2 Cross-layerAttentionFusion................... 52 4.2.3 LossBalance ........................... 53 4.2.4 Prototypes............................. 53 4.3 Analysis of SemanticHead........................ 53 4.4 Analysis of Cross-layer Attention Fusion Module . . . . . . . . . . . 57 4.5 Comparison with Other Methods on COCO . . . . . . . . . . . . . . 59 4.6 QualitativeResults ............................ 60 5 Conclusion 63 REFERENCE 64
dc.language.iso	en
dc.title	結合跨層級注意力特徵融合之高效率全景分割系統	zh_TW
dc.title	Efficient Panoptic Segmentation Network with Cross-layer Attention Fusion	en
dc.type	Thesis
dc.date.schoolyear	108-2
dc.description.degree	碩士
dc.contributor.author-orcid	0000-0001-5841-3455
dc.contributor.coadvisor	蕭培鏞(Pei-Yung Hsiao)
dc.contributor.oralexamcommittee	黃世勳(Shih-Shinh Hua),方瓊瑤(Chiung-Yao Fang),傅楸善(Chiou-Shann Fuh)
dc.subject.keyword	實例分割,語義分割,影像分割,全景分割,	zh_TW
dc.subject.keyword	Instance segmentation,Semantic segmentation,Panoptics segmentation,Scene understanding,	en
dc.relation.page	72
dc.identifier.doi	10.6342/NTU202003237
dc.rights.note	有償授權
dc.date.accepted	2020-08-17
dc.contributor.author-college	電機資訊學院	zh_TW
dc.contributor.author-dept	資訊工程學研究所	zh_TW
顯示於系所單位：	資訊工程學系

文件中的檔案：

檔案	大小	格式
U0001-1308202013330200.pdf 目前未授權公開取用	15.41 MB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。