以低差異領域自適應及特定領域弱注釋實現高效能室內場景解析

Keng-Chi Liu; 劉庚錡

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/71493

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	陳良基(Liang-Gee Chen)
dc.contributor.author	Keng-Chi Liu	en
dc.contributor.author	劉庚錡	zh_TW
dc.date.accessioned	2021-06-17T06:01:47Z	-
dc.date.available	2019-02-12
dc.date.copyright	2019-02-12
dc.date.issued	2019
dc.date.submitted	2019-01-31
dc.identifier.citation	[1] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” CoRR, 2014. [2] S. Ren, K. He, R. B. Girshick, and J. Sun, “Faster R-CNN: towards real-time object detection with region proposal networks,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 6, pp. 1137–1149, 2017. [3] J. Zbontar and Y. LeCun, “Stereo matching by training a convolutional neural network to compare image patches,” Journal of Machine Learning Research, vol. 17, pp. 65:1–65:32, 2016. [4] E. Shelhamer, J. Long, and T. Darrell, “Fully convolutional networks for semantic segmentation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 4, pp. 640–651, 2017. [5] K. Liu, Y. Shen, and L. Chen, “Simple online and realtime tracking with spherical panoramic camera,” in IEEE International Conference on Consumer Electronics, ICCE 2018, Las Vegas, NV, USA, January 12-14, 2018, 2018, pp. 1–6. [6] NVIDIA, “Gtc2017,” https://www.gizmodo.com.au/2017/05/live-blog-nvidia-gtc-2017-technology-keynote/, [Online; accessed November 10, 2018]. [7] S. Shah, D. Dey, C. Lovett, and A. Kapoor, “Airsim: High-ﬁdelity visual and physical simulation for autonomous vehicles,” in Field and Service Robotics, Results of the 11th International Conference, FSR 2017, 2017, pp. 621–635. [8] Y. Zhu, R. Mottaghi, E. Kolve, J. J. Lim, A. Gupta, L. Fei-Fei, and A. Farhadi, “Target-driven visual navigation in indoor scenes using deep reinforcement learning,” in 2017 IEEE International Conference on Robotics and Automation, ICRA, 2017, pp. 3357–3364. [9] Y. Zhu, D. Gordon, E. Kolve, D. Fox, L. Fei-Fei, A. Gupta, R. Mottaghi, and A. Farhadi, “Visual semantic planning using deep successor representations,” in IEEE International Conference on Computer Vision, ICCV, 2017, pp. 483–492. [10] Z. Hong, Y. Chen, H. Yang, S. Su, T. Shann, Y. Chang, B. H. Ho, C. Tu, T. Hsiao, H. Hsiao, S. Lai, Y. Chang, and C. Lee, “Virtual-to-real: Learning to control in visual semantic segmentation,” in Proceedings of the Twenty-Seventh International Joint Conference on Artiﬁcial Intelligence, IJCAI, 2018, pp. 4912–4920. [11] Y. Wu, Y. Wu, G. Gkioxari, and Y. Tian, Building generalizable agents with a realistic and rich 3d environment,” CoRR, vol. abs/1801.02209, 2018. [12] A. Mousavian, A. Toshev, M. Fiser, J. Kosecka, and J. Davidson, “Visual representations for semantic target driven navigation,” CoRR, vol. abs/1805.06066, 2018. [13] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsing network,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR, 2017, pp. 6230-6239. [14] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele, “The cityscapes dataset for semantic urban scene understanding,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR, 2016, pp. 3213–3223. [15] S. Hong, S. Kwak, and B. Han, “Weakly supervised learning with deep convolutional neural networks for semantic segmentation: Understanding semantic layout of images with minimum human supervision,” IEEE Signal Process. Mag., vol. 34, no. 6, pp. 39–49, 2017. [16] S. Song, S. P. Lichtenberg, and J. Xiao, “SUN RGB-D: A RGB-D scene understanding benchmark suite,” in IEEE Conference on Computer Vision and Pattern Recognition, CVPR, 2015, pp. 567–576. [17] J. Hoﬀman, E. Tzeng, T. Park, J. Zhu, P. Isola, K. Saenko, A. A. Efros, and T. Darrell, “Cycada: Cycle-consistent adversarial domain adaptation,” in Proceedings of the 35th International Conference on Machine Learning, ICML, 2018, pp. 1994–2003. [18] J. Dai, K. He, and J. Sun, “Boxsup: Exploiting bounding boxes to supervise convolutional networks for semantic segmentation,” in 2015 IEEE International Conference on Computer Vision, ICCV, 2015, pp. 1635–1643. [19] G. Papandreou, L. Chen, K. P. Murphy, and A. L. Yuille, “Weakly-and semi-supervised learning of a deep convolutional network for semantic image segmentation,” in 2015 IEEE International Conference on Computer Vision, ICCV, 2015, pp. 1742–1750. [20] S. J. Oh, R. Benenson, A. Khoreva, Z. Akata, M. Fritz, and B. Schiele, “Exploiting saliency for object segmentation from image level labels,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR, 2017, pp. 5038–5047. [21] Q. Li, A. Arnab, and P. H. S. Torr, “Weakly- and semi-supervised panoptic segmentation,” in Computer Vision - ECCV 2018 - 15th European Conference, 2018, pp. 106–124. [22] D. Lin, J. Dai, J. Jia, K. He, and J. Sun, “Scribblesup: Scribble-supervised convolutional networks for semantic segmentation,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR, 2016, pp. 3159–3167. [23] A. L. Bearman, O. Russakovsky, V. Ferrari, and F. Li, “What’s the point: Semantic segmentation with point supervision,” in Computer Vision - ECCV 2016 - 14th European Conference, 2016, pp. 549–565. [24] D. Pathak, E. Shelhamer, J. Long, and T. Darrell, “Fully convolutional multi-class multiple instance learning,” CoRR, vol. abs/1412.7144, 2014. [25] D. Pathak, P. Kr¨ahenb¨uhl, and T. Darrell, “Constrained convolutional neural networks for weakly supervised segmentation,” in 2015 IEEE International Conference on Computer Vision, ICCV, 2015, pp. 1796–1804. [26] X. Qi, Z. Liu, J. Shi, H. Zhao, and J. Jia, “Augmented feedback in semantic segmentation under image level supervision,” in Computer Vision - ECCV 2016 - 14th European Conference, 2016, pp. 90–105. [27] P. H. O. Pinheiro and R. Collobert, “From image-level to pixel-level labeling with convolutional networks,” in IEEE Conference on Computer Vision and Pattern Recognition, CVPR, 2015, pp. 1713–1721. [28] S. Kwak, S. Hong, and B. Han, “Weakly supervised semantic segmentation using superpixel pooling network,” in Proceedings of the Thirty-First AAAI Conference on Artiﬁcial Intelligence, 2017, pp. 4111–4117. [29] A. Kolesnikov and C. H. Lampert, “Seed, expand and constrain: Three principles for weakly-supervised image segmentation,” in Computer Vision - ECCV 2016 - 14th European Conference, 2016, pp. 695–711. [30] W. Shimoda and K. Yanai, “Distinct class-speciﬁc saliency maps for weakly supervised semantic segmentation,” in Computer Vision - ECCV 2016 - 14th European Conference. [31] A. Roy and S. Todorovic, “Combining bottom-up, top-down, and smoothness cues for weakly supervised image segmentation,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR, 2017, pp. 7282–7291. [32] Y. Wei, J. Feng, X. Liang, M. Cheng, Y. Zhao, and S. Yan, “Object region mining with adversarial erasing: A simple classiﬁcation to semantic segmentation approach,” CoRR, vol. abs/1703.08448, 2017. [33] A. Chaudhry, P. K. Dokania, and P. H. S. Torr, “Discovering class-speciﬁc pixels for weakly-supervised semantic segmentation,” in British Machine Vision Conference 2017, BMVC, 2017. [34] J. Ahn and S. Kwak, “Learning pixel-level semantic aﬃnity with image-level supervision for weakly supervised semantic segmentation,” CoRR, vol. abs/1803.10464, 2018. [35] X. Wang, S. You, X. Li, and H. Ma, “Weakly-supervised semantic segmentation by iteratively mining common object features,” CoRR, vol. abs/1806.04659, 2018. [36] Y. Wei, H. Xiao, H. Shi, Z. Jie, J. Feng, and T. S. Huang, “Revisiting dilated convolution: A simple approach for weakly- and semi- super-vised semantic segmentation,” CoRR, vol. abs/1805.04574, 2018. [37] B. Zhou, A. Khosla, `A. Lapedriza, A. Oliva, and A. Torralba, “Learning deep features for discriminative localization,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR, 2016, pp. 2921–2929. [38] H. Caesar, J. R. R. Uijlings, and V. Ferrari, “Coco-stuﬀ: Thing and stuﬀ classes in context,” CoRR, vol. abs/1612.03716, 2016. [39] J. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image translation using cycle-consistent adversarial networks,” in IEEE International Conference on Computer Vision, ICCV, 2017, pp. 2242–2251. [40] A. Canziani, A. Paszke, and E. Culurciello, “An analysis of deep neural network models for practical applications,” CoRR, vol. abs/1605.07678, 2016. [41] M. Thoma, “A survey of semantic segmentation,” CoRR, vol. abs/1602.06541, 2016. [42] A. Handa, T. Whelan, J. McDonald, and A. J. Davison, “A benchmark for RGB-D visual odometry, 3d reconstruction and SLAM,” in 2014 IEEE International Conference on Robotics and Automation, ICRA, 2014, pp. 1524–1531. [43] E. Romera, J. M. Alvarez, L. M. Bergasa, and R. Arroyo, “Erfnet: Eﬃcient residual factorized convnet for real-time semantic segmentation,” IEEE Trans. Intelligent Transportation Systems, vol. 19, no. 1, pp. 263–272, 2018. [44] H. Wen, S. Zhou, Z. Liang, Y. Zhang, D. Feng, X. Zhou, and C. Yao, “Training bit fully convolutional network for fast semantic segmentation,” CoRR, vol. abs/1612.00212, 2016. [45] M. Everingham, S. M. A. Eslami, L. J. V. Gool, C. K. I. Williams, J. M. Winn, and A. Zisserman, “The pascal visual object classes challenge: A retrospective,” International Journal of Computer Vision, vol. 111, no. 1, pp. 98–136, 2015. [46] N. Silberman, D. Hoiem, P. Kohli, and R. Fergus, “Indoor segmentation and support inference from RGBD images,” in Computer Vision - ECCV 2012 - 12th European Conference on Computer Vision, 2012, pp. 746–760. [47] J. Shotton, J. M. Winn, C. Rother, and A. Criminisi, “Textonboost for image understanding: Multi-class object recognition and segmentation by jointly modeling texture, layout, and context,” International Journal of Computer Vision, vol. 81, no. 1, pp. 2–23, 2009. [48] P. Sturgess, K. Alahari, L. Ladicky, and P. H. S. Torr, “Combining appearance and structure from motion features for road scene understanding,” in British Machine Vision Conference, BMVC 2009, London, UK, September 7-10, 2009. Proceedings, 2009, pp. 1–11. [49] J. Shotton, M. Johnson, and R. Cipolla, “Semantic texton forests for image categorization and segmentation,” in 2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition CVPR, 2008. [50] B. Fulkerson, A. Vedaldi, and S. Soatto, “Class segmentation and object localization with superpixel neighborhoods,” in IEEE 12th International Conference on Computer Vision, ICCV, 2009, pp. 670–677 [51] J. D. Laﬀerty, A. McCallum, and F. C. N. Pereira, “Conditional random ﬁelds: Probabilistic models for segmenting and labeling sequence data,” in Proceedings of the Eighteenth International Conference on Machine Learning ICML, 2001, pp. 282–289. [52] H. Noh, S. Hong, and B. Han, “Learning deconvolution network for semantic segmentation,” in 2015 IEEE International Conference on Computer Vision, ICCV, 2015, pp. 1520–1528. [53] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR, 2016, pp. 770–778. [54] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 40, no. 4, pp. 834–848, 2018. [55] L. Chen, G. Papandreou, F. Schroﬀ, and H. Adam, “Rethinking atrous convolution for semantic image segmentation,” CoRR, vol. abs/1706.05587, 2017. [56] G. Lin, A. Milan, C. Shen, and I. D. Reid, “Reﬁnenet: Multi-path reﬁnement networks for high-resolution semantic segmentation,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR, 2017, pp. 5168–5177. [57] D. Eigen and R. Fergus, “Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture,” in 2015 IEEE International Conference on Computer Vision, ICCV, 2015, pp. 2650–2658. [58] C. Hazirbas, L. Ma, C. Domokos, and D. Cremers, “Fusenet: Incorporating depth into semantic segmentation via fusion-based CNN architecture,” in Computer Vision - ACCV 2016 - 13th Asian Conference on Computer Vision, 2016, pp. 213-228. [59] L. Ma, J. St¨uckler, C. Kerl, and D. Cremers, “Multi-view deep learn-ing for consistent semantic mapping with RGB-D cameras,” in 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS, 2017, pp. 598–605. [60] S. Andrews, I. Tsochantaridis, and T. Hofmann, “Support vector machines for multiple-instance learning,” in Advances in Neural Information Processing Systems 15 Neural Information Processing Systems, NIPS, 2002, pp. 561–568. [61] H. J. S. III, “Probability of error of some adaptive pattern-recognition machines,” IEEE Trans. Information Theory, vol. 11, no. 3, pp. 363–371, 1965. [62] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra, “Grad-cam: Visual explanations from deep networks via gradient-based localization,” in IEEE International Conference on Computer Vision, ICCV, 2017, pp. 618–626. [63] F. Saleh, M. S. A. Akbarian, M. Salzmann, L. Petersson, and J. M. Alvarez, “Bringing background into the foreground: Making all classes equal in weakly-supervised video semantic segmentation,” in IEEE International Conference on Computer Vision, ICCV, 2017, pp. 2125-2135. [64] B. Kulis, K. Saenko, and T. Darrell, “What you saw is not what you get: Domain adaptation using asymmetric kernel transforms,” in The 24th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2011, Colorado Springs, CO, USA, 20-25 June 2011, 2011, pp. 1785-1792. [65] R. Gopalan, R. Li, and R. Chellappa, “Domain adaptation for object recognition: An unsupervised approach,” in IEEE International Conference on Computer Vision, ICCV 2011, Barcelona, Spain, November 6-13, 2011, 2011, pp. 999–1006. [66] B. Gong, Y. Shi, F. Sha, and K. Grauman, “Geodesic ﬂow kernel for unsupervised domain adaptation,” in 2012 IEEE Conference on Computer Vision and Pattern Recognition, 2012, pp. 2066–2073. [67] B. Fernando, A. Habrard, M. Sebban, and T. Tuytelaars, “Unsupervised visual domain adaptation using subspace alignment,” in IEEE International Conference on Computer Vision, ICCV, 2013, pp. 2960–2967. [68] B. Sun, J. Feng, and K. Saenko, “Return of frustratingly easy domain adaptation,” in Proceedings of the Thirtieth AAAI Conference on Artiﬁcial Intelligence, 2016, pp. 2058–2065. [69] E. Tzeng, J. Hoﬀman, T. Darrell, and K. Saenko, “Simultaneous deep transfer across domains and tasks,” in 2015 IEEE International Conference on Computer Vision, ICCV, 2015, pp. 4068–4076. [70] Y. Ganin and V. S. Lempitsky, “Unsupervised domain adaptation by backpropagation,” in Proceedings of the 32nd International Conference on Machine Learning, ICML, 2015, pp. 1180–1189. [71] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and V. S. Lempitsky, “Domain-adversarial training of neural networks,” Journal of Machine Learning Research, vol. 17, pp. 59:1–59:35, 2016. [72] M. Long, Y. Cao, J. Wang, and M. I. Jordan, “Learning transferable features with deep adaptation networks,” in Proceedings of the 32nd International Conference on Machine Learning, ICML, 2015, pp. 97–105. [73] M. Long, H. Zhu, J. Wang, and M. I. Jordan, “Unsupervised domain adaptation with residual transfer networks,” in Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems NIPS, 2016, pp. 136-144. [74] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. C. Courville, and Y. Bengio, “Generative adversarial nets,” in Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems, 2014, pp. 2672–2680. [75] J. McCormac, A. Handa, S. Leutenegger, and A. J. Davison, “Scenenet RGB-D: 5m photorealistic images of synthetic indoor trajectories with ground truth,” CoRR, vol. abs/1612.05079, 2016. [76] Y. Zhang, S. Song, E. Yumer, M. Savva, J. Lee, H. Jin, and T. A. Funkhouser, “Physically-based rendering for indoor scene understanding using convolutional neural networks,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, 2017, pp. 5057–5065. [77] G. Ros, L. Sellart, J. Materzynska, D. V´azquez, and A. M. L´opez, “The SYNTHIA dataset: A large collection of synthetic images for semantic segmentation of urban scenes,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR, 2016, pp. 3234–3243. [78] S. R. Richter, V. Vineet, S. Roth, and V. Koltun, “Playing for data: Ground truth from computer games,” in Computer Vision – ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II, 2016, pp. 102–118. [79] J. Hoﬀman, D. Wang, F. Yu, and T. Darrell, “Fcns in the wild: Pixel-level adversarial and constraint-based adaptation,” CoRR, vol. abs/1612.02649, 2016. [80] Y. Chen, W. Chen, Y. Chen, B. Tsai, Y. F. Wang, and M. Sun, “No more discrimination: Cross city adaptation of road scene segmenters,” in IEEE International Conference on Computer Vision, ICCV, 2017, pp. 2011–2020. [81] Y. Chen, W. Li, and L. V. Gool, “ROAD: reality oriented adaptation for semantic segmentation of urban scenes,” CoRR, vol. abs/1711.11556, 2017. [82] Y. Zhang, P. David, and B. Gong, “Curriculum domain adaptation for semantic segmentation of urban scenes,” in IEEE International Conference on Computer Vision, ICCV, 2017, pp. 2039–2049. [83] G. E. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” CoRR, vol. abs/1503.02531, 2015. [84] K. He, J. Sun, and X. Tang, “Guided image ﬁltering,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 35, no. 6, pp. 1397–1409, 2013. [85] B. Ham, M. Cho, and J. Ponce, “Robust image ﬁltering using joint static and dynamic guidance,” in IEEE Conference on Computer Vision and Pattern Recognition, CVPR, 2015, pp. 4823–4831. [86] X. Shen, C. Zhou, L. Xu, and J. Jia, “Mutual-structure for joint ﬁltering,” International Journal of Computer Vision, vol. 125, no.1-3, pp. 19–33, 2017. [87] S. Gu, W. Zuo, S. Guo, Y. Chen, C. Chen, and L. Zhang, “Learning dynamic guidance for depth image enhancement,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR, 2017, pp. 712–721. [88] V. Badrinarayanan, A. Kendall, and R. Cipolla, “Segnet: A deep convolutional encoder-decoder architecture for image segmentation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 12, pp. 2481–2495, 2017. [89] A. Paszke, A. Chaurasia, S. Kim, and E. Culurciello, “Enet: A deep neural network architecture for real-time semantic segmentation,” CoRR, vol. abs/1606.02147, 2016. [90] H. Zhao, X. Qi, X. Shen, J. Shi, and J. Jia, “Icnet for real-time semantic segmentation on high-resolution images,” in Computer Vision-ECCV 2018 - 15th European Conference, 2018, pp. 418–434. [91] S. Mehta, M. Rastegari, A. Caspi, L. G. Shapiro, and H. Hajishirzi, “Espnet: Eﬃcient spatial pyramid of dilated convolutions for semantic segmentation,” in Computer Vision - ECCV 2018 - 15th European Conference, 2018, pp. 561–580. [92] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep neural network with pruning, trained quantization and huﬀman coding,” CoRR, vol. abs/1510.00149, 2015. [93] P. Molchanov, S. Tyree, T. Karras, T. Aila, and J. Kautz, “Pruning convolutional neural networks for resource eﬃcient transfer learning,” CoRR, vol. abs/1611.06440, 2016. [94] Z. Liu, J. Li, Z. Shen, G. Huang, S. Yan, and C. Zhang, “Learning eﬃcient convolutional networks through network slimming,” in IEEE International Conference on Computer Vision, ICCV, 2017, pp. 2755–2763. . Zhang, J. Zou, X. Ming, K. He, and J. Sun, “Eﬃcient and accurate approximations of nonlinear convolutional networks,” in IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015, 2015, pp. 1984–1992. [96] X. Liu, J. Pool, S. Han, and W. J. Dally, “Eﬃcient sparse-winograd convolutional neural networks,” CoRR, vol. abs/1802.06367, 2018. [97] V. Vanhoucke, A. Senior, and M. Z. Mao, “Improving the speed of neural networks on cpus,” in Deep Learning and Unsupervised Feature Learning Workshop, NIPS 2011, 2011. [98] S. Anwar, K. Hwang, and W. Sung, “Fixed point optimization of deep convolutional neural networks for object recognition,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, 2015, pp. 1131–1135. [99] D. Miyashita, E. H. Lee, and B. Murmann, “Convolutional neural networks using logarithmic data representation,” CoRR, vol. abs/1603.01025, 2016. [100] S. Zhou, Z. Ni, X. Zhou, H. Wen, Y. Wu, and Y. Zou, “Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients,” CoRR, vol. abs/1606.06160, 2016. [101] M. Courbariaux, Y. Bengio, and J. David, “Binaryconnect: Training deep neural networks with binary weights during propagations,” in Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems, 2015, pp. 3123–3131. [102] I. Hubara, M. Courbariaux, D. Soudry, R. ElYaniv, and Y. Bengio, “Binarized neural networks,” in Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems, 2016, pp. 4107–4115. [103] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio, “Quantized neural networks: Training neural networks with low precision weights and activations,” Journal of Machine Learning Research, vol. 18, pp. 187:1–187:30, 2017. [104] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, “Xnor-net: Imagenet classiﬁcation using binary convolutional neural networks,” in Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part IV, 2016, pp. 525–542. [105] F. Li and B. Liu, “Ternary weight networks,” CoRR, vol. abs/1605.04711, 2016. [106] C. Zhu, S. Han, H. Mao, and W. J. Dally, “Trained ternary quantization,” CoRR, vol. abs/1612.01064, 2016. [107] J. Bohg, J. Romero, A. Herzog, and S. Schaal, “Robot arm pose estimation through pixel-wise part classiﬁcation,” in 2014 IEEE International Conference on Robotics and Automation, ICRA, 2014, pp. 3143–3150. [108] T. Lin, M. Maire, S. J. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll´ar, and C. L. Zitnick, “Microsoft COCO: common objects in context,” in Computer Vision - ECCV 2014 - 13th European Conference, 2014, pp. 740–755. [109] S. Gupta, R. B. Girshick, P. A. Arbel´aez, and J. Malik, “Learning rich features from RGB-D images for object detection and segmentation,” in Computer Vision - ECCV 2014 - 13th European Conference, 2014, pp. 345-360. [110] A. Tonioni, M. Poggi, S. Mattoccia, and L. di Stefano, “Unsupervised adaptation for deep stereo,” in IEEE International Conference on Computer Vision, ICCV, 2017, pp. 1614-1622. [111] A. Bansal, X. Chen, B. C. Russell, A. Gupta, and D. Ramanan, “Pixelnet: Representation of the pixels, by the pixels, and for the pixels, “CoRR, vol. abs/1702.06506, 2017. [112] A. Janoch, S. Karayev, Y. Jia, J. T. Barron, M. Fritz, K. Saenko, and T. Darrell, “A category-level 3-d object dataset: Putting the kinect to work,” in IEEE International Conference on Computer Vision Workshops, ICCV, 2011, pp. 1168–1174. [113] J. Xiao, A. Owens, and A. Torralba, “SUN3D: A database of big spaces reconstructed using sfm and object labels,” in IEEE International Conference on Computer Vision, ICCV, 2013, pp. 1625–1632. [114] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” CoRR, vol. abs/1412.6980, 2014. [115] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and F. Li, “Imagenet: A large-scale hierarchical image database,” in 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR, 2009, pp. 248–255. [116] M. Alwani, H. Chen, M. Ferdman, and P. Milder, “Fused-layer CNN accelerators,” in 49th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO, 2016, pp. 22:1–22:12.
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/71493	-
dc.description.abstract	開發能夠基於視覺感知執行類似人類行為的裝置是人工智慧領域的目標，而像素級別上的視覺資訊（例如場景解析）對於這樣的目標應用是有益的，近年來，由於深度學習的發展，這些任務取得了重大進展，然而，效率仍然是一個主要問題，我們提到的術語「效率」指的是數據收集和計算資源需求。由監督方法預測所得的結果雖然效果顯著，但必須依賴於大規模像素級別的標注資料，這是十分耗時且昂貴的，因此，減輕繁重的人力需求成為訓練過程中的關鍵議題。合成數據和弱監督方法被提出用來克服這一挑戰; 不幸的是，前者遭遇到嚴重的域轉移問題，而後者缺乏準確的圖像邊界資訊，此外，大多數現有的弱監督研究只能處理前景突出的「物體」。因此，為了解決這個問題，我們提出一個輔助師生學習框架，通過自適應具有較低領域差異之輔助訊息（例如深度）及特定領域之弱注釋（例如真實外觀）構建的資訊來訓練這種不具轉移性的任務，此後，透過開發的兩階段投票機制，可以有效地將這種不完美信息整合起來。從推論階段的角度來看，計算資源的需求一直是最主要問題，典型的神經網路運行時需要大的記憶體和使用32位元浮點數計算，此外，上述的問題與僅具有幾個類別輸出的分類網絡不同，輸出需與輸入有相對應的關係包含維度和位置，這將耗費更多資源並且可能無法使用現有文獻所提供的方法來優化，然而大多數的研究仍致力於分類網絡上。在本論文中，考慮到現實世界應用的實用性和必要性，我們的目標是設計高效能的場景解析演算法，須同時考量到標注資料的需求量、運算複雜度和性能。首先，通過對損失函數引入最小-最大歸一化，深度資訊得以減少室內場景的領域差異，此外，我們以現實到生成的重建生成器實現無監督感測器深度圖恢復。其次，我們通過深度自適應輔助師生學習以及特定領域的弱監督訊息提出了場景解析的演算法架構，我們基於兩階段整合機制提供損失函數訓練網路，以便產生更準確的結果。本論文所得之方法在評價函數mIoU方面的表現優於目前最佳的適應方法14.63%。最後，我們介紹了一種量化高效場景解析框架的方法，可以在只有1.8％的mIoU損失情況下，將模型大小減小21.9倍和激活值減小8.2倍。	zh_TW
dc.description.abstract	Developing autonomous mobile agents that can perform behaviors like human based on their visual perception is an goal in the field of artificial intelligence and pixel-wise visual cues such as scene parsing are beneficial to such high-level applications. Significant improvement for these tasks have been made recent years due to the evolution of deep learning. Nevertheless, in addition to accuracy, efficiency remains a major issue. The term “efficiency” we have mentioned refers to both data collection and computational complexity. Remarkable scene parsing results made by supervised methods rely on numerous pixel-level annotations, which are time-consuming and expensive to obtain. Hence, to alleviate such cumbersome manual effort becomes a crucial issue during training procedure. Synthetic rendered data and weakly-supervised methods have been explored to overcome this challenge; unfortunately, the former suffer from severe domain shift, the latter with imprecise information. Moreover, majority of existing researches for weak supervision are only capable of handling foreground salient “things”. Hence, to address the issue, we employ an auxiliary teacher-student learning framework to train such untransferable task through pseudo-ground truths constructed by adapting auxiliary cues with lower domain discrepancy (e.g. depth) and leveraging domain-specific information (e.g. real appearance) in weak form. Thereafter, this imperfect information can be integrated effectively by developing a two-stage voting mechanism. From inference phase perspective, complexity has been the main issue for edge computing all the while. A typical network requires large run-time memory and 32-bit floating point computation. Furthermore, unlike general classification networks with only several category outputs,the hourglass network output is the same size and dimension as the input, which cost more resources. However, most of the previous researches focused on classification networks. In this thesis, considering the practicality as well as necessity of real world applications, our goal is to develop a “efficient” scene parsing algorithm with focus on three objectives: labeling, complexity, performance. First, it is shown that depth diminish more domain discrepancy for indoor scenes by introducing min-max normalization to the loss function. Additionally, we argue that the generator for real-to-sim reconstruction is capable of performing unsupervised sensor depth map restoration. Second, a scene parsing framework is proposed by performing auxiliary teacher-student learning with depth adaptation as well as domain-specific weak supervision information. We train a network based on the loss function that penalizes predictions disagreeing with the highly confident pseudo-ground truths provided by a two-stage integration mechanism so as to produce more accurate segmentations. The proposed method turns out to outperform the state-of-the-art adaptation method by 14.63% in terms of mean Intersection over Union (mIoU). Lastly, we extend the existing method to quantize the target lightweight scene parsing network into ternary weights and low bit-width activations (3-4 bits), which can reduce the model size to 21.9X and activation size to 8.2X smaller with only 1.8% mIoU loss.	en
dc.description.provenance	Made available in DSpace on 2021-06-17T06:01:47Z (GMT). No. of bitstreams: 1 ntu-108-R05943002-1.pdf: 43584334 bytes, checksum: 3d74d3cd31a1b4af891609dac7faa0da (MD5) Previous issue date: 2019	en
dc.description.tableofcontents	Abstract xi 1 Introduction 1 1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Design Consideration . . . . . . . . . . . . . . . . . . . . . . 5 1.3 Main Contributions . . . . . . . . . . . . . . . . . . . . . . . 7 1.4 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . 9 2 Background 11 2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2 Accuracy-oriented Advancement . . . . . . . . . . . . . . . . 12 2.3 Annotation Effort Alleviation . . . . . . . . . . . . . . . . . 14 2.3.1 Weakly Supervision . . . . . . . . . . . . . . . . . . . 14 2.3.2 Unsupervised Domain Adaptation . . . . . . . . . . . 18 2.3.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . 20 2.4 Computational complexity Reduction . . . . . . . . . . . . . 22 2.4.1 Network Structure . . . . . . . . . . . . . . . . . . . 22 2.4.2 Hardware-Friendly Methods for Classification . . . . 24 3 The Proposed Scene Parsing Framework 27 3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.2 Proposed Scene Parsing Algorithm . . . . . . . . . . . . . . 28 3.2.1 Depth-aware Adaptation . . . . . . . . . . . . . . . . 28 3.2.2 Domain-specific Weak Localization . . . . . . . . . . 32 3.2.3 Mechanism for Cues Integration . . . . . . . . . . . . 35 3.2.4 Training of Student Network . . . . . . . . . . . . . . 38 3.3 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.3.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.3.2 Implementation Detail . . . . . . . . . . . . . . . . . 41 3.3.3 Evaluation Matrix . . . . . . . . . . . . . . . . . . . 42 3.3.4 Ablation Study . . . . . . . . . . . . . . . . . . . . . 43 3.3.5 Result . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.3.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . 48 3.4 Unsupervised Depth Restoration via Adaptation and RANSAC Scale Recovering . . . . . . . . . . . . . . . . . . . . . . . . 53 3.4.1 Scale Recovering . . . . . . . . . . . . . . . . . . . . 53 3.4.2 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.4.3 Evaluation Matrix . . . . . . . . . . . . . . . . . . . 55 3.4.4 Ablation Study . . . . . . . . . . . . . . . . . . . . . 56 3.4.5 Result . . . . . . . . . . . . . . . . . . . . . . . . . . 56 3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 4 Hardware Oriented Design and Analysis 61 4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 4.2 Network Structure . . . . . . . . . . . . . . . . . . . . . . . 62 4.3 Related Quantization Methods . . . . . . . . . . . . . . . . . 66 4.3.1 Low bit-width Quantization . . . . . . . . . . . . . . 66 4.3.2 Binarization . . . . . . . . . . . . . . . . . . . . . . . 67 4.3.3 Ternarization . . . . . . . . . . . . . . . . . . . . . . 68 4.4 Proposed Quantization Method . . . . . . . . . . . . . . . . 69 4.5 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 4.6 Bandwidth Issue Discussion . . . . . . . . . . . . . . . . . . 77 4.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 5 Conclusion 83 Bibliography 85
dc.language.iso	en
dc.subject	場景解析	zh_TW
dc.subject	自適應	zh_TW
dc.subject	弱監督	zh_TW
dc.subject	領域差異	zh_TW
dc.subject	效率	zh_TW
dc.subject	Domain discrepancy	en
dc.subject	Adaptation	en
dc.subject	Weak supervision	en
dc.subject	Scene Parsing	en
dc.subject	Efficiency	en
dc.title	以低差異領域自適應及特定領域弱注釋實現高效能室內場景解析	zh_TW
dc.title	Low Discrepancy Adaptation with Weak Domain-specific Annotations for Efficient Indoor Scene Parsing	en
dc.type	Thesis
dc.date.schoolyear	107-1
dc.description.degree	碩士
dc.contributor.oralexamcommittee	簡韶逸(Shao-Yi Chien),楊佳玲(Chia-Lin Yang),徐宏民(Hung-Min Hsu)
dc.subject.keyword	場景解析,自適應,弱監督,領域差異,效率,	zh_TW
dc.subject.keyword	Scene Parsing,Adaptation,Weak supervision,Domain discrepancy,Efficiency,	en
dc.relation.page	100
dc.identifier.doi	10.6342/NTU201900349
dc.rights.note	有償授權
dc.date.accepted	2019-01-31
dc.contributor.author-college	電機資訊學院	zh_TW
dc.contributor.author-dept	電子工程學研究所	zh_TW
顯示於系所單位：	電子工程學研究所

文件中的檔案：

檔案	大小	格式
ntu-108-1.pdf 未授權公開取用	42.56 MB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。