神經元消失：影響深層神經網路之表現能力，並使其難以訓練的新現象

Wen-Yu Chang; 張文于

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/7266

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	林宗男(Tsung-Nan Lin)
dc.contributor.author	Wen-Yu Chang	en
dc.contributor.author	張文于	zh_TW
dc.date.accessioned	2021-05-19T17:40:46Z	-
dc.date.available	2024-08-05
dc.date.available	2021-05-19T17:40:46Z	-
dc.date.copyright	2019-08-05
dc.date.issued	2019
dc.date.submitted	2019-07-31
dc.identifier.citation	[1] M. Abramowitz. Handbook of Mathematical Functions, With Formulas, Graphs, and Mathematical Tables. Dover Publications, Inc., New York, NY, USA, 1974. [2] R. Arora, A. Basu, P. Mianjy, and A. Mukherjee. Understanding deep neural net works with rectified linear units. ICLR, 2018. [3] J. L. Ba, J. R. Kiros, and G. E. Hinton. Layer normalization. arXiv:1607.06450, 2016. [4] M. Chen, J. Pennington, and S. S. Schoenholz. Dynamical isometry and a mean field theory of rnns: Gating enables signal propagation in recurrent neural networks. Proceedings of the 35th International Conference on Machine Learning, 80:873– 882, 2018. [5] T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao, B. Xu, C. Zhang, and Z. Zhang. Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. CoRR, abs/1512.01274, 2015. [6]G.Cybenko.Approximationsbysuperpositionsofsigmoidalfunction.Mathematics of Control, Signals, and Systems, 2(4):303–314, 1989. [7] Y. Dauphin, R. Pascanu, C. Gulcehre, K. Cho, S. Ganguli, and Y. Bengio. Identi fying and attacking the saddle point problem in highdimensional nonconvex opti mization. Advances in Neural Information Processing Systems 27, pages 2933–2941, 2014. [8]F.Gantmacher.Thetheoryofmatrices.Number1inTheTheoryofMatrices.Chelsea Pub. Co., 1960. [9] X. Glorot and Y. Bengio. Understanding the difficulty of training deep feedforward neural networks. Proceedings of the Thirteenth International Conference on Artifi cial Intelligence and Statistics, 9:249–256, 2010. [10] X. Glorot, A. Bordes, and Y. Bengio. Deep sparse rectifier neural networks. AIS TATS, 15:275, 2011. [11] I. J. Goodfellow, O. Vinyals, and A. M. Saxe. Qualitatively characterizing neural network optimization problems. ICLR 2015, 2014. [12]B.Hanin.Universalfunctionapproximationbydeepneuralnetswithboundedwidth and relu activations. arXiv:1708.02691, 2017. [13] B. Hanin. Which neural net architectures give rise to exploding and vanishing gradi ents? Neural Information Processing Systems, 2018. [14] B. Hanin and D. Rolnick. Complexity of linear regions in deep networks. ICML, 2019. [15] K. He and J. Sun. Convolutional neural networks at constrained time cost. CVPR, 2015. [16]K.He,X.Zhang,S.Ren,andJ.Sun.Delvingdeepintorectifiers:Surpassinghuman level performance on imagenet classification. IEEE International Conference on Computer Vision, 2015. [17] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. [18] K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in deep residual networks. European Conference on Computer Vision, pages 630–645, 2016. [19] G. Hinton, G. Dahl, A.r. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, B. Kings bury, and T. Sainath. Deep neural networks for acoustic modeling in speech recog nition. IEEE Signal Processing Magazine, 29:82–97, 2012. [20] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. ICML, 2015. [21] J. R. Ipsen. Products of independent gaussian random matrices. arXiv:1510.06128, 2015. [22] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems, pages 1106–1114, 2012. [23] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998. [24] Y. LeCun, L. Bottou, G. B. Orr, and K.R. Müller. Efficient backprop. Neural Networks: Tricks of the Trade, This Book is an Outgrowth of a 1996 NIPS Workshop, pages 9–50, 1998. [25] V. I. Oseledets. Multiplicative ergodic theorem: Characteristic lyapunov exponents of dynamical systems. Trudy MMO (in Russian), 19:179–210, 1968. [26] J. Pennington and S. S. Schoenholz. Resurrecting the sigmoid in deep learning through dynamical isometry: theory and practice. Advances in Neural Information Processing Systems 30, pages 4788–4798, 2017. [27] J. Pennington, S. S. Schoenholz, and S. Ganguli. The emergence of spectral uni versality in deep networks. International Conference on Artificial Intelligence and Statistics, AISTATS 2018, 911 April 2018, Playa Blanca, Lanzarote, Canary Islands, Spain, pages 1924–1932, 2018. [28] B. Poole, S. Lahiri, M. Raghu, J. SohlDickstein, and S. Ganguli. Exponential ex pressivity in deep neural networks through transient chaos. Neural Information Pro cessing Systems, 2016. [29] M. Raghu, B. Poole, J. Kleinberg, S. Ganguli, and J. SohlDickstein. On the expres sive power of deep neural networks. arXiv:1606.05336, 2016. [30] D. Rolnick and M. Tegmark. The power of deeper networks for expressing natural functions. ICLR, 2018. [31] A. M. Saxe, J. L. McClelland, and S. Ganguli. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. International Conference on Learning Representations (ICLR), 2013. [32] S. S. Schoenholz, J. Gilmer, S. Ganguli, and J. SohlDickstein. Deep information propagation. International Conference on Learning Representations (ICLR), 2017. [33] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, I. S. Nal Kalchbrenner, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis. Mastering the game of go without human knowledge. Nature, 529(7587):484–489, 2016. [34] R. K. Srivastava, K. Greff, and J. Schmidhuber. Highway networks. ICML 2015 Deep Learning workshop, 2015. [35]N.Tajbakhsh,J.Y.Shin,S.R.Gurudu,R.T.Hurst,C.B.Kendall,M.B.Gotway,and J. Liang. Convolutional neural networks for medical image analysis: Full training or fine tuning? IEEE Transactions on Medical Imaging, 35(5):1299–1312, 2016. [36]M.Taki.Deepresidualnetworksandweightinitialization.arXiv:1709.02956,2017. [37] M. Telgarsky. Representation benefits of deep feedforward networks. CoRR, abs/1509.08101, 2015. [38] A. Veit, M. J. Wilber, and S. J. Belongie. Residual networks are exponential ensem bles of relatively shallow networks. CoRR, abs/1605.06431, 2016. [39] S. Wiesler and H. Ney. A convergence analysis of loglinear training. Advances in Neural Information Processing Systems 24, pages 657–665, 2011. [40] Y. Wu, M. Schuster, Z. Chen, M. N. Quoc V. Le, W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey, J. Klingner, A. Shah, M. Johnson, X. Liu, L. Kaiser, S. Gouws, Y. Kato, T. Kudo, H. Kazawa, K. Stevens, G. Kurian, N. Patil, W. Wang, C. Young, J. Smith, J. Riesa, A. Rudnick, O. Vinyals, G. Corrado, M. Hughes, and J. Dean. Google’s neural machine translation system: Bridging the gap. arXiv:1609.08144, 2016. [41] L. Xiao, Y. Bahri, J. SohlDickstein, S. S. Schoenholz, and J. Pennington. Dynam ical isometry and a mean field theory of cnns: How to train 10,000layer vanilla convolutional neural networks. Proceedings of the 35th International Conference on Machine Learning, 2018. [42] G. Yang and S. S. Schoenholz. Mean field residual networks: On the edge of chaos. Advances in neural information processing systems, pages 7103–7114, 2017.
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/7266	-
dc.description.abstract	梯度爆炸/消失，一直被認為是訓練深層神經網路的一大挑戰。在這篇論文裡，我們發現一種被稱為「神經元消失 (Vanishing Nodes)」的新現象同樣也會使訓練更加困難。當神經網路的深度增加，神經元彼此之間的會呈現高度相關。這種行為會導致神經元之間的相似程度提高。也就是隨著神經網路變深，網路內的神經元冗餘程度會提高。我們把這個問題稱為「神經元消失 (Vanishing Nodes)」。可以藉由神經網路的相關參數來對神經元消失的程度做推算；結果可以得出神經元消失的程度與網路深度成正比、與網路寬度成反比。從數值分析的結果呈現出：在反向傳播算法的訓練下，神經元消失的現象會變得更明顯。我們也提出：神經元消失是除了梯度爆炸/消失以外，訓練深層神經網路的另一道難關。	zh_TW
dc.description.abstract	It is well known that the problem of vanishing/exploding gradients creates a challenge when training deep networks. In this paper, we show another phenomenon, called vanishing nodes, that also increases the difficulty of training deep neural networks. As the depth of a neural network increases, the network's hidden nodes show more highly correlated behavior. This correlated behavior results in great similarity between these nodes. The redundancy of hidden nodes thus increases as the network becomes deeper. We call this problem 'Vanishing Nodes.' This behavior of vanishing nodes can be characterized quantitatively by the network parameters, which is shown analytically to be proportional to the network depth and inversely proportional to the network width. The numerical results suggest that the degree of vanishing nodes will become more evident during back-propagation training. Finally, we show that vanishing/exploding gradients and vanishing nodes are two different challenges that increase the difficulty of training deep neural networks.	en
dc.description.provenance	Made available in DSpace on 2021-05-19T17:40:46Z (GMT). No. of bitstreams: 1 ntu-108-R06942064-1.pdf: 17746185 bytes, checksum: 20cb8912b917d7d7e3810b4ac5063af0 (MD5) Previous issue date: 2019	en
dc.description.tableofcontents	摘要 iii Abstract v 1 Introduction 1 2 Related Work 3 2.1 Difficultiesintrainingdeepnerualnetworks . . . . . . . . . . . . . . . . 3 2.2 Representationpowerofdeepneuralnetwork . . . . . . . . . . . . . . . 3 3 Vanishing Nodes: correlation between hidden nodes 5 3.1 VanishingNodeIndicator.......................... 6 3.2 Impactsofbackpropagation ........................ 14 3.3 Relationship between the VNI and the redundancy of nodes . . . . . . . . 15 3.4 Thevanishingoftherepresentationpower . . . . . . . . . . . . . . . . . 18 3.5 The effect of the orthogonal weight matrices to the representation power . 21 3.6 Representation power of residuallike architectures . . . . . . . . . . . . 23 4 Variance propagation of deep neural networks 39 4.1 Comparison of exploding/vanishing gradients and vanishing nodes . . . . 39 4.2 Normpreservingweightinitialization ................... 41 4.3 Thetwoobstaclesfortrainingdeepnerualnetworks . . . . . . . . . . . . 42 5 Experiments 45 5.1 Probability of failed training caused by vanishing nodes . . . . . . . . . . 45 5.2 Analyses of failed training caused by vanishing nodes . . . . . . . . . . . 50 6 Conclusion 53 Bibliography 55
dc.language.iso	en
dc.title	神經元消失：影響深層神經網路之表現能力，並使其難以訓練的新現象	zh_TW
dc.title	Vanishing Nodes: The Phenomena That Affects The Representation Power and The Training Difficulty of Deep Neural Networks	en
dc.type	Thesis
dc.date.schoolyear	107-2
dc.description.degree	碩士
dc.contributor.oralexamcommittee	林昌鴻,李宏毅,李琳山,吳沛遠
dc.subject.keyword	深度學習,梯度消失,機器學習理論,表現能力,神經網路架構,網路訓練問題,正交參數初始化,冗餘神經元,隨機矩陣,	zh_TW
dc.subject.keyword	Deep learning,Vanishing gradient,Learning theory,Representation power,Network architecture,Training difficulty,Orthogonal initialization,Node redundancy,Random matrices,	en
dc.relation.page	59
dc.identifier.doi	10.6342/NTU201901446
dc.rights.note	同意授權(全球公開)
dc.date.accepted	2019-07-31
dc.contributor.author-college	電機資訊學院	zh_TW
dc.contributor.author-dept	電信工程學研究所	zh_TW
dc.date.embargo-lift	2024-08-05	-
顯示於系所單位：	電信工程學研究所

文件中的檔案：

檔案	大小	格式
ntu-108-1.pdf	17.33 MB	Adobe PDF	檢視/開啟

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。