基於混合分布正規化之模型壓縮方法

Chang-Ti Huang; 黃昌第

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/73340

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	吳家麟(Ja-Ling Wu)
dc.contributor.author	Chang-Ti Huang	en
dc.contributor.author	黃昌第	zh_TW
dc.date.accessioned	2021-06-17T07:29:21Z	-
dc.date.available	2022-07-01
dc.date.copyright	2019-07-01
dc.date.issued	2019
dc.date.submitted	2019-06-18
dc.identifier.citation	Bibliography [1] E. Bengio, P.-L. Bacon, J. Pineau, and D. Precup. Conditional computation in neural networks for faster models. arXiv preprint arXiv:1511.06297, 2015. [2] Y. Bengio, N. Léonard, and A. Courville. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432, 2013. [3] M. Courbariaux, I. Hubara, D. Soudry, R. El-Yaniv, and Y. Bengio. Binarized neural networks: Training deep neural networks with weights and activations constrained to +1 or -1. arXiv preprint arXiv:1602.02830, 2016. [4] W. Grathwohl, D. Choi, Y. Wu, G. Roeder, and D. Duvenaud. Backpropagation through the void: Optimizing control variates for black-box gradient estimation. In International Conference on Learning Representations (ICLR), 2018. [5] A. Graves. Stochastic backpropagation through mixture density distributions. arXiv preprint arXiv:1607.05690, 2016. [6] S. Gu, S. Levine, I. Sutskever, and A. Mnih. Muprop: Unbiased backpropagation for stochastic neural networks. arXiv preprint arXiv:1511.05176, 2015. [7] E. Gumbel. Statistical theory of extreme values and some practical applications: a series of lectures. Applied mathematics series. U. S. Govt. Print. Office, 1954. [8] Y. Guo, A. Yao, and Y. Chen. Dynamic network surgery for efficient dnns. In Advances In Neural Information Processing Systems, pages 1379–1387, 2016. [9] S. Han, J. Pool, J. Tran, and W. Dally. Learning both weights and connections for efficient neural network. In Advances in neural information processing systems, pages 1135–1143, 2015. [10] K.He, X.Zhang, S.Ren, andJ.Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision, pages 1026–1034, 2015. [11] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. [12] K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in deep residual networks. In European conference on computer vision, pages 630–645. Springer, 2016. [13] G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. In NIPS Deep Learning and Representation Learning Workshop, 2015. [14] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov. Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580, 2012. [15] A. Howard, M. Sandler, G. Chu, L.-C. Chen, B. Chen, M. Tan, W. Wang, Y. Zhu, R. Pang, V. Vasudevan, et al. Searching for mobilenetv3. arXiv preprint arXiv:1905.02244, 2019. [16] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam. Mobilenets: Efficient convolutional neural networks for mo- bile vision applications. arXiv preprint arXiv:1704.04861, 2017. [17] F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, and K. Keutzer. Squeezenet: Alexnet-level accuracy with 50x fewer parameters and <0.5 mb model size. arXiv preprint arXiv:1602.07360, 2016. [18] E. Jang, S. Gu, and B. Poole. Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144, 2016. [19] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. International Conference on Learning Representations (ICLR), 2015. [20] D. P. Kingma and M. Welling. Auto-encoding variational Bayes. International Conference on Learning Representations (ICLR), 2014. [21] A.Krizhevsky and G.Hinton. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009. [22] Y. LeCun. The mnist database of handwritten digits. 1998. [23] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, et al. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998. [24] Y. LeCun, J. S. Denker, and S. A. Solla. Optimal brain damage. In Advances in neural information processing systems, pages 598–605, 1990. [25] Z. Liu, M. Sun, T. Zhou, G. Huang, and T. Darrell. Rethinking the value of network pruning. In International Conference on Learning Representations (ICLR), 2019. [26] C. Louizos, K. Ullrich, and M. Welling. Bayesian compression for deep learning. In Advances in Neural Information Processing Systems, pages 3288–3298, 2017. [27] C.Louizos, M.Welling, and D.P.Kingma. Learning sparse neural networks through L0 regularization. In International Conference on Learning Representations (ICLR), 2018. [28] C. J. Maddison, A. Mnih, and Y. W. Teh. The concrete distribution: A continuous relaxation of discrete random variables. arXiv preprint arXiv:1611.00712, 2016. [29] C. J. Maddison, D. Tarlow, and T. Minka. A* sampling. In Advances in Neural Information Processing Systems, pages 3086–3094, 2014. [30] D.Molchanov, A.Ashukha, and D.Vetrov. Variational dropout sparsifies deep neural networks. arXiv preprint arXiv:1701.05369, 2017. [31] P.Molchanov, S.Tyree, T.Karras, T.Aila, and J.Kautz. Pruning convolutional neural networks for resource efficient inference. In International Conference on Learning Representations (ICLR), 2017. [32] K. Neklyudov, D. Molchanov, A. Ashukha, and D. P. Vetrov. Structured bayesian pruning via log-normal multiplicative noise. In Advances in Neural Information Processing Systems, pages 6775–6784, 2017. [33] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Automatic differentiation in pytorch. 2017. [34] A.Polino, R.Pascanu, and D.Alistarh. Model compression via distillation and quantization. In International Conference on Learning Representations (ICLR), 2018. [35] L. B. Rall. Automatic differentiation: Techniques and applications. 1981. [36] D. J. Rezende, S. Mohamed, and D. Wierstra. Stochastic backpropagation and approximate inference in deep generative models. In Proceedings of the 31st International Conference on Machine Learning, pages 1278–1286, 2014. [37] J.T.Rolfe. Discrete variational autoencoders. International Conference on Learning Representations (ICLR), 2017. [38] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning representations by back-propagating errors. Nature, 323:533–536, 1986. [39] G. Tucker, A. Mnih, C. J. Maddison, J. Lawson, and J. Sohl-Dickstein. Rebar: Low-variance, unbiased gradient estimates for discrete latent variable models. In Advances in Neural Information Processing Systems, pages 2627–2636, 2017. [40] K. Ullrich, E. Meeds, and M. Welling. Soft weight-sharing for neural network compression. In International Conference on Learning Representations (ICLR), 2017. [41] A. Vahdat, E. Andriyash, and W. G. Macready. Dvae#: Discrete variational autoencoders with relaxed Boltzmann priors. In Advances in Neural Information Process- ing Systems, 2018. [42] A. Vahdat, W. Macready, Z. Bian, A. Khoshaman, and E. Andriyash. Dvae++: Discrete variational autoencoders with overlapping transformations. In International Conference on Machine Learning, pages 5042–5051, 2018. [43] W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li. Learning structured sparsity in deep neural networks. In Advances in neural information processing systems, pages 2074– 2082, 2016. [44] R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4):229–256, 1992. [45] J. Wu, Y. Wang, Z. Wu, Z. Wang, A. Veeraraghavan, and Y. Lin. Deep k-means: Re-training and parameter sharing with harder cluster assignments for compressing deep convolutions. arXiv preprint arXiv:1806.09228, 2018. [46] R. Yu, A. Li, C.-F. Chen, J.-H. Lai, V. I. Morariu, X. Han, M. Gao, C.-Y. Lin, and L. S. Davis. Nisp: Pruning networks using neuron importance score propagation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 9194–9203, 2018. [47] S. Zagoruyko and N. Komodakis. Wide residual networks. arXiv preprint arXiv:1605.07146, 2016.
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/73340	-
dc.description.abstract	高效能深度學習計算是個重要的課題，它不僅可以節省計算成本，更能將人工智慧實現於行動裝置之中。正規化 (regularization) 是一種常見的模型壓縮方法，而 $L_0$ 範數 (norm) 的正規化是其中一種有效的方法。由於此範數之定義為非零參數的個數，因此相當適合作為神經網路參數稀疏化的約束條件 (sparsity constraints)。然而也因為 $L_0$ 範數的定義使它具離散性並成為數學上棘手的問題。一個早先的研究方法利用 Concrete distribution 來模擬二元邏輯閘，並利用這個邏輯閘的概念來決定哪些參數應該進行剪枝 (pruning)。本論文提出一種更可靠的框架來模擬二元邏輯閘。此框架是一種基於混合分布 (mixture distributions) 建構而成的的正規化項。任何一對對稱且收斂於 $delta(0)$ 與 $delta(1)$ 的分布皆可以在我們提出的框架下成為近似二元的邏輯閘，進而估算 $L_0$ 正規化，達成模型壓縮及網路縮減的目標。此外，我們也推演出一種對混合分布重新參數化 (reparameterization) 的方法至前述的模型壓縮中，使得我們提出的深度學習演算法可以利用隨機梯度下降法進行優化。在 MNIST 與 CIFAR-10/CIFAR-100 資料集訓練下的實驗結果均顯示，我們所提出的方法是非常具有競爭力的。	zh_TW
dc.description.abstract	Efficient deep learning computing has recently received considerable attention. It saves computational costs, and potentially realizes model inference using on-chip devices. Regularization of parameters is a common approach to compress the model. $L_0$ regularization is one of the efficient regularizers since it penalizes the non-zero parameters without any shrinkage of larger values. However, the combinatorial nature of the $L_0$ norm makes it an intractable term. A previous work approximated the $L_0$ norm using the Concrete distribution with emulated binary gates, and collectively determined which weights should be pruned. In this thesis, a more general framework for relaxing binary gates through mixture distributions is proposed. With the proposed method, any mixture pair of distributions converging to $delta(0)$ and $delta(1)$ can be applied to construct smoothed binary gates. We further introduce a reparameterization method for mixture distributions to the field of model compression. Reparameterized smoothed-binary gates drawn from mixture distributions are capable of conducting efficient gradient-based optimization under the proposed deep learning algorithm. Extensive experiments show that we achieve the state-of-the-art in terms of pruned architectures, structured sparsity and the reduced number of floating point operations (FLOPs).	en
dc.description.provenance	Made available in DSpace on 2021-06-17T07:29:21Z (GMT). No. of bitstreams: 1 ntu-108-R06944017-1.pdf: 13261977 bytes, checksum: 148e43a65bf77c7345d6042dd9c59152 (MD5) Previous issue date: 2019	en
dc.description.tableofcontents	Contents 口試委員會審定書 i 誌謝 ii 摘要 iii Abstract iv 1 Introduction 1 2 Constructing the MDR 4 2.1 Gradient estimators 5 2.1.1 The REINFORCE 6 2.1.2 Control variates 7 2.1.3 The reparameterization trick 8 2.1.4 The Concrete relaxation 8 2.2 Relaxing L0 norms through the binary Concrete 9 2.2.1 The Gumbel-max trick 9 2.2.2 The Concrete distribution 10 2.2.3 The binary Concrete 11 2.3 Mixture distributions 13 2.3.1 Mixtures of exponential distributions 15 2.3.2 Mixtures of exponential-uniform distributions 15 2.3.3 Mixtures of power-law function distributions 16 2.4 The Mixture-Distributed Regularization (MDR) 16 3 Optimizing the MDR 19 3.1 Inverse transform sampling 19 3.2 Reparameterization for the MDR 21 3.3 Estimating ζ∗ for testing 25 3.4 Combining the MDR with other regularizers 26 3.5 Group sparsity constraints 27 4 Related Work 29 5 Experiments 32 5.1 The gradient variance of the MDR 32 5.2 Experimental setup 34 5.2.1 Datasets 34 5.2.2 Architectures 35 5.2.3 Implementation details 35 5.3 LeNet-300-100 on MNIST 39 5.4 LeNet-5-Caffe on MNIST 40 5.5 Wide-ResNet on CIFAR 41 6 Conclusions 45 Bibliography 47
dc.language.iso	en
dc.subject	深度學習	zh_TW
dc.subject	模型壓縮	zh_TW
dc.subject	網路縮減	zh_TW
dc.subject	正規化	zh_TW
dc.subject	最佳化	zh_TW
dc.subject	Model Compression	en
dc.subject	Deep Learning	en
dc.subject	Optimization	en
dc.subject	Regularization	en
dc.subject	Network Reduction	en
dc.title	基於混合分布正規化之模型壓縮方法	zh_TW
dc.title	A Method of Mixture-Distributed Regularization for Model Compression	en
dc.type	Thesis
dc.date.schoolyear	107-2
dc.description.degree	碩士
dc.contributor.oralexamcommittee	朱威達,鄭文皇,胡敏君
dc.subject.keyword	深度學習,模型壓縮,網路縮減,正規化,最佳化,	zh_TW
dc.subject.keyword	Deep Learning,Model Compression,Network Reduction,Regularization,Optimization,	en
dc.relation.page	51
dc.identifier.doi	10.6342/NTU201900792
dc.rights.note	有償授權
dc.date.accepted	2019-06-19
dc.contributor.author-college	電機資訊學院	zh_TW
dc.contributor.author-dept	資訊網路與多媒體研究所	zh_TW
顯示於系所單位：	資訊網路與多媒體研究所

文件中的檔案：

檔案	大小	格式
ntu-108-1.pdf 未授權公開取用	12.95 MB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。