正則化變數選取方法應用於高度相關資料之比較

Ching-Hsuan Chang; 張靜萱

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/21388

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	周呈霙
dc.contributor.author	Ching-Hsuan Chang	en
dc.contributor.author	張靜萱	zh_TW
dc.date.accessioned	2021-06-08T03:32:43Z	-
dc.date.copyright	2019-08-13
dc.date.issued	2019
dc.date.submitted	2019-08-08
dc.identifier.citation	1. 肝病防治學術基金會（2018）。如何看檢驗報告。取自 https://www.liver.org.tw/knowledgeView.php?cat=4&sid=16 2. 黃昭勳（2016）。LASSO與其衍生方法之特性比較。國立政治大學統計學系碩士論文，台北市。取自https://hdl.handle.net/11296/egvwd6 3. Figure 2.1. Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011. 4. Breaux, H. J. (1967). On stepwise multiple linear regression (No. BRL-1369). Army Ballistic Research Lab Aberdeen Proving Ground Md. 5. Cai, X., Huang, A., & Xu, S. (2011). Fast empirical Bayesian LASSO for multiple quantitative trait locus mapping. BMC bioinformatics, 12(1), 211. 6. Desboulets, L. (2018). A review on variable selection in regression analysis. Econometrics, 6(4), 45.Heinze, G., Wallisch, C., & Dunkler, D. (2018). Variable selection–a review and recommendations for the practicing statistician. Biometrical Journal, 60(3), 431-449. 7. Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., & Rubin, D. B. (2013). Bayesian data analysis. Chapman and Hall/CRC. 8. Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of statistical learning: data mining, inference and prediction. New York: Springer Series in Statistics. 9. Heinze, G., Wallisch, C., & Dunkler, D. (2018). Variable selection–a review and recommendations for the practicing statistician. Biometrical Journal, 60(3), 431-449. 10. Huang, A., Xu, S., & Cai, X. (2013). Empirical Bayesian LASSO-logistic regression for multiple binary trait locus mapping. BMC genetics, 14(1), 5. 11. Huang, A., Xu, S., & Cai, X. (2015). Empirical Bayesian elastic net for multiple quantitative trait locus mapping. Heredity, 114(1), 107. 12. James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning (Vol. 112, p. 18). New York: springer. 13. Li, Q., & Lin, N. (2010). The Bayesian elastic net. Bayesian analysis, 5(1), 151-170. 14. Kim, J., & Jo, I. (2010). Relationship between body mass index and alanine aminotransferase concentration in non-diabetic Korean adults. European journal of clinical nutrition, 64(2), 169. 15. Ma, S., & Huang, J. (2008). Penalized feature selection and classification in bioinformatics. Briefings in bioinformatics, 9(5), 392-403. 16. Mallick, H., & Yi, N. (2013). Bayesian methods for high dimensional linear models. Journal of biometrics & biostatistics, 1, 005. 17. Mungreiphy, N. K., Kapoor, S., & Sinha, R. (2011). Association between BMI, blood pressure, and age: study among Tangkhul Naga tribal males of Northeast India. Journal of Anthropology, 2011. 18. Park, T., & Casella, G. (2008). The Bayesian lasso. Journal of the American Statistical Association, 103(482), 681-686. 19. Ramesh, V., Saraswat, S., Choudhury, N., & Gupta, R. K. (1995). Relationship of serum alanine aminotransferase (ALT) to body mass index (BMI) in blood donors: the need to correct ALT for BMI in blood donor screening. Transfusion Medicine, 5(4), 273-274. 20. Ribeiro, M. T., Singh, S., & Guestrin, C. (2016, August). Why should i trust you?: Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining (pp. 1135-1144). ACM. 21. Schisterman, E. F., Perkins, N. J., Mumford, S. L., Ahrens, K. A., & Mitchell, E. M. (2017). Collinearity and causal diagrams–a lesson on the importance of model specification. Epidemiology (Cambridge, Mass.), 28(1), 47. 22. Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1), 267-288. 23. Tukey, J. W. (1949). Comparing individual means in the analysis of variance. Biometrics, 5(2), 99-114. 24. Vatcheva, K. P., Lee, M., McCormick, J. B., & Rahbar, M. H. (2016). Multicollinearity in regression analyses conducted in epidemiologic studies. Epidemiology (Sunnyvale, Calif.), 6(2). 25. Yuan, M., & Lin, Y. (2006). Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68(1), 49-67. 26. Zhou, D. X. (2013). On grouping effect of elastic net. Statistics & Probability Letters, 83(9), 2108-2112. 27. Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the royal statistical society: series B (statistical methodology), 67(2), 301-320.
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/21388	-
dc.description.abstract	正則化變數選取方法能對共線性資料進行維度縮減。舉例來說，彈性網（elastic net）就被證明對高度相關的變數群具備整群選入或選出的群體效應（grouping effect）。本研究旨在比較四種正則化變數選取方法：LASSO、彈性網、經驗貝氏 LASSO（empirical Bayesian LASSO，EBLASSO），以及經驗貝氏彈性網（empirical Bayesian elastic net，EBENet），在資料具有不同程度相關性下的選取行為。經由模擬研究，在固定的樣本大小和變數數量下，我有以下發現：（1）對於高度相關僅存在於真實變數間的資料，彈性網具較好的選擇、估計係數，和預測能力。（2）對於相關性存在於真實變數和無關變數（irrelevant variables）間的資料，EBLASSO和EBENet在選擇、估計係數，和預測能力方面都是很好的選擇。（3）一般而言，隨著相關性降低，四種正則化方法在變數選擇及係數估計能力上都會有所提升。最後，在對真實健康檢查資料集進行正則化方法比較時，發現EBLASSO表現最佳。由於相關性有可能存在於真實變數和無關變數間，所以此結果可呼應上面的模擬結論（2）。另外，觀察資料集中存在高度相關的幾個變數群，可以發現彈性網對於這些變數群（如：身高、體重、除脂肪淨體重）有整群選入的行為，此現象則可呼應彈性網的群體效應定理。因此，本研究認為由於正則化方法在模擬結果和真實資料分析的表現可以相對應，故模擬研究的發現或可做為實際資料分析時選擇變數選取方法的參考。	zh_TW
dc.description.abstract	The regularization methods are capable of performing variable selection for collinear data. For example, elastic net has been proven to have a grouping effect to select all or none of a group of highly correlated variables. The objective of the study is to compare four regularization methods, i.e., LASSO, elastic net, empirical Bayesian LASSO (EBLASSO), and empirical Bayesian elastic net (EBENet), with their selection behaviors under different levels of correlations. Through simulation studies, at a fixed sample size and number of variables, I found that: (1) For data in which high correlation only exists between true variables, elastic net should be chosen for its better abilities to select, to estimate coefficients, and to predict. (2) For data in which correlations exist between true variables and irrelevant variables, regardless of the levels of correlations, EBLASSO and EBENet are good choices because of the abilities to select, to estimate coefficients, and to predict. (3) In general, as the correlations decrease, the four regularization methods improve in terms of variable selection and coefficient estimation. Finally, EBLASSO was found to outperform the other three regularization methods for the healthcare dataset. Since the correlation may exist between true variables and irrelevant variables, this result can echo the second conclusion above. In addition, for several groups of highly correlated variables in the real datasets, such as height, weight and lean body mass, the elastic net selected all variables from the variables groups into the model. This phenomenon matches the theorem of grouping effect. Therefore, this study believes that because of the consistency between the results from simulation and the outcomes from real data analysis, the findings from the simulation study may be used as a reference for selecting the variable selection method in the real data analysis.	en
dc.description.provenance	Made available in DSpace on 2021-06-08T03:32:43Z (GMT). No. of bitstreams: 1 ntu-108-R06h41002-1.pdf: 946079 bytes, checksum: 2699ce37836ebc9623eb7b8e46f2ac4e (MD5) Previous issue date: 2019	en
dc.description.tableofcontents	謝辭 i 摘要 ii Abstract iii List of Figures vi List of Tables vii Chapter 1 Introduction 1 Chapter 2 Literature Reviews 3 2.1 LASSO regression 6 2.2 Elastic net regression 8 2.3 Empirical Bayesian LASSO (EBLASSO) regression 9 2.4 Empirical Bayesian elastic net (EBENet) regression 12 2.5 In response to highly correlated predictors 14 Chapter 3 Simulation Study on Correlated Data 19 3.1 Case 1: All predictors are uncorrelated 20 3.2 Case 2: All predictors are correlated 21 3.3 Case 3: Correlations within true predictors 21 3.4 Case 4: Correlations between true predictors and irrelevant predictors 22 3.5 Case 5: Correlations within some true predictors and between some true predictors and irrelevant predictors 23 3.6 Simulation process 24 3.7 Evaluation metrics 26 3.8 Simulation results and discussion 27 Chapter 4 Real Data Analysis 40 4.1 Healthcare dataset and preprocessing 40 4.2 Model building based on different regularization methods 43 4.3 Results and discussion 44 Chapter 5 Conclusion 53 References 55 Appendix 58
dc.language.iso	en
dc.title	正則化變數選取方法應用於高度相關資料之比較	zh_TW
dc.title	Comparison of Regularization Methods for Variable Selection in Highly Correlated Data	en
dc.type	Thesis
dc.date.schoolyear	107-2
dc.description.degree	碩士
dc.contributor.oralexamcommittee	陳素雲,王偉仲
dc.subject.keyword	變數選取,正則化方法,群體效應,健檢資料,迴歸資料,	zh_TW
dc.subject.keyword	variable selection,regularization method,grouping effect,healthcare data,regression data,	en
dc.relation.page	70
dc.identifier.doi	10.6342/NTU201902881
dc.rights.note	未授權
dc.date.accepted	2019-08-08
dc.contributor.author-college	共同教育中心	zh_TW
dc.contributor.author-dept	統計碩士學位學程	zh_TW
顯示於系所單位：	統計碩士學位學程

文件中的檔案：

檔案	大小	格式
ntu-108-1.pdf 目前未授權公開取用	923.91 kB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。