請用此 Handle URI 來引用此文件:
http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/21388
標題: | 正則化變數選取方法應用於高度相關資料之比較 Comparison of Regularization Methods for Variable Selection in Highly Correlated Data |
作者: | Ching-Hsuan Chang 張靜萱 |
指導教授: | 周呈霙 |
關鍵字: | 變數選取,正則化方法,群體效應,健檢資料,迴歸資料, variable selection,regularization method,grouping effect,healthcare data,regression data, |
出版年 : | 2019 |
學位: | 碩士 |
摘要: | 正則化變數選取方法能對共線性資料進行維度縮減。舉例來說,彈性網(elastic net)就被證明對高度相關的變數群具備整群選入或選出的群體效應(grouping effect)。本研究旨在比較四種正則化變數選取方法:LASSO、彈性網、經驗貝氏 LASSO(empirical Bayesian LASSO,EBLASSO),以及經驗貝氏彈性網(empirical Bayesian elastic net,EBENet),在資料具有不同程度相關性下的選取行為。經由模擬研究,在固定的樣本大小和變數數量下,我有以下發現:
(1)對於高度相關僅存在於真實變數間的資料,彈性網具較好的選擇、估計係數,和預測能力。 (2)對於相關性存在於真實變數和無關變數(irrelevant variables)間的資料,EBLASSO和EBENet在選擇、估計係數,和預測能力方面都是很好的選擇。 (3)一般而言,隨著相關性降低,四種正則化方法在變數選擇及係數估計能力上都會有所提升。 最後,在對真實健康檢查資料集進行正則化方法比較時,發現EBLASSO表現最佳。由於相關性有可能存在於真實變數和無關變數間,所以此結果可呼應上面的模擬結論(2)。另外,觀察資料集中存在高度相關的幾個變數群,可以發現彈性網對於這些變數群(如:身高、體重、除脂肪淨體重)有整群選入的行為,此現象則可呼應彈性網的群體效應定理。因此,本研究認為由於正則化方法在模擬結果和真實資料分析的表現可以相對應,故模擬研究的發現或可做為實際資料分析時選擇變數選取方法的參考。 The regularization methods are capable of performing variable selection for collinear data. For example, elastic net has been proven to have a grouping effect to select all or none of a group of highly correlated variables. The objective of the study is to compare four regularization methods, i.e., LASSO, elastic net, empirical Bayesian LASSO (EBLASSO), and empirical Bayesian elastic net (EBENet), with their selection behaviors under different levels of correlations. Through simulation studies, at a fixed sample size and number of variables, I found that: (1) For data in which high correlation only exists between true variables, elastic net should be chosen for its better abilities to select, to estimate coefficients, and to predict. (2) For data in which correlations exist between true variables and irrelevant variables, regardless of the levels of correlations, EBLASSO and EBENet are good choices because of the abilities to select, to estimate coefficients, and to predict. (3) In general, as the correlations decrease, the four regularization methods improve in terms of variable selection and coefficient estimation. Finally, EBLASSO was found to outperform the other three regularization methods for the healthcare dataset. Since the correlation may exist between true variables and irrelevant variables, this result can echo the second conclusion above. In addition, for several groups of highly correlated variables in the real datasets, such as height, weight and lean body mass, the elastic net selected all variables from the variables groups into the model. This phenomenon matches the theorem of grouping effect. Therefore, this study believes that because of the consistency between the results from simulation and the outcomes from real data analysis, the findings from the simulation study may be used as a reference for selecting the variable selection method in the real data analysis. |
URI: | http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/21388 |
DOI: | 10.6342/NTU201902881 |
全文授權: | 未授權 |
顯示於系所單位: | 統計碩士學位學程 |
文件中的檔案:
檔案 | 大小 | 格式 | |
---|---|---|---|
ntu-108-1.pdf 目前未授權公開取用 | 923.91 kB | Adobe PDF |
系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。