Skip navigation

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料(如:文字、圖片、PDF)並使其易於取用。

點此認識 DSpace
DSpace logo
English
中文
  • 瀏覽論文
    • 校院系所
    • 出版年
    • 作者
    • 標題
    • 關鍵字
    • 指導教授
  • 搜尋 TDR
  • 授權 Q&A
    • 我的頁面
    • 接受 E-mail 通知
    • 編輯個人資料
  1. NTU Theses and Dissertations Repository
  2. 共同教育中心
  3. 統計碩士學位學程
請用此 Handle URI 來引用此文件: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/76882
標題: 利用相對重要性階層分群變數排序與選擇建構預測模型及其基因預測應用

Variable Ranking and Selection for Prediction Model based on Relative Importance Hierarchical Clustering and Its Applications to Gene Prediction
作者: Jia-Rong Li
李佳蓉
指導教授: 陳正剛(Argon Chen)
共同指導教授: 徐治平(Jyh-Ping Hsu)
關鍵字: 變數排序,變數選擇,相對重要性,相對權重,階層式分群,基因預測,
Variable Ranking,Variable Selection,Relative Importance,Relative Weight,Hierarchical Clustering,Gene Prediction,
出版年 : 2020
學位: 碩士
摘要: 變數選擇一直是資料分析中很重要也很複雜的議題,首當其衝的問題是如何判斷每個解釋變數對反應變數的重要程度?常見的判別指標如簡單相關係數、標準化迴歸係數。但是當解釋變數之間有共線性相關時,此兩指標皆無法同時考量到自身對變數選擇一直是資料分析中很重要也很複雜的議題,首當其衝的問題是如何判斷每個解釋變數對反應變數的重要程度?常見的判別指標如簡單相關係數、標準化迴歸係數。但是當解釋變數之間有共線性相關時,此兩指標皆無法同時考量到自身對反應變數的影響(direct effect)及自身與其它解釋變數的相互影響對反應變數所造成的影響(joint effect),於是J.W. Johnson提出了相對權重(Relative Weight) ,能在考量所有變數的關係與影響下計算變數自身的相對重要程度。但此方法只適用在解釋變數無完全共線性(non-singular)且樣本數大於變數數量(n>p)之情況,故Shen與Chen以相對權重為基礎提出適用於變數存在共線性或n≤p情況的「廣泛」相對重要性,能在各種資料情況下判別解釋變數對反應變數的相對重要程度。在變數選擇的議題中,除了考量解釋變數對反應變數的相對重要程度,也需考量解釋變數自身與模型中其它解釋變數對反應變數的影響貢獻度相似性,由於相對重要性會平分變數的共同貢獻,具有「群組效果」,也就是具高度相似貢獻的變數的相對重要性大小會相當,若同時選取這些變數進入模型,也無法達到較好的解釋效能。為了進一步探討解釋變數之間對反應變數影響貢獻的錯綜關係,學者陳解構相對重要性之構成元素,並以此進行階層式分群,名為「相對重要性階層分群法」,不同於一般非監督式的分群法只考量解釋變數之間的相似程度作為分群依據,此方法同時考量解釋變數影響反應變數的重要程度與解釋變數之間的相似性,藉此方法可以把對反應變數影響具有同質性的變數分在同一群,提供變數相互關係的判斷依據。因此本研究將結合相對重要性與相對重要性階層分群法針對反應變數預測進行變數排序,因為相對重要性階層分群法可以辨別變數之間影響反應變數的相似性,藉此方法化解相對重要性排序的群組效果,達到同時考量變數自身的重要程度與變數之間的相似性以便針對反應變數預測進行變數排序與選擇。本研究首先定義相對重要性階層分群法適用的距離定義,並提出自動化決定分群數的方法,使得分群結果得以合理切割變數群組。接著利用分群結果於每一群內利用變數相對重要性排序選出代表變數,並以代表變數之相對重要性排序群組並定義為「分群結構」。根據群組排序及群組內的變數排序定義出「候選變數規則」之排序方法。在排序變數的過程中,需有一個指標來權衡變數自身的重要程度與變數之間的相似性,本研究進一步利用相對重要性構成元素提出一個權衡指標,衡量變數與模型中其它變數的解釋效能,產生選入預測模型的變數排序。本研究的目的是在高維度資料中進行反應變數預測的變數排序與選擇,我們將結合正規化迴歸(Regularized Regression) Ridge Regression方法建構預測模型,然後利用高維度的基因表現資料案例進行驗證,同時與相對重要性、正規化迴歸中的Lasso及Elastic Net進行預測結果比較,驗證本研究方法在反應變數預測上的優勢。為了可以進一步選擇預測模型的變數,本研究進一步利用預測結果與頻繁模式挖掘(Frequent Pattern Mining)完成變數選擇,針對反應變數預測找出最具解釋效能的變數組合。
Variable selection has always been an important and complex issue in the field of data analysis. The first question is how to judge the importance of each explanatory variable. Indicators commonly used are simple correlation coefficients and standardized regression coefficients. However, when there exists collinearity among explanatory variables, neither of these two indicators are not able to consider the direct effect of individual variable on the response variable and the join effect of multiple variables on the response variable simultaneously. Therefore, J.W. Johnson proposes an indicator, called Relative Weight, which can calculate the relative importance of the variables considering the relationship among all the variables. However, this indicator is only applicable to the case where the variables are not completely collinear (non-singular) with the sample size is greater than the number of variables (n>p). Then, Shen and Chen propose 'comprehensive' relative importance to overcome the difficulty of Relative Weight under the low-rank condition.On the topic of variable selection, in addition to considering the relative importance of the explanatory variables to the response variable, the similarity among the explanatory variables on the contribution to the response variable should be considered. Because the relative importance indices will be equalized for variables with the common contribution to explain the response variable and form a 'group effect', when these variables are selected at the same time in the model better prediction performance cannot be achieved. In order to further explore the intricate relationship among the variables, Chen deconstructs relative importance and carries out hierarchical clustering, called 'relative-importance-based hierarchical clustering'. Different from the unsupervised clustering method, this method considers both the importance of the variables and the similarity between the variables so that it can cluster the variables with a common structure of contributions to the response variables into the same group.Therefore, this research will combine the relative importance and relative-importance-based hierarchical clustering for variable ranking and selection to build a better prediction model. Because the hierarchical clustering can identify the similarity between variables, it resolves group effect of the relative importance to consider both the importance of the variables and the similarity between the variables simultaneously so that we can rank and select variables for prediction model.This research first defines the distance applicable to the relative importance hierarchical clustering method and proposes a method for automatically determining the optimum number of clusters, so that the clustering results can group the variables properly based on their similarities in contributions to explain the response variable. Then, we define the 'grouping structure' by selecting a representative variable in each group and ranking the groups according to the relative importance of the representative variables. The variable ranking method of 'candidate variable rule' is defined according to the grouping structure. In the process of variable ranking, an indicator is needed to weigh the importance of the variables and the similarity between the variables. This study further uses relative importance component to propose an indicator to measure the explanatory powers of the variables to generate variable ranking for prediction model.The purpose of this research is to rank and select variables for the response variable prediction in high-dimensional data. We will combine our method with Ridge Regression to build the prediction model, and then use the high-dimensional gene expression data for validation. At the same time, we compare the prediction results of our method with the relative importance, simple correlation, Lasso and Elastic Net to verify the advantages of our method in predicting the response variable. In order to identify specific variables for the prediction model, we further use the prediction results and frequent pattern mining to find the best variable combination for the response variable prediction.
URI: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/76882
DOI: 10.6342/NTU202003039
全文授權: 未授權
顯示於系所單位:統計碩士學位學程

文件中的檔案:
檔案 大小格式 
U0001-1208202002445300.pdf
  未授權公開取用
4.38 MBAdobe PDF
顯示文件完整紀錄


系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved