Skip navigation

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料(如:文字、圖片、PDF)並使其易於取用。

點此認識 DSpace
DSpace logo
English
中文
  • 瀏覽論文
    • 校院系所
    • 出版年
    • 作者
    • 標題
    • 關鍵字
  • 搜尋 TDR
  • 授權 Q&A
    • 我的頁面
    • 接受 E-mail 通知
    • 編輯個人資料
  1. NTU Theses and Dissertations Repository
  2. 工學院
  3. 工業工程學研究所
請用此 Handle URI 來引用此文件: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/88384
完整後設資料紀錄
DC 欄位值語言
dc.contributor.advisor陳正剛zh_TW
dc.contributor.advisorArgon Chenen
dc.contributor.author華方綾zh_TW
dc.contributor.authorFang-Ling Huaen
dc.date.accessioned2023-08-09T16:49:23Z-
dc.date.available2023-11-10-
dc.date.copyright2023-08-09-
dc.date.issued2023-
dc.date.submitted2023-07-26-
dc.identifier.citation[1] Bartlett, M.S., The statistical conception of mental factors. British journal of Psychology, 1937. 28(1): p. 97.
[2] Borovecki, F., et al., Genome-wide expression profiling of human blood reveals biomarkers for Huntington's disease. Proceedings of the National Academy of Sciences, 2005. 102(31): p. 11023-11028.
[3] Budescu, D.V., Dominance analysis: a new approach to the problem of relative importance of predictors in multiple regression. Psychological bulletin, 1993. 114(3): p. 542.
[4] Carroll, J.B., An analytical solution for approximating simple structure in factor analysis. Psychometrika, 1953. 18(1): p. 23-38.
[5] Cattell, R.B., The scree test for the number of factors. Multivariate behavioral research, 1966. 1(2): p. 245-276.
[6] Chunchan, C., Small-Sample Variable Selection based on Relative Importance Ranking and Its Applications to Gene Expression Analysis. 2021.
[7] Donoho, D.L., High-dimensional data analysis: The curses and blessings of dimensionality. AMS math challenges lecture, 2000. 1(2000): p. 32.
[8] Fahrmeir, L., et al., Regression: Models, Methods and Applications. 2013: Springer Berlin Heidelberg.
[9] Ferguson, G.A., The concept of parsimony in factor analysis. Psychometrika, 1954. 19(4): p. 281-290.
[10] Gibson, W., Orthogonal predictors: A possible resolution of the Hoffman-Ward controversy. Psychological reports, 1962. 11(1): p. 32-34.
[11] Green, P.E., J.D. Carroll, and W.S. DeSarbo, A new measure of predictor variable importance in multiple regression. Journal of Marketing Research, 1978. 15(3): p. 356-360.
[12] Grömping, U., Variable importance in regression models. Wiley interdisciplinary reviews: Computational statistics, 2015. 7(2): p. 137-152.
[13] Harman, H.H. and H.H. Harman, Modern factor analysis. 1976: University of Chicago press.
[14] Hoerl, A.E. and R.W. Kennard, Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 1970. 12(1): p. 55-67.
[15] Jiarong, L., Variable ranking and selection for prediction model based on relative importance hierarchical clustering and its applications to gene prediction. 2020.
[16] Johnson, J.W., A heuristic method for estimating the relative weight of predictor variables in multiple regression. Multivariate behavioral research, 2000. 35(1): p. 1-19.
[17] Johnson, R.M., The minimal transformation to orthonormality. Psychometrika, 1966. 31(1): p. 61-66.
[18] Kaiser, H.F., The varimax criterion for analytic rotation in factor analysis. Psychometrika, 1958. 23(3): p. 187-200.
[19] Kaiser, H.F., The application of electronic computers to factor analysis. Educational and psychological measurement, 1960. 20(1): p. 141-151.
[20] Neuhaus, J.O. and C. Wrigley, The quartimax method: An analytic approach to orthogonal simple structure 1. British Journal of Statistical Psychology, 1954. 7(2): p. 81-91.
[21] Pomeroy, S.L., et al., Prediction of central nervous system embryonal tumour outcome based on gene expression. Nature, 2002. 415(6870): p. 436-442.
[22] Saeys, Y., I. Inza, and P. Larranaga, A review of feature selection techniques in bioinformatics. bioinformatics, 2007. 23(19): p. 2507-2517.
[23] Saunders, D.R., An analytic method for rotation to orthogonal simple structure. ETS Research Bulletin Series, 1953. 1953(1): p. i-28.
[24] Sen, A. and M. Srivastava, Regression Analysis: Theory, Methods, and Applications. 2012: Springer New York.
[25] Shen, Z. and A. Chen, Comprehensive relative importance analysis and its applications to high dimensional gene expression data analysis. Knowledge-Based Systems, 2020. 203: p. 106120.
[26] Singh, D., et al., Gene expression correlates of clinical prostate cancer behavior. Cancer cell, 2002. 1(2): p. 203-209.
[27] Song, Q., J. Ni, and G. Wang, A fast clustering-based feature subset selection algorithm for high-dimensional data. IEEE transactions on knowledge and data engineering, 2011. 25(1): p. 1-14.
[28] Spearman, C., " General Intelligence" Objectively Determined and Measured. 1961.
[29] Sydsæter, K., et al., Further mathematics for economic analysis. 2008: Pearson education.
[30] Thomson, G.H., The definition and measurement of" g"(general intelligence). Journal of Educational Psychology, 1935. 26(4): p. 241.
[31] Thurstone, L.L., Multiple-factor analysis; a development and expansion of The Vectors of Mind. 1947.
[32] Tibshirani, R., Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 1996. 58(1): p. 267-288.
[33] Yiching, C., Relative importance based hierarchical clustering and its application to gene expression analysis. 2019.
-
dc.identifier.urihttp://tdr.lib.ntu.edu.tw/jspui/handle/123456789/88384-
dc.description.abstract高維度小樣本資料是資料探勘領域中的一個挑戰,這種類型的資料在生物資訊領域中尤為常見,比如說基因表現微陣列資料記載了人體成千上萬個基因,而樣本數卻只有幾百。要如何從眾多的基因中找出關鍵基因以預測產生疾病的可能性是一大課題。以統計領域中迴歸分析方法的觀點來看,這是變數選擇的問題,至今有諸多學者提出不同方法,而其中的癥結點是該如何處理變數的「共線性」問題以及變數的「組合」解釋力。
針對共線性問題,J.W. Johnson (2000) [16] 利用 R.M. Johnson (1966) [17] 所提出的「最佳近似正交轉換方法」轉換自變數 (X) 得到正交的中介變數 (Z),再透過中介變數 Z 計算變數相對權重 (Relative Weight),然而該方法僅適用於樣本數 (n) 大於變數數量 (p) 且變數間無線性相關,也就是 X 為滿秩 (full-rank) 之情況,故 Shen and Chen (2020) [25] 延伸 J.W. Johnson 的相對權重 (Relative Weight),提出「普適型相對重要性」 (Comprehensive Relative Importance) 計算方法,使得任何 n 及 p 的數值型資料皆可適用,然而該方法在 p 非常大且 p>>n 時,所需之矩陣計算量也隨之過大。
本研究的目的為建立在以上學者針對相對重要性議題所提出之方法上,針對 p 非常大的問題,提供一個在因素分析的框架上計算變數相對重要性的方法。假設在高維度問題中自變數量 p 很大,但 X 矩陣的 rank 只有 r,本研究首先利用因素分析中的因素萃取及因素變異最大化 (Varimax) 旋轉獲得的 r 個因素,取代原有相對重要性計算中的「最佳近似正交轉換」的 p 個中介變數,再計算得到變數相對重要性後排序選取重要變數進入迴歸模型。本研究同時也嘗試創新因素萃取方法,增加考量對應變數的解釋力,也就是計算自變數與應變數之間的相關係數,將其排序並選擇前 r (r<<p) 個大的變數,接著透過最佳近似正交轉換得到 r 個垂直中介變數 Z,再對 r 個 Z 進行 Varimax 旋轉以得到「r 個因素」,之後再計算得到變數相對重要性。
原本 Shen and Chen (2020) 都是在 n 維空間的 p 維子空間操作,但本研究方法降到 r 維空間計算變數相對重要性,由此帶來的好處是計算量下降。另外,Shen and Chen (2020) 並未真正解決共線性問題,也就是自變數在計算上是透過不垂直的中介變數解釋應變數的變異,但本研究方法確保中介變數互相垂直,因此得以合理使用中介變數分配相對重要性到原始自變數上,由此亦更具解釋性。
本文使用模擬案例以及基因表現資料進行方法驗證。從模擬案例中可以發現本研究方法可以找出實驗設定的垂直因素,在高度共線性的情形,其變數選擇表現及計算效率也比 Shen and Chen (2020) 變數相對重要性方法更好,達到利用因素分析計算變數相對重要性的目標。在基因表現資料的建模預測中,本研究方法的預測結果大部分較傳統的變數選擇方法好,且能夠在某些案例中與 Shen and Chen (2020) 的方法表現相當。
zh_TW
dc.description.abstractHigh-dimensional data with a small sample size is a challenge for researchers in data analysis. This is especially true in the field of bio-informatics, where datasets, such as gene expression microarray data, consist of tens of thousands of features but only have a sample size of hundreds. Hence, the question arises: how can we find the key features among those genes for modeling and prediction?
From the perspective of Regression Analysis in the Statistics domain, this is a well-known “variable selection” problem. However, no consensus has been reached regarding the treatment of the “multicollinearity” that often arises in high-dimensional settings.
Regarding the problem of multicollinearity, J.W. Johnson (2000) [16] used the “Best-Fitting Orthogonal Approximation of X” method proposed by R.M. Johnson (1966) [17] to transform the independent variables (X) into their orthogonal counterparts (Z) and calculate the “Relative Weight” of each feature. However, this method is limited to cases where the sample size (n) is greater than the number of features (p) and when the design matrix X is full-rank.
Shen and Chen (2020) [25] extended the “Relative Weight” method proposed by J.W. Johnson (2000) and introduced the method “Comprehensive Relative Importance (CRI).” This method can be applied to any design matrix, irrespective of singularity or size. Nevertheless, the computational complexity increases as p is very large and p>>n.
The purpose of this study is to propose a new framework inspired by Factor Analysis to address the computational complexity problem in Shen and Chen (2020), specifically in the high-dimensional setting. Assume that p is very large, but the rank of X is only r. Our study first adopts the Factor Extraction and Varimax Factor Rotation process in Factor Analysis in order to find r factors, which serve as the orthogonal basis that spans the subspace of the design matrix. We replace the original p intermediate variable Zs, which are the Best-Fitting Orthogonal Approximation of X, with these r factors for calculating the relative importance. We then select the top-ranked variables by their relative importance into the regression analysis model. This study also attempts to innovate the Factor Extraction method, in which we consider the effect of predictors on the dependent variable. We calculate the Pearson simple correlation between the predictors and the dependent variable, sort the values in descending order and select the first r large variables. We then transform the factors to the r orthogonal intermediate variables Z using the Best-Fitting Orthogonal Approximation of X proposed by R.M. Johnson (1996). Finally, we perform Varimax Rotation on Z and calculate the relative importance.
While Shen and Chen (2020) perform all the calculation in the p-dimensional subspace contained in the n-dimensional space, this study reduces to r-dimensional subspace to calculate relative importance. This reduction results in a reduction in computational complexity. Furthermore, Shen and Chen (2020) use non-orthogonal intermediate variables to explain the variation in the dependent variable, but this study ensures that orthogonal intermediate variables are used in the calculation. Hence, our proposed method has higher interpretability.
Our proposed method is validated using simulation cases and gene expression microarray datasets. From the results of the simulation cases, we can see that our method can indeed identify the corresponding factors that are designed in the simulations to influence both the dependent and independent variables. Moreover, in cases where predictors are highly correlated, our proposed method can select the truly important variables better and requires less computation compared to the method by Shen and Chen (2020). Finally, when applied to gene expression datasets for disease prediction, the proposed method is comparable to the method of Shen and Chen (2020) and outperforms most other popular feature selection algorithms in terms of classification performance in most cases.
en
dc.description.provenanceSubmitted by admin ntu (admin@lib.ntu.edu.tw) on 2023-08-09T16:49:23Z
No. of bitstreams: 0
en
dc.description.provenanceMade available in DSpace on 2023-08-09T16:49:23Z (GMT). No. of bitstreams: 0en
dc.description.tableofcontents誌謝 i
摘要 ii
ABSTRACT iv
目錄 vi
圖目錄 viii
表目錄 x
Chapter 1 緒論 1
1.1 研究背景 1
1.2 研究動機與目標 2
1.3 論文架構 4
Chapter 2 文獻探討 5
2.1 監督與非監督式學習 5
2.1.1 因素分析 (Factor Analysis) 5
2.1.2 迴歸分析 (Regression Analysis) 9
2.1.3 正則化迴歸 (Regularized Regression) 11
2.2 變數相對重要性 13
2.2.1 最佳近似正交轉換 (Johnson’s Transformation) 14
2.2.2 相對權重 (Relative Weight) 15
2.2.3 非行滿秩矩陣變數相對重要性 (Relative Importance) 19
2.3 變數相對重要性構成元素與其簡化方法 24
2.3.1 相對重要性構成元素 (Relative Importance Components) 24
2.3.2 相對重要性構成元素之簡化方法 (RICE) 33
Chapter 3 基於因素分析之變數相對重要性 39
3.1 因素分析之正交轉換幾何意義 39
3.1.1 因素空間幾何意義 39
3.1.2 因素分析轉換與最佳近似正交轉換 43
3.2 利用因素分析決定變數相對重要性之方法 46
3.2.1 因素萃取 47
3.2.2 因素旋轉:Varimax 旋轉 49
3.2.3 相對重要性計算 50
3.3 模擬案例 54
3.3.1 垂直因素檢驗 54
3.3.2 變數選擇 60
Chapter 4 方法應用與結果分析 66
Chapter 5 結論與未來研究 75
參考文獻 78
-
dc.language.isozh_TW-
dc.title基於因素分析與變數相對重要性之變數選擇方法及其應用zh_TW
dc.titleVariable Selection based on Factor Analysis and Relative Importance and Its Applicationen
dc.typeThesis-
dc.date.schoolyear111-2-
dc.description.degree碩士-
dc.contributor.oralexamcommittee藍俊宏;蔡政安zh_TW
dc.contributor.oralexamcommitteeJakey Blue;Chen-An Tsaien
dc.subject.keyword變數選擇,因素分析,相對重要性,基因表現,zh_TW
dc.subject.keywordVariable Selection,Factor Analysis,Relative Importance,Gene Expression,en
dc.relation.page80-
dc.identifier.doi10.6342/NTU202301979-
dc.rights.note未授權-
dc.date.accepted2023-07-27-
dc.contributor.author-college工學院-
dc.contributor.author-dept工業工程學研究所-
顯示於系所單位:工業工程學研究所

文件中的檔案:
檔案 大小格式 
ntu-111-2.pdf
  目前未授權公開取用
5.98 MBAdobe PDF
顯示文件簡單紀錄


系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved