變數相對重要性之階層分群方法及其於基因表現資料分析之應用

陳羿晴; Yi-Ching Chen

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/77325

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	陳正剛	zh_TW
dc.contributor.author	陳羿晴	zh_TW
dc.contributor.author	Yi-Ching Chen	en
dc.date.accessioned	2021-07-10T21:56:29Z	-
dc.date.available	2024-08-06	-
dc.date.copyright	2019-08-07	-
dc.date.issued	2019	-
dc.date.submitted	2002-01-01	-
dc.identifier.citation	[1] A. K. Sen, Regression analysis : theory, methods and applications (Springer texts in statistics.). New York: Springer-Verlag, 1990. [2] A. N. Zaied, M. G. Habishy, and M. A. Saleh, Acute Leukemia Classification using Bayesian Networks. 2012, pp. 1419-1426. [3] A. Wai-Ho, K. C. C. Chan, A. K. C. Wong, and W. Yang, "Attribute clustering for grouping, selection, and classification of gene expression data," IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 2, no. 2, pp. 83-101, 2005. [4] D. Ghosh and A. M. Chinnaiyan, "Classification and selection of biomarkers in genomic data using LASSO," (in eng), Journal of biomedicine & biotechnology, vol. 2005, no. 2, pp. 147-154, 2005. [5] D. V. Budescu, "Dominance analysis: A new approach to the problem of relative importance of predictors in multiple regression," Psychological Bulletin, vol. 114, no. 3, pp. 542-551, 1993. [6] G. Gamberoni, E. Lamma, F. Riguzzi, S. Storari, and S. Volinia, "Bayesian Networks Learning for Gene Expression Datasets," in Advances in Intelligent Data Analysis VI, Berlin, Heidelberg, 2005, pp. 109-120: Springer Berlin Heidelberg. [7] I. Guyon, J. Weston, S. Barnhill, and V. Vapnik, "Gene Selection for Cancer Classification using Support Vector Machines," Machine Learning, vol. 46, no. 1, pp. 389-422, 2002/01/01 2002. [8] J. J. Hughey and A. J. Butte, "Robust meta-analysis of gene expression using the elastic net," (in eng), Nucleic acids research, vol. 43, no. 12, pp. e79-e79, 2015. [9] J. W. Johnson, "A Heuristic Method for Estimating the Relative Weight of Predictor Variables in Multiple Regression," Multivariate Behavioral Research, vol. 35, no. 1, pp. 1-19, 2000/01/01 2000. [10] J. Zhu and T. Hastie, Classification of Gene Microarrays by Penalized Logistic Regression. 2004, pp. 427-43. [11] N. Friedman, M. Linial, I. Nachman, and D. Pe'er, "Using Bayesian networks to analyze expression data," presented at the Proceedings of the fourth annual international conference on Computational molecular biology, Tokyo, Japan, 2000. [12] P. E. Green, J. Douglas Carroll, and W. Desarbo, A New Measure of Predictor Variable Importance in Multiple Regression. 1978, pp. 356-360. [13] R. M. Johnson, "The minimal transformation to orthonormality," Psychometrika, vol. 31, no. 1, pp. 61-66, 1966/03/01 1966. [14] S. Zixin and C. Argon, "Relative importance under low-rank condition and its applications to semiconductor yield analysis," in 2017 International Conference on Decision Support System Technology, Namur, Belgium, 2017, pp. 153–159. [15] T. Hastie, The elements of statistical learning data mining, inference, and prediction, 2nd ed. ed. (Springer series in statistics). New York, NY: Springer-Verlag New York, 2009. [16] T. R. Golub et al., "Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring," Science, vol. 286, no. 5439, p. 531, 1999. [17] W. A. Gibson, “Orthogonal Predictors: A Possible Resolution of the Hoffman-Ward Controversy.” Psychological Reports 11, no. 1 (August 1962): 32–34. [18] Y. Oshima et al., "DNA microarray analysis of hematopoietic stem cell-like fractions from individuals with the M2 subtype of acute myeloid leukemia," Leukemia, vol. 17, no. 10, pp. 1990-1997, 2003/10/01 2003.	-
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/77325	-
dc.description.abstract	變數選擇是資料分析領域中歷久不衰的議題，其中自變數之間的共線性是線性模型在變數選擇時主要考量的問題，早在1966年R.M.Johnson就根基於共線性問題提出了「最佳近似正交轉換」方法，目的為將原始變數轉換成正交變數以解決共線性問題，此後，Green (1978)[12]與J.W. Johnson (2000)[9]相繼延伸R.M.Johnson的正交變數概念提出其他計算變數重要性的方法，其中J.W. Johnson的方法為相對權重(Relative Weight)，但由於這些方法皆受限於最佳近似正交轉換方法中自變數必須無完全共線性(non- singular)且資料筆數大於變數數量(n>p)之情況，故Zixin (2017)[14]以相對權重為基礎提出適用於變數存在共線性或是n≤p情況的「相對重要性」，本研究將延伸此變數重要性指標進行變數分群。分群(Clustering)為非監督式學習方法，用於處理沒有正確標籤可參考的資料，透過不同屬性對物件的描述將物件依照相似度歸類，傳統的階層式分群方法(Hierarchical Clustering)是常見的分群方法，其演算法直觀，且只需定義距離計算方式、物件聚合方式便能得出分群結果，且該結果以樹狀圖呈現，易於透過視覺觀察出物件之間的相似關係，本文中將擷取此方法的優點於變數分群。於過去結合變數重要性與變數分群方法的相關文獻中，研究目的為從各群集中選擇同質性低的重要因子，使選出之變因能更全面地預測結果，其方法皆以兩階段方式進行分析，首先將變因依照彼此的相似程度分群，再來才分析各群變因與結果的關係，藉此篩選出不同面向的重要變因。但以上方法在分析過程中使用了相同的資料集兩次，故本研究希望整合此兩階段分析，藉此減少資源的重複利用性。本研究的目的為將變因以影響結果的程度及變因本身的相似性區分成群，故變因與結果的關係及變因彼此之間的關係皆是變數分群時考量的重點。透過了解相對重要性的幾何意義，並將之合理地拆解成相對重要性構成元素，作為階層式分群中計算距離的依據，其中相對重要性構成元素間彼此獨立、意義不重疊，且相對重要性含有迴歸分析的監督式概念並考量自變數之間的共線性影響，故有利於將變因依照對結果的影響程度及變因間的同質性一次性地分群。本文使用模擬案例及Oshima (2003)[18]等人於文獻中提供的基因表現與白血病資料進行方法驗證，並與非監督式階層分群與貝氏網路之結果比較。	zh_TW
dc.description.abstract	Variable selection is a long-standing issue in the field of data analysis. When selecting the variables in linear models, the collinearity between independent variables is the main consideration. As early as 1966, R.M. Johnson proposed a method to transform original variables to orthogonal variables in order to solve the multi-collinearity problem in linear regression model. Other methods of calculating relative importance based on R.M. Johnson’s ideas are purposed by Green (1978) [12] and J.W. Johnson (2000) [9] successively, and the method purposed by J.W. Johnson is called “Relative Weight”. However, all these relative importance methods are limited by the R.M. Johnson (1966) transformation method, which independent variables must be non-singularity and the number of sample is greater than the number of variables (n>p). Therefore, Zixin (2017) [14] proposed a comprehensive relative importance method to overcome the difficulty of relative weights on low-rank condition. This study will extend the method of relative weights and its comprehensive method proposed by Zixin for variable grouping. Clustering is an unsupervised method which is applied on unlabeled data and cluster objects according to their similarities through different attributes. One of the common clustering method is Hierarchical Clustering. The advantage of Hierarchical Clustering is that its algorithm is intuitive and its result is presented in a tree diagram. That makes the method be understood easily and the results be also interpretable. This research takes advantages of Hierarchical Clustering method for variable grouping. In previous study aiming for selecting important factors in different aspect, they combine variable importance with variable grouping in order to have a more comprehensive results of variable selection. Those methods in previous study are analyzed in a two-stage manner. Firstly, the variables were grouped according to their similarities. Secondly, analyze the relationship between the results and each group of variables. However, the two-stage manner use the same data set twice, so this study aims to improve the previous method into a one-time analysis to avoid reusing of resources. The purpose of this study is to distinguish the causes by their importance of explaining the results and the similarity of the causes. Therefore, the relationship between the causes and the results and the relationship between the causes themselves are two focus of the grouping method. According to the geometric meaning of relative weights, this study takes apart relative weight into multiple components which values are addable. In addition, the meaning of these components are independent to each other, so they are considered as the good reference to calculate the distance between variables. Simulation cases and real cases of gene expression whose data set provided by Oshima et al (2003) [18] are used for method validation, and furthermore compared with the results of unsupervised hierarchical clustering and Bayesian networks.	en
dc.description.provenance	Made available in DSpace on 2021-07-10T21:56:29Z (GMT). No. of bitstreams: 1 ntu-108-R06546005-1.pdf: 5108153 bytes, checksum: 628cbfeaaadbd046b85a4c0e6421a8b0 (MD5) Previous issue date: 2019	en
dc.description.tableofcontents	誌謝 i 摘要 ii ABSTRACT iii 目錄 v 圖目錄 viii 表目錄 xiii Chapter 1 緒論 1 1.1 研究背景 1 1.2 研究動機與目標 2 1.3 論文架構 4 Chapter 2 文獻探討 5 2.1 監督與非監督式學習 5 2.1.1 階層式分群法 (Hierarchical Clustering) 6 2.1.2 迴歸分析 (Regression Analysis) 9 2.1.3 貝氏網路 (Bayesian Network) 12 2.2 變數相對重要性 13 2.2.1 最佳近似正交轉換 (Johnson’s Transformation) 14 2.2.2 相對權重 (Relative Weight) 16 2.2.3 非行滿秩矩陣變數相對重要性 (Relative Importance) 19 2.3 基因表現(Gene Expression)與疾病研究 24 2.3.1 白血病種類與基因表現之間的關係 24 2.3.2 基因表現資料分析方法 25 Chapter 3 變數相對重要性之階層分群方法 26 3.1 相對重要性構成元素(Relative Importance Components)及其幾何意義 26 3.1.1 一般情況之相對重要性構成元素及其幾何意義 28 3.1.2 非一般情況之相對重要性構成元素及其幾何意義 37 3.2 考量自變數與應變數之間關係的階層式變數分群 44 3.2.1 透過相對重要性構成元素進行階層式變數分群 44 3.2.2 分群結果意義闡述與變數群之相對重要性 48 3.3 n《p案例相對重要性構成元素之簡化方法 49 Chapter 4 方法應用與結果分析 57 4.1 模擬案例 57 4.1.1 n>p 59 4.1.2 n≤p 64 4.2 實際案例：探討與白血病有關之基因分群 71 Chapter 5 結論與未來研究 78 參考文獻 80	-
dc.language.iso	zh_TW	-
dc.subject	階層式分群	zh_TW
dc.subject	相對重要性	zh_TW
dc.subject	變數選擇	zh_TW
dc.subject	基因表現	zh_TW
dc.subject	相對權重	zh_TW
dc.subject	Gene Expression	en
dc.subject	Variable Selection	en
dc.subject	Relative Importance	en
dc.subject	Relative Weight	en
dc.subject	Hierarchical Clustering	en
dc.title	變數相對重要性之階層分群方法及其於基因表現資料分析之應用	zh_TW
dc.title	Relative Importance based Hierarchical Clustering and Its Application to Gene Expression Analysis	en
dc.type	Thesis	-
dc.date.schoolyear	107-2	-
dc.description.degree	碩士	-
dc.contributor.oralexamcommittee	藍俊宏;陳炯年;何明志	zh_TW
dc.contributor.oralexamcommittee	;;	en
dc.subject.keyword	變數選擇,相對重要性,相對權重,階層式分群,基因表現,	zh_TW
dc.subject.keyword	Variable Selection,Relative Importance,Relative Weight,Hierarchical Clustering,Gene Expression,	en
dc.relation.page	81	-
dc.identifier.doi	10.6342/NTU201902386	-
dc.rights.note	未授權	-
dc.date.accepted	2019-08-05	-
dc.contributor.author-college	工學院	-
dc.contributor.author-dept	工業工程學研究所	-
顯示於系所單位：	工業工程學研究所

文件中的檔案：

檔案	大小	格式
ntu-107-2.pdf 未授權公開取用	4.99 MB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。