以相似度集群的新分類方法-應用於大型類別資料之預測

Pei-Zhen Wu; 吳佩真

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/49419

標題:	以相似度集群的新分類方法-應用於大型類別資料之預測 A novel method for phenotype classification based on similarity clusters for large-scale categorical data
作者:	Pei-Zhen Wu 吳佩真
指導教授:	蕭朱杏
關鍵字:	分類分析,類別資料,漢明距離,單一核甘酸多型性之集合,相似度, classification,categorical data,Hamming distance,SNP-set,similarity,
出版年 :	2016
學位:	碩士
摘要:	分群（clustering）與分類（classification）一直是眾多科學領域在資料分析上的重要議題，然而，目前的方法都著重在數值型的觀察值上，只有少數方法是設計來分析類別資料的。在這篇論文中，我們提出了一個分析流程，先以距離量測為基礎將類別變項分群，並使用這些群集來分類二元的出象（outcome），例如疾病狀態。我們更特別考慮在單一核甘酸多型性基因型資料上的應用，並將不同染色體位置上的單一核甘酸多型性分群以進行分析。首先，我們使用漢明距離（Hamming distance）作為類別變項集合間之相異度的量測，並建構群集，即單一核甘酸多型性之集合。之後我們對每個個體計算與疾病組、對照組之組相異度的差異，並以此為每個集群的共變量，再計算每個群集的預測準確率，並建構多群集的邏輯斯迴歸模型以分類二元的表現型。我們使用模擬與二筆實際資料來驗證我們的分析流程，第一個應用是 UCI Machine Learning Repository的SPECT Heart資料，我們的方法的預測準確率與lasso、邏輯斯迴歸一樣好，都勝過隨機森林。另外一筆資料為HapMap ENCODE的單一核甘酸多型性基因型之資料，來分類個體的種族（ethnicity）。我們的方法與lasso、隨機森林都100%的分類個體的種族，在此例中我們只使用了1個單一核甘酸多型性集合，不過分群後的結果也產生數個預測準確率大於90%的集合。在模擬中，我們的方法只有得到還能接受的結果，在一些模擬設定之下我們的方法表現的不如其他方法，模擬資料的生成是使用單獨單一核甘酸多型性的效應，而非單一核甘酸多型性集合的效應，或許這樣的模擬資料較適合使用單獨效應來建模的方法。 Clustering and classification have been important issues in data analysis in many scientific fields. Current methodology, however, focuses mostly on numeric observations. Few methods are designed to analyze categorical data. In this paper, we propose a procedure to first cluster categorical data based on similarity measure and then use the clusters to classify individuals with binary outcomes, such as disease status. Specifically, we consider single nucleotide polymorphism (SNP) genotype data as applications, in the case where SNPs from different chromosome regions are to be clustered. In the first step, we use Hamming distance as a measure of dis-similarity between two sets of categorical attributes, and construct SNP-sets as clusters. Next, we calculate for each individual the difference in group-similarity between the case and control group as a covariate for each corresponding cluster, determine the accuracy for this cluster, and develop a multi-cluster logistic regression model for classification of binary phenotypes. To illustrate our procedure, two real data sets were considered for applications. The first application is the UCI Machine Learning Repository SPECT Heart data set. Our procedure was as accurate as lasso and logistic regression, and more accurate than random forest. In the second HapMap ENCODE data, our procedure, lasso and random forest all predicted the ethnicity group with 100% accuracy. In our result, only one SNP-cluster was used. However, the clustering also produced other SNP-clusters with accuracy larger than 90%. We also conducted simulation studies to demonstrate the performance and to compare with other methods. Our method performed only satisfactorily. In several simulation settings, our method does not perform as well as other methods. In the simulation, data were generated based on single-SNP effect and not cluster effect, which may favor other methods that employ single-marker effect in their models.
URI:	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/49419
DOI:	10.6342/NTU201603076
全文授權:	有償授權
顯示於系所單位：	流行病學與預防醫學研究所

文件中的檔案：

檔案	大小	格式
ntu-105-1.pdf 未授權公開取用	1.08 MB	Adobe PDF

顯示文件完整紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料（如：文字、圖片、PDF）並使其易於取用。