Skip navigation

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料(如:文字、圖片、PDF)並使其易於取用。

點此認識 DSpace
DSpace logo
English
中文
  • 瀏覽論文
    • 校院系所
    • 出版年
    • 作者
    • 標題
    • 關鍵字
    • 指導教授
  • 搜尋 TDR
  • 授權 Q&A
    • 我的頁面
    • 接受 E-mail 通知
    • 編輯個人資料
  1. NTU Theses and Dissertations Repository
  2. 公共衛生學院
  3. 流行病學與預防醫學研究所
請用此 Handle URI 來引用此文件: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/49419
完整後設資料紀錄
DC 欄位值語言
dc.contributor.advisor蕭朱杏
dc.contributor.authorPei-Zhen Wuen
dc.contributor.author吳佩真zh_TW
dc.date.accessioned2021-06-15T11:27:49Z-
dc.date.available2021-08-26
dc.date.copyright2016-08-26
dc.date.issued2016
dc.date.submitted2016-08-17
dc.identifier.citationAlamuri, M., Surampudi, B. R., & Negi, A. (2014). A survey of distance/similarity measures for categorical data. International Joint Conference on Neural Networks, 1907–1914. IEEE.
Ayeldeen, H., Mahmood, M. A., & Hassanien, A. E. (2015). Effective Classification and Categorization for Categorical Sets: Distance Similarity Measures. Information Systems Design and Intelligent Applications, 359–368. Springer India.
Buttrey, S. E. (1998). Nearest-neighbor classification with categorical variables. Computational Statistics & Data analysis 28, 157–169.
Chang, C.-C., & Lin, C.-J. (2011). LIBSVM: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology 2, 27.
Chen, L., Ye, Y., Guo, G., & Zhu, J. (2015). Kernel-based linear classification on categorical data. Soft Computing 20, 1–13.
Duda, R. O., Hart, P. E., & Stork, D. G. (2012). Pattern classification. John Wiley & Sons.
Hastie, T., Tibshirani, R., & Friedman, J. (2009). Unsupervised learning The elements of statistical learning, 485–585. Springer New York.

Kurgan, L. A., Cios, K. J., Tadeusiewicz, R., Ogiela, M., & Goodenday, L. S. (2001). Knowledge discovery approach to automated cardiac SPECT diagnosis. Artificial Intelligence In Medicine 23, 149–169.
Lewis, D. D. (1998). Naive (Bayes) at forty: The independence assumption in information retrieval. European Conference on Machine learning, 4–15. Springer Berlin Heidelberg.
Schmidt, M., Fung, G., & Rosales, R. (2007). Fast optimization methods for l1 regularization: A comparative study and two new approaches. European Conference on Machine Learning, 286–297. Springer Berlin Heidelberg.
Wang, C., Kao, W.-H., & Hsiao, C. K. (2015). Using Hamming distance as information for SNP-sets clustering and testing in disease association studies. PloS One 10, e0135918.
Wessel, J., & Schork, N. J. (2006). Generalized genomic distance–based regression methodology for multilocus association analysis. The American Journal of Human Genetics 79, 792–806.
Zhang, J., Chen, L., & Guo, G. (2013). Projected-prototype based classifier for text categorization. Knowledge-Based Systems 49, 179–189.
Zhou, N., & Wang, L. (2007). Perfect population classification on Hapmap data with a small number of SNPs. International Conference on Neural Information Processing.
Zuk, O., Hechter, E., Sunyaev, S. R., & Lander, E. S. (2012). The mystery of missing heritability: Genetic interactions create phantom heritability. National Academy of Sciences 109, 1193–1198.
dc.identifier.urihttp://tdr.lib.ntu.edu.tw/jspui/handle/123456789/49419-
dc.description.abstract分群(clustering)與分類(classification)一直是眾多科學領域在資料分析上的重要議題,然而,目前的方法都著重在數值型的觀察值上,只有少數方法是設計來分析類別資料的。在這篇論文中,我們提出了一個分析流程,先以距離量測為基礎將類別變項分群,並使用這些群集來分類二元的出象(outcome),例如疾病狀態。我們更特別考慮在單一核甘酸多型性基因型資料上的應用,並將不同染色體位置上的單一核甘酸多型性分群以進行分析。首先,我們使用漢明距離(Hamming distance)作為類別變項集合間之相異度的量測,並建構群集,即單一核甘酸多型性之集合。之後我們對每個個體計算與疾病組、對照組之組相異度的差異,並以此為每個集群的共變量,再計算每個群集的預測準確率,並建構多群集的邏輯斯迴歸模型以分類二元的表現型。我們使用模擬與二筆實際資料來驗證我們的分析流程,第一個應用是 UCI Machine Learning Repository的SPECT Heart資料,我們的方法的預測準確率與lasso、邏輯斯迴歸一樣好,都勝過隨機森林。另外一筆資料為HapMap ENCODE的單一核甘酸多型性基因型之資料,來分類個體的種族(ethnicity)。我們的方法與lasso、隨機森林都100%的分類個體的種族,在此例中我們只使用了1個單一核甘酸多型性集合,不過分群後的結果也產生數個預測準確率大於90%的集合。在模擬中,我們的方法只有得到還能接受的結果,在一些模擬設定之下我們的方法表現的不如其他方法,模擬資料的生成是使用單獨單一核甘酸多型性的效應,而非單一核甘酸多型性集合的效應,或許這樣的模擬資料較適合使用單獨效應來建模的方法。zh_TW
dc.description.abstractClustering and classification have been important issues in data analysis in many scientific fields. Current methodology, however, focuses mostly on numeric observations. Few methods are designed to analyze categorical data. In this paper, we propose a procedure to first cluster categorical data based on similarity measure and then use the clusters to classify individuals with binary outcomes, such as disease status. Specifically, we consider single nucleotide polymorphism (SNP) genotype data as applications, in the case where SNPs from different chromosome regions are to be clustered. In the first step, we use Hamming distance as a measure of dis-similarity between two sets of categorical attributes, and construct SNP-sets as clusters. Next, we calculate for each individual the difference in group-similarity between the case and control group as a covariate for each corresponding cluster, determine the accuracy for this cluster, and develop a multi-cluster logistic regression model for classification of binary phenotypes. To illustrate our procedure, two real data sets were considered for applications. The first application is the UCI Machine Learning Repository SPECT Heart data set. Our procedure was as accurate as lasso and logistic regression, and more accurate than random forest. In the second HapMap ENCODE data, our procedure, lasso and random forest all predicted the ethnicity group with 100% accuracy. In our result, only one SNP-cluster was used. However, the clustering also produced other SNP-clusters with accuracy larger than 90%. We also conducted simulation studies to demonstrate the performance and to compare with other methods. Our method performed only satisfactorily. In several simulation settings, our method does not perform as well as other methods. In the simulation, data were generated based on single-SNP effect and not cluster effect, which may favor other methods that employ single-marker effect in their models.en
dc.description.provenanceMade available in DSpace on 2021-06-15T11:27:49Z (GMT). No. of bitstreams: 1
ntu-105-R03849034-1.pdf: 1109088 bytes, checksum: 2af227459a765155748db15f3cb8a85d (MD5)
Previous issue date: 2016
en
dc.description.tableofcontents第壹章 研究背景 1
第貳章 研究方法 3
第一節 使用相異測度(dissimilarity measure)分群(cluster)類別變項 3
(一) 變項之距離(相異度) 3
(二) 階層式分群方法 4
(三) 決定集合之個數 4
第二節 建構預測模型 5
(一) 建構漢明距離共變量(Hamming distance covariate) 5
(二) 邏輯斯迴歸模型 (Logistic regression model) 6
第參章 模擬 10
第一節 資料生成 11
第二節 驗證與比較 12
第三節 結果 12
第肆章 實際資料應用 16
第一節 單光子斷層電腦掃描實例研究 16
(一) 資料背景 16
(二) 分析結果 16
第二節 單一核甘酸多型性實例研究(CEU & YRI) 18
(一) 資料背景 18
(二) 資料處理 18
(三) 分析結果 18
第伍章 結論與討論 22
參考文獻 24
dc.language.isozh-TW
dc.subject相似度zh_TW
dc.subject分類分析zh_TW
dc.subject類別資料zh_TW
dc.subject漢明距離zh_TW
dc.subject單一核甘酸多型性之集合zh_TW
dc.subjectHamming distanceen
dc.subjectsimilarityen
dc.subjectSNP-seten
dc.subjectclassificationen
dc.subjectcategorical dataen
dc.title以相似度集群的新分類方法-應用於大型類別資料之預測zh_TW
dc.titleA novel method for phenotype classification based on similarity clusters for large-scale categorical dataen
dc.typeThesis
dc.date.schoolyear104-2
dc.description.degree碩士
dc.contributor.oralexamcommittee盧子彬,李美賢
dc.subject.keyword分類分析,類別資料,漢明距離,單一核甘酸多型性之集合,相似度,zh_TW
dc.subject.keywordclassification,categorical data,Hamming distance,SNP-set,similarity,en
dc.relation.page46
dc.identifier.doi10.6342/NTU201603076
dc.rights.note有償授權
dc.date.accepted2016-08-18
dc.contributor.author-college公共衛生學院zh_TW
dc.contributor.author-dept流行病學與預防醫學研究所zh_TW
顯示於系所單位:流行病學與預防醫學研究所

文件中的檔案:
檔案 大小格式 
ntu-105-1.pdf
  未授權公開取用
1.08 MBAdobe PDF
顯示文件簡單紀錄


系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved