Skip navigation

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料(如:文字、圖片、PDF)並使其易於取用。

點此認識 DSpace
DSpace logo
English
中文
  • 瀏覽論文
    • 校院系所
    • 出版年
    • 作者
    • 標題
    • 關鍵字
  • 搜尋 TDR
  • 授權 Q&A
    • 我的頁面
    • 接受 E-mail 通知
    • 編輯個人資料
  1. NTU Theses and Dissertations Repository
  2. 生物資源暨農學院
  3. 農藝學系
請用此 Handle URI 來引用此文件: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/55341
完整後設資料紀錄
DC 欄位值語言
dc.contributor.advisor蔡政安
dc.contributor.authorJhih-Wun Zengen
dc.contributor.author曾志文zh_TW
dc.date.accessioned2021-06-16T03:57:29Z-
dc.date.available2018-02-04
dc.date.copyright2015-02-04
dc.date.issued2014
dc.date.submitted2014-12-03
dc.identifier.citation1. Reich, D. E. et al. Human genome sequence variation and the influence of gene history, mutation and recombination. Nat. Genet. 32, 135–142 (2002).
2. Shen, Y.-J. et al. Development of Genome-Wide DNA Polymorphism Database for Map-Based Cloning of Rice Genes. Plant Physiol. 135, 1198–1205 (2004).
3. Yu, J. et al. A Draft Sequence of the Rice Genome (Oryza sativa L. ssp. indica). Science 296, 79–92 (2002).
4. Tenaillon, M. I. et al. Patterns of DNA sequence polymorphism along chromosome 1 of maize (Zea mays ssp. mays L.). Proc. Natl. Acad. Sci. U. S. A. 98, 9161–9166 (2001).
5. Collins, F. S., Guyer, M. S. & Chakravarti, A. Variations on a Theme: Cataloging Human DNA Sequence Variation. Science 278, 1580–1581 (1997).
6. Schaid, D. J. et al. Comparison of microsatellites versus single-nucleotide polymorphisms in a genome linkage screen for prostate cancer-susceptibility Loci. Am. J. Hum. Genet. 75, 948–965 (2004).
7. Risch, N. & Merikangas, K. The future of genetic studies of complex human diseases. Science 273, 1516–1517 (1996).
8. Thornsberry, J. M. et al. Dwarf8 polymorphisms associate with variation in flowering time. Nat. Genet. 28, 286–289 (2001).
9. Atwell, S. et al. Genome-wide association study of 107 phenotypes in Arabidopsis thaliana inbred lines. Nature 465, 627–631 (2010).
10. Lander, E. S. et al. Initial sequencing and analysis of the human genome. Nature 409, 860–921 (2001).
11. Lipka, A. E. et al. GAPIT: genome association and prediction integrated tool. Bioinformatics 28, 2397–2399 (2012).
12. The International HapMap Consortium. A haplotype map of the human genome. Nature 437, 1299–1320 (2005).
13. Consortium, T. 1000 G. P. An integrated map of genetic variation from 1,092 human genomes. Nature 491, 56–65 (2012).
14. Howie, B. N., Donnelly, P. & Marchini, J. A flexible and accurate genotype imputation method for the next generation of genome-wide association studies. PLoS Genet. 5, e1000529 (2009).
15. Browning, S. R. & Browning, B. L. Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering. Am. J. Hum. Genet. 81, 1084–1097 (2007).
16. Scheet, P. & Stephens, M. A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase. Am. J. Hum. Genet. 78, 629–644 (2006).
17. Li, Y., Willer, C. J., Ding, J., Scheet, P. & Abecasis, G. R. MaCH: using sequence and genotype data to estimate haplotypes and unobserved genotypes. Genet. Epidemiol. 34, 816–834 (2010).
18. Marchini, J. & Howie, B. Genotype imputation for genome-wide association studies. Nat. Rev. Genet. 11, 499–511 (2010).
19. Li, N. & Stephens, M. Modeling linkage disequilibrium and identifying recombination hotspots using single-nucleotide polymorphism data. Genetics 165, 2213–2233 (2003).
20. Browning, S. R. Multilocus Association Mapping Using Variable-Length Markov Chains. Am. J. Hum. Genet. 78, 903–913 (2006).
21. Altman, N. S. An Introduction to Kernel and Nearest-Neighbor Nonparametric Regression. Am. Stat. 46, 175–185 (1992).
22. Roberts, A. et al. Inferring missing genotypes in large SNP panels using fast nearest-neighbor searches over sliding windows. Bioinforma. Oxf. Engl. 23, i401–407 (2007).
23. Yu, Z. & Schaid, D. J. Methods to impute missing genotypes for population data. Hum. Genet. 122, 495–504 (2007).
24. Wang, Y. et al. Fast accurate missing SNP genotype local imputation. BMC Res. Notes 5, 404 (2012).
25. Chen, G. K., Wang, K., Stram, A. H., Sobel, E. M. & Lange, K. Mendel-GPU: haplotyping and genotype imputation on graphics processing units. Bioinforma. Oxf. Engl. 28, 2979–2980 (2012).
26. Elshire, R. J. et al. A Robust, Simple Genotyping-by-Sequencing (GBS) Approach for High Diversity Species. PLoS ONE 6, e19379 (2011).
27. Glaubitz, J. C. et al. TASSEL-GBS: A High Capacity Genotyping by Sequencing Analysis Pipeline. PLoS ONE 9, e90346 (2014).
28. Schwender, H. Imputing missing genotypes with weighted k nearest neighbors. J. Toxicol. Environ. Health A 75, 438–446 (2012).
29. Huang, X. et al. Genome-wide association studies of 14 agronomic traits in rice landraces. Nat. Genet. 42, 961–967 (2010).
30. Flint-Garcia, S. A., Thornsberry, J. M. & Buckler, E. S. Structure of linkage disequilibrium in plants. Annu. Rev. Plant Biol. 54, 357–374 (2003).
31. Nachman, M. W. Variation in recombination rate across the genome: evidence and implications. Curr. Opin. Genet. Dev. 12, 657–663 (2002).
32. Chang, C.-C. & Lin, C.-J. LIBSVM: a library for support vector machines. ACM Trans. Intell. Syst. Technol. TIST 2, 27 (2011).
33. Specht, D. F. Probabilistic neural networks. Neural Netw. 3, 109–118 (1990).
34. Mood, A. M. Introduction to the theory of statistics. xiii, (McGraw-Hill, 1950).
35. Parzen, E. On Estimation of a Probability Density Function and Mode. Ann. Math. Stat. 33, 1065–1076 (1962).
36. Rabiner, L. A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 77, 257–286 (1989).
37. Browning, B. L. & Browning, S. R. A unified approach to genotype imputation and haplotype-phase inference for large data sets of trios and unrelated individuals. Am. J. Hum. Genet. 84, 210–223 (2009).
38. Yang, J., Lee, S. H., Goddard, M. E. & Visscher, P. M. GCTA: A Tool for Genome-wide Complex Trait Analysis. Am. J. Hum. Genet. 88, 76–82 (2011).
39. Zhang, Z. et al. Mixed linear model approach adapted for genome-wide association studies. Nat. Genet. 42, 355–360 (2010).
40. 李銓(2014)。水稻幼苗耐鹽相關數量性狀之基因座定位與分析。國立台灣大學農藝所作物生理乙組碩士論文。
41. Benjamini, Y. & Hochberg, Y. Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. J. R. Stat. Soc. Ser. B Methodol. 57, 289–300 (1995).
42. Su, S.-C., Kuo, C.-C. J. & Chen, T. Inference of missing SNPs and information quantity measurements for haplotype blocks. Bioinforma. Oxf. Engl. 21, 2001–2007 (2005).
43. Sargolzaei, M. & Schenkel, F. S. QMSim: a large-scale genome simulator for livestock. Bioinformatics 25, 680–681 (2009).
dc.identifier.urihttp://tdr.lib.ntu.edu.tw/jspui/handle/123456789/55341-
dc.description.abstract近年來高通量定序蓬勃的發展,各類型的基因資料成長飛快,尤其在單核苷酸多型性(single-nucleotide polymorphism ,SNP)的探測上相對於過去的方法簡單許多,像是Affymetrix GeneChip、Illumina BeadChip 或 NGS-SNP檢測(SNPs calling)等,而單核苷酸多型性這種基因體上的單點變異,因其資料型態相較過去的生物標誌,如:SSR標誌(Simple sequence repeats marker)等,有著便於資料數位化及其資料數量龐大這兩項主要優勢,在近幾年來逐漸成為了各種基因體關聯性分析(Association study)、全基因體關聯性分析(Genome-Wide Association Study ,GWAS) …等研究的要角,但SNP 這種資料型態也不是盡善盡美,因SNP在探測時常常會出現資料缺失,像以分析人類SNPs為例子,廣用性高密度SNPs晶片(general-purpose high-density SNPs microarray),也被分析出其遺失率(missing rate)及錯誤率(error rate)大致在0.05%~1%間,而這結果還是由已分析相當透徹的人類基因體得出,更別說其他非模式物種或者投資金費較少的物種,拿本篇資料分析的資料為例子,在SNP檢測(SNPs calling)後的資料就有14.5%的遺失率。
而這種生物標誌資料不全的現象會影響關聯性分析結果,尤其是在全基因體關聯性分析時SNPs的資料越完整其能找到的關聯區域就會越明顯,相對的有缺值分析結果的密度會過於疏鬆更嚴重的是有可能遺失掉重要的關聯區域,而一般在做關聯性分析時都會先過濾掉遺失過多的生物標記(biomaker)或樣本,但是這樣就會大大的影響到檢定的結果,所以才會出現插補(imputation)這樣的想法,盡可能使資料趨於完整。
本篇文章主要提出一個插補的方法LDKNN (Linkage disequilibrium-based K-nearest neighbor),它是一個建立在KNN (K-nearest neighbor)分群演算法上並加入連鎖失衡(linkage disequilibrium)資訊的新方法,在文章中會將它運用在Genotyping by sequencing (GBS)做SNP探測的水稻,自然種原與重組自交系實際資料來比較插補前後是否有差異,另外會讓LDKNN與KNN、SVM、Beagle4等方法做模擬試驗比較,來比較我們提出方法與其他方法之間的優劣。
zh_TW
dc.description.abstractDetection of single nucleotide polymorphism (SNP) in high-throughput sequencing technologies has become efficient and robust strategies for SNP discovery and genome-Wide association study. However, the conventional high-throughput genotyping techniques often produce a certain proportion of missing calls. It has been long recognized that failing to account for these missing data could dramatically reduce the power of detecting SNPs. A variety of imputation methods have been developed to impute the missing genotypes. Methods based on the K-nearest neighbors (KNN) and weighting K-nearest neighbors (wtKNN) have received some attention by considering the similarities in the haplotype structures. More recently, a number of powerful methods based on hidden Markov model (HMM) have become popular in SNPs imputation. However, these methods are time consuming or mostly suitable for small maker sets imputation and cannot exploit the structure of indirect association of tightly linked SNPs. In this study, We Will propose a novel but computationally simple imputation method that is based on weighting K-nearest neighbors (wtKNN) by considering linkage disequilibrium (LD). We will demonstrate the performance of our method to impute missing SNPs using both Genotyping by sequencing (GBS) data and simulation studies. In addition, we will compare the accuracy and performance of our method with competing imputation methods.en
dc.description.provenanceMade available in DSpace on 2021-06-16T03:57:29Z (GMT). No. of bitstreams: 1
ntu-103-R01621206-1.pdf: 7450654 bytes, checksum: 5de6c59489d2979756fff018e80e42da (MD5)
Previous issue date: 2014
en
dc.description.tableofcontents摘要 i
Abstract iii
目錄 iv
圖列 vi
表列 xi
第一章 前言 1
第二章 材料與方法 4
2.1資料來源 4
2.2資料整理 5
2.3插補方法(Imputation method) 6
2.3.1 K最近鄰法(K-nearest neighbor) 6
2.3.2連鎖失衡K最近鄰法(Linkage disequilibrium-based K-nearest neighbor) 8
2.3.3 KNNcatImpute 10
2.3.4 支援向量機器(Support Vector Machine) 11
2.3.5 機率神經網路(Probabilistic neural networks, PNN) 13
2.3.6 Beagle 4 16
2.4 試驗流程 16
第三章 結果與討論 20
步驟一、找出對位分數和錯位分數的最佳組合 20
步驟二、找出W和K的最佳組合 21
步驟三、用LDKNN與其他方法做準確度上的比較 22
步驟四、用LDKNN與其他方法做MSSP上的比較 23
真實資料分析結果 24
第四章 結論與未來研究 26
圖列 32
表列 79
dc.language.isozh-TW
dc.title利用連鎖失衡加權K最近鄰法於基因型資料填補之研究zh_TW
dc.titleGenotype imputation using LD-based Weighted K Nearest Neighboren
dc.typeThesis
dc.date.schoolyear103-1
dc.description.degree碩士
dc.contributor.oralexamcommittee陳凱儀,董致韡,劉力瑜
dc.subject.keyword插補,全基因體關聯性分析,連鎖失衡,缺值,K最近鄰法,單核?酸多型性,zh_TW
dc.subject.keywordimputation,Genome-wide association study,linkage disequilibrium,missing,K-nearest neighbor,single nucleotide polymorphism,en
dc.relation.page84
dc.rights.note有償授權
dc.date.accepted2014-12-03
dc.contributor.author-college生物資源暨農學院zh_TW
dc.contributor.author-dept農藝學研究所zh_TW
顯示於系所單位:農藝學系

文件中的檔案:
檔案 大小格式 
ntu-103-1.pdf
  目前未授權公開取用
7.28 MBAdobe PDF
顯示文件簡單紀錄


系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved