Please use this identifier to cite or link to this item:
http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/21614
Title: | 多重對應分析應用於核心種原篩選之研究 A Study on the Selection of Core Collection based on Multiple Correspondence Analysis |
Authors: | Nien-Lun Wu 吳念倫 |
Advisor: | 蔡政安(Chen-An Tsai) |
Keyword: | 核心種原,多重對應分析,集群分析,基因型數據,巨量資料, core collection,multiple correspondence analysis,cluster analysis,genotype data,big data, |
Publication Year : | 2019 |
Degree: | 碩士 |
Abstract: | 自1984年Frankel提出了核心種原(core collection)這一概念之後,對於如何篩選核心種原以利用少量的蒐集系(accession)來達到最大的遺傳多樣性這點,就不斷有研究者提出想法及篩選方式。而隨著近代次世代基因定序技術的快速發展,在篩選基因型數據上所需要面對的資料量愈來愈龐大,使一部分曾經可行的方式變得難以有效執行。本論文目的為找出一個能夠應用於大量基因型數據的篩選方法,結合了多重對應分析(multiple correspondence analysis, MCA)中的多維度座標計算來進行集群分析,並參考了具有相同目標的GenoCore(Jeong et al., 2017)的篩選方式,組成了一個新的核心種原的篩選方法。為了比較該方法的效果,以覆蓋率(coverage)、香農多樣性指數(Shannon's diversity index)、平均改良型羅傑斯指數(mean modified Rogers value)、及兩兩間最小改良型羅傑斯指數(minimum modified Rogers value),這四項指標作為核心種原品質的評估;此外,也加入了資料分析需要的時間作為篩選方式評估的一部分。研究中利用了四份資料量不一的數據,1.5K SNP的水稻、14K SNP的小麥、37K SNP的水稻及 820K SNP的小麥,以GenoCore及Core Hunter 3這兩項常見的核心種原篩選方法進行分析,將他們的結果與本研究之核心種原結果相互比較。而從本論文所使用的模擬資料結果來看,已經可說是達到了目標:找到能夠應用於大量基因型數據的核心種原挑選方式。且與現有方法的比較中,對比GenoCore,本研究方法能夠在維持覆蓋率達到 99%的情況下,改善核心種原的其餘指標;而相對於Core Hunter 3,本研究方法能更有效率的達到高覆蓋率,並且皆能在合理的時間範圍內,完成四份資料的運算。對於未來應改善的方向,由於從目前的研究結果中,推測核心種原的品質可能受集群分析精確度影響,故可以從改善分群方面著手;以及從Core Hunter 3在結果中的優勢來看,篩選過程中或許能考慮加入:單元(entry)與最近單元距離,及蒐集系與最近單元距離這兩項標準,以進一步提高核心種原中距離以及遺傳多樣性為目標。 Since Frankel proposed the concept of core collection in 1984, many researchers have proposed different methods on how to choose core collection which is aiming to achieve maximum genetic diversity with a small number of accessions. With the rapid development of next generation sequencing technology, genotype data easily reaches to enormous amount, and hence some effective methods in the past have difficulty to execute analyses. The goal of this paper is to propose an algorithm to select a core set of lines using a large genotype data that maximizes possible genetic diversity with a given user-defined number of lines. We apply the multiple correspondence analysis for cluster analysis, and borrow the algorithm of GenoCore which is with the same goal as a reference to constitute a new selecting method for core collection. We demonstrate the ability of our proposed method by using four evaluative criteria of the quality of core collection: coverage rate, Shannon's diversity index, mean modified Rogers value, and minimum modified Rogers value. In addition, the computing time is included as a part of evaluative criteria. We compare our approach to two previously developed methods, GenoCore and Core Hunter 3, by using four SNP datasets with amounts of 1.5K SNPs, 14K SNPs, 37K SNPs and 820K SNPs, respectively. From the results of the simulation used in this paper, our proposed method often exhibits good performances in terms of evaluative criteria. We found that while maintaining coverage of 99%, our method has higher value than GenoCore for the quality of core collection, and can efficiently achieve a higher coverage rate than Core Hunter 3. In addition, our method can finish analyses in a reasonable time range for all of four data-sets. The direction of methodology improvement in the future includes more precise cluster analysis. In addition, the choosing process should be considered adding the distance between the entry and the nearest entry, and the distance between accessions and the nearest entry to selecting standards for further increasing the distance between each entry in a core collection and genetic diversity. |
URI: | http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/21614 |
DOI: | 10.6342/NTU201901294 |
Fulltext Rights: | 未授權 |
Appears in Collections: | 農藝學系 |
Files in This Item:
File | Size | Format | |
---|---|---|---|
ntu-108-1.pdf Restricted Access | 4.39 MB | Adobe PDF |
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.