建立臺灣人泛參考基因組提升短序列回貼及KIR分型正確性

邱顯鈞; Hsien-Chun Chiu

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/94397

標題:	建立臺灣人泛參考基因組提升短序列回貼及KIR分型正確性 Constructing Taiwanese Reference Pangenome (TW-graph) to Improve Read Mapping Rates and KIR typing
作者:	邱顯鈞 Hsien-Chun Chiu
指導教授:	陳倩瑜 Chien-Yu Chen
關鍵字:	臺灣泛參考基因組,短序列回貼,臺灣人體生物資料庫,KIR,單核苷酸多態性, Taiwanese reference pangenome,Short read mapping,Taiwan Biobank,KIR,SNP,
出版年 :	2024
學位:	碩士
摘要:	由於目前主流的個人定序技術，常為大量短序列，而短序列通常已經遺失了位置資訊，因此需要藉由對比參考基因組來將所有短序列回貼至參考基因組上。現有常見的參考基因組為 hg19 或 hg38，但該基因組取自少數個體，又缺少東亞人參與其中，因此對於臺灣人而言做為參考基因組可能會因個體差異而有所偏差。本研究使用了臺灣人體生物資料庫 (Taiwan Biobank) 的資料，所採用的基因變異集合為先前本實驗團隊所得，其中包含許多臺灣人特有的變異點位，本研究用以建立臺灣人泛參考基因組 (TW-graph)，以提升短序列回貼之品質及未來應用之準確性。過去的研究中多採用圖基因組方法來建立泛參考基因組，常用工具中又以 HISAT2 為最熱門者，因此本研究使用 bcftools 篩選變異點位以及 HISAT2 這個基於圖基因組概念的演算法，實現本研究欲建立臺灣人泛參考基因組的目標，以及建立做為對照組的 hg38 圖參考基因組 (意即不加入任何變異點位) 及全球泛參考基因組 (即為加入全球千人基因組點位計劃的變異點位資料) 共兩個版本的對照參考基因組。建立泛參考基因組後，再將臺灣人的短序列資料回貼並做後續分析，使用回貼率來做初步結果判讀及比較。本研究使用七個 Taiwan Biobank 的以及四個非 Taiwan Biobank 的短序列資料，觀察臺灣泛參考基因組對比上述的其他兩者參考基因組而言，回貼率有顯著提升，十一個樣本的總體回貼率對比 hg38-graph 有提升約1%的趨勢，對比 1000G-graph 有提升約0.9%的趨勢。本研究進而將所建立的臺灣人泛參考基因組，應用於KIR基因家族的等位基因分型，在採用的HPRC之44 個樣本中，唯一回貼短序列數 (Unique mapped reads) 的數據顯示，BWA-linear 表現會優於 1000G-graph 再優於 hg38-graph；而和先前一樣的十一個臺灣人樣本中，TW-graph 在 KIR 區域中的唯一回貼短序列數明顯比其他三個對照組多，雖然此刻缺乏正確答案作為評量之參考，仍期待未來有更多的實驗數據來探討回貼序列數的增加，是否助於提升 KIR 等位基因分型之準確性。 Due to the current mainstream personal sequencing technology often producing large numbers of short sequences that have lost positional information, it is necessary to align these short sequences to a reference genome for comparison. The commonly used reference genomes are hg19 or hg38, but these genomes are derived from a small number of individuals and lack East Asian participation. Therefore, for Taiwanese people, using these as reference genomes may lead to biases due to population differences. This study uses data from the Taiwan Biobank, which contains variants unique to Taiwanese people. This data is used to construct a Taiwanese reference pangenome to improve the quality of short sequence alignment and future applications. Past research has often used graph-based genome methods to build reference pangenomes, with HISAT2 being the most popular. Therefore, this study uses bcftools to filter variants and HISAT2, an algorithm that helps to implement the concept of graph genomes, to construct a Taiwanese reference pangenome. For comparison, this study also created two versions of reference genomes: an hg38-graph reference genome (without adding any variant positions, hg38-graph) and a global reference pangenome (incorporating variant data from the global 1000 Genomes Project, 1000G-graph). After establishing the reference pangenomes, this study aligned Taiwanese short read data and performed subsequent analyses such as KIR allele typing. The mapping rate was used for result interpretation and comparison. This study adopts 7 Taiwan Biobank and 4 non-Taiwan Biobank short reads data. Compared to the other two reference genomes mentioned above, the Taiwanese reference pangenome shows a significant improvement in mapping rate. The overall mapping rate for 11 samples shows an improvement trend of about 1% compared to hg38-graph and about 0.9% compared to 1000G-graph. Regarding KIR typing, among the 44 samples from HPRC, the data on uniquely mapped reads shows that BWA-linear performs better than 1000G-graph, which in turn performs better than hg38-graph. For the eleven Taiwanese samples, TW-graph shows significantly more uniquely mapped reads in the KIR region compared to the other three control groups.
URI:	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/94397
DOI:	10.6342/NTU202401388
全文授權:	同意授權(全球公開)
顯示於系所單位：	生物機電工程學系

文件中的檔案：

檔案	大小	格式
ntu-112-2.pdf	1.74 MB	Adobe PDF	檢視/開啟

顯示文件完整紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料（如：文字、圖片、PDF）並使其易於取用。