基於文字探勘排序來自外顯子組資料之基因變異點

Ting-Fu Chen; 陳亭甫

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/79098

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	賴飛羆(Feipei Lai)
dc.contributor.author	Ting-Fu Chen	en
dc.contributor.author	陳亭甫	zh_TW
dc.date.accessioned	2021-07-11T15:43:46Z	-
dc.date.available	2023-08-23
dc.date.copyright	2018-08-23
dc.date.issued	2018
dc.date.submitted	2018-08-09
dc.identifier.citation	1. Behjati, S., & Tarpey, P. S. (2013). What is next generation sequencing?. Archives of Disease in Childhood-Education and Practice, 98(6), 236-238. 2. Bamshad, M. J., Shendure, J. A., Valle, D., Hamosh, A., Lupski, J. R., Gibbs, R. A., ... & Mane, S. (2012). The Centers for Mendelian Genomics: a new large‐scale initiative to identify the genes underlying rare Mendelian conditions. American journal of medical genetics Part A, 158(7), 1523-1525. 3. Gilissen, C., Hehir-Kwa, J. Y., Thung, D. T., van de Vorst, M., van Bon, B. W., Willemsen, M. H., ... & Leach, R. (2014). Genome sequencing identifies major causes of severe intellectual disability. Nature, 511(7509), 344. 4. Van den Veyver, I. B., & Eng, C. M. (2015). Genome-wide sequencing for prenatal detection of fetal single-gene disorders. Cold Spring Harbor perspectives in medicine, 5(10), a023077. 5. Dewey, F. E., Grove, M. E., Pan, C., Goldstein, B. A., Bernstein, J. A., Chaib, H., ... & Pakdaman, N. (2014). Clinical interpretation and implications of whole-genome sequencing. Jama, 311(10), 1035-1045. 6. Pagon, R. A., Adam, M. P., Ardinger, H. H., Bird, T. D., Dolan, C. R., & Fong, C. T. (1993). GeneReviews®[Internet]. Seattle (WA): University of Washington, Seattle, 201. 7. Hamosh, A., Scott, A. F., Amberger, J. S., Bocchini, C. A., & McKusick, V. A. (2005). Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders. Nucleic acids research, 33(suppl_1), D514-D517. 8. Mungall, C. J., McMurry, J. A., Köhler, S., Balhoff, J. P., Borromeo, C., Brush, M., ... & Foster, E. (2016). The Monarch Initiative: an integrative data and analytic platform connecting phenotypes to genotypes across species. Nucleic acids research, 45(D1), D712-D722. 9. Pletscher-Frankild, S., Pallejà, A., Tsafou, K., Binder, J. X., & Jensen, L. J. (2015). DISEASES: Text mining and data integration of disease–gene associations. Methods, 74, 83-89. 10. Van Driel, M. A., Bruggeman, J., Vriend, G., Brunner, H. G., & Leunissen, J. A. (2006). A text-mining analysis of the human phenome. European journal of human genetics, 14(5), 535. 11. Birgmeier, J., Haeussler, M., Deisseroth, C. A., Jagadeesh, K. A., Ratner, A. J., Guturu, H., ... & Bernstein, J. A. (2017). AMELIE accelerates Mendelian patient diagnosis directly from the primary literature. bioRxiv, 171322. 12. Robinson, P. N., Köhler, S., Bauer, S., Seelow, D., Horn, D., & Mundlos, S. (2008). The Human Phenotype Ontology: a tool for annotating and analyzing human hereditary disease. The American Journal of Human Genetics, 83(5), 610-615. 13. Stelzer, G., Plaschkes, I., Oz-Levi, D., Alkelai, A., Olender, T., Zimmerman, S., ... & Guan-Golan, Y. (2016). VarElect: the phenotype-based variation prioritizer of the GeneCards Suite. BMC genomics, 17(2), 444. 14. Safran, M., Dalah, I., Alexander, J., Rosen, N., Iny Stein, T., Shmoish, M., ... & Sirota-Madi, A. (2010). GeneCards Version 3: the human gene integrator. Database, 2010. 15. Rebhan, M., Chalifa-Caspi, V., Prilusky, J., & Lancet, D. (1997). GeneCards: integrating information about genes, proteins and diseases. Trends in Genetics, 13(4), 163. 16. Danecek, P., Auton, A., Abecasis, G., Albers, C. A., Banks, E., DePristo, M. A., ... & McVean, G. (2011). The variant call format and VCFtools. Bioinformatics, 27(15), 2156-2158. 17. Chadwick, J. (2011). Programming Razor: Tools for Templates in ASP. NET MVC or WebMatrix. ' O'Reilly Media, Inc.'. 18. Spurlock, J. (2013). Bootstrap: Responsive Web Development. ' O'Reilly Media, Inc.'. 19. Chang, X., & Wang, K. (2012). wANNOVAR: annotating genetic variants for personal genomes via the web. Journal of medical genetics, jmedgenet-2012. 20. Stenson, P. D., Ball, E. V., Mort, M., Phillips, A. D., Shiel, J. A., Thomas, N. S., ... & Cooper, D. N. (2003). Human gene mutation database (HGMD®): 2003 update. Human mutation, 21(6), 577-581. 21. Fan, C. T., Lin, J. C., & Lee, C. H. (2008). Taiwan Biobank: a project aiming to aid Taiwan’s transition into a biomedical island. 22. Landrum, M. J., Lee, J. M., Riley, G. R., Jang, W., Rubinstein, W. S., Church, D. M., & Maglott, D. R. (2013). ClinVar: public archive of relationships among sequence variation and human phenotype. Nucleic acids research, 42(D1), D980-D985. 23. Rehurek, R., & Sojka, P. (2010). Software framework for topic modelling with large corpora. In In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks. 24. Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781. 25. Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems (pp. 3111-3119). 26. Robertson, S. E., Walker, S., Beaulieu, M., & Willett, P. (1999). Okapi at TREC-7: automatic ad hoc, filtering, VLC and interactive track. Nist Special Publication SP, (500), 253-264. 27. Robertson, S. E., Walker, S., Jones, S., Hancock-Beaulieu, M. M., & Gatford, M. (1995). Okapi at TREC-3. Nist Special Publication Sp, 109, 109. 28. Lee, L. (2007, July). IDF revisited: a simple new derivation within the Robertson-Spärck Jones probabilistic model. In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval (pp. 751-752). ACM. 29. Sutskever, I., Martens, J., & Hinton, G. E. (2011). Generating text with recurrent neural networks. In Proceedings of the 28th International Conference on Machine Learning (ICML-11) (pp. 1017-1024). 30. Bengio, Y., Simard, P., & Frasconi, P. (1994). Learning long-term dependencies with gradient descent is difficult. IEEE transactions on neural networks, 5(2), 157-166. 31. Wei, C. H., Kao, H. Y., & Lu, Z. (2013). PubTator: a web-based text mining tool for assisting biocuration. Nucleic acids research, 41(W1), W518-W522. 32. Maaten, L. V. D., & Hinton, G. (2008). Visualizing data using t-SNE. Journal of machine learning research, 9(Nov), 2579-2605.
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/79098	-
dc.description.abstract	次世代基因定序提供我們一個相當快速的方法去獲得DNA的定序資料，病人的DNA經過quality checking, alignment, 變異點偵測以及變異點分析之後，我們可以獲得外顯子 (Exome) 中單點核苷酸變異 (Single Nucleotide Variants, SNVs) 的資料，這些資料包含病人體細胞中所有的SNVs。然而雖然基因突變可能會導致人類發生疾病、影響對抗病原體的生理機制、改變對於化學物質或藥物的正常反應，但是僅僅只有少數的變異會具有致病性或是將導致生理機制的失調及疾病。為了更快速地找尋真正具有致病性且與臨床上症狀相關之變異點，我們使用自行開發的MViewer，NGS的輔助判讀程式，來做資料的前處理，以及基於文字探勘技術的變異點排序工具 (Variants Prioritizer) 排序所有可能的變異點。 MViewer整合來自多個資料庫中對於外顯子變異點的說明，同時根據具有致病性變異點的共同特徵，例如在某個特定族群中變異點發生的頻率較低等，或是其他條件來篩選出可能性較高變異點。接著我們會根據病人臨床上的症狀及病名給予其一些關鍵字，這些關鍵字與變異點候選將會做為本研究開發的變異點排序工具 (Variants Prioritizer) 之輸入，變異點排序工具會透過分析GeneReviews、Online Mendelian Inheritance in Man (OMIM) 與PubMed上的摘要文章來判斷關鍵字與變異點候選之間的關聯性，最後輸出依據關聯性高到低的變異點候選列表，藉此來找出最有可能導致疾病及與病人臨床病徵最相關的變異點。經過分析32人及其父母的外顯子變異點資料後，我們成功地使用MViewer將原本平均多達30,000個變異點降至500個基因，變異點排序工具亦成功地找出目標變異點，目標變異點出現在候選列表前十名的成功率高達87.5%。這些導致疾病的基因包含CD3E, EIF2B5, COQ4, MOCS2, ISPD, NGLY1, COQ4, PEX1, PITX2, KCNQ5, RYR2, SCN8A, PEX7, C3, TRPC6, TNNI3, CPS1, EDN1, SLC22A5, PAH, RAG1, EYA1, IFT122, ATP7A, BRCA1, ELN, FRAS1, COL2A1, ALDH18A1, MAF, SUOX以及CTLA4。由上述可知MViewer以及變異點排序工具在外顯子定序資料的分析上是相當快速且有幫助的。	zh_TW
dc.description.abstract	Next Generation Sequencing (NGS) provides us a rapid way to get the DNA sequence data. After quality checking, alignment, variants detection and annotation, we will get the exome data of single nucleotide variants (SNVs), which are variations in the single nucleotide without any limitation of frequency and may arise in somatic cells. Variants in the DNA sequences of humans can affect how humans develop diseases and response to pathogens, chemicals, drugs, and other agents, but only one or very few are expected to be pathogenic or significant for the relevant disorder. In order to narrow down further towards culprit disease genes, we use MViewer, which can analyze the SNVs data with various annotations from different genome database, as our preprocessing tools to decrease the numbers of SNVs. Then, in this research, we are planning to describe a novel tool, Variants Prioritizer, based on text-mining to use articles from GeneReviews, Online Mendelian Inheritance in Man (OMIM) and abstracts from PubMed to rank SNVs from MViewer and keywords from electronic medical records (EMRs) to get the most possible disease-causing candidates. After testing 32 trio sequencing data, we successfully decreased the number of candidate SNVs from average about 30,000 to 500 candidate genes with MViewer. Variants Prioritizer succeeds in 87.5% of the cases to locate the target gene in the top 10 ranking list. Disease causing genes identified included CD3E, EIF2B5, COQ4, MOCS2, ISPD, NGLY1, COQ4, PEX1, PITX2, KCNQ5, RYR2, SCN8A, PEX7, C3, TRPC6, TNNI3, CPS1, EDN1, SLC22A5, PAH, RAG1, EYA1, IFT122, ATP7A, BRCA1, ELN, FRAS1, COL2A1, ALDH18A1, MAF, SUOX and CTLA4. Therefore, MViewer and Variants Prioritizer are very powerful in WES reading. Currently Variants Prioritizer is used to help clinical exome reading.	en
dc.description.provenance	Made available in DSpace on 2021-07-11T15:43:46Z (GMT). No. of bitstreams: 1 ntu-107-R05945024-1.pdf: 2701544 bytes, checksum: c3764c9942da447293f29c767c8b705b (MD5) Previous issue date: 2018	en
dc.description.tableofcontents	口試委員會審定書 # 誌謝 i 中文摘要 ii ABSTRACT iii CONTENTS iv LIST OF FIGURES vi LIST OF TABLES vii Chapter 1 Introduction 1 1.1 Background 1 1.2 Motivation 2 1.3 Objective 2 1.4 Related Works 3 Chapter 2 Architecture 4 2.1 Workflow 4 2.2 Data Structure and Framework 5 2.2.1 Data structure from NGS 5 2.2.2 MViewer 5 2.2.3 Variants Prioritizer 6 Chapter 3 Methods 7 3.1 Preprocessing - MViewer 7 3.2 Build Connections between Genes and Keywords 10 3.2.1 Word2Vec 10 3.2.2 Okapi BM25 based on OMIM and GeneReviews articles 11 3.2.3 Expand the number of documents by using RNN to label PubMed abstracts 13 3.2.4 Expand the number of documents by using PubTator to label PubMed abstracts 15 Chapter 4 Result and Discussion 17 4.1 Variants Prioritizer – Web Application 17 4.2 Validation Dataset 19 4.3 Preprocessing (MViewer) 20 4.4 Word2Vec Models 21 4.4.1 Models Evaluation 21 4.4.2 Validation Dataset Rank Result on Word2Vec Models 22 4.5 Okapi BM25 25 4.5.1 OMIM and GeneReviews 25 4.5.2 PubMed Abstracts 28 4.5.3 OMIM, GeneReviews and PubMed 31 4.5.4 Variants Prioritizer, AMELIE and VarElect 34 Chapter 5 Conclusion and Future Work 38 REFERENCE 39
dc.language.iso	zh-TW
dc.subject	單點核?酸突變	zh_TW
dc.subject	次世代基因定序分析	zh_TW
dc.subject	外顯子定序分析	zh_TW
dc.subject	文字探勘	zh_TW
dc.subject	Whole Exome Sequencing (WES) Analysis	en
dc.subject	Text-mining	en
dc.subject	Single Nucleotide Variants (SNV)	en
dc.subject	Next Generation Sequencing (NGS) Analysis	en
dc.title	基於文字探勘排序來自外顯子組資料之基因變異點	zh_TW
dc.title	Variants Prioritizer for Exome Data Based on Text-Mining	en
dc.type	Thesis
dc.date.schoolyear	106-2
dc.description.degree	碩士
dc.contributor.coadvisor	李妮鍾(Ni-Chung Lee)
dc.contributor.oralexamcommittee	呂宗謙(Tsung-Chien Lu),胡務亮(Wuh-Liang Hwu),莊仁輝
dc.subject.keyword	文字探勘,次世代基因定序分析,外顯子定序分析,單點核?酸突變,	zh_TW
dc.subject.keyword	Text-mining,Next Generation Sequencing (NGS) Analysis,Whole Exome Sequencing (WES) Analysis,Single Nucleotide Variants (SNV),	en
dc.relation.page	41
dc.identifier.doi	10.6342/NTU201802837
dc.rights.note	有償授權
dc.date.accepted	2018-08-10
dc.contributor.author-college	電機資訊學院	zh_TW
dc.contributor.author-dept	生醫電子與資訊學研究所	zh_TW
dc.date.embargo-lift	2023-08-23	-
顯示於系所單位：	生醫電子與資訊學研究所

文件中的檔案：

檔案	大小	格式
ntu-107-R05945024-1.pdf 未授權公開取用	2.64 MB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。