利用深度學習來預測阿拉伯芥DNA序列中編碼基因的基因結構

Ching-Tien Wang; 王擎天

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/55139

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	趙坤茂(Kun-Mao Chao)
dc.contributor.author	Ching-Tien Wang	en
dc.contributor.author	王擎天	zh_TW
dc.date.accessioned	2021-06-16T03:48:39Z	-
dc.date.available	2022-12-31
dc.date.copyright	2020-09-17
dc.date.issued	2020
dc.date.submitted	2020-08-14
dc.identifier.citation	1. Cheng C-Y, Krishnakumar V, Chan AP, Thibaud-Nissen F, Schobel S, Town CD: Araport11: a complete reannotation of the Arabidopsis thaliana reference genome. The Plant Journal 2016, 89(4):789-804. 2. Stanke M, Waack S: Gene prediction with a hidden Markov model and a new intron submodel. Bioinformatics 2003, 19(suppl_2):ii215-ii225. 3. Huang G, Liu Z, van der Maaten L, Weinberger KQ: Densely Connected Convolutional Networks. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR): 2017. IEEE: 2261-2269. 4. Quang D, Xie X: DanQ: a hybrid convolutional and recurrent deep neural network for quantifying the function of DNA sequences. Nucleic Acids Research 2016, 44(11):e107-e107. 5. Hill ST, Kuintzle R, Teegarden A, Merrill IIIE, Danaee P, Hendrix DA: A deep recurrent neural network discovers complex biological rules to decipher RNA protein-coding potential. Nucleic Acids Research 2018, 46(16):8105-8113. 6. Veltri D, Kamath U, Shehu A: Deep learning improves antimicrobial peptide recognition. Bioinformatics 2018, 34(16):2740-2747. 7. Gao X, Zhang J, Wei Z, Hakonarson H: DeepPolyA: A Convolutional Neural Network Approach for Polyadenylation Site Prediction. IEEE Access 2018, 6:24340-24349. 8. Bretschneider H, Gandhi S, Deshwar AG, Zuberi K, Frey BJ: COSSMO: predicting competitive alternative splice site selection using deep learning. Bioinformatics 2018, 34(13):i429-i437. 9. Jaganathan K, Panagiotopoulou SK, McRae JF, Darbandi SF, Knowles D, Li YI, Kosmicki JA, Arbelaez J, Cui W, Schwartz GB: Predicting splicing from primary sequence with deep learning. Cell 2019, 176(3):535-548. 10. Amin MR, Yurovsky A, Tian Y, Skiena S: DeepAnnotator: Genome Annotation with Deep Learning. In: Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics; Washington, DC, USA. Association for Computing Machinery 2018: 254–259. 11. Lamesch P, Berardini TZ, Li D, Swarbreck D, Wilks C, Sasidharan R, Muller R, Dreher K, Alexander DL, Garcia-Hernandez M et al: The Arabidopsis Information Resource (TAIR): improved gene annotation and new tools. Nucleic Acids Research 2012, 40(D1):D1202-D1210. 12. Trapnell C, Pachter L, Salzberg SL: TopHat: discovering splice junctions with RNA-Seq. Bioinformatics 2009, 25(9):1105-1111. 13. Stanke M, Keller O, Gunduz I, Hayes A, Waack S, Morgenstern B: AUGUSTUS: ab initio prediction of alternative transcripts. Nucleic Acids Research 2006, 34(suppl_2):W435-W439. 14. Grabherr MG, Haas BJ, Yassour M, Levin JZ, Thompson DA, Amit I, Adiconis X, Fan L, Raychowdhury R, Zeng Q: Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nature biotechnology 2011, 29(7):644. 15. Campbell MS, Law M, Holt C, Stein JC, Moghe GD, Hufnagel DE, Lei J, Achawanantakun R, Jiao D, Lawrence CJ: MAKER-P: a tool kit for the rapid creation, management, and quality control of plant genome annotations. Plant physiology 2014, 164(2):513-524. 16. Haas BJ, Delcher AL, Mount SM, Wortman JR, Smith Jr RK, Hannick LI, Maiti R, Ronning CM, Rusch DB, Town CD et al: Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. Nucleic Acids Research 2003, 31(19):5654-5666. 17. Haberle V, Stark A: Eukaryotic core promoters and the functional basis of transcription initiation. Nature Reviews Molecular Cell Biology 2018, 19(10):621-637. 18. Korkuc P, Schippers JHM, Walther D: Characterization and identification of cis-regulatory elements in Arabidopsis based on single-nucleotide polymorphism information. Plant physiology 2014, 164(1):181-200. 19. Yu C-P, Lin J-J, Li W-H: Positional distribution of transcription factor binding sites in Arabidopsis thaliana. Scientific Reports 2016, 6:25164. 20. Neve J, Patel R, Wang Z, Louey A, Furger AM: Cleavage and polyadenylation: Ending the message expands gene regulation. RNA Biol 2017, 14(7):865-890. 21. Sherstnev A, Duc C, Cole C, Zacharaki V, Hornyik C, Ozsolak F, Milos PM, Barton GJ, Simpson GG: Direct sequencing of Arabidopsis thaliana RNA reveals patterns of cleavage and polyadenylation. Nature structural molecular biology 2012, 19(8):845. 22. Kornblihtt AR, Schor IE, Alló M, Dujardin G, Petrillo E, Muñoz MJ: Alternative splicing: a pivotal step between eukaryotic transcription and translation. Nature Reviews Molecular Cell Biology 2013, 14(3):153-165. 23. Kurihara Y, Makita Y, Kawashima M, Fujita T, Iwasaki S, Matsui M: Transcripts from downstream alternative transcription start sites evade uORF-mediated inhibition of gene expression in Arabidopsis. Proceedings of the National Academy of Sciences 2018, 115(30):7831-7836. 24. Zhang P, Dimont E, Ha T, Swanson DJ, Hide W, Goldowitz D: Relatively frequent switching of transcription start sites during cerebellar development. BMC Genomics 2017, 18(1):461. 25. Ni T, Corcoran DL, Rach EA, Song S, Spana EP, Gao Y, Ohler U, Zhu J: A paired-end sequencing strategy to map the complex landscape of transcription initiation. Nature Methods 2010, 7(7):521-527. 26. Morton T, Petricka J, Corcoran DL, Li S, Winter CM, Carda A, Benfey PN, Ohler U, Megraw M: Paired-end analysis of transcription start sites in Arabidopsis reveals plant-specific promoter signatures. The Plant Cell 2014:tpc-114. 27. Core LJ, Waterfall JJ, Lis JT: Nascent RNA Sequencing Reveals Widespread Pausing and Divergent Initiation at Human Promoters. Science 2008, 322(5909):1845. 28. Ozsolak F, Platt AR, Jones DR, Reifenberger JG, Sass LE, McInerney P, Thompson JF, Bowers J, Jarosz M, Milos PM: Direct RNA sequencing. Nature 2009, 461:814. 29. Harrison PF, Powell DR, Clancy JL, Preiss T, Boag PR, Traven A, Seemann T, Beilharz TH: PAT-seq: a method to study the integration of 3'-UTR dynamics with gene expression in the eukaryotic transcriptome. RNA 2015, 21(8):1502-1510. 30. Krogh A, Mian IS, Haussler D: A hidden Markov model that finds genes in E. coli DNA. Nucleic Acids Research 1994, 22(22):4768-4778. 31. Haussler, David DK, Eeckman, H MGRF: A generalized hidden Markov model for the recognition of human genes in DNA. In: Proceedings of the International Conference on Intelligent Systems for Molecular Biology, St Louis: 1996. 134-142. 32. Stanke M, Diekhans M, Baertsch R, Haussler D: Using native and syntenically mapped cDNA alignments to improve de novo gene finding. Bioinformatics 2008, 24(5):637-644. 33. Stanke M, Morgenstern B: AUGUSTUS: a web server for gene prediction in eukaryotes that allows user-defined constraints. Nucleic Acids Research 2005, 33(suppl_2):W465-W467. 34. Holt C, Yandell M: MAKER2: an annotation pipeline and genome-database management tool for second-generation genome projects. BMC Bioinformatics 2011, 12(1):491. 35. Korf I: Gene finding in novel genomes. BMC Bioinformatics 2004, 5(1):59. 36. Lomsadze A, Ter-Hovhannisyan V, Chernoff YO, Borodovsky M: Gene identification in novel eukaryotic genomes by self-training algorithm. Nucleic Acids Research 2005, 33(20):6494-6506. 37. Chan K-L, Rosli R, Tatarinova TV, Hogan M, Firdaus-Raih M, Low E-TL: Seqping: gene prediction pipeline for plant genomes using self-training gene models and transcriptomic data. BMC Bioinformatics 2017, 18(1):1-7. 38. Majoros WH, Pertea M, Salzberg SL: TigrScan and GlimmerHMM: two open source ab initio eukaryotic gene-finders. Bioinformatics 2004, 20(16):2878-2879. 39. Krizhevsky A, Sutskever I, Hinton GE: Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems: 2012. 1097-1105. 40. Redmon J, Divvala S, Girshick R, Farhadi A: You only look once: Unified, real-time object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition: 2016. 779-788. 41. Glorot X, Bordes A, Bengio Y: Deep sparse rectifier neural networks. In: Proceedings of the fourteenth international conference on artificial intelligence and statistics: 2011. 315-323. 42. Bridle JS: Training stochastic model recognition algorithms as networks can lead to maximum mutual information estimation of parameters. In: Advances in neural information processing systems: 1990. 211-217. 43. He K, Zhang X, Ren S, Sun J: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition: 2016. 770-778. 44. Xu Y, Kong Q, Huang Q, Wang W, Plumbley MD: Attention and Localization based on a Deep Convolutional Recurrent Model for Weakly Supervised Audio Tagging. In: Conference of the International Speech Communication Association: 8/20/2017. 2017: 3083-3087. 45. Ismail Fawaz H, Forestier G, Weber J, Idoumghar L, Muller P-A: Deep learning for time series classification: a review. Data Mining and Knowledge Discovery 2019, 33(4):917-963. 46. Siegelmann HT, Sontag ED: On the computational power of neural nets. Journal of computer and system sciences 1995, 50(1):132-150. 47. Cho K, van Merrienboer B, Gulcehre C, Bougares F, Schwenk H, Bengio Y: Learning phrase representations using RNN encoder-decoder for statistical machine translation. In: Conference on Empirical Methods in Natural Language Processing (EMNLP 2014): 2014. 48. Kingma DP, Ba JL: Adam: A Method for Stochastic Optimization. In: International Conference on Learning Representations: 1/1/2015. 2015. 49. Ioffe S, Szegedy C: Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In: Proceedings of the 32nd International Conference on Machine Learning; Proceedings of Machine Learning Research: Edited by Francis B, David B. PMLR 2015: 448--456. 50. Santurkar S, Tsipras D, Ilyas A, Madry A: How does batch normalization help optimization? In: Advances in Neural Information Processing Systems: 2018. 2483-2493. 51. Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R: Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research 2014, 15(1):1929-1958. 52. Pham V, Bluche T, Kermorvant C, Louradour J: Dropout improves recurrent neural networks for handwriting recognition. In: 2014 14th International Conference on Frontiers in Handwriting Recognition: 2014. IEEE: 285-290. 53. Paszke A, Gross S, Chintala S, Chanan G, Yang E, DeVito Z, Lin Z, Desmaison A, Antiga L, Lerer A: Automatic differentiation in PyTorch. In.; 2017. 54. Snoek J, Larochelle H, Adams RP: Practical bayesian optimization of machine learning algorithms. In: Advances in neural information processing systems: 2012. 2951-2959. 55. Yeo G, Burge CB: Maximum entropy modeling of short sequence motifs with applications to RNA splicing signals. Journal of Computational Biology 2004, 11(2-3):377-394. 56. Salzberg SL, Delcher AL, Kasif S, White O: Microbial gene identification using interpolated Markov models. Nucleic acids research 1998, 26(2):544-548. 57. Hetzel J, Duttke SH, Benner C, Chory J: Nascent RNA sequencing reveals distinct features in plant transcription. Proceedings of the National Academy of Sciences 2016, 113(43):12316. 58. Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, Batut P, Chaisson M, Gingeras TR: STAR: ultrafast universal RNA-seq aligner. Bioinformatics 2013, 29(1):15-21. 59. Heinz S, Benner C, Spann N, Bertolino E, Lin YC, Laslo P, Cheng JX, Murre C, Singh H, Glass CK: Simple combinations of lineage-determining transcription factors prime cis-regulatory elements required for macrophage and B cell identities. Molecular cell 2010, 38(4):576-589. 60. Zhu S, Ye W, Ye L, Fu H, Ye C, Xiao X, Ji Y, Lin W, Ji G, Wu X: PlantAPAdb: A Comprehensive Database for Alternative Polyadenylation Sites in Plants. Plant Physiology 2020, 182(1):228. 61. Akiba T, Sano S, Yanase T, Ohta T, Koyama M: Optuna: A next-generation hyperparameter optimization framework. In: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery Data Mining: 2019. 2623-2631. 62. Student: The probable error of a mean. Biometrika 1908:1-25. 63. Mann HB, Whitney DR: On a test of whether one of two random variables is stochastically larger than the other. The annals of mathematical statistics 1947:50-60. 64. Hothorn T, Hornik K, Hothorn MT: Package ‘exactRankTests’. 2019. 65. Rice P, Longden I, Bleasby A: EMBOSS: The European Molecular Biology Open Software Suite. Trends in Genetics 2000, 16(6):276-277. 66. Potter SC, Luciani A, Eddy SR, Park Y, Lopez R, Finn RD: HMMER web server: 2018 update. Nucleic acids research 2018, 46(W1):W200-W204. 67. Finn RD, Bateman A, Clements J, Coggill P, Eberhardt RY, Eddy SR, Heger A, Hetherington K, Holm L, Mistry J: Pfam: the protein families database. Nucleic acids research 2014, 42(D1):D222-D230. 68. El-Gebali S, Mistry J, Bateman A, Eddy SR, Luciani A, Potter SC, Qureshi M, Richardson LJ, Salazar GA, Smart A et al: The Pfam protein families database in 2019. Nucleic Acids Research 2018, 47(D1):D427-D432. 69. Hoff KJ, Stanke M: Predicting genes in single genomes with augustus. Current protocols in bioinformatics 2019, 65(1):e57. 70. Crooks GE, Hon G, Chandonia J-M, Brenner SE: WebLogo: a sequence logo generator. Genome Research 2004, 14(6):1188-1190. 71. Gallegos JE, Rose AB: Intron DNA sequences can be more important than the proximal promoter in determining the site of transcript initiation. The Plant Cell 2017:tpc-00020. 72. Alexandrov NN, Troukhan ME, Brover VV, Tatarinova T, Flavell RB, Feldmann KA: Features of Arabidopsis genes and genome discovered using full-length cDNAs. Plant molecular biology 2006, 60(1):69-85. 73. Loke JC, Stahlberg EA, Strenski DG, Haas BJ, Wood PC, Li QQ: Compilation of mRNA polyadenylation signals in Arabidopsis revealed a new signal element and potential secondary structures. Plant physiology 2005, 138(3):1457-1468. 74. Tan M, Le Q: EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. In: International Conference on Machine Learning: 2019. 6105-6114. 75. Zagoruyko S, Komodakis N: Wide Residual Networks. In: British Machine Vision Conference: 1/1/2016. 2016. 76. Greff K, Srivastava RK, Koutník J, Steunebrink BR, Schmidhuber J: LSTM: A Search Space Odyssey. IEEE Trans Neural Networks Learn Syst 2017, 28(10):2222-2232. 77. Quinlan AR, Hall IM: BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 2010, 26(6):841-842. 78. Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R, Subgroup GPDP: The Sequence Alignment/Map format and SAMtools. Bioinformatics 2009, 25(16):2078-2079. 79. Grisel O, Mueller A, Lars, Gramfort A, Louppe G, Prettenhofer P, Blondel M, Niculae V, Nothman J, Joly A et al: scikit-learn/scikit-learn: Scikit-learn 0.22.2.post1. In., 0.22.2.post1 edn: Zenodo; 2020. 80. Head T, MechCoder, Louppe G, Shcherbatyi I, fcharras, Vinícius Z, cmmalone, Schröder C, nel, Campos N et al: scikit-optimize/scikit-optimize: v0.5.2. In., v0.5.2 edn: Zenodo; 2018. 81. Virtanen P, Gommers R, Oliphant TE, Haberland M, Reddy T, Cournapeau D, Burovski E, Peterson P, Weckesser W, Bright J: SciPy 1.0: fundamental algorithms for scientific computing in Python. Nature methods 2020, 17(3):261-272. 82. Hunter JD: Matplotlib: A 2D Graphics Environment. Computing in Science Engineering 2007, 9(3):90-95. 83. Waskom M, Botvinnik O, Ostblom J, Lukauskas S, Hobson P, MaozGelbart, Gemperline DC, Augspurger T, Halchenko Y, Cole JB et al: mwaskom/seaborn: v0.9.1 (January 2020). In., v0.9.1 edn: Zenodo; 2020. 84. Walt Svd, Colbert SC, Varoquaux G: The NumPy Array: A Structure for Efficient Numerical Computation. Computing in Science Engineering 2011, 13(2):22-30. 85. McKinney W: Data Structures for Statistical Computing in Python. In: 2010 2010. 56-61. 86. venn 0.1.3 [https://pypi.org/project/venn/] 87. He K, Zhang X, Ren S, Sun J: Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In: Proceedings of the IEEE International Conference on Computer Vision: 2015. 1026-1034. 88. Glorot X, Bengio Y: Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the thirteenth international conference on artificial intelligence and statistics: 2010. 249-256.
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/55139	-
dc.description.abstract	基因的結構可以使我們了解其功能，它可以透過如Augustus等模型的預測來獲得。這些模型為了註解DNA序列，需事先對其特徵組成進行分析並設計多個子模型來偵測。深度學習不需要事先分析其特徵組成並可以學習它所需要的特徵，使之容易應用在多個領域。本研究的目的為建立一個深度學習模型來對阿拉伯芥DNA序列上編碼基因的基因結構進行預測。本研究藉由global run-on sequencing和Poly (A)-Test RNA-sequencing的資料來清洗與重新註解現有的轉錄資料，並得到含有977編碼基因的註解。本研究提出一個全新的深度學習模型和新的損失函數。結果顯示深度學習在macro F-score的中位數為0.969，而在Augustus的結果為0.957，且統計結果顯示深度學習在macro F-score顯著優於Augustus。本研究提出兩種後處理方法，一種名為邊界後處理方法（boundary post-processing method）來處理內含子的邊界，另一種名為長度過濾方法（length filtering method）來處理短片段。深度學習的預測結果經處理後在16個評分中有9個評分有顯著進步。深度學習的預測結果經後處理方法處理後顯示在16個評分中有6個顯著好於Augustus和5個顯著落後於Augustus。這些結果顯示深度學習模型結合後處理方法可以和Augustus匹敵。另外，經後處理方法處理的深度學習預測結果可以在部分基因體上預測出平均為18642個含有已知蛋白質結構域的基因結構。整體來講，深度學習模型結合後處理方法可以成為在阿拉伯芥DNA序列上預測編碼基因的基因結構的替代方法。	zh_TW
dc.description.abstract	The structure of the gene can help us to have a better understanding of its function, and it can be predicted by models such as Augustus. In order to annotate the DNA sequence by these models, the feature composition of annotation needed to be analyzed, and many submodels would be designed to detect these features. The deep learning does not need to analyze the feature composition and can learn the features it needs, and this makes it easily be applied in many fields. The purpose of the thesis is to build a deep-learning-based model to directly predict gene structures of coding genes in DNA sequences of Arabidopsis thaliana. Annotation with 977 coding gene structures was created by using data from global run-on sequencing and Poly (A)-Test RNA-sequencing to reannotate and filter the existed transcripts. A new deep learning model and loss were proposed. The median macro F-score of the deep learning model was 0.969, and the value of Augustus was 0.957. The statistical result showed that the result of the deep learning model in the macro F-score was significantly better than Augustus. Two post-processing methods were proposed, one named boundary post-processing method handled the boundary of the intron, and the other named length filtering method filtered out the region with short length. The revised result of the deep learning model showed that there were 9 out of 16 metrics performances were significantly improved. The revised result of the deep learning model showed that 6 out of 16 metrics were significantly better than Augustus, and 5 out of 16 metrics were significantly worse than Augustus. These results show that the deep learning model with the post-processing procedure is competitive to Augustus. Furthermore, the revised result of the deep learning model on the part of the genome showed that it could predict an average of 18642 gene structures that contained existed protein domains. Overall, the proposed deep learning model with the post-processing procedure can be an alternative method to predict gene structures of coding genes on DNA sequences of Arabidopsis thaliana.	en
dc.description.provenance	Made available in DSpace on 2021-06-16T03:48:39Z (GMT). No. of bitstreams: 1 U0001-3107202002312100.pdf: 5895028 bytes, checksum: 6d118bdb3fff60eb435f76f8632e020e (MD5) Previous issue date: 2020	en
dc.description.tableofcontents	論文口試委員審定書 ii 謝辭 iv 摘要 v Abstract vi Table of Contents viii Table of Figures xii Table of Tables xiv Table of Equations xv Table of Algorithms xvi List of Abbreviations xvii (Cont.) List of Abbreviations xviii Chapter 1. Introduction 1 Chapter 2. The Literature Review 4 2.1 Annotation identification on Arabidopsis thaliana ecotype Col-0 4 2.2 Transcription and splicing in eukaryotes 4 2.3 Alternative TSSs and alternative CSs 5 2.4 Ab initio transcript structure prediction 6 2.5 Deep learning related techniques 8 2.6 Deep learning applications related to sequence annotation 13 Chapter 3. Materials and Methods 15 3.1 The data preparation 15 3.2 The workflow of creating annotation datasets 15 3.3 Label inference methods, loss functions, and model architectures 21 3.4 Hyperparameter optimization procedure, cross-validation, testing, and augmentation 25 3.5 Comparison of results on the testing dataset and potential transcript regions 27 3.6 The post-processing procedure 31 3.7 Training and testing procedure of Augustus 36 Chapter 4. Results 38 4.1 The different settings of the boundary around the gene 38 4.2 The statistic result of the experimental data and transcripts 39 4.3 The statistic result of transcripts and regions after filtering and cleaning 42 4.4 Hyperparameter searching result on the small dataset 45 4.5 Result comparison of deep learning model and Augustus on the testing dataset 46 4.6 The revised result of deep learning model on the testing dataset 51 4.7 Comparison of the revised result of deep learning model and the result of Augustus on the testing dataset and potential transcript regions 55 Chapter 5. Discussion 59 5.1 Different kinds of evidence can affect the percentage of the genes that have evidence supported 59 5.2 The transcripts have transcription-related evidence around their TSSs 59 5.3 Different upstream distances can have a massive impact on the number of data and the percentage of genes supported by transcription evidence 60 5.4 Most locations of evidence are near external UTRs 60 5.5 The boundary of the reannotated transcript is close to the existed boundary 61 5.6 The nucleotide compositions around different kinds of sites agree with the previous studies 62 5.7 There is a tradeoff between the quality of annotation and the number of transcripts, and the number of high-quality data is rare 63 5.8 The hyperparameter optimization can find well hyperparameters in a few trials 63 5.9 The result of deep learning and result of Augustus have their strength and weakness 65 5.10 The result of deep learning has fragment problem and boundary problem, and the data in DataTrain and PredictedVal can provide information for post-processing procedure 66 5.11 The post-processing procedure can improve the result of the deep learning model 67 5.12 The deep learning model with the post-processing procedure is competitive to Augustus in many places 67 5.13 The difficulty of getting a good result in each metric 68 5.14 The deep learning model with the post-processing procedure can predict domain-including genes in potential transcript regions 69 5.15 The comparison of other annotation applications 70 5.16 Future work on improving model 71 Chapter 6. Conclusion 74 References 75 Supplementary Figures 86 Figure S1. Examples of transcripts failed to be reannotated (Assumed all the evidence related to the transcript) 87 Figure S2. The examples of annotation at every level, their gene boundaries, and metric results on the base level, the block level, and chain-block level 88 Figure S3. The examples of annotation and metrics of distances and site predictions 89 Figure S4. The Venn diagram of the transcripts passed filters 90 Supplementary Tables 91 Table S1. The version of tools 92 Table S2. Data source summary 92 Table S3. The names and sources of the datasets (The number mean chromosome) 93 Table S4. Weights and bias initialization (Notes: The fan_in means the number of input channel) 93 Table S5. The number of regions on each dataset 94 Table S6. The summary of regions with single exon, regions with multiple exons, regions with no exon (no gene), and all regions 94 Table S7. The statistical result of DS and AS in gene annotation on DataTrain 94 Table S8. Hyperparameter setting and Lossrevision of the post-processing procedures (L indicates Length filtering and B indicates Boundary post-processing) 95
dc.language.iso	en
dc.title	利用深度學習來預測阿拉伯芥DNA序列中編碼基因的基因結構	zh_TW
dc.title	Using deep learning to predict gene structures of the coding genes in DNA sequences of Arabidopsis thaliana	en
dc.type	Thesis
dc.date.schoolyear	108-2
dc.description.degree	碩士
dc.contributor.coadvisor	林仲彥(Chung-Yen Lin)
dc.contributor.oralexamcommittee	張育榮(Yu-Jung Chang)
dc.subject.keyword	阿拉伯芥,資料清洗,基因註解,深度學習,資料後處理,	zh_TW
dc.subject.keyword	Arabidopsis thaliana,data cleaning,gene annotation,deep learning,post-processing,	en
dc.relation.page	95
dc.identifier.doi	10.6342/NTU202002143
dc.rights.note	有償授權
dc.date.accepted	2020-08-15
dc.contributor.author-college	電機資訊學院	zh_TW
dc.contributor.author-dept	生醫電子與資訊學研究所	zh_TW
顯示於系所單位：	生醫電子與資訊學研究所

文件中的檔案：

檔案	大小	格式
U0001-3107202002312100.pdf 目前未授權公開取用	5.76 MB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。