Skip navigation

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料(如:文字、圖片、PDF)並使其易於取用。

點此認識 DSpace
DSpace logo
English
中文
  • 瀏覽論文
    • 校院系所
    • 出版年
    • 作者
    • 標題
    • 關鍵字
  • 搜尋 TDR
  • 授權 Q&A
    • 我的頁面
    • 接受 E-mail 通知
    • 編輯個人資料
  1. NTU Theses and Dissertations Repository
  2. 電機資訊學院
  3. 生醫電子與資訊學研究所
請用此 Handle URI 來引用此文件: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/55139
標題: 利用深度學習來預測阿拉伯芥DNA序列中編碼基因的基因結構
Using deep learning to predict gene structures of the coding genes in DNA sequences of Arabidopsis thaliana
作者: Ching-Tien Wang
王擎天
指導教授: 趙坤茂(Kun-Mao Chao)
共同指導教授: 林仲彥(Chung-Yen Lin)
關鍵字: 阿拉伯芥,資料清洗,基因註解,深度學習,資料後處理,
Arabidopsis thaliana,data cleaning,gene annotation,deep learning,post-processing,
出版年 : 2020
學位: 碩士
摘要: 基因的結構可以使我們了解其功能,它可以透過如Augustus等模型的預測來獲得。這些模型為了註解DNA序列,需事先對其特徵組成進行分析並設計多個子模型來偵測。深度學習不需要事先分析其特徵組成並可以學習它所需要的特徵,使之容易應用在多個領域。本研究的目的為建立一個深度學習模型來對阿拉伯芥DNA序列上編碼基因的基因結構進行預測。本研究藉由global run-on sequencing和Poly (A)-Test RNA-sequencing的資料來清洗與重新註解現有的轉錄資料,並得到含有977編碼基因的註解。本研究提出一個全新的深度學習模型和新的損失函數。結果顯示深度學習在macro F-score的中位數為0.969,而在Augustus的結果為0.957,且統計結果顯示深度學習在macro F-score顯著優於Augustus。本研究提出兩種後處理方法,一種名為邊界後處理方法(boundary post-processing method)來處理內含子的邊界,另一種名為長度過濾方法(length filtering method)來處理短片段。深度學習的預測結果經處理後在16個評分中有9個評分有顯著進步。深度學習的預測結果經後處理方法處理後顯示在16個評分中有6個顯著好於Augustus和5個顯著落後於Augustus。這些結果顯示深度學習模型結合後處理方法可以和Augustus匹敵。另外,經後處理方法處理的深度學習預測結果可以在部分基因體上預測出平均為18642個含有已知蛋白質結構域的基因結構。整體來講,深度學習模型結合後處理方法可以成為在阿拉伯芥DNA序列上預測編碼基因的基因結構的替代方法。
The structure of the gene can help us to have a better understanding of its function, and it can be predicted by models such as Augustus. In order to annotate the DNA sequence by these models, the feature composition of annotation needed to be analyzed, and many submodels would be designed to detect these features. The deep learning does not need to analyze the feature composition and can learn the features it needs, and this makes it easily be applied in many fields. The purpose of the thesis is to build a deep-learning-based model to directly predict gene structures of coding genes in DNA sequences of Arabidopsis thaliana. Annotation with 977 coding gene structures was created by using data from global run-on sequencing and Poly (A)-Test RNA-sequencing to reannotate and filter the existed transcripts. A new deep learning model and loss were proposed. The median macro F-score of the deep learning model was 0.969, and the value of Augustus was 0.957. The statistical result showed that the result of the deep learning model in the macro F-score was significantly better than Augustus. Two post-processing methods were proposed, one named boundary post-processing method handled the boundary of the intron, and the other named length filtering method filtered out the region with short length. The revised result of the deep learning model showed that there were 9 out of 16 metrics performances were significantly improved. The revised result of the deep learning model showed that 6 out of 16 metrics were significantly better than Augustus, and 5 out of 16 metrics were significantly worse than Augustus. These results show that the deep learning model with the post-processing procedure is competitive to Augustus. Furthermore, the revised result of the deep learning model on the part of the genome showed that it could predict an average of 18642 gene structures that contained existed protein domains. Overall, the proposed deep learning model with the post-processing procedure can be an alternative method to predict gene structures of coding genes on DNA sequences of Arabidopsis thaliana.
URI: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/55139
DOI: 10.6342/NTU202002143
全文授權: 有償授權
顯示於系所單位:生醫電子與資訊學研究所

文件中的檔案:
檔案 大小格式 
U0001-3107202002312100.pdf
  目前未授權公開取用
5.76 MBAdobe PDF
顯示文件完整紀錄


系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved