請用此 Handle URI 來引用此文件:
http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/93888| 標題: | 機器學習應用於核糖核酸定序及全玻片影像預測癌症種類 Machine Learning Approaches for Cancer Classification Using RNA-Sequencing and Whole Slide Images |
| 作者: | 陳宥任 YU-JEN CHEN |
| 指導教授: | 陳中平 Chung-Ping Chen |
| 共同指導教授: | 魏安祺 An-Chi Wei |
| 關鍵字: | 核糖核酸定序,機器學習,生存評估,套索演算法,轉錄因子,弱監督學習,全玻片影像, RNA-Seq,machine learning,survival analysis,LASSO,transcription factor, |
| 出版年 : | 2024 |
| 學位: | 碩士 |
| 摘要: | 核糖核酸定序 (RNA-Seq)是研究癌症最直接的途徑之一,可深入了解癌症的分子機轉,有助於開發標靶治療藥物,並提高癌症診斷和評估預後的準確性。儘管機器學習的興起及定序技術的普及,在探索癌症RNA-Seq上有很大的進展,但始終缺乏生物可解釋性。因此本研究使用美國癌症基因體圖譜計畫 (The Cancer Genome Atlas Program, TCGA) 及基因型組織表現計畫 (The Genotype-Tissue Expression, GTEx) 資料庫,針對影響世界人口罹患、死亡的前五大癌症:肺癌、大腸直腸癌、肝癌、胃癌、乳房癌、攝護腺癌,以及具侵襲性且預後不佳的胰臟癌RNA-Seq進行探討。透過生存分析 (Survival Analysis)、Kaplan-Meier Method (KM method)、套索演算法 (LASSO)及轉錄因子分析,篩選影響病人生存時間的基因。運用這些關鍵的基因,機器學習與深度學習模型可以正確預測健康檢體及癌症檢體,準確性達到97\\\\\\\\\\\\\\\\%以上。此外,我們更進一步在GTEx健康檢體以及TCGA上31種癌症進行分類,透過套索演算法找到的152個轉錄因子,我們設計的集成一維卷積網路 recall 和 F1-score 可以達到95\\\\\\\\\\\\\\\\%以上。我們也使用SHapley Additive exPlanations (SHAP),進一步分析模型的判斷依據。
隨著數位病理全玻片影像的普及,全玻片影像可以作為影像訓練資料,在臨床上輔助醫師進行診斷。過去病理影像的模型訓練需耗費大量人力標記病變區域或是關注區域 (region of interest),本研究使用美國癌症基因體圖譜計畫 (The Cancer Genome Atlas Program, TCGA)及Prostate cANcer graDe Assessment (PANDA)資料庫無標註的攝護腺全玻片影像作為訓練資料,經過補丁抽取、顏色統一化、人為標記移除後,透過多實例學習 (multiple instance learning),模型區分檢體的良惡性準確度達到95\\\\\\\\\\\\\\\\%,格里森分級 (ISUP Gleason grade group)達到F1-score 83.2\\\\\\\\\\\\\\\\%,注意力熱圖也顯示模型判斷的依據與病理醫師認知相同。 總結來說,本研究的機器學習模型除了能以核糖核酸定序準確預測癌症組織及正常組織之外,也可以用於多癌症的分類任務上,透過特徵篩選得到的轉錄因子可作為未來尋找潛在癌症生成路徑的依據。以多實例學習訓練無標註的全玻片影像,也能正確預測攝護腺組織的良惡性及格里森分級。 RNA sequencing (RNA-Seq) is an efficient tool in cancer research, offering insights into the molecular mechanisms of cancer, aiding in developing targeted therapies, and enhancing the accuracy of cancer diagnosis and prognosis. Despite the significant contributions of machine learning and deep learning in exploring cancer RNA-Seq, a lack of biological interpretability persists. In light of this, our study utilizes data from The Cancer Genome Atlas Program (TCGA) and The Genotype-Tissue Expression (GTEx) project to investigate RNA-Seq data pertaining to the most prevalent and deadly cancers globally, including lung, colorectal, liver, stomach, breast, and prostate cancers, as well as pancreatic cancer, which is notably aggressive with poor prognosis. We have identified biologically meaningful genes through survival analysis, the Kaplan-Meier Method, LASSO regression, and the analysis of transcription factors. Using these essential genes, machine learning and deep learning models have achieved over 97\\\\\\\\\\\\\\\\% accuracy in distinguishing between healthy individuals and cancer patients. Furthermore, we classified healthy individuals from GTEx and 31 types of cancers from TCGA. Using 152 transcription factors identified by the LASSO algorithm, we achieved a recall and F1-score of over 95\\\\\\\\\\\\\\\\%. Our study also introduced explainable artificial intelligence techniques, specifically SHapley Additive exPlanations (SHAP), to analyze the contribution of each feature to the model's decisions. With the increasing prevalence of whole slide images (WSIs) in digital pathology, WSIs can serve as valuable training data to assist clinicians in diagnosis. Traditionally, training models on pathological images required substantial manual effort to annotate lesion areas or regions of interest (ROIs). This study used annotaion-free prostate WSIs from TCGA and the Prostate cANcer graDe Assessment (PANDA) dataset as training data. Following patch generation, color normalization, removal of blurred patches, and feature extraction, we employed multiple instance learning (MIL) to train the model. Our model achieved an accuracy of 95\\\\\\\\\\\\\\\\% in distinguishing benign from malignant specimens and an F1-score of 83.2\\\\\\\\\\\\\\\\% in predicting ISUP Gleason grade group. The attention heatmaps generated by the model also demonstrated that the basis of the model’s judgments aligned with the pathologists’ decisions. In summary, this study demonstrates that machine learning models can accurately predict cancerous and normal tissues using RNA-Seq data and can be applied to pan-cancer classification tasks. The transcription factors identified through feature selection can serve as potential markers for discovering cancer pathways in future research. Moreover, the models can accurately predict prostate tissues' benign or malignant nature and their Gleason grade group by training annotation-free whole slide images using multiple instance learning. |
| URI: | http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/93888 |
| DOI: | 10.6342/NTU202400900 |
| 全文授權: | 未授權 |
| 顯示於系所單位: | 生醫電子與資訊學研究所 |
文件中的檔案:
| 檔案 | 大小 | 格式 | |
|---|---|---|---|
| ntu-112-2.pdf 未授權公開取用 | 20.11 MB | Adobe PDF |
系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。
