請用此 Handle URI 來引用此文件:
http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/77348
標題: | 建構大腸直腸癌病患之基因預後模型 Development of a new prognostic gene expression signature for colorectal cancer |
作者: | Han-Ching Chan 詹涵晴 |
指導教授: | 盧子彬(Tzu-Pin Lu) |
關鍵字: | 基因表現量,預後因子,大腸直腸癌,機器學習, gene expression,prognostic signature,colon and colorectal cancer,machine learning, |
出版年 : | 2019 |
學位: | 碩士 |
摘要: | 背景: 大腸直腸癌不論在台灣乃至全球都是發生率極高的一項癌症,隨著篩檢率的提高,也越發容易發現早期病人,而目前臨床上對於大腸直腸癌早期病人術後是否需進一步接受輔助性治療尚未有定論,針對何謂高風險復發病人的定義也尚未明確。本研究旨在藉由基因表現量資料找出在早期病人中有無復發組別間的差異表現基因,進一步建立模型來預測病人的復發風險,找出高復發風險的病人以進行進一步的治療。
方法: 本研究皆使用來自Gene Expression Omnibus (GEO)公開資料庫之基因表現量資料,其中由於訓練集資料GSE40967為不平衡資料,為了避免預測模型傾向判斷為多數類別,我們使用Synthetic Minority Over-sampling Technique (SMOTE)方法調整類別比例,對於少數分類類別使用過取樣增加虛擬樣本,對於多數類別使用簡單隨機欠抽樣方式減少樣本以達到平衡資料。在挑選差異表現基因方面,我們分別依序使用Significance Analysis of Microarrays (SAM)、羅吉斯迴歸以及隨機森林遞迴特徵消除法,在設定固定閥值後一一進行篩選,最後以Support Vector Machine (SVM)建立預測模型,此外為了得到更穩定的預測,我們運用平行集成的概念,使用Bagging方式抽取10次特定比例的訓練資料,分別進行同樣的分析流程並建立起10個獨立的預測模型,最後以多數決投票方式決定最終預測。 結果:在10個模型中,平均每個模型最終擁有11個差異表現基因,將10個模型取聯集後共可得到51個差異表現基因,在預測表現方面,訓練集的敏感度與特異度為91.18%及83.3%;外部驗證方面,敏感度與特異度在美國與澳洲方面分別為80%及37.78%與91.67%及32.84%。同時此研究也做了方法比較,分別比較了羅吉斯迴歸前進法與Lasso迴歸法,其兩者在訓練集的預測表現表現不差,但在外部驗證上仍無法保有高敏感度的特性。 結論:在此研究中,我們成功使用基因表現量資料找出差異表現基因,並建立預測模型找出高復發風險大腸直腸癌早期病人,以輔佐判斷病人到底需不需進行輔助性治療,此外,針對此研究所得到的差異表現基因仍待進一步確認與大腸直腸癌的關係與機制,並期望未來在治療上能有所幫助。 Background: It is still unclear whether a patient with early stages colorectal cancer shall receive adjuvant chemotherapy or not. For the safety, patients usually undergo the treatment but only some of them do get the benefits. However, some patients suffer from the side effects but have no benefits. To address this issue, this study used gene expression profiles to develop a gene signature in order to identify early stages high-risk colorectal cancer patients who have a high probability of recurrence. Methods: All the datasets analyzed were retrieved from public domains, including GSE40967, GSE17536, and GSE14333 from the GEO. First, we applied Synthetic Minority Over-sampling Technique (SMOTE), which is a technique to generate synthetic samples for addressing the problem of imbalanced data due to rare recurrence events. Next, a sequential workflow composed of three statistical methods (SAM, logistic regression and RFE with random forest) was utilized to obtain differentially expressed genes. In addition, we repeated the above processes to develop 10 independent prediction models by bagging training datasets using SVM. Lastly, the final prediction of the workflow was determined by an ensemble classifier. Results: The 10 prediction models included 51 unique differentially expressed genes, which can successfully predict the risk of recurrence within 3-year in the training dataset with sensitivity and specificity as 91.18% and 83.33% respectively. For the validation datasets, the sensitivity and specificity of samples from USA were 80% and 37.78% respectively, while that of Australia were 91.67% and 32.84% respectively. Conclusion: We identified a gene signature which can successfully determine high-risk of recurrence in early stages colorectal cancer patients from public gene expression datasets. The prediction models would potentially function as a tool to decide whether adjuvant chemotherapy should be undergone after surgery or not for patients with early stage colorectal cancer. |
URI: | http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/77348 |
DOI: | 10.6342/NTU201901430 |
全文授權: | 未授權 |
顯示於系所單位: | 流行病學與預防醫學研究所 |
文件中的檔案:
檔案 | 大小 | 格式 | |
---|---|---|---|
ntu-108-R06849032-1.pdf 目前未授權公開取用 | 3.45 MB | Adobe PDF |
系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。