Skip navigation

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料(如:文字、圖片、PDF)並使其易於取用。

點此認識 DSpace
DSpace logo
English
中文
  • 瀏覽論文
    • 校院系所
    • 出版年
    • 作者
    • 標題
    • 關鍵字
    • 指導教授
  • 搜尋 TDR
  • 授權 Q&A
    • 我的頁面
    • 接受 E-mail 通知
    • 編輯個人資料
  1. NTU Theses and Dissertations Repository
  2. 公共衛生學院
  3. 流行病學與預防醫學研究所
請用此 Handle URI 來引用此文件: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/99850
標題: 早期肺癌病患存活基因預測
Predicting Survival-Associated Genes in Early-Stage Lung Cancer
作者: 杜心南
SHIN-NAN DU
指導教授: 盧子彬
TZU-PIN LU
關鍵字: 肺腺癌,基因表現分析,基因表現,存活分析,機器學習,
Lung adenocarcinoma,gene expression,survival analysis,machine learning,prognosis,
出版年 : 2025
學位: 碩士
摘要: 研究背景:
衛生福利部國民健康署癌症登記報告肺癌自110年起至111年位居十大癌症之首,根據台灣癌症登記資料庫,每年約有一萬多人死於肺癌,其中「非小細胞肺癌」比例高達八成,而肺腺癌(Lung adenocarcinoma)為非小細胞肺癌(Non-small cell lung cancer, NSCLC)最常見的病理型態,且多數患者無吸菸習慣,其中又以女性患者比例較高。肺腺癌起源於肺部腺體細胞,位於肺部外緣與其他癌症相比生長較慢。胸悶、胸痛、慢性咳嗽為肺腺癌主要徵兆,這些症狀不具特殊性、往往在晚期才會出現,因此早期診斷較為困難。
儘管早期肺腺癌患者接受手術切除後能獲得相對良好的預後,但仍有部分患者於兩年內死亡,顯示可能存在未被識別的高風險族群。本研究欲運用多筆跨國公開基因表現資料,建立早期肺腺癌病人之存活預測模型,預測兩年內死亡風險,進一步了解與高死亡率相關的基因表現特徵以辨識潛在高風險族群,期望對臨床風險分層與個別化治療策略制定之依據提供更多參考。
方法:
本研究主要收集六筆來自NCBI基因表達資料庫(Gene Expression Omnibus, GEO)之跨國微陣列資料集,在資料集劃分訓練集(Training Set)以整合 GSE30219(法國)、GSE50081(加拿大)、GSE37745(荷蘭)、GSE19188(瑞典)作為模型訓練資料,以GSE3141(美國)用於模型外部獨立驗證,GSE8894(韓國)資料集作為探索式資料分析。依據整體存活時間,將樣本分為高風險組(兩年內死亡,class=1)與低風險組(class=0),建立二元預測模型。針對訓練集不平衡問題,採用 SMOTEENN 技術提升模型學習能力。資料前處理包括批次效應檢視(PCA)、正規化與差異表達分析,並以 LASSO 與隨機森林交集基因作為最終特徵,建立支援向量機、邏輯斯回歸與隨機森林模型,並使用外部資料集評估模型效能,並使用獨立外部資料集測試模型表現。
結果:
以Lasso回歸及隨機森林特徵交集篩選出 50 個差異表現基因作為最終模型特徵,在內部驗證中,隨機森林模型於高風險病患辨識敏感度87%最佳,其次為支援向量機77%與邏輯斯回歸73%,整體而言,訓練集對高風險病人判別有中等以上判別力。外部驗證方面,美國(GSE3141)資料集隨機森林模型敏感度為91%,在Kaplan Meier存活分析曲線中兩組對數秩檢定(Log-rank test) 支援向量機與隨機森林皆達統計顯著性。
結論:
本研究成功建立以基因表現資料為基礎之早期肺腺癌預後預測模型,模型可協助辨識需更積極介入治療之高風險病患,此外,本研究所識別之差異表現基因仍需進一步驗證與肺腺癌之間關聯,期望未來結合臨床變項與擴增樣本規模,有助於提升模型臨床應用價值。
Background:
According to Taiwan’s National Health Agency, lung cancer has ranked first among the top ten causes of cancer-related death since 2021. Non-small cell lung cancer(NSCLC)accounts for over 80% of cases, with adenocarcinoma being the most common subtype. Most lung adenocarcinoma patients are non-smokers, and the condition is more prevalent in females. Originating from glandular cells at the lung periphery, adenocarcinoma progresses slowly and often presents nonspecific symptoms such as chest tightness and chronic cough, which are typically detected in later stages. Although early-stage patients have favorable outcomes after surgery, some still die within two years, suggesting the presence of high-risk subgroups. This study aims to build a survival prediction model using publicly available GEO gene expression datasets to identify patients at high risk of death within two years and support clinical risk stratification and treatment decisions.
Methods:
This study integrated six multinational microarray datasets from the GEO database to develop prognostic models for early-stage lung adenocarcinoma. The training set combined GSE30219 (France), GSE50081 (Canada), GSE37745 (Netherlands), and GSE19188 (Sweden), while GSE3141 (United States) was used for external validation and GSE8894 (South Korea) for exploratory analysis. Patients were classified into high-risk (death within two years) and low-risk groups. Data preprocessing included PCA-based batch effect assessment, normalization, and differential expression analysis. Fifty intersected genes from LASSO and Random Forest were used as final features to construct Support Vector Machine, Logistic Regression, and Random Forest models, whose performance was further evaluated on external datasets.
Results:
Using the intersection of features selected by Lasso regression and Random Forest, 50 differentially expressed genes were identified as the final model features. In internal validation, the Random Forest model achieved the highest sensitivity (87%) for identifying high-risk patients, followed by Support Vector Machine (77%) and Logistic Regression (73%). Overall, the training set demonstrated moderate to strong discriminative power for high-risk patients. In external validation with the U.S. dataset (GSE3141), the Random Forest model achieved a sensitivity of 91%, and both the Support Vector Machine and Random Forest models showed statistically significant differences in the Kaplan-Meier survival curves based on the log-rank test.
Conclusion:
This study successfully developed a prognostic prediction model for early-stage lung adenocarcinoma based on gene expression data, which can help identify high-risk patients requiring more aggressive therapeutic interventions. Moreover, the differentially expressed genes identified in this study warrant further validation of their association with lung adenocarcinoma. Future work integrating clinical variables and expanding the sample size may further enhance the clinical applicability and value of the model.
URI: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/99850
DOI: 10.6342/NTU202502522
全文授權: 同意授權(限校園內公開)
電子全文公開日期: 2030-07-25
顯示於系所單位:流行病學與預防醫學研究所

文件中的檔案:
檔案 大小格式 
ntu-113-2.pdf
  未授權公開取用
1.46 MBAdobe PDF檢視/開啟
顯示文件完整紀錄


系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved