請用此 Handle URI 來引用此文件:
http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/99850完整後設資料紀錄
| DC 欄位 | 值 | 語言 |
|---|---|---|
| dc.contributor.advisor | 盧子彬 | zh_TW |
| dc.contributor.advisor | TZU-PIN LU | en |
| dc.contributor.author | 杜心南 | zh_TW |
| dc.contributor.author | SHIN-NAN DU | en |
| dc.date.accessioned | 2025-09-19T16:05:20Z | - |
| dc.date.available | 2025-09-20 | - |
| dc.date.copyright | 2025-09-19 | - |
| dc.date.issued | 2025 | - |
| dc.date.submitted | 2025-07-25 | - |
| dc.identifier.citation | 1. Cheng TY, Cramb SM, Baade PD, et al. The international epidemiology of lung cancer: Latest trends, disparities, and tumor characteristics. J Thorac Oncol. 2016;11:1653-1671.
2. Health Promotion Administration, Ministry of Health and Welfare, Taiwan. Taiwan Cancer Registry Annual Report of 2022. 3. Tanoue LT. Non–small cell lung cancer. Medscape Drugs & Diseases. 2023. Available from: 4. American Cancer Society. Cancer Facts & Figures 2025. Atlanta: American Cancer Society; 2025. 5. Ekeke CN, Mitchell C, Schuchert M, Dhupar R, Luketich JD, Okusanya OT. Early distant recurrence in patients with resected stage I lung cancer: A case series of “blast metastasis.” Clin Lung Cancer. 2021;22(1):e134-e135. doi:10.1016/j.cllc.2020.09.002 6. Myers DJ, Wallen JM. Lung Adenocarcinoma. In: StatPearls [Internet]. Treasure Island (FL): StatPearls Publishing; 2023. 7. Tang H, Xiao G, Behrens C, et al. A 12-gene set predicts survival benefits from adjuvant chemotherapy in non-small-cell lung cancer patients. Clin Cancer Res. 2013;19(6):1577-1586. 8. Lin L, Bao Y. Development and validation of machine learning models for diagnosis and prognosis of lung adenocarcinoma, and immune infiltration analysis. Sci Rep. 2024;14:22081. 9. Hsiao YW. Development of prognostic models for patients with different cancer types by using high-throughput genomic data [dissertation]. Taipei (Taiwan): National Taiwan University; 2023. 10. Li Y, Ge D, Gu J, Xu F, Zhu Q, Lu C. A large cohort study identifying a novel prognosis prediction model for lung adenocarcinoma through machine learning strategies. BMC Cancer. 2019;19:886. 11. National Center for Biotechnology Information. National Library of Medicine (US). Available from: https://www.ncbi.nlm.nih.gov/ 12. GSE30219 dataset. Available from: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE30219 13. GSE50081 dataset. Available from: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE50081 14. GSE37745 dataset. Available from: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE37745 15. GSE19188 dataset. Available from: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE19188 16. GSE3141 dataset. Available from: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE3141 17. GSE8894 dataset. Available from: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE8894 18. Leek JT, Scharpf RB, Bravo HC, et al. Tackling the widespread and critical impact of batch effects in high-throughput data. Nat Rev Genet. 2010;11(10):733-739. 19. Irizarry RA, Bolstad BM, Collin F, Cope LM, Hobbs B, Speed TP. Summaries of Affymetrix GeneChip probe level data. Nucleic Acids Res. 2003;31(4):e15. 20. Leek JT, Scharpf RB, Bravo HC, et al. Tackling the widespread and critical impact of batch effects in high-throughput data. Nat Rev Genet. 2010;11(10):733-739. 21. Johnson WE, Li C, Rabinovic A. Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics. 2007;8(1):118-127. 22. Liu L, Yao J. Prediction of lung cancer using gene expression and deep learning with KL divergence gene selection. Comput Math Methods Med. 2022;2022:1594781. 23. Smyth GK. Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Stat Appl Genet Mol Biol. 2004;3:Article3. 24. Tibshirani R. Regression shrinkage and selection via the Lasso. J R Stat Soc Series B. 1996;58(1):267-288. 25. Breiman L. Random forests. Mach Learn. 2001;45(1):5-32. 26. Hothorn T, Hornik K, Zeileis A. Unbiased recursive partitioning: A conditional inference framework. J Comput Graph Stat. 2006;15(3):651-674. 27. Hosmer DW, Lemeshow S. Applied logistic regression. 2nd ed. New York: John Wiley & Sons; 2000. 28. Cortes C, Vapnik V. Support-vector networks. Mach Learn. 1995;20(3):273-297. 29. Batista GEAPA, Prati RC, Monard MC. A study of the behavior of several methods for balancing class distribution. ACM SIGKDD Explor Newsl. 2004;6(1):20-29. 30. Chawla NV, Bowyer KW, Hall LO, et al. SMOTE: Synthetic minority over-sampling technique. J Artif Intell Res. 2002;16:321-357. 31. Kohavi R. A study of cross-validation and bootstrap for accuracy estimation and model selection. In: Proceedings of the 14th International Joint Conference on Artificial Intelligence (IJCAI); 1995. p. 1137–1143. 32. Bergstra J, Bengio Y. Random search for hyper-parameter optimization. J Mach Learn Res. 2012;13:281-305. 33. Fawcett T. An introduction to ROC analysis. Pattern Recognit Lett. 2006;27(8):861-874. 34. Kaplan EL, Meier P. Nonparametric estimation from incomplete observations. J Am Stat Assoc. 1958;53(282):457-481. 35. Mantel N. Evaluation of survival data and two new rank order statistics arising in its consideration. Cancer Chemother Rep. 1966;50(3):163-170. 36. Xiao Q, Qu W, Shen W, Cheng Z, Wu H. Exploring SSR1 as a novel diagnostic and prognostic biomarker in hepatocellular carcinoma, and its relationship with immune infiltration. Transl Cancer Res. 2024;13(10):5278-5299. 37. Aoki Y, Kondo S, Kobayashi E, Moriyama-Kita M, Dochi H, Komura S, Nakanishi Y, Endo K, Wakisaka N, Yoshizaki T. Increased Expression of Superoxide Dismutase 2 Is an Indicator of Worse Prognosis of Oropharyngeal Cancer. Int J Mol Sci. 2025;26(7):3223. | - |
| dc.identifier.uri | http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/99850 | - |
| dc.description.abstract | 研究背景:
衛生福利部國民健康署癌症登記報告肺癌自110年起至111年位居十大癌症之首,根據台灣癌症登記資料庫,每年約有一萬多人死於肺癌,其中「非小細胞肺癌」比例高達八成,而肺腺癌(Lung adenocarcinoma)為非小細胞肺癌(Non-small cell lung cancer, NSCLC)最常見的病理型態,且多數患者無吸菸習慣,其中又以女性患者比例較高。肺腺癌起源於肺部腺體細胞,位於肺部外緣與其他癌症相比生長較慢。胸悶、胸痛、慢性咳嗽為肺腺癌主要徵兆,這些症狀不具特殊性、往往在晚期才會出現,因此早期診斷較為困難。 儘管早期肺腺癌患者接受手術切除後能獲得相對良好的預後,但仍有部分患者於兩年內死亡,顯示可能存在未被識別的高風險族群。本研究欲運用多筆跨國公開基因表現資料,建立早期肺腺癌病人之存活預測模型,預測兩年內死亡風險,進一步了解與高死亡率相關的基因表現特徵以辨識潛在高風險族群,期望對臨床風險分層與個別化治療策略制定之依據提供更多參考。 方法: 本研究主要收集六筆來自NCBI基因表達資料庫(Gene Expression Omnibus, GEO)之跨國微陣列資料集,在資料集劃分訓練集(Training Set)以整合 GSE30219(法國)、GSE50081(加拿大)、GSE37745(荷蘭)、GSE19188(瑞典)作為模型訓練資料,以GSE3141(美國)用於模型外部獨立驗證,GSE8894(韓國)資料集作為探索式資料分析。依據整體存活時間,將樣本分為高風險組(兩年內死亡,class=1)與低風險組(class=0),建立二元預測模型。針對訓練集不平衡問題,採用 SMOTEENN 技術提升模型學習能力。資料前處理包括批次效應檢視(PCA)、正規化與差異表達分析,並以 LASSO 與隨機森林交集基因作為最終特徵,建立支援向量機、邏輯斯回歸與隨機森林模型,並使用外部資料集評估模型效能,並使用獨立外部資料集測試模型表現。 結果: 以Lasso回歸及隨機森林特徵交集篩選出 50 個差異表現基因作為最終模型特徵,在內部驗證中,隨機森林模型於高風險病患辨識敏感度87%最佳,其次為支援向量機77%與邏輯斯回歸73%,整體而言,訓練集對高風險病人判別有中等以上判別力。外部驗證方面,美國(GSE3141)資料集隨機森林模型敏感度為91%,在Kaplan Meier存活分析曲線中兩組對數秩檢定(Log-rank test) 支援向量機與隨機森林皆達統計顯著性。 結論: 本研究成功建立以基因表現資料為基礎之早期肺腺癌預後預測模型,模型可協助辨識需更積極介入治療之高風險病患,此外,本研究所識別之差異表現基因仍需進一步驗證與肺腺癌之間關聯,期望未來結合臨床變項與擴增樣本規模,有助於提升模型臨床應用價值。 | zh_TW |
| dc.description.abstract | Background:
According to Taiwan’s National Health Agency, lung cancer has ranked first among the top ten causes of cancer-related death since 2021. Non-small cell lung cancer(NSCLC)accounts for over 80% of cases, with adenocarcinoma being the most common subtype. Most lung adenocarcinoma patients are non-smokers, and the condition is more prevalent in females. Originating from glandular cells at the lung periphery, adenocarcinoma progresses slowly and often presents nonspecific symptoms such as chest tightness and chronic cough, which are typically detected in later stages. Although early-stage patients have favorable outcomes after surgery, some still die within two years, suggesting the presence of high-risk subgroups. This study aims to build a survival prediction model using publicly available GEO gene expression datasets to identify patients at high risk of death within two years and support clinical risk stratification and treatment decisions. Methods: This study integrated six multinational microarray datasets from the GEO database to develop prognostic models for early-stage lung adenocarcinoma. The training set combined GSE30219 (France), GSE50081 (Canada), GSE37745 (Netherlands), and GSE19188 (Sweden), while GSE3141 (United States) was used for external validation and GSE8894 (South Korea) for exploratory analysis. Patients were classified into high-risk (death within two years) and low-risk groups. Data preprocessing included PCA-based batch effect assessment, normalization, and differential expression analysis. Fifty intersected genes from LASSO and Random Forest were used as final features to construct Support Vector Machine, Logistic Regression, and Random Forest models, whose performance was further evaluated on external datasets. Results: Using the intersection of features selected by Lasso regression and Random Forest, 50 differentially expressed genes were identified as the final model features. In internal validation, the Random Forest model achieved the highest sensitivity (87%) for identifying high-risk patients, followed by Support Vector Machine (77%) and Logistic Regression (73%). Overall, the training set demonstrated moderate to strong discriminative power for high-risk patients. In external validation with the U.S. dataset (GSE3141), the Random Forest model achieved a sensitivity of 91%, and both the Support Vector Machine and Random Forest models showed statistically significant differences in the Kaplan-Meier survival curves based on the log-rank test. Conclusion: This study successfully developed a prognostic prediction model for early-stage lung adenocarcinoma based on gene expression data, which can help identify high-risk patients requiring more aggressive therapeutic interventions. Moreover, the differentially expressed genes identified in this study warrant further validation of their association with lung adenocarcinoma. Future work integrating clinical variables and expanding the sample size may further enhance the clinical applicability and value of the model. | en |
| dc.description.provenance | Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2025-09-19T16:05:19Z No. of bitstreams: 0 | en |
| dc.description.provenance | Made available in DSpace on 2025-09-19T16:05:20Z (GMT). No. of bitstreams: 0 | en |
| dc.description.tableofcontents | 口試委員會審定書 i
中文摘要 ii 英文摘要 iv 第一章 導論 1 1.1 研究背景 1 1.2 研究動機與重要性 2 1.3 研究目的 3 第二章 材料與方法 4 2.1 樣本 4 2.1.1 資料來源 4 2.1.2 樣本篩選 4 2.1.3 資料前處理 5 2.2 研究方法 6 2.2.1 特徵選擇方法 6 2.2.2 預測模型建構與評估 7 2.2.3 風險分群比較 9 第三章 結果 10 3.1 資料整合結果描述 10 3.1.1 資料來源 10 3.1.2 樣本篩選 10 3.1.3 資料前處理 11 3.2 特徵選擇結果 11 3.3 預測模型建構 13 3.4 外部驗證 14 3.5 探索性資料分析 14 3.6 風險分群比較 15 第四章 結論與討論 16 4.1 主要發現 16 4.2 研究限制 17 4.3 公共衛生與臨床意義 18 參考文獻 19 圖次 圖1 研究流程圖 22 圖2 訓練集Kaplan-Meier 存活曲線 23 圖3 批次效應校正前PCA散佈圖 24 圖4 批次效應校正後PCA散佈圖 25 圖5 差異表現基因篩選火山圖 26 圖6 LASSO 回歸與隨機森林篩選基因之文氏圖 27 圖7 隨機森林特徵重要性排序 27 圖8 LASSO 模型下對應回歸係數分佈情形 28 圖9 受試者操作特徵曲線 29 圖10 邏輯斯回歸於美國資料集風險分群 Kaplan-Meier 存活曲線 30 圖11 支援向量機於美國資料集風險分群 Kaplan-Meier 存活曲線 31 圖12 隨機森林於美國資料集風險分群 Kaplan-Meier 存活曲線 32 圖13 邏輯斯回歸於韓國資料集風險分群 Kaplan-Meier 存活曲線 33 圖14 隨機森林於韓國資料集風險分群 Kaplan-Meier 存活曲線 34 圖 15 支援向量機於韓國資料集風險分群Kaplan-Meier 存活曲線 35 表次 表1 原始資料集描述 36 表2 研究分析所納入之肺腺癌樣本統計 37 表3 差異表現基因 38 表4 LASSO 回歸與隨機森林挑選重要變數 41 表5 最終建立模型使用特徵基因 43 表6 模型超參數設定 44 表7 交叉驗證評估結果 44 表8 韓國 GSE8894外部資料集模型表現 44 表9 美國 GSE3141外部資料集模型表現 45 | - |
| dc.language.iso | zh_TW | - |
| dc.subject | 基因表現分析 | zh_TW |
| dc.subject | 肺腺癌 | zh_TW |
| dc.subject | 機器學習 | zh_TW |
| dc.subject | 存活分析 | zh_TW |
| dc.subject | 基因表現 | zh_TW |
| dc.subject | survival analysis | en |
| dc.subject | machine learning | en |
| dc.subject | prognosis | en |
| dc.subject | gene expression | en |
| dc.subject | Lung adenocarcinoma | en |
| dc.title | 早期肺癌病患存活基因預測 | zh_TW |
| dc.title | Predicting Survival-Associated Genes in Early-Stage Lung Cancer | en |
| dc.type | Thesis | - |
| dc.date.schoolyear | 113-2 | - |
| dc.description.degree | 碩士 | - |
| dc.contributor.oralexamcommittee | 蕭自宏;王彥雯 | zh_TW |
| dc.contributor.oralexamcommittee | Tzu-Hung Hsiao;CHARLOTTE WANG | en |
| dc.subject.keyword | 肺腺癌,基因表現分析,基因表現,存活分析,機器學習, | zh_TW |
| dc.subject.keyword | Lung adenocarcinoma,gene expression,survival analysis,machine learning,prognosis, | en |
| dc.relation.page | 45 | - |
| dc.identifier.doi | 10.6342/NTU202502522 | - |
| dc.rights.note | 同意授權(限校園內公開) | - |
| dc.date.accepted | 2025-07-28 | - |
| dc.contributor.author-college | 公共衛生學院 | - |
| dc.contributor.author-dept | 流行病學與預防醫學研究所 | - |
| dc.date.embargo-lift | 2030-07-25 | - |
| 顯示於系所單位: | 流行病學與預防醫學研究所 | |
文件中的檔案:
| 檔案 | 大小 | 格式 | |
|---|---|---|---|
| ntu-113-2.pdf 未授權公開取用 | 1.46 MB | Adobe PDF | 檢視/開啟 |
系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。
