利用機器學習方法對長期乳癌偵測及預後進行分類 - 結合影像與分子生物標記

謝淳雅; Chun-Ya Hsieh

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/89624

標題:	利用機器學習方法對長期乳癌偵測及預後進行分類 - 結合影像與分子生物標記 Combining Imaging and Molecular Biomarkers for the Distinct Classification for the Detection and Long-term Breast Cancer Survival Using Machine Learning Method
作者:	謝淳雅 Chun-Ya Hsieh
指導教授:	陳秀熙 Hsiu-Hsi Chen
關鍵字:	機器學習,乳癌,偵測分類,預後分類,早期偵測,預後, Machine learning,breast cancer,detection classification,prognostic classification,early detection,prognosis,
出版年 :	2023
學位:	碩士
摘要:	背景乳癌預後隨著以乳房攝影為工具之族群乳癌篩檢計畫廣泛推行而達到早期偵測效益，雖然結合當前之醫療科技運用包含組織特性以及免疫染色特徵等提供治療，對於乳癌預後已有相當程度改善，但仍有部分乳癌患者預後不佳。另一方面，對於篩檢早期發現且預後良好之乳癌病患則有過度偵測之疑慮。因此本研究旨在釐清乳癌早期偵測特徵與預後特徵，以利乳癌病患建立預後風險類別，並且據以提供臨床介入依據，達到精準醫療目標。材料與方法本研究運用前瞻性世代設計於1996 年至 2021 年間瑞典Dalarna County之乳癌個案納入於研究中。研究收集乳癌患者資訊包含診斷年齡、三代預後因子指標，其中，第一代預後因子為傳統的腫瘤指標包含腫瘤大小、淋巴結狀況和惡性分級，第二代預後因子包含免疫化學特徵(ER、PR、HER2與Basal Type等資訊)，第三代預後因子則為乳房攝影影像資訊(powdery, stellate, circular, crushed stone, casting, and architectural distortion等)。此外，本研究同時記錄了乳癌個案偵測模式、乳癌存活相關的因子包含淋巴管浸潤狀況、病灶分布、腫瘤擴散情況、組織學類型、原位病變的特徵以及治療和療法手術、化療、放療和Tamoxifen治療等資訊進行乳癌預後風險評估。本研究運用機器學習方法包含羅吉斯迴歸、隨機森林、支持向量機、人工神經網絡、貝氏神經網絡等演算法，分別將乳癌偵測模式和乳癌死亡作為目標，運用前述機器學習演算方法建立影響特徵。本研究以召回率(recall)、精確率(precision)、F1 分數、ROC曲線下面積以及 Brier 分數評估每種演算法對於偵測模式以及乳癌死亡之表現。本研究進一步運用Cox 回歸模型來驗證由上述機器學習方法發展之風險分類對於乳癌存活之影響。結果共有 5909 名在 1996 年至 2021 年Dalarna County罹患乳癌女性納入於本研究中。其中包含2752例篩檢偵測個案以及3157例臨床偵測個案。納入個案中共有599 例乳癌死亡病例。運用機器學習方法，年齡、腫瘤大小、乳房攝影型態、腫瘤擴散情況和組織學類型為影響篩檢偵測之主要影響特徵。關於乳癌預後，主要影響特徵為腫瘤大小、淋巴侵犯狀況、診斷時的年齡、乳房攝影型態和組織學類型。而在本研究運用之機器學習方法中，隨機森林和人工神經網絡於所運用之評估指標表現最佳。對於篩檢偵測方面，隨機森林和人工神經網絡的 F1分數分別為 0.92 和 0.90，其對應之AUC則為 0.94 和 0.85，Brier 分數是 0.07 和 0.09。對於乳癌預後特徵方面，隨機森林和人工神經網絡的 F1 分數分別為 0.63 和 0.305，對應之AUC 則為0.94 和 0.85，Brier 的評分是 0.06 和 0.07。運用貝氏網絡方法評估早期偵測與預後的影響特徵間的因果關係，結果顯示組織學類型與乳房攝影型態和免疫化學特徵皆具有重要影響。本研究運用隨機森林和人工神經網絡建立之風險評估模式對乳癌患者之篩檢偵測與預後結果進行風險分層。運用隨機森林建立之風險分層評估顯示，相較於60百分位數風險患者，低於 20 百分位數的乳癌患者死亡風險極低。第三、四十分位數的乳癌死亡風險則是相當，且與第六十分位數的平均風險相當接近。第七十分位數平均風險比是1.88 (95% CI：0.86、4.10）、第八十分位數高出許多，風險比達6.60 (95% CI：3.36、12.94）、落在第九十分位數和百分位數的族群風險比更高，分別是14.48 (95% CI: 7.62, 27.54) 和 72.14 (95% CI: 38.46, 135.32)，顯示不同族群的乳癌死亡風險趨勢。納入篩檢偵測分位數調整後，其結果仍一致。以神經網絡建構之風險分層進行評估個風險對於乳癌死亡之趨勢仍一致，惟各分層差異不若隨機森林風險分層之顯著。結論本研究運用不同機器學習方法建立乳癌早期偵測與預後相關因子並據以建立風險分層，本研究之結果可運用於發展乳癌精準醫療。 Introduction Although the prognosis of breast cancer has been greatly improved as a result of early detection by the implementation of population-based screening program with mammography word widely and the advancement of treatment and therapy targeting at a variety of tumor immunohistochemical characteristics, a subset of breast cancer patients are subject to suboptimal prognosis. On the other hand, low-risk women with breast cancer identified in screening program may subject to over detection. In this study, we aimed to elucidate the early-detection features and prognostic features for the stratification of breast cancer cases into risk groups for the purpose of precision health care. Materials and Methods Breast cancer cases of Dalarna County, Swedish between 1996 and 2021 collected by using a prospective cohort design were used in this study. Information on age at diagnosis and three generation indicators including tumor size, nodes, and grade (first generation), immunochemical characteristics (ER, PR, HER2, and Basal, second generation), and mammographic appearances (powdery, stellate, circular, crushed stone, casting, and architectural distortion, third generation) were collected. Factors associated with the detection mode and breast cancer survival including lymphovascular invasion, focality, tumor extent, histological type, and the characteristics of in situ lesions and modes of treatment and therapies (surgery, chemotherapy, radiotherapy, and hormonal therapy) were also collected. A machine learning design was applied for the identification of early-detection classification and prognosis of breast cancer. The machine learning algorithms including logistic regression, random forest, support vector machine, artificial neural network, and Bayesian neural network were applied by using detection mode and breast cancer death as the target in extracting the features of detection and prognosis, respectively. To evaluated the performance of each algorithm, the recall, precision, F1 score, receiver operation characteristics curve and area under curve, and Brier score were applied. A series of Cox regression models by using time to breast cancer death as the outcome were use to validate the detection classification and prognostic classification derived from the machine learning design depicted as above. Results A total of 5909 women with breast cancer diagnosed between 1996 and 2021 were enrolled in this study. Among which, 2752 screen-detected and 3157 clinical-detected breast cancers were included in this study, including a total of 599 breast cancer deaths identified. For the features of early-detection, the variables including age, tumor size, mammographic appearance, tumor extent, and histological type were identified as top five features. Regarding the prognostic features, the top five features were tumor size, nodal involvement, age at diagnosis, mammographic appearance, and histological type. For the five machine learning algorithms applied, random forest and artificial neural network were the two algorithms with highest performance index. For early-detection features, the F1 score for random forest and antiracial neural network were 0.92 and 0.90, respectively. The corresponding figures for AUC were 0.94 and 0.85, for Brier’s score were 0.07 and 0.09. Regarding the performance index for prognostic features, the F1 score for random forest and antiracial neural network were 0.63 and 0.305, respectively. The corresponding figures for AUC were 0.94 and 0.85, for Brier’s score were 0.06 and 0.07. The detection classification and prognosis classification derived from random forest and artificial neural network were used to stratify breast cancer patients. The causal relationship between the features for early-detection and prognosis were further elucidated by using Bayesian network algorithm, in which histological type showed a significant role in associated with both mammographic appearance the immunochemical characteristics. By using the risk stratification derived from random forest, breast cancer patients with the risk lower 20 percentile has extremely low risk of breast cancer death compared with those at 6th decile. The risk of breast cancer death were equivalent for the 3rd and 4th decile, both were close to the average risk group of 6th decile. The risk of breast cancer risk increased for the group at 7th decile (hazard ratio, HR: 1.88, 95% CI: 0.86, 4.10), 8th decile (HR: 6.60, 95% CI: 3.36,12.94), 9th decile (HR: 14.48, 95% CI: 7.62, 27.54) and 10th decile (HR: 72.14, 95% CI: 38.46, 135.32) with a dose-response trend for the risk of breast cancer death. The results were consistent after adjusting for the rank of detection classification. Similar finding but less striking regarding the difference between decile of risk groups were observed when using the risk stratification derived from artificial neural network. Conclusion By using the machine learning design for early detection and prognosis of breast cancer with a series of machine learning algorithm, the early-detection features and prognostic features were identified with the relationship elucidated by using the Bayesian neural network. The risk stratification derived from the proposed study design and machine learning algorithm can facilitate the development of precision healthcare for breast cancer with the guidance of risk levels.
URI:	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/89624
DOI:	10.6342/NTU202303137
全文授權:	同意授權(限校園內公開)
顯示於系所單位：	流行病學與預防醫學研究所

文件中的檔案：

檔案	大小	格式
ntu-111-2.pdf 授權僅限NTU校內IP使用（校園外請利用VPN校外連線服務）	8.43 MB	Adobe PDF

顯示文件完整紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料（如：文字、圖片、PDF）並使其易於取用。