應用視覺-語言多模態網路於大腸腫瘤分割：整合臨床文本描述以增強 3D 模型

林育新; Yu-Xin Lin

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/101969

標題:	應用視覺-語言多模態網路於大腸腫瘤分割：整合臨床文本描述以增強 3D 模型 Vision-Language Multimodal Network for Colorectal Tumor Segmentation via Integration of Clinical Text Descriptions
作者:	林育新 Yu-Xin Lin
指導教授:	周呈霙 Cheng-Ying Chou
關鍵字:	大腸癌腫瘤分割,多模態學習交叉注意力機制特權資訊學習深度學習 Colorectal Cancer Segmentation,Multi-Modal LearningCross-Attention MechanismLearning Using Privileged Information (LUPI)Deep Learning
出版年 :	2026
學位:	碩士
摘要:	利用電腦斷層掃描影像來診斷大腸癌 (Colorectal Cancer)，對早期篩檢與治療規劃至關重要。近年來，深度學習 (Deep Learning) 模型輔助醫師進行腫瘤分割 (Segmentation) 已成為研究熱點。然而，大腸癌腫瘤的形態異質性高，且與周圍組織之間的影像對比度低，使得現有僅依賴單模態影像資訊的分割模型在準確度上仍面臨顯著挑戰。雖然近期研究指出，引入多模態 (Multi-Modal) 資訊 (如患者的病情描述或腫瘤位置的視覺提示) 可有效提升模型性能，但此類方法在臨床篩檢實務中卻面臨「模態缺失 (Modality Missing)」的關鍵限制——即在尚未確診之前，無法取得此類特權資訊作為模型輸入。為克服上述限制，本研究提出一種多模態交叉注意力 U-Net (Multi-Modal Cross-Attention U-Net, MCA-UNet) 框架。該框架以 3D U-Net 作為影像骨幹網路，並結合 PubMedBERT 語言編碼器，透過多尺度交叉注意力 (Cross Attention) 機制，在影像特徵提取的過程中融合臨床語義資訊。此外，本研究設計一套基於特權資訊學習 (Learning Using Privileged Information) 的訓練策略，於訓練階段對臨床特徵進行隨機遮蔽 (Random Masking)，使模型學習在臨床資訊部分或完全缺失的情境下，仍能從影像中擷取具判別力的特徵；而在推論階段，模型僅需輸入通用的提示詞 (Prompt) 即可運作，無須依賴任何特定病患的臨床資料。實驗結果顯示，在包含 541 位病患的臺大醫院內部資料集上進行評估時，本研究所提出的 MCA-UNet 於腫瘤分割任務中達到 58.18\% 的 Dice 相似係數 (Dice Similarity Coefficient, DSC)，相較於基線模型 (Baseline Model) nnU-Net 具 8.2\% 的絕對提升；同時於表面 Dice (Normalized Surface Dice, NSD) 指標上亦取得 40.86\% 的表現。此外，透過系統性的閾值最佳化，本模型在病患層級達到 84\% 的靈敏度 (Patient-level Sensitivity) 與 93.8\% 的特異度 (Patient-level Specificity)，顯著降低了偽陽性誤判，驗證了系統能有效應用於臨床篩檢流程，並提供高可靠性的診斷輔助。本研究不僅驗證了臨床語義對影像分割任務的引導效果，更提出一套能適應實際臨床篩檢流程、有效因應模態缺失問題的多模態學習框架。 Computed tomography plays a critical role in the early screening and treatment planning of colorectal cancer. In recent years, deep learning based models have gained increasing attention as auxiliary tools for tumor segmentation. However, accurate segmentation remains challenging due to the high morphological heterogeneity of colorectal cancer tumors and their low contrast with surrounding tissues. Although recent studies have shown that incorporating multimodal clinical information—such as tumor location and TNM staging—can substantially improve segmentation performance, these approaches encounter a fundamental limitation in real-world clinical screening scenarios. Specifically, comprehensive pathological information is typically unavailable at the initial screening stage, leading to a critical modality-missing problem. To address this challenge, the Multi-Modal Cross-Attention U-Net is proposed—a novel framework that integrates a PubMedBERT language encoder with a 3D U-Net backbone through a multi-scale cross-attention mechanism, enabling effective fusion of clinical semantic priors into visual feature extraction. Furthermore, inspired by the Learning Using Privileged Information paradigm, a training strategy is introduced in which clinical features are randomly masked during training, encouraging the model to learn robust image representations even when clinical inputs are partially or entirely absent. At inference time, the proposed model operates using only a generic textual prompt, eliminating the need for patient-specific clinical data. Evaluated on an internal dataset comprising 541 patients, MCA-UNet achieves a Dice Similarity Coefficient of 58.18\%, representing an absolute improvement of approximately 8.2\% over the nnU-Net baseline. It also attains a Normalized Surface Dice of 40.86\%. Furthermore, through systematic threshold optimization, the proposed approach achieves a patient-level sensitivity of 84\% and a specificity of 93.8\%. These results not only validate the effectiveness of clinical semantic priors in guiding visual segmentation but also highlight a practical and clinically deployable solution for addressing the modality-missing challenge in automated colorectal cancer screening.
URI:	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/101969
DOI:	10.6342/NTU202504585
全文授權:	同意授權(全球公開)
電子全文公開日期:	2031-01-01
顯示於系所單位：	生物機電工程學系

文件中的檔案：

檔案	大小	格式
ntu-114-1.pdf 此日期後於網路公開 2031-01-01	11.23 MB	Adobe PDF

顯示文件完整紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料（如：文字、圖片、PDF）並使其易於取用。