Skip navigation

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料(如:文字、圖片、PDF)並使其易於取用。

點此認識 DSpace
DSpace logo
English
中文
  • 瀏覽論文
    • 校院系所
    • 出版年
    • 作者
    • 標題
    • 關鍵字
    • 指導教授
  • 搜尋 TDR
  • 授權 Q&A
    • 我的頁面
    • 接受 E-mail 通知
    • 編輯個人資料
  1. NTU Theses and Dissertations Repository
  2. 公共衛生學院
  3. 流行病學與預防醫學研究所
請用此 Handle URI 來引用此文件: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/99859
標題: 台灣高雄地區結核病預測傳播群聚模型之建立與驗證
Development and Validation of Predictive Models for Tuberculosis Transmission Clusters in Kaohsiung, Taiwan
作者: 陳奕欣
Yi-Hsin Chen
指導教授: 林先和
Hsien-Ho Lin
關鍵字: 結核病,傳播群聚,全基因定序,預測模型,機器學習,
Tuberculosis,Transmission cluster,Whole-genome sequencing,Predictive model,Machine learning,
出版年 : 2025
學位: 碩士
摘要: 背景
結核病(Tuberculosis)是全球最嚴重的公共衛生威脅之一。根據世界衛生組織(WHO)估計,2023 年全球新增結核病病例達 1,080 萬例。由於結核病具有較長的潛伏期,僅依賴病患主動就醫常導致診斷延遲與持續傳播。因此,WHO 建議公共衛生體系應主動篩檢潛在個案,並及早介入,以實現「2035 年終結結核病」的全球目標。雖然過去已有多項研究探討導致群聚擴大的風險因子,並建立預測模型,但在台灣,我們缺乏結合分子生物技術與流行病學資料進行預測分析的相關研究。本研究針對台灣高雄地區的結核病傳播群聚進行分析,整合全基因體定序資料與流行病學特徵,以每個群聚中最早通報的前兩位個案資料建構二元分類模型,預測該群聚未來是否會擴展至兩例以上。本研究成果有助於建構早期預警系統,強化結核病防治策略,辨識可能形成大型傳播群聚的高風險因子。
方法
本研究為一項以台灣高雄地區為研究對象,納入2019至2023年間所有高雄地區通報的結核桿菌陽性個案。透過全基因體定序,分析菌株間的單一核苷酸多型性(Single Nucleotide Polymorphism, SNP),並以SNP差異不超過12個作為同一傳播群聚的判定標準。我們建立結合 WGS 與流行病學特徵的資料集,並以每個群聚中最早通報的前兩例個案為分析單位。若該群聚最終累計病例數達三例以上,定義為「大型群聚」;反之則為「小型群聚」。為排除研究初期可能已存在的盛行群聚對預測的干擾,本研究同時建立三種資料集:包含所有個案的完整資料集,以及僅包含2019年後新發群聚的兩個子資料集(分別追蹤兩年與三年)。
分析採用隨機森林、邏輯斯回歸與支援向量機三種模型進行預測建模。資料以7:3比例隨機分為訓練與測試集,建模流程包含缺失值填補、上採樣與交叉驗證,並重複執行2,000次以提升模型穩定性與可靠性。最後透過變項重要性排序與敏感度分析,評估模型預測效果。所有分析皆使用 R 4.5.0 版本執行。
結果
1,197 例個案經全基因體定序確認為傳播群聚成員,共辨識出 325 個傳播群聚,其中 126 個為大型群聚,199個小型群聚。而兩年追蹤資料含有66 個大型群聚, 24個小型群聚;三年追蹤資料含有48個大型群聚, 24個小型群聚。
我們選用三個預測變數,首兩位通報個案的年齡、通報日期間隔、痰塗片結果。每種模型執行 2,000 次模擬,中位數 AUC 所對應的那次模擬結果,作為模型表現的代表。
使用完整資料集進行分析時,邏輯斯回歸模型表現最佳,其AUC為0.690(95% CI: 0.601–0.776);SVM次之,AUC為0.685(95% CI: 0.596–0.772);隨機森林表現較差,AUC為0.623(95% CI: 0.532–0.711)。接著分析2020年後新發群聚,兩年追蹤期中,邏輯斯回歸模型的AUC為0.643(95% CI: 0.436–0.850);隨機森林表現近乎隨機猜測,僅0.507(95% CI: 0.300–0.707);SVM則介於兩者之間,AUC為0.621(95% CI: 0.328–0.821)。三年追蹤資料集邏輯斯回歸模型的中位AUC為0.673(95% CI: 0.449–0.857);隨機森林表現同樣不佳,中位AUC為0.541(95% CI: 0.316–0.745);SVM中位AUC為0.643(95% CI: 0.357–0.847)。兩年與三年追蹤資料集樣本數小,導致AUC信賴區間較寬,預測表現的穩定性相對較低。在三種資料設定中,邏輯斯回歸模型的整體表現最穩定且最佳,SVM次之,而隨機森林的預測能力最弱。
變項重要性分析顯示,前兩例個案的通報日期間隔為最具預測力的因子。而早期個案的通報間隔短,與平均年齡較輕的群集,日後較可能形成大型群聚。痰塗片結果的預測效果則變異性較大。而敏感度分析顯示,模型的預測力均非隨機猜測。綜合而言,研究結果指出,首兩例個案的通報間隔與年齡是預測群聚擴大的關鍵因子。
結論
結合痰塗片陽性率、平均年齡與通報日期間隔三個因子所建構的模型,具備早期預測大型群聚的潛力。根據本研究結果,超過一半的結核病傳播群聚僅包含兩例病例,顯示許多傳播事件能在早期即被中斷,凸顯了及時隔離病例及基礎流行病學調查在疫情控制中的關鍵角色。總體而言,我們應持續強化基礎防治措施,並對模型判定為高風險的群聚,及早啟動擴大調查。靈活且有針對性的應對策略,有助於優化有限公共衛生資源的運用,加快疫情反應速度,有效遏止疾病傳播。
Background.
Tuberculosis (TB) is an important issue around the world. According to the World Health Organization (WHO), there were 10.8 million new TB cases worldwide in 2023. Due to TB’s long latency period, relying solely on patients to seek medical attention often leads to delayed diagnosis and continued transmission. Therefore, the WHO recommends that public health systems proactively identify potential cases and intervene early to achieve the global goal of ending TB by 2035.
. This study investigated TB transmission clusters in Kaohsiung, Taiwan, by integrating whole-genome sequencing (WGS) data with epidemiological characteristics. Using information from the first two reported cases in each cluster, we developed binary classification models to predict whether a cluster was likely to expand beyond two cases. Our findings aim to develop early warning systems, enhance TB control strategies, and help identify risk factors associated with the formation of large transmission clusters.
Methods.
This study included all culture-positive Mycobacterium tuberculosis cases reported between 2019 and 2023 in Kaohsiung, Taiwan. Whole-genome sequencing (WGS) was conducted to analyze the isolates' single-nucleotide polymorphisms (SNP). We used a pairwise SNP distance threshold of 12 to define a transmission cluster. The dataset was created by combining WGS data with epidemiological characteristics, using the first two notified cases in each cluster as the unit of analysis. Clusters with three or more total cases were classified as large clusters, while those with fewer than three were designated as small clusters. Three datasets were constructed: A complete dataset that included all cluster cases, and two subsets focused on incident clusters after 2019, each followed for two and three years, respectively. The two- and three-year follow-up datasets were designed to minimize potential bias from pre-existing clusters at the start of the study period. Three modeling approaches were applied: Random Forest (RF), Simple Logistic Regression (SLR), and Support Vector Machine (SVM). The data were randomly split into training and testing sets in a 70:30 ratio. We repeated the modeling process 2,000 times to ensure robustness and stability. Finally, variable importance calculation and sensitivity analysis were conducted to evaluate model performance. All analyses were performed using R version 4.5.0.
Results.
1,197 cases were included as members of transmission clusters based on whole-genome sequencing (WGS) data. We identified 326 clusters; 126 were classified as large and 199 as small clusters. In the two-year follow-up dataset, 66 large and 24 small clusters were included, while the three-year follow-up dataset comprised 48 large and 24 small clusters.
Three predictors were used in this study: the average age of the first two notified cases in each cluster, the time interval between their notification dates, and their sputum smear results.
In the whole dataset, the logistic regression model showed the best performance, with a median AUC of 0.690 (95% CI: 0.601–0.776), followed by support vector machine (SVM) with a median AUC of 0.685 (95% CI: 0.596–0.772). The random forest model performed the worst, with a median AUC of 0.623 (95% CI: 0.532–0.711). In the two-year follow-up dataset, the logistic regression model again outperformed others, achieving a median AUC of 0.643 (95% CI: 0.436–0.850). Random forest showed near-random guessing performance with an AUC of 0.507 (95% CI: 0.300–0.707). SVM achieved an AUC of 0.621 (95% CI: 0.328–0.821). For the three-year follow-up dataset, logistic regression maintained the best performance (AUC = 0.673; 95% CI: 0.449–0.857), followed by SVM (AUC = 0.643; 95% CI: 0.357–0.847), while random forest performed poorly (AUC = 0.541; 95% CI: 0.316–0.745).
Feature importance analysis indicated that the notification interval between the first two reported cases was the most influential predictor. Clusters with shorter notification intervals and a younger average age were more likely to evolve into large clusters. The predictive contribution of sputum smear status varied across datasets. Sensitivity analyses further confirmed that the model's performances were not driven by random chance.
In summary, the notification interval and mean age of the first two reported cases were identified as key predictors of cluster expansion.
Conclusion.
Our study demonstrated that a predictive model incorporating sputum smear positivity, mean age, and notification date interval showed promise for early identification of large tuberculosis transmission clusters.
The expanded investigations can be initiated earlier for clusters identified as high-risk by the model. Implementing a flexible and targeted response strategy can optimize the allocation of limited public health resources and ultimately reduce the spread of tuberculosis.
URI: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/99859
DOI: 10.6342/NTU202503740
全文授權: 同意授權(全球公開)
電子全文公開日期: 2025-09-20
顯示於系所單位:流行病學與預防醫學研究所

文件中的檔案:
檔案 大小格式 
ntu-113-2.pdf748.06 kBAdobe PDF檢視/開啟
顯示文件完整紀錄


系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved