請用此 Handle URI 來引用此文件:
http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/99918| 標題: | 建構混合資料的基因網路與差異網路:知識導向與資料驅動的集成方法 Genetic Network and Differential Network Construction with Mixed Data: Knowledge-based and Data-driven Ensemble Approach |
| 作者: | 廖振博 Chen-Po Liao |
| 指導教授: | 蕭朱杏 Chuhsing Kate Hsiao |
| 關鍵字: | 基因網路,差異網路,混合資料,無母數集成網路建構,邏輯斯迴歸模型,英國生物資料庫,穿戴式裝置, gene network,differential network,mixed data,nonparametric ensemble network construction,logistic regression,UK biobank,wearable device, |
| 出版年 : | 2025 |
| 學位: | 博士 |
| 摘要: | 隨著高通量技術與多組學(multi-omics)研究的快速發展,生物醫學領域累積了大量高維度且高異質性的資料,使研究者得以更深入探討複雜性疾病的機制。這些資料往往是同時包含連續與離散變數的混合資料(mixed data),並且通常明顯偏離多維度常態分佈或其他指數族分佈的假設。由於複雜性疾病通常由多基因與生物分子共同調控,研究者經常藉由網路模型與差異網路模型分析多個變數之間的關聯性。然而,這些傳統網路模型多假設資料服從多維度常態分佈,或僅適用於單一資料型態;當真實資料偏離這些條件時,其適用性與解釋力便會受到限制。基於上述背景,本研究的研究目的可分為三大部分。
第一,本研究首先根據資料型態進行網路模型的系統性文獻回顧,彙整現有針對單一資料型態與混合資料型態之無向圖模型、其參數估計方法及適用情境,藉此釐清不同資料型態下所面臨的挑戰與限制,例如適用於常態資料的高斯圖形模型(Gaussian graphical model)以及二元資料的易辛模型(Ising model)。 第二,在非常態或高維度資料下,僅憑傳統資料驅動(data-driven)的網路建構方法容易受限於資料品質、估計方法以及資料分佈的假設,也往往忽略已經被驗證的生物先驗知識。為了納入先驗知識與放寬資料分佈的限制,本研究提出無母數知識導向與資料驅動的基因集成網路建構演算法(nonparametric ensemble knowledge-based and data-driven algorithm, NEKBDD),結合知識導向與資料驅動兩種網路建構方法的優點。此方法首先由公開知識資料庫收集參考網路來推估網路的節點度分佈,再生成樣本節點度序列建立潛在網路結構;接著透過由實驗資料計算之信心矩陣進行節點標記,並藉由集成(ensemble)方法與閾值化(thresholding)整合多個潛在網路。統計模擬研究顯示,在不同相關性強度及資料分佈的資料下,NEKBDD於多項指標皆優於傳統網路估計方法,並且結果更穩健(robust)。此方法應用於癌症相關研究時,也成功重建JAK-STAT生物路徑,並辨識出已被證實的樞紐基因以及基因之間的關係,例如 PIK3CA、SOS1與IL22RA2等。 第三,本研究提出估計混合資料之差異網路方法,透過將混合圖形模型(mixed graphical model)之聯合密度函數結合後驗勝算(posterior odds),證明邏輯斯迴歸模型之交互作用項係數等價於不同條件下網路連線參數之差,使得差異網路連線估計可以適用邏輯斯迴歸模型係數估計的一致性與漸近常態性質,進而使參數估計更有效率並且可以進行統計檢定。同時,此推導結果也將二次判別分析(quadratic discriminant analysis, QDA)由原來的多維度常態假設,推廣至更複雜的混合資料。此外,為了進一步考量混合資料下,差異網路裡存在不同類型連線的特性,本研究在模型中納入權重向量(penalty factor)使不同連線有不同的正則化程度,並以集成方法整合不同權重向量的結果能作為衡量差異網路連線不確定性的指標。最後,整合英國生物資料庫(UK Biobank, Application ID: 134902)中的基因與穿戴式裝置資料,將此研究方法應用於晝夜節律的研究,辨識出差異網路中的樞紐節點以及重要的基因與身體活動量的差異交互作用。 With the rapid advancement of high-throughput technologies and multi-omics research, the biomedical field has accumulated a vast amount of high-dimensional and highly heterogeneous data, enabling deeper investigation into the mechanisms of complex diseases. This data is often of a mixed type, comprising continuous and discrete variables, and frequently deviates significantly from the assumptions of multivariate normality or other exponential family distributions. As complex diseases are typically regulated by multiple genes and biomolecules, researchers often utilize network and differential network models to analyze the relationships among variables of interest. However, traditional network construction approaches often rely on the assumption of multivariate normality or can only apply to a single data type, limiting their applicability and interpretability when real-world data do not meet these assumptions. Based on this background, the study is structured around three primary objectives. First, a systematic literature review of network models is conducted and categorized by data type. This review synthesizes existing undirected graphical models for single and mixed data types and their parameter estimation methods and application contexts. It clarifies the challenges and limitations associated with different data structures, such as the Gaussian graphical model (GGM) for Gaussian data and the Ising model for binary data. Second, to address the limitations of conventional data-driven network construction methods in the context of non-Gaussian or high-dimensional data, which are susceptible to data quality, estimation approaches, and distributional assumptions, and often ignore validated prior biological knowledge. This study proposes a nonparametric ensemble knowledge-based and data-driven (NEKBDD) algorithm for gene network construction. This algorithm combines the strengths of both knowledge-guided and data-driven approaches. It begins by inferring the node degree distribution from reference networks in public knowledge databases, then generates sample degree sequences to construct potential network structures. Subsequently, a confidence matrix derived from experimental data is used for node labeling. Finally, an ensemble method combined with thresholding aggregates multiple potential networks. Statistical simulations demonstrate that NEKBDD outperforms traditional network estimation methods across various performance metrics under different association strengths and data distributions, exhibiting greater robustness. In an application to cancer research, the algorithm successfully reconstructed the JAK-STAT pathway and identified previously validated hub genes and their interactions, such as those involving PIK3CA, SOS1, and IL22RA2. Third, this research introduces a novel method for constructing differential networks with mixed data. By linking the joint density functions of a mixed graphical model to posterior odds, we demonstrate that the interaction coefficient in a logistic regression model is equivalent to the difference in edge parameters between networks under different conditions. This equivalence allows the estimation of differential network edges to leverage the consistency and asymptotic normality properties of logistic regression coefficient estimation, thereby enhancing parameter efficiency and enabling statistical testing. This derivation also extends quadratic discriminant analysis (QDA) from the Gaussian assumption to more complex mixed data. Furthermore, to account for the heterogeneous nature of edges within a differential network of mixed data, we introduce edge-specific penalty factors that impose tailored regularization and then ensemble the results across penalty factor configurations, thereby providing an uncertainty metric for each edge in the differential network. This method was applied to a chronotype study using integrated genetic and wearable device data from the UK Biobank (Application ID: 134902), successfully identifying hub nodes and critical gene–activity interaction differences within the resulting differential network. |
| URI: | http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/99918 |
| DOI: | 10.6342/NTU202502974 |
| 全文授權: | 同意授權(全球公開) |
| 電子全文公開日期: | 2029-07-30 |
| 顯示於系所單位: | 流行病學與預防醫學研究所 |
文件中的檔案:
| 檔案 | 大小 | 格式 | |
|---|---|---|---|
| ntu-113-2.pdf 此日期後於網路公開 2029-07-30 | 14.1 MB | Adobe PDF |
系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。
