類別:

類別: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/83230 2026-03-11T10:34:14Z 整合單核苷酸多型性和基因表現以探索台灣乳癌之生物標記與舊藥新用 http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/96801 標題: 整合單核苷酸多型性和基因表現以探索台灣乳癌之生物標記與舊藥新用; Integrated Approaches Utilizing Single Nucleotide Polymorphism and Gene Expression for Exploring Biomarkers and Potential Drug Repurposing in Taiwanese Breast Cancers 作者: 徐于晴; Yu-Ching Hsu 摘要: 精準醫學（precision medicine）旨在利用患者的遺傳資訊、生活方式及環境因素等多種資料，為個人提供量身打造的治療方案。為實現精準醫學的理念，主要需達成兩大目標：疾病分型及個人化治療。分子生物標誌物（molecular biomarker）的發現對疾病分型非常重要，而在藥物基因體學（pharmacogenomics）領域日益增加的科學證據則能為患者提供更精準的治療建議。在本研究中，我們提出了整合單核苷酸多型性（single nucleotide polymorphism）和基因表現（gene expression）數據的方法，以找出專屬台灣乳癌的分子生物標誌物以及可能的舊藥新用（drug repurposing）策略。全基因體關聯研究（genome-wide association studies）能同時檢測數十萬個基因變異，是一種能有效用來找出可能與疾病相關的生物標誌物的方法，而多基因風險分數（polygenic risk score）則利用全基因體關聯研究的結果來建立疾病風險預測模型。過去針對乳癌的全基因體關聯研究大多使用高加索人族群（Caucasian population）的數據，所得出的結果可能不適用於其他族群，因此，我們針對台灣人族群進行了大規模的全基因體關聯研究，以解決此問題。本研究納入了來自中國醫藥大學附設醫院的12,480名受試者，包括2,496名乳癌患者與9,984名對照組，以進行全基因體關聯研究與多基因風險分數分析。我們鑑定出113個與乳癌相關的單核苷酸多型性，其中有50個是新發現的位點。此外，我們在所有乳癌以及管狀A型亞型（luminal A subtype）的多基因風險分數中，觀察到了達統計顯著的正相關趨勢，而在所有乳癌及管狀A型亞型中，多基因風險分數最高分組的勝算比（odds ratio）及其95%信賴區間（confidence interval）分別為5.33（3.79–7.66）及3.55（2.13–6.14）。近年來，在癌症細胞株中已蒐集到了大規模的藥物反應資料，如何將這些從癌症細胞株所得到的藥物基因體學知識應用到實際的癌症上，是推動舊藥新用發展的重要關鍵。我們在此研究中的目標是先將腫瘤解構（deconvolute）為相對應的癌症細胞株，再利用解構的結果來開發藥物反應預測演算法以解決此問題。我們採用了先前開發來分析細胞組成的深度學習（deep learning）模型之架構，來訓練出可進行腫瘤解構（tumor deconvolution）的新模型，我們稱這些新模型為Scaden-CA，是針對涵蓋乳癌在內的18種癌症而開發，每個癌症各有一個模型。我們是利用癌症細胞株單細胞基因表現（single cell RNA-seq）數據所生成的模擬數據來訓練和測試這些模型，然後再利用癌症細胞株的基因表現數據進行驗證。Scaden-CA在模型測試（所有癌症的一致性相關係數 > 0.9）和模型驗證（大多數癌症細胞株的正確解構率> 70%）中均有優異的表現。我們進一步將Scaden-CA應用於來自癌症基因體圖譜（The Cancer Genome Atlas, TCGA）的真實腫瘤基因表現數據，並開發了藥物預測演算法。同時，我們也利用了癌症基因體圖譜的突變數據和基因表現數據，來探討舊藥新用及其可能的機制。為了研究台灣族群與其他族群之間的基因表現差異，並找出可能可應用於台灣乳癌的舊藥新用，我們首先利用PrediXcan及中國醫藥大學附設醫院乳癌患者的基因型資料（genotype profiles）來預測出其相對應的基因表現。而在PredictAP提供的資料中，大約78.4%的基因在東亞族群（East Asian population）與歐洲族群（non-Finnish European population）之間顯示出有族群之差異。部分基因，包括乳癌相關基因，在台灣族群中顯示出基因表現差異，而大多數基因與東亞族群的表現量相近，而這些發現可能有助於解釋乳癌的臨床特徵差異。為了找出可能可用於台灣乳癌的舊藥新用，我們先透過PredictAP過濾掉可能受族群差異影響的基因，然後使用剩餘的基因來檢索從癌症基因體圖譜中所發現到可能用於舊藥新用之藥物，並從中挑出可能台灣乳癌也可能適用的舊藥新用。如此一來，我們不僅能夠找出全新和現有可適用於台灣乳癌的舊藥新用及其可能的機制，還能探討單核苷酸多型性、基因表現與舊藥新用之間的關聯性。; Precision medicine aims at providing tailored treatment for individuals by utilizing distinct types of patient information including genetic, lifestyle, and environmental factors. Two major objectives should be addressed to realize the concept of precision medicine, which are disease subtyping and tailored treatments for specific diseases. Identification of molecular biomarkers can be helpful for advancing disease subtyping, and the accumulating evidences in the field of pharmacogenomics can offer more precise suggestions for patient treatments. In this study, we proposed integrated approaches utilizing single nucleotide polymorphisms and gene expression data to identify biomarkers and potential drug repurposing for Taiwanese breast cancers. Genome-wide association studies (GWASs) are effective methods to examine hundreds of thousands of genetic variants at the same time to identify potential disease-associated biomarkers, and polygenic risk score (PRS) analyses are useful in building prediction models for disease risk by utilizing the results from GWAS. Previous GWASs in breast cancers were mostly conducted in Caucasian population, which may not be applicable to other populations. Therefore, we conducted a large GWAS in Taiwanese population to address the issue. A total of 12,480 participants, including 2,496 cases and 9,984 controls from China Medical University Hospital (CMUH) were included for GWAS and PRS analyses. We identified 113 single-nucleotide polymorphisms (SNPs) associated with breast cancers, among which 50 SNPs are novel. We also observed positively correlated trends with statistical significance in PRS analyses for all breast cancer and the luminal A subtypes, and the odds ratio (95% confidence intervals) for the groups with highest PRS in all breast cancers and the luminal A subtypes were 5.33 (3.79–7.66) and 3.55 (2.13–6.14), respectively. Recently, large-scale drug response data were profiled in a collection of cancer cell lines. How to translate the pharmacogenomics knowledge from in vitro to in vivo is crucial to advance drug repurposing. We aimed to address the issue by deconvoluting tumors to cancer cell lines and developing a corresponding drug response prediction algorithm utilizing the deconvoluted results. We adopted a previously developed deep-learning based model of analyzing cell compositions to train new models for tumor deconvolution, which we called the Scaden-CA models, for 18 cancer types, including breast cancer. The models were trained and tested on simulation data generated from single cell RNA-seq data of cancer cell lines. Then, the models were validated by Cancer Cell Line Encyclopedia (CCLE) bulk RNA-seq data. The Scaden-CA models showed great performance in model testing (concordance correlation coefficient > 0.9 across all cancers) and model validation (correctly deconvoluted rate > 70% across most cancers). We further applied the models to real tumor RNA-seq data from The Cancer Genome Atlas (TCGA) and developed a drug response prediction algorithm. TCGA mutation data and gene expression data were also utilized to investigate the underlying mechanisms of drug repurposing. To investigate gene expression differences between Taiwanese population and other populations and to infer drug repurposing for Taiwanese breast cancers, we first imputed gene expression from the genotype profiles of CMUH breast cancer patients by PrediXcan. About 78.4% genes showed differences between East Asian and non-Finnish European populations in the information provided by PredictAP. Certain genes, including breast cancer associated genes showed gene expression disparity in Taiwanese population while most genes showed similar expression patterns compared to East Asian population. These findings may contribute to the differences in clinical traits of breast cancers. As for inferring drug repurposing for Taiwanese breast cancers, we filtered out imputed genes that may be biased by population differences using PredictAP. We then use the remaining genes to retrieve corresponding breast cancer-gene-drug combinations from TCGA drug repurposing results. In this way, we can not only identify novel and existing breast cancer-gene-drug combinations but also explore the associations between SNPs, gene expression, and drug repurposing. 2024-01-01T00:00:00Z 從蛋白質表徵到蛋白質交互作用組推斷的端到端學習 http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/93077 標題: 從蛋白質表徵到蛋白質交互作用組推斷的端到端學習; Complete end-to-end learning from protein feature representation to protein interactome inference 作者: 陳瑜欣; Yu-Hsin Chen 摘要: 蛋白質複合物是細胞功能的基礎作用單位，而恢復其組成對於理解細胞過程的機制至關重要。液相層析串聯式質譜儀 (CF-MS) 分析法是高通量技術之一，其分析結果可用於推斷多對多的蛋白質交互作用關係，對於蛋白質體系統性結構復原方法起到重要進展。為了從CF-MS數據推斷蛋白質相互作用，目前的機器學習分析流程通常先基於人為定義的CF-MS特徵來進行蛋白質配對關係的推斷，接著使用聚類演算法形成潛在的蛋白質複合物。然而，目前使用的分析方法存在人為定義特徵導致的分析偏差、數據分佈嚴重不平衡導致的過度擬合問題、CF-MS數據噪音導致的偽陰性以及偽陽性問題。為了解決人為定義特徵和資料不平衡等問題，我提出了一種基於卷積神經網路 (CNN) 的端到端學習架構，SPIFFED，透過此架構將特徵提取與交互作用組預測在資料平衡的條件下串聯成一個完整的訓練流程。在傳統資料不平衡的訓練模式下，SPIFFED在預測蛋白質-蛋白質交互作用關係 (PPIs) 方面優於目前最先進的分析方法。當在資料平衡的模式下訓練時，SPIFFED大幅提高了對陽性PPIs的敏感度。此外，SPIFFED內建的整合模型提供了不同的投票方案來統整生物性重複樣本之間或不同CF-MS資料集之間的PPIs預測結果，讓使用者可以根據其實驗設計選擇信賴度較高的交互作用關係。為了解決洗脫圖譜內在的噪音干擾引起的偽陰性問題和偽陽性問題，我提出了另一種平衡的端到端學習架構，FREEPII。FREEPII也使用CNN作為主要架構。與SPIFFED不同的是，FREEPII專注於學習個別蛋白質的特徵表示而非蛋白質對的特徵表示，因此其計算複雜度從O(N2) 降低到O(N)。除此之外，FREEPII使用多個輸入來擴展可用於計算蛋白質之間資料相似性的信息，並使用蛋白質嵌入將存在於蛋白質複合物的網路層級資訊轉移到學習的特徵表示以重新調整CF-MS數據中相互作用的強度。FREEPII在PPIs分類和PPIs聚類方面的結果均優於EPIC 和 SPIFFED。透過視覺化，我揭示了FREEPII在表徵學習和分類判斷方面的優勢。透過交叉預測，我證明了結合不同解析度的CF-MS資料進行模型訓練可以顯著提高CNN對於不同實驗中PPIs分類預測的廣泛性。综上所述，我提出的方法解決了特徵提取步驟引入的資料壓縮和偏差問題，學習到更通用的特徵表示，並在平衡訓練下更準確地發現陽性的作用關係。透過考慮不同屬性的輸入和網路層級的資訊，我的方法有效地減少了洗脫曲線中存在的雜訊所造成的預測誤差，在 PPIs 分類和 PPIs 聚類方面均優於先前的分析方法。最後，我證明了結合不同解析度的CF-MS資料訓練模型可以顯著提高CNN對於不同實驗中PPIs分類預測的泛化能力。; Protein complexes are key functional units in cellular processes, and recovering their composition is critical to understanding the mechanistic basis of cellular processes. Chromatographic fractionation coupled with mass spectrometry (CF-MS) is a high-throughput technique that has significantly advanced protein complex studies by enabling global interactome inference. To infer protein-protein interactions (PPIs) from CF-MS data, current machine learning analysis pipelines usually first infer PPIs based on handcrafted CF-MS features, followed by clustering algorithms to form potential protein complexes. While powerful, these methods suffer from the potential bias of handcrafted features, overfitting problems caused by severely imbalanced data distribution, false negative problems and false positive problems caused by noise interference in CF-MS data itself. To address the issues of handcrafted features and data imbalance, I present a balanced end-to-end learning architecture, Software for Prediction of Interactome with Feature-extraction Free Elution Data (SPIFFED), to integrate feature representation from raw CF-MS data and interactome prediction by convolutional neural network (CNN). SPIFFED outperforms state-of-the-art methods in predicting PPIs under the conventional imbalanced training. When trained with balanced data, SPIFFED has greatly improved sensitivity for true PPIs. Moreover, the ensemble SPIFFED model provides different voting schemes to integrate predicted PPIs from biological replicates or multiple CF-MS datasets, allowing users to select high confidence interactions depending on the CF-MS experimental designs. To solve the problem of prediction errors caused by noise interference in elution profiles, I further proposed another balanced end-to-end learning architecture, Feature Representation Enhancement End-to-end Protein Interaction Inference (FREEPII). FREEPII also used CNN as the main architecture. Unlike SPIFFED, FREEPII focuses on learning the feature representation of proteins rather than the feature representation of protein pairs, reducing computational complexity from O(N2) to O(N). In addition, FREEPII uses multi-inputs to extend the information available for calculating data similarity between proteins, and uses protein embedding to transfer network-level information of protein complexes into the feature representations to rescale the strength of interactions present in CF-MS data. FREEPII outperforms EPIC and SPIFFED in both PPIs classification and PPIs clustering. Through visualization, I reveal the advantages of FREEPII in representation learning and classification judgment. Through cross prediction, I demonstrated that combining CF-MS data with different resolutions for model training can significantly improve the generality of CNN for PPIs classification in different experiments. In summary, my proposed method solves the data compression and bias issues introduced by the feature extraction step, learns a more general feature representation, and discovers positive interactions more accurately under balanced training. By considering different properties of inputs and network-level information, my method effectively reduces the prediction error caused by the noise present in the elution profiles, outperforming previous analysis methods in both PPIs classification and PPIs clustering. Finally, I demonstrate that combining CF-MS data of different resolutions for model training can significantly improve the generalizability of CNN for PPIs classification. 2024-01-01T00:00:00Z 利用未剪切轉錄建構情境依賴的基因調控網路可解釋調控動態及細胞軌跡 http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/83235 標題: 利用未剪切轉錄建構情境依賴的基因調控網路可解釋調控動態及細胞軌跡; Context-Dependent Gene Regulatory Network Explains Regulation Dynamics and Cell Trajectories Using Unspliced Transcripts 作者: 杜岳華; Yueh-Hua Tu 摘要: 在多樣的生物現象中，基因調控網路掌控複雜的基因表現，包含細胞發育、決策細胞命運，以及癌化。單細胞定序技術，比起以往大批RNA定序，提供基因表現較高的解析度，但是同時測量到更多的雜訊，以及更稀疏的表現量，這讓基因調控網路的推論更加有挑戰性。跨不同細胞型態要推論完整的基因調控網路也是相當困難。這邊我們提出情境依賴基因調控網路（CDGRN），它可以從單細胞RNA定序資料來解決這個問題。基因調控網路可以被拆解成子圖，它對應到不同的轉錄情境。每個子圖是由共同活躍的調控配對組成，其中包含由一群細胞共享的轉錄因子，以及他們的目標基因。在不同細胞群體，每個調控配對的活性是由高斯混合模型推得，當中使用了剪切及未剪切轉錄的表現量。我們發現在所有情境下基因表現的聯集提供了足夠的資訊以建構細胞分化軌跡。CDGRN建立了分子層級基因調控與巨觀層級細胞分化之間的連結。在整個發育過程的各個情境中，細胞週期、細胞分化，或是組織特有功能有過度表現這些功能。更令人驚訝的是，我們發現CDGRN的網路亂度會隨著分化過程下降，這暗示了分化的方向。總結而言，我們利用了單細胞RNA定序技術的優勢，並建立了分子調控與分化軌跡之間的連結。情境依賴的網路亂度或許暗示了在特定情境下的細胞成熟度。CDGRN模型被釋出在https://github.com/yuehhua/CDGRNs.jl。; Gene regulatory networks govern the complex gene expression programs in various biological phenomena, including cell development, cell fate decision, and oncogenesis. Single-cell techniques provide higher resolution in gene expression than traditional bulk RNA sequencing, but also incur more noise and sparser expression measurements, making it challenging to infer gene regulatory networks from such profiles. Inference of a complete gene regulatory network across different cell types is also difficult. Here, we propose to address the problem by constructing context-dependent gene regulatory networks (CDGRN) from single-cell RNA sequencing data. A gene regulatory network is decomposed into subgraphs that correspond to distinct transcriptomic contexts. Each subgraph is composed of the consensus active regulation pairs of transcription factors and their target genes shared by a group of cells. The activities of each regulation pair in different cell groups are inferred by a Gaussian mixture model using both the spliced and unspliced transcript expression levels. We find that the union of gene regulation pairs in all contexts provides sufficient information for the reconstruction of differentiation trajectories. CDGRN allows establishing the connection between gene regulation at the molecular level and cell differentiation at the macroscopic level. Functions specific to the cell cycle, cell differentiation, or tissue-specific functions are enriched throughout the developmental progression in each context. Surprisingly, we observe that the network entropy of CDGRN decreases with differentiation progression, implying directionality in differentiation. In conclusion, we leverage the advantage of single-cell RNA sequencing and establish a connection between molecular regulation and differentiation trajectory. Context-dependent network entropy may indicate the maturity of cells in certain contexts. The CDGRN model is available at https://github.com/yuehhua/CDGRNs.jl 2022-01-01T00:00:00Z