從蛋白質表徵到蛋白質交互作用組推斷的端到端學習

陳瑜欣; Yu-Hsin Chen

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/93077

標題:	從蛋白質表徵到蛋白質交互作用組推斷的端到端學習 Complete end-to-end learning from protein feature representation to protein interactome inference
作者:	陳瑜欣 Yu-Hsin Chen
指導教授:	蔡懷寬 Huai-Kuang Tsai
共同指導教授:	呂俊毅;阮雪芬 Jun-Yi Leu;Hsueh-Fen Juan
關鍵字:	液相層析串聯式質譜儀分析,蛋白質-蛋白質交互作用關係,蛋白質交互作用組推斷,卷積神經網路,端到端學習,表徵學習, Co-fractionation coupled with mass spectrometry analysis,Protein-protein interactions,Protein interactome inference,Convolutional neural network,End-to-end learning,Representation learning,
出版年 :	2024
學位:	博士
摘要:	蛋白質複合物是細胞功能的基礎作用單位，而恢復其組成對於理解細胞過程的機制至關重要。液相層析串聯式質譜儀 (CF-MS) 分析法是高通量技術之一，其分析結果可用於推斷多對多的蛋白質交互作用關係，對於蛋白質體系統性結構復原方法起到重要進展。為了從CF-MS數據推斷蛋白質相互作用，目前的機器學習分析流程通常先基於人為定義的CF-MS特徵來進行蛋白質配對關係的推斷，接著使用聚類演算法形成潛在的蛋白質複合物。然而，目前使用的分析方法存在人為定義特徵導致的分析偏差、數據分佈嚴重不平衡導致的過度擬合問題、CF-MS數據噪音導致的偽陰性以及偽陽性問題。為了解決人為定義特徵和資料不平衡等問題，我提出了一種基於卷積神經網路 (CNN) 的端到端學習架構，SPIFFED，透過此架構將特徵提取與交互作用組預測在資料平衡的條件下串聯成一個完整的訓練流程。在傳統資料不平衡的訓練模式下，SPIFFED在預測蛋白質-蛋白質交互作用關係 (PPIs) 方面優於目前最先進的分析方法。當在資料平衡的模式下訓練時，SPIFFED大幅提高了對陽性PPIs的敏感度。此外，SPIFFED內建的整合模型提供了不同的投票方案來統整生物性重複樣本之間或不同CF-MS資料集之間的PPIs預測結果，讓使用者可以根據其實驗設計選擇信賴度較高的交互作用關係。為了解決洗脫圖譜內在的噪音干擾引起的偽陰性問題和偽陽性問題，我提出了另一種平衡的端到端學習架構，FREEPII。FREEPII也使用CNN作為主要架構。與SPIFFED不同的是，FREEPII專注於學習個別蛋白質的特徵表示而非蛋白質對的特徵表示，因此其計算複雜度從O(N2) 降低到O(N)。除此之外，FREEPII使用多個輸入來擴展可用於計算蛋白質之間資料相似性的信息，並使用蛋白質嵌入將存在於蛋白質複合物的網路層級資訊轉移到學習的特徵表示以重新調整CF-MS數據中相互作用的強度。FREEPII在PPIs分類和PPIs聚類方面的結果均優於EPIC 和 SPIFFED。透過視覺化，我揭示了FREEPII在表徵學習和分類判斷方面的優勢。透過交叉預測，我證明了結合不同解析度的CF-MS資料進行模型訓練可以顯著提高CNN對於不同實驗中PPIs分類預測的廣泛性。综上所述，我提出的方法解決了特徵提取步驟引入的資料壓縮和偏差問題，學習到更通用的特徵表示，並在平衡訓練下更準確地發現陽性的作用關係。透過考慮不同屬性的輸入和網路層級的資訊，我的方法有效地減少了洗脫曲線中存在的雜訊所造成的預測誤差，在 PPIs 分類和 PPIs 聚類方面均優於先前的分析方法。最後，我證明了結合不同解析度的CF-MS資料訓練模型可以顯著提高CNN對於不同實驗中PPIs分類預測的泛化能力。 Protein complexes are key functional units in cellular processes, and recovering their composition is critical to understanding the mechanistic basis of cellular processes. Chromatographic fractionation coupled with mass spectrometry (CF-MS) is a high-throughput technique that has significantly advanced protein complex studies by enabling global interactome inference. To infer protein-protein interactions (PPIs) from CF-MS data, current machine learning analysis pipelines usually first infer PPIs based on handcrafted CF-MS features, followed by clustering algorithms to form potential protein complexes. While powerful, these methods suffer from the potential bias of handcrafted features, overfitting problems caused by severely imbalanced data distribution, false negative problems and false positive problems caused by noise interference in CF-MS data itself. To address the issues of handcrafted features and data imbalance, I present a balanced end-to-end learning architecture, Software for Prediction of Interactome with Feature-extraction Free Elution Data (SPIFFED), to integrate feature representation from raw CF-MS data and interactome prediction by convolutional neural network (CNN). SPIFFED outperforms state-of-the-art methods in predicting PPIs under the conventional imbalanced training. When trained with balanced data, SPIFFED has greatly improved sensitivity for true PPIs. Moreover, the ensemble SPIFFED model provides different voting schemes to integrate predicted PPIs from biological replicates or multiple CF-MS datasets, allowing users to select high confidence interactions depending on the CF-MS experimental designs. To solve the problem of prediction errors caused by noise interference in elution profiles, I further proposed another balanced end-to-end learning architecture, Feature Representation Enhancement End-to-end Protein Interaction Inference (FREEPII). FREEPII also used CNN as the main architecture. Unlike SPIFFED, FREEPII focuses on learning the feature representation of proteins rather than the feature representation of protein pairs, reducing computational complexity from O(N2) to O(N). In addition, FREEPII uses multi-inputs to extend the information available for calculating data similarity between proteins, and uses protein embedding to transfer network-level information of protein complexes into the feature representations to rescale the strength of interactions present in CF-MS data. FREEPII outperforms EPIC and SPIFFED in both PPIs classification and PPIs clustering. Through visualization, I reveal the advantages of FREEPII in representation learning and classification judgment. Through cross prediction, I demonstrated that combining CF-MS data with different resolutions for model training can significantly improve the generality of CNN for PPIs classification in different experiments. In summary, my proposed method solves the data compression and bias issues introduced by the feature extraction step, learns a more general feature representation, and discovers positive interactions more accurately under balanced training. By considering different properties of inputs and network-level information, my method effectively reduces the prediction error caused by the noise present in the elution profiles, outperforming previous analysis methods in both PPIs classification and PPIs clustering. Finally, I demonstrate that combining CF-MS data of different resolutions for model training can significantly improve the generalizability of CNN for PPIs classification.
URI:	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/93077
DOI:	10.6342/NTU202401653
全文授權:	同意授權(限校園內公開)
電子全文公開日期:	2029-07-10
顯示於系所單位：	生物資訊學國際研究生博士學位學程

文件中的檔案：

檔案	大小	格式
ntu-112-2.pdf 目前未授權公開取用	3.94 MB	Adobe PDF	檢視/開啟

顯示文件完整紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料（如：文字、圖片、PDF）並使其易於取用。