使用變分自動編碼器建立單細胞多組學模型的基準測試數據

蕭如秀; Ru-Xiu Hsiao

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/90946

標題:	使用變分自動編碼器建立單細胞多組學模型的基準測試數據 Data simulation using variational autoencoder to benchmark single-cell multi-omics models
作者:	蕭如秀 Ru-Xiu Hsiao
指導教授:	陳倩瑜 Chien-Yu Chen
共同指導教授:	歐陽彥正 Yen-Jen Oyang
關鍵字:	單細胞定序,單細胞多組學,單細胞多組學集成,變分自動編碼器,資料生成, Single-cell sequencing,Single-cell multi-omics,Single-cell multi-omics integration,Variational autoencoder,Data simulation,
出版年 :	2022
學位:	碩士
摘要:	單細胞定序 (single-cell sequencing) 技術的發展使得研究人員能夠以單細胞的尺度研究生物中的訊息傳遞，促進發育生物學和癌症生物學等多個領域的研究進展。在此基礎上，學術界目前已開發許多實驗技術來同時測量同一個細胞中的多種組學資訊，例如，單核染色質可及性和 mRNA 表達測序 (SNARE-seq)可以同時測量同一細胞的基因表達和染色質可及性。借助這些技術，研究人員可以使用更可靠的方式進行細胞類型註釋及重建細胞分化過程。目前，有許多方法旨在將這些技術生成的成對單細胞組學數據與生成模型（Generative model）相結合學習多組學數據的聯合潛在表示（Joint latent representation）。然而，這些方法無法將潛在空間的變化映射回個別組學資料，因此難以解釋潛在表示和個別原始資料的相關性。鑑於此，Hsieh 等人。提出了一種整合單細胞基因表達和染色質可及性數據的新方法，該方法使用條件潛在表示（Conditionally latent representation）來分離來自不同組學資料的變異。然而，由於單細胞多組學數據資料稀少，因此難以系統地驗證這些單細胞多組學模型，因此阻礙了單細胞多組學集成模型的開發和評估。因此，本研究旨在生成可用作評估數據集成模型基準的模擬數據集。本研究通過操縱依據每種組學資料獨立訓練所產生的生成模型生成的潛在表示來產生模擬的單細胞基因表達數據和染色質可及性數據。通過操縱潛在表示並透過生成模型的生成模組即可得到對應操縱的潛在表示的高維資料。此後，研究人員可以通過比較基於模擬數據集的數據集成模型，建立的潛在表示與操縱的潛在表示，進而從不同的角度評估模型的性能。總結來說，本研究所開發的軟件實現了在單細胞潛在表示的三種不同操作。本研究進一步通過使用 t-SNE進一步視覺化地檢驗了三種不同操作的有效性，並展示如何使用新生成的數據來評估 Hsieh 等人提出的方法的性能。本研究預期該軟件將促進具有有限數量的單細胞多組學數據集的數據集成模型的開發過程。 Single-cell sequencing technology allows researchers to characterize biological processes at a single-cell level, facilitating the research progress of several fields, such as developmental biology and cancer biology. Currently, many experimental technologies have been developed to simultaneously measure multiple modalities of the same cell. For example, single-nucleus chromatin accessibility and mRNA expression sequencing (SNARE-seq) can measure gene expression and chromatin accessibility of the same cell. With these technologies, researchers can perform cell type annotation and reconstruct cell lineage trajectories more reliably. Currently, there have been many methods that aim to integrate paired single-cell omics data generated by these technologies with generative models that learn the joint latent representations across modalities of multi-omics data. However, these methods cannot map the variations in the latent space back to each modality, making it difficult to interpret the latent space. Given this, Hsieh et al. proposed a new method to integrate single-cell gene expression data and chromatin accessibility data, which uses conditional latent representations to isolate the source of variations in different modalities. However, due to the limited number of single-cell multi-omics data, it is difficult to validate these models systematically, hindering the development and evaluation of these models. Thus, this study aims to generate simulated datasets that can be used as benchmarks to evaluate the data integration models. This study simulates single-cell gene expression data and chromatin accessibility data by manipulating the latent representations generated by generative models trained for each modality independently. This study proposes manipulating the latent representations. In this way, the simulated dataset created by the generative models would encode the manipulation process. Thereafter, the researchers can evaluate the performance of the model from different perspectives through the comparison between the latent representations identified by the data integration models based on the simulated dataset and the manipulated latent representations. In summary, three manipulation methods were implemented in this study. This study further examined the effectiveness of the software by using t-SNE visualization and demonstrated how the newly generated data could be used to evaluate the performance of the method proposed by Hsieh et al. It is anticipated that this software would facilitate the development process of data integration models with limited numbers of the single-cell multi-omics datasets.
URI:	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/90946
DOI:	10.6342/NTU202300518
全文授權:	未授權
顯示於系所單位：	基因體與系統生物學學位學程

文件中的檔案：

檔案	大小	格式
ntu-111-2.pdf 未授權公開取用	34.31 MB	Adobe PDF

顯示文件完整紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料（如：文字、圖片、PDF）並使其易於取用。