Skip navigation

DSpace JSPUI

DSpace preserves and enables easy and open access to all types of digital content including text, images, moving images, mpegs and data sets

Learn More
DSpace logo
English
中文
  • Browse
    • Communities
      & Collections
    • Publication Year
    • Author
    • Title
    • Subject
  • Search TDR
  • Rights Q&A
    • My Page
    • Receive email
      updates
    • Edit Profile
  1. NTU Theses and Dissertations Repository
  2. 生物資源暨農學院
  3. 農藝學系
Please use this identifier to cite or link to this item: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/94611
Title: 單細胞序列資料中特徵選取對細胞層級分群的影響
The effect of feature selection on single cell RNA-Seq data clustering
Authors: 陳靜萱
Ching-Hsuan Chen
Advisor: 蔡政安
Chen-An Tsai
Keyword: 單細胞RNA序列,差異表達基因,高斯Copula,隨機森林,特徵選取,細胞聚類,共識分群,
Single-cell RNA-seq,Differential expressed gene,Gaussian copula,Random forest,Feature selection,Cell clustering,Consensus clustering,
Publication Year : 2024
Degree: 碩士
Abstract: DUBStepR(Determining the Underlying Basis using Stepwise Regression)是一種專為單細胞RNA測序(scRNA-seq)數據設計的基因選擇方法,其目的是識別出在特徵空間中最大化細胞類型間分離度的最佳基因集。在多項細胞分群表現中,DUBStepR顯著優於現有的單細胞基因選擇方法。
然而,我們發現DUBStepR存在三個潛在的問題。首先,DUBStepR未考慮基因表現量,僅在基因重要性篩選和排序時捕捉基因相關性,並且DUBStepR使用Pearson線性相關係數來衡量基因相關性,這種方法未能反映實際資料的非線性結構。其次,DUBStepR在高基因數量的資料集中無法選擇足夠數量的基因,導致分群效果較差。第三,DUBStepR無法均勻地選擇基因,等距分割後基因的分佈不均,使用相同的閥值Z-score 0.7挑選基因可能並不合適。因此,DUBStepR在基因選擇時無法穩定地選出足夠數量的基因,導致細胞分群效果不佳。
在本篇論文中,針對上述問題,我們提出了三種基於DUBStepR的改進方法。首先,RFCell-DUBStepR同時考慮基因相關性與表現量,對基因表現量和重新排列生成的負樣本利用隨機森林計算基因重要性。其次,Copula-DUBStepR使用高斯Copula函數捕捉基因間的非線性相關性。最後,Zqt-DUBStepR保留前5%分位數的基因,考慮了各等距組別中基因數量的不一致性,更均勻地保留重要基因。
在模擬資料中,RFCell-DUBStepR和Copula-DUBStepR分別在細胞類型數量少(2-4組)和多(5-6組)的情境下,均優於原先的DUBStepR。在實際資料中,我們的改進方法一致地表現出比DUBStepR更高的平均準確度,並選擇了更多基因。在Ting資料集中,RFCell-DUBStepR選擇了2105個基因,Copula-DUBStepR選擇了488個,Zqt-DUBStepR選擇了529個;在Petropoulos資料集中,RFCell-DUBStepR選擇了3966個基因,Copula-DUBStepR選擇了4602個,Zqt-DUBStepR選擇了403個。模擬資料和實際資料的結果表明,我們提出的RFCell-DUBStepR、Copula-DUBStepR和Zqt-DUBStepR在細胞分群表現上均優於DUBStepR。
DUBStepR (Determining the Underlying Basis using Stepwise Regression) is a gene selection method applied to single-cell RNA sequencing (scRNA-seq) data. It is designed to identify the optimal set of feature genes that maximizes cell type separa-tion within the feature space. This method outperforms existing single-cell feature se-lection methods in multiple cell clustering performances.
However, we found three potential issues with DUBStepR. First, it does not con-sider gene expression. Specifically, it captures only gene correlation during gene im-portance selection and ranking, using the Pearson correlation coefficient, which is in-adequate for the complex, non-linear structures of scRNA-seq real datasets. Second, DUBStepR has poor clustering performance with datasets having many genes. Third, DUBStepR does not select genes evenly. After dividing genes by correlation, the dis-tribution is biased, and using the same Z-score threshold for each group may be inap-propriate. Consequently, DUBStepR cannot reliably select enough genes, resulting in poor cell clustering performance.
In this paper, we propose three modified gene selection methods based on DUB-StepR. First, RFCell-DUBStepR considers both gene correlation and expression levels, using random forests to compute gene importance after generating negative samples through random shuffling. Second, Copula-DUBStepR captures non-linear correlations using the estimated Gaussian Copula function. Finally, Zqt-DUBStepR retains the top 5% quintile of grouped genes, ensuring more even retention of important genes.
In simulated data, RFCell-DUBStepR and Copula-DUBStepR outperform the original DUBStepR in scenarios with few (2-4 groups) and many (5-6 groups) cell types, respectively. In real datasets, our three modified methods consistently select more genes and achieve higher average accuracy than DUBStepR. In the Ting dataset, RFCell-DUBStepR select 2,105 genes, Copula-DUBStepR 488, and Zqt-DUBStepR 529. In the Petropoulos dataset, RFCell-DUBStepR select 3,966 genes, Copu-la-DUBStepR 4,602, and Zqt-DUBStepR 403.These results demonstrate that RFCell-DUBStepR, Copula-DUBStepR, and Zqt-DUBStepR outperform DUBStepR in cell clustering performance in both simulated and real datasets.
URI: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/94611
DOI: 10.6342/NTU202403772
Fulltext Rights: 未授權
Appears in Collections:農藝學系

Files in This Item:
File SizeFormat 
ntu-112-2.pdf
  Restricted Access
4.5 MBAdobe PDF
Show full item record


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved