請用此 Handle URI 來引用此文件:
http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/94611
完整後設資料紀錄
DC 欄位 | 值 | 語言 |
---|---|---|
dc.contributor.advisor | 蔡政安 | zh_TW |
dc.contributor.advisor | Chen-An Tsai | en |
dc.contributor.author | 陳靜萱 | zh_TW |
dc.contributor.author | Ching-Hsuan Chen | en |
dc.date.accessioned | 2024-08-16T17:02:24Z | - |
dc.date.available | 2024-08-17 | - |
dc.date.copyright | 2024-08-16 | - |
dc.date.issued | 2024 | - |
dc.date.submitted | 2024-08-10 | - |
dc.identifier.citation | Kiselev, V.Y., Andrews, T.S. & Hemberg, M. (2019). Challenges in unsupervised clustering of single-cell RNA-seq data. Nat Rev Genet 20, 273–282.
Adil A, Kumar V, Jan AT and Asger M (2021) Single-Cell Transcriptomics: Current Methods and Challenges in Data Acquisition and Analysis. Front. Neurosci. 15:591122. Su, K., Yu, T., & Wu, H. (2021). Accurate feature selection improves single-cell RNA-seq cell clustering. Briefings in bioinformatics, 22(5), bbab034. Anders, S., Huber, W. (2010). Differential expression analysis for sequence count data. Nat Prec. Cheng C, Chen W, Jin H, Chen X. (2023) Jul 30. A Review of Single-Cell RNA-Seq Annotation, Integration, and Cell-Cell Communication. Cells. 12(15):1970. Kiselev, V.Y., Andrews, T.S. & Hemberg, M. (2019). Challenges in unsupervised clustering of single-cell RNA-seq data. Nat Rev Genet 20, 273–282. Slovin S, Carissimo A, Panariello F, Grimaldi A, Bouché V, Gambardella G, Cacchiarelli D. (2021).Single-Cell RNA Sequencing Analysis: A Step-by-Step Overview. Methods Mol Biol.2284:343-365. Jiang P. (2019).Quality Control of Single-Cell RNA-seq. Methods Mol Biol. 1935:1-9. Vallejos, C. A., Risso, D., Scialdone, A., Dudoit, S. & Marioni, J. C. (2017). Normalizing single-cell RNA sequencing data: challenges and opportunities. Nat. Method 14, 565. Yang, P., Huang, H. & Liu, C. (2021). Feature selection revisited in the single-cell era. Genome Biol 22, 321. Zhao Y, Fang Z-Y, Lin C-X, Deng C, Xu Y-P and Li H-D. (2021). RFCell: A Gene Selection Approach for scRNA-seq Clustering Based on Permutation and Random Forest. Front. Genet. 12:665843. Ranjan B, Sun W, Park J, Mishra K, Schmidt F, Xie R, Alipour F, Singhal V, Joanito I, Honardoost MA, Yong JMY, Koh ET, Leong KP, Rayan NA, Lim MGL, Prabhakar S. (2021 Oct 6). DUBStepR is a scalable correlation-based feature selection method for accurately clustering single-cell data. Nat Commun.12(1):5849. Pearson’s Correlation Coefficient. (2008). In W. Kirch (Ed.), Encyclopedia of Public Health (pp. 1090-1091). Springer, Dordrecht. Cheng C, Chen W, Jin H, Chen X. (2023 Jul 30). A Review of Single-Cell RNA-Seq Annotation, Integration, and Cell-Cell Communication. Cells.12(15):1970. Hou J, Ye X, Feng W, Zhang Q, Han Y, Liu Y, Li Y, Wei Y. (2022 Feb 21). Distance correlation application to gene co-expression network analysis. BMC Bioinformatics. 23(1):81. Sun, T., Song, D., Li, W.V. et al. (2021). scDesign2: a transparent simulator that generates high-fidelity single-cell gene expression count data with gene correlations captured. Genome Biol 22, 163. Avramidis AN, Channouf N, L’Ecuyer P. (2009). Efficient correlation matching for fitting discrete multivariate distributions with arbitrary marginals and normal-copula dependence. INFORMS J Comput.21(1):88–106. Lebrun R, Dutfoy A. (2009). An innovating analysis of the nataf transformation from the copula viewpoint. Probabilistic Eng Mech.24(3):312–20. Kiselev, V., Kirschner, K., Schaub, M. et al. ((2017)). SC3: Consensus clustering of single-cell RNA-seq data. Nat Methods 14, 483–486. Wilkerson MD, Hayes DN. (2010 Jun 15). ConsensusClusterPlus: a class discovery tool with confidence assessments and item tracking. Bioinformatics. 26(12):1572-3. Laurens van der Maaten and Geoffrey Hinton. (2008). Visualizing Data using t-SNE. Journal of Machine Learning Research 9, 2579-2605. Linderman, G. C., and Steinerberger, S. (2019). Clustering with T-Sne, provably. SIAM J. Math. Data Sci. 1, 313–332. Gao,Q. (2023) GeneScape: Simulation of Single Cell RNA-Seq Data with Complex Structure, R package version 1.0. Gao, Q., Ji, Z., Wang, L., Owzar, K., Li, Q. J., Chan, C., & Xie, J. (2024). SifiNet: A robust and accurate method to identify feature gene sets and annotate cells. bioRxiv[Preprint]. Ting, D. T., Wittner, B. S., Ligorio, M., Vincent Jordan, N., Shah, A. M., Miyamoto, D. T., et al. (2014). Single-cell Rna sequencing identifies extracellular matrix gene expression by pancreatic circulating tumor cells. Cell Rep. 8, 1905–1918. Petropoulos S, Edsgärd D, Reinius B, Deng Q, Panula SP, Codeluppi S, Plaza Reyes A, Linnarsson S, Sandberg R, Lanner F. (2016). Single-cell RNA-seq reveals lineage and X chromosome dynamics in human preimplantation embryos. Cell, 165(4), 1012-1026. | - |
dc.identifier.uri | http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/94611 | - |
dc.description.abstract | DUBStepR(Determining the Underlying Basis using Stepwise Regression)是一種專為單細胞RNA測序(scRNA-seq)數據設計的基因選擇方法,其目的是識別出在特徵空間中最大化細胞類型間分離度的最佳基因集。在多項細胞分群表現中,DUBStepR顯著優於現有的單細胞基因選擇方法。
然而,我們發現DUBStepR存在三個潛在的問題。首先,DUBStepR未考慮基因表現量,僅在基因重要性篩選和排序時捕捉基因相關性,並且DUBStepR使用Pearson線性相關係數來衡量基因相關性,這種方法未能反映實際資料的非線性結構。其次,DUBStepR在高基因數量的資料集中無法選擇足夠數量的基因,導致分群效果較差。第三,DUBStepR無法均勻地選擇基因,等距分割後基因的分佈不均,使用相同的閥值Z-score 0.7挑選基因可能並不合適。因此,DUBStepR在基因選擇時無法穩定地選出足夠數量的基因,導致細胞分群效果不佳。 在本篇論文中,針對上述問題,我們提出了三種基於DUBStepR的改進方法。首先,RFCell-DUBStepR同時考慮基因相關性與表現量,對基因表現量和重新排列生成的負樣本利用隨機森林計算基因重要性。其次,Copula-DUBStepR使用高斯Copula函數捕捉基因間的非線性相關性。最後,Zqt-DUBStepR保留前5%分位數的基因,考慮了各等距組別中基因數量的不一致性,更均勻地保留重要基因。 在模擬資料中,RFCell-DUBStepR和Copula-DUBStepR分別在細胞類型數量少(2-4組)和多(5-6組)的情境下,均優於原先的DUBStepR。在實際資料中,我們的改進方法一致地表現出比DUBStepR更高的平均準確度,並選擇了更多基因。在Ting資料集中,RFCell-DUBStepR選擇了2105個基因,Copula-DUBStepR選擇了488個,Zqt-DUBStepR選擇了529個;在Petropoulos資料集中,RFCell-DUBStepR選擇了3966個基因,Copula-DUBStepR選擇了4602個,Zqt-DUBStepR選擇了403個。模擬資料和實際資料的結果表明,我們提出的RFCell-DUBStepR、Copula-DUBStepR和Zqt-DUBStepR在細胞分群表現上均優於DUBStepR。 | zh_TW |
dc.description.abstract | DUBStepR (Determining the Underlying Basis using Stepwise Regression) is a gene selection method applied to single-cell RNA sequencing (scRNA-seq) data. It is designed to identify the optimal set of feature genes that maximizes cell type separa-tion within the feature space. This method outperforms existing single-cell feature se-lection methods in multiple cell clustering performances.
However, we found three potential issues with DUBStepR. First, it does not con-sider gene expression. Specifically, it captures only gene correlation during gene im-portance selection and ranking, using the Pearson correlation coefficient, which is in-adequate for the complex, non-linear structures of scRNA-seq real datasets. Second, DUBStepR has poor clustering performance with datasets having many genes. Third, DUBStepR does not select genes evenly. After dividing genes by correlation, the dis-tribution is biased, and using the same Z-score threshold for each group may be inap-propriate. Consequently, DUBStepR cannot reliably select enough genes, resulting in poor cell clustering performance. In this paper, we propose three modified gene selection methods based on DUB-StepR. First, RFCell-DUBStepR considers both gene correlation and expression levels, using random forests to compute gene importance after generating negative samples through random shuffling. Second, Copula-DUBStepR captures non-linear correlations using the estimated Gaussian Copula function. Finally, Zqt-DUBStepR retains the top 5% quintile of grouped genes, ensuring more even retention of important genes. In simulated data, RFCell-DUBStepR and Copula-DUBStepR outperform the original DUBStepR in scenarios with few (2-4 groups) and many (5-6 groups) cell types, respectively. In real datasets, our three modified methods consistently select more genes and achieve higher average accuracy than DUBStepR. In the Ting dataset, RFCell-DUBStepR select 2,105 genes, Copula-DUBStepR 488, and Zqt-DUBStepR 529. In the Petropoulos dataset, RFCell-DUBStepR select 3,966 genes, Copu-la-DUBStepR 4,602, and Zqt-DUBStepR 403.These results demonstrate that RFCell-DUBStepR, Copula-DUBStepR, and Zqt-DUBStepR outperform DUBStepR in cell clustering performance in both simulated and real datasets. | en |
dc.description.provenance | Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2024-08-16T17:02:24Z No. of bitstreams: 0 | en |
dc.description.provenance | Made available in DSpace on 2024-08-16T17:02:24Z (GMT). No. of bitstreams: 0 | en |
dc.description.tableofcontents | 口試委員會審定書 i
誌謝 ii 摘要 iii Abstract v Chapter 1 Introduction 1 Chapter 2 Literature Review 6 Chapter 3 Materials and Methods 8 3.1 Feature Selection Methods 9 3.1.1 DUBStepR 9 3.1.2 RFCell-DUBStepR 12 3.1.3 Copula-DUBStepR 15 3.1.4 Zqt-DUBStepR 17 3.2 Clustering Methods 18 3.2.1 SC3 Clustering 18 3.2.2 Consensus Clustering, CC 19 3.3 Performance Evaluation 20 3.3.1 Adjusted Rand Index, ARI 20 3.3.2 Normalized Mutual Information, NMI 20 3.3.3 DE-ratio 21 3.4 Visualization of Cell Type Clustering Results 21 Chapter 4 Simulation Study 24 4.1 Simulation Method 24 4.2 Parameter Setting 26 4.3 Results 29 Chapter 5 Real Datasets Analysis 38 5.1 Description of Real Datasets 38 5.2 Ting Dataset Analysis 39 5.3 Petropoulos Dataset Analysis 45 Chapter 6 Conclusion 49 References 51 Supplements Materials 54 | - |
dc.language.iso | en | - |
dc.title | 單細胞序列資料中特徵選取對細胞層級分群的影響 | zh_TW |
dc.title | The effect of feature selection on single cell RNA-Seq data clustering | en |
dc.type | Thesis | - |
dc.date.schoolyear | 112-2 | - |
dc.description.degree | 碩士 | - |
dc.contributor.oralexamcommittee | 薛慧敏;陳琬萍 | zh_TW |
dc.contributor.oralexamcommittee | Hui-Min Hsueh;Wan-Ping Chen | en |
dc.subject.keyword | 單細胞RNA序列,差異表達基因,高斯Copula,隨機森林,特徵選取,細胞聚類,共識分群, | zh_TW |
dc.subject.keyword | Single-cell RNA-seq,Differential expressed gene,Gaussian copula,Random forest,Feature selection,Cell clustering,Consensus clustering, | en |
dc.relation.page | 58 | - |
dc.identifier.doi | 10.6342/NTU202403772 | - |
dc.rights.note | 未授權 | - |
dc.date.accepted | 2024-08-13 | - |
dc.contributor.author-college | 生物資源暨農學院 | - |
dc.contributor.author-dept | 農藝學系 | - |
顯示於系所單位: | 農藝學系 |
文件中的檔案:
檔案 | 大小 | 格式 | |
---|---|---|---|
ntu-112-2.pdf 目前未授權公開取用 | 4.5 MB | Adobe PDF |
系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。