應用特徵選取於跨實驗室前列腺癌核醣核酸序列資料

Tzung-Chien Hsieh; 謝宗潛

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/6671

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	趙坤茂(Kun-Mao Chao)
dc.contributor.author	Tzung-Chien Hsieh	en
dc.contributor.author	謝宗潛	zh_TW
dc.date.accessioned	2021-05-17T09:16:04Z	-
dc.date.available	2013-09-25
dc.date.available	2021-05-17T09:16:04Z	-
dc.date.copyright	2013-09-25
dc.date.issued	2012
dc.date.submitted	2013-08-31
dc.identifier.citation	[1] S. Anders and W. Huber. Differential expression analysis for sequence count data. Nature Precedings, (713), 2010. [2] T. Bammler, R. P. Beyer, S. Bhattacharya, G. A. Boorman, A. Boyles, B. U. Bradford, R. E. Bumgarner, P. R. Bushel, K. Chaturvedi, D. Choi, M. L. Cunningham, S. Deng, H. K. Dressman, R. D. Fannin, F. M. Farin, J. H. Freedman, R. C. Fry, A. Harper, M. C. Humble, P. Hurban, T. J. Kavanagh, W. K. Kaufmann, K. F. Kerr, L. Jing, J. A. Lapidus, M. R. Lasarev, J. Li, Y.-J. Li, E. K. Lobenhofer, X. Lu, R. L. Malek, S. Milton, S. R. Nagalla, J. P. O’Malley, V. S. Palmer, P. Pattee, R. S. Paules, C. M. Perou, K. Phillips, L.-X. Qin, Y. Qiu, S. D. Quigley, M. Rodland, I. Rusyn, L. D. Samson, D. A. Schwartz, Y. Shi, J.-L. Shin, S. O. Sieber, S. Slifer, M. C. Speer, P. S. Spencer, D. I. Sproles, J. A. Swenberg, W. A. Suk, R. C. Sullivan, R. Tian, R. W. Tennant, S. A. Todd, C. J. Tucker, B. V. Van Houten, B. K. Weis, S. Xuan, and H. Zarbl. Addendum: Standardizing global gene expression analysis between laboratories and across platforms. Nature Methods, 2(6):477, 2009. [3] B. M. Bolstad, R. A. Irizarry, M. Astrand, and T. P. Speed. A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics, 19(2):185–193, 2003. [4] L. Breiman. Random forests. Machine Learning, 45:5–32, 2001. [5] N. Cloonan, A. R. R. Forrest, G. Kolle, B. B. A. Gardiner, G. J. Faulkner, M. K. Brown, D. F. Taylor, A. L. Steptoe, S. Wani, G. Bethel, A. J. Robertson, A. C. Perkins, S. J. Bruce, C. C. Lee, S. S. Ranade, H. E. Peckham, J. M. Manning, K. J. 32 McKernan, and S. M. Grimmond. Stem cell transcriptome profiling via massivescale mrna sequencing. Nature Methods, 5(7):613–619, 2008. [6] E. Dimitriadou, K. Hornik, F. Leisch, D. Meyer, and A. Weingessel. e1071: Misc Functions of the Department of Statistics (e1071), TU Wien. R package version 1.5- 25., 2011. [7] T. S. Furey, N. Duffy, N. Cristianini, D. Bednarski, M. Schummer, and D. Haussler. Support Vector Machine Classification and Validation of Cancer Tissue Samples Using Microarray Expression Data. Bioinformatics, 16(10):906–914, 2000. [8] T. R. Golub, D. K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. P. Mesirov, H. Coller, M. L. Loh, J. R. Downing, M. A. Caligiuri, and et al. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science, 286:531–537, 1999. [9] T. Hardcastle and K. Kelly. Bayseq: empirical bayesian methods for identifying differential expression in sequence count data. BMC Bioinformatics, 11, 2010. [10] I. Inc. Quality Scores for Next-Generation Sequencing - Illumina. 2001. [11] I. Inza, P. Larranaga, R. Blanco, and A. J. Cerrolaza. Filter versus wrapper gene selection approaches in DNA microarray domains. Artificial intelligence in medicine, 31(2):91–103, 2004. [12] R. A. Irizarry, D. Warren, F. Spencer, I. F. Kim, S. Biswal, B. C. Frank, E. Gabrielson, J. G. N. Garcia, J. Geoghegan, G. Germino, C. Griffin, S. C. Hilmer, E. Hoffman, A. E. Jedlicka, E. Kawasaki, F. Martinez-Murillo, L. Morsberger, H. Lee, D. Petersen, J. Quackenbush, A. Scott, M. Wilson, Y. Yang, S. Q. Ye, and W. Yu. Multiplelaboratory comparison of microarray platforms. Nature Methods, 2(5):345–350, 2005. [13] P. Jafari and F. Azuaje. An assessment of recently published gene expression data analyses: reporting experimental design and statistical factors. BMC Medical Informatics and Decision Making, 6:27, 2006. 33 [14] A. Jemal, F. Bray, M. M. Center, J. Ferlay, E. Ward, and D. Forman. Global cancer statistics. CA: A Cancer Journal for Clinicians, 61(2):69–90, 2011. [15] K. Kannan, L. Wang, J. Wang, M. M. Ittmann, W. Li, and L. Yen. Recurrent chimeric rnas enriched in human prostate cancer identified by deep sequencing. Proceedings of the National Academy of Sciences, 108(22):9172–9177, 2011. [16] J. Kim, K. Patel, H. Jung, W. P. Kuo, and L. Ohno-Machado. Anyexpress: integrated toolkit for analysis of cross-platform gene expression data using a fast interval matching algorithm. BMC Bioinformatics, 12:75, 2011. [17] B. Langmead, C. Trapnell, M. Pop, and S. L. Salzberg. Ultrafast and memoryefficient alignment of short DNA sequences to the human genome. Genome Biology, 10(3):R25–10, 2009. [18] J. E. Larkin, B. C. Frank, H. Gavras, R. Sultana, and J. Quackenbush. Independence and reproducibility across microarray platforms. Nat Methods, 2(5):337–344, 2005. [19] J. T. Leek, R. B. Scharpf, H. C. Bravo, D. Simcha, B. Langmead, W. E. Johnson, D. Geman, K. Baggerly, and R. A. Irizarry. Tackling the widespread and critical impact of batch effects in high-throughput data. Nature Reviews Genetics, 11(10):733–739, 2010. [20] A. Liaw and M. Wiener. Classification and regression by randomforest. R News, 2(3):18–22, 2002. [21] R. Lister, R. C. O’Malley, J. Tonti-Filippini, B. D. Gregory, C. C. Berry, A. H. Millar, and J. R. Ecker. Highly integrated single-base resolution maps of the epigenome in Arabidopsis. Cell, 133(3):523–536, May 2008. [22] G. Lunter and M. Goodson. Stampy: A statistical algorithm for sensitive and fast mapping of Illumina sequence reads. Genome Research, 21(6):936–939, 2011. [23] N. Mah, A. Thelin, T. Lu, S. Nikolaus, T. Kuhbacher, Y. Gurbuz, H. Eickhoff, G. Kloppel, H. Lehrach, B. Mellgard, C. Costello, and S. Schreiber. A compar- 34 ison of oligonucleotide and cdna-based microarray systems. Physiol Genomics, 16(3):361–70, 2004. [24] M. L. Metzker. Sequencing technologies - the next generation. Nature Reviews Genetics, 11(1):31–46, 2009. [25] A. Mortazavi, B. A. A. Williams, K. McCue, L. Schaeffer, and B. Wold. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nature Methods, 5:621–628, 2008. [26] U. Nagalakshmi, Z. Wang, K. Waern, C. Shou, D. Raha, M. Gerstein, and M. Snyder. The transcriptional landscape of the yeast genome defined by RNA sequencing. Science, 320:1344–1349, 2008. [27] T. P. Niedringhaus, D. Milanova, M. B. Kerby, M. P. Snyder, and A. E. Barron. Landscape of next-generation sequencing technologies. Analytical Chemistry, 83(12):4327–41, June 2011. [28] I. Nookaew, M. Papini, N. Pornputtapong, G. Scalcinati, L. Fagerberg, M. Uhlen, and J. Nielsen. A comprehensive comparison of RNA-Seq-based transcriptome analysis from reads to differential gene expression and cross-comparison with microarrays: a case study in Saccharomyces cerevisiae. Nucleic Acids Research, 40(20):10084–10097, 2012. [29] F. Ozsolak and P. M. Milos. RNA sequencing: advances, challenges and opportunities. Nature Reviews Genetics, 12(2):87–98, 2010. [30] J. R. Prensner, M. K. Iyer, O. A. Balbin, S. M. Dhanasekaran, Q. Cao, J. C. Brenner, B. Laxman, I. A. Asangani, C. S. Grasso, H. D. Kominsky, X. Cao, X. Jing, X. Wang, J. Siddiqui, J. T. Wei, D. Robinson, H. K. Iyer, N. Palanisamy, C. A. Maher, and A. M. Chinnaiyan. Transcriptome sequencing across a prostate cancer cohort identifies PCAT-1, an unannotated lincRNA implicated in disease progression. Nature Biotechnology, 29(8):742–749, 2011. 35 [31] X. Qiu, A. I. Brooks, L. Klebanov, and A. Yakovlev. The effects of normalization on the correlation structure of microarray data. BMC Bioinformatics, 6:120, 2005. [32] M. Quail, M. Smith, P. Coupland, T. Otto, S. Harris, T. Connor, A. Bertoni, H. Swerdlow, and Y. Gu. A tale of three next generation sequencing platforms: comparison of ion torrent, pacific biosciences and illumina MiSeq sequencers. BMC Genomics, 13(1):341, 2012. [33] S. Ren, Z. Peng, J.-H. Mao, Y. Yu, C. Yin, X. Gao, Z. Cui, J. Zhang, K. Yi, W. Xu, C. Chen, F. Wang, X. Guo, J. Lu, J. Yang, M. Wei, Z. Tian, Y. Guan, L. Tang, C. Xu, L. Wang, X. Gao, W. Tian, J. Wang, H. Yang, J. Wang, and Y. Sun. Rna-seq analysis of prostate cancer in the chinese population identifies recurrent gene fusions, cancerassociated long noncoding rnas and aberrant alternative splicings. Cell Research, 22(5):806–821, 2012. [34] A. Roberts, H. Pimentel, C. Trapnell, and L. Pachter. Identification of novel transcripts in annotated genomes using RNA-Seq. Bioinformatics, 27(17):2325–2329, 2011. [35] M. Robinson and A. Oshlack. A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biology, 11(3):R25+, 2010. [36] M. D. Robinson, D. J. McCarthy, and G. K. Smyth. Edger: a bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics, 26(1):139–140, 2010. [37] Y. Saeys, I. Inza, and P. Larranaga. A review of feature selection techniques in bioinformatics. Bioinformatics, 23(19):2507–2517, 2007. [38] J. Shendure and H. Ji. Next-generation DNA sequencing. Nature Biotechnology, 26(10):1135 – 1145, 2008. [39] R. Takata, S. Akamatsu, M. Kubo, A. Takahashi, N. Hosono, T. Kawaguchi, T. Tsunoda, J. Inazawa, N. Kamatani, O. Ogawa, T. Fujioka, Y. Nakamura, and H. Nak- 36 agawa. Genome-wide association study identifies five new susceptibility loci for prostate cancer in the japanese population. Nature Genetics, 42(9):751 – 754, 2010. [40] S. Tarazona, F. Garcia, A. Ferrer, J. Dopazo, and A. Conesa. NOIseq: a RNA-seq differential expression method robust for sequencing depth biases. EMBnet.journal, 17(B), 2012. [41] C. Trapnell, D. G. Hendrickson, M. Sauvageau, L. Goff, J. L. Rinn, and L. Pachter. Differential analysis of gene regulation at transcript resolution with RNA-seq. Nature Biotechnology, 31(1):46–53, 2012. [42] C. Trapnell, L. Pachter, and S. L. Salzberg. Tophat: discovering splice junctions with rna-seq. Bioinformatics, 25(9):1105–1111, 2009. [43] C. Trapnell, A. Roberts, L. Goff, G. Pertea, D. Kim, D. R. Kelley, H. Pimentel, S. L. Salzberg, J. L. Rinn, and L. Pachter. Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nature Protocols, 7(3):562–578, 2012. [44] A. Tsodikov, A. Szabo, and D. Jones. Adjustments and measures of differential expression for microarray data. Bioinformatics, 18(2):251–260, 2002. [45] V. G. Tusher, R. Tibshirani, and G. Chu. Significance analysis of microarrays applied to the ionizing radiation response. Proceedings of The National Academy of Sciences, 98:5116–5121, 2001. [46] Z. Wang, M. Gerstein, and M. Snyder. RNA-Seq: a revolutionary tool for transcriptomics. Nature Reviews Genetics, 10(1):57–63, 2009. [47] P. Warnat, R. Eils, and B. Brors. Cross-platform analysis of cancer microarray data improves gene expression based classification of phenotypes. BMC Bioinformatics, 6(265), 2005. 37 [48] J. Weston, S. Mukherjee, O. Chapelle, M. Pontil, and V. Vapnik. Feature selection for SVMs. In Advances in Neural Information Processing Systems 13, volume 13, pages 668–674, 2000. [49] B. T. Wilhelm, S. Marguerat, S. Watt, F. Schubert, V. Wood, I. Goodhead, C. J. Penkett, J. Rogers, and J. Bahler. Dynamic repertoire of a eukaryotic transcriptome surveyed at single-nucleotide resolution. Nature, 453(7199):1239–1243, 2008. [50] T. D. Wu and C. K. Watanabe. GMAP: a genomic mapping and alignment program for mRNA and EST sequences. Bioinformatics, 21(9):1859–1875, 2005. [51] M. Xiong, X. Fang, and J. Zhao. Biomarker identification by feature wrappers. Genome research, 11(11):1878–1887, 2001. [52] L. Xu, A. C. Tan, D. Q. Naiman, D. Geman, and R. L. Winslow. Robust prostate cancer marker genes emerge from direct integration of inter-study microarray data. Bioinformatics, 21(20):3905–3911, 2005. [53] X. Zhang, X. Lu, Q. Shi, X. Q. Xu, H. C. Leung, L. N. Harris, J. D. Iglehart, A. Miron, J. S. Liu, and W. H. Wong. Recursive SVM feature selection and sample classification for mass-spectrometry and microarray data. BMC Bioinformatics, 7, 2006
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/6671	-
dc.description.abstract	過去的幾年中，RNA-sequencing 技術在轉錄學研究中已經發展成一個不可或缺的工具。基於RNA-sequencing 實驗的花費相當龐大，研究人員總是無法有足夠的樣品去做更為複雜的顯著基因表現量差異的研究。各個實驗室產出的樣品會由於實驗室環境的差異而有不少差異，因此鮮少研究將各個實驗室的資料去整合成一個更大的資料庫。此研究主要探討跨實驗室資料的特徵選取議題。實驗使用四組來自不同實驗室的前列腺癌資料，並應用排名正規化方法去減少來自不同實驗室的差異。首先我們將三組資料結合成一組作為訓練組，再將剩下的一組資料做為測試組。並且使用隨機森林演算法去找出在訓練組中有顯著基因表現量差異的基因，再將找出的基因使用支持向量機從訓練組去建立分類模型。接著用此模型去預測測試組的類別辨識準確度，藉此比較使用排名標準化方法前後的準確度差異。實驗結果顯示，使用排名標準化方法後能有效將測試組的辨識準確度提高，並且使用排名標準化方法配合隨機森林演算法的效果也優於使用Cuffdiff。此外除了標準化和特徵選取演算法的差異，定序機器的差別也是影響結果一個重要的因素。愈新的機器可以給予更穩定且準確的資料，以達到更高的辨識準確度。	zh_TW
dc.description.abstract	Over the past few years, RNA-sequencing has become a revolutionary tool for transcriptomics analysis. The high cost of RNA-sequencing experiment results in the insufficient samples for researchers to conduct a comprehensive differential gene analysis. Nowadays, few studies integrate the cross-laboratory datasets into a big dataset due to the bias from different laboratories experimental procedures. In our study, we investigate the issue of cross-laboratory feature selection. We consider four prostate cancer RNA-seq datasets from different laboratories or platforms. Rank-based normalization is utilized to reduce the bias from the four cross-laboratory datasets. In our experiments, we combine three datasets into a training set. The remaining dataset is regarded as the testing set. Random Forest is applied to select differential genes from training sets. We then put the training subset with only differential genes in support vector machine to learn a classification model. This model then is utilized to predict the class of testing subset with the same list of differential genes. The predicted results are evaluated by balanced accuracy which is an unbiased measurement. Results show that applying rank-based normalization can improve the performance of cross-laboratory feature selection. The performance of Random Forest and rank-based normalization is also better than a well-known tool, Cuffdiff. In addition, we discuss the influence caused by various sequencing platforms. The sequencing machine is also an important factor which affects the preformance of feature selection on cross-lab RNA-seq datasets.	en
dc.description.provenance	Made available in DSpace on 2021-05-17T09:16:04Z (GMT). No. of bitstreams: 1 ntu-101-R00922113-1.pdf: 7799123 bytes, checksum: a4ba7cad081a628c1e26a84bb62cefe1 (MD5) Previous issue date: 2012	en
dc.description.tableofcontents	誌謝ii 摘要iv Abstract v 1 Introduction 1 1.1 RNA-sequencing technology . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Feature selection in bioinformatics . . . . . . . . . . . . . . . . . . . . . 3 1.3 Prostate cancer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.4 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2 Materials and data pre-processing 6 2.1 NGS data collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2 Data pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2.1 Phred quality score . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2.2 Quality control . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.3 Mapping to reference genome . . . . . . . . . . . . . . . . . . . . . . . 10 2.4 Quantifying gene expression value . . . . . . . . . . . . . . . . . . . . . 12 2.5 Cross-laboratory normalization . . . . . . . . . . . . . . . . . . . . . . . 14 3 Feature selection and classification 16 3.1 Feature selection by Random Forest . . . . . . . . . . . . . . . . . . . . 16 3.1.1 Building decision tree . . . . . . . . . . . . . . . . . . . . . . . 16 vii 3.1.2 Ensemble of trees . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.1.3 Training procedure . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.1.4 Measuring feature importance . . . . . . . . . . . . . . . . . . . 18 3.2 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.2.1 Introduction of Support Vector Machine . . . . . . . . . . . . . . 19 3.2.2 Classification using SVM . . . . . . . . . . . . . . . . . . . . . 22 3.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 4 Results 23 4.1 Results of performance . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 4.2 Influence of cross-laboratory . . . . . . . . . . . . . . . . . . . . . . . . 26 4.3 Influence of NGS platforms . . . . . . . . . . . . . . . . . . . . . . . . . 28 5 Conclusions and future work 31 Bibliography 32
dc.language.iso	en
dc.title	應用特徵選取於跨實驗室前列腺癌核醣核酸序列資料	zh_TW
dc.title	Feature Selection on Cross-laboratory Prostate Cancer RNA-sequencing Data	en
dc.type	Thesis
dc.date.schoolyear	101-2
dc.description.degree	碩士
dc.contributor.oralexamcommittee	王弘倫(Hung-Lun Wang),陳怡靜(Yi-Ching Chen)
dc.subject.keyword	RNA 定序,跨實驗室特徵選取,前列腺癌,	zh_TW
dc.subject.keyword	RNA-sequencing,Cross-laboratory,Feature selection,	en
dc.relation.page	38
dc.rights.note	同意授權(全球公開)
dc.date.accepted	2013-09-02
dc.contributor.author-college	電機資訊學院	zh_TW
dc.contributor.author-dept	資訊工程學研究所	zh_TW
顯示於系所單位：	資訊工程學系

文件中的檔案：

檔案	大小	格式
ntu-101-1.pdf	7.62 MB	Adobe PDF	檢視/開啟

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。