利用穩健重覆排序方法偵測表現差異及其應用於分析混合樣本之全基因體掃描資料

Jia-Rou Liu; 劉佳柔

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/65811

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	洪弘(Hung Hung)
dc.contributor.author	Jia-Rou Liu	en
dc.contributor.author	劉佳柔	zh_TW
dc.date.accessioned	2021-06-17T00:12:38Z	-
dc.date.available	2017-09-17
dc.date.copyright	2012-09-17
dc.date.issued	2012
dc.date.submitted	2012-07-11
dc.identifier.citation	[1] Alvo, M., Liu, Z., Williams, A., and Yauk, C. (2010). Testing for mean and correlation changes in microarray experiments: an application for pathway analysis. BMC Bioinformatics, 11, 60. [2] Avvakumov, N., and Cote, J. (2007). The MYST family of histone acetyltransferases and their intimate links to cancer. Oncogene, 26, 5395-5407. [3] Barrett, J. C., Fry, B., Maller, J., and Daly, M. J. (2005). Haploview: analysis and visualization of LD and haplotype maps. Bioinformatics, 21, 263-265. [4] Breiman, L. (2001). Random forests. Machine Learning, 45, 5-32. [5] Breitling, R., Armengaud, P., Amtmann, A., and Herzyk, P. (2004). Rank products: a simple, yet powerful, new method to detect differentially regulated genes in replicated microarray experiments. FEBS Letters, 573, 83-92. [6] Chang, F., and Chen, J.-C. (2010). An adaptive multiple feature subset method for feature ranking and selection. In Proceedings of the 2010 International Conference on Technologies and Applications of Artificial Intelligence, 255-262: IEEE Computer Society. [7] Chu, T. T., Liu, Y., and Kemether, E. (2009). Thalamic transcriptome screening in three psychiatric states. Journal of human genetics, 54, 665-675. [8] Cook, R. D., and Yin, X. (2001). Special Invited Paper: Dimension Reduction and Visualization in Discriminant Analysis (with discussion). Australian and New Zealand Journal of Statistics, 43, 147-199. [9] Cunningham, P. (2008). Dimension reduction. Machine learning techniques for multimedia, 91-112. [10] DeRisi, J. L., Iyer, V. R., and Brown, P. O. (1997). Exploring the metabolic and genetic control of gene expression on a genomic scale. Science, 278, 680-686. [11] Fujii, T., Uchiyama, H., Yamamoto, N., et al. (2011). Possible association of the semaphorin 3D gene (SEMA3D) with schizophrenia. Journal of psychiatric research, 45, 47-53. [12] Guyon, I., and Elisseeff, A. (2003). An introduction to variable and feature selection. The Journal of Machine Learning Research, 3, 1157-1182. [13] Hotelling, H. (1933). Analysis of a complex of statistical variables into principal components. Journal of educational psychology, 24, 417-441, 498-520. [14] John, G. H., Kohavi, R., and Pfleger, K. (1994). Irrelevant features and the subset selection problem. In Proceedings of the 11th International Conference on Machine Learning, 121-129: San Francisco. [15] Kohavi, R., and John, G. H. (1997). Wrappers for feature subset selection. Artificial intelligence, 97, 273-324. [16] Kuo, P.-H., Liu, J. R., Lu, M. K., Lu, R. B., Hung, H. (2011). A genome-wide association study of bipolar disorder using DNA pooling. Asian Journal of Psychiatry, 4 Supplement 1, S38 [17] Manolio, T. A., Rodriguez, L. L., Brooks, L., et al. (2007). New models of collaboration in genome-wide association studies: the Genetic Association Information Network. Nature genetics, 39, 1045-1051. [18] Mexal, S., Frank, M., Berger, R., et al. (2005). Differential modulation of gene expression in the NMDA postsynaptic density of schizophrenic and control smokers. Molecular brain research, 139, 317-332. [19] Purcell, S., Neale, B., Todd-Brown, K., et al. (2007). PLINK: a tool set for whole-genome association and population-based linkage analyses. The American Journal of Human Genetics, 81, 559-575. [20] Quinlan, J. R. (1986). Induction of decision trees. Machine Learning, 1, 81-106. [21] Saeys, Y., Inza, I., and Larranaga, P. (2007). A review of feature selection techniques in bioinformatics. Bioinformatics, 23, 2507-2517. [22] Sullivan, P. F., de Geus, E. J. C., Willemsen, G., et al. (2009). Genome-wide association for major depressive disorder: a possible role for the presynaptic protein piccolo. Molecular Psychiatry, 14, 359-375. [23] Tusher, V. G., Tibshirani, R., and Chu, G. (2001). Significance analysis of microarrays applied to the ionizing radiation response. Proceedings of the National Academy of Sciences of the United States of America, 98, 5116-5121. [24] Viding, E., Hanscombe, K. B., Curtis, C. J. C., Davis, O. S. P., Meaburn, E. L., and Plomin, R. (2010). In search of genes associated with risk for psychopathic tendencies in children: a two-stage genome-wide association study of pooled DNA. Journal of Child Psychology and Psychiatry, 51, 780-788. [25] Zhan, L., Kerr, J., Lafuente, M. J., et al. (2011). Altered expression and coregulation of dopamine signalling genes in schizophrenia and bipolar disorder. Neuropathology and applied neurobiology, 37, 206-219.
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/65811	-
dc.description.abstract	近年來隨著研究技術的蓬勃發展, 研究者愈來愈容易取得同時含有成千上萬個變項個數的資料庫, 使得樣本個數相較之下變得非常小。在這種變項個數遠大於樣本個數的情況之下, 傳統常用來偵測兩組差異的 t 統計量會因為變異估計不夠穩定而不太適用。另一方面, 同樣是用來偵測兩組差異的 ROC 曲線下面積 (AUC), 雖然屬於較不受分配限制的無母數方法, 仍然會因為重覆數值出現的頻率太高, 造成排序挑選的困擾。為了兼顧檢定力和穩健力, 改變傳統給定排序值的方法, 將其重新定義為在同一樣本內不同變項之間的排序, 會更加適用。在此研究中, 我們提出一種重覆排序方法, 以「rank-over-variable」概念為基礎, 再配合「random subset」和「re-rank」兩種技巧, 可用來幫助研究者在分析變項個數遠大於樣本個數的資料型態時，能有效挑選出在兩組間有差異的變項。為了評估此方法，我們以 GAIN-MDD 資料檔為基礎進行模擬分析，驗證相較於 t 統計量和 AUC，我們所提出的重覆排序方法能更有效地偵測出真正在兩組間有差異的變項，同時也較不容易受到小樣本數和實驗誤差的影響。最後, 我們實際將新方法應用於混合樣本之全基因體掃描研究, 偵測出可能與雙極性情感疾病相關的基因, 提供研究者進行更進一步的探討。	zh_TW
dc.description.abstract	Recently, more and more researches encounter the problem where the data objects have an extremely large number of variables while the available sample size is relatively small. To detect the difference between two populations in this situation, the widely used two sample t-test would fail to apply due to its instability in estimating variances. The non-parametric counterpart, AUC, will face the problem of tied values and also fail. To improve the detection power while keeping the robustness, the idea of ``rank-over-variable' is more appropriate to analyze large-p-small-n datasets. In this study, we propose a robust re-rank approach to overcome the above-mentioned difficulties and reduce the influence of enormous features in the large-$p$-small-$n$ situation. In particular, we obtain a rank-based statistic for each feature based on the concept of 'rank-over-variable'. Techniques of 'random subset' and 're-rank' are then iteratively applied to ranking features. Finally, the leading features in the constructed ranking list will be selected for further research. To evaluate the performance of our proposed re-rank approach, we conduct several simulation studies based on the GAIN-MDD dataset. Compared with the t-statistic and AUC, our re-rank approach is able to identify more pre-defined truly relevant SNPs and robust for different pool number and pooling error. Furthermore, we also demonstrate a real data analysis to explore the markers associated with bipolar disorder.	en
dc.description.provenance	Made available in DSpace on 2021-06-17T00:12:38Z (GMT). No. of bitstreams: 1 ntu-101-R99849024-1.pdf: 2840311 bytes, checksum: 2fe249cab5296a4b374a8d1d0e0770b7 (MD5) Previous issue date: 2012	en
dc.description.tableofcontents	誌謝 I 中文摘要 II Abstract III Contents V List of Figures VI List of Tables VII 1 Introduction 1 2 Inference Procedure 7 2.1 Re-Rank Approach . . . . . . . . . . . . . . . . . 7 2.2 Prior screening for re-rank approach . . . . . . . 12 2.3 Selection of M1 . . . . . . . . . . . . . . . . . . 14 3 Numerical Analysis 18 3.1 Simulation studies using GAIN-MDD dataset . . . . . 19 3.2 Bipolar dataset . . . . . . . . . . . . . . . . . . 24 4 Discussion 32 Bibliography 36 A The top 100 SNPs from Stage-1 39 B Matlab Code 43
dc.language.iso	en
dc.subject	過濾法	zh_TW
dc.subject	random subset	zh_TW
dc.subject	降維度分析	zh_TW
dc.subject	特徵選取	zh_TW
dc.subject	大p小n	zh_TW
dc.subject	rank-over-variable	zh_TW
dc.subject	random subset	en
dc.subject	dimension reduction	en
dc.subject	feature selection	en
dc.subject	filter method	en
dc.subject	rank-over-variable	en
dc.subject	large-p-small-n	en
dc.title	利用穩健重覆排序方法偵測表現差異及其應用於分析混合樣本之全基因體掃描資料	zh_TW
dc.title	A Robust Re-Rank Approach with Application to Pooling-Based GWA Study Data	en
dc.type	Thesis
dc.date.schoolyear	100-2
dc.description.degree	碩士
dc.contributor.oralexamcommittee	李文宗(Wen-Chung Lee),蕭朱杏(Chuhsing Kate Hsiao),郭柏秀(Po-Hsiu Kuo)
dc.subject.keyword	大p小n,降維度分析,特徵選取,過濾法,rank-over-variable,random subset,	zh_TW
dc.subject.keyword	large-p-small-n,dimension reduction,feature selection,filter method,rank-over-variable,random subset,	en
dc.relation.page	44
dc.rights.note	有償授權
dc.date.accepted	2012-07-11
dc.contributor.author-college	公共衛生學院	zh_TW
dc.contributor.author-dept	流行病學與預防醫學研究所	zh_TW
顯示於系所單位：	流行病學與預防醫學研究所

文件中的檔案：

檔案	大小	格式
ntu-101-1.pdf 未授權公開取用	2.77 MB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。