應用特徵選取於跨實驗室前列腺癌核醣核酸序列資料

Tzung-Chien Hsieh; 謝宗潛

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/6671

標題:	應用特徵選取於跨實驗室前列腺癌核醣核酸序列資料 Feature Selection on Cross-laboratory Prostate Cancer RNA-sequencing Data
作者:	Tzung-Chien Hsieh 謝宗潛
指導教授:	趙坤茂(Kun-Mao Chao)
關鍵字:	RNA 定序,跨實驗室特徵選取,前列腺癌, RNA-sequencing,Cross-laboratory,Feature selection,
出版年 :	2012
學位:	碩士
摘要:	過去的幾年中，RNA-sequencing 技術在轉錄學研究中已經發展成一個不可或缺的工具。基於RNA-sequencing 實驗的花費相當龐大，研究人員總是無法有足夠的樣品去做更為複雜的顯著基因表現量差異的研究。各個實驗室產出的樣品會由於實驗室環境的差異而有不少差異，因此鮮少研究將各個實驗室的資料去整合成一個更大的資料庫。此研究主要探討跨實驗室資料的特徵選取議題。實驗使用四組來自不同實驗室的前列腺癌資料，並應用排名正規化方法去減少來自不同實驗室的差異。首先我們將三組資料結合成一組作為訓練組，再將剩下的一組資料做為測試組。並且使用隨機森林演算法去找出在訓練組中有顯著基因表現量差異的基因，再將找出的基因使用支持向量機從訓練組去建立分類模型。接著用此模型去預測測試組的類別辨識準確度，藉此比較使用排名標準化方法前後的準確度差異。實驗結果顯示，使用排名標準化方法後能有效將測試組的辨識準確度提高，並且使用排名標準化方法配合隨機森林演算法的效果也優於使用Cuffdiff。此外除了標準化和特徵選取演算法的差異，定序機器的差別也是影響結果一個重要的因素。愈新的機器可以給予更穩定且準確的資料，以達到更高的辨識準確度。 Over the past few years, RNA-sequencing has become a revolutionary tool for transcriptomics analysis. The high cost of RNA-sequencing experiment results in the insufficient samples for researchers to conduct a comprehensive differential gene analysis. Nowadays, few studies integrate the cross-laboratory datasets into a big dataset due to the bias from different laboratories experimental procedures. In our study, we investigate the issue of cross-laboratory feature selection. We consider four prostate cancer RNA-seq datasets from different laboratories or platforms. Rank-based normalization is utilized to reduce the bias from the four cross-laboratory datasets. In our experiments, we combine three datasets into a training set. The remaining dataset is regarded as the testing set. Random Forest is applied to select differential genes from training sets. We then put the training subset with only differential genes in support vector machine to learn a classification model. This model then is utilized to predict the class of testing subset with the same list of differential genes. The predicted results are evaluated by balanced accuracy which is an unbiased measurement. Results show that applying rank-based normalization can improve the performance of cross-laboratory feature selection. The performance of Random Forest and rank-based normalization is also better than a well-known tool, Cuffdiff. In addition, we discuss the influence caused by various sequencing platforms. The sequencing machine is also an important factor which affects the preformance of feature selection on cross-lab RNA-seq datasets.
URI:	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/6671
全文授權:	同意授權(全球公開)
顯示於系所單位：	資訊工程學系

文件中的檔案：

檔案	大小	格式
ntu-101-1.pdf	7.62 MB	Adobe PDF	檢視/開啟

顯示文件完整紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料（如：文字、圖片、PDF）並使其易於取用。