基於變數置換以及隨機森林之變數挑選方法

Yi-Jhen Wu; 吳宜珍

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/84158

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	蔡政安(Chen-An Tsai)
dc.contributor.author	Yi-Jhen Wu	en
dc.contributor.author	吳宜珍	zh_TW
dc.date.accessioned	2023-03-19T22:05:32Z	-
dc.date.copyright	2022-07-12
dc.date.issued	2022
dc.date.submitted	2022-07-06
dc.identifier.citation	Aessopos, A., Farmakis, D., Deftereos, S., Tsironi, M., Tassiopoulos, S., Moyssakis, I., and Karagiorga, M. (2005). Thalassemia heart disease: a comparative evaluation of thalassemia major and thalassemia intermedia. Chest, 127(5):1523–1530. Altmann, A., Toloşi, L., Sander, O., and Lengauer, T. (2010). Permutation importance: a corrected feature importance measure. Bioinformatics, 26(10):1340–1347. Arai, H., Maung, C., Xu, K., and Schweitzer, H. (2016). Unsupervised feature selection by heuristic search with provable bounds on suboptimality. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 30. Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal statistical society: series B (Methodological), 57(1):289–300. Bordag, S. (2008). A comparison of co-occurrence and similarity measures as sim- ulations of context. In International Conference on Intelligent Text Processing and Computational Linguistics, pages 52–63. Springer. Breiman, L. (2001). Random forests. Machine learning, 45(1):5–32. Breiman, L., Friedman, J. H., Olshen, R. A., and Stone, C. J. (2017). Classification and regression trees. Routledge. Cantú-Paz, E., Newsam, S., and Kamath, C. (2004). Feature selection in scientific ap- plications. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 788–793. Chandrashekar, G. and Sahin, F. (2014). A survey on feature selection methods.Computers & Electrical Engineering, 40(1):16–28. Currie, P. J., Kelly, M. J., and Pitt, A. (1983). Comparison of supine and erect bicy- cle exercise electrocardiography in coronary heart disease: accentuation of exercise- induced ischemic st depression by supine posture. The American journal of cardiology, 52(10):1167–1173. Du, L., Shen, Z., Li, X., Zhou, P., and Shen, Y.-D. (2013). Local and global discriminative learning for unsupervised feature selection. In 2013 IEEE 13th International Conference on Data Mining, pages 131–140. IEEE. Dunning, T. E. (1993). Accurate methods for the statistics of surprise and coincidence.Computational linguistics, 19(1):61–74. Farahat, A. K., Ghodsi, A., and Kamel, M. S. (2011). An efficient greedy method for unsupervised feature selection. In 2011 IEEE 11th International Conference on Data Mining, pages 161–170. IEEE. Gao, S., Ver Steeg, G., and Galstyan, A. (2016). Variational information maximization for feature selection. Advances in neural information processing systems, 29. Genuer, R., Poggi, J.-M., and Tuleau-Malot, C. (2015). Vsurf: an r package for variable selection using random forests. The R Journal, 7(2):19–33. Gu, Q., Li, Z., and Han, J. (2012). Generalized fisher score for feature selection. arXivpreprint arXiv:1202.3725. Guyon, I. and Elisseeff, A. (2003). An introduction to variable and feature selection.Journal of machine learning research, 3(Mar):1157–1182. Hapfelmeier, A. and Ulm, K. (2013). A new variable selection approach using random forests. Computational Statistics & Data Analysis, 60:50–69. Harrison Jr, D. and Rubinfeld, D. L. (1978). Hedonic housing prices and the demand for clean air. Journal of environmental economics and management, 5(1):81–102. Hastie, T., Tibshirani, R., Friedman, J. H., and Friedman, J. H. (2009). The elements of statistical learning: data mining, inference, and prediction, volume 2. Springer. Holland, J. H. (1992). Genetic algorithms. Scientific american, 267(1):66–73. Hothorn, T., Hornik, K., and Zeileis, A. (2006). Unbiased recursive partitioning: A conditional inference framework. Journal of Computational and Graphical statistics, 15(3):651–674. Jiang, Y. and Ren, J. (2011). Eigenvalue sensitive feature selection. In ICML. Kohavi, R. and John, G. H. (1997). Wrappers for feature subset selection. Artificial intelligence, 97(1-2):273–324. Ladha, L. and Deepa, T. (2011). Feature selection methods and algorithms. International journal on computer science and engineering, 3(5):1787–1797. Levin, L. A. (1973). Universal sequential search problems. Problemy peredachi informatsii, 9(3):115–116. Li, J., Tang, J., and Liu, H. (2017). Reconstruction-based unsupervised feature selection: An embedded approach. In IJCAI, pages 2159–2165. Liaw, A., Wiener, M., et al. (2002). Classification and regression by randomforest. R news, 2(3):18–22. Louppe, G. (2014). Understanding random forests: From theory to practice. arXiv preprint arXiv:1407.7502. Ma, S. and Huang, J. (2008). Penalized feature selection and classification in bioinfor- matics. Briefings in bioinformatics, 9(5):392–403. Narendra, P. M. and Fukunaga, K. (1977). A branch and bound algorithm for feature subset selection. IEEE Transactions on computers, 26(09):917–922. Park, H., Kwon, S., and Kwon, H.-C. (2010). Complete gini-index text (git) feature- selection algorithm for text classification. In The 2nd international conference on software engineering and data mining, pages 366–371. IEEE. Selman, B. and Gomes, C. P. (2006). Hill-climbing search. Encyclopedia of cognitive science, 81:82. Shishkin, A., Bezzubtseva, A., Drutsa, A., Shishkov, I., Gladkikh, E., Gusev, G., and Serdyukov, P. (2016). Efficient high-order interaction-aware feature selection based on conditional mutual information. Advances in neural information processing systems, 29. Sivanandam, S. and Deepa, S. (2008). Genetic algorithms. In Introduction to genetic algorithms, pages 15–37. Springer. Tang, J., Hu, X., Gao, H., and Liu, H. (2014). Discriminant analysis for unsupervised feature selection. In Proceedings of the 2014 SIAM international conference on data mining, pages 938–946. SIAM. Yang, F. and Mao, K. (2010). Robust feature selection for microarray data based on multicriterion fusion. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 8(4):1080–1092. Zhao, Z. and Liu, H. (2007). Spectral feature selection for supervised and unsupervised learning. In Proceedings of the 24th international conference on Machine learning, pages 1151–1157. Zou, H. and Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the royal statistical society: series B (statistical methodology), 67(2):301– 320.
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/84158	-
dc.description.abstract	在過去的幾十年中，變數挑選經常被使用於降維，根據特定的閥值從原始特徵集中選擇合適的特徵子集，在高維數據中選擇顯著的變量來優化模型識別和分類非常重要，因此在許多研究和應用領域，數據挖掘技術非常依賴變數挑選，尤其是在機器學習算法中。在本文中，我們提出了一種新的特徵選擇方法PBFS，PBFS使用隨機森林模型同時控制FDR來進行變數挑選，與其他現有的變數挑選方法相比，我們使用兩個真實數據集和四個模擬數據來評估我們提出的方法的有效性，發現多重共線性可能對所選變量產生很大影響。一般來說，PBFS方法比其他四種特徵選擇方法具有優勢；此外，我們通過共現網絡分析的PBFS中引導聚合決策樹結果可視化變量之間的關係。	zh_TW
dc.description.abstract	During the past decades, feature selection has been used in dimensionality reduction to select suitable feature subsets from the original set of features according to certain criteria. It is especially important to choose significant variables in high-dimensional data to improve model identification and classification accuracy. In many research and application areas, data mining techniques rely heavily on feature selection methods, especially in machine learning algorithms. In this thesis, a new feature selection approach called Permutation-Based Feature Selection (PBFS) is proposed by using a random forest model while controlling false discovery rate (FDR) to perform the feature selection. Two real datasets and four simulation studies are used to evaluate the effectiveness of our proposed approach compared to the other well-known existing feature selection methods. It was found that multicollinearity could have a great impact on the selected variables. In general, the PBFS method showed advantages over the other four feature selection methods. In addition, we visualized the relationship among variables through bagged decision trees results from PBFS based on the co-occurrence network analysis.	en
dc.description.provenance	Made available in DSpace on 2023-03-19T22:05:32Z (GMT). No. of bitstreams: 1 U0001-0711202113092500.pdf: 857447 bytes, checksum: e296f9123a4ec918e0d6914a1e338708 (MD5) Previous issue date: 2022	en
dc.description.tableofcontents	口試委員審定書 # Acknowledgements i 中文摘要 ii Abstract iii Content v List of Figures vi List of Tables vii Chapter 1 Introduction 1 Chapter 2 Methodology 4 2.1 Random Forests 4 2.1.1 Bagging 4 2.1.2 CART 6 2.1.3 Variable Importance 7 2.2 Permutation-Based Feature Selection 8 2.3 Other Approaches 11 2.3.1 NAP/NAP.B 11 2.3.2 ALT 12 2.3.3 VSURF 12 2.4 Model Performance 13 2.4.1 Out-of-bag Error 13 2.4.2 10-Folf Cross Validation 13 2.4.2.1 Confusion Matrix 15 2.4.2.2 Prediction Error Rate 16 2.5 Co-occurrence Analysis 16 2.5.1 Co-occurrence Matrices 17 2.5.2 Statistical Significance 17 2.5.3 Co-occurrence Network 18 Chapter 3 Materials 20 3.1 Simulation Studies 20 3.1.1 Dataset and Preprocess 20 3.2 Applications 24 3.2.1 Case I: Heart Disease 24 3.2.2 Case II: Boston House Prices 24 Chapter 4 Results 25 4.1 Simulation Studies 25 4.1.1 Study I 27 4.1.2 Study II 28 4.1.3 Study III 29 4.1.4 Study IV 31 4.1.5 Co-occurrence Network 33 4.1.6 Model Performance 40 4.2 Applications 41 4.2.1 Case I: Heart Disease 42 4.2.2 Case II: Boston House Prices 48 Chapter 5 Conclusions 53 References 55 Appendix A — Variable Description 60 A.1 Heart Disease 60 A.2 Boston House Prices 62
dc.language.iso	en
dc.subject	共現網絡分析	zh_TW
dc.subject	隨機森林	zh_TW
dc.subject	變數共線性	zh_TW
dc.subject	變數關係視覺化	zh_TW
dc.subject	基於變數置換的特徵挑選	zh_TW
dc.subject	variable multicollinearity	en
dc.subject	Random Forest	en
dc.subject	Permutation-Based Feature Selection	en
dc.subject	co-occurrence analysis	en
dc.subject	variable relationship visualization	en
dc.title	基於變數置換以及隨機森林之變數挑選方法	zh_TW
dc.title	A Permutation-Based Feature Selection (PBFS) Approach Using Random Forests	en
dc.type	Thesis
dc.date.schoolyear	110-2
dc.description.degree	碩士
dc.contributor.oralexamcommittee	陳春樹(Chun-Shu Chen),陳錦華(Jin-Hua Chen)
dc.subject.keyword	隨機森林,基於變數置換的特徵挑選,共現網絡分析,變數關係視覺化,變數共線性,	zh_TW
dc.subject.keyword	Random Forest,Permutation-Based Feature Selection,co-occurrence analysis,variable relationship visualization,variable multicollinearity,	en
dc.relation.page	62
dc.identifier.doi	10.6342/NTU202104467
dc.rights.note	同意授權(限校園內公開)
dc.date.accepted	2022-07-07
dc.contributor.author-college	共同教育中心	zh_TW
dc.contributor.author-dept	統計碩士學位學程	zh_TW
dc.date.embargo-lift	2022-07-12	-
顯示於系所單位：	統計碩士學位學程

文件中的檔案：

檔案	大小	格式
U0001-0711202113092500.pdf 授權僅限NTU校內IP使用（校園外請利用VPN校外連線服務）	837.35 kB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。