機器學習於預測保險複數理賠案件之比較分析

蔣依儒; Yi-Ju Chiang

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/92890

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	蔡政安	zh_TW
dc.contributor.advisor	Chen-An Tsai	en
dc.contributor.author	蔣依儒	zh_TW
dc.contributor.author	Yi-Ju Chiang	en
dc.date.accessioned	2024-07-03T16:08:24Z	-
dc.date.available	2024-07-04	-
dc.date.copyright	2024-07-03	-
dc.date.issued	2024	-
dc.date.submitted	2024-06-27	-
dc.identifier.citation	[1] Samuel Tober. Tree-based machine learning models with applications in insurance frequency modelling, 2020. [2] Roel Henckaerts, Marie-Pier Côté, Katrien Antonio, and Roel Verbelen. Boosting insights in insurance tariff plans with tree-based machine learning methods, 2020. [3] Walter Olbricht. Tree-based methods: a useful tool for life insurance. European Actuarial Journal, 2:129–147, 2012. [4] Florian Buchner, Jürgen Wasem, and Sonja Schillo. Regression trees identify rele- vant interactions: Can this improve the predictive performance of risk adjustment? Health Economics, 26(1):74–85, 2017. [5] Chukwuebuka Joseph Ejiyi, Zhen Qin, Abdulhaq Adetunji Salako, Monday Nkanta Happy, Grace Ugochi Nneji, Chiagoziem Chima Ukwuoma, Ijeoma Amuche Chik- wendu, and Ji Gen. Comparative analysis of building insurance prediction using some machine learning algorithms. 2022. [6] Andrea Dal Pozzolo, Gianluca Moro, Gianluca Bontempi, and Dott Yann Aël Le Borgne. Comparison of data mining techniques for insurance claim prediction. Universita degli Studi di Bologna, 2011. [7] Leo Guelman. Gradient boosting trees for auto insurance loss cost modeling and prediction. Expert Systems with Applications, 39(3):3659–3667, 2012. [8] Muhammad Arief Fauzan and Hendri Murfi. The accuracy of xgboost for insur- ance claim prediction. International Journal of Advances in Soft Computing & Its Applications, 10(2), 2018. [9] Rahul Sahai, Ali Al-Ataby, Sulaf Assi, Manoj Jayabalan, Panagiotis Liatsis, Chong Kim Loy, Abdullah Al-Hamid, Sahar Al-Sudani, Maitham Alamran, and Hoshang Kolivand. Insurance risk prediction using machine learning. In The Inter- national Conference on Data Science and Emerging Technologies, pages 419–433. Springer, 2022. [10] Bent Jørgensen and Marta C Paes De Souza. Fitting tweedie’s compound poisson model to insurance claims data. Scandinavian Actuarial Journal, 1994(1):69–93, 1994. [11] Yi Yang, Wei Qian, and Hui Zou. Insurance premium prediction via gradient tree- boosted tweedie compound poisson models, 2016. [12] Edward W. Frees and Emiliano A. Valdez. Hierarchical insurance claims modeling. Journal of the American Statistical Association, 103(484):1457–1469, 2008. [13] Eike Christian Brechmann Claudia Czado, Rainer Kastenmeier and Aleksey Min. A mixed copula model for insurance claims and claim sizes. Scandinavian Actuarial Journal, 2012(4):278–305, 2012. [14] Edward W Frees, Xiaoli Jin, and Xiao Lin. Actuarial applications of multivariate two-part regression models. Annals of Actuarial Science, 7(2):258–287, 2013. [15] EdwardWFrees,GeeLee,andLuYang.Multivariatefrequency-severityregression models in insurance. Risks, 4(1):4, 2016. [16] Glenn De’Ath. Multivariate regression trees: a new technique for modeling species– environment relationships. Ecology, 83(4):1105–1117, 2002. [17] David R Larsen and Paul L Speckman. Multivariate regression trees for analysis of abundance data. Biometrics, 60(2):543–549, 2004. [18] Andreas Hamann, Tim Gylander, and Pei-yu Chen. Developing seed zones and transfer guidelines with multivariate regression trees. Tree Genetics & Genomes, 7:399–408, 2011. [19] Zhiyu Quan and Emiliano A Valdez. Predictive analytics of insurance claims using multivariate decision trees. Dependence Modeling, 6(1):377–407, 2018. [20] J. Ross Quinlan. Induction of decision trees. Machine learning, 1:81–106, 1986. [21] J Ross Quinlan. C4. 5: programs for machine learning. Elsevier, 2014. [22] Leo Breiman. Classification and regression trees. Routledge, 2017. [23] Leo Breiman. Random forests. Machine learning, 45:5–32, 2001. [24] Yoav Freund, Robert E Schapire, et al. Experiments with a new boosting algorithm. In icml, volume 96, pages 148–156. Citeseer, 1996. [25] Jerome H Friedman. Greedy function approximation: a gradient boosting machine. Annals of statistics, pages 1189–1232, 2001. [26] Tianqi Chen and Carlos Guestrin. Xgboost: A scalable tree boosting system. In Pro- ceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pages 785–794, 2016. [27] GuolinKe,QiMeng,ThomasFinley,TaifengWang,WeiChen,WeidongMa,Qiwei Ye, and Tie-Yan Liu. Lightgbm: A highly efficient gradient boosting decision tree. Advances in neural information processing systems, 30, 2017. [28] Liudmila Prokhorenkova, Gleb Gusev, Aleksandr Vorobev, Anna Veronika Doro- gush, and Andrey Gulin. Catboost: unbiased boosting with categorical features. Advances in neural information processing systems, 31, 2018. [29] Andreas Mayr, Harald Binder, Olaf Gefeller, and Matthias Schmid. The evolution of boosting algorithms. Methods of information in medicine, 53(06):419–427, 2014. [30] Glenn De’Ath. Multivariate regression trees: a new technique for modeling species– environment relationships. Ecology, 83(4):1105–1117, 2002. [31] MarkSegalandYuanyuanXiao.Multivariaterandomforests.Wileyinterdisciplinary reviews: Data mining and knowledge discovery, 1(1):80–87, 2011. [32] Eleftherios Spyromitros-Xioufis, Grigorios Tsoumakas, William Groves, and Ioan- nis Vlahavas. Multi-target regression via input space expansion: treating targets as inputs. Machine Learning, 104:55–98, 2016. [33] Jesse Read, Bernhard Pfahringer, Geoff Holmes, and Eibe Frank. Classifier chains for multi-label classification. Machine learning, 85:333–359, 2011. [34] Leo Breiman. Classification and regression trees. Routledge, 2017. [35] Scott M Lundberg and Su-In Lee. A unified approach to interpreting model pre- dictions. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vish- wanathan, and R. Garnett, editors, Advances in Neural Information Processing Sys- tems 30, pages 4765–4774. Curran Associates, Inc., 2017. [36] Lloyd S Shapley et al. A value for n-person games. 1953. [37] Scott M. Lundberg, Gabriel Erion, Hugh Chen, Alex DeGrave, Jordan M. Prutkin, Bala Nair, Ronit Katz, Jonathan Himmelfarb, Nisha Bansal, and Su-In Lee. From local explanations to global understanding with explainable ai for trees. Nature Ma- chine Intelligence, 2(1):2522–5839, 2020. [38] F.Pedregosa,G.Varoquaux,A.Gramfort,V.Michel,B.Thirion,O.Grisel,M.Blon- del, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011. [39] Edward W. Frees, Catalina Bolancé, Montserrat Guillen, and Emiliano A. Valdez. Dependence modeling of multivariate longitudinal hybrid insurance data with dropout. Expert Systems with Applications, 185:115552, 2021. [40] Montserrat Guillen, Catalina Bolancé, Edward W. Frees, and Emiliano A. Valdez. Case study data for joint modeling of insurance claims and lapsation. Data in Brief, 39:107639, 2021.	-
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/92890	-
dc.description.abstract	本論文利用進階的機器學習方法探討多元輸出回歸問題。研究將決策樹、隨機森林、CatBoost和Tweedie以及鏈回歸等方法應用於兩個不同的保險複數理賠資料集：LGPIF 資料集和西班牙資料集，並進行全面的分析。為了評估不同模型在單變量輸出與多變量輸出上的預測能力，研究使用均方誤差(MSE)作為評估指標。此外，研究也運用基尼重要性、排列重要性和 SHAP 值等方法,深入探討各變數對於模型預測的重要貢獻程度。本研究為複雜資料在不同模型及變數選擇方面提供了有價值的見解，增進了機器學習在多元輸出迴歸方面的了解，並為未來的研究提供了相關指引。	zh_TW
dc.description.abstract	With this work, we investigate the recent advancements in machine learning techniques for insurance claims data, utilizing both univariate and multivariate approaches. This research applies decision trees, random forests, CatBoost, and Tweedie regression, in addition to innovative ensemble methods such as chain regression, to two insurance claims datasets: the LGPIF dataset and a Spanish dataset. Comprehensive data analysis is conducted, and the models'' predictive performances are evaluated using mean squared error (MSE). The study also explores variable importance through Gini importance, permutation importance, and SHAP values. Our experiments provide valuable insights into the effectiveness of various models and feature selection strategies for regression tasks involving complex data. This work enhances the understanding of machine learning applications in regression analysis and provides practical guidance for future implementations.	en
dc.description.provenance	Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2024-07-03T16:08:24Z No. of bitstreams: 0	en
dc.description.provenance	Made available in DSpace on 2024-07-03T16:08:24Z (GMT). No. of bitstreams: 0	en
dc.description.tableofcontents	Verification Letter from the Oral Examination Committee i 摘要 ii Abstract iii Contents iv List of Figures viii List of Tables x 1 Introduction 1 2 Literature Review 3 3 Methodology 7 3.1 Univariate................................. 7 3.1.1 Decision Tree ............................. 7 3.1.2 Random Forest ............................ 8 3.1.3 CatBoost................................ 10 3.1.4 Tweedie Regression.......................... 12 3.2 Multivariate . . . . . . . . . . . . . 16 3.2.1 MRT:Multivariate Regression Tree ................. 16 3.2.2 Multivariate CatBoost......................... 17 3.2.3 Chain Regression ........................... 18 3.2.4 Ensemble Chain Regression ..................... 22 3.3 Variable Importance............................ 23 3.3.1 Gini Importance............................ 24 3.3.2 Permutation Importance........................ 25 3.3.3 SHAP Values ............................. 26 4 Case Study I : LGPIF....29 4.1 Data Introduction and Description ........... 29 4.2 Data Analysis............................... 34 4.3 Univariate Approach ........................... 35 4.3.1 Univariate Decision Tree ....................... 35 4.3.2 Univariate Random Forest ...................... 36 4.3.3 Univariate CatBoost.......................... 37 4.3.4 Univariate Tweedie Regression.................... 37 4.3.5 Result of univariate models (MSE, variable importance, and feature selection) .. . 38 4.4 Multivariate Approach .......................... 55 4.4.1 Multivariate Decision Tree ...................... 56 4.4.2 Multivariate Random Forest ..................... 56 4.4.3 Multivariate CatBoost......................... 57 4.4.4 Result of multivariate models (MSE, variable importance, and feature selection). . 57 4.5 Enesmble Chain Regression ....................... 66 4.5.1 Ensemble Chain Regression-Uni .................. 66 4.5.2 Ensemble Chain Regression-Multi ................. 67 4.6 Conclusion and Discussion ........................ 69 5 Case Study II : The Spanish Dataset..73 5.1 Data Introduction and Description ............ 73 5.2 Data Analysis............................... 77 5.3 Univariate Approach ........................... 78 5.3.1 Univariate Decision Tree ....................... 78 5.3.2 Univariate Random Forest ...................... 79 5.3.3 Univariate CatBoost.......................... 79 5.3.4 Univariate Tweedie Regression.................... 80 5.3.5 Result of univariate models (MSE, variable importance, and feature selection) . . 80 5.4 Multivariate Approach .......................... 95 5.4.1 Multivariate Decision Tree ...................... 95 5.4.2 Multivariate Random Forest ..................... 96 5.4.3 Multivariate CatBoost......................... 96 5.4.4 Result of multivariate models (MSE, variable importance, and feature selection) . . 97 5.5 Ensemble Chain Regression ....................... 103 5.5.1 Ensemble Chain Regression-Uni ..................104 5.5.2 Ensemble Chain Regression-Multi .................105 5.6 Conclusion and Discussion . . . . . . 106 6 Conclusion and Discussion ...111 References...115 Appendix A — LGPIF...121 Appendix B — The Spanish Dataset...127	-
dc.language.iso	en	-
dc.subject	多元輸出回歸	zh_TW
dc.subject	多元回歸樹	zh_TW
dc.subject	CatBoost	zh_TW
dc.subject	Tweedie	zh_TW
dc.subject	鏈迴歸	zh_TW
dc.subject	變數重要性	zh_TW
dc.subject	SHAP值	zh_TW
dc.subject	multivariate regression tree	en
dc.subject	SHAP values	en
dc.subject	variable importance	en
dc.subject	chain regression	en
dc.subject	Tweedie	en
dc.subject	CatBoost	en
dc.subject	multi-output	en
dc.title	機器學習於預測保險複數理賠案件之比較分析	zh_TW
dc.title	Comparative Analysis of Machine Learning Techniques for Predicting Multiple Insurance Claims	en
dc.type	Thesis	-
dc.date.schoolyear	112-2	-
dc.description.degree	碩士	-
dc.contributor.oralexamcommittee	薛慧敏;陳錦華	zh_TW
dc.contributor.oralexamcommittee	Huei-Min Hsueh;Jin-Hua Chen	en
dc.subject.keyword	多元輸出回歸,多元回歸樹,CatBoost,Tweedie,鏈迴歸,變數重要性,SHAP值,	zh_TW
dc.subject.keyword	multi-output,multivariate regression tree,CatBoost,Tweedie,chain regression,variable importance,SHAP values,	en
dc.relation.page	130	-
dc.identifier.doi	10.6342/NTU202401271	-
dc.rights.note	未授權	-
dc.date.accepted	2024-06-28	-
dc.contributor.author-college	共同教育中心	-
dc.contributor.author-dept	統計碩士學位學程	-
顯示於系所單位：	統計碩士學位學程

文件中的檔案：

檔案	大小	格式
ntu-112-2.pdf 未授權公開取用	10.2 MB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。