Skip navigation

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料(如:文字、圖片、PDF)並使其易於取用。

點此認識 DSpace
DSpace logo
English
中文
  • 瀏覽論文
    • 校院系所
    • 出版年
    • 作者
    • 標題
    • 關鍵字
    • 指導教授
  • 搜尋 TDR
  • 授權 Q&A
    • 我的頁面
    • 接受 E-mail 通知
    • 編輯個人資料
  1. NTU Theses and Dissertations Repository
  2. 共同教育中心
  3. 統計碩士學位學程
請用此 Handle URI 來引用此文件: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/92890
完整後設資料紀錄
DC 欄位值語言
dc.contributor.advisor蔡政安zh_TW
dc.contributor.advisorChen-An Tsaien
dc.contributor.author蔣依儒zh_TW
dc.contributor.authorYi-Ju Chiangen
dc.date.accessioned2024-07-03T16:08:24Z-
dc.date.available2024-07-04-
dc.date.copyright2024-07-03-
dc.date.issued2024-
dc.date.submitted2024-06-27-
dc.identifier.citation[1] Samuel Tober. Tree-based machine learning models with applications in insurance frequency modelling, 2020.
[2] Roel Henckaerts, Marie-Pier Côté, Katrien Antonio, and Roel Verbelen. Boosting insights in insurance tariff plans with tree-based machine learning methods, 2020.
[3] Walter Olbricht. Tree-based methods: a useful tool for life insurance. European Actuarial Journal, 2:129–147, 2012.
[4] Florian Buchner, Jürgen Wasem, and Sonja Schillo. Regression trees identify rele- vant interactions: Can this improve the predictive performance of risk adjustment? Health Economics, 26(1):74–85, 2017.
[5] Chukwuebuka Joseph Ejiyi, Zhen Qin, Abdulhaq Adetunji Salako, Monday Nkanta Happy, Grace Ugochi Nneji, Chiagoziem Chima Ukwuoma, Ijeoma Amuche Chik- wendu, and Ji Gen. Comparative analysis of building insurance prediction using some machine learning algorithms. 2022.
[6] Andrea Dal Pozzolo, Gianluca Moro, Gianluca Bontempi, and Dott Yann Aël Le Borgne. Comparison of data mining techniques for insurance claim prediction. Universita degli Studi di Bologna, 2011.
[7] Leo Guelman. Gradient boosting trees for auto insurance loss cost modeling and prediction. Expert Systems with Applications, 39(3):3659–3667, 2012.
[8] Muhammad Arief Fauzan and Hendri Murfi. The accuracy of xgboost for insur- ance claim prediction. International Journal of Advances in Soft Computing & Its Applications, 10(2), 2018.
[9] Rahul Sahai, Ali Al-Ataby, Sulaf Assi, Manoj Jayabalan, Panagiotis Liatsis, Chong Kim Loy, Abdullah Al-Hamid, Sahar Al-Sudani, Maitham Alamran, and Hoshang Kolivand. Insurance risk prediction using machine learning. In The Inter- national Conference on Data Science and Emerging Technologies, pages 419–433. Springer, 2022.
[10] Bent Jørgensen and Marta C Paes De Souza. Fitting tweedie’s compound poisson model to insurance claims data. Scandinavian Actuarial Journal, 1994(1):69–93, 1994.
[11] Yi Yang, Wei Qian, and Hui Zou. Insurance premium prediction via gradient tree- boosted tweedie compound poisson models, 2016.
[12] Edward W. Frees and Emiliano A. Valdez. Hierarchical insurance claims modeling. Journal of the American Statistical Association, 103(484):1457–1469, 2008.
[13] Eike Christian Brechmann Claudia Czado, Rainer Kastenmeier and Aleksey Min. A mixed copula model for insurance claims and claim sizes. Scandinavian Actuarial Journal, 2012(4):278–305, 2012.
[14] Edward W Frees, Xiaoli Jin, and Xiao Lin. Actuarial applications of multivariate two-part regression models. Annals of Actuarial Science, 7(2):258–287, 2013.
[15] EdwardWFrees,GeeLee,andLuYang.Multivariatefrequency-severityregression models in insurance. Risks, 4(1):4, 2016.
[16] Glenn De’Ath. Multivariate regression trees: a new technique for modeling species– environment relationships. Ecology, 83(4):1105–1117, 2002.
[17] David R Larsen and Paul L Speckman. Multivariate regression trees for analysis of abundance data. Biometrics, 60(2):543–549, 2004.
[18] Andreas Hamann, Tim Gylander, and Pei-yu Chen. Developing seed zones and transfer guidelines with multivariate regression trees. Tree Genetics & Genomes, 7:399–408, 2011.
[19] Zhiyu Quan and Emiliano A Valdez. Predictive analytics of insurance claims using multivariate decision trees. Dependence Modeling, 6(1):377–407, 2018.
[20] J. Ross Quinlan. Induction of decision trees. Machine learning, 1:81–106, 1986.
[21] J Ross Quinlan. C4. 5: programs for machine learning. Elsevier, 2014.
[22] Leo Breiman. Classification and regression trees. Routledge, 2017.
[23] Leo Breiman. Random forests. Machine learning, 45:5–32, 2001.
[24] Yoav Freund, Robert E Schapire, et al. Experiments with a new boosting algorithm. In icml, volume 96, pages 148–156. Citeseer, 1996.
[25] Jerome H Friedman. Greedy function approximation: a gradient boosting machine. Annals of statistics, pages 1189–1232, 2001.
[26] Tianqi Chen and Carlos Guestrin. Xgboost: A scalable tree boosting system. In Pro- ceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pages 785–794, 2016.
[27] GuolinKe,QiMeng,ThomasFinley,TaifengWang,WeiChen,WeidongMa,Qiwei Ye, and Tie-Yan Liu. Lightgbm: A highly efficient gradient boosting decision tree. Advances in neural information processing systems, 30, 2017.
[28] Liudmila Prokhorenkova, Gleb Gusev, Aleksandr Vorobev, Anna Veronika Doro- gush, and Andrey Gulin. Catboost: unbiased boosting with categorical features. Advances in neural information processing systems, 31, 2018.
[29] Andreas Mayr, Harald Binder, Olaf Gefeller, and Matthias Schmid. The evolution of boosting algorithms. Methods of information in medicine, 53(06):419–427, 2014.
[30] Glenn De’Ath. Multivariate regression trees: a new technique for modeling species– environment relationships. Ecology, 83(4):1105–1117, 2002.
[31] MarkSegalandYuanyuanXiao.Multivariaterandomforests.Wileyinterdisciplinary reviews: Data mining and knowledge discovery, 1(1):80–87, 2011.
[32] Eleftherios Spyromitros-Xioufis, Grigorios Tsoumakas, William Groves, and Ioan- nis Vlahavas. Multi-target regression via input space expansion: treating targets as inputs. Machine Learning, 104:55–98, 2016.
[33] Jesse Read, Bernhard Pfahringer, Geoff Holmes, and Eibe Frank. Classifier chains for multi-label classification. Machine learning, 85:333–359, 2011.
[34] Leo Breiman. Classification and regression trees. Routledge, 2017.
[35] Scott M Lundberg and Su-In Lee. A unified approach to interpreting model pre- dictions. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vish- wanathan, and R. Garnett, editors, Advances in Neural Information Processing Sys- tems 30, pages 4765–4774. Curran Associates, Inc., 2017.
[36] Lloyd S Shapley et al. A value for n-person games. 1953.
[37] Scott M. Lundberg, Gabriel Erion, Hugh Chen, Alex DeGrave, Jordan M. Prutkin, Bala Nair, Ronit Katz, Jonathan Himmelfarb, Nisha Bansal, and Su-In Lee. From local explanations to global understanding with explainable ai for trees. Nature Ma- chine Intelligence, 2(1):2522–5839, 2020.
[38] F.Pedregosa,G.Varoquaux,A.Gramfort,V.Michel,B.Thirion,O.Grisel,M.Blon- del, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
[39] Edward W. Frees, Catalina Bolancé, Montserrat Guillen, and Emiliano A. Valdez. Dependence modeling of multivariate longitudinal hybrid insurance data with dropout. Expert Systems with Applications, 185:115552, 2021.
[40] Montserrat Guillen, Catalina Bolancé, Edward W. Frees, and Emiliano A. Valdez. Case study data for joint modeling of insurance claims and lapsation. Data in Brief, 39:107639, 2021.
-
dc.identifier.urihttp://tdr.lib.ntu.edu.tw/jspui/handle/123456789/92890-
dc.description.abstract本論文利用進階的機器學習方法探討多元輸出回歸問題。研究將決策樹、隨機森林、CatBoost和Tweedie以及鏈回歸等方法應用於兩個不同的保險複數理賠資料集:LGPIF 資料集和西班牙資料集,並進行全面的分析。為了評估不同模型在單變量輸出與多變量輸出上的預測能力,研究使用均方誤差(MSE)作為評估指標。此外,研究也運用基尼重要性、排列重要性和 SHAP 值等方法,深入探討各變數對於模型預測的重要貢獻程度。本研究為複雜資料在不同模型及變數選擇方面提供了有價值的見解,增進了機器學習在多元輸出迴歸方面的了解,並為未來的研究提供了相關指引。zh_TW
dc.description.abstractWith this work, we investigate the recent advancements in machine learning techniques for insurance claims data, utilizing both univariate and multivariate approaches. This research applies decision trees, random forests, CatBoost, and Tweedie regression, in addition to innovative ensemble methods such as chain regression, to two insurance claims datasets: the LGPIF dataset and a Spanish dataset. Comprehensive data analysis is conducted, and the models'' predictive performances are evaluated using mean squared error (MSE). The study also explores variable importance through Gini importance, permutation importance, and SHAP values. Our experiments provide valuable insights into the effectiveness of various models and feature selection strategies for regression tasks involving complex data. This work enhances the understanding of machine learning applications in regression analysis and provides practical guidance for future implementations.en
dc.description.provenanceSubmitted by admin ntu (admin@lib.ntu.edu.tw) on 2024-07-03T16:08:24Z
No. of bitstreams: 0
en
dc.description.provenanceMade available in DSpace on 2024-07-03T16:08:24Z (GMT). No. of bitstreams: 0en
dc.description.tableofcontentsVerification Letter from the Oral Examination Committee i
摘要 ii
Abstract iii
Contents iv
List of Figures viii
List of Tables x

1 Introduction 1
2 Literature Review 3
3 Methodology 7
3.1 Univariate................................. 7
3.1.1 Decision Tree ............................. 7
3.1.2 Random Forest ............................ 8
3.1.3 CatBoost................................ 10
3.1.4 Tweedie Regression.......................... 12
3.2 Multivariate . . . . . . . . . . . . . 16
3.2.1 MRT:Multivariate Regression Tree ................. 16
3.2.2 Multivariate CatBoost......................... 17
3.2.3 Chain Regression ........................... 18
3.2.4 Ensemble Chain Regression ..................... 22
3.3 Variable Importance............................ 23
3.3.1 Gini Importance............................ 24
3.3.2 Permutation Importance........................ 25
3.3.3 SHAP Values ............................. 26

4 Case Study I : LGPIF....29
4.1 Data Introduction and Description ........... 29
4.2 Data Analysis............................... 34
4.3 Univariate Approach ........................... 35
4.3.1 Univariate Decision Tree ....................... 35
4.3.2 Univariate Random Forest ...................... 36
4.3.3 Univariate CatBoost.......................... 37
4.3.4 Univariate Tweedie Regression.................... 37
4.3.5 Result of univariate models
(MSE, variable importance, and feature selection) .. . 38
4.4 Multivariate Approach .......................... 55
4.4.1 Multivariate Decision Tree ...................... 56
4.4.2 Multivariate Random Forest ..................... 56
4.4.3 Multivariate CatBoost......................... 57
4.4.4 Result of multivariate models
(MSE, variable importance, and feature selection). . 57
4.5 Enesmble Chain Regression ....................... 66
4.5.1 Ensemble Chain Regression-Uni .................. 66
4.5.2 Ensemble Chain Regression-Multi ................. 67
4.6 Conclusion and Discussion ........................ 69

5 Case Study II : The Spanish Dataset..73
5.1 Data Introduction and Description ............ 73
5.2 Data Analysis............................... 77
5.3 Univariate Approach ........................... 78
5.3.1 Univariate Decision Tree ....................... 78
5.3.2 Univariate Random Forest ...................... 79
5.3.3 Univariate CatBoost.......................... 79
5.3.4 Univariate Tweedie Regression.................... 80
5.3.5 Result of univariate models
(MSE, variable importance, and feature selection) . . 80
5.4 Multivariate Approach .......................... 95
5.4.1 Multivariate Decision Tree ...................... 95
5.4.2 Multivariate Random Forest ..................... 96
5.4.3 Multivariate CatBoost......................... 96
5.4.4 Result of multivariate models
(MSE, variable importance, and feature selection) . . 97
5.5 Ensemble Chain Regression ....................... 103
5.5.1 Ensemble Chain Regression-Uni ..................104
5.5.2 Ensemble Chain Regression-Multi .................105
5.6 Conclusion and Discussion . . . . . . 106

6 Conclusion and Discussion ...111
References...115
Appendix A — LGPIF...121
Appendix B — The Spanish Dataset...127
-
dc.language.isoen-
dc.subject多元輸出回歸zh_TW
dc.subject多元回歸樹zh_TW
dc.subjectCatBoostzh_TW
dc.subjectTweediezh_TW
dc.subject鏈迴歸zh_TW
dc.subject變數重要性zh_TW
dc.subjectSHAP值zh_TW
dc.subjectmultivariate regression treeen
dc.subjectSHAP valuesen
dc.subjectvariable importanceen
dc.subjectchain regressionen
dc.subjectTweedieen
dc.subjectCatBoosten
dc.subjectmulti-outputen
dc.title機器學習於預測保險複數理賠案件之比較分析zh_TW
dc.titleComparative Analysis of Machine Learning Techniques for Predicting Multiple Insurance Claimsen
dc.typeThesis-
dc.date.schoolyear112-2-
dc.description.degree碩士-
dc.contributor.oralexamcommittee薛慧敏;陳錦華zh_TW
dc.contributor.oralexamcommitteeHuei-Min Hsueh;Jin-Hua Chenen
dc.subject.keyword多元輸出回歸,多元回歸樹,CatBoost,Tweedie,鏈迴歸,變數重要性,SHAP值,zh_TW
dc.subject.keywordmulti-output,multivariate regression tree,CatBoost,Tweedie,chain regression,variable importance,SHAP values,en
dc.relation.page130-
dc.identifier.doi10.6342/NTU202401271-
dc.rights.note未授權-
dc.date.accepted2024-06-28-
dc.contributor.author-college共同教育中心-
dc.contributor.author-dept統計碩士學位學程-
顯示於系所單位:統計碩士學位學程

文件中的檔案:
檔案 大小格式 
ntu-112-2.pdf
  未授權公開取用
10.2 MBAdobe PDF
顯示文件簡單紀錄


系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved