Skip navigation

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料(如:文字、圖片、PDF)並使其易於取用。

點此認識 DSpace
DSpace logo
English
中文
  • 瀏覽論文
    • 校院系所
    • 出版年
    • 作者
    • 標題
    • 關鍵字
  • 搜尋 TDR
  • 授權 Q&A
    • 我的頁面
    • 接受 E-mail 通知
    • 編輯個人資料
  1. NTU Theses and Dissertations Repository
  2. 管理學院
  3. 商學研究所
請用此 Handle URI 來引用此文件: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/60872
完整後設資料紀錄
DC 欄位值語言
dc.contributor.advisor蔣明晃(Ming-Huang Chiang)
dc.contributor.authorLi-Yu Shaoen
dc.contributor.author邵立瑜zh_TW
dc.date.accessioned2021-06-16T10:34:07Z-
dc.date.available2020-07-20
dc.date.copyright2020-07-20
dc.date.issued2020
dc.date.submitted2020-07-02
dc.identifier.citation[1] K. Guolin, M. Qi, F. Thomas, W. Taifeng, C. Wei, M. Weidong, Y. Qiwei, L. Tie-Yan, 'LightGBM: A Highly Efficient Gradient Boosting Decision Tree,' Advances in Neural Information Processing Systems vol. 30, pp. 3149-3157, 2017.
[2] A. Dorogush, V. Ershov, A. Gulin 'CatBoost: gradient boosting with categorical features support,' NIPS, pp.1-7, 2017.
[3] J. Friedman. 'Greedy function approximation: a gradient boosting machine.' Annals of Statistics, 29(5): pp.1189-1232, 2001.
[4] J. Friedman. 'Stochastic gradient boosting.' Computational Statistics Data Analysis, 38(4): pp. 367-378, 2002.
[5] Tianqi Chen and Carlos Guestrin. 'Xgboost: A scalable tree boosting system.' In Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 785–794. ACM, 2016.
[6] Tony Duan, Anand Avati, Daisy Yi Ding, Sanjay Basu, Andrew Y Ng, and Alejandro Schuler. 'Ngboost: Natural gradient boosting for probabilistic prediction.' arXiv preprint arXiv:1910.03225. 2019.
[7] Stephen Tyree, Kilian Q Weinberger, Kunal Agrawal, and Jennifer Paykin. 'Parallel boosted regression trees for web search ranking.' In Proceedings of the 20th international conference on World wide web, pp. 387–396. ACM, 2011.
[8] Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al. 'Scikit-learn: Machine learning in python.' Journal of Machine Learning Research, 12(Oct): pp. 2825–2830, 2011.
[9] Ridgeway. Greg., 'Generalized boosted models: A guide to the gbm package.' 2007. Retrieved from https://cran.r-project.org/web/packages/gbm/vignettes/gbm.pdf
[10] Daoud. E. A., 'Comparison between XGBoost, LightGBM and CatBoost Using a Home Credit Dataset,' International Journal of Computer and Information Engineering 13(1), pp. 6-10, 2019.
[11] Potdar, K., T. Pardawala and C. Pai, 'A Comparative Study of Categorical Variable Encoding Techniques for Neural Network Classifiers,' Article in International Journal of Computer Applications, 2017.
[12] Richard Ernest Bellman, Dynamic Programming, Princeton University Press, 1957.
[13] Longadge, R. and S. Dongre, 'Class Imbalance Problem in Data Mining Review.' arXiv preprint arXiv:1305.1707, 2013.
[14] Charles X. Ling., Huang. Jin, Zhang. Harry, 'AUC: a statistically consistent and more discriminating measure than accuracy.' Proceedings of the Eighteenth International Joint Conference of Artificial Intelligence (IJCAI) 2003.
[15] LightGBM API, Microsoft, 'Advanced Topics,' April 2020, https://lightgbm.readthedocs.io/en/latest/Advanced-Topics.html
[16] Frank E. Harrell Jr., Thomas Cason (1994). Titanic: Machine Learning from Disaster. Retrieved March 20, 2020 from https://www.kaggle.com/c/titanic/data.
[17] Kaggle (2015 April). Titanic: Machine Learning from Disaster. Retrieved March 2020 from https://www.kaggle.com/c/titanic/data
[18] Kaggle (2019 August). Categorical Feature Encoding Challenge. Retrieved March 2020 from https://www.kaggle.com/c/cat-in-the-dat/data
[19] Moro, S., P. Cortez and P. Rita. A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, Elsevier, 62: pp. 22-31, June 2014
Retrieved March 2020 from https://www.kaggle.com/c/bank-marketing-uci/data
[20] E-Sun Bank(玉山銀行). Credit Card Fraud Detection Challenge, September 2019
Retrieved Sept 2019 from https://tbrain.trendmicro.com.tw/Competitions/Details/10
dc.identifier.urihttp://tdr.lib.ntu.edu.tw/jspui/handle/123456789/60872-
dc.description.abstract對於現今中小型的資料集,梯度提升決策樹演算法(GBDT)在業界、學術界以及競賽被廣泛應用,此篇論文目的為比較目前最常使用的兩個GBDT套件,LightGBM與CatBoost,並找出兩個演算法之間效能差異的原因。為了讓比較具有公平性與一致性,我們根據一般現有真實資料集的特性設計了一個實驗,並根據此實驗的限制尋找資料集。實驗結果指出CatBoost在類別欄位較多的資料集確實預測效果更佳,而LightGBM則傾向於使用數值欄位來預測。在訓練時間上,LightGBM恆比CatBoost來的迅速。zh_TW
dc.description.abstractOn medium-sized datasets, Gradient Boosting Decision Tree(GBDT) methods have been proven to be effective both academically and competitively. This paper aims to investigate and compare the efficiency of the two most used GBDT methods, LightGBM and CatBoost, and discover the reason behind the performance difference. To make a fairer comparison, we designed an experiment based on data characteristic, and found several desirable raw datasets accordingly. The implementation indicates that CatBoost tends to perform better when the dataset has indeed more categorical columns, while LightGBM incline to use numerical columns to predict. For training speed, LightGBM is always faster than CatBoost under all circumstances.en
dc.description.provenanceMade available in DSpace on 2021-06-16T10:34:07Z (GMT). No. of bitstreams: 1
U0001-0207202009514500.pdf: 820188 bytes, checksum: f22c946fee1ee78a438fab38308b122b (MD5)
Previous issue date: 2020
en
dc.description.tableofcontents口試委員會審定書 #
誌謝 i
中文摘要 ii
ABSTRACT iii
CONTENTS iv
LIST OF FIGURES vii
LIST OF TABLES viii
Chapter 1 Introduction 1
1.1 Motivation 1
1.2 Objective 2
1.3 Organization of thesis 2
1.4 Limitations 3
Chapter 2 Related Work 4
2.1 Boosting Methods 4
2.2 Categorical Encoding 6
Chapter 3 Research Methodology 8
3.1 Research flow 8
3.2 Experimental Design and Performance metrics 9
3.2.1 Control Variables 9
3.2.2 Evaluation Metrics 10
3.3 Datasets 11
3.3.1 Titanic: Machine Learning from Disaster[17] 11
3.3.2 Cat in the Dat: Categorical Feature Encoding Challenge[18] 12
3.3.3 Bank Marketing UCI[19] 13
3.3.4 E-Sun Bank Fraud Detection[20] 14
3.3.5 Data Preprocessing 16
3.3.6 Hyperparameters 17
Chapter 4 Results of our experimental design 19
4.1 Titanic: Machine Learning from Disaster 19
4.1.1 LightGBM (Self-made train(0.75)/test(0.25) on initial training sets) 19
4.1.2 CatBoost (Self-made train(0.75)/test(0.25) on initial training sets) 20
4.2 Cat in the Dat: Categorical Feature Encoding Challenge 21
4.2.1 LightGBM(Kaggle) 21
4.2.2 CatBoost(Kaggle) 22
4.3 Bank Marketing UCI 22
4.3.1 LightGBM(Kaggle) 23
4.3.2 CatBoost(Kaggle) 23
4.4 E-Sun Bank Fraud Detection 24
4.4.1 LightGBM(Self-made train(0.75)/test(0.25) on initial training sets) 25
4.4.2 CatBoost(Self-made train(0.75)/test(0.25) on initial training sets) 25
4.5 Summary (AUC Private Score) 26
4.5.1 Training speed 26
4.5.2 Performance 27
Chapter 5 Conclusion 30
5.1.1 Summary: 30
5.1.2 Contribution: 31
5.1.3 Limits: 31
5.1.4 Future studies: 32
REFERENCE 33
dc.language.isoen
dc.titleLightGBM與CatBoost在類別資料集下之效能探討zh_TW
dc.titleA Study on Performance of LightGBM and CatBoost under categorical datasetsen
dc.typeThesis
dc.date.schoolyear108-2
dc.description.degree碩士
dc.contributor.oralexamcommittee林我聰(Woo-Tsong Lin),郭人介(Ren-Jieh Kuo)
dc.subject.keyword梯度提升決策樹演算法,LightGBM,CatBoost,大數據,資料探勘,zh_TW
dc.subject.keywordGradient Boosting,LightGBM,CatBoost,Big Data,Data mining,en
dc.relation.page35
dc.identifier.doi10.6342/NTU202001258
dc.rights.note有償授權
dc.date.accepted2020-07-02
dc.contributor.author-college管理學院zh_TW
dc.contributor.author-dept商學研究所zh_TW
顯示於系所單位:商學研究所

文件中的檔案:
檔案 大小格式 
U0001-0207202009514500.pdf
  目前未授權公開取用
800.96 kBAdobe PDF
顯示文件簡單紀錄


系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved