O-SAFE：應用於機器學習之自動化特徵工程建構

Chun-Tai Yen; 顏均泰

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/79582

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	吳文方(Wen-Fang Wu)
dc.contributor.author	Chun-Tai Yen	en
dc.contributor.author	顏均泰	zh_TW
dc.date.accessioned	2022-11-23T09:04:20Z	-
dc.date.available	2021-11-08
dc.date.available	2022-11-23T09:04:20Z	-
dc.date.copyright	2021-11-08
dc.date.issued	2021
dc.date.submitted	2021-09-29
dc.identifier.citation	[1] Akagi, D. (2014). A primer on deep learning. Data Robot, DataRobot, Inc, 19. [2] Al-Tashi, Q., Kadir, S. J. A., Rais, H. M., Mirjalili, S., Alhussian, H. (2019). Binary optimization using hybrid grey wolf optimization for feature selection. IEEE Access, 7, 39496-39508. [3] Arık, S. O., Pfister, T. (2020). Tabnet: Attentive interpretable tabular learning. arXiv. [4] Askarzadeh, A. (2016). A novel metaheuristic method for solving constrained engineering optimization problems: crow search algorithm. Computers Structures, 169, 1-12. [5] Bekta, S. (2010). The comparison of L11 and L22-norm minimization methods. International journal of physical sciences, 5(11), 1721-1727. [6] Bernstein, P. A., Goodman, N. (1981). Concurrency control in distributed database systems. ACM Computing Surveys (CSUR), 13(2), 185-221. [7] Berrar, D. (2019). Cross-Validation. [8] Bouckaert, R. R. (2006, December). Efficient AUC learning curve calculation. In Australasian Joint Conference on Artificial Intelligence (pp. 181-191). Springer, Berlin, Heidelberg. [9] Breiman, L. (2001). Random forests. Machine learning, 45(1), 5-32. [10] Corbett, J. C., Dean, J., Epstein, M., Fikes, A., Frost, C., Furman, J. J., ... Woodford, D. (2013). Spanner: Google’s globally distributed database. ACM Transactions on Computer Systems (TOCS), 31(3), 1-22. [11] Cosentino, V., Izquierdo, J. L. C., Cabot, J. (2016, May). Findings from GitHub: methods, datasets and limitations. In 2016 IEEE/ACM 13th Working Conference on Mining Software Repositories (MSR) (pp. 137-141). IEEE.. [12] Cressie, N. A. C., Whitford, H. J. (1986). How to use the two sample t‐test. Biometrical Journal, 28(2), 131-148. [13] Domingos, P. (2012). A few useful things to know about machine learning. Communications of the ACM, 55(10), 78-87. [14] Dong, G., Liu, H. (Eds.). (2018). Feature engineering for machine learning and data analytics. CRC Press. [15] Efron, B., Hastie, T., Johnstone, I., Tibshirani, R. (2004). Least angle regression. The Annals of statistics, 32(2), 407-499. [16] Eldeeb, H., Amashukeli, S., El Shawi, R. (2020). An Empirical Analysis of Integrating Feature Extraction to Automated Machine Learning Pipeline. In ICPR Workshops (4) (pp. 336-344). [17] Friedman, J. H. (2001). Greedy function approximation: a gradient boosting machine. Annals of statistics, 1189-1232. [18] Fukushima, K., Miyake, S. (1982). Neocognitron: A self-organizing neural network model for a mechanism of visual pattern recognition. In Competition and cooperation in neural nets (pp. 267-285). Springer, Berlin, Heidelberg. [19] Harrison Jr, D., Rubinfeld, D. L. (1978). Hedonic housing prices and the demand for clean air. Journal of environmental economics and management, 5(1), 81-102. [20] He, N., Zhang, D. Z., Li, Q. (2014). Agent-based hierarchical production planning and scheduling in make-to-order manufacturing system. International Journal of Production Economics, 149, 117-130. [21] He, X., Zhao, K., Chu, X. (2021). AutoML: A Survey of the State-of-the-Art. Knowledge-Based Systems, 212, 106622. [22] Hershey, J. R., Olsen, P. A. (2007, April). Approximating the Kullback Leibler divergence between Gaussian mixture models. In 2007 IEEE International Conference on Acoustics, Speech and Signal Processing-ICASSP'07 (Vol. 4, pp. IV-317). IEEE. [23] Ho, T. K. (1998). The random subspace method for constructing decision forests. IEEE transactions on pattern analysis and machine intelligence, 20(8), 832-844. [24] Horn, F., Pack, R., Rieger, M. (2019). The autofeat python library for automated feature engineering and selection. arXiv preprint arXiv:1901.07329. [25] Hu, J., Jiang, H., Tian, L., Xu, L. (2010, August). PUD-LRU: An erase-efficient write buffer management algorithm for flash memory SSD. In 2010 IEEE International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (pp. 69-78). IEEE. [26] Idoine, C., Krensky, P., Brethenoux, E., Hare, J., Sicular, S., Vashisth, S. (2018). Magic Quadrant for data science and machine-learning platforms. Gartner, Inc, 13. [27] Ivanov, D., Dolgui, A., Sokolov, B., Werner, F., Ivanova, M. (2016). A dynamic model and an algorithm for short-term supply chain scheduling in the smart factory industry 4.0. International Journal of Production Research, 54(2), 386-402. [28] Justel, A., Peña, D., Zamar, R. (1997). A multivariate Kolmogorov-Smirnov test of goodness of fit. Statistics Probability Letters, 35(3), 251-259. [29] Kanter, J. M., Veeramachaneni, K. (2015, October). Deep feature synthesis: Towards automating data science endeavors. In 2015 IEEE international conference on data science and advanced analytics (DSAA) (pp. 1-10). IEEE. [30] Katz, G., Shin, E. C. R., Song, D. (2016, December). Explorekit: Automatic feature generation and selection. In 2016 IEEE 16th International Conference on Data Mining (ICDM) (pp. 979-984). IEEE. [31] Kaul, A., Maheshwary, S., Pudi, V. (2017, November). Autolearn—Automated feature generation and selection. In 2017 IEEE International Conference on data mining (ICDM) (pp. 217-226). IEEE. [32] Khurana, U., Nargesian, F., Samulowitz, H., Khalil, E., Turaga, D. (2016). Automating feature engineering. Transformation, 10(10), 10. [33] Khurana, U., Samulowitz, H., Turaga, D. (2018, April). Feature engineering for predictive modeling using reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 32, No. 1). [34] Khurana, U., Turaga, D., Samulowitz, H., Parthasrathy, S. (2016, December). Cognito: Automated feature engineering for supervised learning. In 2016 IEEE 16th International Conference on Data Mining Workshops (ICDMW) (pp. 1304-1307). IEEE. [35] Lam, H. T., Thiebaut, J. M., Sinn, M., Chen, B., Mai, T., Alkan, O. (2017). One button machine for automating feature engineering in relational databases. arXiv preprint arXiv:1706.00327. [36] LeDell, E., Poirier, S. (2020, July). H2o automl: Scalable automatic machine learning. In Proceedings of the AutoML Workshop at ICML (Vol. 2020). [37] Li, J., Cheng, K., Wang, S., Morstatter, F., Trevino, R. P., Tang, J., Liu, H. (2017). Feature selection: A data perspective. ACM Computing Surveys (CSUR), 50(6), 1-45. [38] Li, J., Deshpande, A., Khuller, S. (2009, March). Minimizing communication cost in distributed multi-query processing. In 2009 IEEE 25th International Conference on Data Engineering (pp. 772-783). IEEE. [39] Li, Y., Zhu, J. (2008). L 1-norm quantile regression. Journal of Computational and Graphical Statistics, 17(1), 163-185. [40] Massey Jr, F. J. (1951). The Kolmogorov-Smirnov test for goodness of fit. Journal of the American statistical Association, 46(253), 68-78. [41] McKnight, P. E., Najab, J. (2010). Mann‐Whitney U Test. The Corsini encyclopedia of psychology, 1-1. [42] Mookherjee, D., Tsumagari, M. (2014). Mechanism design with communication constraints. Journal of Political Economy, 122(5), 1094-1129. [43] Moshref, M., Yu, M., Sharma, A., Govindan, R. (2013). Scalable rule management for data centers. In 10th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 13) (pp. 157-170). [44] Nematzadeh, H., Enayatifar, R., Mahmud, M., Akbari, E. (2019). Frequency based feature selection method using whale algorithm. Genomics, 111(6), 1946-1955. [45] Pivoto, D. G., de Almeida, L. F., da Rosa Righi, R., Rodrigues, J. J., Lugli, A. B., Alberti, A. M. (2021). Cyber-physical systems architectures for industrial internet of things applications in Industry 4.0: A literature review. Journal of Manufacturing Systems, 58, 176-192. [46] Press, G. (2016). Cleaning big data: Most time-consuming, least enjoyable data science task, survey says. Forbes, March, 23, 15. [47] Riesenhuber, M., Poggio, T. (1999). Hierarchical models of object recognition in cortex. Nature neuroscience, 2(11), 1019-1025. [48] Romano, J. D., Le, T. T., Fu, W., Moore, J. H. (2021). TPOT-NN: Augmenting tree-based automated machine learning with neural network estimators. Genetic Programming and Evolvable Machines, 22(2), 207-227. [49] Sayed, G. I., Hassanien, A. E., Azar, A. T. (2019). Feature selection via a novel chaotic crow search algorithm. Neural computing and applications, 31(1), 171-188. [50] Scott, S., Matwin, S. (1999, June). Feature engineering for text classification. In ICML (Vol. 99, pp. 379-388). [51] Smith, M. J., Wedge, R., Veeramachaneni, K. (2017, October). FeatureHub: Towards collaborative data science. In 2017 IEEE International Conference on Data Science and Advanced Analytics (DSAA) (pp. 590-600). IEEE. [52] Stadtler, H. (2008). Supply chain management—an overview. Supply chain management and advanced planning, 9-36. [53] Thames, L., Schaefer, D. (2016). Software-defined cloud manufacturing for industry 4.0. Procedia cirp, 52, 12-17. [54] Tran, B., Xue, B., Zhang, M. (2016). Genetic programming for feature construction and selection in classification on high-dimensional data. Memetic Computing, 8(1), 3-15. [55] Tsung, C. K., Yen, C. T., Wu, W. F. (2018). A software defined-based hybrid cloud for the design of smart micro-manufacturing system. Microsystem Technologies, 24(10), 4329-4340. [56] Turner, C. R., Fuggetta, A., Lavazza, L., Wolf, A. L. (1999). A conceptual basis for feature engineering. Journal of Systems and Software, 49(1), 3-15. [57] Turner, C. R., Fuggetta, A., Lavazza, L., Wolf, A. L. (1998, April). Feature engineering [software development]. In Proceedings Ninth International Workshop on Software Specification and Design (pp. 162-164). IEEE. [58] Wan, J., Tang, S., Shu, Z., Li, D., Wang, S., Imran, M., Vasilakos, A. V. (2016). Software-defined industrial internet of things in the context of industry 4.0. IEEE Sensors Journal, 16(20), 7373-7380. [59] Waring, J., Lindvall, C., Umeton, R. (2020). Automated machine learning: Review of the state-of-the-art and opportunities for healthcare. Artificial Intelligence in Medicine, 104, 101822. [60] Wu, B., Zhang, L., Zhao, Y. (2013). Feature selection via Cramer's V-test discretization for remote-sensing image classification. IEEE Transactions on Geoscience and Remote Sensing, 52(5), 2593-2606. [61] Xia, W., Zhang, Y., Yang, Y., Xue, J. H., Zhou, B., Yang, M. H. (2021). Gan inversion: A survey. arXiv preprint arXiv:2101.05278. [62] Yao, Q., Wang, M., Chen, Y., Dai, W., Li, Y. F., Tu, W. W., ... Yu, Y. (2018). Taking human out of learning applications: A survey on automated machine learning. arXiv preprint arXiv:1810.13306. [63] Yen, C. T., Tsung, C. K., Wu, W. F. (2020). Detecting removed attributes in the cyber system for smart manufacturing. The Journal of Supercomputing, 76(8), 6280-6301. [64] Zave, P. (2003). An experiment in feature engineering. In Programming methodology (pp. 353-377). Springer, New York, NY. [65] Zhang, Y., Zhang, G., Liu, Y., Hu, D. (2017). Research on services encapsulation and virtualization access model of machine for cloud manufacturing. Journal of Intelligent Manufacturing, 28(5), 1109-1123. [66] Zheng, A., Casari, A. (2018). Feature engineering for machine learning: principles and techniques for data scientists. ' O'Reilly Media, Inc.'. [67] Zimmerman, D. W. (1987). Comparative power of Student t test and Mann-Whitney U test for unequal sample sizes and variances. The Journal of Experimental Education, 55(3), 171-174. [68] Zöller, M. A., Huber, M. F. (2021). Benchmark and survey of automated machine learning frameworks. Journal of Artificial Intelligence Research, 70, 409-472.
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/79582	-
dc.description.abstract	2020年起全球COVID-19疫情擴散，迫使工廠實施作業分流，造成生產線上人力的短缺，而製造業供應鏈體系也出現難以預料的趨勢變化，迫使企業必須加速實現數位轉型。數位轉型帶來大量的數據化過程，機器學習儼然成為處理工業數據重要方法之一，其中特徵工程更是決定機器學習結果最關鍵的程序。傳統特徵工程是運用有經驗人員的領域知識來構建數據特徵，其程序作法對企業而言是繁瑣且耗時的，因此發展自動化特徵工程為目前發展主要趨勢。本研究提出One-Stop Auto-Feature Engineering (O-SAFE)，以特徵生成、特徵選擇和特徵評估來產生自動化特徵工程架構，不僅能在特徵生成上兼顧數值型及類別型特徵，也能同時包含時間、領域和關聯等特徵資料樣態。在特徵選擇上，則以統計分析方法處理因自動化特徵生成過程所衍生之大量新特徵，增加篩選有效特徵之執行速度。O-SAFE更重要的是以特徵評估方法來解決特徵生成後，過多高度關聯的特徵影響模型產生過度擬合，使得訓練資料集在模型訓練上準確度高、測試資料集於驗證時卻準確度大幅降低的問題。本研究以一組製造設備的實際數據與二組開源數據來驗證所提出之O-SAFE，其結果顯示所生成之有效特徵數量比傳統方法多出近一倍，準確度也比專家人工處理的結果提高8.8%；O-SAFE在特徵選擇與特徵評估做法上，避免特徵訓練產生的過度擬合問題，也比其他自動化特徵工程結果提高準確率10.7%。綜整O-SAFE在特徵生成類別型特徵的數量上、特徵選擇的執行速度與特徵評估解決特徵過度擬合等問題上，皆能顯示其優越性。	zh_TW
dc.description.provenance	Made available in DSpace on 2022-11-23T09:04:20Z (GMT). No. of bitstreams: 1 U0001-1509202117313400.pdf: 4904371 bytes, checksum: d9b5ed32be5b961418cae13789fe45b7 (MD5) Previous issue date: 2021	en
dc.description.tableofcontents	誌謝 I 中文摘要 II Abstract III 目錄 V 圖目錄 VII 表目錄 IX 1 緒論 1 1.1 研究背景 1 1.2 現有問題 6 1.3 解決方案 12 2 文獻探討 13 2.1 自動化機器學習(AutoML) 13 2.2 特徵生成 16 2.3 特徵選擇 22 2.3.1 基於相以性的方法 22 2.3.2 基於資訊理論的方法 24 2.3.3 基於稀疏學習的方法 25 3 研究方法 26 3.1 概念架構 26 3.2 方法設計 30 3.2.1 O-SAFE特徵生成 30 3.2.2 O-SAFE特徵選擇 37 3.2.3 O-SAFE特徵評估 42 3.3 研究步驟 51 4 實驗設計、結果與討論 53 4.1 實驗設計 53 4.1.1 實驗1：以實際工廠的設備數據為驗證資料 54 4.1.2 實驗2：以泛用型開源數據為驗證資料 63 4.1.3 實驗3：以類別型資料的開源數據為驗證資料 66 4.2 實驗流程 70 4.2.1 實驗1-3：資料集D + O-SAFE 74 4.2.2 實驗1-4：資料集D + Autofeat 79 4.2.3 實驗2-1：資料集P + H2O 81 4.2.4 實驗2-2：資料集P + O-SAFE 83 4.2.5 實驗2-3：資料集P + Autofeat 87 4.2.6 實驗3-1：資料集W + H2O 89 4.2.7 實驗3-2：資料集W + O-SAFE 90 4.2.8 實驗3-3：資料集W + Autofeat 95 4.3 實驗結果討論 97 4.3.1 實驗1：結果討論 97 4.3.2 實驗2：結果討論 100 4.3.3 實驗3：結果討論 101 5 結論與建議 104 5.1 結論 104 5.2 後續建議與改善 106 參考文獻 107
dc.language.iso	zh-TW
dc.title	O-SAFE：應用於機器學習之自動化特徵工程建構	zh_TW
dc.title	Construction of O-SAFE (One-Stop Auto-Feature Engineering) for Machine Learning	en
dc.date.schoolyear	109-2
dc.description.degree	博士
dc.contributor.oralexamcommittee	蔡孟勳(Hsin-Tsai Liu),蔡曜陽(Chih-Yang Tseng),吳政鴻,藍俊宏,王世明,欉振坤
dc.subject.keyword	機器學習,自動化機器學習,特徵工程,工業4.0,數位轉型,	zh_TW
dc.subject.keyword	Machine learning,AutoML,Feature Engineering,Industry 4.0,Digital transformation,	en
dc.relation.page	111
dc.identifier.doi	10.6342/NTU202103196
dc.rights.note	同意授權(全球公開)
dc.date.accepted	2021-09-30
dc.contributor.author-college	工學院	zh_TW
dc.contributor.author-dept	工業工程學研究所	zh_TW
顯示於系所單位：	工業工程學研究所

文件中的檔案：

檔案	大小	格式
U0001-1509202117313400.pdf	4.79 MB	Adobe PDF	檢視/開啟

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。