基於部分組合並應用於類別數值混合型數據集之監督式預測與非監督式異常偵測

Yi-Hsin Wu; 吳怡欣

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/16844

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	王勝德(Sheng-De Wang)
dc.contributor.author	Yi-Hsin Wu	en
dc.contributor.author	吳怡欣	zh_TW
dc.date.accessioned	2021-06-07T23:47:46Z	-
dc.date.copyright	2020-08-13
dc.date.issued	2020
dc.date.submitted	2020-08-10
dc.identifier.citation	[1] T. Wuest, D. Weimer, C. Irgens, and K.-D. Thoben, 'Machine learning in manufacturing: advantages, challenges, and applications,' Production Manufacturing Research, vol. 4, no. 1, pp. 23-45, 2016. [2] R. Guidotti, A. Monreale, S. Ruggieri, F. Turini, F. Giannotti, and D. Pedreschi, 'A survey of methods for explaining black box models,' ACM computing surveys (CSUR), vol. 51, no. 5, pp. 1-42, 2018. [3] Y. Wu, S. Wang, L. Chen and C. Yu, 'Streaming analytics processing in manufacturing performance monitoring and prediction,' 2017 IEEE International Conference on Big Data (Big Data), Boston, MA, pp. 3285-3289, 2017. [4] W. R. Daasch, C. G. Shirley and A. Nahar, 'Statistics in Semiconductor Test: Going beyond Yield,' in IEEE Design Test of Computers, vol. 26, no. 5, pp. 64-73, Sept.-Oct. 2009 [5] L.-C. Wang, “Experience of Data Analytics in EDA and Test-Principles, Promises, and Challenges,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 36(6): 885-898, 2017. [6] C.-F. Chien and J.-Z. Wu, “Analyzing repair decisions in the site imbalance problem of semiconductor test machines,” IEEE Transactions on Semiconductor Manufacturing 16(4): 704-711, 2003. [7] N. Geng, Z. Jiang, and F. Chen, 'Stochastic programming based capacity planning for semiconductor wafer fab with uncertain demand and capacity,' European Journal of Operational Research, vol. 198, no. 3, pp. 899-908, 2009. [8] P. Backus, M. Janakiram, S. Mowzoon, C. Runger, and A. Bhargava, 'Factory cycle-time prediction with a data-mining approach,' IEEE Transactions on Semiconductor Manufacturing, vol. 19, no. 2, pp. 252-258, 2006. [9] D. Sha, R. Storch, and C.-H. Liu, 'Development of a regression-based method with case-based tuning to solve the due date assignment problem,' International Journal of Production Research, vol. 45, no. 1, pp. 65-82, 2007. [10] L. Lingitz, V. Gallina, F. Ansari, D. Gyulai, A. Pfeiffer, and L. Monostori, 'Lead time prediction using machine learning algorithms: A case study by a semiconductor manufacturer,' PROCEDIA CIRP, vol. 72, pp. 1051-1056, 2018. [11] B. Pavlyshenko, 'Machine learning, linear and Bayesian models for logistic regression in failure detection problems,' in 2016 IEEE International Conference on Big Data (Big Data), 2016: IEEE, pp. 2046-2050. [12] A. Mangal and N. Kumar, 'Using big data to enhance the bosch production line performance: A kaggle challenge,' in 2016 IEEE International Conference on Big Data (Big Data), 2016: IEEE, pp. 2029-2035. [13] P. Su, Y. Liu, and X. Song, 'Research on intrusion detection method based on improved smote and XGBoost,' in Proceedings of the 8th International Conference on Communication and Network Security, 2018, pp. 37-41. [14] G. James, D. Witten, T. Hastie, and R. Tibshirani, An introduction to statistical learning. Springer, 2013. [15] P. Lerman, 'Fitting segmented regression models by grid search,' Journal of the Royal Statistical Society: Series C (Applied Statistics), vol. 29, no. 1, pp. 77-84, 1980. [16] S. Salvador and P. Chan, 'Determining the number of clusters/segments in hierarchical clustering/segmentation algorithms,' in 16th IEEE International Conference on Tools with Artificial Intelligence, 2004: IEEE, pp. 576-584. [17] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu, 'A density-based algorithm for discovering clusters in large spatial databases with noise,' in Kdd, 1996, vol. 96, no. 34, pp. 226-231. [18] B. Ari and H. A. Güvenir, 'Clustered linear regression,' Knowledge-Based Systems, vol. 15, no. 3, pp. 169-175, 2002. [19] A. Gelman and J. Hill, Data analysis using regression and multilevel/hierarchical models. Cambridge university press, 2006. [20] J. V. Petrocelli, 'Hierarchical multiple regression in counseling research: Common problems and possible remedies,' Measurement and evaluation in counseling and development, vol. 36, no. 1, pp. 9-22, 2003. [21] H. Woltman, A. Feldstain, J. C. MacKay, and M. Rocchi, 'An introduction to hierarchical linear modeling,' Tutorials in quantitative methods for psychology, vol. 8, no. 1, pp. 52-69, 2012. [22] H. Späth, 'Algorithm 39 clusterwise linear regression,' Computing, vol. 22, no. 4, pp. 367-373, 1979. [23] B. Zhang, 'Regression clustering,' in Third IEEE International Conference on Data Mining, 2003: IEEE, pp. 451-458. [24] Y. W. Park, Y. Jiang, D. Klabjan, and L. Williams, 'Algorithms for generalized clusterwise linear regression,' INFORMS Journal on Computing, vol. 29, no. 2, pp. 301-317, 2017. [25] B. Chizi and O. Maimon, 'Dimension reduction and feature selection,' in Data mining and knowledge discovery handbook: Springer, 2009, pp. 83-100. [26] C. Quan, D. Wan, B. Zhang, and F. Ren, 'Reduce the dimensions of emotional features by principal component analysis for speech emotion recognition,' in Proceedings of the 2013 IEEE/SICE International Symposium on System Integration, 2013: IEEE, pp. 222-226. [27] J. M. Cadenas, M. C. Garrido, and R. MartíNez, 'Feature subset selection filter–wrapper based on low quality data,' Expert systems with applications, vol. 40, no. 16, pp. 6241-6252, 2013. [28] D. Wu et al., 'A fog computing-based framework for process monitoring and prognosis in cyber-manufacturing,' Journal of Manufacturing Systems, vol. 43, pp. 25-34, 2017. [29] D. Wu, C. Jennings, J. Terpenny, R. X. Gao, and S. Kumara, 'A comparative study on machine learning algorithms for smart manufacturing: tool wear prediction using random forests,' Journal of Manufacturing Science and Engineering, vol. 139, no. 7, p. 071018, 2017. [30] D. Moldovan, T. Cioara, I. Anghel, and I. Salomie, 'Machine learning for sensor-based manufacturing processes,' in 2017 13th IEEE International Conference on Intelligent Computer Communication and Processing (ICCP), 2017: IEEE, pp. 147-154. [31] D. Stanisavljevic and M. Spitzer, 'A Review of Related Work on Machine Learning in Semiconductor Manufacturing and Assembly Lines,' in SAMI@ iKNOW, 2016. [32] P. Mishra, V. Varadharajan, U. Tupakula, and E. S. Pilli, 'A detailed investigation and analysis of using machine learning techniques for intrusion detection,' IEEE Communications Surveys Tutorials, vol. 21, no. 1, pp. 686-728, 2018. [33] T. Chen and C. Guestrin, 'XGBoost: A Scalable Tree Boosting System,' presented at the Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, California, USA, 2016. [34] H. Zheng, J. Yuan, and L. Chen, 'Short-term load forecasting using EMD-LSTM neural networks with a Xgboost algorithm for feature importance evaluation,' Energies, vol. 10, no. 8, p. 1168, 2017. [35] J. Zhong, Y. Sun, W. Peng, M. Xie, J. Yang, and X. Tang, 'XGBFEMF: an XGBoost-based framework for essential protein prediction,' IEEE transactions on nanobioscience, vol. 17, no. 3, pp. 243-250, 2018. [36] L. Torlay, M. Perrone-Bertolotti, E. Thomas, and M. Baciu, 'Machine learning–XGBoost analysis of language networks to classify patients with epilepsy,' Brain informatics, vol. 4, no. 3, p. 159, 2017. [37] L. D. Raedt, K. Kersting, S. Natarajan, and D. Poole, 'Statistical relational artificial intelligence: Logic, probability, and computation,' Synthesis Lectures on Artificial Intelligence and Machine Learning, vol. 10, no. 2, pp. 1-189, 2016. [38] H. Lakkaraju, S. H. Bach, and J. Leskovec, 'Interpretable decision sets: A joint framework for description and prediction,' in Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, 2016, pp. 1675-1684. [39] K. Broelemann and G. Kasneci, 'A Gradient-Based Split Criterion for Highly Accurate and Transparent Model Trees,' arXiv preprint arXiv:1809.09703, 2018. [40] S. Putatunda and K. Rama, 'A Comparative Analysis of Hyperopt as Against Other Approaches for Hyper-Parameter Optimization of XGBoost,' in Proceedings of the 2018 International Conference on Signal Processing and Machine Learning, 2018, pp. 6-10. [41] 'Semiconductor Backend Production Rate Dataset with Partial Combination Models.' [Online]. Available:https://github.com/fishyu-tw/Semiconductor-Backend-Production-Rate-Dataset-with-Partial-Combination-Models [42] 'Prices of 50,000 round cut diamonds from ggplot2 package.' [Online]. Available:https://vincentarelbundock.github.io/Rdatasets/datasets.html [43] ggplot2 documentation. 'diamonds: Prices of over 50,000 round cut diamonds.' [Online]. Available:https://rdrr.io/cran/ggplot2/man/diamonds.html [44] Swapnil Panwala. 'Regression-based machine learning approaches for diamond price prediction.' [Online]. Available:https://medium.com/@sp7091/regression-approaches-to-predict-diamond-price-258478a485c9 [45] V. Chandola, A. Banerjee and V. Kumar, “Anomaly detection: A survey,” ACM computing surveys (CSUR), vol. 41, no. 3, Article 15, 2009. [46] P. J. Rousseeuw and M. Hubert “Robust statistics for outlier detection,” Wiley Interdisciplinary Reviews Data Mining and Knowledge Discovery 1(1): 73-79, 2011. [47] R. J. Tibshirani, “Fast computation of the median by successive binning,” arXiv preprint arXiv: 0806.3301, 2008. [48] R. Jain and I. Chlamtac, “The P2 algorithm for dynamic calculation of quantiles and histograms without storing observations,” Communication of ACM 28, 1076-1085, 1985. [49] I. D. Guedalia, M. London and M. Werman, “An On-line agglomerative clustering method for nonstationary data,” Neural Comput. 11(2):521-540, 1999. [50] B. Yael and T. Elad “A Streaming Parallel Decision Tree Algorithm,” Journal of Machine Learning Research 11:849-872, 2010 [51] Weighted Quantile using R Project, [Online] https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/quantile [52] K. Zhang, H. Zhao, T. Shen, J. Huang, “Research of Probability Distribution of Semiconductor Test Parameter,” Advanced Materials Research 1049-1050: 754-761, 2014. [53] S. Kim and H. Kim, “A new metric of absolute percentage error for intermittent demand forecasts,” International Journal of Forecasting 32(3): 669-679, 2016. [54] W. Lin, “Theoretical Derivation of Junction Temperature of Package Chip, Electronics Cooling,” S M Sohel Murshed, IntechOpen, 2016. [55] A. Ryd and A. A. Petrov, “Hadronic D and Ds meson decays”, Review of Modern Physics, 84, 65, published 23 January 2012. [56] Y. Wu, 'D meson mass reconstruction using R language.' [Online]. Available: https://github.com/daidaihsin/meson-mass [57] Y. Wu, Y. Chang, Y. Tien, C. Yu, S. Wang and C. Wu, 'Using Partial Combination Models to Improve Prediction Quality and Transparency in Mixed Datasets,' in IEEE Access, vol. 8, pp. 132106-132120, 2020.
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/16844	-
dc.description.abstract	近年來，隨著機器學習演算法及相關平台工具的蓬勃發展，進一步落實機器學習於領域應用課題，是逐漸被重視的研究方向。在實務應用中，決策過程的透明度與產出結果的可追溯性，雖較不易量化，但其重要性實乃相當於普遍認為較容易量化的衡量指標，例如準確度、運算速度、資源耗用程度等。有鑒於此，本研究藉由多個場域應用中的類別數值混合型數據集作為範例，聚焦於機器學習的兩大分支—監督式預測與非監督式異常偵測，提出具備或提升決策透明度的機器學習方法，並探討其與現有機器學習方法之優勢所在，期能補足現今產業界在導入機器學習方案時，亦重視決策透明度之實務應用所需。首先，本研究提出一監督式預測方法，藉由將原始資料集依據不同的類別屬性組合分成許多不同的子集，進而產生不同類別屬性組合的預測模型，其中包含基礎、部份及全部類別屬性組合之預測模型；基礎類別屬性組合(Fundamental Combination)預測模型為全部類別變數皆固定的資料子集建立的模型，其特性為資料量少但專一性高；部份類別屬性組合(Partial Combination)預測模型為部份特定類別變數固定的資料子集建立的模型，其特性為資料量與專一性均適中；全部類別屬性組合(Full Combination)預測模型即為以整份資料集建立的模型；最後，對於每一種基礎組合，可運用其訓練與驗證資料集選出較佳的預測模型；此法對於模型的透明度與準確度皆有助益。接著，本研究對於非監督式機器學習提出一即時串流分析之架構，藉由既有的非監督漸進式分箱法(Unsupervised Incremental Binning, UIB)，僅儲存一定數量的分箱統計數據，如此可減少原始資料之儲存空間並加快分析速度。在原始數據分箱後，資料量大幅減少之同時，當使用者提出基於部分組合的分析查詢，透過本研究提出之架構能夠有效率地聚合相關的基礎組合分箱子集，利用穩健統計(Robust Statistics)、階層分群(Hierarchical clustering)、離群評分(Anomaly Scoring)等步驟快速進行跨組合的異常評分。以半導體測試廠數據實測，對於測試設備機台、元件項目等多項類別因子逐組進行異常偵測結果，經由本研究之分箱串流分析，相較於原始資料分析，可同時確保異常偵測的即時性與準確度，並且獲得之異常評分亦具備計算過程的透明度與結果的可解釋性。最後，本研究亦延伸前述之非監督漸進式分箱法，以支援多維度陣列數值型資料的處理；藉由陣列索引(index)的暫存，實現複雜度O(1)之逐筆數據接收，以及當接收完成時，可由複雜度O(n)將數據陣列轉為順序統計量與分群表進行後續分析。實驗以高能粒子的碰撞數據分析做為案例，當分箱數量設為65,536時，此方法可將原有之32位元浮點數轉換為16位元之整數，減省50%儲存空間之同時，於分析結果亦保持高準確度。	zh_TW
dc.description.abstract	In recent years, machine learning (ML) methods and tools are getting more and more mature and available, which leads to a growing research field which highlights successful ML applications for specific domain problems. Yet there are several gaps having been identified by many studies between academic deliverables and business expectations. For example, decision support systems may have been constructed as black boxes but the practitioners may concern about the transparent and interpretable capabilities of the underlying decision processes. Moreover, in additional to the traditional key performance indices (KPIs) such as accuracy and computational complexity for evaluating different ML methods, expressing a method’s transparency and interpretability is also an important aspect when ML is to meet the business expectations in domain applications. Devoted to the study of more transparent and interpretable ML approaches, this thesis proposes both a supervised prediction method and an unsupervised anomaly detection framework and takes mixed-type datasets from domain applications as example showcases. Firstly, a supervised prediction method is proposed, by using the partial-combination models to improve the prediction quality and transparency in mixed datasets with complex interactions. Multiple regression models can be generated by partitioning a dataset into subsets with different categorical attribution combinations. More specifically, given a mixed-type dataset, a fundamental-combination model can be built from the subset in which every categorical variable corresponds to a specific value; a partial-combination model can be built from the subset in which at least one categorical variable corresponds to a specific value; the full combination model can be built from the full dataset. Then, for a specific fundamental-combination, a better model can be selected from the related fundamental, partial and full combination models by comparing the true and the predictive results for the training and validation splits from the corresponding subset with the fundamental-combination dataset. Compared to the existing methods, the proposed partial-combination based supervised prediction will improve both the model transparency and accuracy. Secondly, an unsupervised anomaly detection approach is proposed and demonstrated through an in-stream analytical framework. This approach detects and scores equipment faults during semiconductor test processes, with score providing the meaning of an anomaly in understandable terms to a human and thus expressing a transparent and interpretable decision-making process for identifying the equipment anomalies. Moreover, this approach eliminates the need of storing large amounts of raw measurement data that are generated during semiconductor test processes, while maintaining the accuracy for the later analysis. This is called Unsupervised Incremental Binning (UIB) based on the on-line agglomerative clustering method in the literature which automatically, incrementally and dynamically groups the incoming measured values into a small but sufficient maximum number of bins. The bins preserve the robust statistical properties of medians and interquartile ranges. Then, partial combinations can be applied through different configurations of equipment grouping such as by sites (for which a tester may have multiple sites to test semiconductor devices simultaneously) and by testers (for which a factory may deploy parallel machines). Experimental results show that the unsupervised ML framework is fast, cost-effective and reaches scoring accuracy of 99.6% which satisfies the industrial level of use scenarios. This framework is currently operating on-line in a giant semiconductor test factory in Taiwan. UIB is also extended to support the processing of multi-dimensional streaming vectors. By recording and manipulating the index information upon each newly received value, the original raw numerical vectors can be transformed into orders and dictionaries, with storing only 16-bit integers instead of the 32-bit floating-point numbers when the compression parameter maxNumBins is set to 65,536. This approach is demonstrated through a particle mass reconstruction application, with obtaining the high accuracy of the analytical results.	en
dc.description.provenance	Made available in DSpace on 2021-06-07T23:47:46Z (GMT). No. of bitstreams: 1 U0001-1008202001364900.pdf: 3904259 bytes, checksum: 4eb26418e112f649c800e4385290f543 (MD5) Previous issue date: 2020	en
dc.description.tableofcontents	Chapter 1 Preface 1 1.1 Motivation of Work 1 1.2 Organization of the Dissertation 5 Chapter 2 Using Partial Combination Models to Improve Prediction Quality and Transparency in Mixed Datasets 7 2.1 Introduction 7 2.2 Literature Review 11 2.2.1 Hierarchical Regression/Clustering Methods 13 2.2.2 Feature Selection 16 2.2.3 Ensemble Methods 17 2.2.4 Model Transparency 18 2.3 Problem Description and Data Preprocessing 19 2.3.1 Fundamental Combination (x_1,x_2,……x_k) of Categorical Attributes and Datasets 21 2.3.2 Partial Combination of Categorical Attributes and Partial Combination Datasets 22 2.3.3 Full Combination of Categorical Attributes and Their Corresponding Datasets 24 2.4 Fundamental, Partial, and Full Combination Prediction Models using the Corresponding Datasets 25 2.4.1 Fundamental Combination Prediction Model (M_(x_1,x_2,…x_k)) 25 2.4.2 Partial Combination Prediction Model(M_(x_j,j∉I)) 26 2.4.3 Full Combination Prediction Model (M_(Ω_1,Ω_2,…,Ω_k)) 26 2.4.4 Model Training Process 26 2.5 Selection of the Prediction Models 32 2.5.1 Model Selection Indexes 32 2.5.2 One-stage and Two-stage Model Selection Methods 34 2.5.3 Prediction Quality Indexes for the One-stage and Two-stage Model Selection Methods 36 Chapter 3 Transparent Box Design: Detecting and Scoring Equipment Faults during Semiconductor Test Processes 38 3.1 Introduction and Problem Statement 38 3.2 Unsupervised Incremental Binning 42 3.3 Computing Medians and IQRs from Bins 44 3.4 Managing Single Outliers 45 3.5 Hierarchical Clustering and Scoring 47 3.6 Put-it-all-together: Industrial Implementation 49 3.7 Extension of UIB for N-dimensional Streaming Vectors 52 Chapter 4 Performance Evaluation 55 4.1 Throughput Rate Prediction in Semiconductor Backend Dataset 55 4.2 Price Prediction in Diamonds Dataset 62 4.3 Anomaly Detection during Semiconductor Test Processes 65 4.4 Compressing 32-bit Floating-Point Vectors into 16-bit Orders in Particle Mass Reconstructions 71 Chapter 5 Conclusion and Future Work 77 References 81
dc.language.iso	en
dc.subject	非監督式異常偵測	zh_TW
dc.subject	部份組合	zh_TW
dc.subject	機器學習	zh_TW
dc.subject	監督式預測	zh_TW
dc.subject	模型選擇	zh_TW
dc.subject	串流分析	zh_TW
dc.subject	數據分箱	zh_TW
dc.subject	穩健統計	zh_TW
dc.subject	robust statistics	en
dc.subject	partial combination	en
dc.subject	machine learning	en
dc.subject	supervised prediction	en
dc.subject	model selection	en
dc.subject	unsupervised anomaly detection	en
dc.subject	in-stream analytics	en
dc.subject	data binning	en
dc.title	基於部分組合並應用於類別數值混合型數據集之監督式預測與非監督式異常偵測	zh_TW
dc.title	Partial-Combination based Supervised Prediction and Unsupervised Anomaly Detection in Mixed Datasets	en
dc.type	Thesis
dc.date.schoolyear	108-2
dc.description.degree	博士
dc.contributor.oralexamcommittee	雷欽隆(Chin-Laung Lei),于天立(Tian-Li Yu),鄧維中(Wei-Chung Teng),呂欣澤(Hsin-Tse Lu)
dc.subject.keyword	部份組合,機器學習,監督式預測,模型選擇,非監督式異常偵測,串流分析,數據分箱,穩健統計,	zh_TW
dc.subject.keyword	partial combination,machine learning,supervised prediction,model selection,unsupervised anomaly detection,in-stream analytics,data binning,robust statistics,	en
dc.relation.page	89
dc.identifier.doi	10.6342/NTU202002756
dc.rights.note	未授權
dc.date.accepted	2020-08-11
dc.contributor.author-college	電機資訊學院	zh_TW
dc.contributor.author-dept	電機工程學研究所	zh_TW
顯示於系所單位：	電機工程學系

文件中的檔案：

檔案	大小	格式
U0001-1008202001364900.pdf 未授權公開取用	3.81 MB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。