基於部分組合並應用於類別數值混合型數據集之監督式預測與非監督式異常偵測

Yi-Hsin Wu; 吳怡欣

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/16844

標題:	基於部分組合並應用於類別數值混合型數據集之監督式預測與非監督式異常偵測 Partial-Combination based Supervised Prediction and Unsupervised Anomaly Detection in Mixed Datasets
作者:	Yi-Hsin Wu 吳怡欣
指導教授:	王勝德(Sheng-De Wang)
關鍵字:	部份組合,機器學習,監督式預測,模型選擇,非監督式異常偵測,串流分析,數據分箱,穩健統計, partial combination,machine learning,supervised prediction,model selection,unsupervised anomaly detection,in-stream analytics,data binning,robust statistics,
出版年 :	2020
學位:	博士
摘要:	近年來，隨著機器學習演算法及相關平台工具的蓬勃發展，進一步落實機器學習於領域應用課題，是逐漸被重視的研究方向。在實務應用中，決策過程的透明度與產出結果的可追溯性，雖較不易量化，但其重要性實乃相當於普遍認為較容易量化的衡量指標，例如準確度、運算速度、資源耗用程度等。有鑒於此，本研究藉由多個場域應用中的類別數值混合型數據集作為範例，聚焦於機器學習的兩大分支—監督式預測與非監督式異常偵測，提出具備或提升決策透明度的機器學習方法，並探討其與現有機器學習方法之優勢所在，期能補足現今產業界在導入機器學習方案時，亦重視決策透明度之實務應用所需。首先，本研究提出一監督式預測方法，藉由將原始資料集依據不同的類別屬性組合分成許多不同的子集，進而產生不同類別屬性組合的預測模型，其中包含基礎、部份及全部類別屬性組合之預測模型；基礎類別屬性組合(Fundamental Combination)預測模型為全部類別變數皆固定的資料子集建立的模型，其特性為資料量少但專一性高；部份類別屬性組合(Partial Combination)預測模型為部份特定類別變數固定的資料子集建立的模型，其特性為資料量與專一性均適中；全部類別屬性組合(Full Combination)預測模型即為以整份資料集建立的模型；最後，對於每一種基礎組合，可運用其訓練與驗證資料集選出較佳的預測模型；此法對於模型的透明度與準確度皆有助益。接著，本研究對於非監督式機器學習提出一即時串流分析之架構，藉由既有的非監督漸進式分箱法(Unsupervised Incremental Binning, UIB)，僅儲存一定數量的分箱統計數據，如此可減少原始資料之儲存空間並加快分析速度。在原始數據分箱後，資料量大幅減少之同時，當使用者提出基於部分組合的分析查詢，透過本研究提出之架構能夠有效率地聚合相關的基礎組合分箱子集，利用穩健統計(Robust Statistics)、階層分群(Hierarchical clustering)、離群評分(Anomaly Scoring)等步驟快速進行跨組合的異常評分。以半導體測試廠數據實測，對於測試設備機台、元件項目等多項類別因子逐組進行異常偵測結果，經由本研究之分箱串流分析，相較於原始資料分析，可同時確保異常偵測的即時性與準確度，並且獲得之異常評分亦具備計算過程的透明度與結果的可解釋性。最後，本研究亦延伸前述之非監督漸進式分箱法，以支援多維度陣列數值型資料的處理；藉由陣列索引(index)的暫存，實現複雜度O(1)之逐筆數據接收，以及當接收完成時，可由複雜度O(n)將數據陣列轉為順序統計量與分群表進行後續分析。實驗以高能粒子的碰撞數據分析做為案例，當分箱數量設為65,536時，此方法可將原有之32位元浮點數轉換為16位元之整數，減省50%儲存空間之同時，於分析結果亦保持高準確度。 In recent years, machine learning (ML) methods and tools are getting more and more mature and available, which leads to a growing research field which highlights successful ML applications for specific domain problems. Yet there are several gaps having been identified by many studies between academic deliverables and business expectations. For example, decision support systems may have been constructed as black boxes but the practitioners may concern about the transparent and interpretable capabilities of the underlying decision processes. Moreover, in additional to the traditional key performance indices (KPIs) such as accuracy and computational complexity for evaluating different ML methods, expressing a method’s transparency and interpretability is also an important aspect when ML is to meet the business expectations in domain applications. Devoted to the study of more transparent and interpretable ML approaches, this thesis proposes both a supervised prediction method and an unsupervised anomaly detection framework and takes mixed-type datasets from domain applications as example showcases. Firstly, a supervised prediction method is proposed, by using the partial-combination models to improve the prediction quality and transparency in mixed datasets with complex interactions. Multiple regression models can be generated by partitioning a dataset into subsets with different categorical attribution combinations. More specifically, given a mixed-type dataset, a fundamental-combination model can be built from the subset in which every categorical variable corresponds to a specific value; a partial-combination model can be built from the subset in which at least one categorical variable corresponds to a specific value; the full combination model can be built from the full dataset. Then, for a specific fundamental-combination, a better model can be selected from the related fundamental, partial and full combination models by comparing the true and the predictive results for the training and validation splits from the corresponding subset with the fundamental-combination dataset. Compared to the existing methods, the proposed partial-combination based supervised prediction will improve both the model transparency and accuracy. Secondly, an unsupervised anomaly detection approach is proposed and demonstrated through an in-stream analytical framework. This approach detects and scores equipment faults during semiconductor test processes, with score providing the meaning of an anomaly in understandable terms to a human and thus expressing a transparent and interpretable decision-making process for identifying the equipment anomalies. Moreover, this approach eliminates the need of storing large amounts of raw measurement data that are generated during semiconductor test processes, while maintaining the accuracy for the later analysis. This is called Unsupervised Incremental Binning (UIB) based on the on-line agglomerative clustering method in the literature which automatically, incrementally and dynamically groups the incoming measured values into a small but sufficient maximum number of bins. The bins preserve the robust statistical properties of medians and interquartile ranges. Then, partial combinations can be applied through different configurations of equipment grouping such as by sites (for which a tester may have multiple sites to test semiconductor devices simultaneously) and by testers (for which a factory may deploy parallel machines). Experimental results show that the unsupervised ML framework is fast, cost-effective and reaches scoring accuracy of 99.6% which satisfies the industrial level of use scenarios. This framework is currently operating on-line in a giant semiconductor test factory in Taiwan. UIB is also extended to support the processing of multi-dimensional streaming vectors. By recording and manipulating the index information upon each newly received value, the original raw numerical vectors can be transformed into orders and dictionaries, with storing only 16-bit integers instead of the 32-bit floating-point numbers when the compression parameter maxNumBins is set to 65,536. This approach is demonstrated through a particle mass reconstruction application, with obtaining the high accuracy of the analytical results.
URI:	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/16844
DOI:	10.6342/NTU202002756
全文授權:	未授權
顯示於系所單位：	電機工程學系

文件中的檔案：

檔案	大小	格式
U0001-1008202001364900.pdf 未授權公開取用	3.81 MB	Adobe PDF

顯示文件完整紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料（如：文字、圖片、PDF）並使其易於取用。