請用此 Handle URI 來引用此文件:
http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/91898
標題: | 機器學習演算法於電漿光譜分析之可信度和適用性研究 Applicability and Reliability of Machine Learning Algorithms in Plasma Spectroscopy |
作者: | 許哲瑋 Jhe-Wei Syu |
指導教授: | 徐振哲 Cheng-Che Hsu |
關鍵字: | 微電漿,電漿放射光譜,機器學習,主成分分析,支持向量機,線性判別分析, microplasma,plasma emission spectroscopy,machine learning,principal component analysis,support vector machine,linear discriminant analysis, |
出版年 : | 2022 |
學位: | 碩士 |
摘要: | 本研究利用自製微電漿產生裝置做為激發源,並使用此裝置來收取四種揮發性有機物氣體之電漿放射光譜共十萬張,以建立機器學習訓練之資料庫。本研究主要目的為分析多種機器學習演算法於光譜分析上的適用性以及可信度。在本論文中將會探討所使用到的不同演算法之概念以及和光譜分析特性之適用程度。此外,資料分佈和演算法所應用之特徵將被視覺化,並藉此來對機器學習於光譜分析上的應用有更近一步的瞭解。 首先是針對資料的前處理和資料分佈之分析。在收集到原始資料後,為了減緩每次收光時,收光位置和環境所造成的影響,故在進行分析前需先對原始資料進行縮放之前處理。然在經縮放之處理後,因光譜特徵的維度高達3648維,故需再利用機器學習中的降維技術來將原始的高維度資料在盡量不失真的前提下,投影至較低維之特徵空間,以進行資料分佈之視覺化分析。 接著在演算法於光譜分析之適用性中,首先探討了關於支持向量機、線性判別分析、隨機森林、k-近鄰演算法、決策樹等各演算法之概念及實現方法,再搭配人工辨識光譜之邏輯,並加以判斷演算法之適用性。各演算法之分類準確率分別為99.99 %(SVM)、98.79 %(LDA)、90.26 %(RF)、86.46 %(k-NN)和84.84 %(DT)。上述的結果顯示,在多變數之光譜分析上,多數演算法都有良好的表現,除了部分演算法因其概念的實現與人工辨識之邏輯不符,而導致較差的表現。此外,我們利用特徵提取來降低資料之複雜度,使得模型的運算成本得以下降並得到較高效率之表現。但我們發現,不恰當的特徵提取雖能降低資料複雜度和減少後續模型的計算成本,同時也會造成分類表現之下降。像是經過不適當的特徵提取方法降低資料維度後,SVM的運算時間雖能大幅下降近50倍,但分類之準確率卻會從99.99 %下降至71 %,其結果依然屬於較不理想;然而恰當的特徵提取則能保有模型原本的高準確率並使運算時間減少10倍。 最後,在演算法於電漿光譜分析之可信度部分中,我們利用特徵選取之方法將演算法在運算過程中所認定為較重要的特徵挑選出來,並將此些特徵視覺化再與光譜上的特徵放光區域比較,以確保演算法之可信度。從可視化之比較結果中,演算法判定為重要之波長大多集中於306-310、425-435、500-530、650-660以及700-800 nm等區域,而此些放光皆與本研究所使用之有機物的特徵放光區域十分接近,如:OH (308.54 nm)、C2 (516.11 nm)、CH (430.91 nm)、Hα (656.2 nm)和Ar (700-800 nm)等特徵放光帶。此外,演算法對特徵的重要性排序也與認知相符,因此可知此些演算法並非像黑盒子般隨意挑選特徵來進行運算或是做些不知道的事。根據上述的重要特徵辨認結果再加上線性代數明確的計算過程,都使得機器學習所得出的結果更加有解釋性和可信度。 In this study, the homemade microplasma generation devices were used as the excitation source, and the devices were used to collect a total of 100k plasma emission spectra of four volatile organic compounds to establish a database for machine learning training. The main purpose of this study is to analyze the applicability and reliability of various machine learning algorithms for plasma spectra analysis. In this article, the concept of the different algorithms used in this study and the applicability of these algorithms in plasma spectra analysis will be discussed. Additionally, the data distribution and the features used by the algorithm will be visualized. The visualized result will provide a further understanding of the application of machine learning in plasma spectra analysis. The first part is data preprocessing and data distribution analysis. In order to mitigate the impact caused by the alignment, the raw spectra need to be preprocessed with scaling before analysis. However, since the dimension of the plasma spectra features is as high as 3648 dimensions, it is necessary to use the dimensionality reduction technology in machine learning to project the original high-dimensional data to a lower-dimensional feature space without distorting the information as much as possible, so as to perform a visual analysis of the data distribution. In the section on the applicability of algorithms in plasma spectral analysis, the concepts and implementation methods of each algorithm, such as support vector machine, linear discriminate analysis, random forest, k-nearest neighbors, and decision tree, are first discussed. And then the logic of manually identifying the spectrum is used to judge the applicability of the algorithms. The accuracy of algorithms is 99.99 % (SVM), 98.79 % (LDA), 90.26 % (RF), 86.46 % (k-NN) and 84.84 % (DT), respectively. The above results show that most of the algorithms have good performance in plasma spectral analysis of multiple variates, except partial algorithms that have poor performance caused by the inconsistency between the concept of algorithms and the logic of manually identifying the spectra. In addition, we use feature extraction to reduce the complexity of the data, so that the computational efficiency and performance of the model can be improved. However, we found that inappropriate feature extraction method can reduce the complexity of the data and the computational cost but also cause a decline in classification performance For example, after inappropriate feature extraction to reduce the data dimension, although the operation time of SVM can be greatly reduced by nearly 50 times, its classification accuracy will drop from 99.99 % to 71 %, and the result is still relatively unsatisfactory. However, proper feature extraction can maintain the original high accuracy of the model and reduce the computing time by 10 times. Finally, in the section on the reliability of the algorithms in the plasma spectra analysis, we use the feature selection method to acquire the more important features utilized by the algorithm in the operation process. Then, those features are visualized and compared with the characteristic emitting area to ensure the reliability of the algorithm. From the visual comparison results, the important wavelengths identified by the algorithm focus on some areas, such as 306-310, 425-435, 500-530, 650-660, and 700-800 nm, which are very close to the characteristic emission of VOCs used in this study, such as OH (308.54 nm), C2 (516.11 nm), CH (430.91 nm), Hα (656.2 nm), and Ar (700-800 nm). Moreover, the rank of feature importance is also consistent with cognition. So it can be seen that these algorithms are not like black boxes to randomly pick features and do something unknown. According to the above results of important feature identification and the explicit calculation process of linear algebra, the results obtained by machine learning are more interpretable and reliable. |
URI: | http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/91898 |
DOI: | 10.6342/NTU202201760 |
全文授權: | 未授權 |
顯示於系所單位: | 化學工程學系 |
文件中的檔案:
檔案 | 大小 | 格式 | |
---|---|---|---|
ntu-110-2.pdf 目前未授權公開取用 | 9.02 MB | Adobe PDF |
系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。