應用多變量分析及機器學習技術於老街溪水質評估

Yi-Ching Cheng; 鄭奕晴

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/7489

標題:	應用多變量分析及機器學習技術於老街溪水質評估 Assessment of Lao-Jie River Water Quality Using Multivariate Statistics and Machine Learning Techniques
作者:	Yi-Ching Cheng 鄭奕晴
指導教授:	于昌平
關鍵字:	水質監測,水質評估,多變量分析,主成分分析,因素分析,群集分析,機器學習,決策森林,類神經網路, Water Monitoring,Water Quality Assessment,Multivariate Statistical Techniques,Principal Components Analysis,Factor Analysis,Cluster Analysis,Machine Learning,Decision Forest,Neural Network,
出版年 :	2018
學位:	碩士
摘要:	本研究目的在於探索長期的水質監測數據，使用多變量分析及機器學習等資料探勘方式，探索河水中污染物隨時間的變化及彼此的關聯。蒐集桃園市老街溪流域2002至2016年共計15年期間水質監測數據，河川主流長約37公里，分析範圍包括主流老街溪及支流大坑缺溪所設置之7點測站，每月各蒐集10至32項水質參數。　　所產生的龐大水質資料集(總共約21,194個觀測值)將以多變量分析方法中的主成分分析、因素分析及群集分析方法進行水質評估，除了水質特徵識別外，還加入了時間軸，探討水質隨時間之變化。經主成分及因素分析，萃取出的6個因素可解釋資料集70 %變異量，因素依序為複合污染物、降雨沖刷、工業排水污染(半導體業、印刷電路板業等)及工業常見金屬材料等污染來源；群集分析將7個測站分類為3個群組，分別為支流，上游群組及下游群組，高度污染的支流匯入主流後，影響中下游水質，導致上下游群組組成逐年變化。　　機器學習亦可用於水質監測集的資料探勘上，本研究為判斷水中銅濃度超標與否及評估河川污染程度指標(RPI)，同時利用決策森林模型及類神經網路模型等兩種技術，針對上述議題分別建立模型。在判斷水中銅濃度超標與否的議題上，決策森林模型之正確率較高(0.83)，同時可得知懸浮固體、導電度及點位因素是判斷超標與否的重要決策指標；而在評估RPI數值上，同樣是決策森林模型的評估誤差較小，平均絕對誤差及平均絕對誤差百分比分別為0.352及0.087，並可得知生化需氧量及氨氮為重要的決策資訊。 This study investigated the water quality of river basin from a long-term monitoring dataset using data mining techniques, such as multivariate statistical and machine learning techniques. Water quality of Lao-Jie River basin was monitored at seven different sites from mainstream and tributary Da-Keng-Que creek, with 10-32 water quality parameters collected every month for 15 years (2002–2016). Multivariate statistical techniques, such as Principal Components Analysis (PCA), Factor Analysis (FA) and Cluster Analysis (CA), were applied to evaluate the water quality of the large size monitoring dataset (21,194 observations). PCA/FA identified six factors that explains 70 % of the variance in the dataset. These six factors indicated the source of the pollutions might originate from complex pollutions, rain erosion, industrial wastewater effluent (like semiconductor industry and printed circuit board industry), and industrial metal pollution. Furthermore, CA classified seven sampling sites into three groups: tributary, upstream groups, and downstream groups, while members in upstream and downstream groups change by year due to highly polluted tributary. Machine learning can also be used for data exploration in water quality monitoring datasets. This study addressed two water quality assessment issues, the concentration of copper and the river pollution index (RPI), by using both decision forest and neural network techniques respectively. In terms of the concentration of copper, decision forest has a higher accuracy (0.83), and elucidates that suspended solids, electrical conductivity, and sampling sites are important in determining whether the copper concentration in the water is standard-exceeded or not. On the other hand, for the assessment of the RPI, decision forest model also has a lower mean absolute error and mean absolute percentage error (0.352 and 0.087), and BOD as well as ammonia play important roles in decision-making information.
URI:	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/7489
DOI:	10.6342/NTU201803135
全文授權:	同意授權(全球公開)
顯示於系所單位：	環境工程學研究所

文件中的檔案：

檔案	大小	格式
ntu-107-1.pdf	3.16 MB	Adobe PDF	檢視/開啟

顯示文件完整紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料（如：文字、圖片、PDF）並使其易於取用。