請用此 Handle URI 來引用此文件:
http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/57088
標題: | 度量空間分群問題的數學理論 A Mathematical Theory for Clustering in Metric Spaces |
作者: | Yu-Sheng Chen 陳宇陞 |
指導教授: | 廖婉君(Wan-Jiun Liao) |
關鍵字: | 分群,距離,凝聚力,階層式演算法,分區式演算法, clustering,distance measure,cohesion measure,hierarchical algorithm,partitional algorithm, |
出版年 : | 2014 |
學位: | 碩士 |
摘要: | 分群問題在許多領域都受到廣泛的應用,包括機器學習、資料探勘、圖像分析及生物信息…等。分群是把資料中的個體透過特定的方法分成不同的組別,使在同一個群集中的個體都具有相似的特性,而在度量空間中我們探討的是更短的空間距離。我們審視了前人提出的眾多演算法,諸如:K-means、K-medoids或其他階層式的分群演算法,我們可以發現這些演算法對群集的定義都是概念式的敘述或操作型的定義,而無法以數學方式精確的定義並描繪出一個群集應該具有的特性。
在這篇論文中,我們從「相對」的角度出發,定義了點和點以及集合和集合之間的相對距離以及凝聚力。利用集合間的凝聚力我們定義了群集是一群和自己凝聚力強大的點所形成的集合。我們提出一個基於凝聚力的階層式演算法。與其他階層式演算法不同的是我們提出的演算法有一個明確的中止標準,且其輸出的集合皆是我們定義的群集。此外,我們定義了點和集合之間的三角距離,並提出另一個演算法,可以將資料分成固定K個集合的演算法─K-sets 演算法,透過後面的實驗可以推論輸出的K個集合有極大的可能會是我們所定義的群集。我們更進一步證明若我們放寬對距離的限制(如:允許非對稱與非正值的距離),我們所提出的定義和演算法仍然適用。最後透過實驗結果,我們可以發現現實中的群集確實符合我們對其的定義,且我們提出的兩個演算法可以正確地將資料中的群集分類。 Clustering analysis is widely used in so many fields, such as machine learning, data mining, image analysis and bioinformatics. Clustering is the task of grouping a set of objects in a specific way that objects in the same cluster have similar characteristic. Here, we consider the shorter distance measure in the metric spaces as the similar characteristic. We review some known clustering algorithms, such as K-means, K-medoids, and hierarchical clustering algorithms. We find out that all of these algorithms only give cluster a conceptual description or an operational definition. However, they cannot use a mathematical point of view to describe what a cluster looks like. In this work, we start from the point of view of “relative” and define the relative distance and cohesion measure between any two points or two sets. Making use of the cohesion measure, we define a cluster is a set of data points which is cohesive to itself. Then, we propose a hierarchical algorithm based on the cohesion measure. Unlike standard hierarchical agglomerative algorithms, our hierarchical agglomerative algorithm has a stop criterion and it stops with a partition of clusters. In addition, we define the triangular distance between a point and a set, and propose another clustering algorithm that can separate the dataset into fixed-K sets. We will show that all the K sets may probably be clusters in the experiment. We further proof that our definitions and algorithms will still hold if we release some constraints of the distance measure. Finally, we can show that clusters in reality really meet the definition of ours, and the two proposed algorithms can correctly separate the dataset into several clusters. |
URI: | http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/57088 |
全文授權: | 有償授權 |
顯示於系所單位: | 電機工程學系 |
文件中的檔案:
檔案 | 大小 | 格式 | |
---|---|---|---|
ntu-103-1.pdf 目前未授權公開取用 | 1.89 MB | Adobe PDF |
系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。