度量空間分群問題的數學理論

Yu-Sheng Chen; 陳宇陞

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/57088

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	廖婉君(Wan-Jiun Liao)
dc.contributor.author	Yu-Sheng Chen	en
dc.contributor.author	陳宇陞	zh_TW
dc.date.accessioned	2021-06-16T06:34:39Z	-
dc.date.available	2019-09-04
dc.date.copyright	2014-09-04
dc.date.issued	2014
dc.date.submitted	2014-08-04
dc.identifier.citation	1. Lloyd, S., Least squares quantization in PCM. Information Theory, IEEE Transactions on, 1982. 28(2): p. 129-137. 2. Arthur, D. and S. Vassilvitskii. k-means++: The advantages of careful seeding. in Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms. 2007. Society for Industrial and Applied Mathematics. 3. Bahmani, B., et al., Scalable k-means++. Proceedings of the VLDB Endowment, 2012. 5(7): p. 622-633. 4. Kanungo, T., et al., An efficient k-means clustering algorithm: Analysis and implementation. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 2002. 24(7): p. 881-892. 5. Elkan, C. Using the triangle inequality to accelerate k-means. in ICML. 2003. 6. Dhillon, I.S. and D.S. Modha, Concept decompositions for large sparse text data using clustering. Machine learning, 2001. 42(1-2): p. 143-175. 7. Cordeiro de Amorim, R. and B. Mirkin, Minkowski metric, feature weighting and anomalous cluster initializing in K-Means clustering. Pattern Recognition, 2012. 45(3): p. 1061-1075. 8. Kaufman, L. and P. Rousseeuw, Clustering by means of medoids. 1987, Statistical Data Analysis Based on the L1 Norm and Related Methods North-Holland. 9. Sibson, R., SLINK: an optimally efficient algorithm for the single-link cluster method. The Computer Journal, 1973. 16(1): p. 30-34. 10. Defays, D., An efficient algorithm for a complete link method. The Computer Journal, 1977. 20(4): p. 364-366. 11. Girvan, M. and M.E. Newman, Community structure in social and biological networks. Proceedings of the National Academy of Sciences, 2002. 99(12): p. 7821-7826. 12. Newman, M.E. and M. Girvan, Finding and evaluating community structure in networks. Physical review E, 2004. 69(2): p. 026113. 13. Blondel, V.D., et al., Fast unfolding of communities in large networks. Journal of Statistical Mechanics: Theory and Experiment, 2008. 2008(10): p. P10008. 14. Zachary, W.W., An information flow model for conflict and fission in small groups. Journal of anthropological research, 1977: p. 452-473.
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/57088	-
dc.description.abstract	分群問題在許多領域都受到廣泛的應用，包括機器學習、資料探勘、圖像分析及生物信息…等。分群是把資料中的個體透過特定的方法分成不同的組別，使在同一個群集中的個體都具有相似的特性，而在度量空間中我們探討的是更短的空間距離。我們審視了前人提出的眾多演算法，諸如：K-means、K-medoids或其他階層式的分群演算法，我們可以發現這些演算法對群集的定義都是概念式的敘述或操作型的定義，而無法以數學方式精確的定義並描繪出一個群集應該具有的特性。　　在這篇論文中，我們從「相對」的角度出發，定義了點和點以及集合和集合之間的相對距離以及凝聚力。利用集合間的凝聚力我們定義了群集是一群和自己凝聚力強大的點所形成的集合。我們提出一個基於凝聚力的階層式演算法。與其他階層式演算法不同的是我們提出的演算法有一個明確的中止標準，且其輸出的集合皆是我們定義的群集。此外，我們定義了點和集合之間的三角距離，並提出另一個演算法，可以將資料分成固定K個集合的演算法─K-sets 演算法，透過後面的實驗可以推論輸出的K個集合有極大的可能會是我們所定義的群集。我們更進一步證明若我們放寬對距離的限制(如：允許非對稱與非正值的距離)，我們所提出的定義和演算法仍然適用。最後透過實驗結果，我們可以發現現實中的群集確實符合我們對其的定義，且我們提出的兩個演算法可以正確地將資料中的群集分類。	zh_TW
dc.description.abstract	Clustering analysis is widely used in so many fields, such as machine learning, data mining, image analysis and bioinformatics. Clustering is the task of grouping a set of objects in a specific way that objects in the same cluster have similar characteristic. Here, we consider the shorter distance measure in the metric spaces as the similar characteristic. We review some known clustering algorithms, such as K-means, K-medoids, and hierarchical clustering algorithms. We find out that all of these algorithms only give cluster a conceptual description or an operational definition. However, they cannot use a mathematical point of view to describe what a cluster looks like. 　　In this work, we start from the point of view of “relative” and define the relative distance and cohesion measure between any two points or two sets. Making use of the cohesion measure, we define a cluster is a set of data points which is cohesive to itself. Then, we propose a hierarchical algorithm based on the cohesion measure. Unlike standard hierarchical agglomerative algorithms, our hierarchical agglomerative algorithm has a stop criterion and it stops with a partition of clusters. In addition, we define the triangular distance between a point and a set, and propose another clustering algorithm that can separate the dataset into fixed-K sets. We will show that all the K sets may probably be clusters in the experiment. We further proof that our definitions and algorithms will still hold if we release some constraints of the distance measure. Finally, we can show that clusters in reality really meet the definition of ours, and the two proposed algorithms can correctly separate the dataset into several clusters.	en
dc.description.provenance	Made available in DSpace on 2021-06-16T06:34:39Z (GMT). No. of bitstreams: 1 ntu-103-R01921042-1.pdf: 1936667 bytes, checksum: 6fb0a6057b4231e3119d4383fc7b069e (MD5) Previous issue date: 2014	en
dc.description.tableofcontents	口試委員會審定書 # 誌謝 i 中文摘要 iii ABSTRACT v CONTENTS vii LIST OF FIGURES ix LIST OF TABLES xi Chapter 1 Introduction 1 1.1 Clustering in Metric Spaces 1 1.2 Review of Known Clustering Algorithm 2 1.2.1 K-means Algorithm 2 1.2.2 K-medoids Algorithm 3 1.2.3 Hierarchical Clustering Algorithm 4 1.3 Motivation 4 1.4 Thesis Organization 5 Chapter 2 Clusters in Metric Spaces 7 2.1 Relative Distance and Cohesion Measure 7 2.2 Clusters 17 2.3 Hierarchical Agglomerative Clustering Algorithm 21 Chapter 3 Triangular Distance and the K-sets Algorithm 25 3.1 Triangular Distance 25 3.2 The K-sets Algorithm 29 3.3 The Sparse K-sets Algorithm 32 Chapter 4 Extension 37 4.1 The Constraint of Distance Measure 37 Chapter 5 Experimental Result 55 5.1 Evaluation of Cluster Definition 55 5.1.1 Zachary’s Karate Club Network 55 5.1.2 American College Football Network 56 5.2 Evaluation of Hierarchical Agglomerative Clustering Algorithm 59 5.2.1 Computer Generated Data 59 5.2.2 Zachary’s Karate Club Network 60 5.3 Evaluation of the K-sets Algorithm 62 5.3.1 Computer Generated Data 62 5.3.2 Two Rings 63 5.3.3 Clusters Verification 65 Chapter 6 Conclusion and Future Work 67 6.1 Conclusion 67 6.2 Future Work 67 REFERENCE 69
dc.language.iso	en
dc.subject	階層式演算法	zh_TW
dc.subject	凝聚力	zh_TW
dc.subject	距離	zh_TW
dc.subject	分群	zh_TW
dc.subject	分區式演算法	zh_TW
dc.subject	clustering	en
dc.subject	distance measure	en
dc.subject	cohesion measure	en
dc.subject	hierarchical algorithm	en
dc.subject	partitional algorithm	en
dc.title	度量空間分群問題的數學理論	zh_TW
dc.title	A Mathematical Theory for Clustering in Metric Spaces	en
dc.type	Thesis
dc.date.schoolyear	102-2
dc.description.degree	碩士
dc.contributor.oralexamcommittee	張正尚(Cheng-Shang Chang),陳銘憲(Ming-Syan Chen),林軒田(Hsuan-Tien Lin),陳和麟(Ho-Lin Chen)
dc.subject.keyword	分群,距離,凝聚力,階層式演算法,分區式演算法,	zh_TW
dc.subject.keyword	clustering,distance measure,cohesion measure,hierarchical algorithm,partitional algorithm,	en
dc.relation.page	70
dc.rights.note	有償授權
dc.date.accepted	2014-08-04
dc.contributor.author-college	電機資訊學院	zh_TW
dc.contributor.author-dept	電機工程學研究所	zh_TW
顯示於系所單位：	電機工程學系

文件中的檔案：

檔案	大小	格式
ntu-103-1.pdf 未授權公開取用	1.89 MB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。