以權重距離方法分群混合型資料並評估一些類別資料距離計算方法之表現

魏博鈞; Bo-Jun Wei

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/90057

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	蔡政安	zh_TW
dc.contributor.advisor	Chen-An Tsai	en
dc.contributor.author	魏博鈞	zh_TW
dc.contributor.author	Bo-Jun Wei	en
dc.date.accessioned	2023-09-22T17:14:06Z	-
dc.date.available	2023-11-09	-
dc.date.copyright	2023-09-22	-
dc.date.issued	2023	-
dc.date.submitted	2023-08-07	-
dc.identifier.citation	Anil K Jain, M Narasimha Murty, and Patrick J Flynn. Data clustering: a review. ACM computing surveys (CSUR), 31(3):264–323, 1999. Boris Mirkin. Clustering: a data recovery approach. CRC press, 2012. Yizhang Jiang, Kaifa Zhao, Kaijian Xia, Jing Xue, Leyuan Zhou, Yang Ding, and Pengjiang Qian. A novel distributed multitask fuzzy clustering algorithm for auto-matic mr brain image segmentation. Journal of medical systems, 43:1–9, 2019. Raphael Petegrosso, Zhuliu Li, and Rui Kuang. Machine learning and statistical methods for clustering single-cell rna-sequencing data. Briefings in bioinformatics, 21(4):1209–1223, 2020. James MacQueen et al. Some methods for classification and analysis of multivari-ate observations. In Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, volume 1, pages 281–297. Oakland, CA, USA, 1967. Hae-Sang Park and Chi-Hyuck Jun. A simple and fast algorithm for k-medoids clustering. Expert systems with applications, 36(2):3336–3341, 2009. Zhexue Huang and M.K. Ng. A fuzzy k-modes algorithm for clustering categorical data. IEEE Transactions on Fuzzy Systems, 7(4):446–452, 1999. Zhexue Huang. Extensions to the k-means algorithm for clustering large data sets with categorical values. Data mining and knowledge discovery, 2(3):283–304, 1998. Sotirios P Chatzis. A fuzzy c-means-type algorithm for clustering of data with mixed numeric and categorical attributes employing a probabilistic dissimilarity functional. Expert systems with applications, 38(7):8684–8689, 2011. Zhi Zheng, Maoguo Gong, Jingjing Ma, Licheng Jiao, and Qiaodi Wu. Unsuper-vised evolutionary clustering algorithm for mixed type data. In IEEE congress on evolutionary computation, pages 1–8. IEEE, 2010. Jinchao Ji, Tian Bai, Chunguang Zhou, Chao Ma, and Zhe Wang. An im-proved k-prototypes clustering algorithm for mixed numeric and categorical data. Neurocomputing, 120:590–596, 2013. Alex Foss, Marianthi Markatou, Bonnie Ray, and Aliza Heching. A semiparametric method for clustering mixed data. Machine Learning, 105:419–458, 2016. Jinchao Ji, Wei Pang, Chunguang Zhou, Xiao Han, and Zhe Wang. A fuzzy k-prototype clustering algorithm for mixed numeric and categorical data. Knowledge-Based Systems, 30:129–135, 2012. Jiye Liang, Xingwang Zhao, Deyu Li, Fuyuan Cao, and Chuangyin Dang. Deter-mining the number of clusters using information entropy for mixed data. Pattern Recognition, 45(6):2251–2265, 2012. Wei-Dong Zhao, Wei-Hui Dai, and Chun-Bin Tang. K-centers algorithm for clus-tering mixed type data. In Advances in Knowledge Discovery and Data Mining: 11th Pacific-Asia Conference, PAKDD 2007, Nanjing, China, May 22-25, 2007. Proceedings 11, pages 1140–1147. Springer, 2007. Eleazar Eskin, Andrew Arnold, Michael Prerau, Leonid Portnoy, and Sal Stolfo. A geometric framework for unsupervised anomaly detection: Detecting intrusions in unlabeled data. Applications of data mining in computer security, pages 77–101, 2002. Dekang Lin et al. An information-theoretic definition of similarity. In Icml, vol-ume 98, pages 296–304, 1998. Shyam Boriah, Varun Chandola, and Vipin Kumar. Similarity measures for categor-ical data: A comparative evaluation. In Proceedings of the 2008 SIAM international conference on data mining, pages 243–254. SIAM, 2008. David W Goodall. A new similarity index based on probability. Biometrics, pages 882–907, 1966. Tadeusz Caliński and Jerzy Harabasz. A dendrite method for cluster analysis. Communications in Statistics-theory and Methods, 3(1):1–27, 1974. Joseph B Kruskal and Myron Wish. Multidimensional scaling. Number 11. Sage, 1978. Ulrike Von Luxburg. A tutorial on spectral clustering. Statistics and computing, 17:395–416, 2007. Yiming Yang. An evaluation of statistical approaches to text categorization. Information retrieval, 1(1-2):69–90, 1999. Lawrence Hubert and Phipps Arabie. Comparing partitions. Journal of classification, 2:193–218, 1985. Damien McParland and Isobel Claire Gormley. Model based clustering for mixed data: clustmd. Advances in Data Analysis and Classification, 10(2):155–169, 2016. Cristina Tortora and Francesco Palumbo. Clustering mixed-type data using a prob-abilistic distance algorithm. Applied Soft Computing, 130:109704, 2022.	-
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/90057	-
dc.description.abstract	K-prototypes法可以用來將有數值型特徵和類別型特徵的混合型資料分群，在此論文中，我們針對其運算時結合數值型特徵間距離和類別型特徵間相異度的權重提出不同制定策略，我們於TDM法中利用資料間之總距離比例制定策略，而於CHM法中，我們利用CH係數來制定權重，另外我們也提出了KMMDS法和SCMDS法，兩種方法利用多維尺度法將混合型資料轉換為數值型資料，並分別使用K-means法和譜分群法分群。　　除了提出權重制定策略，我們也比較了多種類別型特徵相異度計算方法於分群時的表現，這些方法包含了K-prototypes法原本使用之Overlap法、Eskin法、IOF法、OF法、Lin法和Goodall1法。　　最後我們以分群準確率和蘭德係數評估分群方法的效果和相異度計算方法的表現，發現提出的CHM法和KMMSD法表現的較原本的K-prototypes法好，而Lin法則能於分群的類別特徵相異度計算中有穩定的表現。　　期望透過我們的研究，能對權重γ的制定和相異度的選擇提供不一樣的策略，並提供能夠將數值型特徵和類別型特徵有效合併的方法，以提高混合型資料分群的準確度。	zh_TW
dc.description.abstract	K-prototypes is developed for clustering mixed type data which is composed of numerical attributes and categorical attributes. In this thesis, we propose various strategies for assigning weights to combine dissimilarities between numerical attributes and categorical attributes. A proposed method, named Total Distance Method (TMD), is to assign weights based on the ratio of total distance between numerical and categorical attributes. Moreover, we introduce the Calinski-Harabasz index to assign weights called Calinski-Harabasz Method. We also propose two novel methods, named KMMDS and SCMDS, which both use multidimensional scaling method to transform categorical attributes into numerical attributes. After deriving new data with all attributes quantitative, we then make use of K-means and spectral clustering techniques for grouping data.use of K-means and spectral clustering techniques for grouping data. Many distance measures have been proposed for measuring dissimilarity between two entities, such as Overlap, Eskin, IOF, OF, Lin and Goodall1, etc. In addition to proposing new strategies for assigning weights to combine distances of numerical attributes and categorical attributes, we study the performances of these dissimilarity measures in terms of clustering task in this thesis. Performances of our proposed algorithms and dissimilarity measures in different cluster algorithms are evaluated by two indices, Accuracy and Rand index. Results on a variety of experiments show that the proposed algorithms, CHM and KMMSD, have better performances compared to K-prototypes algorithm. As for dissimilarity measures, Lin has more consistent performances on the algorithms we use in this thesis. In this study, we aim to increase the accuracy for clustering mixed type data by proposing an effective strategy for assigning weights, dissimilarity selection and an effective method for combining numerical and categorical attributes.	en
dc.description.provenance	Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2023-09-22T17:14:06Z No. of bitstreams: 0	en
dc.description.provenance	Made available in DSpace on 2023-09-22T17:14:06Z (GMT). No. of bitstreams: 0	en
dc.description.tableofcontents	口試委員審定書 i 致謝 ii 摘要 iii Abstract iv 目錄 vi 圖目錄 viii 表目錄 ix 符號列表 xii 第一章、緒論 1 第二章、文獻回顧 3 2.1、K-prototypes法 3 2.2、類別資料相似度計算方法 5 第三章、研究方法 9 3.1、總距離法(Total Distance Method, TDM) 10 3.2、CH 係數加權法 (CH-index Method, CHM) 10 3.3、多元尺度法結合K-means 分群方法(K-Means with MDS, KMMDS) 11 3.4、多元尺度法結合譜分群方法(Spectral Clustering with MDS, SCMDS) 14 3.5、評估指標 14 第四章、模擬研究 17 4.1、模擬設計 17 4.2、模擬資料 19 4.3、模擬結果 20 第五章、實際資料分析 33 5.1、資料集介紹 33 5.2、分群結果 34 第六章、結論與討論 43 6.1、分群方法 43 6.2、相異度計算方法 44 參考文獻 45	-
dc.language.iso	zh_TW	-
dc.subject	距離方法	zh_TW
dc.subject	權重策略	zh_TW
dc.subject	混合型資料	zh_TW
dc.subject	分群	zh_TW
dc.subject	Clustering	en
dc.subject	Mixed Data	en
dc.subject	Weighted Strategy	en
dc.subject	Dissimilarity Measures	en
dc.title	以權重距離方法分群混合型資料並評估一些類別資料距離計算方法之表現	zh_TW
dc.title	Clustering Mixed-type Data Using A Weighted Dissimilarity Measure and An Evaluation of Some Dissimilarity Measures for Categorical Data	en
dc.type	Thesis	-
dc.date.schoolyear	111-2	-
dc.description.degree	碩士	-
dc.contributor.oralexamcommittee	薛慧敏;邱春火	zh_TW
dc.contributor.oralexamcommittee	Huei-Min Hsueh;Chun-Huo Chiu	en
dc.subject.keyword	混合型資料,分群,權重策略,距離方法,	zh_TW
dc.subject.keyword	Mixed Data,Clustering,Weighted Strategy,Dissimilarity Measures,	en
dc.relation.page	48	-
dc.identifier.doi	10.6342/NTU202301617	-
dc.rights.note	未授權	-
dc.date.accepted	2023-08-09	-
dc.contributor.author-college	共同教育中心	-
dc.contributor.author-dept	統計碩士學位學程	-
顯示於系所單位：	統計碩士學位學程

文件中的檔案：

檔案	大小	格式
ntu-111-2.pdf 未授權公開取用	1.1 MB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。