以權重距離方法分群混合型資料並評估一些類別資料距離計算方法之表現

魏博鈞; Bo-Jun Wei

Please use this identifier to cite or link to this item: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/90057

Title:	以權重距離方法分群混合型資料並評估一些類別資料距離計算方法之表現 Clustering Mixed-type Data Using A Weighted Dissimilarity Measure and An Evaluation of Some Dissimilarity Measures for Categorical Data
Authors:	魏博鈞 Bo-Jun Wei
Advisor:	蔡政安 Chen-An Tsai
Keyword:	混合型資料,分群,權重策略,距離方法, Mixed Data,Clustering,Weighted Strategy,Dissimilarity Measures,
Publication Year :	2023
Degree:	碩士
Abstract:	K-prototypes法可以用來將有數值型特徵和類別型特徵的混合型資料分群，在此論文中，我們針對其運算時結合數值型特徵間距離和類別型特徵間相異度的權重提出不同制定策略，我們於TDM法中利用資料間之總距離比例制定策略，而於CHM法中，我們利用CH係數來制定權重，另外我們也提出了KMMDS法和SCMDS法，兩種方法利用多維尺度法將混合型資料轉換為數值型資料，並分別使用K-means法和譜分群法分群。　　除了提出權重制定策略，我們也比較了多種類別型特徵相異度計算方法於分群時的表現，這些方法包含了K-prototypes法原本使用之Overlap法、Eskin法、IOF法、OF法、Lin法和Goodall1法。　　最後我們以分群準確率和蘭德係數評估分群方法的效果和相異度計算方法的表現，發現提出的CHM法和KMMSD法表現的較原本的K-prototypes法好，而Lin法則能於分群的類別特徵相異度計算中有穩定的表現。　　期望透過我們的研究，能對權重γ的制定和相異度的選擇提供不一樣的策略，並提供能夠將數值型特徵和類別型特徵有效合併的方法，以提高混合型資料分群的準確度。 K-prototypes is developed for clustering mixed type data which is composed of numerical attributes and categorical attributes. In this thesis, we propose various strategies for assigning weights to combine dissimilarities between numerical attributes and categorical attributes. A proposed method, named Total Distance Method (TMD), is to assign weights based on the ratio of total distance between numerical and categorical attributes. Moreover, we introduce the Calinski-Harabasz index to assign weights called Calinski-Harabasz Method. We also propose two novel methods, named KMMDS and SCMDS, which both use multidimensional scaling method to transform categorical attributes into numerical attributes. After deriving new data with all attributes quantitative, we then make use of K-means and spectral clustering techniques for grouping data.use of K-means and spectral clustering techniques for grouping data. Many distance measures have been proposed for measuring dissimilarity between two entities, such as Overlap, Eskin, IOF, OF, Lin and Goodall1, etc. In addition to proposing new strategies for assigning weights to combine distances of numerical attributes and categorical attributes, we study the performances of these dissimilarity measures in terms of clustering task in this thesis. Performances of our proposed algorithms and dissimilarity measures in different cluster algorithms are evaluated by two indices, Accuracy and Rand index. Results on a variety of experiments show that the proposed algorithms, CHM and KMMSD, have better performances compared to K-prototypes algorithm. As for dissimilarity measures, Lin has more consistent performances on the algorithms we use in this thesis. In this study, we aim to increase the accuracy for clustering mixed type data by proposing an effective strategy for assigning weights, dissimilarity selection and an effective method for combining numerical and categorical attributes.
URI:	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/90057
DOI:	10.6342/NTU202301617
Fulltext Rights:	未授權
Appears in Collections:	統計碩士學位學程

Files in This Item:

File	Size	Format
ntu-111-2.pdf Restricted Access	1.1 MB	Adobe PDF

Show full item record

DSpace JSPUI

DSpace preserves and enables easy and open access to all types of digital content including text, images, moving images, mpegs and data sets