Skip navigation

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料(如:文字、圖片、PDF)並使其易於取用。

點此認識 DSpace
DSpace logo
English
中文
  • 瀏覽論文
    • 校院系所
    • 出版年
    • 作者
    • 標題
    • 關鍵字
    • 指導教授
  • 搜尋 TDR
  • 授權 Q&A
    • 我的頁面
    • 接受 E-mail 通知
    • 編輯個人資料
  1. NTU Theses and Dissertations Repository
  2. 共同教育中心
  3. 統計碩士學位學程
請用此 Handle URI 來引用此文件: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/90057
標題: 以權重距離方法分群混合型資料並評估一些類別資料距離計算方法之表現
Clustering Mixed-type Data Using A Weighted Dissimilarity Measure and An Evaluation of Some Dissimilarity Measures for Categorical Data
作者: 魏博鈞
Bo-Jun Wei
指導教授: 蔡政安
Chen-An Tsai
關鍵字: 混合型資料,分群,權重策略,距離方法,
Mixed Data,Clustering,Weighted Strategy,Dissimilarity Measures,
出版年 : 2023
學位: 碩士
摘要:   K-prototypes法可以用來將有數值型特徵和類別型特徵的混合型資料分群,在此論文中,我們針對其運算時結合數值型特徵間距離和類別型特徵間相異度的權重提出不同制定策略,我們於TDM法中利用資料間之總距離比例制定策略,而於CHM法中,我們利用CH係數來制定權重,另外我們也提出了KMMDS法和SCMDS法,兩種方法利用多維尺度法將混合型資料轉換為數值型資料,並分別使用K-means法和譜分群法分群。
  除了提出權重制定策略,我們也比較了多種類別型特徵相異度計算方法於分群時的表現,這些方法包含了K-prototypes法原本使用之Overlap法、Eskin法、IOF法、OF法、Lin法和Goodall1法。
  最後我們以分群準確率和蘭德係數評估分群方法的效果和相異度計算方法的表現,發現提出的CHM法和KMMSD法表現的較原本的K-prototypes法好,而Lin法則能於分群的類別特徵相異度計算中有穩定的表現。
  期望透過我們的研究,能對權重γ的制定和相異度的選擇提供不一樣的策略,並提供能夠將數值型特徵和類別型特徵有效合併的方法,以提高混合型資料分群的準確度。
K-prototypes is developed for clustering mixed type data which is composed of numerical attributes and categorical attributes. In this thesis, we propose various strategies for assigning weights to combine dissimilarities between numerical attributes and categorical attributes. A proposed method, named Total Distance Method (TMD), is to assign weights based on the ratio of total distance between numerical and categorical attributes. Moreover, we introduce the Calinski-Harabasz index to assign weights called Calinski-Harabasz Method. We also propose two novel methods, named KMMDS and SCMDS, which both use multidimensional scaling method to transform categorical attributes into numerical attributes. After deriving new data with all attributes quantitative, we then make use of K-means and spectral clustering techniques for grouping data.use of K-means and spectral clustering techniques for grouping data.
Many distance measures have been proposed for measuring dissimilarity between two entities, such as Overlap, Eskin, IOF, OF, Lin and Goodall1, etc. In addition to proposing new strategies for assigning weights to combine distances of numerical attributes and categorical attributes, we study the performances of these dissimilarity measures in terms of clustering task in this thesis.
Performances of our proposed algorithms and dissimilarity measures in different cluster algorithms are evaluated by two indices, Accuracy and Rand index. Results on a variety of experiments show that the proposed algorithms, CHM and KMMSD, have better performances compared to K-prototypes algorithm. As for dissimilarity measures, Lin has more consistent performances on the algorithms we use in this thesis.
In this study, we aim to increase the accuracy for clustering mixed type data by proposing an effective strategy for assigning weights, dissimilarity selection and an effective method for combining numerical and categorical attributes.
URI: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/90057
DOI: 10.6342/NTU202301617
全文授權: 未授權
顯示於系所單位:統計碩士學位學程

文件中的檔案:
檔案 大小格式 
ntu-111-2.pdf
  未授權公開取用
1.1 MBAdobe PDF
顯示文件完整紀錄


系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved