Skip navigation

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料(如:文字、圖片、PDF)並使其易於取用。

點此認識 DSpace
DSpace logo
English
中文
  • 瀏覽論文
    • 校院系所
    • 出版年
    • 作者
    • 標題
    • 關鍵字
    • 指導教授
  • 搜尋 TDR
  • 授權 Q&A
    • 我的頁面
    • 接受 E-mail 通知
    • 編輯個人資料
  1. NTU Theses and Dissertations Repository
  2. 共同教育中心
  3. 統計碩士學位學程
請用此 Handle URI 來引用此文件: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/90057
完整後設資料紀錄
DC 欄位值語言
dc.contributor.advisor蔡政安zh_TW
dc.contributor.advisorChen-An Tsaien
dc.contributor.author魏博鈞zh_TW
dc.contributor.authorBo-Jun Weien
dc.date.accessioned2023-09-22T17:14:06Z-
dc.date.available2023-11-09-
dc.date.copyright2023-09-22-
dc.date.issued2023-
dc.date.submitted2023-08-07-
dc.identifier.citationAnil K Jain, M Narasimha Murty, and Patrick J Flynn. Data clustering: a review. ACM computing surveys (CSUR), 31(3):264–323, 1999.
Boris Mirkin. Clustering: a data recovery approach. CRC press, 2012.
Yizhang Jiang, Kaifa Zhao, Kaijian Xia, Jing Xue, Leyuan Zhou, Yang Ding, and Pengjiang Qian. A novel distributed multitask fuzzy clustering algorithm for auto-matic mr brain image segmentation. Journal of medical systems, 43:1–9, 2019.
Raphael Petegrosso, Zhuliu Li, and Rui Kuang. Machine learning and statistical methods for clustering single-cell rna-sequencing data. Briefings in bioinformatics, 21(4):1209–1223, 2020.
James MacQueen et al. Some methods for classification and analysis of multivari-ate observations. In Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, volume 1, pages 281–297. Oakland, CA, USA, 1967.
Hae-Sang Park and Chi-Hyuck Jun. A simple and fast algorithm for k-medoids clustering. Expert systems with applications, 36(2):3336–3341, 2009.
Zhexue Huang and M.K. Ng. A fuzzy k-modes algorithm for clustering categorical data. IEEE Transactions on Fuzzy Systems, 7(4):446–452, 1999.
Zhexue Huang. Extensions to the k-means algorithm for clustering large data sets with categorical values. Data mining and knowledge discovery, 2(3):283–304, 1998.
Sotirios P Chatzis. A fuzzy c-means-type algorithm for clustering of data with mixed numeric and categorical attributes employing a probabilistic dissimilarity functional. Expert systems with applications, 38(7):8684–8689, 2011.
Zhi Zheng, Maoguo Gong, Jingjing Ma, Licheng Jiao, and Qiaodi Wu. Unsuper-vised evolutionary clustering algorithm for mixed type data. In IEEE congress on evolutionary computation, pages 1–8. IEEE, 2010.
Jinchao Ji, Tian Bai, Chunguang Zhou, Chao Ma, and Zhe Wang. An im-proved k-prototypes clustering algorithm for mixed numeric and categorical data. Neurocomputing, 120:590–596, 2013.
Alex Foss, Marianthi Markatou, Bonnie Ray, and Aliza Heching. A semiparametric method for clustering mixed data. Machine Learning, 105:419–458, 2016.
Jinchao Ji, Wei Pang, Chunguang Zhou, Xiao Han, and Zhe Wang. A fuzzy k-prototype clustering algorithm for mixed numeric and categorical data. Knowledge-Based Systems, 30:129–135, 2012.
Jiye Liang, Xingwang Zhao, Deyu Li, Fuyuan Cao, and Chuangyin Dang. Deter-mining the number of clusters using information entropy for mixed data. Pattern Recognition, 45(6):2251–2265, 2012.
Wei-Dong Zhao, Wei-Hui Dai, and Chun-Bin Tang. K-centers algorithm for clus-tering mixed type data. In Advances in Knowledge Discovery and Data Mining: 11th Pacific-Asia Conference, PAKDD 2007, Nanjing, China, May 22-25, 2007. Proceedings 11, pages 1140–1147. Springer, 2007.
Eleazar Eskin, Andrew Arnold, Michael Prerau, Leonid Portnoy, and Sal Stolfo. A geometric framework for unsupervised anomaly detection: Detecting intrusions in unlabeled data. Applications of data mining in computer security, pages 77–101, 2002.
Dekang Lin et al. An information-theoretic definition of similarity. In Icml, vol-ume 98, pages 296–304, 1998.
Shyam Boriah, Varun Chandola, and Vipin Kumar. Similarity measures for categor-ical data: A comparative evaluation. In Proceedings of the 2008 SIAM international conference on data mining, pages 243–254. SIAM, 2008.
David W Goodall. A new similarity index based on probability. Biometrics, pages 882–907, 1966.
Tadeusz Caliński and Jerzy Harabasz. A dendrite method for cluster analysis. Communications in Statistics-theory and Methods, 3(1):1–27, 1974.
Joseph B Kruskal and Myron Wish. Multidimensional scaling. Number 11. Sage, 1978.
Ulrike Von Luxburg. A tutorial on spectral clustering. Statistics and computing, 17:395–416, 2007.
Yiming Yang. An evaluation of statistical approaches to text categorization. Information retrieval, 1(1-2):69–90, 1999.
Lawrence Hubert and Phipps Arabie. Comparing partitions. Journal of classification, 2:193–218, 1985.
Damien McParland and Isobel Claire Gormley. Model based clustering for mixed data: clustmd. Advances in Data Analysis and Classification, 10(2):155–169, 2016.
Cristina Tortora and Francesco Palumbo. Clustering mixed-type data using a prob-abilistic distance algorithm. Applied Soft Computing, 130:109704, 2022.
-
dc.identifier.urihttp://tdr.lib.ntu.edu.tw/jspui/handle/123456789/90057-
dc.description.abstract  K-prototypes法可以用來將有數值型特徵和類別型特徵的混合型資料分群,在此論文中,我們針對其運算時結合數值型特徵間距離和類別型特徵間相異度的權重提出不同制定策略,我們於TDM法中利用資料間之總距離比例制定策略,而於CHM法中,我們利用CH係數來制定權重,另外我們也提出了KMMDS法和SCMDS法,兩種方法利用多維尺度法將混合型資料轉換為數值型資料,並分別使用K-means法和譜分群法分群。
  除了提出權重制定策略,我們也比較了多種類別型特徵相異度計算方法於分群時的表現,這些方法包含了K-prototypes法原本使用之Overlap法、Eskin法、IOF法、OF法、Lin法和Goodall1法。
  最後我們以分群準確率和蘭德係數評估分群方法的效果和相異度計算方法的表現,發現提出的CHM法和KMMSD法表現的較原本的K-prototypes法好,而Lin法則能於分群的類別特徵相異度計算中有穩定的表現。
  期望透過我們的研究,能對權重γ的制定和相異度的選擇提供不一樣的策略,並提供能夠將數值型特徵和類別型特徵有效合併的方法,以提高混合型資料分群的準確度。
zh_TW
dc.description.abstractK-prototypes is developed for clustering mixed type data which is composed of numerical attributes and categorical attributes. In this thesis, we propose various strategies for assigning weights to combine dissimilarities between numerical attributes and categorical attributes. A proposed method, named Total Distance Method (TMD), is to assign weights based on the ratio of total distance between numerical and categorical attributes. Moreover, we introduce the Calinski-Harabasz index to assign weights called Calinski-Harabasz Method. We also propose two novel methods, named KMMDS and SCMDS, which both use multidimensional scaling method to transform categorical attributes into numerical attributes. After deriving new data with all attributes quantitative, we then make use of K-means and spectral clustering techniques for grouping data.use of K-means and spectral clustering techniques for grouping data.
Many distance measures have been proposed for measuring dissimilarity between two entities, such as Overlap, Eskin, IOF, OF, Lin and Goodall1, etc. In addition to proposing new strategies for assigning weights to combine distances of numerical attributes and categorical attributes, we study the performances of these dissimilarity measures in terms of clustering task in this thesis.
Performances of our proposed algorithms and dissimilarity measures in different cluster algorithms are evaluated by two indices, Accuracy and Rand index. Results on a variety of experiments show that the proposed algorithms, CHM and KMMSD, have better performances compared to K-prototypes algorithm. As for dissimilarity measures, Lin has more consistent performances on the algorithms we use in this thesis.
In this study, we aim to increase the accuracy for clustering mixed type data by proposing an effective strategy for assigning weights, dissimilarity selection and an effective method for combining numerical and categorical attributes.
en
dc.description.provenanceSubmitted by admin ntu (admin@lib.ntu.edu.tw) on 2023-09-22T17:14:06Z
No. of bitstreams: 0
en
dc.description.provenanceMade available in DSpace on 2023-09-22T17:14:06Z (GMT). No. of bitstreams: 0en
dc.description.tableofcontents口試委員審定書 i
致謝 ii
摘要 iii
Abstract iv
目錄 vi
圖目錄 viii
表目錄 ix
符號列表 xii
第一章、緒論 1
第二章、文獻回顧 3
2.1、K-prototypes法 3
2.2、類別資料相似度計算方法 5
第三章、研究方法 9
3.1、總距離法(Total Distance Method, TDM) 10
3.2、CH 係數加權法 (CH-index Method, CHM) 10
3.3、多元尺度法結合K-means 分群方法(K-Means with MDS, KMMDS) 11
3.4、多元尺度法結合譜分群方法(Spectral Clustering with MDS, SCMDS) 14
3.5、評估指標 14
第四章、模擬研究 17
4.1、模擬設計 17
4.2、模擬資料 19
4.3、模擬結果 20
第五章、實際資料分析 33
5.1、資料集介紹 33
5.2、分群結果 34
第六章、結論與討論 43
6.1、分群方法 43
6.2、相異度計算方法 44
參考文獻 45
-
dc.language.isozh_TW-
dc.subject距離方法zh_TW
dc.subject權重策略zh_TW
dc.subject混合型資料zh_TW
dc.subject分群zh_TW
dc.subjectClusteringen
dc.subjectMixed Dataen
dc.subjectWeighted Strategyen
dc.subjectDissimilarity Measuresen
dc.title以權重距離方法分群混合型資料並評估一些類別資料距離計算方法之表現zh_TW
dc.titleClustering Mixed-type Data Using A Weighted Dissimilarity Measure and An Evaluation of Some Dissimilarity Measures for Categorical Dataen
dc.typeThesis-
dc.date.schoolyear111-2-
dc.description.degree碩士-
dc.contributor.oralexamcommittee薛慧敏;邱春火zh_TW
dc.contributor.oralexamcommitteeHuei-Min Hsueh;Chun-Huo Chiuen
dc.subject.keyword混合型資料,分群,權重策略,距離方法,zh_TW
dc.subject.keywordMixed Data,Clustering,Weighted Strategy,Dissimilarity Measures,en
dc.relation.page48-
dc.identifier.doi10.6342/NTU202301617-
dc.rights.note未授權-
dc.date.accepted2023-08-09-
dc.contributor.author-college共同教育中心-
dc.contributor.author-dept統計碩士學位學程-
顯示於系所單位:統計碩士學位學程

文件中的檔案:
檔案 大小格式 
ntu-111-2.pdf
  未授權公開取用
1.1 MBAdobe PDF
顯示文件簡單紀錄


系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved