請用此 Handle URI 來引用此文件:
http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/55668
標題: | 利用非參數型接收者操作特徵曲線建構統計分類樹之研究與應用 Study of A Statistical Classifier Using Nonparametric Receiver Operating Characteristic Curve |
作者: | Fu-Hao Chang 張富皓 |
指導教授: | 陳正剛 |
關鍵字: | 分類樹,接收者操作特徵曲線,費雪線性分析,相對重要性指標, classification tree,Nonparametric AUC,ROC,FLD,Relative Weight, |
出版年 : | 2014 |
學位: | 碩士 |
摘要: | CART是普遍被使用的分類樹,通常使用Gini index做為切割與否的準則,而分類樹中每個節點都有兩個子節點,但過度適配(over-fitting)一直是CART的缺點。另一種分類樹「多層判別分析」,每一層可能有兩個或三個節點,其中一節點為未分類的資料,透過使用其他屬性對未分類的節點進行切割展開新的一層。此外,傳統分類樹在找尋屬性時,並不是考慮屬性分辨出其中一類別的能力,而是考慮同時分辨出兩類別的能力,因此常忽視對於具有分辨一類別能力的屬性。
本研究使用接收者操作特徵曲線來進行屬性分類能力的評估,而使用的方法為檢定特定範圍之線下面積(area under curve,簡稱AUC),一般而言檢定AUC常見的方法是假設資料符合常態分配,進而擬合出一條參數型接收者操作特徵曲線再計算其AUC及檢定統計值,但是因資料不一定服從常態分配,所以本研究使用非參數型接收者操作特徵曲線(簡稱NP-ROC)進行統計檢定,根據檢定的結果找出分類能力最顯著的屬性,並且應用在多層判別分析中,可根據不同候選切點的表現決定此節點該分割成兩個或是三個子節點。在建構模型的演算法中,同時考慮了線性組合多種屬性的可能,在考慮線性組合屬性時,使用概括性相對重要指標(RW)來選取屬性,優點為捨棄過往常使用的貪婪選擇準則,能夠挑選出在線性組合下較重要的屬性,再根據被選取的屬性,使用費雪線性組合(FLD)來創造出新屬性。 為了驗證此模型,我們除了利用模擬案例驗證方法的效能,也利用腫瘤分類的實例來測試,比較原有方法和利用NP-ROC及FLD/RW建構之多層判別分析的判別結果,驗證此判別模型效能。從案例驗證的結果,可以看出利用NP-ROC之分類模型可以較有效率的分類資料,且過度適配的情形已大為改善。 The Classification and Regression Tree (CART) is the most commonly used classification tree which is constructed by a hierarchical tree of decision nodes. Each decision node in CART can only be split into two child-nodes. Over-fitting is the main disadvantage of CART. The Multi-Layer classifier, another more effective tree, can be built with each node split into up to three child-nodes. Each layer of the Multi-Layer classifier consists of one undetermined node and up to two clearly classified nodes. The tree is then further grown by splitting the undetermined node into a new layer of two or three nodes until a stop criterion is reached. The Multi-Layer classifier is more powerful because of its capability to find attributes that can be used to distinguish a class of incidences from the rest whereas the conventional decision tree consider only consider the attributes capable of separating the incidences into two classes. In this research, receiver operating characteristic curve (ROC) is used to determine an attribute’s classification capability. By comparing the statistical hypothesis testing result of areas under receiver operating characteristic curve (AUC), we can select the most significance of an attribute into the tree node. The parametric hypothesis testing of the AUC is to assume that the data follows normal distribution. In many cases, this assumption is not quite correct. To solve this problem, we use non-parametric ROC (NP-ROC) to test, of the significance of an attribute’s classification capability. After an attribute is picked, the algorithm will decide the number of child-node. The linear combination of attributes is also considered in this proposed algorithm. This research uses the relative weight (RW) method to select attributes instead of the usual greedy selection method, followed by the fisher linear discriminant (FLD) analysis to create a new attribute. We refer to this selection method of a linear combination of attributes as the FLD/RW method. Actual and simulation cases are used to demonstrate and verify the proposed method and its superior discriminating capability over the conventional methods. |
URI: | http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/55668 |
全文授權: | 有償授權 |
顯示於系所單位: | 工業工程學研究所 |
文件中的檔案:
檔案 | 大小 | 格式 | |
---|---|---|---|
ntu-103-1.pdf 目前未授權公開取用 | 4.29 MB | Adobe PDF |
系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。