請用此 Handle URI 來引用此文件:
http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/63986
標題: | 以分層拔靴抽樣法改善分類樹之切割能力 Improving Selected split by Stratified Bootstrap Methods |
作者: | Po-Hsun Wang 王柏勛 |
指導教授: | 陳正剛(Argon Chen) |
關鍵字: | 分類樹,資料分布疏密程度,分層抽樣,bootstrap method, Classification and Regression Trees,Density of sample distribution,Stratified sampling,Bootstrap method, |
出版年 : | 2012 |
學位: | 碩士 |
摘要: | 一般認為 CART 傳統分類樹可以有效率地分類某些特定資料類型,實際上因其切割準則演算法的計算方式不同,所分類出來的子結點有所不同,當使用較不佳的切割準則時並非一直能有效率地做分類。導致後續子結點的分割效果累積越來越差。
本研究使用常見的Variation reduction方法,並且在做搜尋切點動作之前先對原來樣本以本研究提出之方法先將資料分層在對每層內以bootstrap method得到多組新樣本,再對多組新樣本分別搜尋切點,以此切點的平均做為分類的切點準則。藉由bootstrap method可以彌補切點附近資訊量不足的狀況,且由於多組樣本求切點結果所得到的新切點,可以使得切點的選擇更加穩定,減少錯誤分類的狀況。 但根據本研究之模擬案例探討發現,在不同的資料分布狀況下,切點的表現狀況不同且有互補的狀況,並且發現主要差異來自於切點附近資料的分布疏密狀況與其他之間的比例關係。因此提出了判別的方式以決定在何種狀況下使用原來切點或以bootstrap method再抽樣後的新切點做為最適切點,以此方法建構規則做為兩切點的權重得到一個以兩者加權後結果的新切點,此切點可應用於各種資料分布狀況下,並且相較下都為變異最低最為穩定的切點選擇。經過統計檢定後證實是有顯著的優於原來的兩種切點結果,在有準確且穩定的切點選擇下,對於分類樹的效率將會有效的提升,並且使每一次子結點的切割更加有效。 Generally, it’s believed that the traditional classification tree, such as Classification and Regression Trees (CART), can effectively classify certain type of data distribution clearly. In fact, because of the split selecting criterion and the procedure used by the traditional classification tree, we can show that it is not always as efficient as expected. The unsuitable split selected will result in many problems such as sample size depletion and over fitting. Without enough sample size, split in the lower hierarchical levels becomes incorrect selection of attributes extremely unreliable. In order to improve the CART performance, we use the Variation Reduction criterion to select the split of a node that splits a node into two child nodes in the next layer. In this research, we propose a new method to improve the split selection. We use stratified sampling to stratify data into multiple sub-sample and use bootstrap method to re-sampling incidences in each sub-sample. The splits are then selected by the variation reduction criterion. Finally, we calculate the mean of each split of bootstrap sample as the “stratified bootstrap split” . The stratified bootstrap splits can improve the variability of splits for certain types of sample distribution and obtain a more stable split to avoid incorrect splits and attribute selection. According to the simulation results in this research, the densities of sample distribution is the most important factor that affects the “Original split” and “Stratified Bootstrap split” performance. We propose a “Weighted split” to integrate the original CART split and the proposed “Stratified Bootstrap split”. It is shown that the weighted split is robust and thus avoid incorrect split and selection of attributes. Though out this thesis, examples are use to illustrate the proposed method. Finally, a hypothetic tree is used to demonstrate how the performance of CART can be improved by the proposed weighted split. |
URI: | http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/63986 |
全文授權: | 有償授權 |
顯示於系所單位: | 工業工程學研究所 |
文件中的檔案:
檔案 | 大小 | 格式 | |
---|---|---|---|
ntu-101-1.pdf 目前未授權公開取用 | 2.25 MB | Adobe PDF |
系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。