利用高維索引技術解決大規模分類問題

Chun-Fu Chang; 張淳富

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/65286

標題:	利用高維索引技術解決大規模分類問題 Solving large-scale classification problem with approximate high dimensional indexing framework
作者:	Chun-Fu Chang 張淳富
指導教授:	孫雅麗
關鍵字:	高維索引,Ramp loss,支持向量機,機器學習&#8195, high dimensional indexing,ramp loss,support vector machine,machine learning,
出版年 :	2012
學位:	碩士
摘要:	近幾年來，線性分類器在大規模資料分類問題上有良好的發展與表現。然而，實際上仍存在著兩個重要的議題尚未被解決。第一個問題是現實生活中所收集到的資料有很大部分是沒有辦法被線性分類器所解釋，如果線性分類器將這些雜訊納入考慮的話，將會影響線性分類器的表現。第二個問題，對於一般使用者而言，記憶體的容量遠小於硬碟容量，所以很容易發生資料無法放進記憶體的情形，這時候線性分類器便需要不斷地在硬碟和記憶體之間不斷地讀取以及寫入，這是個非常花時間的過程。於是本篇論文提出一個索引架構套用在線性分類器的最佳化過程以同時解決這兩個議題。每筆資料都用高維度的特徵向量空間表示，我們將這些資料透過概似高維索引技術建立索引值，讓我們可以很有效率地取出具有用的資料，忽略對線性分類器有害的資料；也因為這樣，我們可以只將這些有用的資料讀進記憶體，進而同時解決兩個議題，我們做了數個實驗比較我們的架構和其他目前最先進的方法，而結果顯示我們的架構有較佳的表現。 Recently, linear classifier has been shown to be able to handle large-scale classification problem well. However, there are two main issues accompanied by large-scale classification problem. First, there may exist many unexplainable or noise instances in the datasets which will hurt the linear classifier’s performance. Second, when data is too large to load in memory, the linear classifier will spend much time on reading/writing between memory and disk. In this thesis, we propose an indexing optimization framework to solve these two issues simultaneously. We apply approximate indexing technique on high dimensional features space to help us efficiently retrieve the informative instances rather than outliers, and so that we can only load those instances into memory. We conduct several experiments to compare our framework with the state-of-the art methods, and the results show that we have a better performance.
URI:	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/65286
全文授權:	有償授權
顯示於系所單位：	資訊管理學系

文件中的檔案：

檔案	大小	格式
ntu-101-1.pdf 目前未授權公開取用	2.85 MB	Adobe PDF

顯示文件完整紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料（如：文字、圖片、PDF）並使其易於取用。