利用高維索引技術解決大規模分類問題

Chun-Fu Chang; 張淳富

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/65286

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	孫雅麗
dc.contributor.author	Chun-Fu Chang	en
dc.contributor.author	張淳富	zh_TW
dc.date.accessioned	2021-06-16T23:34:37Z	-
dc.date.available	2015-07-27
dc.date.copyright	2012-07-27
dc.date.issued	2012
dc.date.submitted	2012-07-27
dc.identifier.citation	[1] C .-J. Hsieh, K.-W. Chang, C.-J. Lin , S. S. Keerthi, and S. Sundararajan , “A dual coordinate descent method for large-scale linear SVM,” in ICML , 2008. [2] S. Shalev-Shwartz, Y. Singer , and N. Srebro, “Pegasos : primal estimated sub-gradient solver for SVM,” in ICML , 2007. [3] A. Gionis, P. Indyk, and R. Motwani. Similarity Search in High Dimensions via Hashing. In Proceedings of the 25th Intl Conf. on Very Large Data Bases, 1999. [4] E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin. LIBLINEAR: A library for large linear classification. Journal of Machine Learning Research, 9:1871– 1874, 2008. [5] R. Collobert, F. Sinz, J. Weston, and L. Bottou, “Trading Convexity for Scalability,” ICML, 2006 [6] J. L. Bentley. Multidimensional binary search trees used for associative searching. Communications of the ACM, 18(9):509-517, 1975. [7] Gaelle Loosli, Stephane Canu, and Leon Bottou. Training invariant support vector machines using selective sampling. In Leon Bottou, Olivier Chapelle, Dennis DeCoste, and Jason Weston, editors, Large Scale Kernel Machines, pages 301-320. MIT Press, Cambridge, MA., 2007. [8] P. Jain, S. Vijayanarasimhan, and K. Grauman. Hashing Hyper- plane Queries to Near Points with Applications to Large-Scale Active Learning. In NIPS, 2010. [9] H.-F. Yu, C.-J. Hsieh, K.-W. Chang, and C.-J. Lin. Large linear classification when data cannot fit in memory. In Proceedings of the 16th ACM SIGKDD 2010. [10] Support Vector Machine – http://en.wikipedia.org/wiki/Support_vector_machine [11] Kddcup1999 http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html [12] Covtype http://archive.ics.uci.edu/ml/datasets/Covertype [13] Mnist http://yann.lecun.com/exdb/mnist/ [14] Mnist8m http://leon.bottou.org/papers/loosli-canu-bottou-2006
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/65286	-
dc.description.abstract	近幾年來，線性分類器在大規模資料分類問題上有良好的發展與表現。然而，實際上仍存在著兩個重要的議題尚未被解決。第一個問題是現實生活中所收集到的資料有很大部分是沒有辦法被線性分類器所解釋，如果線性分類器將這些雜訊納入考慮的話，將會影響線性分類器的表現。第二個問題，對於一般使用者而言，記憶體的容量遠小於硬碟容量，所以很容易發生資料無法放進記憶體的情形，這時候線性分類器便需要不斷地在硬碟和記憶體之間不斷地讀取以及寫入，這是個非常花時間的過程。於是本篇論文提出一個索引架構套用在線性分類器的最佳化過程以同時解決這兩個議題。每筆資料都用高維度的特徵向量空間表示，我們將這些資料透過概似高維索引技術建立索引值，讓我們可以很有效率地取出具有用的資料，忽略對線性分類器有害的資料；也因為這樣，我們可以只將這些有用的資料讀進記憶體，進而同時解決兩個議題，我們做了數個實驗比較我們的架構和其他目前最先進的方法，而結果顯示我們的架構有較佳的表現。	zh_TW
dc.description.abstract	Recently, linear classifier has been shown to be able to handle large-scale classification problem well. However, there are two main issues accompanied by large-scale classification problem. First, there may exist many unexplainable or noise instances in the datasets which will hurt the linear classifier’s performance. Second, when data is too large to load in memory, the linear classifier will spend much time on reading/writing between memory and disk. In this thesis, we propose an indexing optimization framework to solve these two issues simultaneously. We apply approximate indexing technique on high dimensional features space to help us efficiently retrieve the informative instances rather than outliers, and so that we can only load those instances into memory. We conduct several experiments to compare our framework with the state-of-the art methods, and the results show that we have a better performance.	en
dc.description.provenance	Made available in DSpace on 2021-06-16T23:34:37Z (GMT). No. of bitstreams: 1 ntu-101-R99725033-1.pdf: 2917753 bytes, checksum: b04c516c3086afb0d56a20c324728575 (MD5) Previous issue date: 2012	en
dc.description.tableofcontents	Table of Content 口試審定書 III 謝詞 IV 論文摘要 V THESIS ABSTRACT VI Table of Content VII Figure List IX Table List X Chapter 1. Introduction 1 Chapter 2. Preliminary Review 3 2.1 Linear Classifier with Outlier Detection 3 2.2 Approximate High Dimensional Indexing 5 Chapter 3. Methodology 7 3.1 Solving Ramp Loss 7 3.2 Tree-based Indexing 9 3.3 Framework to Solve Primal and Dual Problem 11 3.3.1 Primal Problem 11 3.3.2 Dual Problem 13 Chapter 4. Related Methods 16 4.1 Online Learning 16 4.2 Active Learning 16 4.3 Block Minimization 17 Chapter 5. Experiments 19 5.1 Datasets and Environment 19 5.2 Performance on Primal Problem 22 5.3 Performance on Dual Problem 25 5.3 Performance on Limited Memory 27 5.4 Indexing Property 28 Chapter 6. Discussion and Limitation 32 Chapter 7. Conclusion 34 Reference 35
dc.language.iso	zh-TW
dc.subject	支持向量機	zh_TW
dc.subject	機器學習&#8195	zh_TW
dc.subject	Ramp loss	zh_TW
dc.subject	高維索引	zh_TW
dc.subject	high dimensional indexing	en
dc.subject	ramp loss	en
dc.subject	support vector machine	en
dc.subject	machine learning	en
dc.title	利用高維索引技術解決大規模分類問題	zh_TW
dc.title	Solving large-scale classification problem with approximate high dimensional indexing framework	en
dc.type	Thesis
dc.date.schoolyear	100-2
dc.description.degree	碩士
dc.contributor.oralexamcommittee	陳孟彰,陳建錦,彭文志
dc.subject.keyword	高維索引,Ramp loss,支持向量機,機器學習&#8195,	zh_TW
dc.subject.keyword	high dimensional indexing,ramp loss,support vector machine,machine learning,	en
dc.relation.page	36
dc.rights.note	有償授權
dc.date.accepted	2012-07-27
dc.contributor.author-college	管理學院	zh_TW
dc.contributor.author-dept	資訊管理學研究所	zh_TW
顯示於系所單位：	資訊管理學系

文件中的檔案：

檔案	大小	格式
ntu-101-1.pdf 未授權公開取用	2.85 MB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。