Skip navigation

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料(如:文字、圖片、PDF)並使其易於取用。

點此認識 DSpace
DSpace logo
English
中文
  • 瀏覽論文
    • 校院系所
    • 出版年
    • 作者
    • 標題
    • 關鍵字
  • 搜尋 TDR
  • 授權 Q&A
    • 我的頁面
    • 接受 E-mail 通知
    • 編輯個人資料
  1. NTU Theses and Dissertations Repository
  2. 電機資訊學院
  3. 資訊工程學系
請用此 Handle URI 來引用此文件: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/51029
完整後設資料紀錄
DC 欄位值語言
dc.contributor.advisor劉長遠(Cheng-Yuan Liou)
dc.contributor.authorJian Panen
dc.contributor.author潘健zh_TW
dc.date.accessioned2021-06-15T13:24:05Z-
dc.date.available2021-07-06
dc.date.copyright2016-07-06
dc.date.issued2016
dc.date.submitted2016-06-21
dc.identifier.citation[1].陈琳,王箭. 三种中文文本自动分类算法的比较和研究[J]. 计算机与现代化, 2012, (198): 1-4.
[2].杨明杰. 文本分类中文本表示模型和特征选择算法研究[D]. 吉林:吉林大 学, 2013.
[3].张国栋. 文本数据处理及分类算法研究[D]. 山东:山东师范大学, 2014. [4].胡龙茂. 中文文本分类技术比较研究[J]. 安庆师范学院学报(自然科学版),
2015, 21(2): 49-53.
[5].M, Trivedi, S, Sharma. Comparison of Text Classification Algorithms[J].
International Journal of Engineering Research & Technology (IJERT), 2015, 4(2):
334-336.
[6].Sebastiani F. Machine learning in automated text categorization[J]. ACM
computing surveys (CSUR), 2002, 34(1): 1-47
[7].Yang Y. An evaluation of statistical approaches to text categorization[J].
Information retrieval, 1999, 1(1-2): 69-90.
[8].Dumais S, Platt J, Heckerman D, et al. Inductive learning algorithms and
representations for text categorization[C]//Proceedings of the seventh international conference on Information and knowledge management. ACM, 1998: 148-155.
[9].Joachims T. Text categorization with support vector machines: Learning with
many relevant features[M]. Springer Berlin Heidelberg, 1998. 61
[10].LAM, W., LOW, K. F., AND HO, C. Y. 1997. Using a Bayesian network induction approach for text categorization. In Proceedings of IJCAI-97, 15th International Joint Conference on Artificial Intelligence (Nagoya, Japan, 1997), 745–750.
[11].Lewis D D. An evaluation of phrasal and clustered representations on a text categorization task[C]//Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval. ACM, 1992: 37-50.
[12].Li H, Yamanishi K. Text classification using ESC-based stochastic decision lists[J]. Information processing & management, 2002, 38(3): 343-361.
[13].Yang Y, Liu X. A re-examination of text categorization methods[C]//Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval. ACM, 1999: 42-49.
[14].Lewis D D, Ringuette M. A comparison of two learning algorithms for text categorization[C]//Third annual symposium on document analysis and information retrieval. 1994, 33: 81-93.
[15].Apte C, Damerau F, Weiss S M. Automated learning of decision rules for text categorization[J]. ACM Transactions on Information Systems (TOIS), 1994, 12(3): 233-251.
[16].Cohen W W, Singer Y. Context-sensitive learning methods for text 62
categorization[J]. ACM Transactions on Information Systems (TOIS), 1999,
17(2): 141-173.
[17].Moulinier I, Ganascia J G. Applying an existing machine learning algorithm to
text categorization[M]//Connectionist, Statistical and Symbolic Approaches to Learning for Natural Language Processing. Springer Berlin Heidelberg, 1995: 343-354.
[18].MOULINIER, I., RAS ̆KINIS, G., AND GANASCIA, J.-G. 1996. Text categorization: a symbolic approach. In Proceedings of SDAIR-96, 5th Annual Sympo- sium on Document Analysis and Information Retrieval (Las Vegas, NV, 1996), 87–99.
[19].Dagan I, Karov Y, Roth D. Mistake-driven learning in text categorization[J]. arXiv preprint cmp-lg/9706006, 1997.
[20].Lam W, Ho C Y. Using a generalized instance set for automatic text categorization[C]//Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval. ACM, 1998: 81-89.
[21].Ng H T, Goh W B, Low K L. Feature selection, perceptron learning, and a usability case study for text categorization[C]//ACM SIGIR Forum. ACM, 1997, 31(SI): 67-73.
[22].Wiener E, Pedersen J O, Weigend A S. A neural network approach to topic spotting[C]//Proceedings of SDAIR-95, 4th annual symposium on document
63
analysis and information retrieval. 1995, 317: 332.
[23].Schapire R E, Singer Y. BoosTexter: A boosting-based system for text
categorization[J]. Machine learning, 2000, 39(2): 135-168.
[24].Weiss S M, Apte C, Damerau F J, et al. Maximizing text-mining performance[J].
IEEE Intelligent systems, 1999 (4): 63-69.
[25].Reuters-21578 Text Categorization Collection Data Set[EB / OL].[2011-01-
04].http://archive.ics.uci.edu/ml/datasets/Reuters21578+Text+Categorization+
Collection.
[26].Two Text Learning Data Sets[EB/OL].[2011-01-03].http://www.cs.cmu.edu/
~textlearning/.
[27]. OHSUMEDTestCollection[EB/OL].[2011-01-04].http://ir.ohsu.edu/ohsumed/
ohsumed.html.
[28].UCIIrvine Machine Learning Repository[EB/OL].[2011-01-01].http://archi-
ve.ics.uci.edu/ml/. [29].搜狗文本分类语料库.[EB/OL].[2011-01-01].http://www.sogou.com /labs/d
l/c. html.
[30]. 谭 松 波 . 中 文 文 本 分 类 语 料 库 .[EB/OL].[2011-01-03].http://www.searchforu
m.org.cn/tansongbo/corpus.htm.
[31].李荣路.文本分类及相关技术研究[D]. 上海:复旦大学,2005.
[32].PU 语料库[EB/OL].[2011-01-02].http://iit.demokritos.gr/skel/i-config/down
loads.
64
[33].王斌,潘文锋.基于内容的垃圾邮件过滤技术综述[J].中文信息学 报,2005,19(5):1-10.
[34]. Corpora[EB/OL].[2011-01-03].http://iit.demokritos.gr/skel/i-config/downloa- ds/.
[35].The Apache SpamAssassin Project[EB/OL]. [2011-01-03]. http://www. spama- ssassin.org.
[36]. Spambase[EB/OL].[2011-01-03]. http: //archive. ics.uci.edu/ml/machine- learning-databases/spambase/.
[37].中文自然语言处理开放平台[EB/OL]. [2011-01-02]. http: / /www. nlp. org. cn/categories/default. php? cat_id=17.
[38].互联网语料库(SogouT) [EB/OL]. [2011-01-01]. http://www.sogou.com /labs/dl/t. html.
[39].全网新闻数据(SogouCA)[EB/OL]. [2012].http://www.sogou.com/labs/dl/c- a.html.
[40]. WestburyLIB [EB/OL]. http://www.psych.ualberta.ca/~westburylab/downlo- ads/.[2011-01-01].usenetcorpus.download.html.
[41].The 4 Universities Data Set[EB/OL]. [2011-01-03]. http://www.cs.cmu. edu/afs/cs.cmu.edu/project/theo-20/www/da-ta/.
[42].Lancaster, F.W.; Fayen, E.G. (1973), Information Retrieval On-Line, Melville Publishing Co., Los Angeles, California.
[43].Salton G, Wong A, Yang C S. A vector space model for automatic indexing[J]. 65

Communications of the ACM, 1975, 18(11): 613-620.
[44].Robertson, Stephen E., Cornelis J. van Rijsbergen, and Martin F. Porter.
'Probabilistic models of indexing and searching.' Proceedings of the 3rd annual ACM conference on Research and development in information retrieval. Butterworth & Co., 1980.
[45].Sparck Jones K. A statistical interpretation of term specificity and its application in retrieval[J]. Journal of documentation, 1972, 28(1): 11-21.
[46].Robertson S E, Walker S, Jones S, et al. Okapi at TREC-3[J]. NIST SPECIAL PUBLICATION SP, 1995, 109: 109.
[47].Mitchell, Tom M. (1997). Machine Learning. The Mc-Graw-Hill Companies, Inc. ISBN 0070428077.
[48].Cilibrasi R, Vitanyi P. Clustering by compression[J]. Information Theory, IEEE Transactions on, 2005, 51(4): 1523-1545.
[49].Ceriani L, Verme P. The origins of the Gini index: extracts from Variabilita e Mutabilita (1912) by Corrado Gini[J]. The Journal of Economic Inequality, 2012, 10(3): 421-443.
[50].Ogura H, Amano H, Kondo M. Feature selection with a measure of deviations from Poisson in text categorization[J]. Expert Systems with Applications, 2009, 36(3): 6826-6832.
[51].Yates F. Contingency tables involving small numbers and the χ 2 test[J]. Supplement to the Journal of the Royal Statistical Society, 1934, 1(2): 217-235.
66
[52].Gough D. Weight of evidence: a framework for the appraisal of the quality and relevance of evidence[J]. Research papers in education, 2007, 22(2): 213-228.
[53].Mihalcea R, Tarau P. TextRank: Bringing order into texts[C]. Association for Computational Linguistics, 2004.
[54].Page L, Brin S, Motwani R, et al. The PageRank citation ranking: bringing order to the web[J]. 1999.
[55].荣光.中文文本分类方法研究[D].山东:山东师范大学, 2009.
[56].Yuan G X, Ho C H, Lin C J. Recent advances of large-scale linear
classification[J]. Proceedings of the IEEE, 2012, 100(9): 2584-2603.
[57].Cox D R. The regression analysis of binary sequences[J]. Journal of the Royal
Statistical Society. Series B (Methodological), 1958: 215-242.
[58].Boser B E, Guyon I M, Vapnik V N. A training algorithm for optimal margin classifiers[C]//Proceedings of the fifth annual workshop on Computational
learning theory. ACM, 1992: 144-152.
[59].Cheng-Yuan Liou and Wei-Chen Cheng (2011), Forced Accretion and
Assimilation Based on Self-organizing Neural Network, Self Organizing Maps - Applications and Novel Algorithm Design, Chapter 35 in Book edited by: Josphat Igadwa Mwasiagi, page 683~702, ISBN: 978-953-307-546-4, Publisher: InTech, Publishing date: January 2011. INTECH.
[60].Cheng-Yuan Liou and Wen-Jen Yu (1994), Initializing the weights in multilayer network with quadratic sigmoid function, International Conference on Neural
67
Information Processing, ICONIP, pp. 1387-1392, serve as program committee
member, Seoul, October 17-20.
[61].Breiman L. Bagging predictors[J]. Machine learning, 1996, 24(2): 123-140. [62].Breiman L. Bias, variance, and arcing classifiers[J]. 1996.
[63].MacQueen J. Some methods for classification and analysis of multivariate
observations[C]//Proceedings of the fifth Berkeley symposium on mathematical
statistics and probability. 1967, 1(14): 281-297.
[64].Arthur D, Vassilvitskii S. k-means++: The advantages of careful
seeding[C]//Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms. Society for Industrial and Applied Mathematics, 2007: 1027-1035.
[65].Fan R E, Chang K W, Hsieh C J, et al. LIBLINEAR: A library for large linear classification[J]. The Journal of Machine Learning Research, 2008, 9: 1871-1874. [66].Kelley H J. Gradient theory of optimal flight paths[J]. Ars Journal, 1960, 30(10):
947-954.
[67].Chang C C, Lin C J. LIBSVM: a library for support vector machines[J]. ACM
Transactions on Intelligent Systems and Technology (TIST), 2011, 2(3): 27. [68].Ng A Y. Feature selection, L 1 vs. L 2 regularization, and rotational invariance[C]//Proceedings of the twenty-first international conference on
Machine learning. ACM, 2004: 78.
dc.identifier.urihttp://tdr.lib.ntu.edu.tw/jspui/handle/123456789/51029-
dc.description.abstract新闻因为其高频的访问属性,在中国大陆已经成为了各大互联网公司追逐的领 域。而新闻分类一直是新闻自动处理中的一项主要议题。有很多的监督式学习方法 可以应用在这一领域,其中,支持向量机在离散的特征空间上表现尤为出色。本文 中提出 bi-perceptron 的思路去解决基本的二分类问题,希望在某些方面达到甚至 超过支持向量机的效果。
Bi-perceptron 学习法是一种分而治之的思想,本文中提出这种思想,并实现了 一种基本的解决方法。我们将分类转化成数据划分,基本分类,分类合成三个步骤 并比较了不同划分及合成方法。另外,本文从中文网络新闻的基本处理流程出发, 分析了分词方法,关键词提取数量,基本分类器规则化方法,数据划分个数等对分 类结果的影响。最终,我们也给出了在时间和空间上都比较好的 bi-perceptron 学习 方法。
zh_TW
dc.description.abstractMobile news, due to its natural attributes of high frequency, has become a popular
area pursued by many commercial companies in China. News categorization is an important technology in news automatic process. Many supervised learning methods can be applied in this area, where Support Vector Machine(SVM) achieves the state-of-art performance with discrete features. This paper provides the idea of bi-perceptron learning to solve the binary-class classification problem in the hope of achieving comparable or even better results than SVM.
Bi-perceptron learning is a divide-and-conquer idea. We proposed this idea in this paper and realized a basic approach of it. We divided the classification problem into three steps: data partition, base classification and aggregation and compared different partition and aggregation methods. Moreover, we analyzed the effect of word segmentation methods, keywords number, the regularization of base classifiers and partition number on the categorization performance. Finally, we find an approach of bi-perceptron learning that is perfect in both time and memory consumption.
en
dc.description.provenanceMade available in DSpace on 2021-06-15T13:24:05Z (GMT). No. of bitstreams: 1
ntu-105-R03922141-1.pdf: 2756611 bytes, checksum: c162f72718680c858f482112c451bce8 (MD5)
Previous issue date: 2016
en
dc.description.tableofcontents口試委員會審定書....................................................................................................... i
摘要........................................................................................................................... ii
Abstract .........................................................................................................................iii
Table of Contents .................................................................................................... iv
Table of Tables .......................................................................................................... v
Table of Figures ............................................................................................................i
1 Introduction ...................................................................................................... 1
1.1 Related work .................................................................................................2
2 Web news process methods ......................................................................... 11
2.1 Data set............................................................................................................ 12
2.2 Preprocess ....................................................................................................... 14
2.3 News representation ........................................................................................ 17
2.4 Feature selection ............................................................................................. 19
Base classifiers........................................................................................................26
3.1 Logistic regression .......................................................................................... 26
3.2 Functional margin and Geometrical margin ................................................... 30
3.3 Maximum margin classifier ............................................................................ 32
Bi-perceptron learning algorithm............................................................................ 37
4.1 Fundamentals .................................................................................................. 37
4.2 Some approaches ............................................................................................ 42
4.3 Data partition .................................................................................................. 43
4.4 Base classifier ................................................................................................. 46
4.5 Aggregating methods ...................................................................................... 48
Experiments and results .......................................................................................... 51
5.1 Experimental setup.......................................................................................... 51
5.2 Evaluations...................................................................................................... 53
5.3 Result analysis ................................................................................................ 54
Conclusion and future work....................................................................................60 Reference ................................................................................................................ 61 Appendix ................................................................................................................. 69
dc.language.isoen
dc.titleBi-perceptron 分類中文網頁新聞zh_TW
dc.titleBi-perceptron for Chinese Web News Categorizationen
dc.typeThesis
dc.date.schoolyear104-2
dc.description.degree碩士
dc.contributor.oralexamcommittee呂育道(Yuh-Dauh Lyuu),劉俊緯(Jiun-Wei Liou)
dc.subject.keywordbi-perceptron ??法,中文网?新?分?,文本分?,?督??,zh_TW
dc.subject.keywordbi-perceptron learning,Chinese web news categorization,text classification,supervised learning,en
dc.relation.page72
dc.identifier.doi10.6342/NTU201600400
dc.rights.note有償授權
dc.date.accepted2016-06-21
dc.contributor.author-college電機資訊學院zh_TW
dc.contributor.author-dept資訊工程學研究所zh_TW
顯示於系所單位:資訊工程學系

文件中的檔案:
檔案 大小格式 
ntu-105-1.pdf
  目前未授權公開取用
2.69 MBAdobe PDF
顯示文件簡單紀錄


系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved