使用SVM標記多種生醫具名實體

Chih Lee; 李遲

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/38699

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	陳信希(Hsin-Hsi Chen)
dc.contributor.author	Chih Lee	en
dc.contributor.author	李遲	zh_TW
dc.date.accessioned	2021-06-13T16:42:34Z	-
dc.date.available	2005-07-19
dc.date.copyright	2005-07-19
dc.date.issued	2005
dc.date.submitted	2005-07-01
dc.identifier.citation	S.F. Altschul, W. Gish, W. Miller, E.W. Myers and D.J. Lipman. 1990. Basic Local Alignment Search Tool. Journal of Molecular Biology, 215:403-410. E. Brill. 1994. Some Advances in Transformation-Based Part of Speech Tagging. In Proceedings of the National Conference on Artificial Intelligence, 722-727. J.T. Chang, H. Schutze and R.B. Altman. 2004. GAPSCORE: Finding Gene and Protein Names One Word at a Time. Bioinformatics, 20(2): 216-225. N. Daraselia, A. Yuryev, S. Egorov, S. Novichkova, A. Nikitin and I. Mazo. 2004. Extracting human protein interactions from MEDLINE using a full-sentence parser. Bioinformatics, 20(2): 604-611. S. Dingare, J. Finkel, C. Manning, M. Nissim and B. Alex. 2004. Exploring the Boundaries: Gene and Protein Identification in Biomedical Text. In Proceedings of the Critical Assessment for Information Extraction in Biology (BioCreative-2004) Granada. J. Finkel, S. Dingare, H. Nguyen, M. Nissim, C. Manning and G. Sinclair. 2004. Exploiting Context for Biomedical Entity Recognition: From Syntax to the Web. In Proceedings of Joint Workshop on Natural Language Processing in Biomedicine and its Applications, 88-91. K. Fukuda, T. Tsunoda, A. Tamura and T. Takagi. 1998. Toward Information Extraction: Identifying Protein Names from Biological Papers. In Proceedings of Pacific Symposium on Biocomputing, 707-718. D. Hanisch, J. Fluck and H.T. Mevissen. 2003. Playing Biology’s Name Game: Identifying Protein Names in Scientific Text. In Proceedings of Pacific Symposium on Biocomputing, 403-414. W.J. Hou and H.H. Chen. 2004. Enhancing Performance of Protein and Gene Name Recognizers with Filtering and Integration Strategies. Journal of Biomedical Informatics, 37(6):448-460. M. Huang, X. Zhu, Y. Hao, D. G. Payan, K. Qu and M. Li. 2004. Discovering patterns to extract protein-protein interactions from full texts. Bioinformatics, 20(18): 3604-3612. C.W. Hsu, C.C Chang and C.J. Lin. 2003. A Practical Guide to Support Vector Classification. Available at http://www.csie.ntu.edu.tw/~cjlin/libsvm/index.html. J. Kazama, T. Makino, Y. Ohta and J. Tsujii. 2002. Tuning Support Vector Machines for Biomedical Named Entity Recognition. In Proceedings of the Workshop on Natural Language Processing in the Biomedical Domain, 1-8. J.D. Kim, T. Ohta, Y. Tateisi, H. Mima and J. Tsujii. 2001. Xml-based linguistic annotation of corpus. In Proceedings of the First NLP and XML Workshop, 47-53. J.D. Kim, T. Ohta, Y. Tateisi and J. Tsujii. 2003. GENIA corpus – a semantically annotated corpus for bio-textmining. Bioinformatics, 19 (Suppl.1):180-182. J.D. Kim, T. Ohta, Y. Tsuruoka and Y. Tateisi. 2004. Introduction to the Bio-Entity Recognition Task at JNLPBA. In Proceedings of the Joint Workshop on Natural Language Processing in Biomedicine and its Applications (JNLPBA-2004), 70-75. A. Koike, Y. Niwa and T. Takagi. 2005. Automatic extraction of gene/protein biological functions from biomedical text. Bioinformatics, 21(7): 1227-1236. M. Krauthammer, A. Rzhetsky, P. Morozov and C. Friedman. 2000. Gene, 259(1-2):245-252. C. Lee, W.J. Hou and H.H. Chen. 2004. Annotating Multiple Types of Biomedical Entities: A Single Word Classification Approach. In Proceedings of the Joint Workshop on Natural Language Processing in Biomedicine and its Applications (JNLPBA-2004), 80-83. Y.F. Lin, T.H. Tsai, W.C. Chou, K.P. Wu, T.Y. Sung and W.L. Hsu. 2004. A Maximum Entropy Approach to Biomedical Named Entity Recognition. In Proceedings of the 4th Workshop on Data Mining in Bioinformatics (BIOKDD04), 56-61. A.T. McCray, S. Srinivasan and A.C. Browne. 1994. Lexical Methods for Managing Variation in Biomedical Terminologies. In Proceedings of SCAMC ’94, 235-239. T. Mitsumori, S. Fation, M. Murata, K. Doi and H. Doi. 2004. Gene/Protein Name Recognition Using Support Vector Machine after Dictionary Matching. In Proceedings of the Critical Assessment for Information Extraction in Biology (BioCreative-2004) Granada. F. Olsson, G. Eriksson, K. Franzen, L. Asker and P. Liden. 2002. Notions of Correctness when Evaluating Protein Name Taggers. In Proceedings of the 19th International Conference on Computational Linguistics, 765-771. K.M. Park, S.H. Kim, D.G. Lee and H.C. Rim. 2004. Boosting Lexical Knowledge for Biomedical Named Entity Recognition. In Proceedings of the Joint Workshop on Natural Language Processing in Biomedicine and its Applications (JNLPBA-2004), 76-79. M. Rössler. 2004. Adapting an NER-System for German to the Biomedical Domain. In Proceedings of the Joint Workshop on Natural Language Processing in Biomedicine and its Applications (JNLPBA-2004), 92-95. A. S. Schwartz and M. A. Hearst. 2003. A Simple Algorithm for Identifying Abbreviations Definitions in Biomedical Text. In Proceedings of the Pacific Symposium on Biocomputing (PSB-2003) Kauai. B. Settles. 2004. Biomedical Named Entity Recognition Using Conditional Random Fields and Novel Feature Sets. In Proceedings of the Joint Workshop on Natural Language Processing in Biomedicine and its Applications (JNLPBA-2004), 104-107. L. Smith, T. Rindflesch and W.J. Wilbur. 2004. MedPost: a part-of-speech tagger for biomedical text. Bioinformatics, 20(14):2320-2321. Y. Song, E. Kim, G.G. Lee and B.K. Yi. 2004. POSBIOTM-NER in the shared task of BioNLP/NLPBA 2004. In Proceedings of the Joint Workshop on Natural Language Processing in Biomedicine and its Applications (JNLPBA-2004), 100-103. L. Tanabe and W.J. Wilbur. 2002. Tagging Gene and Protein Names in Biomedical Text. Bioinformatics, 18(8): 1124-1132. V. Vapnik. 1995. The Nature of Statistical Learning Theory, Springer-Verlag. S. Zhao. 2004. Name Entity Recognition in Biomedical Text using a HMM model. In Proceedings of the Joint Workshop on Natural Language Processing in Biomedicine and its Applications (JNLPBA-2004), 84-87. G.D. Zhou, J. Zhang, J. Su, D. Shen and C. Tan. 2004. Recognizing names in biomedical texts: a machine learning approach. Bioinformatics, 20(7):1178-1190. G.D. Zhou and J. Su. 2004. Exploring Deep Knowledge Resources in Biomedical Name Recognition. In Proceedings of JNLPBA-2004, 96-99.
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/38699	-
dc.description.abstract	Named entity recognition is a fundamental task in biomedical text mining. Multiple-class entity annotation is more complicated and challenging than single-class entity annotation. In this thesis, we presented a single word classification approach to dealing with the multiple-class entity annotation problem using Support Vector Machines (SVMs). In other words, each token in a sentence is represented by a feature vector and classified as one of the given classes. Orthographical patterns, morphological patterns, results from existing gene/protein name taggers, context, part of speech (POS) tags, tags (class labels) of surrounding tokens, and other information are important features for named entity recognition. In addition, we employed a unique way of extracting and utilizing context information. Due to the huge number of non-entity instances (class ‘O’), we clustered the instances of this class into 5 subclasses to accelerate the SVM training process. We also applied a simple post-processing technique with the help of a dictionary and a post-processing technique via abbreviation extraction. We presented the performance of our system using 13 different notions of correctness, showing the overall performance of our system is somewhere between 68.16% and 79.91% in terms of f-score, which is comparable to the performance of the top 3 systems in the JNLPBA shared task. Besides various notions of correctness used in evaluation, we defined 5 types of errors and showed how frequently our system made these types of mistakes. The error analysis also revealed the annotation discrepancies among the training and test corpora. Therefore, researchers approaching biomedical named entity recognition with machine learning algorithms should seek to improve their systems as well as be aware of the correctness of the underlying corpus.	en
dc.description.provenance	Made available in DSpace on 2021-06-13T16:42:34Z (GMT). No. of bitstreams: 1 ntu-94-R92922005-1.pdf: 446228 bytes, checksum: d2e907e39c523ec47d1ccf099a2cab94 (MD5) Previous issue date: 2005	en
dc.description.tableofcontents	List of Figures III List of Tables IV 1. Introduction 1 1.1. Named Entity Recognition in Biomedical Literature 1 1.2. Organization of this Thesis 3 2. Training and Test Corpora 5 3. Methods 9 3.1. SVM on Named Entity Recognition 9 3.2. Features 11 3.2.1. Orthographical Patterns 12 3.2.2. Morphological Patterns 12 3.2.3. Information from Gazetteers 15 3.2.4. Context Information 16 3.2.5. Tags of Surrounding Tokens 18 3.2.6. Other Features 21 3.3. Post-Processing Operations 23 3.3.1. Dictionary-Assisted Post-Processing 23 3.3.2. Post-Processing via Abbreviation Extraction 24 4. Results and Discussion 26 4.1. Evaluation Criteria and Metrics 26 4.2. Overview of the Results 28 4.3. Results of Each Subset 33 4.4. Comparison with the Participating Systems in the JNLPBA Shard Task 34 4.5. Error Analysis 35 4.5.1. Correct Boundaries but Class Label 36 4.5.2. Excessive/Missing Leading/Ending Tokens in Proposed Named Entities 37 4.5.3. Concatenation of Named Entities 39 4.5.4. Tagging Parentheses 41 4.5.5. Tagging Coordinating Conjunctions 42 4.6. Discussion 44 5. Conclusion and Future Work 46 References 47
dc.language.iso	en
dc.subject	生物資訊	zh_TW
dc.subject	支援向量機	zh_TW
dc.subject	生醫具名實體	zh_TW
dc.subject	自然語言處理	zh_TW
dc.subject	bioinformatics	en
dc.subject	NLP	en
dc.subject	support vector machines	en
dc.subject	biomedical named entity	en
dc.title	使用SVM標記多種生醫具名實體	zh_TW
dc.title	Annotating Multiple Types of Biomedical Entities Using Support Vector Machines	en
dc.type	Thesis
dc.date.schoolyear	93-2
dc.description.degree	碩士
dc.contributor.oralexamcommittee	簡立峰(Lee-Feng Chien),曾元顯(Yuen-Hsien Tseng),李御璽(Yue-Shi Lee)
dc.subject.keyword	生物資訊,自然語言處理,生醫具名實體,支援向量機,	zh_TW
dc.subject.keyword	bioinformatics,NLP,biomedical named entity,support vector machines,	en
dc.relation.page	50
dc.rights.note	有償授權
dc.date.accepted	2005-07-01
dc.contributor.author-college	電機資訊學院	zh_TW
dc.contributor.author-dept	資訊工程學研究所	zh_TW
顯示於系所單位：	資訊工程學系

文件中的檔案：

檔案	大小	格式
ntu-94-1.pdf 未授權公開取用	435.77 kB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。