Skip navigation

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料(如:文字、圖片、PDF)並使其易於取用。

點此認識 DSpace
DSpace logo
English
中文
  • 瀏覽論文
    • 校院系所
    • 出版年
    • 作者
    • 標題
    • 關鍵字
    • 指導教授
  • 搜尋 TDR
  • 授權 Q&A
    • 我的頁面
    • 接受 E-mail 通知
    • 編輯個人資料
  1. NTU Theses and Dissertations Repository
  2. 電機資訊學院
  3. 資訊工程學系
請用此 Handle URI 來引用此文件: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/38699
完整後設資料紀錄
DC 欄位值語言
dc.contributor.advisor陳信希(Hsin-Hsi Chen)
dc.contributor.authorChih Leeen
dc.contributor.author李遲zh_TW
dc.date.accessioned2021-06-13T16:42:34Z-
dc.date.available2005-07-19
dc.date.copyright2005-07-19
dc.date.issued2005
dc.date.submitted2005-07-01
dc.identifier.citationS.F. Altschul, W. Gish, W. Miller, E.W. Myers and D.J. Lipman. 1990. Basic Local Alignment Search Tool. Journal of Molecular Biology, 215:403-410.
E. Brill. 1994. Some Advances in Transformation-Based Part of Speech Tagging. In Proceedings of the National Conference on Artificial Intelligence, 722-727.
J.T. Chang, H. Schutze and R.B. Altman. 2004. GAPSCORE: Finding Gene and Protein Names One Word at a Time. Bioinformatics, 20(2): 216-225.
N. Daraselia, A. Yuryev, S. Egorov, S. Novichkova, A. Nikitin and I. Mazo. 2004. Extracting human protein interactions from MEDLINE using a full-sentence parser. Bioinformatics, 20(2): 604-611.
S. Dingare, J. Finkel, C. Manning, M. Nissim and B. Alex. 2004. Exploring the Boundaries: Gene and Protein Identification in Biomedical Text. In Proceedings of the Critical Assessment for Information Extraction in Biology (BioCreative-2004) Granada.
J. Finkel, S. Dingare, H. Nguyen, M. Nissim, C. Manning and G. Sinclair. 2004. Exploiting Context for Biomedical Entity Recognition: From Syntax to the Web. In Proceedings of Joint Workshop on Natural Language Processing in Biomedicine and its Applications, 88-91.
K. Fukuda, T. Tsunoda, A. Tamura and T. Takagi. 1998. Toward Information Extraction: Identifying Protein Names from Biological Papers. In Proceedings of Pacific Symposium on Biocomputing, 707-718.
D. Hanisch, J. Fluck and H.T. Mevissen. 2003. Playing Biology’s Name Game: Identifying Protein Names in Scientific Text. In Proceedings of Pacific Symposium on Biocomputing, 403-414.
W.J. Hou and H.H. Chen. 2004. Enhancing Performance of Protein and Gene Name Recognizers with Filtering and Integration Strategies. Journal of Biomedical Informatics, 37(6):448-460.
M. Huang, X. Zhu, Y. Hao, D. G. Payan, K. Qu and M. Li. 2004. Discovering patterns to extract protein-protein interactions from full texts. Bioinformatics, 20(18): 3604-3612.
C.W. Hsu, C.C Chang and C.J. Lin. 2003. A Practical Guide to Support Vector Classification. Available at http://www.csie.ntu.edu.tw/~cjlin/libsvm/index.html.
J. Kazama, T. Makino, Y. Ohta and J. Tsujii. 2002. Tuning Support Vector Machines for Biomedical Named Entity Recognition. In Proceedings of the Workshop on Natural Language Processing in the Biomedical Domain, 1-8.
J.D. Kim, T. Ohta, Y. Tateisi, H. Mima and J. Tsujii. 2001. Xml-based linguistic annotation of corpus. In Proceedings of the First NLP and XML Workshop, 47-53.
J.D. Kim, T. Ohta, Y. Tateisi and J. Tsujii. 2003. GENIA corpus – a semantically annotated corpus for bio-textmining. Bioinformatics, 19 (Suppl.1):180-182.
J.D. Kim, T. Ohta, Y. Tsuruoka and Y. Tateisi. 2004. Introduction to the Bio-Entity Recognition Task at JNLPBA. In Proceedings of the Joint Workshop on Natural Language Processing in Biomedicine and its Applications (JNLPBA-2004), 70-75.
A. Koike, Y. Niwa and T. Takagi. 2005. Automatic extraction of gene/protein biological functions from biomedical text. Bioinformatics, 21(7): 1227-1236.
M. Krauthammer, A. Rzhetsky, P. Morozov and C. Friedman. 2000. Gene, 259(1-2):245-252.
C. Lee, W.J. Hou and H.H. Chen. 2004. Annotating Multiple Types of Biomedical Entities: A Single Word Classification Approach. In Proceedings of the Joint Workshop on Natural Language Processing in Biomedicine and its Applications (JNLPBA-2004), 80-83.
Y.F. Lin, T.H. Tsai, W.C. Chou, K.P. Wu, T.Y. Sung and W.L. Hsu. 2004. A Maximum Entropy Approach to Biomedical Named Entity Recognition. In Proceedings of the 4th Workshop on Data Mining in Bioinformatics (BIOKDD04), 56-61.
A.T. McCray, S. Srinivasan and A.C. Browne. 1994. Lexical Methods for Managing Variation in Biomedical Terminologies. In Proceedings of SCAMC ’94, 235-239.
T. Mitsumori, S. Fation, M. Murata, K. Doi and H. Doi. 2004. Gene/Protein Name Recognition Using Support Vector Machine after Dictionary Matching. In Proceedings of the Critical Assessment for Information Extraction in Biology (BioCreative-2004) Granada.
F. Olsson, G. Eriksson, K. Franzen, L. Asker and P. Liden. 2002. Notions of Correctness when Evaluating Protein Name Taggers. In Proceedings of the 19th International Conference on Computational Linguistics, 765-771.
K.M. Park, S.H. Kim, D.G. Lee and H.C. Rim. 2004. Boosting Lexical Knowledge for Biomedical Named Entity Recognition. In Proceedings of the Joint Workshop on Natural Language Processing in Biomedicine and its Applications (JNLPBA-2004), 76-79.
M. Rössler. 2004. Adapting an NER-System for German to the Biomedical Domain. In Proceedings of the Joint Workshop on Natural Language Processing in Biomedicine and its Applications (JNLPBA-2004), 92-95.
A. S. Schwartz and M. A. Hearst. 2003. A Simple Algorithm for Identifying Abbreviations Definitions in Biomedical Text. In Proceedings of the Pacific Symposium on Biocomputing (PSB-2003) Kauai.
B. Settles. 2004. Biomedical Named Entity Recognition Using Conditional Random Fields and Novel Feature Sets. In Proceedings of the Joint Workshop on Natural Language Processing in Biomedicine and its Applications (JNLPBA-2004), 104-107.
L. Smith, T. Rindflesch and W.J. Wilbur. 2004. MedPost: a part-of-speech tagger for biomedical text. Bioinformatics, 20(14):2320-2321.
Y. Song, E. Kim, G.G. Lee and B.K. Yi. 2004. POSBIOTM-NER in the shared task of BioNLP/NLPBA 2004. In Proceedings of the Joint Workshop on Natural Language Processing in Biomedicine and its Applications (JNLPBA-2004), 100-103.
L. Tanabe and W.J. Wilbur. 2002. Tagging Gene and Protein Names in Biomedical Text. Bioinformatics, 18(8): 1124-1132.
V. Vapnik. 1995. The Nature of Statistical Learning Theory, Springer-Verlag.
S. Zhao. 2004. Name Entity Recognition in Biomedical Text using a HMM model. In Proceedings of the Joint Workshop on Natural Language Processing in Biomedicine and its Applications (JNLPBA-2004), 84-87.
G.D. Zhou, J. Zhang, J. Su, D. Shen and C. Tan. 2004. Recognizing names in biomedical texts: a machine learning approach. Bioinformatics, 20(7):1178-1190.
G.D. Zhou and J. Su. 2004. Exploring Deep Knowledge Resources in Biomedical Name Recognition. In Proceedings of JNLPBA-2004, 96-99.
dc.identifier.urihttp://tdr.lib.ntu.edu.tw/jspui/handle/123456789/38699-
dc.description.abstractNamed entity recognition is a fundamental task in biomedical text mining. Multiple-class entity annotation is more complicated and challenging than single-class entity annotation. In this thesis, we presented a single word classification approach to dealing with the multiple-class entity annotation problem using Support Vector Machines (SVMs). In other words, each token in a sentence is represented by a feature vector and classified as one of the given classes. Orthographical patterns, morphological patterns, results from existing gene/protein name taggers, context, part of speech (POS) tags, tags (class labels) of surrounding tokens, and other information are important features for named entity recognition. In addition, we employed a unique way of extracting and utilizing context information. Due to the huge number of non-entity instances (class ‘O’), we clustered the instances of this class into 5 subclasses to accelerate the SVM training process. We also applied a simple post-processing technique with the help of a dictionary and a post-processing technique via abbreviation extraction.
We presented the performance of our system using 13 different notions of correctness, showing the overall performance of our system is somewhere between 68.16% and 79.91% in terms of f-score, which is comparable to the performance of the top 3 systems in the JNLPBA shared task. Besides various notions of correctness used in evaluation, we defined 5 types of errors and showed how frequently our system made these types of mistakes. The error analysis also revealed the annotation discrepancies among the training and test corpora. Therefore, researchers approaching biomedical named entity recognition with machine learning algorithms should seek to improve their systems as well as be aware of the correctness of the underlying corpus.
en
dc.description.provenanceMade available in DSpace on 2021-06-13T16:42:34Z (GMT). No. of bitstreams: 1
ntu-94-R92922005-1.pdf: 446228 bytes, checksum: d2e907e39c523ec47d1ccf099a2cab94 (MD5)
Previous issue date: 2005
en
dc.description.tableofcontentsList of Figures III
List of Tables IV
1. Introduction 1
1.1. Named Entity Recognition in Biomedical Literature 1
1.2. Organization of this Thesis 3
2. Training and Test Corpora 5
3. Methods 9
3.1. SVM on Named Entity Recognition 9
3.2. Features 11
3.2.1. Orthographical Patterns 12
3.2.2. Morphological Patterns 12
3.2.3. Information from Gazetteers 15
3.2.4. Context Information 16
3.2.5. Tags of Surrounding Tokens 18
3.2.6. Other Features 21
3.3. Post-Processing Operations 23
3.3.1. Dictionary-Assisted Post-Processing 23
3.3.2. Post-Processing via Abbreviation Extraction 24
4. Results and Discussion 26
4.1. Evaluation Criteria and Metrics 26
4.2. Overview of the Results 28
4.3. Results of Each Subset 33
4.4. Comparison with the Participating Systems in the JNLPBA Shard Task 34
4.5. Error Analysis 35
4.5.1. Correct Boundaries but Class Label 36
4.5.2. Excessive/Missing Leading/Ending Tokens in Proposed Named Entities 37
4.5.3. Concatenation of Named Entities 39
4.5.4. Tagging Parentheses 41
4.5.5. Tagging Coordinating Conjunctions 42
4.6. Discussion 44
5. Conclusion and Future Work 46
References 47
dc.language.isoen
dc.subject生物資訊zh_TW
dc.subject支援向量機zh_TW
dc.subject生醫具名實體zh_TW
dc.subject自然語言處理zh_TW
dc.subjectbioinformaticsen
dc.subjectNLPen
dc.subjectsupport vector machinesen
dc.subjectbiomedical named entityen
dc.title使用SVM標記多種生醫具名實體zh_TW
dc.titleAnnotating Multiple Types of Biomedical Entities Using Support Vector Machinesen
dc.typeThesis
dc.date.schoolyear93-2
dc.description.degree碩士
dc.contributor.oralexamcommittee簡立峰(Lee-Feng Chien),曾元顯(Yuen-Hsien Tseng),李御璽(Yue-Shi Lee)
dc.subject.keyword生物資訊,自然語言處理,生醫具名實體,支援向量機,zh_TW
dc.subject.keywordbioinformatics,NLP,biomedical named entity,support vector machines,en
dc.relation.page50
dc.rights.note有償授權
dc.date.accepted2005-07-01
dc.contributor.author-college電機資訊學院zh_TW
dc.contributor.author-dept資訊工程學研究所zh_TW
顯示於系所單位:資訊工程學系

文件中的檔案:
檔案 大小格式 
ntu-94-1.pdf
  未授權公開取用
435.77 kBAdobe PDF
顯示文件簡單紀錄


系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved