生物醫學名詞辨識、語意角色自動標註及問答系統上之應用

Richard Tzong-Han Tsai; 蔡宗翰

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/31522

標題:	生物醫學名詞辨識、語意角色自動標註及問答系統上之應用 BIOMEDICAL NAMED ENTITY RECOGNITION,SEMANTIC ROLE LABELING AND THEIR APPLICATION TO QUESTION ANSWERING
作者:	Richard Tzong-Han Tsai 蔡宗翰
指導教授:	許聞廉(Wen-Lian Hsu),項潔(Jieh Hsiang)
關鍵字:	生醫文獻探勘,自然語言處理,專有名詞辨識,語意角色標註,自動問答,關連性擷取,資訊擷取, Biomedical literature mining,natural language processing,named entity recognition,semantic role labeling,question answering,relation extraction,information extraction,
出版年 :	2006
學位:	博士
摘要:	生醫文獻處理的自動化，在大規模的實驗設計與分析上極為重要。為了達到前述的目標，許多具備自然語言處理 (natural language processing, NLP) 能力的資訊擷取 (information extraction, IE) 系統紛紛出現。本論文將針對其中兩項最基本的技術：專有名詞辨識 (named entity recognition, NER)、語意角色標註 (semantic role labeling, SRL) ，以及這兩項技術在自動問答系統（question answering, QA) 上的應用進行深入的探討。在第一項專有名詞辨識（NER）問題上，我們亟需在模型中加入具有多項條件式的特徵函數，以進一步提升辨識率。然而，由於記憶體有限，且許多特徵並不利於辨識的正確性，沒有必要將所有的特徵均納入辨識模型中。因此，我們運用循序式前向搜尋法 (sequential forward search) 來尋找最有用的特徵群組組合。此外，生醫專有名詞的多變異性會造成資料稀疏 (data sparseness) 的問題，且數字部分特別容易變化，並產生許多不必要的特徵。因此，我們將應用數字正規化的方法來解決這個問題。再者，每個字的標籤 (tag) 並非僅與鄰近的字有關，有可能也跟前後文觀察範圍 (context window) 以外的資訊有關。因此，我們使用自動產生的宏觀樣版 (global pattern) 來記錄這種結構，並以其修正CRF模型標註的結果。依序使用這三項方法之後，本系統專有名詞的辨識精準率 (F-score) 可較陽春型（baseline）系統增加3.28%，到達72.98%。這個成果也超越了目前學術界所有其他系統。在第二項語意角色標註（SRL）問題上，我們建構了一個生醫領域的語意角色自動標註系統，這個系統可以用來擷取生醫領域特有的關連性。這個建構的過程可以分成三部分：首先，我們根據賓夕法尼亞大學所發展的PropBank標註規格，在日本東京大學Tsujii實驗室所提供的GENIA剖析樹語料庫 (GENIA Treebank) 上進行語意角色的標註。我們針對生醫領域最頻繁使用且最重要的三十個動詞，標註以其為主的語意框架 (semantic frame) 及語意角色。接著，我們利用這份標註的語料庫來訓練一個採用最大熵 (maximum entropy) 模型的自動語意角色標註系統。最後，我們採用自動生成的語意角色模版 (argument type template) 來增強角色分類的精準率。在我們的實驗結果中，若使用新聞領域語料訓練的模型來標註生醫文獻，精準率 (F-score) 會從原先標註新聞語料的86.29%遽降至64.64%。若使用我們標註的生醫語料庫—BioProp訓練出的模型，則精準率可提升22.46%，到達87.10%。在更進一步加入模版特徵後，重要修飾性角色的精準率可以再顯著地提升1.57%。最後，我們將前述兩項技術NER與SRL應用到生醫自動問答系統 (QA) 上。對生物醫學領域的學者來說，他們亟需能快速地取得研究上的相關資訊。自動問答系統讓這些學者可以很方便地使用自然語言來發問，並且從大量的文獻庫中自動擷取出答案。在本論文中發展的問答系統—BeQA是用來專門回答跟分子生物事件 (molecular event) 相關的問題。利用SRL系統對問題句及可能答案句的進行語意角色標註，Top-1 accuracy以及Top-5 MRR兩種指標都得到了顯著的成長。此外，在BeQA系統中，我們也採用了Google作為資訊檢索的引擎來提供可能的答案句。BeQA系統的最佳組態在Top-1 Accuracy上到達51.9%；在Top-5 MRR上則到達57.7%；為生醫文獻處理領域第一個經過QA完整效能評估的的系統。未來，我們將繼續加強NER、SRL以及QA系統的能力，並且將這幾項技術應用在生物醫學關連性，例如protein-protein interaction及gene-disease relation的擷取上。 Processing biomedical literature automatically would be invaluable for both the design and interpretation of large-scale experiments. To this end, many information extraction (IE) systems using natural language processing (NLP) techniques have been developed for use in the biomedical field. In this dissertation, we study two main tasks: name entity recognition, semantic role labeling and their application to biomedical question-answering (QA). In the first task, adding conjunction features is necessary, but it is infeasible to include all conjunction feature groups in a NER model since the memory resource is limited and some of them are ineffective. We employ sequential forward search to select the most effective feature groups. In addition, varieties of biomedical terms cause data sparseness and generate many redundant features mostly due to the varieties in the numerical parts. We apply numerical normalization to deal with this problem. In addition, the assignment of NE tags does not merely depend on the closest neighbors but may depend on words beyond the context window. We use automatically generated global patterns to remember such structures and modify the results of CRF tagger. By employing these three techniques sequentially, the F-score becomes 72.98%, which is 3.28% better than the baseline system and also outperforms the state-of-the-art systems. In the second task, we construct a biomedical semantic role labeling (SRL) system that can be used to facilitate relation extraction. This task is divided into three steps. First, we construct a proposition bank on top of the popular biomedical GENIA treebank following the PropBank annotation scheme. We only annotate the predicate-argument structures (PAS's) of thirty frequently used biomedical predicates and their corresponding arguments. Second, we use our proposition bank to train a biomedical SRL system, which uses a maximum entropy (ME) model. Thirdly, we automatically generate argument-type templates which can be used to improve classification of biomedical argument roles. Our experimental results show that a newswire SRL system that achieves an F-score of 86.29% in the newswire domain can maintain an F-score of 64.64% when ported to the biomedical domain. By using our annotated corpus, BioProp, the F-score can be improved by 22.9%. After employing template features, the adjunct arguments such as temporal and locational arguments can be significantly improved by 1.57%. At last, we present a biomedical Question Answering (QA) system by applying the NER and SRL systems. There is a pressing need for biologists to efficiently retrieve biological information related to their research. QA system enables biologists to ask questions conveniently in natural language and to retrieve specific answers from a large number of documents. We introduce our Biomedical Question Answering sys-tem (BeQA), which is designed to answer questions related to molecular events. By using the SRL system to label semantic arguments of questions and answers as well as to help QA mapping, we have improved both of the Top-1 accuracy and Top-5 MRR. In addition, we employ Google as our page retrieval module to find out passages with answers. The best result of BeQA achieves a Top-1 accuracy of 51.9% and a Top-5 MRR of 57.7%. In our future work, not only will we improve the ability of NER, SRL and biomedical QA, but also apply them to built a relation extraction system for pro-tein-protein and gene-disease relations.
URI:	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/31522
全文授權:	有償授權
顯示於系所單位：	資訊工程學系

文件中的檔案：

檔案	大小	格式
ntu-95-1.pdf 目前未授權公開取用	783.47 kB	Adobe PDF

顯示文件完整紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料（如：文字、圖片、PDF）並使其易於取用。