漢字知識本體-以字為本的知識架構與其應用示例

Ya-Min Chou; 周亞民

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/38797

標題:	漢字知識本體-以字為本的知識架構與其應用示例 Hantology-The Knowledge Structure of Chinese Writing System and Its Application
作者:	Ya-Min Chou 周亞民
指導教授:	吳玲玲(Ling-Ling Wu)
共同指導教授:	黃居仁(Chu-Ren Huang)
關鍵字:	語言知識本體,知識工程,書寫系統, Linguistic Ontology,Knowledge Engineering,Writing Systems,
出版年 :	2005
學位:	博士
摘要:	漢字是漢語的基本單位，也是自然語言的重要資源，如何表達漢字書寫系統的知識結構是非常重要的研究議題。但是，直到現在，只有少數的研究關心這個議題。本研究的目的是設計一個架構能夠描述漢字書寫系統中的形式、概念和彼此關係。為了能夠增加這個架構的知識分享能力，本研究提出漢字知識本體(Hantology)，將漢字所表達的概念和關係抽象化，並且以形式語言(formal language)加以描述。漢字知識本體表達的知識包括書寫形式、語音、字義、異體字關係、變異、詞彙衍生。漢語書寫系統的書寫形式是表意或詞語－音節文字。漢字不僅是書寫的單位，也經常是詞或詞素。漢語書寫系統的重要特性是書寫形式和字義是意符的衍生，因此，意符表達的概念是漢語書寫系統的核心。本研究以說文的五百四十部首作為漢字的基本意符，分析部首作為意符時所表達的概念，運用建議上層共用知識本體(Suggested Upper Merged Ontology)加以對應，讓計算機不僅可以處理意符表達的概念和關係，還可以建立意符之間的關係和呈現漢字意符的知識結構，又能與詞網(WordNet)和中央研究院中英雙語詞網結合。漢字字義也採用建議上層共用知識本體表達字義的概念和關係，字義區分為本義、引申和假借義，其中本義以說文的釋義為依據，並描述字音與字義關係和變化，以及不同字義的衍生詞彙，表達漢字的詞彙衍生關係和構詞。漢字之間的關係表達的研究重心是異體字，在漢語書寫系統相同的詞或詞素可以用不同書寫形式而產生異體字，我們建立異體字的語境(context)－字音、字義、聲韻、構詞和時間，描述不同的漢字在什麼語境可以交替使用。為了讓漢字知識本體的知識更容易於被分享和利用，本研究建立了漢字知識本體的模型，使用語意網路知識本體語言(OWL-DL)描述，並且與通用語言描述知識本體(General Ontology for Linguistic Description)整合，以提供計算機在自然語言處理所需的書寫、構詞、語法知識。最後，我們應用漢字知識本體解決缺字交換和異體字檢索問題，並設計和實作了缺字交換系統和缺字交換語言(MCDL)，以中華佛學研究所數位化的大正藏和中央研究院漢籍電子文獻中的缺字文件，進行交換和檢索驗證，結果應用漢字知識本體與文換方法可以大幅改善使用造字的缺字交換和異體字檢索問題，還可以進行文件的共時(synchronically)與歷時(diachronically)檢索。本研究的主要貢獻有下列幾點： 1.提出表達漢字知識結構的語言知識本體漢字知識本體是第一個表意書寫系統(ideographic writing system)的語言知識本體，我們提出完全不同的方法表達漢語書寫系統，這個方法大幅增加了計算機擁有的漢語書寫系統的知識，輔助自然語言處理，還可以做為漢字知識的交換標準，利用這個架構讓不同的應用系統分享漢字的知識。 2.提出表達異體字關係的架構異體字是漢語書寫的重要特性，可惜長久以來，異體字關係沒有被適當在計算機表達。本研究提出一個架構能夠描述異體字的關係，本研究建立了系統化的表達方式，能夠描述異體字的複雜關係，我們比較了漢字知識本體與其它異體字關係表達方法，證明漢字知識本體在異體字關係表達的優越性。 3.提出表達語言變異的方法語言的變化是一直持續不斷的，任何語言知識本體不應忽略語言的變異，但是WordNet, EuroWordNet等詞彙知識本體，都沒有考慮語言隨著時間推移而發生的變化，漢字知識本體是第一個表達時間變異的詞彙知識本體，漢字知識本體描述了書寫形式、字義、聲韻、詞彙衍生和異體字的變異，可以有系統的呈現漢語書寫系統的變異，並且作為其它詞彙知識本體研究的參考。 4.解決缺字與異體字檢索問題計算機適切的表達文字與符號是資訊處理的基本需求，而數十年來中文計算機環境卻無法滿足這個基本的需求，不斷的要面對缺字和異體字的問題，影響所有使用漢字的人和資訊系統。我們改變漢字的表達方法，增加了計算機所擁有的漢字知識，將缺字與漢字知識本體整合，讓計算機擁有缺字和異體字的知識，解決了缺字問題和異體字檢索的問題。 Chinese characters are fundamental linguistic units and important resources for natural language processing, how to represent the knowledge structure of Chinese writing system is a critical research issue. However, until now, there are only few studies on this issue. The purpose of this study is to design a framework that can describe the knowledge structure of Chinese writing system. In order to increase the sharing capability of the framework, Hantology, a formal explicit representation of conceptualization for Chinese writing system, is proposed in this thesis. Hantology describe the orthographic forms, phonological forms, senses, variants, variation and lexicalization of Chinese writing system. The orthographic form of Chinese writing system is ideographic or word-syllable characters. Usually, each Chinese character is not only a writing unit but also a word or morpheme. The most important feature of Chinese writing system is that the orthographic form and sense are extensions of semantic symbols, so the concepts indicated by semantic symbols become the core of Chinese writing system. In this study, we use 540 radicals of ShuoWen as basic semantic symbols. To make the concepts and relations of semantic symbols can be processed by computer systems, the concepts indicated by each radical are analyzed and mapped into IEEE Suggested Upper Merged Ontology (SUMO). In addition, adopting SUMO allows Hantology to integrate with WordNet and the Academia Sinica Bilingual Ontological WordNet(Sinica BOW). The senses of each Chinese character also adopt SUMO to represent the concepts and relations among various senses. The lexicons generated by difference senses are constructed to express the morphological context. Since the senses depend on the pronunciations, the relations between pronunciations and senses are described by Hantology. In Chinese writing, there are lots of variants which are different orthographic forms of the same word or morpheme. A linguistic context is proposed to describe the relations of variants. To make knowledge shared easily, we build a model expressed by Web Ontology Language-Description Logic(OWL-DL) and integrate with General Ontology for Linguistic Description (GOLD) to provide the writing, morphological, syntactical knowledge for natural language processing. Finally, the applications of Hantology on the problems of Chinese missing characters and variants retrieval are given. We propose an interchange framework and Missing Characters Description Language(MCDL) for describing missing characters. The digitalized Sutra constructed by CBETA and the Chinese Ancient Corpus of Academia Sinica are used as experimental data. The results show the missing characters and variants retrieval problems can be solved successfully by Hantology and interchange framework proposed in this thesis. This thesis makes following contributions: (1) We propose a linguistic ontology to describe the knowledge structure of Chinese writing system. Hantology is the first linguistic ontology of ideographic writing systems. We propose a completely different approach to represent knowledge structure of Chinese writing system. This approach significantly increases the knowledge of Chinese writing system owned by computer systems and is able to assist natural language processing. (2) We propose a linguistic context for describing the relations of variants.Variants are an important characteristic of Chinese writing. Unfortunately, the relations of variants have not been properly represented in computer systems for a long time. We proposed a linguistic context for describing the relations of variants. The results show this linguistic context has significant improvement than other previous studies on describing the relations of variants. (3) We propose a framework to describe the variation of language. Language change is always with us. Any linguistic ontology should not ignore the variation of language. Hantology is the first linguistic ontology describing the variation of languages. The aspects of variation described by Hantology include orthographic form, pronunciation, sense, lexicalization and variants relation. This approach can systematically illustrate the development of Chinese writing system. (4) The missing characters and variants retrieval problems are solved. It is a basic requirement to properly represent characters and symbols for any information processing. However, the Chinese computer systems fail to meet this requirement for decades, so that we always have to face the missing characters and variants retrieval problem. We change the representation of Hanzi to increase the knowledge owned by computers. By integrating missing characters with Hantology, the missing characters and variants problem are solved successfully.
URI:	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/38797
全文授權:	有償授權
顯示於系所單位：	資訊管理學系

文件中的檔案：

檔案	大小	格式
ntu-94-1.pdf 未授權公開取用	14.92 MB	Adobe PDF

顯示文件完整紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料（如：文字、圖片、PDF）並使其易於取用。