Please use this identifier to cite or link to this item:
Using Chinese-English Parallel Corpora for Compiling Bilingual Legal Glossaries
noun phrase extraction,parallel corpora,legal expressions,contrastive analysis in Cross-Strait terminology,
|Publication Year :||2016|
|Abstract:||建立雙語詞彙表能有助譯者掌握專門領域翻譯及維持一致性。本文希望提供有效的半自動方法擷取雙語詞組供譯者使用，並有系統的建立編纂雙語專業詞彙機制。首先，本文透過Anymalign及Pialign兩套自然語言處理軟體，從兩岸刑法平行語料庫中取得詞組對應機率較高的漢英雙語候選詞及詞組。鑑軟體的品質不盡完善，故本文著手改善擷取結果，首先處理英文部分，即利用詞性標記軟體自動標記英文語料找出名詞或名詞組。後透過語言規律，將較不符合中英文名詞組組成規律的候選詞組過濾。所得的有效臺灣法律漢英語詞組為1,852個，中國大陸3,782個，其中包含更長單位的漢英名詞組。基於前述結果篩選出漢英詞組單位對照正確的詞組：臺灣有694組，中國大陸852組。另外，本文採用美國的參考語料庫，設下關鍵性字比值LLR ≥ 3.84的門檻，從一般名詞組中篩選出術語。臺灣487個英文名詞�名詞組中，有394個適合為術語；中國大陸的517個名詞�名詞組中則只有418個合適。擷取所得的有效名詞�名詞組經進一步的處理後製作成可比語料，顯示兩岸在法律中文用語或英譯上的異同。
Bilingual glossaries enable translators to maintain accurate domain-specific translation and consistency. This study aims to extract bilingual pairs semi-automatically and provides a systematic specialized term compilation for translators to follow. First, two natural language processing tools, Anymalign and Pialign, were adopted to extract Chinese-English candidate pairs that had higher translation probabilities from Taiwan’s and Mainland China’s criminal parallel corpora.
As the extraction tools are not perfect, this study focuses on improving the quality of the preliminary extracted results: qualified English noun (phrases) were identified first. English words/phrases were assigned parts-of-speech labels by Stanford POS Tagger automatically. Linguistic information was helpful in identifying and removing
non-noun (phrase) patterns in order to locate qualified English noun phrases (NP). 1,852 English NPs with their Chinese pairs were found to be qualified (Taiwan), while 3,782 NPs were found to be valid (Mainland China). These results include noun phrases with longer word units. Based on these candidate results, 694 bilingual pairs were identified to be correctly aligned (Taiwan), while 852 correctly aligned bilingual pairs were identified (Mainland China).
An American reference corpus was utilized to mark terms out by the indication of Keyness scores. The threshold was set at the critical value ≥ 3.84 calculated by the log-likelihood ratio. 394 terms were found among 487 qualified noun (phrases) from Taiwan, while 418 terms were identified among 517 Mainland China’s result. Qualified noun (phrases) and terms were adopted to produce a comparable list of noun (phrases), showing the similarities and differences in Chinese legal expressions and English translation across the Straits.
Improving the extraction of English NPs was confirmed to be much more effective than that of Chinese NPs because Chinese parts-of-speech can be made clearer only in sentences. In contrast, as suffixes of English words are more distinct, it is shown that English noun (phrase) extraction was more effective by having identified their suffixes according to the parts-of-speech. Based on the result, extraction of bilingual pairs was proven to be difficult because, first, no regular patterns could be identified in the two main Chinese NP groups: those NPs include/exclude之 zhi (in Taiwan’s NP), or的de/之zhi (Mainland China’s NP). Second, the English translation patterns corresponding to Chinese NPs were neither always predictable.
Filtering terms was proven to be even more difficult. As the length of an NP unit increases, mixture of terms and common words within an NP will more likely appear.
Legal connotations are not entirely the same in Taiwan’s and Mainland China’s legal systems. If the word units of certain phrases are not extracted completely, falsely similar or seemingly different bilingual pairs will be created.
The methodology and findings presented in this study are recommended to be applicable to other language pairs and different domain-specific genres, which may facilitate improvements in the compilation of bilingual term glossaries and localization in translation.
|Appears in Collections:||翻譯碩士學位學程|
Files in This Item:
|10.55 MB||Adobe PDF|
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.