以半自動化譯後編輯修正專利摘要中譯英機器翻譯之錯誤

Jiuan-an Hsu; 許隽安

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/51836

標題:	以半自動化譯後編輯修正專利摘要中譯英機器翻譯之錯誤 A Semi-Automatic Method for Correcting Errors in Chinese-English Machine Translations of Patent Abstracts
作者:	Jiuan-an Hsu 許隽安
指導教授:	蔡毓芬(Yvonne Tsai)
關鍵字:	中英機器翻譯,專利翻譯,語言歧異,錯誤分析,機器翻譯評鑑,譯後編輯, Chinese-English machine translation,patent translation,language divergence,error analysis,MT evaluation,post-editing,
出版年 :	2015
學位:	碩士
摘要:	中進英機器翻譯文件之中的翻譯錯誤，多少反映了中英文語言在詞彙、結構等特徵上的顯著差異。也因如此，了解語言歧異（language divergence）對機器翻譯文件品質的影響，一直是十分重要的研究議題。現有的相關研究雖為數頗豐，卻少有針對中進英語言組合的討論。本研究挑選一統計式機器翻譯（statistical machine translation）系統之中進英譯文分析錯誤，並根據結果提出一半自動譯後編輯方法，可用以改善機器翻譯中常見的詞義錯誤問題。本研究的第一部分為錯誤分析，程序為將專利摘要的中進英翻譯錯誤加以分類，並按錯誤類型分布狀況提出分析結果；所採用之分類法屬階層性分類，其中五個主要類別為拼寫（orthographic）錯誤、字詞型態（morphological）錯誤、詞彙（lexical）錯誤、語意（semantic）錯誤及語法（syntactic）錯誤。機器翻譯錯誤於各類別中的分佈，讓研究者更加了解該統計式機器翻譯系統於翻譯過程中遭遇的困難，以及設計上的不足。首先，由於中文詞與詞間無分隔符號（即空格），該系統對中文分詞（tokenization）的判斷結果多有錯誤。再者，中文字詞的詞性可因上下文不同而變化，導致該系統在辨識中文字詞的詞性時，亦有許多誤判。最後，中英句構的不同，使得機器翻譯譯文中反覆出現錯誤的語法順序。本研究的第二部分針對機器翻譯常見的專有名詞譯法錯誤，提出半自動的譯後編輯方法。本方法共包含三步驟：（一）詞彙對齊、（二）專有名詞提取及（三）以正確專有名詞替換錯誤用語。本研究以一組對應之英文機器翻譯及人工翻譯測試該方法，發現其有助於提升機器譯文之 BLEU自動評鑑分數。 The grammatical structures of Chinese and English are very different. Such divergences are reflected in translation errors produced by machine translation (MT) systems between the two languages. It is therefore necessary to understand the effect such language divergences has on MT outputs. As there was a noticeable absence of such research project on machine translation systems of the Chinese-English language pair, this study presents results of an analysis of errors found in machine translated patent abstracts, and designed accordingly a semi-automatic method for post-editing targeting a specific group of errors. In the first part of the study, 115 English machine-translated abstracts of Chinese patent abstracts were selected, all of which were done by Google Translate, a statistical machine translation (SMT) system developed for general usage. After errors in the translations were identified, they were categorized based on a hierarchical classification scheme with five categories at the first level: orthographic errors, morphological errors, lexical errors, semantic errors, and syntactic errors. The distribution of these errors yields important insights about the difficulties encountered by the SMT system during the process of translation. Firstly, tokenization of the SL texts was found problematic, as there are no delimiters between words in the Chinese language. Secondly, assigning parts of speech to words in the source sentences was also found challenging, given that Chinese has no inflections for MT systems to identify parts of speech with, and that each Chinese word may act as a syntactic component of more than one part-of-speech (POS) category in different contexts. Thirdly, due to the different orders of sentence components in Chinese and English, erroneous syntactic orders appeared at high frequency, and were pertinent to the sentence lengths of source language sentences. In the second part of this study, a semi-automatic method for post-editing was proposed. This method was targeted at semantically incorrect terms in MT. It consisted of three steps (lexical alignment, noun phrase extraction, and term substitution) and was proved to be able to increase the Bilingual Evaluation Understudy (BLEU) score of machine-translated texts. This study provides a descriptive basis for additional research, and has implications for MT developers as well as post-editors in refining and improving the quality of Chinese-English machine translation output.
URI:	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/51836
全文授權:	有償授權
顯示於系所單位：	翻譯碩士學位學程

文件中的檔案：

檔案	大小	格式
ntu-104-1.pdf 未授權公開取用	1.02 MB	Adobe PDF

顯示文件完整紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料（如：文字、圖片、PDF）並使其易於取用。