基於網路語料之專有名詞翻譯方法於中日韓跨語言資訊檢索之應用

Yu-Chun Wang; 王昱鈞

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/40955

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	顏嗣鈞(Hsu-Chun Yen)
dc.contributor.author	Yu-Chun Wang	en
dc.contributor.author	王昱鈞	zh_TW
dc.date.accessioned	2021-06-14T17:08:40Z	-
dc.date.available	2009-07-30
dc.date.copyright	2008-07-30
dc.date.issued	2008
dc.date.submitted	2008-07-27
dc.identifier.citation	[1] C. Buckley and A. F. Lewit, “Optimization of inverted vector searches,” Proceedings of the 8th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 97–110, 1985. [2] R. Grossi and G. Italiano, “Sufﬁx trees and their applications in string algorithms,” Proceeding of 1st South American Workshop on String Processing (WSP 1993), pp. 57–76, 1993. [3] V. Gudivada, V. Raghavan, W. Grosky, R. Kasanagottu, and D. Markets, “Information retrieval on the World Wide Web,” Internet Computing, IEEE, vol. 1, no. 5, pp. 58–68, 1997. [4] J. H. Lee, “Properties of extended boolean models in information retrieval,” Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 182–190, 1994. [5] G. Salton, A. Wong, and C. S. Yang, “A vector space model for automatic indexing,” Communications of the ACM, vol. 18, no. 11, pp. 613–620, 1975. [6] K. Jones, S. Walker, and S. Robertson, “A probabilistic model of information retrieval: development and comparative experiments,” Information Processing and Management, vol. 36, no. 6, pp. 779–808, 2000. [7] S. Robertson, S. Walker, M. Beaulieu, M. Gatford, and A. Payne, “Okapi at TREC-4,” Proceedings of the Fourth Text Retrieval Conference, pp. 73–97, 1996. [8] J. McCarley, “Should we translate the documents or the queries in cross-language information retrieval?,” Proceedings of the 37th conference on Association for Computational Linguistics, pp. 208–214, 1999. [9] G. Jones, T. Sakai, N. Collier, A. Kumano, and K. Sumita, “A comparison of query translation methods for english-japanese cross-language information retrieval (poster abstract),” Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, pp. 269–270, 1999. [10] L. Yang, D. Ji, and M. Leong, “Document reranking by term distribution and maximal marginal relevance for Chinese information retrieval,” Information Processing and Management: an International Journal, vol. 43, no. 2, pp. 315–326, 2007. [11] I. Kang, J. Lee, and G. Lee, “Word sense disambiguation in query translation of CLIR,” Proceedings of the 9th conference of Hangul and Korean information processing, pp. 52–58, 1997. [12] J. Gonzalo, F. Verdejo, C. Peters, and N. Calzolari, “Applying eurowordnet to cross-language text retrieval,” Eurowordnet: A Multilingual Database With Lexical Semantic Networks, vol. 32, pp. 185–207, 1998. [13] W. Lin and H. Chen, “Using co-occurrence, augmented restrictions, and CE wordnet for Chinese-English cross-language information retrieval at CLEF 2001,” Evaluation of Cross-language Information Retrieval Systems: Second Workshop of the Cross-Language Evaluation Forum, CLEF 2001, Darmstadt, Germany, September 3-4, 2001: Revised Papers, 2002. [14] T. Dunning, “Accurate methods for the statistics of surprise and coincidence,” Computational Linguistics, vol. 19, no. 1, pp. 61–74, 1993. [15] H.-H. Chen, S.-J. Huang, Y.-W. Ding, and S.-C. Tsai, “Proper name translation in cross-language information retrieval,” Proceedings of 17th COLING and 36th ACL, pp. 232–236, 1998. [16] Y. Al-Onaizan and K. Knight, “Translating named entities using monolingual and bilingual resources,” Proceedings of the 40th Annual Meeting of the Association of Computational Linguistics (ACL), pp. 400–408, 2002. [17] W. Lam, S.-K. Chan, and R. Huang, “Named entity translation matching and learning: With application for mining unseen translations,” ACM Transactions on Information Systems (TOIS), vol. 25, no. 1, 2007. [18] F. Huang, S. Vogel, and A. Waibel, “Extracting named entity translingual equivalence with limited resources,” ACM Transactions on Asian Language Information Processing (TALIP), vol. 2, pp. 124–129, June 2003. [19] L. Shao and H. T. Ng, “Mining new word translations from comparable corpora,” Proceedings of the 20th international conference on Computational Linguistics (COLING), pp. 618–624, 2004. [20] B. Zhao and S. Vogel, “Adaptive parallel sentences mining from web bilingual news collection,” 2002 IEEE International Conference on Data Mining, pp. 745–748, 2002. [21] D. Zhou, M. Truran, T. Brailsford, and H. Ashman, “NTCIR-6 experiments using pattern matched translation extraction,” Proceedings of NTCIR-6 Workshop Meeting, 2006. [22] M. Oh, “English fricatives in loanword adaption,” Explorations in Korean Language and Linguistics, pp. 471–487, 2003. [23] T. F. Smith and M. S. Waterman, “Identiﬁcation of common molecular subsequences,” Journal of Molecular Biology, vol. 147, pp. 195–197, 1981. [24] S. Robertson and K. Jones, “Relevance weighting of search terms,” Taylor Graham Series In Foundations Of Information Science, pp. 143–160, 1988. [25] K. Kishida, K.-h. Chen, S. Lee, K. Kuriyama, N. Kando, H.-H. Chen, and S. H. Myaeng, “Overview of CLIR task at the ﬁfth ntcir workshop,” Proceedings of the Fifth NTCIR Workshop, 2005. [26] T. Saracevic, P. Kantor, A. Y. Chamis, and D. Trivison, “A study of information seeking and retrieving,” Journal of the American Society for Information Science, vol. 39, no. 3, pp. 161–176, 1988.
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/40955	-
dc.description.abstract	專有名詞翻譯在許多自然語言處理的研究上，例如資訊檢索與機器翻譯等，扮演了重要的角色。於本篇論文中，我們主要著重在將韓文及日文的專有名詞翻譯成中文，用以增進韓–中及日–中跨語言資訊檢索的效能。中文所使用的漢字為一種形意文字，一個音節可以對應到數個不同的漢字，這造成了專有名詞翻譯上的困難。我們提出一種混合的專有名詞翻譯方法，首先整合數個線上的語料庫來擴增雙語辭典的涵蓋率。我們以維基百科的中英日韓版本的跨語言連結為基礎作為一個翻譯的工具。此外，亦使用了 Naver.com 所提供的人物檢索引擎用以查詢人名的中文或英文翻譯。第二種方法為翻譯模板方法，我們的系統能夠自動從網路的語料庫中學習出韓–中、韓–英、日–中、日–英、及英–中的翻譯模板。而後這些模板便可以用以自 Google 搜尋引擎所回傳的網頁文字片段中抓取出相應的中文翻譯。根據實驗結果，在跨語言資訊檢索系統中加入我們的專有名詞翻譯方法後，在平均準確率 (Mean Average Precision, MAP) 上較單用雙語辭典的方法高出了五倍。平均準確率達到 0.3385，而召回率 (Recall) 亦達到 0.7578。我們的方法可以處理中日韓及非中日韓的專有名詞的翻譯，並可有效提升跨語言資訊檢索系統的效能。	zh_TW
dc.description.abstract	Named entity (NE) translation plays an important role in many applications, such as information retrieval and machine translation. In this paper, we focus on translating NEs from Korean/Japanese to Chinese in order to improve Korean-Chinese and Japanese-Chinese cross-language information retrieval. The ideographic nature of Chinese makes NE translation difficult because one syllable may map to several Chinese characters. We propose a hybrid NE translation system. First, we integrate two online databases to extend the coverage of our bilingual dictionaries. We use Wikipedia as a translation tool based on the inter-language links between the Korean/Japanese edition and the Chinese or English editions. We also use Naver.com’s people search engine to ﬁnd a query name’s Chinese or English translation. The second component of our system is able to learn Korean-Chinese (K-C), Korean-English (K-E), and English-Chinese (E-C) translation patterns from the web. These patterns can be used to extract K-C, K-E and E-C pairs from Google snippets. We also have the Japanese-Chinese (J-C), Japanese-English (J-E) translation patterns for translating Japanese NEs. We found CLIR performance using this hybrid conﬁguration over ﬁve times better than that a dictionary-based conﬁguration using only the bilingual dictionary. Mean average precision was as high as 0.3385 and recall reached 0.7578. Our method can handle Chinese, Japanese, Korean, and non-CJK NE translation and improve performance of CLIR substantially.	en
dc.description.provenance	Made available in DSpace on 2021-06-14T17:08:40Z (GMT). No. of bitstreams: 1 ntu-97-R95921024-1.pdf: 1065933 bytes, checksum: 00fb1898ed0112fa029b825eabeb0567 (MD5) Previous issue date: 2008	en
dc.description.tableofcontents	誌謝 i 中文摘要 ii 英文摘要 iii 1 Introduction 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2 RelatedWork 4 2.1 Information Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.2 Cross-language Information Retrieval . . . . . . . . . . . . . . . . . . . 6 2.3 Named Entity Translation . . . . . . . . . . . . . . . . . . . . . . . . . . 8 3 Japanese and Korean 12 3.1 Common Characteristics of Japanese and Korean . . . . . . . . . . . . . 12 3.1.1 Agglutination . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.1.2 Subject-Object-Verb (SOV) language . . . . . . . . . . . . . . . 13 3.2 Japanese Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.2.1 Origin of Japanese . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.2.2 Japanese Vocabulary . . . . . . . . . . . . . . . . . . . . . . . . 18 3.2.3 Japanese Writing Systems . . . . . . . . . . . . . . . . . . . . . 19 3.2.4 Difficulties in Japanese-Chinese Named Entity Translation for IR 23 3.3 Korean Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.3.1 Origin of Korean . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.3.2 Korean Vocabulary . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.3.3 Korean Writing Systems . . . . . . . . . . . . . . . . . . . . . . 29 3.3.4 Difficulties in Korean-Chinese Named Entity Translation for IR . 33 4 Named Entity Translation Methods 37 4.1 Named Entity Candidate Identification . . . . . . . . . . . . . . . . . . . 37 4.2 Extended Bilingual Dictionaries . . . . . . . . . . . . . . . . . . . . . . 38 4.2.1 Wikipedia . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.2.2 Naver People Search Engine . . . . . . . . . . . . . . . . . . . . 39 4.2.3 CNA Name Database . . . . . . . . . . . . . . . . . . . . . . . . 41 4.3 Pattern-Based Translation . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.3.1 Translation Pattern Learning . . . . . . . . . . . . . . . . . . . . 42 4.3.2 Translation Pattern Filtering . . . . . . . . . . . . . . . . . . . . 43 4.3.3 Pattern-Based NE Translation . . . . . . . . . . . . . . . . . . . 44 5 Cross-language Information Retrieval System 45 5.1 Query Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 5.2 Query Translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 5.3 Term Disambiguation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 5.4 Document Indexing and Retrieval Model . . . . . . . . . . . . . . . . . . 48 6 Evaluation and Analysis 50 6.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 6.2 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 6.3 Configurations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 6.4 Evaluation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 6.5 Effectiveness of Extended Dictionaries . . . . . . . . . . . . . . . . . . . 57 6.6 Effectiveness of Translation Patterns . . . . . . . . . . . . . . . . . . . . 57 6.7 Effectiveness Analysis of NET . . . . . . . . . . . . . . . . . . . . . . . 58 6.8 Error Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 7 Conclusion 60 References 62
dc.language.iso	en
dc.subject	跨語言資訊檢索	zh_TW
dc.subject	專有名詞翻譯	zh_TW
dc.subject	模板	zh_TW
dc.subject	中日韓	zh_TW
dc.subject	Named Entity Translation	en
dc.subject	Korean-Chinese	en
dc.subject	Japanese-Chinese	en
dc.subject	pattern	en
dc.subject	Cross-language Information Retrieval	en
dc.title	基於網路語料之專有名詞翻譯方法於中日韓跨語言資訊檢索之應用	zh_TW
dc.title	Web-based Named Entity Translation Method for Korean-Chinese and Japanese-Chinese Cross-language Information Retrieval	en
dc.type	Thesis
dc.date.schoolyear	96-2
dc.description.degree	碩士
dc.contributor.coadvisor	許聞廉(Wen-Lian Hsu)
dc.contributor.oralexamcommittee	劉昭麟(Chao-Lin Liu),蔡宗翰(Richard Tzong-Han Tsai)
dc.subject.keyword	專有名詞翻譯,模板,中日韓,跨語言資訊檢索,	zh_TW
dc.subject.keyword	Named Entity Translation,pattern,Korean-Chinese,Japanese-Chinese,Cross-language Information Retrieval,	en
dc.relation.page	65
dc.rights.note	有償授權
dc.date.accepted	2008-07-29
dc.contributor.author-college	電機資訊學院	zh_TW
dc.contributor.author-dept	電機工程學研究所	zh_TW
顯示於系所單位：	電機工程學系

文件中的檔案：

檔案	大小	格式
ntu-97-1.pdf 未授權公開取用	1.04 MB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。