Skip navigation

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料(如:文字、圖片、PDF)並使其易於取用。

點此認識 DSpace
DSpace logo
English
中文
  • 瀏覽論文
    • 校院系所
    • 出版年
    • 作者
    • 標題
    • 關鍵字
    • 指導教授
  • 搜尋 TDR
  • 授權 Q&A
    • 我的頁面
    • 接受 E-mail 通知
    • 編輯個人資料
  1. NTU Theses and Dissertations Repository
  2. 文學院
  3. 翻譯碩士學位學程
請用此 Handle URI 來引用此文件: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/79254
完整後設資料紀錄
DC 欄位值語言
dc.contributor.advisor高照明(Zhao-Ming Gao)
dc.contributor.authorRuben Ga-Yuk Tsuien
dc.contributor.author徐嘉煜zh_TW
dc.date.accessioned2022-11-23T08:56:46Z-
dc.date.available2022-02-16
dc.date.available2022-11-23T08:56:46Z-
dc.date.copyright2022-02-16
dc.date.issued2022
dc.date.submitted2022-02-14
dc.identifier.citationArtetxe, M., Schwenk, H. (2019a). Margin-based Parallel Corpus Mining with Multilingual Sentence Embeddings. ArXiv:1811.01136 [Cs]. http://arxiv.org/abs/1811.01136 Artetxe, M., Schwenk, H. (2019b). Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond. Transactions of the Association for Computational Linguistics, 7, 597–610. https://doi.org/10.1162/tacl_a_00288 Baker, M. 1993. Corpus linguistics and translation studies: Implications and applications. In Text and technology: In honour of John Sinclair, ed. M. Baker, G. Francis, and E. Tognini-Bonelli, 233–250. Philadelphia: John Benjamins. Baker, M., Saldanha, G. (eds.) (2020) Routledge Encyclopedia of Translation Studies. 3rd ed. Routledge CRC Press. Baroni, M., Bernardini, S. (2004). BootCaT: Bootstrapping Corpora and Terms from the Web. In Proceedings of LREC (2004) Beltagy, I., Peters, M. E., Cohan, A. (2020). Longformer: The Long-Document Transformer. ArXiv:2004.05150 [Cs]. http://arxiv.org/abs/2004.05150 Chandrasekaran, D., Mago, V. (2021). Evolution of Semantic Similarity #x2014;A Survey. ACM Computing Surveys, 54(2), 41:1-41:37. https://doi.org/10.1145/3440755 Chuang, Y.-S. (2019). Robust Chinese Word Segmentation with Contextualized Word Representations. ArXiv:1901.05816 [Cs]. http://arxiv.org/abs/1901.05816 Devlin, J., Chang, M.-W., Lee, K., Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. ArXiv:1810.04805 [Cs]. http://arxiv.org/abs/1810.04805 Dou, Z.-Y., Neubig, G. (2021). Word Alignment by Fine-tuning Embeddings on Parallel Corpora. ArXiv:2101.08231 [Cs]. http://arxiv.org/abs/2101.08231 Dutta, S. (2021). “Alignment is All You Need”: Analyzing Cross-Lingual Text Similarity for Domain-Specific Applications. Dyer, C., Chahuneau, V., Smith, N. A. (2013). A Simple, Fast, and Effective Reparameterization of IBM Model 2. Proceedings of NAACL-HLT 2013, 644–64 Eisenstein, J. (2019). Introduction to Natural Language Processing. MIT Press. Etchegoyhen, T., Azpeitia, A. (2016). Set-Theoretic Alignment for Comparable Corpora. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2009–2018. https://doi.org/10.18653/v1/P16-1189 Fan, A., Bhosale, S., Schwenk, H., Ma, Z., El-Kishky, A., Goyal, S., Baines, M., Celebi, O., Wenzek, G., Chaudhary, V., Goyal, N., Birch, T., Liptchinsky, V., Edunov, S., Grave, E., Auli, M., Joulin, A. (2020). Beyond English-Centric Multilingual Machine Translation. ArXiv:2010.11125 [Cs]. http://arxiv.org/abs/2010.11125 Fazeli, S., Sarrafzadeh, M. (2021). A Framework for Neural Topic Modeling of Text Corpora. ArXiv:2108.08946 [Cs]. http://arxiv.org/abs/2108.08946 Feng, F., Yang, Y., Cer, D., Arivazhagan, N., Wang, W. (2020). Language-agnostic BERT Sentence Embedding. ArXiv:2007.01852 [Cs]. http://arxiv.org/abs/2007.01852 Fishman, G. (1996). Monte Carlo: Concepts, Algorithms, and Applications. Springer Nature. Fung, P., Cheung, P. (2004). Multi-level bootstrapping for extracting parallel sentences from a quasi-comparable corpus. Proceedings of the 20th International Conference on Computational Linguistics - COLING ‘04, 1051-es. https://doi.org/10.3115/1220355.1220506 Gao, Z.-M. (2021). Automatically compiling bilingual legal glossaries based on Chinese–English parallel corpora. In Terminology Translation in Chinese Contexts: Theory and Practice, Routledge. Gong, H., Chaudhary, V., Tang, Y., Guzmán, F. (2021). LAWDR: Language-Agnostic Weighted Document Representations from Pre-trained Models. ArXiv:2106.03379 [Cs]. http://arxiv.org/abs/2106.03379 Johnson, J., Douze, M., Jégou, H. (2017). Billion-scale similarity search with GPUs. https://arxiv.org/abs/1702.08734v1 Joshi, B., Shah, N., Barbieri, F., Neves, L. (2020). The Devil is in the Details: Evaluating Limitations of Transformer-based Methods for Granular Tasks. https://arxiv.org/abs/2011.01196v1 Jurafsky, D. Martin, J. (2021). Speech and Language Processing, 3rd ed. draft of December 29, 2021. https://web.stanford.edu/~jurafsky/slp3/ Kainen P.C. (1997). Utilizing Geometric Anomalies of High Dimension: When Complexity Makes Computation Easier. In: Computer Intensive Methods in Control and Signal Processing. Kárný M., Warwick K. (eds). Birkhäuser, Boston, MA. https://doi.org/10.1007/978-1-4612-1996-5_18 Koehn, P. (2009). Statistical Machine Translation. Cambridge University Press. https://doi.org/10.1017/CBO9780511815829 Koehn, P. (2020). Neural Machine Translation (1st ed.). Cambridge University Press. https://doi.org/10.1017/9781108608480 Kruger, A., Wallmach, K., Munday, J. (Eds.). (2011). Corpus-based translation studies. Research and applications. London/New York: Bloomsbury. Lang, C., Wachowiak, L., Heinisch, B., Gromann, D. (2021). Transforming Term Extraction: Transformer-Based Approaches to Multilingual Term Extraction Across Domains. Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, 3607–3620. https://doi.org/10.18653/v1/2021.findings-acl.316 Lison, P., Tiedemann, J. (2016). OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles. Liu, B., Huang, L. (2020). NEJM-enzh: A Parallel Corpus for English-Chinese Translation in the Biomedical Domain. ArXiv:2005.09133 [Cs]. http://arxiv.org/abs/2005.09133 Liu, S., Wang, L., Liu, C.-H. (2018). Chinese-Portuguese Machine Translation: A Study on Building Parallel Corpora from Comparable Texts. https://arxiv.org/abs/1804.01768v1 Ma, X. (2006). Champollion: A Robust Parallel Text Sentence Aligner. In Proceedings of the Fifth International Conference on Language Resources and Evaluation, Genoa, Italy (LREC 2006), Paris: ELRA, 489-92. Mikhailov, M., Cooper, R. (2016). Corpus Linguistics for Translation and Contrastive Studies: A guide for research (1st ed.). Routledge. https://doi.org/10.4324/9781315624570 Myrzakhmetov, B., Sultangazina, A., Makazhanov, A. (2016). Identification of the parallel documents from multilingual news websites. 2016 IEEE 10th International Conference on Application of Information and Communication Technologies (AICT), 1–5. https://doi.org/10.1109/ICAICT.2016.7991684 Och, F. J., Ney, H. (2003). A systematic comparison of various statistical alignment models. Computational Linguistics, 29(1), 19–51. Paquot, M., Gries, S. Th. (Eds.). (2020). A Practical Handbook of Corpus Linguistics. Springer International Publishing. https://doi.org/10.1007/978-3-030-46216-1 Paramita, M. L., Guthrie, D., Kanoulas, E., Gaizauskas, R., Clough, P., Sanderson, M. (2013). Methods for Collection and Evaluation of Comparable Documents. In S. Sharoff, R. Rapp, P. Zweigenbaum, P. Fung (Eds.), Building and Using Comparable Corpora (pp. 93–112). Springer. https://doi.org/10.1007/978-3-642-20128-8_5 Patry, A., Langlais, P. (2005). Automatic Identification of Parallel Documents With Light or Without Linguistic Resources. In B. Kégl G. Lapalme (Eds.), Advances in Artificial Intelligence (pp. 354–365). Springer. https://doi.org/10.1007/11424918_37 Peng, Q., Weir, D., Weeds, J. (2021). Structure-aware Sentence Encoder in Bert-Based Siamese Network. Proceedings of the 6th Workshop on Representation Learning for NLP (RepL4NLP-2021), 57–63. https://doi.org/10.18653/v1/2021.repl4nlp-1.7 Pilehvar, M. T., Camacho-Collados, J. (2020). Embeddings in Natural Language Processing: Theory and Advances in Vector Representations of Meaning. Synthesis Lectures on Human Language Technologies, 13(4), 1–175. https://doi.org/10.2200/S01057ED1V01Y202009HLT047 Pinnis, M., Ion, R., Stefanescu, D., Su, F., Skadina, I., Vasiljevs, A., Babych, B. (2012). ACCURAT Toolkit for Multi-Level Alignment and Information Extraction from Comparable Corpora. Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics, 91–96 Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., Liu, P. J. (2020). Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. ArXiv:1910.10683 [Cs, Stat]. http://arxiv.org/abs/1910.10683 Reimers, N., Gurevych, I. (2020). Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation. ArXiv:2004.09813 [Cs]. http://arxiv.org/abs/2004.09813 Reimers, N., Gurevych, I. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 3982–3992. https://doi.org/10.18653/v1/D19-1410 Ruder, S., Peters, M. E., Swayamdipta, S., Wolf, T. (2019). Transfer Learning in Natural Language Processing. Proceedings of the 2019 Conference of the North, 15–18. https://doi.org/10.18653/v1/N19-5004 Sabet et al. - 2021—SimAlign High Quality Word Alignments without Par.pdf. (2020). Retrieved September 28, 2021, from https://arxiv.org/pdf/2004.08728.pdf Sabet, M. J., Dufter, P., Yvon, F., Schütze, H. (2021). SimAlign: High Quality Word Alignments without Parallel Training Data using Static and Contextualized Embeddings. ArXiv:2004.08728 [Cs]. http://arxiv.org/abs/2004.08728 Sammut, C., Webb, G. I. (Eds.). (2017). Comparable Corpus. In Encyclopedia of Machine Learning and Data Mining (pp. 243–243). Springer US. https://doi.org/10.1007/978-1-4899-7687-1_946 Schwenk, H., Chaudhary, V., Sun, S., Gong, H., Guzmán, F. (2019). WikiMatrix: Mining 135M Parallel Sentences in 1620 Language Pairs from Wikipedia. ArXiv:1907.05791 [Cs]. http://arxiv.org/abs/1907.05791 Schwenk, H., Wenzek, G., Edunov, S., Grave, E., Joulin, A. (2020). CCMatrix: Mining Billions of High-Quality Parallel Sentences on the WEB. ArXiv:1911.04944 [Cs]. http://arxiv.org/abs/1911.04944 Shi, H., Zettlemoyer, L., Wang, S. I. (2021). Bilingual Lexicon Induction via Unsupervised Bitext Construction and Word Alignment. ArXiv:2101.00148 [Cs]. http://arxiv.org/abs/2101.00148 Smith, J. R., Quirk, C., Toutanova, K. (2010). Extracting Parallel Sentences from Comparable Corpora using Document Level Alignment. Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the ACL, 403–411 Stetting, K. (1989). Transediting. A new term for coping with the grey area between editing and translating. In G. Caie G. Caie, K. Haastrup, A.L. Jakobsen, A.L. Nielsen, J. Sevaldsen, H. Specht A. Zettersten (Eds.), Proceedings from the Fourth Nordic Conference for English Studies (pp. 371–382). Copenhagen: University of Copenhagen. Thakur, N., Reimers, N., Daxenberger, J., Gurevych, I. (2021). Augmented SBERT: Data Augmentation Method for Improving Bi-Encoders for Pairwise Sentence Scoring Tasks. ArXiv:2010.08240 [Cs]. http://arxiv.org/abs/2010.08240 Tian, L., Wong, D. F., Chao, L. S., Quaresma, P., Oliveira, F., Lu, Y., Li, S., Wang, Y., and Wang, L. (2014). UM-corpus: A large English-Chinese parallel corpus for statistical machine translation. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), 1837–1842, Reykjavik, Iceland, May. European Language Resources Association (ELRA). Tiedemann, J. (2011). Bitext Alignment. Synthesis Lectures on Human Language Technologies, 4(2), 1–165. https://doi.org/10.2200/S00367ED1V01Y201106HLT014 Tran, K. (2020). From English To Foreign Languages: Transferring Pre-trained Language Models. ArXiv:2002.07306 [Cs]. http://arxiv.org/abs/2002.07306 Wang, C., Wu, A., Pino, J., Baevski, A., Auli, M., Conneau, A. (2021). Large-Scale Self- and Semi-Supervised Learning for Speech Translation. ArXiv:2104.06678 [Cs]. http://arxiv.org/abs/2104.06678 Wenzek, G., Lachaux, M.-A., Conneau, A., Chaudhary, V., Guzmán, F., Joulin, A., Grave, E. (2019). CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data. ArXiv:1911.00359 [Cs, Stat]. http://arxiv.org/abs/1911.00359 Wołk, K., Marasek, K. (2014). Building Subject-aligned Comparable Corpora and Mining it for Truly Parallel Sentence Pairs. Procedia Technology, 18, 126–132. https://doi.org/10.1016/j.protcy.2014.11.024 Xue, L., Barua, A., Constant, N., Al-Rfou, R., Narang, S., Kale, M., Roberts, A., Raffel, C. (2021). ByT5: Towards a token-free future with pre-trained byte-to-byte models. ArXiv:2105.13626 [Cs]. http://arxiv.org/abs/2105.13626 Xue, L., Constant, N., Roberts, A., Kale, M., Al-Rfou, R., Siddhant, A., Barua, A., Raffel, C. (2021). mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer. Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 483–498. https://doi.org/10.18653/v1/2021.naacl-main.41 Ziemski, M. Junczys-Dowmunt, M. Pouliquen, B. (2016). The United Nations Parallel Corpus v1.0. 10.13140/RG.2.1.1816.2801.
dc.identifier.urihttp://tdr.lib.ntu.edu.tw/jspui/handle/123456789/79254-
dc.description.abstract"翻譯語料庫 (或平行語料庫) 為一種特殊類型的文本語料庫,在翻譯實務、翻譯研究和翻譯教育發揮了關鍵作用 (Bernardini, Stewart, Zanettin, 2003)。統計式機器翻譯 (statistical machine translation; SMT) 系統和近年開始普及的類神經網絡機器翻譯 (neural MT) 系統問世,使平行語料庫的重要性更為突出,原因是訓練主流機器翻譯系統時所需要的大量「標記」資料 (labeled data) 或「監督式學習」(supervised learning) 所需的資料正是平行語料。機器翻譯在過去幾年取得了長足的進步,然而許多譯者及翻譯教育工作者平時仍須仰賴雙語檢索系統以及其背後的平行語料庫。對於建立高效能漢英檢索系統時遇到的三大課題:(1) 提升平行句子中詞對齊 (word alignment) 的準確性, (2) 提升已對齊平行文檔 (document-aligned texts)中句對齊 (sentence alignment) 的準確性,以及 (3) 從可比語料 (comparable corpus)中找出隱藏的平行句子,本研究提供了目前最佳的處理方法。研究結果顯示,使用最新的類神經網絡 (artificial neural network) 自然語言處理 (natural language processing; NLP) 技術當中稱為 transformer 的架構所建立的語言模型 (language model),可以精準對齊平行句子中的詞和片語 (也就是將對齊誤差減到最低),有助譯者快速找到目的語中譯文的所在。此外,使用句子層次的 transformer,可以將平行文檔或段落對齊的平行語料升級為句對齊的語料庫,並大幅減少自動句對齊作業完成後的手動校正工作。最後,我們示範如何先在多語新聞網站挖掘出平行新聞文章,再從中獲得平行句子,而平行新聞文章之間如有明顯的鏈接或關聯則加以利用,若無本研究開發之演算法也可以根據文章語義加以判斷、比較。"zh_TW
dc.description.provenanceMade available in DSpace on 2022-11-23T08:56:46Z (GMT). No. of bitstreams: 1
U0001-1102202212174700.pdf: 2473054 bytes, checksum: d9b0ba0a3bac47d6720e51ebe6b6eecf (MD5)
Previous issue date: 2022
en
dc.description.tableofcontents"Table of Contents Acknowledgements i Abstract ii 摘要 iii Table of Contents iv List of Figures vi List of Tables vi Chapter 1 Introduction 1 1.1 Bilingual Concordancer with Improved Word alignment 3 1.2 Defining Concepts 4 1.3 Research Questions 8 1.4 Significance of this Study 9 1.5 Outline of the Thesis 10 Chapter 2 Literature Review 11 2.1 Bilingual Concordancer 11 2.2 Parallel corpus construction 13 2.3 Sentence Alignment 16 2.4 Document alignment 21 2.5 Word alignment 24 Chapter 3 Methods 26 3.1 Data Collection 27 3.2 Vector Space Methods – Word Embeddings 40 3.3 Vector Space Methods – Sentence Embeddings 41 3.4 Aligning Sentences from Parallel Texts 42 3.5 Mining Parallel Sentences from Comparable News Sources 49 3.6 Word alignment 55 3.7 Software tools employed in this study 58 Chapter 4 Results and Applications 61 4.1 Summary of Parallel Corpora 61 4.2 Findings 62 4.3 Applications 63 Chapter 5 Conclusions 68 5.1 Summary of Findings and Implications 68 5.1 Limitations and Suggestions for Future Research 69 References 73 Appendix 1 Sources of bilingual Chinese-English articles (news and non-news sources) 78 Appendix 2 Glossary 80 Appendix 3 List of acronyms, abbreviations and initialisms 81 Appendix 4 Data collection procedures and scripts 81 Appendix 5 Document similarity simulation code 84"
dc.language.isoen
dc.title基於Transformers深度學習模型建造之高效率漢英新聞雙語檢索系統zh_TW
dc.titleApplications of Transformers: Constructing a High-Productivity News Translation-focused Bilingual Concordanceren
dc.date.schoolyear110-1
dc.description.degree碩士
dc.contributor.oralexamcommittee陳榮彬(Eric Chen-hua Yu),白明弘(Nathan F. Batto)
dc.subject.keyword平行語料庫,雙語檢索系統,雙語語料對齊,句子嵌入,Transformer,BERT,sentence transformer,zh_TW
dc.subject.keywordparallel corpus,bilingual concordancer,bitext alignment,sentence embeddings,transformer,BERT,sentence transformer,en
dc.relation.page90
dc.identifier.doi10.6342/NTU202200561
dc.rights.note同意授權(全球公開)
dc.date.accepted2022-02-14
dc.contributor.author-college文學院zh_TW
dc.contributor.author-dept翻譯碩士學位學程zh_TW
顯示於系所單位:翻譯碩士學位學程

文件中的檔案:
檔案 大小格式 
U0001-1102202212174700.pdf2.42 MBAdobe PDF檢視/開啟
顯示文件簡單紀錄


系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved