Skip navigation

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料(如:文字、圖片、PDF)並使其易於取用。

點此認識 DSpace
DSpace logo
English
中文
  • 瀏覽論文
    • 校院系所
    • 出版年
    • 作者
    • 標題
    • 關鍵字
  • 搜尋 TDR
  • 授權 Q&A
    • 我的頁面
    • 接受 E-mail 通知
    • 編輯個人資料
  1. NTU Theses and Dissertations Repository
  2. 文學院
  3. 語言學研究所
請用此 Handle URI 來引用此文件: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/73048
完整後設資料紀錄
DC 欄位值語言
dc.contributor.advisor謝舒凱(Shu-Kai Hsieh)
dc.contributor.authorDa-Chen Lianen
dc.contributor.author連大成zh_TW
dc.date.accessioned2021-06-17T07:15:18Z-
dc.date.available2019-07-17
dc.date.copyright2019-07-17
dc.date.issued2019
dc.date.submitted2019-07-15
dc.identifier.citationSrt subtitles. Accessed: 2019-04-15.https://matroska.org/technical/specs/subtitles/srt.html.
Al-Obaidli, Fahad, Stephen Cox & Preslav Nakov. (2016). Bi-text alignment of movie subtitles for spoken english-arabic statistical machine translation. In International conference on intelligent text processing and computational linguistics, 127–139. Springer.
Alammar, Jay. (2018). Visualizing a neural machine translation model (mechanics of seq2seq models with attention.http://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention/.
Aziz, Wilker, Sheila Castilho Monteiro de Sousa & Lucia Specia. (2012). Cross-lingual sentence compression for subtitles. In The 16th annual conference of the european association for machine translation, 103–110.
Bahdanau, Dzmitry, Kyunghyun Cho & Yoshua Bengio. (2014). Neural machine translation by jointly learning to align and translate.
Baker, Paul, Andrew Hardie & Tony McEnery. (2006). A glossary of corpus linguistics. Edinburgh, United Kingdom: Edinburgh University Press.
Biber, Douglas E. (2012). Corpus-based and corpus-driven analyses of language variation and use. In The oxford handbook of linguistic analysis, Oxford University Press.
Chen, Keh-Jiann, Chu-Ren Huang, Li-Ping Chang & Hui-Li Hsu. (1996). Sinica corpus: Design methodology for balanced corpora. In Proceedings of the 11th pacific asia conference on language, information and computation, 167–176.
Cho, Kyunghyun, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, FethiBougares, Holger Schwenk & Yoshua Bengio. (2014). Learning phrase representations using rnn encoder–decoder for statistical machine translation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)doi:10.3115/v1/d14-1179.http://dx.doi.org/10.3115/v1/D14-1179.
Christ, Oliver. (1994). A modular and flexible architecture for an integrated corpus query system. arXiv preprint cmp-lg/9408005.
Evert, Stefan & Andrew Hardie. (2011). Twenty-first century corpus workbench: Updating a query architecture for the new millennium.
Fishel, Mark, Yota Georgakopoulou, Sergio Penkale, Volha Petukhova, Matej Rojc, Mar-tin Volk & Andy Way. (2012). From subtitles to parallel corpora. In Proceedings of the 16th conference of the european association for machine translation (eamt), 3–6.
Goh, Yeng-Seng. (2017). Mandarin chinese as spoken in mainland china, taiwan, hong kong and singapore: A comparison 18–38. Cambridge University Press. doi:10.1017/9781107280472.003.
Gomaa, Wael H & Aly A Fahmy. (2013). A survey of text similarity approaches. International Journal of Computer Applications 68(13). 13–18.
Graff, David & Ke Chen. (2003). Chinese gigaword ldc2003t09. Tech. rep. Philadelphia.
Graff, David, Ke Chen, Junbo Kong & Kazuaki Maeda. (2005). Chinese gigaword second edition ldc2005t14. Web Download.
Hasebe, Yoichiro. (2015). Design and implementation of an online corpus of presentation transcripts of ted talks. Procedia - Social and Behavioral Sciences 198. 174 – 182. doi:https://doi.org/10.1016/j.sbspro.2015.07.434.http://www.sciencedirect.com/science/article/pii/S1877042815044353. Current Work in Corpus Linguistics: Working with Traditionally- conceived Corpora and Beyond. Selected Papers from the 7th International Conference on Corpus Linguistics (CILC2015).
Hickey, Raymond. (2012). 21 internally-and externally–motivated language change. The handbook of historical sociolinguistics 94. 387.
Hoffmann, Sebastian & Stefan Evert. (2006). Bncweb (cqp-edition): The marriage of two corpus tools. Corpus technology and language pedagogy: New resources, new tools, new methods 3. 177–195.
Hsieh, Shu-Kai, Yu-Hsiang Tseng & Chiung-Yu Chiang. Modeling the idiomaticity of chinese quadra-syllabic idiomatic expressions. Submitted.
Huang, Chu-Ren. (2009). Tagged chinese gigaword version 2.0, ldc2009t14. Linguistic Data Consortium .
Huang, Chu-Ren, Adam Kilgarriff, Yiching Wu, Chih-Ming Chiu, Simon Smith, Pavel Rychly, Ming-Hong Bai & Keh-Jiann Chen. (2005). Chinese sketch engine and the extraction of grammatical collocations. In Proceedings of the fourth sighan workshop on chinese language processing,.
Huang, Chu-Ren & Jingxia Lin. (2012). The ordering of mandarin chinese light verbs. In Workshop on chinese lexical semantics, 728–735. Springer.
Huang, Chu-Ren, Jingxia Lin, Menghan Jiang & Hongzhi Xu. (2014). Corpus-based study and identification of mandarin chinese light verb variations. In Proceedings of the first workshop on applying nlp tools to similar languages, varieties and dialects, 1–10.
Huang, Chu-Ren, Jingxia Lin & Huarui Zhang. (2013). World chinese’s based on comparable corpus: the case of grammatical variations of jinxing. Open image in new window 397–414.
Hundt, Marianne, Andrea Sand & Rainer Siemund. (1998). Manual of information to accompany the freiburg-lob corpus of british english (’flob’). Albert-Ludwigs-Universität Freiburg.
Hundt, Marianne, Andrea Sand & Paul Skandera. (1999). Manual of information to accompany the freiburg-brown corpus of american english (’frown’). Albert-Ludwigs-Universität Freiburg.
Ide, Nancy, Patrice Bonhomme & Laurent Romary. (2000). Xces: An xml-based encoding standard for linguistic corpora. In Proceedings of the second international languageresources and evaluation conference. paris: European language resources association,Citeseer.
Itamar, Einav & Alon Itai. (2008). Using movie subtitles for creating a large-scale bilingual corpora. In Lrec,.
Jafari, Farshad. (2018). Generating multilingual parallel corpus using subtitles. arXiv preprint arXiv:1804.03923.
Jakubíček, Miloš, Adam Kilgarriff, Vojtěch Kovář, Pavel Rychlỳ & Vít Suchomel. (2013).The tenten corpus family. In 7th international corpus linguistics conference cl, 125–127.
Kenning, Marie-Madeleine. (2010). What are parallel and comparable corpora and how can we use them? In The routledge handbook of corpus linguistics, 515–528. Routledge.
Kilgarriff, Adam, Vít Baisa, Jan Bušta, Miloš Jakubíček, Vojtěch Kovář, Jan Michelfeit,Pavel Rychlỳ & Vít Suchomel. (2014). The sketch engine: ten years on. Lexicography 1(1). 7–36.
Kilgarriff, Adam, Chu-Ren Huang, Pavel Rychlỳ, Simon Smith & David Tugwell. (2005).Chinese word sketches.
Kilgarriff, Adam, Pavel Rychly, Pavel Smrz & David Tugwell. (2004). Itri-04-08 the sketch engine. Information Technology 105. 116.
Kim-lung, Kenneth Au. (February 2016). Subtitlers: the unsung heroes behind the screen.https://yp.scmp.com/news/special-reports/article/102781/subtitlers-unsung-heroes-behind-screen.
Klein, Guillaume, Yoon Kim, Yuntian Deng, Vincent Nguyen, Jean Senellart & Alexander Rush. (Mar. 2018). OpenNMT: Neural machine translation toolkit. In Proceedings of the 13th conference of the association for machine translation in the Americas (volume1: Research papers), 177–184. Boston, MA: Association for Machine Translation in the Americas.https://www.aclweb.org/anthology/W18-1817.
Kwong, Oi Yee & Benjamin K Tsou. (2003). A synchronous corpus-based study of verb-noun fluidity in chinese. In Proceedings of the 17th pacific asia conference on language,information and computation, 194–203.
Lavecchia, Caroline, Kamel Smaili & David Langlois. (2007). Building parallel corpora from movies. In The 4th international workshop on natural language processing and cognitive science-nlpcs 2007, .
Lin, Jingxia, Dingxu Shi, Menghan Jiang & Chu-Ren Huang. (2018). Building parallel corpora from movies. In Chu-Ren Huang, Zhuo Jing-schmidt & Barbara Meisterernst(eds.), The routledge handbook of chinese applied linguistics, London: Routledge.
Lin, Jingxia, Hongzhi Xu, Menghan Jiang & Chu-Ren Huang. (2014). Annotation and classification of light verbs and light verb variations in mandarin chinese. In Proceedings of workshop on lexical and grammatical resources for language processing, 75–82.
Lison, Pierre & Jörg Tiedemann. (2016). Opensubtitles2016: Extracting large parallel corpora from movie and tv subtitles.
Lison, Pierre, Jörg Tiedemann & Milen Kouylekov. (2018). Opensubtitles2018: Statistical rescoring of sentence alignments in large, noisy parallel corpora. In Proceedings of the eleventh international conference on language resources and evaluation (lrec-2018), .
McEnery, Anthony & Zhonghua Xiao. (2004). The lancaster corpus of mandarin chinese:A corpus for monolingual and contrastive language study. Religion 17. 3–4.
Miller, George A. (1995). Wordnet: a lexical database for english. Communications of the ACM 38(11). 39–41.
Olah, Christopher. (August 2015). Understanding lstm networks.https://colah.github.io/posts/2015-08-Understanding-LSTMs/.
Pedersen, Ted, Siddharth Patwardhan & Jason Michelizzi. (2004). Wordnet:: Similarity:measuring the relatedness of concepts. In Demonstration papers at hlt-naacl 2004, 38–41. Association for Computational Linguistics.
Rosado, Luıs Carlos Cachapela. (2016). Cinema at the service of natural language processing.
Tiedemann, Jörg. (2007)a. Building a multilingual parallel subtitle corpus,.
Tiedemann, Jörg. (2007)b. Improved sentence alignment for movie subtitles. In Proceedings of ranlp, vol. 7,.
Tiedemann, Jörg. (2008). Synchronizing translated movie subtitles. In Lrec, .
Tiedemann, Jörg. (2011). Bitext alignment. Synthesis Lectures on Human Language Technologies 4(2). 1–165.
Tiedemann, Jörg. (2012). Parallel data, tools and interfaces in opus. In Lrec, vol. 2012,2214–2218.
Tsou, Benjamin K & Olivia Oi Yee Kwong. (2015). Some basic and salient linguistic features across chinese speech communities from a corpus linguistics perspective. In The oxford handbook of chinese linguistics, .
T’sou, Benjamin K, Hing-Lung Lin, Godfrey Liu, Terence Chan, Jerome Hu, Ching-hai Chew & KP John. (1997). A synchronous chinese language corpus from different speech communities: Construction and applications. International Journal of Computational Linguistics & Chinese Language Processing, Volume 2, Number 1, February 1997: Special Issue on Computational Resources for Research in Chinese Linguistics 2(1). 91–104.
Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser & Illia Polosukhin. (2017). Attention is all you need. In Advances in neural information processing systems, 5998–6008.
Volk, Martin. (2009). The automatic translation of film subtitles. a machine translation success story? JLCL 24(3). 115–128.
Wang, Shih-Ming & Lun-Wei Ku. (2016). Antusd: A large chinese sentiment dictionary.In the tenth international conference on language resources and evaluation (lrec 2016),2697–2702.
Xiao, Han & Xiaojie Wang. (2009). Constructing parallel corpus from movie subtitles.In International conference on computer processing of oriental languages, 329–336.Springer.
Xiong, Jiajuan & Chu-Ren Huang. (2015). De-verbalization and nominal categories in mandarin chinese: A corpus-driven study in both mainland mandarin and taiwan mandarin. In Proceedings of the 29th pacific asia conference on language, information and computation, 431–438.
Zhang, Shikun, Wang Ling & Chris Dyer. (2014). Dual subtitles as parallel corpora .
Zhou, Jianjiao & Shu Zhou. (2019). A study on differences between taiwanese mandarin and mainland mandarin in vocabulary. In 3rd international conference on culture, education and economic development of modern society (iccese 2019), Atlantis Press. doi:https://doi.org/iccese-19.2019.48.https://doi.org/iccese-19.2019.48.
dc.identifier.urihttp://tdr.lib.ntu.edu.tw/jspui/handle/123456789/73048-
dc.description.abstract隨著中文使用者的人數增加,語言上的變異也會隨之產生,這些變異可能來自外來或本身擁有的因素。雖然已存在研究中文變異的語料庫,但是這些資源不適用於研究篇口語的語域,一個能夠反應非正式的語言是影片裡的字幕,此論文的目的是以電影字幕和TED Talks字幕為基礎建構一個平行語料庫,方便學者研究台灣國語和大陸國語之間的變異zh_TW
dc.description.abstractAs the number of Mandarin Chinese speakers continues to increase, variations will inevitably begin to emerge as all speakers do not reside in one place. This variation can stem from internal factors or external ones, such as culture or location. While there exist corpora that can be used to study Mandarin Chinese variation, the existing resources do not offer insight into more colloquial registers. A good source of material that can more reliably reflect everyday speech is subtitles for TV shows, movies, and videos in general. Because the subtitles are meant to reflect dialogue heard on screen, it can better reflect colloquial speech. The goal of this thesis is to create a parallel corpus based on movie subtitles and TED Talks that can allow researchers to study language variation between Taiwan Mandarin and Mainland Mandarin.en
dc.description.provenanceMade available in DSpace on 2021-06-17T07:15:18Z (GMT). No. of bitstreams: 1
ntu-108-R05142009-1.pdf: 1902023 bytes, checksum: eae52f187ed6035c5b5d0adfd158f1bf (MD5)
Previous issue date: 2019
en
dc.description.tableofcontentsAcknowledgements i
中文摘要 ii
Abstract iii
Contents iv
List of Figures v
List of Tables vi
1 Introduction 1
1.1 Parallel and comparable corpora ..................... 1
1.2 Alignment ........................................... 2
1.3 Parallel corpora and subtitles ...................... 3
1.4 Significance ........................................ 4
1.5 Organization of this thesis ......................... 5
2 Literature Review 6
2.1 Chinese corpora ..................................... 6
2.2 Language variation .................................. 7
2.2.1 Chinese variation ............................... 8
2.3 OpenSubtitles Corpus ................................ 10
2.4 TED Talks ........................................... 11
2.5 Alignment ........................................... 12
3 Methodology 15
3.1 Introduction ........................................ 15
3.2 Data collection ..................................... 16
3.3 Pre-processing ...................................... 18
3.4 Document alignment .................................. 20
3.4.1 Measuring document similarity ................... 20
3.4.2 Results ......................................... 23
3.4.3 Cleaning ........................................ 33
3.5 Sentence alignment .................................. 37
3.6 Initial results ..................................... 41
3.6.1 Removing poor alignments ........................ 42
3.7 Evaluation .......................................... 45
4 Applications 46
4.1 Corpus-based analysis of language varieties ......... 46
4.2 Online corpus ....................................... 46
4.3 Idioms .............................................. 49
4.4 Neural machine translation .......................... 60
5 Conclusion 68
References 69
dc.language.isoen
dc.title現代漢語平行語料庫建構及其應用zh_TW
dc.titleConstruction and Applications of a Modern Chinese Parallel Corpusen
dc.typeThesis
dc.date.schoolyear107-2
dc.description.degree碩士
dc.contributor.oralexamcommittee高照明(Zhao-Ming Gao),呂佳蓉(Chia-Rung Lu)
dc.subject.keyword台灣國語,大陸國語,語言變異,字幕,zh_TW
dc.subject.keywordTaiwan Mandarin,Mainland Mandarin,language variation,subtitles,en
dc.relation.page75
dc.identifier.doi10.6342/NTU201901469
dc.rights.note有償授權
dc.date.accepted2019-07-16
dc.contributor.author-college文學院zh_TW
dc.contributor.author-dept語言學研究所zh_TW
顯示於系所單位:語言學研究所

文件中的檔案:
檔案 大小格式 
ntu-108-1.pdf
  目前未授權公開取用
1.86 MBAdobe PDF
顯示文件簡單紀錄


系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved