請用此 Handle URI 來引用此文件:
http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/5388完整後設資料紀錄
| DC 欄位 | 值 | 語言 |
|---|---|---|
| dc.contributor.advisor | 謝舒凱(Shu-Kai Hsieh) | |
| dc.contributor.author | Tsun-Jui Liu | en |
| dc.contributor.author | 劉純睿 | zh_TW |
| dc.date.accessioned | 2021-05-15T17:57:33Z | - |
| dc.date.available | 2015-02-03 | |
| dc.date.available | 2021-05-15T17:57:33Z | - |
| dc.date.copyright | 2015-02-03 | |
| dc.date.issued | 2014 | |
| dc.date.submitted | 2014-10-21 | |
| dc.identifier.citation | 1. Pierre Magistry.Ptt 批踢踢 as a corpus.Annual meeting of the European Association of Taiwan Studies,EATS 2012 Sonderborg, 2012.
2. Ffaarr.PTT鄉民大百科.時報出版, 1 edition, 9 2013. 3. Tony McEnery and Andrew Wilson.Corpus linguistics: An introduction.Edinburgh University Press, second edition, 2001. 4. Christopher D Manning and Hinrich Schutze.Foundations of statistical natural language processing.MIT press, 1999. 5. Adam Kilgarriff and Gregory Grefenstette.Introduction to the special issue on the web as corpus.Computational linguistics, 29(3):333-347, 2003. 6. Abbas Jalilvand and Naomie Salim.Sentiment classification using graph based word sense disambigution.In Advanced Machine Learning Technologies and Applications,pages 351-358. Springer, 2012. 7. Tsun-Jui Liu, Shu-Kai Hsieh, and Laurent Prevot.Observing features of ptt neologisms: A corpus-driven study withn-gram model.In ROCLING. Association for Computational Linguistics andChinese Language Processing (ACLCLP), Taiwan, 2013. 8. Lun-Wei Ku and Hsin-Hsi Chen.Mining opinions from the web: Beyond relevance retrieval.Journal of the American Society for Information Science andTechnology, 58(12):1838-1850, 2007. 9. Pei-Yu Lu, Yu-Yun Chang, and Shu-Kai Hsieh.Causing emotion in collocation: An exploratory data analysis.In ROCLING, 2013. 10. Hsi-Yao Su.The multilingual and multi-orthographic taiwan-based internet:Creative uses of writing systems on college-affiliated bbss.Journal of Computer-Mediated Communication, 9(1):0-0, 2003. 11. Antoinette Renouf.Webcorp: providing a renewable data source for corpus linguists.Language and Computers, 48(1):39-58, 2003. 12. Mei-Yu Chen, Hsin-Ni Lin, Chang-An Shih, Yen-Ching Hsu, Pei-Yu Hsu, and Shu-KaiHsieh.Classifying mood in plurks.In ROCLING, 2010. 13. Yu-Ming Liu.A study for the culture of university student's buzzwords phenomenonon bbs in taiwan.Master's thesis, National Chi Nan University, 2012. 14. Shiv Naresh Shivhare and Saritha Khethawat.Emotion detection from text.arXiv preprint arXiv:1205.4944, 2012. 15. Cedrick Fairon, Kevin Mace, and Hubert Naets.Glossanet 2: a linguistic search engine for rss-based corpora.In Proceedings of the 4th web as corpus workshop (WAC-4), pages34-39. Citeseer, 2008. 16. Keh-Jiann Chen, Chu-Ren Huang, Li-Ping Chang, and Hui-Li Hsu.Sinica corpus: Design methodology for balanced corpora.Language, 167:176, 1996. 17. Chu-Ren Huang, Adam Kilgarriff, Yiching Wu, Chih-Ming Chiu, Simon Smith, PavelRychly, Ming-Hong Bai, and Keh-Jiann Chen.Chinese sketch engine and the extraction of grammatical collocations.In Proceedings of the Fourth SIGHAN Workshop on Chinese LanguageProcessing, pages 48-55, 2005. 18. Chu-Ren Huang, Feng-Yi Chen, Keh-Jiann Chen, Zhao-ming Gao, and Kuang-Yu Chen.Sinica treebank: design criteria, annotation guidelines, and on-lineinterface.In Proceedings of the second workshop on Chinese languageprocessing: held in conjunction with the 38th Annual Meeting of theAssociation for Computational Linguistics-Volume 12, pages 29-37.Association for Computational Linguistics, 2000. 19. Feng-Yi Chen, Pi-Fang Tsai, Keh-Jiann Chen, and Chu-Ren Huang.The construction of sinica treebank.Computational Linguistics and Chinese Language Processing,4:87-104, 1999. 20. Eric Brill.A simple rule-based part of speech tagger.In Proceedings of the workshop on Speech and Natural Language,pages 112-116. Association for Computational Linguistics, 1992. 21. Steven Bird, Ewan Klein, and Edward Loper.Natural language processing with Python.' O'Reilly Media, Inc.', 2009. 22. Jacob Perkins.Nltk-trainer [software].Available from: http://nltk-trainer.readthedocs.org/, 2011. 23. Junyi Sun.Jieba [software].Available from: https://github.com/fxsjy/jieba, 2012. 24. Qian-Xiang Lin and Chia-Hui Chang.基於特製隱藏式馬可夫模型之中文斷詞研究 (chineseword segmentation using specialized hmm)[in chinese].中央大學資訊工程學系學位論文, pages 1-41, 2006. 25. Oliver Christ.The ims corpus workbench technical manual.Institut fur maschinelle Sprachverarbeitung, Universit atStuttgart, 1994. 26. Oliver Christ, Bruno M Schulze, Anja Hofmann, and Esther Koenig.The ims corpus workbench: Corpus query processor (cqp): User'smanual.University of Stuttgart, 8, 1999. 27. Stefan Evert.The cqp query language tutorial.2005. 28. Pavel Rychly.A lexicographer-friendly association score.Proceedings of Recent Advances in Slavonic Natural LanguageProcessing, RASLAN, pages 6-9, 2008. 29. David Crystal.Dictionary of linguistics and phonetics, volume 30.John Wiley & Sons, 2011. 30. Adam Kilgarriff and Iztok Kosem.Corpus tools for lexicographers.Granger, S. and M. Paquot (Eds.), 2012:31-55, 2012. 31. Tsun-Jui Liu.Gei ta in taiwanese mandarin: A corpus-based study.Paper presented at the 2014 National Conference on Linguistics,23-24th May, Tunghai University, Taiwan., May 2014. 32. Maristella Gatto.Web As Corpus: Theory and Practice.A&C Black, 2014. 33. Steven Bedrick, Russell Beckley, Brian Roark, and Richard Sproat.Robust kaomoji detection in twitter.In Proceedings of the Second Workshop on Language in SocialMedia, pages 56-64. Association for Computational Linguistics, 2012. 34. Michal Ptaszynski, Jacek Maciejewski, Pawel Dybala, Rafal Rzepka, and KenjiAraki.Cao: A fully automatic emoticon analysis system based on theory ofkinesics.Affective Computing, IEEE Transactions on, 1(1):46-59, 2010. 35. Stefan Evert and Andrew Hardie.Twenty-first century corpus workbench: Updating a query architecturefor the new millennium.2011. 36. Sasa Petrovic, Miles Osborne, and Victor Lavrenko.The edinburgh twitter corpus.In Proceedings of the NAACL HLT 2010 Workshop on ComputationalLinguistics in a World of Social Media, pages 25-26, 2010. 37. Adam Kilgarriff and David Tugwell.Sketching words.Lexicography and Natural Language Processing: A Festschrift inHonour of B. TS Atkins, pages 125-137, 2002. 38. Gilles-Maurice De Schryver.Web for/as corpus: A perspective for the african languages.Nordic Journal of African Studies, 11(2):266-282, 2002. 39. Adam Kilgarriff.Web as corpus.In Proceedings of Corpus Linguistics 2001, pages 342-344.Corpus Linguistics. Readings in a Widening Discipline, 2001. 40. Martin Volk.Using the web as corpus for linguistic research.Catcher of the Meaning. Pajusalu, R., Hennoste, T.(Eds.). Dept.of General Linguistics, 3, 2002. 41. Shu-Kai Hsieh.Evaluating chinese web-as-corpus.Linguistic Corpus and Corpus Linguistics in the Chinese Context,2014. 42. Yoshihiko Hayashi, Gen’ichiro Kikui, and Seiji Susaki.Titan: A cross-linguistic search engine for the www.In Working Notes of AAAI-97 Spring Symposiums on Cross-LanguageText and Speech Retrieval, pages 58-65, 1997. 43. Anke Ludeling, Stefan Evert, and Marco Baroni.Using web data for linguistic purposes.Language and Computers, 59(1):7-24, 2006. 44. William H Fletcher.Concordancing the web: promise and problems, tools and techniques.Language and Computers, 59(1):25-45, 2006. 45. Geoffrey Leech.New resources, or just better old ones? the holy grail ofrepresentativeness.Language and Computers, 59(1):133-149, 2006. 46. William H Fletcher.Making the web more useful as a source for linguistic corpora.Language and Computers, 52(1):191-205, 2004. 47. Richard McCreadie, Ian Soboroff, Jimmy Lin, Craig Macdonald, Iadh Ounis, andDean McCullough.On building a reusable twitter corpus.In Proceedings of the 35th international ACM SIGIR conference onResearch and development in information retrieval, pages 1113-1114. ACM,2012. 48. Vinci Liu and James R Curran.Web text corpus for natural language processing.In EACL, 2006. 49. Andreas M. Kaplan and Michael Haenlein.Users of the world, unite! the challenges and opportunities of socialmedia.Business Horizons, 53(1):59 - 68, 2010. | |
| dc.identifier.uri | http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/5388 | - |
| dc.description.abstract | 近年來,語料庫為本與語料庫驅動之研究愈來愈受到關注與重視。 在台灣華語中,中央研究院平衡語料庫 (Chen et al., 1996) 以及中文十 億詞語料庫 (Huang et al., 2005) 為當今兩個最被廣泛使用的語料庫。然 而,這些語料庫並不是完全沒有限制。在語料的部份,這些語料庫大 多已經停止更新或尚未更新,也就是說,這些資料庫已經無法完全即 時反應當代台灣華語的使用狀況。對於眾多研究者來說,在蒐集新興 語料上更產生了一定的程度的難度與不便性。正因如此,本篇論文以 PTT(批踢踢)作為資料來源,試圖建立「批踢踢語料庫」— 一個具 有自動蒐集、更新、分析及後處理的動態語料庫。除此之外,該語料 庫亦會提供一個友善且便利的網路平台,提供研究者作使用。在批踢 踢語料庫中,語料的斷詞是透過 Jseg — 一個利用中央研究院平衡語 料為訓練基礎之中文斷詞器 — 所達成。而在詞性標註方面,則是採 用 Brill Tagger (Brill, 1992) 所使用的演算法,且以中文句結構樹資料庫 (Chen et al., 1999) 中約莫一萬中文句作為訓練的語料。批踢踢語料庫提 供了網路介面以供研究者使用,並包含許多根據批踢踢語料所發展出 來的應用,其中包括基本的詞語索引器 (Concordancer) 以及搭配詞抽 取器 (Collocation extractor),以及其他諸如表情符號偵測器 (Emoticon detector) 與情緒極性分類器 (Sentiment polarity classifier) 等等之應用。 最後,本研究之希望在批踢踢語料庫的建置後,在現代台灣華語中能 夠針對新興語料的部分作補充與更新,並且提供實質的語料庫工具, 以簡化資料蒐集上的繁瑣及能有系統地分析語料,使得研究者能更加 專注在語料本身的分析與發展。 | zh_TW |
| dc.description.abstract | In recent years, corpus-based and corpus-driven studies are getting considerable attentions. In Taiwan Mandarin, two of the most widely used corpora are Academia Sinica Balanced Corpus (Chen et al., 1996) and Chinese Gigawords (Huang et al., 2005). However, both of the corpora have some limitations on
the source of the data, and they have not updated for some time, which makes it difficult to collect more recent examples of language uses. Therefore, the aim of this thesis attempts to establish a dynamic corpus, PTT Corpus, which can automatically collect, update and process data from PTT (批踢 踢), and provide the applications with a user-friendly interface for researchers. Corpora are segmented with Jseg, a Chinese segmentator trained with data from Sinica Corpus, and part-of-speech (POS) tagged by Brill Tagger (Brill, 1992), a POS tagger trained with data trained on the 9999 sentences in the Sinica Treebank (Chen et al., 1999). PTT Corpus provides a web interface with several applications, including Concordancer, Collocation extractor, Emoticon Detector, etc. To conclude, establishing PTT Corpus may be of importance in enriching the source of modern corpora, providing useful corpus tools, simplifying the analysis of recent language uses and changes in linguists in Taiwan Mandarin. | en |
| dc.description.provenance | Made available in DSpace on 2021-05-15T17:57:33Z (GMT). No. of bitstreams: 1 ntu-103-R99142008-1.pdf: 2839196 bytes, checksum: 60c01c9cb9260b4c702bc67818c90427 (MD5) Previous issue date: 2014 | en |
| dc.description.tableofcontents | 致謝 ........................................................................ i
中文摘要 ................................................................. ii Abstract ................................................................ iii Contents ............................................................... iv List of Figures ....................................................... vi List of Tables ....................................................... viii 1 Introduction........................................................ 1 1.1 Research motivation ......................................... 1 1.2 Significance....................................................... 2 1.3 Organization of the thesis................................. 2 2 Literature Review ................................................. 3 2.1 Corpus linguistics and theweb .......................... 3 2.2 Linguistic studies and social media.................... 5 3 Compiling the PTT Corpus................................... 7 3.1 Introduction of PTT............................................ 7 3.2 Metainformation .............................................. 10 3.3 Multidimensionality of PTT data........................ 11 3.3.1 Language data structure ............................... 11 3.3.2 Data collection ............................................. 12 3.4 Word segmentation ......................................... 12 3.4.1 Method.......................................................... 15 3.4.2 Evaluation .................................................... 16 3.5 POS tagging..................................................... 17 3.5.1 Method......................................................... 17 3.5.2 Evaluation .................................................... 17 4 Applications ....................................................... 18 4.1 Jseg and Segcom ............................................. 18 4.2 Concordancer................................................... 20 4.2.1 Searching in raw data.................................... 20 4.2.2 Searching with CWB/CQP............................... 22 4.3 Grammatical collocation extractor .................... 25 4.4 Emoticon detector............................................ 28 4.5 Sentiment polarity classifier ............................. 29 4.5.1 Data training ................................................ 30 4.5.2 Evaluation ..................................................... 31 5 Conclusions ........................................................ 34 Appendix A The weight of selected features .......... 36 Appendix B The Brill Tagger .................................. 39 Appendix C Details of PTT Corpus ......................... 45 Appendix D Procedure of Jseg segmentation ......... 47 Appendix E CKIP POS ............................................. 55 | |
| dc.language.iso | en | |
| dc.title | 批踢踢語料庫之建置與應用 | zh_TW |
| dc.title | PTT Corpus: Construction and Applications | en |
| dc.type | Thesis | |
| dc.date.schoolyear | 103-1 | |
| dc.description.degree | 碩士 | |
| dc.contributor.oralexamcommittee | 高照明(Zhao-Ming Gao),呂佳蓉(Chia-Rung Lu) | |
| dc.subject.keyword | 批踢踢,動態語料庫,臺灣華語, | zh_TW |
| dc.subject.keyword | PTT,dynamic corpus,Taiwan Mandarin, | en |
| dc.relation.page | 61 | |
| dc.rights.note | 同意授權(全球公開) | |
| dc.date.accepted | 2014-10-21 | |
| dc.contributor.author-college | 文學院 | zh_TW |
| dc.contributor.author-dept | 語言學研究所 | zh_TW |
| 顯示於系所單位: | 語言學研究所 | |
文件中的檔案:
| 檔案 | 大小 | 格式 | |
|---|---|---|---|
| ntu-103-1.pdf | 2.77 MB | Adobe PDF | 檢視/開啟 |
系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。
