微型語料庫的自動處理：賽夏語詞性標記、部份剖析及其應用

Zhe-Min Lin; 林哲民

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/38890

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	宋麗梅(Li-May Sung)
dc.contributor.author	Zhe-Min Lin	en
dc.contributor.author	林哲民	zh_TW
dc.date.accessioned	2021-06-13T16:51:00Z	-
dc.date.available	2005-07-20
dc.date.copyright	2005-07-20
dc.date.issued	2005
dc.date.submitted	2005-06-22
dc.identifier.citation	Abney, Steven. 1996. Corpus-based methods in language and speech, chapter Part-of-Speech Tagging and Partial Parsing. Dordrecht: Kluwer. Anoop, Sarkar. 2001. Applying cotraining methods to statistical parsing. In Proceedings of the 2nd NAACL. Pittsburgh, PA. URL: http://citeseer.ist.psu.edu/sarkar01applying.html. Brill, Eric. 1992. A simple rule-based part of speech tagger. In Proceedings of ANLP-92, 3rd Conference on Applied Natural Language Processing, 152--155. Trento, IT. URL http://citeseer.ist.psu.edu/article/brill92simple.html. Brill, Eric. 1993a. Automatic grammar induction and parsing free text: a transformation-based approach. In Proceedings of Meeting of the ACL, 259--265. URL http://citeseer.ist.psu.edu/brill93automatic.html. Brill, Eric. 1993b. A corpus-based approach to language learning. Doctoral Dissertation, University of Pennsylvania. Brill, Eric. 1995. Transformation-based error-driven learning and natural language processing: A case study in part of speech tagging. Computational Linguistics 21(4):543--565. Brill, Eric. 1996. Transformation-based error-driven parsing. Http://www.cs.buffalo.edu/~drpierce/cse/738S2002/brill-parsing-1996.ps. Brill, Eric, and Mitch Marcus. 1992. Tagging an unfamiliar text with minimal human supervision. In Proceedings of the Fall Symposium on Probabilistic Approaches to Natural Language - AAAI Technical Report. Brill, Eric, and Mitch Markus. 1992. Automatically acquiring phrase structure using distributional analysis. In Proceedings of DARPA Speech and Natural Language Workshop, 155--159. Chafe, Wallace L. ed. 1980. The pear stories: Cognitive, cultural, and linguistic aspects of narrative production. Norwood, NJ: Ablex Publishing Corp. Choueka, Yaacov. 1988. Looking for needles in a haystack or locating interesting collocational expressions in large textual databases. In Proceedings of the RIAO International Conference on User-Oriented Content-Based Text and Image Handling, 609--623. Cambridge, Mass. Cloeren, Jan. 1999. Syntactic wordclass tagging, chapter Tagsets, 37--54. Dordrecht: Kluwer. Dien, Dinh, and Hoang Kiem. 2003. Pos-tagger for English-Vietnamese bilingual corpus. In Proceedings of HLT-NAACL 2003 Workshop: Building and Using Parrallel Texts Data Driven Machine Translation and Beyond, 88--95. Du Bois, J. W. 1993. Talking data: Transcription and coding in discourse research, chapter Outline of discourse transcription, 45--89. NJ: Hillsdale: Lawrence Erlbaum Associates. Ezeiza, N., I. Alegria, J. M. Arriola, R. Urizar, and I. Aduriz. 1998. Combining stochastic and rule-based methods for disambiguation in agglutinative languages. In Proceedings of the Thirty-Sixth Annual Meeting of the Association for Computational Linguistics and Seventeenth International Conference on Computational Linguistics, ed. Christian Boitet and Pete Whitelock, 379--384. San Francisco, California: Morgan Kaufmann Publishers. URL: http://citeseer.ist.psu.edu/article/ezeiza98combining.html. Haegeman, Liliane M. V. 1994. Introduction to government and binding theory. Oxford: Blackwell. van Halteren, Hans. 1999. Syntactic wordclass tagging, chapter Performance of taggers, 81--94. Dordrecht: Kluwer. Heidegger, Martin. 1930. Was ist metaphysiks. 台北: 仰哲. Reprinted in 1993 with Chinese translation. Howe, Denis. 1993. The free on-line dictionary of computing. Http://dictionary.reference.com/search?q=heuristic. Huang, Shuan-Fan, Lily I-wen Su, and Li-May Sung. 2003. Syntax and cognition in SaiSiyat. NSC 93-2411-H-022-094. Iida 飯田, 隆. 2001. 維特根斯坦:語言的界限. 石家庄:河北教育出版社. Jian 簡, 鴻模. 2003. 台灣原住民傳統祭典中的神聖現象──以賽夏族矮靈祭為例. 輔仁宗教研究 8:129--162. Knuth, Donald E. 1998. The art of computer programming: Sorting and searching. Massachusetts: Addison-Wesley. Leech, Geoffrey, and Nicholas Smith. 1999. Syntactic wordclass tagging, chapter The use of tagging, 23--36. Dordrecht: Kluwer. Leech, Geoffrey, and Andrew Wilson. 1999. Standards for tagsets, 55--80. Dordrecht: Kluwer. Li, Paul Jen-Kuei. 1978. A comparative vocabulary of Saisiyat dialects. Bulletin of the Institute of History and Philology 49.2:133--199. Lin, Zhemin. 2004a. Extract saisiyat collocations: a brief report on NTU SaiSiyat corpus. June 2004. Lin, Zhemin. 2004b. Pos-tagger for saisiyat: using fieldwork notations and tbl. In Proceedings of ROCLING XVI Student Workshop II , 25--33. Lin, Zhemin, and Li-may Sung. 2004. Tiny corpus applications with transformation-based error-driven learning: Evaluations of automatic grammar induction and partial parsing of saisiyat. In Proceedings of PACLIC 18 , 197--204. Luhn, H. P. 1960. Keyword-in-context index for technical literature (kwic index). American Documentation 11:288--295. Manning, Christopher D., and Hinrich Sch�utze. 1999. Foundations of statistical natural language processing. Cambridge: MIT Press. Mayer, Mercer. 1980. Frog, where are you?. NY: Dial Books. Rose, Tony, Nicholas Haddock, and Roger Tucker. 1997. The ects of corpus size and homogeneity on language model quality. In Proceedings of ACL SIGDAT workshop on very large corpora, Beijing and Hong Kong, 178--191. URL http://acl.ldc.upenn.edu/W/W97/W97-0118.pdf. Rosmorduc, Serge. n.d. Automata-guided context-free parsing for punctuationless languages. URL: http://citeseer.ist.psu.edu/363381.html. Tao, Hongyin. 1996. Units in mandarin conversation: Prosody, discourse and grammar. Amsterdam: John Benjamins. Tsuyoshi, Ono, and Sandra A. Thompson. 1995. What can conversation tell us about syntax? , 213--71. Amsterdam: Benjamins. Wittgenstein, Ludwig. 1958. Philosophical investigations. Oxford: Basil Blackwell. Translated by G.E.M. Anscombe. Wittgenstein, Ludwig. 1961. Tractatus logico-philosophicus. London: Routledge and Kegan Paul. Translated by D. F. Pears and B. F. McGuinness. Wittgenstein, Ludwig. 1965. The blue and brown books. New York: Harper and Row. Xia, Fei, Martha Palmer, Nianwen Xue, Mary E. Okurowski, John Kovarik, Shizhe Huang, Tony Kroch, and Mitch Marcus. 2000. Developing guidelines and ensuring consistency for chinese text annotation. In Proceedings of the 2nd International Conference on Language Resources and Evaluation (LREC-2000), Athens, Greece. URL http://citeseer.ist.psu.edu/xia00developing.html. Yeh 葉, 美利. 2000. 賽夏語參考語法. 台北: 遠流. Zeitoun, Elizabeth, Ching hua Yu, and Cui xia Weng. 2003. The formosan language archive: Development of a multimedia tool to salvage the languages and oral traditions of the indigenous tribes of taiwan. Oceanic Linguistics 42(1):218--232. Zeitoun, Elizabeth, and Ching-Hua Yu. 2005. The formosan language archive: Linguistic analysis and language processing. Computational Linguistics and Chinese Language Processing 10(2):167--200. Zhao 趙, 敦華. 1996. 維根斯坦. 台北: 生智.
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/38890	-
dc.description.abstract	本論文旨在研究二萬詞以下的微型語料庫的詞性標記及部份剖析技術，並提出三項應用。　　台大南島語語料庫是基於語調(intonation unit)的語料庫，其中賽夏語約有一萬二千詞。本文第一章介紹了當前處理南島語語料庫的難點，特別是因為規模太小，不能使用統計式自然語言處理，所以必須尋求其他方案。第二章介紹了新設計的標記集，以切實反應賽夏語的語言特點，並實際使用在詞性標記上，其中，詞彙法從田野調查記錄中抽取語法信息，得到約75%的正確率，再利用基於轉換的錯誤驅動學習(TBL)算則，進一步將正確率提升至85%。本章特別討論了賽夏語的主格及受格格標記(ka)難以區別的問題。　　論文第三章介紹了賽夏語的二位部份剖析，部份剖析可以為抽取名詞詞組和一些其他應用創造條件。我們嘗試了基於Kullback-Leibler分歧值的最短路徑法和TBL法，前者在小句長度加長時，正確率就會快速下降，而且需要大量的計算時間，而後者約達70%的正確率，符合我們設定的需求。　　第四章把標記過的語料庫同語言學研究、說本族語者及一般群眾連繫起來。機器幫助標註作業，讓語言學家較快速、較正確地處理採集到的語料；考慮到人民群眾和語言學家的不同需求，我們設計了在線多媒體語料庫的整合平台，並針對標準化、易及性、互換性三個特點，調整了細項設計。　　最後，本論文嘗試從前、後期的維特根斯坦哲學的角度，討論自然語言處理的哲學意義。我們強調詞在語言中的使用和詞義的關聯性，並認為計算機不能突破語料庫中文本構成的微型宇宙的界限。	zh_TW
dc.description.abstract	This thesis demonstrates an effective method to tag and parse a corpus with no more than twenty thousand words, along with three useful applications which take advantage of the manipulated corpus. The NTU corpus of Austronesian languages, an intonation-unit (IU) based corpus, is chosen to be processed. In Chapter 1, we introduce current problems in automatic processing of Austronesian languages. As small-scaled corpora limit the usage of statistical natural language processing, we are urged to find an alternative method to deal with Austronesian corpora. A new tag set is defined in Chapter 2 to reflect linguistic particularity of the object language of this thesis, SaiSiyat. Two methods to label part-of-speech tags, the gloss-based approach (accuracy rate 75%) and transformation-based error-driven learning (TBL, accuracy rate 85%), are evaluated and reported robust. Difficulties to distinguish between SaiSiyat nominative and accusative case markers are especially discussed. A partial parser is useful in preparing a corpus for noun-phrase extraction and further analyses. In Chapter 3, the tagged corpus is parsed into binary trees by a statistical approach, Kullback-Leibler divergence, and the TBL method. The former method declines quickly as IU length increases and needs huge computation time, while the accuracy rate of the latter method is a little less than 70%. Chapter 4 shows how an annotated corpus is related to linguistic research, native speakers of the object language and the public. Machine-aided annotation helps linguists to quickly rearrange collected data. An integrated platform of multimedia online corpora is also designed in this chapter, in order to serve both linguists and the public. In the last chapter, the natural language processing is discussed in early and late Wittgenstein's points of view. We agree with the idea that the meaning of a word is as many as its actual use. Thus, the computer cannot go beyond the boundary of the micro-cosmos composed by texts given in a corpus.	en
dc.description.provenance	Made available in DSpace on 2021-06-13T16:51:00Z (GMT). No. of bitstreams: 1 ntu-94-R90142001-1.pdf: 1116000 bytes, checksum: 59a6e227d01c9ae66483dfdc8653d75b (MD5) Previous issue date: 2005	en
dc.description.tableofcontents	1 Introduction 1 1.1 SaiSiyat Language . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 NTU SaiSiyat corpus . . . . . . . . . . . . . . . . . . . . . . 4 1.3 Corpus size and the scopes of tagging and parsing . . . . . . 7 1.3.1 The scope of POS-tagging . . . . . . . . . . . . . . . 8 1.3.2 The scope of partial parsing . . . . . . . . . . . . . . 10 1.4 Integrated applications . . . . . . . . . . . . . . . . . . . . . 12 2 Syntactic Word-class Tagging of the SaiSiyat Corpus 15 2.1 Tag set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.2.1 Mutual divergence and word-aligned corpus . . . . . 20 2.2.2 Method 1: a gloss-based strategy . . . . . . . . . . . 22 2.2.3 Method 2: transformation-based error-driven learning 25 2.3 Evaluation of Gloss-based Method and TBL algorithm . . . 29 2.3.1 Scoring . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.3.2 Results and discussion . . . . . . . . . . . . . . . . . 30 2.4 Slight modications to improve accuracy . . . . . . . . . . . 37 2.4.1 Manual correction of lexicon . . . . . . . . . . . . . . 37 2.4.2 Preservation of intonation unit . . . . . . . . . . . . . 38 2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3 Partial Parsing SaiSiyat 41 3.1 Automatic grammar induction . . . . . . . . . . . . . . . . . 43 3.1.1 Methods . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.1.2 Results and discussion . . . . . . . . . . . . . . . . . 47 3.2 Transformation-based error-driven parsing . . . . . . . . . . 49 3.2.1 Methods . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.2.2 Results and discussion . . . . . . . . . . . . . . . . . 53 3.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 4 Applications 57 4.1 Bigram/trigram retrieval . . . . . . . . . . . . . . . . . . . . 57 4.2 Machine-aided glossary annotation . . . . . . . . . . . . . . 61 4.2.1 Brute-force method . . . . . . . . . . . . . . . . . . . 61 4.2.2 Problems in the lexicon . . . . . . . . . . . . . . . . . 62 4.2.3 Interim summary . . . . . . . . . . . . . . . . . . . . 67 4.3 Design of a publicly accessible corpus . . . . . . . . . . . . . 67 4.3.1 Standardisation of text commitment and standards of committed texts . . . . . . . . . . . . . . . . . . . 70 4.3.2 Database design . . . . . . . . . . . . . . . . . . . . . 74 4.3.3 Back-end programmes and the POS-tagger . . . . . . 75 4.3.4 Unied output interface . . . . . . . . . . . . . . . . 77 4.3.5 Interoperability . . . . . . . . . . . . . . . . . . . . . 78 4.3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . 80 4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 5 Conclusion 83 5.1 Boundary of Natural Language Processing . . . . . . . . . . 85 5.2 Pure substitution of symbols . . . . . . . . . . . . . . . . . . 88 5.3 Rules and the understanding of meanings . . . . . . . . . . . 89 5.4 A recognition of the world . . . . . . . . . . . . . . . . . . . 91 5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 A Coding List 93 B Database Schema 97
dc.language.iso	en
dc.subject	田調文本處理	zh_TW
dc.subject	標記集	zh_TW
dc.subject	維特根斯坦	zh_TW
dc.subject	基於轉換的錯誤驅動學習	zh_TW
dc.subject	台灣南島語	zh_TW
dc.subject	線上語料庫	zh_TW
dc.subject	Formosan Austronesian languages	en
dc.subject	Wittgenstein	en
dc.subject	fieldwork process	en
dc.subject	onlinecorpus design	en
dc.subject	transformation-based error-driven learning	en
dc.subject	tag set	en
dc.title	微型語料庫的自動處理：賽夏語詞性標記、部份剖析及其應用	zh_TW
dc.title	Automatic Processing of Languages with Small-Scaled Corpus: Part-of-Speech Tagging and Partial Parsing SaiSiyat and Applications	en
dc.type	Thesis
dc.date.schoolyear	93-2
dc.description.degree	碩士
dc.contributor.oralexamcommittee	蘇以文(Lily I-wen Su),陳信希(Hsin-Hsi Chen)
dc.subject.keyword	台灣南島語,基於轉換的錯誤驅動學習,標記集,線上語料庫,田調文本處理,維特根斯坦,	zh_TW
dc.subject.keyword	Formosan Austronesian languages,tag set,transformation-based error-driven learning,onlinecorpus design,fieldwork process,Wittgenstein,	en
dc.relation.page	104
dc.rights.note	有償授權
dc.date.accepted	2005-06-23
dc.contributor.author-college	文學院	zh_TW
dc.contributor.author-dept	語言學研究所	zh_TW
顯示於系所單位：	語言學研究所

文件中的檔案：

檔案	大小	格式
ntu-94-1.pdf 未授權公開取用	1.09 MB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。