Skip navigation

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料(如:文字、圖片、PDF)並使其易於取用。

點此認識 DSpace
DSpace logo
English
中文
  • 瀏覽論文
    • 校院系所
    • 出版年
    • 作者
    • 標題
    • 關鍵字
    • 指導教授
  • 搜尋 TDR
  • 授權 Q&A
    • 我的頁面
    • 接受 E-mail 通知
    • 編輯個人資料
  1. NTU Theses and Dissertations Repository
  2. 文學院
  3. 語言學研究所
請用此 Handle URI 來引用此文件: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/38890
完整後設資料紀錄
DC 欄位值語言
dc.contributor.advisor宋麗梅(Li-May Sung)
dc.contributor.authorZhe-Min Linen
dc.contributor.author林哲民zh_TW
dc.date.accessioned2021-06-13T16:51:00Z-
dc.date.available2005-07-20
dc.date.copyright2005-07-20
dc.date.issued2005
dc.date.submitted2005-06-22
dc.identifier.citationAbney, Steven. 1996. Corpus-based methods in language and speech, chapter Part-of-Speech Tagging and Partial Parsing. Dordrecht: Kluwer.
Anoop, Sarkar. 2001. Applying cotraining methods to statistical parsing. In Proceedings of the 2nd NAACL. Pittsburgh, PA. URL: http://citeseer.ist.psu.edu/sarkar01applying.html.
Brill, Eric. 1992. A simple rule-based part of speech tagger. In Proceedings of ANLP-92, 3rd Conference on Applied Natural Language Processing, 152--155. Trento, IT. URL http://citeseer.ist.psu.edu/article/brill92simple.html.
Brill, Eric. 1993a. Automatic grammar induction and parsing free text: a transformation-based approach. In Proceedings of Meeting of the ACL, 259--265. URL http://citeseer.ist.psu.edu/brill93automatic.html.
Brill, Eric. 1993b. A corpus-based approach to language learning. Doctoral
Dissertation, University of Pennsylvania.
Brill, Eric. 1995. Transformation-based error-driven learning and natural language processing: A case study in part of speech tagging. Computational Linguistics 21(4):543--565.
Brill, Eric. 1996. Transformation-based error-driven parsing. Http://www.cs.buffalo.edu/~drpierce/cse/738S2002/brill-parsing-1996.ps.
Brill, Eric, and Mitch Marcus. 1992. Tagging an unfamiliar text with minimal human supervision. In Proceedings of the Fall Symposium on Probabilistic Approaches to Natural Language - AAAI Technical Report.
Brill, Eric, and Mitch Markus. 1992. Automatically acquiring phrase structure using distributional analysis. In Proceedings of DARPA Speech and Natural Language Workshop, 155--159.
Chafe, Wallace L. ed. 1980. The pear stories: Cognitive, cultural, and linguistic aspects of narrative production. Norwood, NJ: Ablex Publishing Corp.
Choueka, Yaacov. 1988. Looking for needles in a haystack or locating interesting collocational expressions in large textual databases. In Proceedings of the RIAO International Conference on User-Oriented Content-Based Text and Image Handling, 609--623. Cambridge, Mass.
Cloeren, Jan. 1999. Syntactic wordclass tagging, chapter Tagsets, 37--54.
Dordrecht: Kluwer.
Dien, Dinh, and Hoang Kiem. 2003. Pos-tagger for English-Vietnamese bilingual corpus. In Proceedings of HLT-NAACL 2003 Workshop: Building and Using Parrallel Texts Data Driven Machine Translation and Beyond, 88--95.
Du Bois, J. W. 1993. Talking data: Transcription and coding in discourse research, chapter Outline of discourse transcription, 45--89. NJ: Hillsdale: Lawrence Erlbaum Associates.
Ezeiza, N., I. Alegria, J. M. Arriola, R. Urizar, and I. Aduriz. 1998. Combining stochastic and rule-based methods for disambiguation in agglutinative languages. In Proceedings of the Thirty-Sixth Annual Meeting of the Association for Computational Linguistics and Seventeenth International Conference on Computational Linguistics, ed. Christian Boitet and Pete Whitelock, 379--384. San Francisco, California: Morgan Kaufmann Publishers. URL: http://citeseer.ist.psu.edu/article/ezeiza98combining.html.
Haegeman, Liliane M. V. 1994. Introduction to government and binding theory. Oxford: Blackwell.
van Halteren, Hans. 1999. Syntactic wordclass tagging, chapter Performance of taggers, 81--94. Dordrecht: Kluwer.
Heidegger, Martin. 1930. Was ist metaphysiks. 台北: 仰哲. Reprinted in
1993 with Chinese translation.
Howe, Denis. 1993. The free on-line dictionary of computing. Http://dictionary.reference.com/search?q=heuristic.
Huang, Shuan-Fan, Lily I-wen Su, and Li-May Sung. 2003. Syntax and
cognition in SaiSiyat. NSC 93-2411-H-022-094.
Iida 飯田, 隆. 2001. 維特根斯坦:語言的界限. 石家庄:河北教育出版社.
Jian 簡, 鴻模. 2003. 台灣原住民傳統祭典中的神聖現象──以賽夏族矮靈祭為例. 輔仁宗教研究 8:129--162.
Knuth, Donald E. 1998. The art of computer programming: Sorting and
searching. Massachusetts: Addison-Wesley.
Leech, Geoffrey, and Nicholas Smith. 1999. Syntactic wordclass tagging,
chapter The use of tagging, 23--36. Dordrecht: Kluwer.
Leech, Geoffrey, and Andrew Wilson. 1999. Standards for tagsets, 55--80.
Dordrecht: Kluwer.
Li, Paul Jen-Kuei. 1978. A comparative vocabulary of Saisiyat dialects.
Bulletin of the Institute of History and Philology 49.2:133--199.
Lin, Zhemin. 2004a. Extract saisiyat collocations: a brief report on NTU
SaiSiyat corpus. June 2004.
Lin, Zhemin. 2004b. Pos-tagger for saisiyat: using fieldwork notations and
tbl. In Proceedings of ROCLING XVI Student Workshop II , 25--33.
Lin, Zhemin, and Li-may Sung. 2004. Tiny corpus applications with
transformation-based error-driven learning: Evaluations of automatic
grammar induction and partial parsing of saisiyat. In Proceedings of
PACLIC 18 , 197--204.
Luhn, H. P. 1960. Keyword-in-context index for technical literature (kwic
index). American Documentation 11:288--295.
Manning, Christopher D., and Hinrich Sch�utze. 1999. Foundations of
statistical natural language processing. Cambridge: MIT Press.
Mayer, Mercer. 1980. Frog, where are you?. NY: Dial Books.
Rose, Tony, Nicholas Haddock, and Roger Tucker. 1997. The ects of corpus size and homogeneity on language model quality. In Proceedings of
ACL SIGDAT workshop on very large corpora, Beijing and Hong Kong,
178--191. URL http://acl.ldc.upenn.edu/W/W97/W97-0118.pdf.
Rosmorduc, Serge. n.d. Automata-guided context-free parsing for punctuationless languages. URL: http://citeseer.ist.psu.edu/363381.html.
Tao, Hongyin. 1996. Units in mandarin conversation: Prosody, discourse
and grammar. Amsterdam: John Benjamins.
Tsuyoshi, Ono, and Sandra A. Thompson. 1995. What can conversation
tell us about syntax? , 213--71. Amsterdam: Benjamins.
Wittgenstein, Ludwig. 1958. Philosophical investigations. Oxford: Basil
Blackwell. Translated by G.E.M. Anscombe.
Wittgenstein, Ludwig. 1961. Tractatus logico-philosophicus. London:
Routledge and Kegan Paul. Translated by D. F. Pears and B. F. McGuinness.
Wittgenstein, Ludwig. 1965. The blue and brown books. New York: Harper
and Row.
Xia, Fei, Martha Palmer, Nianwen Xue, Mary E. Okurowski, John Kovarik, Shizhe Huang, Tony Kroch, and Mitch Marcus. 2000. Developing guidelines and ensuring consistency for chinese text annotation. In Proceedings of the 2nd International Conference on Language Resources and Evaluation (LREC-2000), Athens, Greece. URL
http://citeseer.ist.psu.edu/xia00developing.html.
Yeh 葉, 美利. 2000. 賽夏語參考語法. 台北: 遠流.
Zeitoun, Elizabeth, Ching hua Yu, and Cui xia Weng. 2003. The formosan language archive: Development of a multimedia tool to salvage the languages and oral traditions of the indigenous tribes of taiwan. Oceanic
Linguistics 42(1):218--232.
Zeitoun, Elizabeth, and Ching-Hua Yu. 2005. The formosan language archive: Linguistic analysis and language processing. Computational Linguistics and Chinese Language Processing 10(2):167--200.
Zhao 趙, 敦華. 1996. 維根斯坦. 台北: 生智.
dc.identifier.urihttp://tdr.lib.ntu.edu.tw/jspui/handle/123456789/38890-
dc.description.abstract本論文旨在研究二萬詞以下的微型語料庫的詞性標記及部份剖析技術,並提出三項應用。
  台大南島語語料庫是基於語調(intonation unit)的語料庫,其中賽夏語約有一萬二千詞。本文第一章介紹了當前處理南島語語料庫的難點,特別是因為規模太小,不能使用統計式自然語言處理,所以必須尋求其他方案。第二章介紹了新設計的標記集,以切實反應賽夏語的語言特點,並實際使用在詞性標記上,其中,詞彙法從田野調查記錄中抽取語法信息,得到約75%的正確率,再利用基於轉換的錯誤驅動學習(TBL)算則,進一步將正確率提升至85%。本章特別討論了賽夏語的主格及受格格標記(ka)難以區別的問題。
  論文第三章介紹了賽夏語的二位部份剖析,部份剖析可以為抽取名詞詞組和一些其他應用創造條件。我們嘗試了基於Kullback-Leibler分歧值的最短路徑法和TBL法,前者在小句長度加長時,正確率就會快速下降,而且需要大量的計算時間,而後者約達70%的正確率,符合我們設定的需求。
  第四章把標記過的語料庫同語言學研究、說本族語者及一般群眾連繫起來。機器幫助標註作業,讓語言學家較快速、較正確地處理採集到的語料;考慮到人民群眾和語言學家的不同需求,我們設計了在線多媒體語料庫的整合平台,並針對標準化、易及性、互換性三個特點,調整了細項設計。
  最後,本論文嘗試從前、後期的維特根斯坦哲學的角度,討論自然語言處理的哲學意義。我們強調詞在語言中的使用和詞義的關聯性,並認為計算機不能突破語料庫中文本構成的微型宇宙的界限。
zh_TW
dc.description.abstractThis thesis demonstrates an effective method to tag and parse a corpus with no more than twenty thousand words, along with three useful applications which take advantage of the manipulated corpus. The NTU corpus of Austronesian languages, an intonation-unit (IU) based corpus, is chosen to be processed. In Chapter 1, we introduce current problems in automatic processing of Austronesian languages. As small-scaled corpora limit the usage of statistical natural language processing, we are urged to find an alternative method to deal with Austronesian corpora. A new tag set is defined in Chapter 2 to reflect linguistic particularity of the object language of this thesis, SaiSiyat. Two methods to label part-of-speech tags, the gloss-based approach (accuracy rate 75%) and transformation-based error-driven learning (TBL, accuracy rate 85%), are evaluated and reported robust. Difficulties to distinguish between SaiSiyat nominative and accusative case markers are especially discussed. A partial parser is useful in preparing a corpus for noun-phrase extraction and
further analyses. In Chapter 3, the tagged corpus is parsed into binary trees by a statistical approach, Kullback-Leibler divergence, and the TBL method. The former method declines quickly as IU length increases and needs huge computation time, while the accuracy rate of the latter method is a little less than 70%. Chapter 4 shows how an annotated corpus is related to linguistic research, native speakers of the object language and the public. Machine-aided annotation helps linguists to quickly rearrange collected data. An integrated platform of multimedia online corpora is also designed in this chapter, in order to serve both linguists and the public. In the last chapter, the natural language processing is discussed in early and late Wittgenstein's points of view. We agree with the idea that the meaning of a word is as many as its actual use. Thus, the computer cannot go beyond the boundary of the micro-cosmos composed by texts given in a corpus.
en
dc.description.provenanceMade available in DSpace on 2021-06-13T16:51:00Z (GMT). No. of bitstreams: 1
ntu-94-R90142001-1.pdf: 1116000 bytes, checksum: 59a6e227d01c9ae66483dfdc8653d75b (MD5)
Previous issue date: 2005
en
dc.description.tableofcontents1 Introduction 1
1.1 SaiSiyat Language . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 NTU SaiSiyat corpus . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Corpus size and the scopes of tagging and parsing . . . . . . 7
1.3.1 The scope of POS-tagging . . . . . . . . . . . . . . . 8
1.3.2 The scope of partial parsing . . . . . . . . . . . . . . 10
1.4 Integrated applications . . . . . . . . . . . . . . . . . . . . . 12
2 Syntactic Word-class Tagging of the SaiSiyat Corpus 15
2.1 Tag set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.2.1 Mutual divergence and word-aligned corpus . . . . . 20
2.2.2 Method 1: a gloss-based strategy . . . . . . . . . . . 22
2.2.3 Method 2: transformation-based error-driven learning 25
2.3 Evaluation of Gloss-based Method and TBL algorithm . . . 29
2.3.1 Scoring . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.3.2 Results and discussion . . . . . . . . . . . . . . . . . 30
2.4 Slight modications to improve accuracy . . . . . . . . . . . 37
2.4.1 Manual correction of lexicon . . . . . . . . . . . . . . 37
2.4.2 Preservation of intonation unit . . . . . . . . . . . . . 38
2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3 Partial Parsing SaiSiyat 41
3.1 Automatic grammar induction . . . . . . . . . . . . . . . . . 43
3.1.1 Methods . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.1.2 Results and discussion . . . . . . . . . . . . . . . . . 47
3.2 Transformation-based error-driven parsing . . . . . . . . . . 49
3.2.1 Methods . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.2.2 Results and discussion . . . . . . . . . . . . . . . . . 53
3.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4 Applications 57
4.1 Bigram/trigram retrieval . . . . . . . . . . . . . . . . . . . . 57
4.2 Machine-aided glossary annotation . . . . . . . . . . . . . . 61
4.2.1 Brute-force method . . . . . . . . . . . . . . . . . . . 61
4.2.2 Problems in the lexicon . . . . . . . . . . . . . . . . . 62
4.2.3 Interim summary . . . . . . . . . . . . . . . . . . . . 67
4.3 Design of a publicly accessible corpus . . . . . . . . . . . . . 67
4.3.1 Standardisation of text commitment and standards of committed texts . . . . . . . . . . . . . . . . . . . 70
4.3.2 Database design . . . . . . . . . . . . . . . . . . . . . 74
4.3.3 Back-end programmes and the POS-tagger . . . . . . 75
4.3.4 Unied output interface . . . . . . . . . . . . . . . . 77
4.3.5 Interoperability . . . . . . . . . . . . . . . . . . . . . 78
4.3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . 80
4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5 Conclusion 83
5.1 Boundary of Natural Language Processing . . . . . . . . . . 85
5.2 Pure substitution of symbols . . . . . . . . . . . . . . . . . . 88
5.3 Rules and the understanding of meanings . . . . . . . . . . . 89
5.4 A recognition of the world . . . . . . . . . . . . . . . . . . . 91
5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
A Coding List 93
B Database Schema 97
dc.language.isoen
dc.subject田調文本處理zh_TW
dc.subject標記集zh_TW
dc.subject維特根斯坦zh_TW
dc.subject基於轉換的錯誤驅動學習zh_TW
dc.subject台灣南島語zh_TW
dc.subject線上語料庫zh_TW
dc.subjectFormosan Austronesian languagesen
dc.subjectWittgensteinen
dc.subjectfieldwork processen
dc.subjectonlinecorpus designen
dc.subjecttransformation-based error-driven learningen
dc.subjecttag seten
dc.title微型語料庫的自動處理:賽夏語詞性標記、部份剖析及其應用zh_TW
dc.titleAutomatic Processing of Languages with Small-Scaled Corpus: Part-of-Speech Tagging and Partial Parsing SaiSiyat and Applicationsen
dc.typeThesis
dc.date.schoolyear93-2
dc.description.degree碩士
dc.contributor.oralexamcommittee蘇以文(Lily I-wen Su),陳信希(Hsin-Hsi Chen)
dc.subject.keyword台灣南島語,基於轉換的錯誤驅動學習,標記集,線上語料庫,田調文本處理,維特根斯坦,zh_TW
dc.subject.keywordFormosan Austronesian languages,tag set,transformation-based error-driven learning,onlinecorpus design,fieldwork process,Wittgenstein,en
dc.relation.page104
dc.rights.note有償授權
dc.date.accepted2005-06-23
dc.contributor.author-college文學院zh_TW
dc.contributor.author-dept語言學研究所zh_TW
顯示於系所單位:語言學研究所

文件中的檔案:
檔案 大小格式 
ntu-94-1.pdf
  未授權公開取用
1.09 MBAdobe PDF
顯示文件簡單紀錄


系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved