Skip navigation

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料(如:文字、圖片、PDF)並使其易於取用。

點此認識 DSpace
DSpace logo
English
中文
  • 瀏覽論文
    • 校院系所
    • 出版年
    • 作者
    • 標題
    • 關鍵字
    • 指導教授
  • 搜尋 TDR
  • 授權 Q&A
    • 我的頁面
    • 接受 E-mail 通知
    • 編輯個人資料
  1. NTU Theses and Dissertations Repository
  2. 文學院
  3. 語言學研究所
請用此 Handle URI 來引用此文件: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/79507
完整後設資料紀錄
DC 欄位值語言
dc.contributor.advisor謝舒凱(Shu-Kai Hsieh)
dc.contributor.authorYongfu Liaoen
dc.contributor.author廖永賦zh_TW
dc.date.accessioned2022-11-23T09:02:11Z-
dc.date.available2022-02-16
dc.date.available2022-11-23T09:02:11Z-
dc.date.copyright2022-02-16
dc.date.issued2022
dc.date.submitted2022-02-08
dc.identifier.citationBaayen, H. (1993). On frequency, transparency and productivity. In G. Booij J. van Marle (Eds.), Yearbook of Morphology 1992 (pp. 181–208). Springer Netherlands. https://doi.org/10.1007/978-94-017-3710-4_7 Baayen, H. (2001). Word frequency distributions. Kluwer Academic Publishers. Baayen, H. (2009). Corpus linguistics in morphology: Morphological productivity. In A. Lüdeling M. Kytö (Eds.), Corpus Linguistics: An International Handbook (Vol. 2, pp. 899–919). De Gruyter Mouton. https://doi.org/10.1515/9783110213881.2.899 Baayen, H., Renouf, A. (1996). Chronicling the Times: Productive lexical innovations in an English newspaper. Language, 69–96. Breiman, L. (2001). Random Forests. Machine Learning, 45(1), 5–32. https://doi.org/10.1023/A:1010933404324 Chou, Y.-M., Huang, C.-R. (2006). Hantology-a linguistic resource for Chinese language processing and studying. Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06). http://www.lrec-conf.org/proceedings/lrec2006/pdf/532_pdf.pdf Chou, Y.-M., Huang, C.-R. (2010). Hantology: Conceptual system discovery based on orthographic convention. In C.-R. Huang, N. Calzolari, A. Gangemi, A. Lenci, A. Oltramari, A. Prevot (Eds.), Ontology and the lexicon (pp. 122–143). Cambridge University Press. Christ, O. (1994). A Modular and Flexible Architecture for an Integrated Corpus Query System. Proceedings of COMPLEX’94, 23–32. Chuang, D. M. [莊德明], Hsieh, C. C. [謝清俊]. (2005). 漢字構形資料庫的建置與應用 [Construction and application of Chinese characters information database]. In C.-R. Huang (Ed.), 漢字與全球化國際學術研討會論文集 [Proceedings of the international conference on Chinese characters and Globalization]. Daniels, P. T. (1996). The study of writing systems. In P. T. Daniels W. Bright (Eds.), The study of writing systems (pp. 3–17). Oxford University Press. Devlin, J., Chang, M.-W., Lee, K., Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. Everett, C., Blasí, D. E., Roberts, S. G. (2016). Language evolution and climate: The case of desiccation and tone. Journal of Language Evolution, 1(1), 33–46. https://doi.org/10.1093/jole/lzv004 Evert, S. (2009). The CQP query language tutorial. http://cwb.sourceforge.net/temp/CQPTutorial.pdf Evert, S., Hardie, A. (2011). Twenty-first century Corpus Workbench: Updating a query architecture for the new millennium. Proceedings of the Corpus Linguistics 2011 Conference. Gábor, U. (2020). Progress Report on Hanzi Network Dictionary, a Shape-Based, Etymologically Motivated Character Decomposition Dataset with Functional Annotations. Grapholinguistics in the 21st Century 2020 Conference, Paris. https://grafematik2020.sciencesconf.org Gries, S. Th. (2020). Analyzing dispersion. In M. Paquot S. Th. Gries (Eds.), A practical handbook of corpus linguistics (pp. 99–118). Springer International Publishing. https://doi.org/10.1007/978-3-030-46216-1₅ Gries, S. Th., Ellis, N. C. (2015). Statistical measures for usage-based linguistics. Language Learning, 65(S1), 228–255. https://doi.org/10.1111/lang.12119 Handel, Z. (2019). Sinography: The Borrowing and Adaptation of the Chinese Script. Brill. Haralambous, Y. (2020). Grapholinguistics, TEX, and a june 2020 conference. TUGboat. http://svn.tug.org/TUGboat/tb41-1/tb127haralambous-grapholinguistics.pdf Huang, C., Chen, K.-J. (1998). Academia Sinica Balanced Corpus (4.0) [Computer software]. http://asbc.iis.sinica.edu.tw Huang, C.-R., Hsieh, S.-K. (2015). Chinese lexical semantics. In The oxford handbook of chinese linguistics (pp. 290–305). Oxford University Press. Huang, D. K. [黄德宽]. (2003). 汉字构形方式的动态分析 [Dynamic analysis of the formation of Chinese characters]. 安徽大学学报 (哲学社会科学版) [Journal of Anhui University (Philosophy and Social Sciences)], 27(4), 1–8. Liao, Y. (2022). Radical Semantic Information in Words [Manuscript in preparation]. Lu, Q., Chan, S. T., Li, Y., Li, N. L. (2002). Decomposition for ISO/IEC 10646 ideographic characters. COLING-02: The 3rd Workshop on Asian Language Resources and International Standardization. https://aclanthology.org/W02-1209 Ma, S.-L. [馬叔禮]. (2016). 方塊字的靈魂 [The Soul of Chinese Character]. Gallop Service Co., Ltd. [策馬入林文化]. Mei, J.-J. [梅家駒]. (1983). 同義詞詞林 [Tongyici Cilin]. 上海辭書出版社 [Shanghai Lexicographical Publishing House]. Mikolov, T., Chen, K., Corrado, G., Dean, J. (2013). Efficient estimation of word representations in vector space. Mikolov, T., Grave, E., Bojanowski, P., Puhrsch, C., Joulin, A. (2018). Advances in pre-training distributed word representations. Proceedings of the International Conference on Language Resources and Evaluation (LREC 2018). Morioka, T. (2008). CHISE: Character processing based on character ontology. Proceedings of the 3rd International Conference on Large-Scale Knowledge Resources: Construction and Application, 148–162. Morioka, T. (2020). Viewpoints on the structural description of chinese characters. In Y. Haralambous (Ed.), Proceedings of Grapholinguistics in the 21st Century, 2020 (Vol. 5, pp. 683–712). Fluxus Editions. https://doi.org/10.36824/2020-graf-mori Myers, J. (2019). The Grammar of Chinese Characters: Productive Knowledge of Formal Patterns in an Orthographic System (1st ed.). Routledge. Myers, J. (2020). Levels of structure within chinese character constituents. In Y. Haralambous (Ed.), Proceedings of Grapholinguistics in the 21st Century, 2020 (Vol. 5, pp. 645–681). Fluxus Editions. https://doi.org/10.36824/2020-graf-myer Niles, I., Pease, A. (2001). Towards a standard upper ontology. Proceedings of the International Conference on Formal Ontology in Information Systems - Volume 2001, 2–9. https://doi.org/10.1145/505168.505170 Pustejovsky, J. (1995). The generative lexicon. MIT press. Rogers, H. (2004). Writing Systems: A Linguistic Approach. Blackwell Publishing. Rosch, E., Mervis, C. B., Gray, W. D., Johnson, D. M., Boyes-Braem, P. (1976). Basic objects in natural categories. Cognitive Psychology, 8(3), 382–439. https://doi.org/10.1016/0010-0285(76)90013-X Sampson, G. (1985). Writing systems. Hutchinson. Slaměníková, T. (2019). On the nature of unmotivated components in modern chinese characters. In Y. Haralambous (Ed.), Proceedings of Graphemics in the 21st Century, Brest 2018 (pp. 209–226). Fluxus Editions. https://doi.org/10.36824/2018-graf-slam Sproat, R. (2000). A computational theory of writing systems. Cambridge University Press. Sproat, R., Gutkin, A. (2021). The taxonomy of writing systems: How to measure how logographic a system is. Computational Linguistics, 47(3), 477–528. https://doi.org/10.1162/coli_a_00409 The Unicode Consortium. (n.d.-a). East Asian Scripts (Chapter 11). In The Unicode Standard, Version 4.0. https://www.unicode.org/versions/Unicode4.0.0/ The Unicode Consortium. (n.d.-b). Han Radical-Stroke Index (Chapter 17). In The Unicode Standard, Version 4.0. https://www.unicode.org/versions/Unicode4.0.0/ The Unicode Consortium. (2003). The Unicode Standard, Version 4.0. Addison-Wesley Professional. https://www.unicode.org/versions/Unicode4.0.0/ Tseng, Y.-H. (2021). CompoTree: An Unicode IDS Component Tree Representation [Python 3 Library]. https://github.com/seantyh/CompoTree Unger, J. M., DeFrancis, J. (1995). Logographic and Semasiographic Writing Systems: A Critique of Sampson’s Classification. In I. Taylor D. R. Olson (Eds.), Scripts and Literacy: Reading and Learning to Read Alphabets, Syllabaries and Characters (pp. 45–58). Springer Netherlands. https://doi.org/10.1007/978-94-011-1162-1_4 Unicode Technical Committee. (2021). Ideographic Description Characters, The Unicode Standard, Version 14.0. Unicode, Inc. https://unicode.org/charts/PDF/U2FF0.pdf Unihan Radical-Stroke Index. (n.d.). Retrieved December 26, 2021, from https://www.unicode.org/charts/unihanrsindex.html Woods, C. (Ed.). (2010). Writing in the Ancient Middle East and Beyond. Oriental Institute of the University of Chicago. https://oi.uchicago.edu/research/publications/oimp/oimp-32-visible-language-inventions-writing-ancient-middle-east-and Zhou, Y. G. [周有光]. (1978). 现代汉字中声旁的表音功能问题 [To what degree are the “phonetics” of present-day Chinese characters still phonetic?]. 146(3), 172–177.
dc.identifier.urihttp://tdr.lib.ntu.edu.tw/jspui/handle/123456789/79507-
dc.description.abstract中文書寫系統在世界書寫系統中具有獨特的地位,因為絕大多數的漢字為語素文字 (logogram)。因此,漢字本身即攜帶語義訊息,而不像許多其他書寫系統需透過拼音對應至詞彙來攜帶語意訊息。此外,漢字通常可以被分解成更小的元素,這些元素常攜帶著與該漢字相關的語意和發音。然而,由於漢字的編碼方式 (encoding),電腦使用者不容易取得這些豐富的資訊——一個漢字對應到電腦中的一個編碼 (code point),這讓使用者無法進一步取得漢字的內部結構訊息,因為編碼本身並不會記錄這些資訊。例如,中文使用者會知道,「淋」和「霖」這兩個字的發音相同,因為它們有共同的部件「林」。但是我們無法從「淋」和「霖」的編碼中取得這個共同的部件——在 Unicode 中,「淋」與「霖」分別對應到 U+6DCB 與 U+9716,但這些編碼並無法表徵這兩個字具有關聯的事實。面對這個局限,我們開發了一個可分析子字詞層次的中文語料庫工具。這個語料庫工具讓使用者能夠取得漢字豐富的部件資訊 (包含部首與非部首),例如,這讓使用者可以根據漢字共有的部件進行檢索 (舉例來說,透過共同部件「林」,可以取得「淋」、「霖」、「琳」、「箖」與「惏」),並且讓使用者能夠透過這類訊息來進行語料的量化分析。除了語料庫工具之外,我們還進行了一項個案研究,以透過實徵資料驗證子字詞層次的資訊是否有用,並同時探索此階層與更高階層的語意關聯。結果顯示,某些特定的漢字部首語義訊息與詞彙的語義訊息具有顯著的關聯,然而多數的部首與詞彙類型並無明確的對映關係。論文最後,我們指出了漢字內部的高度遞迴結構對於當前研究的一些影響,並討論了解決相關困境的潛在可能。zh_TW
dc.description.provenanceMade available in DSpace on 2022-11-23T09:02:11Z (GMT). No. of bitstreams: 1
U0001-0802202212352300.pdf: 1717279 bytes, checksum: 272f558132c4fd79dd90b10d0c38460b (MD5)
Previous issue date: 2022
en
dc.description.tableofcontents摘要 v Abstract vii List of Figures xi List of Tables xiii 1 Introduction 1 1.1 Background and Motivation 1 1.2 Contributions 3 1.3 Organization of the Thesis 4 2 Literature Review 7 2.1 Typology of Writing Systems 7 2.2 Chinese Character Structure 13 2.3 Resources for Chinese Character Decomposition 17 2.4 Previous Work on Chinese Character Decomposition 22 3 Design of System 25 3.1 Corpus Structure and Input Data 25 3.2 Corpus Query Language 27 3.3 Searching the Corpus 28 3.4 Corpus Analysis 35 4 Case Study: Radical Semantic Information in Words 49 4.1 Data 50 4.2 Association of Radical and Word Semantic Types 51 4.3 Radical Semantic Type Importance 58 4.4 Interim Summary 65 5 Conclusion 69 5.1 Summary 69 5.2 Limitations and Future Work 70 References 75 Appendix A — Search API in hgct 79 Appendix B — Corpus Analysis API in hgct 87
dc.language.isoen
dc.title文字部件為本的語料分析:一個子字詞層次的中文語料庫工具zh_TW
dc.titleGlyph-based Corpus Analysis: A Toolkit for Sub-character Analysis of Chinese Corporaen
dc.date.schoolyear110-1
dc.description.degree碩士
dc.contributor.oralexamcommittee陳正賢(Ouh-Young Ming),張瑜芸(Lung-Pan Chen),(Yi-Ping Hung)
dc.subject.keyword語料庫工具,書寫系統,漢字,部件,語料庫語言學,zh_TW
dc.subject.keywordcorpus toolkit,writing system,Chinese character,character components,corpus linguistics,en
dc.relation.page97
dc.identifier.doi10.6342/NTU202200369
dc.rights.note同意授權(全球公開)
dc.date.accepted2022-02-10
dc.contributor.author-college文學院zh_TW
dc.contributor.author-dept語言學研究所zh_TW
顯示於系所單位:語言學研究所

文件中的檔案:
檔案 大小格式 
U0001-0802202212352300.pdf1.68 MBAdobe PDF檢視/開啟
顯示文件簡單紀錄


系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved