增補資源匱乏漢語方言之漢字發音

Chu-Cheng Lin; 林居正

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/47124

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	許永真(Jane Yung-jen Hsu)
dc.contributor.author	Chu-Cheng Lin	en
dc.contributor.author	林居正	zh_TW
dc.date.accessioned	2021-06-15T05:48:16Z	-
dc.date.available	2011-08-20
dc.date.copyright	2010-08-20
dc.date.issued	2010
dc.date.submitted	2010-08-18
dc.identifier.citation	Bibliography [1] W. S. Allen. Vox Latina: a guide to the pronunciation of classical Latin. Cambridge University Press, Cambridge [Eng.], 1978. [2] M. Ben Hamed and F. Wang. Stuck in the forest : Trees, networks and Chinese dialects. Diachronica, 23(1):29–60, 2006. [3] T. Berg-Kirkpatrick, A. Bouchard-Cote, J. DeNero, and D. Klein. Painless unsupervised learning with features. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 582–590, Los Angeles, California, June 2010. Association for Computational Linguistics. [4] A. Bouchard-Cote, P. Liang, T. Griffiths, and D. Klein. A probabilistic approach to diachronic phonology. Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP/CoNLL), 2007. [5] CMUDICT. CMU pronouncing dictionary, 1998. http://www.speech.cs.cmu.edu/ cgi-bin/cmudict. [6] T. M. Ellison. Bayesian identification of cognates and correspondences. In Proceedings of Ninth Meeting of the ACL Special Interest Group in Computational Morphology and Phonology, pages 15–22, Prague, Czech Republic, June 2007. Association for Computational Linguistics. [7] D. Genzel. Inducing a multilingual dictionary from a parallel multitext in related languages. In HLT ’05: Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing, pages 875–882, Morristown, NJ, USA, 2005. Association for Computational Linguistics. [8] G. Heinrich. Parameter estimation for text analysis. Technical report, University of Leipzig, 2008. [9] E. Hinrichs and T. Zastrow. A vector-based approach to dialectometry. In Proceedings of the 17th Meeting of Computational Linguistics in the Netherlands, 2007. [10] J. H. Jenkins and R. Cook. Unicode Han database. Technical report, The Unicode Consortium, 2009. [11] J. B. Jensen. On the mutual intelligibility of Spanish and Portuguese. Hispania, 72(4):848–852, 1989. [12] C.-J. Lin and H.-H. Chen. A Mandarin to Taiwanese Min Nan machine translation system with speech synthesis of Taiwanese Min Nan. International Journal of Computational Linguistics and Chinese Language Processing, 4(1):59–84, 1999. [13] D. C. Liu and J. Nocedal. On the limited memory bfgs method for large scale optimization. Math. Program., 45(3):503–528, 1989. [14] X. Lu, B. Zheng, A. Velivelli, and C. Zhai. Enhancing text categorization with semantic-enriched representation and training data augmentation. Journal of the American Medical Informatics Association, 13(5):526 – 535, 2006. [15] V. H. Mair. What is a Chinese ‘dialect/topolect’ reflections on some key Sino-English linguistic terms. Sino-Platonic Papers, 29:1–31, 1991. [16] T.-L. Mei. The survival of two pairs of Qieyun distinctions in Southern Wu dialects. Journal of Chinese Linguistics, 280(1):1 – 15, 2001. [17] K. Nigam, A. K. McCallum, S. Thrun, and T. Mitchell. Text classification from labeled and unlabeled documents using em. Mach. Learn., 39(2-3):103–134, 2000. [18] E. G. Pulleyblank. Middle Chinese: a study in historical phonology. University of British Columbia Press, Vancouver, 1984. [19] E. G. Pulleyblank. Qieyun and yunjing: The essential foundation for chinese historical linguis- tics. Journal of the American Oriental Society, 118(2):200–216, 1998. [20] P. Resnik and E. Hardisty. Gibbs sampling for the uninitiated. Technical Report CS-TR-4956, UMIACS-TR-2010-04, LAMP-153, University of Maryland, 2010. [21] B.Snyder,T.Naseem,J.Eisenstein,andR.Barzilay.Unsupervisedmultilinguallearningforpos tagging. In EMNLP ’08: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 1041–1050, Morristown, NJ, USA, 2008. Association for Computational Linguistics. [22] B. Snyder, T. Naseem, J. Eisenstein, and R. Barzilay. Adding more languages improves unsuper- vised multilingual part-of-speech tagging: a bayesian non-parametric approach. In NAACL ’09: Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 83–91, Morristown, NJ, USA, 2009. Association for Computational Linguistics. [23] M. Streeter. DOC, 1971: A Chinese dialect dictionary on computer. Computers and the Humanities, 6(5):259–270, 1972. [24] S. Stuker, F. Metze, T. Schultz, and A. Waibel. Integrating multilingual articulatory features into speech recognition. In Eighth European Conference on Speech Communication and Technology. Citeseer, 2003. [25] C. Tang and V. J. van Heuven. Mutual intelligibility of Chinese dialects experimentally tested. Lingua, 119(5):709–732, 2009. [26] P.-H. Ting. Some thoughts on the reconstruction of Middle Chinese. Journal of Chinese Linguistics, 249(6):414, 1995. [27] L. Q. Tong. Survey on the usage of Chinese languages and script. Language and Literature Press, Beijing, 2006. (Chinese) http://www.china-language.gov.cn/LSF/LSFrame.aspx. [28] D. van Dyk and X. Meng. The art of data augmentation. Journal of Computational and Graphical Statistics, 10(1):1–50, 2001. [29] X. Zhang. Dialect MT: a case study between Cantonese and Mandarin. In Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics - Volume 2, ACL-36, pages 1460–1464, Morristown, NJ, USA, 1998. Association for Computational Linguistics.
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/47124	-
dc.description.abstract	大多數漢語方言缺乏完整的數位發音資料庫,而這卻是語音處理不可或缺的。若有相關方言的完整發音資料庫便能憑某漢字之韻書特徵,及其於相關方言之發音,使用監督式學習方法預測該漢字於目標方言之發音。遺憾的是漢語方言發音資料庫資源仍不完備。我們提出一新式生成模型,同時利用方言發音資料以及中古韻書以發掘在多方言間存在之音韻規律。我們提出之模型能利用現存不完整之方言發音資料庫以及韻書所載資料增補得出一完整之方言發音資料庫。該方言發音資料庫之後即可利用傳統監督式學習方法預測某方言之漢字發音。我們藉整體發音特徵準確率 (OPFA) 項目評估。第一個實驗結果可看出若加入方言發音特徵相較於僅有韻書特徵,能大幅度改進支持向量機分類器 (SVM classifier) 的效能。第二個實驗中我們比較利用親屬關係相近之方言與親屬關係相距遙遠之方言之音韻特徵對支持向量機效能影響。實驗結果顯露利用相近方言可得較高準確率。第三個實驗中可看出利用我們提出之增補模型可以提高 SVM 模型之 OPFA 準確率高達 4.9%。	zh_TW
dc.description.abstract	Most spoken Chinese dialects lack comprehensive digital pronunciation databases, which are crucial for speech processing tasks. Given complete pronunciation databases for related dialects, one can use supervised learning techniques to predict a Chinese character’s pronunciation in a target dialect based on the character’s features and its pronunciation in other related dialects. Unfortunately, Chinese dialect pronunciation databases are far from complete. We propose a novel generative model that makes use of both existing dialect pronunciation data plus medieval rime books to discover patterns that exist in multiple dialects. The proposed model can augment missing dialectal pronunciations based on existing dialect pronunciation tables (even if in-complete) and the pronunciation data in rime books. The augmented pronunciation database can then be used in supervised learning settings. We evaluate the prediction accuracy in terms of phonological features, such as tone, initial phoneme, final phoneme, etc. For each character, features are evaluated on the whole, overall pronunciation feature accuracy (OPFA). Our first experimental results show that adding features from dialectal pronunciation data to our baseline rime-book model dramatically improves OPFA using the support vector machine (SVM) model. In the second experiment, we compare the performance of the SVM model using phonological features from closely related dialects with that of the model using phonological features from non-closely related dialects. The experimental results show that using features from closely-related dialects results in higher accuracy. In the third experiment, we show that using our proposed data augmentation model to fill in missing data can increase the SVM model’s OPFA by up to 4.9%.	en
dc.description.provenance	Made available in DSpace on 2021-06-15T05:48:16Z (GMT). No. of bitstreams: 1 ntu-99-R97922060-1.pdf: 1295631 bytes, checksum: 98b88a4a2cf74f929906c4f0b88cdb51 (MD5) Previous issue date: 2010	en
dc.description.tableofcontents	Abstract iii List of Figures ix List of Tables x Chapter 1 Introduction 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Thesis Structure ................................... 3 Chapter 2 Background 5 2.1 Background of Chinese Dialects ......................... 5 2.1.1 Rimebook ................................. 6 2.2 Related Work..................................... 8 2.2.1 Machine translation and text-to-speech system for Chinese dialects 8 2.2.2 Applications with resource-poor languages . . . . . . . . . . . . . . 8 2.2.3 Computational Dialectometry and Phonology . . . . . . . . . . . . 9 Chapter 3 Methodology 11 3.1 Problem definition ................................. 11 3.2 Model considerations................................ 13 3.2.1 Model description............................. 14 3.2.2 Inference .................................. 16 3.2.3 Inference procedure............................ 20 Chapter 4 Data and Evaluation 21 4.1 Data.......................................... 21 4.1.1 Preprocessing................................ 21 4.2 Evaluation ...................................... 23 4.2.1 Experiment Design ............................ 23 4.2.2 Effect of dialectal data on standard classifiers . . . . . . . . . . . . . 24 4.2.3 Impacts of proximate dialects ...................... 25 4.2.4 Effect of data augmentation ....................... 25 Chapter 5 Conclusion 27 Bibliography 29
dc.language.iso	en
dc.subject	發音資料庫	zh_TW
dc.subject	資料增補	zh_TW
dc.subject	生成模型	zh_TW
dc.subject	漢語方言	zh_TW
dc.subject	data augmentation	en
dc.subject	pronunciation database	en
dc.subject	Chinese dialects	en
dc.subject	generative model	en
dc.title	增補資源匱乏漢語方言之漢字發音	zh_TW
dc.title	Augmentation of Character Pronunciations for Resource-poor Chinese Dialects	en
dc.type	Thesis
dc.date.schoolyear	98-2
dc.description.degree	碩士
dc.contributor.coadvisor	蔡宗翰(Richard Tzong-han Tsai)
dc.contributor.oralexamcommittee	高成炎(Cheng-yan Kao),陳柏琳(Berlin Chen),陳信希(Hsin-Hsi Chen)
dc.subject.keyword	資料增補,生成模型,漢語方言,發音資料庫,	zh_TW
dc.subject.keyword	data augmentation,generative model,Chinese dialects,pronunciation database,	en
dc.relation.page	32
dc.rights.note	有償授權
dc.date.accepted	2010-08-19
dc.contributor.author-college	電機資訊學院	zh_TW
dc.contributor.author-dept	資訊工程學研究所	zh_TW
顯示於系所單位：	資訊工程學系

文件中的檔案：

檔案	大小	格式
ntu-99-1.pdf 未授權公開取用	1.27 MB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。