中文話語標記解譯及句子話語關係辨識之研究

Tai-Wei Chang; 張岱偉

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/61312

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	陳信希(Hsin-Hsi Chen)
dc.contributor.author	Tai-Wei Chang	en
dc.contributor.author	張岱偉	zh_TW
dc.date.accessioned	2021-06-16T13:00:56Z	-
dc.date.available	2013-08-14
dc.date.copyright	2013-08-14
dc.date.issued	2013
dc.date.submitted	2013-08-08
dc.identifier.citation	[1] R. Prasad, N. Dinesh, A. Lee, E. Miltsakaki, L. Robaldo, A. K. Joshi, and B. L. Webber, “The Penn Discourse TreeBank 2.0.,” in LREC, 2008. [2] L. Carlson, D. Marcu, and M. Ellen Okurowski, “RST Discourse Treebank,” 2002. [3] H.-H. Huang and H.-H. Chen, “Chinese Discourse Relation Recognition.,” in IJCNLP, 2011, pp. 1442–1446. [4] H.-H. Huang and H.-H. Chen, “Contingency and comparison relation labeling and structure prediction in Chinese sentences,” in Proceedings of the 13th Annual Meeting of the Special Interest Group on Discourse and Dialogue, Stroudsburg, PA, USA, 2012, pp. 261–269. [5] Y. Zhou and N. Xue, “PDTB-style discourse annotation of Chinese text,” in Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1, Stroudsburg, PA, USA, 2012, pp. 69–77. [6] H.-H. Huang and H.-H. Chen, “An Annotation System for Development of Chinese Discourse Corpus.,” in COLING (Demos), 2012, pp. 223–230. [7] H. Hernault, D. Bollegala, and M. Ishizuka, “A semi-supervised approach to improve classification of infrequent discourse relations using feature vector extension,” in Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, 2010, pp. 399–409. [8] D. Marcu and A. Echihabi, “An unsupervised approach to recognizing discourse relations,” in Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, Stroudsburg, PA, USA, 2002, pp. 368–375. [9] E. Pitler and A. Nenkova, “Using syntax to disambiguate explicit discourse connectives in text,” in Proceedings of the ACL-IJCNLP 2009 Conference Short Papers, Stroudsburg, PA, USA, 2009, pp. 13–16. [10] CMU, “The ClueWeb09 Dataset,” 2009. [11] E. Pitler, M. Raghupathy, H. Mehta, A. Nenkova, A. Lee, and A. K. Joshi, “Easily identifiable discourse relations,” Tech. Reports Cis, p. 884, 2008. [12] Z. Lin, M.-Y. Kan, and H. T. Ng, “Recognizing implicit discourse relations in the Penn Discourse Treebank,” in Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1 - Volume 1, Stroudsburg, PA, USA, 2009, pp. 343–351. [13] E. Pitler, A. Louis, and A. Nenkova, “Automatic sense prediction for implicit discourse relations in text,” in Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2 - Volume 2, Stroudsburg, PA, USA, 2009, pp. 683–691. [14] N. Xue, “Annotating discourse connectives in the Chinese Treebank,” in Proceedings of the Workshop on Frontiers in Corpus Annotations II: Pie in the Sky, Stroudsburg, PA, USA, 2005, pp. 84–91. [15] X. Cheng and X. Tian, “現代漢語,” Goodreads, 1989. [16] S.-Y. Cheng, “Corpus-Based Coherence Relation Tagging in Chinese Discourse,” 2006. [17] S. Lu, “現代漢語八百詞,” 2007. [18] C.-H. Yu, Y. Tang, and H.-H. Chen, “Development of a Web-Scale Chinese Word N-gram Corpus with Parts of Speech Information.,” in LREC, 2012, pp. 320–324. [19] 梅家驹, 竺一鸣, 高蕴琦, and 殷鸿翔, 同义词词林. 上海辞书出版社, 1996. [20] L.-W. Ku and H.-H. Chen, “Mining opinions from the Web: Beyond relevance retrieval,” J. Am. Soc. Inf. Sci. Technol., vol. 58, no. 12, pp. 1838–1850, 2007. [21] F. Wolf and E. Gibson, “Representing discourse coherence: a corpus-based analysis,” in Proceedings of the 20th international conference on Computational Linguistics, 2004, p. 134. [22] C.-C. Chang and C.-J. Lin, “LIBSVM: A library for support vector machines,” Acm Trans Intell Syst Technol, vol. 2, no. 3, pp. 27:1–27:27, May 2011.
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/61312	-
dc.description.abstract	話語關係辨識的目標是在預測任兩個篇章單位之中，最適合的句子話語關係。這對於通篇文章在語義上的判斷有很大的影響，在自然語言處理研究中是個非常重要的議題。相較於英文，由於中文語言本身的特殊性，話語標記相對具有較大的歧義性，而導致在判斷話語關係時效能上的差異。為了有效提升篇章關係辨識的效能，適當的定義話語標記的意義非常重要。有鑑於中文尚未出現像英文PDTB、RST-DT等大規模經過完善標注的話語語料庫。本研究從ClueWeb09語料庫擷取出7,601組句子，請標注者標上最適宜的話語關係，接著利用此小規模資料集建立一半監督式學習模型。藉由參數評估的輔助，不但能有效提升話語關係辨識效能，同時更能統計出每個話語標記在PDTB所定義的四大話語關係中的機率分布資訊。實驗結果顯示表現最佳的一組實驗的平均F-分數可達到73.22%，相較於實驗中所採用的基礎模型的69.76%效能，達到顯著性差異的效能提升。接著將此半監督式分類器擴展到更大規模未經標注的資料集，共302,293組句子，目的是統計出覆蓋度更高的話語標記機率分布資訊。統計結果在經過兩種相似度計算方法驗證下，顯示不錯的表現。最後運用統計結果和一簡單的分類法，定義出話語標記的前/後結合性關係，以期能更有效降低歧義性問題。	zh_TW
dc.description.abstract	Not all Chinese discourse makers have unique interpretation. That becomes a challenging issue when they are used for discourse relation recognition. In this thesis, we propose a semi-supervised method to learn the interpretations of Chinese discourse markers and apply the results to discourse relation labeling. Total 7,601 sentences composed of two clauses connected with single discourse markers are sampled from ClueWeb09 and annotated with discourse relations manually. We train an SVM discourse relation classifier with the dataset and boost the classifier with parameter estimation. Our experimental result shows that the proposed approach can achieve 73.22% of F-score. The discourse relation recognition system is employed to annotate 302,293 unlabeled sentences. The ambiguous degrees of discourse markers and backward/forward combination problems are analyzed.	en
dc.description.provenance	Made available in DSpace on 2021-06-16T13:00:56Z (GMT). No. of bitstreams: 1 ntu-102-R00922097-1.pdf: 2424032 bytes, checksum: 64cce59cfe55632edf17f35d6cfc7ada (MD5) Previous issue date: 2013	en
dc.description.tableofcontents	摘要 iii Abstract iv 致謝 v 圖目錄 ix 表目錄 x 第一章緒論 1 1.1 研究動機 1 1.2 研究目標 3 1.3 論文架構 4 第二章相關研究 5 2.1 話語關係語料庫 5 2.2 英文話語關係分析 6 2.3 中文話語關係分析 11 第三章語料庫資源 16 3.1 中文話語標記辭典 16 3.2 ClueWeb09 ─ 中文語料庫 18 3.3 資料的篩選準則 18 第四章話語標記歧義度 21 4.1 中英文話語標記歧義度比較 21 4.2 使用辭典預測話語關係 21 4.3 中文話語標記歧義度分析 22 第五章半監督式學習方法 26 5.1 實驗方法和目的 26 5.1.1 基礎模型 26 5.1.2 實驗目的 26 5.2 特徵抽取 27 5.2.1 語言特徵 27 5.2.2 話語標記特徵 30 5.3 半監督式學習演算法 31 5.3.1 資料初始化 31 5.3.2 參數評估(Parameter estimation) 31 第六章實驗與討論 33 6.1 實驗設定 33 6.1.1 實驗資料 33 6.1.2 分類器設定 33 6.2 實驗模型比較 34 6.3 大規模測試 41 6.3.1 實驗資料集 41 6.3.2 機率分布預測結果 42 6.3.2.1 歧義性話語標記 42 6.3.2.2 非歧義性話語標記 45 6.3.3 機率分布相似度比較 47 6.3.3.1 餘弦相似度(Cosine Similarity) 47 6.3.3.2 Kendall等級相關係數 52 6.3.4 單一字詞話語標記結合性分析 55 第七章結論與未來展望 59 7.1 結論 59 7.2 未來展望 59 參考文獻 60 附錄A 半監督式學習模型各話語關係預測效能曲線(初始值=查詢話語標記辭典) 63 附錄B 大規模測試資料機率分布預測結果(初始值=0.25) 65 附錄C 半監督式學習模型餘弦相似度比較(初始值=查詢話語標記辭典) 67
dc.language.iso	zh-TW
dc.subject	話語標記	zh_TW
dc.subject	話語關係	zh_TW
dc.subject	標記歧義度	zh_TW
dc.subject	半監督式學習模型	zh_TW
dc.subject	標記結合性	zh_TW
dc.subject	discourse markers	en
dc.subject	discourse relation labeling	en
dc.subject	interpretation of ambiguous markers	en
dc.subject	semi-supervised learning	en
dc.subject	marker combination	en
dc.title	中文話語標記解譯及句子話語關係辨識之研究	zh_TW
dc.title	Interpretation of Chinese Discourse Markers in Discourse Relation Recognition	en
dc.type	Thesis
dc.date.schoolyear	101-2
dc.description.degree	碩士
dc.contributor.oralexamcommittee	張俊盛,林川傑,古倫維
dc.subject.keyword	話語標記,話語關係,標記歧義度,半監督式學習模型,標記結合性,	zh_TW
dc.subject.keyword	discourse markers,discourse relation labeling,semi-supervised learning,interpretation of ambiguous markers,marker combination,	en
dc.relation.page	68
dc.rights.note	有償授權
dc.date.accepted	2013-08-08
dc.contributor.author-college	電機資訊學院	zh_TW
dc.contributor.author-dept	資訊工程學研究所	zh_TW
顯示於系所單位：	資訊工程學系

文件中的檔案：

檔案	大小	格式
ntu-102-1.pdf 未授權公開取用	2.37 MB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。