Skip navigation

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料(如:文字、圖片、PDF)並使其易於取用。

點此認識 DSpace
DSpace logo
English
中文
  • 瀏覽論文
    • 校院系所
    • 出版年
    • 作者
    • 標題
    • 關鍵字
  • 搜尋 TDR
  • 授權 Q&A
    • 我的頁面
    • 接受 E-mail 通知
    • 編輯個人資料
  1. NTU Theses and Dissertations Repository
  2. 文學院
  3. 語言學研究所
請用此 Handle URI 來引用此文件: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/7249
標題: 具詞義區分之中文搭配詞資源建構及其應用
The Construction of a Chinese Collocation Resource with Sense Distinction and its Applications
作者: Meng-Hsien Shih
施孟賢
指導教授: 謝舒凱(Shu-Kai Hsieh)
關鍵字: 搭配詞,依存句法剖析器,語義空間,詞義標記,詞義消歧,
collocation,dependency parser,semantic space,sense annotation,word sense disambiguation,
出版年 : 2019
學位: 博士
摘要: 隨著語料庫的規模越來越大,除了提供上下文檢索功能之外,有必要更進一步自動處理大型語料庫的資料,以提供更多資訊,例如搭配詞和詞義訊息。本論文建構具詞義區分之繁體中文和簡體中文搭配詞資源,並藉由自然語言處理的任務表現以評估提出之搭配詞資源。
為了自動擷取具詞義標記的搭配詞,本研究分別利用 Stanford Parser 和 SyntaxNet Parser 從具詞義標記的句子中擷取搭配詞組合,並依其$logDice$分數高低進行排序。在繁體中文資料的詞義標記上,本文嘗試以半自動化方式標記詞義,從中研院平衡語料庫4.0的句子中找出接近標記詞義的候選句。先以 Stanford Parser (以及SyntaxNet Parser) 剖析語料庫中的句子,然後根據剖析出的依存句法資訊將該句子投射至語義向量空間中。同樣地,詞典 (中文詞彙網路) 中每個詞義的例句也經句法剖析投射至語義空間,然後將在語義空間中接近欲標記詞義例句的中研院語料庫句子優先抽取出來,方便標記者優先標記該詞義可能的候選句,以加速標記工作的進行,而不需從語料庫中一句句地尋找可標記詞義之句子。
簡體中文的詞義標記資料則來自於2007年的語義評估任務,共有40個詞,其詞義標記在2,686個句子中。為了能與簡體中文的詞義標記進行比較,在繁體中文的詞典 (中文詞彙網路) 中選取17個也在簡體中文資料出現的詞當標記目標,並在中研院平衡語料庫中共標記了1,646個含該17個詞的句子。本搭配詞資源及其詞義標記已在網站上釋出 (http://lopen.linguistics.ntu.edu.tw/collocation.htm),以提供使用者查詢。
藉由詞義消歧任務的外部評估,結果證明運用 SyntaxNet Parser 擷取的搭配詞資料,可訓練支持向量機之分類器達到現今最佳的簡體中文詞義消歧準確率 P=75.98%,以及詞義區分較細的繁體中文準確率 P=58.35%。相對於深度學習模型,本研究用較透明的模型僅配合基本的語言特徵,就能得到當今最好的詞義消歧表現,表示詞的搭配行為幾乎就能決定該詞在句中的詞義。
With the size of corpora growing larger and larger, it is of urgent necessity to automatically process big corpora to provide further information beyond concordance, such as collocation and sense information. In this dissertation, a collocation resource with sense distinction in Simplified Chinese and Traditional Chinese is constructed, and the results are evaluated by an NLP (Natural Language Processing) task.
To automatically extract collocation with sense annotation, the Stanford Parser and SyntaxNet Parser are exploited respectively to extract collocation candidates from sense-annotated sentences. These collocation candidates are later ranked by their logDice score. For Traditional Chinese sense annotation, a semi-automatic approach is investigated to facilitate the work of sense annotation, by bootstrapping sense instance candidates from the sentences in Academia Sinica Balanced Corpus 4.0. The sentences in the corpus are first parsed by the Stanford Parser (or by the SyntaxNet Parser alternatively), and each sentence is mapped to the vector space according to the dependency parsing information. Similarly, the example sentences of each sense in the dictionary (Chinese Wordnet) are also parsed to the same vector space. Then the sentence candidates in the corpus are ranked by their distances to the intended CWN sense to annotate in the vector space, so that the annotator can begin with the most likely sense instances to annotate, and does not have to examine the corpus sentence-by-sentence to find good sense instances.
For Simplified Chinese, the data comes from the SemEval-2007 dataset with 40 word types annotated in 2,686 sentences. To be comparable with the Simplified Chinese data, 17 word types in the Traditional Chinese sense inventory (i.e., Chinese Wordnet) overlapping with the SemEval-2007 word types are selected to annotate in 1,646 sentences from the Sinica Corpus. The proposed collocation resource with sense annotation in Simplified Chinese and Traditional Chinese has been released on a web interface (http://lopen.linguistics.ntu.edu.tw/collocation.htm) for users to query.
The extrinsic evaluation by the task of word sense disambiguation (WSD) shows that the collocation data extracted by the SyntaxNet Parser can train an SVM (Support Vector Machine) classifier to achieve the state-of-the-art WSD precision P=75.98% in Simplified Chinese, and P=58.35% in the more fine-grained Traditional Chinese sense inventory (Chinese Wordnet). The state-of-the-art WSD performance based on the proposed transparent approach with only linguistic features (compared to deep learning models) implies that, the collocational behavior of a word can mostly determine the word sense in a sentence.
URI: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/7249
DOI: 10.6342/NTU201901938
全文授權: 同意授權(全球公開)
電子全文公開日期: 2024-08-16
顯示於系所單位:語言學研究所

文件中的檔案:
檔案 大小格式 
ntu-108-1.pdf5.75 MBAdobe PDF檢視/開啟
顯示文件完整紀錄


系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved