請用此 Handle URI 來引用此文件:
http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/76346
標題: | 機器輔助控制詞彙索引之研究 A Study on Machine-Aided Controlled Vocabulary Indexing |
作者: | 伍健廷 |
出版年 : | 1998 |
學位: | 碩士 |
摘要: | 本論文於詞彙頻率統計的基礎下,利用大量經人工控制詞彙索引的檔,配合控制詞彙所提供的語意訊息,設計一個自動索引模型,透過模型,控制詞彙索引可以很容易的自動化。新的索引模型在簡單的訓練下,能夠建立控制詞彙本身與自然語言詞彙之間的關聯,關聯儲存於模型內轉化成索引特徵,當新檔出現時,透過檔的詞彙與索引特徵的比對,就可以達到控制詞彙自動化的目的。在索引模型中,新的詞彙顯著性計算公式TF×OSDF×CSIDF修正傳統以TF×IDF,無法將主題專指性詞彙從主題相近的檔集合中分離出來的問題。在不增加額外訓練檔前提下,利用相同訓練檔之間的合併與分離,分別計算出不同用途的檔頻率,讓對於主題辨識具有顯著貢獻的主題專指性詞彙從一般性詞彙與領域專指性詞彙中分離出來。 實驗針對100個MeSH標題,利用總數60,400篇檔的摘要與題名進行訓練與測試,結果顯示索引模型的表現相當優良。摘要部份的索引精確率與索引回收率可同時到達90%以上,題名部份則在索引精確率90%的要求下,維持索引回收率於70%。實驗數據透過統計後,進一步發現索引模型也適用於具有索引典架構的控制詞彙索引。透過索引模型產生大量的控制詞彙建議名單,將可以減輕索引一致性的問題,並節省花費的時間與精力。經由自動索引模型的輔助,可以提高檔的控制詞彙索引數量,改善傳統控制詞彙索引因為產量過少,導致檢索時精確率雖高,但回收率卻不如自然語言索引的現象。 Based on statistics of word frequencies and supported by semantic information of controlled vocabularies, a new model for automatic controlled vocabulary indexing is proposed in this thesis. Through sample training of documents indexed manually, the model could construct associations between a certain controlled vocabulary and a set of natural language vocabularies, then associations are transferred into indexing features. With matching between indexing features and words in document, the aim of automatic controlled vocabulary indexing achieves. In the proposed model, a new formula of term significance TF × OSDF × CSIDF amends the flow of TF × IDF which subject-specific words with high benefit to subject identification cannot be distinguished from other words in the document collection of the same or close subject. Increasing no additional training document, the formula employs varied document frequencies for different purposes through recombination of the same training documents to separate subject-specific words from common words and domain-specific words. Involving with 100 MeSH subject heading and 60,400 abstracts and titles, results of thesis experiment achieve high performance, whereas indexing precision and recall exceed 90% concurrently in abstract section, and indexing precision reaches 90%, indexing recall keeps 70% in title section. In further analyses, the proposed model is justified to be usable to controlled vocabularies with thesaurus structure. By consulting plentiful candidates of controlled vocabulary index terms generated by the model, problem of indexer consistency could be alleviated. Besides, much time and cost saved will directly prompt quality and quantity of controlled vocabulary index terms, and finally improve retrieval performance indirectly. |
URI: | http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/76346 |
全文授權: | 未授權 |
顯示於系所單位: | 圖書資訊學系 |
文件中的檔案:
沒有與此文件相關的檔案。
系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。