請用此 Handle URI 來引用此文件:
http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/54353
標題: | 運用紀錄時間點與資料來源信賴度解析實體 Entity resolution based on data source reliability and temporal features |
作者: | Yi-Jyun You 尤怡鈞 |
指導教授: | 盧信銘(Hsin-Min Lu) |
關鍵字: | 紀錄鏈結,實體解析,資料整合,真實挖掘,時點資訊,來源信賴, record linkage,entity resolution,data fusion,truth discovery,temporal information,source reliability, |
出版年 : | 2015 |
學位: | 碩士 |
摘要: | Web 2.0概念已發展多年,使得網路上任何使用者都能貢獻資料給他人,但隨著大數據(Big Data)時代的來臨,讓不規則資料的快速整合變得相當重要,若能有一套妥善的整合方式,後續對這些資料擷取資訊、或者是知識,將能對企業有所幫助。
其中,本研究著眼在「紀錄鏈結」與「真實挖掘」的議題上。「紀錄鏈結」意指將網路上的一筆筆資料視作紀錄,要從中找尋出這些紀錄之間是否相關聯──即代表著同一個實體。「真實挖掘」則是在紀錄鏈結過後,必須找出那些實體屬性的正確值。所以在執行這兩個任務時,可以視作將紀錄做分群,每一群代表著同一個實體。 而過往已有類似研究,針對紀錄作分群的方式不單只有依據紀錄字串間的相似度,也能考慮資料來源信賴度或紀錄的時點資訊。而本研究是將這些觀點綜合起來發展出新的模型,並且套用於自網路上擷取的Blog文章、BBS文章以及一些餐廳資訊網站的紀錄。 最終,本研究會比較自身模型與比較模型之間在「紀錄鏈結」和「真實挖掘」上的表現差異,從而探討出因為各個模型之間的著重點不同,才有不同的表現成果。 As the aspect of Web 2.0 developing, any Internet user can contribute data to others. But the Big Data era is coming, integrating different schema of data now is more important than in the past. With an appropriate method to solve this problem, we can do some subsequent processing, such as getting information or knowledge. It will help business more competitive. This study focuses on the issue of record linkage and truth discovery. Record linkage is regarding data as records, and finding out whether these records are associated, in other words, represent as same entity. In addition, after finishing the task of record linkage, we have to find those correct attribute values of an entity. This task is called truth discovery. When conducting these two tasks, we can regard them as the clustering records problem, and each cluster is an entity. In the past, there were some clustering methods which considered not only the similarity between records, but also the source reliability or temporal information of records. So this study integrates these aspects into a new model and applies it to the dataset which are retrieved from the records of blogs, BBS articles and diverse websites. Finally, we compare our model and baseline models on the performance of record linkage and truth discovery. From the results, we propose that because the different emphasis of these models, they result in different performance. |
URI: | http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/54353 |
全文授權: | 有償授權 |
顯示於系所單位: | 資訊管理學系 |
文件中的檔案:
檔案 | 大小 | 格式 | |
---|---|---|---|
ntu-104-1.pdf 目前未授權公開取用 | 7.51 MB | Adobe PDF |
系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。