運用紀錄時間點與資料來源信賴度解析實體

Yi-Jyun You; 尤怡鈞

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/54353

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	盧信銘(Hsin-Min Lu)
dc.contributor.author	Yi-Jyun You	en
dc.contributor.author	尤怡鈞	zh_TW
dc.date.accessioned	2021-06-16T02:52:06Z	-
dc.date.available	2015-09-30
dc.date.copyright	2015-09-30
dc.date.issued	2015
dc.date.submitted	2015-07-13
dc.identifier.citation	1.Bellare, K., Iyengar, S., Parameswaran, A. G., & Rastogi, V. (2012, August). Active sampling for entity matching. In Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 1131-1139). ACM. 2.Chiang, Y. H., Doan, A., & Naughton, J. F. (2014, June). Modeling entity evolution for temporal record matching. In Proceedings of the 2014 ACM SIGMOD international conference on Management of data (pp. 1175-1186). ACM. 3.Cohen, W., Ravikumar, P., & Fienberg, S. (2003, August). A comparison of string metrics for matching names and records. In Kdd workshop on data cleaning and object consolidation (Vol. 3, pp. 73-78). 4.Davies, D. L., & Bouldin, D. W. (1979). A cluster separation measure. Pattern Analysis and Machine Intelligence, IEEE Transactions on, (2), 224-227. 5.Dong, X. L., Berti-Equille, L., & Srivastava, D. (2009). Integrating conflicting data: the role of source dependence. Proceedings of the VLDB Endowment, 2(1), 550-561. 6.Dong, X. L., Berti-Equille, L., & Srivastava, D. (2013). Data fusion: resolving conflicts from multiple sources. In Handbook of Data Quality (pp. 293-318). Springer Berlin Heidelberg. 7.Fellegi, I. P., & Sunter, A. B. (1969). A theory for record linkage. Journal of the American Statistical Association, 64(328), 1183-1210. 8.Gantz, J., & Reinsel, D. (2012). The digital universe in 2020: Big data, bigger digital shadows, and biggest growth in the far east. IDC iView: IDC Analyze the Future, 2007, 1-16. 9.Getoor, L., & Machanavajjhala, A. (2013, August). Entity resolution for big data. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 1527-1527). ACM. 10.Gómez-Bao, J., Larriba-Pey, J. L., & Ribes Puig, J. (2009, November). Record linkage performance for large data sets. In Proceedings of the ACM first international workshop on Privacy and anonymity for very large databases (pp. 9-16). ACM. 11.Guo, L., Sun, H., & Liu, X. (2014, November). Using clustering and transitivity to reduce the costs of crowdsourced entity resolution. In Proceedings of the 1st International Workshop on Crowd-based Software Development Methods and Technologies (pp. 13-18). ACM. 12.Guo, S., Dong, X. L., Srivastava, D., & Zajac, R. (2010). Record linkage with uniqueness constraints and erroneous values. Proceedings of the VLDB Endowment, 3(1-2), 417-428. 13.Jaro, M. A. (1989). Advances in record-linkage methodology as applied to matching the 1985 census of Tampa, Florida. Journal of the American Statistical Association, 84(406), 414-420. 14.Jaro, M. A. (1995). Probabilistic linkage of large public health data files. Statistics in medicine, 14(5‐7), 491-498. 15.Köpcke, H., Thor, A., & Rahm, E. (2010). Evaluation of entity resolution approaches on real-world match problems. Proceedings of the VLDB Endowment, 3(1-2), 484-493. 16.Kuhn, H. W. (1955). The Hungarian method for the assignment problem. Naval research logistics quarterly, 2(1‐2), 83-97. 17.Li, F., Lee, M. L., & Hsu, W. (2014, August). Entity profiling with varying source reliabilities. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 1146-1155). ACM. 18.Negahban, S. N., Rubinstein, B. I., & Gemmell, J. G. (2012, October). Scaling multiple-source entity resolution using statistically efficient transfer learning. In Proceedings of the 21st ACM international conference on Information and knowledge management (pp. 2224-2228). ACM. 19.Xiao, C., Wang, W., Lin, X., Yu, J. X., & Wang, G. (2011). Efficient similarity joins for near-duplicate detection. ACM Transactions on Database Systems (TODS), 36(3), 15. 20.Winkler, W. E. (1999). The state of record linkage and current research problems. In Statistical Research Division, US Census Bureau. 21.Yin, X., Han, J., & Yu, P. S. (2008). Truth discovery with multiple conflicting information providers on the web. Knowledge and Data Engineering, IEEE Transactions on, 20(6), 796-808. 22.Standard address abbreviations. Retrieved from http://www.semaphorecorp.com/cgi/abbrev.html
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/54353	-
dc.description.abstract	Web 2.0概念已發展多年，使得網路上任何使用者都能貢獻資料給他人，但隨著大數據（Big Data）時代的來臨，讓不規則資料的快速整合變得相當重要，若能有一套妥善的整合方式，後續對這些資料擷取資訊、或者是知識，將能對企業有所幫助。　　其中，本研究著眼在「紀錄鏈結」與「真實挖掘」的議題上。「紀錄鏈結」意指將網路上的一筆筆資料視作紀錄，要從中找尋出這些紀錄之間是否相關聯──即代表著同一個實體。「真實挖掘」則是在紀錄鏈結過後，必須找出那些實體屬性的正確值。所以在執行這兩個任務時，可以視作將紀錄做分群，每一群代表著同一個實體。　　而過往已有類似研究，針對紀錄作分群的方式不單只有依據紀錄字串間的相似度，也能考慮資料來源信賴度或紀錄的時點資訊。而本研究是將這些觀點綜合起來發展出新的模型，並且套用於自網路上擷取的Blog文章、BBS文章以及一些餐廳資訊網站的紀錄。　　最終，本研究會比較自身模型與比較模型之間在「紀錄鏈結」和「真實挖掘」上的表現差異，從而探討出因為各個模型之間的著重點不同，才有不同的表現成果。	zh_TW
dc.description.abstract	As the aspect of Web 2.0 developing, any Internet user can contribute data to others. But the Big Data era is coming, integrating different schema of data now is more important than in the past. With an appropriate method to solve this problem, we can do some subsequent processing, such as getting information or knowledge. It will help business more competitive. 　　This study focuses on the issue of record linkage and truth discovery. Record linkage is regarding data as records, and finding out whether these records are associated, in other words, represent as same entity. In addition, after finishing the task of record linkage, we have to find those correct attribute values of an entity. This task is called truth discovery. When conducting these two tasks, we can regard them as the clustering records problem, and each cluster is an entity. 　　In the past, there were some clustering methods which considered not only the similarity between records, but also the source reliability or temporal information of records. So this study integrates these aspects into a new model and applies it to the dataset which are retrieved from the records of blogs, BBS articles and diverse websites. 　　Finally, we compare our model and baseline models on the performance of record linkage and truth discovery. From the results, we propose that because the different emphasis of these models, they result in different performance.	en
dc.description.provenance	Made available in DSpace on 2021-06-16T02:52:06Z (GMT). No. of bitstreams: 1 ntu-104-R02725008-1.pdf: 7685690 bytes, checksum: 3d93729624975e3ae9d0085c128fb64e (MD5) Previous issue date: 2015	en
dc.description.tableofcontents	口試委員會審定書 I 誌謝 II 中文摘要 III Abstract IV 目錄 V 圖目錄 VIII 表目錄 IX 第一章緒論 1 1.1 研究背景與動機 1 1.2 研究目的 2 第二章文獻探討 3 2.1 紀錄鏈結 3 2.1.1 Jaro-Winkler similarity 4 2.1.2 TF-IDF Jaro-Winkler similarity 5 2.1.3 Jaccard similarity 6 2.1.4 Levenshtein distance 6 2.2 真實挖掘 7 2.2.1 MATCH 7 2.2.1.1 分群（Clustering） 9 2.2.1.2 配對（Matching） 12 2.3 運用來源或時間特徵之文獻 14 2.3.1 COMET 15 2.3.1.1 Confidence Based Matching 16 2.3.1.2 Adaptive Matching 18 2.3.2 TEMPORAL MODEL 21 2.3.2.1 訓練階段 21 2.3.2.2 配對階段 24 2.4 不同分類方法之評價 26 2.5 減少計算成本的方法 26 第三章模型介紹 28 3.1 符號定義 28 3.2 CO-MUTA模型架構 29 3.2.1 Confidence Based Matching 29 3.2.2 Temporal Based Matching 29 第四章研究資料概觀與處理 34 4.1 資料集 34 4.1.1 中文資料集：臺灣餐廳 34 4.1.2 英文資料集：洛杉磯市餐廳 36 4.2 前置作業 38 4.2.1 資料清潔 38 第五章實驗設定與結果 42 5.1 比較模型與參數設定 42 5.1.1 比較模型 42 5.1.2 參數設定 42 5.2 分析結果 43 5.2.1 資料鏈結測度定義 43 5.2.2 資料鏈結結果 44 5.2.3 真實挖掘測度 45 5.2.4 真實挖掘結果 46 5.2.5 結果範例 47 5.2.6 捕捉屬性值變化的分類影響 49 第六章結論與未來發展 51 6.1 實驗結論 51 6.2 未來研究方向 51 參考文獻 52
dc.language.iso	zh-TW
dc.subject	紀錄鏈結	zh_TW
dc.subject	來源信賴	zh_TW
dc.subject	時點資訊	zh_TW
dc.subject	真實挖掘	zh_TW
dc.subject	資料整合	zh_TW
dc.subject	實體解析	zh_TW
dc.subject	紀錄鏈結	zh_TW
dc.subject	來源信賴	zh_TW
dc.subject	時點資訊	zh_TW
dc.subject	真實挖掘	zh_TW
dc.subject	資料整合	zh_TW
dc.subject	實體解析	zh_TW
dc.subject	data fusion	en
dc.subject	record linkage	en
dc.subject	entity resolution	en
dc.subject	truth discovery	en
dc.subject	temporal information	en
dc.subject	source reliability	en
dc.subject	record linkage	en
dc.subject	entity resolution	en
dc.subject	data fusion	en
dc.subject	truth discovery	en
dc.subject	temporal information	en
dc.subject	source reliability	en
dc.title	運用紀錄時間點與資料來源信賴度解析實體	zh_TW
dc.title	Entity resolution based on data source reliability and temporal features	en
dc.type	Thesis
dc.date.schoolyear	103-2
dc.description.degree	碩士
dc.contributor.oralexamcommittee	陳建錦(Chien-Chin Chen),施人英(Jen-Ying Shih)
dc.subject.keyword	紀錄鏈結,實體解析,資料整合,真實挖掘,時點資訊,來源信賴,	zh_TW
dc.subject.keyword	record linkage,entity resolution,data fusion,truth discovery,temporal information,source reliability,	en
dc.relation.page	54
dc.rights.note	有償授權
dc.date.accepted	2015-07-13
dc.contributor.author-college	管理學院	zh_TW
dc.contributor.author-dept	資訊管理學研究所	zh_TW
顯示於系所單位：	資訊管理學系

文件中的檔案：

檔案	大小	格式
ntu-104-1.pdf 未授權公開取用	7.51 MB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。