Skip navigation

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料(如:文字、圖片、PDF)並使其易於取用。

點此認識 DSpace
DSpace logo
English
中文
  • 瀏覽論文
    • 校院系所
    • 出版年
    • 作者
    • 標題
    • 關鍵字
    • 指導教授
  • 搜尋 TDR
  • 授權 Q&A
    • 我的頁面
    • 接受 E-mail 通知
    • 編輯個人資料
  1. NTU Theses and Dissertations Repository
  2. 電機資訊學院
  3. 資訊工程學系
請用此 Handle URI 來引用此文件: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/21409
完整後設資料紀錄
DC 欄位值語言
dc.contributor.advisor鄭卜壬(Pu-Jen Cheng)
dc.contributor.authorYing-Hui Wuen
dc.contributor.author吳盈慧zh_TW
dc.date.accessioned2021-06-08T03:33:17Z-
dc.date.copyright2019-08-19
dc.date.issued2019
dc.date.submitted2019-08-07
dc.identifier.citation[1]Claire Wardle. Fake news. it’s complicated. First Draft News, 16, 2017.
[2]Soroush Vosoughi, Deb Roy, and Sinan Aral. The spread of true and false news online. Science, 359(6380):1146–1151, 2018.
[3]Hunt Allcott and Matthew Gentzkow. Social media and fake news in the 2016 election. Journal of economic perspectives, 31(2):211–36, 2017.
[4]Marco T Bastos and Dan Mercea. The brexit botnet and user-generated hy-perpartisan news. Social Science Computer Review, 37(1):38–54, 2019.
[5]Kai Shu, Amy Sliva, Suhang Wang, Jiliang Tang, and Huan Liu. Fake news detection on social media: A data mining perspective. ACM SIGKDD Explorations Newsletter, 19(1):22–36, 2017.
[6]Claire Wardle and Hossein Derakhshan. Information disorder: Toward an in-terdisciplinary framework for research and policy making. Council of Europe Report, 27, 2017.
[7]Sebasti˜ao Miranda, David Nogueira, Afonso Mendes, Andreas Vlachos, Andrew Secker, Rebecca Garrett, Jeff Mitchel, and Zita Marinho. Automated fact checking in the news room. In The World Wide Web Conference, pages 3579–3583. ACM, 2019.
[8]Takuma Yoneda, Jeff Mitchell, Johannes Welbl, Pontus Stenetorp, and Se-bastian Riedel. Ucl machine reading group: Four factor framework for fact finding (hexaf). In Proceedings of the First Workshop on Fact Extraction and VERification (FEVER), pages 97–102, 2018.
[9]Amit Singhal et al. Modern information retrieval: A brief overview. IEEE Data Eng. Bull., 24(4):35–43, 2001.
[10]Stephen Robertson, Hugo Zaragoza, et al. The probabilistic relevance frame-work: Bm25 and beyond. Foundations and TrendsR in Information Retrieval, 3(4):333–389, 2009.
[11]Yuanhua Lv and ChengXiang Zhai. Lower-bounding term frequency normaliza-tion. In Proceedings of the 20th ACM international conference on Information and knowledge management, pages 7–16. ACM, 2011.
[12]Aldo Lipani, Mihai Lupu, Allan Hanbury, and Akiko Aizawa. Verboseness fission for bm25 document length normalization. In Proceedings of the 2015 International Conference on The Theory of Information Retrieval, pages 385–388. ACM, 2015.
[13]Stephen E Robertson, Steve Walker, Susan Jones, Micheline M Hancock-Beaulieu, Mike Gatford, et al. Okapi at trec-3. Nist Special Publication Sp, 109:109, 1995.
[14]Stephen E Robertson, Steve Walker, Micheline Beaulieu, and Peter Willett. Okapi at trec-7: automatic ad hoc, filtering, vlc and interactive track. Nist Special Publication SP, (500):253–264, 1999.
[15]Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. Linguistic regularities in continuous space word representations. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 746–751, 2013.
[16]Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.
[17]Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119, 2013.
[18]Guohong Fu and Kang-Kwong Luke. Chinese named entity recognition using lexicalized hmms. ACM SIGKDD Explorations Newsletter, 7(1):19–25, 2005.
[19]Andrew McCallum and Wei Li. Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons. In Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003-Volume 4, pages 188–191. Association for Computational Linguistics, 2003.
[20]Gy¨orgy Szarvas, Rich´ard Farkas, and Andr´as Kocsor. A multilingual named entity recognition system using boosting and c4. 5 decision tree learning al-gorithms. In International Conference on Discovery Science, pages 267–278. Springer, 2006.
[21]Oliver Bender, Franz Josef Och, and Hermann Ney. Maximum entropy mod-els for named entity recognition. In Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003-Volume 4, pages 148–151. Association for Computational Linguistics, 2003.
[22]Quoc Le and Tomas Mikolov. Distributed representations of sentences and documents. In International conference on machine learning, pages 1188-1196, 2014.
dc.identifier.urihttp://tdr.lib.ntu.edu.tw/jspui/handle/123456789/21409-
dc.description.abstract隨著網際網路及社群媒體蓬勃發展,徹底改變人類接受資訊的方式,人們可隨時隨地透過各式網路平台傳播及閱讀訊息,也因此假新聞(Fake News)猖獗並以極快的速度傳播擴散,假新聞對於社會的危害,不只是單純因錯誤健康資訊、謠言影響個人生活,更妨礙公共議題正常對話,進而演變成重大國安風險。本研究期望探索針對事實查核(Fact-Checking)的特性,如何設計精確性高的非監督式(Unsupervised)資訊檢索系統(Information Retrieval System),以期推展至實務應用,協助減少假新聞藉由網路及通信服務平台氾濫傳播的現象。
本研究資料集融合「Cofacts 真的假的」及「台灣事實查核中心」2 家臺灣主流查核機構的謠言資料庫及查核分析報告內容,並運用大量的臺灣新聞資料集訓練詞向量模型(Word Embedding Model),進而以此模型為基礎設計擴展查詢(Query Expansion),另再透過自然語言處理中的命名實體識別(Named Entity Recognition)技術,進行百萬筆中文維基百科條目名(Wikipedia Titles)自動標註命名實體類別,其成果可同時增進中文斷詞精準度及進行查詢關鍵字加權。最後,以經典資訊檢索模型 Okapi BM25 為基底,建構基於詞向量擴展查詢及命名實體加權的混合式事實查核檢索模型,經由多項實驗及參數調校,證明其混合式模型之綜合表現優於基準,代表設計構想具有一定程度的可行性,並就實驗成果提出相關的發現與精進方向。
zh_TW
dc.description.abstractAs the trend of Internet and social media goes, the way for people to gain access to information has been entirely evolved. People nowadays deliver and receive messages through online platforms anytime and anywhere. However, the convenience also causes severe problems about fake news and the rapid spread of misinformation. Such transmits are harmful to the society. While inaccurate health care tips and rumors trouble personal lives, misinterpret assertions and fabricate claims obstruct communication about public issues, leading to national security risks. In order to decrease the overspreading of fake news on the Internet and telecommunication platforms, the study attempted to discover the characteristic of fact-checking, design a high accurate unsupervised information retrieval system which can be applied in practice.
The source of dataset refered to two major fact-checking organizations in Taiwan, ”Cofacts” and ”Taiwan Fact Check Center”. The study functioned word embedding model to attain query expansion. Chinese text segmentation optimizing and keyword weight tuning were implemented by applying named entity recognition on Wikipedia titles. The final fact-checking retrieval model was developed based on Okapi BM25, word embedding and named entity recognition keyword weighting. After experiments and parameter optimization, the result shows that the mixture model performs better and the design is practical for real cases.
en
dc.description.provenanceMade available in DSpace on 2021-06-08T03:33:17Z (GMT). No. of bitstreams: 1
ntu-108-P04922006-1.pdf: 10853745 bytes, checksum: 12a735e20dd500f39587a7b5f1f7d2f4 (MD5)
Previous issue date: 2019
en
dc.description.tableofcontents目錄 iii
表目錄 v
圖目錄 vi
第一章 導論 1
1.1 研究動機 1
1.2 研究目的 2
第二章 文獻探討 4
2.1 假新聞(Fake News)定義 4
2.2 建構事實查核系統模型相關論文研究 6
2.3 經典資訊檢索排序演算法 Okapi BM25 及其變形 8
2.3.1 Okapi BM25(BM25) 8
2.3.2 BM25+(BM25 Plus) 11
2.3.3 BM25 Verboseness Fission(BM25 VF) 11
2.4 詞向量(Word Embedding / Word Vector) 13
2.5 命名實體辨識(Named Entity Recognition) 15
第三章 研究方法 16
3.0 概述及系統流程架構 16
3.1 資料集蒐整及分析 17
3.1.1 真的假的(Cofacts) 17
3.1.2 台灣事實查核中心 20
3.2 資料清理(Data Cleaning) 21
3.2.1 刪除重複謠言訊息 21
3.2.2 刪除過短或無意義之謠言回應 21
3.2.3 資料集初步分析 21
3.3 命名實體識別(Named Entity Recognition)實作 24
3.3.1 維基百科中文條目名(Wikipedia Titles) 24
3.3.2 史丹佛大學自然語言處理工具(Stanford CoreNLP)25
3.3.3 實作流程 27
3.4 資料預先處理(Data Pre-processing) 29
3.4.1 資料斷字斷詞(Text Segmentation) 29
3.4.2 移除保留字(Stopword Removal) 29
3.5 切割資料集(Split Train/Test Data) 30
3.6 訓練詞向量模型(Training Word2Vec Model) 31
3.7 基於詞向量模型的擴展查詢(Word Embedding Query Expansion) 32
3.8 基於條件的命名實體加權(Rule-based Named Entities Weighting)34
3.9 結合擴展查詢與命名實體加權的混合式資訊檢索模型 35
3.10 使用者檢閱(User Review) 35
3.11 檢索衡量指標(Evaluation Metrics) 36
3.11.1 P@K 36
3.11.2 MAP@K 37
第四章 實驗設計與結果 40
4.1 BM25 模型選擇 40
4.2 擴展查詢模型(QEM)參數調校 41
4.3 命名實體加權模型(NEM)參數調校 42
4.4 線性混合式資訊檢索模型(MM)表現衡量 43
4.4.1 MM 模型與其他模型檢索表現比較 43
4.4.2 MM 混合式模型在不同 K 值的表現 44
4.4.3 MM 混合式模型在查詢長文的表現 45
4.5 線上系統實作 45
4.5.1 系統主畫面 45
4.5.2 查詢資料集 TOP20 MM 混合式模型檢索結果檢視 46
4.5.3 線上檢索各模型結果檢視 46
第五章 結論與未來展望 47
5.1 結論 47
5.2 未來展望 48
參考文獻 49
dc.language.isozh-TW
dc.title非監督式事實查核檢索模型zh_TW
dc.titleUnsupervised Fact-checking Retrieval Model : A Real Case Studyen
dc.typeThesis
dc.date.schoolyear107-2
dc.description.degree碩士
dc.contributor.oralexamcommittee魏志達(Jyh-Da Wei),邱志義(Chih-Yi Chiu)
dc.subject.keyword假新聞,事實查核,非監督式,資訊檢索系統,詞向量,擴展查詢,命名實體識別,維基百科條目名,Okapi BM25,zh_TW
dc.subject.keywordFake News,Fact-Checking,Unsupervised,Information Retrieval System,Word Embedding,Query Expansion,Named Entity Recognition,Wikipedia Titles,Okapi BM25,en
dc.relation.page51
dc.identifier.doi10.6342/NTU201902687
dc.rights.note未授權
dc.date.accepted2019-08-07
dc.contributor.author-college電機資訊學院zh_TW
dc.contributor.author-dept資訊工程學研究所zh_TW
顯示於系所單位:資訊工程學系

文件中的檔案:
檔案 大小格式 
ntu-108-1.pdf
  未授權公開取用
10.6 MBAdobe PDF
顯示文件簡單紀錄


系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved