基於BERT預訓練模型的專利檢索方法

Yu-Hsiu Tai; 戴余修

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/81348

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	莊裕澤(Yuh-Jzer Joung)
dc.contributor.author	Yu-Hsiu Tai	en
dc.contributor.author	戴余修	zh_TW
dc.date.accessioned	2022-11-24T03:44:40Z	-
dc.date.available	2021-08-04
dc.date.available	2022-11-24T03:44:40Z	-
dc.date.copyright	2021-08-04
dc.date.issued	2021
dc.date.submitted	2021-07-19
dc.identifier.citation	1. WIPO：2018年全球專利申請330萬件，年增長5.2％，連續九年成長（民108年10月17日）。國家實驗研究院：科技政策研究與資訊中心：科技產業資訊室。民109年12月22日，取自：https://iknow.stpi.narl.org.tw/Post/Read.aspx?PostID=16077 2. 林韋伶（民108年10月17日）。蘋果、三星、輝達都想搶親？安謀出售案四大看點一次掌握。今周刊。民109年12月22日，取自：https://www.businesstoday.com.tw/article/category/183015/post/202007230036/ 3. 曾元顯 (2004)，”專利文字之知識探勘：技術與挑戰”，現代資訊組織與檢索研討會，頁 111-123。 4. 宋皇志 (2017)， “人工智能在專利檢索之應用初探”，全國律師，第21卷，第10期，頁27-23。 5. Shalaby, W., Zadrozny, W. (2019). Patent retrieval: a literature review. Knowledge and Information Systems, 1-30. 6. Devlin, J., Chang, M. W., Lee, K., Toutanova, K. (2018). BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. 7. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... Polosukhin, I. (2017). Attention is all you need. arXiv preprint arXiv:1706.03762. 8. Lee, J. S., Hsiang, J. (2020). Patent classification by fine-tuning BERT language model. World Patent Information, 61, 101965. 9. Verberne, S., D’hondt, E. (2009, September). Prior art retrieval using the claims section as a bag of words. In Workshop of the Cross-Language Evaluation Forum for European Languages (pp. 497-501). Springer, Berlin, Heidelberg. 10. Xue, X., Croft, W. B. (2009, November). Automatic query generation for patent search. In Proceedings of the 18th ACM conference on Information and knowledge management (pp. 2037-2040). 11. A Fujii, A. (2007, July). Enhancing patent retrieval by citation analysis. In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval (pp. 793-794). 12. Brin, S., Page, L. (1998). The anatomy of a large-scale hypertextual web search engine. Computer networks and ISDN systems, 30(1-7), 107-117. 13. Magdy, W., Lopez, P., Jones, G. J. (2011, April). Simple vs. sophisticated approaches for patent prior-art search. In European Conference on Information Retrieval (pp. 725-728). Springer, Berlin, Heidelberg. 14. Tannebaum, W., Rauber, A. (2012, July). Analyzing query logs of uspto examiners to identify useful query terms in patent documents for query expansion in patent searching: a preliminary study. In Information Retrieval Facility Conference (pp. 127-136). Springer, Berlin, Heidelberg. 15. Mikolov, T., Chen, K., Corrado, G., Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781. 16. Ganguly, D., Roy, D., Mitra, M., Jones, G. J. (2015, August). Word embedding based generalized language model for information retrieval. In Proceedings of the 38th international ACM SIGIR conference on research and development in information retrieval (pp. 795-798). 17. Wei, X., Croft, W. B. (2006, August). LDA-based document models for ad-hoc retrieval. In Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval (pp. 178-185). 18. Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., Zettlemoyer, L. (2018). Deep contextualized word representations. arXiv preprint arXiv:1802.05365. 19. Helmers, L., Horn, F., Biegler, F., Oppermann, T., Müller, K. R. (2019). Automating the search for a patent’s prior art with a full text similarity search. PloS one, 14(3), e0212103. 20. Nogueira, R., Cho, K. (2019). Passage Re-ranking with BERT. arXiv preprint arXiv:1901.04085. 21. Yang, W., Zhang, H., Lin, J. (2019). Simple applications of BERT for ad hoc document retrieval. arXiv preprint arXiv:1903.10972. 22. Sun, C., Qiu, X., Xu, Y., Huang, X. (2019, October). How to fine-tune BERT for text classification? In China National Conference on Chinese Computational Linguistics (pp. 194-206). Springer, Cham. 23. Li, S., Hu, J., Cui, Y., Hu, J. (2018). DeepPatent: patent classification with convolutional neural networks and word embedding. Scientometrics, 117(2), 721-744. 24. Padigela, H., Zamani, H., Croft, W. B. (2019). Investigating the successes and failures of BERT for passage re-ranking. arXiv preprint arXiv:1905.01758. 25. Deshmukh, A. A., Sethi, U. (2020). IR-BERT: Leveraging BERT for Semantic Search in Background Linking for News Articles. arXiv preprint arXiv:2007.12603. 26. Reimers, N., Gurevych, I. (2019). Sentence-BERT: Sentence embeddings using siamese BERT-networks. arXiv preprint arXiv:1908.10084. 27. Pennington, J., Socher, R., Manning, C. D. (2014, October). Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) (pp. 1532-1543). 28. RoBERTson, S. E., Walker, S., Jones, S., Hancock-Beaulieu, M. M., Gatford, M. (1995). Okapi at TREC-3. Nist Special Publication Sp, 109, 109. 29. Roda, G., Tait, J., Piroi, F., Zenz, V. (2009, September). CLEF-IP 2009: retrieval experiments in the Intellectual Property domain. In Workshop of the Cross-Language Evaluation Forum for European Languages (pp. 385-409). Springer, Berlin, Heidelberg. 30. Risch, J., Alder, N., Hewel, C., Krestel, R. (2020). PatentMatch: A Dataset for Matching Patent Claims Prior Art. arXiv preprint arXiv:2012.13919. 31. Henderson, M., Al-Rfou, R., Strope, B., Sung, Y. H., Lukács, L., Guo, R., ... Kurzweil, R. (2017). Efficient natural language response suggestion for smart reply. arXiv preprint arXiv:1705.00652. 32. Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., ... Stoyanov, V. (2019). RoBERTa: A robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692. 33. Sennrich, R., Haddow, B., Birch, A. (2015). Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909. 34. Bowman, S. R., Angeli, G., Potts, C., Manning, C. D. (2015). A large annotated corpus for learning natural language inference. arXiv preprint arXiv:1508.05326. 35. Williams, A., Nangia, N., Bowman, S. R. (2017). A broad-coverage challenge corpus for sentence understanding through inference. arXiv preprint arXiv:1704.05426. 36. Cer, D., Diab, M., Agirre, E., Lopez-Gazpio, I., Specia, L. (2017). Semeval-2017 task 1: Semantic textual similarity-multilingual and cross-lingual focused evaluation. arXiv preprint arXiv:1708.00055. 37. Magdy, W., Jones, G. J. (2010, July). PRES: a score metric for evaluating recall-oriented information retrieval applications. In Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval (pp. 611-618). 38. Le, Q., Mikolov, T. (2014, June). Distributed representations of sentences and documents. In International conference on machine learning (pp. 1188-1196). PMLR. 39. Magdy, W., Jones, G. J. (2010, September). Examining the robustness of evaluation metrics for patent retrieval with incomplete relevance assessments. In International conference of the cross-language evaluation forum for European languages (pp. 82-93). Springer, Berlin, Heidelberg.
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/81348	-
dc.description.abstract	專利檢索是獲取專利資料的最重要手段，為了避免法律上糾紛，企業時常會進行專利檢索來檢驗專利所揭載的技術與過往已核准的專利是否雷同。另外，專利申請者或是專利審查委員也會進行專利檢索，審視申請中專利是否具備新穎性。專利檢索任務與一般的檢索任務特性不同，專利文件通常使用艱澀的技術用語、術語，一般人難以理解專利文件的內容。因此，專利檢索者通常需具備一定的專業知識與長年的經驗，此外，專利文件的數量每年都大幅度的成長，要在海量且增長幅度持續增加的專利資料庫中找尋目標專利，是一項相當具有挑戰性的工作。本研究旨在提出一個有效的專利檢索方法，幫助專利工作者解決專利檢索會遇到的痛點。過去對於專利檢索方法的研究，著重在基於關鍵字的檢索方法，然而這樣的方法容易造成字彙不匹配(vocabulary mismatch)的問題。理由如同上述，專利文件通常使用艱澀的用語，因此很難下一個精準的關鍵字去涵蓋相關的專利文件。而基於語意理解的檢索方法可以緩解專利檢索的字彙不匹配效應。BERT為近年自然語言處理領域備受矚目的模型，甫一推出即在許多NLP任務上達成state-of-the-art (SOTA)。本研究認為BERT在語意理解方面的強大能力，在專利檢索的應用上有相當大的發揮空間。本研究將專利文件切割為段落層級，作為解決BERT輸入長度限制的解決方案，透過Sentence-BERT預訓練模型計算出每個專利文件段落的向量表示，並且將所有段落的向量表示進行mean-pooling，使之聚合成一個代表專利文件的向量表示，並透過專利向量之間的相似度比對進行基於語意的專利檢索。另外，本研究也將BM25與Sentence-BERT結合，透過二階段的檢索，希望能夠結合兩模型基於字詞匹配與語意匹配的特性，發揮綜效，達成更好的檢索表現。實驗結果顯示，Sentence-BERT能產生高品質的專利文件向量表示，其檢索表現遠勝於其他基於語意相似度比對的檢索方法，並且在透過專利領域文本進行微調(fine-tuning)過後，其檢索表現有更進一步的增長。而BM25+Sentence-BERT的二階段檢索，在recall及PRES兩個指標上更是優於其他方法，符合專利檢索recall-oriented的應用需求，幫助專利檢索者在有限的檢索回傳結果中獲得更多與查詢相關的專利文件，降低遺漏重要專利的風險。	zh_TW
dc.description.provenance	Made available in DSpace on 2022-11-24T03:44:40Z (GMT). No. of bitstreams: 1 U0001-1507202122424200.pdf: 1674479 bytes, checksum: a2258091a1fc2741b32ab9ffd9776cc0 (MD5) Previous issue date: 2021	en
dc.description.tableofcontents	摘要 i Abstract iii 目錄 v 圖目錄 vi 表目錄 vi 第一章、緒論 1 1.1 研究背景與動機 1 1.2 研究問題 3 第二章、文獻探討 5 2.1 引言 5 2.2 專利檢索技術 6 2.3 BERT模型 10 2.4 文獻回顧總結 12 第三章、研究方法 14 3.1 方法概述 14 3.2 向量表示 15 3.3 Sentence-BERT 16 3.4 BM25 18 第四章、實驗細節及實驗結果 20 4.1 資料集 20 4.1.1 測試資料集 20 4.1.2 訓練資料集 21 4.2 資料前處理 23 4.3 實驗設定 25 4.4 評估指標 28 4.5 實驗結果 30 第五章、結論 36 5.1 研究成果 36 5.2 研究貢獻 37 5.3 研究限制 37 5.4 未來研究方向 38 參考文獻 39
dc.language.iso	zh-TW
dc.subject	專利檢索	zh_TW
dc.subject	自然語言處理	zh_TW
dc.subject	深度學習	zh_TW
dc.subject	BERT預訓練模型	zh_TW
dc.subject	文字探勘	zh_TW
dc.subject	Patent Retrieval	en
dc.subject	Text Mining	en
dc.subject	BERT Pre-trained Model	en
dc.subject	Deep Learning	en
dc.subject	Natural Language Processing	en
dc.title	基於BERT預訓練模型的專利檢索方法	zh_TW
dc.title	The Novel Patent Retrieval Method Based On BERT Pre-trained Model	en
dc.date.schoolyear	109-2
dc.description.degree	碩士
dc.contributor.oralexamcommittee	陳建錦(Hsin-Tsai Liu),盧信銘(Chih-Yang Tseng),王新民
dc.subject.keyword	專利檢索,自然語言處理,深度學習,BERT預訓練模型,文字探勘,	zh_TW
dc.subject.keyword	Patent Retrieval,Natural Language Processing,Deep Learning,BERT Pre-trained Model,Text Mining,	en
dc.relation.page	42
dc.identifier.doi	10.6342/NTU202101499
dc.rights.note	同意授權(限校園內公開)
dc.date.accepted	2021-07-19
dc.contributor.author-college	管理學院	zh_TW
dc.contributor.author-dept	資訊管理學研究所	zh_TW
顯示於系所單位：	資訊管理學系

文件中的檔案：

檔案	大小	格式
U0001-1507202122424200.pdf 授權僅限NTU校內IP使用（校園外請利用VPN校外連線服務）	1.64 MB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。