閉合模式融合於編輯相似度之自動文本摘要

Yu-Ting Chen; 陳育婷

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/59483

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	蔡政安(Chen-An Tsai)
dc.contributor.author	Yu-Ting Chen	en
dc.contributor.author	陳育婷	zh_TW
dc.date.accessioned	2021-06-16T09:25:12Z	-
dc.date.available	2021-08-16
dc.date.copyright	2020-08-24
dc.date.issued	2020
dc.date.submitted	2020-08-18
dc.identifier.citation	[1] A. Aries and W. K. Hidouci, 'Automatic text summarization: What has been done and what has to be done,' arXiv preprint arXiv:1904.00688, 2019. [2] M. Gambhir and V. Gupta, 'Recent automatic text summarization techniques: a survey,' Artificial Intelligence Review, vol. 47, no. 1, pp. 1-66, 2017. [3] T. Joachims, 'A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization,' Carnegie-mellon univ pittsburgh pa dept of computer science1996. [4] L. Page, S. Brin, R. Motwani, and T. Winograd, 'The PageRank citation ranking: Bringing order to the web,' Stanford InfoLab1999. [5] J. W. Mohr and P. Bogdanov, 'Introduction—Topic models: What they are and why they matter,' ed: Elsevier, 2013. [6] P. Li, Y. Wang, W. Gao, and J. Jiang, 'Generating aspect-oriented multi-document summarization with event-aspect model,' 2011: Association for Computational Linguistics. [7] P. Berkhin, 'A survey of clustering data mining techniques,' in Grouping multidimensional data: Springer, 2006, pp. 25-71. [8] X. Wan and J. Yang, 'Multi-document summarization using cluster-based link analysis,' in Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, 2008, pp. 299-306. [9] R. Nallapati, B. Zhou, C. Gulcehre, and B. Xiang, 'Abstractive text summarization using sequence-to-sequence rnns and beyond,' arXiv preprint arXiv:1602.06023, 2016. [10] S. Chopra, M. Auli, and A. M. Rush, 'Abstractive sentence summarization with attentive recurrent neural networks,' in Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2016, pp. 93-98. [11] J.-P. Qiang, P. Chen, W. Ding, F. Xie, and X. Wu, 'Multi-document summarization using closed patterns,' Knowledge-Based Systems, vol. 99, pp. 28-38, 2016. [12] J. Carbonell and J. Goldstein, 'The use of MMR, diversity-based reranking for reordering documents and producing summaries,' in Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, 1998, pp. 335-336. [13] G. Erkan and D. R. Radev, 'Lexrank: Graph-based lexical centrality as salience in text summarization,' Journal of artificial intelligence research, vol. 22, pp. 457-479, 2004. [14] R. C. Schank, 'Conceptual dependency: A theory of natural language understanding,' Cognitive psychology, vol. 3, no. 4, pp. 552-631, 1972. [15] S. Xu, S. Yang, and F. Lau, 'Keyword extraction and headline generation using novel word features,' in Twenty-Fourth AAAI Conference on Artificial Intelligence, 2010. [16] D. R. Radev, H. Jing, M. Styś, and D. Tam, 'Centroid-based summarization of multiple documents,' Information Processing Management, vol. 40, no. 6, pp. 919-938, 2004. [17] S. Banerjee, P. Mitra, and K. Sugiyama, 'Multi-document abstractive summarization using ilp based multi-sentence compression,' in Twenty-Fourth International Joint Conference on Artificial Intelligence, 2015. [18] D. Shen, J.-T. Sun, H. Li, Q. Yang, and Z. Chen, 'Document summarization using conditional random fields,' in IJCAI, 2007, vol. 7, pp. 2862-2867. [19] C. D. Paice, 'The automatic generation of literature abstracts: an approach based on the identification of self-indicating phrases,' in Proceedings of the 3rd annual ACM conference on Research and development in information retrieval, 1980, pp. 172-191. [20] C. Orasan and S. St, 'Pronominal anaphora resolution for text summarisation,' of: Proceedings of the Recent Advances in Natural Language Processing, pp. 430-436, 2007. [21] N. Goyal and J. Eisenstein, 'A joint model of rhetorical discourse structure and summarization,' in Proceedings of the Workshop on Structured Prediction for NLP, 2016, pp. 25-34. [22] R. Mihalcea and P. Tarau, 'Textrank: Bringing order into text,' in Proceedings of the 2004 conference on empirical methods in natural language processing, 2004, pp. 404-411. [23] K. Ganesan, C. Zhai, and J. Han, 'Opinosis: A graph based approach to abstractive summarization of highly redundant opinions,' 2010. [24] K. Filippova, 'Multi-sentence compression: Finding shortest paths in word graphs,' in Proceedings of the 23rd international conference on computational linguistics (Coling 2010), 2010, pp. 322-330. [25] I. Sutskever, O. Vinyals, and Q. V. Le, 'Sequence to sequence learning with neural networks,' in Advances in neural information processing systems, 2014, pp. 3104-3112. [26] D. Bahdanau, K. Cho, and Y. Bengio, 'Neural machine translation by jointly learning to align and translate,' arXiv preprint arXiv:1409.0473, 2014. [27] A. See, P. J. Liu, and C. D. Manning, 'Get to the point: Summarization with pointer-generator networks,' arXiv preprint arXiv:1704.04368, 2017. [28] E. Baralis, L. Cagliero, S. Jabeen, and A. Fiori, 'Multi-document summarization exploiting frequent itemsets,' in Proceedings of the 27th Annual ACM Symposium on Applied Computing, 2012, pp. 782-786. [29] E. Baralis, L. Cagliero, A. Fiori, and P. Garza, 'Mwi-sum: A multilingual summarizer based on frequent weighted itemsets,' ACM Transactions on Information Systems (TOIS), vol. 34, no. 1, pp. 1-35, 2015. [30] M. Litvak and N. Vanetik, 'Query-based summarization using MDL principle,' in Proceedings of the multiling 2017 workshop on summarization and summary evaluation across source types and genres, 2017, pp. 22-31. [31] M. Du and R. Yangarber, 'Acquisition of domain-specific patterns for single document summarization and information extraction,' in Proceedings of the The Second International Conference on Artificial Intelligence and Pattern Recognition, Shenzhen, China, 2015, 2015: The Society of Digital Information and Wireless Communications (SDIWC). [32] Y. Wu, Y. Li, Y. Xu, and W. Huang, 'Mining topically coherent patterns for unsupervised extractive multi-document summarization,' in 2016 IEEE/WIC/ACM International Conference on Web Intelligence (WI), 2016, pp. 129-136: IEEE. [33] I. Yoo, X. Hu, and I.-Y. Song, 'Integrating biomedical literature clustering and summarization approaches using biomedical ontology,' in Proceedings of the 1st international workshop on Text mining in bioinformatics, 2006, pp. 37-42. [34] 宋锐 and 林鸿飞, '基于文档语义图的中文多文档摘要生成机制,' 中文信息学报, vol. 23, no. 3, pp. 110-116, 2009. [35] T. Takeda and A. Takasu, 'UpdateNews: a news clustering and summarization system using efficient text processing,' in Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries, 2007, pp. 438-439. [36] S. Wang, X. Zhao, B. Li, B. Ge, and D. Tang, 'Integrating extractive and abstractive models for long text summarization,' in 2017 IEEE International Congress on Big Data (BigData Congress), 2017, pp. 305-312: IEEE. [37] R. Agrawal and R. Srikant, 'Mining sequential patterns,' in Proceedings of the eleventh international conference on data engineering, 1995, pp. 3-14: IEEE. [38] R. Agarwal and R. Srikant, 'Fast algorithms for mining association rules,' in Proc. of the 20th VLDB Conference, 1994, pp. 487-499. [39] P. Fournier-Viger, J. C.-W. Lin, R. U. Kiran, Y. S. Koh, and R. Thomas, 'A survey of sequential pattern mining,' Data Science and Pattern Recognition, vol. 1, no. 1, pp. 54-77, 2017. [40] W. Cohen, P. Ravikumar, and S. Fienberg, 'A comparison of string metrics for matching names and records,' in Kdd workshop on data cleaning and object consolidation, 2003, vol. 3, pp. 73-78. [41] G. Recchia and M. M. Louwerse, 'A Comparison of String Similarity Measures for Toponym Matching,' in COMP@ SIGSPATIAL, 2013, pp. 54-61. [42] Q. Le and T. Mikolov, 'Distributed representations of sentences and documents,' in International conference on machine learning, 2014, pp. 1188-1196. [43] T. Mikolov, K. Chen, G. Corrado, and J. Dean, 'Efficient estimation of word representations in vector space,' arXiv preprint arXiv:1301.3781, 2013. [44] R. Hecht-Nielsen, 'Theory of the backpropagation neural network,' in Neural networks for perception: Elsevier, 1992, pp. 65-93. [45] W. Che, Z. Li, and T. Liu, 'Ltp: A chinese language technology platform,' in Coling 2010: Demonstrations, 2010, pp. 13-16. [46] G. Glavaš and J. Šnajder, 'Event graphs for information retrieval and multi-document summarization,' Expert systems with applications, vol. 41, no. 15, pp. 6904-6916, 2014. [47] R. Sun, Y. Zhang, M. Zhang, and D. Ji, 'Event-driven headline generation,' in Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 2015, pp. 462-472. [48] C.-Y. Lin and E. Hovy, 'Automatic evaluation of summaries using n-gram co-occurrence statistics,' in Proceedings of the 2003 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics, 2003, pp. 150-157. [49] Y.-C. Chang, C.-C. Chen, Y.-L. Hsieh, C. C. Chen, and W.-L. Hsu, 'Linguistic template extraction for recognizing reader-emotion and emotional resonance writing assistance,' in Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), 2015, pp. 775-780. [50] J. Han et al., 'Prefixspan: Mining sequential patterns efficiently by prefix-projected pattern growth,' in proceedings of the 17th international conference on data engineering, 2001, pp. 215-224: IEEE Washington, DC, USA. [51] A. Weigel and F. Fein, 'Normalizing the weighted edit distance,' in Proceedings of the 12th IAPR International Conference on Pattern Recognition, Vol. 3-Conference C: Signal Processing (Cat. No. 94CH3440-5), 1994, vol. 2, pp. 399-402: IEEE. [52] S. Kurtz, 'Approximate string searching under weighted edit distance,' in Proc. WSP, 1996, vol. 96, pp. 156-170: Citeseer. [53] J. Zhang, T. Wang, and X. Wan, 'PKUSUMSUM: A Java Platform for Multilingual Document Summarization,' in COLING (Demos), 2016, pp. 287-291. [54] D. Gillick and B. Favre, 'A scalable global model for summarization,' in Proceedings of the Workshop on Integer Linear Programming for Natural Language Processing, 2009, pp. 10-18. [55] J. Li, L. Li, and T. Li, 'Multi-document summarization via submodularity,' Applied Intelligence, vol. 37, no. 3, pp. 420-430, 2012. [56] H. Lin and J. Bilmes, 'Multi-document summarization via budgeted maximization of submodular functions,' in Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, 2010, pp. 912-920. [57] H. P. Luhn, 'The automatic creation of literature abstracts,' Advances in automatic text summarization, pp. 15-22, 1999. [58] J. Steinberger and K. Jezek, 'Using latent semantic analysis in text summarization and summary evaluation,' Proc. ISIM, vol. 4, pp. 93-100, 2004. [59] L. Vanderwende, H. Suzuki, C. Brockett, and A. Nenkova, 'Beyond SumBasic: Task-focused summarization with sentence simplification and lexical expansion,' Information Processing Management, vol. 43, no. 6, pp. 1606-1618, 2007. [60] B. Hu, Q. Chen, and F. Zhu, 'Lcsts: A large scale chinese short text summarization dataset,' arXiv preprint arXiv:1506.05865, 2015. [61] A. M. Rush, S. Harvard, S. Chopra, and J. Weston, 'A neural attention model for sentence summarization,' in ACLWeb. Proceedings of the 2015 conference on empirical methods in natural language processing, 2017. [62] C.-Y. Lin, 'Rouge: A package for automatic evaluation of summaries,' in Text summarization branches out, 2004, pp. 74-81. [63] T. Mikolov, E. Grave, P. Bojanowski, C. Puhrsch, and A. Joulin, 'Advances in pre-training distributed word representations,' arXiv preprint arXiv:1712.09405, 2017.
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/59483	-
dc.description.abstract	模式探勘與編輯距離很少被用在現有的自動摘要技術。傳統的詞頻式模型難以考慮更深入的語意資訊；當紅的深度學習雖然可以解決前者問題，但難以解釋及修改。此外，詞頻模型和深度學習模型的共通點是都會將句子轉換成向量後再做運算；但是，我們並不會在腦中自動將文字轉成一系列的數字，而是以文字本身出發去做思考。基於上述問題，本論文提出一個自動摘要模型–closed-Pattern-Infused Edit Similarity Model (PIESim)。它是一個基於模式探勘與編輯距離比對、以字串而非向量為基礎，從而補足詞頻式及深度學習模型缺點的模型。相對於前者，它可以考慮上下文與順序資訊；相對於後者，它具備直觀的解釋及修改能力。除此之外，我們是第一個提出結合模式頻率之改良編輯距離 (pattern-infused edit distance)的摘要模型。PIESim在資料集上達到比多數摘要方法及用單純編輯距離、單純模式頻率總和、考慮詞彙頻率的改良編輯距離、詞彙向量、及嵌入式向量等更好的效果。此外，在PIESim的架構下，我們可以在不改變方法的前提從任何來源加入重要訊息；實驗中，我們選擇加入訓練集文章資訊和使用者輸入以豐富文章的領域知識，並藉此提出一個全新標準–記憶相似度。 PIESim的非向量表示及可考量語義資訊的特性，均符合人類處理及理解文件的過程；也因此，本模型在中文及英文新聞資料集、長摘要及短摘要上皆取得極為優越的成果。我們也以數個案例及互動式軟體說明PIESim在解釋、修改及與使用者互動上的直觀優勢。未來的自動摘要研究可在此基礎上做更多延伸及應用。	zh_TW
dc.description.abstract	Pattern mining and edit distance have rarely been used in existing text summarization techniques. Conventional term-based approaches are weak at considering semantic information. Although the well-known deep learning algorithm has led to increasing advances in semantic understanding, it suffers from explanation and revision in the context of articles. In addition, both types of approaches transform sentences into vectors; however, human intelligence won’t transfer texts into a series of numbers; decisions are made according to its original form. In this thesis, we propose a novel model, called closed-Pattern-Infused Edit Similarity Model (PIESim). It applies pattern mining and edit distance and is entirely string-based rather than vector-based to compensate for limitations in term-based and deep learning-based methods. Unlike the former, it is able to capture contextual and order information; as well as it offers intuitive explanation and revision compared to the latter. In addition, we are the first to propose pattern-infused edit distance mechanism in summarization systems. PIESim achieves better performance compared to most systems and variants, such as using pure edit distance, sum of patterns’ supports, term-infused edit distance, term and embedded based representations on the experimental dataset. Furthermore, under PIESim’s structure, we can consider new contents from any sources. We choose to add training data and queries to enrich domain knowledge, and propose a novel criterion- memory similarity on this basis. PIESim is a non-vector-based system, while it also accommodates semantic information, conforming to how human process and understand texts. Therefore, experiments show that it achieves superior performance on both Chinese and English datasets, long and short summaries, and is intuitive in explanation, revision, and interaction. Future research can make progresses based on it.	en
dc.description.provenance	Made available in DSpace on 2021-06-16T09:25:12Z (GMT). No. of bitstreams: 1 U0001-1408202008142600.pdf: 2814777 bytes, checksum: 87b4697d2c43ea3ed71cb696b94dc144 (MD5) Previous issue date: 2020	en
dc.description.tableofcontents	誌謝 i Table of Contents ii List of Figures v List of Tables vi 摘要 vii Abstract viii Chapter 1 Introduction 1 1.1 Motivation and Aims 1 1.2 Main Contribution 3 1.3 Thesis Structure 4 Chapter 2 Literature Reviews 5 2.1 General Summarization Methods 5 2.2 Pattern based Summarization Methods 7 2.3 Edit Distance based Summarization Methods 8 Chapter 3 Background Study 10 3.1 Categorization of ATS Systems 10 3.2 Sequential Pattern Mining 10 3.2.1 Closed Sequential Pattern 12 3.3 Edit Distance 12 3.3.1 Jaro Similarity (Jaro Edit Distance) 13 3.4 Paragraph Embedding 14 3.4.1 Distributed Memory Model (DM) 14 3.4.2 Distributed Bag of Words Model (DBOW) 15 3.5 Sentence Compression Tools 16 3.5.1 Language Technology Platform (LTP) 16 3.5.2 Dependency Parsing 16 3.6 Rouge-n 16 Chapter 4 The Proposed PIESim Model 17 4.1 Domain Memory Retrieval 17 4.2 Sentence Representation using Closed Sequential Pattern (SRCSP) 18 4.3 Pattern-Infused Edit Similarity (PIESim) 22 4.4 Sentence Scoring and Selection 24 4.5 Sentence Compression 26 Chapter 5 Experiments and Results 29 5.1 UDN 2017-2018 Chinese News 29 5.1.1 Dataset and Setup 29 5.1.2 Compared Methods 30 5.1.3 Result 31 5.2 DUC 2004 English News 33 5.2.1 Dataset and Setup 33 5.2.2 Compared Methods 34 5.2.3 Result 35 Chapter 6 Different Settings Analysis 36 6.1 Pattern Usage in Sentence Representation 36 6.2 Pattern-based Edit Distance Weights 37 6.3 Variants of Sentence Representation and Similarity Computation 39 Chapter 7 Error Analysis User Interaction 41 7.1 Effectiveness of Different Criterion 41 7.1.1 Coverage Effectiveness 42 7.1.2 Memory Similarity Effectiveness 43 7.2 Revision 44 7.2.1 Coverage Revision using Patterns 45 7.2.2 Memory Similarity Revision using Memories 47 7.3 User Queries and Interaction System 49 Chapter 8 Conclusion 54 References 55
dc.language.iso	en
dc.subject	序列模式探勘	zh_TW
dc.subject	可解釋模型	zh_TW
dc.subject	互動式模型	zh_TW
dc.subject	知識發現	zh_TW
dc.subject	模式探勘	zh_TW
dc.subject	自動摘要	zh_TW
dc.subject	編輯距離	zh_TW
dc.subject	interactive model	en
dc.subject	explainable model	en
dc.subject	knowledge discovery	en
dc.subject	edit distance	en
dc.subject	sequential pattern mining	en
dc.subject	pattern mining	en
dc.subject	automatic text summarization	en
dc.title	閉合模式融合於編輯相似度之自動文本摘要	zh_TW
dc.title	Automatic Text Summarization using closed-Pattern-Infused Edit Similarity (PIESim)	en
dc.type	Thesis
dc.date.schoolyear	108-2
dc.description.degree	碩士
dc.contributor.author-orcid	0000-0001-5968-1419
dc.contributor.advisor-orcid	蔡政安(0000-0002-7490-4331)
dc.contributor.coadvisor	許聞廉(Wen-Lian Hsu)
dc.contributor.coadvisor-orcid	許聞廉(0000-0001-7061-3513)
dc.contributor.oralexamcommittee	張詠淳(Yung-Chun Chang)
dc.contributor.oralexamcommittee-orcid	張詠淳(0000-0002-9634-8380)
dc.subject.keyword	自動摘要,模式探勘,序列模式探勘,編輯距離,知識發現,互動式模型,可解釋模型,	zh_TW
dc.subject.keyword	automatic text summarization,pattern mining,sequential pattern mining,edit distance,knowledge discovery,interactive model,explainable model,	en
dc.relation.page	60
dc.identifier.doi	10.6342/NTU202003370
dc.rights.note	有償授權
dc.date.accepted	2020-08-18
dc.contributor.author-college	共同教育中心	zh_TW
dc.contributor.author-dept	統計碩士學位學程	zh_TW
顯示於系所單位：	統計碩士學位學程

文件中的檔案：

檔案	大小	格式
U0001-1408202008142600.pdf 未授權公開取用	2.75 MB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。