生成科學論文之條列式貢獻

劉孟寰; Meng-Huan Liu

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/86202

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	陳信希	zh_TW
dc.contributor.advisor	Hsin-Hsi Chen	en
dc.contributor.author	劉孟寰	zh_TW
dc.contributor.author	Meng-Huan Liu	en
dc.date.accessioned	2023-03-19T23:42:01Z	-
dc.date.available	2023-12-29	-
dc.date.copyright	2022-09-05	-
dc.date.issued	2022	-
dc.date.submitted	2002-01-01	-
dc.identifier.citation	Arman Cohan, Franck Dernoncourt, Doo Soon Kim, Trung Bui, Seokhwan Kim, Walter Chang, and Nazli Goharian. A discourse-aware attention model for abstractive summarization of long documents. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 615–621. Association for Computational Linguistics, 2018 Arman Cohan and Nazli Goharian. Scientific article summarization using citation- context and article’s discourse structure. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 390–400, 2015 James Hartley and Matthew R. Sydes. Are structured abstracts easier to read than traditional ones. Journal of Research in Reading, 20:122–136, 1997 Jennifer D’Souza, Sören Auer, and Ted Pedersen. SemEval-2021 task 11: NLPContributionGraph - structuring scholarly NLP contributions for a research knowledge graph. In Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021), 2021 HiroakiHayashi, WojciechKryscinski, BryanMcCann, NazneenRajani, and Caiming Xiong. What’s new? summarizing contributions in scientific literature. arXiv preprint arXiv:2011.03161, 2020 H Chen, H Nguyen, and A Alghamdi. Constructing a high-quality dataset for automated creation of summaries of fundamental contributions of research articles. Scientometrics, 2022 Jingqing Zhang, Yao Zhao, Mohammad Saleh, and Peter Liu. Pegasus: Pre-training with extracted gap-sentences for abstractive summarization. In International Conference on Machine Learning, pages 11328–11339. PMLR, 2020 AshishVaswani, NoamShazeer, NikiParmar, JakobUszkoreit, LlionJones, AidanN Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017 Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, and Luke Zettlemoyer. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461, 2019 Iz Beltagy, Matthew E Peters, and Arman Cohan. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150, 2020 Steven Bird, Robert Dale, Bonnie Dorr, Bryan Gibson, Mark Joseph, Min-Yen Kan, Dongwon Lee, Brett Powley, Dragomir Radev, and Yee Fan Tan. The ACL Anthology reference corpus: A reference dataset for bibliographic research in computational linguistics. In Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC’08), 2008 Kyle Lo, Lucy Lu Wang, Mark Neumann, Rodney Kinney, and Daniel Weld. S2ORC: The semantic scholar open research corpus. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4969– 4983, 2020 Rui Meng, Khushboo Thaker, Lei Zhang, Yue Dong, Xingdi Yuan, Tong Wang, and Daqing He. Bringing structure into summaries: a faceted summarization dataset for long scientific documents. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 1080–1089, 2021 Xinyu Hua, Mitko Nikolov, Nikhil Badugu, and Lu Wang. Argument mining for understanding peer reviews. arXiv preprint arXiv:1903.10104, 2019 Liying Cheng, Lidong Bing,Qian Yu,Wei Lu, and Luo Si. Argument pair extraction from peer review and rebuttal via multi-task learning. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7000–7011, 2020 Liying Cheng, Tianyu Wu, Lidong Bing, and Luo Si. Argument pair extraction via attention-guided multi-layer multi-cross encoding. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 6341–6353, 2021 Neha Nayak Kennard, Tim O’Gorman, Akshay Sharma, Chhandak Bagchi, Matthew Clinton, Pranay Kumar Yelugam, Rajarshi Das, Hamed Zamani, and Andrew McCallum. A dataset for discourse structure in peer review discussions. arXiv preprint arXiv:2110.08520, 2021 Weizhe Yuan, Pengfei Liu, and Graham Neubig. Can we automate scientific reviewing? arXiv preprint arXiv:2102.00176, 2021 Chenhui Shen, Liying Cheng, Ran Zhou, Lidong Bing, Yang You, and Luo Si. MReD: A meta-review dataset for structure-controllable text generation. In Findings of the Association for Computational Linguistics: ACL 2022, pages 2521–2535. Association for Computational Linguistics, May 2022 Kuansan Wang, Zhihong Shen, Chiyuan Huang, Chieh-Han Wu, Yuxiao Dong, and Anshul Kanakia. Microsoft academic graph: When experts are not enough. Quantitative Science Studies, 1(1):396–413, 2020 Zhihong Shen, Hao Ma, and Kuansan Wang. A web-scale system for scientific knowledge exploration. In Proceedings of ACL 2018, System Demonstrations, pages 87–92, 2018 Waleed Ammar, Dirk Groeneveld, Chandra Bhagavatula, Iz Beltagy, Miles Craw- ford, Doug Downey, Jason Dunkelberger, Ahmed Elgohary, Sergey Feldman, Vu Ha, Rodney Kinney, Sebastian Kohlmeier, Kyle Lo, Tyler Murray, Hsu-Han Ooi, Matthew Peters, Joanna Power, Sam Skjonsberg, Lucy Wang, Chris Wilhelm, Zheng Yuan, Madeleine van Zuylen, and Oren Etzioni. Construction of the literature graph in semantic scholar. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 3 (Industry Papers), pages 84–91, 2018 Junxian He, Wojciech Kryściński, Bryan McCann, Nazneen Rajani, and Caiming Xiong. Ctrlsum: Towards generic controllable text summarization. arXiv preprint arXiv:2012.04281, 2020 Isabel Cachola, Kyle Lo, Arman Cohan, and Daniel S Weld. Tldr: Extreme summarization of scientific documents. arXiv preprint arXiv:2004.15011, 2020 Ed Collins, Isabelle Augenstein, and Sebastian Riedel. A supervised approach to extractive summarisation of scientific papers. In Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017), pages 195–205. Association for Computational Linguistics, August 2017 Alexios Gidiotis and Grigorios Tsoumakas. Structured summarization of academic publications. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 636–645. Springer, 2019 Shuaiqi LIU, Jiannong Cao, Ruosong Yang, and Zhiyuan Wen. Generating a structured summary of numerous academic papers: Dataset and method. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI-22, pages 4259–4265. International Joint Conferences on Artificial Intelligence Organization, 7 2022 Yuning Mao, Liyuan Liu, Qi Zhu, Xiang Ren, and Jiawei Han. Facet-aware evaluation for extractive summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4941–4957. Association for Computational Linguistics, July 2020 Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683, 2019 Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019 Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xiaodong Liu, Yu Wang, Jianfeng Gao, Ming Zhou, and Hsiao-Wuen Hon. Unified language model pre-training for natural language understanding and generation. arXiv preprint arXiv:1905.03197, 2019 Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020 Nikita Kitaev, Łukasz Kaiser, and Anselm Levskaya. Reformer: The efficient transformer. arXiv preprint arXiv:2001.04451, 2020 Sinong Wang, Belinda Z Li, Madian Khabsa, Han Fang, and Hao Ma. Linformer: Self-attention with linear complexity. arXiv preprint arXiv:2006.04768, 2020 Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, et al. Big bird: Transformers for longer sequences. In NeurIPS, 2020 Wen Xiao, Iz Beltagy, Giuseppe Carenini, and Arman Cohan. Primer: Pyramid-based masked sentence pre-training for multi-document summarization. arXiv preprint arXiv:2110.08499, 2021 Zi-Yi Dou, Pengfei Liu, Hiroaki Hayashi, Zhengbao Jiang, and Graham Neubig. GSum: A general framework for guided neural abstractive summarization. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4830–4842. Association for Computational Linguistics, June 2021 Shashi Narayan, Yao Zhao, Joshua Maynez, Gonçalo Simões, Vitaly Nikolaev, and Ryan McDonald. Planning with learned entity prompts for abstractive summarization. Transactions of the Association for Computational Linguistics, 9:1475–1492, 2021 Yuning Mao, Wenchang Ma, Deren Lei, Jiawei Han, and Xiang Ren. Extract, denoise and enforce: Evaluating and improving concept preservation for text-to-text generation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 5063–5074. Association for Computational Linguistics, 2021 Potsawee Manakul and Mark Gales. Long-span summarization via local attention and content selection. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 6026–6041. Association for Computational Linguistics, 2021 Reinald Kim Amplayo, Stefanos Angelidis, and Mirella Lapata. Aspect controllable opinion summarization. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6578–6593. Association for Computational Linguistics, 2021 T Saier and M Färber. unarxive: a large scholarly dataset with publications' full-text, annotated in-text citations, and links to metadata. Scientometrics, 2020 Eva Sharma, Chen Li, and Lu Wang. Bigpatent: A large-scale dataset for abstractive and coherent summarization. arXiv preprint arXiv:1906.03741, 2019 Lin CY ROUGE. A package for automatic evaluation of summaries. In Proceedings of Workshop on Text Summarization of ACL, Spain, 2004 Iz Beltagy, Kyle Lo, and Arman Cohan. SciBERT: A pretrained language model for scientific text. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3615–3620. Association for Computational Linguistics, November 2019 Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre- training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018 Avi Caciularu, Arman Cohan, Iz Beltagy, Matthew Peters, Arie Cattan, and Ido Dagan. CDLM: Cross-document language modeling. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 2648–2662. Association for Computational Linguistics, November 2021 Alexios Gidiotis and Grigorios Tsoumakas. A divide-and-conquer approach to the summarization of academic articles. ArXiv, abs/2004.06190, 2020 Steven Bird, Ewan Klein, and Edward Loper. Natural language processing with Python: analyzing text with the natural language toolkit. ” O’Reilly Media, Inc.”, 2009 Ramesh Nallapati, Feifei Zhai, and Bowen Zhou. Summarunner: A recurrent neural network based sequence model for extractive summarization of documents. In Thirty-first AAAI conference on artificial intelligence, 2017 Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Weinberger, and Yoav Artzi. Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675, 2019 Suchin Gururangan, Ana Marasović, Swabha Swayamdipta, Kyle Lo, Iz Beltagy, Doug Downey, and Noah A. Smith. Don’t stop pretraining: Adapt language models to domains and tasks. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8342–8360. Association for Computational Linguistics, July 2020 Alexander Fabbri, Simeng Han, Haoyuan Li, Haoran Li, Marjan Ghazvininejad, Shafiq Joty, Dragomir Radev, and Yashar Mehdad. Improving zero and few-shot abstractive summarization with intermediate fine-tuning and data augmentation. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 704–717. Association for Computational Linguistics, June 2021 Amir Soleimani, Vassilina Nikoulina, Benoit Favre, and Salah Ait Mokhtar. Zero-shot aspect-based scientific document summarization using self-supervised pre-training. In Proceedings of the 21st Workshop on Biomedical Language Processing, pages 49–62. Association for Computational Linguistics, May 2022	-
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/86202	-
dc.description.abstract	科學論文的貢獻側重於描述其原創之處和重要價值，對於每個科學研究來說這都可以被認為是其最核心的部分。一個能精確辨認論文貢獻並將其組織為結構化摘要的系統對於輔助自動化處理科學文本和幫助讀者理解等應用具有潛在價值。雖然近期的工作開始致力於與論文貢獻相關的任務的研究中，目前仍缺少高品質的大規模資料集來輔助深度學習模型的訓練。有鑑於此，我們收集並整理了一個資料集，其中包含大約兩萬四千篇計算機科學領域的論文及其作者條列之貢獻，根據我們提出的標記框架，這些科學貢獻又被進一步分為對應的不同類別。接著我們正式定義了生成科學論文之條列式貢獻這個任務。利用大量的無監督資料和原始論文中重要語句以及生成目標所包含的貢獻類別，我們提出了一個細粒度的訓練策略。實驗結果表明我們提出的方法優於具競爭力的基線模型和其他訓練策略，證明了其有效性。我們也進行了詳細分析以研究我們所提出的資料集和任務的特性及其挑戰之處。	zh_TW
dc.description.abstract	Contributions of scientific papers highlight their novelty and key values, which are essentially the core parts of every research work. Systems that are capable of identifying the contributions of the papers precisely and organizing them into well-structured summaries are valuable in aiding both automatic text processing and human comprehensions. Though recent works have focused more on tasks dealing with the contributions of the scientific documents, there is currently no large-scale dataset with high quality that can facilitate the training of modern deep learning based models. To this end, we curate a dataset consisting of 24K computer science papers with contributions explicitly listed by the authors, which are further classified into different contribution types based on our newly-introduced annotation scheme. Then we formally formulate the task of generating disentangled contributions for scientific documents. We present fine-grained post-training strategy leveraging abundant unsupervised data and the contribution types of both highlight sentences in the source documents and the generation targets. Experimental results show that the proposed method outperforms competitive baselines and other post-training strategies, demonstrating the effectiveness of our approach. Detailed analysis is also conducted to study the characteristics and challenges of our dataset as well as the newly-proposed task.	en
dc.description.provenance	Made available in DSpace on 2023-03-19T23:42:01Z (GMT). No. of bitstreams: 1 U0001-3108202217461300.pdf: 2727026 bytes, checksum: c277f0ea386d6e3151d63e3aa39d4971 (MD5) Previous issue date: 2022	en
dc.description.tableofcontents	Acknowledgements i 摘要 ii Abstract iii Contents v List of Figures viii List of Tables ix Chapter 1 Introduction 1 1.1 Background 1 1.2 Motivation and Contribution 3 1.3 Thesis Organization 7 Chapter 2 Related Work 8 2.1 Scholarly Document Processing 8 2.2 Abstractive Summarization 11 Chapter 3 Datasets 14 3.1 Dataset Collection 14 3.2 Dataset Analysis 18 3.2.1 Dataset Statistics 18 3.2.2 Comparisons with Existing Datasets 19 3.2.3 Structural Alignments 20 3.3 Contribution Type Annotation 22 3.3.1 Annotation Scheme 23 3.3.2 Annotation Procedure and Results 24 3.3.3 Analysis of Contribution Types 25 Chapter 4 Methodology 28 4.1 Contribution Type Classification 28 4.2 Disentangled Contribution Generation 29 4.2.1 Task Formulation and Model Architecture 29 4.2.2 Finetune for Disentangled Contribution Generation 31 4.2.3 Fine-grained Post-training 36 Chapter 5 Experiments 39 5.1 Contribution Type Classification 39 5.2 Disentangled Contribution Generation 41 5.2.1 Experimental Setup 41 5.2.2 Evaluation Metrics 42 5.2.3 Main Results 45 5.2.4 Comparisons with Other Post-training Strategies 48 Chapter 6 Discussion 50 6.1 Ablation Study 50 6.2 Experiment Results in Low Resource Setting 51 6.3 Results Based on Different Contribution Types 53 6.4 Analysis of Contribution-Level Evaluations 56 Chapter 7 Conclusion 60 References 62	-
dc.language.iso	en	-
dc.subject	科學貢獻生成	zh_TW
dc.subject	科學文本處理	zh_TW
dc.subject	抽象式摘要	zh_TW
dc.subject	Scholarly Document Processing	en
dc.subject	Abstractive Summarization	en
dc.subject	Research Contribution Generation	en
dc.title	生成科學論文之條列式貢獻	zh_TW
dc.title	Generating Disentangled Contributions for Scientific Documents	en
dc.type	Thesis	-
dc.date.schoolyear	110-2	-
dc.description.degree	碩士	-
dc.contributor.oralexamcommittee	鄭卜壬;蔡銘峰;陳冠宇	zh_TW
dc.contributor.oralexamcommittee	Pu-Jen Cheng;Ming-Feng Tsai;Kuan-Yu Chen	en
dc.subject.keyword	科學文本處理,抽象式摘要,科學貢獻生成,	zh_TW
dc.subject.keyword	Scholarly Document Processing,Abstractive Summarization,Research Contribution Generation,	en
dc.relation.page	70	-
dc.identifier.doi	10.6342/NTU202203034	-
dc.rights.note	同意授權(全球公開)	-
dc.date.accepted	2022-09-02	-
dc.contributor.author-college	電機資訊學院	-
dc.contributor.author-dept	資訊網路與多媒體研究所	-
dc.date.embargo-lift	2023-08-31	-
顯示於系所單位：	資訊網路與多媒體研究所

文件中的檔案：

檔案	大小	格式
ntu-110-2.pdf	2.66 MB	Adobe PDF	檢視/開啟

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。