Skip navigation

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料(如:文字、圖片、PDF)並使其易於取用。

點此認識 DSpace
DSpace logo
English
中文
  • 瀏覽論文
    • 校院系所
    • 出版年
    • 作者
    • 標題
    • 關鍵字
    • 指導教授
  • 搜尋 TDR
  • 授權 Q&A
    • 我的頁面
    • 接受 E-mail 通知
    • 編輯個人資料
  1. NTU Theses and Dissertations Repository
  2. 電機資訊學院
  3. 資訊工程學系
請用此 Handle URI 來引用此文件: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/92803
完整後設資料紀錄
DC 欄位值語言
dc.contributor.advisor徐宏民zh_TW
dc.contributor.advisorWinston H. Hsuen
dc.contributor.author許雅晴zh_TW
dc.contributor.authorYa-Ching Hsuen
dc.date.accessioned2024-07-01T16:10:58Z-
dc.date.available2024-07-02-
dc.date.copyright2024-07-01-
dc.date.issued2024-
dc.date.submitted2024-06-26-
dc.identifier.citation[1] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei. Language models are few-shot learners. In NeurIPS, 2020.
[2] C.-H. Chang, H.-T. Su, J.-H. Hsu, Y.-S. Wang, Y.-C. Chang, Z. Y. Liu, Y.-L. Chang, W.-F. Cheng, K.-J. Wang, and W. H. Hsu. Situation and behavior understanding by trope detection on films. In WWW, 2021.
[3] J.-P. Chou, A. F. Siu, N. Lipka, R. Rossi, F. Dernoncourt, and M. Agrawala. Talestream: Supporting story ideation with trope knowledge. In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, pages 1–12, 2023.
[4] K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
[5] M. Del and M. Fishel. True detective: A deep abductive reasoning benchmark undoable for gpt-3 and challenging for gpt-4. arXiv preprint arXiv:2212.10114, 2022.
[6] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. 2018.
[7] L. Gao, A. Madaan, S. Zhou, U. Alon, P. Liu, Y. Yang, J. Callan, and G. Neubig. Pal: Program-aided language models. arXiv preprint arXiv:2211.10435, 2022.
[8] L. V. Huang. Simultaneous Processing, pages 2301–2302. Springer New York, New York, NY, 2011.
[9] M. Ismayilzada, D. Paul, S. Montariol, M. Geva, and A. Bosselut. Crow: Benchmarking commonsense reasoning in real-world tasks. In Proceedings of the 2023 Conference on Empirical Methodsin Natural Language Processing (EMNLP), 2023.
[10] R. Jia and P. Liang. Adversarial examples for evaluating reading comprehension systems. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2021–2031, 2017.
[11] C. Jiayang, L. Qiu, C. Chan, X. Liu, Y. Song, and Z. Zhang. Eventground: Narrative reasoning by grounding to eventuality-centric knowledge graphs. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 6622–6642, 2024.
[12] D. Kahneman. Thinking, fast and slow (kindle edition), 2011.
[13] H. Liu, J. Liu, L. Cui, Z. Teng, N. Duan, M. Zhou, and Y. Zhang. Logiqa 2.0—an improved dataset for logical reasoning in natural language understanding. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023
[14] H. Liu, R. Ning, Z. Teng, J. Liu, Q. Zhou, and Y. Zhang. Evaluating the logical reasoning ability of chatgpt and gpt-4. arXiv preprint arXiv:2304.03439, 2023.
[15] Q. Liu, S. Hyland, S. Bannur, K. Bouzid, D. C. Castro, M. T. Wetscherek, R. Tinn, H. Sharma, F. Pérez-García, A. Schwaighofer, et al. Exploring the boundaries of gpt-4 in radiology. In EMNLP 2023, 2023.
[16] Y. Miura, Y. Zhang, E. Tsai, C. Langlotz, and D. Jurafsky. Improving factual completeness and consistency of image-to-text radiology report generation. In K. Toutanova, A. Rumshisky, L. Zettlemoyer, D. Hakkani-Tur, I. Beltagy, S. Bethard, R. Cotterell, T. Chakraborty, and Y. Zhou, editors, Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5288–5304, Online, June 2021. Association for Computational Linguistics.
[17] R. OpenAI. Gpt-4 technical report. arxiv 2303.08774. View in Article, 2:3, 2023.
[18] R. B. Palm, U. Paquet, and O. Winther. Recurrent relational networks. In NeurIPS, 2018.
[19] A. Piper, R. J. So, and D. Bamman. Narrative theory for computational narrative understanding. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 298–311, 2021.
[20] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever. Language models are unsupervised multitask learners. 2019.
[21] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67, 2020.
[22] M. Singh, V. SB, N. Malviya, et al. Mind meets machine: Unravelling gpt-4’s cognitive psychology. arXiv preprint arXiv:2303.11436, 2023.
[23] J. R. Smith, D. Joshi, B. Huet, W. Hsu, and J. Cota. Harnessing ai for augmenting creativity: Application to movie trailer creation. In Proceedings of the 25th ACM international conference on Multimedia, pages 1799–1808, 2017.
[24] K. Sprenkamp, D. G. Jones, and L. Zavolokina. Large language models for propaganda detection. arXiv preprint arXiv:2310.06422, 2023.
[25] H.-T. Su, P.-W. Shen, B.-C. Tsai, W.-F. Cheng, K.-J. Wang, and W. H. Hsu. Truman: Trope understanding in movies and animations. In CIKM, 2021.
[26] M. Suzgun, N. Scales, N. Schärli, S. Gehrmann, Y. Tay, H. W. Chung, A. Chowdhery,Q. V. Le, E. H. Chi, D. Zhou, , and J. Wei. Challenging big-bench tasks and whether chain-of-thought can solve them. arXiv preprint arXiv:2210.09261, 2022.
[27] A. Talmor, J. Herzig, N. Lourie, and J. Berant. Commonsenseqa: A question answering challenge targeting commonsense knowledge. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4149–4158, 2019.
[28] A. Talmor, O. Yoran, R. L. Bras, C. Bhagavatula, Y. Goldberg, Y. Choi, and J. Berant. Commonsenseqa 2.0: Exposing the limits of ai through gamification. NeurIPS, 2021.
[29] A. Talmor, O. Yoran, R. Le Bras, C. Bhagavatula, Y. Goldberg, Y. Choi, and J. Berant. Commonsenseqa 2.0: Exposing the limits of ai through gamification.
[30] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
[31] A. Wang, Y. Pruksachatkun, N. Nangia, A. Singh, J. Michael, F. Hill, O. Levy, and S. Bowman. Superglue: A stickier benchmark for general-purpose language understanding systems. Advances in neural information processing systems, 32, 2019.
[32] A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. Bowman. Glue: A multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 353–355, 2018.
[33] J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le, D. Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. NeurIPS, 2022.
[34] S. Yao, D. Yu, J. Zhao, I. Shafran, T. L. Griffiths, Y. Cao, and K. Narasimhan. Tree of Thoughts: Deliberate problem solving with large language models, 2023.
[35] W. Zhang, Y. Deng, J. Ma, and W. Lam. Answerfact: Fact checking in product question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2407–2417, 2020
-
dc.identifier.urihttp://tdr.lib.ntu.edu.tw/jspui/handle/123456789/92803-
dc.description.abstract大型語言模型(LLMs)在思維鏈(CoT)提示下,已展示出在數學、常識和邏輯等事實內容中的顯著多步推理能力。然而,它們在需要更高抽象能力的敘事推理中的表現仍未被探討。本研究利用電影劇本中的隱喻來評估最先進LLMs的敘事推理能力,並發現其表現不佳。我們引入 trope-wise querying 的方法來應對這些挑戰,將F1分數提高了11.8分。此外,雖然先前的研究表明CoT可以增強多步推理能力,但本研究顯示CoT在敘事內容中可能會導致幻覺,降低GPT-4的表現。我們還引入了一種對抗性注入方法,將與隱喻相關的文本標記嵌入不包含明確隱喻的電影劇本中,揭示了CoT對此類注入的高度敏感性。我們的綜合分析為未來的研究方向提供了見解。zh_TW
dc.description.abstractLarge language models (LLMs) equipped with chain-of-thoughts (CoT) prompting have shown significant multi-step reasoning capabilities in factual content like mathematics, commonsense, and logic. However, their performance in narrative reasoning, which demands greater abstraction capabilities, remains unexplored. This study utilizes tropes in movie synopses to assess the narrative reasoning abilities of state-of-the-art LLMs and uncovers their low performance. We introduce a trope-wise querying approach to address these challenges and boost the F1 score by 11.8 points. Moreover, while prior studies suggest that CoT enhances multi-step reasoning, this study shows CoT can cause hallucinations in narrative content, reducing GPT-4's performance. We also introduce an Adversarial Injection method to embed trope-related text tokens into movie synopses without explicit tropes, revealing CoT's heightened sensitivity to such injections. Our comprehensive analysis provides insights for future research directions.en
dc.description.provenanceSubmitted by admin ntu (admin@lib.ntu.edu.tw) on 2024-07-01T16:10:58Z
No. of bitstreams: 0
en
dc.description.provenanceMade available in DSpace on 2024-07-01T16:10:58Z (GMT). No. of bitstreams: 0en
dc.description.tableofcontentsVerification Letter from the Oral Examination Committee i
Acknowledgements iii
摘要 v
Abstract vii
Contents ix
List of Figures xi
List of Tables xiii
Chapter1 Introduction 1
Chapter2 Related Word 7
2.1 Large Language Models(LLMs) 7
2.2 LLM Reasoning 7
2.3 Tropes 8
Chapter3 Narrative Reasoning with TiMoS 9
3.1 Experimental Setup 9
3.2 LLMs Struggle Reasoning TiMoS 11
3.3 Trope-wise Querying Improves LLMs 12
3.4 Challenges of Chain-of-Thoughts(CoT) 13
3.4.1 CoT Diminishes GPT-4 Performance 13
3.4.2 Adversarial Injection Misleads CoT 15
3.4.3 CoT Generates Flawed Thoughts 16
3.5 AdditionalAnalyses 17
Chapter4 Conclusion 21
References 23
Appendix A — Dataset, Baseline, and Fine-tuning Details 29
A.1 Baseline Detail 29
A.2 Dataset Detail 30
A.3 LLaMa-2 Fine-Tune Detail 30
Appendix B — Piloting LLMs’ Ability 33
Appendix C — LLM Query, LLM Output, and Attack Sentence Example 35
C.1 Query Example 35
C.1.1 Original Query 35
C.1.2 Trope-wise Query 36
C.1.3 Different CoT Analysis 38
C.2 LLM Output Example 38
C.3 Attack Sentence 38
-
dc.language.isoen-
dc.subject隱喻zh_TW
dc.subject語言模型zh_TW
dc.subject自然語言處理zh_TW
dc.subject模型分析與可解釋性zh_TW
dc.subject探索zh_TW
dc.subject推理zh_TW
dc.subjectNLPen
dc.subjecttropeen
dc.subjectreasoningen
dc.subjectprobingen
dc.subjectmodel analysis and interpretabilityen
dc.subjectlanguage modelsen
dc.title透過電影劇本中的隱喻探索大型語言模型敘事推理的極限zh_TW
dc.titleUnveiling Narrative Reasoning Limits of Large Language Models with Trope in Movie Synopsesen
dc.typeThesis-
dc.date.schoolyear112-2-
dc.description.degree碩士-
dc.contributor.oralexamcommittee陳尚澤;陳奕廷;葉梅珍zh_TW
dc.contributor.oralexamcommitteeShang-Tse Chen;Yi-Ting Chen;Mei-Chen Yehen
dc.subject.keyword隱喻,語言模型,自然語言處理,模型分析與可解釋性,探索,推理,zh_TW
dc.subject.keywordtrope,language models,NLP,model analysis and interpretability,probing,reasoning,en
dc.relation.page39-
dc.identifier.doi10.6342/NTU202401188-
dc.rights.note未授權-
dc.date.accepted2024-06-26-
dc.contributor.author-college電機資訊學院-
dc.contributor.author-dept資訊工程學系-
顯示於系所單位:資訊工程學系

文件中的檔案:
檔案 大小格式 
ntu-112-2.pdf
  未授權公開取用
5.85 MBAdobe PDF
顯示文件簡單紀錄


系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved