Please use this identifier to cite or link to this item:
http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/93792Full metadata record
| ???org.dspace.app.webui.jsptag.ItemTag.dcfield??? | Value | Language |
|---|---|---|
| dc.contributor.advisor | 謝舒凱 | zh_TW |
| dc.contributor.advisor | Shu-Kai Hsieh | en |
| dc.contributor.author | 周昕妤 | zh_TW |
| dc.contributor.author | Hsin-Yu Chou | en |
| dc.date.accessioned | 2024-08-08T16:14:16Z | - |
| dc.date.available | 2024-08-09 | - |
| dc.date.copyright | 2024-08-08 | - |
| dc.date.issued | 2024 | - |
| dc.date.submitted | 2024-03-06 | - |
| dc.identifier.citation | Alayrac, J.-B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Men- sch, A., Millican, K., Reynolds, M., et al. (2022). Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Sys- tems, 35, 23716–23736.
Bavelas, J., Gerwing, J., & Healing, S. (2014). Effect of dialogue on demonstrations: direct quotations, facial portrayals, hand gestures, and figurative references. Discourse Processes, 51(8), 619–655. Chiang, W.-L., Li, Z., Lin, Z., Sheng, Y., Wu, Z., Zhang, H., Zheng, L., Zhuang, S., Zhuang, Y., Gonzalez, J. E., Stoica, I., & Xing, E. P. (2023). Vicuna: an open-source chatbot impressing gpt-4 with 90%* chatgpt quality. https: //lmsys.org/blog/2023-03-30-vicuna/ Christiano, P. F., Leike, J., Brown, T., Martic, M., Legg, S., & Amodei, D. (2017). Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30. Clark, H. H. (2016). Depicting as a method of communication. Psychological review, 123(3), 324. Clark, H. H. (2019). 17 depicting in communication. Human language: From genes and brains to behavior, 235. Croft, W., & Cruse, D. A. (2004). Cognitive linguistics. Cambridge University Press. Fricke, E. (2012). Grammatik multimodal: wie wörter und gesten zusammenwirken (Vol. 40). Walter de Gruyter. Fricke, E. (2013). Towards a unified grammar of gesture and speech: a multimodal approach. Body-Languge-Communication (HSK 38.1), An International Hand- book on, Multimodality in Human Interaction, 733–754. Girdhar, R., El-Nouby, A., Liu, Z., Singh, M., Alwala, K. V., Joulin, A., & Misra, I. (2023). Imagebind: one embedding space to bind them all. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 15180–15190. Goldberg, A. E. (1995). Constructions: a construction grammar approach to argument structure. University of Chicago Press. Grice, H. P. (1957). Meaning. The philosophical review, 66(3), 377–388. Hart, C., & Marmol Queralto, J. (2021). What can cognitive linguistics tell us about language-image relations? a multidimensional approach to intersemiotic convergence in multimodal texts. Cognitive Linguistics, 32(4), 529–562. Hsu, H.-C., Brône, G., & Feyaerts, K. (2021). When gesture “takes over": speech-embedded nonverbal depictions in multimodal interaction. Frontiers in Psychology, 11, 552533. Huang, R., Li, M., Yang, D., Shi, J., Chang, X., Ye, Z., Wu, Y., Hong, Z., Huang, J., Liu, J., et al. (2023). Audiogpt: understanding and generating speech, music, sound, and talking head. arXiv preprint arXiv:2304.12995. Kendon, A. (2004). Gesture: visible action as utterance. Cambridge University Press. Khot, T., Trivedi, H., Finlayson, M., Fu, Y., Richardson, K., Clark, P., & Sabharwal, A. (2022). Decomposed prompting: a modular approach for solving complex tasks. arXiv preprint arXiv:2210.02406. Kojima, T., Gu, S. S., Reid, M., Matsuo, Y., & Iwasawa, Y. (2022). Large language models are zero-shot reasoners. Advances in neural information processing systems, 35, 22199–22213. Kok, K. I., & Cienki, A. (2016). Cognitive grammar and gesture: points of convergence, advances and challenges. Cognitive Linguistics, 27(1), 67–100. Kress, G., & Van Leeuwen, T. (2006). Reading images: the grammar of visual design. Routledge. Ladewig, S. (2020). Integrating gestures: the dimension of multimodality in cognitive grammar (Vol. 44). Walter de Gruyter GmbH & Co KG. Langacker, R. W. (1987). Foundations of cognitive grammar: volume ii: descriptive application. Langacker, R. W. (2008). Cognitive grammar. Cognition and Pragmatics, 77. Li, C., Wong, C., Zhang, S., Usuyama, N., Liu, H., Yang, J., Naumann, T., Poon, H., & Gao, J. (2023). Llava-med: training a large language-and-vision assistant for biomedicine in one day. arXiv preprint arXiv:2306.00890. Li, J., Li, D., Savarese, S., & Hoi, S. (2023). Blip-2: bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597. Lu, P., Mishra, S., Xia, T., Qiu, L., Chang, K.-W., Zhu, S.-C., Tafjord, O., Clark, P., & Kalyan, A. (2022). Learn to explain: multimodal reasoning via thought chains for science question answering. Advances in Neural Information Processing Systems, 35, 2507–2521. Lugaresi, C., Tang, J., Nash, H., McClanahan, C., Uboweja, E., Hays, M., Zhang, F., Chang, C.-L., Yong, M. G., Lee, J., et al. (2019). Mediapipe: a framework for building perception pipelines. arXiv preprint arXiv:1906.08172. Lyu, C., Wu, M., Wang, L., Huang, X., Liu, B., Du, Z., Shi, S., & Tu, Z (2023). Macaw-llm: multi-modal language modeling with image, audio, video, and text integration. arXiv preprint arXiv:2306.09093. McHugh, M. L. (2012). Interrater reliability: the kappa statistic. Biochemia medica, 22(3), 276–282. McNeill, D. (2016). Why we gesture: the surprising role of hand movements in communication. Cambridge University Press. Mittelberg, I., Evola, V., et al. (2014). Iconic and representational gestures. Body- Language-Communication: An international handbook on multimodality in human Interaction: Handbooks of linguistics and communication science, 2, 1732–1746. Mondada, L. (2019). Contemporary issues in conversation analysis: embodiment and materiality, multimodality and multisensoriality in social interaction. Journal of Pragmatics, 145, 47–62. Norris, S. (2004). Analyzing multimodal interaction: a methodological framework. Routledge. OpenAI. (2023). Gpt-4 technical report. Parmentier, R. J. (1994). Signs in society: studies in semiotic anthropology. Indiana University Press. Peirce, C. (1932). The icon, index, and symbol. Cambridge, MA: Harvard University Press. Radford, A., Kim, J. W., Xu, T., Brockman, G., McLeavey, C., & Sutskever, I. (2023). Robust speech recognition via large-scale weak supervision. International Conference on Machine Learning, 28492–28518. Ravanelli, M., Parcollet, T., Plantinga, P., Rouhe, A., Cornell, S., Lugosch, L., Subakan, C., Dawalatabad, N., Heba, A., Zhong, J., Chou, J.-C., Yeh, S.-L., Fu, S.-W., Liao, C.-F., Rastorgueva, E., Grondin, F., Aris, W., Na, H., Gao, Y., ... Bengio, Y. (2021). SpeechBrain: a general-purpose speech toolkit [arXiv:2106.04624]. Sambre, P., & Brône, G. (2013). Cut and break. the multimodal expression of in- strumentality and causality. MaMuD–Mapping Multimodal Dialogue, Date: 2013/11/22-2013/11/23, Location: RWTH Aanchen, Germany. Scao, T. L., Fan, A., Akiki, C., Pavlick, E., Ilić, S., Hesslow, D., Castagné, R., Luccioni, A. S., Yvon, F., Gallé, M., et al. (2022). Bloom: a 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100. Schoonjans, S. (2017). Multimodal construction grammar issues are construction grammar issues. Linguistics Vanguard, 3(s1), 20160050. Shen, Y., Song, K., Tan, X., Li, D., Lu, W., & Zhuang, Y. (2023). Hugginggpt: solving ai tasks with chatgpt and its friends in huggingface. arXiv preprint arXiv:2303.17580. Shepard, R. N. (1984). Ecological constraints on internal representation: resonant kinematics of perceiving, imagining, thinking, and dreaming. Psychological review, 91(4), 417. Stec, K., Huiskes, M., & Redeker, G. (2015). Multimodal analysis of quotation in oral narratives. Open Linguistics, 1(1). Steen, F., & Turner, M. B. (2013). Multimodal construction grammar. Language and the Creative Mind. Borkent, Michael, Barbara Dancygier, and Jennifer Hinnell, editors. Stanford, CA: CSLI Publications. Su, Y., Lan, T., Li, H., Xu, J., Wang, Y., & Cai, D. (2023). Pandagpt: one model to instruction-follow them all. arXiv preprint arXiv:2305.16355. Tannen, D. (1986). Introducing constructed dialogue in greek and american conversational and literary narrative. Direct and indirect speech, 31, 311–332. Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., Rodriguez, A., Joulin, A., Grave, E., & Lample, G. (2023). Llama: open and efficient foundation language models. arXiv preprint arXiv:2302.13971. Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., et al. (2023). Llama 2: open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288. Wade, E., & Clark, H. H. (1993). Reproduction and demonstration in quotations. Journal of Memory and Language, 32(6), 805–819. Walton, K. L. (1990). Mimesis as make-believe: on the foundations of the representational arts. Harvard University Press. Wang, C.-Y., Bochkovskiy, A., & Liao, H.-Y. M. (2023). Yolov7: trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7464–7475. Wei, J., Bosma, M., Zhao, V. Y., Guu, K., Yu, A. W., Lester, B., Du, N., Dai, A. M., & Le, Q. V. (2021). Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652. Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q. V., Zhou, D., et al. (2022). Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35, 24824– 24837. Wittenburg, P., Brugman, H., Russel, A., Klassmann, A., & Sloetjes, H. (2006). Elan: a professional framework for multimodality research. 5th international conference on language resources and evaluation (LREC 2006), 1556–1559. Wu, C., Yin, S., Qi, W., Wang, X., Tang, Z., & Duan, N. (2023). Visual chatgpt: talking, drawing and editing with visual foundation models. arXiv preprint arXiv:2303.04671. Zhang, H., Li, X., & Bing, L. (2023). Video-llama: an instruction-tuned audio-visual language model for video understanding. arXiv preprint arXiv:2306.02858. https://arxiv.org/abs/2306.02858 Zhou, D., Schärli, N., Hou, L., Wei, J., Scales, N., Wang, X., Schuurmans, D., Cui, C., Bousquet, O., Le, Q., et al. (2022). Least-to-most prompting enables complex reasoning in large language models. arXiv preprint arXiv:2205.10625. Zhu, D., Chen, J., Shen, X., Li, X., & Elhoseiny, M. (2023). Minigpt-4: enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592. Zima, E. (2014). English multimodal motion constructions. a construction grammar perspective. Linguistic Society of Belgium, 8, 14–29. Zima, E., & Bergs, A. (2017). Multimodality and construction grammar. Linguistics Vanguard, 3(s1), 20161006. | - |
| dc.identifier.uri | http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/93792 | - |
| dc.description.abstract | 大型語言模型(LLM)的發展為自然語言處理這個領域帶來新一波的任務以及研究方向,由於大型語言模型生成文字的能力可以透過自然語言直接提示(prompt)和教導(instruction)以解決許多任務,也在業界產生全新的應用和發展。多模態大型語言模型(Multimodal LLM, MLLM)也在幾個月內迅速發展,目前已經有可以解讀影音內容的多模態大型語言模型。本研究探討此刻最新發展的多模態大型語言模型對於「描繪」(Depiction)這項溝通策略的跨模態理解能力,「描繪」是日常生活中人們頻繁使用的溝通方式,所指為創造和呈現能讓聽者想像被描述場景的具象場景,經常會透過手勢、聲音、臉部表情等非語言的方式出現,因此,能夠整合視覺、聽覺、和語言文字等模態的能力對大型語言模型的未來發展極為重要。
本研究論文蒐集100個美國訪談節目的影片,先在視覺和聲音兩個模態進行臉部辨識、姿勢抓取、語音轉寫、語者識別等前處理,完成後進行標記以取出含有「描繪」的影音片段,最後使用Video-LLaMA這個多模態大型語言模型進行四個實驗。實驗資料集會分為四種不同類型的描繪:附加描繪(adjunct depiction)、指引描繪(indexed depiction)、嵌入描繪(embedded depiction)和獨立描繪(independent depiction)四個實驗中分別使用了不同的提示設計,包括零樣本(zero-shot)或少量樣本(few-shot)提示、關聯思考(Chain-of-Thought)提示等變因的操作。根據實驗結果,目前最新的大型語言模型在手勢的判讀上仍難以準確做出有效的整合理解和判斷解釋在。研究結果提出了大型語言模型在手勢的理解能力的現有限制,以及未來朝這個方向繼續發展的重要性。 | zh_TW |
| dc.description.abstract | Large Language Models (LLMs) have revolutionized Natural Language Processing, showcasing remarkable achievements and rapid advancements. Despite significant progress in meaning construal and multimodal capabilities, LLMs still struggle with accurately interpreting iconic gestures that occur in "depiction" at the time of writing. Depiction, a prevalent communicative method in daily life, involves creating and presenting physical, iconic scenes that enable recipients to imagine the depicted meaning. It is crucial for multimodal LLMs to comprehend and potentially acquire this communicative strategy.
This research paper presents an investigation into the capabilities of LLMs with a dataset comprising 100 video clips from four American talk shows. A pipeline is developed to automatically process the multimodal data, and the identified depiction segments are utilized to assess the performance of Video-LLaMA, a multimodal large language model capable of interpreting video. Four experiments are designed to evaluate whether LLMs can identify and accurately interpret four distinct types of depictions: adjunct depiction, indexed depiction, embedded depiction, and independent depiction. The four experiments utilize different prompt designs, including zero-shot, few-shot, zero-shot-CoT (i.e., zero-shot Chain-of-Thought), and few-shot-CoT. Experimental results reveal that current state-of-the-art LLMs are unable to successfully complete these tasks. The findings underscore the existing limitations of LLMs in capturing the nuanced meaning conveyed through depiction. Addressing these challenges will be crucial for advancing the capabilities of LLMs and enabling more sophisticated multimodal interactions in the field of Natural Language Processing. | en |
| dc.description.provenance | Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2024-08-08T16:14:16Z No. of bitstreams: 0 | en |
| dc.description.provenance | Made available in DSpace on 2024-08-08T16:14:16Z (GMT). No. of bitstreams: 0 | en |
| dc.description.tableofcontents | 論文口試委員審定書 i
致謝 iii 摘要 v Abstract vii List of Figures ix List of Tables xiii 1 Introduction 1 2 Literature Review 3 2.1 Amultimodalturnincognitivelinguistics 3 2.2 Depictioninmultimodalinteractions 5 2.3 LLMandMultimodalLLM 7 2.3.1 BLIP-2andVideo-LLaMA 8 3 Research Methods 13 3.1 Materials 13 3.2 Processpipeline 13 3.2.1 Preprocessingwithmultimodalmodels 13 3.2.2 Alignment 15 3.2.3 Annotation 16 3.2.4 Annotation results and inter-rater reliability 18 3.2.5 Prompting methods 19 4 Experiments and Results 23 4.1 Experiments with Video-LLaMA 23 4.1.1 Experiment I: Zero-Shot learning 23 4.1.2 Experiment II: Few-Shot learning 25 4.1.3 Experiment III: Zero-Shot-CoT 27 4.1.4 Experiment IV: Problem Decomposition 28 5 Discussion 33 5.1 Depiction type analysis 33 5.1.1 Adjunct depiction 34 5.1.2 Indexed depiction 36 5.1.3 Embedded depiction 38 5.1.4 Independent depiction 38 5.2 Error analysis 40 5.2.1 Hallucination 40 5.2.2 Bias 43 5.2.3 Limited gesture description 45 5.3 Limitations 46 5.3.1 Unit of depiction 47 5.3.2 Limited resources 47 5.3.3 Lack of groundedness 48 6 Conclusion 51 Appendix A Appendix: Examples of Experiments 53 References 71 | - |
| dc.language.iso | en | - |
| dc.subject | 提示工程 | zh_TW |
| dc.subject | 多模態大型語言模型 | zh_TW |
| dc.subject | 描繪 | zh_TW |
| dc.subject | 多模態互動 | zh_TW |
| dc.subject | Multimodal Interaction | en |
| dc.subject | Prompt engineering | en |
| dc.subject | Depiction | en |
| dc.subject | Multimodal Large Language Model | en |
| dc.title | 大型語言模型的跨模態理解:多模互動中的非語言描繪 | zh_TW |
| dc.title | Cross-Modality Understanding in Large Language Model: Non-verbal Depiction in Multimodal Interaction | en |
| dc.type | Thesis | - |
| dc.date.schoolyear | 112-2 | - |
| dc.description.degree | 碩士 | - |
| dc.contributor.oralexamcommittee | 張瑜芸;陳正賢 | zh_TW |
| dc.contributor.oralexamcommittee | Yu-Yun Chang;Cheng-Hsien Chen | en |
| dc.subject.keyword | 多模態大型語言模型,描繪,多模態互動,提示工程, | zh_TW |
| dc.subject.keyword | Multimodal Large Language Model,Depiction,Multimodal Interaction,Prompt engineering, | en |
| dc.relation.page | 76 | - |
| dc.identifier.doi | 10.6342/NTU202400746 | - |
| dc.rights.note | 同意授權(限校園內公開) | - |
| dc.date.accepted | 2024-03-07 | - |
| dc.contributor.author-college | 文學院 | - |
| dc.contributor.author-dept | 語言學研究所 | - |
| dc.date.embargo-lift | 2029-02-19 | - |
| Appears in Collections: | 語言學研究所 | |
Files in This Item:
| File | Size | Format | |
|---|---|---|---|
| ntu-112-2.pdf Restricted Access | 14.78 MB | Adobe PDF | View/Open |
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.
