請用此 Handle URI 來引用此文件:
http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/93448完整後設資料紀錄
| DC 欄位 | 值 | 語言 |
|---|---|---|
| dc.contributor.advisor | 林守德 | zh_TW |
| dc.contributor.advisor | Shou-De Lin | en |
| dc.contributor.author | 顏廷聿 | zh_TW |
| dc.contributor.author | Ting-Yu Yen | en |
| dc.date.accessioned | 2024-08-01T16:10:30Z | - |
| dc.date.available | 2024-08-02 | - |
| dc.date.copyright | 2024-08-01 | - |
| dc.date.issued | 2024 | - |
| dc.date.submitted | 2024-07-30 | - |
| dc.identifier.citation | R. Abdal, Y. Qin, and P. Wonka. Image2stylegan++: How to edit the embedded images? In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8296–8305, 2020.
J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023. M. J. Addison Howard, MichaelApers. Petfinder.my adoption prediction, 2018. C. Ahuja, D. W. Lee, Y. I. Nakano, and L.-P. Morency. Style transfer for co-speech gesture animation: A multi-speaker conditional-mixture approach. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVIII 16, pages 248–265. Springer, 2020. C. Ahuja and L.-P. Morency. Language2pose: Natural language grounded pose forecasting. In 2019 International Conference on 3D Vision (3DV), pages 719–728. IEEE, 2019. I. Airbnb. Inside airbnb : Hawaii, 2023. Accessed on: 10 September, 2023. Z. Akata, F. Perronnin, Z. Harchaoui, and C. Schmid. Label-embedding for image classification. IEEE transactions on pattern analysis and machine intelligence, 38(7):1425–1438, 2015. J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y. Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022. I. Beltagy, M. E. Peters, and A. Cohan. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150, 2020. A. Brock, T. Lim, J. M. Ritchie, and N. Weston. Neural photo editing with introspec tive adversarial networks. arXiv preprint arXiv:1609.07093, 2016. M. Bucher, S. Herbin, and F. Jurie. Generating visual representations for zero-shot classification. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pages 2666–2673, 2017. H. Chen, Y. Li, Y. Hong, Z. Xu, Z. Gu, J. Lan, H. Zhu, and W. Wang. Boosting audio-visual zero-shot learning with large language models. arXiv preprint arXiv: 2311.12268, 2023. J. Chen, L. Chen, H. Huang, and T. Zhou. When do you need chain-of-thought prompting for chatgpt? arXiv preprint arXiv:2304.03262, 2023. J. Chen and H. Zhuge. Abstractive text-image summarization using multi-modal attentional hierarchical rnn. In Proceedings of the 2018 conference on empirical methods in natural language processing, pages 4046–4056, 2018. J. Chen and H. Zhuge. Extractive text-image summarization using multi-modal rnn. In 2018 14th International Conference on Semantics, Knowledge and Grids (SKG), pages 245–248. IEEE, 2018. S.-I. Cheng, Y.-J. Chen, W.-C. Chiu, H.-Y. Tseng, and H.-Y. Lee. Adaptively-realistic image generation from stroke and sketch with diffusion model. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 4054– 4062, 2023. Y. Da Tsai and S. De Lin. Fast online inference for nonlinear contextual bandit based on generative adversarial network. arXiv preprint arXiv:2202.08867, 2022. A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arxiv 2020. arXiv preprint arXiv:2010.11929, 2010. W. Gao, Z. Deng, Z. Niu, F. Rong, C. Chen, Z. Gong, W. Zhang, D. Xiao, F. Li, Z. Cao, et al. Ophglm: Training an ophthalmology large language-and-vision assis tant based on instructions and dialogue. arXiv preprint arXiv:2306.12174, 2023. Y. Gorishniy, I. Rubachev, V. Khrulkov, and A. Babenko. Revisiting deep learn ing models for tabular data. Advances in Neural Information Processing Systems, 34:18932–18943, 2021. I. Guz, J. Elliott, M. Konstantin, S. Dane, V. Kassym, and W. Kan. Avito demand prediction challenge, 2018. S. Hegselmann, A. Buendia, H. Lang, M. Agrawal, X. Jiang, and D. Sontag. Tabllm: Few-shot classification of tabular data with large language models. In International Conference on Artificial Intelligence and Statistics, pages 5549–5581. PMLR, 2023. J. Ho, A. Jain, and P. Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020. A. Jamaludin, J. S. Chung, and A. Zisserman. You said that?: Synthesising talking faces from audio. International Journal of Computer Vision, 127:1767–1779, 2019. A. Jangra, A. Jatowt, M. Hasanuzzaman, and S. Saha. Text-image-video summary generation using joint integer linear programming. In Advances in Information Retrieval: 42nd European Conference on IR Research, ECIR 2020, Lisbon, Portugal, April 14–17, 2020, Proceedings, Part II 42, pages 190–198. Springer, 2020. T. Karras, S. Laine, and T. Aila. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4401–4410, 2019. T. Karras, S. Laine, M. Aittala, J. Hellsten, J. Lehtinen, and T. Aila. Analyzing and improving the image quality of stylegan. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8110–8119, 2020. B. Kawar, M. Elad, S. Ermon, and J. Song. Denoising diffusion restoration models. Advances in Neural Information Processing Systems, 35:23593–23606, 2022. T. Kojima, S. S. Gu, M. Reid, Y. Matsuo, and Y. Iwasawa. Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199– 22213, 2022. H. C. Kuchibhotla, S. S. Malagi, S. Chandhok, and V. N. Balasubramanian. Unseen classes at a later time? no problem. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9245–9254, 2022. H. Li, J. Zhu, C. Ma, J. Zhang, and C. Zong. Multi-modal summarization for asyn chronous collection of text, image, audio and video. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 1092– 1102, 2017. J. Li, D. Li, S. Savarese, and S. Hoi. Blip-2: Bootstrapping language-image pre training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023. K. Li, Y. He, Y. Wang, Y. Li, W. Wang, P. Luo, Y. Wang, L. Wang, and Y. Qiao. Videochat: Chat-centric video understanding. arXiv preprint arXiv:2305.06355, 2023. M. Li, X. Chen, S. Gao, Z. Chan, D. Zhao, and R. Yan. Vmsmo: Learning to generate multimodal summary for video-based news articles. arXiv preprint arXiv:2010.05406, 2020. P. P. Liang, Y. Lyu, X. Fan, Z. Wu, Y. Cheng, J. Wu, L. Chen, P. Wu, M. A. Lee, Y. Zhu, et al. Multibench: Multiscale benchmarks for multimodal representation learning. arXiv preprint arXiv:2107.07502, 2021. H. Liu, C. Li, Q. Wu, and Y. J. Lee. Visual instruction tuning, 2023. Y. Long, L. Liu, F. Shen, L. Shao, and X. Li. Zero-shot learning using synthesised un seen visual data with diffusion regularisation. IEEE transactions on pattern analysis and machine intelligence, 40(10):2498–2512, 2017. Y. Lu, M. Bartolo, A. Moore, S. Riedel, and P. Stenetorp. Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity. arXiv preprint arXiv:2104.08786, 2021. X. Ma, S. Mishra, A. Beirami, A. Beutel, and J. Chen. Let’s do a thought experiment: Using counterfactuals to improve moral reasoning. arXiv preprint arXiv:2306.14308, 2023. A. Madaan and A. Yazdanbakhsh. Text and patterns: For effective chain of thought, it takes two to tango. arXiv preprint arXiv:2209.07686, 2022. L. McInnes, J. Healy, and J. Melville. Umap: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426, 2018. C. Meng, Y. He, Y. Song, J. Song, J. Wu, J.-Y. Zhu, and S. Ermon. Sdedit:guided image synthesis and editing with stochastic differential equations. International Conference on Learning Representations, 2022. B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM, 65(1):99–106, 2021. E. Ng, H. Joo, L. Hu, H. Li, T. Darrell, A. Kanazawa, and S. Ginosar. Learning to listen: Modeling non-deterministic dyadic facial motion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20395– 20405, 2022. A. Oord, Y. Li, I. Babuschkin, K. Simonyan, O. Vinyals, K. Kavukcuoglu, G. Driess che, E. Lockhart, L. Cobo, F. Stimberg, et al. Parallel wavenet: Fast high-fidelity speech synthesis. In International conference on machine learning, pages 3918–3926. PMLR, 2018. S. Palaskar, J. Libovickỳ, S. Gella, and F. Metze. Multimodal abstractive summa rization for how2 videos. arXiv preprint arXiv:1906.07901, 2019. M. Palatucci, D. Pomerleau, G. E. Hinton, and T. M. Mitchell. Zero-shot learning with semantic output codes. Advances in neural information processing systems, 22, 2009. Z. Peng, W. Wang, L. Dong, Y. Hao, S. Huang, S. Ma, and F. Wei. Kosmos 2: Grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824, 2023. A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019. A. Ramesh, M. Pavlov, G. Goh, S. Gray, C. Voss, A. Radford, M. Chen, and I. Sutskever. Zero-shot text-to-image generation. In International Conference on Machine Learning, pages 8821–8831. PMLR, 2021. Y. Ren, Y. Ruan, X. Tan, T. Qin, S. Zhao, Z. Zhao, and T.-Y. Liu. Fastspeech: Fast, robust and controllable text to speech. Advances in neural information processing systems, 32, 2019. R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. Y. Shi, X. Yang, Y. Wan, and X. Shen. Semanticstylegan: Learning compositional generative priors for controllable image synthesis and editing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11254– 11264, 2022. U. Singer, A. Polyak, T. Hayes, X. Yin, J. An, S. Zhang, Q. Hu, H. Yang, O. Ashual, O. Gafni, et al. Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792, 2022. J. Song, C. Meng, and S. Ermon. Denoising diffusion implicit models. arXiv:2010.02502, October 2020. Y. Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole. Score based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456, 2020. F. Tan, S. Feng, and V. Ordonez. Text2scene: Generating compositional scenes from textual descriptions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6710–6719, 2019. S. Wang, Z. Zhao, X. Ouyang, Q. Wang, and D. Shen. Chatcad: Interactive computer aided diagnosis on medical image using large language models. arXiv preprint arXiv:2302.07257, 2023. W. Wang, Y. Pu, V. Verma, K. Fan, Y. Zhang, C. Chen, P. Rai, and L. Carin. Zero shot learning via class-conditioned deep generative models. In Proceedings of the AAAI conference on artificial intelligence, volume 32, 2018. Z. Wang, C. Vandersteen, C. Raffaelli, N. Guevara, F. Patou, and H. Delingette. One-shot learning for landmarks detection. In Deep Generative Models, and Data Augmentation, Labelling, and Imperfections: First Workshop, DGM4MICCAI 2021, and First Workshop, DALI 2021, Held in Conjunction with MICCAI 2021, Strasbourg, France, October 1, 2021, Proceedings 1, pages 163–172. Springer, 2021. Z. Wang, Y. Yang, M. Sermesant, H. Delingette, and O. Wu. Zero-shot-learning cross-modality data translation through mutual information guided stochastic diffu sion. arXiv preprint arXiv:2301.13743, 2023. J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le, D. Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022. J. Wei, J. Wei, Y. Tay, D. Tran, A. Webson, Y. Lu, X. Chen, H. Liu, D. Huang, D. Zhou, et al. Larger language models do in-context learning differently. arXiv preprint arXiv:2303.03846, 2023. S. Wu, H. Fei, L. Qu, W. Ji, and T.-S. Chua. Next-gpt: Any-to-any multimodal llm. arXiv preprint arXiv:2309.05519, 2023. S. Yin, C. Fu, S. Zhao, K. Li, X. Sun, T. Xu, and E. Chen. A survey on multimodal large language models. arXiv preprint arXiv:2306.13549, 2023. T. Zhang, X. Wang, D. Zhou, D. Schuurmans, and J. E. Gonzalez. Tempera: Test time prompt editing via reinforcement learning. In The Eleventh International Conference on Learning Representations, 2022. Z. Zhao, E. Wallace, S. Feng, D. Klein, and S. Singh. Calibrate before use: Im proving few-shot performance of language models. In International Conference on Machine Learning, pages 12697–12706. PMLR, 2021. Y. Zhou, A. I. Muresanu, Z. Han, K. Paster, S. Pitis, H. Chan, and J. Ba. Large lan guage models are human-level prompt engineers. arXiv preprint arXiv:2211.01910, 2022. H. Zhu, H. Huang, Y. Li, A. Zheng, and R. He. Arbitrary talking face generation via attentional audio-visual coherence learning. arXiv preprint arXiv:1812.06589, 2018. J. Zhu, Y. Shen, D. Zhao, and B. Zhou. In-domain gan inversion for real image editing. In European conference on computer vision, pages 592–608. Springer, 2020. | - |
| dc.identifier.uri | http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/93448 | - |
| dc.description.abstract | 本文探討了多模態監督學習中模態不匹配的挑戰,其指的是推理過程中出現的模態和訓練時出現的模態不同的情況,我們提出了一種創新方法TAMML(文本中心對齊的多模態監督學習),該方法利用具有上下文學習能力的大語言模型和基礎模型提高多模態系統在這種情況下的泛化能力。通過利用文本作為統一語義空間的獨特特性,本文展示了在處理未見過、多樣、不可預測的模態組合時的顯著改進。所提出的解決方法不僅能夠適應不同的模態,還能保持穩健的性能,展示了基礎模型在克服傳統模型框架在嵌入表示的局限性的潛力。本研究通過提供一種靈活且有效的方法,為動態且不確定模態可用性的真實應用作出貢獻。 | zh_TW |
| dc.description.abstract | This paper addresses the challenge of modality mismatch in supervised learning, where the modalities available during inference differ from those available during training. We propose an innovative method, TAMML(Text-centric Alignment for Multi-Modality Supervised Learning), that utilizes Large Language Models with in-context learning and foundation models to enhance the generalizability of multimodal systems under these conditions. By leveraging the unique properties of text as a unified semantic space, this paper demonstrates significant improvements in handling unseen, diverse, and unpredictable modality combinations. The proposed solution not only adapts to varying modalities but also maintains robust performance, showcasing the potential of foundation models in overcoming the limitations of traditional fixed-modality frameworks in embedding representations. This study contributes to the field by offering a flexible, effective solution for real-world applications where modality availability is dynamic and uncertain. | en |
| dc.description.provenance | Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2024-08-01T16:10:30Z No. of bitstreams: 0 | en |
| dc.description.provenance | Made available in DSpace on 2024-08-01T16:10:30Z (GMT). No. of bitstreams: 0 | en |
| dc.description.tableofcontents | 口試委員會審定書 i
誌謝 ii 摘要 iv Abstract v Contents vi List of Figures x List of Tables xii Chapter 1 Introduction 1 Chapter 2 Related Work 7 2.1 Multimodal Foundation Models .................... 7 2.2 Modality Generation .......................... 9 2.2.1 Modality Summarization ....................... 9 2.2.2 Modality Creation........................... 9 2.3 Zero-shot Learning Cross Modality Translation . . . . . . . . . . . . 10 Chapter 3 Methodology 11 3.1 Problem Formalization......................... 11 3.2 Text Transformation .......................... 12 3.3 Text-style Translation across Modality . . . . . . . . . . . . . . . . . 14 3.4 Modality Summarization........................ 15 3.5 LLM Reasoning Augmentation .................... 16 3.6 Downstream Training ......................... 18 Chapter 4 Experiments 20 4.1 Q1: Under Modality Mismatch Scenarios, How Does TAMML Compare to the SOTA?........................... 21 4.1.1 Dataset................................. 21 4.1.2 Large Language Models ....................... 22 4.1.3 Competitors.............................. 22 4.1.4 Results................................. 23 4.2 Q2: Is the Proposed Solution Still Effective When Modality Combinations Dynamically Change?...................... 24 4.2.1 Results................................. 27 4.3 Q3: Is Text Representation Generally More Robust Than Embedding Representation for Cross Modality Translation? . . . . . . . . . . . 28 4.3.1 MLLMs Baseline ........................... 28 4.3.2 Results................................. 29 4.4 Ablation Studies ............................ 29 4.4.1 Text Transformation.......................... 30 4.4.2 Modality Summarization ....................... 30 4.4.3 Reasoning Augmentation....................... 30 4.4.4 Text-Style Translation across Modality . . . . . . . . . . . . . . . . 31 Chapter 5 Discussion 32 5.1 Experiment Settings .......................... 32 5.1.1 Baselines ............................... 33 5.1.2 Evaluation Protocol.......................... 33 5.1.3 Evaluation Metric.......................... 33 5.2 Results ................................. 34 5.3 LLM Ablation ............................. 35 5.4 Qualitative Analysis and Findings ................... 35 Chapter 6 Analysis 36 6.1 Visualization for Distribution Alignment . . . . . . . . . . . . . . . 36 6.2 Effects of the Image Caption Models ................. 37 Chapter 7 Conclusion 38 7.1 Limitation................................ 38 7.2 Conclusion and Future Directions ................... 38 References 40 Appendix A — Experiment Detail Setup 50 A.1 Model Checkpoints........................... 50 A.2 Hyperparameters ............................ 51 A.3 Dataset ................................. 51 A.4 Foundation Models........................... 52 Appendix B — Analysis and Discussion 53 B.1 In-context Modality Transfer Outperforms Zero-shot Learning Based Methods ................................ 53 Appendix C — Detailed Prompt 55 | - |
| dc.language.iso | en | - |
| dc.subject | 大語言模型 | zh_TW |
| dc.subject | 深度學習 | zh_TW |
| dc.subject | 跨模態信息提取 | zh_TW |
| dc.subject | 跨模態內容生成 | zh_TW |
| dc.subject | Deep Learning | en |
| dc.subject | Cross-Modal Information Extraction | en |
| dc.subject | Large Language Model | en |
| dc.subject | Cross-Modal Content Generation | en |
| dc.title | 文本中心對齊的多模態監督學習 | zh_TW |
| dc.title | Text-centric Alignment for Multi-Modality Supervised Learning | en |
| dc.type | Thesis | - |
| dc.date.schoolyear | 112-2 | - |
| dc.description.degree | 碩士 | - |
| dc.contributor.oralexamcommittee | 李宏毅;陳銘憲;陳縕儂;廖耿德 | zh_TW |
| dc.contributor.oralexamcommittee | Hung-Yi Lee;Ming-Syan Chen;Yun-Nung Chen;Keng-Te Liao | en |
| dc.subject.keyword | 大語言模型,深度學習,跨模態信息提取,跨模態內容生成, | zh_TW |
| dc.subject.keyword | Large Language Model,Deep Learning,Cross-Modal Information Extraction,Cross-Modal Content Generation, | en |
| dc.relation.page | 57 | - |
| dc.identifier.doi | 10.6342/NTU202402322 | - |
| dc.rights.note | 同意授權(限校園內公開) | - |
| dc.date.accepted | 2024-08-01 | - |
| dc.contributor.author-college | 電機資訊學院 | - |
| dc.contributor.author-dept | 資訊工程學系 | - |
| 顯示於系所單位: | 資訊工程學系 | |
文件中的檔案:
| 檔案 | 大小 | 格式 | |
|---|---|---|---|
| ntu-112-2.pdf 授權僅限NTU校內IP使用(校園外請利用VPN校外連線服務) | 6.37 MB | Adobe PDF |
系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。
