利用輕量模型進行位置感知視覺問題生成

林良璞; Nicholas Collin Suwono

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/91119

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	古倫維	zh_TW
dc.contributor.advisor	Lun-Wei Ku	en
dc.contributor.author	林良璞	zh_TW
dc.contributor.author	Nicholas Collin Suwono	en
dc.date.accessioned	2023-10-24T17:11:37Z	-
dc.date.available	2025-08-09	-
dc.date.copyright	2023-10-24	-
dc.date.issued	2023	-
dc.date.submitted	2023-08-12	-
dc.identifier.citation	[1] M. Braun, A. Mainz, R. Chadowitz, B. Pfleging, and F. Alt. At your service: Designing voice assistant personalities to improve automotive user interfaces. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, CHI ’19, page 1–11, New York, NY, USA, 2019. Association for Computing Machinery. [2] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei. Language models are few-shot learners, 2020. [3] Y. Chen, Z. Gan, Y. Cheng, J. Liu, and J. Liu. Distilling the knowledge of BERT for text generation. In ACL, 2020. [4] J. Cho, J. Lei, H. Tan, and M. Bansal. Unifying vision-and-language tasks via text generation. In ICML, 2021. [5] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding, 2019. [6] F. Ferraro, N. Mostafazadeh, T.-H. Huang, L. Vanderwende, J. Devlin, M. Galley, and M. Mitchell. A survey of current datasets for vision and language research. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 207–213, Lisbon, Portugal, Sept. 2015. Association for Computational Linguistics. [7] Google. Reverse geocoding. [8] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017. [9] C.-Y. Hsieh, C.-L. Li, C.-K. Yeh, H. Nakhost, Y. Fujii, A. Ratner, R. Krishna, C.-Y. Lee, and T. Pfister. Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes. arXiv preprint arXiv:2305.02301, 2023. [10] T.-H. K. Huang, F. Ferraro, N. Mostafazadeh, I. Misra, A. Agrawal, J. Devlin, R. Girshick, X. He, P. Kohli, D. Batra, C. L. Zitnick, D. Parikh, L. Vanderwende, M. Galley, and M. Mitchell. Visual storytelling. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1233–1239, San Diego, California, June 2016. Association for Computational Linguistics. [11] K. Jing and J. Xu. A survey on neural network language models, 2019. [12] D. Large, G. Burnett, V. Antrobus, and L. Skrypchuk. Stimulating conversation: Engaging drivers in natural language interactions with an autonomous digital driving assistant to counteract passive task-related fatigue. 2017. [13] C.-Y. Lin. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out. ACL, July 2004. [14] S.-C. Lin, C.-H. Hsu, W. Talamonti, Y. Zhang, S. Oney, J. Mars, and L. Tang. Adasa: A conversational in-vehicle digital assistant for advanced driver assistance features. In Proceedings of the 31st Annual ACM Symposium on User Interface Software and Technology, UIST ’18, page 531–542, New York, NY, USA, 2018. Association for Computing Machinery. [15] Y. Liu, T. Han, S. Ma, J. Zhang, Y. Yang, J. Tian, H. He, A. Li, M. He, Z. Liu, Z. Wu, D. Zhu, X. Li, N. Qiang, D. Shen, T. Liu, and B. Ge. Summary of chatgpt/gpt-4 research and perspective towards the future of large language models, 2023. [16] Z. Lu, K. Ding, Y. Zhang, J. Li, B. Peng, and L. Liu. Engage the public: Poll question generation for social media posts. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 29–40, Online, Aug. 2021. Association for Computational Linguistics. [17] S. Mehta and M. Rastegari. Mobilevit: light-weight, general-purpose, and mobile-friendly vision transformer. arXiv preprint arXiv:2110.02178, 2021. [18] N. Mostafazadeh, I. Misra, J. Devlin, M. Mitchell, X. He, and L. Vanderwende. Generating natural questions about an image. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1802–1813, Berlin, Germany, Aug. 2016. Association for Computational Linguistics. [19] OpenAI. Gpt-4 technical report, 2023. [20] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. Bleu: a method for automatic evaluation of machine translation. In ACL, 2002. [21] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever. Language models are unsupervised multitask learners. 2019. [22] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 2020. [23] P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang. SQuAD: 100,000+ questions for machine comprehension of text. In EMNLP, 2016. [24] T. Sellam, D. Das, and A. P. Parikh. BLEURT: learning robust metrics for text gen eration. arXiv preprint arXiv:2004.04696, 2020. [25] Z. Sun, H. Yu, X. Song, R. Liu, Y. Yang, and D. Zhou. Mobilebert: a compact task-agnostic bert for resource-limited devices. arXiv preprint arXiv:2004.02984, 2020. [26] Y. Tay, M. Dehghani, J. Rao, W. Fedus, S. Abnar, H. W. Chung, S. Narang, D. Yogatama, A. Vaswani, and D. Metzler. Scale efficiently: Insights from pre-training and fine-tuning transformers. arXiv preprint arXiv:2109.10686, 2021. [27] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023. [28] L. Vanderwende, A. Menezes, and C. Quirk. An AMR parser for English, French, German, Spanish and Japanese and a new AMR-annotated corpus. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations, pages 26–30, Denver, Colorado, June 2015. Association for Computational Linguistics. [29] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need, 2023. [30] P. Wang, A. Yang, R. Men, J. Lin, S. Bai, Z. Li, J. Ma, C. Zhou, J. Zhou, and H. Yang. Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In ICML, 2022. [31] Y. Wang, Y. Kordi, S. Mishra, A. Liu, N. A. Smith, D. Khashabi, and H. Hajishirzi. Self-instruct: Aligning language models with self-generated instructions, 2023. [32] J. Wei, M. Bosma, V. Y. Zhao, K. Guu, A. W. Yu, B. Lester, N. Du, A. M. Dai, and Q. V. Le. Finetuned language models are zero-shot learners, 2022. [33] M.-H. Yeh, V. Chen, T.-H. K. Haung, and L.-W. Ku. Multi-vqg: Generating engaging questions for multiple images, 2022. [34] A. R. Zamir and M. Shah. Image geo-localization based on multiplenearest neighbor feature matching using generalized graphs. PAMI, 2014. [35] M. J. Zhang and E. Choi. SituatedQA: Incorporating extra-linguistic contexts into QA. Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2021. [36] T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger, and Y. Artzi. BERTScore: Evaluating text generation with BERT. arXiv preprint arXiv:1904.09675, 2019.	-
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/91119	-
dc.description.abstract	本工作引入了一項新的任務，即位置感知視覺問題生成（LocaVQG），旨在從與特定地理位置相關的數據中生成引人入勝的問題。具體而言，我們使用周圍的圖像和GPS坐標來表示這種位置感知信息。為了應對這個任務，我們提出了一個數據集生成流程，利用GPT-4來生成多樣且複雜的問題。然後，我們旨在學習一個輕量級模型，可以應用於邊緣設備，如手機。為此，我們提出了一種可靠地從位置感知信息生成引人入勝問題的方法。我們提出的方法在人工評估（例如參與度，連接性，連貫性）和自動評估指標（例如BERTScore，ROUGE-2）方面優於基線方法。此外，我們進行了大量消融研究，以證明我們提出的數據集生成和解決該任務的技術的有效性。	zh_TW
dc.description.abstract	This work introduces a novel task, location-aware visual question generation (LocaVQG), which aims to generate engaging questions from data relevant to a particular geographical location. Specifically, we represent such location-aware information with surrounding images and a GPS coordinate. To tackle this task, we present a dataset generation pipeline that leverages GPT-4 to produce diverse and sophisticated questions. Then, we aim to learn a lightweight model that can address the LocaVQG task and fit on an edge device, such as a mobile phone. To this end, we propose a method which can reliably generate engaging questions from location-aware information. Our proposed method outperforms baselines regarding human evaluation (\eg engagement, grounding, coherence) and automatic evaluation metrics (\eg BERTScore, ROUGE-2). Moreover, we conduct extensive ablation studies to justify our proposed techniques for both generating the dataset and solving the task.	en
dc.description.provenance	Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2023-10-24T17:11:37Z No. of bitstreams: 0	en
dc.description.provenance	Made available in DSpace on 2023-10-24T17:11:37Z (GMT). No. of bitstreams: 0	en
dc.description.tableofcontents	Verification Letter from the Oral Examination Committee i Acknowledgements ii 摘要 iv Abstract v Contents vi List of Figures x List of Tables xi Denotation xii Chapter 1 Introduction 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Chapter 2 Literature Review 5 2.1 In-car intelligent assistant system . . . . . . . . . . . . . . . . . . . 5 2.2 Visual and Language . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.3 Training with LLM Generated Data . . . . . . . . . . . . . . . . . . 9 2.4 Language Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.5 Lightweight models . . . . . . . . . . . . . . . . . . . . . . . . . . 15 Chapter 3 Methodology 17 3.1 Initial Method: Multitask . . . . . . . . . . . . . . . . . . . . . . . . 17 3.1.1 Multitask Story and Question . . . . . . . . . . . . . . . . . . . . . 17 3.1.2 Generate Description with Question . . . . . . . . . . . . . . . . . 19 3.2 Proposed Method: LocaVQG . . . . . . . . . . . . . . . . . . . . . 20 3.2.1 Location-Aware Information . . . . . . . . . . . . . . . . . . . . . 21 3.2.2 LocaVQG Task Tuples . . . . . . . . . . . . . . . . . . . . . . . . 22 3.2.3 FDT5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.2.3.1 Knowledge Distillation . . . . . . . . . . . . . . . . . 23 3.2.3.2 Post-Inference Filtering . . . . . . . . . . . . . . . . . 23 Chapter 4 LocaVQG Dataset 25 4.1 Choosing the Image Sequences . . . . . . . . . . . . . . . . . . . . 25 4.2 GPT-4 Prompt Construction . . . . . . . . . . . . . . . . . . . . . . 26 4.2.1 Captioning Streetview Images . . . . . . . . . . . . . . . . . . . . 26 4.2.2 Reverse Geocoding GPS coordinate . . . . . . . . . . . . . . . . . 27 4.2.3 Constructing prompts . . . . . . . . . . . . . . . . . . . . . . . . . 27 4.3 Engaging Question Classifier . . . . . . . . . . . . . . . . . . . . . 28 4.4 Dataset Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 4.4.1 Question Length . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.4.2 Frequent Trigrams and Words . . . . . . . . . . . . . . . . . . . . . 31 4.4.3 Question Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.5 Human Data Annotation . . . . . . . . . . . . . . . . . . . . . . . . 34 Chapter 5 Experiment 1: Multitasking Model 35 5.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 5.1.1 Hardware & Hyper parameter setup . . . . . . . . . . . . . . . . . 35 5.1.2 Baseline Model: VLT5 . . . . . . . . . . . . . . . . . . . . . . . . 36 5.1.3 Evaluation Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . 36 5.2 Results & Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 37 5.2.1 Human Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 37 5.2.2 Qualitative Results . . . . . . . . . . . . . . . . . . . . . . . . . . 38 Chapter 6 Experiment 2: LocaVQG 41 6.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 6.1.1 Hardware & Hyper parameter setup . . . . . . . . . . . . . . . . . 41 6.1.2 Baseline Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 6.1.2.1 T5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 6.1.2.2 MVQG-VL-T5 . . . . . . . . . . . . . . . . . . . . . 43 6.1.3 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . 44 6.2 Results & Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 45 6.2.1 Human Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 45 6.2.2 Automatic Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 46 6.2.3 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 6.2.3.1 Employing Engaging Question Classifier . . . . . . . . 47 6.2.3.2 Incorporating GPS Coordinates . . . . . . . . . . . . . 48 6.2.3.3 Varying Dataset Sizes . . . . . . . . . . . . . . . . . . 49 6.2.3.4 Incorporating Directions . . . . . . . . . . . . . . . . . 50 6.2.3.5 Roles of GPT Models . . . . . . . . . . . . . . . . . . 51 Chapter 7 Error Analyses 53 7.1 Case Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 7.2 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 7.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 Chapter 8 Conclusion 58 8.1 Limitation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 8.1.1 Biases in AMT workers . . . . . . . . . . . . . . . . . . . . . . . . 59 8.1.2 Location-aware information . . . . . . . . . . . . . . . . . . . . . . 59 8.1.3 Address-aware LLMs . . . . . . . . . . . . . . . . . . . . . . . . . 59 8.1.4 Human evaluation setup . . . . . . . . . . . . . . . . . . . . . . . . 60 8.2 Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 References 62 Appendix A — Survey Interface 67 Appendix B — Human Annotation Examples 69	-
dc.language.iso	en	-
dc.subject	電動汽車	zh_TW
dc.subject	LLM	zh_TW
dc.subject	引人入勝的問題	zh_TW
dc.subject	司機	zh_TW
dc.subject	駕駛助手	zh_TW
dc.subject	位置	zh_TW
dc.subject	LLM	en
dc.subject	Driver	en
dc.subject	Engaging Question	en
dc.subject	Driving Assistants	en
dc.subject	Electric Vehicles	en
dc.subject	Location	en
dc.title	利用輕量模型進行位置感知視覺問題生成	zh_TW
dc.title	Location-Aware Visual Question Generation with Lightweight Models	en
dc.type	Thesis	-
dc.date.schoolyear	111-2	-
dc.description.degree	碩士	-
dc.contributor.coadvisor	孫紹華	zh_TW
dc.contributor.coadvisor	Shao-Hua Sun	en
dc.contributor.oralexamcommittee	黃挺豪;李政德	zh_TW
dc.contributor.oralexamcommittee	Ting-Hao Huang;Cheng-Te Li	en
dc.subject.keyword	位置,電動汽車,駕駛助手,司機,引人入勝的問題,LLM,	zh_TW
dc.subject.keyword	Location,Electric Vehicles,Driving Assistants,Driver,Engaging Question,LLM,	en
dc.relation.page	69	-
dc.identifier.doi	10.6342/NTU202303856	-
dc.rights.note	同意授權(限校園內公開)	-
dc.date.accepted	2023-08-12	-
dc.contributor.author-college	電機資訊學院	-
dc.contributor.author-dept	資料科學學位學程	-
dc.date.embargo-lift	2028-08-09	-
顯示於系所單位：	資料科學學位學程

文件中的檔案：

檔案	大小	格式
ntu-111-2.pdf 未授權公開取用	10.86 MB	Adobe PDF	檢視/開啟

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。