20.4.3.3.8 Large Language Models for VQA, Visual Question Answering

Chapter Contents (Back)
Large Language Models. LLM. VQA. Visual Reasoning. Question Answer. Visual Q-A.
See also Large Language Models for Vision, LLM, LVLM.
See also General Spatial Reasoning and Geometric Reasoning Issues, Visual Relations.
See also Foundation Models, Graph Foundation Models.

Wang, J.[Jialou], Zhu, M.[Manli], Li, Y.[Yulei], Li, H.L.[Hong-Lei], Yang, L.Z.[Long-Zhi], Woo, W.L.[Wai Lok],
Detect2Interact: Localizing Object Key Field in Visual Question Answering with LLMs,
IEEE_Int_Sys(39), No. 3, May 2024, pp. 35-44.
IEEE DOI 2407
Visualization, Semantics, Object detection, Image segmentation, Task analysis, Computational modeling, Chatbots, Spatial resolution BibRef

Hu, Z.J.[Zhong-Jian], Yang, P.[Peng], Jiang, Y.S.[Yuan-Shuang], Bai, Z.J.[Zi-Jian],
Prompting large language model with context and pre-answer for knowledge-based VQA,
PR(151), 2024, pp. 110399.
Elsevier DOI 2404
Visual question answering, Large language model, Knowledge-based VQA, Fine-tuning, In-context learning BibRef

Kuang, J.Y.[Jia-Yi], Shen, Y.[Ying], Xie, J.[Jingyou], Luo, H.[Haohao], Xu, Z.[Zhe], Li, R.H.[Rong-Hao], Li, Y.H.[Ying-Hui], Cheng, X.F.[Xian-Feng], Lin, X.[Xika], Han, Y.[Yu],
Natural Language Understanding and Inference with MLLM in Visual Question Answering: A Survey,
Surveys(57), No. 8, March 2025, pp. xx-yy.
DOI Link 2504
Survey, Large Language Models. Visual question answering, multimodal representation and reasoning, multimodal large language models BibRef

Xiong, H.M.[Hao-Miao], Zhuge, Y.Z.[Yun-Zhi], Zhu, J.[Jiawen], Zhang, L.[Lu], Lu, H.C.[Hu-Chuan],
3UR-LLM: An End-to-End Multimodal Large Language Model for 3D Scene Understanding,
MultMed(27), 2025, pp. 2899-2911.
IEEE DOI 2506
Large language models, Solid modeling, Visualization, Training, Point cloud compression, visual question answering BibRef

Yu, Z.[Zhou], Ouyang, X.C.[Xue-Cheng], Shao, Z.W.[Zhen-Wei], Wang, M.[Meng], Yu, J.[Jun],
Prophet: Prompting Large Language Models With Complementary Answer Heuristics for Knowledge-Based Visual Question Answering,
PAMI(47), No. 8, August 2025, pp. 6797-6808.
IEEE DOI 2507
BibRef
Earlier: A3, A1, A4, A5, Only:
Prompting Large Language Models with Answer Heuristics for Knowledge-Based Visual Question Answering,
CVPR23(14974-14983)
IEEE DOI 2309
Knowledge based systems, Visualization, Helium, Cognition, Question answering (information retrieval), Training, large multimodal models BibRef


Cai, M.[Mu], Huang, Z.Y.[Ze-Yi], Li, Y.H.[Yu-Heng], Ojha, U.[Utkarsh], Wang, H.H.[Hao-Han], Lee, Y.J.[Yong Jae],
An Investigation on LLMs' Visual Understanding Ability Using SVG for Image-Text Bridging,
WACV25(5377-5386)
IEEE DOI Code:
WWW Link. 2505
Visualization, Large language models, Semantics, Vectors, Question answering (information retrieval), Cognition, SVG BibRef

Amoroso, R.[Roberto], Zhang, G.[Gengyuan], Koner, R.[Rajat], Baraldi, L.[Lorenzo], Cucchiara, R.[Rita], Tresp, V.[Volker],
Perceive. Query & Reason: Enhancing Video QA with Question-Guided Temporal Queries,
WACV25(8853-8862)
IEEE DOI 2505
Visualization, Large language models, Computational modeling, Transformers, Question answering (information retrieval), multimodal large language models BibRef

Weng, W.X.[Wei-Xi], Zhang, R.[Rui], Meng, X.J.[Xiao-Jun], Zhu, J.[Jieming], Liu, Q.[Qun], Yuan, C.[Chun],
Unsupervised Domain Adaptive Visual Question Answering in the Era of Multi-Modal Large Language Models,
WACV25(6248-6258)
IEEE DOI 2505
Visualization, Systematics, Adaptive systems, Large language models, Semantics, Aerospace electronics, visual question answering BibRef

Sun, G.H.[Guo-Hao], Qin, C.[Can], Wang, J.M.[Jia-Mian], Chen, Z.Y.[Ze-Yuan], Xu, R.[Ran], Tao, Z.Q.[Zhi-Qiang],
SQ-LLAVA: Self-questioning for Large Vision-language Assistant,
ECCV24(IX: 156-172).
Springer DOI 2412
BibRef

Ye, Q.[Qilang], Yu, Z.T.[Zi-Tong], Shao, R.[Rui], Xie, X.Y.[Xin-Yu], Torr, P.H.S.[Philip H.S.], Cao, X.C.[Xiao-Chun],
CAT: Enhancing Multimodal Large Language Model to Answer Questions in Dynamic Audio-visual Scenarios,
ECCV24(X: 146-164).
Springer DOI 2412
BibRef

Hu, Y.[Yutao], Li, T.[Tianbin], Lu, Q.[Quanfeng], Shao, W.Q.[Wen-Qi], He, J.J.[Jun-Jun], Qiao, Y.[Yu], Luo, P.[Ping],
OmniMedVQA: A New Large-Scale Comprehensive Evaluation Benchmark for Medical LVLM,
CVPR24(22170-22183)
IEEE DOI Code:
WWW Link. 2410
Reflectivity, Visualization, Biological system modeling, Computational modeling, Medical services, Benchmark testing BibRef

Li, Z.[Zhuowan], Jasani, B.[Bhavan], Tang, P.[Peng], Ghadar, S.[Shabnam],
Synthesize Step-by-Step: Tools, Templates and LLMs as Data Generators for Reasoning-Based Chart VQA,
CVPR24(13613-13623)
IEEE DOI 2410
Training, Visualization, Technological innovation, Accuracy, Computational modeling, Training data, Data augmentation BibRef

Özdemir, Ö.[Övgü], Akagündüz, E.[Erdem],
Enhancing Visual Question Answering through Question-Driven Image Captions as Prompts,
Prompting24(1562-1571)
IEEE DOI Code:
WWW Link. 2410
Visualization, Computational modeling, Large language models, Pipelines, Semantics, Question answering (information retrieval), image captioning BibRef

Ranasinghe, K.[Kanchana], Shukla, S.N.[Satya Narayan], Poursaeed, O.[Omid], Ryoo, M.S.[Michael S.], Lin, T.Y.[Tsung-Yu],
Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs,
CVPR24(12977-12987)
IEEE DOI 2410
Training, Location awareness, Visualization, Image coding, Large language models, Pipelines, Cognition, LLM, VQA, Localization, Video BibRef

Blau, T.[Tsachi], Fogel, S.[Sharon], Ronen, R.[Roi], Golts, A.[Alona], Tsiper, S.[Shahar], Avraham, E.B.[Elad Ben], Aberdam, A.[Aviad], Ganz, R.[Roy], Litman, R.[Ron],
GRAM: Global Reasoning for Multi-Page VQA,
CVPR24(15598-15607)
IEEE DOI 2410
Adaptation models, Visualization, Computational modeling, Large language models, Benchmark testing, Transformers, Cognition, Vision Language Models BibRef

Li, L.[Li], Peng, J.W.[Jia-Wei], Chen, H.[Huiyi], Gao, C.Y.[Chong-Yang], Yang, X.[Xu],
How to Configure Good In-Context Sequence for Visual Question Answering,
CVPR24(26700-26710)
IEEE DOI Code:
WWW Link. 2410
Visualization, Codes, Design methodology, Large language models, Question answering (information retrieval) BibRef

Agrawal, A.[Aviral], Lezcano, C.M.S.[Carlos Mateo Samudio], Heredia-Marin, I.B.[Iqui Balam], Sethi, P.S.[Prabhdeep Singh],
Listen Then See: Video Alignment with Speaker Attention,
MULA24(2018-2027)
IEEE DOI 2410
Bridges, Visualization, Codes, Accuracy, Question answering (information retrieval), LLM BibRef

Tan, R.[Reuben], Sun, X.[Ximeng], Hu, P.[Ping], Wang, J.H.[Jui-Hsien], Deilamsalehy, H.[Hanieh], Plummer, B.A.[Bryan A.], Russell, B.[Bryan], Saenko, K.[Kate],
Koala: Key Frame-Conditioned Long Video-LLM,
CVPR24(13581-13591)
IEEE DOI 2410
Visualization, Accuracy, Large language models, Computational modeling, Benchmark testing, Question answering (information retrieval) BibRef

Ganz, R.[Roy], Kittenplon, Y.[Yair], Aberdam, A.[Aviad], Avraham, E.B.[Elad Ben], Nuriel, O.[Oren], Mazor, S.[Shai], Litman, R.[Ron],
Question Aware Vision Transformer for Multimodal Reasoning,
CVPR24(13861-13871)
IEEE DOI 2410
Visualization, Image coding, Large language models, Focusing, Transformers BibRef

Bansal, H.[Hritik], Bitton, Y.[Yonatan], Szpektor, I.[Idan], Chang, K.W.[Kai-Wei], Grover, A.[Aditya],
VideoCon: Robust Video-Language Alignment via Contrast Captions,
CVPR24(13927-13937)
IEEE DOI 2410
Large language models, Semantics, Question answering (information retrieval), Data models, large multimodal models BibRef

Wang, S.W.[Shao-Wei], Zhang, L.L.[Ling-Ling], Zhu, L.J.[Long-Ji], Qin, T.[Tao], Yap, K.H.[Kim-Hui], Zhang, X.Y.[Xin-Yu], Liu, J.[Jun],
CoG-DQA: Chain-of-Guiding Learning with Large Language Models for Diagram Question Answering,
CVPR24(13969-13979)
IEEE DOI 2410
Bridges, Visualization, Large language models, Computational modeling, Natural languages, Large Language Model BibRef

Khan, Z.[Zaid], BG, V.K.[Vijay Kumar], Schulter, S.[Samuel], Fu, Y.[Yun], Chandraker, M.[Manmohan],
Self-Training Large Language Models for Improved Visual Program Synthesis With Visual Reinforcement,
CVPR24(14344-14353)
IEEE DOI Code:
WWW Link. 2410
Training, Visualization, Annotations, Large language models, Object detection, Question answering (information retrieval), visual question answering BibRef

Liao, Z.[Zhaohe], Li, J.T.[Jiang-Tong], Niu, L.[Li], Zhang, L.Q.[Li-Qing],
Align and Aggregate: Compositional Reasoning with Video Alignment and Answer Aggregation for Video Question-Answering,
CVPR24(13395-13404)
IEEE DOI 2410
Measurement, Accuracy, Computational modeling, Aggregates, Large language models, Pipelines BibRef

Pan, J.T.[Jun-Ting], Lin, Z.[Ziyi], Ge, Y.Y.[Yu-Ying], Zhu, X.T.[Xia-Tian], Zhang, R.R.[Ren-Rui], Wang, Y.[Yi], Qiao, Y.[Yu], Li, H.S.[Hong-Sheng],
Retrieving-to-Answer: Zero-Shot Video Question Answering with Frozen Large Language Models,
MMFM23(272-283)
IEEE DOI 2401
BibRef

Guo, J.X.[Jia-Xian], Li, J.[Junnan], Li, D.X.[Dong-Xu], Tiong, A.M.H.[Anthony Meng Huat], Li, B.Y.[Bo-Yang], Tao, D.C.[Da-Cheng], Hoi, S.[Steven],
From Images to Textual Prompts: Zero-shot Visual Question Answering with Frozen Large Language Models,
CVPR23(10867-10877)
IEEE DOI 2309
BibRef

Chapter on Implementations and Applications, Databases, QBIC, Video Analysis, Hardware and Software, Inspection continues in
Image-Text Matching, Image Text Retrieval, Image-Text Retrieval .


Last update:Jul 7, 2025 at 14:35:55