20.4.3.3.11 Multi-Modal, Multimodal Large Language Models for Vision, LLM

Chapter Contents (Back)
Large Language Models. LLM. Visual Reasoning. Multi-Modal. 2510

See also General Spatial Reasoning and Geometric Reasoning Issues, Visual Relations.
See also Foundation Models, Graph Foundation Models.

Wang, Z.[Zihao], Cai, S.F.[Shao-Fei], Liu, A.[Anji], Jin, Y.G.[Yong-Gang], Hou, J.[Jinbing], Zhang, B.[Bowei], Lin, H.[Haowei], He, Z.F.[Zhao-Feng], Zheng, Z.L.[Zi-Long], Yang, Y.D.[Yao-Dong], Ma, X.J.[Xiao-Jian], Liang, Y.[Yitao],
JARVIS-1: Open-World Multi-Task Agents With Memory-Augmented Multimodal Language Models,
PAMI(47), No. 3, March 2025, pp. 1894-1907.
IEEE DOI 2502
Planning, Diamond, Games, Complexity theory, Cognition, Accuracy, Visualization, Reliability, Multitasking, Iron, Minecraft, open-world agents BibRef

Li, Y.X.[Yun-Xin], Jiang, S.Y.[Shen-Yuan], Hu, B.T.[Bao-Tian], Wang, L.Y.[Long-Yue], Zhong, W.Q.[Wan-Qi], Luo, W.H.[Wen-Han], Ma, L.[Lin], Zhang, M.[Min],
Uni-MoE: Scaling Unified Multimodal LLMs With Mixture of Experts,
PAMI(47), No. 5, May 2025, pp. 3424-3439.
IEEE DOI 2504
Training, Data models, Computational modeling, Connectors, Benchmark testing, Visualization, Tuning BibRef

Huang, Z.Z.[Zhong-Zhan], Zhong, S.S.[Shan-Shan], Zhou, P.[Pan], Gao, S.[Shanghua], Zitnik, M.[Marinka], Lin, L.[Liang],
A Causality-Aware Paradigm for Evaluating Creativity of Multimodal Large Language Models,
PAMI(47), No. 5, May 2025, pp. 3830-3846.
IEEE DOI 2504
Creativity, Games, Cognition, Standards, Benchmark testing, Training, Pipelines, Manuals, Large language models, Information leakage, causal intervention BibRef

Villani, F.[Francesco], Maljkovic, I.[Igor], Lazzaro, D.[Dario], Sotgiu, A.[Angelo], Cinà, A.E.[Antonio Emanuele], Roli, F.[Fabio],
Robust image classification with multi-modal large language models,
PRL(194), 2025, pp. 1-7.
Elsevier DOI 2506
Adversarial machine learning, Robust classification, Multimodal large language model, Multimodal information, TrustworthyAI BibRef

Shao, Z.W.[Zhen-Wei], Yu, Z.[Zhou], Yu, J.[Jun], Ouyang, X.C.[Xue-Cheng], Zheng, L.[Lihao], Gai, Z.B.[Zhen-Biao], Wang, M.Y.[Ming-Yang], Kuang, Z.Z.[Zhen-Zhong], Ding, J.J.[Jia-Jun],
Imp: Highly Capable Large Multimodal Models for Mobile Devices,
MultMed(27), 2025, pp. 2961-2974.
IEEE DOI 2506
Visualization, Training, Data models, Computational modeling, Training data, Connectors, Large language models, Mobile handsets, vision-language models BibRef

Ge, J.[Junyao], Zhang, X.[Xu], Zheng, Y.[Yang], Guo, K.[Kaitai], Liang, J.[Jimin],
RSTeller: Scaling up visual language modeling in remote sensing with rich linguistic semantics from openly available data and large language models,
PandRS(226), 2025, pp. 146-163.
Elsevier DOI Code:
WWW Link. 2506
Vision language model, Multimodal dataset, OpenStreetMap, Google earth engine, Large language models BibRef

Li, Z.S.[Zhen-Shi], Muhtar, D.[Dilxat], Gu, F.[Feng], He, Y.L.X.[Yang-Lang-Xing], Zhang, X.L.[Xue-Liang], Xiao, P.F.[Peng-Feng], He, G.[Guangjun], Zhu, X.X.[Xiao-Xiang],
LHRS-Bot-Nova: Improved multimodal large language model for remote sensing vision-language interpretation,
PandRS(227), 2025, pp. 539-550.
Elsevier DOI Code:
WWW Link. 2508
BibRef
Earlier: A2, A1, A3, A5, A6, Only:
Lhrs-bot: Empowering Remote Sensing with Vgi-enhanced Large Multimodal Language Model,
ECCV24(LXXIV: 440-457).
Springer DOI 2412
Remote sensing, Earth observation, Multimodal large language model, Vision-language dataset BibRef

Li, X.[Xu], Zheng, Y.[Yi], Chen, H.T.[Hao-Tian], Chen, X.L.[Xiao-Lei], Liang, Y.X.[Yu-Xuan], Lai, C.H.[Cheng-Hang], Li, B.[Bin], Xue, X.Y.[Xiang-Yang],
Instruction-guided fusion of multi-layer visual features in Large Vision-Language Models,
PR(170), 2026, pp. 111932.
Elsevier DOI Code:
WWW Link. 2509
Large Vision-Language Models, Multimodal large language models, Hierarchical feature utilization BibRef

Zhang, W.Y.[Wen-Yao], Wu, L.[Letian], Zhang, Z.Q.[Ze-Qun], Yu, T.[Tao], Ma, C.[Chao], Jin, X.[Xin], Yang, X.K.[Xiao-Kang], Zeng, W.J.[Wen-Jun],
Unleash the Power of Vision-Language Models by Visual Attention Prompt and Multimodal Interaction,
MultMed(27), 2025, pp. 2399-2411.
IEEE DOI 2505
Visualization, Adaptation models, Tuning, Training, Computational modeling, Tail, Pipelines, Overfitting, Nose, Attention, Vision-language models BibRef

Weng, Y.[Yu], He, W.B.[Wen-Bin], Dong, J.[Jun], Chaomurilige, Liu, X.[Xuan], Liu, Z.[Zheng],
Cross-Lingual Adaptation for Vision-Language Model via Multimodal Semantic Distillation,
MultMed(27), 2025, pp. 3184-3196.
IEEE DOI 2506
Adaptation models, Multilingual, Visualization, Training, Semantics, Data models, Natural language processing, Translation, zero-shot learning BibRef

Liang, J.W.[Jia-Wei], Liang, S.Y.[Si-Yuan], Liu, A.S.[Ai-Shan], Cao, X.C.[Xiao-Chun],
VL-Trojan: Multimodal Instruction Backdoor Attacks against Autoregressive Visual Language Models,
IJCV(133), No. 7, July 2025, pp. 3994-4013.
Springer DOI 2506
BibRef


Zhang, D.[Di], Lei, J.[Jingdi], Li, J.X.[Jun-Xian], Wang, X.Z.[Xun-Zhi], Liu, Y.J.[Yu-Jie], Yang, Z.L.[Zong-Lin], Li, J.T.[Jia-Tong], Wang, W.[Weida], Yang, S.[Suorong], Wu, J.B.[Jian-Bo], Ye, P.[Peng], Ouyang, W.L.[Wan-Li], Zhou, D.Z.[Dong-Zhan],
Critic-V: VLM Critics Help Catch VLM Errors in Multimodal Reasoning,
CVPR25(9050-9061)
IEEE DOI Code:
WWW Link. 2508
Training, Visualization, Computational modeling, Natural languages, Benchmark testing, Cognition, Mathematical models, Reliability, multimodal reasoning BibRef

Li, L.[Lei], Wei, Y.C.[Yuan-Cheng], Xie, Z.H.[Zhi-Hui], Yang, X.[Xuqing], Song, Y.F.[Yi-Fan], Wang, P.[Peiyi], An, C.X.[Chen-Xin], Liu, T.Y.[Tian-Yu], Li, S.[Sujian], Lin, B.Y.C.[Bill Yu-Chen], Kong, L.P.[Ling-Peng], Liu, Q.[Qi],
VL-RewardBench: A Challenging Benchmark for Vision-Language Generative Reward Models,
CVPR25(24657-24668)
IEEE DOI Code:
WWW Link. 2508
Training, Analytical models, Visualization, Accuracy, Pipelines, Benchmark testing, Cognition, Reliability, Probes, Visual perception, multimodal large language models BibRef

Yang, C.[Cheng], Sui, Y.[Yang], Xiao, J.Q.[Jin-Qi], Huang, L.[Lingyi], Gong, Y.[Yu], Li, C.[Chendi], Yan, J.H.[Jing-Hua], Bai, Y.[Yu], Sadayappan, P.[Ponnuswamy], Hu, X.[Xia], Yuan, B.[Bo],
TopV: Compatible Token Pruning with Inference Time Optimization for Fast and Low-Memory Multimodal Vision Language Model,
CVPR25(19803-19813)
IEEE DOI 2508
Training, Visualization, Computational modeling, Memory management, Cost function, Cache storage BibRef

Hong, W.[Wenyi], Cheng, Y.[Yean], Yang, Z.[Zhuoyi], Luo, Z.Y.[Zi-Yang], Wu, H.N.[Hao-Ning], Li, D.X.[Dong-Xu], Ma, J.[Jing], Kankanhalli, M.[Mohan], Li, J.[Junnan],
VideoAutoArena: An Automated Arena for Evaluating Large Multimodal Models in Video Analysis through User Simulation,
CVPR25(8461-8474)
IEEE DOI 2508
Analytical models, Costs, Annotations, Computational modeling, Scalability, Benchmark testing, large multimodal models BibRef

Tian, J.[Jirui], Zhang, J.R.[Jin-Rong], Liu, S.[Shenglan], Xu, L.[Luhao], Huang, Z.X.[Zhi-Xiong], Huang, G.[Gao],
DTOS: Dynamic Time Object Sensing with Large Multimodal Model,
CVPR25(13810-13820)
IEEE DOI Code:
WWW Link. 2508
Location awareness, Visualization, Large language models, Robustness, Spatiotemporal phenomena, Sensors, Spatial resolution, Videos BibRef

Li, M.[Ming], Zhong, J.[Jike], Chen, T.[Tianle], Lai, Y.X.[Yu-Xiang], Psounis, K.[Konstantinos],
EEE-Bench: A Comprehensive Multimodal Electrical And Electronics Engineering Benchmark,
CVPR25(13337-13349)
IEEE DOI 2508
Visualization, Foundation models, Large language models, Benchmark testing, Control systems, Mathematical models BibRef

Liu, Z.H.[Zhi-Hang], Xie, C.W.[Chen-Wei], Li, P.[Pandeng], Zhao, L.M.[Li-Ming], Tang, L.X.[Long-Xiang], Zheng, Y.[Yun], Liu, C.B.[Chuan-Bin], Xie, H.T.[Hong-Tao],
Hybrid-Level Instruction Injection for Video Token Compression in Multi-modal Large Language Models,
CVPR25(8568-8578)
IEEE DOI Code:
WWW Link. 2508
Training, Visualization, Image coding, Codes, Large language models, Benchmark testing, Computational efficiency, Videos, efficiency BibRef

Ma, Y.Y.[Yi-Yang], Liu, X.C.[Xing-Chao], Chen, X.K.[Xiao-Kang], Liu, W.[Wen], Wu, C.Y.[Cheng-Yue], Wu, Z.Y.[Zhi-Yu], Pan, Z.Z.[Zi-Zheng], Xie, Z.[Zhenda], Zhang, H.[Haowei], Yu, X.K.[Xing-Kai], Zhao, L.[Liang], Wang, Y.S.[Yi-Song], Liu, J.Y.[Jia-Ying], Ruan, C.[Chong],
JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation,
CVPR25(7739-7751)
IEEE DOI 2508
Training, Computational modeling, Large language models BibRef

Farina, M.[Matteo], Mancini, M.[Massimiliano], Iacca, G.[Giovanni], Ricci, E.[Elisa],
Rethinking Few-Shot Adaptation of Vision-Language Models in Two Stages,
CVPR25(29989-29998)
IEEE DOI 2508
Training, Adaptation models, Benchmark testing, Feature extraction, Rendering (computer graphics), Robustness, few-shot learning, multimodal learning BibRef

Zhang, Z.[Zhi], Yadav, S.[Srishti], Han, F.Z.[Feng-Ze], Shutova, E.[Ekaterina],
Cross-modal Information Flow in Multimodal Large Language Models,
CVPR25(19781-19791)
IEEE DOI Code:
WWW Link. 2508
Location awareness, Visualization, Large language models, Computational modeling, Focusing, Linguistics, Predictive models, inner working mechanism BibRef

Fang, Y.[Yi], Jin, B.[Bowen], Shen, J.C.[Jia-Cheng], Ding, S.[Sirui], Tan, Q.[Qiaoyu], Han, J.W.[Jia-Wei],
GraphGPT-o: Synergistic Multimodal Comprehension and Generation on Graphs,
CVPR25(19467-19476)
IEEE DOI Code:
WWW Link. 2508
Codes, Image synthesis, Large language models, Semantics, Transforms, Encoding, Explosions, Electronic commerce, Data mining, multimodal, multimodal large language model BibRef

Hao, H.R.[Hao-Ran], Han, J.M.[Jia-Ming], Li, C.S.[Chang-Sheng], Li, Y.F.[Yu-Feng], Yue, X.Y.[Xiang-Yu],
RAP: Retrieval-Augmented Personalization for Multimodal Large Language Models,
CVPR25(14538-14548)
IEEE DOI Code:
WWW Link. 2508
Training, Visualization, Image recognition, Databases, Large language models, Pipelines, Oral communication, retrieval-augmented generation BibRef

Tong, B.[Bo], Lai, B.[Bokai], Zhou, Y.[Yiyi], Luo, G.[Gen], Shen, Y.H.[Yun-Hang], Li, K.[Ke], Sun, X.S.[Xiao-Shuai], Ji, R.R.[Rong-Rong],
FlashSloth: Lightning Multimodal Large Language Models via Embedded Visual Compression,
CVPR25(14570-14581)
IEEE DOI Code:
WWW Link. 2508
Training, Visualization, Image coding, Codes, Large language models, Semantics, Lightning, Computational complexity, visual compression BibRef

Szot, A.[Andrew], Mazoure, B.[Bogdan], Attia, O.[Omar], Timofeev, A.[Aleksei], Agrawal, H.[Harsh], Hjelm, D.[Devon], Gan, Z.[Zhe], Kira, Z.[Zsolt], Toshev, A.[Alexander],
From Multimodal LLMs to Generalist Embodied Agents: Methods and Lessons,
CVPR25(10644-10655)
IEEE DOI 2508
Training, Adaptation models, Video games, Navigation, Large language models, Supervised learning, Benchmark testing BibRef

Gholami, M.[Mohsen], Akbari, M.[Mohammad], Cannons, K.[Kevin], Zhang, Y.[Yong],
CASP: Compression of Large Multimodal Models Based on Attention Sparsity,
CVPR25(9372-9381)
IEEE DOI 2508
Quantization (signal), Image coding, Large language models, Bit rate, Benchmark testing, Sparse matrices, Matrix decomposition, 2-bit quantization BibRef

Jia, H.R.[Hong-Rui], Jiang, C.[Chaoya], Xu, H.Y.[Hai-Yang], Ye, W.[Wei], Dong, M.F.[Meng-Fan], Yan, M.[Ming], Zhang, J.[Ji], Huang, F.[Fei], Zhang, S.K.[Shi-Kun],
SymDPO: Boosting In-Context Learning of Large Multimodal Models with Symbol Demonstration Direct Preference Optimization,
CVPR25(9361-9371)
IEEE DOI Code:
WWW Link. 2508
Visualization, Large language models, Face recognition, Symbols, Optimization methods, Benchmark testing, Performance gain, Context modeling BibRef

Alvar, S.R.[Saeed Ranjbar], Singh, G.[Gursimran], Akbari, M.[Mohammad], Zhang, Y.[Yong],
DivPrune: Diversity-based Visual Token Pruning for Large Multimodal Models,
CVPR25(9392-9401)
IEEE DOI 2508
Measurement, Visualization, Accuracy, Large language models, Redundancy, Memory management, Minimax techniques, Data models, inference optimization BibRef

Zhang, Z.F.[Ze-Feng], Tang, H.Z.[Heng-Zhu], Sheng, J.W.[Jia-Wei], Zhang, Z.Y.[Zhen-Yu], Ren, Y.M.[Yi-Ming], Li, Z.Y.[Zhen-Yang], Yin, D.W.[Da-Wei], Ma, D.[Duohe], Liu, T.W.[Ting-Wen],
Debiasing Multimodal Large Language Models via Noise-Aware Preference Optimization,
CVPR25(9423-9433)
IEEE DOI Code:
WWW Link. 2508
Visualization, Large language models, Perturbation methods, Noise, Robustness, Noise robustness, Optimization, Resilience, Noise level BibRef

Jiao, Q.[Qirui], Chen, D.[Daoyuan], Huang, Y.L.[Yi-Lun], Ding, B.L.[Bo-Lin], Li, Y.[Yaliang], Shen, Y.[Ying],
Img-Diff: Contrastive Data Synthesis for Multimodal Large Language Models,
CVPR25(9296-9307)
IEEE DOI 2508
Visualization, Fine-grained image recognition, Large language models, Contrastive learning, Benchmark testing, visual instruction tuning dataset BibRef

Ye, X.[Xubing], Gan, Y.[Yukang], Ge, Y.X.[Yi-Xiao], Zhang, X.P.[Xiao-Ping], Tang, Y.S.[Yan-Song],
ATP-LLaVA: Adaptive Token Pruning for Large Vision Language Models,
CVPR25(24972-24982)
IEEE DOI 2508
Degradation, Adaptation models, Visualization, Computational modeling, Large language models, Redundancy, multimodal learning BibRef

Luo, G.[Gen], Yang, X.[Xue], Dou, W.H.[Wen-Han], Wang, Z.K.[Zhao-Kai], Liu, J.W.[Jia-Wen], Dai, J.F.[Ji-Feng], Qiao, Y.[Yu], Zhu, X.Z.[Xi-Zhou],
Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training,
CVPR25(24960-24971)
IEEE DOI Code:
WWW Link. 2508
Visualization, Large language models, Benchmark testing, Encoding, Decoding, Noise measurement, Optimization, multimodal models, vision language models BibRef

Qi, D.[Daiqing], Zhao, H.[Handong], Shi, J.[Jing], Jenni, S.[Simon], Fan, Y.F.[Yi-Fei], Dernoncourt, F.[Franck], Cohen, S.[Scott], Li, S.[Sheng],
The Photographer's Eye: Teaching Multimodal Large Language Models to See and Critique like Photographers,
CVPR25(24807-24816)
IEEE DOI 2508
Location awareness, Visualization, Image color analysis, Large language models, Education, Lighting, Benchmark testing, image quality assessment BibRef

Liu, S.[Shaoyu], Li, J.N.[Jia-Ning], Zhao, G.H.[Guang-Hui], Zhang, Y.J.[Yun-Jian], Meng, X.[Xin], Yu, F.R.[Fei Richard], Ji, X.Y.[Xiang-Yang], Li, M.[Ming],
EventGPT: Event Stream Understanding with Multimodal Large Language Models,
CVPR25(29139-29149)
IEEE DOI 2508
Bridges, Training, Adaptation models, Visualization, Large language models, Pipelines, Lighting, Optimization, Synthetic data BibRef

Zhao, S.Y.[Shi-Yu], Wang, Z.[Zhenting], Juefei-Xu, F.[Felix], Xia, X.[Xide], Liu, M.[Miao], Wang, X.F.[Xiao-Fang], Liang, M.[Mingfu], Zhang, N.[Ning], Metaxas, D.N.[Dimitris N.], Yu, L.C.[Li-Cheng],
Accelerating Multimodal Large Language Models by Searching Optimal Vision Token Reduction,
CVPR25(29869-29879)
IEEE DOI 2508
Image resolution, Large language models, Benchmark testing, Computational efficiency, Bayes methods, Feeds, Optimization, efficiency BibRef

Yan, Z.[Ziang], Li, Z.L.[Zhi-Lin], He, Y.[Yinan], Wang, C.T.[Chen-Ting], Li, K.[Kunchang], Li, X.H.[Xin-Hao], Zeng, X.Y.[Xiang-Yu], Wang, Z.[Zilei], Wang, Y.[Yali], Qiao, Y.[Yu], Wang, L.M.[Li-Min], Wang, Y.[Yi],
Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment,
CVPR25(29880-29892)
IEEE DOI 2508
Training, Visualization, Large language models, Scalability, Contrastive learning, Multitasking, Data models, Optimization BibRef

Chen, C.[Cheng], Zhai, Y.P.[Yun-Peng], Zhao, Y.F.[Yi-Fan], Gao, J.Y.[Jin-Yang], Ding, B.L.[Bo-Lin], Li, J.[Jia],
Provoking Multi-modal Few-Shot LVLM via Exploration-Exploitation In-Context Learning,
CVPR25(3826-3835)
IEEE DOI 2508
Visualization, Fuses, Large language models, Face recognition, Refining, Redundancy, Stochastic processes, Reinforcement learning, large vision-language model BibRef

Zhang, Y.T.[Yu-Ting], Lu, H.[Hao], Hu, Q.Y.[Qing-Yong], Wang, Y.[Yin], Yuan, K.[Kaishen], Liu, X.[Xin], Wu, K.[Kaishun],
Period-LLM: Extending the Periodic Capability of Multimodal Large Language Model,
CVPR25(29237-29247)
IEEE DOI Code:
WWW Link. 2508
Training, Visualization, Analytical models, Large language models, Semantics, Refining, Cognition, Physiology, Optimization, Periodic structures BibRef

Lin, J.[Junyan], Chen, H.R.[Hao-Ran], Fan, Y.[Yue], Fan, Y.Q.[Ying-Qi], Jin, X.[Xin], Su, H.[Hui], Fu, J.[Jinlan], Shen, X.Y.[Xiao-Yu],
Multi-Layer Visual Feature Fusion in Multimodal LLMs: Methods, Analysis, and Best Practices,
CVPR25(4156-4166)
IEEE DOI Code:
WWW Link. 2508
Training, Degradation, Visualization, Large language models, Optical character recognition, Focusing, Nonhomogeneous media, multi-layer visual feature BibRef

Zhao, Q.Q.[Qing-Qing], Lu, Y.[Yao], Kim, M.J.[Moo Jin], Fu, Z.[Zipeng], Zhang, Z.Y.[Zhuo-Yang], Wu, Y.[Yecheng], Li, Z.S.[Zhao-Shuo], Ma, Q.L.[Qian-Li], Han, S.[Song], Finn, C.[Chelsea], Handa, A.[Ankur], Lin, T.Y.[Tsung-Yi], Wetzstein, G.[Gordon], Liu, M.Y.[Ming-Yu], Xiang, D.L.[Dong-Lai],
CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models,
CVPR25(1702-1713)
IEEE DOI Code:
WWW Link. 2508
Visualization, Computational modeling, Predictive models, Benchmark testing, Robot sensing systems, Cognition, Planning, multimodal large language models BibRef

Lu, X.D.[Xu-Dong], Chen, Y.H.[Ying-Hao], Chen, C.[Cheng], Tan, H.[Hui], Chen, B.[Boheng], Xie, Y.[Yina], Hu, R.[Rui], Tan, G.X.[Guan-Xin], Wu, R.S.[Ren-Shou], Hu, Y.[Yan], Zeng, Y.[Yi], Wu, L.[Lei], Bian, L.Y.[Liu-Yang], Wang, Z.X.[Zhao-Xiong], Liu, L.[Long], Yang, Y.Z.[Yan-Zhou], Xiao, H.[Han], Zhou, A.[Aojun], Wen, Y.F.[Ya-Fei], Chen, X.X.[Xiao-Xin], Ren, S.[Shuai], Li, H.S.[Hong-Sheng],
BlueLM-V-3B: Algorithm and System Co-Design for Multimodal Large Language Models on Mobile Devices,
CVPR25(4145-4155)
IEEE DOI 2508
Performance evaluation, Quantization (signal), Large language models, Computational modeling, Mobile handsets, model deployment BibRef

Chen, S.[Shuo], Han, Z.[Zhen], He, B.[Bailan], Liu, J.Z.[Jian-Zhe], Buckley, M.[Mark], Qin, Y.[Yao], Torr, P.[Philip], Tresp, V.[Volker], Gu, J.D.[Jin-Dong],
Can Multimodal Large Language Models Truly Perform Multimodal In-Context Learning?,
WACV25(6000-6010)
IEEE DOI 2505
Visualization, Large language models, Computational modeling, multimodal large language models, in-context learning BibRef

Wang, C.Y.[Chen-Yu], Luo, W.X.[Wei-Xin], Dong, S.[Sixun], Xuan, X.H.[Xiao-Hua], Li, Z.X.[Zheng-Xin], Ma, L.[Lin], Gao, S.H.[Sheng-Hua],
MLLM-Tool: A Multimodal Large Language Model for Tool Agent Learning,
WACV25(6678-6687)
IEEE DOI 2505
Codes, Large language models, Natural languages, Oral communication, Benchmark testing, Encoding BibRef

Liu, S.L.[Shi-Long], Cheng, H.[Hao], Liu, H.T.[Hao-Tian], Zhang, H.[Hao], Li, F.[Feng], Ren, T.[Tianhe], Zou, X.[Xueyan], Yang, J.W.[Jian-Wei], Su, H.[Hang], Zhu, J.[Jun], Zhang, L.[Lei], Gao, J.F.[Jian-Feng], Li, C.Y.[Chun-Yuan],
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents,
ECCV24(XLVII: 126-142).
Springer DOI 2412
BibRef

Cai, R.[Rizhao], Song, Z.[Zirui], Guan, D.[Dayan], Chen, Z.H.[Zhen-Hao], Li, Y.H.[Yao-Hang], Luo, X.[Xing], Yi, C.Y.[Chen-Yu], Kot, A.C.[Alex C.],
BenchLMM: Benchmarking Cross-Style Visual Capability of Large Multimodal Models,
ECCV24(L: 340-358).
Springer DOI 2412
BibRef

Yu, E.[En], Zhao, L.[Liang], Wei, Y.[Yana], Yang, J.R.[Jin-Rong], Wu, D.M.[Dong-Ming], Kong, L.Y.[Ling-Yu], Wang, T.[Tiancai], Ge, Z.[Zheng], Zhang, X.Y.[Xiang-Yu], Tao, W.B.[Wen-Bing],
Merlin: Empowering Multimodal LLMs with Foresight Minds,
ECCV24(IV: 425-443).
Springer DOI 2412
BibRef

Song, K.[Kunpeng], Zhu, Y.Z.[Yi-Zhe], Liu, B.C.[Bing-Chen], Yan, Q.[Qing], Elgammal, A.[Ahmed], Yang, X.[Xiao],
MOMA: Multimodal LLM Adapter for Fast Personalized Image Generation,
ECCV24(XL: 117-132).
Springer DOI 2412
BibRef

Gou, Y.H.[Yun-Hao], Chen, K.[Kai], Liu, Z.[Zhili], Hong, L.Q.[Lan-Qing], Xu, H.[Hang], Li, Z.G.[Zhen-Guo], Yeung, D.Y.[Dit-Yan], Kwok, J.T.[James T.], Zhang, Y.[Yu],
Eyes Closed, Safety on: Protecting Multimodal LLMs via Image-to-text Transformation,
ECCV24(XVII: 388-404).
Springer DOI 2412
BibRef

Wang, D.S.[Dong-Sheng], Cui, J.[Jiequan], Li, M.[Miaoge], Lin, W.[Wang], Chen, B.[Bo], Zhang, H.W.[Han-Wang],
Instruction Tuning-free Visual Token Complement for Multimodal LLMs,
ECCV24(LXXXI: 446-462).
Springer DOI 2412
BibRef

McKinzie, B.[Brandon], Gan, Z.[Zhe], Fauconnier, J.P.[Jean-Philippe], Dodge, S.[Sam], Zhang, B.[Bowen], Dufter, P.[Philipp], Shah, D.[Dhruti], Du, X.Z.[Xian-Zhi], Peng, F.[Futang], Belyi, A.[Anton], Zhang, H.T.[Hao-Tian], Singh, K.[Karanjeet], Kang, D.[Doug], Hè, H.Y.[Hong-Yu], Schwarzer, M.[Max], Gunter, T.[Tom], Kong, X.[Xiang], Zhang, A.[Aonan], Wang, J.Y.[Jian-Yu], Wang, C.[Chong], Du, N.[Nan], Lei, T.[Tao], Wiseman, S.[Sam], Lee, M.[Mark], Wang, Z.[Zirui], Pang, R.[Ruoming], Grasch, P.[Peter], Toshev, A.[Alexander], Yang, Y.F.[Yin-Fei],
MM1: Methods, Analysis and Insights from Multimodal LLM Pre-training,
ECCV24(XXIX: 304-323).
Springer DOI 2412
BibRef

Wang, Y.[Yu], Liu, X.G.[Xiao-Geng], Li, Y.[Yu], Chen, M.[Muhao], Xiao, C.W.[Chao-Wei],
Adashield: Safeguarding Multimodal Large Language Models from Structure-based Attack via Adaptive Shield Prompting,
ECCV24(XX: 77-94).
Springer DOI 2412
BibRef

Zhao, H.H.[Henry Hengyuan], Zhou, P.[Pan], Shou, M.Z.[Mike Zheng],
Genixer: Empowering Multimodal Large Language Model as a Powerful Data Generator,
ECCV24(XXIII: 129-147).
Springer DOI 2412
BibRef

Fu, X.Y.[Xing-Yu], Hu, Y.S.[Yu-Shi], Li, B.Z.[Bang-Zheng], Feng, Y.[Yu], Wang, H.Y.[Hao-Yu], Lin, X.D.[Xu-Dong], Roth, D.[Dan], Smith, N.A.[Noah A.], Ma, W.C.[Wei-Chiu], Krishna, R.[Ranjay],
Blink: Multimodal Large Language Models Can See but Not Perceive,
ECCV24(XXIII: 148-166).
Springer DOI 2412
BibRef

Zhang, Z.K.[Zhi-Kai], Li, Y.T.[Yi-Tang], Huang, H.F.[Hao-Feng], Lin, M.X.[Ming-Xian], Yi, L.[Li],
Freemotion: Mocap-free Human Motion Synthesis with Multimodal Large Language Models,
ECCV24(XXIII: 403-421).
Springer DOI 2412
BibRef

Pi, R.J.[Ren-Jie], Han, T.Y.[Tian-Yang], Xiong, W.[Wei], Zhang, J.P.[Ji-Peng], Liu, R.T.[Run-Tao], Pan, R.[Rui], Zhang, T.[Tong],
Strengthening Multimodal Large Language Model with Bootstrapped Preference Optimization,
ECCV24(XXXIII: 382-398).
Springer DOI 2412
BibRef

Xia, B.[Bin], Wang, S.Y.[Shi-Yin], Tao, Y.[Yingfan], Wang, Y.T.[Yi-Tong], Jia, J.Y.[Jia-Ya],
Llmga: Multimodal Large Language Model Based Generation Assistant,
ECCV24(XXXVIII: 389-406).
Springer DOI 2412
BibRef

Wu, T.[Tianhe], Ma, K.[Kede], Liang, J.[Jie], Yang, Y.[Yujiu], Zhang, L.[Lei],
A Comprehensive Study of Multimodal Large Language Models for Image Quality Assessment,
ECCV24(LXXIV: 143-160).
Springer DOI 2412
BibRef

Xu, J.[Jiacong], Lo, S.Y.[Shao-Yuan], Safaei, B.[Bardia], Patel, V.M.[Vishal M.], Dwivedi, I.[Isht],
Towards Zero-Shot Anomaly Detection and Reasoning with Multimodal Large Language Models,
CVPR25(20370-20382)
IEEE DOI Code:
WWW Link. 2508
Visualization, Large language models, Benchmark testing, Inspection, Cognition, Anomaly detection, Tuning, Biomedical imaging, multimodal large language model BibRef

Yang, Y.C.[Yu-Chen], Lee, K.[Kwonjoon], Dariush, B.[Behzad], Cao, Y.[Yinzhi], Lo, S.Y.[Shao-Yuan],
Follow the Rules: Reasoning for Video Anomaly Detection with Large Language Models,
ECCV24(LXXXI: 304-322).
Springer DOI 2412
BibRef

Zheng, S.[Sipeng], Zhou, B.[Bohan], Feng, Y.C.[Yi-Cheng], Wang, Y.[Ye], Lu, Z.Q.[Zong-Qing],
Unicode: Learning a Unified Codebook for Multimodal Large Language Models,
ECCV24(VIII: 426-443).
Springer DOI 2412
BibRef

Ren, Z.W.[Zhong-Wei], Huang, Z.C.[Zhi-Cheng], Wei, Y.C.[Yun-Chao], Zhao, Y.[Yao], Fu, D.M.[Dong-Mei], Feng, J.S.[Jia-Shi], Jin, X.J.[Xiao-Jie],
PixelLM: Pixel Reasoning with Large Multimodal Model,
CVPR24(26364-26373)
IEEE DOI 2410
Bridges, Image segmentation, Codes, Benchmark testing, Cognition, Decoding BibRef

Yue, X.[Xiang], Ni, Y.S.[Yuan-Sheng], Zheng, T.Y.[Tian-Yu], Zhang, K.[Kai], Liu, R.[Ruoqi], Zhang, G.[Ge], Stevens, S.[Samuel], Jiang, D.[Dongfu], Ren, W.M.[Wei-Ming], Sun, Y.X.[Yu-Xuan], Wei, C.[Cong], Yu, B.T.[Bo-Tao], Yuan, R.B.[Rui-Bin], Sun, R.L.[Ren-Liang], Yin, M.[Ming], Zheng, B.[Boyuan], Yang, Z.Z.[Zhen-Zhu], Liu, Y.[Yibo], Huang, W.H.[Wen-Hao], Sun, H.[Huan], Su, Y.[Yu], Chen, W.[Wenhu],
MMMU: A Massive Multi-Discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI,
CVPR24(9556-9567)
IEEE DOI 2410
Computational modeling, Artificial general intelligence, Social sciences, Manuals, Benchmark testing, Cognition, LLMs BibRef

Xia, Z.F.[Zhuo-Fan], Han, D.C.[Dong-Chen], Han, Y.Z.[Yi-Zeng], Pan, X.[Xuran], Song, S.[Shiji], Huang, G.[Gao],
GSVA: Generalized Segmentation via Multimodal Large Language Models,
CVPR24(3858-3869)
IEEE DOI Code:
WWW Link. 2410
Image segmentation, Visualization, Codes, Large language models, Benchmark testing BibRef

Du, Y.Y.[Yi-Yang], Wang, X.C.[Xiao-Chen], Chen, C.[Chi], Ye, J.[Jiabo], Wang, Y.[Yiru], Li, P.[Peng], Yan, M.[Ming], Zhang, J.[Ji], Huang, F.[Fei], Sui, Z.F.[Zhi-Fang], Sun, M.[Maosong], Liu, Y.[Yang],
AdaMMS: Model Merging for Heterogeneous Multimodal Large Language Models with Unsupervised Coefficient Optimization,
CVPR25(9413-9422)
IEEE DOI 2508
Adaptation models, Interpolation, Large language models, Computational modeling, Merging, Estimation, Data models, model merging BibRef

Ye, Q.H.[Qing-Hao], Xu, H.Y.[Hai-Yang], Ye, J.[Jiabo], Yan, M.[Ming], Hu, A.[Anwen], Liu, H.[Haowei], Qian, Q.[Qi], Zhang, J.[Ji], Huang, F.[Fei],
mPLUG-OwI2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration,
CVPR24(13040-13051)
IEEE DOI 2410
Large language models, Computational modeling, Collaboration, Cognition, Decoding, Vision Language BibRef

Qi, P.[Peng], Yan, Z.[Zehong], Hsu, W.[Wynne], Lee, M.L.[Mong Li],
Sniffer: Multimodal Large Language Model for Explainable Out-of-Context Misinformation Detection,
CVPR24(13052-13062)
IEEE DOI 2410
Visualization, Adaptation models, Accuracy, Large language models, Computational modeling, Data models, multimodal misinformation, explainability BibRef

Li, B.[Bohao], Ge, Y.Y.[Yu-Ying], Ge, Y.X.[Yi-Xiao], Wang, G.Z.[Guang-Zhi], Wang, R.[Rui], Zhang, R.M.[Rui-Mao], Shan, Y.[Ying],
SEED-Bench: Benchmarking Multimodal Large Language Models,
CVPR24(13299-13308)
IEEE DOI Code:
WWW Link. 2410
Accuracy, Codes, Annotations, Image synthesis, Large language models, Computational modeling, Benchmark, Multimodal, Hierarchical BibRef

Mitra, C.[Chancharik], Huang, B.[Brandon], Darrell, T.J.[Trevor J.], Herzig, R.[Roei],
Compositional Chain-of-Thought Prompting for Large Multimodal Models,
CVPR24(14420-14431)
IEEE DOI Code:
WWW Link. 2410
Bridges, Visualization, Codes, Annotations, Large language models, Benchmark testing, Large Multimodal Models, Multimodality, Prompting BibRef

Li, X.Q.[Xiao-Qi], Xu, J.Y.[Jing-Yun], Zhang, M.X.[Ming-Xu], Liu, J.M.[Jia-Ming], Shen, Y.[Yan], Ponomarenko, I.[Iaroslav], Xu, J.H.[Jia-Hui], Heng, L.[Liang], Huang, S.Y.[Si-Yuan], Zhang, S.H.[Shang-Hang], Dong, H.[Hao],
Object-Centric Prompt-Driven Vision-Language-Action Model for Robotic Manipulation,
CVPR25(27638-27648)
IEEE DOI 2508
Training, Visualization, Natural languages, Manuals, Predictive models, Robustness, Libraries, Planning, Videos BibRef

Li, X.Q.[Xiao-Qi], Zhang, M.X.[Ming-Xu], Geng, Y.R.[Yi-Ran], Geng, H.R.[Hao-Ran], Long, Y.X.[Yu-Xing], Shen, Y.[Yan], Zhang, R.R.[Ren-Rui], Liu, J.M.[Jia-Ming], Dong, H.[Hao],
ManipLLM: Embodied Multimodal Large Language Model for Object-Centric Robotic Manipulation,
CVPR24(18061-18070)
IEEE DOI Code:
WWW Link. 2410
Training, Adaptation models, Large language models, Transforms, Predictive models, Robot sensing systems, Cognition, Embodied AI, Multi-modal Large Language Model BibRef

Taesiri, M.R.[Mohammad Reza], Feng, T.J.[Tian-Jun], Bezemer, C.P.[Cor-Paul], Nguyen, A.[Anh],
GlitchBench: Can Large Multimodal Models Detect Video Game Glitches?,
CVPR24(22444-22455)
IEEE DOI Code:
WWW Link. 2410
Video games, Visualization, Quality assurance, Large language models, Benchmark testing, Linguistics, Cognition, game testing BibRef

Zhang, R.[Ruiyi], Zhang, Y.Z.[Yan-Zhe], Chen, J.[Jian], Zhou, Y.F.[Yu-Fan], Gu, J.X.[Jiu-Xiang], Chen, C.[Changyou], Sun, T.[Tong],
TRINS: Towards Multimodal Language Models that Can Read,
CVPR24(22584-22594)
IEEE DOI 2410
Visualization, Annotations, Large language models, Computational modeling, Optical character recognition, Training data BibRef

Zhang, Y.[Yichi], Dong, Y.P.[Yin-Peng], Zhang, S.Y.[Si-Yuan], Min, T.Z.[Tian-Zan], Su, H.[Hang], Zhu, J.[Jun],
Exploring the Transferability of Visual Prompting for Multimodal Large Language Models,
CVPR24(26552-26562)
IEEE DOI 2410
Training, Visualization, Adaptation models, Computational modeling, Large language models, Semantics, Feature extraction, Transferability BibRef

Liang, T.[Tian], Huang, J.[Jing], Kong, M.[Ming], Chen, L.[Luyuan], Zhu, Q.[Qiang],
Querying as Prompt: Parameter-Efficient Learning for Multimodal Language Model,
CVPR24(26845-26855)
IEEE DOI Code:
WWW Link. 2410
Training, Bridges, Adaptation models, Technological innovation, Codes, Computational modeling, multimodal, large language model BibRef

Pi, R.J.[Ren-Jie], Yao, L.W.[Le-Wei], Gao, J.H.[Jia-Hui], Zhang, J.P.[Ji-Peng], Zhang, T.[Tong],
PerceptionGPT: Effectively Fusing Visual Perception Into LLM,
CVPR24(27114-27123)
IEEE DOI 2410
Training, Visualization, Accuracy, Large language models, Decoding, Multimodal Learning BibRef

Tai, Y.[Yan], Fan, W.C.[Wei-Chen], Zhang, Z.[Zhao], Liu, Z.W.[Zi-Wei],
Link-Context Learning for Multimodal LLMs,
CVPR24(27166-27175)
IEEE DOI 2410
Training, Image recognition, Large language models, Oral communication, Propulsion, Cognition BibRef

Jain, J.[Jitesh], Yang, J.W.[Jian-Wei], Shi, H.[Humphrey],
VCoder: Versatile Vision Encoders for Multimodal Large Language Models,
CVPR24(27992-28002)
IEEE DOI 2410
Training, Visualization, Image segmentation, Costs, Image synthesis, Large language models, Machine vision BibRef

Barbany, O.[Oriol], Huang, M.[Michael], Zhu, X.L.[Xin-Liang], Dhua, A.[Arnab],
Leveraging Large Language Models for Multimodal Search,
FGVC24(1201-1210)
IEEE DOI 2410
Large language models, Natural languages, Pipelines, Image retrieval, LLM, retrieval, fashion, multimodal BibRef

Baldassini, F.B.[Folco Bertini], Shukor, M.[Mustafa], Cord, M.[Matthieu], Soulier, L.[Laure], Piwowarski, B.[Benjamin],
What Makes Multimodal In-Context Learning Work?,
Prompting24(1539-1550)
IEEE DOI 2410
Training, Analytical models, Codes, Large language models, Impedance matching, Large Language Models, Shortcuts learning BibRef

Ma, F.P.[Fei-Peng], Zhou, Y.Z.[Yi-Zhou], Zhang, Y.Y.[Yue-Yi], Wu, S.Y.[Si-Ying], Zhang, Z.[Zheyu], He, Z.L.[Zi-Long], Rao, F.Y.[Feng-Yun], Sun, X.Y.[Xiao-Yan],
Task Navigator: Decomposing Complex Tasks for Multimodal Large Language Models,
Reasoning24(2248-2257)
IEEE DOI 2410
Training, Systematics, Navigation, Large language models, Training data, Language and Vision, Multi-modal Vision BibRef

Cha, J.[Junbum], Kang, W.[Wooyoung], Mun, J.[Jonghwan], Roh, B.[Byungseok],
Honeybee: Locality-Enhanced Projector for Multimodal LLM,
CVPR24(13817-13827)
IEEE DOI Code:
WWW Link. 2410
Visualization, Codes, Large language models, Benchmark testing, Tuning, Multimodal LLM, Vision-Language BibRef

Lai, C.G.[Chen-Gen], Song, S.L.[Sheng-Li], Yan, S.[Sitong], Hu, G.[Guangneng],
Improving Vision and Language Concepts Understanding with Multimodal Counterfactual Samples,
ECCV24(LXIX: 174-191).
Springer DOI 2412
BibRef

Cao, J.J.[Jian-Jian], Ye, P.[Peng], Li, S.Z.[Sheng-Ze], Yu, C.[Chong], Tang, Y.S.[Yan-Song], Lu, J.W.[Ji-Wen], Chen, T.[Tao],
MADTP: Multimodal Alignment-Guided Dynamic Token Pruning for Accelerating Vision-Language Transformer,
CVPR24(15710-15719)
IEEE DOI Code:
WWW Link. 2410
Degradation, Adaptation models, Visualization, Costs, Computational modeling, Semantics, Token Pruning, Model Compress BibRef

Sahin, U.[Ugur], Li, H.[Hang], Khan, Q.[Qadeer], Cremers, D.[Daniel], Tresp, V.[Volker],
Enhancing Multimodal Compositional Reasoning of Visual Language Models with Generative Negative Mining,
WACV24(5551-5561)
IEEE DOI Code:
HTML Version. 2404
Training, Visualization, Codes, Pipelines, Self-supervised learning, Cognition, Algorithms, Vision + language and/or other modalities BibRef

Hu, Z.Z.[Zhi-Zhang], Zhu, X.L.[Xin-Liang], Tran, S.[Son], Vidal, R.[René], Dhua, A.[Arnab],
ProVLA: Compositional Image Search with Progressive Vision-Language Alignment and Multimodal Fusion,
CLVL23(2764-2769)
IEEE DOI 2401
BibRef

Chapter on Implementations and Applications, Databases, QBIC, Video Analysis, Hardware and Software, Inspection continues in
Large Language Models for Autonomous Driving, LLM, LVLM .


Last update:Nov 2, 2025 at 14:03:07