Wang, Z.[Zihao],
Cai, S.F.[Shao-Fei],
Liu, A.[Anji],
Jin, Y.G.[Yong-Gang],
Hou, J.[Jinbing],
Zhang, B.[Bowei],
Lin, H.[Haowei],
He, Z.F.[Zhao-Feng],
Zheng, Z.L.[Zi-Long],
Yang, Y.D.[Yao-Dong],
Ma, X.J.[Xiao-Jian],
Liang, Y.[Yitao],
JARVIS-1: Open-World Multi-Task Agents With Memory-Augmented
Multimodal Language Models,
PAMI(47), No. 3, March 2025, pp. 1894-1907.
IEEE DOI
2502
Planning, Diamond, Games, Complexity theory, Cognition, Accuracy,
Visualization, Reliability, Multitasking, Iron, Minecraft,
open-world agents
BibRef
Li, Y.X.[Yun-Xin],
Jiang, S.Y.[Shen-Yuan],
Hu, B.T.[Bao-Tian],
Wang, L.Y.[Long-Yue],
Zhong, W.Q.[Wan-Qi],
Luo, W.H.[Wen-Han],
Ma, L.[Lin],
Zhang, M.[Min],
Uni-MoE: Scaling Unified Multimodal LLMs With Mixture of Experts,
PAMI(47), No. 5, May 2025, pp. 3424-3439.
IEEE DOI
2504
Training, Data models, Computational modeling, Connectors,
Benchmark testing, Visualization, Tuning
BibRef
Huang, Z.Z.[Zhong-Zhan],
Zhong, S.S.[Shan-Shan],
Zhou, P.[Pan],
Gao, S.[Shanghua],
Zitnik, M.[Marinka],
Lin, L.[Liang],
A Causality-Aware Paradigm for Evaluating Creativity of Multimodal
Large Language Models,
PAMI(47), No. 5, May 2025, pp. 3830-3846.
IEEE DOI
2504
Creativity, Games, Cognition, Standards, Benchmark testing, Training,
Pipelines, Manuals, Large language models, Information leakage,
causal intervention
BibRef
Villani, F.[Francesco],
Maljkovic, I.[Igor],
Lazzaro, D.[Dario],
Sotgiu, A.[Angelo],
Cinà, A.E.[Antonio Emanuele],
Roli, F.[Fabio],
Robust image classification with multi-modal large language models,
PRL(194), 2025, pp. 1-7.
Elsevier DOI
2506
Adversarial machine learning, Robust classification,
Multimodal large language model, Multimodal information, TrustworthyAI
BibRef
Shao, Z.W.[Zhen-Wei],
Yu, Z.[Zhou],
Yu, J.[Jun],
Ouyang, X.C.[Xue-Cheng],
Zheng, L.[Lihao],
Gai, Z.B.[Zhen-Biao],
Wang, M.Y.[Ming-Yang],
Kuang, Z.Z.[Zhen-Zhong],
Ding, J.J.[Jia-Jun],
Imp: Highly Capable Large Multimodal Models for Mobile Devices,
MultMed(27), 2025, pp. 2961-2974.
IEEE DOI
2506
Visualization, Training, Data models, Computational modeling,
Training data, Connectors, Large language models, Mobile handsets,
vision-language models
BibRef
Ge, J.[Junyao],
Zhang, X.[Xu],
Zheng, Y.[Yang],
Guo, K.[Kaitai],
Liang, J.[Jimin],
RSTeller: Scaling up visual language modeling in remote sensing with
rich linguistic semantics from openly available data and large
language models,
PandRS(226), 2025, pp. 146-163.
Elsevier DOI Code:
WWW Link.
2506
Vision language model, Multimodal dataset, OpenStreetMap,
Google earth engine, Large language models
BibRef
Li, Z.S.[Zhen-Shi],
Muhtar, D.[Dilxat],
Gu, F.[Feng],
He, Y.L.X.[Yang-Lang-Xing],
Zhang, X.L.[Xue-Liang],
Xiao, P.F.[Peng-Feng],
He, G.[Guangjun],
Zhu, X.X.[Xiao-Xiang],
LHRS-Bot-Nova: Improved multimodal large language model for remote
sensing vision-language interpretation,
PandRS(227), 2025, pp. 539-550.
Elsevier DOI Code:
WWW Link.
2508
BibRef
Earlier: A2, A1, A3, A5, A6, Only:
Lhrs-bot: Empowering Remote Sensing with Vgi-enhanced Large Multimodal
Language Model,
ECCV24(LXXIV: 440-457).
Springer DOI
2412
Remote sensing, Earth observation,
Multimodal large language model, Vision-language dataset
BibRef
Li, X.[Xu],
Zheng, Y.[Yi],
Chen, H.T.[Hao-Tian],
Chen, X.L.[Xiao-Lei],
Liang, Y.X.[Yu-Xuan],
Lai, C.H.[Cheng-Hang],
Li, B.[Bin],
Xue, X.Y.[Xiang-Yang],
Instruction-guided fusion of multi-layer visual features in Large
Vision-Language Models,
PR(170), 2026, pp. 111932.
Elsevier DOI Code:
WWW Link.
2509
Large Vision-Language Models, Multimodal large language models,
Hierarchical feature utilization
BibRef
Zhang, W.Y.[Wen-Yao],
Wu, L.[Letian],
Zhang, Z.Q.[Ze-Qun],
Yu, T.[Tao],
Ma, C.[Chao],
Jin, X.[Xin],
Yang, X.K.[Xiao-Kang],
Zeng, W.J.[Wen-Jun],
Unleash the Power of Vision-Language Models by Visual Attention
Prompt and Multimodal Interaction,
MultMed(27), 2025, pp. 2399-2411.
IEEE DOI
2505
Visualization, Adaptation models, Tuning, Training, Computational modeling,
Tail, Pipelines, Overfitting, Nose, Attention, Vision-language models
BibRef
Weng, Y.[Yu],
He, W.B.[Wen-Bin],
Dong, J.[Jun],
Chaomurilige,
Liu, X.[Xuan],
Liu, Z.[Zheng],
Cross-Lingual Adaptation for Vision-Language Model via Multimodal
Semantic Distillation,
MultMed(27), 2025, pp. 3184-3196.
IEEE DOI
2506
Adaptation models, Multilingual, Visualization, Training, Semantics,
Data models, Natural language processing, Translation, zero-shot learning
BibRef
Liang, J.W.[Jia-Wei],
Liang, S.Y.[Si-Yuan],
Liu, A.S.[Ai-Shan],
Cao, X.C.[Xiao-Chun],
VL-Trojan: Multimodal Instruction Backdoor Attacks against
Autoregressive Visual Language Models,
IJCV(133), No. 7, July 2025, pp. 3994-4013.
Springer DOI
2506
BibRef
Li, L.[Lei],
Wei, Y.C.[Yuan-Cheng],
Xie, Z.H.[Zhi-Hui],
Yang, X.[Xuqing],
Song, Y.F.[Yi-Fan],
Wang, P.[Peiyi],
An, C.X.[Chen-Xin],
Liu, T.Y.[Tian-Yu],
Li, S.[Sujian],
Lin, B.Y.C.[Bill Yu-Chen],
Kong, L.P.[Ling-Peng],
Liu, Q.[Qi],
VL-RewardBench: A Challenging Benchmark for Vision-Language
Generative Reward Models,
CVPR25(24657-24668)
IEEE DOI Code:
WWW Link.
2508
Training, Analytical models, Visualization, Accuracy, Pipelines,
Benchmark testing, Cognition, Reliability, Probes, Visual perception,
multimodal large language models
BibRef
Yang, C.[Cheng],
Sui, Y.[Yang],
Xiao, J.Q.[Jin-Qi],
Huang, L.[Lingyi],
Gong, Y.[Yu],
Li, C.[Chendi],
Yan, J.H.[Jing-Hua],
Bai, Y.[Yu],
Sadayappan, P.[Ponnuswamy],
Hu, X.[Xia],
Yuan, B.[Bo],
TopV: Compatible Token Pruning with Inference Time Optimization for
Fast and Low-Memory Multimodal Vision Language Model,
CVPR25(19803-19813)
IEEE DOI
2508
Training, Visualization, Computational modeling, Memory management,
Cost function, Cache storage
BibRef
Hong, W.[Wenyi],
Cheng, Y.[Yean],
Yang, Z.[Zhuoyi],
Luo, Z.Y.[Zi-Yang],
Wu, H.N.[Hao-Ning],
Li, D.X.[Dong-Xu],
Ma, J.[Jing],
Kankanhalli, M.[Mohan],
Li, J.[Junnan],
VideoAutoArena: An Automated Arena for Evaluating Large Multimodal
Models in Video Analysis through User Simulation,
CVPR25(8461-8474)
IEEE DOI
2508
Analytical models, Costs, Annotations, Computational modeling,
Scalability, Benchmark testing, large multimodal models
BibRef
Tian, J.[Jirui],
Zhang, J.R.[Jin-Rong],
Liu, S.[Shenglan],
Xu, L.[Luhao],
Huang, Z.X.[Zhi-Xiong],
Huang, G.[Gao],
DTOS: Dynamic Time Object Sensing with Large Multimodal Model,
CVPR25(13810-13820)
IEEE DOI Code:
WWW Link.
2508
Location awareness, Visualization, Large language models, Robustness,
Spatiotemporal phenomena, Sensors, Spatial resolution, Videos
BibRef
Li, M.[Ming],
Zhong, J.[Jike],
Chen, T.[Tianle],
Lai, Y.X.[Yu-Xiang],
Psounis, K.[Konstantinos],
EEE-Bench: A Comprehensive Multimodal Electrical And Electronics
Engineering Benchmark,
CVPR25(13337-13349)
IEEE DOI
2508
Visualization, Foundation models, Large language models,
Benchmark testing, Control systems, Mathematical models
BibRef
Liu, Z.H.[Zhi-Hang],
Xie, C.W.[Chen-Wei],
Li, P.[Pandeng],
Zhao, L.M.[Li-Ming],
Tang, L.X.[Long-Xiang],
Zheng, Y.[Yun],
Liu, C.B.[Chuan-Bin],
Xie, H.T.[Hong-Tao],
Hybrid-Level Instruction Injection for Video Token Compression in
Multi-modal Large Language Models,
CVPR25(8568-8578)
IEEE DOI Code:
WWW Link.
2508
Training, Visualization, Image coding, Codes, Large language models,
Benchmark testing, Computational efficiency, Videos, efficiency
BibRef
Ma, Y.Y.[Yi-Yang],
Liu, X.C.[Xing-Chao],
Chen, X.K.[Xiao-Kang],
Liu, W.[Wen],
Wu, C.Y.[Cheng-Yue],
Wu, Z.Y.[Zhi-Yu],
Pan, Z.Z.[Zi-Zheng],
Xie, Z.[Zhenda],
Zhang, H.[Haowei],
Yu, X.K.[Xing-Kai],
Zhao, L.[Liang],
Wang, Y.S.[Yi-Song],
Liu, J.Y.[Jia-Ying],
Ruan, C.[Chong],
JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified
Multimodal Understanding and Generation,
CVPR25(7739-7751)
IEEE DOI
2508
Training, Computational modeling, Large language models
BibRef
Farina, M.[Matteo],
Mancini, M.[Massimiliano],
Iacca, G.[Giovanni],
Ricci, E.[Elisa],
Rethinking Few-Shot Adaptation of Vision-Language Models in Two
Stages,
CVPR25(29989-29998)
IEEE DOI
2508
Training, Adaptation models, Benchmark testing, Feature extraction,
Rendering (computer graphics), Robustness, few-shot learning,
multimodal learning
BibRef
Zhang, Z.[Zhi],
Yadav, S.[Srishti],
Han, F.Z.[Feng-Ze],
Shutova, E.[Ekaterina],
Cross-modal Information Flow in Multimodal Large Language Models,
CVPR25(19781-19791)
IEEE DOI Code:
WWW Link.
2508
Location awareness, Visualization, Large language models,
Computational modeling, Focusing, Linguistics, Predictive models,
inner working mechanism
BibRef
Fang, Y.[Yi],
Jin, B.[Bowen],
Shen, J.C.[Jia-Cheng],
Ding, S.[Sirui],
Tan, Q.[Qiaoyu],
Han, J.W.[Jia-Wei],
GraphGPT-o: Synergistic Multimodal Comprehension and Generation on
Graphs,
CVPR25(19467-19476)
IEEE DOI Code:
WWW Link.
2508
Codes, Image synthesis, Large language models, Semantics, Transforms,
Encoding, Explosions, Electronic commerce, Data mining, multimodal,
multimodal large language model
BibRef
Hao, H.R.[Hao-Ran],
Han, J.M.[Jia-Ming],
Li, C.S.[Chang-Sheng],
Li, Y.F.[Yu-Feng],
Yue, X.Y.[Xiang-Yu],
RAP: Retrieval-Augmented Personalization for Multimodal Large
Language Models,
CVPR25(14538-14548)
IEEE DOI Code:
WWW Link.
2508
Training, Visualization, Image recognition, Databases,
Large language models, Pipelines, Oral communication,
retrieval-augmented generation
BibRef
Tong, B.[Bo],
Lai, B.[Bokai],
Zhou, Y.[Yiyi],
Luo, G.[Gen],
Shen, Y.H.[Yun-Hang],
Li, K.[Ke],
Sun, X.S.[Xiao-Shuai],
Ji, R.R.[Rong-Rong],
FlashSloth: Lightning Multimodal Large Language Models via Embedded
Visual Compression,
CVPR25(14570-14581)
IEEE DOI Code:
WWW Link.
2508
Training, Visualization, Image coding, Codes, Large language models,
Semantics, Lightning, Computational complexity, visual compression
BibRef
Szot, A.[Andrew],
Mazoure, B.[Bogdan],
Attia, O.[Omar],
Timofeev, A.[Aleksei],
Agrawal, H.[Harsh],
Hjelm, D.[Devon],
Gan, Z.[Zhe],
Kira, Z.[Zsolt],
Toshev, A.[Alexander],
From Multimodal LLMs to Generalist Embodied Agents: Methods and Lessons,
CVPR25(10644-10655)
IEEE DOI
2508
Training, Adaptation models, Video games, Navigation,
Large language models, Supervised learning, Benchmark testing
BibRef
Gholami, M.[Mohsen],
Akbari, M.[Mohammad],
Cannons, K.[Kevin],
Zhang, Y.[Yong],
CASP: Compression of Large Multimodal Models Based on Attention
Sparsity,
CVPR25(9372-9381)
IEEE DOI
2508
Quantization (signal), Image coding, Large language models,
Bit rate, Benchmark testing, Sparse matrices, Matrix decomposition,
2-bit quantization
BibRef
Jia, H.R.[Hong-Rui],
Jiang, C.[Chaoya],
Xu, H.Y.[Hai-Yang],
Ye, W.[Wei],
Dong, M.F.[Meng-Fan],
Yan, M.[Ming],
Zhang, J.[Ji],
Huang, F.[Fei],
Zhang, S.K.[Shi-Kun],
SymDPO: Boosting In-Context Learning of Large Multimodal Models with
Symbol Demonstration Direct Preference Optimization,
CVPR25(9361-9371)
IEEE DOI Code:
WWW Link.
2508
Visualization, Large language models, Face recognition, Symbols,
Optimization methods, Benchmark testing, Performance gain,
Context modeling
BibRef
Alvar, S.R.[Saeed Ranjbar],
Singh, G.[Gursimran],
Akbari, M.[Mohammad],
Zhang, Y.[Yong],
DivPrune: Diversity-based Visual Token Pruning for Large Multimodal
Models,
CVPR25(9392-9401)
IEEE DOI
2508
Measurement, Visualization, Accuracy, Large language models,
Redundancy, Memory management, Minimax techniques, Data models,
inference optimization
BibRef
Zhang, Z.F.[Ze-Feng],
Tang, H.Z.[Heng-Zhu],
Sheng, J.W.[Jia-Wei],
Zhang, Z.Y.[Zhen-Yu],
Ren, Y.M.[Yi-Ming],
Li, Z.Y.[Zhen-Yang],
Yin, D.W.[Da-Wei],
Ma, D.[Duohe],
Liu, T.W.[Ting-Wen],
Debiasing Multimodal Large Language Models via Noise-Aware Preference
Optimization,
CVPR25(9423-9433)
IEEE DOI Code:
WWW Link.
2508
Visualization, Large language models, Perturbation methods, Noise,
Robustness, Noise robustness, Optimization, Resilience, Noise level
BibRef
Jiao, Q.[Qirui],
Chen, D.[Daoyuan],
Huang, Y.L.[Yi-Lun],
Ding, B.L.[Bo-Lin],
Li, Y.[Yaliang],
Shen, Y.[Ying],
Img-Diff: Contrastive Data Synthesis for Multimodal Large Language
Models,
CVPR25(9296-9307)
IEEE DOI
2508
Visualization, Fine-grained image recognition,
Large language models, Contrastive learning, Benchmark testing,
visual instruction tuning dataset
BibRef
Ye, X.[Xubing],
Gan, Y.[Yukang],
Ge, Y.X.[Yi-Xiao],
Zhang, X.P.[Xiao-Ping],
Tang, Y.S.[Yan-Song],
ATP-LLaVA: Adaptive Token Pruning for Large Vision Language Models,
CVPR25(24972-24982)
IEEE DOI
2508
Degradation, Adaptation models, Visualization,
Computational modeling, Large language models, Redundancy,
multimodal learning
BibRef
Luo, G.[Gen],
Yang, X.[Xue],
Dou, W.H.[Wen-Han],
Wang, Z.K.[Zhao-Kai],
Liu, J.W.[Jia-Wen],
Dai, J.F.[Ji-Feng],
Qiao, Y.[Yu],
Zhu, X.Z.[Xi-Zhou],
Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large
Language Models with Endogenous Visual Pre-training,
CVPR25(24960-24971)
IEEE DOI Code:
WWW Link.
2508
Visualization, Large language models, Benchmark testing, Encoding,
Decoding, Noise measurement, Optimization, multimodal models,
vision language models
BibRef
Qi, D.[Daiqing],
Zhao, H.[Handong],
Shi, J.[Jing],
Jenni, S.[Simon],
Fan, Y.F.[Yi-Fei],
Dernoncourt, F.[Franck],
Cohen, S.[Scott],
Li, S.[Sheng],
The Photographer's Eye: Teaching Multimodal Large Language Models to
See and Critique like Photographers,
CVPR25(24807-24816)
IEEE DOI
2508
Location awareness, Visualization, Image color analysis,
Large language models, Education, Lighting, Benchmark testing,
image quality assessment
BibRef
Liu, S.[Shaoyu],
Li, J.N.[Jia-Ning],
Zhao, G.H.[Guang-Hui],
Zhang, Y.J.[Yun-Jian],
Meng, X.[Xin],
Yu, F.R.[Fei Richard],
Ji, X.Y.[Xiang-Yang],
Li, M.[Ming],
EventGPT: Event Stream Understanding with Multimodal Large Language
Models,
CVPR25(29139-29149)
IEEE DOI
2508
Bridges, Training, Adaptation models, Visualization,
Large language models, Pipelines, Lighting, Optimization, Synthetic data
BibRef
Zhao, S.Y.[Shi-Yu],
Wang, Z.[Zhenting],
Juefei-Xu, F.[Felix],
Xia, X.[Xide],
Liu, M.[Miao],
Wang, X.F.[Xiao-Fang],
Liang, M.[Mingfu],
Zhang, N.[Ning],
Metaxas, D.N.[Dimitris N.],
Yu, L.C.[Li-Cheng],
Accelerating Multimodal Large Language Models by Searching Optimal
Vision Token Reduction,
CVPR25(29869-29879)
IEEE DOI
2508
Image resolution, Large language models, Benchmark testing,
Computational efficiency, Bayes methods, Feeds, Optimization,
efficiency
BibRef
Yan, Z.[Ziang],
Li, Z.L.[Zhi-Lin],
He, Y.[Yinan],
Wang, C.T.[Chen-Ting],
Li, K.[Kunchang],
Li, X.H.[Xin-Hao],
Zeng, X.Y.[Xiang-Yu],
Wang, Z.[Zilei],
Wang, Y.[Yali],
Qiao, Y.[Yu],
Wang, L.M.[Li-Min],
Wang, Y.[Yi],
Task Preference Optimization: Improving Multimodal Large Language
Models with Vision Task Alignment,
CVPR25(29880-29892)
IEEE DOI
2508
Training, Visualization, Large language models, Scalability,
Contrastive learning, Multitasking, Data models, Optimization
BibRef
Chen, C.[Cheng],
Zhai, Y.P.[Yun-Peng],
Zhao, Y.F.[Yi-Fan],
Gao, J.Y.[Jin-Yang],
Ding, B.L.[Bo-Lin],
Li, J.[Jia],
Provoking Multi-modal Few-Shot LVLM via Exploration-Exploitation
In-Context Learning,
CVPR25(3826-3835)
IEEE DOI
2508
Visualization, Fuses, Large language models, Face recognition,
Refining, Redundancy, Stochastic processes, Reinforcement learning,
large vision-language model
BibRef
Zhang, Y.T.[Yu-Ting],
Lu, H.[Hao],
Hu, Q.Y.[Qing-Yong],
Wang, Y.[Yin],
Yuan, K.[Kaishen],
Liu, X.[Xin],
Wu, K.[Kaishun],
Period-LLM: Extending the Periodic Capability of Multimodal Large
Language Model,
CVPR25(29237-29247)
IEEE DOI Code:
WWW Link.
2508
Training, Visualization, Analytical models, Large language models,
Semantics, Refining, Cognition, Physiology, Optimization, Periodic structures
BibRef
Lin, J.[Junyan],
Chen, H.R.[Hao-Ran],
Fan, Y.[Yue],
Fan, Y.Q.[Ying-Qi],
Jin, X.[Xin],
Su, H.[Hui],
Fu, J.[Jinlan],
Shen, X.Y.[Xiao-Yu],
Multi-Layer Visual Feature Fusion in Multimodal LLMs: Methods,
Analysis, and Best Practices,
CVPR25(4156-4166)
IEEE DOI Code:
WWW Link.
2508
Training, Degradation, Visualization, Large language models,
Optical character recognition, Focusing, Nonhomogeneous media,
multi-layer visual feature
BibRef
Zhao, Q.Q.[Qing-Qing],
Lu, Y.[Yao],
Kim, M.J.[Moo Jin],
Fu, Z.[Zipeng],
Zhang, Z.Y.[Zhuo-Yang],
Wu, Y.[Yecheng],
Li, Z.S.[Zhao-Shuo],
Ma, Q.L.[Qian-Li],
Han, S.[Song],
Finn, C.[Chelsea],
Handa, A.[Ankur],
Lin, T.Y.[Tsung-Yi],
Wetzstein, G.[Gordon],
Liu, M.Y.[Ming-Yu],
Xiang, D.L.[Dong-Lai],
CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action
Models,
CVPR25(1702-1713)
IEEE DOI Code:
WWW Link.
2508
Visualization, Computational modeling, Predictive models,
Benchmark testing, Robot sensing systems, Cognition, Planning,
multimodal large language models
BibRef
Lu, X.D.[Xu-Dong],
Chen, Y.H.[Ying-Hao],
Chen, C.[Cheng],
Tan, H.[Hui],
Chen, B.[Boheng],
Xie, Y.[Yina],
Hu, R.[Rui],
Tan, G.X.[Guan-Xin],
Wu, R.S.[Ren-Shou],
Hu, Y.[Yan],
Zeng, Y.[Yi],
Wu, L.[Lei],
Bian, L.Y.[Liu-Yang],
Wang, Z.X.[Zhao-Xiong],
Liu, L.[Long],
Yang, Y.Z.[Yan-Zhou],
Xiao, H.[Han],
Zhou, A.[Aojun],
Wen, Y.F.[Ya-Fei],
Chen, X.X.[Xiao-Xin],
Ren, S.[Shuai],
Li, H.S.[Hong-Sheng],
BlueLM-V-3B: Algorithm and System Co-Design for Multimodal Large
Language Models on Mobile Devices,
CVPR25(4145-4155)
IEEE DOI
2508
Performance evaluation, Quantization (signal), Large language models,
Computational modeling, Mobile handsets, model deployment
BibRef
Chen, S.[Shuo],
Han, Z.[Zhen],
He, B.[Bailan],
Liu, J.Z.[Jian-Zhe],
Buckley, M.[Mark],
Qin, Y.[Yao],
Torr, P.[Philip],
Tresp, V.[Volker],
Gu, J.D.[Jin-Dong],
Can Multimodal Large Language Models Truly Perform Multimodal
In-Context Learning?,
WACV25(6000-6010)
IEEE DOI
2505
Visualization, Large language models, Computational modeling,
multimodal large language models, in-context learning
BibRef
Wang, C.Y.[Chen-Yu],
Luo, W.X.[Wei-Xin],
Dong, S.[Sixun],
Xuan, X.H.[Xiao-Hua],
Li, Z.X.[Zheng-Xin],
Ma, L.[Lin],
Gao, S.H.[Sheng-Hua],
MLLM-Tool: A Multimodal Large Language Model for Tool Agent Learning,
WACV25(6678-6687)
IEEE DOI
2505
Codes, Large language models, Natural languages,
Oral communication, Benchmark testing, Encoding
BibRef
Liu, S.L.[Shi-Long],
Cheng, H.[Hao],
Liu, H.T.[Hao-Tian],
Zhang, H.[Hao],
Li, F.[Feng],
Ren, T.[Tianhe],
Zou, X.[Xueyan],
Yang, J.W.[Jian-Wei],
Su, H.[Hang],
Zhu, J.[Jun],
Zhang, L.[Lei],
Gao, J.F.[Jian-Feng],
Li, C.Y.[Chun-Yuan],
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents,
ECCV24(XLVII: 126-142).
Springer DOI
2412
BibRef
Cai, R.[Rizhao],
Song, Z.[Zirui],
Guan, D.[Dayan],
Chen, Z.H.[Zhen-Hao],
Li, Y.H.[Yao-Hang],
Luo, X.[Xing],
Yi, C.Y.[Chen-Yu],
Kot, A.C.[Alex C.],
BenchLMM: Benchmarking Cross-Style Visual Capability of Large
Multimodal Models,
ECCV24(L: 340-358).
Springer DOI
2412
BibRef
Yu, E.[En],
Zhao, L.[Liang],
Wei, Y.[Yana],
Yang, J.R.[Jin-Rong],
Wu, D.M.[Dong-Ming],
Kong, L.Y.[Ling-Yu],
Wang, T.[Tiancai],
Ge, Z.[Zheng],
Zhang, X.Y.[Xiang-Yu],
Tao, W.B.[Wen-Bing],
Merlin: Empowering Multimodal LLMs with Foresight Minds,
ECCV24(IV: 425-443).
Springer DOI
2412
BibRef
Song, K.[Kunpeng],
Zhu, Y.Z.[Yi-Zhe],
Liu, B.C.[Bing-Chen],
Yan, Q.[Qing],
Elgammal, A.[Ahmed],
Yang, X.[Xiao],
MOMA: Multimodal LLM Adapter for Fast Personalized Image Generation,
ECCV24(XL: 117-132).
Springer DOI
2412
BibRef
Gou, Y.H.[Yun-Hao],
Chen, K.[Kai],
Liu, Z.[Zhili],
Hong, L.Q.[Lan-Qing],
Xu, H.[Hang],
Li, Z.G.[Zhen-Guo],
Yeung, D.Y.[Dit-Yan],
Kwok, J.T.[James T.],
Zhang, Y.[Yu],
Eyes Closed, Safety on: Protecting Multimodal LLMs via Image-to-text
Transformation,
ECCV24(XVII: 388-404).
Springer DOI
2412
BibRef
Wang, D.S.[Dong-Sheng],
Cui, J.[Jiequan],
Li, M.[Miaoge],
Lin, W.[Wang],
Chen, B.[Bo],
Zhang, H.W.[Han-Wang],
Instruction Tuning-free Visual Token Complement for Multimodal LLMs,
ECCV24(LXXXI: 446-462).
Springer DOI
2412
BibRef
McKinzie, B.[Brandon],
Gan, Z.[Zhe],
Fauconnier, J.P.[Jean-Philippe],
Dodge, S.[Sam],
Zhang, B.[Bowen],
Dufter, P.[Philipp],
Shah, D.[Dhruti],
Du, X.Z.[Xian-Zhi],
Peng, F.[Futang],
Belyi, A.[Anton],
Zhang, H.T.[Hao-Tian],
Singh, K.[Karanjeet],
Kang, D.[Doug],
Hè, H.Y.[Hong-Yu],
Schwarzer, M.[Max],
Gunter, T.[Tom],
Kong, X.[Xiang],
Zhang, A.[Aonan],
Wang, J.Y.[Jian-Yu],
Wang, C.[Chong],
Du, N.[Nan],
Lei, T.[Tao],
Wiseman, S.[Sam],
Lee, M.[Mark],
Wang, Z.[Zirui],
Pang, R.[Ruoming],
Grasch, P.[Peter],
Toshev, A.[Alexander],
Yang, Y.F.[Yin-Fei],
MM1: Methods, Analysis and Insights from Multimodal LLM Pre-training,
ECCV24(XXIX: 304-323).
Springer DOI
2412
BibRef
Wang, Y.[Yu],
Liu, X.G.[Xiao-Geng],
Li, Y.[Yu],
Chen, M.[Muhao],
Xiao, C.W.[Chao-Wei],
Adashield: Safeguarding Multimodal Large Language Models from
Structure-based Attack via Adaptive Shield Prompting,
ECCV24(XX: 77-94).
Springer DOI
2412
BibRef
Zhao, H.H.[Henry Hengyuan],
Zhou, P.[Pan],
Shou, M.Z.[Mike Zheng],
Genixer: Empowering Multimodal Large Language Model as a Powerful Data
Generator,
ECCV24(XXIII: 129-147).
Springer DOI
2412
BibRef
Fu, X.Y.[Xing-Yu],
Hu, Y.S.[Yu-Shi],
Li, B.Z.[Bang-Zheng],
Feng, Y.[Yu],
Wang, H.Y.[Hao-Yu],
Lin, X.D.[Xu-Dong],
Roth, D.[Dan],
Smith, N.A.[Noah A.],
Ma, W.C.[Wei-Chiu],
Krishna, R.[Ranjay],
Blink: Multimodal Large Language Models Can See but Not Perceive,
ECCV24(XXIII: 148-166).
Springer DOI
2412
BibRef
Zhang, Z.K.[Zhi-Kai],
Li, Y.T.[Yi-Tang],
Huang, H.F.[Hao-Feng],
Lin, M.X.[Ming-Xian],
Yi, L.[Li],
Freemotion: Mocap-free Human Motion Synthesis with Multimodal Large
Language Models,
ECCV24(XXIII: 403-421).
Springer DOI
2412
BibRef
Pi, R.J.[Ren-Jie],
Han, T.Y.[Tian-Yang],
Xiong, W.[Wei],
Zhang, J.P.[Ji-Peng],
Liu, R.T.[Run-Tao],
Pan, R.[Rui],
Zhang, T.[Tong],
Strengthening Multimodal Large Language Model with Bootstrapped
Preference Optimization,
ECCV24(XXXIII: 382-398).
Springer DOI
2412
BibRef
Xia, B.[Bin],
Wang, S.Y.[Shi-Yin],
Tao, Y.[Yingfan],
Wang, Y.T.[Yi-Tong],
Jia, J.Y.[Jia-Ya],
Llmga: Multimodal Large Language Model Based Generation Assistant,
ECCV24(XXXVIII: 389-406).
Springer DOI
2412
BibRef
Wu, T.[Tianhe],
Ma, K.[Kede],
Liang, J.[Jie],
Yang, Y.[Yujiu],
Zhang, L.[Lei],
A Comprehensive Study of Multimodal Large Language Models for Image
Quality Assessment,
ECCV24(LXXIV: 143-160).
Springer DOI
2412
BibRef
Xu, J.[Jiacong],
Lo, S.Y.[Shao-Yuan],
Safaei, B.[Bardia],
Patel, V.M.[Vishal M.],
Dwivedi, I.[Isht],
Towards Zero-Shot Anomaly Detection and Reasoning with Multimodal
Large Language Models,
CVPR25(20370-20382)
IEEE DOI Code:
WWW Link.
2508
Visualization, Large language models, Benchmark testing,
Inspection, Cognition, Anomaly detection, Tuning, Biomedical imaging,
multimodal large language model
BibRef
Yang, Y.C.[Yu-Chen],
Lee, K.[Kwonjoon],
Dariush, B.[Behzad],
Cao, Y.[Yinzhi],
Lo, S.Y.[Shao-Yuan],
Follow the Rules: Reasoning for Video Anomaly Detection with Large
Language Models,
ECCV24(LXXXI: 304-322).
Springer DOI
2412
BibRef
Zheng, S.[Sipeng],
Zhou, B.[Bohan],
Feng, Y.C.[Yi-Cheng],
Wang, Y.[Ye],
Lu, Z.Q.[Zong-Qing],
Unicode: Learning a Unified Codebook for Multimodal Large Language
Models,
ECCV24(VIII: 426-443).
Springer DOI
2412
BibRef
Ren, Z.W.[Zhong-Wei],
Huang, Z.C.[Zhi-Cheng],
Wei, Y.C.[Yun-Chao],
Zhao, Y.[Yao],
Fu, D.M.[Dong-Mei],
Feng, J.S.[Jia-Shi],
Jin, X.J.[Xiao-Jie],
PixelLM: Pixel Reasoning with Large Multimodal Model,
CVPR24(26364-26373)
IEEE DOI
2410
Bridges, Image segmentation, Codes, Benchmark testing, Cognition, Decoding
BibRef
Yue, X.[Xiang],
Ni, Y.S.[Yuan-Sheng],
Zheng, T.Y.[Tian-Yu],
Zhang, K.[Kai],
Liu, R.[Ruoqi],
Zhang, G.[Ge],
Stevens, S.[Samuel],
Jiang, D.[Dongfu],
Ren, W.M.[Wei-Ming],
Sun, Y.X.[Yu-Xuan],
Wei, C.[Cong],
Yu, B.T.[Bo-Tao],
Yuan, R.B.[Rui-Bin],
Sun, R.L.[Ren-Liang],
Yin, M.[Ming],
Zheng, B.[Boyuan],
Yang, Z.Z.[Zhen-Zhu],
Liu, Y.[Yibo],
Huang, W.H.[Wen-Hao],
Sun, H.[Huan],
Su, Y.[Yu],
Chen, W.[Wenhu],
MMMU: A Massive Multi-Discipline Multimodal Understanding and
Reasoning Benchmark for Expert AGI,
CVPR24(9556-9567)
IEEE DOI
2410
Computational modeling, Artificial general intelligence,
Social sciences, Manuals, Benchmark testing, Cognition, LLMs
BibRef
Xia, Z.F.[Zhuo-Fan],
Han, D.C.[Dong-Chen],
Han, Y.Z.[Yi-Zeng],
Pan, X.[Xuran],
Song, S.[Shiji],
Huang, G.[Gao],
GSVA: Generalized Segmentation via Multimodal Large Language Models,
CVPR24(3858-3869)
IEEE DOI Code:
WWW Link.
2410
Image segmentation, Visualization, Codes, Large language models,
Benchmark testing
BibRef
Du, Y.Y.[Yi-Yang],
Wang, X.C.[Xiao-Chen],
Chen, C.[Chi],
Ye, J.[Jiabo],
Wang, Y.[Yiru],
Li, P.[Peng],
Yan, M.[Ming],
Zhang, J.[Ji],
Huang, F.[Fei],
Sui, Z.F.[Zhi-Fang],
Sun, M.[Maosong],
Liu, Y.[Yang],
AdaMMS: Model Merging for Heterogeneous Multimodal Large Language
Models with Unsupervised Coefficient Optimization,
CVPR25(9413-9422)
IEEE DOI
2508
Adaptation models, Interpolation, Large language models,
Computational modeling, Merging, Estimation, Data models,
model merging
BibRef
Ye, Q.H.[Qing-Hao],
Xu, H.Y.[Hai-Yang],
Ye, J.[Jiabo],
Yan, M.[Ming],
Hu, A.[Anwen],
Liu, H.[Haowei],
Qian, Q.[Qi],
Zhang, J.[Ji],
Huang, F.[Fei],
mPLUG-OwI2: Revolutionizing Multi-modal Large Language Model with
Modality Collaboration,
CVPR24(13040-13051)
IEEE DOI
2410
Large language models, Computational modeling, Collaboration,
Cognition, Decoding, Vision Language
BibRef
Qi, P.[Peng],
Yan, Z.[Zehong],
Hsu, W.[Wynne],
Lee, M.L.[Mong Li],
Sniffer: Multimodal Large Language Model for Explainable
Out-of-Context Misinformation Detection,
CVPR24(13052-13062)
IEEE DOI
2410
Visualization, Adaptation models, Accuracy, Large language models,
Computational modeling, Data models, multimodal misinformation,
explainability
BibRef
Li, B.[Bohao],
Ge, Y.Y.[Yu-Ying],
Ge, Y.X.[Yi-Xiao],
Wang, G.Z.[Guang-Zhi],
Wang, R.[Rui],
Zhang, R.M.[Rui-Mao],
Shan, Y.[Ying],
SEED-Bench: Benchmarking Multimodal Large Language Models,
CVPR24(13299-13308)
IEEE DOI Code:
WWW Link.
2410
Accuracy, Codes, Annotations, Image synthesis, Large language models,
Computational modeling, Benchmark, Multimodal, Hierarchical
BibRef
Mitra, C.[Chancharik],
Huang, B.[Brandon],
Darrell, T.J.[Trevor J.],
Herzig, R.[Roei],
Compositional Chain-of-Thought Prompting for Large Multimodal Models,
CVPR24(14420-14431)
IEEE DOI Code:
WWW Link.
2410
Bridges, Visualization, Codes, Annotations, Large language models,
Benchmark testing, Large Multimodal Models, Multimodality, Prompting
BibRef
Li, X.Q.[Xiao-Qi],
Xu, J.Y.[Jing-Yun],
Zhang, M.X.[Ming-Xu],
Liu, J.M.[Jia-Ming],
Shen, Y.[Yan],
Ponomarenko, I.[Iaroslav],
Xu, J.H.[Jia-Hui],
Heng, L.[Liang],
Huang, S.Y.[Si-Yuan],
Zhang, S.H.[Shang-Hang],
Dong, H.[Hao],
Object-Centric Prompt-Driven Vision-Language-Action Model for Robotic
Manipulation,
CVPR25(27638-27648)
IEEE DOI
2508
Training, Visualization, Natural languages, Manuals,
Predictive models, Robustness, Libraries, Planning, Videos
BibRef
Li, X.Q.[Xiao-Qi],
Zhang, M.X.[Ming-Xu],
Geng, Y.R.[Yi-Ran],
Geng, H.R.[Hao-Ran],
Long, Y.X.[Yu-Xing],
Shen, Y.[Yan],
Zhang, R.R.[Ren-Rui],
Liu, J.M.[Jia-Ming],
Dong, H.[Hao],
ManipLLM: Embodied Multimodal Large Language Model for Object-Centric
Robotic Manipulation,
CVPR24(18061-18070)
IEEE DOI Code:
WWW Link.
2410
Training, Adaptation models, Large language models, Transforms,
Predictive models, Robot sensing systems, Cognition, Embodied AI,
Multi-modal Large Language Model
BibRef
Taesiri, M.R.[Mohammad Reza],
Feng, T.J.[Tian-Jun],
Bezemer, C.P.[Cor-Paul],
Nguyen, A.[Anh],
GlitchBench: Can Large Multimodal Models Detect Video Game Glitches?,
CVPR24(22444-22455)
IEEE DOI Code:
WWW Link.
2410
Video games, Visualization, Quality assurance, Large language models,
Benchmark testing, Linguistics, Cognition, game testing
BibRef
Zhang, R.[Ruiyi],
Zhang, Y.Z.[Yan-Zhe],
Chen, J.[Jian],
Zhou, Y.F.[Yu-Fan],
Gu, J.X.[Jiu-Xiang],
Chen, C.[Changyou],
Sun, T.[Tong],
TRINS: Towards Multimodal Language Models that Can Read,
CVPR24(22584-22594)
IEEE DOI
2410
Visualization, Annotations, Large language models,
Computational modeling, Optical character recognition, Training data
BibRef
Zhang, Y.[Yichi],
Dong, Y.P.[Yin-Peng],
Zhang, S.Y.[Si-Yuan],
Min, T.Z.[Tian-Zan],
Su, H.[Hang],
Zhu, J.[Jun],
Exploring the Transferability of Visual Prompting for Multimodal
Large Language Models,
CVPR24(26552-26562)
IEEE DOI
2410
Training, Visualization, Adaptation models, Computational modeling,
Large language models, Semantics, Feature extraction, Transferability
BibRef
Liang, T.[Tian],
Huang, J.[Jing],
Kong, M.[Ming],
Chen, L.[Luyuan],
Zhu, Q.[Qiang],
Querying as Prompt: Parameter-Efficient Learning for Multimodal
Language Model,
CVPR24(26845-26855)
IEEE DOI Code:
WWW Link.
2410
Training, Bridges, Adaptation models, Technological innovation,
Codes, Computational modeling, multimodal, large language model
BibRef
Pi, R.J.[Ren-Jie],
Yao, L.W.[Le-Wei],
Gao, J.H.[Jia-Hui],
Zhang, J.P.[Ji-Peng],
Zhang, T.[Tong],
PerceptionGPT: Effectively Fusing Visual Perception Into LLM,
CVPR24(27114-27123)
IEEE DOI
2410
Training, Visualization, Accuracy, Large language models,
Decoding, Multimodal Learning
BibRef
Tai, Y.[Yan],
Fan, W.C.[Wei-Chen],
Zhang, Z.[Zhao],
Liu, Z.W.[Zi-Wei],
Link-Context Learning for Multimodal LLMs,
CVPR24(27166-27175)
IEEE DOI
2410
Training, Image recognition, Large language models,
Oral communication, Propulsion, Cognition
BibRef
Jain, J.[Jitesh],
Yang, J.W.[Jian-Wei],
Shi, H.[Humphrey],
VCoder: Versatile Vision Encoders for Multimodal Large Language
Models,
CVPR24(27992-28002)
IEEE DOI
2410
Training, Visualization, Image segmentation, Costs, Image synthesis,
Large language models, Machine vision
BibRef
Barbany, O.[Oriol],
Huang, M.[Michael],
Zhu, X.L.[Xin-Liang],
Dhua, A.[Arnab],
Leveraging Large Language Models for Multimodal Search,
FGVC24(1201-1210)
IEEE DOI
2410
Large language models, Natural languages, Pipelines,
Image retrieval, LLM, retrieval, fashion,
multimodal
BibRef
Baldassini, F.B.[Folco Bertini],
Shukor, M.[Mustafa],
Cord, M.[Matthieu],
Soulier, L.[Laure],
Piwowarski, B.[Benjamin],
What Makes Multimodal In-Context Learning Work?,
Prompting24(1539-1550)
IEEE DOI
2410
Training, Analytical models, Codes, Large language models,
Impedance matching, Large Language Models, Shortcuts learning
BibRef
Ma, F.P.[Fei-Peng],
Zhou, Y.Z.[Yi-Zhou],
Zhang, Y.Y.[Yue-Yi],
Wu, S.Y.[Si-Ying],
Zhang, Z.[Zheyu],
He, Z.L.[Zi-Long],
Rao, F.Y.[Feng-Yun],
Sun, X.Y.[Xiao-Yan],
Task Navigator: Decomposing Complex Tasks for Multimodal Large
Language Models,
Reasoning24(2248-2257)
IEEE DOI
2410
Training, Systematics, Navigation, Large language models,
Training data, Language and Vision, Multi-modal Vision
BibRef
Cha, J.[Junbum],
Kang, W.[Wooyoung],
Mun, J.[Jonghwan],
Roh, B.[Byungseok],
Honeybee: Locality-Enhanced Projector for Multimodal LLM,
CVPR24(13817-13827)
IEEE DOI Code:
WWW Link.
2410
Visualization, Codes, Large language models, Benchmark testing,
Tuning, Multimodal LLM, Vision-Language
BibRef
Lai, C.G.[Chen-Gen],
Song, S.L.[Sheng-Li],
Yan, S.[Sitong],
Hu, G.[Guangneng],
Improving Vision and Language Concepts Understanding with Multimodal
Counterfactual Samples,
ECCV24(LXIX: 174-191).
Springer DOI
2412
BibRef
Cao, J.J.[Jian-Jian],
Ye, P.[Peng],
Li, S.Z.[Sheng-Ze],
Yu, C.[Chong],
Tang, Y.S.[Yan-Song],
Lu, J.W.[Ji-Wen],
Chen, T.[Tao],
MADTP: Multimodal Alignment-Guided Dynamic Token Pruning for
Accelerating Vision-Language Transformer,
CVPR24(15710-15719)
IEEE DOI Code:
WWW Link.
2410
Degradation, Adaptation models, Visualization, Costs,
Computational modeling, Semantics, Token Pruning, Model Compress
BibRef
Sahin, U.[Ugur],
Li, H.[Hang],
Khan, Q.[Qadeer],
Cremers, D.[Daniel],
Tresp, V.[Volker],
Enhancing Multimodal Compositional Reasoning of Visual Language
Models with Generative Negative Mining,
WACV24(5551-5561)
IEEE DOI Code:
HTML Version.
2404
Training, Visualization, Codes, Pipelines, Self-supervised learning,
Cognition, Algorithms, Vision + language and/or other modalities
BibRef
Hu, Z.Z.[Zhi-Zhang],
Zhu, X.L.[Xin-Liang],
Tran, S.[Son],
Vidal, R.[René],
Dhua, A.[Arnab],
ProVLA: Compositional Image Search with Progressive Vision-Language
Alignment and Multimodal Fusion,
CLVL23(2764-2769)
IEEE DOI
2401
BibRef
Chapter on Implementations and Applications, Databases, QBIC, Video Analysis, Hardware and Software, Inspection continues in
Large Language Models for Autonomous Driving, LLM, LVLM .