Zhao, Z.[Zihao],
Wang, S.[Sheng],
Gu, J.[Jinchen],
Zhu, Y.[Yitao],
Mei, L.[Lanzhuju],
Zhuang, Z.X.[Zi-Xu],
Cui, Z.M.[Zhi-Ming],
Wang, Q.[Qian],
Shen, D.G.[Ding-Gang],
ChatCAD+: Toward a Universal and Reliable Interactive CAD Using LLMs,
MedImg(43), No. 11, November 2024, pp. 3755-3766.
IEEE DOI
2411
Solid modeling, Reliability, Medical diagnostic imaging, Chatbots,
Visualization, Brain modeling, Databases, Large language models,
computer-assisted diagnosis
BibRef
Luo, H.[Haonan],
Zeng, Y.J.[Yi-Jie],
Yang, L.[Li],
Chen, K.[Kexun],
Shen, Z.X.[Zhi-Xuan],
Lv, F.[Fengmao],
VLAI: Exploration and Exploitation based on Visual-Language Aligned
Information for Robotic Object Goal Navigation,
IVC(151), 2024, pp. 105259.
Elsevier DOI Code:
WWW Link.
2411
Object goal navigation, Visual-to-language,
Embodied artificial intelligence, Large language model
BibRef
Mansourian, A.[Ali],
Oucheikh, R.[Rachid],
ChatGeoAI: Enabling Geospatial Analysis for Public through Natural
Language, with Large Language Models,
IJGI(13), No. 10, 2024, pp. 348.
DOI Link
2411
BibRef
Li, D.[Diya],
Zhao, Y.[Yue],
Wang, Z.F.[Zhi-Fang],
Jung, C.[Calvin],
Zhang, Z.[Zhe],
Large Language Model-Driven Structured Output: A Comprehensive
Benchmark and Spatial Data Generation Framework,
IJGI(13), No. 11, 2024, pp. 405.
DOI Link
2412
BibRef
Li, Y.X.[Yun-Xin],
Hu, B.T.[Bao-Tian],
Chen, X.Y.[Xin-Yu],
Ma, L.[Lin],
Xu, Y.[Yong],
Zhang, M.[Min],
LMEye: An Interactive Perception Network for Large Language Models,
MultMed(26), 2024, pp. 10952-10964.
IEEE DOI
2412
Visualization, Task analysis, Data models, Tuning, Large language models,
Training, Cognition, interactive perception network
BibRef
Shao, R.[Run],
Zhang, Z.Y.[Zhao-Yang],
Tao, C.[Chao],
Zhang, Y.S.[Yun-Sheng],
Peng, C.L.[Chang-Le],
Li, H.F.[Hai-Feng],
Homogeneous tokenizer matters: Homogeneous visual tokenizer for
remote sensing image understanding,
PandRS(218), 2024, pp. 294-310.
Elsevier DOI Code:
WWW Link.
2412
Remote sensing image understanding, Visual tokenizer,
Homogeneous, Semantically independent region, Visual transformer model
BibRef
Wang, Z.H.[Zhe-Hui],
Luo, T.[Tao],
Liu, C.[Cheng],
Liu, W.C.[Wei-Chen],
Goh, R.S.M.[Rick Siow Mong],
Wong, W.F.[Weng-Fai],
Enabling Energy-Efficient Deployment of Large Language Models on
Memristor Crossbar: A Synergy of Large and Small,
PAMI(47), No. 2, February 2025, pp. 916-933.
IEEE DOI
2501
Memristors, Random access memory,
Nonvolatile memory, Computational modeling, Neural networks
BibRef
Wang, Z.[Zihao],
Cai, S.F.[Shao-Fei],
Liu, A.[Anji],
Jin, Y.G.[Yong-Gang],
Hou, J.[Jinbing],
Zhang, B.[Bowei],
Lin, H.[Haowei],
He, Z.F.[Zhao-Feng],
Zheng, Z.L.[Zi-Long],
Yang, Y.D.[Yao-Dong],
Ma, X.J.[Xiao-Jian],
Liang, Y.[Yitao],
JARVIS-1: Open-World Multi-Task Agents With Memory-Augmented
Multimodal Language Models,
PAMI(47), No. 3, March 2025, pp. 1894-1907.
IEEE DOI
2502
Planning, Diamond, Games, Complexity theory, Cognition, Accuracy,
Visualization, Reliability, Multitasking, Iron, Minecraft,
open-world agents
BibRef
Zhan, Y.[Yang],
Xiong, Z.[Zhitong],
Yuan, Y.[Yuan],
SkyEyeGPT: Unifying remote sensing vision-language tasks via
instruction tuning with large language model,
PandRS(221), 2025, pp. 64-77.
Elsevier DOI
2503
Remote sensing vision-language, Large language model,
Multi-modal, Instruction tuning
BibRef
Zhu, Y.[Yong],
Wen, Z.Y.[Zhen-Yu],
Li, X.[Xiong],
Shi, X.F.[Xiu-Fang],
Wu, X.[Xiang],
Dong, H.[Hui],
Chen, J.M.[Ji-Ming],
ChatNav: Leveraging LLM to Zero-Shot Semantic Reasoning in Object
Navigation,
CirSysVideo(35), No. 3, March 2025, pp. 2369-2381.
IEEE DOI
2503
Semantics, Navigation, Robots, Cognition, TV, Accuracy, Chatbots,
Large language models, Decision making, Pipelines,
gravity-repulsion model
BibRef
Li, Y.X.[Yun-Xin],
Jiang, S.Y.[Shen-Yuan],
Hu, B.T.[Bao-Tian],
Wang, L.Y.[Long-Yue],
Zhong, W.Q.[Wan-Qi],
Luo, W.H.[Wen-Han],
Ma, L.[Lin],
Zhang, M.[Min],
Uni-MoE: Scaling Unified Multimodal LLMs With Mixture of Experts,
PAMI(47), No. 5, May 2025, pp. 3424-3439.
IEEE DOI
2504
Training, Data models, Computational modeling, Connectors,
Benchmark testing, Visualization, Tuning
BibRef
Huang, Z.Z.[Zhong-Zhan],
Zhong, S.S.[Shan-Shan],
Zhou, P.[Pan],
Gao, S.[Shanghua],
Zitnik, M.[Marinka],
Lin, L.[Liang],
A Causality-Aware Paradigm for Evaluating Creativity of Multimodal
Large Language Models,
PAMI(47), No. 5, May 2025, pp. 3830-3846.
IEEE DOI
2504
Creativity, Games, Cognition, Standards, Benchmark testing, Training,
Pipelines, Manuals, Large language models, Information leakage,
causal intervention
BibRef
Marasco, E.[Emanuela],
Bourlai, T.[Thirimachos],
Enhancing trust in Large Language Models for streamlined
decision-making in military operations,
IVC(158), 2025, pp. 105489.
Elsevier DOI
2505
Machine unlearning, Military, Trustworthy AI, Large Language Models
BibRef
Qiao, D.[Dewen],
Ao, X.[Xiang],
Liu, Y.[Yu],
Chen, X.T.[Xue-Tao],
Song, F.Y.[Fu-Yuan],
Qin, Z.[Zheng],
Jin, W.Q.[Wen-Qiang],
Tri-AFLLM: Resource-Efficient Adaptive Asynchronous Accelerated
Federated LLMs,
CirSysVideo(35), No. 5, May 2025, pp. 4198-4211.
IEEE DOI
2505
Training, Computational modeling, Adaptation models, Data models,
Accuracy, Optimization, Servers, Data privacy, Prompt engineering,
momentum gradient descent
BibRef
Villani, F.[Francesco],
Maljkovic, I.[Igor],
Lazzaro, D.[Dario],
Sotgiu, A.[Angelo],
Cinà, A.E.[Antonio Emanuele],
Roli, F.[Fabio],
Robust image classification with multi-modal large language models,
PRL(194), 2025, pp. 1-7.
Elsevier DOI
2506
Adversarial machine learning, Robust classification,
Multimodal large language model, Multimodal information, TrustworthyAI
BibRef
Wang, Q.W.[Qing-Wang],
Li, C.H.[Chao-Hui],
Liu, Y.[Yi],
Zhu, Q.B.[Qiu-Bai],
Song, J.[Jian],
Shen, T.[Tao],
An Adaptive Framework Embedded With LLM for Knowledge Graph
Construction,
MultMed(27), 2025, pp. 2912-2923.
IEEE DOI
2506
Knowledge graphs, Semantics, Accuracy, Encyclopedias, Data mining,
Online services, Costs, Training, Prompt engineering, Multilingual,
prompt engineering
BibRef
Shao, Z.W.[Zhen-Wei],
Yu, Z.[Zhou],
Yu, J.[Jun],
Ouyang, X.C.[Xue-Cheng],
Zheng, L.[Lihao],
Gai, Z.B.[Zhen-Biao],
Wang, M.Y.[Ming-Yang],
Kuang, Z.Z.[Zhen-Zhong],
Ding, J.J.[Jia-Jun],
Imp: Highly Capable Large Multimodal Models for Mobile Devices,
MultMed(27), 2025, pp. 2961-2974.
IEEE DOI
2506
Visualization, Training, Data models, Computational modeling,
Training data, Connectors, Large language models, Mobile handsets,
vision-language models
BibRef
Zhang, Y.X.[Yi-Xuan],
Liu, C.B.[Chuan-Bin],
Liu, Y.Z.[Yi-Zhi],
Gao, Y.F.[Yi-Fan],
Lu, Z.Y.[Zhi-Ying],
Xie, H.T.[Hong-Tao],
Zhang, Y.D.[Yong-Dong],
Leveraging Concise Concepts With Probabilistic Modeling for
Interpretable Visual Recognition,
MultMed(27), 2025, pp. 3117-3131.
IEEE DOI
2506
Visualization, Probabilistic logic, Semantics, Training, Redundancy,
Predictive models, Large language models, Adaptation models,
probabilistic modeling
BibRef
Ge, J.[Junyao],
Zhang, X.[Xu],
Zheng, Y.[Yang],
Guo, K.[Kaitai],
Liang, J.[Jimin],
RSTeller: Scaling up visual language modeling in remote sensing with
rich linguistic semantics from openly available data and large
language models,
PandRS(226), 2025, pp. 146-163.
Elsevier DOI Code:
WWW Link.
2506
Vision language model, Multimodal dataset, OpenStreetMap,
Google earth engine, Large language models
BibRef
Li, Z.S.[Zhen-Shi],
Muhtar, D.[Dilxat],
Gu, F.[Feng],
He, Y.L.X.[Yang-Lang-Xing],
Zhang, X.L.[Xue-Liang],
Xiao, P.F.[Peng-Feng],
He, G.[Guangjun],
Zhu, X.X.[Xiao-Xiang],
LHRS-Bot-Nova: Improved multimodal large language model for remote
sensing vision-language interpretation,
PandRS(227), 2025, pp. 539-550.
Elsevier DOI Code:
WWW Link.
2508
BibRef
Earlier: A2, A1, A3, A5, A6, Only:
Lhrs-bot: Empowering Remote Sensing with Vgi-enhanced Large Multimodal
Language Model,
ECCV24(LXXIV: 440-457).
Springer DOI
2412
Remote sensing, Earth observation,
Multimodal large language model, Vision-language dataset
BibRef
Chen, L.F.[Ling-Feng],
Hu, P.[Panhe],
Pan, Z.L.[Zhi-Liang],
Liu, Q.[Qi],
Zhang, S.H.[Shuang-Hui],
Liu, Z.[Zhen],
Large Language Models Can Achieve Explainable and Training-Free
One-Shot HRRP ATR,
SPLetters(32), 2025, pp. 3395-3399.
IEEE DOI
2509
Indexes, Target recognition, Scattering, Radar, Training,
Large language models, Frequency-domain analysis, Data mining,
in-context learning
BibRef
Li, X.[Xu],
Zheng, Y.[Yi],
Chen, H.T.[Hao-Tian],
Chen, X.L.[Xiao-Lei],
Liang, Y.X.[Yu-Xuan],
Lai, C.H.[Cheng-Hang],
Li, B.[Bin],
Xue, X.Y.[Xiang-Yang],
Instruction-guided fusion of multi-layer visual features in Large
Vision-Language Models,
PR(170), 2026, pp. 111932.
Elsevier DOI Code:
WWW Link.
2509
Large Vision-Language Models, Multimodal large language models,
Hierarchical feature utilization
BibRef
Yang, S.Y.[Song-Yuan],
Yu, W.J.[Wei-Jiang],
Yang, W.J.[Wen-Jing],
Liu, X.W.[Xin-Wang],
Tan, H.B.[Hui-Bin],
Lan, L.[Long],
Xiao, N.[Nong],
WildVideo: Benchmarking LMMs for Understanding Video-Language
Interaction,
PAMI(47), No. 10, October 2025, pp. 9330-9344.
IEEE DOI
2510
Videos, Benchmark testing, Visualization, Cognition, Training,
Oral communication, Data mining,
video question answering
BibRef
Hong, W.[Wenyi],
Cheng, Y.[Yean],
Yang, Z.[Zhuoyi],
Luo, Z.Y.[Zi-Yang],
Wu, H.N.[Hao-Ning],
Li, D.X.[Dong-Xu],
Ma, J.[Jing],
Kankanhalli, M.[Mohan],
Li, J.[Junnan],
VideoAutoArena: An Automated Arena for Evaluating Large Multimodal
Models in Video Analysis through User Simulation,
CVPR25(8461-8474)
IEEE DOI
2508
Analytical models, Costs, Annotations, Computational modeling,
Scalability, Benchmark testing, large multimodal models
BibRef
Han, Y.D.[Yu-Dong],
Guo, Q.[Qingpei],
Pan, L.Y.[Li-Yuan],
Liu, L.[Liu],
Guan, Y.[Yu],
Yang, M.[Ming],
DynFocus: Dynamic Cooperative Network Empowers LLMs with Video
Understanding,
CVPR25(8512-8522)
IEEE DOI Code:
WWW Link.
2508
Visualization, Computational modeling, Redundancy,
Statistical learning, Semantics, Cooperative systems, dynamic network
BibRef
Liu, Y.[Yexin],
Liang, Z.Y.[Zheng-Yang],
Wang, Y.Z.[Yue-Ze],
Wu, X.F.[Xian-Feng],
Tang, F.L.[Fei-Long],
He, M.[Muyang],
Li, J.[Jian],
Liu, Z.[Zheng],
Yang, H.[Harry],
Lim, S.[Sernam],
Zhao, B.[Bo],
Unveiling the Ignorance of MLLMs: Seeing Clearly, Answering
Incorrectly,
CVPR25(9087-9097)
IEEE DOI
2508
Training, Measurement, Visualization, Pipelines, Refining,
Benchmark testing, Robustness, Decoding, Tuning, MLLM, benchmark,
visual understanding
BibRef
Wang, Z.T.[Zhen-Ting],
Hu, S.M.[Shu-Ming],
Zhao, S.Y.[Shi-Yu],
Lin, X.W.[Xiao-Wen],
Juefei-Xu, F.[Felix],
Li, Z.[Zhuowei],
Han, L.[Ligong],
Subramanyam, H.[Harihar],
Chen, L.[Li],
Chen, J.[Jianfa],
Jiang, N.[Nan],
Lyu, L.[Lingjuan],
Ma, S.Q.[Shi-Qing],
Metaxas, D.N.[Dimitris N.],
Jain, A.[Ankit],
MLLM-as-a-Judge for Image Safety without Human Labeling,
CVPR25(14657-14666)
IEEE DOI
2508
Visualization, Image synthesis, Large language models, Media,
Cognition, Safety, Labeling
BibRef
Tian, J.[Jirui],
Zhang, J.R.[Jin-Rong],
Liu, S.[Shenglan],
Xu, L.[Luhao],
Huang, Z.X.[Zhi-Xiong],
Huang, G.[Gao],
DTOS: Dynamic Time Object Sensing with Large Multimodal Model,
CVPR25(13810-13820)
IEEE DOI Code:
WWW Link.
2508
Location awareness, Visualization, Large language models, Robustness,
Spatiotemporal phenomena, Sensors, Spatial resolution, Videos
BibRef
Li, M.[Ming],
Zhong, J.[Jike],
Chen, T.[Tianle],
Lai, Y.X.[Yu-Xiang],
Psounis, K.[Konstantinos],
EEE-Bench: A Comprehensive Multimodal Electrical And Electronics
Engineering Benchmark,
CVPR25(13337-13349)
IEEE DOI
2508
Visualization, Foundation models, Large language models,
Benchmark testing, Control systems, Mathematical models
BibRef
Liu, Z.H.[Zhi-Hang],
Xie, C.W.[Chen-Wei],
Li, P.[Pandeng],
Zhao, L.M.[Li-Ming],
Tang, L.X.[Long-Xiang],
Zheng, Y.[Yun],
Liu, C.B.[Chuan-Bin],
Xie, H.T.[Hong-Tao],
Hybrid-Level Instruction Injection for Video Token Compression in
Multi-modal Large Language Models,
CVPR25(8568-8578)
IEEE DOI Code:
WWW Link.
2508
Training, Visualization, Image coding, Codes, Large language models,
Benchmark testing, Computational efficiency, Videos, efficiency
BibRef
Ma, Y.Y.[Yi-Yang],
Liu, X.C.[Xing-Chao],
Chen, X.K.[Xiao-Kang],
Liu, W.[Wen],
Wu, C.Y.[Cheng-Yue],
Wu, Z.Y.[Zhi-Yu],
Pan, Z.Z.[Zi-Zheng],
Xie, Z.[Zhenda],
Zhang, H.[Haowei],
Yu, X.K.[Xing-Kai],
Zhao, L.[Liang],
Wang, Y.S.[Yi-Song],
Liu, J.Y.[Jia-Ying],
Ruan, C.[Chong],
JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified
Multimodal Understanding and Generation,
CVPR25(7739-7751)
IEEE DOI
2508
Training, Computational modeling, Large language models
BibRef
Zhu, M.[Muzhi],
Tian, Y.Z.[Yu-Zhuo],
Chen, H.[Hao],
Zhou, C.[Chunluan],
Guo, Q.[Qingpei],
Liu, Y.[Yang],
Yang, M.[Ming],
Shen, C.H.[Chun-Hua],
SegAgent: Exploring Pixel Understanding Capabilities in MLLMs by
Imitating Human Annotator Trajectories,
CVPR25(3686-3696)
IEEE DOI Code:
WWW Link.
2508
Visualization, Protocols, Annotations, Filtering, Decision making,
Stars, Robustness, Trajectory, Visual perception, mllm, VLM, agent
BibRef
Zhu, L.[Lanyun],
Chen, T.R.[Tian-Run],
Xu, Q.X.[Qian-Xiong],
Liu, X.[Xuanyi],
Ji, D.[Deyi],
Wu, H.Y.[Hai-Yang],
Soh, D.W.[De Wen],
Liu, J.[Jun],
POPEN: Preference-Based Optimization and Ensemble for LVLM-Based
Reasoning Segmentation,
CVPR25(30231-30240)
IEEE DOI
2508
Learning systems, Attention mechanisms, Accuracy, Design methodology,
Computational modeling, Optimization methods, Ensemble learning
BibRef
Niu, J.[Junbo],
Li, Y.F.[Yi-Fei],
Miao, Z.Y.[Zi-Yang],
Ge, C.J.[Chun-Jiang],
Zhou, Y.H.[Yuan-Hang],
He, Q.H.[Qi-Hao],
Dong, X.Y.[Xiao-Yi],
Duan, H.D.[Hao-Dong],
Ding, S.[Shuangrui],
Qian, R.[Rui],
Zhang, P.[Pan],
Zang, Y.H.[Yu-Hang],
Cao, Y.H.[Yu-Hang],
He, C.H.[Cong-Hui],
Wang, J.Q.[Jia-Qi],
OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video
Understanding?,
CVPR25(18902-18913)
IEEE DOI Code:
WWW Link.
2508
Analytical models, Adaptation models, Pipelines, Benchmark testing,
Real-time systems, Cognition, Delays, Videos
BibRef
Farina, M.[Matteo],
Mancini, M.[Massimiliano],
Iacca, G.[Giovanni],
Ricci, E.[Elisa],
Rethinking Few-Shot Adaptation of Vision-Language Models in Two
Stages,
CVPR25(29989-29998)
IEEE DOI
2508
Training, Adaptation models, Benchmark testing, Feature extraction,
Rendering (computer graphics), Robustness, few-shot learning,
multimodal learning
BibRef
Xue, X.Y.[Xiang-Yuan],
Lu, Z.[Zeyu],
Huang, D.[Di],
Wang, Z.D.[Zi-Dong],
Ouyang, W.L.[Wan-Li],
Bai, L.[Lei],
ComfyBench: Benchmarking LLM-based Agents in ComfyUI for Autonomously
Designing Collaborative AI Systems,
CVPR25(24614-24624)
IEEE DOI Code:
WWW Link.
2508
Codes, Semantics, Collaboration, Benchmark testing, Closed loop systems,
Artificial intelligence, Complex systems, Multi-agent systems
BibRef
Zhao, Z.[Zijia],
Huo, Y.Q.[Yu-Qi],
Yue, T.T.[Tong-Tian],
Guo, L.T.[Long-Teng],
Lu, H.Y.[Hao-Yu],
Wang, B.N.[Bing-Ning],
Chen, W.P.[Wei-Peng],
Liu, J.[Jing],
Efficient Motion-Aware Video MLLM,
CVPR25(24159-24168)
IEEE DOI
2508
Analytical models, Visualization, Costs, Fuses, Scalability,
Redundancy, Semantics, Benchmark testing, Vectors, Videos
BibRef
Wu, R.H.[Rong-Huan],
Su, W.[Wanchao],
Liao, J.[Jing],
Chat2SVG: Vector Graphics Generation with Large Language Models and
Image Diffusion Models,
CVPR25(23690-23700)
IEEE DOI Code:
WWW Link.
2508
Training, Visualization, Shape, Large language models, Layout,
Pipelines, Diffusion models, Vectors, Complexity theory,
image diffusion model
BibRef
Zhang, Z.[Zhi],
Yadav, S.[Srishti],
Han, F.Z.[Feng-Ze],
Shutova, E.[Ekaterina],
Cross-modal Information Flow in Multimodal Large Language Models,
CVPR25(19781-19791)
IEEE DOI Code:
WWW Link.
2508
Location awareness, Visualization, Large language models,
Computational modeling, Focusing, Linguistics, Predictive models,
inner working mechanism
BibRef
Yang, S.[Senqiao],
Chen, Y.[Yukang],
Tian, Z.[Zhuotao],
Wang, C.Y.[Cheng-Yao],
Li, J.Y.[Jing-Yao],
Yu, B.[Bei],
Jia, J.Y.[Jia-Ya],
VisionZip: Longer is Better but Not Necessary in Vision Language
Models,
CVPR25(19792-19802)
IEEE DOI Code:
WWW Link.
2508
Visualization, Analytical models, Computational modeling,
Redundancy, Video sequences, Performance gain, Feature extraction,
vision language model
BibRef
Xie, J.Y.[Jing-Yi],
Yang, J.T.[Jin-Tao],
Luo, Z.[Zhunchen],
Cao, Y.[Yunbo],
Gao, Q.[Qiang],
Zhang, M.Y.[Meng-Yuan],
Hu, W.P.[Wen-Peng],
AdaDARE-y: Balancing Stability and Plasticity in Multi-modal LLMs
through Efficient Adaptation,
CVPR25(19758-19768)
IEEE DOI
2508
Adaptation models, Visualization, Technological innovation,
Large language models, Computational modeling
BibRef
Fang, Y.[Yi],
Jin, B.[Bowen],
Shen, J.C.[Jia-Cheng],
Ding, S.[Sirui],
Tan, Q.[Qiaoyu],
Han, J.W.[Jia-Wei],
GraphGPT-o: Synergistic Multimodal Comprehension and Generation on
Graphs,
CVPR25(19467-19476)
IEEE DOI Code:
WWW Link.
2508
Codes, Image synthesis, Large language models, Semantics, Transforms,
Encoding, Explosions, Electronic commerce, Data mining, multimodal,
multimodal large language model
BibRef
Tao, K.[Keda],
Qin, C.[Can],
You, H.X.[Hao-Xuan],
Sui, Y.[Yang],
Wang, H.[Huan],
DyCoke: Dynamic Compression of Tokens for Fast Video Large Language
Models,
CVPR25(18992-19001)
IEEE DOI
2508
Training, Visualization, Image coding, Large language models,
Redundancy, Merging, Decoding, Iterative decoding, Videos,
token compression
BibRef
Hao, H.R.[Hao-Ran],
Han, J.M.[Jia-Ming],
Li, C.S.[Chang-Sheng],
Li, Y.F.[Yu-Feng],
Yue, X.Y.[Xiang-Yu],
RAP: Retrieval-Augmented Personalization for Multimodal Large
Language Models,
CVPR25(14538-14548)
IEEE DOI Code:
WWW Link.
2508
Training, Visualization, Image recognition, Databases,
Large language models, Pipelines, Oral communication,
retrieval-augmented generation
BibRef
Tao, C.X.[Chen-Xin],
Su, S.Q.[Shi-Qian],
Zhu, X.Z.[Xi-Zhou],
Zhang, C.Y.[Chen-Yu],
Chen, Z.[Zhe],
Liu, J.[Jiawen],
Wang, W.H.[Wen-Hai],
Lu, L.W.[Le-Wei],
Huang, G.[Gao],
Qiao, Y.[Yu],
Dai, J.F.[Ji-Feng],
HoVLE: Unleashing the Power of Monolithic Vision-Language Models with
Holistic Vision-Language Embedding,
CVPR25(14559-14569)
IEEE DOI
2508
Training, Visualization, Large language models, Predictive models,
Encoding, Data models, Tuning, Faces
BibRef
Tong, B.[Bo],
Lai, B.[Bokai],
Zhou, Y.[Yiyi],
Luo, G.[Gen],
Shen, Y.H.[Yun-Hang],
Li, K.[Ke],
Sun, X.S.[Xiao-Shuai],
Ji, R.R.[Rong-Rong],
FlashSloth: Lightning Multimodal Large Language Models via Embedded
Visual Compression,
CVPR25(14570-14581)
IEEE DOI Code:
WWW Link.
2508
Training, Visualization, Image coding, Codes, Large language models,
Semantics, Lightning, Computational complexity, visual compression
BibRef
Lin, Y.Z.[Yuan-Ze],
Li, Y.S.[Yun-Sheng],
Chen, D.D.[Dong-Dong],
Xu, W.J.[Wei-Jian],
Clark, R.[Ronald],
Torr, P.[Philip],
Olympus: A Universal Task Router for Computer Vision Tasks,
CVPR25(14235-14246)
IEEE DOI
2508
Training, Accuracy, Computational modeling, Large language models,
Transforms, Routing, Videos, multimodal large language models
BibRef
Szot, A.[Andrew],
Mazoure, B.[Bogdan],
Attia, O.[Omar],
Timofeev, A.[Aleksei],
Agrawal, H.[Harsh],
Hjelm, D.[Devon],
Gan, Z.[Zhe],
Kira, Z.[Zsolt],
Toshev, A.[Alexander],
From Multimodal LLMs to Generalist Embodied Agents: Methods and Lessons,
CVPR25(10644-10655)
IEEE DOI
2508
Training, Adaptation models, Video games, Navigation,
Large language models, Supervised learning, Benchmark testing
BibRef
Yin, H.[Hao],
Si, G.Z.[Gunag-Zong],
Wang, Z.[Zilei],
Lifting the Veil on Visual Information Flow in MLLMs: Unlocking
Pathways to Faster Inference,
CVPR25(9382-9391)
IEEE DOI Code:
WWW Link.
2508
Visualization, Codes, Large language models, Perturbation methods,
Computational modeling, Semantics, Information processing,
attention mechanism
BibRef
Gholami, M.[Mohsen],
Akbari, M.[Mohammad],
Cannons, K.[Kevin],
Zhang, Y.[Yong],
CASP: Compression of Large Multimodal Models Based on Attention
Sparsity,
CVPR25(9372-9381)
IEEE DOI
2508
Quantization (signal), Image coding, Large language models,
Bit rate, Benchmark testing, Sparse matrices, Matrix decomposition,
2-bit quantization
BibRef
Jia, H.R.[Hong-Rui],
Jiang, C.[Chaoya],
Xu, H.Y.[Hai-Yang],
Ye, W.[Wei],
Dong, M.F.[Meng-Fan],
Yan, M.[Ming],
Zhang, J.[Ji],
Huang, F.[Fei],
Zhang, S.K.[Shi-Kun],
SymDPO: Boosting In-Context Learning of Large Multimodal Models with
Symbol Demonstration Direct Preference Optimization,
CVPR25(9361-9371)
IEEE DOI Code:
WWW Link.
2508
Visualization, Large language models, Face recognition, Symbols,
Optimization methods, Benchmark testing, Performance gain,
Context modeling
BibRef
Alvar, S.R.[Saeed Ranjbar],
Singh, G.[Gursimran],
Akbari, M.[Mohammad],
Zhang, Y.[Yong],
DivPrune: Diversity-based Visual Token Pruning for Large Multimodal
Models,
CVPR25(9392-9401)
IEEE DOI
2508
Measurement, Visualization, Accuracy, Large language models,
Redundancy, Memory management, Minimax techniques, Data models,
inference optimization
BibRef
Yang, L.R.[Long-Rong],
Shen, D.[Dong],
Cai, C.X.[Chao-Xiang],
Chen, K.B.[Kai-Bing],
Yang, F.[Fan],
Gao, T.T.[Ting-Ting],
Zhang, D.[Di],
Li, X.[Xi],
Libra-Merging: Importance-Redundancy and Pruning-Merging Trade-Off
for Acceleration Plug-In in Large Vision-Language Model,
CVPR25(9402-9412)
IEEE DOI Code:
WWW Link.
2508
Visualization, Costs, Codes, Merging, Faces
BibRef
Zhang, Z.F.[Ze-Feng],
Tang, H.Z.[Heng-Zhu],
Sheng, J.W.[Jia-Wei],
Zhang, Z.Y.[Zhen-Yu],
Ren, Y.M.[Yi-Ming],
Li, Z.Y.[Zhen-Yang],
Yin, D.W.[Da-Wei],
Ma, D.[Duohe],
Liu, T.W.[Ting-Wen],
Debiasing Multimodal Large Language Models via Noise-Aware Preference
Optimization,
CVPR25(9423-9433)
IEEE DOI Code:
WWW Link.
2508
Visualization, Large language models, Perturbation methods, Noise,
Robustness, Noise robustness, Optimization, Resilience, Noise level
BibRef
Liang, Y.[Yinan],
Wang, Z.W.[Zi-Wei],
Xu, X.W.[Xiu-Wei],
Zhou, J.[Jie],
Lu, J.W.[Ji-Wen],
EfficientLLaVA: Generalizable Auto-Pruning for Large Vision-language
Models,
CVPR25(9445-9454)
IEEE DOI Code:
WWW Link.
2508
Training, Visualization, Accuracy, Upper bound, Risk minimization,
Costs, Refining, Training data, Data models, Cognition
BibRef
Jiao, Q.[Qirui],
Chen, D.[Daoyuan],
Huang, Y.L.[Yi-Lun],
Ding, B.L.[Bo-Lin],
Li, Y.[Yaliang],
Shen, Y.[Ying],
Img-Diff: Contrastive Data Synthesis for Multimodal Large Language
Models,
CVPR25(9296-9307)
IEEE DOI
2508
Visualization, Fine-grained image recognition,
Large language models, Contrastive learning, Benchmark testing,
visual instruction tuning dataset
BibRef
Heo, M.[Miran],
Chen, M.H.[Min-Hung],
Huang, D.A.[De-An],
Liu, S.[Sifei],
Radhakrishnan, S.[Subhashree],
Kim, S.J.[Seon Joo],
Wang, Y.C.A.F.[Yu-Chi-Ang Frank],
Hachiuma, R.[Ryo],
Omni-RGPT: Unifying Image and Video Region-level Understanding via
Token Marks,
CVPR25(3919-3930)
IEEE DOI
2508
Bridges, Visualization, Target tracking, Large language models,
Benchmark testing, Commonsense reasoning, Videos
BibRef
Ouali, Y.[Yassine],
Bulat, A.[Adrian],
Xenos, A.[Alexandros],
Zaganidis, A.[Anestis],
Metaxas, I.M.[Ioannis Maniadis],
Martinez, B.[Brais],
Tzimiropoulos, G.[Georgios],
VladVA: Discriminative Fine-tuning of LVLMs,
CVPR25(4101-4111)
IEEE DOI
2508
Training, Representation learning, Adaptation models,
Computational modeling, Benchmark testing, Predictive models, Standards
BibRef
Ye, X.[Xubing],
Gan, Y.[Yukang],
Ge, Y.X.[Yi-Xiao],
Zhang, X.P.[Xiao-Ping],
Tang, Y.S.[Yan-Song],
ATP-LLaVA: Adaptive Token Pruning for Large Vision Language Models,
CVPR25(24972-24982)
IEEE DOI
2508
Degradation, Adaptation models, Visualization,
Computational modeling, Large language models, Redundancy,
multimodal learning
BibRef
Schnaus, D.[Dominik],
Araslanov, N.[Nikita],
Cremers, D.[Daniel],
It's a (Blind) Match! Towards Vision-Language Correspondence without
Parallel Data,
CVPR25(24983-24992)
IEEE DOI
2508
Accuracy, Foundation models, Annotations, Computational modeling,
Semantics, Optimal matching, vision-language models,
representation learning
BibRef
Luo, G.[Gen],
Yang, X.[Xue],
Dou, W.H.[Wen-Han],
Wang, Z.K.[Zhao-Kai],
Liu, J.W.[Jia-Wen],
Dai, J.F.[Ji-Feng],
Qiao, Y.[Yu],
Zhu, X.Z.[Xi-Zhou],
Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large
Language Models with Endogenous Visual Pre-training,
CVPR25(24960-24971)
IEEE DOI Code:
WWW Link.
2508
Visualization, Large language models, Benchmark testing, Encoding,
Decoding, Noise measurement, Optimization, multimodal models,
vision language models
BibRef
Zhao, Y.Q.[Ya-Qi],
Yin, Y.Y.[Yuan-Yang],
Li, L.[Lin],
Lin, M.[Mingan],
Huang, V.S.J.[Victor Shea-Jay],
Chen, S.W.[Si-Wei],
Chen, W.P.[Wei-Peng],
Yin, B.[Baoqun],
Zhou, Z.[Zenan],
Zhang, W.T.[Wen-Tao],
Beyond Sight: Towards Cognitive Alignment in LVLM via Enriched Visual
Knowledge,
CVPR25(24950-24959)
IEEE DOI
2508
Visualization, Accuracy, Large language models, Buildings, Faces
BibRef
Qi, D.[Daiqing],
Zhao, H.[Handong],
Shi, J.[Jing],
Jenni, S.[Simon],
Fan, Y.F.[Yi-Fei],
Dernoncourt, F.[Franck],
Cohen, S.[Scott],
Li, S.[Sheng],
The Photographer's Eye: Teaching Multimodal Large Language Models to
See and Critique like Photographers,
CVPR25(24807-24816)
IEEE DOI
2508
Location awareness, Visualization, Image color analysis,
Large language models, Education, Lighting, Benchmark testing,
image quality assessment
BibRef
Liu, S.[Shaoyu],
Li, J.N.[Jia-Ning],
Zhao, G.H.[Guang-Hui],
Zhang, Y.J.[Yun-Jian],
Meng, X.[Xin],
Yu, F.R.[Fei Richard],
Ji, X.Y.[Xiang-Yang],
Li, M.[Ming],
EventGPT: Event Stream Understanding with Multimodal Large Language
Models,
CVPR25(29139-29149)
IEEE DOI
2508
Bridges, Training, Adaptation models, Visualization,
Large language models, Pipelines, Lighting, Optimization, Synthetic data
BibRef
Zhao, S.Y.[Shi-Yu],
Wang, Z.[Zhenting],
Juefei-Xu, F.[Felix],
Xia, X.[Xide],
Liu, M.[Miao],
Wang, X.F.[Xiao-Fang],
Liang, M.[Mingfu],
Zhang, N.[Ning],
Metaxas, D.N.[Dimitris N.],
Yu, L.C.[Li-Cheng],
Accelerating Multimodal Large Language Models by Searching Optimal
Vision Token Reduction,
CVPR25(29869-29879)
IEEE DOI
2508
Image resolution, Large language models, Benchmark testing,
Computational efficiency, Bayes methods, Feeds, Optimization,
efficiency
BibRef
Ye, X.[Xubing],
Gan, Y.[Yukang],
Huang, X.[Xiaoke],
Ge, Y.X.[Yi-Xiao],
Tang, Y.S.[Yan-Song],
VoCo-LLaMA: Towards Vision Compression with Large Language Models,
CVPR25(29836-29846)
IEEE DOI Code:
WWW Link.
2508
Training, Visualization, Image coding, Correlation,
Large language models, Force, Computational efficiency, Tuning,
multimodal learning
BibRef
Yan, Z.[Ziang],
Li, Z.L.[Zhi-Lin],
He, Y.[Yinan],
Wang, C.T.[Chen-Ting],
Li, K.[Kunchang],
Li, X.H.[Xin-Hao],
Zeng, X.Y.[Xiang-Yu],
Wang, Z.[Zilei],
Wang, Y.[Yali],
Qiao, Y.[Yu],
Wang, L.M.[Li-Min],
Wang, Y.[Yi],
Task Preference Optimization: Improving Multimodal Large Language
Models with Vision Task Alignment,
CVPR25(29880-29892)
IEEE DOI
2508
Training, Visualization, Large language models, Scalability,
Contrastive learning, Multitasking, Data models, Optimization
BibRef
Chen, C.[Cheng],
Zhai, Y.P.[Yun-Peng],
Zhao, Y.F.[Yi-Fan],
Gao, J.Y.[Jin-Yang],
Ding, B.L.[Bo-Lin],
Li, J.[Jia],
Provoking Multi-modal Few-Shot LVLM via Exploration-Exploitation
In-Context Learning,
CVPR25(3826-3835)
IEEE DOI
2508
Visualization, Fuses, Large language models, Face recognition,
Refining, Redundancy, Stochastic processes, Reinforcement learning,
large vision-language model
BibRef
Zhang, Y.T.[Yu-Ting],
Lu, H.[Hao],
Hu, Q.Y.[Qing-Yong],
Wang, Y.[Yin],
Yuan, K.[Kaishen],
Liu, X.[Xin],
Wu, K.[Kaishun],
Period-LLM: Extending the Periodic Capability of Multimodal Large
Language Model,
CVPR25(29237-29247)
IEEE DOI Code:
WWW Link.
2508
Training, Visualization, Analytical models, Large language models,
Semantics, Refining, Cognition, Physiology, Optimization, Periodic structures
BibRef
Hu, Y.[Yangliu],
Song, Z.K.[Zi-Kai],
Feng, N.[Na],
Luo, Y.[Yawei],
Yu, J.Q.[Jun-Qing],
Chen, Y.P.P.[Yi-Ping Phoebe],
Yang, W.[Wei],
SF2T: Self-supervised Fragment Finetuning of Video-LLMs for
Fine-Grained Understanding,
CVPR25(29108-29117)
IEEE DOI
2508
Training, Visualization, Annotations, Large language models,
Natural languages, Benchmark testing, Propulsion, Videos
BibRef
Chen, J.[Joya],
Zeng, Z.Y.[Zi-Yun],
Lin, Y.Q.[Yi-Qi],
Li, W.[Wei],
Ma, Z.[Zejun],
Shou, M.Z.[Mike Zheng],
Live: Learning Video LLM with Streaming Speech Transcription at Scale,
CVPR25(29083-29095)
IEEE DOI
2508
Training, Video on demand, Computational modeling, Training data,
Benchmark testing, Real-time systems,
Videos
BibRef
Wang, Z.W.[Zi-Wei],
Chen, W.Z.[Wei-Zhi],
Yang, L.[Leyang],
Zhou, S.[Sheng],
Zhao, S.[Shengchu],
Zhan, H.[Hanbei],
Jin, J.C.[Jiong-Chao],
Li, L.C.[Liang-Cheng],
Shao, Z.[Zirui],
Bu, J.J.[Jia-Jun],
MP-GUI: Modality Perception with MLLMs for GUI Understanding,
CVPR25(29711-29721)
IEEE DOI Code:
WWW Link.
2508
Training, Visualization, Semantics, Pipelines, Training data,
Feature extraction, Spatial databases, Graphical user interfaces,
Synthetic data
BibRef
Vayani, A.[Ashmal],
Dissanayake, D.[Dinura],
Watawana, H.[Hasindri],
Ahsan, N.[Noor],
Sasikumar, N.[Nevasini],
Thawakar, O.[Omkar],
Ademtew, H.B.[Henok Biadglign],
Hmaiti, Y.[Yahya],
Kumar, A.[Amandeep],
Kuckreja, K.[Kartik],
Maslych, M.[Mykola],
Ghallabi, W.A.[Wafa Al],
Mihaylov, M.[Mihail],
Qin, C.[Chao],
Shaker, A.M.[Abdelrahman M],
Zhang, M.[Mike],
Ihsani, M.K.[Mahardika Krisna],
Esplana, A.[Amiel],
Gokani, M.[Monil],
Mirkin, S.[Shachar],
Singh, H.[Harsh],
Srivastava, A.[Ashay],
Hamerlik, E.[Endre],
Izzati, F.A.[Fathinah Asma],
Maani, F.A.[Fadillah Adamsyah],
Cavada, S.[Sebastian],
Chim, J.[Jenny],
Gupta, R.[Rohit],
Manjunath, S.[Sanjay],
Zhumakhanova, K.[Kamila],
Rabevohitra, F.H.[Feno Heriniaina],
Amirudin, A.[Azril],
Ridzuan, M.[Muhammad],
Kareem, D.[Daniya],
More, K.[Ketan],
Li, K.[Kunyang],
Shakya, P.[Pramesh],
Saad, M.[Muhammad],
Ghasemaghaei, A.[Amirpouya],
Djanibekov, A.[Amirbek],
Azizov, D.[Dilshod],
Jankovic, B.[Branislava],
Bhatia, N.[Naman],
Cabrera, A.[Alvaro],
Obando-Ceron, J.[Johan],
Otieno, O.[Olympiah],
Farestam, F.[Fabian],
Rabbani, M.[Muztoba],
Baliah, S.[Sanoojan],
Sanjeev, S.[Santosh],
Shtanchaev, A.[Abduragim],
Fatima, M.[Maheen],
Nguyen, T.[Thao],
Kareem, A.[Amrin],
Aremu, T.[Toluwani],
Xavier, N.[Nathan],
Bhatkal, A.[Amit],
Toyin, H.[Hawau],
Chadha, A.[Aman],
Cholakkal, H.[Hisham],
Anwer, R.M.[Rao Muhammad],
Felsberg, M.[Michael],
Laaksonen, J.[Jorma],
Solorio, T.[Thamar],
Choudhury, M.[Monojit],
Laptev, I.[Ivan],
Shah, M.[Mubarak],
Khan, S.[Salman],
Khan, F.S.[Fahad Shahbaz],
All Languages Matter: Evaluating LMMs on Culturally Diverse 100
Languages,
CVPR25(19565-19575)
IEEE DOI Code:
WWW Link.
2508
Visualization, Sensitivity, Benchmark testing, Germanium,
Distance measurement, Cognition, Multilingual,
multilingual multimodal benchmark
BibRef
Cao, A.[Anjia],
Wei, X.[Xing],
Ma, Z.H.[Zhi-Heng],
FLAME: Frozen Large Language Models Enable Data-Efficient
Language-Image Pre-training,
CVPR25(4080-4090)
IEEE DOI Code:
WWW Link.
2508
Large language models, Semantics, Fires, Text to image,
Data augmentation, Multilingual, Faces
BibRef
Bi, J.[Jing],
Guo, J.J.[Jun-Jia],
Tang, Y.L.[Yun-Long],
Wen, L.G.B.[Liang-Gong Bruce],
Liu, Z.[Zhang],
Wang, B.J.[Bing-Jie],
Xu, C.L.[Chen-Liang],
Unveiling Visual Perception in Language Models: An Attention Head
Analysis Approach,
CVPR25(4135-4144)
IEEE DOI
2508
Visualization, Adaptation models, Systematics, Correlation,
Large language models, Linguistics, Data models, Visual perception,
llm
BibRef
Li, S.[Shiyao],
Hu, Y.C.[Ying-Chun],
Ning, X.F.[Xue-Fei],
Liu, X.H.[Xi-Hui],
Hong, K.[Ke],
Jia, X.T.[Xiao-Tao],
Li, X.[Xiuhong],
Yan, Y.Q.[Ya-Qi],
Ran, P.[Pei],
Dai, G.H.[Guo-Hao],
Yan, S.[Shengen],
Yang, H.Z.[Hua-Zhong],
Wang, Y.[Yu],
MBQ: Modality-Balanced Quantization for Large Vision-Language Models,
CVPR25(4167-4177)
IEEE DOI Code:
WWW Link.
2508
Quantization (signal), Sensitivity, Accuracy, Fuses,
Large language models, Graphics processing units, Calibration,
Kernel
BibRef
Lin, J.[Junyan],
Chen, H.R.[Hao-Ran],
Fan, Y.[Yue],
Fan, Y.Q.[Ying-Qi],
Jin, X.[Xin],
Su, H.[Hui],
Fu, J.[Jinlan],
Shen, X.Y.[Xiao-Yu],
Multi-Layer Visual Feature Fusion in Multimodal LLMs: Methods,
Analysis, and Best Practices,
CVPR25(4156-4166)
IEEE DOI Code:
WWW Link.
2508
Training, Degradation, Visualization, Large language models,
Optical character recognition, Focusing, Nonhomogeneous media,
multi-layer visual feature
BibRef
Zhao, Q.Q.[Qing-Qing],
Lu, Y.[Yao],
Kim, M.J.[Moo Jin],
Fu, Z.[Zipeng],
Zhang, Z.Y.[Zhuo-Yang],
Wu, Y.[Yecheng],
Li, Z.[Zhaoshuo],
Ma, Q.L.[Qian-Li],
Han, S.[Song],
Finn, C.[Chelsea],
Handa, A.[Ankur],
Lin, T.Y.[Tsung-Yi],
Wetzstein, G.[Gordon],
Liu, M.Y.[Ming-Yu],
Xiang, D.L.[Dong-Lai],
CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action
Models,
CVPR25(1702-1713)
IEEE DOI Code:
WWW Link.
2508
Visualization, Computational modeling, Predictive models,
Benchmark testing, Robot sensing systems, Cognition, Planning,
multimodal large language models
BibRef
Liu, Z.[Zhuoming],
Li, Y.Q.[Yi-Quan],
Nguyen, K.D.[Khoi Duc],
Zhong, Y.[Yiwu],
Li, Y.[Yin],
PAVE: Patching and Adapting Video Large Language Models,
CVPR25(3306-3317)
IEEE DOI Code:
WWW Link.
2508
Adaptation models, Solid modeling, Large language models,
Computational modeling, Cognition, multi-modality
BibRef
Lu, X.D.[Xu-Dong],
Chen, Y.H.[Ying-Hao],
Chen, C.[Cheng],
Tan, H.[Hui],
Chen, B.[Boheng],
Xie, Y.[Yina],
Hu, R.[Rui],
Tan, G.X.[Guan-Xin],
Wu, R.S.[Ren-Shou],
Hu, Y.[Yan],
Zeng, Y.[Yi],
Wu, L.[Lei],
Bian, L.Y.[Liu-Yang],
Wang, Z.X.[Zhao-Xiong],
Liu, L.[Long],
Yang, Y.Z.[Yan-Zhou],
Xiao, H.[Han],
Zhou, A.[Aojun],
Wen, Y.F.[Ya-Fei],
Chen, X.X.[Xiao-Xin],
Ren, S.[Shuai],
Li, H.S.[Hong-Sheng],
BlueLM-V-3B: Algorithm and System Co-Design for Multimodal Large
Language Models on Mobile Devices,
CVPR25(4145-4155)
IEEE DOI
2508
Performance evaluation, Quantization (signal), Large language models,
Computational modeling, Mobile handsets, model deployment
BibRef
Malakouti, S.[Sina],
Aghazadeh, A.[Aysan],
Khandelwal, A.[Ashmit],
Kovashka, A.[Adriana],
Benchmarking VLMs' Reasoning About Persuasive Atypical Images,
WACV25(4788-4798)
IEEE DOI
2505
Visualization, Codes, Large language models, Focusing, Media,
Benchmark testing, Cognition, Data mining, Object recognition
BibRef
Lee, H.[Hankyeol],
Seo, G.[Gawon],
Choi, W.[Wonseok],
Jung, G.[Geunyoung],
Song, K.[Kyungwoo],
Jung, J.Y.[Ji-Young],
Enhancing Visual Classification Using Comparative Descriptors,
WACV25(5274-5283)
IEEE DOI Code:
WWW Link.
2505
Measurement, Visualization, Accuracy, Filtering,
Computational modeling, Large language models, Semantics,
Image classification
BibRef
Ee, Y.K.[Yeo Keat],
Zhang, H.[Hao],
Matyasko, A.[Alexander],
Fernando, B.[Basura],
Deduce and Select Evidences with Language Models for Training-Free
Video Goal Inference,
WACV25(5937-5947)
IEEE DOI
2505
Visualization, Accuracy, Filtering, Large language models,
Computational modeling, Robustness, Cognition, training-free
BibRef
Fu, R.[Rao],
Liu, J.Y.[Jing-Yu],
Chen, X.[Xilun],
Nie, Y.X.[Yi-Xin],
Xiong, W.H.[Wen-Han],
Scene-LLM: Extending Language Model for 3D Visual Reasoning,
WACV25(2195-2206)
IEEE DOI
2505
Location awareness, Solid modeling, Visualization,
Large language models, Cognition, 3d understanding
BibRef
Awais, M.[Muhammad],
Alharthi, A.H.S.A.[Ali Husain Salem Abdulla],
Kumar, A.[Amandeep],
Cholakkal, H.[Hisham],
Anwer, R.M.[Rao Muhammad],
AgroGPT: Efficient Agricultural Vision-Language Model with Expert
Tuning,
WACV25(5687-5696)
IEEE DOI Code:
WWW Link.
2505
Codes, Computational modeling, Large language models, Pipelines,
Oral communication, Agriculture, Data models, Tuning
BibRef
Chen, S.[Shuo],
Han, Z.[Zhen],
He, B.[Bailan],
Liu, J.Z.[Jian-Zhe],
Buckley, M.[Mark],
Qin, Y.[Yao],
Torr, P.[Philip],
Tresp, V.[Volker],
Gu, J.D.[Jin-Dong],
Can Multimodal Large Language Models Truly Perform Multimodal
In-Context Learning?,
WACV25(6000-6010)
IEEE DOI
2505
Visualization, Large language models, Computational modeling,
multimodal large language models, in-context learning
BibRef
Kruzhkov, E.[Evgenii],
Behnke, S.[Sven],
LiLMaps: Learnable Implicit Language Maps,
WACV25(7711-7720)
IEEE DOI
2505
Visualization, Large language models, Human-robot interaction,
Object detection, Solids, Market research, Decoding, Optimization,
incremental implicit mapping
BibRef
Singh, C.K.[Chandan Kumar],
Kumar, D.[Devesh],
Sanap, V.[Vipul],
Sinha, R.[Rajesh],
LLM-RSPF: Large Language Model-Based Robotic System Planning
Framework for Domain Specific Use-cases,
WACV25(7277-7286)
IEEE DOI
2505
Solid modeling, Accuracy, Systematics, Service robots, Ontologies,
Throughput, Robustness, Planning, Robots, coht, task planning
BibRef
Wang, C.Y.[Chen-Yu],
Luo, W.X.[Wei-Xin],
Dong, S.[Sixun],
Xuan, X.H.[Xiao-Hua],
Li, Z.X.[Zheng-Xin],
Ma, L.[Lin],
Gao, S.H.[Sheng-Hua],
MLLM-Tool: A Multimodal Large Language Model for Tool Agent Learning,
WACV25(6678-6687)
IEEE DOI
2505
Codes, Large language models, Natural languages,
Oral communication, Benchmark testing, Encoding
BibRef
Sun, L.[Li],
Ahuja, C.[Chaitanya],
Chen, P.[Peng],
D'Zmura, M.[Matt],
Batmanghelich, K.[Kayhan],
Bontrager, P.[Philip],
Multi-Modal Large Language Models are Effective Vision Learners,
WACV25(8617-8626)
IEEE DOI
2505
Representation learning, Resistance, Visualization, Large language models,
Feature extraction, Robustness, Data models, multi-modal
BibRef
Tateno, M.[Masatoshi],
Yagi, T.[Takuma],
Furuta, R.[Ryosuke],
Sato, Y.[Yoichi],
Learning Multiple Object States from Actions via Large Language
Models,
WACV25(9555-9565)
IEEE DOI
2505
Analytical models, Accuracy, Annotations, Computational modeling,
Large language models, Catalysts, Multi label classification
BibRef
Bahadir, C.D.[Cagla Deniz],
Akar, G.B.[Gozde B.],
Sabuncu, M.R.[Mert R.],
LLM-Generated Rewrite and Context Modulation for Enhanced Vision
Language Models in Digital Pathology,
WACV25(327-336)
IEEE DOI
2505
Training, Pathology, Sensitivity, Computational modeling, Modulation,
Text to image, Standards, Context modeling, Biomedical imaging,
large language models
BibRef
Chu, X.X.[Xiang-Xiang],
Su, J.L.[Jian-Lin],
Zhang, B.[Bo],
Shen, C.H.[Chun-Hua],
VisionlLaMA: A Unified LLaMA Backbone for Vision Tasks,
ECCV24(LXVI: 1-18).
Springer DOI
2412
Code:
WWW Link.
BibRef
Long, F.C.[Fu-Chen],
Qiu, Z.F.[Zhao-Fan],
Yao, T.[Ting],
Mei, T.[Tao],
VideoStudio: Generating Consistent-content and Multi-scene Videos,
ECCV24(LX: 468-485).
Springer DOI
2412
Code:
WWW Link.
BibRef
Liu, S.L.[Shi-Long],
Cheng, H.[Hao],
Liu, H.T.[Hao-Tian],
Zhang, H.[Hao],
Li, F.[Feng],
Ren, T.[Tianhe],
Zou, X.[Xueyan],
Yang, J.W.[Jian-Wei],
Su, H.[Hang],
Zhu, J.[Jun],
Zhang, L.[Lei],
Gao, J.F.[Jian-Feng],
Li, C.Y.[Chun-Yuan],
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents,
ECCV24(XLVII: 126-142).
Springer DOI
2412
BibRef
Kong, X.H.[Xiang-Hao],
Chen, J.[Jinyu],
Wang, W.G.[Wen-Guan],
Su, H.[Hang],
Hu, X.L.[Xiao-Lin],
Yang, Y.[Yi],
Liu, S.[Si],
Controllable Navigation Instruction Generation with Chain of Thought
Prompting,
ECCV24(XXIX: 37-54).
Springer DOI
2412
Instruction generation.
BibRef
Zhu, W.Y.C.[William Yi-Cheng],
Ye, K.[Keren],
Ke, J.J.[Jun-Jie],
Yu, J.H.[Jia-Hui],
Guibas, L.J.[Leonidas J.],
Milanfar, P.[Peyman],
Yang, F.[Feng],
ARTVLM: Attribute Recognition Through Vision-based Prefix Language
Modeling,
ECCV24(XXVII: 127-145).
Springer DOI
2412
Code:
WWW Link.
BibRef
Kim, D.[Donggyun],
Cho, S.[Seongwoong],
Kim, S.[Semin],
Luo, C.[Chong],
Hong, S.[Seunghoon],
Chameleon: A Data-efficient Generalist for Dense Visual Prediction in
the Wild,
ECCV24(XXIII: 422-441).
Springer DOI
2412
Code:
WWW Link.
BibRef
Ke, F.[Fucai],
Cai, Z.X.[Zhi-Xi],
Jahangard, S.[Simindokht],
Wang, W.Q.[Wei-Qing],
Haghighi, P.D.[Pari Delir],
Rezatofighi, H.[Hamid],
Hydra: A Hyper Agent for Dynamic Compositional Visual Reasoning,
ECCV24(XX: 132-149).
Springer DOI
2412
BibRef
Bao, X.Y.[Xiao-Yi],
Sun, S.Y.[Si-Yang],
Ma, S.L.[Shuai-Lei],
Zheng, K.C.[Ke-Cheng],
Guo, Y.X.[Yu-Xin],
Zhao, G.S.[Guo-Sheng],
Zheng, Y.[Yun],
Wang, X.G.[Xin-Gang],
Cores: Orchestrating the Dance of Reasoning and Segmentation,
ECCV24(XVIII: 187-204).
Springer DOI
2412
BibRef
Liu, Z.[Zuyan],
Liu, B.[Benlin],
Wang, J.H.[Jia-Hui],
Dong, Y.H.[Yu-Hao],
Chen, G.Y.[Guang-Yi],
Rao, Y.M.[Yong-Ming],
Krishna, R.[Ranjay],
Lu, J.W.[Ji-Wen],
Efficient Inference of Vision Instruction-following Models with Elastic
Cache,
ECCV24(XVII: 54-69).
Springer DOI
2412
Code:
WWW Link.
BibRef
Alaluf, Y.[Yuval],
Richardson, E.[Elad],
Tulyakov, S.[Sergey],
Aberman, K.[Kfir],
Cohen-Or, D.[Daniel],
MYVLM: Personalizing VLMS for User-specific Queries,
ECCV24(XIII: 73-91).
Springer DOI
2412
BibRef
Cai, R.[Rizhao],
Song, Z.[Zirui],
Guan, D.[Dayan],
Chen, Z.H.[Zhen-Hao],
Li, Y.H.[Yao-Hang],
Luo, X.[Xing],
Yi, C.Y.[Chen-Yu],
Kot, A.C.[Alex C.],
BenchLMM: Benchmarking Cross-Style Visual Capability of Large
Multimodal Models,
ECCV24(L: 340-358).
Springer DOI
2412
BibRef
Ma, Z.X.[Zi-Xian],
Huang, W.[Weikai],
Zhang, J.[Jieyu],
Gupta, T.[Tanmay],
Krishna, R.[Ranjay],
m&m's: A Benchmark to Evaluate Tool-use for multi-step multi-modal
Tasks,
ECCV24(X: 18-34).
Springer DOI
2412
WWW Link. and
WWW Link.
BibRef
Miao, Y.[Yang],
Engelmann, F.[Francis],
Vysotska, O.[Olga],
Zhao, Z.H.[Zhong-Han],
Chai, W.H.[Wen-Hao],
Wang, X.[Xuan],
Li, B.[Boyi],
Hao, S.Y.[Sheng-Yu],
Cao, S.D.[Shi-Dong],
Ye, T.[Tian],
Wang, G.A.[Gao-Ang],
See and Think: Embodied Agent in Virtual Environment,
ECCV24(VIII: 187-204).
Springer DOI
2412
BibRef
Liu, Y.[Yuan],
Duan, H.D.[Hao-Dong],
Zhang, Y.H.[Yuan-Han],
Li, B.[Bo],
Zhang, S.Y.[Song-Yang],
Zhao, W.[Wangbo],
Yuan, Y.[Yike],
Wang, J.Q.[Jia-Qi],
He, C.H.[Cong-Hui],
Liu, Z.W.[Zi-Wei],
Chen, K.[Kai],
Lin, D.[Dahua],
MMBENCH: Is Your Multi-Modal Model an All-Around Player?,
ECCV24(VI: 216-233).
Springer DOI
2412
BibRef
Liu, Y.[Yang],
Ding, P.X.[Peng-Xiang],
Huang, S.[Siteng],
Zhang, M.[Min],
Zhao, H.[Han],
Wang, D.L.[Dong-Lin],
PITE: Pixel-Temporal Alignment for Large Video-Language Model,
ECCV24(V: 160-176).
Springer DOI
2412
BibRef
Panagopoulou, A.[Artemis],
Xue, L.[Le],
Yu, N.[Ning],
Li, J.[Junnan],
Li, D.X.[Dong-Xu],
Joty, S.[Shafiq],
Xu, R.[Ran],
Savarese, S.[Silvio],
Xiong, C.M.[Cai-Ming],
Niebles, J.C.[Juan Carlos],
X-instructblip: A Framework for Aligning Image, 3d, Audio, Video to
LLMs and its Emergent Cross-modal Reasoning,
ECCV24(XLV: 177-197).
Springer DOI
2412
BibRef
Mirza, M.J.[M. Jehanzeb],
Karlinsky, L.[Leonid],
Lin, W.[Wei],
Doveh, S.[Sivan],
Micorek, J.[Jakub],
Kozinski, M.[Mateusz],
Kuehne, H.[Hilde],
Possegger, H.[Horst],
Meta-prompting for Automating Zero-shot Visual Recognition with LLMs,
ECCV24(II: 370-387).
Springer DOI
2412
BibRef
Yu, E.[En],
Zhao, L.[Liang],
Wei, Y.[Yana],
Yang, J.R.[Jin-Rong],
Wu, D.M.[Dong-Ming],
Kong, L.Y.[Ling-Yu],
Wang, T.[Tiancai],
Ge, Z.[Zheng],
Zhang, X.Y.[Xiang-Yu],
Tao, W.B.[Wen-Bing],
Merlin: Empowering Multimodal LLMs with Foresight Minds,
ECCV24(IV: 425-443).
Springer DOI
2412
BibRef
Liu, Z.Y.[Zhao-Yang],
Lai, Z.Q.[Ze-Qiang],
Gao, Z.W.[Zhang-Wei],
Cui, E.[Erfei],
Li, Z.H.[Zi-Heng],
Zhu, X.Z.[Xi-Zhou],
Lu, L.W.[Le-Wei],
Chen, Q.F.[Qi-Feng],
Qiao, Y.[Yu],
Dai, J.F.[Ji-Feng],
Wang, W.H.[Wen-Hai],
ControlLLM: Augment Language Models with Tools by Searching on Graphs,
ECCV24(XII: 89-105).
Springer DOI
2412
BibRef
Yao, Y.[Yi],
Hsu, C.F.[Chan-Feng],
Lin, J.H.[Jhe-Hao],
Xie, H.X.[Hong-Xia],
Lin, T.[Terence],
Huang, Y.N.[Yi-Ning],
Shuai, H.H.[Hong-Han],
Cheng, W.H.[Wen-Huang],
The Fabrication of Reality and Fantasy: Scene Generation with
LLM-assisted Prompt Interpretation,
ECCV24(XXII: 422-438).
Springer DOI
2412
BibRef
Wu, Y.X.[Yi-Xuan],
Wang, Y.Z.[Yi-Zhou],
Tang, S.X.[Shi-Xiang],
Wu, W.H.[Wen-Hao],
He, T.[Tong],
Ouyang, W.L.[Wan-Li],
Torr, P.H.S.[Philip H.S.],
Wu, J.[Jian],
Dettoolchain: A New Prompting Paradigm to Unleash Detection Ability of
MLLM,
ECCV24(XXXII: 164-182).
Springer DOI
2412
BibRef
Song, K.[Kunpeng],
Zhu, Y.Z.[Yi-Zhe],
Liu, B.C.[Bing-Chen],
Yan, Q.[Qing],
Elgammal, A.[Ahmed],
Yang, X.[Xiao],
MOMA: Multimodal LLM Adapter for Fast Personalized Image Generation,
ECCV24(XL: 117-132).
Springer DOI
2412
BibRef
Wang, H.[Han],
Ye, Y.J.[Yong-Jie],
Wang, Y.J.[Yan-Jie],
Nie, Y.X.[Yu-Xiang],
Huang, C.[Can],
Elysium: Exploring Object-level Perception in Videos via MLLM,
ECCV24(XXII: 166-185).
Springer DOI
2412
BibRef
Gou, Y.H.[Yun-Hao],
Chen, K.[Kai],
Liu, Z.[Zhili],
Hong, L.Q.[Lan-Qing],
Xu, H.[Hang],
Li, Z.G.[Zhen-Guo],
Yeung, D.Y.[Dit-Yan],
Kwok, J.T.[James T.],
Zhang, Y.[Yu],
Eyes Closed, Safety on: Protecting Multimodal LLMs via Image-to-text
Transformation,
ECCV24(XVII: 388-404).
Springer DOI
2412
BibRef
Guo, Z.H.[Zong-Hao],
Xu, R.[Ruyi],
Yao, Y.[Yuan],
Cui, J.[Junbo],
Ni, Z.[Zanlin],
Ge, C.J.[Chun-Jiang],
Chua, T.S.[Tat-Seng],
Liu, Z.Y.[Zhi-Yuan],
Huang, G.[Gao],
LLAVA-UHD: An LMM Perceiving Any Aspect Ratio and High-resolution
Images,
ECCV24(LXXXIII: 390-406).
Springer DOI
2412
BibRef
Wang, D.S.[Dong-Sheng],
Cui, J.[Jiequan],
Li, M.[Miaoge],
Lin, W.[Wang],
Chen, B.[Bo],
Zhang, H.W.[Han-Wang],
Instruction Tuning-free Visual Token Complement for Multimodal LLMs,
ECCV24(LXXXI: 446-462).
Springer DOI
2412
BibRef
McKinzie, B.[Brandon],
Gan, Z.[Zhe],
Fauconnier, J.P.[Jean-Philippe],
Dodge, S.[Sam],
Zhang, B.[Bowen],
Dufter, P.[Philipp],
Shah, D.[Dhruti],
Du, X.Z.[Xian-Zhi],
Peng, F.[Futang],
Belyi, A.[Anton],
Zhang, H.T.[Hao-Tian],
Singh, K.[Karanjeet],
Kang, D.[Doug],
Hè, H.Y.[Hong-Yu],
Schwarzer, M.[Max],
Gunter, T.[Tom],
Kong, X.[Xiang],
Zhang, A.[Aonan],
Wang, J.Y.[Jian-Yu],
Wang, C.[Chong],
Du, N.[Nan],
Lei, T.[Tao],
Wiseman, S.[Sam],
Lee, M.[Mark],
Wang, Z.[Zirui],
Pang, R.[Ruoming],
Grasch, P.[Peter],
Toshev, A.[Alexander],
Yang, Y.F.[Yin-Fei],
MM1: Methods, Analysis and Insights from Multimodal LLM Pre-training,
ECCV24(XXIX: 304-323).
Springer DOI
2412
BibRef
Zhou, G.Z.[Geng-Ze],
Hong, Y.C.[Yi-Cong],
Wang, Z.[Zun],
Wang, X.E.[Xin Eric],
Wu, Q.[Qi],
NAVGPT-2: Unleashing Navigational Reasoning Capability for Large
Vision-language Models,
ECCV24(VII: 260-278).
Springer DOI
2412
BibRef
Wei, H.R.[Hao-Ran],
Kong, L.Y.[Ling-Yu],
Chen, J.Y.[Jin-Yue],
Zhao, L.[Liang],
Ge, Z.[Zheng],
Wei, J.R.Y.H.R.[Jin-Rong Yang Hao-Ran],
Wang, T.[Tiancai],
Ge, Z.[Zheng],
Zhang, X.Y.[Xiang-Yu],
Tao, W.B.[Wen-Bing],
Vary: Scaling up the Vision Vocabulary for Large Vision-language Model,
ECCV24(IV: 408-424).
Springer DOI
2412
BibRef
Wang, Y.[Yu],
Liu, X.G.[Xiao-Geng],
Li, Y.[Yu],
Chen, M.[Muhao],
Xiao, C.W.[Chao-Wei],
Adashield: Safeguarding Multimodal Large Language Models from
Structure-based Attack via Adaptive Shield Prompting,
ECCV24(XX: 77-94).
Springer DOI
2412
BibRef
He, S.T.[Shu-Ting],
Ding, H.H.[Heng-Hui],
Jiang, X.D.[Xu-Dong],
Wen, B.[Bihan],
Segpoint: Segment Any Point Cloud via Large Language Model,
ECCV24(XXII: 349-367).
Springer DOI
2412
BibRef
Zhao, H.H.[Henry Hengyuan],
Zhou, P.[Pan],
Shou, M.Z.[Mike Zheng],
Genixer: Empowering Multimodal Large Language Model as a Powerful Data
Generator,
ECCV24(XXIII: 129-147).
Springer DOI
2412
BibRef
Fu, X.Y.[Xing-Yu],
Hu, Y.S.[Yu-Shi],
Li, B.Z.[Bang-Zheng],
Feng, Y.[Yu],
Wang, H.Y.[Hao-Yu],
Lin, X.D.[Xu-Dong],
Roth, D.[Dan],
Smith, N.A.[Noah A.],
Ma, W.C.[Wei-Chiu],
Krishna, R.[Ranjay],
Blink: Multimodal Large Language Models Can See but Not Perceive,
ECCV24(XXIII: 148-166).
Springer DOI
2412
BibRef
Zhang, Z.K.[Zhi-Kai],
Li, Y.T.[Yi-Tang],
Huang, H.F.[Hao-Feng],
Lin, M.X.[Ming-Xian],
Yi, L.[Li],
Freemotion: Mocap-free Human Motion Synthesis with Multimodal Large
Language Models,
ECCV24(XXIII: 403-421).
Springer DOI
2412
BibRef
Murugesan, B.[Balamurali],
Silva-Rodríguez, J.[Julio],
Ben Ayed, I.[Ismail],
Dolz, J.[Jose],
Robust Calibration of Large Vision-language Adapters,
ECCV24(XXIV: 147-165).
Springer DOI
2412
BibRef
Xu, R.[Runsen],
Wang, X.L.[Xiao-Long],
Wang, T.[Tai],
Chen, Y.L.[Yi-Lun],
Pang, J.M.[Jiang-Miao],
Lin, D.[Dahua],
Pointllm: Empowering Large Language Models to Understand Point Clouds,
ECCV24(XXV: 131-147).
Springer DOI
2412
BibRef
Cai, K.W.[Kai-Wen],
Duan, Z.K.[Zhe-Kai],
Liu, G.[Gaowen],
Fleming, C.[Charles],
Lu, C.X.X.[Chris Xiao-Xuan],
Self-adapting Large Visual-language Models to Edge Devices Across
Visual Modalities,
ECCV24(XXVIII: 301-318).
Springer DOI
2412
BibRef
Yu, R.[Runpeng],
Yu, W.H.[Wei-Hao],
Wang, X.C.[Xin-Chao],
Attention Prompting on Image for Large Vision-language Models,
ECCV24(XXX: 251-268).
Springer DOI
2412
BibRef
Luo, Y.L.[Yu-Lin],
An, R.[Ruichuan],
Zou, B.[Bocheng],
Tang, Y.M.[Yi-Ming],
Liu, J.M.[Jia-Ming],
Zhang, S.H.[Shang-Hang],
Llm as Dataset Analyst: Subpopulation Structure Discovery with Large
Language Model,
ECCV24(XXXIII: 235-252).
Springer DOI
2412
BibRef
Pi, R.J.[Ren-Jie],
Han, T.Y.[Tian-Yang],
Xiong, W.[Wei],
Zhang, J.P.[Ji-Peng],
Liu, R.T.[Run-Tao],
Pan, R.[Rui],
Zhang, T.[Tong],
Strengthening Multimodal Large Language Model with Bootstrapped
Preference Optimization,
ECCV24(XXXIII: 382-398).
Springer DOI
2412
BibRef
Huang, Z.J.[Zhi-Jian],
Tang, T.[Tao],
Chen, S.X.[Shao-Xiang],
Lin, S.[Sihao],
Jie, Z.Q.[Ze-Qun],
Ma, L.[Lin],
Wang, G.[Guangrun],
Liang, X.D.[Xiao-Dan],
Making Large Language Models Better Planners with Reasoning-decision
Alignment,
ECCV24(XXXVI: 73-90).
Springer DOI
2412
BibRef
Xia, B.[Bin],
Wang, S.Y.[Shi-Yin],
Tao, Y.[Yingfan],
Wang, Y.T.[Yi-Tong],
Jia, J.Y.[Jia-Ya],
Llmga: Multimodal Large Language Model Based Generation Assistant,
ECCV24(XXXVIII: 389-406).
Springer DOI
2412
BibRef
Zhan, Y.F.[Yu-Fei],
Zhu, Y.[Yousong],
Chen, Z.Y.[Zhi-Yang],
Yang, F.[Fan],
Tang, M.[Ming],
Wang, J.Q.[Jin-Qiao],
Griffon: Spelling Out All Object Locations at Any Granularity with
Large Language Models,
ECCV24(XLII: 405-422).
Springer DOI
2412
BibRef
Li, Y.W.[Yan-Wei],
Wang, C.Y.[Cheng-Yao],
Jia, J.Y.[Jia-Ya],
Llama-vid: An Image is Worth 2 Tokens in Large Language Models,
ECCV24(XLVI: 323-340).
Springer DOI
2412
BibRef
Ju, C.[Chen],
Wang, H.[Haicheng],
Cheng, H.Z.[Hao-Zhe],
Chen, X.[Xu],
Zhai, Z.H.[Zhong-Hua],
Huang, W.L.[Wei-Lin],
Lan, J.S.[Jin-Song],
Xiao, S.[Shuai],
Zheng, B.[Bo],
Turbo: Informativity-driven Acceleration Plug-in for Vision-language
Large Models,
ECCV24(XLVI: 436-455).
Springer DOI
2412
BibRef
Zhao, Q.[Qinyu],
Xu, M.[Ming],
Gupta, K.[Kartik],
Asthana, A.[Akshay],
Zheng, L.[Liang],
Gould, S.[Stephen],
The First to Know: How Token Distributions Reveal Hidden Knowledge in
Large Vision-language Models?,
ECCV24(XLVIII: 127-142).
Springer DOI
2412
BibRef
Lee, B.K.[Byung-Kwan],
Park, B.[Beomchan],
Kim, C.W.[Chae Won],
Ro, Y.M.[Yong Man],
Moai: Mixture of All Intelligence for Large Language and Vision Models,
ECCV24(XLIX: 273-302).
Springer DOI
2412
BibRef
Liu, R.[Ruyang],
Li, C.[Chen],
Tang, H.R.[Hao-Ran],
Ge, Y.X.[Yi-Xiao],
Shan, Y.[Ying],
Li, G.[Ge],
ST-LLM: Large Language Models Are Effective Temporal Learners,
ECCV24(LVII: 1-18).
Springer DOI
2412
BibRef
Cheng, H.[Hao],
Xiao, E.[Erjia],
Gu, J.D.[Jin-Dong],
Yang, L.[Le],
Duan, J.[Jinhao],
Zhang, J.[Jize],
Cao, J.H.[Jia-Hang],
Xu, K.D.[Kai-Di],
Xu, R.[Renjing],
Unveiling Typographic Deceptions: Insights of the Typographic
Vulnerability in Large Vision-language Models,
ECCV24(LIX: 179-196).
Springer DOI
2412
BibRef
Lin, Z.[Ziyi],
Liu, D.Y.[Dong-Yang],
Zhang, R.R.[Ren-Rui],
Gao, P.[Peng],
Qiu, L.T.[Long-Tian],
Xiao, H.[Han],
Qiu, H.[Han],
Shao, W.Q.[Wen-Qi],
Chen, K.Q.[Ke-Qin],
Han, J.M.[Jia-Ming],
Huang, S.Y.[Si-Yuan],
Zhang, Y.[Yichi],
He, X.M.[Xu-Ming],
Qiao, Y.[Yu],
Li, H.S.[Hong-Sheng],
Sphinx: A Mixer of Weights, Visual Embeddings and Image Scales for
Multi-modal Large Language Models,
ECCV24(LXII: 36-55).
Springer DOI
2412
BibRef
Chiquier, M.[Mia],
Mall, U.[Utkarsh],
Vondrick, C.[Carl],
Evolving Interpretable Visual Classifiers with Large Language Models,
ECCV24(LXIV: 183-201).
Springer DOI
2412
BibRef
Wu, T.[Tianhe],
Ma, K.[Kede],
Liang, J.[Jie],
Yang, Y.[Yujiu],
Zhang, L.[Lei],
A Comprehensive Study of Multimodal Large Language Models for Image
Quality Assessment,
ECCV24(LXXIV: 143-160).
Springer DOI
2412
BibRef
Chen, L.[Liang],
Zhao, H.Z.[Hao-Zhe],
Liu, T.Y.[Tian-Yu],
Bai, S.[Shuai],
Lin, J.Y.[Jun-Yang],
Zhou, C.[Chang],
Chang, B.[Baobao],
An Image is Worth 1/2 Tokens After Layer 2: Plug-and-play Inference
Acceleration for Large Vision-language Models,
ECCV24(LXXXI: 19-35).
Springer DOI
2412
BibRef
Xu, J.[Jiacong],
Lo, S.Y.[Shao-Yuan],
Safaei, B.[Bardia],
Patel, V.M.[Vishal M.],
Dwivedi, I.[Isht],
Towards Zero-Shot Anomaly Detection and Reasoning with Multimodal
Large Language Models,
CVPR25(20370-20382)
IEEE DOI Code:
WWW Link.
2508
Visualization, Large language models, Benchmark testing,
Inspection, Cognition, Anomaly detection, Tuning, Biomedical imaging,
multimodal large language model
BibRef
Yang, Y.C.[Yu-Chen],
Lee, K.[Kwonjoon],
Dariush, B.[Behzad],
Cao, Y.[Yinzhi],
Lo, S.Y.[Shao-Yuan],
Follow the Rules: Reasoning for Video Anomaly Detection with Large
Language Models,
ECCV24(LXXXI: 304-322).
Springer DOI
2412
BibRef
Chen, Y.C.[Yi-Chia],
Li, W.H.[Wei-Hua],
Sun, C.[Cheng],
Wang, Y.C.A.F.[Yu-Chi-Ang Frank],
Chen, C.S.[Chu-Song],
Sam4mllm: Enhance Multi-modal Large Language Model for Referring
Expression Segmentation,
ECCV24(LXXXI: 323-340).
Springer DOI
2412
BibRef
Zheng, S.[Sipeng],
Zhou, B.[Bohan],
Feng, Y.C.[Yi-Cheng],
Wang, Y.[Ye],
Lu, Z.Q.[Zong-Qing],
Unicode: Learning a Unified Codebook for Multimodal Large Language
Models,
ECCV24(VIII: 426-443).
Springer DOI
2412
BibRef
Shi, B.F.[Bai-Feng],
Wu, Z.Y.[Zi-Yang],
Mao, M.L.[Mao-Lin],
Wang, X.[Xin],
Darrell, T.J.[Trevor J.],
When Do We Not Need Larger Vision Models?,
ECCV24(VIII: 444-462).
Springer DOI
2412
BibRef
Yu, Q.H.[Qi-Hang],
Shen, X.H.[Xiao-Hui],
Chen, L.C.[Liang-Chieh],
Towards Open-ended Visual Recognition with Large Language Models,
ECCV24(XIV: 359-376).
Springer DOI
2412
BibRef
Yan, C.[Cilin],
Wang, H.C.[Hao-Chen],
Yan, S.L.[Shi-Lin],
Jiang, X.L.[Xiao-Long],
Hu, Y.[Yao],
Kang, G.L.[Guo-Liang],
Xie, W.[Weidi],
Gavves, E.[Efstratios],
VISA: Reasoning Video Object Segmentation via Large Language Models,
ECCV24(XV: 98-115).
Springer DOI
2412
BibRef
Huang, K.[Kai],
Zou, H.[Hao],
Xi, Y.[Ye],
Wang, B.C.[Bo-Chen],
Xie, Z.[Zhen],
Yu, L.[Liang],
IVTP: Instruction-guided Visual Token Pruning for Large Vision-language
Models,
ECCV24(XVII: 214-230).
Springer DOI
2412
BibRef
Liu, H.T.[Hao-Tian],
Li, C.Y.[Chun-Yuan],
Li, Y.H.[Yu-Heng],
Lee, Y.J.[Yong Jae],
Improved Baselines with Visual Instruction Tuning,
CVPR24(26286-26296)
IEEE DOI
2410
Training, Connectors, Visualization, Systematics, Codes, Computational modeling
BibRef
Ren, Z.W.[Zhong-Wei],
Huang, Z.C.[Zhi-Cheng],
Wei, Y.C.[Yun-Chao],
Zhao, Y.[Yao],
Fu, D.M.[Dong-Mei],
Feng, J.S.[Jia-Shi],
Jin, X.J.[Xiao-Jie],
PixelLM: Pixel Reasoning with Large Multimodal Model,
CVPR24(26364-26373)
IEEE DOI
2410
Bridges, Image segmentation, Codes, Benchmark testing, Cognition, Decoding
BibRef
Schiappa, M.[Madeline],
Abdullah, R.[Raiyaan],
Azad, S.[Shehreen],
Claypoole, J.[Jared],
Cogswell, M.[Michael],
Divakaran, A.[Ajay],
Rawat, Y.[Yogesh],
Probing Conceptual Understanding of Large Visual-Language Models,
WhatNext24(1797-1807)
IEEE DOI Code:
WWW Link.
2410
Training, Visualization, Shape, Snow, Color, Benchmark testing,
Transformers, Robustness, Conceptual understanding
BibRef
Yue, T.T.[Tong-Tian],
Cheng, J.[Jie],
GUo, L.T.[Long-Teng],
Dai, X.Y.[Xing-Yuan],
Zhao, Z.[Zijia],
He, X.J.[Xing-Jian],
Xiong, G.[Gang],
Lv, Y.S.[Yi-Sheng],
Liu, J.[Jing],
SC- Tune: Unleashing Self-Consistent Referential Comprehension in
Large Vision Language Models,
CVPR24(13073-13083)
IEEE DOI Code:
WWW Link.
2410
Training, Codes, Computational modeling, Focusing, Benchmark testing
BibRef
Wu, T.H.[Tsung-Han],
Lian, L.[Long],
Gonzalez, J.E.[Joseph E.],
Li, B.[Boyi],
Darrell, T.J.[Trevor J.],
Self-Correcting LLM-Controlled Diffusion Models,
CVPR24(6327-6336)
IEEE DOI Code:
WWW Link.
2410
Image synthesis, Pipelines, Text to image, Process control,
Detectors, Superluminescent diodes, Diffusion models
BibRef
Yue, X.[Xiang],
Ni, Y.S.[Yuan-Sheng],
Zheng, T.Y.[Tian-Yu],
Zhang, K.[Kai],
Liu, R.[Ruoqi],
Zhang, G.[Ge],
Stevens, S.[Samuel],
Jiang, D.[Dongfu],
Ren, W.M.[Wei-Ming],
Sun, Y.X.[Yu-Xuan],
Wei, C.[Cong],
Yu, B.T.[Bo-Tao],
Yuan, R.B.[Rui-Bin],
Sun, R.L.[Ren-Liang],
Yin, M.[Ming],
Zheng, B.[Boyuan],
Yang, Z.Z.[Zhen-Zhu],
Liu, Y.[Yibo],
Huang, W.H.[Wen-Hao],
Sun, H.[Huan],
Su, Y.[Yu],
Chen, W.[Wenhu],
MMMU: A Massive Multi-Discipline Multimodal Understanding and
Reasoning Benchmark for Expert AGI,
CVPR24(9556-9567)
IEEE DOI
2410
Computational modeling, Artificial general intelligence,
Social sciences, Manuals, Benchmark testing, Cognition, LLMs
BibRef
Zheng, D.[Duo],
Huang, S.[Shijia],
Zhao, L.[Lin],
Zhong, Y.[Yiwu],
Wang, L.W.[Li-Wei],
Towards Learning a Generalist Model for Embodied Navigation,
CVPR24(13624-13634)
IEEE DOI Code:
WWW Link.
2410
Training, Adaptation models, Solid modeling, Navigation,
Soft sensors, Computational modeling, Visual-Language Navigation,
LLM
BibRef
Singh, S.[Simranjit],
Fore, M.[Michael],
Stamoulis, D.[Dimitrios],
GeoLLM-Engine: A Realistic Environment for Building Geospatial
Copilots,
EarthVision24(585-594)
IEEE DOI
2410
Earth, Geology, Natural languages, Benchmark testing,
Parallel processing, Geospatial analysis, Satellite images,
Benchmark
BibRef
Zhang, Y.C.[Yue-Chen],
Qian, S.J.[Sheng-Ju],
Peng, B.[Bohao],
Liu, S.[Shu],
Jia, J.Y.[Jia-Ya],
Prompt Highlighter: Interactive Control for Multi-Modal LLMs,
CVPR24(13215-13224)
IEEE DOI
2410
Training, Semantics, Process control, Focusing,
Reliability, Usability, VLM, LLM, Interactive Control, Image Caption,
Training-Free
BibRef
Wang, D.K.[Dong-Kai],
Xuan, S.Y.[Shi-Yu],
Zhang, S.L.[Shi-Liang],
LocLLM: Exploiting Generalizable Human Keypoint Localization via
Large Language Model,
CVPR24(614-623)
IEEE DOI Code:
WWW Link.
2410
Location awareness, Training, Large language models, Pipelines,
Training data, Cognition, Keypoint Localization,
Large Language Model
BibRef
Liu, H.C.[Han-Chao],
Zhan, X.H.[Xiao-Hang],
Huang, S.L.[Shao-Li],
Mu, T.J.[Tai-Jiang],
Shan, Y.[Ying],
Programmable Motion Generation for Open-Set Motion Control Tasks,
CVPR24(1399-1408)
IEEE DOI
2410
Motion planning, Large language models, Computational modeling,
Semantics, Dynamics, Training data
BibRef
Li, W.[Wanhua],
Zhou, R.P.[Ren-Ping],
Zhou, J.W.[Jia-Wei],
Song, Y.W.[Ying-Wei],
Herter, J.[Johannes],
Qin, M.H.[Ming-Han],
Huang, G.[Gao],
Pfister, H.[Hanspeter],
4D LangSplat: 4D Language Gaussian Splatting via Multimodal Large
Language Models,
CVPR25(22001-22011)
IEEE DOI
2508
Training, Deformable models, Visualization, Large language models,
Semantics, Benchmark testing, Videos
BibRef
Xia, Z.F.[Zhuo-Fan],
Han, D.C.[Dong-Chen],
Han, Y.Z.[Yi-Zeng],
Pan, X.[Xuran],
Song, S.[Shiji],
Huang, G.[Gao],
GSVA: Generalized Segmentation via Multimodal Large Language Models,
CVPR24(3858-3869)
IEEE DOI Code:
WWW Link.
2410
Image segmentation, Visualization, Codes, Large language models,
Benchmark testing
BibRef
Zhao, L.[Lirui],
Yang, Y.[Yue],
Zhang, K.[Kaipeng],
Shao, W.Q.[Wen-Qi],
Zhang, Y.X.[Yu-Xin],
Qiao, Y.[Yu],
Luo, P.[Ping],
Ji, R.R.[Rong-Rong],
DiffAgent: Fast and Accurate Text-to-Image API Selection with Large
Language Model,
CVPR24(6390-6399)
IEEE DOI Code:
WWW Link.
2410
Training, Technological innovation, Accuracy, Codes,
Large language models, Computational modeling, LLM Agent, LLM Tool Usage
BibRef
Yao, J.[Junyi],
Liu, Y.J.[Yi-Jiang],
Dong, Z.[Zhen],
Guo, M.F.[Ming-Fei],
Hu, H.[Helan],
Keutzer, K.[Kurt],
Du, L.[Li],
Zhou, D.[Daquan],
Zhang, S.H.[Shang-Hang],
PromptCoT: Align Prompt Distribution via Adapted Chain-of-Thought,
CVPR24(7027-7037)
IEEE DOI
2410
Training, Adaptation models, Visualization, Computational modeling,
Large language models, Semantics, Text to image
BibRef
Cai, Z.P.[Zhi-Peng],
Mueller, M.[Matthias],
Birkl, R.[Reiner],
Wofk, D.[Diana],
Tseng, S.Y.[Shao-Yen],
Cheng, J.[Junda],
Stan, G.B.M.[Gabriela Ben-Melech],
Lai, V.[Vasudev],
Paulitsch, M.[Michael],
L-MAGIC: Language Model Assisted Generation of Images with Coherence,
CVPR24(7049-7058)
IEEE DOI Code:
WWW Link.
2410
Point cloud compression, Solid modeling, Layout, Superresolution,
Estimation, Diffusion models, Image generation, large language models
BibRef
Li, Y.[Yanyu],
Liu, X.[Xian],
Kag, A.[Anil],
Hu, J.[Ju],
Idelbayev, Y.[Yerlan],
Sagar, D.[Dhritiman],
Wang, Y.Z.[Yan-Zhi],
Tulyakov, S.[Sergey],
Ren, J.[Jian],
TextCraftor: Your Text Encoder can be Image Quality Controller,
CVPR24(7985-7995)
IEEE DOI
2410
Training, Measurement, Interpolation, Image synthesis,
Large language models, Pipelines, Text to image, Stable Diffusion,
Image and video synthesis and generation
BibRef
Argaw, D.M.[Dawit Mureja],
Yoon, S.H.[Seung-Hyun],
Heilbron, F.C.[Fabian Caba],
Deilamsalehy, H.[Hanieh],
Bui, T.[Trung],
Wang, Z.W.[Zhao-Wen],
Dernoncourt, F.[Franck],
Chung, J.S.[Joon Son],
Scaling Up Video Summarization Pretraining with Large Language Models,
CVPR24(8332-8341)
IEEE DOI
2410
Analytical models, Large language models, Computational modeling,
Pipelines, Benchmark testing
BibRef
Lai, X.[Xin],
Tian, Z.[Zhuotao],
Chen, Y.K.[Yu-Kang],
Li, Y.W.[Yan-Wei],
Yuan, Y.H.[Yu-Hui],
Liu, S.[Shu],
Jia, J.Y.[Jia-Ya],
LISA: Reasoning Segmentation via Large Language Model,
CVPR24(9579-9589)
IEEE DOI
2410
Image segmentation, Vocabulary, Visualization, Target recognition,
Large language models, Benchmark testing
BibRef
Shang, C.M.[Chen-Ming],
Zhou, S.[Shiji],
Zhang, H.Y.[Heng-Yuan],
Ni, X.Z.[Xin-Zhe],
Yang, Y.[Yujiu],
Wang, Y.W.[Yu-Wang],
Incremental Residual Concept Bottleneck Models,
CVPR24(11030-11040)
IEEE DOI
2410
Measurement, Visualization, Accuracy, Large language models,
Current measurement, Decision making, Closed box
BibRef
Xie, Y.T.[Yu-Tong],
Chen, Q.[Qi],
Wang, S.[Sinuo],
To, M.S.[Minh-Son],
Lee, I.[Iris],
Khoo, E.W.[Ee Win],
Hendy, K.[Kerolos],
Koh, D.[Daniel],
Xia, Y.[Yong],
Wu, Q.[Qi],
PairAug: What Can Augmented Image-Text Pairs Do for Radiology?,
CVPR24(11652-11661)
IEEE DOI Code:
WWW Link.
2410
Data privacy, Medical conditions, Large language models, Radiology,
Data augmentation
BibRef
Dong, Z.K.[Zhi-Kang],
Liu, X.L.[Xiu-Long],
Chen, B.[Bin],
Polak, P.[Pawel],
Zhang, P.[Peng],
MuseChat: A Conversational Music Recommendation System for Videos,
CVPR24(12775-12785)
IEEE DOI Code:
WWW Link.
2410
Accuracy, Large language models, Natural languages, Cognition,
Recommender systems, Multimodal Learning,
Music Information Retrieval
BibRef
Li, F.[Feng],
Jiang, Q.[Qing],
Zhang, H.[Hao],
Ren, T.[Tianhe],
Liu, S.L.[Shi-Long],
Zou, X.[Xueyan],
Xu, H.Z.[Huai-Zhe],
Li, H.Y.[Hong-Yang],
Yang, J.W.[Jian-Wei],
Li, C.Y.[Chun-Yuan],
Zhang, L.[Lei],
Gao, J.F.[Jian-Feng],
Visual in-Context Prompting,
CVPR24(12861-12871)
IEEE DOI Code:
WWW Link.
2410
Training, Visualization, Image segmentation, Codes,
Large language models, Computer architecture
BibRef
Sachdeva, R.[Ragav],
Zisserman, A.[Andrew],
The Manga Whisperer: Automatically Generating Transcriptions for
Comics,
CVPR24(12967-12976)
IEEE DOI Code:
WWW Link.
2410
Visualization, Codes, Large language models, Visual impairment,
Oral communication, Linguistics
BibRef
Du, Y.Y.[Yi-Yang],
Wang, X.C.[Xiao-Chen],
Chen, C.[Chi],
Ye, J.[Jiabo],
Wang, Y.[Yiru],
Li, P.[Peng],
Yan, M.[Ming],
Zhang, J.[Ji],
Huang, F.[Fei],
Sui, Z.F.[Zhi-Fang],
Sun, M.[Maosong],
Liu, Y.[Yang],
AdaMMS: Model Merging for Heterogeneous Multimodal Large Language
Models with Unsupervised Coefficient Optimization,
CVPR25(9413-9422)
IEEE DOI
2508
Adaptation models, Interpolation, Large language models,
Computational modeling, Merging, Estimation, Data models,
model merging
BibRef
Ye, Q.H.[Qing-Hao],
Xu, H.Y.[Hai-Yang],
Ye, J.[Jiabo],
Yan, M.[Ming],
Hu, A.[Anwen],
Liu, H.[Haowei],
Qian, Q.[Qi],
Zhang, J.[Ji],
Huang, F.[Fei],
mPLUG-OwI2: Revolutionizing Multi-modal Large Language Model with
Modality Collaboration,
CVPR24(13040-13051)
IEEE DOI
2410
Large language models, Computational modeling, Collaboration,
Cognition, Decoding, Vision Language
BibRef
Qi, P.[Peng],
Yan, Z.[Zehong],
Hsu, W.[Wynne],
Lee, M.L.[Mong Li],
Sniffer: Multimodal Large Language Model for Explainable
Out-of-Context Misinformation Detection,
CVPR24(13052-13062)
IEEE DOI
2410
Visualization, Adaptation models, Accuracy, Large language models,
Computational modeling, Data models, multimodal misinformation,
explainability
BibRef
Zhong, S.S.[Shan-Shan],
Huang, Z.Z.[Zhong-Zhan],
Gao, S.[Shanghua],
Wen, W.[Wushao],
Lin, L.[Liang],
Zitnik, M.[Marinka],
Zhou, P.[Pan],
Let's Think Outside the Box: Exploring Leap-of-Thought in Large
Language Models with Creative Humor Generation,
CVPR24(13246-13257)
IEEE DOI Code:
WWW Link.
2410
Technological innovation, Codes, Large language models, Games,
Cognition
BibRef
Gao, Z.[Zhi],
Du, Y.T.[Yun-Tao],
Zhang, X.T.[Xin-Tong],
Ma, X.J.[Xiao-Jian],
Han, W.J.[Wen-Juan],
Zhu, S.C.[Song-Chun],
Li, Q.[Qing],
CLOVA: A Closed-LOop Visual Assistant with Tool Usage and Update,
CVPR24(13258-13268)
IEEE DOI
2410
Visualization, Limiting,
Large language models, Training data, Tagging, Reflection,
Compositional Reasoning
BibRef
Li, B.[Bohao],
Ge, Y.Y.[Yu-Ying],
Ge, Y.X.[Yi-Xiao],
Wang, G.Z.[Guang-Zhi],
Wang, R.[Rui],
Zhang, R.M.[Rui-Mao],
Shan, Y.[Ying],
SEED-Bench: Benchmarking Multimodal Large Language Models,
CVPR24(13299-13308)
IEEE DOI Code:
WWW Link.
2410
Accuracy, Codes, Annotations, Image synthesis, Large language models,
Computational modeling, Benchmark, Multimodal, Hierarchical
BibRef
Buettner, K.[Kyle],
Malakouti, S.[Sina],
Li, X.L.[Xiang Lorraine],
Kovashka, A.[Adriana],
Incorporating Geo-Diverse Knowledge into Prompting for Increased
Geographical Robustness in Object Recognition,
CVPR24(13515-13524)
IEEE DOI
2410
Geography, Training, Large language models, Training data, Europe, Robustness
BibRef
Liu, R.[Ruyang],
Li, C.[Chen],
Ge, Y.X.[Yi-Xiao],
Li, T.H.[Thomas H.],
Shan, Y.[Ying],
Li, G.[Ge],
BT-Adapter: Video Conversation is Feasible Without Video Instruction
Tuning,
CVPR24(13658-13667)
IEEE DOI Code:
WWW Link.
2410
Training, Adaptation models, Visualization, Costs,
Computational modeling, Graphics processing units,
Video Large Language Models
BibRef
Li, J.X.[Jia-Xuan],
Vo, D.M.[Duc Minh],
Sugimoto, A.[Akihiro],
Nakayama, H.[Hideki],
Evcap: Retrieval-Augmented Image Captioning with External Visual-Name
Memory for Open-World Comprehension,
CVPR24(13733-13742)
IEEE DOI
2410
Training, Visualization, Adaptation models, Costs,
Large language models, Memory management, Image Captioning,
External Memory
BibRef
Song, L.[Lin],
Chen, Y.K.[Yu-Kang],
Yang, S.[Shuai],
Ding, X.H.[Xiao-Han],
Ge, Y.X.[Yi-Xiao],
Chen, Y.C.[Ying-Cong],
Shan, Y.[Ying],
Low-Rank Approximation for Sparse Attention in Multi-Modal LLMs,
CVPR24(13763-13773)
IEEE DOI
2410
Training, Attention mechanisms, Computational modeling,
Large language models, Benchmark testing, Natural language processing
BibRef
Guo, Q.[Qiushan],
de Mello, S.[Shalini],
Yin, H.X.[Hong-Xu],
Byeon, W.[Wonmin],
Cheung, K.C.[Ka Chun],
Yu, Y.Z.[Yi-Zhou],
Luo, P.[Ping],
Liu, S.[Sifei],
RegionGPT: Towards Region Understanding Vision Language Model,
CVPR24(13796-13806)
IEEE DOI
2410
Training, Visualization, Large language models, Pipelines,
Training data, Object detection, Cognition
BibRef
Yu, T.Y.[Tian-Yu],
Yao, Y.[Yuan],
Zhang, H.Y.[Hao-Ye],
He, T.[Taiwen],
Han, Y.F.[Yi-Feng],
Cui, G.[Ganqu],
Hu, J.Y.[Jin-Yi],
Liu, Z.Y.[Zhi-Yuan],
Zheng, H.T.[Hai-Tao],
Sun, M.[Maosong],
RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from
Fine-Grained Correctional Human Feedback,
CVPR24(13807-13816)
IEEE DOI
2410
Image segmentation, Accuracy, Large language models,
Computational modeling, Benchmark testing, Cognition, vision,
hallucination
BibRef
Xuan, S.Y.[Shi-Yu],
Guo, Q.[Qingpei],
Yang, M.[Ming],
Zhang, S.L.[Shi-Liang],
Pink: Unveiling the Power of Referential Comprehension for
Multi-modal LLMs,
CVPR24(13838-13848)
IEEE DOI Code:
WWW Link.
2410
Training, Visualization, Costs, Accuracy, Annotations, Large language models
BibRef
Yu, Q.[Qiying],
Sun, Q.[Quan],
Zhang, X.S.[Xiao-Song],
Cui, Y.F.[Yu-Feng],
Zhang, F.[Fan],
Cao, Y.[Yue],
Wang, X.L.[Xin-Long],
Liu, J.J.[Jing-Jing],
CapsFusion: Rethinking Image-Text Data at Scale,
CVPR24(14022-14032)
IEEE DOI
2410
Training, Knowledge engineering, Scalability,
Large language models, Computational modeling, Noise
BibRef
Yao, J.W.[Jia-Wei],
Qian, Q.[Qi],
Hu, J.[Juhua],
Multi-Modal Proxy Learning Towards Personalized Visual Multiple
Clustering,
CVPR24(14066-14075)
IEEE DOI Code:
WWW Link.
2410
Deep learning, Bridges, Visualization, Codes, Large language models,
Face recognition
BibRef
Zou, B.[Bo],
Yang, C.[Chao],
Qiao, Y.[Yu],
Quan, C.B.[Cheng-Bin],
Zhao, Y.J.[You-Jian],
LLaMA-Excitor: General Instruction Tuning via Indirect Feature
Interaction,
CVPR24(14089-14099)
IEEE DOI Code:
WWW Link.
2410
Visualization, Adaptation models, Codes, Computational modeling,
Benchmark testing, Instruction Tuning, PEFT,
Large Language Model
BibRef
Hong, W.[Wenyi],
Wang, W.H.[Wei-Han],
Lv, Q.S.[Qing-Song],
Xu, J.Z.[Jia-Zheng],
Yu, W.[Wenmeng],
Ji, J.H.[Jun-Hui],
Wang, Y.[Yan],
Wang, Z.[Zihan],
Dong, Y.X.[Yu-Xiao],
Ding, M.[Ming],
Tang, J.[Jie],
CogAgent: A Visual Language Model for GUI Agents,
CVPR24(14281-14290)
IEEE DOI Code:
WWW Link.
2410
Visualization, Limiting, Image resolution, Image recognition,
Navigation, Large language models, Benchmark testing
BibRef
Mitra, C.[Chancharik],
Huang, B.[Brandon],
Darrell, T.J.[Trevor J.],
Herzig, R.[Roei],
Compositional Chain-of-Thought Prompting for Large Multimodal Models,
CVPR24(14420-14431)
IEEE DOI Code:
WWW Link.
2410
Bridges, Visualization, Codes, Annotations, Large language models,
Benchmark testing, Large Multimodal Models, Multimodality,
Prompting
BibRef
Liu, C.[Chaohu],
Yin, K.[Kun],
Cao, H.Y.[Hao-Yu],
Jiang, X.H.[Xing-Hua],
Li, X.[Xin],
Liu, Y.[Yinsong],
Jiang, D.Q.[De-Qiang],
Sun, X.[Xing],
Xu, L.[Linli],
HRVDA: High-Resolution Visual Document Assistant,
CVPR24(15534-15545)
IEEE DOI
2410
Training, Visualization, Large language models,
Computational modeling, Training data, Transformers,
Multimodal
BibRef
Luo, C.[Chuwei],
Shen, Y.F.[Yu-Fan],
Zhu, Z.Q.[Zhao-Qing],
Zheng, Q.[Qi],
Yu, Z.[Zhi],
Yao, C.[Cong],
LayoutLLM: Layout Instruction Tuning with Large Language Models for
Document Understanding,
CVPR24(15630-15640)
IEEE DOI
2410
Large language models, Layout, Manuals, Inspection, Benchmark testing,
Boosting, Document Understanding, Layout, Large Language Models
BibRef
Yang, Y.[Yue],
Sun, F.Y.[Fan-Yun],
Weihs, L.[Luca],
Vanderbilt, E.[Eli],
Herrasti, A.[Alvaro],
Han, W.[Winson],
Wu, J.J.[Jia-Jun],
Haber, N.[Nick],
Krishna, R.[Ranjay],
Liu, L.J.[Ling-Jie],
Callison-Burch, C.[Chris],
Yatskar, M.[Mark],
Kembhavi, A.[Aniruddha],
Clark, C.[Christopher],
Holodeck: Language Guided Generation of 3D Embodied AI Environments,
CVPR24(16277-16287)
IEEE DOI
2410
Training, Navigation, Large language models, Semantics, Layout, Stars,
Embodied AI, 3D Scene Generation, Language-guided Generation
BibRef
Qin, Y.R.[Yi-Ran],
Zhou, E.[Enshen],
Liu, Q.[Qichang],
Yin, Z.F.[Zhen-Fei],
Sheng, L.[Lu],
Zhang, R.M.[Rui-Mao],
Qiao, Y.[Yu],
Shao, J.[Jing],
MP5: A Multi-modal Open-ended Embodied System in Minecraft via Active
Perception,
CVPR24(16307-16316)
IEEE DOI Code:
WWW Link.
2410
Visualization, Large language models, Active perception, Planning,
Compounds
BibRef
Zhang, S.[Sixian],
Yu, X.Y.[Xin-Yao],
Song, X.H.[Xin-Hang],
Wang, X.H.[Xiao-Han],
Jiang, S.Q.[Shu-Qiang],
Imagine Before Go: Self-Supervised Generative Map for Object Goal
Navigation,
CVPR24(16414-16425)
IEEE DOI Code:
WWW Link.
2410
Training, Geometry, Navigation, Large language models, Semantics,
Layout, Self-supervised learning, Embodied AI, Object Goal Navigation
BibRef
Li, H.[Hao],
Yang, X.[Xue],
Wang, Z.K.[Zhao-Kai],
Zhu, X.Z.[Xi-Zhou],
Zhou, J.[Jie],
Qiao, Y.[Yu],
Wang, X.G.[Xiao-Gang],
Li, H.S.[Hong-Sheng],
Lu, L.W.[Le-Wei],
Dai, J.F.[Ji-Feng],
Auto MC-Reward: Automated Dense Reward Design with Large Language
Models for Minecraft,
CVPR24(16426-16435)
IEEE DOI
2410
Learning systems, Codes, Large language models, Lava, Semantics,
Reinforcement learning, Syntactics, Large Language Model, Reward Shaping
BibRef
Liu, M.X.[Ming-Xuan],
Hayes, T.L.[Tyler L.],
Ricci, E.[Elisa],
Csurka, G.[Gabriela],
Volpi, R.[Riccardo],
SHiNe: Semantic Hierarchy Nexus for Open-Vocabulary Object Detection,
CVPR24(16634-16644)
IEEE DOI
2410
Vocabulary, Fuses, Large language models, Semantics, Detectors,
Object detection, Open-vocabulary, Object Detection, Vision-Language
BibRef
Kim, J.[Jooyeon],
Cho, E.[Eulrang],
Kim, S.[Sehyung],
Kim, H.W.J.[Hyun-Woo J.],
Retrieval-Augmented Open-Vocabulary Object Detection,
CVPR24(17427-17436)
IEEE DOI Code:
WWW Link.
2410
Portable media players, Visualization, Vocabulary,
Large language models, Semantics, Detectors, Object detection,
Retrieval-Augmentation
BibRef
Saha, O.[Oindrila],
van Horn, G.[Grant],
Maji, S.[Subhransu],
Improved Zero-Shot Classification by Adapting VLMs with Text
Descriptions,
CVPR24(17542-17552)
IEEE DOI Code:
WWW Link.
2410
Training, Visualization, Large language models, Habitats,
Benchmark testing, Birds, Zero Shot Learning,
Fine-grained Classification
BibRef
Toubal, I.E.[Imad Eddine],
Avinash, A.[Aditya],
Alldrin, N.G.[Neil Gordon],
Dlabal, J.[Jan],
Zhou, W.[Wenlei],
Luo, E.[Enming],
Stretcu, O.[Otilia],
Xiong, H.[Hao],
Lu, C.T.[Chun-Ta],
Zhou, H.[Howard],
Krishna, R.[Ranjay],
Fuxman, A.[Ariel],
Duerig, T.[Tom],
Modeling Collaborator: Enabling Subjective Vision Classification with
Minimal Human Effort via LLM Tool-Use,
CVPR24(17553-17563)
IEEE DOI
2410
Visualization, Computational modeling, Large language models,
Natural languages, Wildlife, Training data, Manuals, tool-use
BibRef
Li, X.Q.[Xiao-Qi],
Xu, J.Y.[Jing-Yun],
Zhang, M.X.[Ming-Xu],
Liu, J.M.[Jia-Ming],
Shen, Y.[Yan],
Ponomarenko, I.[Iaroslav],
Xu, J.H.[Jia-Hui],
Heng, L.[Liang],
Huang, S.Y.[Si-Yuan],
Zhang, S.H.[Shang-Hang],
Dong, H.[Hao],
Object-Centric Prompt-Driven Vision-Language-Action Model for Robotic
Manipulation,
CVPR25(27638-27648)
IEEE DOI
2508
Training, Visualization, Natural languages, Manuals,
Predictive models, Robustness, Libraries, Planning, Videos
BibRef
Li, X.Q.[Xiao-Qi],
Zhang, M.X.[Ming-Xu],
Geng, Y.R.[Yi-Ran],
Geng, H.R.[Hao-Ran],
Long, Y.X.[Yu-Xing],
Shen, Y.[Yan],
Zhang, R.R.[Ren-Rui],
Liu, J.M.[Jia-Ming],
Dong, H.[Hao],
ManipLLM: Embodied Multimodal Large Language Model for Object-Centric
Robotic Manipulation,
CVPR24(18061-18070)
IEEE DOI Code:
WWW Link.
2410
Training, Adaptation models, Large language models, Transforms,
Predictive models, Robot sensing systems, Cognition, Embodied AI,
Multi-modal Large Language Model
BibRef
Han, T.[Tengda],
Bain, M.[Max],
Nagrani, A.[Arsha],
Varol, G.[Gül],
Xie, W.[Weidi],
Zisserman, A.[Andrew],
AutoAD III: The Prequel: Back to the Pixels,
CVPR24(18164-18174)
IEEE DOI
2410
Training, Measurement, Visualization, Large language models,
Current measurement, Training data, Computer architecture
BibRef
Qu, H.X.[Hao-Xuan],
Cai, Y.J.[Yu-Jun],
Liu, J.[Jun],
LLMs are Good Action Recognizers,
CVPR24(18395-18406)
IEEE DOI
2410
Accuracy, Large language models,
Linguistics, Benchmark testing, Skeleton
BibRef
Chen, J.[Joya],
Lv, Z.Y.[Zhao-Yang],
Wu, S.W.[Shi-Wei],
Lin, K.Q.[Kevin Qinghong],
Song, C.[Chenan],
Gao, D.F.[Di-Fei],
Liu, J.W.[Jia-Wei],
Gao, Z.T.[Zi-Teng],
Mao, D.X.[Dong-Xing],
Shou, M.Z.[Mike Zheng],
VideoLLM-online: Online Video Large Language Model for Streaming
Video,
CVPR24(18407-18418)
IEEE DOI
2410
Training, Large language models, Soft sensors, Pipelines,
Streaming media, Rendering (computer graphics), Data models
BibRef
Zhu, A.[Anqi],
Ke, Q.H.[Qiu-Hong],
Gong, M.M.[Ming-Ming],
Bailey, J.[James],
Part-Aware Unified Representation of Language and Skeleton for
Zero-Shot Action Recognition,
CVPR24(18761-18770)
IEEE DOI Code:
WWW Link.
2410
Visualization, Source coding, Large language models,
Natural languages, Skeleton, representation learning
BibRef
Chen, T.J.[Tong-Jia],
Yu, H.S.[Hong-Shan],
Yang, Z.G.[Zhen-Geng],
Li, Z.C.[Ze-Chuan],
Sun, W.[Wei],
Chen, C.[Chen],
OST: Refining Text Knowledge with Optimal Spatio-Temporal Descriptor
for General Video Recognition,
CVPR24(18888-18898)
IEEE DOI
2410
Training, Adaptation models, Visualization, Large language models,
Semantics, Pipelines, Refining, Video Reognition,
Multi-modality Video Understanding
BibRef
Zhao, Q.H.[Qi-Hao],
Dai, Y.[Yalun],
Li, H.[Hao],
Hu, W.[Wei],
Zhang, F.[Fan],
Liu, J.[Jun],
LTGC: Long-Tail Recognition via Leveraging LLMs-Driven Generated
Content,
CVPR24(19510-19520)
IEEE DOI
2410
Semantic segmentation, Large language models,
Computational modeling, Data visualization, Tail, Benchmark testing
BibRef
Siddiqui, Y.[Yawar],
Alliegro, A.[Antonio],
Artemov, A.[Alexey],
Tommasi, T.[Tatiana],
Sirigatti, D.[Daniele],
Rosov, V.[Vladislav],
Dai, A.[Angela],
Nießner, M.[Matthias],
MeshGPT: Generating Triangle Meshes with Decoder-Only Transformers,
CVPR24(19615-19625)
IEEE DOI
2410
Geometry, Vocabulary, Solid modeling, Shape, Large language models,
Transformers, Mesh Generation, Generative Models for 3D,
Transformers
BibRef
Li, Z.[Zhe],
Gao, Z.Y.[Zhang-Yang],
Tan, C.[Cheng],
Ren, B.[Bocheng],
Yang, L.T.[Laurence T.],
Li, S.Z.[Stan Z.],
General Point Model Pretraining with Autoencoding and Autoregressive,
CVPR24(20954-20964)
IEEE DOI Code:
WWW Link.
2410
Point cloud compression, Representation learning, Codes,
Large language models, Vector quantization, Computational modeling
BibRef
Li, K.C.[Kun-Chang],
Wang, Y.[Yali],
He, Y.[Yinan],
Li, Y.Z.[Yi-Zhuo],
Wang, Y.[Yi],
Liu, Y.[Yi],
Wang, Z.[Zun],
Xu, J.[Jilan],
Chen, G.[Guo],
Lou, P.[Ping],
Wang, L.M.[Li-Min],
Qiao, Y.[Yu],
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark,
CVPR24(22195-22206)
IEEE DOI Code:
WWW Link.
2410
Training, Systematics, Large language models, Image annotation,
Manuals, Benchmark testing
BibRef
Taesiri, M.R.[Mohammad Reza],
Feng, T.J.[Tian-Jun],
Bezemer, C.P.[Cor-Paul],
Nguyen, A.[Anh],
GlitchBench: Can Large Multimodal Models Detect Video Game Glitches?,
CVPR24(22444-22455)
IEEE DOI Code:
WWW Link.
2410
Video games, Visualization, Quality assurance, Large language models,
Benchmark testing, Linguistics, Cognition, game testing
BibRef
Zhang, R.[Ruiyi],
Zhang, Y.Z.[Yan-Zhe],
Chen, J.[Jian],
Zhou, Y.F.[Yu-Fan],
Gu, J.X.[Jiu-Xiang],
Chen, C.[Changyou],
Sun, T.[Tong],
TRINS: Towards Multimodal Language Models that Can Read,
CVPR24(22584-22594)
IEEE DOI
2410
Visualization, Annotations, Large language models,
Computational modeling, Optical character recognition, Training data
BibRef
Dunlap, L.[Lisa],
Zhang, Y.H.[Yu-Hui],
Wang, X.H.[Xiao-Han],
Zhong, R.Q.[Rui-Qi],
Darrell, T.J.[Trevor J.],
Steinhardt, J.[Jacob],
Gonzalez, J.E.[Joseph E.],
Yeung-Levy, S.[Serena],
Describing Differences in Image Sets with Natural Language,
CVPR24(24199-24208)
IEEE DOI Code:
WWW Link.
2410
Analytical models, Large language models, Computational modeling,
Natural languages, Human in the loop
BibRef
Ishmam, A.M.[Alvi Md],
Thomas, C.[Christopher],
Semantic Shield: Defending Vision-Language Models Against Backdooring
and Poisoning via Fine-Grained Knowledge Alignment,
CVPR24(24820-24830)
IEEE DOI
2410
Training, Visualization, Correlation, Computational modeling,
Large language models, Semantics, Adversarial attack and defense,
Vision languge model
BibRef
Yang, Y.J.[Yi-Jun],
Zhou, T.Y.[Tian-Yi],
Li, K.[Kanxue],
Tao, D.P.[Da-Peng],
Li, L.[Lusong],
Shen, L.[Li],
He, X.D.[Xiao-Dong],
Jiang, J.[Jing],
Shi, Y.H.[Yu-Hui],
Embodied Multi-Modal Agent trained by an LLM from a Parallel
TextWorld,
CVPR24(26265-26275)
IEEE DOI
2410
Training, Visualization, Imitation learning, Large language models,
Robustness, Reflection, Embodied AI, Large Language Models, Imitation Learning
BibRef
Hong, Y.[Yining],
Zheng, Z.[Zishuo],
Chen, P.H.[Pei-Hao],
Wang, Y.F.[Yi-Fan],
Li, J.[Junyan],
Gan, C.[Chuang],
MultiPLY: A Multisensory Object-Centric Embodied Large Language Model
in 3D World,
CVPR24(26396-26406)
IEEE DOI
2410
Visualization, Correlation, Navigation, Large language models,
Computational modeling
BibRef
Zhang, Y.[Yichi],
Dong, Y.P.[Yin-Peng],
Zhang, S.Y.[Si-Yuan],
Min, T.Z.[Tian-Zan],
Su, H.[Hang],
Zhu, J.[Jun],
Exploring the Transferability of Visual Prompting for Multimodal
Large Language Models,
CVPR24(26552-26562)
IEEE DOI
2410
Training, Visualization, Adaptation models, Computational modeling,
Large language models, Semantics, Feature extraction,
Transferability
BibRef
Han, J.M.[Jia-Ming],
Gong, K.X.[Kai-Xiong],
Zhang, Y.Y.[Yi-Yuan],
Wang, J.Q.[Jia-Qi],
Zhang, K.[Kaipeng],
Lin, D.[Dahua],
Qiao, Y.[Yu],
Gao, P.[Peng],
Yue, X.Y.[Xiang-Yu],
OneLLM: One Framework to Align All Modalities with Language,
CVPR24(26574-26585)
IEEE DOI Code:
WWW Link.
2410
Point cloud compression, Large language models, Pipelines,
Benchmark testing, Functional magnetic resonance imaging, Routing
BibRef
Xie, H.X.[Hong-Xia],
Peng, C.J.[Chu-Jun],
Tseng, Y.W.[Yu-Wen],
Chen, H.J.[Hung-Jen],
Hsu, C.F.[Chan-Feng],
Shuai, H.H.[Hong-Han],
Cheng, W.H.[Wen-Huang],
EmoVIT: Revolutionizing Emotion Insights with Visual Instruction
Tuning,
CVPR24(26586-26595)
IEEE DOI Code:
WWW Link.
2410
Visualization, Emotion recognition, Large language models,
Pipelines, Benchmark testing, Cognition
BibRef
Wang, X.Y.[Xin-Yu],
Zhuang, B.[Bohan],
Wu, Q.[Qi],
ModaVerse: Efficiently Transforming Modalities with LLMs,
CVPR24(26596-26606)
IEEE DOI Code:
WWW Link.
2410
Training, Adaptation models, Large language models,
Natural languages, Layout, Data models
BibRef
Lin, J.[Ji],
Yin, H.X.[Hong-Xu],
Ping, W.[Wei],
Molchanov, P.[Pavlo],
Shoeybi, M.[Mohammad],
Han, S.[Song],
VILA: On Pre-training for Visual Language Models,
CVPR24(26679-26689)
IEEE DOI
2410
Degradation, Visualization, Accuracy, Large language models,
Benchmark testing, Cognition
BibRef
Lyu, Y.H.[Yuan-Huiyi],
Zheng, X.[Xu],
Zhou, J.Z.[Jia-Zhou],
Wang, L.[Lin],
UniBind: LLM-Augmented Unified and Balanced Representation Space to
Bind Them All,
CVPR24(26742-26752)
IEEE DOI
2410
Point cloud compression, Visualization, Large language models,
Knowledge based systems, Infrared imaging, Contrastive learning,
Data mining
BibRef
Liang, T.[Tian],
Huang, J.[Jing],
Kong, M.[Ming],
Chen, L.[Luyuan],
Zhu, Q.[Qiang],
Querying as Prompt: Parameter-Efficient Learning for Multimodal
Language Model,
CVPR24(26845-26855)
IEEE DOI Code:
WWW Link.
2410
Training, Bridges, Adaptation models, Technological innovation,
Codes, Computational modeling, multimodal,
large language model
BibRef
Zhu, L.[Lei],
Wei, F.[Fangyun],
Lu, Y.[Yanye],
Beyond Text: Frozen Large Language Models in Visual Signal
Comprehension,
CVPR24(27037-27047)
IEEE DOI Code:
WWW Link.
2410
Visualization, Vocabulary, Image recognition, Large language models,
Semantics, Transforms, Feature extraction, Multi-modal learning
BibRef
Pi, R.J.[Ren-Jie],
Yao, L.W.[Le-Wei],
Gao, J.H.[Jia-Hui],
Zhang, J.P.[Ji-Peng],
Zhang, T.[Tong],
PerceptionGPT: Effectively Fusing Visual Perception Into LLM,
CVPR24(27114-27123)
IEEE DOI
2410
Training, Visualization, Accuracy, Large language models,
Decoding, Multimodal Learning
BibRef
Tai, Y.[Yan],
Fan, W.C.[Wei-Chen],
Zhang, Z.[Zhao],
Liu, Z.W.[Zi-Wei],
Link-Context Learning for Multimodal LLMs,
CVPR24(27166-27175)
IEEE DOI
2410
Training, Image recognition, Large language models,
Oral communication, Propulsion, Cognition
BibRef
Tang, Z.[Zineng],
Yang, Z.[Ziyi],
Khademi, M.[Mahmoud],
Liu, Y.[Yang],
Zhu, C.G.[Chen-Guang],
Bansal, M.[Mohit],
CoDi-2: In-Context, Interleaved, and Interactive Any-to-Any
Generation,
CVPR24(27415-27424)
IEEE DOI
2410
Image synthesis, Large language models, Oral communication,
Encoding, Cognition
BibRef
Jain, J.[Jitesh],
Yang, J.W.[Jian-Wei],
Shi, H.[Humphrey],
VCoder: Versatile Vision Encoders for Multimodal Large Language
Models,
CVPR24(27992-28002)
IEEE DOI
2410
Training, Visualization, Image segmentation, Costs, Image synthesis,
Large language models, Machine vision
BibRef
Yuan, Y.Q.[Yu-Qian],
Li, W.[Wentong],
Liu, J.[Jian],
Tang, D.Q.[Dong-Qi],
Luo, X.J.[Xin-Jie],
Qin, C.[Chi],
Zhang, L.[Lei],
Zhu, J.[Jianke],
Osprey: Pixel Understanding with Visual Instruction Tuning,
CVPR24(28202-28211)
IEEE DOI Code:
WWW Link.
2410
Convolutional codes, Visualization, Computational modeling,
Source coding, Large language models, Semantics
BibRef
Zheng, Z.H.[Zhao-Heng],
Wei, J.[Jingmin],
Hu, X.F.[Xue-Feng],
Zhu, H.D.[Hai-Dong],
Nevatia, R.[Ram],
Large Language Models are Good Prompt Learners for Low-Shot Image
Classification,
CVPR24(28453-28462)
IEEE DOI Code:
WWW Link.
2410
Learning systems, Training, Adaptation models, Codes,
Large language models, Computational modeling
BibRef
He, H.Y.[Hao-Yu],
Pan, Z.Z.[Zi-Zheng],
Liu, J.[Jing],
Cai, J.F.[Jian-Fei],
Zhuang, B.[Bohan],
Efficient Stitchable Task Adaptation,
CVPR24(28555-28565)
IEEE DOI Code:
WWW Link.
2410
Training, Deep learning, Adaptation models, Visualization,
Scalability, Pipelines, Memory management, model stitching,
large language model
BibRef
Tian, X.Y.[Xin-Yu],
Zou, S.[Shu],
Yang, Z.Y.[Zhao-Yuan],
Zhang, J.[Jing],
ArGue: Attribute-Guided Prompt Tuning for Vision-Language Models,
CVPR24(28578-28587)
IEEE DOI Code:
WWW Link.
2410
Adaptation models, Visualization, Correlation, Computational modeling,
Large language models, Semantics, few-shot adaptation
BibRef
Barbany, O.[Oriol],
Huang, M.[Michael],
Zhu, X.L.[Xin-Liang],
Dhua, A.[Arnab],
Leveraging Large Language Models for Multimodal Search,
FGVC24(1201-1210)
IEEE DOI
2410
Large language models, Natural languages, Pipelines,
Image retrieval, LLM, retrieval, fashion,
multimodal
BibRef
Lv, J.X.[Jia-Xi],
Huang, Y.[Yi],
Yan, M.[Mingfu],
Huang, J.C.[Jian-Cheng],
Liu, J.Z.[Jian-Zhuang],
Liu, Y.F.[Yi-Fan],
Wen, Y.F.[Ya-Fei],
Chen, X.X.[Xiao-Xin],
Chen, S.F.[Shi-Feng],
GPT4Motion: Scripting Physical Motions in Text-to-Video Generation
via Blender-Oriented GPT Planning,
PBDL24(1430-1440)
IEEE DOI Code:
WWW Link.
2410
Image synthesis, Large language models, Text to image, Fluid flow,
Manuals, Diffusion models
BibRef
Baldassini, F.B.[Folco Bertini],
Shukor, M.[Mustafa],
Cord, M.[Matthieu],
Soulier, L.[Laure],
Piwowarski, B.[Benjamin],
What Makes Multimodal In-Context Learning Work?,
Prompting24(1539-1550)
IEEE DOI
2410
Training, Analytical models, Codes, Large language models,
Impedance matching, Large Language Models, Shortcuts learning
BibRef
Wang, J.C.[Jun-Chi],
Ke, L.[Lei],
LLM-Seg: Bridging Image Segmentation and Large Language Model
Reasoning,
WhatNext24(1765-1774)
IEEE DOI Code:
WWW Link.
2410
Training, Image segmentation, Large language models,
Design methodology, Pipelines, Cognition
BibRef
Hakim, Z.I.A.[Zaber Ibn Abdul],
Sarker, N.H.[Najibul Haque],
Singh, R.P.[Rahul Pratap],
Paul, B.[Bishmoy],
Dabouei, A.[Ali],
Xu, M.[Min],
Leveraging Generative Language Models for Weakly Supervised Sentence
Component Analysis in Video-Language Joint Learning,
MULA24(1975-1985)
IEEE DOI
2410
Training, Adaptation models, Statistical analysis,
Large language models, Estimation, Contrastive learning, Distance measurement
BibRef
Deria, A.[Ankan],
Kumar, K.[Komal],
Chakraborty, S.[Snehashis],
Mahapatra, D.[Dwarikanath],
Roy, S.[Sudipta],
InVERGe: Intelligent Visual Encoder for Bridging Modalities in Report
Generation,
MULA24(2028-2038)
IEEE DOI Code:
WWW Link.
2410
Training, Visualization, Computational modeling, Radiology,
Transformers, Feature extraction, Decoding, Deep Learning,
Large Language Model
BibRef
Ma, F.P.[Fei-Peng],
Zhou, Y.Z.[Yi-Zhou],
Zhang, Y.Y.[Yue-Yi],
Wu, S.Y.[Si-Ying],
Zhang, Z.[Zheyu],
He, Z.L.[Zi-Long],
Rao, F.Y.[Feng-Yun],
Sun, X.Y.[Xiao-Yan],
Task Navigator: Decomposing Complex Tasks for Multimodal Large
Language Models,
Reasoning24(2248-2257)
IEEE DOI
2410
Training, Systematics, Navigation, Large language models,
Training data, Language and Vision, Multi-modal Vision
BibRef
Arefeen, M.A.[Md Adnan],
Debnath, B.[Biplob],
Uddin, M.Y.S.[Md Yusuf Sarwar],
Chakradhar, S.[Srimat],
ViTA: An Efficient Video-to-Text Algorithm using VLM for RAG-based
Video Analysis System,
Reasoning24(2266-2274)
IEEE DOI
2410
Accuracy, Large language models, Natural language processing,
Data models, Video Analytics,
Large Language Models (LLMs)
BibRef
Chen, Y.W.[Yu-Wei],
Chu, S.Y.[Shi-Yong],
Large Language Models in Wargaming: Methodology, Application, and
Robustness,
AML24(2894-2903)
IEEE DOI
2410
Navigation, Large language models, Decision making,
Strategic planning, Solids, Robustness, Natural language processing
BibRef
Lai, Z.X.[Zhi-Xin],
Wu, J.[Jing],
Chen, S.[Suiyao],
Zhou, Y.C.[Yu-Cheng],
Hovakimyan, N.[Naira],
Residual-based Language Models are Free Boosters for Biomedical
Imaging Tasks,
DEF-AI-MIA24(5086-5096)
IEEE DOI Code:
WWW Link.
2410
Visualization, Large language models, Fasteners, Transformers,
LLM, Biomedical Imaging
BibRef
Fang, X.[Xi],
Wang, W.G.[Wei-Gang],
Lv, X.X.[Xiao-Xin],
Yan, J.[Jun],
PCQA: A Strong Baseline for AIGC Quality Assessment Based on Prompt
Condition,
NTIRE24(6167-6176)
IEEE DOI
2410
Image quality, Databases, Large language models, Semantics,
Quality assessment, Ensemble learning, AIGC,
multimodal learning
BibRef
Ye, Z.[Zilyu],
Liu, J.X.[Jin-Xiu],
Cao, J.J.[Jin-Jin],
Chen, Z.Y.[Zhi-Yang],
Xuan, Z.W.[Zi-Wei],
Zhou, M.Y.[Ming-Yuan],
Liu, Q.[Qi],
Qi, G.J.[Guo-Jun],
OpenStory: A Large-Scale Open-Domain Dataset for Subject-Driven
Visual Storytelling,
VDU24(7953-7962)
IEEE DOI
2410
Training, Visualization, Annotations, Large language models,
Pipelines, Manuals
BibRef
Chen, X.Y.[Xiang-Yu],
Liu, J.[Jing],
Wang, Y.[Ye],
Wang, P.P.[Pu Perry],
Brand, M.[Matthew],
Wang, G.H.[Guang-Hui],
Koike-Akino, T.[Toshiaki],
SuperLoRA: Parameter-Efficient Unified Adaptation for Large Vision
Models,
ECV24(8050-8055)
IEEE DOI
2410
Adaptation models, Tensors, Computational modeling,
Large language models, Transfer learning, parameter efficiency,
low-rank adaptation
BibRef
Wei, C.[Chen],
Liu, C.X.[Chen-Xi],
Qiao, S.Y.[Si-Yuan],
Zhang, Z.S.[Zhi-Shuai],
Yuille, A.L.[Alan L.],
Yu, J.H.[Jia-Hui],
De-Diffusion Makes Text a Strong Cross-Modal Interface,
CVPR24(13492-13503)
IEEE DOI
2410
Large language models, Natural languages, Text to image,
Transforms, Diffusion models, Decoding, Diffusion, Generative Model,
Vision and Language
BibRef
Chen, B.[Boyuan],
Xu, Z.[Zhuo],
Kirmani, S.[Sean],
Ichter, B.[Brian],
Sadigh, D.[Dorsa],
Guibas, L.J.[Leonidas J.],
Xia, F.[Fei],
SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning
Capabilities,
CVPR24(14455-14465)
IEEE DOI Code:
WWW Link.
2410
Training, Solid modeling, Visualization, Pipelines, Training data, Cognition,
spatial reasoning, large language model, multimodal, vision language model
BibRef
Dorkenwald, M.[Michael],
Barazani, N.[Nimrod],
Snoek, C.G.M.[Cees G. M.],
Asano, Y.M.[Yuki M.],
PIN: Positional Insert Unlocks Object Localisation Abilities in VLMs,
CVPR24(13548-13558)
IEEE DOI
2410
Training, Computational modeling, Machine vision,
Large language models, Pipelines, Pins, Vision-Language Models,
Efficient Adaption of VLMs
BibRef
Cha, J.[Junbum],
Kang, W.[Wooyoung],
Mun, J.[Jonghwan],
Roh, B.[Byungseok],
Honeybee: Locality-Enhanced Projector for Multimodal LLM,
CVPR24(13817-13827)
IEEE DOI Code:
WWW Link.
2410
Visualization, Codes, Large language models, Benchmark testing,
Tuning, Multimodal LLM, Vision-Language
BibRef
Sun, Z.Y.[Ze-Yi],
Fang, Y.[Ye],
Wu, T.[Tong],
Zhang, P.[Pan],
Zang, Y.H.[Yu-Hang],
Kong, S.[Shu],
Xiong, Y.J.[Yuan-Jun],
Lin, D.[Dahua],
Wang, J.Q.[Jia-Qi],
Alpha-CLIP: A CLIP Model Focusing on Wherever you Want,
CVPR24(13019-13029)
IEEE DOI Code:
WWW Link.
2410
Point cloud compression, Visualization, Image recognition, Codes,
Large language models, CLIP, Vision-language pretraining, MLLMs
BibRef
Parashar, S.[Shubham],
Lin, Z.Q.[Zhi-Qiu],
Liu, T.[Tian],
Dong, X.J.[Xiang-Jue],
Li, Y.[Yanan],
Ramanan, D.[Deva],
Caverlee, J.[James],
Kong, S.[Shu],
The Neglected Tails in Vision-Language Models,
CVPR24(12988-12997)
IEEE DOI
2410
Training, Visualization, Accuracy, Large language models,
Text to image, Tail, Flowering plants, Vision-Language Models,
Long tailed recognition
BibRef
Luo, Y.[Yan],
Shi, M.[Min],
Khan, M.O.[Muhammad Osama],
Afzal, M.M.[Muhammad Muneeb],
Huang, H.[Hao],
Yuan, S.[Shuaihang],
Tian, Y.[Yu],
Song, L.[Luo],
Kouhana, A.[Ava],
Elze, T.[Tobias],
Fang, Y.[Yi],
Wang, M.Y.[Meng-Yu],
FairCLIP: Harnessing Fairness in Vision-Language Learning,
CVPR24(12289-12301)
IEEE DOI Code:
WWW Link.
2410
Deep learning, Bridges, Analytical models, Ethics, Codes,
Computational modeling, Fairness Learning, Large Language Models
BibRef
Zara, G.[Giacomo],
Conti, A.[Alessandro],
Roy, S.[Subhankar],
Lathuilière, S.[Stéphane],
Rota, P.[Paolo],
Ricci, E.[Elisa],
The Unreasonable Effectiveness of Large Language-Vision Models for
Source-free Video Domain Adaptation,
ICCV23(10273-10283)
IEEE DOI
2401
BibRef
Zhao, H.B.[Hong-Bo],
Ni, B.L.[Bo-Lin],
Fan, J.S.[Jun-Song],
Wang, Y.X.[Yu-Xi],
Chen, Y.T.[Yun-Tao],
Meng, G.F.[Gao-Feng],
Zhang, Z.X.[Zhao-Xiang],
Continual Forgetting for Pre-Trained Vision Models,
CVPR24(28631-28642)
IEEE DOI Code:
WWW Link.
2410
Privacy, Codes, Large language models,
Face recognition, Object detection, Continual Forgetting, Machine Unlearning
BibRef
Zhan, X.Y.[Xin-Yu],
Yang, L.X.[Li-Xin],
Zhao, Y.F.[Yi-Fei],
Mao, K.[Kangrui],
Xu, H.L.[Han-Lin],
Lin, Z.[Zenan],
Li, K.L.[Kai-Lin],
Lu, C.[Cewu],
OakInk2: A Dataset of Bimanual Hands-Object Manipulation in Complex
Task Completion,
CVPR24(445-456)
IEEE DOI Code:
WWW Link.
2410
Annotations, Affordances, Computational modeling,
Large language models, Decoding
BibRef
Li, Y.C.[Yi-Cong],
Zhao, N.[Na],
Xiao, J.B.[Jun-Bin],
Feng, C.[Chun],
Wang, X.[Xiang],
Chua, T.S.[Tat-Seng],
LASO: Language-Guided Affordance Segmentation on 3D Object,
CVPR24(14251-14260)
IEEE DOI Code:
WWW Link.
2410
Visualization, Solid modeling, Shape, Affordances,
Large language models, Semantics, Multimodal, 3D-Language, Vision-Language
BibRef
Rotstein, N.[Noam],
Bensaïd, D.[David],
Brody, S.[Shaked],
Ganz, R.[Roy],
Kimmel, R.[Ron],
FuseCap: Leveraging Large Language Models for Enriched Fused Image
Captions,
WACV24(5677-5688)
IEEE DOI
2404
Training, Surveys, Visualization, Fuses,
Optical character recognition, Training data, Algorithms,
Image recognition and understanding
BibRef
Chapter on Implementations and Applications, Databases, QBIC, Video Analysis, Hardware and Software, Inspection continues in
Large Language Models for Autonomous Driving, LLM, LVLM .