20.4.3.3.9 Vision-Language Models, Language-Vision Models

Chapter Contents (Back)
Vision Language Model. Vision-Language Model. Visual-Language Model. Language Vision Model.
See also Visual Grounding, Grounding Expressions.
See also CLIP, Contrastive Language-Image Pre-Training.
See also Large Language Models for Vision, LLM, LVLM.
See also Composed Image Retrieval.
See also Foundation Models, Graph Foundation Models.

Tamaazousti, Y.[Youssef], Le Borgne, H.[Hervé], Popescu, A.[Adrian], Gadeski, E.[Etienne], Ginsca, A.[Alexandru], Hudelot, C.[Céline],
Vision-language integration using constrained local semantic features,
CVIU(163), No. 1, 2017, pp. 41-57.
Elsevier DOI 1712
Image classification BibRef

Zhu, Y.Q.[Yong-Qing], Li, X.Y.[Xiang-Yang], Zheng, M.[Mao], Yang, J.H.[Jia-Hao], Wang, Z.H.[Zi-Han], Guo, X.Q.[Xiao-Qian], Chai, Z.F.[Zi-Feng], Yuan, Y.C.[Yu-Chen], Jiang, S.Q.[Shu-Qiang],
Focus and Align: Learning Tube Tokens for Video-Language Pre-Training,
MultMed(25), 2023, pp. 8036-8050.
IEEE DOI 2312
BibRef

Wu, W.H.[Wen-Hao], Sun, Z.[Zhun], Song, Y.X.[Yu-Xin], Wang, J.D.[Jing-Dong], Ouyang, W.L.[Wan-Li],
Transferring Vision-Language Models for Visual Recognition: A Classifier Perspective,
IJCV(132), No. 2, February 2024, pp. 392-409.
Springer DOI 2402
BibRef

Ming, Y.F.[Yi-Fei], Li, Y.X.[Yi-Xuan],
How Does Fine-Tuning Impact Out-of-Distribution Detection for Vision-Language Models?,
IJCV(132), No. 2, February 2024, pp. 596-609.
Springer DOI 2402
BibRef

Zhao, C.R.[Cai-Rong], Wang, Y.[Yubin], Jiang, X.Y.[Xin-Yang], Shen, Y.F.[Yi-Fei], Song, K.[Kaitao], Li, D.S.[Dong-Sheng], Miao, D.Q.[Duo-Qian],
Learning Domain Invariant Prompt for Vision-Language Models,
IP(33), 2024, pp. 1348-1360.
IEEE DOI 2402
Task analysis, Tuning, Training, Adaptation models, Visualization, Image color analysis, Self-supervised learning, Prompt learning, domain generalization BibRef

Yang, X.F.[Xiao-Feng], Liu, F.[Fayao], Lin, G.S.[Guo-Sheng],
Neural Logic Vision Language Explainer,
MultMed(26), 2024, pp. 3331-3340.
IEEE DOI 2402
Cognition, Logic programming, Deep learning, Visualization, Data models, Training, Markov processes, vision language pretraining BibRef

Wang, Y.D.[Yi-Dong], Yu, Z.O.[Zhu-Ohao], Wang, J.D.[Jin-Dong], Heng, Q.[Qiang], Chen, H.[Hao], Ye, W.[Wei], Xie, R.[Rui], Xie, X.[Xing], Zhang, S.K.[Shi-Kun],
Exploring Vision-Language Models for Imbalanced Learning,
IJCV(132), No. 1, January 2024, pp. 224-237.
Springer DOI 2402
BibRef

Zeng, Y.[Yan], Zhang, X.[Xinsong], Li, H.[Hang], Wang, J.W.[Jia-Wei], Zhang, J.P.[Ji-Peng], Zhou, W.[Wangchunshu],
X2-VLM: All-in-One Pre-Trained Model for Vision-Language Tasks,
PAMI(46), No. 5, May 2024, pp. 3156-3168.
IEEE DOI 2404
Task analysis, Visualization, Transformers, Detectors, Training, Feature extraction, Image coding, vision language pre-training BibRef

Kong, D.[Daehyeon], Kong, K.[Kyeongbo], Kang, S.J.[Suk-Ju],
Image clustering using generated text centroids,
SP:IC(125), 2024, pp. 117128.
Elsevier DOI 2405
Deep neural network, Image clustering, Multimodal task, Vision-language model BibRef

Chen, X.Y.[Xian-Yu], Yang, J.H.[Jin-Hui], Chen, S.[Shi], Wang, L.[Louis], Jiang, M.[Ming], Zhao, Q.[Qi],
Every Problem, Every Step, All in Focus: Learning to Solve Vision-Language Problems With Integrated Attention,
PAMI(46), No. 7, July 2024, pp. 4720-4735.
IEEE DOI 2406
Problem-solving, Task analysis, Visualization, Measurement, Graph neural networks, Cognition, Videos, Graph attention, vision-language problem solving BibRef

Menon, S.[Sachit], Chandratreya, I.P.[Ishaan Preetam], Vondrick, C.[Carl],
Task Bias in Contrastive Vision-Language Models,
IJCV(132), No. 6, June 2024, pp. 2026-2040.
Springer DOI 2406
BibRef

Zhang, J.Y.[Jing-Yi], Huang, J.X.[Jia-Xing], Jin, S.[Sheng], Lu, S.J.[Shi-Jian],
Vision-Language Models for Vision Tasks: A Survey,
PAMI(46), No. 8, August 2024, pp. 5625-5644.
IEEE DOI 2407
Task analysis, Visualization, Training, Deep learning, Surveys, Data models, Predictive models, Big Data, big model, deep learning, image classification BibRef

Dong, M.P.[Meng-Ping], Li, F.[Fei], Li, Z.B.[Zhen-Bo], Liu, X.[Xue],
Cluster prototype earth mover's distance adapters and alignment-guided prompt learning for vision-language models,
PR(156), 2024, pp. 110861.
Elsevier DOI 2408
Cluster prototype, Earth mover's distance, Adapter, Prompt learning, Vision-language models BibRef

Liu, Y.[Ye], Pan, Y.[Yan], Yin, J.[Jian],
Enhancing Multi-Label Deep Hashing for Image and Audio With Joint Internal Global Loss Constraints and Large Vision-Language Model,
SPLetters(31), 2024, pp. 2550-2554.
IEEE DOI 2410
Codes, Transformers, Adaptation models, Training, Convolutional neural networks, Feature extraction, vision transformer BibRef

Zhan, C.L.[Chen-Lu], Zhang, Y.F.[Yu-Fei], Lin, Y.[Yu], Wang, G.A.[Gao-Ang], Wang, H.W.[Hong-Wei],
UniDCP: Unifying Multiple Medical Vision-Language Tasks via Dynamic Cross-Modal Learnable Prompts,
MultMed(26), 2024, pp. 9736-9748.
IEEE DOI 2410
Task analysis, Adaptation models, Visualization, Medical diagnostic imaging, Tuning, Multitasking, Plastics, cross-modal shareable space BibRef

Su, K.[Ke], Zhang, X.X.[Xing-Xing], Zhang, S.Y.[Si-Yang], Zhu, J.[Jun], Zhang, B.[Bo],
To Boost Zero-Shot Generalization for Embodied Reasoning With Vision-Language Pre-Training,
IP(33), 2024, pp. 5370-5381.
IEEE DOI 2410
Cognition, Visualization, Artificial intelligence, Training, Image reconstruction, Navigation, vision-language pre-training BibRef

Xuan, S.Y.[Shi-Yu], Yang, M.[Ming], Zhang, S.L.[Shi-Liang],
Adapting Vision-Language Models via Learning to Inject Knowledge,
IP(33), 2024, pp. 5798-5809.
IEEE DOI 2410
Feature extraction, Visualization, Adaptation models, Tuning, Training, Transformers, Dogs, Accuracy, Robustness, Few shot learning, knowledge injection BibRef

Zhou, W.[Wenlve], Zhou, Z.H.[Zhi-Heng],
Unsupervised Domain Adaption Harnessing Vision-Language Pre-Training,
CirSysVideo(34), No. 9, September 2024, pp. 8201-8214.
IEEE DOI Code:
WWW Link. 2410
Adaptation models, Task analysis, Training, Computational modeling, Tuning, Data models, Visualization, Unsupervised domain adaptation, model deployment BibRef

Guo, M.H.[Meng-Hao], Zhang, Y.[Yi], Mu, T.J.[Tai-Jiang], Huang, S.X.[Sharon X.], Hu, S.M.[Shi-Min],
Tuning Vision-Language Models With Multiple Prototypes Clustering,
PAMI(46), No. 12, December 2024, pp. 11186-11199.
IEEE DOI 2411
Prototypes, Adaptation models, Tuning, Visualization, Benchmark testing, Computational modeling, Data models, clustering BibRef

Sun, B.[Bo], Wu, Z.C.[Zhi-Chao], Zhang, H.[Hao], He, J.[Jun],
VTPL: Visual and text prompt learning for visual-language models,
JVCIR(104), 2024, pp. 104280.
Elsevier DOI 2411
V-L models, Prompt learning, Visual and text prompts, Poly-1 information NCE loss, Center loss BibRef

Liu, L.C.[Liang-Chen], Wang, N.N.[Nan-Nan], Liu, D.[Decheng], Yang, X.[Xi], Gao, X.B.[Xin-Bo], Liu, T.L.[Tong-Liang],
Towards Specific Domain Prompt Learning via Improved Text Label Optimization,
MultMed(26), 2024, pp. 10805-10815.
IEEE DOI 2411
Visualization, Optimization, Semantics, Task analysis, Terminology, Learning systems, Adaptation models, vision-language model BibRef

Liu, X.[Xin], Wu, J.[Jiamin], Yang, W.F.[Wen-Fei], Zhou, X.[Xu], Zhang, T.Z.[Tian-Zhu],
Multi-Modal Attribute Prompting for Vision-Language Models,
CirSysVideo(34), No. 11, November 2024, pp. 11579-11591.
IEEE DOI 2412
Visualization, Task analysis, Semantics, Adaptation models, Integrated circuit modeling, Vectors, attribute BibRef

Jiang, H.J.[Hao-Jun], Zhang, J.K.[Jian-Ke], Huang, R.[Rui], Ge, C.J.[Chun-Jiang], Ni, Z.[Zanlin], Song, S.[Shiji], Huang, G.[Gao],
Cross-modal adapter for vision-language retrieval,
PR(159), 2025, pp. 111144.
Elsevier DOI 2412
Adapter, Cross-modal interaction, Cross-modal retrieval, Parameter-efficient training, Multi-modal learning BibRef

Yellinek, N.[Nir], Karlinsky, L.[Leonid], Giryes, R.[Raja],
3VL: Using Trees to Improve Vision-Language Models' Interpretability,
IP(34), 2025, pp. 495-509.
IEEE DOI 2501
aligning image and text representations. Random forests, Visualization, Training, Cognition, Feature extraction, Transformers, Forestry, Animals, compositional reasoning BibRef

Yang, L.F.[Ling-Feng], Li, X.[Xiang], Wang, Y.Z.[Yue-Ze], Wang, X.L.[Xin-Long], Yang, J.[Jian],
Fine-Grained Visual Text Prompting,
PAMI(47), No. 3, March 2025, pp. 1594-1609.
IEEE DOI 2502
What kind of visual prompts to add. Visualization, Semantics, Image segmentation, Crops, Tuning, Detectors, Proposals, Location awareness, Grounding, Gray-scale, zero-shot BibRef

Wang, F.[Fan], Han, Z.Y.[Zhong-Yi], Liu, X.[Xingbo], Yin, Y.L.[Yi-Long], Gao, X.[Xin],
CTPT: Continual Test-time Prompt Tuning for vision-language models,
PR(161), 2025, pp. 111300.
Elsevier DOI 2502
Test-time adaptation, Contrastive Language-Image Pretraining (CLIP), Stable self-learning BibRef

Liang, N.[Nanhao], Liu, Y.[Yong],
DPO: Discrete Prompt Optimization for Vision-Language Models,
SPLetters(32), 2025, pp. 671-675.
IEEE DOI 2502
Training, Optimization, Adaptation models, Visualization, Overfitting, Vectors, Vocabulary, Signal processing algorithms, vision-language model BibRef

Ondeng, O.[Oscar], Ouma, H.[Heywood], Akuon, P.[Peter],
Enriching visual feature representations for vision-language tasks using spectral transforms,
IVC(154), 2025, pp. 105390.
Elsevier DOI 2502
Visual feature enrichment, Transformers, Image captioning, Discrete Fourier Transform, MS COCO, Kylberg dataset, Diversity BibRef

Xu, C.[Chen], Zhu, Y.H.[Yu-Han], Shen, H.C.[Hao-Cheng], Chen, B.H.[Bo-Heng], Liao, Y.X.[Yi-Xuan], Chen, X.X.[Xiao-Xin], Wang, L.M.[Li-Min],
Progressive Visual Prompt Learning with Contrastive Feature Re-formation,
IJCV(133), No. 2, February 2025, pp. 511-526.
Springer DOI 2502
Adapting the pre-trained Vision-Language Models. BibRef

Long, S.[Sifan], Zhao, Z.[Zhen], Yuan, J.K.[Jun-Kun], Tan, Z.C.[Zi-Chang], Liu, J.J.[Jiang-Jiang], Feng, J.Y.[Jing-Yuan], Wang, S.S.[Sheng-Sheng], Wang, J.D.[Jing-Dong],
Mutual Prompt Leaning for Vision Language Models,
IJCV(133), No. 3, March 2025, pp. 1258-1276.
Springer DOI 2502
BibRef

Yin, J.H.[Jun-Hui], Zhang, X.Y.[Xin-Yu], Wu, L.[Lin], Wang, X.J.[Xiao-Jie],
Context-aware prompt learning for test-time vision recognition with frozen vision-language model,
PR(162), 2025, pp. 111359.
Elsevier DOI Code:
WWW Link. 2503
In-context learning, Prompt learning, Vision-language model, Vision recognition, Test-time adaptation BibRef

Chen, Y.[Yeming], Zhang, S.[Siyu], Sun, Y.[Yaoru], Yang, J.[Jun], Liang, W.J.[Wei-Jian], Wang, H.R.[Hao-Ran],
Artificial-Spiking Hierarchical Networks for Vision-Language Representation Learning,
CirSysVideo(35), No. 3, March 2025, pp. 2768-2781.
IEEE DOI Code:
WWW Link. 2503
Visualization, Semantics, Computational modeling, Transformers, Feature extraction, Object detection, multimodal alignment BibRef

Li, B.Z.[Bin-Zhe], Wang, S.R.[Shu-Run], Wang, S.Q.[Shi-Qi], Ye, Y.[Yan],
High Efficiency Image Compression for Large Visual-Language Models,
CirSysVideo(35), No. 3, March 2025, pp. 2870-2880.
IEEE DOI 2503
Image coding, Visualization, Machine vision, Codecs, Semantics, Standards, Image reconstruction, Bit rate, pre-editing process BibRef

Liu, L.C.[Liang-Chen], Wang, N.N.[Nan-Nan], Zhou, D.W.[Da-Wei], Liu, D.C.[De-Cheng], Yang, X.[Xi], Gao, X.B.[Xin-Bo], Liu, T.L.[Tong-Liang],
Generalizable Prompt Learning via Gradient Constrained Sharpness-Aware Minimization,
MultMed(27), 2025, pp. 1100-1113.
IEEE DOI 2503
Improving the performance on unseen classes while maintaining the performance on seen classes. Optimization, Minimization, Visualization, Training, Degradation, Vectors, Telecommunications, Intserv networks, Geometry, sharpness-aware minimization BibRef

Lu, Z.[Zhihe], Bai, J.[Jiawang], Li, X.[Xin], Xiao, Z.[Zeyu], Wang, X.C.[Xin-Chao],
Task-to-Instance Prompt Learning for Vision-Language Models at Test Time,
IP(34), 2025, pp. 1908-1920.
IEEE DOI Code:
WWW Link. 2504
Training, Training data, Visualization, Adaptation models, Learning systems, Image recognition, Dogs, Vectors, Entropy, task-to-instance BibRef

Fang, Z.Q.[Zheng-Qing], Yuan, Z.H.[Zhou-Hang], Li, Z.Y.[Zi-Yu], Chen, J.Y.[Jing-Yuan], Kuang, K.[Kun], Yao, Y.F.[Yu-Feng], Wu, F.[Fei],
Cross-Modality Image Interpretation via Concept Decomposition Vector of Visual-Language Models,
CirSysVideo(35), No. 4, April 2025, pp. 3024-3038.
IEEE DOI 2504
Visualization, Vectors, Semantics, Training, Image representation, Task analysis, visual-language models BibRef

Ramzi, E.[Elias], Audebert, N.[Nicolas], Rambour, C.[Clément], Araujo, A.[André], Bitot, X.[Xavier], Thome, N.[Nicolas],
Optimization of Rank Losses for Image Retrieval,
PAMI(47), No. 6, June 2025, pp. 4317-4329.
IEEE DOI 2505
Training, Image retrieval, Measurement, Standards, Data mining, Artificial intelligence, Loss measurement, non-decomposable BibRef

Lafon, M.[Marc], Ramzi, E.[Elias], Rambour, C.[Clément], Audebert, N.[Nicolas], Thome, N.[Nicolas],
Gallop: Learning Global and Local Prompts for Vision-language Models,
ECCV24(LXI: 264-282).
Springer DOI 2412
BibRef

Liu, K.C.[Kang-Cheng], Wang, C.Q.[Chao-Qun], Han, X.D.[Xiao-Dong], Liu, Y.J.[Yong-Jin], Chen, B.Q.[Bao-Quan],
Generalized Robot Vision-Language Model via Linguistic Foreground-Aware Contrast,
IJCV(133), No. 6, June 2025, pp. 3481-3518.
Springer DOI 2505
BibRef
And: Correction: IJCV(133), No. 7, July 2025, pp. 4971-4971.
Springer DOI 2506
BibRef

Yang, L.X.[Ling-Xiao], Zhang, R.Y.[Ru-Yuan], Chen, Q.[Qi], Xie, X.H.[Xiao-Hua],
Learning with Enriched Inductive Biases for Vision-Language Models,
IJCV(133), No. 6, June 2025, pp. 3746-3761.
Springer DOI 2505
BibRef

Yao, H.T.[Han-Tao], Zhang, R.[Rui], Lyu, H.H.[Huai-Hai], Zhang, Y.D.[Yong-Dong], Xu, C.S.[Chang-Sheng],
Bi-Modality Individual-Aware Prompt Tuning for Visual-Language Model,
PAMI(47), No. 8, August 2025, pp. 6352-6368.
IEEE DOI 2507
BibRef
Earlier: A1, A2, A5, Only:
TCP: Textual-Based Class-Aware Prompt Tuning for Visual-Language Model,
CVPR24(23438-23448)
IEEE DOI Code:
WWW Link. 2410
Tuning, Visualization, Training, Adaptation models, Hands, Feature extraction, Data models, Artificial intelligence, visual-language model. Benchmark testing. BibRef

Hao, Z.W.[Zhi-Wei], Guo, J.Y.[Jian-Yuan], Shen, L.[Li], Luo, Y.[Yong], Hu, H.[Han], Wen, Y.G.[Yong-Gang],
ADEM-VL: Adaptive and Embedded Fusion for Efficient Vision-Language Tuning,
IJCV(133), No. 8, August 2025, pp. 5527-5543.
Springer DOI 2508
BibRef

Zeng, R.F.[Rong-Fei], Yang, Z.P.[Zhi-Peng], Yu, R.Y.[Rui-Yun], Zhang, Y.G.[Yong-Gang],
Supplementary Prompt Learning for Vision-Language Models,
IJCV(133), No. 8, August 2025, pp. 5822-5839.
Springer DOI 2508
BibRef

Liu, K.C.[Kang-Cheng], Liu, Y.J.[Yong-Jin], Chen, B.Q.[Bao-Quan],
General 3D Vision-Language Model With Fast Rendering and Pre-Training Vision-Language Alignment,
PAMI(47), No. 9, September 2025, pp. 7352-7368.
IEEE DOI 2508
Point cloud compression, Semantics, Training, Solid modeling, Contrastive learning, Data mining, Visualization, 3D vision-language model BibRef

Gao, Y.S.[Yan-Sheng], Zhu, Z.X.[Zi-Xi], Wang, S.S.[Sheng-Sheng],
Mixture of coarse and fine-grained prompt tuning for vision-language model,
PR(170), 2026, pp. 112074.
Elsevier DOI 2509
Prompt learning, Vision-language models, Coarse domain-shared information, BibRef

Hao, F.S.[Fu-Sheng], Liu, L.[Liu], Wu, F.X.[Fu-Xiang], Zhang, Q.S.[Qie-Shi], Cheng, J.[Jun],
Textual Embeddings are Good Class-Aware Visual Prompts for Adapting Vision-Language Models,
SPLetters(32), 2025, pp. 2992-2996.
IEEE DOI 2509
Visualization, Tuning, Semantics, Harmonic analysis, Accuracy, Optimization, Artificial intelligence, Vectors, Training, TV, class-aware visual prompts BibRef

Liu, J.[Jun], Lu, Z.Q.[Zi-Qian], Luo, H.[Hao], Lu, Z.M.[Zhe-Ming], Zheng, Y.M.[Yang-Ming],
Progressive Multi-Prompt Learning for Vision-Language Models,
CirSysVideo(35), No. 10, October 2025, pp. 9562-9574.
IEEE DOI Code:
WWW Link. 2510
Visualization, Overfitting, Optimization, Training, Semantics, Feature extraction, Dogs, Accuracy, Tuning, Transfer learning, few-shot BibRef

Wang, W.X.[Wen-Xuan], He, X.J.[Xing-Jian], Zhang, Y.[Yisi], Guo, L.T.[Long-Teng], Shen, J.C.[Jia-Chen], Li, J.Y.[Jiang-Yun], Liu, J.[Jing],
CM-MaskSD: Cross-Modality Masked Self-Distillation for Referring Image Segmentation,
MultMed(26), 2024, pp. 6906-6916.
IEEE DOI 2405
Image segmentation, Visualization, Task analysis, Correlation, Feature extraction, Transformers, Semantics, vision and language BibRef

Zhang, E.[Enming], Zhu, B.[Bingke], Chen, Y.Y.[Ying-Ying], Miao, Q.H.[Qing-Hai], Tang, M.[Ming], Wang, J.Q.[Jin-Qiao],
Optimization of Prompt Learning via Multi-Knowledge Representation for Vision-Language Models,
MultMed(27), 2025, pp. 7557-7569.
IEEE DOI 2510
Visualization, Tuning, Training, Birds, Semantics, Image recognition, Artificial intelligence, Airplanes, Marine vehicles, multi-knowledge BibRef

Park, K.Y.[Kwan-Yong], An, S.[Sojung], Lee, Y.J.[Yong Jae], Kim, D.H.[Dong-Hyun],
Learning Compositionality from Multifaceted Synthetic Data for Language-based Object Detection,
IJCV(133), No. 11, November 2025, pp. 7873-7896.
Springer DOI 2511
BibRef

Park, K.Y.[Kwan-Yong], Saito, K.[Kuniaki], Kim, D.H.[Dong-Hyun],
Weak-to-strong Compositional Learning from Generative Models for Language-based Object Detection,
ECCV24(XXIII: 1-19).
Springer DOI 2412
BibRef

Sarto, S.[Sara], Moratelli, N.[Nicholas], Cornia, M.[Marcella], Baraldi, L.[Lorenzo], Cucchiara, R.[Rita],
Positive-Augmented Contrastive Learning for Vision-and-Language Evaluation and Training,
IJCV(133), No. 11, November 2025, pp. 7647-7671.
Springer DOI 2511
BibRef

Stefanini, M.[Matteo], Cornia, M.[Marcella], Baraldi, L.[Lorenzo], Cucchiara, R.[Rita],
A Novel Attention-based Aggregation Function to Combine Vision and Language,
ICPR21(1212-1219)
IEEE DOI 2105
Deep learning, Visualization, Image retrieval, Transforms, Knowledge discovery BibRef

Liu, L.C.[Liang-Chen], Wang, N.N.[Nan-Nan], Chen, C.[Chen], Liu, D.[Decheng], Yang, X.[Xi], Gao, X.B.[Xin-Bo], Liu, T.L.[Tong-Liang],
Frequency-Based Comprehensive Prompt Learning for Vision-Language Models,
PAMI(47), No. 12, December 2025, pp. 11974-11989.
IEEE DOI 2511
Visualization, Feature extraction, Frequency-domain analysis, Transformers, Discrete cosine transforms, Frequency diversity, transfer learning BibRef

Li, J.C.[Jun-Cheng], Gao, M.[Minghe], Tang, S.L.[Si-Liang], Wei, L.H.[Long-Hui], Xiao, J.[Jun], Wu, F.[Fei], Hong, R.C.[Ri-Chang], Wang, M.[Meng], Tian, Q.[Qi],
Structure-Induced Gradient Regulation for Generalizable Vision-Language Models,
PAMI(48), No. 1, January 2026, pp. 219-235.
IEEE DOI 2512
Tuning, Metalearning, Adaptation models, Training, Semantics, Testing, Visualization, Prototypes, Vectors, Overfitting, vision-language pre-training models BibRef

Li, J.C.[Jun-Cheng], Gao, M.[Minghe], Wei, L.H.[Long-Hui], Tang, S.L.[Si-Liang], Zhang, W.Q.[Wen-Qiao], Li, M.Z.[Meng-Ze], Ji, W.[Wei], Tian, Q.[Qi], Chua, T.S.[Tat-Seng], Zhuang, Y.T.[Yue-Ting],
Gradient-Regulated Meta-Prompt Learning for Generalizable Vision-Language Models,
ICCV23(2551-2562)
IEEE DOI 2401
BibRef

Xiao, Y.S.[Yi-Song], Liu, X.L.[Xiang-Long], Cheng, Q.J.[Qian-Jia], Yin, Z.F.[Zhen-Fei], Liang, S.Y.[Si-Yuan], Li, J.P.[Jia-Peng], Shao, J.[Jing], Liu, A.S.[Ai-Shan], Tao, D.C.[Da-Cheng],
GenderBias-VL: Benchmarking Gender Bias in Vision Language Models via Counterfactual Probing,
IJCV(133), No. 12, December 2025, pp. 8332-8355.
Springer DOI 2512
BibRef

Chen, T.Y.[Tian-Yang], Ai, J.L.[Jian-Liang],
Hierarchical Prompt Engineering for Remote Sensing Scene Understanding with Large Vision-Language Models,
RS(17), No. 22, 2025, pp. 3727.
DOI Link 2512
BibRef

Xu, X.[Xiao], Qin, L.[Libo], Che, W.[Wanxiang], Kan, M.Y.[Min-Yen],
Manager: Aggregating Insights From Unimodal Experts in Two-Tower VLMs and MLLMs,
CirSysVideo(35), No. 12, December 2025, pp. 12278-12291.
IEEE DOI Code:
WWW Link. 2512
Visualization, Aggregates, Semantics, Meters, Feeds, Indexes, Large language models, Image resolution, Vision-Language model, representation learning BibRef

Kim, G.[Gahyeon], Kim, S.[Sohee], Lee, S.[Seokju],
Decoupling augmentation bias in prompt learning for vision-language models,
PR(172), 2026, pp. 112630.
Elsevier DOI Code:
WWW Link. 2601
BibRef
Earlier:
AAPL: Adding Attributes to Prompt Learning for Vision-Language Models,
Prompting24(1572-1582)
IEEE DOI 2410
Prompt learning, Vision-language models, Image augmentation, Adversarial learning loss, Few-shot classification, Domain generalization. Visualization, Zero-shot learning, Semantics, Focusing, Feature extraction, Data augmentation, Vectors, VLMs BibRef

Guo, Y.C.[Yun-Cheng], Gu, X.D.[Xiao-Dong],
MMRL++: Parameter-Efficient and Interaction-Aware Representation Learning for Vision-Language Models,
IJCV(134), No. 1, January 2026, pp. 11.
Springer DOI 2601
BibRef
Earlier:
MMRL: Multi-Modal Representation Learning for Vision-Language Models,
CVPR25(25015-25025)
IEEE DOI Code:
WWW Link. 2508
Representation learning, Training, Adaptation models, Codes, Transfer learning, Image representation, Data models, Overfitting BibRef

Ye, W.X.[Wei-Xin], Wang, W.[Wei], Liu, Y.H.[Ya-Hui], Song, Y.[Yue], Ren, B.[Bin], Bi, W.[Wei], Cucchiara, R.[Rita], Sebe, N.[Nicu],
A Unified Masked Jigsaw Puzzle Framework for Vision and Language Models,
PAMI(48), No. 2, February 2026, pp. 1873-1887.
IEEE DOI 2601
Transformers, Privacy, Natural language processing, Principal component analysis, Computational modeling, Training, position embedding BibRef

Wang, Z.Y.[Zi-Yan], Liu, L.[Lei], Wan, G.[Gang], Lu, Y.C.[Yu-Chen], Zheng, F.J.[Feng-Jie], Sun, G.[Guangde], Huang, Y.X.[Yi-Xiang], Guo, S.H.[Shi-Hao], Li, X.[Xinyi], Yuan, L.[Liang],
SAREval: A Multi-Dimensional and Multi-Task Benchmark for Evaluating Visual Language Models on SAR Image Understanding,
RS(18), No. 1, 2026, pp. 82.
DOI Link 2601
BibRef

Wu, J.F.[Jun-Feng], Jiang, Y.[Yi], Ma, C.F.[Chuo-Fan], Liu, Y.L.[Yu-Liang], Zhao, H.S.[Heng-Shuang], Yuan, Z.H.[Ze-Huan], Bai, S.[Song], Bai, X.[Xiang],
Liquid: Language Models are Scalable and Unified Multi-Modal Generators,
IJCV(134), No. 1, January 2026, pp. 39.
Springer DOI 2601
Code:
WWW Link. BibRef

Su, Y.L.[Yu-Ling], Liu, X.L.[Xue-Liang], Huang, Z.[Zhen], Zhao, Y.W.[Yun-Wei], Hong, R.C.[Ri-Chang], Wang, M.[Meng],
AttriPrompt: Class Attribute-Aware Prompt Tuning for Vision-Language Model,
IP(35), 2026, pp. 1395-1407.
IEEE DOI 2602
Tuning, Semantics, Visualization, Adaptation models, Head, Legged locomotion, Data models, Training, Standards, Prompt tuning, vision-language models BibRef

Li, Y.W.[Yan-Wei], Zhang, Y.C.[Yue-Chen], Wang, C.Y.[Cheng-Yao], Zhong, Z.S.[Zhi-Sheng], Chen, Y.X.[Yi-Xin], Chu, R.[Ruihang], Liu, S.[Shaoteng], Jia, J.Y.[Jia-Ya],
Mini-Gemini: Mining the Potential of Multi-Modality Vision Language Models,
PAMI(48), No. 3, March 2026, pp. 3530-3543.
IEEE DOI 2602
Visualization, Cognition, Benchmark testing, Data models, Image resolution, Training, TV, Generative Pre-trained transformer, Generative Model BibRef

Xu, N.[Nuo], Yao, K.[Kelu], Yang, R.[Rong], Li, C.[Chao],
Visual-language active search for wide-area remote sensing imagery,
PR(175), 2026, pp. 113106.
Elsevier DOI Code:
WWW Link. 2603
Active search, Multimodality, Reinforcement learning, Graph neural network BibRef

Chen, Y.[Yang], Fu, S.[Shuai], Zhang, Y.[Yu],
MoPD: Mixture-of-Prompts Distillation for Vision-Language Models,
MultMed(28), 2026, pp. 1943-1954.
IEEE DOI 2603
Visualization, Vectors, Learning systems, Training data, Adaptation models, Noise measurement, Knowledge engineering, vision-language models BibRef

Qi, Y.[Yayun], Li, H.X.[Hong-Xi], Song, Y.Q.[Yi-Qi], Wu, X.X.[Xin-Xiao], Luo, J.B.[Jie-Bo],
How Vision-Language Tasks Benefit From Large Pre-Trained Models: A Survey,
MultMed(28), 2026, pp. 1188-1210.
IEEE DOI 2603
Surveys, Visualization, Cognition, Data models, Videos, Training data, Question answering (information retrieval), large language model BibRef

Lee, J.J.[Jae Joong],
Language-guided invariance probing of vision-language models,
PRL(202), 2026, pp. 108-113.
Elsevier DOI 2603
Vision-language models, Prompt robustness, Paraphrase invariance, Semantic sensitivity, Hard negatives, Image-text similarity BibRef


Lin, X.[Xiang], Li, W.X.[Wei-Xin], Guo, S.[Shu], Wang, L.H.[Li-Hong], Huang, D.[Di],
GIP: Gated Interaction Prompt for Parameter Efficient Vision-Language Fine-Tuning,
ICIP25(617-622)
IEEE DOI 2601
Bridges, Visualization, Adaptation models, Computational modeling, Logic gates, Performance gain, Gating Mechanism BibRef

Valois, P.H.V.[Pedro H. V.], Satav, D.[Dipesh], de Campos, R.A.P.[Rodrigo A. P.], Pratamasunu, G.Q.O.[Gulpi Q. O.], Fukui, K.[Kazuhiro],
Vision Language Model Interpretability with Concept Guided Decoding,
ICIP25(397-402)
IEEE DOI 2601
Deep learning, Training, Visualization, Adaptation models, Analytical models, Toxicology, Systematics, Decoding, Security, Jailbreak BibRef

Saravanan, D.[Darshana], Tapaswi, M.[Makarand], Gandhi, V.[Vineet],
Investigating Mechanisms for In-Context Vision Language Binding,
InterpVis25(4852-4856)
IEEE DOI 2512
Solid modeling, Shape, Computational modeling, Toy manufacturing industry, Vectors, Synthetic data, VLM, In-context Binding BibRef

Selvam, S.[Surya], Rajendran, R.K.[Ravi K.], Sankaradas, M.[Murugan], Raghunathan, A.[Anand], Chakradhar, S.T.[Srimat T.],
SimCache: Similarity Caching for Efficient VLM-based Scene Understanding,
LargeVM25(3318-3327)
IEEE DOI 2512
Training, Visualization, Accuracy, Redundancy, Semantics, Memory management, Throughput, Real-time systems, Videos BibRef

Tushar, P.[Pranav], Pandey, E.[Eshan], Austria, L.D.B.[Lyka Diane Bala], Loo, Y.Y.[Yin Yin], Lim, J.H.[Jing Hao], Atmosukarto, I.[Indriyati], Lock, D.S.C.[Donny Soh Cheng],
MerCulture: A Comprehensive Benchmark to Evaluate Vision-Language Models on Cultural Understanding in Singapore,
AIBench25(565-574)
IEEE DOI 2512
Measurement, Visualization, Grounding, Education, Training data, Benchmark testing, Multilingual, Cultural differences, application BibRef

Ma, Z.Y.[Zi-Yu], Gou, C.[Chenhui], Shi, H.[Hengcan], Sun, B.[Bin], Li, S.T.[Shu-Tao], Rezatofighi, H.[Hamid], Cai, J.F.[Jian-Fei],
DrVideo: Document Retrieval Based Long Video Understanding,
CVPR25(18936-18946)
IEEE DOI Code:
WWW Link. 2508
Codes, Large language models, Transforms, Benchmark testing, Cognition, Iterative methods, Videos, long video understanding, vision and language BibRef

Dhouib, M.[Mohamed], Buscaldi, D.[Davide], Vanier, S.[Sonia], Shabou, A.[Aymen],
PACT: Pruning and Clustering-Based Token Reduction for Faster Visual Language Models,
CVPR25(14582-14592)
IEEE DOI 2508
Connectors, Training, Measurement, Visualization, Computational modeling, Redundancy, Merging, Oral communication BibRef

Yu, C.[Chong], Chen, T.[Tao], Gan, Z.X.[Zhong-Xue],
Once-Tuning-Multiple-Variants: Tuning Once and Expanded as Multiple Vision-Language Model Variants,
CVPR25(14712-14722)
IEEE DOI 2508
Training, Adaptation models, Accuracy, Tensors, Memory management, Hardware, Model compression, Tuning, Optimization, dynamic expansion capability BibRef

Hao, F.S.[Fu-Sheng], He, F.X.[Feng-Xiang], Wu, F.[Fuxiang], Wang, T.[Tichao], Song, C.Q.[Cheng-Qun], Cheng, J.[Jun],
Task-Aware Clustering for Prompting Vision-Language Models,
CVPR25(14745-14755)
IEEE DOI Code:
WWW Link. 2508
Adaptation models, Visualization, Attention mechanisms, Codes, Interference, Benchmark testing, Optimization, Overfitting BibRef

Koleilat, T.[Taha], Asgariandehkordi, H.[Hojat], Rivaz, H.[Hassan], Xiao, Y.M.[Yi-Ming],
BiomedCoOp: Learning to Prompt for Biomedical Vision-Language Models,
CVPR25(14766-14776)
IEEE DOI Code:
WWW Link. 2508
Representation learning, Adaptation models, Visualization, Accuracy, Biological system modeling, Semantics, vision-language models BibRef

Nath, V.[Vishwesh], Li, W.Q.[Wen-Qi], Yang, D.[Dong], Myronenko, A.[Andriy], Zheng, M.X.[Ming-Xin], Lu, Y.[Yao], Liu, Z.J.[Zhi-Jian], Yin, H.X.[Hong-Xu], Law, Y.M.[Yee Man], Tang, Y.C.[Yu-Cheng], Guo, P.F.[Peng-Fei], Zhao, C.[Can], Xu, Z.Y.[Zi-Yue], He, Y.F.[Yu-Fan], Harmon, S.[Stephanie], Simon, B.[Benjamin], Heinrich, G.[Greg], Aylward, S.[Stephen], Edgar, M.[Marc], Zephyr, M.[Michael], Molchanov, P.[Pavlo], Turkbey, B.[Baris], Roth, H.[Holger], Xu, D.[Daguang],
VILA-M3: Enhancing Vision-Language Models with Medical Expert Knowledge,
CVPR25(14788-14798)
IEEE DOI 2508
Deep learning, Computational modeling, Medical services, Feature extraction, Data models, Reliability, Tumors, radiology BibRef

Du, H.[Hao], Wu, B.[Bo], Lu, Y.[Yan], Mao, Z.D.[Zhen-Dong],
SVLTA: Benchmarking Vision-Language Temporal Alignment via Synthetic Video Situation,
CVPR25(13798-13809)
IEEE DOI 2508
Measurement, Visualization, Filtering, Statistical analysis, Pipelines, Benchmark testing, Videos BibRef

Kaduri, O.[Omri], Bagon, S.[Shai], Dekel, T.[Tali],
What's in the Image? A Deep-Dive into the Vision of Vision Language Models,
CVPR25(14549-14558)
IEEE DOI 2508
Visualization, Analytical models, Image coding, Focusing, Data models, Data mining, Videos BibRef

Xing, L.[Long], Huang, Q.D.[Qi-Dong], Dong, X.Y.[Xiao-Yi], Lu, J.J.[Jia-Jie], Zhang, P.[Pan], Zang, Y.H.[Yu-Hang], Cao, Y.H.[Yu-Hang], He, C.H.[Cong-Hui], Wang, J.Q.[Jia-Qi], Wu, F.[Feng], Lin, D.[Dahua],
Conical Visual Concentration for Efficient Large Vision-Language Models,
CVPR25(14593-14603)
IEEE DOI Code:
WWW Link. 2508
Training, Visualization, Costs, Codes, Redundancy, Boosting, large vision language model, efficient training, efficient inference BibRef

Zhang, L.[Le], Yang, Q.[Qian], Agrawal, A.[Aishwarya],
Assessing and Learning Alignment of Unimodal Vision and Language Models,
CVPR25(14604-14614)
IEEE DOI 2508
Training, Translation, Computational modeling, Semantic segmentation, Transfer learning, Object recognition BibRef

Sehgal, A.[Atharva], Yuan, P.[Patrick], Hu, Z.[Ziniu], Yue, Y.S.[Yi-Song], Sun, J.J.[Jennifer J.], Chaudhuri, S.[Swarat],
Self-Evolving Visual Concept Library using Vision-Language Critics,
CVPR25(13124-13134)
IEEE DOI 2508
Visualization, Annotations, Buildings, Manuals, Libraries, Cognition, History, Few shot learning, program synthesis, visual programming, library learning BibRef

Wang, W.H.[Wei-Han], Wang, L.[Lefan], Gu, X.T.[Xiao-Tao], Huang, S.Y.[Shi-Yu], Dong, Y.X.[Yu-Xiao], Tang, J.[Jie],
MotionBench: Benchmarking and Improving Fine-Grained Video Motion Understanding for Vision Language Models,
CVPR25(8450-8460)
IEEE DOI Code:
WWW Link. 2508
Visualization, Benchmark testing, Data models, Videos, vision language model, fine-grained video motion understanding, benchmark BibRef

Nacson, M.S.[Mor Shpigel], Aberdam, A.[Aviad], Ganz, R.[Roy], Avraham, E.B.[Elad Ben], Golts, A.[Alona], Kittenplon, Y.[Yair], Mazor, S.[Shai], Litman, R.[Ron],
DocVLM: Make Your VLM an Efficient Reader,
CVPR25(29005-29015)
IEEE DOI 2508
Visualization, Image coding, Computational modeling, Optical character recognition, Layout, Computational efficiency, Text processing BibRef

Alhamoud, K.[Kumail], Alshammari, S.[Shaden], Tian, Y.L.[Yong-Long], Li, G.H.[Guo-Hao], Torr, P.H.S.[Philip H.S.], Kim, Y.[Yoon], Ghassemi, M.[Marzyeh],
Vision-Language Models Do Not Understand Negation,
CVPR25(29612-29622)
IEEE DOI 2508
Training, Accuracy, Computational modeling, Natural languages, Benchmark testing, Videos, Synthetic data, Biomedical imaging, benchmarks BibRef

Schmalfuss, J.[Jenny], Chang, N.[Nadine], VS, V.[Vibashan], Shen, M.[Maying], Bruhn, A.[Andrés], Alvarez, J.M.[Jose M.],
PARC: A Quantitative Framework Uncovering the Symmetries within Vision Language Models,
CVPR25(25081-25091)
IEEE DOI Code:
WWW Link. 2508
Visualization, Analytical models, Sensitivity, Sensitivity analysis, Computational modeling, Semantics, prompt sensitivity BibRef

Xiao, J.Q.[Jin-Qi], Sang, S.[Shen], Zhi, T.C.[Tian-Cheng], Liu, J.[Jing], Yan, Q.[Qing], Luo, L.J.[Lin-Jie], Yuan, B.[Bo],
COAP: Memory-Efficient Training with Correlation-Aware Gradient Projection,
CVPR25(30116-30126)
IEEE DOI Code:
WWW Link. 2508
Training, Degradation, Quantization (signal), Computational modeling, Neural networks, Flora, vision language model BibRef

Zhu, Y.Q.[Yi-Qi], Wang, Z.Y.[Zi-Yue], Zhang, C.[Can], Li, P.[Peng], Liu, Y.[Yang],
CoSpace: Benchmarking Continuous Space Perception Ability for Vision-Language Models,
CVPR25(29569-29579)
IEEE DOI 2508
Visualization, Analytical models, Accuracy, Computational modeling, Benchmark testing, Cognition, Image reconstruction, continuous space perception BibRef

Kang, H.Q.[Hao-Qiang], Sachdeva, E.[Enna], Gupta, P.[Piyush], Bae, S.J.[Sang-Jae], Lee, K.[Kwonjoon],
GFlowVLM: Enhancing Multi-step Reasoning in Vision-Language Models with Generative Flow Networks,
CVPR25(3815-3825)
IEEE DOI Code:
WWW Link. 2508
Training, Decision making, Distributed databases, Reinforcement learning, Games, Cognition, Planning, Optimization, gflownets BibRef

Chen, J.H.[Jiu-Hai], Yang, J.W.[Jian-Wei], Wu, H.P.[Hai-Ping], Li, D.[Dianqi], Gao, J.F.[Jian-Feng], Zhou, T.Y.[Tian-Yi], Xiao, B.[Bin],
Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion,
CVPR25(24928-24938)
IEEE DOI Code:
WWW Link. 2508
Training, Visualization, Statistical analysis, Computational modeling, Optical character recognition, Tuning BibRef

Yang, C.Y.[Chen-Yu], Dong, X.[Xuan], Zhu, X.Z.[Xi-Zhou], Su, W.J.[Wei-Jie], Wang, J.H.[Jia-Hao], Tian, H.[Hao], Chen, Z.[Zhe], Wang, W.H.[Wen-Hai], Lu, L.W.[Le-Wei], Dai, J.F.[Ji-Feng],
PVC: Progressive Visual Token Compression for Unified Image and Video Processing in Large Vision-Language Models,
CVPR25(24939-24949)
IEEE DOI Code:
WWW Link. 2508
Visualization, Adaptation models, Image coding, Limiting, Redundancy, Benchmark testing, Encoding, Data mining, Videos BibRef

Zhang, K.[Kun], Li, J.Y.[Jing-Yu], Li, Z.[Zhe], Zhou, S.K.[S. Kevin],
DH-Set: Improving Vision-Language Alignment with Diverse and Hybrid Set-Embeddings Learning,
CVPR25(24993-25003)
IEEE DOI 2508
Accuracy, Computational modeling, Semantics, Benchmark testing, Computational efficiency, Complexity theory, set-embeddings learning BibRef

Zhu, B.[Beier], Cui, J.[Jiequan], Zhang, H.W.[Han-Wang], Zhang, C.[Chi],
Project-Probe-Aggregate: Efficient Fine-Tuning for Group Robustness,
CVPR25(25487-25496)
IEEE DOI 2508
Training, Correlation, Foundation models, Null space, Robustness, Probes, Faces, group robustness, vision-language models BibRef

Li, H.Y.[Hao-Yang], Wang, L.[Liang], Wang, C.[Chao], Jiang, J.[Jing], Peng, Y.[Yan], Long, G.D.[Guo-Dong],
DPC: Dual-Prompt Collaboration for Tuning Vision-Language Models,
CVPR25(25623-25632)
IEEE DOI Code:
WWW Link. 2508
Codes, Semantic segmentation, Collaboration, Cloning, Object detection, Vectors, Optimization, Tuning, prompt tuning, multi-modal learning BibRef

Saravanan, D.[Darshana], Gupta, V.[Varun], Singh, D.[Darshan], Khan, Z.[Zeeshan], Gandhi, V.[Vineet], Tapaswi, M.[Makarand],
VELOCITI: Benchmarking Video-Language Compositional Reasoning with Strict Entailment,
CVPR25(18914-18924)
IEEE DOI 2508
Visualization, Accuracy, Benchmark testing, Cognition, Videos, video language benchmark BibRef

Pan, B.[Bikang], Li, Q.[Qun], Tang, X.Y.[Xiao-Ying], Huang, W.[Wei], Fang, Z.[Zhen], Liu, F.[Feng], Wang, J.Y.[Jing-Ya], Yu, J.Y.[Jing-Yi], Shi, Y.[Ye],
NLPrompt: Noise-Label Prompt Learning for Vision-Language Models,
CVPR25(19963-19973)
IEEE DOI 2508
Representation learning, Accuracy, Purification, Foundation models, Transportation, Prototypes, Robustness, Noise measurement, Signal to noise ratio BibRef

Zhang, Y.T.[Yong-Ting], Chen, L.[Lu], Zheng, G.D.[Guo-Dong], Gao, Y.F.[Yi-Feng], Zheng, R.[Rui], Fu, J.[Jinlan], Yin, Z.F.[Zhen-Fei], Jin, S.[Senjie], Qiao, Y.[Yu], Huang, X.J.[Xuan-Jing], Zhao, F.[Feng], Gui, T.[Tao], Shao, J.[Jing],
SPA-VL: A Comprehensive Safety Preference Alignment Dataset for Vision Language Models,
CVPR25(19867-19878)
IEEE DOI 2508
Visualization, Computational modeling, Semantics, Data models, Safety BibRef

Bhattacharjee, S.S.[Subhransu S.], Campbell, D.[Dylan], Shome, R.[Rahul],
Believing is Seeing: Unobserved Object Detection using Generative Models,
CVPR25(19366-19377)
IEEE DOI 2508
Measurement, Training, Solid modeling, Adaptation models, Visualization, Pipelines, Object detection, Diffusion models, vision-language models BibRef

Zhou, E.[Enshen], Su, Q.[Qi], Chi, C.[Cheng], Zhang, Z.Z.[Zhi-Zheng], Wang, Z.Y.[Zhong-Yuan], Huang, T.J.[Tie-Jun], Sheng, L.[Lu], Wang, H.[He],
Code-as-Monitor: Constraint-aware Visual Programming for Reactive and Proactive Robotic Failure Detection,
CVPR25(6919-6929)
IEEE DOI Code:
WWW Link. 2508
Visualization, Codes, Accuracy, Prevention and mitigation, Programming, Real-time systems, Closed loop systems, Monitoring, vision-language model BibRef

Zhou, W.J.[Wei-Jie], Tao, M.[Manli], Zhao, C.Y.[Chao-Yang], Guo, H.Y.[Hai-Yun], Dong, H.H.[Hong-Hui], Tang, M.[Ming], Wang, J.Q.[Jin-Qiao],
PhysVLM: Enabling Visual Language Models to Understand Robotic Physical Reachability,
CVPR25(6940-6949)
IEEE DOI 2508
Visualization, Adaptation models, Service robots, Decision making, Benchmark testing, Cognition, Reliability, Robots, embodied ai, , embodied visual reasoning BibRef

Song, C.H.[Chan Hee], Blukis, V.[Valts], Tremblay, J.[Jonathan], Tyree, S.[Stephen], Su, Y.[Yu], Birchfield, S.[Stan],
RoboSpatial: Teaching Spatial Understanding to 2D and 3D Vision-Language Models for Robotics,
CVPR25(15768-15780)
IEEE DOI 2508
Training, Solid modeling, Soft sensors, Pipelines, Training data, Predictive models, Spatial databases, Cognition, Robots, robot perception BibRef

Lozano, A.[Alejandro], Sun, M.W.[Min Woo], Burgess, J.[James], Chen, L.[Liangyu], Nirschl, J.J.[Jeffrey J.], Gu, J.[Jeffrey], Lopez, I.[Ivan], Aklilu, J.[Josiah], Rau, A.[Anita], Katzer, A.W.[Austin Wolfgang], Zhang, Y.H.[Yu-Hui], Chiu, C.[Collin], Wang, X.H.[Xiao-Han], Song, A.S.[Alfred Seunghoon], Tibshirani, R.[Robert], Yeung-Levy, S.[Serena],
BIOMEDICA: An Open Biomedical Image-Caption Archive, Dataset, and Vision-Language Models Derived from Scientific Literature,
CVPR25(19724-19735)
IEEE DOI 2508
Annotations, Biological system modeling, Computational modeling, Dermatology, Surgery, Streaming media, Radiology, biomedical foundation models BibRef

Xiao, R.[Rui], Kim, S.[Sanghwan], Georgescu, M.I.[Mariana-Iuliana], Akata, Z.[Zeynep], Alaniz, S.[Stephan],
FLAIR: VLM with Fine-grained Language-informed Image Representations,
CVPR25(24884-24894)
IEEE DOI Code:
WWW Link. 2508
Visualization, Codes, Semantic segmentation, Computational modeling, Image representation, Benchmark testing, multimodal learning BibRef

Wang, X.[Xin], Chen, K.[Kai], Zhang, J.M.[Jia-Ming], Chen, J.J.[Jing-Jing], Ma, X.[Xingjun],
TAPT: Test-Time Adversarial Prompt Tuning for Robust Inference in Vision-Language Models,
CVPR25(19910-19920)
IEEE DOI Code:
WWW Link. 2508
Visualization, Accuracy, Scalability, Perturbation methods, Benchmark testing, Robustness, Entropy, Safety, Tuning, test-time adversarial prompt tuning BibRef

Vasu, P.K.A.[Pavan Kumar Anasosalu], Faghri, F.[Fartash], Li, C.L.[Chun-Liang], Koc, C.[Cem], True, N.[Nate], Antony, A.[Albert], Santhanam, G.[Gokul], Gabriel, J.[James], Grasch, P.[Peter], Tuzel, O.[Oncel], Pouransari, H.[Hadi],
FastVLM: Efficient Vision Encoding for Vision Language Models,
CVPR25(19769-19780)
IEEE DOI Code:
WWW Link. 2508
Visualization, Image resolution, Accuracy, Image coding, Codes, Benchmark testing, Encoding, vision-language models, efficiency BibRef

Chen, Q.Z.[Qi-Zhou], Wang, C.[Chengyu], Wang, D.[Dakan], Zhang, T.[Taolin], Li, W.[Wangyue], He, X.F.[Xiao-Feng],
Lifelong Knowledge Editing for Vision Language Models with Low-Rank Mixture-of-Experts,
CVPR25(9455-9466)
IEEE DOI 2508
Training, Visualization, Filtering, Large language models, Semantics, Benchmark testing, Routing, Generators, Robustness, model editing, mixture of expert BibRef

Chen, T.Y.[Tian-Yu], Fu, X.C.[Xing-Cheng], Gao, Y.[Yisen], Qian, H.D.[Hao-Dong], Wei, Y.[Yuecen], Yan, K.[Kun], Zhou, H.Y.[Hao-Yi], Li, J.X.[Jian-Xin],
Galaxy Walker: Geometry-aware VLMs For Galaxy-scale Understanding,
CVPR25(4112-4121)
IEEE DOI 2508
Space vehicles, Geometry, Training, Adaptation models, Extraterrestrial phenomena, Estimation, Stars, Vectors, multi-modal learning BibRef

Liu, Z.J.[Zhi-Jian], Zhu, L.[Ligeng], Shi, B.[Baifeng], Zhang, Z.Y.[Zhuo-Yang], Lou, Y.M.[Yu-Ming], Yang, S.[Shang], Xi, H.C.[Hao-Cheng], Cao, S.Y.[Shi-Yi], Gu, Y.X.[Yu-Xian], Li, D.C.[Da-Cheng], Li, X.[Xiuyu], Tang, H.T.[Hao-Tian], Fang, Y.H.[Yun-Hao], Chen, Y.[Yukang], Hsieh, C.Y.[Cheng-Yu], Huang, D.A.[De-An], Cheng, A.C.[An-Chieh], Hu, J.Y.[Jin-Yi], Liu, S.[Sifei], Krishna, R.[Ranjay], Molchanov, P.[Pavlo], Kautz, J.[Jan], Yin, H.X.[Hong-Xu], Han, S.[Song], Lu, Y.[Yao],
NVILA: Efficient Frontier Visual Language Models,
CVPR25(4122-4134)
IEEE DOI 2508
Training, Visualization, Accuracy, Systematics, Image coding, Costs, Decoding, Spatial resolution, Videos BibRef

Poppi, T.[Tobia], Kasarla, T.[Tejaswi], Mettes, P.[Pascal], Baraldi, L.[Lorenzo], Cucchiara, R.[Rita],
Hyperbolic Safety-Aware Vision-Language Models,
CVPR25(4222-4232)
IEEE DOI Code:
WWW Link. 2508
Adaptation models, Ethics, Law, Source coding, Robustness, Data models, Safety, Standards, trustworthy, safety, nsfw, hyperbolic, vision-and-language BibRef

Zhang, H.Y.[Hao-Yu], Guo, Y.Y.[Yang-Yang], Kankanhalli, M.[Mohan],
Joint Vision-Language Social Bias Removal for CLIP,
CVPR25(4246-4255)
IEEE DOI Code:
WWW Link. 2508
Measurement, Degradation, Protocols, Codes, Prevention and mitigation, Computational modeling, vision-language alignment BibRef

Zhang, Y.[Yi], Deng, Y.X.[Yi-Xuan], Guo, M.H.[Meng-Hao], Hu, S.M.[Shi-Min],
Adaptive Parameter Selection for Tuning Vision-Language Models,
CVPR25(4280-4290)
IEEE DOI 2508
Adaptation models, Adaptive learning, Manuals, Benchmark testing, Performance gain, Flowering plants, Tuning, Overfitting BibRef

Deng, A.[Ailin], Cao, T.[Tri], Chen, Z.[Zhirui], Hooi, B.[Bryan],
Words or Vision: Do Vision-Language Models Have Blind Faith in Text?,
CVPR25(3867-3876)
IEEE DOI 2508
Training, Visualization, Analytical models, Computational modeling, Reliability theory, Robustness, Data models, Safety, bias BibRef

Huang, R.[Runhui], Ding, X.P.[Xin-Peng], Wang, C.W.[Chun-Wei], Han, J.H.[Jian-Hua], Liu, Y.L.[Yu-Long], Zhao, H.S.[Heng-Shuang], Xu, H.[Hang], Hou, L.[Lu], Zhang, W.[Wei], Liang, X.D.[Xiao-Dan],
HiRes-LLaVA: Restoring Fragmentation Input in High-Resolution Large Vision-Language Models,
CVPR25(29814-29824)
IEEE DOI 2508
Training, Visualization, Costs, Computational modeling, Benchmark testing, Feature extraction, Image restoration, visual token compression BibRef

Wang, S.[Sudong], Zhang, Y.J.[Yun-Jian], Zhu, Y.[Yao], Li, J.N.[Jia-Ning], Wang, Z.Z.[Zi-Zhe], Liu, Y.W.[Yan-Wei], Ji, X.Y.[Xiang-Yang],
Towards Understanding How Knowledge Evolves in Large Vision-Language Models,
CVPR25(29858-29868)
IEEE DOI Code:
WWW Link. 2508
Dimensionality reduction, Codes, Natural languages, Probability distribution, Encoding, Trajectory, Model compression, interpretation BibRef

Deitke, M.[Matt], Clark, C.[Christopher], Lee, S.H.[Sang-Ho], Tripathi, R.[Rohun], Yang, Y.[Yue], Park, J.S.[Jae Sung], Salehi, M.[Mohammadreza], Muennighoff, N.[Niklas], Lo, K.[Kyle], Soldaini, L.[Luca], Lu, J.[Jiasen], Anderson, T.[Taira], Bransom, E.[Erin], Ehsani, K.[Kiana], Ngo, H.[Huong], Chen, Y.[YenSung], Patel, A.[Ajay], Yatskar, M.[Mark], Callison-Burch, C.[Chris], Head, A.[Andrew], Hendrix, R.[Rose], Bastani, F.[Favyen], VanderBilt, E.[Eli], Lambert, N.[Nathan], Chou, Y.[Yvonne], Chheda, A.[Arnavi], Sparks, J.[Jenna], Skjonsberg, S.[Sam], Schmitz, M.[Michael], Sarnat, A.[Aaron], Bischoff, B.[Byron], Walsh, P.[Pete], Newell, C.[Chris], Wolters, P.[Piper], Gupta, T.[Tanmay], Zeng, K.H.[Kuo-Hao], Borchardt, J.[Jon], Groeneveld, D.[Dirk], Nam, C.[Crystal], Lebrecht, S.[Sophie], Wittlif, C.[Caitlin], Schoenick, C.[Carissa], Michel, O.[Oscar], Krishna, R.[Ranjay], Weihs, L.[Luca], Smith, N.A.[Noah A.], Hajishirzi, H.[Hannaneh], Girshick, R.[Ross], Farhadi, A.[Ali], Kembhavi, A.[Aniruddha],
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models,
CVPR25(91-104)
IEEE DOI Code:
WWW Link. 2508
Award, CVPR, Paper HM. Training, Source coding, Computational modeling, Pipelines, Training data, Data models, Open data, Synthetic data, visual instruction tuning BibRef

Zhao, W.[Wangbo], Han, Y.Z.[Yi-Zeng], Tang, J.S.[Jia-Sheng], Li, Z.[Zhikai], Song, Y.B.[Yi-Bing], Wang, K.[Kai], Wang, Z.Y.[Zhang-Yang], You, Y.[Yang],
A Stitch in Time Saves Nine: Small VLM is a Precise Guidance for Accelerating Large VLMs,
CVPR25(19814-19824)
IEEE DOI Code:
WWW Link. 2508
Visualization, Codes, Accuracy, Benchmark testing, Computational efficiency BibRef

Lee, B.K.[Byung-Kwan], Hachiuma, R.[Ryo], Wang, Y.C.A.F.[Yu-Chi-Ang Frank], Ro, Y.M.[Yong Man], Wu, Y.H.[Yueh-Hua],
VLsI: Verbalized Layers-to-Interactions from Large to Small Vision Language Models,
CVPR25(29545-29557)
IEEE DOI 2508
Training, Performance evaluation, Visualization, Computational modeling, Natural languages, Merging, Tuning BibRef

Sun, J.C.[Jing-Chen], Sharma, R.[Rohan], Lokhande, V.S.[Vishnu Suresh], Chen, C.Y.[Chang-You],
Cross-Modal Feature Alignment and MMD Improve Robustness of Prompt Tuning,
WACV25(4714-4724)
IEEE DOI 2505
Training, Adaptation models, Visualization, Codes, Computational modeling, Stochastic processes, Robustness, Tuning, vision-language model BibRef

Safaei, B.[Bardia], Patel, V.M.[Vishal M.],
Active Learning for Vision-Language Models,
WACV25(4902-4912)
IEEE DOI 2505
Training, Bridges, Uncertainty, Computational modeling, Active learning, Measurement uncertainty, Entropy, Reliability, Image classification BibRef

Wang, Y.C.[Yi-Cheng], Zhang, Z.K.[Zhi-Kang], Wang, J.[Jue], Fan, D.[David], Xu, Z.L.[Zhen-Lin], Liu, L.[Linda], Hao, X.[Xiang], Bhat, V.[Vimal], Li, X.Y.[Xin-Yu],
GEXIA: Granularity Expansion and Iterative Approximation for Scalable Multi-Grained Video-Language Learning,
WACV25(4725-4735)
IEEE DOI 2505
Computational modeling, Semantics, Benchmark testing, Data models, Iterative methods, Videos BibRef

Colman, R.[Roman], Vu, M.[Minh], Bhattarai, M.[Manish], Ma, M.[Martin], Viswanathan, H.[Hari], O'Malley, D.[Daniel], Santos, J.E.[Javier E.],
PatchFinder: Leveraging Visual Language Models for Accurate Information Retrieval Using Model Uncertainty,
WACV25(9146-9155)
IEEE DOI 2505
Visualization, Uncertainty, Accuracy, Computational modeling, Software algorithms, Predictive models, Information retrieval, log likelihood BibRef

Jawade, B.[Bhavin], Soares, J.V.B.[João V. B.], Thadani, K.[Kapil], Mohan, D.D.[Deen Dayal], Eshratifar, A.E.[Amir Erfan], Culpepper, B.[Benjamin], de Juan, P.[Paloma], Setlur, S.[Srirangaraj], Govindaraju, V.[Venu],
SCOT: Self-Supervised Contrastive Pretraining for Zero-Shot Compositional Retrieval,
WACV25(5509-5519)
IEEE DOI Code:
WWW Link. 2505
Training, Codes, Large language models, Image retrieval, Benchmark testing, Web search, Standards, zero-shot BibRef

Talemi, N.A.[Niloufar Alipour], Kashiani, H.[Hossein], Afghah, F.[Fatemeh],
Style-Pro: Style-Guided Prompt Learning for Generalizable Vision-Language Models,
WACV25(6207-6216)
IEEE DOI 2505
Adaptation models, Image recognition, Computational modeling, Benchmark testing, Data models, Robustness, Overfitting, style shift learning BibRef

Chang, H.S.[Hung-Shuo], Wang, C.Y.[Chien-Yao], Wang, R.R.[Richard Robert], Chou, G.[Gene], Liao, H.Y.M.[Hong-Yuan Mark],
Generalist YOLO: Towards Real-Time End-to-End Multi-Task Visual Language Models,
WACV25(6217-6227)
IEEE DOI Code:
WWW Link. 2505
YOLO, Training, Visualization, Accuracy, Source coding, Semantics, Predictive models, Real-time systems, Decoding, multi-task BibRef

Westfechtel, T.[Thomas], Zhang, D.[Dexuan], Harada, T.[Tatsuya],
Combining Inherent Knowledge of Vision-Language Models with Unsupervised Domain Adaptation Through Strong-Weak Guidance,
WACV25(6528-6537)
IEEE DOI 2505
Adaptation models, Accuracy, Predictive models, Benchmark testing, Prediction algorithms, Labeling BibRef

Chen, H.N.[Han-Ning], Ni, Y.[Yang], Huang, W.J.[Wen-Jun], Liu, Y.[Yezi], Jeong, S.[Sung_Heon], Wen, F.[Fei], Bastian, N.D.[Nathaniel D.], Latapie, H.[Hugo], Imani, M.[Mohsen],
VLTP: Vision-Language Guided Token Pruning for Task-Oriented Segmentation,
WACV25(9353-9363)
IEEE DOI 2505
Uniform resource locators, Image segmentation, Image recognition, Computational modeling, Large language models, Transformers, Load modeling BibRef

Ali, E.[Eman], Silva, S.[Sathira], Khan, M.H.[Muhammad Haris],
DPA: Dual Prototypes Alignment for Unsupervised Adaptation of Vision-Language Models,
WACV25(6083-6093)
IEEE DOI 2505
Training, Adaptation models, Visualization, Accuracy, Prototypes, Data models, Noise measurement, Image classification BibRef

Zhang, C.[Ce], Stepputtis, S.[Simon], Sycara, K.[Katia], Xie, Y.Q.[Ya-Qi],
Enhancing Vision-Language Few-Shot Adaptation with Negative Learning,
WACV25(5905-5915)
IEEE DOI Code:
WWW Link. 2505
Adaptation models, Codes, Accuracy, Computational modeling, Noise, Transforms, Computational efficiency, Noise measurement, Few shot learning BibRef

Yamada, M.[Moyuru], Dharamshi, N.[Nimish], Kohli, A.[Ayushi], Kasu, P.[Prasad], Khan, A.[Ainulla], Ghulyani, M.[Manu],
Unleashing Potentials of Vision-Language Models for Zero-Shot HOI Detection,
WACV25(5751-5760)
IEEE DOI 2505
Head, Computational modeling, Redundancy, Object detection, Network architecture, Predictive models, Decoding, vision-and-language BibRef

Imam, R.[Raza], Gani, H.[Hanan], Huzaifa, M.[Muhammad], Nandakumar, K.[Karthik],
Test-Time Low Rank Adaptation via Confidence Maximization for Zero-Shot Generalization of Vision-Language Models,
WACV25(5449-5459)
IEEE DOI Code:
WWW Link. 2505
Adaptation models, Visualization, Codes, Large language models, Transformers, Entropy, Tuning, Optimization BibRef

Ghoddoosian, R.[Reza], Agarwal, N.[Nakul], Dwivedi, I.[Isht], Darisuh, B.[Behzad],
ACE: Action Concept Enhancement of Video-Language Models in Procedural Videos,
WACV25(9521-9531)
IEEE DOI 2505
Training, Visualization, Robustness, Assembly, Videos, Overfitting, zero-shot, action recognition, vlm, vision language model, synonym, text augmentation BibRef

Onoe, Y.[Yasumasa], Rane, S.[Sunayana], Berger, Z.[Zachary], Bitton, Y.[Yonatan], Cho, J.[Jaemin], Garg, R.[Roopal], Ku, A.[Alexander], Parekh, Z.[Zarana], Pont-Tuset, J.[Jordi], Tanzer, G.[Garrett], Wang, S.[Su], Baldridge, J.[Jason],
DOCCI: Descriptions of Connected and Contrasting Images,
ECCV24(LX: 291-309).
Springer DOI 2412
BibRef

Li, T.[Tang], Ma, M.M.[Meng-Meng], Peng, X.[Xi],
DEAL: Disentangle and Localize Concept-level Explanations for VLMs,
ECCV24(XXXIX: 383-401).
Springer DOI 2412
BibRef

Li, S.C.[Shi-Cheng], Li, L.[Lei], Liu, Y.[Yi], Ren, S.H.[Shu-Huai], Liu, Y.X.[Yuan-Xin], Gao, R.D.[Run-Dong], Sun, X.[Xu], Hou, L.[Lu],
Vitatecs: A Diagnostic Dataset for Temporal Concept Understanding of Video-language Models,
ECCV24(LXX: 331-348).
Springer DOI 2412
BibRef

Yang, Y.T.[Yan-Ting], Chen, M.H.[Ming-Hao], Qiu, Q.[Qibo], Wu, J.H.[Jia-Hao], Wang, W.X.[Wen-Xiao], Lin, B.B.[Bin-Bin], Guan, Z.Y.[Zi-Yu], He, X.F.[Xiao-Fei],
Adapt2reward: Adapting Video-language Models to Generalizable Robotic Rewards via Failure Prompts,
ECCV24(LVII: 163-180).
Springer DOI 2412
BibRef

Rahmanzadehgervi, P.[Pooyan], Bolton, L.[Logan], Taesiri, M.R.[Mohammad Reza], Nguyen, A.T.[Anh Totti],
Vision Language Models are blind,
ACCV24(V: 293-309).
Springer DOI 2412
BibRef

Chytas, S.P.[Sotirios Panagiotis], Kim, H.W.J.[Hyun-Woo J.], Singh, V.[Vikas],
Understanding Multi-compositional Learning in Vision and Language Models via Category Theory,
ECCV24(XLVIII: 324-341).
Springer DOI 2412
BibRef

Song, Y.Z.[Yun-Zhu], Chen, Y.S.[Yi-Syuan], Lin, T.L.[Tzu-Ling], Liu, B.[Bei], Fu, J.L.[Jian-Long], Shuai, H.H.[Hong-Han],
Capture Concept Through Comparison: Vision-and-language Representation Learning with Intrinsic Information Mining,
ACCV24(III: 220-238).
Springer DOI 2412
BibRef

Adhikari, R.[Rabin], Thapaliya, S.[Safal], Dhakal, M.[Manish], Khanal, B.[Bishesh],
Tunevlseg: Prompt Tuning Benchmark for Vision-language Segmentation Models,
ACCV24(III: 44-62).
Springer DOI 2412
BibRef

He, H.C.[Hai-Chen], Liu, W.B.[Wei-Bin], Xing, W.W.[Wei-Wei],
Biefficient: Bidirectionally Prompting Vision-language Models for Parameter-efficient Video Recognition,
ACCV24(III: 257-274).
Springer DOI 2412
BibRef

Yang, J.K.[Jing-Kang], Dong, Y.H.[Yu-Hao], Liu, S.[Shuai], Li, B.[Bo], Wang, Z.Y.[Zi-Yue], Tan, H.R.[Hao-Ran], Jiang, C.C.[Chen-Cheng], Kang, J.[Jiamu], Zhang, Y.H.[Yuan-Han], Zhou, K.Y.[Kai-Yang], Liu, Z.W.[Zi-Wei],
Octopus: Embodied Vision-language Programmer from Environmental Feedback,
ECCV24(I: 20-38).
Springer DOI 2412
BibRef

Kar, O.F.[Oguzhan Fatih], Tonioni, A.[Alessio], Poklukar, P.[Petra], Kulshrestha, A.[Achin], Zamir, A.[Amir], Tombari, F.[Federico],
Brave: Broadening the Visual Encoding of Vision-language Models,
ECCV24(XVI: 113-132).
Springer DOI 2412
BibRef

Kamath, A.[Amita], Hsieh, C.Y.[Cheng-Yu], Chang, K.W.[Kai-Wei], Krishna, R.[Ranjay],
The Hard Positive Truth About Vision-language Compositionality,
ECCV24(XIV: 37-54).
Springer DOI 2412
BibRef

Jia, B.X.[Bao-Xiong], Chen, Y.X.[Yi-Xin], Yu, H.Y.[Huang-Yue], Wang, Y.[Yan], Niu, X.S.[Xue-Song], Liu, T.Y.[Teng-Yu], Li, Q.[Qing], Huang, S.Y.[Si-Yuan],
Sceneverse: Scaling 3d Vision-language Learning for Grounded Scene Understanding,
ECCV24(IX: 289-310).
Springer DOI 2412
BibRef

Zhang, Y.F.[Yi-Feng], Jiang, M.[Ming], Zhao, Q.[Qi],
Learning Chain of Counterfactual Thought for Bias-robust Vision-language Reasoning,
ECCV24(VIII: 334-351).
Springer DOI 2412
BibRef

Li, J.[Junyan], Chen, D.[Delin], Cai, T.[Tianle], Chen, P.H.[Pei-Hao], Hong, Y.[Yining], Chen, Z.F.[Zhen-Fang], Shen, Y.K.[Yi-Kang], Gan, C.[Chuang],
Flexattention for Efficient High-resolution Vision-language Models,
ECCV24(XXV: 286-302).
Springer DOI 2412
BibRef

Li, X.[Xiang], Ding, J.[Jian], Chen, Z.Y.[Zhao-Yang], Elhoseiny, M.[Mohamed],
UNI3DL: A Unified Model for 3d Vision-language Understanding,
ECCV24(XXIII: 74-92).
Springer DOI 2412
BibRef

Hao, T.X.[Tian-Xiang], Ding, X.H.[Xiao-Han], Feng, J.X.[Jue-Xiao], Yang, Y.H.[Yu-Hong], Chen, H.[Hui], Ding, G.[Guiguang],
Quantized Prompt for Efficient Generalization of Vision-language Models,
ECCV24(XIX: 54-73).
Springer DOI 2412
BibRef

Xu, H.B.[Huang-Biao], Ke, X.[Xiao], Li, Y.Z.[Yue-Zhou], Xu, R.[Rui], Wu, H.Q.[Huan-Qi], Lin, X.F.[Xiao-Feng], Guo, W.Z.[Wen-Zhong],
Vision-language Action Knowledge Learning for Semantic-aware Action Quality Assessment,
ECCV24(XLII: 423-440).
Springer DOI 2412
BibRef

Zhu, Z.Y.[Zi-Yu], Zhang, Z.[Zhuofan], Ma, X.J.[Xiao-Jian], Niu, X.S.[Xue-Song], Chen, Y.X.[Yi-Xin], Jia, B.X.[Bao-Xiong], Deng, Z.D.[Zhi-Dong], Huang, S.Y.[Si-Yuan], Li, Q.[Qing],
Unifying 3d Vision-language Understanding via Promptable Queries,
ECCV24(XLIV: 188-206).
Springer DOI 2412
BibRef

Zhang, J.M.[Jia-Ming], Ma, X.J.[Xing-Jun], Wang, X.[Xin], Qiu, L.Y.[Ling-Yu], Wang, J.Q.[Jia-Qi], Jiang, Y.G.[Yu-Gang], Sang, J.[Jitao],
Adversarial Prompt Tuning for Vision-language Models,
ECCV24(XLV: 56-72).
Springer DOI 2412
BibRef

Wu, G.[Ge], Zhang, X.[Xin], Li, Z.[Zheng], Chen, Z.W.[Zhao-Wei], Liang, J.J.[Jia-Jun], Yang, J.[Jian], Li, X.[Xiang],
Cascade Prompt Learning for Vision-language Model Adaptation,
ECCV24(L: 304-321).
Springer DOI 2412
BibRef

Jiang, H.B.[Hao-Bin], Yue, J.P.[Jun-Peng], Luo, H.[Hao], Ding, Z.[Ziluo], Lu, Z.Q.[Zong-Qing],
Reinforcement Learning Friendly Vision-language Model for Minecraft,
ECCV24(LXVIII: 1-17).
Springer DOI 2412
BibRef

Nguyen, A.T.[A. Tuan], Tai, K.S.[Kai Sheng], Chen, B.C.[Bor-Chun], Shukla, S.N.[Satya Narayan], Yu, H.C.[Han-Chao], Torr, P.H.S.[Philip H.S.], Tian, T.P.[Tai-Peng], Lim, S.N.[Ser-Nam],
ucap: An Unsupervised Prompting Method for Vision-language Models,
ECCV24(LXXIV: 425-439).
Springer DOI 2412
BibRef

Zhang, Y.[Yi], Yu, K.[Ke], Wu, S.Q.[Si-Qi], He, Z.H.[Zhi-Hai],
Conceptual Codebook Learning for Vision-language Models,
ECCV24(LXXVII: 235-251).
Springer DOI 2412
BibRef

Chatterjee, A.[Agneet], Luo, Y.R.[Yi-Ran], Gokhale, T.[Tejas], Yang, Y.Z.[Ye-Zhou], Baral, C.[Chitta],
Revision: Rendering Tools Enable Spatial Fidelity in Vision-language Models,
ECCV24(XXX: 339-357).
Springer DOI 2412
BibRef

Sharma, P.[Pratyusha], Shaham, T.R.[Tamar Rott], Baradad, M.[Manel], Rodriíuez-Muñoz, A.[Adrián], Duggal, S.[Shivam], Isola, P.[Phillip], Torralba, A.[Antonio], Fu, S.[Stephanie],
A Vision Check-up for Language Models,
CVPR24(14410-14419)
IEEE DOI 2410
Representation learning, Visualization, Analytical models, Codes, Image synthesis, Computational modeling BibRef

Parodi, F.[Felipe], Matelsky, J.K.[Jordan K.], Regla-Vargas, A.[Alejandra], Foglia, E.E.[Elizabeth E.], Lim, C.[Charis], Weinberg, D.[Danielle], Kording, K.P.[Konrad P.], Herrick, H.M.[Heidi M.], Platt, M.L.[Michael L.],
Vision-language models for decoding provider attention during neonatal resuscitation,
CVPM24(343-353)
IEEE DOI 2410
Training, Pediatrics, Accuracy, Semantics, Decision making, Transformers BibRef

Zhang, Y.B.[Ya-Bin], Zhu, W.J.[Wen-Jie], Tang, H.[Hui], Ma, Z.Y.[Zhi-Yuan], Zhou, K.Y.[Kai-Yang], Zhang, L.[Lei],
Dual Memory Networks: A Versatile Adaptation Approach for Vision-Language Models,
CVPR24(28718-28728)
IEEE DOI Code:
WWW Link. 2410
Training, Knowledge engineering, Adaptation models, Codes, Training data, Data models, Vision-language models, versatile adaptation BibRef

Guo, Y.C.[Yun-Cheng], Gu, X.D.[Xiao-Dong],
JoAPR: Cleaning the Lens of Prompt Learning for Vision-Language Models,
CVPR24(28695-28705)
IEEE DOI 2410
Adaptation models, Adaptive systems, Noise, Manuals, Robustness, Noise measurement, prompt learning BibRef

Han, J.[Jinwei], Lin, Z.W.[Zhi-Wen], Sun, Z.Y.[Zhong-Yisun], Gao, Y.G.[Ying-Guo], Yan, K.[Ke], Ding, S.H.[Shou-Hong], Gao, Y.[Yuan], Xia, G.S.[Gui-Song],
Anchor-based Robust Finetuning of Vision-Language Models,
CVPR24(26909-26918)
IEEE DOI 2410
Image recognition, Zero-shot learning, Semantics, Benchmark testing, Anchor, Robust Finetuning BibRef

Cao, Q.L.[Qing-Long], Zheng-Qin, X., Chen, Y.T.[Yun-Tian], Chao, M., Yang, X.K.[Xiao-Kang],
Domain Prompt Learning with Quaternion Networks,
CVPR24(26627-26636)
IEEE DOI Code:
WWW Link. 2410
Knowledge engineering, Adaptation models, Codes, Quaternions, Face recognition, Contrastive learning, vision-language models, quaternion networks BibRef

Li, L.[Lin], Guan, H.Y.[Hao-Yan], Qiu, J.N.[Jia-Ning], Spratling, M.[Michael],
One Prompt Word is Enough to Boost Adversarial Robustness for Pre-Trained Vision-Language Models,
CVPR24(24408-24419)
IEEE DOI Code:
WWW Link. 2410
Accuracy, Codes, Training data, Robustness, Computational efficiency, vision-language models, VLMs BibRef

Zanella, M.[Maxime], Fuchs, C.[Clément], de Vleeschouwer, C.[Christophe], Ayed, I.B.[Ismail Ben],
Realistic Test-Time Adaptation of Vision-Language Models,
CVPR25(25103-25112)
IEEE DOI Code:
WWW Link. 2508
BibRef
And: A2, A1, A3, Only:
Online Gaussian Test-Time Adaptation of Vision-Language Models,
MULA25(128-137)
IEEE DOI Code:
WWW Link. 2512
Adaptation models, Codes, Predictive models, Performance gain, Robustness, vision-language, test-time adaptation, regularized maximum likelihood estimation. Measurement, Visualization, Accuracy, Protocols, Limiting, Predictive models, Data models, Mathematical models, CLIP BibRef

Zanella, M.[Maxime], Ayed, I.B.[Ismail Ben],
On the Test-Time Zero-Shot Generalization of Vision-Language Models: Do we Really need Prompt Learning?,
CVPR24(23783-23793)
IEEE DOI 2410
Training, Systematics, Computational modeling, Quality assessment, Computational efficiency, vision-language, training-free BibRef

Yang, S.[Senqiao], Tian, Z.[Zhuotao], Jiang, L.[Li], Jia, J.Y.[Jia-Ya],
Unified Language-Driven Zero-Shot Domain Adaptation,
CVPR24(23407-23415)
IEEE DOI 2410
Representation learning, Adaptation models, Visualization, Correlation, Scalability, Computational modeling, Vision-Language Model BibRef

Cui, J.Q.[Jie-Quan], Zhu, B.[Beier], Wen, X.[Xin], Qi, X.J.[Xiao-Juan], Yu, B.[Bei], Zhang, H.W.[Han-Wang],
Classes Are Not Equal: An Empirical Study on Image Recognition Fairness,
CVPR24(23283-23292)
IEEE DOI 2410
Training, Representation learning, Image recognition, Accuracy, Predictive models, Network architecture, Prediction algorithms, Vision-Language Models BibRef

Stojnic, V.[Vladan], Kalantidis, Y.[Yannis], Tolias, G.[Giorgos],
Label Propagation for Zero-shot Classification with Vision-Language Models,
CVPR24(23209-23218)
IEEE DOI Code:
WWW Link. 2410
Codes, Computational modeling, Closed box, Encoding, Data models, vision-language models, label propagation, zero-shot classification BibRef

Yuan, T.[Tongtong], Zhang, X.[Xuange], Liu, K.[Kun], Liu, B.[Bo], Chen, C.[Chen], Jin, J.[Jian], Jiao, Z.Z.[Zhen-Zhen],
Towards Surveillance Video-and-Language Understanding: New Dataset, Baselines, and Challenges,
CVPR24(22052-22061)
IEEE DOI Code:
WWW Link. 2410
Annotations, Surveillance, Semantics, Benchmark testing, Public security, Timing, Security, Dataset Annotation BibRef

Chen, Y.F.[Yi-Fei], Chen, D.P.[Da-Peng], Liu, R.J.[Rui-Jin], Zhou, S.[Sai], Xue, W.Y.[Wen-Yuan], Peng, W.[Wei],
Align Before Adapt: Leveraging Entity-to-Region Alignments for Generalizable Video Action Recognition,
CVPR24(18688-18698)
IEEE DOI 2410
Representation learning, Adaptation models, Visualization, Semantics, Transformers, Vectors, Video action recognition, visual-language model BibRef

Mittal, H.[Himangi], Agarwal, N.[Nakul], Lo, S.Y.[Shao-Yuan], Lee, K.[Kwonjoon],
Can't make an Omelette without Breaking some Eggs: Plausible Action Anticipation using Large Video-Language Models,
CVPR24(18580-18590)
IEEE DOI 2410
Accuracy, Computational modeling, Linear programming, Action Anticipation, Video, Large Multimodal Models BibRef

Kahatapitiya, K.[Kumara], Arnab, A.[Anurag], Nagran, A.[Arsha], Ryoo, M.S.[Michael S.],
VicTR: Video-conditioned Text Representations for Activity Recognition,
CVPR24(18547-18558)
IEEE DOI 2410
Training, Visualization, Adaptation models, Semantics, Focusing, Benchmark testing, Vision-language models, Activity Recognition, Video-conditioned Text BibRef

Wu, T.Y.[Tz-Ying], Ho, C.H.[Chih-Hui], Vasconcelos, N.M.[Nuno M.],
ProTeCt: Prompt Tuning for Taxonomic Open Set Classification,
CVPR24(16531-16540)
IEEE DOI Code:
WWW Link. 2410
Measurement, Training, Frequency modulation, Accuracy, Taxonomy, Semantics, Hierarchical Classification, Visual-language foundation model BibRef

Zhao, G.[Ganlong], Li, G.B.[Guan-Bin], Chen, W.[Weikai], Yu, Y.Z.[Yi-Zhou],
OVER-NAV: Elevating Iterative Vision-and-Language Navigation with Open-Vocabulary Detection and StructurEd Representation,
CVPR24(16296-16306)
IEEE DOI 2410
Art, Accuracy, Navigation, Annotations, Detectors, Vision-and-Language Navigation, Open-vocabulary, Multi-Modal Learning BibRef

Li, X.[Xin], Wu, Y.F.[Yun-Fei], Jiang, X.H.[Xing-Hua], Guo, Z.H.[Zhi-Hao], Gong, M.M.[Ming-Ming], Cao, H.Y.[Hao-Yu], Liu, Y.S.[Yin-Song], Jiang, D.Q.[De-Qiang], Sun, X.[Xing],
Enhancing Visual Document Understanding with Contrastive Learning in Large Visual-Language Models,
CVPR24(15546-15555)
IEEE DOI 2410
Visualization, Computational modeling, Contrastive learning, Benchmark testing, Feature extraction, Filling, Contrastive Learning BibRef

Pham, K.[Khoi], Huynh, C.[Chuong], Lim, S.N.[Ser-Nam], Shrivastava, A.[Abhinav],
Composing Object Relations and Attributes for Image-Text Matching,
CVPR24(14354-14363)
IEEE DOI Code:
WWW Link. 2410
Visualization, Codes, Computational modeling, Image edge detection, Semantics, Benchmark testing, vision-language, image retrieval, image-text matching BibRef

Xu, Z.L.[Zhen-Lin], Zhu, Y.[Yi], Deng, S.Q.[Si-Qi], Mittal, A.[Abhay], Chen, Y.B.[Yan-Bei], Wang, M.[Manchen], Favaro, P.[Paolo], Tighe, J.[Joseph], Modolo, D.[Davide],
Benchmarking Zero-Shot Recognition with Vision-Language Models: Challenges on Granularity and Specificity,
WhatNext24(1827-1836)
IEEE DOI 2410
Computational modeling, Face recognition, Semantics, Training data, Focusing, Vision and language models, Zero-shot recognition, Benchmarking BibRef

Luo, Z.W.[Zi-Wei], Gustafsson, F.K.[Fredrik K.], Zhao, Z.[Zheng], Sjölund, J.[Jens], Schön, T.B.[Thomas B.],
Photo-Realistic Image Restoration in the Wild with Controlled Vision-Language Models,
NTIRE24(6641-6651)
IEEE DOI 2410
Degradation, Training, Image synthesis, Pipelines, Transform coding, Diffusion models, Feature extraction, Image restoration, real-world BibRef

Huang, C.Q.[Chao-Qin], Jiang, A.[Aofan], Feng, J.H.[Jing-Hao], Zhang, Y.[Ya], Wang, X.C.[Xin-Chao], Wang, Y.F.[Yan-Feng],
Adapting Visual-Language Models for Generalizable Anomaly Detection in Medical Images,
CVPR24(11375-11385)
IEEE DOI Code:
WWW Link. 2410
Training, Adaptation models, Image segmentation, Visualization, Source coding, Semantics, Anomaly Detection, Medical Images BibRef

Bang, J.[Jihwan], Ahn, S.[Sumyeong], Lee, J.G.[Jae-Gil],
Active Prompt Learning in Vision Language Models,
CVPR24(26994-27004)
IEEE DOI Code:
WWW Link. 2410
Learning systems, Adaptation models, Codes, Sampling methods, Labeling BibRef

Pan, C.[Chenbin], Yaman, B.[Burhaneddin], Nesti, T.[Tommaso], Mallik, A.[Abhirup], Allievi, A.G.[Alessandro G], Velipasalar, S.[Senem], Ren, L.[Liu],
VLP: Vision Language Planning for Autonomous Driving,
CVPR24(14760-14769)
IEEE DOI 2410
Training, Urban areas, Linguistics, Cognition, Robustness, Planning BibRef

Liang, M.[Mingfu], Su, J.C.[Jong-Chyi], Schulter, S.[Samuel], Garg, S.[Sparsh], Zhao, S.Y.[Shi-Yu], Wu, Y.[Ying], Chandraker, M.[Manmohan],
AIDE: An Automatic Data Engine for Object Detection in Autonomous Driving,
CVPR24(14695-14706)
IEEE DOI 2410
Training, Costs, Roads, Pipelines, Object detection, Benchmark testing, Data models, Autonomous Driving, Vision Language Model, Automatic Data Engine BibRef

Li, Z.[Zheng], Li, X.[Xiang], Fu, X.[Xinyi], Zhang, X.[Xin], Wang, W.Q.[Wei-Qiang], Chen, S.[Shuo], Yang, J.[Jian],
PromptKD: Unsupervised Prompt Distillation for Vision-Language Models,
CVPR24(26607-26616)
IEEE DOI Code:
WWW Link. 2410
Codes, Computational modeling, Prediction algorithms, Data models, Vectors, Probability distribution, knowledge distillation, zero-shot learning BibRef

Khandelwal, A.[Anant],
PromptSync: Bridging Domain Gaps in Vision-Language Models through Class-Aware Prototype Alignment and Discrimination,
ZeroShot24(7819-7828)
IEEE DOI 2410
Adaptation models, Computational modeling, Prototypes, Contrastive learning, Benchmark testing, Robustness BibRef

Hirohashi, Y.[Yuki], Hirakawa, T.[Tsubasa], Yamashita, T.[Takayoshi], Fujiyoshi, H.[Hironobu],
Prompt Learning with One-Shot Setting based Feature Space Analysis in Vision-and-Language Models,
ZeroShot24(7761-7770)
IEEE DOI 2410
Learning systems, Analytical models, Adaptation models, Image resolution, Accuracy, Vision-and-Language Model, Prompt Learning BibRef

Zhang, L.[Le], Awal, R.[Rabiul], Agrawal, A.[Aishwarya],
Contrasting Intra-Modal and Ranking Cross-Modal Hard Negatives to Enhance Visio-Linguistic Compositional Understanding,
CVPR24(13774-13784)
IEEE DOI Code:
WWW Link. 2410
Annotations, Semantics, Refining, Text to image, Contrastive learning, Benchmark testing, Cognition, contrastive learning BibRef

Rosasco, A.[Andrea], Berti, S.[Stefano], Pasquale, G.[Giulia], Malafronte, D.[Damiano], Sato, S.[Shogo], Segawa, H.[Hiroyuki], Inada, T.[Tetsugo], Natale, L.[Lorenzo],
ConCon-Chi: Concept-Context Chimera Benchmark for Personalized Vision-Language Tasks,
CVPR24(22239-22248)
IEEE DOI Code:
WWW Link. 2410
Measurement, Codes, Image synthesis, Text to image, Benchmark testing, benchmark, dataset, compositionality BibRef

Cheng, S.[Sijie], Guo, Z.C.[Zhi-Cheng], Wu, J.[Jinawen], Fang, K.[Kechen], Li, P.[Peng], Liu, H.P.[Hua-Ping], Liu, Y.[Yang],
EgoThink: Evaluating First-Person Perspective Thinking Capability of Vision-Language Models,
CVPR24(14291-14302)
IEEE DOI 2410
Bridges, Visualization, Computational modeling, Focusing, Benchmark testing, Planning, Egocentric, Vision-Language Models, Benchmark BibRef

Kil, J.[Jihyung], Song, C.H.[Chan Hee], Zheng, B.[Boyuan], Deng, X.[Xiang], Su, Y.[Yu], Chao, W.L.[Wei-Lun],
Dual-View Visual Contextualization for Web Navigation,
CVPR24(14445-14454)
IEEE DOI 2410
Visualization, Navigation, Benchmark testing, AI Agents, Web Agents, Web Navigation, Vision-Language, Multimodal Agents BibRef

Guo, Y.Y.[Yang-Yang], Wang, G.Z.[Guang-Zhi], Kankanhalli, M.[Mohan],
PELA: Learning Parameter-Efficient Models with Low-Rank Approximation,
CVPR24(15699-15709)
IEEE DOI 2410
Codes, Computational modeling, Perturbation methods, Loading, Transformers, Vision-Language, Low-rank Approximation BibRef

Farina, M.[Matteo], Mancini, M.[Massimiliano], Cunegatti, E.[Elia], Cunegatti, E.[Elia], Iacca, G.[Giovanni], Ricci, E.[Elisa],
MULTIFLOW: Shifting Towards Task-Agnostic Vision-Language Pruning,
CVPR24(16185-16195)
IEEE DOI Code:
WWW Link. 2410
Codes, Computational modeling, Transfer learning, Neurons, Benchmark testing, multimodal learning, sparse neural networks BibRef

Mu, F.Z.[Fang-Zhou], Mo, S.C.[Si-Cheng], Li, Y.[Yin],
SnAG: Scalable and Accurate Video Grounding,
CVPR24(18930-18940)
IEEE DOI Code:
WWW Link. 2410
Training, Analytical models, Accuracy, Grounding, Scalability, Computational modeling, Video understanding, Vision-Language Learning BibRef

Cao, Y.H.[Yun-Hao], Ji, K.X.[Kai-Xiang], Huang, Z.Y.[Zi-Yuan], Zheng, C.Y.[Chuan-Yang], Liu, J.J.[Jia-Jia], Wang, J.[Jian], Chen, J.D.[Jing-Dong], Yang, M.[Ming],
Towards Better Vision-Inspired Vision-Language Models,
CVPR24(13537-13547)
IEEE DOI 2410
Training, Bridges, Visualization, Computational modeling, Poles and towers, Benchmark testing, deep learning, deep prompt BibRef

Shi, K.Y.[Kun-Yu], Dong, Q.[Qi], Goncalves, L.[Luis], Tu, Z.W.[Zhuo-Wen], Soatto, S.[Stefano],
Non-autoregressive Sequence-to-Sequence Vision-Language Models,
CVPR24(13603-13612)
IEEE DOI 2410
Visualization, Technological innovation, Computational modeling, Predictive models, Drives, Encoding, Non-autoregressive, CTC, vision language models BibRef

Man, Y.Z.[Yun-Ze], Gui, L.Y.[Liang-Yan], Wang, Y.X.[Yu-Xiong],
Situational Awareness Matters in 3D Vision Language Reasoning,
CVPR24(13678-13688)
IEEE DOI 2410
Visualization, Solid modeling, Estimation, Performance gain, Cognition, Vision-Language, Multi-modal, 3D Reasoning BibRef

Zheng, C.H.[Chen-Hao], Zhang, J.[Jieyu], Kembhavi, A.[Aniruddha], Krishna, R.[Ranjay],
Iterated Learning Improves Compositionality in Large Vision-Language Models,
CVPR24(13785-13795)
IEEE DOI 2410
Training, Training data, Games, Contrastive learning, Benchmark testing, Performance gain, Cognitive science BibRef

Song, C.H.[Chull Hwan], Hwang, T.[Taebaek], Yoon, J.Y.[Joo-Young], Choi, S.[Shunghyun], Gu, Y.H.[Yeong Hyeon],
SyncMask: Synchronized Attentional Masking for Fashion-centric Vision-Language Pretraining,
CVPR24(13948-13957)
IEEE DOI 2410
Training, Visualization, Image segmentation, Image resolution, Refining, Contrastive learning BibRef

Pramanick, S.[Shraman], Han, G.X.[Guang-Xing], Hou, R.[Rui], Nag, S.[Sayan], Lim, S.N.[Ser-Nam], Ballas, N.[Nicolas], Wang, Q.F.[Qi-Fan], Chellappa, R.[Rama], Almahairi, A.[Amjad],
Jack of All Tasks, Master of Many: Designing General-purpose Coarse-to-Fine Vision-Language Model,
CVPR24(14076-14088)
IEEE DOI Code:
WWW Link. 2410
Image segmentation, Visualization, Image coding, Filters, Grounding, Machine vision, Visual systems BibRef

Zeng, Y.[Yunan], Huang, Y.[Yan], Zhang, J.J.[Jin-Jin], Jie, Z.Q.[Ze-Qun], Chai, Z.H.[Zhen-Hua], Wang, L.[Liang],
Investigating Compositional Challenges in Vision-Language Models for Visual Grounding,
CVPR24(14141-14151)
IEEE DOI 2410
Visualization, Codes, Grounding, Annotations, Pipelines, Benchmark testing BibRef

Karmanov, A.[Adilbek], Guan, D.[Dayan], Lu, S.J.[Shi-Jian], El Saddik, A.[Abdulmotaleb], Xing, E.[Eric],
Efficient Test-Time Adaptation of Vision-Language Models,
CVPR24(14162-14171)
IEEE DOI Code:
WWW Link. 2410
Adaptation models, Codes, Computational modeling, Noise, Predictive models, Benchmark testing BibRef

Sameni, S.[Sepehr], Kafle, K.[Kushal], Tan, H.[Hao], Jenni, S.[Simon],
Building Vision-Language Models on Solid Foundations with Masked Distillation,
CVPR24(14216-14226)
IEEE DOI 2410
Training, Solid modeling, Visualization, Computational modeling, Semantic segmentation, Buildings, LLM BibRef

Peng, W.[Wujian], Xie, S.C.[Si-Cheng], You, Z.[Zuyao], Lan, S.Y.[Shi-Yi], Wu, Z.X.[Zu-Xuan],
Synthesize, Diagnose, and Optimize: Towards Fine-Grained Vision-Language Understanding,
CVPR24(13279-13288)
IEEE DOI Code:
WWW Link. 2410
Visualization, Codes, Computational modeling, Pipelines, Benchmark testing, Linguistics, Vision language model, Fine-grained understdanding BibRef

Zhao, Y.[Yue], Zhao, L.[Long], Zhou, X.Y.[Xing-Yi], Wu, J.L.[Jia-Lin], Chu, C.T.[Chun-Te], Miao, H.[Hui], Schroff, F.[Florian], Adam, H.[Hartwig], Liu, T.[Ting], Gong, B.Q.[Bo-Qing], Krähenbühl, P.[Philipp], Yuan, L.Z.[Liang-Zhe],
Distilling Vision-Language Models on Millions of Videos,
CVPR24(13106-13116)
IEEE DOI 2410
Adaptation models, Computational modeling, Benchmark testing, Data models, Text to video BibRef

Chen, J.N.[Jie-Neng], Yu, Q.H.[Qi-Hang], Shen, X.H.[Xiao-Hui], Yuille, A.L.[Alan L.], Chen, L.C.[Liang-Chieh],
ViTamin: Designing Scalable Vision Models in the Vision-Language Era,
CVPR24(12954-12966)
IEEE DOI 2410
Training, Image segmentation, Accuracy, Protocols, Image coding, Scalability, Computational modeling, Vision-Language Models, Architectural Design BibRef

Liu, S.H.[Shi-Hong], Yu, S.[Samuel], Lin, Z.Q.[Zhi-Qiu], Pathak, D.[Deepak], Ramanan, D.[Deva],
Language Models as Black-Box Optimizers for Vision-Language Models,
CVPR24(12687-12697)
IEEE DOI 2410
Computational modeling, Natural languages, Closed box, Text to image, Human in the loop, Data models, generative models BibRef

Howard, P.[Phillip], Madasu, A.[Avinash], Le, T.[Tiep], Moreno, G.L.[Gustavo Lujan], Bhiwandiwalla, A.[Anahita], Lal, V.[Vasudev],
SocialCounterfactuals: Probing and Mitigating Intersectional Social Biases in Vision-Language Models with Counterfactual Examples,
CVPR24(11975-11985)
IEEE DOI 2410
Training, Prevention and mitigation, Text to image, Diffusion models, Fairness, social bias, counterfactuals BibRef

Jiang, Y.K.[Yan-Kai], Huang, Z.Z.[Zhong-Zhen], Zhang, R.Z.[Rong-Zhao], Zhang, X.F.[Xiao-Fan], Zhang, S.T.[Shao-Ting],
ZePT: Zero-Shot Pan-Tumor Segmentation via Query-Disentangling and Self-Prompting,
CVPR24(11386-11397)
IEEE DOI 2410
Training, Visualization, Pathology, Image segmentation, Image analysis, Computational modeling, Vision-Language Model BibRef

Kim, Y.[Younghyun], Mo, S.[Sangwoo], Kim, M.[Minkyu], Lee, K.[Kyungmin], Lee, J.[Jaeho], Shin, J.[Jinwoo],
Discovering and Mitigating Visual Biases Through Keyword Explanation,
CVPR24(11082-11092)
IEEE DOI Code:
WWW Link. 2410
Training, Visualization, Image recognition, Computational modeling, Training data, Flowering plants, bias and fairness, explainable AI, vision-language model BibRef

Li, R.[Rui], Fischer, T.[Tobias], Segu, M.[Mattia], Pollefeys, M.[Marc], Van Gool, L.J.[Luc J.], Tombari, F.[Federico],
Know Your Neighbors: Improving Single-View Reconstruction via Spatial Vision-Language Reasoning,
CVPR24(9848-9858)
IEEE DOI Code:
WWW Link. 2410
Geometry, Visualization, Attention mechanisms, Shape, Semantics, radiance field, vision-language model, spatial context, spatial attention BibRef

Zeng, Z.[Ziyao], Wang, D.[Daniel], Yang, F.Y.[Feng-Yu], Park, H.[Hyoungseob], Soatto, S.[Stefano], Lao, D.[Dong], Wong, A.[Alex],
WorDepth: Variational Language Prior for Monocular Depth Estimation,
CVPR24(9708-9719)
IEEE DOI Code:
WWW Link. 2410
Measurement, Codes, Estimation, Encoding, Monocular Depth Estimation, Vision-Language Model, Variational Model BibRef

Hu, Y.S.[Yu-Shi], Stretcu, O.[Otilia], Lu, C.T.[Chun-Ta], Viswanathan, K.[Krishnamurthy], Hata, K.[Kenji], Luo, E.[Enming], Krishna, R.[Ranjay], Fuxman, A.[Ariel],
Visual Program Distillation: Distilling Tools and Programmatic Reasoning into Vision-Language Models,
CVPR24(9590-9601)
IEEE DOI 2410
Visualization, Adaptation models, Computational modeling, Instruments, Loading, Music, Cognition, vision-language model, tools BibRef

Zanella, M.[Maxime], Fuchs, C.[Clément], Ben Ayed, I.[Ismail], de Vleeschouwer, C.[Christophe],
Vocabulary-Free Few-Shot Learning for Vision-Language Models,
MULA25(149-158)
IEEE DOI Code:
WWW Link. 2512
Adaptation models, Visualization, Computational modeling, Semantics, Computational efficiency, Few shot learning, prompts BibRef

Silva-Rodríguez, J.[Julio], Hajimiri, S.[Sina], Ben Ayed, I.[Ismail], Dolz, J.[Jose],
A Closer Look at the Few-Shot Adaptation of Large Vision-Language Models,
CVPR24(23681-23690)
IEEE DOI Code:
WWW Link. 2410
Adaptation models, Codes, Computational modeling, Transfer learning, Probes BibRef

Zanella, M.[Maxime], Ben Ayed, I.[Ismail],
Low-Rank Few-Shot Adaptation of Vision-Language Models,
Prompting24(1593-1603)
IEEE DOI 2410
Training, Adaptation models, Design methodology, Few shot learning, Vision-Language, few-shot, adapter BibRef

Yang, C.[Cheng], Xu, R.[Rui], Guo, Y.[Ye], Huang, P.X.[Pei-Xiang], Chen, Y.[Yiru], Ding, W.[Wenkui], Wang, Z.Y.[Zhong-Yuan], Zhou, H.[Hong],
Improving Vision-and-Language Reasoning via Spatial Relations Modeling,
WACV24(758-767)
IEEE DOI 2404
Visualization, Analytical models, Graphical models, Statistical analysis, Computational modeling, Excavation, Vision + language and/or other modalities BibRef

Shen, S.[Sheng], Yang, S.[Shijia], Zhang, T.J.[Tian-Jun], Zhai, B.[Bohan], Gonzalez, J.E.[Joseph E.], Keutzer, K.[Kurt], Darrell, T.J.[Trevor J.],
Multitask Vision-Language Prompt Tuning,
WACV24(5644-5655)
IEEE DOI 2404
Learning systems, Visualization, Adaptation models, Benchmark testing, Vectors, Task analysis, Algorithms, Vision + language and/or other modalities BibRef

Zhang, G.[Gengyuan], Zhang, Y.R.[Yu-Rui], Zhang, K.[Kerui], Tresp, V.[Volker],
Can Vision-Language Models be a Good Guesser? Exploring VLMs for Times and Location Reasoning,
WACV24(625-634)
IEEE DOI Code:
WWW Link. 2404
Visualization, Computational modeling, Feature extraction, Cognition, Task analysis, Commonsense reasoning, Algorithms, Vision + language and/or other modalities BibRef

Ganz, R.[Roy], Nuriel, O.[Oren], Aberdam, A.[Aviad], Kittenplon, Y.[Yair], Mazor, S.[Shai], Litman, R.[Ron],
Towards Models that Can See and Read,
ICCV23(21661-21671)
IEEE DOI 2401
BibRef

Zhang, H.[Heng], Liu, D.[Daqing], Lv, Z.[Zezhong], Su, B.[Bing], Tao, D.C.[Da-Cheng],
Exploring Temporal Concurrency for Video-Language Representation Learning,
ICCV23(15522-15532)
IEEE DOI Code:
WWW Link. 2401
BibRef

Shukor, M.[Mustafa], Dancette, C.[Corentin], Cord, M.[Matthieu],
eP-ALM: Efficient Perceptual Augmentation of Language Models,
ICCV23(21999-22012)
IEEE DOI Code:
WWW Link. 2401
BibRef

Schulter, S.[Samuel], Kumar, B.G.V.[B.G. Vijay], Suh, Y.M.[Yu-Min], Dafnis, K.M.[Konstantinos M.], Zhang, Z.X.[Zhi-Xing], Zhao, S.Y.[Shi-Yu], Metaxas, D.N.[Dimitris N.],
OmniLabel: A Challenging Benchmark for Language-Based Object Detection,
ICCV23(11919-11928)
IEEE DOI Code:
WWW Link. 2401
BibRef

Chen, Z.L.[Zi-Liang], Huang, X.[Xin], Guan, Q.L.[Quan-Long], Lin, L.[Liang], Luo, W.Q.[Wei-Qi],
A Retrospect to Multi-prompt Learning across Vision and Language,
ICCV23(22133-22144)
IEEE DOI 2401
BibRef

Derakhshani, M.M.[Mohammad Mahdi], Sanchez, E.[Enrique], Bulat, A.[Adrian], da Costa, V.G.T.[Victor Guilherme Turrisi], Snoek, C.G.M.[Cees G. M.], Tzimiropoulos, G.[Georgios], Martinez, B.[Brais],
Bayesian Prompt Learning for Image-Language Model Generalization,
ICCV23(15191-15200)
IEEE DOI Code:
WWW Link. 2401
BibRef

Lin, W.[Wei], Mirza, M.J.[Muhammad Jehanzeb], Doveh, S.[Sivan], Feris, R.[Rogerio], Giryes, R.[Raja], Hochreiter, S.[Sepp], Karlinsky, L.[Leonid],
Comparison Visual Instruction Tuning,
Reasoning25(2964-2974)
IEEE DOI 2512
Visualization, Solid modeling, Large language models, Benchmark testing, Cognition, Tuning, Anomaly detection, visual instruction tuning BibRef

Cascante-Bonilla, P.[Paola], Shehada, K.[Khaled], Smith, J.S.[James Seale], Doveh, S.[Sivan], Kim, D.H.[Dong-Hyun], Panda, R.[Rameswar], Varol, G.[Gül], Oliva, A.[Aude], Ordonez, V.[Vicente], Feris, R.S.[Rogerio S.], Karlinsky, L.[Leonid],
Going Beyond Nouns With Vision & Language Models Using Synthetic Data,
ICCV23(20098-20108)
IEEE DOI 2401
BibRef

Upadhyay, U.[Uddeshya], Karthik, S.[Shyamgopal], Mancini, M.[Massimiliano], Akata, Z.[Zeynep],
ProbVLM: Probabilistic Adapter for Frozen Vison-Language Models,
ICCV23(1899-1910)
IEEE DOI Code:
WWW Link. 2401
BibRef

Bitton-Guetta, N.[Nitzan], Bitton, Y.[Yonatan], Hessel, J.[Jack], Schmidt, L.[Ludwig], Elovici, Y.[Yuval], Stanovsky, G.[Gabriel], Schwartz, R.[Roy],
Breaking Common Sense: WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images,
ICCV23(2616-2627)
IEEE DOI 2401
BibRef

Hu, Z.Y.[Zi-Yuan], Li, Y.Y.[Yan-Yang], Lyu, M.R.[Michael R.], Wang, L.W.[Li-Wei],
VL-PET: Vision-and-Language Parameter-Efficient Tuning via Granularity Control,
ICCV23(2998-3008)
IEEE DOI Code:
WWW Link. 2401
BibRef

Slyman, E.[Eric], Kahng, M.[Minsuk], Lee, S.[Stefan],
VLSlice: Interactive Vision-and-Language Slice Discovery,
ICCV23(15245-15255)
IEEE DOI 2401
BibRef

Najibi, M.[Mahyar], Ji, J.W.[Jing-Wei], Zhou, Y.[Yin], Qi, C.R.[Charles R.], Yan, X.C.[Xin-Chen], Ettinger, S.[Scott], Anguelov, D.[Dragomir],
Unsupervised 3D Perception with 2D Vision-Language Distillation for Autonomous Driving,
ICCV23(8568-8578)
IEEE DOI 2401
BibRef

Xu, H.[Hu], Xie, S.[Saining], Huang, P.Y.[Po-Yao], Yu, L.C.[Li-Cheng], Howes, R.[Russell], Ghosh, G.[Gargi], Zettlemoyer, L.[Luke], Feichtenhofer, C.[Christoph],
CiT: Curation in Training for Effective Vision-Language Data,
ICCV23(15134-15143)
IEEE DOI 2401
BibRef

Trager, M.[Matthew], Perera, P.[Pramuditha], Zancato, L.[Luca], Achille, A.[Alessandro], Bhatia, P.[Parminder], Soatto, S.[Stefano],
Linear Spaces of Meanings: Compositional Structures in Vision-Language Models,
ICCV23(15349-15358)
IEEE DOI 2401
BibRef

Chen, Y.S.[Yi-Syuan], Song, Y.Z.[Yun-Zhu], Yeo, C.Y.[Cheng Yu], Liu, B.[Bei], Fu, J.L.[Jian-Long], Shuai, H.H.[Hong-Han],
SINC: Self-Supervised In-Context Learning for Vision-Language Tasks,
ICCV23(15384-15396)
IEEE DOI 2401
BibRef

Wu, C.E.[Cheng-En], Tian, Y.[Yu], Yu, H.C.[Hai-Chao], Wang, H.[Heng], Morgado, P.[Pedro], Hu, Y.H.[Yu Hen], Yang, L.J.[Lin-Jie],
Why Is Prompt Tuning for Vision-Language Models Robust to Noisy Labels?,
ICCV23(15442-15451)
IEEE DOI Code:
WWW Link. 2401
BibRef

Ouali, Y.[Yassine], Bulat, A.[Adrian], Matinez, B.[Brais], Tzimiropoulos, G.[Georgios],
Black Box Few-Shot Adaptation for Vision-Language models,
ICCV23(15488-15500)
IEEE DOI Code:
WWW Link. 2401
BibRef

Kan, B.[Baoshuo], Wang, T.[Teng], Lu, W.P.[Wen-Peng], Zhen, X.T.[Xian-Tong], Guan, W.[Weili], Zheng, F.[Feng],
Knowledge-Aware Prompt Tuning for Generalizable Vision-Language Models,
ICCV23(15624-15634)
IEEE DOI 2401
BibRef

Zhai, J.T.[Jiang-Tian], Zhang, Q.[Qi], Wu, T.[Tong], Chen, X.Y.[Xing-Yu], Liu, J.J.[Jiang-Jiang], Cheng, M.M.[Ming-Ming],
SLAN: Self-Locator Aided Network for Vision-Language Understanding,
ICCV23(21892-21901)
IEEE DOI Code:
WWW Link. 2401
BibRef

Long, S.[Sifan], Zhao, Z.[Zhen], Yuan, J.[Junkun], Tan, Z.C.[Zi-Chang], Liu, J.J.[Jiang-Jiang], Zhou, L.P.[Lu-Ping], Wang, S.S.[Sheng-Sheng], Wang, J.D.[Jing-Dong],
Task-Oriented Multi-Modal Mutual Learning for Vision-Language Models,
ICCV23(21902-21912)
IEEE DOI 2401
BibRef

Cho, E.[Eulrang], Kim, J.[Jooyeon], Kim, H.W.J.[Hyun-Woo J.],
Distribution-Aware Prompt Tuning for Vision-Language Models,
ICCV23(21947-21956)
IEEE DOI Code:
WWW Link. 2401
BibRef

Varma, M.[Maya], Delbrouck, J.B.[Jean-Benoit], Hooper, S.[Sarah], Chaudhari, A.[Akshay], Langlotz, C.[Curtis],
ViLLA: Fine-Grained Vision-Language Representation Learning from Real-World Data,
ICCV23(22168-22178)
IEEE DOI 2401
BibRef

Zhu, H.G.[Hong-Guang], Wei, Y.C.[Yun-Chao], Liang, X.D.[Xiao-Dan], Zhang, C.J.[Chun-Jie], Zhao, Y.[Yao],
CTP: Towards Vision-Language Continual Pretraining via Compatible Momentum Contrast and Topology Preservation,
ICCV23(22200-22210)
IEEE DOI Code:
WWW Link. 2401
BibRef

Hall, M.[Melissa], Gustafson, L.[Laura], Adcock, A.[Aaron], Misra, I.[Ishan], Ross, C.[Candace],
Vision-Language Models Performing Zero-Shot Tasks Exhibit Disparities Between Gender Groups,
CLVL23(2770-2777)
IEEE DOI 2401
BibRef

Agnolucci, L.[Lorenzo], Baldrati, A.[Alberto], Todino, F.[Francesco], Becattini, F.[Federico], Bertini, M.[Marco], del Bimbo, A.[Alberto],
ECO: Ensembling Context Optimization for Vision-Language Models,
CLVL23(2803-2807)
IEEE DOI 2401
BibRef

Palit, V.[Vedant], Pandey, R.[Rohan], Arora, A.[Aryaman], Liang, P.P.[Paul Pu],
Towards Vision-Language Mechanistic Interpretability: A Causal Tracing Tool for BLIP,
CLVL23(2848-2853)
IEEE DOI 2401
BibRef

Sammani, F.[Fawaz], Deligiannis, N.[Nikos],
Uni-NLX: Unifying Textual Explanations for Vision and Vision-Language Tasks,
VLAR23(4636-4641)
IEEE DOI 2401
BibRef

Lee, D.J.[Dong-Jun], Song, S.[Seokwon], Suh, J.[Jihee], Choi, J.[Joonmyeong], Lee, S.[Sanghyeok], Kim, H.W.J.[Hyun-Woo J.],
Read-only Prompt Optimization for Vision-Language Few-shot Learning,
ICCV23(1401-1411)
IEEE DOI Code:
WWW Link. 2401
BibRef

Li, X.[Xuanlin], Fang, Y.H.[Yun-Hao], Liu, M.H.[Ming-Hua], Ling, Z.[Zhan], Tu, Z.W.[Zhuo-Wen], Su, H.[Hao],
Distilling Large Vision-Language Model with Out-of-Distribution Generalizability,
ICCV23(2492-2503)
IEEE DOI 2401
BibRef

Bi, J.Y.[Jun-Yu], Cheng, D.[Daixuan], Yao, P.[Ping], Pang, B.[Bochen], Zhan, Y.F.[Yue-Feng], Yang, C.G.[Chuan-Guang], Wang, Y.J.[Yu-Jing], Sun, H.[Hao], Deng, W.W.[Wei-Wei], Zhang, Q.[Qi],
VL-Match: Enhancing Vision-Language Pretraining with Token-Level and Instance-Level Matching,
ICCV23(2584-2593)
IEEE DOI 2401
BibRef

Udandarao, V.[Vishaal], Gupta, A.[Ankush], Albanie, S.[Samuel],
SuS-X: Training-Free Name-Only Transfer of Vision-Language Models,
ICCV23(2725-2736)
IEEE DOI Code:
WWW Link. 2401
BibRef

Jiang, C.Y.[Chao-Ya], Xu, H.Y.[Hai-Yang], Ye, W.[Wei], Ye, Q.H.[Qing-Hao], Li, C.L.[Chen-Liang], Yan, M.[Ming], Bi, B.[Bin], Zhang, S.K.[Shi-Kun], Huang, F.[Fei], Huang, S.F.[Song-Fang],
BUS: Efficient and Effective Vision-language Pre-training with Bottom-Up Patch Summarization,
ICCV23(2888-2898)
IEEE DOI 2401
BibRef

Shi, C.[Cheng], Yang, S.[Sibei],
LoGoPrompt: Synthetic Text Images Can Be Good Visual Prompts for Vision-Language Models,
ICCV23(2920-2929)
IEEE DOI 2401
BibRef

Wang, A.J.P.[Alex Jin-Peng], Lin, K.Q.H.[Kevin Qing-Hong], Zhang, D.J.H.[David Jun-Hao], Lei, S.W.X.[Stan Wei-Xian], Shou, M.Z.[Mike Zheng],
Too Large; Data Reduction for Vision-Language Pre-Training,
ICCV23(3124-3134)
IEEE DOI 2401
BibRef

Wang, W.H.[Wei-Han], Yang, Z.[Zhen], Xu, B.[Bin], Li, J.Z.[Juan-Zi], Sun, Y.K.[Yan-Kui],
ViLTA: Enhancing Vision-Language Pre-training through Textual Augmentation,
ICCV23(3135-3146)
IEEE DOI 2401
BibRef

Boecking, B.[Benedikt], Usuyama, N.[Naoto], Bannur, S.[Shruthi], Castro, D.C.[Daniel C.], Schwaighofer, A.[Anton], Hyland, S.[Stephanie], Wetscherek, M.[Maria], Naumann, T.[Tristan], Nori, A.[Aditya], Alvarez-Valle, J.[Javier], Poon, H.[Hoifung], Oktay, O.[Ozan],
Making the Most of Text Semantics to Improve Biomedical Vision-Language Processing,
ECCV22(XXXVI:1-21).
Springer DOI 2211
BibRef

Cui, Q.[Quan], Zhou, B.[Boyan], Guo, Y.[Yu], Yin, W.D.[Wei-Dong], Wu, H.[Hao], Yoshie, O.[Osamu], Chen, Y.[Yubo],
Contrastive Vision-Language Pre-training with Limited Resources,
ECCV22(XXXVI:236-253).
Springer DOI 2211
BibRef

Hu, X.W.[Xiao-Wei], Gan, Z.[Zhe], Wang, J.F.[Jian-Feng], Yang, Z.Y.[Zheng-Yuan], Liu, Z.C.[Zi-Cheng], Lu, Y.[Yumao], Wang, L.J.[Li-Juan],
Scaling Up Vision-Language Pretraining for Image Captioning,
CVPR22(17959-17968)
IEEE DOI 2210
Training, Visualization, Computational modeling, Training data, Benchmark testing, Transformers, Feature extraction, Vision + language BibRef

Zhang, P.C.[Peng-Chuan], Li, X.J.[Xiu-Jun], Hu, X.W.[Xiao-Wei], Yang, J.W.[Jian-Wei], Zhang, L.[Lei], Wang, L.J.[Li-Juan], Choi, Y.J.[Ye-Jin], Gao, J.F.[Jian-Feng],
VinVL: Revisiting Visual Representations in Vision-Language Models,
CVPR21(5575-5584)
IEEE DOI 2111
Training, Visualization, Computational modeling, Object detection, Benchmark testing, Feature extraction, Transformers BibRef

Li, Z.W.[Zhuo-Wan], Stengel-Eskin, E.[Elias], Zhang, Y.X.[Yi-Xiao], Xie, C.[Cihang], Tran, Q.[Quan], van Durme, B.[Benjamin], Yuille, A.L.[Alan L.],
Calibrating Concepts and Operations: Towards Symbolic Reasoning on Real Images,
ICCV21(14890-14899)
IEEE DOI 2203
Visualization, Analytical models, Codes, Computational modeling, Cognition, Data models, Vision + language BibRef

Yang, X.[Xu], Zhang, H.W.[Han-Wang], Qi, G.J.[Guo-Jun], Cai, J.F.[Jian-Fei],
Causal Attention for Vision-Language Tasks,
CVPR21(9842-9852)
IEEE DOI 2111
Correlation, Codes, Computational modeling, Training data, Transformers, Data models BibRef

Zheng, W.B.[Wen-Bo], Yan, L.[Lan], Gou, C.[Chao], Wang, F.Y.[Fei-Yue],
Webly Supervised Knowledge Embedding Model for Visual Reasoning,
CVPR20(12442-12451)
IEEE DOI 2008
Visual reasoning between visual image and natural language description. Visualization, Cognition, Knowledge based systems, Task analysis, Knowledge engineering, Modulation, Robustness BibRef

Nguyen, D.K.[Duy-Kien], Okatani, T.[Takayuki],
Multi-Task Learning of Hierarchical Vision-Language Representation,
CVPR19(10484-10493).
IEEE DOI 2002
BibRef

Gupta, T.[Tanmay], Shih, K.J.[Kevin J.], Singh, S.[Saurabh], Hoiem, D.[Derek],
Aligned Image-Word Representations Improve Inductive Transfer Across Vision-Language Tasks,
ICCV17(4223-4232)
IEEE DOI 1802
data visualisation, image recognition, learning (artificial intelligence), Visualization BibRef

Chapter on Implementations and Applications, Databases, QBIC, Video Analysis, Hardware and Software, Inspection continues in
Attacks on Vision-Language Models .


Last update:Apr 6, 2026 at 11:28:57