20.4.3.3.2 Vision-Language Models, Language-Vision Models, VQA

Chapter Contents (Back)
Vision Language Model. Vision-Language Model. Visual-Language Model. Language Vision Model. VQA.
See also Visual Question Answering, Query, VQA.
See also Visual Grounding, Grounding Expressions.
See also CLIP, Contrastive Language-Image Pre-Training.
See also Large Language Models for Vision, LLM, LVLM.
See also Composed Image Retrieval.

Tamaazousti, Y.[Youssef], Le Borgne, H.[Hervé], Popescu, A.[Adrian], Gadeski, E.[Etienne], Ginsca, A.[Alexandru], Hudelot, C.[Céline],
Vision-language integration using constrained local semantic features,
CVIU(163), No. 1, 2017, pp. 41-57.
Elsevier DOI 1712
Image classification BibRef

Gouthaman, K.V., Nambiar, A.[Athira], Srinivas, K.S.[Kancheti Sai], Mittal, A.[Anurag],
Linguistically-aware attention for reducing the semantic gap in vision-language tasks,
PR(112), 2021, pp. 107812.
Elsevier DOI 2102
Attention models, Visual question answering, Counting in visual question answering, Image captioning BibRef

Zhou, K.Y.[Kai-Yang], Yang, J.K.[Jing-Kang], Loy, C.C.[Chen Change], Liu, Z.W.[Zi-Wei],
Learning to Prompt for Vision-Language Models,
IJCV(130), No. 9, September 2022, pp. 2337-2348.
Springer DOI 2208
BibRef

Zhou, K.Y.[Kai-Yang], Yang, J.K.[Jing-Kang], Loy, C.C.[Chen Change], Liu, Z.W.[Zi-Wei],
Conditional Prompt Learning for Vision-Language Models,
CVPR22(16795-16804)
IEEE DOI 2210
Training, Representation learning, Adaptation models, Neural networks, Manuals, Representation learning BibRef

Ma, C.C.[Cheng-Cheng], Liu, Y.[Yang], Deng, J.K.[Jian-Kang], Xie, L.X.[Ling-Xi], Dong, W.M.[Wei-Ming], Xu, C.S.[Chang-Sheng],
Understanding and Mitigating Overfitting in Prompt Tuning for Vision-Language Models,
CirSysVideo(33), No. 9, September 2023, pp. 4616-4629.
IEEE DOI Code:
WWW Link. 2310
BibRef

Zhu, Y.Q.[Yong-Qing], Li, X.Y.[Xiang-Yang], Zheng, M.[Mao], Yang, J.H.[Jia-Hao], Wang, Z.H.[Zi-Han], Guo, X.Q.[Xiao-Qian], Chai, Z.F.[Zi-Feng], Yuan, Y.C.[Yu-Chen], Jiang, S.Q.[Shu-Qiang],
Focus and Align: Learning Tube Tokens for Video-Language Pre-Training,
MultMed(25), 2023, pp. 8036-8050.
IEEE DOI 2312
BibRef

Chen, C.Q.[Chong-Qing], Han, D.[Dezhi], Chang, C.C.[Chin-Chen],
MPCCT: Multimodal vision-language learning paradigm with context-based compact Transformer,
PR(147), 2024, pp. 110084.
Elsevier DOI Code:
WWW Link. 2312
Multimodal vision-language paradigms, High-dependency modeling, Visual question answering (VQA), Logical relationship reasoning BibRef

Wu, W.H.[Wen-Hao], Sun, Z.[Zhun], Song, Y.X.[Yu-Xin], Wang, J.D.[Jing-Dong], Ouyang, W.L.[Wan-Li],
Transferring Vision-Language Models for Visual Recognition: A Classifier Perspective,
IJCV(132), No. 2, February 2024, pp. 392-409.
Springer DOI 2402
BibRef

Ming, Y.F.[Yi-Fei], Li, Y.X.[Yi-Xuan],
How Does Fine-Tuning Impact Out-of-Distribution Detection for Vision-Language Models?,
IJCV(132), No. 2, February 2024, pp. 596-609.
Springer DOI 2402
BibRef

Zhao, C.R.[Cai-Rong], Wang, Y.[Yubin], Jiang, X.Y.[Xin-Yang], Shen, Y.F.[Yi-Fei], Song, K.[Kaitao], Li, D.S.[Dong-Sheng], Miao, D.Q.[Duo-Qian],
Learning Domain Invariant Prompt for Vision-Language Models,
IP(33), 2024, pp. 1348-1360.
IEEE DOI 2402
Task analysis, Tuning, Training, Adaptation models, Visualization, Image color analysis, Self-supervised learning, Prompt learning, domain generalization BibRef

Yang, X.F.[Xiao-Feng], Liu, F.[Fayao], Lin, G.S.[Guo-Sheng],
Neural Logic Vision Language Explainer,
MultMed(26), 2024, pp. 3331-3340.
IEEE DOI 2402
Cognition, Logic programming, Deep learning, Visualization, Data models, Training, Markov processes, vision language pretraining BibRef

Wang, Y.D.[Yi-Dong], Yu, Z.O.[Zhu-Ohao], Wang, J.D.[Jin-Dong], Heng, Q.[Qiang], Chen, H.[Hao], Ye, W.[Wei], Xie, R.[Rui], Xie, X.[Xing], Zhang, S.K.[Shi-Kun],
Exploring Vision-Language Models for Imbalanced Learning,
IJCV(132), No. 1, January 2024, pp. 224-237.
Springer DOI 2402
BibRef

Yu, Z.T.[Zheng-Tao], Zhao, J.[Jia], Guo, C.L.[Chen-Liang], Yang, Y.[Ying],
StableNet: Distinguishing the hard samples to overcome language priors in visual question answering,
IET-CV(18), No. 2, 2024, pp. 315-327.
DOI Link 2403
multimedia systems BibRef

Zeng, Y.[Yan], Zhang, X.[Xinsong], Li, H.[Hang], Wang, J.W.[Jia-Wei], Zhang, J.P.[Ji-Peng], Zhou, W.[Wangchunshu],
X2-VLM: All-in-One Pre-Trained Model for Vision-Language Tasks,
PAMI(46), No. 5, May 2024, pp. 3156-3168.
IEEE DOI 2404
Task analysis, Visualization, Transformers, Detectors, Training, Feature extraction, Image coding, vision language pre-training BibRef

Zheng, Y.Z.[Yao-Zong], Zhong, B.[Bineng], Liang, Q.H.[Qi-Hua], Li, G.R.[Guo-Rong], Ji, R.R.[Rong-Rong], Li, X.X.[Xian-Xian],
Toward Unified Token Learning for Vision-Language Tracking,
CirSysVideo(34), No. 4, April 2024, pp. 2125-2135.
IEEE DOI 2404
Task analysis, Target tracking, Visualization, Feature extraction, Pipelines, Linguistics, Training, Vision-language tracking, multi-modal modeling BibRef

Ye, P.[Ping], Xiao, G.[Gang], Liu, J.[Jun],
Multimodal Features Alignment for Vision-Language Object Tracking,
RS(16), No. 7, 2024, pp. 1168.
DOI Link 2404
BibRef

Bazi, Y.[Yakoub], Bashmal, L.[Laila], Rahhal, M.M.A.[Mohamad Mahmoud Al], Ricci, R.[Riccardo], Melgani, F.[Farid],
RS-LLaVA: A Large Vision-Language Model for Joint Captioning and Question Answering in Remote Sensing Imagery,
RS(16), No. 9, 2024, pp. 1477.
DOI Link 2405
BibRef

Kong, D.[Daehyeon], Kong, K.[Kyeongbo], Kang, S.J.[Suk-Ju],
Image clustering using generated text centroids,
SP:IC(125), 2024, pp. 117128.
Elsevier DOI 2405
Deep neural network, Image clustering, Multimodal task, Vision-language model BibRef

Chen, X.Y.[Xian-Yu], Yang, J.H.[Jin-Hui], Chen, S.[Shi], Wang, L.[Louis], Jiang, M.[Ming], Zhao, Q.[Qi],
Every Problem, Every Step, All in Focus: Learning to Solve Vision-Language Problems With Integrated Attention,
PAMI(46), No. 7, July 2024, pp. 4720-4735.
IEEE DOI 2406
Problem-solving, Task analysis, Visualization, Measurement, Graph neural networks, Cognition, Videos, Graph attention, vision-language problem solving BibRef

Menon, S.[Sachit], Chandratreya, I.P.[Ishaan Preetam], Vondrick, C.[Carl],
Task Bias in Contrastive Vision-Language Models,
IJCV(132), No. 6, June 2024, pp. 2026-2040.
Springer DOI 2406
BibRef

Zhang, J.Y.[Jing-Yi], Huang, J.X.[Jia-Xing], Jin, S.[Sheng], Lu, S.J.[Shi-Jian],
Vision-Language Models for Vision Tasks: A Survey,
PAMI(46), No. 8, August 2024, pp. 5625-5644.
IEEE DOI 2407
Task analysis, Visualization, Training, Deep learning, Surveys, Data models, Predictive models, Big Data, big model, deep learning, image classification BibRef

Dong, M.P.[Meng-Ping], Li, F.[Fei], Li, Z.B.[Zhen-Bo], Liu, X.[Xue],
Cluster prototype earth mover's distance adapters and alignment-guided prompt learning for vision-language models,
PR(156), 2024, pp. 110861.
Elsevier DOI 2408
Cluster prototype, Earth mover's distance, Adapter, Prompt learning, Vision-language models BibRef

Liu, Y.[Ye], Pan, Y.[Yan], Yin, J.[Jian],
Enhancing Multi-Label Deep Hashing for Image and Audio With Joint Internal Global Loss Constraints and Large Vision-Language Model,
SPLetters(31), 2024, pp. 2550-2554.
IEEE DOI 2410
Codes, Transformers, Adaptation models, Training, Convolutional neural networks, Feature extraction, vision transformer BibRef

Zhan, C.[Chenlu], Zhang, Y.F.[Yu-Fei], Lin, Y.[Yu], Wang, G.[Gaoang], Wang, H.W.[Hong-Wei],
UniDCP: Unifying Multiple Medical Vision-Language Tasks via Dynamic Cross-Modal Learnable Prompts,
MultMed(26), 2024, pp. 9736-9748.
IEEE DOI 2410
Task analysis, Adaptation models, Visualization, Medical diagnostic imaging, Tuning, Multitasking, Plastics, cross-modal shareable space BibRef

Su, K.[Ke], Zhang, X.X.[Xing-Xing], Zhang, S.Y.[Si-Yang], Zhu, J.[Jun], Zhang, B.[Bo],
To Boost Zero-Shot Generalization for Embodied Reasoning With Vision-Language Pre-Training,
IP(33), 2024, pp. 5370-5381.
IEEE DOI 2410
Cognition, Visualization, Artificial intelligence, Training, Image reconstruction, Navigation, vision-language pre-training BibRef

Xuan, S.Y.[Shi-Yu], Yang, M.[Ming], Zhang, S.L.[Shi-Liang],
Adapting Vision-Language Models via Learning to Inject Knowledge,
IP(33), 2024, pp. 5798-5809.
IEEE DOI 2410
Feature extraction, Visualization, Adaptation models, Tuning, Training, Transformers, Dogs, Accuracy, Robustness, Few shot learning, knowledge injection BibRef

Zhou, W.[Wenlve], Zhou, Z.H.[Zhi-Heng],
Unsupervised Domain Adaption Harnessing Vision-Language Pre-Training,
CirSysVideo(34), No. 9, September 2024, pp. 8201-8214.
IEEE DOI Code:
WWW Link. 2410
Adaptation models, Task analysis, Training, Computational modeling, Tuning, Data models, Visualization, Unsupervised domain adaptation, model deployment BibRef

Guo, M.H.[Meng-Hao], Zhang, Y.[Yi], Mu, T.J.[Tai-Jiang], Huang, S.X.[Sharon X.], Hu, S.M.[Shi-Min],
Tuning Vision-Language Models With Multiple Prototypes Clustering,
PAMI(46), No. 12, December 2024, pp. 11186-11199.
IEEE DOI 2411
Prototypes, Adaptation models, Tuning, Visualization, Benchmark testing, Computational modeling, Data models, clustering BibRef

Sun, B.[Bo], Wu, Z.C.[Zhi-Chao], Zhang, H.[Hao], He, J.[Jun],
VTPL: Visual and text prompt learning for visual-language models,
JVCIR(104), 2024, pp. 104280.
Elsevier DOI 2411
V-L models, Prompt learning, Visual and text prompts, Poly-1 information NCE loss, Center loss BibRef

Liu, L.C.[Liang-Chen], Wang, N.N.[Nan-Nan], Liu, D.[Decheng], Yang, X.[Xi], Gao, X.B.[Xin-Bo], Liu, T.L.[Tong-Liang],
Towards Specific Domain Prompt Learning via Improved Text Label Optimization,
MultMed(26), 2024, pp. 10805-10815.
IEEE DOI 2411
Visualization, Optimization, Semantics, Task analysis, Terminology, Learning systems, Adaptation models, vision-language model BibRef

Liu, X.[Xin], Wu, J.[Jiamin], Yang, W.F.[Wen-Fei], Zhou, X.[Xu], Zhang, T.Z.[Tian-Zhu],
Multi-Modal Attribute Prompting for Vision-Language Models,
CirSysVideo(34), No. 11, November 2024, pp. 11579-11591.
IEEE DOI 2412
Visualization, Task analysis, Semantics, Adaptation models, Integrated circuit modeling, Vectors, attribute BibRef

Jiang, H.J.[Hao-Jun], Zhang, J.K.[Jian-Ke], Huang, R.[Rui], Ge, C.J.[Chun-Jiang], Ni, Z.[Zanlin], Song, S.[Shiji], Huang, G.[Gao],
Cross-modal adapter for vision-language retrieval,
PR(159), 2025, pp. 111144.
Elsevier DOI 2412
Adapter, Cross-modal interaction, Cross-modal retrieval, Parameter-efficient training, Multi-modal learning BibRef

Tan, Y.T.[Ying-Tao], Chen, Y.Y.[Ying-Ying], Wang, J.Q.[Jin-Qiao],
DSTA: Reinforcing Vision-Language Understanding for Scene-Text VQA With Dual-Stream Training Approach,
SPLetters(32), 2025, pp. 6-10.
IEEE DOI 2501
Optical character recognition, Training, Visualization, Feature extraction, Transformers, Text recognition, sence-text understanding BibRef


Onoe, Y.[Yasumasa], Rane, S.[Sunayana], Berger, Z.[Zachary], Bitton, Y.[Yonatan], Cho, J.[Jaemin], Garg, R.[Roopal], Ku, A.[Alexander], Parekh, Z.[Zarana], Pont-Tuset, J.[Jordi], Tanzer, G.[Garrett], Wang, S.[Su], Baldridge, J.[Jason],
DOCCI: Descriptions of Connected and Contrasting Images,
ECCV24(LX: 291-309).
Springer DOI 2412
BibRef

Li, T.[Tang], Ma, M.M.[Meng-Meng], Peng, X.[Xi],
DEAL: Disentangle and Localize Concept-level Explanations for VLMs,
ECCV24(XXXIX: 383-401).
Springer DOI 2412
BibRef

Park, K.Y.[Kwan-Yong], Saito, K.[Kuniaki], Kim, D.H.[Dong-Hyun],
Weak-to-strong Compositional Learning from Generative Models for Language-based Object Detection,
ECCV24(XXIII: 1-19).
Springer DOI 2412
BibRef

Li, S.C.[Shi-Cheng], Li, L.[Lei], Liu, Y.[Yi], Ren, S.[Shuhuai], Liu, Y.Y.X.[Yuan-Yan-Xin], Gao, R.D.[Run-Dong], Sun, X.[Xu], Hou, L.[Lu],
Vitatecs: A Diagnostic Dataset for Temporal Concept Understanding of Video-language Models,
ECCV24(LXX: 331-348).
Springer DOI 2412
BibRef

Yang, Y.T.[Yan-Ting], Chen, M.H.[Ming-Hao], Qiu, Q.[Qibo], Wu, J.H.[Jia-Hao], Wang, W.X.[Wen-Xiao], Lin, B.B.[Bin-Bin], Guan, Z.Y.[Zi-Yu], He, X.F.[Xiao-Fei],
Adapt2reward: Adapting Video-language Models to Generalizable Robotic Rewards via Failure Prompts,
ECCV24(LVII: 163-180).
Springer DOI 2412
BibRef

Rahmanzadehgervi, P.[Pooyan], Bolton, L.[Logan], Taesiri, M.R.[Mohammad Reza], Nguyen, A.T.[Anh Totti],
Vision Language Models are blind,
ACCV24(V: 293-309).
Springer DOI 2412
BibRef

Lai, C.G.[Chen-Gen], Song, S.L.[Sheng-Li], Yan, S.[Sitong], Hu, G.[Guangneng],
Improving Vision and Language Concepts Understanding with Multimodal Counterfactual Samples,
ECCV24(LXIX: 174-191).
Springer DOI 2412
BibRef

Chytas, S.P.[Sotirios Panagiotis], Kim, H.W.J.[Hyun-Woo J.], Singh, V.[Vikas],
Understanding Multi-compositional Learning in Vision and Language Models via Category Theory,
ECCV24(XLVIII: 324-341).
Springer DOI 2412
BibRef

Song, Y.Z.[Yun-Zhu], Chen, Y.S.[Yi-Syuan], Lin, T.L.[Tzu-Ling], Liu, B.[Bei], Fu, J.L.[Jian-Long], Shuai, H.H.[Hong-Han],
Capture Concept Through Comparison: Vision-and-language Representation Learning with Intrinsic Information Mining,
ACCV24(III: 220-238).
Springer DOI 2412
BibRef

Adhikari, R.[Rabin], Thapaliya, S.[Safal], Dhakal, M.[Manish], Khanal, B.[Bishesh],
Tunevlseg: Prompt Tuning Benchmark for Vision-language Segmentation Models,
ACCV24(III: 44-62).
Springer DOI 2412
BibRef

He, H.C.[Hai-Chen], Liu, W.B.[Wei-Bin], Xing, W.W.[Wei-Wei],
Biefficient: Bidirectionally Prompting Vision-language Models for Parameter-efficient Video Recognition,
ACCV24(III: 257-274).
Springer DOI 2412
BibRef

Yang, J.K.[Jing-Kang], Dong, Y.H.[Yu-Hao], Liu, S.[Shuai], Li, B.[Bo], Wang, Z.Y.[Zi-Yue], Tan, H.R.[Hao-Ran], Jiang, C.C.[Chen-Cheng], Kang, J.[Jiamu], Zhang, Y.[Yuanhan], Zhou, K.Y.[Kai-Yang], Liu, Z.W.[Zi-Wei],
Octopus: Embodied Vision-language Programmer from Environmental Feedback,
ECCV24(I: 20-38).
Springer DOI 2412
BibRef

Kar, O.F.[Oguzhan Fatih], Tonioni, A.[Alessio], Poklukar, P.[Petra], Kulshrestha, A.[Achin], Zamir, A.[Amir], Tombari, F.[Federico],
Brave: Broadening the Visual Encoding of Vision-language Models,
ECCV24(XVI: 113-132).
Springer DOI 2412
BibRef

Kamath, A.[Amita], Hsieh, C.Y.[Cheng-Yu], Chang, K.W.[Kai-Wei], Krishna, R.[Ranjay],
The Hard Positive Truth About Vision-language Compositionality,
ECCV24(XIV: 37-54).
Springer DOI 2412
BibRef

Ye-Bin, M.[Moon], Hyeon-Woo, N.[Nam], Choi, W.[Wonseok], Oh, T.H.[Tae-Hyun],
Beaf: Observing Before-after Changes to Evaluate Hallucination in Vision-language Models,
ECCV24(XI: 232-248).
Springer DOI 2412
BibRef

Jia, B.X.[Bao-Xiong], Chen, Y.X.[Yi-Xin], Yu, H.[Huangyue], Wang, Y.[Yan], Niu, X.S.[Xue-Song], Liu, T.[Tengyu], Li, Q.[Qing], Huang, S.Y.[Si-Yuan],
Sceneverse: Scaling 3d Vision-language Learning for Grounded Scene Understanding,
ECCV24(IX: 289-310).
Springer DOI 2412
BibRef

Zhang, Y.F.[Yi-Feng], Jiang, M.[Ming], Zhao, Q.[Qi],
Learning Chain of Counterfactual Thought for Bias-robust Vision-language Reasoning,
ECCV24(VIII: 334-351).
Springer DOI 2412
BibRef

Ruan, S.[Shouwei], Dong, Y.P.[Yin-Peng], Liu, H.Q.[Han-Qing], Huang, Y.[Yao], Su, H.[Hang], Wei, X.X.[Xing-Xing],
Omniview-tuning: Boosting Viewpoint Invariance of Vision-language Pre-training Models,
ECCV24(XXVI: 309-327).
Springer DOI 2412
BibRef

Li, J.[Junyan], Chen, D.[Delin], Cai, T.[Tianle], Chen, P.H.[Pei-Hao], Hong, Y.[Yining], Chen, Z.F.[Zhen-Fang], Shen, Y.[Yikang], Gan, C.[Chuang],
Flexattention for Efficient High-resolution Vision-language Models,
ECCV24(XXV: 286-302).
Springer DOI 2412
BibRef

Li, X.[Xiang], Ding, J.[Jian], Chen, Z.Y.[Zhao-Yang], Elhoseiny, M.[Mohamed],
UNI3DL: A Unified Model for 3d Vision-language Understanding,
ECCV24(XXIII: 74-92).
Springer DOI 2412
BibRef

Hao, T.X.[Tian-Xiang], Ding, X.H.[Xiao-Han], Feng, J.X.[Jue-Xiao], Yang, Y.H.[Yu-Hong], Chen, H.[Hui], Ding, G.[Guiguang],
Quantized Prompt for Efficient Generalization of Vision-language Models,
ECCV24(XIX: 54-73).
Springer DOI 2412
BibRef

Xu, H.B.[Huang-Biao], Ke, X.[Xiao], Li, Y.Z.[Yue-Zhou], Xu, R.[Rui], Wu, H.Q.[Huan-Qi], Lin, X.F.[Xiao-Feng], Guo, W.Z.[Wen-Zhong],
Vision-language Action Knowledge Learning for Semantic-aware Action Quality Assessment,
ECCV24(XLII: 423-440).
Springer DOI 2412
BibRef

Zhu, Z.Y.[Zi-Yu], Zhang, Z.[Zhuofan], Ma, X.J.[Xiao-Jian], Niu, X.S.[Xue-Song], Chen, Y.X.[Yi-Xin], Jia, B.X.[Bao-Xiong], Deng, Z.D.[Zhi-Dong], Huang, S.Y.[Si-Yuan], Li, Q.[Qing],
Unifying 3d Vision-language Understanding via Promptable Queries,
ECCV24(XLIV: 188-206).
Springer DOI 2412
BibRef

Zhang, J.M.[Jia-Ming], Ma, X.J.[Xing-Jun], Wang, X.[Xin], Qiu, L.Y.[Ling-Yu], Wang, J.Q.[Jia-Qi], Jiang, Y.G.[Yu-Gang], Sang, J.[Jitao],
Adversarial Prompt Tuning for Vision-language Models,
ECCV24(XLV: 56-72).
Springer DOI 2412
BibRef

Wu, G.[Ge], Zhang, X.[Xin], Li, Z.[Zheng], Chen, Z.W.[Zhao-Wei], Liang, J.J.[Jia-Jun], Yang, J.[Jian], Li, X.[Xiang],
Cascade Prompt Learning for Vision-language Model Adaptation,
ECCV24(L: 304-321).
Springer DOI 2412
BibRef

Gao, S.[Sensen], Jia, X.J.[Xiao-Jun], Ren, X.H.[Xu-Hong], Tsang, I.[Ivor], Guo, Q.[Qing],
Boosting Transferability in Vision-language Attacks via Diversification Along the Intersection Region of Adversarial Trajectory,
ECCV24(LVII: 442-460).
Springer DOI 2412
BibRef

Lafon, M.[Marc], Ramzi, E.[Elias], Rambour, C.[Clément], Audebert, N.[Nicolas], Thome, N.[Nicolas],
Gallop: Learning Global and Local Prompts for Vision-language Models,
ECCV24(LXI: 264-282).
Springer DOI 2412
BibRef

Jiang, H.B.[Hao-Bin], Yue, J.P.[Jun-Peng], Luo, H.[Hao], Ding, Z.[Ziluo], Lu, Z.Q.[Zong-Qing],
Reinforcement Learning Friendly Vision-language Model for Minecraft,
ECCV24(LXVIII: 1-17).
Springer DOI 2412
BibRef

Nguyen, A.T.[A. Tuan], Tai, K.S.[Kai Sheng], Chen, B.C.[Bor-Chun], Shukla, S.N.[Satya Narayan], Yu, H.[Hanchao], Torr, P.H.S.[Philip H.S.], Tian, T.P.[Tai-Peng], Lim, S.N.[Ser-Nam],
ucap: An Unsupervised Prompting Method for Vision-language Models,
ECCV24(LXXIV: 425-439).
Springer DOI 2412
BibRef

Zhang, Y.[Yi], Yu, K.[Ke], Wu, S.Q.[Si-Qi], He, Z.H.[Zhi-Hai],
Conceptual Codebook Learning for Vision-language Models,
ECCV24(LXXVII: 235-251).
Springer DOI 2412
BibRef

Kim, M.[Minchan], Kim, M.[Minyeong], Bae, J.[Junik], Choi, S.[Suhwan], Kim, S.[Sungkyung], Chang, B.[Buru],
Exploiting Semantic Reconstruction to Mitigate Hallucinations in Vision-language Models,
ECCV24(LXXXVI: 236-252).
Springer DOI 2412
BibRef

Chatterjee, A.[Agneet], Luo, Y.[Yiran], Gokhale, T.[Tejas], Yang, Y.Z.[Ye-Zhou], Baral, C.[Chitta],
Revision: Rendering Tools Enable Spatial Fidelity in Vision-language Models,
ECCV24(XXX: 339-357).
Springer DOI 2412
BibRef

Ataallah, K.[Kirolos], Shen, X.Q.[Xiao-Qian], Abdelrahman, E.[Eslam], Sleiman, E.[Essam], Zhuge, M.C.[Ming-Chen], Ding, J.[Jian], Zhu, D.[Deyao], Schmidhuber, J.[Jürgen], Elhoseiny, M.[Mohamed],
Goldfish: Vision-language Understanding of Arbitrarily Long Videos,
ECCV24(XXIX: 251-267).
Springer DOI 2412
BibRef

Shen, R.[Ruoyue], Inoue, N.[Nakamasa], Shinoda, K.[Koichi],
Pyramid Coder: Hierarchical Code Generator for Compositional Visual Question Answering,
ICIP24(430-436)
IEEE DOI 2411
Training, Visualization, Codes, Accuracy, Large language models, Natural languages, Visual question answering, Prompting methods BibRef

Sharma, P.[Pratyusha], Shaham, T.R.[Tamar Rott], Baradad, M.[Manel], Rodriíuez-Muñoz, A.[Adrián], Duggal, S.[Shivam], Isola, P.[Phillip], Torralba, A.[Antonio], Fu, S.[Stephanie],
A Vision Check-up for Language Models,
CVPR24(14410-14419)
IEEE DOI 2410
Representation learning, Visualization, Analytical models, Codes, Image synthesis, Computational modeling BibRef

Chen, X.[Xi], Djolonga, J.[Josip], Padlewski, P.[Piotr], Mustafa, B.[Basil], Changpinyo, S.[Soravit], Wu, J.L.[Jia-Lin], Ruiz, C.R.[Carlos Riquelme], Goodman, S.[Sebastian], Wang, X.[Xiao], Tay, Y.[Yi], Shakeri, S.[Siamak], Dehghani, M.[Mostafa], Salz, D.[Daniel], Lucic, M.[Mario], Tschannen, M.[Michael], Nagrani, A.[Arsha], Hu, H.[Hexiang], Joshi, M.[Mandar], Pang, B.[Bo], Montgomery, C.[Ceslee], Pietrzyk, P.[Paulina], Ritter, M.[Marvin], Piergiovanni, A.[AJ], Minderer, M.[Matthias], Pavetic, F.[Filip], Waters, A.[Austin], Li, G.[Gang], Alabdulmohsin, I.[Ibrahim], Beyer, L.[Lucas], Amelot, J.[Julien], Lee, K.[Kenton], Steiner, A.P.[Andreas Peter], Li, Y.[Yang], Keysers, D.[Daniel], Arnab, A.[Anurag], Xu, Y.Z.[Yuan-Zhong], Rong, K.[Keran], Kolesnikov, A.[Alexander], Seyedhosseini, M.[Mojtaba], Angelova, A.[Anelia], Zhai, X.H.[Xiao-Hua], Houlsby, N.[Neil], Soricut, R.[Radu],
On Scaling Up a Multilingual Vision and Language Model,
CVPR24(14432-14444)
IEEE DOI 2410
Training, Visualization, Computational modeling, Object detection, Benchmark testing, Question answering (information retrieval), pretraining BibRef

Parodi, F.[Felipe], Matelsky, J.K.[Jordan K.], Regla-Vargas, A.[Alejandra], Foglia, E.E.[Elizabeth E.], Lim, C.[Charis], Weinberg, D.[Danielle], Kording, K.P.[Konrad P.], Herrick, H.M.[Heidi M.], Platt, M.L.[Michael L.],
Vision-language models for decoding provider attention during neonatal resuscitation,
CVPM24(343-353)
IEEE DOI 2410
Training, Pediatrics, Accuracy, Semantics, Decision making, Transformers BibRef

Zhang, Y.[Yabin], Zhu, W.J.[Wen-Jie], Tang, H.[Hui], Ma, Z.Y.[Zhi-Yuan], Zhou, K.Y.[Kai-Yang], Zhang, L.[Lei],
Dual Memory Networks: A Versatile Adaptation Approach for Vision-Language Models,
CVPR24(28718-28728)
IEEE DOI Code:
WWW Link. 2410
Training, Knowledge engineering, Adaptation models, Codes, Training data, Data models, Vision-language models, versatile adaptation BibRef

Guo, Y.[Yuncheng], Gu, X.D.[Xiao-Dong],
JoAPR: Cleaning the Lens of Prompt Learning for Vision-Language Models,
CVPR24(28695-28705)
IEEE DOI 2410
Adaptation models, Adaptive systems, Noise, Manuals, Robustness, Noise measurement, prompt learning BibRef

Byun, J.[Jaeseok], Kim, D.[Dohoon], Moon, T.[Taesup],
MAFA: Managing False Negatives for Vision-Language Pre-Training,
CVPR24(27304-27314)
IEEE DOI Code:
WWW Link. 2410
Smoothing methods, Codes, Computational modeling, Buildings BibRef

Han, J.[Jinwei], Lin, Z.W.[Zhi-Wen], Sun, Z.Y.[Zhong-Yisun], Gao, Y.G.[Ying-Guo], Yan, K.[Ke], Ding, S.H.[Shou-Hong], Gao, Y.[Yuan], Xia, G.S.[Gui-Song],
Anchor-based Robust Finetuning of Vision-Language Models,
CVPR24(26909-26918)
IEEE DOI 2410
Image recognition, Zero-shot learning, Semantics, Benchmark testing, Anchor, Robust Finetuning BibRef

Wei, Z.[Zihao], Pan, Z.X.[Zi-Xuan], Owens, A.[Andrew],
Efficient Vision-Language Pre-Training by Cluster Masking,
CVPR24(26805-26815)
IEEE DOI 2410
Training, Visualization, Semantics, Contrastive learning, Writing, Predictive models BibRef

Cao, Q.L.[Qing-Long], Zheng-Qin, X., Chen, Y.[Yuntian], Chao, M., Yang, X.K.[Xiao-Kang],
Domain Prompt Learning with Quaternion Networks,
CVPR24(26627-26636)
IEEE DOI Code:
WWW Link. 2410
Knowledge engineering, Adaptation models, Codes, Quaternions, Face recognition, Contrastive learning, vision-language models, quaternion networks BibRef

Wang, S.[Sibo], Zhang, J.[Jie], Yuan, Z.[Zheng], Shan, S.G.[Shi-Guang],
Pre-Trained Model Guided Fine-Tuning for Zero-Shot Adversarial Robustness,
CVPR24(24502-24511)
IEEE DOI 2410
Training, Accuracy, Codes, Minimization, Robustness, Zero-Shot, Adversarial Robustness, Large-scale vision-language models BibRef

Li, L.[Lin], Guan, H.Y.[Hao-Yan], Qiu, J.N.[Jia-Ning], Spratling, M.[Michael],
One Prompt Word is Enough to Boost Adversarial Robustness for Pre-Trained Vision-Language Models,
CVPR24(24408-24419)
IEEE DOI Code:
WWW Link. 2410
Accuracy, Codes, Training data, Robustness, Computational efficiency, vision-language models, VLMs BibRef

Zanella, M.[Maxime], Ayed, I.B.[Ismail Ben],
On the Test-Time Zero-Shot Generalization of Vision-Language Models: Do we Really need Prompt Learning?,
CVPR24(23783-23793)
IEEE DOI 2410
Training, Systematics, Computational modeling, Quality assessment, Computational efficiency, vision-language, training-free BibRef

Yao, H.T.[Han-Tao], Zhang, R.[Rui], Xu, C.S.[Chang-Sheng],
TCP: Textual-Based Class-Aware Prompt Tuning for Visual-Language Model,
CVPR24(23438-23448)
IEEE DOI Code:
WWW Link. 2410
Training, Adaptation models, Benchmark testing, Tuning BibRef

Yang, S.[Senqiao], Tian, Z.[Zhuotao], Jiang, L.[Li], Jia, J.Y.[Jia-Ya],
Unified Language-Driven Zero-Shot Domain Adaptation,
CVPR24(23407-23415)
IEEE DOI 2410
Representation learning, Adaptation models, Visualization, Correlation, Scalability, Computational modeling, Vision-Language Model BibRef

Cui, J.Q.[Jie-Quan], Zhu, B.[Beier], Wen, X.[Xin], Qi, X.J.[Xiao-Juan], Yu, B.[Bei], Zhang, H.W.[Han-Wang],
Classes Are Not Equal: An Empirical Study on Image Recognition Fairness,
CVPR24(23283-23292)
IEEE DOI 2410
Training, Representation learning, Image recognition, Accuracy, Predictive models, Network architecture, Prediction algorithms, Vision-Language Models BibRef

Stojnic, V.[Vladan], Kalantidis, Y.[Yannis], Tolias, G.[Giorgos],
Label Propagation for Zero-shot Classification with Vision-Language Models,
CVPR24(23209-23218)
IEEE DOI Code:
WWW Link. 2410
Codes, Computational modeling, Closed box, Encoding, Data models, vision-language models, label propagation, zero-shot classification BibRef

Yuan, T.[Tongtong], Zhang, X.[Xuange], Liu, K.[Kun], Liu, B.[Bo], Chen, C.[Chen], Jin, J.[Jian], Jiao, Z.Z.[Zhen-Zhen],
Towards Surveillance Video-and-Language Understanding: New Dataset, Baselines, and Challenges,
CVPR24(22052-22061)
IEEE DOI Code:
WWW Link. 2410
Annotations, Surveillance, Semantics, Benchmark testing, Public security, Timing, Security, Dataset Annotation BibRef

Chen, Y.F.[Yi-Fei], Chen, D.P.[Da-Peng], Liu, R.J.[Rui-Jin], Zhou, S.[Sai], Xue, W.Y.[Wen-Yuan], Peng, W.[Wei],
Align Before Adapt: Leveraging Entity-to-Region Alignments for Generalizable Video Action Recognition,
CVPR24(18688-18698)
IEEE DOI 2410
Representation learning, Adaptation models, Visualization, Semantics, Transformers, Vectors, Video action recognition, visual-language model BibRef

Mittal, H.[Himangi], Agarwal, N.[Nakul], Lo, S.Y.[Shao-Yuan], Lee, K.[Kwonjoon],
Can't make an Omelette without Breaking some Eggs: Plausible Action Anticipation using Large Video-Language Models,
CVPR24(18580-18590)
IEEE DOI 2410
Accuracy, Computational modeling, Linear programming, Action Anticipation, Video, Large Multimodal Models BibRef

Kahatapitiya, K.[Kumara], Arnab, A.[Anurag], Nagran, A.[Arsha], Ryoo, M.S.[Michael S.],
VicTR: Video-conditioned Text Representations for Activity Recognition,
CVPR24(18547-18558)
IEEE DOI 2410
Training, Visualization, Adaptation models, Semantics, Focusing, Benchmark testing, Vision-language models, Activity Recognition, Video-conditioned Text BibRef

Wu, T.Y.[Tz-Ying], Ho, C.H.[Chih-Hui], Vasconcelos, N.M.[Nuno M.],
ProTeCt: Prompt Tuning for Taxonomic Open Set Classification,
CVPR24(16531-16540)
IEEE DOI Code:
WWW Link. 2410
Measurement, Training, Frequency modulation, Accuracy, Taxonomy, Semantics, Hierarchical Classification, Visual-language foundation model BibRef

Zhao, G.[Ganlong], Li, G.B.[Guan-Bin], Chen, W.[Weikai], Yu, Y.Z.[Yi-Zhou],
OVER-NAV: Elevating Iterative Vision-and-Language Navigation with Open-Vocabulary Detection and StructurEd Representation,
CVPR24(16296-16306)
IEEE DOI 2410
Art, Accuracy, Navigation, Annotations, Detectors, Vision-and-Language Navigation, Open-vocabulary, Multi-Modal Learning BibRef

Li, X.[Xin], Wu, Y.F.[Yun-Fei], Jiang, X.H.[Xing-Hua], Guo, Z.H.[Zhi-Hao], Gong, M.M.[Ming-Ming], Cao, H.Y.[Hao-Yu], Liu, Y.S.[Yin-Song], Jiang, D.Q.[De-Qiang], Sun, X.[Xing],
Enhancing Visual Document Understanding with Contrastive Learning in Large Visual-Language Models,
CVPR24(15546-15555)
IEEE DOI 2410
Visualization, Computational modeling, Contrastive learning, Benchmark testing, Feature extraction, Filling, Contrastive Learning BibRef

Pham, K.[Khoi], Huynh, C.[Chuong], Lim, S.N.[Ser-Nam], Shrivastava, A.[Abhinav],
Composing Object Relations and Attributes for Image-Text Matching,
CVPR24(14354-14363)
IEEE DOI Code:
WWW Link. 2410
Visualization, Codes, Computational modeling, Image edge detection, Semantics, Benchmark testing, vision-language, image retrieval, image-text matching BibRef

Lee, J.H.[Ju-Hee], Kang, J.W.[Je-Won],
SRTube: Video-Language Pre-Training with Action-Centric Video Tube Features and Semantic Role Labeling,
CVPR24(13689-13699)
IEEE DOI 2410
Attention mechanisms, Computational modeling, Semantics, Electron tubes, Trajectory, video-language pre-training BibRef

Kim, G.[Gahyeon], Kim, S.[Sohee], Lee, S.[Seokju],
AAPL: Adding Attributes to Prompt Learning for Vision-Language Models,
Prompting24(1572-1582)
IEEE DOI 2410
Visualization, Zero-shot learning, Semantics, Focusing, Feature extraction, Data augmentation, Vectors, prompt learning, VLMs BibRef

Xu, Z.[Zhenlin], Zhu, Y.[Yi], Deng, S.Q.[Si-Qi], Mittal, A.[Abhay], Chen, Y.B.[Yan-Bei], Wang, M.[Manchen], Favaro, P.[Paolo], Tighe, J.[Joseph], Modolo, D.[Davide],
Benchmarking Zero-Shot Recognition with Vision-Language Models: Challenges on Granularity and Specificity,
WhatNext24(1827-1836)
IEEE DOI 2410
Computational modeling, Face recognition, Semantics, Training data, Focusing, Vision and language models, Zero-shot recognition, Benchmarking BibRef

Luo, Z.W.[Zi-Wei], Gustafsson, F.K.[Fredrik K.], Zhao, Z.[Zheng], Sjölund, J.[Jens], Schön, T.B.[Thomas B.],
Photo-Realistic Image Restoration in the Wild with Controlled Vision-Language Models,
NTIRE24(6641-6651)
IEEE DOI 2410
Degradation, Training, Image synthesis, Pipelines, Transform coding, Diffusion models, Feature extraction, Image restoration, real-world BibRef

Huang, C.Q.[Chao-Qin], Jiang, A.[Aofan], Feng, J.H.[Jing-Hao], Zhang, Y.[Ya], Wang, X.C.[Xin-Chao], Wang, Y.F.[Yan-Feng],
Adapting Visual-Language Models for Generalizable Anomaly Detection in Medical Images,
CVPR24(11375-11385)
IEEE DOI Code:
WWW Link. 2410
Training, Adaptation models, Image segmentation, Visualization, Source coding, Semantics, Anomaly Detection, Medical Images BibRef

Bang, J.[Jihwan], Ahn, S.[Sumyeong], Lee, J.G.[Jae-Gil],
Active Prompt Learning in Vision Language Models,
CVPR24(26994-27004)
IEEE DOI Code:
WWW Link. 2410
Learning systems, Adaptation models, Codes, Sampling methods, Labeling BibRef

Pan, C.[Chenbin], Yaman, B.[Burhaneddin], Nesti, T.[Tommaso], Mallik, A.[Abhirup], Allievi, A.G.[Alessandro G], Velipasalar, S.[Senem], Ren, L.[Liu],
VLP: Vision Language Planning for Autonomous Driving,
CVPR24(14760-14769)
IEEE DOI 2410
Training, Urban areas, Linguistics, Cognition, Robustness, Planning BibRef

Liang, M.[Mingfu], Su, J.C.[Jong-Chyi], Schulter, S.[Samuel], Garg, S.[Sparsh], Zhao, S.Y.[Shi-Yu], Wu, Y.[Ying], Chandraker, M.[Manmohan],
AIDE: An Automatic Data Engine for Object Detection in Autonomous Driving,
CVPR24(14695-14706)
IEEE DOI 2410
Training, Costs, Roads, Pipelines, Object detection, Benchmark testing, Data models, Autonomous Driving, Vision Language Model, Automatic Data Engine BibRef

Li, Z.[Zheng], Li, X.[Xiang], Fu, X.[Xinyi], Zhang, X.[Xin], Wang, W.Q.[Wei-Qiang], Chen, S.[Shuo], Yang, J.[Jian],
PromptKD: Unsupervised Prompt Distillation for Vision-Language Models,
CVPR24(26607-26616)
IEEE DOI Code:
WWW Link. 2410
Codes, Computational modeling, Prediction algorithms, Data models, Vectors, Probability distribution, knowledge distillation, zero-shot learning BibRef

Khandelwal, A.[Anant],
PromptSync: Bridging Domain Gaps in Vision-Language Models through Class-Aware Prototype Alignment and Discrimination,
ZeroShot24(7819-7828)
IEEE DOI 2410
Adaptation models, Computational modeling, Prototypes, Contrastive learning, Benchmark testing, Robustness BibRef

Hirohashi, Y.[Yuki], Hirakawa, T.[Tsubasa], Yamashita, T.[Takayoshi], Fujiyoshi, H.[Hironobu],
Prompt Learning with One-Shot Setting based Feature Space Analysis in Vision-and-Language Models,
ZeroShot24(7761-7770)
IEEE DOI 2410
Learning systems, Analytical models, Adaptation models, Image resolution, Accuracy, Vision-and-Language Model, Prompt Learning BibRef

Zhang, L.[Le], Awal, R.[Rabiul], Agrawal, A.[Aishwarya],
Contrasting Intra-Modal and Ranking Cross-Modal Hard Negatives to Enhance Visio-Linguistic Compositional Understanding,
CVPR24(13774-13784)
IEEE DOI Code:
WWW Link. 2410
Annotations, Semantics, Refining, Text to image, Contrastive learning, Benchmark testing, Cognition, contrastive learning BibRef

Rosasco, A.[Andrea], Berti, S.[Stefano], Pasquale, G.[Giulia], Malafronte, D.[Damiano], Sato, S.[Shogo], Segawa, H.[Hiroyuki], Inada, T.[Tetsugo], Natale, L.[Lorenzo],
ConCon-Chi: Concept-Context Chimera Benchmark for Personalized Vision-Language Tasks,
CVPR24(22239-22248)
IEEE DOI Code:
WWW Link. 2410
Measurement, Codes, Image synthesis, Text to image, Benchmark testing, benchmark, dataset, compositionality BibRef

Cheng, S.[Sijie], Guo, Z.C.[Zhi-Cheng], Wu, J.[Jinawen], Fang, K.[Kechen], Li, P.[Peng], Liu, H.P.[Hua-Ping], Liu, Y.[Yang],
EgoThink: Evaluating First-Person Perspective Thinking Capability of Vision-Language Models,
CVPR24(14291-14302)
IEEE DOI 2410
Bridges, Visualization, Computational modeling, Focusing, Benchmark testing, Planning, Egocentric, Vision-Language Models, Benchmark BibRef

Guan, T.R.[Tian-Rui], Liu, F.[Fuxiao], Wu, X.[Xiyang], Xian, R.Q.[Rui-Qi], Li, Z.X.[Zong-Xia], Liu, X.Y.[Xiao-Yu], Wang, X.[Xijun], Chen, L.[Lichang], Huang, F.[Furong], Yacoob, Y.[Yaser], Manocha, D.[Dinesh], Zhou, T.Y.[Tian-Yi],
Hallusionbench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models,
CVPR24(14375-14385)
IEEE DOI Code:
WWW Link. 2410
Visualization, Analytical models, Accuracy, Statistical analysis, Computational modeling, Benchmark testing, Vision language model, VLM Evaluation BibRef

Kil, J.[Jihyung], Song, C.H.[Chan Hee], Zheng, B.[Boyuan], Deng, X.[Xiang], Su, Y.[Yu], Chao, W.L.[Wei-Lun],
Dual-View Visual Contextualization for Web Navigation,
CVPR24(14445-14454)
IEEE DOI 2410
Visualization, Navigation, Benchmark testing, AI Agents, Web Agents, Web Navigation, Vision-Language, Multimodal Agents BibRef

Guo, Y.Y.[Yang-Yang], Wang, G.Z.[Guang-Zhi], Kankanhalli, M.[Mohan],
PELA: Learning Parameter-Efficient Models with Low-Rank Approximation,
CVPR24(15699-15709)
IEEE DOI 2410
Codes, Computational modeling, Perturbation methods, Loading, Computer architecture, Transformers, Vision-Language, Low-rank Approximation BibRef

Cao, J.J.[Jian-Jian], Ye, P.[Peng], Li, S.Z.[Sheng-Ze], Yu, C.[Chong], Tang, Y.S.[Yan-Song], Lu, J.W.[Ji-Wen], Chen, T.[Tao],
MADTP: Multimodal Alignment-Guided Dynamic Token Pruning for Accelerating Vision-Language Transformer,
CVPR24(15710-15719)
IEEE DOI Code:
WWW Link. 2410
Degradation, Adaptation models, Visualization, Costs, Computational modeling, Semantics, Token Pruning, Model Compress BibRef

Farina, M.[Matteo], Mancini, M.[Massimiliano], Cunegatti, E.[Elia], Cunegatti, E.[Elia], Iacca, G.[Giovanni], Ricci, E.[Elisa],
MULTIFLOW: Shifting Towards Task-Agnostic Vision-Language Pruning,
CVPR24(16185-16195)
IEEE DOI Code:
WWW Link. 2410
Codes, Computational modeling, Transfer learning, Neurons, Benchmark testing, multimodal learning, sparse neural networks BibRef

Majumdar, A.[Arjun], Ajay, A.[Anurag], Zhang, X.H.[Xiao-Han], Putta, P.[Pranav], Yenamandra, S.[Sriram], Henaff, M.[Mikael], Silwal, S.[Sneha], Mcvay, P.[Paul], Maksymets, O.[Oleksandr], Arnaud, S.[Sergio], Yadav, K.[Karmesh], Li, Q.[Qiyang], Newman, B.[Ben], Sharma, M.[Mohit], Berges, V.[Vincent], Zhang, S.Q.[Shi-Qi], Agrawal, P.[Pulkit], Bisk, Y.[Yonatan], Batra, D.[Dhruv], Kalakrishnan, M.[Mrinal], Meier, F.[Franziska], Paxton, C.[Chris], Sax, A.[Alexander], Rajeswaran, A.[Aravind],
OpenEQA: Embodied Question Answering in the Era of Foundation Models,
CVPR24(16488-16498)
IEEE DOI 2410
Protocols, Natural languages, Semantics, Benchmark testing, Question answering (information retrieval), Vision-Language Models BibRef

Mu, F.Z.[Fang-Zhou], Mo, S.C.[Si-Cheng], Li, Y.[Yin],
SnAG: Scalable and Accurate Video Grounding,
CVPR24(18930-18940)
IEEE DOI Code:
WWW Link. 2410
Training, Analytical models, Accuracy, Grounding, Scalability, Computational modeling, Video understanding, Vision-Language Learning BibRef

Gao, Y.[Yuan], Shi, K.Y.[Kun-Yu], Zhu, P.[Pengkai], Belval, E.[Edouard], Nuriel, O.[Oren], Appalaraju, S.[Srikar], Ghadar, S.[Shabnam], Tu, Z.W.[Zhuo-Wen], Mahadevan, V.[Vijay], Soatto, S.[Stefano],
Enhancing Vision-Language Pre-Training with Rich Supervisions,
CVPR24(13480-13491)
IEEE DOI 2410
Location awareness, Visualization, Technological innovation, Annotations, Pipelines, Web pages, Streaming media, UI understanding BibRef

Cao, Y.H.[Yun-Hao], Ji, K.X.[Kai-Xiang], Huang, Z.Y.[Zi-Yuan], Zheng, C.Y.[Chuan-Yang], Liu, J.J.[Jia-Jia], Wang, J.[Jian], Chen, J.D.[Jing-Dong], Yang, M.[Ming],
Towards Better Vision-Inspired Vision-Language Models,
CVPR24(13537-13547)
IEEE DOI 2410
Training, Bridges, Visualization, Computational modeling, Poles and towers, Benchmark testing, deep learning, deep prompt BibRef

Shi, K.Y.[Kun-Yu], Dong, Q.[Qi], Goncalves, L.[Luis], Tu, Z.W.[Zhuo-Wen], Soatto, S.[Stefano],
Non-autoregressive Sequence-to-Sequence Vision-Language Models,
CVPR24(13603-13612)
IEEE DOI 2410
Visualization, Technological innovation, Computational modeling, Predictive models, Drives, Encoding, Non-autoregressive, CTC, vision language models BibRef

Man, Y.Z.[Yun-Ze], Gui, L.Y.[Liang-Yan], Wang, Y.X.[Yu-Xiong],
Situational Awareness Matters in 3D Vision Language Reasoning,
CVPR24(13678-13688)
IEEE DOI 2410
Visualization, Solid modeling, Estimation, Performance gain, Cognition, Vision-Language, Multi-modal, 3D Reasoning BibRef

Zheng, C.H.[Chen-Hao], Zhang, J.[Jieyu], Kembhavi, A.[Aniruddha], Krishna, R.[Ranjay],
Iterated Learning Improves Compositionality in Large Vision-Language Models,
CVPR24(13785-13795)
IEEE DOI 2410
Training, Training data, Games, Contrastive learning, Benchmark testing, Performance gain, Cognitive science BibRef

Leng, S.[Sicong], Zhang, H.[Hang], Chen, G.Z.[Guan-Zheng], Li, X.[Xin], Lu, S.J.[Shi-Jian], Miao, C.Y.[Chun-Yan], Bing, L.[Lidong],
Mitigating Object Hallucinations in Large Vision-Language Models through Visual Contrastive Decoding,
CVPR24(13872-13882)
IEEE DOI 2410
Training, Visualization, Accuracy, Computational modeling, Benchmark testing, Decoding, Multimodality, Vision and Language BibRef

Slyman, E.[Eric], Lee, S.[Stefan], Cohen, S.[Scott], Kafle, K.[Kushal],
FairDeDup: Detecting and Mitigating Vision-Language Fairness Disparities in Semantic Dataset Deduplication,
CVPR24(13905-13916)
IEEE DOI 2410
Training, Measurement, Costs, Semantics, Skin, Data models, multimodal, fairness, vision-language, foundation models, human-centered ai, deduplication BibRef

Song, C.H.[Chull Hwan], Hwang, T.[Taebaek], Yoon, J.Y.[Joo-Young], Choi, S.[Shunghyun], Gu, Y.H.[Yeong Hyeon],
SyncMask: Synchronized Attentional Masking for Fashion-centric Vision-Language Pretraining,
CVPR24(13948-13957)
IEEE DOI 2410
Training, Visualization, Image segmentation, Image resolution, Refining, Contrastive learning BibRef

Pramanick, S.[Shraman], Han, G.X.[Guang-Xing], Hou, R.[Rui], Nag, S.[Sayan], Lim, S.N.[Ser-Nam], Ballas, N.[Nicolas], Wang, Q.F.[Qi-Fan], Chellappa, R.[Rama], Almahairi, A.[Amjad],
Jack of All Tasks, Master of Many: Designing General-purpose Coarse-to-Fine Vision-Language Model,
CVPR24(14076-14088)
IEEE DOI Code:
WWW Link. 2410
Image segmentation, Visualization, Image coding, Filters, Grounding, Machine vision, Visual systems BibRef

Zeng, Y.[Yunan], Huang, Y.[Yan], Zhang, J.J.[Jin-Jin], Jie, Z.Q.[Ze-Qun], Chai, Z.H.[Zhen-Hua], Wang, L.[Liang],
Investigating Compositional Challenges in Vision-Language Models for Visual Grounding,
CVPR24(14141-14151)
IEEE DOI 2410
Visualization, Codes, Grounding, Annotations, Pipelines, Benchmark testing BibRef

Karmanov, A.[Adilbek], Guan, D.[Dayan], Lu, S.J.[Shi-Jian], El Saddik, A.[Abdulmotaleb], Xing, E.[Eric],
Efficient Test-Time Adaptation of Vision-Language Models,
CVPR24(14162-14171)
IEEE DOI Code:
WWW Link. 2410
Adaptation models, Codes, Computational modeling, Noise, Predictive models, Benchmark testing BibRef

Bulat, A.[Adrian], Ouali, Y.[Yassine], Tzimiropoulos, G.[Georgios],
FFF: Fixing Flawed Foundations in contrastive pre-training results in very strong Vision-Language models,
CVPR24(14172-14182)
IEEE DOI 2410
Training, Image recognition, Noise, Image retrieval, Field-flow fractionation BibRef

Sameni, S.[Sepehr], Kafle, K.[Kushal], Tan, H.[Hao], Jenni, S.[Simon],
Building Vision-Language Models on Solid Foundations with Masked Distillation,
CVPR24(14216-14226)
IEEE DOI 2410
Training, Solid modeling, Visualization, Computational modeling, Semantic segmentation, Buildings, LLM BibRef

Li, R.J.[Rong-Jie], Wu, Y.[Yu], He, X.M.[Xu-Ming],
Learning by Correction: Efficient Tuning Task for Zero-Shot Generative Vision-Language Reasoning,
CVPR24(13428-13437)
IEEE DOI 2410
Training, Visualization, Costs, Computational modeling, Cognition, Question answering (information retrieval), Vision-Language BibRef

Peng, W.[Wujian], Xie, S.C.[Si-Cheng], You, Z.[Zuyao], Lan, S.Y.[Shi-Yi], Wu, Z.[Zuxuan],
Synthesize, Diagnose, and Optimize: Towards Fine-Grained Vision-Language Understanding,
CVPR24(13279-13288)
IEEE DOI Code:
WWW Link. 2410
Visualization, Codes, Computational modeling, Pipelines, Benchmark testing, Linguistics, Vision language model, Fine-grained understdanding BibRef

Zhao, Y.[Yue], Zhao, L.[Long], Zhou, X.Y.[Xing-Yi], Wu, J.L.[Jia-Lin], Chu, C.T.[Chun-Te], Miao, H.[Hui], Schroff, F.[Florian], Adam, H.[Hartwig], Liu, T.[Ting], Gong, B.Q.[Bo-Qing], Krähenbühl, P.[Philipp], Yuan, L.Z.[Liang-Zhe],
Distilling Vision-Language Models on Millions of Videos,
CVPR24(13106-13116)
IEEE DOI 2410
Adaptation models, Computational modeling, Benchmark testing, Data models, Text to video BibRef

Chen, J.[Jieneng], Yu, Q.H.[Qi-Hang], Shen, X.H.[Xiao-Hui], Yuille, A.[Alan], Chen, L.C.[Liang-Chieh],
ViTamin: Designing Scalable Vision Models in the Vision-Language Era,
CVPR24(12954-12966)
IEEE DOI 2410
Training, Image segmentation, Accuracy, Protocols, Image coding, Scalability, Computational modeling, Vision-Language Models, Architectural Design BibRef

Liu, S.H.[Shi-Hong], Yu, S.[Samuel], Lin, Z.Q.[Zhi-Qiu], Pathak, D.[Deepak], Ramanan, D.[Deva],
Language Models as Black-Box Optimizers for Vision-Language Models,
CVPR24(12687-12697)
IEEE DOI 2410
Computational modeling, Natural languages, Closed box, Text to image, Human in the loop, Data models, generative models BibRef

Howard, P.[Phillip], Madasu, A.[Avinash], Le, T.[Tiep], Moreno, G.L.[Gustavo Lujan], Bhiwandiwalla, A.[Anahita], Lal, V.[Vasudev],
SocialCounterfactuals: Probing and Mitigating Intersectional Social Biases in Vision-Language Models with Counterfactual Examples,
CVPR24(11975-11985)
IEEE DOI 2410
Training, Prevention and mitigation, Text to image, Diffusion models, Fairness, social bias, counterfactuals BibRef

Jiang, Y.[Yankai], Huang, Z.Z.[Zhong-Zhen], Zhang, R.Z.[Rong-Zhao], Zhang, X.F.[Xiao-Fan], Zhang, S.T.[Shao-Ting],
ZePT: Zero-Shot Pan-Tumor Segmentation via Query-Disentangling and Self-Prompting,
CVPR24(11386-11397)
IEEE DOI 2410
Training, Visualization, Pathology, Image segmentation, Image analysis, Computational modeling, Vision-Language Model BibRef

Kim, Y.[Younghyun], Mo, S.[Sangwoo], Kim, M.[Minkyu], Lee, K.[Kyungmin], Lee, J.[Jaeho], Shin, J.[Jinwoo],
Discovering and Mitigating Visual Biases Through Keyword Explanation,
CVPR24(11082-11092)
IEEE DOI Code:
WWW Link. 2410
Training, Visualization, Image recognition, Computational modeling, Training data, Flowering plants, bias and fairness, explainable AI, vision-language model BibRef

Li, R.[Rui], Fischer, T.[Tobias], Segu, M.[Mattia], Pollefeys, M.[Marc], Van Gool, L.J.[Luc J.], Tombari, F.[Federico],
Know Your Neighbors: Improving Single-View Reconstruction via Spatial Vision-Language Reasoning,
CVPR24(9848-9858)
IEEE DOI Code:
WWW Link. 2410
Geometry, Visualization, Attention mechanisms, Shape, Semantics, radiance field, vision-language model, spatial context, spatial attention BibRef

Zeng, Z.[Ziyao], Wang, D.[Daniel], Yang, F.Y.[Feng-Yu], Park, H.[Hyoungseob], Soatto, S.[Stefano], Lao, D.[Dong], Wong, A.[Alex],
WorDepth: Variational Language Prior for Monocular Depth Estimation,
CVPR24(9708-9719)
IEEE DOI Code:
WWW Link. 2410
Measurement, Codes, Estimation, Encoding, Monocular Depth Estimation, Vision-Language Model, Variational Model BibRef

Hu, Y.S.[Yu-Shi], Stretcu, O.[Otilia], Lu, C.T.[Chun-Ta], Viswanathan, K.[Krishnamurthy], Hata, K.[Kenji], Luo, E.[Enming], Krishna, R.[Ranjay], Fuxman, A.[Ariel],
Visual Program Distillation: Distilling Tools and Programmatic Reasoning into Vision-Language Models,
CVPR24(9590-9601)
IEEE DOI 2410
Visualization, Adaptation models, Computational modeling, Instruments, Loading, Music, Cognition, vision-language model, tools BibRef

Khan, Z.[Zaid], Fu, Y.[Yun],
Consistency and Uncertainty: Identifying Unreliable Responses From Black-Box Vision-Language Models for Selective Visual Question Answering,
CVPR24(10854-10863)
IEEE DOI 2410
Visualization, Uncertainty, Computational modeling, Closed box, Predictive models, Question answering (information retrieval), trustworthy ml BibRef

Gu, T.C.[Tian-Cheng], Yang, K.C.[Kai-Cheng], Liu, D.[Dongnan], Cai, W.D.[Wei-Dong],
LaPA: Latent Prompt Assist Model for Medical Visual Question Answering,
DEF-AI-MIA24(4971-4980)
IEEE DOI Code:
WWW Link. 2410
Visualization, Accuracy, Medical services, Predictive models, Feature extraction, Question answering (information retrieval), Data mining BibRef

Silva-Rodríguez, J.[Julio], Hajimiri, S.[Sina], Ben Ayed, I.[Ismail], Dolz, J.[Jose],
A Closer Look at the Few-Shot Adaptation of Large Vision-Language Models,
CVPR24(23681-23690)
IEEE DOI Code:
WWW Link. 2410
Adaptation models, Codes, Computational modeling, Transfer learning, Probes BibRef

Zanella, M.[Maxime], Ben Ayed, I.[Ismail],
Low-Rank Few-Shot Adaptation of Vision-Language Models,
Prompting24(1593-1603)
IEEE DOI 2410
Training, Adaptation models, Design methodology, Few shot learning, Vision-Language, few-shot, adapter BibRef

Wang, W.X.[Wen-Xuan], He, X.J.[Xing-Jian], Zhang, Y.[Yisi], Guo, L.T.[Long-Teng], Shen, J.C.[Jia-Chen], Li, J.Y.[Jiang-Yun], Liu, J.[Jing],
CM-MaskSD: Cross-Modality Masked Self-Distillation for Referring Image Segmentation,
MultMed(26), 2024, pp. 6906-6916.
IEEE DOI 2405
Image segmentation, Visualization, Task analysis, Correlation, Feature extraction, Transformers, Semantics, vision and language BibRef

Sahin, U.[Ugur], Li, H.[Hang], Khan, Q.[Qadeer], Cremers, D.[Daniel], Tresp, V.[Volker],
Enhancing Multimodal Compositional Reasoning of Visual Language Models with Generative Negative Mining,
WACV24(5551-5561)
IEEE DOI Code:
HTML Version. 2404
Training, Visualization, Codes, Pipelines, Self-supervised learning, Cognition, Algorithms, Vision + language and/or other modalities BibRef

Yang, C.[Cheng], Xu, R.[Rui], Guo, Y.[Ye], Huang, P.X.[Pei-Xiang], Chen, Y.[Yiru], Ding, W.[Wenkui], Wang, Z.Y.[Zhong-Yuan], Zhou, H.[Hong],
Improving Vision-and-Language Reasoning via Spatial Relations Modeling,
WACV24(758-767)
IEEE DOI 2404
Visualization, Analytical models, Graphical models, Statistical analysis, Computational modeling, Excavation, Vision + language and/or other modalities BibRef

Shen, S.[Sheng], Yang, S.[Shijia], Zhang, T.J.[Tian-Jun], Zhai, B.[Bohan], Gonzalez, J.E.[Joseph E.], Keutzer, K.[Kurt], Darrell, T.J.[Trevor J.],
Multitask Vision-Language Prompt Tuning,
WACV24(5644-5655)
IEEE DOI 2404
Learning systems, Visualization, Adaptation models, Benchmark testing, Vectors, Task analysis, Algorithms, Vision + language and/or other modalities BibRef

Zhang, G.[Gengyuan], Zhang, Y.R.[Yu-Rui], Zhang, K.[Kerui], Tresp, V.[Volker],
Can Vision-Language Models be a Good Guesser? Exploring VLMs for Times and Location Reasoning,
WACV24(625-634)
IEEE DOI Code:
WWW Link. 2404
Visualization, Computational modeling, Feature extraction, Cognition, Task analysis, Commonsense reasoning, Algorithms, Vision + language and/or other modalities BibRef

Feinglass, J.[Joshua], Yang, Y.Z.[Ye-Zhou],
Towards Addressing the Misalignment of Object Proposal Evaluation for Vision-Language Tasks via Semantic Grounding,
WACV24(4385-4395)
IEEE DOI 2404
Measurement, Visualization, Protocols, Annotations, Grounding, Semantics, Question answering (information retrieval), Image recognition and understanding BibRef

Nadeem, A.[Asmar], Hilton, A.[Adrian], Dawes, R.[Robert], Thomas, G.[Graham], Mustafa, A.[Armin],
CAD: Contextual Multi-modal Alignment for Dynamic AVQA,
WACV24(7236-7248)
IEEE DOI 2404
Visualization, Semantics, Decision making, Robustness, Question answering (information retrieval), Complexity theory, Smartphones / end user devices BibRef

Wu, W.[Wenyi], Li, Q.[Qi], Zhong, W.L.[Wen-Liang], Huang, J.Z.[Jun-Zhou],
MIVC: Multiple Instance Visual Component for Visual-Language Models,
WACV24(8102-8111)
IEEE DOI 2404
Visualization, Computational modeling, Neural networks, Question answering (information retrieval), Image recognition and understanding BibRef

Ganz, R.[Roy], Nuriel, O.[Oren], Aberdam, A.[Aviad], Kittenplon, Y.[Yair], Mazor, S.[Shai], Litman, R.[Ron],
Towards Models that Can See and Read,
ICCV23(21661-21671)
IEEE DOI 2401
BibRef

Zhang, H.[Heng], Liu, D.[Daqing], Lv, Z.[Zezhong], Su, B.[Bing], Tao, D.C.[Da-Cheng],
Exploring Temporal Concurrency for Video-Language Representation Learning,
ICCV23(15522-15532)
IEEE DOI Code:
WWW Link. 2401
BibRef

Shukor, M.[Mustafa], Dancette, C.[Corentin], Cord, M.[Matthieu],
eP-ALM: Efficient Perceptual Augmentation of Language Models,
ICCV23(21999-22012)
IEEE DOI Code:
WWW Link. 2401
BibRef

Schulter, S.[Samuel], Kumar, B.G.V.[B.G. Vijay], Suh, Y.M.[Yu-Min], Dafnis, K.M.[Konstantinos M.], Zhang, Z.X.[Zhi-Xing], Zhao, S.Y.[Shi-Yu], Metaxas, D.N.[Dimitris N.],
OmniLabel: A Challenging Benchmark for Language-Based Object Detection,
ICCV23(11919-11928)
IEEE DOI Code:
WWW Link. 2401
BibRef

Chen, Z.L.[Zi-Liang], Huang, X.[Xin], Guan, Q.L.[Quan-Long], Lin, L.[Liang], Luo, W.Q.[Wei-Qi],
A Retrospect to Multi-prompt Learning across Vision and Language,
ICCV23(22133-22144)
IEEE DOI 2401
BibRef

Derakhshani, M.M.[Mohammad Mahdi], Sanchez, E.[Enrique], Bulat, A.[Adrian], da Costa, V.G.T.[Victor Guilherme Turrisi], Snoek, C.G.M.[Cees G. M.], Tzimiropoulos, G.[Georgios], Martinez, B.[Brais],
Bayesian Prompt Learning for Image-Language Model Generalization,
ICCV23(15191-15200)
IEEE DOI Code:
WWW Link. 2401
BibRef

Cascante-Bonilla, P.[Paola], Shehada, K.[Khaled], Smith, J.S.[James Seale], Doveh, S.[Sivan], Kim, D.H.[Dong-Hyun], Panda, R.[Rameswar], Varol, G.[Gül], Oliva, A.[Aude], Ordonez, V.[Vicente], Feris, R.S.[Rogerio S.], Karlinsky, L.[Leonid],
Going Beyond Nouns With Vision & Language Models Using Synthetic Data,
ICCV23(20098-20108)
IEEE DOI 2401
BibRef

Upadhyay, U.[Uddeshya], Karthik, S.[Shyamgopal], Mancini, M.[Massimiliano], Akata, Z.[Zeynep],
ProbVLM: Probabilistic Adapter for Frozen Vison-Language Models,
ICCV23(1899-1910)
IEEE DOI Code:
WWW Link. 2401
BibRef

Chen, Z.H.[Zhi-Hong], Diao, S.Z.[Shi-Zhe], Wang, B.[Benyou], Li, G.B.[Guan-Bin], Wan, X.[Xiang],
Towards Unifying Medical Vision-and-Language Pre-training via Soft Prompts,
ICCV23(23346-23356)
IEEE DOI 2401
BibRef

Bitton-Guetta, N.[Nitzan], Bitton, Y.[Yonatan], Hessel, J.[Jack], Schmidt, L.[Ludwig], Elovici, Y.[Yuval], Stanovsky, G.[Gabriel], Schwartz, R.[Roy],
Breaking Common Sense: WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images,
ICCV23(2616-2627)
IEEE DOI 2401
BibRef

Hu, Z.Y.[Zi-Yuan], Li, Y.[Yanyang], Lyu, M.R.[Michael R.], Wang, L.W.[Li-Wei],
VL-PET: Vision-and-Language Parameter-Efficient Tuning via Granularity Control,
ICCV23(2998-3008)
IEEE DOI Code:
WWW Link. 2401
BibRef

Slyman, E.[Eric], Kahng, M.[Minsuk], Lee, S.[Stefan],
VLSlice: Interactive Vision-and-Language Slice Discovery,
ICCV23(15245-15255)
IEEE DOI 2401
BibRef

Najibi, M.[Mahyar], Ji, J.W.[Jing-Wei], Zhou, Y.[Yin], Qi, C.R.[Charles R.], Yan, X.C.[Xin-Chen], Ettinger, S.[Scott], Anguelov, D.[Dragomir],
Unsupervised 3D Perception with 2D Vision-Language Distillation for Autonomous Driving,
ICCV23(8568-8578)
IEEE DOI 2401
BibRef

Zheng, K.[Kecheng], Wu, W.[Wei], Feng, R.[Ruili], Zhu, K.[Kai], Liu, J.W.[Jia-Wei], Zhao, D.L.[De-Li], Zha, Z.J.[Zheng-Jun], Chen, W.[Wei], Shen, Y.J.[Yu-Jun],
Regularized Mask Tuning: Uncovering Hidden Knowledge in Pre-trained Vision-Language Models,
ICCV23(11629-11639)
IEEE DOI 2401
BibRef

Wang, T.[Tan], Lin, K.[Kevin], Li, L.J.[Lin-Jie], Lin, C.C.[Chung-Ching], Yang, Z.Y.[Zheng-Yuan], Zhang, H.W.[Han-Wang], Liu, Z.C.[Zi-Cheng], Wang, L.J.[Li-Juan],
Equivariant Similarity for Vision-Language Foundation Models,
ICCV23(11964-11974)
IEEE DOI 2401
BibRef

Xu, H.[Hu], Xie, S.[Saining], Huang, P.Y.[Po-Yao], Yu, L.C.[Li-Cheng], Howes, R.[Russell], Ghosh, G.[Gargi], Zettlemoyer, L.[Luke], Feichtenhofer, C.[Christoph],
CiT: Curation in Training for Effective Vision-Language Data,
ICCV23(15134-15143)
IEEE DOI 2401
BibRef

Trager, M.[Matthew], Perera, P.[Pramuditha], Zancato, L.[Luca], Achille, A.[Alessandro], Bhatia, P.[Parminder], Soatto, S.[Stefano],
Linear Spaces of Meanings: Compositional Structures in Vision-Language Models,
ICCV23(15349-15358)
IEEE DOI 2401
BibRef

Chen, Y.S.[Yi-Syuan], Song, Y.Z.[Yun-Zhu], Yeo, C.Y.[Cheng Yu], Liu, B.[Bei], Fu, J.L.[Jian-Long], Shuai, H.H.[Hong-Han],
SINC: Self-Supervised In-Context Learning for Vision-Language Tasks,
ICCV23(15384-15396)
IEEE DOI 2401
BibRef

Wu, C.E.[Cheng-En], Tian, Y.[Yu], Yu, H.C.[Hai-Chao], Wang, H.[Heng], Morgado, P.[Pedro], Hu, Y.H.[Yu Hen], Yang, L.J.[Lin-Jie],
Why Is Prompt Tuning for Vision-Language Models Robust to Noisy Labels?,
ICCV23(15442-15451)
IEEE DOI Code:
WWW Link. 2401
BibRef

Ouali, Y.[Yassine], Bulat, A.[Adrian], Matinez, B.[Brais], Tzimiropoulos, G.[Georgios],
Black Box Few-Shot Adaptation for Vision-Language models,
ICCV23(15488-15500)
IEEE DOI Code:
WWW Link. 2401
BibRef

Kan, B.[Baoshuo], Wang, T.[Teng], Lu, W.P.[Wen-Peng], Zhen, X.T.[Xian-Tong], Guan, W.[Weili], Zheng, F.[Feng],
Knowledge-Aware Prompt Tuning for Generalizable Vision-Language Models,
ICCV23(15624-15634)
IEEE DOI 2401
BibRef

Zhai, J.T.[Jiang-Tian], Zhang, Q.[Qi], Wu, T.[Tong], Chen, X.Y.[Xing-Yu], Liu, J.J.[Jiang-Jiang], Cheng, M.M.[Ming-Ming],
SLAN: Self-Locator Aided Network for Vision-Language Understanding,
ICCV23(21892-21901)
IEEE DOI Code:
WWW Link. 2401
BibRef

Long, S.[Sifan], Zhao, Z.[Zhen], Yuan, J.[Junkun], Tan, Z.C.[Zi-Chang], Liu, J.J.[Jiang-Jiang], Zhou, L.P.[Lu-Ping], Wang, S.S.[Sheng-Sheng], Wang, J.D.[Jing-Dong],
Task-Oriented Multi-Modal Mutual Learning for Vision-Language Models,
ICCV23(21902-21912)
IEEE DOI 2401
BibRef

Cho, E.[Eulrang], Kim, J.[Jooyeon], Kim, H.W.J.[Hyun-Woo J.],
Distribution-Aware Prompt Tuning for Vision-Language Models,
ICCV23(21947-21956)
IEEE DOI Code:
WWW Link. 2401
BibRef

Varma, M.[Maya], Delbrouck, J.B.[Jean-Benoit], Hooper, S.[Sarah], Chaudhari, A.[Akshay], Langlotz, C.[Curtis],
ViLLA: Fine-Grained Vision-Language Representation Learning from Real-World Data,
ICCV23(22168-22178)
IEEE DOI 2401
BibRef

Zhu, H.G.[Hong-Guang], Wei, Y.C.[Yun-Chao], Liang, X.D.[Xiao-Dan], Zhang, C.J.[Chun-Jie], Zhao, Y.[Yao],
CTP: Towards Vision-Language Continual Pretraining via Compatible Momentum Contrast and Topology Preservation,
ICCV23(22200-22210)
IEEE DOI Code:
WWW Link. 2401
BibRef

Salin, E.[Emmanuelle], Ayache, S.[Stéphane], Favre, B.[Benoit],
Towards an Exhaustive Evaluation of Vision-Language Foundation Models,
MMFM23(339-352)
IEEE DOI 2401
BibRef

Hu, Z.[Zhizhang], Zhu, X.L.[Xin-Liang], Tran, S.[Son], Vidal, R.[René], Dhua, A.[Arnab],
ProVLA: Compositional Image Search with Progressive Vision-Language Alignment and Multimodal Fusion,
CLVL23(2764-2769)
IEEE DOI 2401
BibRef

Hall, M.[Melissa], Gustafson, L.[Laura], Adcock, A.[Aaron], Misra, I.[Ishan], Ross, C.[Candace],
Vision-Language Models Performing Zero-Shot Tasks Exhibit Disparities Between Gender Groups,
CLVL23(2770-2777)
IEEE DOI 2401
BibRef

Agnolucci, L.[Lorenzo], Baldrati, A.[Alberto], Todino, F.[Francesco], Becattini, F.[Federico], Bertini, M.[Marco], del Bimbo, A.[Alberto],
ECO: Ensembling Context Optimization for Vision-Language Models,
CLVL23(2803-2807)
IEEE DOI 2401
BibRef

Palit, V.[Vedant], Pandey, R.[Rohan], Arora, A.[Aryaman], Liang, P.P.[Paul Pu],
Towards Vision-Language Mechanistic Interpretability: A Causal Tracing Tool for BLIP,
CLVL23(2848-2853)
IEEE DOI 2401
BibRef

Sammani, F.[Fawaz], Deligiannis, N.[Nikos],
Uni-NLX: Unifying Textual Explanations for Vision and Vision-Language Tasks,
VLAR23(4636-4641)
IEEE DOI 2401
BibRef

Lu, D.[Dong], Wang, Z.Q.[Zhi-Qiang], Wang, T.[Teng], Guan, W.[Weili], Gao, H.[Hongchang], Zheng, F.[Feng],
Set-level Guidance Attack: Boosting Adversarial Transferability of Vision-Language Pre-training Models,
ICCV23(102-111)
IEEE DOI Code:
WWW Link. 2401
BibRef

Lee, D.J.[Dong-Jun], Song, S.[Seokwon], Suh, J.[Jihee], Choi, J.[Joonmyeong], Lee, S.[Sanghyeok], Kim, H.W.J.[Hyun-Woo J.],
Read-only Prompt Optimization for Vision-Language Few-shot Learning,
ICCV23(1401-1411)
IEEE DOI Code:
WWW Link. 2401
BibRef

Li, X.[Xuanlin], Fang, Y.H.[Yun-Hao], Liu, M.H.[Ming-Hua], Ling, Z.[Zhan], Tu, Z.W.[Zhuo-Wen], Su, H.[Hao],
Distilling Large Vision-Language Model with Out-of-Distribution Generalizability,
ICCV23(2492-2503)
IEEE DOI 2401
BibRef

Li, J.C.[Jun-Cheng], Gao, M.[Minghe], Wei, L.[Longhui], Tang, S.L.[Si-Liang], Zhang, W.Q.[Wen-Qiao], Li, M.[Mengze], Ji, W.[Wei], Tian, Q.[Qi], Chua, T.S.[Tat-Seng], Zhuang, Y.T.[Yue-Ting],
Gradient-Regulated Meta-Prompt Learning for Generalizable Vision-Language Models,
ICCV23(2551-2562)
IEEE DOI 2401
BibRef

Bi, J.Y.[Jun-Yu], Cheng, D.[Daixuan], Yao, P.[Ping], Pang, B.[Bochen], Zhan, Y.F.[Yue-Feng], Yang, C.G.[Chuan-Guang], Wang, Y.J.[Yu-Jing], Sun, H.[Hao], Deng, W.W.[Wei-Wei], Zhang, Q.[Qi],
VL-Match: Enhancing Vision-Language Pretraining with Token-Level and Instance-Level Matching,
ICCV23(2584-2593)
IEEE DOI 2401
BibRef

Udandarao, V.[Vishaal], Gupta, A.[Ankush], Albanie, S.[Samuel],
SuS-X: Training-Free Name-Only Transfer of Vision-Language Models,
ICCV23(2725-2736)
IEEE DOI Code:
WWW Link. 2401
BibRef

Jiang, C.[Chaoya], Xu, H.Y.[Hai-Yang], Ye, W.[Wei], Ye, Q.H.[Qing-Hao], Li, C.L.[Chen-Liang], Yan, M.[Ming], Bi, B.[Bin], Zhang, S.K.[Shi-Kun], Huang, F.[Fei], Huang, S.F.[Song-Fang],
BUS: Efficient and Effective Vision-language Pre-training with Bottom-Up Patch Summarization,
ICCV23(2888-2898)
IEEE DOI 2401
BibRef

Shi, C.[Cheng], Yang, S.[Sibei],
LoGoPrompt: Synthetic Text Images Can Be Good Visual Prompts for Vision-Language Models,
ICCV23(2920-2929)
IEEE DOI 2401
BibRef

Wang, A.J.P.[Alex Jin-Peng], Lin, K.Q.[Kevin Qinghong], Zhang, D.J.H.[David Jun-Hao], Lei, S.W.X.[Stan Wei-Xian], Shou, M.Z.[Mike Zheng],
Too Large; Data Reduction for Vision-Language Pre-Training,
ICCV23(3124-3134)
IEEE DOI 2401
BibRef

Wang, W.H.[Wei-Han], Yang, Z.[Zhen], Xu, B.[Bin], Li, J.[Juanzi], Sun, Y.[Yankui],
ViLTA: Enhancing Vision-Language Pre-training through Textual Augmentation,
ICCV23(3135-3146)
IEEE DOI 2401
BibRef

Wang, T.J.J.[Tzu-Jui Julius], Laaksonen, J.[Jorma], Langer, T.[Tomas], Arponen, H.[Heikki], Bishop, T.E.[Tom E.],
Learning by Hallucinating: Vision-Language Pre-training with Weak Supervision,
WACV23(1073-1083)
IEEE DOI 2302
Visualization, Vocabulary, Computational modeling, Detectors, Benchmark testing, Transformers, un-supervised learning BibRef

Boecking, B.[Benedikt], Usuyama, N.[Naoto], Bannur, S.[Shruthi], Castro, D.C.[Daniel C.], Schwaighofer, A.[Anton], Hyland, S.[Stephanie], Wetscherek, M.[Maria], Naumann, T.[Tristan], Nori, A.[Aditya], Alvarez-Valle, J.[Javier], Poon, H.[Hoifung], Oktay, O.[Ozan],
Making the Most of Text Semantics to Improve Biomedical Vision-Language Processing,
ECCV22(XXXVI:1-21).
Springer DOI 2211
BibRef

Cui, Q.[Quan], Zhou, B.[Boyan], Guo, Y.[Yu], Yin, W.D.[Wei-Dong], Wu, H.[Hao], Yoshie, O.[Osamu], Chen, Y.[Yubo],
Contrastive Vision-Language Pre-training with Limited Resources,
ECCV22(XXXVI:236-253).
Springer DOI 2211
BibRef

Walmer, M.[Matthew], Sikka, K.[Karan], Sur, I.[Indranil], Shrivastava, A.[Abhinav], Jha, S.[Susmit],
Dual-Key Multimodal Backdoors for Visual Question Answering,
CVPR22(15354-15364)
IEEE DOI 2210
Visualization, Training data, Detectors, Feature extraction, Question answering (information retrieval), Vision + language BibRef

Ding, Y.[Yang], Yu, J.[Jing], Liu, B.[Bang], Hu, Y.[Yue], Cui, M.X.[Ming-Xin], Wu, Q.[Qi],
MuKEA: Multimodal Knowledge Extraction and Accumulation for Knowledge-based Visual Question Answering,
CVPR22(5079-5088)
IEEE DOI 2210
Bridges, Visualization, Codes, Computational modeling, Knowledge based systems, Semantics, Vision + language BibRef

Gao, F.[Feng], Ping, Q.[Qing], Thattai, G.[Govind], Reganti, A.[Aishwarya], Wu, Y.N.[Ying Nian], Natarajan, P.[Prem],
Transform-Retrieve-Generate: Natural Language-Centric Outside-Knowledge Visual Question Answering,
CVPR22(5057-5067)
IEEE DOI 2210
Knowledge engineering, Visualization, Solid modeling, Knowledge based systems, Natural languages, Transforms, Visual reasoning BibRef

Aflalo, E.[Estelle], Du, M.[Meng], Tseng, S.Y.[Shao-Yen], Liu, Y.F.[Yong-Fei], Wu, C.[Chenfei], Duan, N.[Nan], Lal, V.[Vasudev],
VL-InterpreT: An Interactive Visualization Tool for Interpreting Vision-Language Transformers,
CVPR22(21374-21383)
IEEE DOI 2210
Heating systems, Visualization, Machine vision, Computational modeling, Transformers, Question answering (information retrieval) BibRef

Hu, X.W.[Xiao-Wei], Gan, Z.[Zhe], Wang, J.F.[Jian-Feng], Yang, Z.Y.[Zheng-Yuan], Liu, Z.C.[Zi-Cheng], Lu, Y.[Yumao], Wang, L.J.[Li-Juan],
Scaling Up Vision-Language Pretraining for Image Captioning,
CVPR22(17959-17968)
IEEE DOI 2210
Training, Visualization, Computational modeling, Training data, Benchmark testing, Transformers, Feature extraction, Vision + language BibRef

Zhang, P.C.[Peng-Chuan], Li, X.J.[Xiu-Jun], Hu, X.W.[Xiao-Wei], Yang, J.W.[Jian-Wei], Zhang, L.[Lei], Wang, L.J.[Li-Juan], Choi, Y.J.[Ye-Jin], Gao, J.F.[Jian-Feng],
VinVL: Revisiting Visual Representations in Vision-Language Models,
CVPR21(5575-5584)
IEEE DOI 2111
Training, Visualization, Computational modeling, Object detection, Benchmark testing, Feature extraction, Transformers BibRef

Li, Z.W.[Zhuo-Wan], Stengel-Eskin, E.[Elias], Zhang, Y.X.[Yi-Xiao], Xie, C.[Cihang], Tran, Q.[Quan], van Durme, B.[Benjamin], Yuille, A.L.[Alan L.],
Calibrating Concepts and Operations: Towards Symbolic Reasoning on Real Images,
ICCV21(14890-14899)
IEEE DOI 2203
Visualization, Analytical models, Codes, Computational modeling, Cognition, Data models, Vision + language BibRef

Yang, X.[Xu], Zhang, H.W.[Han-Wang], Qi, G.J.[Guo-Jun], Cai, J.F.[Jian-Fei],
Causal Attention for Vision-Language Tasks,
CVPR21(9842-9852)
IEEE DOI 2111
Correlation, Codes, Computational modeling, Training data, Transformers, Data models BibRef

Stefanini, M.[Matteo], Cornia, M.[Marcella], Baraldi, L.[Lorenzo], Cucchiara, R.[Rita],
A Novel Attention-based Aggregation Function to Combine Vision and Language,
ICPR21(1212-1219)
IEEE DOI 2105
Deep learning, Visualization, Image retrieval, Transforms, Knowledge discovery BibRef

Jain, V., Lodhavia, J.,
Automatic Question Tagging using k-Nearest Neighbors and Random Forest,
ISCV20(1-4)
IEEE DOI 2011
learning (artificial intelligence), question answering (information retrieval), Natural Language Processing BibRef

Zheng, W.B.[Wen-Bo], Yan, L.[Lan], Gou, C.[Chao], Wang, F.Y.[Fei-Yue],
Webly Supervised Knowledge Embedding Model for Visual Reasoning,
CVPR20(12442-12451)
IEEE DOI 2008
Visual reasoning between visual image and natural language description. Visualization, Cognition, Knowledge based systems, Task analysis, Knowledge engineering, Modulation, Robustness BibRef

Nguyen, D.K.[Duy-Kien], Okatani, T.[Takayuki],
Multi-Task Learning of Hierarchical Vision-Language Representation,
CVPR19(10484-10493).
IEEE DOI 2002
BibRef

Gupta, T.[Tanmay], Shih, K.J.[Kevin J.], Singh, S.[Saurabh], Hoiem, D.[Derek],
Aligned Image-Word Representations Improve Inductive Transfer Across Vision-Language Tasks,
ICCV17(4223-4232)
IEEE DOI 1802
data visualisation, image recognition, learning (artificial intelligence), Visualization BibRef

Chapter on Implementations and Applications, Databases, QBIC, Video Analysis, Hardware and Software, Inspection continues in
Video Question Answering, Movies, Spatio-Temporal, Query, VQA .


Last update:Jan 15, 2025 at 14:36:47