WhatNext24
* *What is Next in Multimodal Foundation Models?
* Benchmarking Zero-Shot Recognition with Vision-Language Models: Challenges on Granularity and Specificity
* Continual Diffusion with STAMINA: STack-And-Mask INcremental Adapters
* Forget-Me-Not: Learning to Forget in Text-to-Image Diffusion Models
* ICSVR: Investigating Compositional and Syntactic Understanding in Video Retrieval Models
* LLM-Seg: Bridging Image Segmentation and Large Language Model Reasoning
* Matting Anything
* Probing Conceptual Understanding of Large Visual-Language Models
* Recognize Anything: A Strong Image Tagging Model
* Robustness Analysis on Foundational Segmentation Models
* Show, Think, and Tell: Thought-Augmented Fine-Tuning of Large Language Models for Video Captioning
* Strategies to Leverage Foundational Model Knowledge in Object Affordance Grounding
* Towards Efficient Audio-Visual Learners via Empowering Pre-trained Vision Transformers with Cross-Modal Adaptation
* Wiki-LLaVA: Hierarchical Retrieval-Augmented Generation for Multimodal LLMs
14 for WhatNext24
WhatNext25
* *What is Next in Multimodal Foundation Models?
* How Good is my Video-LMM? Complex Video Reasoning and Robustness Evaluation Suite for Video-LMMs
* Interactive Agent Foundation Model, An
* PLVM: A Tuning-Free Approach for Personalized Large Vision-Language Model
* Repurposing SAM for User-Defined Semantics Aware Segmentation
* Understanding Depth and Height Perception in Large Visual-Language Models
* Understanding the Effect of using Semantically Meaningful Tokens for Visual Representation Learning
* UniToken: Harmonizing Multimodal Understanding and Generation Through Unified Visual Encoding
8 for WhatNext25