_ | question | _ |
3DVQA: Visual | question | Answering for 3D Environments |
5 | question | s for Andrew Leland: The History of Assistive Technology for the Blind |
5 | question | s for Missy Cummings: The Former Fighter Pilot on why Autonomous Vehicles are so Risky |
A-OKVQA: A Benchmark for Visual | question | Answering Using World Knowledge |
A2A: Attention to Attention Reasoning for Movie | question | Answering |
Accuracy vs. complexity: A trade-off in visual | question | answering models |
Action-Centric Relation Transformer Network for Video | question | Answering |
Adapting Grounded Visual | question | Answering Models to Low Resource Languages |
ADCCF: Adaptive deep concatenation coder framework for visual | question | answering |
Adversarial Multimodal Network for Movie Story | question | Answering |
ALSA: Adversarial Learning of Supervised Attentions for Visual | question | Answering |
Analysis of Visual | question | Answering Algorithms, An |
Analyzing Geographic | question | s Using Embedding-based Topic Modeling |
Anomaly Matters: An Anomaly-Oriented Model for Medical Visual | question | Answering |
Answer Distillation for Visual | question | Answering |
Answer Selection in Community | question | Answering via Attentive Neural Networks |
Answer Them All! Toward Universal Visual | question | Answering Models |
Answer-checking in Context: A Multi-modal Fully Attention Network for Visual | question | Answering |
Answer-Type Prediction for Visual | question | Answering |
Answering knowledge-based visual | question | s via the exploration of Question Purpose |
Answering knowledge-based visual | question | s via the exploration of Question Purpose |
Answering | question | s about Data Visualizations using Efficient Bimodal Fusion |
Answering Visual What-If | question | s: From Actions to Predicted Scene Descriptions |
Are we Asking the Right | question | s in MovieQA? |
Are You Smarter Than a Sixth Grader? Textbook | question | Answering for Multimodal Machine Comprehension |
Ask Me Anything: Free-Form Visual | question | Answering Based on Knowledge from External Sources |
Ask Your Neurons: A Deep Learning Approach to Visual | question | Answering |
Ask Your Neurons: A Neural-Based Approach to Answering | question | s about Images |
Ask, Attend and Answer: Exploring | question | -Guided Spatial Attention for Visual Question Answering |
Ask, Attend and Answer: Exploring | question | -Guided Spatial Attention for Visual Question Answering |
AssistQ: Affordance-Centric | question | -Driven Task Completion for Egocentric Assistant |
Auto QA: The | question | Is Not Only What, but Also Where |
Auto-Parsing Network for Image Captioning and Visual | question | Answering |
Automatic | question | Tagging using k-Nearest Neighbors and Random Forest |
Barlow constrained optimization for Visual | question | Answering |
BERT Representations for Video | question | Answering |
Better Way to Attend: Attention With Trees for Video | question | Answering, A |
Beyond | question | -Based Biases: Assessing Multimodal Shortcut Learning in Visual Question Answering |
Beyond | question | -Based Biases: Assessing Multimodal Shortcut Learning in Visual Question Answering |
Beyond VQA: Generating Multi-word Answers and Rationales to Visual | question | s |
Bilaterally Slimmable Transformer for Elastic and Efficient Visual | question | Answering |
Biometric surveillance using visual | question | answering |
Bottom-Up and Top-Down Attention for Image Captioning and Visual | question | Answering |
Bridge to Answer: Structure-aware Graph Interaction Network for Video | question | Answering |
Bridging Video-Text Retrieval with Multiple Choice | question | s |
CAAN: Context-Aware attention network for visual | question | answering |
Cascade transformers with dynamic attention for video | question | answering |
cascaded long short-term memory (LSTM) driven generic visual | question | answering (VQA), A |
CAT: Re-Conv Attention in Transformer for Visual | question | Answering |
CLIP-Guided Vision-Language Pre-training for | question | Answering in 3D Scenes |
Coarse to Fine Frame Selection for Online Open-ended Video | question | Answering |
Coarse-to-Fine Reasoning for Visual | question | Answering |
Coarse-to-Fine Visual | question | Answering by Iterative, Conditional Refinement |
Colour and Space in Cultural Heritage: Key | question | s in 3D Optical Documentation of Material Culture for Conservation, Study and Preservation |
Combining Multiple Cues for Visual Madlibs | question | Answering |
Compact Trilinear Interaction for Visual | question | Answering |
Competence-aware Curriculum for Visual Concepts Learning via | question | Answering, A |
Compositional Attention Networks With Two-Stream Fusion for Video | question | Answering |
Comprehensive-perception dynamic reasoning for visual | question | answering |
Context Relation Fusion Model for Visual | question | Answering |
Context-VQA: Towards Context-Aware and Purposeful Visual | question | Answering |
Contrastive Video | question | Answering via Video Graph Transformer |
Correlation Information Bottleneck: Towards Adapting Pretrained Multimodal Models for Robust Visual | question | Answering |
Counterfactual Samples Synthesizing and Training for Robust Visual | question | Answering |
Counterfactual Samples Synthesizing for Robust Visual | question | Answering |
Counting-based visual | question | answering with serial cascaded attention deep learning |
Creativity: Generating Diverse | question | s Using Variational Autoencoders |
Cross-Attentional Spatio-Temporal Semantic Graph Networks for Video | question | Answering |
Cross-Dataset Adaptation for Visual | question | Answering |
Cross-Modal Causal Relational Reasoning for Event-Level Visual | question | Answering |
Cross-Modal Dense Passage Retrieval for Outside Knowledge Visual | question | Answering |
Cross-modal knowledge reasoning for knowledge-based visual | question | answering |
Cross-modal Relational Reasoning Network for Visual | question | Answering |
Cross-Modal Visual | question | Answering for Remote Sensing Data: the International Conference on Digital Image Computing: Techniques and Applications (DICTA 2021) |
CS-VQA: Visual | question | Answering with Compressively Sensed Images |
Customized Image Narrative Generation via Interactive Visual | question | Generation and Answering |
Cycle-Consistency for Robust Visual | question | Answering |
DAPC: Answering Why-Not | question | s on Top-k Direction-Aware ASK Queries in Polar Coordinates |
Dataset and Exploration of Models for Understanding Video Data through Fill-in-the-Blank | question | -Answering, A |
Dealing with Missing Modalities in the Visual | question | Answer-Difference Prediction Task through Knowledge Distillation |
Debiased Visual | question | Answering via the perspective of question types |
Debiased Visual | question | Answering via the perspective of question types |
DecomVQANet: Decomposing visual | question | answering deep network via tensor decomposition and regression |
Decouple Before Interact: Multi-Modal Prompt Learning for Continual Visual | question | Answering |
Deep Attention Neural Tensor Network for Visual | question | Answering |
Deep Bayesian Network for Visual | question | Generation |
Deep Modular Co-Attention Networks for Visual | question | Answering |
Deep Residual Weight-Sharing Attention Network With Low-Rank Attention for Visual | question | Answering |
Depth and Video Segmentation Based Visual Attention for Embodied | question | Answering |
Depth-Aware and Semantic Guided Relational Attention Network for Visual | question | Answering |
Diagnostic Study of Visual | question | Answering With Analogical Reasoning, A |
Differential Attention for Visual | question | Answering |
DisAVR: Disentangled Adaptive Visual Reasoning Network for Diagram | question | Answering |
Discovering Spatio-Temporal Rationales for Video | question | Answering |
Discovering the Real Association: Multimodal Causal Reasoning in Video | question | Answering |
Divide and Conquer: Answering | question | s with Object Factorization and Compositional Reasoning |
Document Image Retrieval in a | question | Answering System for Document Images |
Does ChatGPT Spell the End of Automatic | question | Generation Research? |
Does the Research | question | Structure Impact the Attention Model? User Study Experiment |
Don't Just Assume; Look and Answer: Overcoming Priors for Visual | question | Answering |
DRAU: Dual Recurrent Attention Units for Visual | question | Answering |
DSGEM: Dual scene graph enhancement module-based visual | question | answering |
Dual Path Multi-Modal High-Order Features for Textual Content based Visual | question | Answering |
Dual self-attention with co-attention networks for visual | question | answering |
Dual-Attention Learning Network With Word and Sentence Embedding for Medical Visual | question | Answering, A |
Dual-decoder transformer network for answer grounding in visual | question | answering |
Dual-Key Multimodal Backdoors for Visual | question | Answering |
DualVGR: A Dual-Visual Graph Reasoning Unit for Video | question | Answering |
DVQA: Understanding Data Visualizations via | question | Answering |
Dynamic dual graph networks for textbook | question | answering |
Dynamic Fusion With Intra- and Inter-Modality Attention Flow for Visual | question | Answering |
Dynamic Self-Attention with Vision Synchronization Networks for Video | question | Answering |
DynGraph: Visual | question | Answering via Dynamic Scene Graphs |
Editorial paper for Pattern Recognition Letters VSI on Cross Model Understanding for Visual | question | Answering |
Editorial to special issue on cross-media learning for visual | question | answering |
Efficient Counterfactual Debiasing for Visual | question | Answering |
EgoVQA: An Egocentric Video | question | Answering Benchmark Dataset |
Embedding Spatial Relations in Visual | question | Answering for Remote Sensing |
Embodied | question | Answering |
Embodied | question | Answering |
Embodied | question | Answering in Photorealistic Environments With Point Cloud Perception |
Empirical Evaluation of Visual | question | Answering for Novel Objects, An |
Empirical study on using adapters for debiased Visual | question | Answering |
Encoder-decoder cycle for visual | question | answering based on perception-action cycle |
Encyclopedic VQA: Visual | question | s about detailed properties of fine-grained categories |
End-to-End Concept Word Detection for Video Captioning, Retrieval, and | question | Answering |
End-to-End Video | question | -Answer Generation With Generator-Pretester Network |
Env-QA: A Video | question | Answering Benchmark for Comprehensive Understanding of Dynamic Environments |
Episodic Memory | question | Answering |
ERM: Energy-Based Refined-Attention Mechanism for Video | question | Answering |
Estimation Of Visual Contents Based On | question | Answering From Human Brain Activity |
Evaluation of a Visual | question | Answering Architecture for Pedestrian Attribute Recognition |
Event Graph Guided Compositional Spatial-Temporal Reasoning for Video | question | Answering |
Existence | question | for Maximum-Likelihood Estimators in Time-of-Arrival-Based Localization, The |
Explicit Bias Discovery in Visual | question | Answering Models |
Explicit ensemble attention learning for improving visual | question | answering |
External Commonsense Knowledge as a Modality for Social Intelligence | question | -Answering |
Face-to-Face Contrastive Learning for Social Intelligence | question | -Answering |
Facial expression of emotion: New findings, new | question | s |
Fantastic Answers and Where to Find Them: Immersive | question | -Directed Visual Attention |
FashionVQA: A Domain-Specific Visual | question | Answering System |
Focal Visual-Text Attention for Memex | question | Answering |
Focal Visual-Text Attention for Visual | question | Answering |
Found a Reason for me? Weakly-supervised Grounded Visual | question | Answering using Capsules |
Four National Maps of Broad Forest Type Provide Inconsistent Answers to the | question | of What Burns in Canada |
Frame Augmented Alternating Attention Network for Video | question | Answering |
Frequently Asked/Answered | question | |
From Images to Textual Prompts: Zero-shot Visual | question | Answering with Frozen Large Language Models |
From known to the unknown: Transferring knowledge to answer | question | s about novel visual and semantic concepts |
From Representation to Reasoning: Towards both Evidence and Commonsense Reasoning for Video | question | -Answering |
GDUMB: A Simple Approach that | question | s Our Progress in Continual Learning |
Generalized Hadamard-Product Fusion Operators for Visual | question | Answering |
Generalized pyramid co-attention with learnable aggregation net for video | question | answering |
Generative Bias for Robust Visual | question | Answering |
Geographic Knowledge Base | question | Answering over OpenStreetMap |
Goal-Oriented Visual | question | Generation via Intermediate Rewards |
Good, Better, Best: Textual Distractors Generation for Multiple-Choice Visual | question | Answering via Reinforcement Learning |
GQA: A New Dataset for Real-World Visual Reasoning and Compositional | question | Answering |
Grad-Cam Aware Supervised Attention for Visual | question | Answering for Post-Disaster Damage Assessment |
Graph-Based Multi-Interaction Network for Video | question | Answering |
Graph-Structured Representations for Visual | question | Answering |
Greedy Gradient Ensemble for Robust Visual | question | Answering |
Grounding Answers for Visual | question | s Asked by Visually Impaired People |
Guiding Visual | question | Answering with Attention Priors |
HAIR: Hierarchical Visual-Semantic Relational Reasoning for Video | question | Answering |
Health-Oriented Multimodal Food | question | Answering |
Heterogeneous Community | question | Answering via Social-Aware Multi-Modal Co-Attention Convolutional Matching |
Heterogeneous Memory Enhanced Multimodal Attention Model for Video | question | Answering |
Hierarchical Conditional Relation Networks for Multimodal Video | question | Answering |
Hierarchical Conditional Relation Networks for Video | question | Answering |
Hierarchical Relational Attention for Video | question | Answering |
Hierarchical Representation Network With Auxiliary Tasks for Video Captioning and Video | question | Answering |
Holistic Multi-Modal Memory Network for Movie | question | Answering |
Human Attention in Visual | question | Answering: Do Humans and Deep Networks Look at the Same Regions? |
Image Captioning and Visual | question | Answering Based on Attributes and External Knowledge |
Image Quality Evaluation in Professional HDR/WCG Production | question | s the Need for HDR Metrics |
Image | question | Answering Using Convolutional Neural Network with Dynamic Parameter Prediction |
Image segmentation in Twenty | question | s |
Image- | question | -Answer Synergistic Network for Visual Dialog |
Improved Attention for Visual | question | Answering, An |
Improved Fusion of Visual and Language Representations by Dense Symmetric Co-attention for Visual | question | Answering |
Improving Selective Visual | question | Answering by Learning from Your Peers |
Improving Visual | question | Answering using Active Perception on Static Images |
Improving visual | question | answering using dropout and enhanced question encoder |
Improving visual | question | answering using dropout and enhanced question encoder |
In Defense of Grid Features for Visual | question | Answering |
Incorporating 3D Information Into Visual | question | Answering |
Information Maximizing Visual | question | Generation |
Interpretable Visual | question | Answering by Reasoning on Dependency Trees |
Interpretable Visual | question | Answering by Visual Grounding From Attention Supervision Mining |
Interpretable Visual | question | Answering Referring to Outside Knowledge |
Interpretable Visual | question | Answering Via Reasoning Supervision |
Invariant Grounding for Video | question | Answering |
Inverse Visual | question | Answering: A New Benchmark and VQA Diagnosis Tool |
IQ-VQA: Intelligent Visual | question | Answering |
IQA: Visual | question | Answering in Interactive Environments |
Is GPT-3 All You Need for Visual | question | Answering in Cultural Heritage? |
ISD-QA: Iterative Distillation of Commonsense Knowledge from General Language Models for Unsupervised | question | Answering |
It's Not About the Journey; It's About the Destination: Following Soft Paths Under | question | -Guidance for Visual Reasoning |
iVQA: Inverse Visual | question | Answering |
Joint Sequence Fusion Model for Video | question | Answering and Retrieval, A |
Just Ask: Learning to Answer | question | s from Millions of Narrated Videos |
K-VQG: Knowledge-aware Visual | question | Generation for Common-sense Acquisition |
Keyword-Aware Relative Spatio-Temporal Graph Networks for Video | question | Answering |
Knowing Where to Look? Analysis on Attention of Visual | question | Answering System |
Knowledge Acquisition for Visual | question | Answering via Iterative Querying |
Knowledge base graph embedding module design for Visual | question | answering model |
Knowledge Proxy Intervention for Deconfounded Video | question | Answering |
Knowledge-Augmented Visual | question | Answering With Natural Language Explanation |
Knowledge-Based Embodied | question | Answering |
Knowledge-based Video | question | Answering with Unsupervised Scene Descriptions |
Knowledge-Based Visual | question | Generation |
Language and Visual Relations Encoding for Visual | question | Answering |
Language Models are Causal Knowledge Extractors for Zero-shot Video | question | Answering |
Latent Variable Models for Visual | question | Answering |
LEAF-QA: Locate, Encode Attend for Figure | question | Answering |
Learning Answer Embeddings for Visual | question | Answering |
Learning by Asking | question | s |
Learning Models for Actions and Person-Object Interactions with Transfer to | question | Answering |
Learning Situation Hyper-Graphs for Video | question | Answering |
Learning to Answer | question | s in Dynamic Audio-Visual Scenarios |
Learning to Ask Informative Sub- | question | s for Visual Question Answering |
Learning to Ask Informative Sub- | question | s for Visual Question Answering |
Learning to Caption Images Through a Lifetime by Asking | question | s |
Learning to Disambiguate by Asking Discriminative | question | s |
Learning to Reason: End-to-End Module Networks for Visual | question | Answering |
Learning to Supervise Knowledge Retrieval Over a Tree Structure for Visual | question | Answering |
Learning Visual Knowledge Memory Networks for Visual | question | Answering |
Learning Visual | question | Answering by Bootstrapping Hard Attention |
Learning visual | question | answering on controlled semantic noisy labels |
Leveraging | question | Answering for Domain-Agnostic Information Extraction |
Leveraging Visual | question | Answering for Image-Caption Ranking |
Linguistically Routing Capsule Network for Out-of-distribution Visual | question | Answering |
LiVLR: A Lightweight Visual-Linguistic Reasoning Framework for Video | question | Answering |
LLQA: Lifelog | question | Answering Dataset |
Local relation network with multilevel attention for visual | question | answering |
Locate Before Answering: Answer Guided | question | Localization for Video Question Answering |
Locate Before Answering: Answer Guided | question | Localization for Video Question Answering |
Locating Visual Explanations for Video | question | Answering |
Logical Implications for Visual | question | Answering Consistency |
LOIS: Looking Out of Instance Semantics for Visual | question | Answering |
Long video | question | answering: A Matching-guided Attention Model |
Long-Form Video | question | Answering via Dynamic Hierarchical Reinforced Networks |
Long-Term Video | question | Answering via Multimodal Hierarchical Memory Attentive Networks |
Machine Learning Based Methods for Arabic Duplicate | question | Detection |
Maintaining Reasoning Consistency in Compositional Visual | question | Answering |
Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual | question | Answering |
MarioQA: Answering | question | s by Watching Gameplay Videos |
Markov Network Based Passage Retrieval Method for Multimodal | question | Answering in the Cultural Heritage Domain, A |
Measuring Compositional Consistency for Video | question | Answering |
Medical Visual | question | Answering via Conditional Reasoning and Contrastive Learning |
Mining deep And-Or object structures via cost-sensitive | question | -answer-based active annotations |
Mining Interpretable AOG Representations From Convolutional Networks via Active | question | Answering |
Mining Object Parts from CNNs via Active | question | -Answering |
MIST: Multi-modal Iterative Spatial-Temporal Transformer for Long-form Video | question | Answering |
MixPHM: Redundancy-Aware Parameter-Efficient Tuning for Low-Resource Visual | question | Answering |
MMTF: Multi-Modal Temporal Fusion for Commonsense Video | question | Answering |
MoCA: Incorporating domain pretraining and cross attention for textbook | question | answering |
Modality Shifting Attention Network for Multi-Modal Video | question | Answering |
MoQA: A Multi-modal | question | Answering Architecture |
Motion-Appearance Co-memory Networks for Video | question | Answering |
Movie | question | Answering via Textual Memory and Plot Graph |
MovieQA: Understanding Stories in Movies through | question | -Answering |
MuKEA: Multimodal Knowledge Extraction and Accumulation for Knowledge-based Visual | question | Answering |
Multi-agent Embodied | question | Answering in Interactive Environments |
Multi-Granularity Interaction and Integration Network for Video | question | Answering |
Multi-level Attention Networks for Visual | question | Answering |
Multi-modal Contextual Graph Neural Network for Text Visual | question | Answering |
Multi-Modal Correlated Network with Emotional Reasoning Knowledge for Social Intelligence | question | -Answering |
Multi-modal Factorized Bilinear Pooling with Co-attention Learning for Visual | question | Answering |
Multi-modal spatial relational attention networks for visual | question | answering |
Multi-Modality Latent Interaction Network for Visual | question | Answering |
Multi-scale relation reasoning for multi-modal Visual | question | Answering |
Multi-scale Relational Reasoning with Regional Attention for Visual | question | Answering |
Multi-Semantic Alignment Co-Reasoning Network for Video | question | Answering |
Multi-stage Attention based Visual | question | Answering |
Multi-Target Embodied | question | Answering |
Multi-Tier Attention Network using Term-weighted | question | Features for Visual Question Answering |
Multi-Tier Attention Network using Term-weighted | question | Features for Visual Question Answering |
Multi-Turn Video | question | Answering via Hierarchical Attention Context Reinforced Networks |
Multi-Turn Video | question | Generation via Reinforced Multi-Choice Attention Network |
Multimodal Dual Attention Memory for Video Story | question | Answering |
Multimodal grid features and cell pointers for scene text visual | question | answering |
Multimodal Integration of Human-Like Attention in Visual | question | Answering |
Multitask learning for neural generative | question | answering |
Multiview Language Bias Reduction for Visual | question | Answering |
MUREL: Multimodal Relational Reasoning for Visual | question | Answering |
MUTAN: Multimodal Tucker Fusion for Visual | question | Answering |
NAAQA: A Neural Architecture for Acoustic | question | Answering |
Natural Language Video Localization: A Revisit in Span-Based | question | Answering Framework |
New Passage Ranking Algorithm for Video | question | Answering, A |
NExT-QA: Next Phase of | question | -Answering to Explaining Temporal Actions |
novel feature extractor for human action recognition in visual | question | answering, A |
NRTK, PPP or Static, That Is the | question | . Testing Different Positioning Solutions for GNSS Survey |
Object detection in 20 | question | s |
Object sequences: encoding categorical and spatial information for a yes/no visual | question | answering task |
OK-VQA: A Visual | question | Answering Benchmark Requiring External Knowledge |
On the General Value of Evidence, and Bilingual Scene-Text Visual | question | Answering |
On the hidden treasure of dialog in video | question | answering |
On the role of | question | encoder sequence model in robust visual question answering |
On the role of | question | encoder sequence model in robust visual question answering |
Ontology-Driven Cyberinfrastructure for Intelligent Spatiotemporal | question | Answering and Open Knowledge Discovery, An |
Open-Ended Video | question | Answering via Multi-Modal Conditional Adversarial Networks |
Open-Vocabulary Video | question | Answering: A New Benchmark for Evaluating the Generalizability of Video Question Answering Models |
Open-Vocabulary Video | question | Answering: A New Benchmark for Evaluating the Generalizability of Video Question Answering Models |
P ˜ NP, at least in Visual | question | Answering |
Pano-AVQA: Grounded Audio-Visual | question | Answering on 360° Videos |
Pattern-information fMRI: New | question | s which it opens up and challenges which face it |
People Tracking in Camera Networks: Three Open | question | s |
Plenty is Plague: Fine-Grained Learning for Visual | question | Answering |
Positional Attention Guided Transformer-Like Architecture for Visual | question | Answering |
PQA: Perceptual | question | Answering |
Predicting Human Scanpaths in Visual | question | Answering |
Predicting the quality of user-generated answers using co-training in community-based | question | answering portals |
Prior Visual Relationship Reasoning For Visual | question | Answering |
Progressive Attention Memory Network for Movie Story | question | Answering |
Prompt-RSVQA: Prompting visual context to a language model for Remote Sensing Visual | question | Answering |
Prompting Large Language Models with Answer Heuristics for Knowledge-Based Visual | question | Answering |
QUALIFIER: | question | -Guided Self-Attentive Multimodal Fusion Network for Audio Visual Scene-Aware Dialog |
Ques-to-Visual Guided Visual | question | Answering |
| question | Answering over Community-Contributed Web Videos |
| question | classification based on co-training style semi-supervised learning |
| question | Classification Based on Weak Supervision and Interrogative Pronouns Attention Mechanism |
| question | Classification for Intelligent Question Answering: A Comprehensive Survey |
| question | Classification for Intelligent Question Answering: A Comprehensive Survey |
| question | interface for 3D picture creation on an autostereoscopic digital picture frame |
| question | Type Guided Attention in Visual Question Answering |
| question | Type Guided Attention in Visual Question Answering |
| question | -Agnostic Attention for Visual Question Answering |
| question | -Agnostic Attention for Visual Question Answering |
| question | -aware dynamic scene graph of local semantic representation learning for visual question answering |
| question | -aware dynamic scene graph of local semantic representation learning for visual question answering |
| question | -Centric Model for Visual Question Answering in Medical Imaging, A |
| question | -Centric Model for Visual Question Answering in Medical Imaging, A |
| question | -Guided Hybrid Convolution for Visual Question Answering |
| question | -Guided Hybrid Convolution for Visual Question Answering |
| question | s of Uniqueness and Resolution in Reconstruction from Projections |
Ranking answers of comparative | question | s using heterogeneous information organization from social media |
Re-Attention for Visual | question | Answering |
Reasoning on the Relation: Enhancing Visual Representation for Visual | question | Answering and Cross-Modal Retrieval |
Recent Advances in Video | question | Answering: A Review of Datasets and Methods |
Recovering Generalization via Pre-Training-Like Knowledge Distillation for Out-of-Distribution Visual | question | Answering |
Reducing Language Biases in Visual | question | Answering with Visually-grounded Question Encoder |
Reducing Language Biases in Visual | question | Answering with Visually-grounded Question Encoder |
Relation-Aware Graph Attention Network for Visual | question | Answering |
Relevance-aware | question | Generation in Non-task-oriented Dialogue Systems |
Reliable Visual | question | Answering: Abstain Rather Than Answer Incorrectly |
Rephrasing Visual | question | s by Specifying the Entropy of the Answer Distribution |
Resolving Zero-Shot and Fact-Based Visual | question | Answering via Enhanced Fact Retrieval |
Rethinking Data Augmentation for Robust Visual | question | Answering |
Retrieving-to-Answer: Zero-Shot Video | question | Answering with Frozen Large Language Models |
Revisiting Visual | question | Answering Baselines |
RMLVQA: A Margin Loss Approach For Visual | question | Answering with Language Biases |
Robust Explanations for Visual | question | Answering |
robust multivariate reranking algorithm for | question | Answering enrichment, A |
Robust Passage Retrieval Algorithm for Video | question | Answering, A |
Robust visual | question | answering via semantic cross modal augmentation |
RSVQA: Visual | question | Answering for Remote Sensing Data |
RUArt: A Novel Text-Centered Solution for Text-Based Visual | question | Answering |
ScanQA: 3D | question | Answering for Spatial Scene Understanding |
Scene Graph Refinement Network for Visual | question | Answering |
Scene Text Visual | question | Answering |
Second Order Enhanced Multi-Glimpse Attention in Visual | question | Answering |
See and Learn More: Dense Caption-Aware Representation for Visual | question | Answering |
SegEQA: Video Segmentation Based Visual Attention for Embodied | question | Answering |
Self-Adaptive Neural Module Transformer for Visual | question | Answering |
SelfGraphVQA: A Self-Supervised Graph Neural Network for Scene-based | question | Answering |
semantic approach for | question | classification using WordNet and Wikipedia, A |
Semantic Equivalent Adversarial Data Augmentation for Visual | question | Answering |
Semantic-Aware Modular Capsule Routing for Visual | question | Answering |
Semantically Guided Visual | question | Answering |
Separating Skills and Concepts for Novel Visual | question | Answering |
Sim VQA: Exploring Simulated Environments for Visual | question | Answering |
Simple and effective visual | question | answering in a single modality |
Simple Baselines for Interactive Video Retrieval with | question | s and Answers |
Simple contrastive learning in a self-supervised manner for robust visual | question | answering |
So Many Heads, So Many Wits: Multimodal Graph Reasoning for Text-Based Visual | question | Answering |
Social-IQ: A | question | Answering Benchmark for Artificial Social Intelligence |
Spatial-Semantic Collaborative Graph Network for Textbook | question | Answering |
Spatio-Temporal Two-stage Fusion for video | question | answering |
SQuINTing at VQA Models: Introspecting VQA Models With Sub- | question | s |
StableNet: Distinguishing the hard samples to overcome language priors in visual | question | answering |
Stacked Attention Networks for Image | question | Answering |
Straight to the Facts: Learning Knowledge Base Retrieval for Factual Visual | question | Answering |
Structured Attentions for Visual | question | Answering |
Structured Semantic Representation for Visual | question | Answering |
Structured Triplet Learning with POS-Tag Guided Attention for Visual | question | Answering |
survey of methods, datasets and evaluation metrics for visual | question | answering, A |
SUTD-TrafficQA: A | question | Answering Benchmark and an Efficient Network for Video Reasoning over Traffic Events |
SwapMix: Diagnosing and Regularizing the Over-Reliance on Visual Context in Visual | question | Answering |
Task-Oriented Multi-Modal | question | Answering For Collaborative Applications |
Tem-adapter: Adapting Image-Text Pretraining for Video | question | Answer |
Temporal Moment Localization via Natural Language by Utilizing Video | question | Answers as a Special Variant and Bypassing NLP for Corpora |
Ten Research | question | s for Scalable Multimedia Analytics |
Term Weighting Schemes for | question | Categorization |
Test-Time Model Adaptation for Visual | question | Answering With Debiased Self-Supervisions |
Text-Guided Object Detector for Multi-modal Video | question | Answering |
Text-instance graph: Exploring the relational semantics for text-based visual | question | answering |
Textbook | question | Answering Under Instructor Guidance with Memory Networks |
TGIF-QA: Toward Spatio-Temporal Reasoning in Visual | question | Answering |
TIFA: Accurate and Interpretable Text-to-Image Faithfulness Evaluation with | question | Answering |
Time-aware Circulant Matrices for | question | -based Temporal Localization |
Tips and Tricks for Visual | question | Answering: Learnings from the 2017 Challenge |
To Complete or to Estimate, That is the | question | : A Multi-Task Approach to Depth Completion and Monocular Depth Estimation |
To Filter Prune, or to Layer Prune, That Is the | question | |
Toward Explainable 3D Grounded Visual | question | Answering: A New Benchmark and Strong Baseline |
Toward Unsupervised Realistic Visual | question | Answering |
Towards Unconstrained Pointing Problem of Visual | question | Answering: A Retrieval-based Method |
Tracking Roads in Satellite Images by Playing Twenty | question | s |
Transfer Learning via Unsupervised Task Discovery for Visual | question | Answering |
Transform-Retrieve-Generate: Natural Language-Centric Outside-Knowledge Visual | question | Answering |
Transformer-based Medical Visual | question | Answering Model, A |
TRAR: Routing the Attention Spans in Transformer for Visual | question | Answering |
Triple attention network for sentimental visual | question | answering |
TRRNET: Tiered Relation Reasoning for Compositional Visual | question | Answering |
Two Can Play This Game: Visual Dialog with Discriminative | question | Generation and Answering |
Two-stage Multimodality Fusion for High-performance Text-based Visual | question | Answering |
Two-Step Neural Network Approach to Passage Retrieval for Open Domain | question | Answering, A |
Unbiased Visual | question | Answering by Leveraging Instrumental Variable |
Uncovering the Temporal Context for Video | question | Answering |
Understanding Knowledge Gaps in Visual | question | Answering: Implications for Gap Identification and Testing |
Understanding Video Scenes through Text: Insights from Text-based Video | question | Answering |
Unified | question | er Transformer for Descriptive Question Generation in Goal-Oriented Visual Dialogue |
Unifying the Video and | question | Attentions for Open-Ended Video Question Answering |
Unifying the Video and | question | Attentions for Open-Ended Video Question Answering |
Unshuffling Data for Improved Generalization in Visual | question | Answering |
V-Doc: Visual | question | s answers with Documents |
Variational Causal Inference Network for Explanatory Visual | question | Answering |
VC-VQA: Visual Calibration Mechanism For Visual | question | Answering |
VIBIKNet: Visual Bidirectional Kernelized Network for Visual | question | Answering |
Video Graph Transformer for Video | question | Answering |
Video | question | Answering Using Clip-Guided Visual-Text Attention |
Video | question | Answering Using Language-Guided Deep Compressed-Domain Video Feature |
Video | question | Answering with Iterative Video-Text Co-tokenization |
Video | question | Answering With Prior Knowledge and Object-Sensitive Learning |
Video | question | Answering with Spatio-Temporal Reasoning |
Video | question | Answering, Movies, Spatio-Temporal, Query, VQA |
Vision and Text Transformer for Predicting Answerability on Visual | question | Answering |
VisKE: Visual knowledge extraction and | question | answering by visual verification of relation phrases |
Visual Madlibs: Fill in the Blank Description Generation and | question | Answering |
Visual | question | Answering as a Meta Learning Task |
Visual | question | Answering as Reading Comprehension |
Visual | question | answering from another perspective: CLEVR mental rotation tests |
Visual | question | answering in the medical domain based on deep learning approaches: A comprehensive study |
Visual | question | answering model based on graph neural network and contextual attention |
Visual | question | Answering Model Based on Visual Relationship Detection |
Visual | question | Answering Network Merging High- and Low-Level Semantic Information, A |
Visual | question | Answering on 360° Images |
Visual | question | Answering on Image Sets |
Visual | question | answering with attention transfer and a cross-modal gating mechanism |
Visual | question | Answering With Dense Inter- and Intra-Modality Interactions |
Visual | question | answering with gated relation-aware auxiliary |
Visual | question | Answering with Memory-Augmented Networks |
Visual | question | Answering with Textual Representations for Images |
Visual | question | Answering, Datasets, Benchmarks, Surveys |
Visual | question | Answering, Query, VQA |
Visual | question | answering: A survey of methods and datasets |
Visual | question | Answering: A Tutorial |
Visual | question | answering: Datasets, algorithms, and future challenges |
Visual | question | answering: Which investigated applications? |
Visual | question | Generation as Dual Task of Visual Question Answering |
Visual | question | Generation as Dual Task of Visual Question Answering |
Visual | question | Generation for Class Acquisition of Unknown Objects |
Visual | question | Generation Under Multi-granularity Cross-Modal Interaction |
Visual | question | Generation: The State of the Art |
Visual | question | Reasoning on General Dependency Tree |
Visual-Textual Image Understanding and Retrieval - Joint Workshop on Content-Based Image Retrieval, Video and Image | question | Answering, Texture Analysis, Classification and Retrieval |
Visual7W visual | question | answering |
Visual7W: Grounded | question | Answering in Images |
VizWiz Grand Challenge: Answering Visual | question | s from Blind People |
VLC-BERT: Visual | question | Answering with Contextualized Commonsense Knowledge |
VQA as a factoid | question | answering problem: A novel approach for knowledge-aware and explainable visual question answering |
VQA as a factoid | question | answering problem: A novel approach for knowledge-aware and explainable visual question answering |
VQA With No | question | s-Answers Training |
VQA, Visual | question | Answering, Neural Networks |
VQA-E: Explaining, Elaborating, and Enhancing Your Answers for Visual | question | s |
VQA-GNN: Reasoning with Multimodal Knowledge via Graph Neural Networks for Visual | question | Answering |
VQA-LOL: Visual | question | Answering Under the Lens of Logic |
VQA-Machine: Learning How to Use Existing Vision Algorithms to Answer New | question | s, The |
VQA: Visual | question | Answering |
VQA: Visual | question | Answering |
VQA: Visual | question | Answering |
VQACL: A Novel Visual | question | Answering Continual Learning Setting |
VQAMix: Conditional Triplet Mixup for Medical Visual | question | Answering |
VQAPT: A New visual | question | answering model for personality traits in social media images |
VQS: Linking Segmentations to | question | s and Answers for Supervised Attention in VQA and Question-Focused Semantic Segmentation |
VQS: Linking Segmentations to | question | s and Answers for Supervised Attention in VQA and Question-Focused Semantic Segmentation |
VQuAD: Video | question | Answering Diagnostic Dataset |
Weakly Supervised Learning for Textbook | question | Answering |
Weakly Supervised Relative Spatial Reasoning for Visual | question | Answering |
Weakly Supervised Visual | question | Answer Generation |
Weakly-Supervised 3D Spatial Reasoning for Text-Based Visual | question | Answering |
What's in a | question | : Using Visual Questions as a Form of Supervision |
What's in a | question | : Using Visual Questions as a Form of Supervision |
What's to Know? Uncertainty as a Guide to Asking Goal-Oriented | question | s |
Where did I leave my keys?: Episodic-Memory-Based | question | Answering on Egocentric Videos |
Where to Look: Focus Regions for Visual | question | Answering |
Why Does a Visual | question | Have Different Answers? |
Yin and Yang: Balancing and Answering Binary Visual | question | s |
Zero-Shot and Few-Shot Video | question | Answering with Multi-Modal Prompts |
500 for question