_ | answering | _ |
3DVQA: Visual Question | answering | for 3D Environments |
A-OKVQA: A Benchmark for Visual Question | answering | Using World Knowledge |
A2A: Attention to Attention Reasoning for Movie Question | answering | |
Accuracy vs. complexity: A trade-off in visual question | answering | models |
Action-Centric Relation Transformer Network for Video Question | answering | |
Adapting Grounded Visual Question | answering | Models to Low Resource Languages |
ADCCF: Adaptive deep concatenation coder framework for visual question | answering | |
Adversarial Multimodal Network for Movie Story Question | answering | |
ALSA: Adversarial Learning of Supervised Attentions for Visual Question | answering | |
Analysis of Visual Question | answering | Algorithms, An |
Anomaly Matters: An Anomaly-Oriented Model for Medical Visual Question | answering | |
Answer Distillation for Visual Question | answering | |
Answer Selection in Community Question | answering | via Attentive Neural Networks |
Answer Them All! Toward Universal Visual Question | answering | Models |
Answer-checking in Context: A Multi-modal Fully Attention Network for Visual Question | answering | |
Answer-Type Prediction for Visual Question | answering | |
| answering | knowledge-based visual questions via the exploration of Question Purpose |
| answering | Questions about Data Visualizations using Efficient Bimodal Fusion |
| answering | Visual What-If Questions: From Actions to Predicted Scene Descriptions |
Are You Smarter Than a Sixth Grader? Textbook Question | answering | for Multimodal Machine Comprehension |
Ask Me Anything: Free-Form Visual Question | answering | Based on Knowledge from External Sources |
Ask Your Neurons: A Deep Learning Approach to Visual Question | answering | |
Ask Your Neurons: A Neural-Based Approach to | answering | Questions about Images |
Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual Question | answering | |
Auto-Parsing Network for Image Captioning and Visual Question | answering | |
Barlow constrained optimization for Visual Question | answering | |
Benchmarking Out-of-Distribution Detection in Visual Question | answering | |
BERT Representations for Video Question | answering | |
Better Way to Attend: Attention With Trees for Video Question | answering | , A |
Beyond Question-Based Biases: Assessing Multimodal Shortcut Learning in Visual Question | answering | |
Bilaterally Slimmable Transformer for Elastic and Efficient Visual Question | answering | |
Biometric surveillance using visual question | answering | |
Bottom-Up and Top-Down Attention for Image Captioning and Visual Question | answering | |
Bridge to Answer: Structure-aware Graph Interaction Network for Video Question | answering | |
CAAN: Context-Aware attention network for visual question | answering | |
Cascade transformers with dynamic attention for video question | answering | |
cascaded long short-term memory (LSTM) driven generic visual question | answering | (VQA), A |
CAT: Re-Conv Attention in Transformer for Visual Question | answering | |
CLIP-Guided Vision-Language Pre-training for Question | answering | in 3D Scenes |
Coarse to Fine Frame Selection for Online Open-ended Video Question | answering | |
Coarse-to-Fine Reasoning for Visual Question | answering | |
Coarse-to-Fine Visual Question | answering | by Iterative, Conditional Refinement |
Combining Multiple Cues for Visual Madlibs Question | answering | |
Compact Trilinear Interaction for Visual Question | answering | |
Competence-aware Curriculum for Visual Concepts Learning via Question | answering | , A |
Compositional Attention Networks With Two-Stream Fusion for Video Question | answering | |
Comprehensive-perception dynamic reasoning for visual question | answering | |
Context Relation Fusion Model for Visual Question | answering | |
Context-VQA: Towards Context-Aware and Purposeful Visual Question | answering | |
Contrastive Video Question | answering | via Video Graph Transformer |
Correlation Information Bottleneck: Towards Adapting Pretrained Multimodal Models for Robust Visual Question | answering | |
Counterfactual Samples Synthesizing and Training for Robust Visual Question | answering | |
Counterfactual Samples Synthesizing for Robust Visual Question | answering | |
Counting-based visual question | answering | with serial cascaded attention deep learning |
Cross-Attentional Spatio-Temporal Semantic Graph Networks for Video Question | answering | |
Cross-Dataset Adaptation for Visual Question | answering | |
Cross-Modal Causal Relational Reasoning for Event-Level Visual Question | answering | |
Cross-Modal Dense Passage Retrieval for Outside Knowledge Visual Question | answering | |
Cross-modal knowledge reasoning for knowledge-based visual question | answering | |
Cross-modal Relational Reasoning Network for Visual Question | answering | |
Cross-Modal Visual Question | answering | for Remote Sensing Data: the International Conference on Digital Image Computing: Techniques and Applications (DICTA 2021) |
CS-VQA: Visual Question | answering | with Compressively Sensed Images |
Customized Image Narrative Generation via Interactive Visual Question Generation and | answering | |
Cycle-Consistency for Robust Visual Question | answering | |
DAPC: | answering | Why-Not Questions on Top-k Direction-Aware ASK Queries in Polar Coordinates |
Dataset and Exploration of Models for Understanding Video Data through Fill-in-the-Blank Question- | answering | , A |
Debiased Visual Question | answering | via the perspective of question types |
DecomVQANet: Decomposing visual question | answering | deep network via tensor decomposition and regression |
Decouple Before Interact: Multi-Modal Prompt Learning for Continual Visual Question | answering | |
Deep Attention Neural Tensor Network for Visual Question | answering | |
Deep Modular Co-Attention Networks for Visual Question | answering | |
Deep Residual Weight-Sharing Attention Network With Low-Rank Attention for Visual Question | answering | |
Depth and Video Segmentation Based Visual Attention for Embodied Question | answering | |
Depth-Aware and Semantic Guided Relational Attention Network for Visual Question | answering | |
Diagnostic Study of Visual Question | answering | With Analogical Reasoning, A |
Differential Attention for Visual Question | answering | |
DisAVR: Disentangled Adaptive Visual Reasoning Network for Diagram Question | answering | |
Discovering Spatio-Temporal Rationales for Video Question | answering | |
Discovering the Real Association: Multimodal Causal Reasoning in Video Question | answering | |
Divide and Conquer: | answering | Questions with Object Factorization and Compositional Reasoning |
Document Image Retrieval in a Question | answering | System for Document Images |
Don't Just Assume; Look and Answer: Overcoming Priors for Visual Question | answering | |
DRAU: Dual Recurrent Attention Units for Visual Question | answering | |
DSGEM: Dual scene graph enhancement module-based visual question | answering | |
Dual Path Multi-Modal High-Order Features for Textual Content based Visual Question | answering | |
Dual self-attention with co-attention networks for visual question | answering | |
Dual-Attention Learning Network With Word and Sentence Embedding for Medical Visual Question | answering | , A |
Dual-decoder transformer network for answer grounding in visual question | answering | |
Dual-Key Multimodal Backdoors for Visual Question | answering | |
DualVGR: A Dual-Visual Graph Reasoning Unit for Video Question | answering | |
DVQA: Understanding Data Visualizations via Question | answering | |
Dynamic dual graph networks for textbook question | answering | |
Dynamic Fusion With Intra- and Inter-Modality Attention Flow for Visual Question | answering | |
Dynamic Self-Attention with Vision Synchronization Networks for Video Question | answering | |
DynGraph: Visual Question | answering | via Dynamic Scene Graphs |
Editorial paper for Pattern Recognition Letters VSI on Cross Model Understanding for Visual Question | answering | |
Editorial to special issue on cross-media learning for visual question | answering | |
Efficient Counterfactual Debiasing for Visual Question | answering | |
EgoVQA: An Egocentric Video Question | answering | Benchmark Dataset |
Embedding Spatial Relations in Visual Question | answering | for Remote Sensing |
Embodied Question | answering | |
Embodied Question | answering | |
Embodied Question | answering | in Photorealistic Environments With Point Cloud Perception |
Empirical Evaluation of Visual Question | answering | for Novel Objects, An |
Empirical study on using adapters for debiased Visual Question | answering | |
Encoder-decoder cycle for visual question | answering | based on perception-action cycle |
End-to-End Concept Word Detection for Video Captioning, Retrieval, and Question | answering | |
Env-QA: A Video Question | answering | Benchmark for Comprehensive Understanding of Dynamic Environments |
Episodic Memory Question | answering | |
ERM: Energy-Based Refined-Attention Mechanism for Video Question | answering | |
Estimation Of Visual Contents Based On Question | answering | From Human Brain Activity |
Evaluation of a Visual Question | answering | Architecture for Pedestrian Attribute Recognition |
Event Graph Guided Compositional Spatial-Temporal Reasoning for Video Question | answering | |
Explicit Bias Discovery in Visual Question | answering | Models |
Explicit ensemble attention learning for improving visual question | answering | |
External Commonsense Knowledge as a Modality for Social Intelligence Question- | answering | |
Face-to-Face Contrastive Learning for Social Intelligence Question- | answering | |
FashionVQA: A Domain-Specific Visual Question | answering | System |
Focal Visual-Text Attention for Memex Question | answering | |
Focal Visual-Text Attention for Visual Question | answering | |
Found a Reason for me? Weakly-supervised Grounded Visual Question | answering | using Capsules |
Frame Augmented Alternating Attention Network for Video Question | answering | |
From Images to Textual Prompts: Zero-shot Visual Question | answering | with Frozen Large Language Models |
From Representation to Reasoning: Towards both Evidence and Commonsense Reasoning for Video Question- | answering | |
Generalized Hadamard-Product Fusion Operators for Visual Question | answering | |
Generalized pyramid co-attention with learnable aggregation net for video question | answering | |
Generative Bias for Robust Visual Question | answering | |
Geographic Knowledge Base Question | answering | over OpenStreetMap |
Good, Better, Best: Textual Distractors Generation for Multiple-Choice Visual Question | answering | via Reinforcement Learning |
GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question | answering | |
Grad-Cam Aware Supervised Attention for Visual Question | answering | for Post-Disaster Damage Assessment |
Graph-Based Multi-Interaction Network for Video Question | answering | |
Graph-Structured Representations for Visual Question | answering | |
Greedy Gradient Ensemble for Robust Visual Question | answering | |
Guiding Visual Question | answering | with Attention Priors |
HAIR: Hierarchical Visual-Semantic Relational Reasoning for Video Question | answering | |
Health-Oriented Multimodal Food Question | answering | |
Heterogeneous Community Question | answering | via Social-Aware Multi-Modal Co-Attention Convolutional Matching |
Heterogeneous Memory Enhanced Multimodal Attention Model for Video Question | answering | |
Hierarchical Conditional Relation Networks for Multimodal Video Question | answering | |
Hierarchical Conditional Relation Networks for Video Question | answering | |
Hierarchical Relational Attention for Video Question | answering | |
Hierarchical Representation Network With Auxiliary Tasks for Video Captioning and Video Question | answering | |
Holistic Multi-Modal Memory Network for Movie Question | answering | |
Human Attention in Visual Question | answering | : Do Humans and Deep Networks Look at the Same Regions? |
Image Captioning and Visual Question | answering | Based on Attributes and External Knowledge |
Image Question | answering | Using Convolutional Neural Network with Dynamic Parameter Prediction |
Improved Attention for Visual Question | answering | , An |
Improved Fusion of Visual and Language Representations by Dense Symmetric Co-attention for Visual Question | answering | |
Improving Selective Visual Question | answering | by Learning from Your Peers |
Improving Visual Question | answering | using Active Perception on Static Images |
Improving visual question | answering | using dropout and enhanced question encoder |
In Defense of Grid Features for Visual Question | answering | |
Incorporating 3D Information Into Visual Question | answering | |
Interpretable Visual Question | answering | by Reasoning on Dependency Trees |
Interpretable Visual Question | answering | by Visual Grounding From Attention Supervision Mining |
Interpretable Visual Question | answering | Referring to Outside Knowledge |
Interpretable Visual Question | answering | Via Reasoning Supervision |
Invariant Grounding for Video Question | answering | |
Inverse Visual Question | answering | : A New Benchmark and VQA Diagnosis Tool |
IQ-VQA: Intelligent Visual Question | answering | |
IQA: Visual Question | answering | in Interactive Environments |
Is GPT-3 All You Need for Visual Question | answering | in Cultural Heritage? |
ISD-QA: Iterative Distillation of Commonsense Knowledge from General Language Models for Unsupervised Question | answering | |
iVQA: Inverse Visual Question | answering | |
Joint | answering | and Explanation for Visual Commonsense Reasoning |
Joint Sequence Fusion Model for Video Question | answering | and Retrieval, A |
Joint Video and Text Parsing for Understanding Events and | answering | Queries |
Keyword-Aware Relative Spatio-Temporal Graph Networks for Video Question | answering | |
Knowing Where to Look? Analysis on Attention of Visual Question | answering | System |
Knowledge Acquisition for Visual Question | answering | via Iterative Querying |
Knowledge base graph embedding module design for Visual question | answering | model |
knowledge infrastructure for intelligent query | answering | in location-based services, A |
Knowledge Proxy Intervention for Deconfounded Video Question | answering | |
Knowledge-Augmented Visual Question | answering | With Natural Language Explanation |
Knowledge-Based Embodied Question | answering | |
Knowledge-based Video Question | answering | with Unsupervised Scene Descriptions |
Language and Visual Relations Encoding for Visual Question | answering | |
Language Models are Causal Knowledge Extractors for Zero-shot Video Question | answering | |
Latent Variable Models for Visual Question | answering | |
LEAF-QA: Locate, Encode Attend for Figure Question | answering | |
Learning Answer Embeddings for Visual Question | answering | |
Learning Models for Actions and Person-Object Interactions with Transfer to Question | answering | |
Learning Situation Hyper-Graphs for Video Question | answering | |
Learning to Ask Informative Sub-Questions for Visual Question | answering | |
Learning to Reason: End-to-End Module Networks for Visual Question | answering | |
Learning to Supervise Knowledge Retrieval Over a Tree Structure for Visual Question | answering | |
Learning Visual Knowledge Memory Networks for Visual Question | answering | |
Learning Visual Question | answering | by Bootstrapping Hard Attention |
Learning visual question | answering | on controlled semantic noisy labels |
Leveraging Question | answering | for Domain-Agnostic Information Extraction |
Leveraging Visual Question | answering | for Image-Caption Ranking |
Linguistically Routing Capsule Network for Out-of-distribution Visual Question | answering | |
LiVLR: A Lightweight Visual-Linguistic Reasoning Framework for Video Question | answering | |
LLQA: Lifelog Question | answering | Dataset |
Local relation network with multilevel attention for visual question | answering | |
Locate Before | answering | : Answer Guided Question Localization for Video Question Answering |
Locate Before | answering | : Answer Guided Question Localization for Video Question Answering |
Locating Visual Explanations for Video Question | answering | |
Logical Implications for Visual Question | answering | Consistency |
LOIS: Looking Out of Instance Semantics for Visual Question | answering | |
Long video question | answering | : A Matching-guided Attention Model |
Long-Form Video Question | answering | via Dynamic Hierarchical Reinforced Networks |
Long-Term Video Question | answering | via Multimodal Hierarchical Memory Attentive Networks |
Maintaining Reasoning Consistency in Compositional Visual Question | answering | |
Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question | answering | |
MarioQA: | answering | Questions by Watching Gameplay Videos |
Markov Network Based Passage Retrieval Method for Multimodal Question | answering | in the Cultural Heritage Domain, A |
Measuring Compositional Consistency for Video Question | answering | |
Medical Visual Question | answering | via Conditional Reasoning and Contrastive Learning |
Mining Interpretable AOG Representations From Convolutional Networks via Active Question | answering | |
Mining Object Parts from CNNs via Active Question- | answering | |
MIST: Multi-modal Iterative Spatial-Temporal Transformer for Long-form Video Question | answering | |
MixPHM: Redundancy-Aware Parameter-Efficient Tuning for Low-Resource Visual Question | answering | |
MMTF: Multi-Modal Temporal Fusion for Commonsense Video Question | answering | |
MoCA: Incorporating domain pretraining and cross attention for textbook question | answering | |
Modality Shifting Attention Network for Multi-Modal Video Question | answering | |
MoQA: A Multi-modal Question | answering | Architecture |
Motion-Appearance Co-memory Networks for Video Question | answering | |
Movie Question | answering | via Textual Memory and Plot Graph |
MovieQA: Understanding Stories in Movies through Question- | answering | |
MuKEA: Multimodal Knowledge Extraction and Accumulation for Knowledge-based Visual Question | answering | |
Multi-agent Embodied Question | answering | in Interactive Environments |
Multi-Granularity Interaction and Integration Network for Video Question | answering | |
Multi-level Attention Networks for Visual Question | answering | |
Multi-modal Contextual Graph Neural Network for Text Visual Question | answering | |
Multi-Modal Correlated Network with Emotional Reasoning Knowledge for Social Intelligence Question- | answering | |
Multi-modal Factorized Bilinear Pooling with Co-attention Learning for Visual Question | answering | |
Multi-modal spatial relational attention networks for visual question | answering | |
Multi-Modality Latent Interaction Network for Visual Question | answering | |
Multi-scale relation reasoning for multi-modal Visual Question | answering | |
Multi-scale Relational Reasoning with Regional Attention for Visual Question | answering | |
Multi-Semantic Alignment Co-Reasoning Network for Video Question | answering | |
Multi-stage Attention based Visual Question | answering | |
Multi-Target Embodied Question | answering | |
Multi-Tier Attention Network using Term-weighted Question Features for Visual Question | answering | |
Multi-Turn Video Question | answering | via Hierarchical Attention Context Reinforced Networks |
Multimodal Dual Attention Memory for Video Story Question | answering | |
Multimodal grid features and cell pointers for scene text visual question | answering | |
Multimodal Integration of Human-Like Attention in Visual Question | answering | |
Multitask learning for neural generative question | answering | |
Multiview Language Bias Reduction for Visual Question | answering | |
MUREL: Multimodal Relational Reasoning for Visual Question | answering | |
MUTAN: Multimodal Tucker Fusion for Visual Question | answering | |
NAAQA: A Neural Architecture for Acoustic Question | answering | |
Natural Language Video Localization: A Revisit in Span-Based Question | answering | Framework |
New Passage Ranking Algorithm for Video Question | answering | , A |
NExT-QA: Next Phase of Question- | answering | to Explaining Temporal Actions |
novel feature extractor for human action recognition in visual question | answering | , A |
Object sequences: encoding categorical and spatial information for a yes/no visual question | answering | task |
OK-VQA: A Visual Question | answering | Benchmark Requiring External Knowledge |
On the General Value of Evidence, and Bilingual Scene-Text Visual Question | answering | |
On the hidden treasure of dialog in video question | answering | |
On the role of question encoder sequence model in robust visual question | answering | |
Ontology-Driven Cyberinfrastructure for Intelligent Spatiotemporal Question | answering | and Open Knowledge Discovery, An |
Open-Ended Video Question | answering | via Multi-Modal Conditional Adversarial Networks |
Open-Vocabulary Video Question | answering | : A New Benchmark for Evaluating the Generalizability of Video Question Answering Models |
Open-Vocabulary Video Question | answering | : A New Benchmark for Evaluating the Generalizability of Video Question Answering Models |
P ˜ NP, at least in Visual Question | answering | |
Pano-AVQA: Grounded Audio-Visual Question | answering | on 360° Videos |
Plenty is Plague: Fine-Grained Learning for Visual Question | answering | |
POP-VQA: Privacy preserving, On-device, Personalized Visual Question | answering | |
Positional Attention Guided Transformer-Like Architecture for Visual Question | answering | |
PQA: Perceptual Question | answering | |
Predicting Human Scanpaths in Visual Question | answering | |
Predicting the quality of user-generated answers using co-training in community-based question | answering | portals |
Prior Visual Relationship Reasoning For Visual Question | answering | |
Progressive Attention Memory Network for Movie Story Question | answering | |
Prompt-RSVQA: Prompting visual context to a language model for Remote Sensing Visual Question | answering | |
Prompting Large Language Models with Answer Heuristics for Knowledge-Based Visual Question | answering | |
Ques-to-Visual Guided Visual Question | answering | |
Question | answering | over Community-Contributed Web Videos |
Question Classification for Intelligent Question | answering | : A Comprehensive Survey |
Question Type Guided Attention in Visual Question | answering | |
Question-Agnostic Attention for Visual Question | answering | |
Question-aware dynamic scene graph of local semantic representation learning for visual question | answering | |
Question-Centric Model for Visual Question | answering | in Medical Imaging, A |
Question-Guided Hybrid Convolution for Visual Question | answering | |
Re-Attention for Visual Question | answering | |
Reasoning on the Relation: Enhancing Visual Representation for Visual Question | answering | and Cross-Modal Retrieval |
Recent Advances in Video Question | answering | : A Review of Datasets and Methods |
Recovering Generalization via Pre-Training-Like Knowledge Distillation for Out-of-Distribution Visual Question | answering | |
Reducing Language Biases in Visual Question | answering | with Visually-grounded Question Encoder |
Relation-Aware Graph Attention Network for Visual Question | answering | |
Reliable Visual Question | answering | : Abstain Rather Than Answer Incorrectly |
Resolving Zero-Shot and Fact-Based Visual Question | answering | via Enhanced Fact Retrieval |
Rethinking Data Augmentation for Robust Visual Question | answering | |
Retrieving-to-Answer: Zero-Shot Video Question | answering | with Frozen Large Language Models |
Revisiting Visual Question | answering | Baselines |
RMLVQA: A Margin Loss Approach For Visual Question | answering | with Language Biases |
Robust Explanations for Visual Question | answering | |
robust multivariate reranking algorithm for Question | answering | enrichment, A |
Robust Passage Retrieval Algorithm for Video Question | answering | , A |
Robust visual question | answering | via semantic cross modal augmentation |
RSVQA: Visual Question | answering | for Remote Sensing Data |
RUArt: A Novel Text-Centered Solution for Text-Based Visual Question | answering | |
ScanQA: 3D Question | answering | for Spatial Scene Understanding |
Scene Graph Refinement Network for Visual Question | answering | |
Scene Text Visual Question | answering | |
Second Order Enhanced Multi-Glimpse Attention in Visual Question | answering | |
See and Learn More: Dense Caption-Aware Representation for Visual Question | answering | |
SegEQA: Video Segmentation Based Visual Attention for Embodied Question | answering | |
Self-Adaptive Neural Module Transformer for Visual Question | answering | |
SelfGraphVQA: A Self-Supervised Graph Neural Network for Scene-based Question | answering | |
Semantic Equivalent Adversarial Data Augmentation for Visual Question | answering | |
Semantic-Aware Modular Capsule Routing for Visual Question | answering | |
Semantically Guided Visual Question | answering | |
Separating Skills and Concepts for Novel Visual Question | answering | |
Sim VQA: Exploring Simulated Environments for Visual Question | answering | |
Simple and effective visual question | answering | in a single modality |
Simple contrastive learning in a self-supervised manner for robust visual question | answering | |
So Many Heads, So Many Wits: Multimodal Graph Reasoning for Text-Based Visual Question | answering | |
Social-IQ: A Question | answering | Benchmark for Artificial Social Intelligence |
Spatial-Semantic Collaborative Graph Network for Textbook Question | answering | |
Spatio-Temporal Two-stage Fusion for video question | answering | |
StableNet: Distinguishing the hard samples to overcome language priors in visual question | answering | |
Stacked Attention Networks for Image Question | answering | |
Straight to the Facts: Learning Knowledge Base Retrieval for Factual Visual Question | answering | |
Structured Attentions for Visual Question | answering | |
Structured Semantic Representation for Visual Question | answering | |
Structured Triplet Learning with POS-Tag Guided Attention for Visual Question | answering | |
survey of methods, datasets and evaluation metrics for visual question | answering | , A |
SUTD-TrafficQA: A Question | answering | Benchmark and an Efficient Network for Video Reasoning over Traffic Events |
SwapMix: Diagnosing and Regularizing the Over-Reliance on Visual Context in Visual Question | answering | |
Tackling Data Bias in MUSIC-AVQA: Crafting a Balanced Dataset for Unbiased Question- | answering | |
Task-Oriented Multi-Modal Question | answering | For Collaborative Applications |
Test-Time Model Adaptation for Visual Question | answering | With Debiased Self-Supervisions |
Text-Guided Object Detector for Multi-modal Video Question | answering | |
Text-instance graph: Exploring the relational semantics for text-based visual question | answering | |
Textbook Question | answering | Under Instructor Guidance with Memory Networks |
TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question | answering | |
Think before You Simulate: Symbolic Reasoning to Orchestrate Neural Computation for Counterfactual Question | answering | |
TIFA: Accurate and Interpretable Text-to-Image Faithfulness Evaluation with Question | answering | |
Tips and Tricks for Visual Question | answering | : Learnings from the 2017 Challenge |
Toward Explainable 3D Grounded Visual Question | answering | : A New Benchmark and Strong Baseline |
Toward Unsupervised Realistic Visual Question | answering | |
Towards Unconstrained Pointing Problem of Visual Question | answering | : A Retrieval-based Method |
Transfer Learning via Unsupervised Task Discovery for Visual Question | answering | |
Transform-Retrieve-Generate: Natural Language-Centric Outside-Knowledge Visual Question | answering | |
Transformer-based Medical Visual Question | answering | Model, A |
TRAR: Routing the Attention Spans in Transformer for Visual Question | answering | |
Triple attention network for sentimental visual question | answering | |
TRRNET: Tiered Relation Reasoning for Compositional Visual Question | answering | |
Two Can Play This Game: Visual Dialog with Discriminative Question Generation and | answering | |
Two-stage Multimodality Fusion for High-performance Text-based Visual Question | answering | |
Two-Step Neural Network Approach to Passage Retrieval for Open Domain Question | answering | , A |
Unbiased Visual Question | answering | by Leveraging Instrumental Variable |
Uncovering the Temporal Context for Video Question | answering | |
Understanding Knowledge Gaps in Visual Question | answering | : Implications for Gap Identification and Testing |
Understanding Video Scenes through Text: Insights from Text-based Video Question | answering | |
Unifying the Video and Question Attentions for Open-Ended Video Question | answering | |
Unshuffling Data for Improved Generalization in Visual Question | answering | |
Variational Causal Inference Network for Explanatory Visual Question | answering | |
VC-VQA: Visual Calibration Mechanism For Visual Question | answering | |
VIBIKNet: Visual Bidirectional Kernelized Network for Visual Question | answering | |
Video Graph Transformer for Video Question | answering | |
Video Question | answering | Using Clip-Guided Visual-Text Attention |
Video Question | answering | Using Language-Guided Deep Compressed-Domain Video Feature |
Video Question | answering | with Iterative Video-Text Co-tokenization |
Video Question | answering | With Prior Knowledge and Object-Sensitive Learning |
Video Question | answering | with Spatio-Temporal Reasoning |
Video Question | answering | , Movies, Spatio-Temporal, Query, VQA |
Vision and Text Transformer for Predicting Answerability on Visual Question | answering | |
VisKE: Visual knowledge extraction and question | answering | by visual verification of relation phrases |
Visual Madlibs: Fill in the Blank Description Generation and Question | answering | |
Visual Query | answering | by Entity-Attribute Graph Matching and Reasoning |
Visual Question | answering | as a Meta Learning Task |
Visual Question | answering | as Reading Comprehension |
Visual question | answering | from another perspective: CLEVR mental rotation tests |
Visual question | answering | in the medical domain based on deep learning approaches: A comprehensive study |
Visual question | answering | model based on graph neural network and contextual attention |
Visual Question | answering | Model Based on Visual Relationship Detection |
Visual Question | answering | Network Merging High- and Low-Level Semantic Information, A |
Visual Question | answering | on 360° Images |
Visual Question | answering | on Image Sets |
Visual question | answering | with attention transfer and a cross-modal gating mechanism |
Visual Question | answering | With Dense Inter- and Intra-Modality Interactions |
Visual question | answering | with gated relation-aware auxiliary |
Visual Question | answering | with Memory-Augmented Networks |
Visual Question | answering | with Textual Representations for Images |
Visual Question | answering | , Datasets, Benchmarks, Surveys |
Visual Question | answering | , Query, VQA |
Visual question | answering | : A survey of methods and datasets |
Visual Question | answering | : A Tutorial |
Visual question | answering | : Datasets, algorithms, and future challenges |
Visual question | answering | : Which investigated applications? |
Visual Question Generation as Dual Task of Visual Question | answering | |
Visual-Textual Image Understanding and Retrieval - Joint Workshop on Content-Based Image Retrieval, Video and Image Question | answering | , Texture Analysis, Classification and Retrieval |
Visual7W visual question | answering | |
Visual7W: Grounded Question | answering | in Images |
VizWiz Grand Challenge: | answering | Visual Questions from Blind People |
VLC-BERT: Visual Question | answering | with Contextualized Commonsense Knowledge |
VQA as a factoid question | answering | problem: A novel approach for knowledge-aware and explainable visual question answering |
VQA as a factoid question | answering | problem: A novel approach for knowledge-aware and explainable visual question answering |
VQA, Visual Question | answering | , Neural Networks |
VQA-GNN: Reasoning with Multimodal Knowledge via Graph Neural Networks for Visual Question | answering | |
VQA-LOL: Visual Question | answering | Under the Lens of Logic |
VQA: Visual Question | answering | |
VQA: Visual Question | answering | |
VQA: Visual Question | answering | |
VQACL: A Novel Visual Question | answering | Continual Learning Setting |
VQAMix: Conditional Triplet Mixup for Medical Visual Question | answering | |
VQAPT: A New visual question | answering | model for personality traits in social media images |
VQuAD: Video Question | answering | Diagnostic Dataset |
Weakly Supervised Learning for Textbook Question | answering | |
Weakly Supervised Relative Spatial Reasoning for Visual Question | answering | |
Weakly-Supervised 3D Spatial Reasoning for Text-Based Visual Question | answering | |
What are you doing while | answering | your smartphone? |
Where did I leave my keys?: Episodic-Memory-Based Question | answering | on Egocentric Videos |
Where to Look: Focus Regions for Visual Question | answering | |
Yin and Yang: Balancing and | answering | Binary Visual Questions |
Zero-Shot and Few-Shot Video Question | answering | with Multi-Modal Prompts |
412 for answering