CLVL19 * *Closing the Loop Between Vision and Language
* Are we Asking the Right Questions in MovieQA?
* Evaluating Text-to-Image Matching using Binary Image Selection (BISON)
* SUN-Spot: An RGB-D Dataset With Spatial Referring Expressions

CLVL21 * *Closing the Loop Between Vision and Language
* CIGLI: Conditional Image Generation from Language & Image
* Egocentric Biochemical Video-and-Language Dataset
* Language-guided Multi-Modal Fusion for Video Action Recognition
* Latent Variable Models for Visual Question Answering
* Semi-Autoregressive Transformer for Image Captioning
* Visual Question Answering with Textual Representations for Images
* What You Say Is Not What You Do: Studying Visio-Linguistic Models for TV Series Summarization
