_ | sentence | _ |
ACMM: Aligned Cross-Modal Memory for Few-Shot Image and | sentence | Matching |
Actor and Action Video Segmentation from a | sentence | |
Adaptive proposal network based on generative adversarial learning for weakly supervised temporal | sentence | grounding |
Adversarial Inference for Multi- | sentence | Video Description |
Affective Words and the Company They Keep: Studying the Accuracy of Affective Word Lists in Determining | sentence | and Word Valence in a Domain-Specific Corpus |
approach for real-time recognition of online Chinese handwritten | sentence | s, An |
Artpedia: A New Visual-Semantic Dataset with Visual and Contextual | sentence | s in the Artistic Domain |
Augmented Visual-Semantic Embeddings for Image and | sentence | Matching |
Automated extraction of signs from continuous sign language | sentence | s using Iterated Conditional Modes |
Automatic Lipreading of | sentence | s Combining Hidden Markov Models and Grammars |
Automatic | sentence | Modality Recognition in Children's Speech, and Its Usage Potential in the Speech Therapy |
Beyond caption to narrative: Video captioning with multiple | sentence | s |
Bidirectional Single-Stream Temporal | sentence | Query Localization in Untrimmed Videos |
Block2vec: An Approach for Identifying Urban Functional Regions by Integrating | sentence | Embedding Model and Points of Interest |
Boundary-aware Temporal | sentence | Grounding with Adaptive Proposal Refinement |
Building a Breast- | sentence | Dataset: Its Usefulness for Computer-Aided Diagnosis |
Can DNNs Learn to Lipread Full | sentence | s? |
Classification of Multi-class Daily Human Motion using Discriminative Body Parts and | sentence | Descriptions |
Coherent Multi- | sentence | Video Description with Variable Level of Detail |
Combining Acoustic and Visual Classifiers for the Recognition of Spoken | sentence | s |
Comprehensive Framework of Early and Late Fusion for Image- | sentence | Retrieval |
Conditional | sentence | Generation and Cross-Modal Reranking for Sign Language Translation |
Conditional Video Diffusion Network for Fine-Grained Temporal | sentence | Grounding |
Constructing and Reconstructing Characters, Words, and | sentence | s by Synthesizing Writing Motions |
Context-aware Biaffine Localizing Network for Temporal | sentence | Grounding |
Cross-modal Semantic Enhanced Interaction for Image- | sentence | Retrieval |
Cross- | sentence | Temporal and Semantic Relations in Video Activity Localisation |
D3G: Exploring Gaussian Prior for Temporal | sentence | Grounding with Glance Annotation |
Decoupled Cross-Modal Phrase-Attention Network for Image- | sentence | Matching |
Deep Convolutional Neural Network for Bidirectional Image- | sentence | Mapping |
Deep Convolutional Neural Network for Correlating Images and | sentence | s |
Deep hierarchical encoding model for | sentence | semantic matching |
Deep Top-k Ranking for Image- | sentence | Matching |
Dense Video Captioning Using Graph-Based | sentence | Summarization |
Detailed | sentence | Generation Architecture for Image Semantics Description |
Dual-Attention Learning Network With Word and | sentence | Embedding for Medical Visual Question Answering, A |
Dynamic Pruning of Regions for Image- | sentence | Matching |
Dynamic Text Line Segmentation for Real-Time Recognition of Chinese Handwritten | sentence | s |
EEG-Based Classification of Implicit Intention During Self-Relevant | sentence | Reading |
Effective semi-supervised learning strategies for automatic | sentence | segmentation |
effective | sentence | -extraction technique using contextual information and statistical approaches for text summarization, An |
Efficient Image and | sentence | Matching |
Efficient Relational | sentence | Ordering Network |
Efficient | sentence | Embedding via Semantic Subspace Analysis |
Elimination of Spatial Incoherency in Bag-of-Visual Words Image Representation Using Visual | sentence | Modelling |
Evaluation of BERT and ALBERT | sentence | Embedding Performance on Downstream NLP Tasks |
Every Picture Tells a Story: Generating | sentence | s from Images |
Exploration of term relationship for Bayesian network based | sentence | retrieval |
Exploring global | sentence | representation for graph-based dependency parsing using BLSTM-SCNN |
Exploring Optical-Flow-Guided Motion and Detection-Based Appearance for Temporal | sentence | Grounding |
Extraction of Indicative Summary | sentence | s from Imaged Documents |
Face Tells Detailed Expression: Generating Comprehensive Facial Expression | sentence | Through Facial Action Units |
Few-Shot Image and | sentence | Matching via Aligned Cross-Modal Memory |
Few-Shot Temporal | sentence | Grounding via Memory-Guided Semantic Learning |
Finding relevant features for Korean comparative | sentence | extraction |
Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to- | sentence | Models |
Framework for Sign Language | sentence | Recognition by Commonsense Context, A |
Fs-DSM: Few-Shot Diagram- | sentence | Matching via Cross-Modal Attention Graph Model |
GCP: Graph Encoder With Content-Planning for | sentence | Generation From Knowledge Bases |
Generating Multi- | sentence | Natural Language Descriptions of Indoor Scenes |
Handling Out-of-Vocabulary Words and Recognition Errors Based on Word Linguistic Context for Handwritten | sentence | Recognition |
Handwritten | sentence | Recognition |
Handwritten | sentence | recognition: from signal to syntax |
Hierarchical Local-Global Transformer for Temporal | sentence | Grounding |
How much do cross-modal related semantics benefit image captioning by weighting attributes and re-ranking | sentence | s? |
Hybrid Language Model for Handwritten Chinese | sentence | Recognition, A |
IAM-database: an English | sentence | database for offline handwriting recognition, The |
IAM-OnDB: An on-line English | sentence | database acquired from handwritten text on a whiteboard |
Identity-Aware Multi- | sentence | Video Description |
Image and | sentence | Matching via Semantic Concepts and Order Learning |
Impossible Objects as Non-Sense | sentence | s |
Improving edit-based unsupervised | sentence | simplification using fine-tuned BERT |
Improving Image- | sentence | Embeddings Using Large Weakly Annotated Photo Collections |
Index Construction and Similarity Retrieval Method Based on | sentence | -Bert, An |
Instance-Aware Image and | sentence | Matching with Selective Multimodal LSTM |
Is An Image Worth Five | sentence | s? A New Look into Semantics for Image-Text Matching |
Japanese | sentence | Dataset for Lip- reading |
KSL-Guide: A Large-scale Korean Sign Language Dataset Including Interrogative | sentence | s for Guiding the Deaf and Hard-of-Hearing |
Label-Based Automatic Alignment of Video with Narrative | sentence | s |
Language-Guided Multi-Granularity Context Aggregation for Temporal | sentence | Grounding |
Learning Joint Representations of Videos and | sentence | s with Web Image Search |
Learning Like a Child: Fast Novel Visual Concept Learning from | sentence | Descriptions of Images |
Learning Modality Interaction for Temporal | sentence | Localization and Event Captioning in Videos |
Learning Semantic Concepts and Order for Image and | sentence | Matching |
Learning the Visual Interpretation of | sentence | s |
Learning Word and | sentence | Embeddings Using a Generative Convolutional Network |
Lip Reading | sentence | s in the Wild |
Local Correspondence Network for Weakly Supervised Temporal | sentence | Grounding |
Matching Image and | sentence | With Multi-Faceted Representations |
Method for Recognizing a Sequence of Sign Language Words Represented in a Japanese Sign Language | sentence | , A |
Multi-label Text Classification Approach for | sentence | Level News Emotion Analysis |
Multi-Modality Cross Attention Network for Image and | sentence | Matching |
Multi-Scale Fine-Grained Alignments for Image and | sentence | Matching |
Multi- | sentence | Auxiliary Adversarial Networks for Fine-Grained Text-to-Image Synthesis |
Multimodal Convolutional Neural Networks for Matching Image and | sentence | |
Neural | sentence | embedding using only in-domain sentences for out-of-domain sentence detection in dialog systems |
Neural | sentence | embedding using only in-domain sentences for out-of-domain sentence detection in dialog systems |
Neural | sentence | embedding using only in-domain sentences for out-of-domain sentence detection in dialog systems |
Offline Grammar-Based Recognition of Handwritten | sentence | s |
On-Line Handwritten Character Pattern Database Sampled in a Sequence of | sentence | s without Any Writing Instructions |
Ordered rules for full | sentence | translation: A neural network realization and a case study for Hindi and English |
Parsing N-best lists of handwritten | sentence | s |
Pose | sentence | s: A new representation for action recognition using sequence of pose words |
Recognition of Handwritten Arabic Words and | sentence | s |
Recognition of handwritten | sentence | s using a restricted lexicon |
Rejection measures for handwriting | sentence | recognition |
Rejection strategies for offline handwritten | sentence | recognition |
Retrieval of | sentence | Sequences for an Image Stream via Coherence Recurrent Convolutional Networks |
SaGCN: Semantic-Aware Graph Calibration Network for Temporal | sentence | Grounding |
Saliency-Guided Attention Network for Image- | sentence | Matching |
SEA: | sentence | Encoder Assembly for Video Retrieval by Textual Queries |
Seeing What You're Told: | sentence | -Guided Activity Recognition in Video |
Semantic Conditioned Dynamic Modulation for Temporal | sentence | Grounding in Videos |
Semantic Similarity Between | sentence | s Through Approximate Tree Matching |
Semantic-Based | sentence | Recognition in Images Using Bimodal Deep Learning |
| sentence | Attention Blocks for Answer Grounding |
| sentence | boundary detection in conversational speech transcripts using noisily labeled examples |
| sentence | Clustering Using Continuous Vector Space Representation |
| sentence | Directed Video Object Codiscovery |
| sentence | Is Worth a Thousand Pixels, A |
| sentence | level matrix representation for document spectral clustering |
| sentence | level text classification in the Kannada language: A classifier's perspective |
| sentence | recognition through hybrid neuro-Markovian modeling |
| sentence | Speaks a Thousand Images: Domain Generalization through Distilling CLIP with Language Guidance, A |
| sentence | -Based Image Description with Scalable, Explicit Models |
| sentence | -level sentiment analysis in Persian |
Show, Tell and Summarize: Dense Video Captioning Using Visual Cue Aided | sentence | Summarization |
Statistical language models for on-line handwritten | sentence | recognition |
synergy of double attention: Combine | sentence | -level and word-level attention for image captioning, The |
Synesketch: An Open Source Library for | sentence | -Based Emotion Recognition |
Syntax Customized Video Captioning by Imitating Exemplar | sentence | s |
Synwmd: Syntax-aware word Mover's distance for | sentence | similarity evaluation |
Tell as You Imagine: | sentence | Imageability-aware Image Captioning |
Temporal | sentence | Grounding in Videos: A Survey and Future Directions |
Topic-word-constrained | sentence | generation with variational autoencoder |
Towards Bridging Event Captioner and | sentence | Localizer for Weakly Supervised Dense Event Captioning |
Treebanks gone bad: Parser evaluation and retraining using a treebank of ungrammatical | sentence | s |
Unconstrained Benchmark Urdu Handwritten | sentence | Database with Automatic Line Segmentation, An |
Unifying Relational | sentence | Generation and Retrieval for Medical Image Report Composition |
Unsupervised Modeling of Signs Embedded in Continuous | sentence | s |
Use of a Confusion Network to Detect and Correct Errors in an On-Line Handwritten | sentence | Recognition System |
Using Word Graphs as Intermediate Representation of Uttered | sentence | s |
Video Captioning via | sentence | Augmentation and Spatio-Temporal Attention |
Visual Code- | sentence | s: A New Video Representation Based on Image Descriptor Sequences |
Visual | sentence | s for Pose Retrieval Over Low-Resolution Cross-Media Dance Collections |
Weakly Supervised Temporal | sentence | Grounding with Gaussian-based Contrastive Proposal Learning |
Weakly Supervised Temporal | sentence | Grounding with Uncertainty-Guided Self-training |
What Is Happening in the Video? Annotate Video by | sentence | |
Where Does It Exist: Spatio-Temporal Video Grounding for Multi-Form | sentence | s |
Word and | sentence | Extraction Using Irregular Pyramid |
Word to | sentence | Visual Semantic Similarity for Caption Generation: Lessons Learned |
Word- | sentence | Framework for Remote Sensing Image Captioning |
You Can Ground Earlier than See: An Effective and Efficient Pipeline for Temporal | sentence | Grounding in Compressed Videos |
153 for sentence