_ | vit | _ |
6D- | vit | : Category-Level 6D Object Pose Estimation via Transformer-Based Instance Representation Learning |
A- | vit | : Adaptive Tokens for Efficient Vision Transformer |
Action- | vit | : Pedestrian Intent Prediction in Traffic Scenes |
ADA- | vit | : Attention-Guided Data Augmentation for Vision Transformers |
All are Worth Words: A | vit | Backbone for Diffusion Models |
Bootstrapping | vit | s: Towards Liberating Vision Transformers from Pre-training |
Castling- | vit | : Compressing Self-Attention via Switching Towards Linear-Angular Attention at Vision Transformer Inference |
DeiT III: Revenge of the | vit | |
Doubly-Fused | vit | : Fuse Information from Vision Transformer Doubly with Local Representation |
DP- | vit | : A Dual-Path Vision Transformer for Real-Time Sonar Target Detection |
EERCA- | vit | : Enhanced Effective Region and Context-Aware Vision Transformers for image sentiment analysis |
Enhancing lung abnormalities diagnosis using hybrid DCNN- | vit | -GRU model with explainable AI: A deep learning approach |
eX- | vit | : A Novel explainable vision transformer for weakly supervised semantic segmentation |
Exploring the differences in adversarial robustness between | vit | - and CNN-based models using novel metrics |
GTP- | vit | : Efficient Vision Transformers via Graph-based Token Propagation |
HM- | vit | : Hetero-modal Vehicle-to-Vehicle Cooperative Perception with Vision Transformer |
Hybrid | vit | -CNN Network for Fine-Grained Image Classification |
I- | vit | : Integer-only Quantization for Efficient Vision Transformer Inference |
Improved GAN for image resolution enhancement using | vit | for breast cancer detection |
Jigsaw- | vit | : Learning jigsaw puzzles in vision transformer |
Joint Convolutional Cross | vit | Network for Hyperspectral and Light Detection and Ranging Fusion Classification, A |
Learning Traces by Yourself: Blind Image Forgery Localization via Anomaly Detection With | vit | -VAE |
Limited Data, Unlimited Potential: A Study on | vit | s Augmented by Masked Autoencoders |
Low-Altitude Remote Sensing Inspection Method on Rural Living Environments Based on a Modified YOLOv5s- | vit | , A |
LRTransDet: A Real-Time SAR Ship-Detection Network with Lightweight | vit | and Multi-Scale Feature Fusion |
LT- | vit | : A Vision Transformer for Multi-Label Chest X-Ray Classification |
Mask- | vit | : an Object Mask Embedding in Vision Transformer for Fine-Grained Visual Classification |
Meta-attention for | vit | -backed Continual Learning |
MIL- | vit | : A multiple instance vision transformer for fundus image classification |
Mini but Mighty: Finetuning | vit | s with Mini Adapters |
Mix- | vit | : Mixing attentive vision transformer for ultra-fine-grained visual categorization |
MM- | vit | : Multi-Modal Video Transformer for Compressed Video Action Recognition |
MMST- | vit | : Climate Change-aware Crop Yield Prediction via Multi-Modal Spatial-Temporal Vision Transformer |
On the Effectiveness of | vit | Features as Local Semantic Descriptors |
Open Set Classification of GAN-based Image Manipulations via a | vit | -based Hybrid Architecture |
Order- | vit | : Order Learning Vision Transformer for Cancer Classification in Pathology Images |
PaCa- | vit | : Learning Patch-to-Cluster Attention in Vision Transformers |
PerfHD: Efficient | vit | Architecture Performance Ranking using Hyperdimensional Computing |
Pyramid Adversarial Training Improves | vit | Performance |
RepQ- | vit | : Scale Reparameterization for Post-Training Quantization of Vision Transformers |
ResFormer: Scaling | vit | s with Multi-Resolution Training |
Rethinking Video | vit | s: Sparse Video Tubes for Joint Image and Video Learning |
SAL- | vit | : Towards Latency Efficient Private Inference on ViT using Selective Attention Search with a Learnable Softmax Approximation |
SAL- | vit | : Towards Latency Efficient Private Inference on ViT using Selective Attention Search with a Learnable Softmax Approximation |
Splicing | vit | Features for Semantic Appearance Transfer |
TFS- | vit | : Token-level feature stylization for domain generalization |
Tokens-to-Token | vit | : Training Vision Transformers from Scratch on ImageNet |
Transformers Pay Attention to Convolutions Leveraging Emerging Properties of | vit | s by Dual Attention-Image Network |
UIA- | vit | : Unsupervised Inconsistency-Aware Method Based on Vision Transformer for Face Forgery Detection |
UniFormerV2: Unlocking the Potential of Image | vit | s for Video Understanding |
V2X- | vit | : Vehicle-to-Everything Cooperative Perception with Vision Transformer |
Video OWL- | vit | : Temporally-consistent open-world localization in video |
Vision Transformers, | vit | |
| vit | -AMC Network With Adaptive Model Fusion and Multiobjective Optimization for Interpretable Laryngeal Tumor Grading From Histopathological Images, A |
| vit | -YOLO: Transformer-Based YOLO for Object Detection |
| vit | s for SITS: Vision Transformers for Satellite Image Time Series |
| vit | S--A Vision System for Autonomous Land Vehicle Navigation |
| vit | S: Video Tagging System from Massive Web Multimedia Collections |
Wave- | vit | : Unifying Wavelet and Transformers for Visual Representation Learning |
When CNN Meet with | vit | : Towards Semi-supervised Learning for Multi-class Medical Image Semantic Segmentation |
YOLO- | vit | -Based Method for Unmanned Aerial Vehicle Infrared Vehicle Target Detection |
61 for vit