| _ | token | _ |
| 3d-array | token | Petri Nets Generating Tetrahedral Picture Languages |
| A-ViT: Adaptive | token | s for Efficient Vision Transformer |
| Accelerating Multimodal Large Language Models by Searching Optimal Vision | token | Reduction |
| Accurate 3D Face Reconstruction with Facial Component | token | s |
| Activating Associative Disease-Aware Vision | token | Memory for LLM-Based X-Ray Report Generation |
| Adanat: Exploring Adaptive Policy for | token | -based Image Generation |
| Adaptive Frequency Filters As Efficient Global | token | Mixers |
| Adaptive | token | Sampling for Efficient Vision Transformers |
| Adaptor: Adaptive | token | Reduction for Video Diffusion Transformers |
| Adjunct Partial Array | token | Petri Net Structure |
| Agglomerative | token | Clustering |
| AITTI: Learning Adaptive Inclusive | token | for Text-to-Image Generation |
| ALGM: Adaptive Local-then-Global | token | Merging for Efficient Semantic Segmentation with Plain Vision Transformers |
| All in | token | s: Unifying Output Space of Visual Tasks via Soft Token |
| All in | token | s: Unifying Output Space of Visual Tasks via Soft Token |
| Animal Pose Tracking: 3D Multimodal Dataset and | token | -based Pose Optimization |
| Architectures for Biometric Match-on- | token | Solutions |
| ATMformer: An Adaptive | token | Merging Vision Transformer for Remote Sensing Image Scene Classification |
| ATP-LLaVA: Adaptive | token | Pruning for Large Vision Language Models |
| Attend to Not Attended: Structure-then-Detail | token | Merging for Post-training DiT Acceleration |
| Attention-Based Layer Fusion and | token | Masking for Weakly Supervised Semantic Segmentation |
| Attribute Surrogates Learning and Spectral | token | s Pooling in Transformers for Few-shot Learning |
| Augmenting Multimodal LLMs with Self-Reflective | token | s for Knowledge-based Visual Question Answering |
| Bad | token | : Token-level Backdoor Attacks to Multi-modal Large Language Models |
| Beyond Attentive | token | s: Incorporating Token Importance and Diversity for Efficient Vision Transformers |
| Beyond Attentive | token | s: Incorporating Token Importance and Diversity for Efficient Vision Transformers |
| Beyond masking: Demystifying | token | -based pre-training for vision transformers |
| Biophasor: | token | Supplemented Cancellable Biometrics |
| Blind Image Quality Assessment via Transformer Predicted Error Map and Perceptual Quality | token | |
| Boosting Point-BERT by Multi-Choice | token | s |
| Building Extraction from Remote Sensing Images with Sparse | token | Transformers |
| CAgMLP: An MLP-like architecture with a Cross-Axis gated | token | mixer for image classification |
| CAST: Clustering self-Attention using Surrogate | token | s for efficient transformers |
| CATANet: Efficient Content-Aware | token | Aggregation for Lightweight Image Super-Resolution |
| Class | token | s Infusion for Weakly Supervised Semantic Segmentation |
| CMI-Net: Cross-View Message | token | Interaction Network for 3D Shape Recognition |
| CMTM: Cross-Modal | token | Modulation for Unsupervised Video Object Segmentation |
| Collaborative Intelligence for Vision Transformers: A | token | Sparsity-Driven Edge-Cloud Framework |
| Computing Curvilinear Structure by | token | -Based Grouping |
| Confidence-Based Sampling Strategy for Dense Temporal | token | Learning in Thermal Infrared Object Tracking, A |
| Content-aware | token | Sharing for Efficient Semantic Segmentation with Vision Transformers |
| Continuous Intermediate | token | Learning with Implicit Motion Manifold for Keyframe Based Motion Interpolation |
| Cooperative Game Modeling With Weighted | token | -Level Alignment for Audio-Text Retrieval |
| Cross-Block Sparse Class | token | Contrast for Weakly Supervised Semantic Segmentation |
| Cross-Domain Detection Transformer Based on Spatial-Aware and Semantic-Aware | token | Alignment |
| CT2: Colorization Transformer via Color | token | s |
| CVT-Track: Concentrating on Valid | token | s for One-Stream Tracking |
| Detect-and-Guide: Self-regulation of Diffusion Models for Safe Text-to-Image Generation via Guideline | token | Optimization |
| Detecting Structure by Symbolic Constructions on | token | s |
| Devil is in Temporal | token | : High Quality Video Reasoning Segmentation, The |
| Difference Inversion: Interpolate and Isolate the Difference with | token | Consistency for Image Analogy Generation |
| Discriminative Class | token | s for Text-to-Image Diffusion Models |
| Discriminatively Matched Part | token | s for Pointly Supervised Instance Segmentation |
| DiViCo: Disentangled Visual | token | Compression for Efficient Large Vision-Language Model |
| DivPrune: Diversity-based Visual | token | Pruning for Large Multimodal Models |
| Dtpose: Learning Disentangled | token | Representation for Effective Human Pose Estimation |
| Dual Class | token | Vision Transformer for Direction of Arrival Estimation in Low SNR |
| Dual-Factor Authentication System Featuring Speaker Verification and | token | Technology, A |
| DyCoke: Dynamic Compression of | token | s for Fast Video Large Language Models |
| Dynamic | token | Pruning in Plain Vision Transformers for Semantic Segmentation |
| Dynamic | token | -Pass Transformers for Semantic Segmentation |
| DyTox: Transformers for Continual Learning with DYnamic | token | eXpansion |
| ECT: Fine-grained edge detection with learned cause | token | s |
| EDTST: Efficient Dynamic | token | Selection Transformer for Hyperspectral Image Classification |
| Effective Style | token | Weight Control Technique for End-to-End Emotional Speech Synthesis, An |
| Efficient | token | -Guided Image-Text Retrieval With Consistent Multimodal Contrastive Training |
| Efficient Transformer Adaptation with Soft | token | Merging |
| Efficient Transformer-Based 3D Object Detection with Dynamic | token | Halting |
| Efficient Video Action Detection with | token | Dropout and Context Refinement |
| Efficient Video Transformers with Spatial-Temporal | token | Selection |
| Efficient Vision Transformer via | token | Merger |
| Efficient Vision Transformer with | token | Sparsification for Event-Based Object Tracking |
| Efficient Visual Transformer by Learnable | token | Merging |
| Emerging Property of Masked | token | for Effective Pre-training |
| Enriching Local Patterns with Multi- | token | Attention for Broad-Sight Neural Networks |
| Entity Extraction and Correction Based on | token | Structure Model Generation |
| Exploring | token | -Level Augmentation in Vision Transformer for Semi-Supervised Semantic Segmentation |
| Fast-iTPN: Integrally Pre-Trained Transformer Pyramid Network With | token | Migration |
| Faster Parameter-Efficient Tuning with | token | Redundancy Reduction |
| FASTer: Focal | token | Acquiring-and-Scaling Transformer for Long-term 3D Object Detection |
| First to Know: How | token | Distributions Reveal Hidden Knowledge in Large Vision-language Models?, The |
| Focus and Align: Learning Tube | token | s for Video-Language Pre-Training |
| FourierSR: A Fourier | token | -Based Plugin for Efficient Image Super-Resolution |
| Fully Attentional Networks with Self-emerging | token | Labeling |
| General and Efficient Training for Transformer via | token | Expansion, A |
| General Approach for | token | Correspondence, A |
| general vision problem solving architecture: Hierarchical | token | grouping, A |
| Generalized Concordant Vision Transformer With Masked Image | token | s for Object Detection |
| Generative Multimodal Pretraining with Discrete Diffusion Timestep | token | s |
| GroupedMixer: An Entropy Model With Group-Wise | token | -Mixers for Learned Image Compression |
| GroupRF: Panoptic Scene Graph Generation with group relation | token | s |
| GTP-ViT: Efficient Vision Transformers via Graph-based | token | Propagation |
| GTPT: Group-based | token | Pruning Transformer for Efficient Human Pose Estimation |
| HalLoc: | token | -level Localization of Hallucinations for Vision Language Models |
| Heterogeneous Generative | token | s and Distance-Aware Recovery Network for Occluded Person Re-Identification |
| Hierarchical Graph Interaction Transformer With Dynamic | token | Clustering for Camouflaged Object Detection |
| Hierarchical | token | -Aware Cross-Modality Reconstruction for Visible-Infrared Person Re-Identification |
| Human Pose as Compositional | token | s |
| Hybrid Multi-Class | token | Vision Transformer Convolutional Network for DOA Estimation |
| Hybrid | token | transformer for deep face recognition |
| Hybrid-Level Instruction Injection for Video | token | Compression in Multi-modal Large Language Models |
| Hybrid-TransCD: A Hybrid Transformer Remote Sensing Image Change Detection Network via | token | Aggregation |
| Hyperspectral image classification with | token | fusion on GPU |
| HyperTransXNet: Learning Both Global and Local Dynamics with a Dual Dynamic | token | Mixer for Hyperspectral Image Classification |
| Image is Worth 1/2 | token | s After Layer 2: Plug-and-play Inference Acceleration for Large Vision-language Models, An |
| Immunized | token | -Based Approach for Autonomous Deployment of Multiple Mobile Robots in Burnt Area |
| Implementation of the USB | token | System for Fingerprint Verification |
| Improved Masked Image Generation with | token | -Critic |
| Improving Autoregressive Visual Generation with Cluster-Oriented | token | Prediction |
| Improving defocus blur detection via adaptive supervision prior- | token | s |
| Improving vision transformer for medical image classification via | token | -wise perturbation |
| Instruction Tuning-free Visual | token | Complement for Multimodal LLMs |
| Inter-image | token | Relation Learning for weakly supervised semantic segmentation |
| ISR3: A | token | Database for Integration of Visual Modules |
| IVTP: Instruction-guided Visual | token | Pruning for Large Vision-language Models |
| Joint | token | and Feature Alignment Framework for Text-Based Person Search |
| Joint | token | Pruning and Squeezing Towards More Aggressive Compression of Vision Transformers |
| Labeling of Curvilinear Structure Across Scales by | token | Grouping |
| language model using variable length | token | s for open-vocabulary Hangul text recognition, A |
| Large Model Empowered Multi-Modal Semantic Communication With Selective | token | s for Training |
| Layer-and Timestep-Adaptive Differentiable | token | Compression Ratios for Efficient Diffusion Transformers |
| Learnable | token | for visual tracking |
| Learning Multi-Modal Class-Specific | token | s for Weakly Supervised Dense Object Localization |
| Learning to mask and permute visual | token | s for Vision Transformer pre-training |
| Learning with Unmasked | token | s Drives Stronger Vision Learners |
| Leveraging multi-class background description and | token | dictionary representation for hyperspectral anomaly detection |
| Leveraging per Image- | token | Consistency for Vision-Language Pre-Training |
| Lightweight Image Super-Resolution with Superpixel | token | Interaction |
| Llama-vid: An Image is Worth 2 | token | s in Large Language Models |
| LookupVIT: Compressing Visual Information to a Limited Number of | token | s |
| MADTP: Multimodal Alignment-Guided Dynamic | token | Pruning for Accelerating Vision-Language Transformer |
| Magic | token | s: Select Diverse Tokens for Multi-modal Object Re-Identification |
| Magic | token | s: Select Diverse Tokens for Multi-modal Object Re-Identification |
| Make Your Vit-based Multi-view 3d Detectors Faster via | token | Compression |
| Making Vision Transformers Efficient from A | token | Sparsification View |
| ManiTrans: Entity-Level Text-Guided Image Manipulation via | token | -wise Semantic Alignment and Generation |
| MAPM: PolSAR Image Classification with Masked Autoencoder Based on Position Prediction and Memory | token | s |
| Mask-Guided Transformer Network with Topic | token | for Remote Sensing Image Captioning, A |
| Masked Reference | token | Supervision-Based Iterative Visual-Language Framework for Robust Visual Grounding, A |
| MatteFormer: Transformer-Based Image Matting via Prior- | token | s |
| MCTformer+: Multi-Class | token | Transformer for Weakly Supervised Semantic Segmentation |
| MedoidsFormer: A Strong 3D Object Detection Backbone by Exploiting Interaction With Adjacent Medoid | token | s |
| Memory- | token | Transformer for Unsupervised Video Anomaly Detection |
| MergeVQ: A Unified Framework for Visual Generation and Representation with Disentangled | token | Merging and Quantization |
| Method for identification of | token | s in video sequences |
| METransformer: Radiology Report Generation by Transformer with Multiple Learnable Expert | token | s |
| Mining representative | token | s via transformer-based multi-modal interaction for RGB-T tracking |
| MMoT: Mixture-of-Modality- | token | s Transformer for Composed Multimodal Conditional Image Synthesis |
| MonoATT: Online Monocular 3D Object Detection with Adaptive | token | Transformer |
| Morphological image processing on a | token | passing pyramid computer |
| MovieChat: From Dense | token | to Sparse Memory for Long Video Understanding |
| MSG-Transformer: Exchanging Local Spatial Information by Manipulating Messenger | token | s |
| MST: Adaptive Multi-Scale | token | s Guided Interactive Segmentation |
| Multi-class | token | Transformer for Weakly Supervised Semantic Segmentation |
| Multi-Criteria | token | Fusion with One-Step-Ahead Attention for Efficient Vision Transformers |
| Multi-Faceted Adaptive | token | Pruning for Efficient Remote Sensing Image Segmentation |
| Multi-modal interaction with | token | division strategy for RGB-T tracking |
| Multi-Scale | token | s-Aware Transformer Network for Multi-Region and Multi-Sequence MR-to-CT Synthesis in a Single Model |
| Multi-schema prompting powered | token | -feature woven attention network for short text classification |
| Multi-user VR Experience for Creating and Trading Non-fungible | token | s |
| Multimodal | token | Fusion for Vision Transformers |
| MVFormer: Diversifying feature normalization and | token | mixing for efficient vision transformers |
| New Coeff- | token | Decoding Method With Efficient Memory Access in H.264/AVC Video Coding Standard, A |
| No | token | Left Behind: Explainability-Aided Image Classification and Generation |
| Not All | token | s Are Equal: Human-centric Visual Analysis via Token Clustering Transformer |
| Not All | token | s Are Equal: Human-centric Visual Analysis via Token Clustering Transformer |
| Object Discovery from Motion-Guided | token | s |
| Object Recognition as Next | token | Prediction |
| Omni-RGPT: Unifying Image and Video Region-level Understanding via | token | Marks |
| On Correspondence, Line | token | s And Missing Tokens |
| On Correspondence, Line | token | s And Missing Tokens |
| Open-Vocabulary Attention Maps with | token | Optimization for Semantic Segmentation in Diffusion Models |
| Optimisation of biometric ID | token | s by using hardware/software co-design |
| OTE: Exploring Accurate Scene Text Recognition Using One | token | |
| Other | token | s matter: Exploring global and local features of Vision Transformers for Object Re-Identification |
| PACT: Pruning and Clustering-Based | token | Reduction for Faster Visual Language Models |
| Partitioned | token | fusion and pruning strategy for transformer tracking |
| Patch Ranking: | token | Pruning as Ranking Prediction for Efficient CLIP |
| Pedestrian Crossing Intention Prediction via Progressive Multimodal | token | Fusion for Autonomous Driving |
| Perception | token | s Enhance Visual Reasoning in Multimodal Language Models |
| Picture is Worth More Than 77 Text | token | s: Evaluating CLIP-Style Models on Dense Captions, A |
| PointLoRA: Low-Rank Adaptation with | token | Selection for Point Cloud Learning |
| Pose-guided | token | selection for the recognition of activities of daily living |
| PostoMETRO: Pose | token | Enhanced Mesh Transformer for Robust 3D Human Mesh Recovery |
| PPT: | token | -Pruned Pose Transformer for Monocular and Multi-view Human Pose Estimation |
| PRANCE: Joint | token | -Optimization and Structural Channel-Pruning for Adaptive ViT Inference |
| Pred | token | : Predicting Unknown Tokens and Beyond with Coarse-to-Fine Iterative Decoding |
| Prune and Merge: Efficient | token | Compression for Vision Transformer With Spatial Information Preserved |
| Prune Spatio-temporal | token | s by Semantic-aware Temporal Accumulation |
| Pruning One More | token | is Enough: Leveraging Latency-Workload Non-Linearities for Vision Transformers on the Edge |
| PTET: A progressive | token | exchanging transformer for infrared and visible image fusion |
| PVC: Progressive Visual | token | Compression for Unified Image and Video Processing in Large Vision-Language Models |
| Pyramid | token | s-to-Token Vision Transformer for Thyroid Pathology Image Classification |
| Pyramid | token | s-to-Token Vision Transformer for Thyroid Pathology Image Classification |
| Quantification and Abstraction: Low Level | token | s for Object Extraction |
| Random Entangled | token | s for Adversarially Robust Vision Transformer |
| Reasoning to Attend: Try to Understand How | token | Works |
| Recovering 3D Motion and Structure from Stereo and 2D | token | Tracking Cooperation |
| Removing Rows and Columns of | token | s in Vision Transformer Enables Faster Dense Prediction Without Retraining |
| Representation Selective Coupling via | token | Sparsification for Multi-Spectral Object Re-Identification |
| Request for Clarity over the End of Sequence | token | in the Self-critical Sequence Training, A |
| ResiComp: Loss-Resilient Image Compression via Dual-Functional Masked Visual | token | Modeling |
| Rethinking | token | Reduction with Parameter-Efficient Fine-Tuning in ViT for Pixel-Level Tasks |
| Rethinking visual prompt learning as masked visual | token | modeling |
| Revisiting Multimodal Representation in Contrastive Learning: From Patch and | token | Embeddings to Finite Discrete Tokens |
| Revisiting Multimodal Representation in Contrastive Learning: From Patch and | token | Embeddings to Finite Discrete Tokens |
| Revisiting | token | Pruning for Object Detection and Instance Segmentation |
| RIFormer: Keep Your Vision Backbone Effective But Removing | token | Mixer |
| Robust Distance Measures for Face-Recognition Supporting Revocable Biometric | token | s. |
| Robust scene text understanding with OCR | token | and word alignment for Text-VQA and text-caption |
| Robustifying | token | Attention for Vision Transformers |
| Robustness | token | s: Towards Adversarial Robustness of Transformers |
| Rollout-Guided | token | Pruning for Efficient Video Understanding |
| Salience-based Adaptive Masking: Revisiting | token | Dynamics for Enhanced Pre-Training |
| SarAdapter: Prioritizing Attention on Semantic-Aware Representative | token | s for Enhanced Medical Image Segmentation |
| SAT-HMR: Real-Time Multi-Person 3D Mesh Estimation via Scale-Adaptive | token | s |
| SATA: Spatial Autocorrelation | token | Analysis for Enhancing the Robustness of Vision Transformers |
| Scale-aware | token | -matching for transformer-based object detector |
| Segment Any Event Streams via Weighted Adaptation of Pivotal | token | s |
| Seit++: Masked | token | Modeling Improves Storage-efficient Training |
| SeiT: Storage-Efficient Vision Training with | token | s Using 1% of Pixel Storage |
| Self-Supervised Anomaly Detection from Anomalous Training Data via Iterative Latent | token | Masking |
| Self-supervised Video Copy Localization with Regional | token | Representation |
| Semantic Prompting with Image | token | for Continual Learning |
| SETA: Semantic-Aware Edge-Guided | token | Augmentation for Domain Generalization |
| SG-Former: Self-guided Transformer with Evolving | token | Reallocation |
| Shunted Self-Attention via Multi-Scale | token | Aggregation |
| Simple | token | -Level Confidence Improves Caption Correctness |
| Simple yet Effective Layout | token | in Large Language Models for Document Understanding, A |
| Sketch | token | s: A Learned Mid-level Representation for Contour and Object Detection |
| Smoothest Velocity Field and | token | Matching Schemes, The |
| Soft Measure of Visual | token | Occurrences for Object Categorization |
| Spatial Positioning | token | (SPToken) for Smart Mobility |
| Spatial-Aware | token | for Weakly Supervised Object Localization |
| SPViT: Enabling Faster Vision Transformers via Latency-Aware Soft | token | Pruning |
| STFormer: An efficient visual Transformer model with sparse attention and adaptive | token | aggregation |
| STPM: Spatial-Temporal | token | Pruning and Merging for Complex Activity Recognition |
| Strategies for Tracking | token | s in a Cluttered Scene |
| Strip-MLP: Efficient | token | Interaction for Vision MLP |
| SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and | token | Folding |
| T4P: Test-Time Training of Trajectory Prediction via Masked Autoencoder and Actor-Specific | token | Memory |
| TACo: | token | -aware Cascade Contrastive Learning for Video-Text Alignment |
| Taming the curse of dimensionality for perturbed | token | identification |
| TCFormer: Visual Recognition via | token | Clustering Transformer |
| TCSAFormer: Efficient Vision Transformer With | token | Compression and Sparse Attention for Medical Image Segmentation |
| TFRNet: Semantic Segmentation Network with | token | Filtration and Refinement Method |
| TFS-ViT: | token | -level feature stylization for domain generalization |
| token | Aggregation and Selection Hashing for Efficient Underwater Image Retrieval |
| token | Boosting for Robust Self-Supervised Visual Transformer Pre-training |
| token | Calibration for Transformer-Based Domain Adaptation |
| token | Compensator: Altering Inference Cost of Vision Transformer Without Re-tuning |
| token | Contrast for Weakly-Supervised Semantic Segmentation |
| token | Cropr: Faster ViTs for Quite a Few Tasks |
| token | Fusion: Bridging the Gap between Token Pruning and Token Merging |
| token | Fusion: Bridging the Gap between Token Pruning and Token Merging |
| token | Fusion: Bridging the Gap between Token Pruning and Token Merging |
| token | Grouping Based on 3d Motion and Feature Selection in Object Tracking |
| token | labeling-guided multi-scale medical image classification |
| token | Masking Transformer for Weakly Supervised Object Localization |
| token | Merging for Fast Stable Diffusion |
| token | Pooling in Vision Transformers for Image Classification |
| token | pyramid pooling-driven style adapter learning with dual-view balanced loss for imbalanced diabetic retinopathy grading |
| token | Selection is a Simple Booster for Vision Transformers |
| token | Tracking in a Cluttered Scene |
| token | Transformation Matters: Towards Faithful Post-Hoc Explanation for Vision Transformer |
| token | Turing Machines |
| token | Turing Machines are Efficient Vision Models |
| token | -aware and step-aware acceleration for Stable Diffusion |
| token | -based dynamic bit-width assignment for ViT quantization |
| token | -Based Extraction of Straight Lines |
| token | -Based Fingerprint Authentication |
| token | -Based, Patch Based Vision Transformers |
| token | -Consistent Dropout For Calibrated Vision Transformers |
| token | -Label Alignment for Vision Transformers |
| token | -Level Prompt Mixture With Parameter-Free Routing for Federated Domain Generalization |
| token | -Mixer: Bind Image and Text in One Embedding Space for Medical Image Reporting |
| token | -Prediction-Based Post-Processing for Low-Bitrate Speech Coding |
| token | -Textured Object Detection by Pyramids |
| token | -word mixer meets object-aware transformer for referring image segmentation |
| token | Compose: Text-to-Image Diffusion with Token-Level Supervision |
| token | HPE: Learning Orientation Tokens for Efficient Head Pose Estimation via Transformers |
| token | Motion: Decoupled Motion Control via Token Disentanglement for Human-centric Video Generation |
| token | Pose: Learning Keypoint Tokens for Human Pose Estimation |
| token | s-to-Token ViT: Training Vision Transformers from Scratch on ImageNet |
| token | s-to-Token ViT: Training Vision Transformers from Scratch on ImageNet |
| TopFormer: | token | Pyramid Transformer for Mobile Semantic Segmentation |
| TopV: Compatible | token | Pruning with Inference Time Optimization for Fast and Low-Memory Multimodal Vision Language Model |
| TORE: | token | Recycling in Vision Transformers for Efficient Active Visual Exploration |
| TORE: | token | Reduction for Efficient Human Mesh Recovery with Transformer |
| Toward Unified | token | Learning for Vision-Language Tracking |
| Towards Universal Modal Tracking With Online Dense Temporal | token | Learning |
| Trajectory-aligned Space-time | token | s for Few-shot Action Recognition |
| Transcending the Limit of Local Window: Advanced Super-Resolution Transformer with Adaptive | token | Dictionary |
| Transferable Adversarial Attacks on Vision Transformers with | token | Gradient Regularization |
| Transformer Compressed Sensing Via Global Image | token | s |
| Transformer RGBT Tracking With Spatio-Temporal Multimodal | token | s |
| Transformer vision-language tracking via proxy | token | guided cross-modal fusion |
| Transformer with | token | attention and attribute prediction for image captioning |
| Translating Optical Flow into | token | Matches |
| Translating Optical Flow into | token | Matches and Depth from Looming |
| TS-CAM: | token | Semantic Coupled Attention Map for Weakly Supervised Object Localization |
| TS2-Net: | token | Shift and Selection Transformer for Text-Video Retrieval |
| TSVT: | token | Sparsification Vision Transformer for robust RGB-D salient object detection |
| TTST: A Top-k | token | Selective Transformer for Remote Sensing Image Super-Resolution |
| UMIFormer: Mining the Correlations between Similar | token | s for Multi-View 3D Reconstruction |
| Understanding the Effect of using Semantically Meaningful | token | s for Visual Representation Learning |
| Unleashing Transformers: Parallel | token | Prediction with Discrete Absorbing Diffusion for Fast High-Resolution Image Generation from Vector-Quantized Codes |
| Using orientation | token | s for object recognition |
| VASparse: Towards Efficient Visual Hallucination Mitigation via Visual-Aware | token | Sparsification |
| vid-TLDR: Training Free | token | merging for Light-Weight Video Transformer |
| Video, How do Your | token | s Merge? |
| VidToMe: Video | token | Merging for Zero-Shot Video Editing |
| Vista-llama: Reducing Hallucination in Video Language Models via Equal Distance to Visual | token | s |
| VL-Match: Enhancing Vision-Language Pretraining with | token | -Level and Instance-Level Matching |
| VLTP: Vision-Language Guided | token | Pruning for Task-Oriented Segmentation |
| Which | token | s to Use? Investigating Token Reduction in Vision Transformers |
| Which | token | s to Use? Investigating Token Reduction in Vision Transformers |
| Window | token | Concatenation for Efficient Visual Large Language Models |
| Zero-shot 3D Question Answering via Voxel-based Dynamic | token | Compression |
| Zero-TPrune: Zero-Shot | token | Pruning Through Leveraging of the Attention Graph in Pre-Trained Transformers |
320 for token