Keith Price Bibliography kwic Details for token

Index for token

_ token _

3d-array token Petri Nets Generating Tetrahedral Picture Languages

A-ViT: Adaptive token s for Efficient Vision Transformer

Accelerating Multimodal Large Language Models by Searching Optimal Vision token Reduction

Accurate 3D Face Reconstruction with Facial Component token s

Activating Associative Disease-Aware Vision token Memory for LLM-Based X-Ray Report Generation

Adanat: Exploring Adaptive Policy for token -based Image Generation

Adaptive Frequency Filters As Efficient Global token Mixers

Adaptive token Sampling for Efficient Vision Transformers

Adaptor: Adaptive token Reduction for Video Diffusion Transformers

Adjunct Partial Array token Petri Net Structure

Agglomerative token Clustering

AITTI: Learning Adaptive Inclusive token for Text-to-Image Generation

ALGM: Adaptive Local-then-Global token Merging for Efficient Semantic Segmentation with Plain Vision Transformers

All in token s: Unifying Output Space of Visual Tasks via Soft Token

All in token s: Unifying Output Space of Visual Tasks via Soft Token

Animal Pose Tracking: 3D Multimodal Dataset and token -based Pose Optimization

Architectures for Biometric Match-on- token Solutions

ATMformer: An Adaptive token Merging Vision Transformer for Remote Sensing Image Scene Classification

ATP-LLaVA: Adaptive token Pruning for Large Vision Language Models

Attend to Not Attended: Structure-then-Detail token Merging for Post-training DiT Acceleration

Attention-Based Layer Fusion and token Masking for Weakly Supervised Semantic Segmentation

Attribute Surrogates Learning and Spectral token s Pooling in Transformers for Few-shot Learning

Augmenting Multimodal LLMs with Self-Reflective token s for Knowledge-based Visual Question Answering

Bad token : Token-level Backdoor Attacks to Multi-modal Large Language Models

Beyond Attentive token s: Incorporating Token Importance and Diversity for Efficient Vision Transformers

Beyond Attentive token s: Incorporating Token Importance and Diversity for Efficient Vision Transformers

Beyond masking: Demystifying token -based pre-training for vision transformers

Biophasor: token Supplemented Cancellable Biometrics

Blind Image Quality Assessment via Transformer Predicted Error Map and Perceptual Quality token

Boosting Point-BERT by Multi-Choice token s

Building Extraction from Remote Sensing Images with Sparse token Transformers

CAgMLP: An MLP-like architecture with a Cross-Axis gated token mixer for image classification

CAST: Clustering self-Attention using Surrogate token s for efficient transformers

CATANet: Efficient Content-Aware token Aggregation for Lightweight Image Super-Resolution

Class token s Infusion for Weakly Supervised Semantic Segmentation

CMI-Net: Cross-View Message token Interaction Network for 3D Shape Recognition

CMTM: Cross-Modal token Modulation for Unsupervised Video Object Segmentation

Collaborative Intelligence for Vision Transformers: A token Sparsity-Driven Edge-Cloud Framework

Computing Curvilinear Structure by token -Based Grouping

Confidence-Based Sampling Strategy for Dense Temporal token Learning in Thermal Infrared Object Tracking, A

Content-aware token Sharing for Efficient Semantic Segmentation with Vision Transformers

Continuous Intermediate token Learning with Implicit Motion Manifold for Keyframe Based Motion Interpolation

Cooperative Game Modeling With Weighted token -Level Alignment for Audio-Text Retrieval

Cross-Block Sparse Class token Contrast for Weakly Supervised Semantic Segmentation

Cross-Domain Detection Transformer Based on Spatial-Aware and Semantic-Aware token Alignment

CT2: Colorization Transformer via Color token s

CVT-Track: Concentrating on Valid token s for One-Stream Tracking

Detect-and-Guide: Self-regulation of Diffusion Models for Safe Text-to-Image Generation via Guideline token Optimization

Detecting Structure by Symbolic Constructions on token s

Devil is in Temporal token : High Quality Video Reasoning Segmentation, The

Difference Inversion: Interpolate and Isolate the Difference with token Consistency for Image Analogy Generation

Discriminative Class token s for Text-to-Image Diffusion Models

Discriminatively Matched Part token s for Pointly Supervised Instance Segmentation

DiViCo: Disentangled Visual token Compression for Efficient Large Vision-Language Model

DivPrune: Diversity-based Visual token Pruning for Large Multimodal Models

Dtpose: Learning Disentangled token Representation for Effective Human Pose Estimation

Dual Class token Vision Transformer for Direction of Arrival Estimation in Low SNR

Dual-Factor Authentication System Featuring Speaker Verification and token Technology, A

DyCoke: Dynamic Compression of token s for Fast Video Large Language Models

Dynamic token Pruning in Plain Vision Transformers for Semantic Segmentation

Dynamic token -Pass Transformers for Semantic Segmentation

DyTox: Transformers for Continual Learning with DYnamic token eXpansion

ECT: Fine-grained edge detection with learned cause token s

EDTST: Efficient Dynamic token Selection Transformer for Hyperspectral Image Classification

Effective Style token Weight Control Technique for End-to-End Emotional Speech Synthesis, An

Efficient token -Guided Image-Text Retrieval With Consistent Multimodal Contrastive Training

Efficient Transformer Adaptation with Soft token Merging

Efficient Transformer-Based 3D Object Detection with Dynamic token Halting

Efficient Video Action Detection with token Dropout and Context Refinement

Efficient Video Transformers with Spatial-Temporal token Selection

Efficient Vision Transformer via token Merger

Efficient Vision Transformer with token Sparsification for Event-Based Object Tracking

Efficient Visual Transformer by Learnable token Merging

Emerging Property of Masked token for Effective Pre-training

Enriching Local Patterns with Multi- token Attention for Broad-Sight Neural Networks

Entity Extraction and Correction Based on token Structure Model Generation

Exploring token -Level Augmentation in Vision Transformer for Semi-Supervised Semantic Segmentation

Fast-iTPN: Integrally Pre-Trained Transformer Pyramid Network With token Migration

Faster Parameter-Efficient Tuning with token Redundancy Reduction

FASTer: Focal token Acquiring-and-Scaling Transformer for Long-term 3D Object Detection

First to Know: How token Distributions Reveal Hidden Knowledge in Large Vision-language Models?, The

Focus and Align: Learning Tube token s for Video-Language Pre-Training

FourierSR: A Fourier token -Based Plugin for Efficient Image Super-Resolution

Fully Attentional Networks with Self-emerging token Labeling

General and Efficient Training for Transformer via token Expansion, A

General Approach for token Correspondence, A

general vision problem solving architecture: Hierarchical token grouping, A

Generalized Concordant Vision Transformer With Masked Image token s for Object Detection

Generative Multimodal Pretraining with Discrete Diffusion Timestep token s

GroupedMixer: An Entropy Model With Group-Wise token -Mixers for Learned Image Compression

GroupRF: Panoptic Scene Graph Generation with group relation token s

GTP-ViT: Efficient Vision Transformers via Graph-based token Propagation

GTPT: Group-based token Pruning Transformer for Efficient Human Pose Estimation

HalLoc: token -level Localization of Hallucinations for Vision Language Models

Heterogeneous Generative token s and Distance-Aware Recovery Network for Occluded Person Re-Identification

Hierarchical Graph Interaction Transformer With Dynamic token Clustering for Camouflaged Object Detection

Hierarchical token -Aware Cross-Modality Reconstruction for Visible-Infrared Person Re-Identification

Human Pose as Compositional token s

Hybrid Multi-Class token Vision Transformer Convolutional Network for DOA Estimation

Hybrid token transformer for deep face recognition

Hybrid-Level Instruction Injection for Video token Compression in Multi-modal Large Language Models

Hybrid-TransCD: A Hybrid Transformer Remote Sensing Image Change Detection Network via token Aggregation

Hyperspectral image classification with token fusion on GPU

HyperTransXNet: Learning Both Global and Local Dynamics with a Dual Dynamic token Mixer for Hyperspectral Image Classification

Image is Worth 1/2 token s After Layer 2: Plug-and-play Inference Acceleration for Large Vision-language Models, An

Immunized token -Based Approach for Autonomous Deployment of Multiple Mobile Robots in Burnt Area

Implementation of the USB token System for Fingerprint Verification

Improved Masked Image Generation with token -Critic

Improving Autoregressive Visual Generation with Cluster-Oriented token Prediction

Improving defocus blur detection via adaptive supervision prior- token s

Improving vision transformer for medical image classification via token -wise perturbation

Instruction Tuning-free Visual token Complement for Multimodal LLMs

Inter-image token Relation Learning for weakly supervised semantic segmentation

ISR3: A token Database for Integration of Visual Modules

IVTP: Instruction-guided Visual token Pruning for Large Vision-language Models

Joint token and Feature Alignment Framework for Text-Based Person Search

Joint token Pruning and Squeezing Towards More Aggressive Compression of Vision Transformers

Labeling of Curvilinear Structure Across Scales by token Grouping

language model using variable length token s for open-vocabulary Hangul text recognition, A

Large Model Empowered Multi-Modal Semantic Communication With Selective token s for Training

Layer-and Timestep-Adaptive Differentiable token Compression Ratios for Efficient Diffusion Transformers

Learnable token for visual tracking

Learning Multi-Modal Class-Specific token s for Weakly Supervised Dense Object Localization

Learning to mask and permute visual token s for Vision Transformer pre-training

Learning with Unmasked token s Drives Stronger Vision Learners

Leveraging multi-class background description and token dictionary representation for hyperspectral anomaly detection

Leveraging per Image- token Consistency for Vision-Language Pre-Training

Lightweight Image Super-Resolution with Superpixel token Interaction

Llama-vid: An Image is Worth 2 token s in Large Language Models

LookupVIT: Compressing Visual Information to a Limited Number of token s

MADTP: Multimodal Alignment-Guided Dynamic token Pruning for Accelerating Vision-Language Transformer

Magic token s: Select Diverse Tokens for Multi-modal Object Re-Identification

Magic token s: Select Diverse Tokens for Multi-modal Object Re-Identification

Make Your Vit-based Multi-view 3d Detectors Faster via token Compression

Making Vision Transformers Efficient from A token Sparsification View

ManiTrans: Entity-Level Text-Guided Image Manipulation via token -wise Semantic Alignment and Generation

MAPM: PolSAR Image Classification with Masked Autoencoder Based on Position Prediction and Memory token s

Mask-Guided Transformer Network with Topic token for Remote Sensing Image Captioning, A

Masked Reference token Supervision-Based Iterative Visual-Language Framework for Robust Visual Grounding, A

MatteFormer: Transformer-Based Image Matting via Prior- token s

MCTformer+: Multi-Class token Transformer for Weakly Supervised Semantic Segmentation

MedoidsFormer: A Strong 3D Object Detection Backbone by Exploiting Interaction With Adjacent Medoid token s

Memory- token Transformer for Unsupervised Video Anomaly Detection

MergeVQ: A Unified Framework for Visual Generation and Representation with Disentangled token Merging and Quantization

Method for identification of token s in video sequences

METransformer: Radiology Report Generation by Transformer with Multiple Learnable Expert token s

Mining representative token s via transformer-based multi-modal interaction for RGB-T tracking

MMoT: Mixture-of-Modality- token s Transformer for Composed Multimodal Conditional Image Synthesis

MonoATT: Online Monocular 3D Object Detection with Adaptive token Transformer

Morphological image processing on a token passing pyramid computer

MovieChat: From Dense token to Sparse Memory for Long Video Understanding

MSG-Transformer: Exchanging Local Spatial Information by Manipulating Messenger token s

MST: Adaptive Multi-Scale token s Guided Interactive Segmentation

Multi-class token Transformer for Weakly Supervised Semantic Segmentation

Multi-Criteria token Fusion with One-Step-Ahead Attention for Efficient Vision Transformers

Multi-Faceted Adaptive token Pruning for Efficient Remote Sensing Image Segmentation

Multi-modal interaction with token division strategy for RGB-T tracking

Multi-Scale token s-Aware Transformer Network for Multi-Region and Multi-Sequence MR-to-CT Synthesis in a Single Model

Multi-schema prompting powered token -feature woven attention network for short text classification

Multi-user VR Experience for Creating and Trading Non-fungible token s

Multimodal token Fusion for Vision Transformers

MVFormer: Diversifying feature normalization and token mixing for efficient vision transformers

New Coeff- token Decoding Method With Efficient Memory Access in H.264/AVC Video Coding Standard, A

No token Left Behind: Explainability-Aided Image Classification and Generation

Not All token s Are Equal: Human-centric Visual Analysis via Token Clustering Transformer

Not All token s Are Equal: Human-centric Visual Analysis via Token Clustering Transformer

Object Discovery from Motion-Guided token s

Object Recognition as Next token Prediction

Omni-RGPT: Unifying Image and Video Region-level Understanding via token Marks

On Correspondence, Line token s And Missing Tokens

On Correspondence, Line token s And Missing Tokens

Open-Vocabulary Attention Maps with token Optimization for Semantic Segmentation in Diffusion Models

Optimisation of biometric ID token s by using hardware/software co-design

OTE: Exploring Accurate Scene Text Recognition Using One token

Other token s matter: Exploring global and local features of Vision Transformers for Object Re-Identification

PACT: Pruning and Clustering-Based token Reduction for Faster Visual Language Models

Partitioned token fusion and pruning strategy for transformer tracking

Patch Ranking: token Pruning as Ranking Prediction for Efficient CLIP

Pedestrian Crossing Intention Prediction via Progressive Multimodal token Fusion for Autonomous Driving

Perception token s Enhance Visual Reasoning in Multimodal Language Models

Picture is Worth More Than 77 Text token s: Evaluating CLIP-Style Models on Dense Captions, A

PointLoRA: Low-Rank Adaptation with token Selection for Point Cloud Learning

Pose-guided token selection for the recognition of activities of daily living

PostoMETRO: Pose token Enhanced Mesh Transformer for Robust 3D Human Mesh Recovery

PPT: token -Pruned Pose Transformer for Monocular and Multi-view Human Pose Estimation

PRANCE: Joint token -Optimization and Structural Channel-Pruning for Adaptive ViT Inference

Pred token : Predicting Unknown Tokens and Beyond with Coarse-to-Fine Iterative Decoding

Prune and Merge: Efficient token Compression for Vision Transformer With Spatial Information Preserved

Prune Spatio-temporal token s by Semantic-aware Temporal Accumulation

Pruning One More token is Enough: Leveraging Latency-Workload Non-Linearities for Vision Transformers on the Edge

PTET: A progressive token exchanging transformer for infrared and visible image fusion

PVC: Progressive Visual token Compression for Unified Image and Video Processing in Large Vision-Language Models

Pyramid token s-to-Token Vision Transformer for Thyroid Pathology Image Classification

Pyramid token s-to-Token Vision Transformer for Thyroid Pathology Image Classification

Quantification and Abstraction: Low Level token s for Object Extraction

Random Entangled token s for Adversarially Robust Vision Transformer

Reasoning to Attend: Try to Understand How token Works

Recovering 3D Motion and Structure from Stereo and 2D token Tracking Cooperation

Removing Rows and Columns of token s in Vision Transformer Enables Faster Dense Prediction Without Retraining

Representation Selective Coupling via token Sparsification for Multi-Spectral Object Re-Identification

Request for Clarity over the End of Sequence token in the Self-critical Sequence Training, A

ResiComp: Loss-Resilient Image Compression via Dual-Functional Masked Visual token Modeling

Rethinking token Reduction with Parameter-Efficient Fine-Tuning in ViT for Pixel-Level Tasks

Rethinking visual prompt learning as masked visual token modeling

Revisiting Multimodal Representation in Contrastive Learning: From Patch and token Embeddings to Finite Discrete Tokens

Revisiting Multimodal Representation in Contrastive Learning: From Patch and token Embeddings to Finite Discrete Tokens

Revisiting token Pruning for Object Detection and Instance Segmentation

RIFormer: Keep Your Vision Backbone Effective But Removing token Mixer

Robust Distance Measures for Face-Recognition Supporting Revocable Biometric token s.

Robust scene text understanding with OCR token and word alignment for Text-VQA and text-caption

Robustifying token Attention for Vision Transformers

Robustness token s: Towards Adversarial Robustness of Transformers

Rollout-Guided token Pruning for Efficient Video Understanding

Salience-based Adaptive Masking: Revisiting token Dynamics for Enhanced Pre-Training

SarAdapter: Prioritizing Attention on Semantic-Aware Representative token s for Enhanced Medical Image Segmentation

SAT-HMR: Real-Time Multi-Person 3D Mesh Estimation via Scale-Adaptive token s

SATA: Spatial Autocorrelation token Analysis for Enhancing the Robustness of Vision Transformers

Scale-aware token -matching for transformer-based object detector

Segment Any Event Streams via Weighted Adaptation of Pivotal token s

Seit++: Masked token Modeling Improves Storage-efficient Training

SeiT: Storage-Efficient Vision Training with token s Using 1% of Pixel Storage

Self-Supervised Anomaly Detection from Anomalous Training Data via Iterative Latent token Masking

Self-supervised Video Copy Localization with Regional token Representation

Semantic Prompting with Image token for Continual Learning

SETA: Semantic-Aware Edge-Guided token Augmentation for Domain Generalization

SG-Former: Self-guided Transformer with Evolving token Reallocation

Shunted Self-Attention via Multi-Scale token Aggregation

Simple token -Level Confidence Improves Caption Correctness

Simple yet Effective Layout token in Large Language Models for Document Understanding, A

Sketch token s: A Learned Mid-level Representation for Contour and Object Detection

Smoothest Velocity Field and token Matching Schemes, The

Soft Measure of Visual token Occurrences for Object Categorization

Spatial Positioning token (SPToken) for Smart Mobility

Spatial-Aware token for Weakly Supervised Object Localization

SPViT: Enabling Faster Vision Transformers via Latency-Aware Soft token Pruning

STFormer: An efficient visual Transformer model with sparse attention and adaptive token aggregation

STPM: Spatial-Temporal token Pruning and Merging for Complex Activity Recognition

Strategies for Tracking token s in a Cluttered Scene

Strip-MLP: Efficient token Interaction for Vision MLP

SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and token Folding

T4P: Test-Time Training of Trajectory Prediction via Masked Autoencoder and Actor-Specific token Memory

TACo: token -aware Cascade Contrastive Learning for Video-Text Alignment

Taming the curse of dimensionality for perturbed token identification

TCFormer: Visual Recognition via token Clustering Transformer

TCSAFormer: Efficient Vision Transformer With token Compression and Sparse Attention for Medical Image Segmentation

TFRNet: Semantic Segmentation Network with token Filtration and Refinement Method

TFS-ViT: token -level feature stylization for domain generalization

token Aggregation and Selection Hashing for Efficient Underwater Image Retrieval

token Boosting for Robust Self-Supervised Visual Transformer Pre-training

token Calibration for Transformer-Based Domain Adaptation

token Compensator: Altering Inference Cost of Vision Transformer Without Re-tuning

token Contrast for Weakly-Supervised Semantic Segmentation

token Cropr: Faster ViTs for Quite a Few Tasks

token Fusion: Bridging the Gap between Token Pruning and Token Merging

token Fusion: Bridging the Gap between Token Pruning and Token Merging

token Fusion: Bridging the Gap between Token Pruning and Token Merging

token Grouping Based on 3d Motion and Feature Selection in Object Tracking

token labeling-guided multi-scale medical image classification

token Masking Transformer for Weakly Supervised Object Localization

token Merging for Fast Stable Diffusion

token Pooling in Vision Transformers for Image Classification

token pyramid pooling-driven style adapter learning with dual-view balanced loss for imbalanced diabetic retinopathy grading

token Selection is a Simple Booster for Vision Transformers

token Tracking in a Cluttered Scene

token Transformation Matters: Towards Faithful Post-Hoc Explanation for Vision Transformer

token Turing Machines

token Turing Machines are Efficient Vision Models

token -aware and step-aware acceleration for Stable Diffusion

token -based dynamic bit-width assignment for ViT quantization

token -Based Extraction of Straight Lines

token -Based Fingerprint Authentication

token -Based, Patch Based Vision Transformers

token -Consistent Dropout For Calibrated Vision Transformers

token -Label Alignment for Vision Transformers

token -Level Prompt Mixture With Parameter-Free Routing for Federated Domain Generalization

token -Mixer: Bind Image and Text in One Embedding Space for Medical Image Reporting

token -Prediction-Based Post-Processing for Low-Bitrate Speech Coding

token -Textured Object Detection by Pyramids

token -word mixer meets object-aware transformer for referring image segmentation

token Compose: Text-to-Image Diffusion with Token-Level Supervision

token HPE: Learning Orientation Tokens for Efficient Head Pose Estimation via Transformers

token Motion: Decoupled Motion Control via Token Disentanglement for Human-centric Video Generation

token Pose: Learning Keypoint Tokens for Human Pose Estimation

token s-to-Token ViT: Training Vision Transformers from Scratch on ImageNet

token s-to-Token ViT: Training Vision Transformers from Scratch on ImageNet

TopFormer: token Pyramid Transformer for Mobile Semantic Segmentation

TopV: Compatible token Pruning with Inference Time Optimization for Fast and Low-Memory Multimodal Vision Language Model

TORE: token Recycling in Vision Transformers for Efficient Active Visual Exploration

TORE: token Reduction for Efficient Human Mesh Recovery with Transformer

Toward Unified token Learning for Vision-Language Tracking

Towards Universal Modal Tracking With Online Dense Temporal token Learning

Trajectory-aligned Space-time token s for Few-shot Action Recognition

Transcending the Limit of Local Window: Advanced Super-Resolution Transformer with Adaptive token Dictionary

Transferable Adversarial Attacks on Vision Transformers with token Gradient Regularization

Transformer Compressed Sensing Via Global Image token s

Transformer RGBT Tracking With Spatio-Temporal Multimodal token s

Transformer vision-language tracking via proxy token guided cross-modal fusion

Transformer with token attention and attribute prediction for image captioning

Translating Optical Flow into token Matches

Translating Optical Flow into token Matches and Depth from Looming

TS-CAM: token Semantic Coupled Attention Map for Weakly Supervised Object Localization

TS2-Net: token Shift and Selection Transformer for Text-Video Retrieval

TSVT: token Sparsification Vision Transformer for robust RGB-D salient object detection

TTST: A Top-k token Selective Transformer for Remote Sensing Image Super-Resolution

UMIFormer: Mining the Correlations between Similar token s for Multi-View 3D Reconstruction

Understanding the Effect of using Semantically Meaningful token s for Visual Representation Learning

Unleashing Transformers: Parallel token Prediction with Discrete Absorbing Diffusion for Fast High-Resolution Image Generation from Vector-Quantized Codes

Using orientation token s for object recognition

VASparse: Towards Efficient Visual Hallucination Mitigation via Visual-Aware token Sparsification

vid-TLDR: Training Free token merging for Light-Weight Video Transformer

Video, How do Your token s Merge?

VidToMe: Video token Merging for Zero-Shot Video Editing

Vista-llama: Reducing Hallucination in Video Language Models via Equal Distance to Visual token s

VL-Match: Enhancing Vision-Language Pretraining with token -Level and Instance-Level Matching

VLTP: Vision-Language Guided token Pruning for Task-Oriented Segmentation

Which token s to Use? Investigating Token Reduction in Vision Transformers

Which token s to Use? Investigating Token Reduction in Vision Transformers

Window token Concatenation for Efficient Visual Large Language Models

Zero-shot 3D Question Answering via Voxel-based Dynamic token Compression

Zero-TPrune: Zero-Shot token Pruning Through Leveraging of the Attention Graph in Pre-Trained Transformers

320 for token

_	token	_
3d-array	token	Petri Nets Generating Tetrahedral Picture Languages
A-ViT: Adaptive	token	s for Efficient Vision Transformer
Accelerating Multimodal Large Language Models by Searching Optimal Vision	token	Reduction
Accurate 3D Face Reconstruction with Facial Component	token	s
Activating Associative Disease-Aware Vision	token	Memory for LLM-Based X-Ray Report Generation
Adanat: Exploring Adaptive Policy for	token	-based Image Generation
Adaptive Frequency Filters As Efficient Global	token	Mixers
Adaptive	token	Sampling for Efficient Vision Transformers
Adaptor: Adaptive	token	Reduction for Video Diffusion Transformers
Adjunct Partial Array	token	Petri Net Structure
Agglomerative	token	Clustering
AITTI: Learning Adaptive Inclusive	token	for Text-to-Image Generation
ALGM: Adaptive Local-then-Global	token	Merging for Efficient Semantic Segmentation with Plain Vision Transformers
All in	token	s: Unifying Output Space of Visual Tasks via Soft Token
All in	token	s: Unifying Output Space of Visual Tasks via Soft Token
Animal Pose Tracking: 3D Multimodal Dataset and	token	-based Pose Optimization
Architectures for Biometric Match-on-	token	Solutions
ATMformer: An Adaptive	token	Merging Vision Transformer for Remote Sensing Image Scene Classification
ATP-LLaVA: Adaptive	token	Pruning for Large Vision Language Models
Attend to Not Attended: Structure-then-Detail	token	Merging for Post-training DiT Acceleration
Attention-Based Layer Fusion and	token	Masking for Weakly Supervised Semantic Segmentation
Attribute Surrogates Learning and Spectral	token	s Pooling in Transformers for Few-shot Learning
Augmenting Multimodal LLMs with Self-Reflective	token	s for Knowledge-based Visual Question Answering
Bad	token	: Token-level Backdoor Attacks to Multi-modal Large Language Models
Beyond Attentive	token	s: Incorporating Token Importance and Diversity for Efficient Vision Transformers
Beyond Attentive	token	s: Incorporating Token Importance and Diversity for Efficient Vision Transformers
Beyond masking: Demystifying	token	-based pre-training for vision transformers
Biophasor:	token	Supplemented Cancellable Biometrics
Blind Image Quality Assessment via Transformer Predicted Error Map and Perceptual Quality	token
Boosting Point-BERT by Multi-Choice	token	s
Building Extraction from Remote Sensing Images with Sparse	token	Transformers
CAgMLP: An MLP-like architecture with a Cross-Axis gated	token	mixer for image classification
CAST: Clustering self-Attention using Surrogate	token	s for efficient transformers
CATANet: Efficient Content-Aware	token	Aggregation for Lightweight Image Super-Resolution
Class	token	s Infusion for Weakly Supervised Semantic Segmentation
CMI-Net: Cross-View Message	token	Interaction Network for 3D Shape Recognition
CMTM: Cross-Modal	token	Modulation for Unsupervised Video Object Segmentation
Collaborative Intelligence for Vision Transformers: A	token	Sparsity-Driven Edge-Cloud Framework
Computing Curvilinear Structure by	token	-Based Grouping
Confidence-Based Sampling Strategy for Dense Temporal	token	Learning in Thermal Infrared Object Tracking, A
Content-aware	token	Sharing for Efficient Semantic Segmentation with Vision Transformers
Continuous Intermediate	token	Learning with Implicit Motion Manifold for Keyframe Based Motion Interpolation
Cooperative Game Modeling With Weighted	token	-Level Alignment for Audio-Text Retrieval
Cross-Block Sparse Class	token	Contrast for Weakly Supervised Semantic Segmentation
Cross-Domain Detection Transformer Based on Spatial-Aware and Semantic-Aware	token	Alignment
CT2: Colorization Transformer via Color	token	s
CVT-Track: Concentrating on Valid	token	s for One-Stream Tracking
Detect-and-Guide: Self-regulation of Diffusion Models for Safe Text-to-Image Generation via Guideline	token	Optimization
Detecting Structure by Symbolic Constructions on	token	s
Devil is in Temporal	token	: High Quality Video Reasoning Segmentation, The
Difference Inversion: Interpolate and Isolate the Difference with	token	Consistency for Image Analogy Generation
Discriminative Class	token	s for Text-to-Image Diffusion Models
Discriminatively Matched Part	token	s for Pointly Supervised Instance Segmentation
DiViCo: Disentangled Visual	token	Compression for Efficient Large Vision-Language Model
DivPrune: Diversity-based Visual	token	Pruning for Large Multimodal Models
Dtpose: Learning Disentangled	token	Representation for Effective Human Pose Estimation
Dual Class	token	Vision Transformer for Direction of Arrival Estimation in Low SNR
Dual-Factor Authentication System Featuring Speaker Verification and	token	Technology, A
DyCoke: Dynamic Compression of	token	s for Fast Video Large Language Models
Dynamic	token	Pruning in Plain Vision Transformers for Semantic Segmentation
Dynamic	token	-Pass Transformers for Semantic Segmentation
DyTox: Transformers for Continual Learning with DYnamic	token	eXpansion
ECT: Fine-grained edge detection with learned cause	token	s
EDTST: Efficient Dynamic	token	Selection Transformer for Hyperspectral Image Classification
Effective Style	token	Weight Control Technique for End-to-End Emotional Speech Synthesis, An
Efficient	token	-Guided Image-Text Retrieval With Consistent Multimodal Contrastive Training
Efficient Transformer Adaptation with Soft	token	Merging
Efficient Transformer-Based 3D Object Detection with Dynamic	token	Halting
Efficient Video Action Detection with	token	Dropout and Context Refinement
Efficient Video Transformers with Spatial-Temporal	token	Selection
Efficient Vision Transformer via	token	Merger
Efficient Vision Transformer with	token	Sparsification for Event-Based Object Tracking
Efficient Visual Transformer by Learnable	token	Merging
Emerging Property of Masked	token	for Effective Pre-training
Enriching Local Patterns with Multi-	token	Attention for Broad-Sight Neural Networks
Entity Extraction and Correction Based on	token	Structure Model Generation
Exploring	token	-Level Augmentation in Vision Transformer for Semi-Supervised Semantic Segmentation
Fast-iTPN: Integrally Pre-Trained Transformer Pyramid Network With	token	Migration
Faster Parameter-Efficient Tuning with	token	Redundancy Reduction
FASTer: Focal	token	Acquiring-and-Scaling Transformer for Long-term 3D Object Detection
First to Know: How	token	Distributions Reveal Hidden Knowledge in Large Vision-language Models?, The
Focus and Align: Learning Tube	token	s for Video-Language Pre-Training
FourierSR: A Fourier	token	-Based Plugin for Efficient Image Super-Resolution
Fully Attentional Networks with Self-emerging	token	Labeling
General and Efficient Training for Transformer via	token	Expansion, A
General Approach for	token	Correspondence, A
general vision problem solving architecture: Hierarchical	token	grouping, A
Generalized Concordant Vision Transformer With Masked Image	token	s for Object Detection
Generative Multimodal Pretraining with Discrete Diffusion Timestep	token	s
GroupedMixer: An Entropy Model With Group-Wise	token	-Mixers for Learned Image Compression
GroupRF: Panoptic Scene Graph Generation with group relation	token	s
GTP-ViT: Efficient Vision Transformers via Graph-based	token	Propagation
GTPT: Group-based	token	Pruning Transformer for Efficient Human Pose Estimation
HalLoc:	token	-level Localization of Hallucinations for Vision Language Models
Heterogeneous Generative	token	s and Distance-Aware Recovery Network for Occluded Person Re-Identification
Hierarchical Graph Interaction Transformer With Dynamic	token	Clustering for Camouflaged Object Detection
Hierarchical	token	-Aware Cross-Modality Reconstruction for Visible-Infrared Person Re-Identification
Human Pose as Compositional	token	s
Hybrid Multi-Class	token	Vision Transformer Convolutional Network for DOA Estimation
Hybrid	token	transformer for deep face recognition
Hybrid-Level Instruction Injection for Video	token	Compression in Multi-modal Large Language Models
Hybrid-TransCD: A Hybrid Transformer Remote Sensing Image Change Detection Network via	token	Aggregation
Hyperspectral image classification with	token	fusion on GPU
HyperTransXNet: Learning Both Global and Local Dynamics with a Dual Dynamic	token	Mixer for Hyperspectral Image Classification
Image is Worth 1/2	token	s After Layer 2: Plug-and-play Inference Acceleration for Large Vision-language Models, An
Immunized	token	-Based Approach for Autonomous Deployment of Multiple Mobile Robots in Burnt Area
Implementation of the USB	token	System for Fingerprint Verification
Improved Masked Image Generation with	token	-Critic
Improving Autoregressive Visual Generation with Cluster-Oriented	token	Prediction
Improving defocus blur detection via adaptive supervision prior-	token	s
Improving vision transformer for medical image classification via	token	-wise perturbation
Instruction Tuning-free Visual	token	Complement for Multimodal LLMs
Inter-image	token	Relation Learning for weakly supervised semantic segmentation
ISR3: A	token	Database for Integration of Visual Modules
IVTP: Instruction-guided Visual	token	Pruning for Large Vision-language Models
Joint	token	and Feature Alignment Framework for Text-Based Person Search
Joint	token	Pruning and Squeezing Towards More Aggressive Compression of Vision Transformers
Labeling of Curvilinear Structure Across Scales by	token	Grouping
language model using variable length	token	s for open-vocabulary Hangul text recognition, A
Large Model Empowered Multi-Modal Semantic Communication With Selective	token	s for Training
Layer-and Timestep-Adaptive Differentiable	token	Compression Ratios for Efficient Diffusion Transformers
Learnable	token	for visual tracking
Learning Multi-Modal Class-Specific	token	s for Weakly Supervised Dense Object Localization
Learning to mask and permute visual	token	s for Vision Transformer pre-training
Learning with Unmasked	token	s Drives Stronger Vision Learners
Leveraging multi-class background description and	token	dictionary representation for hyperspectral anomaly detection
Leveraging per Image-	token	Consistency for Vision-Language Pre-Training
Lightweight Image Super-Resolution with Superpixel	token	Interaction
Llama-vid: An Image is Worth 2	token	s in Large Language Models
LookupVIT: Compressing Visual Information to a Limited Number of	token	s
MADTP: Multimodal Alignment-Guided Dynamic	token	Pruning for Accelerating Vision-Language Transformer
Magic	token	s: Select Diverse Tokens for Multi-modal Object Re-Identification
Magic	token	s: Select Diverse Tokens for Multi-modal Object Re-Identification
Make Your Vit-based Multi-view 3d Detectors Faster via	token	Compression
Making Vision Transformers Efficient from A	token	Sparsification View
ManiTrans: Entity-Level Text-Guided Image Manipulation via	token	-wise Semantic Alignment and Generation
MAPM: PolSAR Image Classification with Masked Autoencoder Based on Position Prediction and Memory	token	s
Mask-Guided Transformer Network with Topic	token	for Remote Sensing Image Captioning, A
Masked Reference	token	Supervision-Based Iterative Visual-Language Framework for Robust Visual Grounding, A
MatteFormer: Transformer-Based Image Matting via Prior-	token	s
MCTformer+: Multi-Class	token	Transformer for Weakly Supervised Semantic Segmentation
MedoidsFormer: A Strong 3D Object Detection Backbone by Exploiting Interaction With Adjacent Medoid	token	s
Memory-	token	Transformer for Unsupervised Video Anomaly Detection
MergeVQ: A Unified Framework for Visual Generation and Representation with Disentangled	token	Merging and Quantization
Method for identification of	token	s in video sequences
METransformer: Radiology Report Generation by Transformer with Multiple Learnable Expert	token	s
Mining representative	token	s via transformer-based multi-modal interaction for RGB-T tracking
MMoT: Mixture-of-Modality-	token	s Transformer for Composed Multimodal Conditional Image Synthesis
MonoATT: Online Monocular 3D Object Detection with Adaptive	token	Transformer
Morphological image processing on a	token	passing pyramid computer
MovieChat: From Dense	token	to Sparse Memory for Long Video Understanding
MSG-Transformer: Exchanging Local Spatial Information by Manipulating Messenger	token	s
MST: Adaptive Multi-Scale	token	s Guided Interactive Segmentation
Multi-class	token	Transformer for Weakly Supervised Semantic Segmentation
Multi-Criteria	token	Fusion with One-Step-Ahead Attention for Efficient Vision Transformers
Multi-Faceted Adaptive	token	Pruning for Efficient Remote Sensing Image Segmentation
Multi-modal interaction with	token	division strategy for RGB-T tracking
Multi-Scale	token	s-Aware Transformer Network for Multi-Region and Multi-Sequence MR-to-CT Synthesis in a Single Model
Multi-schema prompting powered	token	-feature woven attention network for short text classification
Multi-user VR Experience for Creating and Trading Non-fungible	token	s
Multimodal	token	Fusion for Vision Transformers
MVFormer: Diversifying feature normalization and	token	mixing for efficient vision transformers
New Coeff-	token	Decoding Method With Efficient Memory Access in H.264/AVC Video Coding Standard, A
No	token	Left Behind: Explainability-Aided Image Classification and Generation
Not All	token	s Are Equal: Human-centric Visual Analysis via Token Clustering Transformer
Not All	token	s Are Equal: Human-centric Visual Analysis via Token Clustering Transformer
Object Discovery from Motion-Guided	token	s
Object Recognition as Next	token	Prediction
Omni-RGPT: Unifying Image and Video Region-level Understanding via	token	Marks
On Correspondence, Line	token	s And Missing Tokens
On Correspondence, Line	token	s And Missing Tokens
Open-Vocabulary Attention Maps with	token	Optimization for Semantic Segmentation in Diffusion Models
Optimisation of biometric ID	token	s by using hardware/software co-design
OTE: Exploring Accurate Scene Text Recognition Using One	token
Other	token	s matter: Exploring global and local features of Vision Transformers for Object Re-Identification
PACT: Pruning and Clustering-Based	token	Reduction for Faster Visual Language Models
Partitioned	token	fusion and pruning strategy for transformer tracking
Patch Ranking:	token	Pruning as Ranking Prediction for Efficient CLIP
Pedestrian Crossing Intention Prediction via Progressive Multimodal	token	Fusion for Autonomous Driving
Perception	token	s Enhance Visual Reasoning in Multimodal Language Models
Picture is Worth More Than 77 Text	token	s: Evaluating CLIP-Style Models on Dense Captions, A
PointLoRA: Low-Rank Adaptation with	token	Selection for Point Cloud Learning
Pose-guided	token	selection for the recognition of activities of daily living
PostoMETRO: Pose	token	Enhanced Mesh Transformer for Robust 3D Human Mesh Recovery
PPT:	token	-Pruned Pose Transformer for Monocular and Multi-view Human Pose Estimation
PRANCE: Joint	token	-Optimization and Structural Channel-Pruning for Adaptive ViT Inference
Pred	token	: Predicting Unknown Tokens and Beyond with Coarse-to-Fine Iterative Decoding
Prune and Merge: Efficient	token	Compression for Vision Transformer With Spatial Information Preserved
Prune Spatio-temporal	token	s by Semantic-aware Temporal Accumulation
Pruning One More	token	is Enough: Leveraging Latency-Workload Non-Linearities for Vision Transformers on the Edge
PTET: A progressive	token	exchanging transformer for infrared and visible image fusion
PVC: Progressive Visual	token	Compression for Unified Image and Video Processing in Large Vision-Language Models
Pyramid	token	s-to-Token Vision Transformer for Thyroid Pathology Image Classification
Pyramid	token	s-to-Token Vision Transformer for Thyroid Pathology Image Classification
Quantification and Abstraction: Low Level	token	s for Object Extraction
Random Entangled	token	s for Adversarially Robust Vision Transformer
Reasoning to Attend: Try to Understand How	token	Works
Recovering 3D Motion and Structure from Stereo and 2D	token	Tracking Cooperation
Removing Rows and Columns of	token	s in Vision Transformer Enables Faster Dense Prediction Without Retraining
Representation Selective Coupling via	token	Sparsification for Multi-Spectral Object Re-Identification
Request for Clarity over the End of Sequence	token	in the Self-critical Sequence Training, A
ResiComp: Loss-Resilient Image Compression via Dual-Functional Masked Visual	token	Modeling
Rethinking	token	Reduction with Parameter-Efficient Fine-Tuning in ViT for Pixel-Level Tasks
Rethinking visual prompt learning as masked visual	token	modeling
Revisiting Multimodal Representation in Contrastive Learning: From Patch and	token	Embeddings to Finite Discrete Tokens
Revisiting Multimodal Representation in Contrastive Learning: From Patch and	token	Embeddings to Finite Discrete Tokens
Revisiting	token	Pruning for Object Detection and Instance Segmentation
RIFormer: Keep Your Vision Backbone Effective But Removing	token	Mixer
Robust Distance Measures for Face-Recognition Supporting Revocable Biometric	token	s.
Robust scene text understanding with OCR	token	and word alignment for Text-VQA and text-caption
Robustifying	token	Attention for Vision Transformers
Robustness	token	s: Towards Adversarial Robustness of Transformers
Rollout-Guided	token	Pruning for Efficient Video Understanding
Salience-based Adaptive Masking: Revisiting	token	Dynamics for Enhanced Pre-Training
SarAdapter: Prioritizing Attention on Semantic-Aware Representative	token	s for Enhanced Medical Image Segmentation
SAT-HMR: Real-Time Multi-Person 3D Mesh Estimation via Scale-Adaptive	token	s
SATA: Spatial Autocorrelation	token	Analysis for Enhancing the Robustness of Vision Transformers
Scale-aware	token	-matching for transformer-based object detector
Segment Any Event Streams via Weighted Adaptation of Pivotal	token	s
Seit++: Masked	token	Modeling Improves Storage-efficient Training
SeiT: Storage-Efficient Vision Training with	token	s Using 1% of Pixel Storage
Self-Supervised Anomaly Detection from Anomalous Training Data via Iterative Latent	token	Masking
Self-supervised Video Copy Localization with Regional	token	Representation
Semantic Prompting with Image	token	for Continual Learning
SETA: Semantic-Aware Edge-Guided	token	Augmentation for Domain Generalization
SG-Former: Self-guided Transformer with Evolving	token	Reallocation
Shunted Self-Attention via Multi-Scale	token	Aggregation
Simple	token	-Level Confidence Improves Caption Correctness
Simple yet Effective Layout	token	in Large Language Models for Document Understanding, A
Sketch	token	s: A Learned Mid-level Representation for Contour and Object Detection
Smoothest Velocity Field and	token	Matching Schemes, The
Soft Measure of Visual	token	Occurrences for Object Categorization
Spatial Positioning	token	(SPToken) for Smart Mobility
Spatial-Aware	token	for Weakly Supervised Object Localization
SPViT: Enabling Faster Vision Transformers via Latency-Aware Soft	token	Pruning
STFormer: An efficient visual Transformer model with sparse attention and adaptive	token	aggregation
STPM: Spatial-Temporal	token	Pruning and Merging for Complex Activity Recognition
Strategies for Tracking	token	s in a Cluttered Scene
Strip-MLP: Efficient	token	Interaction for Vision MLP
SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and	token	Folding
T4P: Test-Time Training of Trajectory Prediction via Masked Autoencoder and Actor-Specific	token	Memory
TACo:	token	-aware Cascade Contrastive Learning for Video-Text Alignment
Taming the curse of dimensionality for perturbed	token	identification
TCFormer: Visual Recognition via	token	Clustering Transformer
TCSAFormer: Efficient Vision Transformer With	token	Compression and Sparse Attention for Medical Image Segmentation
TFRNet: Semantic Segmentation Network with	token	Filtration and Refinement Method
TFS-ViT:	token	-level feature stylization for domain generalization
	token	Aggregation and Selection Hashing for Efficient Underwater Image Retrieval
	token	Boosting for Robust Self-Supervised Visual Transformer Pre-training
	token	Calibration for Transformer-Based Domain Adaptation
	token	Compensator: Altering Inference Cost of Vision Transformer Without Re-tuning
	token	Contrast for Weakly-Supervised Semantic Segmentation
	token	Cropr: Faster ViTs for Quite a Few Tasks
	token	Fusion: Bridging the Gap between Token Pruning and Token Merging
	token	Fusion: Bridging the Gap between Token Pruning and Token Merging
	token	Fusion: Bridging the Gap between Token Pruning and Token Merging
	token	Grouping Based on 3d Motion and Feature Selection in Object Tracking
	token	labeling-guided multi-scale medical image classification
	token	Masking Transformer for Weakly Supervised Object Localization
	token	Merging for Fast Stable Diffusion
	token	Pooling in Vision Transformers for Image Classification
	token	pyramid pooling-driven style adapter learning with dual-view balanced loss for imbalanced diabetic retinopathy grading
	token	Selection is a Simple Booster for Vision Transformers
	token	Tracking in a Cluttered Scene
	token	Transformation Matters: Towards Faithful Post-Hoc Explanation for Vision Transformer
	token	Turing Machines
	token	Turing Machines are Efficient Vision Models
	token	-aware and step-aware acceleration for Stable Diffusion
	token	-based dynamic bit-width assignment for ViT quantization
	token	-Based Extraction of Straight Lines
	token	-Based Fingerprint Authentication
	token	-Based, Patch Based Vision Transformers
	token	-Consistent Dropout For Calibrated Vision Transformers
	token	-Label Alignment for Vision Transformers
	token	-Level Prompt Mixture With Parameter-Free Routing for Federated Domain Generalization
	token	-Mixer: Bind Image and Text in One Embedding Space for Medical Image Reporting
	token	-Prediction-Based Post-Processing for Low-Bitrate Speech Coding
	token	-Textured Object Detection by Pyramids
	token	-word mixer meets object-aware transformer for referring image segmentation
	token	Compose: Text-to-Image Diffusion with Token-Level Supervision
	token	HPE: Learning Orientation Tokens for Efficient Head Pose Estimation via Transformers
	token	Motion: Decoupled Motion Control via Token Disentanglement for Human-centric Video Generation
	token	Pose: Learning Keypoint Tokens for Human Pose Estimation
	token	s-to-Token ViT: Training Vision Transformers from Scratch on ImageNet
	token	s-to-Token ViT: Training Vision Transformers from Scratch on ImageNet
TopFormer:	token	Pyramid Transformer for Mobile Semantic Segmentation
TopV: Compatible	token	Pruning with Inference Time Optimization for Fast and Low-Memory Multimodal Vision Language Model
TORE:	token	Recycling in Vision Transformers for Efficient Active Visual Exploration
TORE:	token	Reduction for Efficient Human Mesh Recovery with Transformer
Toward Unified	token	Learning for Vision-Language Tracking
Towards Universal Modal Tracking With Online Dense Temporal	token	Learning
Trajectory-aligned Space-time	token	s for Few-shot Action Recognition
Transcending the Limit of Local Window: Advanced Super-Resolution Transformer with Adaptive	token	Dictionary
Transferable Adversarial Attacks on Vision Transformers with	token	Gradient Regularization
Transformer Compressed Sensing Via Global Image	token	s
Transformer RGBT Tracking With Spatio-Temporal Multimodal	token	s
Transformer vision-language tracking via proxy	token	guided cross-modal fusion
Transformer with	token	attention and attribute prediction for image captioning
Translating Optical Flow into	token	Matches
Translating Optical Flow into	token	Matches and Depth from Looming
TS-CAM:	token	Semantic Coupled Attention Map for Weakly Supervised Object Localization
TS2-Net:	token	Shift and Selection Transformer for Text-Video Retrieval
TSVT:	token	Sparsification Vision Transformer for robust RGB-D salient object detection
TTST: A Top-k	token	Selective Transformer for Remote Sensing Image Super-Resolution
UMIFormer: Mining the Correlations between Similar	token	s for Multi-View 3D Reconstruction
Understanding the Effect of using Semantically Meaningful	token	s for Visual Representation Learning
Unleashing Transformers: Parallel	token	Prediction with Discrete Absorbing Diffusion for Fast High-Resolution Image Generation from Vector-Quantized Codes
Using orientation	token	s for object recognition
VASparse: Towards Efficient Visual Hallucination Mitigation via Visual-Aware	token	Sparsification
vid-TLDR: Training Free	token	merging for Light-Weight Video Transformer
Video, How do Your	token	s Merge?
VidToMe: Video	token	Merging for Zero-Shot Video Editing
Vista-llama: Reducing Hallucination in Video Language Models via Equal Distance to Visual	token	s
VL-Match: Enhancing Vision-Language Pretraining with	token	-Level and Instance-Level Matching
VLTP: Vision-Language Guided	token	Pruning for Task-Oriented Segmentation
Which	token	s to Use? Investigating Token Reduction in Vision Transformers
Which	token	s to Use? Investigating Token Reduction in Vision Transformers
Window	token	Concatenation for Efficient Visual Large Language Models
Zero-shot 3D Question Answering via Voxel-based Dynamic	token	Compression
Zero-TPrune: Zero-Shot	token	Pruning Through Leveraging of the Attention Graph in Pre-Trained Transformers

_ tokenbinder _

tokenbinder : Text-Video Retrieval with One-to-Many Alignment Paradigm

_	tokenbinder	_
	tokenbinder	: Text-Video Retrieval with One-to-Many Alignment Paradigm

_ tokencompose _

tokencompose : Text-to-Image Diffusion with Token-Level Supervision

_	tokencompose	_
	tokencompose	: Text-to-Image Diffusion with Token-Level Supervision

_ tokencut _

tokencut : Segmenting Objects in Images and Videos With Self-Supervised Transformer and Normalized Cut

_	tokencut	_
	tokencut	: Segmenting Objects in Images and Videos With Self-Supervised Transformer and Normalized Cut

_ tokenflow _

tokenflow : Unified Image Tokenizer for Multimodal Understanding and Generation

_	tokenflow	_
	tokenflow	: Unified Image Tokenizer for Multimodal Understanding and Generation

_ tokenfocus _

tokenfocus -VQA: Enhancing Text-to-Image Alignment with Position-Aware Focus and Multi-Perspective Aggregations on LVLMs

_	tokenfocus	_
	tokenfocus	-VQA: Enhancing Text-to-Image Alignment with Position-Aware Focus and Multi-Perspective Aggregations on LVLMs

_ tokenhmr _

tokenhmr : Advancing Human Mesh Recovery with a Tokenized Pose Representation

_	tokenhmr	_
	tokenhmr	: Advancing Human Mesh Recovery with a Tokenized Pose Representation

_ tokenhpe _

tokenhpe : Learning Orientation Tokens for Efficient Head Pose Estimation via Transformers

_	tokenhpe	_
	tokenhpe	: Learning Orientation Tokens for Efficient Head Pose Estimation via Transformers

_ tokenhsi _

tokenhsi : Unified Synthesis of Physical Human-Scene Interactions through Task Tokenization

_	tokenhsi	_
	tokenhsi	: Unified Synthesis of Physical Human-Scene Interactions through Task Tokenization

_ tokenised _

Biohashing: Two factor authentication featuring fingerprint data and tokenised random number

Cancellable biometerics featuring with tokenised random number

Integrated Dual Factor Authenticator Based on the Face Data and tokenised Random Number, An

multi-modal method based on the competitors of FVC2004 and on palm data combined with tokenised random numbers, A

_	tokenised	_
Biohashing: Two factor authentication featuring fingerprint data and	tokenised	random number
Cancellable biometerics featuring with	tokenised	random number
Integrated Dual Factor Authenticator Based on the Face Data and	tokenised	Random Number, An
multi-modal method based on the competitors of FVC2004 and on palm data combined with	tokenised	random numbers, A

_ tokenization _

Beyond local patches: Preserving global-local interactions by enhancing self-attention via 3D point cloud tokenization

Channel-Reduced Transformer With Cross-Region tokenization for Hyperspectral Image Classification

Efficient Long Video tokenization via Coordinate-based Patch Reconstruction

Frame-level Feature tokenization Learning for Human Body Pose and Shape Estimation

Gaussian Segmentation and tokenization for Low Cost Language Identification

GROMA: Localized Visual tokenization for Grounding Multimodal Large Language Models

Language-Guided Image tokenization for Generation

MoST: Multi-modality Scene tokenization for Motion Prediction

MSViT: Dynamic Mixed-scale tokenization for Vision Transformers

Neural Sign Language Translation by Learning tokenization

Prior tokenization -based interactive segmentation with Vision Transformers

Region-native Visual tokenization

Scale-free and unbiased transformer with tokenization for cell type annotation from single-cell RNA-seq data

Scaling Mesh Generation via Compressive tokenization

Segment This Thing: Foveated tokenization for Efficient Point-Prompted Segmentation

TokenHSI: Unified Synthesis of Physical Human-Scene Interactions through Task tokenization

Tuning Large Language Model for Speech Recognition With Mixed-Scale Re- tokenization

Video Question Answering with Iterative Video-Text Co- tokenization

Video Segmentation and tokenization for Model-Based Video Scene Classification

Vision Transformers with Mixed-Resolution tokenization

Voxel-MPI: Scene-adaptive multiplane images based local voxel tokenization with attention coordination for 3D scene representation

Zero-Shot Sketch-Based Remote-Sensing Image Retrieval Based on Multi-Level and Attention-Guided tokenization

22 for tokenization

_	tokenization	_
Beyond local patches: Preserving global-local interactions by enhancing self-attention via 3D point cloud	tokenization
Channel-Reduced Transformer With Cross-Region	tokenization	for Hyperspectral Image Classification
Efficient Long Video	tokenization	via Coordinate-based Patch Reconstruction
Frame-level Feature	tokenization	Learning for Human Body Pose and Shape Estimation
Gaussian Segmentation and	tokenization	for Low Cost Language Identification
GROMA: Localized Visual	tokenization	for Grounding Multimodal Large Language Models
Language-Guided Image	tokenization	for Generation
MoST: Multi-modality Scene	tokenization	for Motion Prediction
MSViT: Dynamic Mixed-scale	tokenization	for Vision Transformers
Neural Sign Language Translation by Learning	tokenization
Prior	tokenization	-based interactive segmentation with Vision Transformers
Region-native Visual	tokenization
Scale-free and unbiased transformer with	tokenization	for cell type annotation from single-cell RNA-seq data
Scaling Mesh Generation via Compressive	tokenization
Segment This Thing: Foveated	tokenization	for Efficient Point-Prompted Segmentation
TokenHSI: Unified Synthesis of Physical Human-Scene Interactions through Task	tokenization
Tuning Large Language Model for Speech Recognition With Mixed-Scale Re-	tokenization
Video Question Answering with Iterative Video-Text Co-	tokenization
Video Segmentation and	tokenization	for Model-Based Video Scene Classification
Vision Transformers with Mixed-Resolution	tokenization
Voxel-MPI: Scene-adaptive multiplane images based local voxel	tokenization	with attention coordination for 3D scene representation
Zero-Shot Sketch-Based Remote-Sensing Image Retrieval Based on Multi-Level and Attention-Guided	tokenization

_ tokenize _

tokenize Anything via Prompting

tokenize Image Patches: Global Context Fusion for Effective Haze Removal in Large Images

_	tokenize	_
	tokenize	Anything via Prompting
	tokenize	Image Patches: Global Context Fusion for Effective Haze Removal in Large Images

_ tokenized _

Closed-Loop Supervised Fine-Tuning of tokenized Traffic Models

DistilPose: tokenized Pose Regression with Heatmap Distillation

Regularized Vector Quantization for tokenized Image Synthesis

SDPose: tokenized Pose Estimation via Circulation-Guide Self-Distillation

SSTNet: Saliency sparse transformers network with tokenized dilation for salient object detection

TM2T: Stochastic and tokenized Modeling for the Reciprocal Generation of 3D Human Motions and Texts

TokenHMR: Advancing Human Mesh Recovery with a tokenized Pose Representation

tokenized Generative Speech Enhancement With Language Model and Flow Matching

8 for tokeniz

_	tokenized	_
Closed-Loop Supervised Fine-Tuning of	tokenized	Traffic Models
DistilPose:	tokenized	Pose Regression with Heatmap Distillation
Regularized Vector Quantization for	tokenized	Image Synthesis
SDPose:	tokenized	Pose Estimation via Circulation-Guide Self-Distillation
SSTNet: Saliency sparse transformers network with	tokenized	dilation for salient object detection
TM2T: Stochastic and	tokenized	Modeling for the Reciprocal Generation of 3D Human Motions and Texts
TokenHMR: Advancing Human Mesh Recovery with a	tokenized	Pose Representation
	tokenized	Generative Speech Enhancement With Language Model and Flow Matching

_ tokenizer _

CSLT-AK: Convolutional-embedded transformer with an action tokenizer and keypoint emphasizer for sign language translation

Divot: Diffusion Powers Video tokenizer for Comprehension and Generation

epislon-ViLM: Efficient Video-Language Model via Masked Video Modeling with Semantic Vector-Quantized tokenizer

Finite Scalar Quantization as Facial tokenizer for Dyadic Reaction Generation

H2OT: Hierarchical Hourglass tokenizer for Efficient Video Pose Transformers

Homogeneous tokenizer matters: Homogeneous visual tokenizer for remote sensing image understanding

Homogeneous tokenizer matters: Homogeneous visual tokenizer for remote sensing image understanding

Hourglass tokenizer for Efficient Transformer-Based 3D Human Pose Estimation

Rethinking the Objectives of Vector-Quantized tokenizer s for Image Synthesis

Revisiting Kernel Temporal Segmentation as an Adaptive tokenizer for Long-form Video Understanding

SoftVQ-VAE: Efficient 1-Dimensional Continuous tokenizer

StylerDALLE: Language-Guided Style Transfer Using a Vector-Quantized tokenizer of a Large-Scale Generative Model

TokenFlow: Unified Image tokenizer for Multimodal Understanding and Generation

What Makes for Good tokenizer s in Vision Transformer?

14 for tokenizer

_	tokenizer	_
CSLT-AK: Convolutional-embedded transformer with an action	tokenizer	and keypoint emphasizer for sign language translation
Divot: Diffusion Powers Video	tokenizer	for Comprehension and Generation
epislon-ViLM: Efficient Video-Language Model via Masked Video Modeling with Semantic Vector-Quantized	tokenizer
Finite Scalar Quantization as Facial	tokenizer	for Dyadic Reaction Generation
H2OT: Hierarchical Hourglass	tokenizer	for Efficient Video Pose Transformers
Homogeneous	tokenizer	matters: Homogeneous visual tokenizer for remote sensing image understanding
Homogeneous	tokenizer	matters: Homogeneous visual tokenizer for remote sensing image understanding
Hourglass	tokenizer	for Efficient Transformer-Based 3D Human Pose Estimation
Rethinking the Objectives of Vector-Quantized	tokenizer	s for Image Synthesis
Revisiting Kernel Temporal Segmentation as an Adaptive	tokenizer	for Long-form Video Understanding
SoftVQ-VAE: Efficient 1-Dimensional Continuous	tokenizer
StylerDALLE: Language-Guided Style Transfer Using a Vector-Quantized	tokenizer	of a Large-Scale Generative Model
TokenFlow: Unified Image	tokenizer	for Multimodal Understanding and Generation
What Makes for Good	tokenizer	s in Vision Transformer?

_ tokenless _

tokenless Cancelable Biometrics Scheme for Protecting Iris Codes

_	tokenless	_
	tokenless	Cancelable Biometrics Scheme for Protecting Iris Codes

_ tokenmix _

tokenmix : Rethinking Image Mixing for Data Augmentation in Vision Transformers

_	tokenmix	_
	tokenmix	: Rethinking Image Mixing for Data Augmentation in Vision Transformers

_ tokenmotion _

tokenmotion : Decoupled Motion Control via Token Disentanglement for Human-centric Video Generation

_	tokenmotion	_
	tokenmotion	: Decoupled Motion Control via Token Disentanglement for Human-centric Video Generation

_ tokenpacker _

tokenpacker : Efficient Visual Projector for Multimodal LLM

_	tokenpacker	_
	tokenpacker	: Efficient Visual Projector for Multimodal LLM

_ tokenpose _

tokenpose : Learning Keypoint Tokens for Human Pose Estimation

_	tokenpose	_
	tokenpose	: Learning Keypoint Tokens for Human Pose Estimation

Index for "t"

Last update:26-Feb-26 11:52:11
Use price@usc.edu for comments.