14.5.10.6.1 Token-Based, Patch Based Vision Transformers

Chapter Contents (Back)
Vision Transformers. Transformers. Patch Based. Tokens.
See also Object Detection Using Transformers.

Jiang, B.[Bo], Zhao, K.K.[Kang-Kang], Tang, J.[Jin],
RGTransformer: Region-Graph Transformer for Image Representation and Few-Shot Classification,
SPLetters(29), 2022, pp. 792-796.
IEEE DOI 2204
Measurement, Transformers, Image representation, Feature extraction, Visualization, transformer BibRef

Kim, B.[Boah], Kim, J.[Jeongsol], Ye, J.C.[Jong Chul],
Task-Agnostic Vision Transformer for Distributed Learning of Image Processing,
IP(32), 2023, pp. 203-218.
IEEE DOI 2301
Task analysis, Transformers, Servers, Distance learning, Computer aided instruction, Tail, Head, Distributed learning, task-agnostic learning BibRef

Park, S.[Sangjoon], Ye, J.C.[Jong Chul],
Multi-Task Distributed Learning Using Vision Transformer With Random Patch Permutation,
MedImg(42), No. 7, July 2023, pp. 2091-2105.
IEEE DOI 2307
Task analysis, Transformers, Head, Tail, Servers, Multitasking, Distance learning, Federated learning, split learning, privacy preservation BibRef

Kim, B.J.[Bum Jun], Choi, H.[Hyeyeon], Jang, H.[Hyeonah], Lee, D.G.[Dong Gu], Jeong, W.[Wonseok], Kim, S.W.[Sang Woo],
Improved robustness of vision transformers via prelayernorm in patch embedding,
PR(141), 2023, pp. 109659.
Elsevier DOI 2306
Vision transformer, Patch embedding, Contrast enhancement, Robustness, Layer normalization, Convolutional neural network, Deep learning BibRef

Zhou, D.[Daquan], Hou, Q.[Qibin], Yang, L.J.[Lin-Jie], Jin, X.J.[Xiao-Jie], Feng, J.S.[Jia-Shi],
Token Selection is a Simple Booster for Vision Transformers,
PAMI(45), No. 11, November 2023, pp. 12738-12746.
IEEE DOI 2310
BibRef

Feng, Z.Z.[Zhan-Zhou], Zhang, S.L.[Shi-Liang],
Efficient Vision Transformer via Token Merger,
IP(32), 2023, pp. 4156-4169.
IEEE DOI 2307
Corporate acquisitions, Transformers, Semantics, Task analysis, Visualization, Merging, Computational efficiency, sparese representation BibRef

Qian, S.J.[Sheng-Ju], Zhu, Y.[Yi], Li, W.B.[Wen-Bo], Li, M.[Mu], Jia, J.Y.[Jia-Ya],
What Makes for Good Tokenizers in Vision Transformer?,
PAMI(45), No. 11, November 2023, pp. 13011-13023.
IEEE DOI 2310
BibRef

Fu, K.[Kexue], Yuan, M.Z.[Ming-Zhi], Liu, S.L.[Shao-Lei], Wang, M.[Manning],
Boosting Point-BERT by Multi-Choice Tokens,
CirSysVideo(34), No. 1, January 2024, pp. 438-447.
IEEE DOI 2401
self-supervised pre-training task.
See also Point-BERT: Pre-training 3D Point Cloud Transformers with Masked Point Modeling. BibRef

Yan, F.Y.[Fang-Yuan], Yan, B.[Bin], Liang, W.[Wei], Pei, M.T.[Ming-Tao],
Token labeling-guided multi-scale medical image classification,
PRL(178), 2024, pp. 28-34.
Elsevier DOI 2402
Medical image classification, Vision transformer, Token labeling BibRef

Li, Y.X.[Yue-Xiang], Huang, Y.W.[Ya-Wen], He, N.[Nanjun], Ma, K.[Kai], Zheng, Y.F.[Ye-Feng],
Improving vision transformer for medical image classification via token-wise perturbation,
JVCIR(98), 2024, pp. 104022.
Elsevier DOI 2402
Self-supervised learning, Vision transformer, Image classification BibRef

Kang, J.Y.[Jun-Yong], Heo, B.[Byeongho], Choe, J.[Junsuk],
Improving ViT interpretability with patch-level mask prediction,
PRL(187), 2025, pp. 73-79.
Elsevier DOI 2501
Vision Transformer, Interpretability, Weak supervision, Object localization BibRef

Arya, R.K.[Rajat Kumar], Peddi, R.[Rohith], Srivastava, R.[Rajeev],
Hyperspectral image classification using hybrid convolutional-based cross-patch retentive network,
CVIU(257), 2025, pp. 104382.
Elsevier DOI 2505
Hyperspectral images, Classification, Feature extraction, Convolutional neural networks, Retention mechanism BibRef

Niu, Y.[Yi], Song, Z.C.[Zhuo-Chen], Luo, Q.Y.[Qing-Yu], Chen, G.C.[Guo-Chao], Ma, M.M.[Ming-Ming], Li, F.[Fu],
ATMformer: An Adaptive Token Merging Vision Transformer for Remote Sensing Image Scene Classification,
RS(17), No. 4, 2025, pp. 660.
DOI Link 2502
downsample to improve computaton. BibRef

Wang, Y.C.[Yan-Cheng], Yang, Y.Z.[Ying-Zhen],
Efficient Visual Transformer by Learnable Token Merging,
PAMI(47), No. 11, November 2025, pp. 9597-9608.
IEEE DOI 2510
Transformers, Merging, Visualization, Upper bound, Accuracy, Training, Mutual information, Deep learning, compact transformer networks BibRef


Bergner, B.[Benjamin], Lippert, C.[Christoph], Mahendran, A.[Aravindh],
Token Cropr: Faster ViTs for Quite a Few Tasks,
CVPR25(9740-9750)
IEEE DOI 2508
Instance segmentation, Training, Head, Semantic segmentation, Semantics, Object detection, Throughput, Transformers, token pruning, neural networks BibRef

Dang, C.X.[Chen-Xu], Duan, Z.[Zaipeng], An, P.[Pei], Zhang, X.M.[Xin-Min], Hu, X.[Xuzhong], Ma, J.[Jie],
FASTer: Focal Token Acquiring-and-Scaling Transformer for Long-term 3D Object Detection,
CVPR25(17029-17038)
IEEE DOI Code:
WWW Link. 2508
Laser radar, Detectors, Object detection, Transformers, Robustness, Complexity theory, Spatiotemporal phenomena, Proposals, adaptive scaling BibRef

Olszewski, J.[Jan], Rymarczyk, D.[Dawid], Wójcik, P.[Piotr], Pach, M.[Mateusz], Zielinski, B.[Bartosz],
TORE: Token Recycling in Vision Transformers for Efficient Active Visual Exploration,
WACV25(8606-8616)
IEEE DOI 2505
Visualization, Autoencoders, Transformers, Recycling, Decoding BibRef

Eliopoulos, N.J.[Nicholas John], Jajal, P.[Purvish], Davis, J.C.[James C.], Liu, G.[Gaowen], Thiravathukal, G.K.[George K.], Lu, Y.H.[Yung-Hsiang],
Pruning One More Token is Enough: Leveraging Latency-Workload Non-Linearities for Vision Transformers on the Edge,
WACV25(7153-7162)
IEEE DOI 2505
Degradation, Schedules, Accuracy, Image edge detection, Merging, Neural networks, Transformers, Market research, vision transformer, token sparsification BibRef

Koner, R.[Rajat], Jain, G.[Gagan], Jain, P.[Prateek], Tresp, V.[Volker], Paul, S.[Sujoy],
LookupVIT: Compressing Visual Information to a Limited Number of Tokens,
ECCV24(LXXXVI: 322-337).
Springer DOI 2412
BibRef

Jie, S.[Shibo], Tang, Y.H.[Ye-Hui], Guo, J.Y.[Jian-Yuan], Deng, Z.H.[Zhi-Hong], Han, K.[Kai], Wang, Y.H.[Yun-He],
Token Compensator: Altering Inference Cost of Vision Transformer Without Re-tuning,
ECCV24(XVI: 76-94).
Springer DOI 2412
BibRef

Huang, W.X.[Wen-Xuan], Shen, Y.H.[Yun-Hang], Xie, J.[Jiao], Zhang, B.C.[Bao-Chang], He, G.Q.[Gao-Qi], Li, K.[Ke], Sun, X.[Xing], Lin, S.H.[Shao-Hui],
A General and Efficient Training for Transformer via Token Expansion,
CVPR24(15783-15792)
IEEE DOI Code:
WWW Link. 2410
Training, Accuracy, Costs, Codes, Pipelines, Computer architecture BibRef

Wu, J.[Junyi], Duan, B.[Bin], Kang, W.T.[Wei-Tai], Tang, H.[Hao], Yan, Y.[Yan],
Token Transformation Matters: Towards Faithful Post-Hoc Explanation for Vision Transformer,
CVPR24(10926-10935)
IEEE DOI 2410
Visualization, Correlation, Computational modeling, Perturbation methods, Predictive models, Length measurement, Explainability BibRef

Yu, Q.[Qing], Tanaka, M.[Mikihiro], Fujiwara, K.[Kent],
Exploring Vision Transformers for 3D Human Motion-Language Models with Motion Patches,
CVPR24(937-946)
IEEE DOI 2410
Training, Solid modeling, Computational modeling, Transfer learning, Transformers, Motion-Language Models, Text-Motion Retrieval BibRef

Yuan, X.[Xin], Fei, H.L.[Hong-Liang], Baek, J.[Jinoo],
Efficient Transformer Adaptation with Soft Token Merging,
LargeVM24(3658-3668)
IEEE DOI 2410
Training, Accuracy, Costs, Merging, Video sequences, Optimization methods, Transformers BibRef

Xu, X.[Xuwei], Wang, S.[Sen], Chen, Y.D.[Yu-Dong], Zheng, Y.P.[Yan-Ping], Wei, Z.W.[Zhe-Wei], Liu, J.J.[Jia-Jun],
GTP-ViT: Efficient Vision Transformers via Graph-based Token Propagation,
WACV24(86-95)
IEEE DOI Code:
WWW Link. 2404
Source coding, Computational modeling, Merging, Broadcasting, Transformers, Computational complexity, Algorithms BibRef

Ding, S.R.[Shuang-Rui], Zhao, P.S.[Pei-Sen], Zhang, X.P.[Xiao-Peng], Qian, R.[Rui], Xiong, H.K.[Hong-Kai], Tian, Q.[Qi],
Prune Spatio-temporal Tokens by Semantic-aware Temporal Accumulation,
ICCV23(16899-16910)
IEEE DOI Code:
WWW Link. 2401
BibRef

Guo, Y.[Yong], Stutz, D.[David], Schiele, B.[Bernt],
Improving Robustness of Vision Transformers by Reducing Sensitivity to Patch Corruptions,
CVPR23(4108-4118)
IEEE DOI 2309
BibRef

Xie, W.[Wei], Zhao, Z.[Zimeng], Li, S.Y.[Shi-Ying], Zuo, B.H.[Bing-Hui], Wang, Y.G.[Yan-Gang],
Nonrigid Object Contact Estimation With Regional Unwrapping Transformer,
ICCV23(9308-9317)
IEEE DOI 2401
BibRef

Nalmpantis, A.[Angelos], Panagiotopoulos, A.[Apostolos], Gkountouras, J.[John], Papakostas, K.[Konstantinos], Aziz, W.[Wilker],
Vision DiffMask: Faithful Interpretation of Vision Transformers with Differentiable Patch Masking,
XAI4CV23(3756-3763)
IEEE DOI 2309
BibRef

Beyer, L.[Lucas], Izmailov, P.[Pavel], Kolesnikov, A.[Alexander], Caron, M.[Mathilde], Kornblith, S.[Simon], Zhai, X.H.[Xiao-Hua], Minderer, M.[Matthias], Tschannen, M.[Michael], Alabdulmohsin, I.[Ibrahim], Pavetic, F.[Filip],
FlexiViT: One Model for All Patch Sizes,
CVPR23(14496-14506)
IEEE DOI 2309
BibRef

Chang, S.N.[Shu-Ning], Wang, P.[Pichao], Lin, M.[Ming], Wang, F.[Fan], Zhang, D.J.H.[David Jun-Hao], Jin, R.[Rong], Shou, M.Z.[Mike Zheng],
Making Vision Transformers Efficient from A Token Sparsification View,
CVPR23(6195-6205)
IEEE DOI 2309
BibRef

Phan, L.[Lam], Nguyen, H.T.H.[Hiep Thi Hong], Warrier, H.[Harikrishna], Gupta, Y.[Yogesh],
Patch Embedding as Local Features: Unifying Deep Local and Global Features via Vision Transformer for Image Retrieval,
ACCV22(II:204-221).
Springer DOI 2307
BibRef

Liu, Y.[Yue], Matsoukas, C.[Christos], Strand, F.[Fredrik], Azizpour, H.[Hossein], Smith, K.[Kevin],
PatchDropout: Economizing Vision Transformers Using Patch Dropout,
WACV23(3942-3951)
IEEE DOI 2302
Training, Image resolution, Computational modeling, Biological system modeling, Memory management, Transformers, Biomedical/healthcare/medicine BibRef

Havtorn, J.D.[Jakob Drachmann], Royer, A.[Amélie], Blankevoort, T.[Tijmen], Bejnordi, B.E.[Babak Ehteshami],
MSViT: Dynamic Mixed-scale Tokenization for Vision Transformers,
NIVT23(838-848)
IEEE DOI 2401
BibRef

Haurum, J.B.[Joakim Bruslund], Escalera, S.[Sergio], Taylor, G.W.[Graham W.], Moeslund, T.B.[Thomas B.],
Which Tokens to Use? Investigating Token Reduction in Vision Transformers,
NIVT23(773-783)
IEEE DOI Code:
WWW Link. 2401
BibRef

Ren, S.[Sucheng], Yang, X.Y.[Xing-Yi], Liu, S.[Songhua], Wang, X.C.[Xin-Chao],
SG-Former: Self-guided Transformer with Evolving Token Reallocation,
ICCV23(5980-5991)
IEEE DOI Code:
WWW Link. 2401
BibRef

Xiao, H.[Han], Zheng, W.Z.[Wen-Zhao], Zhu, Z.[Zheng], Zhou, J.[Jie], Lu, J.W.[Ji-Wen],
Token-Label Alignment for Vision Transformers,
ICCV23(5472-5481)
IEEE DOI Code:
WWW Link. 2401
BibRef

Popovic, N.[Nikola], Paudel, D.P.[Danda Pani], Probst, T.[Thomas], Van Gool, L.J.[Luc J.],
Token-Consistent Dropout For Calibrated Vision Transformers,
ICIP23(1030-1034)
IEEE DOI 2312
BibRef

Wei, S.Y.[Si-Yuan], Ye, T.Z.[Tian-Zhu], Zhang, S.[Shen], Tang, Y.[Yao], Liang, J.J.[Jia-Jun],
Joint Token Pruning and Squeezing Towards More Aggressive Compression of Vision Transformers,
CVPR23(2092-2101)
IEEE DOI 2309
BibRef

Zhang, J.P.[Jian-Ping], Huang, Y.Z.[Yi-Zhan], Wu, W.B.[Wei-Bin], Lyu, M.R.[Michael R.],
Transferable Adversarial Attacks on Vision Transformers with Token Gradient Regularization,
CVPR23(16415-16424)
IEEE DOI 2309
BibRef

Ronen, T.[Tomer], Levy, O.[Omer], Golbert, A.[Avram],
Vision Transformers with Mixed-Resolution Tokenization,
ECV23(4613-4622)
IEEE DOI 2309
BibRef

Lorenzana, M.B.[Marlon Bran], Engstrom, C.[Craig], Chandra, S.S.[Shekhar S.],
Transformer Compressed Sensing Via Global Image Tokens,
ICIP22(3011-3015)
IEEE DOI 2211
Training, Limiting, Image resolution, Neural networks, Image representation, Transformers, MRI BibRef

Fayyaz, M.[Mohsen], Koohpayegani, S.A.[Soroush Abbasi], Jafari, F.R.[Farnoush Rezaei], Sengupta, S.[Sunando], Joze, H.R.V.[Hamid Reza Vaezi], Sommerlade, E.[Eric], Pirsiavash, H.[Hamed], Gall, J.[Jürgen],
Adaptive Token Sampling for Efficient Vision Transformers,
ECCV22(XI:396-414).
Springer DOI 2211
BibRef

Kong, Z.L.[Zheng-Lun], Dong, P.Y.[Pei-Yan], Ma, X.L.[Xiao-Long], Meng, X.[Xin], Niu, W.[Wei], Sun, M.S.[Meng-Shu], Shen, X.[Xuan], Yuan, G.[Geng], Ren, B.[Bin], Tang, H.[Hao], Qin, M.H.[Ming-Hai], Wang, Y.Z.[Yan-Zhi],
SPViT: Enabling Faster Vision Transformers via Latency-Aware Soft Token Pruning,
ECCV22(XI:620-640).
Springer DOI 2211
BibRef

Fang, J.[Jiemin], Xie, L.X.[Ling-Xi], Wang, X.G.[Xing-Gang], Zhang, X.P.[Xiao-Peng], Liu, W.Y.[Wen-Yu], Tian, Q.[Qi],
MSG-Transformer: Exchanging Local Spatial Information by Manipulating Messenger Tokens,
CVPR22(12053-12062)
IEEE DOI 2210
Deep learning, Visualization, Neural networks, Graphics processing units, retrieval BibRef

Yin, H.X.[Hong-Xu], Vahdat, A.[Arash], Alvarez, J.M.[Jose M.], Mallya, A.[Arun], Kautz, J.[Jan], Molchanov, P.[Pavlo],
A-ViT: Adaptive Tokens for Efficient Vision Transformer,
CVPR22(10799-10808)
IEEE DOI 2210
Training, Adaptive systems, Network architecture, Transformers, Throughput, Hardware, Complexity theory, Efficient learning and inferences BibRef

Gu, J.D.[Jin-Dong], Tresp, V.[Volker], Qin, Y.[Yao],
Are Vision Transformers Robust to Patch Perturbations?,
ECCV22(XII:404-421).
Springer DOI 2211
BibRef

Li, Z.K.[Zhi-Kai], Ma, L.P.[Li-Ping], Chen, M.J.[Meng-Juan], Xiao, J.R.[Jun-Rui], Gu, Q.Y.[Qing-Yi],
Patch Similarity Aware Data-Free Quantization for Vision Transformers,
ECCV22(XI:154-170).
Springer DOI 2211
BibRef

Yun, S.[Sukmin], Lee, H.[Hankook], Kim, J.[Jaehyung], Shin, J.[Jinwoo],
Patch-level Representation Learning for Self-supervised Vision Transformers,
CVPR22(8344-8353)
IEEE DOI 2210
Training, Representation learning, Visualization, Neural networks, Object detection, Self-supervised learning, Transformers, Self- semi- meta- unsupervised learning BibRef

Salman, H.[Hadi], Jain, S.[Saachi], Wong, E.[Eric], Madry, A.[Aleksander],
Certified Patch Robustness via Smoothed Vision Transformers,
CVPR22(15116-15126)
IEEE DOI 2210
Visualization, Smoothing methods, Costs, Computational modeling, Transformers, Adversarial attack and defense BibRef

Tang, Y.H.[Ye-Hui], Han, K.[Kai], Wang, Y.H.[Yun-He], Xu, C.[Chang], Guo, J.Y.[Jian-Yuan], Xu, C.[Chao], Tao, D.C.[Da-Cheng],
Patch Slimming for Efficient Vision Transformers,
CVPR22(12155-12164)
IEEE DOI 2210
Visualization, Quantization (signal), Computational modeling, Aggregates, Benchmark testing, Representation learning BibRef

Chen, Z.Y.[Zhao-Yu], Li, B.[Bo], Wu, S.[Shuang], Xu, J.H.[Jiang-He], Ding, S.H.[Shou-Hong], Zhang, W.Q.[Wen-Qiang],
Shape Matters: Deformable Patch Attack,
ECCV22(IV:529-548).
Springer DOI 2211
BibRef

Chen, Z.Y.[Zhao-Yu], Li, B.[Bo], Xu, J.H.[Jiang-He], Wu, S.[Shuang], Ding, S.H.[Shou-Hong], Zhang, W.Q.[Wen-Qiang],
Towards Practical Certifiable Patch Defense with Vision Transformer,
CVPR22(15127-15137)
IEEE DOI 2210
Smoothing methods, Toy manufacturing industry, Semantics, Network architecture, Transformers, Robustness, Adversarial attack and defense BibRef

Yuan, L.[Li], Chen, Y.P.[Yun-Peng], Wang, T.[Tao], Yu, W.H.[Wei-Hao], Shi, Y.J.[Yu-Jun], Jiang, Z.H.[Zi-Hang], Tay, F.E.H.[Francis E. H.], Feng, J.S.[Jia-Shi], Yan, S.C.[Shui-Cheng],
Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet,
ICCV21(538-547)
IEEE DOI 2203
Training, Image resolution, Computational modeling, Image edge detection, Transformers, BibRef

Chapter on Pattern Recognition, Clustering, Statistics, Grammars, Learning, Neural Nets, Genetic Algorithms continues in
Attention in Vision Transformers .


Last update:Nov 26, 2025 at 20:24:09