7.1.7.2 Object Detection Using Transformers

Chapter Contents (Back)
Object Detection. Transformer.
See also Vision Transformers, ViT.
See also Vision Transformers for Semantic Segmentation.
See also SWIN Transformer.
See also Detection Transformer, DETR Applications.
See also Patch Based Vision Transformers.

Li, Y.[Yehao], Yao, T.[Ting], Pan, Y.W.[Ying-Wei], Mei, T.[Tao],
Contextual Transformer Networks for Visual Recognition,
PAMI(45), No. 2, February 2023, pp. 1489-1500.
IEEE DOI 2301
Transformers, Convolution, Visualization, Task analysis, Image recognition, Object detection, Transformer, image recognition BibRef

Zhang, H.F.[Hao-Fei], Mao, F.[Feng], Xue, M.Q.[Meng-Qi], Fang, G.F.[Gong-Fan], Feng, Z.L.[Zun-Lei], Song, J.[Jie], Song, M.L.[Ming-Li],
Knowledge Amalgamation for Object Detection With Transformers,
IP(32), 2023, pp. 2093-2106.
IEEE DOI 2304
Transformers, Task analysis, Object detection, Detectors, Training, Feature extraction, Model reusing, vision transformers BibRef

Wang, Z.W.[Zi-Wei], Wang, C.Y.[Chang-Yuan], Xu, X.W.[Xiu-Wei], Zhou, J.[Jie], Lu, J.W.[Ji-Wen],
Quantformer: Learning Extremely Low-Precision Vision Transformers,
PAMI(45), No. 7, July 2023, pp. 8813-8826.
IEEE DOI 2306
Quantization (signal), Transformers, Computational modeling, Search problems, Object detection, Image color analysis, vision transformers BibRef

Peng, Z.L.[Zhi-Liang], Guo, Z.H.[Zong-Hao], Huang, W.[Wei], Wang, Y.W.[Yao-Wei], Xie, L.X.[Ling-Xi], Jiao, J.B.[Jian-Bin], Tian, Q.[Qi], Ye, Q.X.[Qi-Xiang],
Conformer: Local Features Coupling Global Representations for Recognition and Detection,
PAMI(45), No. 8, August 2023, pp. 9454-9468.
IEEE DOI 2307
Transformers, Feature extraction, Couplings, Visualization, Detectors, Convolution, Object detection, Feature fusion, vision transformer BibRef

Peng, Z.L.[Zhi-Liang], Huang, W.[Wei], Gu, S.Z.[Shan-Zhi], Xie, L.X.[Ling-Xi], Wang, Y.W.[Yao-Wei], Jiao, J.B.[Jian-Bin], Ye, Q.X.[Qi-Xiang],
Conformer: Local Features Coupling Global Representations for Visual Recognition,
ICCV21(357-366)
IEEE DOI 2203
Couplings, Representation learning, Visualization, Fuses, Convolution, Object detection, Transformers, Representation learning BibRef

Hu, X.W.[Xiao-Wei], Shi, M.[Min], Wang, W.Y.[Wei-Yun], Wu, S.T.[Si-Tong], Xing, L.J.[Lin-Jie], Wang, W.H.[Wen-Hai], Zhou, X.Z.[Xi-Zhou], Lu, L.W.[Le-Wei], Zhou, J.[Jie], Wang, X.G.[Xiao-Gang], Qiao, Y.[Yu], Dai, J.F.[Ji-Feng],
Demystify Transformers and Convolutions in Modern Image Deep Networks,
PAMI(47), No. 4, April 2025, pp. 2416-2428.
IEEE DOI 2503
Mixers, Transformers, Convolutional codes, Performance gain, Training, Robustness, Object detection, and image deep network BibRef

Sheng, H.L.[Hua-Lian], Cai, S.J.[Si-Jia], Zhao, N.[Na], Deng, B.[Bing], Liang, Q.[Qiao], Zhao, M.J.[Min-Jian], Ye, J.P.[Jie-Ping],
CT3D++: Improving 3D Object Detection with Keypoint-Induced Channel-wise Transformer,
IJCV(133), No. 7, July 2025, pp. 4817-4836.
Springer DOI 2506
BibRef

Sheng, H.L.[Hua-Lian], Cai, S.J.[Si-Jia], Liu, Y.[Yuan], Deng, B.[Bing], Huang, J.Q.[Jian-Qiang], Hua, X.S.[Xian-Sheng], Zhao, M.J.[Min-Jian],
Improving 3D Object Detection with Channel-wise Transformer,
ICCV21(2723-2732)
IEEE DOI 2203
Point cloud compression, Object detection, Detectors, Transforms, Transformers, Encoding, Detection and localization in 2D and 3D, BibRef

Xu, J.H.[Jian-Hao], Fan, X.T.[Xiang-Tao], Jian, H.[Hongdeng], Xu, C.[Chen], Bei, W.J.[Wei-Jia], Ge, Q.F.[Qi-Feng], Zhao, T.[Teng], Han, R.J.[Rui-Jie],
TAM-TR: Text-guided attention multi-modal transformer for object detection in UAV images,
PandRS(227), 2025, pp. 170-184.
Elsevier DOI Code:
WWW Link. 2508
Object detection, UAV image, Multi-modal, Loss function, Transformer BibRef


Kondo, R.[Ryota], Minoura, H.[Hiroaki], Hirakawa, T.[Tsubasa], Yamashita, T.[Takayoshi], Fujiyoshi, H.[Hironobu],
Binary-Decomposed Vision Transformer: Compressing and Accelerating Vision Transformer by Binary Decomposition,
ICIP24(3600-3605)
IEEE DOI 2411
Visualization, Image coding, Quantization (signal), Accuracy, Computational modeling, Object detection, Binary Decomposition, Vision Transformer BibRef

Yun, S.[Seokju], Ro, Y.[Youngmin],
SHViT: Single-Head Vision Transformer with Memory Efficient Macro Design,
CVPR24(5756-5767)
IEEE DOI 2410
Performance evaluation, Head, Accuracy, Redundancy, Graphics processing units, Object detection, CNNs BibRef

Kim, D.[Dahun], Angelova, A.[Anelia], Kuo, W.C.[Wei-Cheng],
Region-centric Image-Language Pretraining for Open-Vocabulary Detection,
ECCV24(LXIII: 162-179).
Springer DOI 2412
BibRef
Earlier:
Region-Aware Pretraining for Open-Vocabulary Object Detection with Vision Transformers,
CVPR23(11144-11154)
IEEE DOI 2309
BibRef

Singh, A.[Apoorv],
Training Strategies for Vision Transformers for Object Detection,
WAD23(110-118)
IEEE DOI 2309
BibRef

Li, Y.H.[Yang-Hao], Mao, H.Z.[Han-Zi], Girshick, R.[Ross], He, K.M.[Kai-Ming],
Exploring Plain Vision Transformer Backbones for Object Detection,
ECCV22(IX:280-296).
Springer DOI 2211
BibRef

Yu, W.X.[Wen-Xin], Zhang, H.[Hongru], Lan, T.X.[Tian-Xiang], Hu, Y.C.[Yu-Cheng], Yin, D.[Dong],
CBPT: A New Backbone for Enhancing Information Transmission of Vision Transformers,
ICIP22(156-160)
IEEE DOI 2211
Merging, Information processing, Object detection, Transformers, Computational complexity, Vision Transformer, Backbone BibRef

Wang, Y.K.[Yi-Kai], Chen, X.H.[Xing-Hao], Cao, L.[Lele], Huang, W.B.[Wen-Bing], Sun, F.C.[Fu-Chun], Wang, Y.H.[Yun-He],
Multimodal Token Fusion for Vision Transformers,
CVPR22(12176-12185)
IEEE DOI 2210
Point cloud compression, Image segmentation, Shape, Semantics, Object detection, Vision+X BibRef

Guo, J.Y.[Jian-Yuan], Han, K.[Kai], Wu, H.[Han], Tang, Y.H.[Ye-Hui], Chen, X.H.[Xing-Hao], Wang, Y.H.[Yun-He], Xu, C.[Chang],
CMT: Convolutional Neural Networks Meet Vision Transformers,
CVPR22(12165-12175)
IEEE DOI 2210
Visualization, Image recognition, Force, Object detection, Transformers, Representation learning BibRef

Feng, W.X.[Wei-Xin], Wang, Y.J.[Yuan-Jiang], Ma, L.H.[Li-Hua], Yuan, Y.[Ye], Zhang, C.[Chi],
Temporal Knowledge Consistency for Unsupervised Visual Representation Learning,
ICCV21(10150-10160)
IEEE DOI 2203
Training, Representation learning, Visualization, Protocols, Object detection, Semisupervised learning, Transformers, Transfer/Low-shot/Semi/Unsupervised Learning BibRef

Hu, R.H.[Rong-Hang], Singh, A.[Amanpreet],
UniT: Multimodal Multitask Learning with a Unified Transformer,
ICCV21(1419-1429)
IEEE DOI 2203
Training, Natural languages, Object detection, Predictive models, Transformers, Multitasking, Representation learning BibRef

Zhang, P.C.[Peng-Chuan], Dai, X.Y.[Xi-Yang], Yang, J.W.[Jian-Wei], Xiao, B.[Bin], Yuan, L.[Lu], Zhang, L.[Lei], Gao, J.F.[Jian-Feng],
Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding,
ICCV21(2978-2988)
IEEE DOI 2203
Image segmentation, Image coding, Computational modeling, Memory management, Object detection, Transformers, Representation learning BibRef

Heo, B.[Byeongho], Yun, S.[Sangdoo], Han, D.Y.[Dong-Yoon], Chun, S.[Sanghyuk], Choe, J.[Junsuk], Oh, S.J.[Seong Joon],
Rethinking Spatial Dimensions of Vision Transformers,
ICCV21(11916-11925)
IEEE DOI 2203
Dimensionality reduction, Computational modeling, Object detection, Transformers, Robustness, Recognition and classification BibRef

Chapter on 2-D Feature Analysis, Extraction and Representations, Shape, Skeletons, Texture continues in
Blob Detection .


Last update:Oct 6, 2025 at 14:07:43