DilateFormer: Multi-Scale Dilated Transformer for Visual Recognition

As a de facto solution, the vanilla Vision Transformers (ViTs) are encouraged to model long-range dependencies between arbitrary image patches while the global attended receptive field leads to quadratic computational cost. Another branch of Vision Transformers exploits local attention inspired by C...

Full description

Saved in:
Bibliographic Details
Published in:IEEE transactions on multimedia Vol. 25; pp. 1 - 14
Main Authors: Jiao, Jiayu, Tang, Yu-Ming, Lin, Kun-Yu, Gao, Yipeng, Ma, Jinhua, Wang, Yaowei, Zheng, Wei-Shi
Format: Journal Article
Language:English
Published: Piscataway IEEE 01.01.2023
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects:
ISSN:1520-9210, 1941-0077
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Abstract As a de facto solution, the vanilla Vision Transformers (ViTs) are encouraged to model long-range dependencies between arbitrary image patches while the global attended receptive field leads to quadratic computational cost. Another branch of Vision Transformers exploits local attention inspired by CNNs, which only models the interactions between patches in small neighborhoods. Although such a solution reduces the computational cost, it naturally suffers from small attended receptive fields, which may limit the performance. In this work, we explore effective Vision Transformers to pursue a preferable trade-off between the computational complexity and size of the attended receptive field. By analyzing the patch interaction of global attention in ViTs, we observe two key properties in the shallow layers, namely locality and sparsity, indicating the redundancy of global dependency modeling in shallow layers of ViTs. Accordingly, we propose Multi-Scale Dilated Attention (MSDA) to model local and sparse patch interaction within the sliding window. With a pyramid architecture, we construct a Multi-Scale Dilated Transformer (DilateFormer) by stacking MSDA blocks at low-level stages and global multi-head self-attention blocks at high-level stages. Our experiment results show that our DilateFormer achieves state-of-the-art performance on various vision tasks. On ImageNet-1 K classification task, DilateFormer achieves comparable performance with 70% fewer FLOPs compared with existing state-of-the-art models. Our DilateFormer-Base achieves 85.6% top-1 accuracy on ImageNet-1 K classification task, 53.5% box mAP/46.1% mask mAP on COCO object detection/instance segmentation task and 51.1% MS mIoU on ADE20 K semantic segmentation task.
AbstractList As a de facto solution, the vanilla Vision Transformers (ViTs) are encouraged to model long-range dependencies between arbitrary image patches while the global attended receptive field leads to quadratic computational cost. Another branch of Vision Transformers exploits local attention inspired by CNNs, which only models the interactions between patches in small neighborhoods. Although such a solution reduces the computational cost, it naturally suffers from small attended receptive fields, which may limit the performance. In this work, we explore effective Vision Transformers to pursue a preferable trade-off between the computational complexity and size of the attended receptive field. By analyzing the patch interaction of global attention in ViTs, we observe two key properties in the shallow layers, namely locality and sparsity, indicating the redundancy of global dependency modeling in shallow layers of ViTs. Accordingly, we propose Multi-Scale Dilated Attention (MSDA) to model local and sparse patch interaction within the sliding window. With a pyramid architecture, we construct a Multi-Scale Dilated Transformer (DilateFormer) by stacking MSDA blocks at low-level stages and global multi-head self-attention blocks at high-level stages. Our experiment results show that our DilateFormer achieves state-of-the-art performance on various vision tasks. On ImageNet-1 K classification task, DilateFormer achieves comparable performance with 70% fewer FLOPs compared with existing state-of-the-art models. Our DilateFormer-Base achieves 85.6% top-1 accuracy on ImageNet-1 K classification task, 53.5% box mAP/46.1% mask mAP on COCO object detection/instance segmentation task and 51.1% MS mIoU on ADE20 K semantic segmentation task.
As a de facto solution, the vanilla Vision Transformers (ViTs) are encouraged to model long-range dependencies between arbitrary image patches while the global attended receptive field leads to quadratic computational cost. Another branch of Vision Transformers exploits local attention inspired by CNNs, which only models the interactions between patches in small neighborhoods. Although such a solution reduces the computational cost, it naturally suffers from small attended receptive fields, which may limit the performance. In this work, we explore effective Vision Transformers to pursue a preferable trade-off between the computational complexity and size of the attended receptive field. By analyzing the patch interaction of global attention in ViTs, we observe two key properties in the shallow layers, namely locality and sparsity, indicating the redundancy of global dependency modeling in shallow layers of ViTs. Accordingly, we propose Multi-Scale Dilated Attention (MSDA) to model local and sparse patch interaction within the sliding window. With a pyramid architecture, we construct a Multi-Scale Dilated Transformer (DilateFormer) by stacking MSDA blocks at low-level stages and global multi-head self-attention blocks at high-level stages. Our experiment results show that our DilateFormer achieves state-of-the-art performance on various vision tasks. On ImageNet-1 K classification task, DilateFormer achieves comparable performance with 70% fewer FLOPs compared with existing state-of-the-art models. Our DilateFormer-Base achieves 85.6% top-1 accuracy on ImageNet-1 K classification task, 53.5% box mAP/46.1% mask mAP on COCO object detection/instance segmentation task and 51.1% MS mIoU on ADE20 K semantic segmentation task. The code is available at https://isee-ai.cn/~jiaojiayu/DilteFormer.html .
Author Jiao, Jiayu
Gao, Yipeng
Tang, Yu-Ming
Zheng, Wei-Shi
Wang, Yaowei
Lin, Kun-Yu
Ma, Jinhua
Author_xml – sequence: 1
  givenname: Jiayu
  surname: Jiao
  fullname: Jiao, Jiayu
  organization: School of Computer Science and Engineering, Sun Yat-Sen University, Guangzhou, China
– sequence: 2
  givenname: Yu-Ming
  surname: Tang
  fullname: Tang, Yu-Ming
  organization: School of Computer Science and Engineering, Sun Yat-Sen University, Guangzhou, China
– sequence: 3
  givenname: Kun-Yu
  surname: Lin
  fullname: Lin, Kun-Yu
  organization: School of Computer Science and Engineering, Sun Yat-Sen University, Guangzhou, China
– sequence: 4
  givenname: Yipeng
  surname: Gao
  fullname: Gao, Yipeng
  organization: School of Computer Science and Engineering, Sun Yat-Sen University, Guangzhou, China
– sequence: 5
  givenname: Jinhua
  orcidid: 0000-0002-0165-8416
  surname: Ma
  fullname: Ma, Jinhua
  organization: School of Computer Science and Engineering, Sun Yat-Sen University, Guangzhou, China
– sequence: 6
  givenname: Yaowei
  orcidid: 0000-0003-2197-9038
  surname: Wang
  fullname: Wang, Yaowei
  organization: School of Computer Science and Engineering, Sun Yat-Sen University, Guangzhou, China
– sequence: 7
  givenname: Wei-Shi
  orcidid: 0000-0001-8327-0003
  surname: Zheng
  fullname: Zheng, Wei-Shi
  organization: School of Computer Science and Engineering, Sun Yat-Sen University, Guangzhou, China
BookMark eNp9UD1PwzAUtFCRaAs7A0Mk5pTnjyQOG2opILVCgsJqOc4zcpUmxUkG_j0u6YAYmO5Jd_dOdxMyqpsaCbmkMKMU8pvNej1jwPiMM8FTmp6QMc0FjQGybBTuhEGcMwpnZNK2WwAqEsjGZLFwle5w2fgd-tto3Vedi1-NrjAamDLaeF239kcQBYjeXdvrKnpB03zUrnNNfU5Ora5avDjilLwt7zfzx3j1_PA0v1vFhuWsi0Wqc2tTkxZoLSukQElNZqxJEoGZLDIJhS1FWaYQmJxLybTEpEw45iis4FNyPfzd--azx7ZT26b3dYhULAfKkwyYDCoYVMY3bevRqr13O-2_FAV12EqFrdRhK3XcKljSPxbjOn2o1nntqv-MV4PRIeKvHBA0tOHf-J14dw
CODEN ITMUF8
CitedBy_id crossref_primary_10_1016_j_optlastec_2024_111957
crossref_primary_10_1016_j_ceramint_2024_11_330
crossref_primary_10_1016_j_bspc_2025_108196
crossref_primary_10_1016_j_engstruct_2025_120264
crossref_primary_10_1002_ima_70176
crossref_primary_10_3390_jmse13101843
crossref_primary_10_1109_LGRS_2024_3431223
crossref_primary_10_1016_j_foodcont_2025_111580
crossref_primary_10_1016_j_robot_2024_104819
crossref_primary_10_3390_app14167071
crossref_primary_10_1016_j_knosys_2025_114069
crossref_primary_10_3390_s25092861
crossref_primary_10_3390_rs16132267
crossref_primary_10_1007_s00371_024_03561_6
crossref_primary_10_1016_j_neunet_2024_106653
crossref_primary_10_3390_electronics14122479
crossref_primary_10_1117_1_JEI_33_6_063053
crossref_primary_10_3390_jimaging11040112
crossref_primary_10_1016_j_asoc_2024_112294
crossref_primary_10_1007_s11554_025_01715_w
crossref_primary_10_1109_TGRS_2024_3441038
crossref_primary_10_1016_j_isprsjprs_2025_01_025
crossref_primary_10_3390_s24237640
crossref_primary_10_1109_JSTARS_2023_3347660
crossref_primary_10_3390_drones9040245
crossref_primary_10_3390_math13172769
crossref_primary_10_3390_biomimetics10090564
crossref_primary_10_1016_j_engappai_2025_111968
crossref_primary_10_1016_j_biosystemseng_2024_11_003
crossref_primary_10_1016_j_bspc_2024_106180
crossref_primary_10_1016_j_patcog_2024_111230
crossref_primary_10_3390_w16223285
crossref_primary_10_1016_j_eswa_2024_125584
crossref_primary_10_1007_s11227_025_07788_5
crossref_primary_10_3390_pr12112419
crossref_primary_10_1016_j_patcog_2024_110870
crossref_primary_10_1109_JSEN_2025_3546966
crossref_primary_10_1109_TGRS_2025_3603946
crossref_primary_10_1016_j_compag_2025_110823
crossref_primary_10_1049_ipr2_70034
crossref_primary_10_1016_j_oceaneng_2024_119011
crossref_primary_10_1038_s41598_025_86371_7
crossref_primary_10_1016_j_sigpro_2024_109850
crossref_primary_10_1016_j_cmpb_2025_108703
crossref_primary_10_1080_0305215X_2025_2485147
crossref_primary_10_1109_JBHI_2025_3555805
crossref_primary_10_1016_j_imavis_2024_105052
crossref_primary_10_1002_cpe_8219
crossref_primary_10_1016_j_neunet_2024_106235
crossref_primary_10_1016_j_isprsjprs_2024_04_012
crossref_primary_10_1109_TGRS_2025_3536473
crossref_primary_10_1109_TMM_2025_3535318
crossref_primary_10_1016_j_inffus_2025_103734
crossref_primary_10_1038_s41598_025_10058_2
crossref_primary_10_1109_ACCESS_2024_3372146
crossref_primary_10_3390_s24237425
crossref_primary_10_3390_s25154840
crossref_primary_10_1109_TGRS_2025_3571033
crossref_primary_10_1016_j_imavis_2025_105503
crossref_primary_10_1049_wss2_70001
crossref_primary_10_1109_TMM_2025_3535321
crossref_primary_10_1016_j_measurement_2024_116220
crossref_primary_10_1109_TIP_2025_3556520
crossref_primary_10_1016_j_cscm_2025_e04414
crossref_primary_10_3390_agriculture14071163
crossref_primary_10_1016_j_bspc_2024_107189
crossref_primary_10_3390_electronics13204121
crossref_primary_10_1016_j_lfs_2025_123931
crossref_primary_10_1109_TGRS_2024_3412401
crossref_primary_10_1109_TMM_2024_3405626
crossref_primary_10_1117_1_JEI_33_2_023003
crossref_primary_10_1109_TMM_2025_3542958
crossref_primary_10_1016_j_compag_2025_110605
crossref_primary_10_1142_S0218001425550134
crossref_primary_10_1016_j_patcog_2024_110979
crossref_primary_10_1088_1361_6501_ad30bb
crossref_primary_10_1016_j_aei_2025_103377
crossref_primary_10_3390_app14072741
crossref_primary_10_1016_j_infrared_2024_105674
crossref_primary_10_3390_rs17132313
crossref_primary_10_1016_j_measurement_2025_118752
crossref_primary_10_1016_j_optlastec_2025_113894
crossref_primary_10_1109_TMM_2025_3535351
crossref_primary_10_1109_JIOT_2024_3492347
crossref_primary_10_62347_BIHI3707
crossref_primary_10_3390_electronics13234844
crossref_primary_10_3390_sym17050674
crossref_primary_10_3390_app14052221
crossref_primary_10_1002_mp_17430
crossref_primary_10_1016_j_isci_2025_112401
crossref_primary_10_3390_s25010065
crossref_primary_10_1109_ACCESS_2024_3499353
crossref_primary_10_1007_s11760_025_04399_8
crossref_primary_10_3390_sym17081232
crossref_primary_10_1016_j_cviu_2025_104482
crossref_primary_10_1080_10589759_2025_2474103
crossref_primary_10_1109_TIM_2024_3476614
crossref_primary_10_1016_j_ipm_2025_104094
crossref_primary_10_1109_TIM_2025_3551136
crossref_primary_10_1109_JBHI_2025_3541982
crossref_primary_10_1109_TCE_2025_3547962
crossref_primary_10_1038_s41598_025_91206_6
crossref_primary_10_1016_j_neucom_2025_130366
crossref_primary_10_1007_s40747_024_01409_z
crossref_primary_10_1109_TCSVT_2024_3402436
crossref_primary_10_1109_TCSVT_2025_3553135
crossref_primary_10_1007_s00530_025_01904_4
crossref_primary_10_1016_j_compag_2024_108876
crossref_primary_10_1117_1_JEI_34_2_023001
crossref_primary_10_1016_j_neucom_2024_129040
crossref_primary_10_1007_s11760_024_03146_9
crossref_primary_10_1007_s42994_025_00245_0
crossref_primary_10_1016_j_eswa_2024_125732
crossref_primary_10_3390_s25113433
crossref_primary_10_3390_s25030766
crossref_primary_10_1109_TETCI_2024_3409707
crossref_primary_10_1109_TIP_2024_3372468
crossref_primary_10_3390_s24144569
crossref_primary_10_1186_s13007_025_01392_7
crossref_primary_10_3390_s24092759
crossref_primary_10_1080_27525783_2025_2551315
crossref_primary_10_1016_j_neucom_2024_128524
crossref_primary_10_1007_s11227_025_07446_w
crossref_primary_10_1109_TMM_2024_3521662
crossref_primary_10_1016_j_eswa_2025_126385
crossref_primary_10_3390_sym17091485
crossref_primary_10_3390_agriculture15131354
crossref_primary_10_1109_TMM_2024_3349865
crossref_primary_10_1109_ACCESS_2024_3416034
crossref_primary_10_1631_FITEE_2500164
crossref_primary_10_1016_j_jvcir_2024_104346
crossref_primary_10_1109_LSP_2024_3519272
crossref_primary_10_1109_LRA_2024_3418274
crossref_primary_10_1007_s12145_025_01981_z
crossref_primary_10_3390_sym17030347
crossref_primary_10_3390_electronics13132472
crossref_primary_10_3389_fpls_2025_1617997
crossref_primary_10_3390_s25134138
crossref_primary_10_3390_app15094746
crossref_primary_10_1016_j_eswa_2024_126375
crossref_primary_10_1088_1361_6501_ad71e5
Cites_doi 10.1109/CVPR46437.2021.00729
10.1109/cvpr.2009.5206848
10.1109/iccv48922.2021.00986
10.1109/iccv48922.2021.00988
10.1007/978-3-030-58452-8_13
10.1109/iccv48922.2021.00707
10.1109/cvpr52688.2022.01181
10.1007/978-1-4899-7502-7_79-1
10.1109/tmm.2021.3086709
10.1109/tpami.2022.3206108
10.1109/cvpr.2015.7298965
10.1109/tmm.2020.2991592
10.1007/978-3-031-20053-3_27
10.1007/978-3-319-10602-1_48
10.1109/cvpr52688.2022.01186
10.1109/ICCV48922.2021.00060
10.1109/cvpr.2017.634
10.1109/iccv48922.2021.00062
10.1109/tmm.2021.3090274
10.1109/cvpr52688.2022.01167
10.1007/978-3-319-24574-4_28
10.1109/cvpr52688.2022.00714
10.1109/tmm.2021.3086758
10.1109/cvpr.2015.7298594
10.1145/3065386
10.1109/tmm.2021.3100766
10.1109/tmm.2021.3057493
10.1109/CVPR.2018.00813
10.1109/tmm.2021.3090206
10.1109/iccv48922.2021.01474
10.1609/aaai.v36i3.20252
10.1109/tmm.2022.3146775
10.1109/tmm.2020.2997192
10.1109/cvpr52688.2022.01058
10.1109/iccv.2019.00612
10.1109/tmm.2021.3050059
10.1109/iccv48922.2021.00360
10.1109/cvpr.2016.90
10.1109/tpami.2022.3164083
10.1109/cvpr.2016.308
10.1609/aaai.v34i07.7000
10.1109/ICCV48922.2021.00041
10.1109/ICCV48922.2021.00675
10.1109/CVPR42600.2020.00815
10.1109/cvpr52688.2022.00520
10.1109/iccv.2017.324
10.1109/tmm.2021.3068576
10.1109/ICCV48922.2021.00009
10.1109/CVPR42600.2020.01104
10.1109/cvpr52688.2022.01179
10.1155/2022/8917964
10.1109/iccv48922.2021.00061
10.1109/tmm.2021.3074008
10.1109/cvpr52688.2022.00943
10.1109/cvpr.2017.544
10.1007/978-3-030-01228-1_26
10.1109/CVPR42600.2020.01044
10.1109/tpami.2016.2577031
10.1109/iccv48922.2021.00010
10.1109/CVPR46437.2021.01625
10.1109/iccv48922.2021.00299
10.1109/icme52920.2022.9859907
10.1109/iccv48922.2021.00042
10.1007/978-3-319-46448-0_2
10.1109/iccv48922.2021.00044
10.1109/iccv.2017.322
10.1109/tpami.2019.2956516
10.1109/tmm.2021.3072479
10.1109/tmm.2021.3109665
10.1109/CVPR52688.2022.01055
10.1109/CVPR.2019.00656
10.1109/tmm.2020.3002614
10.1109/cvpr52688.2022.01553
10.1109/tmm.2021.3096083
10.1109/iccvw54120.2021.00301
10.1109/tmm.2021.3050082
ContentType Journal Article
Copyright Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2023
Copyright_xml – notice: Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2023
DBID 97E
RIA
RIE
AAYXX
CITATION
7SC
7SP
8FD
JQ2
L7M
L~C
L~D
DOI 10.1109/TMM.2023.3243616
DatabaseName IEEE Xplore (IEEE)
IEEE All-Society Periodicals Package (ASPP) 1998–Present
IEEE Electronic Library (IEL)
CrossRef
Computer and Information Systems Abstracts
Electronics & Communications Abstracts
Technology Research Database
ProQuest Computer Science Collection
Advanced Technologies Database with Aerospace
Computer and Information Systems Abstracts – Academic
Computer and Information Systems Abstracts Professional
DatabaseTitle CrossRef
Technology Research Database
Computer and Information Systems Abstracts – Academic
Electronics & Communications Abstracts
ProQuest Computer Science Collection
Computer and Information Systems Abstracts
Advanced Technologies Database with Aerospace
Computer and Information Systems Abstracts Professional
DatabaseTitleList
Technology Research Database
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Electronic Library (IEL)
  url: https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
Discipline Engineering
Computer Science
EISSN 1941-0077
EndPage 14
ExternalDocumentID 10_1109_TMM_2023_3243616
10041780
Genre orig-research
GroupedDBID -~X
0R~
29I
4.4
5GY
5VS
6IK
97E
AAJGR
AARMG
AASAJ
AAWTH
ABAZT
ABQJQ
ABVLG
ACGFO
ACGFS
ACIWK
AENEX
AGQYO
AHBIQ
AKJIK
AKQYR
ALMA_UNASSIGNED_HOLDINGS
ATWAV
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
CS3
DU5
EBS
HZ~
IFIPE
IPLJI
JAVBF
LAI
M43
O9-
OCL
P2P
PQQKQ
RIA
RIE
RNS
TN5
AAYXX
AETIX
AGSQL
AI.
AIBXA
ALLEH
CITATION
EJD
H~9
IFJZH
VH1
ZY4
7SC
7SP
8FD
JQ2
L7M
L~C
L~D
ID FETCH-LOGICAL-c292t-46a9ff6c6beff2b84e81c7cfc554e78b780bfd4dd60e8193882a8e5d53e9e4f43
IEDL.DBID RIE
ISICitedReferencesCount 200
ISICitedReferencesURI http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=001125902000049&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
ISSN 1520-9210
IngestDate Sun Nov 09 08:50:43 EST 2025
Sat Nov 29 03:10:11 EST 2025
Tue Nov 18 22:11:23 EST 2025
Mon Aug 11 03:35:28 EDT 2025
IsPeerReviewed true
IsScholarly true
Language English
License https://ieeexplore.ieee.org/Xplorehelp/downloads/license-information/IEEE.html
https://doi.org/10.15223/policy-029
https://doi.org/10.15223/policy-037
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-c292t-46a9ff6c6beff2b84e81c7cfc554e78b780bfd4dd60e8193882a8e5d53e9e4f43
Notes ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ORCID 0000-0003-2197-9038
0000-0001-8327-0003
0000-0002-0165-8416
0000-0003-0507-2620
0000-0001-5472-0079
0000-0002-0013-3730
PQID 2901357028
PQPubID 75737
PageCount 14
ParticipantIDs crossref_primary_10_1109_TMM_2023_3243616
crossref_citationtrail_10_1109_TMM_2023_3243616
ieee_primary_10041780
proquest_journals_2901357028
PublicationCentury 2000
PublicationDate 2023-01-01
PublicationDateYYYYMMDD 2023-01-01
PublicationDate_xml – month: 01
  year: 2023
  text: 2023-01-01
  day: 01
PublicationDecade 2020
PublicationPlace Piscataway
PublicationPlace_xml – name: Piscataway
PublicationTitle IEEE transactions on multimedia
PublicationTitleAbbrev TMM
PublicationYear 2023
Publisher IEEE
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Publisher_xml – name: IEEE
– name: The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
References ref57
ref56
ref59
ref58
ref53
ref52
ref55
ref54
Brown (ref17) 2020
Xu (ref43) 2021
Chen (ref109) 2019
Han (ref99) 2021; 34
Vaswani (ref19) 2017
ref51
Han (ref96) 2022
ref50
ref46
ref48
ref47
Li (ref103) 2022
Chu (ref24) 2021
ref42
ref41
ref49
Xiao (ref74) 2021
ref8
ref7
Yang (ref27) 2021
Contributors (ref112) 2020
ref4
ref3
ref6
ref5
ref100
ref101
ref40
ref35
ref34
ref31
ref33
Dosovitskiy (ref21) 2021
Xia (ref106) 2022
ref38
ref23
ref26
Chu (ref29) 2021
ref25
Wang (ref39) 2021
ref28
Ali (ref113) 2021
Jiang (ref32) 2021
ref13
ref12
ref15
ref11
Li (ref44) 2022
ref10
Hassani (ref94) 2022
ref98
Redmon (ref9) 2018
ref16
ref93
Radford (ref18) 2018
ref92
Loshchilov (ref105) 2019
ref95
Yu (ref36) 2021
ref91
ref86
ref85
ref88
ref87
Devlin (ref45) 2019
Dai (ref97) 2021
ref82
ref81
Yu (ref89) 2016
ref84
ref83
Cohen (ref90) 2016
ref80
Chen (ref14) 2017
ref79
ref108
ref78
ref107
ref75
ref77
ref76
ref1
Beltagy (ref20) 2020
Chen (ref69) 2022; 157
Zhou (ref104) 2021
Huang (ref37) 2021
ref71
ref111
ref70
ref73
Gildenblat (ref114) 2021
ref72
ref110
ref68
ref67
ref64
Hassani (ref30) 2022
ref63
ref66
ref65
Tan (ref102) 2019
Simonyan (ref2) 2015
ref60
Touvron (ref22) 2021
ref62
ref61
References_xml – ident: ref63
  doi: 10.1109/CVPR46437.2021.00729
– ident: ref33
  doi: 10.1109/cvpr.2009.5206848
– start-page: 1
  volume-title: Proc. Int. Conf. Learn. Representations
  year: 2022
  ident: ref96
  article-title: On the connection between local attention and dynamic depth-wise convolution
– volume: 157
  start-page: 90
  volume-title: Pattern Recognit. Lett.
  year: 2022
  ident: ref69
  article-title: ResT-ReID: Transformer block-based residual learning for person re-identification
– start-page: 18590
  volume-title: Proc. Annu. Conf. Neural Inf. Process. Syst.
  year: 2021
  ident: ref32
  article-title: All tokens matter: Token labeling for training better vision transformers
– ident: ref26
  doi: 10.1109/iccv48922.2021.00986
– year: 2021
  ident: ref104
  article-title: DeepViT: Towards deeper vision transformer
  publication-title: arXiv:2103.11886
– start-page: 12992
  volume-title: Proc. Adv. Neural Inf. Process. Syst.
  year: 2021
  ident: ref36
  article-title: Glance-and-gaze vision transformer
– ident: ref52
  doi: 10.1109/iccv48922.2021.00988
– year: 2022
  ident: ref94
  article-title: Dilated neighborhood attention transformer
  publication-title: arXiv:2209.15001
– year: 2022
  ident: ref106
  article-title: Trt-vit: TensorRT-oriented vision transformer
  publication-title: arXiv:2205.09579
– ident: ref61
  doi: 10.1007/978-3-030-58452-8_13
– ident: ref66
  doi: 10.1109/iccv48922.2021.00707
– start-page: 6105
  volume-title: Proc. Int. Conf. Mach. Learn.
  year: 2019
  ident: ref102
  article-title: EfficientNet: Rethinking model scaling for convolutional neural networks
– ident: ref60
  doi: 10.1109/cvpr52688.2022.01181
– ident: ref47
  doi: 10.1007/978-1-4899-7502-7_79-1
– ident: ref92
  doi: 10.1109/tmm.2021.3086709
– ident: ref54
  doi: 10.1109/tpami.2022.3206108
– start-page: 1
  volume-title: Proc. 3rd Int. Conf. Learn. Representations
  year: 2015
  ident: ref2
  article-title: Very deep convolutional networks for large-scale image recognition
– ident: ref15
  doi: 10.1109/cvpr.2015.7298965
– start-page: 1
  volume-title: Proc. 4th Int. Conf. Learn. Representations
  year: 2016
  ident: ref89
  article-title: Multi-scale context aggregation by dilated convolutions
– ident: ref68
  doi: 10.1109/tmm.2020.2991592
– ident: ref38
  doi: 10.1007/978-3-031-20053-3_27
– ident: ref34
  doi: 10.1007/978-3-319-10602-1_48
– year: 2020
  ident: ref112
  article-title: MMSegmentation: OpenMMLab semantic segmentation toolbox and benchmark
– ident: ref75
  doi: 10.1109/cvpr52688.2022.01186
– ident: ref101
  doi: 10.1109/ICCV48922.2021.00060
– start-page: 1
  volume-title: Proc. 7th Int. Conf. Learn. Representations
  year: 2019
  ident: ref105
  article-title: Decoupled weight decay regularization
– ident: ref107
  doi: 10.1109/cvpr.2017.634
– ident: ref76
  doi: 10.1109/iccv48922.2021.00062
– ident: ref82
  doi: 10.1109/tmm.2021.3090274
– ident: ref5
  doi: 10.1109/cvpr52688.2022.01167
– start-page: 30392
  volume-title: Proc. Annu. Conf. Neural Inf. Process. Syst.
  year: 2021
  ident: ref74
  article-title: Early convolutions help transformers see better
– year: 2022
  ident: ref30
  article-title: Neighborhood attention transformer
  publication-title: arXiv:2204.07143
– ident: ref13
  doi: 10.1007/978-3-319-24574-4_28
– year: 2021
  ident: ref39
  article-title: CrossFormer: A versatile vision transformer based on cross-scale attention
  publication-title: arXiv:2108.00154
– start-page: 2990
  volume-title: Proc. 33nd Int. Conf. Mach. Learn.
  year: 2016
  ident: ref90
  article-title: Group equivariant convolutional networks
– start-page: 20014
  volume-title: Proc. Annu. Conf. Neural Inf. Process. Syst.
  year: 2021
  ident: ref113
  article-title: XCiT: Cross-covariance image transformers
– year: 2018
  ident: ref9
  article-title: YOLOv3: An incremental improvement
  publication-title: arXiv:1804.02767
– ident: ref65
  doi: 10.1109/cvpr52688.2022.00714
– year: 2020
  ident: ref20
  article-title: Longformer: The long-document transformer
  publication-title: arXiv:2004.05150
– start-page: 4171
  volume-title: Proc. Conf. North Amer. Chapter Assoc. Comput. Linguistics
  year: 2019
  ident: ref45
  article-title: BERT: Pre-training of deep bidirectional transformers for language understanding
– ident: ref78
  doi: 10.1109/tmm.2021.3086758
– ident: ref4
  doi: 10.1109/cvpr.2015.7298594
– ident: ref1
  doi: 10.1145/3065386
– ident: ref85
  doi: 10.1109/tmm.2021.3100766
– ident: ref83
  doi: 10.1109/tmm.2021.3057493
– ident: ref16
  doi: 10.1109/CVPR.2018.00813
– ident: ref84
  doi: 10.1109/tmm.2021.3090206
– ident: ref70
  doi: 10.1109/iccv48922.2021.01474
– ident: ref100
  doi: 10.1609/aaai.v36i3.20252
– ident: ref6
  doi: 10.1109/tmm.2022.3146775
– ident: ref64
  doi: 10.1109/tmm.2020.2997192
– ident: ref42
  doi: 10.1109/cvpr52688.2022.01058
– ident: ref48
  doi: 10.1109/iccv.2019.00612
– ident: ref91
  doi: 10.1109/tmm.2021.3050059
– ident: ref62
  doi: 10.1109/iccv48922.2021.00360
– start-page: 28522
  volume-title: Proc. Adv. Neural Inf. Process. Syst.
  year: 2021
  ident: ref43
  article-title: Vitae: Vision transformer advanced by exploring intrinsic inductive bias
– ident: ref3
  doi: 10.1109/cvpr.2016.90
– ident: ref59
  doi: 10.1109/tpami.2022.3164083
– year: 2022
  ident: ref103
  article-title: Next-vit: Next generation vision transformer for efficient deployment in realistic industrial scenarios
  publication-title: arXiv:2207.05501
– volume-title: arXiv:2107.00641
  year: 2021
  ident: ref27
  article-title: Focal self-attention for local-global interactions in vision transformers
– ident: ref49
  doi: 10.1109/cvpr.2016.308
– ident: ref51
  doi: 10.1609/aaai.v34i07.7000
– ident: ref88
  doi: 10.1109/ICCV48922.2021.00041
– ident: ref87
  doi: 10.1109/ICCV48922.2021.00675
– ident: ref50
  doi: 10.1109/CVPR42600.2020.00815
– ident: ref40
  doi: 10.1109/cvpr52688.2022.00520
– year: 2018
  ident: ref18
  article-title: Improving language understanding by generative pre-training
– year: 2021
  ident: ref37
  article-title: Shuffle transformer: Rethinking spatial shuffle for vision transformer
  publication-title: arXiv:2106.03650
– ident: ref11
  doi: 10.1109/iccv.2017.324
– ident: ref25
  doi: 10.1109/tmm.2021.3068576
– ident: ref98
  doi: 10.1109/ICCV48922.2021.00009
– ident: ref93
  doi: 10.1109/CVPR42600.2020.01104
– volume: 34
  start-page: 15908
  volume-title: Proc. Adv. Neural Inf. Process. Syst.
  year: 2021
  ident: ref99
  article-title: Transformer in transformer
– ident: ref46
  doi: 10.1109/cvpr52688.2022.01179
– ident: ref71
  doi: 10.1155/2022/8917964
– ident: ref31
  doi: 10.1109/iccv48922.2021.00061
– year: 2019
  ident: ref109
  article-title: MMDetection: Open MMLab detection toolbox and benchmark
  publication-title: arXiv:1906.07155
– year: 2022
  ident: ref44
  article-title: UniFormer: Unifying convolution and self-attention for visual recognition
  publication-title: arXiv:2201.09450
– ident: ref58
  doi: 10.1109/tmm.2021.3074008
– ident: ref80
  doi: 10.1109/cvpr52688.2022.00943
– year: 2021
  ident: ref24
  article-title: Conditional positional encodings for vision transformers
  publication-title: arXiv:2102.10882
– ident: ref35
  doi: 10.1109/cvpr.2017.544
– ident: ref110
  doi: 10.1007/978-3-030-01228-1_26
– start-page: 9355
  volume-title: Proc. Annu. Conf. Neural Inf. Process. Syst.
  year: 2021
  ident: ref29
  article-title: Twins: Revisiting the design of spatial attention in vision transformers
– year: 2021
  ident: ref114
  article-title: Pytorch library for cam methods
– ident: ref95
  doi: 10.1109/CVPR42600.2020.01044
– ident: ref8
  doi: 10.1109/tpami.2016.2577031
– ident: ref23
  doi: 10.1109/iccv48922.2021.00010
– year: 2017
  ident: ref14
  article-title: Rethinking Atrous convolution for semantic image segmentation
  publication-title: arXiv:1706.05587
– ident: ref77
  doi: 10.1109/CVPR46437.2021.01625
– ident: ref28
  doi: 10.1109/iccv48922.2021.00299
– ident: ref73
  doi: 10.1109/icme52920.2022.9859907
– ident: ref41
  doi: 10.1109/iccv48922.2021.00042
– ident: ref10
  doi: 10.1007/978-3-319-46448-0_2
– ident: ref53
  doi: 10.1109/iccv48922.2021.00044
– ident: ref55
  doi: 10.1609/aaai.v36i3.20252
– ident: ref12
  doi: 10.1109/iccv.2017.322
– ident: ref108
  doi: 10.1109/tpami.2019.2956516
– ident: ref81
  doi: 10.1109/tmm.2021.3072479
– ident: ref56
  doi: 10.1109/tmm.2021.3109665
– ident: ref86
  doi: 10.1109/CVPR52688.2022.01055
– ident: ref111
  doi: 10.1109/CVPR.2019.00656
– start-page: 10347
  volume-title: Proc. 38th Int. Conf. Mach. Learn.
  year: 2021
  ident: ref22
  article-title: Training data-efficient image transformers & distillation through attention
– start-page: 1
  volume-title: Proc. 9th Int. Conf. Learn. Representations
  year: 2021
  ident: ref21
  article-title: An image is worth 16x16 words: Transformers for image recognition at scale
– ident: ref7
  doi: 10.1109/tmm.2020.3002614
– start-page: 1877
  volume-title: Proc. Annu. Conf. Neural Inf. Process. Syst.
  year: 2020
  ident: ref17
  article-title: Language models are few-shot learners
– ident: ref79
  doi: 10.1109/cvpr52688.2022.01553
– ident: ref57
  doi: 10.1109/tmm.2021.3096083
– ident: ref67
  doi: 10.1109/iccvw54120.2021.00301
– ident: ref72
  doi: 10.1109/tmm.2021.3050082
– start-page: 3965
  volume-title: Proc. Annu. Conf. Neural Inf. Process. Syst.
  year: 2021
  ident: ref97
  article-title: Coatnet: Marrying convolution and attention for all data sizes
– start-page: 5998
  volume-title: Proc. Annu. Conf. Neural Inf. Process. Syst.
  year: 2017
  ident: ref19
  article-title: Attention is all you need
SSID ssj0014507
Score 2.707935
Snippet As a de facto solution, the vanilla Vision Transformers (ViTs) are encouraged to model long-range dependencies between arbitrary image patches while the global...
SourceID proquest
crossref
ieee
SourceType Aggregation Database
Enrichment Source
Index Database
Publisher
StartPage 1
SubjectTerms Classification
Computational efficiency
Computational modeling
Computing costs
Convolution
Object recognition
Redundancy
Semantic segmentation
Task analysis
Transformers
Vision
Vision Transformer
Title DilateFormer: Multi-Scale Dilated Transformer for Visual Recognition
URI https://ieeexplore.ieee.org/document/10041780
https://www.proquest.com/docview/2901357028
Volume 25
WOSCitedRecordID wos001125902000049&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
journalDatabaseRights – providerCode: PRVIEE
  databaseName: IEEE Electronic Library (IEL)
  customDbUrl:
  eissn: 1941-0077
  dateEnd: 99991231
  omitProxy: false
  ssIdentifier: ssj0014507
  issn: 1520-9210
  databaseCode: RIE
  dateStart: 19990101
  isFulltext: true
  titleUrlDefault: https://ieeexplore.ieee.org/
  providerName: IEEE
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV3NS8MwFH_o8KAHp3PidEoPXjxk69Y0H97EObxsiE7ZrSRpCoPRyT78-31J2zEQBU8t5KPhvaT5vby89wO4TZ0rj2tGVD-MCNWUE80yRhwHgWBUuZsKnmyCj8diOpUvZbC6j4Wx1vrLZ7bjXr0vP12YjTsq67rsZj0u0ELf55wVwVpblwGNfWw07kchkWjIVD7JUHYno1HH0YR3ED1EzFGb7-xBnlTlx5_Yby_D-j8HdgLHJY4MHgrFn8KezRtQrzgagnLJNuBoJ-HgGQwGszmCyyECVbu8D3z0LXlDNdmgKEmDSYVksRd8BB-z1QY_9FpdNFrkTXgfPk0en0nJo0BMX_bXhDIls4wZplHyfS2oFT3DTWYQSlguNA5cZylNUxZiiYwQdCth4zSOrLQ0o9E51PJFbi8gEDHjSkWx6YXaWWIKJUqtCUPr8rSrrAXdSrKJKZOMO66LeeKNjVAmqIvE6SIpddGCu22LzyLBxh91m072O_UKsbegXWkvKZfgKnEO4ijmiJ8uf2l2BYeu9-JApQ219XJjr-HAfK1nq-WNn13fm2vLaQ
linkProvider IEEE
linkToHtml http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV3fS8MwED5kCuqD06k4ndoHX3zIlrVp0vgmzjFxG6JT9lbaNIXB6GQ__Pu9pO0YiIJPLSRpwl3SfJfL3QdwkxhXnog5iVzqERYzQWKecmI4CALOInNTwZJNiOEwGI_lSxGsbmNhtNb28plumlfry09mamWOylomu1lbBGihb_uMuTQP11o7DZhvo6NxR6JEoilTeiWpbI0Gg6YhCm8ifvC4ITff2IUsrcqPf7HdYLrVfw7tEA4KJOnc56o_gi2d1aBasjQ4xaKtwf5GysFj6HQmU4SXXYSqen7n2Phb8oaK0k5ekjijEsviV_DhfEwWK-zotbxqNMtO4L37OHrokYJJgShXukvCeCTTlCseo-zdOGA6aCuhUoVgQosgxoHHacKShFMskR7C7ijQfuJ7WmqWMu8UKtks02fgBD4XUeT5qk1jY4tFKFGmFaXaZGqP0jq0SsmGqkgzbtgupqE1N6gMUReh0UVY6KIOt-sWn3mKjT_qnhjZb9TLxV6HRqm9sFiEi9C4iD1fIII6_6XZNez2RoN-2H8aPl_AnukpP15pQGU5X-lL2FFfy8lifmVn2jdbiM6w
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=DilateFormer%3A+Multi-Scale+Dilated+Transformer+for+Visual+Recognition&rft.jtitle=IEEE+transactions+on+multimedia&rft.au=Jiao%2C+Jiayu&rft.au=Tang%2C+Yu-Ming&rft.au=Lin%2C+Kun-Yu&rft.au=Gao%2C+Yipeng&rft.date=2023-01-01&rft.issn=1520-9210&rft.eissn=1941-0077&rft.volume=25&rft.spage=8906&rft.epage=8919&rft_id=info:doi/10.1109%2FTMM.2023.3243616&rft.externalDBID=n%2Fa&rft.externalDocID=10_1109_TMM_2023_3243616
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1520-9210&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1520-9210&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1520-9210&client=summon