DilateFormer: Multi-Scale Dilated Transformer for Visual Recognition
As a de facto solution, the vanilla Vision Transformers (ViTs) are encouraged to model long-range dependencies between arbitrary image patches while the global attended receptive field leads to quadratic computational cost. Another branch of Vision Transformers exploits local attention inspired by C...
Uložené v:
| Vydané v: | IEEE transactions on multimedia Ročník 25; s. 1 - 14 |
|---|---|
| Hlavní autori: | , , , , , , |
| Médium: | Journal Article |
| Jazyk: | English |
| Vydavateľské údaje: |
Piscataway
IEEE
01.01.2023
The Institute of Electrical and Electronics Engineers, Inc. (IEEE) |
| Predmet: | |
| ISSN: | 1520-9210, 1941-0077 |
| On-line prístup: | Získať plný text |
| Tagy: |
Pridať tag
Žiadne tagy, Buďte prvý, kto otaguje tento záznam!
|
| Abstract | As a de facto solution, the vanilla Vision Transformers (ViTs) are encouraged to model long-range dependencies between arbitrary image patches while the global attended receptive field leads to quadratic computational cost. Another branch of Vision Transformers exploits local attention inspired by CNNs, which only models the interactions between patches in small neighborhoods. Although such a solution reduces the computational cost, it naturally suffers from small attended receptive fields, which may limit the performance. In this work, we explore effective Vision Transformers to pursue a preferable trade-off between the computational complexity and size of the attended receptive field. By analyzing the patch interaction of global attention in ViTs, we observe two key properties in the shallow layers, namely locality and sparsity, indicating the redundancy of global dependency modeling in shallow layers of ViTs. Accordingly, we propose Multi-Scale Dilated Attention (MSDA) to model local and sparse patch interaction within the sliding window. With a pyramid architecture, we construct a Multi-Scale Dilated Transformer (DilateFormer) by stacking MSDA blocks at low-level stages and global multi-head self-attention blocks at high-level stages. Our experiment results show that our DilateFormer achieves state-of-the-art performance on various vision tasks. On ImageNet-1 K classification task, DilateFormer achieves comparable performance with 70% fewer FLOPs compared with existing state-of-the-art models. Our DilateFormer-Base achieves 85.6% top-1 accuracy on ImageNet-1 K classification task, 53.5% box mAP/46.1% mask mAP on COCO object detection/instance segmentation task and 51.1% MS mIoU on ADE20 K semantic segmentation task. |
|---|---|
| AbstractList | As a de facto solution, the vanilla Vision Transformers (ViTs) are encouraged to model long-range dependencies between arbitrary image patches while the global attended receptive field leads to quadratic computational cost. Another branch of Vision Transformers exploits local attention inspired by CNNs, which only models the interactions between patches in small neighborhoods. Although such a solution reduces the computational cost, it naturally suffers from small attended receptive fields, which may limit the performance. In this work, we explore effective Vision Transformers to pursue a preferable trade-off between the computational complexity and size of the attended receptive field. By analyzing the patch interaction of global attention in ViTs, we observe two key properties in the shallow layers, namely locality and sparsity, indicating the redundancy of global dependency modeling in shallow layers of ViTs. Accordingly, we propose Multi-Scale Dilated Attention (MSDA) to model local and sparse patch interaction within the sliding window. With a pyramid architecture, we construct a Multi-Scale Dilated Transformer (DilateFormer) by stacking MSDA blocks at low-level stages and global multi-head self-attention blocks at high-level stages. Our experiment results show that our DilateFormer achieves state-of-the-art performance on various vision tasks. On ImageNet-1 K classification task, DilateFormer achieves comparable performance with 70% fewer FLOPs compared with existing state-of-the-art models. Our DilateFormer-Base achieves 85.6% top-1 accuracy on ImageNet-1 K classification task, 53.5% box mAP/46.1% mask mAP on COCO object detection/instance segmentation task and 51.1% MS mIoU on ADE20 K semantic segmentation task. As a de facto solution, the vanilla Vision Transformers (ViTs) are encouraged to model long-range dependencies between arbitrary image patches while the global attended receptive field leads to quadratic computational cost. Another branch of Vision Transformers exploits local attention inspired by CNNs, which only models the interactions between patches in small neighborhoods. Although such a solution reduces the computational cost, it naturally suffers from small attended receptive fields, which may limit the performance. In this work, we explore effective Vision Transformers to pursue a preferable trade-off between the computational complexity and size of the attended receptive field. By analyzing the patch interaction of global attention in ViTs, we observe two key properties in the shallow layers, namely locality and sparsity, indicating the redundancy of global dependency modeling in shallow layers of ViTs. Accordingly, we propose Multi-Scale Dilated Attention (MSDA) to model local and sparse patch interaction within the sliding window. With a pyramid architecture, we construct a Multi-Scale Dilated Transformer (DilateFormer) by stacking MSDA blocks at low-level stages and global multi-head self-attention blocks at high-level stages. Our experiment results show that our DilateFormer achieves state-of-the-art performance on various vision tasks. On ImageNet-1 K classification task, DilateFormer achieves comparable performance with 70% fewer FLOPs compared with existing state-of-the-art models. Our DilateFormer-Base achieves 85.6% top-1 accuracy on ImageNet-1 K classification task, 53.5% box mAP/46.1% mask mAP on COCO object detection/instance segmentation task and 51.1% MS mIoU on ADE20 K semantic segmentation task. The code is available at https://isee-ai.cn/~jiaojiayu/DilteFormer.html . |
| Author | Jiao, Jiayu Gao, Yipeng Tang, Yu-Ming Zheng, Wei-Shi Wang, Yaowei Lin, Kun-Yu Ma, Jinhua |
| Author_xml | – sequence: 1 givenname: Jiayu surname: Jiao fullname: Jiao, Jiayu organization: School of Computer Science and Engineering, Sun Yat-Sen University, Guangzhou, China – sequence: 2 givenname: Yu-Ming surname: Tang fullname: Tang, Yu-Ming organization: School of Computer Science and Engineering, Sun Yat-Sen University, Guangzhou, China – sequence: 3 givenname: Kun-Yu surname: Lin fullname: Lin, Kun-Yu organization: School of Computer Science and Engineering, Sun Yat-Sen University, Guangzhou, China – sequence: 4 givenname: Yipeng surname: Gao fullname: Gao, Yipeng organization: School of Computer Science and Engineering, Sun Yat-Sen University, Guangzhou, China – sequence: 5 givenname: Jinhua orcidid: 0000-0002-0165-8416 surname: Ma fullname: Ma, Jinhua organization: School of Computer Science and Engineering, Sun Yat-Sen University, Guangzhou, China – sequence: 6 givenname: Yaowei orcidid: 0000-0003-2197-9038 surname: Wang fullname: Wang, Yaowei organization: School of Computer Science and Engineering, Sun Yat-Sen University, Guangzhou, China – sequence: 7 givenname: Wei-Shi orcidid: 0000-0001-8327-0003 surname: Zheng fullname: Zheng, Wei-Shi organization: School of Computer Science and Engineering, Sun Yat-Sen University, Guangzhou, China |
| BookMark | eNp9UD1PwzAUtFCRaAs7A0Mk5pTnjyQOG2opILVCgsJqOc4zcpUmxUkG_j0u6YAYmO5Jd_dOdxMyqpsaCbmkMKMU8pvNej1jwPiMM8FTmp6QMc0FjQGybBTuhEGcMwpnZNK2WwAqEsjGZLFwle5w2fgd-tto3Vedi1-NrjAamDLaeF239kcQBYjeXdvrKnpB03zUrnNNfU5Ora5avDjilLwt7zfzx3j1_PA0v1vFhuWsi0Wqc2tTkxZoLSukQElNZqxJEoGZLDIJhS1FWaYQmJxLybTEpEw45iis4FNyPfzd--azx7ZT26b3dYhULAfKkwyYDCoYVMY3bevRqr13O-2_FAV12EqFrdRhK3XcKljSPxbjOn2o1nntqv-MV4PRIeKvHBA0tOHf-J14dw |
| CODEN | ITMUF8 |
| CitedBy_id | crossref_primary_10_1016_j_optlastec_2024_111957 crossref_primary_10_1016_j_ceramint_2024_11_330 crossref_primary_10_1016_j_bspc_2025_108196 crossref_primary_10_1016_j_engstruct_2025_120264 crossref_primary_10_1002_ima_70176 crossref_primary_10_3390_jmse13101843 crossref_primary_10_1109_LGRS_2024_3431223 crossref_primary_10_1016_j_foodcont_2025_111580 crossref_primary_10_1016_j_robot_2024_104819 crossref_primary_10_3390_app14167071 crossref_primary_10_1016_j_knosys_2025_114069 crossref_primary_10_3390_s25092861 crossref_primary_10_3390_rs16132267 crossref_primary_10_1007_s00371_024_03561_6 crossref_primary_10_1016_j_neunet_2024_106653 crossref_primary_10_3390_electronics14122479 crossref_primary_10_1117_1_JEI_33_6_063053 crossref_primary_10_3390_jimaging11040112 crossref_primary_10_1016_j_asoc_2024_112294 crossref_primary_10_1007_s11554_025_01715_w crossref_primary_10_1109_TGRS_2024_3441038 crossref_primary_10_1016_j_isprsjprs_2025_01_025 crossref_primary_10_3390_s24237640 crossref_primary_10_1109_JSTARS_2023_3347660 crossref_primary_10_3390_drones9040245 crossref_primary_10_3390_math13172769 crossref_primary_10_3390_biomimetics10090564 crossref_primary_10_1016_j_engappai_2025_111968 crossref_primary_10_1016_j_biosystemseng_2024_11_003 crossref_primary_10_1016_j_bspc_2024_106180 crossref_primary_10_1016_j_patcog_2024_111230 crossref_primary_10_3390_w16223285 crossref_primary_10_1016_j_eswa_2024_125584 crossref_primary_10_1007_s11227_025_07788_5 crossref_primary_10_3390_pr12112419 crossref_primary_10_1016_j_patcog_2024_110870 crossref_primary_10_1109_JSEN_2025_3546966 crossref_primary_10_1109_TGRS_2025_3603946 crossref_primary_10_1016_j_compag_2025_110823 crossref_primary_10_1049_ipr2_70034 crossref_primary_10_1016_j_oceaneng_2024_119011 crossref_primary_10_1038_s41598_025_86371_7 crossref_primary_10_1016_j_sigpro_2024_109850 crossref_primary_10_1016_j_cmpb_2025_108703 crossref_primary_10_1080_0305215X_2025_2485147 crossref_primary_10_1109_JBHI_2025_3555805 crossref_primary_10_1016_j_imavis_2024_105052 crossref_primary_10_1002_cpe_8219 crossref_primary_10_1016_j_neunet_2024_106235 crossref_primary_10_1016_j_isprsjprs_2024_04_012 crossref_primary_10_1109_TGRS_2025_3536473 crossref_primary_10_1109_TMM_2025_3535318 crossref_primary_10_1016_j_inffus_2025_103734 crossref_primary_10_1038_s41598_025_10058_2 crossref_primary_10_1109_ACCESS_2024_3372146 crossref_primary_10_3390_s24237425 crossref_primary_10_3390_s25154840 crossref_primary_10_1109_TGRS_2025_3571033 crossref_primary_10_1016_j_imavis_2025_105503 crossref_primary_10_1049_wss2_70001 crossref_primary_10_1109_TMM_2025_3535321 crossref_primary_10_1016_j_measurement_2024_116220 crossref_primary_10_1109_TIP_2025_3556520 crossref_primary_10_1016_j_cscm_2025_e04414 crossref_primary_10_3390_agriculture14071163 crossref_primary_10_1016_j_bspc_2024_107189 crossref_primary_10_3390_electronics13204121 crossref_primary_10_1016_j_lfs_2025_123931 crossref_primary_10_1109_TGRS_2024_3412401 crossref_primary_10_1109_TMM_2024_3405626 crossref_primary_10_1117_1_JEI_33_2_023003 crossref_primary_10_1109_TMM_2025_3542958 crossref_primary_10_1016_j_compag_2025_110605 crossref_primary_10_1142_S0218001425550134 crossref_primary_10_1016_j_patcog_2024_110979 crossref_primary_10_1088_1361_6501_ad30bb crossref_primary_10_1016_j_aei_2025_103377 crossref_primary_10_3390_app14072741 crossref_primary_10_1016_j_infrared_2024_105674 crossref_primary_10_3390_rs17132313 crossref_primary_10_1016_j_measurement_2025_118752 crossref_primary_10_1016_j_optlastec_2025_113894 crossref_primary_10_1109_TMM_2025_3535351 crossref_primary_10_1109_JIOT_2024_3492347 crossref_primary_10_62347_BIHI3707 crossref_primary_10_3390_electronics13234844 crossref_primary_10_3390_sym17050674 crossref_primary_10_3390_app14052221 crossref_primary_10_1002_mp_17430 crossref_primary_10_1016_j_isci_2025_112401 crossref_primary_10_3390_s25010065 crossref_primary_10_1109_ACCESS_2024_3499353 crossref_primary_10_1007_s11760_025_04399_8 crossref_primary_10_3390_sym17081232 crossref_primary_10_1016_j_cviu_2025_104482 crossref_primary_10_1080_10589759_2025_2474103 crossref_primary_10_1109_TIM_2024_3476614 crossref_primary_10_1016_j_ipm_2025_104094 crossref_primary_10_1109_TIM_2025_3551136 crossref_primary_10_1109_JBHI_2025_3541982 crossref_primary_10_1109_TCE_2025_3547962 crossref_primary_10_1038_s41598_025_91206_6 crossref_primary_10_1016_j_neucom_2025_130366 crossref_primary_10_1007_s40747_024_01409_z crossref_primary_10_1109_TCSVT_2024_3402436 crossref_primary_10_1109_TCSVT_2025_3553135 crossref_primary_10_1007_s00530_025_01904_4 crossref_primary_10_1016_j_compag_2024_108876 crossref_primary_10_1117_1_JEI_34_2_023001 crossref_primary_10_1016_j_neucom_2024_129040 crossref_primary_10_1007_s11760_024_03146_9 crossref_primary_10_1007_s42994_025_00245_0 crossref_primary_10_1016_j_eswa_2024_125732 crossref_primary_10_3390_s25113433 crossref_primary_10_3390_s25030766 crossref_primary_10_1109_TETCI_2024_3409707 crossref_primary_10_1109_TIP_2024_3372468 crossref_primary_10_3390_s24144569 crossref_primary_10_1186_s13007_025_01392_7 crossref_primary_10_3390_s24092759 crossref_primary_10_1080_27525783_2025_2551315 crossref_primary_10_1016_j_neucom_2024_128524 crossref_primary_10_1007_s11227_025_07446_w crossref_primary_10_1109_TMM_2024_3521662 crossref_primary_10_1016_j_eswa_2025_126385 crossref_primary_10_3390_sym17091485 crossref_primary_10_3390_agriculture15131354 crossref_primary_10_1109_TMM_2024_3349865 crossref_primary_10_1109_ACCESS_2024_3416034 crossref_primary_10_1631_FITEE_2500164 crossref_primary_10_1016_j_jvcir_2024_104346 crossref_primary_10_1109_LSP_2024_3519272 crossref_primary_10_1109_LRA_2024_3418274 crossref_primary_10_1007_s12145_025_01981_z crossref_primary_10_3390_sym17030347 crossref_primary_10_3390_electronics13132472 crossref_primary_10_3389_fpls_2025_1617997 crossref_primary_10_3390_s25134138 crossref_primary_10_3390_app15094746 crossref_primary_10_1016_j_eswa_2024_126375 crossref_primary_10_1088_1361_6501_ad71e5 |
| Cites_doi | 10.1109/CVPR46437.2021.00729 10.1109/cvpr.2009.5206848 10.1109/iccv48922.2021.00986 10.1109/iccv48922.2021.00988 10.1007/978-3-030-58452-8_13 10.1109/iccv48922.2021.00707 10.1109/cvpr52688.2022.01181 10.1007/978-1-4899-7502-7_79-1 10.1109/tmm.2021.3086709 10.1109/tpami.2022.3206108 10.1109/cvpr.2015.7298965 10.1109/tmm.2020.2991592 10.1007/978-3-031-20053-3_27 10.1007/978-3-319-10602-1_48 10.1109/cvpr52688.2022.01186 10.1109/ICCV48922.2021.00060 10.1109/cvpr.2017.634 10.1109/iccv48922.2021.00062 10.1109/tmm.2021.3090274 10.1109/cvpr52688.2022.01167 10.1007/978-3-319-24574-4_28 10.1109/cvpr52688.2022.00714 10.1109/tmm.2021.3086758 10.1109/cvpr.2015.7298594 10.1145/3065386 10.1109/tmm.2021.3100766 10.1109/tmm.2021.3057493 10.1109/CVPR.2018.00813 10.1109/tmm.2021.3090206 10.1109/iccv48922.2021.01474 10.1609/aaai.v36i3.20252 10.1109/tmm.2022.3146775 10.1109/tmm.2020.2997192 10.1109/cvpr52688.2022.01058 10.1109/iccv.2019.00612 10.1109/tmm.2021.3050059 10.1109/iccv48922.2021.00360 10.1109/cvpr.2016.90 10.1109/tpami.2022.3164083 10.1109/cvpr.2016.308 10.1609/aaai.v34i07.7000 10.1109/ICCV48922.2021.00041 10.1109/ICCV48922.2021.00675 10.1109/CVPR42600.2020.00815 10.1109/cvpr52688.2022.00520 10.1109/iccv.2017.324 10.1109/tmm.2021.3068576 10.1109/ICCV48922.2021.00009 10.1109/CVPR42600.2020.01104 10.1109/cvpr52688.2022.01179 10.1155/2022/8917964 10.1109/iccv48922.2021.00061 10.1109/tmm.2021.3074008 10.1109/cvpr52688.2022.00943 10.1109/cvpr.2017.544 10.1007/978-3-030-01228-1_26 10.1109/CVPR42600.2020.01044 10.1109/tpami.2016.2577031 10.1109/iccv48922.2021.00010 10.1109/CVPR46437.2021.01625 10.1109/iccv48922.2021.00299 10.1109/icme52920.2022.9859907 10.1109/iccv48922.2021.00042 10.1007/978-3-319-46448-0_2 10.1109/iccv48922.2021.00044 10.1109/iccv.2017.322 10.1109/tpami.2019.2956516 10.1109/tmm.2021.3072479 10.1109/tmm.2021.3109665 10.1109/CVPR52688.2022.01055 10.1109/CVPR.2019.00656 10.1109/tmm.2020.3002614 10.1109/cvpr52688.2022.01553 10.1109/tmm.2021.3096083 10.1109/iccvw54120.2021.00301 10.1109/tmm.2021.3050082 |
| ContentType | Journal Article |
| Copyright | Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2023 |
| Copyright_xml | – notice: Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2023 |
| DBID | 97E RIA RIE AAYXX CITATION 7SC 7SP 8FD JQ2 L7M L~C L~D |
| DOI | 10.1109/TMM.2023.3243616 |
| DatabaseName | IEEE All-Society Periodicals Package (ASPP) 2005–Present IEEE All-Society Periodicals Package (ASPP) 1998–Present IEEE/IET Electronic Library CrossRef Computer and Information Systems Abstracts Electronics & Communications Abstracts Technology Research Database ProQuest Computer Science Collection Advanced Technologies Database with Aerospace Computer and Information Systems Abstracts Academic Computer and Information Systems Abstracts Professional |
| DatabaseTitle | CrossRef Technology Research Database Computer and Information Systems Abstracts – Academic Electronics & Communications Abstracts ProQuest Computer Science Collection Computer and Information Systems Abstracts Advanced Technologies Database with Aerospace Computer and Information Systems Abstracts Professional |
| DatabaseTitleList | Technology Research Database |
| Database_xml | – sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://ieeexplore.ieee.org/ sourceTypes: Publisher |
| DeliveryMethod | fulltext_linktorsrc |
| Discipline | Engineering Computer Science |
| EISSN | 1941-0077 |
| EndPage | 14 |
| ExternalDocumentID | 10_1109_TMM_2023_3243616 10041780 |
| Genre | orig-research |
| GroupedDBID | -~X 0R~ 29I 4.4 5GY 5VS 6IK 97E AAJGR AARMG AASAJ AAWTH ABAZT ABQJQ ABVLG ACGFO ACGFS ACIWK AENEX AGQYO AHBIQ AKJIK AKQYR ALMA_UNASSIGNED_HOLDINGS ATWAV BEFXN BFFAM BGNUA BKEBE BPEOZ CS3 DU5 EBS HZ~ IFIPE IPLJI JAVBF LAI M43 O9- OCL P2P PQQKQ RIA RIE RNS TN5 AAYXX AETIX AGSQL AI. AIBXA ALLEH CITATION EJD H~9 IFJZH VH1 ZY4 7SC 7SP 8FD JQ2 L7M L~C L~D |
| ID | FETCH-LOGICAL-c292t-46a9ff6c6beff2b84e81c7cfc554e78b780bfd4dd60e8193882a8e5d53e9e4f43 |
| IEDL.DBID | RIE |
| ISICitedReferencesCount | 200 |
| ISICitedReferencesURI | http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=001125902000049&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| ISSN | 1520-9210 |
| IngestDate | Sun Nov 09 08:50:43 EST 2025 Sat Nov 29 03:10:11 EST 2025 Tue Nov 18 22:11:23 EST 2025 Mon Aug 11 03:35:28 EDT 2025 |
| IsPeerReviewed | true |
| IsScholarly | true |
| Language | English |
| License | https://ieeexplore.ieee.org/Xplorehelp/downloads/license-information/IEEE.html https://doi.org/10.15223/policy-029 https://doi.org/10.15223/policy-037 |
| LinkModel | DirectLink |
| MergedId | FETCHMERGED-LOGICAL-c292t-46a9ff6c6beff2b84e81c7cfc554e78b780bfd4dd60e8193882a8e5d53e9e4f43 |
| Notes | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 |
| ORCID | 0000-0003-2197-9038 0000-0001-8327-0003 0000-0002-0165-8416 0000-0003-0507-2620 0000-0001-5472-0079 0000-0002-0013-3730 |
| PQID | 2901357028 |
| PQPubID | 75737 |
| PageCount | 14 |
| ParticipantIDs | crossref_primary_10_1109_TMM_2023_3243616 crossref_citationtrail_10_1109_TMM_2023_3243616 ieee_primary_10041780 proquest_journals_2901357028 |
| PublicationCentury | 2000 |
| PublicationDate | 2023-01-01 |
| PublicationDateYYYYMMDD | 2023-01-01 |
| PublicationDate_xml | – month: 01 year: 2023 text: 2023-01-01 day: 01 |
| PublicationDecade | 2020 |
| PublicationPlace | Piscataway |
| PublicationPlace_xml | – name: Piscataway |
| PublicationTitle | IEEE transactions on multimedia |
| PublicationTitleAbbrev | TMM |
| PublicationYear | 2023 |
| Publisher | IEEE The Institute of Electrical and Electronics Engineers, Inc. (IEEE) |
| Publisher_xml | – name: IEEE – name: The Institute of Electrical and Electronics Engineers, Inc. (IEEE) |
| References | ref57 ref56 ref59 ref58 ref53 ref52 ref55 ref54 Brown (ref17) 2020 Xu (ref43) 2021 Chen (ref109) 2019 Han (ref99) 2021; 34 Vaswani (ref19) 2017 ref51 Han (ref96) 2022 ref50 ref46 ref48 ref47 Li (ref103) 2022 Chu (ref24) 2021 ref42 ref41 ref49 Xiao (ref74) 2021 ref8 ref7 Yang (ref27) 2021 Contributors (ref112) 2020 ref4 ref3 ref6 ref5 ref100 ref101 ref40 ref35 ref34 ref31 ref33 Dosovitskiy (ref21) 2021 Xia (ref106) 2022 ref38 ref23 ref26 Chu (ref29) 2021 ref25 Wang (ref39) 2021 ref28 Ali (ref113) 2021 Jiang (ref32) 2021 ref13 ref12 ref15 ref11 Li (ref44) 2022 ref10 Hassani (ref94) 2022 ref98 Redmon (ref9) 2018 ref16 ref93 Radford (ref18) 2018 ref92 Loshchilov (ref105) 2019 ref95 Yu (ref36) 2021 ref91 ref86 ref85 ref88 ref87 Devlin (ref45) 2019 Dai (ref97) 2021 ref82 ref81 Yu (ref89) 2016 ref84 ref83 Cohen (ref90) 2016 ref80 Chen (ref14) 2017 ref79 ref108 ref78 ref107 ref75 ref77 ref76 ref1 Beltagy (ref20) 2020 Chen (ref69) 2022; 157 Zhou (ref104) 2021 Huang (ref37) 2021 ref71 ref111 ref70 ref73 Gildenblat (ref114) 2021 ref72 ref110 ref68 ref67 ref64 Hassani (ref30) 2022 ref63 ref66 ref65 Tan (ref102) 2019 Simonyan (ref2) 2015 ref60 Touvron (ref22) 2021 ref62 ref61 |
| References_xml | – ident: ref63 doi: 10.1109/CVPR46437.2021.00729 – ident: ref33 doi: 10.1109/cvpr.2009.5206848 – start-page: 1 volume-title: Proc. Int. Conf. Learn. Representations year: 2022 ident: ref96 article-title: On the connection between local attention and dynamic depth-wise convolution – volume: 157 start-page: 90 volume-title: Pattern Recognit. Lett. year: 2022 ident: ref69 article-title: ResT-ReID: Transformer block-based residual learning for person re-identification – start-page: 18590 volume-title: Proc. Annu. Conf. Neural Inf. Process. Syst. year: 2021 ident: ref32 article-title: All tokens matter: Token labeling for training better vision transformers – ident: ref26 doi: 10.1109/iccv48922.2021.00986 – year: 2021 ident: ref104 article-title: DeepViT: Towards deeper vision transformer publication-title: arXiv:2103.11886 – start-page: 12992 volume-title: Proc. Adv. Neural Inf. Process. Syst. year: 2021 ident: ref36 article-title: Glance-and-gaze vision transformer – ident: ref52 doi: 10.1109/iccv48922.2021.00988 – year: 2022 ident: ref94 article-title: Dilated neighborhood attention transformer publication-title: arXiv:2209.15001 – year: 2022 ident: ref106 article-title: Trt-vit: TensorRT-oriented vision transformer publication-title: arXiv:2205.09579 – ident: ref61 doi: 10.1007/978-3-030-58452-8_13 – ident: ref66 doi: 10.1109/iccv48922.2021.00707 – start-page: 6105 volume-title: Proc. Int. Conf. Mach. Learn. year: 2019 ident: ref102 article-title: EfficientNet: Rethinking model scaling for convolutional neural networks – ident: ref60 doi: 10.1109/cvpr52688.2022.01181 – ident: ref47 doi: 10.1007/978-1-4899-7502-7_79-1 – ident: ref92 doi: 10.1109/tmm.2021.3086709 – ident: ref54 doi: 10.1109/tpami.2022.3206108 – start-page: 1 volume-title: Proc. 3rd Int. Conf. Learn. Representations year: 2015 ident: ref2 article-title: Very deep convolutional networks for large-scale image recognition – ident: ref15 doi: 10.1109/cvpr.2015.7298965 – start-page: 1 volume-title: Proc. 4th Int. Conf. Learn. Representations year: 2016 ident: ref89 article-title: Multi-scale context aggregation by dilated convolutions – ident: ref68 doi: 10.1109/tmm.2020.2991592 – ident: ref38 doi: 10.1007/978-3-031-20053-3_27 – ident: ref34 doi: 10.1007/978-3-319-10602-1_48 – year: 2020 ident: ref112 article-title: MMSegmentation: OpenMMLab semantic segmentation toolbox and benchmark – ident: ref75 doi: 10.1109/cvpr52688.2022.01186 – ident: ref101 doi: 10.1109/ICCV48922.2021.00060 – start-page: 1 volume-title: Proc. 7th Int. Conf. Learn. Representations year: 2019 ident: ref105 article-title: Decoupled weight decay regularization – ident: ref107 doi: 10.1109/cvpr.2017.634 – ident: ref76 doi: 10.1109/iccv48922.2021.00062 – ident: ref82 doi: 10.1109/tmm.2021.3090274 – ident: ref5 doi: 10.1109/cvpr52688.2022.01167 – start-page: 30392 volume-title: Proc. Annu. Conf. Neural Inf. Process. Syst. year: 2021 ident: ref74 article-title: Early convolutions help transformers see better – year: 2022 ident: ref30 article-title: Neighborhood attention transformer publication-title: arXiv:2204.07143 – ident: ref13 doi: 10.1007/978-3-319-24574-4_28 – year: 2021 ident: ref39 article-title: CrossFormer: A versatile vision transformer based on cross-scale attention publication-title: arXiv:2108.00154 – start-page: 2990 volume-title: Proc. 33nd Int. Conf. Mach. Learn. year: 2016 ident: ref90 article-title: Group equivariant convolutional networks – start-page: 20014 volume-title: Proc. Annu. Conf. Neural Inf. Process. Syst. year: 2021 ident: ref113 article-title: XCiT: Cross-covariance image transformers – year: 2018 ident: ref9 article-title: YOLOv3: An incremental improvement publication-title: arXiv:1804.02767 – ident: ref65 doi: 10.1109/cvpr52688.2022.00714 – year: 2020 ident: ref20 article-title: Longformer: The long-document transformer publication-title: arXiv:2004.05150 – start-page: 4171 volume-title: Proc. Conf. North Amer. Chapter Assoc. Comput. Linguistics year: 2019 ident: ref45 article-title: BERT: Pre-training of deep bidirectional transformers for language understanding – ident: ref78 doi: 10.1109/tmm.2021.3086758 – ident: ref4 doi: 10.1109/cvpr.2015.7298594 – ident: ref1 doi: 10.1145/3065386 – ident: ref85 doi: 10.1109/tmm.2021.3100766 – ident: ref83 doi: 10.1109/tmm.2021.3057493 – ident: ref16 doi: 10.1109/CVPR.2018.00813 – ident: ref84 doi: 10.1109/tmm.2021.3090206 – ident: ref70 doi: 10.1109/iccv48922.2021.01474 – ident: ref100 doi: 10.1609/aaai.v36i3.20252 – ident: ref6 doi: 10.1109/tmm.2022.3146775 – ident: ref64 doi: 10.1109/tmm.2020.2997192 – ident: ref42 doi: 10.1109/cvpr52688.2022.01058 – ident: ref48 doi: 10.1109/iccv.2019.00612 – ident: ref91 doi: 10.1109/tmm.2021.3050059 – ident: ref62 doi: 10.1109/iccv48922.2021.00360 – start-page: 28522 volume-title: Proc. Adv. Neural Inf. Process. Syst. year: 2021 ident: ref43 article-title: Vitae: Vision transformer advanced by exploring intrinsic inductive bias – ident: ref3 doi: 10.1109/cvpr.2016.90 – ident: ref59 doi: 10.1109/tpami.2022.3164083 – year: 2022 ident: ref103 article-title: Next-vit: Next generation vision transformer for efficient deployment in realistic industrial scenarios publication-title: arXiv:2207.05501 – volume-title: arXiv:2107.00641 year: 2021 ident: ref27 article-title: Focal self-attention for local-global interactions in vision transformers – ident: ref49 doi: 10.1109/cvpr.2016.308 – ident: ref51 doi: 10.1609/aaai.v34i07.7000 – ident: ref88 doi: 10.1109/ICCV48922.2021.00041 – ident: ref87 doi: 10.1109/ICCV48922.2021.00675 – ident: ref50 doi: 10.1109/CVPR42600.2020.00815 – ident: ref40 doi: 10.1109/cvpr52688.2022.00520 – year: 2018 ident: ref18 article-title: Improving language understanding by generative pre-training – year: 2021 ident: ref37 article-title: Shuffle transformer: Rethinking spatial shuffle for vision transformer publication-title: arXiv:2106.03650 – ident: ref11 doi: 10.1109/iccv.2017.324 – ident: ref25 doi: 10.1109/tmm.2021.3068576 – ident: ref98 doi: 10.1109/ICCV48922.2021.00009 – ident: ref93 doi: 10.1109/CVPR42600.2020.01104 – volume: 34 start-page: 15908 volume-title: Proc. Adv. Neural Inf. Process. Syst. year: 2021 ident: ref99 article-title: Transformer in transformer – ident: ref46 doi: 10.1109/cvpr52688.2022.01179 – ident: ref71 doi: 10.1155/2022/8917964 – ident: ref31 doi: 10.1109/iccv48922.2021.00061 – year: 2019 ident: ref109 article-title: MMDetection: Open MMLab detection toolbox and benchmark publication-title: arXiv:1906.07155 – year: 2022 ident: ref44 article-title: UniFormer: Unifying convolution and self-attention for visual recognition publication-title: arXiv:2201.09450 – ident: ref58 doi: 10.1109/tmm.2021.3074008 – ident: ref80 doi: 10.1109/cvpr52688.2022.00943 – year: 2021 ident: ref24 article-title: Conditional positional encodings for vision transformers publication-title: arXiv:2102.10882 – ident: ref35 doi: 10.1109/cvpr.2017.544 – ident: ref110 doi: 10.1007/978-3-030-01228-1_26 – start-page: 9355 volume-title: Proc. Annu. Conf. Neural Inf. Process. Syst. year: 2021 ident: ref29 article-title: Twins: Revisiting the design of spatial attention in vision transformers – year: 2021 ident: ref114 article-title: Pytorch library for cam methods – ident: ref95 doi: 10.1109/CVPR42600.2020.01044 – ident: ref8 doi: 10.1109/tpami.2016.2577031 – ident: ref23 doi: 10.1109/iccv48922.2021.00010 – year: 2017 ident: ref14 article-title: Rethinking Atrous convolution for semantic image segmentation publication-title: arXiv:1706.05587 – ident: ref77 doi: 10.1109/CVPR46437.2021.01625 – ident: ref28 doi: 10.1109/iccv48922.2021.00299 – ident: ref73 doi: 10.1109/icme52920.2022.9859907 – ident: ref41 doi: 10.1109/iccv48922.2021.00042 – ident: ref10 doi: 10.1007/978-3-319-46448-0_2 – ident: ref53 doi: 10.1109/iccv48922.2021.00044 – ident: ref55 doi: 10.1609/aaai.v36i3.20252 – ident: ref12 doi: 10.1109/iccv.2017.322 – ident: ref108 doi: 10.1109/tpami.2019.2956516 – ident: ref81 doi: 10.1109/tmm.2021.3072479 – ident: ref56 doi: 10.1109/tmm.2021.3109665 – ident: ref86 doi: 10.1109/CVPR52688.2022.01055 – ident: ref111 doi: 10.1109/CVPR.2019.00656 – start-page: 10347 volume-title: Proc. 38th Int. Conf. Mach. Learn. year: 2021 ident: ref22 article-title: Training data-efficient image transformers & distillation through attention – start-page: 1 volume-title: Proc. 9th Int. Conf. Learn. Representations year: 2021 ident: ref21 article-title: An image is worth 16x16 words: Transformers for image recognition at scale – ident: ref7 doi: 10.1109/tmm.2020.3002614 – start-page: 1877 volume-title: Proc. Annu. Conf. Neural Inf. Process. Syst. year: 2020 ident: ref17 article-title: Language models are few-shot learners – ident: ref79 doi: 10.1109/cvpr52688.2022.01553 – ident: ref57 doi: 10.1109/tmm.2021.3096083 – ident: ref67 doi: 10.1109/iccvw54120.2021.00301 – ident: ref72 doi: 10.1109/tmm.2021.3050082 – start-page: 3965 volume-title: Proc. Annu. Conf. Neural Inf. Process. Syst. year: 2021 ident: ref97 article-title: Coatnet: Marrying convolution and attention for all data sizes – start-page: 5998 volume-title: Proc. Annu. Conf. Neural Inf. Process. Syst. year: 2017 ident: ref19 article-title: Attention is all you need |
| SSID | ssj0014507 |
| Score | 2.7078981 |
| Snippet | As a de facto solution, the vanilla Vision Transformers (ViTs) are encouraged to model long-range dependencies between arbitrary image patches while the global... |
| SourceID | proquest crossref ieee |
| SourceType | Aggregation Database Enrichment Source Index Database Publisher |
| StartPage | 1 |
| SubjectTerms | Classification Computational efficiency Computational modeling Computing costs Convolution Object recognition Redundancy Semantic segmentation Task analysis Transformers Vision Vision Transformer |
| Title | DilateFormer: Multi-Scale Dilated Transformer for Visual Recognition |
| URI | https://ieeexplore.ieee.org/document/10041780 https://www.proquest.com/docview/2901357028 |
| Volume | 25 |
| WOSCitedRecordID | wos001125902000049&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| journalDatabaseRights | – providerCode: PRVIEE databaseName: IEEE Electronic Library (IEL) customDbUrl: eissn: 1941-0077 dateEnd: 99991231 omitProxy: false ssIdentifier: ssj0014507 issn: 1520-9210 databaseCode: RIE dateStart: 19990101 isFulltext: true titleUrlDefault: https://ieeexplore.ieee.org/ providerName: IEEE |
| link | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV3NS8MwFA86POjB6Zw4ndKDFw_Z2qZtEm_iHF42RKfsVtrkBQajk3349_uStmMgCp5ayEdDXtL3e3l570fIbZIpjorHUC0iTSOmGM1NklM0xBhkEAumlSOb4OOxmE7lSxWs7mJhAMBdPoOefXW-fL1QG3tU1rfZzQIu0ELf5zwpg7W2LoModrHRqI98KtGQqX2SvuxPRqOepQnvIXpgiaU239FBjlTlx5_YqZdh858DOyHHFY70HkrBn5I9KFqkWXM0eNWWbZGjnYSDZ2QwmM0RXA4RqMLy3nPRt_QNxQReWaK9SY1ksRd8eB-z1QY_9FpfNFoUbfI-fJo8PtOKR4GqUIZrGiWZNCZRSQ7GhLmIQASKK6MQSgAXOQ48NzrSOvGxRDIE3ZmAWMcMJEQmYuekUSwKuCCeMKEA4DYJP7Y1sTQcOLdJv1QQKF92SL-e2VRVScYt18U8dcaGL1OURWplkVay6JC7bYvPMsHGH3Xbdu536pXT3iHdWnpptQVXqXUQs5gjfrr8pdkVObS9lwcqXdJYLzdwTQ7U13q2Wt641fUNPSTL-g |
| linkProvider | IEEE |
| linkToHtml | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV3NS-wwEB9EBX2H5zfu86sHLx6y223SJvEm6qLoLqKreCvbZAILsiv74d__JmkrC6LgqYUkTcgknd9kMvMDOM0GRpLiccwqYZnghrPCZQUjQ4zjAFPFrQlkE7LXU6-v-qEKVg-xMIgYLp9h078GX74dm7k_Kmv57GZtqchCX0mFSOIyXOvTaSDSEB1NGilmmkyZ2isZ61a_2216ovAm4QeeeXLzBS0UaFW-_IuDguls_HJom_C3QpLRRSn6LVjC0TZs1CwNUbVpt-HPQsrBHbi6Gr4RvOwQVMXJeRTib9kTCQqjssRG_RrL0lfoEb0Mp3Pq6LG-ajQe7cJz57p_ecMqJgVmEp3MmMgG2rnMZAU6lxRKoGobaZwhMIFSFTTwwllhbRZTieYEuwcKU5ty1Cic4HuwPBqPcB8i5RKFKH0afmrrUu0kSunTfpl228S6Aa16ZnNTpRn3bBdveTA3Yp2TLHIvi7ySRQPOPlu8lyk2fqi76-d-oV457Q04rKWXV5twmnsXMU8lIah_3zQ7gbWbfvc-v7_t3R3Auu-pPF45hOXZZI5HsGo-ZsPp5DistP8FO89B |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=DilateFormer%3A+Multi-Scale+Dilated+Transformer+for+Visual+Recognition&rft.jtitle=IEEE+transactions+on+multimedia&rft.au=Jiao%2C+Jiayu&rft.au=Yu-Ming%2C+Tang&rft.au=Kun-Yu%2C+Lin&rft.au=Gao%2C+Yipeng&rft.date=2023-01-01&rft.pub=The+Institute+of+Electrical+and+Electronics+Engineers%2C+Inc.+%28IEEE%29&rft.issn=1520-9210&rft.eissn=1941-0077&rft.volume=25&rft.spage=8906&rft_id=info:doi/10.1109%2FTMM.2023.3243616&rft.externalDBID=NO_FULL_TEXT |
| thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1520-9210&client=summon |
| thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1520-9210&client=summon |
| thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1520-9210&client=summon |