I Know How You Move: Explicit Motion Estimation for Human Action Recognition
Enabled by hierarchical convolutions and nonlinear mappings, recent action recognition studies have continuously boosted performance with spatiotemporal modelling. In general, motion clues are essential in video-oriented tasks, while existing approaches aggregate the spatial and temporal signatures...
Uložené v:
| Vydané v: | IEEE transactions on multimedia Ročník 27; s. 1665 - 1676 |
|---|---|
| Hlavní autori: | , , , , |
| Médium: | Journal Article |
| Jazyk: | English |
| Vydavateľské údaje: |
IEEE
2025
|
| Predmet: | |
| ISSN: | 1520-9210, 1941-0077 |
| On-line prístup: | Získať plný text |
| Tagy: |
Pridať tag
Žiadne tagy, Buďte prvý, kto otaguje tento záznam!
|
| Abstract | Enabled by hierarchical convolutions and nonlinear mappings, recent action recognition studies have continuously boosted performance with spatiotemporal modelling. In general, motion clues are essential in video-oriented tasks, while existing approaches aggregate the spatial and temporal signatures via specially designed modules in the middle or output stages. To highlight the privilege provided by temporal motions, in this paper, we propose a simple but effective MOTion Estimator (MOTE) to generate the motion patterns from every single frame, avoiding complex dense-frame input. In particular, MOTE follows an encoder-decoder structure, which takes the short-term motion features generated by the pretrained dense-frame network as the learning target. The spatial information of a single frame is utilized to estimate the instantaneous motion appearance. It can support the expression of vulnerable regions, such as the 'hand' in 'waving hands,' which would otherwise be suppressed in the feature maps as the 'hand' suffers from motion blur. The training process of MOTE is independent of the action recognition system. Therefore, the trained MOTE can be transplanted to the input-end of existing action recognition methods to provide instantaneous motion estimation as feature enhancement according to practical requirements. Our experiments performed on Something-Something V1, V2, Kinetics-400, and Diving48 verify the effectiveness of the proposed method. |
|---|---|
| AbstractList | Enabled by hierarchical convolutions and nonlinear mappings, recent action recognition studies have continuously boosted performance with spatiotemporal modelling. In general, motion clues are essential in video-oriented tasks, while existing approaches aggregate the spatial and temporal signatures via specially designed modules in the middle or output stages. To highlight the privilege provided by temporal motions, in this paper, we propose a simple but effective MOTion Estimator (MOTE) to generate the motion patterns from every single frame, avoiding complex dense-frame input. In particular, MOTE follows an encoder-decoder structure, which takes the short-term motion features generated by the pretrained dense-frame network as the learning target. The spatial information of a single frame is utilized to estimate the instantaneous motion appearance. It can support the expression of vulnerable regions, such as the 'hand' in 'waving hands,' which would otherwise be suppressed in the feature maps as the 'hand' suffers from motion blur. The training process of MOTE is independent of the action recognition system. Therefore, the trained MOTE can be transplanted to the input-end of existing action recognition methods to provide instantaneous motion estimation as feature enhancement according to practical requirements. Our experiments performed on Something-Something V1, V2, Kinetics-400, and Diving48 verify the effectiveness of the proposed method. |
| Author | Wu, Cong Wu, Xiao-Jun Li, Hui Xu, Tianyang Shen, Zhongwei |
| Author_xml | – sequence: 1 givenname: Zhongwei orcidid: 0000-0002-6701-1965 surname: Shen fullname: Shen, Zhongwei email: shenzw_cv@163.com organization: School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi, China – sequence: 2 givenname: Xiao-Jun orcidid: 0000-0002-0310-5778 surname: Wu fullname: Wu, Xiao-Jun email: wu_xiaojun@jiangnan.edu.cn organization: School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi, China – sequence: 3 givenname: Hui orcidid: 0000-0003-4550-7879 surname: Li fullname: Li, Hui email: lihui.cv@jiangnan.edu.cn organization: School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi, China – sequence: 4 givenname: Tianyang orcidid: 0000-0002-9015-3128 surname: Xu fullname: Xu, Tianyang email: tianyang_xu@163.com organization: Centre for Vision, Speech and Signal Processing, University of Surrey, Guildford, U.K – sequence: 5 givenname: Cong surname: Wu fullname: Wu, Cong email: wucong@stu.jiangnan.edu.cn organization: School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi, China |
| BookMark | eNp9kD1PwzAQhi1UJNrCjsTiP5ByPjtxzFZVhVS0QkJlYIoS20FGbVwlKR__HjetGBgYTvfeSe99PCMyqH1tCblmMGEM1O16tZogIE44MiaQn5EhU4JFAFIOgo4RIoUMLsiobd8BmIhBDslyQR9r_0mzEK9-T1f-w97R-ddu47TrQtk5X9N527lt0cvKNzTbb4uaTnXfeLbav9XuoC_JeVVsWnt1ymPycj9fz7Jo-fSwmE2XkcYk7iKTlMqUJgauBKKE2GiZIFhuLC9YkUCl0ZiYCVNipWSaQFoZY0vBLKDQnI8JHOfqxrdtY6t814T7mu-cQX6gkQca-YFGfqIRLMkfS3iv_6hrCrf5z3hzNDpr7e8epUCmqeQ_ziluGQ |
| CODEN | ITMUF8 |
| CitedBy_id | crossref_primary_10_1016_j_image_2025_117381 crossref_primary_10_3390_electronics12040857 |
| Cites_doi | 10.1109/iccv.2017.622 10.1109/CVPR42600.2020.00043 10.1145/3343031.3350876 10.1016/j.engappai.2018.08.014 10.1109/iccv48922.2021.01345 10.1109/CVPR.2009.5206848 10.1109/CVPR.2019.00044 10.5555/3045118.3045167 10.1109/TIP.2018.2887342 10.1007/978-3-030-01267-0_19 10.1007/978-3-030-01231-1_32 10.1109/CVPR52688.2022.00333 10.1109/CVPR.2019.00137 10.1109/CVPR42600.2020.00067 10.1109/ICCV.2019.00718 10.1109/CVPR.2019.00136 10.1109/CVPR42600.2020.00118 10.1109/ICCV48922.2021.00154 10.1109/TPAMI.2020.3029425 10.1109/CVPR.2018.00151 10.1007/978-3-319-46484-8_2 10.1109/ICCV.2019.00630 10.1109/ICCV.2019.00209 10.1109/CVPR52688.2022.00476 10.1007/978-3-030-01246-5_49 10.1109/CVPR.2019.01233 10.1109/ICCV.2015.510 10.1109/CVPR.2018.00558 10.1109/TPAMI.2016.2599174 10.2307/j.ctvcm4g18.8 10.1007/s11263-021-01435-1 10.1109/ICCV.2017.590 10.1109/CVPR52688.2022.00320 10.1109/CVPR52688.2022.00319 10.48550/arXiv.2102.05095 10.1109/ICCV48922.2021.00986 10.1109/iccv.2019.00561 10.1109/CVPR.2018.00675 10.1007/978-3-030-58539-6_17 10.1109/CVPR46437.2021.00193 10.1109/ICCV.2013.441 10.1109/CVPR52688.2022.01426 10.1109/CVPR42600.2020.00117 10.1145/3448981 10.1109/CVPR.2018.00813 10.1109/TMM.2021.3050073 10.1609/aaai.v34i07.6836 10.1007/978-3-030-01228-1_25 10.1109/TMM.2018.2866370 10.1007/978-3-030-01216-8_43 10.1109/TMM.2019.2943204 10.1109/CVPR42600.2020.00099 10.1613/jair.301 10.1109/CVPR.2018.00054 10.1109/ICCV48922.2021.00675 10.1109/CVPR42600.2020.00028 |
| ContentType | Journal Article |
| DBID | 97E RIA RIE AAYXX CITATION |
| DOI | 10.1109/TMM.2022.3211423 |
| DatabaseName | IEEE Xplore (IEEE) IEEE All-Society Periodicals Package (ASPP) 1998–Present IEEE Xplore CrossRef |
| DatabaseTitle | CrossRef |
| DatabaseTitleList | |
| Database_xml | – sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://ieeexplore.ieee.org/ sourceTypes: Publisher |
| DeliveryMethod | fulltext_linktorsrc |
| Discipline | Engineering Computer Science |
| EISSN | 1941-0077 |
| EndPage | 1676 |
| ExternalDocumentID | 10_1109_TMM_2022_3211423 9907887 |
| Genre | orig-research |
| GrantInformation_xml | – fundername: National Key Research and Development Program of China grantid: 2017YFC1601800 – fundername: National Natural Science Foundation of China grantid: U1836218; 62020106012; 62106089 funderid: 10.13039/501100001809 – fundername: 111 Project of Ministry of Education of China grantid: B12018 |
| GroupedDBID | -~X 0R~ 29I 4.4 5GY 5VS 6IK 97E AAJGR AARMG AASAJ AAWTH ABAZT ABQJQ ABVLG ACGFO ACGFS ACIWK AENEX AETIX AGQYO AGSQL AHBIQ AI. AIBXA AKJIK AKQYR ALLEH ALMA_UNASSIGNED_HOLDINGS ATWAV BEFXN BFFAM BGNUA BKEBE BPEOZ CS3 DU5 EBS EJD HZ~ H~9 IFIPE IFJZH IPLJI JAVBF LAI M43 O9- OCL P2P PQQKQ RIA RIE RNS TN5 VH1 ZY4 AAYXX CITATION |
| ID | FETCH-LOGICAL-c265t-d6b9dbd5039422705dc7620e3de3a1a60fc2dd514db2f978608fddeb41e024c33 |
| IEDL.DBID | RIE |
| ISICitedReferencesCount | 2 |
| ISICitedReferencesURI | http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=001459668500022&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| ISSN | 1520-9210 |
| IngestDate | Sat Nov 29 08:04:05 EST 2025 Tue Nov 18 21:57:15 EST 2025 Wed Aug 27 02:04:15 EDT 2025 |
| IsPeerReviewed | true |
| IsScholarly | true |
| Language | English |
| License | https://ieeexplore.ieee.org/Xplorehelp/downloads/license-information/IEEE.html https://doi.org/10.15223/policy-029 https://doi.org/10.15223/policy-037 |
| LinkModel | DirectLink |
| MergedId | FETCHMERGED-LOGICAL-c265t-d6b9dbd5039422705dc7620e3de3a1a60fc2dd514db2f978608fddeb41e024c33 |
| ORCID | 0000-0002-6701-1965 0000-0002-9015-3128 0000-0002-0310-5778 0000-0003-4550-7879 |
| PageCount | 12 |
| ParticipantIDs | crossref_primary_10_1109_TMM_2022_3211423 ieee_primary_9907887 crossref_citationtrail_10_1109_TMM_2022_3211423 |
| PublicationCentury | 2000 |
| PublicationDate | 20250000 2025-00-00 |
| PublicationDateYYYYMMDD | 2025-01-01 |
| PublicationDate_xml | – year: 2025 text: 20250000 |
| PublicationDecade | 2020 |
| PublicationTitle | IEEE transactions on multimedia |
| PublicationTitleAbbrev | TMM |
| PublicationYear | 2025 |
| Publisher | IEEE |
| Publisher_xml | – name: IEEE |
| References | ref13 ref57 ref12 ref56 ref15 ref59 ref14 ref58 ref53 Carreira (ref26) 2017 ref52 ref11 ref55 ref10 ref54 Dosovitskiy (ref36) 2020 ref17 ref16 Qiu (ref44) 2021 Wang (ref19) 2013; 103 ref51 ref50 ref47 ref41 ref43 ref49 Yu (ref45) 2022 ref7 ref9 ref4 ref3 ref6 ref5 ref40 ref35 Kay (ref8) 2017 ref34 ref37 Tian (ref18) 2021 ref31 ref30 ref33 ref32 ref2 ref1 ref39 ref38 Tran (ref25) 2017 Tong (ref46) 2022 ref24 ref23 ref67 ref20 ref64 ref63 ref22 ref66 ref21 ref65 ref28 ref27 ref29 Simonyan (ref48) 2014 Li (ref42) 2022 ref60 ref62 ref61 |
| References_xml | – ident: ref7 doi: 10.1109/iccv.2017.622 – year: 2022 ident: ref46 article-title: Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training – volume: 103 start-page: 60 issue: 1 volume-title: Int. J. Comput. Vis. year: 2013 ident: ref19 article-title: Dense trajectories and motion boundary descriptors for action recognition – ident: ref64 doi: 10.1109/CVPR42600.2020.00043 – ident: ref51 doi: 10.1145/3343031.3350876 – ident: ref3 doi: 10.1016/j.engappai.2018.08.014 – ident: ref63 doi: 10.1109/iccv48922.2021.01345 – ident: ref59 doi: 10.1109/CVPR.2009.5206848 – ident: ref16 doi: 10.1109/CVPR.2019.00044 – year: 2017 ident: ref25 article-title: Convnet architecture search for spatiotemporal feature learning – ident: ref55 doi: 10.5555/3045118.3045167 – ident: ref56 doi: 10.1109/TIP.2018.2887342 – ident: ref28 doi: 10.1007/978-3-030-01267-0_19 – ident: ref57 doi: 10.1007/978-3-030-01231-1_32 – ident: ref41 doi: 10.1109/CVPR52688.2022.00333 – ident: ref12 doi: 10.1109/CVPR.2019.00137 – ident: ref65 doi: 10.1109/CVPR42600.2020.00067 – ident: ref9 doi: 10.1109/ICCV.2019.00718 – ident: ref50 doi: 10.1109/CVPR.2019.00136 – ident: ref61 doi: 10.1109/CVPR42600.2020.00118 – ident: ref11 doi: 10.1109/ICCV48922.2021.00154 – ident: ref13 doi: 10.1109/TPAMI.2020.3029425 – year: 2017 ident: ref8 article-title: The kinetics human action video dataset – ident: ref49 doi: 10.1109/CVPR.2018.00151 – ident: ref22 doi: 10.1007/978-3-319-46484-8_2 – ident: ref30 doi: 10.1109/ICCV.2019.00630 – ident: ref52 doi: 10.1109/ICCV.2019.00209 – start-page: 6299 volume-title: Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. year: 2017 ident: ref26 – ident: ref40 doi: 10.1109/CVPR52688.2022.00476 – ident: ref23 doi: 10.1007/978-3-030-01246-5_49 – year: 2021 ident: ref18 article-title: Ean: Event adaptive network for enhanced action recognition – ident: ref31 doi: 10.1109/CVPR.2019.01233 – ident: ref24 doi: 10.1109/ICCV.2015.510 – ident: ref14 doi: 10.1109/CVPR.2018.00558 – year: 2022 ident: ref42 article-title: Uniformer: Unified transformer for efficient spatiotemporal representation learning – ident: ref21 doi: 10.1109/TPAMI.2016.2599174 – ident: ref58 doi: 10.2307/j.ctvcm4g18.8 – ident: ref5 doi: 10.1007/s11263-021-01435-1 – ident: ref29 doi: 10.1109/ICCV.2017.590 – ident: ref38 doi: 10.1109/CVPR52688.2022.00320 – ident: ref54 doi: 10.1109/CVPR52688.2022.00319 – ident: ref43 doi: 10.48550/arXiv.2102.05095 – ident: ref37 doi: 10.1109/ICCV48922.2021.00986 – ident: ref67 doi: 10.1109/iccv.2019.00561 – ident: ref27 doi: 10.1109/CVPR.2018.00675 – start-page: 568 volume-title: Proc. Conf. Neural Inf. Process. Syst. year: 2014 ident: ref48 article-title: Two-stream convolutional networks for action recognition in videos – ident: ref66 doi: 10.1007/978-3-030-58539-6_17 – ident: ref17 doi: 10.1109/CVPR46437.2021.00193 – ident: ref20 doi: 10.1109/ICCV.2013.441 – year: 2020 ident: ref36 article-title: An image is worth 16x16 words: Transformers for image recognition at scale – year: 2022 ident: ref45 article-title: Coca: Contrastive captioners are image-text foundation models – ident: ref47 doi: 10.1109/CVPR52688.2022.01426 – ident: ref32 doi: 10.1109/CVPR42600.2020.00117 – ident: ref4 doi: 10.1145/3448981 – ident: ref60 doi: 10.1109/CVPR.2018.00813 – ident: ref6 doi: 10.1109/TMM.2021.3050073 – ident: ref53 doi: 10.1609/aaai.v34i07.6836 – ident: ref62 doi: 10.1007/978-3-030-01228-1_25 – ident: ref2 doi: 10.1109/TMM.2018.2866370 – start-page: 18 volume-title: Proc. 38th Int. Conf. Mach. Learn., Virtual Conf. year: 2021 ident: ref44 article-title: Optimization planning for 3d convnets – ident: ref34 doi: 10.1007/978-3-030-01216-8_43 – ident: ref1 doi: 10.1109/TMM.2019.2943204 – ident: ref10 doi: 10.1109/CVPR42600.2020.00099 – ident: ref15 doi: 10.1613/jair.301 – ident: ref35 doi: 10.1109/CVPR.2018.00054 – ident: ref39 doi: 10.1109/ICCV48922.2021.00675 – ident: ref33 doi: 10.1109/CVPR42600.2020.00028 |
| SSID | ssj0014507 |
| Score | 2.4446633 |
| Snippet | Enabled by hierarchical convolutions and nonlinear mappings, recent action recognition studies have continuously boosted performance with spatiotemporal... |
| SourceID | crossref ieee |
| SourceType | Enrichment Source Index Database Publisher |
| StartPage | 1665 |
| SubjectTerms | action recognition Computational modeling Convolutional neural networks Costs encoder-decoder structure Feature extraction motion estimation Short-term motion Spatiotemporal phenomena Three-dimensional displays Training |
| Title | I Know How You Move: Explicit Motion Estimation for Human Action Recognition |
| URI | https://ieeexplore.ieee.org/document/9907887 |
| Volume | 27 |
| WOSCitedRecordID | wos001459668500022&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| journalDatabaseRights | – providerCode: PRVIEE databaseName: IEEE Electronic Library (IEL) customDbUrl: eissn: 1941-0077 dateEnd: 99991231 omitProxy: false ssIdentifier: ssj0014507 issn: 1520-9210 databaseCode: RIE dateStart: 19990101 isFulltext: true titleUrlDefault: https://ieeexplore.ieee.org/ providerName: IEEE |
| link | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV3PS8MwFH5sw4MenG6K8xc5eBHsliZpu3gbsjHBDZEJu5U2SWEgncxtf78vaTsmiOCh0JY8KPma9_KS9-UDuEtkIDOWBl7IROYJxX2v74eWCJz108CP0KoQm4im0_58Ll9r8LDjwhhjXPGZ6dpbt5evl2pjl8p66Dlt8Vsd6lEUFVyt3Y6BCBw1GsMR9STmMdWWJJW92WSCiSBjXc4sc5T_CEF7mioupIya__uYEzgup45kUGB9CjWTt6BZyTKQcpS24GjvjME2vDwTK21Nxnjh0CaT5dY8Elt6t1CLNT5aZMgQR3pBYiQ4iyVuaZ8MHOeBvFU1Rsv8DN5Hw9nT2CslFDzFwmDt6TCVOtUB5VIwFtFAK_R-1HBteOInIc0U0xonTTplGSaUIe1n6PBS4RsM3orzc2jky9xcAAnCJDE-S7hmVAiDcV4JtMSMK6TCl1kHelWvxqo8X9zKXHzELs-gMkYcYotDXOLQgfudxWdxtsYfbdsWgl27svcvf399BYfM6vS6pZJraKxXG3MDB2q7Xnytbt2P8w1NW76c |
| linkProvider | IEEE |
| linkToHtml | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1bS8MwFD54A_XB6VS8mwdfBLuluXSLb0McG25DZMLeSpukMJBV5rbf70najQki-FBoSw6UfM05Ocn58gHcJUqqjKUyiJjIAqF5GDTDyBGBs2YqwwZaFWITjcGgORqp1w14WHFhrLW--MzW3K3fyze5nrulsjp6Tlf8tgnbUggWFmyt1Z6BkJ4cjQGJBgozmeWmJFX1Yb-PqSBjNc4cd5T_CEJrqio-qLQr__ucQzgoJ4-kVaB9BBt2UoXKUpiBlOO0CvtrpwweQ69LnLg16eCFg5v084V9JK74bqzHM3x02JBnHOsFjZHgPJb4xX3S8qwH8rasMsonJ_Defh4-dYJSRCHQLJKzwESpMqmRlCvBWINKo9H_UcuN5UmYRDTTzBicNpmUZZhSRrSZoctLRWgxfGvOT2Frkk_sGRAZJYkNWcINo0JYjPRaoCXmXBEVocrOob7s1ViXJ4w7oYuP2GcaVMWIQ-xwiEsczuF-ZfFZnK7xR9tjB8GqXdn7F7-_voXdzrDfi3vdwcsl7DGn2usXTq5gazad22vY0YvZ-Gt643-ib2WyweM |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=I+Know+How+You+Move%3A+Explicit+Motion+Estimation+for+Human+Action+Recognition&rft.jtitle=IEEE+transactions+on+multimedia&rft.au=Shen%2C+Zhongwei&rft.au=Wu%2C+Xiao-Jun&rft.au=Li%2C+Hui&rft.au=Xu%2C+Tianyang&rft.date=2025&rft.pub=IEEE&rft.issn=1520-9210&rft.volume=27&rft.spage=1665&rft.epage=1676&rft_id=info:doi/10.1109%2FTMM.2022.3211423&rft.externalDocID=9907887 |
| thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1520-9210&client=summon |
| thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1520-9210&client=summon |
| thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1520-9210&client=summon |