I Know How You Move: Explicit Motion Estimation for Human Action Recognition

Enabled by hierarchical convolutions and nonlinear mappings, recent action recognition studies have continuously boosted performance with spatiotemporal modelling. In general, motion clues are essential in video-oriented tasks, while existing approaches aggregate the spatial and temporal signatures...

Celý popis

Uložené v:
Podrobná bibliografia
Vydané v:IEEE transactions on multimedia Ročník 27; s. 1665 - 1676
Hlavní autori: Shen, Zhongwei, Wu, Xiao-Jun, Li, Hui, Xu, Tianyang, Wu, Cong
Médium: Journal Article
Jazyk:English
Vydavateľské údaje: IEEE 2025
Predmet:
ISSN:1520-9210, 1941-0077
On-line prístup:Získať plný text
Tagy: Pridať tag
Žiadne tagy, Buďte prvý, kto otaguje tento záznam!
Abstract Enabled by hierarchical convolutions and nonlinear mappings, recent action recognition studies have continuously boosted performance with spatiotemporal modelling. In general, motion clues are essential in video-oriented tasks, while existing approaches aggregate the spatial and temporal signatures via specially designed modules in the middle or output stages. To highlight the privilege provided by temporal motions, in this paper, we propose a simple but effective MOTion Estimator (MOTE) to generate the motion patterns from every single frame, avoiding complex dense-frame input. In particular, MOTE follows an encoder-decoder structure, which takes the short-term motion features generated by the pretrained dense-frame network as the learning target. The spatial information of a single frame is utilized to estimate the instantaneous motion appearance. It can support the expression of vulnerable regions, such as the 'hand' in 'waving hands,' which would otherwise be suppressed in the feature maps as the 'hand' suffers from motion blur. The training process of MOTE is independent of the action recognition system. Therefore, the trained MOTE can be transplanted to the input-end of existing action recognition methods to provide instantaneous motion estimation as feature enhancement according to practical requirements. Our experiments performed on Something-Something V1, V2, Kinetics-400, and Diving48 verify the effectiveness of the proposed method.
AbstractList Enabled by hierarchical convolutions and nonlinear mappings, recent action recognition studies have continuously boosted performance with spatiotemporal modelling. In general, motion clues are essential in video-oriented tasks, while existing approaches aggregate the spatial and temporal signatures via specially designed modules in the middle or output stages. To highlight the privilege provided by temporal motions, in this paper, we propose a simple but effective MOTion Estimator (MOTE) to generate the motion patterns from every single frame, avoiding complex dense-frame input. In particular, MOTE follows an encoder-decoder structure, which takes the short-term motion features generated by the pretrained dense-frame network as the learning target. The spatial information of a single frame is utilized to estimate the instantaneous motion appearance. It can support the expression of vulnerable regions, such as the 'hand' in 'waving hands,' which would otherwise be suppressed in the feature maps as the 'hand' suffers from motion blur. The training process of MOTE is independent of the action recognition system. Therefore, the trained MOTE can be transplanted to the input-end of existing action recognition methods to provide instantaneous motion estimation as feature enhancement according to practical requirements. Our experiments performed on Something-Something V1, V2, Kinetics-400, and Diving48 verify the effectiveness of the proposed method.
Author Wu, Cong
Wu, Xiao-Jun
Li, Hui
Xu, Tianyang
Shen, Zhongwei
Author_xml – sequence: 1
  givenname: Zhongwei
  orcidid: 0000-0002-6701-1965
  surname: Shen
  fullname: Shen, Zhongwei
  email: shenzw_cv@163.com
  organization: School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi, China
– sequence: 2
  givenname: Xiao-Jun
  orcidid: 0000-0002-0310-5778
  surname: Wu
  fullname: Wu, Xiao-Jun
  email: wu_xiaojun@jiangnan.edu.cn
  organization: School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi, China
– sequence: 3
  givenname: Hui
  orcidid: 0000-0003-4550-7879
  surname: Li
  fullname: Li, Hui
  email: lihui.cv@jiangnan.edu.cn
  organization: School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi, China
– sequence: 4
  givenname: Tianyang
  orcidid: 0000-0002-9015-3128
  surname: Xu
  fullname: Xu, Tianyang
  email: tianyang_xu@163.com
  organization: Centre for Vision, Speech and Signal Processing, University of Surrey, Guildford, U.K
– sequence: 5
  givenname: Cong
  surname: Wu
  fullname: Wu, Cong
  email: wucong@stu.jiangnan.edu.cn
  organization: School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi, China
BookMark eNp9kD1PwzAQhi1UJNrCjsTiP5ByPjtxzFZVhVS0QkJlYIoS20FGbVwlKR__HjetGBgYTvfeSe99PCMyqH1tCblmMGEM1O16tZogIE44MiaQn5EhU4JFAFIOgo4RIoUMLsiobd8BmIhBDslyQR9r_0mzEK9-T1f-w97R-ddu47TrQtk5X9N527lt0cvKNzTbb4uaTnXfeLbav9XuoC_JeVVsWnt1ymPycj9fz7Jo-fSwmE2XkcYk7iKTlMqUJgauBKKE2GiZIFhuLC9YkUCl0ZiYCVNipWSaQFoZY0vBLKDQnI8JHOfqxrdtY6t814T7mu-cQX6gkQca-YFGfqIRLMkfS3iv_6hrCrf5z3hzNDpr7e8epUCmqeQ_ziluGQ
CODEN ITMUF8
CitedBy_id crossref_primary_10_1016_j_image_2025_117381
crossref_primary_10_3390_electronics12040857
Cites_doi 10.1109/iccv.2017.622
10.1109/CVPR42600.2020.00043
10.1145/3343031.3350876
10.1016/j.engappai.2018.08.014
10.1109/iccv48922.2021.01345
10.1109/CVPR.2009.5206848
10.1109/CVPR.2019.00044
10.5555/3045118.3045167
10.1109/TIP.2018.2887342
10.1007/978-3-030-01267-0_19
10.1007/978-3-030-01231-1_32
10.1109/CVPR52688.2022.00333
10.1109/CVPR.2019.00137
10.1109/CVPR42600.2020.00067
10.1109/ICCV.2019.00718
10.1109/CVPR.2019.00136
10.1109/CVPR42600.2020.00118
10.1109/ICCV48922.2021.00154
10.1109/TPAMI.2020.3029425
10.1109/CVPR.2018.00151
10.1007/978-3-319-46484-8_2
10.1109/ICCV.2019.00630
10.1109/ICCV.2019.00209
10.1109/CVPR52688.2022.00476
10.1007/978-3-030-01246-5_49
10.1109/CVPR.2019.01233
10.1109/ICCV.2015.510
10.1109/CVPR.2018.00558
10.1109/TPAMI.2016.2599174
10.2307/j.ctvcm4g18.8
10.1007/s11263-021-01435-1
10.1109/ICCV.2017.590
10.1109/CVPR52688.2022.00320
10.1109/CVPR52688.2022.00319
10.48550/arXiv.2102.05095
10.1109/ICCV48922.2021.00986
10.1109/iccv.2019.00561
10.1109/CVPR.2018.00675
10.1007/978-3-030-58539-6_17
10.1109/CVPR46437.2021.00193
10.1109/ICCV.2013.441
10.1109/CVPR52688.2022.01426
10.1109/CVPR42600.2020.00117
10.1145/3448981
10.1109/CVPR.2018.00813
10.1109/TMM.2021.3050073
10.1609/aaai.v34i07.6836
10.1007/978-3-030-01228-1_25
10.1109/TMM.2018.2866370
10.1007/978-3-030-01216-8_43
10.1109/TMM.2019.2943204
10.1109/CVPR42600.2020.00099
10.1613/jair.301
10.1109/CVPR.2018.00054
10.1109/ICCV48922.2021.00675
10.1109/CVPR42600.2020.00028
ContentType Journal Article
DBID 97E
RIA
RIE
AAYXX
CITATION
DOI 10.1109/TMM.2022.3211423
DatabaseName IEEE Xplore (IEEE)
IEEE All-Society Periodicals Package (ASPP) 1998–Present
IEEE Xplore
CrossRef
DatabaseTitle CrossRef
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Electronic Library (IEL)
  url: https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
Discipline Engineering
Computer Science
EISSN 1941-0077
EndPage 1676
ExternalDocumentID 10_1109_TMM_2022_3211423
9907887
Genre orig-research
GrantInformation_xml – fundername: National Key Research and Development Program of China
  grantid: 2017YFC1601800
– fundername: National Natural Science Foundation of China
  grantid: U1836218; 62020106012; 62106089
  funderid: 10.13039/501100001809
– fundername: 111 Project of Ministry of Education of China
  grantid: B12018
GroupedDBID -~X
0R~
29I
4.4
5GY
5VS
6IK
97E
AAJGR
AARMG
AASAJ
AAWTH
ABAZT
ABQJQ
ABVLG
ACGFO
ACGFS
ACIWK
AENEX
AETIX
AGQYO
AGSQL
AHBIQ
AI.
AIBXA
AKJIK
AKQYR
ALLEH
ALMA_UNASSIGNED_HOLDINGS
ATWAV
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
CS3
DU5
EBS
EJD
HZ~
H~9
IFIPE
IFJZH
IPLJI
JAVBF
LAI
M43
O9-
OCL
P2P
PQQKQ
RIA
RIE
RNS
TN5
VH1
ZY4
AAYXX
CITATION
ID FETCH-LOGICAL-c265t-d6b9dbd5039422705dc7620e3de3a1a60fc2dd514db2f978608fddeb41e024c33
IEDL.DBID RIE
ISICitedReferencesCount 2
ISICitedReferencesURI http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=001459668500022&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
ISSN 1520-9210
IngestDate Sat Nov 29 08:04:05 EST 2025
Tue Nov 18 21:57:15 EST 2025
Wed Aug 27 02:04:15 EDT 2025
IsPeerReviewed true
IsScholarly true
Language English
License https://ieeexplore.ieee.org/Xplorehelp/downloads/license-information/IEEE.html
https://doi.org/10.15223/policy-029
https://doi.org/10.15223/policy-037
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-c265t-d6b9dbd5039422705dc7620e3de3a1a60fc2dd514db2f978608fddeb41e024c33
ORCID 0000-0002-6701-1965
0000-0002-9015-3128
0000-0002-0310-5778
0000-0003-4550-7879
PageCount 12
ParticipantIDs crossref_primary_10_1109_TMM_2022_3211423
ieee_primary_9907887
crossref_citationtrail_10_1109_TMM_2022_3211423
PublicationCentury 2000
PublicationDate 20250000
2025-00-00
PublicationDateYYYYMMDD 2025-01-01
PublicationDate_xml – year: 2025
  text: 20250000
PublicationDecade 2020
PublicationTitle IEEE transactions on multimedia
PublicationTitleAbbrev TMM
PublicationYear 2025
Publisher IEEE
Publisher_xml – name: IEEE
References ref13
ref57
ref12
ref56
ref15
ref59
ref14
ref58
ref53
Carreira (ref26) 2017
ref52
ref11
ref55
ref10
ref54
Dosovitskiy (ref36) 2020
ref17
ref16
Qiu (ref44) 2021
Wang (ref19) 2013; 103
ref51
ref50
ref47
ref41
ref43
ref49
Yu (ref45) 2022
ref7
ref9
ref4
ref3
ref6
ref5
ref40
ref35
Kay (ref8) 2017
ref34
ref37
Tian (ref18) 2021
ref31
ref30
ref33
ref32
ref2
ref1
ref39
ref38
Tran (ref25) 2017
Tong (ref46) 2022
ref24
ref23
ref67
ref20
ref64
ref63
ref22
ref66
ref21
ref65
ref28
ref27
ref29
Simonyan (ref48) 2014
Li (ref42) 2022
ref60
ref62
ref61
References_xml – ident: ref7
  doi: 10.1109/iccv.2017.622
– year: 2022
  ident: ref46
  article-title: Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training
– volume: 103
  start-page: 60
  issue: 1
  volume-title: Int. J. Comput. Vis.
  year: 2013
  ident: ref19
  article-title: Dense trajectories and motion boundary descriptors for action recognition
– ident: ref64
  doi: 10.1109/CVPR42600.2020.00043
– ident: ref51
  doi: 10.1145/3343031.3350876
– ident: ref3
  doi: 10.1016/j.engappai.2018.08.014
– ident: ref63
  doi: 10.1109/iccv48922.2021.01345
– ident: ref59
  doi: 10.1109/CVPR.2009.5206848
– ident: ref16
  doi: 10.1109/CVPR.2019.00044
– year: 2017
  ident: ref25
  article-title: Convnet architecture search for spatiotemporal feature learning
– ident: ref55
  doi: 10.5555/3045118.3045167
– ident: ref56
  doi: 10.1109/TIP.2018.2887342
– ident: ref28
  doi: 10.1007/978-3-030-01267-0_19
– ident: ref57
  doi: 10.1007/978-3-030-01231-1_32
– ident: ref41
  doi: 10.1109/CVPR52688.2022.00333
– ident: ref12
  doi: 10.1109/CVPR.2019.00137
– ident: ref65
  doi: 10.1109/CVPR42600.2020.00067
– ident: ref9
  doi: 10.1109/ICCV.2019.00718
– ident: ref50
  doi: 10.1109/CVPR.2019.00136
– ident: ref61
  doi: 10.1109/CVPR42600.2020.00118
– ident: ref11
  doi: 10.1109/ICCV48922.2021.00154
– ident: ref13
  doi: 10.1109/TPAMI.2020.3029425
– year: 2017
  ident: ref8
  article-title: The kinetics human action video dataset
– ident: ref49
  doi: 10.1109/CVPR.2018.00151
– ident: ref22
  doi: 10.1007/978-3-319-46484-8_2
– ident: ref30
  doi: 10.1109/ICCV.2019.00630
– ident: ref52
  doi: 10.1109/ICCV.2019.00209
– start-page: 6299
  volume-title: Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit.
  year: 2017
  ident: ref26
– ident: ref40
  doi: 10.1109/CVPR52688.2022.00476
– ident: ref23
  doi: 10.1007/978-3-030-01246-5_49
– year: 2021
  ident: ref18
  article-title: Ean: Event adaptive network for enhanced action recognition
– ident: ref31
  doi: 10.1109/CVPR.2019.01233
– ident: ref24
  doi: 10.1109/ICCV.2015.510
– ident: ref14
  doi: 10.1109/CVPR.2018.00558
– year: 2022
  ident: ref42
  article-title: Uniformer: Unified transformer for efficient spatiotemporal representation learning
– ident: ref21
  doi: 10.1109/TPAMI.2016.2599174
– ident: ref58
  doi: 10.2307/j.ctvcm4g18.8
– ident: ref5
  doi: 10.1007/s11263-021-01435-1
– ident: ref29
  doi: 10.1109/ICCV.2017.590
– ident: ref38
  doi: 10.1109/CVPR52688.2022.00320
– ident: ref54
  doi: 10.1109/CVPR52688.2022.00319
– ident: ref43
  doi: 10.48550/arXiv.2102.05095
– ident: ref37
  doi: 10.1109/ICCV48922.2021.00986
– ident: ref67
  doi: 10.1109/iccv.2019.00561
– ident: ref27
  doi: 10.1109/CVPR.2018.00675
– start-page: 568
  volume-title: Proc. Conf. Neural Inf. Process. Syst.
  year: 2014
  ident: ref48
  article-title: Two-stream convolutional networks for action recognition in videos
– ident: ref66
  doi: 10.1007/978-3-030-58539-6_17
– ident: ref17
  doi: 10.1109/CVPR46437.2021.00193
– ident: ref20
  doi: 10.1109/ICCV.2013.441
– year: 2020
  ident: ref36
  article-title: An image is worth 16x16 words: Transformers for image recognition at scale
– year: 2022
  ident: ref45
  article-title: Coca: Contrastive captioners are image-text foundation models
– ident: ref47
  doi: 10.1109/CVPR52688.2022.01426
– ident: ref32
  doi: 10.1109/CVPR42600.2020.00117
– ident: ref4
  doi: 10.1145/3448981
– ident: ref60
  doi: 10.1109/CVPR.2018.00813
– ident: ref6
  doi: 10.1109/TMM.2021.3050073
– ident: ref53
  doi: 10.1609/aaai.v34i07.6836
– ident: ref62
  doi: 10.1007/978-3-030-01228-1_25
– ident: ref2
  doi: 10.1109/TMM.2018.2866370
– start-page: 18
  volume-title: Proc. 38th Int. Conf. Mach. Learn., Virtual Conf.
  year: 2021
  ident: ref44
  article-title: Optimization planning for 3d convnets
– ident: ref34
  doi: 10.1007/978-3-030-01216-8_43
– ident: ref1
  doi: 10.1109/TMM.2019.2943204
– ident: ref10
  doi: 10.1109/CVPR42600.2020.00099
– ident: ref15
  doi: 10.1613/jair.301
– ident: ref35
  doi: 10.1109/CVPR.2018.00054
– ident: ref39
  doi: 10.1109/ICCV48922.2021.00675
– ident: ref33
  doi: 10.1109/CVPR42600.2020.00028
SSID ssj0014507
Score 2.4446633
Snippet Enabled by hierarchical convolutions and nonlinear mappings, recent action recognition studies have continuously boosted performance with spatiotemporal...
SourceID crossref
ieee
SourceType Enrichment Source
Index Database
Publisher
StartPage 1665
SubjectTerms action recognition
Computational modeling
Convolutional neural networks
Costs
encoder-decoder structure
Feature extraction
motion estimation
Short-term motion
Spatiotemporal phenomena
Three-dimensional displays
Training
Title I Know How You Move: Explicit Motion Estimation for Human Action Recognition
URI https://ieeexplore.ieee.org/document/9907887
Volume 27
WOSCitedRecordID wos001459668500022&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
journalDatabaseRights – providerCode: PRVIEE
  databaseName: IEEE Electronic Library (IEL)
  customDbUrl:
  eissn: 1941-0077
  dateEnd: 99991231
  omitProxy: false
  ssIdentifier: ssj0014507
  issn: 1520-9210
  databaseCode: RIE
  dateStart: 19990101
  isFulltext: true
  titleUrlDefault: https://ieeexplore.ieee.org/
  providerName: IEEE
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV3PS8MwFH5sw4MenG6K8xc5eBHsliZpu3gbsjHBDZEJu5U2SWEgncxtf78vaTsmiOCh0JY8KPma9_KS9-UDuEtkIDOWBl7IROYJxX2v74eWCJz108CP0KoQm4im0_58Ll9r8LDjwhhjXPGZ6dpbt5evl2pjl8p66Dlt8Vsd6lEUFVyt3Y6BCBw1GsMR9STmMdWWJJW92WSCiSBjXc4sc5T_CEF7mioupIya__uYEzgup45kUGB9CjWTt6BZyTKQcpS24GjvjME2vDwTK21Nxnjh0CaT5dY8Elt6t1CLNT5aZMgQR3pBYiQ4iyVuaZ8MHOeBvFU1Rsv8DN5Hw9nT2CslFDzFwmDt6TCVOtUB5VIwFtFAK_R-1HBteOInIc0U0xonTTplGSaUIe1n6PBS4RsM3orzc2jky9xcAAnCJDE-S7hmVAiDcV4JtMSMK6TCl1kHelWvxqo8X9zKXHzELs-gMkYcYotDXOLQgfudxWdxtsYfbdsWgl27svcvf399BYfM6vS6pZJraKxXG3MDB2q7Xnytbt2P8w1NW76c
linkProvider IEEE
linkToHtml http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1bS8MwFD54A_XB6VS8mwdfBLuluXSLb0McG25DZMLeSpukMJBV5rbf70najQki-FBoSw6UfM05Ocn58gHcJUqqjKUyiJjIAqF5GDTDyBGBs2YqwwZaFWITjcGgORqp1w14WHFhrLW--MzW3K3fyze5nrulsjp6Tlf8tgnbUggWFmyt1Z6BkJ4cjQGJBgozmeWmJFX1Yb-PqSBjNc4cd5T_CEJrqio-qLQr__ucQzgoJ4-kVaB9BBt2UoXKUpiBlOO0CvtrpwweQ69LnLg16eCFg5v084V9JK74bqzHM3x02JBnHOsFjZHgPJb4xX3S8qwH8rasMsonJ_Defh4-dYJSRCHQLJKzwESpMqmRlCvBWINKo9H_UcuN5UmYRDTTzBicNpmUZZhSRrSZoctLRWgxfGvOT2Frkk_sGRAZJYkNWcINo0JYjPRaoCXmXBEVocrOob7s1ViXJ4w7oYuP2GcaVMWIQ-xwiEsczuF-ZfFZnK7xR9tjB8GqXdn7F7-_voXdzrDfi3vdwcsl7DGn2usXTq5gazad22vY0YvZ-Gt643-ib2WyweM
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=I+Know+How+You+Move%3A+Explicit+Motion+Estimation+for+Human+Action+Recognition&rft.jtitle=IEEE+transactions+on+multimedia&rft.au=Shen%2C+Zhongwei&rft.au=Wu%2C+Xiao-Jun&rft.au=Li%2C+Hui&rft.au=Xu%2C+Tianyang&rft.date=2025&rft.pub=IEEE&rft.issn=1520-9210&rft.volume=27&rft.spage=1665&rft.epage=1676&rft_id=info:doi/10.1109%2FTMM.2022.3211423&rft.externalDocID=9907887
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1520-9210&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1520-9210&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1520-9210&client=summon