Video description: A comprehensive survey of deep learning approaches

Video description refers to understanding visual content and transforming that acquired understanding into automatic textual narration. It bridges the key AI fields of computer vision and natural language processing in conjunction with real-time and practical applications. Deep learning-based approa...

Celý popis

Uloženo v:
Podrobná bibliografie
Vydáno v:The Artificial intelligence review Ročník 56; číslo 11; s. 13293 - 13372
Hlavní autoři: Rafiq, Ghazala, Rafiq, Muhammad, Choi, Gyu Sang
Médium: Journal Article
Jazyk:angličtina
Vydáno: Dordrecht Springer Netherlands 01.11.2023
Springer
Springer Nature B.V
Témata:
ISSN:0269-2821, 1573-7462
On-line přístup:Získat plný text
Tagy: Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
Abstract Video description refers to understanding visual content and transforming that acquired understanding into automatic textual narration. It bridges the key AI fields of computer vision and natural language processing in conjunction with real-time and practical applications. Deep learning-based approaches employed for video description have demonstrated enhanced results compared to conventional approaches. The current literature lacks a thorough interpretation of the recently developed and employed sequence to sequence techniques for video description. This paper fills that gap by focusing mainly on deep learning-enabled approaches to automatic caption generation. Sequence to sequence models follow an Encoder–Decoder architecture employing a specific composition of CNN, RNN, or the variants LSTM or GRU as an encoder and decoder block. This standard-architecture can be fused with an attention mechanism to focus on a specific distinctiveness, achieving high quality results. Reinforcement learning employed within the Encoder–Decoder structure can progressively deliver state-of-the-art captions by following exploration and exploitation strategies. The transformer mechanism is a modern and efficient transductive architecture for robust output. Free from recurrence, and solely based on self-attention, it allows parallelization along with training on a massive amount of data. It can fully utilize the available GPUs for most NLP tasks. Recently, with the emergence of several versions of transformers, long term dependency handling is not an issue anymore for researchers engaged in video processing for summarization and description, or for autonomous-vehicle, surveillance, and instructional purposes. They can get auspicious directions from this research.
AbstractList Video description refers to understanding visual content and transforming that acquired understanding into automatic textual narration. It bridges the key AI fields of computer vision and natural language processing in conjunction with real-time and practical applications. Deep learning-based approaches employed for video description have demonstrated enhanced results compared to conventional approaches. The current literature lacks a thorough interpretation of the recently developed and employed sequence to sequence techniques for video description. This paper fills that gap by focusing mainly on deep learning-enabled approaches to automatic caption generation. Sequence to sequence models follow an Encoder–Decoder architecture employing a specific composition of CNN, RNN, or the variants LSTM or GRU as an encoder and decoder block. This standard-architecture can be fused with an attention mechanism to focus on a specific distinctiveness, achieving high quality results. Reinforcement learning employed within the Encoder–Decoder structure can progressively deliver state-of-the-art captions by following exploration and exploitation strategies. The transformer mechanism is a modern and efficient transductive architecture for robust output. Free from recurrence, and solely based on self-attention, it allows parallelization along with training on a massive amount of data. It can fully utilize the available GPUs for most NLP tasks. Recently, with the emergence of several versions of transformers, long term dependency handling is not an issue anymore for researchers engaged in video processing for summarization and description, or for autonomous-vehicle, surveillance, and instructional purposes. They can get auspicious directions from this research.
Audience Academic
Author Choi, Gyu Sang
Rafiq, Ghazala
Rafiq, Muhammad
Author_xml – sequence: 1
  givenname: Ghazala
  surname: Rafiq
  fullname: Rafiq, Ghazala
  organization: Department of Information & Communication Engineering, Yeungnam University
– sequence: 2
  givenname: Muhammad
  orcidid: 0000-0001-6713-8766
  surname: Rafiq
  fullname: Rafiq, Muhammad
  email: rafiq@kmu.ac.kr
  organization: Department of Game & Mobile Engineering, Keimyung University
– sequence: 3
  givenname: Gyu Sang
  surname: Choi
  fullname: Choi, Gyu Sang
  email: castchoi@ynu.ac.kr
  organization: Department of Information & Communication Engineering, Yeungnam University
BookMark eNp9kM1KAzEYRYMoWKsv4GrA9Wj-JplxV4p_UHBT3IY0-aZNaZMxmRZ8G5_FJzN1BMFFySLh457c5FygUx88IHRN8C3BWN4lgrmgJaaszCfCS3GCRqSSrJR5fopGmIqmpDUl5-gipTXGuKKcjdDjm7MQCgvJRNf1Lvj7YvL1acK2i7ACn9weirSLe_goQptz0BUb0NE7vyx018WgzQrSJTpr9SbB1e8-RvPHh_n0uZy9Pr1MJ7PScEz7srFAWmErA5ZiW7FaEr2o6gVrawyLVoNhuLEEeMUpMUKaSgpuudD5sdoKNkY3w7W5930HqVfrsIs-NypaC8mJ4M0hdTuklnoDyvk29FGbvCxsncniWpfnEykEw1I0PAP1AJgYUorQKuN6fZCRQbdRBKuDZTVYVtmy-rGsDl30H9pFt9Xx4zjEBijlsF9C_PvGEeobPVaRzw
CitedBy_id crossref_primary_10_1016_j_csl_2024_101694
crossref_primary_10_1016_j_prime_2023_100372
crossref_primary_10_19159_tutad_1696120
crossref_primary_10_1007_s13735_023_00303_7
crossref_primary_10_1016_j_engappai_2023_107744
crossref_primary_10_3390_app15094990
crossref_primary_10_1007_s10462_024_10858_4
crossref_primary_10_1002_advs_202406242
crossref_primary_10_23919_JSC_2024_0030
crossref_primary_10_3390_math11173685
crossref_primary_10_1007_s11227_025_07159_0
crossref_primary_10_1038_s41746_025_01649_4
crossref_primary_10_3390_math13183011
crossref_primary_10_1109_ACCESS_2023_3287783
crossref_primary_10_1177_18479790231223631
crossref_primary_10_1145_3715098
Cites_doi 10.1109/CVPR46437.2021.00832
10.1023/A:1020346032608
10.1609/aaai.v33i01.33018393
10.18653/v1/2020.acl-main.233
10.18653/v1/2020.emnlp-main.61
10.1186/s40537-021-00444-8
10.1109/ICCV.2019.00756
10.1109/WACV.2019.00048
10.1109/ICCV.2019.00901
10.18653/v1/2021.findings-acl.24
10.18653/v1/k18-1011
10.1109/CVPR.2019.00852
10.1109/ICCV.2015.279
10.1007/978-3-030-58589-1_27
10.1145/3355390
10.1109/CVPR.2016.571
10.1109/CVPR.2014.223
10.3115/v1/D14-1086
10.1145/3123266.3123448
10.1109/ICCV48922.2021.00677
10.1109/TMM.2017.2729019
10.1007/978-3-030-01267-0_19
10.1109/CVPR.2018.00443
10.24963/ijcai.2017/307
10.1109/CVPR52688.2022.01747
10.1109/CVPR.2019.00674
10.1007/978-3-030-01216-8-43
10.1109/ICCSP.2019.8698097
10.1109/CVPR.2017.548c
10.18653/v1/d17-1103
10.1109/ACCESS.2021.3108565
10.1109/TMM.2019.2924576
10.1109/ICCV48922.2021.00676
10.1109/cvpr46437.2021.00321
10.1109/ICCV.2015.510
10.1038/nature14236
10.1016/j.asoc.2021.108332
10.1109/ICBK.2017.26
10.1155/2018/3125879
10.24963/ijcai.2019/877
10.1109/CVPR42600.2020.01311
10.1109/TPAMI.2016.2599174
10.1109/CVPR.2017.662
10.1109/TCYB.2018.2831447
10.1007/s00371-021-02294-0
10.1109/icpr48806.2021.9412898
10.18653/v1/p19-1285
10.1109/CVPR.2013.340
10.1109/CVPR.2018.00911
10.1109/TPAMI.2019.2946823
10.1109/CVPR42600.2020.01329
10.1109/CVPR.2017.111
10.24963/ijcai.2020.88
10.1109/tpami.2019.2940007
10.1145/3240508.3240667
10.1109/ICCV.2019.00273
10.1109/CVPR46437.2021.01109
10.1155/2020/3062706
10.3115/v1/d14-1179
10.18653/v1/D18-1117
10.1007/978-3-540-30194-3-16
10.18653/v1/N18-2125
10.1109/tpami.2019.2920899
10.1109/CVPR.2016.90
10.1007/978-3-030-01261-8_22
10.1109/INCET54531.2022.9824569
10.1109/CVPR46437.2021.00030
10.1109/TMM.2020.3002669
10.1109/CVPR.2019.00688
10.3390/app10124312
10.1109/CVPR42600.2020.01098
10.3115/1626355.1626389
10.3390/s22134817
10.7717/PEERJ-CS.916
10.1109/CIC.2017.00050
10.1109/WACV48630.2021.00308
10.1162/neco.1989.1.2.270
10.1109/tpami.2019.2894139
10.24963/ijcai.2018/164
10.1109/CVPR.2018.00782
10.1109/CVPR.2016.497
10.1109/CVPR.2015.7299087
10.1609/aaai.v33i01.33013159
10.1109/ICIP.2019.8803143
10.1007/s11280-018-0531-z
10.3115/v1/p14-2074
10.1609/aaai.v35i4.16421
10.1109/ICME.2019.00226
10.1609/aaai.v35i3.16353
10.1109/ICIIBMS.2017.8279760
10.2307/j.ctt1d98bxx.10
10.1109/ICCV.2015.515
10.1109/TPAMI.2022.3152247
10.1162/tacl_a_00166
10.1109/ICCV.2019.00468
10.1109/tetci.2019.2892755
10.1145/3122865.3122867
10.1609/aaai.v33i01.33018167
10.1002/0470018860.s00225
10.1109/CVPR46437.2021.01521
10.1109/CVPR52688.2022.01743
10.1007/s10590-010-9073-6
10.18653/v1/p18-3003
10.1109/CVPR46437.2021.00971
10.1109/TMM.2022.3146005
10.1109/ICCV.2017.450
10.1109/CVPR.2017.648
10.1109/CVPR.2017.128
10.18653/v1/d16-1146
10.1007/978-3-030-41299-9_37
10.1145/2964284.2984066
10.1109/WACV48630.2021.00102
10.1109/CVPR.2017.131
10.3390/s20061702
10.1007/978-3-031-19836-6_2
10.1109/CVPR.2016.503
10.1109/TIP.2021.3120867
10.1109/access.2021.3078295
10.1155/2022/3454167
10.1145/2647868.2654889
10.1109/CVPR52688.2022.00837
10.3115/1289189.1289273
10.1109/ICCV.2017.83
10.24963/ijcai.2017/381
10.1016/s1364-6613(99)01331-5
10.1016/j.neucom.2018.09.038
10.1109/CVPR.2015.7298594
10.1016/j.neucom.2018.06.096
10.18653/v1/e17-1019
10.5555/946247.946665
10.1109/CVPR.2009.5206848
10.1109/ICCV.1999.790410
10.18653/v1/2020.emnlp-main.161
10.1109/ICCV.2013.61
10.1109/CVPR.2018.00795
10.1609/aaai.v33i01.33018191
10.1109/CVPRW50498.2020.00487
10.1109/CVPR.2017.345
10.1145/1553374.1553380
10.1007/978-3-030-59830-3_21
ContentType Journal Article
Copyright The Author(s) 2023
COPYRIGHT 2023 Springer
Copyright Springer Nature B.V. Nov 2023
Copyright_xml – notice: The Author(s) 2023
– notice: COPYRIGHT 2023 Springer
– notice: Copyright Springer Nature B.V. Nov 2023
DBID C6C
AAYXX
CITATION
3V.
7SC
7WY
7WZ
7XB
87Z
8AL
8AO
8FD
8FE
8FG
8FK
8FL
ABUWG
AFKRA
ALSLI
ARAPS
AZQEC
BENPR
BEZIV
BGLVJ
CCPQU
CNYFK
DWQXO
E3H
F2A
FRNLG
F~G
GNUQQ
HCIFZ
JQ2
K60
K6~
K7-
L.-
L7M
L~C
L~D
M0C
M0N
M1O
P5Z
P62
PHGZM
PHGZT
PKEHL
PQBIZ
PQBZA
PQEST
PQGLB
PQQKQ
PQUKI
PRINS
PRQQA
PSYQQ
Q9U
DOI 10.1007/s10462-023-10414-6
DatabaseName Springer Nature OA Free Journals
CrossRef
ProQuest Central (Corporate)
Computer and Information Systems Abstracts
ABI/INFORM Collection
ABI/INFORM Global (PDF only)
ProQuest Central (purchase pre-March 2016)
ABI/INFORM Collection
Computing Database (Alumni Edition)
ProQuest Pharma Collection
Technology Research Database
ProQuest SciTech Collection
ProQuest Technology Collection
ProQuest Central (Alumni) (purchase pre-March 2016)
ABI/INFORM Collection (Alumni)
ProQuest Central (Alumni)
ProQuest Central UK/Ireland
Social Science Premium Collection
Advanced Technologies & Computer Science Collection
ProQuest Central Essentials
ProQuest Central
Business Premium Collection
Technology Collection
ProQuest One Community College
Library & Information Science Collection
ProQuest Central
Library & Information Sciences Abstracts (LISA)
Library & Information Science Abstracts (LISA)
Business Premium Collection (Alumni)
ABI/INFORM Global (Corporate)
ProQuest Central Student
SciTech Premium
ProQuest Computer Science Collection
ProQuest Business Collection (Alumni Edition)
ProQuest Business Collection
Computer Science Database
ABI/INFORM Professional Advanced
Advanced Technologies Database with Aerospace
Computer and Information Systems Abstracts – Academic
Computer and Information Systems Abstracts Professional
ABI/INFORM Global
Computing Database
Library Science Database
AAdvanced Technologies & Aerospace Database (subscription)
ProQuest Advanced Technologies & Aerospace Collection
ProQuest Central Premium
ProQuest One Academic (New)
ProQuest One Academic Middle East (New)
ProQuest One Business
ProQuest One Business (Alumni)
ProQuest One Academic Eastern Edition (DO NOT USE)
One Applied & Life Sciences
ProQuest One Academic (retired)
ProQuest One Academic UKI Edition
ProQuest Central China
ProQuest One Social Sciences
ProQuest One Psychology
ProQuest Central Basic
DatabaseTitle CrossRef
ProQuest Business Collection (Alumni Edition)
ProQuest One Psychology
Computer Science Database
ProQuest Central Student
Library and Information Science Abstracts (LISA)
ProQuest Advanced Technologies & Aerospace Collection
ProQuest Central Essentials
ProQuest Computer Science Collection
Computer and Information Systems Abstracts
SciTech Premium Collection
ProQuest Central China
ABI/INFORM Complete
ProQuest One Applied & Life Sciences
Library & Information Science Collection
ProQuest Central (New)
Advanced Technologies & Aerospace Collection
Business Premium Collection
Social Science Premium Collection
ABI/INFORM Global
ProQuest One Academic Eastern Edition
ProQuest Technology Collection
ProQuest Business Collection
ProQuest One Academic UKI Edition
ProQuest One Academic
ProQuest One Academic (New)
ABI/INFORM Global (Corporate)
ProQuest One Business
Technology Collection
Technology Research Database
Computer and Information Systems Abstracts – Academic
ProQuest One Academic Middle East (New)
ProQuest Central (Alumni Edition)
ProQuest One Community College
ProQuest Pharma Collection
ProQuest Central
ABI/INFORM Professional Advanced
ProQuest Library Science
ProQuest Central Korea
Advanced Technologies Database with Aerospace
ABI/INFORM Complete (Alumni Edition)
ProQuest Computing
ProQuest One Social Sciences
ABI/INFORM Global (Alumni Edition)
ProQuest Central Basic
ProQuest Computing (Alumni Edition)
ProQuest SciTech Collection
Computer and Information Systems Abstracts Professional
Advanced Technologies & Aerospace Database
ProQuest One Business (Alumni)
ProQuest Central (Alumni)
Business Premium Collection (Alumni)
DatabaseTitleList CrossRef

ProQuest Business Collection (Alumni Edition)

Database_xml – sequence: 1
  dbid: BENPR
  name: ProQuest Central
  url: https://www.proquest.com/central
  sourceTypes: Aggregation Database
DeliveryMethod fulltext_linktorsrc
Discipline Computer Science
EISSN 1573-7462
EndPage 13372
ExternalDocumentID A766307694
10_1007_s10462_023_10414_6
GrantInformation_xml – fundername: National Research Foundation of Korea
  grantid: NRF-2019R1A2C1006159
  funderid: http://dx.doi.org/10.13039/501100003725
– fundername: National Research Foundation of Korea
  grantid: NRF-2021R1A6A1A03039493
  funderid: http://dx.doi.org/10.13039/501100003725
– fundername: 2022 Yeungnam University Research Grant
GroupedDBID -4Z
-59
-5G
-BR
-EM
-Y2
-~C
.4S
.86
.DC
.VR
06D
0R~
0VY
1N0
1SB
2.D
203
23N
28-
2J2
2JN
2JY
2KG
2LR
2P1
2VQ
2~H
30V
3V.
4.4
406
408
409
40D
40E
5GY
5QI
5VS
67Z
6J9
6NX
77K
7WY
8AO
8FE
8FG
8FL
8TC
8UJ
95-
95.
95~
96X
AAAVM
AABHQ
AAHNG
AAIAL
AAJKR
AAJSJ
AAKKN
AANZL
AAOBN
AARHV
AARTL
AATVU
AAUYE
AAWCG
AAYIU
AAYQN
AAYTO
AAYZH
ABAKF
ABBBX
ABBXA
ABDZT
ABECU
ABEEZ
ABFTD
ABFTV
ABHLI
ABHQN
ABIVO
ABJNI
ABJOX
ABKCH
ABKTR
ABMNI
ABMOR
ABMQK
ABNWP
ABQBU
ABQSL
ABSXP
ABTEG
ABTHY
ABTKH
ABTMW
ABULA
ABUWG
ABWNU
ABXPI
ACACY
ACBXY
ACGFS
ACHSB
ACHXU
ACIHN
ACKNC
ACMDZ
ACMLO
ACOKC
ACOMO
ACREN
ACSNA
ACULB
ACZOJ
ADHHG
ADHIR
ADIMF
ADINQ
ADKNI
ADKPE
ADMLS
ADRFC
ADTPH
ADURQ
ADYFF
ADYOE
ADZKW
AEAQA
AEBTG
AEFIE
AEFQL
AEGAL
AEGNC
AEJHL
AEJRE
AEKMD
AENEX
AEOHA
AEPYU
AESKC
AETLH
AEVLU
AEXYK
AFBBN
AFEXP
AFFNX
AFGCZ
AFGXO
AFKRA
AFLOW
AFQWF
AFWTZ
AFYQB
AFZKB
AGAYW
AGDGC
AGGDS
AGJBK
AGMZJ
AGQEE
AGQMX
AGRTI
AGWIL
AGWZB
AGYKE
AHAVH
AHBYD
AHKAY
AHSBF
AHYZX
AIAKS
AIIXL
AILAN
AITGF
AJBLW
AJRNO
AJZVZ
ALMA_UNASSIGNED_HOLDINGS
ALSLI
ALWAN
AMKLP
AMTXH
AMXSW
AMYLF
AMYQR
AOCGG
ARAPS
ARCSS
ARMRJ
ASPBG
AVWKF
AXYYD
AYJHY
AZFZN
AZQEC
B-.
BA0
BBWZM
BDATZ
BENPR
BEZIV
BGLVJ
BGNMA
BPHCQ
C24
C6C
CAG
CCPQU
CNYFK
COF
CS3
CSCUP
DDRTE
DL5
DNIVK
DPUIP
DWQXO
EBLON
EBS
EDO
EIOEI
EJD
ESBYG
FEDTE
FERAY
FFXSO
FIGPU
FINBP
FNLPD
FRNLG
FRRFC
FSGXE
FWDCC
GGCAI
GGRSB
GJIRD
GNUQQ
GNWQR
GQ6
GQ7
GQ8
GROUPED_ABI_INFORM_COMPLETE
GXS
H13
HCIFZ
HF~
HG5
HG6
HMJXF
HQYDN
HRMNR
HVGLF
HZ~
I-F
I09
IAO
IHE
IJ-
IKXTQ
ITM
IWAJR
IXC
IZIGR
IZQ
I~X
I~Z
J-C
J0Z
JBSCW
JCJTX
JZLTJ
K60
K6V
K6~
K7-
KDC
KOV
KOW
LAK
LLZTM
M0C
M0N
M1O
M4Y
MA-
MK~
N2Q
N9A
NB0
NDZJH
NPVJJ
NQJWS
NU0
O9-
O93
O9G
O9I
O9J
OAM
OVD
P19
P62
P9O
PF0
PQBIZ
PQBZA
PQQKQ
PROAC
PSYQQ
PT5
Q2X
QOK
QOS
R4E
R89
R9I
RHV
RNI
RNS
ROL
RPX
RSV
RZC
RZE
RZK
S16
S1Z
S26
S27
S28
S3B
SAP
SCJ
SCLPG
SCO
SDH
SDM
SHX
SISQX
SNE
SNPRN
SNX
SOHCF
SOJ
SPISZ
SRMVM
SSLCW
STPWE
SZN
T13
T16
TEORI
TSG
TSK
TSV
TUC
TUS
U2A
UG4
UOJIU
UTJUX
UZXMN
VC2
VFIZW
W23
W48
WH7
WK8
YLTOR
Z45
Z5O
Z7R
Z7X
Z7Y
Z7Z
Z81
Z83
Z86
Z88
Z8M
Z8N
Z8R
Z8S
Z8T
Z8U
Z8W
Z92
ZMTXR
~A9
~EX
77I
AAFWJ
AASML
AAYXX
ABDBE
ABFSG
ACSTC
ADHKG
AEZWR
AFFHD
AFHIU
AGQPQ
AHPBZ
AHWEU
AIXLP
AYFIA
CITATION
ICD
PHGZM
PHGZT
PQGLB
PRQQA
7SC
7XB
8AL
8FD
8FK
E3H
F2A
JQ2
L.-
L7M
L~C
L~D
PKEHL
PQEST
PQUKI
PRINS
Q9U
ID FETCH-LOGICAL-c402t-9de1f6d5ced20d53871ab58b3f80ebfaec309d1e45421c67c5764d46a524ad63
IEDL.DBID RSV
ISICitedReferencesCount 19
ISICitedReferencesURI http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000968013400001&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
ISSN 0269-2821
IngestDate Sat Nov 15 15:42:14 EST 2025
Sat Nov 29 10:30:54 EST 2025
Sat Nov 29 02:43:27 EST 2025
Tue Nov 18 22:15:22 EST 2025
Fri Feb 21 02:41:52 EST 2025
IsDoiOpenAccess true
IsOpenAccess true
IsPeerReviewed true
IsScholarly true
Issue 11
Keywords Deep learning
Text description
Encoder–Decoder architecture
Video description approaches
Video captioning
Video captioning techniques
Vision to text
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-c402t-9de1f6d5ced20d53871ab58b3f80ebfaec309d1e45421c67c5764d46a524ad63
Notes ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ORCID 0000-0001-6713-8766
OpenAccessLink https://link.springer.com/10.1007/s10462-023-10414-6
PQID 2867416496
PQPubID 36790
PageCount 80
ParticipantIDs proquest_journals_2867416496
gale_infotracacademiconefile_A766307694
crossref_citationtrail_10_1007_s10462_023_10414_6
crossref_primary_10_1007_s10462_023_10414_6
springer_journals_10_1007_s10462_023_10414_6
PublicationCentury 2000
PublicationDate 20231100
2023-11-00
20231101
PublicationDateYYYYMMDD 2023-11-01
PublicationDate_xml – month: 11
  year: 2023
  text: 20231100
PublicationDecade 2020
PublicationPlace Dordrecht
PublicationPlace_xml – name: Dordrecht
PublicationSubtitle An International Science and Engineering Journal
PublicationTitle The Artificial intelligence review
PublicationTitleAbbrev Artif Intell Rev
PublicationYear 2023
Publisher Springer Netherlands
Springer
Springer Nature B.V
Publisher_xml – name: Springer Netherlands
– name: Springer
– name: Springer Nature B.V
References ZellersRBiskYFarhadiAChoiYFrom recognition to cognition: visual commonsense reasoningProc IEEE Comput Soc Conf Comput Vis Pattern Recogn201920196713672410.1109/CVPR.2019.00688
Hammad M, Hammad M, Elshenawy M (2019) Characterizing the impact of using features extracted from pretrained models on the quality of video captioning sequence-to-sequence models. arXiv:1911.09989
HoriCHoriTLeeTYZhangZHarshamBHersheyJRAttention-based multimodal fusion for video descriptionProc IEEE Int Conf Comput Vis201720174203421210.1109/ICCV.2017.450
Zhu X, Guo L, Yao P, Lu S, Liu W, Liu J (2019) Vatex video captioning challenge 2020: multi-view features and hybrid reward strategies for video captioning. arXiv:1910.11102
RohrbachMQiuWTitovIThaterSPinkalMSchieleBTranslating video content to natural language descriptionsProc IEEE Int Conf Comput Vis201310.1109/ICCV.2013.61
ZhangXSunXLuoYJiJZhouYWuYJiRRSTnet: captioning with adaptive attention on visual and non-visual wordsProc IEEE Comput Soc Conf Comput Vis Pattern Recogn20211154601546910.1109/CVPR46437.2021.01521
Pramanik S, Agrawal P, Hussain A (2019) OmniNet: a unified architecture for multi-modal multi-task learning, 1–16. arXiv:1907.07804
XuJMeiTYaoTRuiYMSR-VTT: A large video description dataset for bridging video and languageProc IEEE Comput Soc Conf Comput Vis Pattern Recogn201620165288529610.1109/CVPR.2016.571
Gehring J, Dauphin YN (2016) Convolutional Sequence to Sequence Learning. https://proceedings.mlr.press/v70/gehring17a/gehring17a.pdf
Al-Rfou R, Choe D, Constant N, Guo M, Jones L (2019) Character-level language modeling with deeper self-attention. Proc AAAI Conf Artif Intell 33 , 3159–3166. https://doi.org/10.1609/aaai.v33i01.33013159arxiv.org/abs/1808.04444
Dai Z, Yang Z, Yang Y, Carbonell J, Le QV, Salakhutdinov R (2020) Transformer-XL: Attentive language models beyond a fixed-length context. ACL 2019 -57th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference, 2978–2988. https://doi.org/10.18653/v1/p19-1285arXiv:1901.02860
Wang X, Chen W, Wu J, Wang YF, Wang WY (2018b) Video captioning via hierarchical reinforcement learning. Proc IEEE Comput Soc Conf Comput Vis Pattern Recogn, 4213–4222. https://doi.org/10.1109/CVPR.2018.00443arXiv:1711.11135
ShenZLiJSuZLiMChenYJiangYGXueXWeakly supervised dense video captioningProc 30th IEEE Conf Comput Vis Pattern Recogn, CVPR 2017201720175159516710.1109/CVPR.2017.548c
Lin K, Gan Z, Wang L (2020) Multi-modal feature fusion with feature attention for vatex captioning challenge 2020:2–5. arXiv:2006.03315
Babariya RJ, Tamaki T (2020) Meaning guided video captioning. In: Pattern Recognition: 5th Asian Conference, ACPR 2019, Auckland, New Zealand, November 26–29, 2019, Revised Selected Papers, Part II 5, pp 478–488. Springer International Publishing
Lin C-Y (2004) ROUGE: A Package for Automatic Evaluation of Summaries. In: Text summarization branches out. Association for Computational Linguistics. Barcelona, Spain, pp 74–81. https://aclanthology.org/W04-1013
Demeester T, Rocktäschel T, Riedel S (2016) Lifted rule injection for relation embeddings. Emnlp 2016—conference on empirical methods in natural language processing, proceedings (pp. 1389–1399). https://doi.org/10.18653/v1/d16-1146
Vedantam R, Lawrence Zitnick C, Parikh D (2015) Cider: consensus-based image description evaluation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4566–4575
Gella S, Lewis M, Rohrbach M (2020) A dataset for telling the stories of social media videos. Proceedings of the 2018 conference on empirical methods in natural language processing, EMNLP 2018:968–974
Redmon J, Farhadi A (2018) YOLOv3: an incremental improvement. arXiv:1804.02767
Zhou L, Zhou Y, Corso JJ, Socher R, Xiong C (2018) End-to-End Dense Video Captioning with Masked Transformer. Proceedings of the IEEE computer society conference on computer vision and pattern recognition (pp. 8739–8748). https://doi.org/10.1109/CVPR.2018.00911
Chen DL, Dolan WB (2011) Collecting highly parallel data for paraphrase evaluation. Aclhlt 2011–Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies 1 (pp. 190–200)
Xiao H, Shi J (2019b) Huanhou Xiao, Jinglun Shi South China University of Technology, Guangzhou China, 619–623
Zhang W, Wang B, Ma L, Liu W (2019) Reconstruct and represent video contents for captioning via reinforcement learning. IEEE Trans Pattern Analys Mach Intel, 1–1. https://doi.org/10.1109/tpami.2019.2920899arxiv.org/abs/1906.01452
RenZWangXZhangNLvXLiLJDeep reinforcement learning-based image captioning with embedding rewardProc 30th IEEE Conf Comput Vis Pattern Recogn CVPR201720171151115910.1109/CVPR.2017.128
SunCMyersAVondrickCMurphyKSchmidCVideoBERT: a joint model for video and language representation learningProc IEEE Int Conf Comput Vis201920197463747210.1109/ICCV.2019.00756
Levine R, Meurers D (2006) Head-driven phrase structure grammar linguistic approach , formal head-driven phrase structure grammar linguistic approach , formal foundations , and computational realization (January)
Kenton M-wC, Kristina L, Devlin J (1953) BERT: pre-training of deep bidirectional transformers for language understanding. (Mlm). arXiv:1810.04805v2
HeDZhaoXHuangJLiFLiuXWenSRead, watch, and move: reinforcement learning for temporally grounding natural language descriptions in videosProceed AAAI Conf Artif Intel2019338393840010.1609/aaai.v33i01.33018393arXiv:1901.06829
BarbuABridgeABurchillZCoroianDDickinsonSFidlerSZhangZVideo in sentences outUncertainty Artif Intell–Proc 28th Conf–UAI20122012102112arXiv:1204.2742
Zhao B, Li X, Lu X (2018) Video captioning with tube features. IICAI Int Joint Conf Artif Intel 2018:1177–1183. https://doi.org/10.24963/ijcai.2018/164
Fang Z, Gokhale T, Banerjee P, Baral C, Yang Y (2020) Video2Commonsense: generating commonsense descriptions to enrich video captioning. arXiv:2003.05162
XuHLiBRamanishkaVSigalLSaenkoKJoint event detection and description in continuous video streamsProc 2019 IEEE Winter Conf App Comput Vis, WACV2019201939640510.1109/WACV.2019.00048arXiv:1802.10250
Yang B, Liu F, Zhang C, Zou Y (2019) Non-autoregressive coarse-to-fine video captioning. In: AAAI Conference on Artificial Intelligence. https://doi.org/10.1609/aaai.v35i4.16421
Rafiq M, Rafiq G, Agyeman R, Jin S-I, Choi G (2020) Scene classification for sports video summarization using transfer learning. Sensors (Switzerland) 20(6). https://doi.org/10.3390/s20061702
Hakeem A, Sheikh Y, Shah M (2004) CASE E: a hierarchical event representation for the analysis of videos. Proc Natl Conf Artif Intell:263–268
Park J, Song C, Han JH (2018) A study of evaluation metrics and datasets for video captioning. ICIIBMS 2017 -2nd Int Conf Intel Inform Biomed Sci 2018:172–175. https://doi.org/10.1109/ICIIBMS.2017.8279760
Raffel C, Ellis DPW (2015) Feed-forward networks with attention can solve some long-term memory problems, 1–6. arXiv:1512.08756
MnihVKavukcuogluKSilverDRusuAAVenessJBellemareMGHassabisDHuman-level control through deep reinforcement learningNature2015518754052953310.1038/nature14236
Perez-Martin J, Bustos B, Perez J (2021) Attentive visual semantic specialized network for video captioning, 5767–5774. https://doi.org/10.1109/icpr48806.2021.9412898
Pan Y, Yao T, Li H, Mei T (2017) Video captioning with transferred semantic attributes. Proc 30th IEEE Conf Comput Vis Pattern Recogn CVPR 2017:984–992. https://doi.org/10.1109/CVPR.2017.111arXiv:1611.07675
Deng J, Dong W, Socher R, Li LJ, Li K, Fei-Fei L (2009, June). Imagenet: a large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition, pp 248–255. IEEE
KrishnaRHataKRenFFei-FeiLNieblesJCDense-captioning events in videosProc Int Conf Comput Vis2017201770671510.1109/ICCV.2017.83
Cho K, Van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y (2014) Learning phrase representations using RNN encoder–decoder for statistical machine translation. EMNLP 2014 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference, 1724–1734. https://doi.org/10.3115/v1/d14-1179arXiv:1406.1078
ChenSYaoTJiangYGDeep learning for video captioning: a reviewIJCAI Int Joint Conf Artif Intell201920196283629010.24963/ijcai.2019/877
Child R, Gray S, Radford A, Sutskever I (2019) Generating Long Sequences with Sparse Transformers. arXiv:1904.10509
Xiao H, Shi J (2019a) Diverse video captioning through latent variable expansion with conditional GAN. https://zhuanzhi.ai/paper/943af2926865564d7a84286c23fa2c63 arXiv:1910.12019
Blohm M, Jagfeld G, Sood E, Yu X, Vu NT (2018) Comparing attention-based convolutional and recurrent neural networks: success and limitations in machine reading comprehension. CoNLL 2018–22nd Conference on Computational Natural Language Learning, Proceedings, 108–118. https://doi.org/10.18653/v1/k18-1011arXiv:1808.08744
YuYChoiJKimYYooKLeeSHKimGSupervising neural attention models for video captioning by human gaze dataProc 30th IEEE Conf Comput Vis Pattern Recogn201720176119612710.1109/CVPR.2017.648arXiv:1707.06029
LiSTaoZLiKFuYVisual to text: survey of image and video captioningIEEE Trans Emerg Top Comput Intel20193429731210.1109/tetci.2019.2892755
XuJWeiHLiLFuQGuoJVideo description model based on temporal-spatial and channel multi-attention mechanismsAppl Sci (Switzerland)202010.3390/app10124312
LaokulratNPhanSNishidaNShuREharaYOkazakiNNakayamaHGenerating video description using sequence-to-sequence model with temporal attentionColing201620154452
Zhang Z, Qi Z, Yuan C, Shan Y, Li B, Deng Y, Hu W (2021) Open-book video captioning with retrieve-copy-generate network. Proc IEEE Comput Soc Conf Comput Vis Pattern Recogn, 9832–9841. https://doi.org/10.1109/CVPR46437.2021.00971arXiv:2103.05284
Hammoudeh A, Vanderplaetse B, Dupont S (2022) Deep soccer captioning with transformer: dataset, semantics-related losses, and multi-level evaluation:1–15. arXiv:2202.05728
Wang D, Song D (2017) Video Cap
S Chen (10414_CR29) 2019; 2019
X Li (10414_CR93) 2019; 22
10414_CR90
10414_CR92
S Chen (10414_CR28) 2021; 1
10414_CR94
M Rohrbach (10414_CR130) 2013
10414_CR95
10414_CR96
J Lei (10414_CR86) 2020; 12366
10414_CR97
10414_CR98
10414_CR99
Y Zhang (10414_CR187) 2010; 24
C Szegedy (10414_CR141) 2017; 2017
X Zhang (10414_CR185) 2017; 2017
R Cascade-correlation (10414_CR20) 1997; 9
10414_CR70
R Krishna (10414_CR78) 2017; 2017
10414_CR79
10414_CR190
10414_CR195
10414_CR192
10414_CR193
10414_CR71
10414_CR72
10414_CR73
10414_CR196
10414_CR74
10414_CR75
D Tran (10414_CR145) 2015; 2015
10414_CR76
V Mnih (10414_CR107) 2015; 518
10414_CR81
K Han (10414_CR57) 2022; 8828
S Bhatt (10414_CR15) 2017; 2017
Y Yu (10414_CR179) 2017; 2017
N Laokulrat (10414_CR80) 2016; 2015
A Barbu (10414_CR13) 2012; 2012
B Wang (10414_CR151) 2019; 2019
W Xu (10414_CR170) 2021; 23
10414_CR83
10414_CR85
10414_CR87
10414_CR88
C Hori (10414_CR60) 2017; 2017
S Venugopalan (10414_CR148) 2015; 2015
C Deng (10414_CR37) 2021
Q You (10414_CR177) 2016; 2016
J Xu (10414_CR166) 2016; 2016
C Yan (10414_CR171) 2020; 22
10414_CR50
10414_CR51
10414_CR52
D He (10414_CR58) 2019; 33
10414_CR53
10414_CR54
10414_CR55
J Hou (10414_CR62) 2019; 2019
10414_CR56
S Li (10414_CR91) 2019; 3
Q Zhang (10414_CR183) 2019; 323
10414_CR68
10414_CR69
S Lee (10414_CR84) 2018
M Rafiq (10414_CR123) 2021; 9
10414_CR61
10414_CR100
10414_CR64
10414_CR103
10414_CR65
10414_CR104
10414_CR101
10414_CR102
10414_CR35
10414_CR36
S Antol (10414_CR9) 2015
10414_CR38
10414_CR39
10414_CR32
10414_CR33
10414_CR34
RJ Williams (10414_CR159) 1989; 1
W Ji (10414_CR67) 2022; 117
M Schuster (10414_CR132) 1997; 45
10414_CR48
10414_CR49
Y Bin (10414_CR17) 2019; 49
10414_CR42
10414_CR43
10414_CR2
10414_CR129
10414_CR3
L Sun (10414_CR140) 2019; 2019
10414_CR1
10414_CR16
10414_CR18
10414_CR133
10414_CR131
10414_CR136
10414_CR10
10414_CR137
10414_CR11
10414_CR134
10414_CR12
10414_CR138
M Chen (10414_CR26) 2018; 95
10414_CR24
10414_CR25
10414_CR143
10414_CR144
10414_CR142
10414_CR147
10414_CR21
S Chen (10414_CR27) 2019; 33
L Gao (10414_CR44) 2017; 19
10414_CR22
10414_CR23
10414_CR146
10414_CR108
10414_CR105
10414_CR106
R Agyeman (10414_CR5) 2021; 9
10414_CR109
N Aafaq (10414_CR4) 2022
10414_CR110
J Donahue (10414_CR40) 2017; 39
10414_CR111
10414_CR114
10414_CR115
10414_CR112
Z Shen (10414_CR135) 2017; 2017
10414_CR113
10414_CR118
10414_CR119
10414_CR116
10414_CR117
H Xu (10414_CR165) 2019; 2019
R Zellers (10414_CR181) 2019; 2019
10414_CR121
10414_CR122
L Zhou (10414_CR194) 2019; 2019
A Hussain (10414_CR63) 2022
10414_CR120
10414_CR6
10414_CR125
10414_CR7
10414_CR126
M Zolfaghari (10414_CR197) 2018; 11206
10414_CR8
10414_CR124
D Elliott (10414_CR41) 2014; 2
10414_CR172
L Gao (10414_CR45) 2022; 31
10414_CR173
10414_CR176
SJ Rennie (10414_CR128) 2017; 2017
10414_CR174
10414_CR175
Z Ren (10414_CR127) 2017; 2017
H Zhao (10414_CR191) 2022; 8
10414_CR178
H Im (10414_CR66) 2022; 22
L Gao (10414_CR46) 2019; 14
A Kojima (10414_CR77) 2002; 50
L Gao (10414_CR47) 2020; 395
S Xie (10414_CR164) 2018; 11219
10414_CR180
T Brox (10414_CR19) 2014; 3024
10414_CR184
10414_CR182
10414_CR188
10414_CR189
10414_CR149
X Zhang (10414_CR186) 2021; 1
10414_CR150
10414_CR154
10414_CR155
10414_CR152
10414_CR153
10414_CR158
10414_CR156
10414_CR157
Y Chen (10414_CR31) 2018; 2018
Y Chen (10414_CR30) 2018; 11217
A Lavie (10414_CR82) 2004; 3265
J Xu (10414_CR167) 2020
K He (10414_CR59) 2016; 2016
10414_CR161
10414_CR162
10414_CR160
Y Bengio (10414_CR14) 2009
C Sun (10414_CR139) 2019; 2019
10414_CR163
10414_CR169
10414_CR168
References_xml – reference: Goyal A, Lamb A, Zhang Y, Zhang S, Courville A, Bengio Y (2016) Professor forcing: anew algorithm for training recurrent networks. Adv Neural Inform Process Syst (Nips):4608–4616. arXiv:1610.09038
– reference: Pramanik S, Agrawal P, Hussain A (2019) OmniNet: a unified architecture for multi-modal multi-task learning, 1–16. arXiv:1907.07804
– reference: Xiao H, Shi J (2019b) Huanhou Xiao, Jinglun Shi South China University of Technology, Guangzhou China, 619–623
– reference: Rafiq M, Rafiq G, Agyeman R, Jin S-I, Choi G (2020) Scene classification for sports video summarization using transfer learning. Sensors (Switzerland) 20(6). https://doi.org/10.3390/s20061702
– reference: Dai Z, Yang Z, Yang Y, Carbonell J, Le QV, Salakhutdinov R (2020) Transformer-XL: Attentive language models beyond a fixed-length context. ACL 2019 -57th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference, 2978–2988. https://doi.org/10.18653/v1/p19-1285arXiv:1901.02860
– reference: Zhang W, Wang B, Ma L, Liu W (2019) Reconstruct and represent video contents for captioning via reinforcement learning. IEEE Trans Pattern Analys Mach Intel, 1–1. https://doi.org/10.1109/tpami.2019.2920899arxiv.org/abs/1906.01452
– reference: Li Y, Yao T, Pan Y, Chao H, Mei T (2018) Jointly localizing and describing events for dense video captioning. Proceedings of the IEEE computer society conference on computer vision and pattern recognition (pp. 7492–7500). https://doi.org/10.1109/CVPR.2018.00782
– reference: ChenSJiangYGTowards bridging event captioner and sentence localizer for weakly supervised dense event captioningProc IEEE Comput Soc Conf Comput Vis Pattern Recogn202118421843110.1109/CVPR46437.2021.00832
– reference: Kitaev N, Kaiser L, Levskaya A (2020) Reformer: the efficient transformer, 1–12. arXiv:2001.04451
– reference: TranDBourdevLFergusRTorresaniLPaluriMLearning spatiotemporal features with 3D convolutional networksProc IEEE Int Conf Comput Vis201520154489449710.1109/ICCV.2015.510
– reference: ElliottDKellerFComparing automatic evaluation measures for image description52nd Annu Meet Assoc Comput Linguistics ACL 2014–Proc Conf2014245245710.3115/v1/p14-2074
– reference: DonahueJHendricksLARohrbachMVenugopalanSGuadarramaSSaenkoKDarrellTLong-term recurrent convolutional networks for visual recognition and descriptionIEEE Trans Pattern Analys Mach Intell201739467769110.1109/TPAMI.2016.2599174
– reference: Redmon J, Farhadi A (2018) YOLOv3: an incremental improvement. arXiv:1804.02767
– reference: XuJMeiTYaoTRuiYMSR-VTT: A large video description dataset for bridging video and languageProc IEEE Comput Soc Conf Comput Vis Pattern Recogn201620165288529610.1109/CVPR.2016.571
– reference: Hosseinzadeh M, Wang Y, Canada HT (2021) Video captioning of future frames. Winter Conf App Comput Vis:980–989
– reference: Xiao H, Shi J (2019a) Diverse video captioning through latent variable expansion with conditional GAN. https://zhuanzhi.ai/paper/943af2926865564d7a84286c23fa2c63 arXiv:1910.12019
– reference: Madake J (2022) Dense video captioning using BiLSTM encoder, 1–6
– reference: Yang Z, Yuan Y, Wu Y, Salakhutdinov R, Cohen WW (2016) Review networks for caption generation. Adv Neural Inform Process Syst (Nips), 2369–2377. arXiv:1605.07912
– reference: HussainAHussainTUllahWBaikSWVision transformer and deep sequence learning for human activity recognition in surveillance videosComput Intel Neurosci202210.1155/2022/3454167
– reference: HeDZhaoXHuangJLiFLiuXWenSRead, watch, and move: reinforcement learning for temporally grounding natural language descriptions in videosProceed AAAI Conf Artif Intel2019338393840010.1609/aaai.v33i01.33018393arXiv:1901.06829
– reference: RohrbachMQiuWTitovIThaterSPinkalMSchieleBTranslating video content to natural language descriptionsProc IEEE Int Conf Comput Vis201310.1109/ICCV.2013.61
– reference: LaokulratNPhanSNishidaNShuREharaYOkazakiNNakayamaHGenerating video description using sequence-to-sequence model with temporal attentionColing201620154452
– reference: WangBMaLZhangWJiangWWangJLiuWControllable video captioning with pos sequence guidance based on gated fusion networkProc IEEE Int Conf Comput Vis201920192641265010.1109/ICCV.2019.00273arXiv:1908.10072
– reference: XuHLiBRamanishkaVSigalLSaenkoKJoint event detection and description in continuous video streamsProc 2019 IEEE Winter Conf App Comput Vis, WACV2019201939640510.1109/WACV.2019.00048arXiv:1802.10250
– reference: Gella S, Lewis M, Rohrbach M (2020) A dataset for telling the stories of social media videos. Proceedings of the 2018 conference on empirical methods in natural language processing, EMNLP 2018:968–974
– reference: Peng Y, Wang C, Pei Y, Li Y (2021) Video captioning with global and local text attention. Visual Computer (0123456789). https://doi.org/10.1007/s00371-021-02294-0
– reference: Levine R, Meurers D (2006) Head-driven phrase structure grammar linguistic approach , formal head-driven phrase structure grammar linguistic approach , formal foundations , and computational realization (January)
– reference: Su J (2018) Study of Video Captioning Problem. https://www.semanticscholar.org/paper/Study-of-Video-Captioning-Problem-Su/511f0041124d8d14bbcdc7f0e57f3bfe13a58e99
– reference: Chen H, Li J, Hu X (2020) Delving deeper into the decoder for video captioning. arXiv:2001.05614
– reference: Langkilde-geary I, Knight K (2002) HALogen statistical sentence generator. (July):102–103
– reference: BengioYLouradourJCollobertRWestonJCurriculum learningACM Int Conf Proc Ser200910.1145/1553374.1553380
– reference: Seo PH, Nagrani A, Arnab A, Schmid C (2022) End-to-end generative pretraining for multimodal video captioning, 17959–17968. arXiv:2201.08264
– reference: BarbuABridgeABurchillZCoroianDDickinsonSFidlerSZhangZVideo in sentences outUncertainty Artif Intell–Proc 28th Conf–UAI20122012102112arXiv:1204.2742
– reference: HouJWuXZhaoWLuoJJiaYJoint syntax representation learning and visual cue translation for video captioningIEEE Int Conf Comput Vis201920198917892610.1109/ICCV.2019.00901
– reference: ChenMLiYZhangZHuangSTVT: two-view transformer network for video captioningProc Mach Learn Res2018951997847862
– reference: Child R, Gray S, Radford A, Sutskever I (2019) Generating Long Sequences with Sparse Transformers. arXiv:1904.10509
– reference: BinYYangYShenFXieNShenHTLiXDescribing video with attention-based bidirectional LSTMIEEE Trans Cybern20194972631264110.1109/TCYB.2018.2831447
– reference: Lavie A, Agarwal A (2007) METEOR: an automatic metric for mt evaluation with improved correlation with human judgments. Proceedings of the Second Workshop on Statistical Machine Translation (June):228–231. http://acl.ldc.upenn.edu/W/W05/W05-09.pdf
– reference: Li L, Chen Y-C, Cheng Y, Gan Z, Yu L, Liu J (2020) HERO: hierarchical encoder for video+language omni-representation pre-training, 2046–2065. https://doi.org/10.18653/v1/2020.emnlp-main.161arXiv:2005.00200
– reference: ChenYWangSZhangWHuangQLess is more: picking informative frames for video captioningLecture Notes Comput Sci (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics)20181121736738410.1007/978-3-030-01261-8_22
– reference: ChenSJiangY-GMotion guided spatial attention for video captioningProc AAAI Conf Artif Intel2019338191819810.1609/aaai.v33i01.33018191
– reference: Alzubaidi L, Zhang J, Humaidi AJ, Al-Dujaili A, Duan Y, Al-Shamma O, et al (2021) Review of deep learning: concepts, CNN architectures, challenges, applications, future directions 8(1). https://doi.org/10.1186/s40537-021-00444-8
– reference: WilliamsRJZipserDA learning algorithm for continually running fully recurrent neural networksNeural Comput19891227028010.1162/neco.1989.1.2.270
– reference: Bahdanau D, Cho KH, Bengio Y (2015) Neural machine translation by jointly learning to align and translate. 3rd International Conference on Learning Representations, ICLR 2015 -Conference Track Proceedings, 1–15. arXiv:1409.0473
– reference: LiXZhouZChenLGaoLResidual attention-based LSTM for video captioningWorld Wide Web201922262163610.1007/s11280-018-0531-z
– reference: Yang B, Liu F, Zhang C, Zou Y (2019) Non-autoregressive coarse-to-fine video captioning. In: AAAI Conference on Artificial Intelligence. https://doi.org/10.1609/aaai.v35i4.16421
– reference: Vo DM, Chen H, Sugimoto A, Nakayama H (2022) NOC-REK: Novel object captioning with retrieved vocabulary from external knowledge. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 2022, pp 17979–17987. https://doi.org/10.1109/CVPR52688.2022.01747
– reference: DengCChenSChenDHeYWuQSketch, ground, and refine: top-down dense video captioningProc IEEE Comput Soc Conf Comput Vis Pattern Recogn202110.1109/CVPR46437.2021.00030
– reference: Luo H, Ji L, Shi B, Huang H, Duan N, Li T, et al. (2020) UniVL: a unified video and language pre-training model for multimodal understanding and generation. arXiv:2002.06353
– reference: XuWYuJMiaoZWanLTianYJiQDeep reinforcement polishing network for video captioningIEEE Trans Multimedia2021231772178410.1109/TMM.2020.3002669
– reference: Pan Y, Mei T, Yao T, Li H, Rui Y (2016) Jointly modeling embedding and translation to bridge video and language. Proc IEEE Comput Soc Conf Comput Vis Pattern Recogn 2016:4594–4602. https://doi.org/10.1109/CVPR.2016.497arXiv:1505.01861
– reference: Yin W, Kann K, Yu M, Schütze H (2017) Comparative study of CNN and RNN for natural language processing. arXiv:1702.01923
– reference: Lu J, Batra D, Parikh D, Lee S (2019) ViLBERT: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. (NeurIPS), 1–11. arXiv:1908.02265
– reference: Demeester T, Rocktäschel T, Riedel S (2016) Lifted rule injection for relation embeddings. Emnlp 2016—conference on empirical methods in natural language processing, proceedings (pp. 1389–1399). https://doi.org/10.18653/v1/d16-1146
– reference: Zhou L, Corso JJ (2016) Towards automatic learning of procedures from web instructional videos. arXiv:1703.09788v3
– reference: XieSSunCHuangJTuZMurphyKRethinking spatiotem-poral feature learning: speed-accuracy trade-offs in video classificationLecture Notes Comput Sci (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics)20181121931833510.1007/978-3-030-01267-0_19
– reference: JiWWangRTianYWangXAn attention based dual learning approach for video captioningAppl Soft Comput202211710.1016/j.asoc.2021.108332
– reference: Khan M, Gotoh Y (2012) Describing video contents in natural language. Proceedings of the workshop on innovative hybrid (pp. 27–35)
– reference: Liu F, Ren X, Wu X, Yang B, Ge S, Sun X (2021) O2NA: an object-oriented non-autoregressive approach for controllable video captioning. Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021:281–292. https://doi.org/10.18653/v1/2021.findings-acl.24arXiv:2108.02359
– reference: SchusterMPaliwalKKBidirectional recurrentNeural Netw1997451126732681
– reference: Lee J, Lee Y, Seong S, Kim K, Kim S, Kim J (2019) Capturing long-range dependencies in video captioning. Proc Int Conf Image Process, ICIP, 2019:1880–1884. https://doi.org/10.1109/ICIP.2019.8803143
– reference: Zheng Q, Wang C, Tao D (2020) Syntax-Aware Action Targeting for Video Captioning. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition 13093–13102. https://doi.org/10.1109/CVPR42600.2020.01311
– reference: RenZWangXZhangNLvXLiLJDeep reinforcement learning-based image captioning with embedding rewardProc 30th IEEE Conf Comput Vis Pattern Recogn CVPR201720171151115910.1109/CVPR.2017.128
– reference: GaoLGuoZZhangHXuXShenHTVideo captioning with attention-based lstm and semantic consistencyIEEE Trans Multimedia20171992045205510.1109/TMM.2017.2729019
– reference: Pan Y, Li Y, Luo J, Xu J, Yao T, Mei T (2020) Auto-captions on GIF: a large-scale video-sentence dataset for vision-language pre-training. arXiv:2007.02375
– reference: Young P, Lai A, Hodosh M, Hockenmaier J (2014) From image descriptions to visual denotations? New similarity metrics for semantic inference over event descriptions 2:67–78
– reference: Aafaq N, Mian A, Liu W, Gilani SZ, Sha M (2019c) Video description: a survey of methods, datasets, and evaluation metrics 52(6). https://doi.org/10.1145/3355390
– reference: Gehring J, Dauphin YN (2016) Convolutional Sequence to Sequence Learning. https://proceedings.mlr.press/v70/gehring17a/gehring17a.pdf
– reference: Yan L, Zhu M, Yu C (2010) Crowd video captioning. arXiv:1911.05449v1
– reference: GaoLWangXSongJLiuYFused GRU with semantic-temporal attention for video captioningNeurocomputing202039522222810.1016/j.neucom.2018.06.096
– reference: MnihVKavukcuogluKSilverDRusuAAVenessJBellemareMGHassabisDHuman-level control through deep reinforcement learningNature2015518754052953310.1038/nature14236
– reference: YuYChoiJKimYYooKLeeSHKimGSupervising neural attention models for video captioning by human gaze dataProc 30th IEEE Conf Comput Vis Pattern Recogn201720176119612710.1109/CVPR.2017.648arXiv:1707.06029
– reference: SunLLiBYuanCZhaZHuWMultimodal semantic attention network for video captioningProc IEEE Int Conf Multimedia Expo201920191300130510.1109/ICME.2019.00226arxiv.org/abs/1905.02963
– reference: Uszkoreit J, Kaiser L (2019) Universal transformers, 1-23. arxiv.org/abs/arXiv:1807.03819v3
– reference: ZhangXGaoKZhangYZhangDLiJTianQTask-driven dynamic fusion: reducing ambiguity in video descriptionProc 30th IEEE Conf Comput Vis Pattern Recogn CVPR201720176250625810.1109/CVPR.2017.662
– reference: Wang H, Zhang Y, Yu X (2020) An overview of image caption generation methods. Computational Intelligence and Neuroscience 2020. https://doi.org/10.1155/2020/3062706
– reference: Cho K, Van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y (2014) Learning phrase representations using RNN encoder–decoder for statistical machine translation. EMNLP 2014 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference, 1724–1734. https://doi.org/10.3115/v1/d14-1179arXiv:1406.1078
– reference: BhattSPatwaFSandhuRNatural language processing (almost) from scratchProc IEEE 3rd Int Conf Collaboration Internet Comput CIC 20172017201732833810.1109/CIC.2017.00050
– reference: Song J, Gao L, Guo Z, Liu W, Zhang D, Shen HT (2017) Hierarchical LSTM with adjusted temporal attention for video captioning, 2737–2743
– reference: Yuan Z, Yan X, Liao Y, Guo Y, Li G, Li Z, Cui S (2022) X-Trans2Cap: cross-modal knowledge transfer using transformer for 3D dense captioning, 3–4. arXiv:2203.00843
– reference: ZhouLKalantidisYChenXCorsoJJRohrbachMGrounded video descriptionProc IEEE Comput Soc Conf Comput Vis Pattern Recogn201920196571658010.1109/CVPR.2019.00674arXiv:1812.06587
– reference: Wang X, Chen W, Wu J, Wang YF, Wang WY (2018b) Video captioning via hierarchical reinforcement learning. Proc IEEE Comput Soc Conf Comput Vis Pattern Recogn, 4213–4222. https://doi.org/10.1109/CVPR.2018.00443arXiv:1711.11135
– reference: Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, et al. (2015) Going deeper with convolutions. Proceedings of the IEEE computer society conference on computer vision and pattern recognition (07-12-June, pp. 1-9). https://doi.org/10.1109/CVPR.2015.7298594
– reference: Jin T, Huang S, Chen M, Li Y, Zhang Z (2020) SBAT: Video captioning with sparse boundary-aware transformer. IJCAI Int Joint Conf Artif Intel 2021:630–636. https://doi.org/10.24963/ijcai.2020.88
– reference: Kay W, Carreira J, Simonyan K, Zhang B, Hillier C, Vijayanarasimhan S, et al. (2017) The kinetics human action video dataset. arXiv:1705.06950
– reference: Zhu X, Guo L, Yao P, Lu S, Liu W, Liu J (2019) Vatex video captioning challenge 2020: multi-view features and hybrid reward strategies for video captioning. arXiv:1910.11102
– reference: LeeSKimIMultimodal feature learning for video captioningMath Prob Eng201810.1155/2018/3125879
– reference: YanCTuYWangXZhangYHaoXZhangYDaiQSTAT: spatial-temporal attention mechanism for video captioningIEEE Trans Multimedia202022122924110.1109/TMM.2019.2924576
– reference: Zhang J, Peng Y (2019) Object-aware aggregation with bidirectional temporal graph for video captioning. https://zhuanzhi.ai/paper/237b5837832fb600d4269cacdb0286e3 arXiv:1906.04375
– reference: RafiqMRafiqGChoiGSVideo description: datasets evaluation metricsIEEE Access2021912166512168510.1109/ACCESS.2021.3108565
– reference: Wu Z, Yao T, Fu Y, Jiang, Y-G (2017) Deep learning for video classification and captioning. Front Multimedia Res, 3–29. https://doi.org/10.1145/3122865.3122867arXiv:1609.06782
– reference: Aafaq N, Akhtar N, Liu W, Mian A (2019b) Empirical autopsy of deep video captioning frameworks. arXiv:1911.09345
– reference: SunCMyersAVondrickCMurphyKSchmidCVideoBERT: a joint model for video and language representation learningProc IEEE Int Conf Comput Vis201920197463747210.1109/ICCV.2019.00756
– reference: Doddington G (2002) Automatic evaluation of machine translation quality using n-gram co-occurrence statistics, 138. https://doi.org/10.3115/1289189.1289273
– reference: Ryu H, Kang S, Kang H, Yoo CD (2021) Semantic grouping network for video captioning. arXiv:2102.00831
– reference: Perez-Martin J, Bustos B, Perez J (2021) Attentive visual semantic specialized network for video captioning, 5767–5774. https://doi.org/10.1109/icpr48806.2021.9412898
– reference: Cascade-correlationRChunkingNSLong Short–Term Memory19979817351780
– reference: Chen DL, Dolan WB (2011) Collecting highly parallel data for paraphrase evaluation. Aclhlt 2011–Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies 1 (pp. 190–200)
– reference: Kenton M-wC, Kristina L, Devlin J (1953) BERT: pre-training of deep bidirectional transformers for language understanding. (Mlm). arXiv:1810.04805v2
– reference: Ging S, Zolfaghari M, Pirsiavash H, Brox T (2020) COOT: cooperative hierarchical transformer for video-text representation learning. (NeurIPS):1–27. arXiv:2011.00597
– reference: Iashin V, Rahtu E (2020) Multi-modal dense video captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pp 958–959
– reference: Torralba A, Murphy KP, Freeman WT, Rubin MA (2003) Context-based vision system for place and object recognition. In: Proceedings of the Ninth IEEE International Conference on Computer Vision, ICCV'03, vol 2, pp 273. IEEE Computer Society. https://doi.org/10.5555/946247.946665
– reference: KrishnaRHataKRenFFei-FeiLNieblesJCDense-captioning events in videosProc Int Conf Comput Vis2017201770671510.1109/ICCV.2017.83
– reference: Lu J, Xiong C, Parikh D, Socher R (2017) Knowing when to look: adaptive attention via a visual sentinel for image captioning. Proc 30th IEEE Conf Comput Vis Pattern Recogn CVPR, 2017:3242–3250. https://doi.org/10.1109/CVPR.2017.345arXiv:1612.01887
– reference: Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Li FF (2014) Large-scale video classification with convolutional neural net-works. Proceedings of the IEEE computer society conference on computer vision and pattern recognition (pp. 1725–1732). https://doi.org/10.1109/CVPR.2014.223
– reference: XuJWeiHLiLFuQGuoJVideo description model based on temporal-spatial and channel multi-attention mechanismsAppl Sci (Switzerland)202010.3390/app10124312
– reference: Phan S, Henter GE, Miyao Y, Satoh S (2017) Consensus-based sequence training for video captioning. arXiv:1712.09532
– reference: ShenZLiJSuZLiMChenYJiangYGXueXWeakly supervised dense video captioningProc 30th IEEE Conf Comput Vis Pattern Recogn, CVPR 2017201720175159516710.1109/CVPR.2017.548c
– reference: Lin C-Y (2004) ROUGE: A Package for Automatic Evaluation of Summaries. In: Text summarization branches out. Association for Computational Linguistics. Barcelona, Spain, pp 74–81. https://aclanthology.org/W04-1013
– reference: Liu S, Ren Z, Yuan J (2018) SibNet: Sibling convolutional encoder for video captioning. MM 2018 -Proceedings of the 2018 ACM Multimedia Conference, 1425–1434. https://doi.org/10.1145/3240508.3240667
– reference: ChenSYaoTJiangYGDeep learning for video captioning: a reviewIJCAI Int Joint Conf Artif Intell201920196283629010.24963/ijcai.2019/877
– reference: Olivastri S, Singh G, Cuzzolin F (2019) End-to-end video captioning. International conference on computer vision workshop. https://zhuanzhi.ai/paper/004e3568315600ed58e6a699bef3cbba
– reference: GaoLLiXSongJShenHTHierarchical LSTMs with adaptive attention for visual captioningIEEE Trans Pattern Analys Mach Intell20191481110.1109/tpami.2019.2894139
– reference: Xu K, Ba JL, Kiros R, Cho K, Courville A, Salakhutdinov R, et al. (2015) Show, attend and tell: neural image caption gener-ation with visual attention. 32nd International Conference on Machine Learning, ICML 2015 3:2048–2057. arXiv:1502.03044
– reference: Deng J, Dong W, Socher R, Li LJ, Li K, Fei-Fei L (2009, June). Imagenet: a large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition, pp 248–255. IEEE
– reference: ZellersRBiskYFarhadiAChoiYFrom recognition to cognition: visual commonsense reasoningProc IEEE Comput Soc Conf Comput Vis Pattern Recogn201920196713672410.1109/CVPR.2019.00688
– reference: YouQJinHWangZFangCLuoJImage captioning with semantic attentionProc IEEE Comput Soc Conf Comput Vis Pattern Recogn201620164651465910.1109/CVPR.2016.503arXiv:1603.03925
– reference: Mnih V, Kavukcuoglu K, Silver D, Graves A, Antonoglou I, Wierstra D, Riedmiller M (2013) Playing atari with deep reinforcement learning, 1–9. arXiv:1312.5602
– reference: Wang X, Wang, Y-f, Wang WY (2018c) Watch , listen , and describe: globally and locally aligned cross-modal attentions for video captioning, 795–801
– reference: Li X, Zhao B, Lu X (2017) MAM-RNN: Multi-level attention model based RNN for video captioning. IJCAI International Joint Conference on Artificial Intelligence, 2208–2214. https://doi.org/10.24963/ijcai.2017/307
– reference: LiSTaoZLiKFuYVisual to text: survey of image and video captioningIEEE Trans Emerg Top Comput Intel20193429731210.1109/tetci.2019.2892755
– reference: ZhaoHChenZGuoLHanZVideo captioning based on vision transformer and reinforcement learningPeer J Comput Sci20228200211610.7717/PEERJ-CS.916
– reference: AafaqNMianASAkhtarNLiuWShahMDense video captioning with early linguistic information fusionIEEE Trans Multimedia202210.1109/TMM.2022.3146005
– reference: Xu J, Yao T, Zhang Y, Mei T (2017) Learning multimodal attention LSTM networks for video captioning. MM 2017 -Proceedings of the 2017 ACM Multimedia Conference, 537–545. https://doi.org/10.1145/3123266.3123448
– reference: HanKWangYChenHChenXGuoJLiuZTaoDA survey on vision transformerIEEE Trans Pattern Analys Mach Intel2022882812010.1109/TPAMI.2022.3152247
– reference: Chen J, Pan Y, Li Y, Yao T, Chao H, Mei T (2019b) Temporal deformable convolutional encoder–decoder networks for video captioning. Proc AAAI Conf Artif Intell 33 , 8167–8174. https://doi.org/10.1609/aaai.v33i01.33018167arXiv:1905.01077
– reference: Wang X, Wu J, Chen J, Li L, Wang Y-F, Wang WY (2019b) VATEX: a large-scale, high-quality multilingual dataset for video-and-language research. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp 4580–4590. https://doi.org/10.1109/ICCV.2019.00468
– reference: Huszár F (2015) How (not) to train your generative model: scheduled sampling, likelihood, adversary?:1–9. arXiv:1511.05101
– reference: Chen H, Lin K, Maye A, Li J, Hu X (2019a) A semantics-assisted video captioning model trained with scheduled sampling. https://zhuanzhi.ai/paper/f88d29f09d1a56a1b1cf719dfc55ea61arXiv:1909.00121
– reference: Perez-Martin J, Bustos B, Pérez J (2021) Improving video captioning with temporal composition of a visual-syntactic embedding. Winter Conference on Applications of Computer Vision, 3039–3049
– reference: Kazemzadeh S, Ordonez V, Matten M, Berg TL (2014) ReferItGame: referring to objects in photographs of natural scenes:787–798
– reference: SzegedyCIoffeSVanhouckeVAlemiAAInception-v4, inception-ResNet and the impact of residual connections on learning31st AAAI Conf Artif Intel AAAI2017201742784284
– reference: HoriCHoriTLeeTYZhangZHarshamBHersheyJRAttention-based multimodal fusion for video descriptionProc IEEE Int Conf Comput Vis201720174203421210.1109/ICCV.2017.450
– reference: Al-Rfou R, Choe D, Constant N, Guo M, Jones L (2019) Character-level language modeling with deeper self-attention. Proc AAAI Conf Artif Intell 33 , 3159–3166. https://doi.org/10.1609/aaai.v33i01.33013159arxiv.org/abs/1808.04444
– reference: Bilkhu M, Wang S, Dobhal T (2019) Attention is all you need for videos: self-attention based video summarization using universal Transformers. arXiv:1906.02792
– reference: Lowe DG (1999) Object recognition from local scale-invariant features. In: Proceedings of the Seventh IEEE International Conference on Computer Vision, Kerkyra, Greece, 1999, pp 1150–1157, vol 2. https://doi.org/10.1109/ICCV.1999.790410
– reference: GaoLLeiYZengPSongJWangMShenHTHierarchical representation network with auxiliary tasks for video captioning and video question answeringIEEE Trans Image Process20223120221510.1109/TIP.2021.3120867
– reference: AntolSAgrawalALuJMitchellMBatraDZitnickCLParikhDVQA: visual question answeringProc IEEE Int Conf Comput Vis201510.1109/ICCV.2015.279
– reference: Montague P (1999) Reinforcement learning: an introduction, by Sutton RS and Barto AG trends in cognitive sciences 3(9): 360. https://doi.org/10.1016/s1364-6613(99)01331-5
– reference: Vedantam R, Lawrence Zitnick C, Parikh D (2015) Cider: consensus-based image description evaluation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4566–4575
– reference: LavieASagaeKJayaramanSThe significance of recall in automatic metrics for MT evaluationLecture Notes Comput Sci (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)2004326513414310.1007/978-3-540-30194-3-16
– reference: Kilickaya M, Erdem A, Ikizler-Cinbis N, Erdem E (2017) Re-evaluating automatic metrics for image captioning. 15th conference of the european chapter of the association for computational linguistics, EACL 2017–proceedings of conference (Vol. 1, pp. 199-209). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/e17-1019
– reference: Aafaq N, Akhtar N, Liu W, Mian A (2019a) Empirical autopsy of deep video captioning frameworks. arXiv:1911.09345
– reference: Pasunuru R, Bansal M (2017) Reinforced video captioning with entailment rewards. Emnlp 2017—conference on empirical methods in natural language processing, proceedings (pp. 979–985). https://doi.org/10.18653/v1/d17-1103
– reference: Das P, Xu C, Doell RF, Corso JJ (2013) A thousand frames in just a few words: lingual description of videos through latent topics and sparse object stitching. Proceedings of the IEEE computer society conference on computer vision and pattern recognition (pp. 2634–2641). https://doi.org/10.1109/CVPR.2013.340
– reference: Estevam V, Laroca R, Pedrini H, Menotti D (2021) Dense video captioning using unsupervised semantic information. arXiv:2112.08455v1
– reference: Lowell U, Donahue J, Berkeley UC, Rohrbach M, Berkeley UC, Mooney R (2014) Translating videos to natural language using deep recurrent neural networks. arXiv:1412.4729v3
– reference: Sharif N, White L, Bennamoun M, Shah SAA (2018) Learning-based composite metrics for improved caption evaluation. ACL 2018 56th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Student Research Workshop, 14–20. https://doi.org/10.18653/v1/p18-3003
– reference: LeiJYuLBergTLBansalMTVR: a large-scale dataset for video-subtitle moment retrievalLecture Notes Comput Sci (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)20201236644746310.1007/978-3-030-58589-1_27
– reference: Lin K, Gan Z, Wang L (2020) Multi-modal feature fusion with feature attention for vatex captioning challenge 2020:2–5. arXiv:2006.03315
– reference: Amaresh M, Chitrakala S (2019) Video captioning using deep learning: an overview of methods, datasets and metrics. Proceedings of the 2019 IEEE international conference on communication and signal processing, ICCSP 2019 (pp. 656–661). https://doi.org/10.1109/ICCSP.2019.8698097
– reference: AgyemanRRafiqMShinHKRinnerBChoiGSOptimizing spatiotemporal feature learning in 3D convolutional neural networks with pooling blocksIEEE Access20219707977080510.1109/access.2021.3078295
– reference: Blohm M, Jagfeld G, Sood E, Yu X, Vu NT (2018) Comparing attention-based convolutional and recurrent neural networks: success and limitations in machine reading comprehension. CoNLL 2018–22nd Conference on Computational Natural Language Learning, Proceedings, 108–118. https://doi.org/10.18653/v1/k18-1011arXiv:1808.08744
– reference: Gomez AN, Ren M, Urtasun R, Grosse RB (2017) The reversible resid-ual network: backpropagation without storing activations. Adv Neural Inform Process Syst 2017:2215–2225. arXiv:1707.04585
– reference: Lei J, Wang L, Shen Y, Yu D, Berg T, Bansal M (2020) MART: memory-augmented recurrent transformer for coherent video paragraph captioning:2603–2614. https://doi.org/10.18653/v1/2020.acl-main.233arXiv:2005.05402
– reference: Wang T, Zhang R, Lu Z, Zheng F, Cheng R, Luo P (2021) Endto-End Dense Video Captioning with Parallel Decoding. Proceedings of the IEEE International Conference on Computer Vision, 6827–6837. https://doi.org/10.1109/ICCV48922.2021.00677arXiv:2108.07781
– reference: Ramanishka V, Das A, Park DH, Venugopalan S, Hendricks LA, Rohrbach M, Saenko K (2016) Multimodal video description. MM 2016 -Proceedings of the 2016 ACM Multimedia Conference, 1092–1096. https://doi.org/10.1145/2964284.2984066
– reference: Jia Y, Shelhamer E, Donahue J, Karayev S, Long J, Girshick R, et al. (2014) Caffe: convolutional architecture for fast feature embedding. Mm 2014–proceedings of the 2014 ACM conference on multimedia (pp. 675-678). https://doi.org/10.1145/2647868.2654889
– reference: Pan Y, Yao T, Li H, Mei T (2017) Video captioning with transferred semantic attributes. Proc 30th IEEE Conf Comput Vis Pattern Recogn CVPR 2017:984–992. https://doi.org/10.1109/CVPR.2017.111arXiv:1611.07675
– reference: Babariya RJ, Tamaki T (2020) Meaning guided video captioning. In: Pattern Recognition: 5th Asian Conference, ACPR 2019, Auckland, New Zealand, November 26–29, 2019, Revised Selected Papers, Part II 5, pp 478–488. Springer International Publishing
– reference: ImHChoiY-SUAT: universal attention transformer for video captioningSensors20222213481710.3390/s22134817
– reference: Yan Y, Zhuang N, Bingbing Ni, Zhang J, Xu M, Zhang Q, et al (2019) Fine-grained video captioning via graph-based multi-granularity interaction learning. IEEE Trans Pattern Analys Mach Intel. https://doi.org/10.1109/TPAMI.2019.2946823
– reference: KojimaATamuraTFukunagaKNatural language description of human activities from video images based on concept hierarchy of actionsInt J Comput Vis200250217118410.1023/A:10203460326081012.68781
– reference: Raffel C, Ellis DPW (2015) Feed-forward networks with attention can solve some long-term memory problems, 1–6. arXiv:1512.08756
– reference: ZhangXSunXLuoYJiJZhouYWuYJiRRSTnet: captioning with adaptive attention on visual and non-visual wordsProc IEEE Comput Soc Conf Comput Vis Pattern Recogn20211154601546910.1109/CVPR46437.2021.01521
– reference: Liu S, Ren Z, Yuan J (2020) SibNet: sibling convolutional encoder for video captioning. IEEE Trans Pattern Analys Mach Intel, 1–1. https://doi.org/10.1109/tpami.2019.2940007
– reference: Zhang Z, Shi Y, Yuan C, Li B, Wang P, Hu W, Zha Z (2020) Object relational graph with teacher-recommended learning for video captioning. arXiv:2002.11566
– reference: BroxTPapenbergNWeickertJHigh accuracy optical flow estimation based on warping-presentationLecture Notes Comput Sci (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)20143024May25361098.68736
– reference: RennieSJMarcheretEMrouehYRossJGoelVSelf-critical sequence training for image captioningProc 30th IEEE Conf Comput Vis Pattern Recogn CVPR201720171179119510.1109/CVPR.2017.131
– reference: Chen DZ, Gholami A, Niesner M, Chang AX (2021) Scan2Cap: context-aware dense captioning in RGB-D scans. 3192–3202. https://doi.org/10.1109/cvpr46437.2021.00321arXiv:2012.02206
– reference: Fang Z, Gokhale T, Banerjee P, Baral C, Yang Y (2020) Video2Commonsense: generating commonsense descriptions to enrich video captioning. arXiv:2003.05162
– reference: Wu D, Zhao H, Bao X, Wildes RP (2022) Sports video analysis on large-scale data (1). arXiv:2208.04897
– reference: Zhou L, Zhou Y, Corso JJ, Socher R, Xiong C (2018) End-to-End Dense Video Captioning with Masked Transformer. Proceedings of the IEEE computer society conference on computer vision and pattern recognition (pp. 8739–8748). https://doi.org/10.1109/CVPR.2018.00911
– reference: Rivera-soto RA, Ordóñez J (2013) Sequence to sequence models for generating video captions. http://cs231n.stanford.edu/reports/2017/pdfs/31.pdf
– reference: Zhang Z, Qi Z, Yuan C, Shan Y, Li B, Deng Y, Hu W (2021) Open-book video captioning with retrieve-copy-generate network. Proc IEEE Comput Soc Conf Comput Vis Pattern Recogn, 9832–9841. https://doi.org/10.1109/CVPR46437.2021.00971arXiv:2103.05284
– reference: Zhao B, Li X, Lu X (2018) Video captioning with tube features. IICAI Int Joint Conf Artif Intel 2018:1177–1183. https://doi.org/10.24963/ijcai.2018/164
– reference: Hakeem A, Sheikh Y, Shah M (2004) CASE E: a hierarchical event representation for the analysis of videos. Proc Natl Conf Artif Intell:263–268
– reference: ZolfaghariMSinghKBroxTECO: efficient convolutional network for online video understandingLecture Notes Comput Sci (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics)20181120671373010.1007/978-3-030-01216-8-43
– reference: Ranzato M, Chopra S, Auli M, Zaremba W (2016) Sequence level training with recurrent neural networks. 4th international conference on learning representations, ICLR 2016—conference track proceedings (pp. 1–16)
– reference: Goodfellow I, Bengio Y, Courville A (2016) Deep learning. MIT Press. (http://www.deeplearningbook.org)
– reference: Pan Y, Yao T, Li Y, Mei T (2020) X-linear attention networks for image captioning. Proc IEEE Comput Soc Conf Comput Vis Pattern Recogn, 10968–10977. https://doi.org/10.1109/CVPR42600.2020.01098arXiv:2003.14080
– reference: Song Y, Chen S, Jin Q (2021) Towards diverse paragraph captioning for untrimmed videos. Proceedings of the IEEE Comput Soc Conf Comput Vis Pattern Recogn, 11240–11249. https://doi.org/10.1109/CVPR46437.2021.01109arXiv:2105.14477
– reference: Wang B, Ma L, Zhang W, Liu W (2018a) Reconstruction network for video captioning. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 7622–7631. https://doi.org/10.1109/CVPR.2018.00795
– reference: HeKZhangXRenSSunJDeep residual learning for image recognitionProc IEEE Comput Soc Conf Comput Vis Pattern Recogn2016201677077810.1109/CVPR.2016.90
– reference: ZhangYVogelSSignificance tests of automatic machine translation evaluation metricsMachine Transl2010241516510.1007/s10590-010-9073-6
– reference: Wallach B (2017) Developing: a world made for money (pp. 241–294). https://doi.org/10.2307/j.ctt1d98bxx.10
– reference: ZhangQZhangMChenTSunZMaYYuBRecent advances in convolutional neural network accelerationNeurocomputing2019323375110.1016/j.neucom.2018.09.038arXiv:1807.08596
– reference: ChenYZhangWWangSLiLHuangQSaliency-based spatiotemporal attention for video captioning2018 IEEE 4th Int Conf Multimedia Big Data BigMM2018201818
– reference: VenugopalanSRohrbachMDonahueJMooneyRDarrellTSaenkoKSequence to sequence -video to textProceedings IEEE Int Conf Comput Vis201520154534454210.1109/ICCV.2015.515
– reference: Arnab A, Dehghani M, Heigold G, Sun C, Lučić M, Schmid C (2021) ViViT: a video vision transformer. Proceedings of the IEEE international conference on computer vision, 6816–6826. https://doi.org/10.1109/ICCV48922.2021.00676arXiv:2103.15691
– reference: Park J, Song C, Han JH (2018) A study of evaluation metrics and datasets for video captioning. ICIIBMS 2017 -2nd Int Conf Intel Inform Biomed Sci 2018:172–175. https://doi.org/10.1109/ICIIBMS.2017.8279760
– reference: Vaswani A, Brain G, Shazeer N, Parmar N, Uszkoreit J, Jones L, et al. (2017) Attention is all you need. Adv Neural Inform Process Syst (Nips), 5998–6008. http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf
– reference: Hammad M, Hammad M, Elshenawy M (2019) Characterizing the impact of using features extracted from pretrained models on the quality of video captioning sequence-to-sequence models. arXiv:1911.09989
– reference: Hammoudeh A, Vanderplaetse B, Dupont S (2022) Deep soccer captioning with transformer: dataset, semantics-related losses, and multi-level evaluation:1–15. arXiv:2202.05728
– reference: Wang D, Song D (2017) Video Captioning with Semantic Information from the Knowledge Base. Proceedings -2017 IEEE International Conference on Big Knowledge, ICBK 2017 , 224–229. https://doi.org/10.1109/ICBK.2017.26
– reference: Li J, Qiu H (2020) Comparing attention-based neural architectures for video captioning, vol 1194. Available on: https://web.stanford.edu/class/archive/cs/cs224n/cs224n
– ident: 10414_CR106
– ident: 10414_CR129
– volume: 1
  start-page: 8421
  year: 2021
  ident: 10414_CR28
  publication-title: Proc IEEE Comput Soc Conf Comput Vis Pattern Recogn
  doi: 10.1109/CVPR46437.2021.00832
– volume: 50
  start-page: 171
  issue: 2
  year: 2002
  ident: 10414_CR77
  publication-title: Int J Comput Vis
  doi: 10.1023/A:1020346032608
– ident: 10414_CR16
– volume: 33
  start-page: 8393
  year: 2019
  ident: 10414_CR58
  publication-title: Proceed AAAI Conf Artif Intel
  doi: 10.1609/aaai.v33i01.33018393
– ident: 10414_CR85
  doi: 10.18653/v1/2020.acl-main.233
– ident: 10414_CR193
– ident: 10414_CR43
  doi: 10.18653/v1/2020.emnlp-main.61
– ident: 10414_CR7
  doi: 10.1186/s40537-021-00444-8
– volume: 2019
  start-page: 7463
  year: 2019
  ident: 10414_CR139
  publication-title: Proc IEEE Int Conf Comput Vis
  doi: 10.1109/ICCV.2019.00756
– volume: 2019
  start-page: 396
  year: 2019
  ident: 10414_CR165
  publication-title: Proc 2019 IEEE Winter Conf App Comput Vis, WACV
  doi: 10.1109/WACV.2019.00048
– volume: 2019
  start-page: 8917
  year: 2019
  ident: 10414_CR62
  publication-title: IEEE Int Conf Comput Vis
  doi: 10.1109/ICCV.2019.00901
– ident: 10414_CR101
– ident: 10414_CR51
– ident: 10414_CR97
  doi: 10.18653/v1/2021.findings-acl.24
– ident: 10414_CR18
  doi: 10.18653/v1/k18-1011
– ident: 10414_CR147
– ident: 10414_CR182
  doi: 10.1109/CVPR.2019.00852
– year: 2015
  ident: 10414_CR9
  publication-title: Proc IEEE Int Conf Comput Vis
  doi: 10.1109/ICCV.2015.279
– volume: 12366
  start-page: 447
  year: 2020
  ident: 10414_CR86
  publication-title: Lecture Notes Comput Sci (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
  doi: 10.1007/978-3-030-58589-1_27
– ident: 10414_CR3
  doi: 10.1145/3355390
– volume: 2016
  start-page: 5288
  year: 2016
  ident: 10414_CR166
  publication-title: Proc IEEE Comput Soc Conf Comput Vis Pattern Recogn
  doi: 10.1109/CVPR.2016.571
– volume: 2012
  start-page: 102
  year: 2012
  ident: 10414_CR13
  publication-title: Uncertainty Artif Intell–Proc 28th Conf–UAI
– ident: 10414_CR70
  doi: 10.1109/CVPR.2014.223
– ident: 10414_CR52
– ident: 10414_CR72
  doi: 10.3115/v1/D14-1086
– ident: 10414_CR168
  doi: 10.1145/3123266.3123448
– volume: 2017
  start-page: 4278
  year: 2017
  ident: 10414_CR141
  publication-title: 31st AAAI Conf Artif Intel AAAI
– ident: 10414_CR23
– ident: 10414_CR79
– ident: 10414_CR155
  doi: 10.1109/ICCV48922.2021.00677
– volume: 2018
  start-page: 1
  year: 2018
  ident: 10414_CR31
  publication-title: 2018 IEEE 4th Int Conf Multimedia Big Data BigMM
– volume: 19
  start-page: 2045
  issue: 9
  year: 2017
  ident: 10414_CR44
  publication-title: IEEE Trans Multimedia
  doi: 10.1109/TMM.2017.2729019
– volume: 11219
  start-page: 318
  year: 2018
  ident: 10414_CR164
  publication-title: Lecture Notes Comput Sci (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics)
  doi: 10.1007/978-3-030-01267-0_19
– ident: 10414_CR156
  doi: 10.1109/CVPR.2018.00443
– ident: 10414_CR92
  doi: 10.24963/ijcai.2017/307
– ident: 10414_CR149
  doi: 10.1109/CVPR52688.2022.01747
– ident: 10414_CR176
– ident: 10414_CR74
– volume: 2019
  start-page: 6571
  year: 2019
  ident: 10414_CR194
  publication-title: Proc IEEE Comput Soc Conf Comput Vis Pattern Recogn
  doi: 10.1109/CVPR.2019.00674
– volume: 11206
  start-page: 713
  year: 2018
  ident: 10414_CR197
  publication-title: Lecture Notes Comput Sci (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics)
  doi: 10.1007/978-3-030-01216-8-43
– ident: 10414_CR8
  doi: 10.1109/ICCSP.2019.8698097
– volume: 2017
  start-page: 5159
  year: 2017
  ident: 10414_CR135
  publication-title: Proc 30th IEEE Conf Comput Vis Pattern Recogn, CVPR 2017
  doi: 10.1109/CVPR.2017.548c
– ident: 10414_CR115
  doi: 10.18653/v1/d17-1103
– volume: 9
  start-page: 121665
  year: 2021
  ident: 10414_CR123
  publication-title: IEEE Access
  doi: 10.1109/ACCESS.2021.3108565
– volume: 22
  start-page: 229
  issue: 1
  year: 2020
  ident: 10414_CR171
  publication-title: IEEE Trans Multimedia
  doi: 10.1109/TMM.2019.2924576
– ident: 10414_CR10
  doi: 10.1109/ICCV48922.2021.00676
– ident: 10414_CR12
– ident: 10414_CR22
  doi: 10.1109/cvpr46437.2021.00321
– volume: 2015
  start-page: 4489
  year: 2015
  ident: 10414_CR145
  publication-title: Proc IEEE Int Conf Comput Vis
  doi: 10.1109/ICCV.2015.510
– volume: 518
  start-page: 529
  issue: 7540
  year: 2015
  ident: 10414_CR107
  publication-title: Nature
  doi: 10.1038/nature14236
– volume: 117
  year: 2022
  ident: 10414_CR67
  publication-title: Appl Soft Comput
  doi: 10.1016/j.asoc.2021.108332
– volume: 9
  start-page: 1735
  issue: 8
  year: 1997
  ident: 10414_CR20
  publication-title: Long Short–Term Memory
– ident: 10414_CR50
– ident: 10414_CR153
  doi: 10.1109/ICBK.2017.26
– ident: 10414_CR21
– volume: 95
  start-page: 847
  issue: 1997
  year: 2018
  ident: 10414_CR26
  publication-title: Proc Mach Learn Res
– ident: 10414_CR73
– ident: 10414_CR96
– year: 2018
  ident: 10414_CR84
  publication-title: Math Prob Eng
  doi: 10.1155/2018/3125879
– volume: 2019
  start-page: 6283
  year: 2019
  ident: 10414_CR29
  publication-title: IJCAI Int Joint Conf Artif Intell
  doi: 10.24963/ijcai.2019/877
– ident: 10414_CR192
  doi: 10.1109/CVPR42600.2020.01311
– volume: 39
  start-page: 677
  issue: 4
  year: 2017
  ident: 10414_CR40
  publication-title: IEEE Trans Pattern Analys Mach Intell
  doi: 10.1109/TPAMI.2016.2599174
– volume: 2017
  start-page: 6250
  year: 2017
  ident: 10414_CR185
  publication-title: Proc 30th IEEE Conf Comput Vis Pattern Recogn CVPR
  doi: 10.1109/CVPR.2017.662
– volume: 49
  start-page: 2631
  issue: 7
  year: 2019
  ident: 10414_CR17
  publication-title: IEEE Trans Cybern
  doi: 10.1109/TCYB.2018.2831447
– ident: 10414_CR2
– ident: 10414_CR116
  doi: 10.1007/s00371-021-02294-0
– ident: 10414_CR117
  doi: 10.1109/icpr48806.2021.9412898
– ident: 10414_CR34
  doi: 10.18653/v1/p19-1285
– ident: 10414_CR35
  doi: 10.1109/CVPR.2013.340
– ident: 10414_CR110
– ident: 10414_CR104
– ident: 10414_CR195
  doi: 10.1109/CVPR.2018.00911
– ident: 10414_CR173
  doi: 10.1109/TPAMI.2019.2946823
– ident: 10414_CR189
  doi: 10.1109/CVPR42600.2020.01329
– ident: 10414_CR112
  doi: 10.1109/CVPR.2017.111
– ident: 10414_CR69
  doi: 10.24963/ijcai.2020.88
– ident: 10414_CR99
  doi: 10.1109/tpami.2019.2940007
– ident: 10414_CR98
  doi: 10.1145/3240508.3240667
– volume: 2019
  start-page: 2641
  year: 2019
  ident: 10414_CR151
  publication-title: Proc IEEE Int Conf Comput Vis
  doi: 10.1109/ICCV.2019.00273
– ident: 10414_CR137
  doi: 10.1109/CVPR46437.2021.01109
– ident: 10414_CR154
  doi: 10.1155/2020/3062706
– ident: 10414_CR56
– ident: 10414_CR33
  doi: 10.3115/v1/d14-1179
– ident: 10414_CR49
  doi: 10.18653/v1/D18-1117
– volume: 3265
  start-page: 134
  year: 2004
  ident: 10414_CR82
  publication-title: Lecture Notes Comput Sci (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
  doi: 10.1007/978-3-540-30194-3-16
– ident: 10414_CR157
  doi: 10.18653/v1/N18-2125
– ident: 10414_CR184
  doi: 10.1109/tpami.2019.2920899
– volume: 2016
  start-page: 770
  year: 2016
  ident: 10414_CR59
  publication-title: Proc IEEE Comput Soc Conf Comput Vis Pattern Recogn
  doi: 10.1109/CVPR.2016.90
– volume: 11217
  start-page: 367
  year: 2018
  ident: 10414_CR30
  publication-title: Lecture Notes Comput Sci (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics)
  doi: 10.1007/978-3-030-01261-8_22
– ident: 10414_CR138
– ident: 10414_CR172
– ident: 10414_CR105
  doi: 10.1109/INCET54531.2022.9824569
– year: 2021
  ident: 10414_CR37
  publication-title: Proc IEEE Comput Soc Conf Comput Vis Pattern Recogn
  doi: 10.1109/CVPR46437.2021.00030
– volume: 23
  start-page: 1772
  year: 2021
  ident: 10414_CR170
  publication-title: IEEE Trans Multimedia
  doi: 10.1109/TMM.2020.3002669
– volume: 2019
  start-page: 6713
  year: 2019
  ident: 10414_CR181
  publication-title: Proc IEEE Comput Soc Conf Comput Vis Pattern Recogn
  doi: 10.1109/CVPR.2019.00688
– ident: 10414_CR95
– year: 2020
  ident: 10414_CR167
  publication-title: Appl Sci (Switzerland)
  doi: 10.3390/app10124312
– ident: 10414_CR113
  doi: 10.1109/CVPR42600.2020.01098
– ident: 10414_CR81
  doi: 10.3115/1626355.1626389
– volume: 22
  start-page: 4817
  issue: 13
  year: 2022
  ident: 10414_CR66
  publication-title: Sensors
  doi: 10.3390/s22134817
– ident: 10414_CR196
– ident: 10414_CR54
– ident: 10414_CR121
– volume: 8
  start-page: 1
  issue: 2002
  year: 2022
  ident: 10414_CR191
  publication-title: Peer J Comput Sci
  doi: 10.7717/PEERJ-CS.916
– volume: 2015
  start-page: 44
  year: 2016
  ident: 10414_CR80
  publication-title: Coling
– volume: 2017
  start-page: 328
  year: 2017
  ident: 10414_CR15
  publication-title: Proc IEEE 3rd Int Conf Collaboration Internet Comput CIC 2017
  doi: 10.1109/CIC.2017.00050
– ident: 10414_CR118
  doi: 10.1109/WACV48630.2021.00308
– volume: 1
  start-page: 270
  issue: 2
  year: 1989
  ident: 10414_CR159
  publication-title: Neural Comput
  doi: 10.1162/neco.1989.1.2.270
– volume: 14
  start-page: 1
  issue: 8
  year: 2019
  ident: 10414_CR46
  publication-title: IEEE Trans Pattern Analys Mach Intell
  doi: 10.1109/tpami.2019.2894139
– ident: 10414_CR190
  doi: 10.24963/ijcai.2018/164
– ident: 10414_CR94
  doi: 10.1109/CVPR.2018.00782
– ident: 10414_CR109
– ident: 10414_CR126
– ident: 10414_CR48
– ident: 10414_CR111
  doi: 10.1109/CVPR.2016.497
– ident: 10414_CR143
  doi: 10.1109/CVPR.2015.7299087
– ident: 10414_CR6
  doi: 10.1609/aaai.v33i01.33013159
– ident: 10414_CR83
  doi: 10.1109/ICIP.2019.8803143
– volume: 22
  start-page: 621
  issue: 2
  year: 2019
  ident: 10414_CR93
  publication-title: World Wide Web
  doi: 10.1007/s11280-018-0531-z
– volume: 2
  start-page: 452
  year: 2014
  ident: 10414_CR41
  publication-title: 52nd Annu Meet Assoc Comput Linguistics ACL 2014–Proc Conf
  doi: 10.3115/v1/p14-2074
– ident: 10414_CR174
  doi: 10.1609/aaai.v35i4.16421
– volume: 2019
  start-page: 1300
  year: 2019
  ident: 10414_CR140
  publication-title: Proc IEEE Int Conf Multimedia Expo
  doi: 10.1109/ICME.2019.00226
– ident: 10414_CR131
  doi: 10.1609/aaai.v35i3.16353
– ident: 10414_CR114
  doi: 10.1109/ICIIBMS.2017.8279760
– ident: 10414_CR150
  doi: 10.2307/j.ctt1d98bxx.10
– volume: 2015
  start-page: 4534
  year: 2015
  ident: 10414_CR148
  publication-title: Proceedings IEEE Int Conf Comput Vis
  doi: 10.1109/ICCV.2015.515
– volume: 8828
  start-page: 1
  year: 2022
  ident: 10414_CR57
  publication-title: IEEE Trans Pattern Analys Mach Intel
  doi: 10.1109/TPAMI.2022.3152247
– ident: 10414_CR178
  doi: 10.1162/tacl_a_00166
– ident: 10414_CR158
  doi: 10.1109/ICCV.2019.00468
– ident: 10414_CR71
– ident: 10414_CR1
– volume: 3
  start-page: 297
  issue: 4
  year: 2019
  ident: 10414_CR91
  publication-title: IEEE Trans Emerg Top Comput Intel
  doi: 10.1109/tetci.2019.2892755
– ident: 10414_CR161
  doi: 10.1145/3122865.3122867
– ident: 10414_CR25
  doi: 10.1609/aaai.v33i01.33018167
– ident: 10414_CR87
  doi: 10.1002/0470018860.s00225
– volume: 1
  start-page: 15460
  year: 2021
  ident: 10414_CR186
  publication-title: Proc IEEE Comput Soc Conf Comput Vis Pattern Recogn
  doi: 10.1109/CVPR46437.2021.01521
– ident: 10414_CR88
– ident: 10414_CR120
– ident: 10414_CR133
  doi: 10.1109/CVPR52688.2022.01743
– ident: 10414_CR32
– ident: 10414_CR162
– volume: 24
  start-page: 51
  issue: 1
  year: 2010
  ident: 10414_CR187
  publication-title: Machine Transl
  doi: 10.1007/s10590-010-9073-6
– ident: 10414_CR134
  doi: 10.18653/v1/p18-3003
– ident: 10414_CR188
  doi: 10.1109/CVPR46437.2021.00971
– year: 2022
  ident: 10414_CR4
  publication-title: IEEE Trans Multimedia
  doi: 10.1109/TMM.2022.3146005
– ident: 10414_CR64
– volume: 2017
  start-page: 4203
  year: 2017
  ident: 10414_CR60
  publication-title: Proc IEEE Int Conf Comput Vis
  doi: 10.1109/ICCV.2017.450
– volume: 2017
  start-page: 6119
  year: 2017
  ident: 10414_CR179
  publication-title: Proc 30th IEEE Conf Comput Vis Pattern Recogn
  doi: 10.1109/CVPR.2017.648
– volume: 2017
  start-page: 1151
  year: 2017
  ident: 10414_CR127
  publication-title: Proc 30th IEEE Conf Comput Vis Pattern Recogn CVPR
  doi: 10.1109/CVPR.2017.128
– ident: 10414_CR125
– ident: 10414_CR36
  doi: 10.18653/v1/d16-1146
– ident: 10414_CR102
– ident: 10414_CR11
  doi: 10.1007/978-3-030-41299-9_37
– ident: 10414_CR124
  doi: 10.1145/2964284.2984066
– ident: 10414_CR119
– ident: 10414_CR61
  doi: 10.1109/WACV48630.2021.00102
– volume: 2017
  start-page: 1179
  year: 2017
  ident: 10414_CR128
  publication-title: Proc 30th IEEE Conf Comput Vis Pattern Recogn CVPR
  doi: 10.1109/CVPR.2017.131
– ident: 10414_CR122
  doi: 10.3390/s20061702
– ident: 10414_CR163
– ident: 10414_CR160
  doi: 10.1007/978-3-031-19836-6_2
– volume: 2016
  start-page: 4651
  year: 2016
  ident: 10414_CR177
  publication-title: Proc IEEE Comput Soc Conf Comput Vis Pattern Recogn
  doi: 10.1109/CVPR.2016.503
– ident: 10414_CR76
– ident: 10414_CR24
– volume: 31
  start-page: 202
  year: 2022
  ident: 10414_CR45
  publication-title: IEEE Trans Image Process
  doi: 10.1109/TIP.2021.3120867
– volume: 3024
  start-page: 25
  issue: May
  year: 2014
  ident: 10414_CR19
  publication-title: Lecture Notes Comput Sci (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
– volume: 9
  start-page: 70797
  year: 2021
  ident: 10414_CR5
  publication-title: IEEE Access
  doi: 10.1109/access.2021.3078295
– year: 2022
  ident: 10414_CR63
  publication-title: Comput Intel Neurosci
  doi: 10.1155/2022/3454167
– volume: 45
  start-page: 2673
  issue: 11
  year: 1997
  ident: 10414_CR132
  publication-title: Neural Netw
– ident: 10414_CR68
  doi: 10.1145/2647868.2654889
– ident: 10414_CR175
– ident: 10414_CR180
  doi: 10.1109/CVPR52688.2022.00837
– ident: 10414_CR39
  doi: 10.3115/1289189.1289273
– volume: 2017
  start-page: 706
  year: 2017
  ident: 10414_CR78
  publication-title: Proc Int Conf Comput Vis
  doi: 10.1109/ICCV.2017.83
– ident: 10414_CR136
  doi: 10.24963/ijcai.2017/381
– ident: 10414_CR108
  doi: 10.1016/s1364-6613(99)01331-5
– volume: 323
  start-page: 37
  year: 2019
  ident: 10414_CR183
  publication-title: Neurocomputing
  doi: 10.1016/j.neucom.2018.09.038
– ident: 10414_CR142
  doi: 10.1109/CVPR.2015.7298594
– volume: 395
  start-page: 222
  year: 2020
  ident: 10414_CR47
  publication-title: Neurocomputing
  doi: 10.1016/j.neucom.2018.06.096
– ident: 10414_CR146
– ident: 10414_CR75
  doi: 10.18653/v1/e17-1019
– ident: 10414_CR144
  doi: 10.5555/946247.946665
– ident: 10414_CR169
– ident: 10414_CR38
  doi: 10.1109/CVPR.2009.5206848
– ident: 10414_CR100
  doi: 10.1109/ICCV.1999.790410
– ident: 10414_CR42
– ident: 10414_CR90
  doi: 10.18653/v1/2020.emnlp-main.161
– year: 2013
  ident: 10414_CR130
  publication-title: Proc IEEE Int Conf Comput Vis
  doi: 10.1109/ICCV.2013.61
– ident: 10414_CR152
  doi: 10.1109/CVPR.2018.00795
– volume: 33
  start-page: 8191
  year: 2019
  ident: 10414_CR27
  publication-title: Proc AAAI Conf Artif Intel
  doi: 10.1609/aaai.v33i01.33018191
– ident: 10414_CR65
  doi: 10.1109/CVPRW50498.2020.00487
– ident: 10414_CR53
– ident: 10414_CR103
  doi: 10.1109/CVPR.2017.345
– year: 2009
  ident: 10414_CR14
  publication-title: ACM Int Conf Proc Ser
  doi: 10.1145/1553374.1553380
– ident: 10414_CR55
  doi: 10.1007/978-3-030-59830-3_21
SSID ssj0005243
Score 2.4483402
Snippet Video description refers to understanding visual content and transforming that acquired understanding into automatic textual narration. It bridges the key AI...
SourceID proquest
gale
crossref
springer
SourceType Aggregation Database
Enrichment Source
Index Database
Publisher
StartPage 13293
SubjectTerms Artificial Intelligence
Attention
Coders
Computational linguistics
Computer Science
Computer vision
Deep learning
Dependency
Distinctiveness
Exploitation
Image processing
Language processing
Learning
Machine learning
Machine vision
Methods
Narration
Natural language interfaces
Natural language processing
Recurrence
Reinforcement
Subtitles & subtitling
Summarization
Surveillance
Surveys
Transformers
Variants
Video data
SummonAdditionalLinks – databaseName: ABI/INFORM Global
  dbid: M0C
  link: http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwpV07T8MwED5BYWChPEV5yQMSA1jEjuskLKhCVCxUDAh1sxz7AkioLU1B4t9jp07LQ7Awx7Etf_Y97Lv7AI6cUk9M7NPPMNVUxIWTgxoFzXNjmGZ52k5tRTaR9Hppv5_dhgu3MoRV1jKxEtR2aPwd-RlPpTceRCYvRi_Us0b519VAobEIS96y8SF9N9HlpxCPadQclxl1rgULSTMhdU5ITp3GcoJIMEHlF8X0XTz_eCet1E-3-d-Jr8FqMDxJZ7pT1mEBBxvQrEkdSDjjm3B1_2RxSCzO5Mk56RAfeT7Gx2m0Oylfx2_4ToaFa4YjEpgnHkhdoBzLLbjrXt1dXtPAtUCN8yAnNLPICmnbBi2PrJOCCdN5O83jIo0wLzSaOMosQ9EWnBmZGOenCCukdmusrYy3oTEYDnAHSMyRu95skeXO9Yu0J6jkhWVG-ydbZlvA6nVWJtQh93QYz2peQdljoxw2qsJGyRaczP4ZTatw_Nn62MOn_BF1PRsdMg3c_HyxK9VJnJkVJTITLdivMVPh7JZqDlgLTmvU559_H3f37972YIVX-83f3-xDYzJ-xQNYNm-Tp3J8WO3cDxAo8go
  priority: 102
  providerName: ProQuest
Title Video description: A comprehensive survey of deep learning approaches
URI https://link.springer.com/article/10.1007/s10462-023-10414-6
https://www.proquest.com/docview/2867416496
Volume 56
WOSCitedRecordID wos000968013400001&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
journalDatabaseRights – providerCode: PRVAVX
  databaseName: SpringerLINK Contemporary 1997-Present
  customDbUrl:
  eissn: 1573-7462
  dateEnd: 99991231
  omitProxy: false
  ssIdentifier: ssj0005243
  issn: 0269-2821
  databaseCode: RSV
  dateStart: 19970101
  isFulltext: true
  titleUrlDefault: https://link.springer.com/search?facet-content-type=%22Journal%22
  providerName: Springer Nature
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV1La9wwEB7y6KGXpumDbJsuOhR6aAWWLMt2b5uQJVCyWdKQpr0IWRongbIb1ptA_01_S39ZRl45j7YppBeB8XgsRtKMRpqZD-AtGfXcpSH9DAvLVVqTHrSoeFU5J6yoiqzwLdhEPhoVx8flOCaFNV20e3cl2WrqW8luSktONoZUhxKK62VYJXNXBMCGg89HtwI7FrFyUpecHAoRU2X-zuOOOfpdKf9xO9oaneHa_3X3KTyJm0w2WMyKdVjCyTNY6wAcWFzPz2F4dOZxyjxe646PbPDrZwgzn-HpIrSdNRezS_zBpjXR4TmLMBMnrKtGjs0LOBzuHG7v8giswB25i3NeehS19plDLxNPKi8XtsqKKq2LBKvaokuT0gtUmZLC6dyRU6K80pZEa71OX8LKZDrBDWCpREncfF1W5OclNqBRytoLZ8P9rPA9EJ14jYtFxwP2xXdzUy45yMmQnEwrJ6N78P76m_NFyY1_Ur8Lo2bCeiTOzsa0AupfqGxlBjntqZJcl6oHm93AmrhQGyMLHfakqiRGH7qBvHl9_39fPYz8NTyW7VwIhzebsDKfXeAbeOQu52fNrA_L-ZevfVjd2hmND-jpU86p3Uu2Qyv2qR1n3_rtNL8C3brvcQ
linkProvider Springer Nature
linkToHtml http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMw1V1Lb9QwEB6VggQXylMsLeADiANYjR3HSZAQWkGrVltWHFaoN8uxJ7RStbtstkX9Uf2PjBOny0P01gPnOJOHv3nYnpkP4CU59dylofwMC8tVWpMdtKh4VTknrKiKrPAt2UQ-HheHh-WXNbjoa2FCWmVvE1tD7Wcu7JFvy0KH4EGV-sP8Ow-sUeF0tafQ6GAxwvMftGRr3u9_ovl9JeXuzuTjHo-sAtzRWmnJS4-i1j5z6GXiSd9zYausqNK6SLCqLbo0Kb1AlSkpnM4dReTKK20zqazXKYm9ATcV3RfUapTzXzJKuiQ9qUtOKxkRa3RipZ7SkpODJLunhOL6Nz_4pzf461i29Xa7G__Zf7oHd2NYzYadHtyHNZw-gI2esoJFC_YQdr4ee5wxj5fW8h0bspBXv8CjLpefNaeLMzxns5qG4ZxFXo1vrG-_js0jmFzHtzyG9elsik-ApRIlSfN1WdHCNrGBflPWXjgbDqSFH4Dop9W42GU9kH2cmFV_6AAFQ1AwLRSMHsCby3vmXY-RK0e_DmgxwQCRZGdjHQW9X2jlZYY5BZFJrks1gK0eIiZapsas8DGAtz3IVpf__dynV0t7Abf3Jp8PzMH-eLQJd2QL9bBTtQXry8UpPoNb7mx53Cyet0rDwFwz-H4CPNVPBg
linkToPdf http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMw1V1Lb9QwEB6VghAXylMsLeADiANYjR3HSZAQWtGuqIpWe6hQxcVy7AmtVO0um21Rf1r_HePE6fIQvfXAOc7k4c_zsGfmA3hJRj13aSg_w8JyldakBy0qXlXOCSuqIit8SzaRj8fF4WE5WYOLvhYmpFX2OrFV1H7mwh75tix0cB5UqbfrmBYx2Rl9mH_ngUEqnLT2dBodRPbx_AeFb837vR2a61dSjnYPPn7ikWGAO4qblrz0KGrtM4deJp7Wfi5slRVVWhcJVrVFlyalF6gyJYXTuSPvXHmlbSaV9TolsTfgZk4hZsgmnGRff8ku6RL2pC45RTUi1uvEqj2lJSdjSTpQCcX1bzbxT8vw1xFta_lGG__xP7sHd6O7zYbd-rgPazh9ABs9lQWLmu0h7H459jhjHi-16Ds2ZCHffoFHXY4_a04XZ3jOZjUNwzmLfBvfWN-WHZtHcHAd3_IY1qezKT4BlkqUJM3XZUUBb2IDLaesvXA2HFQLPwDRT7Fxsft6IAE5Mau-0QEWhmBhWlgYPYA3l_fMu94jV45-HZBjgmIiyc7G-gp6v9Diywxzci6TXJdqAFs9XEzUWI1ZYWUAb3vArS7_-7lPr5b2Am4T5sznvfH-JtyRLerDBtYWrC8Xp_gMbrmz5XGzeN6uHwbmmrH3EwtNWCo
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Video+description%3A+A%C2%A0comprehensive+survey+of+deep+learning+approaches&rft.jtitle=The+Artificial+intelligence+review&rft.au=Rafiq%2C+Ghazala&rft.au=Rafiq%2C+Muhammad&rft.au=Choi%2C+Gyu+Sang&rft.date=2023-11-01&rft.pub=Springer+Netherlands&rft.issn=0269-2821&rft.eissn=1573-7462&rft.volume=56&rft.issue=11&rft.spage=13293&rft.epage=13372&rft_id=info:doi/10.1007%2Fs10462-023-10414-6&rft.externalDocID=10_1007_s10462_023_10414_6
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=0269-2821&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=0269-2821&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=0269-2821&client=summon