Video description: A comprehensive survey of deep learning approaches
Video description refers to understanding visual content and transforming that acquired understanding into automatic textual narration. It bridges the key AI fields of computer vision and natural language processing in conjunction with real-time and practical applications. Deep learning-based approa...
Uloženo v:
| Vydáno v: | The Artificial intelligence review Ročník 56; číslo 11; s. 13293 - 13372 |
|---|---|
| Hlavní autoři: | , , |
| Médium: | Journal Article |
| Jazyk: | angličtina |
| Vydáno: |
Dordrecht
Springer Netherlands
01.11.2023
Springer Springer Nature B.V |
| Témata: | |
| ISSN: | 0269-2821, 1573-7462 |
| On-line přístup: | Získat plný text |
| Tagy: |
Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
|
| Abstract | Video description refers to understanding visual content and transforming that acquired understanding into automatic textual narration. It bridges the key AI fields of computer vision and natural language processing in conjunction with real-time and practical applications. Deep learning-based approaches employed for video description have demonstrated enhanced results compared to conventional approaches. The current literature lacks a thorough interpretation of the recently developed and employed sequence to sequence techniques for video description. This paper fills that gap by focusing mainly on deep learning-enabled approaches to automatic caption generation. Sequence to sequence models follow an Encoder–Decoder architecture employing a specific composition of CNN, RNN, or the variants LSTM or GRU as an encoder and decoder block. This standard-architecture can be fused with an attention mechanism to focus on a specific distinctiveness, achieving high quality results. Reinforcement learning employed within the Encoder–Decoder structure can progressively deliver state-of-the-art captions by following exploration and exploitation strategies. The transformer mechanism is a modern and efficient transductive architecture for robust output. Free from recurrence, and solely based on self-attention, it allows parallelization along with training on a massive amount of data. It can fully utilize the available GPUs for most NLP tasks. Recently, with the emergence of several versions of transformers, long term dependency handling is not an issue anymore for researchers engaged in video processing for summarization and description, or for autonomous-vehicle, surveillance, and instructional purposes. They can get auspicious directions from this research. |
|---|---|
| AbstractList | Video description refers to understanding visual content and transforming that acquired understanding into automatic textual narration. It bridges the key AI fields of computer vision and natural language processing in conjunction with real-time and practical applications. Deep learning-based approaches employed for video description have demonstrated enhanced results compared to conventional approaches. The current literature lacks a thorough interpretation of the recently developed and employed sequence to sequence techniques for video description. This paper fills that gap by focusing mainly on deep learning-enabled approaches to automatic caption generation. Sequence to sequence models follow an Encoder–Decoder architecture employing a specific composition of CNN, RNN, or the variants LSTM or GRU as an encoder and decoder block. This standard-architecture can be fused with an attention mechanism to focus on a specific distinctiveness, achieving high quality results. Reinforcement learning employed within the Encoder–Decoder structure can progressively deliver state-of-the-art captions by following exploration and exploitation strategies. The transformer mechanism is a modern and efficient transductive architecture for robust output. Free from recurrence, and solely based on self-attention, it allows parallelization along with training on a massive amount of data. It can fully utilize the available GPUs for most NLP tasks. Recently, with the emergence of several versions of transformers, long term dependency handling is not an issue anymore for researchers engaged in video processing for summarization and description, or for autonomous-vehicle, surveillance, and instructional purposes. They can get auspicious directions from this research. |
| Audience | Academic |
| Author | Choi, Gyu Sang Rafiq, Ghazala Rafiq, Muhammad |
| Author_xml | – sequence: 1 givenname: Ghazala surname: Rafiq fullname: Rafiq, Ghazala organization: Department of Information & Communication Engineering, Yeungnam University – sequence: 2 givenname: Muhammad orcidid: 0000-0001-6713-8766 surname: Rafiq fullname: Rafiq, Muhammad email: rafiq@kmu.ac.kr organization: Department of Game & Mobile Engineering, Keimyung University – sequence: 3 givenname: Gyu Sang surname: Choi fullname: Choi, Gyu Sang email: castchoi@ynu.ac.kr organization: Department of Information & Communication Engineering, Yeungnam University |
| BookMark | eNp9kM1KAzEYRYMoWKsv4GrA9Wj-JplxV4p_UHBT3IY0-aZNaZMxmRZ8G5_FJzN1BMFFySLh457c5FygUx88IHRN8C3BWN4lgrmgJaaszCfCS3GCRqSSrJR5fopGmIqmpDUl5-gipTXGuKKcjdDjm7MQCgvJRNf1Lvj7YvL1acK2i7ACn9weirSLe_goQptz0BUb0NE7vyx018WgzQrSJTpr9SbB1e8-RvPHh_n0uZy9Pr1MJ7PScEz7srFAWmErA5ZiW7FaEr2o6gVrawyLVoNhuLEEeMUpMUKaSgpuudD5sdoKNkY3w7W5930HqVfrsIs-NypaC8mJ4M0hdTuklnoDyvk29FGbvCxsncniWpfnEykEw1I0PAP1AJgYUorQKuN6fZCRQbdRBKuDZTVYVtmy-rGsDl30H9pFt9Xx4zjEBijlsF9C_PvGEeobPVaRzw |
| CitedBy_id | crossref_primary_10_1016_j_csl_2024_101694 crossref_primary_10_1016_j_prime_2023_100372 crossref_primary_10_19159_tutad_1696120 crossref_primary_10_1007_s13735_023_00303_7 crossref_primary_10_1016_j_engappai_2023_107744 crossref_primary_10_3390_app15094990 crossref_primary_10_1007_s10462_024_10858_4 crossref_primary_10_1002_advs_202406242 crossref_primary_10_23919_JSC_2024_0030 crossref_primary_10_3390_math11173685 crossref_primary_10_1007_s11227_025_07159_0 crossref_primary_10_1038_s41746_025_01649_4 crossref_primary_10_3390_math13183011 crossref_primary_10_1109_ACCESS_2023_3287783 crossref_primary_10_1177_18479790231223631 crossref_primary_10_1145_3715098 |
| Cites_doi | 10.1109/CVPR46437.2021.00832 10.1023/A:1020346032608 10.1609/aaai.v33i01.33018393 10.18653/v1/2020.acl-main.233 10.18653/v1/2020.emnlp-main.61 10.1186/s40537-021-00444-8 10.1109/ICCV.2019.00756 10.1109/WACV.2019.00048 10.1109/ICCV.2019.00901 10.18653/v1/2021.findings-acl.24 10.18653/v1/k18-1011 10.1109/CVPR.2019.00852 10.1109/ICCV.2015.279 10.1007/978-3-030-58589-1_27 10.1145/3355390 10.1109/CVPR.2016.571 10.1109/CVPR.2014.223 10.3115/v1/D14-1086 10.1145/3123266.3123448 10.1109/ICCV48922.2021.00677 10.1109/TMM.2017.2729019 10.1007/978-3-030-01267-0_19 10.1109/CVPR.2018.00443 10.24963/ijcai.2017/307 10.1109/CVPR52688.2022.01747 10.1109/CVPR.2019.00674 10.1007/978-3-030-01216-8-43 10.1109/ICCSP.2019.8698097 10.1109/CVPR.2017.548c 10.18653/v1/d17-1103 10.1109/ACCESS.2021.3108565 10.1109/TMM.2019.2924576 10.1109/ICCV48922.2021.00676 10.1109/cvpr46437.2021.00321 10.1109/ICCV.2015.510 10.1038/nature14236 10.1016/j.asoc.2021.108332 10.1109/ICBK.2017.26 10.1155/2018/3125879 10.24963/ijcai.2019/877 10.1109/CVPR42600.2020.01311 10.1109/TPAMI.2016.2599174 10.1109/CVPR.2017.662 10.1109/TCYB.2018.2831447 10.1007/s00371-021-02294-0 10.1109/icpr48806.2021.9412898 10.18653/v1/p19-1285 10.1109/CVPR.2013.340 10.1109/CVPR.2018.00911 10.1109/TPAMI.2019.2946823 10.1109/CVPR42600.2020.01329 10.1109/CVPR.2017.111 10.24963/ijcai.2020.88 10.1109/tpami.2019.2940007 10.1145/3240508.3240667 10.1109/ICCV.2019.00273 10.1109/CVPR46437.2021.01109 10.1155/2020/3062706 10.3115/v1/d14-1179 10.18653/v1/D18-1117 10.1007/978-3-540-30194-3-16 10.18653/v1/N18-2125 10.1109/tpami.2019.2920899 10.1109/CVPR.2016.90 10.1007/978-3-030-01261-8_22 10.1109/INCET54531.2022.9824569 10.1109/CVPR46437.2021.00030 10.1109/TMM.2020.3002669 10.1109/CVPR.2019.00688 10.3390/app10124312 10.1109/CVPR42600.2020.01098 10.3115/1626355.1626389 10.3390/s22134817 10.7717/PEERJ-CS.916 10.1109/CIC.2017.00050 10.1109/WACV48630.2021.00308 10.1162/neco.1989.1.2.270 10.1109/tpami.2019.2894139 10.24963/ijcai.2018/164 10.1109/CVPR.2018.00782 10.1109/CVPR.2016.497 10.1109/CVPR.2015.7299087 10.1609/aaai.v33i01.33013159 10.1109/ICIP.2019.8803143 10.1007/s11280-018-0531-z 10.3115/v1/p14-2074 10.1609/aaai.v35i4.16421 10.1109/ICME.2019.00226 10.1609/aaai.v35i3.16353 10.1109/ICIIBMS.2017.8279760 10.2307/j.ctt1d98bxx.10 10.1109/ICCV.2015.515 10.1109/TPAMI.2022.3152247 10.1162/tacl_a_00166 10.1109/ICCV.2019.00468 10.1109/tetci.2019.2892755 10.1145/3122865.3122867 10.1609/aaai.v33i01.33018167 10.1002/0470018860.s00225 10.1109/CVPR46437.2021.01521 10.1109/CVPR52688.2022.01743 10.1007/s10590-010-9073-6 10.18653/v1/p18-3003 10.1109/CVPR46437.2021.00971 10.1109/TMM.2022.3146005 10.1109/ICCV.2017.450 10.1109/CVPR.2017.648 10.1109/CVPR.2017.128 10.18653/v1/d16-1146 10.1007/978-3-030-41299-9_37 10.1145/2964284.2984066 10.1109/WACV48630.2021.00102 10.1109/CVPR.2017.131 10.3390/s20061702 10.1007/978-3-031-19836-6_2 10.1109/CVPR.2016.503 10.1109/TIP.2021.3120867 10.1109/access.2021.3078295 10.1155/2022/3454167 10.1145/2647868.2654889 10.1109/CVPR52688.2022.00837 10.3115/1289189.1289273 10.1109/ICCV.2017.83 10.24963/ijcai.2017/381 10.1016/s1364-6613(99)01331-5 10.1016/j.neucom.2018.09.038 10.1109/CVPR.2015.7298594 10.1016/j.neucom.2018.06.096 10.18653/v1/e17-1019 10.5555/946247.946665 10.1109/CVPR.2009.5206848 10.1109/ICCV.1999.790410 10.18653/v1/2020.emnlp-main.161 10.1109/ICCV.2013.61 10.1109/CVPR.2018.00795 10.1609/aaai.v33i01.33018191 10.1109/CVPRW50498.2020.00487 10.1109/CVPR.2017.345 10.1145/1553374.1553380 10.1007/978-3-030-59830-3_21 |
| ContentType | Journal Article |
| Copyright | The Author(s) 2023 COPYRIGHT 2023 Springer Copyright Springer Nature B.V. Nov 2023 |
| Copyright_xml | – notice: The Author(s) 2023 – notice: COPYRIGHT 2023 Springer – notice: Copyright Springer Nature B.V. Nov 2023 |
| DBID | C6C AAYXX CITATION 3V. 7SC 7WY 7WZ 7XB 87Z 8AL 8AO 8FD 8FE 8FG 8FK 8FL ABUWG AFKRA ALSLI ARAPS AZQEC BENPR BEZIV BGLVJ CCPQU CNYFK DWQXO E3H F2A FRNLG F~G GNUQQ HCIFZ JQ2 K60 K6~ K7- L.- L7M L~C L~D M0C M0N M1O P5Z P62 PHGZM PHGZT PKEHL PQBIZ PQBZA PQEST PQGLB PQQKQ PQUKI PRINS PRQQA PSYQQ Q9U |
| DOI | 10.1007/s10462-023-10414-6 |
| DatabaseName | Springer Nature OA Free Journals CrossRef ProQuest Central (Corporate) Computer and Information Systems Abstracts ABI/INFORM Collection ABI/INFORM Global (PDF only) ProQuest Central (purchase pre-March 2016) ABI/INFORM Collection Computing Database (Alumni Edition) ProQuest Pharma Collection Technology Research Database ProQuest SciTech Collection ProQuest Technology Collection ProQuest Central (Alumni) (purchase pre-March 2016) ABI/INFORM Collection (Alumni) ProQuest Central (Alumni) ProQuest Central UK/Ireland Social Science Premium Collection Advanced Technologies & Computer Science Collection ProQuest Central Essentials ProQuest Central Business Premium Collection Technology Collection ProQuest One Community College Library & Information Science Collection ProQuest Central Library & Information Sciences Abstracts (LISA) Library & Information Science Abstracts (LISA) Business Premium Collection (Alumni) ABI/INFORM Global (Corporate) ProQuest Central Student SciTech Premium ProQuest Computer Science Collection ProQuest Business Collection (Alumni Edition) ProQuest Business Collection Computer Science Database ABI/INFORM Professional Advanced Advanced Technologies Database with Aerospace Computer and Information Systems Abstracts Academic Computer and Information Systems Abstracts Professional ABI/INFORM Global Computing Database Library Science Database AAdvanced Technologies & Aerospace Database (subscription) ProQuest Advanced Technologies & Aerospace Collection ProQuest Central Premium ProQuest One Academic (New) ProQuest One Academic Middle East (New) ProQuest One Business ProQuest One Business (Alumni) ProQuest One Academic Eastern Edition (DO NOT USE) One Applied & Life Sciences ProQuest One Academic (retired) ProQuest One Academic UKI Edition ProQuest Central China ProQuest One Social Sciences ProQuest One Psychology ProQuest Central Basic |
| DatabaseTitle | CrossRef ProQuest Business Collection (Alumni Edition) ProQuest One Psychology Computer Science Database ProQuest Central Student Library and Information Science Abstracts (LISA) ProQuest Advanced Technologies & Aerospace Collection ProQuest Central Essentials ProQuest Computer Science Collection Computer and Information Systems Abstracts SciTech Premium Collection ProQuest Central China ABI/INFORM Complete ProQuest One Applied & Life Sciences Library & Information Science Collection ProQuest Central (New) Advanced Technologies & Aerospace Collection Business Premium Collection Social Science Premium Collection ABI/INFORM Global ProQuest One Academic Eastern Edition ProQuest Technology Collection ProQuest Business Collection ProQuest One Academic UKI Edition ProQuest One Academic ProQuest One Academic (New) ABI/INFORM Global (Corporate) ProQuest One Business Technology Collection Technology Research Database Computer and Information Systems Abstracts – Academic ProQuest One Academic Middle East (New) ProQuest Central (Alumni Edition) ProQuest One Community College ProQuest Pharma Collection ProQuest Central ABI/INFORM Professional Advanced ProQuest Library Science ProQuest Central Korea Advanced Technologies Database with Aerospace ABI/INFORM Complete (Alumni Edition) ProQuest Computing ProQuest One Social Sciences ABI/INFORM Global (Alumni Edition) ProQuest Central Basic ProQuest Computing (Alumni Edition) ProQuest SciTech Collection Computer and Information Systems Abstracts Professional Advanced Technologies & Aerospace Database ProQuest One Business (Alumni) ProQuest Central (Alumni) Business Premium Collection (Alumni) |
| DatabaseTitleList | CrossRef ProQuest Business Collection (Alumni Edition) |
| Database_xml | – sequence: 1 dbid: BENPR name: ProQuest Central url: https://www.proquest.com/central sourceTypes: Aggregation Database |
| DeliveryMethod | fulltext_linktorsrc |
| Discipline | Computer Science |
| EISSN | 1573-7462 |
| EndPage | 13372 |
| ExternalDocumentID | A766307694 10_1007_s10462_023_10414_6 |
| GrantInformation_xml | – fundername: National Research Foundation of Korea grantid: NRF-2019R1A2C1006159 funderid: http://dx.doi.org/10.13039/501100003725 – fundername: National Research Foundation of Korea grantid: NRF-2021R1A6A1A03039493 funderid: http://dx.doi.org/10.13039/501100003725 – fundername: 2022 Yeungnam University Research Grant |
| GroupedDBID | -4Z -59 -5G -BR -EM -Y2 -~C .4S .86 .DC .VR 06D 0R~ 0VY 1N0 1SB 2.D 203 23N 28- 2J2 2JN 2JY 2KG 2LR 2P1 2VQ 2~H 30V 3V. 4.4 406 408 409 40D 40E 5GY 5QI 5VS 67Z 6J9 6NX 77K 7WY 8AO 8FE 8FG 8FL 8TC 8UJ 95- 95. 95~ 96X AAAVM AABHQ AAHNG AAIAL AAJKR AAJSJ AAKKN AANZL AAOBN AARHV AARTL AATVU AAUYE AAWCG AAYIU AAYQN AAYTO AAYZH ABAKF ABBBX ABBXA ABDZT ABECU ABEEZ ABFTD ABFTV ABHLI ABHQN ABIVO ABJNI ABJOX ABKCH ABKTR ABMNI ABMOR ABMQK ABNWP ABQBU ABQSL ABSXP ABTEG ABTHY ABTKH ABTMW ABULA ABUWG ABWNU ABXPI ACACY ACBXY ACGFS ACHSB ACHXU ACIHN ACKNC ACMDZ ACMLO ACOKC ACOMO ACREN ACSNA ACULB ACZOJ ADHHG ADHIR ADIMF ADINQ ADKNI ADKPE ADMLS ADRFC ADTPH ADURQ ADYFF ADYOE ADZKW AEAQA AEBTG AEFIE AEFQL AEGAL AEGNC AEJHL AEJRE AEKMD AENEX AEOHA AEPYU AESKC AETLH AEVLU AEXYK AFBBN AFEXP AFFNX AFGCZ AFGXO AFKRA AFLOW AFQWF AFWTZ AFYQB AFZKB AGAYW AGDGC AGGDS AGJBK AGMZJ AGQEE AGQMX AGRTI AGWIL AGWZB AGYKE AHAVH AHBYD AHKAY AHSBF AHYZX AIAKS AIIXL AILAN AITGF AJBLW AJRNO AJZVZ ALMA_UNASSIGNED_HOLDINGS ALSLI ALWAN AMKLP AMTXH AMXSW AMYLF AMYQR AOCGG ARAPS ARCSS ARMRJ ASPBG AVWKF AXYYD AYJHY AZFZN AZQEC B-. BA0 BBWZM BDATZ BENPR BEZIV BGLVJ BGNMA BPHCQ C24 C6C CAG CCPQU CNYFK COF CS3 CSCUP DDRTE DL5 DNIVK DPUIP DWQXO EBLON EBS EDO EIOEI EJD ESBYG FEDTE FERAY FFXSO FIGPU FINBP FNLPD FRNLG FRRFC FSGXE FWDCC GGCAI GGRSB GJIRD GNUQQ GNWQR GQ6 GQ7 GQ8 GROUPED_ABI_INFORM_COMPLETE GXS H13 HCIFZ HF~ HG5 HG6 HMJXF HQYDN HRMNR HVGLF HZ~ I-F I09 IAO IHE IJ- IKXTQ ITM IWAJR IXC IZIGR IZQ I~X I~Z J-C J0Z JBSCW JCJTX JZLTJ K60 K6V K6~ K7- KDC KOV KOW LAK LLZTM M0C M0N M1O M4Y MA- MK~ N2Q N9A NB0 NDZJH NPVJJ NQJWS NU0 O9- O93 O9G O9I O9J OAM OVD P19 P62 P9O PF0 PQBIZ PQBZA PQQKQ PROAC PSYQQ PT5 Q2X QOK QOS R4E R89 R9I RHV RNI RNS ROL RPX RSV RZC RZE RZK S16 S1Z S26 S27 S28 S3B SAP SCJ SCLPG SCO SDH SDM SHX SISQX SNE SNPRN SNX SOHCF SOJ SPISZ SRMVM SSLCW STPWE SZN T13 T16 TEORI TSG TSK TSV TUC TUS U2A UG4 UOJIU UTJUX UZXMN VC2 VFIZW W23 W48 WH7 WK8 YLTOR Z45 Z5O Z7R Z7X Z7Y Z7Z Z81 Z83 Z86 Z88 Z8M Z8N Z8R Z8S Z8T Z8U Z8W Z92 ZMTXR ~A9 ~EX 77I AAFWJ AASML AAYXX ABDBE ABFSG ACSTC ADHKG AEZWR AFFHD AFHIU AGQPQ AHPBZ AHWEU AIXLP AYFIA CITATION ICD PHGZM PHGZT PQGLB PRQQA 7SC 7XB 8AL 8FD 8FK E3H F2A JQ2 L.- L7M L~C L~D PKEHL PQEST PQUKI PRINS Q9U |
| ID | FETCH-LOGICAL-c402t-9de1f6d5ced20d53871ab58b3f80ebfaec309d1e45421c67c5764d46a524ad63 |
| IEDL.DBID | RSV |
| ISICitedReferencesCount | 19 |
| ISICitedReferencesURI | http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000968013400001&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| ISSN | 0269-2821 |
| IngestDate | Sat Nov 15 15:42:14 EST 2025 Sat Nov 29 10:30:54 EST 2025 Sat Nov 29 02:43:27 EST 2025 Tue Nov 18 22:15:22 EST 2025 Fri Feb 21 02:41:52 EST 2025 |
| IsDoiOpenAccess | true |
| IsOpenAccess | true |
| IsPeerReviewed | true |
| IsScholarly | true |
| Issue | 11 |
| Keywords | Deep learning Text description Encoder–Decoder architecture Video description approaches Video captioning Video captioning techniques Vision to text |
| Language | English |
| LinkModel | DirectLink |
| MergedId | FETCHMERGED-LOGICAL-c402t-9de1f6d5ced20d53871ab58b3f80ebfaec309d1e45421c67c5764d46a524ad63 |
| Notes | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 |
| ORCID | 0000-0001-6713-8766 |
| OpenAccessLink | https://link.springer.com/10.1007/s10462-023-10414-6 |
| PQID | 2867416496 |
| PQPubID | 36790 |
| PageCount | 80 |
| ParticipantIDs | proquest_journals_2867416496 gale_infotracacademiconefile_A766307694 crossref_citationtrail_10_1007_s10462_023_10414_6 crossref_primary_10_1007_s10462_023_10414_6 springer_journals_10_1007_s10462_023_10414_6 |
| PublicationCentury | 2000 |
| PublicationDate | 20231100 2023-11-00 20231101 |
| PublicationDateYYYYMMDD | 2023-11-01 |
| PublicationDate_xml | – month: 11 year: 2023 text: 20231100 |
| PublicationDecade | 2020 |
| PublicationPlace | Dordrecht |
| PublicationPlace_xml | – name: Dordrecht |
| PublicationSubtitle | An International Science and Engineering Journal |
| PublicationTitle | The Artificial intelligence review |
| PublicationTitleAbbrev | Artif Intell Rev |
| PublicationYear | 2023 |
| Publisher | Springer Netherlands Springer Springer Nature B.V |
| Publisher_xml | – name: Springer Netherlands – name: Springer – name: Springer Nature B.V |
| References | ZellersRBiskYFarhadiAChoiYFrom recognition to cognition: visual commonsense reasoningProc IEEE Comput Soc Conf Comput Vis Pattern Recogn201920196713672410.1109/CVPR.2019.00688 Hammad M, Hammad M, Elshenawy M (2019) Characterizing the impact of using features extracted from pretrained models on the quality of video captioning sequence-to-sequence models. arXiv:1911.09989 HoriCHoriTLeeTYZhangZHarshamBHersheyJRAttention-based multimodal fusion for video descriptionProc IEEE Int Conf Comput Vis201720174203421210.1109/ICCV.2017.450 Zhu X, Guo L, Yao P, Lu S, Liu W, Liu J (2019) Vatex video captioning challenge 2020: multi-view features and hybrid reward strategies for video captioning. arXiv:1910.11102 RohrbachMQiuWTitovIThaterSPinkalMSchieleBTranslating video content to natural language descriptionsProc IEEE Int Conf Comput Vis201310.1109/ICCV.2013.61 ZhangXSunXLuoYJiJZhouYWuYJiRRSTnet: captioning with adaptive attention on visual and non-visual wordsProc IEEE Comput Soc Conf Comput Vis Pattern Recogn20211154601546910.1109/CVPR46437.2021.01521 Pramanik S, Agrawal P, Hussain A (2019) OmniNet: a unified architecture for multi-modal multi-task learning, 1–16. arXiv:1907.07804 XuJMeiTYaoTRuiYMSR-VTT: A large video description dataset for bridging video and languageProc IEEE Comput Soc Conf Comput Vis Pattern Recogn201620165288529610.1109/CVPR.2016.571 Gehring J, Dauphin YN (2016) Convolutional Sequence to Sequence Learning. https://proceedings.mlr.press/v70/gehring17a/gehring17a.pdf Al-Rfou R, Choe D, Constant N, Guo M, Jones L (2019) Character-level language modeling with deeper self-attention. Proc AAAI Conf Artif Intell 33 , 3159–3166. https://doi.org/10.1609/aaai.v33i01.33013159arxiv.org/abs/1808.04444 Dai Z, Yang Z, Yang Y, Carbonell J, Le QV, Salakhutdinov R (2020) Transformer-XL: Attentive language models beyond a fixed-length context. ACL 2019 -57th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference, 2978–2988. https://doi.org/10.18653/v1/p19-1285arXiv:1901.02860 Wang X, Chen W, Wu J, Wang YF, Wang WY (2018b) Video captioning via hierarchical reinforcement learning. Proc IEEE Comput Soc Conf Comput Vis Pattern Recogn, 4213–4222. https://doi.org/10.1109/CVPR.2018.00443arXiv:1711.11135 ShenZLiJSuZLiMChenYJiangYGXueXWeakly supervised dense video captioningProc 30th IEEE Conf Comput Vis Pattern Recogn, CVPR 2017201720175159516710.1109/CVPR.2017.548c Lin K, Gan Z, Wang L (2020) Multi-modal feature fusion with feature attention for vatex captioning challenge 2020:2–5. arXiv:2006.03315 Babariya RJ, Tamaki T (2020) Meaning guided video captioning. In: Pattern Recognition: 5th Asian Conference, ACPR 2019, Auckland, New Zealand, November 26–29, 2019, Revised Selected Papers, Part II 5, pp 478–488. Springer International Publishing Lin C-Y (2004) ROUGE: A Package for Automatic Evaluation of Summaries. In: Text summarization branches out. Association for Computational Linguistics. Barcelona, Spain, pp 74–81. https://aclanthology.org/W04-1013 Demeester T, Rocktäschel T, Riedel S (2016) Lifted rule injection for relation embeddings. Emnlp 2016—conference on empirical methods in natural language processing, proceedings (pp. 1389–1399). https://doi.org/10.18653/v1/d16-1146 Vedantam R, Lawrence Zitnick C, Parikh D (2015) Cider: consensus-based image description evaluation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4566–4575 Gella S, Lewis M, Rohrbach M (2020) A dataset for telling the stories of social media videos. Proceedings of the 2018 conference on empirical methods in natural language processing, EMNLP 2018:968–974 Redmon J, Farhadi A (2018) YOLOv3: an incremental improvement. arXiv:1804.02767 Zhou L, Zhou Y, Corso JJ, Socher R, Xiong C (2018) End-to-End Dense Video Captioning with Masked Transformer. Proceedings of the IEEE computer society conference on computer vision and pattern recognition (pp. 8739–8748). https://doi.org/10.1109/CVPR.2018.00911 Chen DL, Dolan WB (2011) Collecting highly parallel data for paraphrase evaluation. Aclhlt 2011–Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies 1 (pp. 190–200) Xiao H, Shi J (2019b) Huanhou Xiao, Jinglun Shi South China University of Technology, Guangzhou China, 619–623 Zhang W, Wang B, Ma L, Liu W (2019) Reconstruct and represent video contents for captioning via reinforcement learning. IEEE Trans Pattern Analys Mach Intel, 1–1. https://doi.org/10.1109/tpami.2019.2920899arxiv.org/abs/1906.01452 RenZWangXZhangNLvXLiLJDeep reinforcement learning-based image captioning with embedding rewardProc 30th IEEE Conf Comput Vis Pattern Recogn CVPR201720171151115910.1109/CVPR.2017.128 SunCMyersAVondrickCMurphyKSchmidCVideoBERT: a joint model for video and language representation learningProc IEEE Int Conf Comput Vis201920197463747210.1109/ICCV.2019.00756 Levine R, Meurers D (2006) Head-driven phrase structure grammar linguistic approach , formal head-driven phrase structure grammar linguistic approach , formal foundations , and computational realization (January) Kenton M-wC, Kristina L, Devlin J (1953) BERT: pre-training of deep bidirectional transformers for language understanding. (Mlm). arXiv:1810.04805v2 HeDZhaoXHuangJLiFLiuXWenSRead, watch, and move: reinforcement learning for temporally grounding natural language descriptions in videosProceed AAAI Conf Artif Intel2019338393840010.1609/aaai.v33i01.33018393arXiv:1901.06829 BarbuABridgeABurchillZCoroianDDickinsonSFidlerSZhangZVideo in sentences outUncertainty Artif Intell–Proc 28th Conf–UAI20122012102112arXiv:1204.2742 Zhao B, Li X, Lu X (2018) Video captioning with tube features. IICAI Int Joint Conf Artif Intel 2018:1177–1183. https://doi.org/10.24963/ijcai.2018/164 Fang Z, Gokhale T, Banerjee P, Baral C, Yang Y (2020) Video2Commonsense: generating commonsense descriptions to enrich video captioning. arXiv:2003.05162 XuHLiBRamanishkaVSigalLSaenkoKJoint event detection and description in continuous video streamsProc 2019 IEEE Winter Conf App Comput Vis, WACV2019201939640510.1109/WACV.2019.00048arXiv:1802.10250 Yang B, Liu F, Zhang C, Zou Y (2019) Non-autoregressive coarse-to-fine video captioning. In: AAAI Conference on Artificial Intelligence. https://doi.org/10.1609/aaai.v35i4.16421 Rafiq M, Rafiq G, Agyeman R, Jin S-I, Choi G (2020) Scene classification for sports video summarization using transfer learning. Sensors (Switzerland) 20(6). https://doi.org/10.3390/s20061702 Hakeem A, Sheikh Y, Shah M (2004) CASE E: a hierarchical event representation for the analysis of videos. Proc Natl Conf Artif Intell:263–268 Park J, Song C, Han JH (2018) A study of evaluation metrics and datasets for video captioning. ICIIBMS 2017 -2nd Int Conf Intel Inform Biomed Sci 2018:172–175. https://doi.org/10.1109/ICIIBMS.2017.8279760 Raffel C, Ellis DPW (2015) Feed-forward networks with attention can solve some long-term memory problems, 1–6. arXiv:1512.08756 MnihVKavukcuogluKSilverDRusuAAVenessJBellemareMGHassabisDHuman-level control through deep reinforcement learningNature2015518754052953310.1038/nature14236 Perez-Martin J, Bustos B, Perez J (2021) Attentive visual semantic specialized network for video captioning, 5767–5774. https://doi.org/10.1109/icpr48806.2021.9412898 Pan Y, Yao T, Li H, Mei T (2017) Video captioning with transferred semantic attributes. Proc 30th IEEE Conf Comput Vis Pattern Recogn CVPR 2017:984–992. https://doi.org/10.1109/CVPR.2017.111arXiv:1611.07675 Deng J, Dong W, Socher R, Li LJ, Li K, Fei-Fei L (2009, June). Imagenet: a large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition, pp 248–255. IEEE KrishnaRHataKRenFFei-FeiLNieblesJCDense-captioning events in videosProc Int Conf Comput Vis2017201770671510.1109/ICCV.2017.83 Cho K, Van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y (2014) Learning phrase representations using RNN encoder–decoder for statistical machine translation. EMNLP 2014 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference, 1724–1734. https://doi.org/10.3115/v1/d14-1179arXiv:1406.1078 ChenSYaoTJiangYGDeep learning for video captioning: a reviewIJCAI Int Joint Conf Artif Intell201920196283629010.24963/ijcai.2019/877 Child R, Gray S, Radford A, Sutskever I (2019) Generating Long Sequences with Sparse Transformers. arXiv:1904.10509 Xiao H, Shi J (2019a) Diverse video captioning through latent variable expansion with conditional GAN. https://zhuanzhi.ai/paper/943af2926865564d7a84286c23fa2c63 arXiv:1910.12019 Blohm M, Jagfeld G, Sood E, Yu X, Vu NT (2018) Comparing attention-based convolutional and recurrent neural networks: success and limitations in machine reading comprehension. CoNLL 2018–22nd Conference on Computational Natural Language Learning, Proceedings, 108–118. https://doi.org/10.18653/v1/k18-1011arXiv:1808.08744 YuYChoiJKimYYooKLeeSHKimGSupervising neural attention models for video captioning by human gaze dataProc 30th IEEE Conf Comput Vis Pattern Recogn201720176119612710.1109/CVPR.2017.648arXiv:1707.06029 LiSTaoZLiKFuYVisual to text: survey of image and video captioningIEEE Trans Emerg Top Comput Intel20193429731210.1109/tetci.2019.2892755 XuJWeiHLiLFuQGuoJVideo description model based on temporal-spatial and channel multi-attention mechanismsAppl Sci (Switzerland)202010.3390/app10124312 LaokulratNPhanSNishidaNShuREharaYOkazakiNNakayamaHGenerating video description using sequence-to-sequence model with temporal attentionColing201620154452 Zhang Z, Qi Z, Yuan C, Shan Y, Li B, Deng Y, Hu W (2021) Open-book video captioning with retrieve-copy-generate network. Proc IEEE Comput Soc Conf Comput Vis Pattern Recogn, 9832–9841. https://doi.org/10.1109/CVPR46437.2021.00971arXiv:2103.05284 Hammoudeh A, Vanderplaetse B, Dupont S (2022) Deep soccer captioning with transformer: dataset, semantics-related losses, and multi-level evaluation:1–15. arXiv:2202.05728 Wang D, Song D (2017) Video Cap S Chen (10414_CR29) 2019; 2019 X Li (10414_CR93) 2019; 22 10414_CR90 10414_CR92 S Chen (10414_CR28) 2021; 1 10414_CR94 M Rohrbach (10414_CR130) 2013 10414_CR95 10414_CR96 J Lei (10414_CR86) 2020; 12366 10414_CR97 10414_CR98 10414_CR99 Y Zhang (10414_CR187) 2010; 24 C Szegedy (10414_CR141) 2017; 2017 X Zhang (10414_CR185) 2017; 2017 R Cascade-correlation (10414_CR20) 1997; 9 10414_CR70 R Krishna (10414_CR78) 2017; 2017 10414_CR79 10414_CR190 10414_CR195 10414_CR192 10414_CR193 10414_CR71 10414_CR72 10414_CR73 10414_CR196 10414_CR74 10414_CR75 D Tran (10414_CR145) 2015; 2015 10414_CR76 V Mnih (10414_CR107) 2015; 518 10414_CR81 K Han (10414_CR57) 2022; 8828 S Bhatt (10414_CR15) 2017; 2017 Y Yu (10414_CR179) 2017; 2017 N Laokulrat (10414_CR80) 2016; 2015 A Barbu (10414_CR13) 2012; 2012 B Wang (10414_CR151) 2019; 2019 W Xu (10414_CR170) 2021; 23 10414_CR83 10414_CR85 10414_CR87 10414_CR88 C Hori (10414_CR60) 2017; 2017 S Venugopalan (10414_CR148) 2015; 2015 C Deng (10414_CR37) 2021 Q You (10414_CR177) 2016; 2016 J Xu (10414_CR166) 2016; 2016 C Yan (10414_CR171) 2020; 22 10414_CR50 10414_CR51 10414_CR52 D He (10414_CR58) 2019; 33 10414_CR53 10414_CR54 10414_CR55 J Hou (10414_CR62) 2019; 2019 10414_CR56 S Li (10414_CR91) 2019; 3 Q Zhang (10414_CR183) 2019; 323 10414_CR68 10414_CR69 S Lee (10414_CR84) 2018 M Rafiq (10414_CR123) 2021; 9 10414_CR61 10414_CR100 10414_CR64 10414_CR103 10414_CR65 10414_CR104 10414_CR101 10414_CR102 10414_CR35 10414_CR36 S Antol (10414_CR9) 2015 10414_CR38 10414_CR39 10414_CR32 10414_CR33 10414_CR34 RJ Williams (10414_CR159) 1989; 1 W Ji (10414_CR67) 2022; 117 M Schuster (10414_CR132) 1997; 45 10414_CR48 10414_CR49 Y Bin (10414_CR17) 2019; 49 10414_CR42 10414_CR43 10414_CR2 10414_CR129 10414_CR3 L Sun (10414_CR140) 2019; 2019 10414_CR1 10414_CR16 10414_CR18 10414_CR133 10414_CR131 10414_CR136 10414_CR10 10414_CR137 10414_CR11 10414_CR134 10414_CR12 10414_CR138 M Chen (10414_CR26) 2018; 95 10414_CR24 10414_CR25 10414_CR143 10414_CR144 10414_CR142 10414_CR147 10414_CR21 S Chen (10414_CR27) 2019; 33 L Gao (10414_CR44) 2017; 19 10414_CR22 10414_CR23 10414_CR146 10414_CR108 10414_CR105 10414_CR106 R Agyeman (10414_CR5) 2021; 9 10414_CR109 N Aafaq (10414_CR4) 2022 10414_CR110 J Donahue (10414_CR40) 2017; 39 10414_CR111 10414_CR114 10414_CR115 10414_CR112 Z Shen (10414_CR135) 2017; 2017 10414_CR113 10414_CR118 10414_CR119 10414_CR116 10414_CR117 H Xu (10414_CR165) 2019; 2019 R Zellers (10414_CR181) 2019; 2019 10414_CR121 10414_CR122 L Zhou (10414_CR194) 2019; 2019 A Hussain (10414_CR63) 2022 10414_CR120 10414_CR6 10414_CR125 10414_CR7 10414_CR126 M Zolfaghari (10414_CR197) 2018; 11206 10414_CR8 10414_CR124 D Elliott (10414_CR41) 2014; 2 10414_CR172 L Gao (10414_CR45) 2022; 31 10414_CR173 10414_CR176 SJ Rennie (10414_CR128) 2017; 2017 10414_CR174 10414_CR175 Z Ren (10414_CR127) 2017; 2017 H Zhao (10414_CR191) 2022; 8 10414_CR178 H Im (10414_CR66) 2022; 22 L Gao (10414_CR46) 2019; 14 A Kojima (10414_CR77) 2002; 50 L Gao (10414_CR47) 2020; 395 S Xie (10414_CR164) 2018; 11219 10414_CR180 T Brox (10414_CR19) 2014; 3024 10414_CR184 10414_CR182 10414_CR188 10414_CR189 10414_CR149 X Zhang (10414_CR186) 2021; 1 10414_CR150 10414_CR154 10414_CR155 10414_CR152 10414_CR153 10414_CR158 10414_CR156 10414_CR157 Y Chen (10414_CR31) 2018; 2018 Y Chen (10414_CR30) 2018; 11217 A Lavie (10414_CR82) 2004; 3265 J Xu (10414_CR167) 2020 K He (10414_CR59) 2016; 2016 10414_CR161 10414_CR162 10414_CR160 Y Bengio (10414_CR14) 2009 C Sun (10414_CR139) 2019; 2019 10414_CR163 10414_CR169 10414_CR168 |
| References_xml | – reference: Goyal A, Lamb A, Zhang Y, Zhang S, Courville A, Bengio Y (2016) Professor forcing: anew algorithm for training recurrent networks. Adv Neural Inform Process Syst (Nips):4608–4616. arXiv:1610.09038 – reference: Pramanik S, Agrawal P, Hussain A (2019) OmniNet: a unified architecture for multi-modal multi-task learning, 1–16. arXiv:1907.07804 – reference: Xiao H, Shi J (2019b) Huanhou Xiao, Jinglun Shi South China University of Technology, Guangzhou China, 619–623 – reference: Rafiq M, Rafiq G, Agyeman R, Jin S-I, Choi G (2020) Scene classification for sports video summarization using transfer learning. Sensors (Switzerland) 20(6). https://doi.org/10.3390/s20061702 – reference: Dai Z, Yang Z, Yang Y, Carbonell J, Le QV, Salakhutdinov R (2020) Transformer-XL: Attentive language models beyond a fixed-length context. ACL 2019 -57th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference, 2978–2988. https://doi.org/10.18653/v1/p19-1285arXiv:1901.02860 – reference: Zhang W, Wang B, Ma L, Liu W (2019) Reconstruct and represent video contents for captioning via reinforcement learning. IEEE Trans Pattern Analys Mach Intel, 1–1. https://doi.org/10.1109/tpami.2019.2920899arxiv.org/abs/1906.01452 – reference: Li Y, Yao T, Pan Y, Chao H, Mei T (2018) Jointly localizing and describing events for dense video captioning. Proceedings of the IEEE computer society conference on computer vision and pattern recognition (pp. 7492–7500). https://doi.org/10.1109/CVPR.2018.00782 – reference: ChenSJiangYGTowards bridging event captioner and sentence localizer for weakly supervised dense event captioningProc IEEE Comput Soc Conf Comput Vis Pattern Recogn202118421843110.1109/CVPR46437.2021.00832 – reference: Kitaev N, Kaiser L, Levskaya A (2020) Reformer: the efficient transformer, 1–12. arXiv:2001.04451 – reference: TranDBourdevLFergusRTorresaniLPaluriMLearning spatiotemporal features with 3D convolutional networksProc IEEE Int Conf Comput Vis201520154489449710.1109/ICCV.2015.510 – reference: ElliottDKellerFComparing automatic evaluation measures for image description52nd Annu Meet Assoc Comput Linguistics ACL 2014–Proc Conf2014245245710.3115/v1/p14-2074 – reference: DonahueJHendricksLARohrbachMVenugopalanSGuadarramaSSaenkoKDarrellTLong-term recurrent convolutional networks for visual recognition and descriptionIEEE Trans Pattern Analys Mach Intell201739467769110.1109/TPAMI.2016.2599174 – reference: Redmon J, Farhadi A (2018) YOLOv3: an incremental improvement. arXiv:1804.02767 – reference: XuJMeiTYaoTRuiYMSR-VTT: A large video description dataset for bridging video and languageProc IEEE Comput Soc Conf Comput Vis Pattern Recogn201620165288529610.1109/CVPR.2016.571 – reference: Hosseinzadeh M, Wang Y, Canada HT (2021) Video captioning of future frames. Winter Conf App Comput Vis:980–989 – reference: Xiao H, Shi J (2019a) Diverse video captioning through latent variable expansion with conditional GAN. https://zhuanzhi.ai/paper/943af2926865564d7a84286c23fa2c63 arXiv:1910.12019 – reference: Madake J (2022) Dense video captioning using BiLSTM encoder, 1–6 – reference: Yang Z, Yuan Y, Wu Y, Salakhutdinov R, Cohen WW (2016) Review networks for caption generation. Adv Neural Inform Process Syst (Nips), 2369–2377. arXiv:1605.07912 – reference: HussainAHussainTUllahWBaikSWVision transformer and deep sequence learning for human activity recognition in surveillance videosComput Intel Neurosci202210.1155/2022/3454167 – reference: HeDZhaoXHuangJLiFLiuXWenSRead, watch, and move: reinforcement learning for temporally grounding natural language descriptions in videosProceed AAAI Conf Artif Intel2019338393840010.1609/aaai.v33i01.33018393arXiv:1901.06829 – reference: RohrbachMQiuWTitovIThaterSPinkalMSchieleBTranslating video content to natural language descriptionsProc IEEE Int Conf Comput Vis201310.1109/ICCV.2013.61 – reference: LaokulratNPhanSNishidaNShuREharaYOkazakiNNakayamaHGenerating video description using sequence-to-sequence model with temporal attentionColing201620154452 – reference: WangBMaLZhangWJiangWWangJLiuWControllable video captioning with pos sequence guidance based on gated fusion networkProc IEEE Int Conf Comput Vis201920192641265010.1109/ICCV.2019.00273arXiv:1908.10072 – reference: XuHLiBRamanishkaVSigalLSaenkoKJoint event detection and description in continuous video streamsProc 2019 IEEE Winter Conf App Comput Vis, WACV2019201939640510.1109/WACV.2019.00048arXiv:1802.10250 – reference: Gella S, Lewis M, Rohrbach M (2020) A dataset for telling the stories of social media videos. Proceedings of the 2018 conference on empirical methods in natural language processing, EMNLP 2018:968–974 – reference: Peng Y, Wang C, Pei Y, Li Y (2021) Video captioning with global and local text attention. Visual Computer (0123456789). https://doi.org/10.1007/s00371-021-02294-0 – reference: Levine R, Meurers D (2006) Head-driven phrase structure grammar linguistic approach , formal head-driven phrase structure grammar linguistic approach , formal foundations , and computational realization (January) – reference: Su J (2018) Study of Video Captioning Problem. https://www.semanticscholar.org/paper/Study-of-Video-Captioning-Problem-Su/511f0041124d8d14bbcdc7f0e57f3bfe13a58e99 – reference: Chen H, Li J, Hu X (2020) Delving deeper into the decoder for video captioning. arXiv:2001.05614 – reference: Langkilde-geary I, Knight K (2002) HALogen statistical sentence generator. (July):102–103 – reference: BengioYLouradourJCollobertRWestonJCurriculum learningACM Int Conf Proc Ser200910.1145/1553374.1553380 – reference: Seo PH, Nagrani A, Arnab A, Schmid C (2022) End-to-end generative pretraining for multimodal video captioning, 17959–17968. arXiv:2201.08264 – reference: BarbuABridgeABurchillZCoroianDDickinsonSFidlerSZhangZVideo in sentences outUncertainty Artif Intell–Proc 28th Conf–UAI20122012102112arXiv:1204.2742 – reference: HouJWuXZhaoWLuoJJiaYJoint syntax representation learning and visual cue translation for video captioningIEEE Int Conf Comput Vis201920198917892610.1109/ICCV.2019.00901 – reference: ChenMLiYZhangZHuangSTVT: two-view transformer network for video captioningProc Mach Learn Res2018951997847862 – reference: Child R, Gray S, Radford A, Sutskever I (2019) Generating Long Sequences with Sparse Transformers. arXiv:1904.10509 – reference: BinYYangYShenFXieNShenHTLiXDescribing video with attention-based bidirectional LSTMIEEE Trans Cybern20194972631264110.1109/TCYB.2018.2831447 – reference: Lavie A, Agarwal A (2007) METEOR: an automatic metric for mt evaluation with improved correlation with human judgments. Proceedings of the Second Workshop on Statistical Machine Translation (June):228–231. http://acl.ldc.upenn.edu/W/W05/W05-09.pdf – reference: Li L, Chen Y-C, Cheng Y, Gan Z, Yu L, Liu J (2020) HERO: hierarchical encoder for video+language omni-representation pre-training, 2046–2065. https://doi.org/10.18653/v1/2020.emnlp-main.161arXiv:2005.00200 – reference: ChenYWangSZhangWHuangQLess is more: picking informative frames for video captioningLecture Notes Comput Sci (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics)20181121736738410.1007/978-3-030-01261-8_22 – reference: ChenSJiangY-GMotion guided spatial attention for video captioningProc AAAI Conf Artif Intel2019338191819810.1609/aaai.v33i01.33018191 – reference: Alzubaidi L, Zhang J, Humaidi AJ, Al-Dujaili A, Duan Y, Al-Shamma O, et al (2021) Review of deep learning: concepts, CNN architectures, challenges, applications, future directions 8(1). https://doi.org/10.1186/s40537-021-00444-8 – reference: WilliamsRJZipserDA learning algorithm for continually running fully recurrent neural networksNeural Comput19891227028010.1162/neco.1989.1.2.270 – reference: Bahdanau D, Cho KH, Bengio Y (2015) Neural machine translation by jointly learning to align and translate. 3rd International Conference on Learning Representations, ICLR 2015 -Conference Track Proceedings, 1–15. arXiv:1409.0473 – reference: LiXZhouZChenLGaoLResidual attention-based LSTM for video captioningWorld Wide Web201922262163610.1007/s11280-018-0531-z – reference: Yang B, Liu F, Zhang C, Zou Y (2019) Non-autoregressive coarse-to-fine video captioning. In: AAAI Conference on Artificial Intelligence. https://doi.org/10.1609/aaai.v35i4.16421 – reference: Vo DM, Chen H, Sugimoto A, Nakayama H (2022) NOC-REK: Novel object captioning with retrieved vocabulary from external knowledge. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 2022, pp 17979–17987. https://doi.org/10.1109/CVPR52688.2022.01747 – reference: DengCChenSChenDHeYWuQSketch, ground, and refine: top-down dense video captioningProc IEEE Comput Soc Conf Comput Vis Pattern Recogn202110.1109/CVPR46437.2021.00030 – reference: Luo H, Ji L, Shi B, Huang H, Duan N, Li T, et al. (2020) UniVL: a unified video and language pre-training model for multimodal understanding and generation. arXiv:2002.06353 – reference: XuWYuJMiaoZWanLTianYJiQDeep reinforcement polishing network for video captioningIEEE Trans Multimedia2021231772178410.1109/TMM.2020.3002669 – reference: Pan Y, Mei T, Yao T, Li H, Rui Y (2016) Jointly modeling embedding and translation to bridge video and language. Proc IEEE Comput Soc Conf Comput Vis Pattern Recogn 2016:4594–4602. https://doi.org/10.1109/CVPR.2016.497arXiv:1505.01861 – reference: Yin W, Kann K, Yu M, Schütze H (2017) Comparative study of CNN and RNN for natural language processing. arXiv:1702.01923 – reference: Lu J, Batra D, Parikh D, Lee S (2019) ViLBERT: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. (NeurIPS), 1–11. arXiv:1908.02265 – reference: Demeester T, Rocktäschel T, Riedel S (2016) Lifted rule injection for relation embeddings. Emnlp 2016—conference on empirical methods in natural language processing, proceedings (pp. 1389–1399). https://doi.org/10.18653/v1/d16-1146 – reference: Zhou L, Corso JJ (2016) Towards automatic learning of procedures from web instructional videos. arXiv:1703.09788v3 – reference: XieSSunCHuangJTuZMurphyKRethinking spatiotem-poral feature learning: speed-accuracy trade-offs in video classificationLecture Notes Comput Sci (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics)20181121931833510.1007/978-3-030-01267-0_19 – reference: JiWWangRTianYWangXAn attention based dual learning approach for video captioningAppl Soft Comput202211710.1016/j.asoc.2021.108332 – reference: Khan M, Gotoh Y (2012) Describing video contents in natural language. Proceedings of the workshop on innovative hybrid (pp. 27–35) – reference: Liu F, Ren X, Wu X, Yang B, Ge S, Sun X (2021) O2NA: an object-oriented non-autoregressive approach for controllable video captioning. Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021:281–292. https://doi.org/10.18653/v1/2021.findings-acl.24arXiv:2108.02359 – reference: SchusterMPaliwalKKBidirectional recurrentNeural Netw1997451126732681 – reference: Lee J, Lee Y, Seong S, Kim K, Kim S, Kim J (2019) Capturing long-range dependencies in video captioning. Proc Int Conf Image Process, ICIP, 2019:1880–1884. https://doi.org/10.1109/ICIP.2019.8803143 – reference: Zheng Q, Wang C, Tao D (2020) Syntax-Aware Action Targeting for Video Captioning. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition 13093–13102. https://doi.org/10.1109/CVPR42600.2020.01311 – reference: RenZWangXZhangNLvXLiLJDeep reinforcement learning-based image captioning with embedding rewardProc 30th IEEE Conf Comput Vis Pattern Recogn CVPR201720171151115910.1109/CVPR.2017.128 – reference: GaoLGuoZZhangHXuXShenHTVideo captioning with attention-based lstm and semantic consistencyIEEE Trans Multimedia20171992045205510.1109/TMM.2017.2729019 – reference: Pan Y, Li Y, Luo J, Xu J, Yao T, Mei T (2020) Auto-captions on GIF: a large-scale video-sentence dataset for vision-language pre-training. arXiv:2007.02375 – reference: Young P, Lai A, Hodosh M, Hockenmaier J (2014) From image descriptions to visual denotations? New similarity metrics for semantic inference over event descriptions 2:67–78 – reference: Aafaq N, Mian A, Liu W, Gilani SZ, Sha M (2019c) Video description: a survey of methods, datasets, and evaluation metrics 52(6). https://doi.org/10.1145/3355390 – reference: Gehring J, Dauphin YN (2016) Convolutional Sequence to Sequence Learning. https://proceedings.mlr.press/v70/gehring17a/gehring17a.pdf – reference: Yan L, Zhu M, Yu C (2010) Crowd video captioning. arXiv:1911.05449v1 – reference: GaoLWangXSongJLiuYFused GRU with semantic-temporal attention for video captioningNeurocomputing202039522222810.1016/j.neucom.2018.06.096 – reference: MnihVKavukcuogluKSilverDRusuAAVenessJBellemareMGHassabisDHuman-level control through deep reinforcement learningNature2015518754052953310.1038/nature14236 – reference: YuYChoiJKimYYooKLeeSHKimGSupervising neural attention models for video captioning by human gaze dataProc 30th IEEE Conf Comput Vis Pattern Recogn201720176119612710.1109/CVPR.2017.648arXiv:1707.06029 – reference: SunLLiBYuanCZhaZHuWMultimodal semantic attention network for video captioningProc IEEE Int Conf Multimedia Expo201920191300130510.1109/ICME.2019.00226arxiv.org/abs/1905.02963 – reference: Uszkoreit J, Kaiser L (2019) Universal transformers, 1-23. arxiv.org/abs/arXiv:1807.03819v3 – reference: ZhangXGaoKZhangYZhangDLiJTianQTask-driven dynamic fusion: reducing ambiguity in video descriptionProc 30th IEEE Conf Comput Vis Pattern Recogn CVPR201720176250625810.1109/CVPR.2017.662 – reference: Wang H, Zhang Y, Yu X (2020) An overview of image caption generation methods. Computational Intelligence and Neuroscience 2020. https://doi.org/10.1155/2020/3062706 – reference: Cho K, Van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y (2014) Learning phrase representations using RNN encoder–decoder for statistical machine translation. EMNLP 2014 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference, 1724–1734. https://doi.org/10.3115/v1/d14-1179arXiv:1406.1078 – reference: BhattSPatwaFSandhuRNatural language processing (almost) from scratchProc IEEE 3rd Int Conf Collaboration Internet Comput CIC 20172017201732833810.1109/CIC.2017.00050 – reference: Song J, Gao L, Guo Z, Liu W, Zhang D, Shen HT (2017) Hierarchical LSTM with adjusted temporal attention for video captioning, 2737–2743 – reference: Yuan Z, Yan X, Liao Y, Guo Y, Li G, Li Z, Cui S (2022) X-Trans2Cap: cross-modal knowledge transfer using transformer for 3D dense captioning, 3–4. arXiv:2203.00843 – reference: ZhouLKalantidisYChenXCorsoJJRohrbachMGrounded video descriptionProc IEEE Comput Soc Conf Comput Vis Pattern Recogn201920196571658010.1109/CVPR.2019.00674arXiv:1812.06587 – reference: Wang X, Chen W, Wu J, Wang YF, Wang WY (2018b) Video captioning via hierarchical reinforcement learning. Proc IEEE Comput Soc Conf Comput Vis Pattern Recogn, 4213–4222. https://doi.org/10.1109/CVPR.2018.00443arXiv:1711.11135 – reference: Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, et al. (2015) Going deeper with convolutions. Proceedings of the IEEE computer society conference on computer vision and pattern recognition (07-12-June, pp. 1-9). https://doi.org/10.1109/CVPR.2015.7298594 – reference: Jin T, Huang S, Chen M, Li Y, Zhang Z (2020) SBAT: Video captioning with sparse boundary-aware transformer. IJCAI Int Joint Conf Artif Intel 2021:630–636. https://doi.org/10.24963/ijcai.2020.88 – reference: Kay W, Carreira J, Simonyan K, Zhang B, Hillier C, Vijayanarasimhan S, et al. (2017) The kinetics human action video dataset. arXiv:1705.06950 – reference: Zhu X, Guo L, Yao P, Lu S, Liu W, Liu J (2019) Vatex video captioning challenge 2020: multi-view features and hybrid reward strategies for video captioning. arXiv:1910.11102 – reference: LeeSKimIMultimodal feature learning for video captioningMath Prob Eng201810.1155/2018/3125879 – reference: YanCTuYWangXZhangYHaoXZhangYDaiQSTAT: spatial-temporal attention mechanism for video captioningIEEE Trans Multimedia202022122924110.1109/TMM.2019.2924576 – reference: Zhang J, Peng Y (2019) Object-aware aggregation with bidirectional temporal graph for video captioning. https://zhuanzhi.ai/paper/237b5837832fb600d4269cacdb0286e3 arXiv:1906.04375 – reference: RafiqMRafiqGChoiGSVideo description: datasets evaluation metricsIEEE Access2021912166512168510.1109/ACCESS.2021.3108565 – reference: Wu Z, Yao T, Fu Y, Jiang, Y-G (2017) Deep learning for video classification and captioning. Front Multimedia Res, 3–29. https://doi.org/10.1145/3122865.3122867arXiv:1609.06782 – reference: Aafaq N, Akhtar N, Liu W, Mian A (2019b) Empirical autopsy of deep video captioning frameworks. arXiv:1911.09345 – reference: SunCMyersAVondrickCMurphyKSchmidCVideoBERT: a joint model for video and language representation learningProc IEEE Int Conf Comput Vis201920197463747210.1109/ICCV.2019.00756 – reference: Doddington G (2002) Automatic evaluation of machine translation quality using n-gram co-occurrence statistics, 138. https://doi.org/10.3115/1289189.1289273 – reference: Ryu H, Kang S, Kang H, Yoo CD (2021) Semantic grouping network for video captioning. arXiv:2102.00831 – reference: Perez-Martin J, Bustos B, Perez J (2021) Attentive visual semantic specialized network for video captioning, 5767–5774. https://doi.org/10.1109/icpr48806.2021.9412898 – reference: Cascade-correlationRChunkingNSLong Short–Term Memory19979817351780 – reference: Chen DL, Dolan WB (2011) Collecting highly parallel data for paraphrase evaluation. Aclhlt 2011–Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies 1 (pp. 190–200) – reference: Kenton M-wC, Kristina L, Devlin J (1953) BERT: pre-training of deep bidirectional transformers for language understanding. (Mlm). arXiv:1810.04805v2 – reference: Ging S, Zolfaghari M, Pirsiavash H, Brox T (2020) COOT: cooperative hierarchical transformer for video-text representation learning. (NeurIPS):1–27. arXiv:2011.00597 – reference: Iashin V, Rahtu E (2020) Multi-modal dense video captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pp 958–959 – reference: Torralba A, Murphy KP, Freeman WT, Rubin MA (2003) Context-based vision system for place and object recognition. In: Proceedings of the Ninth IEEE International Conference on Computer Vision, ICCV'03, vol 2, pp 273. IEEE Computer Society. https://doi.org/10.5555/946247.946665 – reference: KrishnaRHataKRenFFei-FeiLNieblesJCDense-captioning events in videosProc Int Conf Comput Vis2017201770671510.1109/ICCV.2017.83 – reference: Lu J, Xiong C, Parikh D, Socher R (2017) Knowing when to look: adaptive attention via a visual sentinel for image captioning. Proc 30th IEEE Conf Comput Vis Pattern Recogn CVPR, 2017:3242–3250. https://doi.org/10.1109/CVPR.2017.345arXiv:1612.01887 – reference: Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Li FF (2014) Large-scale video classification with convolutional neural net-works. Proceedings of the IEEE computer society conference on computer vision and pattern recognition (pp. 1725–1732). https://doi.org/10.1109/CVPR.2014.223 – reference: XuJWeiHLiLFuQGuoJVideo description model based on temporal-spatial and channel multi-attention mechanismsAppl Sci (Switzerland)202010.3390/app10124312 – reference: Phan S, Henter GE, Miyao Y, Satoh S (2017) Consensus-based sequence training for video captioning. arXiv:1712.09532 – reference: ShenZLiJSuZLiMChenYJiangYGXueXWeakly supervised dense video captioningProc 30th IEEE Conf Comput Vis Pattern Recogn, CVPR 2017201720175159516710.1109/CVPR.2017.548c – reference: Lin C-Y (2004) ROUGE: A Package for Automatic Evaluation of Summaries. In: Text summarization branches out. Association for Computational Linguistics. Barcelona, Spain, pp 74–81. https://aclanthology.org/W04-1013 – reference: Liu S, Ren Z, Yuan J (2018) SibNet: Sibling convolutional encoder for video captioning. MM 2018 -Proceedings of the 2018 ACM Multimedia Conference, 1425–1434. https://doi.org/10.1145/3240508.3240667 – reference: ChenSYaoTJiangYGDeep learning for video captioning: a reviewIJCAI Int Joint Conf Artif Intell201920196283629010.24963/ijcai.2019/877 – reference: Olivastri S, Singh G, Cuzzolin F (2019) End-to-end video captioning. International conference on computer vision workshop. https://zhuanzhi.ai/paper/004e3568315600ed58e6a699bef3cbba – reference: GaoLLiXSongJShenHTHierarchical LSTMs with adaptive attention for visual captioningIEEE Trans Pattern Analys Mach Intell20191481110.1109/tpami.2019.2894139 – reference: Xu K, Ba JL, Kiros R, Cho K, Courville A, Salakhutdinov R, et al. (2015) Show, attend and tell: neural image caption gener-ation with visual attention. 32nd International Conference on Machine Learning, ICML 2015 3:2048–2057. arXiv:1502.03044 – reference: Deng J, Dong W, Socher R, Li LJ, Li K, Fei-Fei L (2009, June). Imagenet: a large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition, pp 248–255. IEEE – reference: ZellersRBiskYFarhadiAChoiYFrom recognition to cognition: visual commonsense reasoningProc IEEE Comput Soc Conf Comput Vis Pattern Recogn201920196713672410.1109/CVPR.2019.00688 – reference: YouQJinHWangZFangCLuoJImage captioning with semantic attentionProc IEEE Comput Soc Conf Comput Vis Pattern Recogn201620164651465910.1109/CVPR.2016.503arXiv:1603.03925 – reference: Mnih V, Kavukcuoglu K, Silver D, Graves A, Antonoglou I, Wierstra D, Riedmiller M (2013) Playing atari with deep reinforcement learning, 1–9. arXiv:1312.5602 – reference: Wang X, Wang, Y-f, Wang WY (2018c) Watch , listen , and describe: globally and locally aligned cross-modal attentions for video captioning, 795–801 – reference: Li X, Zhao B, Lu X (2017) MAM-RNN: Multi-level attention model based RNN for video captioning. IJCAI International Joint Conference on Artificial Intelligence, 2208–2214. https://doi.org/10.24963/ijcai.2017/307 – reference: LiSTaoZLiKFuYVisual to text: survey of image and video captioningIEEE Trans Emerg Top Comput Intel20193429731210.1109/tetci.2019.2892755 – reference: ZhaoHChenZGuoLHanZVideo captioning based on vision transformer and reinforcement learningPeer J Comput Sci20228200211610.7717/PEERJ-CS.916 – reference: AafaqNMianASAkhtarNLiuWShahMDense video captioning with early linguistic information fusionIEEE Trans Multimedia202210.1109/TMM.2022.3146005 – reference: Xu J, Yao T, Zhang Y, Mei T (2017) Learning multimodal attention LSTM networks for video captioning. MM 2017 -Proceedings of the 2017 ACM Multimedia Conference, 537–545. https://doi.org/10.1145/3123266.3123448 – reference: HanKWangYChenHChenXGuoJLiuZTaoDA survey on vision transformerIEEE Trans Pattern Analys Mach Intel2022882812010.1109/TPAMI.2022.3152247 – reference: Chen J, Pan Y, Li Y, Yao T, Chao H, Mei T (2019b) Temporal deformable convolutional encoder–decoder networks for video captioning. Proc AAAI Conf Artif Intell 33 , 8167–8174. https://doi.org/10.1609/aaai.v33i01.33018167arXiv:1905.01077 – reference: Wang X, Wu J, Chen J, Li L, Wang Y-F, Wang WY (2019b) VATEX: a large-scale, high-quality multilingual dataset for video-and-language research. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp 4580–4590. https://doi.org/10.1109/ICCV.2019.00468 – reference: Huszár F (2015) How (not) to train your generative model: scheduled sampling, likelihood, adversary?:1–9. arXiv:1511.05101 – reference: Chen H, Lin K, Maye A, Li J, Hu X (2019a) A semantics-assisted video captioning model trained with scheduled sampling. https://zhuanzhi.ai/paper/f88d29f09d1a56a1b1cf719dfc55ea61arXiv:1909.00121 – reference: Perez-Martin J, Bustos B, Pérez J (2021) Improving video captioning with temporal composition of a visual-syntactic embedding. Winter Conference on Applications of Computer Vision, 3039–3049 – reference: Kazemzadeh S, Ordonez V, Matten M, Berg TL (2014) ReferItGame: referring to objects in photographs of natural scenes:787–798 – reference: SzegedyCIoffeSVanhouckeVAlemiAAInception-v4, inception-ResNet and the impact of residual connections on learning31st AAAI Conf Artif Intel AAAI2017201742784284 – reference: HoriCHoriTLeeTYZhangZHarshamBHersheyJRAttention-based multimodal fusion for video descriptionProc IEEE Int Conf Comput Vis201720174203421210.1109/ICCV.2017.450 – reference: Al-Rfou R, Choe D, Constant N, Guo M, Jones L (2019) Character-level language modeling with deeper self-attention. Proc AAAI Conf Artif Intell 33 , 3159–3166. https://doi.org/10.1609/aaai.v33i01.33013159arxiv.org/abs/1808.04444 – reference: Bilkhu M, Wang S, Dobhal T (2019) Attention is all you need for videos: self-attention based video summarization using universal Transformers. arXiv:1906.02792 – reference: Lowe DG (1999) Object recognition from local scale-invariant features. In: Proceedings of the Seventh IEEE International Conference on Computer Vision, Kerkyra, Greece, 1999, pp 1150–1157, vol 2. https://doi.org/10.1109/ICCV.1999.790410 – reference: GaoLLeiYZengPSongJWangMShenHTHierarchical representation network with auxiliary tasks for video captioning and video question answeringIEEE Trans Image Process20223120221510.1109/TIP.2021.3120867 – reference: AntolSAgrawalALuJMitchellMBatraDZitnickCLParikhDVQA: visual question answeringProc IEEE Int Conf Comput Vis201510.1109/ICCV.2015.279 – reference: Montague P (1999) Reinforcement learning: an introduction, by Sutton RS and Barto AG trends in cognitive sciences 3(9): 360. https://doi.org/10.1016/s1364-6613(99)01331-5 – reference: Vedantam R, Lawrence Zitnick C, Parikh D (2015) Cider: consensus-based image description evaluation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4566–4575 – reference: LavieASagaeKJayaramanSThe significance of recall in automatic metrics for MT evaluationLecture Notes Comput Sci (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)2004326513414310.1007/978-3-540-30194-3-16 – reference: Kilickaya M, Erdem A, Ikizler-Cinbis N, Erdem E (2017) Re-evaluating automatic metrics for image captioning. 15th conference of the european chapter of the association for computational linguistics, EACL 2017–proceedings of conference (Vol. 1, pp. 199-209). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/e17-1019 – reference: Aafaq N, Akhtar N, Liu W, Mian A (2019a) Empirical autopsy of deep video captioning frameworks. arXiv:1911.09345 – reference: Pasunuru R, Bansal M (2017) Reinforced video captioning with entailment rewards. Emnlp 2017—conference on empirical methods in natural language processing, proceedings (pp. 979–985). https://doi.org/10.18653/v1/d17-1103 – reference: Das P, Xu C, Doell RF, Corso JJ (2013) A thousand frames in just a few words: lingual description of videos through latent topics and sparse object stitching. Proceedings of the IEEE computer society conference on computer vision and pattern recognition (pp. 2634–2641). https://doi.org/10.1109/CVPR.2013.340 – reference: Estevam V, Laroca R, Pedrini H, Menotti D (2021) Dense video captioning using unsupervised semantic information. arXiv:2112.08455v1 – reference: Lowell U, Donahue J, Berkeley UC, Rohrbach M, Berkeley UC, Mooney R (2014) Translating videos to natural language using deep recurrent neural networks. arXiv:1412.4729v3 – reference: Sharif N, White L, Bennamoun M, Shah SAA (2018) Learning-based composite metrics for improved caption evaluation. ACL 2018 56th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Student Research Workshop, 14–20. https://doi.org/10.18653/v1/p18-3003 – reference: LeiJYuLBergTLBansalMTVR: a large-scale dataset for video-subtitle moment retrievalLecture Notes Comput Sci (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)20201236644746310.1007/978-3-030-58589-1_27 – reference: Lin K, Gan Z, Wang L (2020) Multi-modal feature fusion with feature attention for vatex captioning challenge 2020:2–5. arXiv:2006.03315 – reference: Amaresh M, Chitrakala S (2019) Video captioning using deep learning: an overview of methods, datasets and metrics. Proceedings of the 2019 IEEE international conference on communication and signal processing, ICCSP 2019 (pp. 656–661). https://doi.org/10.1109/ICCSP.2019.8698097 – reference: AgyemanRRafiqMShinHKRinnerBChoiGSOptimizing spatiotemporal feature learning in 3D convolutional neural networks with pooling blocksIEEE Access20219707977080510.1109/access.2021.3078295 – reference: Blohm M, Jagfeld G, Sood E, Yu X, Vu NT (2018) Comparing attention-based convolutional and recurrent neural networks: success and limitations in machine reading comprehension. CoNLL 2018–22nd Conference on Computational Natural Language Learning, Proceedings, 108–118. https://doi.org/10.18653/v1/k18-1011arXiv:1808.08744 – reference: Gomez AN, Ren M, Urtasun R, Grosse RB (2017) The reversible resid-ual network: backpropagation without storing activations. Adv Neural Inform Process Syst 2017:2215–2225. arXiv:1707.04585 – reference: Lei J, Wang L, Shen Y, Yu D, Berg T, Bansal M (2020) MART: memory-augmented recurrent transformer for coherent video paragraph captioning:2603–2614. https://doi.org/10.18653/v1/2020.acl-main.233arXiv:2005.05402 – reference: Wang T, Zhang R, Lu Z, Zheng F, Cheng R, Luo P (2021) Endto-End Dense Video Captioning with Parallel Decoding. Proceedings of the IEEE International Conference on Computer Vision, 6827–6837. https://doi.org/10.1109/ICCV48922.2021.00677arXiv:2108.07781 – reference: Ramanishka V, Das A, Park DH, Venugopalan S, Hendricks LA, Rohrbach M, Saenko K (2016) Multimodal video description. MM 2016 -Proceedings of the 2016 ACM Multimedia Conference, 1092–1096. https://doi.org/10.1145/2964284.2984066 – reference: Jia Y, Shelhamer E, Donahue J, Karayev S, Long J, Girshick R, et al. (2014) Caffe: convolutional architecture for fast feature embedding. Mm 2014–proceedings of the 2014 ACM conference on multimedia (pp. 675-678). https://doi.org/10.1145/2647868.2654889 – reference: Pan Y, Yao T, Li H, Mei T (2017) Video captioning with transferred semantic attributes. Proc 30th IEEE Conf Comput Vis Pattern Recogn CVPR 2017:984–992. https://doi.org/10.1109/CVPR.2017.111arXiv:1611.07675 – reference: Babariya RJ, Tamaki T (2020) Meaning guided video captioning. In: Pattern Recognition: 5th Asian Conference, ACPR 2019, Auckland, New Zealand, November 26–29, 2019, Revised Selected Papers, Part II 5, pp 478–488. Springer International Publishing – reference: ImHChoiY-SUAT: universal attention transformer for video captioningSensors20222213481710.3390/s22134817 – reference: Yan Y, Zhuang N, Bingbing Ni, Zhang J, Xu M, Zhang Q, et al (2019) Fine-grained video captioning via graph-based multi-granularity interaction learning. IEEE Trans Pattern Analys Mach Intel. https://doi.org/10.1109/TPAMI.2019.2946823 – reference: KojimaATamuraTFukunagaKNatural language description of human activities from video images based on concept hierarchy of actionsInt J Comput Vis200250217118410.1023/A:10203460326081012.68781 – reference: Raffel C, Ellis DPW (2015) Feed-forward networks with attention can solve some long-term memory problems, 1–6. arXiv:1512.08756 – reference: ZhangXSunXLuoYJiJZhouYWuYJiRRSTnet: captioning with adaptive attention on visual and non-visual wordsProc IEEE Comput Soc Conf Comput Vis Pattern Recogn20211154601546910.1109/CVPR46437.2021.01521 – reference: Liu S, Ren Z, Yuan J (2020) SibNet: sibling convolutional encoder for video captioning. IEEE Trans Pattern Analys Mach Intel, 1–1. https://doi.org/10.1109/tpami.2019.2940007 – reference: Zhang Z, Shi Y, Yuan C, Li B, Wang P, Hu W, Zha Z (2020) Object relational graph with teacher-recommended learning for video captioning. arXiv:2002.11566 – reference: BroxTPapenbergNWeickertJHigh accuracy optical flow estimation based on warping-presentationLecture Notes Comput Sci (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)20143024May25361098.68736 – reference: RennieSJMarcheretEMrouehYRossJGoelVSelf-critical sequence training for image captioningProc 30th IEEE Conf Comput Vis Pattern Recogn CVPR201720171179119510.1109/CVPR.2017.131 – reference: Chen DZ, Gholami A, Niesner M, Chang AX (2021) Scan2Cap: context-aware dense captioning in RGB-D scans. 3192–3202. https://doi.org/10.1109/cvpr46437.2021.00321arXiv:2012.02206 – reference: Fang Z, Gokhale T, Banerjee P, Baral C, Yang Y (2020) Video2Commonsense: generating commonsense descriptions to enrich video captioning. arXiv:2003.05162 – reference: Wu D, Zhao H, Bao X, Wildes RP (2022) Sports video analysis on large-scale data (1). arXiv:2208.04897 – reference: Zhou L, Zhou Y, Corso JJ, Socher R, Xiong C (2018) End-to-End Dense Video Captioning with Masked Transformer. Proceedings of the IEEE computer society conference on computer vision and pattern recognition (pp. 8739–8748). https://doi.org/10.1109/CVPR.2018.00911 – reference: Rivera-soto RA, Ordóñez J (2013) Sequence to sequence models for generating video captions. http://cs231n.stanford.edu/reports/2017/pdfs/31.pdf – reference: Zhang Z, Qi Z, Yuan C, Shan Y, Li B, Deng Y, Hu W (2021) Open-book video captioning with retrieve-copy-generate network. Proc IEEE Comput Soc Conf Comput Vis Pattern Recogn, 9832–9841. https://doi.org/10.1109/CVPR46437.2021.00971arXiv:2103.05284 – reference: Zhao B, Li X, Lu X (2018) Video captioning with tube features. IICAI Int Joint Conf Artif Intel 2018:1177–1183. https://doi.org/10.24963/ijcai.2018/164 – reference: Hakeem A, Sheikh Y, Shah M (2004) CASE E: a hierarchical event representation for the analysis of videos. Proc Natl Conf Artif Intell:263–268 – reference: ZolfaghariMSinghKBroxTECO: efficient convolutional network for online video understandingLecture Notes Comput Sci (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics)20181120671373010.1007/978-3-030-01216-8-43 – reference: Ranzato M, Chopra S, Auli M, Zaremba W (2016) Sequence level training with recurrent neural networks. 4th international conference on learning representations, ICLR 2016—conference track proceedings (pp. 1–16) – reference: Goodfellow I, Bengio Y, Courville A (2016) Deep learning. MIT Press. (http://www.deeplearningbook.org) – reference: Pan Y, Yao T, Li Y, Mei T (2020) X-linear attention networks for image captioning. Proc IEEE Comput Soc Conf Comput Vis Pattern Recogn, 10968–10977. https://doi.org/10.1109/CVPR42600.2020.01098arXiv:2003.14080 – reference: Song Y, Chen S, Jin Q (2021) Towards diverse paragraph captioning for untrimmed videos. Proceedings of the IEEE Comput Soc Conf Comput Vis Pattern Recogn, 11240–11249. https://doi.org/10.1109/CVPR46437.2021.01109arXiv:2105.14477 – reference: Wang B, Ma L, Zhang W, Liu W (2018a) Reconstruction network for video captioning. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 7622–7631. https://doi.org/10.1109/CVPR.2018.00795 – reference: HeKZhangXRenSSunJDeep residual learning for image recognitionProc IEEE Comput Soc Conf Comput Vis Pattern Recogn2016201677077810.1109/CVPR.2016.90 – reference: ZhangYVogelSSignificance tests of automatic machine translation evaluation metricsMachine Transl2010241516510.1007/s10590-010-9073-6 – reference: Wallach B (2017) Developing: a world made for money (pp. 241–294). https://doi.org/10.2307/j.ctt1d98bxx.10 – reference: ZhangQZhangMChenTSunZMaYYuBRecent advances in convolutional neural network accelerationNeurocomputing2019323375110.1016/j.neucom.2018.09.038arXiv:1807.08596 – reference: ChenYZhangWWangSLiLHuangQSaliency-based spatiotemporal attention for video captioning2018 IEEE 4th Int Conf Multimedia Big Data BigMM2018201818 – reference: VenugopalanSRohrbachMDonahueJMooneyRDarrellTSaenkoKSequence to sequence -video to textProceedings IEEE Int Conf Comput Vis201520154534454210.1109/ICCV.2015.515 – reference: Arnab A, Dehghani M, Heigold G, Sun C, Lučić M, Schmid C (2021) ViViT: a video vision transformer. Proceedings of the IEEE international conference on computer vision, 6816–6826. https://doi.org/10.1109/ICCV48922.2021.00676arXiv:2103.15691 – reference: Park J, Song C, Han JH (2018) A study of evaluation metrics and datasets for video captioning. ICIIBMS 2017 -2nd Int Conf Intel Inform Biomed Sci 2018:172–175. https://doi.org/10.1109/ICIIBMS.2017.8279760 – reference: Vaswani A, Brain G, Shazeer N, Parmar N, Uszkoreit J, Jones L, et al. (2017) Attention is all you need. Adv Neural Inform Process Syst (Nips), 5998–6008. http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf – reference: Hammad M, Hammad M, Elshenawy M (2019) Characterizing the impact of using features extracted from pretrained models on the quality of video captioning sequence-to-sequence models. arXiv:1911.09989 – reference: Hammoudeh A, Vanderplaetse B, Dupont S (2022) Deep soccer captioning with transformer: dataset, semantics-related losses, and multi-level evaluation:1–15. arXiv:2202.05728 – reference: Wang D, Song D (2017) Video Captioning with Semantic Information from the Knowledge Base. Proceedings -2017 IEEE International Conference on Big Knowledge, ICBK 2017 , 224–229. https://doi.org/10.1109/ICBK.2017.26 – reference: Li J, Qiu H (2020) Comparing attention-based neural architectures for video captioning, vol 1194. Available on: https://web.stanford.edu/class/archive/cs/cs224n/cs224n – ident: 10414_CR106 – ident: 10414_CR129 – volume: 1 start-page: 8421 year: 2021 ident: 10414_CR28 publication-title: Proc IEEE Comput Soc Conf Comput Vis Pattern Recogn doi: 10.1109/CVPR46437.2021.00832 – volume: 50 start-page: 171 issue: 2 year: 2002 ident: 10414_CR77 publication-title: Int J Comput Vis doi: 10.1023/A:1020346032608 – ident: 10414_CR16 – volume: 33 start-page: 8393 year: 2019 ident: 10414_CR58 publication-title: Proceed AAAI Conf Artif Intel doi: 10.1609/aaai.v33i01.33018393 – ident: 10414_CR85 doi: 10.18653/v1/2020.acl-main.233 – ident: 10414_CR193 – ident: 10414_CR43 doi: 10.18653/v1/2020.emnlp-main.61 – ident: 10414_CR7 doi: 10.1186/s40537-021-00444-8 – volume: 2019 start-page: 7463 year: 2019 ident: 10414_CR139 publication-title: Proc IEEE Int Conf Comput Vis doi: 10.1109/ICCV.2019.00756 – volume: 2019 start-page: 396 year: 2019 ident: 10414_CR165 publication-title: Proc 2019 IEEE Winter Conf App Comput Vis, WACV doi: 10.1109/WACV.2019.00048 – volume: 2019 start-page: 8917 year: 2019 ident: 10414_CR62 publication-title: IEEE Int Conf Comput Vis doi: 10.1109/ICCV.2019.00901 – ident: 10414_CR101 – ident: 10414_CR51 – ident: 10414_CR97 doi: 10.18653/v1/2021.findings-acl.24 – ident: 10414_CR18 doi: 10.18653/v1/k18-1011 – ident: 10414_CR147 – ident: 10414_CR182 doi: 10.1109/CVPR.2019.00852 – year: 2015 ident: 10414_CR9 publication-title: Proc IEEE Int Conf Comput Vis doi: 10.1109/ICCV.2015.279 – volume: 12366 start-page: 447 year: 2020 ident: 10414_CR86 publication-title: Lecture Notes Comput Sci (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) doi: 10.1007/978-3-030-58589-1_27 – ident: 10414_CR3 doi: 10.1145/3355390 – volume: 2016 start-page: 5288 year: 2016 ident: 10414_CR166 publication-title: Proc IEEE Comput Soc Conf Comput Vis Pattern Recogn doi: 10.1109/CVPR.2016.571 – volume: 2012 start-page: 102 year: 2012 ident: 10414_CR13 publication-title: Uncertainty Artif Intell–Proc 28th Conf–UAI – ident: 10414_CR70 doi: 10.1109/CVPR.2014.223 – ident: 10414_CR52 – ident: 10414_CR72 doi: 10.3115/v1/D14-1086 – ident: 10414_CR168 doi: 10.1145/3123266.3123448 – volume: 2017 start-page: 4278 year: 2017 ident: 10414_CR141 publication-title: 31st AAAI Conf Artif Intel AAAI – ident: 10414_CR23 – ident: 10414_CR79 – ident: 10414_CR155 doi: 10.1109/ICCV48922.2021.00677 – volume: 2018 start-page: 1 year: 2018 ident: 10414_CR31 publication-title: 2018 IEEE 4th Int Conf Multimedia Big Data BigMM – volume: 19 start-page: 2045 issue: 9 year: 2017 ident: 10414_CR44 publication-title: IEEE Trans Multimedia doi: 10.1109/TMM.2017.2729019 – volume: 11219 start-page: 318 year: 2018 ident: 10414_CR164 publication-title: Lecture Notes Comput Sci (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics) doi: 10.1007/978-3-030-01267-0_19 – ident: 10414_CR156 doi: 10.1109/CVPR.2018.00443 – ident: 10414_CR92 doi: 10.24963/ijcai.2017/307 – ident: 10414_CR149 doi: 10.1109/CVPR52688.2022.01747 – ident: 10414_CR176 – ident: 10414_CR74 – volume: 2019 start-page: 6571 year: 2019 ident: 10414_CR194 publication-title: Proc IEEE Comput Soc Conf Comput Vis Pattern Recogn doi: 10.1109/CVPR.2019.00674 – volume: 11206 start-page: 713 year: 2018 ident: 10414_CR197 publication-title: Lecture Notes Comput Sci (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics) doi: 10.1007/978-3-030-01216-8-43 – ident: 10414_CR8 doi: 10.1109/ICCSP.2019.8698097 – volume: 2017 start-page: 5159 year: 2017 ident: 10414_CR135 publication-title: Proc 30th IEEE Conf Comput Vis Pattern Recogn, CVPR 2017 doi: 10.1109/CVPR.2017.548c – ident: 10414_CR115 doi: 10.18653/v1/d17-1103 – volume: 9 start-page: 121665 year: 2021 ident: 10414_CR123 publication-title: IEEE Access doi: 10.1109/ACCESS.2021.3108565 – volume: 22 start-page: 229 issue: 1 year: 2020 ident: 10414_CR171 publication-title: IEEE Trans Multimedia doi: 10.1109/TMM.2019.2924576 – ident: 10414_CR10 doi: 10.1109/ICCV48922.2021.00676 – ident: 10414_CR12 – ident: 10414_CR22 doi: 10.1109/cvpr46437.2021.00321 – volume: 2015 start-page: 4489 year: 2015 ident: 10414_CR145 publication-title: Proc IEEE Int Conf Comput Vis doi: 10.1109/ICCV.2015.510 – volume: 518 start-page: 529 issue: 7540 year: 2015 ident: 10414_CR107 publication-title: Nature doi: 10.1038/nature14236 – volume: 117 year: 2022 ident: 10414_CR67 publication-title: Appl Soft Comput doi: 10.1016/j.asoc.2021.108332 – volume: 9 start-page: 1735 issue: 8 year: 1997 ident: 10414_CR20 publication-title: Long Short–Term Memory – ident: 10414_CR50 – ident: 10414_CR153 doi: 10.1109/ICBK.2017.26 – ident: 10414_CR21 – volume: 95 start-page: 847 issue: 1997 year: 2018 ident: 10414_CR26 publication-title: Proc Mach Learn Res – ident: 10414_CR73 – ident: 10414_CR96 – year: 2018 ident: 10414_CR84 publication-title: Math Prob Eng doi: 10.1155/2018/3125879 – volume: 2019 start-page: 6283 year: 2019 ident: 10414_CR29 publication-title: IJCAI Int Joint Conf Artif Intell doi: 10.24963/ijcai.2019/877 – ident: 10414_CR192 doi: 10.1109/CVPR42600.2020.01311 – volume: 39 start-page: 677 issue: 4 year: 2017 ident: 10414_CR40 publication-title: IEEE Trans Pattern Analys Mach Intell doi: 10.1109/TPAMI.2016.2599174 – volume: 2017 start-page: 6250 year: 2017 ident: 10414_CR185 publication-title: Proc 30th IEEE Conf Comput Vis Pattern Recogn CVPR doi: 10.1109/CVPR.2017.662 – volume: 49 start-page: 2631 issue: 7 year: 2019 ident: 10414_CR17 publication-title: IEEE Trans Cybern doi: 10.1109/TCYB.2018.2831447 – ident: 10414_CR2 – ident: 10414_CR116 doi: 10.1007/s00371-021-02294-0 – ident: 10414_CR117 doi: 10.1109/icpr48806.2021.9412898 – ident: 10414_CR34 doi: 10.18653/v1/p19-1285 – ident: 10414_CR35 doi: 10.1109/CVPR.2013.340 – ident: 10414_CR110 – ident: 10414_CR104 – ident: 10414_CR195 doi: 10.1109/CVPR.2018.00911 – ident: 10414_CR173 doi: 10.1109/TPAMI.2019.2946823 – ident: 10414_CR189 doi: 10.1109/CVPR42600.2020.01329 – ident: 10414_CR112 doi: 10.1109/CVPR.2017.111 – ident: 10414_CR69 doi: 10.24963/ijcai.2020.88 – ident: 10414_CR99 doi: 10.1109/tpami.2019.2940007 – ident: 10414_CR98 doi: 10.1145/3240508.3240667 – volume: 2019 start-page: 2641 year: 2019 ident: 10414_CR151 publication-title: Proc IEEE Int Conf Comput Vis doi: 10.1109/ICCV.2019.00273 – ident: 10414_CR137 doi: 10.1109/CVPR46437.2021.01109 – ident: 10414_CR154 doi: 10.1155/2020/3062706 – ident: 10414_CR56 – ident: 10414_CR33 doi: 10.3115/v1/d14-1179 – ident: 10414_CR49 doi: 10.18653/v1/D18-1117 – volume: 3265 start-page: 134 year: 2004 ident: 10414_CR82 publication-title: Lecture Notes Comput Sci (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) doi: 10.1007/978-3-540-30194-3-16 – ident: 10414_CR157 doi: 10.18653/v1/N18-2125 – ident: 10414_CR184 doi: 10.1109/tpami.2019.2920899 – volume: 2016 start-page: 770 year: 2016 ident: 10414_CR59 publication-title: Proc IEEE Comput Soc Conf Comput Vis Pattern Recogn doi: 10.1109/CVPR.2016.90 – volume: 11217 start-page: 367 year: 2018 ident: 10414_CR30 publication-title: Lecture Notes Comput Sci (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics) doi: 10.1007/978-3-030-01261-8_22 – ident: 10414_CR138 – ident: 10414_CR172 – ident: 10414_CR105 doi: 10.1109/INCET54531.2022.9824569 – year: 2021 ident: 10414_CR37 publication-title: Proc IEEE Comput Soc Conf Comput Vis Pattern Recogn doi: 10.1109/CVPR46437.2021.00030 – volume: 23 start-page: 1772 year: 2021 ident: 10414_CR170 publication-title: IEEE Trans Multimedia doi: 10.1109/TMM.2020.3002669 – volume: 2019 start-page: 6713 year: 2019 ident: 10414_CR181 publication-title: Proc IEEE Comput Soc Conf Comput Vis Pattern Recogn doi: 10.1109/CVPR.2019.00688 – ident: 10414_CR95 – year: 2020 ident: 10414_CR167 publication-title: Appl Sci (Switzerland) doi: 10.3390/app10124312 – ident: 10414_CR113 doi: 10.1109/CVPR42600.2020.01098 – ident: 10414_CR81 doi: 10.3115/1626355.1626389 – volume: 22 start-page: 4817 issue: 13 year: 2022 ident: 10414_CR66 publication-title: Sensors doi: 10.3390/s22134817 – ident: 10414_CR196 – ident: 10414_CR54 – ident: 10414_CR121 – volume: 8 start-page: 1 issue: 2002 year: 2022 ident: 10414_CR191 publication-title: Peer J Comput Sci doi: 10.7717/PEERJ-CS.916 – volume: 2015 start-page: 44 year: 2016 ident: 10414_CR80 publication-title: Coling – volume: 2017 start-page: 328 year: 2017 ident: 10414_CR15 publication-title: Proc IEEE 3rd Int Conf Collaboration Internet Comput CIC 2017 doi: 10.1109/CIC.2017.00050 – ident: 10414_CR118 doi: 10.1109/WACV48630.2021.00308 – volume: 1 start-page: 270 issue: 2 year: 1989 ident: 10414_CR159 publication-title: Neural Comput doi: 10.1162/neco.1989.1.2.270 – volume: 14 start-page: 1 issue: 8 year: 2019 ident: 10414_CR46 publication-title: IEEE Trans Pattern Analys Mach Intell doi: 10.1109/tpami.2019.2894139 – ident: 10414_CR190 doi: 10.24963/ijcai.2018/164 – ident: 10414_CR94 doi: 10.1109/CVPR.2018.00782 – ident: 10414_CR109 – ident: 10414_CR126 – ident: 10414_CR48 – ident: 10414_CR111 doi: 10.1109/CVPR.2016.497 – ident: 10414_CR143 doi: 10.1109/CVPR.2015.7299087 – ident: 10414_CR6 doi: 10.1609/aaai.v33i01.33013159 – ident: 10414_CR83 doi: 10.1109/ICIP.2019.8803143 – volume: 22 start-page: 621 issue: 2 year: 2019 ident: 10414_CR93 publication-title: World Wide Web doi: 10.1007/s11280-018-0531-z – volume: 2 start-page: 452 year: 2014 ident: 10414_CR41 publication-title: 52nd Annu Meet Assoc Comput Linguistics ACL 2014–Proc Conf doi: 10.3115/v1/p14-2074 – ident: 10414_CR174 doi: 10.1609/aaai.v35i4.16421 – volume: 2019 start-page: 1300 year: 2019 ident: 10414_CR140 publication-title: Proc IEEE Int Conf Multimedia Expo doi: 10.1109/ICME.2019.00226 – ident: 10414_CR131 doi: 10.1609/aaai.v35i3.16353 – ident: 10414_CR114 doi: 10.1109/ICIIBMS.2017.8279760 – ident: 10414_CR150 doi: 10.2307/j.ctt1d98bxx.10 – volume: 2015 start-page: 4534 year: 2015 ident: 10414_CR148 publication-title: Proceedings IEEE Int Conf Comput Vis doi: 10.1109/ICCV.2015.515 – volume: 8828 start-page: 1 year: 2022 ident: 10414_CR57 publication-title: IEEE Trans Pattern Analys Mach Intel doi: 10.1109/TPAMI.2022.3152247 – ident: 10414_CR178 doi: 10.1162/tacl_a_00166 – ident: 10414_CR158 doi: 10.1109/ICCV.2019.00468 – ident: 10414_CR71 – ident: 10414_CR1 – volume: 3 start-page: 297 issue: 4 year: 2019 ident: 10414_CR91 publication-title: IEEE Trans Emerg Top Comput Intel doi: 10.1109/tetci.2019.2892755 – ident: 10414_CR161 doi: 10.1145/3122865.3122867 – ident: 10414_CR25 doi: 10.1609/aaai.v33i01.33018167 – ident: 10414_CR87 doi: 10.1002/0470018860.s00225 – volume: 1 start-page: 15460 year: 2021 ident: 10414_CR186 publication-title: Proc IEEE Comput Soc Conf Comput Vis Pattern Recogn doi: 10.1109/CVPR46437.2021.01521 – ident: 10414_CR88 – ident: 10414_CR120 – ident: 10414_CR133 doi: 10.1109/CVPR52688.2022.01743 – ident: 10414_CR32 – ident: 10414_CR162 – volume: 24 start-page: 51 issue: 1 year: 2010 ident: 10414_CR187 publication-title: Machine Transl doi: 10.1007/s10590-010-9073-6 – ident: 10414_CR134 doi: 10.18653/v1/p18-3003 – ident: 10414_CR188 doi: 10.1109/CVPR46437.2021.00971 – year: 2022 ident: 10414_CR4 publication-title: IEEE Trans Multimedia doi: 10.1109/TMM.2022.3146005 – ident: 10414_CR64 – volume: 2017 start-page: 4203 year: 2017 ident: 10414_CR60 publication-title: Proc IEEE Int Conf Comput Vis doi: 10.1109/ICCV.2017.450 – volume: 2017 start-page: 6119 year: 2017 ident: 10414_CR179 publication-title: Proc 30th IEEE Conf Comput Vis Pattern Recogn doi: 10.1109/CVPR.2017.648 – volume: 2017 start-page: 1151 year: 2017 ident: 10414_CR127 publication-title: Proc 30th IEEE Conf Comput Vis Pattern Recogn CVPR doi: 10.1109/CVPR.2017.128 – ident: 10414_CR125 – ident: 10414_CR36 doi: 10.18653/v1/d16-1146 – ident: 10414_CR102 – ident: 10414_CR11 doi: 10.1007/978-3-030-41299-9_37 – ident: 10414_CR124 doi: 10.1145/2964284.2984066 – ident: 10414_CR119 – ident: 10414_CR61 doi: 10.1109/WACV48630.2021.00102 – volume: 2017 start-page: 1179 year: 2017 ident: 10414_CR128 publication-title: Proc 30th IEEE Conf Comput Vis Pattern Recogn CVPR doi: 10.1109/CVPR.2017.131 – ident: 10414_CR122 doi: 10.3390/s20061702 – ident: 10414_CR163 – ident: 10414_CR160 doi: 10.1007/978-3-031-19836-6_2 – volume: 2016 start-page: 4651 year: 2016 ident: 10414_CR177 publication-title: Proc IEEE Comput Soc Conf Comput Vis Pattern Recogn doi: 10.1109/CVPR.2016.503 – ident: 10414_CR76 – ident: 10414_CR24 – volume: 31 start-page: 202 year: 2022 ident: 10414_CR45 publication-title: IEEE Trans Image Process doi: 10.1109/TIP.2021.3120867 – volume: 3024 start-page: 25 issue: May year: 2014 ident: 10414_CR19 publication-title: Lecture Notes Comput Sci (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) – volume: 9 start-page: 70797 year: 2021 ident: 10414_CR5 publication-title: IEEE Access doi: 10.1109/access.2021.3078295 – year: 2022 ident: 10414_CR63 publication-title: Comput Intel Neurosci doi: 10.1155/2022/3454167 – volume: 45 start-page: 2673 issue: 11 year: 1997 ident: 10414_CR132 publication-title: Neural Netw – ident: 10414_CR68 doi: 10.1145/2647868.2654889 – ident: 10414_CR175 – ident: 10414_CR180 doi: 10.1109/CVPR52688.2022.00837 – ident: 10414_CR39 doi: 10.3115/1289189.1289273 – volume: 2017 start-page: 706 year: 2017 ident: 10414_CR78 publication-title: Proc Int Conf Comput Vis doi: 10.1109/ICCV.2017.83 – ident: 10414_CR136 doi: 10.24963/ijcai.2017/381 – ident: 10414_CR108 doi: 10.1016/s1364-6613(99)01331-5 – volume: 323 start-page: 37 year: 2019 ident: 10414_CR183 publication-title: Neurocomputing doi: 10.1016/j.neucom.2018.09.038 – ident: 10414_CR142 doi: 10.1109/CVPR.2015.7298594 – volume: 395 start-page: 222 year: 2020 ident: 10414_CR47 publication-title: Neurocomputing doi: 10.1016/j.neucom.2018.06.096 – ident: 10414_CR146 – ident: 10414_CR75 doi: 10.18653/v1/e17-1019 – ident: 10414_CR144 doi: 10.5555/946247.946665 – ident: 10414_CR169 – ident: 10414_CR38 doi: 10.1109/CVPR.2009.5206848 – ident: 10414_CR100 doi: 10.1109/ICCV.1999.790410 – ident: 10414_CR42 – ident: 10414_CR90 doi: 10.18653/v1/2020.emnlp-main.161 – year: 2013 ident: 10414_CR130 publication-title: Proc IEEE Int Conf Comput Vis doi: 10.1109/ICCV.2013.61 – ident: 10414_CR152 doi: 10.1109/CVPR.2018.00795 – volume: 33 start-page: 8191 year: 2019 ident: 10414_CR27 publication-title: Proc AAAI Conf Artif Intel doi: 10.1609/aaai.v33i01.33018191 – ident: 10414_CR65 doi: 10.1109/CVPRW50498.2020.00487 – ident: 10414_CR53 – ident: 10414_CR103 doi: 10.1109/CVPR.2017.345 – year: 2009 ident: 10414_CR14 publication-title: ACM Int Conf Proc Ser doi: 10.1145/1553374.1553380 – ident: 10414_CR55 doi: 10.1007/978-3-030-59830-3_21 |
| SSID | ssj0005243 |
| Score | 2.4483402 |
| Snippet | Video description refers to understanding visual content and transforming that acquired understanding into automatic textual narration. It bridges the key AI... |
| SourceID | proquest gale crossref springer |
| SourceType | Aggregation Database Enrichment Source Index Database Publisher |
| StartPage | 13293 |
| SubjectTerms | Artificial Intelligence Attention Coders Computational linguistics Computer Science Computer vision Deep learning Dependency Distinctiveness Exploitation Image processing Language processing Learning Machine learning Machine vision Methods Narration Natural language interfaces Natural language processing Recurrence Reinforcement Subtitles & subtitling Summarization Surveillance Surveys Transformers Variants Video data |
| SummonAdditionalLinks | – databaseName: ABI/INFORM Global dbid: M0C link: http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwpV07T8MwED5BYWChPEV5yQMSA1jEjuskLKhCVCxUDAh1sxz7AkioLU1B4t9jp07LQ7Awx7Etf_Y97Lv7AI6cUk9M7NPPMNVUxIWTgxoFzXNjmGZ52k5tRTaR9Hppv5_dhgu3MoRV1jKxEtR2aPwd-RlPpTceRCYvRi_Us0b519VAobEIS96y8SF9N9HlpxCPadQclxl1rgULSTMhdU5ITp3GcoJIMEHlF8X0XTz_eCet1E-3-d-Jr8FqMDxJZ7pT1mEBBxvQrEkdSDjjm3B1_2RxSCzO5Mk56RAfeT7Gx2m0Oylfx2_4ToaFa4YjEpgnHkhdoBzLLbjrXt1dXtPAtUCN8yAnNLPICmnbBi2PrJOCCdN5O83jIo0wLzSaOMosQ9EWnBmZGOenCCukdmusrYy3oTEYDnAHSMyRu95skeXO9Yu0J6jkhWVG-ydbZlvA6nVWJtQh93QYz2peQdljoxw2qsJGyRaczP4ZTatw_Nn62MOn_BF1PRsdMg3c_HyxK9VJnJkVJTITLdivMVPh7JZqDlgLTmvU559_H3f37972YIVX-83f3-xDYzJ-xQNYNm-Tp3J8WO3cDxAo8go priority: 102 providerName: ProQuest |
| Title | Video description: A comprehensive survey of deep learning approaches |
| URI | https://link.springer.com/article/10.1007/s10462-023-10414-6 https://www.proquest.com/docview/2867416496 |
| Volume | 56 |
| WOSCitedRecordID | wos000968013400001&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| journalDatabaseRights | – providerCode: PRVAVX databaseName: SpringerLINK Contemporary 1997-Present customDbUrl: eissn: 1573-7462 dateEnd: 99991231 omitProxy: false ssIdentifier: ssj0005243 issn: 0269-2821 databaseCode: RSV dateStart: 19970101 isFulltext: true titleUrlDefault: https://link.springer.com/search?facet-content-type=%22Journal%22 providerName: Springer Nature |
| link | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV1La9wwEB7y6KGXpumDbJsuOhR6aAWWLMt2b5uQJVCyWdKQpr0IWRongbIb1ptA_01_S39ZRl45j7YppBeB8XgsRtKMRpqZD-AtGfXcpSH9DAvLVVqTHrSoeFU5J6yoiqzwLdhEPhoVx8flOCaFNV20e3cl2WrqW8luSktONoZUhxKK62VYJXNXBMCGg89HtwI7FrFyUpecHAoRU2X-zuOOOfpdKf9xO9oaneHa_3X3KTyJm0w2WMyKdVjCyTNY6wAcWFzPz2F4dOZxyjxe646PbPDrZwgzn-HpIrSdNRezS_zBpjXR4TmLMBMnrKtGjs0LOBzuHG7v8giswB25i3NeehS19plDLxNPKi8XtsqKKq2LBKvaokuT0gtUmZLC6dyRU6K80pZEa71OX8LKZDrBDWCpREncfF1W5OclNqBRytoLZ8P9rPA9EJ14jYtFxwP2xXdzUy45yMmQnEwrJ6N78P76m_NFyY1_Ur8Lo2bCeiTOzsa0AupfqGxlBjntqZJcl6oHm93AmrhQGyMLHfakqiRGH7qBvHl9_39fPYz8NTyW7VwIhzebsDKfXeAbeOQu52fNrA_L-ZevfVjd2hmND-jpU86p3Uu2Qyv2qR1n3_rtNL8C3brvcQ |
| linkProvider | Springer Nature |
| linkToHtml | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMw1V1Lb9QwEB6VggQXylMsLeADiANYjR3HSZAQWkGrVltWHFaoN8uxJ7RStbtstkX9Uf2PjBOny0P01gPnOJOHv3nYnpkP4CU59dylofwMC8tVWpMdtKh4VTknrKiKrPAt2UQ-HheHh-WXNbjoa2FCWmVvE1tD7Wcu7JFvy0KH4EGV-sP8Ow-sUeF0tafQ6GAxwvMftGRr3u9_ovl9JeXuzuTjHo-sAtzRWmnJS4-i1j5z6GXiSd9zYausqNK6SLCqLbo0Kb1AlSkpnM4dReTKK20zqazXKYm9ATcV3RfUapTzXzJKuiQ9qUtOKxkRa3RipZ7SkpODJLunhOL6Nz_4pzf461i29Xa7G__Zf7oHd2NYzYadHtyHNZw-gI2esoJFC_YQdr4ee5wxj5fW8h0bspBXv8CjLpefNaeLMzxns5qG4ZxFXo1vrG-_js0jmFzHtzyG9elsik-ApRIlSfN1WdHCNrGBflPWXjgbDqSFH4Dop9W42GU9kH2cmFV_6AAFQ1AwLRSMHsCby3vmXY-RK0e_DmgxwQCRZGdjHQW9X2jlZYY5BZFJrks1gK0eIiZapsas8DGAtz3IVpf__dynV0t7Abf3Jp8PzMH-eLQJd2QL9bBTtQXry8UpPoNb7mx53Cyet0rDwFwz-H4CPNVPBg |
| linkToPdf | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMw1V1Lb9QwEB6VghAXylMsLeADiANYjR3HSZAQWtGuqIpWe6hQxcVy7AmtVO0um21Rf1r_HePE6fIQvfXAOc7k4c_zsGfmA3hJRj13aSg_w8JyldakBy0qXlXOCSuqIit8SzaRj8fF4WE5WYOLvhYmpFX2OrFV1H7mwh75tix0cB5UqbfrmBYx2Rl9mH_ngUEqnLT2dBodRPbx_AeFb837vR2a61dSjnYPPn7ikWGAO4qblrz0KGrtM4deJp7Wfi5slRVVWhcJVrVFlyalF6gyJYXTuSPvXHmlbSaV9TolsTfgZk4hZsgmnGRff8ku6RL2pC45RTUi1uvEqj2lJSdjSTpQCcX1bzbxT8vw1xFta_lGG__xP7sHd6O7zYbd-rgPazh9ABs9lQWLmu0h7H459jhjHi-16Ds2ZCHffoFHXY4_a04XZ3jOZjUNwzmLfBvfWN-WHZtHcHAd3_IY1qezKT4BlkqUJM3XZUUBb2IDLaesvXA2HFQLPwDRT7Fxsft6IAE5Mau-0QEWhmBhWlgYPYA3l_fMu94jV45-HZBjgmIiyc7G-gp6v9Diywxzci6TXJdqAFs9XEzUWI1ZYWUAb3vArS7_-7lPr5b2Am4T5sznvfH-JtyRLerDBtYWrC8Xp_gMbrmz5XGzeN6uHwbmmrH3EwtNWCo |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Video+description%3A+A%C2%A0comprehensive+survey+of+deep+learning+approaches&rft.jtitle=The+Artificial+intelligence+review&rft.au=Rafiq%2C+Ghazala&rft.au=Rafiq%2C+Muhammad&rft.au=Choi%2C+Gyu+Sang&rft.date=2023-11-01&rft.pub=Springer+Netherlands&rft.issn=0269-2821&rft.eissn=1573-7462&rft.volume=56&rft.issue=11&rft.spage=13293&rft.epage=13372&rft_id=info:doi/10.1007%2Fs10462-023-10414-6&rft.externalDocID=10_1007_s10462_023_10414_6 |
| thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=0269-2821&client=summon |
| thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=0269-2821&client=summon |
| thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=0269-2821&client=summon |