Deep visual-semantic alignments for generating image descriptions

We present a model that generates natural language descriptions of images and their regions. Our approach leverages datasets of images and their sentence descriptions to learn about the inter-modal correspondences between language and visual data. Our alignment model is based on a novel combination...

Full description

Saved in:
Bibliographic Details
Published in:2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) pp. 3128 - 3137
Main Authors: Karpathy, Andrej, Fei-Fei, Li
Format: Conference Proceeding Journal Article
Language:English
Published: IEEE 01.06.2015
Subjects:
ISSN:1063-6919, 1063-6919
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Abstract We present a model that generates natural language descriptions of images and their regions. Our approach leverages datasets of images and their sentence descriptions to learn about the inter-modal correspondences between language and visual data. Our alignment model is based on a novel combination of Convolutional Neural Networks over image regions, bidirectional Recurrent Neural Networks over sentences, and a structured objective that aligns the two modalities through a multimodal embedding. We then describe a Multimodal Recurrent Neural Network architecture that uses the inferred alignments to learn to generate novel descriptions of image regions. We demonstrate that our alignment model produces state of the art results in retrieval experiments on Flickr8K, Flickr30K and MSCOCO datasets. We then show that the generated descriptions significantly outperform retrieval baselines on both full images and on a new dataset of region-level annotations.
AbstractList We present a model that generates natural language descriptions of images and their regions. Our approach leverages datasets of images and their sentence descriptions to learn about the inter-modal correspondences between language and visual data. Our alignment model is based on a novel combination of Convolutional Neural Networks over image regions, bidirectional Recurrent Neural Networks over sentences, and a structured objective that aligns the two modalities through a multimodal embedding. We then describe a Multimodal Recurrent Neural Network architecture that uses the inferred alignments to learn to generate novel descriptions of image regions. We demonstrate that our alignment model produces state of the art results in retrieval experiments on Flickr8K, Flickr30K and MSCOCO datasets. We then show that the generated descriptions significantly outperform retrieval baselines on both full images and on a new dataset of region-level annotations.
Author Karpathy, Andrej
Fei-Fei, Li
Author_xml – sequence: 1
  givenname: Andrej
  surname: Karpathy
  fullname: Karpathy, Andrej
  email: karpathy@cs.stanford.edu
  organization: Department of Computer Science, Stanford University, USA
– sequence: 2
  givenname: Li
  surname: Fei-Fei
  fullname: Fei-Fei, Li
  email: feifeili@cs.stanford.edu
  organization: Department of Computer Science, Stanford University, USA
BookMark eNpNkE1Lw0AURUepYFv7A8RNlm5S32QmM5llqfUDCoqo2zCd9xIGkknMpIL_3kq7cHXv4nA53BmbhC4QY9cclpyDuVt_vr4tM-D5UmemMCI7YzMulRbKKAnnbMpBiVQZbib_-iVbxOh3IAAKYzKYstU9UZ98-7i3TRqptWH0LrGNr0NLYYxJ1Q1JTYEGO_pQJ761NSVI0Q2-H30X4hW7qGwTaXHKOft42Lyvn9Lty-PzerVNfQbFmDpTVc4JiVorWYCq_qTBIWrBjcNCoVCCUOQVJzI5AlpLoFEjGpA7FHN2e9zth-5rT3EsWx8dNY0N1O1jybUGIaVS2QG9OaKeiMp-OEgPP-XpKPELqClc-A
ContentType Conference Proceeding
Journal Article
DBID 6IE
6IH
CBEJK
RIE
RIO
7SC
8FD
JQ2
L7M
L~C
L~D
DOI 10.1109/CVPR.2015.7298932
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Proceedings Order Plan (POP) 1998-present by volume
IEEE Xplore All Conference Proceedings
IEEE Electronic Library Online
IEEE Proceedings Order Plans (POP) 1998-present
Computer and Information Systems Abstracts
Technology Research Database
ProQuest Computer Science Collection
Advanced Technologies Database with Aerospace
Computer and Information Systems Abstracts – Academic
Computer and Information Systems Abstracts Professional
DatabaseTitle Computer and Information Systems Abstracts
Technology Research Database
Computer and Information Systems Abstracts – Academic
Advanced Technologies Database with Aerospace
ProQuest Computer Science Collection
Computer and Information Systems Abstracts Professional
DatabaseTitleList
Computer and Information Systems Abstracts
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Electronic Library (IEL)
  url: https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
Discipline Applied Sciences
Computer Science
EISBN 1467369640
9781467369640
EISSN 1063-6919
EndPage 3137
ExternalDocumentID 7298932
Genre orig-research
GroupedDBID 23M
29F
29O
6IE
6IH
6IK
ABDPE
ACGFS
ALMA_UNASSIGNED_HOLDINGS
CBEJK
IPLJI
M43
RIE
RIO
RNS
7SC
8FD
JQ2
L7M
L~C
L~D
ID FETCH-LOGICAL-i208t-c9ffcc34d7764806f89320cdd7319cd86d363ed35f1ee95d0daae07d7dd904bd3
IEDL.DBID RIE
ISICitedReferencesCount 3129
ISICitedReferencesURI http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000387959203017&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
ISSN 1063-6919
IngestDate Fri Sep 05 10:02:20 EDT 2025
Wed Aug 20 06:20:47 EDT 2025
IsPeerReviewed false
IsScholarly true
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-i208t-c9ffcc34d7764806f89320cdd7319cd86d363ed35f1ee95d0daae07d7dd904bd3
Notes ObjectType-Article-2
SourceType-Scholarly Journals-1
ObjectType-Conference-1
ObjectType-Feature-3
content type line 23
SourceType-Conference Papers & Proceedings-2
PQID 1770344662
PQPubID 23500
PageCount 10
ParticipantIDs ieee_primary_7298932
proquest_miscellaneous_1770344662
PublicationCentury 2000
PublicationDate 20150601
PublicationDateYYYYMMDD 2015-06-01
PublicationDate_xml – month: 06
  year: 2015
  text: 20150601
  day: 01
PublicationDecade 2010
PublicationTitle 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
PublicationTitleAbbrev CVPR
PublicationYear 2015
Publisher IEEE
Publisher_xml – name: IEEE
SSID ssib030089920
ssj0023720
ssj0003211698
Score 2.5742872
Snippet We present a model that generates natural language descriptions of images and their regions. Our approach leverages datasets of images and their sentence...
SourceID proquest
ieee
SourceType Aggregation Database
Publisher
StartPage 3128
SubjectTerms Alignment
Annotations
Context modeling
Convolutional neural networks
Descriptions
Grounding
Image segmentation
Mathematical models
Natural languages
Pattern recognition
Recurrent neural networks
Retrieval
Sentences
Training
Vectors
Visualization
Title Deep visual-semantic alignments for generating image descriptions
URI https://ieeexplore.ieee.org/document/7298932
https://www.proquest.com/docview/1770344662
WOSCitedRecordID wos000387959203017&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV09T8MwELVKxcBUoEWULxmJEbdO7NrxiAoVU1UhQN2i1L5UkeiHmqa_HztxwgALW2QlinU-n1_y7u4h9BCoVCkIJQk4AOGch0SBoiQJE-PAHGPGlGITcjqN5nM1a6HHphYGAMrkMxi4y5LLNxtduF9lQ-nahTMbcI-klFWtVu07jDr-ykMfF4WZ_bIRqmEUQqfGUjKfghGhAuUZzoCq4fhz9uaSvEYD_wKvtPIrPJdnzqTzv9meot5P8R6eNcfSGWrB-hx1PNrEfi_ndqgWdKjHuujpGWCLD1leJF8kh5W1eqaxRerLqhIOW4SLl2WjapctjbOVjUbYQBN68h76mLy8j1-J11ggWUijPdEqTbVm3C6M4BEVqZsy1cZIuze1iYRhgoFhozQAUCNDTZIAlUYaoyhfGHaB2uvNGi4Rjix4kRCFzN7C1UIrbsFZyhcs1WlicWkfdZ2R4m3VRiP29umj-9rKsXVtx1cka9gUeRxI6RoSChFe_f3oNTpxy1Zlbt2g9n5XwC061od9lu_uSv_4Bib4t_4
linkProvider IEEE
linkToHtml http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV3NT8IwFG-ImugJFYz4WROPFrq1tOvRoAQjEmLQcFtG-0aWyCAM-Pttx4YHvXhbmm1tXl9ff-3vfSB076lYKfAl8TgA4Zz7RIGiJPIj48AcY8bkxSbkYBCMx2pYQQ-7WBgAyJ3PoOkecy7fzPXaXZW1pEsXzqzB3W_bn3rbaK1Sexh1DFYBfpwdZvZsI9SOU_BdPZac-xSMCOWpguP0qGp1Pofvzs2r3Sy6KGqt_DLQ-a7Trf5vvMeo_hO-h4e7jekEVSA9RdUCb-JiNWe2qSzpULbV0OMTwAJvkmwdfZEMZlbuicYWq0-3sXDYYlw8zVNVO39pnMysPcIGdsYnq6OP7vOo0yNFlQWS-DRYEa3iWGvG7dQIHlARuyFTbYy0q1ObQBgmGBjWjj0A1TbURBFQaaQxivKJYWdoL52ncI5wYOGLhMBn9hWuJlpxC89iPmGxjiOLTBuo5oQULraJNMJCPg10V0o5tMrtGIsohfk6Cz0pXUpCIfyLvz-9RYe90Vs_7L8MXi_RkZvCrR_XFdpbLddwjQ70ZpVky5tcV74B5qO7RQ
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=proceeding&rft.title=2015+IEEE+Conference+on+Computer+Vision+and+Pattern+Recognition+%28CVPR%29&rft.atitle=Deep+visual-semantic+alignments+for+generating+image+descriptions&rft.au=Karpathy%2C+Andrej&rft.au=Fei-Fei%2C+Li&rft.date=2015-06-01&rft.pub=IEEE&rft.issn=1063-6919&rft.eissn=1063-6919&rft.spage=3128&rft.epage=3137&rft_id=info:doi/10.1109%2FCVPR.2015.7298932&rft.externalDocID=7298932
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1063-6919&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1063-6919&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1063-6919&client=summon