Deep visual-semantic alignments for generating image descriptions
We present a model that generates natural language descriptions of images and their regions. Our approach leverages datasets of images and their sentence descriptions to learn about the inter-modal correspondences between language and visual data. Our alignment model is based on a novel combination...
Saved in:
| Published in: | 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) pp. 3128 - 3137 |
|---|---|
| Main Authors: | , |
| Format: | Conference Proceeding Journal Article |
| Language: | English |
| Published: |
IEEE
01.06.2015
|
| Subjects: | |
| ISSN: | 1063-6919, 1063-6919 |
| Online Access: | Get full text |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Abstract | We present a model that generates natural language descriptions of images and their regions. Our approach leverages datasets of images and their sentence descriptions to learn about the inter-modal correspondences between language and visual data. Our alignment model is based on a novel combination of Convolutional Neural Networks over image regions, bidirectional Recurrent Neural Networks over sentences, and a structured objective that aligns the two modalities through a multimodal embedding. We then describe a Multimodal Recurrent Neural Network architecture that uses the inferred alignments to learn to generate novel descriptions of image regions. We demonstrate that our alignment model produces state of the art results in retrieval experiments on Flickr8K, Flickr30K and MSCOCO datasets. We then show that the generated descriptions significantly outperform retrieval baselines on both full images and on a new dataset of region-level annotations. |
|---|---|
| AbstractList | We present a model that generates natural language descriptions of images and their regions. Our approach leverages datasets of images and their sentence descriptions to learn about the inter-modal correspondences between language and visual data. Our alignment model is based on a novel combination of Convolutional Neural Networks over image regions, bidirectional Recurrent Neural Networks over sentences, and a structured objective that aligns the two modalities through a multimodal embedding. We then describe a Multimodal Recurrent Neural Network architecture that uses the inferred alignments to learn to generate novel descriptions of image regions. We demonstrate that our alignment model produces state of the art results in retrieval experiments on Flickr8K, Flickr30K and MSCOCO datasets. We then show that the generated descriptions significantly outperform retrieval baselines on both full images and on a new dataset of region-level annotations. |
| Author | Karpathy, Andrej Fei-Fei, Li |
| Author_xml | – sequence: 1 givenname: Andrej surname: Karpathy fullname: Karpathy, Andrej email: karpathy@cs.stanford.edu organization: Department of Computer Science, Stanford University, USA – sequence: 2 givenname: Li surname: Fei-Fei fullname: Fei-Fei, Li email: feifeili@cs.stanford.edu organization: Department of Computer Science, Stanford University, USA |
| BookMark | eNpNkE1Lw0AURUepYFv7A8RNlm5S32QmM5llqfUDCoqo2zCd9xIGkknMpIL_3kq7cHXv4nA53BmbhC4QY9cclpyDuVt_vr4tM-D5UmemMCI7YzMulRbKKAnnbMpBiVQZbib_-iVbxOh3IAAKYzKYstU9UZ98-7i3TRqptWH0LrGNr0NLYYxJ1Q1JTYEGO_pQJ761NSVI0Q2-H30X4hW7qGwTaXHKOft42Lyvn9Lty-PzerVNfQbFmDpTVc4JiVorWYCq_qTBIWrBjcNCoVCCUOQVJzI5AlpLoFEjGpA7FHN2e9zth-5rT3EsWx8dNY0N1O1jybUGIaVS2QG9OaKeiMp-OEgPP-XpKPELqClc-A |
| ContentType | Conference Proceeding Journal Article |
| DBID | 6IE 6IH CBEJK RIE RIO 7SC 8FD JQ2 L7M L~C L~D |
| DOI | 10.1109/CVPR.2015.7298932 |
| DatabaseName | IEEE Electronic Library (IEL) Conference Proceedings IEEE Proceedings Order Plan (POP) 1998-present by volume IEEE Xplore All Conference Proceedings IEEE Electronic Library Online IEEE Proceedings Order Plans (POP) 1998-present Computer and Information Systems Abstracts Technology Research Database ProQuest Computer Science Collection Advanced Technologies Database with Aerospace Computer and Information Systems Abstracts Academic Computer and Information Systems Abstracts Professional |
| DatabaseTitle | Computer and Information Systems Abstracts Technology Research Database Computer and Information Systems Abstracts – Academic Advanced Technologies Database with Aerospace ProQuest Computer Science Collection Computer and Information Systems Abstracts Professional |
| DatabaseTitleList | Computer and Information Systems Abstracts |
| Database_xml | – sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://ieeexplore.ieee.org/ sourceTypes: Publisher |
| DeliveryMethod | fulltext_linktorsrc |
| Discipline | Applied Sciences Computer Science |
| EISBN | 1467369640 9781467369640 |
| EISSN | 1063-6919 |
| EndPage | 3137 |
| ExternalDocumentID | 7298932 |
| Genre | orig-research |
| GroupedDBID | 23M 29F 29O 6IE 6IH 6IK ABDPE ACGFS ALMA_UNASSIGNED_HOLDINGS CBEJK IPLJI M43 RIE RIO RNS 7SC 8FD JQ2 L7M L~C L~D |
| ID | FETCH-LOGICAL-i208t-c9ffcc34d7764806f89320cdd7319cd86d363ed35f1ee95d0daae07d7dd904bd3 |
| IEDL.DBID | RIE |
| ISICitedReferencesCount | 3129 |
| ISICitedReferencesURI | http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000387959203017&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| ISSN | 1063-6919 |
| IngestDate | Fri Sep 05 10:02:20 EDT 2025 Wed Aug 20 06:20:47 EDT 2025 |
| IsPeerReviewed | false |
| IsScholarly | true |
| Language | English |
| LinkModel | DirectLink |
| MergedId | FETCHMERGED-LOGICAL-i208t-c9ffcc34d7764806f89320cdd7319cd86d363ed35f1ee95d0daae07d7dd904bd3 |
| Notes | ObjectType-Article-2 SourceType-Scholarly Journals-1 ObjectType-Conference-1 ObjectType-Feature-3 content type line 23 SourceType-Conference Papers & Proceedings-2 |
| PQID | 1770344662 |
| PQPubID | 23500 |
| PageCount | 10 |
| ParticipantIDs | ieee_primary_7298932 proquest_miscellaneous_1770344662 |
| PublicationCentury | 2000 |
| PublicationDate | 20150601 |
| PublicationDateYYYYMMDD | 2015-06-01 |
| PublicationDate_xml | – month: 06 year: 2015 text: 20150601 day: 01 |
| PublicationDecade | 2010 |
| PublicationTitle | 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) |
| PublicationTitleAbbrev | CVPR |
| PublicationYear | 2015 |
| Publisher | IEEE |
| Publisher_xml | – name: IEEE |
| SSID | ssib030089920 ssj0023720 ssj0003211698 |
| Score | 2.5742872 |
| Snippet | We present a model that generates natural language descriptions of images and their regions. Our approach leverages datasets of images and their sentence... |
| SourceID | proquest ieee |
| SourceType | Aggregation Database Publisher |
| StartPage | 3128 |
| SubjectTerms | Alignment Annotations Context modeling Convolutional neural networks Descriptions Grounding Image segmentation Mathematical models Natural languages Pattern recognition Recurrent neural networks Retrieval Sentences Training Vectors Visualization |
| Title | Deep visual-semantic alignments for generating image descriptions |
| URI | https://ieeexplore.ieee.org/document/7298932 https://www.proquest.com/docview/1770344662 |
| WOSCitedRecordID | wos000387959203017&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| link | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV09T8MwELVKxcBUoEWULxmJEbdO7NrxiAoVU1UhQN2i1L5UkeiHmqa_HztxwgALW2QlinU-n1_y7u4h9BCoVCkIJQk4AOGch0SBoiQJE-PAHGPGlGITcjqN5nM1a6HHphYGAMrkMxi4y5LLNxtduF9lQ-nahTMbcI-klFWtVu07jDr-ykMfF4WZ_bIRqmEUQqfGUjKfghGhAuUZzoCq4fhz9uaSvEYD_wKvtPIrPJdnzqTzv9meot5P8R6eNcfSGWrB-hx1PNrEfi_ndqgWdKjHuujpGWCLD1leJF8kh5W1eqaxRerLqhIOW4SLl2WjapctjbOVjUbYQBN68h76mLy8j1-J11ggWUijPdEqTbVm3C6M4BEVqZsy1cZIuze1iYRhgoFhozQAUCNDTZIAlUYaoyhfGHaB2uvNGi4Rjix4kRCFzN7C1UIrbsFZyhcs1WlicWkfdZ2R4m3VRiP29umj-9rKsXVtx1cka9gUeRxI6RoSChFe_f3oNTpxy1Zlbt2g9n5XwC061od9lu_uSv_4Bib4t_4 |
| linkProvider | IEEE |
| linkToHtml | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV3NT8IwFG-ImugJFYz4WROPFrq1tOvRoAQjEmLQcFtG-0aWyCAM-Pttx4YHvXhbmm1tXl9ff-3vfSB076lYKfAl8TgA4Zz7RIGiJPIj48AcY8bkxSbkYBCMx2pYQQ-7WBgAyJ3PoOkecy7fzPXaXZW1pEsXzqzB3W_bn3rbaK1Sexh1DFYBfpwdZvZsI9SOU_BdPZac-xSMCOWpguP0qGp1Pofvzs2r3Sy6KGqt_DLQ-a7Trf5vvMeo_hO-h4e7jekEVSA9RdUCb-JiNWe2qSzpULbV0OMTwAJvkmwdfZEMZlbuicYWq0-3sXDYYlw8zVNVO39pnMysPcIGdsYnq6OP7vOo0yNFlQWS-DRYEa3iWGvG7dQIHlARuyFTbYy0q1ObQBgmGBjWjj0A1TbURBFQaaQxivKJYWdoL52ncI5wYOGLhMBn9hWuJlpxC89iPmGxjiOLTBuo5oQULraJNMJCPg10V0o5tMrtGIsohfk6Cz0pXUpCIfyLvz-9RYe90Vs_7L8MXi_RkZvCrR_XFdpbLddwjQ70ZpVky5tcV74B5qO7RQ |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=proceeding&rft.title=2015+IEEE+Conference+on+Computer+Vision+and+Pattern+Recognition+%28CVPR%29&rft.atitle=Deep+visual-semantic+alignments+for+generating+image+descriptions&rft.au=Karpathy%2C+Andrej&rft.au=Fei-Fei%2C+Li&rft.date=2015-06-01&rft.pub=IEEE&rft.issn=1063-6919&rft.eissn=1063-6919&rft.spage=3128&rft.epage=3137&rft_id=info:doi/10.1109%2FCVPR.2015.7298932&rft.externalDocID=7298932 |
| thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1063-6919&client=summon |
| thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1063-6919&client=summon |
| thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1063-6919&client=summon |