DINOv2: Learning Robust Visual Features without Supervision

The recent breakthroughs in natural language processing for model pretraining on large quantities of data have opened the way for similar foundation models in computer vision. These models could greatly simplify the use of images in any system by producing all-purpose visual features, i.e., features...

Full description

Saved in:
Bibliographic Details
Published in:Transactions on Machine Learning Research Journal
Main Authors: Oquab, Maxime, Darcet, Timothée, Moutakanni, Théo, Vo, Huy, Szafraniec, Marc, Khalidov, Vasil, Fernandez, Pierre, Haziza, Daniel, Massa, Francisco, El-Nouby, Alaaeldin, Assran, Mahmoud, Ballas, Nicolas, Galuba, Wojciech, Howes, Russell, Huang, Po-Yao, Li, Shang-Wen, Misra, Ishan, Rabbat, Michael, Sharma, Vasu, Synnaeve, Gabriel, Xu, Hu, Jegou, Hervé, Mairal, Julien, Labatut, Patrick, Joulin, Armand, Bojanowski, Piotr
Format: Journal Article
Language:English
Published: [Amherst Massachusetts]: OpenReview.net, 2022 2024
Subjects:
ISSN:2835-8856
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Abstract The recent breakthroughs in natural language processing for model pretraining on large quantities of data have opened the way for similar foundation models in computer vision. These models could greatly simplify the use of images in any system by producing all-purpose visual features, i.e., features that work across image distributions and tasks without finetuning. This work shows that existing pretraining methods, especially self-supervised methods, can produce such features if trained on enough curated data from diverse sources. We revisit existing approaches and combine different techniques to scale our pretraining in terms of data and model size. Most of the technical contributions aim at accelerating and stabilizing the training at scale. In terms of data, we propose an automatic pipeline to build a dedicated, diverse, and curated image dataset instead of uncurated data, as typically done in the self-supervised literature. In terms of models, we train a ViT model (Dosovitskiy et al., 2020) with 1B parameters and distill it into a series of smaller models that surpass the best available all-purpose features, OpenCLIP (Ilharco et al., 2021) on most of the benchmarks at image and pixel levels.
AbstractList The recent breakthroughs in natural language processing for model pretraining on large quantities of data have opened the way for similar foundation models in computer vision. These models could greatly simplify the use of images in any system by producing all-purpose visual features, i.e., features that work across image distributions and tasks without finetuning. This work shows that existing pretraining methods, especially self-supervised methods, can produce such features if trained on enough curated data from diverse sources. We revisit existing approaches and combine different techniques to scale our pretraining in terms of data and model size. Most of the technical contributions aim at accelerating and stabilizing the training at scale. In terms of data, we propose an automatic pipeline to build a dedicated, diverse, and curated image dataset instead of uncurated data, as typically done in the self-supervised literature. In terms of models, we train a ViT model (Dosovitskiy et al., 2020) with 1B parameters and distill it into a series of smaller models that surpass the best available all-purpose features, OpenCLIP (Ilharco et al., 2021) on most of the benchmarks at image and pixel levels.
Author Massa, Francisco
Vo, Huy
Synnaeve, Gabriel
Khalidov, Vasil
Ballas, Nicolas
Bojanowski, Piotr
Labatut, Patrick
Joulin, Armand
Oquab, Maxime
Xu, Hu
Szafraniec, Marc
Haziza, Daniel
El-Nouby, Alaaeldin
Li, Shang-Wen
Misra, Ishan
Fernandez, Pierre
Huang, Po-Yao
Darcet, Timothée
Howes, Russell
Galuba, Wojciech
Mairal, Julien
Jegou, Hervé
Sharma, Vasu
Moutakanni, Théo
Rabbat, Michael
Assran, Mahmoud
Author_xml – sequence: 1
  givenname: Maxime
  surname: Oquab
  fullname: Oquab, Maxime
  organization: Meta AI
– sequence: 2
  givenname: Timothée
  surname: Darcet
  fullname: Darcet, Timothée
  organization: Meta AI
– sequence: 3
  givenname: Théo
  surname: Moutakanni
  fullname: Moutakanni, Théo
  organization: CentraleSupélec
– sequence: 4
  givenname: Huy
  surname: Vo
  fullname: Vo, Huy
  organization: Meta AI
– sequence: 5
  givenname: Marc
  surname: Szafraniec
  fullname: Szafraniec, Marc
  organization: Meta AI
– sequence: 6
  givenname: Vasil
  surname: Khalidov
  fullname: Khalidov, Vasil
  organization: Meta AI
– sequence: 7
  givenname: Pierre
  surname: Fernandez
  fullname: Fernandez, Pierre
  organization: Creating and exploiting explicit links between multimedia fragments
– sequence: 8
  givenname: Daniel
  surname: Haziza
  fullname: Haziza, Daniel
  organization: Meta AI
– sequence: 9
  givenname: Francisco
  surname: Massa
  fullname: Massa, Francisco
  organization: Meta AI
– sequence: 10
  givenname: Alaaeldin
  surname: El-Nouby
  fullname: El-Nouby, Alaaeldin
  organization: Meta AI
– sequence: 11
  givenname: Mahmoud
  surname: Assran
  fullname: Assran, Mahmoud
  organization: Meta AI
– sequence: 12
  givenname: Nicolas
  surname: Ballas
  fullname: Ballas, Nicolas
  organization: Meta AI
– sequence: 13
  givenname: Wojciech
  surname: Galuba
  fullname: Galuba, Wojciech
  organization: Meta AI
– sequence: 14
  givenname: Russell
  surname: Howes
  fullname: Howes, Russell
  organization: Meta AI
– sequence: 15
  givenname: Po-Yao
  surname: Huang
  fullname: Huang, Po-Yao
  organization: Meta AI
– sequence: 16
  givenname: Shang-Wen
  surname: Li
  fullname: Li, Shang-Wen
  organization: Meta AI
– sequence: 17
  givenname: Ishan
  surname: Misra
  fullname: Misra, Ishan
  organization: Meta AI
– sequence: 18
  givenname: Michael
  surname: Rabbat
  fullname: Rabbat, Michael
  organization: Meta AI
– sequence: 19
  givenname: Vasu
  surname: Sharma
  fullname: Sharma, Vasu
  organization: Meta AI
– sequence: 20
  givenname: Gabriel
  surname: Synnaeve
  fullname: Synnaeve, Gabriel
  organization: Meta AI
– sequence: 21
  givenname: Hu
  surname: Xu
  fullname: Xu, Hu
  organization: Meta AI
– sequence: 22
  givenname: Hervé
  surname: Jegou
  fullname: Jegou, Hervé
  organization: Meta AI
– sequence: 23
  givenname: Julien
  orcidid: 0000-0001-6991-2110
  surname: Mairal
  fullname: Mairal, Julien
  organization: Apprentissage de modèles à partir de données massives
– sequence: 24
  givenname: Patrick
  surname: Labatut
  fullname: Labatut, Patrick
  organization: Meta AI
– sequence: 25
  givenname: Armand
  surname: Joulin
  fullname: Joulin, Armand
  organization: Meta AI
– sequence: 26
  givenname: Piotr
  surname: Bojanowski
  fullname: Bojanowski, Piotr
  organization: Meta AI
BackLink https://hal.science/hal-04376640$$DView record in HAL
BookMark eNotzL1OwzAUQGEPRaKUPgCbV4YE29eOHZiqQmmliEr8rZGTXFNLIansJJS3RwimI33DuSCzru-QkCvOUmmUYjc2nPyUCmAyZZrnMCNzYUAlxqjsnCxj9BVTTEMuJZ-Tu_vd034St7RAGzrffdDnvhrjQN99HG1LN2iHMWCkX3449ONAX8YjhslH33eX5MzZNuLyvwvytnl4XW-TYv-4W6-KxHJhhsTYBqVwAnRtXe2wrjnIHPOMi0wbxw3qhmvBsqZxrqq4FJBzAwyhkqi4hQW5_vsebFseg_-04bvsrS-3q6L8NSZBZ5lkk4Aft5hNQg
ContentType Journal Article
Copyright Attribution
Copyright_xml – notice: Attribution
DBID 1XC
VOOES
DOI 10.48550/arxiv.2304.07193
DatabaseName Hyper Article en Ligne (HAL)
Hyper Article en Ligne (HAL) (Open Access)
DatabaseTitleList
DeliveryMethod fulltext_linktorsrc
Discipline Computer Science
ExternalDocumentID oai:HAL:hal-04376640v2
GroupedDBID 1XC
ALMA_UNASSIGNED_HOLDINGS
GROUPED_DOAJ
M~E
VOOES
ID FETCH-LOGICAL-a128t-8ade42f237cafcfecc1349e9612678f18e7d17206ddffbb142391830e3b4e51a3
ISSN 2835-8856
IngestDate Tue Oct 28 06:37:27 EDT 2025
IsDoiOpenAccess true
IsOpenAccess true
IsPeerReviewed true
IsScholarly true
Keywords Computer Vision and Pattern Recognition (cs.CV)
FOS: Computer and information sciences
Language English
License Attribution: http://creativecommons.org/licenses/by
LinkModel OpenURL
MergedId FETCHMERGED-LOGICAL-a128t-8ade42f237cafcfecc1349e9612678f18e7d17206ddffbb142391830e3b4e51a3
ORCID 0000-0001-6991-2110
OpenAccessLink http://dx.doi.org/10.48550/arxiv.2304.07193
ParticipantIDs hal_primary_oai_HAL_hal_04376640v2
PublicationCentury 2000
PublicationDate 2024-00-00
PublicationDateYYYYMMDD 2024-01-01
PublicationDate_xml – year: 2024
  text: 2024-00-00
PublicationDecade 2020
PublicationTitle Transactions on Machine Learning Research Journal
PublicationYear 2024
Publisher [Amherst Massachusetts]: OpenReview.net, 2022
Publisher_xml – name: [Amherst Massachusetts]: OpenReview.net, 2022
SSID ssib050739441
Score 2.7054882
SecondaryResourceType preprint
Snippet The recent breakthroughs in natural language processing for model pretraining on large quantities of data have opened the way for similar foundation models in...
SourceID hal
SourceType Open Access Repository
SubjectTerms Artificial Intelligence
Computer Science
Computer Vision and Pattern Recognition
Title DINOv2: Learning Robust Visual Features without Supervision
URI https://hal.science/hal-04376640
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
journalDatabaseRights – providerCode: PRVAON
  databaseName: DOAJ Directory of Open Access Journals
  issn: 2835-8856
  databaseCode: DOA
  dateStart: 20220101
  customDbUrl:
  isFulltext: true
  dateEnd: 99991231
  titleUrlDefault: https://www.doaj.org/
  omitProxy: false
  ssIdentifier: ssib050739441
  providerName: Directory of Open Access Journals
– providerCode: PRVHPJ
  databaseName: ROAD: Directory of Open Access Scholarly Resources (ISSN International Center)
  issn: 2835-8856
  databaseCode: M~E
  dateStart: 20220101
  customDbUrl:
  isFulltext: true
  dateEnd: 99991231
  titleUrlDefault: https://road.issn.org
  omitProxy: false
  ssIdentifier: ssib050739441
  providerName: ISSN International Centre
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwtV3Nb9MwFLeqsQMXYHwIBqssxE4oIk3zCaeybuq0rlRsTLtFdmKrZWtamg91F_4Q_lres-M262kcuFiNnTiu34v9_HtfhHxwbC7DjnCtyHeY5Uou4JuTtsU5gwohOn6aqGQTwWgUXl9H41brj_GFqW6DLAtXq2jxX0kNdUBsdJ39B3KvO4UK-A1EhxLIDuWDCN8_HX2rHDzpDw3s8X3Oy7z4eDXN0VkEpb4STtkKg0Wz5ItygStGbkj0s7GNab8HpVM4V3aXotFtbbVndNprwPZXybh2BFpNZ2vO6aNbZbFhEK2j39jewlDYDaZQ0oZLunlumq8Upjso75o4hbNBKA-9r72ZEmbhtTmMe1LmoijyQ6-Pc4GGM1oPgmAMdgVPNyBPDAhnhaEOP7694Kt4bLidLVfTCo3aXYy4qHMu3g-uPehdxOP-STw8HZ3db21YJA56Qygn7NbCkE--79oVbO-PnMCL0Fzw_PexWas81HC6KjHqeoBaaa6G9Gl7QCC6TAxUr0SXy2fkSU0d2tO8skdaIntOnpp8HrRe3l-QL5p1PlNDYaoZh2rGoYZxaM04tME4L8mPk-PLo4FVJ9ewGIgkhRWyVLiOdLpBwmQi4UvGQJUiAokX5BfZCUWQgnBr-2kqJUeosBvB8m-LLneF12HdV2Qnm2fiNaG-HzAMIweHhxQTHkQMrlMeCXhYOl7yhryH_x4vdPiUGAOaw0THWLeZ5v2H3PSWPEbW0hDYO7JTLEtxQHaTqpjmy7bCU9qKTH8BRuBmcg
linkProvider ISSN International Centre
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=DINOv2%3A+Learning+Robust+Visual+Features+without+Supervision&rft.jtitle=Transactions+on+Machine+Learning+Research+Journal&rft.au=Oquab%2C+Maxime&rft.au=Darcet%2C+Timoth%C3%A9e&rft.au=Moutakanni%2C+Th%C3%A9o&rft.au=Vo%2C+Huy&rft.date=2024&rft.pub=%5BAmherst+Massachusetts%5D%3A+OpenReview.net%2C+2022&rft.issn=2835-8856&rft_id=info:doi/10.48550%2Farxiv.2304.07193&rft.externalDBID=HAS_PDF_LINK&rft.externalDocID=oai%3AHAL%3Ahal-04376640v2
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=2835-8856&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=2835-8856&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=2835-8856&client=summon