DINOv2: Learning Robust Visual Features without Supervision
The recent breakthroughs in natural language processing for model pretraining on large quantities of data have opened the way for similar foundation models in computer vision. These models could greatly simplify the use of images in any system by producing all-purpose visual features, i.e., features...
Uložené v:
| Vydané v: | Transactions on Machine Learning Research Journal |
|---|---|
| Hlavní autori: | , , , , , , , , , , , , , , , , , , , , , , , , , |
| Médium: | Journal Article |
| Jazyk: | English |
| Vydavateľské údaje: |
[Amherst Massachusetts]: OpenReview.net, 2022
2024
|
| Predmet: | |
| ISSN: | 2835-8856 |
| On-line prístup: | Získať plný text |
| Tagy: |
Pridať tag
Žiadne tagy, Buďte prvý, kto otaguje tento záznam!
|
| Abstract | The recent breakthroughs in natural language processing for model pretraining on large quantities of data have opened the way for similar foundation models in computer vision. These models could greatly simplify the use of images in any system by producing all-purpose visual features, i.e., features that work across image distributions and tasks without finetuning. This work shows that existing pretraining methods, especially self-supervised methods, can produce such features if trained on enough curated data from diverse sources. We revisit existing approaches and combine different techniques to scale our pretraining in terms of data and model size. Most of the technical contributions aim at accelerating and stabilizing the training at scale. In terms of data, we propose an automatic pipeline to build a dedicated, diverse, and curated image dataset instead of uncurated data, as typically done in the self-supervised literature. In terms of models, we train a ViT model (Dosovitskiy et al., 2020) with 1B parameters and distill it into a series of smaller models that surpass the best available all-purpose features, OpenCLIP (Ilharco et al., 2021) on most of the benchmarks at image and pixel levels. |
|---|---|
| AbstractList | The recent breakthroughs in natural language processing for model pretraining on large quantities of data have opened the way for similar foundation models in computer vision. These models could greatly simplify the use of images in any system by producing all-purpose visual features, i.e., features that work across image distributions and tasks without finetuning. This work shows that existing pretraining methods, especially self-supervised methods, can produce such features if trained on enough curated data from diverse sources. We revisit existing approaches and combine different techniques to scale our pretraining in terms of data and model size. Most of the technical contributions aim at accelerating and stabilizing the training at scale. In terms of data, we propose an automatic pipeline to build a dedicated, diverse, and curated image dataset instead of uncurated data, as typically done in the self-supervised literature. In terms of models, we train a ViT model (Dosovitskiy et al., 2020) with 1B parameters and distill it into a series of smaller models that surpass the best available all-purpose features, OpenCLIP (Ilharco et al., 2021) on most of the benchmarks at image and pixel levels. |
| Author | Massa, Francisco Vo, Huy Synnaeve, Gabriel Khalidov, Vasil Ballas, Nicolas Bojanowski, Piotr Labatut, Patrick Joulin, Armand Oquab, Maxime Xu, Hu Szafraniec, Marc Haziza, Daniel El-Nouby, Alaaeldin Li, Shang-Wen Misra, Ishan Fernandez, Pierre Huang, Po-Yao Darcet, Timothée Howes, Russell Galuba, Wojciech Mairal, Julien Jegou, Hervé Sharma, Vasu Moutakanni, Théo Rabbat, Michael Assran, Mahmoud |
| Author_xml | – sequence: 1 givenname: Maxime surname: Oquab fullname: Oquab, Maxime organization: Meta AI – sequence: 2 givenname: Timothée surname: Darcet fullname: Darcet, Timothée organization: Meta AI – sequence: 3 givenname: Théo surname: Moutakanni fullname: Moutakanni, Théo organization: CentraleSupélec – sequence: 4 givenname: Huy surname: Vo fullname: Vo, Huy organization: Meta AI – sequence: 5 givenname: Marc surname: Szafraniec fullname: Szafraniec, Marc organization: Meta AI – sequence: 6 givenname: Vasil surname: Khalidov fullname: Khalidov, Vasil organization: Meta AI – sequence: 7 givenname: Pierre surname: Fernandez fullname: Fernandez, Pierre organization: Creating and exploiting explicit links between multimedia fragments – sequence: 8 givenname: Daniel surname: Haziza fullname: Haziza, Daniel organization: Meta AI – sequence: 9 givenname: Francisco surname: Massa fullname: Massa, Francisco organization: Meta AI – sequence: 10 givenname: Alaaeldin surname: El-Nouby fullname: El-Nouby, Alaaeldin organization: Meta AI – sequence: 11 givenname: Mahmoud surname: Assran fullname: Assran, Mahmoud organization: Meta AI – sequence: 12 givenname: Nicolas surname: Ballas fullname: Ballas, Nicolas organization: Meta AI – sequence: 13 givenname: Wojciech surname: Galuba fullname: Galuba, Wojciech organization: Meta AI – sequence: 14 givenname: Russell surname: Howes fullname: Howes, Russell organization: Meta AI – sequence: 15 givenname: Po-Yao surname: Huang fullname: Huang, Po-Yao organization: Meta AI – sequence: 16 givenname: Shang-Wen surname: Li fullname: Li, Shang-Wen organization: Meta AI – sequence: 17 givenname: Ishan surname: Misra fullname: Misra, Ishan organization: Meta AI – sequence: 18 givenname: Michael surname: Rabbat fullname: Rabbat, Michael organization: Meta AI – sequence: 19 givenname: Vasu surname: Sharma fullname: Sharma, Vasu organization: Meta AI – sequence: 20 givenname: Gabriel surname: Synnaeve fullname: Synnaeve, Gabriel organization: Meta AI – sequence: 21 givenname: Hu surname: Xu fullname: Xu, Hu organization: Meta AI – sequence: 22 givenname: Hervé surname: Jegou fullname: Jegou, Hervé organization: Meta AI – sequence: 23 givenname: Julien orcidid: 0000-0001-6991-2110 surname: Mairal fullname: Mairal, Julien organization: Apprentissage de modèles à partir de données massives – sequence: 24 givenname: Patrick surname: Labatut fullname: Labatut, Patrick organization: Meta AI – sequence: 25 givenname: Armand surname: Joulin fullname: Joulin, Armand organization: Meta AI – sequence: 26 givenname: Piotr surname: Bojanowski fullname: Bojanowski, Piotr organization: Meta AI |
| BackLink | https://hal.science/hal-04376640$$DView record in HAL |
| BookMark | eNotzL1OwzAUQGEPRaKUPgCbV4YE29eOHZiqQmmliEr8rZGTXFNLIansJJS3RwimI33DuSCzru-QkCvOUmmUYjc2nPyUCmAyZZrnMCNzYUAlxqjsnCxj9BVTTEMuJZ-Tu_vd034St7RAGzrffdDnvhrjQN99HG1LN2iHMWCkX3449ONAX8YjhslH33eX5MzZNuLyvwvytnl4XW-TYv-4W6-KxHJhhsTYBqVwAnRtXe2wrjnIHPOMi0wbxw3qhmvBsqZxrqq4FJBzAwyhkqi4hQW5_vsebFseg_-04bvsrS-3q6L8NSZBZ5lkk4Aft5hNQg |
| ContentType | Journal Article |
| Copyright | Attribution |
| Copyright_xml | – notice: Attribution |
| DBID | 1XC VOOES |
| DOI | 10.48550/arxiv.2304.07193 |
| DatabaseName | Hyper Article en Ligne (HAL) Hyper Article en Ligne (HAL) (Open Access) |
| DatabaseTitleList | |
| DeliveryMethod | fulltext_linktorsrc |
| Discipline | Computer Science |
| ExternalDocumentID | oai:HAL:hal-04376640v2 |
| GroupedDBID | 1XC ALMA_UNASSIGNED_HOLDINGS GROUPED_DOAJ M~E VOOES |
| ID | FETCH-LOGICAL-a128t-8ade42f237cafcfecc1349e9612678f18e7d17206ddffbb142391830e3b4e51a3 |
| ISSN | 2835-8856 |
| IngestDate | Tue Oct 28 06:37:27 EDT 2025 |
| IsDoiOpenAccess | true |
| IsOpenAccess | true |
| IsPeerReviewed | true |
| IsScholarly | true |
| Keywords | Computer Vision and Pattern Recognition (cs.CV) FOS: Computer and information sciences |
| Language | English |
| License | Attribution: http://creativecommons.org/licenses/by |
| LinkModel | OpenURL |
| MergedId | FETCHMERGED-LOGICAL-a128t-8ade42f237cafcfecc1349e9612678f18e7d17206ddffbb142391830e3b4e51a3 |
| ORCID | 0000-0001-6991-2110 |
| OpenAccessLink | http://dx.doi.org/10.48550/arxiv.2304.07193 |
| ParticipantIDs | hal_primary_oai_HAL_hal_04376640v2 |
| PublicationCentury | 2000 |
| PublicationDate | 2024-00-00 |
| PublicationDateYYYYMMDD | 2024-01-01 |
| PublicationDate_xml | – year: 2024 text: 2024-00-00 |
| PublicationDecade | 2020 |
| PublicationTitle | Transactions on Machine Learning Research Journal |
| PublicationYear | 2024 |
| Publisher | [Amherst Massachusetts]: OpenReview.net, 2022 |
| Publisher_xml | – name: [Amherst Massachusetts]: OpenReview.net, 2022 |
| SSID | ssib050739441 |
| Score | 2.7054882 |
| SecondaryResourceType | preprint |
| Snippet | The recent breakthroughs in natural language processing for model pretraining on large quantities of data have opened the way for similar foundation models in... |
| SourceID | hal |
| SourceType | Open Access Repository |
| SubjectTerms | Artificial Intelligence Computer Science Computer Vision and Pattern Recognition |
| Title | DINOv2: Learning Robust Visual Features without Supervision |
| URI | https://hal.science/hal-04376640 |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| journalDatabaseRights | – providerCode: PRVAON databaseName: Directory of Open Access Journals issn: 2835-8856 databaseCode: DOA dateStart: 20220101 customDbUrl: isFulltext: true dateEnd: 99991231 titleUrlDefault: https://www.doaj.org/ omitProxy: false ssIdentifier: ssib050739441 providerName: Directory of Open Access Journals – providerCode: PRVHPJ databaseName: ROAD: Directory of Open Access Scholarly Resources issn: 2835-8856 databaseCode: M~E dateStart: 20220101 customDbUrl: isFulltext: true dateEnd: 99991231 titleUrlDefault: https://road.issn.org omitProxy: false ssIdentifier: ssib050739441 providerName: ISSN International Centre |
| link | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwtV1LbxMxELaiwoELb8SzshA9IYtk1_uCUyCtUpGGipaqt5W9ayuBdhOyD4ULP4Rfy4y9TrY9lQMXK7I3Xq9ndjz7zYuQNzoAwSsDn0meccaVSFjsRZLliScGkYQDSUpTbCKaTuPz8-S41_vjYmGai6go4vU6Wf5XUkMfEBtDZ_-B3JtJoQN-A9GhBbJDeyPCjw6nXxoPv_QnDvb4upB1Wb09m5cYLIJaXw1f2QaDRbfkk3qJEqN0JPreOcZs3IOxKRwZv0vVmbb12nM27Q1g-7MW0gYCreeXG84ZYVhltWUQa6Pf-t7CUsQPLKFkHZfs8MINnxlMd1z_6uIU3hah3As-Di-NMgu3LWHds7pUVVXuBSPcC3ScsXYQBGNwKvh3B_LEhHAsjm368esC3-Rjw-NstZ436NTOMeOirbl4Nbn2eHiSHo8O0snh9PPV0Y5H4ng4gXYmLhimfApD3m_geL_lRUGC7oJHv_edrArQwslNYdTNAq3R3Czp3fUFgeoyc1C9UV1O75O7LXXo0PLKA9JTxUNyz9XzoK14f0Q-WNZ5Tx2FqWUcahmHOsahLePQDuM8Jt8O9k8_jVlbXIMJUEkqFotccU97fpQJnWl4kzFRpUpA4wX9RQ9iFeWg3PbDPNdaIlToJyD--8qXXAUD4T8hO8WiUE8J9XIplcw0vN4Jz2UY92Xmy0TnIRBSCPGMvIZnT5c2fUqKCc1ho1Ps227z85tc9ILcQdayENhLslOtavWK3M6aal6udg2esmvI9BdOLmez |
| linkProvider | ISSN International Centre |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=DINOv2%3A+Learning+Robust+Visual+Features+without+Supervision&rft.jtitle=Transactions+on+Machine+Learning+Research+Journal&rft.au=Oquab%2C+Maxime&rft.au=Darcet%2C+Timoth%C3%A9e&rft.au=Moutakanni%2C+Th%C3%A9o&rft.au=Vo%2C+Huy&rft.date=2024&rft.pub=%5BAmherst+Massachusetts%5D%3A+OpenReview.net%2C+2022&rft.issn=2835-8856&rft_id=info:doi/10.48550%2Farxiv.2304.07193&rft.externalDBID=HAS_PDF_LINK&rft.externalDocID=oai%3AHAL%3Ahal-04376640v2 |
| thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=2835-8856&client=summon |
| thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=2835-8856&client=summon |
| thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=2835-8856&client=summon |