Double Trouble? Impact and Detection of Duplicates in Face Image Datasets
Various face image datasets intended for facial biometrics research were created via web-scraping, i.e. the collection of images publicly available on the internet. This work presents an approach to detect both exactly and nearly identical face image duplicates, using file and image hashes. The appr...
Gespeichert in:
| Veröffentlicht in: | arXiv.org |
|---|---|
| Hauptverfasser: | , , , |
| Format: | Paper |
| Sprache: | Englisch |
| Veröffentlicht: |
Ithaca
Cornell University Library, arXiv.org
25.01.2024
|
| Schlagworte: | |
| ISSN: | 2331-8422 |
| Online-Zugang: | Volltext |
| Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
| Abstract | Various face image datasets intended for facial biometrics research were created via web-scraping, i.e. the collection of images publicly available on the internet. This work presents an approach to detect both exactly and nearly identical face image duplicates, using file and image hashes. The approach is extended through the use of face image preprocessing. Additional steps based on face recognition and face image quality assessment models reduce false positives, and facilitate the deduplication of the face images both for intra- and inter-subject duplicate sets. The presented approach is applied to five datasets, namely LFW, TinyFace, Adience, CASIA-WebFace, and C-MS-Celeb (a cleaned MS-Celeb-1M variant). Duplicates are detected within every dataset, with hundreds to hundreds of thousands of duplicates for all except LFW. Face recognition and quality assessment experiments indicate a minor impact on the results through the duplicate removal. The final deduplication data is publicly available. |
|---|---|
| AbstractList | Various face image datasets intended for facial biometrics research were created via web-scraping, i.e. the collection of images publicly available on the internet. This work presents an approach to detect both exactly and nearly identical face image duplicates, using file and image hashes. The approach is extended through the use of face image preprocessing. Additional steps based on face recognition and face image quality assessment models reduce false positives, and facilitate the deduplication of the face images both for intra- and inter-subject duplicate sets. The presented approach is applied to five datasets, namely LFW, TinyFace, Adience, CASIA-WebFace, and C-MS-Celeb (a cleaned MS-Celeb-1M variant). Duplicates are detected within every dataset, with hundreds to hundreds of thousands of duplicates for all except LFW. Face recognition and quality assessment experiments indicate a minor impact on the results through the duplicate removal. The final deduplication data is publicly available. |
| Author | Busch, Christoph Schlett, Torsten Rathgeb, Christian Tapia, Juan |
| Author_xml | – sequence: 1 givenname: Torsten surname: Schlett fullname: Schlett, Torsten – sequence: 2 givenname: Christian surname: Rathgeb fullname: Rathgeb, Christian – sequence: 3 givenname: Juan surname: Tapia fullname: Tapia, Juan – sequence: 4 givenname: Christoph surname: Busch fullname: Busch, Christoph |
| BookMark | eNotjsFOhDAURRujieM4H-CuiWuwfaWlrIwZHCWZxA37yWt5GCYISIvx8yXq6mxuzj037HIYB2LsToo0s1qLB5y_u68UMiFTmQlrL9gGlJKJzQCu2S6EsxACTA5aqw2rynFxPfF6_uUjrz4m9JHj0PCSIvnYjQMfW14uU995jBR4N_ADelqn-E68xIiBYrhlVy32gXb_3LL68FzvX5Pj20u1fzomqAES0yqdF075xhBa7Z3MrSy8yNQaWcgCrSsoR2yEVU56D61wYBtJOUhvXK627P5PO83j50Ihns7jMg_r4wkKaY3R2oD6AZkBTos |
| ContentType | Paper |
| Copyright | 2024. This work is published under http://arxiv.org/licenses/nonexclusive-distrib/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License. |
| Copyright_xml | – notice: 2024. This work is published under http://arxiv.org/licenses/nonexclusive-distrib/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License. |
| DBID | 8FE 8FG ABJCF ABUWG AFKRA AZQEC BENPR BGLVJ CCPQU DWQXO HCIFZ L6V M7S PHGZM PHGZT PIMPY PKEHL PQEST PQGLB PQQKQ PQUKI PRINS PTHSS |
| DOI | 10.48550/arxiv.2401.14088 |
| DatabaseName | ProQuest SciTech Collection ProQuest Technology Collection Materials Science & Engineering Collection ProQuest Central (Alumni) ProQuest Central UK/Ireland ProQuest Central Essentials - QC ProQuest Central Technology collection ProQuest One Community College ProQuest Central SciTech Premium Collection ProQuest Engineering Collection Engineering Database Proquest Central Premium ProQuest One Academic ProQuest Publicly Available Content Database ProQuest One Academic Middle East (New) ProQuest One Academic Eastern Edition (DO NOT USE) ProQuest One Applied & Life Sciences ProQuest One Academic (retired) ProQuest One Academic UKI Edition ProQuest Central China Engineering Collection |
| DatabaseTitle | Publicly Available Content Database Engineering Database Technology Collection ProQuest One Academic Middle East (New) ProQuest Central Essentials ProQuest One Academic Eastern Edition ProQuest Central (Alumni Edition) SciTech Premium Collection ProQuest One Community College ProQuest Technology Collection ProQuest SciTech Collection ProQuest Central China ProQuest Central ProQuest One Applied & Life Sciences ProQuest Engineering Collection ProQuest One Academic UKI Edition ProQuest Central Korea Materials Science & Engineering Collection ProQuest Central (New) ProQuest One Academic ProQuest One Academic (New) Engineering Collection |
| DatabaseTitleList | Publicly Available Content Database |
| Database_xml | – sequence: 1 dbid: PIMPY name: Publicly Available Content Database url: http://search.proquest.com/publiccontent sourceTypes: Aggregation Database |
| DeliveryMethod | fulltext_linktorsrc |
| Discipline | Physics |
| EISSN | 2331-8422 |
| Genre | Working Paper/Pre-Print |
| GroupedDBID | 8FE 8FG ABJCF ABUWG AFKRA ALMA_UNASSIGNED_HOLDINGS AZQEC BENPR BGLVJ CCPQU DWQXO FRJ HCIFZ L6V M7S M~E PHGZM PHGZT PIMPY PKEHL PQEST PQGLB PQQKQ PQUKI PRINS PTHSS |
| ID | FETCH-LOGICAL-a522-6f3579b3cd6ea85cb17819c043233919a8b9e7aad083b1cc2f0b28d1e721c6b73 |
| IEDL.DBID | BENPR |
| IngestDate | Mon Jun 30 09:16:27 EDT 2025 |
| IsDoiOpenAccess | true |
| IsOpenAccess | true |
| IsPeerReviewed | false |
| IsScholarly | false |
| Language | English |
| LinkModel | DirectLink |
| MergedId | FETCHMERGED-LOGICAL-a522-6f3579b3cd6ea85cb17819c043233919a8b9e7aad083b1cc2f0b28d1e721c6b73 |
| Notes | SourceType-Working Papers-1 ObjectType-Working Paper/Pre-Print-1 content type line 50 |
| OpenAccessLink | https://www.proquest.com/docview/2918665562?pq-origsite=%requestingapplication% |
| PQID | 2918665562 |
| PQPubID | 2050157 |
| ParticipantIDs | proquest_journals_2918665562 |
| PublicationCentury | 2000 |
| PublicationDate | 20240125 |
| PublicationDateYYYYMMDD | 2024-01-25 |
| PublicationDate_xml | – month: 01 year: 2024 text: 20240125 day: 25 |
| PublicationDecade | 2020 |
| PublicationPlace | Ithaca |
| PublicationPlace_xml | – name: Ithaca |
| PublicationTitle | arXiv.org |
| PublicationYear | 2024 |
| Publisher | Cornell University Library, arXiv.org |
| Publisher_xml | – name: Cornell University Library, arXiv.org |
| SSID | ssj0002672553 |
| Score | 1.8585418 |
| SecondaryResourceType | preprint |
| Snippet | Various face image datasets intended for facial biometrics research were created via web-scraping, i.e. the collection of images publicly available on the... |
| SourceID | proquest |
| SourceType | Aggregation Database |
| SubjectTerms | Datasets Face recognition Image quality Quality assessment |
| Title | Double Trouble? Impact and Detection of Duplicates in Face Image Datasets |
| URI | https://www.proquest.com/docview/2918665562 |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| link | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwpV09T8MwFLSgBYmJb_FRKg-soUmcxPFUCUpFB6oIOpSpsl9sqRJKS5JW_Hye3RSYWJiiKBkiO7q7vHd5R8itlsYHA9YqpRIvijl4qU59z4jYID-IiMvIhU3w8TidTkXWFNyqxla5xUQH1PkCbI28Fwo3mg3pur_88GxqlO2uNhEau6RtJ5VFLdK-fxxnL99VljDhqJnZpp3phnf1ZPk5X98hkQWIEn6TuPIbhB2zDA__-0xHpJ3JpS6PyY4uTsi-c3RCdUpGKI3Vu6aT0h37dOT-h6SyyOlA186BVdCFoYPVpoWtKzov6FCCxlsRZehA1khxdXVGJsPHycOT1-QmeBLVlJcYFnOhGOSJlmkMKuBI-2Bn7zEmAiFTJTSXMkf1pQKA0PgqTPNA48cgJIqzc9IqFoW-IFTwMEohYLjTLMpNpPzYgMo5ihJjUKtdks52YWbNu1_Nflbl6u_L1-QgRIlgCxph3CGtulzpG7IH63peld1mK7vWjfmKZ9noOXv7Arp2q4A |
| linkProvider | ProQuest |
| linkToHtml | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMw1V07T8MwED4VCoKJt3gU8ABjaOI8HA-IgVIRtVSV6NCtsh1bqoTSkqQ8fhT_kYvbAhMbA1OGRFGUO3_f57vzHcCFFsZVRlWlUjJygpApJ9ax6xgeGuQHHjAR2GETrNeLh0Per8HH8ixMVVa5xEQL1OlEVTHyJuW2NRvS9c302ammRlXZ1eUIjblbdPT7K27Ziuukhfa9pLR9N7i9dxZTBRyBWsOJjB8yLn2VRlrEoZIeQ1JUVWc63-ceF7HkmgmRojaRnlLUuJLGqadxq6QiyXx87QrUgwr8baXg41dIh0YMBbo_z53aTmFNkb-NX66QNT2EJHcx3uUn4lsaa2_9sx-wDfW-mOp8B2o624V1W62qij1IUPbLJ00Gub3ekMSe9SQiS0lLl7a6LCMTQ1qzeXpeF2SckbZQGh9FBCUtUSJ9l8U-DP7i4w9gNZtk-hAIZzSIleejF_tBagLphkbJlKHgMgZ16BE0lnYYLdZ1Mfo2wvHvt89h437w0B11k17nBDYpSqEqcEPDBqyW-Uyfwpp6KcdFfmZ9iMDoj032CQCvArs |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Double+Trouble%3F+Impact+and+Detection+of+Duplicates+in+Face+Image+Datasets&rft.jtitle=arXiv.org&rft.au=Schlett%2C+Torsten&rft.au=Rathgeb%2C+Christian&rft.au=Tapia%2C+Juan&rft.au=Busch%2C+Christoph&rft.date=2024-01-25&rft.pub=Cornell+University+Library%2C+arXiv.org&rft.eissn=2331-8422&rft_id=info:doi/10.48550%2Farxiv.2401.14088 |