Harmonised Data is Actionable Data: DiSSCo’s solution to data mapping
Predictability is one of the core requirements for creating machine actionable data. The better predictable the data, the more generic the service acting on the data can be. The more generic the service, the easier we can exchange ideas, collaborate on initiatives and leverage machines to do the wor...
Uloženo v:
| Vydáno v: | Biodiversity Information Science and Standards Ročník 7; s. 10 |
|---|---|
| Hlavní autoři: | , |
| Médium: | Journal Article |
| Jazyk: | angličtina |
| Vydáno: |
Sofia
Pensoft Publishers
06.09.2023
|
| Témata: | |
| ISSN: | 2535-0897, 2535-0897 |
| On-line přístup: | Získat plný text |
| Tagy: |
Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
|
| Abstract | Predictability is one of the core requirements for creating machine actionable data. The better predictable the data, the more generic the service acting on the data can be. The more generic the service, the easier we can exchange ideas, collaborate on initiatives and leverage machines to do the work. It is essential for implementing the FAIR Principles (Findable, Accessible, Interoperable, Reproducible), as it provides the “I” for Interoperability (Jacobsen et al. 2020). The FAIR principles emphasise machine actionability because the amount of data generated is far too large for humans to handle.
While Biodiversity Information Standards (TDWG) standards have massively improved the standardisation of biodiversity data, there is still room for improvement. Within the Distributed System of Scientific Collections (DiSSCo), we aim to harmonise all scientific data derived from European specimen collections, including geological specimens, into a single data specification. We call this data specification the open Digital Specimen (openDS). It is being built on top of existing and developing biodiversity information standards such as Darwin Core (DwC), Minimal Information Digital Specimen (MIDS), Latimer Core, Access to Biological Collection Data (ABCD) Schema, Extension for Geosciences (EFG) and also on the new Global Biodiversity Information Facility (GBIF) Unified Model. In openDS we leverage the existing standards within the TDWG community but combine these with stricter constraints and controlled vocabularies, with the aim to improve the FAIRness of the data. This will not only make the data easier to use, but will also increase its quality and machine actionability.
As the first step towards this the harmonisation of terms, we make sure that similar values use the same term in a standard as key. This enables the next step in which we harmonise the values. We can transform free-text values into standardised or controlled vocabularies. For example: instead of using the names J. Doe, John Doe and J. Doe sr. for a collector, we aim to standardise these to J. Doe, with a person identifier that connects this name with more information about the collector.
Biodiversity information standards such as DwC were developed to lower the bar for data sharing. The downside of including minimal restraints and flexibility is that they provide room for ambiguity, leading to multiple ways of interpretation. This limits interoperability and hampers machine actionability. In DiSSCo, data will come from different sources that use different biodiversity information standards. To cover this, we need to harmonise terms between these standards. To complicate things further, different serialisation methods are used for data exchange. Darwin Core Archives (DwC-A; GBIF 2021) use Comma-separated values (CSV) files. ABCD(EFG) exposed through Biological Collection Access Service (BioCASe) uses XML. And most custom formats use JavaScript Object Notation (JSON).
In this lightning talk, we will dive into DiSSCo’s technical implementation of the harmonisation process. DiSSCo currently supports two biodiversity information standards, DwC and ABCD(EFG), and maps the data to our openDS specification on a record-by-record basis. We will highlight some of the more problematic mappings, but also show how a harmonised model massively simplifies generic actions, such as the calculation of MIDS levels, which provide information about digitisation completeness of a specimen. We will conclude by having a quick look at the next steps and hope to start a discussion about controlled vocabularies.
The development of high quality, standardised data based on a strict specification with controlled vocabularies, rooted in community accepted standards, can have a huge impact on biodiversity research and is an essential step towards scaling up research with computational support. |
|---|---|
| AbstractList | Predictability is one of the core requirements for creating machine actionable data. The better predictable the data, the more generic the service acting on the data can be. The more generic the service, the easier we can exchange ideas, collaborate on initiatives and leverage machines to do the work. It is essential for implementing the FAIR Principles (Findable, Accessible, Interoperable, Reproducible), as it provides the “I” for Interoperability (Jacobsen et al. 2020). The FAIR principles emphasise machine actionability because the amount of data generated is far too large for humans to handle. While Biodiversity Information Standards (TDWG) standards have massively improved the standardisation of biodiversity data, there is still room for improvement. Within the Distributed System of Scientific Collections (DiSSCo), we aim to harmonise all scientific data derived from European specimen collections, including geological specimens, into a single data specification. We call this data specification the open Digital Specimen (openDS). It is being built on top of existing and developing biodiversity information standards such as Darwin Core (DwC), Minimal Information Digital Specimen (MIDS), Latimer Core, Access to Biological Collection Data (ABCD) Schema, Extension for Geosciences (EFG) and also on the new Global Biodiversity Information Facility (GBIF) Unified Model. In openDS we leverage the existing standards within the TDWG community but combine these with stricter constraints and controlled vocabularies, with the aim to improve the FAIRness of the data. This will not only make the data easier to use, but will also increase its quality and machine actionability. As the first step towards this the harmonisation of terms, we make sure that similar values use the same term in a standard as key. This enables the next step in which we harmonise the values. We can transform free-text values into standardised or controlled vocabularies. For example: instead of using the names J. Doe, John Doe and J. Doe sr. for a collector, we aim to standardise these to J. Doe, with a person identifier that connects this name with more information about the collector. Biodiversity information standards such as DwC were developed to lower the bar for data sharing. The downside of including minimal restraints and flexibility is that they provide room for ambiguity, leading to multiple ways of interpretation. This limits interoperability and hampers machine actionability. In DiSSCo, data will come from different sources that use different biodiversity information standards. To cover this, we need to harmonise terms between these standards. To complicate things further, different serialisation methods are used for data exchange. Darwin Core Archives (DwC-A; GBIF 2021) use Comma-separated values (CSV) files. ABCD(EFG) exposed through Biological Collection Access Service (BioCASe) uses XML. And most custom formats use JavaScript Object Notation (JSON). In this lightning talk, we will dive into DiSSCo’s technical implementation of the harmonisation process. DiSSCo currently supports two biodiversity information standards, DwC and ABCD(EFG), and maps the data to our openDS specification on a record-by-record basis. We will highlight some of the more problematic mappings, but also show how a harmonised model massively simplifies generic actions, such as the calculation of MIDS levels, which provide information about digitisation completeness of a specimen. We will conclude by having a quick look at the next steps and hope to start a discussion about controlled vocabularies. The development of high quality, standardised data based on a strict specification with controlled vocabularies, rooted in community accepted standards, can have a huge impact on biodiversity research and is an essential step towards scaling up research with computational support. Predictability is one of the core requirements for creating machine actionable data. The better predictable the data, the more generic the service acting on the data can be. The more generic the service, the easier we can exchange ideas, collaborate on initiatives and leverage machines to do the work. It is essential for implementing the FAIR Principles (Findable, Accessible, Interoperable, Reproducible), as it provides the “I” for Interoperability (Jacobsen et al. 2020). The FAIR principles emphasise machine actionability because the amount of data generated is far too large for humans to handle. While Biodiversity Information Standards (TDWG) standards have massively improved the standardisation of biodiversity data, there is still room for improvement. Within the Distributed System of Scientific Collections (DiSSCo), we aim to harmonise all scientific data derived from European specimen collections, including geological specimens, into a single data specification. We call this data specification the open Digital Specimen (openDS). It is being built on top of existing and developing biodiversity information standards such as Darwin Core (DwC), Minimal Information Digital Specimen (MIDS), Latimer Core, Access to Biological Collection Data (ABCD) Schema, Extension for Geosciences (EFG) and also on the new Global Biodiversity Information Facility (GBIF) Unified Model. In openDS we leverage the existing standards within the TDWG community but combine these with stricter constraints and controlled vocabularies, with the aim to improve the FAIRness of the data. This will not only make the data easier to use, but will also increase its quality and machine actionability. As the first step towards this the harmonisation of terms, we make sure that similar values use the same term in a standard as key. This enables the next step in which we harmonise the values. We can transform free-text values into standardised or controlled vocabularies. For example: instead of using the names J. Doe, John Doe and J. Doe sr. for a collector, we aim to standardise these to J. Doe, with a person identifier that connects this name with more information about the collector. Biodiversity information standards such as DwC were developed to lower the bar for data sharing. The downside of including minimal restraints and flexibility is that they provide room for ambiguity, leading to multiple ways of interpretation. This limits interoperability and hampers machine actionability. In DiSSCo, data will come from different sources that use different biodiversity information standards. To cover this, we need to harmonise terms between these standards. To complicate things further, different serialisation methods are used for data exchange. Darwin Core Archives (DwC-A; GBIF 2021) use Comma-separated values (CSV) files. ABCD(EFG) exposed through Biological Collection Access Service (BioCASe) uses XML. And most custom formats use JavaScript Object Notation (JSON). In this lightning talk, we will dive into DiSSCo’s technical implementation of the harmonisation process. DiSSCo currently supports two biodiversity information standards, DwC and ABCD(EFG), and maps the data to our openDS specification on a record-by-record basis. We will highlight some of the more problematic mappings, but also show how a harmonised model massively simplifies generic actions, such as the calculation of MIDS levels, which provide information about digitisation completeness of a specimen. We will conclude by having a quick look at the next steps and hope to start a discussion about controlled vocabularies. The development of high quality, standardised data based on a strict specification with controlled vocabularies, rooted in community accepted standards, can have a huge impact on biodiversity research and is an essential step towards scaling up research with computational support. |
| Author | Addink, Wouter Leeflang, Sam |
| Author_xml | – sequence: 1 givenname: Sam orcidid: 0000-0002-5669-2769 surname: Leeflang fullname: Leeflang, Sam – sequence: 2 givenname: Wouter orcidid: 0000-0002-3090-1761 surname: Addink fullname: Addink, Wouter |
| BookMark | eNp1kL9OwzAQxi0EEqV0ZI_EwpLiP2kds1UttEiVGAqzdbEd5Cqxg50MbLwGr8eTkFAGVInpTvf97nTfd4FOnXcGoSuCpywX_LawMU75lBBKGD9BIzpjsxT3yumf_hxNYtxjjKmgNJ_nI7TeQKi9s9HoZAUtJDYmC9Va76CozM_oLlnZ3W7pvz4-YxJ91Q1q0vpED3wNTWPd6yU6K6GKZvJbx-jl4f55uUm3T-vH5WKbKkoJTzPQmkCBmaZqro1QQmlS5IXJmSpBGUIFBq4yxfRcYTCGaaHLjFNe6KIkGRujm8PdJvi3zsRW1jYqU1XgjO-iZDjDLJ9hhnv0-gjd-y64_jvZeyecZIKznmIHSgUfYzClVLaFwWIbwFaSYDnkK4d8JZeHfPut9GirCbaG8P4P_w1L_IAR |
| CitedBy_id | crossref_primary_10_3897_biss_8_133172 |
| Cites_doi | 10.1162/dint_r_00024 |
| ContentType | Journal Article |
| Copyright | 2023. This work is licensed under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and conditions, you may use this content in accordance with the terms of the License. |
| Copyright_xml | – notice: 2023. This work is licensed under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and conditions, you may use this content in accordance with the terms of the License. |
| DBID | AAYXX CITATION 8FE 8FH ABUWG AFKRA AZQEC BBNVY BENPR BHPHI CCPQU DWQXO GNUQQ HCIFZ LK8 M7P PHGZM PHGZT PIMPY PKEHL PQEST PQGLB PQQKQ PQUKI PRINS 7S9 L.6 |
| DOI | 10.3897/biss.7.112137 |
| DatabaseName | CrossRef ProQuest SciTech Collection ProQuest Natural Science Collection ProQuest Central (Alumni) ProQuest Central UK/Ireland ProQuest Central Essentials - QC Biological Science Collection ProQuest Central Natural Science Collection ProQuest One ProQuest Central Korea ProQuest Central Student SciTech Collection (ProQuest) ProQuest Biological Science Collection Biological Science Database (ProQuest) ProQuest Central Premium ProQuest One Academic (New) ProQuest Publicly Available Content Database ProQuest One Academic Middle East (New) ProQuest One Academic Eastern Edition (DO NOT USE) ProQuest One Applied & Life Sciences ProQuest One Academic (retired) ProQuest One Academic UKI Edition ProQuest Central China AGRICOLA AGRICOLA - Academic |
| DatabaseTitle | CrossRef Publicly Available Content Database ProQuest Central Student ProQuest One Academic Middle East (New) ProQuest Biological Science Collection ProQuest Central Essentials ProQuest One Academic Eastern Edition ProQuest Central (Alumni Edition) SciTech Premium Collection ProQuest One Community College ProQuest Natural Science Collection Biological Science Database ProQuest SciTech Collection ProQuest Central China ProQuest Central ProQuest One Applied & Life Sciences ProQuest One Academic UKI Edition Natural Science Collection ProQuest Central Korea Biological Science Collection ProQuest Central (New) ProQuest One Academic ProQuest One Academic (New) AGRICOLA AGRICOLA - Academic |
| DatabaseTitleList | AGRICOLA CrossRef Publicly Available Content Database |
| Database_xml | – sequence: 1 dbid: PIMPY name: Publicly Available Content Database url: http://search.proquest.com/publiccontent sourceTypes: Aggregation Database |
| DeliveryMethod | fulltext_linktorsrc |
| Discipline | Ecology |
| EISSN | 2535-0897 |
| ExternalDocumentID | 10_3897_biss_7_112137 |
| GroupedDBID | 2XV 8FE 8FH AAFWJ AAYXX ADBBV AFFHD AFKRA ALMA_UNASSIGNED_HOLDINGS BBNVY BENPR BHPHI CCPQU CITATION H13 HCIFZ IAO IGS ITC K13 LK8 M7P M~E OK1 PHGZM PHGZT PIMPY PQGLB PROAC XH6 ABUWG AZQEC DWQXO GNUQQ PKEHL PQEST PQQKQ PQUKI PRINS 7S9 L.6 |
| ID | FETCH-LOGICAL-c2217-4add1ab03d2c6de9c9cd1b8be83cface1290a7c4c3d6c0aee3d9df4727bdbf143 |
| IEDL.DBID | M7P |
| ISSN | 2535-0897 |
| IngestDate | Sun Nov 09 13:52:45 EST 2025 Sun Jul 13 03:53:41 EDT 2025 Sat Nov 29 05:59:38 EST 2025 Tue Nov 18 22:27:33 EST 2025 |
| IsDoiOpenAccess | true |
| IsOpenAccess | true |
| IsPeerReviewed | false |
| IsScholarly | true |
| Language | English |
| License | http://creativecommons.org/licenses/by/4.0 |
| LinkModel | DirectLink |
| MergedId | FETCHMERGED-LOGICAL-c2217-4add1ab03d2c6de9c9cd1b8be83cface1290a7c4c3d6c0aee3d9df4727bdbf143 |
| Notes | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 content type line 23 |
| ORCID | 0000-0002-5669-2769 0000-0002-3090-1761 |
| OpenAccessLink | https://www.proquest.com/docview/2861714973?pq-origsite=%requestingapplication% |
| PQID | 2861714973 |
| PQPubID | 2049297 |
| ParticipantIDs | proquest_miscellaneous_3040385030 proquest_journals_2861714973 crossref_citationtrail_10_3897_biss_7_112137 crossref_primary_10_3897_biss_7_112137 |
| PublicationCentury | 2000 |
| PublicationDate | 20230906 |
| PublicationDateYYYYMMDD | 2023-09-06 |
| PublicationDate_xml | – month: 09 year: 2023 text: 20230906 day: 06 |
| PublicationDecade | 2020 |
| PublicationPlace | Sofia |
| PublicationPlace_xml | – name: Sofia |
| PublicationTitle | Biodiversity Information Science and Standards |
| PublicationYear | 2023 |
| Publisher | Pensoft Publishers |
| Publisher_xml | – name: Pensoft Publishers |
| References | 112137_B9974989 GBIF (112137_B10449212) 2021 |
| References_xml | – year: 2021 ident: 112137_B10449212 article-title: Darwin Core Archives – How-to Guide, version 2.2 – ident: 112137_B9974989 doi: 10.1162/dint_r_00024 |
| SSID | ssj0002922868 |
| Score | 2.230507 |
| Snippet | Predictability is one of the core requirements for creating machine actionable data. The better predictable the data, the more generic the service acting on... |
| SourceID | proquest crossref |
| SourceType | Aggregation Database Enrichment Source Index Database |
| StartPage | 10 |
| SubjectTerms | Biodiversity computer software information science Interoperability lightning Vocabularies & taxonomies |
| Title | Harmonised Data is Actionable Data: DiSSCo’s solution to data mapping |
| URI | https://www.proquest.com/docview/2861714973 https://www.proquest.com/docview/3040385030 |
| Volume | 7 |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| journalDatabaseRights | – providerCode: PRVHPJ databaseName: ROAD: Directory of Open Access Scholarly Resources customDbUrl: eissn: 2535-0897 dateEnd: 99991231 omitProxy: false ssIdentifier: ssj0002922868 issn: 2535-0897 databaseCode: M~E dateStart: 20170101 isFulltext: true titleUrlDefault: https://road.issn.org providerName: ISSN International Centre – providerCode: PRVPQU databaseName: Biological Science Database (ProQuest) customDbUrl: eissn: 2535-0897 dateEnd: 99991231 omitProxy: false ssIdentifier: ssj0002922868 issn: 2535-0897 databaseCode: M7P dateStart: 20210101 isFulltext: true titleUrlDefault: http://search.proquest.com/biologicalscijournals providerName: ProQuest – providerCode: PRVPQU databaseName: ProQuest Central customDbUrl: eissn: 2535-0897 dateEnd: 99991231 omitProxy: false ssIdentifier: ssj0002922868 issn: 2535-0897 databaseCode: BENPR dateStart: 20210101 isFulltext: true titleUrlDefault: https://www.proquest.com/central providerName: ProQuest – providerCode: PRVPQU databaseName: Publicly Available Content Database customDbUrl: eissn: 2535-0897 dateEnd: 99991231 omitProxy: false ssIdentifier: ssj0002922868 issn: 2535-0897 databaseCode: PIMPY dateStart: 20210101 isFulltext: true titleUrlDefault: http://search.proquest.com/publiccontent providerName: ProQuest |
| link | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwpV3NTtwwEB4BWyQu_BQQy8_KSIhTU9g4ie1eECxLlwOrqIBET9HEdtBKdEM3S6VeUF-jr9cnwZNkFzjQC5ccnImVTMbjmfHMfAB7PoYaUaAnlEEv0JnyMAwzr43uf6vMcmkrsAnR78ubGxXXAbeiTquc6MRSUZtcU4z8wJcRYXUrwY_uf3qEGkWnqzWExiw0qEuCX6buxdMYi69895isWmu6nVkcpO5rPguqm2kT8vnLrei1Ji63l7Ol977YMizWhiU7riRhBWbs8CPMd8um1L9X4WsPR45yUFjDTnGMbFCw47KmgWqnyqEv7HRwednJ__35W7CJSLJxziiLlP1A6uRwuwbXZ92rTs-rQRQ87Tt3wwucAmtjesiNryNjlVbatFOZWsl1htpSHAqFDjQ3kT5Ea7lRJgucWZOaNHPW1DrMDfOh3QCGwhlLQUg-jXHT-jKVWvLIqCjlWaCjJnya8DPRdYdxArq4S5ynQexPiP2JSCr2N2F_Sn5ftdZ4i3B7wvmkXmFF8sz2JuxOb7u1QQceOLT5Q5Fwp6G4DJ0e2_z_FFuwQDDyZe5YtA1z49GD3YEP-td4UIxa0Djp9uNvrVK46PrYdWPx-UX8_QmFJd2I |
| linkProvider | ProQuest |
| linkToHtml | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMw1V3LbhMxFL0qKYhuyluEFjASsGJoY8-MbSSEqqYlUdsoUovUrswd24Mi0UzJpKDu-A1-go_iS7ieR4AF7LpgO7Ysj899-XHvAXjKMbGIEiOpHUaxzXWESZJHPSS8de6F8jXZhByN1PGxHi_B9zYXJjyrbG1iZahdYcMZ-QZXaeDq1lK8OfsUBdaocLvaUmjUYrHnL77Qlq18PewTvs8439052h5EDatAZDnF31FMGt3DbFM4blPntdXW9TKVeSVsjtaHgxmUNrbCpXYTvRdOuzwmP5-5LKfwgsa9AstxEPYOLI-HB-OTxakO15wmqupinhQLyI2M1u-lDJk6vcC1_rvz-9P2Vw5t98b_thQ3YbUJndlWLeu3YMlPb8O1nars9sUdeDvAGc1sUnrH-jhHNinZVpW1EbLDqk-vWH9yeLhd_Pj6rWSt0rF5wcI7WXaKoVbFh7vw7lL-4h50psXU3weGksLBOAm7NkfDcpUpq0TqdJqJPLZpF160-Bnb1FAPVB4fDe2lAtwmwG2kqeHuwvNF97O6eMjfOq63SJvGhpTmF8xdeLJoJu0PVzo49cV5aQTZYKESstQP_j3EY7g-ODrYN_vD0d4arHAK1aqXcuk6dOazc_8QrtrP80k5e9SINIP3ly06PwECljxg |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Harmonised+Data+is+Actionable+Data%3A+DiSSCo%E2%80%99s+solution+to+data+mapping&rft.jtitle=Biodiversity+Information+Science+and+Standards&rft.au=Leeflang%2C+Sam&rft.au=Addink%2C+Wouter&rft.date=2023-09-06&rft.issn=2535-0897&rft.eissn=2535-0897&rft.volume=7&rft_id=info:doi/10.3897%2Fbiss.7.112137&rft.externalDBID=n%2Fa&rft.externalDocID=10_3897_biss_7_112137 |
| thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=2535-0897&client=summon |
| thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=2535-0897&client=summon |
| thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=2535-0897&client=summon |