Harmonised Data is Actionable Data: DiSSCo’s solution to data mapping

Predictability is one of the core requirements for creating machine actionable data. The better predictable the data, the more generic the service acting on the data can be. The more generic the service, the easier we can exchange ideas, collaborate on initiatives and leverage machines to do the wor...

Celý popis

Uložené v:
Podrobná bibliografia
Vydané v:Biodiversity Information Science and Standards Ročník 7; s. 10
Hlavní autori: Leeflang, Sam, Addink, Wouter
Médium: Journal Article
Jazyk:English
Vydavateľské údaje: Sofia Pensoft Publishers 06.09.2023
Predmet:
ISSN:2535-0897, 2535-0897
On-line prístup:Získať plný text
Tagy: Pridať tag
Žiadne tagy, Buďte prvý, kto otaguje tento záznam!
Abstract Predictability is one of the core requirements for creating machine actionable data. The better predictable the data, the more generic the service acting on the data can be. The more generic the service, the easier we can exchange ideas, collaborate on initiatives and leverage machines to do the work. It is essential for implementing the FAIR Principles (Findable, Accessible, Interoperable, Reproducible), as it provides the “I” for Interoperability (Jacobsen et al. 2020). The FAIR principles emphasise machine actionability because the amount of data generated is far too large for humans to handle. While Biodiversity Information Standards (TDWG) standards have massively improved the standardisation of biodiversity data, there is still room for improvement. Within the Distributed System of Scientific Collections (DiSSCo), we aim to harmonise all scientific data derived from European specimen collections, including geological specimens, into a single data specification. We call this data specification the open Digital Specimen (openDS). It is being built on top of existing and developing biodiversity information standards such as Darwin Core (DwC), Minimal Information Digital Specimen (MIDS), Latimer Core, Access to Biological Collection Data (ABCD) Schema, Extension for Geosciences (EFG) and also on the new Global Biodiversity Information Facility (GBIF) Unified Model. In openDS we leverage the existing standards within the TDWG community but combine these with stricter constraints and controlled vocabularies, with the aim to improve the FAIRness of the data. This will not only make the data easier to use, but will also increase its quality and machine actionability. As the first step towards this the harmonisation of terms, we make sure that similar values use the same term in a standard as key. This enables the next step in which we harmonise the values. We can transform free-text values into standardised or controlled vocabularies. For example: instead of using the names J. Doe, John Doe and J. Doe sr. for a collector, we aim to standardise these to J. Doe, with a person identifier that connects this name with more information about the collector. Biodiversity information standards such as DwC were developed to lower the bar for data sharing. The downside of including minimal restraints and flexibility is that they provide room for ambiguity, leading to multiple ways of interpretation. This limits interoperability and hampers machine actionability. In DiSSCo, data will come from different sources that use different biodiversity information standards. To cover this, we need to harmonise terms between these standards. To complicate things further, different serialisation methods are used for data exchange. Darwin Core Archives (DwC-A; GBIF 2021) use Comma-separated values (CSV) files. ABCD(EFG) exposed through Biological Collection Access Service (BioCASe) uses XML. And most custom formats use JavaScript Object Notation (JSON). In this lightning talk, we will dive into DiSSCo’s technical implementation of the harmonisation process. DiSSCo currently supports two biodiversity information standards, DwC and ABCD(EFG), and maps the data to our openDS specification on a record-by-record basis. We will highlight some of the more problematic mappings, but also show how a harmonised model massively simplifies generic actions, such as the calculation of MIDS levels, which provide information about digitisation completeness of a specimen. We will conclude by having a quick look at the next steps and hope to start a discussion about controlled vocabularies. The development of high quality, standardised data based on a strict specification with controlled vocabularies, rooted in community accepted standards, can have a huge impact on biodiversity research and is an essential step towards scaling up research with computational support.
AbstractList Predictability is one of the core requirements for creating machine actionable data. The better predictable the data, the more generic the service acting on the data can be. The more generic the service, the easier we can exchange ideas, collaborate on initiatives and leverage machines to do the work. It is essential for implementing the FAIR Principles (Findable, Accessible, Interoperable, Reproducible), as it provides the “I” for Interoperability (Jacobsen et al. 2020). The FAIR principles emphasise machine actionability because the amount of data generated is far too large for humans to handle.While Biodiversity Information Standards (TDWG) standards have massively improved the standardisation of biodiversity data, there is still room for improvement. Within the Distributed System of Scientific Collections (DiSSCo), we aim to harmonise all scientific data derived from European specimen collections, including geological specimens, into a single data specification. We call this data specification the open Digital Specimen (openDS). It is being built on top of existing and developing biodiversity information standards such as Darwin Core (DwC), Minimal Information Digital Specimen (MIDS), Latimer Core, Access to Biological Collection Data (ABCD) Schema, Extension for Geosciences (EFG) and also on the new Global Biodiversity Information Facility (GBIF) Unified Model . In openDS we leverage the existing standards within the TDWG community but combine these with stricter constraints and controlled vocabularies, with the aim to improve the FAIRness of the data. This will not only make the data easier to use, but will also increase its quality and machine actionability.As the first step towards this the harmonisation of terms, we make sure that similar values use the same term in a standard as key. This enables the next step in which we harmonise the values. We can transform free-text values into standardised or controlled vocabularies. For example: instead of using the names J. Doe, John Doe and J. Doe sr. for a collector, we aim to standardise these to J. Doe, with a person identifier that connects this name with more information about the collector.Biodiversity information standards such as DwC were developed to lower the bar for data sharing. The downside of including minimal restraints and flexibility is that they provide room for ambiguity, leading to multiple ways of interpretation. This limits interoperability and hampers machine actionability. In DiSSCo, data will come from different sources that use different biodiversity information standards. To cover this, we need to harmonise terms between these standards. To complicate things further, different serialisation methods are used for data exchange. Darwin Core Archives (DwC-A; GBIF 2021) use Comma-separated values (CSV) files. ABCD(EFG) exposed through Biological Collection Access Service (BioCASe) uses XML. And most custom formats use JavaScript Object Notation (JSON).In this lightning talk, we will dive into DiSSCo’s technical implementation of the harmonisation process. DiSSCo currently supports two biodiversity information standards, DwC and ABCD(EFG), and maps the data to our openDS specification on a record-by-record basis. We will highlight some of the more problematic mappings, but also show how a harmonised model massively simplifies generic actions, such as the calculation of MIDS levels, which provide information about digitisation completeness of a specimen. We will conclude by having a quick look at the next steps and hope to start a discussion about controlled vocabularies.The development of high quality, standardised data based on a strict specification with controlled vocabularies, rooted in community accepted standards, can have a huge impact on biodiversity research and is an essential step towards scaling up research with computational support.
Predictability is one of the core requirements for creating machine actionable data. The better predictable the data, the more generic the service acting on the data can be. The more generic the service, the easier we can exchange ideas, collaborate on initiatives and leverage machines to do the work. It is essential for implementing the FAIR Principles (Findable, Accessible, Interoperable, Reproducible), as it provides the “I” for Interoperability (Jacobsen et al. 2020). The FAIR principles emphasise machine actionability because the amount of data generated is far too large for humans to handle. While Biodiversity Information Standards (TDWG) standards have massively improved the standardisation of biodiversity data, there is still room for improvement. Within the Distributed System of Scientific Collections (DiSSCo), we aim to harmonise all scientific data derived from European specimen collections, including geological specimens, into a single data specification. We call this data specification the open Digital Specimen (openDS). It is being built on top of existing and developing biodiversity information standards such as Darwin Core (DwC), Minimal Information Digital Specimen (MIDS), Latimer Core, Access to Biological Collection Data (ABCD) Schema, Extension for Geosciences (EFG) and also on the new Global Biodiversity Information Facility (GBIF) Unified Model. In openDS we leverage the existing standards within the TDWG community but combine these with stricter constraints and controlled vocabularies, with the aim to improve the FAIRness of the data. This will not only make the data easier to use, but will also increase its quality and machine actionability. As the first step towards this the harmonisation of terms, we make sure that similar values use the same term in a standard as key. This enables the next step in which we harmonise the values. We can transform free-text values into standardised or controlled vocabularies. For example: instead of using the names J. Doe, John Doe and J. Doe sr. for a collector, we aim to standardise these to J. Doe, with a person identifier that connects this name with more information about the collector. Biodiversity information standards such as DwC were developed to lower the bar for data sharing. The downside of including minimal restraints and flexibility is that they provide room for ambiguity, leading to multiple ways of interpretation. This limits interoperability and hampers machine actionability. In DiSSCo, data will come from different sources that use different biodiversity information standards. To cover this, we need to harmonise terms between these standards. To complicate things further, different serialisation methods are used for data exchange. Darwin Core Archives (DwC-A; GBIF 2021) use Comma-separated values (CSV) files. ABCD(EFG) exposed through Biological Collection Access Service (BioCASe) uses XML. And most custom formats use JavaScript Object Notation (JSON). In this lightning talk, we will dive into DiSSCo’s technical implementation of the harmonisation process. DiSSCo currently supports two biodiversity information standards, DwC and ABCD(EFG), and maps the data to our openDS specification on a record-by-record basis. We will highlight some of the more problematic mappings, but also show how a harmonised model massively simplifies generic actions, such as the calculation of MIDS levels, which provide information about digitisation completeness of a specimen. We will conclude by having a quick look at the next steps and hope to start a discussion about controlled vocabularies. The development of high quality, standardised data based on a strict specification with controlled vocabularies, rooted in community accepted standards, can have a huge impact on biodiversity research and is an essential step towards scaling up research with computational support.
Author Addink, Wouter
Leeflang, Sam
Author_xml – sequence: 1
  givenname: Sam
  orcidid: 0000-0002-5669-2769
  surname: Leeflang
  fullname: Leeflang, Sam
– sequence: 2
  givenname: Wouter
  orcidid: 0000-0002-3090-1761
  surname: Addink
  fullname: Addink, Wouter
BookMark eNp1kL9OwzAQxi0EEqV0ZI_EwpLiP2kds1UttEiVGAqzdbEd5Cqxg50MbLwGr8eTkFAGVInpTvf97nTfd4FOnXcGoSuCpywX_LawMU75lBBKGD9BIzpjsxT3yumf_hxNYtxjjKmgNJ_nI7TeQKi9s9HoZAUtJDYmC9Va76CozM_oLlnZ3W7pvz4-YxJ91Q1q0vpED3wNTWPd6yU6K6GKZvJbx-jl4f55uUm3T-vH5WKbKkoJTzPQmkCBmaZqro1QQmlS5IXJmSpBGUIFBq4yxfRcYTCGaaHLjFNe6KIkGRujm8PdJvi3zsRW1jYqU1XgjO-iZDjDLJ9hhnv0-gjd-y64_jvZeyecZIKznmIHSgUfYzClVLaFwWIbwFaSYDnkK4d8JZeHfPut9GirCbaG8P4P_w1L_IAR
CitedBy_id crossref_primary_10_3897_biss_8_133172
Cites_doi 10.1162/dint_r_00024
ContentType Journal Article
Copyright 2023. This work is licensed under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and conditions, you may use this content in accordance with the terms of the License.
Copyright_xml – notice: 2023. This work is licensed under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and conditions, you may use this content in accordance with the terms of the License.
DBID AAYXX
CITATION
8FE
8FH
ABUWG
AFKRA
AZQEC
BBNVY
BENPR
BHPHI
CCPQU
DWQXO
GNUQQ
HCIFZ
LK8
M7P
PHGZM
PHGZT
PIMPY
PKEHL
PQEST
PQGLB
PQQKQ
PQUKI
PRINS
7S9
L.6
DOI 10.3897/biss.7.112137
DatabaseName CrossRef
ProQuest SciTech Collection
ProQuest Natural Science Collection
ProQuest Central (Alumni)
ProQuest Central UK/Ireland
ProQuest Central Essentials
Biological Science Database
ProQuest Central
Natural Science Collection
ProQuest One
ProQuest Central
ProQuest Central Student
SciTech Premium Collection
ProQuest Biological Science Collection
Biological Science Database
ProQuest Central Premium
ProQuest One Academic
Publicly Available Content Database
ProQuest One Academic Middle East (New)
ProQuest One Academic Eastern Edition (DO NOT USE)
ProQuest One Applied & Life Sciences
ProQuest One Academic (retired)
ProQuest One Academic UKI Edition
ProQuest Central China
AGRICOLA
AGRICOLA - Academic
DatabaseTitle CrossRef
Publicly Available Content Database
ProQuest Central Student
ProQuest One Academic Middle East (New)
ProQuest Biological Science Collection
ProQuest Central Essentials
ProQuest One Academic Eastern Edition
ProQuest Central (Alumni Edition)
SciTech Premium Collection
ProQuest One Community College
ProQuest Natural Science Collection
Biological Science Database
ProQuest SciTech Collection
ProQuest Central China
ProQuest Central
ProQuest One Applied & Life Sciences
ProQuest One Academic UKI Edition
Natural Science Collection
ProQuest Central Korea
Biological Science Collection
ProQuest Central (New)
ProQuest One Academic
ProQuest One Academic (New)
AGRICOLA
AGRICOLA - Academic
DatabaseTitleList Publicly Available Content Database
CrossRef
AGRICOLA
Database_xml – sequence: 1
  dbid: PIMPY
  name: Publicly Available Content Database
  url: http://search.proquest.com/publiccontent
  sourceTypes: Aggregation Database
DeliveryMethod fulltext_linktorsrc
Discipline Ecology
EISSN 2535-0897
ExternalDocumentID 10_3897_biss_7_112137
GroupedDBID 2XV
8FE
8FH
AAFWJ
AAYXX
ADBBV
AFFHD
AFKRA
ALMA_UNASSIGNED_HOLDINGS
BBNVY
BENPR
BHPHI
CCPQU
CITATION
H13
HCIFZ
IAO
IGS
ITC
K13
LK8
M7P
M~E
OK1
PHGZM
PHGZT
PIMPY
PQGLB
PROAC
XH6
ABUWG
AZQEC
DWQXO
GNUQQ
PKEHL
PQEST
PQQKQ
PQUKI
PRINS
7S9
L.6
ID FETCH-LOGICAL-c2217-4add1ab03d2c6de9c9cd1b8be83cface1290a7c4c3d6c0aee3d9df4727bdbf143
IEDL.DBID M7P
ISSN 2535-0897
IngestDate Sun Nov 09 13:52:45 EST 2025
Sun Jul 13 03:53:41 EDT 2025
Sat Nov 29 05:59:38 EST 2025
Tue Nov 18 22:27:33 EST 2025
IsDoiOpenAccess true
IsOpenAccess true
IsPeerReviewed false
IsScholarly true
Language English
License http://creativecommons.org/licenses/by/4.0
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-c2217-4add1ab03d2c6de9c9cd1b8be83cface1290a7c4c3d6c0aee3d9df4727bdbf143
Notes ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
content type line 23
ORCID 0000-0002-5669-2769
0000-0002-3090-1761
OpenAccessLink https://www.proquest.com/docview/2861714973?pq-origsite=%requestingapplication%
PQID 2861714973
PQPubID 2049297
ParticipantIDs proquest_miscellaneous_3040385030
proquest_journals_2861714973
crossref_citationtrail_10_3897_biss_7_112137
crossref_primary_10_3897_biss_7_112137
PublicationCentury 2000
PublicationDate 20230906
PublicationDateYYYYMMDD 2023-09-06
PublicationDate_xml – month: 09
  year: 2023
  text: 20230906
  day: 06
PublicationDecade 2020
PublicationPlace Sofia
PublicationPlace_xml – name: Sofia
PublicationTitle Biodiversity Information Science and Standards
PublicationYear 2023
Publisher Pensoft Publishers
Publisher_xml – name: Pensoft Publishers
References 112137_B9974989
GBIF (112137_B10449212) 2021
References_xml – year: 2021
  ident: 112137_B10449212
  article-title: Darwin Core Archives – How-to Guide, version 2.2
– ident: 112137_B9974989
  doi: 10.1162/dint_r_00024
SSID ssj0002922868
Score 2.230406
Snippet Predictability is one of the core requirements for creating machine actionable data. The better predictable the data, the more generic the service acting on...
SourceID proquest
crossref
SourceType Aggregation Database
Enrichment Source
Index Database
StartPage 10
SubjectTerms Biodiversity
computer software
information science
Interoperability
lightning
Vocabularies & taxonomies
Title Harmonised Data is Actionable Data: DiSSCo’s solution to data mapping
URI https://www.proquest.com/docview/2861714973
https://www.proquest.com/docview/3040385030
Volume 7
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
journalDatabaseRights – providerCode: PRVHPJ
  databaseName: ROAD: Directory of Open Access Scholarly Resources
  customDbUrl:
  eissn: 2535-0897
  dateEnd: 99991231
  omitProxy: false
  ssIdentifier: ssj0002922868
  issn: 2535-0897
  databaseCode: M~E
  dateStart: 20170101
  isFulltext: true
  titleUrlDefault: https://road.issn.org
  providerName: ISSN International Centre
– providerCode: PRVPQU
  databaseName: Biological Science Database
  customDbUrl:
  eissn: 2535-0897
  dateEnd: 99991231
  omitProxy: false
  ssIdentifier: ssj0002922868
  issn: 2535-0897
  databaseCode: M7P
  dateStart: 20210101
  isFulltext: true
  titleUrlDefault: http://search.proquest.com/biologicalscijournals
  providerName: ProQuest
– providerCode: PRVPQU
  databaseName: ProQuest Central
  customDbUrl:
  eissn: 2535-0897
  dateEnd: 99991231
  omitProxy: false
  ssIdentifier: ssj0002922868
  issn: 2535-0897
  databaseCode: BENPR
  dateStart: 20210101
  isFulltext: true
  titleUrlDefault: https://www.proquest.com/central
  providerName: ProQuest
– providerCode: PRVPQU
  databaseName: Publicly Available Content Database
  customDbUrl:
  eissn: 2535-0897
  dateEnd: 99991231
  omitProxy: false
  ssIdentifier: ssj0002922868
  issn: 2535-0897
  databaseCode: PIMPY
  dateStart: 20210101
  isFulltext: true
  titleUrlDefault: http://search.proquest.com/publiccontent
  providerName: ProQuest
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwpV1LT9wwEB7xlHqBFoq6sEWuVHEisImzsc0FLbA8JLqKeAlOkV-pVmo3282CxAXxN_h7_BI83uzSHuiFSw7OJEo-2ePxePx9AN8TzmMTahrg1mgQq1gEEgkBZMwazDTD3PqqyqtT1unw62uRVgm3siqrHPtE76hNoTFHvh3xBLW6BaO7_T8Bqkbh7moloTENs8iSEPnSvXSSY4lE5B7jI2pNNzOzbeX-ZovhuZkQlc__nor-9cR-ejlcfO-HfYSFKrAkrVFP-ARTtrcE821PSn2_DEfHcuAsu6U15EAOJemWpOXPNODZKd-0Qw665-f7xfPjU0nGXZIMC4JVpOS3RCaHn5_h8rB9sX8cVCIKgY7cciOInQMLpWpQE-nEWKGFNqHiynKqc6kt5qEk07GmJtENaS01wuSxC2uUUbmLplZgplf07BcguTA6plSoRBgXZUXCoagka-YmF1JpXoPNMZ6ZrhjGUejiV-ZWGgh_hvBnLBvBX4ONiXl_RK3xlmF9jHxWjbAye4W9Bt8mt93YwA0P2bPFbZlR56Eobzo_tvr_V6zBB5SR97VjSR1mhoNb-xXm9N2wWw7WYXav3UnP1n3nwutD27WlJz_SmxdqadzS
linkProvider ProQuest
linkToHtml http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMw1V3LbhMxFL0qKQg2vBGBAkYCVgzN2M7YRkKoaloSNY0itaCyMn4NigSZkklB3fEb_AQfxZfgO48AC9h1wXZsWR6f-7LvC-BRJiX3qWMJukYTbrlKDBYEMFz0hO-neaiiKt-MxWQij47UdA2-t7kwGFbZysRKUPvC4Rv5JpUZ9upWgr08_pRg1yj0rrYtNGqy2AunX-KVrXwxGkR8H1O6u3O4PUyargKJo9H-Tnjk6NTYHvPUZT4op5xPrbRBMpcbF_BhxgjHHfOZ65kQmFc-51HPW2_zaF7Edc_BOkdi78D6dLQ_fbt61aGKxo3KuphntAXEpo3n90xgpk6KvdZ_V35_yv5Koe1e-d-O4ipcbkxnslXT-jVYC_PrcGGnKrt9egNeDc0i7mxWBk8GZmnIrCRbVdYGZodVn56TwezgYLv48fVbSVqmI8uCYJws-WiwVsX7m_D6TP7iFnTmxTzcBpIr7zhjymbKRzuSqoiaNaKf-1wZ62QXnrb4adfUUMdWHh90vEsh3Brh1kLXcHfhyWr6cV085G8TN1qkdSNDSv0L5i48XA1H7keXjpmH4qTULMpgJvtRUt_59xIP4OLwcH-sx6PJ3l24RKOpVkXKZRvQWS5Owj047z4vZ-XifkPSBN6dNen8BIV6O6o
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Harmonised+Data+is+Actionable+Data%3A+DiSSCo%E2%80%99s+solution+to+data+mapping&rft.jtitle=Biodiversity+Information+Science+and+Standards&rft.au=Leeflang%2C+Sam&rft.au=Addink%2C+Wouter&rft.date=2023-09-06&rft.issn=2535-0897&rft.eissn=2535-0897&rft.volume=7&rft_id=info:doi/10.3897%2Fbiss.7.112137&rft.externalDBID=n%2Fa&rft.externalDocID=10_3897_biss_7_112137
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=2535-0897&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=2535-0897&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=2535-0897&client=summon