Datatractor: Metadata, automation, and registries for extractor interoperability in the chemical and materials sciences
Two key issues hindering the transition toward findable, accessible, interoperable, and reusable (FAIR) data science are the poor discoverability and inconsistent instructions for the use of data extractor tools (i.e., how we go from raw data files created by instruments, to accessible metadata and...
Uloženo v:
| Vydáno v: | MRS bulletin Ročník 50; číslo 7; s. 838 - 845 |
|---|---|
| Hlavní autoři: | , , , |
| Médium: | Journal Article |
| Jazyk: | angličtina |
| Vydáno: |
Warrendale
Springer Nature B.V
01.07.2025
|
| Témata: | |
| ISSN: | 0883-7694, 1938-1425 |
| On-line přístup: | Získat plný text |
| Tagy: |
Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
|
| Shrnutí: | Two key issues hindering the transition toward findable, accessible, interoperable, and reusable (FAIR) data science are the poor discoverability and inconsistent instructions for the use of data extractor tools (i.e., how we go from raw data files created by instruments, to accessible metadata and scientific insight). If the existing format conversion tools are hard to find, install, and use, their reimplementation will lead to a duplication of effort and an increase in the associated maintenance burden is inevitable. The Datatractor framework presented in this article addresses these issues. First, by providing a curated registry of such extractor tools, their discoverability will increase. Second, by describing them using a standardized but lightweight schema, their installation and use are machine-actionable. Finally, we provide a reference implementation for such data extraction. The Datatractor framework can be used to provide a public-facing data extraction service, or be incorporated into other research data management tools providing added value.Impact statementThe ongoing digitalization of materials research necessitates investment in bespoke research data management platforms. A common pain point across such infrastructures is the ingestion and extraction of data and metadata from raw data files. Often these files require significant effort to ingest in a way consistent with the FAIR data principles such that definitions are clear and the resulting data are extracted reliably and repeatedly. As research laboratories increasingly move toward coupled automated systems, handling of files exported by different instruments can introduce significant friction to the data-to-insight pipeline. This paper presents a conceptual framework and prototype implementation of Datatractor, which aims to bring together the community of efforts on open-source software tools that can perform such extractions and disseminate them in such a way that new platforms can use the tools in an automated fashion. This can reduce the duplication of effort by broadly advertising existing and tested tools, communalize the maintenance and development burden of such tools, and lower the barrier to the creation of materials data infrastructure. By building atop community consensus formed in a Materials Research Data Alliance (MaRDA) Working Group, it is hoped that this framework can grow to propagate best practices in research data management and help to foster a rich open data ecosystem in materials science and beyond. |
|---|---|
| Bibliografie: | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 |
| ISSN: | 0883-7694 1938-1425 |
| DOI: | 10.1557/s43577-025-00925-8 |