Sequence database versioning for command line and Galaxy bioinformatics servers

Motivation: There are various reasons for rerunning bioinformatics tools and pipelines on sequencing data, including reproducing a past result, validation of a new tool or workflow using a known dataset, or tracking the impact of database changes. For identical results to be achieved, regularly upda...

Celý popis

Uloženo v:

Podrobná bibliografie
Vydáno v:	Bioinformatics Ročník 32; číslo 8; s. 1275 - 1277
Hlavní autoři:	Dooley, Damion M., Petkau, Aaron J., Van Domselaar, Gary, Hsiao, William W.L.
Médium:	Journal Article
Jazyk:	angličtina
Vydáno:	England Oxford University Press 15.04.2016
Témata:	Applications Notes Bioinformatics Computational Biology Computer programs Consumption Databases, Nucleic Acid Disks Galaxies Software Source code User-Computer Interface
ISSN:	1367-4803, 1367-4811, 1460-2059
On-line přístup:	Získat plný text
Tagy:	Přidat tag Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!

Popis
Shrnutí:	Motivation: There are various reasons for rerunning bioinformatics tools and pipelines on sequencing data, including reproducing a past result, validation of a new tool or workflow using a known dataset, or tracking the impact of database changes. For identical results to be achieved, regularly updated reference sequence databases must be versioned and archived. Database administrators have tried to fill the requirements by supplying users with one-off versions of databases, but these are time consuming to set up and are inconsistent across resources. Disk storage and data backup performance has also discouraged maintaining multiple versions of databases since databases such as NCBI nr can consume 50 Gb or more disk space per version, with growth rates that parallel Moore's law. Results: Our end-to-end solution combines our own Kipper software package—a simple key-value large file versioning system—with BioMAJ (software for downloading sequence databases), and Galaxy (a web-based bioinformatics data processing platform). Available versions of databases can be recalled and used by command-line and Galaxy users. The Kipper data store format makes publishing curated FASTA databases convenient since in most cases it can store a range of versions into a file marginally larger than the size of the latest version. Availability and implementation: Kipper v1.0.0 and the Galaxy Versioned Data tool are written in Python and released as free and open source software available at https://github.com/Public-Health-Bioinformatics/kipper and https://github.com/Public-Health-Bioinformatics/versioned_data, respectively; detailed setup instructions can be found at https://github.com/Public-Health-Bioinformatics/versioned_data/blob/master/doc/setup.md Contact: Damion.Dooley@Bccdc.Ca or William.Hsiao@Bccdc.Ca Supplementary information: Supplementary data are available at Bioinformatics online.
Bibliografie:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23 Associate Editor: Jonathan Wren
ISSN:	1367-4803 1367-4811 1460-2059
DOI:	10.1093/bioinformatics/btv724