CARE 2.0: reducing false-positive sequencing error corrections using machine learning

Background Next-generation sequencing pipelines often perform error correction as a preprocessing step to obtain cleaned input data. State-of-the-art error correction programs are able to reliably detect and correct the majority of sequencing errors. However, they also introduce new errors by making...

Celý popis

Uloženo v:

Podrobná bibliografie
Vydáno v:	BMC bioinformatics Ročník 23; číslo 1; s. 227 - 17
Hlavní autoři:	Kallenborn, Felix, Cascitti, Julian, Schmidt, Bertil
Médium:	Journal Article
Jazyk:	angličtina
Vydáno:	London BioMed Central 13.06.2022 BioMed Central Ltd Springer Nature B.V BMC
Témata:	Algorithms Analysis Assembly Bioinformatics Biomedical and Life Sciences Candidates Computational Biology/Bioinformatics Computer Appl. in Life Sciences Construction Context Datasets Decision trees DNA sequencing Error correction Error correction & detection False positive reactions High-Throughput Nucleotide Sequencing - methods Humans Impact analysis Learning algorithms Life Sciences Machine Learning Methods Microarrays Next-generation sequencing Nucleotide sequence Nucleotide sequencing Pipelining (computers) Prevention Sequence Alignment Sequence Analysis, DNA - methods Software Statistical analysis Germany Next-generation sequencing Machine learning Error correction
ISSN:	1471-2105, 1471-2105
On-line přístup:	Získat plný text
Tagy:	Přidat tag Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!

Popis
Shrnutí:	Background Next-generation sequencing pipelines often perform error correction as a preprocessing step to obtain cleaned input data. State-of-the-art error correction programs are able to reliably detect and correct the majority of sequencing errors. However, they also introduce new errors by making false-positive corrections. These correction mistakes can have negative impact on downstream analysis, such as k -mer statistics, de-novo assembly, and variant calling. This motivates the need for more precise error correction tools. Results We present CARE 2.0, a context-aware read error correction tool based on multiple sequence alignment targeting Illumina datasets. In addition to a number of newly introduced optimizations its most significant change is the replacement of CARE 1.0’s hand-crafted correction conditions with a novel classifier based on random decision forests trained on Illumina data. This results in up to two orders-of-magnitude fewer false-positive corrections compared to other state-of-the-art error correction software. At the same time, CARE 2.0 is able to achieve high numbers of true-positive corrections comparable to its competitors. On a simulated full human dataset with 914M reads CARE 2.0 generates only 1.2M false positives (FPs) (and 801.4M true positives (TPs)) at a highly competitive runtime while the best corrections achieved by other state-of-the-art tools contain at least 3.9M FPs and at most 814.5M TPs. Better de-novo assembly and improved k -mer analysis show the applicability of CARE 2.0 to real-world data. Conclusion False-positive corrections can negatively influence down-stream analysis. The precision of CARE 2.0 greatly reduces the number of those corrections compared to other state-of-the-art programs including BFC, Karect, Musket, Bcool, SGA, and Lighter. Thus, higher-quality datasets are produced which improve k -mer analysis and de-novo assembly in real-world datasets which demonstrates the applicability of machine learning techniques in the context of sequencing read error correction. CARE 2.0 is written in C++/CUDA for Linux systems and can be run on the CPU as well as on CUDA-enabled GPUs. It is available at https://github.com/fkallen/CARE .
Bibliografie:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 content type line 23
ISSN:	1471-2105 1471-2105
DOI:	10.1186/s12859-022-04754-3