Addressing the dynamic nature of reference data: a new nucleotide database for robust metagenomic classification.
Uloženo v:
| Název: | Addressing the dynamic nature of reference data: a new nucleotide database for robust metagenomic classification. |
|---|---|
| Autoři: | Martí JM; Global Security Computing Applications Division, Lawrence Livermore National Laboratory, Livermore, California, USA., Kok CR; Biosciences and Biotechnology Division, Lawrence Livermore National Laboratory, Livermore, California, USA., Thissen JB; Biosciences and Biotechnology Division, Lawrence Livermore National Laboratory, Livermore, California, USA., Mulakken NJ; Global Security Computing Applications Division, Lawrence Livermore National Laboratory, Livermore, California, USA., Avila-Herrera A; Global Security Computing Applications Division, Lawrence Livermore National Laboratory, Livermore, California, USA., Jaing CJ; Biosciences and Biotechnology Division, Lawrence Livermore National Laboratory, Livermore, California, USA., Allen JE; Global Security Computing Applications Division, Lawrence Livermore National Laboratory, Livermore, California, USA., Be NA; Biosciences and Biotechnology Division, Lawrence Livermore National Laboratory, Livermore, California, USA. |
| Zdroj: | MSystems [mSystems] 2025 Apr 22; Vol. 10 (4), pp. e0123924. Date of Electronic Publication: 2025 Mar 20. |
| Způsob vydávání: | Journal Article |
| Jazyk: | English |
| Informace o časopise: | Publisher: American Society for Microbiology Country of Publication: United States NLM ID: 101680636 Publication Model: Print-Electronic Cited Medium: Internet ISSN: 2379-5077 (Electronic) Linking ISSN: 23795077 NLM ISO Abbreviation: mSystems Subsets: MEDLINE |
| Imprint Name(s): | Original Publication: Washington, DC : American Society for Microbiology, [2016]- |
| Výrazy ze slovníku MeSH: | Metagenomics*/methods , Databases, Nucleic Acid*/standards, Metagenome ; Humans |
| Abstrakt: | Accurate metagenomic classification relies on comprehensive, up-to-date, and validated reference databases. While the NCBI BLAST Nucleotide (nt) database, encompassing a vast collection of sequences from all domains of life, represents an invaluable resource, its massive size-currently exceeding 1012 nucleotides-and exponential growth pose significant challenges for researchers seeking to maintain current nt-based indices for metagenomic classification. Recognizing that no current nt-based indices exist for the widely used Centrifuge classifier, and the last public version currently available was released in 2018, we addressed this critical gap by leveraging advanced high-performance computing resources. We present new Centrifuge-compatible nt databases, meticulously constructed using a novel pipeline incorporating different quality control measures, including reference decontamination and filtering. These measures demonstrably reduce spurious classifications, as shown through our reanalysis of published metagenomic data where Plasmodium annotations were dramatically reduced using our decontaminated database, highlighting how database quality can significantly impact research conclusions. Through temporal comparisons, we also reveal how our approach minimizes inconsistencies in taxonomic assignments stemming from asynchronous updates between public sequence and taxonomy databases. These discrepancies are particularly evident in taxa such as Listeria monocytogenes and Naegleria fowleri, where classification accuracy varied significantly across database versions. These new databases, made available as pre-built Centrifuge indexes, respond to the need for an open, robust, nt-based pipeline for taxonomic classification in metagenomics. Applications such as environmental metagenomics, forensics, and clinical metagenomics, which require comprehensive taxonomic coverage, will benefit from this resource. Our work highlights the importance of treating reference databases as dynamic entities, subject to ongoing quality control and validation akin to software development best practices. This approach is crucial for ensuring accuracy and reliability of metagenomic analysis, especially as databases continue to expand in size and complexity. Importance: Accurately identifying the diverse microbes present in a sample, whether from the human gut, a soil sample, or a crime scene, is crucial for fields ranging from medicine to environmental science. Researchers rely on comprehensive DNA databases to match sequenced DNA fragments to known microbial species. However, the widely used NCBI nt database, while vast, poses significant challenges. Its massive size makes it difficult for many researchers to use effectively with taxonomic classifiers, and inconsistencies and contamination within the database can impact the accuracy of microbial identification. This work addresses these challenges by providing cleaned, updated, and validated nt-based databases specifically optimized for the widely used Centrifuge classification tool. This new resource demonstrably reduces errors and improves the reliability of microbial identification across diverse taxonomic groups. Moreover, by providing readily usable indexes, we overcome the size barrier, enabling researchers to leverage the full potential of the nt database for metagenomic analysis. Our findings underscore the need to treat reference databases as dynamic entities, emphasizing continuous quality control and versioning as essential practices for robust and reproducible metagenomics research. |
| References: | Ann Neurol. 2017 Mar;81(3):369-382. (PMID: 28220542) Nucleic Acids Res. 2012 Jan;40(Database issue):D136-43. (PMID: 22139910) Bioinformatics. 2016 Dec 15;32(24):3823-3825. (PMID: 27540266) Genome Biol. 2018 Nov 16;19(1):198. (PMID: 30445993) Genome Biol. 2014 Mar 03;15(3):R46. (PMID: 24580807) Genome Biol. 2017 Sep 21;18(1):182. (PMID: 28934964) mBio. 2016 Sep 06;7(5):. (PMID: 27601572) Adv Wound Care (New Rochelle). 2014 Jul 1;3(7):502-510. (PMID: 25032070) Front Bioinform. 2024 Mar 15;4:1278228. (PMID: 38560517) Neurobiol Dis. 2021 Jan;148:105199. (PMID: 33249136) Genome Res. 2015 Jul;25(7):1056-67. (PMID: 25926546) Nat Commun. 2016 Apr 13;7:11257. (PMID: 27071849) Microbiol Spectr. 2023 Dec 12;11(6):e0252023. (PMID: 37874143) Bioinformatics. 2020 Jul 1;36(Suppl_1):i12-i20. (PMID: 32657362) Bioinformatics. 2023 Jan 1;39(1):. (PMID: 36579886) Biol Direct. 2024 Oct 30;19(1):100. (PMID: 39478626) Bioinformatics. 2013 Sep 15;29(18):2253-60. (PMID: 23828782) BMC Plant Biol. 2020 Feb 17;20(1):79. (PMID: 32066386) PLoS One. 2017 Mar 22;12(3):e0172914. (PMID: 28328972) PLoS One. 2013 Apr 22;8(4):e61217. (PMID: 23630581) Sci Rep. 2020 Jul 24;10(1):12399. (PMID: 32709938) PLoS One. 2023 Apr 7;18(4):e0284031. (PMID: 37027361) PLoS Comput Biol. 2019 Apr 8;15(4):e1006967. (PMID: 30958827) Proc Natl Acad Sci U S A. 2007 Jan 2;104(1):299-304. (PMID: 17182744) Genome Res. 2016 Dec;26(12):1721-1729. (PMID: 27852649) Curr Nutr Rep. 2019 Jun;8(2):53-65. (PMID: 30949921) Mil Med. 2021 Feb 26;186(3-4):e310-e318. (PMID: 33137200) Bioinformatics. 2017 Jul 15;33(14):2082-2088. (PMID: 28334086) BMC Genomics. 2015 Mar 25;16:236. (PMID: 25879410) Nat Biotechnol. 2023 Nov;41(11):1633-1644. (PMID: 36823356) Infect Immun. 2015 Feb;83(2):802-11. (PMID: 25486991) Bioinformatics. 2016 Oct 1;32(19):3047-8. (PMID: 27312411) Lancet Glob Health. 2023 Jun;11(6):e976-e981. (PMID: 37202030) Nucleic Acids Res. 2006 Jul 1;34(Web Server issue):W6-9. (PMID: 16845079) Viruses. 2023 Feb 04;15(2):. (PMID: 36851645) Nucleic Acids Res. 2015 May 26;43(10):e69. (PMID: 25765641) mSystems. 2020 Jul 21;5(4):. (PMID: 32694127) mSystems. 2017 Mar 21;2(2):. (PMID: 28345059) PLoS Comput Biol. 2018 Jun 25;14(6):e1006277. (PMID: 29939994) Nat Methods. 2015 Jan;12(1):59-60. (PMID: 25402007) Nucleic Acids Res. 2013 Jan;41(Database issue):D8-D20. (PMID: 23193264) Water Res. 2021 Apr 15;194:116907. (PMID: 33610927) Sci Rep. 2022 Aug 15;12(1):13816. (PMID: 35970993) Imeta. 2023 May 08;2(2):e107. (PMID: 38868435) Genome Biol. 2019 Nov 28;20(1):257. (PMID: 31779668) Genome Biol. 2020 May 12;21(1):115. (PMID: 32398145) BMC Bioinformatics. 2009 Dec 15;10:421. (PMID: 20003500) J Mol Biol. 1990 Oct 5;215(3):403-10. (PMID: 2231712) |
| Grant Information: | Lawrence Livermore National Laboratory, Laboratory Directed Research and Development Program |
| Contributed Indexing: | Keywords: Centrifuge; NCBI BLAST nt; Recentrifuge; high-performance computing; metagenomics; quality control; reference contamination; reference database; taxonomic classification |
| Entry Date(s): | Date Created: 20250320 Date Completed: 20250422 Latest Revision: 20250424 |
| Update Code: | 20260130 |
| PubMed Central ID: | PMC12013259 |
| DOI: | 10.1128/msystems.01239-24 |
| PMID: | 40111052 |
| Databáze: | MEDLINE |
| Abstrakt: | Accurate metagenomic classification relies on comprehensive, up-to-date, and validated reference databases. While the NCBI BLAST Nucleotide (nt) database, encompassing a vast collection of sequences from all domains of life, represents an invaluable resource, its massive size-currently exceeding 10<sup>12</sup> nucleotides-and exponential growth pose significant challenges for researchers seeking to maintain current nt-based indices for metagenomic classification. Recognizing that no current nt-based indices exist for the widely used Centrifuge classifier, and the last public version currently available was released in 2018, we addressed this critical gap by leveraging advanced high-performance computing resources. We present new Centrifuge-compatible nt databases, meticulously constructed using a novel pipeline incorporating different quality control measures, including reference decontamination and filtering. These measures demonstrably reduce spurious classifications, as shown through our reanalysis of published metagenomic data where Plasmodium annotations were dramatically reduced using our decontaminated database, highlighting how database quality can significantly impact research conclusions. Through temporal comparisons, we also reveal how our approach minimizes inconsistencies in taxonomic assignments stemming from asynchronous updates between public sequence and taxonomy databases. These discrepancies are particularly evident in taxa such as Listeria monocytogenes and Naegleria fowleri, where classification accuracy varied significantly across database versions. These new databases, made available as pre-built Centrifuge indexes, respond to the need for an open, robust, nt-based pipeline for taxonomic classification in metagenomics. Applications such as environmental metagenomics, forensics, and clinical metagenomics, which require comprehensive taxonomic coverage, will benefit from this resource. Our work highlights the importance of treating reference databases as dynamic entities, subject to ongoing quality control and validation akin to software development best practices. This approach is crucial for ensuring accuracy and reliability of metagenomic analysis, especially as databases continue to expand in size and complexity.<br />Importance: Accurately identifying the diverse microbes present in a sample, whether from the human gut, a soil sample, or a crime scene, is crucial for fields ranging from medicine to environmental science. Researchers rely on comprehensive DNA databases to match sequenced DNA fragments to known microbial species. However, the widely used NCBI nt database, while vast, poses significant challenges. Its massive size makes it difficult for many researchers to use effectively with taxonomic classifiers, and inconsistencies and contamination within the database can impact the accuracy of microbial identification. This work addresses these challenges by providing cleaned, updated, and validated nt-based databases specifically optimized for the widely used Centrifuge classification tool. This new resource demonstrably reduces errors and improves the reliability of microbial identification across diverse taxonomic groups. Moreover, by providing readily usable indexes, we overcome the size barrier, enabling researchers to leverage the full potential of the nt database for metagenomic analysis. Our findings underscore the need to treat reference databases as dynamic entities, emphasizing continuous quality control and versioning as essential practices for robust and reproducible metagenomics research. |
|---|---|
| ISSN: | 2379-5077 |
| DOI: | 10.1128/msystems.01239-24 |
Full Text Finder
Nájsť tento článok vo Web of Science