KungFQ: A Simple and Powerful Approach to Compress fastq Files
Nowadays storing data derived from deep sequencing experiments has become pivotal and standard compression algorithms do not exploit in a satisfying manner their structure. A number of reference-based compression algorithms have been developed but they are less adequate when approaching new species...
Saved in:
| Published in: | IEEE/ACM transactions on computational biology and bioinformatics Vol. 9; no. 6; pp. 1837 - 1842 |
|---|---|
| Main Authors: | , , |
| Format: | Journal Article |
| Language: | English |
| Published: |
United States
IEEE
01.11.2012
The Institute of Electrical and Electronics Engineers, Inc. (IEEE) |
| Subjects: | |
| ISSN: | 1545-5963, 1557-9964, 1557-9964 |
| Online Access: | Get full text |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Abstract | Nowadays storing data derived from deep sequencing experiments has become pivotal and standard compression algorithms do not exploit in a satisfying manner their structure. A number of reference-based compression algorithms have been developed but they are less adequate when approaching new species without fully sequenced genomes or nongenomic data. We developed a tool that takes advantages of fastq characteristics and encodes them in a binary format optimized in order to be further compressed with standard tools (such as gzip or lzma). The algorithm is straightforward and does not need any external reference file, it scans the fastq only once and has a constant memory requirement. Moreover, we added the possibility to perform lossy compression, losing some of the original information (IDs and/or qualities) but resulting in smaller files; it is also possible to define a quality cutoff under which corresponding base calls are converted to N. We achieve 2.82 to 7.77 compression ratios on various fastq files without losing information and 5.37 to 8.77 losing IDs, which are often not used in common analysis pipelines. In this paper, we compare the algorithm performance with known tools, usually obtaining higher compression levels. |
|---|---|
| AbstractList | Nowadays storing data derived from deep sequencing experiments has become pivotal and standard compression algorithms do not exploit in a satisfying manner their structure. A number of reference-based compression algorithms have been developed but they are less adequate when approaching new species without fully sequenced genomes or nongenomic data. We developed a tool that takes advantages of fastq characteristics and encodes them in a binary format optimized in order to be further compressed with standard tools (such as gzip or lzma). The algorithm is straightforward and does not need any external reference file, it scans the fastq only once and has a constant memory requirement. Moreover, we added the possibility to perform lossy compression, losing some of the original information (IDs and/or qualities) but resulting in smaller files; it is also possible to define a quality cutoff under which corresponding base calls are converted to N. We achieve 2.82 to 7.77 compression ratios on various fastq files without losing information and 5.37 to 8.77 losing IDs, which are often not used in common analysis pipelines. In this paper, we compare the algorithm performance with known tools, usually obtaining higher compression levels. Nowadays storing data derived from deep sequencing experiments has become pivotal and standard compression algorithms do not exploit in a satisfying manner their structure. A number of reference-based compression algorithms have been developed but they are less adequate when approaching new species without fully sequenced genomes or nongenomic data. We developed a tool that takes advantages of fastq characteristics and encodes them in a binary format optimized in order to be further compressed with standard tools (such as gzip or lzma). The algorithm is straightforward and does not need any external reference file, it scans the fastq only once and has a constant memory requirement. Moreover, we added the possibility to perform lossy compression, losing some of the original information (IDs and/or qualities) but resulting in smaller files; it is also possible to define a quality cutoff under which corresponding base calls are converted to N. We achieve 2.82 to 7.77 compression ratios on various fastq files without losing information and 5.37 to 8.77 losing IDs, which are often not used in common analysis pipelines. In this paper, we compare the algorithm performance with known tools, usually obtaining higher compression levels.Nowadays storing data derived from deep sequencing experiments has become pivotal and standard compression algorithms do not exploit in a satisfying manner their structure. A number of reference-based compression algorithms have been developed but they are less adequate when approaching new species without fully sequenced genomes or nongenomic data. We developed a tool that takes advantages of fastq characteristics and encodes them in a binary format optimized in order to be further compressed with standard tools (such as gzip or lzma). The algorithm is straightforward and does not need any external reference file, it scans the fastq only once and has a constant memory requirement. Moreover, we added the possibility to perform lossy compression, losing some of the original information (IDs and/or qualities) but resulting in smaller files; it is also possible to define a quality cutoff under which corresponding base calls are converted to N. We achieve 2.82 to 7.77 compression ratios on various fastq files without losing information and 5.37 to 8.77 losing IDs, which are often not used in common analysis pipelines. In this paper, we compare the algorithm performance with known tools, usually obtaining higher compression levels. |
| Author | Gregorio, F. D. Molineris, I. Grassi, E. |
| Author_xml | – sequence: 1 givenname: E. surname: Grassi fullname: Grassi, E. email: grassi.e@gmail.com organization: Dept. of Genetics, Biol. & Biochem., Mol. Biotechnol. Center, Turin, Italy – sequence: 2 givenname: F. D. surname: Gregorio fullname: Gregorio, F. D. email: fog@dndg.it organization: DNDG srl, Turin, Italy – sequence: 3 givenname: I. surname: Molineris fullname: Molineris, I. email: ivan.molineris@unito.it organization: Dept. of Genetics, Biol. & Biochem., Mol. Biotechnol. Center, Turin, Italy |
| BackLink | https://www.ncbi.nlm.nih.gov/pubmed/23221092$$D View this record in MEDLINE/PubMed |
| BookMark | eNqN0cFP2zAUBnALgYACR05Ik6VduKT42XHc7DCprSibhgTT4Gy5zgsEJXGwE03773Eo3aHSpJ3sw-_75Oc3Ifuta5GQc2BTAJZfPSwXiylnwKfAxR45BilVkudZuj_eU5nIPBNHZBLCC2M8zVl6SI644DyG-TH5-mNon1Y_v9A5_VU1XY3UtAW9d7_Rl0NN513nnbHPtHd06ZrOYwi0NKF_pauqxnBKDkpTBzz7OE_I4-r6Yfktub27-b6c3yZWqLRPOBq1ZgBoZgYKUQhgNrWzIisErlnJwFpAVBbRFBIMSDODco3AUWEWqTghl5ve-JzXAUOvmypYrGvTohuCBi6VyuNI7D-oUBLSXEKkn3foixt8Gwd5V4ylmVBRffpQw7rBQne-aoz_o7efGIHYAOtdCB5Lbave9JVre2-qWgPT46r0uCo9rmpsj6lkJ7Ut_pe_2PgKEf_aTICQOYg3xLCabg |
| CODEN | ITCBCY |
| CitedBy_id | crossref_primary_10_1109_TPDS_2017_2692782 crossref_primary_10_1007_s11227_016_1753_4 crossref_primary_10_1093_bioadv_vbaf128 crossref_primary_10_1186_1748_7188_8_25 crossref_primary_10_1371_journal_pone_0059190 crossref_primary_10_1016_j_procs_2014_05_369 |
| Cites_doi | 10.1093/bioinformatics/btp352 10.1093/bioinformatics/18.12.1696 10.1093/bioinformatics/btq346 10.1186/gb-2009-10-3-r25 10.1093/bioinformatics/btr014 10.1038/459927a 10.1093/nar/gkq1019 10.1186/1471-2105-11-514 10.4137/EBO.S6618 10.1007/978-3-642-12683-3_20 10.1101/gr.114819.110 10.1093/bioinformatics/btn582 |
| ContentType | Journal Article |
| Copyright | Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) Nov/Dec 2012 |
| Copyright_xml | – notice: Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) Nov/Dec 2012 |
| DBID | 97E RIA RIE AAYXX CITATION CGR CUY CVF ECM EIF NPM 7QF 7QO 7QQ 7SC 7SE 7SP 7SR 7TA 7TB 7U5 8BQ 8FD F28 FR3 H8D JG9 JQ2 KR7 L7M L~C L~D P64 7X8 |
| DOI | 10.1109/TCBB.2012.123 |
| DatabaseName | IEEE Xplore (IEEE) IEEE All-Society Periodicals Package (ASPP) 1998–Present IEEE Electronic Library (IEL) CrossRef Medline MEDLINE MEDLINE (Ovid) MEDLINE MEDLINE PubMed Aluminium Industry Abstracts Biotechnology Research Abstracts Ceramic Abstracts Computer and Information Systems Abstracts Corrosion Abstracts Electronics & Communications Abstracts Engineered Materials Abstracts Materials Business File Mechanical & Transportation Engineering Abstracts Solid State and Superconductivity Abstracts METADEX Technology Research Database ANTE: Abstracts in New Technology & Engineering Engineering Research Database Aerospace Database Materials Research Database ProQuest Computer Science Collection Civil Engineering Abstracts Advanced Technologies Database with Aerospace Computer and Information Systems Abstracts Academic Computer and Information Systems Abstracts Professional Biotechnology and BioEngineering Abstracts MEDLINE - Academic |
| DatabaseTitle | CrossRef MEDLINE Medline Complete MEDLINE with Full Text PubMed MEDLINE (Ovid) Materials Research Database Civil Engineering Abstracts Aluminium Industry Abstracts Technology Research Database Computer and Information Systems Abstracts – Academic Mechanical & Transportation Engineering Abstracts Electronics & Communications Abstracts ProQuest Computer Science Collection Computer and Information Systems Abstracts Ceramic Abstracts Materials Business File METADEX Biotechnology and BioEngineering Abstracts Computer and Information Systems Abstracts Professional Aerospace Database Engineered Materials Abstracts Biotechnology Research Abstracts Solid State and Superconductivity Abstracts Engineering Research Database Corrosion Abstracts Advanced Technologies Database with Aerospace ANTE: Abstracts in New Technology & Engineering MEDLINE - Academic |
| DatabaseTitleList | Engineering Research Database MEDLINE - Academic Materials Research Database MEDLINE |
| Database_xml | – sequence: 1 dbid: NPM name: PubMed url: http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed sourceTypes: Index Database – sequence: 2 dbid: RIE name: IEEE Electronic Library (IEL) url: https://ieeexplore.ieee.org/ sourceTypes: Publisher – sequence: 3 dbid: 7X8 name: MEDLINE - Academic url: https://search.proquest.com/medline sourceTypes: Aggregation Database |
| DeliveryMethod | fulltext_linktorsrc |
| Discipline | Biology |
| EISSN | 1557-9964 |
| EndPage | 1842 |
| ExternalDocumentID | 2838383331 23221092 10_1109_TCBB_2012_123 6313591 |
| Genre | orig-research Research Support, Non-U.S. Gov't Journal Article |
| GroupedDBID | 0R~ 29I 4.4 53G 5GY 5VS 6IK 8US 97E AAJGR AAKMM AALFJ AARMG AASAJ AAWTH AAWTV ABAZT ABQJQ ABVLG ACGFO ACGFS ACIWK ACM ACPRK ADBCU ADL AEBYY AEFXT AEJOY AENEX AENSD AETIX AFRAH AFWIH AFWXC AGQYO AGSQL AHBIQ AIBXA AIKLT AKJIK AKQYR AKRVB ALMA_UNASSIGNED_HOLDINGS ASPBG ATWAV AVWKF BDXCO BEFXN BFFAM BGNUA BKEBE BPEOZ CCLIF CS3 DU5 EBS EJD FEDTE GUFHI HGAVV HZ~ I07 IEDLZ IFIPE IPLJI JAVBF LAI LHSKQ M43 O9- OCL P1C P2P PQQKQ RIA RIE RNI RNS ROL RZB TN5 XOL AAYXX CITATION CGR CUY CVF ECM EIF NPM RIG 7QF 7QO 7QQ 7SC 7SE 7SP 7SR 7TA 7TB 7U5 8BQ 8FD F28 FR3 H8D JG9 JQ2 KR7 L7M L~C L~D P64 7X8 |
| ID | FETCH-LOGICAL-c374t-2ea7b011ea8a1d3d310c4c8d6d3eb0f01cc1ee7ceead51a15a81fbe12e7e60c43 |
| IEDL.DBID | RIE |
| ISICitedReferencesCount | 11 |
| ISICitedReferencesURI | http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000312558400027&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| ISSN | 1545-5963 1557-9964 |
| IngestDate | Tue Oct 07 09:38:31 EDT 2025 Thu Oct 02 10:42:47 EDT 2025 Mon Jun 30 05:54:18 EDT 2025 Mon Jul 21 05:56:38 EDT 2025 Tue Nov 18 21:42:11 EST 2025 Sat Nov 29 08:10:28 EST 2025 Wed Aug 27 02:55:15 EDT 2025 |
| IsPeerReviewed | true |
| IsScholarly | true |
| Issue | 6 |
| Language | English |
| License | https://ieeexplore.ieee.org/Xplorehelp/downloads/license-information/IEEE.html |
| LinkModel | DirectLink |
| MergedId | FETCHMERGED-LOGICAL-c374t-2ea7b011ea8a1d3d310c4c8d6d3eb0f01cc1ee7ceead51a15a81fbe12e7e60c43 |
| Notes | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 content type line 23 |
| PMID | 23221092 |
| PQID | 1237004637 |
| PQPubID | 85499 |
| PageCount | 6 |
| ParticipantIDs | pubmed_primary_23221092 ieee_primary_6313591 proquest_miscellaneous_1257791090 proquest_journals_1237004637 crossref_citationtrail_10_1109_TCBB_2012_123 proquest_miscellaneous_1237514951 crossref_primary_10_1109_TCBB_2012_123 |
| PublicationCentury | 2000 |
| PublicationDate | 2012-11-01 |
| PublicationDateYYYYMMDD | 2012-11-01 |
| PublicationDate_xml | – month: 11 year: 2012 text: 2012-11-01 day: 01 |
| PublicationDecade | 2010 |
| PublicationPlace | United States |
| PublicationPlace_xml | – name: United States – name: New York |
| PublicationTitle | IEEE/ACM transactions on computational biology and bioinformatics |
| PublicationTitleAbbrev | TCBB |
| PublicationTitleAlternate | IEEE/ACM Trans Comput Biol Bioinform |
| PublicationYear | 2012 |
| Publisher | IEEE The Institute of Electrical and Electronics Engineers, Inc. (IEEE) |
| Publisher_xml | – name: IEEE – name: The Institute of Electrical and Electronics Engineers, Inc. (IEEE) |
| References | (ref14) 2012 ref15 ref11 ref10 Kolivas (ref16) 2011 ref2 ref1 Pavlov (ref13) 2011 ref8 ref7 ref9 ref4 ref3 ref6 ref5 Krueger (ref12) 2010 |
| References_xml | – ident: ref15 doi: 10.1093/bioinformatics/btp352 – ident: ref10 doi: 10.1093/bioinformatics/18.12.1696 – ident: ref8 doi: 10.1093/bioinformatics/btq346 – ident: ref2 doi: 10.1186/gb-2009-10-3-r25 – ident: ref7 doi: 10.1093/bioinformatics/btr014 – volume-title: Picard year: 2012 ident: ref14 – ident: ref11 doi: 10.1038/459927a – ident: ref1 doi: 10.1093/nar/gkq1019 – ident: ref5 doi: 10.1186/1471-2105-11-514 – ident: ref9 doi: 10.4137/EBO.S6618 – year: 2010 ident: ref12 article-title: Sharpziplib – year: 2011 ident: ref13 article-title: LZMA SDK – ident: ref3 doi: 10.1007/978-3-642-12683-3_20 – ident: ref6 doi: 10.1101/gr.114819.110 – ident: ref4 doi: 10.1093/bioinformatics/btn582 – year: 2011 ident: ref16 article-title: lrzip |
| SSID | ssj0024904 |
| Score | 2.0278625 |
| Snippet | Nowadays storing data derived from deep sequencing experiments has become pivotal and standard compression algorithms do not exploit in a satisfying manner... |
| SourceID | proquest pubmed crossref ieee |
| SourceType | Aggregation Database Index Database Enrichment Source Publisher |
| StartPage | 1837 |
| SubjectTerms | Algorithms algorithms for data and knowledge management Bioinformatics Biology and genetics Compression algorithms Data compression Data Compression - methods Databases, Genetic Decoding Encoding Genomics Genomics - methods High-Throughput Nucleotide Sequencing Internet Software |
| Title | KungFQ: A Simple and Powerful Approach to Compress fastq Files |
| URI | https://ieeexplore.ieee.org/document/6313591 https://www.ncbi.nlm.nih.gov/pubmed/23221092 https://www.proquest.com/docview/1237004637 https://www.proquest.com/docview/1237514951 https://www.proquest.com/docview/1257791090 |
| Volume | 9 |
| WOSCitedRecordID | wos000312558400027&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| journalDatabaseRights | – providerCode: PRVIEE databaseName: IEEE Electronic Library (IEL) customDbUrl: eissn: 1557-9964 dateEnd: 99991231 omitProxy: false ssIdentifier: ssj0024904 issn: 1545-5963 databaseCode: RIE dateStart: 20040101 isFulltext: true titleUrlDefault: https://ieeexplore.ieee.org/ providerName: IEEE |
| link | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV3dS9xAEB-uR4W-aFutnp7HCsUno9nbSzbxoaCHR6EilircW9iPCQiSUy8n-N87k-RyCLXQt5D8lgyzM7uzH_MbgO-OYm4dWxukmMTBKNch-RyS4ymjnfXKYlQlCl_qq6tkOk2vO3DU5sIgYnX5DI_5sTrL9zO34K2yk1hJFXGq-get4zpXa8Wrl1alAjkiCCKyqhWf5snN-PycL3ENj2mYZvZfMmJ6P3wzFVW1Vd4PM6vpZrLxf4J-hvUmrBRntR18gQ4WX2GtLjT5sgk_fpFPT36fijPx5475gIUpvLjmCmn5gpo1xOKinAkeIHgJLnIzLx_FhEaN-RbcTi5uxj-DpnJC4JQelcEQjbbkuWgSI73yFMO5kUt87BXaMA-lcxJR0wRpfCSNjEwic4tyiBpjgqpv0C1mBe6ACI1xeWI9hTJ2FGFuZOxyr7Sy1Mh53YOjpRIz19CKc3WL-6xaXoRpxurPWP0Zqb8Hhy38oebTeA-4yXptQY1Ke9Bf9lDWeNuc4czSHyuS5qD9TH7Chx-mwNmixkS8HJT_wkRap3xVtQfbde-3_18aze7f5dqDTyx5naXYh275tMB9-Oiey7v504AMdpoMKoN9BZGc5Oc |
| linkProvider | IEEE |
| linkToHtml | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV3dS9xAEB_EtrQv_dB-XKt2hdIno9lsNpv0oaDioXgell7Bt7AfExBKrvVyhf73nUlykUIt-BaS35LN7MzuTHbnNwAfPPncJnMuKjDPorQyMdkckuEpa7wLyqFuE4UnZjrNr66KyzXYG3JhELE9fIb7fNnu5Ye5X_KvsoNMSaU5Vf2BTtMk7rK1bpn1irZYIPsEkSa9umXUPJgdHx3xMa5knyZq5v8lNab7yV-LUVtd5W5Hs11wxs_u19Xn8LR3LMVhpwkvYA3rDXjUlZr8vQmfz8mqx18-iUPx9ZoZgYWtg7jkGmnVkpr11OKimQueIjgIF5VdND_FmOaNxUv4Nj6ZHZ9Gfe2EyCuTNlGC1jiyXbS5lUEF8uJ86vOQBYUurmLpvUQ0tETaoKWV2uaycigTNJgRVL2C9Xpe4xsQsbW-yl0gZ8alGisrM18FZZSjRj6YEeythFj6nlic61t8L9sAIy5KFn_J4i9J_CP4OMB_dIwadwE3Wa4DqBfpCLZWI1T29rZgOPP0Z4p6szs8Jkvh7Q9b43zZYTQHhPJ_GG1MwYdVR_C6G_3h_Sulefvvfr2Hx6ezi0k5OZuev4Mn_BVdzuIWrDc3S9yGh_5Xc7242WnV9g87b-dG |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=KungFQ%3A+A+Simple+and+Powerful+Approach+to+Compress+fastq+Files&rft.jtitle=IEEE%2FACM+transactions+on+computational+biology+and+bioinformatics&rft.au=Grassi%2C+E.&rft.au=Gregorio%2C+F.+D.&rft.au=Molineris%2C+I.&rft.date=2012-11-01&rft.pub=IEEE&rft.issn=1545-5963&rft.volume=9&rft.issue=6&rft.spage=1837&rft.epage=1842&rft_id=info:doi/10.1109%2FTCBB.2012.123&rft_id=info%3Apmid%2F23221092&rft.externalDocID=6313591 |
| thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1545-5963&client=summon |
| thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1545-5963&client=summon |
| thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1545-5963&client=summon |