KungFQ: A Simple and Powerful Approach to Compress fastq Files

Nowadays storing data derived from deep sequencing experiments has become pivotal and standard compression algorithms do not exploit in a satisfying manner their structure. A number of reference-based compression algorithms have been developed but they are less adequate when approaching new species...

Full description

Saved in:
Bibliographic Details
Published in:IEEE/ACM transactions on computational biology and bioinformatics Vol. 9; no. 6; pp. 1837 - 1842
Main Authors: Grassi, E., Gregorio, F. D., Molineris, I.
Format: Journal Article
Language:English
Published: United States IEEE 01.11.2012
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects:
ISSN:1545-5963, 1557-9964, 1557-9964
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Abstract Nowadays storing data derived from deep sequencing experiments has become pivotal and standard compression algorithms do not exploit in a satisfying manner their structure. A number of reference-based compression algorithms have been developed but they are less adequate when approaching new species without fully sequenced genomes or nongenomic data. We developed a tool that takes advantages of fastq characteristics and encodes them in a binary format optimized in order to be further compressed with standard tools (such as gzip or lzma). The algorithm is straightforward and does not need any external reference file, it scans the fastq only once and has a constant memory requirement. Moreover, we added the possibility to perform lossy compression, losing some of the original information (IDs and/or qualities) but resulting in smaller files; it is also possible to define a quality cutoff under which corresponding base calls are converted to N. We achieve 2.82 to 7.77 compression ratios on various fastq files without losing information and 5.37 to 8.77 losing IDs, which are often not used in common analysis pipelines. In this paper, we compare the algorithm performance with known tools, usually obtaining higher compression levels.
AbstractList Nowadays storing data derived from deep sequencing experiments has become pivotal and standard compression algorithms do not exploit in a satisfying manner their structure. A number of reference-based compression algorithms have been developed but they are less adequate when approaching new species without fully sequenced genomes or nongenomic data. We developed a tool that takes advantages of fastq characteristics and encodes them in a binary format optimized in order to be further compressed with standard tools (such as gzip or lzma). The algorithm is straightforward and does not need any external reference file, it scans the fastq only once and has a constant memory requirement. Moreover, we added the possibility to perform lossy compression, losing some of the original information (IDs and/or qualities) but resulting in smaller files; it is also possible to define a quality cutoff under which corresponding base calls are converted to N. We achieve 2.82 to 7.77 compression ratios on various fastq files without losing information and 5.37 to 8.77 losing IDs, which are often not used in common analysis pipelines. In this paper, we compare the algorithm performance with known tools, usually obtaining higher compression levels.
Nowadays storing data derived from deep sequencing experiments has become pivotal and standard compression algorithms do not exploit in a satisfying manner their structure. A number of reference-based compression algorithms have been developed but they are less adequate when approaching new species without fully sequenced genomes or nongenomic data. We developed a tool that takes advantages of fastq characteristics and encodes them in a binary format optimized in order to be further compressed with standard tools (such as gzip or lzma). The algorithm is straightforward and does not need any external reference file, it scans the fastq only once and has a constant memory requirement. Moreover, we added the possibility to perform lossy compression, losing some of the original information (IDs and/or qualities) but resulting in smaller files; it is also possible to define a quality cutoff under which corresponding base calls are converted to N. We achieve 2.82 to 7.77 compression ratios on various fastq files without losing information and 5.37 to 8.77 losing IDs, which are often not used in common analysis pipelines. In this paper, we compare the algorithm performance with known tools, usually obtaining higher compression levels.Nowadays storing data derived from deep sequencing experiments has become pivotal and standard compression algorithms do not exploit in a satisfying manner their structure. A number of reference-based compression algorithms have been developed but they are less adequate when approaching new species without fully sequenced genomes or nongenomic data. We developed a tool that takes advantages of fastq characteristics and encodes them in a binary format optimized in order to be further compressed with standard tools (such as gzip or lzma). The algorithm is straightforward and does not need any external reference file, it scans the fastq only once and has a constant memory requirement. Moreover, we added the possibility to perform lossy compression, losing some of the original information (IDs and/or qualities) but resulting in smaller files; it is also possible to define a quality cutoff under which corresponding base calls are converted to N. We achieve 2.82 to 7.77 compression ratios on various fastq files without losing information and 5.37 to 8.77 losing IDs, which are often not used in common analysis pipelines. In this paper, we compare the algorithm performance with known tools, usually obtaining higher compression levels.
Author Gregorio, F. D.
Molineris, I.
Grassi, E.
Author_xml – sequence: 1
  givenname: E.
  surname: Grassi
  fullname: Grassi, E.
  email: grassi.e@gmail.com
  organization: Dept. of Genetics, Biol. & Biochem., Mol. Biotechnol. Center, Turin, Italy
– sequence: 2
  givenname: F. D.
  surname: Gregorio
  fullname: Gregorio, F. D.
  email: fog@dndg.it
  organization: DNDG srl, Turin, Italy
– sequence: 3
  givenname: I.
  surname: Molineris
  fullname: Molineris, I.
  email: ivan.molineris@unito.it
  organization: Dept. of Genetics, Biol. & Biochem., Mol. Biotechnol. Center, Turin, Italy
BackLink https://www.ncbi.nlm.nih.gov/pubmed/23221092$$D View this record in MEDLINE/PubMed
BookMark eNqN0cFP2zAUBnALgYACR05Ik6VduKT42XHc7DCprSibhgTT4Gy5zgsEJXGwE03773Eo3aHSpJ3sw-_75Oc3Ifuta5GQc2BTAJZfPSwXiylnwKfAxR45BilVkudZuj_eU5nIPBNHZBLCC2M8zVl6SI644DyG-TH5-mNon1Y_v9A5_VU1XY3UtAW9d7_Rl0NN513nnbHPtHd06ZrOYwi0NKF_pauqxnBKDkpTBzz7OE_I4-r6Yfktub27-b6c3yZWqLRPOBq1ZgBoZgYKUQhgNrWzIisErlnJwFpAVBbRFBIMSDODco3AUWEWqTghl5ve-JzXAUOvmypYrGvTohuCBi6VyuNI7D-oUBLSXEKkn3foixt8Gwd5V4ylmVBRffpQw7rBQne-aoz_o7efGIHYAOtdCB5Lbave9JVre2-qWgPT46r0uCo9rmpsj6lkJ7Ut_pe_2PgKEf_aTICQOYg3xLCabg
CODEN ITCBCY
CitedBy_id crossref_primary_10_1109_TPDS_2017_2692782
crossref_primary_10_1007_s11227_016_1753_4
crossref_primary_10_1093_bioadv_vbaf128
crossref_primary_10_1186_1748_7188_8_25
crossref_primary_10_1371_journal_pone_0059190
crossref_primary_10_1016_j_procs_2014_05_369
Cites_doi 10.1093/bioinformatics/btp352
10.1093/bioinformatics/18.12.1696
10.1093/bioinformatics/btq346
10.1186/gb-2009-10-3-r25
10.1093/bioinformatics/btr014
10.1038/459927a
10.1093/nar/gkq1019
10.1186/1471-2105-11-514
10.4137/EBO.S6618
10.1007/978-3-642-12683-3_20
10.1101/gr.114819.110
10.1093/bioinformatics/btn582
ContentType Journal Article
Copyright Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) Nov/Dec 2012
Copyright_xml – notice: Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) Nov/Dec 2012
DBID 97E
RIA
RIE
AAYXX
CITATION
CGR
CUY
CVF
ECM
EIF
NPM
7QF
7QO
7QQ
7SC
7SE
7SP
7SR
7TA
7TB
7U5
8BQ
8FD
F28
FR3
H8D
JG9
JQ2
KR7
L7M
L~C
L~D
P64
7X8
DOI 10.1109/TCBB.2012.123
DatabaseName IEEE Xplore (IEEE)
IEEE All-Society Periodicals Package (ASPP) 1998-Present
IEEE Electronic Library (IEL)
CrossRef
Medline
MEDLINE
MEDLINE (Ovid)
MEDLINE
MEDLINE
PubMed
Aluminium Industry Abstracts
Biotechnology Research Abstracts
Ceramic Abstracts
Computer and Information Systems Abstracts
Corrosion Abstracts
Electronics & Communications Abstracts
Engineered Materials Abstracts
Materials Business File
Mechanical & Transportation Engineering Abstracts
Solid State and Superconductivity Abstracts
METADEX
Technology Research Database
ANTE: Abstracts in New Technology & Engineering
Engineering Research Database
Aerospace Database
Materials Research Database
ProQuest Computer Science Collection
Civil Engineering Abstracts
Advanced Technologies Database with Aerospace
Computer and Information Systems Abstracts – Academic
Computer and Information Systems Abstracts Professional
Biotechnology and BioEngineering Abstracts
MEDLINE - Academic
DatabaseTitle CrossRef
MEDLINE
Medline Complete
MEDLINE with Full Text
PubMed
MEDLINE (Ovid)
Materials Research Database
Civil Engineering Abstracts
Aluminium Industry Abstracts
Technology Research Database
Computer and Information Systems Abstracts – Academic
Mechanical & Transportation Engineering Abstracts
Electronics & Communications Abstracts
ProQuest Computer Science Collection
Computer and Information Systems Abstracts
Ceramic Abstracts
Materials Business File
METADEX
Biotechnology and BioEngineering Abstracts
Computer and Information Systems Abstracts Professional
Aerospace Database
Engineered Materials Abstracts
Biotechnology Research Abstracts
Solid State and Superconductivity Abstracts
Engineering Research Database
Corrosion Abstracts
Advanced Technologies Database with Aerospace
ANTE: Abstracts in New Technology & Engineering
MEDLINE - Academic
DatabaseTitleList Engineering Research Database
MEDLINE - Academic

Materials Research Database
MEDLINE
Database_xml – sequence: 1
  dbid: NPM
  name: PubMed
  url: http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed
  sourceTypes: Index Database
– sequence: 2
  dbid: RIE
  name: IEEE Electronic Library (IEL)
  url: https://ieeexplore.ieee.org/
  sourceTypes: Publisher
– sequence: 3
  dbid: 7X8
  name: MEDLINE - Academic
  url: https://search.proquest.com/medline
  sourceTypes: Aggregation Database
DeliveryMethod fulltext_linktorsrc
Discipline Biology
EISSN 1557-9964
EndPage 1842
ExternalDocumentID 2838383331
23221092
10_1109_TCBB_2012_123
6313591
Genre orig-research
Research Support, Non-U.S. Gov't
Journal Article
GroupedDBID 0R~
29I
4.4
53G
5GY
5VS
6IK
8US
97E
AAJGR
AAKMM
AALFJ
AARMG
AASAJ
AAWTH
AAWTV
ABAZT
ABQJQ
ABVLG
ACGFO
ACGFS
ACIWK
ACM
ACPRK
ADBCU
ADL
AEBYY
AEFXT
AEJOY
AENEX
AENSD
AETIX
AFRAH
AFWIH
AFWXC
AGQYO
AGSQL
AHBIQ
AIBXA
AIKLT
AKJIK
AKQYR
AKRVB
ALMA_UNASSIGNED_HOLDINGS
ASPBG
ATWAV
AVWKF
BDXCO
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
CCLIF
CS3
DU5
EBS
EJD
FEDTE
GUFHI
HGAVV
HZ~
I07
IEDLZ
IFIPE
IPLJI
JAVBF
LAI
LHSKQ
M43
O9-
OCL
P1C
P2P
PQQKQ
RIA
RIE
RNI
RNS
ROL
RZB
TN5
XOL
AAYXX
CITATION
CGR
CUY
CVF
ECM
EIF
NPM
RIG
7QF
7QO
7QQ
7SC
7SE
7SP
7SR
7TA
7TB
7U5
8BQ
8FD
F28
FR3
H8D
JG9
JQ2
KR7
L7M
L~C
L~D
P64
7X8
ID FETCH-LOGICAL-c374t-2ea7b011ea8a1d3d310c4c8d6d3eb0f01cc1ee7ceead51a15a81fbe12e7e60c43
IEDL.DBID RIE
ISICitedReferencesCount 11
ISICitedReferencesURI http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000312558400027&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
ISSN 1545-5963
1557-9964
IngestDate Tue Oct 07 09:38:31 EDT 2025
Thu Oct 02 10:42:47 EDT 2025
Mon Jun 30 05:54:18 EDT 2025
Mon Jul 21 05:56:38 EDT 2025
Tue Nov 18 21:42:11 EST 2025
Sat Nov 29 08:10:28 EST 2025
Wed Aug 27 02:55:15 EDT 2025
IsPeerReviewed true
IsScholarly true
Issue 6
Language English
License https://ieeexplore.ieee.org/Xplorehelp/downloads/license-information/IEEE.html
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-c374t-2ea7b011ea8a1d3d310c4c8d6d3eb0f01cc1ee7ceead51a15a81fbe12e7e60c43
Notes ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
content type line 23
PMID 23221092
PQID 1237004637
PQPubID 85499
PageCount 6
ParticipantIDs pubmed_primary_23221092
ieee_primary_6313591
proquest_miscellaneous_1257791090
proquest_journals_1237004637
crossref_citationtrail_10_1109_TCBB_2012_123
proquest_miscellaneous_1237514951
crossref_primary_10_1109_TCBB_2012_123
PublicationCentury 2000
PublicationDate 2012-11-01
PublicationDateYYYYMMDD 2012-11-01
PublicationDate_xml – month: 11
  year: 2012
  text: 2012-11-01
  day: 01
PublicationDecade 2010
PublicationPlace United States
PublicationPlace_xml – name: United States
– name: New York
PublicationTitle IEEE/ACM transactions on computational biology and bioinformatics
PublicationTitleAbbrev TCBB
PublicationTitleAlternate IEEE/ACM Trans Comput Biol Bioinform
PublicationYear 2012
Publisher IEEE
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Publisher_xml – name: IEEE
– name: The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
References (ref14) 2012
ref15
ref11
ref10
Kolivas (ref16) 2011
ref2
ref1
Pavlov (ref13) 2011
ref8
ref7
ref9
ref4
ref3
ref6
ref5
Krueger (ref12) 2010
References_xml – ident: ref15
  doi: 10.1093/bioinformatics/btp352
– ident: ref10
  doi: 10.1093/bioinformatics/18.12.1696
– ident: ref8
  doi: 10.1093/bioinformatics/btq346
– ident: ref2
  doi: 10.1186/gb-2009-10-3-r25
– ident: ref7
  doi: 10.1093/bioinformatics/btr014
– volume-title: Picard
  year: 2012
  ident: ref14
– ident: ref11
  doi: 10.1038/459927a
– ident: ref1
  doi: 10.1093/nar/gkq1019
– ident: ref5
  doi: 10.1186/1471-2105-11-514
– ident: ref9
  doi: 10.4137/EBO.S6618
– year: 2010
  ident: ref12
  article-title: Sharpziplib
– year: 2011
  ident: ref13
  article-title: LZMA SDK
– ident: ref3
  doi: 10.1007/978-3-642-12683-3_20
– ident: ref6
  doi: 10.1101/gr.114819.110
– ident: ref4
  doi: 10.1093/bioinformatics/btn582
– year: 2011
  ident: ref16
  article-title: lrzip
SSID ssj0024904
Score 2.0278625
Snippet Nowadays storing data derived from deep sequencing experiments has become pivotal and standard compression algorithms do not exploit in a satisfying manner...
SourceID proquest
pubmed
crossref
ieee
SourceType Aggregation Database
Index Database
Enrichment Source
Publisher
StartPage 1837
SubjectTerms Algorithms
algorithms for data and knowledge management
Bioinformatics
Biology and genetics
Compression algorithms
Data compression
Data Compression - methods
Databases, Genetic
Decoding
Encoding
Genomics
Genomics - methods
High-Throughput Nucleotide Sequencing
Internet
Software
Title KungFQ: A Simple and Powerful Approach to Compress fastq Files
URI https://ieeexplore.ieee.org/document/6313591
https://www.ncbi.nlm.nih.gov/pubmed/23221092
https://www.proquest.com/docview/1237004637
https://www.proquest.com/docview/1237514951
https://www.proquest.com/docview/1257791090
Volume 9
WOSCitedRecordID wos000312558400027&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
journalDatabaseRights – providerCode: PRVIEE
  databaseName: IEEE Electronic Library (IEL)
  customDbUrl:
  eissn: 1557-9964
  dateEnd: 99991231
  omitProxy: false
  ssIdentifier: ssj0024904
  issn: 1545-5963
  databaseCode: RIE
  dateStart: 20040101
  isFulltext: true
  titleUrlDefault: https://ieeexplore.ieee.org/
  providerName: IEEE
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV3fS-QwEB5W8cAXz9_2XCXCcU9Wm6ZpUh8EFRfhDvHQO_atpMkUhKN753YP_O_NtN2KoIJvpf1Kw2RmMmlmvgH4qiS32hY8VIVKw0Q74_1ggmEUYSKsyArRVMj9_qGur_V4nN0M4LCvhUHEJvkMj-iyOct3EzujX2XHqeBCUqn6glJpW6v1zKuXNa0CKSIIpdeqZz7N47uL83NK4oqPvJsm9l-vxP5-_GIpanqrvB1mNsvN6PPHBroKK11Yyc5aPViDAVbr8KltNPm4AaffvU2Pfp6wM3Z7T3zAzFSO3VCHtHLmX-uIxVk9YeQgaAvOSjOt_7GR9xrTTfg1ury7uAq7zgmhFSqpwxiNKrzlotGGO-F8DGcTq13qBBZRGXFrOaLyC6RxkhsujeZlgTxGhamHii1YrCYV7gCLZKExdokyskyMkoXRHl0Sz5eRVukADudCzG1HK07dLf7kzfYiynISf07iz734A_jWw_-2fBpvATdIrj2oE2kAw_kM5Z21TQlOLP2pUAEc9I-9ndDhh6lwMmsxkraD_D2MVCqjVNUAttvZ778_V5ovr49rF5Zp5G2V4hAW64cZ7sGS_V_fTx_2vcKO9X6jsE-LIuNf
linkProvider IEEE
linkToHtml http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV3fa9RAEB5KrdgXrda2qa2uID41bTa7m934ILTFo9LzqHhK38JmdwIFydleTvC_706SSxGs0LeQfCGb2ZnZ2R_zDcA7rbgzruSxLnUWS-Nt8IMS4yRBKZzIS9FmyP0Y68nEXF7mFytwMOTCIGJ7-AwP6bLdy_czt6ClsqNMcKEoVf2RkjJNumytO2a9vC0WSDFBrIJe3TFqHk1PT07oGFd6GBw18f8GNQ73078Go7a6yv2BZjvgjJ49rKkb8LQPLNlxpwnPYQXrF_C4KzX5ZxM-ngerHn39wI7ZtytiBGa29uyCaqRVi_BaTy3OmhkjF0GTcFbZeXPNRsFvzF_C99Gn6elZ3NdOiJ3QsolTtLoMtovWWO6FD1Gck874zAsskyrhznFEHYZI6xW3XFnDqxJ5ihqzABVbsFrPatwBlqjSYOqltqqSVqvSmoCuiOnLKqdNBAdLIRauJxan-hY_i3aCkeQFib8g8RdB_BG8H-C_OkaN-4CbJNcB1Is0gr1lDxW9vc0JTjz9mdARvB0eB0uh7Q9b42zRYRRNCPn_MErrnA6rRrDd9f7w_aXS7P67XW_gydn0y7gYf56cv4J1-osuZ3EPVpubBe7DmvvdXM1vXrdqewsqf-W-
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=KungFQ%3A+a+simple+and+powerful+approach+to+compress+fastq+files&rft.jtitle=IEEE%2FACM+transactions+on+computational+biology+and+bioinformatics&rft.au=Grassi%2C+Elena&rft.au=Di+Gregorio%2C+Federico&rft.au=Molineris%2C+Ivan&rft.date=2012-11-01&rft.eissn=1557-9964&rft.volume=9&rft.issue=6&rft.spage=1837&rft_id=info:doi/10.1109%2FTCBB.2012.123&rft_id=info%3Apmid%2F23221092&rft.externalDocID=23221092
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1545-5963&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1545-5963&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1545-5963&client=summon