AFRESh: an adaptive framework for compression of reads and assembled sequences with random access functionality
The past decade has seen the introduction of new technologies that lowered the cost of genomic sequencing increasingly. We can even observe that the cost of sequencing is dropping significantly faster than the cost of storage and transmission. The latter motivates a need for continuous improvements...
Uloženo v:
| Vydáno v: | Bioinformatics (Oxford, England) Ročník 33; číslo 10; s. 1464 - 1472 |
|---|---|
| Hlavní autoři: | , , , |
| Médium: | Journal Article |
| Jazyk: | angličtina |
| Vydáno: |
England
15.05.2017
|
| Témata: | |
| ISSN: | 1367-4803, 1367-4811 |
| On-line přístup: | Získat plný text |
| Tagy: |
Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
|
| Abstract | The past decade has seen the introduction of new technologies that lowered the cost of genomic sequencing increasingly. We can even observe that the cost of sequencing is dropping significantly faster than the cost of storage and transmission. The latter motivates a need for continuous improvements in the area of genomic data compression, not only at the level of effectiveness (compression rate), but also at the level of functionality (e.g. random access), configurability (effectiveness versus complexity, coding tool set …) and versatility (support for both sequenced reads and assembled sequences). In that regard, we can point out that current approaches mostly do not support random access, requiring full files to be transmitted, and that current approaches are restricted to either read or sequence compression.
We propose AFRESh, an adaptive framework for no-reference compression of genomic data with random access functionality, targeting the effective representation of the raw genomic symbol streams of both reads and assembled sequences. AFRESh makes use of a configurable set of prediction and encoding tools, extended by a Context-Adaptive Binary Arithmetic Coding scheme (CABAC), to compress raw genetic codes. To the best of our knowledge, our paper is the first to describe an effective implementation CABAC outside of its' original application. By applying CABAC, the compression effectiveness improves by up to 19% for assembled sequences and up to 62% for reads. By applying AFRESh to the genomic symbols of the MPEG genomic compression test set for reads, a compression gain is achieved of up to 51% compared to SCALCE, 42% compared to LFQC and 44% compared to ORCOM. When comparing to generic compression approaches, a compression gain is achieved of up to 41% compared to GNU Gzip and 22% compared to 7-Zip at the Ultra setting. Additionaly, when compressing assembled sequences of the Human Genome, a compression gain is achieved up to 34% compared to GNU Gzip and 16% compared to 7-Zip at the Ultra setting.
A Windows executable version can be downloaded at https://github.com/tparidae/AFresh .
tom.paridaens@ugent.be. |
|---|---|
| AbstractList | MOTIVATIONThe past decade has seen the introduction of new technologies that lowered the cost of genomic sequencing increasingly. We can even observe that the cost of sequencing is dropping significantly faster than the cost of storage and transmission. The latter motivates a need for continuous improvements in the area of genomic data compression, not only at the level of effectiveness (compression rate), but also at the level of functionality (e.g. random access), configurability (effectiveness versus complexity, coding tool set …) and versatility (support for both sequenced reads and assembled sequences). In that regard, we can point out that current approaches mostly do not support random access, requiring full files to be transmitted, and that current approaches are restricted to either read or sequence compression.RESULTSWe propose AFRESh, an adaptive framework for no-reference compression of genomic data with random access functionality, targeting the effective representation of the raw genomic symbol streams of both reads and assembled sequences. AFRESh makes use of a configurable set of prediction and encoding tools, extended by a Context-Adaptive Binary Arithmetic Coding scheme (CABAC), to compress raw genetic codes. To the best of our knowledge, our paper is the first to describe an effective implementation CABAC outside of its' original application. By applying CABAC, the compression effectiveness improves by up to 19% for assembled sequences and up to 62% for reads. By applying AFRESh to the genomic symbols of the MPEG genomic compression test set for reads, a compression gain is achieved of up to 51% compared to SCALCE, 42% compared to LFQC and 44% compared to ORCOM. When comparing to generic compression approaches, a compression gain is achieved of up to 41% compared to GNU Gzip and 22% compared to 7-Zip at the Ultra setting. Additionaly, when compressing assembled sequences of the Human Genome, a compression gain is achieved up to 34% compared to GNU Gzip and 16% compared to 7-Zip at the Ultra setting.AVAILABILITY AND IMPLEMENTATIONA Windows executable version can be downloaded at https://github.com/tparidae/AFresh .CONTACTtom.paridaens@ugent.be. The past decade has seen the introduction of new technologies that lowered the cost of genomic sequencing increasingly. We can even observe that the cost of sequencing is dropping significantly faster than the cost of storage and transmission. The latter motivates a need for continuous improvements in the area of genomic data compression, not only at the level of effectiveness (compression rate), but also at the level of functionality (e.g. random access), configurability (effectiveness versus complexity, coding tool set …) and versatility (support for both sequenced reads and assembled sequences). In that regard, we can point out that current approaches mostly do not support random access, requiring full files to be transmitted, and that current approaches are restricted to either read or sequence compression. We propose AFRESh, an adaptive framework for no-reference compression of genomic data with random access functionality, targeting the effective representation of the raw genomic symbol streams of both reads and assembled sequences. AFRESh makes use of a configurable set of prediction and encoding tools, extended by a Context-Adaptive Binary Arithmetic Coding scheme (CABAC), to compress raw genetic codes. To the best of our knowledge, our paper is the first to describe an effective implementation CABAC outside of its' original application. By applying CABAC, the compression effectiveness improves by up to 19% for assembled sequences and up to 62% for reads. By applying AFRESh to the genomic symbols of the MPEG genomic compression test set for reads, a compression gain is achieved of up to 51% compared to SCALCE, 42% compared to LFQC and 44% compared to ORCOM. When comparing to generic compression approaches, a compression gain is achieved of up to 41% compared to GNU Gzip and 22% compared to 7-Zip at the Ultra setting. Additionaly, when compressing assembled sequences of the Human Genome, a compression gain is achieved up to 34% compared to GNU Gzip and 16% compared to 7-Zip at the Ultra setting. A Windows executable version can be downloaded at https://github.com/tparidae/AFresh . tom.paridaens@ugent.be. |
| Author | Paridaens, Tom De Neve, Wesley Van Wallendael, Glenn Lambert, Peter |
| Author_xml | – sequence: 1 givenname: Tom surname: Paridaens fullname: Paridaens, Tom – sequence: 2 givenname: Glenn surname: Van Wallendael fullname: Van Wallendael, Glenn – sequence: 3 givenname: Wesley surname: De Neve fullname: De Neve, Wesley – sequence: 4 givenname: Peter surname: Lambert fullname: Lambert, Peter |
| BackLink | https://www.ncbi.nlm.nih.gov/pubmed/28057687$$D View this record in MEDLINE/PubMed |
| BookMark | eNpVkU1PwzAMhiM0xD7gJ4By5FKWLEubcpumDZAmIfFxrtLU0QJtM-KWsX9Pp41JnGzZj-1XfoekV_saCLnm7I6zVIxz511tfah04wyO8-aHMX5GBlzESTRVnPdOORN9MkT8YIxJJuML0p8oJpNYJQPiZ8uXxev6nuqa6kJvGvcN1AZdwdaHT9odoMZXmwCIztfUWxpAF9jhBdWIUOUlFBThq4XaANKta9Y0dF1fUW26ClLb1qbphnXpmt0lObe6RLg6xhF5Xy7e5o_R6vnhaT5bRUaIpImEZVKynDMArVKRmDifpDBNlU50Kq21e_lTkxulCmUlxMamEMe2KLiZQKLEiNwe9m6C77Rhk1UODZSlrsG3mHElY6nSROzRmyPa5hUU2Sa4Sodd9vekDpAHwASPGMCeEM6yvRnZfzOygxniFyQShMo |
| Cites_doi | 10.1093/bioinformatics/btu390 10.1093/bioinformatics/bts593 10.1093/bioinformatics/btu844 10.1109/TCSVT.2003.815173 10.1093/bioinformatics/btu698 10.2174/1574893609666140516010143 10.1038/nmeth.3133 10.1147/rd.282.0135 10.1093/nar/gks754 10.1093/bioinformatics/btv384 10.1093/bioinformatics/btu208 10.1007/978-3-662-44276-0 10.1371/journal.pbio.1002195 10.1093/bioinformatics/btv399 10.1109/TCSVT.2012.2221526 |
| ContentType | Journal Article |
| Copyright | The Author 2017. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com |
| Copyright_xml | – notice: The Author 2017. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com |
| DBID | AAYXX CITATION CGR CUY CVF ECM EIF NPM 7X8 |
| DOI | 10.1093/bioinformatics/btx001 |
| DatabaseName | CrossRef Medline MEDLINE MEDLINE (Ovid) MEDLINE MEDLINE PubMed MEDLINE - Academic |
| DatabaseTitle | CrossRef MEDLINE Medline Complete MEDLINE with Full Text PubMed MEDLINE (Ovid) MEDLINE - Academic |
| DatabaseTitleList | MEDLINE - Academic MEDLINE |
| Database_xml | – sequence: 1 dbid: NPM name: PubMed url: http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed sourceTypes: Index Database – sequence: 2 dbid: 7X8 name: MEDLINE - Academic url: https://search.proquest.com/medline sourceTypes: Aggregation Database |
| DeliveryMethod | fulltext_linktorsrc |
| Discipline | Biology |
| EISSN | 1367-4811 |
| EndPage | 1472 |
| ExternalDocumentID | 28057687 10_1093_bioinformatics_btx001 |
| Genre | Journal Article |
| GroupedDBID | --- -E4 -~X .2P .DC .I3 0R~ 1TH 23N 2WC 4.4 48X 53G 5GY 5WA 70D AAIJN AAIMJ AAJKP AAJQQ AAKPC AAMDB AAMVS AAOGV AAPQZ AAPXW AAUQX AAVAP AAVLN AAYXX ABEJV ABEUO ABGNP ABIXL ABNKS ABPQP ABPTD ABQLI ABWST ABXVV ABZBJ ACGFS ACIWK ACPRK ACUFI ACUXJ ACYTK ADBBV ADEYI ADEZT ADFTL ADGKP ADGZP ADHKW ADHZD ADMLS ADOCK ADPDF ADRDM ADRTK ADVEK ADYVW ADZTZ ADZXQ AECKG AEGPL AEJOX AEKKA AEKSI AELWJ AEMDU AENEX AENZO AEPUE AETBJ AEWNT AFFZL AFGWE AFIYH AFOFC AFRAH AGINJ AGKEF AGQXC AGSYK AHMBA AHXPO AIJHB AJEEA AJEUX AKHUL AKWXX ALMA_UNASSIGNED_HOLDINGS ALTZX ALUQC AMNDL APIBT APWMN ARIXL ASPBG AVWKF AXUDD AYOIW AZVOD BAWUL BAYMD BHONS BQDIO BQUQU BSWAC BTQHN C45 CDBKE CITATION CS3 CZ4 DAKXR DIK DILTD DU5 D~K EBD EBS EE~ EJD EMOBN F5P F9B FEDTE FHSFR FLIZI FLUFQ FOEOM FQBLK GAUVT GJXCC GROUPED_DOAJ GX1 H13 H5~ HAR HW0 HZ~ IOX J21 JXSIZ KAQDR KOP KQ8 KSI KSN M-Z MK~ ML0 N9A NGC NLBLG NMDNZ NOMLY NU- NVLIB O9- OAWHX ODMLO OJQWA OK1 OVD OVEED P2P PAFKI PEELM PQQKQ Q1. Q5Y R44 RD5 RNS ROL ROX RPM RUSNO RW1 RXO SV3 TEORI TJP TLC TOX TR2 W8F WOQ X7H YAYTL YKOAZ YXANX ZKX ~91 ~KM ADRIX AFXEN BCRHZ CGR CUY CVF ECM EIF M49 NPM 7X8 |
| ID | FETCH-LOGICAL-c337t-3f0550b10eea8937c6b29e498a7a95fff57684cbc88d8f5e6cf9e66fdd1c2e783 |
| ISICitedReferencesCount | 6 |
| ISICitedReferencesURI | http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000402130700005&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| ISSN | 1367-4803 |
| IngestDate | Fri Jul 11 15:27:20 EDT 2025 Wed Feb 19 02:26:32 EST 2025 Sat Nov 29 05:34:07 EST 2025 |
| IsDoiOpenAccess | false |
| IsOpenAccess | true |
| IsPeerReviewed | true |
| IsScholarly | true |
| Issue | 10 |
| Language | English |
| License | https://academic.oup.com/journals/pages/about_us/legal/notices The Author 2017. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com |
| LinkModel | OpenURL |
| MergedId | FETCHMERGED-LOGICAL-c337t-3f0550b10eea8937c6b29e498a7a95fff57684cbc88d8f5e6cf9e66fdd1c2e783 |
| Notes | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23 |
| OpenAccessLink | https://academic.oup.com/bioinformatics/article-pdf/33/10/1464/25153204/btx001.pdf |
| PMID | 28057687 |
| PQID | 1856589738 |
| PQPubID | 23479 |
| PageCount | 9 |
| ParticipantIDs | proquest_miscellaneous_1856589738 pubmed_primary_28057687 crossref_primary_10_1093_bioinformatics_btx001 |
| PublicationCentury | 2000 |
| PublicationDate | 2017-05-15 2017-May-15 20170515 |
| PublicationDateYYYYMMDD | 2017-05-15 |
| PublicationDate_xml | – month: 05 year: 2017 text: 2017-05-15 day: 15 |
| PublicationDecade | 2010 |
| PublicationPlace | England |
| PublicationPlace_xml | – name: England |
| PublicationTitle | Bioinformatics (Oxford, England) |
| PublicationTitleAlternate | Bioinformatics |
| PublicationYear | 2017 |
| References | Stephens (2023020205103540200_btx001-B14) 2015; 13 Wien (2023020205103540200_btx001-B17) 2015 Jones (2023020205103540200_btx001-B6) 2012; 40 Nicolae (2023020205103540200_btx001-B11) 2015; 31 Ochoa (2023020205103540200_btx001-B9) 2014; 31 Roguski (2023020205103540200_btx001-B12) 2014; 30 Bonfield (2023020205103540200_btx001-B1) 2014; 30 Saha (2023020205103540200_btx001-B13) 2015; 31 Grabowski (2023020205103540200_btx001-B2) 2014; 31 Hach (2023020205103540200_btx001-B4) 2014; 11 Langdon (2023020205103540200_btx001-B7) 1984; 28 Paridaens (2023020205103540200_btx001-B10) 2014 Wandelt (2023020205103540200_btx001-B16) 2014; 9 Hach (2023020205103540200_btx001-B3) 2012; 28 Marpe (2023020205103540200_btx001-B8) 2003; 13 Sze (2023020205103540200_btx001-B15) 2012; 22 |
| References_xml | – volume: 30 start-page: 2818 year: 2014 ident: 2023020205103540200_btx001-B1 article-title: The Scramble conversion tool publication-title: Bioinformatics doi: 10.1093/bioinformatics/btu390 – volume: 28 start-page: 3051 year: 2012 ident: 2023020205103540200_btx001-B3 article-title: SCALCE: boosting sequence compression algorithms using locally consistent encoding publication-title: Bioinformatics doi: 10.1093/bioinformatics/bts593 – volume: 31 start-page: 1389 year: 2014 ident: 2023020205103540200_btx001-B2 article-title: Disk-based compression of data from genome sequencing publication-title: Bioinformatics doi: 10.1093/bioinformatics/btu844 – volume: 13 start-page: 620 year: 2003 ident: 2023020205103540200_btx001-B8 article-title: Context-based adaptive binary arithmetic coding in the H.264/AVC video compression standard publication-title: IEEE Trans. Circuits Syst. Video Technol doi: 10.1109/TCSVT.2003.815173 – volume: 31 start-page: 626 year: 2014 ident: 2023020205103540200_btx001-B9 article-title: iDoComp: a compression scheme for assembled genomes publication-title: Bioinformatics doi: 10.1093/bioinformatics/btu698 – volume: 9 start-page: 315 year: 2014 ident: 2023020205103540200_btx001-B16 article-title: Trends in genome compression publication-title: CBIO Curr. Bioinf doi: 10.2174/1574893609666140516010143 – volume: 11 start-page: 1082 year: 2014 ident: 2023020205103540200_btx001-B4 article-title: DeeZ: reference-based compression by local assembly publication-title: Nat. Methods doi: 10.1038/nmeth.3133 – volume: 28 start-page: 135 year: 1984 ident: 2023020205103540200_btx001-B7 article-title: An Introduction to Arithmetic Coding publication-title: IBM J. Res. Dev doi: 10.1147/rd.282.0135 – volume: 40 start-page: i year: 2012 ident: 2023020205103540200_btx001-B6 article-title: Compression of next-generation sequencing reads aided by highly efficient de novo assembly publication-title: Nucleic Acids Res doi: 10.1093/nar/gks754 – volume: 31 start-page: 3276 year: 2015 ident: 2023020205103540200_btx001-B11 article-title: LFQC: a lossless compression algorithm for FASTQ files publication-title: Bioinformatics doi: 10.1093/bioinformatics/btv384 – volume: 30 start-page: 2213 year: 2014 ident: 2023020205103540200_btx001-B12 article-title: DSRC 2–Industry-oriented compression of FASTQ files publication-title: Bioinformatics doi: 10.1093/bioinformatics/btu208 – volume-title: High Efficiency Video Coding: Coding Tools and Specification year: 2015 ident: 2023020205103540200_btx001-B17 doi: 10.1007/978-3-662-44276-0 – volume: 13 year: 2015 ident: 2023020205103540200_btx001-B14 article-title: Big Data: Astronomical or Genomical? publication-title: PLOS Biol doi: 10.1371/journal.pbio.1002195 – volume: 31 start-page: 3468 year: 2015 ident: 2023020205103540200_btx001-B13 article-title: ERGC: an efficient referential genome compression algorithm publication-title: Bioinformatics doi: 10.1093/bioinformatics/btv399 – volume: 22 start-page: 1778 year: 2012 ident: 2023020205103540200_btx001-B15 article-title: High Throughput CABAC Entropy Coding in HEVC publication-title: IEEE Trans. Circuits Syst. Video Technol doi: 10.1109/TCSVT.2012.2221526 – year: 2014 ident: 2023020205103540200_btx001-B10 |
| SSID | ssj0005056 |
| Score | 2.2606816 |
| Snippet | The past decade has seen the introduction of new technologies that lowered the cost of genomic sequencing increasingly. We can even observe that the cost of... MOTIVATIONThe past decade has seen the introduction of new technologies that lowered the cost of genomic sequencing increasingly. We can even observe that the... |
| SourceID | proquest pubmed crossref |
| SourceType | Aggregation Database Index Database |
| StartPage | 1464 |
| SubjectTerms | Algorithms Bacteria - genetics Data Compression - methods Genome Genomics - methods High-Throughput Nucleotide Sequencing - methods Humans Plants - genetics Sequence Analysis, DNA - methods Software |
| Title | AFRESh: an adaptive framework for compression of reads and assembled sequences with random access functionality |
| URI | https://www.ncbi.nlm.nih.gov/pubmed/28057687 https://www.proquest.com/docview/1856589738 |
| Volume | 33 |
| WOSCitedRecordID | wos000402130700005&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| journalDatabaseRights | – providerCode: PRVASL databaseName: Oxford Journals Open Access Collection customDbUrl: eissn: 1367-4811 dateEnd: 99991231 omitProxy: false ssIdentifier: ssj0005056 issn: 1367-4803 databaseCode: TOX dateStart: 19850101 isFulltext: true titleUrlDefault: https://academic.oup.com/journals/ providerName: Oxford University Press – providerCode: PRVASL databaseName: Oxford Journals Open Access Collection customDbUrl: eissn: 1367-4811 dateEnd: 20220930 omitProxy: false ssIdentifier: ssj0005056 issn: 1367-4803 databaseCode: TOX dateStart: 19850101 isFulltext: true titleUrlDefault: https://academic.oup.com/journals/ providerName: Oxford University Press |
| link | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwtV3da9swEBdpt0Ffxr6XfRQNxl6CV8d2LGlvZWvYQ0n74NK8GcmSWaC1s8YN6d-1f3AnnezEjEH3sBdjlMQyd7-cfnc63RHyMU2TIilKHhQpty3MeBxIocKAWS4B62cYq8Q1m2CzGZ_Pxflg8Ks9C7O-YlXFNxux_K-qhjFQtj06-w_q7h4KA3APSocrqB2u91L88RSE-sMdYq5GUsslVvZuk7BcXqFNJMcEWMcWgThqrNUMVNpcqytgoV2ONYZqYUnT9fVIuv6KI7saYhBx0fT3hRe1L8Xqyj_bWqabNn3e9wvZiT2cg6OupUEqn6EsfPev0aVr8gKfYgcwuO9g_M2mZa5dJPbSrNqdaaweqfwZpG3esQ9pwDJpq6Hi3rZBMxzbauzcm2Fvp7FgRovHcMfqgrVPdlbwcYLdgP5YHbBylupJwg40m9BHVHr1uGdn-fTi9DTPTubZp-XPwLYqs1v6vm_LHnkQsYmweYTZ2XybVwSUsj0qJuKj_mxHOFefBP3Fs3EMJ3tCHnvXhB4jpJ6SgamekUfYrPTuOakRWF-orGgLK9rBisLkdAdWtC6pgxV8XdMOVrSDFbWwoggrirCiPVi9IBfTk-zr98C36wiKOGZNEJchuLtqHBojLQsuUhUJA39_yaSYlGVpXdukUAXnmpcTkxalMGlaaj0uIsN4_JLsV3VlXhOa6NQYZs2IEAnXGpzyRJtEq6iUXEdqSD638suXWJUlx2yKOO8LPEeBD8mHVso52E-7KSYrU9-ucuCrQMIFi_mQvELxd4-MeGjfmb25x6_fkoMtlN-R_ebm1rwnD4t1s1jdHJI9NueHDii_Aaiop1k |
| linkProvider | Oxford University Press |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=AFRESh%3A+an+adaptive+framework+for+compression+of+reads+and+assembled+sequences+with+random+access+functionality&rft.jtitle=Bioinformatics+%28Oxford%2C+England%29&rft.au=Paridaens%2C+Tom&rft.au=Van+Wallendael%2C+Glenn&rft.au=De+Neve%2C+Wesley&rft.au=Lambert%2C+Peter&rft.date=2017-05-15&rft.eissn=1367-4811&rft.volume=33&rft.issue=10&rft.spage=1464&rft.epage=1472&rft_id=info:doi/10.1093%2Fbioinformatics%2Fbtx001&rft.externalDBID=NO_FULL_TEXT |
| thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1367-4803&client=summon |
| thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1367-4803&client=summon |
| thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1367-4803&client=summon |