AlcoR: alignment-free simulation, mapping, and visualization of low-complexity regions in biological data
Abstract Background Low-complexity data analysis is the area that addresses the search and quantification of regions in sequences of elements that contain low-complexity or repetitive elements. For example, these can be tandem repeats, inverted repeats, homopolymer tails, GC-biased regions, similar...
Saved in:
| Published in: | Gigascience Vol. 12 |
|---|---|
| Main Authors: | , , , |
| Format: | Journal Article |
| Language: | English |
| Published: |
United States
Oxford University Press
2023
|
| Subjects: | |
| ISSN: | 2047-217X, 2047-217X |
| Online Access: | Get full text |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Abstract | Abstract
Background
Low-complexity data analysis is the area that addresses the search and quantification of regions in sequences of elements that contain low-complexity or repetitive elements. For example, these can be tandem repeats, inverted repeats, homopolymer tails, GC-biased regions, similar genes, and hairpins, among many others. Identifying these regions is crucial because of their association with regulatory and structural characteristics. Moreover, their identification provides positional and quantity information where standard assembly methodologies face significant difficulties because of substantial higher depth coverage (mountains), ambiguous read mapping, or where sequencing or reconstruction defects may occur. However, the capability to distinguish low-complexity regions (LCRs) in genomic and proteomic sequences is a challenge that depends on the model’s ability to find them automatically. Low-complexity patterns can be implicit through specific or combined sources, such as algorithmic or probabilistic, and recurring to different spatial distances—namely, local, medium, or distant associations.
Findings
This article addresses the challenge of automatically modeling and distinguishing LCRs, providing a new method and tool (AlcoR) for efficient and accurate segmentation and visualization of these regions in genomic and proteomic sequences. The method enables the use of models with different memories, providing the ability to distinguish local from distant low-complexity patterns. The method is reference and alignment free, providing additional methodologies for testing, including a highly flexible simulation method for generating biological sequences (DNA or protein) with different complexity levels, sequence masking, and a visualization tool for automatic computation of the LCR maps into an ideogram style. We provide illustrative demonstrations using synthetic, nearly synthetic, and natural sequences showing the high efficiency and accuracy of AlcoR. As large-scale results, we use AlcoR to unprecedentedly provide a whole-chromosome low-complexity map of a recent complete human genome and the haplotype-resolved chromosome pairs of a heterozygous diploid African cassava cultivar.
Conclusions
The AlcoR method provides the ability of fast sequence characterization through data complexity analysis, ideally for scenarios entangling the presence of new or unknown sequences. AlcoR is implemented in C language using multithreading to increase the computational speed, is flexible for multiple applications, and does not contain external dependencies. The tool accepts any sequence in FASTA format. The source code is freely provided at https://github.com/cobilab/alcor. |
|---|---|
| AbstractList | Background Low-complexity data analysis is the area that addresses the search and quantification of regions in sequences of elements that contain low-complexity or repetitive elements. For example, these can be tandem repeats, inverted repeats, homopolymer tails, GC-biased regions, similar genes, and hairpins, among many others. Identifying these regions is crucial because of their association with regulatory and structural characteristics. Moreover, their identification provides positional and quantity information where standard assembly methodologies face significant difficulties because of substantial higher depth coverage (mountains), ambiguous read mapping, or where sequencing or reconstruction defects may occur. However, the capability to distinguish low-complexity regions (LCRs) in genomic and proteomic sequences is a challenge that depends on the model’s ability to find them automatically. Low-complexity patterns can be implicit through specific or combined sources, such as algorithmic or probabilistic, and recurring to different spatial distances—namely, local, medium, or distant associations. Findings This article addresses the challenge of automatically modeling and distinguishing LCRs, providing a new method and tool (AlcoR) for efficient and accurate segmentation and visualization of these regions in genomic and proteomic sequences. The method enables the use of models with different memories, providing the ability to distinguish local from distant low-complexity patterns. The method is reference and alignment free, providing additional methodologies for testing, including a highly flexible simulation method for generating biological sequences (DNA or protein) with different complexity levels, sequence masking, and a visualization tool for automatic computation of the LCR maps into an ideogram style. We provide illustrative demonstrations using synthetic, nearly synthetic, and natural sequences showing the high efficiency and accuracy of AlcoR. As large-scale results, we use AlcoR to unprecedentedly provide a whole-chromosome low-complexity map of a recent complete human genome and the haplotype-resolved chromosome pairs of a heterozygous diploid African cassava cultivar. Conclusions The AlcoR method provides the ability of fast sequence characterization through data complexity analysis, ideally for scenarios entangling the presence of new or unknown sequences. AlcoR is implemented in C language using multithreading to increase the computational speed, is flexible for multiple applications, and does not contain external dependencies. The tool accepts any sequence in FASTA format. The source code is freely provided at https://github.com/cobilab/alcor. Low-complexity data analysis is the area that addresses the search and quantification of regions in sequences of elements that contain low-complexity or repetitive elements. For example, these can be tandem repeats, inverted repeats, homopolymer tails, GC-biased regions, similar genes, and hairpins, among many others. Identifying these regions is crucial because of their association with regulatory and structural characteristics. Moreover, their identification provides positional and quantity information where standard assembly methodologies face significant difficulties because of substantial higher depth coverage (mountains), ambiguous read mapping, or where sequencing or reconstruction defects may occur. However, the capability to distinguish low-complexity regions (LCRs) in genomic and proteomic sequences is a challenge that depends on the model's ability to find them automatically. Low-complexity patterns can be implicit through specific or combined sources, such as algorithmic or probabilistic, and recurring to different spatial distances-namely, local, medium, or distant associations.BACKGROUNDLow-complexity data analysis is the area that addresses the search and quantification of regions in sequences of elements that contain low-complexity or repetitive elements. For example, these can be tandem repeats, inverted repeats, homopolymer tails, GC-biased regions, similar genes, and hairpins, among many others. Identifying these regions is crucial because of their association with regulatory and structural characteristics. Moreover, their identification provides positional and quantity information where standard assembly methodologies face significant difficulties because of substantial higher depth coverage (mountains), ambiguous read mapping, or where sequencing or reconstruction defects may occur. However, the capability to distinguish low-complexity regions (LCRs) in genomic and proteomic sequences is a challenge that depends on the model's ability to find them automatically. Low-complexity patterns can be implicit through specific or combined sources, such as algorithmic or probabilistic, and recurring to different spatial distances-namely, local, medium, or distant associations.This article addresses the challenge of automatically modeling and distinguishing LCRs, providing a new method and tool (AlcoR) for efficient and accurate segmentation and visualization of these regions in genomic and proteomic sequences. The method enables the use of models with different memories, providing the ability to distinguish local from distant low-complexity patterns. The method is reference and alignment free, providing additional methodologies for testing, including a highly flexible simulation method for generating biological sequences (DNA or protein) with different complexity levels, sequence masking, and a visualization tool for automatic computation of the LCR maps into an ideogram style. We provide illustrative demonstrations using synthetic, nearly synthetic, and natural sequences showing the high efficiency and accuracy of AlcoR. As large-scale results, we use AlcoR to unprecedentedly provide a whole-chromosome low-complexity map of a recent complete human genome and the haplotype-resolved chromosome pairs of a heterozygous diploid African cassava cultivar.FINDINGSThis article addresses the challenge of automatically modeling and distinguishing LCRs, providing a new method and tool (AlcoR) for efficient and accurate segmentation and visualization of these regions in genomic and proteomic sequences. The method enables the use of models with different memories, providing the ability to distinguish local from distant low-complexity patterns. The method is reference and alignment free, providing additional methodologies for testing, including a highly flexible simulation method for generating biological sequences (DNA or protein) with different complexity levels, sequence masking, and a visualization tool for automatic computation of the LCR maps into an ideogram style. We provide illustrative demonstrations using synthetic, nearly synthetic, and natural sequences showing the high efficiency and accuracy of AlcoR. As large-scale results, we use AlcoR to unprecedentedly provide a whole-chromosome low-complexity map of a recent complete human genome and the haplotype-resolved chromosome pairs of a heterozygous diploid African cassava cultivar.The AlcoR method provides the ability of fast sequence characterization through data complexity analysis, ideally for scenarios entangling the presence of new or unknown sequences. AlcoR is implemented in C language using multithreading to increase the computational speed, is flexible for multiple applications, and does not contain external dependencies. The tool accepts any sequence in FASTA format. The source code is freely provided at https://github.com/cobilab/alcor.CONCLUSIONSThe AlcoR method provides the ability of fast sequence characterization through data complexity analysis, ideally for scenarios entangling the presence of new or unknown sequences. AlcoR is implemented in C language using multithreading to increase the computational speed, is flexible for multiple applications, and does not contain external dependencies. The tool accepts any sequence in FASTA format. The source code is freely provided at https://github.com/cobilab/alcor. Abstract Background Low-complexity data analysis is the area that addresses the search and quantification of regions in sequences of elements that contain low-complexity or repetitive elements. For example, these can be tandem repeats, inverted repeats, homopolymer tails, GC-biased regions, similar genes, and hairpins, among many others. Identifying these regions is crucial because of their association with regulatory and structural characteristics. Moreover, their identification provides positional and quantity information where standard assembly methodologies face significant difficulties because of substantial higher depth coverage (mountains), ambiguous read mapping, or where sequencing or reconstruction defects may occur. However, the capability to distinguish low-complexity regions (LCRs) in genomic and proteomic sequences is a challenge that depends on the model’s ability to find them automatically. Low-complexity patterns can be implicit through specific or combined sources, such as algorithmic or probabilistic, and recurring to different spatial distances—namely, local, medium, or distant associations. Findings This article addresses the challenge of automatically modeling and distinguishing LCRs, providing a new method and tool (AlcoR) for efficient and accurate segmentation and visualization of these regions in genomic and proteomic sequences. The method enables the use of models with different memories, providing the ability to distinguish local from distant low-complexity patterns. The method is reference and alignment free, providing additional methodologies for testing, including a highly flexible simulation method for generating biological sequences (DNA or protein) with different complexity levels, sequence masking, and a visualization tool for automatic computation of the LCR maps into an ideogram style. We provide illustrative demonstrations using synthetic, nearly synthetic, and natural sequences showing the high efficiency and accuracy of AlcoR. As large-scale results, we use AlcoR to unprecedentedly provide a whole-chromosome low-complexity map of a recent complete human genome and the haplotype-resolved chromosome pairs of a heterozygous diploid African cassava cultivar. Conclusions The AlcoR method provides the ability of fast sequence characterization through data complexity analysis, ideally for scenarios entangling the presence of new or unknown sequences. AlcoR is implemented in C language using multithreading to increase the computational speed, is flexible for multiple applications, and does not contain external dependencies. The tool accepts any sequence in FASTA format. The source code is freely provided at https://github.com/cobilab/alcor. Low-complexity data analysis is the area that addresses the search and quantification of regions in sequences of elements that contain low-complexity or repetitive elements. For example, these can be tandem repeats, inverted repeats, homopolymer tails, GC-biased regions, similar genes, and hairpins, among many others. Identifying these regions is crucial because of their association with regulatory and structural characteristics. Moreover, their identification provides positional and quantity information where standard assembly methodologies face significant difficulties because of substantial higher depth coverage (mountains), ambiguous read mapping, or where sequencing or reconstruction defects may occur. However, the capability to distinguish low-complexity regions (LCRs) in genomic and proteomic sequences is a challenge that depends on the model's ability to find them automatically. Low-complexity patterns can be implicit through specific or combined sources, such as algorithmic or probabilistic, and recurring to different spatial distances-namely, local, medium, or distant associations. This article addresses the challenge of automatically modeling and distinguishing LCRs, providing a new method and tool (AlcoR) for efficient and accurate segmentation and visualization of these regions in genomic and proteomic sequences. The method enables the use of models with different memories, providing the ability to distinguish local from distant low-complexity patterns. The method is reference and alignment free, providing additional methodologies for testing, including a highly flexible simulation method for generating biological sequences (DNA or protein) with different complexity levels, sequence masking, and a visualization tool for automatic computation of the LCR maps into an ideogram style. We provide illustrative demonstrations using synthetic, nearly synthetic, and natural sequences showing the high efficiency and accuracy of AlcoR. As large-scale results, we use AlcoR to unprecedentedly provide a whole-chromosome low-complexity map of a recent complete human genome and the haplotype-resolved chromosome pairs of a heterozygous diploid African cassava cultivar. The AlcoR method provides the ability of fast sequence characterization through data complexity analysis, ideally for scenarios entangling the presence of new or unknown sequences. AlcoR is implemented in C language using multithreading to increase the computational speed, is flexible for multiple applications, and does not contain external dependencies. The tool accepts any sequence in FASTA format. The source code is freely provided at https://github.com/cobilab/alcor. |
| Author | Pinho, Armando J Pratas, Diogo Qi, Weihong Silva, Jorge M |
| Author_xml | – sequence: 1 givenname: Jorge M orcidid: 0000-0002-6331-6091 surname: Silva fullname: Silva, Jorge M – sequence: 2 givenname: Weihong orcidid: 0000-0001-8581-908X surname: Qi fullname: Qi, Weihong – sequence: 3 givenname: Armando J orcidid: 0000-0002-9164-0016 surname: Pinho fullname: Pinho, Armando J – sequence: 4 givenname: Diogo orcidid: 0000-0003-1176-552X surname: Pratas fullname: Pratas, Diogo email: pratas@ua.pt |
| BackLink | https://www.ncbi.nlm.nih.gov/pubmed/38091509$$D View this record in MEDLINE/PubMed |
| BookMark | eNqNUl1r1jAUDjJxH-4XCBLwxot1S9q0abwZY0wdDAai4F1I0pOakSa1aafz15t377vxuothzkUOPB-ch3P20U6IARB6Q8kxJaI66V2vknEQDORedZTQF2ivJIwXJeXfd7b6XXSY0g3Jj_O25dUrtFu1RNCaiD3kzryJXz5g5V0fBghzYScAnNyweDW7GI7woMbRhf4Iq9DhW5eWzP1zj-FosY-_ChOH0cNvN9_hCfoMJOwC1i762DujPO7UrF6jl1b5BIeb_wB9-3jx9fxzcXX96fL87KowNaFzAW0pNOim0bUyubqmMqXVmmoLUNeUqYpxSzWzmjWiA0tLkoWtMVYIxmh1gE7XvuOiB-hMzjQpL8fJDWq6k1E5-S8S3A_Zx1tJCadNWzbZ4f3GYYo_F0izHFwy4L0KEJckS0FKwQTnK-q7J9SbuEwh55NVNuNNzTh_nlURwVnJVl5vtwd_nPhhWZkg1gQzxZQmsNK4-X4ROYfzOYBc3Ybcug25uY2srZ5oH-yfVx2vVXEZ_0vwF0as1Pw |
| CitedBy_id | crossref_primary_10_1093_gigascience_giae086 crossref_primary_10_3390_ijms26020477 crossref_primary_10_1093_nar_gkae871 |
| Cites_doi | 10.1016/j.patrec.2018.05.026 10.1093/gigascience/giaa086 10.1093/gigascience/giaa119 10.1007/978-1-0716-1036-7_8 10.1093/nar/gkh466 10.1093/bioinformatics/bti774 10.3390/ijms20071704 10.1002/ijc.21731 10.1002/j.1538-7305.1948.tb01338.x 10.1016/S0022-2836(05)80360-2 10.1093/bioinformatics/btg005 10.1093/gigascience/giab099 10.1016/0097-8485(93)85006-X 10.1186/s12859-021-03980-5 10.1093/bioinformatics/bth497 10.1093/nar/gkad199 10.1038/s41586-020-2547-7 10.1002/ijc.30243 10.1016/j.electacta.2015.01.191 10.1093/molbev/msad084 10.1186/1471-2105-8-292 10.7150/thno.29622 10.3389/fpls.2015.00913 10.1038/nrg3117 10.1093/bioinformatics/btr708 10.3390/e21121196 10.1093/nar/30.1.38 10.1186/s13015-022-00210-2 10.1016/S0140-6736(07)61416-0 10.1038/s41467-018-06792-z 10.1371/journal.pone.0079922 10.1038/nature19946 10.1109/DCC.2016.60 10.1101/2023.04.17.537157 10.1371/journal.pone.0021588 10.1093/gigascience/giaa048 10.1016/0306-4573(94)90014-0 10.1126/science.abj6965 10.1016/j.yexcr.2020.112127 10.1093/bib/bby070 10.1145/3188745.3188814 10.1126/sciadv.abd9230 10.1126/science.abl4178 10.1089/cmb.2009.0198 10.1016/j.crmeth.2021.100069 10.1186/1471-2105-8-252 10.1186/1471-2105-8-393 10.1093/nar/gkaa1237 10.1093/bioinformatics/13.2.131 10.1109/DCC.1993.253115 10.1002/spe.619 10.1093/bioinformatics/btz144 10.1111/2041-210X.12349 10.1186/s13059-019-1891-0 10.1016/j.virol.2017.10.017 10.3390/e21050513 10.1109/BIBM49941.2020.9313338 10.1186/s13059-021-02579-6 10.1093/bioinformatics/btad097 10.3390/genes9090445 10.1086/503270 10.1126/science.abj6987 10.1093/bioinformatics/18.5.679 10.1016/S0097-8485(99)00009-1 10.1126/science.abl3533 10.1101/gr.083634.108 10.1093/gigascience/giab005 10.1007/978-0-387-49820-1 10.1093/gigascience/giac028 10.1038/s41592-022-01440-3 10.1002/wrna.1611 10.3390/e22050575 10.1109/SSP.2011.5967637 10.1093/bioinformatics/btaa572 10.1038/s41592-019-0496-6 10.1109/TBME.2006.879477 10.1145/1055709.1055711 10.1186/1471-2105-10-421 10.1016/S0097-8485(00)80006-6 10.1093/molbev/msaa025 10.1073/pnas.2102842118 10.1093/gigascience/giy148 10.1016/S1473-3099(02)00315-8 10.1101/gr.266932.120 10.1038/s41586-022-04601-8 10.1093/molbev/msac087 10.1073/pnas.1921046117 10.1109/SSP.2011.5967639 10.3390/e23050530 10.1089/cmb.2010.0056 10.1016/j.softx.2020.100535 10.1093/bioinformatics/bty146 10.1093/bioinformatics/btl113 10.1093/bioinformatics/btab343 10.1093/gigascience/giaa072 10.1089/cmb.2006.13.1028 10.1038/nrg.2016.57 10.1186/s13059-017-1319-7 10.1016/j.cbpa.2021.04.004 10.1186/s13059-019-1755-7 |
| ContentType | Journal Article |
| Copyright | The Author(s) 2023. Published by Oxford University Press GigaScience. 2023 The Author(s) 2023. Published by Oxford University Press GigaScience. The Author(s) 2023. Published by Oxford University Press GigaScience. This work is published under https://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License. |
| Copyright_xml | – notice: The Author(s) 2023. Published by Oxford University Press GigaScience. 2023 – notice: The Author(s) 2023. Published by Oxford University Press GigaScience. – notice: The Author(s) 2023. Published by Oxford University Press GigaScience. This work is published under https://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License. |
| DBID | TOX AAYXX CITATION CGR CUY CVF ECM EIF NPM JQ2 K9. 7X8 5PM |
| DOI | 10.1093/gigascience/giad101 |
| DatabaseName | Oxford Journals Open Access Collection CrossRef Medline MEDLINE MEDLINE (Ovid) MEDLINE MEDLINE PubMed ProQuest Computer Science Collection ProQuest Health & Medical Complete (Alumni) MEDLINE - Academic PubMed Central (Full Participant titles) |
| DatabaseTitle | CrossRef MEDLINE Medline Complete MEDLINE with Full Text PubMed MEDLINE (Ovid) ProQuest Health & Medical Complete (Alumni) ProQuest Computer Science Collection MEDLINE - Academic |
| DatabaseTitleList | ProQuest Health & Medical Complete (Alumni) ProQuest Health & Medical Complete (Alumni) MEDLINE - Academic MEDLINE |
| Database_xml | – sequence: 1 dbid: NPM name: PubMed url: http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed sourceTypes: Index Database – sequence: 2 dbid: TOX name: Oxford Journals Open Access Collection url: https://academic.oup.com/journals/ sourceTypes: Publisher – sequence: 3 dbid: 7X8 name: MEDLINE - Academic url: https://search.proquest.com/medline sourceTypes: Aggregation Database |
| DeliveryMethod | fulltext_linktorsrc |
| Discipline | Library & Information Science |
| EISSN | 2047-217X |
| ExternalDocumentID | PMC10716826 38091509 10_1093_gigascience_giad101 10.1093/gigascience/giad101 |
| Genre | Research Support, Non-U.S. Gov't Journal Article |
| GrantInformation_xml | – fundername: Finnish Computing Competence Infrastructure – fundername: ; |
| GroupedDBID | -A0 0R~ 3V. 4.4 53G 5VS 7X7 88E 88I 8FE 8FG 8FH 8FI 8FJ AAFWJ AAHBH AAPPN AAPXW AAVAP ABDBF ABEJV ABPTD ABUWG ABXVV ACGFS ACPRK ACRMQ ACUHS ADBBV ADINQ ADRAZ ADUKV AEGXH AENZO AFKRA AFULF AHBYD AHSBF AHYZX ALIPV ALMA_UNASSIGNED_HOLDINGS AOIJS ARAPS AZQEC BAWUL BAYMD BBNVY BCNDV BENPR BFQNJ BGLVJ BHPHI BMC BPHCQ BTTYL BVXVI C24 C6C CCPQU DIK DWQXO EBS EJD FYUFA GNUQQ GROUPED_DOAJ GX1 H13 HCIFZ HMCUK HYE IAO IHR IHW INH INR IPNFZ ITC K6V K7- KQ8 KSI LK8 M0N M1P M2P M48 M7P M~E O9- OK1 P62 PIMPY PQQKQ PROAC PSQYO RBZ RIG RNS ROL ROX RPM RSV SBL SOJ TJX TOX UKHRP AAYXX ABGNP AFPKN AMNDL CITATION CGR CUY CVF ECM EIF NPM JQ2 K9. 7X8 5PM |
| ID | FETCH-LOGICAL-c501t-e829beb66b5acacad63c2fbb1bfee5514a347f1b4fb469def1205018ccf994413 |
| IEDL.DBID | TOX |
| ISICitedReferencesCount | 5 |
| ISICitedReferencesURI | http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=001187364200001&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| ISSN | 2047-217X |
| IngestDate | Tue Sep 30 17:10:46 EDT 2025 Sun Sep 28 07:14:12 EDT 2025 Tue Oct 07 06:43:20 EDT 2025 Tue Oct 07 06:38:46 EDT 2025 Mon Jul 21 06:02:05 EDT 2025 Tue Nov 18 21:37:30 EST 2025 Sat Nov 29 03:15:51 EST 2025 Mon Dec 16 07:45:55 EST 2024 |
| IsDoiOpenAccess | true |
| IsOpenAccess | true |
| IsPeerReviewed | true |
| IsScholarly | true |
| Keywords | data compression genomes proteomes simulation low-complexity analysis FASTA alignment-free method |
| Language | English |
| License | This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. https://creativecommons.org/licenses/by/4.0 The Author(s) 2023. Published by Oxford University Press GigaScience. |
| LinkModel | DirectLink |
| MergedId | FETCHMERGED-LOGICAL-c501t-e829beb66b5acacad63c2fbb1bfee5514a347f1b4fb469def1205018ccf994413 |
| Notes | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 content type line 23 |
| ORCID | 0000-0002-9164-0016 0000-0003-1176-552X 0000-0002-6331-6091 0000-0001-8581-908X |
| OpenAccessLink | https://dx.doi.org/10.1093/gigascience/giad101 |
| PMID | 38091509 |
| PQID | 3130974246 |
| PQPubID | 2040230 |
| ParticipantIDs | pubmedcentral_primary_oai_pubmedcentral_nih_gov_10716826 proquest_miscellaneous_2902949776 proquest_journals_3168765477 proquest_journals_3130974246 pubmed_primary_38091509 crossref_citationtrail_10_1093_gigascience_giad101 crossref_primary_10_1093_gigascience_giad101 oup_primary_10_1093_gigascience_giad101 |
| PublicationCentury | 2000 |
| PublicationDate | 2023-00-00 |
| PublicationDateYYYYMMDD | 2023-01-01 |
| PublicationDate_xml | – year: 2023 text: 2023-00-00 |
| PublicationDecade | 2020 |
| PublicationPlace | United States |
| PublicationPlace_xml | – name: United States – name: Oxford |
| PublicationTitle | Gigascience |
| PublicationTitleAlternate | Gigascience |
| PublicationYear | 2023 |
| Publisher | Oxford University Press |
| Publisher_xml | – sequence: 0 name: Oxford University Press – name: Oxford University Press |
| References | Grabowski (2024111706383848100_bib74) 2022; 11 Camacho (2024111706383848100_bib95) 2009; 10 Pratas (2024111706383848100_bib113) 2018; 9 Ponty (2024111706383848100_bib49) 2006; 22 Reis (2024111706383848100_bib5) 2022; 23 Rajaby (2024111706383848100_bib6) 2021; 49 Carvalho (2024111706383848100_bib81) 2018; 112 Resende (2024111706383848100_bib19) 2019; 21 Wang (2024111706383848100_bib2) 2022; 604 Silva (2024111706383848100_bib84) 2023 Morgenstern (2024111706383848100_bib110) 2021 Wu (2024111706383848100_bib47) 2021; 65 Cherniavsky (2024111706383848100_bib63) 2004; 21 Wan (2024111706383848100_bib107) 2010; 17 Jiang (2024111706383848100_bib37) 2020 Posada (2024111706383848100_bib44) 2020; 37 Mouakkad-Montoya (2024111706383848100_bib94) 2021; 118 Rouchka (2024111706383848100_bib50) 2007; 8 Hosseini (2024111706383848100_bib31) 2017 Pinho (2024111706383848100_bib80) 2010 Rivals (2024111706383848100_bib25) 1997; 13 Parkin (2024111706383848100_bib9) 2006; 118 Dickson (2024111706383848100_bib42) 2021; 1 Chen (2024111706383848100_bib51) 2009; 19 Li (2024111706383848100_bib12) 2008 Troyanskaya (2024111706383848100_bib39) 2002; 18 Morgulis (2024111706383848100_bib86) 2006; 13 Dickson (2024111706383848100_bib40) 2022; 39 Miga (2024111706383848100_bib100) 2020; 394 Silva (2024111706383848100_bib83) 2023 Osswald (2024111706383848100_bib23) 2015; 177 Cao (2024111706383848100_bib65) 2007 Ermolenko (2024111706383848100_bib35) 2021; 12 Huang (2024111706383848100_bib45) 2016; 537 Vollger (2024111706383848100_bib99) 2022; 376 Pinho (2024111706383848100_bib30) 2011 Ferragina (2024111706383848100_bib15) 2007; 8 Silva (2024111706383848100_bib77) 2021; 23 Enright (2024111706383848100_bib41) 2023; 40 Pratas (2024111706383848100_bib71) 2016 Wood (2024111706383848100_bib112) 2019; 20 Qi (2024111706383848100_bib3) 2022; 11 Smit (2024111706383848100_bib59) Dix (2024111706383848100_bib28) 2006 Vinga (2024111706383848100_bib105) 2003; 19 Pratas (2024111706383848100_bib82) 2017 Kryukov (2024111706383848100_bib78) 2020; 9 Huang (2024111706383848100_bib56) 2012; 28 Nurk (2024111706383848100_bib98) 2022; 376 Gupta (2024111706383848100_bib68) 2011; 33 Feldkamp (2024111706383848100_bib48) 2001 Menéndez (2024111706383848100_bib20) 2019; 21 2024111706383848100_bib122 Deorowicz (2024111706383848100_bib75) 2023; 39 Schiffman (2024111706383848100_bib8) 2007; 370 Wu (2024111706383848100_bib29) 2019; 20 Zielezinski (2024111706383848100_bib109) 2019; 20 Morgulis (2024111706383848100_bib90) 2006; 22 Allison (2024111706383848100_bib27) 2000; 24 Altschul (2024111706383848100_bib96) 1990; 215 Chao (2024111706383848100_bib33) 2015; 6 Pinho (2024111706383848100_bib32) 2006; 53 Yang (2024111706383848100_bib46) 2019; 16 Allison (2024111706383848100_bib24) 1998 Angermueller (2024111706383848100_bib53) 2020 Escalona (2024111706383848100_bib57) 2018; 34 Korodi (2024111706383848100_bib64) 2005; 23 Chen (2024111706383848100_bib120) 2019; 20 Pratas (2024111706383848100_bib52) 2011 Mishra (2024111706383848100_bib66) 2010; 3 Pinho (2024111706383848100_bib70) 2011; 6 Zielezinski (2024111706383848100_bib108) 2017; 18 Kempa (2024111706383848100_bib18) 2018 Cameron (2024111706383848100_bib119) 2021; 37 Wootton (2024111706383848100_bib87) 1993; 17 Aganezov (2024111706383848100_bib103) 2022; 376 Pinho (2024111706383848100_bib69) 2011 Donahue (2024111706383848100_bib22) 2006; 643 Kroese (2024111706383848100_bib91) 2011 Kolmogorov (2024111706383848100_bib13) 1965; 1 Pinho (2024111706383848100_bib14) 2013; 8 Liu (2024111706383848100_bib73) 2020; 36 Silva (2024111706383848100_bib76) 2020; 9 Shannon (2024111706383848100_bib92) 1948; 27 Kao (2024111706383848100_bib7) 2002; 2 Išerić (2024111706383848100_bib93) 2022; 17 Xu (2024111706383848100_bib10) 2019; 9 Kryukov (2024111706383848100_bib72) 2019; 35 Williams (2024111706383848100_bib85) 1999 Alshahwan (2024111706383848100_bib21) 2020; 22 Mc Cartney (2024111706383848100_bib114) 2022; 19 Cuacos (2024111706383848100_bib104) 2015; 6 Rangwala (2024111706383848100_bib89) 2021; 31 Hubbard (2024111706383848100_bib4) 2002; 30 Cantalupo (2024111706383848100_bib117) 2018; 513 Grumbach (2024111706383848100_bib60) 1993 Almeida (2024111706383848100_bib54) 2020; 12 Flynn (2024111706383848100_bib58) 2020; 117 Treangen (2024111706383848100_bib115) 2012; 13 Shin (2024111706383848100_bib88) 2005; 21 Reinert (2024111706383848100_bib106) 2009; 16 Rajeswari (2024111706383848100_bib67) 2010; 2 Escalona (2024111706383848100_bib43) 2016; 17 Crochemore (2024111706383848100_bib26) 1999; 23 Pratas (2024111706383848100_bib38) 2020; 9 Pyöriä (2024111706383848100_bib121) 2023; 51 Orlov (2024111706383848100_bib16) 2004; 32 Pratas (2024111706383848100_bib97) 2017 Bodelon (2024111706383848100_bib116) 2016; 139 Altemose (2024111706383848100_bib101) 2022; 376 Miga (2024111706383848100_bib1) 2020; 585 Rong (2024111706383848100_bib55) 2021; 10 Grumbach (2024111706383848100_bib61) 1994; 30 Vinga (2024111706383848100_bib17) 2007; 8 Golan (2024111706383848100_bib11) 2018 Hosseini (2024111706383848100_bib36) 2020; 9 Suzuki (2024111706383848100_bib102) 2020; 6 Leimeister (2024111706383848100_bib111) 2019; 8 Pratas (2024111706383848100_bib79) 2020 Manzini (2024111706383848100_bib62) 2004; 34 Lai (2024111706383848100_bib34) 2018; 9 Pischedda (2024111706383848100_bib118) 2021; 22 |
| References_xml | – volume: 112 start-page: 49 year: 2018 ident: 2024111706383848100_bib81 article-title: Extended-alphabet finite-context models publication-title: Pattern Recognit Lett doi: 10.1016/j.patrec.2018.05.026 – volume: 9 start-page: giaa086 issue: 8 year: 2020 ident: 2024111706383848100_bib38 article-title: A hybrid pipeline for reconstruction and analysis of viral genomes at multi-organ level publication-title: Gigascience doi: 10.1093/gigascience/giaa086 – volume: 9 start-page: giaa119 issue: 11 year: 2020 ident: 2024111706383848100_bib76 article-title: Efficient DNA sequence compression with neural networks publication-title: Gigascience doi: 10.1093/gigascience/giaa119 – start-page: 121 volume-title: Multiple Sequence Alignment year: 2021 ident: 2024111706383848100_bib110 article-title: Sequence comparison without alignment: the SpaM approaches doi: 10.1007/978-1-0716-1036-7_8 – volume: 32 start-page: W628 issue: suppl 2 year: 2004 ident: 2024111706383848100_bib16 article-title: Complexity: an internet resource for analysis of DNA sequence complexity publication-title: Nucleic Acids Res doi: 10.1093/nar/gkh466 – volume: 22 start-page: 134 issue: 2 year: 2006 ident: 2024111706383848100_bib90 article-title: WindowMasker: window-based masker for sequenced genomes publication-title: Bioinformatics doi: 10.1093/bioinformatics/bti774 – volume: 20 start-page: 1704 issue: 7 year: 2019 ident: 2024111706383848100_bib29 article-title: Improved prediction of regulatory element using hybrid abelian complexity features with DNA sequences publication-title: Int J Mol Sci doi: 10.3390/ijms20071704 – start-page: 43 volume-title: 2007 Data Compression Conference (DCC’07) year: 2007 ident: 2024111706383848100_bib65 article-title: A simple statistical algorithm for biological sequence compression – volume: 118 start-page: 3030 issue: 12 year: 2006 ident: 2024111706383848100_bib9 article-title: The global health burden of infection-associated cancers in the year 2002 publication-title: Int J Cancer doi: 10.1002/ijc.21731 – volume: 27 start-page: 379 issue: 3 year: 1948 ident: 2024111706383848100_bib92 article-title: A mathematical theory of communication publication-title: Bell Syst Tech J doi: 10.1002/j.1538-7305.1948.tb01338.x – start-page: 214 volume-title: Pac Symp Biocomput year: 1999 ident: 2024111706383848100_bib85 article-title: Effective query filtering for fast homology searching – volume: 215 start-page: 403 issue: 3 year: 1990 ident: 2024111706383848100_bib96 article-title: Basic local alignment search tool publication-title: J Mol Biol doi: 10.1016/S0022-2836(05)80360-2 – volume: 19 start-page: 513 issue: 4 year: 2003 ident: 2024111706383848100_bib105 article-title: Alignment-free sequence comparison—a review publication-title: Bioinformatics doi: 10.1093/bioinformatics/btg005 – start-page: 97 volume-title: Probabilistic Modeling and Machine Learning in Structural and Systems Biology, Workshop Proceedings year: 2006 ident: 2024111706383848100_bib28 article-title: Exploring long DNA sequences by information content – volume: 33 start-page: 245 issue: 3 year: 2011 ident: 2024111706383848100_bib68 article-title: A novel approach for compressing DNA sequences using semi-statistical compressor publication-title: Int J Comput Appl – volume: 11 year: 2022 ident: 2024111706383848100_bib74 article-title: MBGC: Multiple Bacteria Genome Compressor publication-title: Gigascience doi: 10.1093/gigascience/giab099 – volume: 17 start-page: 149 issue: 2 year: 1993 ident: 2024111706383848100_bib87 article-title: Statistics of local complexity in amino acid sequences and sequence databases publication-title: Comput Chem doi: 10.1016/0097-8485(93)85006-X – volume: 22 start-page: 1 issue: 1 year: 2021 ident: 2024111706383848100_bib118 article-title: ViR: a tool to solve intrasample variability in the prediction of viral integration sites using whole genome sequencing data publication-title: BMC Bioinformatics doi: 10.1186/s12859-021-03980-5 – start-page: 265 volume-title: 11th International Conference on Practical Applications of Computational Biology & Bioinformatics. Advances in Intelligent Systems and Computing year: 2017 ident: 2024111706383848100_bib82 article-title: Substitutional tolerant Markov models for relative compression of DNA sequences – volume: 21 start-page: 160 issue: 2 year: 2005 ident: 2024111706383848100_bib88 article-title: A new algorithm for detecting low-complexity regions in protein sequences publication-title: Bioinformatics doi: 10.1093/bioinformatics/bth497 – volume: 51 start-page: 3223 issue: 7 year: 2023 ident: 2024111706383848100_bib121 article-title: Unmasking the tissue-resident eukaryotic DNA virome in humans publication-title: Nucleic Acids Res doi: 10.1093/nar/gkad199 – volume: 585 start-page: 79 issue: 7823 year: 2020 ident: 2024111706383848100_bib1 article-title: Telomere-to-telomere assembly of a complete human X chromosome publication-title: Nature doi: 10.1038/s41586-020-2547-7 – volume: 139 start-page: 2001 issue: 9 year: 2016 ident: 2024111706383848100_bib116 article-title: Genomic characterization of viral integration sites in HPV-related cancers publication-title: Int J Cancer doi: 10.1002/ijc.30243 – year: 2023 ident: 2024111706383848100_bib84 article-title: AlcoR website – volume: 177 start-page: 270 year: 2015 ident: 2024111706383848100_bib23 article-title: Fast and accurate measurement of entropy profiles of commercial lithium-ion cells publication-title: Electrochim Acta doi: 10.1016/j.electacta.2015.01.191 – volume: 40 start-page: msad084 issue: 4 year: 2023 ident: 2024111706383848100_bib41 article-title: Low complexity regions in proteins and DNA are poorly correlated publication-title: Mol Biol Evol doi: 10.1093/molbev/msad084 – volume: 8 start-page: 1 issue: 1 year: 2007 ident: 2024111706383848100_bib50 article-title: rMotifGen: random motif generator for DNA and protein sequences publication-title: BMC Bioinformatics doi: 10.1186/1471-2105-8-292 – volume: 9 start-page: 1115 issue: 4 year: 2019 ident: 2024111706383848100_bib10 article-title: Genome-wide profiling of Epstein-Barr virus integration by targeted sequencing in Epstein-Barr virus associated malignancies publication-title: Theranostics doi: 10.7150/thno.29622 – volume: 6 start-page: 913 year: 2015 ident: 2024111706383848100_bib104 article-title: Atypical centromeres in plants—what they can tell us publication-title: Front Plant Sci doi: 10.3389/fpls.2015.00913 – volume: 13 start-page: 36 issue: 1 year: 2012 ident: 2024111706383848100_bib115 article-title: Repetitive DNA and next-generation sequencing: computational challenges and solutions publication-title: Nat Rev Genet doi: 10.1038/nrg3117 – volume: 28 start-page: 593 issue: 4 year: 2012 ident: 2024111706383848100_bib56 article-title: ART: a next-generation sequencing read simulator publication-title: Bioinformatics doi: 10.1093/bioinformatics/btr708 – volume: 21 start-page: 1196 issue: 12 year: 2019 ident: 2024111706383848100_bib19 article-title: A survey on using Kolmogorov complexity in cybersecurity publication-title: Entropy doi: 10.3390/e21121196 – volume: 30 start-page: 38 issue: 1 year: 2002 ident: 2024111706383848100_bib4 article-title: The Ensembl genome database project publication-title: Nucleic Acids Res doi: 10.1093/nar/30.1.38 – volume: 21 start-page: 2004 year: 2004 ident: 2024111706383848100_bib63 article-title: Grammar-based compression of DNA sequences publication-title: DIMACS Working Group Burrows-Wheeler Transform – volume: 17 start-page: 1 issue: 1 year: 2022 ident: 2024111706383848100_bib93 article-title: Fast characterization of segmental duplication structure in multiple genome assemblies publication-title: Algorithm Mol Biol doi: 10.1186/s13015-022-00210-2 – volume: 370 start-page: 890 issue: 9590 year: 2007 ident: 2024111706383848100_bib8 article-title: Human papillomavirus and cervical cancer publication-title: Lancet doi: 10.1016/S0140-6736(07)61416-0 – volume: 9 start-page: 1 issue: 1 year: 2018 ident: 2024111706383848100_bib34 article-title: mRNAs and lncRNAs intrinsically form secondary structures with short end-to-end distances publication-title: Nat Commun doi: 10.1038/s41467-018-06792-z – volume: 8 start-page: e79922 issue: 11 year: 2013 ident: 2024111706383848100_bib14 article-title: DNA sequences at a glance publication-title: PLoS One doi: 10.1371/journal.pone.0079922 – volume: 537 start-page: 320 issue: 7620 year: 2016 ident: 2024111706383848100_bib45 article-title: The coming of age of de novo protein design publication-title: Nature doi: 10.1038/nature19946 – start-page: 228 volume-title: 11th International Conference on Practical Applications of Computational Biology and Bioinformatics year: 2017 ident: 2024111706383848100_bib31 article-title: On the role of inverted repeats in DNA sequence similarity – start-page: 231 volume-title: 2016 Data Compression Conference (DCC) year: 2016 ident: 2024111706383848100_bib71 article-title: Efficient compression of genomic sequences doi: 10.1109/DCC.2016.60 – ident: 2024111706383848100_bib122 doi: 10.1101/2023.04.17.537157 – start-page: 2024 volume-title: 2011 19th European Signal Processing Conference year: 2011 ident: 2024111706383848100_bib30 article-title: Symbolic to numerical conversion of DNA sequences using finite-context models – volume: 6 start-page: e21588 issue: 6 year: 2011 ident: 2024111706383848100_bib70 article-title: On the representability of complete genomes by multiple competing finite-context (Markov) models publication-title: PLoS One doi: 10.1371/journal.pone.0021588 – volume: 9 start-page: giaa048 issue: 5 year: 2020 ident: 2024111706383848100_bib36 article-title: Smash++: an alignment-free and memory-efficient tool to find genomic rearrangements publication-title: Gigascience doi: 10.1093/gigascience/giaa048 – volume: 30 start-page: 875 issue: 6 year: 1994 ident: 2024111706383848100_bib61 article-title: A new challenge for compression algorithms: genetic sequences publication-title: Inf Proc Manage doi: 10.1016/0306-4573(94)90014-0 – volume: 376 start-page: eabj6965 issue: 6588 year: 2022 ident: 2024111706383848100_bib99 article-title: Segmental duplications and their variation in a complete human genome publication-title: Science doi: 10.1126/science.abj6965 – volume: 1 start-page: 1 issue: 1 year: 1965 ident: 2024111706383848100_bib13 article-title: Three approaches to the quantitative definition of information publication-title: Prob Inf Trans – volume: 394 start-page: 112127 issue: 2 year: 2020 ident: 2024111706383848100_bib100 article-title: Centromere studies in the era of ‘telomere-to-telomere’ genomics publication-title: Exp Cell Res doi: 10.1016/j.yexcr.2020.112127 – volume: 20 start-page: 2088 issue: 6 year: 2019 ident: 2024111706383848100_bib120 article-title: Comprehensive comparative analysis of methods and software for identifying viral integrations publication-title: Brief Bioinform doi: 10.1093/bib/bby070 – start-page: 827 volume-title: Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing (STOC 2018) year: 2018 ident: 2024111706383848100_bib18 article-title: At the roots of dictionary compression: string attractors doi: 10.1145/3188745.3188814 – volume: 6 start-page: eabd9230 issue: 50 year: 2020 ident: 2024111706383848100_bib102 article-title: Rapid and ongoing evolution of repetitive sequence structures in human centromeres publication-title: Sci Adv doi: 10.1126/sciadv.abd9230 – ident: 2024111706383848100_bib59 article-title: RepeatMasker Open-4.0 – volume: 376 start-page: eabl4178 issue: 6588 year: 2022 ident: 2024111706383848100_bib101 article-title: Complete genomic and epigenetic maps of human centromeres publication-title: Science doi: 10.1126/science.abl4178 – volume: 3 start-page: 39 issue: 1 year: 2010 ident: 2024111706383848100_bib66 article-title: An efficient horizontal and vertical method for online DNA sequence compression publication-title: Int J Comput Appl – volume: 16 start-page: 1615 issue: 12 year: 2009 ident: 2024111706383848100_bib106 article-title: Alignment-free sequence comparison (I): statistics and power publication-title: J Comput Biol doi: 10.1089/cmb.2009.0198 – start-page: 23 volume-title: International Workshop on DNA-Based Computers year: 2001 ident: 2024111706383848100_bib48 article-title: DNASequenceGenerator: a program for the construction of DNA sequences – volume: 1 start-page: 100069 issue: 6 year: 2021 ident: 2024111706383848100_bib42 article-title: Probe design for simultaneous, targeted capture of diverse metagenomic targets publication-title: Cell Rep Methods doi: 10.1016/j.crmeth.2021.100069 – volume: 8 start-page: 1 issue: 1 year: 2007 ident: 2024111706383848100_bib15 article-title: Compression-based classification of biological sequences and structures via the universal similarity metric: experimental assessment publication-title: BMC Bioinformatics doi: 10.1186/1471-2105-8-252 – volume: 8 start-page: 1 issue: 1 year: 2007 ident: 2024111706383848100_bib17 article-title: Local Renyi entropic profiles of DNA sequences publication-title: BMC Bioinformatics doi: 10.1186/1471-2105-8-393 – volume: 49 start-page: e33 issue: 6 year: 2021 ident: 2024111706383848100_bib6 article-title: SurVirus: a repeat-aware virus integration caller publication-title: Nucleic Acids Res doi: 10.1093/nar/gkaa1237 – volume: 13 start-page: 131 issue: 2 year: 1997 ident: 2024111706383848100_bib25 article-title: Detection of significant patterns by compression algorithms: the case of approximate tandem repeats in DNA sequences publication-title: Bioinformatics doi: 10.1093/bioinformatics/13.2.131 – start-page: 340 volume-title: [Proceedings] DCC93: Data Compression Conference year: 1993 ident: 2024111706383848100_bib60 article-title: Compression of DNA sequences doi: 10.1109/DCC.1993.253115 – start-page: 117 year: 2010 ident: 2024111706383848100_bib80 article-title: Finite-context models for DNA coding publication-title: Signal Process – volume: 34 start-page: 1397 issue: 14 year: 2004 ident: 2024111706383848100_bib62 article-title: A simple and fast DNA compressor publication-title: Software Pract Experience doi: 10.1002/spe.619 – volume: 35 start-page: 3826 issue: 19 year: 2019 ident: 2024111706383848100_bib72 article-title: Nucleotide Archival Format (NAF) enables efficient lossless reference-free compression of DNA sequences publication-title: Bioinformatics doi: 10.1093/bioinformatics/btz144 – volume: 6 start-page: 873 issue: 8 year: 2015 ident: 2024111706383848100_bib33 article-title: Estimating diversity and entropy profiles via discovery rates of new species publication-title: Methods Ecol Evol doi: 10.1111/2041-210X.12349 – volume: 20 start-page: 1 issue: 1 year: 2019 ident: 2024111706383848100_bib112 article-title: Improved metagenomic analysis with Kraken 2 publication-title: Genome Biol doi: 10.1186/s13059-019-1891-0 – volume: 513 start-page: 208 year: 2018 ident: 2024111706383848100_bib117 article-title: Viral sequences in human cancer publication-title: Virology doi: 10.1016/j.virol.2017.10.017 – volume: 21 start-page: 513 issue: 5 year: 2019 ident: 2024111706383848100_bib20 article-title: Mimicking anti-viruses with machine learning and entropy profiles publication-title: Entropy doi: 10.3390/e21050513 – start-page: 2209 volume-title: 2020 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) year: 2020 ident: 2024111706383848100_bib37 article-title: J2*: A new method for alignment-free sequence similarity measurement doi: 10.1109/BIBM49941.2020.9313338 – start-page: 137 volume-title: Practical Applications of Computational Biology and Bioinformatics, 13th International Conference. PACBB 2019 year: 2020 ident: 2024111706383848100_bib79 article-title: GeCo2: an optimized tool for lossless compression and analysis of DNA sequences – volume: 23 start-page: 1 issue: 1 year: 2022 ident: 2024111706383848100_bib5 article-title: Using synthetic chromosome controls to evaluate the sequencing of difficult regions within the human genome publication-title: Genome Biol doi: 10.1186/s13059-021-02579-6 – year: 2011 ident: 2024111706383848100_bib91 article-title: Handbook of Monte Carlo Methods, Wiley Series in Probability and Statistics – volume: 39 start-page: btad097 issue: 3 year: 2023 ident: 2024111706383848100_bib75 article-title: AGC: compact representation of assembled genomes with fast queries and updates publication-title: Bioinformatics doi: 10.1093/bioinformatics/btad097 – volume: 9 start-page: 445 issue: 9 year: 2018 ident: 2024111706383848100_bib113 article-title: Metagenomic composition analysis of an ancient sequenced polar bear jawbone from Svalbard publication-title: Genes doi: 10.3390/genes9090445 – volume-title: International Conference on Learning Representations year: 2020 ident: 2024111706383848100_bib53 article-title: Model-based reinforcement learning for biological sequence design – volume: 643 start-page: 730 issue: 2 year: 2006 ident: 2024111706383848100_bib22 article-title: Entropy profiles in the cores of cooling flow clusters of galaxies publication-title: Astrophys J doi: 10.1086/503270 – volume: 376 start-page: 44 issue: 6588 year: 2022 ident: 2024111706383848100_bib98 article-title: The complete sequence of a human genome publication-title: Science doi: 10.1126/science.abj6987 – volume: 18 start-page: 679 issue: 5 year: 2002 ident: 2024111706383848100_bib39 article-title: Sequence complexity profiles of prokaryotic genomic sequences: a fast algorithm for calculating linguistic complexity publication-title: Bioinformatics doi: 10.1093/bioinformatics/18.5.679 – volume: 23 start-page: 275 issue: 3–4 year: 1999 ident: 2024111706383848100_bib26 article-title: Zones of low entropy in genomic sequences publication-title: Comput Chem doi: 10.1016/S0097-8485(99)00009-1 – volume: 376 start-page: eabl3533 issue: 6588 year: 2022 ident: 2024111706383848100_bib103 article-title: A complete reference genome improves analysis of human genetic variation publication-title: Science doi: 10.1126/science.abl3533 – volume: 19 start-page: 136 issue: 1 year: 2009 ident: 2024111706383848100_bib51 article-title: Fast and flexible simulation of DNA sequence data publication-title: Genome Res doi: 10.1101/gr.083634.108 – volume: 10 start-page: giab005 issue: 2 year: 2021 ident: 2024111706383848100_bib55 article-title: MB-GAN: Microbiome Simulation via Generative Adversarial Network publication-title: Gigascience doi: 10.1093/gigascience/giab005 – year: 2008 ident: 2024111706383848100_bib12 article-title: An Introduction to Kolmogorov Complexity and Its Applications. Vol. 3 doi: 10.1007/978-0-387-49820-1 – volume: 11 start-page: giac028 year: 2022 ident: 2024111706383848100_bib3 article-title: The haplotype-resolved chromosome pairs of a heterozygous diploid African cassava cultivar reveal novel pan-genome and allele-specific transcriptome features publication-title: GigaScience doi: 10.1093/gigascience/giac028 – volume: 19 start-page: 687 year: 2022 ident: 2024111706383848100_bib114 article-title: Chasing perfection: validation and polishing strategies for telomere-to-telomere genome assemblies publication-title: Nat Methods doi: 10.1038/s41592-022-01440-3 – volume: 12 start-page: e1611 issue: 2 year: 2021 ident: 2024111706383848100_bib35 article-title: Making ends meet: new functions of mRNA secondary structure publication-title: Wiley Interdiscip Rev RNA doi: 10.1002/wrna.1611 – volume: 22 start-page: 575 issue: 5 year: 2020 ident: 2024111706383848100_bib21 article-title: Detecting malware with information complexity publication-title: Entropy doi: 10.3390/e22050575 – year: 2023 ident: 2024111706383848100_bib83 article-title: AlcoR Code Repository – start-page: 125 volume-title: 2011 IEEE Statistical Signal Processing Workshop (SSP) year: 2011 ident: 2024111706383848100_bib69 article-title: Bacteria DNA sequence compression using a mixture of finite-context models doi: 10.1109/SSP.2011.5967637 – volume: 36 start-page: 4675 issue: 18 year: 2020 ident: 2024111706383848100_bib73 article-title: Allowing mutations in maximal matches boosts genome compression performance publication-title: Bioinformatics doi: 10.1093/bioinformatics/btaa572 – volume: 16 start-page: 687 issue: 8 year: 2019 ident: 2024111706383848100_bib46 article-title: Machine-learning-guided directed evolution for protein engineering publication-title: Nat Methods doi: 10.1038/s41592-019-0496-6 – volume: 53 start-page: 2148 issue: 11 year: 2006 ident: 2024111706383848100_bib32 article-title: A three-state model for DNA protein-coding regions publication-title: IEEE Trans Biomed Eng doi: 10.1109/TBME.2006.879477 – volume: 23 start-page: 3 issue: 1 year: 2005 ident: 2024111706383848100_bib64 article-title: An efficient normalized maximum likelihood algorithm for DNA sequence compression publication-title: ACM Trans Inform Syst doi: 10.1145/1055709.1055711 – volume: 10 start-page: 1 issue: 1 year: 2009 ident: 2024111706383848100_bib95 article-title: BLAST+: architecture and applications publication-title: BMC Bioinformatics doi: 10.1186/1471-2105-10-421 – volume: 24 start-page: 43 issue: 1 year: 2000 ident: 2024111706383848100_bib27 article-title: Sequence complexity for biological sequence analysis publication-title: Comput Chem doi: 10.1016/S0097-8485(00)80006-6 – volume: 37 start-page: 1535 issue: 5 year: 2020 ident: 2024111706383848100_bib44 article-title: CellCoal: coalescent simulation of single-cell sequencing samples publication-title: Mol Biol Evol doi: 10.1093/molbev/msaa025 – volume: 118 start-page: e2102842118 issue: 47 year: 2021 ident: 2024111706383848100_bib94 article-title: Quantitative assessment reveals the dominance of duplicated sequences in germline-derived extrachromosomal circular DNA publication-title: Proc Natl Acad Sci USA doi: 10.1073/pnas.2102842118 – volume: 8 start-page: giy148 issue: 3 year: 2019 ident: 2024111706383848100_bib111 article-title: Prot-SpaM: Fast alignment-free phylogeny reconstruction based on whole-proteome sequences publication-title: Gigascience doi: 10.1093/gigascience/giy148 – volume: 2 start-page: 395 issue: 7 year: 2002 ident: 2024111706383848100_bib7 article-title: Global control of hepatitis B virus infection publication-title: Lancet Infect Dis doi: 10.1016/S1473-3099(02)00315-8 – volume: 31 start-page: 159 issue: 1 year: 2021 ident: 2024111706383848100_bib89 article-title: Accessing NCBI data using the NCBI sequence viewer and genome data viewer (GDV) publication-title: Genome Res doi: 10.1101/gr.266932.120 – volume: 604 start-page: 437 issue: 7906 year: 2022 ident: 2024111706383848100_bib2 article-title: The Human Pangenome Project: a global resource to map genomic diversity publication-title: Nature doi: 10.1038/s41586-022-04601-8 – volume: 39 start-page: msac087 issue: 5 year: 2022 ident: 2024111706383848100_bib40 article-title: Low complexity regions in mammalian proteins are associated with low protein abundance and high transcript abundance publication-title: Mol Biol Evol doi: 10.1093/molbev/msac087 – volume: 117 start-page: 9451 issue: 17 year: 2020 ident: 2024111706383848100_bib58 article-title: RepeatModeler2 for automated genomic discovery of transposable element families publication-title: Proc Natl Acad Sci doi: 10.1073/pnas.1921046117 – year: 2018 ident: 2024111706383848100_bib11 article-title: Foundations of Info-metrics: Modeling, Inference, and Imperfect Information – start-page: 133 volume-title: 2011 IEEE Statistical Signal Processing Workshop (SSP) year: 2011 ident: 2024111706383848100_bib52 article-title: DNA synthetic sequences generation using multiple competing Markov models doi: 10.1109/SSP.2011.5967639 – volume: 23 start-page: 530 issue: 5 year: 2021 ident: 2024111706383848100_bib77 article-title: AC2: an efficient protein sequence compression tool using artificial neural networks and cache-hash models publication-title: Entropy doi: 10.3390/e23050530 – volume: 17 start-page: 1467 issue: 11 year: 2010 ident: 2024111706383848100_bib107 article-title: Alignment-free sequence comparison (II): theoretical power of comparison statistics publication-title: J Comput Biol doi: 10.1089/cmb.2010.0056 – volume: 12 start-page: 100535 year: 2020 ident: 2024111706383848100_bib54 article-title: GTO: a toolkit to unify pipelines in genomic and proteomic research publication-title: SoftwareX doi: 10.1016/j.softx.2020.100535 – volume: 34 start-page: 2506 issue: 14 year: 2018 ident: 2024111706383848100_bib57 article-title: NGSphy: phylogenomic simulation of next-generation sequencing data publication-title: Bioinformatics doi: 10.1093/bioinformatics/bty146 – volume: 22 start-page: 1534 issue: 12 year: 2006 ident: 2024111706383848100_bib49 article-title: GenRGenS: software for generating random genomic sequences and structures publication-title: Bioinformatics doi: 10.1093/bioinformatics/btl113 – volume: 37 start-page: 3115 issue: 19 year: 2021 ident: 2024111706383848100_bib119 article-title: VIRUSBreakend: viral integration recognition using single breakends publication-title: Bioinformatics doi: 10.1093/bioinformatics/btab343 – volume: 9 start-page: giaa072 issue: 7 year: 2020 ident: 2024111706383848100_bib78 article-title: Sequence Compression Benchmark (SCB) database—a comprehensive evaluation of reference-free compressors for FASTA-formatted sequences publication-title: Gigascience doi: 10.1093/gigascience/giaa072 – volume: 2 start-page: 25 issue: 3 year: 2010 ident: 2024111706383848100_bib67 article-title: GENBIT Compress-Algorithm for repetitive and non repetitive DNA sequences publication-title: Int J Comput Sci Inf Technol – start-page: 259 volume-title: Iberian Conference on Pattern Recognition and Image Analysis year: 2017 ident: 2024111706383848100_bib97 article-title: On the approximation of the Kolmogorov complexity for DNA sequences – start-page: 8 volume-title: ISMB year: 1998 ident: 2024111706383848100_bib24 article-title: Compression of strings with approximate repeats – volume: 13 start-page: 1028 issue: 5 year: 2006 ident: 2024111706383848100_bib86 article-title: A fast and symmetric DUST implementation to mask low-complexity DNA sequences publication-title: J Comput Biol doi: 10.1089/cmb.2006.13.1028 – volume: 17 start-page: 459 issue: 8 year: 2016 ident: 2024111706383848100_bib43 article-title: A comparison of tools for the simulation of genomic next-generation sequencing data publication-title: Nat Rev Genet doi: 10.1038/nrg.2016.57 – volume: 18 start-page: 1 issue: 1 year: 2017 ident: 2024111706383848100_bib108 article-title: Alignment-free sequence comparison: benefits, applications, and tools publication-title: Genome Biol doi: 10.1186/s13059-017-1319-7 – volume: 65 start-page: 18 year: 2021 ident: 2024111706383848100_bib47 article-title: Protein sequence design with deep generative models publication-title: Curr Opi Chem Biol doi: 10.1016/j.cbpa.2021.04.004 – volume: 20 start-page: 1 issue: 1 year: 2019 ident: 2024111706383848100_bib109 article-title: Benchmarking of alignment-free sequence comparison methods publication-title: Genome Biol doi: 10.1186/s13059-019-1755-7 |
| SSID | ssj0000778873 |
| Score | 2.3082159 |
| Snippet | Abstract
Background
Low-complexity data analysis is the area that addresses the search and quantification of regions in sequences of elements that contain... Low-complexity data analysis is the area that addresses the search and quantification of regions in sequences of elements that contain low-complexity or... Background Low-complexity data analysis is the area that addresses the search and quantification of regions in sequences of elements that contain... |
| SourceID | pubmedcentral proquest pubmed crossref oup |
| SourceType | Open Access Repository Aggregation Database Index Database Enrichment Source Publisher |
| SubjectTerms | Alignment Amino acid sequence Cassava Chromosomes Complexity Computer Simulation Cultivars Data analysis Diploids Gene sequencing Genome, Human Genomic analysis Haplotypes Humans Mapping Nucleotide sequence Proteomics Repetitive Sequences, Nucleic Acid Sequence Analysis, DNA - methods Software Source code Technical Note Visualization |
| Title | AlcoR: alignment-free simulation, mapping, and visualization of low-complexity regions in biological data |
| URI | https://www.ncbi.nlm.nih.gov/pubmed/38091509 https://www.proquest.com/docview/3130974246 https://www.proquest.com/docview/3168765477 https://www.proquest.com/docview/2902949776 https://pubmed.ncbi.nlm.nih.gov/PMC10716826 |
| Volume | 12 |
| WOSCitedRecordID | wos001187364200001&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| journalDatabaseRights | – providerCode: PRVADU databaseName: BioMed Central customDbUrl: eissn: 2047-217X dateEnd: 99991231 omitProxy: false ssIdentifier: ssj0000778873 issn: 2047-217X databaseCode: RBZ dateStart: 20120101 isFulltext: true titleUrlDefault: https://www.biomedcentral.com/search/ providerName: BioMedCentral – providerCode: PRVHPJ databaseName: ROAD: Directory of Open Access Scholarly Resources customDbUrl: eissn: 2047-217X dateEnd: 99991231 omitProxy: false ssIdentifier: ssj0000778873 issn: 2047-217X databaseCode: M~E dateStart: 20120101 isFulltext: true titleUrlDefault: https://road.issn.org providerName: ISSN International Centre – providerCode: PRVASL databaseName: Oxford Journals Open Access Collection customDbUrl: eissn: 2047-217X dateEnd: 99991231 omitProxy: false ssIdentifier: ssj0000778873 issn: 2047-217X databaseCode: TOX dateStart: 20110101 isFulltext: true titleUrlDefault: https://academic.oup.com/journals/ providerName: Oxford University Press |
| link | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LT9wwEB4B4tAL5dHC8pIrVeWyUTe24yTcEAJxQFBVgPYW2Y69RFqSarNAf37HiTfarQCBcolkJ3HsGc039sw3AN8R43NtDQ-olSLgLolLqVwENuKSMs1U3NQiuLuMr66S4TD95R3F-oUj_JT9HBUj6c0B3ss8bNK1wihxBQturofdlsogdqFxbMYt9PKjC_ZnIadtDlr-HyE5Z3LOP390sOuw5sElOWmlYQOWTLkJBz41gfwgPvfIrQXxSr0FxclYV7-PCQLyURMaENiJMaQuHnxlrz55kI7FYdQnsszJU1G7RMw2fZNUloyr56CJTDd_EdITV-oBRZkUJWkZnpwYEBeJ-gVuz89uTi8CX4Ah0NEgnAYmoakySggVSY1XLpimVqlQWWMc1JKMxzZU3Cr0snNjQzpwBIFa2zRFnMW-wkpZlWYHSMSo0YpJqq0jVRNKOzK6PGQG_RnJWQ_obFky7dnJXZGMcdaekrNsbmYzP7M96HcP_WnJOd7ufoTr_b6e-zOZyLxO1xlDc4_eF-XilWaBliXicdyDb10zKqs7gZGlqR7rjKYDmnKE3PiK7VbCuuGwBKEbwrceJAuy13VwROCLLWVx3xCCowuPH6di990_uAefKKK1di9pH1amk0dzAKv6aVrUk0NYjofJYaNc_wCp7SyZ |
| linkProvider | Oxford University Press |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=AlcoR%3A+alignment-free+simulation%2C+mapping%2C+and+visualization+of+low-complexity+regions+in+biological+data&rft.jtitle=Gigascience&rft.au=Silva%2C+Jorge+M&rft.au=Qi%2C+Weihong&rft.au=Pinho%2C+Armando+J&rft.au=Pratas%2C+Diogo&rft.date=2023&rft.pub=Oxford+University+Press&rft.eissn=2047-217X&rft.volume=12&rft_id=info:doi/10.1093%2Fgigascience%2Fgiad101&rft.externalDocID=10.1093%2Fgigascience%2Fgiad101 |
| thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=2047-217X&client=summon |
| thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=2047-217X&client=summon |
| thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=2047-217X&client=summon |