AlcoR: alignment-free simulation, mapping, and visualization of low-complexity regions in biological data

Abstract Background Low-complexity data analysis is the area that addresses the search and quantification of regions in sequences of elements that contain low-complexity or repetitive elements. For example, these can be tandem repeats, inverted repeats, homopolymer tails, GC-biased regions, similar...

Full description

Saved in:
Bibliographic Details
Published in:Gigascience Vol. 12
Main Authors: Silva, Jorge M, Qi, Weihong, Pinho, Armando J, Pratas, Diogo
Format: Journal Article
Language:English
Published: United States Oxford University Press 2023
Subjects:
ISSN:2047-217X, 2047-217X
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Abstract Abstract Background Low-complexity data analysis is the area that addresses the search and quantification of regions in sequences of elements that contain low-complexity or repetitive elements. For example, these can be tandem repeats, inverted repeats, homopolymer tails, GC-biased regions, similar genes, and hairpins, among many others. Identifying these regions is crucial because of their association with regulatory and structural characteristics. Moreover, their identification provides positional and quantity information where standard assembly methodologies face significant difficulties because of substantial higher depth coverage (mountains), ambiguous read mapping, or where sequencing or reconstruction defects may occur. However, the capability to distinguish low-complexity regions (LCRs) in genomic and proteomic sequences is a challenge that depends on the model’s ability to find them automatically. Low-complexity patterns can be implicit through specific or combined sources, such as algorithmic or probabilistic, and recurring to different spatial distances—namely, local, medium, or distant associations. Findings This article addresses the challenge of automatically modeling and distinguishing LCRs, providing a new method and tool (AlcoR) for efficient and accurate segmentation and visualization of these regions in genomic and proteomic sequences. The method enables the use of models with different memories, providing the ability to distinguish local from distant low-complexity patterns. The method is reference and alignment free, providing additional methodologies for testing, including a highly flexible simulation method for generating biological sequences (DNA or protein) with different complexity levels, sequence masking, and a visualization tool for automatic computation of the LCR maps into an ideogram style. We provide illustrative demonstrations using synthetic, nearly synthetic, and natural sequences showing the high efficiency and accuracy of AlcoR. As large-scale results, we use AlcoR to unprecedentedly provide a whole-chromosome low-complexity map of a recent complete human genome and the haplotype-resolved chromosome pairs of a heterozygous diploid African cassava cultivar. Conclusions The AlcoR method provides the ability of fast sequence characterization through data complexity analysis, ideally for scenarios entangling the presence of new or unknown sequences. AlcoR is implemented in C language using multithreading to increase the computational speed, is flexible for multiple applications, and does not contain external dependencies. The tool accepts any sequence in FASTA format. The source code is freely provided at https://github.com/cobilab/alcor.
AbstractList Background Low-complexity data analysis is the area that addresses the search and quantification of regions in sequences of elements that contain low-complexity or repetitive elements. For example, these can be tandem repeats, inverted repeats, homopolymer tails, GC-biased regions, similar genes, and hairpins, among many others. Identifying these regions is crucial because of their association with regulatory and structural characteristics. Moreover, their identification provides positional and quantity information where standard assembly methodologies face significant difficulties because of substantial higher depth coverage (mountains), ambiguous read mapping, or where sequencing or reconstruction defects may occur. However, the capability to distinguish low-complexity regions (LCRs) in genomic and proteomic sequences is a challenge that depends on the model’s ability to find them automatically. Low-complexity patterns can be implicit through specific or combined sources, such as algorithmic or probabilistic, and recurring to different spatial distances—namely, local, medium, or distant associations. Findings This article addresses the challenge of automatically modeling and distinguishing LCRs, providing a new method and tool (AlcoR) for efficient and accurate segmentation and visualization of these regions in genomic and proteomic sequences. The method enables the use of models with different memories, providing the ability to distinguish local from distant low-complexity patterns. The method is reference and alignment free, providing additional methodologies for testing, including a highly flexible simulation method for generating biological sequences (DNA or protein) with different complexity levels, sequence masking, and a visualization tool for automatic computation of the LCR maps into an ideogram style. We provide illustrative demonstrations using synthetic, nearly synthetic, and natural sequences showing the high efficiency and accuracy of AlcoR. As large-scale results, we use AlcoR to unprecedentedly provide a whole-chromosome low-complexity map of a recent complete human genome and the haplotype-resolved chromosome pairs of a heterozygous diploid African cassava cultivar. Conclusions The AlcoR method provides the ability of fast sequence characterization through data complexity analysis, ideally for scenarios entangling the presence of new or unknown sequences. AlcoR is implemented in C language using multithreading to increase the computational speed, is flexible for multiple applications, and does not contain external dependencies. The tool accepts any sequence in FASTA format. The source code is freely provided at https://github.com/cobilab/alcor.
Low-complexity data analysis is the area that addresses the search and quantification of regions in sequences of elements that contain low-complexity or repetitive elements. For example, these can be tandem repeats, inverted repeats, homopolymer tails, GC-biased regions, similar genes, and hairpins, among many others. Identifying these regions is crucial because of their association with regulatory and structural characteristics. Moreover, their identification provides positional and quantity information where standard assembly methodologies face significant difficulties because of substantial higher depth coverage (mountains), ambiguous read mapping, or where sequencing or reconstruction defects may occur. However, the capability to distinguish low-complexity regions (LCRs) in genomic and proteomic sequences is a challenge that depends on the model's ability to find them automatically. Low-complexity patterns can be implicit through specific or combined sources, such as algorithmic or probabilistic, and recurring to different spatial distances-namely, local, medium, or distant associations.BACKGROUNDLow-complexity data analysis is the area that addresses the search and quantification of regions in sequences of elements that contain low-complexity or repetitive elements. For example, these can be tandem repeats, inverted repeats, homopolymer tails, GC-biased regions, similar genes, and hairpins, among many others. Identifying these regions is crucial because of their association with regulatory and structural characteristics. Moreover, their identification provides positional and quantity information where standard assembly methodologies face significant difficulties because of substantial higher depth coverage (mountains), ambiguous read mapping, or where sequencing or reconstruction defects may occur. However, the capability to distinguish low-complexity regions (LCRs) in genomic and proteomic sequences is a challenge that depends on the model's ability to find them automatically. Low-complexity patterns can be implicit through specific or combined sources, such as algorithmic or probabilistic, and recurring to different spatial distances-namely, local, medium, or distant associations.This article addresses the challenge of automatically modeling and distinguishing LCRs, providing a new method and tool (AlcoR) for efficient and accurate segmentation and visualization of these regions in genomic and proteomic sequences. The method enables the use of models with different memories, providing the ability to distinguish local from distant low-complexity patterns. The method is reference and alignment free, providing additional methodologies for testing, including a highly flexible simulation method for generating biological sequences (DNA or protein) with different complexity levels, sequence masking, and a visualization tool for automatic computation of the LCR maps into an ideogram style. We provide illustrative demonstrations using synthetic, nearly synthetic, and natural sequences showing the high efficiency and accuracy of AlcoR. As large-scale results, we use AlcoR to unprecedentedly provide a whole-chromosome low-complexity map of a recent complete human genome and the haplotype-resolved chromosome pairs of a heterozygous diploid African cassava cultivar.FINDINGSThis article addresses the challenge of automatically modeling and distinguishing LCRs, providing a new method and tool (AlcoR) for efficient and accurate segmentation and visualization of these regions in genomic and proteomic sequences. The method enables the use of models with different memories, providing the ability to distinguish local from distant low-complexity patterns. The method is reference and alignment free, providing additional methodologies for testing, including a highly flexible simulation method for generating biological sequences (DNA or protein) with different complexity levels, sequence masking, and a visualization tool for automatic computation of the LCR maps into an ideogram style. We provide illustrative demonstrations using synthetic, nearly synthetic, and natural sequences showing the high efficiency and accuracy of AlcoR. As large-scale results, we use AlcoR to unprecedentedly provide a whole-chromosome low-complexity map of a recent complete human genome and the haplotype-resolved chromosome pairs of a heterozygous diploid African cassava cultivar.The AlcoR method provides the ability of fast sequence characterization through data complexity analysis, ideally for scenarios entangling the presence of new or unknown sequences. AlcoR is implemented in C language using multithreading to increase the computational speed, is flexible for multiple applications, and does not contain external dependencies. The tool accepts any sequence in FASTA format. The source code is freely provided at https://github.com/cobilab/alcor.CONCLUSIONSThe AlcoR method provides the ability of fast sequence characterization through data complexity analysis, ideally for scenarios entangling the presence of new or unknown sequences. AlcoR is implemented in C language using multithreading to increase the computational speed, is flexible for multiple applications, and does not contain external dependencies. The tool accepts any sequence in FASTA format. The source code is freely provided at https://github.com/cobilab/alcor.
Abstract Background Low-complexity data analysis is the area that addresses the search and quantification of regions in sequences of elements that contain low-complexity or repetitive elements. For example, these can be tandem repeats, inverted repeats, homopolymer tails, GC-biased regions, similar genes, and hairpins, among many others. Identifying these regions is crucial because of their association with regulatory and structural characteristics. Moreover, their identification provides positional and quantity information where standard assembly methodologies face significant difficulties because of substantial higher depth coverage (mountains), ambiguous read mapping, or where sequencing or reconstruction defects may occur. However, the capability to distinguish low-complexity regions (LCRs) in genomic and proteomic sequences is a challenge that depends on the model’s ability to find them automatically. Low-complexity patterns can be implicit through specific or combined sources, such as algorithmic or probabilistic, and recurring to different spatial distances—namely, local, medium, or distant associations. Findings This article addresses the challenge of automatically modeling and distinguishing LCRs, providing a new method and tool (AlcoR) for efficient and accurate segmentation and visualization of these regions in genomic and proteomic sequences. The method enables the use of models with different memories, providing the ability to distinguish local from distant low-complexity patterns. The method is reference and alignment free, providing additional methodologies for testing, including a highly flexible simulation method for generating biological sequences (DNA or protein) with different complexity levels, sequence masking, and a visualization tool for automatic computation of the LCR maps into an ideogram style. We provide illustrative demonstrations using synthetic, nearly synthetic, and natural sequences showing the high efficiency and accuracy of AlcoR. As large-scale results, we use AlcoR to unprecedentedly provide a whole-chromosome low-complexity map of a recent complete human genome and the haplotype-resolved chromosome pairs of a heterozygous diploid African cassava cultivar. Conclusions The AlcoR method provides the ability of fast sequence characterization through data complexity analysis, ideally for scenarios entangling the presence of new or unknown sequences. AlcoR is implemented in C language using multithreading to increase the computational speed, is flexible for multiple applications, and does not contain external dependencies. The tool accepts any sequence in FASTA format. The source code is freely provided at https://github.com/cobilab/alcor.
Low-complexity data analysis is the area that addresses the search and quantification of regions in sequences of elements that contain low-complexity or repetitive elements. For example, these can be tandem repeats, inverted repeats, homopolymer tails, GC-biased regions, similar genes, and hairpins, among many others. Identifying these regions is crucial because of their association with regulatory and structural characteristics. Moreover, their identification provides positional and quantity information where standard assembly methodologies face significant difficulties because of substantial higher depth coverage (mountains), ambiguous read mapping, or where sequencing or reconstruction defects may occur. However, the capability to distinguish low-complexity regions (LCRs) in genomic and proteomic sequences is a challenge that depends on the model's ability to find them automatically. Low-complexity patterns can be implicit through specific or combined sources, such as algorithmic or probabilistic, and recurring to different spatial distances-namely, local, medium, or distant associations. This article addresses the challenge of automatically modeling and distinguishing LCRs, providing a new method and tool (AlcoR) for efficient and accurate segmentation and visualization of these regions in genomic and proteomic sequences. The method enables the use of models with different memories, providing the ability to distinguish local from distant low-complexity patterns. The method is reference and alignment free, providing additional methodologies for testing, including a highly flexible simulation method for generating biological sequences (DNA or protein) with different complexity levels, sequence masking, and a visualization tool for automatic computation of the LCR maps into an ideogram style. We provide illustrative demonstrations using synthetic, nearly synthetic, and natural sequences showing the high efficiency and accuracy of AlcoR. As large-scale results, we use AlcoR to unprecedentedly provide a whole-chromosome low-complexity map of a recent complete human genome and the haplotype-resolved chromosome pairs of a heterozygous diploid African cassava cultivar. The AlcoR method provides the ability of fast sequence characterization through data complexity analysis, ideally for scenarios entangling the presence of new or unknown sequences. AlcoR is implemented in C language using multithreading to increase the computational speed, is flexible for multiple applications, and does not contain external dependencies. The tool accepts any sequence in FASTA format. The source code is freely provided at https://github.com/cobilab/alcor.
Author Pinho, Armando J
Pratas, Diogo
Qi, Weihong
Silva, Jorge M
Author_xml – sequence: 1
  givenname: Jorge M
  orcidid: 0000-0002-6331-6091
  surname: Silva
  fullname: Silva, Jorge M
– sequence: 2
  givenname: Weihong
  orcidid: 0000-0001-8581-908X
  surname: Qi
  fullname: Qi, Weihong
– sequence: 3
  givenname: Armando J
  orcidid: 0000-0002-9164-0016
  surname: Pinho
  fullname: Pinho, Armando J
– sequence: 4
  givenname: Diogo
  orcidid: 0000-0003-1176-552X
  surname: Pratas
  fullname: Pratas, Diogo
  email: pratas@ua.pt
BackLink https://www.ncbi.nlm.nih.gov/pubmed/38091509$$D View this record in MEDLINE/PubMed
BookMark eNqNUl1r1jAUDjJxH-4XCBLwxot1S9q0abwZY0wdDAai4F1I0pOakSa1aafz15t377vxuothzkUOPB-ch3P20U6IARB6Q8kxJaI66V2vknEQDORedZTQF2ivJIwXJeXfd7b6XXSY0g3Jj_O25dUrtFu1RNCaiD3kzryJXz5g5V0fBghzYScAnNyweDW7GI7woMbRhf4Iq9DhW5eWzP1zj-FosY-_ChOH0cNvN9_hCfoMJOwC1i762DujPO7UrF6jl1b5BIeb_wB9-3jx9fxzcXX96fL87KowNaFzAW0pNOim0bUyubqmMqXVmmoLUNeUqYpxSzWzmjWiA0tLkoWtMVYIxmh1gE7XvuOiB-hMzjQpL8fJDWq6k1E5-S8S3A_Zx1tJCadNWzbZ4f3GYYo_F0izHFwy4L0KEJckS0FKwQTnK-q7J9SbuEwh55NVNuNNzTh_nlURwVnJVl5vtwd_nPhhWZkg1gQzxZQmsNK4-X4ROYfzOYBc3Ybcug25uY2srZ5oH-yfVx2vVXEZ_0vwF0as1Pw
CitedBy_id crossref_primary_10_1093_gigascience_giae086
crossref_primary_10_3390_ijms26020477
crossref_primary_10_1093_nar_gkae871
Cites_doi 10.1016/j.patrec.2018.05.026
10.1093/gigascience/giaa086
10.1093/gigascience/giaa119
10.1007/978-1-0716-1036-7_8
10.1093/nar/gkh466
10.1093/bioinformatics/bti774
10.3390/ijms20071704
10.1002/ijc.21731
10.1002/j.1538-7305.1948.tb01338.x
10.1016/S0022-2836(05)80360-2
10.1093/bioinformatics/btg005
10.1093/gigascience/giab099
10.1016/0097-8485(93)85006-X
10.1186/s12859-021-03980-5
10.1093/bioinformatics/bth497
10.1093/nar/gkad199
10.1038/s41586-020-2547-7
10.1002/ijc.30243
10.1016/j.electacta.2015.01.191
10.1093/molbev/msad084
10.1186/1471-2105-8-292
10.7150/thno.29622
10.3389/fpls.2015.00913
10.1038/nrg3117
10.1093/bioinformatics/btr708
10.3390/e21121196
10.1093/nar/30.1.38
10.1186/s13015-022-00210-2
10.1016/S0140-6736(07)61416-0
10.1038/s41467-018-06792-z
10.1371/journal.pone.0079922
10.1038/nature19946
10.1109/DCC.2016.60
10.1101/2023.04.17.537157
10.1371/journal.pone.0021588
10.1093/gigascience/giaa048
10.1016/0306-4573(94)90014-0
10.1126/science.abj6965
10.1016/j.yexcr.2020.112127
10.1093/bib/bby070
10.1145/3188745.3188814
10.1126/sciadv.abd9230
10.1126/science.abl4178
10.1089/cmb.2009.0198
10.1016/j.crmeth.2021.100069
10.1186/1471-2105-8-252
10.1186/1471-2105-8-393
10.1093/nar/gkaa1237
10.1093/bioinformatics/13.2.131
10.1109/DCC.1993.253115
10.1002/spe.619
10.1093/bioinformatics/btz144
10.1111/2041-210X.12349
10.1186/s13059-019-1891-0
10.1016/j.virol.2017.10.017
10.3390/e21050513
10.1109/BIBM49941.2020.9313338
10.1186/s13059-021-02579-6
10.1093/bioinformatics/btad097
10.3390/genes9090445
10.1086/503270
10.1126/science.abj6987
10.1093/bioinformatics/18.5.679
10.1016/S0097-8485(99)00009-1
10.1126/science.abl3533
10.1101/gr.083634.108
10.1093/gigascience/giab005
10.1007/978-0-387-49820-1
10.1093/gigascience/giac028
10.1038/s41592-022-01440-3
10.1002/wrna.1611
10.3390/e22050575
10.1109/SSP.2011.5967637
10.1093/bioinformatics/btaa572
10.1038/s41592-019-0496-6
10.1109/TBME.2006.879477
10.1145/1055709.1055711
10.1186/1471-2105-10-421
10.1016/S0097-8485(00)80006-6
10.1093/molbev/msaa025
10.1073/pnas.2102842118
10.1093/gigascience/giy148
10.1016/S1473-3099(02)00315-8
10.1101/gr.266932.120
10.1038/s41586-022-04601-8
10.1093/molbev/msac087
10.1073/pnas.1921046117
10.1109/SSP.2011.5967639
10.3390/e23050530
10.1089/cmb.2010.0056
10.1016/j.softx.2020.100535
10.1093/bioinformatics/bty146
10.1093/bioinformatics/btl113
10.1093/bioinformatics/btab343
10.1093/gigascience/giaa072
10.1089/cmb.2006.13.1028
10.1038/nrg.2016.57
10.1186/s13059-017-1319-7
10.1016/j.cbpa.2021.04.004
10.1186/s13059-019-1755-7
ContentType Journal Article
Copyright The Author(s) 2023. Published by Oxford University Press GigaScience. 2023
The Author(s) 2023. Published by Oxford University Press GigaScience.
The Author(s) 2023. Published by Oxford University Press GigaScience. This work is published under https://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Copyright_xml – notice: The Author(s) 2023. Published by Oxford University Press GigaScience. 2023
– notice: The Author(s) 2023. Published by Oxford University Press GigaScience.
– notice: The Author(s) 2023. Published by Oxford University Press GigaScience. This work is published under https://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
DBID TOX
AAYXX
CITATION
CGR
CUY
CVF
ECM
EIF
NPM
JQ2
K9.
7X8
5PM
DOI 10.1093/gigascience/giad101
DatabaseName Oxford Journals Open Access Collection
CrossRef
Medline
MEDLINE
MEDLINE (Ovid)
MEDLINE
MEDLINE
PubMed
ProQuest Computer Science Collection
ProQuest Health & Medical Complete (Alumni)
MEDLINE - Academic
PubMed Central (Full Participant titles)
DatabaseTitle CrossRef
MEDLINE
Medline Complete
MEDLINE with Full Text
PubMed
MEDLINE (Ovid)
ProQuest Health & Medical Complete (Alumni)
ProQuest Computer Science Collection
MEDLINE - Academic
DatabaseTitleList ProQuest Health & Medical Complete (Alumni)
ProQuest Health & Medical Complete (Alumni)
MEDLINE - Academic

MEDLINE
Database_xml – sequence: 1
  dbid: NPM
  name: PubMed
  url: http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed
  sourceTypes: Index Database
– sequence: 2
  dbid: TOX
  name: Oxford Journals Open Access Collection
  url: https://academic.oup.com/journals/
  sourceTypes: Publisher
– sequence: 3
  dbid: 7X8
  name: MEDLINE - Academic
  url: https://search.proquest.com/medline
  sourceTypes: Aggregation Database
DeliveryMethod fulltext_linktorsrc
Discipline Library & Information Science
EISSN 2047-217X
ExternalDocumentID PMC10716826
38091509
10_1093_gigascience_giad101
10.1093/gigascience/giad101
Genre Research Support, Non-U.S. Gov't
Journal Article
GrantInformation_xml – fundername: Finnish Computing Competence Infrastructure
– fundername: ;
GroupedDBID -A0
0R~
3V.
4.4
53G
5VS
7X7
88E
88I
8FE
8FG
8FH
8FI
8FJ
AAFWJ
AAHBH
AAPPN
AAPXW
AAVAP
ABDBF
ABEJV
ABPTD
ABUWG
ABXVV
ACGFS
ACPRK
ACRMQ
ACUHS
ADBBV
ADINQ
ADRAZ
ADUKV
AEGXH
AENZO
AFKRA
AFULF
AHBYD
AHSBF
AHYZX
ALIPV
ALMA_UNASSIGNED_HOLDINGS
AOIJS
ARAPS
AZQEC
BAWUL
BAYMD
BBNVY
BCNDV
BENPR
BFQNJ
BGLVJ
BHPHI
BMC
BPHCQ
BTTYL
BVXVI
C24
C6C
CCPQU
DIK
DWQXO
EBS
EJD
FYUFA
GNUQQ
GROUPED_DOAJ
GX1
H13
HCIFZ
HMCUK
HYE
IAO
IHR
IHW
INH
INR
IPNFZ
ITC
K6V
K7-
KQ8
KSI
LK8
M0N
M1P
M2P
M48
M7P
M~E
O9-
OK1
P62
PIMPY
PQQKQ
PROAC
PSQYO
RBZ
RIG
RNS
ROL
ROX
RPM
RSV
SBL
SOJ
TJX
TOX
UKHRP
AAYXX
ABGNP
AFPKN
AMNDL
CITATION
CGR
CUY
CVF
ECM
EIF
NPM
JQ2
K9.
7X8
5PM
ID FETCH-LOGICAL-c501t-e829beb66b5acacad63c2fbb1bfee5514a347f1b4fb469def1205018ccf994413
IEDL.DBID TOX
ISICitedReferencesCount 5
ISICitedReferencesURI http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=001187364200001&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
ISSN 2047-217X
IngestDate Tue Sep 30 17:10:46 EDT 2025
Sun Sep 28 07:14:12 EDT 2025
Tue Oct 07 06:43:20 EDT 2025
Tue Oct 07 06:38:46 EDT 2025
Mon Jul 21 06:02:05 EDT 2025
Tue Nov 18 21:37:30 EST 2025
Sat Nov 29 03:15:51 EST 2025
Mon Dec 16 07:45:55 EST 2024
IsDoiOpenAccess true
IsOpenAccess true
IsPeerReviewed true
IsScholarly true
Keywords data compression
genomes
proteomes
simulation
low-complexity analysis
FASTA
alignment-free method
Language English
License This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
https://creativecommons.org/licenses/by/4.0
The Author(s) 2023. Published by Oxford University Press GigaScience.
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-c501t-e829beb66b5acacad63c2fbb1bfee5514a347f1b4fb469def1205018ccf994413
Notes ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
content type line 23
ORCID 0000-0002-9164-0016
0000-0003-1176-552X
0000-0002-6331-6091
0000-0001-8581-908X
OpenAccessLink https://dx.doi.org/10.1093/gigascience/giad101
PMID 38091509
PQID 3130974246
PQPubID 2040230
ParticipantIDs pubmedcentral_primary_oai_pubmedcentral_nih_gov_10716826
proquest_miscellaneous_2902949776
proquest_journals_3168765477
proquest_journals_3130974246
pubmed_primary_38091509
crossref_citationtrail_10_1093_gigascience_giad101
crossref_primary_10_1093_gigascience_giad101
oup_primary_10_1093_gigascience_giad101
PublicationCentury 2000
PublicationDate 2023-00-00
PublicationDateYYYYMMDD 2023-01-01
PublicationDate_xml – year: 2023
  text: 2023-00-00
PublicationDecade 2020
PublicationPlace United States
PublicationPlace_xml – name: United States
– name: Oxford
PublicationTitle Gigascience
PublicationTitleAlternate Gigascience
PublicationYear 2023
Publisher Oxford University Press
Publisher_xml – sequence: 0
  name: Oxford University Press
– name: Oxford University Press
References Grabowski (2024111706383848100_bib74) 2022; 11
Camacho (2024111706383848100_bib95) 2009; 10
Pratas (2024111706383848100_bib113) 2018; 9
Ponty (2024111706383848100_bib49) 2006; 22
Reis (2024111706383848100_bib5) 2022; 23
Rajaby (2024111706383848100_bib6) 2021; 49
Carvalho (2024111706383848100_bib81) 2018; 112
Resende (2024111706383848100_bib19) 2019; 21
Wang (2024111706383848100_bib2) 2022; 604
Silva (2024111706383848100_bib84) 2023
Morgenstern (2024111706383848100_bib110) 2021
Wu (2024111706383848100_bib47) 2021; 65
Cherniavsky (2024111706383848100_bib63) 2004; 21
Wan (2024111706383848100_bib107) 2010; 17
Jiang (2024111706383848100_bib37) 2020
Posada (2024111706383848100_bib44) 2020; 37
Mouakkad-Montoya (2024111706383848100_bib94) 2021; 118
Rouchka (2024111706383848100_bib50) 2007; 8
Hosseini (2024111706383848100_bib31) 2017
Pinho (2024111706383848100_bib80) 2010
Rivals (2024111706383848100_bib25) 1997; 13
Parkin (2024111706383848100_bib9) 2006; 118
Dickson (2024111706383848100_bib42) 2021; 1
Chen (2024111706383848100_bib51) 2009; 19
Li (2024111706383848100_bib12) 2008
Troyanskaya (2024111706383848100_bib39) 2002; 18
Morgulis (2024111706383848100_bib86) 2006; 13
Dickson (2024111706383848100_bib40) 2022; 39
Miga (2024111706383848100_bib100) 2020; 394
Silva (2024111706383848100_bib83) 2023
Osswald (2024111706383848100_bib23) 2015; 177
Cao (2024111706383848100_bib65) 2007
Ermolenko (2024111706383848100_bib35) 2021; 12
Huang (2024111706383848100_bib45) 2016; 537
Vollger (2024111706383848100_bib99) 2022; 376
Pinho (2024111706383848100_bib30) 2011
Ferragina (2024111706383848100_bib15) 2007; 8
Silva (2024111706383848100_bib77) 2021; 23
Enright (2024111706383848100_bib41) 2023; 40
Pratas (2024111706383848100_bib71) 2016
Wood (2024111706383848100_bib112) 2019; 20
Qi (2024111706383848100_bib3) 2022; 11
Smit (2024111706383848100_bib59)
Dix (2024111706383848100_bib28) 2006
Vinga (2024111706383848100_bib105) 2003; 19
Pratas (2024111706383848100_bib82) 2017
Kryukov (2024111706383848100_bib78) 2020; 9
Huang (2024111706383848100_bib56) 2012; 28
Nurk (2024111706383848100_bib98) 2022; 376
Gupta (2024111706383848100_bib68) 2011; 33
Feldkamp (2024111706383848100_bib48) 2001
Menéndez (2024111706383848100_bib20) 2019; 21
2024111706383848100_bib122
Deorowicz (2024111706383848100_bib75) 2023; 39
Schiffman (2024111706383848100_bib8) 2007; 370
Wu (2024111706383848100_bib29) 2019; 20
Zielezinski (2024111706383848100_bib109) 2019; 20
Morgulis (2024111706383848100_bib90) 2006; 22
Allison (2024111706383848100_bib27) 2000; 24
Altschul (2024111706383848100_bib96) 1990; 215
Chao (2024111706383848100_bib33) 2015; 6
Pinho (2024111706383848100_bib32) 2006; 53
Yang (2024111706383848100_bib46) 2019; 16
Allison (2024111706383848100_bib24) 1998
Angermueller (2024111706383848100_bib53) 2020
Escalona (2024111706383848100_bib57) 2018; 34
Korodi (2024111706383848100_bib64) 2005; 23
Chen (2024111706383848100_bib120) 2019; 20
Pratas (2024111706383848100_bib52) 2011
Mishra (2024111706383848100_bib66) 2010; 3
Pinho (2024111706383848100_bib70) 2011; 6
Zielezinski (2024111706383848100_bib108) 2017; 18
Kempa (2024111706383848100_bib18) 2018
Cameron (2024111706383848100_bib119) 2021; 37
Wootton (2024111706383848100_bib87) 1993; 17
Aganezov (2024111706383848100_bib103) 2022; 376
Pinho (2024111706383848100_bib69) 2011
Donahue (2024111706383848100_bib22) 2006; 643
Kroese (2024111706383848100_bib91) 2011
Kolmogorov (2024111706383848100_bib13) 1965; 1
Pinho (2024111706383848100_bib14) 2013; 8
Liu (2024111706383848100_bib73) 2020; 36
Silva (2024111706383848100_bib76) 2020; 9
Shannon (2024111706383848100_bib92) 1948; 27
Kao (2024111706383848100_bib7) 2002; 2
Išerić (2024111706383848100_bib93) 2022; 17
Xu (2024111706383848100_bib10) 2019; 9
Kryukov (2024111706383848100_bib72) 2019; 35
Williams (2024111706383848100_bib85) 1999
Alshahwan (2024111706383848100_bib21) 2020; 22
Mc Cartney (2024111706383848100_bib114) 2022; 19
Cuacos (2024111706383848100_bib104) 2015; 6
Rangwala (2024111706383848100_bib89) 2021; 31
Hubbard (2024111706383848100_bib4) 2002; 30
Cantalupo (2024111706383848100_bib117) 2018; 513
Grumbach (2024111706383848100_bib60) 1993
Almeida (2024111706383848100_bib54) 2020; 12
Flynn (2024111706383848100_bib58) 2020; 117
Treangen (2024111706383848100_bib115) 2012; 13
Shin (2024111706383848100_bib88) 2005; 21
Reinert (2024111706383848100_bib106) 2009; 16
Rajeswari (2024111706383848100_bib67) 2010; 2
Escalona (2024111706383848100_bib43) 2016; 17
Crochemore (2024111706383848100_bib26) 1999; 23
Pratas (2024111706383848100_bib38) 2020; 9
Pyöriä (2024111706383848100_bib121) 2023; 51
Orlov (2024111706383848100_bib16) 2004; 32
Pratas (2024111706383848100_bib97) 2017
Bodelon (2024111706383848100_bib116) 2016; 139
Altemose (2024111706383848100_bib101) 2022; 376
Miga (2024111706383848100_bib1) 2020; 585
Rong (2024111706383848100_bib55) 2021; 10
Grumbach (2024111706383848100_bib61) 1994; 30
Vinga (2024111706383848100_bib17) 2007; 8
Golan (2024111706383848100_bib11) 2018
Hosseini (2024111706383848100_bib36) 2020; 9
Suzuki (2024111706383848100_bib102) 2020; 6
Leimeister (2024111706383848100_bib111) 2019; 8
Pratas (2024111706383848100_bib79) 2020
Manzini (2024111706383848100_bib62) 2004; 34
Lai (2024111706383848100_bib34) 2018; 9
Pischedda (2024111706383848100_bib118) 2021; 22
References_xml – volume: 112
  start-page: 49
  year: 2018
  ident: 2024111706383848100_bib81
  article-title: Extended-alphabet finite-context models
  publication-title: Pattern Recognit Lett
  doi: 10.1016/j.patrec.2018.05.026
– volume: 9
  start-page: giaa086
  issue: 8
  year: 2020
  ident: 2024111706383848100_bib38
  article-title: A hybrid pipeline for reconstruction and analysis of viral genomes at multi-organ level
  publication-title: Gigascience
  doi: 10.1093/gigascience/giaa086
– volume: 9
  start-page: giaa119
  issue: 11
  year: 2020
  ident: 2024111706383848100_bib76
  article-title: Efficient DNA sequence compression with neural networks
  publication-title: Gigascience
  doi: 10.1093/gigascience/giaa119
– start-page: 121
  volume-title: Multiple Sequence Alignment
  year: 2021
  ident: 2024111706383848100_bib110
  article-title: Sequence comparison without alignment: the SpaM approaches
  doi: 10.1007/978-1-0716-1036-7_8
– volume: 32
  start-page: W628
  issue: suppl 2
  year: 2004
  ident: 2024111706383848100_bib16
  article-title: Complexity: an internet resource for analysis of DNA sequence complexity
  publication-title: Nucleic Acids Res
  doi: 10.1093/nar/gkh466
– volume: 22
  start-page: 134
  issue: 2
  year: 2006
  ident: 2024111706383848100_bib90
  article-title: WindowMasker: window-based masker for sequenced genomes
  publication-title: Bioinformatics
  doi: 10.1093/bioinformatics/bti774
– volume: 20
  start-page: 1704
  issue: 7
  year: 2019
  ident: 2024111706383848100_bib29
  article-title: Improved prediction of regulatory element using hybrid abelian complexity features with DNA sequences
  publication-title: Int J Mol Sci
  doi: 10.3390/ijms20071704
– start-page: 43
  volume-title: 2007 Data Compression Conference (DCC’07)
  year: 2007
  ident: 2024111706383848100_bib65
  article-title: A simple statistical algorithm for biological sequence compression
– volume: 118
  start-page: 3030
  issue: 12
  year: 2006
  ident: 2024111706383848100_bib9
  article-title: The global health burden of infection-associated cancers in the year 2002
  publication-title: Int J Cancer
  doi: 10.1002/ijc.21731
– volume: 27
  start-page: 379
  issue: 3
  year: 1948
  ident: 2024111706383848100_bib92
  article-title: A mathematical theory of communication
  publication-title: Bell Syst Tech J
  doi: 10.1002/j.1538-7305.1948.tb01338.x
– start-page: 214
  volume-title: Pac Symp Biocomput
  year: 1999
  ident: 2024111706383848100_bib85
  article-title: Effective query filtering for fast homology searching
– volume: 215
  start-page: 403
  issue: 3
  year: 1990
  ident: 2024111706383848100_bib96
  article-title: Basic local alignment search tool
  publication-title: J Mol Biol
  doi: 10.1016/S0022-2836(05)80360-2
– volume: 19
  start-page: 513
  issue: 4
  year: 2003
  ident: 2024111706383848100_bib105
  article-title: Alignment-free sequence comparison—a review
  publication-title: Bioinformatics
  doi: 10.1093/bioinformatics/btg005
– start-page: 97
  volume-title: Probabilistic Modeling and Machine Learning in Structural and Systems Biology, Workshop Proceedings
  year: 2006
  ident: 2024111706383848100_bib28
  article-title: Exploring long DNA sequences by information content
– volume: 33
  start-page: 245
  issue: 3
  year: 2011
  ident: 2024111706383848100_bib68
  article-title: A novel approach for compressing DNA sequences using semi-statistical compressor
  publication-title: Int J Comput Appl
– volume: 11
  year: 2022
  ident: 2024111706383848100_bib74
  article-title: MBGC: Multiple Bacteria Genome Compressor
  publication-title: Gigascience
  doi: 10.1093/gigascience/giab099
– volume: 17
  start-page: 149
  issue: 2
  year: 1993
  ident: 2024111706383848100_bib87
  article-title: Statistics of local complexity in amino acid sequences and sequence databases
  publication-title: Comput Chem
  doi: 10.1016/0097-8485(93)85006-X
– volume: 22
  start-page: 1
  issue: 1
  year: 2021
  ident: 2024111706383848100_bib118
  article-title: ViR: a tool to solve intrasample variability in the prediction of viral integration sites using whole genome sequencing data
  publication-title: BMC Bioinformatics
  doi: 10.1186/s12859-021-03980-5
– start-page: 265
  volume-title: 11th International Conference on Practical Applications of Computational Biology & Bioinformatics. Advances in Intelligent Systems and Computing
  year: 2017
  ident: 2024111706383848100_bib82
  article-title: Substitutional tolerant Markov models for relative compression of DNA sequences
– volume: 21
  start-page: 160
  issue: 2
  year: 2005
  ident: 2024111706383848100_bib88
  article-title: A new algorithm for detecting low-complexity regions in protein sequences
  publication-title: Bioinformatics
  doi: 10.1093/bioinformatics/bth497
– volume: 51
  start-page: 3223
  issue: 7
  year: 2023
  ident: 2024111706383848100_bib121
  article-title: Unmasking the tissue-resident eukaryotic DNA virome in humans
  publication-title: Nucleic Acids Res
  doi: 10.1093/nar/gkad199
– volume: 585
  start-page: 79
  issue: 7823
  year: 2020
  ident: 2024111706383848100_bib1
  article-title: Telomere-to-telomere assembly of a complete human X chromosome
  publication-title: Nature
  doi: 10.1038/s41586-020-2547-7
– volume: 139
  start-page: 2001
  issue: 9
  year: 2016
  ident: 2024111706383848100_bib116
  article-title: Genomic characterization of viral integration sites in HPV-related cancers
  publication-title: Int J Cancer
  doi: 10.1002/ijc.30243
– year: 2023
  ident: 2024111706383848100_bib84
  article-title: AlcoR website
– volume: 177
  start-page: 270
  year: 2015
  ident: 2024111706383848100_bib23
  article-title: Fast and accurate measurement of entropy profiles of commercial lithium-ion cells
  publication-title: Electrochim Acta
  doi: 10.1016/j.electacta.2015.01.191
– volume: 40
  start-page: msad084
  issue: 4
  year: 2023
  ident: 2024111706383848100_bib41
  article-title: Low complexity regions in proteins and DNA are poorly correlated
  publication-title: Mol Biol Evol
  doi: 10.1093/molbev/msad084
– volume: 8
  start-page: 1
  issue: 1
  year: 2007
  ident: 2024111706383848100_bib50
  article-title: rMotifGen: random motif generator for DNA and protein sequences
  publication-title: BMC Bioinformatics
  doi: 10.1186/1471-2105-8-292
– volume: 9
  start-page: 1115
  issue: 4
  year: 2019
  ident: 2024111706383848100_bib10
  article-title: Genome-wide profiling of Epstein-Barr virus integration by targeted sequencing in Epstein-Barr virus associated malignancies
  publication-title: Theranostics
  doi: 10.7150/thno.29622
– volume: 6
  start-page: 913
  year: 2015
  ident: 2024111706383848100_bib104
  article-title: Atypical centromeres in plants—what they can tell us
  publication-title: Front Plant Sci
  doi: 10.3389/fpls.2015.00913
– volume: 13
  start-page: 36
  issue: 1
  year: 2012
  ident: 2024111706383848100_bib115
  article-title: Repetitive DNA and next-generation sequencing: computational challenges and solutions
  publication-title: Nat Rev Genet
  doi: 10.1038/nrg3117
– volume: 28
  start-page: 593
  issue: 4
  year: 2012
  ident: 2024111706383848100_bib56
  article-title: ART: a next-generation sequencing read simulator
  publication-title: Bioinformatics
  doi: 10.1093/bioinformatics/btr708
– volume: 21
  start-page: 1196
  issue: 12
  year: 2019
  ident: 2024111706383848100_bib19
  article-title: A survey on using Kolmogorov complexity in cybersecurity
  publication-title: Entropy
  doi: 10.3390/e21121196
– volume: 30
  start-page: 38
  issue: 1
  year: 2002
  ident: 2024111706383848100_bib4
  article-title: The Ensembl genome database project
  publication-title: Nucleic Acids Res
  doi: 10.1093/nar/30.1.38
– volume: 21
  start-page: 2004
  year: 2004
  ident: 2024111706383848100_bib63
  article-title: Grammar-based compression of DNA sequences
  publication-title: DIMACS Working Group Burrows-Wheeler Transform
– volume: 17
  start-page: 1
  issue: 1
  year: 2022
  ident: 2024111706383848100_bib93
  article-title: Fast characterization of segmental duplication structure in multiple genome assemblies
  publication-title: Algorithm Mol Biol
  doi: 10.1186/s13015-022-00210-2
– volume: 370
  start-page: 890
  issue: 9590
  year: 2007
  ident: 2024111706383848100_bib8
  article-title: Human papillomavirus and cervical cancer
  publication-title: Lancet
  doi: 10.1016/S0140-6736(07)61416-0
– volume: 9
  start-page: 1
  issue: 1
  year: 2018
  ident: 2024111706383848100_bib34
  article-title: mRNAs and lncRNAs intrinsically form secondary structures with short end-to-end distances
  publication-title: Nat Commun
  doi: 10.1038/s41467-018-06792-z
– volume: 8
  start-page: e79922
  issue: 11
  year: 2013
  ident: 2024111706383848100_bib14
  article-title: DNA sequences at a glance
  publication-title: PLoS One
  doi: 10.1371/journal.pone.0079922
– volume: 537
  start-page: 320
  issue: 7620
  year: 2016
  ident: 2024111706383848100_bib45
  article-title: The coming of age of de novo protein design
  publication-title: Nature
  doi: 10.1038/nature19946
– start-page: 228
  volume-title: 11th International Conference on Practical Applications of Computational Biology and Bioinformatics
  year: 2017
  ident: 2024111706383848100_bib31
  article-title: On the role of inverted repeats in DNA sequence similarity
– start-page: 231
  volume-title: 2016 Data Compression Conference (DCC)
  year: 2016
  ident: 2024111706383848100_bib71
  article-title: Efficient compression of genomic sequences
  doi: 10.1109/DCC.2016.60
– ident: 2024111706383848100_bib122
  doi: 10.1101/2023.04.17.537157
– start-page: 2024
  volume-title: 2011 19th European Signal Processing Conference
  year: 2011
  ident: 2024111706383848100_bib30
  article-title: Symbolic to numerical conversion of DNA sequences using finite-context models
– volume: 6
  start-page: e21588
  issue: 6
  year: 2011
  ident: 2024111706383848100_bib70
  article-title: On the representability of complete genomes by multiple competing finite-context (Markov) models
  publication-title: PLoS One
  doi: 10.1371/journal.pone.0021588
– volume: 9
  start-page: giaa048
  issue: 5
  year: 2020
  ident: 2024111706383848100_bib36
  article-title: Smash++: an alignment-free and memory-efficient tool to find genomic rearrangements
  publication-title: Gigascience
  doi: 10.1093/gigascience/giaa048
– volume: 30
  start-page: 875
  issue: 6
  year: 1994
  ident: 2024111706383848100_bib61
  article-title: A new challenge for compression algorithms: genetic sequences
  publication-title: Inf Proc Manage
  doi: 10.1016/0306-4573(94)90014-0
– volume: 376
  start-page: eabj6965
  issue: 6588
  year: 2022
  ident: 2024111706383848100_bib99
  article-title: Segmental duplications and their variation in a complete human genome
  publication-title: Science
  doi: 10.1126/science.abj6965
– volume: 1
  start-page: 1
  issue: 1
  year: 1965
  ident: 2024111706383848100_bib13
  article-title: Three approaches to the quantitative definition of information
  publication-title: Prob Inf Trans
– volume: 394
  start-page: 112127
  issue: 2
  year: 2020
  ident: 2024111706383848100_bib100
  article-title: Centromere studies in the era of ‘telomere-to-telomere’ genomics
  publication-title: Exp Cell Res
  doi: 10.1016/j.yexcr.2020.112127
– volume: 20
  start-page: 2088
  issue: 6
  year: 2019
  ident: 2024111706383848100_bib120
  article-title: Comprehensive comparative analysis of methods and software for identifying viral integrations
  publication-title: Brief Bioinform
  doi: 10.1093/bib/bby070
– start-page: 827
  volume-title: Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing (STOC 2018)
  year: 2018
  ident: 2024111706383848100_bib18
  article-title: At the roots of dictionary compression: string attractors
  doi: 10.1145/3188745.3188814
– volume: 6
  start-page: eabd9230
  issue: 50
  year: 2020
  ident: 2024111706383848100_bib102
  article-title: Rapid and ongoing evolution of repetitive sequence structures in human centromeres
  publication-title: Sci Adv
  doi: 10.1126/sciadv.abd9230
– ident: 2024111706383848100_bib59
  article-title: RepeatMasker Open-4.0
– volume: 376
  start-page: eabl4178
  issue: 6588
  year: 2022
  ident: 2024111706383848100_bib101
  article-title: Complete genomic and epigenetic maps of human centromeres
  publication-title: Science
  doi: 10.1126/science.abl4178
– volume: 3
  start-page: 39
  issue: 1
  year: 2010
  ident: 2024111706383848100_bib66
  article-title: An efficient horizontal and vertical method for online DNA sequence compression
  publication-title: Int J Comput Appl
– volume: 16
  start-page: 1615
  issue: 12
  year: 2009
  ident: 2024111706383848100_bib106
  article-title: Alignment-free sequence comparison (I): statistics and power
  publication-title: J Comput Biol
  doi: 10.1089/cmb.2009.0198
– start-page: 23
  volume-title: International Workshop on DNA-Based Computers
  year: 2001
  ident: 2024111706383848100_bib48
  article-title: DNASequenceGenerator: a program for the construction of DNA sequences
– volume: 1
  start-page: 100069
  issue: 6
  year: 2021
  ident: 2024111706383848100_bib42
  article-title: Probe design for simultaneous, targeted capture of diverse metagenomic targets
  publication-title: Cell Rep Methods
  doi: 10.1016/j.crmeth.2021.100069
– volume: 8
  start-page: 1
  issue: 1
  year: 2007
  ident: 2024111706383848100_bib15
  article-title: Compression-based classification of biological sequences and structures via the universal similarity metric: experimental assessment
  publication-title: BMC Bioinformatics
  doi: 10.1186/1471-2105-8-252
– volume: 8
  start-page: 1
  issue: 1
  year: 2007
  ident: 2024111706383848100_bib17
  article-title: Local Renyi entropic profiles of DNA sequences
  publication-title: BMC Bioinformatics
  doi: 10.1186/1471-2105-8-393
– volume: 49
  start-page: e33
  issue: 6
  year: 2021
  ident: 2024111706383848100_bib6
  article-title: SurVirus: a repeat-aware virus integration caller
  publication-title: Nucleic Acids Res
  doi: 10.1093/nar/gkaa1237
– volume: 13
  start-page: 131
  issue: 2
  year: 1997
  ident: 2024111706383848100_bib25
  article-title: Detection of significant patterns by compression algorithms: the case of approximate tandem repeats in DNA sequences
  publication-title: Bioinformatics
  doi: 10.1093/bioinformatics/13.2.131
– start-page: 340
  volume-title: [Proceedings] DCC93: Data Compression Conference
  year: 1993
  ident: 2024111706383848100_bib60
  article-title: Compression of DNA sequences
  doi: 10.1109/DCC.1993.253115
– start-page: 117
  year: 2010
  ident: 2024111706383848100_bib80
  article-title: Finite-context models for DNA coding
  publication-title: Signal Process
– volume: 34
  start-page: 1397
  issue: 14
  year: 2004
  ident: 2024111706383848100_bib62
  article-title: A simple and fast DNA compressor
  publication-title: Software Pract Experience
  doi: 10.1002/spe.619
– volume: 35
  start-page: 3826
  issue: 19
  year: 2019
  ident: 2024111706383848100_bib72
  article-title: Nucleotide Archival Format (NAF) enables efficient lossless reference-free compression of DNA sequences
  publication-title: Bioinformatics
  doi: 10.1093/bioinformatics/btz144
– volume: 6
  start-page: 873
  issue: 8
  year: 2015
  ident: 2024111706383848100_bib33
  article-title: Estimating diversity and entropy profiles via discovery rates of new species
  publication-title: Methods Ecol Evol
  doi: 10.1111/2041-210X.12349
– volume: 20
  start-page: 1
  issue: 1
  year: 2019
  ident: 2024111706383848100_bib112
  article-title: Improved metagenomic analysis with Kraken 2
  publication-title: Genome Biol
  doi: 10.1186/s13059-019-1891-0
– volume: 513
  start-page: 208
  year: 2018
  ident: 2024111706383848100_bib117
  article-title: Viral sequences in human cancer
  publication-title: Virology
  doi: 10.1016/j.virol.2017.10.017
– volume: 21
  start-page: 513
  issue: 5
  year: 2019
  ident: 2024111706383848100_bib20
  article-title: Mimicking anti-viruses with machine learning and entropy profiles
  publication-title: Entropy
  doi: 10.3390/e21050513
– start-page: 2209
  volume-title: 2020 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)
  year: 2020
  ident: 2024111706383848100_bib37
  article-title: J2*: A new method for alignment-free sequence similarity measurement
  doi: 10.1109/BIBM49941.2020.9313338
– start-page: 137
  volume-title: Practical Applications of Computational Biology and Bioinformatics, 13th International Conference. PACBB 2019
  year: 2020
  ident: 2024111706383848100_bib79
  article-title: GeCo2: an optimized tool for lossless compression and analysis of DNA sequences
– volume: 23
  start-page: 1
  issue: 1
  year: 2022
  ident: 2024111706383848100_bib5
  article-title: Using synthetic chromosome controls to evaluate the sequencing of difficult regions within the human genome
  publication-title: Genome Biol
  doi: 10.1186/s13059-021-02579-6
– year: 2011
  ident: 2024111706383848100_bib91
  article-title: Handbook of Monte Carlo Methods, Wiley Series in Probability and Statistics
– volume: 39
  start-page: btad097
  issue: 3
  year: 2023
  ident: 2024111706383848100_bib75
  article-title: AGC: compact representation of assembled genomes with fast queries and updates
  publication-title: Bioinformatics
  doi: 10.1093/bioinformatics/btad097
– volume: 9
  start-page: 445
  issue: 9
  year: 2018
  ident: 2024111706383848100_bib113
  article-title: Metagenomic composition analysis of an ancient sequenced polar bear jawbone from Svalbard
  publication-title: Genes
  doi: 10.3390/genes9090445
– volume-title: International Conference on Learning Representations
  year: 2020
  ident: 2024111706383848100_bib53
  article-title: Model-based reinforcement learning for biological sequence design
– volume: 643
  start-page: 730
  issue: 2
  year: 2006
  ident: 2024111706383848100_bib22
  article-title: Entropy profiles in the cores of cooling flow clusters of galaxies
  publication-title: Astrophys J
  doi: 10.1086/503270
– volume: 376
  start-page: 44
  issue: 6588
  year: 2022
  ident: 2024111706383848100_bib98
  article-title: The complete sequence of a human genome
  publication-title: Science
  doi: 10.1126/science.abj6987
– volume: 18
  start-page: 679
  issue: 5
  year: 2002
  ident: 2024111706383848100_bib39
  article-title: Sequence complexity profiles of prokaryotic genomic sequences: a fast algorithm for calculating linguistic complexity
  publication-title: Bioinformatics
  doi: 10.1093/bioinformatics/18.5.679
– volume: 23
  start-page: 275
  issue: 3–4
  year: 1999
  ident: 2024111706383848100_bib26
  article-title: Zones of low entropy in genomic sequences
  publication-title: Comput Chem
  doi: 10.1016/S0097-8485(99)00009-1
– volume: 376
  start-page: eabl3533
  issue: 6588
  year: 2022
  ident: 2024111706383848100_bib103
  article-title: A complete reference genome improves analysis of human genetic variation
  publication-title: Science
  doi: 10.1126/science.abl3533
– volume: 19
  start-page: 136
  issue: 1
  year: 2009
  ident: 2024111706383848100_bib51
  article-title: Fast and flexible simulation of DNA sequence data
  publication-title: Genome Res
  doi: 10.1101/gr.083634.108
– volume: 10
  start-page: giab005
  issue: 2
  year: 2021
  ident: 2024111706383848100_bib55
  article-title: MB-GAN: Microbiome Simulation via Generative Adversarial Network
  publication-title: Gigascience
  doi: 10.1093/gigascience/giab005
– year: 2008
  ident: 2024111706383848100_bib12
  article-title: An Introduction to Kolmogorov Complexity and Its Applications. Vol. 3
  doi: 10.1007/978-0-387-49820-1
– volume: 11
  start-page: giac028
  year: 2022
  ident: 2024111706383848100_bib3
  article-title: The haplotype-resolved chromosome pairs of a heterozygous diploid African cassava cultivar reveal novel pan-genome and allele-specific transcriptome features
  publication-title: GigaScience
  doi: 10.1093/gigascience/giac028
– volume: 19
  start-page: 687
  year: 2022
  ident: 2024111706383848100_bib114
  article-title: Chasing perfection: validation and polishing strategies for telomere-to-telomere genome assemblies
  publication-title: Nat Methods
  doi: 10.1038/s41592-022-01440-3
– volume: 12
  start-page: e1611
  issue: 2
  year: 2021
  ident: 2024111706383848100_bib35
  article-title: Making ends meet: new functions of mRNA secondary structure
  publication-title: Wiley Interdiscip Rev RNA
  doi: 10.1002/wrna.1611
– volume: 22
  start-page: 575
  issue: 5
  year: 2020
  ident: 2024111706383848100_bib21
  article-title: Detecting malware with information complexity
  publication-title: Entropy
  doi: 10.3390/e22050575
– year: 2023
  ident: 2024111706383848100_bib83
  article-title: AlcoR Code Repository
– start-page: 125
  volume-title: 2011 IEEE Statistical Signal Processing Workshop (SSP)
  year: 2011
  ident: 2024111706383848100_bib69
  article-title: Bacteria DNA sequence compression using a mixture of finite-context models
  doi: 10.1109/SSP.2011.5967637
– volume: 36
  start-page: 4675
  issue: 18
  year: 2020
  ident: 2024111706383848100_bib73
  article-title: Allowing mutations in maximal matches boosts genome compression performance
  publication-title: Bioinformatics
  doi: 10.1093/bioinformatics/btaa572
– volume: 16
  start-page: 687
  issue: 8
  year: 2019
  ident: 2024111706383848100_bib46
  article-title: Machine-learning-guided directed evolution for protein engineering
  publication-title: Nat Methods
  doi: 10.1038/s41592-019-0496-6
– volume: 53
  start-page: 2148
  issue: 11
  year: 2006
  ident: 2024111706383848100_bib32
  article-title: A three-state model for DNA protein-coding regions
  publication-title: IEEE Trans Biomed Eng
  doi: 10.1109/TBME.2006.879477
– volume: 23
  start-page: 3
  issue: 1
  year: 2005
  ident: 2024111706383848100_bib64
  article-title: An efficient normalized maximum likelihood algorithm for DNA sequence compression
  publication-title: ACM Trans Inform Syst
  doi: 10.1145/1055709.1055711
– volume: 10
  start-page: 1
  issue: 1
  year: 2009
  ident: 2024111706383848100_bib95
  article-title: BLAST+: architecture and applications
  publication-title: BMC Bioinformatics
  doi: 10.1186/1471-2105-10-421
– volume: 24
  start-page: 43
  issue: 1
  year: 2000
  ident: 2024111706383848100_bib27
  article-title: Sequence complexity for biological sequence analysis
  publication-title: Comput Chem
  doi: 10.1016/S0097-8485(00)80006-6
– volume: 37
  start-page: 1535
  issue: 5
  year: 2020
  ident: 2024111706383848100_bib44
  article-title: CellCoal: coalescent simulation of single-cell sequencing samples
  publication-title: Mol Biol Evol
  doi: 10.1093/molbev/msaa025
– volume: 118
  start-page: e2102842118
  issue: 47
  year: 2021
  ident: 2024111706383848100_bib94
  article-title: Quantitative assessment reveals the dominance of duplicated sequences in germline-derived extrachromosomal circular DNA
  publication-title: Proc Natl Acad Sci USA
  doi: 10.1073/pnas.2102842118
– volume: 8
  start-page: giy148
  issue: 3
  year: 2019
  ident: 2024111706383848100_bib111
  article-title: Prot-SpaM: Fast alignment-free phylogeny reconstruction based on whole-proteome sequences
  publication-title: Gigascience
  doi: 10.1093/gigascience/giy148
– volume: 2
  start-page: 395
  issue: 7
  year: 2002
  ident: 2024111706383848100_bib7
  article-title: Global control of hepatitis B virus infection
  publication-title: Lancet Infect Dis
  doi: 10.1016/S1473-3099(02)00315-8
– volume: 31
  start-page: 159
  issue: 1
  year: 2021
  ident: 2024111706383848100_bib89
  article-title: Accessing NCBI data using the NCBI sequence viewer and genome data viewer (GDV)
  publication-title: Genome Res
  doi: 10.1101/gr.266932.120
– volume: 604
  start-page: 437
  issue: 7906
  year: 2022
  ident: 2024111706383848100_bib2
  article-title: The Human Pangenome Project: a global resource to map genomic diversity
  publication-title: Nature
  doi: 10.1038/s41586-022-04601-8
– volume: 39
  start-page: msac087
  issue: 5
  year: 2022
  ident: 2024111706383848100_bib40
  article-title: Low complexity regions in mammalian proteins are associated with low protein abundance and high transcript abundance
  publication-title: Mol Biol Evol
  doi: 10.1093/molbev/msac087
– volume: 117
  start-page: 9451
  issue: 17
  year: 2020
  ident: 2024111706383848100_bib58
  article-title: RepeatModeler2 for automated genomic discovery of transposable element families
  publication-title: Proc Natl Acad Sci
  doi: 10.1073/pnas.1921046117
– year: 2018
  ident: 2024111706383848100_bib11
  article-title: Foundations of Info-metrics: Modeling, Inference, and Imperfect Information
– start-page: 133
  volume-title: 2011 IEEE Statistical Signal Processing Workshop (SSP)
  year: 2011
  ident: 2024111706383848100_bib52
  article-title: DNA synthetic sequences generation using multiple competing Markov models
  doi: 10.1109/SSP.2011.5967639
– volume: 23
  start-page: 530
  issue: 5
  year: 2021
  ident: 2024111706383848100_bib77
  article-title: AC2: an efficient protein sequence compression tool using artificial neural networks and cache-hash models
  publication-title: Entropy
  doi: 10.3390/e23050530
– volume: 17
  start-page: 1467
  issue: 11
  year: 2010
  ident: 2024111706383848100_bib107
  article-title: Alignment-free sequence comparison (II): theoretical power of comparison statistics
  publication-title: J Comput Biol
  doi: 10.1089/cmb.2010.0056
– volume: 12
  start-page: 100535
  year: 2020
  ident: 2024111706383848100_bib54
  article-title: GTO: a toolkit to unify pipelines in genomic and proteomic research
  publication-title: SoftwareX
  doi: 10.1016/j.softx.2020.100535
– volume: 34
  start-page: 2506
  issue: 14
  year: 2018
  ident: 2024111706383848100_bib57
  article-title: NGSphy: phylogenomic simulation of next-generation sequencing data
  publication-title: Bioinformatics
  doi: 10.1093/bioinformatics/bty146
– volume: 22
  start-page: 1534
  issue: 12
  year: 2006
  ident: 2024111706383848100_bib49
  article-title: GenRGenS: software for generating random genomic sequences and structures
  publication-title: Bioinformatics
  doi: 10.1093/bioinformatics/btl113
– volume: 37
  start-page: 3115
  issue: 19
  year: 2021
  ident: 2024111706383848100_bib119
  article-title: VIRUSBreakend: viral integration recognition using single breakends
  publication-title: Bioinformatics
  doi: 10.1093/bioinformatics/btab343
– volume: 9
  start-page: giaa072
  issue: 7
  year: 2020
  ident: 2024111706383848100_bib78
  article-title: Sequence Compression Benchmark (SCB) database—a comprehensive evaluation of reference-free compressors for FASTA-formatted sequences
  publication-title: Gigascience
  doi: 10.1093/gigascience/giaa072
– volume: 2
  start-page: 25
  issue: 3
  year: 2010
  ident: 2024111706383848100_bib67
  article-title: GENBIT Compress-Algorithm for repetitive and non repetitive DNA sequences
  publication-title: Int J Comput Sci Inf Technol
– start-page: 259
  volume-title: Iberian Conference on Pattern Recognition and Image Analysis
  year: 2017
  ident: 2024111706383848100_bib97
  article-title: On the approximation of the Kolmogorov complexity for DNA sequences
– start-page: 8
  volume-title: ISMB
  year: 1998
  ident: 2024111706383848100_bib24
  article-title: Compression of strings with approximate repeats
– volume: 13
  start-page: 1028
  issue: 5
  year: 2006
  ident: 2024111706383848100_bib86
  article-title: A fast and symmetric DUST implementation to mask low-complexity DNA sequences
  publication-title: J Comput Biol
  doi: 10.1089/cmb.2006.13.1028
– volume: 17
  start-page: 459
  issue: 8
  year: 2016
  ident: 2024111706383848100_bib43
  article-title: A comparison of tools for the simulation of genomic next-generation sequencing data
  publication-title: Nat Rev Genet
  doi: 10.1038/nrg.2016.57
– volume: 18
  start-page: 1
  issue: 1
  year: 2017
  ident: 2024111706383848100_bib108
  article-title: Alignment-free sequence comparison: benefits, applications, and tools
  publication-title: Genome Biol
  doi: 10.1186/s13059-017-1319-7
– volume: 65
  start-page: 18
  year: 2021
  ident: 2024111706383848100_bib47
  article-title: Protein sequence design with deep generative models
  publication-title: Curr Opi Chem Biol
  doi: 10.1016/j.cbpa.2021.04.004
– volume: 20
  start-page: 1
  issue: 1
  year: 2019
  ident: 2024111706383848100_bib109
  article-title: Benchmarking of alignment-free sequence comparison methods
  publication-title: Genome Biol
  doi: 10.1186/s13059-019-1755-7
SSID ssj0000778873
Score 2.3082159
Snippet Abstract Background Low-complexity data analysis is the area that addresses the search and quantification of regions in sequences of elements that contain...
Low-complexity data analysis is the area that addresses the search and quantification of regions in sequences of elements that contain low-complexity or...
Background Low-complexity data analysis is the area that addresses the search and quantification of regions in sequences of elements that contain...
SourceID pubmedcentral
proquest
pubmed
crossref
oup
SourceType Open Access Repository
Aggregation Database
Index Database
Enrichment Source
Publisher
SubjectTerms Alignment
Amino acid sequence
Cassava
Chromosomes
Complexity
Computer Simulation
Cultivars
Data analysis
Diploids
Gene sequencing
Genome, Human
Genomic analysis
Haplotypes
Humans
Mapping
Nucleotide sequence
Proteomics
Repetitive Sequences, Nucleic Acid
Sequence Analysis, DNA - methods
Software
Source code
Technical Note
Visualization
Title AlcoR: alignment-free simulation, mapping, and visualization of low-complexity regions in biological data
URI https://www.ncbi.nlm.nih.gov/pubmed/38091509
https://www.proquest.com/docview/3130974246
https://www.proquest.com/docview/3168765477
https://www.proquest.com/docview/2902949776
https://pubmed.ncbi.nlm.nih.gov/PMC10716826
Volume 12
WOSCitedRecordID wos001187364200001&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
journalDatabaseRights – providerCode: PRVADU
  databaseName: BioMed Central
  customDbUrl:
  eissn: 2047-217X
  dateEnd: 99991231
  omitProxy: false
  ssIdentifier: ssj0000778873
  issn: 2047-217X
  databaseCode: RBZ
  dateStart: 20120101
  isFulltext: true
  titleUrlDefault: https://www.biomedcentral.com/search/
  providerName: BioMedCentral
– providerCode: PRVHPJ
  databaseName: ROAD: Directory of Open Access Scholarly Resources
  customDbUrl:
  eissn: 2047-217X
  dateEnd: 99991231
  omitProxy: false
  ssIdentifier: ssj0000778873
  issn: 2047-217X
  databaseCode: M~E
  dateStart: 20120101
  isFulltext: true
  titleUrlDefault: https://road.issn.org
  providerName: ISSN International Centre
– providerCode: PRVASL
  databaseName: Oxford Journals Open Access Collection
  customDbUrl:
  eissn: 2047-217X
  dateEnd: 99991231
  omitProxy: false
  ssIdentifier: ssj0000778873
  issn: 2047-217X
  databaseCode: TOX
  dateStart: 20110101
  isFulltext: true
  titleUrlDefault: https://academic.oup.com/journals/
  providerName: Oxford University Press
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LT9wwEB4B4tAL5dHC8pIrVeWyUTe24yTcEAJxQFBVgPYW2Y69RFqSarNAf37HiTfarQCBcolkJ3HsGc039sw3AN8R43NtDQ-olSLgLolLqVwENuKSMs1U3NQiuLuMr66S4TD95R3F-oUj_JT9HBUj6c0B3ss8bNK1wihxBQturofdlsogdqFxbMYt9PKjC_ZnIadtDlr-HyE5Z3LOP390sOuw5sElOWmlYQOWTLkJBz41gfwgPvfIrQXxSr0FxclYV7-PCQLyURMaENiJMaQuHnxlrz55kI7FYdQnsszJU1G7RMw2fZNUloyr56CJTDd_EdITV-oBRZkUJWkZnpwYEBeJ-gVuz89uTi8CX4Ah0NEgnAYmoakySggVSY1XLpimVqlQWWMc1JKMxzZU3Cr0snNjQzpwBIFa2zRFnMW-wkpZlWYHSMSo0YpJqq0jVRNKOzK6PGQG_RnJWQ_obFky7dnJXZGMcdaekrNsbmYzP7M96HcP_WnJOd7ufoTr_b6e-zOZyLxO1xlDc4_eF-XilWaBliXicdyDb10zKqs7gZGlqR7rjKYDmnKE3PiK7VbCuuGwBKEbwrceJAuy13VwROCLLWVx3xCCowuPH6di990_uAefKKK1di9pH1amk0dzAKv6aVrUk0NYjofJYaNc_wCp7SyZ
linkProvider Oxford University Press
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=AlcoR%3A+alignment-free+simulation%2C+mapping%2C+and+visualization+of+low-complexity+regions+in+biological+data&rft.jtitle=Gigascience&rft.au=Silva%2C+Jorge+M&rft.au=Qi%2C+Weihong&rft.au=Pinho%2C+Armando+J&rft.au=Pratas%2C+Diogo&rft.date=2023&rft.pub=Oxford+University+Press&rft.eissn=2047-217X&rft.volume=12&rft_id=info:doi/10.1093%2Fgigascience%2Fgiad101&rft.externalDocID=10.1093%2Fgigascience%2Fgiad101
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=2047-217X&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=2047-217X&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=2047-217X&client=summon