Character sets of strings

Given a string S over a finite alphabet Σ, the character set (also called the fingerprint) of a substring S ′ of S is the subset C ⊆ Σ of the symbols occurring in S ′ . The study of the character sets of all the substrings of a given string (or a given collection of strings) appears in several domai...

Celý popis

Uloženo v:
Podrobná bibliografie
Vydáno v:Journal of discrete algorithms (Amsterdam, Netherlands) Ročník 5; číslo 2; s. 330 - 340
Hlavní autoři: Didier, Gilles, Schmidt, Thomas, Stoye, Jens, Tsur, Dekel
Médium: Journal Article
Jazyk:angličtina
Vydáno: Elsevier B.V 01.06.2007
Elsevier
Témata:
ISSN:1570-8667, 1570-8675
On-line přístup:Získat plný text
Tagy: Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
Abstract Given a string S over a finite alphabet Σ, the character set (also called the fingerprint) of a substring S ′ of S is the subset C ⊆ Σ of the symbols occurring in S ′ . The study of the character sets of all the substrings of a given string (or a given collection of strings) appears in several domains such as rule induction for natural language processing or comparative genomics. Several computational problems concerning the character sets of a string arise from these applications, especially: (1) Output all the maximal locations of substrings having a given character set. (2) Output for each character set C occurring in a given string (or a given collection of strings) all the maximal locations of C. Denoting by n the total length of the considered string or collection of strings, we solve the first problem in Θ ( n ) time using Θ ( n ) space. We present two algorithms solving the second problem. The first one runs in Θ ( n 2 ) time using Θ ( n ) space. The second algorithm has Θ ( n | Σ | log | Σ | ) time and Θ ( n ) space complexity and is an adaptation of an algorithm by Amir et al. [A. Amir, A. Apostolico, G.M. Landau, G. Satta, Efficient text fingerprinting via Parikh mapping, J. Discrete Algorithms 26 (2003) 1–13].
AbstractList Given a string S over a finite alphabet Σ, the character set (also called the fingerprint) of a substring S ′ of S is the subset C ⊆ Σ of the symbols occurring in S ′ . The study of the character sets of all the substrings of a given string (or a given collection of strings) appears in several domains such as rule induction for natural language processing or comparative genomics. Several computational problems concerning the character sets of a string arise from these applications, especially: (1) Output all the maximal locations of substrings having a given character set. (2) Output for each character set C occurring in a given string (or a given collection of strings) all the maximal locations of C. Denoting by n the total length of the considered string or collection of strings, we solve the first problem in Θ ( n ) time using Θ ( n ) space. We present two algorithms solving the second problem. The first one runs in Θ ( n 2 ) time using Θ ( n ) space. The second algorithm has Θ ( n | Σ | log | Σ | ) time and Θ ( n ) space complexity and is an adaptation of an algorithm by Amir et al. [A. Amir, A. Apostolico, G.M. Landau, G. Satta, Efficient text fingerprinting via Parikh mapping, J. Discrete Algorithms 26 (2003) 1–13].
Author Tsur, Dekel
Didier, Gilles
Schmidt, Thomas
Stoye, Jens
Author_xml – sequence: 1
  givenname: Gilles
  surname: Didier
  fullname: Didier, Gilles
  organization: Centro de Modelamiento Matematico CNRS UMR 2071 Santiago de Chile, Chile
– sequence: 2
  givenname: Thomas
  surname: Schmidt
  fullname: Schmidt, Thomas
  organization: International NRW Graduate School in Bioinformatics and Genome Research, Center of Biotechnology, Universität Bielefeld, 33594 Bielefeld, Germany
– sequence: 3
  givenname: Jens
  surname: Stoye
  fullname: Stoye, Jens
  email: stoye@techfak.uni-bielefeld.de
  organization: Technische Fakultät, Universität Bielefeld, 33594 Bielefeld, Germany
– sequence: 4
  givenname: Dekel
  surname: Tsur
  fullname: Tsur, Dekel
  organization: Computer Science Department, Ben-Gurion University, Israel
BackLink https://hal.science/hal-01614122$$DView record in HAL
BookMark eNp9kE1LAzEQhoNUsK3-gN569bDrzGaTdPFUilqh4EXPYZoPm6XuShIK_nu3VD146GmG4X1mmGfCRl3fOcZmCCUCyru2bC2VFYAsgZdQ4QUbo1BQLKQSo79eqis2SakFqATW1ZjNVjuKZLKL8-Rymvd-nnIM3Xu6Zpee9snd_NQpe3t8eF2ti83L0_NquSkM55iL7ZYLj7AFQR7MQnlURjZOuoZ7so5LW9eyEUI4EhK9EYKUsraSDZGtkfMpuz3t3dFef8bwQfFL9xT0ernRx9nwHtZYVQccsuqUNbFPKTqvTciUQ9_lSGGvEfTRhm71YEMfbWjgerAxkPiP_D11jrk_MW54_xBc1MkE1xlnQ3Qma9uHM_Q3n8h3XQ
CitedBy_id crossref_primary_10_1016_j_ic_2011_04_001
crossref_primary_10_1007_s00287_009_0350_9
crossref_primary_10_1007_s10958_018_3921_y
crossref_primary_10_1016_j_jda_2007_05_001
crossref_primary_10_1186_1471_2105_14_S15_S14
crossref_primary_10_1109_TCBB_2013_150
crossref_primary_10_1016_j_jda_2013_06_004
crossref_primary_10_1016_j_jda_2006_03_017
crossref_primary_10_1089_cmb_2009_0098
crossref_primary_10_1089_cmb_2010_0089
crossref_primary_10_1186_1471_2164_15_S6_S2
crossref_primary_10_1016_j_tcs_2016_07_041
crossref_primary_10_1186_s12864_018_4622_0
crossref_primary_10_1089_cmb_2011_0132
crossref_primary_10_1109_TCBB_2011_112
Cites_doi 10.1016/S0968-0004(98)01274-2
10.1007/s004539910014
10.1093/nar/30.10.2212
ContentType Journal Article
Copyright 2006 Elsevier B.V.
Distributed under a Creative Commons Attribution 4.0 International License
Copyright_xml – notice: 2006 Elsevier B.V.
– notice: Distributed under a Creative Commons Attribution 4.0 International License
DBID 6I.
AAFTH
AAYXX
CITATION
1XC
DOI 10.1016/j.jda.2006.03.021
DatabaseName ScienceDirect Open Access Titles
Elsevier:ScienceDirect:Open Access
CrossRef
Hyper Article en Ligne (HAL)
DatabaseTitle CrossRef
DatabaseTitleList
DeliveryMethod fulltext_linktorsrc
Discipline Computer Science
EISSN 1570-8675
EndPage 340
ExternalDocumentID oai:HAL:hal-01614122v1
10_1016_j_jda_2006_03_021
S1570866706000505
GroupedDBID --K
--M
.~1
0R~
1B1
1~.
1~5
29K
4.4
457
4G.
5GY
5VS
6I.
7-5
71M
8P~
AACTN
AAEDT
AAEDW
AAFTH
AAIKJ
AAKOC
AALRI
AAOAW
AAQFI
AAXUO
AAYFN
ABAOU
ABBOA
ABJNI
ABMAC
ABVKL
ABXDB
ABYKQ
ACAZW
ACDAQ
ACGFS
ACNNM
ACRLP
ACZNC
ADBBV
ADEZE
ADMUD
AEBSH
AEKER
AEXQZ
AFKWA
AFTJW
AGHFR
AGUBO
AGYEJ
AIALX
AIEXJ
AIGVJ
AIKHN
AITUG
AJBFU
AJOXV
ALMA_UNASSIGNED_HOLDINGS
AMFUW
AMRAJ
AOUOD
ARUGR
AXJTR
BKOJK
BLXMC
CS3
D-I
DU5
EBS
EFJIC
EFLBG
EJD
EO8
EO9
EP2
EP3
FDB
FEDTE
FIRID
FNPLU
FYGXN
G-Q
GBLVA
GBOLZ
HVGLF
HZ~
IHE
IXB
J1W
KOM
M41
MHUIS
MO0
N9A
NCXOZ
O-L
O9-
OAUVE
OK1
OZT
P-8
P-9
PC.
Q38
RIG
ROL
RPZ
SDF
SDG
SES
SEW
SPC
SSV
SSW
SSZ
T5K
~G-
9DU
AATTM
AAXKI
AAYWO
AAYXX
ABWVN
ACLOT
ACRPL
ACVFH
ADCNI
ADNMO
ADVLN
AEIPS
AEUPX
AFPUW
AIGII
AIIUN
AKBMS
AKRWK
AKYEP
ANKPU
CITATION
EFKBS
~HD
1XC
ID FETCH-LOGICAL-c331t-bb35f10b05af0c87f17c69e6e93fade36d4469555ea561fc55a77dd269aad4133
ISICitedReferencesCount 22
ISICitedReferencesURI http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000213933500012&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
ISSN 1570-8667
IngestDate Tue Oct 14 20:44:07 EDT 2025
Sat Nov 29 03:56:19 EST 2025
Tue Nov 18 22:42:19 EST 2025
Fri Feb 23 02:29:58 EST 2024
IsDoiOpenAccess true
IsOpenAccess true
IsPeerReviewed false
IsScholarly true
Issue 2
Keywords Comparative genomics
Natural language processing
Fingerprints
Combinatorial algorithms on words
Character sets
Language English
License http://www.elsevier.com/open-access/userlicense/1.0
https://www.elsevier.com/tdm/userlicense/1.0
https://www.elsevier.com/open-access/userlicense/1.0
Distributed under a Creative Commons Attribution 4.0 International License: http://creativecommons.org/licenses/by/4.0
LinkModel OpenURL
MergedId FETCHMERGED-LOGICAL-c331t-bb35f10b05af0c87f17c69e6e93fade36d4469555ea561fc55a77dd269aad4133
ORCID 0000-0001-8587-4177
0000-0003-0596-9112
OpenAccessLink https://dx.doi.org/10.1016/j.jda.2006.03.021
PageCount 11
ParticipantIDs hal_primary_oai_HAL_hal_01614122v1
crossref_citationtrail_10_1016_j_jda_2006_03_021
crossref_primary_10_1016_j_jda_2006_03_021
elsevier_sciencedirect_doi_10_1016_j_jda_2006_03_021
PublicationCentury 2000
PublicationDate 2007-06-01
PublicationDateYYYYMMDD 2007-06-01
PublicationDate_xml – month: 06
  year: 2007
  text: 2007-06-01
  day: 01
PublicationDecade 2000
PublicationTitle Journal of discrete algorithms (Amsterdam, Netherlands)
PublicationYear 2007
Publisher Elsevier B.V
Elsevier
Publisher_xml – name: Elsevier B.V
– name: Elsevier
References Bender, Farach-Colton (bib009) 2000; vol. 1776
Rogozin, Makarova, Murvai, Czabarka, Wolf, Tatusov, Szekely, Koonin (bib002) 2002; 30
Schmidt, Stoye (bib005) 2004; vol. 3109
Heber, Stoye (bib004) 2001; vol. 2089
Dandekar, Snel, Huynen, Bork (bib001) 1998; 23
Amir, Apostolico, Landau, Satta (bib007) 2003; 26
Karlsson, Voutilainen, Heikkilä, Antilla (bib006) 1995
Uno, Yagiura (bib003) 2000; 26
Didier (bib008) 2003; vol. 2812
Dandekar (10.1016/j.jda.2006.03.021_bib001) 1998; 23
Schmidt (10.1016/j.jda.2006.03.021_bib005) 2004; vol. 3109
Bender (10.1016/j.jda.2006.03.021_bib009) 2000; vol. 1776
Heber (10.1016/j.jda.2006.03.021_bib004) 2001; vol. 2089
Amir (10.1016/j.jda.2006.03.021_bib007) 2003; 26
Uno (10.1016/j.jda.2006.03.021_bib003) 2000; 26
Didier (10.1016/j.jda.2006.03.021_bib008) 2003; vol. 2812
Karlsson (10.1016/j.jda.2006.03.021_bib006) 1995
Rogozin (10.1016/j.jda.2006.03.021_bib002) 2002; 30
References_xml – volume: 30
  start-page: 2212
  year: 2002
  end-page: 2223
  ident: bib002
  article-title: Connected gene neighborhoods in prokaryotic genomes
  publication-title: Nucleic Acids Res.
– volume: vol. 2812
  start-page: 17
  year: 2003
  end-page: 24
  ident: bib008
  article-title: Common intervals of two sequences
  publication-title: Proceedings of the Third International Workshop on Algorithms in Bioinformatics
– year: 1995
  ident: bib006
  article-title: Constraint Grammar: A Language-Independent System for Parsing Unrestricted Text
– volume: vol. 3109
  start-page: 347
  year: 2004
  end-page: 358
  ident: bib005
  article-title: Quadratic time algorithms for finding common intervals in two and more sequences
  publication-title: Proceedings of the 15th Annual Symposium on Combinatorial Pattern Matching
– volume: 26
  start-page: 1
  year: 2003
  end-page: 13
  ident: bib007
  article-title: Efficient text fingerprinting via Parikh mapping
  publication-title: J. Discrete Algorithms
– volume: 26
  start-page: 290
  year: 2000
  end-page: 309
  ident: bib003
  article-title: Fast algorithms to enumerate all common intervals of two permutations
  publication-title: Algorithmica
– volume: vol. 1776
  start-page: 88
  year: 2000
  end-page: 94
  ident: bib009
  article-title: The LCA problem revisited
  publication-title: Proceedings of the 4th Latin American Symposium on Theoretical Informatics
– volume: 23
  start-page: 324
  year: 1998
  end-page: 328
  ident: bib001
  article-title: Conservation of gene order: a fingerprint of proteins that physically interact
  publication-title: Trends Biochem. Sci.
– volume: vol. 2089
  start-page: 207
  year: 2001
  end-page: 218
  ident: bib004
  article-title: Finding all common intervals of
  publication-title: Proceedings of the 12th Annual Symposium on Combinatorial Pattern Matching
– volume: 23
  start-page: 324
  year: 1998
  ident: 10.1016/j.jda.2006.03.021_bib001
  article-title: Conservation of gene order: a fingerprint of proteins that physically interact
  publication-title: Trends Biochem. Sci.
  doi: 10.1016/S0968-0004(98)01274-2
– volume: vol. 2812
  start-page: 17
  year: 2003
  ident: 10.1016/j.jda.2006.03.021_bib008
  article-title: Common intervals of two sequences
– volume: vol. 2089
  start-page: 207
  year: 2001
  ident: 10.1016/j.jda.2006.03.021_bib004
  article-title: Finding all common intervals of k permutations
– year: 1995
  ident: 10.1016/j.jda.2006.03.021_bib006
– volume: 26
  start-page: 290
  year: 2000
  ident: 10.1016/j.jda.2006.03.021_bib003
  article-title: Fast algorithms to enumerate all common intervals of two permutations
  publication-title: Algorithmica
  doi: 10.1007/s004539910014
– volume: vol. 1776
  start-page: 88
  year: 2000
  ident: 10.1016/j.jda.2006.03.021_bib009
  article-title: The LCA problem revisited
– volume: vol. 3109
  start-page: 347
  year: 2004
  ident: 10.1016/j.jda.2006.03.021_bib005
  article-title: Quadratic time algorithms for finding common intervals in two and more sequences
– volume: 26
  start-page: 1
  year: 2003
  ident: 10.1016/j.jda.2006.03.021_bib007
  article-title: Efficient text fingerprinting via Parikh mapping
  publication-title: J. Discrete Algorithms
– volume: 30
  start-page: 2212
  year: 2002
  ident: 10.1016/j.jda.2006.03.021_bib002
  article-title: Connected gene neighborhoods in prokaryotic genomes
  publication-title: Nucleic Acids Res.
  doi: 10.1093/nar/30.10.2212
SSID ssj0025142
Score 1.8425412
Snippet Given a string S over a finite alphabet Σ, the character set (also called the fingerprint) of a substring S ′ of S is the subset C ⊆ Σ of the symbols occurring...
SourceID hal
crossref
elsevier
SourceType Open Access Repository
Enrichment Source
Index Database
Publisher
StartPage 330
SubjectTerms Character sets
Combinatorial algorithms on words
Comparative genomics
Computer Science
Data Structures and Algorithms
Discrete Mathematics
Fingerprints
Natural language processing
Title Character sets of strings
URI https://dx.doi.org/10.1016/j.jda.2006.03.021
https://hal.science/hal-01614122
Volume 5
WOSCitedRecordID wos000213933500012&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
journalDatabaseRights – providerCode: PRVESC
  databaseName: Elsevier SD Freedom Collection Journals 2021
  customDbUrl:
  eissn: 1570-8675
  dateEnd: 20180131
  omitProxy: false
  ssIdentifier: ssj0025142
  issn: 1570-8667
  databaseCode: AIEXJ
  dateStart: 20030201
  isFulltext: true
  titleUrlDefault: https://www.sciencedirect.com
  providerName: Elsevier
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwtV3Pb9MwFLag48AFxi9RYFOEODFlSuzETo7VGGxompAoqLfI8Y81ZU2nNJvGf89zbKel0yZ24BJVVmMlfp9ePj9_7z2EPhDGgPdSHPKEM9igxEnIy1iEWEeJSnKNeae2-HnCTk-zyST_5mRFy66dAKvr7Po6v_ivpoYxMLZJnb2HuftJYQB-g9HhCmaH6z8Z_sCXYN5bKivUMK05fDz8Jg81ebkNUOc9fn62aKp2Ou8CsaO5KaEgHV5WacFroYNPlaysxb-YjMKenX8X03kl2w39ke3J9ls5TU0_OF5eNtbz_XLyfR-EYCuxlPebLAozajtreMearuEHrzlJ4k5i7PeW2HJNN1y5jSrM9meSuzMjsh_ZbOq_y2ZvfM56kaHXr80KmMK026RFRIrIVB3YwizNswHaGh0fTr72G3Qgj93puH8Zfwre6QE3nuM2HvNw6iPyHUMZb6MnzqTByELiGXqg6ufoqW_bETgv_gINe4QEBiHBQgcOIS_Rj8-H44Oj0HXICAUhcRuWJUl1HJVRynUkMqZjJmiuqMqJ5lIRKmG3n6dpqjjwZC3SlDMmJaY55xLoC3mFBvWiVq9RwIRhw8A3mZZJppJMM2COlIiSSo2xHqLIv28hXPl408XkvLh1nYfoY3_Lha2dctefE7-IhSN_ltQVAIi7bnsPC95Pb4qlH41OCjNmNjNJjPFV_OY-D_IWPV4h_B0atM2l2kGPxFVbLZtdB5k_PEiAxw
linkProvider Elsevier
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Character+sets+of+strings&rft.jtitle=Journal+of+discrete+algorithms+%28Amsterdam%2C+Netherlands%29&rft.au=Didier%2C+Gilles&rft.au=Schmidt%2C+Thomas&rft.au=Stoye%2C+Jens&rft.au=Tsur%2C+Dekel&rft.date=2007-06-01&rft.issn=1570-8667&rft.volume=5&rft.issue=2&rft.spage=330&rft.epage=340&rft_id=info:doi/10.1016%2Fj.jda.2006.03.021&rft.externalDBID=n%2Fa&rft.externalDocID=10_1016_j_jda_2006_03_021
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1570-8667&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1570-8667&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1570-8667&client=summon