Character sets of strings
Given a string S over a finite alphabet Σ, the character set (also called the fingerprint) of a substring S ′ of S is the subset C ⊆ Σ of the symbols occurring in S ′ . The study of the character sets of all the substrings of a given string (or a given collection of strings) appears in several domai...
Uloženo v:
| Vydáno v: | Journal of discrete algorithms (Amsterdam, Netherlands) Ročník 5; číslo 2; s. 330 - 340 |
|---|---|
| Hlavní autoři: | , , , |
| Médium: | Journal Article |
| Jazyk: | angličtina |
| Vydáno: |
Elsevier B.V
01.06.2007
Elsevier |
| Témata: | |
| ISSN: | 1570-8667, 1570-8675 |
| On-line přístup: | Získat plný text |
| Tagy: |
Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
|
| Abstract | Given a string
S over a finite alphabet
Σ, the
character set (also called the
fingerprint) of a substring
S
′
of
S is the subset
C
⊆
Σ
of the symbols occurring in
S
′
. The study of the character sets of all the substrings of a given string (or a given collection of strings) appears in several domains such as rule induction for natural language processing or comparative genomics. Several computational problems concerning the character sets of a string arise from these applications, especially:
(1)
Output all the maximal locations of substrings having a given character set.
(2)
Output for each character set
C occurring in a given string (or a given collection of strings) all the maximal locations of
C.
Denoting by
n the total length of the considered string or collection of strings, we solve the first problem in
Θ
(
n
)
time using
Θ
(
n
)
space. We present two algorithms solving the second problem. The first one runs in
Θ
(
n
2
)
time using
Θ
(
n
)
space. The second algorithm has
Θ
(
n
|
Σ
|
log
|
Σ
|
)
time and
Θ
(
n
)
space complexity and is an adaptation of an algorithm by Amir et al. [A. Amir, A. Apostolico, G.M. Landau, G. Satta, Efficient text fingerprinting via Parikh mapping, J. Discrete Algorithms 26 (2003) 1–13]. |
|---|---|
| AbstractList | Given a string
S over a finite alphabet
Σ, the
character set (also called the
fingerprint) of a substring
S
′
of
S is the subset
C
⊆
Σ
of the symbols occurring in
S
′
. The study of the character sets of all the substrings of a given string (or a given collection of strings) appears in several domains such as rule induction for natural language processing or comparative genomics. Several computational problems concerning the character sets of a string arise from these applications, especially:
(1)
Output all the maximal locations of substrings having a given character set.
(2)
Output for each character set
C occurring in a given string (or a given collection of strings) all the maximal locations of
C.
Denoting by
n the total length of the considered string or collection of strings, we solve the first problem in
Θ
(
n
)
time using
Θ
(
n
)
space. We present two algorithms solving the second problem. The first one runs in
Θ
(
n
2
)
time using
Θ
(
n
)
space. The second algorithm has
Θ
(
n
|
Σ
|
log
|
Σ
|
)
time and
Θ
(
n
)
space complexity and is an adaptation of an algorithm by Amir et al. [A. Amir, A. Apostolico, G.M. Landau, G. Satta, Efficient text fingerprinting via Parikh mapping, J. Discrete Algorithms 26 (2003) 1–13]. |
| Author | Tsur, Dekel Didier, Gilles Schmidt, Thomas Stoye, Jens |
| Author_xml | – sequence: 1 givenname: Gilles surname: Didier fullname: Didier, Gilles organization: Centro de Modelamiento Matematico CNRS UMR 2071 Santiago de Chile, Chile – sequence: 2 givenname: Thomas surname: Schmidt fullname: Schmidt, Thomas organization: International NRW Graduate School in Bioinformatics and Genome Research, Center of Biotechnology, Universität Bielefeld, 33594 Bielefeld, Germany – sequence: 3 givenname: Jens surname: Stoye fullname: Stoye, Jens email: stoye@techfak.uni-bielefeld.de organization: Technische Fakultät, Universität Bielefeld, 33594 Bielefeld, Germany – sequence: 4 givenname: Dekel surname: Tsur fullname: Tsur, Dekel organization: Computer Science Department, Ben-Gurion University, Israel |
| BackLink | https://hal.science/hal-01614122$$DView record in HAL |
| BookMark | eNp9kE1LAzEQhoNUsK3-gN569bDrzGaTdPFUilqh4EXPYZoPm6XuShIK_nu3VD146GmG4X1mmGfCRl3fOcZmCCUCyru2bC2VFYAsgZdQ4QUbo1BQLKQSo79eqis2SakFqATW1ZjNVjuKZLKL8-Rymvd-nnIM3Xu6Zpee9snd_NQpe3t8eF2ti83L0_NquSkM55iL7ZYLj7AFQR7MQnlURjZOuoZ7so5LW9eyEUI4EhK9EYKUsraSDZGtkfMpuz3t3dFef8bwQfFL9xT0ernRx9nwHtZYVQccsuqUNbFPKTqvTciUQ9_lSGGvEfTRhm71YEMfbWjgerAxkPiP_D11jrk_MW54_xBc1MkE1xlnQ3Qma9uHM_Q3n8h3XQ |
| CitedBy_id | crossref_primary_10_1016_j_ic_2011_04_001 crossref_primary_10_1007_s00287_009_0350_9 crossref_primary_10_1007_s10958_018_3921_y crossref_primary_10_1016_j_jda_2007_05_001 crossref_primary_10_1186_1471_2105_14_S15_S14 crossref_primary_10_1109_TCBB_2013_150 crossref_primary_10_1016_j_jda_2013_06_004 crossref_primary_10_1016_j_jda_2006_03_017 crossref_primary_10_1089_cmb_2009_0098 crossref_primary_10_1089_cmb_2010_0089 crossref_primary_10_1186_1471_2164_15_S6_S2 crossref_primary_10_1016_j_tcs_2016_07_041 crossref_primary_10_1186_s12864_018_4622_0 crossref_primary_10_1089_cmb_2011_0132 crossref_primary_10_1109_TCBB_2011_112 |
| Cites_doi | 10.1016/S0968-0004(98)01274-2 10.1007/s004539910014 10.1093/nar/30.10.2212 |
| ContentType | Journal Article |
| Copyright | 2006 Elsevier B.V. Distributed under a Creative Commons Attribution 4.0 International License |
| Copyright_xml | – notice: 2006 Elsevier B.V. – notice: Distributed under a Creative Commons Attribution 4.0 International License |
| DBID | 6I. AAFTH AAYXX CITATION 1XC |
| DOI | 10.1016/j.jda.2006.03.021 |
| DatabaseName | ScienceDirect Open Access Titles Elsevier:ScienceDirect:Open Access CrossRef Hyper Article en Ligne (HAL) |
| DatabaseTitle | CrossRef |
| DatabaseTitleList | |
| DeliveryMethod | fulltext_linktorsrc |
| Discipline | Computer Science |
| EISSN | 1570-8675 |
| EndPage | 340 |
| ExternalDocumentID | oai:HAL:hal-01614122v1 10_1016_j_jda_2006_03_021 S1570866706000505 |
| GroupedDBID | --K --M .~1 0R~ 1B1 1~. 1~5 29K 4.4 457 4G. 5GY 5VS 6I. 7-5 71M 8P~ AACTN AAEDT AAEDW AAFTH AAIKJ AAKOC AALRI AAOAW AAQFI AAXUO AAYFN ABAOU ABBOA ABJNI ABMAC ABVKL ABXDB ABYKQ ACAZW ACDAQ ACGFS ACNNM ACRLP ACZNC ADBBV ADEZE ADMUD AEBSH AEKER AEXQZ AFKWA AFTJW AGHFR AGUBO AGYEJ AIALX AIEXJ AIGVJ AIKHN AITUG AJBFU AJOXV ALMA_UNASSIGNED_HOLDINGS AMFUW AMRAJ AOUOD ARUGR AXJTR BKOJK BLXMC CS3 D-I DU5 EBS EFJIC EFLBG EJD EO8 EO9 EP2 EP3 FDB FEDTE FIRID FNPLU FYGXN G-Q GBLVA GBOLZ HVGLF HZ~ IHE IXB J1W KOM M41 MHUIS MO0 N9A NCXOZ O-L O9- OAUVE OK1 OZT P-8 P-9 PC. Q38 RIG ROL RPZ SDF SDG SES SEW SPC SSV SSW SSZ T5K ~G- 9DU AATTM AAXKI AAYWO AAYXX ABWVN ACLOT ACRPL ACVFH ADCNI ADNMO ADVLN AEIPS AEUPX AFPUW AIGII AIIUN AKBMS AKRWK AKYEP ANKPU CITATION EFKBS ~HD 1XC |
| ID | FETCH-LOGICAL-c331t-bb35f10b05af0c87f17c69e6e93fade36d4469555ea561fc55a77dd269aad4133 |
| ISICitedReferencesCount | 22 |
| ISICitedReferencesURI | http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000213933500012&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| ISSN | 1570-8667 |
| IngestDate | Tue Oct 14 20:44:07 EDT 2025 Sat Nov 29 03:56:19 EST 2025 Tue Nov 18 22:42:19 EST 2025 Fri Feb 23 02:29:58 EST 2024 |
| IsDoiOpenAccess | true |
| IsOpenAccess | true |
| IsPeerReviewed | false |
| IsScholarly | true |
| Issue | 2 |
| Keywords | Comparative genomics Natural language processing Fingerprints Combinatorial algorithms on words Character sets |
| Language | English |
| License | http://www.elsevier.com/open-access/userlicense/1.0 https://www.elsevier.com/tdm/userlicense/1.0 https://www.elsevier.com/open-access/userlicense/1.0 Distributed under a Creative Commons Attribution 4.0 International License: http://creativecommons.org/licenses/by/4.0 |
| LinkModel | OpenURL |
| MergedId | FETCHMERGED-LOGICAL-c331t-bb35f10b05af0c87f17c69e6e93fade36d4469555ea561fc55a77dd269aad4133 |
| ORCID | 0000-0001-8587-4177 0000-0003-0596-9112 |
| OpenAccessLink | https://dx.doi.org/10.1016/j.jda.2006.03.021 |
| PageCount | 11 |
| ParticipantIDs | hal_primary_oai_HAL_hal_01614122v1 crossref_citationtrail_10_1016_j_jda_2006_03_021 crossref_primary_10_1016_j_jda_2006_03_021 elsevier_sciencedirect_doi_10_1016_j_jda_2006_03_021 |
| PublicationCentury | 2000 |
| PublicationDate | 2007-06-01 |
| PublicationDateYYYYMMDD | 2007-06-01 |
| PublicationDate_xml | – month: 06 year: 2007 text: 2007-06-01 day: 01 |
| PublicationDecade | 2000 |
| PublicationTitle | Journal of discrete algorithms (Amsterdam, Netherlands) |
| PublicationYear | 2007 |
| Publisher | Elsevier B.V Elsevier |
| Publisher_xml | – name: Elsevier B.V – name: Elsevier |
| References | Bender, Farach-Colton (bib009) 2000; vol. 1776 Rogozin, Makarova, Murvai, Czabarka, Wolf, Tatusov, Szekely, Koonin (bib002) 2002; 30 Schmidt, Stoye (bib005) 2004; vol. 3109 Heber, Stoye (bib004) 2001; vol. 2089 Dandekar, Snel, Huynen, Bork (bib001) 1998; 23 Amir, Apostolico, Landau, Satta (bib007) 2003; 26 Karlsson, Voutilainen, Heikkilä, Antilla (bib006) 1995 Uno, Yagiura (bib003) 2000; 26 Didier (bib008) 2003; vol. 2812 Dandekar (10.1016/j.jda.2006.03.021_bib001) 1998; 23 Schmidt (10.1016/j.jda.2006.03.021_bib005) 2004; vol. 3109 Bender (10.1016/j.jda.2006.03.021_bib009) 2000; vol. 1776 Heber (10.1016/j.jda.2006.03.021_bib004) 2001; vol. 2089 Amir (10.1016/j.jda.2006.03.021_bib007) 2003; 26 Uno (10.1016/j.jda.2006.03.021_bib003) 2000; 26 Didier (10.1016/j.jda.2006.03.021_bib008) 2003; vol. 2812 Karlsson (10.1016/j.jda.2006.03.021_bib006) 1995 Rogozin (10.1016/j.jda.2006.03.021_bib002) 2002; 30 |
| References_xml | – volume: 30 start-page: 2212 year: 2002 end-page: 2223 ident: bib002 article-title: Connected gene neighborhoods in prokaryotic genomes publication-title: Nucleic Acids Res. – volume: vol. 2812 start-page: 17 year: 2003 end-page: 24 ident: bib008 article-title: Common intervals of two sequences publication-title: Proceedings of the Third International Workshop on Algorithms in Bioinformatics – year: 1995 ident: bib006 article-title: Constraint Grammar: A Language-Independent System for Parsing Unrestricted Text – volume: vol. 3109 start-page: 347 year: 2004 end-page: 358 ident: bib005 article-title: Quadratic time algorithms for finding common intervals in two and more sequences publication-title: Proceedings of the 15th Annual Symposium on Combinatorial Pattern Matching – volume: 26 start-page: 1 year: 2003 end-page: 13 ident: bib007 article-title: Efficient text fingerprinting via Parikh mapping publication-title: J. Discrete Algorithms – volume: 26 start-page: 290 year: 2000 end-page: 309 ident: bib003 article-title: Fast algorithms to enumerate all common intervals of two permutations publication-title: Algorithmica – volume: vol. 1776 start-page: 88 year: 2000 end-page: 94 ident: bib009 article-title: The LCA problem revisited publication-title: Proceedings of the 4th Latin American Symposium on Theoretical Informatics – volume: 23 start-page: 324 year: 1998 end-page: 328 ident: bib001 article-title: Conservation of gene order: a fingerprint of proteins that physically interact publication-title: Trends Biochem. Sci. – volume: vol. 2089 start-page: 207 year: 2001 end-page: 218 ident: bib004 article-title: Finding all common intervals of publication-title: Proceedings of the 12th Annual Symposium on Combinatorial Pattern Matching – volume: 23 start-page: 324 year: 1998 ident: 10.1016/j.jda.2006.03.021_bib001 article-title: Conservation of gene order: a fingerprint of proteins that physically interact publication-title: Trends Biochem. Sci. doi: 10.1016/S0968-0004(98)01274-2 – volume: vol. 2812 start-page: 17 year: 2003 ident: 10.1016/j.jda.2006.03.021_bib008 article-title: Common intervals of two sequences – volume: vol. 2089 start-page: 207 year: 2001 ident: 10.1016/j.jda.2006.03.021_bib004 article-title: Finding all common intervals of k permutations – year: 1995 ident: 10.1016/j.jda.2006.03.021_bib006 – volume: 26 start-page: 290 year: 2000 ident: 10.1016/j.jda.2006.03.021_bib003 article-title: Fast algorithms to enumerate all common intervals of two permutations publication-title: Algorithmica doi: 10.1007/s004539910014 – volume: vol. 1776 start-page: 88 year: 2000 ident: 10.1016/j.jda.2006.03.021_bib009 article-title: The LCA problem revisited – volume: vol. 3109 start-page: 347 year: 2004 ident: 10.1016/j.jda.2006.03.021_bib005 article-title: Quadratic time algorithms for finding common intervals in two and more sequences – volume: 26 start-page: 1 year: 2003 ident: 10.1016/j.jda.2006.03.021_bib007 article-title: Efficient text fingerprinting via Parikh mapping publication-title: J. Discrete Algorithms – volume: 30 start-page: 2212 year: 2002 ident: 10.1016/j.jda.2006.03.021_bib002 article-title: Connected gene neighborhoods in prokaryotic genomes publication-title: Nucleic Acids Res. doi: 10.1093/nar/30.10.2212 |
| SSID | ssj0025142 |
| Score | 1.8425412 |
| Snippet | Given a string
S over a finite alphabet
Σ, the
character set (also called the
fingerprint) of a substring
S
′
of
S is the subset
C
⊆
Σ
of the symbols occurring... |
| SourceID | hal crossref elsevier |
| SourceType | Open Access Repository Enrichment Source Index Database Publisher |
| StartPage | 330 |
| SubjectTerms | Character sets Combinatorial algorithms on words Comparative genomics Computer Science Data Structures and Algorithms Discrete Mathematics Fingerprints Natural language processing |
| Title | Character sets of strings |
| URI | https://dx.doi.org/10.1016/j.jda.2006.03.021 https://hal.science/hal-01614122 |
| Volume | 5 |
| WOSCitedRecordID | wos000213933500012&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| journalDatabaseRights | – providerCode: PRVESC databaseName: Elsevier SD Freedom Collection Journals 2021 customDbUrl: eissn: 1570-8675 dateEnd: 20180131 omitProxy: false ssIdentifier: ssj0025142 issn: 1570-8667 databaseCode: AIEXJ dateStart: 20030201 isFulltext: true titleUrlDefault: https://www.sciencedirect.com providerName: Elsevier |
| link | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwtV3Pb9MwFLag48AFxi9RYFOEODFlSuzETo7VGGxompAoqLfI8Y81ZU2nNJvGf89zbKel0yZ24BJVVmMlfp9ePj9_7z2EPhDGgPdSHPKEM9igxEnIy1iEWEeJSnKNeae2-HnCTk-zyST_5mRFy66dAKvr7Po6v_ivpoYxMLZJnb2HuftJYQB-g9HhCmaH6z8Z_sCXYN5bKivUMK05fDz8Jg81ebkNUOc9fn62aKp2Ou8CsaO5KaEgHV5WacFroYNPlaysxb-YjMKenX8X03kl2w39ke3J9ls5TU0_OF5eNtbz_XLyfR-EYCuxlPebLAozajtreMearuEHrzlJ4k5i7PeW2HJNN1y5jSrM9meSuzMjsh_ZbOq_y2ZvfM56kaHXr80KmMK026RFRIrIVB3YwizNswHaGh0fTr72G3Qgj93puH8Zfwre6QE3nuM2HvNw6iPyHUMZb6MnzqTByELiGXqg6ufoqW_bETgv_gINe4QEBiHBQgcOIS_Rj8-H44Oj0HXICAUhcRuWJUl1HJVRynUkMqZjJmiuqMqJ5lIRKmG3n6dpqjjwZC3SlDMmJaY55xLoC3mFBvWiVq9RwIRhw8A3mZZJppJMM2COlIiSSo2xHqLIv28hXPl408XkvLh1nYfoY3_Lha2dctefE7-IhSN_ltQVAIi7bnsPC95Pb4qlH41OCjNmNjNJjPFV_OY-D_IWPV4h_B0atM2l2kGPxFVbLZtdB5k_PEiAxw |
| linkProvider | Elsevier |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Character+sets+of+strings&rft.jtitle=Journal+of+discrete+algorithms+%28Amsterdam%2C+Netherlands%29&rft.au=Didier%2C+Gilles&rft.au=Schmidt%2C+Thomas&rft.au=Stoye%2C+Jens&rft.au=Tsur%2C+Dekel&rft.date=2007-06-01&rft.issn=1570-8667&rft.volume=5&rft.issue=2&rft.spage=330&rft.epage=340&rft_id=info:doi/10.1016%2Fj.jda.2006.03.021&rft.externalDBID=n%2Fa&rft.externalDocID=10_1016_j_jda_2006_03_021 |
| thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1570-8667&client=summon |
| thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1570-8667&client=summon |
| thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1570-8667&client=summon |