Monotony of surprise and large-scale quest for unusual words
The problem of characterizing and detecting recurrent sequence patterns such as substrings or motifs and related associations or rules is variously pursued in order to compress data, unveil structure, infer succinct descriptions, extract and classify features, etc. In molecular biology, exceptionall...
Gespeichert in:
| Veröffentlicht in: | Journal of computational biology Jg. 10; H. 3-4; S. 283 |
|---|---|
| Hauptverfasser: | , , |
| Format: | Journal Article |
| Sprache: | Englisch |
| Veröffentlicht: |
United States
01.01.2003
|
| Schlagworte: | |
| ISSN: | 1066-5277 |
| Online-Zugang: | Weitere Angaben |
| Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
| Abstract | The problem of characterizing and detecting recurrent sequence patterns such as substrings or motifs and related associations or rules is variously pursued in order to compress data, unveil structure, infer succinct descriptions, extract and classify features, etc. In molecular biology, exceptionally frequent or rare words in bio-sequences have been implicated in various facets of biological function and structure. The discovery, particularly on a massive scale, of such patterns poses interesting methodological and algorithmic problems and often exposes scenarios in which tables and synopses grow faster and bigger than the raw sequences they are meant to encapsulate. In previous study, the ability to succinctly compute, store, and display unusual substrings has been linked to a subtle interplay between the combinatorics of the subword of a word and local monotonicities of some scores used to measure the departure from expectation. In this paper, we carry out an extensive analysis of such monotonicities for a broader variety of scores. This supports the construction of data structures and algorithms capable of performing global detection of unusual substrings in time and space linear in the subject sequences, under various probabilistic models. |
|---|---|
| AbstractList | The problem of characterizing and detecting recurrent sequence patterns such as substrings or motifs and related associations or rules is variously pursued in order to compress data, unveil structure, infer succinct descriptions, extract and classify features, etc. In molecular biology, exceptionally frequent or rare words in bio-sequences have been implicated in various facets of biological function and structure. The discovery, particularly on a massive scale, of such patterns poses interesting methodological and algorithmic problems and often exposes scenarios in which tables and synopses grow faster and bigger than the raw sequences they are meant to encapsulate. In previous study, the ability to succinctly compute, store, and display unusual substrings has been linked to a subtle interplay between the combinatorics of the subword of a word and local monotonicities of some scores used to measure the departure from expectation. In this paper, we carry out an extensive analysis of such monotonicities for a broader variety of scores. This supports the construction of data structures and algorithms capable of performing global detection of unusual substrings in time and space linear in the subject sequences, under various probabilistic models.The problem of characterizing and detecting recurrent sequence patterns such as substrings or motifs and related associations or rules is variously pursued in order to compress data, unveil structure, infer succinct descriptions, extract and classify features, etc. In molecular biology, exceptionally frequent or rare words in bio-sequences have been implicated in various facets of biological function and structure. The discovery, particularly on a massive scale, of such patterns poses interesting methodological and algorithmic problems and often exposes scenarios in which tables and synopses grow faster and bigger than the raw sequences they are meant to encapsulate. In previous study, the ability to succinctly compute, store, and display unusual substrings has been linked to a subtle interplay between the combinatorics of the subword of a word and local monotonicities of some scores used to measure the departure from expectation. In this paper, we carry out an extensive analysis of such monotonicities for a broader variety of scores. This supports the construction of data structures and algorithms capable of performing global detection of unusual substrings in time and space linear in the subject sequences, under various probabilistic models. The problem of characterizing and detecting recurrent sequence patterns such as substrings or motifs and related associations or rules is variously pursued in order to compress data, unveil structure, infer succinct descriptions, extract and classify features, etc. In molecular biology, exceptionally frequent or rare words in bio-sequences have been implicated in various facets of biological function and structure. The discovery, particularly on a massive scale, of such patterns poses interesting methodological and algorithmic problems and often exposes scenarios in which tables and synopses grow faster and bigger than the raw sequences they are meant to encapsulate. In previous study, the ability to succinctly compute, store, and display unusual substrings has been linked to a subtle interplay between the combinatorics of the subword of a word and local monotonicities of some scores used to measure the departure from expectation. In this paper, we carry out an extensive analysis of such monotonicities for a broader variety of scores. This supports the construction of data structures and algorithms capable of performing global detection of unusual substrings in time and space linear in the subject sequences, under various probabilistic models. |
| Author | Apostolico, Alberto Lonardi, Stefano Bock, Mary Ellen |
| Author_xml | – sequence: 1 givenname: Alberto surname: Apostolico fullname: Apostolico, Alberto email: axa@cs.purdue.edu organization: Department of Computer Sciences, Purdue University, West Lafayette, IN 47907, USA. axa@cs.purdue.edu – sequence: 2 givenname: Mary Ellen surname: Bock fullname: Bock, Mary Ellen – sequence: 3 givenname: Stefano surname: Lonardi fullname: Lonardi, Stefano |
| BackLink | https://www.ncbi.nlm.nih.gov/pubmed/12935329$$D View this record in MEDLINE/PubMed |
| BookMark | eNo1j0tLAzEAhHOo2If-AC-Sk7fVPDYv8CLFqlDxouclm4dUsklNNkj_vavW08DMxzCzBLOYogPgAqNrjKS6wYhzRgSiHHEpEUEzsPjxmskUc7As5QMhPKXiFMwxUZRRohbg9jnFNKZ4gMnDUvM-74qDOloYdH53TTE6OPhZXRmhTxnWWEvVAX6lbMsZOPE6FHd-1BV429y_rh-b7cvD0_pu2xjKxdi02lumuRLWety31gjNiSFcYGMZc54xRa2i2FEsqdSeWq5ZyxzvETetFGQFrv569zn9TumGXTEuBB1dqqUTlAmMZTuBl0ew9oOz3fRm0PnQ_f8l31MLV18 |
| CitedBy_id | crossref_primary_10_4137_BBI_S415 crossref_primary_10_1109_JBHI_2019_2942938 crossref_primary_10_1007_BF02944783 crossref_primary_10_1186_1471_2105_6_121 crossref_primary_10_1109_ACCESS_2019_2940381 crossref_primary_10_1109_TCBB_2008_123 crossref_primary_10_1016_j_tcs_2018_09_011 crossref_primary_10_1145_2810036 crossref_primary_10_1186_1748_7188_6_5 crossref_primary_10_1016_j_jtbi_2024_111878 crossref_primary_10_1016_j_jcss_2013_03_004 crossref_primary_10_1007_s00354_006_0004_2 crossref_primary_10_1186_s13015_017_0094_z crossref_primary_10_1016_j_tcs_2004_06_023 crossref_primary_10_1109_TCBB_2016_2620143 crossref_primary_10_1016_j_cosrev_2011_11_001 crossref_primary_10_3390_machines10111025 crossref_primary_10_1093_nar_gni131 crossref_primary_10_1016_j_tcs_2018_06_047 |
| ContentType | Journal Article |
| DBID | CGR CUY CVF ECM EIF NPM 7X8 |
| DOI | 10.1089/10665270360688020 |
| DatabaseName | Medline MEDLINE MEDLINE (Ovid) MEDLINE MEDLINE PubMed MEDLINE - Academic |
| DatabaseTitle | MEDLINE Medline Complete MEDLINE with Full Text PubMed MEDLINE (Ovid) MEDLINE - Academic |
| DatabaseTitleList | MEDLINE - Academic MEDLINE |
| Database_xml | – sequence: 1 dbid: NPM name: PubMed url: http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed sourceTypes: Index Database – sequence: 2 dbid: 7X8 name: MEDLINE - Academic url: https://search.proquest.com/medline sourceTypes: Aggregation Database |
| DeliveryMethod | no_fulltext_linktorsrc |
| Discipline | Biology Mathematics |
| ExternalDocumentID | 12935329 |
| Genre | Journal Article |
| GroupedDBID | --- 0R~ 1-M 29K 34G 39C 4.4 53G 5GY ABBKN ABEFU ACGFO ADBBV AENEX AFOSN AI. ALMA_UNASSIGNED_HOLDINGS BAWUL BNQNF CAG CGR COF CS3 CUY CVF D-I DIK DU5 EBS ECM EIF EJD F5P IAO IER IGS IHR IM4 ISR ITC MV1 NPM NQHIM O9- OK1 P2P R.V RIG RML RMSOB RNS TN5 TR2 UE5 VH1 7X8 J8X SAUOL SCNPE SFC |
| ID | FETCH-LOGICAL-c367t-4afd5a697ddf1b4dc7a62c2671cd55ef5593d931e31838af3d6a545e6b06c4872 |
| IEDL.DBID | 7X8 |
| ISICitedReferencesCount | 37 |
| ISICitedReferencesURI | http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000184535800004&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| ISSN | 1066-5277 |
| IngestDate | Sun Nov 09 12:18:42 EST 2025 Sat Sep 28 07:36:53 EDT 2024 |
| IsPeerReviewed | true |
| IsScholarly | true |
| Issue | 3-4 |
| Language | English |
| LinkModel | DirectLink |
| MergedId | FETCHMERGED-LOGICAL-c367t-4afd5a697ddf1b4dc7a62c2671cd55ef5593d931e31838af3d6a545e6b06c4872 |
| Notes | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23 |
| PMID | 12935329 |
| PQID | 73571184 |
| PQPubID | 23479 |
| ParticipantIDs | proquest_miscellaneous_73571184 pubmed_primary_12935329 |
| PublicationCentury | 2000 |
| PublicationDate | 2003-01-01 |
| PublicationDateYYYYMMDD | 2003-01-01 |
| PublicationDate_xml | – month: 01 year: 2003 text: 2003-01-01 day: 01 |
| PublicationDecade | 2000 |
| PublicationPlace | United States |
| PublicationPlace_xml | – name: United States |
| PublicationTitle | Journal of computational biology |
| PublicationTitleAlternate | J Comput Biol |
| PublicationYear | 2003 |
| SSID | ssj0013607 |
| Score | 1.862959 |
| Snippet | The problem of characterizing and detecting recurrent sequence patterns such as substrings or motifs and related associations or rules is variously pursued in... |
| SourceID | proquest pubmed |
| SourceType | Aggregation Database Index Database |
| StartPage | 283 |
| SubjectTerms | Algorithms Binomial Distribution Computational Biology - methods Data Interpretation, Statistical Markov Chains Sequence Analysis, DNA - methods |
| Title | Monotony of surprise and large-scale quest for unusual words |
| URI | https://www.ncbi.nlm.nih.gov/pubmed/12935329 https://www.proquest.com/docview/73571184 |
| Volume | 10 |
| WOSCitedRecordID | wos000184535800004&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| hasFullText | |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| link | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwpV3NS8MwFH9Mp6AHP-bX_MzBa3Br2qSBgYg4PLixg8puJW0SEKSddlX23_uSruJFPHjpIbQlvL73-ntf-QFcoqIYZiWniM4xQJGpoRJhLe0Z9JOpm1nxXIfPD2I8jqdTOWnBoJmFcW2VjU_0jloXmcuRXwkWCQTD4fXsjTrOKFdbXRJorECbIZBxOi2mP2oI3A9LY8zDMdwSoqlpxvLKreESem_HudILer_jS_-fGW7_b4c7sLXEl-SmVohdaJm8A-s14-SiA5uj72Nayz0YoEUXCP4WpLCkRJGjxRuick1eXYM4LfEDGuL3QBDckiqvygrf_okRa7kPT8O7x9t7uqRToBnjYk5DZXWkuBRa234a6kwoHmQBF_1MR5GxGFswLVnfZUVZrCzTXCG-Mjzt8QzjmuAAVvMiN0dAGLdRyIQ1MuShkjrGm1SMokR4GaWp6sJFI6IE1dXVIFRuiqpMGiF14bCWcjKrT9VIHPCIWCCP_3z2BDZ8S51PhJxC26KhmjNYyz7mL-X7udcCvI4noy-UDboO |
| linkProvider | ProQuest |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Monotony+of+surprise+and+large-scale+quest+for+unusual+words&rft.jtitle=Journal+of+computational+biology&rft.au=Apostolico%2C+Alberto&rft.au=Bock%2C+Mary+Ellen&rft.au=Lonardi%2C+Stefano&rft.date=2003-01-01&rft.issn=1066-5277&rft.volume=10&rft.issue=3-4&rft.spage=283&rft_id=info:doi/10.1089%2F10665270360688020&rft.externalDBID=NO_FULL_TEXT |
| thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1066-5277&client=summon |
| thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1066-5277&client=summon |
| thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1066-5277&client=summon |