Monotony of surprise and large-scale quest for unusual words

The problem of characterizing and detecting recurrent sequence patterns such as substrings or motifs and related associations or rules is variously pursued in order to compress data, unveil structure, infer succinct descriptions, extract and classify features, etc. In molecular biology, exceptionall...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Journal of computational biology Jg. 10; H. 3-4; S. 283
Hauptverfasser: Apostolico, Alberto, Bock, Mary Ellen, Lonardi, Stefano
Format: Journal Article
Sprache:Englisch
Veröffentlicht: United States 01.01.2003
Schlagworte:
ISSN:1066-5277
Online-Zugang:Weitere Angaben
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Abstract The problem of characterizing and detecting recurrent sequence patterns such as substrings or motifs and related associations or rules is variously pursued in order to compress data, unveil structure, infer succinct descriptions, extract and classify features, etc. In molecular biology, exceptionally frequent or rare words in bio-sequences have been implicated in various facets of biological function and structure. The discovery, particularly on a massive scale, of such patterns poses interesting methodological and algorithmic problems and often exposes scenarios in which tables and synopses grow faster and bigger than the raw sequences they are meant to encapsulate. In previous study, the ability to succinctly compute, store, and display unusual substrings has been linked to a subtle interplay between the combinatorics of the subword of a word and local monotonicities of some scores used to measure the departure from expectation. In this paper, we carry out an extensive analysis of such monotonicities for a broader variety of scores. This supports the construction of data structures and algorithms capable of performing global detection of unusual substrings in time and space linear in the subject sequences, under various probabilistic models.
AbstractList The problem of characterizing and detecting recurrent sequence patterns such as substrings or motifs and related associations or rules is variously pursued in order to compress data, unveil structure, infer succinct descriptions, extract and classify features, etc. In molecular biology, exceptionally frequent or rare words in bio-sequences have been implicated in various facets of biological function and structure. The discovery, particularly on a massive scale, of such patterns poses interesting methodological and algorithmic problems and often exposes scenarios in which tables and synopses grow faster and bigger than the raw sequences they are meant to encapsulate. In previous study, the ability to succinctly compute, store, and display unusual substrings has been linked to a subtle interplay between the combinatorics of the subword of a word and local monotonicities of some scores used to measure the departure from expectation. In this paper, we carry out an extensive analysis of such monotonicities for a broader variety of scores. This supports the construction of data structures and algorithms capable of performing global detection of unusual substrings in time and space linear in the subject sequences, under various probabilistic models.The problem of characterizing and detecting recurrent sequence patterns such as substrings or motifs and related associations or rules is variously pursued in order to compress data, unveil structure, infer succinct descriptions, extract and classify features, etc. In molecular biology, exceptionally frequent or rare words in bio-sequences have been implicated in various facets of biological function and structure. The discovery, particularly on a massive scale, of such patterns poses interesting methodological and algorithmic problems and often exposes scenarios in which tables and synopses grow faster and bigger than the raw sequences they are meant to encapsulate. In previous study, the ability to succinctly compute, store, and display unusual substrings has been linked to a subtle interplay between the combinatorics of the subword of a word and local monotonicities of some scores used to measure the departure from expectation. In this paper, we carry out an extensive analysis of such monotonicities for a broader variety of scores. This supports the construction of data structures and algorithms capable of performing global detection of unusual substrings in time and space linear in the subject sequences, under various probabilistic models.
The problem of characterizing and detecting recurrent sequence patterns such as substrings or motifs and related associations or rules is variously pursued in order to compress data, unveil structure, infer succinct descriptions, extract and classify features, etc. In molecular biology, exceptionally frequent or rare words in bio-sequences have been implicated in various facets of biological function and structure. The discovery, particularly on a massive scale, of such patterns poses interesting methodological and algorithmic problems and often exposes scenarios in which tables and synopses grow faster and bigger than the raw sequences they are meant to encapsulate. In previous study, the ability to succinctly compute, store, and display unusual substrings has been linked to a subtle interplay between the combinatorics of the subword of a word and local monotonicities of some scores used to measure the departure from expectation. In this paper, we carry out an extensive analysis of such monotonicities for a broader variety of scores. This supports the construction of data structures and algorithms capable of performing global detection of unusual substrings in time and space linear in the subject sequences, under various probabilistic models.
Author Apostolico, Alberto
Lonardi, Stefano
Bock, Mary Ellen
Author_xml – sequence: 1
  givenname: Alberto
  surname: Apostolico
  fullname: Apostolico, Alberto
  email: axa@cs.purdue.edu
  organization: Department of Computer Sciences, Purdue University, West Lafayette, IN 47907, USA. axa@cs.purdue.edu
– sequence: 2
  givenname: Mary Ellen
  surname: Bock
  fullname: Bock, Mary Ellen
– sequence: 3
  givenname: Stefano
  surname: Lonardi
  fullname: Lonardi, Stefano
BackLink https://www.ncbi.nlm.nih.gov/pubmed/12935329$$D View this record in MEDLINE/PubMed
BookMark eNo1j0tLAzEAhHOo2If-AC-Sk7fVPDYv8CLFqlDxouclm4dUsklNNkj_vavW08DMxzCzBLOYogPgAqNrjKS6wYhzRgSiHHEpEUEzsPjxmskUc7As5QMhPKXiFMwxUZRRohbg9jnFNKZ4gMnDUvM-74qDOloYdH53TTE6OPhZXRmhTxnWWEvVAX6lbMsZOPE6FHd-1BV429y_rh-b7cvD0_pu2xjKxdi02lumuRLWety31gjNiSFcYGMZc54xRa2i2FEsqdSeWq5ZyxzvETetFGQFrv569zn9TumGXTEuBB1dqqUTlAmMZTuBl0ew9oOz3fRm0PnQ_f8l31MLV18
CitedBy_id crossref_primary_10_4137_BBI_S415
crossref_primary_10_1109_JBHI_2019_2942938
crossref_primary_10_1007_BF02944783
crossref_primary_10_1186_1471_2105_6_121
crossref_primary_10_1109_ACCESS_2019_2940381
crossref_primary_10_1109_TCBB_2008_123
crossref_primary_10_1016_j_tcs_2018_09_011
crossref_primary_10_1145_2810036
crossref_primary_10_1186_1748_7188_6_5
crossref_primary_10_1016_j_jtbi_2024_111878
crossref_primary_10_1016_j_jcss_2013_03_004
crossref_primary_10_1007_s00354_006_0004_2
crossref_primary_10_1186_s13015_017_0094_z
crossref_primary_10_1016_j_tcs_2004_06_023
crossref_primary_10_1109_TCBB_2016_2620143
crossref_primary_10_1016_j_cosrev_2011_11_001
crossref_primary_10_3390_machines10111025
crossref_primary_10_1093_nar_gni131
crossref_primary_10_1016_j_tcs_2018_06_047
ContentType Journal Article
DBID CGR
CUY
CVF
ECM
EIF
NPM
7X8
DOI 10.1089/10665270360688020
DatabaseName Medline
MEDLINE
MEDLINE (Ovid)
MEDLINE
MEDLINE
PubMed
MEDLINE - Academic
DatabaseTitle MEDLINE
Medline Complete
MEDLINE with Full Text
PubMed
MEDLINE (Ovid)
MEDLINE - Academic
DatabaseTitleList MEDLINE - Academic
MEDLINE
Database_xml – sequence: 1
  dbid: NPM
  name: PubMed
  url: http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed
  sourceTypes: Index Database
– sequence: 2
  dbid: 7X8
  name: MEDLINE - Academic
  url: https://search.proquest.com/medline
  sourceTypes: Aggregation Database
DeliveryMethod no_fulltext_linktorsrc
Discipline Biology
Mathematics
ExternalDocumentID 12935329
Genre Journal Article
GroupedDBID ---
0R~
1-M
29K
34G
39C
4.4
53G
5GY
ABBKN
ABEFU
ACGFO
ADBBV
AENEX
AFOSN
AI.
ALMA_UNASSIGNED_HOLDINGS
BAWUL
BNQNF
CAG
CGR
COF
CS3
CUY
CVF
D-I
DIK
DU5
EBS
ECM
EIF
EJD
F5P
IAO
IER
IGS
IHR
IM4
ISR
ITC
MV1
NPM
NQHIM
O9-
OK1
P2P
R.V
RIG
RML
RMSOB
RNS
TN5
TR2
UE5
VH1
7X8
J8X
SAUOL
SCNPE
SFC
ID FETCH-LOGICAL-c367t-4afd5a697ddf1b4dc7a62c2671cd55ef5593d931e31838af3d6a545e6b06c4872
IEDL.DBID 7X8
ISICitedReferencesCount 37
ISICitedReferencesURI http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000184535800004&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
ISSN 1066-5277
IngestDate Sun Nov 09 12:18:42 EST 2025
Sat Sep 28 07:36:53 EDT 2024
IsPeerReviewed true
IsScholarly true
Issue 3-4
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-c367t-4afd5a697ddf1b4dc7a62c2671cd55ef5593d931e31838af3d6a545e6b06c4872
Notes ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 23
PMID 12935329
PQID 73571184
PQPubID 23479
ParticipantIDs proquest_miscellaneous_73571184
pubmed_primary_12935329
PublicationCentury 2000
PublicationDate 2003-01-01
PublicationDateYYYYMMDD 2003-01-01
PublicationDate_xml – month: 01
  year: 2003
  text: 2003-01-01
  day: 01
PublicationDecade 2000
PublicationPlace United States
PublicationPlace_xml – name: United States
PublicationTitle Journal of computational biology
PublicationTitleAlternate J Comput Biol
PublicationYear 2003
SSID ssj0013607
Score 1.862959
Snippet The problem of characterizing and detecting recurrent sequence patterns such as substrings or motifs and related associations or rules is variously pursued in...
SourceID proquest
pubmed
SourceType Aggregation Database
Index Database
StartPage 283
SubjectTerms Algorithms
Binomial Distribution
Computational Biology - methods
Data Interpretation, Statistical
Markov Chains
Sequence Analysis, DNA - methods
Title Monotony of surprise and large-scale quest for unusual words
URI https://www.ncbi.nlm.nih.gov/pubmed/12935329
https://www.proquest.com/docview/73571184
Volume 10
WOSCitedRecordID wos000184535800004&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText
inHoldings 1
isFullTextHit
isPrint
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwpV3NS8MwFH9Mp6AHP-bX_MzBa3Br2qSBgYg4PLixg8puJW0SEKSddlX23_uSruJFPHjpIbQlvL73-ntf-QFcoqIYZiWniM4xQJGpoRJhLe0Z9JOpm1nxXIfPD2I8jqdTOWnBoJmFcW2VjU_0jloXmcuRXwkWCQTD4fXsjTrOKFdbXRJorECbIZBxOi2mP2oI3A9LY8zDMdwSoqlpxvLKreESem_HudILer_jS_-fGW7_b4c7sLXEl-SmVohdaJm8A-s14-SiA5uj72Nayz0YoEUXCP4WpLCkRJGjxRuick1eXYM4LfEDGuL3QBDckiqvygrf_okRa7kPT8O7x9t7uqRToBnjYk5DZXWkuBRa234a6kwoHmQBF_1MR5GxGFswLVnfZUVZrCzTXCG-Mjzt8QzjmuAAVvMiN0dAGLdRyIQ1MuShkjrGm1SMokR4GaWp6sJFI6IE1dXVIFRuiqpMGiF14bCWcjKrT9VIHPCIWCCP_3z2BDZ8S51PhJxC26KhmjNYyz7mL-X7udcCvI4noy-UDboO
linkProvider ProQuest
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Monotony+of+surprise+and+large-scale+quest+for+unusual+words&rft.jtitle=Journal+of+computational+biology&rft.au=Apostolico%2C+Alberto&rft.au=Bock%2C+Mary+Ellen&rft.au=Lonardi%2C+Stefano&rft.date=2003-01-01&rft.issn=1066-5277&rft.volume=10&rft.issue=3-4&rft.spage=283&rft_id=info:doi/10.1089%2F10665270360688020&rft.externalDBID=NO_FULL_TEXT
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1066-5277&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1066-5277&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1066-5277&client=summon