Hashing algorithms and data structures for rapid searches of fingerprint vectors

In many large chemoinformatics database systems, molecules are represented by long binary fingerprint vectors whose components record the presence or absence of particular functional groups or combinatorial features. To speed up database searches, we propose to add to each fingerprint a short signat...

Celý popis

Uložené v:

Podrobná bibliografia
Vydané v:	Journal of chemical information and modeling Ročník 50; číslo 8; s. 1358
Hlavní autori:	Nasr, Ramzi, Hirschberg, Daniel S, Baldi, Pierre
Médium:	Journal Article
Jazyk:	English
Vydavateľské údaje:	United States 23.08.2010
Predmet:	Algorithms Chemistry - methods Computer Simulation Databases, Factual Informatics - economics Informatics - methods Models, Chemical Molecular Structure
ISSN:	1549-960X, 1549-960X
On-line prístup:	Zistit podrobnosti o prístupe
Tagy:	Pridať tag Žiadne tagy, Buďte prvý, kto otaguje tento záznam!

Popis
Shrnutí:	In many large chemoinformatics database systems, molecules are represented by long binary fingerprint vectors whose components record the presence or absence of particular functional groups or combinatorial features. To speed up database searches, we propose to add to each fingerprint a short signature integer vector of length M. For a given fingerprint, the i component of the signature vector counts the number of 1-bits in the fingerprint that fall on components congruent to i modulo M. Given two signatures, we show how one can rapidly compute a bound on the Jaccard-Tanimoto similarity measure of the two corresponding fingerprints, using the intersection bound. Thus, these signatures allow one to significantly prune the search space by discarding molecules associated with unfavorable bounds. Analytical methods are developed to predict the resulting amount of pruning as a function of M. Data structures combining different values of M are also developed together with methods for predicting the optimal values of M for a given implementation. Simulations using a particular implementation show that the proposed approach leads to a 1 order of magnitude speedup over a linear search and a 3-fold speedup over a previous implementation. All theoretical results and predictions are corroborated by large-scale simulations using molecules from the ChemDB. Several possible algorithmic extensions are discussed.
Bibliografia:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23
ISSN:	1549-960X 1549-960X
DOI:	10.1021/ci100132g