CINF 88 |
| One of the most fundamental tasks in bioinformatics or chemoinformatics is to search large databases for molecules that are “similar” to a given query , or set of queries. In bioinformatics, BLAST has become one of the workhorses of modern biology, allowing biologists to search sequence databases and retrieve ranked list of hits associated with significance scores (“e-values”). In chemoinformatics, similarity and search algorithms for small molecules have also been derived but, surprisingly, a theory of when a small molecule hit is significant has not yet been developed. Here we develop and apply a theory of statistical significance for small-molecule similarity scores. As in the case of BLAST, significance is assessed against a random background model. Several tractable background models of randomness are introduced and a statistical theory of Z-scores and Extreme Value Distributions is derived for similarity scores, such as Tanimoto scores, with specific implications for practical searches. |
|
Challenges in Structure Searching
9:00 AM-11:30 AM, Wednesday, April 9, 2008 Marriott Convention Center -- Blaine Kern C, Oral
Division of Chemical Information |