Theory and practice of statistical significance for molecular similarity scores: When is a similarity score “significant”?

CINF 88

Pierre Baldi, pfbaldi@uci.edu, Institute for Genomics and Bioinformatics, School of Information and Computer Sciences, University of California, Irvine, Irvine, CA 92697 and Ryan W. Benz, rbenz@uci.edu, School of Information and Computer Sciences, University of California, Irvine, ORU Genomics and Bioinformatics, Irvine, CA 92697-3445.
One of the most fundamental tasks in bioinformatics or chemoinformatics is to search  large databases for  molecules that are “similar” to a given query , or set of queries.  In bioinformatics, BLAST has become one of the workhorses of modern biology, allowing biologists to search sequence databases and retrieve ranked list of hits associated with significance scores (“e-values”).  In chemoinformatics, similarity and search algorithms for small molecules have also been derived but, surprisingly, a theory of when a small molecule hit is significant has not yet been developed. Here we develop and apply a theory of statistical significance for small-molecule similarity scores. As in the case of  BLAST, significance is assessed  against a random background model. Several tractable background models of randomness are introduced and a statistical theory of Z-scores and Extreme Value Distributions is derived for similarity scores, such as Tanimoto scores, with specific implications for practical searches.
 

Challenges in Structure Searching
9:00 AM-11:30 AM, Wednesday, April 9, 2008 Marriott Convention Center -- Blaine Kern C, Oral

Division of Chemical Information

The 235th ACS National Meeting, New Orleans, LA, April 6-10, 2008