Discovery and applications of power-laws in organic chemistry

CINF 111

Ryan W. Benz, rbenz@uci.edu1, S. Joshua Swamidass2, and Pierre Baldi, pfbaldi@uci.edu2. (1) School of Information and Computer Sciences, University of California, Irvine, ORU Genomics and Bioinformatics, Irvine, CA 92697-3445, (2) Institute for Genomics and Bioinformatics, School of Information and Computer Sciences, University of California, Irvine, Irvine, CA 92612
Chemical fingerprints are the basis for molecular similarity search methods used in most chemical database systems, including Chemical Abstracts Service, PubMed, and ChemDB. Understanding the statistical properties of these fingerprints is crucial for developing new and improved chemical search techniques. Through the study of fingerprints from large chemical databases, we have discovered that the distributions of several combinatorially extracted fingerprint features, such as labeled paths and trees, follow power-law distributions. These power-laws can be used to generate more realistic probabilistic models for fingerprints. We have also found that the power-laws can be leveraged to produce highly efficient compression schemes for chemical fingerprints. These compression schemes losslessly encode fingerprints in approximately 300 bits, or 1/3 the size of typical, lossy compressed fingerprints. Using these lossless representations, the exact similarity scores between pairs of molecules can be computed, leading to improved recall of drug-like molecules using similarity search methods.
 

General Papers
9:00 AM-10:30 AM, Thursday, April 10, 2008 Marriott Convention Center -- Blaine Kern C, Oral

Division of Chemical Information

The 235th ACS National Meeting, New Orleans, LA, April 6-10, 2008