CINF 111 |
| Chemical fingerprints are the basis for molecular similarity search methods used in most chemical database systems, including Chemical Abstracts Service, PubMed, and ChemDB. Understanding the statistical properties of these fingerprints is crucial for developing new and improved chemical search techniques. Through the study of fingerprints from large chemical databases, we have discovered that the distributions of several combinatorially extracted fingerprint features, such as labeled paths and trees, follow power-law distributions. These power-laws can be used to generate more realistic probabilistic models for fingerprints. We have also found that the power-laws can be leveraged to produce highly efficient compression schemes for chemical fingerprints. These compression schemes losslessly encode fingerprints in approximately 300 bits, or 1/3 the size of typical, lossy compressed fingerprints. Using these lossless representations, the exact similarity scores between pairs of molecules can be computed, leading to improved recall of drug-like molecules using similarity search methods. |
|
General Papers
9:00 AM-10:30 AM, Thursday, April 10, 2008 Marriott Convention Center -- Blaine Kern C, Oral
Division of Chemical Information |