An address book for chemical space: The Chemical Structure Lookup Service (CSLS)

CINF 84

Markus Sitzmann, sitzmann@helix.nih.gov1, Igor V. Filippov2, Wolf-Dietrich Ihlenfeldt3, and Marc C. Nicklaus, mn1@helix.nih.gov1. (1) Laboratory of Medicinal Chemistry, Center for Cancer Research, National Cancer Institute, National Institutes of Health, DHHS, Frederick, MD 21702, (2) Laboratory of Medicinal Chemistry, SAIC-Frederick, Inc., NCI-Frederick, Frederick, MD 21702, (3) Xemistry GmbH, Auf den Stieden 8, D-35094 Lahntal, Germany
We give an overview of our recent work in the context of our Chemical Structure Lookup Service (CSLS). This service comprises (at the time of this writing) a collection of approx. 80 chemical structure databases from commercial and public sources, indexes approximately 40 million molecules representing approximately 27 million unique chemical structures, and continues to grow. We focus on our procedure for the normalization of the chemical structures, which is a crucial step in the processing of chemical databases coming from different sources. It is needed for finding a canonical representation of a chemical which otherwise might be missed because of differing encoding due to certain chemical features (e.g. different tautomers, different resonance structures etc.) or to ill-defined parts of the structure (e.g. misdrawn functional groups, missing hydrogen atoms, missing charges or incorrect valences). This structure normalization is performed for any incoming structure set to be registered, or searched by, in CSLS. We also discuss our structure-based hashcode identifiers, which are calculable for any small molecule. They are specifically designed to enable a fine-tunable yet rapid compound identification even in very large datasets. They can be set to be sensitive to a variety of chemical features such as tautomerism, different resonance structures drawn for a charged species, and presence or absence of certain fragments like counterions. One specific such identifier, called FICuS, is one of the crucial mechanisms for identification and lookup of chemicals in CSLS – enabling CSLS to function essentially as an “address book” of any small molecule. FICuS and the other identifiers are however not dependent on the infrastructure of this service. CSLS is freely available at http://cactus.nci.nih.gov/lookup. The service recognizes over 20 chemical structure representation formats as input data, including SD files, SMILES strings, InChI identifiers, or FICuS hashcodes.
 

Sci-Mix
8:00 PM-10:00 PM, Monday, August 20, 2007 BCEC -- Exhibit Hall - B2, Sci-Mix

General Papers
1:30 PM-8:25 PM, Thursday, August 23, 2007 BCEC -- 252 A, Oral

Division of Chemical Information

The 234th ACS National Meeting, Boston, MA, August 19-23, 2007