Confusing descriptors: Where chemical information gets dizzy

COMP 45

Cristian G. Bologa, Marius Olah, molah@salud.unm.edu, and Tudor I. Oprea, toprea@salud.unm.edu. Division of Biocomputing, University of New Mexico School of Medicine, MSC 084560, 1 University of New Mexico, Albuquerque, NM 87131-0001
Structures from WOMBAT (WOrld of Molecular BioAcTivity) [1] were investigated with several descriptor systems. For 79,483 unique non-stereoisomeric compounds, we found multiple “confused“ instances: For 2D-descriptors, 314 duplicates (0.4%) across 80 descriptors (487 pairs); for MESA-implemented [2] MDL keys, 4391 duplicates (5.5%) across 320 keys; for Daylight fingerprints [3], 7166 duplicates (9.0%) for 512-keys, 5010 duplicates (6.3%) for 1024-keys, and 4092 duplicates (5.1%) at the 2048 level. The WOMBAT-derived set of 512 keys had 6202 (7.8%) duplicates. Our results indicate that, for several chemical descriptor systems, it is not always possible to provide a 1:1 map between chemical structure and chemical description. This implies that we can devise an information–rich, yet “confused” descriptor system, i.e., a chemical information exchange tool allowing for chemical structure ambiguity.

[1] WOMBAT is available from http://www.sunsetmolecular.com [2] The MDL 320 keys fingerprinter is available from http://www.mesaac.com [3] The Daylight fingerprinter is available from http://www.daylight.com