Trust…but verify! On the importance of experimental data curation prior to building (Q)SAR models

CINF 53

Alexander Tropsha, alex_tropsha@unc.edu1, Eugene Muratov, 00dqsar@ukr.net2, and Denis Fourches, fourches@email.unc.edu1. (1) Laboratory for Molecular Modeling, School of Pharmacy, University of North Carolina, CB # 7360, Beard Hall, School of Pharmacy, Chapel Hill, NC 27599-7360, (2) Laboratory of Molecular Modeling, School of Pharmacy, The University of North Carolina at Chapel Hill, CAMPUS BOX 7360, Chapel Hill, NC 27599
Molecular modelers are always at the mercy of the primary data providers. We argue and illustrate with examples that the quality of data predefines the accuracy and predictive power of models irrespective of the rigor and thoroughness used in building (Q)SAR models. The primary data may contain errors in both chemical structures, values of the biological data, and associations between structure and bioassay results; frequently, there are duplicates. We show that many publicly available datasets including those recently used for QSAR competitions contain erroneous information that is sometimes sufficient to undermine the virtue of the competition. We further show that the data errors influence significantly if not dramatically the accuracy of the resulting models. Conversely, we demonstrate that rigorously built (Q)SAR models can help identifying and correcting gaps and possible errors in primary datasets. Finally, we propose simple protocols for primary data analysis and curation.
 

Herman Skolnik Award Symposium
8:30 AM-11:45 AM, Tuesday, August 18, 2009 Walter E. Washington Convention Center -- 204C, Oral

Division of Chemical Information

The 238th ACS National Meeting, Washington, DC, August 16-20, 2009