CINF 53 |
| Molecular modelers are always at the mercy of the primary data providers. We argue and illustrate with examples that the quality of data predefines the accuracy and predictive power of models irrespective of the rigor and thoroughness used in building (Q)SAR models. The primary data may contain errors in both chemical structures, values of the biological data, and associations between structure and bioassay results; frequently, there are duplicates. We show that many publicly available datasets including those recently used for QSAR competitions contain erroneous information that is sometimes sufficient to undermine the virtue of the competition. We further show that the data errors influence significantly if not dramatically the accuracy of the resulting models. Conversely, we demonstrate that rigorously built (Q)SAR models can help identifying and correcting gaps and possible errors in primary datasets. Finally, we propose simple protocols for primary data analysis and curation. |
|
Herman Skolnik Award Symposium
8:30 AM-11:45 AM, Tuesday, August 18, 2009 Walter E. Washington Convention Center -- 204C, Oral
Division of Chemical Information |