Proper use of cross-validation while descriptor-thinning: Naïve vs. true q2

CINF 38

Ramanathan Natarajan, rnataraj@nrri.umn.edu1, Subhash C. Basak, sbasak@nrri.umn.edu1, Douglas M. Hawkins, dhawkins@umn.edu2, and Jessica Karaker, krakerjj@uwec.edu3. (1) Center for Water and the Environment, Natural Resources Research Institute, University of Minnesota, 5013 Miller Trunk Highway, Duluth, MN 55811, (2) School of Statistics, University of Minnesota, 313 Ford Hall, 224 Church Street SE, Minneapolis, MN 55455, (3) Department of Mathematics, University of Wisconsin-Eau Claire, 508 Hibbard Humanities Hall, Eau Claire, WI 54702-4004
In QSAR modeling of property/ bioactivity of chemicals using calculated molecular descriptors, we are faced with the usual problem of “few compounds and many descriptors.” Hence, variable-selection (descriptor-thinning) methods are used to select a proper subset of descriptors to develop QSAR models. It is vital to incorporate the descriptor selection, as well as any parameter selection, as part of the modeling procedure to be cross-validated for assessment of the model. When the cross-validation step does not include all such elements of the modeling procedure, the “naïve q2” thus estimated suffers from an upward bias. Application of proper cross-validation that includes descriptor thinning is necessary for developing QSAR models with good predictive ability. The importance of embedding descriptor selection as well as parameter selection inside the cross-validation step, resulting in calculation of the “true q2”, is highlighted by a comparison of true q2 with naïve q2 for a few sets of compounds.
 

Sci-Mix
8:00 PM-10:00 PM, Monday, August 20, 2007 BCEC -- Exhibit Hall - B2, Sci-Mix

Division of Chemical Information

The 234th ACS National Meeting, Boston, MA, August 19-23, 2007