Building classification models for DMSO solubility: Comparison of five methods


Jing Lu and Gregory A. Bakken. Scientific Computing Group, Groton Computational Chemistry, Pfizer Global R&D - Groton Labs, Eastern Point Road, Groton, CT 06340
It is now increasingly recognized that DMSO solubility is a problem at least as serious as compound stability in combinatorial libraries, since it may cause artifacts in library screening, and thereby negatively impact screening efficiency. It is desirable to have an effective in silico model for estimation of DMSO solubility to reveal any poorly soluble compounds, which are incompatible with assay protocols prior to screening runs. In this study DMSO solubility data at 30 mMol were gathered for 33,329 Pfizer compounds. Five linear and nonlinear classification methods were evaluated and compared on the data set using a set of 200 2D descriptors. Five predictive binary classification models for estimation of DMSO solubility class of organic compounds were derived and validated. The results show the high accuracy using ensembles of decision trees (specifically, boosting and random forests). Additionally, methods like LDA and BinaryQSAR, when used in conjunction with feature selection methods, provide accurate models. While not quantitative in nature, models such as these are effective for screening compounds to be stored in DMSO for potential solubility problems.

8:00 PM-10:00 PM, Monday, August 23, 2004 Pennsylvania Convention Center -- Hall D, Sci-Mix

Division of Chemical Information

The 228th ACS National Meeting, in Philadelphia, PA, August 22-26, 2004