CINF 18 |
| Selecting a small subset of descriptors from a large pool to build a good QSAR model is a hard problem. Even heuristics typically aim to find a subset that leads to a good model for a single model type. Ensemble QSAR models combine predictions of multiple instances of of different model types. Traditionally, descriptor selection for ensemble models has consisted of performing feature selection for the individual models leading to a set of features that are specific to the model type. However, for more interpretable QSAR models, it is advantageous to have a single consistent set of features that can be used for different model types. In this work, we select a single optimal subset of descriptors for multiple model types by jointly optimizing the prediction accuracy of multiple model types using a genetic algorithm and linear combination functions. We apply this approach to both regression as well as classification problems. In particular, for two datasets, using an ensemble of a linear model and a neural network, we show that the predictive ability decreases only by 1.14%, 2.3% respectively. This work is the very first step in consensus descriptor modeling and we are not aware of any other work in this are. Several directions are currently being pursued to improve the above approach including designing better scoring functions, exploring alternative optimization techniques and novel ways to combine predictions. |
|
Advances in Virtual High-Throughput Screening
1:30 PM-4:30 PM, Sunday, 10 September 2006 Moscone Center -- Room 125, Oral
Division of Chemical Information |