Determining a minimum yet sufficient training set size for QSAR modeling

COMP 404

Shaillay Kumar Dogra, shaillay@strandls.com, Cheminformatics, Strand Life Sciences Pvt. Ltd, No. 237, Sir C. V. Raman Avenue, Raj Mahal Vilas, Bangalore, India
In Quantitative Structure-Activity Relationship (QSAR) modeling, limited availability of data is a problem. The modeler needs to learn, validate as well as test the models on the given data. Ideally, for final testing of the models, as large an external test set as possible should be used. However, this is a compromised aspect of QSAR modeling. Here, we present a methodology wherein we first use permutation tests to determine ‘signal' and ‘noise' levels for the given data. We do so by learning QSAR models for ‘true' and ‘randomized' data. Having determined the ‘separation' that exists between the ‘signal' and ‘noise' for the given data, we next try to achieve a similarly significant separation with as minimum data as possible. We illustrate our empirical approach on cheminformatics datasets, demonstrating how equally accurate models can be learnt using minimal data with the advantage that model testing is now on a larger set.