Learning optimum Decision Trees: Influence of parameter choice and feature selection

CINF 81

Shaillay Kumar Dogra, shaillay@strandls.com, Cheminformatics, Strand Life Sciences Pvt. Ltd, No. 237, Sir C. V. Raman Avenue, Raj Mahal Vilas, Bangalore, India
Decision Tree (DT), as a classification algorithm, has certain advantages over other methods like Neural Networks or Support Vector Machines. Apart from producing interpretable models, DTs can inherently select those descriptors that are of relevance to modeling the given property, during tree building itself. However, in context of cheminformatics data, which is characterized by high dimensionality of feature-space and less number of samples available for training, DTs tend to suffer. Here, ‘parameter tuning' and ‘feature selection' become of importance. In this study, we present our findings about the influence of parameters such as ‘attribute selection measure', ‘tree stopping criterion' and ‘tree pruning method' on the size and performance of the learned Decision Trees. Further, we introduce an initial feature selection, using wrappers, before invoking DT learning to take care of high-dimensional data. Finally, we compare our results with those obtained from ‘Decision Forest', which is an ensemble of DTs.