Classifying the mutagenicity of two diverse sets of organic compounds using Ames test data for Salmonella typhimurium TA100 and TA98

COMP 191

Brian E. Mattioni1, Peter C. Jurs1, and David T. Stanton2. (1) Department of Chemistry, The Pennsylvania State University, 152 Davey Laboratory, Box 43, University Park, PA 16802, (2) Central Research Chemical Technology Division, Procter & Gamble, Miami Valley Laboratories, Cincinnati, OH 45252
Classification models are constructed for mutagenicity assessment using Ames test data with two diverse sets of organic compounds. The models are developed for two strains of Salmonella typhimurium (TA100 and TA98). More than 300 structural descriptors are calculated that encode the topological, geometrical, electronic, and polar surface area features of the compounds. In addition, we report the development and use of a new class of descriptors we call hydrophobic surface area (HSA) descriptors. The new descriptors should help to generate more accurate toxicity models due to the role that hydrophobicity has been shown to play for toxicity prediction. To establish diversity, a subsetting approach is employed to generate multiple training and prediction sets ensuring that consistent results are obtained regardless of training set membership. The resultant predictions are subjected to a 'majority rules' voting scheme to form a model ensemble. Linear discriminant analysis produces the best results for the TA100 strain where the model ensemble correctly classifies ~77% of the prediction set compounds. On the other hand, the probabilistic neural network generates the best results for the TA98 strain where a prediction set accuracy of ~80% is obtained by the ensemble.