Abstract
High Throughput Screening (HTS) is used in drug discovery to screen large numbers of compounds against a biological target. Data on activity against the target are collected for a representative sample (experimental design) of compounds selected from a collection. The explanatory variables are chemical descriptors of compound structure. Some previous work shows that local methods, namely K-nearest neighbors (KNN) and classification and regression trees (CART), perform very well. Some adaptations to KNN and CART including averaging over subsets of explanatory variables, bagging, and boosting, have also been considered. After briefly reviewing and comparing these techniques, I will focus on estimating activity and error rates for assessing model performance. This will shed some light on how various models handle large random or systematic errors in drug screening data.