High Throughput Screening (HTS) is used in drug discovery to screen
large numbers of compounds against a biological target. Data on activity
against the target are collected for a representative sample
(experimental design) of compounds selected from a collection. The
explanatory variables are chemical descriptors of compound structure.
Some previous work shows that local methods, namely K-nearest
neighbors (KNN) and classification and regression trees (CART), perform
very well. Some adaptations to KNN and CART including averaging over
subsets of explanatory variables, bagging, and boosting, have also been
considered. After briefly reviewing and comparing these techniques, I will
focus on estimating activity and error rates for assessing model
performance. This will shed some light on how various models handle large
random or systematic errors in drug screening data.