Department of Statistics Seminar
North Carolina State University

presents

Dr. Andy Liaw*

Merck Research Laboratories

"Random Forest and Its Application to QSAR Modeling of HTS Data"

ABSTRACT

Recently there has been an explosion of research activities on a new class of statistical learning techniques, wich combines the results of an ensemble of classifiers (or regressions). These techniques, notably bagging and boosting, have been shown to have superior performance in terms of test set prediction accuracy over most existing methods. We will present our experience in applying one particular ensemble method, called random forest (Breiman, 2001) to the problem of building quantitative structural-activity relationship (QSAR) models for high throughput screening (HTS) data. A random forest is formed by combining a large number of un-pruned trees, each grown from a bootstrap sample of the data. At each node of every tree, the best split is chosen among only a small random sample of all variables. We found this technique to be particularly attractive for QSAR models because in addition to its good prediction performance, it provides a few very useful "by-products" such as relative variable importance and proximity measures among data points. We will compare the performance of random forest with other techniques, and show how the extra information produced by random forest helps in building QSAR models.

*Joint work with Dr. Vladimir Svetnik, Biometrics Research, Merck Research Laboratories

Friday, September, 12, 2003

3:35 - 4:35 pm

206 Cox Hall

Refreshments will be served on the second floor of Dabney Hall (left of Room 222) at 3:00 pm.