ChemModLab, written by the ECCR@NCSU consortium under NIH support, is a toolbox for fitting and assessing quantitative structure-activity relationships (QSARs). Its elements are: a cheminformatic component that computes five types of molecular descriptors for use in modeling; a set of sixteen statistical methods for fitting models; and methods for validating the resulting model. These sixteen statistical methodologies comprise a comprehensive collection of approaches. ChemModLab can produce eighty QSAR models that can be used individually or as the basis for ensembles. The first part of this presentation will introduce this web-accessible software.
The remainder of this presentation will focus on ensemble models, where output from many individual models are combined to yield an overall conglomerate model. Such methods have gained popularity in multiple areas of chemistry. The ensemble method Random Forests (RF) has been shown to be highly effective for predicting biological activity in many applications. RF is a family ensemble model because it uses base learners created from the same underlying mechanism, a recursive partitioning decision tree. While generally effective, RF can have poor performance when the training set is highly unbalanced. This is often the case for applications regarding QSARs, where the percent of active compounds can be very small. For such applications, we study the properties of family ensemble models and make recommendations for obtaining improved performance.
Return to Biostatistics Working Group