This past week, Nate Silver held an “Ask Me Anything” chat on Reddit. There were several very good questions, one of which I found particularly important as we begin the International Year of Statistics: “What is the biggest abuse of statistics”? To which Nate replied: “Overfitting.”
This response is very relevant to our discussion of predictive modeling, particularly when applied to problems in the life sciences. In a clinical trial, we may wish to predict a patient’s response in order to apply a study adaptation. Or we may want to understand which subjects will likely experience a severe adverse experience or death. For problems in genomics, we hope to identify alleles contributing to various disease states, or predict the toxic effects of various chemicals based on in-vivo gene expression data and chemical structure.
Overfitting occurs when the model fits the noise as well as the signal in the data. In instances when an overfitted predictive model is applied to new data sets, the performance of the model is often quite poor. For example, let’s take the most extreme case where the number of independent predictors equals the number of observations under study and there is no regularization. In this scenario, the model will achieve a perfect prediction every time on the training data, since each covariate can be used to predict the response of a single observation. However, the model likely includes numerous variables that are not useful in predicting the outcome. The result? When the model is used for a new set of data, the fit is terrible! This is because the model is too specific to the first set of data (i.e., the model is overfit).
So why is overfitting such a concern for the life sciences? Well, unlike commercial data mining applications in finance, retail, and telecommunications, data sets from life science domains typically have orders of magnitude more predictors than observations (p >>n). In these “wide data” instances, it is very easy to overfit the data with predictive models (as in our worst case scenario above). Imagine early phase clinical trials where every test known to humankind is performed over the course of several days or weeks to assess the safety of a new medication. How about analyses of SNPs where the number of predictors can easily stretch into the millions? As you can see, the potential for overfitting in the life sciences is great.
How is it possible to generate a useful model, one that can be applied to new sets of data with the goal of achieving accurate, generalizable predictions? As part of the model building process, the model should be tested or validated against one or more independent sets of data to assess whether some of the covariates are unnecessary. But how is this practical for our purposes? In commercial applications, there is typically sufficient data to define mutually exclusive subsets to train a model (the learning or training set), as well as validate the model (the validation set). However, observations from the life sciences tend to be more limited due to the expense, time, and difficulty of obtaining subjects or experimental samples.
Validation methods in the life sciences often require predicting outcomes for a small subset of the available data using the remaining observations, and repeating this process for all subsets. Figure 1 illustrates an example of 5-fold cross validation. The data is partitioned into five mutually exclusive sets, each representing 20% of the available observations. A predictive model is trained with 80% of the data (the blue squares) and then validated with the test set (the red square). This exercise is repeated four additional times so that each partition of the data (i.e., each square) serves as the test set.
Further, simulation can be used to account for differences in how subsets are generated. For example, the above cross-validation exercise can be repeated 10 times, each time generating a new partition of the data and performing cross validation. After the simulation exercise, there would be 10 x 5 = 50 models. In order to develop my final model, I would choose those predictors that are selected in a majority of the 50 available models. This set of predictors can then be used on the full data set to develop the final model.
So the next time a journal article promotes a model as the best way to predict a particular outcome, approach this model with caution! Examine whether the authors applied their model to new data or applied honest cross-validation methods to assess its predictive performance. Otherwise, you may be doing yourself a disservice by using it.
JMP Clinical and JMP Genomics have an extensive set of predictive models available. Further, robust tools for cross-validation, learning curves (to answer the question “Are my training sets large enough?”) and predictor reduction are available to develop a useful predictive model for life science applications.
For more information, download this add-in (works with JMP 10) or view this on-demand webcast (which makes use of the add-in). The add-in and webcast illustrate the above methods using data from a clinical trial and a genomics study using next generation sequencing.