It’s World Statistics Day! To honor the theme of the day, the JMP User Community is having conversations about the importance of trust in statistics and data. And we want to hear from you! Tell us the steps you take to ensure that your data is trustworthy.
May 8, 2012 9:15 AM
| Last Modified: Apr 18, 2017 7:20 AM
When building a prediction model, there are a variety of ways that we can model the response as a function of our predictors. The Fit Model platform in JMP allows us to model the response as a linear function of our predictors. The Nonlinear platform allows us to model the response as a nonlinear function of the predictors, maybe in the form of an exponential or sigmoidal curve. Another option is to model the response as a step-function of our continuous predictors. We can build this kind of model by discretizing, or binning, our continuous columns. This blog post provides a brief description of the “Supervised Binning” add-in for building this type of model. Here “supervised” refers to the fact that we are using the response variable to help us choose the best binning scheme. The add-in is available for download from the JMP File Exchange as part of the “Predictor Binning” add-in (download requires a free SAS profile).
As an example, the Corn data table in JMP’s sample data folder provides corn yield measurements and the concentration of nitrate added to the soil. Plotting these data, we see that the relationship between the two variables is highly nonlinear, and we realize that it might be difficult to determine a model that would be appropriate for the data. So it would be interesting to bin the nitrate values, creating a step-function to predict corn yield. To do this, we simply launch the Supervised Binning add-in and specify the “yield” column as the Response and the “nitrate” column as the Explanatory Variable.
The add-in appends a binned version of the nitrate column to the original data table. This new column, “nitrate binned,” is a four-level categorical column where each level represents a different range of nitrate values. For example, the first bin represents the observations where the nitrate level is less than or equal to 10.58. Now we can use the binned column (in either the Fit Model or Fit Y by X platform) to predict yield. The figure below compares the binned prediction function to a disjoint quadratic model for predicting yield. The binned model is easier to interpret than the nonlinear model, and it has a lower mean-square error.
The corn example is nice for looking at how predictor binning works for a simple example, but binning is also very useful when building more complicated models. For example, we could look at the Boston Housing data from the JMP sample data folder. Here we are trying to predict median home values using a variety of features of each town. Suppose we want to use all of the data available to build a linear model, but we have reason to believe that several of the predictors have a nonlinear relationship with home value. For example, maybe we believe that home values are relatively constant for low crime rates, but drop dramatically above a certain crime rate. We could make similar arguments to justify binning the “rooms” column. We can use the Supervised Binning add-in to bin the “crim” and “rooms” columns, as well as any other columns that might seem appropriate.
The add-in breaks the “crim” column down into 10 discrete categories. Once the per capita crime rate goes above 2.0, home values start to drop quickly. Now we can use a combination of our binned columns and the remaining continuous columns to build a model with potentially (hopefully!) much better predictive ability than a model built with the original columns.
So if you have ever found yourself wanting to bin or discretize your continuous predictors, you might want to try out the Supervised Binning add-in. You can find this add-in as part of the Predictor Binning add-in on the JMP File Exchange.