Sampling data to build, validate and test predictive models when target event occurs infrequently
Jul 13, 2011 8:18 AM
When building models to predict a binary outcome variable, such as respond or not respond, the proportion in the desired category (respond) may be low. For example, in direct mail campaigns the response rate is often 1% or lower. In such cases, a predictive model is likely to learn how to predict the frequent outcome very well and the non-frequent outcome poorly.
There are various ways of ensuring better prediction of the non-frequent class, such as assigning higher weights to the infrequent events; stratified sampling to include a higher ratio of target events in the model building, selection and testing samples than would occur at random; or assigning different profit/costs associated with the different correct and incorrect predictions.
One of the simpler approaches is stratified sampling to create a sample consisting of roughly equal numbers of responders and non-responders. This is a simple approach that gives your models a much better chance of learning how to predict those customers who respond positively to your campaigns. However, because the proportion of respondents to non-respondents in the model-building database is not reflective of the general situation, the predicted probability of propensity to respond for any customer from the resulting model will be too optimistic. There are various ways of adjusting these predictions, but if all you are concerned about is using the predictive model to score and rank customers, then no adjustment may be necessary.
Consider a direct mail campaign to solicit charitable donations where the historic proportion of customers responding to campaigns is 5.6%. Figure 1 below shows a predictive model based on a random sample of 10,000 customers. Figure 2 shows a predictive model from a stratified sample of 1,000 customers. I used JMP Pro to build both models.
Figure 1 - Bootstrap (random) forest model from random sample of 10,000 customers
Figure 2: Bootstrap (random) forest model from stratified sample of 1,000 customers
As expected, the random sample maintains a response rate of roughly 5.6% in the model building, selection and test phases, and it achieves a very low overall misclassification rate of 5.85% in the test data subset at the expense of being unable to predict the customers who will respond positively.
On the other hand, the stratified sample of 1,000 customers used in Figure 2 consists of an equal number of givers and non-givers. This sampling strategy has a higher overall misclassification rate of 34% in the test subset at the gain of being able to predict 69% of the givers correctly. Since our goal is to predict givers, the stratified sample approach will enable future marketing campaigns to be more productive, unlike the random sample shown in Figure 1.