Subscribe Bookmark


Jul 3, 2014

Sampling data to build, validate and test predictive models when target event occurs infrequently

When building models to predict a binary outcome variable, such as respond or not respond, the proportion in the desired category (respond) may be low. For example, in direct mail campaigns the response rate is often 1% or lower. In such cases, a predictive model is likely to learn how to predict the frequent outcome very well and the non-frequent outcome poorly.

There are various ways of ensuring better prediction of the non-frequent class, such as assigning higher weights to the infrequent events; stratified sampling to include a higher ratio of target events in the model building, selection and testing samples than would occur at random; or assigning different profit/costs associated with the different correct and incorrect predictions.

Stratified Sampling

One of the simpler approaches is stratified sampling to create a sample consisting of roughly equal numbers of responders and non-responders. This is a simple approach that gives your models a much better chance of learning how to predict those customers who respond positively to your campaigns. However, because the proportion of respondents to non-respondents in the model-building database is not reflective of the general situation, the predicted probability of propensity to respond for any customer from the resulting model will be too optimistic. There are various ways of adjusting these predictions, but if all you are concerned about is using the predictive model to score and rank customers, then no adjustment may be necessary.

Consider a direct mail campaign to solicit charitable donations where the historic proportion of customers responding to campaigns is 5.6%. Figure 1 below shows a predictive model based on a random sample of 10,000 customers. Figure 2 shows a predictive model from a stratified sample of 1,000 customers. I used JMP Pro to build both models.

Figure 1 - Bootstrap (random) forest model from random sample of 10,000 customers

Figure 2: Bootstrap (random) forest model from stratified sample of 1,000 customers

As expected, the random sample maintains a response rate of roughly 5.6% in the model building, selection and test phases, and it achieves a very low overall misclassification rate of 5.85% in the test data subset at the expense of being unable to predict the customers who will respond positively.

On the other hand, the stratified sample of 1,000 customers used in Figure 2 consists of an equal number of givers and non-givers. This sampling strategy has a higher overall misclassification rate of 34% in the test subset at the gain of being able to predict 69% of the givers correctly. Since our goal is to predict givers, the stratified sample approach will enable future marketing campaigns to be more productive, unlike the random sample shown in Figure 1.

Community Member

GA wrote:

Dear Malcolm: What is the best way to build a stratified sample using JMP Pro?

Many thanks,


Community Member

Malcolm wrote:


There are many ways of selecting a stratified sample. One of the simpler options is to use the Subset option from the Tables menu. In the resulting dialog select Random - sample size and check the stratify option. Then select the variable you want to define the stratified sample in the list that appears and define the sample size per stratum. Click OK.


Community Member

Danilo wrote:


If I understood it right, the article stands that is better for the acuraccy of the results don't use the entire information given by a real sampling process, even if the data sampled is real.

I think that's a little controversial with any fundamentals principals of statistics........

What is the cut-off point used to build the confusion matrix? If its is possible, I would like to see the ROC curve of both models....

Community Member

Phillip Julian wrote:

Can you please send or post technical or academic resources for this blog article on stratified sampling with equal numbers is binary strata? I will be using that method in a campaign response analysis, and I will need to justify my methods. Thanks for the article and suggestion. I work in Retail Campsign Analytics, and this will be a big help to the team.

Community Member

Malcolm wrote:

If it is a case of reassuring yourself or colleagues that stratified sampling is a valid way of selecting data to use when building predictive models when the desired response rate is low (predicting rare events or low-frequency events), then try modeling your data with and without stratified sampling. I find stratified sampling works better for identifying the important predictors of low-frequency (or rare) events, or in the case of marketing campaigns, sorting which customers or target customers are more likely to respond. However, please don't use the predicted probabilities from models based on stratified sampling as we have deliberately biased the proportion of responders to non-responders in order to make it easier to spot the predictors of and/or predict the responders -- although you could make a simple correction to the predicted probabilities to make them reflective of your general population.