Re: Stratified Data Split into test and control

75Innovation · Jul 12, 2019 02:03 PM

Hello JMP Community,

I am trying to split a dataset into Test and Control for a marketing experiment. The data set is unique at the customer id and has a million rows. Both the new datasets must be balanced using two parameters which are annual customer spend (YTD) and last quarter spend (QTD). These parameters are represented in two columns. I am unable to figure out how to create a stratified/balanced split. I tried using the "Stratified Data Partitioning (with balancing options) add-in." but in this add-in, the balance is performed at a categorical level (Focal Group). In my case, the balance needs to be done on a numerical data type (YTD and QTD spend).

I am NOT a JMP Pro user.

Any help in this use case will be much appreciated.

Thanks

Thierry_S · Jul 14, 2019 03:06 PM

Hi,
Since nobody seems to want to take a stab at it, I will share with you my own approach which may be a bit clunky.
First, I not aware of a technique that would let you stratify on a continuous variable (may be other might be more knowledgeable ?)
Second, considering the size of your primary data set, you may not need to stratify at all because random sampling should give you a pretty good match between the means and stdevs for your two variables in the test and control groups.
If this approach is not acceptable, you may want to create categorical bins (tertile or quartile of YTD and QTD) and stratify on these.

Thierry R. Sornasse

Dan_Obermiller · Jul 15, 2019 10:17 AM

JMP Pro is going to be the best way to do what you want. However, since you do NOT have JMP Pro, here is an approach that will get you pretty close. I will assume that you want 75% training, 25% validation, but you can easily revise this for other percentages.

Sort your data by your two stratification variables. Then, since you are looking at training and validation sets, take groups of 4 rows of data and randomly assign three of those rows to training, the other row to validation. This should give you fairly representative groupings according to your two stratification variables.

Dan Obermiller