cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
Choose Language Hide Translation Bar
bbenny7
Level II

Subset creation keeping original distribution

If I have a data table with a continuous variable and I want to create several subsets that keep the same distribution as the original table, how can I do it?

I have JMP Pro and I have tried the Stratify option in the Subset menu, but I could not figure out how to do it.

1 ACCEPTED SOLUTION

Accepted Solutions
peng_liu
Staff

Re: Subset creation keeping original distribution

Because you have JMP Pro, and you mentioned Stratify, here is another option: "Make K-Fold Columns" from XGBoost Add-in for JMP Pro (https://community.jmp.com/t5/JMP-Add-Ins/XGBoost-Add-In-for-JMP-Pro/ta-p/319383) And read page 3, 4, 5

in the "XGBoost Add-in.pdf" file.

peng_liu_0-1708265326457.png

The function creates subsets (the Add-in calls it folds) that keeping original distribution (the Add-in calls it balanced), while respect stratification by keeping stratification variables balanced as well. See the histograms on page 5 in that pdf document.

View solution in original post

7 REPLIES 7
dlehman1
Level IV

Re: Subset creation keeping original distribution

You could create a validation column (under Predictive Modeling) stratified as you want and then use those subsets.  I think you are limited, at least initially, to 3 subsets (training, validation, and test subsets) but you could do that multiple times if you need more.  I've never used the stratify option in the Subset menu, but it appears that it works but will only create one subset at a time.

bbenny7
Level II

Re: Subset creation keeping original distribution

Thanks for your answer.

I have realized that more than 3 subsets are needed, you can use "Make K-fold Validation Column" in Predictive Modeling --> Make Validation Column.

txnelson
Super User

Re: Subset creation keeping original distribution

I think all you need to do is to specify to subset using the random sampling capability in the Tables=> Subset Platform.  Random sampling by either rate or size will give you data tables with the same distribution as the original table.

Jim

Re: Subset creation keeping original distribution

Strictly speaking, sampling with replacement is required if you want the same distribution as the original. That is the basis for resampling and bootstrap methods. The Subset command performs random sampling without replacement. You need a script, I think, to get what you need. The sampling would be based on computing random subscripts like this:

subscript = J( sample size, 1, Random Integer( 1, N Row( dt ) )

This vector could be used in the creation of a data column in a new data table.

New Table( "Sample", New Column( "Data", Value( dt:originalColumn[subscript] ) ) );
SDF1
Super User

Re: Subset creation keeping original distribution

Hi @bbenny7 ,

 

  I found this topic interesting for a similar reason of wanting to correctly subset a data table, but to stratify it on a column. I've done it in the past as @dlehman1 has suggested using a validation column to stratify on column(s) of interest, but also was curious how to do it a different way in case multiple data tables were needed. I tried the way that @Mark_Bailey suggested, but found that I needed to split the JSL code for the New Table() into two lines, one defining the new table, and the next assigning the values based on the other data table column of interest. I couldn't get it to work the way his original code was laid out. Here's how a modified code worked for me:

Names Default To Here( 1 );

dt = Data Table( "originaldatatable" ); //assigns the original data table to the variable dt

subscript = J( samplesize, 1, Random Integer( 1, N Rows( dt ) ) ); //creates the random integer vector of length 'samplesize'

dt2 = New Table( "Sample", New Column( "Data" ) ); //creates new data table with column Data

dt2:Data << Set Values( dt:originalColumn[subscript] ); // assigns values to Data based on the row entries for the originalColumn

  As a fun little test, I generated 4 subsets by making a For() loop and putting the subscript line in it (to generate a new set of row numbers) and compared the distributions for the 4 sets, and their summary statistics are all very similar.

SDF1_0-1708025111456.png

  I did a similar test but created 4 stratified validation columns and then looked at their statistics. The N is different because the Make Validation Column platform wouldn't generate the same ratios that I did above, where I chose 300 just randomly. Anyway, the results are all very similar.

SDF1_1-1708026727602.png

  Either way should work and get you where you want to go.

 

Good luck!,

DS

peng_liu
Staff

Re: Subset creation keeping original distribution

Because you have JMP Pro, and you mentioned Stratify, here is another option: "Make K-Fold Columns" from XGBoost Add-in for JMP Pro (https://community.jmp.com/t5/JMP-Add-Ins/XGBoost-Add-In-for-JMP-Pro/ta-p/319383) And read page 3, 4, 5

in the "XGBoost Add-in.pdf" file.

peng_liu_0-1708265326457.png

The function creates subsets (the Add-in calls it folds) that keeping original distribution (the Add-in calls it balanced), while respect stratification by keeping stratification variables balanced as well. See the histograms on page 5 in that pdf document.

bbenny7
Level II

Re: Subset creation keeping original distribution

Thanks for your answer.

I have realized that I don't need the add-in, but I can use "Make K-fold Validation Column" in Predictive Modeling --> Make Validation Column.