Solved: Subset creation keeping original distribution

bbenny7 · Feb 15, 2024 10:27 AM

If I have a data table with a continuous variable and I want to create several subsets that keep the same distribution as the original table, how can I do it?

I have JMP Pro and I have tried the Stratify option in the Subset menu, but I could not figure out how to do it.

peng_liu · Feb 18, 2024 09:15 AM

Because you have JMP Pro, and you mentioned Stratify, here is another option: "Make K-Fold Columns" from XGBoost Add-in for JMP Pro (https://community.jmp.com/t5/JMP-Add-Ins/XGBoost-Add-In-for-JMP-Pro/ta-p/319383) And read page 3, 4, 5

in the "XGBoost Add-in.pdf" file.

The function creates subsets (the Add-in calls it folds) that keeping original distribution (the Add-in calls it balanced), while respect stratification by keeping stratification variables balanced as well. See the histograms on page 5 in that pdf document.

View solution in original post

dlehman1 · Feb 15, 2024 11:29 AM

You could create a validation column (under Predictive Modeling) stratified as you want and then use those subsets. I think you are limited, at least initially, to 3 subsets (training, validation, and test subsets) but you could do that multiple times if you need more. I've never used the stratify option in the Subset menu, but it appears that it works but will only create one subset at a time.

bbenny7 · Feb 19, 2024 03:39 AM

Thanks for your answer.

I have realized that more than 3 subsets are needed, you can use "Make K-fold Validation Column" in Predictive Modeling --> Make Validation Column.

txnelson · Feb 15, 2024 12:11 PM

I think all you need to do is to specify to subset using the random sampling capability in the Tables=> Subset Platform. Random sampling by either rate or size will give you data tables with the same distribution as the original table.

Jim

Mark_Bailey · Feb 15, 2024 01:18 PM

Strictly speaking, sampling with replacement is required if you want the same distribution as the original. That is the basis for resampling and bootstrap methods. The Subset command performs random sampling without replacement. You need a script, I think, to get what you need. The sampling would be based on computing random subscripts like this:

subscript = J( sample size, 1, Random Integer( 1, N Row( dt ) )

This vector could be used in the creation of a data column in a new data table.

New Table( "Sample", New Column( "Data", Value( dt:originalColumn[subscript] ) ) );

SDF1 · Feb 15, 2024 02:52 PM

Hi @bbenny7 ,

I found this topic interesting for a similar reason of wanting to correctly subset a data table, but to stratify it on a column. I've done it in the past as @dlehman1 has suggested using a validation column to stratify on column(s) of interest, but also was curious how to do it a different way in case multiple data tables were needed. I tried the way that @Mark_Bailey suggested, but found that I needed to split the JSL code for the New Table() into two lines, one defining the new table, and the next assigning the values based on the other data table column of interest. I couldn't get it to work the way his original code was laid out. Here's how a modified code worked for me:

Names Default To Here( 1 );

dt = Data Table( "originaldatatable" ); //assigns the original data table to the variable dt

subscript = J( samplesize, 1, Random Integer( 1, N Rows( dt ) ) ); //creates the random integer vector of length 'samplesize'

dt2 = New Table( "Sample", New Column( "Data" ) ); //creates new data table with column Data

dt2:Data << Set Values( dt:originalColumn[subscript] ); // assigns values to Data based on the row entries for the originalColumn

As a fun little test, I generated 4 subsets by making a For() loop and putting the subscript line in it (to generate a new set of row numbers) and compared the distributions for the 4 sets, and their summary statistics are all very similar.

I did a similar test but created 4 stratified validation columns and then looked at their statistics. The N is different because the Make Validation Column platform wouldn't generate the same ratios that I did above, where I chose 300 just randomly. Anyway, the results are all very similar.

Either way should work and get you where you want to go.

Good luck!,

DS

peng_liu · Feb 18, 2024 09:15 AM

Because you have JMP Pro, and you mentioned Stratify, here is another option: "Make K-Fold Columns" from XGBoost Add-in for JMP Pro (https://community.jmp.com/t5/JMP-Add-Ins/XGBoost-Add-In-for-JMP-Pro/ta-p/319383) And read page 3, 4, 5

in the "XGBoost Add-in.pdf" file.

The function creates subsets (the Add-in calls it folds) that keeping original distribution (the Add-in calls it balanced), while respect stratification by keeping stratification variables balanced as well. See the histograms on page 5 in that pdf document.

bbenny7 · Feb 19, 2024 03:37 AM

Thanks for your answer.

I have realized that I don't need the add-in, but I can use "Make K-fold Validation Column" in Predictive Modeling --> Make Validation Column.

Subset creation keeping original distribution

Re: Subset creation keeping original distribution

Re: Subset creation keeping original distribution

Re: Subset creation keeping original distribution

Re: Subset creation keeping original distribution

Re: Subset creation keeping original distribution

Re: Subset creation keeping original distribution

Re: Subset creation keeping original distribution

Re: Subset creation keeping original distribution