Solved: Sub-Sample randomly from a column

Report Inappropriate Content · Jun 10, 2023 4:56 PM

Hello all, - I have a population of 200 stores across the country. I'm interested to know if I can sub-sample a fraction of these stores and approximate the distribution of events. Might there be some Monte Carlo method for filling e.g., 30 rows in a new column with a random sub-sample from the original 200.

Then I could compare the distribution of the 30 counts in the sub-sample to the distribution in the population of 200.

Repeat this say 100 times to determine the risk of sampling 30 stores to represent the population of 200.

Thoughts? thanks, Paul

First 10 rows of the 200

Store# - column 1

Even count - column 2

19
20
21
22
23
25
26
27
29
34

1
4
7
3
4
3
12
5
8
8

SamGardner · Oct 28, 2022 03:27 PM

I quick way to do a subsample using a column formula is the use the Col Shuffle() function. The function returns a random reordering of row numbers. If you choose rows where the values of Col Shuffle() are less than than or equal to the subsample size (n_sub), this equivalent to doing a random ordering of the column and selecting the first n_sub rows.

For example, adding new column with this formula will do this for a subsample size of 30.

example table attached.

If you want to do this repeatedly, then learning some JMP Scripting would make this easier. A basic script to add columns like this repeatedly is

Names Default to Here( 1 );

Number_of_Cols_to_Add = 100;

For( i = 1, i <= Number_of_Cols_to_Add, i++, 
	
	Current Data Table() << 
	New Column( "Y Subsample (Size 30)",
		Numeric,
		"Continuous",
		Format( "Best", 12 ),
		Formula( If( Col Shuffle() <= 30, :Y, . ) )
	)
);

View solution in original post

Jeff_Perkinson · Oct 27, 2022 06:44 PM

Take a look at the Resample Freq() function.

Resample Freq

Generates a random selection with replacement frequency counts, suitable for use in bootstrapping. For example, it supports a second Freq Column argument, enabling it to do bootstrap samples relating to a pre-existing frequency column specified in the second argument. Resample Freq() generates a 100% resample. ResampleFreq(rate) generates a rate frequency sample. Resample(rate, column) generates a sample that is calculated by the rate multiplied by the sum of the specified column. For more information about the syntax for the Resample Freq function, see Discrete Probability Functions in the JSL Syntax Reference.

-Jeff

paulp · Oct 28, 2022 02:58 PM

Hello Jeff, thank you for the reply. I am not understanding how to use the function for my purpose, unfortunately.

Resample Freq creates counts of re-samples. This seems it would be an intermediate step to then creating a new table with the sub-sampled data. I am hoping to directly create a column of sub-sampled data from the original population of 200 rows.
I would prefer to sample without replacement, not bootstrap.

If I only needed the original data sub-sampled once, the easiest path would be Table > Subset > Random Sample Size = 30 (e.g.)

I don't script, otherwise I would create a loop to produce 100 tables of 30 random samples, then concatenate the 100 tables into a master table.

Is there a way to directly grab 30 random samples from a column, and put those 30 values into a new column?

Ultimately, I hope to try the Reliability > Life distribution > Compare Groups to assess the distribution of the subsamples vs. original population distribution.

Then repeat for Sample Size = 20, then 40, etc. so I can assess the minimum sample size that gives a similar distribution to the population of 200 stores.

Thanks, Paul

txnelson · Oct 28, 2022 03:59 PM

Here is an example script that creates subsets of a given size without replacement

Names Default To Here( 1 );
dt = 
// Open Data Table: semiconductor capability.jmp
// → Data Table( "semiconductor capability" )
Open( "$SAMPLE_DATA/semiconductor capability.jmp" );

// New formula column: Row
Data Table( "semiconductor capability" ) << New Column( "Row",
	Numeric,
	"Continuous",
	Format( "Best", 12 ),
	Formula( Row() )
) << Move Selected Columns( {:Row}, after( :SITE ) );

// Loop through data creating sample data tables
sampleSize = 20;

For( i = 1, i <= 20, i++,
	dtSample = Data Table( "semiconductor capability" ) << Subset(
		Sample Size( sampleSize ),
		Selected columns only( 0 ),
		output table( "Sample " || Char( i ) )
	);
	
	matSelected = dtSample:Row << get values;
	
	// Delete row column from sample data
	dtSample << delete columns( Row );

// Delete selected rows from main table
	dt << select rows( matSelected );
	dt << delete rows;
);

// Delete row column from original data table
dt << delete columns( Row );

Jim

SamGardner · Oct 28, 2022 03:27 PM

I quick way to do a subsample using a column formula is the use the Col Shuffle() function. The function returns a random reordering of row numbers. If you choose rows where the values of Col Shuffle() are less than than or equal to the subsample size (n_sub), this equivalent to doing a random ordering of the column and selecting the first n_sub rows.

For example, adding new column with this formula will do this for a subsample size of 30.

example table attached.

If you want to do this repeatedly, then learning some JMP Scripting would make this easier. A basic script to add columns like this repeatedly is

Names Default to Here( 1 );

Number_of_Cols_to_Add = 100;

For( i = 1, i <= Number_of_Cols_to_Add, i++, 
	
	Current Data Table() << 
	New Column( "Y Subsample (Size 30)",
		Numeric,
		"Continuous",
		Format( "Best", 12 ),
		Formula( If( Col Shuffle() <= 30, :Y, . ) )
	)
);

paulp · Nov 1, 2022 06:06 PM

Thank you, Sam, and to everyone who posted solutions.

Mark_Bailey · Oct 29, 2022 11:05 AM

I understand that you are not scripting your solution. You can also interactively use Tables > Subset to create a new smaller table with randomly selected rows from the original data table. Would that way work for you?

ron_horne · Nov 2, 2022 05:19 AM

Hi @paulp ,
the other non-scripting solution that can be useful in such applications is using JMP query builder under the Tables menu.

there you can find a random sampling tab as in the picture

As usual one you run it interactively you can make use of the script created for additional usage.

paulp · Nov 4, 2022 12:14 PM

thanks, Ron

Sub-Sample randomly from a column

Re: Sub-Sample randomly from a column

Re: Sub-Sample randomly from a column

Resample Freq

Re: Sub-Sample randomly from a column

Re: Sub-Sample randomly from a column

Re: Sub-Sample randomly from a column

Re: Sub-Sample randomly from a column

Re: Sub-Sample randomly from a column

Re: Sub-Sample randomly from a column

Re: Sub-Sample randomly from a column

Recommended Articles

Get Going with JMP: Essentials for Using JMP

Multiple-Group Analysis in Structural Equation Modeling

Getting Started with JMP: On Demand Course