Solved: Re: Creating a Sampling Distribution

tmfortney · Apr 23, 2019 12:38 PM

I would like to create a true sampling distribution from my dataset and I am not sure what formula(s) to use in JMP to create it. I have a population dataset of 10,000 and would like to create a sampling distribution with n=100, which would then result in 100 sample means. What formula do I use to create this in JMP?

Mark_Bailey · Apr 24, 2019 10:18 AM

You are correct, there is no analog to the choices for multiple comparisons as found in the Oneway platform.

The Wilcoxon test is an omnibus indicator of any difference. It is not specific to one parameter like the mean or variance. The plot at the top can help there, though. Parallel lines have the same variance or scale. Displaced lines have different mean or location. So if one agent is consistently completing their calls more quickly, their curve would shift to the left.

You also have the parameter point estimates and confidence intervals for each group (agent) for comparison, although that information is not the same as a multiple comparison test.

You can also use the profilers to extract information about each group. These answers are provided both as a point estimate and interval estimate.

I am not apologizing but simply recognizing that the methodology here comes from the reliability engineering field. The same methods were independently discovered in medical mortality and morbidity. The terminology, therefore, pertains to those fields but the methods are none the less relevant. It just requires a bit of translation. Sometimes it also requires reversing the goals. In reliability, an increasing hazard function is bad. In your case, though, it is good. It means that an event is more likely to happen. But in your case an event is not a failure, it is a completed call.

There are analogous methods for regression models with time to event data. So if you had covariates, you could include them in the model for lifetime and test them. There is a lot of flexibility here.

View solution in original post

Mark_Bailey · Apr 23, 2019 01:03 PM

First of all, a sampling distribution of what? We usually think of a sampling distribution with respect to an estimate, like the sample average or a slope. Is kind of thing what you are after? If so, what estimate?

Second, the sampling distribution assumes sampling with replacement. That operation would be difficult using column formulae.

Third, it usually takes many draws to get a good sense of the sampling distribution. Again, this iteration would be difficult using column formulae.

Fourth, do you have JMP Pro? There is a built-in bootstrapping feature that will re-sample your data to obtain new estimates of the statistic. I can't remember if you can specify the size of the draw or if it always uses the original sample size.

I think that a script might be necessary.

tmfortney · Apr 23, 2019 01:11 PM

Thanks for the response. So, I have population data from my call center regarding the length of phone calls over the course of a month and the data is not normally distributed. My thought was to perform random sampling of n=100 per sample, which would help create a normal distribution. I know there are other ways to do this but I wanted to understand how to do this in JMP. I do not have JMP Pro so this may not be possible based on your text below. Does this make more sense what I am asking?

Mark_Bailey · Apr 23, 2019 01:13 PM

I just read your original post again. You want to compute a sample mean of Y, whatever that data happens to be? Just calculate it. The sample average is still the unbiased estimator for the population mean.

How will the sampling distribution help you? What would you do with all the estimates from many sub-samples of N = 100?

Also, what are you trying to do such that a normal distribution is assumed?

tmfortney · Apr 23, 2019 01:29 PM

Well, I guess I need to give some additional background on this. Where I am going with this is I would like to do is look at the different call center agents and perform an ANOVA test to determine if there are statistically significant differences in the average length of call time between agents. However, when you look at the length of call data for each agent, the datasets are not normally distributed and I am violating homogenity of variance when I compare the sets. Based on the central limits theorem, if I peformed a sampling distribution for the 10,000 (approximate) calls each agent performed that month with n=100 it should then be more normally distributed. I also considered using a Welch's t-test for unequal variances, but I was not sure if this would fully address the issue. I know there are other ways to normalize the data, but I also thought it would be useful to know how to perform sampling distributions in general using JMP. I may be going about this the wrong way so any additional advice would be great!

txnelson · Apr 23, 2019 02:34 PM

Here is one way to create the sample means. If you saved and concatenated the tables, rather than deleting them, you could actually use the samples to do analyses on.

Names Default To Here( 1 );
dt = Open( "$SAMPLE_DATA/semiconductor capability.jmp" );

dtMeans = New Table( "Final", New Column( "Sample" ), New Column( "NPN1 Mean" ) );
dtMeans << add rows( 100 );

For( i = 1, i <= 100, i++,
	dt2 = dt << Subset( invisible, Sample Size( 100 ), columns( :NPN1 ) );

	dtMeans:Sample[i] = i;
	dtMeans:NPN1 Mean[i] = Col Mean( dt2:npn1 );

	Close( dt2, nosave );

);

Jim

nil · Jan 13, 2020 08:40 AM

Hi txnelson, Thanks for script on sample means. I am new to JSL and need your help to correct the script which I had tried on my data. Data table and script attached for reference. I just followed (based on limited understanding) the script you have posted. However, the final table observed was blank. Please suggest..
I would need Final table with 100 sample mean, each of 3 samples, from original data table.

Data Table (Table Name: BPN) Contains BPN content value (in %) from 60 individual units/ vials.
BPN
103.2
100.3
99.2
103.6
102.1
100.6
100.1
99.6
99
102.6
101.2
97.6
102.1
100
102.4
101.7
105.3
103.7
99.1
103.1
99.3
99.2
98.5
101.7
106.3
97.8
104.6
102.9
103.1
101.8
101.8
99.9
102
100.6
106.6
100.6
104.7
98.4
95.6
103.2
103.5
101
102.8
100.5
97
104
99.8
103
103.4
101.6
100.6
101.9
101.2
99.7
101.1
101.1
99.2
104.9
103.4
106.4

Script: (with data table open, I tried with script below)

dtMeans = New Table( "BPN Mean", New Column( "Sample" ), New Column( "BPN 3 Sample Mean" ) );
dtMeans << add rows( 100 );

For( i = 1, i <= 60, i++,
dt2 = dt << Subset( invisible, Sample Size( 3), columns( : BPN ) );

dtMeans:Sample[i] = i;
dtMeans:BPN 3 Sample Mean[i] = Col Mean( dt2:BPN );

Close( dt2, nosave );

);

Thanks!!

txnelson · Jan 14, 2020 07:30 AM

The Subset platform has a built in random capability. Below is a simple script that uses

Tables==>Subset

to generate a random sample data table with 3 rows. It is in a For() loop, so you can specify to generate as many random samples as you want

dt = New Table( "test",
	New Column( "BPN",
		values(
			{103.2, 100.3, 99.2, 103.6, 102.1, 100.6, 100.1, 99.6, 99, 102.6,
			101.2, 97.6, 102.1, 100, 102.4, 101.7, 105.3, 103.7, 99.1, 103.1,
			99.3, 99.2, 98.5, 101.7, 106.3, 97.8, 104.6, 102.9, 103.1, 101.8,
			101.8, 99.9, 102, 100.6, 106.6, 100.6, 104.7, 98.4, 95.6, 103.2,
			103.5, 101, 102.8, 100.5, 97, 104, 99.8, 103, 103.4, 101.6, 100.6,
			101.9, 101.2, 99.7, 101.1, 101.1, 99.2, 104.9, 103.4, 106.4}
		)
	)
);

For( i = 1, i <= 10, i++, // change this line to produce as many subsets as you want
	dt << Subset( Sample Size( 3 ), Selected columns only( 0 ) )
);

Jim

nil · Jan 15, 2020 11:58 PM

Thanks txnelson. The script output provides 10 random samples, each of sample size 3. As suggested, one can change required no of samples and or sample size. Script is limited to re-sampling (or random subset of sample size 3, 10 times).
Actually I would need script which can provide data table with mean of sample size 3, arranged in single column for lets say 10 random sample, which I would further use to understand probability of observing different mean values.

txnelson · Jan 16, 2020 12:12 AM

Below is a simple modification of the script I previously responded with. The one below is just one of the ways to solve this. If you are going to play in the world of JMP Scripting, you need to read the JMP Scripting Guide, so you can start down the path of learning JSL.

names defalut to here( 1 );

dt = New Table( "test",
	New Column( "BPN",
		values(
			{103.2, 100.3, 99.2, 103.6, 102.1, 100.6, 100.1, 99.6, 99, 102.6, 101.2, 97.6, 102.1,
			100, 102.4, 101.7, 105.3, 103.7, 99.1, 103.1, 99.3, 99.2, 98.5, 101.7, 106.3, 97.8,
			104.6, 102.9, 103.1, 101.8, 101.8, 99.9, 102, 100.6, 106.6, 100.6, 104.7, 98.4, 95.6,
			103.2, 103.5, 101, 102.8, 100.5, 97, 104, 99.8, 103, 103.4, 101.6, 100.6, 101.9,
			101.2, 99.7, 101.1, 101.1, 99.2, 104.9, 103.4, 106.4}
		)
	)
);

dtSamples = New Table( "Samples", New Column( "Sample Means" ) );

For( i = 1, i <= 10, i++, // change this line to produce as many subsets as you want
	dt2 = dt << Subset( invisible, Sample Size( 3 ), Selected columns only( 0 ) );
	dtSamples << add Rows( 1 );
	dtSamples:Sample Means[N Rows( dtSamples )] = Col Mean( dt2:BPN );
	Close( dt2, nosave );
);

Jim