Looking for a sample size calculator for defect proportions that can help me opt...

students_t · Jun 10, 2023 1:43 PM

Use case:

35,000 defects detected by a machine on one wafer. An operator samples 1500 images and sorts them into 10 categories. The percentages of the defect categories are meant to reflect the defects in the population of 35,000. In this case 1500 images were used but what happens when the 5000 images are used?

What does that do to the confidence intervals on the proportions. As the sampling gets low, some of the low proportions will not be detectible. How does one determine the lowest detectible proportion for a given samples size.

Any ideas if there any calculators or tools that can help with this?

ih · Feb 26, 2021 1:31 PM

You might be able to replicate this experiment within JMP. If so, and if you have JMP Pro, you could do a Monte Carlo simulation to evaluate different sample sizes and the rate of detection for each. This would be an alternative to more pure statistical methods which I'm sure exist.

Consider a 35,000 row table (one row per defect) with:

A column of random data representing the actual category for each defect, weighted with some categories more important than others to represent your actual data. This part requires a lot of attention and thought, you need this to represent the distribution(s) your categories might take. It might need to randomly choose from a variety of distributions.
A sample size column, where the first row indicates how many rows to sample
A column which 'selects' the prescribed number of rows
A column with the 'detected' categories for each of the sampled rows
A column to check whether the most common actual category matches the most common detected category (only use the first row of the column)

Use the distribution platform and include both the first column (actual defect category) and the last column (whether the correct defect category was identified). Under the frequencies for the correct category column, you should see a single value. Now for the JMP Pro magic: right click on a value in the table and select simulate. Swap out the Defect Actual Category column for itself. When it runs a new table will show up where each row represents a time your original table was recalculated, every time giving you a fresh set of defects to be analyzed. Record the fraction of 'yes' values, and then repeat with different sample sizes. Then you can chart the fraction of correct diagnoses against the sample size.

Remember, all of this depends on that first bullet, representing your actual results using a simulation.

Here is an example of what your simulation table might look like with a sample size of 20, assuming that one defect category is twice the size of all others:

Notice how clean the actual defect categories are here! I suspect this will not be so clear in your actual data, so you would need to use a different function in that column.

Here is the the simulation results table, here only using 100 rows (I recommend using 1000s):

You could record this in a table, and then repeat with other values to find your acceptable level of risk:

Here is a script to create the top table above:

New Table( "Monte Carlo Sample Size Example",
	Add Rows( 35000 ),
	New Script(
		"Distribution to Simulate",
		Distribution(
			Nominal Distribution( Column( :Defect Actual Category ) ),
			Nominal Distribution( Column( :Correct Category Identified ) )
		)
	),
	New Column( "Defect Actual Category",
		Numeric,
		"Ordinal",
		Format( "Best", 12 ),
		Formula( Max( Floor( Random Uniform( 0, 10 ) ), 1 ) )
	),
	New Column( "Sample Size",
		Numeric,
		"Continuous",
		Format( "Best", 12 ),
		Formula( 20 )
	),
	New Column( "Sampled Rows Sorted",
		Numeric,
		"Nominal",
		Format( "Best", 12 ),
		Formula( If( Row() <= :Sample Size[1], 1, 0 ) )
	),
	New Column( "Sampled Rows Shuffled",
		Numeric,
		"Nominal",
		Format( "Best", 12 ),
		Formula( :Sampled Rows Sorted[Col Shuffle()] )
	),
	New Column( "Sampled Defect Category",
		Numeric,
		"Ordinal",
		Format( "Best", 12 ),
		Formula( If( :Sampled Rows Shuffled == 1, :Defect Actual Category ) ),
		Set Display Width( 135 )
	),
	New Column( "Actual Defect Category",
		Numeric,
		"Nominal",
		Format( "Best", 12 ),
		Formula( If( Row() == 1, Mode( :Defect Actual Category << Get VAlues ) ) ),
		Set Display Width( 130 )
	),
	New Column( "Predicted Default Category",
		Numeric,
		"Nominal",
		Format( "Best", 12 ),
		Formula( If( Row() == 1, Mode( :Sampled Defect Category << Get Values ) ) )
	),
	New Column( "Correct Category Identified",
		Numeric,
		"Nominal",
		Format( "Best", 12 ),
		Formula(
			If( Row() == 1,
				:Actual Defect Category == :Predicted Default Category
			)
		),
		Value Labels( {0 = "No", 1 = "Yes"} ),
		Use Value Labels( 1 )
	)
);

New Table( "Monte Carlo Sample Size Example", Add Rows( 35000 ), New Script( "Distribution to Simulate", Distribution( Nominal Distribution( Column( :Defect Actual Category ) ), Nominal Distribution( Column( :Correct Category Identified ) ) ) ), New Column( "Defect Actual Category", Numeric, "Ordinal", Format( "Best", 12 ), Formula( Max( Floor( Random Uniform( 0, 10 ) ), 1 ) ) ), New Column( "Sample Size", Numeric, "Continuous", Format( "Best", 12 ), Formula( 20 ) ), New Column( "Sampled Rows Sorted", Numeric, "Nominal", Format( "Best", 12 ), Formula( If( Row() <= :Sample Size[1], 1, 0 ) ) ), New Column( "Sampled Rows Shuffled", Numeric, "Nominal", Format( "Best", 12 ), Formula( :Sampled Rows Sorted[Col Shuffle()] ) ), New Column( "Sampled Defect Category", Numeric, "Ordinal", Format( "Best", 12 ), Formula( If( :Sampled Rows Shuffled == 1, :Defect Actual Category ) ), Set Display Width( 135 ) ), New Column( "Actual Defect Category", Numeric, "Nominal", Format( "Best", 12 ), Formula( If( Row() == 1, Mode( :Defect Actual Category << Get VAlues ) ) ), Set Display Width( 130 ) ), New Column( "Predicted Default Category", Numeric, "Nominal", Format( "Best", 12 ), Formula( If( Row() == 1, Mode( :Sampled Defect Category << Get Values ) ) ) ), New Column( "Correct Category Identified", Numeric, "Nominal", Format( "Best", 12 ), Formula( If( Row() == 1, :Actual Defect Category == :Predicted Default Category ) ), Value Labels( {0 = "No", 1 = "Yes"} ), Use Value Labels( 1 ) ) );

If you have the computer power, you could skip a step and make the first 35k rows of your table use one sample frequency, the next 35k rows use another frequency, and so on. Then you could simulate all values at once to populate the table in the last screenshot at once.

Georg · Feb 27, 2021 01:16 PM

To boil it a bit down, you can think about the defect category with the smallest proportion, for all others it is better. If all of ten are equal, this would be 0.1 (1 tenth). If we assume, that the proportion of the smallest category is e.g. 0.01, you could use the Sample Size and Power calculator (under DOE --> Design Diagnostics).

When you have 1500 samples and need a power of 0.9 (this is a reasonable value), you could detect a proportion of larger than 2 % in comparison to 1 %. If you calculate with 5000 samples, this will reduce to 1.5 %. Having smaller proportions, needs larger sample sizes.

Please see the manual:

One Sample Proportion Calculator (jmp.com)

Georg

Kevin_Anderson · Feb 27, 2021 03:29 PM

Hi, students_t!

Great name, by the way!

As you may be aware, there are more than just one way to calculate confidence intervals for proportions.

I believe the JMP Sample Size And Power calculators use a Normal approximation to the binomial. There exist exact methods and several other approximate methods as well. I think JMP uses Wilson score methods in their other platforms.

Attached, please find a reference from Agresti and Coull entitled "Approximate is Better than 'Exact' for Interval Estimation of Binomial Proportions" from The American Statistician in 1998. It basically makes the case that the guaranteed coverage of the intervals from the Exact method is overly conservative especially as you approach 0 or 1, and that some approximate methods are better than the Exact method and even other approximate methods.

Personally, I use the Wilson score method. But the Sample Size And Power calculator uses a different method, so it will be close but no cigar.

So, to answer your question, it's necessary for you to define which form of confidence intervals you desire. The simulation idea from @ih is a great one.

Kevin_Anderson · Feb 27, 2021 03:38 PM

...and I wrote about the Sample Size and Power calculators based on knowledge from previous versions of JMP. I now note that JMP 15.2.1 allows one to choose from two Exact methods.

Looking for a sample size calculator for defect proportions that can help me optimize the sample size with a minimum proportion to detect and the CIs for computed proportions.

Re: Looking for a sample size calculator for defect proportions that can help me optimize the sample size with a minimum proportion to detect and the CIs for computed proportions.

Re: Looking for a sample size calculator for defect proportions that can help me optimize the sample size with a minimum proportion to detect and the CIs for computed proportions.

Re: Looking for a sample size calculator for defect proportions that can help me optimize the sample size with a minimum proportion to detect and the CIs for computed proportions.

Re: Looking for a sample size calculator for defect proportions that can help me optimize the sample size with a minimum proportion to detect and the CIs for computed proportions.