Sample Size: More Than a Number - (2023-US-30MP-1483)

Karen Copeland, PhD

Boulder Statistics LLC

While the question "How many (parts/subjects/runs) do I need?" is one nearly every statistician dreads, it is an important question and should be asked prior to running any study or experiment. The answer seems simple enough. Just plug some numbers into a calculator and off you go! In my experience, though, sample size calculations are rarely that easy.

JMP 16 introduced an entire suite of Sample Size Explorers, with more added in JMP 17. But why call them "explorers" and not "calculators"? Because sample size is more than a calculation. It is an integral part of a study design, and to determine a sample size, more than math is needed. This presentation explores sample size from the concept to the execution. While the examples include sample size explorations for medical device or diagnostics studies, the lessons learned are applicable across industries.

What we're going to talk about today is

a simple introduction to sample size thinking.

Then we'll look at two examples; one, comparing the mean of two populations,

and the second, looking at a study with a proportion endpoint,

and we'll wrap up with some additional thoughts.

A question I'm often asked is, what sample size do I need?

One might think, "Oh, that's easy. Just use a sample size calculator."

But wait a second.

Why does JMP call sample size calculators explorers?

Why are they in the DOE menu? Which one do I use?

Well, let's talk about some sample size basics.

A sample size is calculated prior to running a study.

A study is an experiment designed ahead of time.

That's why they're in the DOE menu.

Sample size depends on the goal of a study.

I often call this, are you making a $5 decision or a $50 million decision?

Are you looking at a regulatory clearance,

a publication, an R&D question, or a simple exploration?

What's the primary endpoint of your study? What are you trying to show?

How is your study design? What are your outcome assumptions?

These might be based on prior knowledge,

a pilot data, or often, or just simply guessing.

Sample size is a risk-benefit exploration.

That's why they're called sample size explorers.

You want to explore

how different assumptions are going to impact your sample size.

Now, more is generally better,

but as we all know, more costs more, and more might not be possible.

Let's start with a simple example

of sizing the study for comparing two means.

We'll at the Fit Y by X platform,

and we'll look at the Power Explorer for two independent sample means.

This sample size example is based on

a real situation where a company is in the R&D phase.

They're doing a sample collection study. That could be blood, nasal swabs, saliva.

There's no primary endpoint because it's an R&D study.

They're still in the R&D phase, but they need a sample, a power analysis.

They were asked for power analysis by the entity that is considering

funding the project.

How can we provide a power analysis without a primary input?

Well, best thought here is one,

we could say, "Hey, we can't do a power analysis,"

or, knowing that the funding entity wants a power analysis

to show that we've thought about the study and we've thought about how many people

were asking them to enroll, we could generate a research endpoint.

In that case, we're going to ask, "Can I distinguish the difference in means

between my sick and healthy subjects for some primary biological markers?"

We'll use the sample size from the power analysis

and the expected prevalence of illness to justify the number

of subjects, we're requesting to enroll in the study.

I need to understand test for comparing to independent means,

and I need a calculator for the power of a test to compare to independent means.

What I like to ask myself is if I had data, what would I do?

If I understand what analysis I'm going to do,

that's going to help me determine what sample size I need.

Sometimes you'll have pilot data,

and sometimes you can just make up data to help you figure out

what analysis are you going to do and what sample sizing should you do.

Let's take a look at this.

I'm going to open a data table, and this is just generated data.

I've got a sick… I have 15 sick patients and 15 healthy patients.

I'm going to do a Fit Y by X.

I'll do a couple of things here: our range, I'm going to jitter my points,

I'm going to run a T-test, and I like to look at the densities.

Here's two examples of what some data might look like.

On the left is a fairly separated populations

of outcomes, the biomarker number one.

The difference is about 2, 2.5.

These were generated from a normal distribution

as were the ones on the right-hand side.

Here, the difference is a little less.

You can see in both places,

we would conclude that there's a difference between these two populations.

The one on the right being closer together

is harder to differentiate than the one on the left.

We used a T-test for that.

Now the question becomes,

how many samples would I need

if I'm going to run this experiment?

Again, let's look at that.

Let me just step through my Workflow Builder

so it closes down our data tables.

DOE Sample Size Explorer p ower.

I want power for two independent sample means.

I pull that up, you'll see that there's quite a few things to look at.

First, we have the test type. It's going to be two-sided.

Our Alpha is 0.05,

and the group population standard deviations are not assumed to be known.

We're guessing at those.

To calculate my sample size, I need to fill in this information.

This is my calculator part. I have two groups.

I'm going to start over here on the right-hand side.

I have two group standard deviations to put in estimates for.

I'm going to assume that one group is less variable than the other group.

Next, I need to fill in the difference to detect.

Here, I'm using standard deviation units,

and I'm going to say I want to detect a one standard deviation unit difference.

Next, I've got right now sample size of 30 in each group

that gives me a very high power.

I'm going to lower this power to 90,

and I see that for a power of 90 to detect a difference of one

between these two groups, I need a sample size

of 15 subjects in each group.

That seems reasonable.

Now, you can look at these graphics to see that

how your guesses, your assumptions might impact the power of your study.

We can see that the standard deviations have quite a bit of impact.

As my standard deviation increases,

so my data becomes more spread out, my power decreases.

It's going to be harder to detect this difference.

You can see we're at a sweet spot in the sample size.

As I increase the sample size,

my power is going to increase, but not terribly greatly.

As I decrease, if I went down to about 10,

my power is going to go down to about eight.

But let's go back to the point now.

I want about 15 samples per group.

In this instance, to get 15 positive samples

from a study where I'm enrolling people,

and if I have a 10 % prevalence rate

of sickness over the study period, I would need about 150 subjects.

If the prevalence was low or say, only 5%, then I would need 300 subjects.

Again, sample size is a risk benefit calculation,

so we want to consider various sample sizes.

All right, now to our second example.

This is sizing a study with a proportion endpoint.

We'll use the distribution platform

and we'll use the Interval Explorer for one sample proportion.

This is based on the question of how many samples do I need

to demonstrate sensitivity and specificity for regulatory filing?

I do a lot of work in diagnostics.

In diagnostics, sensitivity is simply the proportion of positive cases

that your test calls positive, and the specificity is the proportion

of negative cases that your test calls negative.

We generally calculate sample size for each of these metrics individually,

and then we add for the total sample size for a retrospective study

where I've already got samples, perhaps in a freezer or from a partner,

and I'm just going to pull out the ones that I need.

For a prospective study, again, we would use the prevalence

to calculate the total number of subjects to enroll,

similarly as we did in the last example.

Again, we need some preliminary information.

The goal of this study is a regulatory filing,

so a high level of evidence is needed.

Then this particular industry sector, I need to demonstrate

that the lower confidence limit for sensitivity and specificity

is greater than 80 %.

The study design is a retrospective study.

It's a review of CT scan .

The assumptions are that the sensitivity of identifying

the outcome is 0.9 and specificity is 0.85.

I need to understand the confidence interval as an outcome,

and I need a calculator for confidence interval for proportion.

Again, the question, if I had data, what would I do?

Let's look at that.

Again, I generated some data.

I have a reference standard where I had about 145 negative samples

and 144 positive cases or samples.

Then I have the test results, positive and negative.

You can see they're not perfect.

Some of the cases that the test calls negative are actually positive,

and some of the cases that the test calls positive are actually negative.

How would I look at this?

Well, I could tabulate it

and come up with the % of positive cases that the test calls positive

and the % of negative cases that the test calls negative.

But I want confidence intervals.

I'm going to use the distribution platform,

and I'm going to look at the proportion in the test cases by the reference case.

Again, let's add…

We want to add, sorry, wrong red triangle menu.

We want to add confidence intervals,

and I held down my Control key to broadcast those.

Now I can look and see what's going on here.

For the cases that by the reference are positive,

the new method calls 135 of those positive, so 93.75,

and I have my confidence interval that goes from 88-96.6

You'll see this note here that says, computed using score confidence intervals.

Then the thing to note here

is that a score confidence interval is not symmetric.

We can look at that.

Here I generated a graphic,

and you can see that when we're at the low end,

so a probability of 0.1,

you can see that the upper confidence limit is higher

as compared to the point estimate than the lower confidence interval.

The point estimates are not centered in the middle of these confidence intervals.

That's just the nature of this core confidence interval.

The question now is,

how many samples do I need to show that

my lower confidence limit is at least 0.8, given the assumptions of here

we had for sensitivity, which is the positive side,

that we were going to be greater than 0.9, and on the negative side

that we were going to be greater than 0.85.

Here we can see that at 0.85, my lower confidence limit is only 0.78.

I would need a few more samples in order to

show that my lower confidence limit is greater than 0.8.

Again, the question is now that I understand what I'm looking for

is how much data should I collect?

Let's go to DOE Sample Size Explorer,

confidence intervals for one sample proportion.

Let's put in this example here.

Let's put in our proportion of 0.9375,

and the sample size that we had used here, which was 144.

I had left the interval type as two-sided,

confidence intervals, confidence level is 95 %.

With the sample size of 144, if my proportion comes out to be 93.75,

my margin of error is 0.04.

Okay, well, what's margin of error?

Margin of error is the half width of the confidence interval.

If it was a symmetric confidence interval,

it would be your plus or minus value over your point estimate.

But in the case of a score confidence interval,

and that's what this calculator is based on,

this is the half width of your confidence interval.

But we can see that…

With the 93.75, the margin of error of 0.04,

it's not simply a minus 0.04 from this 93.75

because we noticed that when we did this calculation

that our lower confidence limit was 0.88.

This sample size is more than sufficient for what we needed.

We only needed a confidence limit of 0.8. Let's do that calculation.

Let's put in our assumed value of 0.9,

and let's put in a margin of error of, say, 0.08.

We know that 0.1 is going to underestimate our sample size.

If we do this and we say, all right, for a proportion of 0.9,

margin of error is 0.08, our sample size, it says, is 56.

Okay, well, let's double-check that.

To do that,

I constructed a calculator

where I can put in my assumed proportion and I can put in this value of 56.

If I run this distribution,

and what I did here is I have an outcome of one and zero, and I have a frequency.

If I relaunch this,

I use the outcome and I use the frequency column

to give me the distribution as if I had 51's and six zeros in my data file.

Well, what does that look like?

With a sample size of 56,

a proportion of about 0.9, my lower confidence limit,

using a score confidence interval is 0.78.

This sample size of 56 gives me the precision that I asked for,

the margin of error of 0.08, but it doesn't quite give me

the lower limit on this confidence interval that I need for this situation.

Let's put in a slightly larger sample size.

Let's make this 65.

That gives me a margin of error of 0.074, which is slightly tighter than the 0.08,

and let's see what that looks like in my score confidence interval.

If I do that, now I see that my lower confidence limit is above the 0.8.

The point of this was not to…

The point of this was really to show you that it's important to understand

what it is you're trying to show, and it's important to understand

what is it that your sample size calculator is providing to you.

There are sample size calculators all over the internet.

Then in JMP, we have a whole slew of sample size calculators,

explorers to look at.

It's important to understand what is your endpoint,

what are you trying to solve, and what is it that your calculator

is calculating for you.

Once you do that, then you're better informed

for making decisions as to how many samples do you really need.

Let's finish up with just a few brief comments on additional topics.

Other ways that you can get at sample size.

One is simulation.

You can use pilot data to define distributions,

use random number generators to generate a study based on those distributions.

Then you can analyze that data to see if your endpoint is met.

Is it met? Yes or no?

Then you can repeat that some large number of times

and calculate the portion of times, your endpoint is met.

In a sense, your power.

How likely are you to meet your endpoint given your assumptions?

I like to do that. Simulation is useful.

Again, however, it's all based on your assumptions.

If your assumptions are wrong, your sample size may not be large enough.

Another thing that often happens is that we have to make

the best allocations of what we have.

We may have 1,000 samples in the freezer

and we know what their outcomes are and we want to test them on a new test

or we want to develop a new test.

How many can we use to train an algorithm?

How many do we need to use to validate that algorithm?

Sometimes we have to take the sample numbers that we have.

Use the sample size explorers to evaluate what you might be able to conclude,

and then use those findings to decide if what you have is sufficient to proceed

with the experiments

and the development of your test or product.

That's what I have on sample size. It's more than a number.

It's based on what it is you're trying to decide

and how you're going to analyze the data once you get that data.

It's an exploration, you want to take into account

how do those assumptions that you make impact those sample sizes

and hedge your bets for a great outcome.

Thank you, and that's it.