Choose Language Hide Translation Bar

Sample Size: More Than a Number - (2023-US-30MP-1483)

While the question "How many (parts/subjects/runs) do I need?" is one nearly every statistician dreads, it is an important question and should be asked prior to running any study or experiment. The answer seems simple enough. Just plug some numbers into a calculator and off you go! In my experience, though, sample size calculations are rarely that easy.

 

JMP 16 introduced an entire suite of Sample Size Explorers, with more added in JMP 17. But why call them "explorers" and not "calculators"? Because sample size is more than a calculation. It is an integral part of a study design, and to determine a sample size, more than math is needed. This presentation explores sample size from the concept to the execution. While the examples include sample size explorations for medical device or diagnostics studies, the lessons learned are applicable across industries.

 

 

What  we're  going  to  talk  about  today  is

a  simple  introduction to  sample  size  thinking.

Then  we'll  look  at  two  examples;  one, comparing  the  mean  of  two  populations,

and  the  second,  looking  at  a  study with  a  proportion  endpoint,

and  we'll  wrap  up with  some  additional  thoughts.

A  question  I'm  often  asked  is, what  sample  size  do  I  need?

One  might  think,  "Oh,  that's  easy. Just  use  a  sample  size  calculator."

But  wait  a  second.

Why  does  JMP  call sample  size  calculators  explorers?

Why  are  they  in  the  DOE  menu? Which  one  do  I  use?

Well,  let's  talk  about some  sample  size  basics.

A  sample  size  is  calculated prior  to  running  a  study.

A  study  is  an  experiment designed  ahead  of  time.

That's  why  they're  in  the  DOE  menu.

Sample  size  depends on  the  goal  of  a  study.

I  often  call  this,  are  you  making  a  $5 decision  or  a  $50  million  decision?

Are  you  looking  at  a  regulatory  clearance,

a  publication,  an  R&D  question, or  a  simple  exploration?

What's  the  primary  endpoint  of  your  study? What  are  you  trying  to  show?

How  is  your  study  design? What  are  your  outcome  assumptions?

These  might  be  based  on  prior  knowledge,

a  pilot  data,  or  often, or  just  simply  guessing.

Sample  size  is  a  risk-benefit  exploration.

That's  why  they're  called sample  size  explorers.

You  want  to  explore

how  different  assumptions are  going  to  impact  your  sample  size.

Now,  more  is  generally  better,

but  as  we  all  know,  more  costs  more, and  more  might  not  be  possible.

Let's  start  with  a  simple  example

of  sizing  the  study for  comparing  two  means.

We'll  at  the  Fit  Y  by  X  platform,

and  we'll  look  at  the  Power  Explorer for  two  independent  sample  means.

This  sample  size  example  is  based  on

a  real  situation  where a  company  is  in  the  R&D  phase.

They're  doing  a  sample  collection  study. That  could  be  blood,  nasal  swabs,  saliva.

There's  no  primary  endpoint because  it's  an  R&D  study.

They're  still  in   the  R&D  phase, but  they  need  a  sample,  a  power  analysis.

They  were  asked  for  power  analysis by  the  entity  that  is  considering

funding  the  project.

How  can  we  provide  a  power analysis  without  a  primary  input?

Well,  best  thought  here  is  one,

we  could  say, "Hey,  we  can't  do  a  power  analysis,"

or,  knowing  that the  funding  entity  wants  a  power  analysis

to  show  that  we've  thought  about  the  study and  we've  thought  about  how  many  people

were  asking  them  to  enroll, we  could  generate  a  research  endpoint.

In  that  case,  we're  going  to  ask, "Can  I  distinguish  the  difference  in  means

between  my  sick  and  healthy  subjects for  some  primary  biological  markers?"

We'll  use  the  sample  size from  the  power  analysis

and  the  expected  prevalence of  illness  to  justify  the  number

of  subjects,  we're  requesting to  enroll  in  the  study.

need  to  understand  test  for  comparing to  independent  means,

and  I  need  a  calculator  for  the  power of  a  test  to  compare  to  independent  means.

What  I  like  to  ask  myself  is if  I  had  data,  what  would  I  do?

If  I  understand  what  analysis I'm  going  to  do,

that's  going  to  help  me  determine what  sample  size  I  need.

Sometimes  you'll  have  pilot  data,

and  sometimes  you  can  just  make  up  data to  help  you  figure  out

what  analysis  are  you  going  to  do and  what  sample  sizing  should  you  do.

Let's  take  a  look  at  this.

I'm  going  to  open  a  data  table, and  this  is  just  generated  data.

I've  got  a  sick…  I  have  15  sick patients  and  15  healthy  patients.

I'm  going  to  do  a  Fit  Y  by  X.

I'll  do  a  couple  of  things  here: our  range,  I'm  going  to  jitter  my  points,

I'm  going  to  run  a  T-test, and  I  like  to  look  at  the  densities.

Here's  two  examples of  what  some  data  might  look  like.

On  the  left is  a  fairly  separated  populations

of  outcomes,  the  biomarker  number  one.

The  difference  is  about  2, 2.5.

These  were  generated from  a  normal  distribution

as  were  the  ones  on  the  right-hand  side.

Here,  the  difference  is  a  little  less.

You  can  see  in  both  places,

we  would  conclude  that  there's  a difference  between  these  two  populations.

The  one  on  the  right  being  closer  together

is  harder  to  differentiate than  the  one  on  the  left.

We  used  a  T-test  for  that.

Now  the  question  becomes,

how  many  samples  would  I  need

if  I'm  going  to  run  this  experiment?

Again,  let's  look  at  that.

Let  me  just  step through  my  Workflow  Builder

so  it  closes  down  our  data  tables.

DOE  Sample  Size  Explorer p ower.

I  want  power  for  two independent  sample  means.

I  pull  that  up,  you'll  see  that  there's quite  a  few  things  to  look  at.

First,  we  have  the  test  type. It's  going  to  be  two-sided.

Our  Alpha  is  0.05,

and  the  group  population  standard deviations  are  not  assumed  to  be  known.

We're  guessing  at  those.

To  calculate  my  sample  size, I  need  to  fill  in  this  information.

This  is  my  calculator  part. I  have  two  groups.

I'm  going  to  start over  here  on  the  right-hand  side.

I  have  two  group  standard  deviations to  put  in  estimates  for.

I'm  going  to  assume  that  one  group is  less  variable  than  the  other  group.

Next,  I  need  to  fill in  the  difference  to  detect.

Here,  I'm  using  standard  deviation  units,

and  I'm  going  to  say  I  want  to  detect a  one  standard  deviation  unit  difference.

Next,  I've  got  right  now  sample  size of  30  in  each  group

that  gives  me  a  very  high  power.

I'm  going  to  lower  this  power  to  90,

and  I  see  that  for  a  power  of  90 to  detect  a  difference  of  one

between  these  two  groups, I  need  a  sample  size

of  15  subjects  in  each  group.

That  seems  reasonable.

Now,  you  can  look at  these  graphics  to  see  that

how  your  guesses,  your  assumptions might  impact  the  power  of  your  study.

We  can  see  that  the  standard  deviations have  quite  a  bit  of  impact.

As  my  standard  deviation  increases,

so  my  data  becomes  more  spread  out, my  power  decreases.

It's  going  to  be  harder to  detect  this  difference.

You  can  see  we're at  a  sweet  spot  in  the  sample  size.

As  I  increase  the  sample  size,

my  power  is  going  to  increase, but  not  terribly  greatly.

As  I  decrease,  if  I  went  down  to  about  10,

my  power  is  going  to  go down  to  about  eight.

But  let's  go  back  to  the  point  now.

I  want  about  15  samples  per  group.

In  this  instance, to  get  15  positive  samples

from  a  study  where  I'm  enrolling  people,

and  if  I  have  a  10 %  prevalence  rate

of  sickness  over  the  study  period, I  would  need  about  150  subjects.

If  the  prevalence  was  low  or  say,  only 5%,  then  I  would  need  300  subjects.

Again,  sample  size is  a  risk  benefit  calculation,

so  we  want  to  consider various  sample  sizes.

All  right,  now  to  our  second  example.

This  is  sizing  a  study with  a  proportion  endpoint.

We'll  use  the  distribution  platform

and  we'll  use  the  Interval  Explorer for  one  sample  proportion.

This  is  based  on  the  question of  how  many  samples  do  I  need

to  demonstrate  sensitivity and  specificity  for  regulatory  filing?

I  do  a  lot  of  work  in  diagnostics.

In  diagnostics,  sensitivity  is  simply the  proportion  of  positive  cases

that  your  test  calls  positive, and  the  specificity  is  the  proportion

of  negative  cases that  your  test  calls  negative.

We  generally  calculate  sample  size for  each  of  these  metrics  individually,

and  then  we  add  for  the  total  sample  size for  a  retrospective  study

where  I've  already  got  samples, perhaps  in  a  freezer  or  from  a  partner,

and  I'm  just  going  to  pull out  the  ones  that  I  need.

For  a  prospective  study, again,  we  would  use  the  prevalence

to  calculate  the  total  number of  subjects  to  enroll,

similarly  as  we  did  in  the  last  example.

Again, we  need  some  preliminary  information.

The  goal  of  this  study is  a  regulatory  filing,

so  a  high  level of  evidence  is  needed.

Then  this  particular  industry  sector, I  need  to  demonstrate

that  the  lower  confidence  limit for  sensitivity  and  specificity

is  greater  than  80 %.

The  study  design is  a  retrospective  study.

It's  a  review  of  CT  scan .

The  assumptions  are  that the  sensitivity  of  identifying

the  outcome  is  0.9 and  specificity  is  0.85.

I  need  to  understand the  confidence  interval  as  an  outcome,

and  I  need  a  calculator  for  confidence interval  for  proportion.

Again,  the  question, if  I  had  data,  what  would  I  do?

Let's  look  at  that.

Again,  I  generated  some  data.

I  have  a  reference  standard where  I  had  about  145  negative  samples

and  144  positive  cases  or  samples.

Then  I  have  the  test  results, positive  and  negative.

You  can  see  they're  not  perfect.

Some  of  the  cases  that  the  test  calls negative  are  actually  positive,

and  some  of  the  cases  that  the  test calls  positive  are  actually  negative.

How  would  I  look  at  this?

Well,  I  could  tabulate  it

and  come  up  with  the  %  of  positive  cases that  the  test  calls  positive

and  the  %  of  negative  cases that  the  test  calls  negative.

But  I  want  confidence  intervals.

I'm  going  to  use the  distribution  platform,

and  I'm  going  to  look  at  the  proportion in  the  test  cases  by  the  reference  case.

Again,  let's  add…

We  want  to  add,  sorry, wrong  red  triangle  menu.

We  want  to  add  confidence  intervals,

and  I  held  down  my  Control key  to  broadcast  those.

Now  I  can  look  and  see what's  going  on  here.

For  the  cases that  by  the  reference  are  positive,

the  new  method  calls  135 of  those  positive,  so  93.75,

and  I  have  my  confidence interval  that  goes  from  88-96.6

You'll  see  this  note  here  that  says, computed  using  score  confidence  intervals.

Then  the  thing  to  note  here

is  that  a  score  confidence  interval is  not  symmetric.

We  can  look  at  that.

Here  I  generated  a  graphic,

and  you  can  see  that when  we're  at  the  low  end,

so  a  probability  of  0.1,

you  can  see  that the  upper  confidence  limit  is  higher

as  compared  to  the  point  estimate than  the  lower  confidence  interval.

The  point  estimates  are  not  centered  in the  middle  of  these  confidence  intervals.

That's  just  the  nature of  this  core  confidence  interval.

The  question  now  is,

how  many  samples  do  I  need to  show  that

my  lower  confidence  limit  is  at  least  0.8, given  the  assumptions  of  here

we  had  for  sensitivity, which  is  the  positive  side,

that  we  were  going  to  be  greater  than  0.9, and  on  the  negative  side

that  we  were  going to  be  greater  than  0.85.

Here  we  can  see  that  at  0.85, my  lower  confidence  limit  is  only  0.78.

I  would  need a  few  more  samples  in  order  to

show  that  my  lower confidence  limit  is  greater  than  0.8.

Again,  the  question  is  now that  I  understand  what  I'm  looking  for

is  how  much  data  should  I  collect?

Let's  go  to  DOE  Sample  Size  Explorer,

confidence  intervals for  one  sample  proportion.

Let's  put  in  this  example  here.

Let's  put  in  our  proportion  of  0.9375,

and  the  sample  size  that  we  had used  here,  which  was  144.

I  had  left  the  interval  type  as  two-sided,

confidence  intervals, confidence  level  is  95  %.

With  the  sample  size  of  144, if  my  proportion  comes  out  to  be  93.75,

my  margin  of  error  is  0.04.

Okay, well,  what's  margin  of  error?

Margin  of  error  is  the  half  width of  the  confidence  interval.

If  it  was  a  symmetric  confidence  interval,

it  would  be  your  plus  or  minus value  over  your  point  estimate.

But  in  the  case of  a  score  confidence  interval,

and  that's  what  this  calculator is  based  on,

this  is  the  half  width of  your  confidence  interval.

But  we  can  see  that…

With  the  93.75, the  margin  of  error  of  0.04,

it's  not  simply  a  minus  0.04 from  this  93.75

because  we  noticed  that when  we  did  this  calculation

that  our  lower  confidence  limit  was  0.88.

This  sample  size  is  more  than  sufficient for  what  we  needed.

We  only  needed  a  confidence  limit  of  0.8. Let's  do  that  calculation.

Let's  put  in  our  assumed  value  of  0.9,

and  let's  put  in  a  margin of  error  of,  say,  0.08.

We  know  that  0.1 is  going  to  underestimate  our  sample  size.

If  we  do  this  and  we  say,  all  right, for  a  proportion  of  0.9,

margin  of  error  is  0.08, our  sample  size,  it  says,  is  56.

Okay,  well,  let's  double-check  that.

To  do  that,

I  constructed  a  calculator

where  I  can  put  in  my  assumed  proportion and  I  can  put  in  this  value  of  56.

If  I  run  this  distribution,

and  what  I  did  here  is  I  have  an  outcome of  one  and  zero,  and  I  have  a  frequency.

If  I  relaunch  this,

I  use  the  outcome and  I  use  the  frequency  column

to  give  me  the  distribution  as  if I  had  51's  and  six  zeros  in  my  data  file.

Well,  what  does  that  look  like?

With  a  sample  size  of  56,

a  proportion  of  about  0.9, my  lower  confidence  limit,

using  a  score  confidence  interval  is  0.78.

This  sample  size  of  56  gives  me the  precision  that  I  asked  for,

the  margin  of  error  of  0.08, but  it  doesn't  quite  give  me

the  lower  limit  on  this  confidence interval  that  I  need  for  this  situation.

Let's  put  in  a  slightly larger  sample  size.

Let's  make  this  65.

That  gives  me  a  margin  of  error  of  0.074, which  is  slightly  tighter  than  the  0.08,

and  let's  see  what  that  looks  like in  my  score  confidence  interval.

If  I  do  that,  now  I  see  that  my  lower confidence  limit  is  above  the  0.8.

The  point  of  this  was  not  to…

The  point  of  this  was  really  to  show  you that  it's  important  to  understand

what  it  is  you're  trying  to  show, and  it's  important  to  understand

what  is  it  that  your  sample  size calculator  is  providing  to  you.

There  are  sample  size  calculators all  over  the  internet.

Then  in  JMP,  we  have  a  whole  slew of  sample  size  calculators,

explorers  to  look  at.

It's  important  to  understand what  is  your  endpoint,

what  are  you  trying  to  solve, and  what  is  it  that  your  calculator

is  calculating  for  you.

Once  you  do  that, then  you're  better  informed

for  making  decisions  as  to  how many  samples  do  you  really  need.

Let's  finish  up  with  just  a  few brief  comments  on  additional  topics.

Other  ways  that  you  can get  at  sample  size.

One  is  simulation.

You  can  use  pilot  data to  define  distributions,

use  random  number  generators  to  generate a  study  based  on  those  distributions.

Then  you  can  analyze  that  data to  see  if  your  endpoint  is  met.

Is  it  met? Yes  or  no?

Then  you  can  repeat that  some  large  number  of  times

and  calculate  the  portion of  times,  your  endpoint  is  met.

In  a  sense,  your  power.

How  likely  are  you  to  meet  your  endpoint given  your  assumptions?

I  like  to  do  that. Simulation  is  useful.

Again,  however,  it's  all  based on  your  assumptions.

If  your  assumptions  are  wrong, your  sample  size  may  not  be  large  enough.

Another  thing  that  often  happens is  that  we  have  to  make

the  best  allocations  of  what  we  have.

We  may  have  1,000  samples  in  the  freezer

and  we  know  what  their  outcomes  are and  we  want  to  test  them  on  a  new  test

or  we  want  to  develop  a  new  test.

How  many  can  we  use  to  train  an  algorithm?

How  many  do  we  need  to  use to  validate  that  algorithm?

Sometimes  we  have  to  take the  sample  numbers  that  we  have.

Use  the  sample  size  explorers  to  evaluate what  you  might  be  able  to  conclude,

and  then  use  those  findings  to  decide  if what  you  have  is  sufficient  to  proceed

with  the  experiments

and  the  development of  your  test  or  product.

That's  what  I  have  on  sample  size. It's  more  than  a  number.

It's  based  on what  it  is  you're  trying  to  decide

and  how  you're  going  to  analyze the  data  once  you  get  that  data.

It's  an  exploration, you  want  to  take  into  account

how  do  those  assumptions that  you  make  impact  those  sample  sizes

and  hedge  your  bets  for  a  great  outcome.

Thank  you,  and  that's  it.