Choose Language Hide Translation Bar

Choosing Models in JMP with Model Selection Criteria - (2023-US-30MP-1456)

More than any statistical software, JMP and JMP Pro make tremendous use of model selection criteria, such as the AICc and BIC. These tools can be used by practitioners in all industries and at all skill levels, from users choosing a distribution for a capability analysis, to advanced users choosing input factors in a linear mixed model or a functional DOE analysis. Model selection criteria are incredibly flexible and powerful, yet make it easy to decide between very different sets of predictor variables, response distributions, and even correlation structures, all at the same time.

 

Unfortunately, the full story of how and when to use these criteria are not part of most standard data science courses in universities and professional training. One reason for this omission is that, unlike JMP, many software packages implement model selection criteria in an incomplete or arguably incorrect way, making it impossible to compare models with different input variables.

 

In this presentation, we give clear guidance on how and when to use model selection criteria. We describe their motivation and the assumptions they require. We compare model selection criteria to other better-known approaches to selecting models, such as hypothesis tests and holdout-based crossvalidation procedures. We also give a brief story of how JMP Statistical R&D developers came to appreciate how useful these tools are, as we sought a general solution to the distribution dredging problem.

 

 

The  most  famous  quote  in  all of  statistics  is  George  Box's

"All  models  are  wrong,  but  some  are  useful."

I've  heard  this  quote   at  almost  every  conference

I've  ever  been  to, and   because  of  this,

to  my  recollection,  I've  actually  avoided using  this  quote  in  any  talk  before.

But  when  I  looked  up

the  first  time  it  was  ever  set  in  print, it  was  in  a  1976  journal

of  the  American   Statistical  Association  article.

It's  found  in  a  section  called  Parsimony.

Immediately  after  that  first  instance

of  the  quote, he  talks  about  the  importance  of  finding

the  simplest  model   that  describes  the  observed  phenomena.

This  amounts  to  finding  models  that  offer  a  reasonable  balance

of   goodness-of-fit versus  model  complexity  and  is  exactly

what  I'm  going  to  be  talking about  today  in  this  presentation.

JMP  and  JMP  Pro  offer  a  lot  of  different  modeling  capabilities,

each  with  a  lot of  output  related  to  choosing  a  model.

Today  I'm  going  to  go  into  some  detail into  some  of  the  most  important  of  these,

highlighting  their  motivation and  the  assumptions  behind  them.

A  lot  of  the  discussion  will  be  about the  AICc  and   BIC model  selection  criteria,

which  are  direct  and  very  data- efficient tools  for  addressing  the  problem.

Box  had  in  mind  with  his  quote,

which  is  how  to  find  a  useful  model from  a  set  of  flawed  or  wrong  ones.

As  I  was  putting   this  presentation  together,

I  went  through the  derivations  of  the  AIC  and  the  Bic.

I  wanted  to  get  a  clear  understanding

of  what  these  similar- looking  methods really  are  and  what  assumptions  they  made.

Afterwards,  out  of  curiosity,

I  did  an  Internet  search  of  AIC versus   BIC versus  cross-validation.

It  was  interesting  to  see  in  all   these  Internet  forms  that  there  is  still

so  much  debate,  even  though  these  methods have  been  around  for  50  years.

Having  recently  reviewed  the  derivations  of  the  methods,

it  looks  like  there  are still  a  lot  of  misconceptions  out  there.

I  think  the  reason  for  this  is  that  both  model  selection  criteria

have  very  deep  and  technical  derivations

despite  the  simplicity  of  their  formulas, both  of  them  are  equal  to  minus  two  times

the   log likelihood  of  the  fitted  model,

plus  a  simple  penalty  based  on  the  number  of  model  parameters.

You  can't  guess  the  reasons

for  the  penalty  terms   from  the  formula  alone,

which  makes  them   seem  mystical  and  arbitrary.

One  of  my  goals  today  is  to  try

to  demystify  these  without going  overboard  on  the  math.

To  put  this  all  in  the  context of  an  analysis  workflow,

we  can  think  of  an  analysis  project as  having  four  major  steps.

We  first  have  to  acquire  the  data, get  it  organized  and  cleaned  up.

Then  we  fit  several  models  to  it  in  a  way

that  is  either  manual  or  automated by  software  like  JMP  or  JMP  Pro.

Once  we've  done  that, then  we  need  to  choose  one  of  them

as  the  model  that  we're  going to  work  with  moving  forward.

This  is  a  critical  step  in  the  process that  we'll  be  focusing  on  today.

It's  important  that  we  get  the  model  selection  right

because  the  quality of  the  results  and  the  conclusions  we  make

at  the  end  requires   that  we  have  a  reasonably  good  model.

Here  are  the  main  ways  that  I've  seen people  make  decisions  about  models.

Hypothesis  testing  is  probably the  first  one  people  learn  about.

These  are  most  commonly  used  to  determine  if  a  regression  coefficient

is  statistically  significantly different  from  zero,

which  sounds  like  a  model  selection  problem.

While  they  are  often  used  in  that  way,

hypothesis  tests  are  derived  under a  specific  set  of  assumptions

that  explicitly  does  not  account for  having  changed  the  model

or  having  used  a  model  that  was  chosen  as  the  best

amongst  several  alternatives.

Then  we  have  the  general  empirical procedures  that  assess  models  based

on  data  held  out   from  the  model  fitting  process.

These  techniques  can  be  applied  to  both  classical  statistical  models

as  well  as  machine  learning  models.

In  my  opinion,  holdout  validation, in  particular,  is  the  way  to  go

if  you  have  a  whole  lot  of  data.

Then  we  have  what  all  called the  small  data  analytical  procedures.

These  were  derived  for  situations   when  you  have  to  make  a  decision

about  which  model  to  use,   but  you  don't  have  enough  data

to  hold  out  any  observations.

The  most  commonly  used  of  these are  the  AIC  and  the  BIC.

But  there  are  other  well- known  techniques

like  Generalized C ross-Validation  and  Mallow's C P.

It  turns  out  that  these  two   are  actually  asymptotically  equivalent

to  the  AIC,  so  in  large  samples  you  should get  the  same  conclusions  from  GCV,

Mallow's  CP,  and  the  AIC,  in  particular, for  at  least  squares- based  models.

Then  we  also  have  what  I'll  call  model- specific  approaches  like  VIP

and  partially  squares  models   and  the  cubic  clustering  criterion

in clustering  models.

These  are  pretty  niche  and  I  won't  be talking  about  them  any  more  here  today.

Then  we  also  have  visual  tools  like  actual by  predicted  plots

and  ROC  curves.

Regardless  of  how  you  choose  your  model,

these  plots  are  good  to  take  a  look  at  before  moving  forward  with  a  model

because  they  provide  more  interesting  information

than  any  individual  statistic  will  and  can  tell  us  if  the  best  model

that  we've  considered  so  far  is  still a  good  enough  model  for  us  to  use.

My  own  first  encounter  with  model selection  criteria  in  my  professional  life

was  back  in  the  mid- 2000,   around  when  JMP 5  and  JMP  6  were  out.

JMP  had  added  the  ability  to  provide  capability  analyses

for  non- normal  distributions.

Capability  analysis  is  a  very  important  tool

for  assessing  whether   a  manufacturing  process  is  " capable"

of  delivering  products that  are  within  specification.

JMP  users  wanted  to  determine  the " best  distribution"  for  the  data

so  their  process  capability  metrics  would  best  reflect  the  reality

of  their  situation.

JMP  customers  understood  that  you  could fit  different  distributions  with  JMP

and  knew  that  many  of  the  distributions came  with  a   goodness-of-fit  test  in  a  case

of  having  a  hammer  causing you  to  find  nails  everywhere.

They  were  trying  all  the  distributions they  could  find  and  were  choosing  the  one

with  the  largest   p-value   as  the  distribution

for  their  capability  analysis.

They  wanted  us  to  codify  this  into  a  new  fit  all  distributions  feature

that  would  automate  this  process  for  them.

But  we  were  rather  uncomfortable with  this  request  for  a  number  of  reasons.

For  one  thing,   the  different  distributions  fit in  JMP

came  with  different kinds  of   goodness-of-fit  tests.

The  normal  had  a  Shapiro- Wilk  test.

The  Weibull  had  a  Cramér–von Mises  test, and  the  LogNormal  had  a  Kolmogorov  test.

It's  very  strange  to  compare  tests

that  are  rather  different  from  one  another.

Another  problem  with  this  approach is  that  distributions  with  more  parameters

are  going  to  tend  to  have an  edge  on  those  with  fewer.

If  we  choose  the  distribution based  on  the  largest   p-value,

it  will  always  favor  distributions

with  more  parameters  as  we  see  here with  the  two- parameter  normal  compared

with  the  four- parameter Johnson  Su  distribution.

Then  for  some  of  the  distributions

like  the  Weibull's  Cramer  von  Mises  W  test,

we  only  had  table  values  of   p-values going  up  to  something  like  P  equals  25.

But  even  if  we  consolidated  all  the   goodness-of-fit  tests  down

to  just  one  and  got  accurate   p-values  for  all  of  them,

there's  still  a  larger  philosophical  issue at  stake  and  that's  that  hypothesis  test

like  these  can  only  quantify  evidence against  the  null  hypothesis.

If  the  null  hypothesis  is  true,

then  the   p-value  is  a  uniformly   distributed  random  variable.

In  other  words,   if  the  null  hypothesis  is  true,

then  the  probability  that  the   p-value  is  between  0.1  and  0.2  is  exactly

the  same  as  the  probability that  it  is  between  0.8  and  0.9.

S eeing  a   p-value  of  0.9   isn't  more  evidence  that  the  hypothesis

is  true  than  a   p-value  of  0.3.

Returning  to  our  example, because  all  four  of  these  distributions

have  goodness- of- fit   p-values  larger  than  0.05.

Through  this  lens,   all  four  distributions

fit  the  data  reasonably  well, even  though  the   goodness-of-fit  tests  say

all  the  distributions  are  good, the  conclusions  about  the  process

generating  the  data  are  different depending  on  the  distribution.

If  you  use  a  peak  reference  value  of  1.33

to  determine  if  the  process  is  capable, then  choosing  the  viable  indicates

that  the  process  is  not  sufficiently capable  to  meet  the  specifications,

whereas  the  other  distributions indicate  that  the  process  is  capable.

We  recognize  that  there  had  to  be  a  better  way  to  determine

the  distribution  automatically  and  came  to  the  conclusion

that  this  should  be  seen  as  a  very basic  kind  of  model  selection  problem.

In  our  search  for  a  sound  method for  choosing  a  distribution,

we  stumbled  upon  this  very  good  book  on model  selection  by  Burnham  and  Anderson.

They  give  careful  derivations  of  the  AIC  from  the  perspectives

of  information, theory,  and  cross-validation.

They  also  give  a  derivation  of  the  BIC into  how  the  AIC  can  be  derived

in  the  same  way  with  a  different assumption  about  the  prior  distribution.

Burnham  and  Anderson  also  carefully  show

hypothesis  testing  is  rather  incoherent as  a  model  selection  strategy.

The  book  had  a  pretty  big  impact  on  my  own  views  of  modeling

and  also  on  JMP  statistical  modeling  platforms.

Returning  to  the  distribution   selection  problem  for  the  moment,

when  we  went  ahead   and  added  a  distribution  selector,

we  ended  up  calling  it  fit  all and  we  base  it  on  the  AICc.

Here  on  the  left,

we  have  two  distributions of  the  capability  analysis  data

we  were  looking  at  before,   the  normal  and  the  Johnson  Su.

The  Johnson  Su's   goodness-of-fit   p-value is  larger  than  the  normal's  because  it  has

two  more  parameters  than  the  normal  distribution.

Now  on  the  right,

we  see  the  results of  a  fit  all  using  the  AICc.

The  normal  comes  out   as  the  best- fitting  distribution,

but  the  Johnson Su is  near  the  bottom.

This  is  because  the  AICc  is  penalizing  it for  having  these  two  extra  parameters.

This  feature  has  now   been  used  many,  many  times

and  I  believe  people   are  generally  pretty  happy  with  it.

Now  I'm  going  to  go  through

a  somewhat  mathy but  hopefully  accessible  explanation

of  what  the  AICc  really  is.

All  right.

Now  I'm  going  to  go  into  some  basic  theory  behind  the  AIC.

I'll  be  as  brief  as  possible   and  use  the  best  analogies  as  I  can,

but  I  think  it  is  important  to  be  exposed to  the  underlying  concepts  so  you  can  see

that  the  AIC  has  a  rigorous  foundation  that  has  some  sense  to  it.

The  AIC- type  selection  criteria  are  based  on  a  distance- type  metric

between  probability  distributions  called the  Kullback- Leibler  or  KL  divergence.

It  quantifies  the  amount

of  information  lost   by  using  probability  distribution  two

one  probability  distribution  one   is  the  correct  one.

KL  divergence  has  the  property  of  always being  greater  than  or  equal  to  zero

and  is  only  equal  to  zero   when  the  two  probability  distributions

are  the  same.

This  is  to  say  that  using   the  wrong  distribution  always  leads

to  a  theoretically  quantifiable, strictly  positive  information  loss.

This  is  pretty  heady  abstract  stuff,

so  I'm  going  to  translate  it  into  the  language  of  statistical  modeling.

When  we  are  using  data  in  statistics to  learn  about  how  something  works,

we  are  explicitly  or  implicitly  fitting probability  models  to  the  data

to  approximate   the  true  model  that  generated  it.

If  we  knew  the  true  probability  generating mechanism,  we  could  use  the  KL  divergence

to  quantify  how  far   or  how  wrong  the  model  is  from  the  truth.

We  could  then  try  several  models  and  find  the  one  that  is  the  closest

to  the  truth.

Akaike  recognized  this   and  plugged  the  true

and  the  model  probability  formulas  into  the  KL  divergence  formula

and  used  a  little  algebra  to  see that  the  KL  divergence  had  two  terms.

The  first  term  only  contains   the  true  probability- generating  mechanism

for  the  data,  which  we  can  never  know since  we  can  only  work  with  models.

However,  this  is  a  constant  that  is  the  same  for  all  models

that  you  fit  to  the  data

as  long  as  we  play by  a  couple  simple  rules.

The  second  term  is  what  Akaike  discovered is  empirically  estimable  and  with  a  lot

of  math,  he  found  a  simple  formula to  estimate  this  second  term.

In  particular,

he  discovered  that  two  times the  KL  divergence  is  equal  to  a  constant

that  is  the  same  for  all  models, plus  two  times  the  negative   log likelihood

of  the  data  used  to  fit  the  model, plus  two  times  the  number  of  parameters.

Everything  had  been  multiplied  by  a  factor of  two  just  to  follow  the  same  convention

as  a  likelihood  ratio  test,   since  the  constant  term  is  the  same

for  all  models as  long  as  we  don't  change

the  response  data,   we  can  fit  several  models,

and  the  one  whose  AIC  is  the  smallest   is  the  one  that  is  estimated

to  have  the  smallest K L  divergence from  the  truth  and  in  a  sense  is  the  one

that  is  the  least  wrong.

Using  the  AIC  for  model  selection  is  entirely  analogous  to  there  being

a  collection  of  islands  and  you  want to  know  which  of  the  islands  you  know  of

is  closest  to  another  island  that  you  know  you'll  never  be  able

to  get  to.

The  direct  solution  to  this  problem  would  be  to  calculate  the  distances

from  each  of  the  islands  to  the  one  that  we  want  to  get  close  to.

Now,  what  if  the  island  we  wanted  to  get  close  to  was  surrounded

by  a  circular  high  fence  that  we  could  approach?

The  island  is  perfectly  in  the  middle  of  the  fence,

so  the  distance  from  the  center  of  the  island  to  the  fence

is  always  the  same.

But  the  fence  was  far  enough  away from  the  island  that  it  enclosed

that  we  couldn't  see  it   or  measure  the  distance

from  the  fence  to  the  interior  island.

We  can  still  estimate  the  distance from  each  island  to  the  fence.

Because  the  main  island  is  in  the  center  of  the  fence,

we  know  that  the  island  closest to  the  fence  is  the  closest  island.

This  is  exactly the  situation  with  the  AIC.

With  the  AIC,  we  can  estimate  the  distance from  the  truth  to  each  of  the  models.

Each  AIC  estimate   is  off  by  the  same  amount.

While  we  can't  estimate  the  absolute  distance  of  the  models

from  the  truth, we  can  know  which  model  is  the  closest

in  a  relative  sense.

The  original  AIC  is  based

on  the  likelihood  of  the  training  data  plus  a  parameter  penalty.

The  training  likelihood assesses  the  goodness-of- fit  of  the  model.

We  can't  use  this  term  by  itself  though, because  it  is  biased  downward

as  the  model  parameters   were  chosen  to  minimize

the  negative   log likelihood.

With  a  lot  of  math,  Akaike  derived  a  very  simple  expression

that  corrects  for  this  bias.

The  original  penalty  is  just  2 K

where  K is  the  total  number of  estimated  parameters.

For  linear  regression  with  a  slope

and  an  intercept,   we  also  have  to  count  the  variance.

For  that  case  you  would  have K  equals  three  and  not  two.

There  are  important  assumptions that  led  to  the 2 K  penalty.

We  can  characterize  them  loosely that  the  model  has  to  be  reasonably  good.

The  AIC  is  still  going  to  be  robust however,  because  if  a  model  is  bad,

then  the  likelihood  component   will  be  large  and  will  dominate

the  penalty  amongst  the  good  models.

The  2K  term  will  favor  the  smaller  models  as  long

as  the  sample  size  is  large.

However,  it  didn't  take  long   for  people  to  find  that  this  original  AIC

often  shows  models   that  overfit  in  small  samples,

so  a  more  accurate,  higher- order approximation  to  the  bias  was  derived.

When  this  extra  term  is  added,

the  criteria  becomes  known   as  the  AICc  or  the  corrected  AIC.

Unfortunately,   the  reputation  that  the  AIC  overfits

had  become  commonplace before  the  correction  was  discovered

and  widely  known  about.

The  correction  becomes  infinite  as  K  approaches N

pushing  the  model  selection  criteria  away from  models  that  are  nearly  saturated.

Notice  also  that  the  correction  term goes  to  zero  as  N  goes  to  infinity.

In  large  samples   the  AIC  and  AICc  are  equivalent.

The  AICc  is  what  we  reported  in  Trump because  it  works  well  for  small  samples

and  although  it  was  derived   for  Gaussian  distributions,

experience  suggests  that  it's  good  enough with  other  commonly  used  distributions.

Now  I'm  going  to  illustrate  the  AICc

in  a  real  example  that  was  a  five- factor central  composite  design  with  31  runs,

and  the  response  was  the  amount of  p DNA  produced  by  a  bioreactor.

I'll  illustrate  the  AICc using  the  generalized  regression  platform,

giving  it  a  full  response  surface  model

with  all  main  effects  interactions and  second- order  terms.

I  fit  four  models  to  the  data.

One  is  a  full  response  surface  model  using least  squares  that  was  fit  automatically.

Then  I  use  forward  selection

under  the  normal,  logNormal, and  exponential  distributions.

I  chose  the  exponential  distribution to  illustrate  poor  model  fit.

The  models  had  2 2, 9, 9,   and  1  parameters  respectively,

and  the  model  with  the  lowest  AICc   was  the  logN ormal  with  an  AICc

of  about  334.8.

We  can  break  the  AIC  and  AICc  calculations

down  to  see  how  different  parts of  the  penalty  are  contributing.

The  full  least  squares  model

has  the  lowest  likelihood, but  the  highest  AICc  overall.

When  we  look  at  the  second- order   corrections  and  the  original  AIC  values,

we  see  that  it's  the  second  order correction  term  that  is  pushing

the  model  selection  criteria  to  be  very  large  for  this  model.

The  logN ormal  forward  selection  log  likelihood  is  a  little  lower

than  the  normal  forward  selection  one.

They  both  have  nine  parameters, so  their  penalties  are  the  same

and  the  logN ormal  forward selection  model  has  the  lower  AICc.

The  exponential  forward  selection  model has  the  poorest  model  fit  as  measured

by  the   log likelihood,   but  also  only  has  one  parameter

in  the  model.

Overall  it  has  the  smallest penalty  contribution  to  the  AICc.

But  the  poor  fit  of  the  model  is  such  that  the  likelihood  dominates

and  the  exponential  model  is  the  second from  the  worst  as  measured  by  the  AICc.

If  you  review  the  general  derivation

of  the  AIC  in  the  Burnham  and  Anderson  book,

you'll  see  that  what  it's  actually estimated  is  the  expected  value

of  a  hypothetical  test  set  likelihood for  a  data  set  that  has  the  same  size

and  response  structure, but  not  values  as  the  training  set.

The  expected  values  also  take into  consideration  the  variability

in  the  estimate  of  the  MLE.

I  find  this  cross-validation   interpretation  of  the  AIC

to  be  pretty  compelling.

I  think  it's  also  important  to  point  out that  this  cross-validation  derivation

of  the  AIC  does  not  assume  at  all  that  we  have  the  correct  model.

To  show  that  this  cross-validation interpretation  really  works,

I  created  a  simulation  formula  using  an  average  of  the  models  I've  shown

in  the  previous  slides   as  well  as  some  other  ones.

This  way  we  knew  that  none  of  the  models  were  actually  the  correct  one.

I  fit  each  of  the  four  models  to  new  training  data  a  thousand  times

and  set  it  up  so  that  job  would  report  an  independent  holdout  likelihood

using  another  new  data  set.

I  kept  each  of  the  four  models  structure

and  distributions  intact   and  did  not  apply  variable  selection.

This  was  to  perfectly  mimic  the  exact  cross-validation  interpretation

of  the  AIC.

From  there,   I  created  a  table  of  simulated

holdout  likelihoods  and  computed   their  average  for  each  of  the  four  models.

This  is  the  AIC   and  AICc  summary  table  from  before,

with  the  simulation- based average  holdout   log likelihoods

added  over  here  to  the  right, you  can  see  that  the  full  normal  model

holdout  likelihood   is  very  close  to  its  AICc  value

and  that  the  second- order  correction  term  was  essential  for  this  match  to  happen.

On  the  other  hand,

you  see  that  the  simulated   average  exponential  holdout   log likelihood

is  also  very  close  to  the  AICc.

Both  the  normal  and  logN ormal   holdout  likelihoods  are  close

to  the  original  log Normal  models A ICc.

The  normal  holdout  likelihood is  a  little  smaller.

I  attribute  this  to  averaging  a  bunch  of  simulation  models,

making  the  simulated  data  a  little  bit  more  normally  distributed

than  the  original  data  was.

There  are  a  couple  simple  rules  that  are  needed  to  make  AICc  comparisons

really  valid  between  different  models.

The  most  important  is  that   the  stochastic  part  of  the  data

has  to  stay  the  same, the  same  rows  have  to  be  used

and  it  is  the  Y's  in  particular that  must  be  the  same.

The  X's  can  be  different,  of  course, even  if  they  were  originally  random,

not  only  must  the  Y's  be  the  same, but  they  can't  be  changed  or  transformed.

The  transform  would  have  to  be  built  into  the  model  appropriately.

The  AIC  is  also  only  defined

for  well-behaved   maximum  likelihood  estimators

and  other  closely  related  methods.

This  explains  why  you  don't  see  AICc

for  neural  networks   and  other  machine  learning  models.

Also,  you  have  to  keep  in  mind  that  just because  you  found  a  model  that  the  AICc

says  is  the  best,  it  doesn't  mean  that  it  is  a  good  model.

Use  your  past  experience   and  model  diagnostic  plots  to  ensure

that  the  model is  right  enough  to  be  useful.

Returning  to  the  pDNA  data, we  see  two  equivalent  models.

On  the  top,  we  have  a  logN ormal  model

and  on  the  bottom  we  have  a  normal  fit  to  the  log- transformed  response.

You  can  see  that  the  generalized  RS quares

are  the  same  for  these  two  models, but  the  AICcs  are  very  different.

This  is  because  the  logN ormal  fit

implicitly  builds  the  transform into  the  likelihood.

But  the  log  scale  normal  fit  does  not.

In  this  case,  the  right  thing to  use  is  the  logN ormal.

Here's  a  quick  demonstration   that  you  have  to  decide  the  distribution

and  the  input  variables  at  the  same  time.

Here  is  simulated  data  from  a  T-t est  type  model

two  groups  of  normally  distributed  data   with  the  same  variance,

but  different  means.

If  you  fit  all   in  the  distribution  platform,

it  chooses  the  normal two  mixture  with  an  AICc  of  1036.

This  is  the  correct  distribution

if  you  don't  know   the  group  identity  of  the  rows.

Once  you  include   the  grouping  variable  though,

you  see  that  the  normal comes  out  on  top  of  an  AICc  of  717  or  so.

We  also  tried  the  Weibull  logNormal

and  gamma  and  the  normal  still  came  out on  top,  even  though  those  distributions

did  better  in  distribution   without  including  the  grouping  variable.

You'd  have  to  try  different  model  structures

and  distributions  together to  find  the  right  combination.

Now  I'm  going  to  change  gears  and  talk  a  little  bit  about  the  BIC,

which  is  the  other  main  analytical model  selection  criteria  and  JMP.

The   BIC is  motivated   in  a  completely  different  way

than  the  AIC.

Schwartz  used  a  large  sample  argument in  a  Bayesian  context  to  approximate

the  log  probability  of  the  data   after  having  integrated  the  model  out,

assuming  a  flat  prior  on  the  parameters, an  expression  similar  to  the  AIC  pops  out

with  a  K  log  in  type  penalty  term   rather  than  two  times  k.

There  were  also  other  terms   in  the  integral  that  are  always  ignored.

One  is  K  log  2  pi, which  was  considered  too  small

to  deal  with and  the  other  one

is  a  normalized  variance  of  the  MLE, which  would  also  be  of  order K.

I  didn't  study  the  AIC  or  BIC in  any  depth  in  school.

I  just  remember  hearing  the  refrain  AIC  overfits,

BIC under fits  several  times in  different  classes,

which  I  interpreted   as  a  strong  skepticism  about  both  of  them.

Comparing  the  AICc  and   BIC penalties, we  see  that  the  AICc  will  prevent

big  models  from  being  chosen when  the  sample  size  is  small,

whereas  the  BIC will  still  allow  large  models.

I  see  the  K  log  and  normalization   constant  penalty  in  the   BIC as  somewhat

less  compelling  than  the  cross-validation interpretation  of  the  AIC- type  penalties.

Something  that  leads

to  a  marginal  probability  of  the  data is  more  abstract  to  me  than  something

that  is  directly  interpretable as  a  cross-validation  metric

taking  into  account  parameter  uncertainty.

I'm  fully  aware

that  I'm  editorializing  here, but  this  is  what's  worked  well

for  me  so  far.

Returning  to  the  p DNA  DoE  one  more  time.

Here  are  the  same  models  fit  in  the  pDNA  example

using  the   BIC first  selection  on  top  and  the  AICc  on  the  bottom.

Notice  that  the   BIC  of  the  full  normal  model

is  not  as  far  away from  the  other  models  as  with  the  AICc.

The  best  model  overall  as  rated  by  the  BIC  is  a  logNormal,

but  with  13  parameters this  time  around  rather  than  nine.

The  forward  selected   BIC normal  model also  has  a  couple  more  parameters.

In  small  samples, contrary  to  the  AIC  overfits,

BIC under fits,  the  AICc  can  choose  smaller models  than  the  BIC  in  small  samples.

Here  we  see  the  effects  chosen by  the   BIC and  the  AICc.

The  set  of   BIC-selected  effects  is  a  superset  of  the  ones  chosen

by  the  AICc.

A lso  notice,  interestingly, that  all  four  effects  not  chosen

by  the  AICc  are  statistically  significant under the BIC.

Under  the  BIC, the  pHsquared  term  is  highly  significant,

but  it  isn't  present  in  the  AICc  model,  for  example.

I  would  say  that   all  the  significant  effects

should  have  asterisks  by  them,

but  all  significant   p-values have  asterisks  by  them  in J MP  reports.

Instead,  I'll  just  say  that  I  take   p-values  of  effects

after  selection  with  a  grain  of  salt.

Although  the  two  models   choose  different  effects,

some  of  them  highly statistically  significant  in  the  model.

If  we  look  at  the  profile

or  variable  importance   from  these  two  models,

they  tell  a  very  similar  story.

Feed  rate  is  by  far  the  most  important

and  after  that  the  ordering   is  the  same  between  the  two  models.

pH only  impacts  3%  of  the  variation in  the  response  surface  under  the   BIC

best  model  and  isn't  included at  all  in  the  AICc  best  model.

This  is  a  very  clear  example

of  statistical  significance   and  practical  relevance

being  two  different  things.

There  are  a  lot  of  opinions  out there  about  the  AICc  and  the  BIC.

For  example,  Burnham  and  Anderson

say  that  both  methods  are  consistent   for  the  quasi- true  model  as  N  goes

to  infinity.

But  then  there's  others  that  say

that  the   BIC is  the  only  one consistent  for  the  truth.

Burnham  and  Anderson  say  that  you  can  set up  simulations  to  make  one  look  good,

change  the  way  it's  set  up   a  little  bit  and  it'll  flip  the  results.

Burnham  and  Anderson,

who  are  about  the  most  diehard  AICc  fans out  there  in  their  simulations,

found  that  the  AICc  chooses  fewer really  bad  models  than  the  BIC.

I  think  it's  not  a  bad  idea   to  look  at  both   BIC and  AICc

after  applying  variable  selection.

If  the  best  models  under  both  are  pretty much  the  same,  which  is  often  the  case,

you  can  feel  pretty  good   about  either  of  them,  if  they're  different

it's  good  to  think  about  the  reasons  why  and  use  your  subject  matter  expertise

to  help  make  a  decision.

My  last  topic  is  model  selection criteria  and  linear  mixed  models.

This  is  a  pretty  complicated  situation, especially  because  there  isn't  consensus

between  software  vendors  in  how to  compute  the  model  selection  criteria.

To  illustrate  this,  I  created   a  split  plot  design  with  four  factors.

There  are  two  whole  plot  effects and  two  split  plot  effects.

If  you  take  the  same  data  and  fit  the  same  model  in  JMP  Pro

and  SAS  using  fit  mixed  and  proc  mixed,

you  will  see  that  the  likelihoods and  model  selection  criteria  don't  match,

but  the  variance  estimates  do,  you  get  different  fixed  effects

parameter  estimates,   but  the  fixed  effects  tests  agree.

One  of  the  reasons  for  this  is  that  JMP and  SAS  fixed  effects  design  matrices

use  a  different  coding  strategy for  categorical  effects.

On  the  left  I  have  the  JMP  design  matrix for  the  split  plot  example,

and  on  the  right  you  see  the  SAS  one.

JMP  creates  a  row  of  minus  ones

for  the  last  level  of  categorical  effects which  is  seen  in  blue  here

whereas  SAS  creates  a  row  of  zeros.

Neither  one  of  these  is  right  or  wrong.

It's  like  changing  units   or  changing  coordinate  systems.

JMP  categorical  effects  sum  to  zero, whereas  SAS  categorical  effects

can  be  interpreted   as  differences  from  the  last  level.

Although  the  raw  parameter  estimates  differ,

predictions  will  be  the  same between  the  two  codings  because  the  models

are  fundamentally  equivalent.

Most  things  that  matter

won't  be  different   between  the  two  software  products.

However,  REML, the  method  used  to  estimate  mixed  effects  models

has  an  ambiguity  in  it.

The  base  Gaussian  likelihood  at  the  top

will  be  the  same  in  either  software because  it's  a  real  likelihood.

But  the  REML  or  residual  likelihood reported  by  proc  mixed

and  JMP  pro's  fit  mixed  isn't  a  real  likelihood.

If  it  was  a  real  likelihood, then  we  would  get  the  same  values

regardless  of  which  coding or  software  we  used.

This  is  because  there's  an  extra  penalty added  to  the  Gaussian  likelihood

for  REML  that  reduces  the  bias of  the  variance  estimates.

But  this  depends  on  the  design  matrix

in  a  way  that  is  sensitive  to  the  coding  used.

JMP  reports,  the  raw  Gaussian  likelihood,  and  the  AICc  and   BIC that  it  reports

are  based  on  that rather  than  the  residual  likelihood.

The  number  of  parameters  fit  mixed  counts

is  the  total  including  both  fixed  effects  and  variance  parameters.

We  did  it  this  way  to  make  it   so  that  you  can  use  JMP  to  compare  models

with  different  fixed- effect  structures as  well  as  variance  models.

In  SAS,  they  only  report   the  residual  or REML  log likelihood

and  it  reports  model  selection  criteria  based  on  it.

You  can  see  here  that  it  also  only  counts variance  parameters  as  well

because  the  difference   between  the  SAS  likelihood

and  its  AIC   is  for  implying  two  parameters,

a  variance  component  and  a  residual.

All  this  means  is  that

you  can  only  use p roc  mixed for  comparing  variance  models

with  the  AIC   because  the  model  selection  criteria

includes  the  REML  penalty  and  it's  only   counting  variance  parameters.

With  all  due  respect,

I  can  think  of  some  good  reasons for  the  SAS  approach,

and  there  are  probably  some  other good  reasons  I  don't  even  know  of.

But  I  personally  prefer  the  flexibility afforded  by  the  JMP  approach.

To  summarize,   if  you  compare  results  across  software

for  non- mixed  models, the  mean  parameter  estimates  may  differ,

but  otherwise  everything else  should  be  the  same.

As  long  as  the  software  computes

the  constants  and  the  likelihood correctly  as  JMP  does.

When  we  get  to  Gaussian  mixed  models,

there  are  very  important  software  differences

and  the  scope  of  the  decisions you  can  make  about  the  models

using  the  software  may  be  very  different

depending  on  the  details  of  how  its  likelihood  is  calculated.

JMP  model  selection  criteria

are  comparable  both   within  the  same  platform

and  across  other  modeling  platforms.

I'll  close  with  this  slide, which  gives  my  basic  recommendations

for  applying  the  tools  discussed  today.

Hypothesis  testing  is  a  tool   for  when  you  need  to  prove  something

and  is  best  used  in  situations   when  you  have  a  good  idea

of  the  model  structure  in  advance.

When  you're  working  on  a  problem in  industry  and  the  sample  size  is  small,

I  would  stick   to  classical  statistical  models

and  use  the  AICc  as  the  primary  tool for  choosing  between  them.

With  larger  data  sets, when  I  have  enough  data  to  hold  out

at  least  a  third  of  the  observations, I  use  holdout  cross-validation  to  compare

classical  statistical  models  as  well  as  machine  learning  models.

In  my  own  work,  I  tend  to  avoid  K- fold  cross-validation

and  its  variance.

The  model  selection  criteria  are  equivalent  to  it  in  larger  samples,

and  I  tend  to  stick  with  simpler  models  with  smaller  data  sets.

I  know  that  not  everyone  is  going  to  agree  with  me  on  this,

but  this  is  what  works  for  me  and  is  a  pretty  safe  way

to  approach  model  selection.

Choosing  the  most  useful  model  from  a  set of  alternatives  that  must  all  be  wrong

on  some  level  is  an  important  decision, and  these  are  the  main  considerations

have  when  deciding  upon a  model  selection  strategy.

Thank  you  for  your  attention  and  I  look  forward  to  talking  with  you

in  the  Meet  the  Expert  Sessions.