Choosing Models in JMP with Model Selection Criteria - (2023-US-30MP-1456)

The most famous quote in all of statistics is George Box's

"All models are wrong, but some are useful."

I've heard this quote at almost every conference

I've ever been to, and because of this,

to my recollection, I've actually avoided using this quote in any talk before.

But when I looked up

the first time it was ever set in print, it was in a 1976 journal

of the American Statistical Association article.

It's found in a section called Parsimony.

Immediately after that first instance

of the quote, he talks about the importance of finding

the simplest model that describes the observed phenomena.

This amounts to finding models that offer a reasonable balance

of goodness-of-fit versus model complexity and is exactly

what I'm going to be talking about today in this presentation.

JMP and JMP Pro offer a lot of different modeling capabilities,

each with a lot of output related to choosing a model.

Today I'm going to go into some detail into some of the most important of these,

highlighting their motivation and the assumptions behind them.

A lot of the discussion will be about the AICc and BIC model selection criteria,

which are direct and very data- efficient tools for addressing the problem.

Box had in mind with his quote,

which is how to find a useful model from a set of flawed or wrong ones.

As I was putting this presentation together,

I went through the derivations of the AIC and the Bic.

I wanted to get a clear understanding

of what these similar- looking methods really are and what assumptions they made.

Afterwards, out of curiosity,

I did an Internet search of AIC versus BIC versus cross-validation.

It was interesting to see in all these Internet forms that there is still

so much debate, even though these methods have been around for 50 years.

Having recently reviewed the derivations of the methods,

it looks like there are still a lot of misconceptions out there.

I think the reason for this is that both model selection criteria

have very deep and technical derivations

despite the simplicity of their formulas, both of them are equal to minus two times

the log likelihood of the fitted model,

plus a simple penalty based on the number of model parameters.

You can't guess the reasons

for the penalty terms from the formula alone,

which makes them seem mystical and arbitrary.

One of my goals today is to try

to demystify these without going overboard on the math.

To put this all in the context of an analysis workflow,

we can think of an analysis project as having four major steps.

We first have to acquire the data, get it organized and cleaned up.

Then we fit several models to it in a way

that is either manual or automated by software like JMP or JMP Pro.

Once we've done that, then we need to choose one of them

as the model that we're going to work with moving forward.

This is a critical step in the process that we'll be focusing on today.

It's important that we get the model selection right

because the quality of the results and the conclusions we make

at the end requires that we have a reasonably good model.

Here are the main ways that I've seen people make decisions about models.

Hypothesis testing is probably the first one people learn about.

These are most commonly used to determine if a regression coefficient

is statistically significantly different from zero,

which sounds like a model selection problem.

While they are often used in that way,

hypothesis tests are derived under a specific set of assumptions

that explicitly does not account for having changed the model

or having used a model that was chosen as the best

amongst several alternatives.

Then we have the general empirical procedures that assess models based

on data held out from the model fitting process.

These techniques can be applied to both classical statistical models

as well as machine learning models.

In my opinion, holdout validation, in particular, is the way to go

if you have a whole lot of data.

Then we have what all called the small data analytical procedures.

These were derived for situations when you have to make a decision

about which model to use, but you don't have enough data

to hold out any observations.

The most commonly used of these are the AIC and the BIC.

But there are other well- known techniques

like Generalized C ross-Validation and Mallow's C P.

It turns out that these two are actually asymptotically equivalent

to the AIC, so in large samples you should get the same conclusions from GCV,

Mallow's CP, and the AIC, in particular, for at least squares- based models.

Then we also have what I'll call model- specific approaches like VIP

and partially squares models and the cubic clustering criterion

in clustering models.

These are pretty niche and I won't be talking about them any more here today.

Then we also have visual tools like actual by predicted plots

and ROC curves.

Regardless of how you choose your model,

these plots are good to take a look at before moving forward with a model

because they provide more interesting information

than any individual statistic will and can tell us if the best model

that we've considered so far is still a good enough model for us to use.

My own first encounter with model selection criteria in my professional life

was back in the mid- 2000, around when JMP 5 and JMP 6 were out.

JMP had added the ability to provide capability analyses

for non- normal distributions.

Capability analysis is a very important tool

for assessing whether a manufacturing process is " capable"

of delivering products that are within specification.

JMP users wanted to determine the " best distribution" for the data

so their process capability metrics would best reflect the reality

of their situation.

JMP customers understood that you could fit different distributions with JMP

and knew that many of the distributions came with a goodness-of-fit test in a case

of having a hammer causing you to find nails everywhere.

They were trying all the distributions they could find and were choosing the one

with the largest p-value as the distribution

for their capability analysis.

They wanted us to codify this into a new fit all distributions feature

that would automate this process for them.

But we were rather uncomfortable with this request for a number of reasons.

For one thing, the different distributions fit in JMP

came with different kinds of goodness-of-fit tests.

The normal had a Shapiro- Wilk test.

The Weibull had a Cramér–von Mises test, and the LogNormal had a Kolmogorov test.

It's very strange to compare tests

that are rather different from one another.

Another problem with this approach is that distributions with more parameters

are going to tend to have an edge on those with fewer.

If we choose the distribution based on the largest p-value,

it will always favor distributions

with more parameters as we see here with the two- parameter normal compared

with the four- parameter Johnson Su distribution.

Then for some of the distributions

like the Weibull's Cramer von Mises W test,

we only had table values of p-values going up to something like P equals 25.

But even if we consolidated all the goodness-of-fit tests down

to just one and got accurate p-values for all of them,

there's still a larger philosophical issue at stake and that's that hypothesis test

like these can only quantify evidence against the null hypothesis.

If the null hypothesis is true,

then the p-value is a uniformly distributed random variable.

In other words, if the null hypothesis is true,

then the probability that the p-value is between 0.1 and 0.2 is exactly

the same as the probability that it is between 0.8 and 0.9.

S eeing a p-value of 0.9 isn't more evidence that the hypothesis

is true than a p-value of 0.3.

Returning to our example, because all four of these distributions

have goodness- of- fit p-values larger than 0.05.

Through this lens, all four distributions

fit the data reasonably well, even though the goodness-of-fit tests say

all the distributions are good, the conclusions about the process

generating the data are different depending on the distribution.

If you use a peak reference value of 1.33

to determine if the process is capable, then choosing the viable indicates

that the process is not sufficiently capable to meet the specifications,

whereas the other distributions indicate that the process is capable.

We recognize that there had to be a better way to determine

the distribution automatically and came to the conclusion

that this should be seen as a very basic kind of model selection problem.

In our search for a sound method for choosing a distribution,

we stumbled upon this very good book on model selection by Burnham and Anderson.

They give careful derivations of the AIC from the perspectives

of information, theory, and cross-validation.

They also give a derivation of the BIC into how the AIC can be derived

in the same way with a different assumption about the prior distribution.

Burnham and Anderson also carefully show

hypothesis testing is rather incoherent as a model selection strategy.

The book had a pretty big impact on my own views of modeling

and also on JMP statistical modeling platforms.

Returning to the distribution selection problem for the moment,

when we went ahead and added a distribution selector,

we ended up calling it fit all and we base it on the AICc.

Here on the left,

we have two distributions of the capability analysis data

we were looking at before, the normal and the Johnson Su.

The Johnson Su's goodness-of-fit p-value is larger than the normal's because it has

two more parameters than the normal distribution.

Now on the right,

we see the results of a fit all using the AICc.

The normal comes out as the best- fitting distribution,

but the Johnson Su is near the bottom.

This is because the AICc is penalizing it for having these two extra parameters.

This feature has now been used many, many times

and I believe people are generally pretty happy with it.

Now I'm going to go through

a somewhat mathy but hopefully accessible explanation

of what the AICc really is.

All right.

Now I'm going to go into some basic theory behind the AIC.

I'll be as brief as possible and use the best analogies as I can,

but I think it is important to be exposed to the underlying concepts so you can see

that the AIC has a rigorous foundation that has some sense to it.

The AIC- type selection criteria are based on a distance- type metric

between probability distributions called the Kullback- Leibler or KL divergence.

It quantifies the amount

of information lost by using probability distribution two

one probability distribution one is the correct one.

KL divergence has the property of always being greater than or equal to zero

and is only equal to zero when the two probability distributions

are the same.

This is to say that using the wrong distribution always leads

to a theoretically quantifiable, strictly positive information loss.

This is pretty heady abstract stuff,

so I'm going to translate it into the language of statistical modeling.

When we are using data in statistics to learn about how something works,

we are explicitly or implicitly fitting probability models to the data

to approximate the true model that generated it.

If we knew the true probability generating mechanism, we could use the KL divergence

to quantify how far or how wrong the model is from the truth.

We could then try several models and find the one that is the closest

to the truth.

Akaike recognized this and plugged the true

and the model probability formulas into the KL divergence formula

and used a little algebra to see that the KL divergence had two terms.

The first term only contains the true probability- generating mechanism

for the data, which we can never know since we can only work with models.

However, this is a constant that is the same for all models

that you fit to the data

as long as we play by a couple simple rules.

The second term is what Akaike discovered is empirically estimable and with a lot

of math, he found a simple formula to estimate this second term.

In particular,

he discovered that two times the KL divergence is equal to a constant

that is the same for all models, plus two times the negative log likelihood

of the data used to fit the model, plus two times the number of parameters.

Everything had been multiplied by a factor of two just to follow the same convention

as a likelihood ratio test, since the constant term is the same

for all models as long as we don't change

the response data, we can fit several models,

and the one whose AIC is the smallest is the one that is estimated

to have the smallest K L divergence from the truth and in a sense is the one

that is the least wrong.

Using the AIC for model selection is entirely analogous to there being

a collection of islands and you want to know which of the islands you know of

is closest to another island that you know you'll never be able

to get to.

The direct solution to this problem would be to calculate the distances

from each of the islands to the one that we want to get close to.

Now, what if the island we wanted to get close to was surrounded

by a circular high fence that we could approach?

The island is perfectly in the middle of the fence,

so the distance from the center of the island to the fence

is always the same.

But the fence was far enough away from the island that it enclosed

that we couldn't see it or measure the distance

from the fence to the interior island.

We can still estimate the distance from each island to the fence.

Because the main island is in the center of the fence,

we know that the island closest to the fence is the closest island.

This is exactly the situation with the AIC.

With the AIC, we can estimate the distance from the truth to each of the models.

Each AIC estimate is off by the same amount.

While we can't estimate the absolute distance of the models

from the truth, we can know which model is the closest

in a relative sense.

The original AIC is based

on the likelihood of the training data plus a parameter penalty.

The training likelihood assesses the goodness-of- fit of the model.

We can't use this term by itself though, because it is biased downward

as the model parameters were chosen to minimize

the negative log likelihood.

With a lot of math, Akaike derived a very simple expression

that corrects for this bias.

The original penalty is just 2 K

where K is the total number of estimated parameters.

For linear regression with a slope

and an intercept, we also have to count the variance.

For that case you would have K equals three and not two.

There are important assumptions that led to the 2 K penalty.

We can characterize them loosely that the model has to be reasonably good.

The AIC is still going to be robust however, because if a model is bad,

then the likelihood component will be large and will dominate

the penalty amongst the good models.

The 2K term will favor the smaller models as long

as the sample size is large.

However, it didn't take long for people to find that this original AIC

often shows models that overfit in small samples,

so a more accurate, higher- order approximation to the bias was derived.

When this extra term is added,

the criteria becomes known as the AICc or the corrected AIC.

Unfortunately, the reputation that the AIC overfits

had become commonplace before the correction was discovered

and widely known about.

The correction becomes infinite as K approaches N

pushing the model selection criteria away from models that are nearly saturated.

Notice also that the correction term goes to zero as N goes to infinity.

In large samples the AIC and AICc are equivalent.

The AICc is what we reported in Trump because it works well for small samples

and although it was derived for Gaussian distributions,

experience suggests that it's good enough with other commonly used distributions.

Now I'm going to illustrate the AICc

in a real example that was a five- factor central composite design with 31 runs,

and the response was the amount of p DNA produced by a bioreactor.

I'll illustrate the AICc using the generalized regression platform,

giving it a full response surface model

with all main effects interactions and second- order terms.

I fit four models to the data.

One is a full response surface model using least squares that was fit automatically.

Then I use forward selection

under the normal, logNormal, and exponential distributions.

I chose the exponential distribution to illustrate poor model fit.

The models had 2 2, 9, 9, and 1 parameters respectively,

and the model with the lowest AICc was the logN ormal with an AICc

of about 334.8.

We can break the AIC and AICc calculations

down to see how different parts of the penalty are contributing.

The full least squares model

has the lowest likelihood, but the highest AICc overall.

When we look at the second- order corrections and the original AIC values,

we see that it's the second order correction term that is pushing

the model selection criteria to be very large for this model.

The logN ormal forward selection log likelihood is a little lower

than the normal forward selection one.

They both have nine parameters, so their penalties are the same

and the logN ormal forward selection model has the lower AICc.

The exponential forward selection model has the poorest model fit as measured

by the log likelihood, but also only has one parameter

in the model.

Overall it has the smallest penalty contribution to the AICc.

But the poor fit of the model is such that the likelihood dominates

and the exponential model is the second from the worst as measured by the AICc.

If you review the general derivation

of the AIC in the Burnham and Anderson book,

you'll see that what it's actually estimated is the expected value

of a hypothetical test set likelihood for a data set that has the same size

and response structure, but not values as the training set.

The expected values also take into consideration the variability

in the estimate of the MLE.

I find this cross-validation interpretation of the AIC

to be pretty compelling.

I think it's also important to point out that this cross-validation derivation

of the AIC does not assume at all that we have the correct model.

To show that this cross-validation interpretation really works,

I created a simulation formula using an average of the models I've shown

in the previous slides as well as some other ones.

This way we knew that none of the models were actually the correct one.

I fit each of the four models to new training data a thousand times

and set it up so that job would report an independent holdout likelihood

using another new data set.

I kept each of the four models structure

and distributions intact and did not apply variable selection.

This was to perfectly mimic the exact cross-validation interpretation

of the AIC.

From there, I created a table of simulated

holdout likelihoods and computed their average for each of the four models.

This is the AIC and AICc summary table from before,

with the simulation- based average holdout log likelihoods

added over here to the right, you can see that the full normal model

holdout likelihood is very close to its AICc value

and that the second- order correction term was essential for this match to happen.

On the other hand,

you see that the simulated average exponential holdout log likelihood

is also very close to the AICc.

Both the normal and logN ormal holdout likelihoods are close

to the original log Normal models A ICc.

The normal holdout likelihood is a little smaller.

I attribute this to averaging a bunch of simulation models,

making the simulated data a little bit more normally distributed

than the original data was.

There are a couple simple rules that are needed to make AICc comparisons

really valid between different models.

The most important is that the stochastic part of the data

has to stay the same, the same rows have to be used

and it is the Y's in particular that must be the same.

The X's can be different, of course, even if they were originally random,

not only must the Y's be the same, but they can't be changed or transformed.

The transform would have to be built into the model appropriately.

The AIC is also only defined

for well-behaved maximum likelihood estimators

and other closely related methods.

This explains why you don't see AICc

for neural networks and other machine learning models.

Also, you have to keep in mind that just because you found a model that the AICc

says is the best, it doesn't mean that it is a good model.

Use your past experience and model diagnostic plots to ensure

that the model is right enough to be useful.

Returning to the pDNA data, we see two equivalent models.

On the top, we have a logN ormal model

and on the bottom we have a normal fit to the log- transformed response.

You can see that the generalized RS quares

are the same for these two models, but the AICcs are very different.

This is because the logN ormal fit

implicitly builds the transform into the likelihood.

But the log scale normal fit does not.

In this case, the right thing to use is the logN ormal.

Here's a quick demonstration that you have to decide the distribution

and the input variables at the same time.

Here is simulated data from a T-t est type model

two groups of normally distributed data with the same variance,

but different means.

If you fit all in the distribution platform,

it chooses the normal two mixture with an AICc of 1036.

This is the correct distribution

if you don't know the group identity of the rows.

Once you include the grouping variable though,

you see that the normal comes out on top of an AICc of 717 or so.

We also tried the Weibull logNormal

and gamma and the normal still came out on top, even though those distributions

did better in distribution without including the grouping variable.

You'd have to try different model structures

and distributions together to find the right combination.

Now I'm going to change gears and talk a little bit about the BIC,

which is the other main analytical model selection criteria and JMP.

The BIC is motivated in a completely different way

than the AIC.

Schwartz used a large sample argument in a Bayesian context to approximate

the log probability of the data after having integrated the model out,

assuming a flat prior on the parameters, an expression similar to the AIC pops out

with a K log in type penalty term rather than two times k.

There were also other terms in the integral that are always ignored.

One is K log 2 pi, which was considered too small

to deal with and the other one

is a normalized variance of the MLE, which would also be of order K.

I didn't study the AIC or BIC in any depth in school.

I just remember hearing the refrain AIC overfits,

BIC under fits several times in different classes,

which I interpreted as a strong skepticism about both of them.

Comparing the AICc and BIC penalties, we see that the AICc will prevent

big models from being chosen when the sample size is small,

whereas the BIC will still allow large models.

I see the K log and normalization constant penalty in the BIC as somewhat

less compelling than the cross-validation interpretation of the AIC- type penalties.

Something that leads

to a marginal probability of the data is more abstract to me than something

that is directly interpretable as a cross-validation metric

taking into account parameter uncertainty.

I'm fully aware

that I'm editorializing here, but this is what's worked well

for me so far.

Returning to the p DNA DoE one more time.

Here are the same models fit in the pDNA example

using the BIC first selection on top and the AICc on the bottom.

Notice that the BIC of the full normal model

is not as far away from the other models as with the AICc.

The best model overall as rated by the BIC is a logNormal,

but with 13 parameters this time around rather than nine.

The forward selected BIC normal model also has a couple more parameters.

In small samples, contrary to the AIC overfits,

BIC under fits, the AICc can choose smaller models than the BIC in small samples.

Here we see the effects chosen by the BIC and the AICc.

The set of BIC-selected effects is a superset of the ones chosen

by the AICc.

A lso notice, interestingly, that all four effects not chosen

by the AICc are statistically significant under the BIC.

Under the BIC, the pHsquared term is highly significant,

but it isn't present in the AICc model, for example.

I would say that all the significant effects

should have asterisks by them,

but all significant p-values have asterisks by them in J MP reports.

Instead, I'll just say that I take p-values of effects

after selection with a grain of salt.

Although the two models choose different effects,

some of them highly statistically significant in the model.

If we look at the profile

or variable importance from these two models,

they tell a very similar story.

Feed rate is by far the most important

and after that the ordering is the same between the two models.

pH only impacts 3% of the variation in the response surface under the BIC

best model and isn't included at all in the AICc best model.

This is a very clear example

of statistical significance and practical relevance

being two different things.

There are a lot of opinions out there about the AICc and the BIC.

For example, Burnham and Anderson

say that both methods are consistent for the quasi- true model as N goes

to infinity.

But then there's others that say

that the BIC is the only one consistent for the truth.

Burnham and Anderson say that you can set up simulations to make one look good,

change the way it's set up a little bit and it'll flip the results.

Burnham and Anderson,

who are about the most diehard AICc fans out there in their simulations,

found that the AICc chooses fewer really bad models than the BIC.

I think it's not a bad idea to look at both BIC and AICc

after applying variable selection.

If the best models under both are pretty much the same, which is often the case,

you can feel pretty good about either of them, if they're different

it's good to think about the reasons why and use your subject matter expertise

to help make a decision.

My last topic is model selection criteria and linear mixed models.

This is a pretty complicated situation, especially because there isn't consensus

between software vendors in how to compute the model selection criteria.

To illustrate this, I created a split plot design with four factors.

There are two whole plot effects and two split plot effects.

If you take the same data and fit the same model in JMP Pro

and SAS using fit mixed and proc mixed,

you will see that the likelihoods and model selection criteria don't match,

but the variance estimates do, you get different fixed effects

parameter estimates, but the fixed effects tests agree.

One of the reasons for this is that JMP and SAS fixed effects design matrices

use a different coding strategy for categorical effects.

On the left I have the JMP design matrix for the split plot example,

and on the right you see the SAS one.

JMP creates a row of minus ones

for the last level of categorical effects which is seen in blue here

whereas SAS creates a row of zeros.

Neither one of these is right or wrong.

It's like changing units or changing coordinate systems.

JMP categorical effects sum to zero, whereas SAS categorical effects

can be interpreted as differences from the last level.

Although the raw parameter estimates differ,

predictions will be the same between the two codings because the models

are fundamentally equivalent.

Most things that matter

won't be different between the two software products.

However, REML, the method used to estimate mixed effects models

has an ambiguity in it.

The base Gaussian likelihood at the top

will be the same in either software because it's a real likelihood.

But the REML or residual likelihood reported by proc mixed

and JMP pro's fit mixed isn't a real likelihood.

If it was a real likelihood, then we would get the same values

regardless of which coding or software we used.

This is because there's an extra penalty added to the Gaussian likelihood

for REML that reduces the bias of the variance estimates.

But this depends on the design matrix

in a way that is sensitive to the coding used.

JMP reports, the raw Gaussian likelihood, and the AICc and BIC that it reports

are based on that rather than the residual likelihood.

The number of parameters fit mixed counts

is the total including both fixed effects and variance parameters.

We did it this way to make it so that you can use JMP to compare models

with different fixed- effect structures as well as variance models.

In SAS, they only report the residual or REML log likelihood

and it reports model selection criteria based on it.

You can see here that it also only counts variance parameters as well

because the difference between the SAS likelihood

and its AIC is for implying two parameters,

a variance component and a residual.

All this means is that

you can only use p roc mixed for comparing variance models

with the AIC because the model selection criteria

includes the REML penalty and it's only counting variance parameters.

With all due respect,

I can think of some good reasons for the SAS approach,

and there are probably some other good reasons I don't even know of.

But I personally prefer the flexibility afforded by the JMP approach.

To summarize, if you compare results across software

for non- mixed models, the mean parameter estimates may differ,

but otherwise everything else should be the same.

As long as the software computes

the constants and the likelihood correctly as JMP does.

When we get to Gaussian mixed models,

there are very important software differences

and the scope of the decisions you can make about the models

using the software may be very different

depending on the details of how its likelihood is calculated.

JMP model selection criteria

are comparable both within the same platform

and across other modeling platforms.

I'll close with this slide, which gives my basic recommendations

for applying the tools discussed today.

Hypothesis testing is a tool for when you need to prove something

and is best used in situations when you have a good idea

of the model structure in advance.

When you're working on a problem in industry and the sample size is small,

I would stick to classical statistical models

and use the AICc as the primary tool for choosing between them.

With larger data sets, when I have enough data to hold out

at least a third of the observations, I use holdout cross-validation to compare

classical statistical models as well as machine learning models.

In my own work, I tend to avoid K- fold cross-validation

and its variance.

The model selection criteria are equivalent to it in larger samples,

and I tend to stick with simpler models with smaller data sets.

I know that not everyone is going to agree with me on this,

but this is what works for me and is a pretty safe way

to approach model selection.

Choosing the most useful model from a set of alternatives that must all be wrong

on some level is an important decision, and these are the main considerations

I have when deciding upon a model selection strategy.

Thank you for your attention and I look forward to talking with you

in the Meet the Expert Sessions.