"All models are wrong, but some are useful."
I've heard this quote at almost every conference
I've ever been to, and because of this,
to my recollection, I've actually avoided using this quote in any talk before.
But when I looked up
the first time it was ever set in print, it was in a 1976 journal
of the American Statistical Association article.
It's found in a section called Parsimony.
Immediately after that first instance
of the quote, he talks about the importance of finding
the simplest model that describes the observed phenomena.
This amounts to finding models that offer a reasonable balance
of goodness-of-fit versus model complexity and is exactly
what I'm going to be talking about today in this presentation.
JMP and JMP Pro offer a lot of different modeling capabilities,
each with a lot of output related to choosing a model.
Today I'm going to go into some detail into some of the most important of these,
highlighting their motivation and the assumptions behind them.
A lot of the discussion will be about the AICc and BIC model selection criteria,
which are direct and very data- efficient tools for addressing the problem.
Box had in mind with his quote,
which is how to find a useful model from a set of flawed or wrong ones.
As I was putting this presentation together,
I went through the derivations of the AIC and the Bic.
I wanted to get a clear understanding
of what these similar- looking methods really are and what assumptions they made.
Afterwards, out of curiosity,
I did an Internet search of AIC versus BIC versus cross-validation.
It was interesting to see in all these Internet forms that there is still
so much debate, even though these methods have been around for 50 years.
Having recently reviewed the derivations of the methods,
it looks like there are still a lot of misconceptions out there.
I think the reason for this is that both model selection criteria
have very deep and technical derivations
despite the simplicity of their formulas, both of them are equal to minus two times
the log likelihood of the fitted model,
plus a simple penalty based on the number of model parameters.
You can't guess the reasons
for the penalty terms from the formula alone,
which makes them seem mystical and arbitrary.
One of my goals today is to try
to demystify these without going overboard on the math.
To put this all in the context of an analysis workflow,
we can think of an analysis project as having four major steps.
We first have to acquire the data, get it organized and cleaned up.
Then we fit several models to it in a way
that is either manual or automated by software like JMP or JMP Pro.
Once we've done that, then we need to choose one of them
as the model that we're going to work with moving forward.
This is a critical step in the process that we'll be focusing on today.
It's important that we get the model selection right
because the quality of the results and the conclusions we make
at the end requires that we have a reasonably good model.
Here are the main ways that I've seen people make decisions about models.
Hypothesis testing is probably the first one people learn about.
These are most commonly used to determine if a regression coefficient
is statistically significantly different from zero,
which sounds like a model selection problem.
While they are often used in that way,
hypothesis tests are derived under a specific set of assumptions
that explicitly does not account for having changed the model
or having used a model that was chosen as the best
amongst several alternatives.
Then we have the general empirical procedures that assess models based
on data held out from the model fitting process.
These techniques can be applied to both classical statistical models
as well as machine learning models.
In my opinion, holdout validation, in particular, is the way to go
if you have a whole lot of data.
Then we have what all called the small data analytical procedures.
These were derived for situations when you have to make a decision
about which model to use, but you don't have enough data
to hold out any observations.
The most commonly used of these are the AIC and the BIC.
But there are other well- known techniques
like Generalized C ross-Validation and Mallow's C P.
It turns out that these two are actually asymptotically equivalent
to the AIC, so in large samples you should get the same conclusions from GCV,
Mallow's CP, and the AIC, in particular, for at least squares- based models.
Then we also have what I'll call model- specific approaches like VIP
and partially squares models and the cubic clustering criterion
in clustering models.
These are pretty niche and I won't be talking about them any more here today.
Then we also have visual tools like actual by predicted plots
and ROC curves.
Regardless of how you choose your model,
these plots are good to take a look at before moving forward with a model
because they provide more interesting information
than any individual statistic will and can tell us if the best model
that we've considered so far is still a good enough model for us to use.
My own first encounter with model selection criteria in my professional life
was back in the mid- 2000, around when JMP 5 and JMP 6 were out.
JMP had added the ability to provide capability analyses
for non- normal distributions.
Capability analysis is a very important tool
for assessing whether a manufacturing process is " capable"
of delivering products that are within specification.
JMP users wanted to determine the " best distribution" for the data
so their process capability metrics would best reflect the reality
of their situation.
JMP customers understood that you could fit different distributions with JMP
and knew that many of the distributions came with a goodness-of-fit test in a case
of having a hammer causing you to find nails everywhere.
They were trying all the distributions they could find and were choosing the one
with the largest p-value as the distribution
for their capability analysis.
They wanted us to codify this into a new fit all distributions feature
that would automate this process for them.
But we were rather uncomfortable with this request for a number of reasons.
For one thing, the different distributions fit in JMP
came with different kinds of goodness-of-fit tests.
The normal had a Shapiro- Wilk test.
The Weibull had a Cramér–von Mises test, and the LogNormal had a Kolmogorov test.
It's very strange to compare tests
that are rather different from one another.
Another problem with this approach is that distributions with more parameters
are going to tend to have an edge on those with fewer.
If we choose the distribution based on the largest p-value,
it will always favor distributions
with more parameters as we see here with the two- parameter normal compared
with the four- parameter Johnson Su distribution.
Then for some of the distributions
like the Weibull's Cramer von Mises W test,
we only had table values of p-values going up to something like P equals 25.
But even if we consolidated all the goodness-of-fit tests down
to just one and got accurate p-values for all of them,
there's still a larger philosophical issue at stake and that's that hypothesis test
like these can only quantify evidence against the null hypothesis.
If the null hypothesis is true,
then the p-value is a uniformly distributed random variable.
In other words, if the null hypothesis is true,
then the probability that the p-value is between 0.1 and 0.2 is exactly
the same as the probability that it is between 0.8 and 0.9.
S eeing a p-value of 0.9 isn't more evidence that the hypothesis
is true than a p-value of 0.3.
Returning to our example, because all four of these distributions
have goodness- of- fit p-values larger than 0.05.
Through this lens, all four distributions
fit the data reasonably well, even though the goodness-of-fit tests say
all the distributions are good, the conclusions about the process
generating the data are different depending on the distribution.
If you use a peak reference value of 1.33
to determine if the process is capable, then choosing the viable indicates
that the process is not sufficiently capable to meet the specifications,
whereas the other distributions indicate that the process is capable.
We recognize that there had to be a better way to determine
the distribution automatically and came to the conclusion
that this should be seen as a very basic kind of model selection problem.
In our search for a sound method for choosing a distribution,
we stumbled upon this very good book on model selection by Burnham and Anderson.
They give careful derivations of the AIC from the perspectives
of information, theory, and cross-validation.
They also give a derivation of the BIC into how the AIC can be derived
in the same way with a different assumption about the prior distribution.
Burnham and Anderson also carefully show
hypothesis testing is rather incoherent as a model selection strategy.
The book had a pretty big impact on my own views of modeling
and also on JMP statistical modeling platforms.
Returning to the distribution selection problem for the moment,
when we went ahead and added a distribution selector,
we ended up calling it fit all and we base it on the AICc.
Here on the left,
we have two distributions of the capability analysis data
we were looking at before, the normal and the Johnson Su.
The Johnson Su's goodness-of-fit p-value is larger than the normal's because it has
two more parameters than the normal distribution.
Now on the right,
we see the results of a fit all using the AICc.
The normal comes out as the best- fitting distribution,
but the Johnson Su is near the bottom.
This is because the AICc is penalizing it for having these two extra parameters.
This feature has now been used many, many times
and I believe people are generally pretty happy with it.
Now I'm going to go through
a somewhat mathy but hopefully accessible explanation
of what the AICc really is.
All right.
Now I'm going to go into some basic theory behind the AIC.
I'll be as brief as possible and use the best analogies as I can,
but I think it is important to be exposed to the underlying concepts so you can see
that the AIC has a rigorous foundation that has some sense to it.
The AIC- type selection criteria are based on a distance- type metric
between probability distributions called the Kullback- Leibler or KL divergence.
It quantifies the amount
of information lost by using probability distribution two
one probability distribution one is the correct one.
KL divergence has the property of always being greater than or equal to zero
and is only equal to zero when the two probability distributions
are the same.
This is to say that using the wrong distribution always leads
to a theoretically quantifiable, strictly positive information loss.
This is pretty heady abstract stuff,
so I'm going to translate it into the language of statistical modeling.
When we are using data in statistics to learn about how something works,
we are explicitly or implicitly fitting probability models to the data
to approximate the true model that generated it.
If we knew the true probability generating mechanism, we could use the KL divergence
to quantify how far or how wrong the model is from the truth.
We could then try several models and find the one that is the closest
to the truth.
Akaike recognized this and plugged the true
and the model probability formulas into the KL divergence formula
and used a little algebra to see that the KL divergence had two terms.
The first term only contains the true probability- generating mechanism
for the data, which we can never know since we can only work with models.
However, this is a constant that is the same for all models
that you fit to the data
as long as we play by a couple simple rules.
The second term is what Akaike discovered is empirically estimable and with a lot
of math, he found a simple formula to estimate this second term.
In particular,
he discovered that two times the KL divergence is equal to a constant
that is the same for all models, plus two times the negative log likelihood
of the data used to fit the model, plus two times the number of parameters.
Everything had been multiplied by a factor of two just to follow the same convention
as a likelihood ratio test, since the constant term is the same
for all models as long as we don't change
the response data, we can fit several models,
and the one whose AIC is the smallest is the one that is estimated
to have the smallest K L divergence from the truth and in a sense is the one
that is the least wrong.
Using the AIC for model selection is entirely analogous to there being
a collection of islands and you want to know which of the islands you know of
is closest to another island that you know you'll never be able
to get to.
The direct solution to this problem would be to calculate the distances
from each of the islands to the one that we want to get close to.
Now, what if the island we wanted to get close to was surrounded
by a circular high fence that we could approach?
The island is perfectly in the middle of the fence,
so the distance from the center of the island to the fence
is always the same.
But the fence was far enough away from the island that it enclosed
that we couldn't see it or measure the distance
from the fence to the interior island.
We can still estimate the distance from each island to the fence.
Because the main island is in the center of the fence,
we know that the island closest to the fence is the closest island.
This is exactly the situation with the AIC.
With the AIC, we can estimate the distance from the truth to each of the models.
Each AIC estimate is off by the same amount.
While we can't estimate the absolute distance of the models
from the truth, we can know which model is the closest
in a relative sense.
The original AIC is based
on the likelihood of the training data plus a parameter penalty.
The training likelihood assesses the goodness-of- fit of the model.
We can't use this term by itself though, because it is biased downward
as the model parameters were chosen to minimize
the negative log likelihood.
With a lot of math, Akaike derived a very simple expression
that corrects for this bias.
The original penalty is just 2 K
where K is the total number of estimated parameters.
For linear regression with a slope
and an intercept, we also have to count the variance.
For that case you would have K equals three and not two.
There are important assumptions that led to the 2 K penalty.
We can characterize them loosely that the model has to be reasonably good.
The AIC is still going to be robust however, because if a model is bad,
then the likelihood component will be large and will dominate
the penalty amongst the good models.
The 2K term will favor the smaller models as long
as the sample size is large.
However, it didn't take long for people to find that this original AIC
often shows models that overfit in small samples,
so a more accurate, higher- order approximation to the bias was derived.
When this extra term is added,
the criteria becomes known as the AICc or the corrected AIC.
Unfortunately, the reputation that the AIC overfits
had become commonplace before the correction was discovered
and widely known about.
The correction becomes infinite as K approaches N
pushing the model selection criteria away from models that are nearly saturated.
Notice also that the correction term goes to zero as N goes to infinity.
In large samples the AIC and AICc are equivalent.
The AICc is what we reported in Trump because it works well for small samples
and although it was derived for Gaussian distributions,
experience suggests that it's good enough with other commonly used distributions.
Now I'm going to illustrate the AICc
in a real example that was a five- factor central composite design with 31 runs,
and the response was the amount of p DNA produced by a bioreactor.
I'll illustrate the AICc using the generalized regression platform,
giving it a full response surface model
with all main effects interactions and second- order terms.
I fit four models to the data.
One is a full response surface model using least squares that was fit automatically.
Then I use forward selection
under the normal, logNormal, and exponential distributions.
I chose the exponential distribution to illustrate poor model fit.
The models had 2 2, 9, 9, and 1 parameters respectively,
and the model with the lowest AICc was the logN ormal with an AICc
of about 334.8.
We can break the AIC and AICc calculations
down to see how different parts of the penalty are contributing.
The full least squares model
has the lowest likelihood, but the highest AICc overall.
When we look at the second- order corrections and the original AIC values,
we see that it's the second order correction term that is pushing
the model selection criteria to be very large for this model.
The logN ormal forward selection log likelihood is a little lower
than the normal forward selection one.
They both have nine parameters, so their penalties are the same
and the logN ormal forward selection model has the lower AICc.
The exponential forward selection model has the poorest model fit as measured
by the log likelihood, but also only has one parameter
in the model.
Overall it has the smallest penalty contribution to the AICc.
But the poor fit of the model is such that the likelihood dominates
and the exponential model is the second from the worst as measured by the AICc.
If you review the general derivation
of the AIC in the Burnham and Anderson book,
you'll see that what it's actually estimated is the expected value
of a hypothetical test set likelihood for a data set that has the same size
and response structure, but not values as the training set.
The expected values also take into consideration the variability
in the estimate of the MLE.
I find this cross-validation interpretation of the AIC
to be pretty compelling.
I think it's also important to point out that this cross-validation derivation
of the AIC does not assume at all that we have the correct model.
To show that this cross-validation interpretation really works,
I created a simulation formula using an average of the models I've shown
in the previous slides as well as some other ones.
This way we knew that none of the models were actually the correct one.
I fit each of the four models to new training data a thousand times
and set it up so that job would report an independent holdout likelihood
using another new data set.
I kept each of the four models structure
and distributions intact and did not apply variable selection.
This was to perfectly mimic the exact cross-validation interpretation
of the AIC.
From there, I created a table of simulated
holdout likelihoods and computed their average for each of the four models.
This is the AIC and AICc summary table from before,
with the simulation- based average holdout log likelihoods
added over here to the right, you can see that the full normal model
holdout likelihood is very close to its AICc value
and that the second- order correction term was essential for this match to happen.
On the other hand,
you see that the simulated average exponential holdout log likelihood
is also very close to the AICc.
Both the normal and logN ormal holdout likelihoods are close
to the original log Normal models A ICc.
The normal holdout likelihood is a little smaller.
I attribute this to averaging a bunch of simulation models,
making the simulated data a little bit more normally distributed
than the original data was.
There are a couple simple rules that are needed to make AICc comparisons
really valid between different models.
The most important is that the stochastic part of the data
has to stay the same, the same rows have to be used
and it is the Y's in particular that must be the same.
The X's can be different, of course, even if they were originally random,
not only must the Y's be the same, but they can't be changed or transformed.
The transform would have to be built into the model appropriately.
The AIC is also only defined
for well-behaved maximum likelihood estimators
and other closely related methods.
This explains why you don't see AICc
for neural networks and other machine learning models.
Also, you have to keep in mind that just because you found a model that the AICc
says is the best, it doesn't mean that it is a good model.
Use your past experience and model diagnostic plots to ensure
that the model is right enough to be useful.
Returning to the pDNA data, we see two equivalent models.
On the top, we have a logN ormal model
and on the bottom we have a normal fit to the log- transformed response.
You can see that the generalized RS quares
are the same for these two models, but the AICcs are very different.
This is because the logN ormal fit
implicitly builds the transform into the likelihood.
But the log scale normal fit does not.
In this case, the right thing to use is the logN ormal.
Here's a quick demonstration that you have to decide the distribution
and the input variables at the same time.
Here is simulated data from a T-t est type model
two groups of normally distributed data with the same variance,
but different means.
If you fit all in the distribution platform,
it chooses the normal two mixture with an AICc of 1036.
This is the correct distribution
if you don't know the group identity of the rows.
Once you include the grouping variable though,
you see that the normal comes out on top of an AICc of 717 or so.
We also tried the Weibull logNormal
and gamma and the normal still came out on top, even though those distributions
did better in distribution without including the grouping variable.
You'd have to try different model structures
and distributions together to find the right combination.
Now I'm going to change gears and talk a little bit about the BIC,
which is the other main analytical model selection criteria and JMP.
The BIC is motivated in a completely different way
than the AIC.
Schwartz used a large sample argument in a Bayesian context to approximate
the log probability of the data after having integrated the model out,
assuming a flat prior on the parameters, an expression similar to the AIC pops out
with a K log in type penalty term rather than two times k.
There were also other terms in the integral that are always ignored.
One is K log 2 pi, which was considered too small
to deal with and the other one
is a normalized variance of the MLE, which would also be of order K.
I didn't study the AIC or BIC in any depth in school.
I just remember hearing the refrain AIC overfits,
BIC under fits several times in different classes,
which I interpreted as a strong skepticism about both of them.
Comparing the AICc and BIC penalties, we see that the AICc will prevent
big models from being chosen when the sample size is small,
whereas the BIC will still allow large models.
I see the K log and normalization constant penalty in the BIC as somewhat
less compelling than the cross-validation interpretation of the AIC- type penalties.
Something that leads
to a marginal probability of the data is more abstract to me than something
that is directly interpretable as a cross-validation metric
taking into account parameter uncertainty.
I'm fully aware
that I'm editorializing here, but this is what's worked well
for me so far.
Returning to the p DNA DoE one more time.
Here are the same models fit in the pDNA example
using the BIC first selection on top and the AICc on the bottom.
Notice that the BIC of the full normal model
is not as far away from the other models as with the AICc.
The best model overall as rated by the BIC is a logNormal,
but with 13 parameters this time around rather than nine.
The forward selected BIC normal model also has a couple more parameters.
In small samples, contrary to the AIC overfits,
BIC under fits, the AICc can choose smaller models than the BIC in small samples.
Here we see the effects chosen by the BIC and the AICc.
The set of BIC-selected effects is a superset of the ones chosen
by the AICc.
A lso notice, interestingly, that all four effects not chosen
by the AICc are statistically significant under the BIC.
Under the BIC, the pHsquared term is highly significant,
but it isn't present in the AICc model, for example.
I would say that all the significant effects
should have asterisks by them,
but all significant p-values have asterisks by them in J MP reports.
Instead, I'll just say that I take p-values of effects
after selection with a grain of salt.
Although the two models choose different effects,
some of them highly statistically significant in the model.
If we look at the profile
or variable importance from these two models,
they tell a very similar story.
Feed rate is by far the most important
and after that the ordering is the same between the two models.
pH only impacts 3% of the variation in the response surface under the BIC
best model and isn't included at all in the AICc best model.
This is a very clear example
of statistical significance and practical relevance
being two different things.
There are a lot of opinions out there about the AICc and the BIC.
For example, Burnham and Anderson
say that both methods are consistent for the quasi- true model as N goes
to infinity.
But then there's others that say
that the BIC is the only one consistent for the truth.
Burnham and Anderson say that you can set up simulations to make one look good,
change the way it's set up a little bit and it'll flip the results.
Burnham and Anderson,
who are about the most diehard AICc fans out there in their simulations,
found that the AICc chooses fewer really bad models than the BIC.
I think it's not a bad idea to look at both BIC and AICc
after applying variable selection.
If the best models under both are pretty much the same, which is often the case,
you can feel pretty good about either of them, if they're different
it's good to think about the reasons why and use your subject matter expertise
to help make a decision.
My last topic is model selection criteria and linear mixed models.
This is a pretty complicated situation, especially because there isn't consensus
between software vendors in how to compute the model selection criteria.
To illustrate this, I created a split plot design with four factors.
There are two whole plot effects and two split plot effects.
If you take the same data and fit the same model in JMP Pro
and SAS using fit mixed and proc mixed,
you will see that the likelihoods and model selection criteria don't match,
but the variance estimates do, you get different fixed effects
parameter estimates, but the fixed effects tests agree.
One of the reasons for this is that JMP and SAS fixed effects design matrices
use a different coding strategy for categorical effects.
On the left I have the JMP design matrix for the split plot example,
and on the right you see the SAS one.
JMP creates a row of minus ones
for the last level of categorical effects which is seen in blue here
whereas SAS creates a row of zeros.
Neither one of these is right or wrong.
It's like changing units or changing coordinate systems.
JMP categorical effects sum to zero, whereas SAS categorical effects
can be interpreted as differences from the last level.
Although the raw parameter estimates differ,
predictions will be the same between the two codings because the models
are fundamentally equivalent.
Most things that matter
won't be different between the two software products.
However, REML, the method used to estimate mixed effects models
has an ambiguity in it.
The base Gaussian likelihood at the top
will be the same in either software because it's a real likelihood.
But the REML or residual likelihood reported by proc mixed
and JMP pro's fit mixed isn't a real likelihood.
If it was a real likelihood, then we would get the same values
regardless of which coding or software we used.
This is because there's an extra penalty added to the Gaussian likelihood
for REML that reduces the bias of the variance estimates.
But this depends on the design matrix
in a way that is sensitive to the coding used.
JMP reports, the raw Gaussian likelihood, and the AICc and BIC that it reports
are based on that rather than the residual likelihood.
The number of parameters fit mixed counts
is the total including both fixed effects and variance parameters.
We did it this way to make it so that you can use JMP to compare models
with different fixed- effect structures as well as variance models.
In SAS, they only report the residual or REML log likelihood
and it reports model selection criteria based on it.
You can see here that it also only counts variance parameters as well
because the difference between the SAS likelihood
and its AIC is for implying two parameters,
a variance component and a residual.
All this means is that
you can only use p roc mixed for comparing variance models
with the AIC because the model selection criteria
includes the REML penalty and it's only counting variance parameters.
With all due respect,
I can think of some good reasons for the SAS approach,
and there are probably some other good reasons I don't even know of.
But I personally prefer the flexibility afforded by the JMP approach.
To summarize, if you compare results across software
for non- mixed models, the mean parameter estimates may differ,
but otherwise everything else should be the same.
As long as the software computes
the constants and the likelihood correctly as JMP does.
When we get to Gaussian mixed models,
there are very important software differences
and the scope of the decisions you can make about the models
using the software may be very different
depending on the details of how its likelihood is calculated.
JMP model selection criteria
are comparable both within the same platform
and across other modeling platforms.
I'll close with this slide, which gives my basic recommendations
for applying the tools discussed today.
Hypothesis testing is a tool for when you need to prove something
and is best used in situations when you have a good idea
of the model structure in advance.
When you're working on a problem in industry and the sample size is small,
I would stick to classical statistical models
and use the AICc as the primary tool for choosing between them.
With larger data sets, when I have enough data to hold out
at least a third of the observations, I use holdout cross-validation to compare
classical statistical models as well as machine learning models.
In my own work, I tend to avoid K- fold cross-validation
and its variance.
The model selection criteria are equivalent to it in larger samples,
and I tend to stick with simpler models with smaller data sets.
I know that not everyone is going to agree with me on this,
but this is what works for me and is a pretty safe way
to approach model selection.
Choosing the most useful model from a set of alternatives that must all be wrong
on some level is an important decision, and these are the main considerations
I have when deciding upon a model selection strategy.
Thank you for your attention and I look forward to talking with you
in the Meet the Expert Sessions.