Hello, I'm Ron Kennett.
This is a joint talk with Chris Gotwalt.
We're going to talk to you about models.
Models are used extensively.
We hope to bring some additional perspective
on how to use models in general.
We call the talk
"Different goals, different models:
How to use models to sharpen your questions."
My part will be an intro, and I'll give an example.
You'll have access to the data I'm using.
Then Chris will continue with a more complex example,
introduce the SHAP values available in JMP 17,
and provide some conclusions.
We all know that all models are wrong,
but some are useful.
Sam Karlin said something different.
He said that the purpose of models is not to fit the data,
but to sharpen the question.
Then this guy, Pablo Picasso, he said something in...
He died in 1973, so you can imagine when this was said.
I think in the early '70 s.
"Computers are useless. They can only give you answers."
He is more in line with Karlin.
My take on this is that
this presents the key difference between a model and a computer program.
I'm looking at the model from a statistician's perspective.
Dealing with Box's famous statement.
"Yes, some are useful. Which ones?"
Please help us.
What do you mean by some are useful?
It's not very useful to say that .
Going to Karlin, "Sharpen the question."
Okay, that's a good idea.
How do we do that?
The point is that Box seems focused on the data analysis phase
in the life cycle view of statistics, which starts with problem elicitation,
moves to goal formulation, data collection, data analysis,
findings, operationalization of finding,
communicational findings, and impact assessment.
Karlin is more focused on the problem elicitation phase.
These two quotes of Box and Karlin
refer to different stages in this life cycle.
The data I'm going to use is an industrial data set,
174 observations.
We have sensor data.
We have 63 sensors.
They are labeled 1, 2, 3, 4 to 63.
We have two response variables.
These are coming from testing some systems.
The status report is fail/pass.
52.8% of the systems that were tested failed.
We have another report,
which is a more detailed report on test results.
When the systems fail,
we have some classification of the failure.
Test result is more detailed.
Status is go/no go.
The goal is to determine the system status from the sensor data
so that we can maybe avoid the costs and delays of testing,
and we can have some early predictions on the faith of the system.
One approach we can take is to use a boosted tree.
We put the status as the response, the 63 sensors X, factors.
The boosted tree is trained sequentially, one tree at a time.
T he other model we're going to use is random forest,
and that's done with independent trees.
There is a sequential aspect in boosted trees
that is different from random forests.
The setup of boosted trees involves three parameters:
the number of trees, depth of trees, and the learning rate.
This is what JMP gives as a default.
Boosted trees can be used to solve most objective functions.
We could use it for poisson regression,
which is dealing with counts that is a bit harder to achieve
with random forests.
We're going to focus on these two types of models.
When we apply the boosted tree,
and we have a validation set up
with 43 systems drawn randomly as the validation set.
A hundred and thirty-one systems is used for the training set.
We are getting a 9.3% misclassification rate.
Three failed systems.
We know that they failed because we have it in the data,
were actually classified as pass.
The 20 that passed, 19 were classified as pass.
The false predicted pass is 13 %.
We can look at the variable column contributions.
We see that Sensor 56, 18, 11, and 61 are the top four
in terms of contributing to this classification.
We see that in the training set, we had zero misclassification.
We might have some over fitting i n this BT application.
If we look at the lift curve,
40 % of the systems, we can get over two lift
which is the performance that this classifier gives us.
If we try the boos trap forest,
another option, again, we do the same thing.
We use the same validation set.
The defaults of JMP are giving you some parameters
for the number of trees
and the number of features to be selected at each mode.
This is how the random forest works.
You should be aware that this is not very good
if you have categorical variables and missing data,
which is not our case here.
Now, the misclassification rate is 6.9, lower than before.
On the training set, we had some misclassification.
The random forest applied to the test status,
which means when we have the details on the failures is 23.4,
so bad performance.
Also, on the training set, we have 5% misclassification.
But we have now a wider range of options
and that is explaining some of the lower performance.
In the lift curve on the test results,
we actually, with quite good performance,
can pick up the top 10 % of the systems with a leverage of above 10.
So we have over a ten fold increase for 10 % of the systems
relative to the grand average.
Now this is posing a question— remember the topic of the talk—
what are we looking at?
Do we want to identify top score good systems?
The random forest would do that with the test result.
Or do we want to predict a high proportion of pass?
The bootstrap tree would offer that.
A secondary question is to look at what is affecting this classification.
We can look at the column contributions on the boosted tree.
Three of the four top variables show up also on the random forest.
If we use the status pass/fail,
or the detailed results,
there is a lot of similarity on the importance of the sensors.
This is just giving you some background.
Chris is going to follow up with an evaluation of the sensitivity
of this variable importance, the use of SHAP v alues
and more interesting stuff.
This goes back to questioning what is your goal,
and how is the model helping you figure out the goal
and maybe sharpening the question that comes from the statement of the goal.
Chris, it's all yours.
Thanks, Ron.
I'm going to pick up from where Ron left off
and seek a model that will predict whether or not a unit is good or not,
and if it isn't, what's the likely failure mode that has resulted?
This would be useful in that if a model is good at predicting good units,
we may not have to subject them to much further testing.
If the model gives a predicted failure mode,
we're able to get a head start on diagnosing and fixing the problem,
and possibly, we may be able to get some hints
on how to improve the production process in the future.
I'm going to go through the sequence
of how I approached answering this question from the data.
I want to say at the outset that this is simply the path that I took
as I asked questions of the data and acted on various patterns that I saw.
There are literally many other ways that one could proceed with this.
There's often not really a truly correct answer,
just a criterion for whether or not the model is good enough,
and the amount that you're able to get done
in the time that you have to get a result back.
I have no doubt that there are better models out there
than what I came up with here.
Our goal is to show an actual process of tackling a prediction problem,
illustrating how one can move forward
by iterating through cycles of modeling and visualization,
followed by observing the results and using them to ask another question
until we find something of an answer.
I will be using JMP as a statistical Swiss army knife,
using many tools in JMP
and following the intuition I have about modeling data
that has built up over many years.
First, let's just take a look
at the frequencies of the various test result categories.
We see that the largest and most frequent category is Good.
We'll probably have the most luck being able to predict that category.
On the other hand, the SOS category has only two events
so it's going to be very difficult for us to be able to do much with that category.
We may have to set that one aside.
We'll see about that.
Velocity II, IMP, and Brake
are all fairly rare with five or six events each.
There may be some limitations in what we're able to do with them as well.
I say this because we have 174 observations
and we have 63 predictors.
So we have a lot of predictors for a very small number of observations,
which is actually even smaller when you consider the frequencies
of some of the categories that we're trying to predict.
We're going to have to work iteratively by doing visualizations in modeling,
recognizing patterns, asking questions,
and then acting on those with another modeling step iteratively
in order to find a model that's going to do a good job
of predicting these response categories.
I have the data sorted by test results,
so that the good results are at the beginning,
followed by each of the different failure modes d ata a fter that.
I went ahead and colored each of the rows by test results so that we can see
which observation belongs to a particular response category.
So then I went into the model- driven multivariate control chart
and I brought in all of the sensors as process variables.
Since I had the good test results at the beginning,
I labeled those as historical observations.
This gives us a T² chart.
It's chosen 13 principal components as its basis.
What we see here
is that the chart is dominated by these two points right here
and all of the other points are very small in value
relative to those two.
Those two points happen to be the SOS points.
They are very serious outliers in the sensor readings.
Since we also only have two observations of those,
I'm going to go ahead and take those out of the data set
and say, well, SOS is obviously so bad that the sensors
should be just flying off the charts.
If we encounter it, we're just going to go ahead
and try to concern ourselves with the other values
that don't have this off- the- charts behavior.
Switching to a log scale, we see that the good test results
are fairly well -behaved.
Then there's definite signals
in the data for the different failure modes.
Now we can drill down a little bit deeper,
taking a look at the contribution plots for the historical data,
the good test result data, and the failure modes
to see if any patterns emerge in the sensors that we can act upon.
I'm going to remove those two SOS observations
and select the good units.
If I right-click in the plot,
I can bring up a contribution plot
for the good units, and then I can go over to the units
where there was a failure, and I can do the same thing,
and we'll be able to compare the contribution plots side by side.
So what we see here are the contribution plots
for the pass units and the fail units.
The contribution plots
are the amount that each column is contributing to the T ²
for a particular row.
Each of the bars there correspond to an individual sensor for that row.
Contribution plots are colored green when that column is within three sigma,
using an individuals and moving range chart,
and it's red if it's out of control.
Here we see most of the sensors are in control for the good units,
and most of the sensors are out of control for the failed units.
What I was hoping for here
would have been if there was only a subset of the columns
or sensors that were out of control over on the failed units.
Or if I was able to see patterns
that changed across the different failure modes,
which would help me isolate what variables are important
for predicting the test result outcome.
Unfortunately, pretty much all of the sensors
are in control when things are good,
and most of them are out of control when things are bad.
So we're going to have to use some more sophisticated modeling
to be able to tackle this prediction problem.
Having not found anything in the column contributions plots,
I'm going to back up and return to the two models that Ron found.
Here are the column contributions for those two models,
and we see that there's some agreement in terms of
what are the most important sensors.
But boosted tree found a somewhat larger set of sensors as being important
over the bootstrap forest.
Which of these two models should we trust more?
If we look at the overall model fit report,
we see that the boosted tree model has a very high training RS quare of 0.998
and a somewhat smaller v alidation RS quare of 0.58.
This looks like an overfit situation.
When we look at the random forest, it has a somewhat smaller training RS quare,
perhaps a more realistic one, than the bootstrap forest,
and it has a somewhat larger validation RS quare.
The generalization performance of the random forest
is hopefully a little bit better.
I'm inclined to trust the random forest model a little bit more.
Part of this is going to be based upon just the folklore of these two models.
Boosted trees are renowned for being fast, highly accurate models
that work well on very large datasets.
Whereas the hearsay is that random forests are more accurate on smaller datasets.
They are fairly robust, messy, and noisy data.
There's a long history of using these kinds of models
for variable selection that goes back to a paper in 2010
that has been cited almost 1200 times.
So this is a popular approach for variable selection.
I did a similar search for boosting,
and I didn't quite see as much history around variable selection
for boosted trees as I did for random forests.
For this given data set r ight here,
we can do a sensitivity analysis to see how reliable
the column contributions are for these two different approaches,
using the simulation capabilities in JMP Pro.
What we can do is create a random validation column
that is a formula column
that you can reinitialize and will partition the data
into random training and holdout sets of the same portions
as the original validation column.
We can have it rerun these two analyses
and keep track of the column contribution portions
for each of these repartitionings.
We can see how consistent the story is
between the boosted tree models and the random forests.
This is pretty easy to do.
We just go to the Make Validation Column utility
and when we make a new column, we ask it to make a formula column
so that it could be reinitialized.
Then we can return to the bootstrap forest platform,
right- click on the column contribution portion,
select Simulate.
It'll bring up a dialog
asking us which of the input columns we want to switch out.
I'm going to choose the validation column,
and I'm going to switch in in replacement of it,
this random validation formula column.
We're going to do this a hundred times.
Bootstrap forest is going to be rerun
using new random partitions of the data into training and validation.
We're going to look at the distribution of the portions
across all the simulation runs.
This will generate a dataset
of column contribution portions for each sensor.
We can take this and go into the graph builder
and take a look and see how consistent those column contributions are
across all these random partitions of the data.
Here is a plot of the column contribution portions
from each of the 100 random reshufflings of the validation column.
Those points we see in gray here,
Sensor 18 seems to be consistently a big contributor, as does Sensor 61.
We also see with these red crosses,
those are the column contributions from the original analysis that Ron did.
The overall story that this tells is that the tendency
i s that whenever the original column contribution was small,
those re simulated column contributions also tended to be small.
When the column contributions were large in the original analysis,
they tended to stay large.
We're getting a relatively consistent story from the bootstrap forest
in terms of what columns are important.
Now we can do the same thing with the boosted tree,
and the results aren't quite as consistent as they were with the bootstrap forest.
So here is a bunch of columns
where the initial column contributions came out very small
but they had a more substantial contribution
in some of the random reshuffles of the validation column.
That also happened quite a bit over with these Columns 52 through 55 over here.
Then there were also some situations
where the original column contribution was quite large,
and most, if not all,
of the other column contributions found in the simulations were smaller.
That happens here with Column 48,
and to some extent also with Column 11 over here.
The overall conclusion being that I think this validation shuffling
is indicating that we can trust the column contributions
from the bootstrap forest to be more stable than those of the boosted tree.
Based on this comparison, I think I trust the column contributions
from the bootstrap forest more,
and I'm going to use the columns that it recommended
as the basis for some other models.
What I'd like to do
is find a model that is both simpler than the bootstrap forest model
and performs better in terms of validation set performance
for predicting pass or fail.
Before proceeding with the next modeling stuff,
I'm going to do something that I should have probably done at the very beginning,
which is to take a look at the sensors in a scatterplot matrix
to see how correlated the sensor readings are,
and also look at histograms of them as well to see if they're outlier- prone
or heavily skewed or otherwise highly non- gaussian.
What we see here is there is pretty strong multicollinearity
amongst the input variables generally.
We're only looking at a subset of them here,
but this high multicollinearity persists across all of the sensor readings.
This suggests that for our model,
we should try things like the logistic lasso,
the logistic elastic net, or a logistic ridge regression
as candidates for our model to predict pass or fail.
Before we do that, we should go ahead
and try to transform our sensor readings here
so that they're a little bit better- behaved and more gaussian- looking.
This is actually really easy in JMP
if you have all of the columns up in the distribution platform,
because all you have to do is hold down Alt , choose Fit Johnson,
and this will fit Johnson distributions to all the input variables.
This is a family of distributions
that is based on a four parameter transformation to normality.
As a result, we have a nice feature in there
that we can also broadcast using Alt Click,
where we can save a transformation from the original scale
to a scale that makes the columns more normally distributed.
If we go back to the data table,
we'll see that for each sensor column, a transform column has been added.
If we bring these transformed columns up with a scatterplot matrix
and some histograms,
we clearly see that the data are less skewed
and more normally distributed than the original sensor columns were.
Now the bootstrap forest model that Ron found
only really recommended a small number of columns
for use in the model.
Because of the high collinearity amongst the columns,
the subset that we got could easily be part
of a larger group of columns that are correlated with one another.
It could be beneficial to find that larger group of columns
and work with that at the next modeling stage.
An exploratory way to do this
is to go through the cluster variables platform in JMP .
We're going to work with the normalized version of the sensors
because this platform is PCA and factor analysis based,
and will provide more reliable results if we're working with data
that are approximately normally distributed.
Once we're in the variable clustering platform,
we see that there is very clear,
strong associations amongst the input columns.
It has identified that there are seven clusters,
and the largest cluster, the one that explains the most variation,
has 25 members.
The set of cluster members is listed here on the right.
Let's compare this with the bootstrap forest.
Here on the left, we have the column contributions
from the bootstrap forest model that you should be familiar with by now.
On the right, we have the list of members
of that largest cluster of variables.
If we look closely, we'll see that the top seven contributing terms
all happen to belong to this cluster.
I'm going to hypothesize that this set of 25 columns
are all related to some underlying mechanism
that causes the units to pass or fail.
What I want to do next is I want to fit models
using the generalized regression platform with the variables in Cluster 1 here.
It would be tedious if I had to go through
and individually pick these columns out and put them into the launch dialog.
Fortunately, there's a much easier way
where you can just select the rows in that table
and the columns will be selected in the original data table
so that when we go into the fit model launch dialog,
we can just click Add
and those columns will be automatically added for us as model effects.
Once I got into the Generalized Regression platform,
I went ahead and fit a lasso model and elastic net model
and a ridge model to have them compared here to each other,
and also to the logistic regression model that came up by default.
We're seeing that the lasso model is doing a little bit better than the rest
in terms of its validation generalized RS quare.
The difference between these methods
is that there's different amounts of variable selection
and multicollinearity handling in each of them.
Logistic regression has no multicollinearity handling
and no variable selection.
The lasso is more of a variable selection algorithm,
although it has a little bit of multicollinearity handling in it
because it's a penalized method.
Ridge regression has no variable selection
and is heavily oriented around multicollinearity handling.
The elastic net is a hybrid between the lasso and ridge regression.
In this case, what we really care about
is just the model that's going to perform the best.
We allow the validation to guide us.
We're going to be working with the lasso model from here on.
Here's the prediction profiler for the lasso model that was selected.
We see that the lasso algorithm has selected eight sensors
as being predictive of pass or fail.
It has some great built-in tools
for understanding what the important variables are,
both in the model overall and, new to JMP Pro 17,
we have the ability to understand
what variables are most important for an individual prediction.
We can use the variable importance tools to answer the question,
"What are the important variables in the model?"
We have a variety of different options for this.
We have a variety of different options for how this could be done.
But because of the multicollinearity and because this is not a very large model,
I'm going to go ahead
and use the dependent resampled inputs technique,
since we have multicollinearity in the data,
and this has given us a ranking of the most important terms.
We see that Column 18 is the most important,
followed by Column 27 and then 52, all the way down.
We can compare this to the bootstrap forest model,
and we see that there's agreement that Variable 18 is important,
along with 52, 61, and 53.
But one of the terms that we have pulled in
because of the variable clustering step that we had done,
Sensor 27 turns out to be the second most important predictive
in this lasso model.
We've hopefully gained something by casting a wider net through that step.
We've found a term that didn't turn up
in either of the bootstrap forest or the boosted tree methods.
We also see that the lasso model has an RS quare of 0.9,
whereas the bootstrap forest model had an RS quare of 0.8.
We have a simpler model that has an easier form to understand
and is easier to work with,
and also has a higher predictive capacity than the bootstrap forest model.
Now, the variable importance metrics in the profiler
have been there for quite some time.
The question that they answer is, "Which predictors have the biggest impact
on the shape of the response surface over the data or over a region?"
In JMP 17 Pro, we have a new technique called SHAP Values
that is an additive decomposition of an individual prediction.
It tells you by how much each individual variable contributes
to a single prediction,
rather than talking about variability explained over the whole space.
The resolution of the question that's answered by Shapley values
is far more local than either the variable importance tools
or the column contributions i n the bootstrap forest.
We can obtain the Shapley Values by going to the red triangle menu for the profiler,
and we'll find the option for them over here, fourth from the bottom.
When we choose the option, the profiler saves back SHAP columns
for all of the input variables to the model.
This is, of course, happening for every row in the table.
What you can see is that the SHAP V alues are giving you the effect
of each of the columns on the predictive model.
This is useful in a whole lot of different ways,
and for that reason, it's gotten a lot of attention in intelligible AI,
because it allows us to see
what the contributions are of each column to a black box model.
Here, I've plotted the SHAP V alues for the columns that are predictive
in the last fit model that we just built.
If we toggle back and forth between the good units and the units that failed,
we see the same story that we've been seeing
with the variable importance metrics for this,
that Column 18 and Column 27 are important in predicting pass or fail.
We're seeing this at a higher level of resolution
than we do with the variable importance metrics,
because each of these points corresponds to an individual row
in the original dataset.
But in this case, I don't see the SHAP Values
really giving us any new information.
I had hoped that by toggling through
the other failure modes, maybe I could find a pattern
to help tease out different sensors
that are more important for particular failure modes.
But the only thing I was able to find was that Column 18
had a somewhat stronger impact on the Velocity Type 1 failure mode
than the other failure modes.
At this point, we've had some success
using those Cluster 1 columns in a binary pass/ fail model.
But when I broke out the SHAP Values
for that model, by the different failure modes
I wasn't able to discern a pattern or much of a pattern.
What I did next was I went ahead
and fit the failure mode response column test results
using the Cluster 1 columns,
but I went ahead and excluded all the pass rows
so that the modeling procedure would focus exclusively
on discerning which failure mode it is given that we have a failure.
I tried the multinomial lasso, elastic net, and ridge,
and I was particularly happy with the lasso model
because it gave me a validation RS quare of about 0.94.
Having been pretty happy with that,
I went ahead and saved the probability formulas
for each of the failure modes.
Now the task is to come up with a simple rule
that post processes that prediction formula
to make a decision about which failure mode.
I call this the partition trick.
The partition trick is where I put in the probability formulas
for a categorical response, or even a multinomial response.
I put those probability formulas in as Xs.
I use my categorical response as my Y.
This is the same response that was used
for all of these except for pass, actually.
I retain the same validation column that I've been working with the whole time.
Now that I'm in the partition platform,
I'm going to hit Split a couple of times, and I'm going to hope
that I end up with an easily understandable decision rule
that's easy to communicate.
That may or may not happen.
Sometimes it works, sometimes it doesn't.
So I split once, and we end up seeing that
whenever the probability of pass is higher than 0.935,
we almost certainly have a pass.
Not many passes are left over on the other side.
I take another split.
We find a decision rule on ITM
that is highly predictive of ITM as a failure mode.
Split again.
We find that whenever Motor is less than 0.945,
we're either predicting Motor or Brake.
We take another split.
We find that whenever Velocity Type 1, its probability is bigger than 0.08
or likely in a Velocity Type 1 situation or in a Velocity T ype 2 situation.
Whenever Velocity Type 1 is less than 0.79,
we're likely in a gripper failure mode or an IMP failure mode.
What do we have here? We have a simple decision rule.
We're going to not be able to break these failure modes down much further
because of the very small number of actual events that we have.
But we can turn this into a simple rule
for identifying units that are probably good,
and if they're not, we have an idea of where to look to fix the problem.
We can save this decision rule out as a leaf label formula.
We see that on the validation set,
when we predict it's good, it's good most of the time.
We did have one misclassification of a Velocity Type 2 failure
that was actually predicted to be good.
Predict grippers or IMP, it's all over the place.
That leaf was not super useful.
Predicting ITM is 100 %.
Whenever we predict a motor or brake,
on the validation set, we have a motor or a brake failure.
When we predict a Velocity Type 1 or 2,
it did a pretty good job of picking that up
with that one exception of the single Velocity Type 2 unit
that was in the validation set,
and that one happened to have been misclassified.
We have an easily operational rule here that could be used to sort products
and give us a head start on where we need to look to fix things.
I think this was a pretty challenging problem,
because we didn't have a whole lot of data.
But we didn't have a lot of rows,
but we had a lot of different categories to predict
and a whole lot of possible predictors to use.
We've gotten there by taking a series of steps,
asking questions,
sometimes taking a step back and asking a bigger question.
Other times, narrowing in on particular sub- issues.
Sometimes our excursions were fruitful, and sometimes they weren't.
Our purpose here is to illustrate
how you can step through a modeling process,
through this sequence of asking questions
using modeling and visualization tools to guide your next step,
and moving on until you're able to find
a useful, actionable, predictive model.
Thank you very much for your attention.
We look forward to talking to you in our Q&A session coming up next.