Different goals, different models: How to use models to sharpen up your questio...

Hello, I'm Ron Kennett.

This is a joint talk with Chris Gotwalt.

We're going to talk to you about models.

Models are used extensively.

We hope to bring some additional perspective

on how to use models in general.

We call the talk

"Different goals, different models:

How to use models to sharpen your questions."

My part will be an intro, and I'll give an example.

You'll have access to the data I'm using.

Then Chris will continue with a more complex example,

introduce the SHAP values available in JMP 17,

and provide some conclusions.

We all know that all models are wrong,

but some are useful.

Sam Karlin said something different.

He said that the purpose of models is not to fit the data,

but to sharpen the question.

Then this guy, Pablo Picasso, he said something in...

He died in 1973, so you can imagine when this was said.

I think in the early '70 s.

"Computers are useless. They can only give you answers."

He is more in line with Karlin.

My take on this is that

this presents the key difference between a model and a computer program.

I'm looking at the model from a statistician's perspective.

Dealing with Box's famous statement.

"Yes, some are useful. Which ones?"

Please help us.

What do you mean by some are useful?

It's not very useful to say that .

Going to Karlin, "Sharpen the question."

Okay, that's a good idea.

How do we do that?

The point is that Box seems focused on the data analysis phase

in the life cycle view of statistics, which starts with problem elicitation,

moves to goal formulation, data collection, data analysis,

findings, operationalization of finding,

communicational findings, and impact assessment.

Karlin is more focused on the problem elicitation phase.

These two quotes of Box and Karlin

refer to different stages in this life cycle.

The data I'm going to use is an industrial data set,

174 observations.

We have sensor data.

We have 63 sensors.

They are labeled 1, 2, 3, 4 to 63.

We have two response variables.

These are coming from testing some systems.

The status report is fail/pass.

52.8% of the systems that were tested failed.

We have another report,

which is a more detailed report on test results.

When the systems fail,

we have some classification of the failure.

Test result is more detailed.

Status is go/no go.

The goal is to determine the system status from the sensor data

so that we can maybe avoid the costs and delays of testing,

and we can have some early predictions on the faith of the system.

One approach we can take is to use a boosted tree.

We put the status as the response, the 63 sensors X, factors.

The boosted tree is trained sequentially, one tree at a time.

T he other model we're going to use is random forest,

and that's done with independent trees.

There is a sequential aspect in boosted trees

that is different from random forests.

The setup of boosted trees involves three parameters:

the number of trees, depth of trees, and the learning rate.

This is what JMP gives as a default.

Boosted trees can be used to solve most objective functions.

We could use it for poisson regression,

which is dealing with counts that is a bit harder to achieve

with random forests.

We're going to focus on these two types of models.

When we apply the boosted tree,

and we have a validation set up

with 43 systems drawn randomly as the validation set.

A hundred and thirty-one systems is used for the training set.

We are getting a 9.3% misclassification rate.

Three failed systems.

We know that they failed because we have it in the data,

were actually classified as pass.

The 20 that passed, 19 were classified as pass.

The false predicted pass is 13 %.

We can look at the variable column contributions.

We see that Sensor 56, 18, 11, and 61 are the top four

in terms of contributing to this classification.

We see that in the training set, we had zero misclassification.

We might have some over fitting i n this BT application.

If we look at the lift curve,

40 % of the systems, we can get over two lift

which is the performance that this classifier gives us.

If we try the boos trap forest,

another option, again, we do the same thing.

We use the same validation set.

The defaults of JMP are giving you some parameters

for the number of trees

and the number of features to be selected at each mode.

This is how the random forest works.

You should be aware that this is not very good

if you have categorical variables and missing data,

which is not our case here.

Now, the misclassification rate is 6.9, lower than before.

On the training set, we had some misclassification.

The random forest applied to the test status,

which means when we have the details on the failures is 23.4,

so bad performance.

Also, on the training set, we have 5% misclassification.

But we have now a wider range of options

and that is explaining some of the lower performance.

In the lift curve on the test results,

we actually, with quite good performance,

can pick up the top 10 % of the systems with a leverage of above 10.

So we have over a ten fold increase for 10 % of the systems

relative to the grand average.

Now this is posing a question— remember the topic of the talk—

what are we looking at?

Do we want to identify top score good systems?

The random forest would do that with the test result.

Or do we want to predict a high proportion of pass?

The bootstrap tree would offer that.

A secondary question is to look at what is affecting this classification.

We can look at the column contributions on the boosted tree.

Three of the four top variables show up also on the random forest.

If we use the status pass/fail,

or the detailed results,

there is a lot of similarity on the importance of the sensors.

This is just giving you some background.

Chris is going to follow up with an evaluation of the sensitivity

of this variable importance, the use of SHAP v alues

and more interesting stuff.

This goes back to questioning what is your goal,

and how is the model helping you figure out the goal

and maybe sharpening the question that comes from the statement of the goal.

Chris, it's all yours.

Thanks, Ron.

I'm going to pick up from where Ron left off

and seek a model that will predict whether or not a unit is good or not,

and if it isn't, what's the likely failure mode that has resulted?

This would be useful in that if a model is good at predicting good units,

we may not have to subject them to much further testing.

If the model gives a predicted failure mode,

we're able to get a head start on diagnosing and fixing the problem,

and possibly, we may be able to get some hints

on how to improve the production process in the future.

I'm going to go through the sequence

of how I approached answering this question from the data.

I want to say at the outset that this is simply the path that I took

as I asked questions of the data and acted on various patterns that I saw.

There are literally many other ways that one could proceed with this.

There's often not really a truly correct answer,

just a criterion for whether or not the model is good enough,

and the amount that you're able to get done

in the time that you have to get a result back.

I have no doubt that there are better models out there

than what I came up with here.

Our goal is to show an actual process of tackling a prediction problem,

illustrating how one can move forward

by iterating through cycles of modeling and visualization,

followed by observing the results and using them to ask another question

until we find something of an answer.

I will be using JMP as a statistical Swiss army knife,

using many tools in JMP

and following the intuition I have about modeling data

that has built up over many years.

First, let's just take a look

at the frequencies of the various test result categories.

We see that the largest and most frequent category is Good.

We'll probably have the most luck being able to predict that category.

On the other hand, the SOS category has only two events

so it's going to be very difficult for us to be able to do much with that category.

We may have to set that one aside.

We'll see about that.

Velocity II, IMP, and Brake

are all fairly rare with five or six events each.

There may be some limitations in what we're able to do with them as well.

I say this because we have 174 observations

and we have 63 predictors.

So we have a lot of predictors for a very small number of observations,

which is actually even smaller when you consider the frequencies

of some of the categories that we're trying to predict.

We're going to have to work iteratively by doing visualizations in modeling,

recognizing patterns, asking questions,

and then acting on those with another modeling step iteratively

in order to find a model that's going to do a good job

of predicting these response categories.

I have the data sorted by test results,

so that the good results are at the beginning,

followed by each of the different failure modes d ata a fter that.

I went ahead and colored each of the rows by test results so that we can see

which observation belongs to a particular response category.

So then I went into the model- driven multivariate control chart

and I brought in all of the sensors as process variables.

Since I had the good test results at the beginning,

I labeled those as historical observations.

This gives us a T² chart.

It's chosen 13 principal components as its basis.

What we see here

is that the chart is dominated by these two points right here

and all of the other points are very small in value

relative to those two.

Those two points happen to be the SOS points.

They are very serious outliers in the sensor readings.

Since we also only have two observations of those,

I'm going to go ahead and take those out of the data set

and say, well, SOS is obviously so bad that the sensors

should be just flying off the charts.

If we encounter it, we're just going to go ahead

and try to concern ourselves with the other values

that don't have this off- the- charts behavior.

Switching to a log scale, we see that the good test results

are fairly well -behaved.

Then there's definite signals

in the data for the different failure modes.

Now we can drill down a little bit deeper,

taking a look at the contribution plots for the historical data,

the good test result data, and the failure modes

to see if any patterns emerge in the sensors that we can act upon.

I'm going to remove those two SOS observations

and select the good units.

If I right-click in the plot,

I can bring up a contribution plot

for the good units, and then I can go over to the units

where there was a failure, and I can do the same thing,

and we'll be able to compare the contribution plots side by side.

So what we see here are the contribution plots

for the pass units and the fail units.

The contribution plots

are the amount that each column is contributing to the T ²

for a particular row.

Each of the bars there correspond to an individual sensor for that row.

Contribution plots are colored green when that column is within three sigma,

using an individuals and moving range chart,

and it's red if it's out of control.

Here we see most of the sensors are in control for the good units,

and most of the sensors are out of control for the failed units.

What I was hoping for here

would have been if there was only a subset of the columns

or sensors that were out of control over on the failed units.

Or if I was able to see patterns

that changed across the different failure modes,

which would help me isolate what variables are important

for predicting the test result outcome.

Unfortunately, pretty much all of the sensors

are in control when things are good,

and most of them are out of control when things are bad.

So we're going to have to use some more sophisticated modeling

to be able to tackle this prediction problem.

Having not found anything in the column contributions plots,

I'm going to back up and return to the two models that Ron found.

Here are the column contributions for those two models,

and we see that there's some agreement in terms of

what are the most important sensors.

But boosted tree found a somewhat larger set of sensors as being important

over the bootstrap forest.

Which of these two models should we trust more?

If we look at the overall model fit report,

we see that the boosted tree model has a very high training RS quare of 0.998

and a somewhat smaller v alidation RS quare of 0.58.

This looks like an overfit situation.

When we look at the random forest, it has a somewhat smaller training RS quare,

perhaps a more realistic one, than the bootstrap forest,

and it has a somewhat larger validation RS quare.

The generalization performance of the random forest

is hopefully a little bit better.

I'm inclined to trust the random forest model a little bit more.

Part of this is going to be based upon just the folklore of these two models.

Boosted trees are renowned for being fast, highly accurate models

that work well on very large datasets.

Whereas the hearsay is that random forests are more accurate on smaller datasets.

They are fairly robust, messy, and noisy data.

There's a long history of using these kinds of models

for variable selection that goes back to a paper in 2010

that has been cited almost 1200 times.

So this is a popular approach for variable selection.

I did a similar search for boosting,

and I didn't quite see as much history around variable selection

for boosted trees as I did for random forests.

For this given data set r ight here,

we can do a sensitivity analysis to see how reliable

the column contributions are for these two different approaches,

using the simulation capabilities in JMP Pro.

What we can do is create a random validation column

that is a formula column

that you can reinitialize and will partition the data

into random training and holdout sets of the same portions

as the original validation column.

We can have it rerun these two analyses

and keep track of the column contribution portions

for each of these repartitionings.

We can see how consistent the story is

between the boosted tree models and the random forests.

This is pretty easy to do.

We just go to the Make Validation Column utility

and when we make a new column, we ask it to make a formula column

so that it could be reinitialized.

Then we can return to the bootstrap forest platform,

right- click on the column contribution portion,

select Simulate.

It'll bring up a dialog

asking us which of the input columns we want to switch out.

I'm going to choose the validation column,

and I'm going to switch in in replacement of it,

this random validation formula column.

We're going to do this a hundred times.

Bootstrap forest is going to be rerun

using new random partitions of the data into training and validation.

We're going to look at the distribution of the portions

across all the simulation runs.

This will generate a dataset

of column contribution portions for each sensor.

We can take this and go into the graph builder

and take a look and see how consistent those column contributions are

across all these random partitions of the data.

Here is a plot of the column contribution portions

from each of the 100 random reshufflings of the validation column.

Those points we see in gray here,

Sensor 18 seems to be consistently a big contributor, as does Sensor 61.

We also see with these red crosses,

those are the column contributions from the original analysis that Ron did.

The overall story that this tells is that the tendency

i s that whenever the original column contribution was small,

those re simulated column contributions also tended to be small.

When the column contributions were large in the original analysis,

they tended to stay large.

We're getting a relatively consistent story from the bootstrap forest

in terms of what columns are important.

Now we can do the same thing with the boosted tree,

and the results aren't quite as consistent as they were with the bootstrap forest.

So here is a bunch of columns

where the initial column contributions came out very small

but they had a more substantial contribution

in some of the random reshuffles of the validation column.

That also happened quite a bit over with these Columns 52 through 55 over here.

Then there were also some situations

where the original column contribution was quite large,

and most, if not all,

of the other column contributions found in the simulations were smaller.

That happens here with Column 48,

and to some extent also with Column 11 over here.

The overall conclusion being that I think this validation shuffling

is indicating that we can trust the column contributions

from the bootstrap forest to be more stable than those of the boosted tree.

Based on this comparison, I think I trust the column contributions

from the bootstrap forest more,

and I'm going to use the columns that it recommended

as the basis for some other models.

What I'd like to do

is find a model that is both simpler than the bootstrap forest model

and performs better in terms of validation set performance

for predicting pass or fail.

Before proceeding with the next modeling stuff,

I'm going to do something that I should have probably done at the very beginning,

which is to take a look at the sensors in a scatterplot matrix

to see how correlated the sensor readings are,

and also look at histograms of them as well to see if they're outlier- prone

or heavily skewed or otherwise highly non- gaussian.

What we see here is there is pretty strong multicollinearity

amongst the input variables generally.

We're only looking at a subset of them here,

but this high multicollinearity persists across all of the sensor readings.

This suggests that for our model,

we should try things like the logistic lasso,

the logistic elastic net, or a logistic ridge regression

as candidates for our model to predict pass or fail.

Before we do that, we should go ahead

and try to transform our sensor readings here

so that they're a little bit better- behaved and more gaussian- looking.

This is actually really easy in JMP

if you have all of the columns up in the distribution platform,

because all you have to do is hold down Alt , choose Fit Johnson,

and this will fit Johnson distributions to all the input variables.

This is a family of distributions

that is based on a four parameter transformation to normality.

As a result, we have a nice feature in there

that we can also broadcast using Alt Click,

where we can save a transformation from the original scale

to a scale that makes the columns more normally distributed.

If we go back to the data table,

we'll see that for each sensor column, a transform column has been added.

If we bring these transformed columns up with a scatterplot matrix

and some histograms,

we clearly see that the data are less skewed

and more normally distributed than the original sensor columns were.

Now the bootstrap forest model that Ron found

only really recommended a small number of columns

for use in the model.

Because of the high collinearity amongst the columns,

the subset that we got could easily be part

of a larger group of columns that are correlated with one another.

It could be beneficial to find that larger group of columns

and work with that at the next modeling stage.

An exploratory way to do this

is to go through the cluster variables platform in JMP .

We're going to work with the normalized version of the sensors

because this platform is PCA and factor analysis based,

and will provide more reliable results if we're working with data

that are approximately normally distributed.

Once we're in the variable clustering platform,

we see that there is very clear,

strong associations amongst the input columns.

It has identified that there are seven clusters,

and the largest cluster, the one that explains the most variation,

has 25 members.

The set of cluster members is listed here on the right.

Let's compare this with the bootstrap forest.

Here on the left, we have the column contributions

from the bootstrap forest model that you should be familiar with by now.

On the right, we have the list of members

of that largest cluster of variables.

If we look closely, we'll see that the top seven contributing terms

all happen to belong to this cluster.

I'm going to hypothesize that this set of 25 columns

are all related to some underlying mechanism

that causes the units to pass or fail.

What I want to do next is I want to fit models

using the generalized regression platform with the variables in Cluster 1 here.

It would be tedious if I had to go through

and individually pick these columns out and put them into the launch dialog.

Fortunately, there's a much easier way

where you can just select the rows in that table

and the columns will be selected in the original data table

so that when we go into the fit model launch dialog,

we can just click Add

and those columns will be automatically added for us as model effects.

Once I got into the Generalized Regression platform,

I went ahead and fit a lasso model and elastic net model

and a ridge model to have them compared here to each other,

and also to the logistic regression model that came up by default.

We're seeing that the lasso model is doing a little bit better than the rest

in terms of its validation generalized RS quare.

The difference between these methods

is that there's different amounts of variable selection

and multicollinearity handling in each of them.

Logistic regression has no multicollinearity handling

and no variable selection.

The lasso is more of a variable selection algorithm,

although it has a little bit of multicollinearity handling in it

because it's a penalized method.

Ridge regression has no variable selection

and is heavily oriented around multicollinearity handling.

The elastic net is a hybrid between the lasso and ridge regression.

In this case, what we really care about

is just the model that's going to perform the best.

We allow the validation to guide us.

We're going to be working with the lasso model from here on.

Here's the prediction profiler for the lasso model that was selected.

We see that the lasso algorithm has selected eight sensors

as being predictive of pass or fail.

It has some great built-in tools

for understanding what the important variables are,

both in the model overall and, new to JMP Pro 17,

we have the ability to understand

what variables are most important for an individual prediction.

We can use the variable importance tools to answer the question,

"What are the important variables in the model?"

We have a variety of different options for this.

We have a variety of different options for how this could be done.

But because of the multicollinearity and because this is not a very large model,

I'm going to go ahead

and use the dependent resampled inputs technique,

since we have multicollinearity in the data,

and this has given us a ranking of the most important terms.

We see that Column 18 is the most important,

followed by Column 27 and then 52, all the way down.

We can compare this to the bootstrap forest model,

and we see that there's agreement that Variable 18 is important,

along with 52, 61, and 53.

But one of the terms that we have pulled in

because of the variable clustering step that we had done,

Sensor 27 turns out to be the second most important predictive

in this lasso model.

We've hopefully gained something by casting a wider net through that step.

We've found a term that didn't turn up

in either of the bootstrap forest or the boosted tree methods.

We also see that the lasso model has an RS quare of 0.9,

whereas the bootstrap forest model had an RS quare of 0.8.

We have a simpler model that has an easier form to understand

and is easier to work with,

and also has a higher predictive capacity than the bootstrap forest model.

Now, the variable importance metrics in the profiler

have been there for quite some time.

The question that they answer is, "Which predictors have the biggest impact

on the shape of the response surface over the data or over a region?"

In JMP 17 Pro, we have a new technique called SHAP Values

that is an additive decomposition of an individual prediction.

It tells you by how much each individual variable contributes

to a single prediction,

rather than talking about variability explained over the whole space.

The resolution of the question that's answered by Shapley values

is far more local than either the variable importance tools

or the column contributions i n the bootstrap forest.

We can obtain the Shapley Values by going to the red triangle menu for the profiler,

and we'll find the option for them over here, fourth from the bottom.

When we choose the option, the profiler saves back SHAP columns

for all of the input variables to the model.

This is, of course, happening for every row in the table.

What you can see is that the SHAP V alues are giving you the effect

of each of the columns on the predictive model.

This is useful in a whole lot of different ways,

and for that reason, it's gotten a lot of attention in intelligible AI,

because it allows us to see

what the contributions are of each column to a black box model.

Here, I've plotted the SHAP V alues for the columns that are predictive

in the last fit model that we just built.

If we toggle back and forth between the good units and the units that failed,

we see the same story that we've been seeing

with the variable importance metrics for this,

that Column 18 and Column 27 are important in predicting pass or fail.

We're seeing this at a higher level of resolution

than we do with the variable importance metrics,

because each of these points corresponds to an individual row

in the original dataset.

But in this case, I don't see the SHAP Values

really giving us any new information.

I had hoped that by toggling through

the other failure modes, maybe I could find a pattern

to help tease out different sensors

that are more important for particular failure modes.

But the only thing I was able to find was that Column 18

had a somewhat stronger impact on the Velocity Type 1 failure mode

than the other failure modes.

At this point, we've had some success

using those Cluster 1 columns in a binary pass/ fail model.

But when I broke out the SHAP Values

for that model, by the different failure modes

I wasn't able to discern a pattern or much of a pattern.

What I did next was I went ahead

and fit the failure mode response column test results

using the Cluster 1 columns,

but I went ahead and excluded all the pass rows

so that the modeling procedure would focus exclusively

on discerning which failure mode it is given that we have a failure.

I tried the multinomial lasso, elastic net, and ridge,

and I was particularly happy with the lasso model

because it gave me a validation RS quare of about 0.94.

Having been pretty happy with that,

I went ahead and saved the probability formulas

for each of the failure modes.

Now the task is to come up with a simple rule

that post processes that prediction formula

to make a decision about which failure mode.

I call this the partition trick.

The partition trick is where I put in the probability formulas

for a categorical response, or even a multinomial response.

I put those probability formulas in as Xs.

I use my categorical response as my Y.

This is the same response that was used

for all of these except for pass, actually.

I retain the same validation column that I've been working with the whole time.

Now that I'm in the partition platform,

I'm going to hit Split a couple of times, and I'm going to hope

that I end up with an easily understandable decision rule

that's easy to communicate.

That may or may not happen.

Sometimes it works, sometimes it doesn't.

So I split once, and we end up seeing that

whenever the probability of pass is higher than 0.935,

we almost certainly have a pass.

Not many passes are left over on the other side.

I take another split.

We find a decision rule on ITM

that is highly predictive of ITM as a failure mode.

Split again.

We find that whenever Motor is less than 0.945,

we're either predicting Motor or Brake.

We take another split.

We find that whenever Velocity Type 1, its probability is bigger than 0.08

or likely in a Velocity Type 1 situation or in a Velocity T ype 2 situation.

Whenever Velocity Type 1 is less than 0.79,

we're likely in a gripper failure mode or an IMP failure mode.

What do we have here? We have a simple decision rule.

We're going to not be able to break these failure modes down much further

because of the very small number of actual events that we have.

But we can turn this into a simple rule

for identifying units that are probably good,

and if they're not, we have an idea of where to look to fix the problem.

We can save this decision rule out as a leaf label formula.

We see that on the validation set,

when we predict it's good, it's good most of the time.

We did have one misclassification of a Velocity Type 2 failure

that was actually predicted to be good.

Predict grippers or IMP, it's all over the place.

That leaf was not super useful.

Predicting ITM is 100 %.

Whenever we predict a motor or brake,

on the validation set, we have a motor or a brake failure.

When we predict a Velocity Type 1 or 2,

it did a pretty good job of picking that up

with that one exception of the single Velocity Type 2 unit

that was in the validation set,

and that one happened to have been misclassified.

We have an easily operational rule here that could be used to sort products

and give us a head start on where we need to look to fix things.

I think this was a pretty challenging problem,

because we didn't have a whole lot of data.

But we didn't have a lot of rows,

but we had a lot of different categories to predict

and a whole lot of possible predictors to use.

We've gotten there by taking a series of steps,

asking questions,

sometimes taking a step back and asking a bigger question.

Other times, narrowing in on particular sub- issues.

Sometimes our excursions were fruitful, and sometimes they weren't.

Our purpose here is to illustrate

how you can step through a modeling process,

through this sequence of asking questions

using modeling and visualization tools to guide your next step,

and moving on until you're able to find

a useful, actionable, predictive model.

Thank you very much for your attention.

We look forward to talking to you in our Q&A session coming up next.

Different goals, different models: How to use models to sharpen up your questions (2022-US-45MP-1159)

Presenter

Files