cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
Choose Language Hide Translation Bar
Different goals, different models: How to use models to sharpen up your questions (2022-US-45MP-1159)

Ron Kenett, Professor, KPA, Samuel Neaman Institute, Technion, Israel and University of Turin, Italy
Christopher Gotwalt, Chief Data Scientists, JMP

 

The famous Stanford mathematician, Sam Karlin, is quoted as stating that “The purpose of models is not to fit the data but to sharpen the question” (Kenett and Feldman, 2022). A related manifesto on the general role of models was published in Saltelli et al (2020). In this talk, we explore how different models are used to meet different goals. We consider several options available on the JMP Generalized Regression platform including ridge regression, lasso, and elastic nets. To make our point, we use two examples. A first example consists of data from 63 sensors collected in the testing of an industrial system (From Chapter 8 in Kenett and Zacks, 2021). The second example is from Amazon reviews of crocs sandals where text analytics is used to model review ratings (Amazon, 2022).

 

References

 

Amazon, 2022, https://www.amazon.com/s?k=Crocs&crid=2YYP09W4Z3EQ3&sprefix=crocs%2Caps%2C247&ref=nb_sb_noss_1

 

Feldman, M. and Kenett, R.S. (2022), Samuel Karlin, Wiley StatsRef, https://onlinelibrary.wiley.com/doi/10.1002/9781118445112.stat08377

 

Kenett, R.S. and Zacks, S. (2021) Modern Industrial Statistics: With Applications in R, MINITAB, and JMP, 3rd Edition, Wiley, https://www.wiley.com/en-gb/Modern+Industrial+Statistics%3A+With+Applications+in+R%2C+MINITAB%2C+and...

 

Saltelli, A. et al (2020) Five ways to ensure that models serve society: a manifesto, Nature, 582, 482-484, https://www.nature.com/articles/d41586-020-01812-9

 

 

Hello,  I'm  Ron  Kennett.

This  is  a  joint  talk  with  Chris  Gotwalt.

We're  going  to  talk   to  you  about  models.

Models are  used  extensively.

We  hope  to  bring   some  additional  perspective

on  how  to  use  models  in  general.

We  call  the  talk

"Different  goals,  different  models:

How  to  use  models  to  sharpen  your  questions."

My  part  will  be  an  intro, and  I'll  give  an  example.

You'll  have  access  to  the  data  I'm  using.

Then  Chris  will  continue with a  more  complex  example,

introduce  the  SHAP  values available  in  JMP  17,

and  provide  some  conclusions.

We  all  know  that  all  models  are  wrong,

but  some  are  useful.

Sam  Karlin  said  something  different.

He  said  that  the  purpose  of  models is  not  to  fit  the  data,

but  to  sharpen  the  question.

Then  this  guy,  Pablo  Picasso, he  said  something in...

He  died  in  1973, so  you  can  imagine  when  this  was  said.

I  think  in  the  early  '70 s.

"Computers  are  useless. They  can  only  give  you  answers."

He  is  more  in  line  with  Karlin.

My  take  on  this is that

this presents  the key difference between  a  model  and  a  computer  program.

I'm  looking  at  the  model from  a  statistician's  perspective.

Dealing  with  Box's  famous  statement.

"Yes,  some  are  useful.  Which  ones?"

Please  help  us.

What do  you  mean  by  some  are  useful?

It's  not  very  useful  to  say  that .

Going to Karlin, "Sharpen  the  question."

Okay,  that's  a  good  idea.

How  do  we  do  that?

The  point  is  that  Box  seems  focused on the data analysis phase

in the life cycle view of statistics, which starts with problem elicitation,

moves  to  goal  formulation, data  collection,  data  analysis,

findings,  operationalization  of  finding,

communicational  findings, and  impact  assessment.

Karlin is  more  focused on  the  problem  elicitation  phase.

These  two  quotes  of  Box  and  Karlin

refer  to  different  stages in  this  life  cycle.

The  data  I'm  going  to  use  is  an  industrial  data set,

174 observations.

We  have  sensor  data.

We  have  63  sensors.

They are  labeled  1, 2, 3, 4  to  63.

We  have  two  response  variables.

These  are  coming   from  testing  some  systems.

The  status  report  is  fail/pass.

52.8%  of  the  systems that  were  tested  failed.

We  have  another  report,

which  is  a  more  detailed  report on  test  results.

When  the systems  fail,

we  have  some  classification of  the  failure.

Test  result  is  more  detailed.

Status  is  go/no go.

The  goal  is  to  determine  the  system  status from  the  sensor  data

so  that  we  can  maybe  avoid  the  costs  and  delays  of  testing,

and  we  can  have  some  early  predictions on  the  faith  of  the  system.

One  approach  we  can  take is  to  use  a  boosted  tree.

We  put  the  status  as the response, the 63 sensors X, factors.

The  boosted  tree  is  trained  sequentially, one  tree  at  a  time.

T he  other  model  we're  going  to  use is  random  forest,

and  that's  done  with  independent  trees.

There  is  a  sequential  aspect   in  boosted  trees

that  is  different  from  random  forests.

The  setup  of  boosted  trees   involves  three  parameters:

the  number  of  trees, depth  of  trees,  and  the  learning  rate.

This  is  what  JMP  gives   as  a default.

Boosted  trees  can  be  used to  solve  most  objective  functions.

We  could  use  it  for  poisson  regression,

which  is  dealing  with  counts that  is  a  bit  harder  to  achieve

with  random  forests.

We're  going  to  focus on  these  two  types  of  models.

When  we  apply  the  boosted  tree,

and  we  have  a  validation  set  up

with  43  systems  drawn  randomly as  the  validation  set.

A hundred and thirty-one systems is used  for  the  training  set.

We  are  getting   a  9.3%  misclassification  rate.

Three  failed  systems.

We  know  that they  failed   because  we  have  it  in  the  data,

were  actually  classified  as  pass.

The  20  that  passed,   19  were  classified  as  pass.

The  false  predicted  pass  is  13 %.

We  can  look  at  the  variable column  contributions.

We  see  that  Sensor   56,  18,  11,  and  61 are  the  top  four

in  terms  of  contributing to  this  classification.

We  see  that  in  the  training  set, we  had  zero  misclassification.

We  might  have  some  over fitting  i n  this  BT  application.

If  we  look  at  the  lift  curve,

40 %  of  the  systems, we  can  get  over  two  lift

which  is  the performance  that this classifier gives us.

If  we  try  the  boos trap  forest,

another  option, again,  we  do  the  same  thing.

We  use  the  same  validation  set.

The  defaults  of  JMP  are  giving  you some  parameters

for  the  number  of  trees

and  the  number  of  features to  be  selected  at  each  mode.

This  is  how  the  random  forest  works.

You  should  be  aware that  this  is  not  very  good

if  you  have  categorical  variables and  missing  data,

which  is  not  our  case  here.

Now,  the  misclassification  rate  is  6.9, lower  than  before.

On  the  training  set, we  had  some  misclassification.

The  random  forest   applied  to  the  test  status,

which  means  when  we  have  the  details on  the  failures  is  23.4,

so  bad  performance.

Also,  on  the  training  set, we  have  5%  misclassification.

But  we  have  now  a  wider  range  of  options

and that is explaining   some  of  the  lower  performance.

In  the  lift  curve  on  the  test  results,

we  actually,  with  quite  good  performance,

can  pick  up  the  top  10 %  of  the  systems with  a  leverage  of  above  10.

So  we  have  over a ten fold  increase for  10 %  of  the  systems

relative  to  the  grand  average.

Now  this  is  posing  a  question— remember  the  topic  of  the  talk—

what  are  we  looking  at?

Do  we  want  to  identify top  score  good  systems?

The  random  forest   would  do  that  with  the  test  result.

Or  do  we  want  to  predict a  high  proportion  of  pass?

The  bootstrap tree  would  offer  that.

A  secondary  question  is  to  look  at what  is  affecting  this  classification.

We  can  look  at  the  column  contributions on  the  boosted  tree.

Three  of  the  four top  variables  show  up  also  on  the  random  forest.

If  we  use  the  status  pass/fail,

or  the  detailed  results,

there  is  a  lot  of  similarity on  the  importance  of  the  sensors.

This  is  just  giving  you  some  background.

Chris  is  going  to  follow  up  with  an  evaluation  of  the  sensitivity

of  this  variable  importance, the  use  of  SHAP v alues

and more interesting stuff.

This  goes  back  to  questioning what  is  your  goal,

and  how  is  the  model  helping  you  figure  out  the  goal

and  maybe  sharpening  the  question that  comes  from  the  statement  of  the  goal.

Chris,  it's  all yours.

Thanks,  Ron.

I'm  going  to  pick  up  from  where  Ron  left off

and  seek  a  model  that  will  predict whether  or  not  a  unit  is  good  or  not,

and  if  it  isn't,  what's  the  likely failure  mode  that  has  resulted?

This  would  be  useful  in that  if  a  model is  good  at  predicting  good  units,

we  may  not  have  to  subject  them to  much  further  testing.

If  the  model  gives   a  predicted  failure  mode,

we're  able  to  get  a  head  start  on  diagnosing  and  fixing  the  problem,

and  possibly,  we  may  be  able  to  get  some  hints

on  how  to  improve  the  production  process  in  the  future.

I'm  going  to  go  through  the  sequence

of  how  I  approached  answering  this  question  from  the  data.

I  want  to  say  at  the  outset that  this  is  simply  the  path  that  I  took

as  I  asked  questions  of  the  data and  acted  on  various  patterns  that  I  saw.

There  are  literally  many  other  ways that  one  could  proceed  with  this.

There's  often  not  really   a  truly  correct  answer,

just  a  criterion  for  whether  or  not  the  model  is  good  enough,

and  the  amount  that  you're  able  to  get  done

in  the  time  that  you have  to  get  a  result  back.

I  have  no  doubt   that  there  are  better  models  out  there

than  what  I  came  up  with  here.

Our  goal  is  to  show  an  actual  process of  tackling  a  prediction  problem,

illustrating  how  one  can  move  forward

by  iterating  through  cycles of  modeling  and  visualization,

followed  by  observing  the  results   and  using  them  to  ask  another  question

until  we  find  something  of  an  answer.

I  will  be  using  JMP   as  a  statistical  Swiss  army  knife,

using  many  tools  in  JMP

and  following  the  intuition   I  have  about  modeling  data

that  has  built  up  over  many  years.

First,  let's  just  take  a  look

at  the  frequencies   of  the  various  test  result  categories.

We  see  that  the  largest and  most  frequent  category  is  Good.

We'll  probably  have  the  most  luck being  able  to  predict  that  category.

On  the  other  hand,   the  SOS  category  has  only  two  events

so  it's  going  to  be  very  difficult  for  us to  be  able  to  do  much  with  that  category.

We  may  have  to  set  that  one  aside.

We'll  see  about  that.

Velocity II,  IMP,  and  Brake

are  all  fairly  rare  with  five  or  six  events  each.

There  may  be  some  limitations  in  what we're  able  to  do  with  them  as  well.

I  say  this  because   we  have  174  observations

and  we  have  63  predictors.

So  we  have  a  lot  of  predictors  for  a  very  small  number  of  observations,

which  is  actually  even  smaller   when  you  consider  the  frequencies

of  some  of  the  categories that  we're  trying  to  predict.

We're  going  to  have  to  work  iteratively by  doing  visualizations  in  modeling,

recognizing  patterns,  asking  questions,

and  then  acting  on  those with  another  modeling  step  iteratively

in  order  to  find  a  model that's  going  to  do  a  good  job

of  predicting  these  response  categories.

I  have  the  data  sorted by  test  results,

so  that  the  good  results are  at  the  beginning,

followed  by  each  of  the  different  failure  modes d ata a fter  that.

I  went  ahead  and  colored  each  of  the  rows by  test  results  so  that  we  can  see

which  observation  belongs to  a  particular  response  category.

So   then  I  went  into  the  model- driven multivariate  control  chart

and  I  brought  in  all  of  the  sensors as  process  variables.

Since  I  had  the  good  test  results at  the  beginning,

I  labeled  those as  historical  observations.

This   gives us  a  T²  chart.

It's  chosen  13  principal  components as  its  basis.

What  we  see  here

is  that  the  chart  is  dominated  by  these  two  points  right  here

and  all  of  the  other  points are  very  small  in  value

relative  to  those  two.

Those two points happen  to  be  the SOS  points.

They  are  very  serious  outliers in  the  sensor  readings.

Since  we  also  only  have   two  observations  of  those,

I'm  going  to  go  ahead and  take  those  out  of  the  data set

and  say,  well, SOS is  obviously  so  bad  that  the  sensors

should  be  just  flying  off  the  charts.

If  we  encounter  it,  we're  just  going  to  go  ahead

and  try  to  concern  ourselves with  the  other  values

that  don't  have  this  off- the- charts  behavior.

Switching  to  a  log  scale, we  see  that  the  good  test  results

are  fairly  well -behaved.

Then  there's  definite  signals

in  the  data   for  the  different  failure  modes.

Now  we  can  drill  down  a  little  bit  deeper,

taking  a  look  at  the  contribution  plots for  the  historical  data,

the  good  test  result  data, and  the  failure  modes

to  see  if  any  patterns  emerge  in  the  sensors  that  we  can  act  upon.

I'm  going  to  remove   those  two SOS  observations

and  select  the  good  units.

If  I   right-click  in  the  plot,

I  can  bring  up  a  contribution  plot

for  the  good  units, and  then  I  can  go  over  to  the  units

where  there  was  a  failure, and  I  can  do  the  same  thing,

and  we'll  be  able  to  compare the  contribution  plots  side  by  side.

So  what  we  see  here   are  the  contribution  plots

for  the  pass  units  and  the  fail  units.

The  contribution  plots

are  the  amount  that  each  column is  contributing  to  the  T ²

for  a  particular  row.

Each of  the  bars  there  correspond to  an  individual  sensor  for  that  row.

Contribution  plots  are  colored  green when  that  column  is  within  three  sigma,

using  an  individuals and  moving  range  chart,

and  it's  red  if  it's  out  of  control.

Here  we  see  most  of  the  sensors  are  in  control  for  the  good  units,

and  most  of  the  sensors  are  out  of  control  for  the  failed  units.

What  I  was  hoping  for  here

would  have  been if  there  was  only  a  subset  of  the  columns

or  sensors  that  were  out  of  control over on  the  failed  units.

Or  if  I  was  able  to  see  patterns

that  changed  across  the  different  failure  modes,

which  would  help  me  isolate what  variables  are  important

for  predicting  the  test  result  outcome.

Unfortunately,  pretty  much all  of  the  sensors

are  in  control  when  things  are  good,

and  most  of  them  are  out  of  control when  things  are  bad.

So we're going to  have  to  use   some  more  sophisticated  modeling

to  be  able  to  tackle   this  prediction  problem.

Having  not  found  anything   in  the  column  contributions  plots,

I'm  going  to  back  up  and  return to  the  two  models  that  Ron  found.

Here  are  the  column  contributions for  those  two  models,

and  we  see  that there's  some  agreement  in  terms  of

what  are  the  most  important  sensors.

But  boosted  tree  found  a  somewhat  larger set  of  sensors  as  being  important

over  the  bootstrap  forest.

Which  of  these  two  models should  we  trust  more?

If  we  look   at  the  overall  model  fit  report,

we  see  that  the  boosted  tree  model has  a  very  high  training  RS quare  of  0.998

and  a  somewhat   smaller v alidation  RS quare  of  0.58.

This  looks  like  an  overfit  situation.

When  we  look  at  the  random  forest,  it  has a  somewhat  smaller  training  RS quare,

perhaps  a  more  realistic  one, than  the  bootstrap  forest,

and  it has  a  somewhat  larger  validation  RS quare.

The  generalization  performance of  the  random  forest

is  hopefully  a  little  bit  better.

I'm  inclined  to  trust   the  random  forest  model  a  little  bit  more.

Part  of  this  is  going  to  be  based  upon just  the  folklore  of  these  two  models.

Boosted  trees  are  renowned for  being  fast,  highly  accurate  models

that  work  well  on  very  large  datasets.

Whereas  the  hearsay  is  that  random  forests are  more  accurate  on  smaller  datasets.

They  are  fairly  robust, messy,  and  noisy  data.

There's  a  long  history   of  using  these  kinds  of  models

for  variable  selection that  goes  back  to  a  paper  in  2010

that  has  been  cited  almost  1200  times.

So  this  is  a  popular  approach for  variable  selection.

I  did  a  similar  search  for  boosting,

and  I  didn't  quite  see  as  much  history  around  variable  selection

for  boosted  trees   as  I  did  for  random  forests.

For  this  given  data  set r ight  here,

we  can  do  a  sensitivity  analysis to  see  how  reliable

the  column  contributions   are  for  these  two  different  approaches,

using  the  simulation  capabilities in  JMP  Pro.

What  we  can  do   is  create  a  random  validation  column

that  is  a  formula  column

that  you  can  reinitialize and  will  partition  the  data

into  random  training  and  holdout  sets  of  the  same  portions

as  the  original  validation  column.

We  can  have  it  rerun  these  two  analyses

and  keep  track   of  the  column  contribution  portions

for  each  of  these  repartitionings.

We  can  see  how  consistent  the  story is

between  the  boosted  tree  models and  the  random  forests.

This  is  pretty  easy  to  do.

We just go  to the  Make  Validation  Column  utility

and  when  we  make  a  new  column,  we  ask  it  to  make  a  formula  column

so  that it  could  be  reinitialized.

Then  we  can  return to  the  bootstrap  forest  platform,

right- click   on  the  column  contribution  portion,

select  Simulate.

It'll  bring  up  a  dialog

asking  us  which  of  the  input  columns we  want  to  switch  out.

I'm  going  to  choose   the  validation  column,

and  I'm  going  to  switch  in in replacement  of  it,

this  random  validation  formula  column.

We're  going  to  do  this a hundred  times.

Bootstrap  forest  is  going  to  be  rerun

using  new  random  partitions   of  the  data  into  training  and  validation.

We're  going  to  look  at  the  distribution  of  the  portions

across  all  the  simulation  runs.

This  will  generate  a  dataset

of  column  contribution  portions for  each  sensor.

We  can  take  this  and  go   into  the  graph  builder

and  take  a  look  and  see  how  consistent  those  column  contributions  are

across  all  these  random  partitions  of  the  data.

Here  is  a  plot   of  the  column  contribution  portions

from  each  of  the  100  random  reshufflings of  the  validation  column.

Those points  we  see  in  gray here,

Sensor  18  seems  to  be  consistently a  big  contributor,  as  does  Sensor  61.

We  also  see  with  these  red  crosses,

those  are  the  column  contributions from  the  original  analysis  that  Ron  did.

The  overall  story  that  this  tells is  that  the  tendency

i s  that  whenever  the  original column  contribution  was small,

those  re simulated  column  contributions also  tended  to  be  small.

When  the  column  contributions  were  large  in  the  original  analysis,

they  tended  to  stay  large.

We're  getting  a  relatively   consistent  story  from  the  bootstrap  forest

in  terms  of  what  columns  are  important.

Now  we  can  do  the  same  thing with  the  boosted  tree,

and  the  results  aren't  quite  as  consistent as  they  were  with  the  bootstrap  forest.

So  here  is  a  bunch  of  columns

where  the  initial  column  contributions  came  out very small

but  they  had  a  more  substantial  contribution

in  some  of  the  random  reshuffles of  the  validation  column.

That  also  happened  quite  a  bit  over with these  Columns  52  through  55  over  here.

Then  there  were  also  some  situations

where  the  original  column  contribution   was  quite  large,

and  most,  if  not  all,

of  the  other  column  contributions found  in  the  simulations  were  smaller.

That  happens  here  with  Column  48,

and  to  some  extent  also  with  Column  11  over  here.

The  overall  conclusion  being   that  I  think  this  validation  shuffling

is  indicating  that  we  can  trust   the  column  contributions

from  the  bootstrap  forest  to  be  more stable  than  those  of  the  boosted  tree.

Based on  this  comparison, I  think  I  trust  the  column  contributions

from  the  bootstrap  forest  more,

and  I'm  going  to  use  the  columns  that  it  recommended

as  the  basis  for  some  other  models.

What  I'd  like  to  do

is  find  a  model  that  is  both  simpler than  the  bootstrap  forest  model

and  performs  better  in  terms  of  validation  set  performance

for  predicting  pass  or  fail.

Before  proceeding   with  the  next  modeling  stuff,

I'm  going  to  do  something  that  I  should have  probably  done  at  the  very  beginning,

which  is  to  take  a  look  at  the  sensors in  a  scatterplot  matrix

to  see  how  correlated  the  sensor  readings  are,

and  also  look  at  histograms  of  them as  well  to  see  if  they're  outlier- prone

or heavily  skewed or  otherwise  highly  non- gaussian.

What  we  see  here  is   there  is  pretty  strong  multicollinearity

amongst  the  input  variables  generally.

We're  only  looking   at  a  subset  of  them  here,

but  this  high  multicollinearity  persists across  all  of  the  sensor  readings.

This  suggests  that  for  our  model,

we  should  try  things  like  the  logistic  lasso,

the  logistic  elastic  net, or  a logistic  ridge  regression

as  candidates  for  our  model to  predict  pass  or  fail.

Before  we  do  that, we  should  go  ahead

and  try  to  transform our  sensor  readings  here

so  that  they're  a  little  bit better- behaved  and  more  gaussian- looking.

This  is  actually  really  easy  in  JMP

if  you  have  all  of  the  columns  up in  the  distribution  platform,

because  all  you  have  to  do  is  hold  down Alt ,  choose  Fit  Johnson,

and  this  will  fit  Johnson  distributions to  all  the  input  variables.

This  is  a  family  of  distributions

that  is  based  on  a  four  parameter transformation  to  normality.

As  a  result,   we  have  a  nice  feature  in  there

that  we  can  also  broadcast using Alt  Click,

where  we  can  save a  transformation  from  the  original  scale

to  a  scale  that  makes  the  columns more  normally  distributed.

If  we  go  back  to  the  data  table,

we'll  see  that  for  each  sensor  column, a  transform  column  has  been  added.

If  we  bring  these  transformed  columns  up with  a  scatterplot  matrix

and  some histograms,

we  clearly  see  that  the  data  are  less  skewed

and  more  normally  distributed  than  the  original  sensor  columns  were.

Now  the  bootstrap  forest  model that  Ron  found

only  really  recommended   a  small  number  of  columns

for  use  in  the  model.

Because  of  the  high  collinearity   amongst  the  columns,

the  subset  that  we  got  could  easily  be  part

of  a  larger  group  of  columns that  are  correlated  with  one  another.

It  could  be  beneficial   to  find  that  larger  group  of  columns

and  work  with  that   at  the  next  modeling  stage.

An  exploratory  way  to  do  this

is  to  go  through  the  cluster variables  platform  in JMP .

We're  going  to  work with  the  normalized  version  of  the  sensors

because  this  platform   is  PCA  and  factor  analysis  based,

and  will  provide  more  reliable  results  if  we're  working  with  data

that  are  approximately   normally  distributed.

Once  we're  in  the  variable clustering  platform,

we  see  that  there  is  very  clear,

strong  associations amongst  the  input  columns.

It  has  identified   that  there  are  seven  clusters,

and  the  largest  cluster, the  one  that  explains  the  most  variation,

has  25  members.

The  set  of  cluster  members is  listed  here  on  the  right.

Let's  compare  this  with  the  bootstrap  forest.

Here  on  the  left, we  have  the  column  contributions

from  the  bootstrap  forest  model  that  you  should  be  familiar  with  by  now.

On  the  right,  we  have  the  list  of  members

of  that  largest  cluster  of  variables.

If  we  look  closely,  we'll  see  that  the  top seven  contributing  terms

all  happen  to  belong  to  this  cluster.

I'm  going  to  hypothesize   that  this  set  of  25  columns

are  all  related  to  some   underlying  mechanism

that  causes  the  units  to  pass  or  fail.

What  I  want  to  do  next   is  I  want  to  fit  models

using  the  generalized  regression  platform with  the  variables  in  Cluster  1  here.

It  would  be  tedious   if  I  had  to  go  through

and  individually  pick  these  columns  out and  put  them  into  the  launch  dialog.

Fortunately,  there's  a  much  easier  way

where  you  can  just  select  the  rows in  that  table

and  the  columns  will  be selected  in  the  original  data  table

so  that  when  we  go   into  the  fit  model  launch  dialog,

we  can  just  click  Add

and  those  columns  will  be  automatically   added  for  us  as  model  effects.

Once  I  got  into  the  Generalized Regression  platform,

I  went  ahead  and  fit  a  lasso  model   and  elastic  net  model

and  a  ridge  model   to have  them  compared  here  to  each  other,

and  also  to  the  logistic  regression  model   that  came  up  by  default.

We're  seeing  that  the  lasso  model   is  doing  a  little  bit  better  than  the  rest

in  terms  of  its  validation generalized  RS quare.

The  difference  between  these  methods

is  that  there's  different  amounts of  variable  selection

and  multicollinearity  handling in  each  of  them.

Logistic regression   has  no  multicollinearity  handling

and  no  variable  selection.

The  lasso  is  more   of  a  variable  selection  algorithm,

although  it  has  a  little  bit of  multicollinearity  handling  in  it

because  it's  a  penalized  method.

Ridge  regression  has  no  variable  selection

and  is  heavily  oriented  around multicollinearity  handling.

The  elastic  net  is  a  hybrid  between the  lasso  and  ridge  regression.

In  this  case, what  we  really  care  about

is  just  the  model  that's going  to  perform  the  best.

We  allow  the  validation  to  guide  us.

We're  going  to  be  working with  the  lasso  model  from  here  on.

Here's  the  prediction  profiler for  the  lasso  model  that  was  selected.

We  see  that  the  lasso  algorithm has  selected  eight  sensors

as  being  predictive  of  pass  or  fail.

It  has  some  great  built-in tools

for  understanding   what  the  important  variables  are,

both  in  the  model  overall  and,  new  to  JMP  Pro  17,

we  have  the  ability  to  understand

what  variables  are  most  important for  an  individual  prediction.

We  can  use  the  variable  importance  tools to  answer  the  question,

"What  are  the  important  variables in  the  model?"

We  have  a  variety   of  different  options  for  this.

We  have  a  variety  of  different  options for  how  this  could  be  done.

But  because  of  the  multicollinearity  and because  this  is  not  a very  large  model,

I'm  going  to  go  ahead

and  use  the  dependent   resampled  inputs  technique,

since  we  have   multicollinearity  in  the  data,

and  this  has  given  us  a  ranking   of  the  most  important  terms.

We  see  that  Column  18  is  the  most  important,

followed  by  Column  27 and  then  52,  all  the  way  down.

We  can  compare  this   to  the  bootstrap  forest  model,

and  we  see  that  there's  agreement that  Variable  18  is  important,

along  with  52,  61,  and  53.

But  one  of  the  terms  that  we  have  pulled  in

because  of  the  variable  clustering  step that  we  had  done,

Sensor  27  turns  out  to  be the  second  most  important  predictive

in  this  lasso  model.

We've  hopefully  gained  something by  casting  a  wider  net  through  that  step.

We've  found  a  term  that  didn't  turn  up

in  either  of  the  bootstrap  forest or  the  boosted  tree  methods.

We  also  see  that  the  lasso  model has  an  RS quare  of  0.9,

whereas  the  bootstrap  forest  model  had  an  RS quare  of  0.8.

We have  a  simpler  model   that  has  an  easier  form  to  understand

and  is  easier  to  work  with,

and  also  has  a  higher  predictive  capacity than  the  bootstrap  forest  model.

Now,  the  variable   importance  metrics  in  the  profiler

have  been  there  for  quite  some  time.

The  question  that  they  answer  is, "Which  predictors  have  the biggest  impact

on  the  shape  of  the  response  surface over  the  data  or  over  a  region?"

In  JMP  17  Pro,   we  have  a  new  technique  called  SHAP  Values

that  is  an  additive  decomposition   of  an  individual  prediction.

It  tells  you  by  how  much   each  individual  variable  contributes

to  a  single  prediction,

rather  than  talking  about  variability explained  over  the  whole  space.

The  resolution  of  the  question that's  answered  by  Shapley  values

is  far  more  local  than  either  the  variable  importance  tools

or  the  column  contributions i n  the  bootstrap  forest.

We  can  obtain  the  Shapley  Values  by  going to  the  red  triangle  menu  for  the  profiler,

and  we'll  find  the  option  for  them over  here,  fourth from  the  bottom.

When  we  choose  the  option, the  profiler  saves  back  SHAP  columns

for  all  of  the  input  variables  to  the  model.

This  is,  of  course,  happening for  every  row  in  the  table.

What  you  can  see  is  that the  SHAP V alues  are  giving  you  the  effect

of  each  of  the  columns   on  the  predictive  model.

This  is  useful   in  a  whole  lot  of  different  ways,

and  for  that  reason,  it's  gotten a  lot  of  attention  in intelligible  AI,

because  it  allows  us  to  see

what the  contributions  are   of  each  column  to  a  black  box  model.

Here,  I've  plotted  the  SHAP V alues for  the  columns  that  are  predictive

in  the  last  fit model  that  we  just  built.

If  we  toggle  back  and  forth  between the  good  units  and  the  units  that  failed,

we  see  the  same  story  that  we've  been  seeing

with  the  variable  importance  metrics  for  this,

that  Column  18  and  Column  27  are  important  in  predicting  pass  or  fail.

We're  seeing  this at  a  higher  level  of  resolution

than  we  do   with  the  variable  importance  metrics,

because  each  of  these  points   corresponds  to  an  individual  row

in  the  original  dataset.

But  in  this  case, I  don't  see  the  SHAP  Values

really  giving  us  any  new  information.

I  had  hoped  that  by  toggling  through

the  other  failure  modes, maybe  I  could  find  a  pattern

to  help  tease  out  different  sensors

that  are  more  important  for  particular  failure  modes.

But  the  only  thing  I  was  able  to  find  was  that  Column  18

had  a  somewhat  stronger  impact  on  the  Velocity  Type  1  failure  mode

than  the  other  failure  modes.

At  this  point,  we've  had  some  success

using  those  Cluster  1  columns in  a  binary  pass/ fail  model.

But  when  I  broke  out  the  SHAP  Values

for  that  model,  by  the  different  failure modes

I  wasn't  able  to  discern  a  pattern or  much  of  a  pattern.

What  I  did  next  was  I  went  ahead

and  fit  the  failure  mode   response  column  test  results

using  the  Cluster  1  columns,

but  I  went  ahead   and  excluded  all  the  pass  rows

so  that  the  modeling procedure  would  focus  exclusively

on  discerning  which  failure  mode  it is given  that  we  have  a  failure.

I  tried  the  multinomial  lasso,   elastic  net,  and  ridge,

and  I  was  particularly  happy  with  the  lasso  model

because  it  gave  me a  validation  RS quare  of  about  0.94.

Having  been  pretty  happy  with  that,

I  went  ahead  and  saved  the  probability  formulas

for  each  of  the  failure  modes.

Now  the  task  is  to  come  up  with  a  simple  rule

that  post  processes  that  prediction  formula

to  make  a  decision about  which  failure  mode.

call  this  the  partition  trick.

The partition trick  is  where  I  put  in  the  probability  formulas

for  a  categorical  response, or  even  a  multinomial  response.

I  put  those  probability  formulas  in  as Xs.

I  use  my  categorical  response  as  my  Y.

This  is  the  same  response  that  was  used

for  all  of  these except  for pass,  actually.

I  retain  the  same  validation  column  that I've  been  working  with  the  whole  time.

Now  that  I'm  in  the  partition  platform,

I'm  going  to  hit  Split  a  couple  of  times, and  I'm  going  to  hope

that  I  end  up  with an  easily  understandable  decision  rule

that's  easy  to  communicate.

That  may  or  may  not  happen.

Sometimes  it  works,  sometimes  it  doesn't.

So  I  split  once, and  we  end  up  seeing  that

whenever  the  probability  of  pass  is  higher  than  0.935,

we  almost  certainly  have  a  pass.

Not  many  passes  are  left over  on  the  other  side.

I  take  another  split.

We  find  a  decision  rule  on  ITM

that  is  highly  predictive  of  ITM  as  a  failure  mode.

Split  again.

We  find  that  whenever  Motor is  less  than  0.945,

we're  either  predicting  Motor  or  Brake.

We  take  another  split.

We  find  that  whenever  Velocity  Type  1, its probability  is  bigger  than  0.08

or likely  in  a  Velocity  Type  1  situation or  in  a  Velocity T ype  2  situation.

Whenever  Velocity  Type  1 is  less  than  0.79,

we're  likely  in  a  gripper  failure  mode or  an  IMP  failure  mode.

What  do  we  have  here? We  have  a  simple  decision  rule.

We're  going  to  not  be  able  to  break   these  failure  modes  down  much  further

because of  the  very  small  number of  actual  events  that  we  have.

But  we  can  turn  this  into  a  simple  rule

for  identifying  units   that  are  probably  good,

and  if  they're  not,  we  have  an  idea of  where  to  look  to  fix  the  problem.

We  can  save  this  decision  rule  out   as  a  leaf  label  formula.

We  see  that  on  the  validation  set,

when  we  predict  it's  good, it's good most  of  the  time.

We  did  have  one  misclassification of  a  Velocity  Type 2  failure

that  was  actually  predicted  to  be  good.

Predict  grippers  or  IMP, it's  all  over  the  place.

That  leaf  was  not  super  useful.

Predicting  ITM is  100 %.

Whenever  we  predict  a  motor  or  brake,

on  the  validation  set, we  have  a  motor  or  a  brake  failure.

When  we  predict  a  Velocity  Type  1 or 2,

it  did  a  pretty  good  job of  picking  that  up

with  that  one  exception of  the  single  Velocity  Type  2  unit

that  was  in  the  validation  set,

and  that one  happened  to  have  been  misclassified.

We  have  an  easily  operational  rule  here that  could  be  used  to  sort  products

and  give  us  a  head  start  on  where we  need  to  look  to  fix  things.

I  think  this  was  a  pretty  challenging  problem,

because  we  didn't  have  a  whole  lot  of  data.

But  we  didn't  have  a  lot  of  rows,

but  we  had  a  lot  of   different  categories  to  predict

and a  whole  lot  of   possible  predictors  to  use.

We've  gotten  there  by  taking  a  series  of  steps,

asking  questions,

sometimes  taking  a  step  back and  asking  a  bigger  question.

Other  times,  narrowing  in   on  particular  sub- issues.

Sometimes  our  excursions  were  fruitful, and  sometimes  they  weren't.

Our  purpose  here  is  to  illustrate

how  you  can  step  through   a  modeling  process,

through  this  sequence  of  asking  questions

using  modeling  and  visualization  tools to  guide  your  next  step,

and  moving  on   until  you're  able  to  find

a  useful,  actionable,  predictive  model.

Thank  you  very  much  for  your  attention.

We  look  forward  to  talking  to  you in  our  Q&A  session  coming  up  next.

Comments

Really great example of using the simulate feature in this presentation. I did not know you could use a formula validation column and then use the simulate feature to continuously update which parts of the data table are being used as validation. Very cool and it is awesome that feature will be available to all users in JMP 17, not just Pro users.

 

I also really enjoy presentations that spend just as much time on what didn't work as they spend talking about what did work. This was a really enjoyable and informative presentation so well done!