Choose Language Hide Translation Bar

What Model When? (and Which Modeling Type?) - (2023-US-PO-1509)

You have a question to answer, so you collect the relevant data and are ready to start creating a predictive model. But what which type of model do you choose and which modeling type? Is the goal to segment, predict, explain, or identify? Are the variables continuous, nominal, or ordinal? 

Before we even get to choosing a type of model, we must define how the variables are used in analysis with the modeling type. What happens if we treat number of bedrooms as continuous versus a nominal or ordinal modeling type? We see when we pull up a distribution how modeling type impacts the type of summary statistics we get as means or frequencies. This poster demonstrates how the modeling type determines the results of your predictive model depending on which type of model is chosen. Get ready to play “Name that Analysis” as we go head-to-head on classifying different case study modeling examples with their respective modeling type.

 

 

Hello,  my  name  is  Andrea  Coombs,

 

and  I'm  joined  by  my  colleague, Olivia  Lipp incott.

Olivia  and  I  have  given  a  presentation before  called  What  Model  When.

If  you  want  to  take  a  closer  look at  that  presentation,

you  can  take  a  look at  the  link  in  the  community  post.

But  today,  we  want  to  talk about  something  a  little  bit  more.

Yeah,  today  we  want to  think  about  modeling  type

and  how  modeling  type  impacts  the analysis

for  each  of  the  four  model  goals that  we  talked  about  previously.

Right,  and  we're  actually  going to  use  the  same  data.

This  is  data  that  we  pulled from  Redfin  that  represents

the  housing  market  in  the  Cincinnati  area.

Here  we're  trying  to  look

at  the  price  of  homes  relative to  their  square  footage,

the  number  of  beds, the  number  of  baths,

and  so  on  and  so  forth.

Previously,  we've  answered  the  question, what  model  when?

It  really  depends  on  what  model you're  going  to  choose

based  on  your  goal  for  the  analysis.

For  segment,  we're  trying to  examine  relationships

where  there's  no  intended  response;

explain,  we're  trying to  explain  a  relationship

and  look  at  the  underlying  factors and  how  those  affect  the  response;

predict,  we're  trying to  predict  future  outcomes

or  the  response  in  new  situations;

and identify,  we're  trying to  find  important  variables.

Right.

Now  let's  bring the  modeling  type  into  the  picture.

Both  your  responses  and  your  factors can  have  different  modeling  types.

In  JMP,  there  are three  main  modeling  types:

continuous,  nominal,  and  ordinal.

Continuous  modeling  type  is  represented by  this  blue  triangle  icon  here,

and  this  refers  to  numeric  data  only.

The  nominal  modeling  type is  represented  by  this  red  icon,

and  this  is  numeric  or  character  data where  values  belong  to  categories,

but  the  order  is  not  important.

For  the  ordinal  modeling  type, it  is  represented  by  this  green  icon,

and  this  can  be  either  numeric or  character  data  as  well.

But  in  this  case,  values  belong to  ordered  categories.

When  you're  doing  an  analysis  in  JMP,

you  want  to  make  sure  you  set  up the  correct  modeling  type,

because  JMP  will  do the  correct  model  for  you,

will  do  the  correct  analysis depending  on  modeling  type.

Andrea,  I  have  a  game  for  us  to  play.

It's  called  Name  That  Analysis.

Do  you  want  to  play?

Absolutely. I  love  games.

Awesome.

Here's  your  first  question.

We  want  to  identify

which  features  of  a  home are  most  important

to  determining  the  price.

For  example,  square  footage and  number  of  bathrooms

can  explain  a  large  amount of  the  variation  in  price,

but  other  features  are  less  important.

All  right,  Olivia.

I  think  you're  making this  first  question  easy  for  me.

Is  the  answer  identify?

Let's  see.

Yeah,  you're  right.

I  did  make  that  one a  little  bit  easy  to  get  us  going,

but  that  is  identify to  find  important  variables  within  there.

There's  a  couple of  different  places  in  JMP

where  we  can  use  tools  to  identify if  that's  our  modeling  goal.

Under  the  Analyze  menu  under  Screening, Predictive  Modeling  and  Fit  Model,

using  tools  like  Predictor  Screening, Bootstrap  Forest,

Generalized  Regression and  Stepwise  Selection.

For  modeling  type,

when  we're  looking at  the  goal  of  identify,

it's  not  going  to  affect  things  much.

JMP  is  going  to  do  the  correct  analysis

as  long  as  your  modeling  types are  set  appropriately.

We  took  a  look  at  this  and  we  took both  the  response  and  the  factors

and  changed  them from  continuous  to  nominal

and  looked  at  how which  factors  came  up  as  most  important.

While  the  order  of  the  factors  varied, the  dominant  factors  stayed  the  same.

All  right.

It  looks  like  if  our  goal is  to  identify  important  factors,

really,  the  exact  modeling  type we're  using

isn't  impacting  things  that  much, it  looks  like,  Olivia.

Right.

Our  conclusions on  which  variables  are  important

aren't  going  to  change  much based  on  the  modeling  type.

All right.

Well,  that  is  good  to  know.

I  have  a  question  for  you.

Are  you  ready?

I'm  ready.

All  right,  here  is  your  question.

Let's  say  we  want  to  build  a  model to  predict  house  prices.

This  model  will  be  based  on  many  important predictor  variables  we  have  in  our  data.

For  example,  we  want  to  predict

the  price  of  a  house that  we  want  to  put  on  the  market.

Which  goal  do  you  think we're  working  with  here?

Okay,  so  it's  not  like  question  one where  we're  trying  to  see

which  factors  are  most  important to  predict  housing  prices.

We're  just  really  trying  to  get that  final  housing  price  prediction.

I'm  going  to  go  with  predict.

All  right,  let's  see  if  you're  right.

Yes,  you  are  right.

The  goal  of  this  analysis  is  predict.

There's  lots  of  different  platforms  in  JMP where  you  can  build  models  for  prediction.

Within  each  of  those  platforms  in  JMP where  you  can  build  the  prediction  models,

JMP  will  do  the  correct  analysis  for  you,

depending  on  the  modeling  type of  your  response.

Here  we  have  a  table

of  different  modeling  types for  our  responses:

continuous,  nominal,  and  ordinal.

For  a  continuous  response,

this  is  the  typical  one that  we  were  talking  about,  right?

We  want  to  predict  the  price  of  a  home that  we're  going  to  put  on  the  market.

Now,  when  we're  building  this  type of  model  with  a  continuous  response,

well,  we  want  to  know how  powerful  that  model  is.

What's  the  predictive  power  of  that  model?

We  can  use   RSquared and  the  Root  Average  Squared  Error

to  diagnose  that  model.

Now,  for  a  nominal  and  ordinal  model, it's  a  little  bit  different.

For  a  model  with  a  nominal  response, we  have  categories  as  the  response.

In  this  example,  we're  looking

at  whether  or  not  the  price will  be  over  or  below  $1  million.

That's  what  we  want  to  predict.

For  the  ordinal  response, here  we  have  an  ordered  category.

We  want  to  predict whether  the  price  of  the  house

is  going  to  be  low, medium,  and  high.

For  the  nominal  and  ordinal  examples,

again,  we  can  look  at   RSquared and  Root  Average  Squared  Error

to  evaluate  those  models.

But  there's  other  things  that  we  can  use to  evaluate  those  models,

like  the  misclassification  rate and  the  area  under  the  ROC  curve.

Of  course,  our  favorite  tool  in  JMP

to  take  a  look  at  our  prediction  model is  the  Prediction  Profiler.

Let's  take  a  look  at  the  difference between  the  Prediction  Profiler

for  the  modeling  types  of  our  responses.

For  the  continuous  response, we  can  see  that  on  the  Y-axis,

we  have  the  mean  prediction plus  or  minus  the  confidence  interval

given  the  value  of  the  model  factors here  on  each  of  the  X -axes.

For  the  nominal and  ordinal  logistic  models,

what  we  see  on  the  Y -axis

is  the  probability  of  the  response  being in  a  certain  category.

For  the  nominal  logistic  model, we  have  the  probability

that  the  house  is  either  going  to  be above  or  below  a  million  dollars.

For  this  ordinal  logistic  model,

we  can  see  the  probability  of  having a  low,  medium,  or  high  price.

Okay,  so  it  sounds  like  the  goal of  what  we  want  to  predict

is  also  important  when  we're  talking about  that  prediction  goal,

whether  we  want to  treat  price  as  continuous

and  get  the  predictions of  the  exact  prices  out  of  there,

or  if  we  want  to  treat  it  as  a  category.

Right.

You  just  need  to  get

that  response  variable  set  up and  your  data  set  the  correct  way,

and  then,  of  course, assign  the  correct  modeling  type,

and  JMP  is  going  to  build the  correct  model  for  you.

All  right,  Andrea.

Are you  ready  for  your  next  question?

I'm  ready. Let's  go.

Okay.

We  want  to  quantify

the  effect  on  home  prices from  additional  bedrooms.

For  example,  on  average,

every  additional  bedroom  adds about $ 97,000  to  the  total  home  cost.

Adding  a  bedroom  adds  $97,000?

Man,  Cincinnati  is  a  tough  housing  market.

That's  crazy.

All  right,  well, so  let's  see.

What  are  we  trying  to  do  here?

We're  trying  to  quantify  the  effect  here.

I  think  what  we're  trying  to  do  is  explain

that  effect  that  bedrooms  has on  the  price  of  a  house.

I'm  going  to  say  explain.

You're  correct.

Yeah,  we're  trying  to  describe the  relationships.

In  explain,  we  use  the  parameter  estimates taken  from  the  model  equation

to  quantify  those  relationships between  the  factors  and  the  responses.

Typically,  we  use  in  JMP under  the  Fit  Model  menu  location

tools  like  Standard Least  Squares, Logistic  and  Ordinal  Regression,

and  Generalized  Regression.

Modeling  type  can  really  impact

how  our  factored  relationship with  the  response  variable  is  interpreted.

We  took  a  look, and  we  were  looking

at  how  does  the  number  of  beds  affect the  housing  price?

We  changed  beds  from  continuous, to  nominal,  to  ordinal,

and  see  what  that  relationship  was.

We  can  see  under  the  continuous, that's  where  we've  got

that  every  additional  bedroom  adds $97,000  about  to  the  total  home  price.

That  prediction  profiler  shows a  linear  relationship

when  we  treat  beds  as  continuous.

But  when  we  treat  beds as  nominal  or  ordinal,

there's  not  that  straight linear  relationship  going  on.

We  see  a  spike  in  price  for  4-5  bedrooms compared  to  going  from  2-3  bedrooms.

Right.

I  see  with  nominal  and  ordinal,

the  prediction  profiler  looks almost  exactly  the  same,

so  it  must  be  the  same  model.

However,  I'm  seeing with  the  parameter  estimates,

they  look  a  little  bit  different between  nominal  and  ordinal.

What's  going  on  there?

Yeah,  so  the  nominal and  ordinal  modeling  type,

and  when we  use  that  within a  regression, is  treating...

They're  coded  differently within  the  regression,

so  the  parameter  estimates  are  different.

For  nominal,  that  intercept,

we  think  of  that  as  the  mean  house  price across  all  the  different  bedrooms,

and  each  of  those  parameter  estimates

are  how  much  that  number  of  beds  increases or  decreases  that  mean  house  price.

But  for  ord inal, because  we're  looking  at  order  matters,

we  think  of  the  intercept as  if  there  are  zero  bedrooms

and  each  of  those  parameter  estimates

is  the  effect  of  adding an  additional  bedroom  onto  the  price.

All  right.

Modeling  type  is  really  going to  affect  my  parameter  estimates.

I  really  need  to  think  about exactly  what  do  I  want  to  explain

as  a  part  of  this  model when  I'm  doing  this  analysis.

Yes.

All  right.

Are  you  ready for  the  final  question,  Olivia?

Yeah,  bring  it  on.

All  right,  here's  the  question.

Let's  say  we  want to  identify  groups  of  homes

that  are  similar  based on  a  list  of  possible  characteristics.

In  other  words,

we  want  to  identify  market  segments based  on  things  like  square  footage,

location,  number  of  bedrooms,  et  cetera.

Which  goal  do  you  think  this  is?

I  think  you're  trying to  trick  me  with  that  identify,

and  I'm  not  going  to  fall  for  it.

Okay.

But  there  are  no  responses within  this  question.

I  think  we're  looking  at  clustering.

I'm  going  to  say  segment.

Okay.

Well,  you're  right,  Olivia.

I  did  try  and  trick  you  a  little  bit because  I  really  wanted  to  win.

But  you're  right, that's  the  key  thing  here,

is  that  there  are  no  responses  here in  this  analysis.

We  are  definitely  looking  at  segment.

When  our  goal  is  segment,

we  can  use a  couple  of  different  clustering  tools.

We  can  do  Hierarchical  Clustering,

K-Means C lustering, or  Latent  Class  Analysis.

It's  important  to  keep  in  mind that  with  Hierarchical  Clustering,

you  can  only  include...

Sorry,  you  can  include all  of  the  modeling  types:

continuous,  nominal,  and  ordinal.

But  for  K-Means  Clustering,

you  can  only  include variables  that  are  continuous.

For  Latent  Class  Analysis,

you  can  only  include nominal  or  ordinal  variables.

In  our  case  here, when  we're  looking

at  the  number  of  bedrooms, lot  size,  year  built,  and  square  feet,

we  have  a  combination of  continuous  and  nominal  variables.

Hierarchical  Clustering  may  be  the  best clustering  tool  to  use  in  this  scenario.

It  looks  like  with  that  parallel  plot with  Hierarchical  Clustering,

maybe  we  could  call  Cluster  6 Amazing  Location.

Yes.

If  you  think  a  large  lot  size is an  amazing  location,

yeah,  we  can  definitely  call that  segment  Amazing  Location  Homes.

Well,  all  right,  Olivia,

despite  me  giving  you  a  trick in  that  last  question,

it  looks  like  we  ended  up with  a  tie  here  again.

We'll  have  to  rematch  again  soon.

Absolutely.

We  talked  about  what  model  when, and that  really,

what  model  you  choose  depends on  your  goal  for  the  analysis,

whether  it's  segment,  explain, predict,  or  identify.

Yeah,  in  terms  of  modeling  type,  again,

JMP  is  going  to  do the  correct  analysis  for  you,

especially  with  your  responses.

If  you're  setting  them  up with  the  correct  modeling  type,

JMP  is  going  to  do the  correct  analysis  for  you.

If  your  goal  is  explain,

you  might  need  to  think  a  little  bit about  which  modeling  type  to  use,

depending  on  how  you  want  to  explain

the  effect  of  something like  the  number  of  bedrooms.

Thank  you,  Olivia.

This  is  so  much  fun.

Let's  do  it  again  next  year.