cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
Choose Language Hide Translation Bar
An Introduction To Spectral Data Analysis With Functional Data Explorer in JMP® Pro 17 (2023-EU-30MP-1264)

Ryan Parker, Senior Research Statistician Developer, JMP
Clay Barker, Principal Research Statistician Developer, JMP

 

Since the Functional Data Explorer was introduced in JMP Pro 14, it has become a must-have tool to summarize and gain insights from shape features in sensor data. With the release of JMP Pro 17, we have added new tools that make working with spectral data easier. In particular, the new wavelets model is a fast alternative to existing models in FDE for spectral data. This presentation introduces these new tools and how to use them to analyze your data.

 

 

Hi,  everyone.  Thanks  for  coming  to  our  video.  My  name  is  Ryan  Parker,  and  today  I'm  going  to  present  with  Clay  Barker  about  some  new  tools  that  we  have  added  to  analyze  Spectral  Data  with  the  Functional  Data  Explorer  in  JMP  Pro  17.   First,  I  just  wanted  to  start  off  with  some  of  the  motivating  data  sets  that  led  us  to  add  these  new  tools.  They're  really  motivated  by  these  chemometric  applications,  can  definitely  be  applied  to  other  areas,  but  for  example,  we  have  this  spectroscopy  data  where  the  first  thing  you  might  notice  with  this  is  we've  got  a  lot  of  data  points  sampled,  but  we  also  have  some  very  sharp  peaks  in  our  data.  That's  going  to  be  a  recurring  theme  where  we  have  a  need  to  really  identify  these  sharp  features  that  the  existing  tools  we  have  in  JMP  are  a  little  difficult  to  really  capture  those.   For  example,  we're  thinking  about  composition  of  materials  or  how  we  can  detect  biomarkers  and  data.   These  are  three  spectroscopic  examples  that  we'll  look  at.

Another  example  of  data  that  is  of  interest  is  this  mass  spectrometry  data.  Here  we're  thinking  about  a  mass  to  charge  ratio  that  we  can  use  to  construct  a  spectrum  where  the  peaks  in  the  spectrum,  they're  representing  proteins  that  are  of  interest  in  the  area  of  application.  One  example  is  comparing  these  spectrums  between  different  patients,  say  a  patient  with  cancer  or  a  patient  without,  and  the  location  of  these  proteins  is  very  important  to  identify  differences  in  these  two  groups.

Another  example  is  chromatography  data.   Here  we  can  think  about  we're  using  some  chemical  mixture  over,  we're  using  a  material  that's  going  to  help  us  quantify  relative  amounts  of  the  various  components  that  are  in  these  mixtures.   By  using  the  retention  time  in  this  process,  we  can  try  to  identify  the  different  components.   For  example,  if  you  didn't  know  this,  I  was  not  aware  until  I  started  to  work  with  this  data,  trying  to  impersonate  olive  oil  is  a  big  deal.   We  can  use  these  data  sets  to  figure  out,  okay,  what's  a  true  olive  oil,  or  what's  just  some  other  vegetable  oil  that  someone  might  be  trying  to  pass  off  as  an  olive  oil?

The  first  thing  I  want  to  do  is  go  through  some  of  the  new  preprocessing  options  that  we've  added  to  help  work  with  spectral  data  before  we  get  to  the  modeling  stage.   We  have  a  new  tool  called  the  Standard  normal  Variate,  a  multiplicative  scatter  correction  where  you  have  if  you  have  light  scatter  in  your  data,  the   Savitzky–Golay filter,  which  is  the  smoothing  step  when  you  have  spectral  data  that  we'll  get  into.  Finally,  a  new  tool  to  perform  a  baseline  correction  for  the  data  to  remove  trends  that  you're  not  really  interested  in  that  you  want  to  get  out  first.

Okay,  so  what's  standard  normal  variant?  Currently  in  JMP,  we  have  the  ability  to  just  standardize  your  data  in  FDE.  But  when  you  use  that  tool,  it's  just  taking  the  mean  of  all  of  the  functions  and  a  global  variance  and  scaling  it  that  way,  but  with  standard  normal  variant,  we're  thinking  about  the  individual  means  and  variances  of  each  of  the  functions  to  standardize  and  remove  those  effects  before  we  go  to  analysis.   Whenever  I'm  on  the  right  here,  after  performing  the  standard  normal  variant,  we  can  see,  okay,  there  was  some  overall  means  and  now  they're  all  together  and  any  excess  variance  is  taken  out  before  we  go  to  analysis.

Multiplicate  of  scatter  correction  is  the  next  step,  and  it's  an  alternative  to  standard  normal  variant.   In  some  cases  you're  thinking  you're  going  to...  Whenever  you  use  it,  you  may  end  up  with  similar  results.   The  difference  here  is  the  motivation  for  using  multiplicate  of  scatter  correction.   That's  when  you  have  light  scatter,  or  you  think  you  might  have  light  scatter  because  of  the  way  that  you  collected  the  data.

What  happens  is  for  every  function,  we're  going  to  fit  this  simple  linear  model  where  we've  got  a  slope  and  an  intercept,  and  we're  going  to  use  those  estimated  coefficients  to  now  standardize  our  data  that  we're  going  to  work  with.   We  subtract  off  the  intercept  and  divide  by  the  slope,  and  now  we  have  the  standardized  version.  A gain,  you  can  end  up  with  similar  results  as  standard  normal  variance.

Now  the  next  preprocessing  step  I'm  going  to  cover  is  the   Savitzky–Golay filter.  When  you  have  spectral  data,  before  we  get  to  the  modeling  stage,  the  new  modeling  tools  we  have,  they're  developed  in  such  a  way  that  they're  trying  to  pick  up  all  the  important  pieces  of  the  data.   If  you  have  noise,  we  need  to  do  a  step  where  we  smooth  that  first.   That's  where  the   Savitzky–Golay filter  comes  in.  What  we're  doing  is  we're  going  to  fit  an  end- degree  polynovial  over  a  specified  bandwidth  that  we  can  choose  to  help  try  to  remove  any  noise  from  the  data.  In  FDE,  currently,  we're  going  to  go  ahead  and  select  those  best  parameters  for  you,  the  degree  and  the  width  to  try  to  minimize  the  model  error  that  we  get.

One  thing  I  do  want  to  point  out  is  that  we  do  require  a  regular  grid  to  give us  this  operation,  which  will  come  up  again  in  the  future,  but  FDE  is  going  to  create  one  for  you.   We  also  have  this  reduced  grid  option  available  if  you  want  finer  control  first  before  you  rely  on  us  making  that  choice  for  you.  The  nice  thing  about  this   Savitzky–Golay filter  is  because  of  the  way  the  model  is  fit,  we  now  have  access  to  derivatives.  This  is  something  that  has  come  up  prior  to  spectral  data,  and  now  that  we  have  this,  we've  got  a  nice  way  for  you  to  access  and  work  with  modeling  these  derivative  functions.

The  last  one  I  want  to  cover  is  the  baseline  correction.  What  baseline  correction  is  doing  is  it's  thinking  about  there  might  be  overall  trends  in  our  data  that  we  want  to  get  rid  of.  This  data  set  on  the  right  has  just  a  very  small  differences  like  linear  difference  in  the  functions.  What  we're  thinking  about  is,  okay,  we  don't  really  want  to  care  about  that,  we  want  to  get  rid  of  it.  What  this  tool  is  going  to  allow  you  to  do  is  select  the  baseline  model  that  you  want. I n  this  case,  it's  just  a  really  simple  linear  model,  but  you  may  have  some  where  you've  got  exponential  or  logarithmic  trends  that  you  want  to  get  rid  of.  And  so  we  have  that  available.  Then  you  can  select  your  correction  region.

For  the  most  part,  you're  going  to  want  to  correct  the  entire  function,  but  it  may  be  possible  that  maybe  only  the  beginning  or  the  end  of  the  function  is  where  you  want  to  correct.   We  end  up  with  these  baseline  regions  that  are  these  blue  lines.  If  we  click  this  add  button,  it'll  give  us  a  pair  of  blue  lines.  We're  going  to  drag  these  around  to  parts  of  the  function  that  we  believe  are  real.  All  the  peaks  in  these  data  is  something  that  we  don't  really  want  to  touch.  This  is  the  part  of  the  functions  that  we  want  to  keep  and  analyze  and  is  going  to  give  the  information  that  we're  interested  in.   Also  if  you  select  this  within  region,  anything  that's  within  these  regions  is  what  will  get  corrected.   You're  either  going  to  do  one  or  the  other,  right?  You're  going  to  either  leave  it  alone  or  you're  going  to  change  only  within  that  in  those  data  sets.

Finally,  you  don't  see  it  here,  but  you  can  also  add  anchor  points.  It  may  be,  depending  on  your  data,  easy  to  just  specify  a  few  points  that  you  know  this  describes  the  overall  trend.  When  you  click  add,  you'll  get  a  red  line,  and  that's  going  to  tell  you  this,  wherever  I  drag  that  line,  it's  definitely  going  to  be  included  in  the  model  before  I  correct  the  baseline.  When  you  click  okay  here,  you'll  just  end  up  with  a  new  data  set  that  has  the  trend  removed.

Okay,  so  that  brings  us  to  the  modeling  stage.   What  we've  added  for  JMP  Pro  17  are  wavelet  models.  Okay,  so  what  are  wavelet  models?  They  are  basis  function  models,  not  like  anything  we  have  currently  in  JMP,  but  they  can  have  very  dramatic  features.   What  these  features  are  doing  are  helping  us  pick  up  these  sharp  peaks  or  these  large  changes  in  the  function.  We  also  have  the  simple  Haar  wavelet,  which  is  just  a  step  function.   If  it  turns  out  that  something  really  simple  is  like  the  step  function  fits  best,  we  will  give  you  that  as  well.   You  can  see  we  have  a  few  different  options  that  are  really...  If  you  think  about  bending  these  wavelets  and  stretching  them  out,  that's  how  we  are  modeling  the  data  to  really  pick  up  all  these  features  of  interest.

To  just  motivate  that,  I  want  to  show  you  the  current  go- to  in  JMP,  which  is  a  B- spline  model,  which  has  a  very  difficult  time  picking  up  on  these  features  without  any  hand- tuning.   B- spline  model  is  doing  a  little  bit  better.  It  still  has  some  issues  picking  up  the  peaks,  but  it  might  in  some  ways  be  the  best.  Direct  functional  PCA,  doing  almost  as  good  as  P-sp lines,  but  not  quite.   Then  we  have  wavelets.  We're  really  picking  up  the  peaks  the  best.  In  this  particular  data  set,  it's  not  fitting  them  perfectly,  but  we  would  think  looking  at  diagnostics,  the  wavelet  model  is  definitely  the  one  we  would  want  to  go  with.

Again,  we  have  these  five  different  wavelet  model  types,  and  what  we're  going  to  do  is  we'll  fit  all  these  for  you  so  that  you  don't  have  to  worry  about  picking  and  choosing.  Outside  of  the  Haar wavelets,  all  of  the  other  wavelet  types  have  a  parameter.  We  have  a  grid  that  we  are  going  to  search  over  for  you  in  addition  to  the  type.

Now  it  may  be  possible  that  there  are  some  cases  where  users  have  said,  hey,  this  particular  wavelet  type  is  exactly  how  my  data  should  be  represented,  so  you  can  change  the  default  model,  but  by  default,  we're  going  to  pick  the  model  that's  going  to  optimize  this  model  selection  criteria,  the  AISC.   Really  what  you  can  think  about  here  is  there  could  be  potentially  a  lot  of  parameters  in  every  one  of  these  wavelet  models.  We're  effectively  using  a  Lasso  model  to  try  to  remove  out  any  parameters  that  really  just  aren't  fitting  the  data.  We  get  a  sparse  representation,  no  matter  the  wavelet  model.  We  saw  this  earlier  where  we  have  to  have  our  data  on  the  grid.  It's  the  same  thing  with  wavelets.  If  you  just  start  going  through  the  wavelet  models  and  your  data  are  not  on  the  grid,  we'll  create  one  for  you.  But  again,  just  wanted  to  point  out  you  can  use  that  reduce  grid  option  to  have  finer  control.

Okay,  so  something  else  that  we  show  that  can  help  give  you  some  insight  into  how  these  models  work  is  this  coefficient  plot.  What  it's  telling  us  is  this  X  axis  is  the  normal  X  of  the  input  space  of  your  function,  but  the  Y  axis  is  the  resolution.  These  top  resolutions  here,  you're  thinking  about  overall  means.  As  we  get  into  these  high  resolutions,  these  are  the  things  that  are  happening  really  close  together.  A  red  line  means  it's  a  negative  coefficient.  Blue  means  it's  positive.  They're  scaled  so  that  they're  all  interpretable  against  each  other.  The  largest  lines  give  you  the  idea  where  the  largest  coefficients  are.  We  can  see  that  the  higher  frequency  items  are  really  here  at  the  end  of  the  function.  We  have  some  overall  trends,  but  just  something  to  think  about  that  these  wavelet  models  are  looking  at  different  resolutions  of  your  data.

Something  else  that  we've  added  before  we  get  to  our  demo  with  Clay  is  wavelets  DOE.  In  FDE,  we  have  a  functional  DOE  that  is  working  with  functional  principal  components.  If  you  don't  know  those  are,  that's  okay.  All  you  need  to  know  is  that  with  wavelets,  we  have  coefficients  for  all  of  these  wavelet  functions.   In  this  DOE  analysis,  we're  thinking  about  modeling  the  coefficients  directly.   The  resolution  tells  you  an  idea  of  if  it's  a  high- frequency  item  or  low- frequency  item.  Then  this  number  in  the  brackets  is  telling  you  the  location.   You  can  think,  okay,  these  items  here  are  in  the  threes,  and  that's  where  some  of  that  highest  features  were  that  we  saw  in  that  coefficient  plot.  Those  have  what  we're  calling  the  highest  energy.

Energy  in  this  case  is  just...  If  we  score  all  the  coefficients,  we  add  them  up,  you  can  think  of  that  being  as  the  total  energy.  So  this  energy  number  here  is  a  relative  energy  and  giving  you  an  idea  of  how  much  energy  it  is  explaining  in  the  data.   The  nice  thing  really  about  using  the  coefficient  approach  is  these  have  direct  interpretation  right  to  the  location  and  to  the  resolution.   An  alternative  that  you  can  try  and  compare  against  functional  PCA  or  functional  DOE  if  you  have  this  interpretability  of  the  coefficients.  Now  I  think  I'll  hand  it  over  to  Clay.  He's  got  a  demo  for  you  to  see  how  you  use  these  models  in  JMP  Pro.

Thanks,  Ryan.  Let's  take  a  look  at  an  example  that  we  found.  Ryan  mentioned  briefly  the  olive  oil  data  set  that  we  found.  It's  a  sample  of  120  different  oils.  Most  of  them  are  olive  oils,  some  of  them  are  blends  or  vegetable  oils.   What  we  wanted  to  see  is,  can  we  use  this  high- performance  liquid  chromatography  data?  Can  we  use  that  information  to  classify  the  oil?   Can  we  look  at  the   spectraland  say  this  is  an  olive  oil  or  this  is  not  an  olive  oil?

These  data  came  out  of  a  study  from  a  university  in  Spain,  and  Ryan  and  I  learned  a  lot  about  olive  oil  in  the  process.  For  example,  olive  oil  is  actually  a  fruit  juice,  which  I  did  not  know.  Let's  take  a  look  at  our  data.  Each  row  in  our  data  set  is  a  different  olive  oil  or  other  oil,  and  the  rows  represent  the  spectra.   We'll  use  the  Functional  Data  Explorer,  and  it'll  take  just  a  second  to  fit  the  wavelet  models.  Y ou'll  see  here,  we  fit  our  different  wavelets.  As  Ryan  mentioned  earlier,  we  try  a  handful  of  different  wavelets  and  we  give  you  the  best  one.

In  this  case,  the  Simlet  20  was  the  best  wavelet  in  terms  of  how  well  it  fits  our  data.  We  can  see  here  where  we've  overlaid  these  fitted  wavelets  with  the  data  that  this  wavelet  model  fits  really  well.  L et's  say  you  had  a  preferred  wavelet  function  that  you  wanted  to  use  instead,  you  can  always  click  around  in  this  report  and  it'll  update  which  wavelet  we're  using.   If  we  wanted  the  Simlet  10  instead,  all  you  have  to  do  is  click  on  this  row  in  the  table,  and  we'll  switch  to  the  Simlet  10  instead.   Let's  go  back  to  the  20  and  we'll  take  a  look  at  our  coefficients.

In  the  wavelet  report,  we  have  this  table  of  wavelet  coefficients.  As  Ryan  was  saying  earlier,  these  give  us  information  about  where  the  peaks  are  in  the  data.  The  further  wavelet,  we  think  about  that  like  an  intercept,  so  that's  like  an  overall  mean.  Then  every  one  of  these  wavelet  coefficients  with  a  resolution  is...  It  lines  up  with  a  different  part  of  the  function.  This  resolution  one  is  the  lowest  frequency  resolution,  and  it  goes  all  the  way  up  to  resolution  12.  These  are  much  higher  frequency  resolutions.

As  you  can  see,  we've  zeroed  a  lot  of  these  out.  In  fact,  this  whole  block  of  wavelet  coefficients  is  zeroed  out.  That  just  goes  to  show  that  we're  smoothing.  If  we  used  all  of  these  resolutions,  it  would  recreate  the  function  perfectly,  but  we  zero  them  out  and  that  gives  us  a  much  smoother  function.  We  fit  the  wavelet  model  to  our  spectral  and  we  think  we  have  a  good  model.  Let's  take  these  coefficients  and  we're  going  to  use  these  to  predict  whether  or  not  an  oil  is  olive  oil.  I've  got  that  in  a  different  data  set.

Now  I've  imported  all  of  those  wavelet  coefficients  into  a  new  data  set  and  I've  combined  it  with  what  type  of  oil  it  is.  It's  either  olive  oil  or  it's  other,  and  we've  got  all  of  these  wavelet  coefficients  that  we're  going  to  use  to  predict  that.  The  way  we  do  that  is  using  the  generalized  regression  platform.  We're  going  to  model  the  type  using  all  of  our  different  wavelet  coefficients.  Since  it's  a  binary  response,  we  choose  the  binomial  distribution,  and  we're  interested  in  modeling  the  probability  that  an  oil  is  olive  oil.  Because  we  don't  want  to  use  all  of  those  wavelet  coefficients,  we're  going  to  use  the  Lasso  to  do  variable  selection.

Now  we've  used  the  Lasso  and  we've  got  a  model  with  just  14  parameters.  Of  all  of  those  wavelet  coefficients  that  we  considered  for  our  model,  we  only  really  needed  14  of  them.  We  can  take  a  look,  we've  zeroed  out  a  lot  of  those  wavelet  coefficients.  Let's  take  a  look  at  the  confusion  matrix.  Using  our  model,  we  actually  perfectly  predicted  whether  or  not  one  of  these  oils  is  an  olive  oil  or  it's  something  else.  That's  pretty  good.  We  took  our  wavelet  coefficients  and  we  selected  the  13  most  important  because  one  of  those  14  parameters  is  the  intercept.  We  only  needed  13  of  those  wavelet  coefficients  to  predict  which  oil  we  had.

In  fact,  we  can  take  a  look  at  where  those  wavelet  coefficients  fall  on  our  function.  What  we  have  here  is  we  have  the  average  olive  oil   spectralin  blue  and  the  other  oils  in  red,  and  each  of  those  dashed  lines  lines  up  with  the  coefficients  that  we  used.  Some  of  these  really  make  a  lot  of  sense.  For  example,  here's  one  of  the  wavelet  coefficients  that  is  important,  and  you  can  see  that  there's  a  big  difference  in  the  olive  oil  trace  and  the  other  oil.

Likewise,  over  here,  we  can  see  that  there's  a  big  difference  between  the  two  there.   You  can  look  through  and  see  that  a  lot  of  these  locations  really  do  make  sense.   It  makes  sense  that  we  can  use  that  part  of  the  curve  to  discriminate  between  the  different  types  of  oil.   We  just  thought  that  was  a  really  cool  example  of  using  wavelets  to  predict  something  else.   Not  that  olive  oil  isn't  fun,  but  Ryan  and  I  both  have  young  kids  and  we're  both  big  fans  of  the  world.

We  also  found  a  new  world  data  set  where  someone  had  recorded  wait  times  for  one  of  the  popular  rides  at  Disney  World.  It's  called  the   Seven  Dwarfs  Mind  Train .  It's  a  roller  coaster  at  Disney  World.  Someone  had  recorded  wait  times  throughout  the  day  for  several  years  worth  of  data.  I  also  mentioned  these  are  a  subset  of  the  data.  One  of  the  problems  is  the  parks  are  open  for  different  amounts  of  time  each  day,  and  some  of  the  observations  are  missing.  We  subset  it  down  and  got  it  to  a  more  manageable  data  set.  I  would  say  that  this  example  is  inspired  by  real  data,  but  it's  not  exactly  real  data  once  we  massaged  it  a  little  bit.

If  we  graph  our  data,  we  can  see...  The  horizontal  axis  here  is  the  time  of  day,  and  the  vertical  axis  is  the  wait  time.   In  the  middle  of  the  day,  the  wait  time  for  this  ride  tends  to  be  the  highest.   We  can  look  around  at  different  days  of  the  week.  Sunday,  Monday  is  a  little  bit  more  busy.  Tuesday  is  a  little  less  busy,  Saturday  is  the  most  busy.  We  can  do  the  same  thing  looking  at  the  years.   This  is  2015,  2016,  2017.   It  looks  like  every  year  it's  getting  longer  and  longer  wait  times  until  something  happens  in  2021.  I  think  we  all  know  why  wait  times  at  an  amusement  park  would  be  less  in  2021.  We've  got  an  idea  that  you  can  use  this  information,  like  day  of  the  week,  year  and  month,  to  predict  what  that  wait  time  curve  will  look  like.  Let's  see  how  we  do  that  in  FDE.

I'll  just  run  my  script  here.  What  we've  done  is  we'll  come  to  the  menu  and  ask  to  fit  our  wavelet  model.  It  takes  just  a  second,  but  really  not  that  long  to  fit  several  years  worth  of  data.  This  time  we're  not  using  the  Simlet  anymore.  We're  using  this  Daubechies  wavelet  function.  What  Ryan  mentioned  earlier  was  the  wavelet  DOE  feature.  Now,  what  I  didn't  show  was  that  we've  also  loaded  time  of  the  day  of  the  week  and  the  year  and  the  month  in  the  FDE.  We're  going  to  use  those  variables  to  predict  the  wavelet  coefficients.  Let's  go  to  the  red  triangle  menu  and  we'll  ask  for  wavelet  DOE.

Now,  what  is  happening  behind  the  scenes  is  we're  using  day  of  the  week,  month,  and  year  to  predict  those  wavelet  coefficients,  and  then  we  put  it  all  back  together  so  that  we  can  see  how  the  predicted  wait  time  changes  as  a  function  of  those  supplementary  variables.  Now,  of  course,  we  summarize  it  in  a  nice  profiler.  We  can  really  quickly  see  the  effects  of  month.  If  we're  just  going  by  the  average  wait  time  of  this  particular  ride,  we  can  see  that  September  tends  to  have  the  lowest  wait  time.  We  can  really  quickly  see  the  COVID  effect.  The  wait  times  were  here  in  2019,  and  then  when  we  went  forward  to  2020,  they  really  dropped.  You  can  look  around  to  see  which  day  of  the  week  tends  to  be  less  busy,  which  months  are less  busy. I t's  really  a  cool  way  to  look  at  how  these  wait  times  change  as  a  function  of  different  factors.  Thank  you  for  watching.  That's  all  we  have for  today  and  we  hope  you'll  give  the  wavelet  features  in  FDE  a  try.  Thanks.