Data Driven Selection as a First Step for a Fast and Future Proof Process Development (2023-EU-30MP-1260)

Egon Gross, Research & Technology Manager, Symrise AG

Bill Worley, JMP Senior Global Enablement Engineer, JMP
Peter Hersh, Senior Systems Engineer, JMP

 

This talk will focus on how JMP® helped drastically reduce the cultivation experimentation workload and improved response from four up to 30-fold, depending on the target. This was accomplished by screening potential media components, generally the first, and sometimes tedious, step in fermentation optimizations. Taking characteristic properties such as the chemical composition of complex components like yeast extracts enables flexibility in the case of future changes. We describe the procedure for reducing the workload using FT-MIR spectral data based on a DSD setup of 27 media components. After several standard chemometric manipulations, enabled by different Add-ins in JMP® 16, the workload for cultivation experiments could be drastically reduced. In the end, important targets could be improved up to approximately 30-fold as a starting point for subsequent process optimizations. As JMP® 17 was released in the fall of 2022, the elaborate procedure in version 16 will be compared to the integrated features. It might give space for more inspiration – for developers and users alike.

 

 

Hello  everyone,  nice  to  meet  you.  I'm  Egon  Gross  from  Symrise,  Germany.  From  professional,  I'm  a  biotechnologist  and  I'm  looking  forward  for  my  presentation  for  you.

Hello  everyone.  I'm  Bill  Worley.  I  am  a  JMP  systems  engineer  working  out  of  the  chemical  territory  in  the  central  region  of  the  US.

Hi.  My  name  is  Peter  Hersh.  I'm  part  of  the  Global  Technical  Enablement  Team,  working  for  JMP  out  of  Denver,  Colorado.

Peter welcome  to  our  presentation,   Data- driven  selection  as  a  First  Step  for  a  Fast  and  Future -Proof  Process  Development .  First  I  want  to  introduce  the  company  I'm  working  for.   We  are  located  in  Holzminden,  more  or  less  in  the  center  of  Germany,  and  there  are  two  sites  coming  from  our  history.

Globally  seen,  we  are  located  with  the  headquarters  in Holzminden .  We  have  big  subsidiaries  in  Peterborough,  in  Sao  Paulo  and  in  Singapore,  and  there  are  quite  a  lot  of  facilities  in  France  also  coming  due  to  our  history.

Coming  to  the  history,  Symrise  was  created  in  2003  out  of  a  merger  from  Harmon  and  Rhymer,  which  was  founded  in  1874,  in  Dragoko,  which  is  the  other  side,  from  our  facility,  which  was  established  1990.  Over  the  years  there  have  been  quite  some  acquisitions  and  also  in  2014  the  acquisition  of  Diana,  which  is  mainly  located  in  France  because  that's  the  reason  why  there  are  so  many  different  research  and  development  hubs.

Our  main  products  come  from  agricultural  products  or  from  chemical  processes  and  there  are  quite  a  lot  of  diverse  production  capacities,  production  possibilities  for  our  main  customers  being  human  or  pet.  As  is  so  diverse,  we  are  dealing  for  food  for  the  human  consumption,  for  pets  consumption,  and  also  for  health  benefits.

On  the  other  side,  the  segment  Scent  and  Care  is  dealing  with  fragrances  coming  from  fine  fragrances,  to  household  care,  to  laundry ,  whatever  thing  you  can  imagine,  that  smell  nicely.

As  I  said  in  the  beginning,  I'm  a  biotechnologist  by  training  and  I'm  dealing  a  lot  with  fermentation  processes  to  optimize  them  and  to  scale  them  up  or  down.  One  major  issue  when  it  comes  to  fermentation  is  the  broth,  the  liquid  composition  of  the  media,  which  will  then  feed  the  organisms.  No  matter  which  organisms  that  are,  they  need  carbon  sources,  they  need  nitrogen  sources,  they  need  minor  salts,  major  salts,  pH  values,  and  other  things.

It  is  often  important  which  kind  of  media  one  has.  When  it  comes  to  media  composition,  there  are  two  big  branches  which  can  be  seen.  One  is  the  branch  of  synthetic  media,  so  all  components  are  known  in  the  exact  amount  and  composition.  The  other  way  are  complex  media,  for  example,  having  a  yeast  extract  or  a  protein  extract  or  whatever,  where  it's  a  complex  mixture  of  different  substrates,  different  chemical  substances.  The  third  approach  would  be  a  mixture  of  both.

One  of  the  side  effects  of  these  complex  media  is  that  it's  quite  easy  to  deal  with  them.  But  on  the  other  hand,  there  can  be  constitutional  changes  over  time,  as  some  vendors  tends  to  optimize  their  processes,  their  products,  to  whatever  region,  to  whatever  target.  Some  customers  get  hold  of  those  changes,  some  don't.

Another  issue  might  be  the  availability o f  it's  a  natural  product  like   [inaudible 00:04:38]  or  whatever.  You  might  know  some  ingredients,  you  will  surely  not  know  all  ingredients  and  there  might  be  promoting  or  inhibiting  substances  within  those  mixtures.

At  the  beginning  of  a  process  development,  the  media  is  of  main  importance.  Therefore  I  tried  to  look  at  carbon  sources,  nitrogen  sources,  salts,  trace  elements,  and  so  on,  being  my  different  raw  materials.  While  growing  the  organisms,  one  has  to  take  care  of  different  temperature,   stirring velocities  to  get  oxygen  into  the  liquids,  cultivation  time,  and  there  are  a  lot  of  unknown  variables  to  get  an  idea  what  the  effect  might  be  to  the  cell  dry  weight,  for  example,  or  to  the  different  targets  compounds  one  has  in  mind.

For  this  setup,  I  used  then  the  definitive  screening  design.  As  the  most  of  you  know,  they  are  quite  balanced  and  have  a  particular  shape  which  is  reflected  in  these  three- dimensional  plot.  You  can  see  definitive  screening  design  is  somehow  walking  around  certain  edges  and  having  a  center  point.  Due  to  the  construction  of  the  definitive  screening  design,  one  can  estimate  interactions  and  square  effects.  These  interactions  are  not  confounded  with  the  main  factors  and  the  main  factors  itself  are  also  not  confounded  with  each  other.  This  is  a  very  nice  feature  of  the  definitive  screening  design  and  therefore  they  are  very  efficient  when  it  comes  to  the  workload  compared  to  formerly  known  screening  designs.

Some  disadvantages  are  also  there.  One  big  disadvantage  is  if  you  have  about  50%  of  the  factors  that  are  working  that  have  a  high  influence  or  even  more,  you  have  a  significant  influence,  significant  confounding,  which  you  have  to  take  care  of.  In  this  particular  case,  although  it's  the  leanest  possible  design  I  found,  the  practical  workload  would  require  five  to  six  months  just  for  screening.  This  is  far  too  long  when  it  comes  to  a  master  thesis.

The  alternative  was  then  to  build  another  design  or  to  build  another  process.  I  was  so  inspired  in  Copenhagen  2019  by  a  contribution  from  Bill  where  he  talked  about  infrared  spectroscopy  and  I  thought  why  that  might  be  a  good  idea,  using  the  chemical  information  hidden  in  a  near-infrared  spectrum  to  describe  the  chemical  composition  of  the  mixtures.

Therefore  I  established  this  workflow.  First,  the  media  preparation  was  done  of  all  the  65  mixtures.  Then  the  near- infrared  spectrum  was  measured,  some  chemometric  treatments  were  preferred  and  afterwards,  the  space  of  variation  could  be  held  constant  at  a  maximum,  but  the  number  of  the  experiments  could  be  reduced  quite  significantly.

To  show  you  how  the  workflow  is,  I  started,  as  I  said,  with  spectral  information.  One  of  the  first  principles  one  has  to  do  is  to  make  a  standardization  of  the  spectra  to  avoid  baseline  shifts  and  things  like  that.  This  is  one  way  to  make  it.  Introducing  a  new  formula  to  standardize,  or  what  I  did,  I  used  an  add-in  to  preprocess  and  calculate  the  standard  normal  variety,  which  is  when  it  comes  to  the  digits,  the  same  as  the  standardization,  as  we  see  here.

With  this  standardized  spectra,  depending  on  each  measurement,  I  continued  then  and  compiled  first  all  these  spectra.  What  you  see  here  on  the  top  is  the  absorption  of  every  sample.  We  have  an  aqueous  system  so  we  took  water  as  a  reference.  A fter  building  the  difference  between  the  absorption  and  the  water,  we  then  got  deeper  and  saw  differences  within  the  spectra.

One  of  the  big  question  was  do  I  calculate  first  the  difference  between  the  absorption  of  the  sample  and  the  water  and  calculate  then  the  standard  normal  variety?  Or  do  I  first  calculate  the  standardization  and  then  use  these  standardized  values  from  the  water  background?

One  could  think  the  procedure  is  the  same,  but  the  outcome  is  different.  As  you  see  here,  on  the  right- hand  side  of  the  dashboard,  I  zoomed  into  this  area  and  in  the  lower  part,  the  curves  have  a  different  shape,  a  different  distance  from  each  other  than  in  the  upper  part.  T his  might  have  then  an  influence  on  the  subsequent  regression  analysis.  Therefore,  I  selected  first  to  make  the  standardizing  and  then  the  difference  calculations.

After  I  did  these  first  steps,  then  came  the  chemometric  part,  that  is  smoothing  and  the  filtering and  to  calculate  the  derivatives.  This  is  a  standard  procedure  using  an  add-in  which  is  available.  You  can  imagine  that  the  signal  might  have  some  noise.  This  is  seen  here  in  the  lower  part,  the  red  area  is  the  real  data,  and  the  blue  curves  are  the  smooth  data.  On  the  left  upper  side,  you  see  the  first  derivative.  On  the  right  upper  side,  the  second  derivative  of  these  functions.  If  it  comes  to  polynomial  fits,  it's  depending  on  the  procedure,  what  you  are  fitting,  what's  the  polynomial  order,  and  how  broad  your  area  is,  where  you  make  the  calculations  in.

If  we  take  here  only  a  second- order  polynomial,  you  see  that  it  might  change.  Now,  this  is  not  a  two,  this  ought  to  be  a  20.  Then  the  curve  smooths  out.   Although  it's  smooth,  you  can  see  differences  in  height,  in  shape.  To  get  hold  of  those  data,  one  has  to  save  the  smooth  data  to  data  tables,  separate  data  tables.  Then  I  tried  different  settings  for  the  smoothing  process,  because  I  did  not  know  from  the  beginning  which  process  is  the  best  to  fit  my  desired  outcome  of  the  experiment  at  the  end.

After  quite  a  lot  of  smoothing  tables,  which  were  then  manually  done,  and  I  then  concatenated  the  tables.  These  are  all  the  tables  we  just  made.  I'm  going  to  the  first  one  and  say,  please  concatenate  all  of  the  others.  The  nice  thing  is  that  you  then  have  at  the  end,  these  different  distances  coming  from  the  smoothing  effect.  I  had  a  second  polynomial  order.  A  third  polynomial  order  is  20  points  to  the  left  and  to  the  right  for  the  smoothing  process  and  30  and  so  on.

This  is  just  a  small  example  to  show  you  the  procedure.  I  did  quite  more.  What  I  did  was  this  amount  of  treatment.  I  had  [inaudible 00:15:01]  for  a  second,  third,  or  fifth  polynomial  order  with  10, 20,  or  30.  Now  came  the  big  part  to  decide  which  particular  procedure  represents  my  space  at  best.  This,  therefore,  I  made  a  principal  component  analysis  of  all  my  treatments  I  did.

This  is  a  general  overview.  The  numbers  represent  each  experiment  by  its  own  that  you  can  follow  them  in  the  different  scores  and  loading  spots.  The  loading  plot  is…  That's  a  regular  picture  of  a  loading  plot  when  it  comes  to  spectral  data.  If  you  take  into  account  that  we  are  coming  from  a  design,  this  value  of  24%  explained  variation  at  the  beginning  for  the  first  component  is  very  high.

Why?   Because  the  factors  of  the  definitive  screening  designs  are  orthogonal  to  each  other  and  independent  from  each  other.  One  would  expect  lower  values  for  the  principal  components.  After  this  treatment,  the  first  derivative  with  second  order  polynomial  and  10  points  to  the  left  and  to  the  right  for  the  smoothing,  it  looks  very  evenly  distributed.  You  might  think  of  a  cluster  here  on  top  or  below.

I  went  through  all  of  these  particular  processes  and  selected  then  a  favored  one,  where  I  saw  that  the  first  principal  component  has  a  very  slow  describing  power  for  the  variation.  That's  then  the  way  I  proceeded.

After  selecting  the  particular  pre- processed  data,  I  still  have  my  65  samples.  But  as  we  heard  at  the  beginning,  65  is  far  too  much  for  a  workload.  If  you  ask  yourself,  why  is  there  132  samples?   That  is  because  I  copy  pasted  the  design  below,  the  original  design  for  the  spectral  treatment  I  used  then.

If  you  want  then  to  select  your  runs  you  are  able  to  make  due  to  time  reasons  or  due  to  cost  reasons  or  whatever,  this  is  one  process  you  can  make  use  the  coverage  and  the  principal  components.  Then  this  is  the  design  space  which  is  available  dealing  for  the  all  variation  which  is  inside.  But  as  you  see,  we  would  need  to  make  132  experiments.  If  we  then  go  just  select  all  the  principal  components  and  say  please  make  only  the  one  which  are  possible,  then  you  have  the  ability  to  type  in  every  number  you  want  to.

At  this  stage,  I  selected  several  smaller  or  bigger  designs  and  saw  how  far  can  I  go  down  to  reach  at  least  a  good  description  power.   I  made  these  25  experiments,  let  JMP  select  them.   The  nice  thing  is  with  this  procedure,  if  you  are  coming  back  to  your  data  table,  they  are  selected.  But  this  procedure  I  didn't  do  right  at  the  beginning.  At  the  beginning,  I  made  a  manual  selection.

How  did  I  do  that?  I  took  the   score plot  of  the  particular  treatment  and  then  selected  manually  the  outer  points  as  good  as  possible.  Not  only  in  the  picture  of  the  first  and  second  principal  component,  but  I  went  deeper.   This,  for  example,  is  the  comparison  of  a  selection  method  I  just  showed  you  with  the   DOE of  the  constraint  factors  and  with  the  manual  selection,  just  for  showing  you  maybe  some  differences.

If  you  make  this   DOE selection  several  times,  don't  be  confused  to  get  not  always  the  same  numbers,  the  same  experiments,  which  might  be  important.   With  this  approach,  I  then  reduced  the  workload  from  64  experiments  to  25  experiments.   In  all  of  these  experiments,  all  my  raw  materials  I  had  from  the  beginning  were  inside.  I  didn't  leave  any  raw  material  out,  and  that  was  very  nice  to  see,  that  I  could  retain  the  space  of  the  variation.

After  the  cultivation  in  two  blocks,  which  took  a  frame  week  of  three  weeks  for  each  block,  we  yet  then  analyzed  our  metabolome  and  the  supernatant  and  determined  our  cell  drive  mass.  For  time's  sake,  I  show  you  only  the  results  and  the  procedure  for  the  cell  dry  mass.  Other  molecules  might  be  the  same  procedure  to  be  done  then.

The  next  issue  I  had  was  that  there  is  a  confounding.  I  had  to  expect  the  confounding  because  I  had  only  25  experiments  for  27  mixtures  coming  out  of  a  design  where  I  knew  where  I  supposed  to  have  interactions  and  quadratic  effects.  These  interactions  is  nothing  new  when  it  comes  to  media  composition.  Quadratic  effects  were  nice  to  be  seen.

Then  came  the  next  nice  thing,  which  was  introduced  by  Pete  Hersh  and  Phil  K.  It's  the  SVEM  process,  the  Self-V alidated  Ensemble  Model.  In  this  sketch,  you  see  the  workflow  and  we  will  go  through  that  in  JMP.   The  first  thing  was  to  look  at  the  distribution  of  my  target  value.  After  making  a  log  transformation,  I  then  saw  that  it's  normally  distributed.  So  we  have  a  log- normal  distribution.  That's  nice  to  know.

The  first  thing  was  to  download  this  add- in,  Auto validation  Set-up,  and  hit  the  run  button.  We  then  get  a  new  table.  The  new  table  has  50  rows  instead  of  25  rows  from  our  original  table.  Why  is  that  so?  The  reason  for  that  is  while  hitting  the  button,  the  data  table  gets  copy- pasted  below  and  we  get  a  differentiation  into  the  validation  set  and  into  the  training  set,  as  you  see  here.   The  nice  feature  of  this  Auto validation  table  is  that  you  can,  due  to  a  simulation,  find  out  which  parameters,  which  factors  have  an  influence.

This  happens  by  the  spared  fractionally  weighted  bootstrap  weight.  If  you  look  for  example,  the  second  experiment  has  a  value  of  1.8  in  the  training  set  and  the  same  sample  has  a  value  of  0.17  in  the  validation  set.   This  then  gives  one  the  ability  to  have  a  bigger  weight  for  some  samples  in  the  training  set  and  vice  versa  in  the  validation  set.   While  they  have  a  bigger  value,  a bigger  weight  in  the  training  set,  they  have  a  lower  weight  in  the  validation  set.

To  analyze  this,  it's  necessary  to  have  the  pro  version  to  make  a  generalized  regression.  As  we  took  the  log  value  of  our  cell  dry weight,  I  can  then  make  a  normal  distribution  and  then  it's  recommended  to  make  a  lasso  regression.  From  the  first  lasso  regression,  we  get  a  table  for  the  estimates,  and  now  comes  the  nice  part.  We  make  simulations  changing  the  paired  weight  bootstrap  weight  of  each  factor.

For  time's  sake,  I'm  just  making  50  simulations.   From  these  50  simulations,  we  get  then  the  proportion  for  each  factor  we  had  in  the  design  where  it  entered  the  regression  equation,  or  didn't  enter  the  regression  equation.   This  pattern  comes  due  to  this  randomization  process  of  the  bootstrap  forest  method.  From  this  distribution  we  go  to  the  summary  statistics,  customize  them,  we  are  just  only  interested  in  the  proportion  nonzero.  This  proportion  nonzero  is  finally  the  amount  of  the  50   simulations.  How  often  this  particular  variable  went  into  the  regression  equation.

From  this,  we  make  a  combined  data  table  and  have  a  look  on  the  percentage  of  each  variable  being  in  a  model  or  being  not  in  a  model.  This  looks  a  little  bit  confusing.  If  we  are  ordering  it  by  the  column  two  descending,  we  then  see  a  nice  pattern.

Now  you  can  imagine  why  I  introduced  at  the  beginning  this  null  factor  or  these  random  uniform  factors.  T he  uniform  factors  were  manually  introduced.  The  null  factor  was  introduced  by  hitting  the  auto- validation  set.  What   do these  points  mean?  These  points  mean  that  until  the  null  factor,   these  variables  have  a  high  potential  because  they  were  quite  often  within  the  model- building  processes.  These  at  the  bottom  were  quite  seldom  within  the  model- building  processes  so  the  ability  to  reduce  your  complexity  is  given  by  just  discarding  these.  Here  in  the  middle  one  has  to  decide  what  to  do.

After  having  this  extraction,  not  losing  information,  and  not  losing  variation,  one  can  then  think  of  different  regression  processes  making  response  surface  model  or  step wise  regression  or  whatever  regression  you  have  in  mind.  It's  wise  to  compare  different  regression  models  looking  what's  feasible,  what's  meaningful.   That  was  the  procedure  I  used  in   JMP 16.  While  coming  now  to  Pete  and  Bill,  they  will  describe  you  something  else.

Thank  you,  Egon.  That  was  a  great  example  of  an  overview  of  your  workflow.  Thank  you.  What's  new  in   JMP 17  that  might  have  helped  Egon  a  little  bit  with  the  tools  he  was  working  with?   I'm  going  to  start  off  with  a  little  bit  of  a  slide  show  here.  I'm  going  to  be  talking  about   Functional Data Explorer.  That's  in   JMP Pro  and  talking  about  the  pre- processing  and  Wavelet  modeling  that  are  built  into   Functional Data Explorer  now.

All  right,  so  let  me  slide  this  up  a  little  bit  so  you  can  see.   What's  new  in   JMP 17?  We've  added  some  tools  that  allow  for  a  better  chemometric  analysis  of  spectral  data.  Really  any  multivariate  data  that  you  might  have  that  you  can  think  of,  these  tools  are  there  to  help.  First  is  adding  the  preprocessing  methods  that  are  built  into  FDE  now.

We've  got  standard  normal  variant,  which  Egon  showed  you.  We've  got  multiplicative  scatter  correction,  which  is  a  little  bit  more  powerful  than  the  standard  normal  variant.  Both  of  these  will  not  disrupt  the  character  of  your  spectra.  That's  not  the  story  with   Savitzky-Golay.  It  does  alter  the  spectra,  which  will  then  make  a  little  bit  harder  to  interpret  the  data.  The  key  thing  is  it  still  helps.  Then  we  have  something  called  polynomial  baseline  correction,  which  is  another  added  tool  if  you  need  that.

The  next  step  would  be  then  to  save  that  preprocess  data  for  further  analysis,  like  principal  component  analysis,  partially  squares,  so  on  and  so  forth,  so  you  can  do  some  analysis  there.

The  Wavelet  modeling  is  a  way  to  look  at  the  chemometric  data  similar  to  principal  component  analysis.  We're  fitting  a  model  to  the  data  to  determine  which  is  the  best  overall  fit  for,  in  this  case,  25  spectra.  That's  the  goal  here.  It's  an  alternative  to  spline  models.  It's  typically  better  than  spline  models,  but  not  always.  You  get  to  model  the  whole  spectra,  not  the  point- by- point,  which  you  would  do  with  other  analysis  types.

Then  you  get  to  discern  these  things  called  shape  functions  that  make  up  the  curve.  These  shape  functions  are,  again,  similar  to  principal  component  analysis  in  that  they  are  helping  with  dimension  reduction.   Then,  as  I  said  before,  these  are  excellent  tools  for  spectral  and  chromatographic  data,  but  virtually  any  multivariate  data  is  fair  game.

These  are  just  an  example  of  the  Wavelet  functions  that  are  built  in.  I  could  try  and  pronounce  some  of  these  names,  but  I'll  mess  them  up,  but  know  that  these  are  in  there.  There  is  a  site  here  that  you  can  look  up  what  these  Wavelets  are  all  about.  I  got  the  slide  from  Ryan  Parker  so  thank  you,  Ryan.

Almost  last  but  not  least,  what  we're  doing  with  this  functional  principal  component  analysis  is  we're  trying  to  determine,  again,  what's  the  best  overall  fit  for  these  data  and  then  compare  the  curves  as  needed.  What  comes  out  of  the  Wavelet  modeling  is  a  Wavelet  DOE,  and  we  determine  which  wavelengths  have  the  highest  energy  for  any  given  spectra  or  whatever  we're  looking  at.

These  Wavelet  coefficients  can  then  be  used  to  build  a  classification  or  quantification  model.  That's  up  to  you.  It  depends  on  the  data  and  what  supplemental  variables  you  have  built  in.  In  this  case,  this  is  a  different  example  where  I  was  looking  at  percent  active  based  on  some  near  IR  spectra.

Let's  get  into  a  quick  example.  All  right.  This  is  Egon's  data.  I've  taken  the  data  that  was  in  the  original  table,  this  absorption  minus  the  water  spectra,  and  I've  transposed  that  into  a  new  data  table  where  I've  run   Functional Data Explorer.  I'm  just  going  to  open  up  the  analysis  here.  It  does  take  a  little  bit  to  run,  but  this  is  the  example  that  wanted  to  show.

We've  done  the  pre- processing  beforehand.  We've  taken  the  multiplicative  scatter  in  this  case  and  then  the  standard  normal  variate,  and  then  built  the  model  off  of  that.   After  this  function  or  these  pre- processing  tools  which  are  found  over  here,  I'm  going  to  say  that  data  out,  and  then  that  data  is  going  to  be  used  for  further  analysis  as  needed.

To  build  on  the  story  here,  we've  got  the  analysis  done.  We  built  the  Wavelet  model.  After  we've  gotten  the  mean  function  and  the  standard  deviation  for  all  of  our  models,  we  build  that  Wavelet  model  and  we  get  the  fit  that  you  see  here.  What  this  is  telling  us  is  that  the  Haar  Wavelet  is  the  best  overall  based  on  the  lowest   Bayesian Information  Criteria  score .   Now  we  can  come  down  here  and  look  at  the  overall  Wavelet  functions,  the  shape  functions,  and  get  an  idea  of  which  Wavelets  have  the  highest  energy,  which  shape  functions  are  explaining  the  most  variation  that  you're  seeing  between  curves,  and  then  you  can  also  reduce  the  model  or  increase  the  model  with  your  selection  here  with  the  number  of  components  that  you  select.

One  thing  that  comes  out  of  this  is  a   Score Plot  which  allows  you  to  see  groupings  of  different  in  this  case,  spectra.  One  that  you're  going  to  see  down  here  is  this. This  could  be  a  potential  outlier.  It's  different  than  the  rest.  If  you  hover  over  the  data  point,  you  can  actually  see  that  spectra.  You  can  pin  it  to  the  graph,  pull  that  out,  and  then  let's  say  let's  just  pick  another  blue  one  here  and  we'll  see  if  we  can  see  where  the  differences  are.

It  looks  like  it  might  be  at  the  beginning .   If  we  look  at  this  right  here,  that's  a  big  difference,  then  maybe  that  just  didn't  get   subtracted  out  or  pre- processed  the  same  way  in  the  two  spectra.  I  don't  have  an  example  of  the  Wavelet   DOE for  this  set up,  but  just  know  that  it's  there.  If  you're  interested  in  this —this  has  been  a  fairly  quick  overview— but  if  you're  interested  in  this,  please  contact  us,  and  we  will  find  a  way  to  help  you  better  understand  what's  going  on  with  Wavelet   DOE and  preprocessing  built  into   JMP Pro.  Pete,  I  will  turn  it  over  to  you.

All  right.  Well,  thanks,  Bill  and  Egon.   Just  like  Bill,  I  am  going  to  go  over  how   Self-Validating Ensemble Models  changed  in   JMP 17.   Bill  showed  how  you  could  do  what  Egon  did  in  16   in 17  much  easier  using   Functional Data Explorer.  For  me,  I'm  going  to  take  that  last  bit  that  Egon  showed  and  with  the  add- in,  creating  that SVEM set up.  Using  those  partially  weighted  bootstrap  columns  and  then  also  making  that  validation  and  the  null  factor.   I'm  going  to  just  show  how  that's  done  now  in   JMP® 17.  So  this  is  much  easier  to  do  in   JMP 17.  Just  like  that,  spectral  data  processing  with  FDE,  this  is  done  in   JMP 17.

If  you  remember,  Egon  had  gone  through,  he  looked  at  all  those  spectra,  he  extracted  out  the  meaningful  area,  looking  at  smoothers,  the  standard  normal  variant,  and  did  a  bunch  of  different  pre-processing  steps.  Then  he  took  those  preprocessing  steps  and  he  selected  a  subset  of  those  runs  to  actually  run,  and  he  had  come  up  with  25.   Here  is  those  25  runs.   From  this  step,  what  he  did  is  that   Self-Validating Ensemble Model  or SVEM.

In  16,  this  took  a  little  bit  of  doing.  You  had  to  make  that  model,  then  you  had  to  simulate,  then  you  had  to  take  those  simulations,  and  run  a  distribution  on  each  one  of  them,  and  then  get  the  summary  statistics,  and  then  extract  that  out  to  a  combined  data  table,  and  then  graph  that  or  tabulate  that  and  see  which  ones  happen  the  most  often.

That  was  a  lot  of  steps  and  a  lot  of  clicks  to  do,  and  Egon  has  clearly  done  this  a  bunch  of  times  because  he  did  it  pretty  quickly  and  smoothly,  but  it  took  a  little  bit  of  doing  to  learn.   Clay  Barker  made  this  much  easier  in   JMP 17.   Same  25  samples  here,  and  instead  of  running  that  Auto validation  Set- up  add- in  that  Egon  showed,  we're  going  to  just  go  ahead  and  go  to  Analyze  and  Fit  Model.

We'll  set  up  our  model.  I f  you  remember,  we're  taking  this  log  of  the  dry  weight  here.  We're  going  to  add  a  couple  of  random  variables  along  with  all  of  our  factors  into  the  model,  and  then  we're  going  to  make  sure  that  we've  selected  generalized  regression.   This  is  the  set up  for  our  model,  we're  going  to  go  ahead  and  run  it,  and  in   JMP 17,  we  have  two  new  estimation  methods.

These  are  both   Self-Validating Ensemble Model  methods.  The  first  one  is  a  forward  selection.  I'm  going  to  go  ahead  and  use   SVEM Lasso  because  that's  what  Egon  used  in  his  portion,  and  here  you  just  put  in  how  many  samples  you  want.   He  had  selected  50.  I'm  going  to  just  go  with  the  default  of  200.  Hit  go,  and  you  can  see  now  it's  running  all  of  that  behind  the  scenes  where  you  would  have  simulated,  recalculated  those  proportional  weights,  and  then  at  the  end  here,  we  just  have  this  nice  table  that  shows  us  what  is  entering  our  model  most  often  up  here.

Then  when  we  hit  something  like  a  random  variable.   Just  out  of  randomness,  something  that's  entering  that  model  is  entering  maybe  about  half  the  time.   Things  that  are  entering  more  regularly  than  a  random  variable,  we  have  pretty  high  confidence  that  those  are  probably  variables  we  want  to  look  at.   Then  we  would  go  from  here  and  launch  the  Profiler.  I've  already  done  that  over  here,  so  we  don't  have  to  wait  for  it  to  launch  or  assess  variable  importance.

But  here,  this  shows  us  which  of  these  factors  are  active.  We  can  see  the  most  active  factors,  and  while  it's  not  making  a  super  accurate  model,  because  again  if  you  remember,  we  are  taking  25  runs  to  try  to  estimate  27  different  factors.  If  you  take  a  look  here  at  the  most  prevalent  ones,  this  can  at  least  give  you  an  idea  of  the  impact  of  each  one  of  these  factors.  All  right,  so  that  basically  sums  up  what  Egon  had  done.  It  just  makes  this  much  easier  in   JMP 17,  and  we  are  continuing  to  improve  these  things  and  hope  that  this  workflow  gets  easier  with  each  release  of  JMP.   Thank  you  for  your  attention  and  hopefully,  you  found  this  talk  informative.

Article Tags