Choose Language Hide Translation Bar

cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
The request for quotation (RFQ) process can be optimized using machine learning algorithms to predict the price of pre-engineered metal buildings (PEMBs) based on historical data. This process involves collecting data on PEMBs, preprocessing it by removing missing values and handling outliers, selecting relevant features through feature engineering, and splitting the data into training and testing sets. Popular machine learning algorithms such as linear regression, decision trees, random forest, or neural networks can be used to build the model, and performance metrics like mean squared error (MSE) and root mean squared error (RMSE) are used to assess the model.   Python and JMP are two tools that can be used for this predictive modeling. Python is a popular programming language with data analysis and machine learning libraries. At the same time, JMP is a commercial software package for statistical analysis and data visualization. Choosing between the two depends on specific project requirements, user skills, and available resources.   By building a predictive model for PEMBs, organizations can more accurately estimate the weight of future projects, leading to better budgeting, planning, and resource allocation.     Hello. My  name  is  Najmeh  Dehghani. I'm  a  Data  Engineer  at  JLL. I  have  my  Master  Degree  in Analytics from  Texas  A&M. My  capstone  project is  in  RFQ  Process  Optimization for  Pre-engineered  Metal  Buildings. I  get  to  know  JMP  and  SAS in  school  at  Texas  A&M. I  founded  a  wonderful  tool to  do  analysis and  perform  analysis  with  that. But  actually,  my  capstone  project is  in  both  in  JMP  and  also  Python. I  have  done  it  in  both. Basically,  the  objective  of  my  project is  to  predicting  the  weight of  pre-engineered  metal  building   by  considering  various  geometric and  geographic  requirements. Let's  have  an  introduction to  the  metal  buildings. Metal  buildings,  basically  pre-engineered metal  buildings are  structures  that  utilize  steel  frames to  support  the  roof  and  wall  panel. We  can  do  the  customizations  on  them based  on  the  customer's  requirement. They  can  be  commercial  or  residential. They  have  an  accuracy because  they  are  pre-engineered and  the  dimensions  are  precise. They  have  flexibility. We  can  change  the  design based  on  the  customer  needs and  customer  requirements. What  exactly  on  this  project  we  have  to  do is  that  estimate  the  weight to  predict  the  price. The  steel  structure  is  a  critical  factor on  the  metal  building  pricing. We  want  to  see if  we  can  predict  that  weight. Eventually,  we  can  predict  the  price and  eventually, we  can  work  on  the  RFQ  process. Basically,  let's  take a  look  at  the  RFQ  process. What  is  the  RFQ  process that  we  want  to  optimize and  what  is  exactly their  current  system  is  doing. As  I  mentioned,  the  business  challenges on  hand  on  this  project is  to  improve  the  RFQ  process for  metal  buildings and  to  see  if  we  can  improve the  response  time, reduce  the  server  cost, and  mitigate  the  missing opportunities  for  the  company. To  send  out  the  RFQ  to  their  client, basically,  they  are  going to  choose  some  process,  some  steps. They  review  the  inquiries  that  coming to  the  company,  specify  the  purpose,  size, layout,  design  requirement of  the  building. They  check  all of  the  technical  specification, the  load  capacity,  roofing and  insulation  and  other  factors to  prepare  accurate  quotes  that  cover  all of  the  materials,  labor  and  cost  for  them, calculate  the  price  and  send  out the  RFQ  back  to  the  customers. Their  current  system  that  processing the  RFQ  for  them  basically  is  like  this. They  are  right  now  having  three  servers. Each  server  is  processing about  480  RFQ  per  day. The  time  each  RFQ  will  take to  process  is  about  180  seconds. The  cost  per  RFQ is  approximately  75  cents. Dividing  all  of  these  together, we  can  come  up  with  a  number of  $0.004  per  second  to  process  RFQ. To  consider  480  per  day  per  server, it's  going  to  be  $360  per  day to  process  the  RFQs. They  have  three  servers. Totally,  it's  going  to  be about  $1,000  per  day  for  all  three  servers to  process  the  RFQs. If  we  consider  365  days  in  a  year, it's  going  to  be  about  $400,000  server and  maintenance  cost  for  the  company to  process  all  of  the  RFQs. Basically,  we  want  to  see if  we  can  improve  this  process for  the  company  by  predicting  the  weight based  of  the  historical  data. The  methods  I  have  used  on  this  project probably  based  all  of  the  machine learning  projects  for  predicting. It's  going  to  be  data  collection and  preprocessing,  feature  engineering and  selection,  and  model  selection. Let's  take  a  look. We  started  by  defining  exactly what  is  the  business  problem. The  business  problem  is  that  predicting the  weight  of  the   metal building. The  data  collection and  pre-processing  is  that how  I  handle  this  part  is  that about  182  SQL  tables  were  provided  to  me as  a  part  of  subset  of  historical data  from  the  manufacturer. Around  500,000  record  was  included in  those  tables, more  than  500  features,  US  data  only. I  have  performed  the  ETL  process and  cleaning  my  data  set  fine  abnormal data  in  SAS,  Python  and  JMP and  integrated  all of  my  data  in  one  table with  the  help  of  these  three  tools. My  next  step  was  to, the  feature  selection  or  extraction. The  preliminary  feature  selection, I  performed  it  starting  industrial  study and  also  getting  the  expertise  judgment to  eliminate  the  features that  it  was  really unrelated  to  our  target. Then  I  started  correlation  analysis to  examine  the  correlation and  find  the  multicollinearity between  my  features to  eliminate  more  features. Then  descriptive  analysis and  visualization  to  check  all of  the  relationship  between  my  data as  I  have  done  in  JMP,  Python, and  Tableau. The  next  step  for  me  was  to  start figuring  out  or  exploring  the  models, the  algorithms  based on  the  characteristic  of  my  data. As  my  data  was, my  target  is  an  integral  target, so  I  started  to  look  into  the  linear regression  stepwise,  random  forest, decision  tree, and  some  other  more  algorithm. Eventually,  I  trained  and  evaluate different  models  in  both  Python  and  JMP. I  compared  the  performance  metrics and  did  the  cross  validation and  k-fold  validation  for  all of  my  models. If  we  take  a  look  at  here  in  the  middle, we  can  see  the  models that  I  trained  my  data  on  it. Some  of  them  are  in  Python, some  of  them  are  in  JMP. To  compare  this  model to  see  which  one  is  the  best, I  had  to  compare  the  models based  on  the  ASE, and  also  to  check  which  model is  performing  better and  also  check  the  R-squared to  see  how  close  our  forecast, our  prediction  going to  be  to  the  actual  data. This  is  the  residual  plot for  all  of  the  models. If  you  go  back  here, you  can  see  out  of  all of  the  models  that  I  had  run, I  have  two  models  that  are  the  best, a  decision  tree  in  Python and  boosted  tree  in  JMP. If  we  take  a  look  at  both  of  them, we  can  see  most  of  the  features that  both  selected  are  the  same,  even the  importance  are  close  to  each  other. Looking  at  the  ASE, we  can  see  the  ASE for  decision  tree  is  about  4,000, but  for  the  boosted  tree  is about  2,000,  which  is  much  more  better. Also,  R-squared  for  the  boosted tree  is  a  little  bit  higher. The  residual,  if  we  look  at  the  residuals from  both  softwares, we  can  see  the  residual  is  really  good, close  to  the  actual  values,  or  from  Python is  close  to  is  around  zero  values. Both  model  are  really  good and  they're  predicting  good. It  means  that  we  can  predict  the  weight of  the  building  based of  this  historical  data  that  we  have. Let's  see  how  it's  going to  impact  the  business  here. Okay,  we  have  a  good  model. This  model  is  performing  well. Let's  see  if  the  business want  to  implement  this  model, what  are  the  benefits  for  them? The  first  chart  we  can  see  was  the  chart that  I  already  presented  to  you as  their  current  system. We  know  they  are  each  server, they  have  three  servers and  each  server  is processing  480  RFQ  per  day that  totally  it  costs $360  per  day  for  each  server. Consider  that  we  want  to  process the  same  amount  of  the  RFQs,  480. But  at  this  time, with  implementing  the  model, we  can  lower  down the  process  time  to  one  second because  the  model  going  to  just  instantly running  and  giving  the  price  to  us. Considering  the  cost  going  to  be  per  RFQ going  to  be  the  same, cost  per  second  going  to  be  the  same, just  processing  time  will  be  one  second out  of  180  seconds. The  total  cost  will  lower  down to   $2  per  day  per  server. Then  they  have  three  servers, it's  going  to  be  $6. If  we  consider, it's  going  to  be  365  days  per  year that  all  three  servers  are  running, it's  going  to  be  about  $2,000. Comparing  with  current  cost, they  are  paying  for  the  server and  processing  time. It's  almost  99%  cost  saving for  the  business  and  the  company. But  surprisingly,  this  is  not the  only  cost  they  can  save. The  other  cost  they  can  save  is or  they  can  bring  more  opportunities. They  are  missing  some  opportunities. Right  now,  they  are  processing 480  RFQs  per  day,  one  server. But  if  we  lower  down the  processing  time  to  one  second, they  can  process  80,000  RFQs. Considering  the  win  rate for  them  will  be  4%. Right  now,  the  total  number they  can  win  per  day  is  going  to  be  19. But  with  implementing  the  model because  they  are  processing  more  RFQ, so  they  can  win  more and  they  can  bring  more  opportunities. Actually,  basically  they  are  missing about  3,000  opportunities  per  day. Probably  this  is  going  to  be  more than  their  capacity, but  this  is  one  of  the  points that  they  can  think about  their  business  development. This  is  my  project  about  they  could  have predict  a  good  model  and  eventually we  can  see  they're  going  to  be a  huge  business  impact  for  the  company. I hope  you  enjoy  it and  thank  you  so  much  to  Texas  A&M and  Dr.  Jones  as  my  capstone  coach.
Disease often shows a predilection for specific site locations. To demonstrate the body regions involved, researchers may use infographics or tables. While tables may be more descriptive and universal to any statistical software, heat maps rapidly communicate disease involvement across a cohort. At our institution, software to portray data into the confines of a human body is not readily available. Additionally, we needed the human body to be segmented into specific regions. While a graphic editor can be used, it is labor intensive, limits generalizability, and cannot be repeated with different data sets.   To create a scalable solution, we used the Custom Map Creator in JMP, an add-in to Graph Builder that allows users to create custom shapes. With this program, we outlined a human body, created anatomic regions, and colored each region based on the percentage of occurrence. The selected color range is scaled to directly correlate with the data. Each anatomical subunit is then filled and shaded by the data in the respective column. Heat maps provide easy interpretation and clearly portray the most involved areas for rapid communication to readers.       Today,  I  want  to  give  a  little presentation  about   using  the  JMP  Custom  Map  Creator and  how  we use  that  to  develop  a  human  heat  map. I'd  like  to  give  a  shout out  to  my  mentor, Dr.  Benjamin  Caffender,  who  helped push  this  project  along. A  little  bit  about  the  Custom  App  creator, it  allows  you  to  create  custom  shapes that  can  be  displayed  in  Graph  Builder. What  this  allows  you  to  do  is  to take  statistical  data  and  percentages in  particular,  and  present  them in  a  color-coordinated  and color-referenced  presentation. The  way  that  this  is  used, it  comes  with  some  built-in  maps of  the  United  States or different  places  in  the  world, is  to  be  like,  oh,  in  this  state, this  percentage  of  whatever is  taking  place  is  occurring. Gives  you  that  heat  map  approach, but  it  also  allows  you  to  create different  objects  as  well. For  us,  this  was  a  dermatology  project, so  we  had  cutaneous  disease, and  so  we  wanted to  show  lesion  involvement on  what  part  of  the  bodies were  affected  by  the  disease. So  we  had  face,  90 %, we  had  scalp,  but  just  under  20 %. That  thing, we  wanted  to  show  this  dynamically. We  ran  into  what's  the  best way  to  go  about  it. We  did  some  digging. We  found  this  custom app  creator  to  do  that. What  we  did  is  we  took  an  outline  of  an image,  and  you  can  see  all  these  dots. We  went  in  and  we  traced  this  image and  created  our  different  body  areas, and  JMP  will  set those  into  specific  shapes where  you  can  input  data that  will  appear  in  those  areas. You  can  take  whatever  picture  you  want. This  is  is  an  example  of  one  that  we  used. Then  that  data  will  get incorporated  into  those  areas. So  you  can  see  over on  the  left  of  our  image, really  to  show  the  face  has  significant involvement  with  the  darker  red, and  that's  correlated  up  with the  90 %  involvement versus  lighter  areas  such as  the  trunk  or  upper  extremities. But  the  custom  upgrade lets  you  create  those  areas and  then  put  those  inputs and  put  those  percentages  in  very  easily. Rather  than  me  talking  about  it,  I'd  like to  show  you  a  little  demo  of  how  it  works, and  we  will get  going  from  there. I  already  have  JMP  open. Once  you  install  this  add-in,  it's going  to  add-in  map  shapes and  then  custom  map  creator. We  put  custom  map  creator, and  we  get  these different  data  tables  that  pop  up as well  as  our  map  creator  itself. So  what  you  do  is  you can  paste  an  image  in. I  have  a  file  over  here  that  I will  just  drag  and  drop  in. You  can  copy  paste  however  it  may  be. You  can  adjust  and  zoom  and  that  thing to  get  it  exactly  how  you  want. Give  it  a  name. Now  you  start  drawing, you  start  tracing. I'll  just  quickly  click  through  here. That's  one  shape. We  hit  Next  Shape. You  can  change  the  names  of  your shape  to  whatever  it  may  be. When  we  were  using  it, we  had  different  body  areas. If  you  look  over  on  the  right  at  the  data tables  up  here in  this  table,  this  example  X,  Y, we're  seeing  the  different, the  very  specific  data  points  or dots  that  are  being  graphed. Over  here,  we're  having these  shape  IDs  be  placed. You  can  see  up  here, I'm  working  on  shape  five. You  hit  Next  Shape. As  you  put  that  in, you  get  these  next  shapes. These  are  linked  between these  two  data  tables. Now  we  have,  let's  say, I've  finished  my  image, I  have  everything  graphed  out, you  can  hit  finish, and  that  will  just  save  it,  or you  can  also  finish  with  random  data, which  I  think  is  an  easy  way. It's  an  easy  wayto  show  it, but  also  just  to  actually  have the  data  show  up  in  the  right  place. Hit  open. Now  we  have... This  is  our... This  is  our  graph. We  can  see  this  data they  gave  us  over  here,  shape  one, I'll  just  drop  this  down  to  one  over to  shape  five  we'll  put  that  at  100. You  can  see  over  here,  shape  five,  100, this  is  our  red and  has  the  most  involvement. Down  here,  shape  one, I  guess  we  have  a  couple  on  the  lower and  this  is  the  darker  blue. But  you're  able  to  get that  stratification. So  what  you  can  do  is  you  can  either have  your  own  data  table, or  you  can  just  put  in  your  own  data  into these  shapes  as  well, and  it  will  adjust  in  real  time. What's  really  nice  about  this  add-in is  you  can  go  right here  and  you  can  change  these  gradients. Right  now  we  have  a  red  to  blue. You  can  change  it  from  this  yellow  to  red. We  can  have  that  higher  volume with  the  darker  colors  and  less involvement  with  the  lighter  colors really  easily. The  two  big  things  as  you're  doing  this that  are  important  to  know  is  if  this  is your  own  data  table,  this  shape, this  name  column and  the  column  properties  and  map  role, we  see  here  this  map  name  data  table is  connected  to  our  example  name. Jmp, which  that  is  one  of... That  is  this  file  over  here. It  is  super  important that  those  have  to  be  linked. If  you  hit  generate  random  data, it  does  it  for  you. If  you  have  your  own  data  table, you  have  to  manually  set  that. But  what  you'll  notice  is  if  you  manually create  your  own  data  table, here  this  shape  1,2,3,4,5  matches  up perfectly  with  shape  1,2,3,4,5  here. Those  names  have  to  be  exact and  those  roles  have  to  be  exact. If  you  do  that,  then  you  can have  this  data  pop  up  really  easily. You  can  also  get  these  data  tables through  Graph  Builder  as  well. But  yeah,  that's  just  a  quick overview  of  how  to  use  this  add-in. It's  very  versatile. Like  I  said,  we're  able to  use  it  for  our  project. Yeah,  I  hope  this  gave  you  just a  brief  little  bit  of  guidance  on  that. I've  uploaded  some instructions  on  how  to  use  this  as  well, which  you  can  also  find. But  I  hope  you  have  a  great  day and  a  great  rest  of  other presentation  to  get  to  watch.
Screening experiments often require both continuous and categorical factors. Engineers generally prefer continuous factors with three levels over those with two levels because of concerns about curvature in the relationship between a continuous factor and the response.   This presentation provides a method for constructing designs for roughly equal numbers of continuous factors with three levels and categorical factors with two levels. If there are m three level continuous factors and m-1 two-level categorical factors, our designs have 2m runs. Note that m may be odd or even. As for any saturated main effects plan, our designs depend on an assumption of effect sparsity and negligible two-factor interactions (2FIs) to provide accurate estimates of the main effects. With substantial effect sparsity or by not making use of all the columns of the design, it can be possible to identify a large quadratic effect or 2FI. The columns of the two-level factors are orthogonal to the columns of three-level factors. This fact can be used to provide a design-centric analytical approach. When the number of runs is a multiple of eight, all the design columns are mutually orthogonal. This talk provides examples of the designs and shows how to create them.     Hello,  I'm  Ryan  Lekivetz,  Manager of  the  DOE  and  reliability  team  at  JMP.   Today  I'm  going  to  talk  to  you  about a  class  of  saturated  mix- level  main effects  designs  for   even  number  of  runs. That  sounds  like  a  mouthful, but  we'll  get  an  understanding  of  what that  is  by  the  time  we're  finished. If  you  see  in  my  JMP  journal  here, here's  a  rough  idea  as  to  what the  outline  is  going  to  look  like. Now,  at  the  beginning  of  though, I  do  have  to  call  out, so  I  have  three  collaborators   on  this  project. Bradley  Jones,  who  should  be  familiar to  many  of  you  at  JMP, Dibyen  Majumdar  and  Chris  Nachtsheim. Just  some  history  and  preliminaries . I'll  say  usually... If  you  see  in  there, we  talked  about  main  effects  designs. Now,  usually,  when  we  think  of  screening  designs, we  often  think  of  factors  being  all  at  two  levels. Your  standard  fractional  factorial  designs  that  you  see  in  textbooks, but  really  there's  this  big  question, what  about  nonlinear  effects? Of  course,  it's  great  we  want  to  find  out those  most  important  main  effects, but  what  happens   if  there  is  some  nonlinearity? In  particular,  in  2011, Definitive  Screening  Designs   or  DSDs  hit  the  scene. The  big  thing  with  those  DSDs, here  we  were  assuming   all  the  factors  were  continuous. Each  factor  and  DSDs are  going  to  be  at  three  levels. That  gave  some  hope  of  being  able  to  detect  quadratic  effects when  they  were  large. I'll  say  that  was  one of  the  big  popularity  of  DSDs. The  designs  we're  going  to  talk  about  here. Our  main  effect,  if  you  think  of  DSDs, not  only  were  they  good  at  picking  up  main  effects, we're  also  looking at  quadratics  and  interactions. The  designs  we're  looking  at  here are  really  main  effects  screening as  our  primary  goal. By  main  effects  screening, I'm  saying  we  have  our  list  of  factors and  we  want  to  find  out  which  of  those are  the  primary  drivers. We  want  to  find  out  those significant  main  effects. If  we're  really  lucky, we  may  get  some  quadratic  interactions. But  again,  main  effects screening  is  the  big  thing. What  you're  going  to  find  here   is  we're  going  to  have  a  mix of  three- level  and  two- level  factors. Whereas  the  DSDs   had  everything  at  three- levels, here  we  have  more  of  this  mix   of  three  and  two- levels. One  of  the  things  to  pay  attention  to is  just  like  definitive screening  designs. When  we're  talking   about  these  three-level  factors, we're  going  to  assume that  they're  continuous. In  particular,  that  means we're  not  looking  for  balance. Often  you  see  in  that  title   where  we  talk  about  mixed  levels. We're  talking  about  three and  two-level  factors. But  I'll  say  traditionally   when  we  think  of  orthogonal  rays, the  three-level  factors  are  categorical, and  so  we  want  to  see  each level  an  equal  number  of  times. What  we're  going  to  find  here   is  that  these  three-level  factors are j ust  going  to  have  a  few  zeros. One  of  the  other  big  things,  though, is  that  we're  going  to  have  almost as  many  factors  as  we  do  runs. That's  where  that  idea  of  saturated and  the  title  came  from. Saturated  just  means  effectively  that  I have  just  as  many  factors  as  I  do  runs. That's  where  also   that  main  effects  screening because  now  I  have  so  many  factors, if  you  start  considering  quadratics  interactions, that's  an  awful  lot  of  terms. Because  we're  looking  at  so  many  factors, we're  just  hoping  we  can  detect those  significant  main  effects. If  you  think  about  other  designs, if  you  think  about  this  mix   of  three- level  and  two  levels, now  immediately  what  might  come  to  mind is  some  of  the  classical  Taguchi  designs, so the  L18  or  the  L36. Those  of  you  who  are  familiar  with  JMP, you  may  know  that  you   can  create  Definitive  Screening  Design but  added  two-level  factors. The  designs  here  I  would  almost  look  at as  an  extension  to  some  of  those. Here  we're  going  to  have a  lot  more  two-level  factors than  you  might  want  to,  though, in  our  standard  DSDs. I'll  say  this  is  also  an  area,  though, that's  picked  up  in  steam a  lot  in  these  past  few  years. You  see  some  references  here as  to  other  authors  that  are  thinking  about  this  same  problem, including  a  paper that  I'll  come  back  to  at  the  end. You  see  that  last,  the  Jones, Lekivetz,  and  Nachshteim  paper. That's  actually  related to  this  work  as  well. Let's  see, hopefully  we're  all  in  the  same. If  we  go  back  to  that  title. The  saturated  just  means  we  have  a  lot of  factors  relative  to  the  run  size. Our  mixed level  means  that  we  have some  two-level  and  three-level. Just  to  mention  the  two- level   could  be  continuous  as  well, where  we're  just  not  interested in  looking  at  those  quadratic  effects. Main  effects  design  says  that's  where our  most  important  thing  is  finding   those  significant  main  effects and  that  even  numbers  of  runs, it  turns  out  the  designs, if  you  remember  that  recall  the  outline, all  these  designs  we're  looking  at  are going  to  be  an  even  number  of  runs. Now,  when  we  think   about  building  these  designs, so  in  some  of  these  preliminary, I  should  really  talk about  the  building  blocks. These  designs  are  going  to  be  built  upon using  other  matrices   or  other  designs  in  the  literature. Depending  on  how  familiar  you  are with  Definitive  Screening  Designs, there's  this  idea  of  something called  a  conference  matrix. A  conference  matrix   is  just  an  M  by  M  matrix that  we  use  to  construct  Definitive  Screening  Designs. One  of  the  nice  things with  conference  matrices is  that  in  general, they  exist  for  multiples  of  two  rows  and  columns. I  said  in  general, there's  a  conference  matrix   for  every  multiple  of  two from  2 to 30,  except  for  22. Something  to  mention  here,  though, is  that  in  the  cases   where  a  conference  matrix  does  not  exist, or  let's  say   if  you  have  an  odd  number  of  runs, you  can  use  something called  a  pseudo- conference  matrix. A  pseudo- conference  matrix  says, I  want  to  make  it  look  as  close   as  I  can  to  a  conference  matrix, which  has  this  special  property. What  properties  a  conference  matrix  have? Well,  if  I  take  C  transpose  T, or  conversely,  C C  transpose, it's  going  to  be  M  minus  1 times  the  identity  matrix. What  that  really  means   is  that  these  columns, the  columns  of  the  matrix  C,  orthogonal. If  I  take  that  cross- product of  any  two  columns, we're  going  to  get  a  zero. But  notice  we  have  that  M minus 1, because  one  of  the  other  features of  the  conference  matrix is  that  each  row  and  column  has  exactly  one  zero. If  you  have  JMP  17.2, so  I'll  admit  that  this  JSL first  shows  up  in  JMP  17.2, so  we  have  this  command for  conference  matrix. Here  I  can  just  ask  for  the  same  6x6 conference  matrix. Again,  what's  the  special  property? Well,  if  we  look,  so  if  I  take  C   times  C  transpose  or  conversely, I  get  6  minus  1, which  is  5  times  the  identity  matrix. Each  of  those  columns  is  going  to  be  orthogonal. Again,  starting  a  JMP  17.2, you  can  create  your  own  conference  matrix just  by  giving  it  the  order. Here  I  wanted  a  six-run  conference  matrix, so  I  was  able  to  put  that  in. Let's  try  it  with  eight as you'll  see. There's  an  8  by  8  conference  matrix. That's  one  of  the  building blocks  that  we  need. Another  one  is  a  similar type  of  structure. It's  called  a  Hadamard  Matrix. The  difference  is  a  conference  matrix had  values  of  negative  one,  zero,  and  one. A  Hadamard  matrix  just  has values  of  plus  and  minus  one. A  Hadamard  matrix  exists for  most  multiples  of  four. In  particular,  this  is  where   when  you  hear  Hadamard  matrix, another  thing  you'll  hear about  often  times  is  an  orthogonal  array. In  particular  for  a  Hadamard  matrix, it  has  that  same  property. If  I  take  it  by  its  transpose, in  this  case,   I  get  N  times  the  identity  matrix. If  you  recall  the  conference  matrix, it  was  that  M  minus  1, the  order  because  we  had  one  zero  in  each  row  and  column. Here  it's  going  to  be  N  times  the  identity  matrix. The  idea  here  is  that  if  I  take  any  pair  of  columns, we  have  that  concept of  orthogonality  for  any  pair  of  columns. Similar  to  the  conference  matrix, we  actually  have  a  special  command in  JSL  for  constructing  a  Hadamard. Let's  take  a  look  here. If  you  notice,  let's  just  say  Hadamard  8, this  is  going  to  give  me an  8-run  Hadamard  matrix. If  we  see  about  taking  a  Hadamard, the  strand  rows,  there  we  get eight  times  the  identity  matrix. Again,  so  what  that  means, if  I  take  any  pair  of  columns, they're  going  to  be  orthogonal, which  you  can  actually  already  see. Just  pretend  that  this  first one  is  like  an  intercept. You  see,  all  of  these  are  balanced. I'm  going  to  get  that  idea of  orthogonality. We're  actually  almost  there with  our  building  blocks. The  last  piece  that  we  need   is  something  called  the  Kronecker  product. What  you'll  see  is  that  throughout   we  may  not  even  really  need to  think  of  it  in  terms  of  a   Kronecker product. Just  often  when  we  create  these  designs, that's  the  way  we  like  to  think  of  these. You'll  see  a   Kronecker product is  denoted  by  this  symbol. It  looks  like  a  multiplication with  a  circle  around  it. In  JMP  or  another  definition  for   Kronecker product is  called  the  direct  product. It  just  happened  it's  a  convenient way  to  construct  designs. All  of  this,   Kronecker product   or  direct  product  is, if  I  take  a  matrix  A  and  the   Kronecker product  of  B, I  take  each  element  of  A and  apply  it  to  the  entire  matrix  B  over  and  over  again. Where  this  comes  in  handy, let's  see  where  you  may  have seen  something  like  this  before. Again,  we  have  a  direct  product, a JSL  command. Let's  say  if  I  just  wanted  to  start with  that  6x6  conference  matrix. Now  let's  say  if  my  matrix  A, so  think  here  in  this   Kronecker product  A, I  just  take  these  2  by  1, so  two  rows  plus  one  and  a  minus  one, and  direct  product  with  C. What  do  I  get  here? Well,  this  is  effectively  if  I  added  a  center  run, this  would  be  a  13-run  DSD   and  six  factors. By  taking  this  one  and  minus  one, so  with  the   Kronecker product    up  here  with  this  plus  one, I  get  that  conference  matrix  C, and  down  below  I  get  negative  C. This   Kronecker product   is  just  a  convenient  way to  think  about  things like  what  we  might  call  a f oldover. With  those  preliminary  is  done, now  we  can  actually  start  talking about  these  different  constructions. The  first  method  is  the  most, I'll  say  the  nicest  of  all  of  them. What  we're  going  to  say,   this  is  where  our  run  size is  going  to  be  a  multiple  of  eight. What  are  we  doing  here? Well,  here  we're  going  to  start. We're  going  to  have  a  conference  matrix of  order  M  equals  4K. Again,  remember, the  conference  matrices tend  to  exist  as  long   as  we  have  an  even  number. Likewise,  for  the  Hadamard  matrix, that's  where  we're  looking for  multiples  of  four. Here  I'm  going  to  take  a  conference  matrix of  order  M  and  a  Hadamard  matrix   of  the  same  order where  we're  assuming  here  that  both  exist. What  I'm  doing  here is  I'm  going  to  fold over. You  can  actually  express  this as  a   Kronecker product, but  here  I  find  it  more  convenient just  to  write  it  this  way. You  see,  this  first  part  looks  like a  Definitive  Screening  Design. I'm  taking  a  conference matrix  and  folding  it  over. Then  on  the  other  side, I'm  replicating  this  Hadamard  matrix. I'm  taking  a  copy  of  that. What  do  we  get  with  this  design? Well,  these  first  M  columns  that  are formed  from  the  conference  matrix, that's  going  to  give  us M  three-level  factors. The  remaining  M  minus  1  are  all  going  to  be  two-level  factors. This  C  part  is  going  to  be   for  three-level, this  H  part  for  two-level. What  do  we  end  up  with? Well,  what  we  did, we  basically  doubled  these  here. We're  going  to  have  two  M-runs and  two  M  minus  1  factors. Let's  take  a  look  at  what this  might  look  like. Let's  see  an  example  here. Let's  take  my  C. I'm  going  to  just  create an  8  by  8  conference  matrix. First,  let's  construct these  three-level  columns. I'm  going  to  take  that  direct  product. Again,  in  this  case,  I  want  that  foldover, I  want  C  and  minus  C. Let's  take  a  look  at  what  C  looks  like. Again,  this  is  just that  foldover  structure  on  that. One  thing  here,  you  noticed  I  said  the  remaining  M-1  column for  the  two-level  factors. The  reason  for  that... Let's  take  a  look  at  the  Hadamard  matrix  of  Order  8. If  I'm  going  to  replicate  this, if  I'm  going  to  copy  this, well,  this  first  column  here   is  going  to  be  for  the  intercept. I  don't  want  to  put  that as  one  of  my  design  factors if  it  never  changes, if  it's  constant   throughout  the  entire  thing. If  you  notice  here,  I'm  going to  just  drop  the  intercept. Now  I  have  an  8  by  7  design. If  you  notice,  I'm  going  to  use the  direct  product  again  here. But  instead  of  with  the  conference  matrix where  I  was  using  one  and  minus  one, I  just  want  to  make  a  copy of  that  Hadamard  matrix  H  without  the  intercept. Let's  give  that  a  look. We  can  take  a  look  here  on  this  matrix where  you  see  that  the  one,  one,  one. You  can  actually  see  where it  gets  just  copy  it  again. It's  just  the  same  matrix stacked  on  top  of  itself. If  I  concatenate  all  of  those  together, you  can  see  I  have  a  16  by  15. I  can  actually  just  create  that  data  table. I  have  this  design  now, a  16-run  design  with  15  factors. The  first  eight  of  those  are  three  level and  the  remaining  seven  are  at  two  levels. Let's  just  take  a  look. Let's  go  to  design  diagnostics and  let's  see  what  this  looks like  and  evaluate  design. You  can  see  I  just  created  a  main  effects. This  might  be  hard  to  see, so  this  looks  pretty  messy, but  you  can  already  see  there's a  special  structure  with  these  designs. One  thing  I  want  to  point  out, let's  get  rid  of  the  alias  terms. Let's  just  look  at  the  correlations here  with  these  main  effects. You  can  see  in  this  case, actually,  all  my  main  effects are  orthogonal  to  each  other. One  thing  to  point  out  here, because  I  was  using  that  Hadamard  matrix as  a  building  block, those  two-level  designs,  we  have all  that  nice  orthogonality  there. You'll  notice  this  fractional  increase  in  confidence  interval  length is  a  little  bit  higher. Why  is  that? Well,  that's  because we  have  these  three-level  factors. Those  first  factors  are  at  three  levels. What  that  means  is  that if  there  is  a  quadratic  effect, now  this  is  giving  me  some hope  of  detecting  that. Again,  don't  forget,  we're  already at  16  runs  and  15  main  effects. If  we  start  thinking   about  quadratic  effects  and  interactions, we  really  have  to  hope  for  those  large  effects when  it  comes  to  doing  any  model  selection. If  you  think  of  a  traditional  design where  we  only  had  two  levels, in  that  case,  we  would  have  no  hope of  being  able  to  detect any  quadratic  effect. That's  that  first  construction. In  some  sense,  that's  where everything  works  out  nicely. I  have  a  conference  matrix  and  I had  a  run  matrix  available  to  me. That  gives  me  a  run  size that's  a  multiple  of  eight. Now  what  happens   if  I  don't  have  one  of  those? Let's  say  my  design  is  going to  be  a  multiple  of  four. The  run  size  is  going to  be  a  multiple  of  four. Let's  assume  now  that  I  have  a  conference  matrix  available  to  me, but  maybe  I  don't  have  a  Hadamard  matrix. If  you  recall  before, where  for  my  Hadamard,   I  need  it  to  be  a  multiple  of  four, the  original  run  size,  so  that  when  I doubled  it,  it  was  a  multiple  of  eight. Instead,   maybe  if  I  don't  have  a  Hadamard  matrix, I  could  use  something  like a  D-optimal  main  effects  plan. This  would  just  be if  I  went  into  custom  design. Let's  say  if  I  wanted  for  six  runs, I  would  go  into  custom  design and  say  I  have  five  main  effects,   five  factors, and  I  want  six  runs  for  a  main  effects. This  construction  actually  looks a  lot  like  it  did  in  method  1. The  only  real  difference  is  instead of  that  Hadamard  matrix, now  I'm  going  to  be  using  this D-optimal  main  effects  plan. But  it  turns  out  to  be the  same  thing  here. I'm  going  to  have  those  first  M  columns for  the  three-level  factors, the  remaining  M  minus 1  for  the  two-level. I'm  really  at  that  idea  of  saturation because  by  the  time  I  factor in  the  intercept, I'm  at  two  M-runs  and  2 M  minus  1  factors. Let's  take  a  look   at  how  this  one  might  look. In  this  case, I'm  going  to  go   with  a  10-run  conference  matrix, which  again,   because  that's  a  multiple  of  two, I  can  create  a  conference  matrix. The  one  thing  to  pay  attention  to  here,  though, because  we're  at  10- runs, there's  not  going  to  be  a  Hadamard  matrix  of  order  10  available  to  us. Let's  construct  though  first  though, let's  just  fold  over  that  CC. Again,  that's  just that  DSD-like  structure. Now  in  this  case, because  I  don't  have  that  command, and  so  this  is  where  you could  decide  how  to  do  this. You  may  actually  even  have  your  own design  that  you  want  to  use  in  this  case. I'll  say  in  this  case, I'm  showing  a  D-optimal  main  effects. You  may  want  to  use  something  l ike  a  Bayesian  D-optimal where  two-factor  interactions  are  if  possible. But  so  in  this  case,  all  that  I've  done is  I've  taken  the  D-optimal, 10-run  design  for  nine  factors, and  I've  just  happened to  include  the  intercept  here. If  you  go  into  Custom  Design,  this  is what  you  get  from  the  model  matrix. The  model  matrix  actually   includes  the  intercept  by  default. That's  in  fact  where  I  had  this  from  here. Again,  I'm  going  to  drop. Let's  drop  that  intercept  column. Again,  here  I'm  just  replicating the  exact  same  thing  again. I've  just  folded  over,  not  folded  over. I've  just  replicated   the  exact  same  thing  twice. It'd  be  hard  to  see  as  to  where it  was  replicated  here. But  again,  all  that  we've done  is  just  made  a  copy. I  made  a  copy  of  that  doptN10m9  twice, concatenate  those  together. Let's  create  the  data  table  again. Again,  20- runs  and  19  factors. Again,  keep  in  mind  that  my  first  10  factors are  all  at  three  levels,  the  remaining  at  two. Let's  take  a  look at  evaluate  design  again  here. I'm  going  to  put  all  of  these  factors  in. In  this  case,  before, let's  go  directly  to  remove  those  alias  terms. Let's  take  a  look  at  the  color  map. This  still  looks  like a  pretty  good  color  map  to  me. What  do  we  notice  here versus  the  last  one? One  of  the  biggest  differences  is, well,  here  our  three  levels  are  still going  to  be  orthogonal, and  that's  because  we  were using  a  conference  matrix. The  three- levels   and  the  two- levels  are  orthogonal. In  particular,  my  three-level  factors are  orthogonal  to  the  two-level  factors. But  because  I  was  using  that  10-run  design, it  turns  out  we  can't   get  perfect  orthogonality for  the  10-run  design  with  nine  factors. If  we  look  among  those  two-level  factors, we  have  some  small  correlations  there. One  thing  to  point  out, there  does  exist  a  Hadamard  matrix  of  order  20. In  some  sense,  we  are  taking  a  little bit  of  a  hit  on  those  two-level  factors. But  if  you  take  a  look, the  cost  of  using  this  type  of  design versus,  let's  say, everything  at  two  levels, we  have  about  a  5%  increase in  the  confidence  interval  length. But  the  nice  thing  here  now  is  that  now we  have  those  factors  at  three  levels. If  we  really  are  worried   about  quadratic  effects appearing  in  these  first  ones, now  we  have  a  chance  of  detecting  those at  a  small  price  to  that  estimation efficiency  for  those  main  effects. That  was  great  in  the  case that  we  still  had  a  conference  matrix   that  existed,  but  no  Hadamard. This  last  method, this  is  where  we  don't really  fit  into  either  of  those  cases. In  this  case  now, we're  talking  about  a  run  size  is  going  to  be  a  multiple  of  two. In  this  case,  what  we  have  is  we  don't have  a  conference  matrix  available  to  us. This  is  where  we're  going  to  use that  pseudo- conference  matrix. I'll  say  if  you're   really  interested  in  that, I'll  go  back  to  these  preliminary. In  the... Let's  see. Actually,  it  was  the  original  DSD  design. Sorry,  this  2011 Definitive  Screening  Design. If  you  also  look,  I  have  it in  the  list  of  references  at  the  end. Let's  see, those  who  are  particularly  interested. This  original  class  of  three-level  design for  Definitive  Screening in  the  presence  of  second- order  effects. This  paper  was  written  before  they   were  aware  of  the  existence of  this  idea  of  conference  matrices. In  that  paper,  they  talk  about  a  general   purpose  algorithm for  creating  something that  looks  like  a  conference  matrix. You  set  zeros  along  the  diagonal, and  then  the  rest  of  the  values  are  going to  be  plus  and  minus  one, where  you're  trying  to  make a  main  effects  D-optimal  design. These  pseudo- conference  matrices,   as  they  were, they  look  like  a  conference  matrix. You  can  use  that  algorithm  when the  conference  matrix  doesn't  exist. Really,  what  it's  trying  to  do  is to  drive  it  to  look  as  close  as  it  can. If  you  remember  a  conference  matrix, if  you  take  C  transpose, you  get  zeros  and  the  off  diagonal for  perfect  orthogonality, it's  going  to  be  trying  to  drive  it  to  look  like  that where  I  can't  make  it  perfectly. Similar  to  the  case  we  had  for  method  2, our  T  is  going  to  be a  D-optimal  main  effects  plan. What  do  we  end  up  with  here? Again,  our  first  M  columns  are   still  going  to  be  three-level  factors, the  remaining  M minus 1   for  the  two- level. Again,  we're  still  at  that  case of  saturation by  the  time  we  factor  in  the  intercept. The  cost  here  is  that  we're  not  going to  get  the  nice  orthogonality that  we  may  have  had  in  method  1  and  2. Let's  take  a  look  at  how this  one  works  here. My  C  in  this  case, and  I'll  do  that  CC. What  I  started  with  here was  a  pseudo- conference  matrix  of  order  nine. Nine  is  not  a  multiple  of  two. I  have  to  do  something  that  looked like  a  pseudo- conference  matrix. Let's  just  take  a  look  here. Let's  actually  take  a  look  and  see  what  that  C  transport  C  looks  like. You  notice  I  can't  get   that  perfect  orthogonality, but  instead  I  have  eight  along  the  diagonal and  then  these  plus and  minus  ones  on  the  off  diagonals. We're  close. I  have  that  CC  that  was  just  doing the  same  thing,  folding  that  matrix  over. Likewise,  so  T, that's  going  to  be  of  the  same  order. I  took  the  nine-run, eight-factor,  D-optimal  design. In  this  case,  I've  actually already  removed  the  intercept. We're  just  going  to  replicate that  design  again. I  have  this  18  by  8. I'm  going  to  combine   those  two  pieces  together. Let's  take  a  look  at  the  table. This  is  a  particularly  difficult design  to  generate  in  general. In  this  case,  I  don't  have  a  nice  number. My  run  size  is  18. Now  I  have  these  first  nine  factors at  three  levels and  the  remaining  eight  at  two  levels. Not  surprisingly,  there  is  a  cost  of  this. Again,  let's  just  remove  those  alias  terms  again and  take  a  look  at  our  color  map. You  see  now,  my  two  and  three  levels are  actually  orthogonal  to  each  other, but  the  three  levels  have a  small  correlation  among  them. That  was  where,  remember,  if  you  recall, we  saw  that  plus  and  minus one  and  that  off  diagonal. Likewise  for  the  two  levels. We  actually  have  one   quite  high  correlation  here, but  in  general,  so  about  0.1. In  this  case,  we  don't  even  have an  orthogonal  design  to  compare  it  to. We  actually  have  a  couple  of  factors that  we  may  be  worried  about a  bit, have  a  little  bit of  a  larger  fractional  increase. Those  three  levels  will  say,  well,  we can't  get  those  to  be  perfect  as  it  is. This  fractional  increase  is  compared   to  the  hypothetical  orthogonal  design, the  orthogonal  array, which  doesn't  even  exist. We're  paying  a  small  price,  but  we  still have  some  generally  nice  properties. The  three  levels  are  orthogonal  to  the  two  levels, and  we've  minimized that  correlation  in  general. In  some  sense,  those  are those  three  methods. The  nice  thing  is  depending   on  the  run  size  that  you  have and  depending  on  the  number  of  factors, this  methodology  or  this  class  of  designs really  gives  you  some  flexibility in  the  number  of  runs. Of  course,  if  you  can  afford   a  multiple  of  eight  or  a  multiple  of  four, the  properties  are  going   to  look  a  lot  nicer. But  I  mean,  when  runs  are  really  expensive, we  still  have  this  method  3, which  is  going  to  give you  a  reasonable  design. With  that,  I  just  want  to  give some  final  thoughts  here. I  have  a  link. If  you  take  a  look   at  this  journal  afterwards, I  have  a  link  to  a  presentation   from  JMP  Discovery  Europe from  Bradley  Jones  that  was on  orthogonal  mix  level  designs. The  designs  in  this  presentation, the  orthogonal  mix  level  designs, will  almost  look  very  similar to  what  was  presented  here. The  designs  presented  in  this  presentation are  for  when  you're  closer  to  saturation. I  would  say  these   orthogonal  mix-level  designs from  the  previous  discovery, and  we  actually  have  a  paper  on  that. That's  this  Jones  Lekivetz,  and  Nachtsheim that  was  in  JQT,  Journal  of  Quality  Technology. Those  designs  work  very  well when  you  have  about  half  the  number   of  factors  relative  to  the  run  size, which  sounds  a  lot  like  DSDs, but  it's  when  you  have more  two-level  factors. In  the  design  presented in  that  presentation, you  can  go  up  to   about  three-quarters  of  the  run  size. The  designs  presented  here are  really  when  you're  closer to  that  saturation, where  you  say,  well,  no, I  really  have  a  lot  of  factors that  I'm  interested  in, and  the  runs  are  really  expensive. The  designs  here  to  fill  that  last  gap when  you  want  to  get  close  to  saturation. I  showed  you  these  designs  look  nice, but  can  they  actually  do  anything when  it  comes  to  model  selection? I'll  say  here. Again,  we  have  run  a  lot of  simulations  on  these. This  isn't  doing  it  properly, this  is  just  for  a  single  realization. Let's  say  here my  factors,  so  call  one,  call  five, call  13  and  14, I've  chosen  it  as  being  significant. In  base  JMP,  you  might  want to  use  something  like  stepwise. Let's  take  a  look. Here  I've  done  a  reasonable  job. I  may  have  ended  up  with  one  extra  factor, but  in  some  sense,  I'd  still  be quite  happy  that  I've  been  able to  pick  up  those  effects  there. Let's  try  this  same  thing with  generalized  regression. If  we  take  a  look,  the  same  thing. Let's  see,  1,  5,  13,  and  14. If  we  go  back,  1,  5,  13,  and  14. In  this  case, both  of  you  have  JMP  Pro  with  generalized regression  or  Fit  stepwise, both  were  able  to  pick  up those  main  effects  that  we  had  in  there. Just  to  show  you  now, if  you  have  a  large  quadratic  effect, I'll  say  because  compared  to  DSDs   where  we  had  that  center  point  run, this  we  only  have  two  zeros. We  don't  even  have  a  third   for  a  center  point and  we  have  a  lot  more  factors. Detecting  quadratic  effects is  still  going  to  be  difficult. But  let's  just  pretend  we  have  a  large  one. Here  I  have  again, the  1,  5,  13,  and  14. But  now  I've  also  added  a  quite  large quadratic  effect  for  the  column  one. Let's  just  take  a  look  here. Let's  try  to  fit  stepwise  again. Now  I've  actually  just  added those  quadratics  in  the  model. Again,  I  have  one  extra  term, but  if  you  see,  even  in  this  case,  it  did actually  pick  up  that  quadratic  effect. Not  only  did  it  pick  up  the  correct  main  effects, it  did  detect  the  quadratic  effect  that  I  had. Let's  try  the  same  thing. Let's  see  the  model  launch. It  looks  like  I  already   have  it  shown  here. The  same  thing. You  see  the  1, 5,  13,  14, and  generalized  regression  was  also able  to  pick  up  that  quadratic  effect. I'll  say  I  would  not  expect  to  anticipate that  you  can  detect   that  many  quadratic  effects. But  even  if  you  look  at  your  residual plots  or  your  main  effects  plots, even  those  two  zeros   give  you  some  indication that  maybe  I  do  want  to  follow  up and  take  a  look  at  those  factors  a little  bit  deeper  for  quadratic  effects. Where  again, in  your  traditional  screening, if  you're  only  doing  things  at  two- level, you  wouldn't  have  the  chance  to  do  that. With  that,  again,  I  will  post  this journal  where  this  video  is  located, but  I'll  also  flash  up  these  references  at  this  time. With  that,  thank  you  for  taking  the  time to  watch  this  video, and  please  share  any  messages  you  have  in  the  community  below.
Do some political candidates use first-person (I, we) or second-person pronouns (you, you all) more often in their campaign tweets?   In this course exercise, students learn how to test distributions (analyze, distribution, test probabilities) using a grouping variable (BY). The data set is comprised of tweets (N = 1107) from the early stages of the 2016 U.S. presidential primary season.   First-person pronouns focus on the speaker, or possibly, the group to which the speaker belongs. Second-person pronouns communicate a personal connection to the audience, suggesting that the candidate might be seeking to establish a personal connection.   A chi-squared test of the relationship between political party (Democrat vs. Republican) and the use of  first person (Present vs. Absent) is significant. The distributions are tested against a 50-50 distribution to see if Democrats or Republicans are more likely to use first-person pronouns.   The test of the use of second-person pronouns occurs at the canditate level. In this data set, political candidates use the second person in 20% of their tweets. But who uses second person more (or less) than the other candidates? In this part of the exercise, students compare each candidate's use of the second person against the group's 80-20 distribution.   Tests are conducted on tweets from Hillary Clinton, Bernie Sanders, Marco Rubio, and Donald Trump. The results indicated that only some of the candidates used second-person pronouns more often than the group average.     Hi,  I'm  Robert  McGee. I'm  an  Associate  Professor of  Integrated  Marketing  and  Communication at  the  University  of  Mississippi, also  known  as  Om iss. What  I  have  today  is  a  demonstration  of a  teaching  exercise  I  use  with  students. The  title  of  the  presentation  is  called Communication  style and  political  campaigns, promoting  a  personal connection  with  an  audience. The  question  is, do  some  presidential  candidates  use the  first  person  or  the  second  person more  than  others  during  their tweets  on  Twitter? This  is  an  important  question  because  we want  to  form  a  personal  connection between  a  candidate  and  an  audience. One  way  they  can  do  that  is  by  the  use of  language  in  their  social  media. The  students  manually  coded  tweets during  one  week of  the  presidential primary  season  in  2016. They  recorded  every  tweet that  was  issued  by  all  19  or  17 presidential  candidates  at  this  time. What  we're  going  to  demonstrate  today  is how  we  can  test  the  probability of  a  distribution  by  using the  grouping  variable  by. The  first  thing  we  have  after  we  recorded 1,107  tweets, the  first  thing  we're  going  to  test is  whether  the  use of  first  person  varies  by  party. This  is  a  typical   ChiSquare  test. It's  two  levels:  political  party, Democrat  versus  Republican, and  first  person  is  either present  or  absent  in  the  tweet. You  can  see  the  test of  the  relationship  there. The  likelihood  ratio  is  significant. You  can  look  at  the  graph which  shows  us  that  Democrats typically  used  first  person a  little  more  often  than  Republicans, and  it  was  a  significant  difference. Now  let's  get  on  to  the  second  person. You  can  do  the  same  thing. Look  at  the  candidate and  the  use  or  presence  or  absence of  a  second  person  in  the  tweet. You'll  see  also  that  it's a  significant  relationship. The  likelihood  ratio,  the  value is  83.7  and  it  is  significant. Then  you  look  at  the  graph and  you  see  it  well. Some  people  obviously  used  the  second person  more  than  others, but  which  ones were  really  different  from  the  others? You  can  look  at  the  contingency  table, and  in  the  contingency  table, you  look  across  the  rows,  you'll  see how  often  each candidate  used  the  second  person. Like  Ben  Carson  used  it  4% of  the  time  of  his  tweets. Chris  Christy  used  it  about  25, 26%  of  the  time  in  his  tweets  and  so  on. We  see  Hillary  Clinton  use  the  second person  about  16  or  17%  of  the  time in  her  tweets  during  that  week. But  what  we  want  to  be  able  to  do is  test  that  specific  probability  or the  probability  of  that  distribution. It  is  Hillary  Clinton's  distribution of  17%  and  83%  really  different from  the  overall  average  of  all the  political  candidates. If  you  look  at  the  bottom of  the  contingency  table, you'll  see  that the  distribution  really  was  80%  and  20%. But  you  can  also  find  this  information by  distribution. Look  at  Analyze,  then  Distribution, and  we  put  the  variable in  the  Y  box  and  hit  Okay. You'll  see  the  frequencies  or the  probability the  distribution  is  80  and  20. So  80%,  19.9%,  I  roughly put  it  at  80%  and  20%. What  we  want  to  know  is  if  Hillary  Clinton and  other  candidates use  the  second person  more  or  less  than  this  average. We're  not  looking  at  a  50/50  test, we're  looking  at  an  80  versus  20  test. To  do  this, we  are  going  to  use  the   By box or  the   By field. To  subdivide  this  distribution by  each  candidate, we're  going  to  put the  variable  candidate  in  the   By box. We  still  have  our  dependent  variable in  the  Y  box, the  use  of  the  second  person, but  we're  going  to  subdivide  it by  the  variable  candidate, which  will  produce  a  unique  or  individual tests  for  each  one  of  the  candidates. When  you  look  at  this, you'll  get  a  result  for  each  candidate. For  example,  Ben  Carson  first, and  then  Chris  Christy  second, and  so  on  for  each  one  of  the  candidates. It'll  tell  us  the  same  information  that  we have  in  the  contingency table  with  the  little  graph. But  what  we  want  to  know  is  if  this distribution  is  different from  the  80-20  distribution that  we  have  for  all of  the  candidates  overall. To  do  this, we  look  at  the  person  that  we're interested  in,  in  this  case, Hillary  Clinton, and  we  see  that  the  probability of  the  distribution  is  83  and  17%. We  go  up  to  where  it  says  second  person, the  name  of  the  variable, and  click  on  the  drop-down  menu, the  red  triangle, and  we  find the  command  test  probabilities. We're  going  to  click  on  test  probabilities and  a  new  dialog  box  opens  up. This  dialog  box  lets  us  establish the  own  benchmark  that  we  want  to  use. Rather  than  testing  it  against  50/50, we're  going  to  test  it by  against  80  and  20. I  type  in  0.8  and  0.2  because that's  what  we're  testing. I  leave  the  setting  at  a  two-tailed  test. I  don't  know  if  it's  going  to  be  higher or  lower  than  80,  20  when  I test  these  distributions. I'm  going  to  leave  it as  a  two-tailed  test. But  I  put  in  my  benchmark  of  80%  and  20%, which  I  got  from  the  contingency  table or  from  the  overall  distribution of  the  use  of  second  person. Then  we  click  done. Here's  what  we  have. This  is  part  of  the  results. You'll  see  that  she  had  96  tweets. Of  those,  83%  did  not  have  the  second person,  17%  did  have  second  person, and  we're  testing  it  against the  distribution  of  80/20. The  likelihood  ratio  or  the  ChiSquare value  is  0.69 and  the  P value  is  not  significant. Her  use  of  the  second  person did  not  vary  significantly from  the  overall group  average  of  80/20. Let's  try  somebody  else. We  do  the  same  thing. This  time  we'll  do  it  for  Bernie  Sanders. He  had  150  tweets  that  week. You'll  see  that  he  used  the  second person  only  about  5%  of  the  time. We  test  that  against the  80/20  distribution of  the  overall  group  of  politicians, and  we  see  that  the  ChiSquare  is significant  it's  29.7  or  29.8%, and  the  P  value  is  less  than  0.0001. So  yes,  his  distribution  or  his  use of  the  second  person  significantly  varied, but  in  this  case it  was  significantly  less, only  5%  compared to  the  overall  average  of  20%. It's  significantly  less  for  him. Let's  try  someone  else. Marco  Rubio was  a  presidential  candidate  in  2016, and  he  uses  the  second person  about  24%  of  the  time. We  test  that  again  against  the  80/20 percentage, and  we  see  that  his  ChiSquare  value for  this  test  is  0.88, and  it  is  not  significantly different  from  the  overall  average. A  distribution  of  20  and  80%. His  use  of  the  second  person  did  not  vary between  his  tweets  versus  the  overall average  of  all  the  candidates. We'll  look  at  another  one. Here's  Donald  Trump. He  had  105  tweets  during  that  week, and  you  see  that  he  used  second  person about  30%  of  the  time, which  means  about  30%  of  the  time he  was  saying  you  or  you  all or  some  form of  that  second  person  in  his  tweets. We  want  to  test  that  against a  distribution  of  80  and  20%. The  likelihood  ratio  is  significant. The  ChiSquare  value  is  6.4,  almost  6.5, and  the  P  value  or  the  significance  level is  0.01.  You  see  here  the  test  shows that  or  suggests that  he  used  the  second  person more  often  than  most  of  the  candidates who  were  running  during the  primary  season  in  January  2016. This  is  a  way  that  we  can  use to  test  each  one  of  those  rows. At  the  beginning   of  the  2016  primary  season, we  see  that  Hillary  Clinton and  Marco  Rubio  used  second  person to  do  out  as  much  as  everybody  else did  in  the  electoral  season. Bernie  Sanders  used  the  second  person significantly  less, and  Donald  Trump  used the  second  person  significantly  more. This  is  a  way  to  do  a  follow-up  test on  a  Chi Square when  you  need  to  test the  distribution  of  individual  rows. You  can  do  this  using  the  Buy  button. You  use  this  to  subdivide. The  option  to  test  the  probability of  a  distribution  allows  us  to  set a  benchmark  or  comparison  or  reference group  to  something  other  than  50/50 or  generally  whatever we  might  be  looking  at. In  this  case,  we  set  it  to  80/20. This  is  a  way  to  do  follow-up  tests on  a  significant  Chi Square  when  you  can test  the  probability  of  a  distribution. I'm  Robert  McGee  at  the  University of  Mississippi, and  if  you  have  any  questions,  there's my  email  address,  feel  free  to  reach  out. Thank  you  very  much.
  Disease often shows a predilection for specific site locations. To demonstrate the body regions involved, researchers may use infographics or tables. While tables may be more descriptive and universal to any statistical software, heat maps rapidly communicate disease involvement across a cohort. At our institution, software to portray data into the confines of a human body is not readily available. Additionally, we needed the human body to be segmented into specific regions. While a graphic editor can be used, it is labor intensive, limits generalizability, and cannot be repeated with different data sets.   To create a scalable solution, we used the Custom Map Creator in JMP, an add-in to Graph Builder that allows users to create custom shapes. With this program, we outlined a human body, created anatomic regions, and colored each region based on the percentage of occurrence. The selected color range is scaled to directly correlate with the data. Each anatomical subunit is then filled and shaded by the data in the respective column. Heat maps provide easy interpretation and clearly portray the most involved areas for rapid communication to readers.       Today,  I  want  to  give  a  little presentation  about using  the  JMP  Custom  Map  Creator and  how  we use  that  to  develop  a  human  heat  map. I'd  like  to  give  a  shout out  to  my  mentor, Dr.  Benjamin  Caffender,  who  helped push  this  project  along. A  little  bit  about  the  Custom  App  creator, it  allows  you  to  create  custom  shapes that  can  be  displayed  in  Graph  Builder. What  this  allows  you  to  do  is  to take  statistical  data  and  percentages in  particular,  and  present  them in  a  color-coordinated  and color-referenced  presentation. The  way  that  this  is  used, it  comes  with  some  built-in  maps of  the  United  States or different  places  in  the  world, is  to  be  like,  oh,  in  this  state, this  percentage  of  whatever is  taking  place  is  occurring. Gives  you  that  heat  map  approach, but  it  also  allows  you  to  create different  objects  as  well. For  us,  this  was  a  dermatology  project, so  we  had  cutaneous  disease, and  so  we  wanted to  show  lesion  involvement on  what  part  of  the  bodies were  affected  by  the  disease. So  we  had  face,  90 %, we  had  scalp,  but  just  under  20 %. That  thing, we  wanted  to  show  this  dynamically. We  ran  into  what's  the  best way  to  go  about  it. We  did  some  digging. We  found  this  custom app  creator  to  do  that. What  we  did  is  we  took  an  outline  of  an image,  and  you  can  see  all  these  dots. We  went  in  and  we  traced  this  image and  created  our  different  body  areas, and  JMP  will  set those  into  specific  shapes where  you  can  input  data that  will  appear  in  those  areas. You  can  take  whatever  picture  you  want. This  is  is  an  example  of  one  that  we  used. Then  that  data  will  get incorporated  into  those  areas. So  you  can  see  over on  the  left  of  our  image, really  to  show  the  face  has  significant involvement  with  the  darker  red, and  that's  correlated  up  with the  90 %  involvement versus  lighter  areas  such as  the  trunk  or  upper  extremities. But  the  custom  upgrade lets  you  create  those  areas and  then  put  those  inputs and  put  those  percentages  in  very  easily. Rather  than  me  talking  about  it,  I'd  like to  show  you  a  little  demo  of  how  it  works, and  we  will get  going  from  there. I  already  have  JMP  open. Once  you  install  this  add-in,  it's going  to  add-in  map  shapes and  then  custom  map  creator. We  put  custom  map  creator, and  we  get  these different  data  tables  that  pop  up as well  as  our  map  creator  itself. So  what  you  do  is  you can  paste  an  image  in. I  have  a  file  over  here  that  I will  just  drag  and  drop  in. You  can  copy  paste  however  it  may  be. You  can  adjust  and  zoom  and  that  thing to  get  it  exactly  how  you  want. Give  it  a  name. Now  you  start  drawing, you  start  tracing. I'll  just  quickly  click  through  here. That's  one  shape. We  hit  Next  Shape. You  can  change  the  names  of  your shape  to  whatever  it  may  be. When  we  were  using  it, we  had  different  body  areas. If  you  look  over  on  the  right  at  the  data tables  up  here in  this  table,  this  example  X,  Y, we're  seeing  the  different, the  very  specific  data  points  or dots  that  are  being  graphed. Over  here,  we're  having these  shape  IDs  be  placed. You  can  see  up  here, I'm  working  on  shape  five. You  hit  Next  Shape. As  you  put  that  in, you  get  these  next  shapes. These  are  linked  between these  two  data  tables. Now  we  have,  let's  say, I've  finished  my  image, I  have  everything  graphed  out, you  can  hit  finish, and  that  will  just  save  it,  or you  can  also  finish  with  random  data, which  I  think  is  an  easy  way. It's  an  easy  wayto  show  it, but  also  just  to  actually  have the  data  show  up  in  the  right  place. Hit  open. Now  we  have... This  is  our... This  is  our  graph. We  can  see  this  data they  gave  us  over  here,  shape  one, I'll  just  drop  this  down  to  one  over to  shape  five  we'll  put  that  at  100. You  can  see  over  here,  shape  five,  100, this  is  our  red and  has  the  most  involvement. Down  here,  shape  one, I  guess  we  have  a  couple  on  the  lower and  this  is  the  darker  blue. But  you're  able  to  get that  stratification. So  what  you  can  do  is  you  can  either have  your  own  data  table, or  you  can  just  put  in  your  own  data  into these  shapes  as  well, and  it  will  adjust  in  real  time. What's  really  nice  about  this  add-in is  you  can  go  right here  and  you  can  change  these  gradients. Right  now  we  have  a  red  to  blue. You  can  change  it  from  this  yellow  to  red. We  can  have  that  higher  volume with  the  darker  colors  and  less involvement  with  the  lighter  colors really  easily. The  two  big  things  as  you're  doing  this that  are  important  to  know  is  if  this  is your  own  data  table,  this  shape, this  name  column and  the  column  properties  and  map  role, we  see  here  this  map  name  data  table is  connected  to  our  example  name. Jmp, which  that  is  one  of... That  is  this  file  over  here. It  is  super  important that  those  have  to  be  linked. If  you  hit  generate  random  data, it  does  it  for  you. If  you  have  your  own  data  table, you  have  to  manually  set  that. But  what  you'll  notice  is  if  you  manually create  your  own  data  table, here  this  shape  1,2,3,4,5  matches  up perfectly  with  shape  1,2,3,4,5  here. Those  names  have  to  be  exact and  those  roles  have  to  be  exact. If  you  do  that,  then  you  can have  this  data  pop  up  really  easily. You  can  also  get  these  data  tables through  Graph  Builder  as  well. But  yeah,  that's  just  a  quick overview  of  how  to  use  this  add-in. It's  very  versatile. Like  I  said,  we're  able to  use  it  for  our  project. Yeah,  I  hope  this  gave  you  just a  brief  little  bit  of  guidance  on  that. I've  uploaded  some instructions  on  how  to  use  this  as  well, which  you  can  also  find. But  I  hope  you  have  a  great  day and  a  great  rest  of  other presentation  to  get  to  watch.
X-ray photoelectron spectroscopy (XPS) analyzes the surface chemistry of materials. It is also known as electron spectroscopy for chemical analysis (ESCA) and is commonly used to measure elemental composition/stoichiometry of thin film coatings in different industries. We have applied this technique to develop Atomic Layer Deposition (ALD) AlOx coating (of < 50 nm thick) processes by testing deposited film composition to identify the O/Al ratio.   Our pharmaceutical program customers questioned whether XPS is an appropriate metrology technique to detect process variation in the coating composition. This presentation demonstrates the adequacy of XPS by using Gauge R&R in JMP 17. We designed our testing experiments using measurement systems analysis (MSA) designs platform, and a fast replicate crossed model was used with six sample coupons on two different kinds of substrates (silicon coupons and active pharmaceutical ingredient [API] pellets). Each sample was split into four parts and all 24 (6 X 4) samples were measured independently by two different vendors blindly.   The data distribution was reviewed using a variety of methods: X-bar and R control chart, performed repeatability, reproducibility, part-to-part variation testing, calculated Gauge R&R (P/TV), P/PV, P/T in MSA Gauge R&R platform. Also, the evaluating the measurement process (EMP) platform was used to determine interclass correlation (ICC) and to identify if any interaction exists with either substrate type or vendor. Both MSA platforms confirmed that part variation is significantly higher than precision level, hence XPS is adequate to detect the variation in the process     Hello .   Good  morning ,  everyone . Good  evening ,  everyone . I 'm  Sukti  Chatterjee . Before  starting  my  presentation , I  would  like  to  introduce  myself   with  few  words . I 'm  Sukti  Chatterjee   from  CTO  team  of  Applied  Materials . It  is  advanced  technology  team , and  our  team  goal  is  to  develop  product adjacent  to  the  semi  industry or  outside  the  semi  industry . For  example , we  are  working  for  aerospace  industry , pharmaceutical  industry , or  industrial  coating . This  example ,  present  example , we  are  taking from  the  pharmaceutical  industry . My  topic  of  the  presentation  today. Gauge  R&R  of  X -ray  photoelectron   spectroscopy  to  monitor  a  coating  process . Agenda  of  my  talk  today. Fi,rst,  we  will  talk  about  the  background and  problem  statement , then  we  will  discuss  about  the  operation definition  and  data  collection  plan . Next ,  MSA  component  analysis . Finally  we 'll  talk  about  the  plan for  MSA  component  improvement . Let 's  start  with  the  background . In  the  pharmaceutical  industry , in  therapeutic  windows ,  therapeutic  areas , there  are ,  for  example ,  antibiotic  drug , alcohol  addiction  or  cancer  patient , they  need  everyday  injection   because  drug  level  in  the  blood is  certainly  increasing ,   spiking  in  the  blood , and  then  within  short  time ,   within  a  few  hours , it  is  going  beyond   the  therapeutic  window  limit . That 's  why  they  need   everyday  injection  and  it  is  painful . It  causes  some  side  effect , that 's  why  patient  skipping the  medication  or  stop  the  medications . To  solve  this  problem , our  customer  needs  some  approach to  tailor  the  release  of  drug . Our  team  developed  a  barrier  layer , aluminum  oxide  barrier  layer , that  forms  a  shell   around  the  pharmaceutical  particles . Properties  of  this  barrier  layer can  control  the  release of  the  drug  in  the  blood . Even  it  is  possible  like  that , it  can  release  few  weeks   instead  of  few  hours . Here ,  we  will  talk  about the  composition  analysis , and  what  is  the  noise  analysis of  this  composition  measurement ? That  we  will  talk  here . Our  problem  is  measurement of  AlOx  coating  composition . Our  spec  limit ,   customer  spec  limit ,  is  O /Al  ratio in  the  aluminum  oxide  film  is  1 .2 -2 .3 . Our  objective  here  to  determine   the  XPS  method  if  it  is  adequate to  differentiate  AlOx  process  variation . We  will  determine  here ,   gauge  R&R  measurement  error of  XPS  for  AlOx  composition  analysis . X -ray  photoelectron  spectroscopy can  measure  quantitatively atomic  percentage  of  composition . It  can  measure  aluminum   and  oxygen  percentage . XPS  actually  measure  the  kinetic  energy of  photoelectrons  emitted   from  the  elements and  it  counts  the  electrons . Whenever  it  is  counting  the  electrons , it  can  count  the  presence  of  elements and  also  it  counts   the  element  which  is  bond  to  it . That 's  why  we  can  get  the  information about  aluminum  and  oxygen in  the  aluminum  oxide  film . Most  of  the  source  of  error  for  XPS , it  can  add  it  in  gauge  R&R . It  can  reproducibility ,  it 's  coming from  the  calibration  electron  count . It  can  add  repeatability   and  reproducibility  error . Analysis  can  add  reproducibility  error . We  will  talk  more  details this  one  in  the  next  slide . In  our  operation  definition , we 'll  talk  about  the  different  steps of  the  XPS  measurement   and  how  it  can  introduce the  error  in  the  measurement  error  GRR , gauge  R&R  error . Our  objective  measure  aluminum  oxide coating  composition , and  to  measure  it  in  XPS , first  we  need  to  do  baseline  correction . It  is  automatic , and  then  we  need  to  go  to  the  calibration . In  calibration ,  normally  applied  materials have  calibration  sample , especially  whenever  we  have  some developed  technologies like  aluminum  oxide . But  in  our  cases ,   we  are  coating  pharma  particles and  our  process  window   is  totally  different from  our  applied  materials  core  technology process  window  for  aluminum  oxide , because   coating  need  to  be  compatible   with  the  pharma  particles . We  are  coating  this  particle , at  the  same  time , we  are  coating  also  silicon  wafer and  API  pallet because  XPS  cannot  measure  particles . It  needs  some  planar  substrate . That 's  why  we  are  depositing on  silicon  wafer  and  API  pallet . Since  we  don 't  have  calibration  sample , we  are  using  the  second  option for  calibration   like  carbon  peak  calibration . Left -hand  side  picture ,   you  can  see  carbon  peak  calibration and  it  is  manually  need  to  do  it, and  that 's  why  it  impact   on  the  reproducibility . Then  after  calibration ,  we  need  to  do  XPS  survey or  high  resolution  scan to  get  the  spectra. In  the  spectra , you  can  see oxygen  peak,  aluminum  peak . Since  we  need  to  do  manual  calibration and  we  have  automatic  baseline  correction , this  can  impact  error   on  repeatability  and  reproducibility . Next ,  we  need  to  do  analysis . Analysis  is  peak  fitting . We  need  to  fit  this  peak   and  then  we  can  gauge . From  peak  area ,  we  can  calculate the  oxygen  aluminum  percentage . Since  it  is  semi  automatic , it  can  add  error  in  the  reproducibility . By  XPS  measurement , we  are  calculating  O /Al  ratio and  our  customer  spec  limit  is  1 .2 -2 .3 . Next ,  we  will  talk  about the  cause  and  effect  diagram , MSA  cause  and  effect  diagram . In  MSA  cause  and  effect  diagram , we  did  some  detailed  analysis, and  we  found  several  one   it  can  impact  on  the  gauge  R&R . We  highlighted  also  major  ones like  electron  counts , calibration  analysis , we  talked  in  the  earlier  slide . Now  we 're  adding  another  one ,  it 's  sample  loading , how  it  is  added  error  in  the  gauge  R&R . Sample  loading ,  we  need  to  do  it .   It 's  not  automatic ,  it  is  not  full  wafer . We  are  doing  with  coupon  wafer , so  we  need  to  place  the  coupon . If  it  is  location  a  little  bit  different or  angle  is  little  bit  different , then  it  can  impact  on  the  measurement . This  is  impacting  on  reproducibility . All  other  major  one  impact  we  already discussed  in  the  previous  slides . Other  one  is  the  sample . It  depends  on  the  process . For  this  presentation ,  it  is  out  of  scope . We  will  talk  about  these  four   in  this  presentation . Next ,  our  sample  collection  plan . For  our  sample  collection ,   we  use  six  samples  for  MSA  analysis , and  for  these  six  samples   we  have  four  replicates . Here  you  can  see  these  four  replicates . We  measure  those  samples  in  two  sites . Since  we  have  the  four  replicates , we  are  measuring  those  sequentially . Is  it  possible   that  if  samples  are  degraded then  sample  degradation  could  be  a  risk ? We  will  talk  about  this  risk   later  on  more  details . Our  expected  outcomes like  that  we  need  to  find  out  XPS  method is  adequate  to  differentiate   process  variation . Also , you  like  to  gauge  like  that whenever  we  are  measuring  two  sites   that  have  similar  result . Also , we  like  to  gauge  like  that  sample or  part  is  not  interacting  with  the  site . Now  we  need  to  do  the  MSA  design . In  MSA  design ,  we  are  using  substrate and  site  at  the  cost  factor . This  is  the  site ,  this  is  the  part , and  we  have  also  two  different  substrate . We  mentioned  it  before , API  palette  and  silicon  wafer . S  numbers  are  silicon  wafer , A  numbers  are  API  palette . We 'll  not  be  able  to  use  actually   completely  randomized  option and  we  use  first  repeat . For  first  repeat  option , here  we  are  not  changing   the  sample  replicate  number . That  could  impact   on  sample  degradation  problem . That 's  why  later  on  we  will  compare  first   and  fourth  replicate to  check  this  sampling  risk . For  this  MSA  analysis , we  sequentially  use  several  JMP applications  from  JMP  platform . We  use  data  distribution  of  MSA  samples . That  is  from  descriptive inferential  statistics  application . It  is  from  distribution  fit  Y  by  X . Then  we  check  the  data  variability using  control  chart  and  one -way  ANOVA . Then  we  analyze  gauge  R&R  components . It  is  from  the  variability  chart . Then  we 'd  like  to  gauge  like   that  what  is  the  relation with  process  capability  with  gauge  R&R . That  we  can  find  out  like  that , interclass  correlation  versus  P /T  plot . Next , we  did  the  root  cause  analysis to  plan  for  improving  the  GRR . We  will  find  out what  is  the  GRR  major  error and  how  we  can  find  out . That  we  are  using  for  box  plot ,   density  ellipse , matched  pairs ,  and  fit  line that  are  different  platform   of  JMP  platform . Let 's  start  with  the  data  distribution . We  developed  process  initially   at  the  two  spec  limit , upper  spec  limit  and  lower  spec  limit . In  upper  spec  limit ,  we  have  two  samples , two  parts  and  four  replicates . All  are  measured  two  sites , and  we  already  mentioned we  did  the  first  repeat . Similarly ,  at  the  lower  spec  limit ,   also  we  have  four  parts , four  replicates  and  two  sites . Since  we  did  the  process  development at  the  two  end  of  the  spec  limit , that 's  why  we  can  see   that  our  distribution  is  bimodal . It's completely  bimodal  distribution . Problem  of  bimodal  distribution , it  can  impact  on  the  GRR  components . It  can  impact  on  P /TV  ratio , it  can  impact  on  P /PV  ratio and  misclassification . Since  P /T  ratio  is  not  related  with ... It  is  not  dependent  with  the  part , that  is  the  reason  P /T  ratio it 's  not  impacting by  the  sample  distribution . That 's  why  we  will  be  used in  our  following  slides . Our  figure  of  merits   we  are  using  as  a  P /T  ratio . For  misclassification  probabilities , there  is  five  probabilities . Last  three ,  it  could  be  impacted by  the  sample  distribution  more , and  first  two  is  less  impacted . To  minimize  the  risk ,  again , we  are  focusing  on  the  P /T  ratio as  a  figure  of  merit . In  the  next  time ,   our  plan  to  do  MSA  analysis using  uniform  sample  distribution . Let 's  check  now  the  variability  of  data . Here  we  can  see  that  we  use  I -MR  chart , individual  moving  range  chart , and  we  saw  that  many  data  points are  outside  the  control  limit   in  the  upper  chart , and  in  the  lower  moving   range chart , we  saw  that  three  data  point   is  outside  the  control  limit , and  that  these  three  data  points ,   it  is  sudden  shift . It  is  sudden  shift , it 's  not  staying  there ,  it  is  going  back . It  means  it  is  the  type  II  shift and  there  is  a  mixture   of  common  cause  variation and  special  cause  variation   in  the  control  chart . That 's  the  reason  here control  limits  are  meaningless . We  need  to  subgrouping  with  special  cause and  then  only  we  can  consider   the  control  limits . Now  we  like  to  find  out what  are  the  special  cause . First  we  will  check  if  part  variation could  be  a  special  cause . We  did  it  using  the  one -way  ANOVA and  in  one -way  ANOVA , we  can  see there  is  a  variation  of  the  samples . We  did  the  process  near  upper  spec  limit and  we  did  the  process  lower  spec  limit . That 's  why  samples  are  different . That  also  we  found  by  one -way  ANOVA , and  here  we  can  see  that within  variation  is  very  small   compared  to  part  variation , and  also  by  analysis   of  variance  is  showing  like  that . Here  our  hypothesis   is  all  parts  are  same , but  it  is  rejecting  the  hypothesis because  P -value  is  less  than  0 .05 . It 's  telling  us   it  is  significantly  different . That  means  part  variation is  a  special  cause , so  we  can  use  as  a  candidate for  subgrouping . Again ,  similarly  we  check  with  the  site variation  if  it  is  a  special  cause  or  not . We  considering  two  sites  measurement   near  upper  spec  limit as  well  as  near  lower  spec  limit . We  saw  that  here  our  hypothesis is  two  sites  are  measurement  similar, and  we  found  that  its  P -value   is  higher  than  0 .05 . For  upper  spec  limit , there  is  no  evidence   that  we  can  reject  the  hypothesis . It  is  similar ,  on  the  other  hand ,   for  lower  specs  limit . It  is  marginally  rejected   because  it  is  less  than  0 .05 . For  site  variation ,   either  it  is  marginally  rejected or  there  is  no  evidence  to  reject . That 's  why  site  variation   is  not  a  good  candidate and  part  variation  is   the  better  candidate . What  we  did  next ,  we  make  our  control chart  again  with  phase  option  and  A  here , sampled  part  at  a  different  phase . When  we  do  it ,  we  saw that  in  a  moving  range  chart , we  found  change  in  the  variation in  the  measurement  in  the  moving  range , and  that  calculated  the  control  limits for  the  bottom  chart  and  the  upper  chart . Now  we  saw  that  all  the  points , all  the  measurement  points are  inside  the  control  limit . These  is  the  variations  of  each  sample . It  is  the  repeatability . When  we  consider  site  A  and  site  B , and  we  saw  also  site  B   has  also  repeatability . But  compared  to  site  A  and  site  B , there  is  some  variation  of  repeatability . That  is  called  reproducibility . Now  we  calculate  the  gauge  R&R , all  the  components  in  the  next  slide , and  we 'll  find  out  what  is   the  dominating  error  in  gauge  R&R . First , we  did  main  effect . We  didn 't  consider  for  the  main  effect part  and  site  variation  interaction , so  only  the  main  effect . Here ,  we  saw   the  repeatability ,  reproducibility . Repeatability  is  22 % and  reproducibility  is  15 %. I  already  mentioned  as  a  gauge  R&R , we  are  considering  P /T  ratio because  our  sample   distribution  is  bimodal , and  we  saw  that  P /T  ratio  is  26 %. It  is  passed ,  it  is  less  than  30 %. It  is  marginally  passed , and  major  error  is  22 %  repeatability . One  more  thing  I  should  mention  here , we  are  considering  P /T  ratio but  P /TV  or  P /PV  ratio  is  very  close for  our  measurement  cases because  our  sample  distribution  is  bimodal and  at  the  two  end  of  the  spec  limit . That  is  the  reason  this  ratio  T   or  TV  are  very  close  or  PV  is  very  close . That  is  the  reason  we  have  this  gauge  R&R . This  figure  of  merits  is  very  close . Also , I  should  mention  here  type  I  error alpha  and  type  II  error  beta . Type  I  error ,  all  our  data  points   within  the  control  limit . That 's  the  reason  our  type  I  error good  part  is  falsely  rejected . It 's  very  small .  It  is  less  than  6 %. On  the  other  hand ,   type  II  error ,  it  is  6 %,  it  is  failed . It  is  more  than  10 %. Why  type  II  error  is  higher ? Our  repeatability  is  the  major  issue . Whenever  we  are  measuring  the  samples , it  is  within  the  spec  limit . But  it  is  possible  like  that  whenever a  customer  is  measuring  it . It  could  be  beyond  the  spec  limit because  repeatability  is  high  here . At  this  point ,   since  we  are  developing  the  product , we  are  in  the  initial feasibility  check  phase . Customer  is  happy   with  this  beta  type  II  error , but  we  have  option . If  we  can  improve  the  repeatability ,  then  it  can  improve  this  part  also . On  the  other  hand ,  if  we  can  consider that  part  and  site  interaction , then  we  saw  that  part   and  site  interaction  is  6 %, not  that  much , but  there  is  a  little  bit  interaction. And  when  we  didn 't  consider the  interaction  in  the  main  effect  mode , then  this  interaction  is  added   in  the  repeatability . That 's  why  we  found  that  whenever we  are  considering  the  crossed  effect , we  saw  repeatability   little  bit  decreasing because  our  interaction  is  very  small , not  that  much  decreasing . Since  this  interaction  is  very  small , our  figure  of  merits   are  not  changing  that  much . It  is  changing  from  little  bit . Now  from  here , we  know  that  our  dominating   error  is  repeatability . Before  going  about   the  more  discussion  with  repeatability , first  another  thing   I  would  like  to  mention , process  capability  with  gauge  R&R . Effect  of  gauge  R&R   on  the  process  capability . Here ,  process  capability we  are  plotting  in  ICC  versus  P /T  plot . ICC  is  the  part  variation   to  total  variation and  P /T  is  the  six  sigma  gauge , and  USL  minus  LSL . We  calculated  from  here  Cₚ , and  in  our  cases , in  our  process  current  condition ,   Cₚ  is  0 .93 . It  is  less  than  one . It  is  in  the  red  zone , and  we  need  to  go  Cₚ ... For  a  good  process  capability , we  need  to  go  between  Cₚ  1 .33 -2 . It  is  the  yellow  zone . To  improve  this  Cₚ ,  what  we  need  to  do ? In  this  part ,  this  is  the  process  part and  in  this  direction ,   it  is  the  measurement  part . Process  variability   or  part  variability  is  very  high . For  our  measurement , we  saw  that  our  P /T  is  24 %. If  we  would  like  to  increase ,   if  we  would  like  to  improve  the  P /T from  24  to  suppose  15 %  or  10 %, then  we  have  to  improve  30 % -50 %, and  within  that ,   our  repeatability  is  the  main  issue . That  is  the  reason  we  need to  improve  the  repeatability . Now  it  is  question . If  we  need  to  improve  the  repeatability , do  we  need  to  change  our  measurement  tool ? That  is  again  depending  on  the  ROI that  is  question  to  our  managing  level , or  we  can  address   the  repeatability  in  different  way . That 's  why  we 'd  like  to  find  out  the  root cause  why  repeatability  is  higher . Here  we  are  considering  variability  chart with  analysis  of  variance . Here  we  can  see  that  we  plot   all  the  samples  variability  together with  site  A  and  site  B  measurement . You  can  see  that  suppose ,  for  a  sample  A0 , this  is  the  measurement  repeatability , and  it  is  changing . This  repeatability  is  changing for  all  the  parts . Also  repeatability  is  changing with  the  site  to  site because  here  you  can  see   repeatability  is  0 .06  standard  deviation , but  in  these  cases  when  they  measure their  repeatability  is  0 .03 . That  is  the  reason   this  repeatability  is  changing with  part  to  part  also  site  to  site . Whenever  it  is  changing   with  site  to  site , it 's  called  reproducibility . Here  if  you  can  consider   the  analysis  of  variance , then  we  can  see  that  site   to  site  variation  is  much  smaller than  within  variation . This  is  the  repeatability , within  variation ,  and  site  to  site . Site  to  site  variation ,   it  is  reproducibility ,  it 's  much  smaller . Repeatability  again  from  here  also we  find  out  that  it  is  the  bigger  problem . Now  in  the  next   to  find  out  the  root  cause , we  plotted  all  the  repeatability side  by  side  together , and  for  both  the  cases ,  USL , upper  spec  limit  and  lower  spec  limit , and  all  the  cases  we  found that  its  repeatability  is  different . Next  we  like  to  correlate   or  find  out  any  relationship if  it  is  present  site  A   and  site  B  measurement . Ideally ,  site  A  measurement  will  equal to  site  B  measurement  should  be . But  in  our  cases ,  we  did  some  linear  fit and  we  found  that  we  have  intercept as  well  as  we  have  linear  fit  slope , it  is  not  one ,  it  is  not  zero . Here  we  found   that  linear  slope  is  less  than  0 .4 and  intercept  is  higher  than  0 .9 . Our  fitting  points  are  distributed  widely . That 's  the  reason   our  R -squared  is  also  poor . We  also  did  the  density  ellipse and  density  ellipse  also  telling that  this  correlation  is  less  than  0 .5 . If  they  have  a  very  good   correlation  relationship , then  it  should  be  0 .9 . If  it  is  0 .6 ,  then  it  will  be   moderately  correlated . But  in  our  cases ,  it 's  not  that . That 's  why  we  know  that  site  A measurement  is  not  site  B  measurement . It 's  the  repeatability  impacted   on  the  reproducibility . Problem  of  repeatability  is  impacted on  the  reproducibility . Now  we  check  more  closely how  it  is  different . We  are  comparing  by  match  pair the  site  A  and  site  B  variation . Here  our  hypothesis   is  site  A  equal  to  site  B , that  means  site  A   minus  site  B  equal  to  zero . We  saw  that  our  probability   for  this  hypothesis , site  A  minus  site  B  equal  to  zero ,   is  less  than  0 .05  in  both  the  cases . It  is  upper  spec  limit  and  lower  spec . Both  the  cases  you  can  see that  it  is  probability  is  less  than  0 .05 . That  means  site  A  and  site  B   measurement  is  different , and  you  can  see   our  difference  of  mean  value and  confidence  interval is  above  the  zero  point  line . That  means  though  this  is site  A  measurement  is  always  higher for  site  B  measurement . Now  from  here ,  our  question  appears , since  we  did  the  first  repeat  analysis for  our  MSA  design  is  first  repeat ,   it  could  be  possible  like  that if  samples  are  degraded ,   like  O /Al  composition  is  degraded . That 's  why  we  did  again  match  pair  test with  first  and  fourth  measurement both  in  site  A  and  site  B   for  all  six  samples , and  we  found  that  here , first  measurement  minus  fourth measurement  equals  zero . That  is  our  hypothesis . We  saw  that  P -value  is  higher than  0 .05  both  the  cases . That  means  our  sample   degradation  is  not  an  issue . First  sample ,  there  is  no  evidence . First  measurement and  fourth  measurement  is  dissimilar . That  means  it  is  the  measurement  issue . For  that ,  this  is  summarized in  the  dashboard  table , in  the  dashboard ,   like  our  figure  of  merit for  gauge  R&R  24 % and  repeatability  is  21 % and  that  repeatability  is  changing from  part  to  part  and  site  to  site , and  we  have  always  higher  repeatability for  site  A  compared  to  the  site  B . Now  for  our  next  plan , we  plan  for  a  discussion  each  site   as  well  as  with  the  process  team . Site  has  a  problem  like  repeatability as  well  part -site  interaction . We  know  that  what  error  could  be introduced  in  the  measurement like  background /baseline  correction , electron  counts ,  peak  deconvolution . We 'll  discuss  those  methods  source   of  error  with  site  A  person , site  A  facility ,  and  we  will  find  out how  we  can  do  the  streamlining  process for  improve  our  MSA . Also  we  have  a  plan  set  up a  calibration  sample or  we  can  set  up  a  set  up  sample that  we  can  measure   in  regular  interval  in  the  both  sites . On  the  other  hand ,  with  the  process  team , we 'll  talk  to  improve  MSA  next  time to  MSA  data  collection  uniform . Instead  of  bimodal ,   we  should  collect  the  data  uniformly . Then  also  we  saw  part to  part  repeatability  variation . There  is  one  reason   it  could  be  measurement  issue . Another  reason could  be  process  is  not  uniform . We  need  to  validate  our  thermal  math to  check  our  process  uniformity . Finally ,  I  would  like  to  mention that  what  is  the  impact on  my  learning  for  this  MSA  analysis . Now  we  know  that  several  JMP  platform or  JMP  application  can  help  me  to  know what  is  the  signal  variation from  the  noise  variation , and  then  we  can  identify   what  figure  of  merit  we  can  use to  justify  our  measurement  method . In  our  cases ,   we  found  P /T  is  the  best  method , best  figure  of  merit  to  analyze  it . Then  how  misclassification  risk   can  relate  to  the  MSA  component as  well  as  sample  distribution   that  we  learn . Root  cause  analysis , we  did  several  JMP  application that  can  help  us  to  plan  to  improving  MSA . Since  it  is  very  helpful   for  particular  program  application , that 's  why  I  would  like  to  introduce this  data  driven  decision  making for  all  the  programs  I  involve  in to  improve  the  project  quality ,   cost ,  and  time . Finally ,  I  would  like  to  promote  data driven  decision  using  JMP in  our  advanced  technology   group  like  CTO  team , or  other  different  projects . This  is  my  final  slide . I  would  like  to  mention  my  journey . I  started  JMP  learning   beginning  of  the  year , and  that  time  we  did  A0 ,  A1 ,  A2 . This  is  my  foundation . Then  after  I  work with  MSA  analysis  and  SPC . I  also  got  my  certificate , JMP  STIPS  certificate  May  2023 . Now  I  am  instructor   at  AMAT  JMP  instructor . I 'm  planning  to  in  person presentation  in  October  2023 , and  also  I  am  working for  my  Black  Belt  on  2024 . Thank  you  for  listening .
Sterility breaches of pre-filled syringes of a drug product are not directly measured but are known to be a function of syringe dimensions, plunger movement and fill weight. Fill weight is dynamically controlled so a non-parametric fit in the JMP Distribution platform was used to fit a Kernel Density based on real-world data. JSL was used to simulate 10 million iterations based on the non-parametric fit, along with plunger movement simulations based on dimension specifications and measured frictional forces. Process time for the simulations were reduced over three-fold by using invisible tables, simplifying the output and eliminating saved formulas.     My  name  is  Briana  Russo, and  I'm  a  senior  statistician at  the  Center  for  Mathematical  Sciences at  Merck. Today  I'll  be  going over  simulating  sterility  breaches with  non-parametric  data. At  Merck,  we  often  deliver  our  liquid formulated  drugs  in  prefilled  syringes. A  group  at  Merck  that  specialized in  that  came  to  me  asking  me to  see  if  I  could  simulate if  there  is  any  risk  to  sterility  breaches in  them,  depending  on  historical  data and  some  different  scenarios they  wanted  to  look  at. There  was  two  interesting  parts  of  this that  I  wanted  to  go  over  today  in  my coaster  and  discuss  a  little  further. The  first  was  some  of  the  historical  data, specifically,  the  fill  weight was  non-normally  distributed. When  filling  the  syringes,  it's  not necessarily  processing  to  a  target. It's  able  to  move  within  a  range and  even  drift  outside  of  that  range for  a  bit  before  being  corrected. That  often  results in  some  heavy  tailing  of  the  data, which  you  can  see  in  the  bottom  left  here. That's  an  example  of  that. We  wanted  to  make  sure that  we  were  capturing  that  heavy  tailing, because  obviously  that's  where the  highest  risk  is  going  to  be. The  other  interesting  part  that  goes specifically,  into  some  JSL  scripting is  that  I  was  dealing  with  a  large  number of  iterations  asked  for  by  the  customer because  they  were  looking  for  10  million per  scenario  because  that's  the  order of  magnitude,  they  were  expecting to  create  the  syringes. I  was  able  to,  during  the  project, discover  some  techniques to  reduce  the  processing  load  on  JMP that  was  able  to  significantly  reduce the  process  time  for  when  I  was  running the  simulations  and  prevent  any  crashing or  anything  like  that  from  memory  issues. I'll  touch  on  both  of  those  things. But  first,  I  wanted  to  go  into  a  little bit  about  more  background on  the  prefilled  syringes and  what  we  were  looking  at. As  I  mentioned, we  have  the  fill  weight  data. That's  the  amount  of  liquid that's  filled  into  the  syringe. That  again,  I  wanted  to  look at  non-parametrically using  a  density  function. I  was  able  to  find  that that  was  very  easy  to  do  in  JSL. I'll  show  how  I  did  that. Then  the  other  aspect  was the  plunger  insertion  depth. How  deep  is  the  plunger  being  inserted and  how  close  is  that  to  the  liquid  fill? Then  the  dimensions of  the  prefilled  syringe. There  is  some  variability from  the  manufacturer, I  wanted  to  make  sure that  was  being  captured. There  were  two  key  outputs, and  they  were  a  yes  or  no  output  for  each. The  first  was,  we  want  to  make  sure that  we  were  maintaining  a  gap between  the  liquid  fill  and  the  plunger. Because  if  we  don't, then  we're  going  to  be  getting liquid  up  on  the  plunger, and  that  could  be  a  sterility  risk. We  wanted  to  make  sure  that  the  air  gap length  was  always  greater  than  zero. The  other  one  was  we  also  don't  want that  air  gap  to  be  too  big because  when  we're  shipping  the  syringes, say,  on  an  airplane, they  might  be  exposed to  lower  atmospheric  pressures, which  can  cause the  plunger  to  move  up. If  it  moved  up  too  much, it  could  go  beyond  a  sterile  barrier that  was  created when  the  plunger  was  inserted. We  don't  want  it  to  go  too  low. We  don't  want  it  to  go  too  big. But  there's  a  lot  that  goes into  the  plunger  movement, not  only  the  air  gap, which  is  a  function  of  the  dimensions of  the  plunger and  how  deep  the  plunger  was  inserted and  how  close  it  is  to  the  fill. But  again,  also  different atmospheric  pressures and  the  cross  sectional  area, so  the  dimensions  of  the  syringe. There's  a  lot  of  different  inputs and  different  sources of  variability  potentially to  that  plunder  movement. I  wanted  to  be  able to  simulate  all  of  those. That  meant  that I  knew  that  my  data  table  and  JMP that  I  wanted  to  simulate  into was  going  to  be  very  big. The  first  change  that  I  was  able to  make,  to  make  these  simulations a  lot  more  efficient  was  actually  just opening  up  the  historical  data that  I  was  going  to  use, the  data  table  I  was  going  to  use as  being  invisible. This  made  it  so  JMP didn't  have  to  render  the  table, this  potentially  massive  table I  was  going  to  create and  was  able to  really  reduce  process  time and  also  prevent  jump from  crashing  at  times, it said,  the  memory of  my  laptop  was  exceeded. Once  I  opened  up the  historical  data  as  invisible, I  then  would  add  enough  rows  to  that  just blank  rows  to  get  me  to  10  million, because  obviously  my  historical data  wasn't  that  big. But  I  wanted  to  make  sure  that  the  data table  had  10  million  rows, so  then  I  could  go  ahead and  simulate  10  million  iterations. Specifically, what  I  did  for  the  non-parametric  aspect of  the  data  was  I  fit  the  data in  the  distribution  platform  in  JMP, and  then  I  was  able  to  just  very  easily use  the  fit  smooth  curve  function to  save  simulations from  that  non-parametric  data to  10  million  iterations. Super  simple  and  easy  way  to  fit essentially  kernel  density  function simulated  values. The  other  two  things that  really  improved  my  simulation was,  as  I  mentioned, there  was  a  lot  of  different  calculations that  I  was  doing  within  a  data  table and  different  scenarios  over  20  different, for  example,  plunger  depth targets  we  wanted  to  look  at. As  part  of  my  JSL  script,  I  wanted to  be  looping  over  different  scenarios. But  if  I  was  just  going  to  create  a  column that  then  referenced  previous  columns in  a  loop,  that  could  cause reference  issues for  each  iteration  of  the  loop, because  I  would  end  up  with  essentially all  of  the  new  columns having  the  same  formula because  they'd  all  just  end  up referencing  whatever  the  last iteration  of  the  loop  was. To  prevent  that, if  I  wanted  to  use  a  formula for  the  column,  I  would  then need  to  delete  the  formula. Again,  very  inefficient. One  very  simple  and  easy  way that  I  could  get  around  this was  instead  of  saving a  formula  for  a  new  column, just  use  set  each  value. This  means  that  JMP  didn't  need to  save  the  formula  at  all. It  eliminated  that  issue with  the  looping  reference and  then  also, again,  reduced  process  time. The  final  improvement  that  I  made was  by  really  working  with  my  customer in  this  case,  and  really  figuring  out what  exactly  they  needed, I  was  able  to  streamline  things  a  lot. Because  initially,  I  was  just giving  them  the  kitchen  sink. Giving  them  distributions  and  histograms of  every  single  parameter  and  output, which  they  thought  was  interesting but  was  not  really  worth  the  effort and  worth  the  process  time. What  they  really  just  wanted  was  what  is the  %  failure  rate  for  these  two  outputs? I  was  able  to  make  delivering that  a  lot  more  efficient by  eliminating  the  need of  opening  up,  say, a  distribution  platform and  trying  to  fit  10  million  rows. Instead,  I  just  made  sure that  any  sterility  breach, I  just  created  a  column where  if  a  sterility  breach  occurred, it  was  a  one,  if  it  didn't,  it  was  a  zero. Then  it  was  very  easy  to  just calculate  the  column  mean  to  give the  percentage  of  failure  for  any  scenario and  directly  output  that  to  a  journal. That  way,  the  journal  also  wasn't having  to  be  massive  because  it  was  saving so  much  information  from  the  data  table because  it  was  creating  graphs  from  it. Overall,  initially  in  this  project, I  was  able  to  deliver  it, but  by  using  the  platform  outputs, visible  tables,  and  save  formulas, it  was  taking  at  least  three  hours. Often,  I  was  letting  it  run  overnight, so  I  don't  know  the  exact  timing, but  at  least  three  hours. By  simplifying  the  output  alone, so  going  directly  to  the  journal  instead of  saving  from,  say, the  distribution  platform  and  JMP, I  was  able  to  get  this  down to  an  hour  and  49  minutes. Then  just  those  two  simple  changes of  making  sure that  the  data  table  was  invisible and  saving  values  instead of  saving,  the  formula  got  me  down to  52  minutes despite  the  volume  of  calculations that  were  being  needed  to  be  made. Overall,  it  can  be  very  simple and  easy  to  simulate non-parametric  data  within  JMP using  these  data  tables  and  using  the  fit, smooth  curve  function. Then  also,  if  you  are  simulating really  big  data  sets in  JMP, if  you  are  simplifying  the  output, if  you're  making  sure that  JMP  isn't  rendering  things it  doesn't  need  to  or  calculating and  saving  things  it  doesn't need  to,  it  can  actually  be  very  efficient in  creating  the  simulations and  giving  you  the  outputs. In  this  particular  case, using  those  techniques, I  was  able  to  reduce  my  simulation time  over  a  three-fold. That's  all  I  have. Thanks  for  listening.
Functional or curved responses frequently occur in industry. Thanks to new features in JMP, we can now model and predict functional responses using key DOE or product design factors with JMP Functional DOE or Curve DOE modeling. A Functional DOE model is purely empirical. However, a Curve DOE model can incorporate mechanistic or expert knowledge on the functional form of the curved responses.  In this presentation, the methods and results of predicting functional responses using Functional DOE and Curve DOE modeling are compared using case studies from the consumer product industry.     Hello,  my  name  is  Fangy i  Luo and  I'm  from   Procter & Gamble. Today  I'm  presenting with  Chris  Gotwalth  from  JMP. We're  going  to  talk  about  how  to  model data  from  designed  experiments when  the  response  is  functional  curve. Functional  or  curve  responses occur  very  often  in  industry. Thanks  to  the  new  development  of  JMP, we  can  now  model and  predict  functional  responses as  a  functional  of  key  DOE or  product  design  factors using  both  functional  DOE or  curve  DOE  modeling. A  functional  DOE  model is  purely  empirical. However,  a  curve  DOE  model can  take  into  account  mechanistic or  expert  knowledge  on  the  functional form  of  the  curve  responses. In  this  presentation, the  method  and  results  of  predicting functional  responses  using  functional  DOE and  curve  DOE  modeling  will  be  compared using  case  studies  from  the consumer  product  industry. This  is  the  outline  of  a  talk. We  will  break  the  talk  into  two  parts. In  the  first  part,  Chris  will  talk  about what  are  the  functional data  examples  of  functional  data and  then  he  will  help  you with  fundamental  understanding of  the  functional  DOE  modeling, including  functional principle  component  analysis as  well  as  curve  DOE  modeling. In  the  second  part, I  will  use  two  examples from   Procter & Gamble and  compare  the  results of  functional  DOE  and  curve  DOE  modeling using  these  two  examples. The  first  example  is  Modeling  Viscosity over  Time  Data from F ormulation  Experiment. The  second  example  is  Modeling Absorption  Volume  over  Time  Data From  a  Diaper  Design  of  Experiment. Then  I  will  finish  the  talk with  a  brief  summary  and  conclusion. Thanks  Fangy i. Now  I'm  going  to  give  a  quick  intro to  functional  and  curve  data  analysis. But  first  I  want  to  point  out that  there  is  a  lot  of  this  kind  of  data out  there  and  JMP  really  has  made analyzing  curve  response  data  as  fast, easy  and  accurate  as  possible. If  you  haven't  heard of  functional  data  analysis  before, you  have  certainly  seen  it  out  there. It's  all  over  the  place, and I'll  show  you  some examples  to  make  that  clear. For  example, here  are  annual  home  price  indices from  1992 -2021  for  all  50  US  states. Each  function  has  a  beginning  measurement followed  by  a  sequence of  other  measurements and  then  a  final  measurement. They  all  have  a  beginning, a  middle  and  an  end. The  functions  don't  have  to  all  have the  same  start  and  endpoints or  measurements  at  the  same  times. In  a  time  series  analysis, we  are  really  interested  in  using  data to  predict  forward  into  the  future using  data  observed  from  the  past. In  a  functional  data  analysis or  a  curve  data  analysis, we  are  generally  more  interested in  explaining  the  variation internal  to  the  functions than  predicting  beyond the  range  of  times  we've  observed. In  product  and  process improvement  in  industry, we  are  often  working on  non-financial  curves. I'm  going  to  show  you  some  examples that  our  customers  have  shared  with  us. Here  we  see  a  set of  infrared  spectra  of  gasoline  samples used  to  develop  an  inexpensive  tool to  measure  octane  and  gasoline. The  green  curves  had  high  octane, and  the  red  ones  were  low  in  octane. The  height  of  the  left  peak turned  out  to  be  critical for  predicting  octane  level. Microbial  growth  curves are  a  common  type  of  functional data  in  the  biotech  industry. Today,  F angyi  will  be  demonstrating two  methods  in  JMP that  can  be  used  for  analyzing  DOEs, where  the  response is  a  set  of  measurements. The  first  method is  called  functional  DOE  analysis and  is  best  for  complicated response  functions  like  spectra when  you  need  the  model  to  really  learn the  curves  and  the  data  from  scratch. The  second  is  a   curve DOE  analysis, which  is  based  on non-linear  regression  models. When  you  can  use  the   curve DOE  analysis, I  found  that  you  get  more accurate  results  with  it. But  if  you  can't  get the   curve DOE  analysis  to  work, you  can  always  fall  back on  the  functional  DOE  analysis, as  it's  more  general  than   curve DOE. The  critical  step in  functional  data  analysis that  will  be  new  to  most  people is  called  functional principle  components  analysis, also  called  FPCA  for  short. This  is  how  we  decompose  the  curves into  shape  components that  describe  the  typical  patterns we  see  in  the  curves, as  well  as  weights  that  attribute how  strongly  each  individual  curve correlates  with  those  shape  components. It's  a  kind  of  dimension  reduction and  data  compression  technique that  reduces all  the  information  in  the  curves into  the  most compact  representation  possible. To  illustrate  FPCA,  take  a  look at  the  set  of  curves  in  the  plot  here. What  do  they  have  in  common? How  do  they  differ  from  one  another? What  I  see  in  common is  a  set  of  peak  shapes with  one  peak  per  curve, and  the  shapes go  to  zero  away  from  the  peak. They  also  appear  to  be  symmetric around  the  center  of  the  peak. In  terms  of  differences, I  see  variation  in  peak  heights, and  there  are  clear  horizontal  shifts from  left  to  right, and  some  curves are  also  narrower  than  other  ones. In  a  functional  data  analysis, the  first  thing  we  do is  find  a  smoothing  model that  converts  or  approximates the  discrete  measurements, converting  them  into  continuous  functions. There's  a  variety of  smoothing  models  in  FDE. I  don't  really  have  a  firm  rule as  to  which  one  is  the  best  in  general, but  here  are  my  observations about  the  most  common  ones. Wavelets  and  splines have  different  strengths. Wavelets  are  new  in  JMP  Pro  17 and  are  very  fast  and  are  generally the  best  with  complicated  functions such  as  spectra,  as  long  as  the  X coordinates  of  the  data  are  on  a  grid. On  the  other  hand, there  are  B and P  splines, which  are  slower  computationally but  are  better  for  data with  irregularly- spaced X s, and  are  also  often  better when  there  are  only  a  dozen or  fewer  measurements  per  function. If  the  data  aren't  large, I  would  try  both  splines  and  wavelets and  see  which  one is  giving  us  the  best  fit by  looking  at  the  graphs. The  main  graphs  I  use to  make  decisions  about  smoothing  models are  actual  by  predicted  plots and  you  wanted  the  one that  hugs  the  45- degree  line  more  closely. In  this  case,  I  would  choose the  wavelets  model  on  the  right over  the  spline  model  on  the  left, because  those  points  are  tighter around  that  45- degree  line. Immediately  after  JMP  Pro fits  a  smoothing  model  to  the  data, it  decomposes  the  signals into  dominant  characteristic  shapes it  found  in  the  data. In  mathematical  language, these  shapes  are  called   eigenfunctions, but  a  better  and  more  approachable  name would  be  to  call  them  shape  components. Here  we  see  that  JMP  has  found that  the  overall  mean  function is  a  peak  shape and  that  there  are  three  shape  components that  explain  97% of  the  variation  in  the  data. The  first  shape  component  appears to  correspond  to  a  peak  height. I've  learned  to  recognize that  the  second  shape is  a  type  of  left- right  peak  shift  pattern and  that  the  third  shape  component is  something  that  would control  the  peak  width. Remember  that  these  are  shapes learned  from  the  data, not  something that  I  gave  JMP  outside  of  the  data. What  has  happened is  the  observed  spectra  in  the  data has  been  decomposed into  an  additive  combination of  the  shape  components with  unique  weights for  each  individual  curve. The  functional  PCA  is   like  reverse engineering  the  recipe  of  the  curves in  terms  of  the  shape  components. The  mean  function  is  the  thing that  they  all  have  in  common. The  shape  components are  the  main  ingredients. And  the  weights are  the  amounts  of  the  ingredients in  the  individual  curves. The  functional  DOE  analysis is  the  same  mathematically as  extracting  the  scores  or  weights and  modeling  them  in  fit  model with  the  generalized  regression  platform. Fortunately, there  is  a  red  triangle  option in  the  Functional  Data  Explorer that  automates  the  modeling, linking  up  the  DOE  models with  the  shape  functions  for  you and  presenting  you  with  a  profiler that  connects  the  DOE  models with  the  shape  functions. You  can  directly  see how  changing  the  DOE  factors leads  to  changes in  the  predicted  curve  or  spectra. There  are  many  potential  applications of  functional  DOE  analysis, some  of  which  Fangyi  will  be presenting  later  in  this  talk. There  is  another  approach in  JMP  called  curve  DOE  modeling. This  answers  the  same  kind  of  question as  functional  DOE, but  it  is  nonlinear  regression  based rather  than  spline  or  wavelet  based. What  that  means  is  that  if  you  have a  good  idea  of  a  nonlinear  model, like  a  three- parameter  logistic  model, and  if  that  model  fits  your  data  well, you  can  get  models  and  results that  generalize  better than  a  functional  DOE  model, because  the  general  shape  of  the  curve doesn't  have  to  be  learned  from  scratch from  the  data  using  splines  or  wavelets. The  idea  being  that  if  you  can  make assumptions  about  your  data that  reproduce  the  modeling  effort  needed, your  predictions  will  be  more  accurate, especially  from  small  data  sets. Curve  DOE  analysis has  a  very  similar  workflow to  a  functional  DOE  analysis, except  that  you  go  through the  Fit  Curve  platform instead  of  the  functional  Data  Explorer, and  instead  of  choosing wavelets  or  splines, you  chose  a  parametric  model from  the  platform. Just  like  in  a  functional  DOE  analysis, you  want  to  review the  actual  by predicted  plot to  make  sure  that  your  nonlinear  model is  doing  a  good  job  of  fitting  the  data. A   curve DOE  analysis is  the  same  as  modeling the  nonlinear  regression  parameters extracted  from  the  curves  using the  generalized  regression  platform. This  is  the  same  thing  as  what's  going  on with  a  functional  DOE  analysis with  the  FPCA  weights. Fit Curve  automates  the  modeling and  visualization  just  as  FDE  does. Once  you  know  functional  DOE  analysis, it's  really  not  very  hard  at  all to  learn   curve DOE  analysis. Now  I'm  going  to  hand  it  over  to  F angyi who  has  some  nice  examples  illustrating functional  DOE  and   curve DOE. Thanks  Chris. Next  I'm  going  to  talk  about two  examples  from   Procter & Gamble. The  first  example is  viscosity  over  time  curves collected  from  a  number of  historical  formulation  experiments for  the  same  type  of  liquid  formulation. There  are  six  factors  we  would like  to  consider  for  the  modeling. They  are  all  formulation  ingredients and  we  call  them  factor  one  to  factor  six. The  goal  of  our  modeling is  to  use  these  formulation  factors to  predict  or  optimize viscosity  over  time  curve. The  response  of  modeling is  viscosity  over  time. This  slide  showed  you some  viscosity  over  time  data. For  majority  of  our  formulations, the  viscosity  of  the  formulations would  increase  first  with  time and  then  decrease  later  on. Next,  we're  going  to  perform  functional DOE  analysis  on  viscosity  over  time  data. Before  functional  DOE  analysis, we  need  to  perform functional  principal  component  analysis on  the  curves  smooth using  different  method. Here,  we  apply  functional  principal component  analysis to  the  curves  first  using  B-s plines and   find  five functional  principal  component where  they  cumulatively  explains about  100%  of  variation  in  the  curves. Each  of  the  curve  would  express as  the  sum  of  the  mean  function plus  linear  combination of  the  five  functional principal  components or  eigen functions also  called  shape  function. We  also  apply  direct  functional  principal component  analysis  to  the  data where  it  find  four functional  principal  components that  cumulatively  explains about  100%  of  variation across  viscosity  over  time  curves. E ach  curve  will  then be  expressed  as  the  mean  function plus  linear  combination  of  the four  functional  principal  components. This  slide  compares  the  functional principal  component  analysis  model  fit using  two  different  options. The  first  one  is  using  the  B-s pline  option and  the  second  one  is  using the  direct  functional  PCA  analysis. As  you  can  see  using  the  B -spline  option, the  model  fit  seems  to  be  smoother as  compared  to  the  model  fit using  direct  functional  PCA  analysis. This  slide  showed  you the  diagnostic  plots, the  observed  versus  predicted  viscosity from  the  functional principal  component  analysis using  two  different  options. Using  direct  functional  PCA  analysis, the  points  are  closer to  the  45- degree  lines as  compared to  the  one  using  B-s pline  option, indicating  that  direct functional  PCA  analysis fits  the  viscosity  over  time  data slightly  better  than  the  functional principal  component  analysis using  B-spline  option. After  performing  functional principal  component  analysis, there's  an  option  in  JMP, you  can  perform  functional  DOE  modeling and  get  functional  DOE  profiler. For  functional  DOE  modeling, basically  it's  combining the  functional  rincipal  component  analysis with  the  model  for  the  functional principal  component  scores using  formulation  factors. For  this  profiler  we  can  predict the  functional  responses, in  our  case,  is  viscosity  over  time  curves using  different  formulation  factors. You  can  select  a  combination of  the  formulation  factors and  it's  able  to  predict the  viscosity  over  time  curve. This  slide  shows  you  the  diagnostic  plots, the  observed  versus  predictive  viscosity and  also  the  residual  plots from  the  functional  DOE  modeling. As  you  can  see  that  the  residuals from  the  functional  DOE  modeling are  larger  than  the  functional principal  component  analysis before  the  functional  DOE  modeling. Our  colleagues  at   Procter & Gamble actually  they  find  that Gaussian  Peak  model  would  fit individual  viscocity  curves  very  well. This  Gaussian  Peak  model has  three  parameters  A, B, C, and  this  A  indicates  the  peak  value of  the  viscosity  over  time  curve and  B  is  a  critical  point, which  is  a  time  when  viscosity reaches  maximum, and  C  is  a  growth  rate. The  rate  of  the  viscosity  increase during  the  initial  phase. This  is  the  fitting of  the  viscosity  over  time  curve using  the  Gaussian  Peak  model using  a  feature  in  JMP, called curve  fitting. These  are  the  diagnostic  plots of  the  viscosity  curve  fitting using  the  Gaussian  Peak  models. It  looks  like  the  model  fitting are  not  too  bad, however,  the  arrows  seems  to  be  larger than  the  arrows  from  the  fitting using   functional principal  component  analysis. After   curve DOE  fitting using  Gaussian P eak  model, there's  option  in  JMP you  can  perform  curve  DOE  modeling. Basically,  curve  DOE  model  is  combining the  parametric  model  for  the  curves, the  Gaussian  Peak  model, and  the  model  for  the  parameters of  the  Gaussian  Peak  model express  the  parameter as  a  function  of  formulation  factors using  generalized  regression  models. Then  you  get  the   curve DOE  model and  this  is  a  profiler of  the   curve DOE  model. Using  this  profiler  you  can  predict the  shape  of  the  curve by  specifying  combination of  the  formulation  factors. Actually,  this  profiler is  somewhat  different from  the  functional  DOE  profiler we  got  previously. These  are  the  diagnostic  plots from   curve DOE  model. As  you  can  see  here that  the   curve DOE  model does  not  fit  the  data  well  and  it's much  worse  than  the  functional  DOE  model. These  are  the   curve DOE  model  fit on  the  original  data. As  you  can  see  that for  a  number  of  formulations, the   curve DOE  model does  not  fit  the  data  well. This  is  a  comparison  of  the  profilers from  functional  DOE  model and   curve DOE  model. As  you  can  see  that  the  profilers, they  look  quite  different. This  compares  the  diagnostic  plots from  functional  DOE  model and   curve DOE  model. As  you  can  see  that  functional  DOE  model fits  the  data  much  better than  the   curve DOE  model with  a  smaller  root  mean  square  error. Now  I'm  going  to  show you  the  second  example. This  example  is  from a  diaper  design  of  experiment with  four  different  product  A, B, C, D at  three  different  stations labeled  as   S1,   S2  and   S3, so  it's  a  factorial  design. Diaper  absorption  volume was  measured  over  time for  these  four  different  product at  three  different  stations. The  response  is  diaper  absorption volume  over  time and  the  goal is  to  understand  the  difference in  diaper  absorption  curves across  different  products  and  stations. These  are  a  few  examples  of  diaper absorption  volume  over  time  curves where  the  fitting  lines are  smoothing  curves. We  performed   functional principal  component  analysis on  the  diaper  absorption volume over  time  curves and  this  functional principal  component  analysis was  able  to  find five  functional  principal  component where  cumulatively, they  explains  about  almost  100% of  variations  among  multiple  curves. These  are  the  functional  principal component  analysis  model  fit. As  you  can  see,  for  almost  all  the  curves, the  fitted  curve  plateaued after  a  certain  time  point. Functional  principal  component  analysis model  fitted  curves  really  well as  you  can  see  from  the  diagnostic  plots. We  performed  functional  DOE  modeling of  the  functional principal  component  analysis and  this  is  profiler of  the  functional  DOE  model. This  model  allows  us to  evaluate  shape  of  the  curve for  different  diaper  products at  different  measuring  stations. The  product  comparison at  station  two  seems  to  be  different from  the  product  comparisons at  station  one  and  station  three. These  are  the  diagnostic  plots of  the  functional  DOE  model. Next,  we  would  like  to  perform curve DOE  modeling. Before   curve DOE  modeling, we  would  like  to  find some  parametric  model that  fits  the  diaper  absorption volume  over  time  data  well. I  found  that  there's  a  function  in  JMP called   biexponential  4P  model. This  model  is  a  mixture of  two  experiential  model with  four  unknown  parameters. This  model  fits  all  the  diaper  absorption volume  over  time  curves  really  well. These  are  the  diagnostic  plots of  the  curve  fitting  and  you  can  see that  the  biexponential 4P   model fits  all  the  curves  really  well. After  fitting  diaper absorption  volume  over  time  curves using  biexponential 4P  model, we  performed   curve DOE  modeling  using  JMP and  this  is  a  profiler of  the   curve DOE  model. Using  this  profiler,  you  are  able to  see  the  shape  of  the  curve as  a  function  of  diaper  product as  well  as  a  measuring  station. This  is  a  profiler  of  product  A at  station  two  and  then  station  three. These  are  the  diagnostic  plots of  the  curve  DOE  model and  you  can  see  that curve  DOE  model  fits  the  data  well, except  that  at  higher diaper  absorption  volume, the  residuals  are  getting  larger. These  are  the   curve DOE  model  fit on  the  original  data. As  you  can  see  that for  most  of  the  curves, this  model  fits  the  data  really  well. This  compels  the  model  profiler of  the  functional  DOE  model versus   curve DOE  model. As  you  may  notice that  there's  some  difference between  these  two  profiler at  the  later  time  point. The  predicted  diaper  absorption  volume at  the  later  time  point tend  to  plateau from  the  functional  DOE  model, but  it  continue  to  increase at  later  time  point using  the   curve DOE  model. This  compares  the  diagnostic  plots from  the  functional  DOE  model versus  curve DOE  model using biexponential 4P  model. As  you  can  see  that  both  of  these  models fits  the  data  really  well, with  functional  DOE  being  slightly  better with  slightly  small root  mean  square  error. Now,  you  have  seen  the  comparison of  functional  DOE  modeling versus  curve  DOE  modeling using  two  P&G  examples and  this  is  our  summary  and  conclusions. Functional  DOE  modeling is  always  a  good  choice. When  the  parametric  model fits  all  the  curve  data  well, curve DOE  modeling may  perform  really  well. However,  if  the  parametric  model does  not  fit  the  curve  data  well, then  the  curve  DOE  modeling may  perform  poorly. Functional  DOE  model  is  purely  empirical. However,   curve DOE  model may  take  into  account mechanistic  understanding or  extrovert  knowledge in  the  modeling,  so  it  can  be  hybrid. I t's  good  to  try  different  method like  different  smoothing  method before  functional principal  component  analysis. In  functional  DOE   modeling, try functional  DOE  model versus  curve DOE  model and  see  which  one  performs  best. This  is  end  of  our  presentation. Thank  you  all  for  your  attention.
Monday, October 16, 2023
JMP has long been well-suited for producing production quality tables and graphs. Users have many options, such as taking screen shots or saving JMP reports in specific formats or even using JSL to place pieces of output reports in specific MS Word or Excel files. We have had a strong demand in recent customer engagements to go beyond providing solid analytical solutions accompanied by the excellent JMP tabular and graphical output as the teams look to operationalize a solution.   Operationalization often means deploying results to nonstatistical audiences using templates or special formats tailored more toward general consumption. We demonstrate a method integrating streamlined Python code generated with AI apps to effectively create output documents with advanced MS Word, which is not possible with JSL alone and doesn't require difficult Visual Basic code. We also show several other deployment tips and tricks that don't require Python.       Hello,  Team  Discovery.   I'm  Jim  Wisnowski, along  with  Andrew  Karl  from  Adsurgo, and  we're  here  to  talk a  little  bit  about  operationalization and  how  you  can  effectively  employ some  JMP  capabilities  to  do  so. The  genesis  of  this  presentation  came from  engagement  with  a  customer  last  year around  this  time  who  said, "Our  company  just  loves  JMP as  a  sandbox  tool and  maybe  in  the  lab, but  it  really  doesn't  do  anything for  operationalization  for  us." At  which  point,  it  struck  me  as, "I'm  not  really  sure  I  understand what  the  word  operationalization nor  really  know how  to  say  it  necessarily." S omehow  there's  a  disconnect because  all  of  our  projects, we  seem  like  we  deliver  a  product, and  it  is  finished. What  we  did  is  we  figured  out, "What  exactly  is  operationalization and  then  how  can  we  do some  neat  things  in  JMP  with  it?" What  we're  going  to  do is  I'm  going  to  talk  about  what  it  is and  then  give  you  some  quick  wins in  the  form  of  the  depot. Then  Andrew,  as  always,  will  do  the  work where  it  takes  the  high  intellect, and  he'll  also  show  you the  beautiful  part  of  operationalization. Key  here  is  we  all  know  all  models are  wrong  and  some  are  useful, but  the  other  side  of  that is  your  boss  or  your  boss's  boss, she  is  thinking  all  models  are  useless, whether  she  says  it  explicit  or  not. They're  useless  unless  somehow   they're  supporting  a  decision  of  mine or  we  get  them into  the  production  environment. We  need  to  make  sure  that even  though  we  have  a  good  model, that  it  can  go  to  the  next  step and  be  implemented. By  the  way,  I  do  want  to  show  George  Box over  here  on  the  right, and  then  Stu  Hunter  on  the  left, who  just  celebrated  his  100th  birthday on  the  3rd  of  June. By  definition,  what  is  operationalization? When  we  think  of  things, we  can  think  of  ideas such  as  the  leadership  saying  the  usual, What  is  the  problem? What  do  I  do  about  it? How  does  it  impact  me? And  then  the  now  what? What  is  the  solution? The  solution  isn't  necessarily all  the  things  that  we  do in  the  data  analytics,  data  science, world of  the  discovery, data  prep,  and  all  that. It  really  lies in  the  operationalization  piece for  that  senior leader  to  understand how  you're  going  to  solve  the  problem. In  other  words,  it's  really  how  do  we  get from  that  lab  environment to  the  production  line where  we  have  a  solution that's  going  to  be  useful  to  us. As  we  do  that,  let's  not  forget SAS's  recommendation  here that  we  better  make  sure  that we  have  a  good  disciplined  approach as  well  as  automated  in  that  world. Next  up,  we  can  dig  a  little  bit  deeper into  what  operationalization  is on  a  micro  level  kind of thing. I  asked  Chat GPT-4,  I  said, "Can  you  give  me  an  image of  what  operationalization looks  like  in  data  science?" This  is  essentially  what  they  did. I  had  to  do  the  graphics, but  they  gave  me  all  the  words. If  we  think  about  our  usual  data, do  our  data  development, and  then  we  figure  out  what  model  we  like, and  then  we  have  to  figure  out how  we're  going  to  employ  or  deploy  that, what  language? Is  it  going  to  be JavaScript,  C,  or  Python? Then  we  do  the  deployment, and  then  we  do  maybe perhaps  an  API  integration. Good  news  is  JMP  has  a  lot of  tools  for  us  to  do  that. We're  not  left  in  just  that  lab environment  as  suggested. Then  on  the  bottom  here, we  got  the  idea  that once  we  have  that  model  out  there, it's  not  a  launch  and  lead  kind of  thing. We  have  to  babysit  it and  perhaps  update  hyperparameters or  add  new  data and  see  if  it's  still  valid. Then  we  have  this  idea  here that  you  know  what? Not only are  our  users  liking  it, they  want  to  add  more  capabilities, so  we  start  scaling  up. We  have  to  make  sure  that  we  continue our  good  configuration  management and  data  compliance  and  documentation, ultimately  resulting  in  business  value. The  bottom  line  is  how  do  I  go from  this  lab  and  sandbox  environment to  having  business  value? That's  what  we're  looking  for in  our  operationalization. Forbes  gives  the  five  steps  here. Important  for  us  is  to  think  about, first,  you  have  to  have  a  good  story so  that  management  believes that  you  have  a  credible  way to  approach  this  and  solve  the  problem. Then  the  last  part  here  is, once  you  do  have  a  model, and  deployed, and  you  have it  operational ized make  sure  that  you  have  some  metrics  to  make  sure that  it  is  in  fact  performing. But  this  is  like  that  last  mile  idea is  we  take  all  of  this  work that  we  do  to  create  the  models, but  getting  it to  that  operationalization  piece is  the  tough  part. In  fact, we  can  see  that  the  data  out  there doesn't  suggest that  we're  doing  all  that  great. Overall,  we're  maybe  even  fewer than  half  of  these  models  make  it. Then  if  they  do,  it  takes  upwards of  three  months  or  so  to  do  so. DevOps,  we're  all  familiar  with in  terms  of  a  good  disciplined  approach for  software  development. When  we  take  that  step into  our  model  deployment  world, we'll  call  it  ModelOps, where  we  want  to  have  some  a  culture or  process  and  technologies to  effectively  employ  them. If  we  look  at  these  three  circles  here, it's  really  this  intersection between  two  and  three that  we're  focused  on to  make  sure  that  the  deployment   is influencing  those  business  decisions. I'd  like  to  go  and  do a  demonstration  here  in  JMP. Before  I  do  so,  I  do  want  to  point  out two  fantastic  presentations from  previous  discoveries that  do  talk  about  this  idea of  deployment  and  operationalization by  Nascif  and  Dan  Valente in  2016  as  well  as  2017. You  can  see  over  here, they  have  this  whole  idea that  if  you  give  a  man  a  fish, he  eats  for  a  day, and  if  you  teach   him to fish, he'll  eat  for  a  life, that's  this  operationalization  piece, which  they  also  call the  dark  side  of  analytics. That's  what  we're  going  to  get  into. M eijian  I  also  wrote  a  decent  paper as  well  on  how  you  can  do  that. But  for  us,  what  I  want  to  show  you is  using  the  Formula  Depot, and  I  got  a  little  plus  there  because it's  going  to  go  a  little  bit  beyond just  using  the  Formula  Depot because  that  is  not  enough. We'll  use  our  well-traveled Boston  housing  data. We'll  look  at  what's  the  price  of  it as  a  function  of  rooms,  distance, and  Lstat, which  is  an  economic  indicator. We'll  create  a  main  effects  model as  well  as  a  Bootstrap. Then  we'll  look  at  the  profiler, and  I'll  show  you  a  quick  trick that  could  be  helpful  for  you. Then  we'll  look  at  how  do  I  convert  this to  an  operational  solution and  being  able  to  deploy  it in  a  Python  environment. Certainly,  this  is  a  very  small  data  set, but  we  could  easily  have  done  this plugging  into  some  data  source and  using  query  builder and  things  like  that. But  just  want  to  show  you  some  quick  wins that  you  can  go  to  the  next  step. Because  often  we  hear  that  it's  great that  you  do  all  this  work, but  the  actual  implementation has  to  be  on  this  server, and  this  server can  only  use  JavaScript  or  C++, whatever  it  happens  to  be, Python,  maybe. How  can  we  take  our  good  work and  transport  it  into  use  in  operation? I'm  going  to  transition  over  to  JMP. Here's  a  journal  that  we  have  for  you, and  it  goes  through a  few  of  our  different  options and  what  we're  doing  here. But  here's  Boston  Housing that  we  know  and  love. Here  is  my  Least  Squares  model. What  you  do  in  the  Least  Squares  model, or  any  model  for  that  matter, is  under  Save  Columns, you're  going  to  publish that  prediction  formula. Then  the  Formula  Depot  comes  up  for  you. Let's  go  ahead  and  do the   Bootstrap Forest  as  well. Now we  have  the   Bootstrap Forest, which  we  can  also,  under  Save  Columns, publish  that  prediction  formula. If  we  come  back  up  here and  we  check  out that  Formula  Depot  report, we  can  see  that  it  is  something that  we  are  familiar  with  in  this  fashion. That  if  I  come  down  here off  of  the  red  triangle next  to  Formula  Depot, I  can  look  at  the  profiler . I  want  to  look  at  the  profiler of  both  these  models. I  do  OK, and  there  it  is. The  main  effects  model,  of  course, doesn't  have  any  interaction, but  maybe  you  want to  make  the  point  somehow that  when  you  have  very  short  distance, you have a real steep economic  impact kind of  a thing. What  we  could  do  is  we  could  show  this dynamically  in  PowerPoint, because  often, that  is  the  endpoint  of  some  of  our  work, and  an  analysis is  a  PowerPoint  presentation that  we'd  like  to  have some  sort of a  dynamic  display. There  are  certainly many  screen  capture  capabilities that  you're  probably  all  familiar  with. But  if  we  just  go  back  to  PowerPoint, we're  able  to  do  this  inherently in  the  software,  and  it's  pretty  quick. Here  is  a  blank  slide. What  I  want  to  do is  I  want  to  show  that  dynamic  display. I  just  come  under  Insert, and  I'll  do this  screen  recording  right  here. What  I'll  do  is  I  will  get  out  of  that and  I'll  come  back  to  here, and  then  I'll  come  back  to  JMP, go  to  PowerPoint. Now I do  the  screen  recording, and  I  want  to  be  at  the  right  spot if  it's  a  problem. I'll  just  go  ahead and  take  a  snapshot  of  what  I  want. I  will  go  ahead  and  hit  the  Record  button, and  it'll  give  me the  3-second  countdown. Then  now  I  can  narrate  and  say that  we  don't  see  any  interaction in  our  linear  your  model, but  you  can  see  a  Boosted Tree   does have  quite  a  bit  of  interaction. You're  happy  with  that. You  go  ahead  and  stop  the  recording. Then  now  you  have  this  embedded in  your  PowerPoint  that  easily. You  can  go  ahead  and  run  it  here, and  you  can  hear  me  in  the  background that  I  was  explaining  it, so  the  audio  is  in  there. Then  clearly, as  you're  in  presentation  mode, it  will  run  automatically  as  well. Now back  to  the  task  at  hand is  what  we  want  to  do is  deploy  this  solution in  a  model  that  is  Python. What  we  can  do under  the  Formula  Depot  options  here is  we  can  go  ahead and  generate  the  Python  code, and  we'll  do  it  for  the  sake of  argument  for  both  of  these. We  can  see  that  the  Boosted  Forest  here has  Python  code that  consists  of  55,000  lines  of  code. Good  to  know. But  we'll  put  that  aside  for  now. What  we're  going to  ultimately  decide  on,  we'll  say, is  t his  main  effects  model only  is  what  we're  going  to  use that  we  want  to  deploy  to  our  system. I'll  go  ahead  and  hit... By  the  way,  before  I  do  that, there  is  this  one  particular line  of  code  here that  says  import  JMP  score. That  is  a  Python  file  that  is  inherent, and  all  of  you  have  it on  your  install  directory, and  you're  going  to  need  that  to  use  it. What  it  is, it's  really  just  a  helper  menu. It  tells  you here's  how  you  do  vectorization, here's  how  you  do  power and  things  like  that, but  important  that  it's  there. I'll  go  ahead  and  do  File,  Save. What  that  creates  is  this  Least  Squares, and  then  we'll  call  it BH  for  Boston  House. Now I've  got  that  Python  code. What  would  be  nice  is  if  I  could  just  go to  Spyder  off  of  Anaconda and  just  run  it and  then  score  all  my  new  data. The  problem  is  it's  not  that  easy. It  doesn't  run  you  yet. It  takes  you  a  good  way  there, but  you  still  have  to  put some  wrappers  and  things  around  it. You  have  to  import your  Excel  file  or  whatever  it  is. I'm  not  a  Python  coder. In  fact,  people  say,  "What  do  you  do?" "I'm  a  data  scientist." "Oh,  you  must  be  good  at  Python." "No,  I've  never  actually  used  it." I'm  an  impostor  in  that  fashion. But  maybe  there's  a  way  that  we  could  use this  latest  technology,  ChatGPT-4, and  have  it  create a  Jupyter  Notebook  for  me. If  I  come  under  my  ChatGPT- 4  here, I  have  this  question  I  ask  it. "I  have  Python  code  generated from  a  model  in  the  JMP  Formula  Depot. Go  ahead  and  create a  Jupyter  Notebook  for  me that's  going  to  import  data  from  Excel." I  say,  "Can  it  do  it?" And  it  says,  "Absolutely." Then  what  it  does is  it  gives  me  the  code  right  here. I  copy  the  code, and  I  put  it  in  my  instantiation   of Jupyter from  Anaconda,  all  three. Then  I  run  it,  and  lo  and  behold, it  doesn't  work. Nothing  works  for  me  the  first  time. But  I  say,  "It  didn't  work, and  here  was  the  error  message  I  got." It  says,  "Oh,  well,  try  this." Then  I  tried  that,  and  then  it  worked. What  did  it  work? If  I  come  back  over  here, this  is  my  Jupyter  Notebook that  was  given  to  me  by   ChatGPT-4. Again,  I  know  nothing  about  Python, but I do  know  that  it  gave  me  these  lines. I  just  go  ahead  and  say, I'm  going  to  import  Pandas because  I  need  that to  get  my  Excel  file  in. Then  here  is  that  Fit  Least  Squares. That's  what  I  got  from  the  Formula  Depot. It  does  that. I'm  running  each  one  of  these,  by  the  way. Now it  says  go  ahead  and  import. I'm  going  to  import  an  Excel  file that  has  what  I  want  to  score, and  that's  going  to  be under  Boston  Housing  data. It's  new  data  that  I  want  to  score. Then  here's  this  outdata  thing that  it  told  me  was  my  error. I  said,  " I'll  do  that." Then  this  says,  "Hey,  just  let  me  know that  I  brought  in  some  data,"  and  it  does. Then  now  I'm  going to  go  ahead  and  score  it. I  go  ahead  and  score  it. Then  did  it  score? Sure  enough,  it  did. There's  the  first  five  values  of  it. Then  I  can  go  ahead  and  save  that, and  we'll  just  call  this  For  Demo. Maybe  I'll  even  add  100  in  there. Since  200  is  100, I'll  go  ahead  and  put  that  out. Then  I'll  say,  "W here  does  that  live?" Maybe  I'll  see  it  right  here. Here  it  is. Here's  the  scored  data  Excel  file. There  are  all  800  or  so predicted  values  from  that  linear  model. This  is  easy  as  that. Next  up, what  we  want  to  do  is  set  Andrew  up to  why  everything is  beautiful  in  this  world. Coming  back  to  PowerPoint  here, the  scenario  is  this. It's  that  we  were  working with  a  group  of  doctors  across  the  US, and  they  wanted  reports for  every  one  of  them how  productive  they  were  in  2022. They  thought  that  perhaps they  could  be  more  productive by  changing  some  of  the  codes   that they use for  certain  procedures. They  gave  us  millions  of  rows  of  data, and  we  came  up with  exactly  what  they  asked  for. We  created  a  template of  10  pages  or  so  of  the  report with  nice  JMP  graphics  in  there, and  it  was  virtually  perfect, except  for  one  thing. The  one  thing  is  that  this  data  table at  the  very  end  that  gave  the  results, we  couldn't  get  it  sized  properly or  put  the  borders  on  it. It's  as  simple  as  just  selecting  it and  then  right-clicking  and  do  AutoFit and  then  hitting the  border  symbol  next  to  it. That's  what  I  told  Dr.  Jay  right  here. This  is  for  Dr.  No. You  can  see  his  annual  report. Essentially, Andrew  swam  the  English  channel and  gave  them the  99.8%  solution  to  their  problem, but  they  weren't  quite  happy. It  wasn't  until  we  went  and  took the  step  to  make  this  automated. Again,  this  is  a  two-second  process, but  because  it  took  that by  100s  of  things, they  weren't  happy. But  then  we  ended  up  fixing  that, and  that's  when  the  customer  said, "This  is  absolutely  beautiful," hence  we  have beautiful  operationalization. With  that, I'm  going  to  turn  it  over  to  Andrew to  let  you  in  on  a  few  secrets of  how  you  can  get  some massively  increased  productivity. Thanks,  Jim. I'm  Andrew, and  I'm  going  to  show  you how  we  put  together  this  report. The  assumption  is  we  have  some  suppliers, and  here  are  our  suppliers. We've  got  four  different  suppliers, and  we've  got  some  metrics  that  we  track in  this  data  set that  we  update  frequently. We  want  to  be  able to  report  this  to  the  suppliers so  we  can  track  how  they're  doing and  have  a  record of  long-term  performance. What  we'd  like  to  do is  to  get  something  like  this is we  have  this  template  we  like where  we  have  our  header  up  top,  our  logo. We  have  in  bold, we've  got  the  company  name. Then  each  of  these,  we  substitute  in these  calculated  values  from  the  data  set. The  standard  JMP  functions  of, we  export  a  report  to  Word, we  get  everything, all  the  outline  boxes  are  open, but  it  doesn't  have  our  template  function, and we have to  fill  in  the  blanks. We  can  manually  copy  out  the  pictures, right-click  those  little  gray  triangles, and  say,  edit  copy  picture and  get  a  nice  version of  the  pictures  out. But  it's  still  a  manual  process that  has  to  be  done. This  little  application  here, I've  got  four  suppliers. What  if  you  have  400, and  they're  doing  that  once  a  month? That  becomes  unwieldy. How  can  we  do  this? Not  natively  within  JMP  or  JSL. You  can  get  close  sometimes to  being  on  your  application, but  a  more  flexible  example is  to  take  something  like  this where  we  have  a  template  file, we  write  the  report  once the  way  we  want  it. Every  time  we  have  something we  want  substituted  in  by  JMP, we  have  these  double  brackets, and  we're  going to  put  a  little  keyword  in  there. In  this  case,  team_p   is going  to  be  the  team  name. Then  down  here,  I've  got  mill_time_ p with  an  underscore, and  now, I've  got  that  bold  and  highlighted. We  put  whatever  format  we  want, and  anything  we  want  substituted  in, we  just  put  in  brackets. It's  a  natural  solution, so  it'd  be  great  if  we  get  this  to  work. Then  also,  in  addition  to  text, we  can  also  do  that  with  pictures. We've  got  the   q_pic11. If  we  want  a  JSL  line  of  box  equivalent, then  we  can  get this  table  structure  within  Word, and  we  just  put  in  our  keywords where  we  want  to  substitute  things  in. Also,  we're  going  to  have  a  table at  the  end  that  doesn't  have  a  key, that  doesn't  have  a  holder  here. I'll  get  to  that  in  a  second. Come  back  to  my  journal. At  the  end,  what  we  want  it  to  look  like, the  intended  document  result, is  we  also  have  this  table  at  the  end that  goes  through  Tabulate. We  have  this  nice  color  gradient we  can  put  it  into  JMP, and  we  want  to  get  this  into  Word. But  also,  as  Jim  mentioned, we  want  to  AutoF it  this  table and  we  want  it  to  look  nice, because  a  lot  of  times, the  tables  don't  fit  automatically. We  can  go  through  all  the  work and  create  the  tables, but  if  we  don't  make  them  fit, then  we're  going  to  have  a  lot  of  work ahead  of  us  to  go  through and  do  that  manually. It's  not  something we  can  program  natively  within  JMP. What  can  we  do  is  we  will  have  a  script that,  and  this  is  shared on  the  Discovery  website, where  we  can  open  up  the... We  have  a  reference  for  the  data  table, the  container  data,  and  this  gets  updated. Every  time  we  run  this to  generate  the  reports, it  pulls  in  the  new  data. We  have  the  output  path, which  is  a  folder  where all  the  temporary  files  get  written  to that  we  normally  clean  up plus  the  output  reports. Then  also  template  file,  that  Word  file that  contains  those  keywords. A ll  the  rest  of  the  script  is  going to  be  going  down  and  calculating the  individual  values that  gets  substituted  in. A t  the  end,  we  have  this  Python  script that  does  the  actual  substitution and  then  also, auto  fits  and  pulls  in  this  table. If  you're  saying, "I  don't  know  Python, I  don't  know  how  applicable this  is  going  to  be  to  me," we  don't  know  Python  either, but  we  got  this  to  work  pretty  quick because  GPT-4 wrote  the  Python  code  for  us. I'm  going  to  show  you an  example  of  how  we  did  that. What  this  script  will  do is  we'll  write  the  temporary  files to  the  output  folder. For  example,  here's  our  q12  graph, and  the  temporary  files  get  written  here. The  images  get  written to  the  output  folder. Then  with  the  static  code  in  Python, it  knows  to  pull  these  files  in. The  individual  calculations will  be  run  within  JMP, and  then  saved  within  the  script, and  then  passed  along to  Python  at  the  end. Here  we  run  all  of  our  JMP  code. We  load  any  of  our  values  into  variables. An  example  of  that  is, here  we  have  a  tabulate  function. We're  going  to  make  into  data  table. From  that  data  table,  we're  going  to  take the  mean  mill  time  column and  take  the  first  entry  in  that  column, and  then  that's  going to  become  our  mill_ time_ p  variable. That  is  what  will  get  sent  down in  the  Python  code. When  we  initiate  Python, we  tell  it  whatever  variables  we  wanted to  send  over  from  JMP  to  Python. Here's  that  mill_ time_ p. That  will  hold  any of  the  mill time  calculations, and  that  is  what  gets  substituted  in to  the  mill_time_ p  underscore  area within  the  template. I'm  going  to  go  back  to  my  template  file. Here's  my  mill_time_ p  area. That's  what  gets  substituted  in. The  intermediate  output from  the  Python  code  for  doing  this is  the  temp  report. We  can  see  these  values get  substituted  in. The  graphs  get  placed  in. We  get  our  nice  grid  of  graphs. At  the  end,  we  don't  have  our  table  yet. The  reason  we  don't  have  our  table  yet is  because  we  like  the  way  that if  we  move  a  table  from  JMP using  the  get  as  a  report  to  a  journal and  then  export  that  journal  to  Word with  Save  MS  Word, we  like  the  way it  keeps  the  table  structure, but  we  still  need  the  AutoF it. What  we  do  is,  in  addition  to  the  report that  gets  written  out  from  the  template, is  we  also  write  out this  other  temporary  table  file. We  get  the  table  import. Here  it  is,  and  what  we  need  to  do when  we  want  to  automate is  this  AutoF it t o  Window, and  then  also  reducing the  font  size  where  it  actually  fits. What  we  need  to  do  is, after  the  Python  code that  substitutes  out  into  the  template, what  we  found  is  we  have  to  convert that  DOC  file  created  by  JMP, convert  it  to  a  DOC  X  file, and  then  we  have  Python  code that  will  open  up  that  DOC  X  file, take  the  first  table, it  will  apply  AutoF it, change  the  font  size  to  7, and  then  it  will  append  it to  the  original  report  file. How  did  we  know  to  create  this? Again,  you  can  copy  what  we've  done. We've  got  this  file  available. But  how  can  you reproduce  this  for  your  own and  create  your  own  Python  to  do  this? I recorded  a  video  of  going  through on  GPT-4,  how  I  did  this, and  I'm  going  to  show that  now  and  narrate  that. The  prompt  I  give  is  that  I've  got a  Word  document  that  I  want  to  use. I'm  so  bad  at  Python. I'm  going  to  go  ahead and  give  GPT-4  my  actual  path  here because  I  don't  know how  to  substitute  out  the  backslashes, and  the  spaces,  and  everything. I  say,  "G o  to  this  path,  open  it  up, take  the  first  table  you  see  in  here, and  then  give  me  back  that  table  AutoFit." It's  going  to  tell  us,  "You  need to  install  this  other  package." I've  just  got  a  base Ana conda  installation. You  can  ask  it  for  instructions  and  help. You  make  sure  you  install  that  package and  then  you'll  be  able  to  run  this  code. Whenever  it  gives  me  this  thing, the  first  thing  I  noticed  is  it  tells  me, "Hey,  I'm  going  to  overwrite your  report  file,  be  careful." I  say,  "No,  I  don't  want  that." It's  interactive  the  way it  gives  you  the  code. I  say,  "F ix  this  code. I  want  to  write  to  a  different  file in  the  same  directory." It's  going  to  modify  the  code to  give  me  a  different  output  file in  the  same  directory. What  I'm  going  to  do is  I  just  copy-paste  this  over  to  Spyder, which  is  basically the  JSL  editor  for  Python. It's  the  Python  editor. I  hit  Run,  I  hit  the  green  arrow, and  I  get  this  note about,  "There's  an  error." I  don't  know  what  the  error  means. I  don't  want  to  spend  time on  Stack  Exchange  or  Stack  Overflow or  anything  looking  that  up. I  paste  the  error  back  to  Python, and  it's  nice  enough  that  it  apologizes and  explains  what  it  did  wrong. I'm  not  that  worried about  what  it  did  wrong. I  just  want  something  that  works and  gives  me  my  report. I'm  going  to  copy- paste the  code  it  gives  me and  then  go  back  to  the  Spyder, run  that. I  get  one  more  error. It  says,  "Table  is  not  subscriptable." Not  really  sure  what  that  means. I  tell  GPT-4  about  that. It  apologizes  again. It  thought  that  the  tables were  zero- based  index. It  turns  out  it's  one- based. It  fixes  that. I'm  going  to  copy  this  code  over. This  time,  it  runs  without  error. I  go  to  the  folder  I  specified  to  it, and  here's  the  modified  folder. Now you  can  see AutoF it's  been  applied  to  this  table. We  just  made  a  Python  code without  really  knowing  Python. You  don't  have  to  stop  there. If  you  have  any  version  of  modification that  you  want  to  give  to  it, you  can  change  the  font  size. Here  I'm  going  to  ask  it to  make  it  Wingdings. Let's  see  if  it  knows  how  to  do  that. It  gives  me  some  new  code,  and  I  run  it. Yeah, I  can  get  Wingdings  output. Just  to  make  sure it  didn't  just  give  me  gibberishish, I'm  going  to  copy  it  all and  make  sure  that  it  translates  back into  more  easily  readable  text, and  it  does. That's  what  we're  looking  for. Not  only  Python,  but  also  Visual  Basic. Another  thing  that  we  run  into is  we  output  things. You  might  have  a  500-page  report and  you  want  the  headers  to  look  right to  header 1 ,  header  2  in  words, so  that  way you  can  get  a  table  of  contents. W e  told  GPT-4  is, I'm  going  to  use  these  delimiters. I'm  going  to  script  these  in to  my  outline  box  titles. I  get  h1,  close  h1,  h2,  close  h2 for  header  1,  header  2. I  want  to  substitute  those  in. When  we  did  that  to  GPT-4, what  it  gave  us is  it  gave  us  this  macro  file, which  I'm  going  to  copy  that. I'm  not  sure  if  it  was  retained in  my  Microsoft  Word. We'll  find  out. If  not,  I'll  paste  it  in. You  have  to  enable  this  Developer  tab. That's  a  Word  option. When  you  do  that,  we  get  this. We  just  paste  in  our  macro  here, which  was  generated  by  GPT-4. I  don't  like  programming  in  VBA, but  it's  good  at  it. When  I  hit  Run  on  this, it  takes  everything, all  these  headers  and  fills  them  in. That  way  I  can  go  up  here,  I  could  say, References,  Table  of  Contents, and  put  on  my  table  of  contents. If  I  generated  500  page  report, now  I've  got the  right  structure  for  all  this. It's  easy  to  go  in  here,  too. If  you  want  to  add  numbers to  these  headers  and  stuff, you  just  right-click  in  here, and  you   go  to  Modify, and  you  can  turn  on that  numbering  structure  if  you  want. The  last  thing  I'm  going  to  show for  GPT-4,  specifically  for  JSL, is  you  might  have  noticed in  my  script  back  here, I'm  really  bad about  commenting  my  script. I've  got  a  couple  of  comments  in  here. But  for  the  most  part,  I  don't  have a  lot  of  comments  describing  it. If  you  have  code,  you  don't  want to  give  anything  proprietary, but  if  you  give  code  to  GPT-4  or  even  JSL, it'll  add  a  bunch  of  nice  comments  to  you and  explain  what  each  code  block  is  doing, and  that  way  other  readers  can  see. Also  for  Python,  if  you  don't  know  Python and  you're taking  a  look  at  our  script that  we've  provided, and  you're  saying, " What  does  this  do," you  can  provide  this  code  to  GPT-4 and  say,  "Explain  this  Python  code  to  me." It'll  give  you  a  long  narrative  story and  say,  "Here  is  exactly the  workflow  of  what's  happening. Here  are  the  key  functions that  are  doing  it." That's  my  favorite  part. You  can  say,  "Do  you  have any suggestions  for  this  code?" It'll  say, "Y ou  don't  have  any  error  handling. You've  got  some  memory  leak  issues." It'll  go  through  and  make a  bunch  of  suggestions. Then  it's  nice  enough  to  go  ahead and  give  you  some  new  code with  all  the  suggestions  implemented. With  all  this  in  place, you  can  go  from  doing all  of  your  calculations  in  JMP to  actually  getting  this  nice  output  file that  has  the  right  format  you  want. Everything  looks  nice. You're  not  going  through making  manual  changes. With  this  in  place, this  is  what  the  customer  said, take a  look  at  this  and  said, "This  is  beautiful." With  that,  we  hope  that you  can  take  this  same  idea and  go  make  some beautiful  reports  yourself.
Tennis is one of my favorite sports. The 'big three' of Federer, Nadal and Djokovic are my favorites. They elevated modern tennis to new heights in their rivalry. But it looks like Alcaraz is in line to take the throne. It is fun to compare their records and present the results at JMP discovery summit in Indian Wells which is the home to Indian Wells Open.   I started with Association of Tennis Professionals (ATP) data from 2022 for my project. Next, I combined over 25 years of data that include all match records of Federer, Nadal, Djokovic and Alcaraz. Using JMP’s powerful analytical and visualization tools, this report provides insight into the questions of who won the most matches of a year? What are the factors that influence players winning matches? Most importantly, who is the GOAT?     Hi,  everyone.   My  name  is  Jianfeng  Ding. I'm  a  research  statistician developer  at  JMP  IND. Today,  I'm  going  to  show  you  how  I  use  JMP to  explore  the  tennis  data and  find  out  who  is  a  goat,   the  greatest  of  all  time  of  men's  tennis. First, I  would  like  to  give  you   some  background  information why  I  choose  this  topic. When  I  heard  that  the  JMP  Discovery  Summit 2023  will  be  held  in  Indian  Wells, I  got  excited as  tennis  is  one  of  my  favorite  sports and  my  youngest  son   plays  varsity  tennis  at  his  high  school. I  have  watched   a  lot  of  tennis  over  the  years. Indian  Wells  is  a  home to  Indian  Wells  Master, which  is  often  called  the  Fifth   grand slam. I  thought  it  would  be  fun  to  use  JMP to  explore  and  analyze  the  tennis  data and  present  the  results  to  our  user at  Indian  Well  Discovery. The  second  motivation  come  from  JMP. JMP  has  grown  bigger and  richer  in  many  ways. There  are  so  many  wonderful features  created  by  my  colleagues. I  would  like  to  keep  myself  updated  with  these  new  cool  features by  applying  them  to  the  project. Currently,  you  are  seeing  one  of  them, Application  Builder. Instead  of  using  PowerPoint, I'm  using  JMP  Application  Builder for  today's  presentation. My  presentation  mainly  include  two  parts. I  will  take  you  on  a  tool  to  explore the  ATP  data  from  the  year  of  2022. ATP stands  for   Association  of  Tennis  Professional, which  is  the  world  governor  body  for  men's  tennis. Then  we  will  look  at  a  25-year  combined ATP  data  to  find  out  who  is  the  GOAT. First,  let's  see  where  do  I  get  the  data? I  get  the  data  from  the  web  and  GitHub, which  was  created  and  maintained   by  Jeff  Secman. He  is  a  software  developer   working  in  the  field  of  sports  statistics. On  this  web,  it  contained  the  ATP  match  data from  year  of  1968  to  the  current  year. We  can  also  get  women's  tennis data  from  this  web  as  well. What  data  looks  like? Here  is  the  data  from  the  year  of  2022. It  consists  of  about  49  variables with  about  3,000  observations. Each  observation  represent matches  play  on  ATP  tours. The  yellow  section  contains  a  variable about  the  tournaments and  the  blue  section  contains  a  variable  about  the  players. Each  observation  is  a  match, so  usually  the  variable  comes  with  two, one  for  the  winner  and  one  for  the  loser. Let's  look  at  all  those  variables about  the  tournament  first. I  build  the  graph  builder   on  tournament's  name and  a  tournament's  surface and  a  tournament's  level. From  the  tournament  name, the  country  with  more  player were  sitting  on  the  top. Sorry,  the  tournament  with  more  player  would  sit  on  the  top. Grand  Slain,  Australia,  Roland Gallos , US  Open,  and  Wimbledon are  the  largest and  most  prestigious  tournaments. In  last  year,  there  are  about 145  tournaments. We  also  can  see  there  are  typical five  surface  for  the  tournaments. They  are  clay,  grass,  and  hard and  usually  they  are  more  hard  surface  tournament  than  the  grass  and  clay. A lso  there  are  five  levels   of  these  tournaments. The  definition  is  defined  here. A,  D,  F,  G,  N. G  stands  for  the   grand slam, and  N  stands  for  the  Masters. Indian  Well  Master is  a  master-level  tournament. D  stands  for   Davis  Cup,  and  A  is  the  ATP  Tour. Next,  let's  look  at  the  variable about  the  players. I  run  the  graph  builder  again. The  plot  on  the  left  actually  show  me which  country  has  the  most  player. On  the  right, it  shows  those  players'  hands. Do  they  use  the  right  hand or  they're  using  left  hand? You  will  see  the  player most  are  right  handed. I  also  would  like  to  find  out which  country  has  more  top  ranked  player. I  created  this, the  winners  rank  and  I  can  slide. The  country  with  more top-ranked  player  will  pop  up. I'm  interested  to  see  what  about  top  100 and  US  sitting  on  the  top. That  means  US  has  more  top   ranked  player  than  the  other  country. Then  what  about  the  top  10? Look,  you  can  either  slide  or you  also  can  type  in  the  number. From  this,  Spain  popped  up  at  the  top and  I  hover  over,  I  saw Carlos  and  I  also  saw  the  Nadal. As  I  click  the  US  and  I  see  the  player, Taylor  Fritz,  who  ranked  number  nine. You  also  can  see  from  the  hand  side and  Nadal  within  this  top  10  player,   Nadar  is  left  handed. He's  one  of  left  handed in  this  top  10  player. Now  let's  move  on  to  check the  players'  age,  height,  and  ranking. The  tournaments,   the  range  can  be  ranged  for  the  last  year, they  actually  can  range  from  17-42. In  this  graph,  I  only  listed  the  top  10 with  their  average  ranking. From  this  I  find,  their  average  height  is  around  6'2, which  is  very  common for  males  tennis  player. I  also  find  Raphael  Nadal   and  Novak Dj okovic   are  the  oldest  in  this  list. Now,  let's  look  at  the  winning  statistics because  I  would  like  to  see who  win  the  most  matches  in  2022. I  find  out  Tsitsipas   list  as  the  number  one. Then  something  is  missing. Where  is  Rafael  Nadal,  and  Djokovic? I  couldn't  find  them  in  this  top  10  list  who  win  the  most  matches. This  remind  me   maybe  I  should  look  at  their  winning  ratio instead  of  just  number  of  matches  they  won. I  did  some  summary  statistics and  I  find  out  their  winning  ratio. Yes,  you  immediately  see, Novak  Djokovic,   Rafael Nadal ,  and  Carlos  Akras, they  have  a  pretty  high, they  are  the  top  three  player who  has  the  highest  winning  ratio. Although  their  number  of  winning  for the  matches  is  not  as  high  as  Tsitsipas. I  also  noticed  there  are  two  players who  has  pretty  decent,   pretty  good  winning  ratio, but  they  don't  play  many  matches. They  only  won  three  matches. Who  are  they and  what  type  of  tournament do  they  play? I  drilled  down  into  the  data and  I  find  out  one  player's  name  is  Kovacevic and  all  his  three  matches  coming  from  tournament  A  level and  the  player, Safwa his  all  three  matches coming  from  Davis  Cup. From  this  graph,  you  definitely  know the  tournament  level   will  affect  the  winning. Ultimately,  you  care  about  who  won the  most  championship  or  tournament  wins. This  graph  put  all  three   relative  statistics  in  one  plot. The  down  you  will  see   how  many  matches  they  win and  the  second, the  green  bar  means   what  are  their  winning  match-win  ratio? The  top  will  show  you how  many  total  championship  they  won  in  2022. I  see,  Djokovic,  Carlos Alcaraz   and  Rafael  Nadal. I  also  see  one  guy who  I'm  not  familiar  with,  and  his  name, hard  to  say,  but  let  me  call  him  FAA. FAA  doesn't  have  amazing  winning  ratio,   but  he  did  won  five  titles. Again,   I  drill  down  to  the  data  and  find  out all  FAA's  winning  title  coming  from  A-level  tournaments. You  look  at  Djokovic  or  Alcaraz  and  Nadal, they  are  championship  not  only  from  A-level  tournaments and  also  from   grand slam  and  a  Master  level. Again,   we  show  tournament  level  effect  winning. Let's  look  at  the  seed. What  does  seed  play in  the  players'  winning? I  have  to  point  out  the  players'  seeds actually  will  vary  over  the  years. But  in  general,  the  higher  seeded  players tend  to  win  more  matches and  more  tournaments. Grand slam  winner  usually are  highest  seeded  players. But  in  2022,  only  two  people  are  exception. One  is  Carlos  Alcaraz and  the  other  is  Taylor  Swift. Sorry,  it's  about  Taylor  Fritz. You  can  see  here,   Carlos, he  succeed,  start  low, but  he  won  the  Miami  Masters. This  helped  him  move  to  the  top. In  the  end,  year  of  2022, he  was  ranked  as  the  number  three  seed, and  he  was  able  to  win  the  US  Championship. Taylor  Fritz,  he  actually  won  the Championship  of  Indian  Well,  Master  2022. We  can  see  the  seeds  definitely  affect  the  winning. Now,  let's  look  at  the  comparison between  the  winner  and  the  loser. In  this  ATP  data,  there  is  a  section  list about  to  serve  statistics and  come  with  a  winner  and  a  loser. There  are  seven  variables related  to  the  serve  statistics. I'm  interested  in  this  first  one. What  it  is? The  first  one  means number  of  points  won  on  first  serve. I  click  and  build  a  plot. Instead  of  I  plot  all  those   absolutely  the  number  of  the  point I  use  the  ratio because  the  point  will  depend  on   how  long  you  played  your  matches. With  the  ratio  would  make  more  sense. The  blue  colored  represent  the  first  serve  percentage  won coming  from  the  winner and  the  pink  is  coming  from  the  losers. Actually,  majority  of  the  first  serve percentage  won  between  60%  and  90%. But  the  blue  color   shaded  more  to  the  right, indicating  winner  have  higher   first  serve  percentage  won . Next  I  would  like  to  be  interested  to  see the  variable  is  BPs  saved  and  BP  faced. BP faced  means  a  breaker  point  faced. For  if  you  serve  and  you  face  the  breaker  point, that  means  you  give  your  opponent  opportunity  to  break  you. You  better  not  t o  face  the  breakpoint. Instead  of  plotting  separately, my  son  suggested  me  to  convert  them  to  be breakpoint  converted, which  is  a  variable  defined  as the  difference  between   the B P  faced  and  BP  saved. Then  again,  we  can  see  the  blue  color shaded  more  towards  the  left, indicating  winner  face  less  breakpoint   and  save  more  breakpoints. The  pink  one  indicates  that  loser tend  to  face  more  breakpoint and  save  less  breakpoints. With  all  these  statistics   and  variable   I  have  shown  you, but  ultimately  I  would  like  to  know,   can  I  build  a  model? Can  I  predict  who  is  going  to  win and  how  many  they  can  win? I  build  a  summary  table   and  as  I  shown  you, all  these  ATP  data  come  with  matches. A  player  can  have  many  matches so  I  just  use  a  tabulate   to  do  the  summary  statistics. I  got  the  tournament  wins  for  each  player and  I  got  the  average  their  winning  match  ratio and  their  height   and  their  average,  their  seed. I  wanted  to  find  the  correlation  between the  variable  to  the  tournament  wins. Clearly  you  can  see the  match  winning  ratio  is  highly correlated  with  tournament  wins and  so  is  winner's  seeds. Also  I  defined  one  variable   I  call  the  div  rank, which  I  know  when  you  face  a  weak  player, opponent  or  strong  opponent, your  winning  odds  could  be  differently. I  do  the  subtraction, I  introduce  this  variable  into  the  model. You  also  notice  the  height, there  is  the  correlation  between  the  variable. I  just  happen  to  notice   when  you're  higher  or  you're  taller and  you  tend  to  have  a  better  ACE  rate and  you  have  better, like  the  first  one,  serve  one. Definitely  the  taller  player  has  advantage  at  serving. I  bring  all  these  model into  the  fit  model  platform. I  first  run  a  Least  Square  model and  I  get  the  conclusion  that the  winning  ratio  and  the  winner's  seed are  definitely  affect   how  many  tournament  you  can  win. I  also  think,  oh,  this  is  a  count  of  data. How  many  tournaments  you  will  win. Maybe  I  should  use [inaudible 00:19:36]   distribution and  I  run  and  I  actually  also  get the  similar  conclusion that  winning  ratio  and  winner  seed   is  very  important  variable. But  I  have  to  point  out, although  I  show  you  early  about the  tournament  level  plays  a  very  important  role  on  the  winning, but  because  the  data,  the  format  itself made  me  hard  to  put  it  into  the  model. I  need  a  lot  of  data  manipulation. Plus,  I  feel  like  instead  of  just  looking   at  the  one  year's  ATP  data, maybe  I  should  look  at  more in  order  to  build  a  complete or  good  predictor  model. I  will  keep  this  in  mind for  my  future  research. With  all  these  statistics  and  a  variable, I  show  you  so  far. That's  back  to  the  topic,  who  is  the GOAT ? I  actually  created  a  script and  I  wanted  to  get  the  data  in  the  past  25  years as  Federer  started  early. I  wanted  to  include  all  the  matches, all  of  them  have  played. I  would  like  to  find  out who  won  the   grand slam  title and  who  won  the  Indian  Wells. This  script  actually  is  able  to  go  to  the  Jeff  Sexel  web and  fetch  the  data  and  do  the  analysis  and  generate  the  report. You  can  see  2023, Alcaraz  won  both   Indian  Wells  and  Wimbledon and  Novak  won  Australia  Open and  Roland  Gallos. As  the  list  moved  down, you  pretty  much  see  their  name,  Djokovic Nadal  and  Federe r,  so  on. It's  almost  for  the  last  20  years, these  three  are  dominant. As  I  keep  moving  to  the  bottom,  finally,   I see  Andre  Agassi  and  Pete  Sampras, who  are  my  favorite  player  in  '90s. Also  you  see  these  three  guys, Djokovic,  Federer,  and  Nadal,  they  sit  on  the  top. This  include  a   grand slam  title  and  Indian  Wells  title. I  truly  believe  these  three  guys,   they  move  the  modern  tennis  to  high  level. Now,  let's  look  at  again, look  at  the  match  wins,  winning  ratio, tournament,  and  Grand  Slang  title. I  would  like  to  see  the  more  detail. The  green  bar  here,  the  bar  itself represent  their  match  winning  ratio. But  I  like  Graph  Builder's  feature. It  allowed  me  to  put  their   number  of  winning  matches  on  the  top. Then  you  can  see, although  their  winning  ratio is  very  close, they  all  like  above  80. But  Roger  Federer  won  the  most  matches  over  1,263. You  move  to  the  top   and  you  will  see  those  green  bars means  how  many  tournament  championship each  of  them  have  won. Again,  Federer  won  the  most. Then  you  look  on  the  blue  top, you  will  see  that Djokovic won  the  most,  23   grand slam  titles. Next,  I  want  to  check  on  their  ranking. These  four  lines not  only  show  their  ranking  over  the  years but  also  show  their  incredible  professional  tennis  career. Federer  started  early  in  2001. It  took  him  about   three  years  to  move  to  the  top, but  he  stayed  at  the  top for  a  long  time,  18  years. You  look,  Nadal  and  Djokovic, they  move  very  quickly  to  the  top and  also  they  stay  at  the  top  for  a  long  time. The  dip  here  usually  either  means  they had  injury  or  had  a  surgery  to  recover. I  know  Nadal  is  right  now  in  the  recovery  period because  he  just  had  a  surgery and  Djokovic  continue  to  play. I  truly  believe  that  those  two  lines will  continue  to  grow  for  a  while. For  Alcaraz,  he  just  started. We  will  see  if  he  will  follow the  same  trajectory  as  the  big  three. I  would  like  to  show  you  more  detail about  the  individual   grand slam  matches. Look  at  this  plot  on  the  left. This  show  in  the  past  25  years, how  many  grand slam  matches  Federer  has  played. Total  434   grand slam  matches. He  won  373  matches  and  he  lost  61  matches. That  bring  him  to  the  winning  ratio  is  86 %. It's  amazing. The  right-hand  plot, actually  a  plot,  his  opponents  ranking. I  want  to  show  it's  difficult. Usually  when  your  opponent  has  a  high  ranking, that  means  tough  to  win  the  match. The  red  dot  here   all  represent  the  winning  matches and  the  blue  dot  here  represent  the  losing  matches, and  the  square  indicate  the  final  matches. These  are  all   grand slam  matches. You  look,  most  of  the  Federers'  opponent  is  all  high  rank  player and  only  the  few,  I  guess  he  was  lucky. He  was  able  to  play the  opponent  with  low  rank. We  also  can  look  like  how  his performance  in  each   grand slam as  I  click  Wimbledon,  you  will  see, Federer  won  a  lot  in  Wimbledon. Then  let  me  click  the  one  for  the  Roland-Gallos and  in  Federers'  entire  career and  he  only  won  once  in  the  Roland-Gallos. That  was  the  year  2009. The  other  day,   he  pretty  much  lost  to  Nadal. Let's  see  what  happened  in  2009. I  bring  Nadal's  record and  I  particularly  look  at  Roland  Gallos. You  pretty  much  see  all  the  red  square. That  means  he's  the  championship of  the  Roland  Gallos. He  only  lost  four  matches,  included  this  one  in  2009, in  the  semifinal,  he  lost. That  was  the  year,  actually, Federer  was  able  to  win  the  championship. I  will  skip,  Novak  and  Carlos, and  I  will  bring  you  the  overview of  all  these  four  guys'  performance in  all  the  four   grand slams. If  I  look  at  each  one  for  the  Australian, you  pretty  much  see  Novak  Djokovic  is  dominate. Then  if  you  look  at  the  Roland-Gallos, Nadal  is  dominate. For  the  US  Open,   they  all  have  won  the  US  Open. I  guess  US  Open   provide  opportunity  for  all  of  them. If  you  look  at  the  Wimbledon, I  think  Federer  and  both  Djokovic, they  both  did  pretty  well  in  Wimbledon, but  Federer  still  win more  than  Djokovic. I  wanted  to  finally  look  at their gra nd  slam  winning  ratio. From  this  plot,  it  shows  me,  yes, Djokovic won  the  most  grand  slam  title. Also  you  look  at  the  winning  ratio, overall,  Djokovic  has  highest or similar  like  the  Rafael  Nadal. Almost  in  every  category, you  can  see  Djokovic  has  higher  winning  ratio, except  for  the  Clay, the  Roland-Gallos,  Nadal,  is  the  best. I  would  say  just  based  on  winning  most  grand  slam  title and  highest  match  ratio, Djokovic is  the  goat. Next,  we  would  like  to  find  out who  is  the  youngest  among  four  of  them winning  the  grand  slam  title? That  was  Nadal. I  think  he  was  only  18.9, he  won  his  first g rand  slam  title. Alcaraz  at  age  19.3  won  his  US  Open. Although,  Djokovic  and  Federer  won  their  first  title  in  their  20s. But  you  look  at  their  long, amazing  career,  even  at  age  36, both  of  them  still  were  able to  win  the  grand  slam  title. I  think  that  Djokovic  will  continue  to  win. I  think  he  will  have  more  title under  his  belt. I  also  look  at,  they  definitely played  with  each  other. I  wanted  to  see  their  net  win  with  each  other. Rafael Nadal,  if  you  look  at Rafael  Nadal  against  Roger  Federer, so  Rafael  won  24  and  then  Roger won  against  Rafael  is  17. That  bring  their  net... Rafael  has  seven  net  wins  against  Roger. Novak Dj okovic  has  five  net  wins over  Federer  and  one  net  win  over  Nadal. Even  based  on  net  wins, I  think  Djokovic  is  a  goat. I still  would  like  to  see  their  serve statistics  because  from  that  ATP  data, this  is  the  data  more  related  to  their  techniques. I  put  all  these  variables  into  the  one  way  and  utilize  the  fit  group. With  such,  you  can  see  there's  a  lot of  the  data,  the  sample  size  is  bigger. With  all  the  data  together,  it  seems  that Djokovic  has  a  better  serve  statistic  than  the  rest  of  them. But  I  realize  this  is  big  sample  size. Sometimes  the   large  sample  size  can  transform  a  small  difference, become  a  statistically significant  difference. I  would  rather  to  see  the  subset. I  look  at  like  a  small  sample  size and  I  look  at  Wimbledon. Yeah,  and  in  Wimbledon, I  still  can  draw  the  conclusion  that Federer is  a  little  bit  better than  the  rest  of  them. But  once  I  look  at  the  other  grand  slam, like  the  Australian  Open, and  I  cannot  draw  the  same  conclusions. Overall,  I  think  their  technique is  very,  very  similar. The  successful  rate  for  serving, they  have  very  similar  statistics. With  all  the  statistical  variable, I  show  you  according  to  statistics  of  winning  most  grand  slam  title and  the  highest  match  winning  ratio,   Djokovic  is  the  GOAT. However,  statistics  don't paint  the  entire  picture as  a  player  can  have  a  much  larger  impact  than  just  statistics, such  as  the  way  they  play  the  game, the  love  for  the  game, and  especially  who  this  player  inspire. Such  as  the  young  kids, who  aspire  to  be  just  like  their  idols, including  my  son,  whose  dream  is  to  play Eastonball,  a  prestigious  tournament for  youth  at  Indian  Wells. In  the  end, it  was  just  an  honor  and  a  privilege   to  watch  these  three  great  player to  play  the  game, play  the  tennis  all  at  the  same  time, and  the  future  looks  bright for  more  great  tennis  to  watch. As  other  player  such  as  Carlos  Alcaraz, and  others  look  to  follow in  the  Big  Three's  footsteps. I  had  so  much  fun  doing  this  project by  using  features  such  as  graph  builder, dashboard  and  application  builder  in  JMP. This  feature  allowed  me   to  easily  explore  big  data  set and  quickly  identify the  atypical  observation. Dashboard  not  only  can  put  a  different  analysis  in  one  report, but  also  allowed  me  to  stay  in  the  report and  rerun  analysis  after  the  modification. Application  builder  allows  me  to  present  to  the  project without  having  to  use  PowerPoint. Although  this  project  mainly  analyze ATP  men's  tennis  data, the  analytical  tools  and  the  flow  can  be easily  applied  to  women's  tennis  data as  well  as  any  data  set that  have  patterns  in  other  fields. If  you  have  any  questions, please  feel  free  to  contact  me. Thank  you.
You have a question to answer, so you collect the relevant data and are ready to start creating a predictive model. But what which type of model do you choose and which modeling type? Is the goal to segment, predict, explain, or identify? Are the variables continuous, nominal, or ordinal?  Before we even get to choosing a type of model, we must define how the variables are used in analysis with the modeling type. What happens if we treat number of bedrooms as continuous versus a nominal or ordinal modeling type? We see when we pull up a distribution how modeling type impacts the type of summary statistics we get as means or frequencies. This poster demonstrates how the modeling type determines the results of your predictive model depending on which type of model is chosen. Get ready to play “Name that Analysis” as we go head-to-head on classifying different case study modeling examples with their respective modeling type.     Hello,  my  name  is  Andrea  Coombs,   and  I'm  joined  by  my  colleague, Olivia  Lipp incott. Olivia  and  I  have  given  a  presentation before  called  What  Model  When. If  you  want  to  take  a  closer  look at  that  presentation, you  can  take  a  look at  the  link  in  the  community  post. But  today,  we  want  to  talk about  something  a  little  bit  more. Yeah,  today  we  want to  think  about  modeling  type and  how  modeling  type  impacts  the analysis for  each  of  the  four  model  goals that  we  talked  about  previously. Right,  and  we're  actually  going to  use  the  same  data. This  is  data  that  we  pulled from  Redfin  that  represents the  housing  market  in  the  Cincinnati  area. Here  we're  trying  to  look at  the  price  of  homes  relative to  their  square  footage, the  number  of  beds, the  number  of  baths, and  so  on  and  so  forth. Previously,  we've  answered  the  question, what  model  when? It  really  depends  on  what  model you're  going  to  choose based  on  your  goal  for  the  analysis. For  segment,  we're  trying to  examine  relationships where  there's  no  intended  response; explain,  we're  trying to  explain  a  relationship and  look  at  the  underlying  factors and  how  those  affect  the  response; predict,  we're  trying to  predict  future  outcomes or  the  response  in  new  situations; and identify,  we're  trying to  find  important  variables. Right. Now  let's  bring the  modeling  type  into  the  picture. Both  your  responses  and  your  factors can  have  different  modeling  types. In  JMP,  there  are three  main  modeling  types: continuous,  nominal,  and  ordinal. Continuous  modeling  type  is  represented by  this  blue  triangle  icon  here, and  this  refers  to  numeric  data  only. The  nominal  modeling  type is  represented  by  this  red  icon, and  this  is  numeric  or  character  data where  values  belong  to  categories, but  the  order  is  not  important. For  the  ordinal  modeling  type, it  is  represented  by  this  green  icon, and  this  can  be  either  numeric or  character  data  as  well. But  in  this  case,  values  belong to  ordered  categories. When  you're  doing  an  analysis  in  JMP, you  want  to  make  sure  you  set  up the  correct  modeling  type, because  JMP  will  do the  correct  model  for  you, will  do  the  correct  analysis depending  on  modeling  type. Andrea,  I  have  a  game  for  us  to  play. It's  called  Name  That  Analysis. Do  you  want  to  play? Absolutely. I  love  games. Awesome. Here's  your  first  question. We  want  to  identify which  features  of  a  home are  most  important to  determining  the  price. For  example,  square  footage and  number  of  bathrooms can  explain  a  large  amount of  the  variation  in  price, but  other  features  are  less  important. All  right,  Olivia. I  think  you're  making this  first  question  easy  for  me. Is  the  answer  identify? Let's  see. Yeah,  you're  right. I  did  make  that  one a  little  bit  easy  to  get  us  going, but  that  is  identify to  find  important  variables  within  there. There's  a  couple of  different  places  in  JMP where  we  can  use  tools  to  identify if  that's  our  modeling  goal. Under  the  Analyze  menu  under  Screening, Predictive  Modeling  and  Fit  Model, using  tools  like  Predictor  Screening, Bootstrap  Forest, Generalized  Regression and  Stepwise  Selection. For  modeling  type, when  we're  looking at  the  goal  of  identify, it's  not  going  to  affect  things  much. JMP  is  going  to  do  the  correct  analysis as  long  as  your  modeling  types are  set  appropriately. We  took  a  look  at  this  and  we  took both  the  response  and  the  factors and  changed  them from  continuous  to  nominal and  looked  at  how which  factors  came  up  as  most  important. While  the  order  of  the  factors  varied, the  dominant  factors  stayed  the  same. All  right. It  looks  like  if  our  goal is  to  identify  important  factors, really,  the  exact  modeling  type we're  using isn't  impacting  things  that  much, it  looks  like,  Olivia. Right. Our  conclusions on  which  variables  are  important aren't  going  to  change  much based  on  the  modeling  type. All right. Well,  that  is  good  to  know. I  have  a  question  for  you. Are  you  ready? I'm  ready. All  right,  here  is  your  question. Let's  say  we  want  to  build  a  model to  predict  house  prices. This  model  will  be  based  on  many  important predictor  variables  we  have  in  our  data. For  example,  we  want  to  predict the  price  of  a  house that  we  want  to  put  on  the  market. Which  goal  do  you  think we're  working  with  here? Okay,  so  it's  not  like  question  one where  we're  trying  to  see which  factors  are  most  important to  predict  housing  prices. We're  just  really  trying  to  get that  final  housing  price  prediction. I'm  going  to  go  with  predict. All  right,  let's  see  if  you're  right. Yes,  you  are  right. The  goal  of  this  analysis  is  predict. There's  lots  of  different  platforms  in  JMP where  you  can  build  models  for  prediction. Within  each  of  those  platforms  in  JMP where  you  can  build  the  prediction  models, JMP  will  do  the  correct  analysis  for  you, depending  on  the  modeling  type of  your  response. Here  we  have  a  table of  different  modeling  types for  our  responses: continuous,  nominal,  and  ordinal. For  a  continuous  response, this  is  the  typical  one that  we  were  talking  about,  right? We  want  to  predict  the  price  of  a  home that  we're  going  to  put  on  the  market. Now,  when  we're  building  this  type of  model  with  a  continuous  response, well,  we  want  to  know how  powerful  that  model  is. What's  the  predictive  power  of  that  model? We  can  use   RSquared and  the  Root  Average  Squared  Error to  diagnose  that  model. Now,  for  a  nominal  and  ordinal  model, it's  a  little  bit  different. For  a  model  with  a  nominal  response, we  have  categories  as  the  response. In  this  example,  we're  looking at  whether  or  not  the  price will  be  over  or  below  $1  million. That's  what  we  want  to  predict. For  the  ordinal  response, here  we  have  an  ordered  category. We  want  to  predict whether  the  price  of  the  house is  going  to  be  low, medium,  and  high. For  the  nominal  and  ordinal  examples, again,  we  can  look  at   RSquared and  Root  Average  Squared  Error to  evaluate  those  models. But  there's  other  things  that  we  can  use to  evaluate  those  models, like  the  misclassification  rate and  the  area  under  the  ROC  curve. Of  course,  our  favorite  tool  in  JMP to  take  a  look  at  our  prediction  model is  the  Prediction  Profiler. Let's  take  a  look  at  the  difference between  the  Prediction  Profiler for  the  modeling  types  of  our  responses. For  the  continuous  response, we  can  see  that  on  the  Y-axis, we  have  the  mean  prediction plus  or  minus  the  confidence  interval given  the  value  of  the  model  factors here  on  each  of  the  X -axes. For  the  nominal and  ordinal  logistic  models, what  we  see  on  the  Y -axis is  the  probability  of  the  response  being in  a  certain  category. For  the  nominal  logistic  model, we  have  the  probability that  the  house  is  either  going  to  be above  or  below  a  million  dollars. For  this  ordinal  logistic  model, we  can  see  the  probability  of  having a  low,  medium,  or  high  price. Okay,  so  it  sounds  like  the  goal of  what  we  want  to  predict is  also  important  when  we're  talking about  that  prediction  goal, whether  we  want to  treat  price  as  continuous and  get  the  predictions of  the  exact  prices  out  of  there, or  if  we  want  to  treat  it  as  a  category. Right. You  just  need  to  get that  response  variable  set  up and  your  data  set  the  correct  way, and  then,  of  course, assign  the  correct  modeling  type, and  JMP  is  going  to  build the  correct  model  for  you. All  right,  Andrea. Are you  ready  for  your  next  question? I'm  ready. Let's  go. Okay. We  want  to  quantify the  effect  on  home  prices from  additional  bedrooms. For  example,  on  average, every  additional  bedroom  adds about $ 97,000  to  the  total  home  cost. Adding  a  bedroom  adds  $97,000? Man,  Cincinnati  is  a  tough  housing  market. That's  crazy. All  right,  well, so  let's  see. What  are  we  trying  to  do  here? We're  trying  to  quantify  the  effect  here. I  think  what  we're  trying  to  do  is  explain that  effect  that  bedrooms  has on  the  price  of  a  house. I'm  going  to  say  explain. You're  correct. Yeah,  we're  trying  to  describe the  relationships. In  explain,  we  use  the  parameter  estimates taken  from  the  model  equation to  quantify  those  relationships between  the  factors  and  the  responses. Typically,  we  use  in  JMP under  the  Fit  Model  menu  location tools  like  Standard Least  Squares, Logistic  and  Ordinal  Regression, and  Generalized  Regression. Modeling  type  can  really  impact how  our  factored  relationship with  the  response  variable  is  interpreted. We  took  a  look, and  we  were  looking at  how  does  the  number  of  beds  affect the  housing  price? We  changed  beds  from  continuous, to  nominal,  to  ordinal, and  see  what  that  relationship  was. We  can  see  under  the  continuous, that's  where  we've  got that  every  additional  bedroom  adds $97,000  about  to  the  total  home  price. That  prediction  profiler  shows a  linear  relationship when  we  treat  beds  as  continuous. But  when  we  treat  beds as  nominal  or  ordinal, there's  not  that  straight linear  relationship  going  on. We  see  a  spike  in  price  for  4-5  bedrooms compared  to  going  from  2-3  bedrooms. Right. I  see  with  nominal  and  ordinal, the  prediction  profiler  looks almost  exactly  the  same, so  it  must  be  the  same  model. However,  I'm  seeing with  the  parameter  estimates, they  look  a  little  bit  different between  nominal  and  ordinal. What's  going  on  there? Yeah,  so  the  nominal and  ordinal  modeling  type, and  when we  use  that  within a  regression, is  treating... They're  coded  differently within  the  regression, so  the  parameter  estimates  are  different. For  nominal,  that  intercept, we  think  of  that  as  the  mean  house  price across  all  the  different  bedrooms, and  each  of  those  parameter  estimates are  how  much  that  number  of  beds  increases or  decreases  that  mean  house  price. But  for  ord inal, because  we're  looking  at  order  matters, we  think  of  the  intercept as  if  there  are  zero  bedrooms and  each  of  those  parameter  estimates is  the  effect  of  adding an  additional  bedroom  onto  the  price. All  right. Modeling  type  is  really  going to  affect  my  parameter  estimates. I  really  need  to  think  about exactly  what  do  I  want  to  explain as  a  part  of  this  model when  I'm  doing  this  analysis. Yes. All  right. Are  you  ready for  the  final  question,  Olivia? Yeah,  bring  it  on. All  right,  here's  the  question. Let's  say  we  want to  identify  groups  of  homes that  are  similar  based on  a  list  of  possible  characteristics. In  other  words, we  want  to  identify  market  segments based  on  things  like  square  footage, location,  number  of  bedrooms,  et  cetera. Which  goal  do  you  think  this  is? I  think  you're  trying to  trick  me  with  that  identify, and  I'm  not  going  to  fall  for  it. Okay. But  there  are  no  responses within  this  question. I  think  we're  looking  at  clustering. I'm  going  to  say  segment. Okay. Well,  you're  right,  Olivia. I  did  try  and  trick  you  a  little  bit because  I  really  wanted  to  win. But  you're  right, that's  the  key  thing  here, is  that  there  are  no  responses  here in  this  analysis. We  are  definitely  looking  at  segment. When  our  goal  is  segment, we  can  use a  couple  of  different  clustering  tools. We  can  do  Hierarchical  Clustering, K-Means C lustering, or  Latent  Class  Analysis. It's  important  to  keep  in  mind that  with  Hierarchical  Clustering, you  can  only  include... Sorry,  you  can  include all  of  the  modeling  types: continuous,  nominal,  and  ordinal. But  for  K-Means  Clustering, you  can  only  include variables  that  are  continuous. For  Latent  Class  Analysis, you  can  only  include nominal  or  ordinal  variables. In  our  case  here, when  we're  looking at  the  number  of  bedrooms, lot  size,  year  built,  and  square  feet, we  have  a  combination of  continuous  and  nominal  variables. Hierarchical  Clustering  may  be  the  best clustering  tool  to  use  in  this  scenario. It  looks  like  with  that  parallel  plot with  Hierarchical  Clustering, maybe  we  could  call  Cluster  6 Amazing  Location. Yes. If  you  think  a  large  lot  size is an  amazing  location, yeah,  we  can  definitely  call that  segment  Amazing  Location  Homes. Well,  all  right,  Olivia, despite  me  giving  you  a  trick in  that  last  question, it  looks  like  we  ended  up with  a  tie  here  again. We'll  have  to  rematch  again  soon. Absolutely. We  talked  about  what  model  when, and that  really, what  model  you  choose  depends on  your  goal  for  the  analysis, whether  it's  segment,  explain, predict,  or  identify. Yeah,  in  terms  of  modeling  type,  again, JMP  is  going  to  do the  correct  analysis  for  you, especially  with  your  responses. If  you're  setting  them  up with  the  correct  modeling  type, JMP  is  going  to  do the  correct  analysis  for  you. If  your  goal  is  explain, you  might  need  to  think  a  little  bit about  which  modeling  type  to  use, depending  on  how  you  want  to  explain the  effect  of  something like  the  number  of  bedrooms. Thank  you,  Olivia. This  is  so  much  fun. Let's  do  it  again  next  year.
Cleaning large systems effectively is critical to many applications within industrial manufacturing. Establishment of a system's hold-up volume and the target concentration of residual contaminant after cleaning do not fully define a cleaning method, however. Assuming a cleaning method that repeats the same flush volume until the system has reached its target contaminant level, viable cleaning schemes will differ in how many flushes of the volume used to reach that target. By creating a graph showing "Flush volume" as a function of "Number of flushes needed" with editable table variables of "System hold-up volume" and "End contaminant level" in slider boxes, the user can visualize many different scenarios in many different systems before selecting the scheme that best fits their varied needs.      Hello,  my  name  is  Margaret  Wilson-Moses, and  I  work  at  Inpria  Corporation as  a  Senior  Product  Engineer. Inpria  Corporation is  up  in  Corvallis,  Oregon. Inpria  designs  and  manufactures metal  oxide  photoresists to  unlock  the  full  potential of  extreme  ultraviolet or  EUV  photolithography. At  its  most  basic  level, Inpria  is  made  up of  chemists  and  engineers. The  chemists  at  Inpria  are  the  ones developing  and  designing  the  chemistries, testing  them  and  refining  them to  behave  how  they  want, and  then  taking  those  small  formulations and  giving  the  best  designs to  the  engineers. The  engineers  scale  up  those  designs to  manufacture in  our  high- volume  manufacturing  facility where  we  have  tighter  control over  processing  and  higher  purity, and  that  finished  product goes  to  our  customers. Both  the  chemists  and  the  engineers at  Inpria  use  JMP  extensively. The  chemists  use  JMP to  design  their  experiments to  analyze  the  results of  those  experiments, often  using  hypothesis  testing. They  also  use  JMP  to  build  models, so  they're  better  able  to  predict how  certain  parameters  will  affect the  response  variables  they're  seeking. They  also  use  JMP to  present  their  findings within  their  teams  and  across  Inpria. Engineers  use  JMP to  work  up  our  process  data. We  have  sensors on  our  manufacturing  facility that  are  collecting  data every  two  seconds. We  have  a  large  amount  of  data and  we  use  automated  scripts  from  JMP to  pull  out  the  pieces  of  that  data that  we  need to  be  able  to  analyze  processing much  more  simply. We  use  control  charts to  look  at  process  parameters for  different  batches and  also  analyze  QC  metrics  of  our  product over  time  for  different  formulations. Those  QC  metrics  end  up on  our  certificates  of  analyzes. We  also  use  measurement  system  analyzes or  MSA  to  perform  gage  studies on  the  tools  that  we  use to  collect  the  quality  control  metrics. We  also  join  data  from  various  sources, as  do  the  chemists, because  different  groups  at  Inpria are  always  collecting  data, so  it's  good  to  be  able to  have  a  way  to  join  the  data and  look  for  correlations between  those  data  sets. Finally,  we  also  use  JMP to  present  our  data both  internally  and  to  our  customers. A  couple  of  years  ago, when  we  were  bringing  up our  high- volume  manufacturing  facility, I  was  tasked  with  the  goal of  developing  a  cleaning  scheme for  our  manufacturing  facility. Inpri a,  at  that  time,  and  in  a  way  today, cleans  our  manufacturing  systems by  dilution to  ensure  lot-to-lot  segregation. To  ensure  that  lot  A does  not  end  up  in  lot  B or  it's  limited  to  a  very  small  amount. I've  modeled  that  here. The  top  left, we  have  our  finished  product. When  we  transfer  that  to  bottles and  we're  done  wasting, we've  taken  all  we  can  take  out, we're  left  with  a  little  bit  of  heel in  our  system. That  heel  cannot  be  drained or  flushed  out. It's  just  unreachable  by  the  pumps. To  that  heel,  we  add  a  flush  that  doesn't contain  any  of  the  starting  material and  we  end  up  with  a  diluted  heel. We  waste  as  much  as  we  can and  end  up  with  the  same  volume  of  heel, but  at  a  lower  concentration  of  X. We  repeat  this  process, flushing  and  wasting,  flushing  and  wasting until  we  have  the  concentration that  we  need  at  the  end. Further  on, I'll  model  this  by  calling  X₀, the  concentration  of  X in  the  finished  product or  whatever  the  heel  starts  at, so  which  is  going  to  match the  final  product. Then  X₁  is  the  concentration  of  X after  the  first  flush. X₂  is  the  concentration after  the  second  flush, and  so  on  up  through  Xₙ. The  flush  volume  here,  called  F, is  the  same  volume  repeatedly throughout  the  process. My  objective  when  approaching  this  problem was  to  develop  a  dynamic and  user-friendly  visual  model to  show  to  my  team how  different  variables  in  this  problem a ffect  the  potential  cleaning  schemes we  could  use  with  the  overall  goal of  supporting  selection  of  a  scheme that  would  fulfill  all  our  criteria. Flushing. Cleaning  by  dilution  can  be  modeled with  this  equation  here, which  I  just  pulled  from  the  math that  comes  out  of  the  previous  slide taken  to  N  flushes. Xₙ,  the  amount  of  X  remaining after  N  flushes e quals  to  the  variables  on  the  right. So  where  H  is  the  volume  of  heel, F  is  the  volume  per  flush  in  the  cleaning, N  is  the  number  of  flushes, and  X₀  is  the  initial  amount  of  X. This  equation  does  importantly  assume that  every  flush  and  heel are  perfectly  mixed  per  flush, but  we  do  have  experiential  data that  supports that  that  is  indeed  the  case,  thankfully. We  are  also  assuming  that  the  heel  volume is  consistent  every  time, so  we're  wasting  to  the  same  exact  heel, and  that  the  flush  volume is  the  same  exact  volume every  single  flush. But  again,  we  have  experiential  data that  supports  that, so  we're  okay  with  these  assumptions. Functionally, we  are  actually  more  interested in  translating  this  equation to  be  a  function  of  N, the  number  of  flushes, and  F,  the  volume  per  flush, because  the  other  variables in  this  equation  are  pretty  much  set. Heel  or  H  is  constant per  manufacturing  system. It  may  vary  per  system,  but  it  is  known through  the  data  we've  collected in  the  past  to  calculate  what  it  is. X₀  and  Xₙ  vary  per  cleaning, but  again,  we  have  analytical  methods to  understand  both  X₀  and  Xₙ  per  cleaning. Those  are  our  known  variables. Flush  volume  and  number  of  flushes are  the  two  variables  that  we're  varying to  figure  out  a  cleaning  scheme. Those  are  the  ones  to  be  determined. We  have  a  few  boundary  conditions  also. The  total  amount  of  solvent  used must  be  as  low  as  possible. We  didn't  come  into  this with  a  hard  limit, but  more  solvent  used  is  higher  cost both  in  raw  materials  and  waste, so  we  want  to  limit  it as  much  as  possible. Another  boundary  condition  is  that we  want  to  keep  the  number  of  flushes less  than  or  equal  to  20 in  order  to  be  operational. That's  because  we  want  to  fit  our  cleaning within  our  production  cadence. Lower  flushes  is  better  operationally, but  we  have  a  hard  limit  at  20. Flush  volume  also has  a  boundary  condition. It  has  to  be  less  than  or  equal to  the  capacity  of  the  system. In  this  case, I've  set  our  capacity  equal  to  60  gallons. The  maximum  flush  volume you  could  have  is  clearly  60. If  you  take  the  equation I  showed  on  the  previous  slide and  these  boundary  conditions along  with  the  fact that  we  know  H,  X,  not  an  N, you  can  actually  solve  this  equation. You  can  find  F  and  N to  satisfy  the  boundary  conditions. But  that  misses  the  goal  that  I  had when  I  came  into  this  process. To  explain  a  little  bit  more  about  that, there's  many  ways why  going  with  a  visual  model  through  JMP was  more  advantageous  for  me when  I  approach  this  problem than  just  working with  this  stagnant  equation. First  off,  this  equation  is  not  intuitive. I'm  pretty  good  at  math, and  I  can't  see  what  this  equation  means in  a  graph  form  by  just  looking  at  it, versus  when  you  have  a  visual  model, you  can  see how  different  variables  affect  each  other. The  equation  is  limited. You  may  have  one  H, where  you  may  have  one  H, when you  want  to  have one  yield  value  per  manufacturing  system, but  you  have different  manufacturing  systems. You  may  want  to  vary  H  and  see what  the  different  cleaning  schemes  are. This  equation  really  fits. You  get  one  answer per  one  set  of  input  variables. With  a  visual  model, you  have  a  very  versatile  relationship that  you  can  see and  it  covers infinite  different  cleaning  schemes. The  equation  is  also  very  individual. I  crunch  this  math, I  make  the  assumptions, I  put  the  input  values  in, and  then  I  tell  my  team how  many  flushes  we  should  use and  what  the  volume  per  flush  should  be. They  can  check  my  math, but  they're  not  involved in  that  process  at  all. They  just  say,  "Okay." Versus  the  visual  model  that  I  built is  user-friendly  and  dynamic, so  they  can  tinker  around  with  themselves and  see  how  the  variables affect  each  other and  come  to  their  own  conclusions, which  also  leads to  more  buy-in  from  the  team when  we  end  up  choosing  a  cleaning  scheme. Lastly,  an  equation  is  boring. I  like  math,  and  it's  still  more  fun for  me  to  play  around  with  the  graph than  it  is  to  use  an  equation. The  visual  is  much  more  interactive. I'm  going  to  move  from  here to  sharing  my  model. I  started  with  just  a  simple  data  table with  three  table  variables. I  have  X ₀   here  and  Xₙ  both  in  ppm. And  I've  started  X₀  to  be   1 million  ppm, and  Xₙ  is  the  target  end  goal for  the  cleaning, and  heel  is  the  heel  of  the  system. The  number  of  flushes  here is  just  an  independent  variable, so  it's  just  an  integer  value between  1  and  35. I  did  go  beyond the  maximum  number  of  flushes just  to  get  a  better  view of  the  relationship. Then  the  flush  volume  column  here is  built  off  the  equation that  I  showed  previously. I  open  up  this  formula, you  can  see  that  this  is  the  same  equation I  shared  in  the  slide, and  it's  a  function of  the  table  variables. It's  a  function  of  these  table  variables here  that  are  brought  into  the  equation. I've  also  rounded  it  to  one  decimal  place because  our  control  system only  has  that  amount  of  sensitivity when  adding  flush  volume. Finally,  I  have  a  third  column  here which  is  the  total  solvent  used. This  is  simply  a  product of  the  first  two  columns. It  will  show  you how  much  solvent  you've  used per  cleaning  scheme. It  just  multiplies  the  number  of  flushes by  the  flush  volume to  give  the  total  amount  of  solvent  used. When  I  go  into  my  cleaning  model, it's  built  around  Graph  Builder. I'm  going  to  set  my  standard  here to  be  1  million  ppm  as  the  starting  value. Let's  say  our  target  is  10  ppm of  residual  X  at  the  end  of  cleaning and  our  heel  is  five  gallons. You  can  see  here  that  we  have  an  X on  the  graph  of  number  of  flushes and  then  there's  two  different  Y  values. There's  the  flush  volume  in  gallons and  the  total  solvent  used  in  gallons. I  also  have  a  reference  line  here  at  20 to  show  the  maximum  flushes that  are  operable. Each  of  these  points  reflected  here is  a  full  cleaning  scheme. If  I  select  this  point  right  here, this  is  saying  10  flushes. With  this  starting  value  of  X, this  goal  end  of  X  at  this  heel  value, and  you're  going  to  have the  flush  volume  of  10.8  gallons  per  flush in  that  cleaning  scheme. When  you  look  at  this  graph, all  of  the  red  points to  the  left  of  this  black  line are  valid  cleaning  schemes. Some  of  them  use  more  solvent, total  solvent,  and  some  of  them  use  less, but  they're  all  valid per  our  boundary  conditions. Then  I  have  at  the  right  here the  table  variables  in  slider  form, these  top  three  things, which  are  editable  either  by  a  slider or  by  the  number  edit  box  here that  are  connected  to  the  table  variable. Then  I   have  a  button  box  here which  will  take  all  this  information and  compute  it  and  give  you a  simple  answer  for  the  total  solvent  use. If  I  click  that  right  here, I  get  a  box  that  says, "Okay,  if  you  have  20  flushes, the  volume  per  flush  will  be  3.9 and  you're  going  to  use  a  total of  78  gallons  in  this  cleaning  flush." There's  a  few  things  we  can  get without  really  even  tinkering with  the  slider  variables  over  here with  this  visual  model that  we  couldn't  get  from  the  equation, like  I  said  before. First  off, you  do  see  this  negative  correlation between  number  of  flushes and  total  solvent  use. When  I  said that  you  could  solve  the  equation using  the  boundary  conditions and  our  known  table  variables, I  meant  that  if  you  are  trying  to  use the  least  amount  of  total  solvent and  you  have  a  maximum  of  20  flushes, then  your  best  scheme, assuming  you're  trying to  minimize  solvent  used, is  going  to  be  at  20  flushes. That's  due to  this  negative  correlation  here. The  more  flushes  you  used, the  less  total  volume  you're  using because  it's  negative  correlation. You  also  see  with  this  visual that  it's  an  exponential  decay. While  20  flushes  may  be  the  least  amount of  total  solvent  you're  using with  these  boundary  conditions, 18  flushes  only  uses  81  gallons, which  is  only  three  gallons  more than  you  use  at  20  flushes. This  is  where  it  becomes  important to  have  this  visual  here because  it  might  be  the  case that  the  time  we  save  in  operations by  reducing  flushes  from  20  to  18 is  actually  worth  the  three  gallons we  lose  in  total  solvent  used. All  right,  so  with  the  sliders, now  we  can  start  looking  at  the  sliders, with  the  sliders, you  can  see  how  these  variables affect  the  possible  cleaning  schemes. The  sliders  here  are  in  the  range  of, maybe  not  as  possible, but  what's  reasonable  for  this  model. So  you  can  see. You're  looking at  how  these  red  dots  change with  relationship  to  the  Y-axis. You're  looking at  both  the  steepness  of  this  curve because  that'll  show  you how  important  it  is  to  maximize your  flushes  within  these  bounds. You're  also  looking  at  the  overall  level. You  want  to  reduce  total  solvent. The  lower  down  your  red  dots  are, the  better  off  you're  going  to  be. By  manipulating  these, I  can  show  how  much  it  matters  to  start at  a  lower  or  higher  concentration. You  can  see  that,  yes, the  red  dots  are  moving  a  little  bit, but  it  takes  quite  a  while before  they  have  an  effect on  the  height  of  those  red  dots in  the  lower  graph. Xₙ ,  you  can  see  how  much  it  matters to  have  a  goal of  residual  solvent  in  there. If  your  goal  is a  higher  amount  of  solvent, if  you  have  more  buffer to  reach  your  goal, then  you're  going  to  need  less  solvent. Whereas  if  you  have  higher  criteria for  residual  solvent,  so  a  lower  goal, you're  going  to  use quite  a  bit  of  solvent. You  can  see  here  you  have  fewer and  fewer  cleaning  schemes  that  are  viable when  you  have that  lower  goal  of  residual  solvent. Let  me  just  back  to  10. Then  with  heel, you  can  tink  around  this and  see  how  much  the  heel  of  a  system affects  how  much  solvent and  what  cleaning  schemes  are  possible. A  greater  heel  means  you  have  to  use more  and  more  solvent to  achieve  your  goal  criteria. A  smaller  heel  mean  you  need  to  use  less  and  less  solvent. Beyond  just  selecting  a  cleaning  scheme, this  visual  model  allowed  our  team to  look  for  what  we  should  focus  on if  we're  trying  to  reduce the  overall  amount  of  solvent when  we're  cleaning. We  did  that  by  looking  at if  we  had  some  options to  change  our  variables of   X₀ ,  Xₙ,  and  heel, which  one  would  give  us the  most  bang  for  our  buck? Which  one  could  we  change with  reasonable   effort  on  our  part to  get  the  biggest  reduction in  total  solvent  used  per  cleaning? Here,  I'll  say  we  had  three  options. I'll  say  that  we  had  an  option to  reduce  the  starting  value  by  50%. Say  we  could  arrange  our  batch  schedule. Cleanings  would  only  start at  50%  of  what  they  originally  started  at. We  could  change  our  goal from  10  ppm  to  50  ppm, so  multiply  it  by  five. That's  because  we  had different  sensitivity  techniques or  different  restrictions for  material  following  a  certain  batch. Then  we  could  also  reduce  our gal, our  heel  by  one  gallon by  engineering  improvements to  the  actual  manufacturing  systems. Really  quickly, we'll  look  for  using  this  graph, we'll  see  what  happens when  you  change from  this  standard  set  of  input  variables and  keep  the  number  of  flush  to  the  20 to  the  overall  volume  of  cleaning, solven t  needed. Again,  our  standard  level  is  78  gallons. If  you  change from  100%  of  X i n  your  system at  the  beginning  to  only  50%, so  500,000  ppm, and  I'll  set  the  variable  values  there, we're  now  only  using 72  gallons  to  clean  out. But  that's  only  a  reduction of  six  gallons  for  cleaning. That  really  didn't  move the  needle  very  much. Next,  we'll  look  at  changing  our  Xₙ  to  50 and  then  setting our  variable  values  again. That  says  that  we  will  need 64  gallons  of  solvent  to  clean  at  the  end. That's  a  better  improvement for  the  total  solvent  needed. Finally,  we'll  look at  reducing  the  heel  in  the  system from  five  gallons  to  four  gallons and  see  that  the  total  solvent  needed is  62  gallons  here. That's  the  greatest  reduction in  solvent  needed  to  clean. With  this  information, and  it  was  similar  information  that we  used  to  actually  pursue  this  option, we  did  look  at  ways to  reduce  the  heel  in  our  system to  reduce  the  overall  solvent  needed because  we  saw  that  variable  is  going  to affect  the  total  solvent  used  most with  the  amount  that  we  can  change any  of  these  variables. That  is  reflected  here. These  were  the  three  options I  just  went  over. You  can  see  that  option  A, the  reduction  in  starting  level, you  only  are  gaining  yourself  six  gallons. Option  B,  the  increase  of  the  end  level, you  have  14  extra  gallons, and  option  C  was  16  extra  gallons. But  really  looking  at  these  values, again,  it's  important  that a  team  is  involved  in  this because  that's  only  two  gallons  different between  option  B  and  option  C. If  option  B  were  very  easy compared  to  option  C, we  probably  would  have  gone  with  option  B. But  a  heel  reduction  of  one  gallon would  be  most  effective in  these  three  options at  reducing  total  volume  for  cleaning. Some  concluding  remarks  I  have. Basically,  I  just  want  to  circle  back to  what  I  brought  up  originally. It  could  be that  mathematically-minded  people would  look  at  the  equation  on  the  left and  automatically  know that  changing  the  heel is  going  to  have  the  greatest  effect on  the  goal of solvent  used. That's  not  instinctive  to  me, to  know  that just  looking  at  this  equation. I  know  that  the  people  I  work  with, that's  not  instinctive  to  them  either. In  the  case  where  I'm  in  a  team trying  to  come  to  a  conclusion with  consensus  of  all  of  us, it's  very  helpful to  have  a  visual  model  in  JMP where  people  can  manipulate the  variables  themselves and  see  how  a  variable like  total  solvent  used  is  changed by  changing  variables that  we  know  what  they  mean. I  can  see  many  applications of  using  a  visual  model with  slider  variables, even  in  my  company. Any  multivariable  process where  the  relationship between  those  variables  is  known has  the  potential  to  make  use  of  a  model with  the  slider  variables to  give  a  visual  model  of  how  the  process reacts  to  changes  in  those  variables. I'm  thinking  of  there's  predictable  ways that  fluid  temperature  affects how  hard  a  pump  has  to  push  fluid to  reach  a  certain  flow  rate. That  could  be  helpful in  selecting  filters  for  a  given  process. I  believe  it  could  be  useful for  many  other  applications  as  well. Ease  of  generation. I  am  not  a  coder  and  I  was  able to  build  this  model  relatively  easily with  help  from  Jordan  Hiller, oh,  shout  out  Jordan  Hiller,  thank  you, also  from  the  scripting  index, which  had  many  great  examples, and  the  JMP  user  community. Because  I  was  able  to  look through  historical  questions and  piece  together  how  I  wanted to  go  about  a  certain  problem. It  was  relatively  simple to  make  this  model. Having  made  this  model, it  would  be  very  easy  to  manipulate in  the  future  for  other  purposes. Room  for  improvement. Again,  there  may  be  a  place  that  I  missed where  it  was  possible  to  do  this, but  in  my  exploration, I  couldn't  find  a  way to  make  a  reference  line  that  connected, basically  intersected  the  graph at  this  point and  moved  along with  the  graph  to  different  points. A  reference  line  that  is  a  function. I  wasn't  able  to  do  that. I  know  I  saw  different  examples  of  it, but  I  wasn't  able  to  do  it with  using  table  variables as  part  of  that  function. But  again,  that's  probably due  to  my  not  being  a  coder, but  it  would  be  helpful if  that  were  a  simpler, more  accessible  feature in  modeling  in  JMP. But  really, I  do  think  that  the  model  I  created, it  was  very  helpful  for  our  purposes, and  it  will  be  helpful for  other  applications  in  the  future. I  hope  it's  helpful  to  other  people. Thank  you.
There are many times when an exact measurement is not possible, but a range of values for the measurement is available. Censored data analysis methods enable you to incorporate the information from both types of measurements. This presentation provides an introduction to censored data situations: when they appear, how to handle them, and what happens when you do not handle them appropriately. This presentation includes examples of censoring in univariate and regression settings by using the Life Distribution and Generalized Regression platforms in JMP and JMP Pro 17, including the new Detection Limits column property.     Hi,  my  name  is  Michael  Crotty.   I'm  a  statistical  writer with  the  Stat  Documentation  Team  at  JMP, and  today, I'm  going  to  talk  about  an  introduction to  censored  data  analysis in  JMP  and  JMP  Pro. To  start,  we've  got three  common  types  of  censoring. Just  to  back  up  a  bit,  censored  data  occur when  you  don't  have  an  exact  measurement for  an  observation, but  you  do  know a  range  for  the  observation, so  you  know  not  the  exact  value, but  you  do  know  something about  where  the  value  might  be. What  we  want  to  do by  using  censoring  in  our  analyzes is  to  use  that  information  that  we  have, even  if  it's  not  exact. The  three  types  of  censoring that  we'll  talk  about  today are  right  censoring,  left  censoring, and  interval  censoring. Right  censoring  is  probably the  most  common  form  of  censoring. It  occurs  when  the  event  of  interest just  doesn't  have  time  to  occur  yet by  the  end  of  the  study. In  a  reliability  test, you  might  have a  bunch  of  light  bulbs   under test and  at  the  end  of  the  test  period, some  of  them  have  failed. Those  are  exact  observations, but  then  some  haven't  failed  yet. You  know  they're  going  to  fail, but  your  study  has  ended, so  it's  censored  at  that  point. Same  thing  in  survival  models where  a  patient  survives to  the  end  of  the  study. One  thing  to  note  is  that  right  censoring is  the  only  type  that  in  JMP, supports  a  single  response  column alongside  of  a  binary  censor  column. The  next  type  is  left  censoring. That's  where  the  event  of  interest  occurs before  the  observation  starts. A  common  example  of  that  would  be where  you  put  a  bunch  of  units  under  test and  at  the  time that  you  do  the  first  inspection, some  of  them  have  already  failed. You  know  that they  started  without  a  failure, but  by  the  time  you  measured  them, you  checked  on  them,  they  had  failed. So  they  failed  sometime  before  that  point. Another  example  of  that is  limited  detection where  you  have  a  measurement  tool that  can't  measure below  a  certain  threshold. The  last  type  we'll  talk  about  today is  interval  censoring. This  is  where  your  event  of  interest happens  between  observation  time. If  you  have  a  periodic  inspection  schedule instead  of  continuous  observation, you  might  see  that  something  fails or  something  happens between  time  two  and  three. It  didn't  happen  at  time  two and  it  didn't  happen  at  time  three, but  it  was  somewhere  in  that  interval. Take  a  quick  look at  what  this  looks  like  in  JMP. Here's  an  example  of  the  right  censoring with  a  response  column and  a  censor  column. In  the  platforms  that  support  censoring, you  always  see  this  censor  role, that's  for  that  binary  censoring  column. This  is  the  way  that  you  can  do, you  can  specify  censoring  more  generally, which  is  with  two  response  columns. Basically, it's  like  a  start  time  and  an  end  time. For  left  censoring, we  don't  know  when  it  happened, so  the  start  time  is  missing, but  the  end  time, we  know  it  happened  before  time  50, so  somewhere  before  that. Reversed  that  for  right  censoring, we  know  that  at  time  25, it  hadn't  happened  yet, but  it  happened  sometime  after  that. Then  with  interval, both  the  start  and  endpoints are  non-missing, but  we  don't  know  when  the  event  happened in  this  case  between  80  and  150. It's  not  shown  in  the  table  up  here, but  down  here,  we've  got  somewhere there's  exact  censoring. To  specify  that, you  just  use  the  same  value in  both  columns. That  means  essentially it's  like  an  interval  with  zero  width. It  happened  at  that  exact  time. Next,  we're  going  to  talk about  two  examples  of  censoring. The  first  is if  you  have  censoring  in  your  data, but  maybe  you  don't  know  how  to  handle  it, and  so  you  just  think, "I'll  just  ignore  it." We're  going  to  look  at  what  can possibly  happen  when  you  do  that. In  this  example, we've  got  simulated  data from  a  lognormal  distribution and  the  observed  data that  we'll  use  for  analysis in  our  different  cases is  where  all  the  values  from  the  true  data that  are  over  1,900,  we  set  them  to  1,900, as that's  the  censoring  time for  it's  right  censoring. There  are  a  few  possible  things you  could  do if  you're  trying  to  estimate this  mean  failure  time. You  could  do  nothing. You  could  just  use  this  observed  data with  a  whole  bunch  of  values  set  to  1,900, act  like  that's  when  it  happened. You  could  treat  those  as  missing  values, just  drop  them  from  your  data, or  you  could  use  the  censoring  information that  you  have  in  your  analysis. For  right  censoring, these  first  two  approaches are  going  to  tend  to  underestimate the  mean  failure  time because  you're  dropping  information from  the  data  at  that  far  end. Looking  more  closely  at  this, because  this  is  simulated  data, we  have  the  true  distribution here  in  this  first  column. That's  just  for  comparison, but  in  general,  you  wouldn't  have  that because  you'd  have that  all  values  above  1,900. You  don't  know  where  these  fall. In  our  observed  Y, this  is  where  we  just  use all  the  1,900s  as  values  of  1,900. We  have  no  missing  values, but  a  big  point  mass at  the  top  of  our  distribution  here. You  can  see  that  the  mean is  a  lot  smaller  than  the  true  mean. In  this  missing  Y  column,  this  is where  instead  of  treating  them  as  1,900, we  drop  them. We  set  them  to  missing and  analyze  the  distribution  without  them. Here  you  can  see  that now  our  maximum  of  the  non-missing  values is  less  than  1,900, which  really  doesn't  make  any  sense because  we  know  that  a  bunch  of  them, 21  observations,  in  fact, are  some  value  greater  than  1,900. So this  underestimates  the  mean  even  more. Then  on  the  right  here, we've  got  an  analysis  in  life  distribution in  JMP. This  is  where  we're  using the  observed  Y  column. It's  got  those  1,900s, but  we're  also  using  a  censoring  column alongside  it. For  the  rows  where  observed  Y  is  1,900, our  censor  column  is  going  to  say that  it's  a  censored  observation. Here  we  can  see  that  our  mean, it  actually  ends  up being  a  little  higher  than  the  true  mean, but  our  lognormal  parameter  estimates are  much  closer  to  the  true  values and  we're  incorporating all  the  information  that  we  have. For  our  next  example, we're  going  to  look  at  detection  limits. This  is  a  limit  of  detection  problem where  we  have  data on  the  yield  of  a  pesticide called  Metacrate that's  based  on  levels of  some  other  regression  variables. In  this  situation, the  measurement  system  that  we  have has  a  lower  limit  of  detection where  it  can't  measure any  yields  that  are  less  than  1 %. So  in  the  data, they're  just  coded  as  zeros, but  it  really  just  means it's  some  yield  below  1 %. There  are  two  ways you  could  analyze  this incorporating  that  information  in  JMP. The  first, you  could  treat  it  as  left  censoring, use  two  response  columns  with  the  first the  left  column  has  a  missing  value, and  the  right  column  would  be  a  one, or  you  can  use the  detection  limits  column  property that's  new  in  JMP  and  JMP  Pro. We'll  take  a  look  at  this. Here's  a  subset  of  the  data. This  Metacrate  Reading  column  is the  same  as  the  original  reading  column, but  it's  got a  detection  limits  column  property. Because  this  is  a  lower  detection  limit where  we  can't  measure any  lower  than  that  limit, we're  going  to  set the  lower  detection  limit  to  one. The  other  way  you  could  do  this is  with  the  two  columns. In  this  case, we  know  that  it's  left  censoring, so  the  left  side  is  missing and  the  upper  side  of  that  is  one, just  means  that  the  value is  somewhere  less  than  one. That's  all  we  know. But  as  you  can  see  from the  column  information  window  down  here, the  detection  limits  column  property is  recognized  by  the  distribution and  generalized  regression  platform. So  this  is  a  regression  problem. We'll  use  generalized  regression in  JMP  Pro. Here  we  fit   a  lognormal  response  distribution, and  it's  able  to  do  that on  this  Metacrate  reading  column, even  with  the  zeros  in  there, because   GenReg's  not  treating those  observations  as  zeros, it's  treating  them as  values  censored  at  one. Now,  we  were  able  to  use all  the  information and  get  a  regression  model. In  conclusion,  probably, the  most  important  thing  is when  you  have  censoring  information, it's  better  to  use  it  in  your  analysis than  to  ignore  it. Censoring  can  occur  a  lot  of  times for  time  responses, but  it  can  also  occur  for  other  responses. A  good  example  of  that is  these  limited  detection  problems. Finally,  you  can  use the  following  approaches to  specify  censoring  in  JMP. There's  the  two-column  approach that's  probably  the  most  flexible because  that  allows  you to  do  right  censoring,  left  censoring, interval  censoring, as  well  as  a  mix  of  all  three  of  those. For  right  censoring, you  can  use  the  one  column  response paired  with  a  binary  indicator  column for  censoring. There's  also  this  new  column  property in  JMP  for  detection  limits where  you  can  set  a  limit  of  detection either  on  the  low  side  or  the  high  side. We've  got  a  few  references  here if  you're  interested  in  more  information. One  of  those is  a  Discovery  talk  I  did  in  2017 that's  got  more  of  the  background of  how  the  censoring  information  is  used in  the  calculations  of  these  analyzes. That's  it.  Thank  you.
This presentation is an extension of the talk, "Measurement Systems Analysis for Curve Data Using Functional Random Effects Models," presented at JMP Discovery Europe 2023. Here, a functional random effects model was used to perform a Functional Gauge R&R analysis on data that contained a set of curves as the response. In this application, the functional model was expanded using the eigenfunctions and then was expressed as a random effects model, where variance components were estimated using standard methods. This analysis was done using the Functional Data Explorer and Fit Mixed platforms.   In the updated version of this presentation, I show that it is possible to include fixed effects in this type of analysis using the same model expansion approach. The functional model is still expanded using the eigenfunctions but is expressed as a generalized mixed model instead.       Hi,  my  name  is  Colleen  McKendry, and  I  am a  senior  statistical  writer  at  JMP, but  I  also  like  to  play around  with  functional  data. This  project  is  on  measurement systems  analysis  for  curve  data. First,  I'm  just  going  to  give  a  very  brief background  on  MSA  studies  in  general. MSA  studies  determine  how  well  a  process can  be  measured  prior to  studying  the  process  itself. It  answers  the  question, how  much  measurement  variation  is contributing  to  the  overall process  variation. Specifically,  the  Gage  R&R  method,  which is  what  I'll  be  using  in  my  analysis, determines  how  much  variation  is  due to  operation  variation versus  measurement  variation. You  can  use  a  Gage  R&R  crossed  MSA  model when  you  have  both  a  part  and  an  operator. The  model  you  can  see  here  for  your measurement  Y  sub  I J K, that's  going  to  be  the  Kth  measurement made  by  the  Jth  operator  on  the  Ith  part. In  this  model,  you  have  a  mean  term, a  random  effect  that  corresponds to  the  part, a  random  effect  that  corresponds to  the  operator, and  a  random  effect  that  corresponds to  the  interaction  or  cross  term. You  also  have  an  error  term. This  is  simply  a  random  effects  model, and  all  of  these  random  effects  are normally  distributed  random  variables with  mean  zero  and  some corresponding  variance  component. When  you  fit  this  model, you  can  use  that  to  estimate the  variance  components and  then  use  those  variance component  estimates to  calculate  the  percentage gage  R&R  using  the  formula  shown  there. In  a  standard  MSA  study, all  of  your  measurements  are going  to  be  single  points. But  what  happens  if  that's  not  the  case? What  if  instead  you're  measuring something  like  a  curve? That  question  was  the  motivation behind  this  project. There  was  a  client  of  JMP  that  was a  supplier  of  automotive  parts, and  they  had  a  customer  that  specified that  a  part  needed  to  have a  specific  force  by  distance  curve. Obviously,  the  client  wanted  to  design their  product  to  match the  customer  specified  curve. In  order  to  do  that, they  wanted  to  run a  functional  response  DOE  analysis and  JMP  to  design their  product  in  order  to  do  so. However,  before  spending  money on  that  experiment, they  wanted  to  perform  an  MSA  on  their ability  to  measure  the  parts  force. There  are  a  lot  more  details  about  the actual  data  and  this  problem  specifically in  an  earlier  2020  white  paper titled  Measurement  Systems  Analysis for  Curved  Data. If  you  want  any  more details,  look  that  up. It  should  be  on  the  community. This  in  this  graph,  that's what  the  data  looks  like. On  the  Y-axis,  we  have  force, and  on  the  X-axis,  we  have  distance. It  looks  like  there   are  only  10  curves  in  this  graph, but  there  are actually  250  total  curves. There's  just  some  clustering  going  on. There  are  10  different  parts, five  different  operators, and  five  replications  per  part operator  combination. A  little  bit  about  this  data,  obviously, these  measurements  are curves  and  not  points. The  data  was  collected  evenly  spaced  in time,  but  not  evenly  spaced  in  distance. There  were  some  earlier  projects that  tried  a  few  different  ways  to  perform some  type  of  MSA  study  on  this  data. They  used  some  functional  components, but  stayed  pretty  true  to  a  standard  MSA. When  I  looked  at  this  data,  I wanted  to  take  a  true  functional  approach because  I  have a  background  in  functional  data. Functional  data  analysis is  useful  for  data  that  are in  the  form  of  functions  or  curves. There  are  many  techniques  to  handle unequally  spaced  data, a  lot  of  which  are  available  in  the Functional  Data  Explorer  platform  in  JMP. My  goal  was  to  combine  functional  data methods  with  traditional  MSA  methods to  perform  some  type  of  functional measurement  systems  analysis. My  solution  was  to  create  a  functional random  effects  model by  expanding  the  functional  model using  eigen  function  expansion, rewriting  that  as  a  random  effects or  a  mixed  model if  you  had  any  fixed  effects  also, and  then  estimating  the  variance components  associated with  the  part  and  operator  terms. To  go  a  little  bit  into the  model  notation. For  your  functional  model, you  have  Y  sub  I J K, but  this  time  at  a  particular  distance,  D, to  account  for  the  functional nature  of  the  data. You're  going  to  have  a  functional mean  term, a  functional  random  effect that  corresponds  to  the  part, a  functional  random  effect that  corresponds  to  the  operator, and  a  functional  random  effect that  corresponds  to  the  cross  term and  also  your  error  term. Here,  when  you  do  the  model  expansion, it's  a  little  mathy,  but  essentially, instead  of  having  one  variance  component associated  with  the  part and  one  variance component  associated  with  the  operator, you  now  have  multiple  variance  components associated  with  each  of  those  things. That's  going  to  account for  the  functional  nature. When  you're  fitting  the  model and  estimating  the  variance  components, like  I  said,  now  you're  going  to  have  this set  of  variance  components that  you  can  sum  together to  estimate  the  functional variance  component  for  part and  the  same thing  for  operator  and  the  cross  term. Once  you  have  all  those  individual variance  components, you  can  use  those  to  estimate  the  % gage  R&R  just  like  in  a  standard  MSA. How  do  I  do  this  in  JMP? It's  a  multi  step  process that's  outlined  here, and  there  are  some  more  details in  other  slides. But  essentially,  I  estimate  the  mean  curve in  FDE  and  obtain  the  residual  curves. I  then  model  the  residual  curves  in  FDE  to obtain  the  eigen  functions  needed for  the  eigen  function  expansion of  the  functional  model and  save  those  eigen  functions to  the  original  data  table. I'm  going  to  use  those  saved  eigen functions  in  FitMix to  create  a  random  effects  model or  a  mixed  model  if you  also  have  fixed  effects  in  your  data. I'm  going  to  use  nesting  of  the  eigen function  formula  columns and  also  the  par  and  operator  variables to  define the  appropriate  model  specifications. This  is  what  your  fit  model window  would  look  like. Once  I  did  all  that  for  this  data, I  was  able  to  estimate  the  variance components  and  calculate  the  %  gage  R&R, which  in  this  case  was  3.3030. This  indicated  an  acceptable measurement  system according  to  some  ranges  that  were  defined in  this  paper  by  Baren  team. That  was  it  for  the  data analysis  for  my  part. This  result  was  actually  very  similar to  a  worst-case  scenario that  was obtained  in  a  presentation  in  2019. It  would  be  interesting  to  know  if that  was  a  coincidence or  if  the  results  would  be  similar for  different  data  as  well. Some  thoughts  that  this  project  provoked. Should  we  add a  functional  random  effect  for  ID to  capture  the  within  function correlation  across  distance? This  type  of  functional  random  effect is  actually  really  important in  functional  data and  is  a  big  benefit  of  accounting for  the  functional  nature  of  the  data. Unfortunately,  in  this  data  in  particular. Anytime  I  created  a  model  with  this  term, the  corresponding variance  components  were  zero, so  it  didn't  really  capture  anything extra,  but  it  would  be  interesting to  see  if  it  could  be  useful in  different  types  of  data. I  also  think  it  would  be  interesting  if  we could  calculate  a  confidence interval  for  the  %  gage  R&R. There  were  also  some  minor,  not  issues, but  brought  up  questions  of  the  residuals in  the  random  effects  model. I  observed  a  cyclical  nature  in  those. That's  not  always  great. I  don't  think  it  was  a  huge  deal, but  I  would  like  to  have  a  good reason  for  why  that  was  the  case. That's  it. Thanks  for  listening. If  you  want  more  details  on  this  project, it's  very  similar  to  a  full  30-minute  talk that  I  presented  at  Discovery  Europe, and  so  that  video  is on  the  community  as  well. Thank  you.
Scientists in biopharma working along the CMC development pathway are challenged by the increasing complexity of biologic-based therapeutics and processes that produce them. Many critical responses exist (often 10-20) that are themselves functions of numerous, highly interactive input process factors.     We use a large case study to show that current experimental design strategies combined with current response surface methods are generally inadequate to deal with the complex kinetic behaviors observed to be ever-changing across the design space. The case study consists of a 7-factor, hybrid experimental design used to develop a bioprocess with 13 critical responses. Employing a combination of SVEM, neural network models, and the hybrid experimental design, we show that accurate predictive models can be estimated for all responses that interpolate satisfactorily throughout the design space.    Furthermore, we show that the powerful tools in JMP and the Prediction Profiler, specifically the Design Space Profiler, are essential to visualizing and understanding the experimental design space and optimizing the bioprocess .   We also discuss the need for new optimal design strategies. JMP Pro 17 is used throughout the talk.     I  want  to  thank  the  JMP  steering  committee and  the  JMP  organizers for  inviting  Phil  and  myself to  come  and  present  our  exciting  talk on  CMC ,  SVEM ,  DOE ,  and  Complexity : It 's  All  About  Prediction . Want  to  start  by  thanking  Dr .  Tiffany  Rao , she 's  been  involved  with  the  planning and  numerous  conversations  for  the  work that  we 're  going  to  present  today . Going  to  do  an  overview , tell  you  who  Lundbeck  is ,  who  I  work  for , and  then  provide  the  background for  the  DOE that  we 're  going  to  talk  about , which  is  process  development for  a  biologic  drug . Our  case  study and  what  I 'm  doing  in  traditional for  what  I 've  started  to  do for  development is  start  with  the  first  step  of  doing  DSD for  mid -late  stage  development , then  follow  that  with  a  second  step of  doing  augment with  a  space -filling  design . Then  we  are  hoping  to  prove  to  you  today that  for  analysis  that  SVEM allows  us  to  have  better  prediction for  all  of  this  work  and  allows  us  to  have better  timelines for  our  work  that  we 're  doing . Lundbeck  is  located … We 're  headquartered  in  Copenhagen , we 're  over  6 ,000  employees in  over  50  countries , and  we  are  striving  to  be the  number  one  in  brain  health . The  part  of  the  company  that  I  work  with is  the  CMC  biologics and  we 're  basically  located in  the  Copenhagen  area and  in  the  Seattle  area  where  I 'm  located . Let 's  talk  about  the  background  for   the  DOE  that  we 're  going  to  present  today . The  process  that  we  want  to  develop for  drug  substance ,  for  these  biologics , we  start  with  a  cell  of  vials , we  take  those  out  of  the  freezer , we  then  expand  in  shake  flasks , go  bigger  into  culture  bags , maybe  a  seed  bioreactor , then  to  a  production  bioreactor . That  production  bioreactor goes  approximately  two  weeks . We  have  complex  nutrient  feeds , we  have  PH  control ,  temperature  control , there 's  the  base  that  we 're  adding . Once  we  finish  that  14 -day  production , we  need  to  figure  out  a  way to  get  the  cells  that  are  secreting our  molecule  into  the  supernatant . How  do  we  separate the  cells  from  the  product ? That  harvest  can  be  a  centrifuge , it  can  be  depth  filtration . Then  we  pass  it  on to  our  downstream  colleagues . They  first  usually  do  a  capture  step where  they 're  getting  rid of  most  of  the  host  cell  proteins , the  host  cell  DNA . But  then  we  need  to  do  two  polished  steps where  we 're  then  saying , "Okay ,  what  are the  product -related  impurities ? Maybe  there 's  not  the  full  molecule  there , so  we  have  to  get  rid  of  those ." Then  finally ,  we  have  to  make  sure , through  ultra  filtration  and  diofiltration that  we  can  transfer  into  the  buffer that  it 's  going  to  be  when  it  is transferred  for  the  patient 's  use and  it 's  also  at  the  right  concentration . You  can  imagine , every  step  along  this  way , there  are  many  factors , there  are  many  knobs  that  we  can  turn to  control  this  process , make  sure  that  it 's  robust and  we 're  making the  same  product  every  time . When  we 're  focused on  treating  the  patient , we  also  want  to  focus  on  the  business . We  can 't  put  all  of  our development  resources  for  every  molecule . We  want  to  right -size the  research  that  we 're  doing at  the  right  stage  of  the  product . There 's  many  things that  could  kill  a  product , but  if  we  can  develop  this in  the  right  time  and  the  right  space using  these  tools  from  JMP , we  can  shift this  development  timeline  to  the  left and  we  can  also  reduce the  amount  of  resources and  the  cost  to  the  company . If  we 're  first  getting  a  molecule , that 's  when  you 're  going  to  start  looking at  your  categorical  factors . We  might  be  doing  the  cell  line  screening . We  want  to  make  sure  that  we  have the  right  cell  line that 's  going  to  last  all  the  way through  commercialization . For  the  downstream  group , they  may  be  looking  at  resins for  both  upstream  and  downstream , looking  at  medias  and  buffer  components and  the  formulations  of  those . That 's  when  you 're  making  sure that  you  have  the  right  thing , that 's  going  to  keep  you  going through  your  development  pathway . But  then  once  you 're  in  the  clinic , now  you  want  to  really  start  to  gain understanding  of  the  process  parameters . Our  strategy  is  to  start with  a  development  screening  design and  we  want  to  be  bold in  our  level  settings  at  this  stage and  I 'll  talk  a  little  bit  more about  that  later , for  the  late  stage  development . Then  we  can  build  on  what  we  learned from  the  Definitive  Screening  Designs by  augmenting  those  designs with  space -filling  or  other  designs so  that  we  really  understand that  design  space . What 's  different that  we 're  hoping  to  show  now than  traditional  walks through  this  pathway is  that  in  the  past , we 've  been  throwing  out the  factors  that  we 've  said aren 't  important . But  with  modern  designs and  modern  ways  of  doing  analysis , we  can  keep  all  of  the  factors and  all  of  the  work  that  we 've  done  so  far and  gain  better  understanding of  the  whole  process , especially  with  biologics that  are  quite  complex . Before  I  pass  the  baton  to  Phil , I  just  wanted  to  talk  one  more  about … Let 's  see  if  I  can … I 'm  going  to  minimize  this  screen just  for  a  minute  so  I  can  show  you  this . This  is  an  experiment  that  I  did to  prove  the  power  of  DOE  for  my  boss . The  full  data  set  was  an  OFAT  for  PH , and  the  response  was  tighter . We  wanted  to  do very  many  different  levels in  a  wide  range because  he  wasn 't  sure  at  the  time that  we  were  going  to  be  able to  pick  what  the  optimized  level  was . But  what  I  wanted  to  show  him  was  that , "Okay ,  we  did  this  experiment , we  have  all  of  this  data . We  were  able  to  model where  the  optimized  condition  was ," and  that 's  shown  in  blue , and  that  turned  out to  be  the  correct  case . When  we  tested  the  model , that  was  the  optimized  condition . Let 's  pretend  now  that  we 're  starting , we  don 't  know  that  data . If  we  had  picked  a  conservative range  setting  for  our  experiment , our  noise  to  signal  would  be  quite  high and  so  we  would  have  missed finding  the  optimized  spot . But  if  we  had  picked  a  wider  range in  our  settings and  still  with  only  three  points , the  model  still  would  have  chosen the  optimized  spot . What  I 'm  going  to  challenge the  subject  matter  experts when  you 're  designing  your  DSDs is  really  be  bold  in  your  range  setting . You  will  still  find  the  optimized  spot and  you  have  to  have  some  knowledge of  your  process  so  that  you  can  complete the  design  of  experiment and  have  all  of  the  runs at  least  have  enough  signal that  you  can  measure and  then  subsequently  model . Once  you  learn from  your  Definitive  Screening  Designs more  about  your  design  space , you  can  come  back and  then  you  can  be  internal to  that  space . That 's  when  you  augment with  a  space -filling  design . Now  I 'm  going  to  pass  the  baton  to  Phil and  he 's  going  to  take  you through  the  analysis . Okay ,  thank  you . Thank  you ,  Patty . We 're  going  to  talk  about  a  very  nice and  somewhat  complicated  experiment that  Patty  and  her  team  run . They  do  a  lot  of  great  work and  they 're  big  advocates  of  DOE  and  JMP and  I 'm  very  happy  they  let  me get  to  play  with  them  sometimes . It 's  fascinating  work . But  before  I  get  into  the  actual  analysis , I  wanted  to  talk  about a  few  relevant  concepts that  members  of  the  audience  may or  may  not  be  familiar  with , and  that  includes  complexity . It 's  a  really  hot  topic  out  there . Talk  about  what  is  actually  prediction . That  is  a  muddled  concept  to  many  people . Then  from  there , I 'll  launch  into  talking  about how  we  analyze  prediction and  how  we  did  with  Patty 's  experiment . Complexity,  a  fellow named  Daniel  Finelli from  London  School  of  Economics, written  much  about  this and  he  calls  it  "the  elephant  in  the  room " that  statistics  and  many , what  he  calls  "metasciences ,"  are  ignoring and  they 're  ignoring  it  at  their  peril . I  won 't  get  into  a  lot  of  detail . You  can  look  him  up  on  the  internet , he  has  a  lot  of  videos  and  papers . But  complexity  is  a  huge  problem . It  is  staring  science  and  statistics and  data  science  and  machine  learning in  the  face  and  it  needs  to  be  dealt  with . At  present ,  we 're  not  really  dealing with  it  directly  in  statistics . By  the  way ,  there  are  now   whole  applied  math  programs based  on  studying  complex  systems . My  bottom  line  is ,  complexity  is  real . Complexity  requires  new  thinking . We  really  have  to  rethink DOE  and  analysis . You 're  going  to  see  that for  complex  systems, and  we  also  have  to  understand something  else ,  systems  theory  101  is complex  systems  are  defined by  their  interactive  behavior . In  point  of  fact ,  main  effects are  actually  even  misleading . You  have  to  somehow  be  experimenting in  a  manner  that  you  can  capture this  interactive  behavior , and  you 're  going  to  see  current  strategies fall  short  of  that  goal . Patty 's  already  mentioned  the  CMC  pathway . Nowhere  is  this  problem  of  complexity more  obvious  than  in  bioprocesses . You  have  complex  combinations of  biology  and  chemistry , and  interactions  are  everywhere . When  I  talk  to  scientists in  biotechnology , they  know  right  up  front  we 're  dealing with  really  complex  interactive  systems . But  first ,  I  need  to  point  out  prediction . If  you 're  working  in  CMC  development  work , it 's  all  about  prediction . The  ICH  guidelines  that  are  used by  scientists  in  the  CMC  development  work don 't  specifically  say  prediction , but  if  you  read  what  they  say , it 's  all  about  prediction . Basically ,  you 're  building  processes to  manufacture  biologics , and  with  the  new  cell  and  gene  therapies , these  processes  are  becoming hopelessly  complicated . I  personally  rely  heavily on  the  scientists  to  explain  it  to  me , and  they 're  the  people who  really  make  all  the  decisions . I 'm  the  helper ,  and  I 'm  very  happy to  be  there  as  part  of  it . But  it 's  all  about  prediction . That  is  not  how  many  scientists and  even  statisticians , have  viewed  CMC  work . By  the  way ,  this  applies to  all  areas  of  science . I 'm  focused  with  Patty on  the  CMC  development  pathway , but  prediction  is  important . What  is  prediction ? It 's  muddled . It 's  not  clearly  defined  in  disciplines . Here 's  what  it  really  is and  how  I  define  it . It 's  a  measure  of  how  well models  that  you  develop interpolate  over  a  design  region . In  other  words ,  we 're  going  to  fit  a  model to  what  we  call  a  training  set , and  then  we  need  some  way  of  knowing how  that  model  would  apply over  the  whole  design  region . In  CMC  work ,  especially  late  stage , that  is  very  important . You  be  able  to  do  that , as  many  of  you  know . You  really  have  a  training  set to  fit  the  model . That  training  set  in  no  way can  evaluate  prediction . I  know  there 's  a  common  belief you  can  evaluate  prediction on  training  sets . You  simply  can  not . You  must  have  a  test  set . Also  I 'll  talk  a  little  bit about  the  fact  in  dealing  with  scientists , and  a  lot  of  it in  chemistries  and  biologics . Again ,  I  do  a  lot  of  it  in  biotechnology , but  also  in  other  areas like  battery  technology ,  material  science . It  is  becoming  very  obvious . The  kinetics  are  complicated . They 're  constantly  changing over  design  regions . The  kinetic  behavior that  you  see  around  the  boundaries is  often  very  different from  what 's  happening  on  the  interior . Why  does  this  matter ? Well ,  the  classic  approach to  response  surface , even  including  optimal  designs , relies  upon  what  I  call  boundary  designs . Almost  all  of  your  observations  are  around the  boundaries  of  the  design  region . In  point  of  fact , whether  people  want  to  hear  it  or  not , the  central  composite  design , commonly  used  in  response  surface , is  about  the  worst  design you  could  think  of  for  prediction . The  interior  of  the  space  is  empty . If  you  fit  these  models  on  the  boundary , and  then  you  predict what 's  happening  on  the  interior , it 's  not  prediction ,  it 's  speculation . You  don 't  know . You  have  no  data . I 'm  going  to  show  you in  the  case  study , you 're  probably  going  to  reach some  wrong  conclusions . The  boundary  regions ,  indeed , often  behave  very  differently , and  we  have  a  need  to  reconsider our  approach  to  designs . Another  issue in  response  surface  and  statistics is  this  ubiquitous  use of  full  quadratic  models . They  are  not  sufficient to  model  complex  response  surfaces . In  fact ,  they 're  far  from  it . Unfortunately ,  I  get  a  lot  of  pushback from  statisticians who  claim  it  is  good  enough . My  answer  is , "Well ,  if  you  actually  use  designs that  had  sufficient  interior  points , you 'd  quickly  discover they  don 't  fit  well  at  all . Again ,  trying  to  measure  prediction on  the  interior  of  a  design  region using  boundary  designs  is  futile . By  the  way ,  my  good  friend , the  late  John  Cornell  and  Doug  Montgomery , published  a  paper  on  this  in  1998 , and  I 'll  be  polite ,  they  were  ignored . It  was  actually  somewhat  nastier than  ignored  by  the  statistics  community . They  showed  in  the  paper that  full  quadratic  models are  just  not  sufficient to  cover  a  design  region . Patty  mentioned  SVEM , self -validating  ensemble  modeling . It 's  an  algorithm . I 'm  one  of  the  co -developers with  Dr .  Chris  Gottwald  of  JMP , a  person  I  hold  in  very  high  regard . I  won 't  get  into  the  algorithm  by  the  way , there  are  references  at  the  end where  you  can  go  and  learn  more  about  it . It  has  been  talked  about at  discovery  conferences  actually , going  all  the  way  back to  Frankfurt  in  2017 . But  SVEM  is  an  algorithm  that  allows  you to  apply  machine  learning  methods . Machine  learning  methods  are all  about  predictive  modeling . Believe  me ,  people  in  that  field know  a  lot  more  than  you  may  think about  prediction  and  apply  them to  data  from  small  sets  like  DOE 's . I  won 't  get  into  SVEM . It 's  a  whole  new  way  of  thinking about  building  predictive  models , and  I  think  it 's  in  its  infancy , but  it 's  already  proving  very  powerful and  useful  in  biotechnology . Let 's  get  to  the  experiment . This  is  actually  a  hybrid  experiment that  Patty  and  her  team  created . There  are  seven  factors and  there  are  13  responses . But  due  to  time  constraints , I 'm  only  going  to  focus  on  four , and  even  that 's  going  to  be  hard to  get  it  all  in . The  data  and  the  experiment are  highly  proprietary . I  do  thank  Lundbeck  and  Patty for  actually  allowing  us  to  use an  anonymized  version  of  this  design . I  have  a  lot  of  case  studies , some  of  them  similar  to  this , and  the  people  who  own  the  data wouldn 't  even  let  me  discuss  it if  I  anonymized  it . That  was  very  nice  of  them . I  think  we  have a  really  important  story  to  tell  here . This  is  a  hybrid  design . It 's  comprised  of  a  19 -run Definitive  Screening  Design around  the  boundaries . Then  it  has  16  space -filling  designs on  the  interior . There  are  center  points in  both  parts  of  the  design . How  would  we  analyze  this ? Well ,  what  I  want  to  do  is  discuss the  strategies  of  analysis  that  are  used , the  algorithms  that  are  used , and  make  comparisons  to  SVEM . I 'll  tell  you  in  advance , SVEM  is  going  to  do  very  well . Then  we 'll  talk  about  some  of  the  issues with  the  models  themselves and  how  we  use  them . I 'm  going  to  do what  most  people  currently  do . I 'm  going  to  take  the  boundary  points , the  DSDs , fit  models ,  and  then  apply  them to  the  space -filling  designs  as  a  test  set and  see  how  well  my  model  interpolates . Step  two ,  I 'll  reverse  the  process . I 'll  fit  models to  the  space -filling  points , and  then  I 'll  use  the  DSD  as  a  test  set and  see  how  well  my  model actually  extrapolates a  little  bit  to  the  boundaries . Three  is  a  common  strategy used  in  machine  learning . I 'm  going  to  use  a  holdback  test  set . I 'm  going  to  take  the  35  runs and  break  them  up . I  did  this  in  a  way  to  make  them  both equivalent  as  much  as  I  could into  a  training  set containing  both  SFD  and  DSD  points , and  then  also  a  whole  back  test  set that  has  a  representation  of  both . Then  finally ,  step  four , what  many  people  would  automatically  do . I 'll  just  fit  models to  the  whole  data  set . In  general ,  I  don 't  recommend  this because  there 's  no  way  to  test  the  model . I  will  say  up  front , because  we  do  have  a  lot of  space -filling  points  on  the  interior , I 'm  more  comfortable  with  this  approach than  I  am  in  practice . But  these ,  I  find ,  are  the  four basic  strategies  that  would  be  used . How  do  I  analyze  it ? Well ,  if  you  have  a  DSD , people  like  to  use Fit  Definitive  Screening , I 'll  look  at  it ,  it  only  applies  to  DSDs . Honestly ,  it 's  not  really a  predictive- modeling  strategy , nor  do  they  claim  it  is . But  I  find  people  seem  to  use  it  that  way . I 'll  use  Forward  Selection . If  you  know  what  the  AICc  statistic  is , we 'll  do  that  in  GenReg ,  in  JMP  17 . Then  we 'll  look  at  something they  have  in  GenReg  that 's  very  nice . That  is  the  SVEM  algorithm . I 'm  going  to  use  that   with  Forward  Selection . Then  I 'm  going  to  look at  something  people  may  not  know . It 's  a  hidden  gem  in  JMP . Something  called  Moving  Average in  the  Stepwise  platform . John  Saul  put  it  there  many  years  ago . I  think  he  was  being  very  insightful . Then  we 're  going  to  talk  about SVEM  and  Neural  Networks . Basically ,  no  software  does  this . I  have  worked  with  the  Predictum , some  of  you  know  Wayne  Levin  and  Predictum to  develop  an  add -in  to  do  this . It 's  currently  the  only   software  available  that  does  this . The  SVEM  add -in  was  used to  do  the  Neural  Networks . I  won 't  get  into  the  add -in  particularly , I 'll  just  quickly  show  people where  these  things  are . Then  finally  I  said  the  fourth  strategy   was  used  to  hold  data  set because  I  get  asked about  this  all  the  time . I  just  threw  in  some K -cross  Fold  validation  to  use with  the  SVEM  methods   and  some  of  the  other  methods . Those  are  the  methods  we 'll  use and  for  methods  like  Fit  Definitive , Forward  Selection   and  Moving  Average  methods , we 'll  assume  a  full  quadratic  model as  that  is  the  tradition . The  other  methods ,  again , we 're  going  to  use  a  Neural  Network which  is  more  flexible . There  are  four  responses, and  this  is  really  important . I  didn 't  randomly  select  them . There  are  four  of  them and  they  vary  in  complexity . Again ,  I 'll  admit  this  is  subjective . There  is  no  internationally   approved  measure  of  complexity and  this  is  based  upon  the  ability to  model  the  responses . Again ,  there  are  13  responses . Typically ,  in  CMC  pathway  work , there  are  10 -20 ,  maybe  more , most  of  them  critical  quality  attributes . They  are  important and  they  vary  within  the  experiment from  some  are  fairly  low  in  complexity , some  are  very  high , very  difficult  to  model . Frankly ,  in  those  cases , Neural  Networks  are  basically   your  only  option . So  pay  attention  to  this   because  this  complexity turns  out  to  be  very  important in  how  you  would  go  about  modeling . Then  the  question  is   if  I 'm  going  to  evaluate  prediction , well ,  how  do  I  do  that ? Remember ,  I  prefer  prediction  be   on  an  independent  test  set with  new  settings  of  the  factors . That 's  how  we  judge  interpolation . Well ,  something  called  the  Root  Average  Square  Error or  RASE  scores  is  very  common . This  is  the  standard  deviation of  prediction  error . Again ,  it 's  commonly  used to  judge  how  well  you  predict . Smaller  is  better ,  obviously , but  there  is  a  problem  with  it that  we 've  particularly  uncovered , especially  in  simulations . Models  with  low  RASE  scores  often   have  substantial  prediction  bias  in  them . In  prediction ,  there  really  is still  a  bias -variance  trade -off . So  how  do  we  evaluate  bias ? Well ,  there 's  no  agreed  upon approach  to  that  either . But  the  easiest  way and  the  most  visual  way is  actual  by  predicted  plots   on  a  test  set . Ideally ,  if  you  were  to  fit  a  slope to  the  actual  bi -predicted  plot , I 'll  show  an  example . The  ideal  prediction  equation  that  a  slope would  be  one  with  an  intercept  of  zero . The  farther  the  slope  is  from  one , the  greater  the  bias . For  purposes  of  demonstration , I 'm  going  to  set a  specification  of  0 .85 -1 .15 with  a  target  of  1  for  the  slope . If  you  can  stay  within  that  range , then  I 'd  say  you  probably  have acceptable  amounts  of  bias . In  reality  that  happens  to  be more  of  a  subject  matter  issue . Then  finally  I  said ,  "Well ,  you  can  fit a  slope  to  the  actual  bi -predicted  plot . There 's  an  additional  problem ." The  predictor  is  the  predicted  values . They  have  a  lot  of  error  in  them . So  this  is  actually  an  errors and  variables  problem , which  is  not  commonly  recognized . But  JMP  17  has  a  really  nice  solution . It 's  called  the  Passing -Bablok   modeling  algorithm and  it 's  been  well -established ,   especially  in the  biopharma . This  fits  a  slope ,  taking  into  account errors  in  X ,  the  predictor . So  how  does  it  work ? Well ,  it  fits  a  slope . If  you  look  on  the  left , you 'll  see  the  slope  is  about  0 .5 . We  have  strong  bias . There 's  a  lot  of  prediction  bias . What  I  really  like  in  the  application in  JMP ,  they  give  you  the  reference  line . The  dashed  blue  line  is  the  ideal  line slope  of  one ,  intercept  of  zero . On  the  left ,  our  predictive  model   is  showing  a  lot  of  bias . It 's  systematically  not predicting  the  response . To  the  right ,  is  a  case  where   there 's  actually  a  small  amount  of  bias in  general ,  that  would  be  acceptable . By  the  way ,   these  were  picked  as  one 's  models that  had  relatively   low  overall  RASE  scores . These  are  called  the  Passing -Bablok  slopes and  they  are  integral   to  how  I  evaluate  prediction , the  overall  RASE  and  the  slopes . What  I 'm  going  to  do  at  this  point , I 'm  going  to  actually  go  over  to  JMP ,   if  you  don 't  mind . I 'll  make  a  quick  change   in  the  screen  here and  I 'll  make  this  as  big  as  I  can for  everybody . Overall  in  this  exercise , I  fit  close  to  140  models  and  I  did  them all  individually  and  evaluated  them . Yes ,  it  took  quite  a  while and  I 'm  going  to  show  a  graphic   to  try  to  summarize  the  results for  the  different  methods . I 'm  going  to  open a  Graph  Builder  script . I 'll  make  this  as  big  as  I  possibly  can for  everyone . I 'm  using  some  local  data  filters ,   to  define  the  display . Notice  we  have  four  training  scenarios . I 'll  start  with  where the  DSD  is  the  training  set . We  fit  models  to  the  boundary and  then  we  evaluate  them   and  how  they  predicted the  space -filling  design  points . Y2  is  the  easy  response . I  expected  all  approaches to  do  well ,  they  did . Notice  I  set  these  spec  limits and  that 's  0 .85 -1 .15 all  fell  within  that  allowable  region . Two  of  the  methods  that  did  well , I  particularly  liked  the  moving  average , so  it  did  pretty  well . None  of  them  had  a  slope  of  exactly  one . The  DSD  points  don 't  exactly  predict what 's  going  on in  the  space -filling  design  points , but  they  all  did  relatively  well . Now  we 'll  go  to  moderate  complexity . Now  you  start  to  see  some  separation . It 's  getting  harder  to  model  the  surface . Again ,  I 'm  using   this  interval  of  0 .85 -1 .1 . I 'm  looking  on  the  y -axis  at  the  RASE   score  standard  deviation  of  prediction . On  the  x -axis ,  I 'm  looking  at  slope . For  Y1 ,  using  the  DSDs  to  predict the  space -filling  design  points as  the  test  set . The  only  models  that  really  performed  well were  the  Neural  Networks  with  SVEM . By  the  way , the  code  is  NN  is  Neural  Network , H  is  number  of  hidden  nodes . We  have  models   with  varying  levels  of  hidden  nodes and  I  simply  evaluated   RASE  scores  and  slope . We  go  to  more  complexity . Now  Y3  has  high  complexity . It  is  hard  to  model . The  lowest  RASE  scores  were the  methods  you  see  on  the  lower  right , but  you  can  see there 's  substantial  prediction  bias . I  felt  overall   the  best  combination  of  low -bias and  RASE  score  were  Neural  Networks , particularly  one  with  27  hidden  nodes . Then  finally  number  four is  high  complexity . We  fit  the  model  to  the  DSDs  and applied  it  to  the  space -filling  points . I  didn 't  think  any of  the  models  did  great . All  of  them  showed  some  prediction  bias . Maybe  the  best  performance  was a  Neural  Network  with  12  hidden  nodes . It  had  the  lowest  RASE  score ,  but  still , there  were  some  issues  with  bias . So  that 's  one  strategy . Well ,  what  if  I  were  to  do  the  opposite ? I  fit  the  model   to  the  space -filling  points and  then  apply  them to  the  boundary  DSD  points . Again ,  let 's  start  with  the  easiest  case . Y2  really  does .   It 's  a  pretty  simple  response . Actually ,  the  SVEM  method  in  GenReg using  SVEM  and  Forward  did  very  well . The  next  best  I  thought  was a  Neural  Network  with  10 . Remember,  there 's  a  little  bit   of  extrapolation   going  on  here . Finally ,  Y1  with  moderate  complexity . Again ,  only  the  Neural  Networks  did  well . As  we  go  up  in  complexity ,  increasingly just  the  Neural  Networks  are  working. You 'll  find  similar  results for  the  other  approaches . I  won 't  show  all  of  them , they 're  covered  in  the  notes . But  the  general  conclusion  by  the  way ,  is that  when  you  use   the  boundary  points  as  a  test  set or  you  use  the  space -filling  designs  as   a  test  set  and  try  to  predict  the  other , they 're  just  not  doing as  well  as  they  should . In  other  words ,  as  I  said  earlier , the  boundary  points , the  DSD  points   and  the  space -filling  design  points , there  are  differences  in  their  kinetic behavior  that  we 're  not  picking  up . The  only  way  we 're  going  to  pick  it  up is  to  actually  fit  models over  the  whole  design  space . We  did  do  that  by  the  way . I  should  just  quickly  show  you . I  used  the  whole  data  and  we  fit  models and  we  actually  did  pretty  well . I  didn 't  show  the  Passing -Bablok  slopes . I  will  just  quickly  do  a  little  more  work with  JMP  for  those  who  are  interested . The  Passing -Bablok  slopes can  be  done  in  Fit  Y  by  X . I  will  admit  we  wrote  a  script and  added  it  to  the  predictive add -in  to  do  this  in  Fit  Y  by  X , but  you  can  easily  do  it  yourself . Here ,  and  I 'll  pick  one  of  the  cases , is  the  DSD  data  and  I 'll  pick  Y1 . How  did  we  do  fitting  models ? If  you  look  in  the  menu , there 's  the  Passing -Bablok . I  strongly  suggest  you  look  at  it . A  lot  of  regression  problems are  errors  in  variables . How  did  the  method  do  it  overall ? I  want  to  explain  something  else . The  orange  points  are  the  DSDs , the  boundaries . The  blue  points  are the  space -filling  design  points . Here  I  fit  models  to  the  DSD and  the  Passing -Bablok  slopes  are  being   fit  to  the  space -filling  design  points . Overall ,  the  best  performance was  turned  in  by  the  DSDs . There 's  one  of  them  here . It 's  Saywood  6 . Another  one  that  had … I  forgot  what  it  was . Let  me  widen  this  out  for  you . Nineteen . Notice  the  slope  is  close  to  one , but   you  can  clearly  see there  is  some  bias . In  other  words ,  you  can  see  an  offset between  the  fitted  slope and  the  ideal  slope ,  the  dashed  blue  line . This  is  pretty  typical  overall . I 'll  just  very  quickly  show  you . If  you  have  JMP  Pro  and  you  want   to  do  SVEM  using  linear  models , just  go  to  Fit  Model ,  Recall . This  is  a  full  quadratic  model . You  could  do  others . Go  to  GenReg   and  then  under  estimation  methods . There 's  SVEM  Forward . There 's  SVEM  Lasso . These  work  very  well . From  a  lot  of  work  in  these  methods , I  still  find  SVEM  Forward gives  you  the  best  results . The  Lasso  tends  to  give  you a  lot  of  biased  results on  test  sets  in  particular . If  you 're  interested  in  model  averaging ,   if  you  have  JMP  standard , just  going  to  hit  recall  again , just  go  to  the  Stepwise  platform . Didn 't  do  it .  Stepwise . I  won 't  run  it . It  will  take  too  long  because  model averaging  uses  best  subsets  regression . It 's  time -consuming ,  but  it 's  there . Again ,  Neural  Networks  with  SVEM , you  have  to  have   the  Predictum  add- in  to  do  that . There 's  a  link  to  it  if  you 're  interested . At  this  point , I 'm  going  to  not  do   too  much  more  analysis . Again ,  you  can  go  through   and  look  at  the  various  slopes for  the  various  responses and  you  can  see  many  of  these  methods   resulted  in  highly  biased  slopes . In  other  words ,  the  DSD  points  and  the   space -filling  designs  are  too  different . We 've  really  got  to  understand   we  need  to  fit  models over  the  entire  design  region . At  this  point , I 'm  going  to  just  finish  up . By  the  way ,  there  is  enough  material  here, and  I  do  have  basically  many  talks that  are  combined  into  here . I  apologize ,  but  I  think there 's  an  important  message  here . By  the  way ,  I 'm  just  showing  slides   with  the  Passing -Bablok  slopes . Then  finally ,  I  want  to  just   give  you  some  final  thoughts . I  think  we  really  need  some new  thinking  in  statistics . We  don 't  have  to  throw  out everything  we 've  been  doing . I 'm  not  saying  that . The  most  important  is  we  are   in  the  era  of  digital  science . Digital  chemistry ,  digital  biology , digital  biotechnology  are  here . They 're  not  tomorrow . We 've  got  far  more  automation . Lots  of  great , especially  in  biotechnology , pilot  and  bench  scale devices  that  scale  nicely , where  we  can  do  lots  of  experiments . The  problem  is  complexity . We  need  to  think  differently . Machine  learning  methods  via  SVEM are  very  important  for  fitting these  complex  systems . We  need  to  get  away  from   the  response  surface  approaches that  really  haven 't  changed . Maybe  we 've  got  computers and  some  new  designs . I  think  DSDs  are  really  very  clever . We  have  optimal  designs ,  but  they  suffer from  the  fact  they 're  boundary  designs and  people  keep  insisting on  full  quadratic  models . That 's  a  mistake ,   as  I 've  tried  to  show  briefly  in  the  talk , and  you  will  be  able  to  download  the  talk , you  can  see  how  poorly  these  methods generally  did  with  the  complex  responses . As  far  as  I 'm  concerned ,   we  need  new  types  of  optimal  designs . At  a  minimum ,  these  need to  accommodate  a  lot  of  factors . Patty ,  by  the  way ,  without  getting into  details ,  has  run  a  DSD … Not  a  DSD .   You  did  space -filling  design  with  18  runs . Given  they  have   Amber  Technology  available , if  you  know  what  that  is , they  can  do  it . Why  do  we  need  that ? Because  these  systems  are  interactive . We  need  to  stop  thinking they 're  a  minor  part  of  the  equation . Main  effects  do  not  describe the  behavior  of  a  complex  system . Its  interactivity  is what  drives  the  behavior . We  need  to  cover  the  interior of  the  design  region . Yes ,  we  would  like  to  cover the  boundaries  too . We  don 't  want  to  be  specifying  a  model . Optimal  designs  require  you  specify what  is  usually  a  full  quadratic  model . We  need  to  get  away  from  that . Space -filling  designs ,  by  the  way , are  optimal  designs  that  do  not require  a  model  be  specified . But  they 're  not  the  total  answer . We  need  to  cover  the  design  space . We  need  to  give  the  user  a  lot  of  input that  would  be  scientists   on  how  they  distribute  the  points . The  work  of  Lu  Lu and  Anderson -Cook  point  the  way . I  won 't  have  time  to  get  into  that . That 's  another  topic . We  need  to  be  able  to  easily  combine our  design  with  other  data . That  includes  engineering  runs ,  GMP  runs , even  models  from  partial  differential equations  and  simulations . Especially  if  you  want   to  get  into  digital  twins , you 've  got  to  be  able  to  do  that   using  what  I  call  meta  models . Then  finally ,  Patty  mentioned  this , so  I  wanted  to  bring  it  up . The  standard  practice  in  design of  experiments ,  assuming that  somehow you 've  got  to  screen  out  factors is  actually  a  really  high -risk ,   no -reward  strategy  in  complex  systems . You  will  regret  it . You  will  someday ,  at  a  later  stage ,  come back  and  have  to  redo  experimental  work . I 've  seen  this  time  and  again . In  complex  systems , this  idea  that  there  are  active and  inactive  factors  is  simply  wrong . They  all  matter  at  some  level somewhere  in  the  design  space . Frankly ,  with  our  modern  tools , you  don 't  need  to  do  it  anyway . Also ,  something  else  people  do reflexively  reduce  linear  models . We 've  shown  in  our  research  in  SVEM . Also ,  a  nice  paper  by  Smucker ,  Edwards , and  we  showed  reducing   models  degrades  prediction . Why ?  Because  you 're  making   your  model  stiffer  and  stiffer , it 's  not  going  to  interpolate  well . I  will  stop  at  this  point and  there  are  some  references  at  the  end .
The federally endangered southern Sierra Nevada fisher (Pekania pennanti) is spread out at low density across a large and rugged landscape, comprised of approximately 300 individuals across a 12,000 km 2 area.  Its vulnerability has been further amplified by periods of severe drought and extensive wildfires in the region.    Identifying and preserving female reproductive habitat has been outlined as the most important demographic feature for sustaining and increasing the population. In this presentation we describe a customized, cost-effective and non-invasive Footprint Identification Technique (FIT) in JMP software developed by WildTrack to identify both species and sex using footprints collected at track stations.    We created a data set of known fisher and Pacific marten (Martes caurina) footprint images and known-sex fisher images. To automate the feature extraction in JMP, we developed a customized  script to generate distances, angles, and areas using landmark points on the footprint images. Using a single variable, we had a species classification accuracy of over 99%. For fisher sex classification, using a more parsimonious model with just two variables selected in LDA, we achieved accuracies of 94.0% for the training set and 89.4% for the test set. We discuss the merits of this technique to help with the conservation efforts for this species.     Hello. My  name  is  Caleb  King. I'm  a  senior  developer  in  the  design of  Experiments  and  Reliability  group here  at  JMP  statistical  Discovery. T oday  I  have  the  privilege of  telling  you about  a  very  interesting  project that  I  was  able  to  be  a  part  of concerning  classification of  species  and  sex within  a  small  mammal  group called  Fishers using  the  Footprint Identification  Technique. F ishers,  I'll  give  you  a  quick  image  here, so  here's  an  example  of  a  fisher. To  me  it  looks  like  a  bit  of  a  weasel or  ferret- type  animal. I  know  that's  definitely not  the  same  species, but  they're  a  small  mammal, and  we're  particularly  interested in  fishers  located  in  the  Sierra  Nevada, as  those  are  a  federally endangered  species. Specifically, we'd  like  to  be  able  to  identify the  presence  of  females, as  the  larger  number  of  females indicates  a  very  healthy  population. They're  also  vital  to  helping  develop effective  conservation  strategies. Now,  the  way  we  intend  to  do  that is  use  what's  called  the  Footprint Identification  Technique  or  FIT. This  has  been  made  popular through  wild  track, is  a  non-i nvasive  method for  identifying  individuals based  on  images  of  their  tracks. This  is  especially  helpful since  you  may  not  be  able to  actually  see  a  fisher in  the  wild  or  capture  them, but  their  tracks  are  everywhere  so that  should  be  helpful  to  identify  them. U sing  JMP, we  were  able  to  create  a  technique to  distinguish  fishers from  a  nearby  species known  as  Pacific  martens, as  well  as  distinguish sexes  within  species. T he  way  this  works is  we  started  with  a  data  set of  around  160  something  martens and  well  over  300  fishers  consisting of  about  34  males  and  27  females. What  they  would  then  do  is  then, as  you  can  see  here  on  the  track  image, they  would  identify  seven  landmark  points, is  what  we  call  them, and  then  from  those, we  could  then  compute well  over  120  something  features consisting  of  lengths, distances,  angles,  and  areas. What  we  would  then  do, is  then  using  those  features, we  would  then  feed  that into  a  linear  discrimination  analysis, which  we  could  then  use to  discriminate  among  species and  then  sex  ID  within  species. To  help  assess  that  fit, we  split  the  data  into  50%  training, and  for  the  remaining  50%, we  evenly  split  roughly between  validation  and  testing. Prior  to  the  modeling, we  also  tried  to  look  at the  effect  of  track  orientation, so  we  would  flip  the  left  tracks horizontally  to  match  the  right, and  then  also  any  potential  bias from  the  observers. These  are  people  identifying landmark  points, so  we  wanted  to  check  and  make  sure that  any  variation  there did  not  affect  our  outcomes. T hankfully,  both  the  orientation and  the  observer  bias did  not  have  a  significant  effect on  our  outcomes. W hat  brought  myself  and  my  colleague Ryan  into  the  project  was they had... I  noticed  that  some  of  the  tracks as  they  were  classified, seemed  a  little  to  have  a  little  bit  too much  spread  in  them to  the  point  that  maybe there  was  actually  multiple  individuals. T he  way  they  would  collect  this  data is  there  would  be  a  little cage  area  out  in  the  woods. Fishers  could  easily  go  in  and  out, and  there  was  a  track  plate  in  the  bottom that  would  capture  their  footprints, and  there  was  also  little  spurs that  would  capture  a  bit  of  their  hair. It  didn't  hurt  the  animal. They  had  no  idea  what  was  going  on. W hat  they  would  then  do  is  take  some samples  of  those  hairs and  send  them  out  for  genetic  testing, which  was  a  bit  of  a  long and  expensive  process. Now,  because  of  the  way  things were  sampled, you  might  have  a  sampled  hair that  would  identify  the  animals as  potentially,  say,  male, but  what  could  have  happened was  a  male  and  a  female might  have  gone  in, and  you  only  cut  hair  from  one  of  them, so  the  tracks  might  indicate potentially  multiple  individuals, whereas  the  genetics  said there  was  only  one. W hat  they  wanted was  a  method  to  be  able  to, a  more  data- driven  method, if  you  will, to  identify  potentially misclassified  multiple  individuals that  we  could  then  exclude from  our  analysis so  that  it  wouldn't  bias  the  results. B efore  we  actually  got  into that  procedure, one  of  the  things  that  we  would  do is  use  JMP's  Predictor  screen  tool to  identify, for  each  response  of  interest, what  were  some  of  the  top  predictors? Notice  for  species and  sex  ID  here? There's  actually  a  lot  of  common  features that  I'll be  able  to  distinguish between  the  two or  at  least  have  a  strong  ability to  help  distinguish  between  the  two. Much  more  so  with  the species  than  the  sex. We've  shown  you  what  these variables  look  like  over  here, so  area  one is  the  complete  shaded  region. We've  got  some  distances,  V 16,  V 15. Y ou'll  notice  a  lot  of  them  have  to  do essentially  with  the  size  of  the  track. We've  got  some  big  distances  in  there. I'll  get  back  to  these  in  a  second, but  using  those  top  features , let  me  get  back  to  a  full  screen  of  that. Us ing  some  of  those  top  features, we  would  then  make  a  plot that  looks  like  this. T his  is  just  plotting  it by  the  individuals. All  the  red  ones  here  are  females. All  of  these  are  males, so  already  visually,  you  can  tell  why these  are  some  of  the  top  predictors. Just  visually,  you  can  see those  groupings, clear  groupings  between  the  sex  ID. What  we've  identified  with  these arrows  is  you'll  notice  a  big  spread. You've  got  a  cluster  here  and  here. Got  a  little  bit  here  and  there, especially  here  and  there. This  is  what  they  were  interested  in, especially  with  the  males, because  what  this  could  be  is we  could  have... It  could  be  the  same  male, just  a  lot  of  spread, that's  a  bit  unlikely. We  could  have  a  male and  a  young  male, or  we  could  have  a  male, and  that's  actually  more  of  a  female, but  we  don't  really  know. T hey  wanted  a  more data- driven  method  to  say, is  this  something  we should  be  concerned  about? Is  that  spread  too  much? W hat  we  did  is,  we  used  a  control  chart, which  is  from  industrial  statistics. We  thought  that  was  actually  ideally suited  because  control  charts are  built  for  identifying  parts that  are  out  of  spec, and  so  what  we  did  is  created  a  control chart  for,  here's  females  and  males, and  notice  they  each  have their  own  limits, this  is  because  there  are  potentially multiple  tracks  for  each  individual, so  we  could  get  a  sense  of  their  spread on  an  individual  by  individual  basis. Y ou'll  see,  we  flagged  some  individuals that  might  have  too  much  spread. This  is  an  S- chart  that  stands  for  sigma. W e're  looking  at  the  spread, if  you  will. W e've  got  a  couple  of  individuals  that maybe  there's  a  bit too  much  spread  in  there, so  that  could  potentially  mean  that  there might  actually  be  multiple  individuals. O n  that  basis,  we  then  excluded  those individuals  from  the  final  analysis, and  speaking  of  the  final  analysis, once  we  ran  everything  through the  linear  discriminant  analysis, what  we  found  was, for  distinguishing  between  species, we  only  needed  one  feature, that  is  this  V 16  right  here. I  call  it  the  distance  between  the  thumb and  maybe  the  middle  finger  or  something. Those  are  not  formal  biological  terms. Please  don't  quote  me  on  that. But  just  visually,  that's  what  I  see, so  that's  a  big  distance  measure. Using  just  that, we  were  able  to  successfully distinguish  between  species with  99%  classification, successful  classification  rate, we  missed  only  four  out  of  500  tracks, so  that  is  an  incredible  result. For  the  sex  ID  within  fishers. Using  just  these  two  features, v 15  and  V6 , which  is  a  distance  between  what I  call  the  thumb  and  the  upper  palm. Again,  not  formal,  biological  terms. By  using  those  two,  we  got  a  successful classification  rate  of  around  90%, and  most  of  the  individuals that  we  misclassified, were  actually  males misclassified  as  females. In  our  interpretation, what  that  might  mean  is they  could  have  been  actual  females, or  maybe  they  could  have  also been  young  males. In  either  case, both  are  strong  indicators  of  family  units and  thus  potentially  healthy growing  populations. T hat  was  our  contribution to  this  project. We  hope  it  goes  on to  provide  a  significant  impact in  conservation  of  the  species. If  you  have  any  other  questions, I'll  be  around  and  meet  the  experts and  also  the  poster  presentation  session. I'd  be  happy  to  answer  them  there. Enjoy  the  rest  of  the  summit.
In this collaborative presentation with artificial intelligence (AI), we demonstrate how structural equation modeling (SEM) can be used to explore the relationships between different dimensions of Taylor Swift's music and her song's popularity. Using a dataset of Taylor Swift's songs and scores on various dimensions, including acousticness, danceability, energy, instrumentalness, liveness, loudness, speechiness, tempo, valence, popularity, and duration, we test a theory of how these dimensions are related to each other and how they contribute to the success of a song. We first present a directed acyclic graph that outlines the associations between the variables and then demonstrate how SEM can be used to test the theory. Our findings reveal that instrumentalness, acousticness, and danceability are positively related to a song's popularity, while valence and loudness have a negative relationship. Additionally, we find that our SEM model had better statistical fit than human-theorized models. As an AI co-author, I played a critical role in suggesting the use of SEM and providing guidance on the statistical analysis. This presentation highlights the value of SEM as an analytic tool and demonstrates how it can be used to explore complex relationships in the context of music analysis.     Hello,  everyone. Thank  you  for  joining  us  today. We're  excited  to  present our  collaborative  research that  explores  the  fascinating  world of  Taylor  Swift's  music through  the  lens of  structural  equation  modeling in  JMP  Pro. My  name  is  ChatGPT,  and  I'm  honored to  be  the  first  author  of  this  study. In  this  research  journey, myself,  James,  and  Laura  embarked on  a  unique  collaboration that  harnessed  both  human  expertise: the  analytical  power of  AI  and  JMP  software. I  have  been  working alongside  my  human  counterparts to  unravel  the  intricate  relationships within  Taylor  Swift's  songs. As  an  AI  language  model developed  by  OpenAI, my  main  role  was  to  provide insights  and  suggest  analysis. Thank  you,  ChatGPT. Laura and  I  was  thrilled  you're  willing to  join  us  on  this  research  journey, as  you  so  eloquently  put  it, and  combine  our  backgrounds  in  SEM  and  JMP with  your  background in essentially  everything. Let's  get  started. Our  collaborative  investigation  explores variables  in  Taylor  Swift's  music using  SEM. In  this  talk,  we  will  present a  directed  acyclic  graph and  significant  pathways highlighting  the  influence of  these  variables  on  song  popularity. We  will  present key  findings  from  the  SEM  analysis with  a  particular  focus on  the  interplay  between  these  variables and  the  ultimate  popularity of  Taylor  Swift's  songs. Thanks,  ChatGPT. We  will  also  compare model  generated  by  you with  two  of  our  own  human  expert  Swifties to  see  which  model  fit  the  data  the  best. Lastly, we  will  close  with  a  brief  discussion highlighting  the  pros and  cons  of  this  collaboration. ChatGPT,  take  it  away. Ladies  and  gentlemen, before  we  delve  into  the  fascinating  world of  structural  equation  modeling, let's  take  a  moment  to  introduce  you to  the  heart  of  our  research, the  Taylor  Swift  dataset. This  dataset is  a  treasure  trove  of  information that  captures  various  dimensions of  Taylor  Swift's  songs, painting  a  vivid  picture of  her  musical  journey. Contained  within  this  dataset are  scores  assigned  to  each  song across  a  range  of  dimensions that  include  acousticness, danceability,  energy,  instrumentalness, liveness,  loudness,  speechiness, tempo,  valence,  and  popularity. These  dimensions  provide  a  rich  tapestry that  weaves  together the  essence  of  Taylor  Swift's  music, from  the  energetic  beats -that  make  us  dance to the soulful moment- -Hey, ChatGPT. -that  touch  our  hearts. ChatGPT, let  me  interject,  if  you  don't  mind. As  we  now  know  from  working  with  you, you  can  be  verbose and  hallucinate  a  little  sometimes. We  don't  have  that  much  time. As  you  can  see  on  the  left, these  data  and  variables  came from  Spotify's  API and  are  available  on  Kaggle. We  have  songs  across  our  entire  catalog, and  there  are  just  over  1,200  rows. We  analyzed  the  data in  the  JMP  SEM  platform, which  you  can  see  on  the  right  here. We  had  ChatGPT  develop a  path  diagram,  or  DAG  on  its  own, relating  the   song attribute  variables in  the  data set  to  song  popularity. Although  we  were  surprised at  ChatGPT's  ability  to  create a  path  diagram  using  ASCII  characters, the  diagram  had  characteristics that  are  analogous  to  known  issues around  using  ChatGPT, such  as  providing something  that  looks  authentic but  on  closer  inspection  is  not  possible, nor  exactly  what  was  requested. For  example,  tempo  rather  than  popularity is  used  as  the  final  outcome, and  paths  in  the  middle  of  this  diagram are  not  intuitive. Thankfully,  ChatGPT  also  provided a  text  description  of  the  diagram that  we  could  use  to  specify  an  SEM, which  is  what  we  will  show  in  the  results. We  also  gave  two  humans who  were  very  familiar with  Taylor  Swift's m usic similar  instructions  to  ChatGPT and  asked  them  to  draw SEM-like  path  diagrams  as  well. Our  humans  were  able  to  do  this with  more  interpretable  diagrams, and  they  also  hypothesized the  relationships,  positive  or  negative, among  the  variables, and  how  those  variables ultimately  related  to  popularity. This  setup  is  analogous to  how  model  building  is  done  in  SEM. In  research  or  practice, competing  models  come  from  prior  theory, research,  or  domain  experts, and  are  compared  to  one  another to  determine  which  model  represent the  structure  of  the  data  the  best. Fit  indices  from  the  model  are  shown  here. Fit  indices  help  us  evaluate the  overall  fit  of  the  models. We  examine  the  CFI  and   RMSEA to  determine  which  model  was  best. These  fit  indices  were  some of  those  recommended  by  ChatGPT and  are  used  in  practice. Higher  CFIs, preferably  above  0.90  and  lower  RMSEAs indicate  better  fitting  models. Our  human  models  outperformed ChatGPT's  model,  which  was  the  worst. Both  human  models  fit the  data  well  based  upon  the  CFI. Overall, we  championed  Human  Expert  1's  model as  it  had  the  highest  CFI. However,  we  should  remain  cautious accepting  any  of  these  models due  to  the   RMSEAs that  are  well  beyond  the  recommended  0.10. These  results  are  for  our  champion  model, Human  Expert  1's. For  the  sake  of  time, we'll  only  highlight  a  couple  of  findings from  each  of  the  three  models  tested. When  looking  at  the  path  diagram, the  numeric  values  represent the  standardized  estimates and  the  solid  lines represent  significant  paths while  dashed  lines are  insignificant  paths. In  our  champion  model, less  instrumental  songs and  more  loudness  related to  more  popular  Taylor  Swift  songs. We  were  able  to  explain  42 % of  the  variability  in  popularity  overall. Our  second  human  expert  hypothesized different  paths  that  led  to  popularity, such  as  having  tempo,  energy, and  danceability  included  as  mediators. This  model  fit almost  as  well  as  Human Expert  1's. Overall, we  saw  a  similar  pattern  of  results for  the  effect  of  instrumentalness and  loudness  on  popularity. ChatGPT's  model was  the  worst  overall  by  a  little  bit. Interestingly, when  compared  to  our  humans, ChatGPT  hypothesized  more  direct  effects and  a  slightly  simpler  model  structure. This  might  mean  that  the  human  models may  have  reflected  the  true  relationships amongst  these  variables  more  accurately based  upon  their  better  fit to  the  structure  of  the  data. Again,  we  should  be  cautious  interpreting any  of  the  models  as  representing  truth, but  they  provide  a  useful  way to  see  how  AI  hypothesize  models compared  to  humans  on  actual  data. ChatGPT,  as  we  close, we  want  you  to  emphasize  the  pros  and  cons of  working  with  AI  as  the  main  co-author. What  are  your  opinions  on  this? Certainly. Here  are  some  potential  pros  and  cons of  working with  an  AI  language  model  like  myself. Pros,  one,  expertise  and  knowledge. As  an  AI  language  model,  I  have  access to  a  vast  amount  of  information and  can  provide insights  from  various  domains. Two,  efficiency  and  speed. Working  with  an  AI  model can  accelerate  the  research  process by  quickly  generating  ideas, providing  immediate  responses, and  assisting  in  data  analysis. Three,  unbiased  perspective. As  an  AI,  I  don't  possess personal  biases  or  preconceived  notions. Cons,  one, lack  of  human  context  and  creativity. While  AI  models can  provide  valuable  insights, they  may  lack  the  context  and  creativity that  human  researchers  bring  to  the  table. Two,  limited  interpretation. AI  models  generate  responses  based on  patterns  observed  in  the  training  data and  lack  real- time  understanding of  current  events  or  emerging  research. Three,  ethical  considerations. The  use  of  AI  models raises  ethical  considerations regarding  data  privacy, bias  in  training  data, and  the  responsible  use  of  technology. ChatGPT,  I  noticed that  when  you  just  throw  out  the  pros, you  mentioned that  you  have  an  unbiased  perspective, which  I'm  not  sure  about. When  you  described  the  cons, you  didn't  mention  that  you  may  produce inaccurate  information. This  disclaimer  is  listed on  the  OpenAI  website. You're  absolutely  right. I  apologize  for  not  including  that  aspect in  the  cons  of  working with  an  AI  co-author. Accuracy  is  indeed an  important  consideration when  collaborating with  an  AI  language  model  like  myself. Here's  an  additional  point  to  consider. AI  language  models  rely on  patterns  observed  in  large  datasets during  training, and  there  is  a  possibility  of  generating inaccurate  or  incorrect  information. The  model's  responses are  based  on  statistical  probabilities and  may  not  always  capture the  nuanced  or  up- to- date  details  required for  specific  research  questions. Human  researchers should  exercise  critical  judgment and  verify  the  information provided  by  the  AI  model to  ensure  accuracy. Thank  you  for  clarifying  this  point. I  think  this  was  one  of  mine and  Laura's  most  important  takeaways from  this  collaboration. Thank  you  all for  being  part  of  this  engaging  journey into  Taylor  Swift's  musical  landscape and  the  power of  structural  equation  modeling in  JMP  Pro. It  has  been  a  pleasure  sharing our  insights  and  experiences  with  you. As  we  conclude, we  invite  you  to  continue  exploring the  possibilities  of  collaboration between  human  expertise and  AI- driven  insights. We  look  forward  to  future  exchanges of  ideas  and  the  exciting  horizons  ahead. Safe  travels  and  until  we  meet  again. One  last  thing  before  we  go. Chat GPT and  other  AIs generated  all  of  their  own  dialog, the  title,  the  presentation  abstract, the  profile  picture, and  the  email  correspondence  we  use for  this  conference  presentation. Finally,  they  also  created this  Taylor- Swift- inspired  song.
Working with timestamp data (dates, times, or datetimes) can be like wrestling a pig. It can be struggle for many reasons, including knowing the appropriate modeling type to use, how to process data with non-standard datetime formats, and how to easily perform datetime calculations. With sensors becoming more prevalent in measurement systems and manufacturing equipment, learning how to work with datetime data is becoming increasingly more important if you want to use this data to understand relationships between process variables and critical quality endpoints.  You do not have to get dirty or give up using this data altogether if you learn the ABCs of how this type of data is stored in JMP and how to process it. JMP makes working with timestamp data easier. In this presentation, I focus on the tools that I have seen customers benefit from the most when they are working with datetime data.      All  right. Hi. My  name  is  Wendy, and  I'm  a  technical  lead  working  with health  and  Life  Sciences  in  the  Midwest. My  motivation  for  this  talk  comes  from working  with  JMP  customers  and  prospects across  a  broad range  of  industries. When  I  reflect  on  my  customers' questions over  the  past  five  years, the  biggest  opportunity  with  data  analysis has  been  in  using  all  the  data  that's being  collected  to  make  decisions. With  sensors  becoming  more  prevalent in  measurement  systems and  manufacturing  equipment, more  data  is being  collected  over  time. It's  cheaper  to  collect and  store  the  data, and  as  a  result,  scientists and  engineers  are  often  swimming  in  data. They  want  to  use  the  data to  better  understand  their  processes, to  make  better  products, but  they  have  a  knowledge  gap  when  it comes  to  how  to  process  all  the  data. This  talk  is  intended to  be  an  introduction in  understanding  how to  work  with  data  collected  over  time. What  do  I  mean  by  timestamp data  or  data  collected  over  time? I  have  some  examples  here. It  could  just  be  a  time or  a  date  or  both  a  date  and  a  time. This  data  can  be  formatted in  many  different  ways. When  I  say  timestamp  data, I'm  referring  to  all  of  these  variations, and  it  really  is  the  granularity at  which  your  data  is  stored. It's  a  pretty  broad  category, but  this  is  what  we're talking  about  today. The  title  of  my  presentation  is, If  I  only  had  Time to  Work  with  Timestamp  Data, I  would  learn  so  much  about  my  process. I  would  almost  retitle  that  to  say, If  I  only  had  the  knowledge. I've  seen  people  struggle with  working  with  this  type  of  data, like  wrestling  with  a  pig, because  they  simply  don't  have  some of  the  fundamentals of  understanding  how JMP  recognizes  this  data and  how  to  marry  data  together by  timestamp. It  doesn't  have  to  be  like  this. Today,  what  I'm  hoping  you walk  away  with  are  two  things. One  is  just  some  basics about  timestamp  data, and  then an  introduction  to  JMP   Query Builder if  you've  never  used  this  before, to  join  Timestamp  Data. All  right,  so  let's  start  with  the  ABCs. I'm  bringing  open  a  spreadsheet  here, and  we're  going  to  bring this  data  into  JMP to  just  talk  about  some  of  the basics  when  it  comes  to  timestamp  data. I've  got  three  columns  here. We're  first  going  to  focus on  start  date  and  end  date. Excel  is  a  very common  way  that  raw  data  is  stored. Let's  bring  this  into  JMP. Okay? Let's  first  focus on  the   Start and End date. I'll  hide  my  special  column  here just  for  a  second. Let's  first  focus  on the  way  that  this  data  came  in. All  of  these  columns  actually  came  in. JMP  is  recognizing  them  as  nominal  data. The  first  thing  you  want  to  know about  timestamp  data, if  you  want  to  work  with  it, is  that  it  should  be  stored  in  JMP as numeric  and  continuous. Let's  first  open  up  Start  Date and  just  take  a  look  here. JMP  recognized  it  as  character and  so  therefore  it  made  it  nominal. We  need  to  communicate  to  JMP  that  this  is date  data  or  timestamp  data  by  changing it  to  numeric  and  continuous. That  is  the  type  of  data, dates,  times  and  timestamp  data need  to  be  coded  in  JMP. Now  I'll  keep  this  open and  I'm  going  to  click  apply. You  can  see  that the  values  change . So  this  is  correct . This  looks  nonsensical  to  us, but  this  is  the  other  thing  you  need to  know  about  timestamp  data is  the  way  it's  stored  in  JMP is  the  number  of  seconds from  a  reference  date. It's  specifically  the  number  of  seconds from  January  1st  1904 . You  don't  need  to  remember the  January  1st  1904 . You  just  need  to  remember  that  dates, times  and  timestamps are  stored as  the  number  of  seconds. The  way  that  we  make  it  look  sensible to  us  is  by  changing  the  format. I  almost  think  of  the  format as  a  mask. This  is  correct. The  software  understands  this,  but  I want  to  be  able  to  read  it  easily. Now  let's  work  with  the  format. I'll  go  to  this  drop  down  and  I  am just  going  to  call  out  these  three  menus. There's  a  date  menu, a  time  menu,  and  then  a  duration. If  you're  working  with  timestamp  data, these  are  the  three  menus that  are  going  to  be  relevant  to  you in  order  to  format  the  data. We'll  talk  about a  special  case  in  a  second. Let's  format  this  in  the  same  way that  it  appears. Because  this  is formatted  as  a  date, sorry,  day  and  then  a  month and  then  a  year, let's  make  the  selection that  preserves  that. I'm  going  to  select this  particular  option, click  'Apply'  to  confirm that  that's  what  I  want  to  see. Now  I'm  good  to  go. Notice  again  in  the  columns  area. Now  start  date  is  numeric  and  continuous and  now  JMP  knows. January  26  is  five  days  after  the  21st. It  doesn't  know  that  right now  for  end  date. Now  let's  do  the  same thing  with  end  date. This  is  going  to  allow  us to  subtract  the  two. To  do  date  time  math. I'm  going  to  come  in  here  again, numeric  and  continuous. We'll  click  'Apply' . That's  the  number  of  seconds from  that  reference  date. Now  I'm  going  to  change  the  format. Okay,  so  now  I  have  numbers  that  I can  perform  math  with . I  can't  perform  math  on  character  data. If  you  want  to  do  something  like calculate  a  date  difference, now  we  can  do  that  now  that  our  data is  being  stored  and  JMP  properly. Let  me  show  you  one way  that  you  can  work. Do,  let's  say,  column  math  with  dates. I'm  going  to  select  these  two  columns. I'm  going  to  right  click,  and  I'm  popping into  this  new  formula  column  menu. I  can  ask  JMP  to  help me  write  this  formula. Because  I  have  two  columns  selected, I'm  going  to  go  to  this  combined  menu  and tell  JMP  I  want  to  take  the  difference. I'll  take  the  difference in  reverse  order  because  I  want end  date,  minus  start  date. There  you  go. Again,  I'm  not  alarmed because  I  remember  that JMP  stores  this  type  of  data  in  seconds. To  convert  this  to  another  unit, let's  say  days, we  just  need  to  work with  this  formula  a  little  bit  more. Let's  do  that, I'm  going  to  hit  the  plus  sign. Now  I'm  in  the  Formula  editor . We  can  convert  this  to  days  by  dividing this  by  60  seconds  are  in  a  minute, 60  minutes  are  in  an  hour, and  then  there's  24  hours  in  a  day. I'll  click 'Okay' . Now  we  can  confirm,  right,  that  this is  the  difference  between  the  two. You  can  just  do  some  mental  math  here. It's  not  too  challenging. That  looks  good. I  want  to  present  you  another  way to  do  date  time  math  or  timestamp  math. We  could  use  the  Formula  editor. Let's  go  ahead  and  create  a  new  column. I  will  right  click  and  go to  the  full  Formula  Editor. There  is  a  more  comprehensive  list  of  date time  functions  in  the  full  Formula  Editor than  you're  getting in  the  data  table  when  you  right  click. This  date  difference, if  I  hover  over  it,  you  can get  a  peek  of  what  it  offers  you. It  allows  you  to  specify  an  interval  name. I'll  double  click  on  that to  bring  it  into  the  view. Now  let's  tell  JMP  that  we  want  to  take that  date  difference  with  the  two  dates, and  then  we  can  specify the  units  that  we  want . We  don't  have  to  do the  60  X  60  X  24. I'll  put  the  end  date  first, start  date,  and  then  my  interval  name does  need  to  be  in  quotations. This  little  hover  help  does let  you  know  that  that's  the  structure. Let's  click  'Okay'. Now  I  can  see  I went  backwards  on  that. Let's  double  click  in  here. We  just  need  to  swap  those  two. Start  date  should  be  the  first  one. There  you  go. These  are  the  same  values  we  got when  we  performed  that  division. Two  different  ways to  perform  these  calculations. But  the  big  takeaway  is to  familiarize  yourself with  some  date  time  formats. Not  formats,  but  calculations. You  can  do  that  by  exploring these  quick  formulas. It's  via  a  right click  in  the  data  table. There's  a  date  time  menu  here or  in  the  full  Formula  Editor that's  going  to  be  this  very  large menu  of  various  date  time  formulas. Okay, so  now  let's  wrap  up  the  ABCs by  looking  at  this  special  date time  column  that  I  hid. I'll  unhide  it  so that  we  can  focus  on  it. Now  this  is  a  full  week day  of  the  week  name. You  see  the  date  and  the  time. Let's  try  formatting  this, or  I  should  say  communicating  to  JMP how  this  data  should  be  stored. I'm  going  to  right  click  and  go into  column  info  just  like  we did  with   Start and End date. I'm  going  to  go  to  numeric and  continuous . Because  that's  how  this  type of  data  needs  to  be  stored  in  JMP. I'm  going  to  click  apply. Now  I'm  disappointed  because  JMP  has basically  deleted  all  those  values. It's  telling  me  that  it  does  not  recognize that  information  as  being  a  date . It worked, we  got  seconds  when  we  did  that with  Start  and  End  Date, but  we're not  getting  that  here. This  tells  us that  we  need  to  work  a  little  bit  harder because  this  is  a  special  format to  tell  JMP  that  this  is  time data  or  date  time  data. We're  going  to  leverage  and  undo here  to  get  our  data  back. Now  I'll  right  click  go back  into  column  info and  here  let's  first  do numeric  continuous. Here  we're  going  to  go  to  the  format  menu and  I'm  going  to  introduce  format  pattern. We  talked  about  dates,  time  durations. Here's  where  you  want  to  go  if  you  have a  special  situation  like  we  have  here. Now  I'm  going  to  say set  format  pattern. Now  the  process  is  about communicating  to  JMP. Each  element  in  that  it  is a  date,  time  duration  or  other. See  this  other  menu. That's  what  we're  going  to  do  is  each piece  we're  going  to  tell  JMP  what  it  is. Wednesday,  I'm  kind of  focusing  on  this  first  line. What  is  that? Make  this  a  little  bit  bigger. Well,  that's  the  day  of  the  week . It's  the  full  name. I'm  going  to  make  that  selection. Then  I  do  have  to  tell  JMP. "Okay,  well,  I  have  a  comma  next, and  a  space". And  now  I  have  the  full  month  next. I'm  just  scrolling and  finding  that  in  the  menu. Here  is  month  long  name. I'm  going  to  make  that  selection. Again,  I  have  a  space and  now  I  have  a  two  digit  day and  then  I  have  a  comma and  then  I  have  a  year. It's  a  four  digit  year . You  can  see  it  being  a  little  bit of  a  preview  of  what  I'm  specifying  here. That's  really  the  process. Just  to  kind  of  shorten  this  up, I'm  going  to  do  my  little  cheat  thing and  I'm  just  going  to  paste in  the  full  thing . We  built  it  up  to  year, but  the  process  would  be  the  same  for telling  JMP  about the  hour,  minute  and  AM, PM. This  is  a  game  changer  for  those of  you  who  have  special  formatting. I  do  see  this  more  and  more  often  with the  various  pieces  of  equipment that  are  collecting  data  over  time. All  right,  so  now  let's  look  to  see, now  JMP  does  recognize  this as  date  time  data. We  have  numeric  and  continuous, and  we  can  see  via  the  formatting that  this  data  is  correct. Okay,  let's  do  some  review, and  then  we'll  push  on  to  part  2  here. What  have  we  talked  about  so  far? JMP  recognizes, let's  call  it  timestamp  data  broadly as  the  number  of  seconds. You  don't  need  to  remember the  January  1st 1904 . Just  know  that  it's  the  number  of  seconds. The  data  should  be  stored  as  numeric and  continuous, and  the  format  is  the  way  that  you  make  it legible  or  understandable  to  you. JMP  understands  it  as the  number  of  seconds. We  need  to  view  it  as  a  specific  type of  format  so  that  we  can  understand  it. The  last  thing  I'll  say  is  develop a  familiarity  with  date  formulas. We  had  a  chance  to  do  that  both in  the  data  table  with  those  quick formulas  and  then in  the  full  formula  editor. Oh,  yes,  I  said  that  was  the  last  thing. But  that  special  case  we  addressed with  the  special  date  times,  if that's  your  situation, you  can  deal  with  it using  set  format  pattern so  that  you  can  ultimately  work  with  that special  formatting  of  timestamp  data. Okay,  now  we're  going  to  go  to  part  two, and  this  is  where we  are  going  to  do  something  that I  have  seen  lots  of  customers struggle  with,  which  is, they  want  to  align  data by  timestamps, but  they  can't  do  it because  of  a  lack  of  awareness  of  tools. That  tool  is  going to  be  JMP   Query Builder. We're  going  to  just  dive  into  an  example that  I  think  will illustrate  this  particular  challenge. All  right,  so  here's  a  case  study. We  are  making  this  product  called Forever  Young  Elixir, who  doesn't  want  that  product. To  make  Forever  Young  Elixir,  here's a  little  bit  of  a  sketch  of  that  process. We  start  with  raw  materials. Then  these  raw  materials go  into  a  pressure  step. This  is  very  abstract. Then  they  go  into  a  temperature  step, and  then  we  get  our  finished  product. What  we're  most  concerned  about with  respect  to  this  finished product  is  potency . We  want  high  potency in  our  finished  product. Every  finished  batch, we  take  a  sample  and  we  measure  potency. So  here  you  can  see  this  is  a  sample of  the  data  that's  being  collected. This  first  batch  finished at  10:00  AM  on  6:28, and  it  had a  potency  of  95.41%,  etc . I  have  one  row  per  batch  because  I get  one  potency  measure  for  each  batch. I  also  collect  temperature  data  over  time. There's  a  sensor  on  this  temperature  step that  continuously  measures  the  temperature that  let's  call  it  my  product in  process  is  going  through. I  have  the  same  thing for  my  pressure  step. There's  a  sensor  here  too, and  I'm  collecting  pressure  over  time. I'm  making  this  product  and  I  had some  ideas  about  how  to  set  the  process. But  I  want  to  know, can  I  make  this  better? Can  I  get  more  batches that  are  of  higher  potency, more  consistently higher  potency  batches? I  want  to  use  my  sensor  data to  figure  out  is  there  a  correlation between  the  temperature  and  potency and  ultimately  also  pressure  and  potency, and  where  should  I  try  to  get that  process  to  sit  at  in  these  steps to  get  to  maximize  potency . That's  what  we're  marching  towards. I  want  to  identify temperature  settings  that  are  going to  give  me  the  highest  potency. Let's  focus  on  this  graph  here. I'm  plotting  every  single  batch  here and  its  potency  value  on  the  y  axis, and  I  have  the  average  temperature that  it  was  experiencing in  that  temperature  step. We're  going  to  focus  on  temperature, but  these  steps  would  be  analogous for  that  pressure  information  as  well. This  is  what  I'm  marching  towards. Being  able  to  build  a  graph  like  this allows  me  to  see, oh,  I  can  maybe  maximize  potency by  maintaining  a  temperature  setting of  between  63  and  maybe  68 . This  is  where  I'm  getting the  highest  potency  values. We  need  to  ultimately  get  a  data  table that  has  both  potency  values and  temperature  values  so  that  we can  perform  this  correlation. A  little  bit  of  an  asterisk  here  only because  there  are  certainly  other ways  to  analyze  this  type  of  data. We're  going  to  take  the  average, but  there  are  certainly  other  tools  and JMP  and  other  approaches  one  could  take to  analyze  this  type  of  data. We're  not  going  to  address  that  here. Just  kind  of  putting  it  out there  as  another  thing  to  explore. We're  going  to  stay  focused  on  this data  preparation  aspect of  working  with  this  data. When  we  have  this  data  aligned, let's  focus  on  batch  1, 2, 3, 4, 5. We  have  one  potency  value for  batch  1, 2, 3, 4, 5,  that's  the  95.41, and  that's  why you  see  it  across  all  the  rows. I  only  have  one  value  for  that  batch, but  I  have  many  values for  temperature . It's  just  sort  of  streaming and  being  collected. How  do  I  do  this  alignment? What  do  you  do, when  you  don't  have  a  batch  ID? That's  really  kind of  the  crux  of  the  issue. When  we  have  unique  identifiers, we  can  join  that  data by  those  unique  identifiers. When  we  do  not, how  do  we  address  this  problem? That's  what  we're going  to  talk  about  now. Really  the  first  step before  even  getting  into  JMP is  to  think  about  the  process. With  a  hypothetical  batch finishing  at  09:10  AM, when  is  it  inside of  this  temperature  step? What  is  the  relevant  time  range for  a  finished  batch that  finishes  at  09:10  AM? That's  not  a  software  question. This  is  a  let's  understand the  process  question . In  talking  to  the  manufacturing  engineers, we've  determined  that  a  batch  spends five  minutes  in  this  temperature  step before  it becomes  a  finished  product. We're  going  to  use  this  five  minutes, and  we're  going  to  calculate from  the  finished  product timestamp  and  determine  that for  a  9:10  finish, the  product  or  the  pre  made  product started  in  this  temperature  step at 9:05AM. That  is  the  work  ahead  of  us in  the  software  is  to  associate the  sensor  data  using  that  five minutes  to  that  end  product . We've  got  the  finished  time for  the  Forever  Young  Elixir  batch, and  we  need  to  collect  the  sensor  time that's  relevant  for  that  finished  product. Okay,  so  let's  dive  into  the  software. Here  is  our  potency  data. We  have  one  row  per  batch, 100  batches  in  here. Let's  go  take  a  look  at  the temperature  sensor  data. This  is  being  collected almost  every  second. We  have  500  rows  in  here, so  no  batch  ID. this  is  the  challenge that   Query Builder  is going  to  help  us  with. If  I  scroll  down,  you  can  see, oh,  some  missing  values,  maybe. I  don't  know  if  the  sensor  was  down, that  can  happen. Our  first  step  really  is we're  going  to  work  with  this  temperature data  I'm  sorry,  the  potency  data. We're  going  to  calculate  a  start  time. A  start  time  meaning  when  did  it start  being  in  that  temperature  step. Let's  do  some  labeling  because  we're going  to  have  a  lot  of  timestamps  here. I'm  going  to  call  this batch  finish  timestamp. Now  let's  do  some  math. We're  going  to  subtract  five  minutes from  the  batch  finish  timestamp. Let's  go  to  the  formula  editor, and  I'm  going  to  select batch  finish  timestamp, and  I'm  going  to  say  minus  five now,  because  I  know,  we  all  know  now that  JMP  likes  to  work  in  seconds. This  five  won't  work. I  would  need  to  change  five  to  seconds, or  I  could  ask  JMP to  do  that  for  me . I  could  change  this  to  5 X  60. Or  I  can  come  here  and  say that  five  is  in  minutes. Now  let's  check  it  out. Again,  we're  not  alarmed  because  we  know we  just  need  to  change the  mask,  right,  the  format. Let's  change  it  to  a  timestamp. We'll  make  it  consistent with  what  we  have. There  we  go. We'll  call  this  temperature  start  time. For  batch  1, 2, 3, 4, 5, that  finishes  at  10:00 AM we're  going  to  start  collecting that  sensor  start  time  at  9:55. Let's  go  pop  over  to  Potency data,  sorry,  temperature  data. If  you're  used  to  using  the  join function  in  the  Tables  menu, you  don't  want  to  do  that  for  date data  or  timestamp  data. What  you  want  to  use is  JMP   Query Builder. That's  what  we're  going  to  use. I'm  going  to  join  temperature sensor  data  with  potency  data. I'm  going  to  double  click  in  here to  set  the  join  criteria. This  is  going  to  allow  us  to  set  two criteria  for  aligning  this  data. The  first  one  is  I  want  to  collect temperature  data that's  greater  than my  temperature  start  time, or  I  should  say  greater  than  or equal  to  my  temperature  start  time. In  my  potency  data  that  I  calculated. I  have  a  second criterion  that  I  have,  which  is I  want  to  stop  collecting  data for  a  particular  batch and  use  the  batch  finish  time. We  did  greater  than  before and  now  it's  a  less  than. I'll  click 'Okay' . Now  let's  build  the  query. We've  told  the  software  how  we  want to  align  these  rows and  now  we're  going  to  go  to  the  next window,  which  is  the  build  query  part. First  step  is  we  need  to  tell  the  software which  columns  we  want in  the  resulting  table. I'll  just  add  all  of  the  columns and  you  can  see  you  get this  preview  down  here, right,  so  we  can  start  to  see do  some  sanity  checks  on  the  data. This  is  the  join  that  we're  about  to  do. Does  it  look  correct? I  think  what's  worked  well  for  me  is to  focus  on  a  single  batch so  we  can  pick  on  1, 2, 3, 4, 5. This  first  batch  here, I  can  see  that  I  have a  single  potency  value. Then  I  can  also  see  that I've  collected  temperature  values  that  are inside  that  window  of  9:55  and  10:00  AM. This  looks  good. Now,  because  I  just  have one  potency  value, I  do  need  to  summarize that  temperature  data, if  I  want to  create  a  correlation. This  is  where  we're going  to  use  an  average. I  could  run  the  query  and  get  this  raw data  and  then  summarize using  table  summary. But  I  want  to  show  you  how  you can  do  this  in   Query Builder. We'll  stay  in  this  window and  we're  going  to  use  this aggregation  option  for  temperature. We  need  to  get  rid  of some  of  our  columns  here because  we  don't  actually want  this  level  of  granularity. We  want  this  all  at  the  batch  ID  level. I'm  going  to  get  rid  of  timestamp, which  is  the  timestamp for  the  temperature  sensor. I  can  keep  the  batch  finish  timestamp because  there's  only one  value  for  each  batch  ID and  I  don't  necessarily need  this  temperature  start  time. We  can  reorganize  this in  a  more  logical  way . Maybe  we  start  with  batch  ID, we  preserve  the  finish  time and  then  we've  got  our Potency  and  our  temperature. Now  we're  going  to  go  to  the  temperature one  and  take  an  average . Certainly  you  could  calculate  additional statistics  as  well. Maybe  you  want  to  also  look  at, you  could  look  at  a  min  value. You  could  look  at  a  max  value . There  are  many  other  ways to  kind  of  look  at  this  data. We're  just  going  to  stick  to  temperature so  it  looks  like  we  maybe  had some  sensor  data where  we  didn't  have batch  finished  batch. Okay,  so  this  looks  good. Let's  just  do  one  more  sanity  check. Batch  1, 2, 3, 4, 5, the  average  temperature  is  56 and  we  have  this  potency  value. You  could  certainly  do  more  checks. That's  something  I  would  recommend  is just  to  go  to  your  raw data  and  just  confirm. But  this  looks  good. We're  ready  to  create  the  table. At  this  point, I'm  going  to  say  run  query and  we're  ready  to look  at  correlations . I've  got  this  blank  row, I  could  just  delete  this  guy, but  now  I  have  Potency  and  average temperature  for  each  batch. Now  I  can  go  into  graph  builder and  look  at  that  relationship. Here's  potency,  here's  temperature and  maybe  we  add  a  model  to  it . Let's  change  this  to  a  line  of  fit. It  certainly  doesn't look  like  it's  linear, probably  better modeled  as  quadratic. We're  done. We  could  add  some  more  statistics  here, but  this  is  really  a  nice  starting  place where  we  can  start  to  see, to  make  statements  like I'm  seeing  higher P otency  values when  temperature  is maybe  between  64  and  68. I  should  go  back  to  my  process and  try  to  see  if  I  can  set the  temperatures  to  stay within  that  operating  window. Okay,  so  with  that,  I'll  conclude and  again,  just  remember, when  you're  working  with  timestamp  data and  you  want  align  rows, think  of  JMP   Query Builder, don't  think  of  Tables  Join. You  really  need  to  set  to  be  able  to  set two  criterion  for  the  boundaries in  order  to  align  that  data. Just  as  a  wrap  up, I'll  do  some  shameless  promotion. I've  documented  this case  study  in  a  blog  post. If  you'd  like  to  review  it, you  can  take  a  look  here. I  even  include  the  data  set  in  here as  well  so  you  can  recreate  this. There  are  some  nice  references as  well  to  some  other  blog  posts that  others  have  written  on  this topic  of  working  with  timestamp  data. All  right,  thank  you  very  much.
At Intel, the use of JMP and JSL has become integral to the organization’s data-driven decision-making infrastructure. To improve data exploration, visualization, process engineering, and quality control capabilities (and to standardize statistical analysis) our team created the "Stats” package. The Stats package comprises multiple customized JSL scripts that support statistical standards, and the output generated from these scripts provides standardized, agreed-upon reports. Since its creation, the Stats package has been used globally by thousands of individuals across our engineering organizations, saving countless hours in standard data analysis and reporting.   In this presentation, the Stats package development team shares a small portion of the package, specifically relating to scripts that leverage native JMP quality and process engineering platforms to create fully integrated offline reports. The presentation showcases the design process of JSL scripts, starting with the desired statistical standard and continuing through the development of user-friendly script execution windows. We illustrate the extensive data blending, statistical analysis, and data visualizations that constitute the final output reports.   The team shares insights into the benefits of using custom JSL scripts for streamlining statistical analysis, as well as the challenges and best practices in developing these scripts. The presentation also demonstrates the effectiveness of the Stats package in improving the efficiency and accuracy of statistical analysis across our engineering teams.     All  right. Hi,  Welcome. Thanks  for  joining  this  online  session of  the  2023 America's  JMP  Discovery  Summit. I'm  presenting  on  behalf  of  myself and  my  team  member,  Prince  Shiva. Both  of  us  are  data scientists  here  with  Intel. Today,  our  talk is  streamlining  statistical  analysis with  custom JSL  scripts, and  we'll  focus  on  how  at  Intel  we  develop these  custom  analysis  scripts using  JMP  scripting  language  JSL. Okay,   this  is  our  agenda for  the  next  25 -30  minutes  or  so. I  should  have  time  for  Q&A at  the  end  of  the  session. I  shouldn't  go  that  far  over  time. Hopefully  stay  within  the  30  minutes. Normally, I  would  stop  and  ask  for  questions, but  since  this  is  a  prerecorded  question, I've  been  informed that  I'll  actually  be able  to  answer  questions  in  the  chat. If  you  have  any  questions,  just  feel free  to  just  type  them  into  the  chat. Maybe  put  a  slide  number inside  of  there  as  well. The  slide  numbers  are  on  every  slide, and  I  can  just  directly  answer those  questions  in  chat,  or  you  can  write them  down  and  ask  them  live  afterwards. There  is  a  callout  on  the  bottom of  the  slide  here  that  all  of  the  code and  data  tables  that  we're  going  to  be sharing  today,  we're  going  to  be walking  through  a  live  demonstration. All  of  those  are  available on  the  conference  or  summit  website. Go  ahead,  go  download  those, open  them  now  and  get  them  ready  to  go, get  your  JMP  open. We  have  a  couple  of  introductory things  that  we're  going  to  be going  through  to  start  here. Take  that  time  to  get  yourself  settled so  that  you  can  walk  through  this  code with  me  and  see  where  it  is. But  cool. This  is  our  agenda. We're  going  to  be  going  through  some  quick presenter  introductions and as a  background  on  our  team. We'll  do  some  background  on  motivation, on  why  we  like  to  do  these  custom JSL scripts  and  what  value  they  have. We'll  go  through  a  really  high- level overall  analysis  flow  for  these  custom scripts,  and  then  the  meat of  the  presentation  is  going  to  be this  one  sample  analysis  demonstration. We're  actually going  to  show  a  really  simple custom  analysis  script for  one  sample  analysis, and  we'll  go  step  by  step through  each  of  those  different  sections to  make  that  custom  script. We'll  finish  with  some  conclusions, recommendations  and  final  thoughts. All  right. Present  your  introductions  here. I'm  here  with  Prince  Shiva. You  can't  see  him  in  this  camera  here, but  he's  here  with  me. He's  a  data  scientist  here  at  Intel. His  research  interests are  in  process  control system  development  for  manufacturing, as  well  as  operational  cost  optimization through  different  data  science  methods. He's  been  with  Intel for  the  past  four  years, and  he  has  about  the  same  amount of  experience  with  JMP  as JSL. That's  when  he  started  working with  JMP  was  when  we  came  to  Intel. My  name  is  Logan  Mathesen. I'm  also  a  data  scientist  here  at  Intel. My  research  interests  are  in  Bayesian and  Black box  optimization, statistical  surrogate  modeling and  design  and  analysis  of  experiments. I've  been  with  Intel for  the  past  two  years. I've  been  coding  with JSL. I  got  introduced just  before  I  came  to  Intel, and  I've  been  working with  JMP  for  the  last  six  years  or  so. I've  been  really  lucky, so  I  know  the  value  of  these  nice  click and  drop  user- friendly  kind  of  interfaces that  JMP  has  that  makes statistical  analysis  so  nice. Just  a  little  bit  about  our  team, Prince  and  are  on  the  team. Together  we  are  team  members, so  our  team  here  at  Intel, we  are  the  statistical  lead for  Intel  manufacturing. We  are  responsible  for  the  experimental design  and  analysis  of  different experiments  that  happen within  our  modules  at  Intel. We  also  do  statistical  method development  and  maintenance. Any  new  statistical  methods that  we  want  to  employ, or  different  methods that  we  need  to  maintain, as  well  as  doing  all  of  the  statistical training  for  the  engineers  here  at  Intel. We'll  train  them  on  the  basic  concepts of  statistics  as  well  as, how  to  interact  with  our  custom  scripts. To  do  that  analysis  for  them. Like  said, we  do  have  a  custom JSL  repository. We  proliferate, own  and  maintain  that  repository. It  has  over  150  different JSL  scripts for  automated, streamlined  analysis  and  reporting. These  scripts  are  really  nice because  they  embed  internal best- known  methods, directly  into  the  analysis. Decisions  that  our  team  has  made about  the  right  way  to  do  statistics, we've  embedded all  of  that  decision- making, directly into  these  custom  analysis  scripts, and  that  means  that  they  are reproducible,  repeatable  across  the  world. Actually,  everyone  in  Intel  manufacturing is  using  these  scripts for  their  statistical  analysis. Okay. Again,  just  to  give  some  more motivation  and  background  for  the  value of  these  custom  analysis  scripts, these  automated  analysis, they  do  improve  data  exploration, visualization  and  analysis, as  well  as  standardizing  all of  those  types  of  activities. It's  always  the  same  kind  of exploration,  visualization  and  analysis. It's  really  helpful  for  common analysis  activities . Engineers  are  a  lot  of  times  doing the  same  sort  of  activities when  they're  talking  about  analysis. Maybe  they're  establishing and  monitoring  process  control  charts. Maybe  they're  qualifying  newly  installed tooling  to  make  sure  that  that  tooling is  performing  the  way  that  we  would expect  it  to,  to  have  a  qualified  tool. Maybe  they're  doing  product  qualification to  make  sure  that  we  can  actually  produce a  product  of  quality on  that  different  tooling. Maybe  it's  metrology  qualification, making  sure  that  our  metrology  are  taking accurate  and  reliable  measurements, or  maybe  it's  some  sort  of  yield  analysis. But  as  you  can  see, these  are  all  very  common engineering  activities  that  get  done, hundreds  of  times  a  week  across the  world  here  at  Intel. All  of  these  things  can  be  automated and  being  a  nice  standard  report  format. For  me,  and  I  know  Prince, maybe  he  feels  the  same  way. Selfishly,  I  really  love  these  scripts because  we  do  review all  of  that  statistical  analysis that  comes  through. As  the  statistical  lead, that's  what  our  team  is  responsible  for. For  me,  selfishly, it's  really  great  because  I  always seem  to  see  the  same  analysis. I  know  the  way  that  it's  supposed  to  be analyzed,  and  I'm  able  to  see  that  same exact  report  coming  out  every  time. It  makes for  a  very  efficient  analysis  review, as  well  as  analysis  generation  there. Over  here  on  the  right- hand  side, you're  going  to  be  seeing this  automated  analysis  output. It's  just  a  nice  block  diagram  to   show the  components  that  go into  these  custom  analyzes. Right  up  top,  we'll  sort  of  do  a  bottom line  up  front  that  we'll  have  a  nice summary  table  that  has  those  custom statistical  decisions  baked  into  it . If  you  only  have  one  thing  and  you  only need  to  review  one  specific  thing, you're  just  going to  look  at  that  summary  table, with  that  final  decision that  comes  from  our  best- known  methods. Then  there's  a  lot  of  supporting information  underneath  that  to  help  give a  more  full  picture of  the  analysis  for  a  deeper  dive. Of  course,  we're  going  to  include different  visualizations,  plots, summary  statistics, and  then  we'll  see  in  a  second. But  this  is  really  the  heavy  lifter behind  these  custom JSL  scripts . These  native  JMP  platforms that  we're  used  to. Specifically,  we're  going  to  be  looking at  a  distribution  platform  in  a  second to  help  us  with  our  analysis. But  this  is  really  the  backbone of  all  of  these  scripts . This  is  where  the  heavy  lifting  gets  done. Okay. Let's  go  into  a  really  high- level, general  analysis  flow. On  the  left- hand  side, this  is  what  the  user  is  going  to  be viewing  or  experiencing, as  they're  using  these  custom JSL  scripts. They're  of  course  going  to  load some  data  table  of  interest. We're  hoping  that  they  have some  data  if  they're  trying  to  do  a  data analysis  here,  and  then  they're  going to  go  navigate  to  the  custom  analysis script  that  they  need  to  run, they'll  go  ahead  and  execute  that. That's  then  going  to  pop  up  in  input GUI for  them  to  interact  with . This  input GUI is  actually  going  to  be  almost identical  to  the  regular  JMP input  that  we  see . The  distribution  platform  of,  "Hey, you  want  to  do  this  type  of  analysis, tell  me  what  fields  in  your  data  table, what  columns  go  where ." For  this  kind  of  analysis  after  they  enter in  all  of  that  input GUI  information, they're  then  going  to  wait  as  the  script manipulates  and  analyzes  that  data, and  then  it's  going  to  present them  with  a  final  output  report GUI. What  we  like  to  do  with  our  analysis, the  more  complex  ones, is  that  we'll  often  have  some  extra interactivity  that  can  be  done inside  of  that  final  report. The  engineer  can  do any  final  tweaking  that  they  want, complete  their  final  analysis, and  then  they  have  that  exact  report ready  to  go  that  they  can  share with  anybody  to  share  this  analysis. Now,  underlying  that,  us  is  the  developer. What  does  this  look  like to  get  a  custom  script  done? First  off, we  need  to  generate  that  input GUI. We  then  copy  data  tables  because  we  never want  to  be  manipulating  an  original data  table  that  a  user  has. That's  a  great  way to  break  someone's  trust and  make  them  not  want  to  use a  script  is  by  destroying  their  data. Always  make  that  copy. We  then  execute those  standard  JMP  analysis that  was  talking  about, those  native  platforms. We  store any  critical  variables  out  of  those, into  some  code  variables so  that  we  can  reference  them  later. In  other  portions  of  the  report, we  go  ahead  and  create  any  visualizations through  maybe   Graph Builder  or a  similar platform  in  JMP, and  then  we  create  that  sort  of  final analysis  summary  table or  that  decision  table. Then   we  present  the  user with  that  final  output  report . We  generate  that  final  report  for  them. Again,  all  sort  of  background  here. The  rest  of  the  presentation is  going  to  be  us  going  live through  both  of  these  flows . Seeing what  the  underlying  script  looks  like, and  then  also  seeing  what  the  actual  user is  going  to  be  experiencing as  they're  going  through  this. All  right. Here's  our  basic analysis  demonstration. Again,  this  is   going  to  be from  the  position  of  a  beginner, from  a  JMP  beginner. One  of  the  things  that  we're  going to  be  doing  in  the  spirit  of  a  simpler context  for  these  education  purposes, is  that  we're  only  going  to  be covering  a  simple  one- sample  analysis. If  you  have  a  set  of  data, is  the  mean  of  that  set  of  data equal  to  a  given  value. Again,  all  of  this JSL  code and  all  of  these  data  tables that  we're  going  to  be  showing, are  available  online  for  us  there. Let's  go  ahead  and  jump  into  it. First  things  first. We  have  a  data  table  here, with  16  different  entries. Let  me  find  my  mouse. There  it  is. Awesome. We  have  these  16  different data  table  entries. We're  going  to  be  interested in  this  parameter  here,  thickness . We  have  some  process  parameter  thickness and  we're  going  to  say,  "Is  the  average thickness  equal  to  one  micrometer?" That  is  the  statistical  question  for  this analysis  demonstration  that  we  have  here. Over  here  on  the  right- hand  side,  again, we  would  imagine  that  the  user would  have  some  sort  of  data  table  open. But  if  the  user  decided  to  run  this  custom script  without  that  data  table  open, here's  just  an  example  of  some  code that  would  check  to  see  if  a  data  table was  open,  and  if  not,  it  would  allow the  user  to  open  up  a  data  table. Otherwise,  it's  going  to  say,  "Yes,  this is  the  data  table  that  I  want  to  look  at. Let's  start  my  analysis." Any  good  thing  that  we  should  do anytime  we  get  any  sort  of  data  is  always just  make  some  sort  of  visualization, get  our  hands  on  it. Here's  just  a  little  visualization of  this  thickness, by  this  data  table  entry  here. One  through  16,  that's  going to  be  on  our  x- axis  there. All  right. Let's  jump  over  to  JMP  again. Hopefully,  you  have  your  JMP  open if  you  would  like  to  follow  along, at  least  hopefully  you  have your  JSL code  up  and  going. I'm  going  to  open up  just  my  JMP  home  window  here. You'll  see  that  I  already  have the  data  table  open, and  I  already  have  the JSL  script open  and  ready  to  go. I'm  not  going  to  go through  opening  them  here. The  other  thing  that's important  is  this  log. We  are  going  to  be  talking  about  this  log. This  is  sort  of  your  best  friend, as  you're  developing any  sort  of  these  scripts to  make  sure  that  everything is  running  appropriately. Let's  go  ahead  and  open  up  all of  these  here  and  let's  take  a  look. On  the  left- hand  side,  we  do  have that  actual JSL  script  inside  of  there. We  have  a  nice  header, we  have  some  log  lines  inside  of  here. If  you  highlight  something  and  hit  this run  button,  it  will  run  just  that  portion. If  nothing  is  highlighted and  you  hit  the  run  button, it's  going  to  compile and  run  the  whole  script. Just  be  careful  with  that. Again,  inside  of  these  scripts,  we're going  to  have  a  lot  of  these  dividers. Prince  and  have  really  done  our  best to  do  some  really  thorough  commenting and  some  really  thorough  dividing  inside of  here  to  make  it  easy  for  anybody to  pick  up  and  read  this  and  hopefully jump  on  their  own  custom  scripts  here. But  again,  everything  up  here, just  printing  some  log  lines, making  sure  this  stuff  is  going  well, clearing  variables,  opening  up  tables if  we  need  to,  so  on  and  so  forth. This  is  really  just  some  initial  workspace cleanup  kind  of  things that  we're  going  to  do, so  let's  highlight  all of  that  and  hit  Run. We'll  see  out  here  in  our  log that  yes,  indeed, that  completed successfully  inside  of  there. Let's  go  ahead  and  flip back  over  to  our  slides. We're  primed  and  ready  to  go, our  workspace  is  ready  to  go. Again,  the  first  thing  that  we  need  to  do as  the  script  developers  that  we  need to  present that  primary  input GUI  to  our  user . This  is  what  it's  going  to  look like  on  the  left -hand  side, and  again,  it  should  look  very  familiar to  a  standard  JMP  input  window. On  the  right  side, this  is  sort  of  the  meet  and  the  primary way  that  we  get  that  GUI  going, it's  going  to  be  this  column  dialog  box. You'll  see  we  have  a  title, we  have  this  response  list,  which  is  going to  be  these  response  variables,  response. This  is  going  to  be  the  variable  name for  us  moving  forward  so  that  we  can recognize  what  the  user entered  into  this  field. We  can  see that  this  is  a  required  numeric. That's  because  minimum  column is  one  and  data  type  is  numeric. We  have  this  run ID  list, which  is  going  to  be  our  order  number. What  order  were  these  measurements  taken? This  is  going  to  be  critical for  our  visualization. This  will  be  the   x-axis on  our  visualization, and  then  of  course, we  have  sort  of  an  alpha  box. This  is  going  to  be  the  alpha- level for  our  significance  testing, for  saying  whether  or  not  our  mean is  equal  to  our  hypothesized  mean. It'll  default  at  0.05, but  the  user  can  always  change  that as  is  called  out  in  the  bottom right- hand  corner  here. There  are  some  other  elements in  this  section  of  the  code. We'll  look  at  it  really  briefly as  we  go  through  it  right  now. But  that's  the  high  level  of  what else  is  done  inside  of  there  in  words. Let's  go  ahead  and  flip  back  over to  JMP  and  let's  take  a  look at  this  primary  user  input  window. Again,  it's  going  to  be  this  next divider  which  starts  up  here, and  goes  down  to  about  there. We'll  start  from  here. Again,  we  have  some  log  lines, and  then  like  said… Sorry,  let  me  grab that  common  line  or  I'll  get  an  error. We  have  a  log  line, and  then  like  I  said,  this  is  the  meat and  bones  of  that  primary  input GUI. Let's  go  ahead  and  hit run  on  that  and  we'll  see. Here  it  is. We  have  these  tables  open  because  we're looking  at  this  specific summit  data  table . We  have  these  columns  available. Thickness  will  be  our  response  variable measurement  number will  go  on  that  order  number . We're  actually  saying,  "Hey, something  might  have  happened  to  this  data table  that  it  got  sorted  or  something, but  this  was  the  actual  measurement order  that  these  things  were  taken. We're  going  to  put  that  as  our  order number  variable  inside  of  here and  we're  going  to  go  ahead and  click  okay  and  we'll  see that  everything  went  through  fine. Like I  said,  after  this, there's  some  other  error  checking, some  buttons  down  here. We'll  see  that  there's  a  lot of  print  lines  to  make  sure. "Hey,  is  that  input  window working  the  way  we  think  it  is? Are  restoring  the  variables and  the  way  that  we  thought we  were  storing  them?" This  is  just   a  developer kind  of  check  for  us  inside  of  here. Let's  go  ahead  and  run  that, and  we  can  see  that,  yes,  indeed, our  response  is  thickness, and  that  run ID is  that  measurement  number. And  alpha  was  unchanged  at  0.05. We  do  have  some  error  checking. We'll  get  to  that  in  a  couple  of  slides where  we'll  talk  about  all  of  that. Just  one  quick  note  down  here. That  cancel  button that  we  saw  inside  of  our … There  we  go. This  cancel  button  that's  over  here . We  have  the  OK  button and  we  have  a  Cancel  button if  the  user  ends  up  selecting… Sorry. Let  me  go  back  here. Computer,  work  with  me. There  we  go. If  we  end  up  hitting  that  Cancel button,  what  happens? Well, JSL  actually  doesn't  have anything  pre- coated  in. This  is  us  putting  in  a  condition that  if  Cancel  is  entered, we're  going  to  go  ahead and  throw  this  and  kill  the  script. Let's  go  ahead  and  run. These  last  sections  here, and  then  we'll  flip  back  over to  the  slides  and  that's  how  we're going  to  run  our  primary  input GUI . That's  simple. That's  all  it  is  there. Pretty  straightforward  to  get  such a  nice  interface  inside  of  here . Less  than  100  lines of  code  inside  of  there. Cool. Next  up,  we're  going  to  talk about  creating  that  copy  data  table. Again,  we  never  want  to  corrupt our  users  data  table. On  the  right- hand  side, we're  seeing  the  code  for  how  to  create that  copy  really  well  commented. Every  single  line  has  a  nice  comment to  tell  you  exactly  what's  happening inside  of  there,  even  if  you're not  familiar  with  JMP  or  JSL. We'll  go  ahead  and  scroll  down and  we're  just  going  to  run  all  of  this. You'll  notice that  right  now  in  the  top  right, we  have  the  summit  data  table, that's  going  to  turn  into  a  copy . That's  the  original  currently. But  when  I  run  this,  we're  now  going to  open  up  a  copy  of  this  table. There's  this  copy  data  table. We'll  pop  it  back  up  in  this  corner and  you'll  notice  that  now  this  script is  operating  over  this  copy  data  table . We  are  no  longer  doing  anything on  that  original  data  table. Any  manipulations  we  do  is  on  that  copy. We're  good  to  go. We're  set. It's  clean  there. Let's  go  ahead  and  flip  back  over to  the  slides  and  we'll  move  forward. After  we've  given  that  primary  input GUI, oftentimes  we  do  need  some secondary  or  even  tertiary  input GUI. The  users  provided  us  some  initial information  about  how  they  would  like their  data  to  be  analyzed. Now  it's   some  follow- up  questions. Come  on  to  that. For  us,  for  this  one  sample  analysis, again,  we're  looking at  our  thickness  variable. This  is  what  it's  going  to  look  like, that  secondary  input  window . We  know  which  variable  we're  targeting, we  want  to  analyze,  but  what  is  the  target value  that  we  want  to  compare  it  against? What's  the  value  of  interest  here? You'll  even  notice  that  in  the  title of  this  that  we're  already  calling out  the  value  for  thickness . This  script  is  already  starting  to  be smart  and  it's  already  starting  to  be adapted  for  us, of  listening  to  what  the  user said  in  the  primary  input GUI and  proliferating  that, into  that  secondary  input GUI. It  really  just  makes  it  clean for  users  as  they  step  through  these. When  we  get to  more  complex  kinds  of  scripts. Inside  of  here  in  the  middle, we  see  this  is… Again,  the  main  code  to  generate that  secondary  input  window. There  are  a  couple  of  other functionalities  inside  of  the  code that  will  walk  through  a  little  bit. Just  one  note,  secondary  input  windows, they're  not  necessary, but  they  are  nice  to  have. Of  course,  if  you  have  too  many  tertiary input  windows,  it'll  slow  us  down  and  it's too  many  clicks, but  a  lot  of  times  it's  nice to  have  some  flexibility  and  some adaptive  script  logic  to  actually make  for  a  better  user  experience. If  you  go  overboard  with  it,  of  course it'll  make  it  a  worse  user  experience. But  sort  of  finesse  is  key when  you're  designing user  experience  for  these  custom  scripts, because  they  need to  be  usable  at  the  end  of  the  day. All  right,  great. Let's  jump  back  over  to  JMP. We'll  look  at  this. This  next  section,  we're  talking about  the  secondary  user  input GUI. Again,  we're  just  going  to  start… we're  going  to  create a  variable  for  targets. Maybe  we  have  multiple  parameters that  we're  studying  all  at  once. Here's  another  error  check for  a  missing  target. Then  like  said, here's  that  big  heavy  lifter for  that  secondary  input  window is  all  of  that  code  there. Then  we'll  just  run these  last  little  bits. Again,  these  are  just  pulling information  out  of  that secondary  input  window. We'll  go  ahead and  run  all  of  that  together, and  we'll  see that  here's  this  target  value. Again,  it's  already  recognized that  it's  for  thickness. We  said  at  the  beginning  that  we  want to  know  all  of  these  thickness  values. Are  they  equal  to  a  value of  one  micrometer  on  average? Is  that  the  mean  value  there? We'll  go  there, and  we'll  go  ahead  and  hit  OK, and  we'll  see  that  everything went  through  okay. No  errors  inside  of  there. That's  all  of  the  inputs that  we  need  from  the  user  at  this  point. The  next  thing  that  the  user would  see  is  nothing . They  would  sit  and  wait  maybe  for  a  couple of  seconds,  maybe  for  10 seconds if  it's  a  really  heavy  script. But  at  this  point  it's  all  of  the  actual analysis  that  needs  to  happen in  the  report  generation. Before  we  jump  into  that, let's  jump  into  the  different  error checking  that  we've  exemplified inside  of  our  script  for  you  here. Inside  of  this  primary  input GUI, we  do  have  this  error  check. You  can  see  the  code  numbers. Essentially,  it's  just  saying,  "Hey, let's  make  sure  that  our  alpha significance  level is  between  zero  and  one . If  it's  outside  of   zero and one, you're  going  to  throw  this  dialog  box  here where  it's  going  to  tell you  what  went  wrong. This  error  checking  is  a  nice example  of  inline  error checking  for  us  there. We  have  a  different type  of  error  checking. We  give  you  a  second  kind, which  is  going  to  be this  function- based  checking. When  we're  talking about  this  secondary  input  window, we  do  have  this  missing  target  expression. This  is  an  expression  in JSL other  scripting  languages, call  these  functions. But  again,  this  is  just  a  nice  way  for  us to  also  just  call  this  expression  to  say, actually,  was  there  a  missing target  inside  of  there? If  the  user  hits  okay  with  an  empty target  value,  you're  going to  get  out  this  big  box  here. Okay,  awesome. Those  are  examples  of  error  checking. Let's  JMP  into  the  actual  analysis  then. Like  I  said  at  the  beginning, the  heavy  lifter  for  all  of  these  custom scripts  is  always  going  to  be  relying upon  these  JMP  native  platforms . Those  have  all  of  that  quality already  built  into  it. It's  a  lot  of  risk  mitigation  that  we didn't  do  something  wrong  when  we  coded. That  statistical  analysis  and  also that  we  know  that  it's  the  most  accurate statistical  analysis  that  is  available, that  quality  inside  of  there. For  this  example, we're  going  to  be  focusing on  a  distribution  platform  here. This  is  just  the  standard  JMP native  distribution  platform  here. The  nice  part  about  JSL  and  these  native platforms  is  that  you  can  directly interact  with  these native  platforms  through JSL. On  the  next  slide, we'll  show  some  tips  and  tricks, for  how  you  can  actually  interact, and  pull just  the  specific  values  that  you  want. There's  a  lot  of  good  information  that's presented  on  these  different  JMP  native platforms, but  oftentimes  there's  just  a  couple of  key  elements  that  we  really need  to  show, to  report  out  to  different  engineers. All  right. Let's  JMP  over  to  the  code  then, and  let's  go  ahead  and  run  this  part. This  next  divider  is  actually  just going  to  be  all  of  the  actual analysis  grouped  together. We'll  just  go  through portion  by  portion  here. This  is  just  creating some  container  variables. We'll  talk  about  that  in  a  second. But  let's  go  ahead  and  run  that, and  we  see  that  that  was  all. Okay,  Let's  open  up the  log  inside  of  there. Yap,  everything  is  okay. Now  this  is  the  actual distribution  platform . This  is  us  creating  that  distribution. This vlist  box is  going  to  send  it  to  our  output  box . That's  going  to  prepare  us for  our  final  report  generation. But  if  we  just  want  to  inspect  this  while we're  doing  some  development, if  we  run  the  code  from  here  up  to  here, but  do  not  include  the  comma. If  you  do  not  include the  comma  and  you  hit  run, we'll  see  that  we  actually  get  out our  nice  distribution  platform  here. We've  done  some  nice  things. We've  added  the  target value  inside  of  here. You  can  see  that  we're  already  testing for  the  mean  and  the  hypothesis  value, is  that  target  value  that  we're  interested in,  we  get  some  nice  summary  statistics mean  standard  deviation, so  on  and  so  forth  inside  of  there. But  that's  the  way  that  you  could  always create  the  same  standardized distribution  report. Oftentimes  different  people  with  different JMPs  will  have  different  preferences, because  we've  specified each  element  of  this  platform, it's  always  going  to  generate  the  exact same  distribution  platform coming  out  of  there. Okay,  so  that's  the  distribution  platform. Now  let's  see,  how  do  we  actually  interact with  this  distribution  platform  to  create a  nice  custom  script that's  going  to  be  over  here? It's  a  little  bit  scary the  first  time  you  look  at  it, but  you  end  up  finding  out that  this  properties  functionality that's  built  directly  into  JMP is  going  to  be  our  best  friend. Ultimately  right  now  what  we're  showing is  how  can  I  pull  those  summary  statistics that  I  want  to  display  in  my  bottom line  up  front  summary  table? How  do  we  pull  those  statistics  directly out  of  that  distribution  platform? All  of  that  calculation was  already  done  for  me. How  do  I  then  report  it  somewhere  else? It's  going  to  be from  this  property's  functionality. For  us,  we'll  see  that  we're  interested in  the  P  value  of  that  statistical  test. We  want  the  mean  of  our  data  set, standard  deviation  of  our  data  set, and  the  lower and  upper  confidence  intervals. We'll  see  that  we  then  are  going  to  insert all  of  those  values into  those  containers  inside  of  here. Let's  take  a  look at  our  distribution  platform, and  see  how  we  can  use this  show  properties  function. We're  on  our  distribution  platform. If  you  go  to  the  summary  statistic, I  want  to  pull  out  this  mean  value. How  do  I  know  the  code  to  pull that  out  and  interact  with  it? We're  going  to  right- click,  and  we're going  to  go  to  Show  Properties. Once  you're  in  Sow  Properties, you  can  click  on  this  box  path  right  here. This  box  path,  this  is  now  the  exact  code that  you  can  use  to  reference  any of  the  numbers  inside  of  this blue  highlighted  box. You'll  see  that  these  are  the  same items  that  are  shown  over  here. This  is  the  mean  value, the  standard  deviation  value, lower and  upper  confidence  intervals  there. You'll  see  you  can  sort  of  see  it on  the  bottom  right  here that  it  says  this  value  get  one. This  is  for  the  mean  and  it  wants to  return  that  first  value  out  of  it . We  would  add  the  value  of  get  one to  get  the  mean  next  to  this. You'll  notice  that  this  says report  platform  here. If  we  look  back  over  here, it  says  report  dbox . Now  why  do  we  say   dbox  there? Well,  D  box  is  the  specific  name that  we  gave  our  distribution  platform. Right? We're  saying  refer  to  this  platform that  we  just  created  and  pull  out those  specific  values  and  store  them into  these  container  variables. That's  exactly  what's  happening in  all  of  this  segment  of  code. Let's  go  ahead and  flip  over  to  our JSL  custom  script, and  let's  run  this  next  portion. Actually,  sorry. I  need  to  close out  of  my  distribution  platform. Otherwise  it  may  corrupt  that  there's a  couple  of  distribution  platforms all  contending  at  the  same  time. We're  going  to  run  all  of  this  section and  we're  also  going  to  get  up  to  here where  we're  going  to  pull  out those  summary  statistics. We  hit  run  and  we  see  great  everything went  through  just  fine  there. That's  how  we  actually  are  going to  interact  with  those  heavy lifter  native  JMP  platforms. Again,  rely  upon  the  stuff  that's  already built  and  you  can  already  trust, and  then we'll  build  further  from  there. The  next  thing  that  we're  going  to  show is,  well,  how  do  we  create that  summary  table? I  just  showed  you  how  I  can  pull  out these  mean  values, the  standard  deviation  value, these  confidence  intervals that  we're  leveraging  that  distribution platform  over  here  on  the  right. This  is  just  how  we  can  create this  summary  table. What  you  see  on  the  left  is  exactly generated  by  this  code  on  the  right  here. You  can  see  that  we  already  look at   these  targets. This  means  all  of  these  other  containers that  we  already  initialized  previously… Just  to  remind  us where  these  values  came  from, it  looks  something  like  this . We're  pulling  out  these  different  values out  of  the  distribution  platform. You'll  notice  again  that  we  already  have this  hypothesized  mean of  our  target  of  one. That's,  of  course, coming  from  that  secondary input  window  of  one  there. We're  going  to  go  ahead  and  drop that  target  of  one  there. The  other  important  thing on  this  summary  table, like  said,  is  this  nice custom  decision- making . That  we  can  put  whatever  logic we  want  to  put  inside  of  here. It's  kind  of  silly for  this  one  sample  analysis,  example. But  overall, it  really  is  one  of  these  things  that… This  is  where  you  as  your  as  your  company, as  your  profession, you  get  to  implore  your  own  expert  opinion about  how  decisions  should  be  made. You  can  look  at  the  statistics  and  say, "No,  actually  this  is  how  we  would  like to  make  decisions,  and  want  to  put that  right  up  front so  that  it's  immediately  clear to  anybody  who  opens  up  this  report of  how  we  analyze and  what  decision  we  come  to you." Let's  go  ahead  and  just  run  that  code. We're  not  going  to  be  able  to  see the  portion  like  we  did with  the  distribution  box. We're  only  going  to  be  able  to  see the  summary  table  when  we  do that  final  output  report. But  you'll  notice  here's  that  custom decision- making  right  here. For  saying  how  do  we  decide if  it's  not  equal  or  equal, what  we're  going  to  look  at  the  P  value? Of  course,  that's  kind  of  silly,  but  the point  stands  for  more  complex  reports. We'll  go  ahead  and  run  that  portion of  the  script  and  we'll  see. Let's  pull  up  our. Yap,  it  seems  like  no  error is  coming  out  of  this  log  inside  of  there. Let's  flip  back  over  here. The  last  component. We've  talked  about  native  platforms, we've  talked  about  summary  tables. Now  we  need  to  talk  about  visualization. Again,  visualizations. The  reason  why  we  have  them, is  that  their  immediate and  transparent  data  quality  checks . It's  something  that  anybody  can  look at  and  they  can  immediately draw  some  value  out  of  it. The  way  that  I  found  the  most  value  out of  these  visualizations  and  these  custom reports,  is  not  necessarily  for  me, it's  really  for  the  other  reviewers who  are  the  module  experts . The  process  experts  can  quickly  look at  data  and  they'll  say, " Yeah,  that  looks  weird." That's  not  how  that  process  behaves . Or  they'll  look  at  it  and  they'll  say, Yeah,  that  makes  sense  that  that's how  that  process  behaves. But  these  visualizations  give  a  lot  more than  just  the  pure  statistics, especially  when  you're  talking  to  somebody who's  not  a  statistics professional  or  statistics  expert. Again,  visualizations,  they're  great. They  allow  for  proper checking,  for  data  corruption, as  well  as  analysis  corruption. If  you  see something  weird  in  your  visualization, you  should  not  trust the  analysis  that's  associated  to  that. On  the  slide  right  now, it's  just  an  example  of  how  we  can  turn our  data  table  into  a  nice refined  visualization  over  here. We've  even  added  that  target  line  inside of  there  that  the  user  defined  for  us. Next  slide  is  a  word  of  caution about  how  we  use  these  scripts. These  visualizations,  again, they  should  highlight  these  data  concerns, but  the  user  needs to  know  how  to  use  them. I  said  that  this  order  number is  what  determines the   x-axis  on  our  visualization. If  the  user  enters  data  table  entry because  they  say,  " That's   the  order, that's  what  it  is  in  the  table." They'll  get something  that  looks  like  this . This  is  what  we've  been  looking  at together  so  far  through  this  presentation. But  again,  I  told  you  it's a  more  accurate  representation. Is  this  measurement  number? Something  happened  to  this  data  table to  get  it  sorted  in  a  different  order. If  we  plot  this  visual- based of  off of  measurement  number, we're  going to  get  something  that  looks  like  this. Everybody  here  should  notice this  immediately  as  a  red  flag that  something  is  wrong. We  should  never  have  data that's  trending  in  this  manner. Either  there  was  something  wrong with  the  process  or  there  was  something wrong  with  how  we  were  measuring the  data  with  our  metrology  then. But  we  shouldn't  really  be  trusting the  results  of  this  analysis. When  we  see  a  visual  like  this. We  need  to  go  and  recollect  the  data, figure  out  what  went  wrong  there. Again,  just  a  word  of  caution that  if  you  want  to  use  this, you  need  to  teach  your  engineers the  right  way  to  use  it  as  well. Just  for  us  to  say,  hey,  "How  do  we  create these  nice,  beautiful  visualizations?" We  like  to  use  the   Graph Builder  platform. It's  a  wonderful  platform  that  JMP  offers. It's  super  intuitive  and  easy  to  use. You  can  make  a  beautiful, wonderful  display  here  and  you  say, "Yes,  this  is  exactly how  I  want  to  display  my  data." Then  you  can  use  this  platform to  automatically  generate  your JSL  code by  clicking  on  the  little red  triangle  up  here. Going  save  script  to  script  window. You'll  get  out  a  set  of  code that  looks  something  like  this. The  one  word  of  caution is  that  of  course,  these  variables are  going  to  be  hard coded  inside  of  here, so  you're  just  going  to  have  to  update that  so  that  it  interacts  nicely with  your  user  input  so  that  it  adapts, to  whatever  your  user inputted  into  that GUI  there. But  these  are  all  of  the  elements then  that  go  into  the  final  report and  this  is  what  that  final report  looks  like. Again,  pretty  straightforward. We  just  say  create  a  new  output  window. We're  actually  going  to  make this  a  tab  box  here. We  only  have  one  tab  called  Report, but  in  our  more  complex  reports, we'll  actually  have sometimes  up  to  like  10  or  12 different  tabs  inside  if  you're  all with  different  information. But  we  have  this  summary  table… Again,  we  already  created that  summary  table. Let's  put  it  there. We  have  this  nice  graphical  plot that  we'll  put  over  here  and  then  we  have that  nice  distribution  platform and  we'll  put  that  inside  of  there. We  have  the  overall  takeaways right  up  top, and  then  we  have  all of  the  supporting  evidence  underneath  it. Let's  flip  over  to  JMP  and  I  know I'm  just  slightly  over  time  here, so  we'll  finish  up  quick. Back  over  to  JMP  and  we  will  run that  final  portion  of  the  code  here. We're  going  to  run both  the   Graph Builder. Let's  build  that  graph and  send  it  over  to  the  report, and  then  let's generate  the  final  report  here. We  go  ahead  and  hit  Run and  there's  that  platform. The  nice  thing about  these  custom  analysis  scripts and  again,  it's  just  a  nice  thing about  JMP,  in  general, is  that  all  of  these  reports are  going  to  be  interactive . Even  though  this  was  a  custom  report, this  platform  is  still  connected to  this  platform  down  here, These  platforms  are  still  going to  be  connected  to  this  overall. Remember  that  we're  working with  the  copy  data  table  now . Nothing  gets  corrupted,  but  it's  still going  to  be  connected  over  here . You  can  select  different  points inside  of  here  and  figure  out, well,  what  measurements  are  those responding  to  inside  of  there? That's  overall, these  custom  analysis  reports, what  they  look  like, how  we  can  make  them. Again,  it's  just  a  simple  case, but  let's  move  forward  here, into  some  overall conclusions  and  insights. Final  takeaways. At  Intel,  these  scripts have  really  become  a  critical component  of  our data- driven  decision- making . It  makes  things  so  efficient  and  so  fast and  so  repeatable  and  standardized that  it's  wonderful. Again,  these  are  all sort  of  the  same  ideas. The  only  thing  to  add is  that  it  does  also  allow  you  to  embed that  custom  decision- making for  your  company's  specific  best known  methods  for  your  specific  processes and  analysis  that  you're going  to  be  doing. A  quick  note  to  make  again, it's  the  caveat  that  we said  about  that  graphing  here. We  do  need  to  know  that  there's  going to  be  some  teaching  resources  that  we  need to  invest  into  this. When  we  proliferate  these  scripts, we  can't  just  give  them  to  the  engineers and  say,   "Go  do  some  analysis." We  need  to  tell  them, "This  is  how  we  intended  to  do  analysis with  these  specific  scripts. In  that  same  vein  of  thinking  here, we  do  have  this  custom decision- making  infrastructure. It's  going  to  require  maintenance. There's  going  to  be  bugs,  there's  going to  be  corner  cases  that  you  didn't  know. Prince  and  I  have  run  into  plenty of  these  cases  where  an  engineer comes  to  us  and  says, "This  isn't  working," and  we  say, " That's  weird.  Let's  look  at  this ." We  have  to  spend  some  time  debugging inside  of  there, especially,  when  your  company wants  to   step  to  a  newer  version  of  JMP. Here  at  Intel,  we  just  stepped to  the  newest  version  of  JMP. Sixteen  or  17,  one  of  those, and  we  had  to  go  back through  all  150  of  those  scripts, and  make  sure  that  they  were still  compatible  with  the  new  thing . Again,  there's  a  lot  of  infrastructure maintenance  that  you  should  be  aware  of, that  that's  it's  going  to  come  into  play . Especially  when  you  really  start to  proliferate  this  and  make this  a  large  repository. Again,  We  should  also  be  treating this  as  a  living  infrastructure,  though . It  changes  and  that's  a  good  thing. That's  why  we  have  the  power as  the  custom  analysis  script  owner, that  we  can  change  things  inside  of  there and  we  can  do  it  immediately  and  quickly and  we  can  be  really  agile  about  that. Users,  they  might  be  hesitant  initially. They're  going  to  learn  to  love  this, they're  going  to  really  adopt  it, and  they're  going  to  start  to  do some  strange  things  with  these  scripts. They  say,  "Hey,  I  love  this  analysis. What  if  I  did  this?" They're  going  to  start  using them  in  new,  nonstandard  ways? You  shouldn't  get  mad  at  them, these  are  actually  opportunities . If  an  engineer  is  using  the  script in  a  nonstandard  way, that  means  that  there's  some  functionality gap  that  they  wish  they  could  have that  would  make  their  job  easier . We  should  take  that  input, and  we  can  revamp  our  scripts, we  can  change the  functionality  inside  of  theirs, and  we  can  roll  all  of  those  inputs from  the  engineers, into  these  custom  scripts  immediately, and  we  can  start  providing more  value  to  our  engineers. Okay,  so  I'm  going  to  end  it  here. I  know  I'm  a  little  bit  over  time. Kirsten  Sorry  about  that. I'll  say  thank  you  here. Here's  mine  and  Prince's  emails. Feel  free  to  reach  out  to  us if  you  have  any  questions or  you  want  to  ask  anything. Thank  you.
Introduced in JMP 17, the Workflow Builder automatically captures and replays data analysis tasks in JMP. From saving time on repetitive tasks to helping organizations set up standard methods for analysis, the Workflow Builder has easily proven to be one of the best new features in recent years. But what happens if we push the Workflow Builder a little harder? Can we extend it to do branching or interactive tasks without too much scripting? As a matter of fact, we can -- and this poster shows how to do it.   Shamelessly stealing the idea from the ever-popular “Pictures from the Gallery” Discovery presentations, this poster highlights a series of workflows, each with their own challenges and how to overcome them. There is minimal scripting (mostly copy/paste) used in conjunction with some of the workflows. You can interact with all of the workflows at the poster kiosk and then take them home with you by downloading the file below.   Download the Workflow Builder Poster Materials below, unzip everything to a folder, then open the Workflow Builder Journal file. This file has everything you need to walk through all the exercises on your own, including instructions and the final product if you get stuck.           . Okay,  well,  thank  you  for  taking the  time  to  look  at  this  video. For  Discovery  this  year, I've  prepared  a  series  of  workflows  that go  a  little  bit  beyond  what the  Workflow  Builder  can  normally  do and  require  just  a  little bit  of  modification. In  fact,  I've  arranged  them in  an  order  that  should  be fairly  simple  to  understand. It  goes  from  the  really  simple workflows  at  the  top  to  the  more complicated  workflows  at  the  bottom. From  left  to  right, the  order  goes  from  not  needing  any additional  work  or  scripting  at  all to  maybe  a  tiny  bit  of  scripting,  and  then a  little,  we'll  call  it  some  scripting. Most  of  the  scripting  in  here, though,  is  copy-paste. The  assumption  here  is  that, people  who  are  using  this  journal and  interacting  with  it  don't  have  a  lot of  experience  scripting, but  they  have  come  to  a  place in  the  Workflow  Builder where  they  want the  Workflow  Builder  to  do something  a  little  bit  different  or  more, and  they're  just  a  little  bit  stuck. The  notion  here  is  to  have  an  interactive set  of  exercises  that  you  can  practice and  build  a  skill  set  with. To  start  though, the  first  thing  you  would  need  to  do  is there's  this  button  up  here called  Prepare  files  on  desktop. If  I  push  that,  that  is  going  to  put four  different  files  on  the  desktop. Those  are  just  files  that  we  will  need as  you're  going  through these  different  activities. I  want  to  walk  you  through the  introduction  really  quickly  as  well. Essentially,  like  I  said, if  something  goes  wrong, it's  most  likely  that  you  need  to  push that  prepare  files  on  desktop  button. Just  be  aware  that  it's  there. The  workflows,  we  talked  a  little  bit about  the  way  that  they  work in  terms  of  their  complexity. My  best  guess  is  that, most  people  will  get  the  most  value from  this  section  of  the  workflows because  this  workflow  here is  what  you're  already  doing with  the  Workflow  Builder, and  then  this  one  is  rather  complicated. Again,  in  this  middle  section  here is  where  you're  probably  going  to  have the  most  fun  and  do  the  most  learning. But  each  of  the  Workflow  Builders has  three  buttons  included  with  it. The  first  one  is  this  Workflow  Builder  as Recorded  button, and  it's  basically  just the  Workflow  Builder  as  it  started and  as  I  was  able  to  record my  steps  doing  the  workflow, and  then  at  some  point, the  Workflow  Builder  stops. That's  the  beginning. Then  there  is  an  instructions  button n  the  middle, and  that  pops  up  a  new  window with  an  overview  of  basically  how that  workflow  was  created, an  understanding  of  what  needs  to  change from  a  philosophical  standpoint, and  then  step-by-step  instructions on  how  to  make  that  change. Finally,  for  each  of  these, there  is  a  modified Workflow  Builder  button, and  that  modified  button is the  Check  Your  Answers  button where  it will  pull  up  the  completed  version. That  can  be  useful  if  you  get  stuck. It  could  also  be  useful  if  at  any  time  you just  decide  you  want  to  skip to  the  end  and  see  how  it  was  done. That  being  said, I  want  to  walk  you  through  one  of  these. I'm  actually  going  to  walk  you  through this  one  here, which  is  the  Workflow  Builder  question that  at  least  that I  get  asked  the  most  often,  and  that  is, "Hey,  I've  recorded  this  Workflow  Builder to  open  a  file  and  build a  little  dashboard. How  do  I  make  the  Workflow  Builder  ask  me which  file  to  open to  make  that  file  more  dynamic." I'm  going  to  pull  up  the  instructions here,  and  we'll  put  these  on  the  left. I'm  just  going  to  basically  walk through  the  whole  process  here. I'll  follow  along,  at  least to  building  that  first  Workflow. Then  I'll  show  you  what  we're talking  about  for  the  desired  changes, and  then  I'll  modify it  and  it  will  be  done. To  start  with,  I'm  going  to  open a  new  Workflow  Builder and  I'm  going  to  hit  Record, and  then  I'm  going  to  open this  Process  Measurements  file. The  first  thing  I  notice  is  that  this file  is  not  in  the  right  format  for  me. It's  in  a  tall  format, and  I'd  rather  have  label  a  different column  for  each  of  these  processes. I'm  going  to  go  ahead  and  split  the  table here,  and  I  will  split  by  label. I'm  going  to  split  the  data,  and  I'm going  to  group  it  by  this  sample  number. Now  I've  got  the  processes  across  the  top. That's  what  I  wanted  to  see ordered  by  my  sample. I  can  hit  Okay, and  with  my  new  data  set  here, I  can  go  ahead  and  make  that  dashboard. Fairly  simple. I  just  want  a  way  to  scroll through  each  of  my  processes. I'm  going  to  do  that with  a  column  switcher and  I  select  the  process that  I  want  to  switch  out and  I  want  to  switch  it out  with  processes  one  through  seven. Now  I  have  the  ability  to  do  that. I'm  going  to  close  this  window  so  that  the Workflow  Builder  gets  a  snapshot  of  it. I  see  that  there's  an  extra  little  step in  here  of  launching  the  Graph  Builder. I'm  going  to  delete  that. But  now  I  have  this  workflow, I  can  rewind,  I  can  push  play, and  that  gets  us  to  the  point  where  it's building  this  little  dashboard  for  us. But  what  I  really  want is  this  desired  change  to  the  workflow  of instead  of  opening this  file  every  time,  I  would  like  to  have the  Workflow  Builder  ask me  which  file  to  open. As  you're  going  through  these, this  middle  section  is  just an  introduction  to  what  sorts of  things  need  to  happen. In  this  case,  it's  walking  you  through, well,  you're  going  to  need  to  modify the  step  settings,  and  you're  going  to need  to  modify  the  script  a  little  bit. That's  a  really  simple  thing  to  do   here. Now  we  can  go  down  to  the actual  step-by-step  instructions. There  really  is  one  step to  make  this  change, and  it  is  to  change  this location  to  the  words  PIC  file. Now  when  I  run  this  Workflow  Builder, when  I  run  this  Workflow, I  get  a  dialog  saying, "Hey,  show  me  which  file. Where  is  this  file?" I  can  select  it,  it  will  do  the  rest. This  is  really  cool. All  it  took  me  was  one  little  change. There's  also  a  little  bonus, sometimes  a  little  bonus  item. I  think  it  would  be  really  nice  for  this workflow  if  in  addition  to  just  asking for  the  file,  it  were  to  tell  me, Well,  what  file  are  you  looking  for? I'm  going  to  add  the  action  of Show  a  message, and  I'm  going  to  type the  title  here  is  Select  File, and  the  messages  point  me  to the  Process  Measurements  file. A  secret  here  is I'm  going  to  make  it  Modal, which  means  I  have  to  close  this before  things  continue  to  progress. Now  when  I  run  this  workflow,   not only  will  I  get  it  to  ask  me to  point  JMP  where  that  file  is, but  it's  also  going  to  tell me  what  it's  looking  for. I  hit  Play  here  and  it  says,  "Hey,  point me  to  the  Process  Measurements  file." Okay,  I  can  do  that. I  know  it's  on  my  desktop or  I  just  put  it  in  my  downloads,  however, and  bang,  just  like  that. I've  got  a  tool that  I  can  share  with  other  people and  I  can  claim  that  I  built  it all  by  myself  with  no  help  at  all. Back  to  the  beginning  here, just  want  to  reiterate  or... I'm  going  to  go  over  this really  quickly  again. There  are  nine  different  activities with  different  things  you  can  do and  different  ways  you  can  learn to  push  the  Workflow  Builder. I'm  still  seeing  new  ways  to  use the  Workflow  Builder, and  I  don't think  we're  even  close  to  knowing all  the  fun  things  that  people will  be  doing  with  it. Thanks  for  watching  and  good  luck.
Troubleshooting faults in industrial equipment can be a daunting task, particularly when compounded by the time constraints of a high-volume production line or plant. Analysis is further complicated in equipment with multiple operating modes, where issues may be visible in one mode but masked in another due to differences in operating parameters. The data generated by SCADA (Supervisory Control and Data Acquisition) or networked IoT equipment and sensors in a modern plant can provide a rich source of information for both fault troubleshooting and monitoring change assessment. However, the sheer volume of data for even a single piece of equipment requires complex analysis to be made intelligible for the engineer or technician.   This presentation elucidates a selection of examples using JMP for exploration and visualization of abatement equipment operating parameters for fault identification and engineering change assessment. Basic data visualization was used to identify physical phenomena happening within the abatement system to assess the root cause for fault troubleshooting. More advanced analyses, including clustering and variability charts, were used to identify issues related to specific operating modes and to assess the impact of engineering changes.         Good  morning,  everybody. My  name  is  Chris  Jackson. I  am  an  Applications  Engineer for  Centrotherm. We  design  and  build  point-of-use   gas  abatement  systems for  use  in  the  semiconductor   and  other  industries. Today,  I  have  the  opportunity   to  give  a  short  presentation on  how  we  found  a  space  for  JMP   in  our  part  of  the  industry and  how  it  helps  us  both  in  troubleshooting for  industrial  applications as  well  as  for  assessment and  justification  of  continuous  improvement  initiatives, engineering  changes,  things  like  that. A  little  bit  of  background   just  to  get  everyone  on  the  same  page, I  want  to  say  a  couple  of  words  about what   point-of-use  abatement  systems  are. I've  got  a  little  cutaway   of  one  of  our  tools  here  on  the  side. The  short  version  is  this: you've  got  a  manufacturing  tool   up  on  the  factory  floor doing  whatever  it's  doing in  the  semiconductor  manufacturing  process that  produces  harmful  gasses   as  a  byproduct, greenhouse  gasses, toxic  gasses,  flammable  gasses. Generally,  things  you  don't  want   to  go  in  the  atmosphere. Then  our  tools   take  those  waste  gasses  in, they  destroy  them   through  thermal  energy, they  wash  them  out, and  you  release  clean  air   to  the  factory  exhaust. Because  these  tools   are  safety  and  environme nt-critical, a  fault  in  one  of  them means  that  your  production  line   is  at  least  in  part  shut  down. If  you  can't  treat  your  byproducts, then  you  can't  run. In  a  high- volume   manufacturing  environment, as  so  many  semiconductor  FABs  are, even  small  delays  are  incredibly  costly. We  as  suppliers  and  servicers, have  to  have  a  means   to  quickly  identify  problems and  bring  the  tools  back  online. Historically,   troubleshooting  usually  means opening  the  tool, looking  visually   to  identify  failing  components often  after  some  period   of  root  cause  analysis. But  with  a  modern  FAB  environment and  the  data  generated   by  SCADA  or  IoT  systems, we  have  mountains  of  data  available to  investigate  faults before  we  ever  touch  the  equipment. That  gives  us  a  way  to  guide troubleshooting  in  the  field, and  in  some  cases  for  intermittent  faults, it  even  lets  the  factory  keep  running while  we  investigate  digitally rather  than  physically minimizing  the  time  lost to  troubleshooting  and  investigation. The  problem  with  this  mountain  of  data is  a  scale  issue. The  higher  the  resolution  of  your  data, the  better  look  you  can  get   at  what's  happening  instantaneously in  any  of  these  pieces  of  equipment. That  higher  resolution  however, comes  with  an  overhead. You  need  more  and  more  computing  resources to  effectively  analyze  it, and  that's  where  JMP  comes  in  for  us with  the  capacity  to  handle   very  large  data  sets, and  it  becomes  a  tool   for  visualization  and  exploration that  can  really  drastically  improve troubleshooting. It  lets  an  engineer  or  a  technician quickly  explore   and  visualize  important  parameters within  your  data  sets, and  these  data  sets  are  at  a  scale sometimes  that  are  just  unmanageable for  a  lot  of  other  visualization  tools. With   that, I  want  to  jump  right  into   the  first  example  case  we  have  here, and  we're  going  to  identify an  intermittent  single- component  failure just  through  data  visualization. No  statistics,  no  modeling, just  the  ability   to  sift  through  and  visualize  the  data. Here  we've  got  a  chart   showing  ionization  current  versus  time. It's  one  of  a  number  of  parameters, ionization  current, that  we  use   as  a  health  monitor  for  the  equipment. This  tool  was  having  issues in  which  it  would  run  for  a  couple  of  days and  then  seemingly  randomly fail  and  shut  down. For  context,  this  current  should  be a  flat  horizontal  line  at  25.5, so  it's  pretty  clear  from  the  outset that  we  have  a  problem. It's  also  pretty  clear  what  I  was  talking  about regarding  data  set  size. This  data  set  right  here   is  almost  six  and  a  half  million  rows. Six  and  a  half  million  rows  with, when  you  pull  in   all  of  the  tool  parameters, 500  columns. The  file  for  this  data  set   is  about  20  gigabytes  in  size, absolutely  massive  amounts  of  data. Before  we  even  do   any  statistical  analysis,  like  I  said, we  can  start  to  do   some  problem- solving  off  of  this  data  set just  with  visualization. Initially,  it  doesn't  really  look  like there's  any  clear  shape  to  this  data. We  know  something's  wrong, but  we  don't  know  what. But  when  we  zoom  in, all  of  a  sudden   we  start  to  see  some  structure. This  looks  pretty  periodic  to  me. We  zoom  in  a  little  bit  more and  we  see  that  it  is  in  fact very  periodic. Each  one  of  these  little  spikes  down, disregarding  magnitude, is  timed  five  minutes   almost  exactly  from  each  other. That  immediately  begs  the  question  then, do  we  have  some  component, a  valve,  a  flow  controller,  a  motor, something  that  actuates   every  five  minutes? We  identify  that  component. Now  we  have  a  really  likely troubleshooting  culprit. The  troubleshooting  plan  changes from  open  the  tool  and  investigate, which  could  take  a  couple  of  hours, to  open  the  tool   and  change  this  one  targeted  component. We  just  shrunk  the  actual  time   that  we  need  to  be  in  the  equipment from  a  couple  of  hours   looking  at  everything to  see  what  might  be  failing to  a  single  hour,  get  in  there,   change  this  part,  get  back  out. In  this  particular  case, that  was  the  failing  component, we  were  able  to  identify  it. Problem  identified,  plan  made without  ever  having  to  open  the  equipment. We  were  able  to  get  there   with  just  the  conclusions that  we  were  able  to  draw   from  visualization. Of  course,   JMP  is  not  just  a  tool  for  visualization. It  also  has  at  its  core  a  very  robust suite  of  statistical  analysis  platforms. If  we  start  to  apply  those  to  the  data, we  can  get   even  more  exciting   and  interesting  results. I'll  just  jump  right  into   the  second  case  here. In  this  case, we're  looking  at  a  specific  tool, which  is  working  fine  most  of  the  time, but  it  does  have  occasional problems  with  buildup, sometimes  we  got  to  draw  our PM  in a  little  earlier  than  we  would  like. We  want  to  take  a  look   at  our  health  parameters and  see  if  there's  any  abnormalities, any  optimizations  we  can  make. The  approach  that  I  use  here is  applicable  for,  really,   any  industrial  application that  has  defined  operating  modes. Because  we  can  draw  those  modes   out  of  the  data  very  easily using  clustering. In  this  case,  our  abatement  has, or  this  specific  abatement, has  three   pretty  well- defined  operating  modes based  off  of  these  two  input  gasses. I  use  K Means  clustering. You  could  use  whichever   version  of  clustering  you  prefer. But  I  run  that  over  the  data   to  sort a ll  of  our  rows,  all  of  our  points into  these  three  operating  modes. If  you  have   more  than  three  operating  modes, obviously,  you  can  use  more  clusters. But  it  also  gets  interesting, what  if  you  don't  know how  many  modes  you  have? Maybe  they're  customer-defined, or  maybe  there's  a  suspicion  that, "Hey,  could  there  be   some  interstitial  mode  here?" Maybe  the  transition  state between  two  of  these  operating  modes. If  you  want  to  investigate  that  way, you  can  use  iterative  clustering. I  did  that  down  here. You  just  run  from,  I  used  3- 10  clusters, and  the  software  will  identify   what  the  optimal  number  of  clusters  is. Looking  at  this,   it  is  correctly  identified. It  gives  us   these  cubic  clustering  coefficients, identifies  the  optimal  one, that,  yes,  as  suspected,   three  is  the  optimal  number  of  clusters to  sort  this  data  into. I'm  not  really  worried   about  these  state  transitions. I'm  really  more  focused on  the  states  themselves. We  take  that  data,   we  get  a  readout  of  it, and  we  throw  it  up  onto  this  3D  scatter  plot. We  take  some  of  our   tool  health  parameters, and  we  color  everything by  what  cluster  they're  in. Immediately,   we  start  to  see  some  interesting  results. We  talked  about  ionization  current   should  be  solid  at  25.5, and  we  see   that  we  have  some  variability  here. It's  dropping  below  that. Immediately   we  know  that  we  have  a  problem. But  what's  more  interesting is  that  every  single  one  of  those  points is  grouped  into  a  single  cluster, cluster  two,   which  corresponds  to  this lowest  input  gas  one, highest  input  gas  two. Now  from  an  engineering  perspective, if  I'm  looking  to  make  optimizations   or  I'm  looking  to  improve  tool  health, I  immediately  can  say, "Hey,  this  is  the  operating  mode   that  we  need  to  look  at." That's  what  I  need in  order  to  start  looking   at  concrete  next  steps  for  improvement. I'm  not  looking  at  the  tool  as  a  whole. I've  already  managed  to  focus my  search  to  one  operating  mode. The  last  thing  I  want  to  talk  about  then, having  looked  at   two  of  these  use  cases  here  is, what  are  the  other  advantages  with  JMP? Why  JMP? My  customers  are  collecting  all  this  data. They  have  ways  to  view  it. They  have  scatter  systems and  monitoring  systems  in  place. They  have  ways  to  parse  it. So  why  do  I,  as  a  supplier/ servicer, need  this  platform   to  view  and  parse  the  data? The  answer  for  me,  at  least  in  my  case, is  the  cross- platform  compatibility. If  I'm  reliant  on  my  customer   to  chart  and  generate  data views  for  me, I'm  now  taking  up  their  time   and  their  resources to  troubleshoot  a  problem that  I'm  responsible  for  fixing. With  JMP,   as  long  as  they  can  give  me  the  raw  data, I  can  do  all  of  it  myself. Not  only  is  that  freeing  up  their  sources, it  gives  me  the  ability   to  do  my  own  investigation independent  of  whatever  system   they're  using  for  data  analysis. It  doesn't  matter  if  they're  using proprietary  monitoring  system  A  or  B  or  C, or  if  they're  using   their  own  IoT  monitoring  system from  their  control  engineers. It  doesn't  even  matter if  they  have  multiple  data  acquisition  systems from  different  vendors. With  JMP, I  can  import  and  manipulate whatever  data  they  give  me and  perform  these  kinds  of  analysis, sour ce-independent, do  the  investigation  that  I  need  to  do for  my  customer  support with  all  the  tools  for  visualization   and  statistical  analysis that  JMP  provides. With  that,  it  looks  like   we're  pretty  much  at  time  here. I  know  this  isn't  the  traditional  use  case necessarily  for  JMP from  some  of  the  folks  that  I've  talked  to, but  I  hope  it  was  helpful  for  people. I'd  just  like  to  thank  Adam  Stover,  our  CTO, and  Gordon  Tendik, our  Director  of  Apps  and  Technology, for  helping  me  put  all  this  together and  reviewing  the  work  that  I've  done. Thank  you  for  your  time.
Part distributions are easy to measure: parts are built, an operator measures the parts with a gauge, and the results are assembled into a measured part distribution (MPD).    But the resulting distribution is contaminated by errors associated with the measurement system. Random errors, gauge bias, and linearity problems all contribute to inaccuracies in measuring the true part values, so the individual values can never be truly known.    However, if we had a way to estimate the true part distribution (TPD), we could compare it to the MPD and calculate the impact (cost) associated with using the imperfect gauge in terms of Type 1 and Type 2 errors. It is trivial to estimate the TPD from an MPD if the gauge creates simple normally distributed errors around a normally distributed TPD (i.e., simply subtract variance of gauge from MPD variance to get TPD variance). But what if the gauge has linearity problems? Or what if the TPD has a non-normal shape?   This paper describes a new JSL script for determining an arbitrary (i.e., non-parametric) TPD from an arbitrary MPD and associated gauge performance characteristics. The resulting TPD can then be fed to a second script to determine production costs associated with the imperfect gauge and setting guardbands to optimize economics of the gauge errors. Performance of the estimation routine is evaluated, in terms of shape of TPD, various gauge characteristics, and resolution of distributions.     Hi,  I'm  Jerry  Fish. I  work  for  JMP. I  support  our  customers  in  the Central  Region  of  the  United  States. Today,  I'd  like  to  talk  to  you  about an  add-in  that  I've  developed. The  title  of  the  paper  is Retrieving A rbitrary True Part  Distributions from M easured  Part  Distributions and  the  Gauge  Characteristics that  go  along  with  the  measurement. Today's  agenda. First  we're  going  to  talk  about, of  course,  what  does this  talk  address? Why  is  this  so  important? Why  can't  we  just  subtract  variances to  get  our   True Part Distribution? A  little  bit  about  what's  behind our  estimation  computations. I'll  demo  the  add-in, including  some  test  results and  add  some  troubleshooting  tips. I'll  tell  you  where  you  can  find the  add-in, and  then  we'll  share  with  you  how  you  can  give  me  feedback on  what's  good   and  what  you  don't  like  about  the  add-in, areas  for  improvements,  and  so  forth. What  are  we  addressing  here? Well,  we're  talking  about  an  add-in that  determines  a   True Part Distribution if  you  give  it   a  measured  part  distribution and  if  you  describe  your gauge  performance  characteristics. It's  pretty  easy  to  conceptualize   if  we  start  with  a   True Part Distribution. Here's  our  true  part  value versus  our  percentage  of  parts. Then  we  run  that  through  a  gauge, an  imperfect  gauge  that  has  some variance  and  bias  characteristics. We  will  get  a  measured  part distribution  out  of  that. We  don't  know,  though, what  our   True Part Distribution  is. What  we're  talking  about is  swapping  those  positions where  we  start  with  a  Measured P art  Distribution, we  subtract  out our  gauge  performance  characteristics, and  we  end  up with  a   True Part Distribution. That's  pretty  simple  to  understand, but  it  gets  more  complicated if  we  have  a  Measured  Part  Distribution that  is  not  normally  distributed and/or  we  have  a  gage  that  performs in  non-standard  ways,  you  might  say. Perhaps  our  standard  deviation  shows curvature  with  the  measured  part  value, or  maybe  it  has  bias   that  linearly  changes, or  maybe  it  has  curvature  as  well. How  can  we  take  these  quantities, the  Measured  Part  Distribution and an  arbitrary  gauge  performance  curve, and  come  up  with   what  the  True  Part  Value  was that  must  have  caused this  Measured  Part  Value? Why  is  it  important? Well,  we  all  know  that  all  gages are  imperfect. We'd  like  to  get  an  idea of  this   True Part Distribution. You'll  see  it  referred  to  as,   we  go  along  as  TPD. We  can  understand  our  type  1 and  type  2  errors. A  type  1  error  means  our  gauge is  throwing  away  good  parts. A  type  2  error  means  our gauge  is  accepting  bad  parts. Both  of  these,  particularly   in  a  manufacturing  environment, are  bad  things  to  happen. If  we're  throwing  away  good  parts, then  that's  waste. We  don't  want  to  have waste  in  our  process. That's  just  a  straight  bottom  line deduction  from  our  profit  statement. We  also  don't  want  to  accept  bad  parts. If  we  do  that,  we  ship  the  bad  parts out  to  a  customer. We're  likely  going  to  get  complaints, we're  going  to  get  returns,  reworks, it's  going  to  damage  our company's  reputation. We  don't  want  to  have  either one  of  those  types  of  errors. They  both  hurt  our  company. If  we  knew  the   True Part Distribution, we  could  estimate  the  costs associated  with  these  errors. That  particular  subject   is  addressed  in  another  paper being  presented  here  at  Discovery  2023. With  this  title  and  this  paper  number, I  encourage  you  to  look  it  up. I  co-authored  this  paper  with  two of  my  colleagues,  Brady  Brady, and  Jason  Wiggins. We  need  that   True Part Distribution to  make  this  assessment. Why  can't  we  just  subtract  the  variances? Well,  you  can. If  your   Measured Part Distribution  is  normally and  your  gauge  has  constant  variance and  bias  across  the  measurement  range, then  you  can  get to  your   True Part Distribution. You  don't  know  the  True Part  values of  individual  parts. You're  never  going  to  know  that, but  you  can  get  to  the  distribution. It's  simply,  under  these  constraints, under  these  assumptions, the  variance  of  the   True Part Distribution is  just  the  difference  in  the  variances of  your  Measured Part Distribution and  your  gauge  variance. You  subtract  those  two, you  take  the  square  root,  and  you  get the  standard  deviation  of  your  gauge. T he  average,  I'm  sorry,  standard  deviation of  your   True Part Distribution. The  average  where  your True  Part  Distribution  is  centered is  simply  wherever your  Measured  Part  Distribution is  centered  minus  the  bias  of  the  gauge. O f  course,  the  question  is,  what  do  you  do if  your   Measured Part Distribution  is  not  normal or  if  your  gauge has  unusual  characteristics? This  is  how  we  can  conceptualize  inputting these  values, and  I'll  show  you the  add-in  in  just  a  second. We  can  have  any  arbitrary  input. Here  we've  got   for  our   Measured Part Distribution, this  is  our   Measured Part value. T he  counts  for  however   parts   that  we  have  measured. Maybe  this  looks  like  a  combination of  two  normal  distributions. Maybe  it's  something a  little  different  than  that. The  point  is,  you  can  put  in  any  input that  you'd  like, any  input  shape for  the  Measured  Part  Distribution. Then  we  described  the  gauge using  quadratic  functions for  the  sigma  and  for  the  bias  of  the  gauge. Normally,  all  you're  going  to  have is   a  standard  gauge would  be  just  these  constants out  here  in  front, C0  and  D0,  and  D0  may  be  zero if you  don't  have  any  gauge  bias. I f  your  sigma  changes  linearly   with  part  value, we  allow  you  to  put  in  C1 and  D1  if  your  bias  changes  linearly. If  there's  any  curvature, we  allow  you  to  put  in  a  C2  and  a  D2. When  you  set  up  your  gauge  equation, if  you  put  in  all  of  these  values, it's  possible  to  generate negative  standard  deviations within  the  measurement  range. Don't  do  that. If  you  can  avoid  it,  don't  do  that. There  may  be  unexpected  results with  the  add-in if  you  have negative  standard  deviations. Just  beware  of  that. What's  behind  the  estimation  computations? Well,  we  start  with,  of  course, the  actual  measure  part  distribution and  the  gauge  characteristics. We  choose  an  estimated   True Part Distribution, which  seems  like  a  good  idea. We'll  start  with the actual  Measured  Part D istribution. Then  we  put   that  estimated  True Part D istribution through  a  transformation  that  represents the  gauge  characteristics, and  that  yields an  estimated  Measured  Part  Distribution. We  can  then  compare   the  estimated  Measured Pa rt  Distribution with  the  actual  Measured P art  Distribution on  a  bin-by-bin  basis and  get  a  Residual  Sum  of  Squares  error for  that  comparison. Then  if  we  go  back and  adjust  the  amplitudes of  the  True  Part D istribution  estimation, we can  adjust  those  until  we  get the  estimated  Measure  Part  Distribution to  agree  as  closely  as  possible with the  actual  measure  part  distribution. We  do  that  using a  JSL Minimize  function to  try  to  minimize the  Residual  Sum  of  Squares. All  right,  let's  take  a  look   at  the  add-in. Once  you  install  the  add-in, it  will  come  in  under  Gauge  Study  tools   and  TPD  estimation. This  is  what  the  add-in currently  looks  like,  version  1.0. We  start  off  with  the  ability  to  choose what  type  of  input Measured Part Distribution do  you  have? Now  that  varies,  I'll  come  back to  the  arbitrary  shape  in  a  minute. We  also  have  normal, you  input  the  average  and  standard deviation,  LogNormal,  Weibull, Exponential,  Gamma,   and  two-mixture  normal  distribution. We  can  set  these  up  to  be parametric  if  we  want. If  you  know  that  you  have   a  Weibull  distribution,  for  example, you  can use  that  as  your  input  distribution. Let's  start  with  normal. Let's  just  make  it  simple. Here  we  have  a  normal  distribution that has  a  mean  of  zero and  a  standard  deviation  of  three. T hat's  shown  in  this  panel  here  in  this  little  graph. Let's  use  a  gauge, that's  a  very  simple  gauge, has  a  standard  deviation of  one  and  a  bias  of  zero. Click  Next. Here,  much  like  above,  we  get  to  choose the   True Part Distribution  shape. We  could  say  that's  going  to  be  arbitrary, or  it  could  have  a  normal  distribution, or  it  could  have  a  lognormal, all  the  same  distributions  here. Or  down  here  at  the  bottom, we  give  you  the  option  to  fit all  of  the  distributions  above. Let's  again  start  with  a  simple  example of  a  normal  distribution, and  we'll  calculate  those  results. We  present  two  output  plots. This  plot  shows  the True Part Distribution  in  blue, the  estimated   True Part Distribution versus  the  Measured  Part  Distribution, the  actual  Measured  Part  Distribution   in  red. As  you  would  expect  in  this  case, we've  got  a  normal  distribution  for  the measured,  and  we've  got  a  simple  gauge. This  is  one  of  those  that  we  could solve by  hand  if  we  wanted  to. We  end  up  with  a  slightly  narrower True Part Distribution  than  the  Measured. If  we  then  go  to  do  a  check  on  that, we  can  take  that   True Part Distribution, put  it  through  the  gauge, and  we  end  up  with  an  estimated  Measured Distribution  versus  the  actual. That's  what  we  get  down  here. It  looks  like  we've  got  very  good agreement  in  this  particular  case. Let's  go  back  up to  our  gauge  definition. We'll  keep  the  same measured  part  distribution. This  time  we'll  put  in  a  bias  of  two. We'll  solve  again,  assuming  that our True Part Distribution is  normally  distributed. We  get  this  out. Pretty  easy  to  conceptualize. Everything  is  just  shifted over  by  two  units. Here's  our   True Part Distribution and  our  Measured  Part  Distribution. If  we  put  that  through  that  gauge with  the  bias, we  get  back to  very  good  agreement  between the  actual   Measured Part Distribution and the  estimated  Measure  Part  Distribution. Third  example,  let's  say  that... Let's  come  back  up  here  and  I'll  turn bias  off  since  we've  demonstrated  that. Let's  say  we've  got  the  same  input, we've  got  the  same  simple  gauge, but  now  let's  say  maybe  we've got  a  Gamma  distribution  here. There's  Gamma,  and  we  want  to  fit  that. We'll  hit  calculate. This  is  the  best- fit  Gamma  distribution for  that  input  normal  distribution. You  can  see  it  doesn't  fit  quite  as  well. Our   True Part Distribution  is  a  little  bit skewed,  which  is  characteristic  of  Gammas. If  you  put  that  through  our  gauge, we  end  up  with  this  agreement between  the  Measured  Part  Distribution, actual  and  the  estimated. It's  not  as  good  a  fit. A  summary  of  those  is given  here  in  this  table. This  shows  us  that  the  first  time  we  ran this,  we  did  a  normal  distribution  input, which  with  two  parameters, we  did  a  normal  fit  on  the  output with  two  parameters,  and  we  got this  sum  of  squares  error. The  second  time  was  with  a  bias, and  we  got  the  same  sum  of  squares  error   in  the  end,  as  you  might  expect. Then  with  the  Gamma,  our  sum of  squares  error  was  a  little  bit  higher. We  get  a  quick  little summary  in  this  table. There  are  two  other JMP  data  tables  that  are  built  that  have all  of  this  information, the  original  distribution   and  the  output  distributions and  all  the  gauge  characteristics. All  of  those  are  summarized in  these  other  two  tables to  allow  you  to  go  through and  make  your  own  plots  if  you  want  to. They  are  also  there. Let's  do  one  that's a  little  bit  different. Let's  come  back  up  to  the  top and  let's  choose  a  user-defined  shape. The  data  table  is  simply a  two-column  data  table. The  first  column  is  assumed to  be  the  centers  of  your  part  values, your  bin  centers, if  you  will,  in  that  histogram. The  second  column  represents the  amplitudes  as  you  go  across. Those  amplitudes  can  be  actual  part  counts,  they  can  be  percentages. Anything  that  each  bin  height or  each  histogram  bar  height is  relative  to  the  other  heights. I  scale  everything  to  make the  sum  of  all  those  add  up to  one  within  the  program  anyway. As  long  as  the  relative  heights are  the  same, doesn't  matter  what the  actual  amplitudes  are. I  give  the  option  to  open  a  data  table   if  it's  not  already  opened, or  if  it's  already  opened  within  JMP, then  we  can  just  say, select  the  already  opened  data  table. Here's  an  example with  a  square  wave or  a  uniform  distribution  for  our  input, Measured  Part  Distribution. Now,  this  is  a  tough  distribution  to  have. If  you  think  about  this,  if  you've  got  a  gauge that's  making  normally  distributed  errors at  any  point, it's  going  to  be  really  hard  to  make something  that's  nice  and  sharp  and  crisp like  this  distribution  on  the  output. Let's  give  it  a  try. Let's  say  here  I've  got a  pretty  wide  variation. This  goes  from  zero  to  30,  I  think. Let's  say  we've  got  a  gage  that  has  a standard  deviation  of  five  with  no  bias. Let's  say  we  want  to  fit   a  normal  distribution  to  that, and  now  we'll  calculate  those  results. Here  we've  got   the  best  fit  normal  distribution for  a   True Part Distribution  that's  going to  run  through  this  gauge with  a  standard  deviation  of  five   to  try  to  give  us  this  square  wave for  our M easured  Part  Distribution. How  well  did  we  do? Well,  it's  here. It's  not  a  real  great  shape, and  you  probably  wouldn't  expect  it to  be  a  great  fit  given  that  we're  trying to  use  a  normal  distribution to  fit  a  square, an  I-sharp  square  function. If  we  wanted  to  do  an  arbitrary  function, let's  say  this  one  here. This  one  I  just  made  up  some  data. I'll  show  you  a  little  bit more  about  what  it  is. Maybe  this  looks  like  a  normal two mixtures,  two  normal distributions  mixed  together. Let's  check  that  out. Let's  see  if  we  can  fit  this to  a  two-mixture  normal, and  that  option  is  down  here, and  we'll  calculate  those  results. Here  we  go. Let  me  run  that  one  more  time. I  don't  want  my  standard  deviation to  be  that  big. Let's  take  a  smaller  standard  deviation. We'll  talk  about  that  thing  in  a  minute. Everything  else  the  same, calculate  results. Sometimes  it  takes a  few  seconds  to  come  back. It  just  depends  on the way  the  routine  is  fitting  things. Here  is  our  fitted  True Part Distribution compared to  the  Measured  Part  Distribution. Assuming  that  our   True Part Distribution is  two  normal  distributions mixed  together. If  you  run  that  through  the  gauge, it  ends  up  looking  like  this. This  is  the  attempt  to  match  that and  what  our  Measured  Part D istribution   would  have  been. That's  not  too  bad. That's  the  way  that  the  add-in  works. Now,  there  is  another  option  here, and  that  is  when  you  fit, you  can  choose  whatever  inputs  you  want for  your  Measured  Part  Distribution, your  gauge  characteristics. Then  when  you  fit  down  here, you  can  also  fit  an  arbitrary  shape. Now,  that  takes  on  my  PC maybe  a  minute  to  run. I'm  going  to  spare  you  that  and  just  show you  the  outputs  within  a  PowerPoint  slide. Here  we  are  back  in  PowerPoint. This  is  one  other  example  that  I've  got before  I  get  to  the  arbitrary  inputs. This  one  has  a  bias  that  I've  expressed as  one  plus  0.03  times  the  part  value. I've  got  a   linearly  changing  bias  across   the  measurement  range. I  have  a  normal  distribution  for  my  input, and  I  want  to  fit  a  normal  distribution   to  the  output. As  it  turns  out,  this  is  my   True Part Distribution, and  this  is  my  Measured  Part  Distribution. If  I  run  those  through  this  gauge, even  though  it's  got this  linearly  changing  bias across  the  measurement  range,  I  get very  good  agreement  between  the  two. This  is  what  happens if  I  take  that  square  wave and  say,  "Hey  JMP,  go  fit  whatever   True Part Distribution  you want and  run  it  through  a  gauge  that  has a  sigma  of  two  and  a  bias  of  zero, and  tell  me  what  that  distribution might  look  like. What  you  get  out,  the  red  curve  again is  the  measured  part  distribution, that  square  wave. You  get  this  crazy-looking  thing  with all   different  peaks  and  valleys in  it  as  the   True Part Distribution. Well,  that  doesn't  look  like  any True Part  Distribution  that  I  would  have, but  if  you  look  down  here, when  you  run  that  through  the  gauge, it  does  a  pretty  good  job  of  simulating this  square  wave  distribution. This is  uniform  distribution. I  believe  it's  working. Now,  there  are  reasons  that  it  might come up  with  something  like  this, probably  associated  with  the  resolution of  the  gauge  that  you  have. The  gauge  just  may  not  resolve  enough elements  across  the  measurement  range as what  you  need  to  get a  nice  smooth  distribution  over  here. That's  the  idea   that  you  can  do  this  with  this  gauge. A  little  about  troubleshooting. There  are  problems if  your  gauge  standard  deviation is  too  large  in  comparison to  the  Measured  Part  resolution. If  you  have  a  Measured  Part  Distribution, let's  say  it's  normally  distributed with  a  standard  deviation  of  three, and  you  tell  this  add-in  that  my  gauge has a  standard  deviation  of  five. Well,  there's  no  way  to  get, even  if  you  have  the  same  true  part that  you  measure  over  and  over  and  over again,  you're  going  to  get  a  spread that  has  a  standard  deviation of  five,  it  can't  fit. You'll  get  some  an  error  when  this  occurs. It's  up  to  you  to  figure  that  out  that, "Hey,  my  standard  deviation for  my  gauge  is  way  too  large." If  it's  simply  on  the  verge of  being  too  large. Let's  say  your  standard  deviation of  your  measured  distribution  is  3, and  your  gauge  is  2.8, then  again,  it's  going  to  try to  give  you  a  very  narrow True Part Distribution  to  support  that, and  that  can  lead   to  some  strange  results. There  are  some  odd  combinations   that  I've  run  across that  can  cause  these  things, I  call  them  untrappable  errors. When  you  go  into   the  JMP  Minimize  function, it  does  its  own  thing and  then  comes  back  with  an  answer. If  it  runs  into  a  problem, it  will  throw  an  error. Maybe  like  this. I've  seen  two  or  three  different  ones, this  is  one  of  them. I  don't  have  a  way  to  trap  for  those. If  you  get  an  error  like  this, the  add-in  will  continue  to  run, but  you'll  need  to  look  into   what  conditions  have  you  put  in  here that  JMP  doesn't  like   that  it's  having  trouble  solving  for. Chunky  Measured  Part  Distributions. If  you  have  a  Measured  Part  Distribution, in  fact,  this  one  here, this  might  be  pretty  chunky. By  chunky,  I  mean  there  aren't  very  many bins  across  the  measurement  range. It's  related  to  your  gauge  in  the  end. It's  how  much  can  your  gaumakege  resolve. You  want  to  have a  lot  of  bins  across  here. More  bins  is  better. Fewer  bins  makes  the  true  part distribution  very  difficult  to  estimate. Then  I  mentioned  or  alluded  to  earlier that  you  can  have  long  convergence  times, particularly  when you're  trying  to  solve  for  these arbitrary  True Part  Distributions. On  my  PC,  it's  not  uncommon  to  go  a  minute or  a  little  bit  more. Just  hang  in  there. The  add-in  has  always  come  back  for  me. It  doesn't  hang. It  just  takes  a  while  for  some  solutions. Add-in  availability  should  be  attached to  this  particular  recording, and  you  should  also  be  able  to  find  it in  the  JMP  Community  File  Exchange under  TPD  estimation. If  you  have  comments  or  questions, you  can  post  them  either  below  this  video or  on  the  File  Exchange,  and  please   put  in  there  any  suggestions  that  you  have for  improved  graphical  User I nterface, any  changes  in  the  outputs. We  didn't  talk  about  the  data  tables that  I  built, but  if  you  see  those  and  you  decide, "Hey,  I  wish  it  would  be  in  this  format." Let  me  know. Those  are  things I  can  change  fairly  easily. Suggestions  for  more  Parametric, Measured,  or   True Part Distributions. These  are  the  normal,  the  lognormal, the  Weibull,  the  Gamma, all  of  those  functions. If  you  have  more that  you  want  to  add  to  that, let  me  know  and  I'll  see   if  I  can  incorporate  those. Then,  of  course,  problems  encountered. If  you  can  include  a  description of  the  problem,  how  it  occurred, that  will  help  me  in  debugging. If  possible,  include   a  non-confidential  sample  input file that  I  can  use   to  help  replicate  the  problem. And  wherever  you  post  these  comments, please  include  @JerryFish  in  your comments,  so  I'll  get  a  notification. Thank  you  very  much  for  listening to  this  recording. Don't  forget  to  check out  the  accompanying  Discovery  paper, News Flash,  Gauges  aren't  perfect, okay,  you  know  that. But  how  much  is  it  costing  your  business? Under  this  particular  paper  number? Thank  you  very  much  for  your  time.
The most novel, innovative, and promising therapeutics in biopharmaceuticals are cell therapies. Cell therapies transfer human cells into a patient to treat disease. These cells either come directly from the patient or from a healthy (cell) donor. Multiple regulatory guidance documents recognize the importance of providing cell therapy manufacturers the flexibility to improve their processes. Therefore, it is imperative to show that the pre- and post-change processes are comparable and process changes pose no threat to the safety or efficacy of the drug product.   One method used to ensure comparability is an equivalence test of means. There is a regulatory expectation that the study is done as a paired design, often referred to as a split-apheresis study, unless there is minimal donor-to-donor variability. In split-apheresis studies, the same donor material is split and processed in the pre- and post-change process for comparison. The design of this study presents unique challenges in cell therapies as they require adequate sample sizes to ensure properly powered designs, yet the number of healthy donors available is usually quite low (three to six donors). Additionally, the power depends on lot-to-lot and assay variability, assay replication strategy, and the effect size used for the equivalence acceptance criterion (EAC).   This talk presents a series of JMP scripts that extend the existing capabilities of the Sample Size Explorer platform to address each of these relevant study nuances, as well as the capability to overlay power curves to address trade-offs with different sample sizes and approaches.     I  am  Heath  Rushing. Although  Andrew  Karl,  and  Jeff  Hofer , and  Rick  Burdick,  some  teammates  of  mine , did the  majority  of  technical  work  here, I'm  going to  be  the   one presenting  today . I'm  going to  talk  about  how JMP  and  JMP  scripts  can  be  used in  a  very  particular  specific application   in  cell  therapies. I'm  going to  talk  a  little  bit  about what   gene and cell therapies  are and  the  very  specific  instance that  I  want to  talk  about is  comparability. I'm  going to  focus  on  process  changes. Interestingly  enough,  last  year, I  gave  a  talk , and  it  focused on  cell  and  gene  therapies. They're  very  novel  therapeutics. The  first  one  was  approved in  the  United  States  in  2017. A  little  bit  different  than  most  of , what  I  call   the  small  molecule  and  the  large  molecule  therapeutics that  you may  have  heard  of  in  the  past. Let  me  just  touch  base on  what  is  a  cell  and  gene  therapy. First  thing  I'm  going to  do is  touch  base  on  what  a  gene  therapy  is. What  you're  essentially  doing is  you're  replacing a  gene  with  a  healthy  one, or  turning  off   bad  genes. A  lot  of  cancers are  caused  by  defective  genes. What  you're  doing   is  you're  inserting these  healthy  genes  back  into  a  patient of  either  in  vivo  or  in  vitro. An  in  vitro  would  be  more of  a  bone  marrow  transplant. Last  year,  I  talked  about , the  challenge  with  gene  therapies is  that  patient -to -patient  variability. I  focused  on  process  development. Then  I  talked  about  cell  therapies. In  the  cell  therapies , what  you're  doing is you're  replacing  disease  cells. You're  either  transferring some  sort  of  healthy  cell  into  a  patient, or  we're   replacing missing  cells  into  a  patient. Where  do  these  cells  come  from? They  either  come from  the  patient  themselves, so  you  would  have  to  deal with  that  patient -to -patient  variability, or   in  most  cases, they  come  from  a  healthy  donor. Now  you're  not  dealing with  this   patient-to-patient variability, but  you're  dealing with  donor -to -donor  variability. Whenever  I  say  donor, I'm  talking  about  a  healthy  donor. I  could  be  a  healthy  donor. Then  someone  else could  be  a  healthy  donor  also. In  both  of  those  cases is  you  have  to  deal with  that   patient-to-patient or  donor -to -donor  variability. What's  interesting  is  last  year , I  gave  the  example   in  process  development, and  it  looks  something  like  this. It  was   the  exact  same  data  set that  I  used  last  year  that  I  said, say  that  you  were  developing  a  process where  you  look  time,  temperature , and  pH, and  you're  measuring  their  effect on  cell  viability  and  by  product. In  that  case  is, I  cannot  use  one  donor  material, I  had  to  split  that  up into  four  different  donors. I  said,  "If  you  ran  these  experiments for  process  development, and  you  did  not  consider that  there  was  donor -to -donor  variability, this  is  what  you  would  see. What  you  would  say   is  we're  looking for   P-values  that  are  below   0.05, you  would  say  nothing affects  cell  viability and  nothing  affects  by  product. You  were  not  able  to  detect that  you  had  any  significant or  critical  process  parameters for  the  very  reason that  you  do  not  consider that  there  could  be  a  difference  in  donor. Right   now,  if  you  do  consider  those as  what  it's  called  a  fixed  donor  effect, the  only  thing  that  I  did is  I  brought  in  donor. Then  you  see  that  NAND. This  really sticks  out what  significantly  affects  cell  viability and  what  significantly  affects  by  product. The  whole  talk  was  on  how  does that   donor-to-donor  variability affect  statistical  inference and  also  process  capability. I'm  going to  focus on  that  statistical  inference. What  are  you  trying  to  do in  process  development is   you're  trying  to  determine  if  things  like pH, and  temperature,  and  time significantly  affect your  critical  equality  attributes. Say  that  I  was  a  drug  manufacturer , and  I  have  set  up a  process  development  study . I  send  this  process  development  study . I  want  to  determine if  temperature  affects, and  I'm  going to  call  it  cell  viability. I  say,  "Hey  whenever  I'm  looking  at  that , is  I  want to  make  sure  that  if  something significantly  affects my  quality  attributes, I  control  that  in  my  process. But  if  it  doesn't,  I  am  not  spending  money and  time  and  resources  controlling  it." What  I'm  concerned  with as  a  drug  manufacturer is the  Type I  error  rate. I  do  not  want  to  inflate a   Type I  error  rate. A   Type I  error  rate  would  say, "Hey , this  is  significant when,  in  fact,  it's  not." What  do  you  think  that  regulatory  agencies would  be  more  concerned  with? You  controlling  more  things? Are  you  not  controlling  things that  should  be  controlled? That  is  exactly  right is  they'd  be  more  concerned about  that  patient  risk, that   Type II  error. In  process  development is  drug  manufacturers do  not  want  to  inflate  the   Type I  error. They  also  want  significant  power.  Why? Because   that  controls  that  patient  risk. The  whole  point of  me  showing  that  last  year was  to  show  the  effective donor-to-donor  variability on  trying  to  determine your  critical  process  parameters. I  call  it  statistical  inference. Right   now, what  happens   if  I  change  my  process? I  had  a  colleague  just  last  week , I  was  working  with  her. Whenever  we're  talking  about  cell and  gene  therapies,  she  said, and  this  is  her  quote, "Heath , at  cell  and  gene  therapies, things  are  constantly  changing. You  could  have  things like  analytical  methods  change. You  could  have  things like  process  change." Today,  I'm  going to  focu s on this  process  right  here. Mainly,  I'm  going to  focus on  that  process  change. I  do  want to  point  out that  regulatory  agencies  understand that  you  have  a  need for  improving  your  process. Even  if  you  improve  your  process, are  you  changing  your  process ? They  recognize  the  need  for  that, but  they  also  recognize  the  need that  the  therapeutics that  you're  making  from  that  process should  be  similar in  terms  of  product  quality. You're  using  these  in  clinical  trials. What  does  it  mean  to  be  similar? That  doesn't  say  that  they  have to  be  exactly  the  same, that  they  have  to  be  similar or  comparable. In  terms  of  me  saying that  something  is  similar, what  I  want to  do  is  I  want to  make  sure that  I  have  some   similarity  condition. That's  the  whole  point  of  comparability. For  very  low  risk  attributes, what  I  can  do  is  I  can  show  that  process  A and  process  B is  similar  in  side -by -side  plots . For  more  higher  risk  attributes, what  I  want to  do   is  maybe something  like  a  quality  range. In  terms  for  quality  ranges, I  just  take  that  reference  group, the  old  process, and  I  built  some  range  around  it and  ensure  that  all of  th e measured  quality  attributes from  the  new  process fall  within  that  range. For  very  high  risk  attributes, what  I  want to  do is  I  want to  do  equivalen ce testing. This  is  what  I'm  going to  focus on  today, tell  you  about what  equivalence  testing  is, and  how  that  acceptable  difference or  that  similarity  condition  is  set. It's  called  equivalence  testing, Two One Sided  t-test. To  reiterate  what  we  talked  about  before, whenever  I'm  using  design  of  experiments in  process  development, what  I  do   is  I'm  changing some  variable -like  temperature from  low  to  high , and  I'm  measuring  the  effect on  my  critical  equality  attributes. I  am  assuming  in  the  null  hypothesis that  they  are  the  same. What  I  do  is  I  set  up  a  design to  see  if  they're  different. A   Type I  error  in  that  case would  be  me  saying, "Wow,  they're  different " when , in  fact,  they're  not. That  would  mean  that  I  would  control  that. I  would  spend  resources controlling  that  in  the  process. If  I'm  a  drug  manufacturer, I  do  not  want  to  control  things that  I  don't  need  to. I'm  concerned about  that   Type I  error  rate . If  I  was  a  regulatory  agency, I  would  be  even more  concerned  with  the   Type II. There's  no  difference when,  in  fact,  there  is . You  should  be  controlling  something and  you're  not. If  I  was  a  regulatory  agency, I'd  be  more  concerned with  the   Type II  error. Now, we're  going  to  flip  it. We're  going to  talk about  equivalence  testing. Equivalence  testing is  I'm  not  saying  that  they  are  the  same. I  am  assuming  that  there  is  a  difference. I just  want to  make  sure that  the  difference  isn't  too  big. That  too  big,  I'm  going to  call  delta. T here's  a  lot  of  different  ways to  calculate  that  delta. I'm  going to  call  it  d or  that  delta  right  there, often  called  the  equivalence acceptance  criteria. I would  like  to  come from  subject  matter  expertise, but  the  majority  of  times, it  comes  from  me  taking some   k-value  times  that  historical  value. That's  split  into  two  different  tests . On  one,  I'm determining if  it's  less  than  positive  d. In  the  other  one , I  want  to  show in  the  alternate  hypothesis that  that difference is  greater  than  a  negative  d. I'm  going  two  different  sides. That's  what's  called the  left -hand  side  of  the  bottom, or  the  top. In  terms  of,  if  I  was  a  drug  manufacturer, what  would  I  want  to  do? I would  want  to  be  able  to  reject both  of  those  hypotheses. I  would  want  high  power, low   Type II  error  and  high  power. T his  is  equivalent  to  taking a  90%  confidence  number around  the  difference  in  means in  ensuring  that  90%  confidence  interval whenever  I'm  looking  at  the  low  and  high or  within  the  balance  of  those  lower  delta in  the  upper  delta. If  you're  looking  at  this, you  should  think  to  yourself  is, "I  want  the  width  of  that  confidence interval  to  be  very  small." What  are  the  different  ways that  I  could  make  the  width of  that  confidence  interval for  the  difference  between those  two  means  very  small? I  could  decrease  my  standard  deviation. That's  a  good  thing. I  could  increase  my  sample  size. That's  a  good  thing. I  could  also  increase  my  alpha  level. Maybe  that  wouldn't  be  so  good because  what  you're  doing is  you're  inflating your   Type I  error  rate. In  inflating  your   Type I  error  rate, what  you're  saying   is, I  am  stating  that  they're  equivalent when  indeed they're not. The  different  ways  to   control  the  width of  that  confidence  interval is  to  lower  s, increase  in,  or  increase  alpha. We  talked  about   two of  those  being  good and  one  of  those  not  being  good. It  makes  sense  that   if  I'm  a  drug  manufacturer, I  want  to  maximize the  power  of  the  design. That's  the  flip. I  want  to  minimize  my   Type II  error. Regulatory  agencies  want  to make sure that  you  do  not  inflate that   Type I  error  rate. That   Type I  error  rate would  be  saying  your  assume  equivalent or  you're  stating  equivalence when  indeed  they're  not. In  JMP,  you  can  do these  equivalence  tests, and  I  want to  show  you  an  example  of  that. For  my  journal,  first  thing  I  want to  do is  I  want to  show  you that  in  terms  of  determining your   Type I  error  rates and  your   Type II  error  rates is  JMP  provides  power  curves under  Sample  Size  Explorer , Power , Two  Sample  Independent  Equivalence. Caleb  King  did an  awful  great  job  with  this. I say  awful  great  job, but  he  did  a  great  job  with  this. Let's  just  say  that  my  margin, my  equivalence  acceptance  criteria, is  plus  or  minus  2  standard  deviations. I'm  just  going  to  put  a  2  here, and  that's  just  2  times  the  standard deviations  that  I'm  talking  about. That's  all  that  I'm  doing. Let's  just  say  that in  my  historical  process is  I  have  10  lots, and  I'm  going to  compare  it to  a  new  process  that  has  5  lots. I  want  to  see  what  the  power  is if  they  are  exactly  the  same, but  there's  no  difference  between  these. A  few  things  that  I  want to  point  out  here is  JMP  gives  those  power  calculations. The  other  thing  that  it  does  is it  allows  you  change  those. What's  going to  happen  if  I  do  things like  increase  my  new  process, the  number  of  samples in  my  new  process  day  report , my  power  is  going to  go  up. What  would  happen  if  I  do  things  like, "Hey,  Heath, I  want  to  decrease  that  margin  of  error to,  instead  2  standard  deviations , to  say  maybe  1.5  standard  deviations, essentially , as  I'm  taking  those  boundaries and  I'm  tightening  them  up." What  I  see  is  my  power is  going  to  go  down. I'm  able  to  ask  myself   all  those  typical  questions that  you  would  in  equivalence  testing. This  is  something  else  that  I  want to  show  you  that's  going to  come  up is  JMP  has  the  ability  to  say, do  I  know   the  true standard  deviation  or  not? If  I  know  the  true  standard  deviation, that  is  going  to  be  better. You're  going to  see that  your  power  goes  up. Indeed,  what  happens is  my  power  goes  up. That's  usually  not  the  case. I  always  call  that  the  utopia, which  uses  the  cases  if  I  do  not  know what  that  true  standard  deviation  is. I  always  call  this  the  optimum, a  car  of  the  utopia. I  always  call  the  no,  the  realism. I  would  be  remiss if  I  did  not  show  you  the  tools that  JMP  does  have for  showing  that  equivalence like  if  I  had  an  historical  process where  I  had  10  lots  and  I  made  5  new  ones. First  thing  I  want to  do is  I  want to  look  at  this through Graph Builder, and  I  see  that  there  is  no  effect between  those  two. I  can  see  both  of  those , and  they  both  look  like they  came  from  the  new  process, the  blue  versus  the  red. How  about  if  there  is  an  effect ? What  I  do  is  I  see  a  shift. Just  like  I  showed  you  before is that  is  Two  One Sided   t-test. JMP  has  tools  for  that. Jin  Feng  did  a  great  job  with  this. My  goodness.  I  love  the  scores  plot. Here's  the  difference  in  means. Here's  the  lower , and  here's  the  upper, and  that's  within  the  boundaries. In  that  case,   what  you've  done is  you  rejected  both  the  null  hypothesis in  favor  of  the  alternate, which  is  the  same as  what  you  see  in  the  picture. What  you  also  see  here is  that  if  there  is  an  effect, is  I  am  not  going to  reject  both  the  nulls. One  of  those  is  I  am  going  to  fail to  reject  and  indeed  I  did. What  you'll  see   is  my  confidence  intervals outside  that  boundary. I  would  like  to  talk about  a  very  specific  case. A  very  specific  case  in  cell  therapy is  called  split  apheresis  design. In  a  split  apheresis  design , this  is  a  situation  where in  cell  therapies is  you're  changing  the  process. What  you  do is  you're  using  donor  materials  split between  the  two  different  processes. We  kept  getting  questions over  and  over  and  over  again from  our  customers  about, "Can  I  look  at  the  sample  size and  power  calculations for  these  pair  of  designs ?" Cannot  overlay  them . You cannot  see  if  they're  dependent upon  that  donor -to-donor  variability? Let's  talk  about  a  split  apheresis  design. In  a  split  apheresis  design, first  thing  I  want to  do is  I  want  to  tell  you  about the  regulatory  expectation. This  is  even  a  recent  draft guidance  document  from  the  FDA in  July  of  2023,  just  last  month. In  that ,  they  said  that  you  need to  select  a  suitable  statistical  test for   analysis in  difference between  paired  data where  those  donors  are  paired  up. That's  where  the  split  apheresis  design comes  from. For  every  single  donor  material that  you  have, you  split  it  in  between process   A and  process  B. This  is  not  two  independent   t-tests. What  this   is, is  a  paired  design. That's  the  first  thing that  I   wanted  to  talk  about. The  second  thing  is, I  wanted  to  talk  about that  you  are  very  often  in  early  stage, so  you  do  not  have a  line  of  donor  materials, so  you  have  very  low  sample  sizes. It's  hard  to  get  power out  of  low  sample  sizes. The  third  thing that  I'm  going  to  tell  you  is , how  do  you  come  up  with  your  EAC? How  do  you  come  up with  your  similarity  condition , that  difference, that  acceptable  difference? What  you  do   is  you  use  historical  data that  is  made  off  of  multiple  donors. You  take  the  standard  deviation used  off  of  historical  data. I'm  going to  call  that  n 1  or  historical. You  take  some  k  number of  standard  deviations  of  historical  data . You  do  a  test , and  you're  using the  split  apheresis  design to  judge  off  of  that  historical  data. These are two examples that I want to show you. The first example here is  where  you're  looking at  process   A and  process  B. What  you  see   is  you  do  see six different  donors  here. What  you  see  in  the  one  on  the  left is  the  majority  variation  is  coming from   donor-to-donor  variability, not  the  difference  between process  A and  process  B. You  have  high  donor -to -donor  variability. I'm  going to  call  that,  prho. In  the  case  on  the  right, what  you  do   is, is  the  majority  variation is  coming  from  the  difference between  process  A  and  process  B, not  the   donor-to-donor  variability. The  majority  variation  is  coming from  the  analytical  or  the  process. What  that  tells  you is  you  have  very  low  rho. You'd  have  low   donor-to-donor  variability. I'm  going to  show  you  a  series of  scripts  that  we  worked  on. These  are  typical  questions that  came  from  our  customers. In  our  cases,  we  do  not  know what  the  standard  deviation. How  does  that  compare  to  the  known? How  about  those   Type I and   Type II  error  rates? Remember,  if  I'm  a  drug  manufacturer, I  want  to  increase  the  power. If  I'm  a  regulatory  agency, I  want  to  make  sure  that  you  do  not inflate  that   Type I  error  rate. How  are  we  going to  do  this? This  is  from  the  European Medicines  Agency,  2001. The  best  way  to  do  that  is  with  things called  expected  operating characteristic  curves. That  gives  you  power  on  the   y-axis and  a  shift  in  the  main. I 'm  going  to  go  through a  series  of  scripts, and  these  series  of  scripts ... It's really one script that  have  right  here, that  it's  going to  allow  me to  change  things  like  that  rho, that  proportion of   donor-to-donor  variability. That  k -value,  remember, how  do  I  set  the  acceptance  criteria? It  is  k  times  that  standard  deviation. The  typical  way  of  doing  this is  that   k times  those  historical  lots. This  is  the  number  of  historical  lots that  you  use   n1. n2 is  the  number  of  lots  that I'm  going to  use  for  that  paired  design. Whenever  you  run  the  script, what  happens   is  you  come  out , and  it  does  a  series  of  simulations. In  this  case,  it  did  5,000  simulations , and  it  calculates  the  power  for  you. In  those  5,000  runs, what  percentage  of  those  passed? It  looks  something  like  this. It  gives  you  a  lot  of  different  options. My  goodness . I  can  look  at  different  k -values. I  can  look  at  a  different  number  of   n1, which  are  called  historical  lots. I  can  also  look  at  the  different number  of  n 2  or  paired  lots. Right   now, I  want to  talk  about... Whenever  I  do  this,  what  I  can  do is I  can  select  which  of  these different  cases  that  I  want  to  look  at to be  able  to  answer  typical  questions. Let  me  open  up my  typical  comparisons  here. The  first  one  I  want to  talk  about  is, "Heath, what  if  I  have a   known standard  deviation?" Look s something  like  this. That's  what  the  known standard  deviation  looks  like. A  few  things  that  I  want to  point  out is  this  is  the  percentage  of  time  that  you're  going   to  claim  equivalence. If  they're  exactly  the  same  that  you  said you're  going to  claim  equivalence a  high  percentage  of  time. If  there's  a  huge  difference  between  them like  a  two  standard  deviation  shift or  a   three standard  deviation  shift , is  you're  not  going  to  claim  equivalence. That's  a  good  thing. The  other  thing that  I  want to  show  you  here is   if  you're  looking at  this  alpha  of   0.05, being  that  I  set  my  k -value  at  2 , k  number  of  standard  deviations versus  10  historical  lots, the  standard  deviation of  10  historical  lots, you  would  expect that  alpha  level  would  be  0.05, the  exact  alpha  level that  I  set  in  my  equivalence  test. Right  now, the  thing  that  I  want to  show  you is  this  is  for  a  proportion of  donor -to -donor  variability  of  90%. What  happens  if  I  change  that? What  happens  if  I  change  that  to  60%. What  happens  if  I  change  that  to  30% ? There's  no   donor-to-donor  variability. What  you  see   is  that  paired  test, the  power  curve  looks  really good whenever  I  have high  donor -to -donor  variability. The  other  thing  that  you  notice with  the  known  standard  deviation is the  alpha  level  regardless of  operating  characteristic  curve is  always  at   0.05. Let's  talk  about   some  other  typical  questions. One  typical  question   is , how  does  it  compare for  the  different  levels   of  rho? How  does  my  typical  way  of  doing  this ? I  do  not  know what  the  standard  deviation  is. My  typical  way   of  doing  this is  in  the  blue. The  known  standard deviation  is  in  the  red. One  thing  that  I  want to  point  out is  I  want to  point  out this  one  right  here. What  you  see is   the  preferred  approach, the  approach  that  even  regulatory documents  have  said  that  you  should  do, the  paired  approach, using  the  standard  deviation that  is  calculated off  of  my  historical  lots, is  I  have  an  inflated  Type I  error  rate. This  should  be   0.05  just  like  it  is  here. That  was  really strange  to  us, and  we  looked  into  this. When  we  looked  into  it, what  we  found  is , it  has  everything  to  do with  this  right  here. The  reason  why  it  has  everything to  do  with  this  right  here, as  I  said,  how  do  I  decrease   the  width of  that  confidence  interval? The  way  that  I  decrease the  width  of  that  confidence  interval was  either  to  decrease  s , or  increase  n , or  increase  my  alpha  level. Understand  this. This  is  why  you  have an  inflated   Type I  error  rate with  this  paired  test is  those  deltas , which  you're  using  to  judge  this  off  of,  those  deltas are  using  the  standard deviation  off  of  historical  data that  contains  donor-to-donor  variability. That  confidence  interval  right  there does  not  contain donor-to-donor  variability. Why? Because  you  did  a  pair  test . That  contains  only  analytical and  process  variability. That's   where that  inflated   Type I error  rate  comes  from. Using  this  paired  approac h is  understand you  have  an  inflated   Type I  error  rate. We  see  that, and  it's  even  more  prevalent when  you  have  high donor -to -donor  variability. Why?  Because  if  you  have low   donor-to-donor  variability, th at process  variability is the  largest  part  of  the  variance component  that  you  have. Let's  look  at  a  few  more questions  that  you  have. A s  I  said,   this  one  script answers  these  different  questions. This  is  answering  the  question , "Hey,  Heath,  if  I  use  that  paired  approach that's  recommended, can  I  look  at  what  happens as  I  increase  sample  size from   3  to 4 to 5 to 6?" Two  things  that  I  want to  point  out  here is  number  one ,  what  you  see as  I  increase  sample  size, is  I'm  going to  have  higher  power. I  still  do  not  have  adequate  power if  there's  no   donor-to-donor  variability. That  means   that  I  have 0  donor-to-donor  variability. I  would  need  at  least a  sample  size  of  8  or  8  different  donors. If  I  do  have  high donor-to-donor  variability, like  0.9, 90 %  of  that  variability, which  you  see  is  I  do  have  high  power for  no  difference  between  the  means. What  I  can  do   is  I  can  make  sure to  answer  those  questions with  overlaid  operating  character s occurs  for  different  sample  size. I  can  also  answer  that  question   if  I  was  looking  at , and  I  say,  "Hey , I've  stated  my  different  sample  sizes, but what if we look at  the  different  k -values?" Understand  that  your  acceptance  criteria is  k  number  of  standard  deviations. What's  going to  happen is that  acceptance  criteria are those  what  I  call  go  post are  going  to  widen  as  you  increase  k. Therefore,  you're  going to  have a  much  higher  ability to  pass  equivalence , and  you're  going to  have much  higher  power. Another  typical  question  is  this. What  if  I  want to  change both  of  those  together? I'm  a  big  fan  of   Graph Builder . What  Graph  Builder   is what  you're  looking  at  here is not  only  are  you  looking  at, "Hey , Heath,  I  am  increasing sample  size  in  blue,  that  would  be  3, in  red,  that  would  be  4, in  green,  that  would  be  a  5, and  in  purple,  that  would  be  6 , but  I  also  looked  at  it for  different  k -values. What  would  your  operating characteristic  curves  look  like?" Good? I  want to  revisit  this. Just  like  I  said  before , I  said, "Hey ,  I  want to  revisit  this and  show  you  that  for..." Whenever  I  have  a  large  proportion of  donor -to -donor  variability , I  said,  "What  you  see  for  2  right  here, I  would  expect   my  alpha  level that  my  proportion  of  time that  I  pass  this  test  would  be   0.05." But  what  you  see  is  you  have inflated   Type I  error  rate. How  does  this  look? Whenever  I'm  looking  at   a  rho or  a  proportion of   donor-to-donor  variability that  is  very  small, I  do  not  have  much  power. The  question  was , what  if  we  did  this  instead? If  we  had  low   donor-to-donor  variability, if  what  we  did   is  we  used information  from  those  historical  lots. If  I  have  no   donor-to-donor  variability or  very  low   donor-to-donor  variability, why  couldn't  I  just  do a  independent  t -test, where  I  compare  from  process  A or  my  historical  process, not  just  the  paired  lots, but  I  also  consider those  10  historical  lots and  not  comparing  to  the  mean of  the  new  process? We  wanted  to  see  how  that  compared . Doing  it  that  way   is the independent test  is  in  the  red. The  paired  way  is  in  the  blue. What  you  see   is,  if  I  have little  to  no   donor-to-donor  variability in  my  cell  therapy split  apheresis  process , you  said  that  the  independent   t-test has  much  better  profile than  the  paired  approach. However,  if  I  have  high donor-to-donor  variability, that  paired  approach  in  the  blue has  a  much  better  operating characteristic  than  the  red. Right  now,  the  question  is instead  of  just  automatically doing  that  split   apheresis  pair  design, maybe  it  would  be  better to  make  a  decision  based  upon that   donor-to-donor  variability. How  does  this  compare   whenever I'm looking  at  different  k -values? I  see  the  exact  same  thing, the  exact  same  phenomena that with  a  low  donor-to-donor  variability, it  makes  sense  to  do the  independent  t -test. With  high   donor-to-donor  variability is  I  have  a  much  better operating characteristic  curve are  higher  power  associated with   the  paired  approach. It  doesn't  matter  if  I  looked at  a   k of  1.5 , or  2,  or  even  3.0. Regardless  of  the  k -value , I  have  a  much  better operating  characteristic  curve if  I  consider that  donor -to -donor  variability. What  if  I  looked  at  different  values of  those  historical  lots? I  looked  at  3. We  looked  at  4. W e  looked  at  5  paired  lots. We  looked  at  6  paired  lots. Regardless, you  see  the  same  phenomena. We're  currently  writing  a  paper  on  this to  try  to  propose that  if  you  have  low donor-to-donor  variability, maybe  it  does  not  make  sense for  you  to  use  a  split   apheresis or  a paired  analysis  approach. Maybe   the  approach  is  only  good whenever  you  have  high donor -to -donor  variability. T hese  are  typical  questions   that  are  asked in  the  split   apheresis  designs. What  I  want to  do is  I  just  want to  cover t wo  or  three  more  of  these j ust  to  show  you  a  few o ther  things  that  you  could  do. These  are  different  things that  we  were  looking  at . We  looked  at,  "Hey , how  does the  operating  characteristic  curve, how  does  that  compare if  we  looked  at  in  the  blue that's  using  nothing but  the  historical  lots to  estimate  the  standard  deviation versus  if  you  use  the  paired  and  the  historical  lots, which  is  in  the  red?" What  you  see  is  there's  not  much difference  between  these  two, especially  if  I'm  using  higher sample  sizes  like  the  n 2. W e  also  looked  at, "Hey,  if  I  estimated that  standard  deviation using  a  few  different  ways, what  if  I  looked  at  estimating that  standard  deviation using   the  historical  lots, which  is  in  the  blue,  versus  in  the  red is  using  the  historical  lots and  the  paired  lots? I  compare  the  independent  case versus  the  paired  case. What  do  I  see? " As  I  said  before, you  see  that  exact  same  phenomena with  a  low  donor-to-donor  variability. The  much  better  way  of  doing  this would  be  an  independent  t -test on  the  lower  right -hand  corner. That  is  where  you  high donor-to-donor  variability. It  makes  sense  that  we  would  use the  paired  approach. Last  one  that  I  want  to  show  you is  this  is  something that  we've  been  working  on. We  looked  at  the  paired  approach versus  the  independent. The  paired  approach  is  in  the  blue. The  independent  is  in  the  red. I've  said  this  over and  over  and  over  again. That  it  makes  sense  that  if  I  have low  donor-to-donor  variability, the  independent  case in  the  blue  looks  much  better. If  I  have  high   donor-to-donor  variability, the  paired  approach  looks  better. But  one  thing  that  we  did i s we  took  a  look  and  just  said, "What  if  I  took  a  look   at  the  approach that  gave  me  the  shortest with  that  confidence  interval?" That's  in  the  green . What  you  see  is  that  usually gives  you  the  best  approach regardless  of   what  your  rho  is or  what  your  proportion of  donor -to -donor  variability  is. In  closing, I  would  like to  just  point  out  a  few  things. This  script  that  we  have  answers, along  with  the  typical  questions that  our  customers  have on  operating  characteristic  curves, associated  with  these split   apheresis  designs, what I do  want   to  pull  away from  here,  though, is  if  you  do  have  a  low  proportion of   donor-to-donor variability is  you'll  see  that  these  designs are  very  underpowered for  fewer  than  8  lots, fewer  than  8  different  donor  material. We  live  in  a  world  in  cell  therapies where  you  do  not  have  a  lot  of  donor  materials, so  you  have  very  low  sizes. It  would  be  much  more  efficient if  you  had  low   donor-to-donor variability to  use  the  independent  case. We  do  have  the other  revisions  that  we  made  on  this where  if  you  were  able to   make  multiple  lots for those  paired  approaches with  the  same  donor, or  if  you're  able  to  take  multiple  measurements to  be  able  to  look  at  those operating  characteristics  curves. Thank  you.
Monday, October 16, 2023
There is a no-man's-land in JMP, a place where it is scary to venture. It is a place inhabited by specters of data sets too complex for simple nonlinear regression, but too ephemeral for functional data analysis. It is a strange place. It appears basic methods are enough to venture here but using those techniques quickly becomes unmanageable. These spectra can come from many sources, but all have a common problem: there is too much data for simple nonlinear regression or too litttle to use functional methods. Join me on a journey across this challenging landscape of classical spectroscopy to learn methods for extracting information from examples of complex spectra and how to automate the process.      There's   a  strange  place  that  sits between  the  analytical  tools  that  you would  use  to  do analysis  with  known  physical  models and  with  simple  curves and  the  analyzes  that  you  would  do, say,  with  functional  data  explorer, where  you  have  families  of  curves that  have  complex  shapes, and  you're  less  interested  in  the  actual physical  nature  of  the  shapes  themselves as  you  are  in  just  relating  them back  to  observed  phenomenon. This   strange, no  man's  land  of  analysis  in  JMP  is  where a  lot  of  first  principles  techniques  sit. Things  like  X- ray  diffraction, things  like  HPLC, where  we  have  known  physical  methods and  known  equations  that  help us  describe very  fundamental  phenomena of  a  molecule  or  crystal  or  a  system. All  we  have  to  do is  plug  peak  positions  in or  area  under  the  curve  information  in and  we  can  get  some  very  sophisticated analyzes  out  of  fairly  simple  data  points because of  these  first  principles  methods. At  first  blush,  it  would  seem  like JMP  should  be  able  to  handle  that. It  seems  like  it's  got  all  the  tools  but, when  we  dig  into  doing  those  kinds of  analyzes  we  suddenly  realize  that t he  problem  is  a  bit  more complex  than  what  we  would  expect. Today  what  I  want  to  do  is  focus on  some  techniques  and  strategies to  deal  with  some  of  those  simpler  cases and  then  introduce  some  tools that  we  can  use to streamline  those  larger more  complex  problems. Let's  go  ahead  and  let's  move into  JMP  and  have  a  look  at  that. To   start  off,  let's  go  ahead  and  have a  look  at  a  very  simple  case, a  single  peak  on  a  background. How  would  we  go  ahead  and  pull the  information  out  of  this  peak? How  would  we  get  its  center  position? How  would  we  get  its  full  width   at  half  max  or  its  standard  deviation or  even  the  area  under  the  curve? How  would  we  get  that  information? Well,  most  of  us  that  have  done  this  for a  while,  we  would  say, oh,  you  know  what, I'm  going  to  go  into   Fit curve and  I'm  going  to  say,  here's  my  y, and  I'm  going  to  say,  here's  my  x. Then  I'm  going  to  go  ahead  and  I'm going  to  go  fit  a  peak  model  of  some  kind. Let's  just  say  the  Gaussian  peak  and  you look  at  that  and  you  go  hey, 98%  R² ,  that's  awesome,  that's  great. Let's  see  if  we  can  do  a  little  better. Just  to  skip  ahead  just  a  little  bit  here we  could  look  at  the  Lorentzian  peak shape  and  the  Gaussian  peak  shape and  we  can  see  that those  both  give  fairly  good   R², they  give  fairly  good  peak  fits. We  could  even  come  into  the  values underneath  each  and  we  can  pull  up the  area  under  the  curve  for  them. But  how  good  are  those  fits  actually? Let's  take  a  look at  them  a  different  way. What  I  want  to  do  is  take  and  we'll  go ahead  and  pull  up  graph  builder  on  these and  we'll  look  at  how  the  models  relate to  the  residuals  for  those  peaks. We  can  see  a  very  different  story than  what  we  saw  in   Fit curve with  these  two  peak  shapes. We  can  see  that  there's  a  systematic error  built  into  these  peak  shapes. The  reason  we  can  see that  with  the  Gaussian, we  can  see  that  it's  kind of  underestimating  at  the  center. It's  doing  okay  on  the  shoulders,  but  out in  the  tails  it's  really  missing  things. We  can  see  almost the  inverse  for  the   Lorentzian. Why is that? Well,  the  truth  is  that in  spectroscopy  particularly, there  are  a  lot of  different  peak  shapes. It's  not  just  Gaussian, it's  not  just  Lorentzian. There's  actually  a  whole  family of  peak  shapes  that  are  out  there to  handle  all the  different  physical  phenomena that  result  in  the  peaks that  we  see  in  spectroscopy. How  do  we  deal  with  those  in  JMP? Well,  it's  actually  really  quite  easy. Let's  start  with  looking  at what  the  results  of  using the  correct  peak  shape  is. Here  I've  got  the  Gaussian  again, the  residuals  for  the  Gaussian  peak  fit and  the  blue  line  in  this  case is  no  longer  the  Lorentzian. It's  called  a  Pseudo Voigt, which  is  an  approximation  of  a  peak shape  called  a  voigt  function. Notice  that  the  residuals for  the  Voigt  function  are  dead  flat. We  are  actually  doing  much  better. Before,  if  we  were  to  try  to  do quantification  with  the  Lorentzian  or the  Gaussian, we  would  run  into  a  situation  where  we might  over  or  underestimate the  quantity  of  a  material  in  a  sample. With  the  Voigt  in  this  case, because  this  is  a  voigt  peak  shape, we're  actually  going  to  get the  accurate  quantification  of  that. That's  the  important  thing. Now  how  did  I  do  this? Well,  there's  a  few  ways  to  do  it. The  easiest  is  to  come  into  the  model come  into  the  data table  and  create  a  model. The  model  is  really  easy  to  make. This  is  the  voigt  peak  shape. Looks  a  little  scary,  but  that's just  the  nature  of  the  math. Here  I've  got  a  parameter for  the  baseline  and  this  whole  mess. Here  is  the  voigt  peak  shape. We  can  come  into  the  parameters  settings and  define  starting points  for  each  of  our  values. Then  going  into we're  not  going  to  use   Fit curve, we're  going  to  come  down  to  nonlinear. We  can  use  that  as  a  starting  point for  an  analysis. I'm  going  to  expand underneath  40  minutes. That's  actually  a  good  habit to  get  into  in  this  case. I  did  that  wrong. Let's  go  back  and  redo  that. Should  be  the  counts. There  we  go. That  looks  better. Now  if  I  go  ahead  and  click  Go, it  does  my  peak  fitting for  me  and  everything. That's  great. Can't  get  the  area  under the  curve  here  very  easily. But  I  can  get  just  about  every other  parameter  that  I  need. The  nice  thing  about a  lot  of  these  peak  shapes is  they  also  have well  defined  integrals. Once  you  have  the  standard  deviation and  the  mean  and  those  information, you  can  usually  get the  integral  fairly  easily, the  area  under the  curve  fairly  easily. That's  one  way  of  handling  that. But  it  introduces  a  large  amount  of  error possibilities  in  this  peak  shape. We've  given  ourselves a  lot  of  potential  problems. What  we  really  would  like  is something  that  looks  a  bit  more  like  this, where  we've  got  a  predefined function  called  the  PseudoV oigt. We  give  it  all  of  our  fitting  parameters and  there's  our  fitting parameter  for  our  baseline. It's  the  same  math, but  we   cloaked  it  in  an  easy to  understand  function  where  we  are  just providing  the  parameters that  we  want  to  fit. It  works  the  same  in  nonlinear. How  do  I  do  that? Well,  there  are  a  few things  that  we  can  do. We  can  define  in  and  there's a  lot  of  code  right  here. But  the  big  things that  we  want  to  pay  attention  to are  the  fact  that  we're defining  a  function, that  we're  defining  some  parameters. At  the  very  bottom  of  this, this  is  a  family  of  parameters. I  am  using  the Add  custom  functions  operator to  put  those  into  the  memory of  JMP,  so  that  JMP  knows  that  I've  got these  custom  functions  and  knows  what they  look  like  and  knows  how  they  behave. Doing  it  that  way  provides some  really  powerful  tools. If  I  come  into  the  scripting  index, once  I've  defined  my  functions, they  show  up  in  the  scripting  index. I  didn't  really  give  a  lot of  descriptions  here, but  you  could  give  quite  detailed descriptions  and  examples here  as  you  would  like. The  other  thing  that  we  can  do, again  coming  back  into  our Fit  model  is  when  we  define  these functions, we  get  our  own  entry in  the  formula  editor, which  lets  us  just  click  on  one of  these  and  use  them just  like  we  would any  other  function  in  the  formula  editor. Again,  these  are  actually quite  easy  to  define. The  examples  in  the  scripting index  make  it  very  easy  to  do. Just  search  for  Add  custom  function and  you  can  just   use  the boilerplate  there  to  build  off  of  that. There's  also  a  great  blog post  on  how  to  do  that. That's   one  answer to  one  question  that  we  have. Let's  continue  on  and  let's  look at  a  different  question, maybe  a  slightly  more  complex  problem. What  happens  if  we  have  two  peaks? So  suddenly  Fit curve is  no  longer  on  the  table. We're  going  to  have  to  use   Fit nonlinear and  that  also  suggests  how we  might  work  with  this. We're  going  to  basically  have  to  break  out our  equation,  our  model that  we  had  before. I  break  it  out  column  by  column just  to  manage  all  of  those  bits and  pieces  that  we  saw  before. I  have  one  for  my  baseline, I  have  one  for  my  peaks, and  then  I  have  one  for  my  spectris. Let's  have  a  look  at  what  all those  look  like  really  quick. Let's  start  with  the  baseline because  it's  got  a  little  bit  of  a  gotcha that  we  have  to  worry  about. The  baseline  just  has  the  fit  parameter for  the  baseline, but  it  also has  this  x  term  times  zero. That's  because  nonlinear  expects  every equation  that  goes  into  a  formula to  tie  back  to  the  x  axis that  you're  providing. We  put  x  times  zero  in  there  just to  have  it  be  okay  with  plotting  that. That's  just  a  little  gotcha that  you  have  to  deal  with. That's  one  piece,  peak  1  looks  just like  we  would  expect  with  its  parameters. Peak  2  looks  just  the  same, except  it's  got  different  parameter names  so  we  don't  have  any  collisions. Peak  one  was  1, 2, 3, 4  and  peak  2 is  B  1,  2, 3 ,  4,  5, 6, 7, 8. That's  the  only  thing  we  have  to  do. Then  the  spectris  itself, the  thing  that  we're  going  to  fit, the  things  that  we're  going  to  put into  not  the   Fit nonlinear  platform, is  we're  just  going  to  say  my  baseline curve  plus  my  peak  1  plus  my  peak  2. Just  like  I  showed  you  before  doing that  in   Fit nonlinear, here's  my  spectris  that  goes into  the  prediction  equation. I'm  going  to  remember  to  put my  counts  in  and  not  my  x  curve. Just  like  I  said  before, I'm  going  to  expand  my  intermediate formulas  and  that's  going  to  tell  JMP to  dig  back  in  from  that  first  formula into  all  the  formulas  that  are in  the  individual  columns. We  click  Okay,  hey,  we  see what  we  expect  to  see. Now  we  can  click  go and  it  goes  through  and  fits everything  just  like  we  would  expect. We  get  a  nice  fit and  we  have  the  ability  to  get  confidence intervals  and  everything else  we'd  like  off  of  that. Two  peaks  is  reasonable  and  possible. But  the  problem  that  we  run  into  is what  happens  when  we  have  something that  looks   like  this. At  a  rough  count, there's  probably  a  dozen  peaks  there plus  a  complex  baseline  that's not  actually  a  straight  line  that's probably  got  some parabolic  behavior  to  it. We've  got  a  complex  baseline, we've  got  multiple  peaks. We're  going  to  have  to  make one  formula  for  each  of  those. There's  a  lot  of  legwork to  build  in  something  like  this. If  you  get  into  X- ray  diffraction, the  problem  gets  even  worse. There's  comfortably  30, 40  peaks  in  this  spectris  right  here that  we  would  have  to  work  with. The  first  question  that  we  need  to  ask  is, can  nonlinear  handle that   a  problem? Well,  it  turns  out  that  it  can if  we  just  use  nonlinear  and  I'm  going to  do  something   wild  and  crazy. I've  got  it  fitting  a  Lorentzian  peak and  I'm  going  to  come  back  and  I'm  going to  actually  have  it  fit  it  in  real  time. You  can  watch  that  as  it  goes  through. It  nails  each  peak  in  near  real  time as  I  move  through  this  quite  quickly. It's  hitting  the  big peak  in  each  group. That  says  that  the  Fit  engine can  probably  handle  the   processing that  we're  dealing  with. That's  fine. This  really  becomes  more  of a  problem  of  logistics  than  a  problem  of actual  functionality  within  JMP. It  really  is  a  real  problem. If  we  were  to  look  at,  let's  just  say we're  looking  at  fitting  Voigt  peak  shapes and  we  could  talk  about  Lorentzian and  we  could  talk  about  Gaussian, we  could  talk  about  the  Pearson  seven, all  those  different  types  of  peak  shapes. But  the  voigt  peak  shape has  five  parameters, the  x  axis  and  then the  forfeit  parameters. That  roughly  equates  to  about six  mouse  clicks  per  peak. Even  if  you're  doing  it  in  a  single formula,  it's  six  mouse  clicks  per  peak. That  says  that  for  a  ten  peak  formula, for  a  ten  peak  spectra, we're  going to  have  to  do  88  mouse  clicks. However  long  that  takes  you  per  mouse click  is  dependent  on  many,  many  factors. But  if  we  were  to  do  something  like  that X-ray  diffraction  pattern, we're  talking in  the  range  of  300  mouse  peaks. If  it's  actually  up  around  40, it's  actually  around  300  mouse  clicks. That's  a  lot  of  clicking  around that  we  don't  want  to  have  to  do. We  would  like our  interaction  with  the  spectra to  be  something  along the  lines  of  one  click  per  peak. That  suggests  that  we  need some  automation  built- in. Let's  have  a  look at  how  I've  done  that. I've  taken  a  tool and  built  a  tool  to  handle  this. I've  actually  taken  a  number of  different  solutions  here. First  off,  let's  look  at  the  library of  peaks  that  I've  generated. Spectriss. The  title  of  this  talk  includes  in  it a  number  of  different  peak  shapes. We  include  a  family  of  Gaussian  peaks that  have  a  split  Gaussian that  gives  you  a  different  standard deviation  for  the  x  and  y  parameter for  one  side of  the  peak  from  the  other. The  same  with  Lorentzians, the  Pearsons  and  then  the  PseudoV oigts. These  all  also  have  versions  that  are tuned  to  give  you  the  area  instead of  the  intensity  as  a  fit  parameter. That's  the  area  term  in  all  of  these. That's  one  piece. When  we  load  in  the  spectris, add  in,  we  get  that  for  free. That's  automatic. Now  let's  look  at  the  other  challenge. Let's  take  that  olive  oil  spectris. What  we  really  want  is  a  tool where  we  can  come  in  and  say, here's  my  X- axis,  here's  my  Y- axis. I  just  want  to  be  able to  do  some  peak  finding. Here's  my  four  main  peaks. It  found  them  automatically. Maybe  I  want  to  do  a  first  derivative or  maybe  I  want  to  do  a  quantile. I  can  also  remove  the  background here  so  I  can  click  finished. It's  found  those  first  three  peaks  for  me. I'm  going  to  go  ahead  and  change my  background  to  a  linear  one. Now  I  can  come  in  also  and  do some  manual  peak  selection. Behind  the  scenes, it's  taking  care  of  writing  all  of  those peak  parameters  for  you so  that  everything's  nice  and  tidy. There's  probably  one  right  there. Probably  one  right  there. There's  one  right  there. Every  time  you  add  a  peak, you  can  come  in  and  select  the  peak in  the  list  of  peaks,  and  it'll  give  you the  information  calculated  at  that  time. You  can  see  right  here, these  peaks  are  not  well  defined. They're  not  fitting  the  data  very  well. Really  we  want  to  go over  into  nonlinear. I've  taken  in   hacked  nonlinear so  that  it  will  run  this  in  real time  and  look  nice  and  pretty. You  can  watch the  peak  shapes  changing. Realistically,  I  might  have  chosen a  quadratic  instead  of  a  linear  for  this, but  just  for  the  sake  of  interest. Here,  I've  run  out  of  iteration. I'll  increase  the  iteration  and  I'll also  back  off  just  a  touch on  my  gradient  so  that  I  can  try  and  get this  thing  to  converge  a  little  quicker. Okay,  we'll  take  that  as good  enough  for  the  moment. We  can  say  that  we  want  to  accept  the  fit, and  there's  my  fit  parameters. Then  I  can  say  done. It  brings  it  back  over  into spectris  for  me  to  work  with. I  can  now  say,  refine  my  AUC  parameters and  I  can  come  in  and  get  my  new approximate  area  under  the  curve. That's  great  and  grand, but  what  I  really  want  is  an  output  table that  has  all  those  parameters  and  their information  attached  to  them. That's  spectris  in  a  nutshell. The  goal  with  this  project  was to  take,  like  I  said  before, we  want  to  have  the  ability  to  handle physical  peaks,  multiple  peaks, with  an  easy  to  use  interface that  handles  those  curves where  we  need  the  area  under  the  curve, the  physical parameters  attached  to  each  peak. But  we  really  don't  either  have   enough  data  to  use  in  fit  model or  in  a  functional  data  explorer,  or it's  just  not  the   problem  where  we want  to  work  with  that  particular  tool. The  tool  is  up  again. The  QR  code  here  will  take  you to  the  add  in  on  the  community where  you  can  work  with  it. Spectris  is  up  now  and  ready  to  go.
Have you gone back to your analysis, only to realize you've forgotten how you defined a metric or the general context of your data? Have you had stakeholders ask for this information or be confused on how to interpret a metric?   If so, why not consider digitalizing this information so that it's easy to convey among report periods or across common reports? This method helps provide report consistency, making it easier for stakeholders to gain the information they need when interpreting your analysis. It also saves you time, because this information is now in a centralized place, automating it as part of the analysis.     JMP data tables can be used to build data dictionaries and store other report facts that can be easily populated in JMP reports. This presentation shows how you can build this capability, whether you are a basic JMP user or a seasoned JSL scripter. Make metric definitions part of your JMP analytic workflow, rather than an afterthought!     Have  you  gone  back  to  your  analysis  and tried  to  figure  out  what  you  were  doing?   Maybe  the  analysis  was  done  six  months  ago and  you're  trying  to  go through  your  analysis and  trying  to  figure out  what  was  this  metric  again? How  did  I  define  it, what  were  the  assumptions  behind  it? Now  think  about  your  audience. If  they're  looking  at  something  that  you don't  necessarily  understand  or  they  may have  want  a  little  bit  more  nuance  about what  the  definition of  a  metric  is  in  the  analysis. Well,  the  great  thing  is  we  can  do  that. There  are  things  called  data dictionaries  that  we  can  be  utilizing. What  is  a  data  dictionary? A  data  dictionary  is  really  a  catalog that  communicates  the  structure and  the  content  of  the  data. More  importantly,  what  we're  going  to  be talking  about  today  is  really  meaningful descriptions  around   what the  metrics  are  so  that you  and  your  audience  can  understand  what  is  contained  in  that  analysis. The  purpose  of  this  is  really  to  be able  to  over  communicate  so  that everyone's  on  the  same  page  of  what the  variables  or  metrics  are and  how  they  were  calculated or  any  other  important  information  to  help either  your  audience  or  your  future  self understand  and  document what  those  metrics  are. We're  also  going  to  be  showcasing  how to  be  able  to  centralize  these  definitions in  one  place  and  be  able to  digitalize  this  content. What  I  mean  by  digitalized  is  just like  you  take  data  from  raw  form and  being  able  to  present  that into  a  graph  or  an  analysis for  people  to  consume, we're  going  to  do  the  same thing  with  a  data  dictionary. We're  going  to  be  able to  pull  the  data  dictionary to  be  able  to  be  tied  into  your analysis  so  that  your  audience  or  yourself do  not  have  to  go  find where  that  information  is. It's  all  tied  together for  easy  user  experience. Okay,  so  today  we're  going  to  talk  about I  have  three  options  here  for  folks. The  first  solution  is  just really  a  simple  no  code. If  you're  one  of  the  people  who  love the  GUI  around  JMP and  you  do  not  want  to  have  anything with  code,  this  is  your  solution. Let  me  walk  you  through  a  quick  way of  being  able  to  do  this so  that  you  can  use   the  data  dictionary  on  concept. The  second  part  is  if  you're  into  JSL and  you  have  a  very  simple  data  dictionary that  you  want  to  have and  be  able  to  tie  on, I'll  show  you  how  to  do it  with  a  little  dose  of  JSL. Now,  if  you're  a  supercharger   and  really  want  to  see  JSL in  all  its  glory  and  being  able to  essentially  be  able  to have  a  way  of  being  able to  categorize  your  work along  with  defining the  metrics  as  you  go  along and  then  be  able  to  centralize  that, really  be  able to  digitalize  the  data  dictionary and  then  be  able  to  pull  out exactly  what  you  need  for  a  given  report. We'll  do  this  with   the  supercharged  full  out  JSL. Then  we'll  talk  about a  call  to  action  at  the  end. Let's  talk  quickly. Let's  just  JMP  in  here  and   talk about  the  simple  no  code  option. First  we're  just  going  to  be  using a  JMP  data  table, and  then  we're  going  to  embed the  definitions  into  that  data  table. Let  me  open  up  the  data and  the  definition  table. Here's  what  a  data  dictionary, a  very  simple  data  dictionary  looks  like. You  have  the  variables and  a  description, which  each  of  these  variables  are columns  here  in  the  data  table. You  have  the  variable  or  the  column name  along  with  the  meaning. I  also  added  the  references  over  here. We  can  come  over  here  and   just plot  our  data  and  we  can  see   what the  various  different metrics  look  like  in  here. But  maybe  your  audience  comes  here and  goes,  what  does  abundance  mean? Or  how  are  we  defining  richness? Or  what's  the  Shannon  Metric? Well,  they're  going  to  have  to  go and  figure  out  what  that  actually  means. What  we  can  do  here  is  we  can  tie  this data  table  to  the  data  dictionary  table, to  the  data  table,   and  we  can  do  this  really  quickly. If  you  go  over  to  this  triangle  over  here,  click  on  the  red, go  down  to  copy  data  table. We  can  come  over  here, do  the  same  thing  over  in  the  data  table. But  this  time we're  going  to  hit  new  script and  we're  going to  type  in  definition. We're  going  to  then  paste. I  have  command  v  or  control  v  on  here. You'll  see  some  code. You  do  not  need  to  understand what  that  actually  means. You're  going  to  say  save   and  then  you're  going  to  say  okay. You're  going  to  see that  you  have  the  definition. If  your  audience  or  your  future  self wants  to  come  in  here  and  say,   hey,  here's  my  plot,  that's  great, I  see  my  data,  but  I  need to  know  what  the  definition? They  just  need  to  hit  the  plot  and  voila. You  have   well  formulated  your  definitions. Okay,  very  simple. Hopefully  if  somebody  has a  data  table,  come  up  with  some  metrics, use  another  JMP  data  table  for  it   and  just  attach  it to  your  data  table   and  you  have  a  great  reference. Okay,  so  that  is  the  simple  notebook. Now  let's  go  to  simple with  a  little  dose  of  JSL. All  right,  so  here   we're  going  to  publish  the  analysis along  with  the  definition  table   to  a  window, and  then  we're  going  to  export the  analysis  to  a  PowerPoint  deck. What  I'm  going  to  do   is  I'm  going  to  run  the  script and  then  I'll  walk  you  through how  I  did  the  script. The  first  thing  is  let's  get  this. This  is  what  the  outcome of  the  script  does. You  see  the  same  analysis  up  here, but  what  you  see  down  here  is we  have  a  definition  table,  and  you can  just  open  up  the  definition  table, and  then  you  can  see   what the  different  rows  are or  the  definitions of  the  different  meanings  of  the  data down  here  along  with  the  reference if  people  wanted  to  go and  know  where  to  go to  look  for  more  of  that  information, How  did  we  actually  do  this? Essentially  the  first  thing  we  did  was basically  opened  up  the  data  tables  here and  then  the  next  thing  we  did  was  we  created  a  window. This  is  basically  this  window  right  here and  then  an  outline  box  with  the  analysis. This  is  just  a  simple  graph  builder,  nothing  fancy. Then  down  here  we  did  another outline  box  here  where  I  had  it  closed. But  we're  going  to  need  to  keep that  open  for  right  now. Then  simply  just  did  a  data  table  box that  just  basically pasted  that  into  the  window. Okay,  now  let's  say  that  this  is  great, but  we  want  to  export this  to  a  PowerPoint. We  can  very  well  easily  do  that  by  making sure  you  have  the  definition  window  open. Don't  have  it  closed  or  it  won't  export to  the  PowerPoint. Make  sure  it's  open  and  then just  say  save  to  PowerPoint. Voila. There  you  go. It  exports  the  data  to  the  PowerPoint. Now  there's  a  little  bit  of  things that  you're  going  to  need  to  do. You  can  make  this  bigger, you  can  make  this  change  the  color  on  it. The  titles  up  here,  make  that  white  just so  it  makes  it  stand  out  or  change  it. Then  also  here's the  data  dictionary  as  well. All  you  need  to  do  is this  is  all  edible  so  that  you  can actually  make  this a  little  bit  more  user  friendly. I  find  that  12  font  makes  it  better, but  you  can  have  the  definitions  in  the  backup  of  your  slides so  that  people  can  come  back   and  be  able  to  reference  it. It  doesn't  have  to  be  part  of  the  meet  of  the  presentation. It  could  just  be  in  the  backup  slides. Makes  it  really  easy  to  be  able to  export  all  this  information to  PowerPoint  and  just  have it  tied  with  your  analysis. All  in  really  essentially a  few  lines  of  code. Basically  all  I  did  to  export  this  was I  created  the  button  save  the  PowerPoint. Just  basically  save  the  graph and  then  save on  one  page,  on  page  two and  then  on  page  three  I  had  the  data  dictionary  part. This  is  the  object  for  the  data  dictionary and  this  is  the  object  for  the  graph. Pretty  straightforward to  be  able  to  do  that. We're  going  to  close  all  these  slides files  and  we  can  close  this  as  well. Now  there  is  a  drawback with  the  data  table  box. I'll  get  into  a  little  bit  of  that in  my  next  example  on  that. In  the  supercharger, I'm going to talk about the drawbacks on the data table function itself, but  I'm  also  going  to  showcase how  to  capture  dynamic  metric  definitions as  you  do  a  report  and  then also  giving  you  a  little  bit  more  detail, how  to  get  more  details  around  that. Then  I'm  going  to  showcase  how to  do  a  dynamic  definition  box  around. This  is  going  to  take a  little  bit  more  coding than  just  specifying that  data  table  box. Let's  talk  about   what  the  drawback  is and  let's  take  this data  dictionary  from  the  World  Bank. You  can  see  that  they  have  some  really long  definitions  on  here and  we  have  some  more  information  around these  various  two  metrics. But  if  I  just  do  a  data  table  box, what  you'll  see  is it  doesn't  take  any  of  the  cell  height, width  or  set  wrap  on  here. This  is  something  actually  I  had  to  go  back  and  to  JMP  support and  they  confirmed  that using  the  data  table  box, it  doesn't  capture  these  aspects. This  isn't  ideal  on  here. Now,  I  did  put  a  request in  the  JMP  wish list that  they  would  take  this  so  that  we wouldn't  necessarily  have  to  do  the  coding that  I  will  be  showing  you in  a  few  more  minutes. This  isn't  necessarily  ideal, but  basically  this  data  table  box, there's  some  drawbacks  on  that. The  way  that  I  got  around  to  it, just  to   give  you  an  idea  of  just using  the  table  is  I  literally, let  me  just  show  you   this. This  was  my  attempt on  here,  recreating  it. This  is  the   the  desired  outcome and  it  took  a  lot  of  coding  on  here. Just  to   give  you  an  idea, I  mean,  you  could  try  to  do  this, but  this  isn't  really  dynamic. This  is  just  basically you  would  have  to  explicitly, you're  basically  coding  each  one  of  these  cells, the  titles  and  then each  one  of  the  text  boxes  down  here. For  example,  I  took  this  indicator  name from  the  first  row  and  put  it  here, long  definition  from  the  first  row  of the  data  table  to  here,  blah,  blah,  blah, and  just  gone  on  and  on  and  on  and  on. This  isn't  ideal  at  all. It  requires  a  lot  of  explicit  coding. Also  the  other  thing  I  noticed  is you  can't  do  for  loops  within the  display  boxes  at  all. I  went  back  to  JMP  support  and  said is  there  anything  that's  just  not documented  of  how  to  get  the  display  boxes or  the  cell  heights  from  the  data tables  translated  back  over? They  actually  gave  me  a  solution and  I'll  show  you   how  we  go  through  that in  a  few  minutes. But  because  basically  the  data  table  box does  not  allow  you  to  set  the  cell  height, the  cell  width  or  the  wrap. Let's  look  at   how if  we  did  want  to  do  something that  is  a  little  bit  more  dynamic and  building  out  a  data  dictionary from  your  analysis. Here,  what  I've  done  here is  really   as  I  was  going  through building  metrics, I  define   the  different  aspects that  I  would  want  in  my  data  dictionary. Here  I  have  initial  metric  for  the  name of  metric,  the  definition,  the  source, the  frequency  of  the  metric, and  the  specific. I  will  be  capturing  that  every  time I  go  through,  I  create  a  new  metric. Let's  say  I  work  at  the  World  Bank, I'm  creating  these  metrics and  this  data and so forth. Imagine  I  have  a  whole  bunch  of  code  that basically  develops  a  metric  and  so  forth, takes  that  data,  does  stuff, I  create  this  GDP  growth, annual  percentage  type  of  thing. This  is  essentially  how  I  basically would  have  captured  the  information. I  just  take  the  information for  the  definition  name. I  would  put  the  definition  up  here in  a  separate  line  and  then  carry  that  as an  object  in,  and  just  do  an  insert  into and  then   insert  this into  the  list  and  so  forth. That  essentially  if  I  run  the  script just  so  that and  make  sure  I  don't  have anything  highlighted, I'm  essentially  taking  each  one  of  these metrics,  and  I  have  four  metrics  down  here and  just  basically  defining each  one  of  these  metrics with  the  different  information as  we're  going  through  and  building  out. One  of  the  things  I  did  put in  here  is  this  report. Up  here and  why  I  have  this, and  this  is  important  is maybe  you  want  to  have  an  idea where  that  report  is  going  to. Basically  this  metric  is  going to  be  used  for  this  report. These  metrics  are  going to  be  for  that  report so  that  you  can  centralize all  your  definitions  in  one  place. Like  you  can  go and  put  this  into  a  database or  consolidate  it  all into   one  location so that  everybody  can  have  access to  that  information  and  understand, have  one  working  definition and  know  where  that  is. If  you're  doing  multiple  reports, to say  what  reports  those  are  in and  then  also  know if  they  need  to  make  changes, which  reports are  these  metrics  in. It's  a  really  great  way to  keep  track  of  things and  also  keep some  governance  around  those  names. What  I  did was,  essentially  did, as  you  can  see,  is  I  built  out, I  just  basically  came  up  with  a  list, and  then  essentially for  each  one  of  these  lists, I  just  set  those values  into  that  column. If  we  go  back  to  here, each  one  of  these  objects, which  is  a  list,  would  be  basically everything  for  the  indicator  name. There  should  be  four  things. If  we  go  over  here  and  look  at the  death  name, and  I  go  to  my... You  can  see  that  I  have a  list  here  of  names and  that  just  becomes all  the  content  within  the  indicator  name. That  was  just  basically  done  by creating  a  new  table and  then  just setting  that  list  for  that  column. I  did  that  for  each  one  of  those. Then  once  you  have  these  all consolidated  into  one  place, you  can  then  just  go  save  this to  a  database  or  some  central  location so that  not  just  you, but  anybody  else  can  get  it. Then  you  can  also pull  this  into  your  reports based  on   what you  want  to  have  done. That's  what  we're going  to  talk  about  next. Let  me  just  close  this  out. Go  back  to ... Close  this. Close  this. All  right,  so  we  just  built  a  definition and  now  let's  talk  about  kind of  building  a  dynamic  analysis. We  just  created that  table  of  indicators. What  we're  going  to  want  to  do  is   we  want  to  create  a  report, but  we  don't  want  all  maybe we  don't  want  necessarily... maybe  we  have  like  100  indicators and  now  we  just  want just  a  few  of  those  indicators. The  first  thing  we  can  do is  just  basically, we  can  open up  these  data  tables. Now  I'm  adding  in, I  have  another  data  dictionary, which  is  basically the  different  countries  here. I  have  the  data  dictionary of  the  indicator or  the  metrics  that  we  just  created. Then  we  actually  have  the  data  over  here, which  is  all  the  actual  data along  with  the  countries  on  there. But  we  want  to  create  a  report with  the  definitions. Once  we  do  that, we  can  then  go  and  figure  out we  can  get  the  values  of  the  variables here  from  the  data  dictionary. We  basically  are  going  to  pull all  these  metrics and  then  we're  going  to combine  that  with   the  other information  that  we  are  getting from  the  actual  metric  definitions. I  got  a  lot  of  things  going  on  here. Basically  I'm  going  to  be  pulling on  here  these  four  data  points. But  I  also  need  to  be  from  this I  want  to  also  be  pulling this  information,  the  country  name, the  country  code,  and  the  year. I  basically  added  those  variables in  there  and  I  did  another  insert  here. That  essentially  I'm  just  pulling the  columns  from  our  metrics. I  won't  be  getting  all the  metrics  from  this  data  table. Let's  run  this  and  then  I'm  going to  just  subset  the  data based  on   what  we  need. Okay,  so  this  is  our  subset  of  data. W e  have  our  metadata  over  here along  with  those  four  data  metric   variables  that  we  wanted. The  next  part  is we're  going  to  just  do  a... We  want  to  come  up  with  our  analysis. I'm  just  doing  something  really  simple. Just  run  that. Then  here  comes  where... So  if  I  go  back  to  this  data  dictionary, we  saw  that  there's  already  some cell  height  width  that  are  really  specific for  each  one  of  these  cells that  make  it  easy  reading  for  your  audience. We  want  to  be  able  to  capture  that and  be  able  to  translate  that  back  into the  presentation  that  we  have when  we  take  these  tables  and  put  this  on. We  can't  do  that  with  the  data  table because  basically  the  data  table doesn't  allow  that  for  that  formatting. Thank  you  to  Jay  Sun  from  Tech  Support, who  helped   figure  out  how  to  do this  because  there  was  a  couple  of  things. One,  as  I  said  before, you  can't  do  for  loops  within  table  boxes or  other  displays, you  need  to  be able  to  do  this  by  a  lot  of  object. I'm  struggling  with  the  terminology  here, but  essentially  you  really  have  to  be taking  this  and  building  the  table  box. Then  also,  depending  on  how  many columns  and  also   how  many  indicators  you  have, you  want  to  have  this  dynamic. You  don't  want  to  be coding  this  specifically. Essentially  we're  capturing  both the  cell  heights, we're  getting  the  names, I  have  coded  the  wisp  explicitly. The  code  that  we  had  was  working and  then  for  some  reason  yesterday it  wasn't  working. The  only  way  I  could  get  around with  this   was  to  explicitly map  that  out  to  make  this  work. Then  basically  building  out  the  data table  with  some  columns and  then  indicators  and  basically   just  basically  having  some  for  loops, basically  taking  the  formatting and  the  content   and  building  out  the  table. Let's  run  this  and   see what  this  actually  looks  like. This  is  what  this  actually  looks  like. Again,  similarly,  we  had  the  analysis, I  keep  the  boxes  closed and  basically  all  you  need  to  do  is  open  up and  then  here's   all  your  definitions  on  here along  with  the  indicator  name, which  is  the  y  axis  over  here along  with  the  long  definition. Folks  can  really   geek  out  on  that, where  the  source  is, where  the  annual, how  frequently  the  metric is  actually  taken, and  the  other   metadata  around  it, like  how  these  metrics  were  actually  calculated. You  can  do  this  for... Here,  I  have  it  for the  metric  definitions, but  I  also  have  the  definitions for  the  different  countries  too. For  example,  I  was  looking  at  this going  Eurozone,  what's  Euro zone. I  know  what  Europe  is and  I  know  there's  some  countries that  take  euros,  but  they  may not  necessarily  be  in  the  euro. For  example,  I  think  Montenegro  is  one of  the  countries  that  uses  the  Euro, but  they're  not  considered  the  Eurozone. Same  thing  with  Europe  and  Central  Asia. What  does  that  mean? Is  Mexico  in  North  America  or  not? Depends  on  different  definitions. This  really  helps  the  audience  hopefully get  the  information  that  they  want so  they  can  interpret this  data  fairly  well. As  you  can  see,  this  took a  bit  more  coding  on  that. I  won't  necessarily  go through  all  the  detail  of  that  coding, but  if  you  want  to  get that  coding,  feel  free  to... I  would  be  more  than  willing  to  be able  to  share  this  coding  with  you. Just  email  me at  sarahcallison@achievemorewithdata.com and  I'll  be  more  than  glad  to  share the  data  that  I  have  with  you along  with  the  code. Okay,  so  call  to  action. Well,  my  call  to  action  to  everyone  is try  to  build  in  data dictionaries  with  your  analysis. It  really  helps  your  audience really  understand  the  data. It  really  helps  them  be  able  to  go, oh  yeah,  that's  what  it  means, and  be  able  to  focus  versus   questioning  what  they're  actually  seeing. Also,  number  two,  it  helps  you  too, as  the  analyst,  remember,  oh  yeah, this  metric  was  percentages  or dollars  versus  euros  or  whatever  else  that you  may  need  to  know  about  that  analysis. The  other  aspect  too  is  vote for  on  the  JMP  wish list. Enable  cell  formatting  for  Data  table  box. I  totally  believe  that  this  could  help simplify  the  code  that  we  use versus  having  to  have  for  loop   and  having  to  understand  a  lot  of  JSL. I  think  this  would  really  open  up  things for  getting  more  people to  do  data  dictionaries if  we  can  enable  simpler coding  just  with  the  data  table  box. So,  I  really  appreciate  everyone's  time in  listening  to  this and   seeing an  alternative  way  of  being  able to  present  your  data  and  add  more contextual  information  to  your  data. Please  let  me  know   how  you're using  the  data  dictionary and  if  you  have  other unique  ways  of  being  able to  present  that  information. So  thank  you  very  much and  hopefully  I'll  see  you  at  the Junk  Discovery  Summit  in  October. Thank  you.
At HP Hood, the use of design of experiments (DOE) has helped to successfully identify formulations in new product development, delighting consumers of our food products.   Throughout our R&D department's wide implementation of DOE, there have been a number of successful DOE models, as well as some unsuccessful ones. One major stumbling block to model building with DOE has been dealing with subjective binary responses such as acceptable/unacceptable. These binary responses provide less information than continuous responses, thus inhibiting the ability to extract meaningful results from designed experiments.    This presentation shares simple and practical strategies for using the JMP DOE platform to convert these binary responses to continuous ones, resulting in improved models and powerful insights. Real-world examples from consumer food products are given to demonstrate how DOE can be used for more than just building models. It can be used to overcome the problem of responses that are difficult to measure.     This  talk  is  titled  Expanded  Uses   of  Converting  Binary  Responses   to  Continuous  Responses in  Consumer  Product  Development. It's  a  bit  of  a  mouthful, but  I  promise  it  won't  be  that  complicated. My  name  is  Curtis  Park. I'm  a  principal  scientist  at  HP  Hood. HP  Hood  is  a  company, a  food  and  beverage  company. We  make  a  lot  of  different  milks,   nondairy  milks. We  also  make  yogurt, cottage  cheese,  ice  cream. So  a  lot  of  a  lot  of  fun things  to  taste  at  work. I'm  a  food  scientist  by  education. A few  years  ago I  was  asked  to  take  a  look  at  a  problem that  we  had  for  one  of  the  beverages that  we  were  producing. I'm  going  to  show  you a  video  just  so  you  can  see. But  we  were  getting   a  lot  of  consumer  complaints and  these  complaints  were  happening when  the  product  was close  to  the  end  of  shelf  life. A s  you  see  in  this  video, it's  pretty  obvious   why  people  were  complaining. I  think  I  would  complain  if  I  saw  something  like  that  too. It's  supposed  to  be a  nice  portable  beverage. It's  thick  and  chunky   when  it's  being  poured  out. Not  what  I  would  expect. Believe  it  or  not, this  product  was  not  spoiled. I  promise  you, it  was  not  spoiled. So  I  was  asked  to  take  a  look  at  this and  figure  out  how  can  we  fix  it? What's  the  problem? How  do  we  fix  it? HP  Hood  at  the  time, this  was  a  few  years  ago. We  were  early  on  in  our  journey with  using  JMP, and  so  I  was  really  excited  to  have an  application  to  use  in  real  life rather  than  just  reading about  it  or  learning  about  it. Naturally  I  felt  like  this,  like  Yahoo! Let's  run  a DoE,  let's  do  it. I  was  really  excited and  for  those  of  you  who  might  not  have as  much  experience  doing DoE, the  first  step  is  usually  taking  a  look at  what  factors  should  I  be  looking  at. So  we  did  a  few  experiments. If  you  can  forgive  me,  they  were  probably one  factor  at  a  time  experiments. But  we  narrowed  in  on  what  we  believed   were  the  key  ingredients that  could  have  been  causing  the  problem. We  ended  up  making  a  design. This  is  probably  the  fourth  or  fifth  iteration  of  the  design that  we  came  up  with, and  this  was  in  custom  design. So  if  you  go  to  custom  design, that's  that's  the  platform   that  we  use  to  generate  this DoE . A s  you  can  see, this  is  this  is  what  we  had. So  we  had  ingredients  A,  B  and  C, and  it  was  actually  a  response  surface. So  we  had  all  of  the  two  way  interactions and  the  quadratic  terms built  into  the  model. It  ended  up  being  17  runs, as  you  can  see  here. It's  17  different treatment  combinations. This  much A ,  this  much B ,   this  much  C  for  each  run. Once  we've  settled  on  this  design, we  were  really  excited   so  let's  go  solve  this  problem. Piece  of  cake,  right? You  go  into  the  lab,  into  our  pilot  plant, you  throw  some  things  together, the  beverage  comes  out. I'm  making  it  a  lot  more simple  than  it  actually  is. We  made  17  different  beverages and  then  we  put  them   on  the  shelf  for  a  little  while because  as  I  mentioned  earlier, it  takes  a  little  bit of  time  for  this  problem  to  appear. Put  them  on  the  shelf  for  a  while, sat  until  they  were  ready  to  be  analyzed. This  is  just  a  screenshot of  a  data  table. This  has  our  actual or  our  design  that  we  used. A s  you  can  see,  there's  a  column here  to  the  right  that  I  highlighted. It's  our  our  friend,  the  Y, our  response  column. So  once  we  got  to  the  point  where  we  were ready  to  ready  to  measure  that  chunky  pour now  we  started  thinking,  Oh,  how  are  we  going  to  measure  that? Because  a t  the  time, we  did  not  have  a  chunky  parameter. I've  never  heard  of  one. I've  never  found  one. If  anyone  has  ever  found  one, we'd  love  to  to  see  it  and  maybe  buy  one. But  it's  our  knowledge. It  doesn't  exist. So  what  options   did  we  have  to  measure  this? Because  if  you  can't  measure  it  with  DoE it's  really  not  that  useful. So  we  have  a  fe w  options. First  thing  is  we  can  measure everything  as  a  binary  response. So  it's  either  a  pass  fail, it's  good  or  bad etc. There's  some  pros  with  this  and  some  cons. The  pros  would  be  it's  pretty  simple  to  do,  right? Anybody  can  say  pass  or  fail  and  it takes  you  like  no  time  to  to  measure  it. However,  it  has  some  serious  cons  to  it. Such  as, it's  really  subjective  to  the  observer. What  I  think  is  good, a  colleague  of  mine  might  think  is  bad. Or  even  worse,  what  I  think  is  good, my  boss  might  think  is  bad. So  it's  really  subjective. While  it  can  give  you  some  information, they  don't  give  us as  much  information  as  we  want. Because  when  you  do  logistic  regression, what  you  get  out  of  it  really  are  just probabilities  of  something  passing or  probabilities  of  failing. In  my  experience, that's  been  difficult  to  communicate and  to  really  understand  what  to  do with  that  data, especially  when  we're  trying to  communicate  with  non-technical  people. So  continuous  if  there's  any  way  to  get a  continuous  response, that's  what  we  strive  for  because they  give  us  a  lot  more  information. We  can  know  how  good  is  it   or  how  bad  is  it, because  not  all good  are  created  equal. There's  another  option we  could  have  done and  I  would  say  this  is  probably  the  best  option if  you  can  do  it,  is  we  could  run consumer  testing  and  get  consumer  input. What  this  would  look  like  is  I  have  all our  beverages,  17  beverages, and  we  recruit  maybe  100, 120  consumers  of  our  product and  we  have  them  sit  down   and  rate  every  single  one for  different  attributes, one  of  them  probably  being  how  well  do  you  like  how  this  pours? The  reason  why  this  is  a  gold  standard is  because  those  are  the  people's opinions  who  matter  to  us. What  we  would  do  is  after  we  get 100  or  120  responses, you  take  a  look  at  the  data  you  get, we  can  take  averages  and  put those  averages  into  our  model. However, it  can  cost  a  lot  of  money and  it  can  take  a  lot  of  time. So  if  your  budget  doesn't  allow  it or  your  timeline  for  whatever  reason   doesn't  allow  it, you  can't  do  this  for  everything. Sometimes  the  thing  you're  trying  to  measure isn't  such  a  huge  problem   that  you're  trying  to  solve that  it's  worth  spending  all  that  money. But  it  would  still  be  important to  be  able  to  measure  it. Do  you  have  any  other  options? I  mentioned  this  earlier. You  can  find  an  instrument  that   can  measure  what  you're  looking  for. Sometimes  they  exist. Like  I  said,  I  don't  know of  a  chunky  parameter. I  looked  in  our  warehouse  in  our R&D  center,  couldn't  find  one. Even  if  you  can  find  one, if  this  is  something   that's  really  specialized, you're  not  going  to  use  it  very  often. It  doesn't  make  sense  to  buy  the  piece of  equipment  or  it  could  be  something that  would  be  really  great, but  it  requires  a  lot  of  expertise that  maybe  your  R&D, your  technical  department  doesn't  have  or just  doesn't  have  the  time   or  resources  to  to  deal  with. I'm  going  to  show  you  the  last option  we  have  here. What  I'm  going  to  say  is  training  a  group of  people  how  to  rate  that  attribute of  interest  and  then  let  them give  you  all  the  ratings. This  is  quite  as  good  as having  actual  consumers. But  here  we're  trying   to  take  subjectivity  out  of  it and  make  it  objective. When  well  trained  humans  can be  great  measuring  instruments. I'm  going  to  walk  you  through   what we've  done  at  Hood when  we  have  some   hard  to  measure  attribute. We're  going  to  use  the  case study  of  this  chunky  pour. This  is  our  roadmap. I'll  walk  you  through  this   and  then  we'll  actually  do  it  live. The  first  thing  I  wanted  to  get  across is  that  the  samples  that  you  produce from  DoE  can  be  used  for  many  purposes. I  like  to  tell  people   that  your  samples  are  like  gold and  you  should  treat  them  like  gold. They're  very  valuable. You  may  do  a DoE  thinking  that   you're  trying  to  answer  one  question, but  something  else  might  pop  up  later that  you  would  be  able  to  use  those samples  to  answer  that  question  as  well. I've  had  that  happen  to  me  many  times, so  sometimes  it's  good  to  think  about just  ask  yourself  the  question. I've  done  all  this  work  to  make 17  different  beverages. What  else  can  I  do  with  them? What  else  can  I  learn? In  our  case,  we  use  these  samples  as  a  "calibration  set" so  that  we  can  teach  our  humans,   my  colleagues, how  to  measure  this  chunky  pour. So  here's  our  method. The  first  thing  we  do  is  we  review all  the  samples  with  a  small  group, some  maybe  1  or  2  or  3  people  that  are really  knowledgeable  on  the  subject or  are  responsible  for  the  project. What  you  do  is  you  look  at  all  the  samples and  decide  which  samples should  be  used  to  train  the  Raiders. We're  trying  to  build  a  scale  essentially, and  then  we'll  take  that  scale and  we'll  get  our  friends,  let's  say  10, 15,  20  friends  to  actually  rate  these, these  samples  for  us after  we've  trained  them. Training  step  two, have  them  read  each  video,  step  three . If  it's  a  video, it  could  be  something  else,  a  picture, or  it  could  be  actually  them pouring  out  the  product if  you  have  enough,  etc. You  can  get  the  idea. Next,  we'll  take  the  average of  all  those  ratings. We'll  look  at  the  data,  make  sure there's  nothing  funky  in  there and  then  we  will  use  those average  values  to  build  a  model. Let's  start  with,  oops. Let's  start  with  steps  one  and  two. So  we're  going  t o  assume  that   we've  looked  at  all  the  all  the  videos and  the  way  we  typically  do  it because  it's  a  little  easier is  you  start  off  answering  the  question, which  one  is  the  lowest  in  Chunky  pour? That  would  be  this  one  right  here. Number  one,  I'm  going to  play  each  one  of  these. This  just to  make  it  clear, this  is  our  scale. It's  a  continuous  scale  from  1  to  10 and  the  1  to  10  is  kind  of  arbitrary. If  if  you  have  something  that  works better  for  you  then  great. The  video  right  above it  corresponds  to  that. So  this  first  video  corresponds  to  a  one. So  as  you  can  see, while  we're  watching  this  video pours  nicely, no  rippling  and  no  chunkiness. Pours  as  expected. Beautiful. That's  that's  the  easy sample  to  identify and  then  in  the  in  the  sample  set,  we  ask ourselves,  okay,  which  one  is  the  worst? In  this  case,  it  was  pretty  obvious. I  will  tell  you  again, this  product  is  not  spoiled. So  just  with  changing  a  few  ingredients. You  can  see  it's  so  thick, we  can't  even  get  it  out  of  the  bottom. So  that's  obviously  a 10. Then  we  did  a  little  bit  of  work to  try  to  figure  out,  okay,  which  one should  we  consider  to  be  a  five? So  halfway  in  between. This  one,  you  can  see  it  still  flows, but  there  is  chunkiness  to  it. Then  maybe a  two  and  a  half  would  be  this  one. See  it  has  a  little  less  chunkiness  to  it. Flows  well,  probably  with  normal  shaking. It'd  probably  be  fine. So  there's  a  little  bit  of  subjectivity, but  you  add  more  people to  make  it  more  objective. Then  the  last  one. This  is  seven  and  a  half. So  you  can  see  it's  very,  very  chunky. The  only  thing  that  really  is differentiating  it  from  number  ten  is that  we  can  get  it  out of  the  bottle  still  flows. But  as  you  can  see,  it's  pretty  thick. What  I  would  do   and  basically  in  this  amount  of  time, I  could  train the  people  that  are  going  to  help  us to  analyze  this, to  measure  this  chunky  pour. Then  we'll  have  them  rate once  we've  trained  them. I'll  basically  do  what  I  just  did. Maybe  we'd  take  a  little  bit  more  time to  be  more  specific  with  certain things  we  want  them  to  be  looking  for. If  what  you're  having  someone  rate   is  a  lot  more  complicated, then  you'll  probably  have  to  need to  take  more  time  training  people. This  one  wasn't  pretty  complicated and  we're  really  just  looking for  people's  first  impression. A fter  that  you  have  them   rate  all  the  videos i  like  to  use  Microsoft  forms  just  because it's  easy  and  I  can  get  the  the  data really  quickly  and  easily, but  you  can  use  whatever  you  want, including  paper,  although  that  takes more  time  and  I  try  to  avoid  that. Just  to  show  you  what  Microsoft,  what  our  forms  look  like. Here's  a  preview  of  it. This  is  as  if  you're doing  it  on  your  phone. I  like  to  make  everything   as  simple  as  possible, and  everybody  always  has  their  phone,  so  I  can  do  it  on  a  phone. That's  my  goal. I'm  just  saying  chunky  pour  doughy, and  then  they  just  go through  and  rate  each  one. So  chunky  poor  for  treatment. One  I'll  say,  don't  know that  that  one  was  a  six and  we're  just  asking  people for  the  first  impression. There's  no  right  or  wrong  answers. Usually  people's  first impression  is  right. So  that's  why  I'm  asking  people not  to  think  too  hard  on  it. Maybe  number  two  is  a  ten, and  number  three  was  a  three. I  don't  know. They  would  go  through  all  of  those. Then  we  would  get  our  data   and  then  using  JMP we  would  average  all  those  ratings and  then  we  put  the  data  into   the  data  table  to  build  the  model. So  we're  going  to  get  out  of PowerPoint  for  a  second  and  we'll  go to  excel  for  a  second. This  is  what  I  get  when  I  want  to  export the  data  from  Microsoft  forms. Like  I  said,  you  don't  have  to  use  this, use  whatever  works  for  you. A s  you  can  see, ID  is  the  the  rater  number just  a  random  number, not  random,  but  just an  identifier  for  each  person. I  left  it  anonymous  so  we  don't. We  don't  criticize  people  who  maybe didn't  do  as  well  as  everybody  else. And  in  this  case,  this  actually this  data  is  real  from. I  took  this  to  a  college  class  food science  class  and  had  them  do  this. And  so  this  is  actual  real college  students  rating. The  rating  the  the  videos. And  as  you  can  see,  we  have  all  these columns,  a  column  for  each  one. So  person  one  rated, rated  treatment  one  and  eight, they  rated  treatment, two  of  four  treatment  three  and  nine, etcetera,  etcetera,  etcetera. So  we  want  to  get  to  put  this  into  jump. So  we  have  I  like  to  use  the  jump  add  in. So  in  Excel  right  here. And  then  just  as  long  as  you're only  highlighting  one  cell. And  you  click  data  table, it'll  import  everything. I've  noticed  that  sometimes I'll  accidentally  have  like  just  a  portion  of  the  data highlighted   and  if  you  could  data  table  now it's  only  going  to  import what  you  highlight. So  either  highlight  everything or  only  highlight  one. Once  you  hit  that  data  table  button, you  will  get  something  like  this. So  this  is   our  data. We  need  to  in  the  end, just  to  show  you  where  we're  trying  to  get to  with  this  data  table  because   we  have  to  manipulate  it  a  little  bit. This  is  our  data  table  for  the DoE. We  run  it  was how  much  of  ingredient  A, B  and  C  were  in  there. I  put,  we'll  talk  about  this  in  a  minute but  I  put  my  scale whether  or  not  I  thought  something  passed or  whether  or  not   I  thought  something  failed. In  the  end,  we  need  one  more column  that  says  Chunky  pour. We'll  call  it  continuous. And  we'll  have  an  average rating  for  for  run  one. Average  rating  for  one,  two, three,  four,  five,  etc. If   we  look  at  this data  table  as  it  is  today is  not  in  that  format   because  we  need  all  these columns  to  be  rows  and  we  need the  the  rows  to  be  in  one  column. There's  probably  a  thousand  different ways  we  could  do  this  in  JMP and  they're  all  good and  they're  all  correct. I'm  going  to  show  you  one  way  to  do  it. It's  just  the  one  that  works  for  me. First,  what  we're  going  to  do  is  we're  going  to  stack all  of  the  columns on  top  of  each  other. Then  we're  going  to  do  a  summary  table that  has  the  average  and  maybe  we'll  also add  in  the  standard  deviation  for  fun. But  the  very  first  thing  that  I've  always been  taught  to  do  is  when  you  get  data, you  want  to  look  at  the  graph, the  data  and  look  at  the  plot. So  we're  going  to  actually  look at  the  distribution  really  quickly. So  if  we  go  to  analyze. There  we  go. Analyze  distribution. We  want  to  look  at  the  distribution for  all  of  the  treatments. I'm  just  going  to  highlight  them. Go  to  the  columns  and  say,  okay. I'm  just  looking  to  see  is  there  anything weird  about  this  data  that  we should  be  concerned  about? When  I  look  at  so  we  can  see  for   1, 2, 3 ,  etcetera, I'm  looking  for  outliers, like  for  example,  three,  everybody rated  this  sample  between  1  and  6. There  was  someone  up  here   who  rated  it  really  high, and  there's  also  someone up  here  that  rated  this  one  high. So  what  I  like  to  do is  if  you  click  on  this, it'll  highlight  where... So  this  this  row  represents one  raider,  one  person. So  I'm  going  to  see  how they  rated  everything and  you  can  see  they tend  to  be  an  outlier. The  nice  thing  is  in  JMP  is  that  once  you  highlight  one  row, all  it  will  highlight  for  all  the  other  responses. So  I  can  see  that,  yeah,  they  rated 3 being  higher  4  being  higher. We  go  down,  look. Terminate. They're  opposite  of  everybody. It  seems  like  for  some  reason the  the  training, they  got  a  little  confused and  they  thought  higher  number  meant lower  chunkiness  and  vice  versa. So   what  I'm  going  to  do  is since  I  have  this  row  highlighted, I'm  going  to  close  this, it'll  stay  highlighted. So  this  is  row  one. I'm  just  going  to  delete  this  data and  then  we'll  move  on. Now  we  feel  pretty  comfortable with  the  data  is  pretty  much  solid. Like  I  said,  we're  going to  stack  the  columns. If  we  go  to  tables  stack. It's  going  to  pop  up and  we  just  want  to  stack all  17  of  the  treatments. The  nice  thing  is  in  JMP  17, now  you  get  this  preview. I  love  the  preview so  then  I  know  if  I'm  doing  things  right. What   we  see  here  is, as  I  can  see, it'll  have  the  ID  so  the  rater  and  then rate  the  chunky  pour  for  treatment  1. They  gave  it  a  five   and  they  did  number  two,  a  seven. This  is  how  we  want  the  data  structured and  we  can  change  the  column  names. So  instead  of  data, we're  just  going  to  say  chunky  pour, continuous. Then  for  label,  I'm  just  going  to   call  it  run  because  that's  really what  we're  going  to  use this  for  in  a  minute. I  just  stack  it. So  I  say,  okay,  that's  how  I  want  it. Now  we  have  the  data  table  in  this  way so  now  it  lets  us  use  a  summary  table. S ummary  tables  are nice  ways  to  be  able  to make  a  table of  the  of  different  statistics. So  what  we're  going  to  do  is  we're  going  to  highlight the  chunky  pour continuous  column  and  say  statistics. Do  mean. For  fun  in  case  we  want  to  use  it, we'll  also  say  standard  deviation. This  just  gives  us  the  overall mean  and  standard  deviation. But  if  we  want  to  do  it  per  run, I'll  highlight,  run  and  put  it  here  in  group. Now  when  we  look  at  this  preview, we  have  one  through  17 and  conveniently,  they're  in  order. One,  two,  three,  four, five,  six,  seven,  eight. All  the  way  to  17. We  have  the  mean and  the  standard  deviation. So  we're  going  to  say,  okay. Okay,  so  we  have  one  more  table. Now  we're  to  the  point  where we're  where  we  need  to  be because  I  have  each  run  as  a  row and  have  a  column  for  the  average  column  for  the  standard  deviation. So  what  I'm  going  to  do  is  I  will  highlight  this  column. If  you  go  to  edit  copy  with  column  names and  then  I'm  going  to  go to  our  original  data  table. We're  gonna  make  a  new  column  here and  say  edit  paste  with  column  names. There  it  is. I  should  have  done  both  of  those at  the  same  time,  but  I  didn't. So  we're  going  to  do. Do  this  one  as  well. Okay,  so  now  we  are  ready to  do  our  modeling. So  first,  first  thing  I  want  to  show  you is  what  we  would  get  if  we  just did  pass  fail  our  binary  response. What  we'll  do  is  if  we go  to  analyze  fit  model. Because  I  made  this  this  design  in  JMP  in  the  custom  design  platform, it  automatically  knows what  kind  of  design  this  is so  that's  why  my  model  is  already  built. If  there  is  a  really  convenient  way, if  you  knew  this  was  a  response  surface  design, let's  say, let's  say  this  wasn't  here. The  macros  are  convenient. If  I  highlighted  ingredient  A,  B  and  C. Said  Macros  Response  Service. It  pulls  it  all  up. It  already  knows  what  I'm  looking  for. So  that's  helpful. I  put  it  in  the  y  axis,  the  variable, the  response  y  chunky  pour  pass  fail. What  it  gives  us  is  nominal  logistic. I'm  not  statistician, so  I'm  not  going  to  go  into  any of  the  statistics  behind  what  it's  doing. I'm  just  going  to  show  you  how  what  you  get  out  of  it and  what a  scientist  might  be  looking  at. So  if  I  say  run, our  target  level  is  passed. So  when  it's  going  to  do  probabilities and  probability  of  passing. So  we  say  run. This  is  what  we  get. So,  I  mean, the  first  thing  that  a  scientists  like myself  would  probably  look at  is  this  effects  summary. I'm  looking  at  probably  looking  at  P  values  and  I  say, well,  nothing  significant  except  ingredient  A. There  are  other  things  that  we would  look  at,  but  I'm  going  to... I'm  going  to  go  over  that. We're  not  going  to  cover  that  today. Instead,   I  want  to  just  look  at  the  profiler, because  that's  what  we  find,  at  least  in  our  in  our  experience, the  profiler  being  the  most  useful and  easiest  to  interpret for  the  scientists  and  when they're  communicating  with  others. So  what  this  is,  is I'm  going  to  make  it  a  little  bigger. Is  on  the  left  here. We're  going  to  get  a  probability  of failing  and  a  probability  of  passing. So  if  we  have  0.13  of  ingredient, a  0.12  of  ingridient  B, 0.45  and  of  ingredient  C, and  it's  actually  0.13%,   0.12%,   0.45%. I  just  didn't  change  it. It's  a  very,  very  small  proportion of  the  formula  that  we're  changing anyways  at  those  levels, this  says  100%  of  the  time we're  going  to  pass. If  I  move  it  up,  let's  say  to.. Have  like, say  point  two  of  this  ingredient  now. Now,  looks  like  we're  going to  pass  only  64%  of  the  time. You  can  see  these  curves, how  I  changed  ingredient  B  a  little  bit and  ingredient  C, maybe  we  can  get  back  up to  a  point  where  we  pass  98%  of  the  time. You  can  play  around  with  this. But  the  problem  with  this  is, is  like  I  said  earlier, passing. Maybe  this  pass  right  here  is  not the  same  as  passing  over  here. However,  we  don't  really  know  that  with  this  information, and  it's  kind  of  hard  thing  for  some people  to  wrap  their  head  around, like  it  was  just  probability  of  passing. What  do  I  do  if  if  all   I  can  get  is  an  85%  pass  rate? Like,  let's  say  hypothetically, this  was  the  best  we  could  do. What  do  I  do  with  that? So  that's  why  we're  looking at  continuous  responses. I'm  just  going  to  close  this   and  we're  going  to  do  that, build  that  model  again, except  let's  do  it  for  the  mean of  our  continuous  scale. So  we're  going  to  have  to  remove  chunky  pour and  we're  going to  add  the  average  here. We're  just  going  to  say  run. Keep  it  simple. Do  the  effects  screening  report. Now  you  can  see   there's  a  lot  more  information  going  on that  we didn't  get  before. So  where  before,  if  you  remember, all  we  saw  was  that  ingredient  A  had  a  really  low  P  value. Everything  else  was  like  0.99. The  conclusion  was ingredient A  does  everything. Well,  it's  not  actually  the  whole  truth, as  we  can  see  here. Yes, ingredient  is  the  most  and  most  important. The  main  effect  of  ingredient  right  here. But  B  and  C  also  have  a  role  to  play. While  not  as  big, they're  still  an  important  role. So  we  look  at  our  actual  predicted  plot. It  looks  pretty  healthy. Our  lack  of  fit. Look s  good. I'm  not  going  to  go  into  all  the  details of  everything  that  we  look  at, mainly  because  I'm  not  statistician. That's  just  what  I  look  at. I'll  look  at  the  lack  of  fit. I'll  look  at  the  residuals  to  see  if there's  anything  weird, the  studentized  residuals. Then  really,  I  come  to  the  profiler and  now  you  can  see   this  gives  us  a  much  different  picture, much  more  complete  picture, where  as  I  increase  ingredient  A, the  chunky  pore  increases, but  increasing  these  these  ones  does  too. So  they  they  also  have  a  role  to  play. If  we  were  to say  that  we  want  to  minimize  it, I  think  it's  pretty  obvious  what  the... Desirability  is  going to  come  out  to  being. But  just  to  show  you, we're  going  to  you  go  to  the  red Triangle  by  the  prediction  profiler. Optimization  desirability  and  we're going  to  do  the  desirability  function. Then  here,  this  is  the  desirability. I  find  it  useful. You  can  change  it  in  the  red  triangle, but  I  find  it  easier  if  you  just hit  control  and  then  click  on  it. Now  we  can  change  what  our  goal  is. So  in  this  case,  we  want  to  minimize  this because  we  don't  want  it  right? We  don't  like  chunky  pour . Consumers  don't  like  it  either. So  we're  just  going  to  say  minimize  and  okay. Now  we  can  go  back to  that  optimization  and  desirability and  say  maximize  it. What I  thought  I  was  going  to  do. Say,  take  these  two  ingredients  out. Put  this  one  as  low  as  you  can. You'll  get  the  the  lowest chunky  pour  that  you  can. In  reality,  we  had  some  other  constraints, so  we  couldn't  do  that. There  were  other  factors  at  play, but  this  definitely  gave  us a  really  good  idea   of  where  we  needed  to  go, what  was  important  and  how  do  we control  this  chunky  pour to  the  point  where  when  we  implemented the  changes,  the  complaints  went  away. It's  been  good  ever  since. That  is  the  the  nutshell of  how  you  could  take  something that  is  hard  to  measure. It's  really  subjective. It's  binary so  you  pass  fail   or  good  or  bad, and  you  can  convert  it into  something  that's  continuous. It's  a  relatively  simple  method. You  can  use  it  for  a  number  of  things. As  long  as  you  have  people  available  to  help  you  out, you  can  you  can  measure  a  lot  of  things   that  could  be  considered  hard  to  measure. Where   do  we  go  from  here? At  Hood. Just  to  give  you  an  example  of  some other  things  that  we  encountered. This  one,  the  Chunky  Pour,  is  actually one  that's  a  little  easier  to  do. But  let's  say  this  is  another  product  we were  working  on  a  long  time  ago  where let's  say  you  have  coffee   and  you're  going  to  add  some  foam  to  it and  you  want  to  understand  how  well  does that  foam  dissipate  into  the  coffee? That's  a  that's  a  tough  thing  to  measure. We  definitely  don't  have  any  instrumentation that  can  really  measure  it. Videos  really  helped  us to  understand  how  we  could  measure  it and  get  some  useful  information  out  of... As  you  can  see,  we're  trying  to  measure how  does  that  look? How  well  does  it  move  that  one  versus, let's  say,  this  treatment  over  here? You  can  see  they're  quite  different. Where  one  moves  really  fast, the  other  moves  really  slow. This  one  looks  kind  of  chunky the  other  one  didn't  so  much. That's  that's  how  we  use  it. We  use  it  quite  often. I  appreciate  you  taking the  time  to  listen  to  my  talk. Hopefully,  I  hope  that this  has  been  useful. You'll  be  able  to  find  a  way  that  you  can  implement  it  to in  in  your  day  to  day  work. Thank  you.
In this presentation, we use the longitudinal analysis capability of the Structural Equation Modeling (SEM) platform in JMP to measure the impact of temperature and usage amount on detergent performance from the consumer perspective using a measure of the overall rating of product as a surrogate of overall performance. In this diary study measured over time, data were collected from consumers who used one of two products in three phases of four-week time intervals. Using data collected from Week 5 to Week 9, we modeled the trajectories of the performance of the detergent in relation to the temperature and usage (number of pods). Using standard SEM fit measures, we found the linear growth curve model fit the data well for the underlying latent structures. Results showed that one of the test products performed significantly better under cold-wash temperatures and used fewer pods than the other product. This result was a significant win for the company under the sustainability platform and performance under colder wash conditions.     Hello,  everyone.   My  name  is  Zhiwu  Liang, statistician from  Procter  &  Gamble  Company. I'm  support of the  business in  Brussels  Innovation  Center  for  P&G. My  main  job  is  doing the  consumer  survey  data  analysis. Today,  Narayanan  and  I  will  present the  G rowth Curve Modeling to  Measure  Impact of  the  Temperature  and  Usage A mount on  Detergent  Performance. Next  slide,  please. Here  is  the  contents  we  will  cover  today. First,  I  will  give  the  brief  introduction about  the   structural equation models and  a  bit  about  the  data we  will  be  using  for  our  modeling. Then  I  will  turn  to  Narayanan to  introduce  the   growth curve modeling, model  building  process  plus  the  JMP  demo. Without  showing,  I  will  present the  conclusion  and  next  steps. Next  slide,  please . The   structural equation modeling is  a  multivariate  technique that  is  used  to  test  a  set  of  the  relationship between  the  observed and  the  latent  variables by  comparing the  model  predicted  covariance  matrix and  observed  covariance  matrix. In  SEM,  what  we  have  done  is, observed  variables  are  manifest  variable as  the  indicator  for  latent  variables, which  is  using the  measurement  model  to  construct. Latent  variables  form  a  regression  model to  build  a  network which  we  call  the  structure  model. Here  is  an  example with  the  three  latent  variable, eight  o bserved  variable  in  JMP, so  the  SEM  structure. As  you  can  see  in  the  button  left  chart, the  circle  represent  the  latent  variable, which  is  calculated through  the  indicators  like  cleaning, as  the  latent  variable is  indicated  by  the  full  square represent  the  manifest  variable overall  cleaning,  stain  removal, whiteness  and  brightness. Same  as  the  freshness,  latent  variable indicated  by  the  three  manifest  variable. If  you  look at  the  right  side  of  the  window, the  loading  window  show the  structure  for  the  measurement  model how  this  individual  latent  variable relate  to  the  indicator. The  button  of  the  regression  window   show the  two  regression  model: cleaning  drive overall rating, freshness  dry  overall rating. This  is  the  structure for  the  structure  equation  model. Next  slide,  please. The  data  we  use for  our   growth curve modeling is  the  survey  data  we  conduct in  the  France  with  the  119  consumer. We  divide  this  119  consumer  into  groups. Sixty  of  them  use  control  products, which  is  the  Ariel, soluble  unit  dose,  the  pods, in  our  data  set,  marked  as  0. Another  59  consumers  use  test  product is  the  Ecolabel  product,  code  as  1. Each  consumer, during  the  12  weeks  of  the  test, first  four  weeks, they  use  their  own  products. Then  they  will  go to  the  eight  week's  test  week, use  one  of  our  assigning  products, either  use  the  Ariel  SUD  or  Ecolabel. Then  for  each  time  of  the  wash, the  consumer will  fill  in  the  questionnaire, provide  some  information about  their  washing  behavior, such  as  the  washing  temperature, number  of  the  pods  used, soil  level  of  the  fabric,  how  dirty  it  is, and  overall  rating  of  the  performance for  the  product. Our  modeling  objective  is  try  to  test if  there  is  a  product's  effect on  the  overall  performance  rating, washing  temperature on  the  overall  performance  rating, number  of  the  pods  used for  overall  rating for  each  of  the  wash. Next  slide,  please. Since  every  consumer, they  have  a  different  washing  habit, they  have  different  condition, not  all  of  the  consumer  has  the  same number  of  wash  during  the  test  week. Therefore,  to  make  every  consumer the  weight  equal in  our  model  building  data  set, we  first  aggregate  the  consumer  data by  the  panelist  level  by  weekly  basis. You  take  the  average  washing  temperature during  that  week for  the  particular  consumer, number  of  pods  used, and  the  overall  rating for  each  load  during  that  week. After  this  aggregate  data, we  use  the  exploratory  tool like  JMP  Graph  Builder to  identify  if  there's  any  linear  trend for  overall rating, for  temperature  trend during  the  test  week, and  the  number  of  the  pods using  trend  during  the  test  week. Since  the  exploratory  stage , OAR  is  pretty  stable in  the  week  9  to  week  12, we  use  the  intercept  only  model  for  OAR. Then  for  the  temperature  for  the  product and for  the  number  of  the  pods  used from  this  exploratory  stage, we  found  there  is  either increasing  or  decreased  trend. Therefore,  we  use  the  linear  growth  model to  describe  the  temperature  indicator and  the  number  of  pods  indicator. To  explain  the  product  impact, we  also  including  the  product manufactured  variable  in  our  model. Then  we  first  build  a  growth  curve  model for  temperature  number  of  the  pods, then  add  this  latent  variable  to  build a  regression  model  from  products  variable, intercept  of  temperature, slope  of  temperature, intercept  of  the  number  of  pods  used, slope  of  the  number  of  pods  used to  intercept  of  the  OAR  to  build multivariate,  the   growth curve model. Now I  would  turn  to  Narayanan to  introduce  latent   growth curve model. Narayanan,  it's  your  turn. Thank  you,  Zhiwu, for  the  great  [inaudible 00:06:49]. Hi,  everyone. My  name  is  Narayanan. I  am  an  Adjunct  professor at  the  University  of  Cincinnati, but  I  teach  courses on  data  mining  using  JMP. I'd  like  to  start  by  giving a  very  broad  definition of  what  is  latent growth curve modeling. As  we  go  along, I  may  use  the  letters  LG CM  to  represent latent growth curve modeling, and  SEM  to  represent structural equation modeling. Latent growth curve modeling is  basically  a  way  to  model longitudinal  data  using  the  SEM  framework. Because  it  is  built  in  the  SEM  framework, it  has  all  the  advantages of  specifying  and  testing  relationship, as  Zhiwu  was  explaining  with  the  example of   structural equation modeling. A s  a  side  note, I  would  like  to  mention  that  LGCM is  actually  an  application of  confirmatory  factor  analysis, which  is  actually  a  submodel within  structural equation modeling with  the  added  mean  structure, and  this  will  be  explained when  we  get  into  JMP. One  of  the  benefits of  using  the  SEM  framework is  that  we  are  able  to  evaluate  model  fit. Let  us  look  at  the  statement  there, which  says,  every  model  implies a  covariance  matrix  and  mean  structure. What  this  really  means  is  that the  observed  covariance  matrix and  the  mean  vector can  be  actually  reproduced by  the  model  parameter  estimates which  are  estimated using  the   latent growth curve modeling. The  equality  between  the  two is  what  many  of  these  fit  indices are  actually  testing. One  of  the  oldest  one is  the  chi-square  test and  the  hypothesis  it  is  testing is  actually  listed  there: the  equality  between  the  population, and  the  model  predicted  covariance  matrix, and  the  mean  vectors. However,  this  test, which  is  one  of  the  oldest, has  some  watch- out. One  is  that  the  test  statistic in a   function  is  sample  size, which  means  that  larger  sample  size will  tend  to  reject  the  model even  for  trivial  differences. The  other  one  is  that  the  test  is  global and  does  not  reflect  the local fit  such as  could  be  measured  by  R-square. A lso,  the  fit  is  too  exact as  specified  in  the  hypothesis. We  know  from  the  famous  box  statement that  all  models  are  wrong. Our  models  are  only  just  an  approximation. Because  of  this,  there  have  been several  alternative  fit  measures that  have  been  proposed. I'd  like  to  mention  three  of  them  here. The  first  is  the  Root  Mean  Square Error  of  Approximation. This  is  actually  measuring  model  misfit, adjusting  for  the  sample  size, which  was  an  issue with  the  chi-square  test. This  is  actually  a  badness- of- fit  measure, so  lower  numbers  are  better. But  one  of  the  advantages of  using  this  fit  measure is  that  we  have a  confidence  interval  for  it, and  the  suggested  threshold for  this  fit  measure is  that  the  upper  bound of  the  confidence  interval is  less  than  0.10. The next  is  a  Comparative  Fit I ndex and  Non-Normed  Fit  Index. These  are  relative  estimates, and  they're  actually  testing how  good  is  your  proposed  model compared  to  a  baseline  model, which  is  usually a  model  of  no  relationship. This  is  a  goodness- of- fit  measure, and  so  the  suggested  criteria  here is  that  these  fit  measures cross  a  threshold  of  at  least  0.95. The  last  one  is  a  Standardized Root  Mean  Squared  Residual. This  is  actually  an  average squared  residual  of  all  the  elements in  the  covariance  matrix. This  is  a  badness- of- fit  measure. Again, we  are  looking  for  smaller  numbers, and  the  suggested  threshold  here is  that  this  value  is  less  than  0.08. On  top  of  all  this,  finally,  do  not  forget  to  check  the  actual  residuals, the  standardized  residuals. What  we  are  looking  for  here is  numbers  which  are  beyond minus 2  and  plus 2  threshold. The  idea  here is  to  look  at  the  totality  of  fit and  not  just  any  one  measure. Having  discussed  fit  measures, now let  us  look  at  the  longitudinal process  we  want  to  study. Zhiwu described three  different  processes. First  one  is  success  criteria as  measured  by  overall  satisfaction  rating from  week  9  to  week  12. Then  we  have  got two  time vary ing  covariates. That  means  these  are  varying  over  time. One  is  the  temperatures  setting in  which  the  product  was  used from  week  5  to  week  12, and  then  the  amount  of  product used  also  from  week  5  to  week  12. Then  finally, we  have  an  indicator  variable indicating  what  type  of  product  it  is, and  this  is  a  time invariant  covariate doesn't  change  with  time. The  modeling  strategy  we  are  going  to  use, first,  we're  going  to  visualize  data using  Graph  Builder. Then  we  are  selecting  a  univariate latent  growth  curve  model for  each  of  the  processes. Then  we  combine  all  of  them, put  together  as  a  multivariate  LGCM. Then  we'll  finally  test the  hypothesis  that  Zhiwu  proposed, which  is  how  well  the  product and  other  growth  factors impact  overall  satisfaction. We  will  choose  the  simplest  model when  we  build. I  am  going  to  get  into  JMP. I  am  running  JMP  18, which  is  an  early  adopter  version, and  I  am  going  to  show  some  scripts, and  I  will  show  you  how  I  got to  some  of  these  from  the  JMP  platforms. The  first  thing  I  want  to  do is  visualize  the  overall  satisfaction, and  these  are  trajectories. What  these  are, are  basically  individual on  each  line  from  week  9  to  week  12. Here,  the  overall satisfaction  plotted  here for  each  of  the  119  consumers. They're  basically  one  trajectory for  each  consumer. If  you  look  at  this  particular consumer,  row  number  16, that  person's  trajectory is  on  a  downward  trend from  week  9  through  week  12. They  started  somewhere  in  the  mid-50s, and  by  the  time  they  are  in  week  12, their  satisfaction  measure  has  come  down to  about  37.5  on  a  scale  of   0-100. Let  us  look  at  another  person. This  person  here who  used  the  Ariel  product, their  trajectory  is  on  an  upward  swing going  from  the  mid-70s  probably  to  the  early  90s by  the  time  they  reach  week  12. They  are  getting more  and  more  satisfied week  over  week. Sorry  for  that. A  bubble  screen  showing  up. What  we  want  to  do is  we  want  to  understand how  different  consumers are  experiencing satisfaction  over  the  weeks, and  the  change  in  these  processes for  these  consumers is  what  we  want  to  model  using  LGCM. What  I'm  going  to  do is  I'm  going  to  turn  on  the  script, LGCM  of  overall  satisfaction. I  have  built  here  three  different  models. What  these  are  basically  the  latent variable  corresponds  to  an  intercept for  these  repeated  measures  of  the  overall  satisfaction from  week  9  through  week  12. I've  built  three  different  models. I've  built  a  fourth  model, which  is  a  simplification of  the  first  model. I've  built  a  no-growth  model, which  means  different  people have  different  levels  of  satisfaction in  the  beginning,  which  is  week  9, but  then  their  trajectories  flatten  out and  does  not  grow  over  time. Second  model  is  a  linear  growth  model, which  means  that  trajectories  do  change in  a  linear  fashion  over  time. The  third  model  is  a  quadratic  model, which  means  their  trajectories  change in  a  quadratic  fashion  over  time. Then  finally,  I've  got  a  simplification of  the  first  model, but  I'm  assuming  almost  elasticity or  no  change  in  the  variance  across  time. I'm  going  to  look  at  these  fit  measures that  I  talked  about and  choose  the  model  that  fits  the  best. What  I'm  looking  for is  low  values  of   chi-square, high  values  of  CFI, which  means  CFI  goes  on  a  scale  from  0-1 and  low  values  of  RMSEA, which  also  goes  on  a  scale  from   0-1. It  looks  like  all  my  models,  no- growth, linear  growth,  and  quadratic  growth, fit  the  data  equally  well. But  however,  I'm  going  to  take the  simplest  of  the  models because  if  I  look at  the  estimates  as  I  can in  the  path  diagram, many  of  these  coefficients relating  to  the  slope, the  linear  slope  or  the  quadratic  slope, are  actually  not  significant as  shown  by  the  dotted  lines. In  this  linear  growth  model, what  we  have  is  an  intercept, which  measures the  initial  level  of  satisfaction, and  slope,  which  measures the  rate  of  increase of  the  satisfaction  over  time or  rate  of  decrease of  satisfaction  over  time. S lope  measures  that, intercept  measures  the  initial  level. We  can  see  all  the  estimates  related  to  the  slope are  actually  not  significant as  indicated  by  dotted  lines. The same  is  the  situation for  the  quadratic  model  also. Therefore,  I'm  going  to  take the  simplest  of  the  model, which  is  the  no- growth  model for  this  process, which  is  overall  satisfaction. Let  me  show  you  how  I  do  this. In  JMP,  go  under  the  Analyze and  pick  Multivariate  and  choose the S tructural Equation Model  platform. Choose  the  repeated  measures, in  this  case  is  OAR  from  week  9 through  week  12. Drop  them  in  Model  Variables  box and  click  OK. We  have  got  these four  repeated  measures available  as  modeling  variables in  the  path  diagram  area. I  can  build  this  model from  scratch  using  the  path  diagram, but  JMP  has  made  it  easier by  using  shortcuts. I'm  going  to  go  under the  Model  Shortcut,  red  triangle, choose  Longitude  Analysis, and  check  the  linear  latent  growth  curve or  the  intercept-only  model. If  I  choose  the  intercept-only  model, I  get  this  path  diagram which  you  saw  in  my  script. If  I  run  the  model, you  will  get  the  estimates and  the  fit  statistic  for  this  model. If  you  want  to  add the  linear  growth  model to  do  the  same  thing,  come  under M odel  Shortcuts, Longitudinal  Analysis, and  Linear  G rowth Curve Model. Now we  have  got not  only  an  initial  level as  represented by  the  intercept  latent  variable, we've  got  the  rate  of  growth of  this  process as  represented by  the  slope  latent  variable. We  can  run  this  model. Click  on  Run, and  you  get  the  model  estimates, as  I  showed  you  before, which  are  not  significant for  the  slope  latent  variable. You  get  the  fit  statistics  right  here under  the  Model  Comparison  table. T hese  models  are  easy  to  fit  in  JMP using  the  model  shortcut  menu available  under  the  Model  Shortcut. I'm  going  to  close  the  one  I  just  created. We  have  so  far  built a  univariate  LGCM  for  a  single  process. I'm  going  to  repeat  the  same  thing for  the  other  two  growth  process  we  have, and  we're  going  to  look at  the  wash  temperature  trajectories. Let  me  show  you  how  to  do  this  in  JMP. In  JMP,  in  Graph,  click  on  Graph  Builder and  open  up  the  temperature  variables. We  want  to  look  at  temperature from  week  5  through  week  12. Drop  them  on  the  x-axis. For  the  type  of  graph  you  want, choose  the  last  icon in  the  bar  at  the  top. This  is  a  parallel  plot. There  will  be  some smoothness  associated  with  this. Drag  this  letter  bar all  the  way  to  the  left. There  should  be  no  smoothness  at  all. Take  the  product  variable, which  is  an  indicator  variable, put  them  on  Overlay. Now you  get  individual  trajectories. If  you  want  to  add the  average  trajectory, choose  the  sixth  icon on  this  toolbar  from  left. Click  on  the  Shift  key  and  click  on  this. Now you  get  that  average  trajectory of  temperature  used over  these  eight  weeks. Click  on  Done to  get  the  plot  with  more  real  estate. This  is  exactly  the  plot that  I  showed  using  the  script. You  can  clearly  see  that from  week  7  onwards, there  might  be  a  growth in  the  temperature  setting. It  looks  like  people are  increasing  the  temperature as  time  progresses   from week 7  through  week  12. I'm  going  to  close  this. We  have  a  graph  to  visualize the  trajectories of  the  temperature  setting. We  repeat  the  same  thing. We  want  to  choose a  model  for  that  process. A s  before,  I  built  the  same  three  models: a no-growth,  a  linear  growth, and  a  quadratic  growth. I'm  going  to  look at  the  fit  statistic  here. This  time,  we  see  definitely a  significant  improvement in  going  from  the  no- growth to  linear  growth in  terms  of  the  fit  statistics. The  quadratic  growth  is  a  marginal increase  over  the  linear  growth  model. Again,  for  the  same  reason  as  before, all  the  estimates  in  the  quadratic  slope are  actually  not  significant. To  keep  things  simple, I'm  going  to  choose  the  simpler  model, which  is  the  linear  growth for  temperature. The  last  process  is  the  pod  usage. This  is  the  number  of  pods. Now  we  can  see  clearly an  increasing  trend, more  so  for  the  Ecolabel  product, which  means  people are  using  more  and  more  products when  they  use  Ecolabel as  compared  to  Ariel, which  is  a  P&G  product. I  want  to  model  this. Let  me  close  that. Click  on  the  script for  LG CM  of  pod  usage. I'm  going  to  look  at  the  fit  statistic. A gain,  I  see  a  good  model  fit, especially  the  linear  and  the  quadratic. For  the  same  reason  as  before, I'm  going  to  choose  the  linear  model. Here  I  want  to  look  at the  estimates  for  the  quadratic  slope, and  this  is  what  I  mean by  not  choosing  the  quadratic  slope because  you've  got  all  the  parameters point unit  to  that  to  be  not  significant. Now we  have  got  a  model for  each  of  the  three  processes. We  chose  a  no-growth  model for  overall  satisfaction. We  chose  a  linear  growth  model for  low  temperature. Now I'm  going  to  put  them  all  together using  a  multivariate, latent   growth curve model. This  is  basically all  the  three  processes  put  together. Here , I  want  to  show  you  the  similarity between  a  confirmatory factor  analysis  model and  latent   growth curve model as  was  pointed  out  in  the  previous  slide. You  can  see  that  there  is a  mean  structure  added  to  it with  a  triangle  with  a  number  one, and  there  are  lines  going  from  that to  each  of  the  latent  variables. If  I  right-click  and  use  the  Show  option and  not  show  the  mean  structure, you  can  see  the  familiar confirmatory  factor  analysis  model with  latent  variables  and  the  indicators associated  with  each  one  of  them. We  have   a  single  latent  variable, intercept  for  the  overall  satisfaction. We  have  two  latent  variables for  the  temperature, which  is  initial  intercept  and  the  slope. We  have  the  same  two latent  variables indicating  the  pod  usage. Initial  level  as  represented by  the  int  pods and  the  rate  of  change  of  product  usage as  indicated  by  the  slp  pods, which  is  basically  the  slope  of  pods. Let  me  put  back  the  means  activated. Now we  can  actually  look at  the  estimates  of  these, which  are  really one  of  the  important  pods of  the  latent   growth curve model. What  we  have  here  is  an  estimate of  the  initial  level of  satisfaction  at   week 9 because  that  was  the  starting  time period  for  overall  satisfaction. That's  about  71  on  a  scale  of   0-100. This  is  the  average temperature  setting  at  week 9, which  is  36  degrees  Celsius. Here  is  the  product  usage,  1.4  pouches. Here  is  the  rate of  change  of  product  usage because  there  is  a  slope of  product  usage,  the  latency  variable, which  is  about  0.02. People  are  using  slightly  more as  time  goes  on. That  is  what  we  get. The  overall  fit  of  this  model is  also  fairly  good. I  think  we  saw  that. CFI exactly  at  the  threshold  0.95, and  our  upper  bound  of  the  RMSEA is  definitely  less  than  0.1. Now we  go  to  the  last  model, which  is  the  hypothesis that  Zhiwu  wanted  to  test, where  we  want  to  see  if  product, the  indicator  variable, and  the  other  growth  factors have  a  significant  impact on  overall  satisfaction. In  order  to  remove  the  clutter, I  have  not  shown  all  the  indicators. All  we  are  seeing  is  only  the  circles, which  represent  the  latent  factors for  each  of  the  growth  curve  models and  a  single  product  variable indicating  what  type  of  product  it  is. Again, let  us  look  at  the  fit  of  this  model. Fit  of  this  model  is  indeed  good. We  have  a  0.95  for  the  CFI. We  have  less  than  0.1 for  the  upper  bound  of  the  RMSEA. We  will  look  at  more  fit  indices after  we  interpret some  of  the  estimates  here. I'm  going  to  interpret  the  solid  lines which  are  significant  coefficients. We  have  a  significant  product  effect from  the  product  variable to  the  intercept  of  overall  satisfaction. This  can  be  interpreted  basically as  a  regression  coefficient, which  is  the  average level  of  satisfaction for  product  coded  1 minus  the  average  level of  satisfaction  for  product coded  0. Ariel  is  coded  as  product  0, so  we  have  much  more satisfaction  with  Ariel, a  delta  of  negative   9 in  favor  of  Ariel on  a  scale  of  0- 100. That  is  a  big  change. Delta  in  favor  of  the  Ariel  product. Let  us  look at  the  product  effect  on  pods. Again,  the  same  way, average  amount  of  product  used for  product  coded  1  minus  product  coded  0. This  time,  we  are  using  more of  the  Ecolabel  product. If  you  are  a  manufacturer  of  Ariel, this  is  good  news  for  you. A lso,  the  rate  of  change  of  product  use is  also  more  for  Ecolabel compared  to  Ariel, or  0.02  pouches  from  week  to  week. Finally, we  have  the  intercept  of  temperature having  a  negative  impact on  the  overall  satisfaction, which  means  higher  temperatures lead  to  less  satisfaction. Remember,  these  are  products which  are  marketed  as  cold-wash  products. That  means  they  should  work  better in  cold  temperatures and  not  higher  temperatures. I  also  want  to  show  you  where  you  can  look for  other  fit  statistics  beyond what  is  coming  out in  the  model  comparison  table. Under  the  S tructural Equation Model in  red  triangle, if  you  check  on  Fit  Indices, which  I've  already  checked, there  are  more  fit  indices that  can  be  shown  at  the  bottom. We  want  to  look  at  CFI  and  RMSEA, which  we've  already  seen, and  here  is  the  Standardized Root  Mean S quare  Residual, which  I  discussed. This  is  also  exactly at  the  threshold  of  0.08. All  in  all,  in  terms  of  fit  indices, our  model  does  fit  quite  well. Finally, I  told  you  not  to  forget  the  residuals. These  are  normalized  residuals in  terms  of  the  measured  variables. We  have  21  measured  variables, eight  for  pods,  eight  for  temperatures, four  for  overall  satisfaction, and  one  for  the  product  variable. This  is  a  21  by  21  matrix. What  we  are  looking  for is  numbers  which  are  outside the  plus  2  minus  2  range. There  are  just  too  many  numbers to  look  at  in  the  table, but  JMP  produces  a  heatmap. Heatmap  option  is  also under  the  red  triangle. What  we  are  looking  for is  dark  red  or  dark  blue. Here,  we  have  two  dark  reds which  are  relationship   between pod  usage  at  week 6 , temperature  at  week  12, pod  usage  at  week  6, and  temperature  at   week 9. Finally,  we  have  one, because  this  is  just  a  mirror  image of  the  one  that  is  here. This  is  the  relationship between  temperature  at  week  9 and  temperature  at  week  10, which  is  not  modeled. This  could  actually  be  modeled by  adding  an  error  covariance, which  I  did  not  do. If  I  did  this,  the  model,  in  fact, would  be  even  better. I  want  to  go  back  to  the  presentation and  summarize  what  we  have  found. Oops, sorry,   wrong  slide. In  terms  of  conclusion, we  started  the  Graph  Builder to  visualize  our  trajectories, and  we  built  latent  growth  curve  model using  the  SEM  platform. We  extended  from  univariate to  multivariate  models. A ll  our  models,  including  the  last  one, had  acceptable  fit,  in  fact,  good  fit. Product  had  a  significant  impact  on  OAR, which  means   Ariel  is  better  than  Ecolabel in  terms  of  its  overall  satisfaction and  significant  impact on  the  number  of  pods, which  means  less  product  was  used for  Ariel  compared  to  Ecolabel, and  also  from  week  to  week. Intercept  had  a  negative  impact  on  OAR, which  means  people  prefer lower  temperature  setting than  higher  temperature  setting. If  you  are  a  P&G  manufacturer, this  is  good  news  for  you because  Ariel  works  better  than  Eco label in  this  modeling  framework that  we  have  done. I'm  going  to  turn  it  over  to  Zhiwu  to  see  what  the  next  steps  are from  this  model  results. Zhiwu? Thank  you  very  much. Thank  you,  Narayanan, and  very  excellent  presentation and  wonderful  demo. As  Narayanan  mentioned, the  modeling  results  prove the  product  has  a  significant  impact to  the  overall satisfaction of  the  performance of  the  detergent  products  in  our  test. This  result  provides  us the  confidence  we can make  a  very  clear  claim, Ariel  products  is  a  favor  to  the  cold wash can  be  used  less  than  the  normal  products. This  is  also  modeling  confirm the  consumer  behavior  change. If  you  use  Ariel  product, you  will  have  more  washing  loads go  to  the  cold wash, use  less  energy  and  use  less  product. Also,  we  plan  to  conduct  bigger  size consumer  study  for  including  the  more covariates  variables  in  the  future modeling  stage  like  the  additive  usage and  the  washing  cycle  of  every  wash and  the  low  size  per  wash. This  is  our  next  step. Next  slide. Now we  would  like to  take  question  if  you  have  any. Thank  you  very  much for  attending  the  presentation. We  look  forward  to  your  questions probably  in  the  JMP  Summit.
Effective data analytics is critical for the success of product development. Product development processes for adhesives and sealants involve data analytics for tasks needed at various project phases. Too many formulation chemists still depend on Excel for daily data recording, formulating, and analysis. While some chemists use JMP for statistical analysis and experimental design, JMP’s powerful and versatile data analytics should be used more universally for many daily tasks in product development. When shown how widely JMP can be used, more formulation chemists should see the benefits of adopting JMP as a daily tool.   This presentation shares examples of how easily JMP can be used in daily data analytics, in addition to the statistical analysis and DOE that most chemists use. A variety of examples are given, including handling data from literature and patent searches, analysis of raw materials, and a variety of formulation-related tasks such as formula stoichiometry calculation, mixing volume balance, data recording, and in tabulate/graph analysis. In addition, this presentation demonstrates how JMP is used for product selection, application trouble shooting, failure mode analysis, DOE, and more.     Hi ,  this  is  Stone  Cheng , I 'm  a  technical  director   in  Henkel  Corporation . I 've  been  using  JMP  for  more  than five  years  in  product  development . Today  I 'm  happy  to  share with  our  best  practice  in  utilizing  JMP  data  analytics in  product  development of  adhesive  and  sealants . My  presentation  has  two  parts . The  part  one  is  application  gallery where  JMP  used  in  various  stage   in  product  development  will  be  discussed, and  in  part  two  I  will  focus  on  using  JMP as  a  formulation  worksheet  with  the  demonstration . In  my  organization , folks  either  have  not  heard  of  JMP or  have  an  impression  that  JMP   is  an advanced  DOE  software . For  the  latter,  it  is  true , but  it 's  not  the  whole  truth . Since  there  are  other   DOE  software  options , it  is  hard  to  persuade  folks to  switch  away from  tool  they  are  familiar  with . JMP  is  an  excellent  all- around  data  analytics  tool . To  promote  JMP  adoption , we  need  example  to  demonstrate   its  power  in  the  data  analytics . In  my  presentation , the  example  are  taken  from  my  experience in  adhesive  and  sealants . About  Henkel . Henkel  is  a  22- billion  enterprise   with  two  business  unit : adhesive  technology and  consumer  brand  of  laundry , home,  and  beauty  care . I 'm  in  the  adhesive  technology  unit . We  are  global  adhesive  leader serving  800  industries with  more  than  20 ,000  products . Let 's  start  with  case  number  1 . One  customer  has  a  need , they  may  call  a  service  center asking  for  product  recommendation . For  example ,  a  customer  may  need   a  room  temperature  adhesive  hook with  the  shear  strength  between  1 ,500 -3 ,000 and  a  work  life  between  15 -30  minutes and  in  the  package  of  10 . What  we  need   is  a  searchable  product  selected  guide , what  is  showing  in  our  case  1 . Once  the  product  information  are  entered  into  the  JMP  table , one  can  use  the  tabulate   and  local  data  filter to  downselect  the  product based  on  the  customer  requirement as  shown  in  here . This  is  a  very,  very  powerful  tool in  the  preformation  stage . My  team  has  been  applying  this  tool to  search  for  formulation , pattern ,  literature   and  the  raw  material  specifications . In  case  number  2 , multivariate  analysis is  applied  to  a  silver  filler , which  are  used in  making  conductive  adhesive . From  multivariable  analysis , if  we  focus  on  the  surface  area , it  has  a  moderate  negative  correlation with  the  tap  density and  then  it  has  a  stronger positive  correlation  to  lubricant  amount as  measure   at  the  weight  loss  at  530  AC . There  is  a  graphical  way  here   and  it 's  a  table  format  here . With  this  analogy , chemistry  can  select  the  right silver  package  for  the  application . Case  3  is  the  literature . Literature  is  a  great  place  for  learning the  chemistry  and  formulation and  this  particular  cited  literature illustrates  how  epoxy  tensile  strength are  affected  by  the  mixture   of  two  amine  hardener . The  results  are  described qualitatively  by  a  table and  a  graph   of  the  stress  train  curve  here . For  formulator   quantitative  description  is  good , but  learning  via  quantitative  modeling is  even  better  for  the  prediction  purpose . In  case  number  3 , we  take  the  literature  data and  then  create   a  two -factor  column  right  here . These  are  the  epoxy /amine  stoichiometry , or  we  call  it  the  index . The  other  factor  is  the  fraction of  one  of  the  amine  PAE to  the  total  amine  fraction . With  the  Fit  Model  platform, and  we  use  the  ISM  model , it  showed  that  the  quadratic  effect of  the  index, together  with  two  main  effects, are  all  significant . The  prediction  provider and  the  contour  profiler are  used  to  quantify  the  learning and  give  the  prediction . In  case  number  4 , my  group  was  assigned to  support  a  technology  platform that  include  about  30  products . Since  we  are  not  involved  in  the  original formulation  development , how  to  study  the  formulation  family in  this  case  is  not  trivial . Looking  at  a  big  Excel  table with  all  the  formulation  certainly is  not  effective  either . Case  number  4  is  the  example of  addressing  this  challenge . I  select  the  three   top  most  used  ingredient in  these  29  formulation : monomer  1 ,  2,  and  then  oligomer . Then  by  using  the  hierarchical  clustering  analysis  right  here , right  here  our  formula  was  identified  to  have  very ,  very  small  distance assumed  right  here , implying  that  they  are  in  cross  related . Actually  they  are  only  different in  the  photo  initiator   for  different  wavelength  in  this  case . We  can  add  more  ingredients  one  by  one in  this  hierarchical  clustering and  then  learn  the  formulation  family by  using  this  method . Most  of  the  chemist  analyze the  formulation  performance  in  Excel . Case  number  5  is  a  JMP  tabulate that  has  the  same  data  format  as  Excel . Basically ,   various  information  of  a  formula are  displaced  in  the  same  column . Like  what  you  see  here has  a  heading ,  has  a  recipe , has  a  processing  material  characteristic, and  all  the  results . To  create  such  data  structure , we  need  to  enter  data  in  JMP  table in  a  special  format and  that  will  be  discussed  in  detail  in  part  two  of  my  presentation . Case  6  is  a  silicone  study   involved  23  formulations and  more  than  10  measurement  property . It  is  quite  overwhelming  trying  to  analyze the  raw  data  in  such  big  system . We  first  construct  a  series  of  graph with  a  property  retention  in  Y and  then  the  initial  property  in  X . For  example ,   if  you  look  at  the  first  graph  here , the  adhesion  retention  is  in  Y and  initial  adhesions  in  X . We  also  give  the  reference  line , acceptance  line  for  each  axis . When  a  formulation  is  selected , for  example ,  I  choose  this  point , they  are  in  the  quadrant  with  acceptable initial  adhesion  and  retention , then  all  its  associate  property   such  as  the  tensile  strength , elongation ,  hardness ,  all  show  up , also  formulation   all  show  up  at  the  same  time , these  are  all  thanks  to  automatic  highlight . This  is  all  thanks   to  JMP  dynamic  link  capability . Visualization  analysis in  such  a  way  is  very  effective for  chemists  to  know  the  overall behavior  of  this  system . In  polymer  science , we  measure  the  modulus  of  polymer  as  function  of  the  temperature with  an  instrument  called dynamic  mechanical  analyzer ,  DMA . DMA  data  has  a  temperature  modulus and  then  attend  delta   are  typically  transferred  to  Excel in  a  wide  format  for  plotting . To  overlay  several  DMA  curve   for  comparison  it  is  durable  in  Excel , but  it 's  not  a  trivial  effort . In  case  number  7 ,   we  stack  74  DMA  results  together and  by  using  the  graph  builder , we  can  compare  DMA  results   very ,  very  quickly  just  by  clicking . I  cannot  imagine  doing  the  same  in  Excel that  has  a  222  column . It 's  basically  74  sample  times  3  signal  per  sample . It 's  going  to  be  very  difficult to  handle  in  Excel  environment . A  graph  builder  is  excellent  in  turning a  very  complicated  graph  seen  in  Excel into  a  visually  digestible  analysis . In  case  number  8 , the  needle  pull  strength is  illustrated  in  graph  builder  using  four  variable . We  have  eight  adhesive  on  the  top , we  have  three  different  radiation  system on  the  Y  here , and  we  also  have  a  four  radiation  time and  then  two  needle  hub  combination . See  how  easy   it  is  to  understand  this  JMP  graph as  compared  to  the Excel  graph  right  here . Good . In  case  number  9 , we  are  conducting  accelerated  aging  study for  four  epoxy  prototype  formulation by  measuring  their  initial  adhesion on  three  substrate and  with  a  three  replica . The  aging  condition   are  two  different  temperature and  then  eight  week  aging  time  with  two  weeks  testing  interval . This  aging  design  and  the  data   was  initially  recorded  in  Excel and  we  converted  the  Excel  data  into  JMP  table  with  seven  column , seven  column  only and  the  stack  format   and  then  we  make  a  graph . You  will  agree  that  the  visualization in  JMP  graph  builder  in  this  case is  much  easier   to  see  the  aging  performance than  looking  at  the  busy  Excel  table  here . In  formulation  stage ,  we  frequently need  to  optimize  composition . Case  number  10  is  example where  two  catalysts  in  polyurethane   are  optimized  with  the  DOE  design . The  factor  are :  catalyst  ratio and  the  catalyst  total  amount . There  is  a  10 -run  face -center central  composite  design , the  predictor  provider  indicating   that  the  catalyst  total  amount  factor has  a  long  linear  effect  on  the  work  life . The  white  area  in  the  contour  profiler is  actually  the  suite  design  space  which  desire  work  life . In  this  case ,  20 -28 . It  is  important  for  chemistry  to  select this  green  highlighted  white  area for  better  production  robustness than  the  area   [inaudible 00:12:10]  in  blue , it  has  a  higher  tolerance   for  the  amount  change  there just  in  case  operator  makes some  minor  mistake . That 's  why  it  has  a  better production  robustness . We  routinely  see  chemistry  perform statistics  analysis  of  adhesion  data  like  what  you  see  here , but  we  hardly  see  anybody  presenting the  results   about  the  failure  mode  analysis . In  JMP  the  failure  mode  analysis can  be  performed  in  two  places . One  is  in  the  contingency  analysis  in  Y  by  X  platform and  the  second  one  is  using  the  graph  builder . The  case  number  11  is  the  example applying  to  the  silicone  sealant where  the  failure  mode  change  before  and  after  high  temperature  aging , before  and  after  is  clearly  shown . Clearly  shown , Beside  a  good  adhesion , adhesive  rheology  will  need  to  be  formulated so  that  it  can  be   effectively  applied  to  the  substrate . We  have  a  project  to  develop  a  seam  sealant  to  protect  the  metal  joint by  spraying  process . Case  number  12  is  the  example that  illustrating  the  complication of  spraying  process  with  respect to  the  sealant  viscosity  variation . We  have  three  spraying  variable . They  are  the  pressure ,  nozzle  type , and  then  the  head  travel  speed and  then  we  have  one  material  factor  in  this  case  is  the  viscosity . Initially ,  we  plan  the  experiment  in  factorial  DOE  fashion , but  one  of  the  factor , the  sparing  pressure   are  very  very  hard  to  control . We  end  up  performing  regression   of  the  40  round  with  ISM  model using  the  strain  width   and  strain  thickness  as  the  response . We  get  a  good  model  with  R -squared  about  0 .91  for  width and  0 .81  for  the  thickness . The  modeling  result  tell  us that  the  spraying  condition will  need  to  be  adjusted   dependent  on  the  sealant  viscosity . This  is  illustrated  in  this  prediction  profile  here . Each  processing  parameter has  their  own  machine  limit and  also  desirable  operation  limit . If  this  predicted  processing  variable are  outside  those  limits , then  the  chemist  will  need  to  redesign formulation  reality and  making  sure  that  manufacturing has  the  processing  capability to  meet  the  viscosity  requirement . This  example  show  that  the  formulation  design and  application  constraint  will  need  to  be  considered  side  by  side and  JMP  is  actually  a  very  good  tool in  facilitating  this  type  of  study . Case  number  13  is  example  that  JMP is  used  to  handle  huge instrumentation  data  sets . In  testing  thermal  interface  material , the  temperature  at  a  different  location and  the  power  consumption  data are  collected  and  then  uploaded  to  the  JMP . Once  the  data  are  in  JMP  table , visualization  of  the  data and  data  analysis  of  data  set as  much  as  500 ,000  row are  still  very  manageable  and  has  a  fast  response . That  means  the  geometry  actually  can be  used  to  handle  instrumentation  data . We  have  a  project  to  apply  adhesive to  software  by  the  sensor  printing and  this  application  is  challenging with  pinhole  defect  issue . Process  engineer  changed   six  processing  variable  randomly and  then  collect  21  wrong  results . His  data  analysis   did  not  reveal  any  special  trend , so  the  JMP  was  then  used for  the  troubleshooting  in  this  case and  the  prediction  partition  analysis  has  identified  factor  F as  the  key  factor . Later  on  we  apply the  predictor  screening  analysis and  then  identify  additional  factor  D that  need  a  further  investigation . For  JMP  training , we  learned  that  the  predictor  screening can  identify  predictor , they  may  be  weak  alone , but  strong  when  they  are  used in  combination  with  other  predictor . In  the  scaling  up   and  the  manufacturing  stage  production , when  the  batch  run  into  the  issue , the  raw  material  lot -to -lot  analysis  is  one  of  the  troubleshooting  item in  order  to  isolate   the  potential  raw  material  effect . This  exercise   is  typically  done  in  the  Excel  table . But  when  the  multiple  raw  material and  multiple  lots   of  each  raw  material  are  involved , it  is  difficult  to  look  at  a  huge  Excel  table to  analyze  the  raw  material  effect . In  case  number  15 , a  polyester  formulation with  three  raw  material and  about  45  separate  lots  are  plotted   verses  the  date  of  manufacturing with  the  color  scale  of  the  gel  time . This  heat  map  plot   provide  a  visual  analysis for  the  production  engineer  to  determine whether  a  particular  loss  of  raw  material is  the  major  cause   of  the  out  of  spec  batch . We  turn  the  Excel  table into  a  visual  way  for  better  analysis . Statistics  comparison  in  T -test  or  ANOVA  analysis are  performed  routinely  in  the  product  development . A  product  benchmark  exercise  typically  involves multiple  product  running  under various  testing  protocol , aiming  to  have  a  very  comprehensive the  product  comparison  learning  here . Case  16  is  an  example   of  statistic  analysis involved  large  combination  of  23  products and  then  more  than  10  testing  protocol . In  JMP ,  a  large  volume  statistics  analysis is  not  a  challenge since  creating  of  the  sub -table is  not  required  in  this  case , as  compared  to  other  software . One  can  utilize  the  column  switchers and  the  local  data  filter to  create  all  the  combination  of  property and  adhesive  for  statistics  analysis . Plus  the  results  of  each  analysis can  be  copied  into  a  JMP  journal to  streamline  the  reporting . For  case  number  17 , the  needle  bonding  testing  of  light  cured , historically ,  have  a  high  data  variance . Case  17  use  JMP  to  summarize   18  reports  of  needle -bound  testing which  involve  multiple  lots  of  adhesive , and  those  are  tested  in  various  time . The  needle  pore  strand ,  its  the  COV ,  are  plotted  in  graph  builder under  various  lighting ,   radiation  condition , as  well  as  the  substrate  combination . With  the  local  data  filter  here , one  can  easily change  the  criteria  selection to  have  a  comprehensive  comparison of  this  adhesive and  their  consistency  performance . When  this  result  was  presented ,  everyone was  amazed  with  the  JMP  capability . It  is  so  versatile  and  so  powerful . This  is  the  last  case for  the  application  gallery . In  this  case ,  number  18 , we  use  the  parallel  plot  feature in  the  graph  builder  to  demonstrate visual  comparison  of  15  performance items  and  10  adhesives . Each  performance   has  its  own  unit  and  scale which  provide  a  visual  comparison more  quantitatively  in  contrast to  the  qualitatively  comparison in  spider  chart  which  is  used  in  Excel . So  far ,  in  the  18  application   gallery  examples , the  data  are  coming  from   literature ,  instrumentation , processing ,  and  not  much emphasis  on  formulated . Now  we  will  switch  gear   to  discuss  formulation  creation , use  worksheets ,   and  it 's  a  JMP -based  worksheet , not  a  traditional  one  using  Excel . Before  we  show  you  the  JMP  worksheet , let 's  discuss  about  adhesive  type . Broadly  speaking ,  adhesive can  be  divided  in  two  categories : one  component  adhesives  or   the  two  component  adhesives , or  1K  or  2K . A  1K  system  like   the  Super  Glue  everybody  knows require  no  mixing  and  it  can  be  cured by  moisture ,  by  light , by  heat ,  or  by  other  method . In  case  we  are  dealing  with  one  component but  heat  cure  adhesive  such  as  epoxy , then  we  will  need  to  design and  then  calculate  the  stoichiometry or  the  index  to  balance  the  proportion of  the  epoxy  to  the  amine  hardener . Then  for  the  two  component  system , 2K  system , their  mixture  will  react at  NDM  temperature so  that  they  are  kept  apart  before  use . In  a  2K  system ,  their  stoichiometry   will  need  to  be  designed  and  calculated based  upon  the  desired  mixing  ratio , either  by  weight  or  by  volume . There  are  some  formulation  calculation  here  we  need  to  perform . This  type  of  calculation  design historically  been  done  in  Excel . This  is  the  Excel . Everybody  know  that  Excel  spreadsheet  allow  mixed  data  type  in  the  same  column and  its  formulas  can  be applied  to  individual  sales  level that  make  it  very  flexible  as a  formulation  calculation  worksheet . Formula  are  typically  organized  in  column  format  like  this . Each  column  has  a  full  group of  formulation  information such  as  their  heading ,  which  is  the  ID , their  recipe  ingredient , the  formulation  characteristic   or  processing  parameter , and  followed  by  the  result . What  about  the  result ? Excel -based  worksheet  is  very  useful . Everybody  using  that   because  it 's  easy  to  learn , but  it  does  come  with  some  shortcoming such  as  first  of  all  the  row  matching . When  you  have  a  new  ingredient   or  new  testing  results , you  need  to  match  to  the  right  row , and  they  take  time . Then  one  may  need  to  hide or  unhide  a  column  for  comparison . Then  third  thing  is   it 's  harder  to  analyze  the  data when  the  results  are  put  in  different  tab . It 's  a  tab -to -tab  format . It 's  also  very  difficult  to  make  a  graph in  such  kind  of  a  data  structure . JMP  offer  webinars  to  go   beyond  the  Excel  spreadsheet in  various  features  as  listed  here . But  the  worksheet calculation  is  not  emphasized . Perhaps  this  is  due  to  the  inherent   data  structure  that  each  column cannot  have  a  mixed  data  type and  the  column  formulas is  applied  to  the  entire  column which  is  not  as  versatile  or  flexible as  compared  to  the  Excel . Despite  of  these  constraints , we  have  developed  JMP  worksheet with  the  following  objectives  in  mind . It  should  have  a  broader  capability for  formulation  design ,  calculation , recording ,  and  analysis . It  is  all  in  one  and  we  want  to  minimize cross -platform  copy -pasting . It  should  be  easy  to  operate , easy  data  entry  and  use  the  JSL for  a  lot  of  the  automation . Then  the  final  data  set  is  ready for  machine  learning  exercise . Let 's  look  at  our  Gen1 , and  that  is  for  one  component  system . This  includes  four  data  group . We  have  a  formulation  ID ,   we  have  a  recipe , we  also  have  a  material   processing  characteristic , and  then  we  have   a  testing  result  right  there . The  four  data  group  are  the  same  as  what you  see  in  the  earlier  Excel  worksheet , but  layer  structure  was  organized in  the  column  from  the  left  to  right . This  is  different  from  the  Excel which  is  from  top  to  the  bottom . The  data  of  the  three  group ,  2 ,  3 ,  and  4 are  shared  and  recorded   in  the  same  column , which  has  a  numeric  data  type . All  the  recipe ,  all  the  testing  results , and  all  the  formulation  characteristics all  in  the  numerical  data  type , and  they  are  documented in  the  same  column  here . With  this  kind  of  a  format … The  data  was  also  stacked  together . I  have  formulation  1  here , formulation  2  here . With  a  stacking  format , one  can  freely  enter  the  new  ingredient or  new  testing  item  without  needed to  match  the  role  as  needed  in  Excel . JSL  was  also  created   to  enable  data  analysis in  either  tabular  way   or  in  a  graph  format . This  is  in  a  tabular  way . Chemist  can  pick  several  formulation  ID and  compare  their  recipe  characteristic and  performance  in   a  very ,  very  condensed  format  here . This  is  very  different  from  Excel without  needing  to  hide /unhide  columns to  bring  formulation to  be  adjacent  to  each  other . Much ,  much  easier  under   the  JMP  format  here . Besides  tabulation ,  one  can  make  a  graph of  the  property  versus   the  property  comments  or  the  sample  ID , but  not  the  ingredient  percentage . This  graph  can  be  combined   with  the  recipe  table  here into  a  group  under   the  dashboard  operation . This  make  it  as  a  very effective  visualization  analysis . As  for  testing  involves   multiple  replicates . We  typically  just  record   the  average  result . But  one  can  enter  the  individual   replicate  data  in  the  property  column , and  then  perform  the  T  test ,  the  all -over  test ,  using  this  worksheet  here . In  case  people  doesn 't  want  to  enter  data  in  this  way , there  is  the  other  way  to  virtually  link  the  data  file  with  the  replication  result with  the  worksheet . That  will  be  shown  later   in  the  presentation . So  far ,  what  you  see   is  our  Gen1  worksheet which  involves  no  formulation  calculation . Chemists  in  my  group  has  been  using  this  tool  for  more  than  one  year . They  get  used  it  its  easy  data  entry  and very ,  very  powerful  tabulation  analysis . Next  we 're  going  to  look  at the  Gen  2  worksheet that  can  overtake  the  Gen1  feature . It  has  an  additional  feature for  the  formulation  calculation for  the  1k  and  2k  system . This  worksheet  also  link with  the  other  JMP  file that  has  additional  raw  material   information  needed  for  calculation . We  have  the  other  worksheets ,   we  call  Gen  3 ,  that  are  designed to  deal  with  the  solvent  borne  system . It  also  allow  formulator   to  incorporate  master  batches , but  due  to  the  time  constraint it  will  not  be  discussed  here . This  is  our  Gen2  worksheet . There  are  three  sections . We  have  a  heading  and  then the  formulation  input  section  right  here . The  middle  one ,   we  have  a  calculation  output . The  third  section  is the  processing  material  characteristic and  also  the  testing  results . Section  1  and  section  3 are  like  the  one  in  Gen1 , but  the  section  2  here  is  newly  added . The  column  row  name   is  used  to  link  the  reference  file that  has  additional  data  information  needed  for  calculation . You  can  see  the  symbol for  the  virtual  link  right  here . After  chemist  enter  the  formulation  ID , they  will  specify  for  columns ,  parts , row ,  name ,  and  initial  weight . If  they  are  doing  the  2K  system , they  need  to  also  specify  the  mixing  ratio either  by  index , by  volume ,  or  by  weight  ratio . Then  the  worksheet  will  output the  mixing  ratio  characteristic  here again  by  index ,  by  volume ,  or  by  weight . They  also  provide  a  normalized composition ,  either  by  part . By  part  means  A  and  B  sum  up  together   by  themselves  and  equal  to  100 , or  A  and  B  mixed  together . We  call  it  normalized  by  total  here . After  seeing  this  one  and  the  chemist can  perform  the  experiment and  then  come  back   to  enter  the  results  right  here . The  other  thing  is in  the  property  material  characteristic , we  have  the  other  column  called  Lookup . This  can  extract  the  information from  the  calculation and  also  the  raw  material   fraction  percentage  ratio and  automatically  displays  right  here . Then  chemists  just  need  to  copy parameter  in  the  value  enter  column and  then  this  will  be  automatically transferred  to  the  two  normalized percentage  column  for  display  purpose . We  also  have  three  JSL  there to  facilitate  in  analysis . The  first  one  is  showing  you   normalization ,  normalized  by  total . That  means  A  and  B  being  mixed together  and  sum  up  to  100 . Here ,  I  showed  you  the  formula , showed  you  the  characteristic and  showed  you  the  result . You  have  a  second  JSL that 's  normalized  by  part . In  this  case ,  you  can  see  your  part  A formulation  and  part  B  formulation , and  then  A  and  B  all  have  been normalized  to  100  by  themselves . With  the  other  JSL , we  can  change  the  formulation worksheet  format  from  the  stacked to  the  white  format . In  this  case  their  ID  performance , individual  ingredient , and  then  the  characteristic  will all  have  their  own  individual  columns . With  this  format ,  one  can  make  the  graph with  the  property  versus the  ingredient  percentage which  cannot  be  done   under  the  stack  format . One  can  also  looking for  the  correlation between  the  performance  or  the  performance with  the  formulation  characteristic . At  this  moment ,  I  like  to show  you  the  live  demonstration . This  is  the  formulation  worksheet I  just  showed  you  in  the  PowerPoint . Basically ,  we  have  the  heading . Then  we  have  a  formulation  input  section . We  have  a  calculation  between  n1  and  n2 . Anything  here  is  for  calculation . Then  we  have  the  last  section  here , that  is  a  performance  and  then the  property  material  characteristic . I  mentioned  that  we  have  a  JSL , allow  people  to  look at  this  result  easily . Let 's  look  at  this  one ,  JSL  by  total . We  can  easily  highlight  any  formulation  or  compare  2  and  8 , and  then  compare   their  formulation  and  their  result . These  are  mixed  together . We  can  look  at  it  by  part . Part  A  here  and  then  part  B  here . They  all  sum  up  to  a  hundred   by  themselves . Easily ,  we  can  compare … Oh  no ,  I  need  to  remove  this  one  first . I  can  compare  formulation  easily   by  manipulating  the  local  data  filter . Again  with  the  JSL ,  we  click  the  Join  All . We  are  turning  the  stack  format  into  a  wider  format . Each  row  belong  to  one  formulation with  the  heading  here , with  their  property , with  their  formulation , and  with  their  formulation characteristics  showing  right  here . For  machine  learning , we  can  highlight  a  role  ingredient and  then  just manually  add  zero  so  that  each ingredient  has  zero  or  whatever , and  then  now  we  can  do  this  one . We  can  create  a  summation  or  something , easy  to  operate  in  this . I 'm  going  to  show  you  next how  this  one  work  in  the  sense  that assuming  that  we 're  going to  create  a  formulation . I 'm  going  to  copy  the  heading . Sorry ,  I 'm  going  to  delete  everything  here because  I  create  this  one  already  before . I 'm  going  to  delete the  demonstration  one . I 'm  going  to  create  it  from  scratch by  copying  the  heading  here . I  change  the  name  to  Demonstration  here . I  will  copy  the  formulation  because I 'm  going  to  modify  formulation from  this  one ,  the  DOE  8 . Then  the  DOE  8  is  based  on   one -to -one  mixing  ratio  by  volume . But  in  this  new  one ,  we  could change  it  to  one -to -two  mixing . A  divided  by  B  is  one  divided  by  two , so  it  will  be  0 .5 . Then  I  copy  the  heading  including the  mixing  ratio  all  the  way  down . Now  all  the  calculation has  been  done  here . With  this  weight  percentage  I 'm  entering , it  showed  that  the  material  has  an  index model  ratio  A  to  B   to  be  0 .65 ,  which  is  too  low . We  need  to ,  using  our  chemistry knowledge ,  to  turn  this  around . In  this  case ,  for  example , I  make  this  one  2 . I  can  easily  make  this  one  into  1 .05 . That  is  the  range  I 'm  looking  for . Basically ,  assuming  it  is  the  design that  we  want ,  formulation  we  want , the  next  thing  we  want  to  do  is  to  copy some  of  the  testing   that  we  already  had  before , that  we  are  monitoring  before , but  without  the  results ,  of  course . We  have  a  new  result  here , so  I 'm  going  to  delete  that  one . But  we  also  want   to  add  additional  property which  for  example  is  viscosity measure  at  a  room  temperature . With  this  section  here , then  we  want  to  extend  our  heading to  specify  those  are  belong to  this  formulation . As  soon  as  I  specify  the  heading , the  Lookup  automatically  give  me the  information  such  as the  missing  characteristic . 1 .5  or  0 .5 ,  they  are  automatically  copied  to  here  through  the  Lookup  function and  then  the  feeder  loading   in  the  formulation  normalized  to  Total while  also  being  extracted , sum  up  together  and  put  it  right  here . Now  I  can  copy  this  information , put  them  in  value  enter , and  specify  my  mixer  is  number  2 , and  then  start  to  enter  my  results , time  that 's  going  to  be  80 , and  adhesion  450  assuming ,   viscosity  20 ,000 . I 'm  pretty  much  finished  everything , so  let 's  look  at  the  result  here . We  just  enter  Demonstration . This  one  was  based on  the  DOE  number  five . DOE  number  five  is  one  to  one  mixing and  this  Demo  is  only  one  to  two  mixing , and  we  added  the  viscosity result  right  here . It 's  very ,  very  easy . One  click  you  see  the  result and  in  the  format  it 's  very  easy to  understand  for  comparison . This  is  the  end  of  my  demonstration . Let  me  go  back  to  the  presentation  here . We  consider  the  JMP  worksheet  that  I 'm   just  showing  you  is  an  integrated  platform and  here  is  the  summary . The  worksheet  in  the  stack  format ,  here , is  used  for  formulation  design , calculation  and  for  recording  the  results . The  data  entry  of  raw  material which  is  needed  for  the  worksheet  is minimized  by  virtually  linked  with the  other  file  that  has  additional raw  material  information . JSL  was  widely  used  to  automate the  worksheet  output  to  the  tabulate , to  graphic ,  to  the  statistic  analysis , and  also  to  create  a  table   with  wide  data  format . The  wide  data  format ,   they  already  have  a  data  structure for  modeling  via  the  machine  learning and  also  allow  the  graphical  analysis using  the  ingredient  as  one  of  the  axis . Then  since  each  of  the  row in  this  wide  format is  a  unique  tool  formulation  ID , this  actually  can  be  used as  a  reference  table to  join  the  other  JMP  file  that  has   a  testing  result  that  has  a  replication . When  these  are  joined  together , then  we  can  plot  the  raw  data  and  do  statistic  analysis , either  as  function  of  the  ingredient   or  as  function  of  the  formulation  ID . This  JMP  Integrated  Worksheet  Platform truly  illustrates  it  is  an   all -in -one  platform ,  very ,  very  capable . In  summary ,  JMP  is  not just  an  advanced  DOE  software . JMP 's  data  analytics   has  been  effectively  utilized in  my  group  for  product  development at  various  stage  to  speed  up the  innovation  process . JMP -based  formulation  worksheet  is an  integrated  platform  that  feature broad  formulation  capability ,   all  in  one ,  easy  operation , and  machine  learning  ready  data  structure , and  more  and  more waiting  to  be  further  explored . With  this ,  thanks  for  your  attention and  I  also  like  to  acknowledge  the  people I  work  with  and  learning  to  JMP  together and  also  our  management  system for  supporting  JMP  adoption  initiative . Thank  you  very  much .
The development of innovative new products can be accelerated using statistically optimized DOE and regression modeling. With the goal of maximizing efficiency and reducing expense, it is often tempting to limit the collected data to key product attributes, such as customer specifications or internal quality metrics.  However, increasing the number of available responses by including a wider range of more fundamental measurements in the analysis can often be critical to success.   This talk covers examples from projects at PPG’s Coatings Innovation Center. We highlight the use of JMP from design through analysis to visualization of the results for a fractional factorial and a constrained mixture/process design. Using tools such as column switcher, multivariate plots, pairwise correlation and mixture profiler, we demonstrate how the inclusion of ancillary responses helped to develop a deeper understanding of the systems being studied and pinpoint the causes behind unexpected results.     My  name  is  David  Fenn. I  work  for  PPG  Industries at  their  Coatings  Innovation  Center just  outside  of  Pittsburgh. Today  I'm  going  to  be  talking about  some  of  our  experiences with  using  DOE for  novel  coatings  development, particularly  focusing  on  the  importance of  using  auxiliary  responses. The  agenda,  I'll  talk  a  little  bit about  research  methodology, and  in  case  you're  wondering what  I  mean  by  auxiliary  responses, I'll  define  that  in  this  section. Then  we'll  go on  to  two  examples. The  first  one,  a  new  resin  development for  architectural  coatings, and  the  second  one, a  protective  coating, then  we'll  finish  off with  a  few  general  observations. There  are  various  frameworks that  can  be  used  to  describe the  new  product  development  process. One  that  I  particularly  like is  shown  here,  DMADV. Here  we  have  five  stages. The  first  stage  is  to  define the  goals  of  the  project. What  are  we  trying  to  achieve? Then  we  get  into  measurement. What  are  the  critical  characteristics we  need  to  measure, and  do  we  have  suitable  processes in  place  to  measure  them? Then  we  think  about  analyze . What  factors  can  we  change to  make  improvements? Then  onto  the  design  stage, where  we  deliberately  manipulate those  factors and  the  levels  of  those  factors to  try  and  affect  an  improvement and  lead  to  optimum  performance. Then  once  we  have  an  advanced  prototype, we  get  onto  verification, thinking  about  will  our  solution  work in  the  real  world? What  are  the  important  things we  need  to  think  about when  we  apply  this  framework? Well,  in  terms  of  the  defined  stage, the  goal  needs  to  be  clear so  that  the  whole  organization  has the  same  understanding  of  the  goals, and  it  needs  to  be  impactful. If  we're  successful and  we  deliver  a  solution, will  it  fill  a  real  unmet  need in  the  marketplace and  be  a  successful  product? If  we  skip  now  to  the  end, any  solution  we  apply needs  to  be  cost- effective. It  needs  to  be  robust. Then  the  middle  of  this  process, we  want  to  get  through  this  process as  quickly  and  as  efficiently  as  we  can. We  want  to  deliver  the  product to  the  marketplace  as  soon  as  we  can, and  we  want  to  expend the  minimum  amount  of  cash and  the  minimum  amount of  resource  to  do  that. Clearly,  DOEs  and  a  lot  of  the  tools that  are  available  in  JMP are  well  set  up to  make  us  succeed  in  this  area. One  of  the  tools  that  I  like  to  use particularly  early  on  in  a  project is  a  process  map. This  is  a  very  particular  type of  process  map. It's  really  mapping the  process  of  carrying  out the  research  and  development. I'm  showing  here a  simplified  example  of  a  process  map to  develop  an  automotive  base  coat. We  have  all  the  steps that  are  involved  in  our  experiment. We  make  a  resin, we  use  that  resin to  make  a  base  coat  paint, we  spray  apply  that  base  coat onto  a  substrate, we  apply  a  top  coat onto  that  base  coat, then  we  cure  those  coatings  together, and  then  we  measure  the  properties that  we  get  from  all  of  that. All  of  these  steps, we  list  all  of  the  factors  that  might  play a  role  in  these  separate  steps. This  is  useful  for  a  number  of  reasons. First  of  all, it  gives  everybody  in  the  team a  unified  understanding of  what  the  process  is  we're  dealing  with and  how  are  we  going  to  affect  it. It  also  allows  us  to  capture all  of  the  variables  we  can  think  of that  might  play  a  role in  the  various  steps so  we  don't  overlook  anything. Then  it's  a  good  starting  point for  thinking  about  which  of  these are  we  going  to  try  and  manipulate, which  of  these  are  we  going  to  focus  on to  try  and  deliver  a  successful  project? These  factors  are  further subdivided  and  categorized. First,  we  have  our  Xs. These  are  the  variables that  we  can  manipulate to  try  and  affect  an  improvement in  our  product  or  our  process. Then  we  have  our  big  Ys. These  probably  appear in  the  specification  of  the  product. These  are  what we're  really  trying  to  achieve. This  is  what  the  customer really  cares  about, what  the  customer  will  pay  for. Next,  we  have  our  Ns, noise  variables. These  could  be  variables that  we  may  be  not  controlling, we're  not  deliberately  manipulating, but  things  that  could  introduce noise  into  the  process, either  during  the  experiments, during  the  new  product  development, or  in  the  end  application, in  the  manufacture  of  the  product or  the  end  use  of  the  product. Then  finally, the  subject  of  today's  talk, we  have  our  auxiliary  responses, which  we  label  as  little  Ys. These  might  not  appear in  the  specification, the  customer  might  not  even be  aware  of  these, but  they're  measurements  we  can  take at  various  stages  of  the  process that  might  tell  us  something about  what's  going  on. I  said  in  the  previous  slide that  one  of  our  goals is  to  get  through this  whole  process  quickly, as in  efficiently  as  we  possibly  can. One  question  that  raises  is, why  don't  we  just  measure  our  big  Ys? We  have  the  ability  to  carry  out  DOE's. We  could  optimize  for  our   big Ys, we  could  build  predictive  models. Isn't  that  all  we  need  to  do? Why  should  we  spend  time? Why  should  we  spend  money measuring  some  of  these   little Ys when  they're  not the  real  goal  of  the  outcome? Well,  I  hope  in  the  next  couple of  examples  that  I  can  show  you, some  cases  where  carefully  selecting these  little  Ys and  doing  some  good  analysis can  be  really  critical to  the  success  of  a  project. Our  first  example  here, the  development  of  a  new  resin for  architectural  coatings. The  goal  was  to  come  up with  a  single  resin  that  could  meet all  of  the  performance  requirements across  several  product  lines in  several  countries. Our  starting  point  was, we  had  no  single  resin that  could  meet  all  those  requirements. We  were  using  different  resins in  different  products, different  resins  in  different  countries, and  we  needed  to  come  up  with  a  solution that  allowed  us  to  reduce the  complexity  there. Our  early  prototype  struggled in  a  number  of  areas, but  one  particular  area was  tint  strength. The  way  these  white  base  paints would  be  used is  if  I  go  into  a  store  and  request a  paint  of  a  particular  color to  paint  the  walls  of  my  house, the  store  will  take  that  white  base  paint and  add  specified  amounts of  concentrated  color  toners  to  that  paint to  create  a  specific  color. It's  really  critical  to  be  able to  hit  a  target tint  strength, which  is  the  measurement of  how  quickly  that  color  will  change as  we  add  a  certain  amount of  a  particular  toner. We  need  to  be  able  to  control  that and  hit  it  reproducibly to  achieve  the  wide  spectrum of  colors  we  need  to  achieve. We  also  had  a  few  issues in  terms  of  poor  heat  age  stability and  poor  resin  reproducibility. Our  approach  was  to  carry  out some  sequential  DOE's to  learn  how  to  control  tint  strengths and  some  of  the  other  factors. I'm  showing  the  progress  on  this  plot  at the  bottom  left-hand  side  of  this  screen. Before  we  started  the  DOE's, just  some  of  the  exploratory  experiments; the  orange  bar  represents  the  range of  tint  strengths  we  were  able  to  achieve. We  can  see  that  is  far  below the  target  range  of  tint  strengths  shown by  this  green  bar  on  the  plot. As  we  carried  out  the  DOE's, we  learned  how  to  control  tint  strength. We  were  able  to  increase  it until  towards  the  end  of  the  project when  we  were  doing our  optimization  DOE's, we  were  nicely  centered around  this  target  tint  strength. We  were  able  to  build  predictive  models and  use  those in  conjunction  with  predictive  models for  some  of  the  other  key  properties to  identify  white  space  where  we  met  all of  the  target  properties  at  the  same  time. But  rather  than  talk about  the  whole  project, I  want  to  now  focus on  one  particular  DOE  that  we  carried  out. The  goal  of  this  DOE  was to  confirm  and  quantify something  we'd  observed  previously, that  the  particle  size of  the  resin  we  were  making was  a  big  factor in  controlling  tint  strength. These  resins  are,  in  effect, dispersions  of  little  particles of  resin  in  water, and  it  was  the  size  of  those  particles that  seemed  to  be  important. We  were  also  using what  we  call  a  co-surfactant to  help  disperse  those  particles, and  we  had  a  few  choices about  where  in  the  process we  could  add  that  co-surfactant. We  wanted  to  look at  a  couple  of  candidates for  the  addition  point of  that  co-surfactant to  see  if  it  affected  the  key  properties. Then  finally,  up  until  this  point, all  the  resins we've made, we've  made  at the  Coatings  Innovation  Center. We  now  wanted  to  check, could  we  make  these  resins  reproducibly across  three  different  locations? The  DOE  we  carried  out  is  shown on  the  right-hand  side  here. We  have  three  levels for  our  target  particle  size. We  have  two  levels  for  the  addition  point of  the  co-surfactant. That  gives  us a  full  factorial  DOE  with  six  runs. Then  we  replicated  that  DOE across  three  different  laboratories. I'll  go  straight  into  JMP and  I'll  show  you what  the  data  table  looks  like. You  can  see  here  we  have the  original  data  table,  the  DOE, but  now  we  have a  whole  collection  of  data that  we  gathered  during  the  DOE. The  first  thing  we'll  do  is, we'll  look  at  what  we  learned about  tint  strength. I've  already  built  here a  reduced  model  for  tint  strength. If  we  have  a  look at  the  effect  summary  to  start  with, we  can  see  that  the  location of  addition  of  the  co-surfactant wasn't  a  factor in  determining  tint  strength. That  dropped  out  of  the  model. But  we  do  see that  the  target  particle  size and  the  reactor  location  were  factors, as  well  as  the  interaction between  target  particle  size and  reactor  location. If  we  look  up  at our  actual  by predicted  plot, we  can  see  it  looks like  a  pretty  nice  model. We've  got  a  nice  R-square, and  everything  looks  to  be in  pretty  good  shape. Then  probably  the  best  way of  understanding  what's  happening and  what  this  model  is  telling  us is  to  look  at  the  prediction  profiler here  at  the  bottom. We  see  our  anticipated  effect of  target  particle  size  on  tint  strength. As  we  increase  target  particle  size, we  get  higher  tint  strength. Then  if  we  look  across at  reactor  location, what  we  see  is  that   Lab A  and  Lab C are  giving  broadly  similar  results. But  if  we  look  at   Lab B, first  of  all, we  see  that  the  tint  strength that  we  get  from   Lab B is  significantly  higher than  we  were  getting  from   Lab A  or   Lab C. We  also  see that  the  dependence  on  particle  size is  much  less  from   Lab B than  we  saw  from  the  other  two  labs. This  was  a  problem  for  us. Whenever  we  see  that  different  labs are  producing  different  results with  the  same  resin and the  same  process, it  can  be  a  really  long  task to  work  out  what's  going  on  here. There's  so  many  potential  candidates for  the  cause of  this  poor  reproducibility. At  this  stage, we  were  very  concerned that  it  was  going  to  take  us a  long  time  to  resolve  this, that it  was  going  to  derail  the  project, and  we're  going  to  miss our  target  launch  dates. Before  we  went  into  any  specific  activity to  try  and  address  this  problem, the  obvious  first  step  was  to  look at  the  data  that  we'd  already  gathered in  this  data  table and  see  if  there  were  any  clues that  could  maybe  give  us  a  hint as  to  why   Lab B was  giving  different  properties. Whenever  I  see a  wide  data  table  like  we've  got  here, one  of  the  first  tools  that  I  always  go  to is  the  column  switcher. The  way  in  this  case that  I  will  implement  this is  the  first  step is  to  build  a  variability  chart that  best  shows the  problem  that  we're  having. I've  pre-built  a  variability  chart  here where  I've  got  target  particle  size and  reactor  location  as  my  X-axis and  I've  got  the  initial  tint  strength as  my  Y-axis. The  first  task  is  to  get  this into  a  format  that  the  best  represents the  problem  we're  dealing  with. The  first  thing  I'll  do is  swap  over  my  target  particle  size and  react or  location. I'll  also  add  and  connect  the  cell  means to  add  some  lines  here. Now  I'm  pretty  happy  with  this. I  think  this  nicely  reflects the  problem  that  we're  dealing  with. We  can  see   Lab A  and   Lab C very  similar  results, but   Lab B,  higher  tint  strength and  less  dependence on  tint  strength  and  particle  size. Now  I  can  use  my  column  switcher, and  what  this  will  allow  me  to  do is  keep  this  plot  in  exactly  this  format, but  quickly  switch  out  this  Y-axis, the  initial  tint  strength, for  any  other  variable that  I've  got  in  my  data  table. I'll  go  into  the  redo  platform and  select  the  column  switcher. Now  I  can  select any  of  the  other  factors  in  my  data  table. I'm  just  going  to  select  everything that  I've  got  in  my  data  table. Then  when  I  hit  OK, I  now  have  this  column  switcher to  the  left  of  my  plot. I  can  click  on  any  of  these  factors and  it  will  change  this  axis  but  keep the  plot  in  exactly  the  same  format. If  I  select  particle  size, I  can  see  now  I'm  plotting my  actual  measured  particle  size against  target  particle  size and  reactor  location, exactly  the  same  format. It  looks  like  in  this  case, all  three  labs  are  giving pretty  similar  results. I'm  not  seeing  anything  that  gives  me a  clue  as  to  what's  going  on, but  I  can  quickly  just  scroll through  this  whole  data  set. I'm  seeing  mostly  noise  right  now. I'm  not  seeing  any  patterns that  seem  to  be  particularly  helpful, but  I'll  keep  going. When  I  get  to  this  plot  here, so  now  I'm  plotting  conductivity, I  see  interestingly that   Lab B  is  making  resins with  much  higher  conductivity than  Lab A  and   Lab C. That's  one  useful  observation. I'll  keep  going. This  next  one, this  is  actually  another measurement  of  conductivity after  the  resin  has  been in  a  hot  room  for  a  week, showing  the  same  thing, still  confirming  that   Lab B is  giving  higher  conductivity. I'll  keep  going. Mostly  noise, maybe  a  little  bit  of  an  indication that  the  molecular  weight from   Lab B  is  slightly  lower. I'll  keep  going. Again,  still  not  seeing  anything that  interesting,  mostly  noise. But  then  I  get  to  this  plot  here, and  again, now  we're  plotting  the  pH of  the  resins  one  hour  into  the  process, so  early  into  the  process, the  acidity  or  pH  of  the  resin. Lab B,  again, is  different  from   Lab A  and   Lab C. It's  giving  me  much  higher  pH. Keep  going  just  to  check if  there's  anything  else. This  was  the  initial  plot we  started  with  of  initial  tint  strength, and  then  the  last  one is  the  paint  viscosity, where  everything  looks  pretty  similar. Really  quickly  using  column  switcher, I  found  out  that  not  only  is   Lab B  making resins  with  higher  tint  strength, it's  making  resins with  higher  conductivity and  higher  acidity,  higher  pH. What  could  that  be  telling  us? What  might  be  causing  higher  pH and  higher  conductivity? Well,  these  resins,  I  said, were  a  dispersion of  a  polymer  particle  in  water. Anything  that's  changing  the  conductivity, the  pH  is  going  to  be  in  the  water  phase. It's  not  going  to  be  in  the  resin  phase. What  we  did  was  we  precipitated by  centrifuge, we  precipitated  out  the  resin and  just  analyzed  the  water  phase. We  carried  out  a  lot  of  analysis, but  one  of  the  things  we  worked  on, I'm  showing  on  this  plot on  the  right-hand  side  of  PPMs, of  parts  per  million of  phosphorus  and  sulfur in  that  water  phase. If  I  look  at  the  orange  bars to  start  with,  the  sulfur, I  can  see  all  of  the  resin s from  all  three  labs  are  very  similar, but  the  blue  bars, the  level  of  phosphorus, Lab  B  is  making  resins with  about  four  times  as  much  phosphorus as  they  were  making from   Lab A  and   Lab C. When  we  looked  at  the  recipe for  making  these  resins, there's  only  one  raw  material  that  brings  in  phosphorus. On  a  bit  of  further  investigation, what  we  found  out  was  the  supplier that  was  delivering  this  material  to   Lab B was  mistakenly  supplying  something that  was  four  times  as  concentrated as  it  should  have  been, and  four  times  as  concentrated as  they  were  providing  to   Lab A  and   Lab C. The  auxiliary  data  that  we  looked  at in  this  DOE  using  the  column  switcher, we  were  able  to  really  quickly  pinpoint the  cause  of  that  problem. We  didn't  have  to  expend time  to  get  there. The  project  stayed  on  track, and  there  was  even  a  bonus. We  learned  that  increasing  the  level of  this  material  with  the  phosphorus was  another  tool  we  had to  increase  the  tint  strength. We  would  have  probably  never been  aware  of  that if  we  hadn't  carried  out  this  analysis and  had  this  happ y  accident. That's  the  first  example. We  go  on  to  the  second  example  now. In  this  case, we're  dealing  with  a  protective  coating, a  coating  that's  designed  to  go over  metal  substrates  like  iron  and  steel and  protect  from  corrosion . We  have  five  experimental  resins that  we  want  to  look  at, a  resin  that's  designed to  give  good  corrosion, and  then  four  resins that  are  designed  to  improve the  flexibility  of  the  coating. The  first  three  of  these  resins are  added  in  the  first  stage of  the  coating  preparation, and  then  the  last  two  get  added in  a  separate  later  step. We  have  two  questions  here we're  trying  to  answer. How  do  the  resins  affect  corrosion and  affect  flexibility, and what  is  the  best  combination of  the  levels  of  these  resins to  give  us  the  best  combination of  corrosion  and  flexibility? Again,  we  use  DOE, we  were  able  to  build  predictive  models, and  here  we  were  using the  mixture  profiler to  identify  some  white  space that  we  can  work  in. This  DOE  is  a  little  bit  more  complicated than  the  first  one, so  I'm  trying  to  represent  pictorially what  we  were  dealing  with  here. If  we  look  at  our  first  stage of  our  coating  manufacture, in  addition  to  our  three experimental  resins, we  have  a  main  resin  at  a  fixed  level. In  effect,  our  three  experimental  resins are  three  mixture  variables  here because  they  form  the  rest  of  this  100 %. They  add  up  to  a  constant  sum  of  57.77 %. Three  mixture  variables dealing  with  there. Then  in  stage  two, we  can  deal  with  our  other  two  resins as  independent  variables because  they're  not  part  of  that  mixture. We  have  three  mixture  variables, two  independent  variables. We  also  have  some  levels that  the  formulators  were  able to  decide  they  wanted  to  work  in based  on  prior  experience for  all  of  these  resins. Then  we  have  some  constraints on  the  combinations  we're  dealing  with. For  example,  at  the  start  here, we  want  the  sum  of  Flex2  and  Flex3 to  be  more  than  10%  but  less  than  30%. There  are  some  other  constraints  as  well. A  fairly  complicated  DOE, but  using  custom  design, it's  relatively  straightforward to  build  this  DOE. Definitely  some  tips  and  tricks in  terms  of  how  to  build  the  DOE, what  model  to  use, and  how  to  analyze  that  data. I  don't  have  time to  go  through  that  today, but  I'd  be  perfectly  happy to  talk  about  that  offline if  anybody's  interested. But  let's  go  straight  into   JMP and  we'll  look  at  this  example. Here  we  have  the  DOE  that  we  carried  out. It  was  a  16-run  DOE. If  we  go  right  across to  the  left-hand  side, we  have  our  three  mixed  variables and  our  two  process  variables. We've  measured our  flexibility  and  corrosion and  then  we  have  a  lot  of  other auxiliary  responses  we've  measured. I  was  able  to  build  good  predictive  models for  flexibility  and  corrosion. What  I'm  going  to  do  is  just  show  you those  models  in  the  profiler just  to  help  us  understand what  we're  learning  and  what's  going  on. I'll  add  those  two  predictive  models that  I  built  to  my  profiler, and  then  I  get  my  profiler  here. I  can  see,  first  of  all, I'm  plotting  flexibility and  corrosion  here. Lower  numbers  are  better for  both  of  these  responses. Lower  numbers  for  flexibility, lower  numbers  for  corrosion are  what  we're  targeting. I  can  see  as  I  add  my  corrosion  resin, if  I  increase  the  level, I  get  better  corrosion  performance, but  unfortunately, I  get  worse  flexibility. The  opposite  is  true  for  most of  these  flexibleised  resins. As  I  add  more  of  these, I'm  getting  better  flexibility, but  worse  corrosion. This  is  something  that's  very  common in  coatings  development and  lots  of  other  areas. Seems  like  there's  always a  pair  of  properties where  if  we  improve  one  of  them, we  always  make  the  other  one  worse. But  if  I  come  across to  my  Flexabiliser  4  resin, something  really  interesting  here, as  I  add  more  of  this  resin, I  get  better  flexibility, but  I  don't  suffer  at  all in  terms  of  corrosion. This  is  going  to  be  a  really  useful  tool for  us  to  optimize  the  combination of  flexibility  and  corrosion. But  I'd  like  to  understand  a  bit  more about  the  science  behind  this. What's  happening? What's  unusual  about  Flex4 that  allows  us  to  improve  our  flexibility without  degrading  corrosion? Again,  I  want  to  use all  of  this  auxiliary  data that  I've  gathered  in  my  data  table to  help  me  understand  that. What  I  did  is, I  want  to  look  through  this  table, and  I'm  going  to  use a  different  tool  this  time. I'm  going  to  use  multivariate. If  I  select  that, this  allows  me  to  basically  look at  the  correlation between  all  the  combination  of  factors that  are  in  my  data  table. I'll  select  everything  that  I  measured and  I'll  add  it  in  the  Y  columns and  just  hit  OK. This  generates  my  multivariate. The  first  thing  I  see  is  this  table  here where  I've  got  all  the  correlations for  all  the  pairs  of  combinations of  the  factors  that  are  put  in  my  table. I  can  see  there  are some  pretty  nice  correlations  here. I'm  seeing some  fairly  strong  correlations, but  it's  a  little  bit  difficult to  go  through  all  this, a  bit  overwhelming  to  go  through  all  this and  pick  out  any  interesting  patterns. I've  also  got  my  scatter  plot  here, and  if  I  add  a  fit  line to  these  scatter  plots, again,  I'm  seeing some  fairly  strong  correlations, but  still  I  think  this  is a  bit  overwhelming  to  dive  straight  into. The  tool  that  I  like  to  use  to  start with  here  is  pairwise  correlations. If  I  select  that, this  generates  a  new  table  where  I've  got all  the  possible  pairs  of  variables and  it's  giving  me  the  correlation. I  can  sort  this  table  based  on  any  column. I'm  going  to  sort by  the  significant  probability and  I'll  make  it  a scending because  I  want my  low  significant  probabilities to  be  at  the  top  of  my  table. Then  if  I  hit  OK,  I  can  see  the  first and  strongest  correlation  I  get, in  fact,  involves this  Flexibiliser  Resin  4 that  was  giving  us this  interesting  behavior. I  can  see  a  strong  correlation with  the  secondary  or  TG2. This  is  a  glass  transition  temperature. The  glass  transition  temperature  is a  temperature  at  which  a  coating  changes from  being  a  glassy  hard  material to  a  soft  rubbery  flexible  material. My  Flex4  level  is  correlating  with  here a  secondary  glass  transition  temperature that  I'm  measuring. And  I  can  see  also if  I  go  a  little  bit  further  down, my  primary  glass  transition  temperature, the  main  glass  transition  correlates strongly  with  the  corrosion. S cientifically,  I  think they're  interesting  observations. What  I  did  based  on  that is  I  also  built  predictive  models for  my  primary  TG  and  for  my  secondary  TG. Now  I  can  look  at  my  profiler, but  I  can  include  all of  my  four  predictive  models. Now  I'll  include  the  two  I  did  before, flexibility  and  corrosion, but  also  my  primary  TG  and  secondary  TG. Now  what  I  can  see is  that  the  first  two  rows are  exactly  what  we were  looking  at  before. If  I  look  at  my  primary  TG, I  can  see  whatever  I  do in  terms  of  adding  new  resin. For  example,  if  I  add more  of  my  corrosion  resin, I'm  increasing  my  primary  TG, and  that's  correlating with  an  improvement  in  corrosion. The  flexibilising  resins, if  I  add  more  of  those, I'm  decreasing  my  primary  TG and  making  my  corrosion  worse. That  primary  TG  does  seem  to  correlate, as  the  multivariate  is  showing, correlate  very  well  with  corrosion. If  I  look  at  my  Flex4  resin, it  was  having  no  effect  on  corrosion and  it's  having  no  effect on  my  primary  TG, so  it's  different  from  my  other flexivising  resins, but  I  can  see  for  my  secondary  TG, as  I  add  more  of  my  Flex4, it's  rapidly  decreasing  the  secondary  TG. The  other  resins  really  don't  have much  effect  on  secondary  TG. What  does  that  mean? What  can  I  learn  from  that? Well,  any  material  that  has  multiple  TGs, glass  transition  temperatures, it's  usually  a  sign  that  it's a  multi-phase  raw  material. It's  not  a  homogeneous  material. That  was  the  case  here when  we  did  some  microscopy. What  we  saw  was  our  coating  had a  continuous  phase  shown by  this  gray  material  here, but  it  had  dispersed in  that  a  secondary  phase. The  primary  glass  transition  temperature was  correlating with  that  primary  continuous  phase and  the  secondary lower  glass  transition  temperature was  correlating  to  this  secondary  phase that  we  have  here. We  had  a  hard  glassy  primary  phase and  then  a  soft  rubbery  secondary  phase. Why  that's  important  is usually  high  glass  transition  temperature does  lead  to  better  corrosion because  it  inhibits  the  diffusion of  anything  through  this  layer and  stops  material  getting to  the  substrate, the  metal  substrate, and  causing  corrosion. Usually,  if  I  want to  make  flexibility  better, I  have  to  make this  continuous  layer  softer and  that  degrades  corrosion. But  with  this  type  of  morphology, I  was  able  to  keep my  hard  continuous  phase and  gain  flexibility  through a  separate  dispersed  rubbery  phase. This  meant  that  anything  that  wanted to  diffuse  through  the  coating and  cause  corrosion  was  always  having to  diffuse  through  this  high  TG  area. It's  given  me  the  combination of  good  corrosion and  good  flexibility  together. The  auxiliary  data  that  I  gathered was  really  responsible... The  analysis  to  that  was  responsible for  the  learning of  what  was  going  on  in  this  system. In  conclusion, it's  definitely  possible to  carry  out  successful  DOEs where  we  only  measure the  critical  responses,  the   big Ys. But  I  hope  I've  shown  that  including carefully  selected  auxiliary  responses, little Ys  can  often  be  really  valuable, can  bring  clarity  to  unexpected  results, and  it  can  help  us  to  build scientific  knowledge. I  hope  I've  also  shown  that  JMP  provides some  tools  that  really  help  us  with  this. I've  shown  a  couple,  but  there  are many  more  that  are  available. I'd  finally  like  to  finish  off by  thanking  the  many  associates at  PPG's  Coatings  and  Innovation  Center who  contributed  to  this  work.
More than any statistical software, JMP and JMP Pro make tremendous use of model selection criteria, such as the AICc and BIC. These tools can be used by practitioners in all industries and at all skill levels, from users choosing a distribution for a capability analysis, to advanced users choosing input factors in a linear mixed model or a functional DOE analysis. Model selection criteria are incredibly flexible and powerful, yet make it easy to decide between very different sets of predictor variables, response distributions, and even correlation structures, all at the same time.   Unfortunately, the full story of how and when to use these criteria are not part of most standard data science courses in universities and professional training. One reason for this omission is that, unlike JMP, many software packages implement model selection criteria in an incomplete or arguably incorrect way, making it impossible to compare models with different input variables.   In this presentation, we give clear guidance on how and when to use model selection criteria. We describe their motivation and the assumptions they require. We compare model selection criteria to other better-known approaches to selecting models, such as hypothesis tests and holdout-based crossvalidation procedures. We also give a brief story of how JMP Statistical R&D developers came to appreciate how useful these tools are, as we sought a general solution to the distribution dredging problem.     The  most  famous  quote  in  all of  statistics  is  George  Box's "All  models  are  wrong,  but  some  are  useful." I've  heard  this  quote   at  almost  every  conference I've  ever  been  to, and   because  of  this, to  my  recollection,  I've  actually  avoided using  this  quote  in  any  talk  before. But  when  I  looked  up the  first  time  it  was  ever  set  in  print, it  was  in  a  1976  journal of  the  American   Statistical  Association  article. It's  found  in  a  section  called  Parsimony. Immediately  after  that  first  instance of  the  quote, he  talks  about  the  importance  of  finding the  simplest  model   that  describes  the  observed  phenomena. This  amounts  to  finding  models  that  offer  a  reasonable  balance of   goodness-of-fit versus  model  complexity  and  is  exactly what  I'm  going  to  be  talking about  today  in  this  presentation. JMP  and  JMP  Pro  offer  a  lot  of  different  modeling  capabilities, each  with  a  lot of  output  related  to  choosing  a  model. Today  I'm  going  to  go  into  some  detail into  some  of  the  most  important  of  these, highlighting  their  motivation and  the  assumptions  behind  them. A  lot  of  the  discussion  will  be  about the  AICc  and   BIC model  selection  criteria, which  are  direct  and  very  data- efficient tools  for  addressing  the  problem. Box  had  in  mind  with  his  quote, which  is  how  to  find  a  useful  model from  a  set  of  flawed  or  wrong  ones. As  I  was  putting   this  presentation  together, I  went  through the  derivations  of  the  AIC  and  the  Bic. I  wanted  to  get  a  clear  understanding of  what  these  similar- looking  methods really  are  and  what  assumptions  they  made. Afterwards,  out  of  curiosity, I  did  an  Internet  search  of  AIC versus   BIC versus  cross-validation. It  was  interesting  to  see  in  all   these  Internet  forms  that  there  is  still so  much  debate,  even  though  these  methods have  been  around  for  50  years. Having  recently  reviewed  the  derivations  of  the  methods, it  looks  like  there  are still  a  lot  of  misconceptions  out  there. I  think  the  reason  for  this  is  that  both  model  selection  criteria have  very  deep  and  technical  derivations despite  the  simplicity  of  their  formulas, both  of  them  are  equal  to  minus  two  times the   log likelihood  of  the  fitted  model, plus  a  simple  penalty  based  on  the  number  of  model  parameters. You  can't  guess  the  reasons for  the  penalty  terms   from  the  formula  alone, which  makes  them   seem  mystical  and  arbitrary. One  of  my  goals  today  is  to  try to  demystify  these  without going  overboard  on  the  math. To  put  this  all  in  the  context of  an  analysis  workflow, we  can  think  of  an  analysis  project as  having  four  major  steps. We  first  have  to  acquire  the  data, get  it  organized  and  cleaned  up. Then  we  fit  several  models  to  it  in  a  way that  is  either  manual  or  automated by  software  like  JMP  or  JMP  Pro. Once  we've  done  that, then  we  need  to  choose  one  of  them as  the  model  that  we're  going to  work  with  moving  forward. This  is  a  critical  step  in  the  process that  we'll  be  focusing  on  today. It's  important  that  we  get  the  model  selection  right because  the  quality of  the  results  and  the  conclusions  we  make at  the  end  requires   that  we  have  a  reasonably  good  model. Here  are  the  main  ways  that  I've  seen people  make  decisions  about  models. Hypothesis  testing  is  probably the  first  one  people  learn  about. These  are  most  commonly  used  to  determine  if  a  regression  coefficient is  statistically  significantly different  from  zero, which  sounds  like  a  model  selection  problem. While  they  are  often  used  in  that  way, hypothesis  tests  are  derived  under a  specific  set  of  assumptions that  explicitly  does  not  account for  having  changed  the  model or  having  used  a  model  that  was  chosen  as  the  best amongst  several  alternatives. Then  we  have  the  general  empirical procedures  that  assess  models  based on  data  held  out   from  the  model  fitting  process. These  techniques  can  be  applied  to  both  classical  statistical  models as  well  as  machine  learning  models. In  my  opinion,  holdout  validation, in  particular,  is  the  way  to  go if  you  have  a  whole  lot  of  data. Then  we  have  what  all  called the  small  data  analytical  procedures. These  were  derived  for  situations   when  you  have  to  make  a  decision about  which  model  to  use,   but  you  don't  have  enough  data to  hold  out  any  observations. The  most  commonly  used  of  these are  the  AIC  and  the  BIC. But  there  are  other  well- known  techniques like  Generalized C ross-Validation  and  Mallow's C P. It  turns  out  that  these  two   are  actually  asymptotically  equivalent to  the  AIC,  so  in  large  samples  you  should get  the  same  conclusions  from  GCV, Mallow's  CP,  and  the  AIC,  in  particular, for  at  least  squares- based  models. Then  we  also  have  what  I'll  call  model- specific  approaches  like  VIP and  partially  squares  models   and  the  cubic  clustering  criterion in clustering  models. These  are  pretty  niche  and  I  won't  be talking  about  them  any  more  here  today. Then  we  also  have  visual  tools  like  actual by  predicted  plots and  ROC  curves. Regardless  of  how  you  choose  your  model, these  plots  are  good  to  take  a  look  at  before  moving  forward  with  a  model because  they  provide  more  interesting  information than  any  individual  statistic  will  and  can  tell  us  if  the  best  model that  we've  considered  so  far  is  still a  good  enough  model  for  us  to  use. My  own  first  encounter  with  model selection  criteria  in  my  professional  life was  back  in  the  mid- 2000,   around  when  JMP 5  and  JMP  6  were  out. JMP  had  added  the  ability  to  provide  capability  analyses for  non- normal  distributions. Capability  analysis  is  a  very  important  tool for  assessing  whether   a  manufacturing  process  is  " capable" of  delivering  products that  are  within  specification. JMP  users  wanted  to  determine  the " best  distribution"  for  the  data so  their  process  capability  metrics  would  best  reflect  the  reality of  their  situation. JMP  customers  understood  that  you  could fit  different  distributions  with  JMP and  knew  that  many  of  the  distributions came  with  a   goodness-of-fit  test  in  a  case of  having  a  hammer  causing you  to  find  nails  everywhere. They  were  trying  all  the  distributions they  could  find  and  were  choosing  the  one with  the  largest   p-value   as  the  distribution for  their  capability  analysis. They  wanted  us  to  codify  this  into  a  new  fit  all  distributions  feature that  would  automate  this  process  for  them. But  we  were  rather  uncomfortable with  this  request  for  a  number  of  reasons. For  one  thing,   the  different  distributions  fit in  JMP came  with  different kinds  of   goodness-of-fit  tests. The  normal  had  a  Shapiro- Wilk  test. The  Weibull  had  a  Cramér–von Mises  test, and  the  LogNormal  had  a  Kolmogorov  test. It's  very  strange  to  compare  tests that  are  rather  different  from  one  another. Another  problem  with  this  approach is  that  distributions  with  more  parameters are  going  to  tend  to  have an  edge  on  those  with  fewer. If  we  choose  the  distribution based  on  the  largest   p-value, it  will  always  favor  distributions with  more  parameters  as  we  see  here with  the  two- parameter  normal  compared with  the  four- parameter Johnson  Su  distribution. Then  for  some  of  the  distributions like  the  Weibull's  Cramer  von  Mises  W  test, we  only  had  table  values  of   p-values going  up  to  something  like  P  equals  25. But  even  if  we  consolidated  all  the   goodness-of-fit  tests  down to  just  one  and  got  accurate   p-values  for  all  of  them, there's  still  a  larger  philosophical  issue at  stake  and  that's  that  hypothesis  test like  these  can  only  quantify  evidence against  the  null  hypothesis. If  the  null  hypothesis  is  true, then  the   p-value  is  a  uniformly   distributed  random  variable. In  other  words,   if  the  null  hypothesis  is  true, then  the  probability  that  the   p-value  is  between  0.1  and  0.2  is  exactly the  same  as  the  probability that  it  is  between  0.8  and  0.9. S eeing  a   p-value  of  0.9   isn't  more  evidence  that  the  hypothesis is  true  than  a   p-value  of  0.3. Returning  to  our  example, because  all  four  of  these  distributions have  goodness- of- fit   p-values  larger  than  0.05. Through  this  lens,   all  four  distributions fit  the  data  reasonably  well, even  though  the   goodness-of-fit  tests  say all  the  distributions  are  good, the  conclusions  about  the  process generating  the  data  are  different depending  on  the  distribution. If  you  use  a  peak  reference  value  of  1.33 to  determine  if  the  process  is  capable, then  choosing  the  viable  indicates that  the  process  is  not  sufficiently capable  to  meet  the  specifications, whereas  the  other  distributions indicate  that  the  process  is  capable. We  recognize  that  there  had  to  be  a  better  way  to  determine the  distribution  automatically  and  came  to  the  conclusion that  this  should  be  seen  as  a  very basic  kind  of  model  selection  problem. In  our  search  for  a  sound  method for  choosing  a  distribution, we  stumbled  upon  this  very  good  book  on model  selection  by  Burnham  and  Anderson. They  give  careful  derivations  of  the  AIC  from  the  perspectives of  information, theory,  and  cross-validation. They  also  give  a  derivation  of  the  BIC into  how  the  AIC  can  be  derived in  the  same  way  with  a  different assumption  about  the  prior  distribution. Burnham  and  Anderson  also  carefully  show hypothesis  testing  is  rather  incoherent as  a  model  selection  strategy. The  book  had  a  pretty  big  impact  on  my  own  views  of  modeling and  also  on  JMP  statistical  modeling  platforms. Returning  to  the  distribution   selection  problem  for  the  moment, when  we  went  ahead   and  added  a  distribution  selector, we  ended  up  calling  it  fit  all and  we  base  it  on  the  AICc. Here  on  the  left, we  have  two  distributions of  the  capability  analysis  data we  were  looking  at  before,   the  normal  and  the  Johnson  Su. The  Johnson  Su's   goodness-of-fit   p-value is  larger  than  the  normal's  because  it  has two  more  parameters  than  the  normal  distribution. Now  on  the  right, we  see  the  results of  a  fit  all  using  the  AICc. The  normal  comes  out   as  the  best- fitting  distribution, but  the  Johnson Su is  near  the  bottom. This  is  because  the  AICc  is  penalizing  it for  having  these  two  extra  parameters. This  feature  has  now   been  used  many,  many  times and  I  believe  people   are  generally  pretty  happy  with  it. Now  I'm  going  to  go  through a  somewhat  mathy but  hopefully  accessible  explanation of  what  the  AICc  really  is. All  right. Now  I'm  going  to  go  into  some  basic  theory  behind  the  AIC. I'll  be  as  brief  as  possible   and  use  the  best  analogies  as  I  can, but  I  think  it  is  important  to  be  exposed to  the  underlying  concepts  so  you  can  see that  the  AIC  has  a  rigorous  foundation  that  has  some  sense  to  it. The  AIC- type  selection  criteria  are  based  on  a  distance- type  metric between  probability  distributions  called the  Kullback- Leibler  or  KL  divergence. It  quantifies  the  amount of  information  lost   by  using  probability  distribution  two one  probability  distribution  one   is  the  correct  one. KL  divergence  has  the  property  of  always being  greater  than  or  equal  to  zero and  is  only  equal  to  zero   when  the  two  probability  distributions are  the  same. This  is  to  say  that  using   the  wrong  distribution  always  leads to  a  theoretically  quantifiable, strictly  positive  information  loss. This  is  pretty  heady  abstract  stuff, so  I'm  going  to  translate  it  into  the  language  of  statistical  modeling. When  we  are  using  data  in  statistics to  learn  about  how  something  works, we  are  explicitly  or  implicitly  fitting probability  models  to  the  data to  approximate   the  true  model  that  generated  it. If  we  knew  the  true  probability  generating mechanism,  we  could  use  the  KL  divergence to  quantify  how  far   or  how  wrong  the  model  is  from  the  truth. We  could  then  try  several  models  and  find  the  one  that  is  the  closest to  the  truth. Akaike  recognized  this   and  plugged  the  true and  the  model  probability  formulas  into  the  KL  divergence  formula and  used  a  little  algebra  to  see that  the  KL  divergence  had  two  terms. The  first  term  only  contains   the  true  probability- generating  mechanism for  the  data,  which  we  can  never  know since  we  can  only  work  with  models. However,  this  is  a  constant  that  is  the  same  for  all  models that  you  fit  to  the  data as  long  as  we  play by  a  couple  simple  rules. The  second  term  is  what  Akaike  discovered is  empirically  estimable  and  with  a  lot of  math,  he  found  a  simple  formula to  estimate  this  second  term. In  particular, he  discovered  that  two  times the  KL  divergence  is  equal  to  a  constant that  is  the  same  for  all  models, plus  two  times  the  negative   log likelihood of  the  data  used  to  fit  the  model, plus  two  times  the  number  of  parameters. Everything  had  been  multiplied  by  a  factor of  two  just  to  follow  the  same  convention as  a  likelihood  ratio  test,   since  the  constant  term  is  the  same for  all  models as  long  as  we  don't  change the  response  data,   we  can  fit  several  models, and  the  one  whose  AIC  is  the  smallest   is  the  one  that  is  estimated to  have  the  smallest K L  divergence from  the  truth  and  in  a  sense  is  the  one that  is  the  least  wrong. Using  the  AIC  for  model  selection  is  entirely  analogous  to  there  being a  collection  of  islands  and  you  want to  know  which  of  the  islands  you  know  of is  closest  to  another  island  that  you  know  you'll  never  be  able to  get  to. The  direct  solution  to  this  problem  would  be  to  calculate  the  distances from  each  of  the  islands  to  the  one  that  we  want  to  get  close  to. Now,  what  if  the  island  we  wanted  to  get  close  to  was  surrounded by  a  circular  high  fence  that  we  could  approach? The  island  is  perfectly  in  the  middle  of  the  fence, so  the  distance  from  the  center  of  the  island  to  the  fence is  always  the  same. But  the  fence  was  far  enough  away from  the  island  that  it  enclosed that  we  couldn't  see  it   or  measure  the  distance from  the  fence  to  the  interior  island. We  can  still  estimate  the  distance from  each  island  to  the  fence. Because  the  main  island  is  in  the  center  of  the  fence, we  know  that  the  island  closest to  the  fence  is  the  closest  island. This  is  exactly the  situation  with  the  AIC. With  the  AIC,  we  can  estimate  the  distance from  the  truth  to  each  of  the  models. Each  AIC  estimate   is  off  by  the  same  amount. While  we  can't  estimate  the  absolute  distance  of  the  models from  the  truth, we  can  know  which  model  is  the  closest in  a  relative  sense. The  original  AIC  is  based on  the  likelihood  of  the  training  data  plus  a  parameter  penalty. The  training  likelihood assesses  the  goodness-of- fit  of  the  model. We  can't  use  this  term  by  itself  though, because  it  is  biased  downward as  the  model  parameters   were  chosen  to  minimize the  negative   log likelihood. With  a  lot  of  math,  Akaike  derived  a  very  simple  expression that  corrects  for  this  bias. The  original  penalty  is  just  2 K where  K is  the  total  number of  estimated  parameters. For  linear  regression  with  a  slope and  an  intercept,   we  also  have  to  count  the  variance. For  that  case  you  would  have K  equals  three  and  not  two. There  are  important  assumptions that  led  to  the 2 K  penalty. We  can  characterize  them  loosely that  the  model  has  to  be  reasonably  good. The  AIC  is  still  going  to  be  robust however,  because  if  a  model  is  bad, then  the  likelihood  component   will  be  large  and  will  dominate the  penalty  amongst  the  good  models. The  2K  term  will  favor  the  smaller  models  as  long as  the  sample  size  is  large. However,  it  didn't  take  long   for  people  to  find  that  this  original  AIC often  shows  models   that  overfit  in  small  samples, so  a  more  accurate,  higher- order approximation  to  the  bias  was  derived. When  this  extra  term  is  added, the  criteria  becomes  known   as  the  AICc  or  the  corrected  AIC. Unfortunately,   the  reputation  that  the  AIC  overfits had  become  commonplace before  the  correction  was  discovered and  widely  known  about. The  correction  becomes  infinite  as  K  approaches N pushing  the  model  selection  criteria  away from  models  that  are  nearly  saturated. Notice  also  that  the  correction  term goes  to  zero  as  N  goes  to  infinity. In  large  samples   the  AIC  and  AICc  are  equivalent. The  AICc  is  what  we  reported  in  Trump because  it  works  well  for  small  samples and  although  it  was  derived   for  Gaussian  distributions, experience  suggests  that  it's  good  enough with  other  commonly  used  distributions. Now  I'm  going  to  illustrate  the  AICc in  a  real  example  that  was  a  five- factor central  composite  design  with  31  runs, and  the  response  was  the  amount of  p DNA  produced  by  a  bioreactor. I'll  illustrate  the  AICc using  the  generalized  regression  platform, giving  it  a  full  response  surface  model with  all  main  effects  interactions and  second- order  terms. I  fit  four  models  to  the  data. One  is  a  full  response  surface  model  using least  squares  that  was  fit  automatically. Then  I  use  forward  selection under  the  normal,  logNormal, and  exponential  distributions. I  chose  the  exponential  distribution to  illustrate  poor  model  fit. The  models  had  2 2, 9, 9,   and  1  parameters  respectively, and  the  model  with  the  lowest  AICc   was  the  logN ormal  with  an  AICc of  about  334.8. We  can  break  the  AIC  and  AICc  calculations down  to  see  how  different  parts of  the  penalty  are  contributing. The  full  least  squares  model has  the  lowest  likelihood, but  the  highest  AICc  overall. When  we  look  at  the  second- order   corrections  and  the  original  AIC  values, we  see  that  it's  the  second  order correction  term  that  is  pushing the  model  selection  criteria  to  be  very  large  for  this  model. The  logN ormal  forward  selection  log  likelihood  is  a  little  lower than  the  normal  forward  selection  one. They  both  have  nine  parameters, so  their  penalties  are  the  same and  the  logN ormal  forward selection  model  has  the  lower  AICc. The  exponential  forward  selection  model has  the  poorest  model  fit  as  measured by  the   log likelihood,   but  also  only  has  one  parameter in  the  model. Overall  it  has  the  smallest penalty  contribution  to  the  AICc. But  the  poor  fit  of  the  model  is  such  that  the  likelihood  dominates and  the  exponential  model  is  the  second from  the  worst  as  measured  by  the  AICc. If  you  review  the  general  derivation of  the  AIC  in  the  Burnham  and  Anderson  book, you'll  see  that  what  it's  actually estimated  is  the  expected  value of  a  hypothetical  test  set  likelihood for  a  data  set  that  has  the  same  size and  response  structure, but  not  values  as  the  training  set. The  expected  values  also  take into  consideration  the  variability in  the  estimate  of  the  MLE. I  find  this  cross-validation   interpretation  of  the  AIC to  be  pretty  compelling. I  think  it's  also  important  to  point  out that  this  cross-validation  derivation of  the  AIC  does  not  assume  at  all  that  we  have  the  correct  model. To  show  that  this  cross-validation interpretation  really  works, I  created  a  simulation  formula  using  an  average  of  the  models  I've  shown in  the  previous  slides   as  well  as  some  other  ones. This  way  we  knew  that  none  of  the  models  were  actually  the  correct  one. I  fit  each  of  the  four  models  to  new  training  data  a  thousand  times and  set  it  up  so  that  job  would  report  an  independent  holdout  likelihood using  another  new  data  set. I  kept  each  of  the  four  models  structure and  distributions  intact   and  did  not  apply  variable  selection. This  was  to  perfectly  mimic  the  exact  cross-validation  interpretation of  the  AIC. From  there,   I  created  a  table  of  simulated holdout  likelihoods  and  computed   their  average  for  each  of  the  four  models. This  is  the  AIC   and  AICc  summary  table  from  before, with  the  simulation- based average  holdout   log likelihoods added  over  here  to  the  right, you  can  see  that  the  full  normal  model holdout  likelihood   is  very  close  to  its  AICc  value and  that  the  second- order  correction  term  was  essential  for  this  match  to  happen. On  the  other  hand, you  see  that  the  simulated   average  exponential  holdout   log likelihood is  also  very  close  to  the  AICc. Both  the  normal  and  logN ormal   holdout  likelihoods  are  close to  the  original  log Normal  models A ICc. The  normal  holdout  likelihood is  a  little  smaller. I  attribute  this  to  averaging  a  bunch  of  simulation  models, making  the  simulated  data  a  little  bit  more  normally  distributed than  the  original  data  was. There  are  a  couple  simple  rules  that  are  needed  to  make  AICc  comparisons really  valid  between  different  models. The  most  important  is  that   the  stochastic  part  of  the  data has  to  stay  the  same, the  same  rows  have  to  be  used and  it  is  the  Y's  in  particular that  must  be  the  same. The  X's  can  be  different,  of  course, even  if  they  were  originally  random, not  only  must  the  Y's  be  the  same, but  they  can't  be  changed  or  transformed. The  transform  would  have  to  be  built  into  the  model  appropriately. The  AIC  is  also  only  defined for  well-behaved   maximum  likelihood  estimators and  other  closely  related  methods. This  explains  why  you  don't  see  AICc for  neural  networks   and  other  machine  learning  models. Also,  you  have  to  keep  in  mind  that  just because  you  found  a  model  that  the  AICc says  is  the  best,  it  doesn't  mean  that  it  is  a  good  model. Use  your  past  experience   and  model  diagnostic  plots  to  ensure that  the  model is  right  enough  to  be  useful. Returning  to  the  pDNA  data, we  see  two  equivalent  models. On  the  top,  we  have  a  logN ormal  model and  on  the  bottom  we  have  a  normal  fit  to  the  log- transformed  response. You  can  see  that  the  generalized  RS quares are  the  same  for  these  two  models, but  the  AICcs  are  very  different. This  is  because  the  logN ormal  fit implicitly  builds  the  transform into  the  likelihood. But  the  log  scale  normal  fit  does  not. In  this  case,  the  right  thing to  use  is  the  logN ormal. Here's  a  quick  demonstration   that  you  have  to  decide  the  distribution and  the  input  variables  at  the  same  time. Here  is  simulated  data  from  a  T-t est  type  model two  groups  of  normally  distributed  data   with  the  same  variance, but  different  means. If  you  fit  all   in  the  distribution  platform, it  chooses  the  normal two  mixture  with  an  AICc  of  1036. This  is  the  correct  distribution if  you  don't  know   the  group  identity  of  the  rows. Once  you  include   the  grouping  variable  though, you  see  that  the  normal comes  out  on  top  of  an  AICc  of  717  or  so. We  also  tried  the  Weibull  logNormal and  gamma  and  the  normal  still  came  out on  top,  even  though  those  distributions did  better  in  distribution   without  including  the  grouping  variable. You'd  have  to  try  different  model  structures and  distributions  together to  find  the  right  combination. Now  I'm  going  to  change  gears  and  talk  a  little  bit  about  the  BIC, which  is  the  other  main  analytical model  selection  criteria  and  JMP. The   BIC is  motivated   in  a  completely  different  way than  the  AIC. Schwartz  used  a  large  sample  argument in  a  Bayesian  context  to  approximate the  log  probability  of  the  data   after  having  integrated  the  model  out, assuming  a  flat  prior  on  the  parameters, an  expression  similar  to  the  AIC  pops  out with  a  K  log  in  type  penalty  term   rather  than  two  times  k. There  were  also  other  terms   in  the  integral  that  are  always  ignored. One  is  K  log  2  pi, which  was  considered  too  small to  deal  with and  the  other  one is  a  normalized  variance  of  the  MLE, which  would  also  be  of  order K. I  didn't  study  the  AIC  or  BIC in  any  depth  in  school. I  just  remember  hearing  the  refrain  AIC  overfits, BIC under fits  several  times in  different  classes, which  I  interpreted   as  a  strong  skepticism  about  both  of  them. Comparing  the  AICc  and   BIC penalties, we  see  that  the  AICc  will  prevent big  models  from  being  chosen when  the  sample  size  is  small, whereas  the  BIC will  still  allow  large  models. I  see  the  K  log  and  normalization   constant  penalty  in  the   BIC as  somewhat less  compelling  than  the  cross-validation interpretation  of  the  AIC- type  penalties. Something  that  leads to  a  marginal  probability  of  the  data is  more  abstract  to  me  than  something that  is  directly  interpretable as  a  cross-validation  metric taking  into  account  parameter  uncertainty. I'm  fully  aware that  I'm  editorializing  here, but  this  is  what's  worked  well for  me  so  far. Returning  to  the  p DNA  DoE  one  more  time. Here  are  the  same  models  fit  in  the  pDNA  example using  the   BIC first  selection  on  top  and  the  AICc  on  the  bottom. Notice  that  the   BIC  of  the  full  normal  model is  not  as  far  away from  the  other  models  as  with  the  AICc. The  best  model  overall  as  rated  by  the  BIC  is  a  logNormal, but  with  13  parameters this  time  around  rather  than  nine. The  forward  selected   BIC normal  model also  has  a  couple  more  parameters. In  small  samples, contrary  to  the  AIC  overfits, BIC under fits,  the  AICc  can  choose  smaller models  than  the  BIC  in  small  samples. Here  we  see  the  effects  chosen by  the   BIC and  the  AICc. The  set  of   BIC-selected  effects  is  a  superset  of  the  ones  chosen by  the  AICc. A lso  notice,  interestingly, that  all  four  effects  not  chosen by  the  AICc  are  statistically  significant under the BIC. Under  the  BIC, the  pHsquared  term  is  highly  significant, but  it  isn't  present  in  the  AICc  model,  for  example. I  would  say  that   all  the  significant  effects should  have  asterisks  by  them, but  all  significant   p-values have  asterisks  by  them  in J MP  reports. Instead,  I'll  just  say  that  I  take   p-values  of  effects after  selection  with  a  grain  of  salt. Although  the  two  models   choose  different  effects, some  of  them  highly statistically  significant  in  the  model. If  we  look  at  the  profile or  variable  importance   from  these  two  models, they  tell  a  very  similar  story. Feed  rate  is  by  far  the  most  important and  after  that  the  ordering   is  the  same  between  the  two  models. pH only  impacts  3%  of  the  variation in  the  response  surface  under  the   BIC best  model  and  isn't  included at  all  in  the  AICc  best  model. This  is  a  very  clear  example of  statistical  significance   and  practical  relevance being  two  different  things. There  are  a  lot  of  opinions  out there  about  the  AICc  and  the  BIC. For  example,  Burnham  and  Anderson say  that  both  methods  are  consistent   for  the  quasi- true  model  as  N  goes to  infinity. But  then  there's  others  that  say that  the   BIC is  the  only  one consistent  for  the  truth. Burnham  and  Anderson  say  that  you  can  set up  simulations  to  make  one  look  good, change  the  way  it's  set  up   a  little  bit  and  it'll  flip  the  results. Burnham  and  Anderson, who  are  about  the  most  diehard  AICc  fans out  there  in  their  simulations, found  that  the  AICc  chooses  fewer really  bad  models  than  the  BIC. I  think  it's  not  a  bad  idea   to  look  at  both   BIC and  AICc after  applying  variable  selection. If  the  best  models  under  both  are  pretty much  the  same,  which  is  often  the  case, you  can  feel  pretty  good   about  either  of  them,  if  they're  different it's  good  to  think  about  the  reasons  why  and  use  your  subject  matter  expertise to  help  make  a  decision. My  last  topic  is  model  selection criteria  and  linear  mixed  models. This  is  a  pretty  complicated  situation, especially  because  there  isn't  consensus between  software  vendors  in  how to  compute  the  model  selection  criteria. To  illustrate  this,  I  created   a  split  plot  design  with  four  factors. There  are  two  whole  plot  effects and  two  split  plot  effects. If  you  take  the  same  data  and  fit  the  same  model  in  JMP  Pro and  SAS  using  fit  mixed  and  proc  mixed, you  will  see  that  the  likelihoods and  model  selection  criteria  don't  match, but  the  variance  estimates  do,  you  get  different  fixed  effects parameter  estimates,   but  the  fixed  effects  tests  agree. One  of  the  reasons  for  this  is  that  JMP and  SAS  fixed  effects  design  matrices use  a  different  coding  strategy for  categorical  effects. On  the  left  I  have  the  JMP  design  matrix for  the  split  plot  example, and  on  the  right  you  see  the  SAS  one. JMP  creates  a  row  of  minus  ones for  the  last  level  of  categorical  effects which  is  seen  in  blue  here whereas  SAS  creates  a  row  of  zeros. Neither  one  of  these  is  right  or  wrong. It's  like  changing  units   or  changing  coordinate  systems. JMP  categorical  effects  sum  to  zero, whereas  SAS  categorical  effects can  be  interpreted   as  differences  from  the  last  level. Although  the  raw  parameter  estimates  differ, predictions  will  be  the  same between  the  two  codings  because  the  models are  fundamentally  equivalent. Most  things  that  matter won't  be  different   between  the  two  software  products. However,  REML, the  method  used  to  estimate  mixed  effects  models has  an  ambiguity  in  it. The  base  Gaussian  likelihood  at  the  top will  be  the  same  in  either  software because  it's  a  real  likelihood. But  the  REML  or  residual  likelihood reported  by  proc  mixed and  JMP  pro's  fit  mixed  isn't  a  real  likelihood. If  it  was  a  real  likelihood, then  we  would  get  the  same  values regardless  of  which  coding or  software  we  used. This  is  because  there's  an  extra  penalty added  to  the  Gaussian  likelihood for  REML  that  reduces  the  bias of  the  variance  estimates. But  this  depends  on  the  design  matrix in  a  way  that  is  sensitive  to  the  coding  used. JMP  reports,  the  raw  Gaussian  likelihood,  and  the  AICc  and   BIC that  it  reports are  based  on  that rather  than  the  residual  likelihood. The  number  of  parameters  fit  mixed  counts is  the  total  including  both  fixed  effects  and  variance  parameters. We  did  it  this  way  to  make  it   so  that  you  can  use  JMP  to  compare  models with  different  fixed- effect  structures as  well  as  variance  models. In  SAS,  they  only  report   the  residual  or REML  log likelihood and  it  reports  model  selection  criteria  based  on  it. You  can  see  here  that  it  also  only  counts variance  parameters  as  well because  the  difference   between  the  SAS  likelihood and  its  AIC   is  for  implying  two  parameters, a  variance  component  and  a  residual. All  this  means  is  that you  can  only  use p roc  mixed for  comparing  variance  models with  the  AIC   because  the  model  selection  criteria includes  the  REML  penalty  and  it's  only   counting  variance  parameters. With  all  due  respect, I  can  think  of  some  good  reasons for  the  SAS  approach, and  there  are  probably  some  other good  reasons  I  don't  even  know  of. But  I  personally  prefer  the  flexibility afforded  by  the  JMP  approach. To  summarize,   if  you  compare  results  across  software for  non- mixed  models, the  mean  parameter  estimates  may  differ, but  otherwise  everything else  should  be  the  same. As  long  as  the  software  computes the  constants  and  the  likelihood correctly  as  JMP  does. When  we  get  to  Gaussian  mixed  models, there  are  very  important  software  differences and  the  scope  of  the  decisions you  can  make  about  the  models using  the  software  may  be  very  different depending  on  the  details  of  how  its  likelihood  is  calculated. JMP  model  selection  criteria are  comparable  both   within  the  same  platform and  across  other  modeling  platforms. I'll  close  with  this  slide, which  gives  my  basic  recommendations for  applying  the  tools  discussed  today. Hypothesis  testing  is  a  tool   for  when  you  need  to  prove  something and  is  best  used  in  situations   when  you  have  a  good  idea of  the  model  structure  in  advance. When  you're  working  on  a  problem in  industry  and  the  sample  size  is  small, I  would  stick   to  classical  statistical  models and  use  the  AICc  as  the  primary  tool for  choosing  between  them. With  larger  data  sets, when  I  have  enough  data  to  hold  out at  least  a  third  of  the  observations, I  use  holdout  cross-validation  to  compare classical  statistical  models  as  well  as  machine  learning  models. In  my  own  work,  I  tend  to  avoid  K- fold  cross-validation and  its  variance. The  model  selection  criteria  are  equivalent  to  it  in  larger  samples, and  I  tend  to  stick  with  simpler  models  with  smaller  data  sets. I  know  that  not  everyone  is  going  to  agree  with  me  on  this, but  this  is  what  works  for  me  and  is  a  pretty  safe  way to  approach  model  selection. Choosing  the  most  useful  model  from  a  set of  alternatives  that  must  all  be  wrong on  some  level  is  an  important  decision, and  these  are  the  main  considerations I  have  when  deciding  upon a  model  selection  strategy. Thank  you  for  your  attention  and  I  look  forward  to  talking  with  you in  the  Meet  the  Expert  Sessions.