Choose Language Hide Translation Bar

cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
The Institute for Health Policy and Practice (IHPP) launched a new Project ECHO ® Hub at the University of New Hampshire in 2019. Project ECHO is “an evidence-based method using web-based teleconferencing to link specialist teams with community-based sites to help community providers improve their ability to manage complex conditions (Ryer, West, Plante et al., 2020).”   The Partnership for Academic Clinical Telepractice Medications for Addiction Treatment (PACT-MAT) was IHPP’s first ECHO. It was formed through collaboration between IHPP and UNH’s Department of Nursing, and its aim was to increase knowledge and confidence among Medication Assisted Treatment (MAT) prescribing providers. To evaluate the effectiveness of the PACT-MAT ECHO, the PACT-MAT team sought to analyze MAT prescribing practices for participants before and after their participation in the PACT-MAT ECHOs. IHPP’s Health Analytics and Informatics team were brought into the project to facilitate data use permissions and aggregate administrative claims data. UNH’s Department of Mathematics and Statistics provided statistical analysis and modeling using JMP. This presentation provides an overview of our experience in using healthcare claims data to measure impact from an innovative model such a Project ECHO, as well as highlights our use of JMP for the final analysis.   Source: Ryer J, West K, Plante E-L, et al. Planning for Project ECHO® in NH: The New Hampshire Project ECHO Planning for Implementation and Business Sustainability Project Summary Report. NH Citizens Health Initiative, Institute for Health Policy and Practice; 2020     Hello. Thank  you  for  joining  us  today for  our  presentation  on, "Measuring  Change in  Medication Assisted  Treatment for  Participants  in  project  ECHO  Using Administrative  Healthcare  Claims" We  are  excited  to  share  how  we  use  JMP as  a  key  tool  in  our  analytic  work. Next  slide,  please. My  name  is  Erica  Plante   and  I  am  a  senior  scientific  data  analyst at  the  Institute for  Health  Policy  and  Practice at  the  University  of  New  Hampshire. I  am  joined  by  Dr.  Michelle  Capozzoli, Senior  Lecturer in  the  Department  of  Mathematics   and  Statistics, also  at  the  University  of  New  Hampshire. Neither  Michelle  nor  I  have conflicts  of  interest  to  disclose. Next  slide,  please. Before  we  describe  our  work, I  would  like  to  provide  a  brief overview  on  Project  ECHO. Project  ECHO  was  founded  in  2003  by  Dr.  Sanjeev  Arora at  the  University  of  New  Mexico. Dr.  Arora  is  a  Physician  specializing in  Gastroenterology  and  Hepatology. He  was  seeing  patients with  Hepatitis  C  die  at  alarming  rates because  they  could  not  access  care for  this  treatable  disease in  a  timely  manner. He  sought  to  bring  providers  together to  form  a  community  of  practice where  doctors  and  other  specialists can  learn  from  each  other. ECHO  is  an  "all  teach,  all  learn"  model, and  the  sessions  are  often  centered  around a  key  issue  or  condition. The  University  of  New  Hampshire  launched its  project  ECHO  hub  in  2018 and  has  since  produced  a  number of  ECHO  programs, including  the  Partnership  for  Academic Clinical  Telepractice,  Medications for  Addiction  Treatment,  or  PACT-MAT. Next  slide,  please. The  primary  goal of  PACT-MAT  ECHO  is to  increase  the  number of  nurse  practitioner  students in  graduate  and  postgraduate  programs who  receive  waiver  training, apply  for  the  waiver,  and  subsequently  prescribe  MAT. Secondarily,  the  project  seeks  to  increase provider  self- efficacy in  managing  patients with  Substance- use  Disorder. The  program  developed a  learning  community that  enabled  a  culture that  understood  addiction as  a  chronic  disease and  was  prepared  to  address the  range  of  issues  that  emerged  during the  process  of  treatment. Specifically,  this  program  focused on  all  participants  becoming  proficient and  culturally  competent in  prescribing  and  treating  SUD, as  well  as  enhancing  capacity and  qualities  of  services  available to  patients  in  their  communities through  their  providers. Next  slide,  please. After  the  completion of  the  first   PACT-MAT  session, the  team  was  interested in  answering  some  questions about  the  PACT-MAT  ECHO through  Claims  Data  Analysis. But  here's  some  core  information about  the  analytic  project. The  analytic  period  of  interest was  from  2018  through  June  2020. This  was  to  capture  data prior to and after  the  first   PACT-MAT  ECHO  session. The  project  was  funded by  the  Substance  Abuse  and  Mental  Health Services  Administration ( SAMHSA), as  part  of  150K 3-year  grant. The  project's  principal investigator  was  Dr.  Marcy  Doyle. We  were  wanting  to  ask  a  few  questions about  the  actual  ECHO  program and  how  we  were  able  to  see if  provider  practices  had  changed after  actually  participating  in  the  ECHO. And  the  Center for  Health  Analytics  of  Informatics at  the  Institute  for  Health  Policy and  Practice  at  UNH. We're  fortunate  enough  to  have  access to  healthcare,  administrative  claims, and  enrollment  data  for  commercial and  New  Hampshire  Medicaid  policies. Therefore,  the  CHA  team  was  brought into  the  research  project to  collect  and  aggregate  the  data, and  UNH's  Department  of  Mathematics and  Statistics  was  brought  in to  build  models  and  perform  analysis. Next  slide,  please. And  these  are  the  questions  that  we  asked as  our  core  research  questions. Did  the  PACT-MAT  ECHO  Series  have  an  impact on  participants'  MAT prescribing  practices? And  can  we  successfully  perform a  case/ control  study  on  providers using  administrative  claims  data? Next  slide,  please. When  collecting  health  care  claims, we  included  all  members,   ages  zero  to  64  who  had  medical and  pharmacy  enrollment in  the  month  of  interest. PACT-MAT  participants  self- reported their  name,  titles,  NPI, organization's  name  and  address, and  their  waiver  status. We  cross- referenced  that  data against  CMS's  National  Plan and  Provider  Enumeration  System, also  known  as  the  NPPES  Registry. In  the  cases  of  a  mismatch   between  the  self- reported  data and  the  NPPES  registry, the  self  reported  data  was  considered the  most  up  to  date  and  was used  for  the  analysis. Information  on  our  control  group  was sourced  only  from  the  NPPES. Next  slide,  please. And  claims  reflect  if  one or  more  service  lines  included an  MAT  procedure  code  or  drug  code, they  were  also  flagged if an  Opioid-Related  Disorder diagnosis  code  was  found. Medical  providers  were  selected as  having  billed  for  MAT  if  at  least one  service  line  included  their  NPI or  one  of  the  MAT  CPT  codes. Prescribing  providers  were  selected as  having  prescribed  MAT  if at least one  pharmacy  service  line  included their  NPI  as  the  prescriber and  at  least  one  of  the  MAT  NDC  codes. The  case  and  control  populations  each  had two  pairs  of   datasets. Next  slide,  please. One  pair  for  each  insurance  type, commercial  or  New  Hampshire  Medicaid. The  first  date  included  there the  providers  NPI  and  information, as  well  as  total  aggregates by  month  of  all  providers, patients  and  claims. Patients  with  Opioid- Use  Disorder  (OUD), patients  with  any  medication assisted  treatment, and  patients with  both  OUD  and  MAT. The  second  dataset  provided the  same   dataset  as  the  first  data set... Same  data  as  the  first  data set, with  the aggregation at  the  member  demographic  level, such  as  age,  category,  county,  and  sex. No  identifiable  member  data  was  applied to  the  statisticians. Now  I'm  going  to  pass  the  presentation to  my  colleague,  Dr.  Capozzoli. Thank  you,  Erica. Once  we  obtain  the  data. The  data  was  actually analyzed  by  Rebecca  L. Park. She  was  a  UNH  master's  student under  the  supervision  of  myself and  Dr.  Philip  Ramsey. We  received  the  three  data sets. One  was  the  practitioners  demographics such  as  name, National  Provider  Identifier, title,  practice  address. The  other  two  were  the  claims   datasets. One  for  the  Medicaid and  one  for  the  commercial. So  the  original  data  was extremely  large for  both  the  commercial   and  for  Medicaid for  each  practitioner,  it  tracked   each  of  their  MAT  patients'  history over  the  study  period. So  the  original  thought  was  try  to... Use  their  pattern  of  behavior of  their  patients  over  time. Quite  quickly,  it  became  apparent  that this  approach  was  a  little  bit  problematic and  also  we  ran  into some  privacy  laws  in  trying  to  make  sure that  all  identifying markers  were  not  available. So  what  we  did   was  we  honed  in on  several  of  the  variables for  the  demographics. So  we  honed  in   on  the  National  Provider  Identifier, their  title, and  the  city  of  their  practice. And  then  from  the  claims  data, we  focused  in  on  the  month  and  year. So  this  tracks  what  month  and  year that  we  were  looking  at. The  phase  of  the  program  was  the  Pre, so  this  is  Pre  before  ECHO, Ongoing  is  during  ECHO, Post is  obviously  after  ECHO. So  then  we  aggregated  the  data and  so  instead  of  looking  at every  single  visit, what  we  looked  at  were  months and  we  looked  at  patient  totals. So  the  first,  we  obviously  looked  at was  the  number  of  patients  total for  that  practitioner  during  that  month, the  number  of  patients with  Opioid-U se  Disorder, the  number  of  patients  who  had  any  MAT, the  number  of  patients with  OUD  who  had  any  MAT. Further,  when  looking  at  the  Medicaid versus  the  commercial  care, during  the  exploratory  analysis, it  became  apparent  that  we  were  going to  need  to  focus  in  on  the  Medicaid  data due  to  the  low  patient  numbers in  the  commercial  care. So  here  for  example, when  we  were  looking  at  it, it  became  apparent; so  on  average  they  had one  patient  per  month  with  any MAT. And  so  what  we  were  doing  was   there's  a  lot  of  sparse  data  here and  it  was  just  not  conducive to  trying  to  fit  models. So  we  focused  in  on  the  Medicaid  data. Further,  we  initially  had  20  providers, which  we  had  to reduce  down  to  nine  providers. And  the  reason  being  is that  some  of  the  providers had  too  many  months  of  missing  data. Some  of  them, we  had  to  eliminate because  the  majority  of  the  months, they  didn't  have  any  patients  who  had  MAT. And  then  further, as  we  started  to  fit  the  models, it  became  apparent  that  we  also  needed a  minimum  of  10  total  patients  per  month. So  we  did  end  up with  a  very  smaller  sample  size than  we  had  originally  thought we  would  have. The  other  piece, as  we  were  exploring  the  data, using  some  of  the  tools that  John  provides, we  noticed  that  the  nine  practitioners  had on  average  the  number  of  patients or  number  of  total  patients  differed. So  for  example, they  could  have, I  think  it  was  between  a  total  range of  one  to  161  patients. So  between  that and  then  looking  at  the  trend  over  time, so  we  looked  at the  average  number  of  patients  total and  as  you  notice  this, the  blue  line, you  notice  that  just  in  general there's  an  increase  of  patients  over  time. We  also  looked  at the  average  number  of  patients with  the  Opioid-Use D isorder and  those  who  had  any  MAT, and  those  who  were  diagnosed with  OUD  and  had  any  MAT. And  if  you  notice,  all  four  have a  similar  trend  of  an  increasing  trend. So  to  combat  that, because  what  we  want  to  know  is, are the  practitioners  increasing their  prescribing? Not  that  they  are  having more  in patients  over  time, we  normalized  the  data by  creating  a  new  variable. So  what  we  created  was the  proportion  of  patients who  had  any  MAT  in  comparison to  the  total  number  of  patients. And  the  reason  that  we  did  use the  patients  who  had  any MAT is due  to  the  fact  that  a  diagnosis of  an  OUD  does  have  a  certain  stigmatism. And  so  we  felt  that  it  would  be  better to  look  at  any  MAT. So  Analysis  Considerations. From  the  beginning  we  knew  that  we  had a  small  sample  size of  practitioners,  only  nine. The  other  thing that  became  apparent through  a  lot  of the  graphical  representations  of  the  data was  that  we  had  a  lot  of  noise. And  so  in  taking   this  into  consideration, we  decided  to  attempt  several  different  approaches. The  first  approach  was  basically, your  means  comparison: ANOVA and Matched  pairs. Then  we  thought   about  bringing  in  that  time  variable. And  so  we  did  look  at  them in  several  different  ways, from  just  simple   Linear Regression, from  the  beginning. We  did  consider Structural  Equation  Models. Dr.  Laura  Castro- Shilo  from  JMP  had  come  to  one  of  Dr.  Ramsey's classes and  given  a talk  on  these  models. So  originally,  when... We  were  looking  at  the  data, we  thought  that  these  models might  be  appropriate. But  it  became  apparent  quickly because  of  the  difficulties  that  we  had that  they  just   were  not  working  for  us. So  the  next,  we  worked with   Segmented Regression. That  was  chosen because  in  some  previous  work with  claims  data , it  was  brought  up  that  maybe the   Segmented Regression  would  work  well because  we  had  data that  was  looking  at  pre  and  post. So  we  thought  that with  our  pre, ongoing, and post, that  maybe  the   Segmented Regression would  be  appropriate. We  also  looked  at  Exponential  Regression and  Logistic  Regression. We  also  looked at  their  Generalized  Regression  Models and  including  just  the  regular and  then  zero- inflated  Poisson, Binomial, and  Negative  Binomial  Models. And  the  reason  why  we  decided to  look  at  the  Poisson, and  Binomial  and  Negative  Binomial  is the  data  is  inherent  that  it  has  counts and  so  we  thought  that  maybe these  might  be  appropriate. And  so  what  I'm  going  to  focus  on today  are  the  Means  Comparison, the   Segmented Regression   and  the  zero- inflated  Poisson and  this  is  going  to  be for  both  just  looking  at  the  ECHO  group and  then  the  Matched  Pairs  comparing the  control  group  to  the  ECHO  group. So  the  first  analysis  ignores the  time  variable and  we're  just  looking  at  averages. What's  the  average  proportion  of  patients in  your  pre,  ongoing,  and  post. And  just  from  the  means, it  is  quickly  apparent  and  from  the  graph that  our  pre  phase  is  definitely lower  than  the  ongoing  and post. We  also  tested for the... Because  we  have  small  sample with  practitioners, we  did  look  at  the  variance and  we  noted  that  we  did  have  an  issue with  non- equal  variance and  so  we  did  use  Welch's  test instead  of  the  traditional  ANOVA to  conduct  to  see  if  we  had  differences between  the  phases  and  obviously  we  do. And  then  to  determine  statistically which  ones  were  different, we  did  look  at   the  All  Pairs  Tukey- Kramer  Test. And  it  did  confirm  that  our  pre  phase  was different  than  the  ongoing  and  post, but  the  ongoing and  post  were  very  similar. And  so  this  gave  us  some... This  obviously  is  indicating that  we  do  have  some  differences. The  ECHO  program  is  making  a  difference. So  this   Segmented Regression, as  I  noted  before, was  suggested  because  we  do  have the  three  phases:  pre,  ongoing,  and  post. And  we  were  able... Dr.  Ramsey  had  suggested that  we  use  a  script that  is  done  by  David  Burnham, by  Pega  Analytics. If  you  are  interested  in  this  script, the  link  is  here  on  the  website, it  gives  you  the  code   and  it  also  gives  you a  very  detailed  description   from  line  to  line, what  he  is  doing  in  this  code. And  so  what's  happening  here, when  we  ran  the  script, We  were  able  to  fit  separate  regressions to  the  three  different  phases as  well  as  get  the  fit, an   R², and  test the  significance  of  the  slope. And  so  again,  we're  seeing this  pattern  that  we  saw  with  the  ANOVA. You  do  see  that  the  pre  is  definitely lower  than  the  ongoing  and post. Unfortunately,  the  slopes for  all  of  the  three  phases were  not  significant. You  can  also  see  that  we  do  have a  significant  amount  of  variability. So  next, what  we  considered was  the  fact  that  we  had  two  types... Or  two  categories  of  practitioners; we  had  those  who  were nurse  practitioners,  physician  assistants; and  then  we  had  the  physicians. Even  though  they  were  small  sample  sizes. We  decided  to  see, "Okay,  can  we  see  some  kind  of  a  signal?" "What  is  going  on  here?" And  when  we  looked at  the  nurse  practitioners  again, you  see  that  behavior  of  the  pre  being lower  than  the  ongoing  and  the  post. And  what  we  noted  is  that for  the  pre  phase, we  are  seeing  a  little  bit of  a  significance  even  though  it  is  0.07, it  is  saying  that  there  is  a  signal  here and  we  do  have  something  going  on. Unfortunately  for  the  ongoing  and  post, we're  not  seeing  that  significance but  again  we  are  seeing  that  trend. What's  interesting  is when  we  looked  at  the  positions. And  we  noted that  obviously  we  do  not  have... The  slopes  are  definitely not of  significance and  there  doesn't seem  to  be  any  difference whether  they're  in  the pre,  ongoing,  or  post. So  it  seems  while  it  may  not  be a  practical  significance, I mean it  may  not  be a  statistical  significance of  practical  interest  is  the  fact that  the  ECHO  group  does  seem  to  be benefiting  this  nurse  practitioner, physicians  assistants  group. So  the  next  thing  we  tried  was the  Zero  Inflation   Poisson  model. We  chose  the  Poisson  Model  due  to  the  fact that  we  had  counts and  we  also chose  the  Zero  Inflation... We  did  both  the  regular  and the  Zero  Inflation and  we  chose  the  Zero  Inflation because  we  did  have  a  lot  of  zeros in  our  data. And  so  when  we  did  it, you'll  notice that  we  do  have  this  slight  trend of  increasing of  proportion of  MAT patients over time. And looking  at  the P arameter  Estimates,   month  is  significant. Note  that  the   Zero Inflation  is  zero. So  if  you  look  at  it, it wasn't  really  doing  much  for  it to  have  the   Zero Inflation  but... It  was  informative  for  us. Unfortunately,  when  we  went  to  evaluate  the  fit  of  the  model, it  became  quickly apparent  that  it  was  a  poor  fit. So  for  example,  when  we  looked  at your  generalized   R²,  it's  very  low. And  then  we  have  this... When  we  looked  at  the  actual   versus  predicted  plot, what's  happening  is  that here  are  the  predicted  values and  they're  really ranging  between  0.1  and   0.4 where  the  actual  data  is ranging  between  zero  and  one. So  what's  happening  is  that  the  data between  here  is  really  pushing  this  model. And  so  obviously,   what  we  would  have  liked to have seen to  have   is  more of  the  predictions  following  this  line. Having  said  that,  we  did  see  some  trends and  even  though  we  didn't  have  maybe a  statistical  piece, we  did  have  some  practical  interpretation. So  moving  on,  we  did  have  a  control  group. And  so  what  we  did is  that  we  took the  nine  control  providers were  directly  compared to  the  nine  ECHO  providers and  they  were equivalent  in  title  and  city. So  we  were  trying  to  match whether  they  were  a  nurse  practitioner, physician's  assistant  or  a  physician, and  then  where  their  practice  mainly  was,  their  primary  city. And  the  piece  of  that  is  that  there  is a  very  different  demographic  when  we  go from  the  South  of  New  Hampshire up  to  the  North  of  New  Hampshire. So  we  wanted  to  make  sure that  we  captured  any  of  those. So  when  we  did  the  Matched  Pairs  Test, we  created  a  confidence  interval for  the  proportion  of  patients  who  had any  MAT to  the  total  number  of  patients. We  looked  at  the  difference between  our  control and  our  providers and  we  know  in  that... First  off,  when  you  look   at  the  confidence,  zero  is  not  in  there. So  we  do  have  a  difference. And  we  note  that  when  we  looked at  the  actual  means you  have  about  your  treatment, ECHO  is  about 0.2... Have 0.2  proportion  of  patients  have  MAT. And  then  for  the  control  was  only  0.13. So  we  do  have  a  difference  of 0.07 . So  we  are  seeing  that our  ECHO  group  is prescribing  MAT  more  frequently than  our  control  group. So  the  next  was  to  try to  bring  in  that  again, that  time  variable. And  so  we  looked at  the   Zero Inflation  Poisson  Model and  so  what  we  see  here, so  first  off  the  ECHO  group  is  in  red and  then  the  control  group  is  in  blue. So  looking  at  it  graphically, we  are  seeing  an  increase  in  time. It  is  evident  that  the  ECHO  group  is slightly  higher  in  prescribing, a  proportion  of  prescribing than  our  control  group. Again  looking  at  the  parameters in  our  model,  they  are  significant. Unfortunately  when  we  moved again  to  assess  our  model, we  had  a  very  similar result  as  we  did  with  just  looking at  the  ECHO, which  is  somewhat  not  surprising   in  the  sense  that  we  are  using the  similar  data. And  so  again  you're  looking at  your  Generalized   R²,  it's  very  low, and  again  we're  noting that  our  actual  versus  predicted, our  model  is  predicting  probably  between  now   0.05  and  0.4, whereas  we  were  hoping   that  it  would  predict  along the  range  from  zero  to  one. So  findings. So overall  we  were  able  to  detect the  difference  in provider  diagnostic  patterns before  and  after  they  predicted in  project  ECHO. We  did  see  a  small  difference   in  provider  diagnostic  patterns between  the  providers  that  did  participate in  project  ECHO and  those  who  did  not. One  of  the  things  that  we  did  not  control on  was  the  number  of  total  patients. So  that  may  be something  to  consider  later. And  then  we  also  noted  that  there  may  be a  difference  on  the  impact  of  project  ECHO from  the  different  provider  level  title. And  we  need to  maybe  further  delve  into  that. So  next  steps, this  is  an  ongoing  project. These  are  the  next  steps that  we  are  considering. So  first  off,  what  we  would  like  to  do  is to  include  additional  providers to  increase  the  size  of  the  database. And  one  of  the  ways  that  we  are  looking at  doing  that  is  to  include some  more  of  the  ECHO  periods. So  we  do  have  one that  was  finishing  up  this  year. So  we're  hoping  to  add practitioners  in  from  that  period. And  we  also  may  want  to  consider the  methodology  for  detecting  MAT and  the  medical  and  pharmacy  claims  data. Also  we  would  like  to  analyze at  the  practice  level with  case  control  studies to  help  combat  the  small  sample  size. And  again,  the  overall  goal  is to  fit  an  appropriate  model. So  we  wanted  to  thank  you  for  taking the  time  to  listen  to  our  presentation on  the  PACT-MAT  ECHO. If  you  have  any  questions, please  contact  us  at  our  following  emails. Enjoy  the  rest  of  your  conference. Thank  you. Thank  you.
In an AMAT Six Sigma Black Belt project regarding to PVD sputtering process, the project goals were to optimize several film properties. A Pugh Matrix was used to select the most feasible hardware design. PVD sputtering process parameters were chosen (Xs) based on the physics of the PVD sputtering process.    To improve the design structure, definitive screening design and augment design were used to avoid confounding at Resolution II and III. RSM model and least square Fit were used to construct the predictive modeling of these film properties. In addition to main effects, several interaction effects were found to be significant in the reduced RSM model. Each interaction effect uncovered potential insights from two competing PVD sputtering physics models.   To further optimize the multiple-stream sputtering process, group orthogonal supersaturated design DOE was utilized to optimize each sputtering process block. Film properties (Ys) compete with each other when searching the optimal design. Three JMP techniques were used to find the optimal Robust Design: Set up simulation-based DSD to optimize the desirability functions. Conduct Profiler Robust Optimization to find the optimal design. Run a Monte Carlo simulation to estimate non-conforming defect %. By using these JMP 16 modeling platforms and techniques, this Black Belt project was very successful, not only in improv i ng the film properties, but also in furthering the understanding of how physics interact in the process. This multifaceted JMP DOE, modeling and simulation-based approach became the benchmark for several new JMP projects undertaking a similar approach.     Hi.  Hello,  everyone. This  is  Cheng Ting  here, and  the  other  presenter today  will  be  Cui Yue. Today,  we  are  going to  share  with  you  our  experience of  exploring  JMP DOE  design and  the  modeling  functions to  improve  the   sputtering process. First of  all, let's  have  a  brief  background and  introduction  about  the  project. This  is  the  Black  Belt Six Sigma DMAIC Project regarding  to  the  sputtering  process. The  project  goal  were  to  optimize several  film  properties. In  the  define  phase, we  have  identified  the  three   CTQ and  the  three correlated  success  criteria. CTQ 1 is  the  most  important and  a  challenging  one. The  success  criteria is  the  measurement  result is  larger  than  0.5. CTQ2 and   CTQ3  are  equally  important. The  success  criteria  for  both  of  them is  the  measurable  result  should  be less  than  0.05. Different   JMP tools  has  been  applied massively  throughout  the  measure  phase, analyze  phase,  and  improve  phase. We  will  share  our  experience of  using  the   JMP tools  here. In  the  measure  phase,  we  did   a MSA, and  we  finalized  the  data  collection for  the  baseline  hardware. Three  tuning  knobs, namely  X ₁,  X ₂, and  X₃  are involved. After  data  collection, we  use  a  Monte-Carlo  simulation  in  JMP to  analyze  the  baseline  capability. This  is  the  first tool we  introduce   today  too. In  order  to  establish  your  baseline  model, we  use  augmented  DOE, RSM,  prediction  profile, as  well  as  interaction  profile, check  functions  in   JMP. After   entering  the  analyze  space, in  order  to  do  a  root- cause  analysis and the  capability  analysis, we  use  the  goal  plot, desirability  function, multivariate  method, and  the  graphic  analysis  tools. In  the  improvement  phase, we  did   hardware  modification, where  another  tuning  node  X₄ was  introduced. Interactive  graph,  augmented DOE, RSM,  desirability  function, and  interaction  profile  were  used in  this  section to  further  improve  the  process. The  GOSS  stepwise  fit and  desirability  function  were  used in  the  robust DOE   modeling to  further  improve  the  process. Some  of   these  tools  will  be  demonstrated by   Cui Yue later. The  control  phase  will  not  be  covered at  today's  presentation. As  we  mentioned, in  the  measure  phase, so  after  the  baseline  collection, we  actually  use  the   Monte-Carlo simulation to  understand the  baseline  process  capability. These  are  the  baseline capabilities  for  three  CTQs. We  can  see that  all  of  them  are  normal  distribution, indicating  a  strong  linear  correlation with  the  input  parameters. For  CTQ1,  the  sample  mean is  out  of  spec  limit. Thus,  we  have  a  very   negative  Ppk  value. The  99 %  of  the  baseline  process  result cannot  meet  CTQ1  spec  at  all. CTQ2  is  close  to  the  spec among  the  three  CTQS. Sample  mean  is  lower than  the up  spec limit. Thus,  we  have  a  positive  Ppk . The  48 %  of  the  baseline  process  result, they  do  not  meet  the  spec  for  CTQ2. Sample  mean  for  the  CTQ3 is  out  of  spec  limit  as  well. Thus,  it  has  negative  Ppk. 64 %  of  the  baseline  process result  did  not  meet  the  spec. This  baseline  capability  confirmed that  the  biggest  longevity for  this  project  is  C TQ1. Apparently, the  baseline  process  condition, which  we'll  call process  conditional  1  here, cannot  meet  our  CTQ  success  criteria, especially  for  CTQ1. We  will  need  to  tune the  process  condition. However,  before  that,  we  will  need to  know if  the  current  hardware  can  meet the  requirement   or not. The  subject  matter  experts, the   SME,  has  advised   us, has  proposed  the  two  hypothesis, and  advise  us  to  shift  the  process to  condition  2 based  on  the  second  hypothesis. However,  before  that, we  will  need  to  check if  the  prediction  model  is  good for  both  condition. Hence,  we  use  the  scatter  plot to  check  the  collected  data  structure. As  we  see  here,  the  data  collected  is  not in  a  orthogonal  structure. This  is  because  we  actually  use a  two- step  evaluation  design, and  widen  the  process  range to  meet  the  success  criteria  of  CTQ1. We  did  have   a weak  prediction  capability in  the   whiter area. However,  we  still  have  good  prediction for  the  condition   1 and  condition  2. We  also  did  a  confounding  analysis. The  fact  there  is  certain  confounding  risk in  the  Resolution  II  between  X ₁  and  X₃. Nonetheless, we  still  built  a  prediction  model. We  use  the  response  service  method for  the  fitting. In  this  case, the  main  effect,  interactions, and  the   quadratic terms will  be  fitting  together. Based on  the  RS,  S R as well, RS- adjusted  square,  and  the  p value, we  can  see  it's  a  valid  prediction  model. From the  effect  summary, we  can  see  that  only  the  significant  terms are  included  in  the  model. With  the  interaction  profile, we  can  see  two  interactions which  are  correlated  with  two  hypothesis we  mentioned  before. With  the  prediction  profile, we  pick  the  process  condition  2. At  this  process  condition, what  we  can  see  here is  95 %  of  the  confidence  interval  of   CTQ1 is  between  the  range  of  0.5  and  0.6. This  CTQ  has  been  tuned  into  spec. However,  in  the  meantime, the  CTQ2  and  the  CTQ3 are  out  of  stack. Hence,  we  use  the  goal  plot to  compare  the  two  process  conditions, and  re alize that  when  we  have  improved  CTQ 1 from  process  condition   1 to  process  condition  2 by  getting  closer  to  the  target and  a  narrowing  down of  standard  deviation. However,  in  the  meantime, C TQ2  and  3  were  compensated with  larger  standard  deviation and  the  further  distance  to  the  target. Hence,  there  is  a  tradeoff between  three  CTQs. In  this  case, we  try  to  find  an  optimized  solution with  the  desirability  function  in  JMP. For CTQ1 ,  the  success  criteria is  that  mode of  0.5. H ere,  we  use  the  maximum  plateau  method when  set  the  desirability  function. Means  any  value  more  than or  equal  to  the  target will  be  equally  preferred. We  also  highlight  the  importance  of CTQ 1 by  putting  1.5  as  the  important  factor. Fo r CTQ2 and 3, the  success  criteria  is  less  than  0.05. Hence,  we  use  the  minimum plateau when  set  the  desirability. Any  value  less  than or  equal  to  the  target will  be  equally  preferred. However,  after  maximize  the  desirability, the  calculated  optimized  solution is  only  around  0.02, and  none  of  the  three  CTQs can  meet  the  success  criteria. Hence,  we  can  conclude that  there  is  a  hardware  limitation in  this  case. After  discussion  with  the  SME, we  decide  to  introduce  Y₄ into  our  data  analysis. Y₄  is  not  a CTQ  here, but  it  is  a  measurement  that  can  reflect the  intrinsic  property  of  the  process. This  intrinsic  property  will  affect the  CTQ2  and   CTQ3 directly. if   Y₄  is  more  than  zero, then  it's  a  positive  process, and  when that  Y₄  is  less  than  zero, it's  an  active  process. If   Y₄ is  closer  to  zero, then  we  call  is  a  neutral  process, which  lead in to  a  smaller  number of  CTQ2  and  CTQ3  together. Here  is  the  distribution  of  Y₄ of  the  baseline  hardware. As  what  we  can  see  here with  the  baseline  hardware, the   Y₄ is  always  more  than  zero. That,  we  will  always  have for  positive  process. The  multi variate graph here actually  shows  the  relationship among  the   Y₄:   CTQ2  and   CTQ3. They  are  strongly  correlated. If  we  have  smaller   Y₄, we  will  also  have  smaller  CTQ2 and  CTQ3. In  order  to  have  a  wider  range  of   Y₄, we  decide  to  add  in  another  factor, X₄, in  the  improve  the  hardware. Together, another  two  scientific  hypothesis has  been  proposed  by  the   SME. We  have  collected   data  on  the  new  hardware and  compare  the   Y₄  distribution in  two  hardwares. In  the  baseline  hardware,  without  X₄ , we  have  collected  data  orthogonally and  with  certain  range  for  each  factor. This  is  the  distribution  of  the   Y₄ under  these  contributions. With  the  improved  hardware, with  the  X ₄  introduced, we  have  collected  data in  the  same  range  of  X ₁,  X ₂,  and  X ₃. This  time,  Y₄  a t different  X ₄, has  also  been  collected. Comparing  the  two  distributions, we  can  see  that  without X₄ , we  only  see  one  cluster  for Y ₄, with  the  peak  value  more  than  zero. However,  with  the  X₄  introduced, we  can  observe a bimodal  distribution  for  Y₄, with  one  peak  with  mean  more  than  zero, and  another  peak  with  mean  less  than  zero. The  process  condition  makes Y ₄  less  than  zero actually  draws  our  attention. Under  these  process  conditions, we  will  have  negative  process, and  this  may  help  us to  improve  CTQ2 and CTQ3, but choose  that  process  if  we  cannot  meet all  CTQs  in  one  process  only, because  neural  process  benefits CTQ2 and   CTQ3. We  did  a  simple  screen of  processed  conditions when  we  have   a negative  process, and  this  lead  us  to  a  certain  range  of   X₄. That's  why  we   collect more dat a  in  this  range, because  it's  our  condition  of  interests. Now,  we   conclude that  the  X₄  did  impact  on  Y ₄, and  thus,  it  can  impact on  CTQ2  and   CTQ3. Now,  we  can  further with the  impact  of   X₄  on  CTQ1, and  build  another  model for  the  improved  hardware. Prior  the  data  collection, we  have  prescreened the  conditions  of  interest using  the  interactive  graph  in  JMP. We  will  collect  more  data with  certain  range  of   X₄, because  this   X₄  actually can  give  us  the  negative  Y₄  value. It  also  covers   most  ranges  of  Y ₄. As  we  can  see  here, before  the  data  collection, as  what  we  can  see  here, this  is  not  the  most  orthogonal   structure, since  we  have  collected  more  data at  the  conditions  of  interests, even  though  after  doing a  design  evaluation, we  find  a  low  con founding  risk. The  data  structure  is  still  good for  modeling. This  is  the  model  we  constructed. As  what  we  can  see  here, we  have  an  adequate  model. Only  factor with  the  p  value  less  than  0.05, or  what  included  in  the  model. The  RS quare is  more  than  0.8. The  difference between  the  R Square  adjusted and  R Square  is  less  than  10 %. The  p  value  for  the  whole  model is  always  less  than  0.0 01. Also,  through  the  interaction  profile, the  hypothesis   1 to 4  has  been  validated. This  time,  can  we  find any  optimized  solution? Again,  we  run  the desirability  function. The  left  side  is  the  optimized  solution provided  with  the  baseline  hardware before  the  X₄  installment, and  the  right  side is  the  optimized  solution with  the  improve d hardware with   X₄  installments. As  what  we  can  see  here, compared  with  the  baseline  hardware, improve d  hardware  did  provide an  optimized  solution with  higher  desirability and  improve  the  result  for  each  C TQ. However,  the  desirability  is  still  low, which  is  only   0.27. Not  all  CTQs meet the  success  criteria  in  one  step. So  we  still  did  not  find that  adequate  solution in  one-step  process  for  the  project. However,  but  as  what we  mentioned  previously, since  we  have  a  cluster of  the  process  conditions, allow  us  two  negative  process  with  the  Y ₄ less  than  zero. We  can  propose  a  potential  solution with  two- step  process. The  solution  with  two- step  process may  not  be  that  straightforward. As  we  know,  if  we  can  find the  optimized  solution  in  one  step, all  we  need  to  do  is  just to  round  the  process  at  the  conditions, gives  the  maximized  desirability. The  result  will  be  predictable since  we  have  a  known  model  for  it. Now  we  want  to  have  a  two- step  process. For  each  step,  we  have  a known  model, and if  the  process  condition is  determined. However,  due  to  the  different process  duration  for  each  step, we  will  have  a  new  model for  the  final  results. In  this  new  model, we  will  have  nine  variables  in  total: X ₁  to  X₄  for  each  step, and  the duration  for  each  step. Now,  the  question  is, how  are  we  going  to  find  a  proper  solution for  the  two- step  processes? We've  got  two  strategies. The  first one  is  to  do a  DSD  modeling  for  the  nine  variables. In  this  case,  we  will  need  to  have at least  25  runs  for  the  first trial. Of  course,  we  will  have orthogonal  data  structure. The RS M  model  can  be  constructed, but  the  cost  will  be  very  high. The  other  strategy  is  to  screen  design with  the  group  orthogonal super-saturated  design  first, which  is  the   GOSS  design. In  this  case,  we  can  screen  the  impact of  seven  variables  with  six  run. This  is  why  it's  super- saturated. We  have  more  variables than  the  data  points. Of  course,  we  will  need to  screen  out  two  variables before  the   GOSS, and  we  use  the  interactive  graph again  in  this  case, and  the  details  will  be  reviewed by  the  next  slides. The   GOSS  design  provides two  independent  blocks for  each  process step. There  is  no  interactions between  the  factors  across  block. The  data  structure  is   orthogonal in each  block, making  it  possible  to  screen  effect with  the  super-saturated  data  collection. However,  this  GOSS  will  show the  impact  of  main  effect  only, and   no  interaction  will  be  considered. This  is  a  low- cost  approach, and  we  manage  further  DOE  design. The  follow-up DOE  can  be  DSD, augment,  or OFAT. Each  of  these  has  its  own   pros and cons, and  it  will  not  be  covered in  this  presentation. Anyways,  to  save  the  cost, we  decide  to  proceed  with  strategy  2. We  start  with   a GOSS  design. As  we  discuss ed  now, we  have  nine  different  variables. However,  in  the  GOSS, we  can  only include  seven  variables. In  order  to  narrow  down  the  parameters for  the   GOSS  design, we  did  a  simple  screen with  the  interactive  graph. For  step  1,  we  choose the  process  conditions, allow  us  to  have  a  good  CTQ2. After  screening, we  decide  to  fix   X₂  in  this  case, based  on  the  previous  learning. As  seen  here, when  the  CTQ1  is   more  than  0.5, all  we  have  is  a  positive  process with  the  Y₄  for  more  than  20 %. Hence  in  this  case,  for  step  2, we   choose process  conditions, allow  us  to  have  the  Y ₄  less  than  -0.5 so we  can  have  a  negative  process. In  this  case,  adding  two  steps  together, the  final Y ₄ will  be  closer  to  zero. This  way  can  improve  in  CTQ2  and  CTQ3. After  screening, we  decide  to  fix  X₁   for  this  step, based  on  the  previous  learning. After  data  collection, we  did  a  step wise fit with  main  effect  only, since  in  the  GOSS, as  we  mentioned  previously, only  main  effect  is  considered. All  three  CTQs  validated  the  model with  the  p value  less  than  0.05, and  the  RS quare  adjusted  around  0.8, and   VIF  less  than  5. After  maximize  the  desirability, the  model  provide  us with  an  optimized  solution with  desirability  more  than  0.96, which  is  way  higher  than  0.27, and  not  to  mention  0.02. Hence,  we  can  actually  lock the  process  parameters and   further  by   the optimized  solution in  next  step, which  is  we  choose  to  use  the   OFAT. But  this  will  not  be  covered  here. Here,  I  like  to  summarize what  we  have  discussed in  this  presentation. In  this  presentation,  we  have and  share  with  you  the  experience. We  use  different  JMP  tools involved  in  the  data  analysis throughout  the  DMAIC  project in  different   stages. For  the  base line capability  analysis, we  have  used   Monte-Carlo simulation. We  also  used  a  goal  plot. For the   root- cause  analysis, we  used   multivariate method. We  also  used  a  graphic  analysis. In  order  to  help  with  the  DOE, we  used  the  augmented  DOE, GOSS, and design diagnostic. In  terms   to  have  a  good  model and  prediction, we  used  different   fitting functions, and  we  also  used  the  prediction  profile. We  also  used  the  interaction  profile. It  also  was  mentioned that  these  profiles  are  not  only  used for  the  model  and  the  prediction, but  it  also  help  us to  have  a  deeper  understanding of  the  process  itself. In  order  to  screen  out the  conditions  of  interest, we  actually  use  the  interactive  graph, which  is  simple, but  very  useful  and  powerful. In  order  to  do  d ecision  making, we  actually  use  the  desirability  functions help  us  to  make  a  decision. Until  now, we  share  with  you  our  experience of  how  the  JMP  can  help  us to  do  the  analyze. The  last  but  not  least, we  would  like  to  thank  Charles for  his  mentorship  along  the  progress. Thank  you. My  partner Cui Yue, she will  share  with  you some  demonstration  with  the  JMP  side. She  will  help  us  to  demonstrate how  can  we  use the  interactive  graphic  analysis as  well  the   stepwise  fitting in  the  GOSS  model. Okay,  thank  you. Thank you, Cheng Ting. I  think  you  can  see  my  screen  now,  right? The  first thing, I  would  like  to  introduce  you the  interactive  plot Cheng Ting has  just  mentioned. This  is  actually  one of  my  personal  favorite  function  in  JMP. It's  simple,  but  it's  very  powerful. Here,  our  purpose  is  to  screen what's  the  most  relative  function, a factor   relative to Y₄. Which  one  contributes  the  most  to  Y₄, from  X₁  to  X ₄? We  can  simply  select all  the  factor  of  interest  we  want  here, and  click OK. Now,  we  are  having the  distribution  of  all  the  factors. Now,  as  Cheng Ting  has  mentioned, we  want  to  know  what  contributes  the  most to  the   Y₄ at  negative  side. For  example,  here. Then  now,  we  can  see only  X₄  from  13- 14,  X₃  from   0- 1, and   X₂  at  this  range,  2.5 -5. Also,  X ₁ at   19-20, can  make  this  happen. Thus,  it's  easier  to  give  us  a  range of  different  factors, how  it  will  contribute  to   Y₄, and  how  we're  going  to  choose  the  factors if  we  want the  Y₄  reaches  certain  level. Similarly,  if  we  want  it  to  be slightly  higher  than  0 %, we  can  also  simply  click  this  area. Actually,  to  be  quite  straightforward, to  solve  the  problem  at  one  shot, we  just  select these  two  together. Then,  we  can  see,  okay, X ₄  should  be  at  this  range, maybe   10- 14. X₃,  definitely  should  concentrate at   0-1. X₂  is  a  slightly  wider  distribution. X ₁,  there are o nly  two c andidates for  this  direction. From  this  one,  we  can  easily and  intuitively  find the  contribution  factors  we  want. This  is  the  first function I  would  like  to  demonstrate  to  you, the  distribution  function. There  are  many  other  things the  distribution  function  can  do, including  see  the  data  distribution and   test  a  lot  of  thing, [inaudible 00:24:49]  test ,  et cetera. So  I  won't  come  in  here. The  second  thing  I  want to  share  with  you  is  the   GOSS. It is  actually the   GOSS fit  stepwise. Now,  we  are  having  three  CTQs. All  the  CTQ, the  analysis  dialogue  is  open  here. For  all  the  three, they  have  separate  analysis  dialogue. To  do  a  very  straightforward  way, we  just  the  Control  and  click  Go. It  will  select  the  factors for  all  three  C TQs at  once. Very  straightforward. These  three  will  use the  same  stopping  rule, which  is  the  minimum BI C. Then,  let's  click Run M odel. This  fit  model  will  give  us  our  fitting separately  for  C TQ1, 2, 3. Then,  we  need  to  modify  the  model or  reduce  the  model  one  by  one. Here,  our  criteria  is  to  choose the  p  value  lower  than  0. 05. We  define  that  value. When  the  p value  lower  than  5 %, the  factor  is  significant. Here,  we  can  remove  X₄ . Next  one,  for  C TQ2... For  CTQ1,  we  are  down  here. For  CTQ2,  we  can  remove  X ₁  and  X2 . Okay,  now  we  are  having both  lower   than 0.05. For X₃, w e  also  can  do the  same  thing  accordingly. Let's  move  X₄ first . All  the  three  factors, p  value are lower  than 0.05. We  have  get  the  reduced  model. Here  at  the  bottom, we  will  have  a  prediction  profiler. If  you  don't  have  it,  we  can  add  it from  the  profiler  function. Then,  we  would  like  to  find the  optimum  condition. How  we  are  going  to  do  that? We  are  going  to  use the   desirability function. First step, we  will  always  be  set  the  desirability, which  is  already  set  here. We  have  one  me ta-gate to  minimize. Then,  let's  use  the  maximum desirability  function. H ere,  we  can  find  our  optimum  condition. If  we  use  Maximize and  Remember, here  is  our  optimal  condition. Then,  we  can  use  this  condition to  run  the  process and  valid ate the  sequence  again. These  are  two  functions I'd  like  to  introduce  you. Okay,  thank  you. Thank you.
The goal of this Six Sigma BB Project was to improve the Pin Gauge Measurement Capability of the diffuse hole size, which is critical to the Production Failure Analysis. Several JMP Analysis objectives are in the Measure Phase. First, compare Excel Xbar-R method vs. JMP ANOVA cross method. Second, choose GRR criteria between Precision to Tolerance (P/T) ratio and Precision to Total Variance (P/TV) ratio. Third, use SPC Control Chart to monitor the GRR stability, determine the process long term sigma for calculating the GRR P/V ratio, and identify the process rational subgrouping. Fourth, evaluate the Pin Gage wear risk to determine whether the Pin Gage measurement is destructive test and use the GRR Nested model. Finally, use GRR Misclassification to assess the risk of both the Alpha risk and Beta risk on the production yield/cost. JMP platforms have helped us execute this Six Sigma BB project effectively.     Hello,  everyone. My  name  is  Kemp. Come  from  Taiwan. I  work  in  a Prime  Material for  global  continuing  improvement. And  other  author  is  Wayne also  from  Taiwan. He work in Prime Material  for  customer  quality  engineer. Today  we  will  present  how  to  utilize SIPOC JMP platform to  combat  measurement  system  errors throughout your  purchase. The  project  is  measurement  system errors between  supplier  A  and  supplier  B. I  will  go  through  for  the  two topic. For  the  introduction and  supplier A  measurement  system. Then Wayne  we  will  go  through the  next  two  topics. For  the  introductions, I  will  go  through  the  background, SIPOC  and in and out scope analysis, correlations of the MSA CTQ. Supplier  A  measurement  system, we  will  cover  the  Excel Xbar-R vs  JMP ANOVA crossed method. Gauge R&R criteria on P/T and P/TV ratio, variability chart and Gauge R&R mean plots, Gauge R&R mis-verification. Supplier  B  measurement  process, we  will  cover  CE diagram, Fit Y by X contingency platform. One simple t and sample power t-test, Gauge R&R. The  final  conclusion  will  be  show SPC  control  chart  for  Gauge R&R, and  a  summary  of  analysis. The rest is the  takeaway learnings from the JMP. First topic  we  will  go  through for  the  introductions. The project was based  on the  problem  statement. We  understand  the  VOC. The  purpose  of  this  project  is both supplier A and supplier B measured the same parts. Supplier  B  has  got  a  much  worse  result unexpectedly  than  results of  the  supplier A. Customer  has  requested  to  find  out  the  gap and validate the measurement system of supplier A and B. Then  we  go  to for  the  SIPOC to  understand  our  CTQ. They  are  three CTQ,  we  find  out. C T Q1, standardize supplier  B  measurements, which  we  need  to meet for  at  the  certain criteria, resolution  need  to  be  less  than  10 %. C T Q2,  verify  the gaps between supplier A and Supplier B. So  we  need  to  meet...  The  bias need to be zero. CTQ3, supplier B Gauge R&R which  we  need  to  meet the criteria for the CTQ ratio  less  than  30 %. This  is  the  SIPOC  structure. We  start  with  the  customer voice and  back to  the  output  CTQ, which  we'll  find  out. A nd back  to  the  process  items, all  process they're bound the  measurement  system. With  these  four  sequences,  states  for calibration  to  operation. And  the   pin gauge calibrations. Then  sample  for  the  diffuser, hole  measurement. Finally, debate hypothesis through the same way like PQP Excel file. As  you  can  see, this  is  all  about  the MSA. On  the  following  slide, we  will  go  into  picture of  the  relation  about  the  SIPOC and  the  VOP  and  the  customer VOC. There are  two  measurement  output for  the  supplier  A  and  the  supplier  B. In  our  goal s  need  to  measure our  measurement  system, supplier A and supplier B need  to  be  the  same. On  the  measurement  process  variation, there  are  two  parts. One  is  the  accuracy, the  other  one,  the  position. For  the  accuracy,  is  due  for  the  bias linearity. And  the solution  for  that, stability  is  not  in  our  scope. For  these  three  is  indicate for  our  CTQ1  and  the  CTQ2. For  the  CTQ3  is ou tput for the precision  problem  for  the repeatabilty and reproducibility's. We  will  go  through  the  second  topic, supplier  A  measurement  system. In our  Gauge R& R errors, we  use  the  continued  data, crossed  way and  don't  form for  two  best. What  is  the Xbar-R  method? And  otherwise,  ANOVA method? In Xbar-R method is for the PQP Excel file. They  have  two  disadvantage, the  first one,  it  could  not  detect parts- to- operator  interaction. Second, it  is  made  with  the  Gauge R&R, with  the  normal  distribution. So  you  will  be  impact  by  any  outlier. Also,  is  not  good  for  the  sample  data is  more  than  five  data  points. In  ANOVA method, it  not  only  can  detect parts-to-operator  interaction, but  also  use  standard  deviation directly . So  it  don't need  to  be  considered for  the  normal  distributions. In our  project the interaction  is  high so w e  must  need  to  be  use  the  ANOVA   method for  our  Gauge R&R  analysis. First,  we  chose the  low  data  funds  for  a  PQP  file. They  use  the  10  data  points for the  study. Actually, it  is  not  good  for  more than  five  data  point, X-bar  analysis, as I  mentioned  for  this  slide. Anyway, we  still  use  the  same  data into  JMP ANOVA  as  well . The first somewhere PQP is no file is  only  just  useless  file errors. Y ou  tell  us   there  are  two  ratio  for  Gauge R&R, P/TV  is  24 %, is  measured  [inaudible 00:06:52]. P/T  is  9 %,  which  is  pretty  good. What  is  the  difference between  P/TV  and P/T ? W e  should  know  there  are  three  variations. One,   equipment  variations, which  is called   repeatability. Another  is  the  operator  variation, which is  called  the  reproducibility. The  last  one  is  the  part  variations, which  depend  on  simple  sessions. Now, P/ TV  definition  is P  is  the  process equal to EV plus AV. And TV is total variance equal to EV plus AV plus PV. There  will  be  a  risk  when  gage  sample range  will highly  impact by P/TV  ratio  result. In  P/T  definitions, P  is  tolerance  is  Parts Spec. R ange. When  spec  is  well  defined, P/T  ratio  is  our  prefered  to  MSA for  doing Gauge R&R success  criteria. For  the  second  [inaudible 00:08:13] I'm  sorry. For  the  second  [inaudible 00:08:17] JMP has  no Xbar-R  errors, only ANOVA. There  are  two  model. One  is  the  main effect  model, and  other  one  is  crossed  model. Main  effect model has  no  parts-to-operator component. Will  be  most  assigned  to  repeatability and is  caused  by  machine  issues. On  the  other  hand, crossed  model  can correctly  derive the   parts-to-operator interaction  component, which  will  tell  us  there  is  the  issue happen  on  the  Gauge R&R, on  the  process  issues. We  prefer  to choose  the crossed  model rather  than main effect model. On  the  Gauge R&R variability chart, on the left side of the graph measurement mean. Here  we  need  to  be see  is  there  50 %, of  point,  is  outside  of  the  control maybe. It indicate  the  case  is  capable to detect  the   [inaudible 00:09:32] . Our  case  is  a  100 % outside  of  the  control,  which  is  good. In  the  standard  deviation  chart, on  the  operator  B, on  the  parts A  has  repeatability  issues. There  are  25 %  of  the  data is  on  the zero , which  indicate  measurement  result is  not attainable. On the Gauge R&R, mean plot for  the  measurement  operator. We  can  see  for  the operator  C, is higher  than  operator A   and  B on  measurement by parts. You can see the data   is  up  and  down, it indicate large  different  between  parts. Also,  the  sample  range  is  1.2 equal  to  20% of the tolerance range , which  is  too  small, resulting in higher  P/T  ratio. The  last  one  is  the parts- to- operator  interactions. On  the  parts  7 to A, there  is  the  crossed  line between  part operator  B with  the  operator A  and  C. It indicate interaction between  appraiser  and  the  part. We  finally  going  to  the  last  part of  the  Gauge R&R. The  type one  and   type two  error. The  type one error is  often rejected  alpha  risk. Maybe in  manufacturer  side parts   [inaudible 00:11:04]   is  good but we reject. And otherwise, type two error is  often accepted,   beta  risk, which  means  is  customer  side  part   [inaudible 00:11:16]   is  bad. But  we  said, and  this  side, we  can  see  there is nothing. We  can  see that alpha  is  zero  and  the  beta  is  39 %. If  you  are  still not clear , we  can  show  you  the  example. In  manufacturer  side,  produces  300  parts. There  are  200  parts,  is  good  parts and  the  100  parts,  is  bad  parts. If  the  beta risk  is  39 %, which  we'll  deliver 39 bad parts to the  customer  site. We  conclude  39 %  of  the parts, will  be  delivered  to  the  customer, which  is  not  acceptable. We  need  to  consider  for  the spec limit for  the  adjust  our  offer and  the  beta  risk. In  the  data Gauge R&R  samples  are  all  good,  100 %. And  only  beta risk have been observed. Here  are  the  Gauge R&R  summary. We  recommend P/T ratio  instead of  the  P/TV  ratio. We  recommend  use  the JMP  crossed method for allocate your Gauge R&R errors. Even P/T is best, we  still  have  to  watch the  operator-to- parts  interaction the  misclassification for  our  parts  quality  efforts. Okay,  I  finished  my  part. So  the  next  two  parts will  be  show  by  Wayne. Wayne is  your  turn. Hello. This  is  Wayne  speaking. Let  me  uncover  the  last  two session for  the  supplier  B  measurement  process. This  is C&E  diagram. Fish bone,  to  identify  potential  cost across  standard procedure  and  the  standard supplier B and  supplier  A. After  last  discussion with  engineering, we  conclude  five  potential good cause to  standardize  and  validate the  first item  is  pin measurement  sequence. Supplier  B  adapted  from larger  pin to  smaller  pin  by  standard  procedure. [inaudible 00:13:59]  both  way, supplier  A  adopt  smaller to  larger  diversity. The  second  item  is  precheck  by  calibration. Supplier  B didn't  do  the  action by  a standard  procedure and  supplier A  did. The  third action  is  a   pin gauge resolution. Supplier  B  used  the  larger pin gauge  increment, 50 %  in  resolution, while  standard  and  supplier  A take  10 %  in  resolution. The  fourth item is   whether   to  check  the  pin before  entering  the  diffuser  hole  or  not. The  last  item  about  the  pin  holder  weight. Supplier  B  adopted behaviour, than  standard  procedure and  supplier  A. Now  we  will do  one  by  one, the  hypothesis  test and valid  data  with  high  score  contingency. The  first hypothesis we  have  to  do  is  verify if  sequence  one  is  different from  sequence  two  or  not. Sequence  one   pin  go from  smaller  size  to  larger  size. Sequence  two  is  just  on  the  contrary, reverse  pin. Here  is  the  major  result for  the  sequence  one  and  sequence  two. Let's  do  the chi-squared test  to  see the difference  between [inaudible 00:15:25]  or  not . In contingency table, the results  shows   p-value  less  than  5 %. So  we  rejected  a  null  hypothesis, which  means  sequence  one  is  rather different  from  sequence  two. Therefore  we  must  watch  out this  key  factor  on  pin gauge  measurement. The  second  hypothesis is  the  way  that you  do  precheck pin  size  by  calibration  tool. Learn your  calibre  before  the  measurement. Here  is  the  diffuser,  whole  size  data. The  assumption  is  that  we mistake  90 for 90.6  without  precheck. There  will  be  11 zones become  no- go. Based  on  contingency table. In   chi-squared test, we reject  null  hypothesis. This  means  calibration  before measurement  is  significant, important. The  third hypothesis  test  is  to  compare all  the  measurement  tool, 20 %  resolution with  new  two  or  10 % resolution  on  the  measurement  variation. By  using the  similar  concept, we  can  know  why  we  used the  higher  resolution on  the  measurement. The same on the item  four,  shaking  the  pin before  entering  the  hole. We  use  the  specific  48  holes with  and  without  shaking to  see  the go/no-go effect. The results shows null  hypothesis  is  rejected. That  means  shaking  pin  is  important on measurement. However, regarding the item five pin vise all  this  weight... Chi-square p-value  more  than  5 %, which  means  we  cannot  reject the null  hypothesis. That  tell  us there's  no  difference on  the  measurement  between  the  unit and  five  times  unit  in  weight. Because  we  have  varied data all  of  five  key  measurement  items  above. FMEA  can  further  estimated RPN before  and  after  the  improvement of  our  recommend  action. Based  on  severity,  probability and  detectivity. For   [inaudible 00:17:55] pin gauge  measurement  sequence, severity  is  high  because  the  wrong  go. Past  probability  is  also  high because  dis location  to  hole  center, W e  use  the  sequence  one  because the high  detec tivity. After  we  change  to  sequence  two, the  probability  and  detectivity  will  drop to  the  half,  so  RPN  reduced  to  54 As y ou  can  see all  the  score  are under  100 meeting  our  forecast. Okay, let's  move  on  the  CT Q2   about bias  between  supplier  B  and  supplier  A. From  the  FMEA,  we  are  confident of the  diffuser pin hole  size  is  around  90  to  90.6. However,  supplier  A, FA  report  shows all  the  pin hole  points  96. We  take  distribution  mean  test, the  p-value  suggest the  gap  between  supplier  B and  supplier  A actual standardized  is really  something  different. Therefore  we  have last  communication with  the  supplier  A. We  conclude  the  two  potential  cause. One  potential  cause  might  be  some  hole point  93  some  are  90.6. In  current symbolic method parts straight  over into  24 wrong. Sample  one randomly  in  24  wrong. Is  it  enough  to  find  out  a  larger  size? So  we  confirm  with  the  sample size  and  power  test. The  other  possible  cause  could  be the  hole enlarged  during  the  measurement. W e  validated  by  repeating the   pin gauge  measurement. All right, let's  see  the  result . Here you the sample  size and   power  test  result. In exact  test  binominal, power  is  equal  to  99.7, that's  more  than  90 %. In  normal  approximation,  power  is 97.6. That's  more  than  90 %  as  well. In  other  word  for  power,  more  than  90 % will  just  require  18  points  to  major. Therefore,  current  stratified random sampling is good detectivity. It  is  a  fat  chance  that  we   cannot   catch  96  in  diameter and current 24 straight by wrong. In  parts  do  have  larger  hole  size. Now,   after the gap on  was  found we  further  check  if  our  CTQ3 Guage R&R  is qualified or not. F rom  left  table,  you  can  see  the  hole diameter  was  increased  by  4 %  only compared  to  tolerance. The  resolution  is  less  than  10 %. We can  assume  nondestructive crossed method , to  qualify this  measurement  capability. Now, although the operator- to- parts  interaction accounts  for  13.6%. P/T  ratio  is  still  18 %, which  would  meet Gauge R&R  criteria,  less than 30% Later we  going to  address  more about  interaction  here   regarding  the  P/TV  ratio. Sigma  and  variance also  are  high and  cannot be... Quantified  because  the  same hole  size  selection is  not  wide  enough. Layers  for  the  reference  only. Okay, for  our  conclusion and  about  the  SPC  control  chart  to  monitor the  Gauge R&R  stability. We  use  the  Levey Jenning  chart to  get  the  process  long-term  sigma for Gauge R&R P/V ratio calculation. Phase before  in  supplier  B, with  measurement  resolution   50 %  shows  larger  sigma and  wider  control  limit  according. After,  in  supplier  B, with  measurement  resolution  10 %,  shows smaller  sigma  and  narrow  control  limit. It got  improved on the  measurement  precision for  a  common  cause. Here  is  the  summary  for  a  CT Q1 standardized  supplier  B  measurement with  the  resolution  less  than  10 %. Only pin  holder  weight  is  not  significant the  other  four measurement  item  are  all  significant. For  CTQ2 ,  sample  size  is  good and  the  hole  enlargement  issue during  measurement  was  found. For  the  CTQ3 ,  supplier  B provide  its  qualified   measurement  capability Gauge R&R  less  than  30 %. Where  the  repeatability  less  than  10 %,  and re producibility  less  than  20 %. For  take away  learning, we  use  the  last  JMP  tool. Like  C&E  diagram, Fit Y  by  X  contingency  table, sample  power  test, distribution  mean  test ANOVA crossed Gauge R&R to  standardize  our  pin gauge  measurement. That's  all  about  the  JMP  practicing. Thank  you  for  listening.
JMP platforms have significantly helped find the right parameters to determine optimal process. This presentation demonstrates production cycle time analysis using JMP 16 data mining and text mining, using the distribution platform to set up histogram conditions for systemic root cause analysis, and building three partition platform models to improve the R square. Then we will optimize the partition model for both success and failure analysis, create a Neural model and use the text explorer platform to search key words that trigger the modeling parameters used in the predictive model.     Hello,  good  morning. Good  evening everyone. My name  is Raisa. I'm  a  manufacturing  quality  engineer of  Applied  Materials,  Taiwan. I  started  to  learn  JMP at  beginning  of  this  year. Recently,  I  pass ed this  certification  exam with  score,  925  in  this  July. Today,  I'd  like  to  make a  short  presentation about   QN Immediate Fix  Time Analysis  by  JMP. As  we  know,  once  a  quality notification  cure  is  created, it  must  take  additional, more  or  less  time  to  fix  issue, and  may  impact  on  production planning  and  scheduling. Therefore,  we'd  like  to  find a  worst- case  by  analysis. Okay. For  analysis,  here  are  five s ub topics  on  agenda. First,  the  Root  Cause Analysis  of  QN  Fix Cycle  Time, Graphical  Root  Cause Analysis Summary, Compare Fit Model, Partition,  Neural Model, and  then H ybrid  Text M ining and  the  Data  Mining  Analysis. Finally,  Take  Away  Learnings. Okay,  let's  get  started. The Histogram. 1st  layer  of   Root Cause Analysis  of  QN  Fix  Cycle  Time. Before  investigation,  think  about  that. What scenarios  impact on  QN Fix  Cycle T ime and  how  long  is  it endurable? First up, define  five  days  as  a  criteria and  also  a  key  condition  to  follow C-wide  spread  i n wording and  within  five  days, we  in spec  and  success  criteria. On  the  other  hand,  over  five days  out  of  spec  failure  analysis, later,  I  make  directly a  breakdown  to  SA  and  the FA. Notice  this  shade  of  distribution between  SA  and  the  FA. Meanwhile, look  at  the  Mosaic  Plot for  the  proportion  of  each  category to  infer  a  potential r oot cause for  the  success  analysis. We  can  see  the  Workmanship and  MFG rework, seem  to  have  quick  response and  better  fix  cycle  time. For  FA  dimension  issue, take  more  fix  cycle  time. It  is  obvious  variation  in  FA time. Distribution  between  SA  and  the   FA, suppose  if FA time is  one  of  the  key  factor, S1 to  impact  the f ix c ycle  time. The  Box P lot. The box plot is  a  graph  of the  distribution  of  continuous  variable. Therefore,  plot  continues fixed  cycle  time  versus nested  structure, categorical  country  under  containment to  search  other  factors impact  fix  cycle  time  or  not. It  displays  the  five- number summary  of  setup  data. It  is  non- parametric  tool to  use  Median  as  central  tendency. Besides,  there  are  some observations  on  box plot  graph. First, at  least  seven point  to  detect  the  first outlier. Otherwise,  it  becomes whisker (skew)  group  problem where  sample  size  is  less  than  seven . Second, observed  screwed  distribution by  Box Width  or  Whisker Length. How  to  handle  marginal  outliers, which  are  we  think   two Sigma GRR noise  from   whisker and  back  to  the  Root Cause Analysis, it  is  not  difficult  to  find  a  recycle  time of  the  containment  replacement is  much  longer  than  other  containment. And  with  that, X₂ containment and  X₃  country  here. Heatmap. Heatmap  is another  graphical  tool to defect  data  value  by  color. A gain, until now, we  gather  three  input  factors, defect type from  histogram, containment  and  country  from  box plot and  in  order  to  follow   study quarter  impact quarter input  impact on  fixed  cycle  time. Here  at  back  categorical called defect type  on   Y axis and  the  color  cycle  time and  keep  bus  prognostic  structure. Categorical  country  under containment  in  X  and  X  group. Then  use  a 8  by  9  layout  look  balance to  quickly  catch  out  the  maximum and  the  minimal  cycle time  scenarios. For  FA,  it  is  easy  to  find a  little  red  area,  right? The  library  highlights the  longest  fix  cycle  time. With that, Replacement Taiwan  Damage  parts, in  the  worst  case  for  cycle  time. The  Replacement  United  States and  the  Dimension  issue  parts is  the  second  worst  scenario. For  SA  is  set  for dimension  and  damage defect. Others  are  easy  to  quickly  fix. For  Pareto C hart, to  further  analyze the  FA  and  SA  from  heatmap , heatmap must  use  two- dimensional Pareto  Chart  by  to  variable defect type  and  the  country under  specific  containment. Here  are  X₁  defect type, X₂  containment  and  X₃  country we  mentioned  before. Then  add  additional workstation  X₄  here in  the  course of  Pareto C hart to  visualize  frequency  event. Now  for  FA failure  analysis, we  get  replacing  high work supplier  damage  issue, frequently  happen in  CVD  service  fraud or  replacing  United  States  supplier  dimension  issue often  happens  in  CVD  workstation. In  the  same  way  for  X  analysis, instead for  dimension issue  or  damaged defect, United  prior,  functional and  a  workmanship  issue can  quickly  fix  in  CVD  major  test. Currently,  we  have  four  input factors  and  SA  and FA  frequency. For  interface, one we  are  more  interest in is  pass  or  fail  frequency or  pure  cycle  time. Then  Tabulate. Here  put  our  previous  mention of  factors   X₁ to X₄  for  on  Tabulate. Meanwhile,  Tabulate  pure  cycle  time and  frequency  into  a  account to  do  further  comparison. For  FA, CVD  service  fraud,  damage  issue, Taiwan  supplier  require  replacement, it did take a  longer  cycle time although  the  frequency is  now  the  highest, like  seven  times  here, the  means  of  the  cycle  time, 34  days is  much  longer  than  others. For  SA,  in  CVD motor  test, we  can  measure  issue   in  United  States  prior and  fixed  by  MFG re work. Even  there  is  only one  day  on  the  table, but  the  frequency  and  is far  too  low  to  be  true. Here  I  summarize  the  main points. Follow   Root Cause Analysis, use  different  graphical  JMP  platforms in  engineering  and  large  caustics sequence  to  conductive prior  Root Cause Analysis. In  previous  slide, I  show  Histogram,  Box plot, Heatmap, Pareto  Chart  and  the  Tabulate. Second,  identify  a  potential  input X to  protect the  QN  fixed  cycle  time. A ccording  to  the  Tabulate, the  FA results  from  a  damage  issue, replacement,  Taiwan  suppliers and  CVD  service  workstation. Next,  build  a  model to  predict  QN  fix  cycle  time and  validation  of  the  root  cause. Before  entering  each  model's  detail, here,  I'd  like  to  introduce  model selection  and  the  comparison first up. The  fit  model,  consider  data structure  and  a  distribution. Here  are  some challenge  in  fit  model. For  skewed  distribution, use  log  transformation,  but  no  help. All input  variable,  X₁ to X₄ are  categorical  type. After  I  build  our  60 %  of workstation  category, R-square increased  6 %  only. Check  dependency among  a  categorical  variable by correspondence  analysis part. It  is  low  risk  because the  closer  things  are  to  a  region, the  less  distinct  that  they  probably  are. In  other  word,  the  farther away  the  more  distinct. Second,  proximity  between  labels probably  indicate  a  similarity. For  partition  tree  model, the  plus  points are  distribution  free  model, split  based  on  data  available, little  overfit concern, but  minus  points  recursive  split. Therefore,  use JMP  projector screen  by  random  forest to  average  a  recursive  product and  find  out  a  five  input factor  with  their  ranking. It  is  convenient  and  a  quickly way  to  find  important  input to  optimize  or  improve  model. Regarding  a  neural  network, the  plus  points  are  strong transformation  model, two  steps  training and  a  validation  model. However,  the  minus  is significant  overfit concern . Which  model  is  more proper  to  be  believed that  goes  through each  model  results? Come  back  to fit  model, main  event  only. If  our  score  isn't  high, only  30 %  are wrong, because  data  is  severe  right skewness. Observed significant  level  of  risk, so  Max  R- square around  just   47 %  is  not  worth it and  use  log  transformation of  the  cycle  time  variable to  avoid  a  negative  number of  95 %  confidence  interval, but  no  help a  lot so log  choice  is  out. The  next  is  Partition  Tree  Model. Here  are  three  partition  models, are  baseline  model, model augmentation and  a  model  simplification. Experience  a  series  of  improvement  per engineering  and  the  logical  thinking that R  square  improved  to  62 % from  38%  baseline  model. All the  detail  will  show  you step  by  step  in  following  slide. Model  augment. During  this  step,  we  improve  model  20 %. Where  are  they  from? First,  they  will  present  improve. Here,  changing  QN age to  immediate fix  cycle  time for propriety  experience,  but  no  help. Second,  6%  and  add one  X  factor  workstation. Remember  it  is  export, we  discover  from  Pareto C hart. UD code  becomes  less critical  from  26 % to  8 %  only. The  third and ano ther  4 % by  changing  to  containment  from  UD  code. Now,  check  a  contribution  ranking  here. The  number  two  become  workstation instead  of  country  anymore. In model  simplification, here,  improve  additional  6 % R  square  by  model  simplification. Before  simplification  model, the  plus  is  all  scenario under  consideration, but  like  minus  two,  many categories  might  dilute  predict  power. In  simplification  model, filtering  out  minor categories  with  fewer  counts, like  remove  60 %  categories  of  workstation, the  total  amount decreased  to  270  from  426. Check  again,  the  contribution  ranking, the  defect  type  and  our  workstation  are still  the  number  one  and  the  number  two. Now,  we  have  more confidence  to  use  the  model to  predict  the  FA  and the SA. For  Partition  Tree  Model Optimization as  I  mentioned  before, the  major  contributors are  Defect type  &  Workstation around  80% for Pareto C oncept. Compare  defect type in SA  and FA  prediction. For  SA  here,  it's  labeling  issue. Makes  sense,  we  don't  spend more  time  to  fix  every  issue. For  FA,  it's  damage. Yes,  it  would  take much  more  cycle  time. About  a  workstation comparison  between  SA  and  FA PVD  mechanical  and  CVD  module  tester. Currently,  it  still  needs further  analysis  and  understanding. About  the FA  country  here,  in  a profile  or  prediction  profile  is  flat. Doesn' t  country  impact QN fixed  cycle  time? Is  it right? To  answer  the  question  here, I  introduce  the  model  limitation, recursive  partition. Recursive  partitions, sequential  dependency  risk. Factor  country  is  spread  six  times, and  only  one  time  happen in  higher  cycle  time  cluster. Such  recursive  dependency  limitation may  impact the  predictive  model. The  third model, Neural  Network (Artificial Intelligence). Here  observes  severe  overfit  concern between  training and  the  variation   R square. If  R-square  between Training  Set  and  V alidation  Set is  over  20 %, it  has  overfit  concern. Besides,  we  find  it in  neural  model, the  number  one  ranking  is  workstation, and  the  number  two  is  fault  by which  is  different  from previous  partition  model. For  SA,  the  workstation  is  at  staging, CCT staging,  where material  are  brought  together before  entering  MFG fault , and  it  doesn't  have competitive  operation  process. Once  the  issue  happen, it  can  be  fixed  quickly. Makes  sense. For  FA  at  the  CVD service fraud, it  has competitive  operation  process. Yes,  it  did  to  have  longer  cycle time  to  treat  difficult  issue. Until  now,  we  already  have  three  model, Tree  Model  Partition and  the  Neural  Model, and  which  model  is  much  more proper  and  meet  reality. Therefore, model  comparison  and  selection, Root Cause Analysis,  graphical  tool damage  issue  replacement, Taiwan  CVD  service fraud  is  the  worst scenario  with  longer  fixed  cycle  time. Currently,  Neural  model  has the  identical  scenario as  the  Graphical  Root Cause Analysis, but  only concern is overfit  risk. Besides,  the  three  model has  very  close  prediction on  the worst  cycle time  within  1.2  days. The  final I will  introduce, Test  Mining  and  the  Data  Mining  Hybrid. Currently  in  QN D ata base, it  still  has  test  messenger  resulting  well to  such  more  information  about  long  cycle time  in QN  tester  variable  database. Use  JMP Test Explorer  to discover  some  frequency  keywords such  as  here,  I  circle  replace, rework,  dimension  and  the  F10246 a project   to  do  further  analysis, then  convert  them  to  binary  detectors, conduct  a  further  Data  Mining and  the   Root Cause Analysis on F10246 case via  heatmap  graph. Here,  put  dimension indicator  under   F10246 and  containment replace  and  re work in  Y. According  to  the  heat map  results,  F10246 it did  suffer  lots  of  fix  cycle  time that  other  project  by  color  results and  check  dimension  detector  observed  is not  only  dimension  issue, but  also  our  various  defect cause long  cycle  time, even  if  just  are  fixed  by  rework. In  the  end,  here  are  my takeaway learning. JMP G raphical  Platforms are  very  powerful  to  conduct deeper  r oot cause analysis  through  engineering and  the  logical,  data- driven  process and  compare  and  select a  more  appropriate  JMP  model from  Classic  Fit M odel, P artition  and  a  Neural  Network by  knowing  the  model  limitation and  the  risk  of  connecting  to  previous Graphical   Root Cause Analysis. Conduct  a  Hybrid  Text M ining and Data Mining R oot  Cause  Analysis on  the   complicated QN Database. Final,  I'd  like  to  thank  GCI  M BB Charles Chen  as  my  project  mentor and  that's  all  my  presentation. Thank  you  for  your  time  and  attention. Thank  you.
Sidewall damage of subtractively etched lines is a common problem in semiconductor manufacturing. This presentation aims to show how JMP’s various analysis platforms can be used with a simple image analysis pipeline structure built into JMP using a Python/JSL integration approach to build a highly predictive model and a method to integrate and simplify the models output to be used in a standard trend chart used for monitoring in manufacturing.   Our methodology uses a metric extraction protocol from images inspected and dispositioned in manufacturing. These extracted metrics and dispositions are then passed through a series of model building and tuning platforms in JMP. Our approach involves using dimensional reduction platforms (PCA,PLS) to identify relevant metrics and building a neural network model using K-folds validation to compensate for class imbalance problems often encountered when working with manufacturing data. Further refinement of the model is done by identifying outliers using the multivariate platform and reprocessing the data. This approach resulted in a highly predictive model that is easy to integrate.       Hi,  my  name's  Steve  Maxwell. I  am  an  engineer. I  work  in  the  semiconductor  industry. Today,  I'm  going  to  give a  presentation  on  how  I  use JMP in  the  building of  the  image  analysis  pipeline. A  little  bit  of  background  on  the  problem. Two  of  our  module  groups  at  our  facility ran  into  an  issue  with  a  feature where there  was  erosion  on  the sidewall  of  the  feature. The  SEM  system  that  we  used  to  measure the  size  of  this  feature wasn't  detecting  this  feature, and  it  didn't  have the  capability  to  characterize  this. It's  an  older  piece  of  equipment, but  it's also  paid  for  so  we  liked  it. The  approach  that I  employed  to  help  them  out was  to  build  a  separate analysis  pipeline  to  analyze  images, to classify  images that's  having  damage or not , and  it  produce  a  metric  that  could be  sent to SPC  system  for  tracking purposes. I  use  JMP primarily  to  analyze  data, to  build  a  model  system deploying  and  manufacturing, as  well  as,  more  recently, integrating  some  of  the  actual image  analysis  into  JSL. I'll  show  some  examples  of   that. This  is  a  sample  of  what it  is  that  we're  looking  at. This  is  a  good  image  the  left- hand  side, and  this  is  a  bad  image on  the  right-hand side. We  can  see  here  the  like variation  and  erosion. That's  what  we  were looking  for  in  these  images. We  could  see  from  here that  the  measurement, which  is  done  on  the  inside  of  this wouldn't  really  reset  to  these  features that lands  at  the  same  spot  every  time. We're  not  able  to  tell by just using  our  normal  methods. This  is  a  little  bit  more  detail about  what  we  were  seeing. This  is  an  example  of  what some  of  that  data  looked  like if  we  could  see  from  here  that though  the  process  was  having  issues, we  were  not  detecting  it with  the  measurements  that  we  were  using. We launched  into  an  approach to  address  this  issue. I  use  the  standard  ETL  format for  most data  science  problems. Extract,  transform,  load. I  got  to  break  it  into  four categories, configure,  collect, analysis  and  report. Configure and  collection, standard  methods. Analysis, that's  the  key  to  this. That's  how  I'm  reprocessing  the  images and  then  extracting  metrics  for  those images  that  we  can  use  to  build  a  model to  determine  whether  or  not, we  could  detect  the  issue  at  hand. The  approach  for  that  analysis  is to  convert  the  images  from  RGB, which  is  come  off   on  the  system, the  measurement  system, and  convert  those  to  gray scale. Then the  images  are  cropped, remove  any  measurement  details, everything  like  that  out  of  them. Then  the  next  step  is to  apply  median  to  these  images. The  reason  why  do  that is  we're  going  to  blur  the  images and  remove  some of  the  noise  out  of  the  image. This  is  a  key  step  for  image analysis  and  image  processing. The  approach  was  used  as  an  experimental approach  different  kernel  sizes, different  kernel  shapes, but  in  the  end,  is  settled  on using  a  disk- shaped  kernel  with  a  pixel radius  of  like  5, 7 ,  10,  and  15  pixels. For  those  that  are  unfamiliar with  that  terminology, kernel  is  basically  just  an  array that's  used  to  deteriorate the  image when  you're  reprocessing it. After  that  was  done, extraction  of  the  metrics  from the  reprocessed  images  was  developed, used  several  different  standard  metrics, structural similarity  index, mean square error,  Shannon  entropy, then  a  few  different  ones like gaussian  version  of  SSIM, as  well  as  looking  at  the  means of  the  standard  deviations  of  the SSIM image  gradients  from  the gaussian  weighted  images. From  there, used dimensional  reduction  techniques to  extract  which  of  those metrics  was  most  relevant, to  try  and  build a  model, a  predictive  model and  then  convert  that into  some  of that  metric that  could  be  easily interpreted  on  the  manufacturing while  running  production. One  of  the  follow-ups  on  this is  when  doing  the  metric  extractions, the  reference  image exploits  the  image  that used  a  kernel with  a  pixel  radius  at  five, and  then  the   7, 10, and 15 kernel  radius  is  prepared  to  that  image. This  shows  a  little bit  more  detail  about what  the  analysis pipeline  approach  looks  like. The  offline  approach, we  start  with  the  image  acquisition. Obviously,  this  is  done  during the  manufacturing  process. After  that,  initially, this  work  was  done  Python, which  was  to  convert the  grayscale  crop,  de noise,  the  images, and  then  feature  generation metric  extraction,  SSIM, mean  square, entropy,  calculations,  etc. JMP JSL  has  a  Python  wrapper that  you  can  use  for  integration. What's  great  is  you  can actually  move  a  lot  of  this  code in to  the  JMP  environment. I'll  go  into  a  little  bit more  detail  about  what  that  is and  how  I  use  it to  help speed  this  process  along. The  final  component  of  this was  to  develop  a  quality  metric. This  is  very  JMP-intensive  step  as  well. Then  once  a  metric  is  determined, you  move  that  over into  an  online  process in  which, the  manufacturing  data  is  analyzed, and then that  information is  pushed  to  a  SPC  system. A  little  bit  more about  Python/ JSL  Integration, the  Integrated  Python  analysis  code  into  JSL  using  Python/ JSL  wrapper. The  data  transfer  occurs  by  converting the JMP  table  into  Pandas  Data frame. The  data frames  are  passed in  an  analysis  code. The  results  are  returned from  Python  to  JMP  as  a  data  table. What's  great  about this  is  it  enables  using common  image  analysis  libraries as skimage, scipy, PIL,  OpenCV ,  etc, to  perform  the  work  while  keeping the  data  within  the  JMP framework. This,  in  my  opinion, was  key  because  I'm  not  moving  data from  one  system  or  one  platform to  another  platform  and  back  and  forth. I  was  able  to  keep  it  all  in  one  spot. This  is  a  little  bit  more  detail  about what  the  analysis  pipeline  is  doing when  it's  reprocessing  the  images. This  is  what  I  was  able  to  move from  just  running  slowly  in  Python to  integrating  it  into  JSL  using, let's  say,  Python  libraries. You  can  see, here's  a  bad,  here's  a  good. We  cropped the  image  to only  show only  take  basically  this component  of  it,  convert  it  to  gray, and  then  applying  medium  filters  to  blur the  images  with  different  kernel  sizes as you  can  see  as we  increase  the  kernel  sizes, they  become  a  little  bit  blur. What  I  want  to  show takeaway  here is  for  a  good  image,  you  can  see the  kernel  with   disk size 5 versus,  7 versus,  10  versus  15, look  pretty  similar  to  one  another . Whereas   disk 5, disk  7, disk  10, disk  15,  on  a  bad  image, you  can  see  that  by  time you  get  to  disk  15, this  image,  or  kernel  size  15, this  image  is  starting to  look  similar  to  this. By  removing  the  noise  from  this  image, and  then  denoising the  images  incrementally using  different  kernel  sizes. Th is what enables  us to  successfully   extract   metrics that  we  can  use  to  differentiate between  the  good  versus  the  bad. Sorry,  hold  on  here. Let me  go  back  up. At  this  point,  I'm  going  to launch  demo  that  shows how  I'm  moving  this  data  into  JMP and  then  what  it  looks  like as  it  basically  reprocessing  this  stuff. How  I  do  that  is  the  first thing  I  need to  do  is  I  need  to  build  a  data  table. It's  a  great  feature  in JMP, you  go  to  file, import  multiple  files  here. Then  what  I've  got  is  I've  got basically  a  sample  set  of just  five  images that  I'm  going  to  combine all t ogether  into  one  data  table. What  it  does  is  it  actually  import  the images  as  well  as  the  image  file  names. You  go  here,  click  that  file column  name,  click  import, and what  we  get, is  this. We've  got  an  image  of  our picture  that  we  want  to  analyze. We've  got  a  file  name  tag  next  to  it. We've  got  the  data  imported image. Now,  the  next  step is  to  run  our  analysis  code against  each  of  these  images and  start  extracting  metrics. I've  got  a... modified  version  of  that  code  here. The  way  this  code  works  is there's  a  python initialization  that  occurs  JSL. And then  once  that's  done, the  data  table that  you  want  to  analyze is  past  eight  to Python  as  a  data  frame. At  that  point  under the  Python  submit, basically, you  convert  over  to  Python  code. And then  you're  using standard  approaches, standard  coding procedures  for  Python and  analyze  the  image, as  you  can  see  here, we've our  library  imports,  etc. How  high  this  code  was, basically  to  find  a  bunch  of  functions. Then  those  functions will  return  an  output that  would  basically  get added  back  into  the  data  frame that  is  pushed  back  out to  JMP  into  the  data  table. That's  done  by  using this  lambda function, th is  lambda  function applies  this  fu nc tion  all  rows, this  specified  column  in  the  data frame  or  the  JMP  table. When  we  run  this... What we're  going  to  do is  take  that  table  I  just  built and  we're  going  to  extract, but  what  we're  going  to  do  is  reprocess the  images  and  extract  the  metrics. What  we  see  here  is  the  images that  are  showing  earlier. We've  taken  our  picture, we've  converted  it  to  gray. Then  we  started  learning  it. We've  got  5  pixel radius  kernel  7,   10 and  15. Then  from  there,  we're  extracting all  of  our  different  metrics. SS IM  for   7 compared  to   5 SSIM. SSIM is  structural  similarity  index, 7 compared  to   5, 10 compared  to  5,   15 compared  to  5. That's  basically  what  we're  doing, is  calculating   structural similarity  index between  this  image  and  this image, this image and this image and this image and this image. If  we  go  back  to  our  presentation, from  here,  I  used  several  different approaches  for  interrogating  the  data. First was  multivariate lots correlations  to  work  with  here. The  red  indicates  bad  images, blue  indicates  good  images. An  outlier  analysis,  just  looking at  the  Mahalanobis  distances, Jackknife  distances. That's  just  to  see  whether or  not  we're  heavily  skewed, you  can  see  that  we've  got  more  outliers in  the  bad  population compared  to  the  good. But I  think  if  you  separate  these  two  out, this  comes  down  a  little  bit. I'm  showing  this  just to  show  you  can  go  in  here, and  you  can  start  trimming some  of  these  out  if  you  want  to, if  you're  not  seeing  good  fits. But  for  the  purpose  of  this  exercise, I  didn't  have  to  remove  or hide  any  data  in  the  analysis. I  was  able  to  use  all  as  it  is. But  how  I  did  that  was  here. This  is  the  main  data  table. This  is  all  the  data  that  I  looked  at, here  all  the  images. I  would  just  go  to  analyze,  multivariate methods,  run  the  multivariate. What  we  want  to  do  is  just  go  with  that, actually,  sorry, in  the  second  here. When  classifying, I  did  use  conversion  on  the  classification from  a  character  base  t o   turn. There's  a  great  feature  in  here. We  can  go  to  column,  utility, and  then  make  indicator  columns. You  can  go  here and click append  column  name. What  it  does  is  a  little  say, "Okay,  that  is  one,  zero." For  this  row, this  is  bad,  not  good. Then,  down  here  to  go  vice  versa. You  can  use  that  information to  feed  into  some  of the  different  modelling  platforms that  we're going to look out here  in  a  few  minutes. Back  to  the  multi variate. Going  here,  going  to  run  all  these. Let's  run  this  platform. That's  interesting. Let's  go  here,  my  bad. W e  can  see  our  correlations. Then  to  get  the  outlier  analysis, go  here,  click, sorry,  go  into  the  multi variate and  you  go  down  to  outlier  analysis. You  run  Maha lanobis  Distances, you  run Jackknife Distances . There's  different  approach  is  based  on. I've  had  to  get  successful  for this. This  is  first pass. Looks  like  we've  got something  to  work  with  here. The  next  step  is  to  break these  down  a  little  bit  more to  look  at  how  each  of  the different  metrics  is  responding to  different  kernel  sizes. What  we're  really  looking  for  here  is how to get... We  start  to  see larger  changes  in  the  kernel  sizes. Are  the  lines  still overlapping  with  one  another? We're  starting  to  see  separation. Looking  at  15  on  bad , versus  15 are good, 15 are bad,  15 are good for structural  similarity  in  MSE and we  can  start  to  see  that,   we're  seeing  separation  here. I  think  that  is  a  good  indicator that  we  should  be  able to  extract  something  from  this. That  was  done  using the  graph  builder  demo. I've  got  data  table  built specifically  for  this graph  builder, great  feature  here. I'm  dumping  in  kernel  size. I'm  looking  the  value  the  network, these  into  groups  based  on  good versus  bad  classification  codes. Then  my  image  metric is  going  to  use  page. Here  I  just  go  create  a  linear  fit that  look   at   advanced for  prediction ,  R². It will just pop the equation  is  there. You  can  see the  goodness  of  the  fit. I  do  like  looking  at  cheer  points. For  this,  when  I  used   centered grid  and  move  this  along. It's  just  easy  around  the  eyes for  presentation  purposes. You  can  scroll  through  here, and  it's  got  for  each  of  these really  nice  looking  graphs  to  preview. Based on  the  multi variate  and  this  linear  fit, I'm  feeling  pretty  good  about  the  data that  we  extracted  from  these  images. The  next  step  is,  we've  got  a  lot  of metrics  that  we've  extracted  from  these. Do all  of it  matter, do  some  of  it  matter? The  next  couple  slides, we  go  over  dimensional  reduction. Specifically,  I  used  a  PCA and  Partial  B  squares  for  this  PCA. With  this  is  classification, bad  versus  good. This  goes  back  to  what I  was  demoing  earlier and  the  data  table. When  you're  building  a  PCA, you  want  to  have,  obviously, your  inputs  and  your  outputs recast  as  vectors  in  that  analysis. The  reason  why  you  want  that  is  because you're  looking  for  non-o rthogonality  with regards  to  your  inputs  and  your  outputs. The  more  orthagonal, they  are  to  one  another, the  less  likely  they're  playing a  role in  what's  going  on. Recasting a  character- based  classification  into a  number  makes  that  easy  to  work  with. We  could  see  here,  we've  got  some  here that  look  like  they  could  be  interesting. Some  of  these  are  interesting. I  highlighted  these  to  run  this  analysis and  simply  took my  data  table  that  I  built, go  here,  analyze, predictive  modeling. Sorry,  multivariate, principal  components. What  we're  going  to  do is  cast  good,  bad  classification, and  all  of  our  different metrics  into  the  analysis. See  here, these  look interesting. These  were  particularly  interesting. What's  nice  is  when  you  highlight  these, and  you  go  back and  you  look  at  your  data  table, it'll  highlight which  problems  you  selected. Oftentimes  PTs  can  get  really  busy, so  it's  handy  to  be able  to  go  references, and say, "Okay,  so  these  are  the  ones that  I'm  actually  interested  in." Let's  get  back  to  here. After  this, in  a  PLS,  this  is  more  for targeting  how  many  factors we  think  are  playing  a  role in  the  data  that  we've  got, as  well  as  helping  us identify  additional  metrics that  might  be  interesting. For  this  particular  data  set, it  looks  like  it  settles  in  and  about probably  minimum of  five  factors that  we  would  expect that  comes  from  this  prob  greater  than van der Voet  T²  metric based  on  the  job manual, they  referenced  a  academic  research that indicates that anything  goes  about  0.1  is typically  where  you  want  to  start  at. From  there,  we  run  the  PLS factor  identification. This  is  your  VIP  versus  Coefficient  plots, and  from  there, you  can  highlight  the  different metrics  that  seem  to  be  driving  this, the  ones  that  are  most  relevant. The  way  that  works  looking  at  these  charts is  you  want  to  go  out  to  the  extremes. These  are  interesting. As  it  gets  closer, closer  to  the  zero  coefficient, it's  less  interesting. I'm  not  going  to  run  the  PLS  platform because  this  will  actually take  a  while  to  run. I  don't  think  anyone  wants to  just  watch  my  computer  run while doing this. Based on  our  dimension  reduction, I  ended  up  with  these  factors being  the  most  relevant with  regards  to  what  it  is  that we  want  to  feed  into  the  neural  network. Some  experimentation,  I  found  that  it  was  mainly  the  PLS  results, and  then  the  standard 10 metric  from  the  PCA  as  well when  that  was  added  in, it  really  helped  pull  the  model  together. Training  versus  Validation. Obviously,  I'm  using  a  neural  network to  build  the  model  for  the  classifier. The  reason  why  I'm  using that  platform  specifically  for  this, it  has  K-Fold  validation booked  in on  its  standard  in  the  JMP platform, which  is  great. I've  found  that  K-F olds  helps  to  compensate when  you're  teetering  on  class  imbalance. That's  a  really  common  problem when  you're  dealing  with, studying  manufacturing  processes. Y ou're  not  going  to  have  a  lot   of  the  defective  data  to  work  with. You've  really  got  to  make the  best  of  what  you've  got. I  find  that  sometimes K-Folds helps with  that. The  output  from  the  neural network  was  pretty  good,  actually. The  way  I  do  this  is  I'll  run  this  model, like  about  five  times, and  then  I'll  compare all  five of  them  to one  another. I'll  take  basically  the  median, this  classification  rate, and  then  go  down  here  and  look  at  our false  positives  and  false  negative  rates. With  regards  to, the  false  positive  rates. I  look  at  the  ROC  Curve. This is  a  secret  operator characteristics  curves. You  want  to  look  at  these, they  want  to  be described  as  high and  tight, the  closer  these  lines get  to  this  middle  line, the  closer  your getting to  the  coin  pass and your  the  prediction. Obviously,  that  is  not what  we're  seeing  here. Then  the  prediction  profiler has  some  great  features because  you  can  basically reorder  these  based  on which  are  playing the  biggest  role  in  the  prediction, as  well  as  visualizing, like  what  each  individual  metric  is  doing with  regards  to  playing a  role  in  the  prediction  itself. In  this  case, we're  looking  at  transitions of  the  dominant  dimensions  here. Here's  a  little  bit  weird. I'm  not  sure  what's  going  on  with that. A s  you  get  back  here, where  the  less  dominant  dimensions, you  can  start  to  see they  become  more  pushed. I'll  run  the  demo for  the  neural  network. We  go  here. Analyze. A ctually, I've  got  this  one  built  already. Let me... We  launch  this  analysis. We've  included  our  factors, our  responses  are  classification. Neural  network  doesn't  care if  you're  using  a  numerical or  character- based  response. That  will  work  with  either  one. Here,  for  validation  method, you've got a K-Fold, using  five folds. The  gold  standard  for  most like  AI-type  analysis  is  ten. I  found  this  five  was  fine so  no  reason  to  push  it. I  use  three  hidden  nodes. Hit go and then here's the  output. Like  I  said,  I'll  run  this  probably five  times,  take the reading value  you  can  see  here. Our  classification  is  pretty  good. We'll  see  a  little  bit variation  on  the  validation  set. But  it's  all  within  reason for  what  it  is  that  I'm  looking  for with  regards  to  the  profiler and  the  ROC  curve. You  get  down  here, which  ROC  curves  you  can  see  here, as  I  described  earlier,  this  height characteristics  on  the  curves. Validation,  may be  a  little  bit  less, but  we can  work  with  it. Then  with  regards  to  the  profiler. The profiler outputs  looks  like  this. It  doesn't  arrange  them and  which  one  is  most  relevant. You  go  here and  you  go  to  assess  variable  importance, independent  uniform  inputs. Go  back  up  here,  we  can  see  a  summary  report. Here,  you  go  down to  variable  importance,  menu. From  here,  you  can  reorder factors  by  main  effect importance. For  this  model,  it  reordered  this  way, Colorize. You  can  see  here  we've  got the  darker  color  is most  relevant and  then  it  moves  to  the  right. Let's see. Converting  the  model  to  an  SPC  metric, to  make  the  data  easier  to  interpret, we  need  to  take  the  result from  the  model  converted  into  metric that can  be  posted  to  SPC  system and  plotted  on  SPC  chart. The  activation  component  of  the  neural network  model  is  a  soft  map  function that's  a  standard  in JMP. This  calculates  the  probability of  an  image  being  good  or  bad, with  the  higher  the  two  resulting and  with  the  images  classified  as and  what  we're  going  to  do  here  is  going to  introduce  a  third metric  unknown. I'll  show  a  little  bit about  how  that  is  done. One  of  the  nice  things  about the  neural  network  platform is  the  SAS  DATA  Step  output  file  is actually  like  a  Pythonic  model  script. Typically  for  the metric system, you  want  this  in  python, that'll  be  really  cool, forgot  how  to  do  it  all  in  JMP. This  code  from  the  neural  network  here, the  actual  model  itself  can be  outputed  in  this  format. You  just  go  to  model and  make  SAS  DATA S tep. Here  we  go. I  think  in JMP Pro,  you  could  just output  it  by  python  directly, but  not  everyone  has  access  to  JMP Pro. It  gets  either  80 %  of  the  way  there, not  100 %. It  just  requires  a  few modifications  in  the  actual  code. On  the  left  hand  side, we've  got  the  SAS DATA S tep  format. On  the  right-hand side, it's  the  conversion  of  it  to  the  python. It's  just  importing  it as  importing  the  math  library, and  then  applying  that  so  we  can  take  the  tangent  to  things. Then  you  get  that  here. What  we  want  to  do  is  threshold, our S oftm ax  probability calculation  we're  doing  this, rather  than  just  saying, whichever  of  these, is  highest  is  your  actual  results. If  it's  0.5 1,  bad,  0.49 ,  good, then  it's going to  be  bad. Is  we  go  and  we  say,  "Okay, if  the  Softmax  output  is  greater  than  0.75 then  it's  that  classifier, because  it'll  be  equal  and  opposite." It'll  be  acceptable  one. If  it's  0.75  bad then  it's  bad, but  if  it's  a  number  less  than  that, it  won't  classify  it  as  good  or  bad. It'll  just  classified  as  unknown. We'll  pass  that  into  our  system. What  that  tells  us  to  do  is  someone look  at  those  images, because  the  classifier  can't  decide whether  or  not  it's  good  or  bad. The  output, and  this  is  the  finished  product, is  to  convert,  like  I  said, the final  metric  into  something  that could  be  reported  in our  SPC  system. Basically,  what  we're  doing here  is  just  actually  reporting what  our  Softm ax  probability  is, taking  the  inverse  of  that, and  then  applying  the  threshold to  determine  whether  or  not things  are  known  or  unknown. Based on  that,  this  is  our  initial  data that   I've  showed  at the  beginning  of  the  presentation that  shows  this  is  the  chart  look  like for  the  measurement  of  the  defect. Then  this  is  what we  get  when  we  go  in and  we  reanalyze  that  data and  apply  this  new  metric, which  tells  us  whether  or  not  it  thinks that  there's  side wall  feature, you  can  see  looking  at  the  same  data  set that  yes,  it  very  much  so  identifies that  as  either  good  or  bad. The  green  indicates  that's  a  good. The  red  indicates those  were  classified  as  bad. Then  you  can  see  here  on  SPC Data, they  would  instantly get  flagged  as  bad  as  well. These  blue  ones  hanging  out  here are the  ones  that  are  unknown that  it  is  telling  us  so  much, "Go  look  at  this." That  is  my  presentation. I've  posted  the  information to  the  user  community  site if  anyone  wants  access  to  the  code that's  used  for  the  image analysis  that  is  there. Thank  you  very  much.
Process characterization (PC) is the evaluation of process parameter effects on quality attributes for a pharmaceutical manufacturing process that follows a Quality by Design (QbD) framework. It is a critical activity in the product development life cycle for the development of the control strategy. We have developed a fully integrated JMP add-in (BPC-Stat) designed to automate the long and tedious JMP implementation of tasks regularly seen in PC modeling, including automated set up of mixed models, model selection, automated workflows, and automated report generation. The development of BPC-Stat followed an agile approach by first introducing an impressive prototype PC classification dashboard that convinced sponsors of further investment. BPC-Stat modules were introduced and applied to actual project work to both improve and modify as needed for real world use. BPC-Stat puts statistically defensible, flexible, and standardized practices in the hands of engineers with statistical oversight. It dramatically reduces the time to design, analyze, and report PC results, and the subsequent development of in-process limits, impact ratios, and acceptable ranges, thus delivering accelerated PC results.     Welcome  to  our  presentation  on  BPC- Stat: Accelerating  process  characterization with  agile  development of  an  automation  tool  set. We'll  show  you   our  collaborative  journey of  the  Merck  and  Adsurgo development  teams. This  was  made  possible  by  expert  insights, management sponsorship, and  the  many  contributions from  our  process  colleagues. We  can't  overstate  that  enough. I'm  Seth  Clark. And I'm  Melissa  Matzke. We're  in  the  Research  CMC  Statistics  group at  Merck  Research  Laboratories. And  today,  we'll  take  you  through   the  problem we were challenged with, the  choice  we  had to make, and  the  consequences of  our  choices. And  hopefully,  that  won't  take  too  long and we'll quickly get to the best part, a  demonstration   of  the  solution  BPC- Stat. So  let's  go  ahead  and  get  started. The  monoclonal  antibody  or  mAb   fed- batch process consists  of  approximately 20 steps or  20  unit  operations across  the  upstream  cell  growth  process and  downstream  purification  process. Each  step  can  have up  to  50  assays for  the  evaluation  of  process and  product  quality. Process  characterization  is  the  evaluation of  process  parameter  effects  on  quality. Our  focus  is  the  statistics  workflow associated  with  the  mAb process  characterization. The  statistics  workflow  includes study design  using  Sound  DOE  principles, robust  data  management, statistical  modeling  and  simulation. This  is  all  to  support   parameter classification and  the  development of  the  control  strategy. The  goal  for  the  control  strategy   is to make statements about  how  the  quality is  to  be  controlled, to  maintain  safety  and  efficacy  through consistent  performance  and  capability. To  do  this,  we  use  the  statistical models developed  from  the  design  studies. Parameter  classification  is  not a statistical analysis, but  it  can  be  thought  of   as  an  exercise of translating statistically meaningful  to  practically  meaningful. The  practically  meaningful  effects will be used to guide and inform SME  (subject matter  expert)  decisions to  be  made  during  the  development  of  the  control  strategy. And  the  translation  from  statistically to practically meaningful is  done  through   a  simple  calculation. It's  the  change  of  the  attribute  mean, that is the parameter effect, relative  to  the  difference  in  the  process mean and  the  attribute  acceptance  criteria. And  depending  on  how  much  of the acceptance criteria range gets used by  the  parameter  effect determines whether that  process  has a practically  meaningful  effect  on  quality. So  we  have  a  defined  problem - the  monoclonal antibody  PC statistics workflow and  study  design   through  control  strategy. How  are  we  going  to  implement the  statistics  workflow  in  a  way that  the  process  scientists  and  engineers will  actively  participate  in and own these components  of  PC, allowing us, the  statisticians, to provide oversight and  guidance   and  allow  us  to  extend  our  resources. We  had  to  choose  a  statistical  software that included data management, DOE, plotting,   linear  mix  model  capabilities. Of  course,  it was extendable  through  automation, intuitive, interactive  and  fit- for- purpose without  a  lot  of  customization. And  JMP  was  an  obvious clear  choice  for  that. Why? Because  it  has  extensive customization and  automation  through  JSL; many  of  our  engineers, it's already their current  go- to  software for  statistical  analysis; we  have  existing  experience   training in our organization; and  it's  an  industry- leading  DOE   with  data  interactivity. The  profiler  and  simulator  in  particular is  very  well  suited  for  PC  studies. Also,  the  scripts  that  are  produced  are standard,  reproducible,  portable  analysis. We're  using  JMP   to  execute  study design, data  management, statistical  modeling  and  simulation for  one  unit  operation  or  one  step that  could  have  up  to  50  responses. The  results  of  the  analysis   are  moved to a study report and  used  for  parameter  classification. And  we  have  to  do  all  these  20  times. And   we have  to  do  this   for  multiple  projects. So  you  can  imagine it's a  huge  amount  of  work. When  we  started  doing  this initially,  we  were  finding  that  we  were doing a  lot  of  editing  of  scripts before  the  automation. We  had  to  edit  scripts  to  extend  a copy existing analysis to other responses, we  had  to  edit  scripts   to  add  where conditions, we  had  to  edit  scripts to  extend  the  models  to  large  scale. We  want  to  stop  editing  scripts. Many  of  you  may  be  familiar  with  the simulator  set up  can  be  very  tedious. You  have  to  set  the  distribution for  each  individual  process  parameter. You  have  to  set  the  add  random  noise, and  you  have  to  set  the  number of  simulation  runs  and  so  on  and  so  on. There  are  many  different  steps, including  the  possibility   of  editing scripts if we want to change from  an  internal  transform  to  the  explicit  transform. The  simulator  is  doing   add  random  noise on  a  log- normal  type  basis, for  example,   the  log  transform. We  want  to  stop  that  manual  set up. Our  process  colleagues  and  us we're spending enormous time compiling information   for  the  impact  ratio calculations that  we  use  to  make  parameter   classification  decisions. We  are  using  profiles  to  verify  those and  assembling  all  this  information into  a  heat  map that  then  would  be   very  tedious  exercise to  decompose  the  information back  to  the  original  files. We  want  to  stop  this  manual parameter  classification  exercise. Of  course,  last  but  not  least, we  have  to  report our  results. And  the  reporting  involves  copying, or  at  least  in  the  past  involved copying and  pasting  from  other  projects. And  then,  of  course,   you  have  copy  and paste errors, copy  and  pasting  from  JMP, you  might  put  the  wrong  profile or  the  wrong  attribute  and  so  on. We  want  to  stop   all  this  copying  and  pasting. We  clearly  had  to  deal  with  the consequences  of  our  choice  to  use  JMP. The  analysis  process  was  labor- intensive, taking weeks, sometimes months to finish an  analysis   for  one  unit  operation. It  was  prone  to  human  error and  usually  required  extensive  rework. It  posed  to  be  an  exceptional challenge to  train  colleagues  to  own  the  analysis. We  developed  a  vision   with  acceleration in mind to  enable  our  colleagues with a  standardized   yet  flexible  platform approach to  the  process  characterization and  statistics  workflow. So  we  had  along  the  way some  guiding  principles. As  we  mentioned  before,   we  wanted  to stop editing JMP scripts so  any  routine  analysis,   no  editing  is  required. And  our  JMP  analysis, they need to stand on their own, they  need  to  be  portable   without  having to install BPC-Stats so  that  they  live  on  only  requiring   the  JMP  version  they  were  used. We  collected  constant  feedback, we  constantly  updated, we  constantly  tracked  issues relentlessly,  updating  sometimes  daily, meeting  weekly  with   Adsurgo our development  team. Our  interfaces,  we  made  sure that  they  were  understandable. We  use  stoplight  coloring   such  as  green for good, yellow  for  caution, and  issues  flagged  with  red. We  had  two  external  approaches   or  inputs  into  the  system, an  external  standardization, which  I'll  show  a  bit  later, where  the  process  teams define  the  standard  requirements  for  the  analysis file. And  our  help  files, we  decided  to  move  them  all  externally so that  they  can  continue  to  evolve as the  users  in  the  system  evolved. We  broke  our  development   into  two  major development cycles. Early  development  cycle, where  we  develop  proof  of  concepts so we would  have  a  problem, we  would  develop  a  proof  of  concept working  prototype  to  address  that  problem. We  would  iterate  on  it   until  it  was solving the problem and  it  was  reasonably  stable. And  then  we  moved  it   into  the  late development cycle and  continued  the  agile  development, where  in  this  case, we  applied  it  to  actual  project  work and  did  very  careful  second- person  reviews to  make  sure  all  the  results  were  correct and  continue  to  refine  the  modules based on  feedback  over  and  over  again to get to  a  more  stable  and  better  state. One  of  these  proof  of  concept  wow factors that  we  did  at  the  very  beginning was  the  Heat  Map  tool, which brought together all kinds of  information  in  the  dashboard and it saved  the  team's enormous amounts  of  time. I'll  show  you  an  example  of  this later. But  you  can  see  the  quotes on  the  right- hand  side, they  were  very  excited  about  this. They  actually  helped  design  it. And  so  we  got  early  engagement, we  got  motivation  for  additional  funding and  a  lot  of  excitement  generated by  this  initial  proof  of  concept. In  summary,  we  had  a  problem  to  solve, the PC statistics workflow. We  had  a  choice  to  make   and  we  chose  JMP. And  our  consequences  copying  and  paste, manual,  mistakes,  extensive  reworking. We  had  to  develop  a  solution, and  that  was   BPC-Stat. I'm  extremely  happy   to  end  this  portion of our talk and  let  Seth  take  you  through   a  demonstration  of   BPC-Stat. We  have  a  core  demo   that  we've  worked  out and  we're  actually  going to  begin  at  the  end. In  the  end,  which  is  that   proof of  concept  Heat  Map tool that  we  had  briefly  shown  a  picture of is  when  the  scientists  have  completed all  of  their  analysis and  the  files  are  all  complete, the  information  for  impact  ratio  is  all available and  collected into  a  nice  summary  file, which  I'm  showing  here. Where  for  each  attribute in  each  process  factor, we  have  information  about  the  profiler, it's  predicted  means  and  ranges as  the  process  parameter  changes. So  we  can  compute  changes  across that  process  factor  in  the  attribute and  we  can  compute  that  impact ratio that  we  mentioned  earlier. Now,  I'm  going  to  run  this  tool and  we'll  see  what  it  does. So  first,  it  pulls  up  one  of  our early  innovations  by  the  scientists. We  organized  all  this  information that  we  had  across  multiple  studies. Now,  this  is  showing three different process  steps  here and you can see on  the  x- axis, we have the  process  steps  laid  out. In  each  process  step,  we  have  different studies  that  are  contributing  to  that, we  have  multiple  process  parameters, and  they  are  all  crossed   with these attributes  that  we  use to assess the  quality  of  the  product that's  being  produced. And  we  get  this  Heat  Map  here. The  white  spaces  indicate  places where the  process  parameter  either  dropped out  of  the  model  or  was  not  assessed. And  then  the  colored  ones  are  where we  actually  have  models  built. And  of  course,   the  intensity  of  the  heat is  depending  on  this practical  impact  ratio. This  was  great  solution for  the  scientists, but  it  still  wasn't  enough  because we had  disruptions  to  the  discussion. We  could  look  at  this  and  say,  okay, there's  something  going  on  here, we  have  this  high  impact  ratio, then  they  would  have  to  track down  where  is  that  coming  from. Oh,  it's  in  this  study. It's  this  process  parameter. I  have  to  go  to  this  file. I  look  up  that  file, I  find  the  script  among  the  many  scripts. I  run  the  script, I  have  to  find  the  response and then finally  I  get  to  the  model. We  enhance  this  so  that  now  it's  just a simple  click  and  the  model  is  pulled  up. The  relevant  model  is  pulled  up  below. You can  see  where  that  impact  ratio of  one  is  coming  from. Here, the  gap  between   the  predicted  process  mean and  the  attribute  limit is  the  space  right  here. And  the  process  parameter  trend is taking  up  essentially  the  entire  space. That's  why  we  get that  impact  ratio  of  one. Then  the  scientists  can  also   run their simulators that  have  been  built  or  are  already  here,  ready  to  go. They  can  run  simulation  scenarios to  see  what  the  defect  rates  are. They  can  play  around   with  the  process parameters to  establish  proven  acceptable ranges that  have  lower  defect  rates. They  can  look  at  interactions, the  contributions  from  that. They  can  also  put  in  notes  over  here on the right and  save  that  in  a  journal file to  keep  track  of  the  decisions  that they're making. And  notice  that  all  of  this   is  designed  to  support  them, maintaining  their  scientific  dialogue, and prevent  the  interruption  to  that. They  can  focus  their  efforts on  particular  steps. So  if  I  click  on  a  step, it  narrows  down  to  that. Also,  because  our  limits   tend  to  be in flux, we  have  the  opportunity  to  update those and  we  can  update  them  on  the  fly  to  see  what  the  result  is. And  you  can  see  here  how  this  impact changed and  now  we  have  this  low- impact ratio and  say,  how  does  that  actually   look  on  the  model? The  limit's  been  updated now, you  can  see  there's  much  more  room and  that's  why  we  get  that  lower  impact ratio  and  we'll  get  lower  failure  rates. That  was  the  heat  map  tool   and  it  was a huge win and  highly  motivated  additional investment  into  this  automation. I  started  at  the  end, now  I'm  going  to  move  to  the  beginning  of the process statistics workflow, which  is  in  design. When  we  work  with  the  upstream  team, they  have  a  lot  of  bio reactors that  they  run  in  each  of  these  runs. This  is  essentially a  central  composite  design. Each  of  these  runs  is  a  bio reactor and  that  bio reactor  sometimes  goes  down because  of  contamination  or  other  issues that  essentially  are  missing  at  random. So  we  built  a  simulation to evaluate this to  potential  losses called  evaluate  loss  runs. And  we  can  specify   how  many  runs  are  lost. I'm  just  going  to  put  something   small  here for demonstration and  show  this  little  graphic. What  it's  doing  is  it's  going   through randomly selecting points to  remove  from  the  design   in calculating  variance inflation factors, which  can  be  used   to  assess  multi linearity and  how  well  we  can  estimate   model  parameters. And  when  it's  done, it  generates  a  summary  report. This  one's  not  very  useful   because  I  had very few simulations, but  I  have  another  example  here. This  is  500  simulations on  a  bigger  design. And  you  can  see we get this  nice  summary  here. If  we  lose  one  bio reactor, we have essentially a zero probability of getting  extreme  variance  inflation factors or non- estimable  parameter  estimates. And  so  that's  not  an  issue. If  we  lose  two  bio reactors  up  to  about 4%,  that's  starting  to  become  an  issue. So  we  might  say  for  this  design, two  bio reactors  is  a capability  of  loss. And  if  we  really  wanted   to  get  into  the  details, we  can  see  how  each  model  parameters impacted  on  variance  inflation for given number  of  bioreactors  lost, or  we  can  rank  order  all  the  combinations of bioreactors that are lost, which  specific  design  points   are  impacting the design the most, and  do  further  assessments  like  that. That's  our  simulation  and  that's  now a routine  part  of  our  process  that  we  use. I'm  going  to  move  on  here   to  standardization. We  talked  about  the  beginning  of the process statistics workflow, the  end  of  the  process   statistics  workflow, now  I'm  going  to  go  back  to  what is the  beginning  of  BPC-Sta t  itself. When  people  install  this, they  have  to  do  a  setup  procedure. And  the  setup  is  basically specifying  the  preferences. It's  specifying  those  input parameter, input  standardizations that we had  talked  about  earlier, as  well  as  the  help  file, what  process  area  they're  working  with, and  default  directory that  they're  working  with. And  then  that  information  is  saved  into the  system  and  they  can  move  forward. Let  me  just  show  some  examples  here  of  the standardization  files  and  the  Help  file. Help  file  can  also  be  pulled up under  the  Help  menu. And  of  course,  the  Help  file is  the  JMP  file  itself. But  notice  that  it  has   in  this  location column, these  are  all  videos  that  we've created that  explain  and  they're  all timestamped and  so  users  can  just  figure  out   what  they're  looking  for,  what  feature. Click  on  it,  immediately pull  up  the  video. But  what's  even  more  exciting  about  this  is in all the dialogues of BTC-Stat, when  we  pull  up  a  dialogue   and  there's a Help button there, it  knows  which  row  in  this  table  to  go  to  get  the  help, and  it  will  automatically  pull  up. If  I  click  that  Help  button, it will automatically pull up the  associated  link  and  training   to  give  immediate  help. That's  our  Help  file. The  standardization. We  have  standardizations   that  we  work with the teams to  standardize  either  across  projects   or  within  a  specific project, depending  on  the  needs and  for  process  areas. We  had  this  problem  early  on  that   we weren't getting consistent naming and  it  was  causing  problems  and  rework. Now,  we  have  this  standardization  put  in  place. Also,  the  reporting  decimals   that  we  need to use, the  minimum  recorded  decimals, what  names  we  use  when  we  write  this in  a  report,  our  unit,  and  then a  default  transform  to  apply. That's  our  attribute  standardization. And  then  for  our  group  standardization, it's  very  similar  identifying  columns, except  we  have  this  additional  feature here  that  we  can  require only specific levels  be  present and  otherwise  will  be  flagged. We  can  also  require  that they have  a  specific  value  ordering. So  let,  for  example, the process steps  are  always  listed in process step order, which  is  what  we  need  in  all  our  output. Okay,  so  I'm  going  to  show an  example  of  this. Let  me  see  if  I  can  close some of the  previous  stuff   that  we  have  open  here. Okay,  so  let  me  go  to  this  example. So  here's  the  file. The  data  has  been  compiled. We  think  it's  ready  to  go,  but  we're going  to  make  sure  it's  ready  to  go. So  we're  going  to  do   the  standardization. First  thing  is  looking  at  attribute that  recognizes  the  attributes that  are  erred the  standard  names. And  then  what's  left  is  up  here. And  we  can  see  immediately these are process parameters, which we don't have   the  standardization  set up  for. But  we  see  immediately that  something's  wrong  with  this, and  we  see  in  the  list  of  possible attributes  that  the  units  are  missing. We  can  correct  that  very  easily. It  will  tell  us  what  it's  doing, we're  going  to  make  that  change. And  then  it  generates  a  report, and  then  that  stoplight  coloring and  says,  oh,  we  found  these. This  is  a  change  we  made, pay  attention  to  this  caution. These  are  ones  we  didn't  find. And  this  report  is  saved back  to the data table so  it  can  be  reproduced   on  demand. And  I'll  go  through   the  group standardization just  to  take a  quick  look  at  that. Here,  it's  telling  me  red, stop  light  coloring. We  have  a  problem  here, you're  missing  these  columns. The  team  has  required  that this information  be  included. It's  going  to  force  those columns  onto  the  file. We  have  the  option  with  the  yellow to  add  additional  columns. And  so  we'll  go  ahead  and  run  that, and it's  telling  us  what  it's  going  to  do. And  then  it  does  the  same  thing,   creates  a  report. And  we  look  through  the  report and we notice  something's  going  on  here. Process  scale. Our  process  scale  can  only have  large  lab  micro. Guess  what,  we  have  a  labbb. We  have  an  extra  B  in  there. So  that's  an  error. If we  find  that  value,  correct it. Rerun  the  standardization and  everything  is  good  there. I  did  want  to  point  out one  more  thing  here. You'll  see  that   these  are  our  attributes, there  are  these  little  stars indicating  properties. The  properties  that  are  assigned   when  we did the standardization is  this  custom  property  table  deck. And  that's  going  to  pass  information  to the system about  what  the  reporting  precision  is   when  it  generates  tables. Also,  our  default  transformation   for  HCP was logged, so  it  automatically  created the  log  transform  for  us. So  we  don't  have  to  do  that. Okay. That's  the  standardization,  let's  move  on   to  a much more  interesting  things now. The  PC  analysis. Before  I  get  to  that, I  just  want  to  mention  that we have a  module   for  scaled- down  model  qualification. And  essentially,  it's  using JMP's built- in  equivalents  testing. But  it  enhances  it  by  generating   some custom  plots and summarizing all that  information at a table that's  report  ready. It's  beautiful. Unfortunately,  we  don't have  time  to  cover  that. I'm  going  to  go  now  into  the  PC  analysis, which  I'm  really  excited  about  here. I  have  this. Standardization  has  already   been  done. We  have  this  file   that  contains  lab results, experimental  design that's  been  executed  at  labscale. We  have  large  scale  data  in  here. We can't execute... It's  not  feasible   to  execute  BOEs  at  large scale, but  we  have  this at  the  control  point. We  want  to  be  able  to  project our  models  to  that  large  scale. And  because  we  have  different  subsets and we have potentially different models, this  one  only  has  a  single  model, but we can  have  different  models and  different  setups, we  decided  to  create  a   BPC-Stat  workflow and  we  have  a  workflow  set up  tool that  helps  us  build  that  based  on  the particular  model  we're  working  with. I  can  name  each  of  these  workflows and  I  provide  this  information that it's going  to  track throughout the  whole  analysis. What  is  our  large- scale  setting, what  are  our  responses? Notice  this  is  already populated for me. It  looked  at  the  file  and  said,  oh, I know these responses, they're  in  your  standardized  file, they're  in  this  file,  they  exist. I  assume  you  want  to  analyze this  and  they  get  pre- populated. It  also  recognize  this   as  a  transform because  it  knows  that  for  that  HCP, we want that  on  a  log  transform. And it's going to  do  it   internally  transform, which  means  JMP   will  automatically  back transform it so  that  scientists  can  interpret  it   on  the  original  scale. There  are  some  additional  options  here. This  PC  block  right  now  is  fixed. In  some  cases,  the  scientists  want   to look at  the  actual  PC  Block  means. But  for  the  simulation,  we're  interested in  a  population- type  simulation. We  don't  want  to  look  at  specific  blocks, we  want  to  see  what  the  variability  is. So  we're  going  to  change   that PC Block factor into  a  random  effect when  we  get  to  the  simulation. And  we're  going  to  add   a  process  scale to our models so  we  can  extend our  model  to  a  large  scale. The  system  will  review  the  different process  parameters  and  check  the  coding. If  there's  some  issues  here or a coding missing, it will automatically flag that   with  the  stoplight  coloring. We  have  here  the  set  point. Very  tedious  exercise,  annoying. We  constantly  want  to  show  everything at the set point in the profilers because  that's  our  control, not  the  default  that  JMP's calculating  the  mathematical  center. So  we  built  this  information  in   so that it could  be  automatically  added. And then we can  define  the  subsets  for  an  analysis. And  for  that,  we  use  a  data  filter. I'll show here  for  this  data  filter and  there's  explanation  of  this   in  the  dialogue. But  we  want  to  do  summary  statistics   on  a  small  scale. So  I  go  ahead  and  select  that. It  gives  feedback  on   how  many  rows  are in that subset and  what  the  subset  is  so we can double- check  that  that's  correct. And  then  for  the  PC  analysis,   in  this  case, I have the model setup  so  that,  of  course, it's going to analyze the DOE with  the  center  point but  it's  also  going to do  this single factor study, or  what  they  call  OFAT   and  SDM  block, separate  center  points that  were  done  as  another  block. And  that's  all  built  into  another block  in  the  study  for  that  PC  block. Lastly,  I  can  specify  subsets for confirmation points, which  they  like  to  call   verification points, to  check and  to  see   how  well  the  model  is  predicting. We  don't  have  those  in  this  case. And  for  what  is  our  subset   for  large scale, that  would  include  both  the  lab   and  the  large  scale  data. Since  it's  all  the  data  in  this  case, I  don't  have  to  specify  any  subset. Now,  I  have  defined  my  workflow. I  click  okay,  and  it  saves  all that  information  right  here  as  a  script. If  I  right- click  on  that,  edit  it, you  can  see  what  it's  doing. It's  creating  a  new  namespace. It's  got  the  model  in  there, it's got all my responses, and  everything I  could  need  for  this  analysis. A s  soon  as  you  see  this, you  start  thinking,  well, if  I  have  to  add  another  response, I  can  stick  another  response  in  here. But  that  violates  the  principle of  no  script  editing. Well,  sometimes  we  do  it, but  don't  tell  anybody. What  we  did  is  we  built  a  tool   that  has a workflow editor that  allows  us  to  go  ahead   back  into  that  workflow through point  and  click  and  change  some of  its  properties  and  change  the  workflow. I'm  going  to  go  ahead  now  and  do  the  analysis. And  this  is  where the  magic  really  comes  in. When  I  do  the  PC  analysis  set up, it's  going  to  go  take  that   workflow  information  and  apply  it across the entire set  of  scripts that we need  for  our  analysis. And  you  see  what  it  just  did  there. It  dropped  a  whole  bunch  of  scripts. It  grouped  them  for  us. Everything  is  ready  to  go. It's  a  step- by- step  process, the  scientists  can  follow  it through. If  there  are  scripts   that  are  not applicable, they  can  remove  those  scripts  and  they're  just  gone. We  don't  worry  about  them. And  then  for  the  scripts  that  are  present, we  have  additional  customization. These  are  essentially generator  scripts. And  you  can  see   it  generates  a  dialogue that's  already  pre- populated   with  what  we should need, but  we  have  additional  flexibility   if  we  need  it. And  then  we  can  get  our  report and we can enhance  it  as  we  need  to, in  this  case, subsets  I  may  want  to  include. And  then  resave  the  script  back  to  the table  and  replace  the  generator  script. Now,  I  have  a  rendered  script  here that  I  can  use  that's  portable. Then  for  the  PC  analysis, we  have  data  plots. Of  course,   we  want  to  show  our  data. Always  look  at  your  data, generate  the  plots. There's  a  default  plot  that's  built. And  now  the  user, we only did one plot because  we  wanted  the  user   to  have  the  option  to  change things so  they  might  go  in  here, say,  get  rid  of  that  title. I  just  changed  the  size   and  I  add  a  legend,  whatever. You  can  change  the  entire  plot if  they  wanted  to. And  then  one  of  their   all- time  favorite features of BPC-Stat seems  to  be  this  repeat  analysis. Once  we  have  an  analysis  we  like, we  can  repeat  it. And  what  this  is  doing  is  it's  hacking the column switcher and  adding  some  extra  features  onto  it. It'll  take  the  output, dump  it  in  a  vertical  list  box  or  tab  box, and  allow  us  to  apply  our  filtering either  globally  or  to  each  analysis. Now,  I'm  in  the  column  switcher and I can tell  it  what  columns   I  want  it  to  do. This  works  for  any  analysis, not  just  plotting. Click  OK. It  runs  through  the  switcher, generates  the  report. There  I  have  it. All  my  responses  are  plotted. That  was  easy. I  go  down  and  there's  the  script  that  recreates  that. I  can  drop  it  here, get  rid  of  the  previous  one. Done. Descriptive  statistics. Here  we  go. It's  already  done. I  have  the  subsetting  applied, to  have  the  tables  I  need. Look  at  this. It's  already  formatted   to  the  number of decimals I needed because  it's  taking  advantage   of  those  properties  that  we  had assigned, those  unique  properties based  on  this  table  standardization. So  that  one  is  done. And  then  the  full  model. Full  model  is  set  up,  ready  to  go for what would  you  think? It's  ready  to  go   for  residual  assessment. We  can  go  through  each   of  the  models one at a time and  take  a  look  to  see  how  the  points  are  behaving,  the  lack  of  fit. Does  it  look  okay? Here,  we  have  one  point  that  may  be a little  bit  off  we  might  want  to  explore. Auto  recalc  is  already  turned  on, so  I  can  do  a  row  exclude and it will  automatically  update  this. Or  we  have  a  tool  that  will exclude  the data point in a new column so  that  I  can  analyze  it  side  by  side. And  then  since  I've  already   specified my responses, in  order  to  include  that  side  by  side, I  would  have  to  go  back and  modify  my  workflow. And  we  have  that   workflow  editor  to  do  that. I'm  just  going  to  skip  ahead  to  save some  time  where  I've  already  done  that. This  is  the  same  file,   same  analysis, but  I've  added  an  additional  response and  it's  right  here. Yield  without  run  21. Now,  scientists  can  look  at  this   side  by side  and  say, you  know  what,  that  point, yeah,  it's a little bit unusual statistically, but  practically there's  really  no  impact. All  right,  let's  take  this  further. This  is  our  routine  process. We  do  take  it  all  the  way   through the reduced model because  we  want  to  see   if  it  impacts  model  selection. We  have  automated  the  model  selection and  it  takes  advantage   of  the  existing stepwise for forward AIC or  the  existing  effects  table  where   you  can  click  to  remove by backward selection manually if you want, this  automates  the  backward  selection which  we  typically  use for  split  pot  designs. We  also  have  a  forward  selection for mixed models  which  is  not  currently a JMP feature and  JMP that we find  highly  useful. I'm  going  to  go  ahead,  since  it's  a  fixed  model, I'm going to do that  and gets the  workflow  information. I  know  I  need  to  do  this   on  the  full  model. It  goes  ahead  and  does  the  selection. What  it's  doing  in  the  background  here is  it's  running  each  model, it's  constructing  report   of  the  selection that it's done in  case  we  want  to  report  that. And  it's  going  to  save  those  scripts back  to  the  data  table. There's that  report  right  there that contains all  the  selection  process. Those  scripts  were  just generated  and  dumped  back  here. Now,  I  can  move  those  scripts back  into  my  workflow  section. I  know  the  reduced  model  goes in there and  this  is  my  model selection  step  history. I  can  put  that  information  in  there. Okay,  so  this  is  great. Now,  when  I  looked  at  my  reduced  model, I  have  that  gone  through  the  selection. Now  I  can  see  the  impact  of  the  selection on  removing  this  extra  point  here. And  again,  we  see   there's  just  basically, likely the scientists would conclude there's  just  no  practical  difference  here. And  they  could  even  go,  and  should  go and  look  at  the  interaction  profilers as  well,  compare  them  side  by  side. This  is  great. We  want  to  keep  this  script  because  we want to keep track of  the  decisions  that  we  made, so  there's  a  record  of  that. But  we  also  want   to  report  the  final  model. So  we  want  a  nice,  clean  report. We  don't  want  that  without   running response in there because  we've  decided   that  it's  not  relevant, we  need  to  keep  all  the  data. Another  favorite  tool  that   we developed is  the  Split  Fit  Group which  allows  us to break up  these fit groups We have the reduced model  here. Take  the  reduced  model. And allows us to  break  them  up   into as many  groups  as  we  want. In  this  case,  we're  only  going   to  group  it up into one group because  we're  going   to  eliminate  one  response. We  want  one  group. When  we're  done,  we're  just  using  this to eliminate this response we  no  longer  want  in  there. Click Okay. That's  some  feedback  from   the model fitting,  and  boom,  we  have  it. The  fit  group  is  now  there  and  without response  analysis  has  been  removed. Now  we  have  this  ready  to  report. Notice  that  the  settings  for  the  profiler, they're  all  the  settings  we  specified. It's  not  at  the  average  point or the  center  point, it's  at  the  process  set point, which  is  where  we  need  it to  be  for  comparison  purposes. It's  all  ready  to  go. Okay,  so  that  generates   that script there  when  I  did  that  split. I  can  put  it  up  here and  I  can  just  rename  that. That's  final  models. Okay,  very  good. Now,  for  some  real  fun. Remember  we  had  talked  about   how tedious it is to  set up  this  simulation  script. Now  watch  this. Watch  how  easy  this  is  for  the  scientists. And  before  I  do  this, I  want  to  point  out  that  this  was, of  course,  created  by  a  script,  obviously JSL, but  this  is  a  script  that  creates a script that  creates  a  script. So  this  was  quite  a  challenge for  Adsurgo  to  develop. But  when  I  run  this, I can pick my model here, final  models,   and  then  in  just  a  matter of seconds, it  generates  the  simulation  script   that  I  need. I  run  that,  and  boom. There  it  is,  all  done. It set up  the  ranges  that  I  need for  the  process  parameters. It's  set  them  up  to  the  correct  intervals. It  set  the  ad  random  noise. But  there's  even  more  going on  here  than  what  appears. Notice  that  the  process  scale   has  been  added, we  didn't  have  that  in  the  model  before. That  was  something  that  was  added so that  we  could  take  these  labs, scale  DOE  models,   and  extend  them  to  the  large  scale. Now  we're  predicting  large  scales. That's  important. That  was  a  modification to  the  model. Previously, very  tedious  editing   of the script was  required  to  do  that. Notice  that  we  also  have  this  PC  block random  effect  in  here  we  had  specified because  we  don't  want to  simulate  specific  blocks, now  it's  an  additional  random  effect. And  the  total  variance  is  being  plugged into  the  standard  deviation for the ad random  noise,   not  the  default  residual  random  noise. We  also  added  this  little  set  seed  here so  we  can  reproduce  our  analysis  exactly. So  this  is  really  great. And  again,  notice  that  we're  at  the process  set  point  where  it  should  be. Okay,  last  thing  I  want to  show  here  is  the  reporting. We essentially  completed  the entire analysis, you  can  see  it's  very  fast. We  want  to  report  these  final  models  out   into  a  statistics  report. And  so  we  have  a  tool  to  do  that. And  this  report  starts with  a  descriptive  statistics. I'm  going  to  run  that  first, and  then  we're  going  to  go  and  build the  report,  export  stats  to  Word. And  then  I  have  to  tell  it which  models  do  I  want  to  export. It's  asking  about  verification  plots. We  didn't  have  any  in  this  case for  confirmation  points. So  we're  going  to  skip  that. And  then  it  defaults  to  the  default directory  that  we  set  the  output. I'm  going  to  open the  report  when  I'm  done. And  this  is  important. We're  leaving  the  journal  open for  saving  and  modification. Because  as  everybody  knows, when you copy  stuff, you  generate  your  profilers, you  dump  them  in  Word,   and  there's  some  clipping  going  on. We  may  have  to  resize  things, we  may  have  to  put something  on  a  log  scale. We  can  do  all  that  in  the  journal and  then  just  resave  it  back  to  Word. That  saves  a  step. So  we  generate  that. I  click  okay  here. It's  reading  the  different  tables   and  the  different profilers, and  it's  generating this  journal  up  here. That's  actually  the  report that  it's  going  to  render  in  Word. And  it  will  be  done   in  just  a  second  here. Okay. And  then  just  opening  up  Word. And  boom,  there's  our  report. So  look  at  what  it  did. It  puts  captions  here,  it put  our  table. It's  already  formatted   to  the  reporting  precision  that  we  need. It  has  this  footnote  that  it  added,  meeting  our  standards. And  then  for  each  response, it  has  a  section. And  then  the  section  has  the  tables with  their  captions,  and  the  profilers, and  interaction  profiler, footnotes,  et  cetera. And  it  repeats  on  and  on  for  each  attribute  over  and  over. It  also  generates  some  initial  text that  the  scientists  can  update with  some  summary  statistics. And  so  it's  pretty  much  ready  to  go and  highly  standardized. That  completes  the  demo   of  the  system. Now,  I  just  have  one  concluding  slide that  I  want  to  go  back  to  here. So,  in  conclusion,   BPC-Stat, it's  added  value  to  our  business. It's  enabled  our  process  teams. It's  paralyzed  the  work. It's  accelerated  our  timelines. We've  implemented  a  standardized, yet  flexible  systematic  approach with  that  higher,  faster  acceleration, and  much  more  engagement. Thank  you  very  much.
This presentation highlights modeling approaches used to understand and predict product quality based on product measurements. The model development process involves four steps: data collection, model development, testing, and implementation. During data collection, parts are collected from the production line and product measurements are completed.  These parts continue through the process and are subject to a quality test. These product measurements and quality test results are combined into the dataset and used for modeling. The quality test is the response and product measurements are predictors.  In model development, the data is examined thoroughly to ensure it is as clean as possible. Variable clustering and stepwise regression are applied to remove highly correlated input variables and select the important variables. The final step is to apply generalized regression using log-normal distribution and the adaptive lasso estimation method. The model must have an accuracy of greater than a certain metric that is acceptable. If the model meets this criterion, it’s moved into the testing phase.  This phase involves using the model under engineering control to determine how well the model predicts the product quality.  Once satisfied, the model is implemented into production. The operations team receives instant feedback on how the part will perform and can adjust and tune the process in real-time. With this information, we can also deem the product acceptable or not.  If rejected, the product is disposed of and doesn’t continue through the process. These predictive models identify unacceptable parts and process upsets in the upstream processes.     Welcome,  and  thank  you  for  joining my  poster  presentation at  this  year's  JMP C onference. My  name  is  Kaitlin Shorkey , and  I'm  a  senior  statistical engineer  at  Corning  Incorporated. How  do  you  get  a  glimpse of  a  product  quality before  it  completes the  production  process? We  chose  to  build  a  model  that  will predict  the  product  quality  outcome before  it  has  completed   the  entire  process. There  are  two  major benefits  of  this  approach. One  is  the  operations  team receives  instant  feedback on  how  the  parts  will  perform and  can  adjust the  process  in  real  time. The  second  is  that  we  can  dem the  product  acceptable  or  not. Like  I  just  mentioned, the  main  objective  of  this  work is  to  build  a  predictive  model using  a  few  modeling  approaches to  understand and  predict  product  quality based  on  certain  product  measurements. Our  major  steps  in  building  this  model are  data  collection,  model  development, testing  and  implementation. First off,  for  the  data  collection  phase, parts  are  collected  at  the  end of  the  production  line  and  appropriate product  measurements  are  completed. The   parts  are  then  subjected   to  the  quality  test. The  product  measurements   and  quality  measurement  results are  combined  into  a  data set  and  used  for  building  the  model. In  this  case,   the  quality  measurement  results are  combined  into  a  data set and  used  for  building  the  model. In  this  case,  the  quality  measurement  is  the  response and  all  the  product  measurements  are  the  predictors. The   dataset  consists  767  predictors and  990  observations  or  parts. This  stuff  can  take a  long  time  to  execute. Since  we're  building  a  model, it's  important  to  get  as  large  of  a  range of  product  measurements   and  quality  measurement  results  as  we  can. If  we  leave  this  the  accuracy   and  model  predictions are  more  consistent  across  the  range. Essentially,  this  allows  the  model  to  accurately  predict  at  all  levels of  the  product  quality  results. Once  the   dataset  is  compiled, it  is  thoroughly  examined  to  ensure it  is  as  clean  as  possible. After  the  data  collection  and  cleaning, the  second  phase   of  model  development  is  started. For  this,  we  begin   with  variable  clustering. Step  by  regression, remove  highly  correlated  variables and  select  the  most  important  ones. With  so  many  predictors we  first apply  variable  clustering. This  method  allows  for  the  reduction in  the  number  of  variables. Variable  clustering  groups  the  predictors, the  variables  into  clusters that  share  common  characteristics. Each  cluster  can  be  represented by  a  single  component  or  variable. A  snippet  of  the  cluster  summary   from  JMP  is  shown, which  indicates  that  85 %  of  the  variation is  explained  by  clustering. Cluster  12  has  49  members, and  V 232  is  the  most representative  of  that  cluster. The  variables  that  are  identified  as  the  most  representative  ones are  then  used  in  the  next  method   of  stepwise  regression. Stepwise  regression  is  used on  the  identified  clustered  variables to  select  the  most  important  one to  us  in  the  model, and  further  reduces   the  number  of  variables. For  this,  the  forward  direction and  minimum  corrected  AIC   stopping  rule  is  used. The  direction  controls  how  variables enter  and  leave  the  model. The  forward  direction  means  that  terms  are  entered  into  the  model that  have  the  smallest  p-value. The  stopping  rule  is  used   to  determine  which  model  is  selected. The  corrected   AIC  is  based  on   negative  two,  law  of  likelihood, and  the  model  with  the  smallest corrected  AIC  is  a  preferred  model. From  this,  51  variables are  entered  into  the  model of  the  99  available  variables from  the  variable clustering  step. At  this  point,   we  have  reduced  the  number of  variables  from  767  to  51 using  variable  clustering  and  stepwise  regression. The  final  method  is  to  fit a  generalized  regression  model. For  this,  the  law  of  normal  distribution is  used  with  an  adaptive  lasso   estimation  method. For  this,  the  long  normal  distribution  is  used with  an  adaptive  lasso estimation  method. The  law  of  normal  distribution  is  the  best fit  for  the  response,  so  is  chosen to  use  in  the  regression  model. The  adaptive lasso  estimation  method is  a  penalized  regression  technique which  shrinks  the  size  of  the  regression  coefficient and  reduces  the  variance  in  the  estimate. This  helps  to  improve   predictive  ability  of  the  model. Also  the  data set  was  split into  a  training  and  validation  set. The  training  set  has  744  observations, and  the  validation  set  has  246. From  this,  the  resulting  model  produces  a  .81  generalized   R-square for  the  training  set   and  .8  for  the  validation  set. These   R-squares  are  acceptable  for  our process  that  we  will  now  evaluate the  accuracy  and  predictability of  the  resulting  model. Now  that  we  have  a  model, we  need  to  review  its  accuracy and  predictability  to  see  if  it  would be  suitable  to  use  in  production. In  doing  this, a  graph  is  produced  that  compares a  predicted  quality  measurement for  a  specific  part  to  the  actual quality  measurement. In  the  graph  the xx shows  the  predicted  value, and  the  yx  shows  the  actual. Also,  the  quality  measurement  is  bucketed  into  three  groups based  on  its  value, which  is  shown by  the  three  colors  on  the  graph. In  general,  the  model  predicts the  quality  measurement well. It  does  appear that  the  model  may  fit  better in  the  lower  product  quality  range than  the  upper,  which  may  be  due to  more  observations  in  the  lower  range. As  mentioned,   the  quality  measurement was  bucketed  into  three  different categories  based  on  its  value. This  was  also  done   for  the  predictive  quality  measurement. For  each  observation, if  the  quality  measurement  category is  the  same  as  a   predicted  measurement  category, it  is  assigned  to  one. If  not,  it  is  assigned  to  zero. For  both  the  training  and  validation  sets, the  average  of  these  ones  and  zeros is  calculated  and  is  used as  the  accuracy to  measure. We  see  that  training  set  has  an  accuracy  of  87.5% and  the  validation  set  has  an  accuracy  of  84%. For  the  model  to  be  moved  to  the  testing  phase, accuracy  must  be  above  a  certain  limit, and  both  of  these  accuracy  values  are. This  will  allow  us  to  move to  the  testing  phase  of  the  project. In  addition,  we  look  at  the  confusion  matrix to  visualize  the  accuracy  of  the  model by  comparing  the  actual to  the  predicted  categories. Ideally,  the  off  diagonals  of  each  matrix should  be  all  zeros, with  the  diagonal  from  top  left   to  bottom  right  containing  all  the  counts. The  matrices  show  on  the  poster that  the  higher  counts  are  along that  diagonal  with  lower  numbers  in  the  off  diagonal, but  discrepancies  still  exist among  the  three  categories. For  example,  in  the  training  set, there  are  29  instances  where  the  actual quality  measurement  of  the  three was  predicted  as  a  two. In  the  same  case   for  the  validation  set,  there  are  12. The  confusion  matrix  helps  to  understand where  these  discrepancies  are so  further  investigations  can  be  done and  improvements  made. Overall  though,  the  model  has  an  accuracy above  that  requires  limit, where  our  next  step  would  be the  testing  and  implementation  phases. Now  that  our  model   is  through  the  development  phase, it's  time  to  test  it  in  live  situations. For  this,  the  model  is  used  under  engineering  control to  determine  how  well  it  predicts  the  quality  measurement in  small,  controlled  experiment. This  is  done  by  the  engineering  team with  support  from  the  project  team  when  necessary. Once  the  engineering  team  is  satisfied  with  this  testing, the  model  is  fully  implemented into  production  and  monitored  over  time. In  conclusion,  this  model  development  process has  allowed  us  to  build predictive  models  for  the  production  process. The  methods  of  variable  clustering, stepwise  variable  selection  in  generalized  regression were  the  most  appropriate and  best  students  to  use for  this  application. With  further  research  and  investigation, other  methods  could  be  potentially  applied to  improve  model  performance  even  more. From  a  production  standpoint, the  benefit  of  this  model is  that  the  operations  team  will  receive instant  feedback  on  how  a  part or  group  of  parts  will  perform,   and  can  ingest  the  tune and  tune  the  process  in  real  time. We  can  also  deem  the  product acceptable  or  not. If  rejected,  the  product  is  disposed  of   and  will  not  continue  through  the  process, which  over  time reduces  production  costs. Lastly,  I'd  like  to  give   a  huge  special  thank  you to   Zvouno and  Chova and  the  entire  project  team at  Corning Incorporated. Thank  you  for  joining  and  listening to  my  poster  presentation.
The National Basketball Association (NBA) has long been a league where impressive events happen on a regular basis. Because of this the standard for noteworthy achievements has been raised time and time again. You will frequently see social media posts from top sports accounts regarding players scoring 30+ points or incredible dunks. However, you rarely see a post about a comeback victory unless the team overcame the insurmountable odds of being down by, what appears to be, too many points to win. This presentation considers the individual plays, players, and salary figures that give a team the best chance at achieving a comeback victory. Descriptive analytics are used to gain valuable insights into how often a team can produce a comeback when trailing by ten or more points at half, which teams are achieving them most often, which players are most involved in producing comebacks and the salaries of those players. The focus for the predictive portion of the analysis is on predicting the involvement of players based on their career stats and salary figures and the types of plays NBA teams can prioritize to achieve comeback victories more often.     Today  we're  going  to  talk  about what  provides  the  best  opportunity for  a  comeback  in  the  NBA. So  my  name  is  Weston  Salmon. I'm  currently  a  student at  Oklahoma  State  University studying  for  Business  Analytics and  Data  Science   in  our  Masters  program. And  my  name  is  Zach  Miller. I'm  also  a  student at  Oklahoma  State  University and  also  studying  for  my  Masters in  Business A nalytics  and  Data  Science. All  right,  so  we're  going  to  cover the  table  contents  real  quick. that  shows  what  we're  doing throughout  the  presentation. First  we're  going  to  begin with  an  introduction that  discusses  why  we're  here and  what  exactly  the  study  is  for. Then  we're  going  to  jump into  our  data and methods, and  that   looks  at what  does  the  data  look  like, how  did  we  complete  our  analysis, and  the  different  ways we  manipulated  our  data. We'll  then  look  at  the  descriptive and predictive proportions that   show  what  can  we  derive from  the  data as it sits and  then  what do  our  predictions  look  like. Then  at  the  end, we'll  conclude  the  presentation and   look  at  what  are  the  implications of the analysis and  how  it  can  be  used by  NBA  teams  in  the  future. Here's  a  quote  by  Gabe  Frank. He's  the  Director  of  Basketball Analytics  with  the  San  Antonio  Spurs. We  thought  this  would  be  a  good  quote to throw in as it deals with  [inaudible 00:01:16] and  also  the  NBA is  in  general. So  he  said, "I  think  analytics have  grown in popularity because  it  can  give  you a  competitive  advantage if  you  do  it  well. Every  little  bit  helps." Through  our  presentation, we're  going  to  discuss NBA  analytics  and  how  they  can  produce come  backs based  on  the  data  that  we  find. We  thought  this  quote  really  spoke to  the  overall  objective  of  our  project. Now I'll  pass  it  off  to  Zach to i ntroduce  the  project  as  a  whole. Thanks,  Weston. Going  into  the  introduction  here, going  into  a  NBA  season, every  team  has  one  common  goal, and  that's  to  win  the  championship. Like  I  said,  going  into  the  season, a  lot  of  teams  hope  for  40  plus  wins in  their  82- game  season, but  a  great  season typically  results  in  50  plus  wins. Then  also  our  primary  interest for  this  presentation  and  analysis is  in  the  hard  fought  victories or also  known as a comeback victory by  these  different  teams. We  define  a  comeback  victory as  the  winning  team losing  by  10  plus  points  at  halftime. At  halftime  being  down  by  more than  10 points by the end of the game, we  were  seeing  that  every  so  often there  were  teams  that  were  taking  a  lead and  ultimately  winning  this  game. Then  finally, we  have  our  analysis  that  utilizes play- by- play  data  and  salary  data  sets, which  we  will  go  into  a  little more  detail  in  just  a  few  minutes. Now  we're  going  to  discuss the  two business questions that  we  want to  answer  using  both  the  data. First,  looking  at  the  play- by- play data, we  want  to  know  exactly  what  plays or  sequence  of  plays  within  the  game give  the  team  the  greatest  chance at  a  comeback  victory. We  want  to  know  what  exactly players and coaches can do and   draw  up  in  the  line  up to  produce  a  comeback. Then  with  the  salary and  career  stats data we  really  want  to  see how  those  variables, the  salary  and  career  stats, can  be  used  to  determine  how  involved a  player  should  be  in  comeback  victories. So  not  necessarily  just  how  well they  perform  as  they  actually  perform, but  also  how  they  should  perform based  on  these  variables. So  which  players  are  underperforming and  over  performing  according to  their  contract  and  track  record? Next  we'll  discuss  the  data  and  methods that  we  used  for  both  the  data  sets. Like  I  said,  we  ha d two  data  sets. The  first  one  focused on   play-by-play  data. This  data  contains  the  process and  outcome of every play within  every  game  from  2015  through  2018. So  it  included  all  30  NBA  teams and  exactly  what  they  did in  every  single  play  throughout  the  games throughout  these  years. Zach  will  also  talk  about the  salary  data  that  we  have. Right,  so  going  into  the  salary  data, it  contains  the  salary  information for  each  of  the  players that  were  mentioned in  the   play-by-play  data. Whether  that  player  that  was  mentioned in the play-by-play data played  one  minute  of  NBA  time or  hundreds  of  minutes  of  NBA time, they  were  appearing in  my  salary  data. I  could  then  go  and  see  what  their career  stats  and  what  their  salary  was for  the  seasons  that  we  were  looking at within  the   play-by-play  data. Now  we're  going  to  look at  the  key variables that  we  found within  the   play-by-play  data  set. You  can  see  things  such  as  comeback, half time  deficit,  the  shot  distance, outcome  type,  and  rebound  type. But  two  that  we  want  to  focus in in  particular  was  the  comeback. That's  a  variable  [inaudible 00:04:51] it  was  our  flag  variable and  used as  our  predictor. We  flagged  the  one  next  to  all  games where  a  team  trailed by  10  or  more  points at halftime and  came  back  and  won and  it  was  a  zero  if  not, because  as  we  said, the  overall  goal  of  this  presentation is  to  see  what  leads to  come  back  in  general. Then  also  we  want  to  look at the halftime deficit, which  was  another  variable that  we  created. This  shows  the  number  of  points, the  certain team  trailed  by at  halftime. If  the  deficit  was  greater  than or  equal to 10 points, then  those  were  the  games that  we  specifically  looked  at, and  then  we  want  to  see if   the plays made throughout those games led  to  a  comeback  in  the  end. Now  looking  at  some  of  the  key variables  for  the  salary  data. These  as  a  whole are  variables  that  we decided were  going  to  be  important for our  analysis. But  once  again, I  wanted  to   focus on  a  few  of  these or a couple of these variables as  I  feel  that  they  are  more important to  point  out  and  explain. First  of  which  being  the  player involvement  variable. The  player  involvement  variables account of individual involvement on  key  plays  during  comebacks. These  plays  could  include  shots, rebounds, fouls, any  of  the  actionable  plays that  we  see  throughout the   play-by-play  data. So  we  wanted  to  take  individual  accounts from  players  so  we  could  see how  many  times  a  certain  player was  shooting the ball throughout the seasons, and  really  be  able  to  compare these  players to  other players within the league Then  going  to  come  back  score. This  is  a  min- max  scoring  method that  we  use  to  score  the  overall player  involvements. This  is  what  we  use  to  really  quantify how  involved  these  players  were. This  is  utilizing the  player involvement variable that  you  see  that  I  just  explained. I  wanted  to  go  a  little  bit  deeper into  the  comeback  score  calculation just  to  make  sure  that  everyone understands  how  this  was  calculated. As  I  said, it  was  a  min- max  scoring method, and  this  was  used  to  determine the involvement of the players during  their  team's  comeback  victories. This  min- max  method  creates  the  scores, taking  the  players  involvement into account, relative  to  the  range  of  values that  appear  for  each  variable. It  would  take  the  maximum count  of  these  different  plays and  it  would  use  that  as  the  maximum and  then  a  minimum of  typically  what  we  found  would  be  zero as  certain  players  would  only  play a  very  low  amount  of  minutes  from  zero and  all  the  way  up  to  hundreds  of  minutes. But  typically with  the  zero  minute  players, we  found  that  they  did  not  contribute much  to  these  comeback  wins. Below  you  see  the  formula that  we  used for each of the players to  create  this  comeback  score. This  is  a  perfect  example as  we  see  the  assist  count divided  by  the  maximum assist  times . 1667, which  .1667  being  1  divided  by the  total  of  the  6  included variables, which  is  what  we  would  call the  weight  for  the  formula. Each  of  these  variables was  weighted  equally and  we  took  the  min- max  score for each of the variables and  multiplied that  by  100  to  get  the  final  score. We'll  now  look at  the  play- by- play  analysis  method. When  taking  this  data, we  first  began  by  merging. We  had  six  CSV  files, one  that  identified  each  individual  year, and  we  combined  all  those into  one  central file so  we  could  look  at  each   play-by-play data  from  the  six  years  that  we  had. We  then  transformed  the  data using  flag  variables. As  we  said, we  created  a  column  that  specified whether  there  was  a  comeback  or  not. We  first  looked  at  the  halftime  scores and  saw  if  teams  were  trailing by  10  or  more  at  halftime. We  would  then  take  those  games and  then  see if  a  comeback  actually  occurred. If  it  did,  we  flagged  one  and  specifically we'll  get  those  plays  that  occurred. Then  for  the  descriptive  analysis, we  looked  at  different graphs  within  Tableau. These  included  things  as  how  far  away the  players  were  shooting  from  the  basket, and whether  they  were  missing or  faking  their  shots, the  rebound  types, a nd  things  like that to   get  an  idea  of  what  players were  doing  during  the  games, if  they  were  actually  producing good  outcome  to  secure  a  comeback. Then  lastly,  the  predictive  analysis we  did  in  J MP Pro  using  a  decision  tree to   see  which  plays  and  sequences of plays produce  the  come  back and  how  we  can  better  look  at  those in  the  future  to  then  have  teams be  able  to  produce more  comebacks  throughout  the  season. Now  for  the  methods with  the  salary analysis. First  off, we  had  to  do  some  table  joins. These  joints  were  necessary to  get  all of the data tables together as  we  needed  them  all  together to  really be able to dive into everything  as  a  whole Separated  it  wasn't  too  much  help  for  us. Then  we've  moved on  to  some  data  transformation. We  wrote  SQL  queries  to  gather the  counts  of  the  key  metrics. This  is  how  we  got  the  counts  of  shots for  the  various  players along  with  other  things such  as  rebalance  or  fouls. Then  we  moved  on  to  some  descriptive analysis  that  was  completed  in  Tableau. With  this  descriptive  analysis, one  of  the  key  things that  we  were  looking at was  the  comeback  scores, the  actual  comeback  scores, and  the  predicted  comeback  scores versus  the  salary  of  the  players. So  we  could  see just  how  well  they're  performing relative  to  their  salary. Then  finally  we  had a  predictive  analysis. We  did  a  linear  regression that  was  completed  in  JMP  Pro. I  will  go  into  a  little  bit  more detail about  that  a  little  bit  later  on. Now  we're  going  to  jump into  those  descriptive and  predictive  analysis  that  we  conducted. We're  going  to  begin with  the  descriptive  analysis  first. Here,  we  want  to  look  at  the  salary versus  comebacks  by  each  NBA  team. If  you  look  at  the  data  points, you  can  see that  most  teams  follow the trend line, meaning  that  as  they  spend more money on  their  teams  and  salary, they  also  produce  a  greater number  of  comebacks. So  you  can  see  that  the  Boston  Celtics had  the  most  comebacks at  14  towards the top, and  then  the  Cleveland  Cavaliers have  the  highest  salary  paid, but  also  one  of  the  fewest comebacks  with  only  five  comebacks. What  we  thought  was  the  most  interesting was  the  Indiana  Pacers, because  not  only  did  they  pay such  a  low  salary, but  they  were  also  able  to  produce 12 come backs which  is  the  third  most throughout  the  NBA. I  wanted  to  hone in  on the Indiana Pacers and  see  what  exactly  they were doing that  allowed  them  to  produce such a high number of comebacks with  such  a  low  salary  rate. As  Weston  said,  we  wanted to  focus  on  the  Indiana  Pacers. Here  we  see  the  salary  of  Pacers' players versus their individual comeback scores. Several  highly  scored  players  are  found within  the  Indiana  Pacers  roster, as  you  can  see  with  Myles  Turner, Carlson,  Young,  George,  and  Oladipo. The  top  scored  players are  spread  across  the  salary  spectrum. So  you  see  some  cheap  players such  as Myles Turner or Carlson being  more  of  a  mid- range  player, salary  paid  player. Then  you  also  have  more  expensive players  such  as  Paul  George or Victor  Oladipo  further  towards the  top  right  of  the  graph  there. So  you  can  really  see how  they've  spread the wealth out across and  are  getting  maximum  performance out  of  their  highly paid players, but  also  finding  performance out  of  lower  paid  players. You  can  also  see  that  they  have  several middle  tier  players  that  come  into  play and  provide  big  help  to  the  Pacers as  they  need  some  players to  come  off  the  bench and  be  able  to  provide some  key  value  plays and  produce  comebacks. As  I  said,  one  of  the  key  points that I want to point out was  that  picture  Victor Oladipo— the  highest  paid  player on the team— is  also  the  highest  performing in  terms  of  comeback  score, so  they're  definitely  getting their  worth  out  of  him  as  a  player. All  right,  so  now  we're  going  to  get into  our  predictive  analysis. To  begin  with  the   play-by-play  data, we  decided  to  make  a  decision  tree to  predict  the  play  type  that  leads Indiana  Pacers  producing   a come back using  the  following  variables that you  can  see  below. We'll  see  that only  a  couple of these variables actually  played  a  huge  impact in  predicting  whether  the  Pacers will  come  back  from  10  or  more  points. In  the  decision  tree, there  are  two  nodes  in  particular. One  where  the  distance  shot was  greater than or equal to 26 feet from  the  basket, and  they  were  making  those, as  well  as having a shot distance of greater than or equal to 3 feet from  the  basket, meaning  that  they're  looking at more  of  a  layout  option. Now  we're  going  to  look  at  those two  nodes  a  little  bit  more  in  particular. These  branches,  as  I  said, predict  that  the  Pacers produce  comeback  victories. In  the  overall  model  we  had  a  validation misclassification  rate  of  45.97%. As  I  said, the  model  predicts that  made  shots  of  26 feet and further made  lay ups  3 feet  or  further from  the  basket  leads  to  comebacks. We  would  say  is, they  should  really  focus on  the  three-point aspect and  more  higher  percentage shootings  such  as  lay ups, because  as  you  can  see  in  both  of  those, the  prediction  was  one, which  in  this  case  means  that  the  Pacers were  able  to  produce  a  comeback. You  can  see  that with  the  26  feet and further node, the  probability  that  it  equalled  1 was  62.75%, Then  when  we were shooting lay ups 3 feet or further from the basket. you  had  a  probability that  you  would  win of 75.9%, or  come  back  at  75.9%. Then  as  I  said, there  were  two  variables that  seem the most important of  the  10  that  we  looked  at in predicting  why  the  Pacers were able to produce a comeback that  was  first  shot  distance. Which  looked  at  how  far players  shot  the  ball, and  then  also  shot  outcome. That's  whether  they  made or  missed  the  shot. With  the  distance,  as  I  said, 26 feet  or  further, which  is  about  the  three- point  range, or  some  of  those  higher  percentage  shots in  the  play  for   a  lay up. Then  also  if  you're  making  more  shots, you're  producing  a  higher  score giving  you  a  better  chance  of  coming  back. All  right,  so  now  moving  into our  linear  regression  portion of  our  predictive analysis. This  regression,  as  I  said, was  completed  in  JMP  Pro, and  this  was  done  to  predict the  comeback  scores  of  individual  players based  on  the  following  variables that  you  see  there  on  screen. A  couple  of  the  key  ones  to  point  out would be their individual player salaries, their  team  name, and  then  their  career statistics, as  you  see  with  all  those different  variables  there. It's  also  important  to  note that  the  variables were  selected  for  this regression based  on  their  level  of  significance. If  the  variable was  not  found  to  be significant, it  was  not included  in  the  regression. Going  into  the  summary  of  fit for this linear regression, I  do  want  to  point  out that it does have a low RSquare, but  this  is  not  a  primary  concern for  our  analysis. We  knew  that  the  comeback  score would  be  based  on  the  comeback involvement statistic, but  we  now  wanted  to  know what  the  score would be based  on  completely  different  variables. So  instead  of  using  the  variables that  we  used  to  create the  statistic  initially, we're  now  using  new  variables  to  try to  predict  what  it  should  be  based  on, like  I  said,  their  salary and  career  stats. That  means  that  the  predictions would  vary  from  the  original  scores and  that  was  not  only  expected in  our  analysis, but  it  was  also  desired that  we came up with different scores to  really  see how  they  were  supposed  to  perform. Now,  based  on  this  analysis, we  were  able  to  come  up with  some  of  the  most  important  variables. The  first  of  which  that  we  saw was  most  important  was  salary. Something  that  we  were  seeing is  that  higher  paid  players were  predicted to perform more, which  is  something that  you  would definitely see more in  the  actual  NBA. Seeing  that  players like  Victor Oladipo  or LeBron James with  higher  salaries  paid  to  them would  be  performing  better than  those  with  lower  salaries. Then  moving  on  from  there, we  also  have  the  team. This  one  definitely  makes  sense as  you  see  that  some  of  the  top  teams that  it  was looking at for a comeback victory and  predicting  the  comeback  scores is  the  Golden  State  Warriors and the Indiana  Pacers, which  is  a  couple  of  the  teams  that  we  saw had  the  highest  number of  comeback victories over  the  seasons that  we  were  looking  at. Then  we  also  had  a  couple  of  career stats that  really  popped  up and  showed  to  me  a  couple of  the  most  important  variables for  this  regression. The  first  of  which  being the  career  total  rebounds  by  the  players, and  then  that  was  followed by  the  career  points. Seeing  that  player  had  higher career-total rebounds and higher points, we  expected  those  players  to  produce  more value  whenever  it  came to  creating  comeback  victory. I'll  also  note that  these  important variables were  calculated  through  the  log worth. Now  we're  going  to  look at  the  conclusions  of  the  presentation. Okay,  so  going  into  some of  the  Indiana  Pacers  predictions, specifically  want  to  point out some Pacers' top performers and  under  performers. The  blue  dots  that  you  see  there are the actual Pacers' top performers that  we  saw  in  the  earlier  graphs of  the  actual predicted comeback score versus the salary, whereas  now  we  are  looking at the predicted score or  their  actual  comeback  score, sorry,  versus  the  salary, and now  we  are  looking  at the  predicted score. The  the  orange  Xs mark  the  Pacers'  underperformers. The  underperformers in  this  graph  with  the  orange  Xs, we  are  seeing  them  predicted to  be relatively much higher than  their  teammates, whereas  with  their  actual  scores, they  are  finding  themselves more  middle-to-lower-end of the pack relative  to  their  teammates, which  really  shows  us that  they're  not  performing up  to  what  their  salary and career statistics say that  they  should  be  performing particularly  when  it  comes to  creating  a  comeback  victory. But  it  is  important  to  point  out that  the  team  has  done  a  great  job of signing inexpensive players that  produce  comeback  wins. We  see  those  players such  as Myles Turner or Carlson, or Young that  have  a  little  bit  lower  salaries, but  they  also  produce  a  lot  of  plays that  can  help with  creating  a  comeback  win. Then  we  also  wanted  to  point  out some of the Cleveland Cavaliers predictions and  their  faults  that  go  with  them. The  Cavs  should  have  multiple high- tier  comeback  players. One  specifically  to  point out  would  be  Kevin  Love. Kevin  Love  is  there at  the  top of the graph and  he  has  both  an  orange  X and  a  blue  mark  next  to  his  name. That  just  marks  that  he  was  one of  the  actual  top  performers for the Cavaliers, but  at  the  same  time, he's  under  performing  greatly. So  in  our  predictions, we  can  see  that  he's  predicted  to  actually perform  better  than LeBron James, which  is  something that  is  very  interesting  to  point  out. Like  I  said,  with  our  predictions based on salary and their career stats, we  would  expect  Kevin  Love to  outperform LeBron James when  it  came  to  producing  comeback  wins. But  in  reality, he's  actually  quite  far down the list and  he  still  remains one  of  the  top  performers, but  he  does  not  produce nearly  as  much  as  LeBron  James  does. Now  we  also  wanted  to  look at  the  best  valued  players  in  NBA. We  show  the  top  five  here. Looking  at  their  predicted comeback  scores, you  have  people  such  as Karl-Anthony Towns, Joel Embiid, and  Ben Simmons who were predicted  to  be  some of  the  higher  performing  players in  the  entire  league. But  as  you  can  also  see, when  this  data  was  taken, they  had  relatively  low  salaries compared  to  other  players. What  we   recommend  here is  definitely  giving  these  players the  contracts that they are deserving of, as  they  help  teams  produce  comebacks and  obviously provide  statistics that  allow  teams  to  perform  their  best. They're  definitely doing more for  what  they're  actually  worth. Then  we  also  wanted  to  look at the best line of predictions for  the  Pacers. As  I  mentioned  earlier, that  three  point in  high  percentage  emphasis. So  build  up  the lineup of  shooting  threats  from  distance. You  have  people  such  as  Robinson, Joseph, and George their  average  shot  distance is  about  15 feet and beyond when  the  three- point  line is  about  25 feet, so  that   shows that  they  are shooting a lot of threes, but  they're  also  making  it. Not  only  are  they  shooting  from  that  far, but  they're  also  more  likely to  make  their  shots, so  those   people would  be  good to have  in  the  line  up whenever  you  are  trying to  produce  a  comeback as  they're  more  efficient. Also  because  they  can  shoot  from  deep, you'd  expect  that  they  also  have  a  solid play  down  low to  be  able  to  get  a   lay up real quick and  get  those  higher percentage  shots  go in  as  well. As  I  mentioned, an  average  distance  of  made  shots near the three-point line is very  important for the Pacers in particular to  be  able  to  produce a  high  number  of  comebacks. This  analysis  confirms what  is  already  going  on  in  the  NBA. Typically,  teams  who  find  themselves down  by  a  certain  number  at  halftime will  throw  up  a  little  bit  more three- point  shots, but  also  they  don't  really  focus on that high percentage look just  from  down  low  into  the  basket. We  also  think  that  they  should  focus on drawing up plays, allow  them  to  just  get  a  quick   lay up and build momentum upon that as  they  try  to  produce a  comeback  later  on. All  right,  so  that  wraps up our  presentation. We  just  want  to  say  a  quick  thank  you, and  this  is  where we  would  open  it  up  to  questions.
When taking measurements, sometimes we are unable to reliably measure above or below a particular threshold. For example, we may be weighing items on a scale that is known to only be able to measure as low as 10 grams. This kind of threshold is known as a "Limit of Detection" and is important to incorporate into our data analysis. This poster will highlight new features in the Distribution platform for JMP 17 that make it easier to analyze data that feature detection limits.  We will highlight the importance of recognizing detection limits when analyzing process capability and show how ignoring detection limits can cause a quality analyst to make incorrect conclusions about the quality of their processes.     Hi,  my  name  is  Clay  Barker,  and  I'm  a statistical  developer  in  the  JMP  group. I'm  going  to  be  talking  about  some  new features  in  the  distribution  platform geared  towards  analyzing limit  of  detection  data. It's  something  I  worked  on  with  my  colleague,  Laura  Lancaster, who's  also  a  stat  developer  at  JMP. To  kick  things  off, what  is  a  limited  detection  problem? What  are  we  trying  to  solve? It  is  basic  level, a  limit  of  detection  problem  is  when   we  have  some  measurement  device and  it's  not  able  to  provide  good measurements  outside  of  some  threshold. That's  what  we  call a  limit  of  detection  problem. F or  example,  let's  say  we  have, we're  taking  weight  measurements and  we're  using  a  scale  that's  not  able to  make  measurements  below  1g . In  this  instance,  we'd  say  that  we  have a lower  detection  limit  of  1 because we  can't  measure  below  that. But  in  our  data  set,  we're  still recording  those  values  as  1. Because  of  those  limitations, we  see,  we  might  have  a  data  set that  looks  like  this. In  part,  we  have  some  values  of  1  and some  non-1  values  that  are  much  bigger. We  don't  really  believe that  those  values  are  1s. We  just  know  that  those  are  at  most  1. This  data  happens  all the  time  in  practice. If  you  think  about  sometimes  we're  not able  to  measure  below  a  lower  threshold. Sometimes  we're  not  able  to  measure above  an  upper  threshold. Those  are  both  limited  detection  problems. What  should  we  do  about  that? Let's  look  at  a  really  simple  example. I  simulated  some  data that  are  normally  distributed with  mean  10  and  variance 1 , and  we're  imposing  a  lower detection  limit  of  9. If  we  look  at  our  data  set  here, we  have  some  values  above  9, and  we  have  some  9s. When  we  look  at  the  histogram, this  definitely  doesn't look  normally  distributed because  we  have a  whole  bunch  of  extra  9s and  we  don't have  anything  below  9. What  happens  if  we  just model  that  data  as  is? Well,  it  turns  out  the  results aren't  that  great. We  get  a  decent  estimate of  our  location  parameter,  our  Mu, it's  really  close to  10,  which  is  the  truth. But  we've  really  underestimated that  scale  or  dispersion  parameter. We've  estimated  it  to  be  0.8, when  the  truth  is  that  we  generated it  with  scale  equal  to  1. You'll  notice  that  our  confidence interval  for  that  scale  parameter doesn't  even  cover  1. It  doesn't  contain  the  truth  and  that's generally  not  a  great  situation  to  be  in. What's  more,  if  we  look  at  the… We  fit  a  handful  of  distributions, we  fit  the  log  normal, the  gamma,  and  the  normal, well the  normal  distribution,  which  is what  we  use  to  generate  our  data, it  isn't  even  competitive based  on  the  AIC. Based  on  those  AIC  values, we  would  definitely  choose  a  log  normal distribution  to  model  our  response. We  haven't  done  a  good  job estimating  our  parameters. We're  not  even  choosing to  use  the  distribution that  we  generated  the  data  with. W e  just  threw all  those  9s  into  our  data  set. We  ignored  the  fact  that that was  incomplete  information and that  didn't  work  out  well. What  if,  instead  of  ignoring that  limit  of  detection, what  if  we  just throw  out  all  those  times? Well,  now  we've  got  a  smaller data  set  and  it's  biased. We've  thrown  out  a  large chunk  of  our  data  on  it. We  have  a  biased  sample  now. Now  if  we  fit  our  normal  distribution, now  we're  overestimating the  location  parameter, and  we're  still  underest imating the  scale  parameter. We're  actually  in  quite a  bad  position  still, because  we  haven't  done  a  good job  with  either  of  those  parameters. We're  still  unlikely  to  pick the normal  distribution. Based  on  the  AIC, the  log  normal  and  the  gamma  distribution both  fit  quite  a  bit  better than  the  normal  distribution. We're  still  in  a  bad  place. We  tried  throwing  out  the  9 s, and  that  didn't  turn  out  well. We  tried  just  including  them  as  9 s. That  didn't  turn  out  well  either. What  should  we  do  instead? The  answer  is  that  we  should  treat those  observations at  the  limit  of  detection. We  should  treat  those  as censored  observations. Censoring  is  a  situation  where we  only  have  partial  information about  our  response  variable. That's  exactly the  situation  we're  in  here. If  we  have  an  observation  at  the  lower detection  limit,  and  here  I've  denoted  it D sub L, we  say  that  observation is  less  censored. We  don't  say  that  Y  is  equal to  that  limit  of  detection. We  say  that  Y  is  less  than or  equal  to  that  DL  value. On  the  flip  side, if  we  have  a  upper  limit  of  detection, denoted  DU  here, those  observations  are  right  censored. Because  we're  not saying  that  Y  is  equal  to  that  value. We're  just  saying  it's at  least  that  value. If  you're  looking  for  more  information about  how  to  handle  censored  data, one  of  the  references that  we  suggest  all  the  time  is Meeker and  Escobar's  book Statistical  Methods  for  Reliability  Data. That's  a  really  good  overview for  how  you  should  treat  censored  data. If   you've  used  some  of  the... If  you  use  some  of  the  features  and the  survival  and  reliability  menu  in JMP, then  you're  familiar  with  things like  life  distribution  and  fit  life  by  X. These  are  all  platforms that  accommodate  censoring  in  JMP. What  we're  excited  about  in  JMP  17  is  now we  have  added  some  features to  distribution  so  that  we  can  handle this  limit  of  detection  problem and distribution  as  well. All  you  have  to  do  is  you  add a  detection  limit  column  property to  your  response  variable, and  you  specify  what  the  upper and  or  lower  detection  limit  is, and  you're  good  to  go, there's  nothing  else  you  have  to  do. In  my  simulated  example, I  had  a  lower  detection  limit  of  9. I  would  put  9  in  the  lower detection  limit  field  here. That's  really  all  you  have  to  do. By  specifying  that  detection  limit, now  distribution  is  going  to  say,   okay,  I  know  that  values  of  9 are  actually  left  censored, and  I'm  going  to  do estimation  accounting  for  that. Now  with  that  same  simulated  example, and  this  lower  detection  limit  specified, now  you'll  notice  we  get  a  much  more reasonable  fit  the  normal  distribution. Now  our  confidence  interval  for  both the  location  and  scale  parameter covers  the  truth, because  we  know,  again,  the location  was  10  and  the  scale  was  1. Now  our  confidence intervals  cover  the  truth and that's  a  much  better  situation. If  you  look  at  the  CDF  plot  here, this  is  a  really  good  way  to  compare our  fitted  distribution  to  our  data. What  it's  doing  is  that  red  line is  the  empirical  CDF, and  the  green  line  is  the  fitted  normal  CDF. as  you  can  tell,  they're really  close  up  until  9. And that  makes  sense,  because that's  where  we  have  censoring. We're  doing a  much  better  job  fitting  these  data because  we're  properly handling  that  detection  limit. I  just  wanted  to  point  out  that  when you've  specified  the  detection  limit, the  report  makes  it  really clear  that  we've  used  it. As  you  can  see  here, it  says  the  fitted  normal  distribution with  detection  limits, and  it  lets  you  know  exactly which  detection  limits  it  used. Now  not  only  are, because  we're  doing  a  better  job estimating  our  parameters, things  like  inference  about  those parameters  is  more  trustworthy. If  we  do  something  like we  look  at  the  distribution  profiler now  we  can  trust  these inference  based on  our  fitted  distribution, we  feel  much  better  about  trusting  things like  the  distribution  profiler. With  the  simulated  example, if  we  use  our  fitted  normal  distribution, Because  we  properly   handled censoring, we  know  that  about  16 % of  the  observations are  falling below  that  lower  detection  limit. I  also  wanted  to  point  out  that when  you  have detection  limits  in  distribution, now  we're  only  able  to  fit a  subset  of  the  distributions that  you  would  normally see  in  the  distribution  platform. We  can  fit  the  normal,  exponential, gamma  log,  normal,  WI  and  beta. All  of  those  distributions support  censoring or  limited detection  in  distribution. But  if  you  were  using  something  like the  mixture  of  normals, well,  that   that  doesn't extend  well  to  sensor  data. You're  not  going  to  be  able  to  fit  that   when  you  have  a  limit  of  detection. I  also  wanted  to  point  out if  you  have  JMP  pro and  you're  used  to  using the  generalized  regression  platform, generalized  regression  recognizes that  exact  same  column  property. The  detection  limit  column  property is  recognized  by  both distribution  and  generalized . One  of  the  really  nice  things  about this  new  feature  is  that  it  gets  carried on  to  the  capability  platform. If  you  do  your  fit  and  distribution, and  you  launch  capability, now  we're  going  to  get  more trustworthy  capability  results. Let's  say  that  we're manufacturing  a  new  drug, and  we  want  to  measure the  amount  of  sum  impurity  in  the  drug. Our  data  might  look like  what  we  have  here. We  have  a  bunch  of  small  values,  and  we have  a  lower  detection  limit  of  1  mg. these  values  of  1 that  are  highlighted  here, we  don't really  think  those  are  1. We  actually  think  it's  something... We  know  that  it's  something  less  than 1 . We  have  an  upper  specification limit  of  2.5  milligrams. this  is  a  situation  where  we  have both  spec  limits  and  detection  limits. It's  really  easy  to  specify those  in  the  column  properties. Here  we've  defined  our  upper  spec  limit as  2.5 And  our  lower detection  limit  of  1. Now  all  you  have  to  do  is  just give  distribution the  column  that  you  want  to  analyze. It  knows  exactly  how  to  handle  that response  variable. Let's  look  at  the  capability  results. Now,  because  we've  properly handled  that  limit  of  detection, we  trust  that  our  log  normal  fit  is  good. We  see  that  our  Ppk, value  here  is  0.546 . That's  not  very  good. Usually  you  would  want  a  Ppk  above  1. This  is  telling  us  that our  system  is  not  very  capable. We've  got  some  problems  that  we might  need  to  sort  out. Once  again,  what  would  have  happened  if we  had  ignored  that  limit  of  detection and  we  had  just  treated  all  those 1s  as  if  they  truly  were  1s. Well,  let's  take  a  look. We  do  our  fit,  ignoring  the  limit  of detection,  and  we  get  a  Ppk  of  above  1. Based on  this  fit, we  would  say  that  we  actually  have a  decently  capable  system, because  a  Ppk  of  1  is  not  too  bad. It  might  be  an  acceptable  value. By  ignoring  that  limit  of  detection, we've  tricked  ourselves  into  thinking  our system  is  more  capable  than  it  really  is. I  think  this  is  a  cool  example, because  we  have  a  lower  detection  limit, which  may  lead  you  to  believe, well,  I  might  be  maybe  ignoring  the  limit of  detection  would  be  conservative, because  I'm  overestimating the  location  parameter. That's  true, when  we  ignore  the  limit  of  detection, we're  overestimating that  location  parameter. But  the  problem  is  we're  grossly underestimating  the  scale  parameter. That's  what  makes  us  make  bad decisions  out  in  the  tail of  that  distribution. By  ignoring  that  limit  of  detection, we've  really  gotten  ourselves into  a  bad  situation. Just  to  summarize, it's  really  important  to  recognize  when our  data  feature  a  limit  of  detection. I  think  it's  easy  to  think  of, sometimes  we  think  about  data  sets  where maybe  we've  analyzed the  response  as  is  in  the  past, when  really,  maybe  we  should  have adjusted  for  a  limit  of  detection. Because  like  we  just  saw,  when  we  ignore those  limits,  we  get  misleading  fits. Then  we  may  make  bad  decisions based  on  those  misleading  fits. Like  we  saw  in  our  example, we  got  bad  estimates or  the  location  and  scale  parameters, and  our  Ppk  estimate  was almost  double  what  it  should  have  been. But  what  we're  excited  about  in  JMP  17 is  that  the  distribution  platform  makes  it really  easy  to  avoid  these  pitfalls and  to  analyze  this  kind of data  properly. All  you  have  to  do  is  specify that  detection  limit  column  property, and  distribution  knows exactly  what  to  do  with  that. Today  we  only  looked  at lower  detection  limits, but  you  can  just  as  well have  upper  detection  limits  as  well. In  fact,  you  can  have  both. Like  I  said,  there's  only six  distributions  that  currently  support censoring inside  of  the  distribution  platform. But  those  are  also  the  six  most  important distributions  for  these  kinds  of  data. It  really  is a  good selection  of  distributions. That's  it. I just wanted  to  thank  you  for  your  time and  encourage  you  to  check  out  these enhancements  to  the  distribution  platform. Thanks.
JMP 17 introduces the Easy DOE platform, providing both flexible and guided modes to users, aiding them in their design choices. In addition, Easy DOE allows for the DOE workflow from design through data collection and modelling. This presentation offers a preview of the new Easy DOE platform, including insights from a seven-year-old using the new platform on a classic DOE problem.     Hello. Thank  you  for  joining  us  today. I'm  Ryan  Lekivetz,  Manager  of  the  DOE and  Reliability  team  at  JMP. And  I'm  Rory  Lekivetz, rising  second- grade  student. Today ,  we're  going to  talk  to  you  about  Easy  DOE. I  posed  the  question easy  enough  for  a  seven- year- old. Now,  for  a  lot  of  you  watching, you may  not  have  heard  of  Easy  DOE  before. It's  one  of  our  new  platforms  in  JMP 17, and  so  it's  base  JMP so  you  don't  need  a  pro  to  use  it. The  idea  with  Easy  DOE  is  that we're going  to  have  one  file, one  workflow  that's  going  to  contain   both  the  design  and  analysis. If  you're  familiar  with  doing   design experiments and JMP, you're  used  to  going   under  the  DOE  menu, creating  a  designed  experiment, and  then  making  a  data  table  from  there. And  then  you  would  do  all   your  data  table  on  that  analysis. And  there  was   that  separation between  the  design  and  analysis  point. The  idea  with  Easy  DOE  is  that we're trying  to  aid  novice  users through  the  entire  workflow. And  so,  unlike  custom  design, you're  going  to  see  a  lot  more  hints and  different  defaults that are set  to  try  to  aid  those  users. And  you're  going  to  see  that  both on  the  design  and  the  analysis  side. In addition,  you'll  also  see   there  isa flexible mode for  those  who  are   more  comfortable  with  DOE. That's  all  safe. My  purpose  in  this  talk   was  really to see, is  Easy  DOE  going to  be  easy  enough  to  use? Well,  I  said  seven- year- old, but  I  guess  you're  more  like seven and  a  half  now,  is  that  right? The  idea  here  was  to  let  Rory   do the steering throughout. That  I  wanted  her  to  be  the  one   using  Easy  DOE,  putting  everything  in. I  wanted  to  have   as  little  input  as possible, even  when  it  came  to  decisions   about  what to do with the design, and  things  like  that. Of  course,  I  still  did  need to  give  her  an  introduction  to  DOE. I  mean,  if  you've  seen in our DOE documentation, this  figure  might  look  familiar. So  we  went  through the  different  phases. We  talked  about  what's  the difference between a goal, a  response  and  the  factors. Specify. That's  where  we  talked  about   what's  going  on  with  the  model? And  in  particular,  what's  the  difference between a  main  effect  and  an  interaction? Why  would  you  care   about  one  or  another? And  we  said,  well,   once  we  have  the  design, we  actually  have  to  go   about  collecting some data. Then  we  have  to  fit,  then  we have  to  do  something  with  the  model. We  want  to  find  out  what's  important and  how  can  we  use  that  model to  do  something  further. I  mean,  you  can  imagine  we  spent maybe about half an hour just  talking  about   some  of  those  different  things. I'll  say,  well,  why  don't  we  use some  classic  experiments ? My  suggestions  were,  why  don't  we  try the  old  paper  helicopter  experiment? O n  our  shelf  we  have  the  statap ult or  the  classic  catapult  experiment. And  so  what  did  Rory  say? No. Rory's  idea  was  actually to  do  paper  airplanes. S he'd  started  to  to  try  out flying  some  paper  airplanes. S he  said,  well,  I  want  to  do a  DOE  with  paper  airplanes. I  said,  well,  that's  great, let's  see  what  we  can  do. Luckily,  she  knew   the  classic  paper airplane which  is  what  you'll  see is  called  a  Dart. But  we  found  a  website   that  had  different instructions for that different  types of  paper  airplanes  you  could  make. And  on  top  of  that,  it had suggestions for  what  you  could  do to  try  to  make  your  plane  fly  better. So  thankfully  then, instead of her  having to try to figure out what  are  some  of  her  factors   and  levels going to be, this  website  had  some really  nice  suggestions  for  that. Before  we  get  into  this, Easy  DOE,   as  I  said, it is that new platform, and  you're  just  going  to  find  it   under  the  DOE  menu. Underneath  custom  and  augment  design, there's  Easy  DOE  right  there. Now,  what  you'll  see   with  Easy  DOE  is that idea of  going  through  that  workflow is  going  to  be  done  via  tabs. A s  soon  as  you  launch Easy  DOE, you'll  see  there's  that   guided  and  flexible  mode. We're  just  talking  about   guided  mode here. But  the  idea  is  we're  going  to  go through these  different  steps by  clicking  on  the  tabs. There's  the  Define, and  then  we're  going  to  go to  the  Model,  Design,  et  cetera. One  way  to  do  that  is  to  click on  the  tabs,  the  tabs  one  at  a  time. And  at  the  bottom  of  Easy  DOE, you'll  also  find  a  set of  navigation controls so  that  will  take  you  forward  and  backwards  between  the  different  tabs. And  so  the  idea  of  our  talk  here, we're  going  to  go  through   these  different tabs and  both  of  us  are  going   to  give  observations. Rory  is  going  to  give  her  thoughts first on those different tabs and  then  followed  by  my  own. Think  of  it more  as  a  teacher in  my  point  of  view, and  Rory's  was  more as  the  novice  user, trying  Easy  DOE  for  the  first  time. Let's  start  with  the  Define  tab. On  this  one,  we  had  to  type  in   the  different  factors and levels that  we  were  going to  do  for  a  paper  airplane. None  of  my  factors  had  numbers, so I chose categorical. The  different  levels   and  factors  that  I  had  type, and  then  the  levels   were  dart  and  lock bottom. And  then  there  was  paper, and the levels were  regular  and construction. In throwing  force, the levels  for  that  one  was  hard and light. And  paperclip  was  paper  clip   and  no  paper  clip. The  response  that  we  had   was  distance. And  for  the  distance,  I  wanted  to  see what  could  make   the  paper  airplane  go the  farthest and  our JMP  goal  was  maximize. Just  to  mention. When  you  go  into  Easy  DOE, you're going to find right now. If  you  take  a  look  at   that  screenshot  from  before. So  currently,  we  have three different  types  of  factors: continuous,  discrete,  numeric,   and  categorical  factor  types. I'll  say  I  think she was  able to identify that  she  needed  the  categorical   for  all  of  those. Of  course,  now,   she  did  need  confirmation. I  mean,  when  she  said  categorical, she  looked  back  to  me  to  see, "Am  I  doing  that  correctly?" S he  was  able  to  identify  the  factor types and actually enter her level names, sorry,  her  factor  names   and  her  levels. Now,  I  will  admit  though, she's  used  to  using  a  touch screen. And  so  there  did  come  a  point instead of  her  trying  to  click  into the little box for levels or  the  factor  name  and to do a  double  click, there  did  come  a  point  where I told her you  can  just  use  the  tab  button   to  make  your  life  easier. A gain,  I  didn't  want  to  have   a  whole  lot  of  input. S he  picked  the  four  factors  and  levels and  I  think  I  had  told  her   to  pick  three  or  four  was   a  good  number. But  I  will  say  so  if  you  think  back  to that paper clip. She  had  mentioned  it  was   the  paper  clip  or  no  paper  clip, that  one  took  a  little  bit  of  time. I  think  maybe  if  we  would  have  went  paper clip yes or no, it  might  have  been  easier. But  I  think  for  her,  it  was. Well,  how  do  we actually define  a paper clip factor that  also  has  a  paper  clip  level? But I  think  once  she  had  that  idea, okay, it  was  just  kind  of  a  yes  or  no, it's  in  there  or  not, we  were  able  to  get  through  it okay. Moving  on  from  the  Define  tab. The Model. For  the  Model  tab, we  picked  the  one  which   we  thought  would be  the  best. We decided that Main Effects   and two- factor  interactions were  the  ones  that   we  want   the number. So  we  picked  the  one  that  had Main Effects  and  two- factor  interactions. The  number  of  runs  meant we  had  to  make  16  airplanes which  didn't  seem  too  bad. Now,  I'll  say  in  hindsight   for  my  own  view  on  the  Model  tab, I  mean  this  required  the   most  hand-holding of all the tabs. And  this  was  really  more  because   of  trying to explain the difference if  we  just  went  with  main  effects  versus  interactions. But  again,  so  this  is   a  seven  and  a  half- year- old who's  never  taken  any statistics course, has  never  done  any  kind   of  modeling  before. If  you  have  somebody  who's familiar with the  idea  of  main effects and interactions, then  I  think  that  tab   wouldn't  be  nearly  as  bad. Now,  I  will  say  too,  though , it  was  nice. The  paper  airplane  website   that  we  were using, it  spelled  out  the  idea of  what  interactions  really  mean  with  those  factors. I  think  so  you  would   see  things  where  it  said: while  some  of  the  airplanes   will  work  better  with  a  paper  clip, or  you'll  see  some  airplanes  are  better  when they're throwing hard, while  others  need  that  lighter  touch. In  some  sense,   that  actually  gave  us a  natural  point  to  start talking  about  interactions. Now,  also,  say,  when  I  think  of   from  my  own perspective, it's  also  something  that  we  can  improve  upon. From  a  US,  I  want  to  think  about, well,  how  can  we  distinguish   between  these  choices? We  did  have  the  hints   in  there  as  well. But  when  I  think  of,  if  I  wasn't  there to do the hand-holding, how  might  she'll  come  up   with  that  decision? Now  when  we  went  to  that  Model  tab, and  our  next  tab  that  we clicked  on  was... Design. There's  not  really  a  lot   I  can  say  about  this  one, but  I  thought  it  was  interesting that it was showing what  kind  of  paper  airplanes   that  you  were  going  to  make. And  then  we  had  the  hard  work of  making  planes  and  flying  them. Yeah,  I  think  that  probably   took  the  most time, but yeah. I  don't  have  too  much  to  add  on  this  one. I  think,  though, it was nice  to have that design displayed just  to  give  her  that  sense  of, well,  what  did  that  really  mean when  we  put  in  those  16  runs? When  we  have  those  factors,   what  does  that  mean  at  the  end  of  the  day? I  think  from  this, then  she  could  really  get  that  sense. Okay,  then  my  plane  one, it's  the  lock bottom  with  regular  paper. I'm  going  to  throw  it  lightly and  put  a  paper  clip  on  it. What  was  the  next  tab   that  we  went  to? Data  Entry. So  full  of  a  data  entry. We  didn't  really  have  to  do that  much  in  this  one. So  we  just  put  the  distance  of the  different  kind  of  airplanes  flew. So  that  was  what  we  measured with  our  measuring  tape  of  stuff. Is  that  right? Now,  I'll  say  with  this  one,  if  you  look,   there  were  these  factor  plots. This  is  just  for  the  main  effects  here, but  this  actually  was   telling  her   what  she  had  said as soon as she was done. She  see  those  factor  plots   at  the  bottom, and  she  said  the  lock bottom   is  not  the  best. I  think  she  might  have  used   something stronger than that, but  yeah,  so  the  lock bottom   was  not  good  for  her. Also,  this  was  a  good  teaching  moment  about randomization. So  the  run  order  does   come  out  randomized. I  did  have  to  warn  her  when  we  had all the paper airplanes with us outside, I  said,  well,   you  don't  want  to throw all the  dart  type  first, followed  by  the  long  bottom. I  said,  because  you  might  get  better as you were throwing or  it  might  start  to  storm,   it  might  get  windier. Now,  I  say  some  of  these  results. There  were  times  where  the  hard- throwing   force  needed  a  bit  of  practice. I'll  admit  there  were  a  few  there that probably were based on more than one, because  if  you  had  a  crash, almost immediate crash landings. But  I'll  say  it  did  seem   straightforward for her to  be  able  to  enter the  data  right  from  there. I  mean,  I  didn't  have  to  say  anything; she   knew. So  when  you  come  in  to  hear it, the  response  column   just  had  missing  values. And  so  she  had  that  intuition, well,  this  is  where  I  need  to  go  to  put  in  the  data. I  also  mentioned  here,  you'll  see this  export  data  and  load  response. That  export  data  is  if  you  actually just want  to  create  a  data  table   with  all  your  stuff. S ometimes  that  will  be  useful   if  you want to go through what  you're  typically  thinking  of   with  your  JMP  workflow. If  you  just  want  to  JMP  data  table, that's  what  that  export  data button  is  going  to  do. Likewise,  load  response, if  you've  actually  just  recorded your responses  in  a  different data  table,  you  can  do  that. Now  that  we  had  our  data, what  did  we  have  to  do? -What  was  our  next  step? -Analyze. For the analyze,  I  already  knew   that  the  dotted  ones weren't  the  most  important  ones. I  figured  out  I  just  needed to  click  them  to  get  rid  of  them. I  found  out  dart  type  was one  of  the  most  important  ones. Now,  I'll  say  here. The  analyze  I  had  thought  was  going to  be  the  hardest  one  to  explain. Now,  I'll  say  it  was  surprisingly easy  and  effective. It  was  almost  she  had  clicked  on  the  tab and  just  started  doing  her  own  thing, and  I  didn't  really need  to  say  an  awful  lot. Now,  you'll  notice  here, so  when  you  come  into  that  analyze  tab, when  Rory  saw  it,  it  was  the  full  model, but  she  saw  a  lot  of  the  terms   that  were  not  significant. There  were  a  lot  that  were  dashed and   close  to  that  zero. And  so  what  she  did  was  just   remove  those  ones  one  at  a  time. And  then  I  also  see   there  is  a  best model button that's  based  on  some  type   of  a  forward  selection. Now,  I'll  say  the  best  model   may actually have more terms, but  at  the  end,  you  can  see those extra terms  really  weren't  even  significant. S ometimes  I  couldn't  even argue  with  the  results. Perhaps,  one  might  even  argue that  the  model  that  Rory  picked  was  better because  it  was  simpler  and  there wasn't a  huge  difference  between  the  two. Now,  one  of  the  nice  things   with  this  Easy DOE platform, this  analyzed  for  the  guided  users, there's  this  idea  of  adding and  removing  terms  easily. And  so  to  add   and  remove  these  terms, all  that  you  do  is  you  go   to  those confidence intervals in there and  a  click  will  either  add  it   or  remove it depending  on  if  it's  currently   in  the  model. I  highly  recommend  trying  that  out   when you get your hands on 17. Perhaps,  there  was  something   to  be  said. The  best  model, when we'll see  in  a  minute with the profiler, it  makes  it  a  little  bit   more  interesting, of course, if  you  have  some additional  terms  in  there. Rory's  model, with  only  the  three  terms, the  profiler  was  less  interesting to  try  to  explain  things. But  again,  for  being   a  first  time  DOE, a  model  like  that   didn't  disappoint  me  at  all. Now  that  we  had  a  model, what  do  we  actually  do  with  that? -So  what  was  our  next  step? -Predict. The Predict tab, this  tab  tells  you what the  best  paper  airplane  was. If  you  click  on  the  levels   that  you  think are the worst, it  shows  what  might  happen   with  that  airplane. Likewise  with  that  Model  tab, if  you  have  a  user  who's  used  the  profiler before in something like Fit model, of  course,  that'll  be  a  lot  easier. But  I'll  also  say the profiler  is intuitive in general. I  think  it  was  easy   for  her  to  pick  up  on once  she  started  playing  around   with  that  profiler. And  then  just   a  little  bit  of  an  idea, a  discussion  as  to   what  that  actually  meant. But  I  think  the  profiler, that's  the  nice  thing  with  the  Easy  DOE, the  profiler  is  already  intuitive. It  was  also  a  good  teaching  moment   when  it  came  to  interactions. If  you  go  back,   if  you  look  at  that  model, we  actually  had  an  interaction between  the  type  and  the  type  of  paper. Paper  and  type  had  an  interaction and  so  then  we  could  talk  about well, what happens  when  we  change  type, then look  at  what  happens  with  that  paper? Again,  this  is  where   those  extra  model terms. In  Rory's  model,  that  hard  and  light   had  a  zero  effect. I  was  saying  it  didn't  really  matter what you did for hard or light. The  best  model,  it  had  a  small  effect, so  you  could   say,  well,  it's  not  going   to  make  a  big  difference. But  for  some  of  these, it  was  saying  that  it  was. But  again,  I  think  the  Predict  tab, she  seemed  to  do  well in that. And  the  optimized  did  need   some  explanation, I  think  just  because  that's a new word in her vocabulary. But  I  think  that   she  had  that  sense  as  to, okay,  these  were  the  settings that  were  going  to  be  the  best. I  think  that  was  what  we  had to  talk  to  you  about  today. I  did  just  have  a  few  final  questions   for  you,  if  that's  okay. What  was  your  favorite  part   of  the  experiment? Flying  the  airplane. What  was  your  least  favorite  part? Getting  hard  flying  the  airplane. The air  was  about  100  degrees  Fahrenheit  when we were flying the airplanes, but  we  didn't  have  a  lot  of  choice. We  had  a  lot  of  storms coming  up  to  it,  didn't  we? If  you  were  to  tell  somebody, what was  the  most  important  factor? What's  the  most  important  thing if you  wanted  to  make  the  paper  airplane? Maybe  play  type,  like,  with  a  dart. Okay. Was  there  anything  else that  was  important? Nothing  you  can  really  think  of. Okay. I  think  we  found   the  construction  paper a little bit, and  I  think  it  did  actually  say   the  no  paper  clip. I  think  you  were  saying   that  might  be  because  of  the  weight. Bul I'll say yeah. If  we  tried  it  again, I think  you  had  the  paper  clip at  the  back  part  of  the  wing. And  so  we  thought  about  maybe at the nose if  we  were  to  try  that  again. But  let's  say,  what  do  you  want  to  do   for  your  next  experiment? Should  we  do  another  one? What  would  you  want  to  do? Statapult. A  statapult? Yeah,  that  does  look  fun. Ane now, I  didn't  ask  you to answer this in  any  way,  did  I? But  was  Easy  DOE  easy  to  use? Yes. It was. Okay. I'm  glad  to  hear. I  think  that's  everything  that  we  have  left  today. So  thank  you  for  your  time. And  please  post  any  questions in  the  community  forums  below if  you'd  like  to  ask either  of  us  anything. Thank  you.
Monday, September 12, 2022
Interactive HTML was introduced in JMP 11, and each succeeding version adds support for more interactive features and improves support for JMP Live. When used alone, Interactive HTML enables users to share JMP reports. JMP Live supports that sharing with collaboration, organization, security, automation, and significantly more interactivity.    This paper explains the new features in JMP 17 Interactive HTML, both when it's used alone and as part of JMP Live.     The journal shown in the presentation contains Example buttons that link to Interactive HTML files hosted on an internal JMP Live server. I've attached a journal formatted similarly to the one shown in the video, but rather than launch files from a JMP Live server, the Example buttons produce JMP reports that can be exported to Interactive HTML or published to JMP Live or JMP Public (when it is upgraded to JMP 17) from JMP 17 to see the improvements in Interactive HTML yourself.    Hi,  welcome  to  this  talk. Interactive  HTML  is  a  format  for  sharing  JMP reports  and  dashboards  on  the  web with  some  of   JMP's signature  interactivity. It's  also  the  format  used  in   JMP Live  and   JMP Public. My  name  is  John  Powell,  and  I'm the  Software D evelopment M anager for  the  team  that  puts  this  feature  together  for  JMP. With  every  release  of  JMP, we  improve  and  add  more  of  this  capability. In  this  talk,  I'm  going  to  show  you what's  new  and  improved  in I nteractive  HTML  for  JMP  17. Now,  I'm  not  going  to  cover  everything  that  we've  done,  just  the  highlights. I've organized  this  into  three  categories: new  functionality,  improved  functionality,  and  improved  appearance. What  you're  going  to  see  are  examples  of  this  on  JMP  Live. Let's  get  started. I'm  going  to  move  this  out  of  the  way. Sorry  about  that. Move  this  out  of  the  way, because  I'm  going  to  be  bringing  up  the  browser. Starting  with  new  functionality. We've  got  packed  bars, which  is  a  feature  that's  been in  JMP  for  a  couple  of  releases  now. I've  got  an  example. I  just  need  to  bring  my  browser  window  over  here. Here's  my  pack  bars  example  in  JMP  Live. As  you  can  see,  it  looks  like  it  would  in  JMP. It's  got  labels, and  it's  got  the  bars  that  basically  are  the  most  important  ones  in  blue, and  all  the  lesser  important  ones,  or  smaller  data,  are  in  gray. It  looks  like  it  would in  JMP. It  supports  a  little  bit  of  interactivity, basically  these  tool  tips  that  you'll  see  for  each  bar, and  even  for  the  gray items  as  well. Then  you  also  have  some  interactivity,  which  is  selections, and  it  works  with  the  local  data  filter. Right  now,  I'm  sorting... Or  I'm  looking  at  the  commodity  of  corn. If  I  want  to  look  at  soybean, I  can  click  on  that,  and  the  graph  updates . That's  it  for  pack  bars. Parallel  plots  [support in Interactive HTML]  is brand  new  in  JMP  17. Here's  an  example  using  the  Iris  data set. With  a  parallel  plot, you  can  drop  in  different  types  of  variables. In  this  case, we're  looking  at  the  dimensions  of  the  petals and  the  sepal  length  and width. Basically,  they're  all  continuous  data,  and  continuous  data  is  drawn   as curves. We've  made  this  available  in  JMP Live  now  and in  Interactive  HTML. Basically,  it  supports  selection  and  tool  tips. There's  a  tool  tip  for  one  of  the  lines,  and  this  is  continuous. Then, one  thing  about  this  particular  example is  that  it  has  a  common  axis. All  the  variables  are  around  the  same  range. They  share  an  axis. The  next  example  is  using  categorical  variables. Here's  the  Titanic  passenger  database, and  I've  got  passenger  class,  sex,  and  survived. When  you  use  categorical  variables,  they  display  very  differently. You  see  these  curved  sections, and  they're  also  selectable. You'll  notice  a  little  interactivity  with  the  legend. I'll  talk  about  that  a  little  bit  more  later. Now,  we  also  have  tool  tips  in  this  one. Passenger  class,  sex,  male. So  on this  side, you  would  get  sex  is  female,  and   they  survived,  yes. The  next  example. This  one  is  actually  mixed  with  categorical  and  continuous, the  same  data set,  Titanic  passengers, but  I've  used  some  continuous  variables  on  the  left  here, and  some  categorical  variables  on  the  right. They  also  support  selection  and  tool  tips  on  both  sides. Next,  a  new  functionality  is  categorical  response  profilers. The  difference  between  a  categorical  response  profiler and  a  continuous  report p rofiler  is  basically,  you'll  see  multiple  curves. Here's  an  example. That's  interactive n ow. This  would  have  been  static in J MP  16. You  see  these  curves  are  interacting, and  there  are  tool  tips  as  well. These  appear  in  many different  platforms. That  was  th e... Sorry,  this  is,  yes,  the  ordinal  logistic. The  next  one  is  general  regression. Each platform  needed  some  work  from  my  team in  order  to  get  them  interactive. This  one  shows  down  at  the  bottom,  and  it's  the  same  thing. It  has   categorical  responses, and  it  shows  the  interactivity  and  tool  tips  as  well. I'm  not  going  to  do  all  of  them. Here's  one  more, P artition. [ The profiler in ] Partition  also  is  down  below. Here's  a  categorical  profiler. You'll  see  it  responds  as  you  move   one  of  the  variables. All  the  other  variables  respond  too. Now,  the  next  one  actually  isn't  a  categorical  response profiler, but  I  threw  it  in  here  because  it's  a  new  platform  in  JMP  17, Naive Bayes,  and  it  has  a  profiler  in  it. It  has  also  been  made  interactive  in   Interactive HTML  and  on  JMP L ive. The  other  two  that  we  support  are B ootstrap F orest  and  Boosted T rees. I'll  leave  that  up  to  you  to  try  when  you  get  JMP  17. This,  we  hope,  is  going  to  be  an  interesting  feature  for  people, that  we  can  resize  graphs  now  in JMP  Live  and  JMP  17. One  of  the  first things  you  might  do  in  JMP, if  somebody  sends  you  a  report,  and  it's  not  big  enough, is you  go  to  the  [ bottom right ] corner  and  drag  it   to  resize  it. We  now  support  that  on  the  web  as  well. That's  a  distribution  example. Some  examples  are a  little  tricky,  like  this  one  here. I'm bringing  up  a  profiler. What  makes  that  tricky  is  that  there  are  multiple  graphs that  all  resize  together. Of  course,  they  stay  interactive  when  they're  larger  as  well. Let's  see,  the  scatter  plot  matrix  is  similar  too, that  there  are  multiple. I  don't  know  if  I  mentioned, but  you  can  also  drag  on  the  side  to  make  them  bigger. There  you  have  it. Of  course,  these  are  still  interactive  with  tool  tips  and  so  on. Now,  since  we  did  that, we  felt  it  was  necessary  to  also  make  it  possible  to  resize  panels  and  dashboards. Imagine if  I  took  this  graph  here  and  I  expanded  it  wider. Now,  it's  getting  pretty  close  to  the  edge. If  I  went  and  did  this  one, now  we're  actually  cutting  off  a  little  bit  of  the  graph. We  can  actually  resize  the  panel  to  give  more  room  to  that  one. We  can  even  shrink  these if  we  didn't  want  to  show  them  at  the  moment, or  to  just  balance  between  the  two  panels. That's  it  for  new  functionality. I'm going to  move  on to  improved  functionality. I'll just  take  a  drink  here. All  right. Now,  the  interactive  legend,  I  did  show  it  before, but  I  wanted  to  show  that  it's  happening  in more  than  just  one  graph. It's  actually  all  graphs. I  have  an  example  here. This  is  back  with  the  Titanic  passengers  database  again. To  show  you  the  interactivity, basically,  if  you  click  on  the  legend  in  JMP  17, you  will  see  the  selection  in  the  graph  as  well. That's  one  part  of  this  interactivity. The  other  part  is  that  if  you  click  in  the  graph to  make  selections, you'll  see  that  the  legend  highlights  as  well. That's  behavior that  you  would  see  in  JMP. We  tried  to  get  that  available  on  the  web  as  well. Local  data  filter  has  been  in  Interactive  HTML  for  a  while, but  there  are  a  few  additions  to  it,  modifications and  updates. Here's  an  example  with  the  diamonds  data set. As  you  can  see,  it's  got  a  pretty  busy- looking  local  data  filter  here. You can  do  lots  with  it. I'll  stick  with  this  example  for  a  while. Imagine if  you  wanted  to  just  limit  what  you  see  in  terms  of  the  price. I'm  looking  at  the  $4,900- $10,000  diamonds. Then  you  go,  "Well,  maybe  I  don't  really  want  to  do  that. I  want  to  do  the  inverse." Now,  we  have  this  inverse,  that wasn't  there in  JMP  16, that  when  you  click  on  it,  the  graph will  update. Nothing  updates  in  the  local  data  filter  itself. We  used  to  have  a  feature  in  this  menu  up  here  that  would  do  that. It  would  invert  all  the  settings. But  JMP  didn't  have  that. We  prefer  to  stick  to  a  model  that's  closer  to  what  JMP  has and  behaves  the  way  JMP  does  as  well. Another  thing  that  you  might  see with  a  big  local  data  filter  like  this  with  lots  of  options, every  time  I  click  on  the  setting  here,  it  updates,  right? But  what  if  you  want  to  make  a  lot  of  changes? You  probably  don't  want  it  updating  every  single  change that  you  make  in  the  local  data  filter. Now,  we've  added  this  auto- update  feature. Now,  if  we  disable  it,  as  you  add  more  settings,  nothing  changes. That  gives  you  a  chance  to  make  lots  of  changes. Maybe  I'll  even  change  something  down  here  as  well. Let's  just  choose  Excellent  and  these  V  settings, and  I'll  leave  these  sliders  where  they  were. Now,  I'm  ready  to  update. I  hit  the  update  button,  and  now  we've  got  an  updated  graph. You  may  have  noticed,  or  maybe  you  didn't, that  there's  a  bunch  of  information  being  added  to  this  URL every  time  I  change  the  setting  or  I  updated. Now,  the  purpose  of  that is  so  that  people  can  share  these  settings. If  you  actually  selected  all  of  this,  and  there's  multiple  ways  to  select. You might  try  to  double  click. Anyway,  once  they're  all  selected,  then  you  can  copy  it  or  use  Control  C. Then  you  can  put  that  in  an  email  and  send  it  to  somebody. When  they  see  this  graph  on  JMP  Live, they're  going  to  get  the  settings  that  you  had, not  the  original  settings  that  you  published  it with. That's  really  going  to  be  useful. Another  thing  about  that  is  that  the  saving  of  settings is  also  done  for  column  switchers  now  too. This  example  has  a  column  switcher  and  a  local  data  filter. Of  course,  if  I  change  to  lead, basically,  you would  see  those  settings  get  updated  on  the  top  as  well. What's  interesting  about  that  is  you  can  grab   this  URL, and  I  like  to  store  it  in  the  comments, and  I've  annotated  what  I  stored  here  saying  what  the  settings  are  going  to  do. This  top  link  is  with  the  column  switcher set  to  the  pollutant equal  carbon  monoxide, and  the  local  data  filter  regions  set  to  California. I  believe  west  is  what  the  W  stands  for. When  I  click  on  that, or  if  I  send  that  to  somebody  and  they  click  on  it, it would  load  the  post  with  this  pollution  map, and  then  it  would  use  those  settings  and  update  it  right  away. Another  thing  you  could  do  with  that  URL is  you  can  embed  it  in  a  journal,  like  I've  done  here. If  you  look  at  this  side  here, I  say  I  have  a  link  with  stored  settings  with  the  lead  for  the  column  switcher. The  regions  are  South  Texas, which  is  different  than  what  you  see  right  now. When  I  click  on  that  URL  or  that  link  in  my  journal, it  will  do  the  same. It  loads  up  this  posting,  and  then  it  applies  the  settings. Isn't  that convenient? All right. Here's  a  feature  that  we  added,  at publish  time, that  you  could  actually  choose whether  you  want  to  have  interactivity  or  performance. The  reason  for  that  is  that  it  takes  a  lot  of  data  to  store  in  the  file to  be  able  to  provide  the  interactivity. Sometimes  that  makes  loading  slower. Or  if  you  have  a  really  big  data set and  you  don't  really  want  to  load  all  the  data, but  you  want  to  have  it  interactive, well,  now  you  can  do  that with  the  performance  mode. All  the  examples  I  showed  so  far  had  interactivity  mode  on so  that  I  can  show  all  the  interactivity. This  example  I  published, it's  the  same  one  that  you  just  saw,  but  the  graph  is  not  interactive. It's  a  static  image  of  the  graph. But  you  also  get  to  interact  with  that  graph by  using  the  column  switcher  and local  data  filter. This  is  because  JMP L ive  rebuilds  the  graph. That's  it  for  that  section. I  hope  you're  going  to  like  those  improved  functionality  features. The  final  section  is  things  that  affect  the  appearance. We  hadn't  really  paid  attention  to  your  font  size  settings  in  JMP when  you  exported  in  the  past because  we  really  wanted  to  have unique  font  sizes  across  our  web  offerings. But  now,  we  wanted  to  support  this. Basically,  this  is  everywhere  in   JMP that you  can  change  font  sizes. It  was  big  effort,  basically,  to  go  through,  find  them  all, and  then  make  sure  that  they  came  out  as  you  set  them  before  you  published. I've  got  a  few  examples, but  definitely  not  all  the  places  where  you  can  set  fonts. Here's  an  example. You  may  not  want  to  do  this, but  I  did  it  in  a  way  that  you  could  see  what  the  font  differences  are. In  this  case,  I  increased  the  carat  weight  title, but  I  did  not  increase  the  labels  for  the  carat  weight. Down  on  the  X-axis, I  increased  the  axis  label's  font  size  and  kept  the  title  at  the  original  size. Normally,  you'd  probably  do  a  little  bit  more, but  this  is  just  for  demonstration,  so  you  could  see  the  difference between  the  regular  size  and  the  increased  size. Here's  another  example  where  font  size  shows  up. Labels  and  graphs,  like  in  heat  maps. In  heat  maps,  there's  a  label  that  you  can  apply, and  this  one  has  been  increased  as  well. You  can  tell  it's  much  larger  than  the  other  fonts  in  this  graph. I've  also  increased  the  size of  the  title  and  labels  in  the  legend  in  this  example. Next  up,  I've  got  a  tree  map  that  also  has  labels. This  is  back  with  the  diamonds  data set  again. This  tree  map, we  did  a  little  bit  more  than  just  font  size. You  can  see  that  the  labels  are  made  larger, not  in  the  legend  this  time. But  the  group  labels,  in  JMP  17,  they've  added  the  ability to  set  background  color  and  the  color  of  the  text  as  well. We  felt  that  we  would  want  to  support  that  as  well to  stick  with  the  customized  ability. Of  course,  there's  many  other  places  where  fonts  can  be  customized  in JMP. You  could  discover  that  as  you  get  JMP  17,  and  start  customizing  font  sizes, and  then  see  them  respond  in  your  published  files or  shared  with  exporting  to  Interactive HTML. Here's  a  tabulate  example. There's  a  new  feature  in  JMP that  allows  you  to  combine  columns  like  cut  and  clarity. This  is  also  the  diamonds  data set. When  you  do   and  use  the  stack group  columns  feature, the  items  in  the  cells  are  basically  indented for  the  secondary  variable. In  this  example,  I've  also  increased  the  titles of  these  columns  in  tabulate, increased  the  font  size  of  that,  so  easier  to  see  and  show that yet  another  place  where  this  can  be  customized. We  did  some  other  work  customizing  tables,  supporting  the  customization  of  tables, and  this  shows  up  in  many  different  ways. The  first one  that  I'm  going  to  show  you  is  color- coded  cells.  In  JMP  16,even  if  color  coding  is  used,  we  didn't  carry  that  through  to  the  web. You  would  have  lost  the  different  colors  here  in  the  column that  have  meaning  in  this  process  screening  example. Another  place  where  color  coding  shows  up in  cell  background  colors is  when  you  do  text  explorer. As  you  see  in  this  graph  down  here, there's  purple  and  orange  being  used  for  positive  and  negative  sentiment. They  are  now  supported  in  the  table  as  well, not  just  in  the  graph. Of  course,  font  size  can  also  be  updated. I  mentioned  that  we  supported  customizing  font  size. If  that's  been  done  in  a  table  like  in  this  one, where  the  cells  have  been  increased  in  size, that  is  now  respected. We  used  to  emphasize  small  p  values when  it's  small  enough  to  indicate  it  with  just  bold  font. But  what  JMP  did  was  use  this  color,  and  numbers  support  that  color, so we're  looking  more  like  JMP. This  final  table  example  will  show  not  cell -color  coding, but  the  actual  text  in  the  cells  is  color  coded  for  the  correlation. The  high  correlation, or positive  one correlation  here  happens  to  be  blue, and  here's  negative  correlation  that's  red. I'm  going  to  use  this  example  for  the  next  feature, which  is,  we  had  themes  before,  but  we've  added  the  dark  theme. The  dark  theme  is  good  to  show  with  this  example, and  I'll  show  you  how  you  do  that in  JMP  Live. Let me  just  switch  to  dark  mode  this  way. Now,  you  can  see  those  colors  pop  a  little  bit  more  with  a  black  background. I  think  people  will like  this,  maybe  all  the  time. Actually, if you  like  it  all  the  time, you  can  set  that  as  permanent  preference in  JMP  Live, or  set  it  as  a  preference  when  you  export. Here,  I'm  going  to  open  up  just  an  image  of  a  JMP  dialogue, which  is  the  preferences  dialogue. If  you  look  on  the  Styles  page  of  the  preferences  dialogue, down  at  the  bottom,  there's  Interactive  HTML  Themes. It used  to  have  light  and  gray. Now,  we  have  this  dark  theme  as  well. Then,  when  you  publish  or  export  to  Interactive  HTML, that  theme  will  be  used. Last  but  not  least,  we  think  this  will  be  really  important. If  you  zoomed  in  on  graphs   [ exported to Interactive HTML ] from JMP  16,  they  would  start  to  get  blurry. I  want  to  talk  about  two  different  types  of  zooming  in  here. In  16,  we  did  have  this  magnifier. Let's  say  I'm  zooming  in  this  way. That  hasn't  changed. But  what's  changed  is  that  this  graph  here is   at a  higher  resolution  than  it  used  to  be. If  you  use  your  browser  zoom, which  is  usually  C ontrol  Plus, and  zoom  in  this  way, this  is  when  things  would  start  to  get  blurry. But  now  that  we've  used  double  resolution, you  can  go  pretty  high  without  seeing  any  blurriness. We  think  that  will  be  a  good  feature. You  don't  have  to  turn  it  on. It's  just  the  default  that it's  going  to  be  a  higher  resolution. Make  all  your  graphs  look  a  little  bit  better. That's  it  for  the  things  I  have  to  show. Like  I  said,  these  are  just  the  highlights. There's  so  many  other  things  that  my  team  did, and  I'm  really  proud  of  the  work  that  they  did. We  had  a  lot  of  contributors,  some  of  them  part  time, some  of  them  full  time. Here's  the  names  of  the  contributors. We  had  Josh  Mark wardt,  Bryan  Fricke,  Mayowa  Sogbein, Praveena  Panineerkandy,  Paul   Spychala, and  Tommy  Odom. Of  course,  we'd  like  to  thank  all  the  JMP  developers  themselves that  do  the  desktop  product  to  help  us  figure  out how  to  share  these  creations  on  the  web. We get  a  lot  of  help  from  them. Of  course,  we  also  get  a  lot  of  help  from  our  quality  assurance  team. I'd  like  to  thank  them  as  well,  and  anybody  else  in  the  JMP  team that  helps  us  make our  contribution  look  good, Also,  the  JMP  Live  team who  hosts  our  content  when  we  publish  to  JMP  Live. That's  it  for  my  presentation. I thank  you  for  your  interest  in  this  topic, and  I'll  see  you  at  the  conference.
Monday, September 12, 2022
Extracting and combining data from multiple, disconnected databases was one of the biggest challenges encountered at Samson Rope Technologies before our investment in JMP Statistical Software. With the incredibly useful JMP Query Builder, this challenge has been economically addressed.   This presentation describes the way Samson uses the JMP Query Builder to easily extract desired records and fields from a single database (leaving behind the fields that are not of interest) and, more importantly, extract desired records and fields from multiple databases and combine them into a single “master table” file with all required data in one place, ready for rigorous statistical analysis, typically by an R&D Engineer, Quality Engineer or Manufacturing Engineer working on a continuous improvement project.   One of the best JMP features is that JMP creates scripts behind the scenes and if the scripts are properly managed, there is no need to spend time repeating the previous steps to create or update the “master table”. And, the best part of the JMP Query Builder is that we can create complex JMP scripts just by simple “copy and paste” and combination of the scripts JMP creates for each previous action. No background in scripting or coding is needed to create a JMP script!   This poster will be highly graphical in nature, drawing the attention of the viewer to the practical, useful benefits of the JMP Query Builder. Any organization with multiple databases containing important information should benefit from this poster.       Hello. My  name  is  Canh  Khong. I  work  at  Samson  Rope  Technology in  Ferndale,  Washington  State. Today,  I  am  talking  about  how  to  build  a Query Builder  using  JMP  software. Let's  start  with  a  working  table   that  has a few columns such  as  date,  operator i nitial,  job  number,  actual  measurements. The  working  table  is  just for  manual  data  collection. To  analyze,  we  may  need  to  add more information  from  other  tables. How  can  I  do  that? The  answer  is  using  a Query  Builder. Query  Builder  can  extract desired columns from  other  tables and join these columns to  the  working table that  I  call  the  Master  Table. The  Master  Table  has  all   of  the  information  needed  for  analysis. Here  are  a  few  steps  to build a Query Builder to  join  the  working  table   to  another  table. These  steps  are  as  simple   as  point,  click,   drag and drop. Now,  the  Master  Table   is  ready  for  data  analysis. However,  if  needed,  a  new  column with  a  formula  can  be  manually  created adding  to  the  Master  Table, and  JMP  will  automatically  create the  corresponding  script  of  a  new  column. Copy  and  paste  the  script to  the  Post Query Script. If  this  is  done,  there  is  no  need to  spend  time  to  recreate  the  new  column. The  new  column  will  be  updated  and  shown every  time  the  Query  Builder  runs. If  desired,  the  report can  be  manually  created. And  JMP will  create   the  corresponding  script. Copy  and  paste the  script to  the  Post  Query  Script. If  this  is  done,  there's  no  need to  spend  time  to  recreate  the  report. The  report  will  be  automatically  updated and  shown  every  time  the  Query  Builder  runs. If  desired,  a  chart  can  be  manually  created and JMP will  create   the  corresponding  script. Copy  and  paste the  script to  the  Post  Query  Script. If  this  is  done,  there  is  no  need to  spend  time  to  recreate  the  chart. The  chart  will  be   automatically  updated and  shown  every  time the  Query  Builder  runs. One  click  to  run  the   Query Builder. This  will  produce the  updated  Master  Table, the  report,  and  the   chart as desired. Query Builder  Summary. Combine  tables. Select only desired columns  from tables. Create  columns  with  formulas   and  create  reports. Copy  and  paste  the  scripts  of  columns and  reports  into  Post  Query  Script. Scripting  experience  is  not  needed   to  build  a  Query  Builder. Creating  a  Query  Builder  is  as  simple   as " point,  click,  drag  and  drop". With  this  summary,   that  concludes  my  presentation. Thank  you.
The famous Stanford mathematician, Sam Karlin, is quoted as stating that “The purpose of models is not to fit the data but to sharpen the question” (Kenett and Feldman, 2022). A related manifesto on the general role of models was published in Saltelli et al (2020). In this talk, we explore how different models are used to meet different goals. We consider several options available on the JMP Generalized Regression platform including ridge regression, lasso, and elastic nets. To make our point, we use two examples. A first example consists of data from 63 sensors collected in the testing of an industrial system (From Chapter 8 in Kenett and Zacks, 2021). The second example is from Amazon reviews of crocs sandals where text analytics is used to model review ratings (Amazon, 2022).   References   Amazon, 2022, https://www.amazon.com/s?k=Crocs&crid=2YYP09W4Z3EQ3&sprefix=crocs%2Caps%2C247&ref=nb_sb_noss_1   Feldman, M. and Kenett, R.S. (2022), Samuel Karlin, Wiley StatsRef, https://onlinelibrary.wiley.com/doi/10.1002/9781118445112.stat08377   Kenett, R.S. and Zacks, S. (2021) Modern Industrial Statistics: With Applications in R, MINITAB, and JMP, 3rd Edition, Wiley, https://www.wiley.com/en-gb/Modern+Industrial+Statistics%3A+With+Applications+in+R%2C+MINITAB%2C+and+JMP%2C+3rd+Edition-p-9781119714903   Saltelli, A. et al (2020) Five ways to ensure that models serve society: a manifesto, Nature, 582, 482-484, https://www.nature.com/articles/d41586-020-01812-9     Hello,  I'm  Ron  Kennett. This  is  a  joint  talk  with  Chris  Gotwalt. We're  going  to  talk   to  you  about  models. Models are  used  extensively. We  hope  to  bring   some  additional  perspective on  how  to  use  models  in  general. We  call  the  talk "Different  goals,  different  models: How  to  use  models  to  sharpen  your  questions." My  part  will  be  an  intro, and  I'll  give  an  example. You'll  have  access  to  the  data  I'm  using. Then  Chris  will  continue with a  more  complex  example, introduce  the  SHAP  values available  in  JMP  17, and  provide  some  conclusions. We  all  know  that  all  models  are  wrong, but  some  are  useful. Sam  Karlin  said  something  different. He  said  that  the  purpose  of  models is  not  to  fit  the  data, but  to  sharpen  the  question. Then  this  guy,  Pablo  Picasso, he  said  something in... He  died  in  1973, so  you  can  imagine  when  this  was  said. I  think  in  the  early  '70 s. "Computers  are  useless. They  can  only  give  you  answers." He  is  more  in  line  with  Karlin. My  take  on  this is that this presents  the key difference between  a  model  and  a  computer  program. I'm  looking  at  the  model from  a  statistician's  perspective. Dealing  with  Box's  famous  statement. "Yes,  some  are  useful.  Which  ones?" Please  help  us. What do  you  mean  by  some  are  useful? It's  not  very  useful  to  say  that . Going to Karlin, "Sharpen  the  question." Okay,  that's  a  good  idea. How  do  we  do  that? The  point  is  that  Box  seems  focused on the data analysis phase in the life cycle view of statistics, which starts with problem elicitation, moves  to  goal  formulation, data  collection,  data  analysis, findings,  operationalization  of  finding, communicational  findings, and  impact  assessment. Karlin is  more  focused on  the  problem  elicitation  phase. These  two  quotes  of  Box  and  Karlin refer  to  different  stages in  this  life  cycle. The  data  I'm  going  to  use  is  an  industrial  data set, 174 observations. We  have  sensor  data. We  have  63  sensors. They are  labeled  1, 2, 3, 4  to  63. We  have  two  response  variables. These  are  coming   from  testing  some  systems. The  status  report  is  fail/pass. 52.8%  of  the  systems that  were  tested  failed. We  have  another  report, which  is  a  more  detailed  report on  test  results. When  the systems  fail, we  have  some  classification of  the  failure. Test  result  is  more  detailed. Status  is  go/no go. The  goal  is  to  determine  the  system  status from  the  sensor  data so  that  we  can  maybe  avoid  the  costs  and  delays  of  testing, and  we  can  have  some  early  predictions on  the  faith  of  the  system. One  approach  we  can  take is  to  use  a  boosted  tree. We  put  the  status  as the response, the 63 sensors X, factors. The  boosted  tree  is  trained  sequentially, one  tree  at  a  time. T he  other  model  we're  going  to  use is  random  forest, and  that's  done  with  independent  trees. There  is  a  sequential  aspect   in  boosted  trees that  is  different  from  random  forests. The  setup  of  boosted  trees   involves  three  parameters: the  number  of  trees, depth  of  trees,  and  the  learning  rate. This  is  what  JMP  gives   as  a default. Boosted  trees  can  be  used to  solve  most  objective  functions. We  could  use  it  for  poisson  regression, which  is  dealing  with  counts that  is  a  bit  harder  to  achieve with  random  forests. We're  going  to  focus on  these  two  types  of  models. When  we  apply  the  boosted  tree, and  we  have  a  validation  set  up with  43  systems  drawn  randomly as  the  validation  set. A hundred and thirty-one systems is used  for  the  training  set. We  are  getting   a  9.3%  misclassification  rate. Three  failed  systems. We  know  that they  failed   because  we  have  it  in  the  data, were  actually  classified  as  pass. The  20  that  passed,   19  were  classified  as  pass. The  false  predicted  pass  is  13 %. We  can  look  at  the  variable column  contributions. We  see  that  Sensor   56,  18,  11,  and  61 are  the  top  four in  terms  of  contributing to  this  classification. We  see  that  in  the  training  set, we  had  zero  misclassification. We  might  have  some  over fitting  i n  this  BT  application. If  we  look  at  the  lift  curve, 40 %  of  the  systems, we  can  get  over  two  lift which  is  the performance  that this classifier gives us. If  we  try  the  boos trap  forest, another  option, again,  we  do  the  same  thing. We  use  the  same  validation  set. The  defaults  of  JMP  are  giving  you some  parameters for  the  number  of  trees and  the  number  of  features to  be  selected  at  each  mode. This  is  how  the  random  forest  works. You  should  be  aware that  this  is  not  very  good if  you  have  categorical  variables and  missing  data, which  is  not  our  case  here. Now,  the  misclassification  rate  is  6.9, lower  than  before. On  the  training  set, we  had  some  misclassification. The  random  forest   applied  to  the  test  status, which  means  when  we  have  the  details on  the  failures  is  23.4, so  bad  performance. Also,  on  the  training  set, we  have  5%  misclassification. But  we  have  now  a  wider  range  of  options and that is explaining   some  of  the  lower  performance. In  the  lift  curve  on  the  test  results, we  actually,  with  quite  good  performance, can  pick  up  the  top  10 %  of  the  systems with  a  leverage  of  above  10. So  we  have  over a ten fold  increase for  10 %  of  the  systems relative  to  the  grand  average. Now  this  is  posing  a  question— remember  the  topic  of  the  talk— what  are  we  looking  at? Do  we  want  to  identify top  score  good  systems? The  random  forest   would  do  that  with  the  test  result. Or  do  we  want  to  predict a  high  proportion  of  pass? The  bootstrap tree  would  offer  that. A  secondary  question  is  to  look  at what  is  affecting  this  classification. We  can  look  at  the  column  contributions on  the  boosted  tree. Three  of  the  four top  variables  show  up  also  on  the  random  forest. If  we  use  the  status  pass/fail, or  the  detailed  results, there  is  a  lot  of  similarity on  the  importance  of  the  sensors. This  is  just  giving  you  some  background. Chris  is  going  to  follow  up  with  an  evaluation  of  the  sensitivity of  this  variable  importance, the  use  of  SHAP v alues and more interesting stuff. This  goes  back  to  questioning what  is  your  goal, and  how  is  the  model  helping  you  figure  out  the  goal and  maybe  sharpening  the  question that  comes  from  the  statement  of  the  goal. Chris,  it's  all yours. Thanks,  Ron. I'm  going  to  pick  up  from  where  Ron  left off and  seek  a  model  that  will  predict whether  or  not  a  unit  is  good  or  not, and  if  it  isn't,  what's  the  likely failure  mode  that  has  resulted? This  would  be  useful  in that  if  a  model is  good  at  predicting  good  units, we  may  not  have  to  subject  them to  much  further  testing. If  the  model  gives   a  predicted  failure  mode, we're  able  to  get  a  head  start  on  diagnosing  and  fixing  the  problem, and  possibly,  we  may  be  able  to  get  some  hints on  how  to  improve  the  production  process  in  the  future. I'm  going  to  go  through  the  sequence of  how  I  approached  answering  this  question  from  the  data. I  want  to  say  at  the  outset that  this  is  simply  the  path  that  I  took as  I  asked  questions  of  the  data and  acted  on  various  patterns  that  I  saw. There  are  literally  many  other  ways that  one  could  proceed  with  this. There's  often  not  really   a  truly  correct  answer, just  a  criterion  for  whether  or  not  the  model  is  good  enough, and  the  amount  that  you're  able  to  get  done in  the  time  that  you have  to  get  a  result  back. I  have  no  doubt   that  there  are  better  models  out  there than  what  I  came  up  with  here. Our  goal  is  to  show  an  actual  process of  tackling  a  prediction  problem, illustrating  how  one  can  move  forward by  iterating  through  cycles of  modeling  and  visualization, followed  by  observing  the  results   and  using  them  to  ask  another  question until  we  find  something  of  an  answer. I  will  be  using  JMP   as  a  statistical  Swiss  army  knife, using  many  tools  in  JMP and  following  the  intuition   I  have  about  modeling  data that  has  built  up  over  many  years. First,  let's  just  take  a  look at  the  frequencies   of  the  various  test  result  categories. We  see  that  the  largest and  most  frequent  category  is  Good. We'll  probably  have  the  most  luck being  able  to  predict  that  category. On  the  other  hand,   the  SOS  category  has  only  two  events so  it's  going  to  be  very  difficult  for  us to  be  able  to  do  much  with  that  category. We  may  have  to  set  that  one  aside. We'll  see  about  that. Velocity II,  IMP,  and  Brake are  all  fairly  rare  with  five  or  six  events  each. There  may  be  some  limitations  in  what we're  able  to  do  with  them  as  well. I  say  this  because   we  have  174  observations and  we  have  63  predictors. So  we  have  a  lot  of  predictors  for  a  very  small  number  of  observations, which  is  actually  even  smaller   when  you  consider  the  frequencies of  some  of  the  categories that  we're  trying  to  predict. We're  going  to  have  to  work  iteratively by  doing  visualizations  in  modeling, recognizing  patterns,  asking  questions, and  then  acting  on  those with  another  modeling  step  iteratively in  order  to  find  a  model that's  going  to  do  a  good  job of  predicting  these  response  categories. I  have  the  data  sorted by  test  results, so  that  the  good  results are  at  the  beginning, followed  by  each  of  the  different  failure  modes d ata a fter  that. I  went  ahead  and  colored  each  of  the  rows by  test  results  so  that  we  can  see which  observation  belongs to  a  particular  response  category. So   then  I  went  into  the  model- driven multivariate  control  chart and  I  brought  in  all  of  the  sensors as  process  variables. Since  I  had  the  good  test  results at  the  beginning, I  labeled  those as  historical  observations. This   gives us  a  T²  chart. It's  chosen  13  principal  components as  its  basis. What  we  see  here is  that  the  chart  is  dominated  by  these  two  points  right  here and  all  of  the  other  points are  very  small  in  value relative  to  those  two. Those two points happen  to  be  the SOS  points. They  are  very  serious  outliers in  the  sensor  readings. Since  we  also  only  have   two  observations  of  those, I'm  going  to  go  ahead and  take  those  out  of  the  data set and  say,  well, SOS is  obviously  so  bad  that  the  sensors should  be  just  flying  off  the  charts. If  we  encounter  it,  we're  just  going  to  go  ahead and  try  to  concern  ourselves with  the  other  values that  don't  have  this  off- the- charts  behavior. Switching  to  a  log  scale, we  see  that  the  good  test  results are  fairly  well -behaved. Then  there's  definite  signals in  the  data   for  the  different  failure  modes. Now  we  can  drill  down  a  little  bit  deeper, taking  a  look  at  the  contribution  plots for  the  historical  data, the  good  test  result  data, and  the  failure  modes to  see  if  any  patterns  emerge  in  the  sensors  that  we  can  act  upon. I'm  going  to  remove   those  two SOS  observations and  select  the  good  units. If  I   right-click  in  the  plot, I  can  bring  up  a  contribution  plot for  the  good  units, and  then  I  can  go  over  to  the  units where  there  was  a  failure, and  I  can  do  the  same  thing, and  we'll  be  able  to  compare the  contribution  plots  side  by  side. So  what  we  see  here   are  the  contribution  plots for  the  pass  units  and  the  fail  units. The  contribution  plots are  the  amount  that  each  column is  contributing  to  the  T ² for  a  particular  row. Each of  the  bars  there  correspond to  an  individual  sensor  for  that  row. Contribution  plots  are  colored  green when  that  column  is  within  three  sigma, using  an  individuals and  moving  range  chart, and  it's  red  if  it's  out  of  control. Here  we  see  most  of  the  sensors  are  in  control  for  the  good  units, and  most  of  the  sensors  are  out  of  control  for  the  failed  units. What  I  was  hoping  for  here would  have  been if  there  was  only  a  subset  of  the  columns or  sensors  that  were  out  of  control over on  the  failed  units. Or  if  I  was  able  to  see  patterns that  changed  across  the  different  failure  modes, which  would  help  me  isolate what  variables  are  important for  predicting  the  test  result  outcome. Unfortunately,  pretty  much all  of  the  sensors are  in  control  when  things  are  good, and  most  of  them  are  out  of  control when  things  are  bad. So we're going to  have  to  use   some  more  sophisticated  modeling to  be  able  to  tackle   this  prediction  problem. Having  not  found  anything   in  the  column  contributions  plots, I'm  going  to  back  up  and  return to  the  two  models  that  Ron  found. Here  are  the  column  contributions for  those  two  models, and  we  see  that there's  some  agreement  in  terms  of what  are  the  most  important  sensors. But  boosted  tree  found  a  somewhat  larger set  of  sensors  as  being  important over  the  bootstrap  forest. Which  of  these  two  models should  we  trust  more? If  we  look   at  the  overall  model  fit  report, we  see  that  the  boosted  tree  model has  a  very  high  training  RS quare  of  0.998 and  a  somewhat   smaller v alidation  RS quare  of  0.58. This  looks  like  an  overfit  situation. When  we  look  at  the  random  forest,  it  has a  somewhat  smaller  training  RS quare, perhaps  a  more  realistic  one, than  the  bootstrap  forest, and  it has  a  somewhat  larger  validation  RS quare. The  generalization  performance of  the  random  forest is  hopefully  a  little  bit  better. I'm  inclined  to  trust   the  random  forest  model  a  little  bit  more. Part  of  this  is  going  to  be  based  upon just  the  folklore  of  these  two  models. Boosted  trees  are  renowned for  being  fast,  highly  accurate  models that  work  well  on  very  large  datasets. Whereas  the  hearsay  is  that  random  forests are  more  accurate  on  smaller  datasets. They  are  fairly  robust, messy,  and  noisy  data. There's  a  long  history   of  using  these  kinds  of  models for  variable  selection that  goes  back  to  a  paper  in  2010 that  has  been  cited  almost  1200  times. So  this  is  a  popular  approach for  variable  selection. I  did  a  similar  search  for  boosting, and  I  didn't  quite  see  as  much  history  around  variable  selection for  boosted  trees   as  I  did  for  random  forests. For  this  given  data  set r ight  here, we  can  do  a  sensitivity  analysis to  see  how  reliable the  column  contributions   are  for  these  two  different  approaches, using  the  simulation  capabilities in  JMP  Pro. What  we  can  do   is  create  a  random  validation  column that  is  a  formula  column that  you  can  reinitialize and  will  partition  the  data into  random  training  and  holdout  sets  of  the  same  portions as  the  original  validation  column. We  can  have  it  rerun  these  two  analyses and  keep  track   of  the  column  contribution  portions for  each  of  these  repartitionings. We  can  see  how  consistent  the  story is between  the  boosted  tree  models and  the  random  forests. This  is  pretty  easy  to  do. We just go  to the  Make  Validation  Column  utility and  when  we  make  a  new  column,  we  ask  it  to  make  a  formula  column so  that it  could  be  reinitialized. Then  we  can  return to  the  bootstrap  forest  platform, right- click   on  the  column  contribution  portion, select  Simulate. It'll  bring  up  a  dialog asking  us  which  of  the  input  columns we  want  to  switch  out. I'm  going  to  choose   the  validation  column, and  I'm  going  to  switch  in in replacement  of  it, this  random  validation  formula  column. We're  going  to  do  this a hundred  times. Bootstrap  forest  is  going  to  be  rerun using  new  random  partitions   of  the  data  into  training  and  validation. We're  going  to  look  at  the  distribution  of  the  portions across  all  the  simulation  runs. This  will  generate  a  dataset of  column  contribution  portions for  each  sensor. We  can  take  this  and  go   into  the  graph  builder and  take  a  look  and  see  how  consistent  those  column  contributions  are across  all  these  random  partitions  of  the  data. Here  is  a  plot   of  the  column  contribution  portions from  each  of  the  100  random  reshufflings of  the  validation  column. Those points  we  see  in  gray here, Sensor  18  seems  to  be  consistently a  big  contributor,  as  does  Sensor  61. We  also  see  with  these  red  crosses, those  are  the  column  contributions from  the  original  analysis  that  Ron  did. The  overall  story  that  this  tells is  that  the  tendency i s  that  whenever  the  original column  contribution  was small, those  re simulated  column  contributions also  tended  to  be  small. When  the  column  contributions  were  large  in  the  original  analysis, they  tended  to  stay  large. We're  getting  a  relatively   consistent  story  from  the  bootstrap  forest in  terms  of  what  columns  are  important. Now  we  can  do  the  same  thing with  the  boosted  tree, and  the  results  aren't  quite  as  consistent as  they  were  with  the  bootstrap  forest. So  here  is  a  bunch  of  columns where  the  initial  column  contributions  came  out very small but  they  had  a  more  substantial  contribution in  some  of  the  random  reshuffles of  the  validation  column. That  also  happened  quite  a  bit  over with these  Columns  52  through  55  over  here. Then  there  were  also  some  situations where  the  original  column  contribution   was  quite  large, and  most,  if  not  all, of  the  other  column  contributions found  in  the  simulations  were  smaller. That  happens  here  with  Column  48, and  to  some  extent  also  with  Column  11  over  here. The  overall  conclusion  being   that  I  think  this  validation  shuffling is  indicating  that  we  can  trust   the  column  contributions from  the  bootstrap  forest  to  be  more stable  than  those  of  the  boosted  tree. Based on  this  comparison, I  think  I  trust  the  column  contributions from  the  bootstrap  forest  more, and  I'm  going  to  use  the  columns  that  it  recommended as  the  basis  for  some  other  models. What  I'd  like  to  do is  find  a  model  that  is  both  simpler than  the  bootstrap  forest  model and  performs  better  in  terms  of  validation  set  performance for  predicting  pass  or  fail. Before  proceeding   with  the  next  modeling  stuff, I'm  going  to  do  something  that  I  should have  probably  done  at  the  very  beginning, which  is  to  take  a  look  at  the  sensors in  a  scatterplot  matrix to  see  how  correlated  the  sensor  readings  are, and  also  look  at  histograms  of  them as  well  to  see  if  they're  outlier- prone or heavily  skewed or  otherwise  highly  non- gaussian. What  we  see  here  is   there  is  pretty  strong  multicollinearity amongst  the  input  variables  generally. We're  only  looking   at  a  subset  of  them  here, but  this  high  multicollinearity  persists across  all  of  the  sensor  readings. This  suggests  that  for  our  model, we  should  try  things  like  the  logistic  lasso, the  logistic  elastic  net, or  a logistic  ridge  regression as  candidates  for  our  model to  predict  pass  or  fail. Before  we  do  that, we  should  go  ahead and  try  to  transform our  sensor  readings  here so  that  they're  a  little  bit better- behaved  and  more  gaussian- looking. This  is  actually  really  easy  in  JMP if  you  have  all  of  the  columns  up in  the  distribution  platform, because  all  you  have  to  do  is  hold  down Alt ,  choose  Fit  Johnson, and  this  will  fit  Johnson  distributions to  all  the  input  variables. This  is  a  family  of  distributions that  is  based  on  a  four  parameter transformation  to  normality. As  a  result,   we  have  a  nice  feature  in  there that  we  can  also  broadcast using Alt  Click, where  we  can  save a  transformation  from  the  original  scale to  a  scale  that  makes  the  columns more  normally  distributed. If  we  go  back  to  the  data  table, we'll  see  that  for  each  sensor  column, a  transform  column  has  been  added. If  we  bring  these  transformed  columns  up with  a  scatterplot  matrix and  some histograms, we  clearly  see  that  the  data  are  less  skewed and  more  normally  distributed  than  the  original  sensor  columns  were. Now  the  bootstrap  forest  model that  Ron  found only  really  recommended   a  small  number  of  columns for  use  in  the  model. Because  of  the  high  collinearity   amongst  the  columns, the  subset  that  we  got  could  easily  be  part of  a  larger  group  of  columns that  are  correlated  with  one  another. It  could  be  beneficial   to  find  that  larger  group  of  columns and  work  with  that   at  the  next  modeling  stage. An  exploratory  way  to  do  this is  to  go  through  the  cluster variables  platform  in JMP . We're  going  to  work with  the  normalized  version  of  the  sensors because  this  platform   is  PCA  and  factor  analysis  based, and  will  provide  more  reliable  results  if  we're  working  with  data that  are  approximately   normally  distributed. Once  we're  in  the  variable clustering  platform, we  see  that  there  is  very  clear, strong  associations amongst  the  input  columns. It  has  identified   that  there  are  seven  clusters, and  the  largest  cluster, the  one  that  explains  the  most  variation, has  25  members. The  set  of  cluster  members is  listed  here  on  the  right. Let's  compare  this  with  the  bootstrap  forest. Here  on  the  left, we  have  the  column  contributions from  the  bootstrap  forest  model  that  you  should  be  familiar  with  by  now. On  the  right,  we  have  the  list  of  members of  that  largest  cluster  of  variables. If  we  look  closely,  we'll  see  that  the  top seven  contributing  terms all  happen  to  belong  to  this  cluster. I'm  going  to  hypothesize   that  this  set  of  25  columns are  all  related  to  some   underlying  mechanism that  causes  the  units  to  pass  or  fail. What  I  want  to  do  next   is  I  want  to  fit  models using  the  generalized  regression  platform with  the  variables  in  Cluster  1  here. It  would  be  tedious   if  I  had  to  go  through and  individually  pick  these  columns  out and  put  them  into  the  launch  dialog. Fortunately,  there's  a  much  easier  way where  you  can  just  select  the  rows in  that  table and  the  columns  will  be selected  in  the  original  data  table so  that  when  we  go   into  the  fit  model  launch  dialog, we  can  just  click  Add and  those  columns  will  be  automatically   added  for  us  as  model  effects. Once  I  got  into  the  Generalized Regression  platform, I  went  ahead  and  fit  a  lasso  model   and  elastic  net  model and  a  ridge  model   to have  them  compared  here  to  each  other, and  also  to  the  logistic  regression  model   that  came  up  by  default. We're  seeing  that  the  lasso  model   is  doing  a  little  bit  better  than  the  rest in  terms  of  its  validation generalized  RS quare. The  difference  between  these  methods is  that  there's  different  amounts of  variable  selection and  multicollinearity  handling in  each  of  them. Logistic regression   has  no  multicollinearity  handling and  no  variable  selection. The  lasso  is  more   of  a  variable  selection  algorithm, although  it  has  a  little  bit of  multicollinearity  handling  in  it because  it's  a  penalized  method. Ridge  regression  has  no  variable  selection and  is  heavily  oriented  around multicollinearity  handling. The  elastic  net  is  a  hybrid  between the  lasso  and  ridge  regression. In  this  case, what  we  really  care  about is  just  the  model  that's going  to  perform  the  best. We  allow  the  validation  to  guide  us. We're  going  to  be  working with  the  lasso  model  from  here  on. Here's  the  prediction  profiler for  the  lasso  model  that  was  selected. We  see  that  the  lasso  algorithm has  selected  eight  sensors as  being  predictive  of  pass  or  fail. It  has  some  great  built-in tools for  understanding   what  the  important  variables  are, both  in  the  model  overall  and,  new  to  JMP  Pro  17, we  have  the  ability  to  understand what  variables  are  most  important for  an  individual  prediction. We  can  use  the  variable  importance  tools to  answer  the  question, "What  are  the  important  variables in  the  model?" We  have  a  variety   of  different  options  for  this. We  have  a  variety  of  different  options for  how  this  could  be  done. But  because  of  the  multicollinearity  and because  this  is  not  a very  large  model, I'm  going  to  go  ahead and  use  the  dependent   resampled  inputs  technique, since  we  have   multicollinearity  in  the  data, and  this  has  given  us  a  ranking   of  the  most  important  terms. We  see  that  Column  18  is  the  most  important, followed  by  Column  27 and  then  52,  all  the  way  down. We  can  compare  this   to  the  bootstrap  forest  model, and  we  see  that  there's  agreement that  Variable  18  is  important, along  with  52,  61,  and  53. But  one  of  the  terms  that  we  have  pulled  in because  of  the  variable  clustering  step that  we  had  done, Sensor  27  turns  out  to  be the  second  most  important  predictive in  this  lasso  model. We've  hopefully  gained  something by  casting  a  wider  net  through  that  step. We've  found  a  term  that  didn't  turn  up in  either  of  the  bootstrap  forest or  the  boosted  tree  methods. We  also  see  that  the  lasso  model has  an  RS quare  of  0.9, whereas  the  bootstrap  forest  model  had  an  RS quare  of  0.8. We have  a  simpler  model   that  has  an  easier  form  to  understand and  is  easier  to  work  with, and  also  has  a  higher  predictive  capacity than  the  bootstrap  forest  model. Now,  the  variable   importance  metrics  in  the  profiler have  been  there  for  quite  some  time. The  question  that  they  answer  is, "Which  predictors  have  the biggest  impact on  the  shape  of  the  response  surface over  the  data  or  over  a  region?" In  JMP  17  Pro,   we  have  a  new  technique  called  SHAP  Values that  is  an  additive  decomposition   of  an  individual  prediction. It  tells  you  by  how  much   each  individual  variable  contributes to  a  single  prediction, rather  than  talking  about  variability explained  over  the  whole  space. The  resolution  of  the  question that's  answered  by  Shapley  values is  far  more  local  than  either  the  variable  importance  tools or  the  column  contributions i n  the  bootstrap  forest. We  can  obtain  the  Shapley  Values  by  going to  the  red  triangle  menu  for  the  profiler, and  we'll  find  the  option  for  them over  here,  fourth from  the  bottom. When  we  choose  the  option, the  profiler  saves  back  SHAP  columns for  all  of  the  input  variables  to  the  model. This  is,  of  course,  happening for  every  row  in  the  table. What  you  can  see  is  that the  SHAP V alues  are  giving  you  the  effect of  each  of  the  columns   on  the  predictive  model. This  is  useful   in  a  whole  lot  of  different  ways, and  for  that  reason,  it's  gotten a  lot  of  attention  in intelligible  AI, because  it  allows  us  to  see what the  contributions  are   of  each  column  to  a  black  box  model. Here,  I've  plotted  the  SHAP V alues for  the  columns  that  are  predictive in  the  last  fit model  that  we  just  built. If  we  toggle  back  and  forth  between the  good  units  and  the  units  that  failed, we  see  the  same  story  that  we've  been  seeing with  the  variable  importance  metrics  for  this, that  Column  18  and  Column  27  are  important  in  predicting  pass  or  fail. We're  seeing  this at  a  higher  level  of  resolution than  we  do   with  the  variable  importance  metrics, because  each  of  these  points   corresponds  to  an  individual  row in  the  original  dataset. But  in  this  case, I  don't  see  the  SHAP  Values really  giving  us  any  new  information. I  had  hoped  that  by  toggling  through the  other  failure  modes, maybe  I  could  find  a  pattern to  help  tease  out  different  sensors that  are  more  important  for  particular  failure  modes. But  the  only  thing  I  was  able  to  find  was  that  Column  18 had  a  somewhat  stronger  impact  on  the  Velocity  Type  1  failure  mode than  the  other  failure  modes. At  this  point,  we've  had  some  success using  those  Cluster  1  columns in  a  binary  pass/ fail  model. But  when  I  broke  out  the  SHAP  Values for  that  model,  by  the  different  failure modes I  wasn't  able  to  discern  a  pattern or  much  of  a  pattern. What  I  did  next  was  I  went  ahead and  fit  the  failure  mode   response  column  test  results using  the  Cluster  1  columns, but  I  went  ahead   and  excluded  all  the  pass  rows so  that  the  modeling procedure  would  focus  exclusively on  discerning  which  failure  mode  it is given  that  we  have  a  failure. I  tried  the  multinomial  lasso,   elastic  net,  and  ridge, and  I  was  particularly  happy  with  the  lasso  model because  it  gave  me a  validation  RS quare  of  about  0.94. Having  been  pretty  happy  with  that, I  went  ahead  and  saved  the  probability  formulas for  each  of  the  failure  modes. Now  the  task  is  to  come  up  with  a  simple  rule that  post  processes  that  prediction  formula to  make  a  decision about  which  failure  mode. I  call  this  the  partition  trick. The partition trick  is  where  I  put  in  the  probability  formulas for  a  categorical  response, or  even  a  multinomial  response. I  put  those  probability  formulas  in  as Xs. I  use  my  categorical  response  as  my  Y. This  is  the  same  response  that  was  used for  all  of  these except  for pass,  actually. I  retain  the  same  validation  column  that I've  been  working  with  the  whole  time. Now  that  I'm  in  the  partition  platform, I'm  going  to  hit  Split  a  couple  of  times, and  I'm  going  to  hope that  I  end  up  with an  easily  understandable  decision  rule that's  easy  to  communicate. That  may  or  may  not  happen. Sometimes  it  works,  sometimes  it  doesn't. So  I  split  once, and  we  end  up  seeing  that whenever  the  probability  of  pass  is  higher  than  0.935, we  almost  certainly  have  a  pass. Not  many  passes  are  left over  on  the  other  side. I  take  another  split. We  find  a  decision  rule  on  ITM that  is  highly  predictive  of  ITM  as  a  failure  mode. Split  again. We  find  that  whenever  Motor is  less  than  0.945, we're  either  predicting  Motor  or  Brake. We  take  another  split. We  find  that  whenever  Velocity  Type  1, its probability  is  bigger  than  0.08 or likely  in  a  Velocity  Type  1  situation or  in  a  Velocity T ype  2  situation. Whenever  Velocity  Type  1 is  less  than  0.79, we're  likely  in  a  gripper  failure  mode or  an  IMP  failure  mode. What  do  we  have  here? We  have  a  simple  decision  rule. We're  going  to  not  be  able  to  break   these  failure  modes  down  much  further because of  the  very  small  number of  actual  events  that  we  have. But  we  can  turn  this  into  a  simple  rule for  identifying  units   that  are  probably  good, and  if  they're  not,  we  have  an  idea of  where  to  look  to  fix  the  problem. We  can  save  this  decision  rule  out   as  a  leaf  label  formula. We  see  that  on  the  validation  set, when  we  predict  it's  good, it's good most  of  the  time. We  did  have  one  misclassification of  a  Velocity  Type 2  failure that  was  actually  predicted  to  be  good. Predict  grippers  or  IMP, it's  all  over  the  place. That  leaf  was  not  super  useful. Predicting  ITM is  100 %. Whenever  we  predict  a  motor  or  brake, on  the  validation  set, we  have  a  motor  or  a  brake  failure. When  we  predict  a  Velocity  Type  1 or 2, it  did  a  pretty  good  job of  picking  that  up with  that  one  exception of  the  single  Velocity  Type  2  unit that  was  in  the  validation  set, and  that one  happened  to  have  been  misclassified. We  have  an  easily  operational  rule  here that  could  be  used  to  sort  products and  give  us  a  head  start  on  where we  need  to  look  to  fix  things. I  think  this  was  a  pretty  challenging  problem, because  we  didn't  have  a  whole  lot  of  data. But  we  didn't  have  a  lot  of  rows, but  we  had  a  lot  of   different  categories  to  predict and a  whole  lot  of   possible  predictors  to  use. We've  gotten  there  by  taking  a  series  of  steps, asking  questions, sometimes  taking  a  step  back and  asking  a  bigger  question. Other  times,  narrowing  in   on  particular  sub- issues. Sometimes  our  excursions  were  fruitful, and  sometimes  they  weren't. Our  purpose  here  is  to  illustrate how  you  can  step  through   a  modeling  process, through  this  sequence  of  asking  questions using  modeling  and  visualization  tools to  guide  your  next  step, and  moving  on   until  you're  able  to  find a  useful,  actionable,  predictive  model. Thank  you  very  much  for  your  attention. We  look  forward  to  talking  to  you in  our  Q&A  session  coming  up  next.
Since the Functional Data Explorer was introduced in JMP Pro 14, it has become a must-have tool to summarize and gain insights from shape features in sensor data. With the release of JMP Pro 17, we have added new tools that make working with spectral data easier. In particular, the new Wavelets model is a fast alternative to existing models in FDE for spectral data. Drop in to get an introduction to these new tools and see how you can use them to analyze your data.     Hi,  my  name  is  Ryan  Parker, and  I'm  excited  to  be  here  today with  Clay  Barker  to  share  with  you  some new  tools  that  we've  added  to  help  you analyze  spectral  data  with  the Functional  Data  Explorer  in  JMP Pro  17. So  what  do  we  mean  by  spectral  data? We  have  a  lot  of  applications from  chemometrics  that  have motivated  a  lot  of  this  work. But  I  would  just  start  off  by  saying we're  really  worried  about  data that  have  sharp  peaks. This  may  not  necessarily  be  spectral  data, but  these  are  the  type  of  data that we've had  a  hard  time  modeling in  FE  up  to  this  point. And  so  we  really  wanted  to  focus  on  trying to  open  up  these  applications  and  make  it a  lot  easier  to  handle  these  sharp  peaks. Maybe  potential  discontinuities. Just   these  large, wavy  features  of  data. Where  in  this  specific  example, with  spectroscopy  data, we're  thinking  about composition  of  materials, and  these  peaks  can  represent these  compositions, and  we  want  to  be  able to  try  to  quantify  those. So  another  application  is from mass spectrometry, and  here  you  can  see these  very  sharp  peaks. They're  all  over  the  place in  these  different  functions. But  these  peaks  are  representing  proteins from  these  spectrums, and  they  can  help  you,  for  example, compare  differences  between  things  from  a  patient  with  cancer and  patients  without  cancer to   understand  differences. I mean, again, it's  really  important  that  we  try to  model  these  peaks  well so  that  we can  quantify  these  differences. An  example  that  Clay  is  going to  show  comes  from  chromatography. This  is  where  we're  trying  to ... In  this  case, we  want  to  look  at  quantifying  the  difference  between  an  olive  oil versus  other  vegetable  oils. And  so  the  components  of  these  things  represented  by  all  of  these  peaks, we  need  to,  again,  try  to  model  these  well. The  first  thing  I  want  to  cover  are four  new  tools  to  preprocess  your  data, spectral  data  before  you get  to  the  modeling  stage. The  first  one  is the  Standard  Normal  Variate. So  with  the  Standard  Normal  Variate, we're  thinking  about standardizing  each  function   by  their  individual  mean  and  variance. So  we're  going  to  take  every  function   one  at  a  time, subtract  off  the  mean,  divide  by  the  variance, so  that  they  all  have   mean  zero  and  variance  one. This  is  an  alternative  to   the  existing  tool  we  have  standardized, which  is  just looking  at  a  global  mean  in  variance so  that  the  data themselves  have  been  scaled, but  certain  aspects, like  means,  are  still  there, whereas  with  the  Standard  Normal  Variate, we  want  to  remove  that  for  every  function. The  next  tool's   Multiplicative S catter  Correction. It  is  similar  to  Standard  Normal  Variate, the  results  end  up being  the  same,  similar. But  in  this  case,  we're  thinking  about data  where  we  have  light  scatter. So  some  of  these  spectral  data  come  where  we  have  to  worry  about our  light  scatter  from  all  the  individual  functions  being  different from  a  reference function  that  we're  interested  in. Usually  this  is  the  mean. So  what  we'll  do  is  we   will  set  a  simple  model between  the  individual functions  to  this  mean  function. You know, we will get coefficients that  we  can  subtract   off  this  mean  feature, divide  by  the  slope  feature, get  us to  that  similar  standardizing  the  data, and  in  this  case,  focused  on  this  light  scatter. Okay,  so  at  this  point, we're  thinking  about what  if  we  have  noise  in  our  data? What  if  we  need  to  smooth  it? So  the  models  that  we  want  to  fit, for  spectral  data, these  wavelets, they  don't  smooth  the  data  for  you. So  if  you  have  noisy  data,  you  really want  to  try  to  handle  that  first, and  that's  where  the   Savitzky-Golay  Filter  comes  in. What  this  is  going  to  do  is  fit  n- degree  polynomials over  a specified  bandwidth to  try  to  find  the  best  model  that  will  smooth  your  data. So  we  search  over  a  grid  for  you, pick  the  best  one, and  then  present  the  results  to  you. And  I  do  want  to  note  that  the  data  are required  to  be  on  a  regular  grid, but  if  you  don't  have  one, FDE  will  create  one  for  you. We  have  a  reduce  option that  you  can  use  if  you  want   some  finer  control  over  this, but  by  default, we  are  going  to  look at  the  longest  function, choose  that  as  our  number  of  grid  points, and  create  a  regular grid  from  that  for  you. But  the  nice  thing  about   the  S avitzky-Golay Filter is  because  of  the  construction   with  these  polynomials, we  have  easy  access to  the  first  or  second  derivative. Even  if  you  don't  have  spectral  data  and you  want  to  access derivative functions this  will  be your  pathway  to  do  that. And  if  you  do  request,  say, the  second  derivative, our  search  gets   constrained  to  only  polynomials that will allow us to  give  you a  second  derivative,  for  example. But  this  would  be  the  way  to  access that, even  if  you  weren't  even worried  about  smoothing. You  can  now  get  to  derivatives. The  last   preprocessing  tool  I'll  cover is  Baseline  Correction. So  in  Baseline  Correction, you  are  worried  about  having   some  feature  of  your  data that  you  would  just  consider  a  baseline  that  you  want  to  remove before  you  model  your  data. So  the  idea  here  is  we  want to  fit  a  baseline  model. We  have  linear,  quadratic,  exponential options for you, so  we  want  to  fit  this  to  our  data   and  then  subtract  it  off. But  we  know  there  are  important  features, typically  these  peaks, so  we  want  to  not  use  those  parts  of  the  data when  we  actually fit  these  baseline  models. So  we  have  the  option  here for  correction  region. I  think  for  the  most  part  you would  likely  use  entire  function. So  what  this  just  means  is,  what  part  are  we  going  to  subtract  off? So  if  you  select  within  regions only things  within  these  blue  lines are  going  to  be  subtracted. But  I've  already  added  four  here. Every  time  you  click  add   on  baseline regions, you're  going  to  get  a  pair  of  these  blue  lines  and  you  can  drag  them around to  the  important  parts  of  your  data, and  what  this  will  do  is, when  you  try  to  fit,  say, a  linear  baseline  model, it's  going  to  ignore  the  data  points that  are  within  these  two  blue  lines. So,  function  one,  we  set  a  linear  model, but  we  exclude  all  these  sharp  peaks that  we  really  want, that  we're  interested  in. And  so  then  we  take  the  result   from  this linear model and  subtract  it off  from  the  whole  function. The  alternative  to  that is  an  Anchor  Point, and  that's  if  you  say, I  really  would  like for this specific point  to  be  included in  the  baseline  model. Usually  this  is  if  you  have  smaller  data and  you  know,  okay, I  want  these  few  points. These  are  key. These  represent  the  baseline. If  I  were  able  to  fit, you know, say a  quadratic model to these points, that's  what  I  want  to  subtract  off. So  it's  an  alternative. When  you  click  those,  they'll  show  up as  red  as  an  alternative  to  this  blue. But  this  will  allow  you to  correct  your data remove  the  baseline before  proceeding. So  that  gets  us  to  how  we  model   spectral  data  now  in  JMP Pro  17, and  we're  using  wavelets. The  nice  thing  about  wavelets  is  we  have a  variety  of  options  to  choose  from. So  these  graphs  represent   what  are  called  mother  wavelets and  they  are  used  to  construct the  basis  that  we  use, that  models  the  data. So  the  simplest  is  this Haar  wavelet, which  is  really  just  step  functions, maybe  hard  to  see  that  here, but  these  are  just  step  functions. But  this  biorthogonal, it  has  a  lot  of  little  jumps and  you  can  start  to  imagine, okay,  I  can  see  why  these   make  it  a  lot  easier to  capture  peaks  in  my  data other  than  the  Haar  wavelet. All  these  have  parameters  that  are   changing  the  shape  and  the  size  of  these, so  I've  just  selected  a  couple  here to  just  show  you  the  differences. But  you  can  really  see  where, okay,  if  I  put  a  lot  of  these  together, I  can  understand  why  this  is probably  a  lot  better  to  model  all these  peaks  of  my  data. And so here's  an  example  to  illustrate  that with  one  of  our  new  sample  data, a NMR  design  of  experiments. So  this  is  just  from  one  function where  let's  start  with  B-Splines. This  is  sort  of  the  go  to  for   most  data  place  to  start  in  FDE. But  we  can  see  that  it's   really  having  a  hard  time picking  up  on  these  peaks. Now, there are,  we  have  provided  you  tools to  change  where  knots are at in  these  beastline  models. So  you  could  do  some  customization. Probably  fit  this  a  lot better  than  the  default. But  the  idea  is  that  now  you've  had  to  go  and  move  things around and  maybe  it  works  for  some  functions, but  not  others, and  you  need  a  more  automated  way. So  that's  one  alternative to  that  is  P-Sp lines. That  is  doing  that a  little  bit  for  you, but  it's  still  not  quite  capturing the  peaks  maybe  as  well  as  wavelets. It's  probably  doing  the  best   for  these data. relative  to  wavelets and  almost  a  model -free  approach where  we  model  the  data  just  directly on  our  shape  components, this  direct  functional  PCA. It's  maybe  a  bridge   between  P-Splines  and  B-Splines where  it's  not  quite  as  good  as  P -Splines   but  it's  better  than  B -Splines but  this  is  just a  quick  example  to   highlight how  wavelets  can  really be  a  lot  quicker  and  powerful. What  are  we  doing  in  FDE? We  construct  a  variety  of  wavelet types and  their  associated parameters  and  fit  them  for  you. So  similar  to  the  S avitzky-Golay Filter, we  do  require  that  the  data are  on  a  regular  grid. And  good  news,  we'll  create  one  for  you. But  of  course  you  can  go  to  reduce  the  control  that  if  you  would  like. But  now  the  nice  thing  about   once  data are   on  the  grid, we  can  use a  transformation  that's  super  fast. So  P -Splines  would  also,  for  these  data, be  what  you  would  really  want to  have  to  use, but they can take a longtime to fit, especially  if  you  have   a  lot  of  functions and  you  have  a  lot of  data  points  per  function. But  our  wavelet  models  are going  to  essentially  last so  all  of  the  different  basis  functions that  construct  this  particular   wavelet  type  with  parameter and  what  that's  going to  allow  us  to  do  is  just  really  force a  lot  of  those  coefficients  that  don't really  mean  anything  to  zero to  help  create  a  sparse representation  for  the  model. So  those  five  different  wavelet  types that  I  showed  before, those  are  available, and  once  we  fit  them  all, we're  going  to  construct   model  selection  criteria and  choose  the  best  for  you  by  default. You  can  click  through these  as  options  to  see  other  fits. And  a  lot  of  times  these  first  few   are  going  to  look  very  similar and  it's  really  just  a  matter  of, there  are  certain  applications   where  they  know, "Okay, I  really  want  this  wavelet type," so  they'll  pick  the  best one  of  that  type  in  the  end to use. So  the  nice  thing  about  these  models is  they  happen  on  a  resolution. They're  modeling  different resolutions  of  the  data. So  we  have  these  coefficient  plots  where at  the  top  they're  showing low-frequency,  larger  scale  trends, like  an  overall  mean, and  as  you  go  down  in  the... Or  I  guess  you go  up  in  the  resolutions, but  down  in  the  plot, you're  going  to  look at  high -frequency  items, and  these  are  going  to  be  things that  are  happening  on  very  short  scales, so  you  can  see  where  it's  picking up  a  lot  of  different  features. In  this  case, it's taking... A  lot  of  these  are  zero for  the  highest  resolutions. So  it's  picking  up  some  scales  that  are at  the  very  end  of  this  function. It's  picking  up  some of  these  differences  here. But  this  gives  you  a  sense  of  kind of  where  things  are  happening at both  location  and  then   is  that   high-frequency  or   low-frequency parts  of  the  data. So  the  last  thing  we've  added to  complete  the  wavelet  models that's  a  little  bit  different  from  what we  have  now  is  called  Wavelets  DOE. So  if  you've  used  FDE  before, you've   likely  tried functional  design  of  experiments and  that  is  going  to  allow  you  to  take   functional  principle  component scores, connect  design  of  experience factors  to  the  shape  of  your  functions. But  now, wavelet  models  in  particular, they're  coefficients,  because  they  do represent  resolutions  and  locations, these  can  be  more  interpretable and  they  have  more  direct  impact to  understanding  what's happening  in  the  data, that  may be a  functional  principle  component isn't  as  easy  to  connect  with. We  have  this  energy  function and  it's   standardized  to  show  you, "Okay,  this  resolution  at  3.5," just   representing  more  on  this  end point. That's going to have... That's  where  the  most  differences are  in  all  of  our  functions, and  it's  representing  about  12%. So  you  can  scroll  down. We  go  up  to  where  we  get  90% of  this  energy, which  is  just  the  squared   coefficient  values that  we  just  standardized  here. But   energy  is  just   how  big  are these coefficients relative  to  everything  else? But  this, similar  to  Functional  DOE, you  can  change the  factors,  see  how  the  shape  changes, and  we  have  cases where  both  Wavelets  DOE and  Functional  DOE  work well. Sometimes  Wavelets DOE  just  get the  structure  better. Doesn't  allow  for  some   maybe  negative  points that  Functional  DOE might  allow  in  this  example. So  it's  just  they're  both  there, they're  both  fast. I mean, you  can  use  both  of  them  to  try  to analyze  the  results  of  wavelet  models. But  that's  my  quick  overview. So  now  I  just  want  to  turn  it  over  to  Clay to  show  you  some  examples of using wavelet  models  with some  example  data  sets. Thanks,  Ryan. So,  as  Ryan  showed  earlier,  we  found  an  example where  folks  were  trying   to  use  chemometrics  methods to  classify different  vegetable  oils. So  I've  got  the  data  set  opened  up  here. Here  we  have  our each  row  of  the  data  set  is  a  function, so  each  row  in  the  data  set represents  a  particular  oil, and  as  we  go  across  the  table, that's  the  chromatogram. So  I've  opened  this  up  in FDE  just  to  save  a  few  minutes. The  wavelet  fitting  is  really  fast, but  I  figured  we'd  just  go  ahead and  start  with  the  fit  open. So  here's  what  our  data  set  looks  like. You  can  see those  red  curves  are  olive  oils. The  green  curves  are  not  olive  oils. So  we  can  see  there's   definitely  some  differences between  the  two  different kinds  of  oils  and  their  chromatograms. So,  as  Ryan  said,  we  just  go   to  the  red triangle menu and  we  ask  for  wavelets  and it  will  pick  the  best  wavelet  for  you. But  like  I  said, I've  already  done  that, so  we  can  scroll  down  and  look  at  the  fit. Here  we  see  the  best  wavelet  that  we found  was  called  the  Symlet 20, and  we've  got  graphs  of  each fit  here  summarized  for  you. As  you  can  see,  the  wavelets have  fit  these  data  really  well. But  in  this  case, we're  not  terribly  interested in  fitting  the  individual  fits. We  want  to  see  if  we  can  use  these individual  chromatograms to  predict  whether  or  not  an  oil   is  an  olive  oil  or  not  an  olive  oil. So  what  we  can  do  is, we  can  save  out  these wavelet  coefficients, which  we've  gotten  a  big  table  here, and  there's  thousands  of  them. In  fact,  there's  one  for  every point  in  our  function. so  here  we've  got  4,000 points  in  each  function. This  table  is  pretty  huge. There's  4,000  wavelet  coefficients. But  as  Ryan  was  saying,  you  can  see that  we've  zeroed  some  of  them  out. So  these  wavelet  coefficients drop  out  of  the  function. So  that's  how  we  get  smoothing. We  fit  our  data  really  well, but  zeroing  out  some  of  those  coefficients is  what  smooths  the  function out. So  how  can  we  use  these  values  to  predict whether  or  not  we  have  an  olive  oil? Well,  you  can  come  here  to  the  function summaries  and  ask  for  save  summaries. So  what  it's  done  is  it  saves  out the  functional  principal  components. But  here  at  the  end  of  the  table,  it  also saves  out  these  wavelet  coefficients. So  these  are  the  exact  same  values that we saw  in  that  wavelet  coefficient table  in  the  platform. So  let  me  close  this  one. I've  got  my  own  queued  up  just  so  that  I don't  have  anything  unexpected  happen. So  here's  my  version  of  that  table. And  what  we  want  to  do  is  we  want  to  use all  of  these  wavelet  coefficients to  predict  whether the  curve  is  from  an  olive  oil or  from  a  different  type  of  oil. So  what  I'm  going  to  do  is, I'm  going  to  launch the  generalized  regression  platform, and  if  you've  ever  used  that  before, it's  the  place  we  go  to  build  linear models  and  generalize  linear  models using a  variety  of  different variable  selection  techniques. So  here  my  response  is  type. I  want  to  predict   what  type  of  oil  we're looking at and  I  want  to  predict  it  using all  of  those  wavelet  coefficients. So  I  press  run. In  this  case,  I'm  going  to  use  the  Elastic Net because  that  happens  to  be my  favorite  variable  selection  method. And  I'm  going  to  press  go. So  really  quickly, we  took  all  those  wavelet  coefficients and  we  have  found  the  ones  that  really do  a  good  job  of  differentiating between  olive  oils and  non -olive  oils. So  in  fact,  if  we  look  at  the  confusion  matrix, which  is,  this  is  a  way  to  look at  how  often  we  predict  properly,  right? So  for  all  49  other  olive  oils, we  correctly  identified  those   as  not  olive  oils. And  for  all  71  olive  oils,  we  correctly identified  those  as  olive  oils. So  we  actually  predicted  these  perfectly and  we  only  needed  a  pretty  small  subset of  those  wavelet  coefficients. So  I  didn't  count, but  that  looks  like  about  a  dozen. So  we  started  with  thousands  of  wavelet coefficients  and  we  boiled  it  down to just the  12  or  so  that  were  useful for  predicting  our  response. So  what  I  think  is  really  cool  is, we  can  interpret  these  wavelet coefficients  to  an  extent,  right? So  this  coefficient  here  is resolution  two  at  location  3001. So  that  tells  us  there's  something  going on  in  that  part  of  the  curve that helps us differentiate between  olive  oils  and  not  olive   oils. So  what  I've  done  is, I've  also  created a  graph  of  our  data  using... Well,  you'll  see. So  what  I've  done  is  here the  blue  curve  is  the  olive  oils. The  red  curve  is  the  non -olive  oils, and  this  is  the  mean  chromatogram. So  averaging  over  all  of  our  samples. And  these  dashed  lines  are  the  locations where  the  wavelet coefficients  are  nonzero. So  these  are  the  ones  that  are  useful for  discriminating  between  oils. And  as  you  can  see,  some  of  these   non-zero coefficients  line  up  with  peaks and  the  data  that  really tend  to  make  sense,  right? So, here is, here's  one  of  the  non -zero  coefficients, and  you  can  tell  it's  right  at  a  peak where  olive  oil  is  peaking, but   non-olive  oils  are  not,  right? So  that  may  be  meaningful  to  someone that  studies  chromatography and  olive  oils  in  particular. But  so  we  like  this  example  because it's  a  really  good  example  of  how  wavelets fit  these  chromatograms  really  well. And  then  we  can  use  the  wavelet coefficients  to  do  something  else,  right? So  not  only  have  we   fit  the  curves  really well, but  then  we've  taken   the  information  from  those  curves and  we've  done  a  really  good  job  of  discriminating  between  different  oils. And  so  I've  got  one  more  example. Ryan  and  I  are  both big  fans  of  Disney  World, so  this  is  not  a  chromatogram. This  is  not  spectroscopy. But  instead  we  found  a  data  set  that  looks at  wait  times  at  Disney  World, so  we  downloaded  a  subset  of  wait  times for  a  ride  called  Disney's the  Seven  Dwarfs,  Mind  Train. And  if  you've  ever  been  to  Disney  World, you  know  it's  a  really  cool  roller coaster right  there  in  fantasyland. But  it  also  tends  to  have really  long  wait  times, right? You  spend  a  lot  of  time  waiting  your  turn. So  we  wanted  to  see  if  we  could  use wavelets  to  analyze  these  data and then use  the  Wavelet  DOE  function   to  see  if  we  can  figure  out if  there's  days  of  the  weeks   or  months  or  years where  wait  times are  particularly  high  or  low. So  we  can  launch  FDE. Here you  can  see, we've got each  day  in  our  data  set, we  have  the  wait  times from the time that  the  park  opens,  here, to  the   time that  the  park  closes  over  here. And  to  make  this  demo  a  little  bit  easier, we've  finessed  the  data  set to  clean  it  up  some. So  this  is  not  exactly  the  data, but  I  think  some  of  the  trends that  we're  going  to  see  are  still  true. So  what  I'm  going  to  do  is  I'm  going to  ask  for  wavelets, and  it'll  take  a  second  to  run, but  not  too  long. So  now  we've  found  that  a  different basis  function  is  the  best. It's  the Daubechies 20 and  I  apologize  if  I  didn't pronounce  that  right. I've  been  avoiding  pronouncing  that  word  in  public, but  now  that's  not  the  case  anymore. So  we've  found  that's  our  favorite  wavelet and  what  we're  going  to  do is  we're  going  to  go   to  the  Wavelet  DOE  analysis, and  it's  going  to  use  these  supplementary  variables that  we've  specified  day of  the  week,  year  and  month to see if we can  find  trends  in  our  curves  using  those  variables. So  we'll  ask  for  Wavelets  DOE, and what's  happening  in  the  background  is we're  modeling  those  wavelet  coefficients using  the  generalized  regression  platform, so  that's  happening  behind  the  scenes, and  then  it  will  put  it  all together  in  a  Profiler  for  us. So  here  we've  got, you know,  this  is  our  time  of  day  variable. We  can  see  that  in  the  morning. The  wait  times  sort  of  start, you know, around  an  hour. It  gets  longer  throughout  the  day, you know, peaking  at  about  80  minutes, almost  an  hour  and  a  half  wait. And  then,  as  you  would  expect, as  the  day  goes  on   and  kids  get  tired  and  go  to  bed, wait  times  get  a  little bit  shorter  until  the  end  of  the  day. Now,  what  we  thought  was  interesting is looking  at  some  of  these  effects, like  year  and  month. So  we  can  see  in  2015,  the  wait  times  sort of  gradually  went  up,  right,  until  2020. And  then  what  happened  in  2020? They  increased  in  February, and then, shockingly,  they  dropped  quite a  bit  in  March  and  April. And  I  think  we  all  know  why that  might  have  happened  in  2020. Because  of  COVID,  fewer  people were  going  to  Disney  World. In  fact,  it  was  shut  down  for  a  while. So  you  can  very  clearly  see  a  COVID  effect  on  Disney  World  wait  times really  quickly  using  Wavelet  DOE. One  of  the  other  things  that's  interesting is  we  can  look  at  what time  of  year  is  best  to  go. It  looks  like  wait  times   tend  to  be  lower  in  September, and  since  Disney  World   is  in  Florida, you know, that's  peak  hurricane  season, and  kids  don't  tend  to  be  out  of  school. So  it's  really  cool  to  see  that  our model  picked  that  up  pretty  easily, right? So, but  don't  start  going to  Disney  World  in  September. That's  our  time. We  don't  want  it  getting  crowded. But  yeah,  so  with  just  a  few  clicks,   we  were  able to  learn  quite  a  few things  about  wait  times, Seven  Dwarfs  Mine  Train  at  Disney. But  we  really  wanted  to  highlight that these methods were  focused  on  chromatography  and  spectrometry, but  there's  a  lot  of  applications  where  you  can  use  Wavelets, and  I  think  that's  all  we  have. So  thank  you. And thank  you,  Ryan.
The town of Sharon, Massachusetts, created a Governance Study Committee to recommend changes to municipal by-laws and governance within the town, particularly with an eye to elevating civic engagement among residents. I am a member of that committee, and in one phase of our work we sought to confer with officials in similar communities across the state to learn from best practices elsewhere.   There are 351 cities and towns in the Commonwealth of Massachusetts, and we had limited time and no budget for comprehensive research. We quickly confronted the issue of how to best identify a modest group of communities closely comparable to Sharon, which in turn raised questions about which characteristics are most relevant to citizen participation in local governance.    Using JMP with publicly available data, I conducted a two-stage project to select key variables and then used those variables to run cluster analysis to identify other communities for our research. Because we are an all-volunteer, appointed public body, the research had to be presentable in public forums, comprehensible by a lay audience. Several visualization features of JMP 16 were particularly valuable in that regard.   This talk walks through the analysis, as well as my strategy to make clustering understandable.     Resources Academic Case Study on this topic. Description to use with the data attached here.   Hello. My  name  is  Rob  Carver, and  today  I  want  to  share  a  story about  a  project  I've  been  working on  in  my  small  town  in  Massachusetts. At  the  outset, I'll  point  out  that  the  slides and  the  JMP  data  table are  up  on  the  discovery  website and  there's  a  new  academic  case  study on  this  very  topic  that  will  be  posted. It's  not  already  posted. It  will  be  posted  very  soon. What  I'm  hoping  to  do  in  30  minutes is  spend  most  of  our  time  with  a  JMP  demo, but  you're  going  to  need some  context  and  background. I  want  to  provide a  little  bit  of  a  scenario, give  you  a  sense  of  the  problem that  I'm  trying  to  solve, and  talk  about  the  research  strategy and  then  get  into  the  demo and  wrap  up  with  some  conclusions. I  live  in  a  town  called  Sharon, which  is  an  archetypal  New  England  town. Here  you  see  a  picture of  our  talent  center. It  was  incorporated  in  1765, so an  old  community. Like  many  New  England  communities, the  legislative  function  of  the  town, is managed  by  the  annual  town  meeting in  which  anyone  can  come  and  speak. M emorialized  in  the Norman  Rockwell  images. From  the  start, we  have  used  an   open-town  meeting and  the  executive  function  is  carried  out by  a   three-member  board known  as   The Select Board. But  since  Norman  Rockwell's  day, municipal  government  has  become technologically,  financially, legally  more  complex, even  for  the  most fundamental  services  that  a  town  provides. Attendance  at  the  town meeting  has  really  dwindled. About  a  year  and  a  half  ago, The Select Board  created  a  governance study  committee,  of  which  I'm  a  member. we  are  doctors  and  lawyers and  accountants,  and  teachers, and  marketing  people, local  business  people. I'm  the  resident  stats  guy. Overtime,  the  population of  the  town  has evolved, it's grown . It's  more  diverse  than it  was  100  years  ago. We've  gone  from an  agrarian  manufacturing  community to  a  bedroom  community for  the  city  of  Boston. Lots  of  professionals  working  in  hospitals and  universities,  law  firms  in  the  city. People  tend  not  to  live and  work  in  the  town, and  that  has  impactsq on  participation  in  town  governance. The  charge  to the  governance  study  committee is  find  ways to  boost  citizen  engagement. we've  been  doing  our  due  diligence. We've  been  researching, we've  surveyed  residents, we've  read  the  literature, we've  interviewed  town  officials. One  part  of  our  research, and  that's  what  this  talk  is  about, is  we  wanted  to  reach  out to  towns  like  Sharon to  find  out  what  are  they  doing, what's  their  experience. There's  some  comparative  research. There  are  350  towns  in  Massachusetts. We  have  time  constraints, and  so  we're  looking  for  a  way  to  identify a  smallish  number  of  communities that  are  similar  to  us. We  didn't  want  to  reinvent  the  wheel, but  we  thought  that  modernizing it  some  would  be  a  good  idea. The  driving  question covered  in  this  research is  which  towns  are similar  to  this  town. A  little  bit  about  Sharon. We  sit  in  South-eastern  Massachusetts. We  are  not  too  far  from  Plymouth, which  is  where  the  1620 May  Flower  landing  happened. This  community  was  originally  populated by  Wampanoag  peoples. Europeans  arrived  in  163 7. We're  about  halfway  between Boston  and  Providence. For  the  sports  fans  out  there, we  are  next  door  to  where the  New  England  Patriots  play  football. Population  about  18,500, which  is  quite  average  in  Massachusetts. We  have  great  percentage  of  the  voters of  the  population  are  registered  voters. Yet  out  of  all  those  people, we  get  2%  for  a  town  meeting. Most  recently  in  May  of  2022. This  was  the  scene and  a  lot  of  that  is  COVID  related. There  was  social  distancing rules  in  effect,  but  turnout  is  low, partly  because  of  COVID, partly  because  of  factors that  we  don't  fully  understand. One  task  for  the  governance  study community  is  to  consider other  alternatives  to  town  meeting, or  tweaks  and  enhancements to town meeting. Under  state  law  in  an   open-town  meeting to  participate, you  have  to  be  in  the  room. It's  broadcast  on  local  television, but  you  have  to  be  present to  speak  or  to  vote. State  law  also  says  there's  three ways to  run  local  government. 74%  of  the  communities, the  large  majority  do  what  Sharon  does. Open-town  meeting  once  or  twice  a  year. A  small  number  have  what's  called representative  town  meeting  in  which voters  elect  their  neighbors, maybe  a  few  hundred  of  them, to  participate  and  vote  in  town  meeting. Traditionally  cities have  had  small  councils with  a  mayor or  administrator  of  some  kind. Increasingly  that's  being  adopted by  towns,  and  so  we're  looking  into  that. For  this  talk, the  task  is  identify  peer  towns, that  we  can  then  we  could  then  interview and  consult  with  and  reach  out  to  them. I  mentioned  some  of  the  state legal  constraints. One  other  constraint  is  the  town  boards, like  a  government  study  committee, have  to  have  open  meetings. Anything  we  do  and  decide and  deliberate  about  has  to  be  in  public, which  is  a  good  thing. We  have  no  budget. We  have  some  wonderful  staff in  the  town  hall,  but  they  are busy  doing  other  things  as  well. Data  availability  was  a  mixed  story. Plenty  of  data  available  about characteristics  of  communities. We're  really  interested  in  how  many  people participate  in  local  government and  there's  no centralized  data  about  that, so we  needed  to  hunt  for  proxies. We  also  had  no  ability to  compel  folks  in  other  towns to  meet  with  us,  advise  us, or  share  data  with  us. We're  operating  in  a  topic  area that  is  heavily  governed  by  tradition. People  really  cleave  to  that  Norman  Rockwell inch. We  came  up  with  a  three- stage  plan. As  a  committee,  we  brainstorm  variables, say  why  do  people  participate? Why  don't  people  participate? Why  are  different  towns  different? I  then  grabbed  some  data  from  voter turnout  in  a  recent  state-wide  election to  use  as  a  proxy for  citizen  engagement. Ran  some  models  in  JMP to  identify  those  variables that  seem  to  have  predictive  value. The  committee  then  discussed and  added some more variables that  they  thought  were  important on  the  town  meeting  dimension. That  generated  20  predictor  columns, which  I  knew  was  far  more than  I  wanted  to  deal  with. I  consulted  my  brain  trust  some academic  colleagues  special  thanks  go  to, Mia  Stevens  and  Ruth  Humble  at  Chomp, who  advised  me on  principal  components  analysis, which  I'll  note  that  the  outset was  not  part  of  my  comfort  zone, so I  told  her  a  little  bit  about  that. Then  I  ran  cluster  analysis. That's the  main  event  today. People  on  the  committee  understood that  we  probably  want  to  be  talking to  towns  of  comparable  size, but  there's  more  to  similarity  than  size. There's  more  to  similarity  than being  a  geographic  neighbor. Part  of  the  work  involved instructing  the  committee a  little  bit  on  cluster  analysis. Just  in  case  anybody  watching doesn't  have  much  background  in  this, here's  how  I  did  it. I  said  well, we  can  look  at  population and  something  else  at  the  same  time. and maybe  though  that  something  else has  an  impact  on  participation. In  this  case,  the  Y  axis was  a  single  family  property  tax  bills. You  can  see  that  there's  a  bunch of  towns  similar  in  size  to  Sharon, but  which  might  have  very different  tax  impacts. The  idea  and   cluster  analysis, if  you  are  going  to  work in  two  dimensions, choose  two  attributes  that  you  think are  relevant  to  your  query, spread  the  towns  out on  those  two  dimensions, and  then  identifying a reasonable  number  of  towns that  are reasonably  similar  to  Sharon. That's   a  big  idea  in  cluster  analysis. Fortunately  we're  not  limited  to  two attributes  or  two  dimensions. We  can  have  more  than  that. with  that,  I  think  you  now  know enough  to  follow  the  demo. Where  we're  walking  into  this  demo, I  had used  gathered  data  from  a  variety of  state  and  publicly  available  sources. Used  query  builder to  build  a  large  data  table inspected  for  outliers  and  missing  data. The  one real  outlier is  the  city  of  Boston, which  is  just  unique. T hat's  excluded from  all  the  analysis. A  little  bit  of  missing  this, but  nothing  terrible. I'm  going  to  be  showing you  a  JMP  project. Let  me  switch  gears,  move  into  the  demo and  I  hope  that  I  do  this  correctly. What  we're  looking  at  is  my  data table  of  351  cities  and  towns. The  first  several  columns are  identification, size  of   The Select Board, their  legislative  option name  of  community. The  next  20  columns  are  our  predictors. Just to  round  us  a  bit, if  we  look  at  some basic  descriptives  of  the  communities, towns  in  Massachusetts tend  to  be  on  the  small  size. The  medium  is  only  10,000  people. Sharon  is  quite  near the  mean  community  size. Terms  of  legislative  function, 74%  use  open- town  meetings, so we  are  in  good  company, and  in  terms  of  the  size of  The  Select Board  which  is, another  thing,  the  governance  committee is  looking  at  just  about  50/ 50. Half  of  the  town's  with  a  Select  Board have  three  members,  half  have  five. We've  got  these  20  predictors. One  issue  that  comes  up  pretty early in  the  analysis  is the  linearity. Here  I  have  five variables  that  all  speak  to  the  size and  the  electric,  the  size  of  the  town. You  can  see  that  there  are some very  strong  correlations. We generally  speaking, don't  want  to  deal with  so  much of the  linearity. One  way  out is  principal  components  analysis. At  this  point,  not  quite  ready to  jump  into  clustering, but  want  to  take  those 20  columns  and  distil  them  down, conserve  as  much  information  as  possible, but  reduce  the  redundancy and  collinearity  across  columns. To  do  that, principal components  analysis is  an  excellent  option. I  don't  have  the  ability  today to  give  a  full  crash  course in  principal  components  analysis, but  we  can  see  that we  have  variables  that  seem  to  be overlapping  in  terms  of  their  message. We  also  can  see  that when  you  give  a  PCA  20  columns, it  initially  comes  up  with  20  components. The  first  few  of  which  seem  to  capture most  of  the  variability. We  have  to  make  a  decision  about how  many  principal  components to  use  and  what  they  represent. For  this,  the  screen  plot  is  helpful and  we're  looking  for  a  kink or  an  elbow  in  the  plot. That  seems  to  happen  somewhere  down here,  around   4, 5, 6  components. If  we  consult  the loading  matrix  to  see  how different  variables  associate  load into  different  components, we  can  begin  to  subjectively assign  meaning  to  the  components. I'll  cut  to  the  chase. We  selected  six  principal  components as  being  informative for  the  purposes  of  cluster  analysis, and  came  up  with  some interpretations  that  made  sense  to  us. Things  like  how  big  is  the  town? How  affluent  is  the  town? How  fast  is  it  growing? Now  we're  ready  for  clustering. Back to  JMP. There  are  two  basic approaches  to  clustering. You  might  think  of  them as  a  top- down  and  a  bottom- up. Both  of  them  take  the  raw  data, standardize  it, and  then  compute  Euclidean  distances for  each  pair  of  rows  in  the  data  table for  each pair  of  communities. Taking  into  account  six  factors, six  principle  components. Size,  affluence,  education, things  like  that, which  ones  are  similar  to  Sharon in  hierarchical  clustering? The  report  starts  us  off with  a  dazzling  graph  that  with  350  rows, this  is  hard  to  interpret. Let's  begin  we'll  come  back  to  it. Begin  with  something that's  a  little  easier  to  interpret, which  is  the  cluster  summary. In  the  hierarchical  method, JMP  has  found  for  us  16  clusters. I  can  tell  you  because  I  peaked that  Sharon  turns  out  to  be  in  cluster  15. Sharon  and  23  other  towns. For  example, if  you  scan  down  the  affluence  column, we  see  that, again  these  are  standardized  scores, these  are  the  most  affluent towns  in  the  state. If  we  come  over  here  to  the  growth  column, and  this  is  largely  growth in  the  housing  units  and  population. The  least  growth, in  fact,  some  negative  growth. If  we  come  back  up  here. Now,  having  looked  at  the  cluster  summary, all  of  the  clusters have  been  identified  and  colored. JMP  gives  us  a  cut  point. If  we  zoom  in  on  cluster  15. Let's make that a little bigger. Sharon  is  here  in  the  center. It's  nearest  Euclidean  neighbor is  Winchester, which  is  about  an  hour's  drive. We  now  have  a  provisional  list of  towns  to  consult  with. All  right,  so  that's  a  crash  course in  the  hierarchical clustering. I'll  move  out  to  the  K-Means. Hierarchical  is  bottom  up. We  start  with 350  individual  towns  as  clusters. We  interrogate  the  distance  matrix, all  the  parallelized  distances, find  those  two  towns that  are  nearest  to  each  other, they  form  a  cluster. We  take  the  mean  distance  of  that  cluster. Now  we're  either  looking for  the  next  two  nearest  towns or  the  next  town that's  closest  to  that  cluster, and iteratively  process  for  the  tree  until we  have  one  gigantic  cluster  of  350 towns. With  K-M eans  clustering. We  flip  the  process. We  start  with  350 towns  in  one  cluster and  then  begin  slicing  and  dividing in  multiple  dimensions. In  this  approach,  same  utility and  distances,  same  distance  matrix, we  end  up  with   Sharon being  in  cluster  number 4, with  a  full  compliment  of  33 towns. We  automatically get  a  cluster  means  picture. Again,  very  affluent  low  growth, not  necessarily  the  lowest, but  low  growth  again. We  get  slightly  different  results. I  think  in  the  interest  of  time, I  will  show  you  one  other  graph. There's  various  things  to  look  at, but  let's  look  at  the  parallel coordinate  blocks. What  is  this tool? We  have  16  clusters, and  by  the  way  in  K-Means, it's  up  to  the  user to  specify  the  number  of  clusters. I  chose  16  as  a  starting  point because  that's  what  hierarchical  gave  us. H ere  we  are,  cluster  four. The  dark  brown  line  is  Sharon. Here  we  see  the  six  characteristics, the  six  principal  components. F or  example,  if  we  compare, how  is  cluster  4  different from  cluster 3  let's  say,  or  cluster  5, maybe  similar  in  sizes. Cluster  five,  less  affluent. Property  values  are  a  little  lower. Permanent  population  refers  to their  communities  with  universities, hospitals,  prisons,  so  forth, vacation  homes, snowbirds  who  leave  for  the  winter. Towns  differ  in  terms  of  their  permanent populations  in  cluster  three,  much  lower. Here's  where  we  find  our  university  towns. I  just  popped  up the  town  of  Shirley,  Massachusetts as  the  state's  largest maximum  security  prison. I  don't  know  if  we  consider  these  folks permanent  residence  or  not,  but  any  event. We've  done  two  different clustering  methods. Let's  take  a  look at  how  the  results  compare. I  saved  within  each  clustering  method, the  cluster  assignments  for  each town, created  binary  variables. Are  you  in  the  same  cluster  as  Sharon or  are  you  in  a  different  cluster? I  also,  just  as  an  aside, JMP  has  lots  of  wonderful built  in  geographic  maps. It  does  not  have  a  built- in  map  showing municipality  borders within  the  state  of  Massachusetts. But  it  turns  out  that  with  JMP it's  fairly  easy to  create  a  new  geographic  map. I  was  able  to  do  this without  very  much  work  at  all. Here  are  the  results of  hierarchical  clustering,  cluster  15. Sharon  is  here. It's  similar  towns  are  in  blue. I  was  pleased  to  see  that  my  little  tiny hometown  of  Marbleh ead is  similar  to  the  place  I  moved  to. Hierarchical  clustering  15 gives us these 24  towns. K-Means clustering  and  some  more  towns, but  there's  an  awful  lot  of  overlap. I  also,  just  out  of  curiosity, look  to  identify  the  33  towns about 10% of the state that  is  most  similar to Sharon. This  is  a  larger  group. Again,  an  awful  lot  of  repeats. A  lot  of  repeats. That  last  approach  also gave  us  some  other  advantages. I  want  to  now  shift  back  to  PowerPoint and  talk  about  some  of  those. O ne  last  point  before  I finish   the  demo. We  were  also  curious  to  ask... Mostly  our  goal  was, who  shall  we  interview? Who  shall  we  call  in  to  meet publicly  with  our  committee? But  while  we're  at  it, let's  see  what  our peers  do  in  terms  of  governance. State-wide, open-town  meeting— OTM— dominates  74% and  there  is  no  dominant  fourth  size. If  we  look  at  who's  in  our  cluster, let  me  use  K-Means, because  it's  a  little  bit  larger  group. That  74%, jumps  up  to  85% with  time  meeting. By  a  two  to  one  margin, towns  have   five-member  Select  Boards. Now,  this  isn't  definitive as  to  channel  my  mother. If  all  the  other  towns  jumped  off the  Empire  State  Building, we  wouldn't  necessarily  want  to  jump  off. But  it's  interesting  to  note that  the  towns  most  similar  to  Sharon favor  the   five-member  Board and  are  even  more  inclined to  open- town  meeting. With  that,  let  me get  back  to  some  conclusions. So what  did  we  learn? One  thing  we  learned  was  the  geographic proximity  is  uninformative  in  those  maps, none  of  the  abutting  towns  came  up  blue. Our  most  similar  communities are  not  our  next  door  neighbors. As  I  just  noted, open-town  meeting  in  five- member  boards really  predominate. So  what  did  we  do? This  work  actually  happened several  months  ago. We  were  able  to  prioritize  our  outreach, begin  contacting  those towns  most  similar  to  us. Many  were  extremely  cooperative and  shared  a  lot  of  information  and  data. We  also  didn't  want  to  assume that  open-town  meeting was  the  only  way  to  consider, so we  wanted  to  talk  to  people with  representative  or  councils. Those  Euclidean  distances became  instructive  in  terms  of, okay,  none  of  our  immediate  neighbors, closest  neighbors use  town  council or  representative  town  meeting. But  which  RTM  town, which  council  town  is  most  like  us? And we  contacted  those  folks  as  well. W e  went  from  having  to  contemplate outreach  to  350  towns in  a  limited  amount  of  time and with no  money and staff, to focused sampling  method. Then  because  town  officials talk  to  one  another and  they  are  professionally  active, that  led  us  to  other  interviewees. With  that,  I  think  that's  about  my  time. I  hope  this  has  been interesting  and  constructive. Thank  you  for  coming and  I  hope  you  enjoy the  rest  of  the  program.
Recent events have brought about much discussion in both the popular press and the scientific literature about the safety and efficacy of some recent vaccination programs. One frequently referenced data source is the Vaccine Adverse Event Reporting System (VAERS), which now covers more than 20 years. In this session, we will demonstrate the Exploratory Data Analysis and Data Visualization capabilities of JMP Statistical Discovery software. We will begin by using the Subset, Join, and Concatenate platforms from the Tables menu, followed by Graph Builder from the Graph menu. Finally, we will make use of some Screening platforms from the Analyze menu. In general, we will show how to use JMP’s “front end” data selection and management tools, drag and drop interactive graphs, and linked analyses to speed the time to discovery for a large, complex dataset.     Good  morning. Good  afternoon and  good  evening,  everyone, wherever  you  are. My  name  is  Stan  Siranovich, and  I  am  the  principal  analyst at  Crucial  Connection L LC, and  I'm  doing  this  presentation from  Jeffersonville,  Indiana, right  across  the  river from  Louisville,  Kentucky. Today  I'm  going  to  talk  about how  to  do  an  exploratory data  analysis  and  visualization of  an  online  vaccine  data base using  JMP PRO  16. That  online  database is  the  VAER  system. So  what  is  VAERS? That  is  the  Vaccine  Adverse Event  Reporting  System. It's  a  national  early  warning  system to  detect  possible  safety  problems in  US-L icensed  vaccine. That  is  the  same  database that  has  been  in  the  news for  the  last  year  or  two. Let's  talk  about  why  it  was  developed. First of  all,  it  was to  detect  new,  unusual or  rare  vaccine  adverse  events, monitoring  increases  in  known  events, identify  potential  risks, and  assess  the  safety of  newly  licensed  vaccine. Nowhere  in  those  goals do  you  see  anything  about make  data  analysis  vaccine  data  set  easy, which  is  why we're  doing  this  presentation. Now,  it's   maybe  structured a  bit  differently than  some  of  us  are  used  to. I  came  out  of  the  lab and  production  facilities, and  I'm  used  to, for  lack  of  a  better  word, chemical  scientific, rather  small  data  sets. On  rare  occasions, when  I  did  work  on  something  larger, somebody  in  the  corporate  hierarchy cleaned  all  the  data  for  me. That  is  not  the  case  here. It's  organized  by  year, and  there  are  three  tables  per  year. There's  facts,  there's  data, and  various  symptoms. What  I  did  was, the  first week  in  June, I  downloaded  all  the  data  as  of  May  31st, and  this  is  what  it  looks  like. You  can  download  to  your  heart's  content. You  do  have  to  sign  in. Over  on  the  right  side  of  the  screen, let's  see  what  I  got. Now,  notice  in  years  here, 2018,  2019,  2020, the  zip  files  are  roughly  the  same  size. By  the  way,  this  zip  file  is  simply these  three  files  here  zipped  into  one. That  is  normally the  best  way  to  work  on  it. Download  zip  and  unzip  it. But  notice  what  happens between  2020  and  2021, which  is  at  about a  12,  14  percent  increase. We  go  from  11.2  Meg  up  to  almost  169  Meg. So  something's  going  on, and  we  would  like  to  take  a  look  at  it. This  is  what  it  looked like  on  my  desktop. Now  let's  talk  about  tables  there. I  mentioned  there are  three  tables  per  year. There's  vax; contains  all  the  vaccine  information. It's  got  information,  such  as, almost  100 %  unique  VAERS  ID  at  the  top, manufacture,  lot  type and  the  data  is  where  a  lot  of  the  data that  we're  going to  be  interested  in  resides. Notice  it's  got  the  VAERS ID  again, and  it's  got  some  different  columns that  we're  interested  in, such  a  state,  age in years,  sex,  symptoms, which  is  a  free  form  of  a  text  field that  sometimes  seems  to  go  on  forever, whether  or  not the  patient d ied,  et  cetera. Then   VAERSYMPTOMS, contains  just  the  VAERS  ID  and  symptoms, and  they  go  from   1-5, and  sometimes they  continue  on  from   6-10, and  we  will  address  that  issue towards  the  end  of  this  presentation. Let  me  get  out  of  that, drag  that  over. Right now,  you  should  see the  JMP  window  open. I  am  going  to  present  from  a  JMP  project. Let  me  go  over  that  very  quickly. By  the  way,  I  assume everybody  watching  this has  seen  a  JMP  window  before. This  is  my  workspace. I  drag  the  three  files  in and  opened  them  up in  JMP  down  here  as  contents. I  opened  up  a  new  instance  of  this. When  I'm  working  on  a  project, I  drag  my  links  and  maybe  some PDFs  or  whatnot  into  that  space. Then at the  bottom,  we  have  recent  files. But  mainly  what  we  want  to  do, Iis  look  at  this  window, and  the  main  window  here, notice  we  have  tabs  here, and you  click  on  the  tab  just  like  a  spreadsheet and  we  can  see  our  different  sheets. Let's  start  cleaning. First thing  I'm  going  to  do  is  make... Where is it? Where  there  is  VAX. Make  that  the  home  table, and  you'll  see  why  in  just  minute. So  I'm  going  to  start  there. Now   notice  in  VAX_TYPES , everything's  mixed  in  together. We  got  the   COVID-19, which  we  are  going  to  focus  on, HPV9. Scroll  down  a  little  bit, we  have  unknown. We  have  Flu  X,  et cetera. So  what  we  want  to  do is  separate  up  to COVID- 19. The  way  we  do  that, and  I  am  going  to  go  up  to Rows Row  Selection, Select  Where. By  the  way,  in  JMP, there's  almost  always several  ways  of  doing  things, but  to  keep  things  uniform, I  will  always  go  up  to  the  menu  bar to  do  this  type  of  thing. What  we  want  to  do is  separate  COVID- 19. So  I  select   VAX_TYPE. From  the  dropdown, I'll  just  leave  the  default  to  equals, and  I'll  put  in  COVID-19 and  I'll  come  down  here and  uncheck  the  Match  case. Let's  see, it's  checked  over  the  window. It  tells  us  what  we're  going  to  do; select  rows  in  the  data  table that  match  specified  criteria. It  looks  like  it. Let's  see. Click  OK. And  there  it  is. Notice  it  selected  all  the  rows. It  skipped  the  HPV9 and  the  unknown  here,  et cetera . What  we  want  to  do, now  that  we  selected  them, is  to  separate  them  out. So  I  will  go  to  Tables,  Subset, and  it  tells  us  what  we're  going  to  do. We're  going  to  create  a  new  data  table from  selected  rows  and  columns. I  went  to  Selected  Rows, which  check  for  us  already. Notice  it  says  here, we  can  save  the  script to the  table. Normally,  that's  a  good  practice. Makes  it  much  more convenient  to  repeat  things. But  I'll  leave  that  unchecked  for  now, since  this  is  a  demo. Is  there  anything  else  I  need  to  do? Yes,  it  says,  Output  table  name. For  simple  analysis, JMP  will  take  care  of  that  for  you. But  for  anything  it  starts to  get  a  little  bit  complicated. I  recommend  deciding on  some sort of  a  naming  scheme. So  rather  than  Subset, I  am  going  to  name  that  COVID- 19  only. Click  the  OK  button. Here  it  is,  COVID  only. We  can  scroll  down,  and verify  that. Notice  down  here  in  All  rows, we've  selected  129, 975. So  keep  that  one  in  mind, because  over  here,  we  started  off... Oh, no, it's  data. Where  was  it? Back. Right  here. We  started  off  with  146, 500. We  just  got  the  COVID  right  there. We'd  like  to  see  what's  going  on. Notice  here,  we  don't  have  any  symptoms, we  don't  have  any  adverse  events. What  we're  going  to  have  to  do is  get  them  out  of  the  data, and  that's  right  here. Now  let's  take  a  quick  look  at  that. We've  got  the  columns  displayed  over  here. We've  got  all  sorts  of  things. They  died,  length  of  stay, onset  date,  et  cetera and here's  the  VAERS  ID. Now  what  we're  going  to  do is  to  join  the  two  tables  on  the  VAERS  ID. Let's  go  back  over  here to  our COVID- 19  data  only. There's  VAX. We've  got  that. Now  what  I'm  going  to  do is  go  up  here  to  Tables. We  want  to  come  down  here  to  Join  tables. I  won't  get  into  that  database  stuff, but  let's  just  say  we  want  to  join  tables. Now  I  went  to  VAERS VAX , and  did  it  because  it  says,  up  here, "Join  this  with   VAERS VAX," and  down  here, what  I  have  to  do  is  a couple  things. First of  all, let  me  now  save  that  to  later. Go  down  here and  select  VAERSDATA . Notice  we  have  some  windows that  pop  up  here. That  shows  us  all  the  rows in  the  second  data  table. Now  we  have  to  match  the  rows. We're  going  to  come  here in  the COVID- 19  data  table, the  VAERS_ ID, and  then  jump  into  the  end. So  we're  going  to  click  here and  they are  two  separate  windows, so  you  don't  have  to  get  to  control. What  we  want  to  do  is  match  them so  that  things  don't  get  mixed  up. Let's  look  down  here, and  we  get  to  some  data  table  stuff. It   tells  us  that  it  is  an  inner  join. An inner  join, selects  rows  common  to  both. But  I'm  not  sure  about  this  data. So  what  I  want  to  do  is  come  over  here to  Main  Table  and  select  Left  Outer  Join. That's  going  to  keep  all the  ones  in  the  original  table, which  was  the  VAERSVAX  table, and  all  of  the  matching  entries in  the   VAERSDATA  table. Let's  look  a  little  bit. Again,  we  can  save  script. Let's  give  it  a  name. Let's  call  that  one, VAX join DATA and  see  if  there's  anything  else  we  need. Yeah,  you  know  what, we  could  do  this  in  a  two- step  process. But  why  don't  we  do  it  all  in  one? Let's  go  back  up  here  to  the   VAERS VAX, and  what  we'll  do is  we'll  keep  the  VAERS_ID, because we don't  want to know what  that  is. We  don't  need  TYPE, because  they're  all  the  COVID-19 . Well,  we  may  want  to  look at  different  manufacturers  in  the  lot, whether  it's Series  One, Two,  Three,  or  Four,  whatever, a   VAX_ROUTE  and   VAX_SITE. We  won't  worry  about  that  for  now. Come  down  here to  the  other  data  table  and  see... What  do  I  want  to  check? We  don't  need  the  VAERS_ ID  again. Let's  check  STATE,  age  in  years. Notice,  there's  a  couple other  age  columns, but  we'll  leave  those  go  for  now. We  don't  really  need  them. We  want  to  know the  response  by  sex,  probably. Don't  know,  SYMPTOM_TEXT? Yeah,  we'll  keep  that  in. We  certainly  want  to  know whether  or  not  the  patient  died. They  died. Let's  do  HOSPITALDAYS , which  is  how  many  days they  spent  in  the  hospital if  they  went  to  the  hospital. Let's  do   NUMDAYS. Now  we  know, from  checking  the  VAERS  website. NUMDAYS  is  simply the  difference  between  VAERS state. and  the  onset  of  symptoms  states. So  that  tells  us, how  far  after  the  vaccination. And  we've  got a  whole  bunch  of  other  fields  here, and  I  don't  think  we  need  any  of  those. Let me scroll up, make  sure  everything's  still  checked. I  hope  I  have  everything  that  I  need. And  I'm  going to  click  this  one  here . It  says,  "Select  columns  for  join  table," in  case  you  can't  see  it. I'm  going  to  hit  Select and  put  them  all  in, I  hope,  and  we'll scroll  down  to  check. A gain,  we  can  save  the  script  to  table, but  let's  hit  OK. We  named  it  VAX  join  DATA. There  it  is, VAX join DATA  over  here. Now  notice  we  have the  manufacturers,  the  dose,  the  state. Let's look  at  state. Have  a  number  of  missing  values and  there's  some  other  considerations, there  too  in  that  column. But  we'll  get  to  those  later. Let's  see,   SYMPTOM_TEXT,  HOSPDAYS . We're  all  set. Let's  look  at  this. Let's  expand  this. This is  just  a  free- form  column. Let's  make  use  of  one of  my  favorite  features; Show  Header  Graphs. Let's  hit  that. And  wow! Let's  see  what  we  have  here, let  me  pull  that  over. We  have,  let's  look  at  manufacturers. We  want  to  see  if  maybe  one  manufacturer or neither  has  more  adverse  events. It  looks  like  BIONTECH is  by  far  the  largest. Then  we  have  some  unknowns. Look  at  this. We  have  almost  8,000   VAX_LOTs. So  if  you  want  to  do an  examination  by  lot, that's  going  to  be  rather  difficult. We  see   VAX_DOSE_SERIES  here, and  it  looks  like  Series  1 has  more  than  series... Though,  that's  a  little  strange. Then  comes  2,  then  comes  3. How  did  that  happen? Now  we  just  note  that  and  move  on. Notice  for  STATE, we  have  1, 2, 3, 4, 5 plus  54. Well,  last  I  heard we  didn't  have  59 s tates so  were  going  to  have  to  check  that. AGE_YRS  looks  okay. It  looks  like  we  have a  whole  lot  more  females  than  males. SYMPTOM_TEXT , lot  of  stuff  there. We  noted  all  that. We  could  close  that and  let's  do  break  up the  monotony  here,  the  cleaning. Let's  do  an  analysis. Let's  go  to  everybody's favorite  platform;  Graph  Builder. We  go  up  to  the  menus,  select G raph and  select  Graph  Builder  from  that. Let's  see  what  do  we  have  here. How  about  I  did  bring in  hospital  days,  I  hope. Yes. Let's  do  hospital  days. We'll  select  it,  put  it  on  the  y, and  let's  see  if  there's  any  differences between  amongst, or I  should  say  the  states. Didn't  want  to  do  that. Just  wanted  to  select  STATE. Put  that  in  the x, and  I'll  select  Bar  Graph. There  are  the  bar  graph. Notice  we  do  have  some  anomalies  here. Let's  look  down  at  the  x-xaxis. We  have  State AS , State MH,  State  PW,   State QW. That  isn't  right. So  we  know  we  have  to  do a  little  bit  more  cleaning  on  that. But  this  is  a  demo, so  we'll  leave  those  in  there  for  now. That's  pretty  easy  thing  to  figure  out. What  we  want  to  do  is... Come  on,  keeps  popping  up. You  can  be  a  little more  careful  about  that. Come  up  here, and  I  right- click  on  the  x-axis and  come  up  here  to  order  by and  to  hospital days  descending. There  it  is. We  have  some  unusual  results  here. It  looks  like  Wyoming. Oh,  I  should  point  this  out  too. JMP  automatically selected  the mean  for  us. Which  for  our  purposes  is  probably  okay. It  looks  like  the  mean  hospital  stay for  the  people  who  suffered an  adverse  event  Wyoming is,  I  don't  know, 20  in  a  fraction  days. Which,  man,  that's  high. Come down  here  to  the  next  one,  Vermont, and  it  gives  us  the  number  of  rows. Put  that  in  automatically for  the  hover  label. A fter  that,  it's  Mississippi, and  see  Oklahoma  and  Utah. Let  me  find... Let's  see,  New  York,  Pennsylvania. They're  all  down  in  here. For  some  reason, I  don't  know if  it's  just  the  chance  or  what, but  we  make  note  of  the  fact that  a  couple  of  the  sparsely  populated states  seem  to  have  longer  hospital  stays. Let's  go  back  to  here. And  this  is  another  reason why  I  like  to  use  the  JMP  projects. You  don't  have  to  go hovering  over  closed  windows. Let's  go  to  Graph. We  want  to  do that  graph   [inaudible 00:20:22]. Yeah,  let's  do  one  more  example. Go  to  Graph  Builder, and  we'll  take  STATE, put  that  in  the  x-axis. And  we  look  at  what  the  JMP  did  for  us. And  it  looks  like  we  have some  high  numbers  here. And  we  hover  over  California,  for  example, and  it  says  it's  got   10,629  rows, and  it  lists  them, and  it  gives  us  the  state. What  JMP  did  here  was  automatically put  the  number  of  rows  in  there  for  us, which  is  a  reflection of  the  number  of  patients who  reported  an  adverse  event. Let's  see,  Ordered  by  state,  descending. So  let's  do  that  order  again and  take  a  look  at  it. This  makes  sense. California  is  far  and  away the  leader  with,  unfortunately for  them  with  adverse  events, but  it  is  also  a  highly  populated  state. The  next  one  is  Michigan, the  same  deal  Florida. So  it  seems  to  somewhat mirror the  population. There's  New  York. We'll  just  leave  that  open for  now  and  move  on. Now  let's  take  a  look at  the  VAER  symptoms. Click  that  tab. First thing  we  noticed, and  I  did  not  plan  this, but  notice  VAERS_I D, row  one  and  two. It's  the  same  number. How  can  that  be? That  is  supposed to  be  a  unique  identifier, but  that  is  somewhat  common in  the  VAERs  data  set. Let's  take  a  look  why. By  the  way, I  think  this  is  an  excellent  reason to  always  spend  some  time, even  if  you're  in  a  hurry, to  look  at  the  data  and  see if  anything  looks  a  little  bit  weird. And  sure  enough,  two  rows with  the  same  row  number. Now  notice  symptoms. This  is  the   SYMPTOMVERSION, or  rather,  this  is  the  MEDRA  database. It's  how  the  medical coders  code  the  symptoms. So  there's  some  degree of  uniformity  nationally and  also  internationally. This  is  the  MEDRA  version for  this  entry  right  here. And  there  are  only  two  versions that  I've  run  across  so  far in  my  work with VAERS , and  that's  24,  one  and  25. So  we  have  symptom  one, we  have  symptom  two,  chest  pain. We  have  symptom  three,  the  heart  rate, and  it  goes  on  and on. Then  we  come  back  and  we  have  some... Looks a little  weird  here, SARS-CoV-2 and  whatever  in  this  duplicate  row with  the  duplicate  number  ending  in  266, which  is  really  not  a  duplicate, because  there  is  only one  entry  out  of  the  five  in  that  row. So  that's  a  bit  disconcerting. But  we're  going  to  take  care  of  that. What  we're  going  to  do  here, is  a  feature  in  JMP, and  it's  stacking  the  variables. If  we  wanted  to  do  in  analysis on  symptoms  from  this  table, what  we  would  have  to  do  is  go  and  run  it on  each  one  of  the  columns, et cetera,  et  cetera. But  if  we  stack  the  columns, we  won't  have  to  do  that. So  let's  stack  the  columns. So we  come  up  here  to,  again,  Tables, and  we  come  up  to  Stack, and  we  select  Stack. So  let's  pick  SYMPTOM1. Hold on  to  control or  command  if  you've  got  a  Mac. Up  here  to  five, take  a  look  at  our  check  boxes. We  want  to  keep  everything, again,  s ave  script  to  source  table, if  we  want  to, and  in  case  something  goes  wrong, we  may  want  to  keep  dialogue  blocks  open, but  I  will  not  do  that. Now,  ask  us  where  to  move  the  columns? Well, t o  last. But  it  would  be  much  more  convenient if  we  moved them  after the VAERS_ ID. So  I'll  click  that  radio  button, click  the   VAERS_ID, I'll  put  table  name, and  we'll  just  label  that, how  about  SYMPTOMS STACKED? We'll  come  up  here,  put  them  in, five,  see  if  there's  anything  else. Oh,  yeah,  new  column  names. If  you're  stacking  several  columns, which  is  often  a  case, especially  if  you're  trying  to  pull in  some  data  from  a  PDF  file that  was  made  for  human  consumption, that  could  be  an  issue  here. But  for  right  now, we'll  just  leave the  stacked column  labelled, D ata and the  source  column  labeled,  Label, because  it  makes  sense. And  one  final  check, I  hit  the  OK  button,  and  there  it  is. Well,  let's  take  a  look  at  that  one. Here  we  go,  the  VAERS_ID, and  sure  enough, we  have  the  label, and  here's  the  data. Here  we  have  our  favorite  row, the  one  that  ends  in   266. Let's  take  a  look  at  that. What  happened? One goes  from   1-10. So  we  have  ten  instances  of  the  same  row, hence  table   1, 2, 3, 4, 5, which  are  filled  in. And  then  it  starts  again with  SYMPTOM1, and  the  rest  of  them  are  empty. So  that  adds  up  to  the  10. Here's  our  MEDRA  data. That  makes  sense. I  know,  because  I  did  it  before. This  is  where  we  want  to  be. So  let's  go  up  to  Rows  again. Row  Selection, select  WHERE. In  the case of  the  nomenclature here  sounds  an  awful  lot  like  SQL, that  is  what  JMP  is  doing  under  the  hood. We  want  to  select  some  rows. So  let's  pick   CHEST PAIN, We'll  leave  that all in  caps and  we  come  down  here  again. There's  this  little  check  box, hidden  way  down  here, this  Match  case. I  don't  want  to  match  case, because  I  don't  know how  people  enter  the  data and  how  the  good  people at  the  CDC, clean  the  data  before they  posted  it  on  the  website. But  by  the  way,  the   VAERS data  set is  not  instantaneous  loaded  up. It  doesn't  go  there. The  CDC  usually  updates it  about  once  a  week, and  they  clean  the  data,  then  update  it. So  Match  case,  CHEST PAIN. It  should  say  something about  the  dropdown. We  want  equals, but  we  could  use  does  not  equals, or  whatever  it  is that  suits  our  purpose. So  let's  click,  OK. Wait  a  minute. Data  equals. That  should  do  it. There  it  is. Here  I  hover  over it, it  says  SYMPTOM STACK . And  we  look  down  here  at  rows  again, I  find  myself  referring to  that  quite  frequently, just  to  get  an  idea  what's  going  on. We  started  off  wit h  890,000  plus  rows and  we  have  3,667  that  were  selected. Let's  subset. Go  up  here  to  Tables  again, Stack,  where  it  is  Subset, we  will  select  subset. Of  course,  we  get  our  pop  up  window that  tells  us  what  subset is  going  to  do  for  us. And  we  click  on  that. Let's  see,  we  went  to  Selected  Rows. We  could  link that  to  the  original  data  table, save  script  to  the  table, Subset  of  SYMPTOMS STACKED . Let's  go  with  our  convention, and  we'll  call  that  CHEST PAINS of SYMPTOMS STACKED. I want to make  sure I  did  everything  right. No. I  want  to  call  that   CHEST PAIN. I  hope  that's  right. Click  on  the  OK  button,  and  there  it  is. We'll  take  a  look  here. Notice  we  only  have  one  row  with   266 and  we  have  all our  other  rows  with  CHEST PAIN  in  it, and  we've  got  3,666  selected. Now  let's  go  back  to  the  data, do  a  bit  more  analysis after  we  did  all  that  cleaning. VAX join DATA. That  looks  right. Let  me  drag  that  over  there. Let's  look  at  a  couple of  different  variables, or  rather  the  graph. This  time,  let's  do  the  summary  tables. So  let's  go  up  to  Tables and  Summary's  at  the  top. And  this  is  what  it  looks  like. Well,  let's  look  at... What  do we  want? Age  and   HOSPDAYS and   NUMDAYS. I  am  going  to  put  that  in  here, but  before  I  can  do  that, JMP  wants  us  to  tell  it what  statistic  to  use. Let's  use  the  mean. I  selected  mean  from  dropdown, and  it's  going  to  give  us the  mean  of  those  three  columns. Now  there  are  some  other  columns here  that  we  have  an  interest  in. I'd  say, well, DIED . Yeah,  probably  interested in  whether  or  not  somebody  died. See  what's  going  on  there. So  we'll  select  that. Now  we  can't  take  a  mean. It's  a  binary  categorical variables  or variable,  rather. So  let's  select  N  and  see . No,  wait,  let's  look  at  STATE. Where  was  it? Where  was  STATE? There  it  is. Now  STATE,  we  have  50  plus and  we  want  a  summary  table . So  50  plus  summary  table. Let's  put  STATE  into  group. And  again,  we  can  save  the  script, but  we  won't  do  it  here. One  final  check, that  looks  right. We  hit  the  OK  button. In  here  is  our  summary  table. Notice  that STATE  again, we  have  some  serious  cleaning  to  do. There's  no  state here, State GU . There's  state MH ,  again. Some  may have  missing values,  et cetera,  et  cetera. And  it  gives  us  a  mean  age in  years  for  all  those  states. And  it  looks  like  the  mean  age  there, is  somewhere  in  the  40 s  thfough to  50 s. Come  over  here. Let's  take  a  look  at  the  mean hospital  days  by  state. And  that  makes  sense. 5, 6, 7  days. Looks  like  we're  all hovering  about a  week. Just  taking  a  quick  look  at  it. I  don't  see  any  outliers. And  come  over  here. Do  the  same  with  NUMDAYS . And  that's  the  number of  days  between  vaccination and  when  they  say  the  effect  appeared, and  let's  see,  number  died. One  thing  we  want  to  notice  is  right  here. Fortunately,  there  are  not a  whole  lot  of  people,  thousands  dying. There's  a  couple  with  a  hundred  here, but  notice  here, N(DIED)  with  the  blank  state. Apparently  there's  a  lot  of  state with  number  of  died  missing. So  if  we  want  to  analyze that  in  a  bit  more  detail, we're  going  to  have  to  be  careful  there, because  there's  a  lot  that  can't be  assigned  to  a  particular  state. And  what  else  can  we  do? Let's  just  take  a  look  at  all  that. And  we  see  our  states  again, these  five  plus   254. We  have  the  number  of  rows  up  here. Let's  see,  what  else  can  we  look  at? Mean(AGE ), haven't  mentioned  it  yet, but  JMP  is  interactive. So  I  click  on  that  bar,  and  let's  see, here's  some  people up  here  in  the  old  age  range, older  age,  I  should  say, excuse  me, or people I can  come  up  here. It's  a  little  hard  select  that  one,  maybe. I  don't  see  anything that  sticks  out  at  me. Fortunately,  number  died, it's  pretty  big  bar  at  zero. So  that's  good. And  the  mean  for   NUMDAYS between  the  vaccine and  yet  first event is  rather  low,  right  here. We'll  just  leave  it  at  that. I  see  that  I  am  just  out  of  time. So  what  did  we  do? We  looked  at  a  large  online  database. We  were  able  to  download the  ZIP  file  on to our  desktop, open  them  up in  JMP. We  were  able  to  do some  rudimentary  analysis after  spending  a  lot of  our  time  data  cleaning. Notice,  even  though  that  we  spent a  lot  of  our  time  cleaning  the  data, we  were  able  to  do  it  in  JMP, which  of  course,  is  very  convenient. Because,  number  one, we  didn't  have  to  switch, I'll  switch  here,  switch  there, run  some  SQL  code  and  bring  it  back  in. Number two,  we  could  do  our  analysis in  line  as  a  bar  while we're  cleaning  the  data. We  say,  "Oh,  that  looks  interesting." We  went  to  our  Graph  Builder, took  a  look  at  the  data, see  if  anything  peaked  our  curiosity or  if  anything  was  out  of  place. When  we  finished  our  examination, we  could  continue  with  our  data  cleaning. In  this  case,  we  went  on  to  the  SYMPTOMS. So  that  concludes  my  presentation. I  hope  everyone  enjoyed  it and  hopefully  learn  something. Maybe  I  should  put a  disclaimer  here  at  the  end. This  is  the  way  I  would  do  it. After  using  JMP  for  a  few  years, it's  not  necessarily the  way  you  should  do it , and  not  necessarily the  best  way  to  do  it. But  using  JMP,  we  were  able  to  bring down  some  data  and  analyze  it,  clean  it. It's  some  data that's  quite  a  bit  in  the  news  right  now. I  thank  you  all for watching.
The Functional Data Explorer (FDE) in JMP Pro allows for analysis of a DOE where the response is a curve. The entire functional DOE analysis workflow can be done within FDE - from smoothing response curves all the way to fitting the functional DOE model and optimization with the Profiler. But what about setting up the design and organizing your data for functional DOE analysis? This presentation will help you understand options around functional DOE design using the Custom Design platform and organizing your data using table manipulations such as Stack, Split, Join, and Update.     Hi,  everyone. My  name  is  Andrea  Coombs. I'm  a  Senior  Systems  Engineer, supporting  customers from  major  accounts in  the  eastern  part  of  the  US. Today  I'm  going  to  be  talking  about functional DOE and  specifically  around  design of  your  functional  DOEs and  how  to  prepare   your  data  for  analysis. I'm  going  to  turn  off my  video  for  presentation  here. Let's  go  in  and  look  at  the  goals. Really,  the  goals  are  very  simple here. I  want  to  cover  some  tips  and  tricks for  setting  up  your  functional  DOE using  the  Custom  Design  Platform and  also  give  you  some  tips  and  tricks for  adding  functional  data to  your  DOE  data  table. I  am  going  to  be  using  J MP Pro  16.2 during  this  presentation. Let's  start  off  by  defining what  is  functional  data, what  is  functional  data  analysis, and  specifically,  what  is  functional  DOE. Functional  data  is  really  curved  data. It's  any  data  that   unfolds  over  a  continuum, and  there's  a  lot  of  data that  is  inherently  functional  in form. You  can  think  about  time  series  data, sensor  streams from  a  manufacturing  process, spectra  data  that's  produced by  lots  of  different  types  of  equipment, measurements  taken   over  a  range  of  temperatures. And  the  Functional  Data  Explorer in  JMP  Pro  makes  it  really  easy to  solve  many  kinds  of  problems with  functional  data. Here  we  have  an  example of  some  functional  data. Here  we  have  a  plot  of  Home  Price  Index for  New  Jersey,  from  1990  to  2021. Llike  many  functional  data, we  don't  get  it  as  this  smooth  curve, like  we  see  here, but  rather  we  get a  series  of  discrete  Index  values. So  we  get  one  value  that  represents the  value  for  the  X,  the  year, and  the  value  for  Y,  the  Home  Price  Index. With  functional  data  analysis, it  isn't  typically  just  one  point of  the  curve  that  we  are  interested, or  even  the  collection of  points  from  a  single  curve. We  typically  have  a  collection  of  curves. Here  you  can  see  we  have the  Home  Price  Index  over  time, for  the  50  states, plus  the  District  of  Columbia. And  when  we're  doing functional  data  analysis, we  want  to  understand the  variability  around  these  curves. Often  we  want  to  understand what  are  the  variables that  drive  the  variability  in  our  curves. Or  maybe  you  want  to  use  the  variability in  our  curves  to  predict  another  outcome. Functional  data  analysis is  going  to  use  all of  the  information contained  in  the  curves. We're  not  going  to  leave any  information  behind. To  model  the  curves  directly, we  can  treat  the  curves   as  first  class  data  objects, in  the  same  way  that  JMP  will  treat traditional  types  of  scalar  data. When  I'm  thinking  about functional  data  analysis, I  like  to  break  this  down  into  four  steps. The  first  step  is  to  take the  collection  of  curves and  to  smooth  the  individual  curves. The  next  thing  is  we'll  determine  the  mean of  those  curves  and  the  shape  components. These  shape  components  represent the  variability  around  the  mean. Next,  we  extract  the  magnitude of  each  of  these  shape  components. Knowing  the  magnitude of  the  shape  components, the  function  that  describes the  shape  component, the  function  that  describes  the  mean, we  are  able  to  reproduce  all  of  our  curves by  just  knowing  these  two  shape  component  scores. Now  we  can  use  the  shape   component  scores  in  an  analysis. Here  what  I've  done is  I've  done  a  cluster  analysis, and  I've  defined four  groups  of  my  curve  shapes. In  the  Functional  Data  Explorer  itself, there  are  two  primary  questions that  can  be  answered. The  first  is  about  how  to  adjust process  settings  and  product  inputs to  achieve  a  desired   function  or  spectral  shape. We  call  this  Functional  DOE  analysis, or  FunDoE,  for  short. The  second  question  we  can  ask  is, how  can  I  use  entire  functions to  predict  something  like  yield or  quality  attributes? We  call  this  Functional m achine  learning or  FunML,  for  short. Today  we're  going  to  be focusing  on  Functional  DOE. Let's  take  a  closer  look at  the  Functional  DOE  workflow. The  first  thing  you  want  to  do is  to  set up  your  design using  the  Custom  Design  Platform. Then  you  can  go  out,  run  your  DOE, collect  the  results, and  you  want  to  organize  your  data to  get  it  ready  to  put  into  the Functional  Data  Explorer. The  remaining  steps in  our  Functional  DOE  workflow, will  be  done  all  within the  Functional  DOE  Explorer. In  the  Functional  Data  Explorer, we  can  process  our  data, we  can  smooth  our  individual  curves, we  can  extract  our  shape  components, and  then  we'll  use  our shape  component  scores  for  DOE  modeling, and  we  can  use  our  profiler to  address  the  goals  of  our  DOE. All  of  this  is  done  within the  Functional  Data  Explorer. Now,  there  are  many  presentations at  this  Discovery  Summit, at  previous  Discovery  Summit, even  in  our  Mastering  JMP  series on  jmp.com, that  will  go  over  lots  of  details  around the  Functional  Data  Explorer. I'm  not  going  to  be  talking  specifically about  the  Functional  Data  Explorer  today. What  I  want  to  talk  about  is how  do  you  set up  your  design for  a  functional  DOE and  how  can  you  organize  your  data. To  do  this,  I'm  going  to  use this  Bead  Mill  Case  Study. In  this  example,  what  we  have  is we're  essentially  milling pigment  particles  for  LED  screens. You  start  off  with  beads  and  pigment in  this  slurry,  in  this  holding  tank. It  goes  through, it  flows  through  this  milling  chamber, and  comes  back  to  the holding  tank  in  a  continuous  process. So  if  we  were  doing  a  DOE  on  this  process, some  factors  that  we  could  look  at is  the  percent  of  beads  we're  starting  off with  here  in  the  holding  tank, the  percent  of   pigment  particles we're  starting  off  with. We  can  look  at  the  flow  rate through  the  system and  also  the  temperature. When  we're  looking at  the  goal  of  this  DOE, we  essentially  want  to  achieve an  optimal  size  over  time  curve. So  let's  take  a  look at  that  optimal  curve. The  optimal  curve  is  represented by  this  green  curve  here. So  essentially,  we  want our  pigment  particle  sizes  to  decrease, so  they  fall  within  specification  quickly. And  our  specification  range  is represented  by  this  green  shaded  area. We  want  those  particles  to  remain within  specification throughout  the  duration  of  the  run. That  is  our  optimal  curve. Let's  go  ahead and  take  a  look  at  data  prep. I'm  going  to  talk  about  data  prep  first, and  then  we'll  move  backwards and  talk  about  the  DOE  design. For  data  prep,  there  are  three  main tips  and  tricks  I  want  to  share  with  you. First  of  all,  I  want  you  to  understand that  the  Functional  Data  Explorer accepts  data  in  different  formats. The  Stacked  Data  Format  is  the  default format,  and  it's  the  most  versatile. But  you  can  also  use  Rows  as  Functions. I'm  going  to  go  over some  table  manipulations, such  as  Stack,  Split,  Join,  and  Update, to  show  you  how  you  can get  your  data  ready  for  analysis. And  then  I'll  also  show  you how  you  can  quickly  import  multiple  files if  your  curved  data  is  stored in  separate  files. What  data  format  is  FDE  expecting? Well,  there's  actually three  different  formats. There's  Rows  as Functions, Stacked  Data  and  Columns as Functions. Let  me  open  up  a  data  table  here and  launch  the  FDE  platform to  show  you  that  there  are  different  tabs up  here  for  these  different  formats. The  Stacked  Data  format  is  the  default. We  have  Rows as  Functions and  Columns  as  Functions. This  example  here  happens to  be  Rows  as Function. Each  row  contains  a  full  function. Here  we  have  the  first  run  from  our DOE, and  the  function  is represented  here  in  these  columns. Each  column  represents an  X  variable  or  an  input, and  then  the  value  within  the  cell is  a  Y  variable. When  we  go  to  populate the  Functional  Data  Explorer, we  can  come  in  here, go  to  Rows  as Functions, our  Y  output  is  represented in  these  columns, we  can  put  in  our  DOE  factors, and our  ID  function,  and  then  you  can go  ahead  and  analyze  that  data. So  this  is  Rows as  Functions. One  thing  to  know  is  that  Rows  as  Function assumes  that  observations are  equally  spaced  in  the  input  domain, unless  you  have  an  FDE X  Column  Property. The  FDE X  Column  Property   is  something  that  comes  into  play when  we  design  our DOE, which  we're  going  to  talk about  here  in  a  second. But  I  just  want  to  show  you  here, next  to  each  of  these  columns, I  have  a  Column  Property   associated  with  it. And  you  can  see here's  the  FDE X  Column  Property, and  the  X  input  value  will  be  two  here. If  you  want  to  use the   FDE X Column Property , I'll  show  you  here  at  the  end how  you  can  use  the  JMP  scripting  language  to  assign  that. So  that's   Rows as Functions. Now  let's  look  at  Stacked  Data. Here's  an  example  of  Stacked  Data, where  I  have  one  run  or  one  curve over  multiple  rows, and  each  row  is an  observation  of  the  curve. So  in  row  one  here, I  have  a  value  for  X  and  a  value  for  Y, and  that  continues  over  multiple  rows. This  is  the  most  common and  the  most  versatile  way of  organizing  your   curve data. And  when  we  populate the  Functional  Data  Explorer, we're  here  in  the  Stacked  Data  format, we'll  put  in  our  X and  our  Y  of  our  function, put  in  our  ID  of  the  function, and  then  we  can  put  in  our  DOE  Factors here  as  supplementary  variables. The  last  type  of  format  that  the  functional  data  can  use is  Columns  as  Functions. I've  never  seen  data  organized  this  way, and  it's  a  little  perplexing. It's  hard  for  me  to  get  my  mind  around why  you  would  organize  your  data  this  way, but  I'll  show  it  to  you  even  though it's  not  very  common. In  this  example,  each  row  is  the  level of  your  X  variable  of  the  function. So  here  we  have  a  column  for  time, and  each  row  represents  the  X measurement, and  then  each  column  represents  a  run. Let's  go  ahead  and  launch the  Functional  Data  Explorer. We'll  come  over  here to  Columns  as  Functions. We  can  put  in  our  X  variable, which  is  time,  and  all  of  our  output variables,  which  are  each  of  our  batches. And  you'll  notice  in  here  we  cannot  input supplementary  variables because  we  don't  have  any  way  of defining  which  factor  or  which  treatment is  associated  with  each  of  these  runs. So  you  cannot  do  Functional  DOE with  Columns  as  Functions. Now  let's  talk  about  getting your  data  into  your  DOE  data  table. To  do  this,  we're  going to  use  the  Tables  menu. We  have  lots  of  different  platforms here where  we  can  manipulate  our  data. The  two  things  that  we  may   want  to  do  with  our  data is  reshape  it,  using  Sort  or  Stack, or  we  may  want  to  add  data, by  using  Join,  Update,  or  Concatenate. And  especially  if  you're  new  to  JMP, some  of  these  table  manipulation  platforms can  be  a  little  confusing when  you  start  using  them. The  little  icons next  to  each  of  the  platforms, can  be  very  helpful to  know  which  platform  does  what. So  what  I've  done  in  my  journal is  I've  taken  each  of  these  icons and  I've  blown  them  up  here, so  we  can  take  a  closer  look at  these  icons to  understand  what each  of  these  platform  does. Let's  first  talk  about reshaping  with  Stack  and  Split. Let's  first  talk  about  Stacking. Stacking  is  going  from  wide  to  tall  data. In  this  example, you  have  data  in  multiple  columns, and  you  want  to  combine that  data  into  one  column. Let's  look  at  an  example  here. Here  we  have  wide  data, we  have  Rows as  Functions. Let's  say  we  want  to  Stack  this  data, so we  can  use  it in  the  Stacked  data  format, in  the  Functional  Data  Explorer. We're  going  to  come  up  here to  the  Tables  menu,  go  to  Stack. I'm  going  to  pick  all  of  those columns  I  want  to  stack. And  here  I  have  50  measurements in  each  of  my  functions. I'm  going  to  select  all  50  of  those rows  and  say  I  want  to  stack  them. I  can  come  down  here  and  define  what my  new  column  names  are  going  to  be. The  data  that  I'm  stacking is  actually  my  size  data. My  label  column,  which  happens to  be  my  column  name  here, this  refers  to  my  time. Now,  two  things  when  I'm  doing table  manipulation, I  always  give  my  output table  an  explicit  name. Otherwise,  JMP  will  call  it  Untitled, and  it  will  iterate through  untitled  numbers. So  I  like  to  give  them, each  of  my  tables,   an  explicit  name, and  then  keeping  dialog  open. You  can  check  this  box to  keep  this  dialog  open, so  when  you  hit  Apply and  see  your  results, if  you  didn't  get  the  results you're  expecting, you  have  your  dialog  here  to  review  what  you  did and  maybe  fix  what  you  need  to  fix to  get  the  desired  output. Now,  this  data  is  stacked  and  ready  to  go. Let's  go  through  an  example  of  Split. Split  is  when  you're  starting with  tall  data  or  stacked  data, and  you  want t o  split  it  out into  different  columns. In  this  example  here,  I  have  stacked  data, and  let's  say  I  want  to  split  it  out so  I  can  use  Rows as  Functions  in  FDE. I'm  going  to  come  up here  to  the  Tables  menu. I'm  going  to  use  Split. And  this  Split  dialog  is  probably the  most  confusing  of  all  of  them. Even  after  using  JMP  for  many, many  years, I  always  have  to   step  back and  think  about  how  to  populate  this. But  the  Split  by  Columns is  essentially  what's  going  to  be  your new  column  headers. So  I  want  Time  as  my  new  column  headers, and  I  want  to  Split  out  my  size  data, and  I  want  to  be  sure to  group  this  by  Run O rder. A gain,  give  this  an  explicit  name, and  I  can  keep  the  dialog open  to  see  how  I  split  this  data. Now  here  I  have  my  data  is  wide, I  can  use  Rows as  Functions  in  the  FDE. That  is  reshaping  your  data. Going  from  wide  to  tall or  from  tall  to  wide. Now  let's  talk  about  adding  data. A  lot  of  times  you're  starting  out with  a  DOE  data  table  that  you  created, such  as  this. Let  me  just  delete  this  column  out. I want  to  add,  I  want  to  be  able to  join  my   curve data  to  this  table. Here's  my   curve data in  a  separate  table. Essentially,  what  I  want  to  do is, I  want  to  add  the  columns in  the  second  table  to  my  first  table. I'm  going  to  use  Join. Join  adds  columns. I'm  going  to  start here  with  my  DOE  table. I'm  always  going  to  start with  my  DOE  table because  my  DOE  table  has all  of  these  scripts  in  here that  I  can  use  to  analyze  my  data. These  are  very  important. So you  always  want  to  start with  your  DOE  table. And  we're  going  to  use  Join, going  to  join  it  with  our   curve data. You  always  want  to  make  sure  that  you're matching  up  based  on  your  row  numbers, so  the  right  curve  for  the  right  run  goes with  the  correct  factors  for  that  run. And  I'm  going  to  select  all  the  columns in  my  DOE  table  and  my  Functions from  my  Curve  Table. A gain,  I  can  use  an  explicit name  here  when  I  create  this  table. Now, I  have  my  table  ready  for  analysis. That's  an  example  of  Join. Let's  talk  about  Concatenate. I  don't  use  concatenate  too  much for  my  DOE data  prep. Concatenate… You  use  that  when  you  want  to add  rows  to  a  data  table. Then  DOE, we  typically  have  all  of  our  rows, all  of  our  runs  in  our  data  tables. We  don't  need  to  concatenate, but  I  just  want  to  run  through this  example  real  quick. Let's  say  I  have  my  data for  my  first  16  runs. I  have  10  observations  per  run, so  I  have  160  rows. Then  I  run  my  17th  run. It looks like this. That's 160. Here's  my  17th  run  with the  10  observations  from  that  run. Essentially,  what  I  want  to  do is  join  this  data  table, or sorry,  concatenate. I  want  to  add  these  rows at  the  bottom  of  this  data  table. I  can  start  here, come  to  Concatenate. We're  going  to  add  this. With concatenate,  you  have  this option  to  append  to  first  table. I'm  just  going  to  add  these  rows, append this  data  table. Now,  we  have  170  rows  of  data. That's  Concatenate. I  want  to  end  up  here  with  Update. Update  can  be  a  very  handy  tool when  you're  populating your  DOE  table  with   curve data. Here's  an  example of  the DOE  data table  I  created. I  have  columns  here to  populate  my   curve data. Here's  my   curve data. Here's  my   DOE data table. Essentially,  what  I  want  to  do  in  Update is  I  want  to  be  able  to  populate my  blank  cell  with  the  information I  have  in  this  data  table. I  can  do  that  by  matching  run  order, and  then  JMP  will  automatically  match  up the  columns  with  the  same  names and  update  this  data  table. Let's  come  here in T ables,  Update. Select  my  table  that  has  my   curve data. Match  on  Run  Order, say  OK. And  now,  this  data  table  is  updated. Those  are  some  table  manipulations you  can  do  to  get your  data  ready  for  analysis. The  last  thing  I  want  to  talk  about is  importing  multiple  files. Let's  say  that  your   curve data  gets  stored as  separate  files  for  each  batch. I  have  this  example  here  of… I  have  my   curve data   in  17  different  files, and  they  happen  to  be  CSV  files. I  want  to  be  able  to  import each  of  these  CSV  files and  concatenate  them  together so  I  have  one  data  table. You  can  easily  do  this  by  using the  Import  Multiple  Files  function under  the  File  menu. When you use Import  Multiple  File, you  can  click  on  this  button  here to  select  the  folder that  contains  all  of  those  files. Here's  a  list  of  all  those  files. Now,  the  file  name  itself  actually contains  my  batch  number, and  this  is  data  that  I  actually want  to  pull  out  of  the  file  name. I'm  going  to  add the  file  name  as  a  column. We'll  import. Here's  my   curve data with  time  and  size, and  here's  my  file  name. Now  I  can  come  up  to  the  Columns  menu and  use  this  column  utility  to  convert my  text  to  multiple  columns. I  just  have  to  put  in the  delimiter  I  want. I'm  going  to  use  the  underscore that's  before  the  batch  number and  the  dot  that's  after  the  batch  number, and  I  can  say  OK. That  gave  me  three  columns: the  curve,  the  batch  number, and  the  file  extension. This  is  the  data  that  I  want. I'm  just  going  to  delete these  other  columns  here. And  now  I  have  all  of  my   curve data for  all  my  17  runs  in  one  file with  the  batch  number. That  is  what  I  wanted to  show  you  for  data  prep. Now,  let's  talk  about setting  up  your  DOE  design. There's  a  couple  of  tricks that  I  want  to  show  you. In  your  DOE  Dialog, there's  two  things  to  think  about. First  of  all,  we  want  to  make  sure we're  removing  this  default  response, and  then  we're  going  to  talk  about how  to  define  the  functional  response based  on  the  format  of  your   curve data. Let's  go  ahead  and  launch  our  DOE  Dialog. We're  going  to  come  up  here  to  DOE, go  to  Custom  Design. Here's  our  DOE  dialog. Now,  the  DOE  Dialog,  like  I  said, will  have  this  default  Y  response. If  we  just  have a  functional  response  in  our  DOE, we  don't  need  this  default  response, so  we  need  to  get  rid  of  this. What  we  don't  want  to  do is  just  delete  the  name because  that  response  is  still  there. What  we  want  to  do  is  select that  default  response and  actually  use  Remove  to  get  rid  of  it. Then  we  want  to  add a  functional  response. I'm  going  to  come  here and  add  a  functional  response. When  we're  defining our  functional  response, we  can  give  it  a  name. We  can  say  the  number  of  measurements per  run  and  the  values. Let's  go  ahead  and  do  this  for  our  DOE. Our  responses   size… This  is  what's  on  the  y- axis of  our  function. Then  we  can  tell  the  DOE  platform what  our  X  values  look  like. We  can  define  the  number  of  measurements with  the  number  of  X  values and  what  those  X  values  are. Let's  say  I'm  going  to  measure the  size  every  2  hours. I'm  just  going  to  type  in  here every  2  hours  up  to  20  hours. That  looks  good. The  next  thing  I  need  to  do is  add  my  factors. I  have  saved  my  factors and  my  factor  ranges  to  this  factor  table. I'm  just  going  to  load  in  these  factors. I  have  my  factors  up  here. Next  thing  I  want  to  do is  specify  my  model. I'm  going  to  choose a  response  surface  model which  will  add  all  my  two- way interactions  and  all  my  quadratics. Finally,  I  can  enter  in the  number  of  runs. JMP  is  recommending a  default  number  of  21, but  let's  say  I  only  have  enough time  and  resources  to  do  17. I'll say 17. I  will  ask  them  to  make  the  design  using all  of  those  inputs  that  I  entered. It  just  takes  a  couple  of  seconds for  JMP  to  create  this  design  for  me. Here's  my  design. Here  are  my  17  runs  with  the  treatment I  want  to  apply  for  each  of  those  runs. When  I'm  creating  my   DOE table, I  always  want  to  use this  Make  Table  button. I  always  like  to  include the  run  order  column because  the  order  that  these  runs are  executed  is  very  important. I'm  going  to  include  that  run  order  column and  make  our   DOE data  table. Here's  our  DOE  data  table. We  have  our  treatment. We  have  a  place  for  R  to  enter in  results  for  our  function, and  I  have  my  run  order  column here  at  the  end. I  also  have  my  scripts that  reflect  the  functional  DOE and  also  the  model  I  specified. Whenever  I'm  adding  data, my   curve data,  like  I  said  before, I  always  want  to  add  it  here to  my  DOE  data  table because  it  contains  information about  the  functional  data  analysis and  the  DOE  model  that  was  specified. That's  a  quick  overview, but  I  want  to  give  you  some  tips  about defining  the  functional  response based  on  the  format  of  your   curve data. Let's  come  back  here. We'll  go  back. Let's  come  back  up  here  to  Responses. The  way  that  you  populate this  information  here will  define  how  your  DOE  data  table  looks. I  want  to  give  you  a  couple  tips for  what  to  enter  here, depending  on  what your   curve data  looks  like because  we  want   the  data  prep  part to  be  as  easy  as  possible. There's  a  couple  of  things  to  consider. First  of  all,  is  your  curve data wide  or  tall? We  talked  a  lot  about  this,  right? Do  you  have  wide  data? Or  do  you  have  tall  data? Is  it  stacked  or  are  you  going  to have  rows  as  functions? The  other  thing  to  consider  is  whether your  data  is  equally  spaced, if  you  have  the  same  X  measurements, or  whether  your  measurements are  asynchronous. What  do  I  mean  here? Well,  let  me  pull  up  a  couple  examples. In  this  example  here, I  just  have  a  few  measurements  per  run. I  have  10  measurements  per  run, and  they  are  all  equally  spaced, and  I  have  the  same measurements  for  each  run. When  I  go  to  enter  in  this  information, there's  just  a  few  to  populate  here. It's  not  that  difficult. But  you  might  have  a  scenario where  you  have  asynchronous  data. In  other  words, you  have  different  measurements for  each  of  your  runs, and  you  might  have  a  situation  where you're  collecting  a  lot of  data  points,  maybe… My  rule  of  thumb  is if  you're  around  10,  less  than  20,  yeah, go  ahead  and  populate  your  values  here. But  once  you  start  getting  up above  20,  certainly  hundreds, that's  a  lot  of  information to  add  here  to  your  response  here. The  other  thing  to  consider  is if  your  data  will  be  manually  entered, are  you  going  to  manually enter  the  responses or  are  you  going  to  use  Join  or  Update? Let's  run  through  some  scenarios. Let's  say  you  have  rows as  functions. You  have  a  few  measurements, and  you're  going  to manually  enter  your  data. Well,  if  you're  going  to  do  that, then  set  up  your  DOE  data table  like  this. Or  set  up  your… sorry,  your   functional  response  like  this. This  is  what  your  DOE  data  table is  going  to  look  like. You  can  manually  enter your  results  in  here, and  then  you  can  use  this  script  here to  run  your  functional  data  analysis. That's  the  first  scenario. Let's  say  you  have  rows as  functions, you  have  a  few  measurements, but  you  want  to  use  Update to  update  your  datas. Again,  l et's  come  back   and  take  a  look  at  this  example  here. In  this  example, since  I  defined  the  name  of  my  response, I  get  time  in  here  with  my  column  header. Let's  say  when  I  bring  in  my  data, I  just  have  the  number  here in  my  column  header. So  if  I  was  going  use  Update, these  column  names  do  not  match. To  make  these  column  names  match, what  I  want  to  do  is  come  back  here, remove  the  name, and  then  when  I  create  my   DOE data  table, I  just  have  the  number. And  then  I  can  use  Update to  update  this  data  table. Let's  go  ahead  and  do  that. Here's  my  data. Let's  go  ahead  and  update with  my  current  data. I'm  going  to  update  based  on matching  the  row  numbers, and  now  my  data  is  in  here. I  can  use  this  script  here  to  go  ahead and  go  into  the  Functional  Data  Explorer to  start  analyzing  this  data. So  that  scenario, let's  say  I  have  rows as  function. Again,  I  have  wide  data, but  I  have  many  measurements. I  have  many  more  than  10. Let's  say  I  have  50. Entering  the  50  values  in  here doesn't  make  a  whole  lot  of  sense. What  I'm  going  to  do  is  I'm  going  to set  the  number  of  measurements  to  one, and  I  can  just  set the  values  to  one  as  well. When  I  go  to  make  my  DOE  data  table, it  will  look  like  this. I  will  get  my  run,  order. I  will  get  my  factors, and  I'll  get  this  blank  column. All  I  need  to  do  is  delete  that  column. Here  are  my  50  measurements for  each  of  my  curves. Again,  I'm  going  to  use  Join, like  I  showed  you  up  above. I'm  going  to  match  based  on  run  order, bring in  everything  from  my   DOE table, my  functions  from  my  curve  table. I  can  give  it  an  explicit  name and  say  OK. In  this  example  here, since  I've  used  Join, I'm  essentially  ignoring  what  I  set  up  as the  details  around  my  functional  response. This  script  here  is  not  going  to  work. In  this  scenario, I  will  have  to  come  back  here, go open  up  the  Functional  Data  Explorer, enter  my  supplementary  variables, my… This  is  rows as  function. Enter  in  my  supplementary  variables, my  run  ID,  and  my  curves. When  I  go  to  do the  functional  DOE  analysis, it  will  come  back  and  look  at the  model  here  that's  specified  here. It's  generalized  regression  script, so  it  will  remember  the  model  that I  specified  when  I  set  up  my  DOE. That  is  that  example. Let's  talk  about  stacked  format. With  stacked  format,  we  typically  are going  to  be  adding  the  data  using  Join. Again,  what  I  populate  here doesn't  really  matter, just  as  long  as  I  have a  functional  response  entered  in  here. Again,  I  get  this  same  data  table. I  can  delete  out  the  response. Oops, I added to it. Delete column. I  can  remove  that  response. Now,  I  can  use  Join  to  bring  in my  stacked   curve data by  matching  on  run  order, bring in  everything  in  from  my  DOE  table and  my  function  data  from  my  curve  table. Again,  for  this  example  here, running  this  script  is  not  going  to  work, so I'm  going  to  have  to  manually  launch the  Functional  Data  Explorer, bring  in  my  X,  my  Y, my  supplementary  variable, and  my  run  order, and  then  I  can  go  ahead  and  execute the  Functional  Data  Explorer when  I  go  to  do  functional  DOE. Again,  as  before, it  will  look  at  the  model  that's  included in  generalized  regression  script that  is  based  on  the  model  that  you specified  when  you  designed  your  DOE. The  last  thing  I  want  to  mention real  quick  is  this FDE X Column Property that  I  talked  about  before. Let's  say  that… this  is  a  scenario  where  I  want  to bring  in  data,  my   curve data, where  I  have  rows as  functions. So  I  have  rows as  functions, I  have  many  measurements. I  want  to  add  the   curve data  using  Join, but  my  column  headings  contain  text. In  this  case, I  have  the  units  of  measurement for  each  of  my  X  values here  in  my  column  headers. I  can  join  this  data  together, bring in  my   curve data by  matching  on  run  order. We then bring  in  all  my  data from  my  DOE  table, bring in  all  of  my   curve data. Let's  say  I  want the  Functional  Data  Explorer  to  recognize the  number  in  my  column  header. Well,  to  do  that, I  need  that   FDE X Column  Property. But  when  I  go  in  here  to Column  Properties, you're  not  going  to  find FDE X Column  Property  here. What  I  can  do  is  I can use a script to define my FDE X Column Property. Actually, it's going to be based on this. What I can do is run  this  script, and  now  I  have  a  column  property  assigned where  I  have  the  number  that  the  FDE will  recognize  as  your  X  value. That  was  my  last  tip  or  trick. Let's  just  do  a  quick  wrap- up, a  review  of  the  tips  and  tricks starting  with  your  DOE  design. You  always  want  to  remove that  default  Y  response before  you  add  your  functional  response. You're  going  to  define your  functional  response based  on  the  format  of  your  curve data because  you  want  to  make your  data  prep  as  easy  as  possible. You  always  want  to  add your   curve data  to  the   DOE data  table to  take  advantage  of  all, not  only  the  FDE  script, but  the  model  script  that  is  created for  you  by  the  Custom  Design  Platform. When  you're  preparing your  data  for  analysis, when  you're  bringing  in  your  curve data, just  know  that the  Functional  Data  Explorer accepts  different  formats, stacked  data  and  rows  as  functions. You  can  use  Stack,  Split,  Join, and  Update to  get  your  data  ready  for  analysis. And  if  your   curve data is  stored  in  separate  files, use  import  multiple  files. I  just  want  to  acknowledge a  couple  of  people. Ryan  Parker,  he  is  the  developer of  the  Functional  Data  Explorer. I  want  to  acknowledge  him for  all  of  his  help  with  understanding all  of  the  wonderful  things that  FDE  can  do. I  also  want  to  thank  Chris  Gotwalt for  his  leadership, also  for  some of  the  slides  that  I  used at  the  beginning  of  my  presentation. With  that,  I  thank  you  very  much. If  you  have  any  questions, I'd  love  to  hear  about  them  in  the  chat. Thank  you.
A series of С/М x O y /SiO 2  nanocomposites has been synthesized through pyrolysis of a resorcinol-formaldehyde polymers pre-modified with metal oxide/silica nanocomposites. In turn, the binary nanoxides (M x O y /SiO 2 , where M = Mg, Mn, Ni, Cu and Zn) were synthesized via thermal oxidative decomposition of metal acetates deposited onto fumed silica. These materials are promising for adsorption and concentration of trace amounts of organic substances or heavy metals. The nanocomposites exhibit mesoporosity and narrow size of pores, as seen from their distribution profiles. The porosity depends on composition of the materials. Hence, the textural parameters of carbon-containing and carbon-free types (or classes) served as input data to develop classification models of both materials classes using unsupervised method: hierarchical clustering.   Once the cluster formula was derived, it was established that surface and the volume of micropores (Smicro, Vmicro) together with the volume of mesopores (Vmeso) have the the highest R2 (0.83 - 90) to enable successful clustering. Macroposity demonstrates the lowest fit (R2 < 0.1), and its two respective parameters (Smacro, Vmacro) are considered as the weakest contributors to the two-cluster model.   In parallel, principal components analysis was helpful to distinguish the subject classes of the nanomaterials, at reduced number of the variables (three components at eigenvalue > 1). Three- and one-component 2-Means clustering was conducted to assign each composite to its proper class. Thus, the case for two multivariate classes of nanomaterials can be described by various independent methods of the data science.      Hello. I'm  Dr  Michael  Nazarkovasky, Ukrainian  researcher  in  chemistry and  materials  science  from  Brazil. Evolved  data  driven  solutions   in  my  area  of  knowledge  and  expertise. The  presentation  is  made   in  collaboration  with  Ukrainian  colleagues from  National  Academy  of  Sciences  of  Ukraine, supporting  and  promoting their  scientific  research  programs during  such  a  difficult  period. The  subject  says  on  statistical   and  data  analysis  approaches, deepen  conception  behind the  experimental  results  and  phenomena. In  particular,   unsupervised  methods  are  helpful for  multivariate  cases  like  this when  two  or  more  classes  of  materials   are  characterized  by  a  large  body of  the  parameters  measured  or  calculated in  the  course  of  the  lab  analysis. This  case  is  about  hybrid  materials which contain mixed  nanoxides   and  carbon  phases. The  nano hybrids  combine  properties  of  both  components; well  ordered  micro  and  meso porosity, a large  surface  area   and  high  porosity  in  general, high  corrosion  resistance, thermal  and  mechanical  stability, hydrophobicity  and  high  conductivity due  to  the  presence  of  carbon, developed for active sites  attributed  to  the  metal  oxide  nanoparticles. Hence  reasons  of  the  subject   Nanomaterials' I mportance exists in  the  variety   of  their  applications; Catalysis, adsorption,  sensors, energy  field,  and  hydrogen  adsorption. Typical  preparation  of  binary  oxides, non carbon  oxide  nanocomposites consists  in  three  stages. On  the  first  stage,  preparation of  the  homogeneous  dispersion  of  silica in  the  aqueous  solution of  the  corresponding  metal  acetate under  stirring  at  room  temperature  was  conducted. On  the  second  stage, dispersions  were  dried   at  130  degree  C  during  five  hours and  sift through the  0.5  millimeter  mesh. At  the  last  stage  all  of  the  10  powders were   [inaudible 00:02:26]   for  2  hours at  600  degree  C  in  air. Also  the  reference  sample  of   fumed silica  without  metals was  treated  under  the  same  conditions, by  bringing  all  these  three  stages; homogenization of  the  aqueous  dispersion, drying,  grinding,  sieving,  and  carbonization   at  the  same  temperature. The  process  of  impregnation   of  fumed  silica  with  an  aqueous  solution of  metal  acetate and  subsequent  removal  of  the  solvent, the  adsorbed acetates,  are  distributed uniformly  over  the  matrix  surface. Modification  of  reserves   of  resorcinal  formaldehyde polymer by  oxide  nano composite  was  carried  out by  easing the  process mixing  resorcinal formaldehyde, and  this  binary pre-synthesized nanocomposites reference  silica  the  weight  ratio of  an  aqueous solution  under  stirring  at  room  temperature. All  these  samples  were  sealed  and  placed in  a  thermostatic  oven  for  polymerization, and  all  synthesized  composites  were  seized with  further  drying   at  90  degrees  C  for  10  hours. Carbonization  of  the  samples   was  carried  out in  a  tubular  furnace   under  nitrogen  atmosphere upon  heating  from  room  temperature  up  to  800  degrees  C at  a  heating  rate   of  5  degrees  C  per  minute, and  kept  at  a  maximum temperature  for  two  hours. Hence  the  carbonized  samples  are  labeled  as  C and  the  initial  materials  which  do  not  contain  carbon are  [inaudible 00:04:29]. Actual properties  of the… Or  in  other  word, parameters  of  porosity  were  calculated using  modified Nguyen-Do  method from  the  low  temperature  nitrogen  adsorption- desorption. This  is  a  standard   [inaudible 00:04:49]   method  for  porosity and  it's  called  the   [inaudible 00:04:51]   . The  calculated  parameters   are  assigned  as  variables for  further  data  processing  using JMP. To  be  more  exact, the  specific surface  area  and  total pore  volume were  derived  from  the  BET  measurements and  then served  to  calculate respective  portions  of  micro, meso,  and  macro porosity. In  this  case  we  have  a  set of inward  variables. For  example, Nguyen-Do  method   was  developed  initially for  calculation  of  carbon  materials with  a  sleeve-like  porosity, afterward  by  one  of  the  co  authors  of  the presentation,  Professor  Vladimir Gun'ko . The  method  was  modified  and  extended through  a  large  variety  of  other  materials which may  contain  cylinders,  also  slits, and  voids  among  the  particles within  the  aggregates and  agglomerates  of  aggregates. Not  only  for  carbon  materials and  the  method  serves  also  for  hybrids, as  for  individual  materials as  for  hybrids  also. So  let's  start  from  the  basics. In  multivariate  analysis  indicates  that  not  all  the  parameters  around  the  health are  well  related  with  each  other   in  case  of  non  carbon  materials. Specifically  there  a meso porosity   dominates  overly  serious  as  shown on  the  heat  map and  from  the  table. The  parameters  corresponding to  microporosity are  demonstrating  correlation only  within  their  group and  group  and  with  macroporosity. Yes,  surprisingly. Contrastingly,  the  heat  map of  the  carbonized  nano  hybrid speaks  for  more  consistent   and  more  ordered  structure with  almost  complete  correlation between  all  the  types  of  the  porosity, whereas  the  role  of  microporosity is  significantly  increased. Comparison  between  parameters  or  variances   reveals  the  differences especially  for  micro porosity. For the specific  surface  area, as  for a total pore  volume, for example, total pore volume and  volume  of  the  mesop ores can  be  also  considered  as  factors to  claim  the  difference  between both  types  of  the nanomaterials. All  eight  parameters  were  taken  to  perform  hierarchical  clustering and  it  is  easy  to  see  that the  minimal  optimal  in  the  same  time  model can  be  offered  for  three  clusters on  the  cluster  path and  on  the  constellation  plant. Think oxide  sample, it cannot  separate within  the  non  carbon  group but  can  be  attributed in  other  carbonized  cluster. Well,  main  parameters  as  seen   from  the  summary  are  the  volume and  surface  of  the  micro pores. In  other  words,  micro porosity. Indeed,  some  parameters; surface  and  volume  of  macro  and  mesopores are  out  of  the  group  samples  of  the  non  carbon  nanomaterials. I'm  talking  about,   namely  for  a  sync oxide. Anyway,  the  mean  values  for  both  parameters do  not  match  over  the  whole parallel  plot  of  their  clusters. Principal  components  analysis   help  to  represent  all  variables in  three   linear  combinations. According  to  the   [inaudible 00:09:02]   ,  the eigen values  less  than 1 are  not  taken  into  consideration. This  is  why  we  have  only  three  components whose  values  are  higher  than  1. As  the  two  first  describe  almost  80%   of  the  samples  or  nine  samples  from  12, with  the  help  of  the  third  component, the  least  important, almost  all  the  samples  fit  such  a  model. The  first  component  comprises   both  micropores  parameters [inaudible 00:09:39]   and  volume  of  the  meso pores. The   other  variables  take  a  secondary  role in  the  second  and  third  components, describing  together   another  half  of  the  temples. Mapping  the  points   over  the  score  plots  in  3D, it  is  easy  to  conclude  that  both  groups, carbonized  and  non  carbonized, can  be  separated  into  two  clusters   defined  by  three  principal  components and two  almost  flat  clusters  comprising  the  points situated  on  the  plane are  set  by  two  main  clustering  algorithms and described,  yes,  by  these  three  components. Taking  a  closer  look  at  the  results of  predictor  screening  by  boosting, again  microporosity  is  referred to  be  the  main  property. The simplified  analysis,   two  variables  can  be  extracted  and  plotted with  the  PCA  cluster, I mean using  the  same  PCA, however,  with  completely  different  amounts, one  instead  of three  components,  and  yes, whereas  a  single  principal  component serves  to  describe  the  cluster  model. The  cluster  formula  and  equation for  the  principal  component  are  provided. It  can  be  recommended   for  future  classification or  for  a  qualitative  analysis   of  synthesized  samples. As  conclusions, I  can  say  that the  presented  synthesis  method  makes  it  possible to  obtain  mesoporous  carbon nanocomposites  uniformly  filled  with  metal and  metal  oxide  phases  which  were pre-synthesized  in  silicon  matrix. With  the carbonization,  the  portion  of  micro pores  is  growing,  yes, the  specific  surface  area  increased with  decreased  total  porosity, total  pore  volume. High  order  hybrid carbon  oxide  nanocomposites with large specific  surface  area. The  controlled  size  distribution of  the  modifier, which  is  important   from  the  clinical  point  of  view, and  significantly  expands the  scope  of  application of  the  synthesized  materials. Parameters  of  textural  properties   are  effective  variable helpful  to  identify classified  nanooxide  materials. Data visualization  has  given  insights   to  select  adequate  approaches to  analyze  the  experimental  data. K-Means  Clustering, Self  Organized  Map, and  Hierarchical C lustering have  proven  to  be  powerful  tools for  classification   of  the  subject  materials  by  actual  properties. Principal  Component  Analysis  in  turn  had  reduced  the  set  of  variables for  a  definite  and  simple  classification. The  study  claims  two  cluster  models  described  by  three  or  even  one principal  component  to  classify  carbonized and  carbon- free  hybrid  nano composites. I'm  thankful  also  for  the  financial  support for  Brazilian  agency. I'm  very  thankful  to  my  colleague David  Kirmayer   from  the  Hebrew  University  of  Jerusalem and  one  of  the  co  authors  Maria  Galaburda, who actually  synthesized  these  samples. Thanks to P olish  Foundations and  International  Visegrad  Fund. Thank  you  very  much  for  your  attention. It  was  a  pleasure  for  me to  make  such  a  presentation.
In this paper, we will review the most prominent methods for estimating parameters of the Poisson regression model when data suffers from a semi-multicollinearity problem, such as Ridge regression and Liu estimator's method. Estimation methods were applied to real data obtained from Central Child Hospital in Baghdad, representing the number of cases of congenital defects of children in the heart and circulatory system for the period from 2012-2019; The results showed the superiority of the Liu estimators' method over the ridge regression method based on (AIC) as a criterion for comparison.   Keywords: Poisson regression, Liu estimators, Multicollinearity problem.               Hello  everyone, my  name  is  Raaed Fadhil  Mohammed. I  am  a  statistician. I  lecturer  in  University  of  Mustansiriyah. My  paper title is  Estimating  the  Parameter of  Poisson  Regression  Model  Under the   Multicollinearity  Problem . Outline:  Poisson  Regression  Model, Multicollinearity  problem, Ridge Regression Estimator  Method, Liu Estimators Method, and  Real  Data  Example. Conclusion  and  References. Poisson Regression  Model. One  of  the  types  of  regression  model that  fall  under  linear-logarithmic regression  model  as  by  taking the  natural  logarithm of  the  distribution  formula, it turns  into  a  linear  procedure. Random  errors  in the  model  follow a  Poisson  distribution  with  a  parameter mu. The  model  is  based  on  two  essential assumptions  about  the  distribution as  it  differs  from  the  distribution of  random  errors in the linear regression model and  the  properties  of  the  Poisson distribution  parameter  mu  as  a  function of  predictor  variables. M ulticollinearity  Problem. Multicollinearity Problems occur  when  two or  more  predictor  variables  are  correlated to  a solid  linear  relation, so  it's  difficult  to  separate  the  effect of  each  predictor  variable from  the  dependent  variable in  practice. Or  when  the  value  of  one  of  the  predictor variables  depends  on  one  or  more of  the  other  predictor  variables in  the  model  under  study, as  well  as  if the  data  takes  the  form of  a  time  series  or  across- section  data. The  multicollinearity   problem  can  be classified  into  two  types: Number  1,  Perfect multicollinearity. The determinant of the  information  matrix is zero,  x  transpose  x  determined  equal zero. It follows from  this  is  impossible to  estimate, the  parameters  of  the  regression  model due  to  the  inability  to  calculate the  inverse  of  matrix  x  transpose  x. The  best  method  in  this  case  to calculate  x  transpose x.  We  can  make  use principal  component  analysis. Number  2,  Semi-perfect  multicollinearity. In  this  case,  if  the  value of  the  determinant  information  matrix is  minimal,  close  to  zero, then  the  parameters estimated  considerable  variance. The  best  method  in  this  case  we can  use  regression  method  or  Leo estimator  method. The  following  formula  here  can  express the  variance- covariance  matrix of the  parameters  estimated. Perhaps  the  best  statistical  method for  measuring  the  multicollinearity intensity  is  the  variance  in flation factor  VIF,  whose  formula is  as  follows. VIF  equal  one  over  one  minus  R  square. R  square  here  determined  coefficient. Ridge Regression  Estimator  Method. One  of  the  important  alternatives for  estimating  the  parameters of  regression  module  when  there is  m ulticollinearity  between  predictor variables. This  method  established  according to  the  principle  of  the  researchers  Hoerl and  Kennard,  which  is  by  adding  a  small positive  quantity  to  the  mine  diameter elements  of  the  information  matrix. The  regression  estimators  are  based  when k  greater  than  zero  so  that  the  base amount  can  be  expressed  by  the  formula, Z minus  identity  by  beta. Liu  Estimators  method. The researcher  Liu 1993  laid the  foundations  of  this  method  to  address the  issue  of  the  variance  inflation of  the  estimated  parameters in  the  presence of  multicollinearity  a problem. The  Liu  estimator  for  the  parameter Poisson  regression  can  be  expressed in  the  following  formula. Also  Liu  estimators  are  biased  when d  greater  than  zero  and  the  magnitude of  the  bias  is  z  minus  identity  by  beta. Liu  estimators  are  biased, the  reason  of  the  bias  is  the  added  value d,  which  ranges  between  zero  and  one. Also,  the  calculated  mean  squared  error according  to  Liu  estimators'  method is  less  than  the  mean  squared  error for  the  same  parameters  if  estimated according  to  the  maximum  likelihood method. Real  Data  example. We  will  obtain  real  data  concerning congenital  defects  of  the  heart and  circulatory  system  in  a  new borns  from the  Central  Child T eaching  Hospital in  Baghdad,  Iraq, where  the  distribution  of  a  dependent variable  y represents  abnormalities of  the  heart  and  circulatory  system in  children  was  studied. Also the  revealing  existence of  a  multicollinearity  problem  among the  predictor  variables  under  study. The  case  of  congenital  disabilities arriving  at  the  Central  Child  Teaching Hospital  are  recorded  in  a  form  prepared by  the  Statistics  Division  in  the  hospital in  the  form  of  count  data  and  totals within  semi  monthly  periods, the  sample  was  taken  for  the  period  from 2012  to  end  2019, and a Poisson  regression  model  was  built as  one  of  the  appropriate  models to  describe  this phenomenon as  the  following  formula: yi equ al  exponential  beta  one  xi 1  plus beta  2  xi  2  plus  beta  3  xi  3 plus beta 4 xi 4 plus beta 5 xi 5 plus beta 6 xi 6 plus beta 7 xi plus ui. That y represent  the  total  number of  children  with  congenital  heart and  circulatory  defects  in  each  period. Xi1, the total weighted  of  infected children  within  each  period. Xi2, the  total  ages  fathers  of  inflected children  within  each  period. Xi3, the total ages mothers of inflected children within each period. Xi 4,  represents  the  number  of  infected male  children  within  each  period. Xi5   represents  the  number  of  inected female  children  within  each. Xi 6,  the  number  of  infected children  born from consanguineous  marriages within  each  period. Xi7,  the  number  of  infected  children whose  mothers  were  exposed  to  radiation or  life  influence  such  as  taking  certain medications  and  drugs  during pregnancy. Beta  one,  beta  two,  beta  three,  beta  four, beta  five,  beta  six  and  beta  seven  beta. The  slope  parameters  in  the  model and  beta  note  represents the  constant term. ui  represent the  random  error  in  the  model. This  table,  Testing  Data  and  Diagnoising Multicollinearity  to  find  out   probability distribution  according  to  which  response variable  can  be  distributed. We  use  jump  pro  16.2  and  it  was  found  y, dependent  variable  follows  the  Poisson distribution  with  distributed  parameter, mu  equals  6.5. To  verify, the  suitability  of  the   Poisson distribution  to  the  response  variable y. The  testing  goodness  of  it  was conducted  for  the  variable  of  the  total number  of  children  with  congenital  heart and  circulatory  defense  in  each  period. The  throw  which  we  make  show that  the  poison  distribution  is  the  most promoted  distribution  that  dependant variable  can follow. Where  we   not have  the  goodness  of  test value  is   4958.4579  with  significance level  close  to  zero. As  on  the  table  in  the  front  of  you. To  detect  whether  there  is multicolliniarity  among  the  product variable  under  study, we  can  calculate  the  correlation  matrix between  the  predictor  variables. From  the  figure  we  observe  that  value of  correlation  coefficient  are  significant and  large  for  all  predictor  variables, as  each  variable  is  associated  with  all predictor  variables, with  the  strong  direct  linear  correlation. This  table  shown  the  value  of  various function  factors. As  the  largest  of  them,  well  those of  the  projector variables, X2, X3 and X4 The variance inflation factor for undermining projector variables exists the number 18. From  this  we  conclude  that there  is  a  linear  multiplicate  between the  predictor  variables  and  the [inaudible 00:16:24]. Application  of  Poisson  regression  method. Parameter  estimator  of Poisson regression model using  method  regression, we  observe  that  the  total  number of  children  with  congenital  health, and  circulatory  defects  in  each  period depends  on  the  increase  and  all parameters  of   [inaudible 00:17:11]. However, most  variables  are  insignificant.  X1,  X4, X5 and X6 because  of  the  effects  of  semi- perfect multi collinearity. Also,  the  result  indicates  that the  base  parameter  is  k  equal  zero point one, two. [foreign language 00:17:53] 86.4959. This  table  we  can  obtain  by  using  JMP. When  we  are  applying  the  Liu  estimator method  to  estimate  the  coefficient of  the  Poisson  regression  module in  the  presence  of  the   multicollinearity problem. We  use  the  JMP  secret  to  connect between  our  language  and  JMP. This  secret  to connect and run from  JMP, the  package of  Liu  regression  in  our  language. While applying to  [inaudible 00:19:12], the coefficient, we obtained this result. We  observed  that  the  total  number  of children  with  the  congenital  hearts and  circulatory   in  each  period depends  on  the  extent of  increase  in  all  parameters of  the  model, despite  the  insignificance of  the  variable  xi  1. Because  all  variables  under  study  are increasing  the  number  of  children with  congenital  disabilities, but  it  is  very  good  properties. Also  the  result  indicates that  Hiaki  formation  criteria  35. 29 add  the  base  parameter  d  equal 4.1. When  comparing  the  two  methods  regression and  Liu estimator, we  know  that  the  estimator  approach given  a  low  value  of  information. Has  more  significant  proficient  when they  compare  to  the  regrasion  method. Conclusions. In  this  paper,  we  review  the  most prominent  method  of  parameter  estimating of  the  Poisson  regression  model  when the  data  suffer  from  the  problem of  semi- perfect multicollinearity, where  took  the ridge  regression  method and  Liu  estimators'  method and  compared  the  two  methods  based on  Account  Information  Criteria  as a  criterion  for  comparison. By  applying  regression  analysis  method in  the  presence  of  a  semi-perfect multicollnearity  problem to real data  regarding  congenital  heart and  circulatory  defects  in  newborns, obtained  from the  Central  Child  Teaching Hospital  for  the  period  from  2012  to  2019, we  find  the Liu  estimator's  method is  more  efficient  than  the  regression method  because  it  has  a  lower Akaike's information  criterion. It  also  gives  more  reliable  results and  more  accurate  p-values. Number  3:  Through Liu estimators'  method, it  is  clear  that  all  predictor  variables under  study are  influential in  the  regression  model, even  if  they  are  not  significant, as  all  parameters  are  influential in  increasing  the  number  of  children with  congenital  disabilities  but in varying  proportions. Thank  you.
Monday, September 12, 2022
Artificial intelligence algorithms are useful for gaining insight into complex problems. One of the drawbacks of these algorithms is they are often difficult to interpret. The lack of interpretability can make models generated using these AI algorithms less trustworthy and less useful. This talk will show how utilizing a few features in JMP can make AI more understandable. The presentation will feature performing “what if” hypothesis testing using the prediction profiler, testing for model robustness utilizing Monte Carlo Simulations, and analyzing Shapley values, a new feature in JMP Pro 17, to explore contrastive explanations.     Welcome  to  the  talk E xplainable  AI:  Unboxing  the  Black box. Let's  introduce  ourselves and  let's  start  with  Laura. Hello,  I'm  Laura  Lancaster, and  I'm  a S tatistical Developer  in  JMP and  I'm  located  in  the  Cary  office. Thanks. What  about  you,  Russ? Hey, everyone. Russ  Wolfinger. I'm the  Director  of  Life  Sciences  R&D in  the  JMP  Group and  a Research  fellow  as  well. Looking  forward  to  the  talk  today. And  Pete? My  name's  Peter  Hersch. I'm  part  of  the  Global  Technical Enablement T eam and  I'm  located  in  Denver,  Colorado. Great  and  my  name  is  Florian  Vogt. I'm  Systems  Engineer for  the  Chemical  Team  in  Europe and  I'm  located in  beautiful  Heidelberg,  Germany. Welcome  to  the  talk. AI  is  a  hot  topic  at  the  moment and  a  lot  of  people  want  to  do  it. But  what  does  that  mean for  the  industries? Does  it  mean  that  scientists  and  engineers need  to  become  coders, for  processes in  the  future  run  by  data  scientists? A  recent  publication  called Industrial  Data  Science -  a  review of  Machine  Learning  Applications for  Chemical  and  Process  Industries explains  industrial  data  science fundamentals, reviews  industrial  applications  using state  of  the  art machine  learning  techniques and  it  points  out  some important  aspects  of  industrial  AI. These  are  the  accessibility  of  AI, the  understandability  of  AI and  the   consumability of  AI, and  in  particular  the  output. We'll  show  you  some of  the  features  that  we  think are  contributing to  this  topic  very  well  in  JMP. Before  we  start  into  the  program of  today,  let's  briefly  review what  AI  encompasses  and  what our  focus  today  is  located  on. I've  picked  a  source  that  actually separates  it  into  four  groups. Those  groups  are:  first,  supporting  AI also  called  Reactive  Machines and  this  aims  at  decision  support. The  second  group  is  called  Augmenting  AI or  Limited  Theory and  that  focuses  on  process  optimization and  the  third  group  is  Automating  AI, or  Theory  of  Mind,  which, as  the  name  suggests,  aims  on  automation, and  the  fourth  is  called  autonomous  AI or  self  aware  AI,  which  encompasses autonomous  optimizations. Today's  focus  is  really on  the  first  and  the  second  topic. We  had  a  brief  discussion  before. Russ,  what  are  your  thoughts  on  these also  with  respect  to  what  JMP  can  cover? Well,  certainly  the  term AI  gets  thrown  around  a  lot. It's  used  in  many  kind of  different  nuanced  meetings. I  tend  to  prefer  meetings  that  are definitely  more  tangible and  usable  and  more  focused. Like  we're  going  to  zoom  in  on  today with  some  specific  examples. The  terminology  can  get a  little  confusing  though. I  guess  I  just  tend  to  kind  of  keep  it fairly  broad,  open  mind whenever  anyone  uses  the  term  AI  and  try to  infer  its  meaning  from  the  context. Right.  That's in  terms  of  introduction, now  we'll  get  a  little  bit  more into  the  details  and  specifically  into why  it  is  important  to  actually understand  your  AI  models. Over  to  you,  Pete. Perfect. Thanks,  Florian. I  think  what  Russ  was  hitting  on  there and  Florian's  introduction  is, we  oftentimes  don't  know  what  an  AI  model is  telling  us  and  what's  under  the  hood. When  we're  thinking  about how  well  a  model  performs, we  think  about  how well  that  fits  the  data. If  we  look  here,  we're  looking at  a  neural  network  diagram and  as  you  can  see, these  can  get  pretty  complex. These AI  models  are  becoming more  and  more  prevalent and  relied  upon  for  decision  making. Really,  understanding  why  an  AI  model is  making  a  certain  decision, what  criteria it's  basing  that  decision on, is imperative  to  taking full  advantage  of  these  models. When  a  model  changes  or  updates, especially  with  that  autonomous  AI or  the  automating  AI, we  need  to  understand  why. We  need  to  confirm that  this  model  is  maybe  not extrapolating  or  basing  it  on  a  few  points outside  of  our  normal  operating  range. Hold  on. Let  me  steal  the  screen here  from  Florian, and  I'm  going  to  go  ahead and  walk  through  a  case  study  here. All  right,  so  this  case  study is  based  on  directional  drilling from  wells  near  Fort  Worth,  Texas. The  idea  with  this  type  of  drilling  is unlike  conventional  wells, where  you  would  just  go  vertically, you  go  down  a  certain  depth, and  then  you  start  going  horizontal. The  idea  is  these  are  much  more efficient  than  the  traditional  wells. You  have  these  areas  of  trapped  oil and  gas  that  you  can get  at  with  some special  completion  parameters. We're  looking  at  the  data  here from  these  wells, and  we're  trying  to  figure  out  what  are the  most  important  factors, including  the  geology,  the  location, and  the  completion  factors, and  can  we  optimize  these  factors  to increase  or  optimize  our  well  production? To  give  you  an  idea, here's  a  map  of  that  basin, so  like  I  mentioned, this  is  Fort  Worth,  Texas. You  can  see  we  have  wells  all  around  this. We  have  certain  areas  where  our  yearly production  is  higher, others  where  it's  lower. We  wanted  to  ask  a  few  questions looking  at  this  data. What  factors  have the  biggest  influence  on  production? If  we  know  certain  levels  for  a  new  well, can  we  predict  what  our production  will  be? Is  there  a  way  to  alter  our  factors, maybe  some  of  the  completion  parameters and  impact  our  production? We're  going  to  go  ahead and  answer  some  questions  with  a  model. But  before  we  get  into  that, I  wanted  to  ask  Russ, since  he's  got lots  of  experience  with  this. When  you're  starting to  dig  into  data,  Russ, what's  the  best  place  to  start  and  why? Well,  I  guess  maybe  I'm  biased, but  I  love  JMP for  this  type of  application,  Pete, just  because  it's  so  well  suited  for  quick exploratory  data  analysis. You  want  to  get  a  feel  for  what  target you're  trying  to  predict and  the  predictors  you're  about  to  use, looking  at  their  distributions, checking  for  outliers  or  any unusual  patterns  in  the  data. You  may  even  want  to  do  some  quick pattern  discovery  clustering or  PCA  type  analysis  just  to  get  a  feel for  any  structure  that's  in  the  data. Then  also  be  thinking  carefully  about what  performance  metric  would  make  the most  sense  for  the  application  at  hand. Typically  the  common  one,  obviously for  continuous  predictors  would be like  root  means  square  error, but  there  could  be  cases  where may be that's  not  quite  appropriate, especially  if  there's direct  cost  involved, sometimes  absolute  error is  more  relevant for  a  true  profit- loss  type  decision. These  are  all  things that  you  want  to  start  thinking  about as  well  as  how  you're  going  to  validate  your  model. I'm  a  big  fan  of   k-fold cross validation. Where  you  split  your  data  into  distinct subsets  and  hold  one  out and  being  very  careful  about  not  allowing leakage  or  also  careful  about  overfitting. These  are  all  concerns  that  tend to  come  top  of  mind  for  me  when I  start  out  with  a  new  problem. Perfect. Thanks, Russ . I'm  going  to  walk  through  this  in  JMP some  of  the  tools  we  can  use   to  start  looking  at  our  problem. Then  we're  going  to  cover  some of  the  things  that  help  us  determine which  of  these  factors  are  having the  biggest  impact  on  well  production. I'm  going  to  show  Variable  Importance and  then  Shapley  Values and  we'll  have Laura  talk  to  that  and  how  we  do  that. But first,  let's  go  ahead and  look  at  this  data  inside  a  JMP. Like  we  mentioned  here, I  have  my  production  from  these  wells. I  have  some  location  parameters so  where  it  is  latitude  and  longitude, I  have  some  geologic  parameters. This  is  about  the  rock  formation we're  drilling  through. Then  I  have  some   completion  parameters and  this  is  factors  that  we  can  change while  we're  drilling  as  well, some  of  these  we can  have  influence  on. If  we  wanted  to  go  through and  this   dataset  only  has  5,000  rows. When  talking  to  Russ  in  starting  to  prep this  talk,  he  said  just  go  ahead and  run  some  model  screening  and  see what  type  of  model  fits  this  data  best. To  do  that,  we're  going  to  go  ahead and  go  under  the  Analyze  menu, go  to  Predictive  Model and  hit  Model  Screening. I'm  going  to  put  my  response, which  is  that  production, take  all  of  our  factors,  location, geology  and  completion  parameters, put  those  into  the  X  and  grab my  Validation  and  put  it  into  Validation. Down  here  we  have  all  sorts  of  different options  on  types  of  models  we  can  run. We  can  pick  and  choose which  ones  maybe  make  sense or  don't  make  sense  for  this  type  of  data. We  can  pick  out  some  different  modeling options  for  our  linear  models. Even,  like  Russ  mentioned  here, if  we  don't  have  enough  data to  hold  back  and  do  our  validation. That  way  we  can  utilize a  k-fold cross validation in here. Now  to  save  some  time,  I've  gone  ahead and  run  this  already  so  you  don't  have to  watch  JMP  create  all  these  models. Here's  the  results. For  this  data,  you  can  see that  these  tree  based  methods: Boosted  Tree,  Bootstrap f orest  and  XGB oost all  did  very  well  at  fitting  the  data, compared  to  some  of  the  other  techniques. We  could  go  through and  run  several  of  these, but  for  this  one,  I'm  going  to  just  pick the  Boosted  Tree  since  it  had  the  best RS quare  and  root  average  square  error for  this   dataset. We'll  go  ahead  and  run  that. After  we've  run  the  screening, we're  going  to  go  ahead  and  pick  a  model or  a  couple  of  models  that  fit well  and  just  run  them. Al right,  so  here's   the  overall  fit  in  this  case. Depending  on  what  type  of  data you're  looking  at, maybe  an  RS quare  of  .5  is  great, maybe  an  RSquare  of  .5  is  not  so  great. Just  depending on  what  type  of  data  you  have, you  can  judge  if  this  is  a  good  enough  model  or  not. Now  that  I  have  this, I  want  to  answer  that  first  question. Knowing  a  few  parameters  going  in, what  can  I  expect my  production  level  to  be? An  easy  way  to  do  that  inside  a  JMP with  any  type  model  is  with  this  profiler. Okay,  so  we  have  the  profiler  here, we  have  all  of  the  factors  that  were included  in  the  model, and  we  have  what  our  expected  12  month  production  to  be. Here  I  can  adjust if  I  know  my  certain  location. I  know  the  latitude  and  longitude  going in maybe  I  know  some of  these  geologic  parameters. I  can  maybe  adjust  several  of  these  factors and  figure  out  of  the  completion  parameters and  figure  out  a  way  to  optimize  this. But  I  think  here  with  a  lot  of  factors, this  can  be  complex. Let's  talk  about  the  second  question, where  we  were  wondering  which  one  of  these factors  was  having  the  biggest  influence. You  can  see  based  on   which  of  these  lines  are  flatter or  have  more  shape  to  them, what  is  the  biggest  influence. But   let's  JMP  do  that  for  us. Under  Assess  Variable  Importance, I'm  going  to  just  let  JMP  go  through and  pick  the  factors that  are  most  important. Here  you  can  see it's  outlined  the  most  important down  to  ones  that  are  less  important. I  like  this  feature   colorize  the  profiler. Now  it's  highlighted the  most  important  factors and  gone  down to  the  least  important  factors. Again,  I  can  adjust  these   and  see,  oh,  it  looks  like maybe  adjusting  the  depth  of  the  well, adding  some  more   [inaudible 00:15:51], it  might  improve  my  production. That  is  the  way  we  could  do  this but  we  have  a  new  way  of  looking at  the  impact  of  each  one  of  these  factors on  a  certain  well. We  can  launch  that  under  the  red  triangle in  JMP  17  and  Shapley  values. I  can  set  my  options  or save  out  the  Shapley  values. Once  I  do  that,  it  will  create  new  columns  in  my  data  table that  save  out  the  contributions from  each  one  of  those  factors. This  is  where  I'm  going  to  let Laura  talk  to  some  Shapley  values. I  just  wanted  going  to  talk  briefly   about  what  Shapley  values  are and  how  we  use  them. Shapley  values  are  a  model  agnostic  method  for  explaining  model  predictions and  they  are  really  helpful for  Black box  models that  are  really  hard to  interpret  or  explain. The  method  comes  from  cooperative  game  theory and  I  don't  have  time  to  talk  about  the  background or  the  math  behind the  computations,  but  we  have  a  reference at  the  bottom  of  the  slide  and  if  you Google  it,  you  should  be  able  to  find a  lot  of  references  to  learn more  if  you're  interested. What  these  Shapley  values  do, they  tell  you  how  much  each  input  variable is  contributing  to  an  individual prediction  for  a  model. That's  away  from  the  average predicted  value across  the  input   dataset and  your  input   dataset  is  going to  come  from  your  training data. Shapley values  are  additive, which  makes  them  really  nice  and  easy to  interpret  and  understand. Every  prediction  can  be  written as  a  sum  of  your   Shapley values plus  that  average  predicted  value, which  we  refer  to  as the  Shapley  intercept  in  JMP. They  can  be  computationally  intensive to  compute  if  you  have  a  lot  of  input values,  input variables in  your  model or  if  you're  trying  to  create Shapley values  for  a  lot  of  predictions. We  try  to  give  some  options for  helping  to  reduce  time  in JMP . These  Shapley  values, as  Peter  mentioned, were  added  to  the  prediction  profiler for  quite  a  few  of  the  models  in  JMP  Pro17 and  they're  also  available in  the  graph  profiler. They're  available  for  Fit Least S quares Nominal Logistic, O rdinal  Logistics, Neural,  Gen Reg, P artition,  Bootstrap Forest  and  Boosted  Tree. They're  also  available  if  you  have  the  XB oost  Add -In. Except  in  that  Add-I n, they're  available  from  the  model  menu and  not  from  the  prediction  profiler. Okay,  next  slide. In  this  slide  I  want  to  just  look  at  some of  the  predictions  from  Peter's  model. This  is  from  a  model   using  five  input  variables. These  are  stack  bar  charts of  the  first  three  predictions coming  from  the  first  three  rows  of  his  data. On  the  left  you  see  a  stack  bar chart of  the  Shapley  values  for  the  first  row. That  first  prediction  is  11.184  production  barrels in hundreds of thousands. Each  color  in  that  bar  graph  is  divided out  by  the  different  input  variables. Inside  the  bars  are the   Shapley values. If  you  add  up  all  of  those  values, plus  the  Shapley  intercept that  I  have  in  the  middle  of  the  graph, you  get  that  prediction  value. This  is  showing  you  that  first  of  all, all  of  these  are  making positive  contributions  to  the  production and  they  show  you  how  much,  so the  size, I  can  see  that  longitude  and  proppant are  contributing  the  most for  this  particular  prediction. Then  if  I  look  to  the  right  side to  the  third  prediction, which  is  2.916  production  barrels  and  hundreds  of  thousands, I  can  see  that  two  of  my  input  variables are  contributing  positively to  my  production and  three  of  them are  having  negative  contributions, the  bottom  three  here. You  can  use  graphs  like  this to  help  visualize  your  Shapley Values . That  helps  you  really  understand these  individual  predictions. Next  slide. This  is  just  one  of  many  types  of  graphs you  can  create. The  Shapley  values get  saved  into  your  data  table. You  can  manipulate  them and  create  all  kinds  of  graphs   in  Graph  Builder  and  JMP. This  graph  is  just  a  graph of  all  the  Shapley  values from  over  5,000  rows  of  the  data split  out  by  each  input  variable. It just  gives  you  an  idea of  the  contributions  of  those  variables, both  positive  and  negatively to  the  predictions. Now  I'm  going   to  hand  it  back  over  to  Peter. Great.  Thanks,  Laura. I  think  now  we'll  go  ahead and  transition  to  our  second  case  study that  Florian  is  going  to  do. Should  I  pass  the  screen  share  back  to  you? Yeah,  that  would  be  great. Make the transition. Thanks  for  this  first  case  study and  thanks  for  the  contributions. Really  interesting. I hope  we  can  bring  some  light  onto  a  different  kind  of  application with  our  second  case  study. I  have  given  it  the  subtitle, Was  it  Torque? Because  that's  a  question  we'll  have  hopefully  answered by  the  end of  the  second  case  study  presentation. This  second  case  study  is  about  predictive  maintenance and  the  particular  aspects   of  why  it  is  important  to  understand your  models  in  this  scenario. Most  likely  everybody can  think  that  it's  very  important to  have  a  sense  for  when  machines require  maintenance. Because  if  machines  fail, then that's  a  lot  of  trouble, a  lot  of  costs,  when  plants have  to  shut  down  and  so  on. It's  really  necessary to  do  preventative  maintenance to  keep  systems  running. A  major  task  in  this  is  to  determine when  the  maintenance  should  be  performed and  not  too  early,   not  too  late,  certainly. Therefore,  it's  a  task  to  find  a  balance which  limits  failures   and  also  saves  costs  on  maintenance. There's  a  case  study  that  we're  using to  highlight  some  functions  features and  it's  actually  a  synthetic  data set which  comes  from  a  published  study. The  source is  down  there  at  the  bottom. You  can  find  it. It  was  published   in  the  AI  for  Industry  event  in  2020. The  basic  content  of  this  dataset is  that  it  has  six  different features  of  process  settings, which  are  product  or product type which  denotes   for  different  quality  variants. Then  we  have  air  temperature, process  temperature, rotational  speed,  torque,  and  tool wear. We  have  one  main  response  and  that  is  whether  the  machine  fails  or  not. When  we  think  of  questions that  we  could  answer  with  a  model  or  models  or  generally  data. There's  several  that  come  to  mind. Now,  the  most  obvious  question is  probably  how  we  can  explain and  interpret  settings, which  likely  lead  to  machine  failure. This  is  something   that [inaudible 00:24:38] to  create  and  compare  multiple  models and  then  choose the  one  that's  most  suitable. Now,  in  this  particular  setting  where  we  want  to  predict whether  a machine  fails  or  not. We  also  have  to  account  for  misclassifications that  is  either  a  false  positive  or  a  false  negative  prediction. With  JMP's  decision  threshold  graphs and  the  profit  matrix, we  can  actually  specify  an  emphasis or  importance   to  which  outcome  is  less  desirable. For  example, it  is  typically  less  desirable  to  actually  have  a  failure when  the  model  didn't  predict  one compared  to  the  opposite,   misclassification. Then  besides   the  binary  classification,  of  course, you'd  be  also  interested  in  understanding what  drives  failure  typically. There  are  certainly  several  ways to  deal  with  this  question. I  think  visualization   is  always  a  part  of  it. But  when  we're  using  models   we  can  consider, for  example,  this  self  explaining  models like  decision  trees   or  we  can  use  built- in  functionality like  the  prediction  profiler and  the  variable  importance  feature. The  last  point  here when  we  investigate and rate which  factors  are  most  important for  the  predictive  outcome, we  assume   that  there  is  an  underlying  behavior. The  most  important  factor  is  XYZ, but  we  do  not  know  which  factor  actually has  contributed  to  what  extent to  an  individual  prediction. A gain,  Shapley  values  are  a  very  helpful addition  that  can  allow  us  to  understand  the  contribution for  each  of  the  factors  in  individual  prediction. On a  general  level, now,  let's  take  a  look into  three  specific  questions and  how  we  can   answer  those  with  the  software. The  first  one  is  how  do  we   adjust  predictive  model with  respect  to  the  high  importance of  omitting  false  negative  predictions? This  assumes  a  little  bit   that  we've  already  done  a  first  step because  we've  already  seen  model  screening  and  how  we  can  get  there. I'm  starting  one  step  ahead. Let's  move  into  JMP  to  actually  take  a  look  at  this. We  see  the   dataset, we can see it's  fairly  small, not  too  many  columns. It  looks  very  simple. We  only  have  these  few  predictors and  there's  some  more  columns. There's  also  a  validation  column that  I've  added, but  it's  not  shown  here. As  for  the  first  question, let's  assume  we  have  already  done the  model  screening. Again,  this  is  accessible on the  analyzed  predictive  model  screening where  we  don't  specify   what  we  want  to  predict and  the  factors  that  we  want  to  investigate. Again,  I  have  already  prepared  this. We  have  an  outcome that  looks  like  this. It  looks  a  little  bit  different than  in  the  first  use  case because  now  we  have  this  binary  outcome and  so  we  have  some  different  measures that  we  can  use  to  compare. But  again,  what's  important  is  that  we  have  an  overview of  which  of  the  methods are  performing  better  than  other  ones. As  we  said,  in  order  to  now  improve the  model  and  emphasize  on  omitting  these false  negative  predictions. Let's  just  pick  the  one and  see  what  we  can  do  here. Let's  maybe  even  pick  the  first  three  here, so  we  can  just  do that  by  holding  the  control  key. Another  feature  that  will  help  us  here is  called  decision  threshold and  it's  located   in  the  red  triangle decision  threshold. The  decision  threshold  gives  us  several  contents. We  have  these  graphs  here, these  shows  the  actual  data  points. We  have  this  confusion  matrix and  we  have  some   additional  graphs  and  matrix, but  we  will  focus  on  the  upper  part  here. Let's actually  take  a  look   at  the  test  portion  of  the  set. When  we  take  a  look  at  this, we  can  see  that  we  have   different  types  of  outcomes. The  default  of  this  probabilities  threshold is  the  middle,  which  would  be  here  at  .5. We  have  now  several  options to  see  and  optimize  this  model and  how  effective  it  is  with  respect to  the  confusion  matrix. Confusion  matrix,  we  can  see  the  predicted  value and  whether  that  actually  was  true  or  not. If  we  look  at  when o  failure  is  predicted, we  can  see  that  here,  with  this  setting, we  actually  have  quite  a  high  number  of  failures, even  though  there  were  no  predicted. Now  we  can  interactively explore  how  adjusting  this  threshold actually  affects  the  accuracy  of  the  model or  the  misclassification  rates. Or  in  some  cases,  we  can  also  put an  emphasis  on  what's  really worse  than  an other  failure. We  can  do  this  with   the  so  called  profit  matrix. If  we  go  here,  we  can  set  a  value on  which  of  the  misclassifications   is  actually  worse  than  the  other  one. In  this  case,  we  really  do  not  want to  have  a  prediction  of  no  failure when  there  actually  is  a  failure  happening. We  would  put  something  like  20  times. More  importantly, we  do  not  get  this  misclassification and  we  set  it  and  hit  okay, and  then  it  will  automatically  update  the  graph and  then  we  can  see that  the  values  for  the  misclassification have  dropped  now  in  each  of  the  models and  we  can  use  this  as  an  additional  tool to  select  a  model   that's  maybe  most  appropriate. That's  for  the  first  question   of  how  we  can  adjust  a  predictive  model with  respect  to  the  higher  importance   of  omitting  false  negative  predictions. Now,  another  question  here  is  also  when  we  think  of  maintenance and  where  we  put  our  efforts  into  maintenance, is  how  can  we  identify  and  communicate the  overall  importance  of  predictors? What  factors  are  driving the  system,  the  failures? Let's  go  back  to  the  data  table   to  say  that  first, I  personally  like  visual   and  simplistic  ones. One  of  them  that  I  like  to  use  is  stuff  like  the  parallel  plot. Because  it's  really a  nice  overview  summarizing where  the  failures  group   and which  parameters  settings  and  so  on. On  the  modeling  and  machine  learning  side, there's  a  few  other  options that  we  can  actually  use. One  that  I  like  because  it's  very  crisp  and  clear, is  the  predictor  screening. Predictor  screening  gives  us  very  compact  output about  what  is  important  and  it's  very  easy  to  do and  it's  under  analyzed  screening, predictor  screening. A ll  we  need  to  do   is  say  what  we  want  to  understand and  then  specify  the  parameters that  we  want  to  use  for  this. Click  okay,  and  then  it  recalculates and  we  have  this  output. For  me,  it's  a  really  nice  thing because as I said, crisp  and  clear  and  consuming. But  we've  talked  about  this  before and  Russ,  when  we're  working with  models  particularly, do  you  have  any  other  suggestion or  do  you  have  anything  to  add to  my  approach  to  understanding  the  factors of  which  predictors  are  important. Yes,  it  is  a  good  thing  to  try. As  I  mentioned  earlier,   you  got  to  be  really  careful about  overfitting. I tend to work  with  a  lot   of  these  wide  problems, say  from  Genomics  and  other  applications, where  you  might  even  have many  more  predictors than  you  have  observations. In  such  a  case, if  you  were  to  run  predictor  screening, say  maybe  pick  the  top  10  best and  then  go  turn   right  around  and  fit  a  new  model with  those  10  only, you've  actually  just  set  yourself  up for  overfitting  if  you  did  the  predictor screening  on  the  entire  data set. That's  the  scenario I'm  concerned  about. It's   an  easy  trap  to  fall  into, because  you  think   you're  just  filtering  things  down, but   you've  got  to  reuse  the  same  data  twice. The  danger  would  be  if  you  were  to  plug then  apply  that  model to  some  new  data, it  likely  won't  do  nearly  as  well. If  you're  in  the  game   where  you  want  to  reduce  predictors, I  tend  to  like  to  prefer   to  do  it  within  each  fold  of  a  K-fold. The  drawback  of  that  is  you'll  get a  different  set  every  time, but  you  can't  aggregate  those  things. If  you  got  a  certain  predictor that's just showing  up consistently  across  folds. It's  very  good  evidence  of that.  That's  a  very  important  one. I  expect  that's  what  would  happen in  this  case  with,  say,  torque. Even  if  you  were  to  do  this  exercise, say  10  times  with  10  different  folds, you'd  likely  get  a  pretty  similar  ranking, but  it's  more  of  a  subtlety but  again,  a  danger  that  you  have  to  watch  out  for. Job  can  make  it  a  little  bit  easier  just because  things  are  so  quick  and  clean, like  you  mentioned,  that  you  might  fall into  that  trap  if  you're  not  careful. Yeah,  that's  very  valuable   addition  to  this  approach. Just accompanying to  this  additional  information, there's  also  the  other   option  that  we  have, particularly  when  we  have  already gone  through  the  process  of  creating a  model  where  we  can  then actually  again,  use  the  prediction profiler  and  the  variable  importance. It's  another  way  where  we  can  assess which  of  the  variables have  the  higher  importance. Russ,  do  you  want  to  say  word  on  that  also in  contrast  maybe  to the  predictor  screening? Yeah.  Honestly,  Vogt,  I   like the  importance  were it a  little  better. Just  dive  right  into  the  modeling. Again,  I  would  prefer  with  K-fold. Then  you  can  just  use   the  variable  importance  measures, which  are  often  really informative  directly. They're  very  similar. In  fact,  predictor  screening, I  believe,  it's  just  calling  bootstrap  forest  in  the  background and  collecting the  most  important  variables. It's  basically  the  same  thing. Then  following  up  with  the  profiler, which  can  be  excellent  for  seeing  exactly how  certain  variables  are  marginally affecting  the  response, and  then  drilling  that  even  further  with  Shapley to  be  able  to  break  down  individual  predictions into  their  components. To  me,  it's  a  very  compelling and  interesting  way to  dive  into  a  predictive  model and  understand what's  really  going  on  with  it, kind of unpacking   the  black  box and  letting  you  see   what's  really  happening. Yeah,  thanks. I  think  that's  the  whole  point, making  it  understandable   and  making  it  consumable besides,  of  course,  actually getting  to  the  results, which  is  understanding  which  factors are  influencing  the  outcome. Thanks. Now,  I  have  one  more  question, and  you've  already  mentioned  it. When  we  score  new  data,  in  particular, what  can  we  do  to  identify which  predictors  have  actually influenced  the  model  outcome? Now,  with  what  we  have  done  so  far, we  have  gained  a  good  understanding of  the  system  and  know Which  of  the  factors  are  the  most  dominant and  we  can  even  derive  operating  ranges. If  the  system  changes,  what  if  a  different factor  actually  drives  a  failure? Then  as  it  would  be  expected  in  this  case and we have talked to  Laura  beforehand, and  Shapley V alues  again  are  a  great   addition  that  will  help  us  to  interpret. we've  seen  how   we  can  generate  them, and  you've  learned  on  which  platforms they'll  be  available. Now,  the  output  that  you  get when  you  save  out Shapley Values is, for  example,  also  a  graph   that  shows  us  per  actual  per  row. In  this  case, we  have  10,000  rows  in  the  data  tab, so we have 10,000 stack bar charts, and  then  we  can  already  see  that besides the common pattern, there's  also  times  when  there's   actually other influencing factors that  drive  the  decision   of  the  model. It's  really  a  helpful  tool  to  not  only  raise an individual prediction, but  also  add  on  to  that  what  Russ just  said, understanding  of  the  system, which  factors  contribute. When  we  move  a  little  bit on in  this  understandable   or  exploratory path, we  can  use  these   Shapley Values   in  different  ways. What  I  personally  liked   was the suggestion to  actually  plot  the  Shapley  value   by  their  actual parameters setting, because  that  allows  us   to  identify  areas  of  settings. For  example, if  we  take  rotational  speed  here, we  can  see  that  there's  actually areas  of  this  parameter that  tend  to  contribute  a  lot in  terms  of  the  model  outcome, but  also  in  terms  of  the  actual  failure. That  also  helps  us in getting  more understanding with  respect  to  the  actual  problem  of machine failure and what's causing it, and  also  with  respect  to  why  the  model predicts something. Now,  finally, I  like  to  answer  the  question. When  we  take  these  graphs   of  Shapley values, and  we  have  seen  it  before  in  several occasions, stock  is  certainly  a  dominant  factor. But  from  all  of  these, I've  just  picked  a  few  predictions, and  we  can  see  that  sometimes it  stocks,  sometimes  it's  not. With  the  Shapley  values, we have really a great way to  interpreting  a  specific   prediction  by  the  model. All  right,  so  those  were  the  things we  wanted  to  show. I  hope  this  gives   some  great  insight into how  we  can  make   AI  models  more  explainable, more  understandable, more  easy  to  digest  and  to  work  with, because  that's  the  intention  here. Yeah,  I'd  like to  summarize  a  little  bit. Pete,  maybe  you  want to come  in  and  help  me  here. I  think  what  we're  hoping   to  show  is  that as  these  AI  models  become more  and  more  prevalent and  are  relied  upon  for  decision  making, that  understanding,  interpreting, and  being  able  to  communicate   those  models  is  very  important. We  hope  that  with  these  Shapley v alues, with  the  variable  importance, and with the profiler, we've  shown  you  a  couple  of  ways   that  you can share share those results and  have  them   easily  understandable. That  was   the  take- home  there  between that  and  being  able  to  utilize model  screening, and things like that, that hopefully,  you  found  a  few  techniques that  will  make  this  more  understandable and  less  of  a  black  box. Yeah,  I  absolutely  agree. Just  to  summarize,   I  really  like  to  thank Russ and Laura for  contributing  here   with  your  expertise. Thanks,  Pete. It was  a  pleasure. Thanks,  everybody for listening. We're  looking  forward  to  having   discussions  and  questions  to  answer.
Monday, September 12, 2022
While JMP is definitely a time-saver in the world of statistical problem solving, with a small amount of up front JSL scripting, JMP also can be a huge time saver for the more mundane aspects of daily work. This presentation demonstrates how to free up time for more meaningful work using JSL to automate daily workflows, send automatic reports as emails, perform automated file management and more. While there will be some scripting involved, this presentation is accessable to all, as JMP will do most of the heavy lifting (scripting). The presentation is live demonstration using JMP only.     Hi,  this  is  Jed  Campbell. I'm  here  to  present  today   on maybe a  little  bit  different than  what  we  normally  think  of   as things to go and  learn  about at  Discovery. It  turns  out  that when  you  really  look  at  it, there's  two  things  that  we  all  do. There  are  two  things  that  we  all  do, and  really,  it  comes  down  to  the  things that  we want to do in life and  the  things that  we  don't  want  to  do  in  life. I  think  Randall  Munroe summed  this  up. In  the  last  few  years, we've  all  spent  a  lot  of  time doing  things  that we  don't  really  want  to  do. But  that  comes  down to  the  same  at  work  as  well. I  think  a  lot  of  the  time  when  we  come to  JMP  Discovery, we  want  to  focus on  the  things  we  want  to  do. We  want  to  focus  on  learning  how  to  do better and  faster  analysis, more  statistical  tools. Today,  I'm  actually  going  to  focus on  the  things  that  we  don't  want  to  do, the  mundane  tasks that   make  life  less  worth  living,  I  guess. We're  going  to  focus  on  ways  to  make those  simpler,  better,  and  faster  more. But  before  we  begin,  I  want  to   maybe talk  a  little  bit  about  history. In  1981,  I  remember  my  dad   came home  with  our  first  computer. It  was  a  Commodore  VIC- 20, had   4k  of  Ram,  and  I  had  to  look  this  up, but  a  1- Megahertz  processor, and  it  was  an   8-bit  processor. A nything  we  wanted  to  do  in  that, we  had  to  program. I'm  presenting  today  on  a  basic  laptop that  you  can  buy  at  any store with  16  gigs  of  RAM, a  super fast  processor, and  64  bit  processor. Really  it  comes  down  to what  is  the  difference between  a  million  and  a  billion? Notice  that  we  have  a  log  scale  here. This  Commodore  V IC-20 was  at  the  beginning  of  things, and  now  we're  somewhere  in  this  range. Essentially,  the  difference   between a  million  and  a  billion is  a  billion. Or  the  difference between  an  old  school  processor   and current processors  is  almost  infinite. What  that  means  is  that while  elegant  code  can  feel  good  to  write and  can  execute  a  little  bit  faster, it's  really  hard  to  decipher. Brute  force  code,  on  the  other  hand, is  lazy  and  easy  to  read. The  good  news  is  there  are  plenty of  CPU  cycles  for  a  brute  force  approach. That's  what  we're  going to  be  talking  about. I  don't  think  we  need to  do  anything wild or crazy to  make  mundane  tasks more  easy  for  us  in  JMP. Before   I  begin, I  just  like  to  say  nobody  likes to  be  watched  using  a  computer. Obviously,  if  something  goes  wrong, this  journal is  already  available on  the community.jmp.com, and  we'll  make  it  work. But  anyway. Also,  shout  out  to  the   Scripting Index. I  want  to  make  sure  that if  you  find  yourself doing  scripting  in  JMP, that  you  should  probably  set a  keyboard  shortcut for  the   Scripting Index   because it  really  speeds  things  up. A  way  to  do  that  would  be to  come  up  to  the  View  menu. View,  Customize,  Menus  and  Toolbars, and  then  you  can  find   Scripting Index. I've  already  assigned the  keyboard  shortcut of  Control  Shift  X. I  can  just  reassign  that  right  now. What  that  means  is  anytime  I  want to  see  the   Scripting Index, I  can  hit  that  Control  Shift  X, and  it  pops  up. Super  useful,  and  I  use  it  a  ton. This  little  button  right  here is  going to make a folder because  we're  going  to  be  doing some  file  manipulation. I'm  just  going  to  hit  the  OK  button, and  now  I  have a  folder  with  an  Excel  document  in  it and  a  PowerPoint  document  in  it. For  those  of  you familiar  with  demos  in  JMP, you  may  see that  this  Excel  document  looks suspiciously  like  big  class because  that's  what  it  is,   as well as the  PowerPoint  document. But  we're  going  to  be  messing  around with  some  of  those  things in  a  little  while. That  all  being  said,  let's  get  started. As  we  talk  about  things  that  we  don't necessarily  want  to  do  but  must  do. I  work  in  a  quality  group, which  means there  are  a  lot  of  requirements that  customers  have  for  us, that  governing  bodies  have  for  us that  just  need  to  be  done. I  could  spend  a  lot  of  time doing  those  things. Or  I  could  make,  for  example, a  few  little  dashboards. These  are  dumbed  down  versions of  those  dashboards. They're  all  functional  in  here if  you're  going  through  this  journal on  your  own  later. For  example, somebody  emails  you  a  folder   and emails you  something  in  Excel, and  you  have  to  do something with it,  or  you  have  to  send  emails  to  people, you  have  to  do  some  work  in  some  weird random  network  folder that  you  can  never  remember. But  we'll  go  through  some  of these examples  together, but  just  know  that  those   are there  to  interact  with  later. First,  let's  talk  about the  web  command  in  JSL. It  turns  out  it's  not  just  for  websites, and  it's  actually,  sure, you  can  open  a  website  with  it, but  you  can  also  open  a  file from  a  SharePoint,  or  a  Google  Doc, or whatever  cloud  service your  organization  uses. You  can  open  a  folder  on  your  computer. You  can  also  open  a  non-junk  document in  its  native  thing  because  remember, the JMP, it's  more  like  saying, "Computer,  do  something  for  me   the way you're  used  to  doing  it." For  example,  we  can  open an  Excel  file  in  Excel, if  you  ever  need  to  do  that. You  can  also  use  it to  compose  an  editable  email. In  fact, looking  on  Wikipedia,  URI   or  Uniform  Resource  Identifier is  a  special  way  that  computers   can do  all  sorts  of  things. You  can  tell  your  computer, as  I'm  scrolling  through  all  of  these, apparently,  there's  an   ed2k  thing, if  you  know  that. If  you're  on  an  Apple  computer, and  you  want  to  FaceTime  someone, you  can  use  this  protocol  to script that. Microsoft  computers, you  can  do  the  same  thing. There's  a  whole  bunch  of  config  settings. You  could  use  to  take  cameras, take  pictures  with  the  camera, or  change  notifications. The  list  is  pretty  much  endless. Let's  actually  dive  into  it  now. This  button  here. Okay. Again,  we've  seen,  and  we   think that  the  web  command. Yeah,  of  course  it  works for  opening  web  addresses. If  I  use  the  Enter  key on  my  numeric keypad that  runs  that  line  of  the  script, and  that's  what  I  expect. We  can  also  copy  and  paste, like  I  mentioned  before, a  URL  from  a  SharePoint  site and  make  a  link  to  that  with  JMP. You  can  use  it  to  open  a  folder. Here  is  the  shortcut  to  open that folder that  we've  been  working  with, that  we're  going  to  be  working  with. You  can  open  a  non-JMP  document  with  it. Instead  of  using  the  open  command, you  would  use  web, and  here  is  a  PowerPoint  document. You  can  open  an  Excel  file   in JMP  or  in  Excel. Using  the  open  command  here is  something. If  I  run  this  line, it  will  open  that  Excel  file that's  in  our  folder  in  JMP. But  if  that's  not what  the  behavior  that  I want, I  can  use  the  web  command  instead and  it  will  open  that  file  in  Excel. If  I  need  for  some  reason  a  feature that  only  Excel  has, or  I'm  working  with  people that  don't  have JMP. This  is  where  it  gets  more  fun. I  just  chose  the  mail to  URI  scheme   as something  to  experiment  with. This  is  just  an  online URL  mail to  generator. We  can  open  this  and  we  can  say, to  test@example.com. What's  the  subject?  JMP. Is  great,  lots  of  exclamation  marks. Then  it  will  generate  that  URL  for  me. A ll  I  have  to  do  is  CTRL  C to  copy  to  my  clipboard. Then  I  can  come  in  here  and make  a  new  line. Web,  open  quotes,  paste  that  in, put  my  semicolon  at  the  end. Now  I  get  an  email  pop- up that  I  can  edit  before  I  send  it. Sometimes  that's  really  nice. JMP's  mail  command  doesn't  allow  you to  modify  an  email  before  you  send  it. For  example,  if  you  have  a  long  list of  people  that  need  to  be  emailed, but  you  want  to  personalize  each  message before  you  send  it, this  is  a  way  to  do  that. I'm  not  going  to  go  ahead  and  send that because  I  don't  want  those  people to  get  bombarded. A gain,  here's  another  list  of  URI  schemes. There  are  tons  of  ways  to  do  that. We  just  chose  the  mail to  one as  an  example. That  being  said, we've  learned  one  way  to  do  mail. If  you're  not  familiar with  JMP's  built- in  email  command, let's  go  over  that, and maybe  say,  for  example, you  have  a  report  or  a  dashboard that  you  output  and  you  need  to  email a  bunch of people, but  they  don't  all  necessarily  have  JMP. Let's  walk  our  way  through  opening a  data  table,  and  running  a  script on it, and  then  saving  that as  an  interactive  HTML, and  then  we'll  email  that. In  this  case  here, I'm  just  going  to  tell  JMP to  open  the   Big Class  data  table from  its  sample  data  folder, and  store  that  in  something called  data  table bc, dtbc for Big Class data table. Then  I'm  going  to  tell that  Big  Class  table to  run  the  script  that's  already saved  to  the  table  called  Bivariate, and  do  that  in  a  new  window  called  NW. Then  all  I  need  to  do  is  tell  NW to  Save   Interactive  HTML. I  should  if  everything  works  right, I  probably  could  have  scripted  this to  close  automatically  for  me. But  I  need  to  get  to  that  folder now  that  I've  closed. I'm  going  to  cheat  a  little  bit and  open  this  folder. Now  it  has  that  report that  htm, and  back  to  our  regular  program  here. Now  I  can  run  this  line,  and  JMP  will  automatically  open  it in  my  default  web  browser. If  I've  got  that  file, and  now  I  want  to  email  it  to  people. It  should  be  relatively  simple  for  me, but  it  turns  out  it's  maybe  a  little bit  more  complicated  than  you'd  expect. If  you  have  a  static  list  of  people you  want  to  email, like  test and test2 @example.com, "Hey,  here's  your  monthly  reminder to  do  the  thing," and  I  want  to  attach  this  file that  I've  done, I  can  select  those  lines,  I  can  run  them. My  organization  is  going  to  pop  up a  little  Allow  or  Deny  for  each  person. Your  security  systems   may be a little bit  different. It's  not  perfect  yet. I  haven't  found  a  way  to  get  around  this, but  that  just  sent  two  emails. Now,  you  might  think   that since this is a  list  here, that  I  could  put that list into a variable  and  then  mail it, but  for  some  reason  it  doesn't  work. We'll  go  ahead  and  run  it and  it'll  give  us  an  error  message. I  tried  a  couple  of  different  ways of  evalling  this  list,  eval,  parse, all  those  things, and  then  I  just  thought, "Well,  no,  I'm  more  in  favor of  brute  force  here." One  way  to  do  it   is I  can  throw  this  into  a  list to all  my  email  addresses, say,  there  are  47  of  them, and  then  I  can  just  iterate through  it  with  a  for  loop. When  we  run  that, that  works  just  fine,  and  it  pops  up that  message that  my  organization  requires. That  is  another  way  to  do  it. You  can  either  use  the  URI  scheme to  pull  up  an  editable email, or  you  can  have  JMP do  the  whole  thing  for  you. It  works  pretty  well. I'm  sure  somebody  in  the  comments or  in  the  group will  know  what  I'm  doing  wrong  here, and  I'm  looking  forward to  learning  about  that. But  back  to  this. We  know  we  can  email  files. We  can  also  do  other  things  with  files. For example, we  can  automate creating  a  folder  structure. At  the  beginning  of  each  year, there  are  a  couple  of  different  places, where  I  need  to  make  a  folder with  a  different  subfolder in  it  for  each month so  that  we  can  figure  out where  things  are  being  filed. Now  I  could  walk  through  in  Explorer, and I could  create  a  new  folder, and  give  it  a  name, and  then  go  into  that  new  folder, and  create  another new folder, and  give  it  a  name, and  oh,  my  word, that  just  seems  too  mundane. I'd  almost  rather  die. Or  I  can  just  do  it  right  here. I  can  create  a  variable that  is  just  the  year  that  we  are, and  then  I  can  iterate through  months   1-12 and  create  a  directory with  the  year  and  then  that  month. Let's  go  ahead  and  run  that, and  then  I  should  be  able to  pop  into  this. I  see  a  directory that  wasn't  there  before,  2022, and  when  I  go  into  it, it's  already  done  for  me. I'm  so  happy  I  could  cry because  I  don't  have  to  waste  my  life manipulating  this. That's  just  one  example and  maybe  another  example, probably  our  biggest  example that  we're  going  to  do  today   would be reviewing  a  list  of  files. For  example, part  of  my  job,  each  month, different  people  do  different audits, and  those  audits  are  stored  in  folders. But  those  audits,  for  example, are not stored  in  just  one  set  of  folders. It's  lots  of  folders. I  could  comb  through  all  of  them to  learn  which  ones  were  done   so that  I  could  review  each  of  them. Or  I  can  just  make  a  script  to  do  it. Here's  the  script, which  is  a  little  bit  more  complicated, but  it's  really  still  just  brute  force, and  it's  just  piecing  things  together. First  that  we're  going  to  do is  we're  going  to execute  this  code  right  here, which  is  going  to  get  a  date, which  is  the  beginning  of  last  month, and  this  is  being  recorded  in  July, so  that's  that  date  there. Then  I  just  want  to  go  into  this  folder that  we've  been  working with, and  get  a  list  of  all  the  files in  this  directory. There's  a  recursive, and  that  just  means  also  look  through the   subfolders  because  I  don't  want  to. I  can  run  through  that,  and  I  can  get this  list  of  new  files  together. Sorry,  I  can  get  this  list  of  files, and  I  can  hover  over  that, and  I  can  see  that  it  has  the  documents we've  been  looking  at. From  that  list  of  all  the  files, I  need  to  know  just  which  ones were  the  new  files. To  do  that,  I'll  create  an  empty  list. Then  after  this  empty  list, I'm  going  to  look  at  the  creation  date of  each  of  the  files  in  this  list. If  that  creation  date  is  bigger  than this  date  that  I  set  earlier, then  I'll  keep  it, and I'll say it's in  the  new  list. That's  all  nice,  and  fine,  and  dandy, let's  run  that,  and  nothing  will  happen. I  want  to  actually  be  able  to  see  it. I  can  tell  JMP  to  show  me a  new  list  or a new window with  a  list  box  of  those  new  files. There  I  have  the  list. That's  nice,  but  I  can't  really  do anything  with  it  yet. If  I  just  copy  and  paste  this down  here  below and  reformat  it  a  little  bit. It's  the  same  script  as  before. But  now  I  just  append  a  little  button  box at  the  bottom  that  says  Open, and  I  tell  JMP  that  when  I  click  on  it, on  that  open  button, tell  me  what  I  have  selected, and  then  going  back to  the  beginning  here, use  that  web  function. Open  that  file. Well,  then  I  get  something that  presents  me  with  a  list and  allows  me  to  select  each  of  them or  one  of  them  and  hit  the  Open  button. Honestly,  this  one  little  thing  saves  me so  much  time  each  month and  so  much  hassle of  trying  to  comb  through  lots of different folders and  find  which  things  I  need  to  review. I  think  the  lesson  of  this  is, not  so  much  that  you  have  to  do   the exact same  things  that  I've  done, but  more  to  start  thinking  about what  can  you  do  with  this? Now,  I'm  not  saying that  you  could  hack  your  job and  get  paid  to  do  nothing for five years, that's  mostly  just  there  for  a  laugh. But  what  I  am  saying  is  if  we  go  back to  that  notion  of  we  all  do  two  things, the  things  that  we  want  to  do and  the  things  that  we  have  to  do, maybe  we  can  challenge  ourselves  to  find a  way  to  save,  I  don't  know, 30  minutes  a  week,  30  minutes  a  month, by  automating  the  tasks that  we  don't  really  want  to  do. That  way  then we  can  focus  a  little  bit  more on  the  tasks  that  we  do  want  to  do. I'd  love  to  hear  your  comments and  love  to  hear how  you've  succeeded  with  this. Thanks.
Repeated k-fold cross-validation is commonly used to evaluate the performance of predictive models. The problem is, how do you know when a difference in performance is sufficiently large to declare one model better than another? Typically, null hypothesis significance testing (NHST) is used to determine if the differences between predictive models are “significant”, although the usefulness of NHST has been debated extensively in the statistics literature in recent years. In this paper, we discuss problems associated with NHST and present an alternative known as confidence curves, which has been developed as a new JMP Add-In that operates directly on the results generated from JMP Pro's Model Screening platform.     Hello. My  name  is  Bryan  Fricke. I'm  a  product  manager  at  JMP  focused  on  the  JMP  user  experience. Previously,  I  was  a  software  developer  working  on  exporting  reports to  standalone  HTML  fire  files. JMP  Live  and  JMP  Public. In  this  presentation, I'm  going  to  talk  about  using   Confidence Curves as  an  alternative  to  null  hypothesis  significance  testing in  the  context  of  predictive  model  screening. Additional  material  on  this  subject  can  be  found  on  the  JMP  Community  website in  the  paper  associated  with  this  presentation. Dr.  Russ  Wolfinger  is a  Distinguished  Research  Fellow  at  JMP and  a  co- author,  and  I  would  like  to  thank  him  for  his  contributions. The  Model  Screening  Platform, introduced  in  JMP  Pro  16 allows  you  to  evaluate  the  performance of  multiple  predictive  models using  cross  validation. To  show  you  how the  Model  Screening  platform  works, I'm  going  to  use  the  Diabetes  Data  table, which  is  available  in  the  JMP sample  data  library. I'll  choose  model  screening  from  the analyzed  predictive  modeling  menu. JMP  responds  by  displaying the  Model  Screening  dialogue. The  first  three  columns  in  the  data  table represent  disease  progression in  continuous  binary  and  ordinal  forms. I'll  use  the  continuous  column named  Y  as  the  response  variable. I'll  use  the  columns  from  age  to  glucose in  the  X  factor  role. I'll  type  1234  in  the  set  random  seed input  box  for  reproducibility, I'll  select  the  check  box  next  to  K-Fold cross  validation  and  leave  K  set  to  five. I'll   type 3  into  the  input box  next  to  repeated  K-F old. In  the  method  list,  I'll  unselect  neural. Now  I'll  click  Okay. JMP  responds  by  training  and  validating models  for  each  of  the  selected  methods using  their  default  parameter settings  and  cross  validation. After  completing  the  training and  validating  process, JMP  displays  the  results  in  a  new  window. For  each  modeling  method. The  Model  Screening  platform provides  performance  measures in  the  form  of  point  estimates for  the  coefficient  of  determination, also  known  as  R  squared, the  root  average  squared  error, and  the  standard  deviation for  the  root  average  squared  error. Now  I'll  click  select  dominant. JMP  responds  by  highlighting  the  method that  performs  best  across the  performance  measures. What's  missing  here  is  a  graphic  to  show the  size  of  the  differences between  the  dominant  method and  the  other  methods, along  with  the  visualization of  the  uncertainty associated  with  the  differences. But  why  not  just  show  P- values  indicating whether  the  differences  are  significant? Shouldn't  a  decision  about  whether one  model  is  superior  to  another be  based  on  significance? First,  since  the  P- value provides  a  probability based  on  a  standardized  difference, a  P-value  by  itself  loses information  about  the  raw  difference. A a  significant  difference  doesn't imply  a  meaningful  difference. Is  that  really  a  problem? I  mean,  isn't  it  pointless to  be  concerned  with  the  size  of the  difference  between  two  models before  using  significance  testing to  determine  whether the  difference  is  real? The  problem  with  that  line  of  thinking  is that  it's  power  or  one  minus  beta that  determines  our  ability to  correctly  reject  a  null  hypothesis. Authors  such  as  Jacob  Cohen and  Frank  Smith have  suggested  that  typical  studies have  the  power  to  detect  differences in  the  range  of  .4 to .6 . So  let's  suppose  we  have  a  difference. Where  the  power  to  detect a  true  difference? Zero  five  at  an  alpha  level  zero  five. That  suggests  we  would  detect  the  true difference,  on  average  50%  of  the  time. So  in  that  case,  significance testing  would  identify  real  differences no  better  than flipping  an  unbiased  coin. If  all  other  things  are  equal,   type 1 and   type 2  errors  are  equivalent. But  significance  tests that  use  an  alpha  value  of  .05. Often  implicitly  assume   type 2  errors are  preferable  to   type 1  errors, particularly  if the  power  is  as  low  as  .5 . A  common  suggestion  to  address  these and  other  issues  with  significance  testing is  to  show  the  point  estimate  along with  the  confidence  intervals. One  objection  to  doing  so  is  that a  point  estimate along  with  a  95%  confidence  interval is  effectively  the  same thing  as  significance  testing. Even  if  we  assume  that  is  true, a  point  estimate  and  confidence  interval still  puts  the  magnitude  of  the  difference and  the  range  of  the  uncertainty front  and  center, whereas  a  loan P-value  conceals  them  both. So  various  authors, including  Cohen  and  Smith, have  recommended  replacing significance  testing with  point  estimates and  confidence  intervals. Even  so,  the  recommendation  to  use confidence  intervals  begs  the  question, which  ones  do  we  show? Showing  only  the  95%  confidence  interval would  likely  encourage  you  to  interpret  it as  another  form  of  significance  testing. The  solution  provided  by   Confidence Curves is  to  literally  show all  confidence  intervals up  to  an  arbitrarily high  confidence  level. How  do  I  show   Confidence Curves  and  JMP? To  conveniently  create Confidence Curves  and  JMP, install  the   Confidence Curves   add-in  by visiting  the  JMP  Community  Homepage. Type   Confidence Curves into  the  search  input  field. Click  the   Confidence Curves  result. Now  click  the  download  icon  next to  confidencecurves. JMPa dd- in. Now  click  the  downloaded  file. JMP  responds  by  asking  if  I want  to  install  the  add- in. You  would  click  Install. However,  I'll  click  cancel  as I've  already  installed  the  add-in. So  how  do  you  use  the  add-in? First. To  generate   Confidence Curves  for  this report,  select  Save  Results  table from  the  top  red  triangle  menu  located on  the  Model  Screening  Report  window. JMP  responds  by  creating a  new  table  containing  among  others. The  following  columns  trial. Which  contains  the  identifiers for  the  three  sets of  cross validation  results. Fold. Which  contains  the  identifiers for  the  five  distinct  sets  of  subsamples used  for  validation  in  each  trial. Method. Which  contains  the  methods  used  to  create models  from  the  test  data. And  in. Which  contains  the  number  of  data points  used  in  the  validation  folds. Note  that  the  trial  column  will  be  missing if  the  number  of  repeats  is  exactly  one, in  which  case  the  trial  column is  neither  created  nor  needed. Save  for  that  exception, these  columns  are  essential for  the   Confidence Curves add-in  to  function  properly. In  addition  to  these  columns, you  need  one  column that  provides the  metric  to  compare  between  methods. I'll  be  using  R  squared  as  the  metric of  interest  in  this  presentation. Once  you  have the  model  screening  results  table, click   add-ins  from  JMPs  main  menu bar  and  then  select   Confidence Curves. The  logic  that  follows  would  be  better placed  in  a  wizard, and  I  hope  to  add  that  functionality in  a  future  release  of  this   add-in. As  it  is,  the  first  dialog  that  appears requests  you  to  select the  name  of  the  table that  was  generated  when  you  chose 'save  results  table', from  the Model  Screen  Reports  Red  Triangle  menu. The  name  of  the  table  in  this  case  is Model  Screen  Statistics  Validation  Set. Next,  a  dialogue  is  displayed that  requests  the  name  of  the  method that  will  serve  as  the  baseline from  which  all  the  other performance  metrics  are  measured. I  suggest  starting  with  the  method that  was  selected  when  you  click the  Select  Dominant  option in  the  Model  Screening  Report  window, which  in  this  case  is  Fit  Stepwise. Finally,  a  dialog  is  displayed that  requests  you  to  select  the  metric  to be  compared  between  the  various  methods. As  mentioned  earlier,  I'll  use R  squared  as  the  metric  for  comparison. JMP  responds  by  creating  a  Confidence Curve  table  that  contains  P-values and  corresponding  confidence  levels for  the  mean  difference between  the  chosen  baseline  method and  each  of  the  other  methods. More  specifically,  the  generated  table  has columns  for  the  following. Model. In  which  each  row  contains  the  name of  the  modeling  method whose  performance  is  evaluated relative  to  the  baseline  method. P- value  in  which  each  row contains  the  probability associated with  the  performance  difference at  least  as  extreme  as  the  value  shown in  the  difference  in  our  square  column. Confidence  interval  in  which  each row  contains  the  confidence  level. We  have  that  the  true  mean  is  contained in  the  associated  interval and  finally,  difference  in  our  square, in  which  each  row is  the  maximum  or  minimum of  the  expected  difference  in  R  squared associated  with  the  confidence  level shown  in  the  confidence  interval  column. From  this  table, Confidence Curves  are  created and  shown in  the  Graph  Builder  graph. So  what  are   Confidence Curves? To  clarify  the  key  attributes of  a   Confidence Curve, I'll  hide  all  but  the  Support  Vector machine's   Confidence Curve using  the  local  data  filter  by  clicking on  Support  Vector  Machines. By  default,  a   Confidence Curve  only  shows the  lines  that  connect  the  extremes of  each  confidence  interval. To  see  the  points,  select  Show Control  Panel  from  the  red  triangle  menu located  next  to  the  text that  reads  Graph  Builder  in  the  Title  bar. Now  I'll  shift  click  the  points  icon. JMP  responds  by  displaying  the  endpoints of  the  confidence  intervals that  make  up  the   Confidence Curve. Now  I  will  zoom  in  and  examine  a  point. If  you  hover  the  mouse  pointer  over any  of  these  points. A  hover  label  shows  the P - value confidence  interval, difference  in  the  size of  the  metric and  the  method  used to  generate  the  model being  compared  to  the  reference  model. Now  we'll  turn  off  the  points by  shift  clicking  the  points  icon and  clicking  the  Done  button. Even  though  the  individual  points are  no  longer  shown, you  can  still  view the  associated  hover  label by  placing  the  mouse  pointer over  the   Confidence Curve. Point  estimate  for  the  main  difference in  performance  between  the  sport  vector machines  and   Fit Step Wise  models is  shown  at  the  0%  confidence  level, which  is  the  mean  value  of  the  differences computed  using  cross  validation. A   Confidence Curve  plots  the  extent of  each  confidence  interval from  the  generated  table  between  zero and  the  99.99%  confidence  level along  with  the  left  Y  axis. P  values  associated  with the  confidence  intervals  are  shown along  the  right  y  axis. The  confidence  level  associated with  each  confidence  interval  shown. The  Y  axis  uses  a  log  scale so  that  more  resolution  is  shown at  higher  confidence  levels. By  default,  two  reference  lines  are plotted  alongside  a   Confidence Curve. The  vertical  line  represents the  traditional  null  hypothesis of  no  difference  in  effect. Note  you  can  change the  vertical  line  position and  thereby the  implicit  null  hypothesis. In  the  X axis  settings. The  horizontal  line  passes  through the  conventional  95%  confidence  interval. As  with  the  vertical  reference  line, you  can  change the  horizontal  line  position and  thereby the  implicit  level  of  significance by  changing the  Y  axis  settings. If  a   Confidence Curve  crosses  the  vertical line  above  the  horizontal  line, you  cannot  reject  the  null  hypothesis using  significance  testing. For  example,  we  cannot  reject  the  null hypothesis  for  support  vector  machines. On  the  other  hand,  if  a   Confidence Curve crosses  the  vertical  line  below the  horizontal  line, you  can  reject  the  null  hypothesis using  significance  testing. For  example,  we  can  reject  the  null hypothesis  for  boosted  tree. How  are   Confidence Curves  computed? The  current  implementation of  confidence  curves  assumes the  differences  are  computed using  R times   repeated K-fold  cross  validation. The  extent  of  each confidence  interval  is  computed using  what  is  known  as a  variance  corrected  resampled  T-test. Note  that  authors  Claude  Nadeau and  Yoshua Bengio, note  that  a  corrected resampled  T-test is  typically  used  in  cases where  training  sets are  five  or  ten times  larger  than  validation  sets. For  more  details,  please  see  the  paper associated  with  this  presentation. So  how  are   Confidence Curves  interpreted? First,  the   Confidence Curve graphically  depicts the  main  difference  in  the  metric of  interest  between  a  given  method and  a  reference  method at  the  0%  confidence  level. So  we  can  evaluate  whether  the  mean difference  between  the  methods is  meaningful. If  the  main  difference  isn't  meaningful, there's  little  point  in  further  analysis of  a  given  method  versus  the  reference method  with  respect  to  the  chosen  metric. What  constitutes  a  meaningful  difference depends  on  the  metric  of  interest as  well  as  the  intended  scientific or  engineering  application. For  example, you  can  see  the  model  developed with  a  decision  tree  method is  on  average  about 14%  worse  than   Fit Step Wise, which  arguably  is  a  meaningful  difference. If  the  difference  is  meaningful, we  can  evaluate  how  precisely the  difference  has  been  measured by  evaluating  how  much the  Confidence C urve  width  changes across  the  confidence  levels. For  any  confidence  interval  not  crossing the  default  vertical  axis, we  have  at  least  that  level  of  confidence that  the  main  difference  is  nonzero. For  example, the  decision  tree  confidence curve  doesn't  cross  the  Y  axis until  about the  99.98%  confidence  level. We  are  nearly  99.98%  confident the  mean  difference  isn't  equal  to  zero. In  fact,  with  this  data, it  turns  out  that we  can  be  about  81%  confident that   Fit Step Wise  is  at  least  as  good, if  not  better,  than  every  method  other than  generalized  regression  Lasso. Now  let's  consider  the  relationship between   Confidence Curves. If  two  or  more   Confidence Curves significantly  overlap and  the  mean  difference  of  each  is  not meaningfully  different  from  the  other. The  data  suggest  each  method  performs about  the  same  as  the  other with  respect  to  the  reference  model. So  for  example, we  see  that  on  average, the  Support  vector  Machines  model performs  less  than  . 5% than  Bootstrap  Forest, which  is  arguably not  a  meaningful  difference. The  confidence  intervals  do  not  overlap until  about  the  4%  confidence  level, which  suggests these  values  would  be  expected if  both  methods  really  do  have about  the  same  difference  in  performance with  respect  to  the  reference. If  the  average  difference  in  performance is  about  the  same for  two  confidence  curves, but  the  confidence  intervals don't  overlap  too  much, the  data  suggests the  models  perform  about  the  same  as  each other  with  respect  to  the  reference  model. However,  we  are  confident of  a  non- meaningful  difference. This  particular  case is  rare  than  the  others and  I  don't  have  an  example to  show  with  this  data  set. On  the  other  hand, if  the  average  difference  in  performance between  a  pair  of   Confidence Curves is  meaningfully  different and  the  confidence  curves have  little  overlap, the  data  suggests  the  models  perform differently  from  one  another with  respect  to  the  reference. For  example,  the  generalized  regression Lasso  model  predicts  about  13.8% more  of  the  variation  in  the  response than  does  the  decision  tree  model. Moreover,  the   Confidence Curves don't  overlap until  about  the  99.9%  confidence  level, which  suggests these  results  are  quite  unusual if  the  methods  actually  perform  about the  same  with  respect  to  the  reference. Finally,  if  the  average  difference in  performance  between  a  pair of   Confidence Curves  is  meaningfully different  from  one  another and  the  curves  have  considerable  overlap, the  data  suggests  that while  the  methods  perform  differently  from one  another  with  respect  to  the  reference, it  wouldn't  be  surprising  if the  difference  is  furious. For  example,  we  can  see  that  on  average support  vector  machines  predicted  about 1.4%  more  of  the  variance  in  the  response than  did  [inaudible 00:19:18]   nearest  neighbors. However,  the  confidence  intervals  begin  to overlap  at  about  the  17%  confidence  level, which  suggests  it  wouldn't  be  surprising if  the  difference in  performance  between  each  method and  the  reference  is  actually  smaller than  suggested  by  the  point  estimates. Simultaneously, it  wouldn't  be  surprising  if  the  actual difference  is  larger  than  measured, or  if  the  direction of  the  difference  is  actually  reversed. In  other  words,  the  difference in  performance  is  uncertain. Note  that  it  isn't  possible  to  assess the  variability  and  performance between  two  models relative  to  one  another when  the  differences  are relative  to  a  third  model. To  compare  the  variability  and  performance between  two  methods relative  to  one  another, one  of  the  two  methods must  be  the  reference  method from  which the  differences  are  measured. But  what  about  multiple  comparisons? Don't  we  need  to  adjust  the  P-values to  control  the  family wise   type 1  error  rate? In  this  paper  about   Confidence Curves, Daniel  Burrough  suggests  that  adjustments are  needed  in  confirmatory  studies where  a  goal  is  prespecified, but  not  in  exploratory  studies. This  idea  suggests  using  unadjusted P-values  for  multiple   Confidence Curves in  an  exploratory  fashion and  only  a  single   Confidence Curve generated  from  different  data to  confirm  your findings  of  a  significant  difference between  two  methods  when using  significance  testing. That  said,  please  keep  in  mind  the  dangers of  cherry  picking,  p-hacking when  conducting  exploratory  studies. In  summary,  the  model  screening  platform introduced  in  JMP  Pro  16 provides  a  means to  simultaneously  compare  the  performance of  predictive  models  created using  different  methodologies. JMP  has  a  long  standing  goal to  provide  a  graph  with  every  statistic, and  Confidence Curves help  to  fill  that  gap. For  the  model  screening  platform. You  might  naturally  expect  to  use significance  testing  to  differentiate between  the  performance  of  various methods  being  compared. However,  P-values  have  come under  increased  scrutiny  in  recent  years for  obscuring the  size  of  performance  differences. In  addition, P-values  are  often  misinterpreted  as  the probability  the  null  hypothesis  is  true. Instead  of  P-value  is  the  probability of  observing  a  difference as  or  more  extreme, assuming  the  null  hypothesis  is  true. The  probability  of  correctly  rejecting the  null  hypothesis  when  it  is  false is  determined  by  a  power or  one  minus  beta. I  have  argued  that  it  is  not  uncommon to  only  have  a  50%  chance of  correctly  rejecting  the  null  hypothesis with  an  alpha  value  of  .05 . As  an  alternative,  a  confidence  interval could  be  shown  instead  of  a  loan  P- value. However,  the  question  would  be  left  open as  to  which  confidence  level  to  show. Confidence Curves  address  these  concerns by  showing  all  confidence  intervals up  to  an  arbitrarily high- level  of  confidence. The  mean  difference  in  the  performance is  clearly  visible  at  the  0%  confidence level,  and  that  acts  as  a  point  estimate. All  of  the  things  being  equal, type 1  and  type  2  errors  are  equivalent. Confidence Curves  don't  embed  a  bias towards  trading  type 1  errors  for   type 2. Even  so,  by  default, a  vertical  line  is  shown in  the  confidence  curve  graph for  the  standard  null hypothesis  of  no  difference. In  addition, a  horizontal  line  is  shown  that delineates  the  95%  confidence  interval, which  readily  affords  a  typical significance  testing  analysis  if  desired. The  defaults  for  these  lines are  easily  modified, but  different  null  hypothesis and  confidence  levels  desired. Even  so,  given  the  rather  broad and  sometimes  emphatic  suggestion to  replace  significance  testing  with  point estimates  and  confidence  intervals, it  may  be  best  to  view a   Confidence Curve  as  a  point  estimate along  with  a  nearly  comprehensive  view of  its  associated  uncertainty. If  you  have  feedback about  the  confidence  curve's  add-in, please  leave  a  comment on  the  JMP  community  site. And  don't  forget to  vote  for  this  presentation if  you  found it  interesting  or  useful. Thank  you  for  watching  this  presentation and  I  hope  you  have  a  great  day.
In mixture experiments, the factors are constrained to sum to a constant. Whether measured as a proportion of total volume or as a molar ratio, increasing the amount of one factor necessarily leads to a decrease in the total amount of the other factors. Sometimes also considering unconstrained process factors, these experiments require modifications of the typical design and analysis methods. Power is no longer a useful metric to compare designs, and analyzing results is far more challenging. Framed in the setting of liquid nanoparticle formulation optimization for in vivo gene therapy, we use a modular JSL simulation to explore combinations of design and analysis options in JMP Pro, highlighting the ease and performance of the new SVEM options in Generalized Regression. Focusing on the quality of the candidate optima (measured as a percentage of the maximum of the generating function) in addition to the prediction variance at these locations, we quantify the marginal impact of common choices facing scientists, including run size, space-filling vs optimal designs, the inclusion of replicates, and analysis approaches (full model, backward AICc, SVEM-FS, SVEM-Lasso, and a neural network used in the SVEM framework).     Hello,  everyone, and  welcome  to  JMP  Discovery. Andrew  Karl  and  Heath  Rushing. I  have  a  presentation  today that's  going  to  highlight some  of  the  features  in  JMP  17   Pro that  will  help  you  out in  the  world  of  mixture  models. Many  of  you   are  involved  in  formulations, and  to  be  honest,   that's  what  we've  been  doing  a  lot  lately. We're  a  lot  like  ambulance  chasers, where  we'll  just  go  after the  latest  thing    that customers  are  interested  in. But  that's  what  we're  seeing  a  lot  lately, is  folks  that  are  doing  mixture  models that  are  actually  quite  complex and  more  so  than  we'd  ever  know. We  just  decided   we  would  do  some  deeper  investigation with  some  of   the  new  techniques that are out, and  that  JMP  17   performs  for  us. Andrew,  would  you  like  to  get  started and  maybe  give  a  little  bit  of  background on  the  whole  idea   of  what  a  mixture  model  is, and  some  of   the  other  techniques? Okay,  so let's  start  out with  a  nice,  easy  graph. Let's  take  a  look  over  here at  the  plot  on  the  left. We're  in  an  experimental  setting, so now  I  suppose   we've  got  two  factors, Ingredient  A  and  Ingredient  B, and  they  range  from   0-1. If  there's  no  mixture  constraints, then  everything  in  the  black  square is  a  feasible  point in  this  factor  space, and so  our  design  routine  is  going to  give  us  points  somewhere  in  this  space. However, if  there's  a  mixture  constraint where  these  have  to  add  up  to  one, then  only  the  red  line  is  feasible. We  want  to  get   a  conditional  optimum given  that  constraint, and  we  want  to  end  up somewhere  in  that  line   for  both  our  design  and  our  analysis. If  we  move  up   to  three  mixture  ingredients, A,  B  and  C,   all  able  to  vary  from   0-1, then  we  get  a  cube  for  that   0-1  constraint  for  each  of  them. But  with  the  mixture  constraints, that  takes  the  form  of  a  plane intersecting  a  cube, and  that  gives  us  this  triangle, so  only  this  red  triangle is  relevant  out  of  that  entire  space. If  we  have  four  dimensions, if  we  have  four  mixture  factors, then  that  allowable  factor  space is  actually  a  three- dimensional  subset, a  pyramid  within  there. Looking  back  to   the  three- mixture  setting. See  this  triangle? That's  the  allowed  region. Well,  that's  why JMP  gives  us  these  ternary  plots. For these ternary plots, what  JMP  will  do  is,  if  you  have   more  than  three  mixture  factors, is  you'll  have  two  factors   shown  at  a  time, and  the  third  axis  will  be the  sum  of  all  the  other  mixture  factors. We  can  look  at  these  ternary plots,  rather  than  having  to  have  a  pyramid that  we're  looking  throughout. We  have  to  decide, do we  want  a   Space Filling Design or  an  optimal  design? Now,  normally  in  a  non-mixture  setting, we'd  normally  use  an  optimal  design, and  for  the  most  part,  we  wouldn't consider  a   Space Filling Design. There's  a  few  reasons that  we  want  to  consider a   Space Filling Design in  mixture  settings. Often  in  the  formulations  world, if  you  go  too  far, there's  a  little  bit  more  sensitivity to  going  too  far  in  your  factor  space, making  it  too  wide, then  your  entire  process  fails. S uppose  that  happens over  here  where   X2  is. Suppose  it  fails   everything  below   0.1. You're  going  to  lose  a  good  chunk   of your runs  because  the  optimal  design tends to  put  most  of  your  runs on  the  boundary  of  the  factor  space, so  you're  going  to  lose  these. You're  not  going  to  be  able  to run   your  full  model with  the  remaining  points, and  you're  not  going  to  have any  good  information  about   where  that  failure  boundary  is. For  the   Space Filling Design, if  you  have  some  kind  of  failure below 0.1,  you're  losing a  smaller  number  of  points. Your  remaining  points   still  give  you  a   Space Filling Design in  the  existing  space   that  you  can  use  to  fit  the  model  effects, and now  we're  going  to  be  able to  model  that  failure  boundary. Also  in  the  mixture  world, we  often  see  higher  order  effects  active: interactions   or  curvatures,  polynomials, than  we  might  see in  the  non- mixture  setting. If  we  don't  specify  those, because  any  of  these  models  are  optimal, conditional  on  the  target  model  we  give, so  if  we  don't  specify  an  effect  operator for  the  optimal  model, we  might  not  be  able  to  fit  it after  the  fact, because  they  might  be  aliased with  other  effects  that  we  have. These  space filling  runs  act  as something of  a  catch- all  of  possible  runs, so  there's  a  couple  of  reasons that  we  might  want  to  consider space  filling  runs, but  we  want  to   take  a  look  analytically. What's  the  difference in  performance  between  these, after  we  run  through  model  reduction, not  just  at  the  full  model, because  we're  not   just going to be  using  the  full  model. That's  the  design  phase  question. When  we  get to  the  analysis, whenever  you're  looking  at these  mixture  experiments  in  JMP, JMP  automatically turns  off  the  intercept  for  you, because  if  you  want  to fit the  full  model with  all  of  your  mixed  remain  effects, you  can't  include  the  intercept. You'll  get  this  warning because  the  mixture- made effects  are  constrained  to  one, and  they  add  up  to what the  intercept  is,  so  they're  aliased. Also,  if  we  want to  look  at  higher  order  effects, we  can't  be  like  a  response  surface where  we  have  pure  quadratic  terms. We  have  to  look  at these  Scheffé  Cubic  terms because  if  we  try to  look  at  the  interactions plus  the  pure  quadratic  terms, then  we  get  other  singularities. Those  are  a  couple of  wrinkles  in  the  analysis. However,  going  forward with  the  model  selection  methods, Forward  Selection  or   Lasso, which  are  the  base  methods of  the  SVEM  methods that we're  going  to  be  looking  at, we  want  to  consider  sometimes  turning  off this  default  'no  intercept'  option. What  we  find  is  for  the   Lasso method  we  actually  have  to  do that in  order  to  get reasonable  results. After  we  fit  our  model, now  we  want  to  do  model  reduction to  kick  out irrelevant  factors. We've  got  a  couple  of   different  ways  of  doing  that  in  base  JMP . Probably  what  people  do  most  frequently is  they  use  the  effect  summary to  go  backwards  on  the  p- values, and  kick  out  the  small  p- values. But  this  is  pretty  unstable because  of  the  multicollinearity from  the  mixture  constraint, where  kicking  out  one  effect can  drastically  change  the  p-values of  the  remaining  effects  in  the  design. What  this  plot  shows  is if  we  go  backwards  on  p- values, what  is  the  largest p-value  that's  kicked  out and  we  see  some jumping around   here  and  that's  from  that  effect. Given  that  kind  of volatility  in  the  fitting  process, you  can  imagine if  you  have  small  changes in  your  observed  responses, maybe  even  from  assay  variability, or  any  other  variability, just  small  changes  can  lead to  large  changes in  the  reduced  model that  is  presented  as a  result  of  this  process. That  high  model  variability is  something  that'd  be  nice if  we  could  average  over  in  some  way,   in  the  same  way  that  maybe with  Bootstrap  for est we  average  over  the  variability of  the  CART  methods or  the  partition  methods and  that's  what  the  SVEM  methods would  be  looking  at  doing. In  the  loose  sense, they're  kind  of  the  analog for that, for   the  linear  methods  we're  looking  at. Also  we  can   go to  Step wise and  look  at  min AICc, which  is  maybe  the  preferred  method. In our  last  slide  of the show  today  we'll  be  taking  a  look  at, for  the  base  JMP  users, AICc  versus  B IC  versus  p-value  selection with  our  simulation  diagnostics. Give  credit  to  a  lot  of  the  existing work  and  leading  up  to  the  SVEM  approach. These  are  some  great  references that  we've  used  over  the  years. Also  want  to  thank  Chris  Gottwalt  for  his patience  and  answering  questions and  sharing  information  as  they've discovered  things  along  the  way. That's really  helped  set  us  up  to  be  able to  put  this  to  good  effect  in  practice. Speaking  of  in  practice, where  have  we  been  using  this  quite  a  bit over  the  years  is  the  setting  of  liquid nanoparticle  formulation  optimization. What  is  lipid  nanoparticle  formulation? Well, a lipid nanoparticle, if  you've  gotten  any  mRNA  COVID  vaccines Pfizer,  Moderna,  then  you've  gotten  these. What  these  do  is  you  take  a  mixture of  four  different  lipids,  summarized  here, and  they  form  these   little  bitty  bubbles. Then  those  are  electrically  charged, then  they  carry  along  with  them a  genetic  payload, mRNA,  DNA,  or  other  payloads that  either   act as  vaccines or  can  target  cancer  cells. The  electric  charge between  the  genetic  payload, and then  the  opposite  charge  in  the nanoparticle  is  what  binds together. Then  we  want it  to  get  through  the  cell and  then  release  the  payload  inside. The  formulation  changes  depending on  what  the  payload  is. A lso  sometimes  we  might  change the  type  of  ionizable  lipid, or  the  type  of  he lper  lipid  to  see which  one  does  better, so  we  have  to  redo  this process  over  and  over. For  the  most  part, the  scientists  have  found  that  these ranges  of  maybe 10-60% for  the  lipid  settings and then  a  narrower  range  of   1-5%. That's  for  the  most  part the  feasible  range  for  this  process. That's  been  explored  out, and  that's  what  the  geometry we  want  to  match here  in  our  application  is. We  want  to  say, given that  structure that  we're  doing  over and over, do we  have  an  ideal analysis  and  design  method  for  that? A lso  we  want  to  set  up  a  simulation that  if  we're  looking  at  other  structures, other  geometries  for  the  factor  space, maybe  we  can  generalize  to  that, but  that's  going  to  be our  focus  for  right  now. Given  that  background, I'm  going  to  let  Jim  now  summarize the  SVEM  approach  and  talk  about  that. Yes,  thank  you. This  particular   discovery  presentation, what  we're  going  to  do is  a  little  bit more centered  on   PowerPoint, unfortunately,  because  the  results  are really  what  this  is  all  about for  the  simulations  that  we've  done. But  this  particular  session, I  will  show  you  some  of  the  JMP  17 that  we  have  new  capability. If  I  go  and  I  want  to set  up  a   Space Filling Design... Now,  previously  we  weren't  able to do  mixture  models  with   Space Filling Designs right from  out  of  the  box,  if  you  will. We  certainly  could  put constraints  in  there, but  now  what  we  want  to  do is  show  you  how  you  can  do a  Space F illing  Design with  these  mixture  factors. This  is  new  that  I  now  have these  come  in  as  mixture, which  is  good because  it  now  carries all  of  those  column  properties with  it  as  well. One  thing  worth  mentioning   right  now is the  default  is  you're  going to  get  10  runs  per  factor, so  40  runs  in  a  DoE   typically  is  good and  we  are  happy because  our  power is  well  above  90% or  whatever  our  criteria  is. But  that's  not  the  case [inaudible 00:09:54]  these  mixture  models because  there's  so many  constraints  inherent  in  it. What  that  is  telling  me,  unfortunately, is  even  if  I  were  to  have  40  runs, I'd  still  only  have  5%  power from  doing  Scheffé  Cubic and  even  if  it's  main effects  only, there's  only  20%  power. Power  now  is  not  really  a  design  criteria that  we're  going  to  look  to when  we  do  these  mixture  models. Now,  typically  in  our  applications, unfortunately,  we  don't  have the  luxury  of  having  40  runs. In  this  case  we'll  do 12  runs  and  see  how  that  comes  out. We'll  go  ahead  and  make   that Space  Filling  Design, and  you can see that it's  maybe evenly  spread  throughout  the  surface. Of  course,  we  do  know that  we're   bounded  with  some  of  these  guys  here that  we  can only go from  1 -5%  on  the  polyethylene  glycol. What  I  want  to  do now is  just  go  to  fast  forward. Now  let's  say  I've  run  this  design and  I'm  ready  to  do  the  analysis. This  is  where  SVEM  is really  made  huge  headway and  if  you  listen  to some  of  Chris  and  Phil  Ramsey's  work out  there  on   JMP  community, you'll  see  this  is  a  step  change. This  is  a  game  change in  terms  of  your  analytical  capability. How  would  we  do  this  in  16? In  JMP  16  what  we'd  have  to  do is  we'd  have  to  come  through, and actually  it's  worth  going   through the  step  just  because  it  gives  you a  little  bit  of  insight, though  the  primary  mission of  this  briefing or  this  talk  is  not  SVEM, it  will  give  us an  idea  of  what's  going  on. What  we  can  do  here  is  we  can  go  ahead and  we'll  make  the  Auto validation  Table. This  is  the  JMP  16  methodology. What  you'll  note  here  is  we've  gone  from  12  runs  to  24. We  just  doubled  them and  you  see  the  Validation  Set. The  training  may  be    the first 12, validation  the  next  12. That's  what's  added and  then  we  have  this  weight. This  is  the  Bootstrap  weight and  it's   fractionally  weighted. What  happens  is we  will  go  ahead  and  run  this  model and  come  up  with  a  predicted  value, but  then  we  need  to  change  these  weights and  then  keep  doing  this  over  and  over for  our Bootstrap, much like a  random  forest,  idea  for  the  SVEM. Now  what  is  useful is  to  kind  of  take  a  quick  look. What  is  the  geometry   of  these  weights? We  can   see  they're  anti- correlated, meaning if in   the  training  set  that I'm  low, I'm  probably  going to  be  high  in  the  validation  set. This  is  kind  of  a  quick little  visual  of  that  relationship. Now  I'm  ready  to  go   do  my  analysis  in  JMP  16. It  would  be  analyzed  and  we'd  just  do  our  fit  model. Of  course,  we  want a  generalized  regression and  we'll  go  through and  do  a   Scheffé  Cubic  here, because  it's  a  mixture. But  here's  where we  have  to  add  in  the  step, we  put  the  validation set as  your  validation  column and  then  this  validation  weight is  going  to  be  that  frequency. Now  I  can  run  this. By  the  way  we  could, in  many  of  our  instances   we're  not  normal,  we're  log  normal, we  could  put  that  in  right  there. Here  we  have  our generalized  regression  ability to  go  ahead  and  run  this  model and   voila,  there  are  the  estimates. What  we  would  do  then is we come here  under  the  Save  the  Prediction  Formula. Then  here  is  one  run. Okay,  so  we  got  one  run. You  can  see  that  the  top  is  15.17,  and  we  actually  saw  15.23, so  not  bad  in  this  model, but  we  would  do  this  over  and  over. We  used  to  do  it  about  50  times  or  so. But  with  JMP  17, now  this  whole  process  is  automated  to  us. We  don't  have  to  do  this 50  times  and  then  take  the  average of  these  prediction  formulas. We're  able  to  go  directly  at  it. If  I  come  back  to  my original  design  here  with  the  response, I  can  get  right  at  it. By  the  way,  this  is  showing that I  have  another  constraint  put  in  here. A  lot  of  times  we  have  the  chemists and  biochemists  like  to  see  that to  make  sure  that  the  ratios based on  molecular  weights  are  within  reason. Not  only  do  we  have the  mixture  constraints, we  also  have a  lot  of  other  constraints. I'm  working  with  a  group where  we  have  maybe  15  different  ingredients and  probably  30  constraints in  addition  to  the  mixture  constraints, so  these  methods   work  and scale up, probably  is  the  best  way  to  say  it, pretty  well. Now  this  is  17, so  17  I  can  get  right  at  it. I'm  going  to  go  ahead  into  Fit  Model, and  I'll  go  ahead  and  do  a   Scheffé Cubic. From  here,  what  we're  able  to  do is come  into  a  generalized  regression. In  this  case,  we  don't  need to  worry  about  these  guys  in  here. We  can  change  it  to log normal  if  we  so  desire. One  of  my  choices  in  the  estimation instead  of  Forward  is in  fact  SVEM  Forward, so  I  do  SVEM  Forward and  I'm  going  to  go  do  200. You'll  see how quickly  they  have  this  tuned. Really the  only  thing  you  can  do in  advanced  controls is  check  whether  or  not you  want  to  force  a  term  in. I  hit  Go  and  instantaneously  I've  done 200  Bootstrap  samples  for  this  problem. Of  course, I  now  can  look  at  the  profiler and  that  is  the  average  of  the  200  runs. That's  kind  of  my  end  model,  if  you  will. Of  course,   with  Prediction  Profiler, there  are hundreds  of  things  you  can  do from here and  Andrew  will touch on  a  couple  more  of  those. But  two  other  things worth  noting  here, I'll  save  the  Prediction  Formula as  well  and   take  a  look  at  that  guy. When  I  look  at  the  Prediction  Formula, I'll  note  that  it  is  in  fact already  averaged  out  for  me  here. This  is  the average  of the 200   different  samples  that  are  out  there. With  that,  that  is  the  demo, and  we'll  go  back  to looking  at  the  charts  there  to   say, "Well, What  is  it  that  we're  seeing in  terms  of  the  results  of  SVEM?" Andrew,  if  you  want  to  pull  up  that  slide. This  is  maybe  a  quick  visual. You  can  see  that if  I  look  at  those  first  three, in  this  case,  red  is  bad. What  we're  looking  at  here is  the  nominal  coverage. This  is  a  mixture  model at  the  predicted optimum  spot  of  performance. We  can  see  that  the  standard Step wise  guys  are  not  doing  too  well. That's  the  backward  and  forward  AIC. This  is  the  coverage  rates. We'd  like  it  to  be  a  nominal 5%  error  that  we  don't  actually  see the  true  response  in  the  prediction or  actually  the  confidence  interval. In  other  words, when  we  looked  at  the  profile that  we  just  saw, it gives us  a  prediction  or  confidence  interval was  the  true  value, which  we  know  because we're  playing  the  simulation  game,  right? We  know  what  the  true  value  is, what  percentage  of  time  was  that  in  there? We  can  see  that  we  don't  do as  well  with  our  classical  methods. The  full  model, putting  all  the  terms in, Lasso  does  pretty  well at  a  10%  rate  or  so, but  it's  not  until  we  get   the SVEM methods  here  that  we  start  seeing that  we're  truly  capturing and  getting  good  coverage. A  good  maybe  picture to  keep  in  your  mind, that  we  are  way  outperforming  some  of  the  other  methods  out  there when  it  comes  to  the  capability. Now,  in  terms  of  how  this  simulation, what  we're  focusing  on  here  is  a  little bit  different  than  what  you  may  think of, in  terms  of  a  simulation  where   we're looking  at  a  model  and  saying, "How  well  did  we  do  with  this particular  method?" We  could  measure  that by  how  the  actual  versus  the  predicted  is and  then  we'd  get   some sort of  a  mean  squared  error. We  do  track  that  value, but  we  find  in  our  work, we're  much  more  concerned  about  finding that  optimal  mixture,  if  you  will, with  the  optimal  settings  that  achieve  us a  maximum  potency or  minimizes some side-effects or  helps  us  with this   [inaudible 00:18:58] That's  going  to  be  called the "percent  of  max"  that  we're  looking  for. We're  going  to  use  that as  our  primary  response  here in  terms  of  being  able  to evaluate  which  methods  outperform  others. It's  not  really  going  to  be, how  far  away  am  I from  the  location  of  the  optimal  value? It's  how  far  is that  response that  I  predicted  as optimum how  far  is  that  from  the  actual  optimum? That's  going  to  be our  measure  of  success. The  way  this  will  work  is I'll  be   asking  Andrew  a  few  ideas  here in  terms  of  what typically  comes  up  in  practice. I  saw  the  geometry   he showed  me  early  on and  the  optimal  design was  always  hit  the  boundaries. What  if  I  like  things  that  we call  more  mix,  right? You  have  more  mixed  stuff in  the  middle, space fill,  which  is  better. If  I  do  use  an  optimal  design, it  defaults  to  D  maybe, but  what  about  I  and  A? Then  how  about the  age- old  DOE  adding  center points? Is  that  smart? Or  is  one  center point? Or  how  about  replicates? We've  already  discussed how  we're  not  being  helpful, so  what  is  a  helpful  measure of  a  good  design? That's  the  design  piece, but  also  the  analysis  piece is, is  there  a  particular  method that  outperforms  everyone, or  are  there  certain  areas that  we  should  focus  on using  Lasso  and  others, that  we  should just  use  SVEMs  for  selection? These  are  practical  questions that  come  up  from  all  of  our  customers, and  we'd  like  to  share  with  you some  of  the  results that  we  get  from  the  simulation. Andrew,  you  want  to  give  us a  little  bit  more  insight into  our  simulation  process? Yeah,  thanks,  Jim. Before  I  do  that, I  just  want  to  point  out  one  tool that  we've  made  heavy  use  of in  the  analysis  of  our  results, and  unfortunately,  we  don't  have  time to  delve  into  the  demo, but  it  has  been  so  useful is, within  the  profiler to  look  at  the  output  random  table for  these  mixture  designs and  to  look  at  the  responses  especially, we  frequently  have  potencies by  side  effects. We  have  multiple  responses, that we want to   balance  that  out with  the  desirability  function, and  then  we're  going  to  look at  the  individual  responses  themselves. When  we  output  a  random  table, we  get  a  Space  Filling  Design, basically  not  a  design, but  we  fill  up  the  entire  factor  space, and  we're  able  to  look at  the  marginal  impact of  each  of  the  factors over  the  entire  factor  space. For  example, for  the  ionized  lipid  type, what  we'll  frequently  see  is, we  can  see  that  maybe  one  has a  lower  marginal  behavior over  the  entire  space. But  since  we're  wanting  to  optimize, we  care  about  what the  max  of  each  of  these  is, and  one  of  these will  clearly  be  better  or  worse. We're  looking  at  the  reduced  model. After  he  fits  them, we'll  go  to  the  profiler  and  do  this. We  can  still  get the  analytic  optimum  from  profiler, but  in  addition  to  that, this  gives  us  more  information outside  of  just  that  optimum. What  we  might  do  here  is for  candidate  runs, because  we  always  running our  confirmation  runs after  these  formulation and  optimization  problems is  we  might  run the  global  optimum  here  for  H 102, we  might  pick  out the  conditional  optimum  for  H 101 and  see  which  one does  better  in  practice. Also,  looking  at  the  ternary  plots, if  we  color  those  by  desirability or  by  the  response, we  can  see  the  more or  less  desirable  regions of  that  mixture  space, so  that  can  help  us as  we  either  augment  the  design or  either  include  additional  areas in  the  factor  space, or  to  exclude  areas. I  can't  do  much  more with  that  right  now, but  I  wanted  to  point  that  out because  that's  a  very  important  part of  our  analysis  process. How  do  we  evaluate some  of  these  options within  this  type  of  geometry of  a  factor  space? We  built  a  simulation  script that  we  have  shared  on  the  JMP  website, and  it  allows  us  to   plug  and  play for  different  sample  sizes  in  total, how  many  runs are  in  the  design? We  have  a  true  form  choice that  gives  us  the  true  generating  function behind  the  process, a  design  type, either  space  filling  or  optimal. The  optimal  design  now  is  going  to  be of  a  certain  minimum  size based  on  the  number  of  effects that  we're  targeting. Do  we  have  a  second -order  model, a  third -order  model, a   Scheffé Cubic  model? What  do  we  have? Normally,  whenever  you  build  a  model and  custom  design  in  JMP, it  writes  a  model  script  out  to  your  table and  then  you  use  that  to  analyze  it. Well,  something  we've  explored is  allowing  a  richer  model than  what  we  get, what  we  target, and  are  we  able  to  use these  methods  with  SVEM and  get  additional  improved  results, even  though  we  didn't  originally  target those  effects  in  the  design? The  short  answer  there  is  yes. That's  something  else we  want  to  consider, so  we  allow  ourselves with  the  effective  choice to  include  additional  effects. We  can  look  at  the  impact of  adding  replicates  or  center points and  that  custom  DWI  dialogue to  enforce  those. How  does  that  affect  our  response? Because  any  of  the  summaries that  you  get  out  of  the  design and  out  the design  diagnostics are  beginning  targeting  the  full  model, either  with  respect to  prediction  variants, D-optimal, you're looking at standard  errors  for  the  parameters. But  what  we  really  care  about is  how  good  is  the  optimum that  we're  getting  out  of  this, so  that's  what  we're  going  to  take a  look  at  with  these  simulations. For  the  most  part in  these  LMP  optimization  scenarios, a  lot  of  times, we'll  come  across  two  situations. The  scientists  will  say, "I've  got  about  12  runs  available, and  maybe  it's  not that  important  of  a  process, or  the  material  is  very  expensive, and  I  just  need  to  do  the  best  I  can with  12  runs.  That's  what  I've  got." Or  it  might  be  something  where  they've  got a  budget  for  40  runs, and  they  can  fit a  full  second -order  model plus  third  order  mixture  effects, and  we  want  to  try to  characterize  this  entire  factor  space and  see  what  the  response  surface   looks  like  over  the  whole  thing. Those  are  the  two  scenarios  we're  going   to  be  targeting  in  our  simulation. Jim,  I  think  you  had  some  questions about  performance under  different  scenarios. What  was  your  first  question  there? I  did. I  guess  when  I  think  about a  12- run  scenario  here, and  if  I  just  go  with  the  default, I'd  get  a  D -optimal and  it  would  be main  effects  only. I  recognize  I  could  do the  space  filling  like  I  just  did, but  my  question  is, if  I  do  the  default, which  one  of  the  analysis  methods would  be  preferred? Or  is  there  one? Okay,  so  taking  a  look  at  that. For  the  D- optimal  design, as  a  general  rule, it's  going  to  put  almost  all  of  its  runs along  the  boundary  of  the  factor  space and  it's  not  going to  have  any  interior  runs unless  you  have  quadratic  terms or  something  that  requires  that. With  a  12 -run  design, there's  90  degrees  of  freedom  required to  fit  all  the  main  effects  here. We've  got  a  few  degrees of  freedom  for  error, but  mostly  we're  only  targeting the  main  effects  here. How  do  the  analysis  methods  do? Is  there  any  difference in  the  analysis  methods? What  we  do,  and  all  of  these we're  going  to  summarize, we  show  the  percent  of  max for  all  of  our  simulations  that  we  do, and  so  we  can  see  that  distribution for  each  of  the  analysis  methods, all  for  this  12- run, D-optimal  design  target  effects. Then  we  also  show any  significant  differences  between  these, and  we're  just  using  students' T. We're  not  making a  two  keys  adjustments, so  keep  that  in  mind whenever  you're  looking at these  significant  values. The  winner  here  is  our homemade  SVEM  neural  approach because  it's  not  restricted to  only  looking  at  the  main  effects, they  can  allow  some   additional  complexity  in  the  model, and  so  it  wins  here. Now,  don't  get  too  excited  about  that because  this  is  about  the  best that  we've  seen  SVEM neural  do is in  these  small  settings. But  if  we  are  running  more  than  one candidate  optimum  going  forward, then  maybe  we  can  include  a  SVEM  neural, but  in  general, we  wouldn't  recommend only  sticking  with a  SVEM  neural just  because  it  tends  to  be  more  variable, have  heavier  low  tails. What  are  the  other  results? We  see  the  losers  in  this  application or  anything  that's  doing the  single  shot  model  reduction because  all  these  effects are  significant  in  the  model, and  any  time  we  pull  one  of  them  out, we  are  going  to  get a  suboptimal  representation of  our  process. That's  why  in  this  case the  full  model  does  better  than  those. But  what's  interesting  is the  SVEM  linear  approaches are  able  to  at  least  match that  full  model  performance. We're  not  losing  anything by  using  SVEM  in  this  application, so  that's  a  nice  aspect where  we  don't  have  to  worry about  the  smaller  setting. Are  we  hurting  ourselves  at  all by  using  AICC? Now,  something  else  we  tried  here is  given  the  same, you've  only  got  12  runs. You're  only  targeting  the  manufacture and  the  D-optimal  criteria in  the  custom  DOE. What  if  we  allow  the  fit  model to  consider  second -order  effects plus  there are  mixture  effects, so  which our  model  then was  targeted  to  do? What  happens,  and  we  see  this  JMP  here, this  SVEM  linear  methods  are  able to  utilize  that  information and  give  us  better  percent  of  max, get  optimal  candidates, and  those  are  our  winners  here  now is  these  SVEM  linear  methods. What  we  see  is  that  interestingly, the  base  methods  for  these   SVEM  approaches,  Ford  method, or  Ford  selection  or  the  Lasso are  not  able  to  make  use  of  that, only  the  SVEM  is, so  that's  a  nice  property. They  actually  beat  out  Neural, which  is  nice  because  now  these are  native  to  JMP  17 and  they  don't  require as  much  computation  time or  manual  set  up  as  in  their  own. What  we  start  to  see  here is  the  theme  that  we're  going  to  see throughout  the  rest  of  the  day is  that  any  of  these  Lasso  approaches with  no  intercept are  going  to  give  us  sub -optimal  results because  without  the  intercept and  the  penalization doesn't  work  right  in  Lasso, so you  actually  want  to  turn  off the  default  option  of  no  intercept if  you're  going  to  try to  use  SVEM  Lasso or  even  just  Lasso without  an  intercept. Okay,  so  I  guess  it  looks  like SVEM  Neural  did  well  there. But  again,  that  is  not  native. We  can't  do  that  with  JMP  17  Pro, that's  not  in there . We  can,  we  have  to  have  a  manual   [inaudible 00:29:01]  scripted. Yeah, it's  not  a  manual  option. Okay,  this  is  good, but  I'm  also  a  fan of  the  Space  Filling  Design, so  how  does  that  play  out in  terms  of  the  analysis  methods? For  the  Space Filling  Design, you  can  see  rather  than  having all  the  points  along  the  exterior, along  the  boundary, now  we  fill  up  the  interior  space for  both  the  mixture  factors and  the  process  factors, which  sometimes  in  practice what  we'll  do  is, we'll  take  the  process  factors and  round  these to  the  nearest   0.25  or  0.5 or  whatever  granularity  works  best  for  us, but  this  is  what  it  looks  like. In  terms  of  the  results, how  do  they  perform? Now  what  we're  going  to  do is  compare the  concatenation  of  the  design  approach along  with  the  analysis  method and  see  which  these  do  best. Looking  at  now  still  allowing the  richer  second  and  third -order  model for  the  selection  methods and  see  which  one  does  best. When  we  look  at  the  comparison, the  winners  are the  SVEM Linear  approaches, Lasso  only  with  the  intercept, not  without  the  intercept, and  the  D- optimal. Again,  behind  the  scenes, you  have  to  remember, now  you're  assuming for  this   D-optimal  approach that  your  positive  model  is  true over  the  entire  factor  space and  you've  got  constant  various over  that  factor  space. If  you're  worried  about  failures along  the  boundary, then  that's  something  else  to  take  into account,  and  it's  not  built  into  this. You  have  to  consider  that. But  if  you  are  confident, maybe  you've  run  this  before and  you're  only  making  minor  changes, then  the  way  to  go  is  the  D- optimal with  the  SVEM  approaches. Down  here,  the  losers  are  the  Lasso, with  no  intercept. We 're going to  avoid  those, and  you  can  see those  heavy  tails  down  here. Not  the  SVEM  Lasso, just  the  Lasso. Actually  here's  the  SVEM  Lasso with  no  intercept  down here. Yeah. They  all  get  these  Fs, so  they  all  fail. -Conveniently,  [crosstalk 00:30:48] . -Yeah Okay. What  often  will  come  up, whether  it's  designed  up  front where  we've  done  our  12  runs, and  the  boss, she  has  some  more  questions and  we  have  more  runs. If  we're  going  to  do  five  more  runs, how  does  that  impact some  of  these  results? When  you  say  five  runs, not  a  follow -up  study, but  your  build  is  study either  12  or  17  runs in  a  single  shot  right  now is  what  you're  considering,  right? Yeah,  exactly. Okay,  so  yeah, we  can  look  at  the  marginal  impact because  there's  a  cost  to  you for  those  extra  five  runs. What's  the  benefit of  those  five  extra  runs? Using  the  design  analysis you  could  use, look  at the  FDS  plot and  your  FDS  plus  means  lower, reflecting  smaller  prediction  variants. Power  is  not  that  useful for  these  mixture  effects  designs. We  don't  care  about  the  parameters. We  want  to  know,  how  well  would  we  do with  optimization? That's  where  the  simulation's  handy, we  can  take  a  look  at  that. How  does  your  distribution of  your  percent  of  max  change as  you  go  from  12  to  17  runs? Interestingly,  there's  no  benefit for  the  single  shot  40  ICC to  having  17  versus  12  runs. Now,  again,  right  now  we're  looking at  the  percentage  of  max. If  you  look  at  your  error variance, your prediction variance is  going  to  be  smaller, and  there  might  be some  other   [inaudible 00:32:09] , but  mainly  your  prediction  variance is  going  to  be  smaller if  you  look  at  that. But  really,  we  don't  care  that  much about  prediction  variance. We  want  to  know, where  is  that  optimum  point? Because  after  this,  we're  going  to  be  running  confirmation  runs and  maybe  in  replicas  at  that  point to  get  an  idea  of  the  process and  assay  variance  then. But  right  now, we  are  just  trying  to  scout  out the  response  surface to  find  our  optimal  formulation, so  with  that  goal  in  mind, there's  no  benefit  for  a four-day  ICC. Now  for  the  SVEM  methods, we  do  see there  is  a  significant  difference and  we  do  get  a  significant  improvement in  terms  of  our  average percent  of  max  we  obtain, and  maybe  not  as  heavy  tails  down  here. But  now  you  need  to  know is  that  you  need  to  decide, is  that  practically  significant? Do you  want  to  move from  90%  to  92%  mean  percent  of  max in  this  first  shot  with  five  extra  runs? You  have  to  do  your  marginal  cost original  benefit  analysis  there as  a  scientist and  decide  if  that's  worth  it. Just  looking  at  it  here, what  I  think  might  be  useful because  you  have  to  run confirmation  runs  anyway is  if  we  run  the  12 -run  design, you  can  then  run a  candidate  optima  or  two based  on  the  results  we  get, and  then  plus a  couple  of  additional  runs  maybe in  a  high -density  region for  what  that  looks  good, or  even  augment  out your  factor  space  a  little  bit, and  then  you're  still  running a  total  of  17  runs, but  now  we're  going  to  have  even a  better  sense  of  the  good  region  here, so  that's  something  to  consider. Something  else  we  can  see from  running  the  simulation with  17  runs  is, let's  look  at  the  performance of  each  of  the  fitting  methods within  each  iteration, and  there's  actually a  surprisingly  low  correlation between  the  performance of  these  different  methods within  each  iteration. We  can  use  that  to  our  benefit because  we're  going  to  be  running confirmation  runs  after  this, so  rather  than  just  having  to  take one  method  and  one  confirmation  point, one  candidate  optimal  point, if  we  were  to,  for  example, look  at  these  four  methods and  then  take  the  candidate  optimum from  each  of  them, then  we're  going  to  be  able  to  go  forward with  which  one  everyone  does  best. We're  looking  at  the  maximum  of  these. Rather  than  looking at  a  mean  of  92%  to  94%, now  we're  looking  at  a  mean of  about  97%  with  a  smaller  tail if  we  consider multiple  of  these  methods  at  once. Okay,  very  useful. Let's  now  put  our  eyes toward  the  40 -run  designs. Very  good  information  in  terms of  my  smaller  run  designs. Now  with  40,  how  does  it  play  out in  terms  of  these  analysis  methods? Are  we  going  to  see  consistent  behavior with  what  we  saw  in  the  12 -run  design? Then  how  about  the  Space  Filling  versus the  optimal  design,  D-optimal? -I'd  be  interested  in  that. -Okay. Well,  first  take  a  look at  the  D-optimal  design,   40  runs, and  now  we're  targeting all  of  the  second -order  effects, the  third -order  effects,  mixture  effects, and  we're  targeting  all  the  effects that  are  truly  present in  our  generating  function, and  we  still  see  that  we're  loaded  up on  the  boundary  of  the  factor  space with  the  optimal  design, and  then  if  we  were  going to  see  now  with  the  space  filling  design, we're  going  to  see  now we're  filling  up  the  interior of  the  factor  space  for  the  mixtures and  for  the  other continuous  process  factors. Let's  see  what  the   performance  difference  is. First  of  all, focus  on  the  space filling  design, which  analysis  methods  do  best? And  same  as  we  saw in  the  12 -run  example  of  the  SVEM  linear, Ford  selection  with  that  intercept, Lasso with  the  intercept  does  the  best. The  worst  case  you  can  do is  keeping  the  full  model, or  then  trying  SVEM or  single  shot  Lasso   with  no  intercept and  the   D-optimal  setting, same  winners,  which  is  reassuring because  now  we  don't  have to  be  worried  about, "Well,  we're changing  our  design  type. Now  we  got  to  change  our  analysis  type." It's  good  to  see  this  consistency across  the  winners  of  the  analysis  type. The  full  model  doesn't  do  as  poorly  here with  the  optimal  design,  I  think, because  the  optimal  design is  targeting  that  model and  the  losers  here  are  still the  Lasso  with  no  intercept. Then  Neural  is  really  falling  behind  here, behind  the  other  methods. Now  let's  compare  the  space  filling to  the  D-optimal  designs, and  we  can  really  see the  biggest  difference  here is  within  the  full  model, the  space  filling  designs  are  much  worse than   the   D-optimal  design. Anytime  you're  doing design  diagnostics, that's  all  within  the  context of  the  full  model. For  your  D -optimality  criteria, your  average  prediction  variance, that's  all  there. A  lot  of  times when  you  run  those  comparisons, you're  going  to  see a  stark  difference  between  those and  that's  what  you're  seeing  here. However,  in  real  life, we're  going  to  be  running a  model  reduction  technique. With  SVEM,  even  the  single shot  methods  improve  it. But  especially  with  SVEM  here, it  really  closes  the  difference  between the  space  filling  and  the  optimal  design, and  we  see  pretty  close  to  medium, and  slightly  heavier  tail  here. But  now  you  can  look  at  this  and  say. "Okay.  I  lose  a  little  bit with  space  f illing  design. But  if  I  have  any  concerns  at all about  the  boundary  of  the  factor  space, or  if  I'm  somewhat  limited in  how  many  confirmation  points  I  can  run and  I  want  to  have  something that's  going  to  be  not  too  far  away from  the  candidate  optimum that  I'm  going  to  carry  forward, then  those  are  the  benefits of  the  space  filling  design." Now  we  can  weigh  those  out. We're  not  stuck with  this  drastic  difference between  the  two. Again,  that's  based  only  versus the   D-optimal  design . I  guess  a  lot  of  times  in  our  DOE  work, we  like  to  maybe  look at  the  I-optimality  criteria and  even  the  A has  done  really  well  for  us. In  particular, it  spreads  it. It's c ertainly  not  space  filling, but  at  least  it  spreads  it  out a  bit  more  than  the  D -optimal. Do  we  have  any  ideas how  those  I  and  A  optimal  work? Yeah,  we  can  swap  those  out into  simulations. One  thing  we've  always  noticed, I  love  the  A -optimal  designs in  the  non- mixture  setting. It's  almost  my  default  now. I  really  like  them. But  in  the  mixture  setting, whenever  we  try  them, even  before  the  simulations, if  we  look  at  the  design  diagnostics, the  A -optimal  never  does  as  well as  the  D  or  the  I -optimal, and  that  bears  out  here in  the  simulations, that's  the  blue  here  for  the  optimal, gives  us  inferior  results. Rule  of  thumb  here  is, don't  bother  with  the  A-optimal  designs for  mixture  designs. Now  for  D  versus  I -optimal, we  don't  see  any... In  this  application for  this  generating  function, we  don't  see  any  difference  between  them. However,  a  reason  to  slightly  prefer the   D-optimal  is, there  tends  to  be  some  convergence  issues for  these  LNP  settings where  you've  got  to  peg  over  the  one  5% and  you're  trying  to  target a   Scheffé Cubic  model  in  JMP, so we've  noticed  sometimes some  convergence  problems for  the   I-optimal  designs and  it  takes  longer. The  D -optimal, if  there's  not  much  of  a  benefit, then  it  seems  to  be  the  safer  bet to  stick  with  the   D-optimal. Now  we  weren't  able to  test  that  with  the  simulations because  right  now  in  JMP, you  can't  script  in  Scheffé Cubic  terms  into  the  DOE to  build  an  optimal  design. You  have  to  do  that  through  the  GUI. We  weren't  able  to  test  that, see  how  often  that  happens, but  that's  why  we've  carried  forward D-optimal  in  these  simulations and  we  stick  with  those. If  you  want  to  in  your  applications, you  can  try  both  D  and  I and  see  what  they  look  like both  graphically and  with  the  diagnostics, but  the   D-optimal  seems to  be  performing  well. Okay,  I  guess  just  keep  pulling  the  thread a  little  bit  further  is, a  lot  of  times we'll  try  some  type  of  a  hybrid  design . Why  don't  we  start  out  with, say,  25  space  filling  runs, and  then  augment  that with  some   D-optimal  criterion to   make  sure  that  we  can  target the  specific  parameters  of interest? Does  that  work  out  pretty  well? Yeah,  we  can  simulate  that and  we  take  a  look. Either  we've  got... This  is  the  same  simulated  function, generating  function  we've  been  looking  at for  you  to run  the  D-optimal, for you to run the  space  filling, or  a  hybrid,  where  we  start  out with  25  space  filling  runs and  then  we  go  to  augment  and  load in  building  15  additional  runs  targeting the  third  order  model, and  what   we  see  is  that  now, we  have  no  significant  difference in  terms  of  the  optimization between  the  40-run D-optimal and  the  hybrid  design, But  in  the  hybrid  design, we  get  the  benefit   of  those  25  space filling runs. We  get  some  interior  runs  protection  to  fit  additional  effects and  protection  against  failures   along  the  boundary. It's  a  little  bit  more  work   to  set  this  up. We'll  do  this  for  high  priority  projects because  only  for  those because  of that extra  cost  and  time. But  it  does  appear  to  be   a  promising  method. Right. Practically  you  think  about   where  your  optimal  is  going  to  be, there's  a  good  chance   it  could  be  in  that  interior  space that's  not  filled  in  the  D-optimal   along  the  boundaries. I  guess  just  maybe  going  back, revisiting  the  ideas   of  what  if  I  had  a  center  point, what  if  I  had  a  point   that  I  could  replicate? Again,  maybe  on  the  40- run  design, if  I  had  five  more   things, so  just  any  other  little  nuggets that  we  learned   along  the  way  with  these? Well,  this  comes  up  a  lot because  now  textbook  will  tell  you to  add  five  to  seven  replicate  runs. The  scientists  are  going  to  kick  you  out if  you  try  to  do  that. A  lot  of  times  we  have  to  make the  argument to  add  even  a  single  replicate  run because  it  has  advantages   outside  of  the  fitting because  now  you  get  a  model [inaudible 00:41:09] and  just  graphically  we  can  use   that  as  a  diagnostic, we  can  look  at  that  air  variance  relative to  the  entire  variance from  the  entire  experiment. It's  very  useful  to  have, and  so  it's  going  to  be  nice to  have  an  argument  for  you  to  say  that, "Okay,  we're  not  hurting your  optimization  routine by  including  even  a  single  replicate  run." That's  what  we  see  here for  the  40-run  example by  forcing  one  of  these   to  be a  replicate  within  custom  design. We  are  not  getting   a  significant  difference  at  all in  terms  of  optimization. It's  neither  helping  or  hurting. Let's  go  ahead  and  do  that, so  that  when  we  have  that  extra  piece of  information  going  forward. I  don't  have  the  graphs  here because  it's  boring. It's  the  same  thing in  this  particular  application, forcing  one  of  them  to  be  a  centerpoint. There's  no  difference. Part  of  that  might  be  in  this  case, the  D-optimal  design  was  giving  us a  center point  or  something  close   to  a  center point. That  might  not  have  been  changing  the  design  that  much. You  might  see  a  bigger  difference if  you  go  back  to  the  12-run  design enforce  the  centerpoint. But  that's  the  advantage   of  having  a  simulation  framework  built  up where  you  can  take  a  look  at  that and  see  what  is  the  practical  impact   going  to  be  for  including  that. Okay,  now  how  about... I  mentioned  I  have  this  big  project   with  lots  of  constraints. Would  a  constraint  maybe change  some  of  the  results? Well,  we  could  possibly  include   the  constraints and  it's  going  to  change  the  allowed  region  within  the... Graphically,  you're to  going  to  see a  change  in  your  allowed  region, and  we  can  simulate  that. Actually,  I've  done  that. I  don't  have  the  graph up  with  me  right  now, but  what  it  does  is  there's  not that  much  an  impact,  SVEM   still  does well. One  difference  we  did  note   is  that  running  this  simulation and  then  constraining  the  region  somewhat is  that  the  space  filling  improved because  it's  got  a  smaller  space   to  fill  and  not  as  much  noise  space, but  the  D-optimal   will  perform just  as  well  between  the  two   with  or  without  constraint. That  was  pretty  interesting  to  see. But  all  of  this  applies  just  as  well with  constraints and  nothing  of  note  in  terms  of  difference for  analysis  methods  with  the  constraint, at  least the  relatively  simple  ones   that  we  applied. Right,  okay,  we're  almost  running  short on  time  here,  Andrew, but  I  do  have  a  concern. We  have  a  misspecified  design and  we  would  like  to  wrap  up and  leave  the  folks   with  a  few  key  takeaways. Here's  an  example   where  now  this  functional  form does  not  match  any  of  the  effects  we're  considering and  we're  relatively  flat   on  this  perimeter where  a  lot  of  those  optimal designs are going to be so I'm going to see how  that  works  out. Also  note  the  [inaudible 00:43:52]   Cholesterol  set  to  a  coefficient  zero and a true  generating  function. Now  taking  that  true  function  going to  profil er  output  right in the  table and  you  can  see  how  nice  it  is   to  be  able  to  plot  these  things to  see  the  response  surface   using  that  output  right  in  the  table. Here's  really  your  true  response  surface, and  this  is  your  response  surface, but  what's  interesting is  it  looks  like  there's  an  illusion  here. It  looks  like  Cholesterol  is  impactful for  your  response. It  looks  like  it  affects  your  response, but  in  reality  the  coefficient  is  zero. But  the  reason  it  looks  like  that   is  because  of  the  mixture  constraint. That's  why  it's  hard  to  parse  out, which  the  individual  mixture  effects really  affect  your  response. We're  not  as  concerned  about  that as we are  of  saying,   what's  a  good  formulation  going  forward? In  this  setting,   we  add  a  little  bit  of  noise, 4% CV, which  is  used  frequently   in  the  pharma  world. In  this  case,  the  mean  we're  using   is  the  mean  at  the  maximum, which  in  this  case  is  one, and  then  also  a  much  more  extreme  40%  CV. This  looks  more  like  a  sonar and  they're  trying to  find  Titanic  or  something. Hopefully  none  of  your  pharma applications  look  like  this, but  we  just  want  to  see   in  this  extreme  case  how  things  work  out. What  we  see   is  in  the  small  12-run  example with  relatively  small  process  variation is  process  plus  essay  variation is  these  baseline  designs  went  out and  SVEM,  all  the same  methods, and  then  if  we  go  up  to  40-run, the  space filling  isn't  able   to  keep  up  as  well, but D-optimal   will  really  do  better now, even  though  it's  relatively  flat  out  there  and  the  size  where  most  of  the  runs  are, it's  able  to  pick  up the  signature  of  the  surface  here. Now,  here's  the  difference   between  the  full  model and  then  the  space filling and  the D- optimal. Not  as  big  of  a  difference   for  the  SVEM  methods, but  you  do  still  have   a  few  tail  points  down  here. Then  they're  all  not  performing as  well  as  the  SVEM  linear, even  though  the  SVEM  linear is  only  approximating  that  curvature for  that  response  surface. If  we  go  up  to  the  super  noisy  view, no  one  does  a  really  good  job, but  still  your  only  chance   is with  the  space  filling  approaches. But  then  when  we  go  up   to  the  larger  sample  size, even  in  the  face  of  all the  process  variation,  process  noise, is  now  the  option  was  able to  bounce  out  over  that  noise  better and  is  able  to  make  better  use of  those  runs  than  the  space  filling. A  couple  of  considerations  there. What's  your  run  size? How  saturated  is your  model? How  much  process  variation  do  you  have relative  to  your  process mean, goes  into  the  balance   of  the  space  filling  versus  optimal. If  we  take  a  look   at  what  are  the  candid  optimal  points we're  getting  out  of  the  space  filling   versus  optimal. I'm  sorry,  for  the  space  filling, then  what  we  see  is  we're  on  target for  this  is  ionizable  and  helper. We're  on  target  for  all  of  our  approaches except  for  these  last  with  no  intercept. They're  never  on  target, they're  always  pushing  you  off   somewhere  else. You  can  see  graphically how  that  lack  of  intercept. Now,  if  we  allow  the  intercept,   then we're  on  target. That  really  is  important  to  uncheck that  no  intercept  option  for  lasso. For  all  the  people   that  are  not  using  JMP  Pro and  don't  have  SVEM, you  might  say,  well,  okay, what's  your  simulation? Here,  what  is  better? AICc,  versus  BIC,  versus  P-value. Unfortunately,  just  using  the  number of  simulations  we've  run, there's  not  as  consistent  approach   as  there  is  with  SVEM. If  you've  got  a  large  number  of  runs, where there is  either specified or  correctly  specified  or  misspecified, the  forward  or  backward  AICc  do  well. Full  model  does  worse, whereas  in  the  smaller  setting, the  full  model  does  better   because  all  those  terms  are  relevant. Also  the  P-values  here,  too. Now,  you  see,  0.01  does  the  worst, 0.01  does  the  best  in  large  setting. Not  consistency, what P-value do you use? 0.01, 0.05, 0.1 . The  P-value  from  this  view   is   an  untuned  optimization  parameter, so  maybe  best  to  avoid  that   and  stick  with  the  AICc if  you're  in  base  JMP. However,  we  have  seen  now  that the SVEM  approaches   for  these  optimization  problems do  give  you  almost   universally  better  solutions than  the  single  shot  methods. You  can  get  better  solutions with  JMP P ro,  with  SVEM. Great. I  guess  we  want  to  just  wrap  up. Some  of  the  key  findings  here,  Andrew. Yeah,  and  also,  Jim,  any  other  comments? Do  you  have  any  other  comments  too  about these  optimization  problems  or  anything? Interesting  things  we've  seen  recently? We  have,  we're  up  against  time  for  sure, but  we've  done  some  pretty  amazing  things that  we've  come  up   with  new  engineered  lumber that's  better  than  it's  ever  been and  propellants   that  are  having  physical  properties and  performance that  we  haven't  seen  before. We  have  taken  a  step, a  leap  in  terms  of  some of  the  capabilities that  we've  seen  in  our  mixture  model. Can  we  summarize   with  the  highlighted  bullet  down  there, that  SVEM  seems  to  be  our  way  to  go, and  if  you  only  had  one   maybe  SVEM  forward  selection, you'll  be  covered  pretty  well. Yes,  that's  right, because  I'm  always  scared. Even  though  the  last   lasso sometimes it  looks  less  with  intercept, sometimes  it  looks  slightly  better. I  don't  know  if  it's  maybe  one   or  two  cases  were  significantly  better, but  always  neck  and  neck   with  forward  selection, but  I'm  always  scared  that  I'm  going to  forget  to  turn  off  no  intercept and  then  give  myself  something   that's  worse  than  doing or  as  bad  as  doing  the  full  model. I'm  always  scared  of  doing  that. SVEM  forward  selection  with  de fault setting  seems  like  a  good  safe  way  to  go. Perfect. Well,  with  that,  we  stand  ready   to  take  your  questions.
Many industries (particularly pharmaceuticals) use milling processes to reduce the particle size of key raw materials. The aim of the milling process is to reduce the average particle size and achieve a targeted particle size distribution (PSD). While there are many types of milling techniques, the key performance indicators (KPIs) are typically mill time, PSD, average particle size, and other industry-specific targets.   Bringing a milling process from lab scale to manufacturing scale can present many challenges. Process factors, such as heat transfer, mass transfer, milling times, and additive amounts, can all substantially differ from the small-scale process. Thus, a thorough understanding of the milling process is required to maximize a successful scale-up.   This talk begins with a description of typical challenges seen in milling scale-up operations. We follow this with an analysis of a sample data set that demonstrates how JMP can be used to quickly and efficiently resolve these problems. We use data visualization, definitive screening designs, augmented DOEs and functional data exploration to help with the scale-up process.     -Hi,  I'm  Jerry  Fish I'm  a  technical  engineer  with  JMP. I  cover  several  Midwestern states  in  the  US. I'm  joined  today by  one  of  my  colleagues  Scott  Allen, who's  also  a  technical  engineer  for  JMP and  he  supports  several other  Midwestern  states. Hi,  Scott. -Hey,  Jerry,  good  morning. -Good  morning. Today  we  want  to  talk  about  a  variety, a  relatively  new  way  to  analyze  data, specifically  from  milling  operations, where  we  want  to  learn  about  milling to  help  with  a  scale- up  milling  process. -W e  both  have  a  strong  interest in  process  optimization  using  DOE and  in  modeling  response  curves using  the  Functional  Data  Explorer in  JUMP  Pro, and  we  wanted  to  bring  those  together for  today's  presentation. And one  of  the  first  examples I  saw  doing  this  sort  of  analysis was  the  milling  DOE that's  in  the  sample  data  library, where  the  goal  is  to  optimize an  average  particle  size. So  as  we  were  talking about  possible  topics  for  today, we  thought  it  would  be  interesting to  see  if  we  could  extend  that instead  of  optimizing  that just  the  particle  size, could  we  actually  optimize  the  particle size  distribution  response  curve? -So,  milling  has  many different  applications. You'll  find  it  in  anything  from  mining, food  processing, making  toner  in  the  printing  industry, making  pharmaceutical  powders. At  the  most  basic  level, a  milling  process  is  used when  we  want  to  reduce  the  particle  size of  a  certain  substance and  produce  uniform  particle  shapes and  size  distribution of  a  starting  material. Often  some  type  of  grinding  medium is  added  to  accelerate  the  process or  to  control  the  size and  shape  distribution of  the  resulting  particles. In  each  application  the  desire is  to  get  the  right  particle  size, say  the  median  or  the  mean  particle  size, with  a  controlled  predictable particle  size  distribution. In  the  scenario  we  discussed  today, we  have  an  existing manufacturing  milling  operation that  produces  a  pharmaceutical  powder. It  has  good  performance  today, creating  the  right  medium  particle  size and  a  narrow  particle  size  distribution. A  full  disclosure  this  scenario and  the  resulting  data  are  invented. Scott  and  I  didn't  have  access to  non- confidential  data to  present  in  this  paper. However,  even  though  the  data are  fabricated, the  techniques  that  we're  about  to  show are  applicable  for  real  world  problems. The  picture  on  the  left shows  a  typical  agitated  ball  mill. The  material  to  be  milled, enters  at  the  top and  continuously  flows  into  the  vessel where  an  agitator  rotates the  material  and  the  grinding  medium to  affect  particle  size. The  resulting  particles are  then  vacuumed  out  of  the  vessel as  they're  milled. Management  has  said  they  need to  increase  the  production  output, something  we're  all, I'm  sure,  familiar  with. They  are  considering  building a  new  milling  line, but  before  investing  all  that  capital, they'd  like  to  investigate, can  we  simply  increase  the  throughput of  our  existing  equipment? So  manufacturing  made  some  attempts at  doing  that, they  adjusted  their  process, and  while  they  can  affect the  median  particle  size, the  new  output has  odd  particle  size  distributions. So  manufacturing  came  back  to  R& D where  Scott  and  I  are, and  asked  us  to  go  to  the  pilot  lab and  see  if  there's  any  combination of  settings  that  might  improve  throughput. Scott,  what  parameters  did  we  look  at? -In  this  case, we're  going  to  use  six  different  factors for  a  DEO. There's  going  to  be  four continuous  factors so  agitation  speed, the  flow  rate  of  carrier  gas, the  media  loading as  a  percentage  of  the  total, and  then  the  temperature  of  the  system. And  then  there's  two  categorical  factors, the  media  type  and  maybe  the  mesh  size of  a  pre  screen  process. And  so  determining  those  factors is  fairly  straightforward  these  are  known to  affect  particle  size  distributions and  things  like  that, but  the  response  is  still  a  challenge. And  in  this  case,  I'm  not  sure how  you  would  actually  go  bout  doing  this if  you  couldn't  model  the  response  curve like  we're  going  to  do. So  Jerry,  in  your  experience, how  would  you  have  done  this  before? -S o  this  next  slide  shows    typical  particle  size  distribution or  particle  size  density  plot. And  we've  got  a  plotted as  percent  per  micron  versus  size. But  you  can  think  of  this as  just  a  particle  count on  the  y- axis  or  a  mass  distribution, it's  just  a  histogram  representing the  distribution  of  particles. What  you're  seeing  on  your  screen, this  might  be a  good  particle  size  distribution, has  a  nice  narrow  shape  with  a  peak at  the  desired  median  particle  size. But  how  do  we  characterize this  distribution if  we  want  to  do  a  test  to  adjust  it? Well,  we  might  characterize  the  location with  the  mean,  median,  mode of  the  distribution. And  we  might  characterize  the  width via  a  standard  deviation or  maybe  a  width  at  half  peak  height, those  are  typical  ways we  might  measure  that. But  when  manufacturing  tried to  turn  various  production  knobs in  their  process to  speed  up  the  throughput, they  saw  varying  degrees of  asymmetry  in  the  distribution. Maybe  this  was  due  to  incomplete  milling of  the  pharmaceutical  material, or  perhaps  there  were  temperature  effects that  caused  particles  to  agglomerate, we  don't  really  know. But  now  the  half  width isn't  really  representing  the  shape of  that  new  curve. So  we  might  turn to  calculating  maybe  percentiles along  the  width  of  the  curve, maybe  the  10th  percentile  of  particles that  fall  below  a  certain  point, 20%  below  this  point, 90%  below  this  point,  and  so  forth. But  it  gets  even  tougher  when  we're  trying to  describe  something  like  this  shape where  there  are  two  very  pronounced  peaks, or  this  shape, which  I  tend  to  call  a  haystack where  it's  very  broad,  doesn't  have  tails. What  do  we  do  with  that? So  this  parameterization  technique doesn't  seem  to  be  the  best  way to  approach  the  problem. Scott,  I  know  we  have  to  do some  experimentation, but  how  are  we  going  to  approach  this in  today's  analysis? -So  that's  what  we're  going  to  do. So we're  going  to  use  the  entire  shape, our  response  in  this  case is  going  to  be  that  curve. So  we're  not  going  to  try  to  co-optimize all  those  different  parameters that  you  talked  about. We  could  co  optimize  two,  three, four  different  parameters, but  instead  we're  going  to  use that  entire  curve  as  our  response. We're  going  to  use  all  the  data and  then  our  target is  going  to  be  some  hypothetical  curve that  we  want  to  achieve. So once  again, we're  not  going  to  try  to  target all  the  different  parameters in  that  curve, we're  going  to  try  to  match  the  shapes. So  we  want  to  have  our  experimental  shape match  the  shape  of  our  target. And  so  that's  how  we're  going  to  get started  with  the  analysis  today and  we'll  take  you  through the  workflow  of  how  we  would  do  that. So  let  me  go  and  get  into  JMP. Oops,  there  we  go. So  we  first  see  here is  the  DOE  that  we  ran. So  we  ran  a  definitive  screening  design with  those  six  factors, although  you  could  use  any  design that  you  wanted. And  we've  got  18  experiments  in  this  case, so here's  all  the  factors and  the  factor  settings  that  we  used. -That  looks  pretty  standard  to  me,  Scott, for  the  DOE  that  I've  run  in  the  past. But  you  don't  have  a  response  column. -W ell,  that's  one  of  the  unique  things about  the  response  curve  analysis  is in  some  cases  you  set  it  up  a  little  bit differently  and  how  you  do  the  analysis. So  we  don't  have  a  response  column and  we're  not  going  to  optimize just  a  single  value. Instead,  our  responses are  in  this  other  table. So  in  this  case, we've  got  a  very  tall  data  set with  the  x- axis  is  our  size and  the  Y  value  is  our  percent  per  micron and  this  is  what  we're  going  to  plot and  optimize. But  we  do  need to  get  our  DOE  factors  in  there. So  to  do  that, we  just  took  a  little  bit  of  a  shortcut and  we  did  a  virtual  join between  these  two  tables. So  in  our  design  here, our  run  number  is  the  link  ID. And  then  we've  got  the  run  number  here and  this  is  the  link  reference, and  this  lets  us  bring  in all  of  those  DOE  factors. So  these  are  all  here  in  the  table, but  they're  just  virtually  joined and  that  helps  us  keep our  response  table  nice  and  clean. So  if  there's  any  modifications, we  don't  have  to  worry  about  copying and  pasting or  adjusting  all  those  DOE  factors. So that's  how  we  set  up  our  table, we've  got  our  DOE  factors  in  this  table and  our  DOE  responses  occurs  in  this. And  as  you  can  see, we've  got  all  of  our  18  runs  here. And  before  we  start  the  analysis, let's  take  a  look  at  those  curves. So  I  just  plotted  all  those  curves in  Graph  Builder, and  so  we've  got  our  target  curve  here. So  this  is  a  hypothetical  target  curve that  we  want  to  achieve, it's  going  to  be  experiment  number  zero. And  then  we've  got  experiments one  through  18 and  the  response  curves  for  each  of  those. -So  you've  run  18  experiments and  not  a  single  one  of  those looks  exactly  like  that  target. What  do  we  do  with  that? -That's  a  good  observation. And  in  this  case,  what  we  can  see are  some  different  features  between  these, so  definitely  some  are  more  broad. I  like  how  you  called  it, that  haystack  here. Some  are  more  narrow, maybe  with  smaller  shoulders  here. We  do  see  that  the  peak  shifts a  little  bit  in  some  of  these, here's  the  peak, it's  shifting  left  and  right. And  so  hopefully  we  can  find some settings  and  those  factors that  will  use  the  best  of  all of  these  give  us  something  that's  narrow without  a  shoulder  or  bimodal  peak. But  to  do  that, we  need  to  go into  the  Functional  Data  Explorer. This  is  a  traditional  DOE, we  would  go  up  to  analyze and  we  would  go  to  fit  model,  potentially. In  this  case, we're  going  to  go  down  here to  specialized  modeling and  go  to  Functional  Data  Explorer. And  so  when  we  launch  this, we  need  to  add  our  Y  values, which  were  the  percent  per  micron. The  X  values  was  our  micron  size. We  need  to  identify each  of  those  functions with  the  run  number. And  then  we're  going  to  add all  those  DOE  factors as  supplementary  information. So  I'm  just  going  to  take all  of  my  DOE  factors and  add  them  as  supplementary  information. Now  we'll  click  okay. So  when  you  launch the  Functional  Data  Explorer, this  is  what  you  get  first. And  this  is  just  a  data  processing  window. And  what  we're  doing is  just  taking  a  quick  look  at all  of  our  data. And  so  in  this  initial  data  plot, we  just  have  all  of  our  curves  overlaid, and  you  can  see our  green  target  curve  hiding  in  there. So  this  just  shows  us how  all  of  our  data  are  lining  up. Over  here  on  the  right, we  have  a  different  set  of  functions to  help  clean  up  the  data,  process  it. And  one  of  the  really  nice  things about  this  platform is  you  don't  have  to  do that  data  processing  in  the  data  table. So  if  you  needed  to  remove  zeros or  do  some  sort  of  adjustment  here, standardized  the  range,  things  like  that, you  can  do  all  of  that  over  here in  the  Clean up. In  our  case,  our  data  is  pretty  clean, so  we  don't  need  to  do any  data  processing. But  what  we  do  need  to  do is  take  this  green  curve, our  target  curve,  out  of  the  analysis. So  this  is  the  target, this  is  what  we're  going  to  try  to  match. And  so  we  don't  want  it to  be  part  of  the  modeling  analysis. So  to  take  that  curve  out, we  go  over  here  to  the  target  function and  we  click  load. And  I'm  going  to  select  that  zero  curve, click  okay. And  now  it's  gone. So  now  we're  not  going  to  include  that in  our  models. And  so  we  can  scroll  down  here and  just  see  how  each of  our  individual  experiments  are  plotted. So  now  that  our  data  is  nice and  cleaned  up, we  can  go  on  to  the  analysis. So  to  do  the  modeling, we  go  up  to  the  red  triangle and  there  are  several  different  models that  we  can  choose. And  in  a  typical  workflow, at  the  beginning,  you  might  not  know which  is  the  best  model  to  use, whether  you're  going  to  use  a B-spline or  a  P-spline  or  something  else. In  this  case,  in  the  interest  of  time, we've  done  all  of  that  already. And  we  know  that  the  P-spline gives  us  a  pretty  good  model. So  we're  going  to  go  ahead and  fit  that  model. So  I  select  P-spline and  now  JMP  is  creating  the  models. And  what  we'll  see  is, the  first  thing  we'll  is  something  similar to  that  initial  data  window  over  here. So  all  of  our  curves are  still  plotted  and  overlaid. But  now  we've  got  this  red  line and  this  red  line  is  representing  the  mean so the mean curve of  all  of  those  different  curves. And  so  we  can  also  scroll  down  below and  we  see  each of  our  individual  experiments also  with  a  line  of  fit. And  this  is  the  first  indication that  you  can  get  about how  well  this  model  is  fitting. So  if  you're  getting  all  these  red  lines overlaying  your  experimental  data, then  you're  probably  on  the  right  track. If  there  was  a  lot  of  deviation then  you  might  consider  doing a  different  model. Other  thing  you'll  notice  over  here is  there's  different  fitting  functions that  the  spline  model  is  using. In  this  case,  there  were  two  that  JMPs that  are  pretty  good. So  this  linear  model and  the  step  function  model and  by  default  all  the  analysis  down  below is  going  to  use  the  model that  had  the  lowest  BIC  value. So  in  this  case,  all  the  analysis is  using  this  linear  model. But  if  you  wanted  to  use  a  different  one you  just  select  it and  I  don't  know  if  you  can  see  it  easily, but  this  one's  highlighted  now or  you  go  to  the  linear or  you  can  just  click  on  the  background and  it'll  go  to  the  default. And  so  that's  the  modeling  side  of  it. But  we  need  to  check how  well  this  model  is  fitting. And  so  to  do  that,  we  just  go  down  here to  the  window  that  has  a  functional  PCA. So  this  is  the  functional principle  component  analysis. This  looks  a  little  complicated, but  what  we  want  to  do is  really  start  to  take  a  look at  how  well  this  model  has  been  created. And  so  what  we  want  to  do is  look  at  this  mean  curve  here. So  this  is  the  same  mean  curve that  was  calculated  in  the  section  above. And  what  JMP  has  done  is  said, we're  going  to  start  with  this  mean  curve and  then  we're  going  to  add  a  shape. So  we're  going  to  add  this  function or  some  portion, either  positive or  negative  portion  of  this  curve to  this  mean  curve. And  you  can  see  over  here how  much  of  the  variance you  can  explain  with  that  one  curve. So  in  that  case, if  we  just  had  our  mean  curve in  our  first  shape, we  would  explain  about  50% of  the  variance. By  adding  a  second  function, now  we're  explaining  nearly  79%, 3rd  function  gets  us  up  to  88%. And  so  you  can  see  how  much of  that  variation  we  can  explain by  adding  more  and  more  shapes. And  depending  on  the  type of  curve  you  have, you  might  have  only  one  function or  you  might  have  dozens  of  functions depending  on  what  that  curve  looks  like. So  this  is  it  looks  like  we  can  explain a  lot  of  the  variance  here. It  takes  us  about  nine functions  to  get  up  there. But  now  we  want  to  see how  well  those  combinations of  all  those  shapes  with  the  mean  function or  with  the  mean  curve, how  those  are  represented. How  representative  they  are of  our  experimental  data. So  to  do  that, we're  going  to  go  down  to  the  score  plot, and I'm  going  to  make  this just  a  little  bit  smaller. And  we're  going  to  look  at  the  score  plot and  this  FPC  profiler. So  the  FPC  profiler is  a  way  to  show  the  combination of  all  those  different  shapes. So  we  really  just  want  to  pay  attention to  this  top  left  part  of  the  grid. So  this  is  our  experiment based  on  the  combination  of  the  mean  curve with  all  those  different  shapes. And  right  now,  all  of  the  FPCs, since  they're  set  to  zero, we  just  get  that  mean  curve. But  if  I  start  adding  that  first  FPC, if  I  make  it  more  positive, I'm  adding  that  shape, now  I  can  see how  my  modeled  shape  is  changing, and  if  I  go  lower, I  can  see  how  it's  changing. So  by  adding  and  subtracting each  of  these  different  shapes, I  can  recreate  all  of  the  different  curves or  get  close to  all  those  different  curves. So  this  might  take  a  little  while to  do  manually, but  there's  a  nice  little  shortcut. So  what  I  like  to  do is  go  into  this  score  plot, and  let's  say  I  want  to  see experiment  number  six so  I  can  hover  over  six, and  then  I'm  going  to  pin  that  here, pull  it  over. And  so  there  are  nine  different  functions, but  we're  only  going  to  see  two  at  a  time. And  I  can  see  that  component  one  is  0.08 and  component  two  is  minus  0.03 . So  I  can  take  this  to  0.08 , and  I  can  take  this  to  minus  0.03 . And  I'm  starting  to  reproduce  this  curve, but  I  would  need  to  adjust  all  of  them. And  so  there's  a  shortcut  to  do  that by  just  clicking  on  this. So  by  clicking  on  six, all  of  the  different  FPC  components are  set  to  the  best  representative  model. And  we  can  look  at  these  two  shapes and  see  how  close  they  are. In  this  case,  they  look  pretty  good. Maybe  there's  not  as  much  definition, and  this  is  not  very  straight, but  it  looks  pretty  good. And  so  we  can  go  over  to  another  curve like  number  seven, we'll  click  on  that  one and  see  how  it  changes. Now  it's  not  looking  quite  as  good, and  we  can  go  to  eight. And  this  is  what  I  really  like about  this  platform, is  it  lets  you  explore  the  data. So  it's  Functional D ata  Explorer, we're  just  seeing how  well  this  model  fits, and  we're  doing  it  fairly  visually. And right  here,  if  we're  really  interested in  that  understanding  the  bimodal  nature we're  not  getting that  resolution  with  here. So  this  is  telling  us maybe  this  isn't  the  best  model. Maybe  there's  a  better  one  out  there that  we  can  look  at. So  if  we  go  up  back  to  the  top, the  linear  model  was  selected  initially because  it  had  the  minimum  BIC, but  maybe  we  want  to  use  a  step  function so  I  can  click  on  the  step  function. And  now  all  those  FPC  curves have  been  recalculated. And  the  first  thing  we  notice  is, we're  getting  a  lot  more  explanation of  the  variance  here. So  we  don't  necessarily need  all  of  these  curves, I  can  just  take  this  slider. Maybe  we  just  want  to  look  at  six  curves and  explain  99.7%  the  variance. And  so  it's  simplifying  the  model  a  bit. So  now  we  can  go  down  here and  take  another  look and  spot  check  those  curves. So  I  can  hover  over  six  again, pin  it  here. This  is  the  curve that  we'll  be  looking  at. And  what  I  want  to  do, I'll  just  make  this  a  little  bit  bigger. And  so  when  I  click  on  six, okay  so  what  do  you  think,  Jerry? What do you  think  this  one is  looking  a  little  bit  better? -That's  much  better  reproduction of  your  experimental  day? Yeah,  I  like  that. -Good. Yeah,  I  think  this  is  looking  better. So  we  can  go  to  seven, and  that  one's  looking a  lot  better  as  well. And  you  don't  need  to  select  them  all, but  it's  good  to  check  a  few  of  them so  we  can  go  look  at  eight. Yeah,  so  now  we're  getting a  lot  better  resolution  here on  the  bimodal  nature  of  it. All  right,  I  think  this  is  telling  us  that this  model  is  pretty  good. -Yeah,  so  what  do  we  do  now? That's  great  that  you  can  reproduce the  experimental  results, but  how  do  you  get  to  the  optimal? -Yeah,  I  guess  at  this  stage, it's  still  a  little  bit  abstract. So  we've  got  all  these  different  shapes that  we're  combining  in  different  ways to  reproduce  all  of  our  curves, but  we  haven't  done  what  we  set  out  to  do which  was  relate  those  shapes to  our  DOE  factors. So  that's  what  we're  going  to  do  next. We're  going  to  go  back  up  to  the  model and  we're  going  to  select functional  DOE  analysis. And  when  we  do  that,  now  we're  getting  a  profiler that  might  look  a  little  bit  more  familiar if  you're  used  to  doing  traditional  DEO. So  once  again, the  response  curve  that  we  have  is  here. And  so  we  see our  percent  per  micron  on  the  Y and  the  micron  size or  the  particle  size  on  the  X. But  now  instead  of  having  those  FPCs in  those  shapes, now  we  have  our  DOE  factors, so  we've  got  our  agitation  speed, our  flow  rate,  media  loading,  et  cetera. Now  I  can  move  that  agitation  speed and  I  can  see  how  it's  relating  to or  how  it's  influencing  the  curve. And  I  can  see  by  the  slope  of  these  lines whether  or  not  something is  important  or  not. So  changing  that  one doesn't  really  change  the  shape. So  flow  rate  doesn't  matter  a  whole  lot, but  temperature  certainly makes  it  more  broad or  makes  it  more  narrow. And  so  what  we  can  do because  we  loaded  that  target  curve, just  like  in  a  standard  deal, we  can  go  to  our  red  triangle and  we  can  go  to  maximize  desirability. So  typically, this  would  look  at  a  parameter if  you  were  doing  a  traditional  DOE. But  in  this  case, it's  going  to  try  to  find  the  settings that  match  that  target  curve that  we  loaded  earlier. So  when  I  click  that,  and  there  we  go. It  looks  like  there  are  some  settings  here that  get  us  a  curve  that's  fairly  narrow, hitting  the  peak  that  we  wanted and  doesn't  have  any  of  those  features that  we're  trying  to  avoid. -Very  cool. So  are  we  done? We've  got  the  settings  that  we  need, we  just  throw  those over  the  fence  manufacturing and  we're  done. -Well,  that's  one  way  to  doe  it,  Jerry. I  don't  know  if  folks  in  manufacturing that  I  worked  with  before might  not  like  that. We  probably  want  because  these  settings were  set  at, the  settings  are  not  part  of  our  design. So  this  one  is  in  the  center, this  one's  not  at  an  edge  or  the  center. So  we  probably  want  to  run some  confirmation  runs  here, maybe  see  some  sensitivities and  make  sure  that,, we've  got  some  robustness around  these  settings. -Very  cool. Okay,  all  right. -Let's  get  back and  I  think  we  can  sum  up. -Yeah. So,  Scott,  that  was  great, thank  you  for  that  presentation. So  in  summary,  what  we've  tried  to  do,  is  demonstrate  how  to  perform  a  DOE using  these  particle  size  density  curves. The  curves  themselves  as  the  response rather  than  parameterizing  the  PSDs with  summary  statistics  like  median, standard  deviation,  et  cetera. We  were  then  able  to  optimize our  factor  input  settings  to  the  process at  least  at  the  pilot  scale  to  find an  optimal  curve  shape  that  was  very  close to  our  desired  particle  size  distribution. Along  with  the  way  we  discovered how  some  of  those  parameters, agitation  speed  and  so  forth affect  the  particle  size  distribution, leading  to  multiple  peaks or  leading  to  broad  peaks or  whatever  that  might  be. So  we  have  an  understanding  about  that, and  we  have  a  model. So  bringing  this  all  back to  our  original  scenario, R& D  took  the  results  back to  manufacturing, where  confirmation  runs  were  attempted. Scale- up  perhaps wasn't  completely  successful. That's  typical  of  scale- ups, sometimes  the  pilot  runs don't  map  directly  to  manufacturing. But  we  do  have  this  model  now that  gives  us  an  indication of  which  of  these  knobs  to  turn  to  adjust if  we  do  have  a  shoulder  on  that  peak or  something  like  that. So  we  were  able to  go  back  to  manufacturing, give  them  the  assistance that  they  needed  to  get  that  in so  that  they  could  increase their  throughput  and  everyone  was  happy. [crosstalk 00:26:18] That  concludes  our  paper. Scott,  thanks  for  all  the  hard  work. -Yeah,  well,  it  was  great  working with  you  on  this,  Jerry. -Yeah, likewise. Scott  has  been  kind  enough  to  save the  modeling  script  in  the  data  table, which  we're  going  to  attach to  this  presentation. If  you've  got  any  questions about  the  video or  any  of  the  techniques  that  we  did, please  post  your  comments  below  the  video, there'll  be  a  space  for  you  to  do  that, we'd  be  happy  to  get  back  with  you. Thank  you  for  joining  us. -Yap,  thanks.
Latin Squares are beautifully symmetric designs employing a given number of symbols in the same number of rows and columns. Each row and each column has all the symbols. A Youden Square, attributed to Jack Youden, is a Latin Square from which a number of rows have been removed. The Youden Square is also a special case of a balanced incomplete block design (BIBD), which is one of the design tools in JMP. All these designs have been around for decades. They support a single categorical factor and one or two blocking factors. They are commonly used in agriculture where the rows and columns are rows and columns of plants and it is desirable to remove any fertility gradients in a field in the analysis.   In industrial settings, it is unusual for experiments to be limited to only categorical factors and blocking factors. However, it could be very useful to use these designs as building blocks for creating experiments in an industry with more factors. This talk will show how this can be done in JMP using a combination of the BIBD designer and the Custom Designer.     Hi,  my  name  is  Bradley  Jones. I  am  the  leader  of  the  JMP  DOE  Group   at JMP  statistical  software. I'm  going  to  talk  to  you  today about   Latin Squares, Youden  Squares, Balanced In complete  Block  Designs, also  known  as  BIBDs, and  some E xtensions for  Industrial  Application. Let's  get  started. What  you  see  here  is  a  window in  Cambridge University that  shows  depicts  a   Latin Square. The  window  has  seven  rows of  colored panes, seven  columns  of  colored  panes,  and  seven  different  colors. You  can  see  the  yellow,  blue. You  can  see  that  yellow  appears in  every  row,  in  every  column, as  well  as  all  the  other  colors also appear  in  every  row and  every  column. There's  this  beautiful  symmetry to  the   Latin Square  that  we  see  here. As  a  designed  experiment, Latin Square  is  primarily an  agricultural  design. You  can  think  of  the  rows and  columns  as  blocking  factors. For  instance,  if  you  were  doing the   Latin Square  design in agriculture, the  rows  and  columns would  be  rows  and  columns  of  plants. What  you're  doing  when  you  make those  blocking  factors is  ruling  out  any  effect   of a gradient of  fertility  in  a  row across  rows  or  across  columns. But  then  the  entry  in  each  row  or  column is  one  level  of  a  categorical  factor, that  in  the  case of  the   Latin Square  design, the  categorical  factor  has  seven levels that  correspond to  the  seven  different  colors. You  can  see  that each  of  the  two  blocking  factors both  have  seven  levels, and  the  categorical  factor also  has  seven  levels. There  are  three  factors, and  each  of  them  have  seven  levels. One  might  say,  well,  first, it's  rare  that in  an  industrial  experiment, you  would  have  three   seven-level categorical  factors, or  even  one  categorical  factor at  seven levels, and  two  other  factors that  were  blocking  factors at  seven  levels. That  makes  a  lot  more sense  in  agriculture. H ere  is  an  example of  a   Latin Square  design that  I  created  using  the  Balanced Incomplete  Block  Design  tool in  the  DOE  menu. You  can  see  that  there  are  seven  blocks and  seven  levels, A  through  G, in  each  row  and  also  in  each  column. The   Youden Square  is  a  Latin S quare with  some  rows  removed. For  instance,  what  you're  seeing  here is  a  transpose  Youden  Square. If  you  turned  it  on  its  side, you  would  see  that  there  are  seven  blocks, each  of  which  has  four  levels. Basically,  this   Youden Square is created  by  just  removing   three rows  from  a   Latin Square. This  is  a  little  bit  more  like something  that  you  might  enjoy doing  in  an  industrial  setting. Imagine  that  you  were  doing an experiment that  you  are  going  to  run  on  seven  days. Each  day,  you  could  do  four  runs. You  would  have  4  times  7 or  28  runs  in  all. Each  of  the  days, you  would  be  doing  four of  the  seven  levels of  some  treatment  factor. But  the   Youden Square   is not  really  a  square. It's  more  like  a  rectangle. I  don't  know  exactly  how  it  came to  come  to  have  that  name, but  it's  also  a  special case of  a  Balanced  In complete Block  design  or  BIBD. The   Youden Square  is  actually, I  mentioned  it  only  because I've  been  asked  to  give the   Youden a  lecture at  the  Fall  Technical  Conference this  year,  in  October. I  wanted  to  show  something  about Youden since  I'm  doing  that  lecture. But  I  really  want  to  talk  more about  Balanced  Incomplete Block  Designs,  or  B IBDs, because  they  are  more general  type  of  design. In  this  case,  we're  thinking  about a   seven-level  categorical  factor. You  can  only  do  four  runs  a  day, but  you  worry  that  there  might  be a  day- to- day  effect. The  four  runs  a  day are  a  blocking  factor. You  have  a  seven- level categorical  factor that  you're  interested  in. Again, this  is  the  same  scenario   as you  would  have  with  a   Youden Square, except  that  there  are  a  lot  more possibilities  for  creating Balanced  Incomplete  Block D esigns than   Youden Squares. Here's  an  example  of  that. Here's  the  B IBD  with  each  block having  four  values. There  are  seven  blocks. You  can  see  that... Here's  the   Incidence Matrix. What  the   Incidence Matrix  is, it  shows  a  1  if  that  treatment appears  in  that  block. The  first block,  A  is  in  the  block, C  is  in  the  block,  F  is  in  the  block, and  G  is  in  the  block, and  you  can  see   A, C, F, and G. Now  in  this  design, each of  the  seven  levels   of the categorical  factor appears  four  times. You  can  see  that  in  this  Pairwise Treatment  Frequencies. Also, each  level  of  the  categorical  factor appears  with  another  level of  the  categorical  factor  two  times. Level A  appears  with  Level  B  twice. Here's  one  case  here, in   Block 2  and  also  in  Block  5. For  every  pair  of  factors, they  appeared  in  some  block with  any  other  level  twice. Now,  in  fact,  there's  one  more  cool  thing about  this  design, which  isn't  always  guaranteed  to  happen, but  in  this  case,  it  does. Each  treatment  appears  once in  each  possible  position. For  instance,   Level A  appears in   Block 2  in  the  first position, in  Block  5  in  the  second  position, and  Block  6  in  this  third position, and  Block  1  in  the  fourth position. A ll  the  other  levels of  the  various  treatment  effects appear  once  in  each  position. That  means  that  the  position, if  you  wanted  to  make  position  a  variable, you  could  have  its  orthogonal to  the  blocking  factor  and  also orthogonal  to  the  treatment  factor. You  can  actually,  in  this  case, have  a   seven-level  treatment  factor, a   seven-level  blocking  factor, and  a  four- level  position  factor. Imagine that  you  are  going to,  again, do this experiment in  seven  days  with  blocks of   size 4  in  each  day, and  then  in  each  day, you  would  control  the  position that  each  treatment  appears in so  that  the  position  effect wouldn't bias  in  any  other   main effect  of  the  design. What  I  just  talked  to  you  about in the BIBD is  that  in  this   BIBD, there  are  the   seven-level categorical  factor, that's  a  treatment  factor. There are  seven  different  possible treatments  that  you  might  have. You  could  imagine  that  you  could  have seven  different  lots of  material,  for  instance. You  would  think  of  the  lot  of  material as  being  different  lot  of  material, might  be  a  different  treatment. The  blocking  factor  is  day. You're  going  to  run  the  experiment over seven days, and  you're going to  run for  four  runs  in  each  day. Then  within  each  day,  there's  a  position. The  time  order  of  the  position  of  the  run isn't  going  to  affect  any  other  estimate of  either  day  or  treatment. Now,  in  industrial  experiments, having  only  a  categorical  factor   and two blocking  factors  is  a  rare  thing. I'm  thinking, what  if  I  wanted  to  add  some  factors to  this  experiment, say,  four  continuous  factors? I  can  make  design with  four  continuous factors, a   seven-level  categorical  factor, a   seven-level  blocking  factor, and  four- level  position  factor using  the  custom  designer. But  I  wouldn't  necessarily  get that  beautiful  symmetric  structure of  the   BIBD  on  the  categorical  factors and  the  blocking  factors. S uppose  I  want  to  keep that  beautiful structure and  just  add  the  four  continuous  factors. That  is  an  extension  of  the  BBD that  might  be  more  appropriate to  an  industrial  experiment. Here's  an  example  of  that. I  have  four  continuous  factors. I  have  6  degrees  of  freedom  for  blocks, 6  degrees  of  freedom  for  treatment, and  3  degrees  of  freedom  for  order. Because  in  a  categorical  factor, you  have  one  fewer  degrees  of  freedom, then  you  have  levels. You  can  see  that  the  main  effects of the continuous  factor are  all  orthogonal  to  each  other. They're  orthogonal  blocks, they're  orthogonal  treatments, and they're  only  slightly correlated  with  the  order  variable. Let  me  point  out  to... I'm  going  to  leave  the  slideshow, and  move  to  JMP  here. Here  is  the  JMP   BIBD  capability. You  can  find  it  under  special  Purpose, Balanced Incomplete Block Design, so DOE  then  Special  Purpose and   Balanced Incomplete Block Design. I  chose  that. I  defined  a  treatment  variable   that has seven treatments,  A  through  G. I  made  the  block  size  here. Let's  suppose  I  want  blocks of  size  4  and  seven  blocks. That's  my  design  here. Now  I  have the picture that  I  showed  you  before in  the  slideshow. Here's  the  blocking  factor as  seven  blocks. Each  has  four  elements. This  is  the   Incidence Matrix, which  shows  which  treatment is  applied  in  which  block. If  it's  applied,  it's  1, and  if  it's  not  applied  at  0. You  can  see  that  each  treatment appears  four  times  in  the  design, in  each  treatment,  or  each  level  pair appears  twice  in  some  block of  the  design. Finally,  we  have the  Positional F requencies that  shows  each  treatment appears  in  each  position  in  the  design. Here's  the  table  of  the   Balanced Incomplete Block Design. Now  what  I  want  to  do  is  I  want to  create  a  design  experiment that  forces  this  set  of  factors into  the  design. I  can  do  that  in  the  custom  designer, but  I  have  a  script that  does  it  automatically. I'm  going  to  run  the  script, and  it's  going  to  do  10,000  random  starts of  the  custom  designer  behind  the  scenes. Then  you'll  see the  resulting  design pop up as  soon  as  it's  finished  getting through  all  these  10,000  random  starts. Here  is  the  design, and  you  can  see  that  the  factors are my four continuous factors: the  block,  the  treatment, and  the  order  effect. These  are  the  covariate  factors that  came  from  this  table  here. There  are  28  rows  in  this  table. I'm  calling  these  factors covariate factors because  I'm  forcing  them into  the  design  as  they  are. I've  already  created  the  design, and  it's  matched  up the  four  continuous factors  and all of their rows with  the  Balanced Incomplete  Block  Design that  have  the  treatment, the  block,  and  order  variable. Now  I  can  show  you the  table  of  the  design. What  I've  done  is  I've  sorted this  table  by  the  order  call, because  I  want, for  instance,  the  first block, I  want  the  order  to  go  1,  2,  3, 4, and  the  second  block  again, 1,  2,  3,  4,  and  so  forth. I'm  controlling  the  order of  the  runs  in  a  non-random  way, but  I've  now  made  order  be orthogonal  to  treatment an  orthogonal  block. When  I  evaluate  this  design, one  thing  I  want  to  show  you  is how well I can  estimate the  continuous  factors compared  to  an  absolutely   completely orthogonal  design, a  completely  orthogonal design,  the  fractional  increase in  the  confidence  interval would  be   0 here. What  we  see  here  are  numbers that  are 0 .01  or  0.011, which  is  to  say  that  a  confidence  interval for  the  main  effect of  factor  1  is  1 %  longer than  the  confidence  interval  would be  if  you  could  make  a  completely orthogonal  design  for  this  case. I'm  going  to  select  all  of  these and  remove  these  terms so  that  I  can  show you  the  correlation  cell  plot without  a  bunch  of  noise. This  is  the  correlation  cell  plot  for this design, showing  the  orthogonality  of  the  main  effects of  the  four  continuous  factors. The  block  variable  is  orthogonal  to  them, the  treatment  variable is  orthogonal  to  them, and  the  only  thing  that's  not  orthogonal to  the  four  continuous  factors is  the  order  effect. But   the  order  effect  is  orthogonal to  the  blocks  in  the  treatments. There's  very  minimal  correlation. That  correlation  is  leading  to  almost no  loss  of  information  or  increase in  variance  of  the  continuous factors  in  the  design. The  result  of  doing  it  this  way is  a  much  simpler  design  structure so  that  analysis  of  this  design will  be  easier for  even  a  novice  in  design  to  do. That  is  all  I  have  for  you  today. Thanks  for  your  attention.
Cardiovascular disease is the number one cause of death globally, claiming an estimated 17.9 million lives in 2019, accounting for 32% of all deaths worldwide that year.   Heart failure is a common illness of cardiovascular disease, and this dataset contains 11 features that can be used to predict likely heart disease. The prediction results can help people with cardiovascular disease or high cardiovascular risk (due to the presence of one or more risk factors, such as hypertension, diabetes, hyperlipidemia, or established diseases) to predict early symptoms and detect disease risk in a timely manner.   The data set included 918 participants from different countries and 11 factors associated with heart failure, such as age, sex, blood pressure, blood glucose, etc. This study plans to use different analysis models in JMP software for statistical analysis of data sets, such as neural networks, logistic classifiers, Random Forest, etc. The optimal prediction model is selected by comparing model performance.   The model output will help people understand the importance of different factors leading to heart disease and the probability of developing heart disease under certain conditions, to help people pay more attention to the management of physical health in daily life and the prediction of disease risk.     Hello. Good  morning,  everyone. This  is  Saijac Lami  and  I have  my  teammate,  Zhe Diao. Basically,  we  are  a  business analytics  graduate  students from   University of Connecticut-Stamford Campus. Little  about  our exposure  to  JMP. We  have  extensively  used  JMP in  our P rediction Model  course in  our  first semester . We  felt  it's  a  very  easy and  very  powerful  tool, and  there  is  a  lot we  can  do  from  it. We  are  still  exploring the  many  features  of  the  JMP. Today,  we  are  here  to  present our work of which we did during summer that is a heart  failure  prediction using  modern  screening  platform. We  are  calling  it  improved because  we  use  several  JMP platforms to  leverage  predictions. Coming  to  the  agenda  today, so  this  is  just  an  overview  slide from  which  you  would  get the  gist  of  what  we  are  doing, we  talked  about followed  by  the  three  slides, which  we  where we  talk  about  pre-processing and  some  EDA,  which  we have  done,  and  the  modeling. Coming  to  the  introduction, we  know  that  cardiovascular disease is  the  number  one causing  of  the  death  globally, claiming  an  estimated of  70.9  billion lives  in  2019. It  accounts  for  around  32 % of  deaths  worldwide  every  year. In  our  problem, so  we  have  taken  a gathered  data  set, and  we  developed  a  classification  model for  classifying  the  heart  disease. We  also  leverage  this  predictions using  the  model  screening, modal  comparison  and  attachment feature  in  the  JMP  16. The  model  output  will  help in  understanding  the  importance  of  factors that  are  leading  to  heart disease  and  the  probability. We  also  find  the  probability of  developing  the  heart  disease, unless when under  c ertain  conditions. Summarizing  our  objective  is  to  build the  best  model  and  find  the  factors that  are  leading  to  the  heart  failure using  the  JMP 16  platform. Coming  to  a  methodology and  little  about  our  data  set, the  data  set  included  around  918   participants  from  different  countries. There  are  11  factors associated  to  heart  failure, such  as  age,  sex, blood  pressure,  and  blood  glucose. How we went  to  our  predictions is  we  first perform the  pre-processing  of  the  data by  exploring  if  there  are any  missing  values  or  any  outliers. In further,  we've  performed  the  EDA analysis  to  understand  the  importance and  the  relationship  of  the  each  feature in  relation  to  the  heart  failure. To  build  the  model, we  incorporated  the  following JMP  16 capabilities  in  our  methodology. The  first thing  is  the  Model  Screening. Which  is  an  efficient  platform, a  simultaneously  fitting, comparing and  exploring,  selecting and  then  deploying the  best  predictive  model. The  next  comes  the  Model  Comparison, which  an  easy  platform to  compare  and  select the  best- performing  predictive  model. Next  comes the dashboard is  an  efficient  way to  better  represent our  EDA  concisely. We  can  run  any  time  as the  new  data  is  available. I'm  coming  to  our  results. This  is  just  overview of  the  results  which  we  had. Using  the  model  screening, we  identified  that  the  Boosted Tree  is  our  best  model. We  choose  and  we  have also  not  just  on  accuracy. We  focus  on  which is  having  the  least  one  rate, because  we  do  not  want our  model  to  have  high  false  positives, because  we  don't  want  the  heart  rate, like  heart  failure  predict  patients not  to  work  as  not  detected. Based on  this,  we  have  chosen the  Boosted  Tree  as  our  best  model. Coming  to  the  column  contributions, so  when  we  are, when we  tried  to  identify what  are  the  important  factors  that are  causing  the  heart  failure  prediction. When using  the  Boosted  Tree, we  identified  that  Exercise Angina, so  which  is  if  a  person  has the  in- use  pain  due  to  exercise and  also  Fasting Bs, Blood Glucose level, Resting ECG, ST_slope,  Chestpain type are few parameters out of 11, are contributing to around 75% of  the heart failure. Next comes in more depth analysis,  Zhe Diao will  be  taking  care. Okay, after  screening  through  the  basic information  of  the  data, such  as  target  feature, predictive  variables,  data  types, our  analysis  work  will start  with  data  cleaning. We  need  to  deal  with  missing  values and  all  layers  first to  get  a  clean  data. JMP  provides  a  variety  of  ways to  explore  and  deal  with  them. For  the  missing  value, JMP can  display  the  details in  the  summary  table  are rarely display  them  in  cell  plot  or  tree  map. Today,  we  show  the  statistical  table  here, which  is  also  a  way  to  get the information  we  want. We  can  see  that  there  is  no missing  value  in  our  data. But  when  you  further  explore the  data  distribution,  you  will  find  that some indicators  use  the  number  zero to  replace  the  missing  value. We  treat  the  data  zero  with  deletion and  the  media  replacements  because the  value  of  the  lead indicator  cannot  be  zero. For  the  Outlier, box plot  and  explore Outlier module  are  common  methods. Today, we  use  the  Outlier Ana lysis function, and  there's  a   multi-variate  module, which  reflect  the  distance of  a  multi-dimensional  space into  this  Mahalanobis  Distance  Graph. We  have  retain  this   Outliers  in this analysis  because  we  consider  that this is a  common  phenomenon in  medical  test  results. After  completing  these  steps, we  get  the  clean  data,  and  then we  enter  the  data  exploration  stage. In  this  step, we  built  some  commonly  used the  charts  to  show  some information  contained  amount  data. JMP  provides  us many  choices  in  this part such  as  tree  map, ring  chart,  bar  chart and  zero. From  these  graphs, we  know  that  the  proportion  of  male suffering  from  heart  failure is  twice  that  are  female. 80 %  of  patients  with heart  failure  have  diabetes and  77  persons  have no  symptoms  of  chest  pain, which  reveals the  imperceptibility  of  the  disease. After  we  draw  this  useful  conclusions, we  come  to  the  modeling  stage  to  further explore  the  relationship  between  data. When  you  are  doing  data  analysis, you  may  usually  think what  model  I  want  to  build, or  what  model  performs  best. Model  screening  function  in  JMP  helps  us solve  this  problem in a  very  intuitive  way. You  just  need  to  drag  the  target  variable and  the  prediction  variables to  the  corresponding  positions. JMP will  help  you  run  also appropriate  models. In  this  analysis,  JMP write  nine  models  automatically, including  Regression  model, Boosted Tree, Neural  ne twork  and  so  on. You  can  get  a  detailed  and  clear  output. If  you  only  care  about  the  results, the  summary  table  can  help you  choose  the  best  model, whether  you  consider residual  or  fitting  degree. If  you  want  to  know  the  detail, the  parameters  and  the  results  of  each model, you  just  need  to  click  the  model you  want  to  view  in  details  part, and  you  can  understand  the  performance of  the  model  from  all  aspects. Here  we  intercept  a  parameters, computer  matrix  and  profiler. In  profiler,  you  can  enter  new  data to  observe  the  change  train  of  each variable  and  get  the  predict  the  result. We  see  that  the  influence of  age  is  not  significant, which  may  be  countering to  our combination. Where  gender,  diabetes  and ST_ slope are  the  main  influence  factors. Moreover,  in  these  results, we  pay  attention  to the  misclassification  rate, especially  is  a  false  negative  value, because  it  means  that the  patient  has  heart  failure, and  we  predict  that  he  does  not, which  may  lead to  very  serious  consequence. The  best  performance  model  we  select in  this  analysis  is  supposed  to  decrease which  has  the  lowest  the  misclassification rate  and  the  highest  sensitivities. Then  we  can  save  all the  prediction formulas and  the  results  for  use in  the  model  comparison. Model comparison provides more  concise  and  intuitive  format to  show  model  performance  indicators, which  is  convenient  for  us to  make  the  final  choice. Now,  I  take  through  to  the  last  part  of our  presentation,  which  is  the  dashboard. Using  the  dashboard  feature, we  created  a  utility  where  we  added several  important  features, which  we  discussed  before,  and  which are  critically  affecting  the  heart  rate. Here  we  can  interact by  providing  the  inputs Like  I  can  choose  male  or  female, and  we  can  even choose  the  Chest pain  type. Also,  what's  the  ST_s lope  pattern, and  also,  what's  the  excess  in  genome? Based on  this, I  can  interact  with  the  utility, and  also  based  on  that,  it  will  display the  probability  of  the  heart  rate, which  is  a  pretty  useful  feature. That  comes  to  the  last  part. In  conclusion, I  just  want  to  summarize. We  use  the  modern  screen  platform, and  through  which  we  explore the best predictive  modeling for  the  heart  failure  prediction. We  also  leverage  whatever  the  work we  try using  the  JMP 16  dashboard. Which  we  develop  a  utility to  develop  an  interactive  platform that  outputs  the  probability  of  the  heart failure  based  on  the  input  and  parameters. That's  all  we  have  for  today. Thank  you.
As the pet food industry continues to expand, one of the product categories that continues to gain momentum is around pet treatment. There are various products available, including ones that provide a cleaning benefit where the texture of the product promotes chewing behavior and “scrubs” the pet’s teeth of plaque as it is consumed. Various methods are used to measure the dental efficacy of these treating products, from which the data is prepped, modeled and reviewed for further insights. To assist in the consumption of the data, I utilized JMP’s mapping capability to create a custom heat map that provided an unrealized layer of insight into the performance of the products beyond what current models and analyses were showing. This additional layer of analysis drove various discussions and investigations to measure, test and enhance claimed benefits of various products, including new product development. By utilizing JMP’s custom mapping capabilities, analytical professionals can help provide additional layers of insights into data that can lead teams into further innovation channels.     Welcome,  everyone. Today,  I  will  be  going through  custom  heat  maps and  how  they  provided  some new  insights  into  pet  treating. First  of  all,  I'm  Jared  Shaw and  I  work  for  Mars. I've  been  with  Mars  for  eight  years. Prior  to  Mars, I  worked  in  semiconductor I ntel and  IM  flash  technologies for  about  13  years. My  background  is  statistics  and  education. I've  done  a  lot  of  consulting over  the  years  as  well  as teaching  others in  various  statistical  methods. I'm  married  and  I  have  three  kids. They're  all  adopted. I  play  games  on  the  sidelines, build  models and  then  periodically  camping. Then  I  also  tinker around  with  construction. On  the  bottom  right,  there  is a  room  over  my  garage  that  I  finished. Today,  I'll  be  going  through basically  the  abstract  I  submitted. Then  C&T  stands  for  care  and  treat. I'll  give  some background  on  that  for  Mars. I'll  go  through  measuring  efficacy, research  protocols  we  have for  these  types  of  studies. Then  I'll  get  into  the  JMP  portion. This  is  not  a  live demonstration  of  JMP. I'll  just  be  showing some  images  from  the  program and  talking  through  some  approaches that  we  used  in  looking  at  this  data and  then  ending  up  with  the  new approach  that  I  introduced. First  of  all, the  abstract  here  is I  have  these  custom  heat  maps and  they  provide  a  lot of  insights  into  pet  treating. Overall,  our  intention  is  to  improve  the cleaning  of  pet's  teeth  with  a  new  treat. We  do  this  by  changing  texture, changing  ingredients. We  want  to  do  something  that  will  help impact  the  teeth, but  also  be  safe  and  delicious and  fun  for  the  pet as  well  as  the  pet  owner. The  results  provide  a  lot  of  insights, including  patterns  that  we see  across  the  mouth. How  does  the  product  affect  the  teeth? The  current  graphs  and  methodologies. The  modeling  is  pretty  good, but  the  methodologies  and  how  we  showcase this  data  is  not  very  good. It  doesn't  offer  very  clear  insight unless  there's  a  remarkable  difference between  the  products that  are  being  tested. I  found  that  there  is  a  great opportunity  to  utilize  JMP  mapping to  create  some  custom  maps. These  new  images, they   spawned  a  whole huge  investigation, brought  in  some  new  associates  to  do some  great  insights  and  learning. Just  from  a  simple  image, brought  some  great  rewards. Maybe  to  help  bottom  or to  ground  us  to  a  baseline. First  of  all,  let's  consider what  Care   & Treats  means. Pet  care  consists  of  a  dry  diet and  you  also  have  wet  diets and  then  the  Care &  Treat  components. This  is  split  up  into  two  pieces, the  treat  and  then  the  care. Treating  products, they  have  a  high palatability,  they  excite  pets. They  may  add  some  extra  nutrition and  supplement  the  pet's  diet. They're  used  for  training. Positive  reward  in  training,  getting the  pet  to  respond  to  your  voice,  etc. In  some  cases,  they're  long- lasting to  help  relieve  boredom, reduce  destructive  behavior  and  so  forth. The  care  products,  on  the  other  hand, these  are  also  have  a  high  probability to  encourage  consumption. But  these  promote  a  healthy  teeth that  we  concentrate a  lot  on  oral  care,  reducing  bad  breath and  they  can  also  act  as  a  medication to  give  appeal  to  your  pet. You  may  have  heard  of  like a  pill  pocket,  et  cetera. Now,  one  of  the  main drivers  of  these  treats  is  texture and  that  really  promotes the  consumption  benefits. Does  it  become  fun  to  chew  the  product? For  those  that  have  dogs, you  may  notice  that  your  dog, in  some  cases  when they  start  eating  a  product, they  seem  to  inhale it  more  than  they  chew  it. The  texture  piece  definitely  is  something that  you  want  to  have  them enjoy  the  experience of  biting  into  the  product. This  is  just  an  image to  help  us  understand what  we're  talking  about  for  treating and  trying  to  clean the  teeth  for  our  pets. These  are  just  a  couple of  different  images  that  show the  breakdown  of  the  teeth  in  pets and  how  we  want  to  understand how  a  product  is impacting  these  different  teeth. How  do  we  measure  efficacy? Efficacy  is  basically  how well  the  product  is  running. Periodontal  disease  is  the  most widespread  of  oral  disease  in  pets. Companies  all  over, when  they  get  into  the  care  space, they're  looking  for  how  they  can  take texture,  how  they  can  take  shapes, how  they  can  build  a  chew that  really  affects  this  oral  care. They  want  to  measure the  efficacy  of  the  treats. They  use  different approaches  over  the  years. I'm  not  really  going to  go  too  much  into  these. This  is  just  more  informational... Years  ago,  there  was a  Logan  & Boyce  method. This  is  a  visual  measure  on  the  teeth. It  was  invasive  in  how they  did  it  against  the  pets and  you  can  read more  about  that  on  your  own. But  there's  also  GCPI. This  was  less  invasive  to  the  pets, but  still  a  manual  approach. Then  recently,  probably  within the  last  maybe  4  or 5  years, they  came  up  with  this  QOLF  technique that  essentially  takes  an  image of  the  teeth  before  and  after. We  find  that  this  is just  much  more  informative. It  gives  us  much  clearer  results on  how  things  are  proceeding  when the  pets  consume  these  products. Efficacy formula  for  itself. First  of  all, there's  what's  called  an  ITS and  you'll  see  this in  the  data  later  on. It  stands  for  Individual  Tooth S core. Basically,  it's  looking  at  how much  plaque  exists  on  the  tooth. In  this  case  for  the  data  that  I'm using  is  based  upon  the  GCPI  approach. The  length  of  the  tooth and  this  gives  us  an  idea  of  basically, how  much  plaque  is  on each  one  of  those  teeth. The  Chew  X,  you'll  see... Actually,  have  them called  different  names. But  essentially,  this  is  the  treat that's  being  tested. Then  the  overall  efficacy, this  is  again,  that  ability or  can  we  produce  the intended  result  from  the  product? This  looks  at  the  calculation takes  the  no  chew, subtract  the  result  of  the  chew and  then  divides  by  that  no  chew and  we  get  that  efficacy. The  research  protocol. The  background image  here  is  actually one  of  our  feeding  center here  in  Tennessee. The  round  sections,  there  are  dog  pods. We  have  several  dogs  within each  one  of  those  pods and  then  the  center building  has  the  cat  rooms. What  we  do  is  we  prepare the  pets  by  cleaning  their  teeth so  they  get  a  professional  cleaning. We  try  to  get  all  of  plaque  off the  teeth  to  give  them  a  score  of  zero. We  run  a  crossover  design. Essentially,  this  means  that  every  Chew is  going  to  be  administered  to  every  pet. Not  at  the  same  time. We  break  it  up into  different  phases. In  each  of  these  phases, the  chews  are  then  rotated against  different  dogs, as  I  have  written  here, or  cats  as  well. Scoring  this  is  essentially done  between  each  phase. After  a  phase  of  the  study, so  after  those  pets, they  have  their  standard  diets. They  get  maybe  a  treat product  at  the  end  of  the  day and  then  at  the  end  of  whatever prescribed  time  frame, they  measured  the  amount of  plaque  that  is  on  the  teeth. Cleaning  teeth  with  a  score of  zero  across  all  the  treatments. We  just  removed  these  from  the  study. It's  just  something  that  where they're  consuming  the  product. But  for  whatever  reason,  that  tooth didn't  get  impacted  by  the  product. Typically,  we'll  see these  with  the  front  teeth, the  in cisors, they're  used  more  for  cutting. Generally,  the  products  are more  about  the  chewing  behavior. This  whole  mouth, what  this  is  talking  about  is  some  cases, we  have  these  individual  tooth  scores. We  have  zero  on  a  No  Chew, so  basically  it's  missing, or  we  find  that  the  No  Chew  has results  that  are  less  than  the  tooth. Basically,  No  Chew  means  that for  that  phase  of  the  study, the  dog  or  cat, they  did  not  receive a  treat  product  to  consume, they  just  had  a  standard  diet. The  Chew  X  means  that  they  had some   care  treat  at  the  end  of  the  day. We  summarize  the  data across  the  whole  mouth. Sometimes  we'll  break  it  up  into  regions in  order  to  give  us  an  idea of  how  it's  performing. The  analysis  protocol  itself,  really these  are  done  with  linear  mixed  models. We  have  fixed  effects with  maybe  treatments or  the  regions  depending  on  the  study. Then  we  have  random  effects that's  really  focused  around  the  pet  ID. Again,  the  intent  here  is  that we  have  the  effects  affect  all  pets regardless  of  the  specific pets  in  the  study. We  also  then  run  specific  contrast. This  is  where  we  look at  different  sizes  of  the  mouth, different  regions,  etc . Over  here  on  the  right- hand side  in  this  table is  an  example  of  some of  those  contrast and  depending  on the  number  of  contracts  we  run, of  course,  we're  going  to  use  the  FWER, to  control  for  inflated  error. Then  we  communicate  these  results. We  take  the  analysis  results, we  take  images  and  then we  sit  down  with  the  stakeholders and  we  show  them which  of  these  Chews was  better  essentially. Initially,  when  I  started  getting involved  in  these  studies, it  was  very,  this  Chew  did  better  versus this  other  Chew  for  the  whole  mouth. But  as  we  started  bringing in  different  regions  of  the  mouth, we  started  seeing some  different  results and  had  much  more fruitful  discussions. This  will  get  us  into  the  analysis. What  I'm  going  to  do  here  is I'm  just  going  to  concentrate  really on  the  data  visualization  component. I'm  not  going  to  go  too  much  into the  statistics  on  the  modeling  piece. This  is  just  about  visualization and  in  this  first  part is  specifically  about ways  that  we  are  trying to  visualize  the  teeth  initially. This  is  a  results,  this  is  from  JMP, from  running  the  mixed  effects  model and  then  at  the  end here  we  are  running  these  contrasts. In  these  type  of  results as  we  look  at  these, because  I  have  here  marked  in  the  center, the  Chews  would  show  no  difference but  areas  of  the  mouth  would, particularly  the  molars. You  can  see  here  on  the  left, I  have  just  Chew  by  itself  being  compared and  then  on  the  right- hand  side, you  see  I  have  different areas  of  the  mouth. Different  areas  of  mouth were  showing  interesting  differences but  the Chew  by  themselves compared  to  No  Chew, maybe  we're  not  seeing  too  much for  one  of  them  but  some  for  another. Then  we  would  group  them  into  different sections  used in  the  variability  chart. This  shows  my  different Chews  with  the  no  Chew and  then  again the  areas  of  the  mouth. In  this  case,  I  would  see  that the  mean  of  the  data  is  here  on  the  left and  then  the  standard  deviation of  the  data  is  on  the  right. Definitely,  one  area  of  the  mouth is  operating  differently than  another,  as  I  can  see  here  on  the  right- hand  side. Let  me  just  turn  on  my  pointer. Over  here,  we  can  see  that  this variability  for  the  lower  molars versus  the  upper  molars is  different  for  different  Chew. Other  ways  that  we  tried  to  portray this  is  using  graph  builder, we  used  the area  of  the  mouth  over  here. I  forgot  to  mention this  earlier  but  the  IUL, this  is  in sisors and  upper  lower  canine  teeth and  then these  are  the  molars. We  can  see  definitely some  pattern  going  on when  I  compare across  different  phases. I  have  phase  1, phase  2  and  phase  3. Phase  3,  it  looks  like I  have  this  linear  effect  of  sorts that's  occurring between  the  Chews. It's  just  how  it's  showing  up  visually. It  doesn't  mean in  the  order  in  which  they  are  given, it's  just  what  the  data  is  showing. Looking  further into  the  variability  chart, bringing  in  the  phases. We're like  "Hey,  do  we  see differences  per  phase?" Here  we  really see  for  this  Chew  W, the  variability  was  much lower  than  No Chew  and  Chew P. It  really  starts  questioning, well,  what's  going  on  here? Why  is  this  specifically happening  for  this  Chew? What  could  we  do to  understand  that  better? Another  visual  that we  generated  for  this  study is  we  again,  summarized  it by  the  area  of  the  mouth and  then  the  Chew  efficacy for  each  of  the  Chews  themselves. We  can  definitely  see  some differences  between  the  Chews, but  overall,  they  might seem  like  that  they're  similar even  though  we're  seeing differences  in  the  areas  of  the  mouth. One  of  the  things  that I  started  asking  is  like why  do  we  see  these  differences between  areas  of  the  mouth but  we  don't  see  across  the  Chews as  much  what's  going  on? Here  I  generated  a  plot where  down  here  on  the  x- axis, I  have  the  different  dogs and  then  the  Chew  efficacy for  the  W  and  P  Chews and  then  areas  of  the  mouth. Definitely, what's  interesting  here  is  that particular  animals are  showing  the  difference and  other  animals  are  not  now. We  would  expect  this, given  randomness  of  the  study  that the  Chews  are  going  to  behave  differently and  how  the  pet is  consuming  the  product. We  really  started  making  me  think it, much  more  deeply  about  the  data and  say  really  what's going  on  here  in  this  data? Do  the  pets  chew the  product  differently? That  led  me  into  this  data visualization  for  the  second  part because  it  made  me  really start  thinking  about  the  data, what  can  I  do  or  how  can I  look  at  this  differently to  bring  out  this  individual component  of  the  pets. I  was  working  with  a  research  scientist and  we  were  going through  one  of  the  studies and  they  happened  to  have  this  card. As  you  can  see  here  on  the  right  hand side,  this  is  just  a  picture  of  the  card. They  had  this  sitting  on  their  desk and  I  was  sitting  there  staring  at  it. I  had  the  idea, "What  if  I  created  a  tooth  map  in  JMP?" I  could  then  color  each  of  these individual  teeth and  maybe  get  some  clarity, further  clarity  in  these  studies than  what  we  were  looking  at. I  went  and  contacted our  Waltham  scientist. Waltham  is  a  site within  pet  care  in  the  UK and  they  concentrate  on  doing research  on  the  pet  nutrition. I  went  and  talked  to  them  and  one of  their  scientists  drew  me  up  some  teeth. Here  for  this  first  slide, we  have  the  dog  teeth. The  map  is  here  on the  left  that  they  drew  up. On  the  right  is  just a  visual  to  give  you  an  idea of  the  different  types  of  teeth that  show  up  in  the  dog's  mouth. Then  we  see  something similar  for  cat  teeth. Again,  on  the  left  is  the  one that  was  drawn  up  for  this  study. What  I  did  then  is  this is  the  map  creator. This  is  a  script  that's  available on  the  JMP  community. This  is  an  older  script. It's  been  a  while since  it's  been  updated, but  I  found  that  it  was very  helpful  for  this  scenario and  so  downloaded  the  add  in and  it  creates the  add- in  pull- down  menu. You  go  to  the  add- in  pull- down as  you  see  up  here  on  the  upper  left. You  can  click  on  Map  Shapes  and  then you  can  do  the  Custom  Map  Creator. When  this  pops  up,  you  get this  screen  here,  again  without  the  teeth. Then  you  get  a  couple  of  empty  tables. I  dragged  the  image  file  onto  the  map. I  gave  it  a  name. Then  I  go  over  here  to  this  next  section, and  basically,  I  start  tracing  the  teeth. For  every  single  tooth,  I  would  trace  it and  then  I  would  give  it  a  name. This  is  an  example  of  after doing  all  of  that  work. As  you  can  see,  these  are  all of  the  individual plot  points, is  me  just  clicking  around  that tooth  to  try  to  get  the  entire  shape accurately,  so  it  would  show up  as  accurate  as  possible on  the  screen when  we  look  at  the  plots. Then  we  have  our  different  files  here. You  have  this  XY. This  gives  you  the  coordinates. Down  here  on  the  graph  on  the  lower left,  you  see  this  is  a  zero,  zero. This  is  essentially  like an  X,  Y  coordinate  system. It's  just  telling  me  where  on the  graph  that  particular  point  is for  that  particular  shape  ID. Then  I  have  a  name  file that  gives  me  the  shape  ID and  then  the  name  of  the  tooth. In  this  case, I  created  a  separate  file  for  dog  teeth and  of  course,  for  cat teeth, since  they  are  different  shapes  of  teeth and  different  shapes  of  the  jaw. D  just  stands  for  dog  and  then the  ID  number  for  that  tooth. One  thing  that  I  found  interesting  is  when I  first  created  this  program, this  was  a  few  years  ago, I  was  able  to  just  create  the  maps, and  I  created  a  custom  script. People  would  run  the  script, it  would  save  the  maps  to  their  C drive  and  everything  would  work  fine. But  soon  after  a  couple  of  years, it  no  longer  worked. It  was  because  Mars  entered in  some  security  protocols  that basically  wouldn't  allow  us to  save  maps  to  the  C  drive. It  basically  locked  it  up. I  had  to  go  out  and  figure  out, well,  how  can  I  still  do  this? I  want  to  see  the  maps, we  want  to  create  these  maps. Then  I  found  another  community forum  that  talked  about  putting  it  out onto  the  app  data  for  your username  roaming,  etc . You  see  the  path  here and  so  you  put  your  maps  there and  it  works  just as  if  I  put  them  on  the  C  drive. Here,  once  you  have  those files  saved  on to  their  proper  location, then  you  go  into  JMP Graph  Builder. If  you've  never  used  it  down  here  on the  lower  left- hand  corner  of  the  screen, it  doesn't  show   it, it  just  says  Ma. But  that  is  the  map  feature. Since  I  gave  these  tooth  IDs  as  map component  or  the  name  component, then  I  take  that  tooth  ID and  drag  it  down  to  that  section. When  you  do,  you  can  see  here  in  the background,  I  see  those  teeth  showing  up. Not  all  of  the  teeth  show  up because  for  this  particular  study, I  did  not  look  at  every  single  tooth. You  can  see  the  incisors  are  missing. You  can  right- click  on the  image  and  go  to  Map  Shapes and  show  the  missing  shapes. When  you  do  that,  you  get all  of  the  teeth  showing  up. In  addition,  I  took  the  different Chews  that  were  investigated and  I  dragged  that  up  here to  the  Group  X  up  here  at  the  top. We  see  these  three  different  maps for  each  of  the  Chews  and  the  No Chew. I  then  take  the  ITS and  pulled  it  over  here  to  color. Again,  ITS  is  the  individual  tooth  score, about  how  much  plaque  is  on  the  tooth. At  this  point,  this  has  given  me the  average  amount  of  plaque that  is  on  all  of  these  dog's  teeth that  were  in  this  particular  study. I  can  start  to  see  where  that  plaque is  showing  up  on  the  No Chew . Definitely,  it's  on  these  molars especially  down  here  on  the  bottom  molars, on  the  right- hand  side  especially. Then  I  can  also  see for  the  different Chews , I  could  see  how  maybe some  of  the  canine  teeth on  average  was  showing  that  some of  this  plaque  remained  on  the  teeth. But  definitely,  I'm  seeing some  cleaning  of  the  molars. Clicking  on  done and  giving  me  the  bigger  image so  I  can  see  it  in  more  detail. This  leads  me  into  data  discovery. I  created  these  maps and  they're  looking  great. People  like,   "Hey,  this  is  a  really  interesting way  of  looking  at  the  data." But  I  wasn't  done  there. We  started  discovering  something when  we  looked  at  the  maps  differently. In  this  case,  this  is  just  back to  where  we  were, that  same  map. What  I  did  is  I  started, I  put  a  local  data  filter on  and  by  the  dog  names. Here  I  have  dog  names  on  the  left, turn  on  a  local  data  filter and  I  can  now  filter  on each  one  of  these  dogs and  look  at  them  individually. Now,  we  don't  have  time to  go  through  all  of  them, but  I  wanted to  show  just  a  few  of  them. Here  for  Aura. What  was  interesting  for  Aura  is  that we  noticed  that  for  this  Chew P, that  these  lower  molars on  the  right- hand  side weren't  really  getting  cleaned  very  well. According  to  the  score, they  weren't  getting  cleaned  at  all. This  started  telling  me  as  I  looked at  this,  "M an,  this  particular  pet, Aura would  only  chew the  product  on  the  left  side." For  this  Chew  W, we  actually  saw  a  different  signal. She  actually  chewed the  product  that  seemed  like more  on  both  sides  of  the  mouth. Very   interesting  results. A gain,  these  are  just  two different  types  of  textures that  are  being  looked at  for  this  product. Going  down  to  another  dog  here,  Gretchen. She  showed  something  different. For  Chew  P, she  did  very  well  with  that  Chew, but  for  Chew  W, she  preferred  to  chew  it  more  on  the  right side  of  her  mouth  than  the  left  side. Again,  these  are  completely two  different  animals and  they  chew  the  product differently  depending  on  the  texture and  their  preference  to  the  texture. Not  all  dogs,  we  are  starting to  see  here  like  the  same  texture. They're  very  individual. When  I  was  doing  this, I  started  asking  friends  of  mine, "Do  you  chew  with  one  side  of  your mouth  for  particular  products?" Sure  enough,  as  we  started  collecting that  data,  we  found,  like  for  myself, I  like  to  chew  nuts,  but  only on  the  left  side  of  my  mouth. Others,  when  they  chewed  nuts, it  didn't  matter  which  side,  etc . As  we  started  talking  about  it and  taking  record  of  it, we  started  to  see, "Hey,  these  pets are  consuming  product  more  like an  individual  human  does  when  they  have preferences  in  how  they  look  at  product." And  just  a  couple more  dogs  to  look at here. Bagel  this  one,  the  top  teeth clean  better  than  the  bottom  teeth. Just  fascinating  results. How  is  that  possible? Because  the  product is  when  they're  chewing  it, they're  biting down  into  the  product. Your  teeth,  your  top  teeth  and  your  bottom teeth  are  sinking  into  that  material. Why  would  we  get  certain components  showing  up  here? Basically,  what  it  means for  this  particular  dog,  for  Bagel, it's  just  the  rear  molars  that  Bagel was  using  to  chew  into  the  product. The  front  molars,  who  wasn't  using  at  all. Then  Muck,   this  is  a  great  example  of  the  dog. Either  they're  not  consuming  the  treat  at all,  or  they're  just  inhaling  the  treat. Those  are  some  of  the  customer calls  that  we  sometimes  get  is, "Hey,  my  dog  is  not even  chewing  this  product. They're  just  like  taking  it and  swallowing  it  whole." Very  interesting  results. A s  we  started  looking into  this  data  more  and  more, it  really  led  us  to  believe  that,   "Hey,  we  need  a  new  product. We  need  to  create  something  that  will really  bring  in  a  whole  mouth  clean. A  chew  experience where  the  animal  likes  to  chew, likes  to  really  get into  the  product  and  consume  it and  to  have  that  efficacy  result where  the  product is  helping  to  clean  the  teeth." In summary, what  I've  learned  from  this  experience  is that  historical  studies  for  this particular  experience  were  basically giving  an  average across  the  whole  mouth and  it  wasn't  sufficient in  really  giving  us  a  good  idea of  what  was  happening  with  the  product. Viewing  these  tooth maps  by  individual  pets started  showing  some very  interesting  results that  really  we  couldn't  even  look  at  it by  reaching into  the  mouth across  all  of  the  pets. We  actually  need  to  start  looking  at  it  by  individual  and  start  classifying  it  by, "Hey,  this  particular  treat impacted  these  teeth  only, and  this  particular  treat impacted  these  teeth  only." Start  classifying  it  in  that  way so  that  we  can  start  learning  a  lot  more about  the  texture  of  the  product and  how  it  was  consumed. Now  these  findings, we  started  applying  this  across all  studies,  all  historical  studies. We  pulled  this  into  a  large  analysis that  started  really digging  into  it  to  learn  more from  the  history of  what  we've  done and  how  it  affects things  moving  forward. Of  course,  this  led  into  some new  product  development. Here's  an  example  of  what  that  is. Unfortunately,  I  can't  show  it  to  you. The  image  is  protected. It  is  not  yet  released,  but  it  is something  that  we're  investigating. Thank  you.
Repairable systems can be complex. The RSS platform offers several lesser-known or undocumented features that can help users build simulations more closely aligned to how they are running a system. Did you know block replacement can be contingent on its age or the number of times it has failed? Or that maintenance schedules can be based on system state, making it possible to skip maintenance if it would bring the system down? Using an example-based approach, this session will illustrate useful features and functionality not easily found or missing from the documentation and JMP User Community. Topics include getting, setting, and employing system information as well as applications for less commonly used Events and Actions. A JMP Journal with all examples will be available.     Thanks for taking some time out, watch this video. I'll assume that you're here because you've used or tried to use the Repairable System Simulation platform within JMP. You might want to learn a little bit more. You might get be frustrated because there's things that you're trying to do that you didn't think you can do. So what I'm going to cover today are things that aren't explained deeply in the documentation, things that I've learned through trial and error that aren't in the documentation, and even stuff that I've learned from talking to the developer about the platform, stuff that you wouldn't be able to find any other way. I'm going to assume that you've had some exposure to this platform. I'm going to assume that you know a little bit about reliability analysis, so I don't have to go into detail in terms of what is a parallel setup, what is something that's K out of N, and so on. So I've got a number of examples I want to cover, a lot of materials. Let's get started. Now, before we actually look at the Repairable System Simulation platform, let me just give you an overview. Hopefully, you've used it before, you're familiar with it, we're going to talk about three different components and aspects about those components that aren't in the documentation or aren't well-explained in the documentation. The first thing we're going to talk about are blocks. Blocks are what I use to build my Repairable System Simulation things like Parallel, K out of N, Standby. So for each block, I can associate an event with that block. So those are these little orange blocks that we see, and for each block, I can associate one or more outcomes or actions. Those are those blue dots that we see, and that's going to help me build up my simulation. For example, I might want to know when a block fails and when that block fails, I might want to repair that block, and that repair might take, let's say, 40 hours. That's the idea behind building these RSS diagrams. Let's start with talking about some of the general properties of the blocks. There are six different blocks in a knot, which allows me to tie things together within the diagram. Five of those six blocks are composites. They were built to be made of two or more subunits. JMP allows you to use them with one subunit. But really, the use case is that they are built up of two or more subunits. Events and Actions are applied to the entire block. So this is an important point. The components within the block can't be treated individually, except for the standby block. They're assigned a single distribution. Taken together, these two bullet points mean that things like repair or maintenance on subunits is not possible within these composite blocks. I've got to wait until the entire block to fail before I can actually do anything with that block before I can repair the block. If I want different distributions within the block, I can't do that either, with the exception of the Standby. And we'll talk about that in an example we have a little bit later on. Subunits for all the composite blocks, except for the Series blocks, are in parallel. The thing that differentiates those different composite blocks are the number of components that are running, when the block fails, and how the block ages. So I put together a little table that goes over all of the composite blocks and how they differ with those characteristics. For all the composite blocks, with the exception of Standby, all the components are running simultaneously. With a composite block, I can pick K of my N units to be running and then the remaining units to be backups. So when do these blocks fail? Obviously, if you have any experience in reliability, a Series block fails when one unit fails. A Parallel block fails when all the units fail, K out of N when you have fewer than K units, and so on. So things that you might expect naturally in reliability analysis, Standby and Sharing have another method of failure. With the standby unit, there is a switching mechanism. There could be one switch for the entire block or one switch for each backup. The Standby can fail if I have a single switch when that switch fails, or if I attempt to use each one of my backup units and the switch on those backup units fail as well. So switch failure could be another cause for the block failing. Same thing goes with Stress Sharing. However, with Stress Sharing, I only have a single switch. The final characteristic that's different between these composite blocks is how they age. For the first three blocks, they age equally. All the subunits within the block age equally. For Standby and Stress Sharing, the subunits can age... For Standby, they can age differently. For Stress Sharing, they definitely age differently. For Standby, I get to pick a single distribution for my operating units. For my backup units, I can choose that they don't age— that's cold standby— or I can give them a different distribution as to how they age when they're in standby mode or when they're acting as backups. That could be the same as my operating units could be something different. So I get to pick two different distributions. The thing is, again, as with all these composite blocks, I get one distribution for all my operating units, one distribution for all my backups. I can't mix and match. Finally, with Stress Sharing, Stress Sharing works a little bit differently. With Stress Sharing, the stress is distributed among the blocks equally. However, when one of the subunits fails, additional stress is associated with the remaining operating units. So the stress changes from failure to failure. We'll see an example of that in a little bit. Now, one of the things that you might have found, or you might realize is that sometimes you want to have a little bit more flexibility and the way I can build flexibility is I can arrange my basic blocks or my basic block as I would a Parallel or K out of N as one of my composite blocks. Basic blocks can be easily arranged like Series, Parallel, or K out of N harder to do with the Standby and Stress Sharing blocks when you've got more than two units. I've got an example in a little bit that shows a Standby setup with two basic blocks. Relatively straightforward how to do that. But if I were to try to extend that to three blocks, it becomes much more difficult. Not impossible, just very difficult to do because I'd have to keep track of what's operating, what's not operating, what happens when the failure occurs and so on. Same thing goes with the Stress Sharing as well. Let's go to our first example. For our first example, we're going to look at a basic K out of N arrangement. We're going to have five subunits. Three of them need to be operational. They're all going to have the Weibull with an Alpha of 2,000, Beta of 1. We're going to assume that when a block fails that it takes 40 hours to replace. We're going to put that arrangement in series with a K out of N composite block. We'll also assume that the K out of N composite block takes 40 hours to replace, despite the fact that there are five subunits within that block that have failed. We'll take a look at what difference does it make whether I treat these as individual items or as my composite item? This is what my setup would look like in RSS. I've kept it very simple. Here are my subunits relative to my K out of N. What makes it a K out of N is in the knot. I can specify how many of the incoming items need to be operating. And here I've said I need three operating. So that's what makes it a K out of N, this arrangement, K out of N. And I'm just going to keep it very simple. The only event I'm going to look at is whether or not the block fails, the only action I'm going to take is to repair as new. Let's go ahead and run this. Relatively fast. I want to spend a little bit talking about the output because what gets sent to the output and the states will help us when we start building our diagrams in the future. Let's just take a look at the order of operations for this very first failure. So we see it's 766 hours unit Sub1 has failed. The block is removed. So that's important to note because that is one of my potential events. So block is removed, I start the process, I replace the block. And after the block is replaced, after the 40 hours I specified in my setup, the system is turned on automatically. So that's another important point. We're going to run into the situation where for some of my events, the system doesn't get turned on automatically, and I'm going to talk about which ones those are. What we'll have to do in those cases is actually turn the system on, or in some cases, when the block doesn't get turned on automatically, turn the block on manually. Here we've got Sub3 fail and you see a similar series of events, Sub2 fails, etc. Let's take a look at what happens when the system goes down. And we know that happens for sure when K out of N fails. Okay, so here's the first case where the system fails. Again, very similar in respect to most of the operations, but you'll notice that it turns the system down. Notice it turned the system down, so I get to turn system down. It turns off all of the blocks in the system. Again, something important to keep in mind, because there are going to be cases where we need to know whether or not a block is on and off. Whether or not we need to turn on a block manually. So here we go to the replacement. Once the replacement is finished, all the blocks are turned on automatically. And again with this replace, with new, the turning off and turning on all happens automatically. So I don't have to worry about that. Finally, I finished my K out N replacement and I'm done until the next failure. Relatively straightforward. If I wanted to take a look at how my system has performed I'm just going to launch my Result Explorer. By default, I am not going to see the component level information. If I want the component level information, I need to go back up to the hotspot and pick if I want an overall, I'm going to pick my downtime by component. Now, the thing to keep in mind with this organization, this is just telling me the downtime of the system associated with my components. So this tells me how many times K out of N has taken the system down. For each one of these, these corresponded to when that subunit took the system down. So if you think about it, these correspond to the times when subunit was the last of my five units to go down, because that's the only time the system is going to go down is when all of those five subunits go down. Now, if I want to look at my component level distribution, this only tells me when it brings the system down. But say I'm interested in how often Sub1 was down, how often Sub2 was down and so on. The hotspot, I can look at this Point Estimation for Component Availability. And if I scroll down, for example, for Sub1, this gives me my distributional information for the component itself. Z includes the times not only when Subunit 1 took the system down, but when Subunit 1 went down itself. Easiest way to dive into the individual distributions of all my subcomponents. Let's move on to some of these actions and events and some of the particular properties associated with those. We'll start with the events. Not mentioned in the documentation but Inservice is only available for Basic blocks. So to remind you, Inservice is based on the age of the block and not the age of the time of the simulation. So, for example, if I want to service something after 100 hours of use, I would use an Inservice. Otherwise, I could use just a scheduled event. A scheduled event would be serviced once a week, but Inservice is only available for my Basic block. So tells me if I want to use Inservice, I'm going to have to build up these basic blocks. Initialization, Block Failure, and System is Down can only be used once for each block. So what you'll see is when you use them, they disappear from your palette of events, so it won't show up anymore. So you can only use these once. Let's talk about actions. There are 13 actions that I can use. Install Used, Change Distribution, and If can only be used with Basic blocks. Minimal Repair can only be used with Basic and Series. So that leaves us nine actions that I can use for any of the blocks. So they are the Turn On and Off a Block, Turn On and Off the System, Replace with New, Install New, Remove a Block, Inspect Failure, and Schedule. So those are the nine that any of the block, either the Basic or the Composite blocks can use. There is no limit to how many times an action can be used to a given block. So I have an example where I used Turn Off Block more than once. The reason is that what I want to do after I turn or when I turn on the block might differ depending on when the block gets turned on. And again, we'll see an example of that in a little bit. Actions can only be connected to other actions. As a matter of fact, events can only be connected to actions. So I cannot connect an event or an action to a block or to another event. However, I'm unlimited in terms of how many actions I can chain together. A few specific properties. Initialization only occurs once, and it's at the very start of the simulation run. We're going to have an example where that makes a difference, where I turn off a block at an initialization, but when the system comes back online, that block, it's turned on again. So I'm going to have to make sure I turn that off manually. So Initialization only happens at the beginning. System is Down occurs for every block, regardless of which block brought the system down. So I can use that as a listener to see... For example, if the system goes down, I might want to perform maintenance on the block even though the block was running, maybe just to bring it back up to near new or to bring it back up as new. So a System is Down will occur regardless of which block brought it down. Then about the actions, Turn on System turns on every available block. Now, by every block, I mean everything that's not already down because of failure or scheduled to be down. So I could schedule something to be down either by using that scheduled event or by using the scheduled action. Turn on the System is going to turn on all of my blocks. Blocks got to be turned on manually after Install New, Install Used, or Change Distribution. Those are a couple of those cases where the system does not get turned on, the block does not get turned on automatically. This is why I might want to use, for example, install new instead of Replace with New. Install New lets me do stuff before I turn the system back on. It gives me that chance to perform other operations or other actions before I turn on my system. Turn on Block does not turn on the system if the block was turned off manually. As we saw in the output, as you might have noticed in the output, that regardless of the system being on, you might get a Turn on System event show up. It does not hurt to turn on a system if the system is already on. What I say is, if in doubt, add a Turn on System to your chain of events to make sure the system is on. By the way, it doesn't matter whether the block is turned on first or the system is turned on first. Both work the same way. All right, let's move on to our second example. What we're going to do is we're going to build a basic block in a Standby arrangement. We're going to have one main unit, one standby unit. We're going to do a cold startup, meaning that the standby unit, the backup unit, does not age. However, it's going to take 30 minutes to start up the unit. Something I can't do with the composite standby. Eight hours of maintenance on the backup is performed after control is passed back to the main unit. We're going to look at two different switch situations. We're actually going to look at an infallible switch and we're going to look at a switch where we've got 90% reliability. For all of my blocks, I'm going to assume Exponential(500). I'm going to assume eight hours to either repair or maintain. Failing blocks will be replaced with new. Like the previous example, we're going to look at the basic blocks in this arrangement in series with a standby block with the same setup, obviously, with the exception of that 30-minute startup time, which I can't do. Let's take a look. Here is our basic arrangement. I've got my backup unit, my main unit, my standby unit, and for each one of these I've associated a failure event and a Replace with New. Sort of the bare bones of what I might start with. I don't know if I pointed out previously when we were looking at the output, but it is helpful when I add an event, add an action to give it a unique name because those are the names that appear in my output. All right, so let's start with a backup unit. When my system comes online at time zero, everything is going to be turned on. I want to make sure that at time zero, that backup unit gets turned off. The way I'm going to do that is I'm going to use Initialization. You'll notice that when I used Initialization, it disappeared. Next thing I want to do is turn off the block. Now, if you have used RSS before, you'll remember that when I try to add an action, it's going to stack these actions on top of one another. Sometimes that's not the best location for them. What you could do anytime you have one action to work from, I can select the action, click and hold this plus sign and then just drag off the plus. Now I can add my action. In this case, I want to turn off the block. Now, I don't want the Turn off Block associated with this, Replace with New. What I'm going to do to avoid that happening is just delete the arrow. Now I can connect to the event I want to. The advantage of doing that is now I can move that block anywhere I want to. I have much more flexibility with the way that block works. The other thing I'm going to do is I definitely want to label these so I know when I look at my result table, I'll know what happens. This is Turn off Block... Let's just call this turn off backup at start. I'm going to give each one of my events, each one of my actions a unique name so I know when they occurr. That turns my backup off at the start. The next thing I'm going to want to do is when the main unit fails, I'm going to want to turn on the backup unit. Now, again, rather than selecting the backup unit and say Turn on Block, I'm going to choose my Turn on Block interactively. I'm going to delete my connection here so I can move this. I'm going to move things around a bit. I'll move this right over here. Again, I want to name this so that if I come back to this diagram in a couple of days, I know what this action is associated with. We'll call this turn on backup from main. Another thing that is mentioned in the documentation, but I don't know how strongly, but I can connect, in this case, I'm connecting an event from one block to an action in another block. I can definitely do that. Now, as I mentioned earlier, this will turn on the block because that block was turned off manually, I'm going to have to turn on the system as well. Turn on the system. Again, I probably want to label that, I'll say, Turn on System from backup and so on. Now the only other thing I need to add to my diagram is that when my when my main comes back online, I want to turn off my backup, do maintenance to my backup. Now I could probably... Well, let's start with, turn off the backup, Turn off Block. Delete that. We'll call this. Must have picked turn on, I got Turn off Block. Let's try it again. Turn off Block and label that, turn off backup from main. There's where my main comes online. There I've turned off my backup. I want to perform maintenance on it. The way I'm going to do that is I'm just going to Replace with New. Okay, we'll just call this preventive maintenance. I think I said that that preventive maintenance is going to be eight hours that it takes. Now again, I've got to remember that Replace with New automatically starts the block. What I'm going to have to do now is turn off the block. Now, thinking about that in a little bit more detail, I could have probably just gone directly from the new coming back online to doing the maintenance. Because once I start the maintenance on my backup, the backup stops operating. Then once I am done with my maintenance, turning off the block. This block doesn't hurt, but it's redundant. So let's just get rid of it. That would be my setup. I always like to run to double-check to see if things operate in the proper order and I'll generate my output. In this case, I've only got it set up to do one simulation. I might have to take a couple of simulations depending on the reliability associated with my units. Here's the main fails, the system turns down, the standby block is turned off, put in a new main, turn on backup for main. If we look at the predator column, there's my turn on backup for main. Turn on System from backup, Turn on Block by system. Again, I like to step through these once I build them to make sure things happen in the right order. Turn system gets turned on. There's the new main is put in. I turn on my system, again, it's already on, but doesn't matter, doesn't hurt. New main Replace with New. That is the block, the backup being turned to getting a PM and so on. There should be that backup being turned off. Again, I like to use those just to double-check to make things are working properly. That's the setup without the switch. To add the switch is relatively straightforward. Where we would put that is, we would put that between the main failing and the backup starting and that's what the if blocks are made for. Let me get rid of that connection. It doesn't matter which of my blocks that if comes from and I didn't want to put it there. Let me go from... Here we go. I've got a backup just in case something like that happened. Here we go. There's my backup being turned on. There is my main failing. We are going to get rid of that. I probably want to add in my if block so I can float it anywhere. I'm just going to grab any one of these and we'll add an if. Delete the connection. Let's connect that. Now, the way I would set up my if, the way if works is that it evaluates what's in the if condition box. If it's true, if it evaluates to one, then you continue going. If it evaluates to zero, you stop. What I can say is if... I don't need the if there, I'm going to say random uniform is less than 0.9. When that's true the chain will continue, otherwise, it'll stop there. That's it. It's as simple as that to put in that if statement. Again, I probably want to label this too. So if switch is success or something like that. At this point, the only drawback is this what would happen if the switch failed? Well, switch failed, it's going to have to wait until the main comes back online. Main comes back online, it goes tries to turn off my backup. Doesn't hurt trying to turn off something that's already off. But the thing that I might not want to do is I might not want to do the maintenance if the backup wasn't used. In a later example, we'll see how to get around that, how to look into the system and say, is that backup running? If it isn't running, yeah, do the maintenance. If it is running, you don't need the maintenance. Let's move on to the next example. In this case, I've got two pumps operating in parallel with a preventive maintenance performed every two weeks taking eight hours. The PMs are staggered by a week to keep from having both pumps down at the same time. If a pump fails, minimal repair, taking four hours is performed. In this case, simulation is run for a year's time. Let's start with the bare bones here. Here's my pumps with the pump failures and replacements. How do I set up the wait a week before you start the PMs? Like the last example, I'm going to start with initialization. What I'm going to say is wait a week. How do I say wait a week? Well, what I'll do is I will use the schedule action. I'm going to do what I've been doing in the past. We'll set up the schedule, delete my connection. I've got a number of different options with schedule. I only want this to happen once. I want it to happen at the very beginning so it's connected to my initialization event. We're going to call this wait one week. I only want this to happen once. I want it to take 168 hours. That's because I've got my simulation set up in hours. What that does is on start waits a week and then performs whatever action I connect to here. What's the action I'm going to connect? Well, I'm going to use schedule again. What this is going to be is this going to be PM on pump 2. I'm going to leave my max occurrence as blank. So it does it until the end of the simulation. The completion time is going to be 168x2 so 336. Now what happens is that after that first week, every 336 hours, I am going to be able to perform another operation. That operation a couple of different ways to do that. We're going to assume that it's PM [inaudible 00:37:03]. So we're going say, Replace with New. I can say here, replace pump B. Now, one of the things you might be thinking is that what happens if I go to do my preventive maintenance and everything is staggered but for whatever reason pump A is down, it might have had a failure. I wouldn't necessarily take down my pump B for preventive maintenance. Do that and take down the system. What we're going to use is we're going to use a hidden variable called simulation context. Simulation context is the way I can look into my system. What we'll do is we will put the simulation context, we'll put it between my scheduled and my actual preventative maintenance. Let's do this. Let's get rid of that and we will implement it with an if. Let me connect and we'll call this, if pump A is working, now the question is, how do I tell whether pump A is working? Well, in my if condition, I need to specify the hidden variable that's associated with pump A's... Whether it's active or not. The way I would do that is simulation context and case does not matter. Simulation context and the variables, the system variables are stored as an associate of array. This is called status of pump A. Now what I have found is that, this particular part of the dialogue box is not quite as flexible. If I've got something more complex that I want to build, then I'll usually do it in a script window and then copy it and then paste it into this window. But this is relatively small and we'll just go ahead and type away. We're going to say this is active. That's what I would use. Now, how did I know that that's what I have to use? Well, as it turns out, this if condition and a little bit later on when we talk about the stress sharing, this acts like a script. What I can do and what I've often done is put in a print statement, how do I know what is available to me? Let's put in a print statement there and we'll say print simulation context, okay. The only thing that we need to be sure of is that if statement, the last statement in that if condition evaluates the true. Just to show you what happens when I do this, let's just go ahead and run this. I'm going to run that just to give you an idea of what gets printed to the log. Let me go ahead and grab my log and it's over on one of my other… Here we go, there's my log and here we go. This is what got printed to the log. That gives me information on all of the variables that are available in simulation context for this particular diagram. If I had put more blocks in there, more actions, more events then those would show up, information on those would show up too. But this is a good way to tell what's in my system and what are the variables that I have access to. The important thing to remember here is I can read this, but I cannot write back to this particular variable. I don't have that option to change things in my simulation, but I can use that information. Going back to our last example, what I could have done is I could have put an if statement in there and said if my backup was active, well then turn it down, do the maintenance. Okay. Now for the sake of time, I'm going to move on to my last bit of information I want to share with you. By the way, I will have a journal available, should have a writeup available, I'll have more examples that I didn't have time to cover. You'll see a couple more of the actions we didn't have a chance to look at how they get implemented and maybe some of the other things that might not be terribly straightforward. Custom stress sharing is a little bit unusual in the sense that by default, stress is shared equally among the subunits. It's implemented by altering how this block ages and not explicitly altering the distribution. When I have no stress sharing, let's say I have n subunits in parallel, let's say the first one fails at time one, second at time two, and so on. With stress sharing, the way I age the block is that, that first failure occurs at t one times n, times the number of working units. The second one ages the block an additional n minus one times the difference between my first and second failure. The third one n minus two times the difference between the third, second failure and so on. That's the way that the blocks age. This is the information that gets passed to that custom function. When I try to build this into my stress sharing, okay, so let me share with you this last example on how this is set up. Let me find my example. Okay, here I've got three different stress sharing. There's my basic stress sharing, so I've got that set up as basic. Here's my custom. This is the end that I'm talking about. You'll know that it's embedded in this log function. It's done that way so that the aging of the block matches up with the distributions that I can use for my custom stress sharing block. What I do is I think of things in terms of my explanation that the first unit ages at some multiplier times the time, the second one at some multiplier times the difference between my first and second failure times, and so on. That's the way I think about all these things. For example, let's say I wanted to set up, well, before I go there, let me just briefly say, as with the if block I can put print statements in here. If I were to put in a print statement here, you'll notice that this gets executed before the system starts, so it's going to print five and it's going to print five because in this case the block has got five units. Going to print five when the first one fails, it's going to go back into the function print four, and so on, so on and so forth. Let's say I wanted to set this up as a custom failure where the block ages at 10 times, not 5 times, but 10 times the rate, the second one ages at five, the third one at three. The second one ages at one, and the last one is over stressed and ages twice as fast. What I could do is, again, let's put this here. Let me add a little space and make my box a bit bigger. I might put something like, again, I want to put that log, that log has got to be in there, log and I would say something like choose. What choose does based on n, it'll return, let's see, when there's one unit, I want it to return a half, one with two units, let's say 2.5 with 3 units, 5 with 4 units, and then finally 10. That's how I would create my custom stress sharing. I'm not limited to this choose function. Here I've got in this particular example, I'm using a function. Okay, again, the thing that's important here is that this all gets embedded in this log function so that the aging of the block and the distributions can match up. Okay, unfortunately, that's all I have time to cover today. I wish I had more time. There's certainly a lot of other things that we could talk about. Those will be in the materials that I provided for you, and I'd love to hear your feedback and questions and comments that you have. Again, thanks for taking the time. Watch this. I hope this is really beneficial. Thanks.
Peter Hersh, JMP Senior Systems Engineer, SAS Hadley Myers, Sr. Systems Engineer, JMP   When collecting data for an analysis, we are all very cognizant of the need for an unbiased sample and a true representative of a greater population. Great efforts, often at great expense, are taken to ensure that is the case.  However, this standard is not always applied to other forms of data collection. For many, research into topics of interest start and end with online searches. Using a designed experiment and the visualization/analytic capabilities of JMP 17, we sought to investigate how different search engines in different parts of the world are potentially biasing search results and, therefore, the conclusions we respectively reach on these topics. Join us for this amusing and thought-provoking presentation that you should totally rate five stars prior to viewing to save time.     Thank  you  all  for  clicking on  this  talk  and  coming  to  watch  it. This  is  really  about  bias  in  data. Every  analyst  that  works  on an  project  understands the  importance  of  ensuring  that  the  data that  they  collect  is  unbiased. The  steps  are  taken  to  avoid that  at  the  start  of  the  data  collection, before  the  projects  even  really  begun, there  are  numerous  checks  that points  along  the  analysis, and  then  at  the  end, any  conclusion  reached  is  taken in  the  context  of  potential additional  sources  of  bias. But this  same  level  of  care  isn't  applied to  online  searches  on  topics  of  interest. Search  engines  use  algorithms   that  are  designed  to  deliver personalized  content  that  is   relevant  for  us  as  individuals. Now,  this  has  advantages. It  means  that  return  search  hits are  more  likely  to  be   relevant  and  interesting, but  it  also  has  disadvantages. By  definition,  these  are  not  unbiased. We  have  an  example. Yeah, I   heard  they  brought up  a  great  point  there, in  science  and  engineering, we  take  a  great  care  to  make  sure that  our  samples  are  unbiased. But  let's  think  of  a  library. We  walk  into  the  library   and  there's  two  people  interested in  informing  themselves  on  vaccine  safety. Let's  say  they  walk  into  a  library and  ask  a  librarian  for  these  books  on  vaccines, so the  first  person  receives  three  books. These  are  actual  books. Smallp ox: A  Vaccine  Success , Anti-vaxxers: H ow  to  Challenge the  Misinformed  Movement, and   Stuck:  How  Vaccine  Rumors  Start and  Why  They  Don't  Go  Away. Now,  let's  say  a  different  person  walks  in and  receives  three completely  different  books. THE COVID  VACCINE: A nd  silencing  of  our doctors  and  scientists, Jabbed: H ow  the  Vaccine  Industry, Medical  Establishment,  and  Government Stick  it  to  You  and  Your  Family, Anyone Who Tells You Vaccines  Are Safe and Effective is Lying. These  are  actual  book  titles. Let's  say  that  looking  at  who  you  are, so  where  you  live, how  old  you  are,  your  gender, maybe  even  your  browser  history determines  which  of  these  sets  of  books  you  get. This  is  essentially  some  of  the  problem  with  the  bias as  you  go  in  to  search  for  things  online. It  may  be  that  before  we  even  start looking  at  our  browser  search,   we've  already  got  bias  in  there and  we  want  to  understand if  that's  the  case  or  not. That's   what  motivated  this. You  got  any  thoughts  on  that, H adley? Well,  I  think  that  the  thought that  I'd  like  to  express  right  now is  that  the  purpose  of  this  presentation isn't  to  judge  or  to  opine on  the  advantages  or  disadvantages   of  the  search  algorithms that  may  or  may  not  be  used. The  purpose  here  really   is  just  to  take  an  example of complex/ unstructured  data and complex  because  it  is  unstructured and  this  was  search  results. Then  to  use  some  of  the  exploratory  visual and  analytic  capabilities  found in  JMP  Pro  17  to  try  to  understand what  we  were  seeing  and  to  present  it in  such  a  way  to  help  you  to  understand. The  purpose  of  this  presentation  is to  inspire  you  to  try  these  techniques for  yourselves  and  others like  them  on  your  own  data. Let's  go  through  briefly  the   methodology. What  Pete  and  I  did  was  we  came  up with  a  few  search  terms  we  thought would  lead  to  interesting  results. You  can  see  those  terms  here. We  define  some  potential  input  variables which  may  or  may  not  be  affecting the  results  of  the  search. We  know  that  there  are  very  likely others  as  well  that  we  didn't  include. This  is  true  with  any  designed  experiment. We  can't  capture  every  variable, but  we  took  a  few. We'll  see  whether  these are  significant  or  not. We  developed  a  data  collection  procedure whereby  we  use  the   MSA  design  in  the  DOE  menu. This  is  a  convenient  way  to  create  these  tables  that  we  can  then  send to  JMP  SEs  and  friends  of  SEs. Now,  right  away  this  isn't an  unbiased  random  assortment of  people  we've  asked   to  fill  out  these. They're  all  people  that  work  for  the  same company  and  have  the  same  job  title. As  we  said,  the  purpose  is  really to  understand  the  techniques  and  methods that  we  use  to  try  to  understand  the  data, and  then  to  think  about  how you  can  apply  it  yourself. We  explored  the  results which  we'll  show  you. Then  finally  we  presented  the  findings at  the  JMP  Discovery  Summit  America  2022, which  is  what  you  are   watching  right  now. Without  further  ado,   let's  jump  into  the  data. I'll  start  out  by  talking  just  briefly about  the  MSA  design  that  you  see  here. What  we've  done  is  we've   added  the  factors  of  interest, we've  added  the  terms of  interest  that  we  were  looking  at. then  the  nice  thing  about  this   is  that  when  we  make  the  table, what  we  could  always  do  is  press  this  button, send  these  out  to  everybody  that  needed to  complete  the  results  for  us, send  them  back,  catenate them and  then  we're  ready   to  start  beginning  our  analysis. But  as  any  analyst   who's  ever  collected  data and  tried  to  analyze  knows the  data  very  often  isn't  in  a  format where  you  can  immediately   start  with  your  analysis, some  cleaning  needs  to  be  done. I'll  pass  things  over to  Pete  to  talk  about  that. Pete? Yes,  great  point. I  think  everybody has  gone  through  this. Even  with  a  well- designed DOE, you  oftentimes  have  to  make some  adjustments  to  do  the  analysis. Hadley  showed  those  operator  worksheets that  came  out,  and  here is  one  that  I  filled  out. I'm  not  going  to  keep  myself  anonymous, but  I  didn't  want  to  share someone  else's  results. But  just  to  give  you  an  idea, we  had  folks  answer a  few  demographic  questions that  hopefully  weren't  too  revealing, but  basically  where  you  were  located, how  old  you  are, and  then  the  search  term  you  used. Like  Hadley  showed, there  was  three  responses. We  had  people  do  this  search   and  then  show  the  top  three  responses that  that  search  engine  recommended. To  do  the  analysis, the  first  thing  we  had  to  do  is  just  take these  three  and  bring  them  together. This  is  a  nice  easy  way  to  do  this, is  to  go  under  the  columns, utilities  and  just  combined  columns. Now  I  just  called  these  responses and  made  a  little  delimiter. I  unchecked  that  multiple  response   'cause  we're  going  to  just  do text  analytics  on  this. Then  you  get  this, which  is  the  table  that  we, excuse me, the column  we're  using  for  Text  Explorer. Now,  I  did  this  and  then   I  brought  in  all  of  the  results from  all  the  different  people  who  took  the  survey and  tweaked  a  little  bit  more   like  combined  whether  you  were  in  the  US, which  state  you  were  in,   and  then   summarize  that  into  region between  America's  and  Europe, 'cause  we  didn't  have  enough  respondents  to   break  it  up  by  state. But  in  the  end,  we  end  up with  a  table  that  looks  like  this. We  had  to  do  a  little  bit  of  recoding, we  had  to  do  a  little  bit  of  filling  this  in and  then  anonymize   that  search  engine. The  folks  that  got  the  survey  knew  which  search  engine  to  use, but  we're  not  sharing  that  here. Hadley  is  going  to  now  talk  about  some of  the  results  we  saw  out  of  this once  we  had  it  in  this  form. Let's  open  up  this dashboard  right  here. What  you're  seeing   are  the  most  popular  terms in  order  of  popularity  from  left  to  right,  descending  order for  the  first  response,  second  response, and  third  response  for  every  one of  the  search  terms,  for  every  gender and  age  and  all  of  the  other  factors. W  can  use  this and  this  hierarchical  filtering   on  the  dashboard  to  explore this  a  little  closer, see  if  we  can  learn  anything. One  thing  I  happen  to  notice  if  we  look at  the  world  is  and  we  click  on  male, you'll  see  that  for  most  people, or for  many  people,  the  first  search   they  found  was  that  the  world is  not  enough  if  you're  female, equally  likely  is  to  find  it  your  oyster. Interestingly,  if  you're  less  than  40  years  old, that's  when  the  world  is  not  enough,  suddenly  becomes  the  world  is  yours. I  think  we  could  probably  agree   with  it's  probably  true  for  people under  40,  isn't  it? What  else  have  we  got  here? If  I  look  at  climate  change, so  another of course  hot  topic  of  interest  these  days, as  well  it  should  be. If  I  were  to  look  at  people  over  50, apparently  a  huge  concern  for  them   is  whether  climate  change is  changing  babies  in  the  womb, which,  interestingly, isn't  a  concern  for  people  below  40. I  wondered  whether this  is  a  valid  concern   for  people  over  50, whether  they're  more  likely   to  have  their  babies  change in  their  wombs  or  not. But  aside  from  that, let's  take  a  step  back  and  see  how  we can  go  about  creating  this  dashboard. It's  quite  simple. The  first  thing  we  need  to  do   is  to  create  our  filter  variables. I've  done  that  here. Here's  our  search  terms and  our  distributions. What  I'll  do  is  I'll  go  through   how  to  create  the  graph  builder  report, because  that's  something   that  you  may  not  be  familiar  with, that  you  might  be interested  in  doing. I'm  going  to  take  my  first  response, put  it  here,  and  simply  choose the  number  that  those  occur. Then  I  can  right  click and  order  by  count  descending. T hat's  it. I've  done  the  same  thing   for  my  second  response and  my  third  response  as  well. Now  we  can  go  ahead  putting together  the  pieces  of  the  dashboard. We'll  click  on  new  dashboard, we'll  choose  the   hierarchical  filter  plus  one. I'll  take  my  distribution  results put  them  there, my  input  parameters, put  them  there, and  then  my  graphs. Let's  see,  is  this  one  first? Well, I cant tell. We'll  just  put  them  in  order  like  this, and  I  can  always  change the  order  if  I  want. All  right. I'll  run  the  dashboard, and  there  we  have  it. It  really  is  as  simple  as  that. Then  I  can  go  ahead and  save  it  to  the  table. That  was  one  use  of  a  dashboard. I'll  show  you  another  use  of  a  dashboard, which  was  to  use  it  with  a Text  Explorer   Word Cloud. This  is  the  most  common  words, not  just  entire  phrases  or  entries,  but  individual  words. You  can  see  the  word  design seem  to  be  used  a  lot. If  I  were  to  look  at, for  example,  statistics, so  it  looks  like  everybody  can  agree   that  statistics  is  a  science. Interestingly,  if  you're  in  Europe, apparently  you  find  it  harder  than  you  do if  you're  in  America, where  that  doesn't  come  up, so  something  I  happen  to  notice  there. To  create  this  dashboard, it's  very  much  the  same  as  the  other  one. We'll  add  our  distribution  items. Here's  the  first  one, here's  the  second  one. We'll  add  our  Text  Explorer  Word cloud, and  then  we'll  simply  put  this  one  together just  as  we  did  the  previous  one. With  that,  I'd  like  to  thank  you for  this  part  of  the  presentation about  the  exploratory  visual  analysis. I've  shown  you  how  you  can  go  about doing  this  using   the  hierarchical  dashboards. Now  I'll  turn  things  back  over  to  Pete, who will  take  us  through  some  more in  depth  use  of  the  Text  Explorer. Perfect. Thanks,  Hadley. Like  Hadley  mentioned  this  is a  different  way  to  display  this, but  this  is  the  end  result  of  using the  Text  Explorer  and  looking just  at  the  Word  Cloud  here. He  had  made  this  a  dashboard and  used  filters  that  were   graphical  of  nature,  which  is  great. You  could  do  this   also  with  a  local  data  filter. But  this  is  basically the  end  result  we're  going  for. Let's  now  back  up  and  talk  about  how  we  got  here. With  our  data  set  over  here, we  just  launched  the  Text  Explorer under  analyze  menu, we  put  in  our  column   that  we're  interested  in. In  this  case,  all  three  responses combined  into  one  column. We  have  a  bunch  of  options   we  can  use  to  tweak  this, including  language and  how  we  tokenize  the  words. But  we're  going  to  go  ahead and  just  use  the  defaults. Here  you  can  see  since   we  have  different  responses to  different  search  terms, the  overall  term  and  phrase  list by  itself  is  not  super  informative. What  we  would  want  to  do   is  apply  that  local  data  filter and  the  first  thing  we'll   look  at  is  that  search  term. Now  we  can  do  something  like  the  economy  or  coronavirus or  climate  change  and  go  from  there. Let's  focus  in  on  climate  change  here. One  thing  that  I  wanted  to  do was  add  some  sentiment  analysis. The  first  thing  I'm  going  to  go  ahead and  turn  on  this  Word  Cloud   so  it  looks  like  it  had  before. Now  we  can  display  it  this  way  where  you have  the  most  common  term  in  there and  you  can  see  it's  climate  and  change. We  know  that  we're  searching  that, so I  could  go  in  here   and  add  these  as  stop words and  now  see  which  ones  come  up the  most  frequently  when  we're mentioning  climate  change. This  is  one  way  to  display  the  Word  Cloud. I  can  also  go  through  here  and  maybe  change  this  to  something that  is  a  little  more   appealing  to  the  eye, but  maybe  less  useful   from  a  quantitative  standpoint. You  can  always  add   some  arbitrary  colors if  you  like  that  as  well. All  right,  so  I've  done  this  to  this  point, but  now  I  want  to  add  some  sentiment  analysis  to  this. Are  people  thinking  climate  change is  natural,  a  good  thing or  is  it  a  bad  thing? You  can  see  some  things  in  here   that  maybe  indicate  that, but  I  wasn't  quite  sure where  to  find  sentiment  analysis. With  JMP  17,  we  have  this new  feature  called  Search  JMP. If  you're  ever  looking  for  an  analysis  in  JMP, this  is  a  great  way  to  find  that. If  I  just  start  typing  in  sentiment, you  can  see  right  here that  it  tells  me  how  to  find  this, I  can  do  the  help,  but  I  can  also  just  hit show  me  and  it  launches  it  right  there. If  I'm  ever  wondering,  hey,  how  can I  do  this,  this  gives  me  the  option. Now,  a  couple  of  things  you  see  here it's  identified  some   of  these  default  terms that  are  providing  sentiment. Things  like  good. If  I  click  on  good, I  get  a  little  summary. It  looks  like  when  people  are  saying  good, that  is  actually  a  positive  sentiment. Now,  what  about  greatest? Oh,  boy,  almost  everything  that  says greatest  is  a  greatest  threat. Maybe  that's  not  actually a  positive  sentiment  there. We  might  need  to  do   a  little  bit  of  tweaking. First  let's  go  in  here  and  say,  okay,  well,  greatest  threat  is  a  phrase that  we're  seeing  commonly. I'm  going  to  just  add  that  phrase. Again,  you  would  do  this  in  your  curation  process, and  now  you  see  that  that  goes  away. But  I  think  greatest  threat is  actually  a  negative  thing. Let's  look  at  those  sentiment  terms. You  can  see  JMPs  identified  that  as something  that  maybe  has  sentiment. I'm  going  to  just  say,  you  know  what? That's  a  really  negative  sentiment. Now  when  we  go  down  here, you  can  see  that  it's  flagged  those  seven  occurrences where  they  mentioned  greatest  threat, and  it  said  that  those are  a  very  negative. That's  changed  our  overall  impression  of,  do  most  of  these  search  terms think  this  is  negative  or  positive? That's  just  an  example  of  how  you  can walk  through  that  flow  and  come  up with  the  end  sentiment  analysis. I'm  going  to  pass  it  back  over  to  Hadley and  let  him  wrap  things  up  here. What  I'd  like  to  say   is  that  we  showed  you, first  of  all,  how  we  went  about  using the  MSA  Design  to  help   with  the  data  collection. We  use  Recode  and  other  items in  the  Tables  menu  to  help with  the  data  clean  up. We  then  used  Distribution,  Graph  Builder, Text  Explorer,  and  combinations, all  of  them  together  to  help with  the  data  exploration, see  if  we  can  uncover  anything  interesting. Then  Pete  used  Sentiment  Analysis together  with  the  Search  and  JMP  17 to  see  what  else  we  can  learn about  the  data  as  a  whole. With  that,  I  hope  you  found  this  useful. I  hope  it's  given  you  some  ideas  about  how you  can  do  this  on  your own  data  for  yourselves. I'd  like  to  thank  you  all  for  listening, and  I  hope  you  enjoy  the  rest of  the  JMP  Discovery  Conference. Thank  you.
In the pharmaceutical industry, the three PQ batch concept is being replaced by continued process verification. However, there is still a validation event (stage 2) prior to going into commercial manufacturing (stage 3). In stage 2 you are supposed to prove future batches will be good and in stage 3 to prove that the validated state is maintained.   JMP has the ideal toolbox for both stages. If the process can be described with a model, prediction intervals with batch as a random factor can be used to predict the performance of future batches. From version 16, prediction intervals also work with random factors. To live up to the assumptions behind the model, JMP has an excellent toolbox to do the sanity check. The prediction limits obtained in stage 2 can be used as the first set of control limits to be used in stage 3.   A JMP script that combines the necessary tools described above to calculate control limits, prediction limits and process capability in stages 2 and 3 will be demonstrated. The script has been heavily used by many customers. It only requires that data can be described with a model in JMP.     My  name  is  Per Vase  from  NNE,  and  I  will together  with  my  colleague  Louie  Meyer represent  how  you  can  use  JMP in  pharma  process  validation, and  actually  leading  into  that  you  can  do continuous  process verification  afterwards. We  both  come  from  a  company  called  NNE, and  we  do  consultancy for  the  pharmaceutical  industry and  we  work  with  data  science. So  we  actually  trying  to  create value  from  data  at  our  customers. Of  course,  we  are  extremely happy  that  we  can  use JMP  for  that and  this  is  what  we  are going  to  demonstrate  today. So  the  background  for  the  presentation is  process  validation. This  is  a  very  important  issue in  the  pharmaceutical  industry, that  before  you  launch  a  new  product, you  have  to  prove that  you  are  capable  of  producing  it. Classically  this  has  been  done by  just  making  three  batches. If  these  three  batches  were  okay, we  have  proved  that  we  can  manufacture consistently  and  then  you're ready  to  launch  products. But  of  course  everyone, including  the   [inaudible 00:01:06] have  found  out, that  you  could  make three  batches  is  not  the  same, as  all  future batches  are  going  to  be  good. What  is  expected  today  from  a  validation is  that  not  only  show  that  you  can  make three  or  could  make  three  batches, you're suppossed to  make  a  set  of  batches. Based  on  this  set  of  batches, you  should  be  able  to  prove with  confidence  that  all future  batches  will  be  good. So  instead  of  predicting  the  past, we  now  have  to  predict  the  future. This  really  something  that  is a  challenging  thing  in  the  pharma  industry because  they  have  been  used  for  many years  just  to  make  these  three  batches. So  how  to  do  it  now? We  are  really  helping many  customers  with  this, and  what  we  strongly  recommend is  simply  predict  the  future. How  can  you  predict  the  future? Well,  you  can  do  that  in  JMP with  prediction  intervals. That's  what  they  are  for. So  you  collect  some  data, on  the  days  you  analyze, you  can  predict  how the  rest  would  look  like. If  you  can  just  build  a  model  that  can describe  your  validation  data  set, and  if  you  put  your  batchess in  this  model  as  a  random  factor, these  prediction  limits  actually  cover not  only  the  batches  you  make, but  also  the  batches you  haven't  made  yet, and  thereby you  can  predict  the  future. What  you're  seeing  in  the  bottom of  the  graph  you  have  stage  one. This  is  where  you  designed  the  new process  for  making  the  new  product. Stage 2  is  exam  that's the  classical  old  fashioned  validation. Where we  previously  made  three  batches, now  we  are  making  a  set  of  batches. Now  based  on  that  we will  predict the  future  batches  will  be  good. If  you  have  a  lot  of  batch to  batch  variation, it  can  be  hard  to  predict with  just  a  few  batches, due  to  the  width  of  the T  distribution which  view  the  use  of  freedom, so  it  might  take  more  than  three. But  we  strongly  recommend  still  to  make trees  stage  2  because  otherwise, patients  have  to  wait  too  long  for  the  new product  and  then  if  it's  not  enough. You  can  make  the  waste in  what  we  call  stage   3a. Certain  parties that treat  hopefully on  how  your  prediction  limits will  be  inside  specification  limits and  then  you  have  proven that  all  future  paths  will  be  good, and  you  can  go  into  what  we  call  stage   3b, meaning  that  now  you  don't  have  to  measure in  product  testing  any  longer. You  could  maybe  measure  on  the  process instead  of  measuring  on  the  product because  you  have  proved  all future  medicine  will  be  good. You  only  have  to  prove  that  you  can maintain  the  validated  state. That  can  give  you  a  heavy  reduction in  your  future  costs. Some  customers,  they  get  you  up  to  70% reductions,  in  the  future  costs, after  they  have  proved  that  all  future batches  are  going  to  be  good. Today  we  will  try  to  demonstrate, how  this  can  easily  be  done  in  JMP. So  what  is  the  validation  all  about? Well,  it's  the  prediction of  future  batches. Now, previously  we  just  made  three  batches, but  now  we  have to  predict  the  future  batches. How  can  you  predict  that  with  confidence? Because  when  you  do  validation  you  have to  prove  confidence,  things  are  fine. Well, you  can  just  use  prediction  intervals. They're  also  called  individual  confidence intervals  and  JMP  for  the  same  reason. Or  you  can  also  go  for  tolerances, if  you  want. How  many  batches  in  stage  2? We  actually  recommend to  go  on  with  what they  used  to,  which  is  three. But  you  can  only  pass  with  the  three batches  if  you  control  limits. Control  limits  are  actually  just predicting  limits  without  confidence. I  will  show  you  how  to  calculate  them, and if  they  are  inside. Actually  the  best  guess  it's  fine. If  your  predictions  are  outside, you  might  need  more  batches. But  you  can  make  these  after  have  gone to  the  market  with  your  product. How  many  batches   should they  make   in 3a? Very  simple. Until  your  prediction limits  are  inside  specification  intervals. Or  if  you  want, the  corresponding  PPK  is  high  enough. I  will  show  how  you  can  convert these  limits  to  a  PPK. When  it's  passed  you  can  actually  go to  stage   3b  because  now  we  have  proved all  the  future  batches  will  be  good. You  just  have  to  prove that  you  can  maintain  the  validated  state. Typically  that  can  be  done by  measuring  in  the  process, which  is  easier  in  real- time, compared  to  doing  in  product  testing. So  that's  actually  a  huge  benefit. A  lot  of  people  harvest  if you're  doing  it  this  way. That's  a  small  flow chart  for  how  does  it  work. So  you  start  out  looking at  your  validation  once, you  calculate  your  prediction  limits, just  be  aware  that until  version  15  they  were  not  right. They  agreed  to  freedom  two low  prediction  limits. It  was  settled  in  version  16, so  please  JMP  use  version  16  for  this. Then  if  these  prediction limits  are  inside  spec  limit, you have passed both stage  2  and   3a, and  you're  waiting for  continuous process  verification, measuring  on  the  process instead  of  measuring  on  the  product. If  it  turns  out that  the  prediction  limits  are  too  wide, then  look  at  your  control  limits. If  they  are  within  spec  limits, and  predicted  limits  outside, the  most  common  typical  course is to  just  collect  data, you  just  collect  more  data. But  you  do  that  instead   3a after  going  on  to  the  market. So  just  recalculate  this  prediction  limit, every  time  you  have  a  new  batch, and  then  hopefully after  maybe  three  or  four  extra  batches, they  are  inside  your  spec  limit and  then  you're  ready  to  go. If  it  happens  that  also  your  control limits  are  outside  specification, then  actually  you're  not  ready for  validation  because  then  actually best  guess  is that  the  process  is  not  capable, and  of  course  you  have  failed  your validation  and  you  need  to  improve the  process  and  we  do  everything  again. Hopefully  we  don't  end  there. But  of  course  it  can  happen. Just  short  about  this  particular  methods we  have  used to  calculate  these  control  limits, prediction  limits  and  tolerance  limits. For  control  limits, you need the JMP of  course,  you  can  get  control  limits. But  it's  actually  just  for  one  known mean  and  one  known  standard  reason. Often  you  have  more  complex systems  that  you  might  have  many  means, different  cavities and  decent  moving  process  for  example, you  might  have  several  bands components  between  sampling  points. Typically  you  have  more  than  one mean  and  one  standard  reason. But  then  you  can  just  build  a  model, and  describe  your  data  set  with  the  model. Then  you  can  just  take the  prediction  formula, you  can  send  the  standard  error or prediction  formula and  you  can  take your standard  error  on  residuals, and  then  instead  of  multiplying with  the  G  quantile, just  use  the  normal  quantile. Then  it  corresponds to   having back  confidence. This  is  how  we  calculate control  limits  based  on  a  model. Simply  based  on  the  prediction  formula and  the  standard  of  the  prediction, the  standard of residuals which  we  can  all  get  from  JMP, then  it's  pretty  easy  to calculate  how  does  control  limits  work. It's  even  easier  to  predict  the  limits, because  they  are  ready  to  go  and  JMP. They're  already   there, and  of  course  they  would  just use  the  limits  calculated  by  JMP. I  said  before  be  careful  with  JMP. Before  version  16, limits  simply  gets  too  wide. So  please  use  version 16  and   newer  for  this. For  tolerance  limits, we  have  a  little  bit  of  the  same  issue as  with  the  control  limits. You  have  tolerance  limits  in  JMP, but  only  for  one  mean, and   one standard deviation. We  don't  have  it  for  a  model. But  it's  actually  pretty  easy  to  convert the  classical  formula, for  a  tolerance  interval into  taking  a  model. Just  replace  the  mean with  the  prediction  formula. Just  replace  N  minus  one  with  the  degrees of  region  for  the  total  variation, which  we  actually  calculate from  the  width  of  the  prediction  interval. Then  you  can  just  enter  this in  the  classical  formula  and  suddenly, we  actually  also  have  a  tolerance interval  and  JMP, that  can  happen  within and  between  batch  variation. Then  we  ready  to  go with  the  mathematics  here. Some  customers  prefer  not  just  to  look at  if  limits  are  inside  specification, they  need  to  be  deep  inside  specification, actually  corresponding to  that  PPK  is  bigger  than  one. But  it's  very  easy  to  take  the  limits and  the  prediction  formula, and  convert  them  into  a  PPK  by  this classical  formula  for  PPK. If  you  predict  the  limits, you  get  it  with  confidence, because  the  average  confidence the  same  retirement  limit, and  if  you  put  it  in  the  control  limit, you  just  get  PPK  without  confidence. We'll do  without  confidence and  with  confidence. Without  confidence  is  the  one  we supposed  to  recommend  to  use  in  stage  2, and  the  other  ones are  clearly  for  stage  3a, where  we  have  to  prove  it with  confidence. So  life  is  easy  when  you  have  JMP, and  if  you  have  a  model  that  can describe  your  validation  dataset. But  we  have  to  be  aware that  behind  all  models, there  are  some  assumptions that  need  to  be  fulfilled, at  least  to  be  justified  before you  can  rely  on  your  model. You  need  to  do  a  standard  check of  your  data, before  you  can use  the  model  to  calculate  limits. A s  you  know, JMP  has  very  good  possibilities  for  this, and  I  will  just  now  go  into  JMP, and  see  how  this  works. Here  is  a  very  well-known  data  set in  the  pharma  industry, published  by  Industrial  Society of  Pharmaceutical  Engineering, where  they  put  out  a  data  set from  the  real  world  and  say, "This  is  a  typical  validation  set, please  try  to  calculate  on  this  and  see how  would  you  evaluate  this  as a  process  validation  data  set." If  you  start  looking  at  the  data  chart, you  can  see  here  we  have  three  batches. The  classical  three  batches, oldfashioned  ways, and  we  are  making  tablets  here and  we  are  blending  the  powder. When  we  are  blending  the  powder, we  take  out  samples  in  15  different systematic  positions  in  the blender  in  all  three   batches. When  we  take  samples, we  take  more  than  one. We  actually  take  four. So  we  have  three  batches,  15  indications, and  every  time  we  take  four. This  is  a  data  set  from  real  life that  was  used  for  validation. If  you  look  at  data, it's  actually  fairly  easy  to  see that  for  batch  B  within  a  location, the   inter batch's  location  evaluation  is much  bigger  than  for  batches  A  and  C, and  if  you  put  a  control  limit on  your  standard  deviation, that's  clearly  C  is  higher. You  can  also  do heterogeneity of  variance  test. It  will  also  be  fairly  obvious  that  B  has significant  bigger  variance  than  A  and  C. So  we  cannot  just  pull  this  because  you're only  supposed  to  pull  variances, that  can  be  assumed  the  same. So  what  to  do  here? Well,  the  great  thing  about  JMP, you  can  just  do  a  balance  model, so  you  can  go  in  and  put  in  the  batches as  a  balance  factor  on  top  of  having factors  to  predict  the  mean. A gain  you  can  clearly  see  that  patch  B, has  bigger  variance  than  A  and  B and  we  need  to  correct  for  that. So  how  do  you  correct  for  that  in  a  model? Because  I  would  like  now to  put  in  random  factors but  not   where its  modelled  in  JMP do  not  support  random  factors. What  I  do  instead,  I  just  go  up  here and  I  save  my  variance  formula. So  actually  I  get  a  column  here, where  I  have  my  data  variances. Then  I  can  just  do  weighted  questions with  inverse  variance. Classical  way  of  weighing when  you're  doing  linear  regression. So  it's  very  easy  to  correct for  that  patch  B  has  higher  variance than  A  and  C, just  by  doing  weighted  regression Being  aware  of  that,  then  I  can  start looking  do  I  need  to  transform  a  data? You  know  you  have  back drops  and  information  and  JMP. So  it's  very  easy, just  make  an  ID  that  describes the  combination  of  batches  and  location. Then  you  can  look  at  the  procedure variation  within  each  group, and  pull  this  across, and  you  will  get  a  model  like  this. You  can  see  I  weighted  it because  I  had  to  weight  it. To  correct   for batch  B  as I  have   in  A  and  C. You  can  then  see  you  have  no  outliers and you can also  box -cox. There's  no  need  for  transformation. So  that's  also  easily justified  working  this  way. Then  I'm  ready  to  make  my  model. But  when  I  make  my  model, I  will  put  in  my  batch  as  a  random  factor. I  might  also  put  in  location  times  batch as  a  random  factor, because  it  should  be  random  because I  would  like  to  predict  future  batches. Then  there's  actually  another  assumption, because  when  you  put  in  random factors  you  are  assuming  the  influence at  least  on  average  zero and  it's  normally  distributed. How  to  verify  this  or  justify  this? Well  just  go  and  look  at  BLUPs, that's  just  a  random  factor. You  can  just  make  them  into  a  data  table, and  you  can  do  a  BLUP  test  on  them. That's  what  I  have  here. Here  you  can  see  I  have  saved  my batches' BLUPs  and  my  interaction  BLUPs, and  I  just  re -modelled  them  to these. I  can  see  these  batch  effects can  easily  be  assumed  normally  distributed and  actually  the  same with  the  interaction  effects. Now  in  a  very  short  time, all  the  sensitive  check  you  need, to  be  able  to  justify  that  you are  ready  for  analysis. So  now  I  have  my  model, which  I'm  ready  to  go  for, where  I  have  put in  my  locations  in  the  random factors, and  my  pass  as my  other factors. Now  I  will  give  my  work  to  Louie, who  will now show   how  to  go on  with  the  script  on  this  data  file. As  Per's  already  shown, JMP  already  offers  a  lot  of  opportunities to  deal  with  many  of  the  problems you're  facing  and  validating, for  example,  TPQ  batches. Some  of  the  things he  has  shown  here  is  that, in  fact  most  processes cannot  be  described  by  just  one mean  and  one  standard  deviation. For  this  we  really  like  the  state model  platform, which  allows  us  to  do  exactly  this by  including  system  manufacturers  for  many means  and  or  random  factors for  many  variances. Then  often  we  see  the  data requires  transformation. Here  we  already  the  fit  model  platform  has the  box- cox  transformation  allowing  this. Furthermore,  data  sets  almost  always contain  some  kind  of  outliers. For  this  we  are  already in  the  midmarket  platform, also  have  the  student  type  residual  plus, which  is  a  great  way of  dealing  with outliers. Then  in  cases  where  we  see  a  lack of  variance  from  beginners, JMP  has  also  the  log  variance  modelling, which  we  use  to  find  the  variance, and  convert  it  into  a  weight in  a  recursion  model. Then  out- of- the- box  JMP  of  course also  have  the  prediction  intervals, individual  confidence intervals  directly  from  the  Fit  model. There  are  also  some  of  the  problems we  faced  where  JMP  didn't  take  us  all the  way  but  did  the  groundwork  here. I  think  this  is  what  we're  actually  trying to  address  with  our  script  as  well, making  it  much  easier, and  I  think  one  of  the  primary  reasons  we have  developed  this  script  is  that  usually the  customers  we  see  that  these calculations  and  visualizations  we're doing  often  are  done  by  human  employees. So  whenever  humans  are  involved, there  is  a  high  risk of  human  errors  as  well. This  is  what  we  are  trying  to  battle by  automating  all  of  these  steps  later on  coming  from  the  analysis and  also  the  visualizations. Then  of  course, here  is  the  control  intervals  in  JMP, where  JMP  can  only  handle one  mean  and  one  variance  component. This  will  also  solve  in  the  script from  the  fit  and  model  platform  using the  prediction  folder  instead  of  the  mean. Then  we  use  the  total  variance from  the  model  calculated from  the  standard  error  and  the  prediction formula  and  the  standard error  on  the  residuals. Then  we  also  often  face  customers who  prefer  tolerance  interval  issues, and  this  is  due  to  their  separate settings  of  both  confidence  and  courage. However  briefly  described, JMP  can  only  handle  one  mean  and  one variance  component  tolerance  intervals. So  in  the  script  we  also  calculate tolerance  intervals  from  the  Fit  model platform  using  the  same  number of  effective  degrees  of  freedom as  used  in  our  prediction  itself. And  then  lastly, the  script  has  also  included the  calculation  of  capability  values, more  specifically  the  PDK  values, as  many  customers  also  require  these. Here  we  calculate  them  both with  confidence  using  the  prediction limits  as  an  input  to  the  calculation. Then  we  also  calculated  our  confidence using  the  controller  input  calculations. So  just  to  go  to  the  script. Our  script  here, as  we  have  developed  package  most  of  these subsequent  steps,  after  doing  the  sentence check  into  a  single  package, the  script  itself  takes  three  inputs. We  have  the  original  data  file, and  then  we  have  the  model  window and  then  we  have  a  template  document. So  what  our  script  does  is  that  it actually  feeds  information  from  the  data file  and  the  modern  window, into  the  template  documents. From  the  data  file  we  take  the  metadata and  data  itself,  and  from  the  model  window we  take  model  parameters  and  model results  and  fields  as  well. This  is  also  why  it's  important to  stress  here  that  the  model that  are  used  here  as an  input  should  be  a  complete  done  model. This  means  that  you  have  done  already all  the  sanity  checks  SPI just  introduced  us  to. Also  you  have  done  your  model  deduction and  the  sanity  check  includes  something like  outlier  exclusion  and  data transformations  as  well. One  could  argue  that  we  could  do  this in  the  script  as  well, obeying some  hardcore  statistical  rules. But  we  also  think  that  this is  a  very  useful  experience. Working  with  the  model  here. You  get  a  much  better  insight into  the  data, and  you  also  have  to  apply  some   process knowledge  here  to  get  a  better  model. So  the  template  document  in  itself, we  actually  use  this  inspired by  the  app  functionality  and  JMP. We  really  like  some  script  where  the  users or  each users  can  actually interact  with  the  script  itself. So  the  template  document  is  essentially a  data  table  in  JMP where  we  have  no  data  in  it. However,  we  have  defined  a  lot  of  columns with  predefined  formulas referencing  each  other. This  allows  the  user  to  actually  go  in, and  backtrack  all  our  calculations through  the  columns  in  this  template  file, leveraging  the  for  example, formula  editor  and  JMP, and  potentially  add  new  columns  if they  want  to  see  new  calculations. Or  they  can  edit  some  of  the  calculations we  are  doing  in  there  as  well. Furthermore,  all  visualizations descriptors  providing  the  user are  also  actually  stored  as a  table  script  in  the  template  document. So  users  can  just  go  in  here  and  edit, for  example,  the  layout  of  specific  graphs if  they  want  new  colors or  something  like  this. I  think  we  should  take  a  look at  how  it  actually  looks  in  reality. We  will  jump  into  JMP  here and  I  will  just  find  the  same  model as  Per  just  finished  for  us. We  are  again  looking at  the  ISPE  case  as  Pierre  showed, and  I  will  just  find the  same  model  he  ended  up  with  here. Now  that  we  have  our  model,  we  have  done the  sanity  check  and  everything, we  have  our  original  data  file  behind, we  simply  run  the  scripts. And  we  will  need to  send  it  to  three  things  here. First  of  all,  we  have  template  files now  filled  with  all  the  data  we  need. We  see  the  two  graphs we're  showing  right  here. Then  we  have  the  PPK  batch  giving  us a  PPK  for  each  location  in  the  blender. We  have  a  total  of  15  up  here, and  we  have  the  normal  PPK  based on  control  limits  in  blue, based  on  prediction  limits  in  red and  based  on  tolerance  limits  in  green. Lastly  we  had  the  graph  here. This  graph  shows  us  partly  our  data, along  with  all  the  limits  we have  calculated  in  the  script. Yes,  and  what  we  see  here  is  in  fact that  batch  A  and  batch  C are  performing  well  for  all  locations. Meaning  it  has  its  prediction  limits, and  tolerance  limits,  and  control  limits for  that  matter  inside the  specification  for  all  locations. However, if  we  look  at  batch  B, where  they  also  found the  variance  to  be  much  larger  than the  two  other  batches, we  see  that  this  is  in  fact  going  outside specifications  on  all  locations  when  we look  at  the  prediction  limits, not  necessarily  both  specifications, but  either  under  lower  and  upper. So  we  see  many  customers  here  actually just  computing  some  average  limits, across  A,  B  and  C. But  we  don't  see  that  as  an  opportunity here  because  for  us  this  tells  us  that, if  future  batches  behave  like batch  A  or  C,  we  are  okay. We  can  consider  and  say  that  future  or observations  of  future  batches, will  fall  inside  specifications. However, the  future  batches  behave  like  batch  B, and  we  cannot  guarantee  that  we  will  have all  observations  of  future  batches  falling inside  the  specification  limits  here. I  think  I  would  just  also  use  this  time to  go  through  an  example  where  we  actually also  take  into  consideration  the  whole approach  and  that  showed  the  flowchart, and  how  we  would  actually  do  a  PPQ validation  using  the  script. We've  brought  a  customer example  here,  just  find  it  here, and  this  is  affecting  customers  who  have been  through  PPQ  with  the  validation and  they  have  now produced  all  the  batches. So  we  will  just  quickly  use  a  local  data filter  to  just  look  at  the  three  first batches  they  produced  in  stage  2. Here  we  have  our  model, and  we  run  the  script. What  we  see  here  is  that  we  find ourselves  in  a  situation  where  we  have both  prediction  limits and  tolerate  limits  outside  big. In  fact  they  are  quite  far  outside. But  the  best  guess  determined by  the  control  limit  is  in  fact, that  we  are  inside  specification. So  what  this  tells  us  is that  we  actually  feel  safe  enough, concluding  that  we  have  passed  stage  2, and  can  now  go  into  stage   3a. Actually  enable  us  by  the  customer to  put  their  products  on  the  market. It's  important  to  stress  here  that  even though  we  go  from   stage 2  to  stage   3a, we  do  not  reduce  the  [inaudible 00:23:22], and the production  you  see  we  have. We  are  still  fairly  certain or  really  certain  actually  that, no  bad  products  should  enter  the  market. So  what  the  customer  would  do  next  then, is  simply  to  produce  the  next  batch, which  would  be  batch  four  here  included in  the  model,  and  then  we  run  it  again. Of  course  you  will  have  to  do the  sanity  check  of  the  model  again, including  the  new  data. What  we  see  here  now including  the  fouth  badge, is  that  we  are  in fact in  the  same  situation  as  before. However,  both  the  tolerance  limits, and  the  prediction  limits, are  moving  much  closer to  the  control  limit, and  even  more  important, they  are  moving  towards  the  spec, which  is  what  we  really want  to  see  in  this  situation. The  reason  this  is  that  we  actually noted  within  batch  variation  very  well. However, due  to  having  only  three  or  four  batches, we  don't  know  the  twin batch  variation  that  will... This  is  what  actually gives  us  these  four  limits. This  would  be  an  iterative  approach, producing  one  batch,  analyzing  the  results and  continues  going  on  until  you actually  find  the  limits  inside. Now  just  include  the  two  last  batches, run  it  again. What  we  see  now  actually  is  that  we have  all  our  limits  inside  specification. This  tells  us that  we  now  have  passed  stage   3a, and  we  can  now  work  towards reducing  our  heavy  in  productivity, by  implementing  sampling instead  of  100%  inspection, and  continuous  process verification  and  so  forth. This  we're  still  having  had  our  products on  the  market, since  the  first  three  batches  we  produced. In  an  image  like  the  graph  shown  here, you  could  be  concerned  that  your prediction  limits  are  this  broad  still. The  charterment  limits are  very  close  to  spec. So  if  you  just  go  back  to  the  situation where  we  have  three  batches, one  question  here  could  be, "Are  we  actually  safe, when  we  have  this broad  prediction  limits?" Could  we  end  up  sending bad  products  to  the  market? Here  there  are  two very  important  points. First  of  all,  at  this  stage,  we  still have  a  very  heavy  unit  in  product  QC. They  will  ensure  bad  products does  not  go  to  the  market. Furthermore,  we  have  to  remember  here that  we  are  actually  not  trying  to  say something  about  the  specific batches  we  produced. We  are  trying  to  say  something about  future  batches. If  we  want  to  assess  the  performance of  these  individual  batches  per  batch, what  we  actually  have  to  do  is that  we  have  to  go  back  to  our  model. We  go  to  the  model  dialog  here,  and  we  do not  include  batch  as  a  random  and  fit. We  use  this  as  a  fixed vacant  environment  instead. I'll  just  apply  the  local  data  filter, and  then  if  we  run  the  script  like  this, we  will  actually  see  how  well  we know  the  individual  batches  here. Here  we  see,  compared  to  the  other  graph, which  I  will  just  try  to  put  up  here, we  now  have  much, much  narrower  limits  for  each  bad  batch, telling  us  that  we  don't  have  to  fear that  any  observations  within  either of  these  batches  are  actually outside  specifications. So  it's  the  combination  of  this, and  the  heavy  in  product  QC, which  actually  makes  us  confident that  we  can  have  the  three  batches, go  to  the  market  with  the  products, as  long  as  our  control  limits are  still  within  specifications. Yes. So  just  to  conclude  on  this, I  will  go  here. So,  what  we  have  been  looking  at  here  is how  can  we  validate, how  can  the  inner  validation  use  JMP  to predict  the  future  batches  will  be  okay? So  we  have  seen  here  that, if  you  can  describe your  validation  data  set  with  a  model, you  can  actually predict  the  future  with  confidence. This  can  be  done either  by  using  the  prediction  intervals, or  the  tolerance  intervals. So  what  we  have  made  here  and  over  here, is  a  script, that  actually  automates visualizations  of  prediction  intervals and  also  the  PDK  values. It  calculates  and  visualize  tolerance intervals  using  the  same  number of  effective  degrees  of  freedom  as  seen when  calculating  the  prediction  intervals. Then  it  also  calculates and  visualize  the  control  limits. But  it  is  important to  remember  before  using  the  script, you  have  to  go  through the  sentence  the  check  is  represented. But  here  JMP  already  offers  a  lot of  unique  possibilities  to  justify our  assumptions  behind  the  model. This  includes  variance  heterogeneity. So  if  variances  are  not  equal  between the  batches  as  we  saw  here, we  can  use  the  log  variance  model to  find  a  weight  factor, and through  this  weight   factor, I  made a  weighted  progression  model  instead. We  can  ensure  that  our residuals  are  normally  distributed, and  if  they' re  not,  we  can  use  the  box-cox transformation  to  transform  our  data. We  can,  through  the  Fit  model  platform also  evaluate  potentially  outliers through  the  student  plug. If  there  are  any  outliers, we  always  exclude this  from  the  model  itself. Then  at  last  we  can  actually  justify whether  or  not  our  random  factors are  normally  distributed through  the BLUP  test. A gain,  if  we  find  a  level  here which  does  not  match, we  will  ask  for  the  outliers and  the  studentized  residual, exclude  these  from  the  model  as  well. You.