Choose Language Hide Translation Bar
Peter_Vogel
Level II

Automated Extraction of Data From PDF Documents Using the Customized JMP Add-ins (2023-EU-30MP-1334)

Peter Vogel, CSL Behring Innovation GmbH

 

In many instances, relevant data exists, yet often, it is not directly accessible, and either cannot be utilized for data-driven analyses or requires painstaking manual efforts to extract. One classical instance of this type is PDF documents. In this presentation, we will demonstrate an example of standardized PDF reports in the Laboratory Information Management Systems (LIMS) and how the JMP Scripting Language can automate data extraction from these PDF files. The presentation will also show how the resulting scripts can be packaged as an Add-in for distribution to many users.

 

____________________________________________________________________________

 

Explanation of the attached materials:

  • 2023-03-20 Automated extraction of data from PDF documents using the customized JMP Add-ins.pdf
    --> The slide deck in which the numbering of the examples refers to the correspondingly enumerated sections in 
  • 02 Step by step development.jsl
    --> The JSL file that guides you through the step by step process of the JSL code development in order to read the 
  • Freigabedaten_Beispiel.pdf
    --> Examplary sample data stored in a PDF file
  • 03a Functional code.jsl
    --> Summarizes all code developed in 02
  • 03b Custom_Functions.jsl
    --> Example file to demonstrate how multiple JSL files can be packaged in a JMP add-in
  • 03c Values for add-in creation.txt
    --> The values utilized to define the JMP add-in
  • Example PDF Data Parse.jmp
    --> The JMP add-in created from 03a - 03c.
  • PDF Data Load Example.jmpaddin
    --> The same add-in but with extended functionalities (files selection, progress window, etc.) that were not discussed in the presentation

 

 

Good  day,  everyone.  My  name  is  Peter  Fogel.  I'm  an  employee  of  CSL  Behring  Innovation,  and  it's  my  pleasure  today  to  talk  to  you  about Automated  Extraction  of  Data  from  PDF  Documents  using  what  I  call  Customized  JMP  Add-ins.  More  or  less,  let  me  give  you  a  little  bit  of  an  high  level  overview  of  what  we're  going  to  do  today.

First  of  all,  I  want  to  motivate  in  the  introduction  why  we  should  actually  want  to  extract  data  from  PDF  documents.  Then  second  of  all,  in  the  approach,  I  want  to  show  you  how  you  can  leverage  JMP  to  actually  really  do  so,  and  what  it  actually  means  to  use  JMP  and  to  create  JMP  scripts.  Finally,  we  want  to  really  transfer  those  JMP scripts  into  what  I  would  call  an  add- in,  and  I  want  to  explain  a  little  bit  why  add- ins  are  actually  the  better  way  to  store,  if  you  like, JMP  scripts.  Finally,  I  want  to  tell  you  what  you  can  do  once  you  are  actually  at  the  level  of  JMP.

Why  should  we  actually  use  PDF  documents and  want  to  extract  data  from  it?  Well,  on  the  right- hand  side,  you  see  one  example  of  a  PDF  document,  and  you  see  that  it  actually  contains  quite  a  lot  of  data.  Quite  often,  this  data  is  unfortunately  not  really  accessible  in  any  other  way.  Be  it  for  questions  of  old  software  systems,  be  it  in  any  proprietary  software,  be  it  of  whatever  it  actually  is.

Sometimes,  really,  PDF  documents,  and  here  you  can  really  also  replace  the  word  PDF  with  any  other  document  format  is  really  the  only  choice.  You  want  to  actually  have  this  data,  or  otherwise,  you  would  really  need  to  actually  have  a  lot  of  manual  operations  to  do  on  the  data,  which  is  both  annoying  but  potentially  also  really  demotivating  for  your  team  members.

The  latest  point  is,  if  you  don't  have  the  data  at  hand,  well,  you  can't  make  the  decisions  you  want  to.  Quite  often,  data  is  key  to  making  informed  decisions.  Without  informed  decisions,  well,  that's  really  a  disadvantage  in  today's  business  world.

What  I  want  to  show  you  now  is  really  how  can  we  actually  use  structured  data  in  PDFs  files,  how  can  we  leverage  them  using  JMP,  and how  can  we,  based  on  that,  really  make  decisions.  Today,  I'll  only  focus  the  aspect  of  really  how  to  get  the  data  out  of  the  PDF  and  how  to  really  give  it  over  to  the  user,  everything  else,  how  to  analyze  the  data  and  so  on,  could  be  then  a  topic  for  another  talk  at  another  time.

Before  we  actually  start  really  with  JMP  itself,  let's  talk  a  little  bit  about  what  I  would  call  the  guiding  principle.  The  first  part,  I  believe,  is  really,  first  of  all,  understand  what  you  want  to  do.  If  you  don't  understand  the  topic  itself,  you  can't  really  work  with  it.  In  this  case,  we  know  we  have  any  PDF  document,  or  potentially  also  multiple  PDF  documents,  which  we  want  to  actually  parse.

Then  we  might  need  to  actually  do  some  organization  of  the  data.  And  finally,  potentially  also  do  system  processing  depends  obviously  on  what  is  in  there  and  what  specifics  we  have.  But  in  the  end,  that  could  be  more  or  less  a  three- step  approach.  From  there  on,  you  could  be  ready  to  do  any  data  analysis  you  want  to  do.  Really  understand  your  question  at  hand  and  we'll  do  so  also  in  the  next  slides  in  a  little  bit  more  detail.

The  next  part  is  really  break  it  down  into  modules.  The  more  modules  you  have  and  the  better  they  are  defined,  the  easier  it  is.  Really  make  your  problem  into  smaller  pieces  and  then  you  can  really  tackle  each  piece  on  its  own,  and  it's  much  easier  than  if  you  actually  have  one  big  chunk  of  things  to  do  at  the  same  time.

The  third  part,  I  believe,  is  always  use  JMP  to  the  best  you  can  do,  because  JMP  really  can  do  quite  a  lot  of  what  I  would  call  heavy  lifting  for  you.  We'll  see  one  example,  which  in  this  case  will  be  the  PDF  Wizard,  but  there  are  many,  many  more  things  that  you  could  do  from  analysis  platforms  like  the  distribution  platform,  over  other  platforms.  They  can  really  do  a  lot  for  you,  and  in  the  end,  you  just  have  to  scrape  the  code  and  that's  it.  You  can  really  get  it  more  than  for  free.

The  fourth  point,  I  believe,  if  you  define  more  fields,  really  also  make  sure  that  they  are  standardized.  Standardized,  this  sense  really  means  they  should  have  defined  inputs  and  outputs  so  that  actually  if  you  figure  out  I  want  to  do  this  part  of  one  of  the  modules  slightly  differently,  it  still  doesn't  break  the  logic  of  the  code  after  all  because  it  still  has  the  same  inputs  and  outputs.

The  last  part,  I  hope  should  be  clear,  let's  first  focus  on  functionality  and  then  later  on,  really  make  it  user- friendly  and  really  suitable  for  any  end  user.  That's  also  what  we  will  do  today.  We'll  really  focus  more  on  functionality  today  and  less  on  the  appearance.

Let  us  now  very  shortly  look  into  our  PDF  documents,  and  I'll  also  share  that  with  you  in  a  second,  the  actual  document.  But  now  let's  first  look  into  this  snapshot  here  on  the  right- hand  side.  What  do  we  see?  Well,  this  PDF  is  actually  consisting  of  several  pieces.  The  first  one  is  typically  this  header,  which  just  holds  very  general  information  of  which  we  might  just  use  some  of  them,  but  potentially  the  all.

Then  we  actually  get  an  actual  table,  which  is  this  data  table  here  which  has  both  a  table  header  as  well  as  some  sample  information.  If  we  look  into  that  now  in  an  actual,  let's  say,  an  actual  sample,  then  we  can  actually  look  into  this  PDF  and  we'll  see  that, we can  share  that  with  you.  It  really  looks  like  that.  You  see  this  table  continues  and  continues  and  continues  across  multiple  pages.

On  the  last  page,  we'll  actually  see  that  there's  again  data  and  then  at  some  stage,  we'll  have  some  legend  down  here.  Potentially,  obviously,  we'll  also  note  that  there  might  not  necessarily  be  data  on  this  page,  but  we  can  rather  just  have  the  legend  here.  Just  as  a  background  information,  we  know  now  how  the  structure  of  this  document  works.

More  or  less,  we  can  also  state  the  first  page  is  slightly  differently,  then  we'll  actually  have  our  interior  pages,  and  the  last  page,  as  mentioned,  can  contain  sample  information,  but  it  does  not  have  to,  and  it  certainly  contains  always  the  legend.

If  we  now  get  a  little  bit  into  more  details,  we'll  actually  see  that  each  more  or  less,  let's  say  line  or  each...  Let's  call  it  line  or  entry  in  this  data  table,  actually  consists  of  a  measurement  date  as  typically  also  than  actual  measurement.  Those  again  are  actually  separated  into  multiple  pieces.

You  will  have,  for  example,  here  the  assay,  which  in  this  case  is  just  called  end  date,  or  here,  let's  say  the  user  side.  You  might  then  also  have  some  code  or  assay  code,  it  depends.  You  will  have  a  sample  name,  you  will  also  have  a  start  and  end  date  typically.  You  might  have  some  requirements  and  so  on  and  so  forth  until  finally,  you  get  what  we  call  the  reported  result  at  the  end.

Our  idea  would  be  really  how  to  get  that out.  We'll  see  actually,  yes,  this  first  line,  if  you  like,  of  each  entry,  that  actually  holds  different  information  than  the  second  one.  This  third  line  actually  just  holds  here  what  we  call  WG  in  terms  of  the  requirements.  It's  not  really  yet  that  perfectly  structured,  but  we  see  there  is  a  system  behind  this  data,  and  that  really  allows  us  to  really  then  scrape  the  data  to  parse  them,  to  really  utilize  them  to  their  full  extent.

Let's  now  again  break  it  down  in  modules,  as  I  said.  What  we  can  do  is  we  can  again  think  around  this  three- step  process,  and  I  believe  what  we  could  also  do  is  we  could  actually  try  to  break  it  down  in  even  more  steps.  The  first  step  could  be,  and  that  is  now  really  user  dependent  then,  that  actually  user  says,  Please  tell  me  which  PDFs  to  parse.  The  user  tells  you,  It's  PDF  one,  two,  three,  for  example.

Then  you  would  actually  say  per  PDF,  I  always  do  exactly  the  same,  because  in  principle,  every  PDF  is  the  same.  One  has  more  pages  than  the  other,  doesn't  matter,  the  logic  always  stays  the  same.  You  would,  first  of  all,  try  to  determine  the  number  of  pages.  This  we  won't  cover  today,  but  in  general,  we  can  think  around  it.

Then  you  might  actually  want  to  read  general  header  information  as  we  know  it,  and  obviously  process  it.  We  might  certainly  want  to  read  the  sample  information  and  process  that.  We  might  want  to  combine  that.  Again,  this  one  we'll  skip  today,  and  we  obviously  want  to  combine  the  information  across  files.

Now  that  means  at  that  stage,  we  would  really  have  all  the  information  available  that  we  want  to.  Finally,  just  what  we  need  to  do  is  we  need  to  actually  tell  the  user,  now  tell  us  where  to  store  it.  Finally,  we  want  to  store  the  result.  Again,  those  two  last  steps  we  won't  cover  today,  but  I  guess  you  can  really  imagine  that  that  is  something  that  is  not  too  complicated  to  be  achieved.

Now,  let's  actually  jump  into  JMP  itself.  What  I  want  to  show  here  is  that  really,  let  JMP  do  the  help  lifting.  This  case,  in  particular,  let  actually  the  PDF  Wizard  do  all  the  powering  of  the  data  for  you,  and  if  you  like  all,  you  then  have  to  do  is  really  change  the  structure  of  the  data.  But  more  or  less,  you  actually  can  leverage  the  JMP  Wizard  or  the  PDF  Wizard  in  JMP  to  a  full  extent.

At  that  stage,  let's  really  switch  very  quickly  over  to  JMP  itself,  and  let's  see  how  that  works.  I've  taken  here  this  example  which  is  called  just  Freigabedaten Biespiel.p df and  we'll  actually  see  what  happens.  If  you  open  that  either  by  double- clicking  on  it  or  by  actually  going  via  File  and  Open  or  this  shortcut  File  Open,  then  you  can  actually  see  that  if  we  select  a  respective  PDF  file,  you  can  actually  use  the  PDF  Wizard,  and  now  let  me  make  that  a  little  bit  larger  for  you  to  actually  read  the  data.

We  see  that  from  the  beginning,  actually  JMP  already  auto- detects  some  of  those  data  tables  in  here,  but  we  now  want  to  be  really  specific  and  we  just  want  to,  in  this  case,  only  look  at  the  header.  Let's  ignore  that  for  now  and  let's  really  just  look  at  the  general  header  table.  We  would  say  in  that  case,  it  starts  here  with  the  product  and  adds  with  the  LIMS  Product  Specification.  So  we  can  draw  just  simply  a  rectangle  around  it,  let  that  fall,  and  you'll  actually  see  in  an  instant  what  happens  over  here.

You'll  see  JMP  recognizes  that  one  has  two  lines.  That  seems  to  be  about  right.  It  also  recognizes,  well,  in  principle,  I  have  only  two  fields.  Now,  one  could  argue,  well,  this  one  is  one  field,  this  one  is  a  field,  and  this  one  is  a  field.  So  it  might  or  might  not.  It  depends  a  little  bit  also  on  how  you  want  to  process  the  data,  say  JMP,  please  split  here  the  data.  If  we  don't  want  to  do  so,  we  really  need  to  actually  look  at,  yes,  this  second  part  of  the  field  starts  with  something  like  a LIMS  log  number.

In  any  case,  we  now  have  more  or  less  data  at  hand  in  the  format  and  could  just  say  okay,  JMP  will  actually  open  that  data  to  the  force.  Now,  very  interestingly,  what  we  can  directly  do  is  we  can  actually  look  into  the  source  script  and  we  can  see,  oh,  there's  actually  code.  And  this  code  we  can  really  leverage.  I  would  now  just  copy  this  code  for  a  second.  We  could  now  actually  create  a  first  script.  For  this,  I'll  just  actually  open  a  script  all  by  myself.  I'll  very  quickly  open  that  for  you.

We  can  actually  add  here  the  code.  What  you  should  actually  see  is  that  this  code  that  I've  just  added  is  really  the  same  as  the  code  that  we  have  down  here.  It  has  no  difference  whatsoever.  So  let's  just  use  the  code  as  it  is.  Now,  if  we  look  a  little  bit  closer  at that  code,  we'll  actually  see  that  there  are  a  couple  of  things  we  can  see.

The  first  one  would  be  that  this  actually  is  just  the  file  name  of  the  file  that  we  used.  Instead  of  actually  having  their  long  file  name,  I  said  down  here,  okay,  let's  define  that  as  a  variable  and  let's  just  use  the  file  name  here.  What  we  also  see  is  that  this  table  name  that  was  here  is  actually  the  name  of  the  table  how  it  actually  is  returned  by  JMP.

In  this  case,  we  would  potentially  not  just  call  it  something  like  that,  but  rather  this  case  had  that  information.  And  then  more  or  less,  we  also  see  that  JMP  actually  tells  us  how  it  actually  passed  that  PDF  table.  In  this  case,  it  says  it  was  page  one,  and  it  says  I  actually  looked  for  data  in  this  rectangle.  Everything  else  was  done  automatically.

If  we  execute  this  statement  now,  we  actually  see  it  gets  us  exactly  the  same  data  as  previously,  and  that  is  it.  So  far,  so  good.  That  is  just,  if  you  like,  all  until  now  about  the  reading  of  a  PDF  file.  However,  as  I  said,  we  actually  also  wanted  to  look  at  the  actual  sample  data,  not  only  the  header  data,  but  also  the  sample  data.

Let's  now  do  that  once  more.  Let  me  enlarge  that  again  a  little  bit  so  that  we  can  look  at  that.  A gain,  you  could  say,  Okay,  in  this  case,  let's  ignore  the  data.  Let's  again  focus  only  on  one  specific  part,  in  this  case,  the  sample  data  only  here  on  page  one.  Where  does  the  sample  data  start?  Well,  it  starts  here  with  the   LIMS  Proben number.  It  goes  down  exactly  until  here  and  also  out  until  the  scales  column  if  you  look.

We  can  read  that  now  in  assays.  What  we  would  now  see  directly  is  both  at  looking  over  here  but  also  looking  over  here,  that  JMP  actually  utilizes  two  lines  as  a  header,  so  two  rows.  That  is  not  really  what  we  desire  because  only  the  first  line  is  really  the  header.  Everything  else  actually  is  content.  If  you  right- click  on  this  red  triangle,  you  could  actually  adjust  that  and  say,  Oh,  I  don't  want  to  use,  in  this  case,  two  rows  as  a  header,  but  only  just  one.

Now,  once  you  change  that,  you  see,  okay,  we  start  with  the  end  date  as  the  first  actual  value  here.  That's  perfectly  fine.  The  other  part  that  we  might  actually spot  is  that  this  first,  if  you  like,  column  actually  contains  two  extra  columns.  Here,  the  one  that  actually  holds  the  sample  number,  and  here,  the  start  date.  The  reason  for  that  is  that  actually,  many  of  those  values  are  actually  too  long  to  be  broken  into  two  columns.

We  can  now  tell  JMP,  please  enforce  that  it  is  broken  into  two  columns  by  right- clicking  into  more  or  less  the  right  vertical  position  and  then  telling  it,  Please  add  here  column  divider,  and  would  now  directly see  that  yes  JMP  splits  that.  More  or  less  we  now  get,  unfortunately  here,  a  little  bit  of  a  mess  for  always  this,  let's  say,  first  column  where  actually  SOP  word  here  is  split  as  an S and  OP,  but  therefore  we  have  a  start  column.

Here ,  I  would  say,  let's  appreciate  as  it  is,  obviously,  keep  in  mind  that  we  split  this  field  always,  which  is  a  little  bit  unfortunate,  but  it  is  good  as  it  is  for  now.

Again,  if  you  capture  that  content,  you  would  get  a  JMP  data  table,  and  for  that,  you  could  again  source  or  use  the  source  script  to  actually  look  at  the  code.  If  you  compare  this  code  to  the  code  I've  captured  now  here  previously,  you  would  see  it  is  pretty  exactly  the  same,  potentially  up  to  this  field  where  we  actually  set  the  header  or  the  column  divider.  That  might  be  shifted  a  little  bit  only,  but  the  remainder  is  exactly  the  same.

We  could  really  read  here  how  that  actually  works.  You  see  that  you  have  one  header  row,  you  see  that  it's  page  one.  You  again  have  defined  a  rectangle  for  where  you  want  to  read,  and  here  you  have  also  defined  column  borders  as  we  more  or  less  want  to  appreciate.

Again,  as  previously,  you  could  actually  say,  Let's  source  out  this  name,  and  let's  also  source  out  this  table  name  or  replace  them.  And  that  is  more  or  less  what  we  call  now  our  content  file.  If  I  close  that  and  we  just  run  once  more  this  code,  you  would  actually  see.  That  creates  our  JMP  data  table  as  we  want  to.  Getting  more  or  less  the  first  shot  at  your  data  seems  perfectly  fine  is  not  way  too  complicated,  I  would  argue.

Now,  how  do  we  go  from  here?  We  have  now  the  data  in  principle,  but  obviously,  we  need  to  organize  that  a  little  bit.  For  this,  we  can  actually  take  a  number  of  features,  and  it  depends  a  little  bit  as  to  what  we  want  to  do.  There  is  things  where  we  can  actually  use  the  lock,  which  actually  records  more  or  less  all  your  actions  in  JMP  on  the  graphical  user  interface.  From  there,  you  can  actually  really  script  code.  That  is  something  we'll  see  just  as  an  instance  here.

In  addition,  you  could  also  use  the  scripting  index,  which  I  highly  recommend,  which  really  holds  quite  a  number  of  functions  and  examples.  And  so  really  helps  you  to  actually  also  use  them.  We  can  use  the  formula  editor,  I  believe,  and  we  can  also  use  the  copy  table  script,  for  example,  to  really  get  things  going.

Now,  let's  demonstrate  that  again  at  our  JMP  data  table.  In  this  data  table,  we'll  actually  see  that  we  have  a  number  of  things  in  here.  For  example,  we  want  to  now  actually  get  that  organized  in  meaningful  form.  First  of  all,  let's  define  how  that  format  should  look  like.  Let's  open  a  new  JMP  data  table,  which  will  be,  if  you  like,  our  target.  Into  this  data  table,  we  want  to  write,  and  let's  define  what  should  be  it.

We  could,  for  example,  say  the  first  thing  we  want  to  do  is  that  we  have  here  the  assay,  for  example.  We  then  potentially  would  also  want  to  have  an  assay  or  just  assay  code,  it depends  on  what  you  want  to  call  it.  We  might  want  to  have  here  the  sample  name  because  obviously,  that  is  now  this  field  that  should  be  captured  as  well  because  that  is  highly  relevant.

You  might  also  want  to  include  a  start  date  or  an  end  date,  and  so  on,  and  so  forth  until  you  actually  have  more  or  less  included  all  of  those  fields  as  you  want.  Now,  I  would  at  that  stage  also  say  they  should  actually  be  just  by  now  because  this  data  over  here  is  also  correct.  So  if  you  want  like  attribute,  if  you  like,  we  should  also  do  so  here  and  standardize  those  attributes  by  selecting  actually  data  type  and  say,  yeah,  that  should  be  correct  at  that  stage.

Now  we  have  that  data  table,  but  obviously,  this  doesn't  help  us  so  much  because  that  is  not  reproducible  by  now.  However,  there  is  the  option  to  really  record  that.  For  example,  you  could  say,  copy  the  table  script  without  data,  and  I'll  do  so  for  a  second,  and  I  would  now  insert  that  script  here  as  well.  If  we  look  at  that,  we'll  see  that  we  actually  created  a  new  data  table  which  has  the  name  Untitled  4,  and  obviously,  we  can  change  that.

It  has  so  far  zero  rows  and  it  has  all  the  different  columns  that  we  just  created  from  assay  to  start.  We  could  give  it  a  name  and  I've  actually  created  here  a  data  table  that  has  just  the  name  data  for  page  one  that  holds  those  first  four  attributes,  as  well  as  all  the  others  that  we  actually  want  to  have.  Let's  actually  leverage  that  and  continue  with  this  one  as  this  one  was  really  just  a  demonstration.

Let's  create  that  one.  Let's  run  it,  and  you'll  actually  see  that's  just  a  data  table  as  it  should  be  with  all  the  fields  that  we  want  to  fill  from  now  on.  What  we  also  want  to  do  for  now  is  we  want  to  recall  this  data  table,  which  is  just  called  something  like  that,  and  we  call  it  to  actually  call  that  that  content,  in  that  case,  say  and  we  actually  want  to  abbreviate  this  LIMS Proben minus number to LIMS Probe  for  simplicity.

Now,  what  do  we  actually  want  to  do?  We  actually  want  to  work  with  the  data  a  little  bit,  and  I  want  to  illustrate  two  examples  how  we  could  do  so.  Let's  look  first  at  this  column  unfold.  Within  this  one,  you  see  that  there  is  actually  the  A G  and  also  the  WG,  and  we  might  actually  want  to  split  that  into  two  separate  columns  to  really  make  sure  that  in  one  column  later  on,  we  can  more  or  less  capture  the  AG  values  and  in  another  one,  the  WG  values,  and  that  not  the  sample  information  as  here  is  split  really  across  three  rows,  but  rather  following  what  I  would  call  a  date  target  or  a  fair  data  format  in  one  room.

How  could  we  do  so?  Let's,  in  this  case,  just  insert  the  column  and  let's  call  this  column  AG,  say  requirement  just  to  more  or  less  translate  the  word  unfold  into  English.  Now,  what  would  we  want  to  see?  We  would  actually  say  if  there  is  an  AG  in  here,  then  let's  capture  the  value  after  the  AG  in  this  column.  If  there's  nothing  there,  then  let's  capture  nothing.  And  if  there's  WG,  then  let's  also  not  capture  anything  because  that  does  relate  to  age.

How  could  we  do  so?  Well,  I  would  say  let's  build  a  formula.  Formula  typically  is  really  the  best  place  to  start.  What  do  we  want  to  do?  As  I  said,  we  want  to  do  something  conditional,  which  means  if  there's  an  AG  in  there,  we  want  to  see  something  in  there.  If  there's  no  AG  in  there,  then  let's  not  do  so.  The  easiest  way  to  do  so,  I  would  say,  is  the  if  condition,  which  really  tells  you  if  there's  something,  then  do  something,  and  if  there's  nothing  in  it,  then  do  something  else.

We  would  say  here  if  contains  and  contains  really  looks  for  a  substring  if  you  like.  We  would  actually  look  now  for  this  column  which  is  called  Anforderung,  and  we  would  look  for  the  word  AG,  and  we  say  that  should  happen  something,  and  if  not,  then  something  else  should  happen.  Now,  we've  actually  just  created  a  very  simple  if  statement.  And  more  or  less  those  two,  we  would  still  have  to  specify.

However,  even  at  that  stage,  we  could  actually  look  like  if  that  what  we  described  makes  sense.  We  would  see  whenever  there's  an  AG  like  here  or  here  in  our  column,  Anforderung,  then  we  would  see  a  then  statement,  which  is  good.  Otherwise,  we  would  see  here  just  the  else  statement,  which  is  also  good.  So  let's  modify  that  a  little  bit.

What  would  we  want  to  see  in  the  then  statement?  Ideally,  I  would  say  we  want  to  see  more  or  less  what  is  called  in  the  Anforderung  filter  or  the  Anforderung  column,  but  really  getting  rid  of  this  AG  part  and  just  keeping  it  in  mind.  To  do  so,  you  have  many  options.  One  of  them,  I  would  say,  is  so- called  Regex  or  Regular  Expression,  which  really  says,  take  what  is  in  this  column,  look  for  this,  in  this  case,  AG  part,  replace  this  by  nothing,  and  then  actually  give  me  back  the  remaining.

You  would  see  if  we  do  so,  then  we  would  actually  looking  at  more  or  less  the  whole  expression,  we'd  see  if  there  is  AG  with  a  minus,  we  will  actually  get  a  minus  as  a  return.  If  there  is  a  smaller  equal  to  50  minutes,  we'll  get  the  smaller  equal  to  50  minutes.  That  sounds  good.  The  else  statement  assay ,  we  would  actually  just  say,  let's  make  there  an  empty  statement,  so  nothing  else  should  be  returned.  And  that  actually  really  would  work.

You  see,  if  we  go  to  this  column,  only  whatever  you  have  this  AG,  it  will  return  the  value  after  the  AG.  That  looks  perfect.  Now,  I  would  actually  use  more  or  less  this  idea  or  this  logic  to  actually  include  it  in  my  script.  We  could  also  again  capture  the  code  from  the  data  table  and  we  would  see  it  down  to  formula.  But in  principle,  we  could  also  capture.

Before  we  do  so,  I  have  inserted  here  a  little  bit  of  additional  information,  which  means  in  case  we  would  actually  read  the  last  page,  we  saw  that  there  was  the  legend.  And  in  this  case  we  said,  let's  remove  the  legend  and  it  should  be  good.  In  addition,  I  also  said  if  there  should  be  any  completely  empty  rows,  I  would  want  to  remove  them.

Now  to  continue,  I  would  actually  say,  let's  look  now  for  where  are  the  samples,  and  then  let's  capture  actually  the  data  of  each  sample.  In  this  case,  we  would  look  into  where  our  samples  and  we  would  see,  let  me  very  quickly  execute  this  part,  would  actually  execute  and  would  see,  okay,  that  is  actually  a  start  where  each  sample  starts.

It  looks  actually,  in this  case,  only  for  where  more  or  less  this  value  of  end  is  missing.  Similarly,  where  the  Anforderung  is  missing  because  those  are  the  two  columns  that  define  where  actually  only  the  sample  resides  if  we  have  to  move  up  the  column.

Now,  iterating  across  each  sample  on  its  own,  we  would  actually  look  at  where  is  the  data.  Taking,  for  example,  this   Losezeit  sample  here  as  the  second  sample,  we'd  look  at,  okay,  the  assay,  or  we  first  look  at  where  does  it  start.  It  would  start  in  this  case  at  row  4.  Would  actually  now  combine  the  data  of  those  two  fields  to  get,  again,  a  full  name.

We  would  look  actually  at  where  does  the  assay  sit.  The  assay  is,  if  you  like,  just  in  verbal  names,  it  would  be  actually  the  first  part  of  this  whole  string,  if  you  like,  just  before  the  forward  slash.  You  could  really  just  capture  that,  potentially  also  removing  the  one  because  that  doesn't  make  sense.  Similarly,  you  could  look  into  the  code,  which  would  be  really  the  second  part  here,  which  you  could  get  from  there,  and  so  on,  and  so  forth.

Now,  obviously,  I  agree,  this  part  of  code  doesn't  look  way  too  simple,  but  if  you  read  it  very  carefully,  it  actually  always  has  more  or  less  the  same  structure.  You  look  at  the  part  of  the  code  that  is  in  the  respective  line  at  the  respective  field  and  potentially  to  do  a  little  bit  of  twisting  just  as  we  did  with  the  AG  column.  If  you  look  at  this  AG  column,  you'll  actually  see  there's  again  our  regular  expression,  there  is  the  AG  part  that  we  replace  by  nothing,  and  that's  more  or  less  it  as  we  do.

If  you  have  done  so  now,  you  would  actually  want  to  create  here  one  additional  line  where  you  can  actually  now  enter  all  the  data  that  we  have  captured.  How  would  we  do  that?  We  would  actually  say,  right- click  onto  that  here,  sorry,  right- click  onto,  left- click  onto  the  row  menu  and  say  Add  Rows  and  enter  there.

Now,  interesting  enough,  at  that  stage,  you  could  really  look  also  into  the  Log  statement  and  see  there,  there's  one  statement  that  says  Add  Rows  and  you  could  just  copy  this  part  about  add  rows.  This  is  really  more  or  less  the  same  as  I  did  here.  You  see  there's  also,  in  addition,  this  At  end.  Typically,  that's  the  default  value  so  it  doesn't  matter  if  I  have  it  or  not,  but  that's  it.

From  there  on,  I  could  really  say  if  I  have  included  that,  I  actually  just  copy  all  those  values  that  I  had  previously  here,  everything  that  starts  with  a  C  into  the  respective  column.  Sorry,  into  the  respective  column.  In  principle,  it  should,  if  I  know  correctly,  execute  that  at  once.  It  should  now  actually  work  as  is.  So  we  see  actually  the  second  row  now  was  the  one  that  was  correctly  added,  or  if  I  delete  them  for  a  second.  Again,  that  should  now  execute  as is.

We  could  really  do  so  line  by  line  by  line,  and  we'll  see  if  we  do  that  across  all  the  samples,  which  should  be  very  good.  Now,  let's  return  at  that  stage  a  little  bit  into  the  presentation  and  look  how  we  continue  from  there.  Now,  we  have  actually  at  that  stage  really  captured  all  the  sample  information,  but  we  want  to  make  it  a  little  bit  more  handy  for  like.  So  far  it's  a  little  bit  of  massive  code,  but  we  can  certainly  break  it  down  a  little  bit  better.

That  is  what  we  would  do  now.  We'd  really  say,  let's  make  out  functions  from  it.  And  functions  have  really  the  nice  feature  of  they  tell  you  what  to  actually  have  here  as  an  input  and  what  to  have  it  as  an  output.  That  really  means  you  have  that  standardization  of  inputs  and  outputs  anyways.  In  my  eyes,  it's  also  way  easier  to  debug  and  to  maintain.  You  have  no  need  for  any  copy- pasting  operation.  In  my  eyes,  it  also  really  enforces  a  good  documentation  of  code.

Let's  do  so.  What  could  we  do  now?  As  we  have  seen  previously,  when  we  actually  read  our  data,  we  use  this  open  statement  and  just  said  that's  it.  However,  here  we  could  now  also  say,  let's  define  a  function  which  just  has  a  file  name,  then  we  read  the  data  and  we  return  the  data.  In  principle,  it's  not,  let's  say,  too  different  from  what  we  did.  Just  that  we  actually  say  it's  a  function  which  takes  one  argument,  in  this  case,  the  file  name,  could  be  also  multiples,  and  which  returns  something.  If  we  actually  execute  that,  we'll  see,  oh  yeah,  that  actually  created  exactly  that  data  table  that  we  initially  brought  in.

Similarly,  you  could  also  do  so  and  say,  oh,  we  just  transformed  the  data  by  creating  a  new  data  structure  and  then  by  actually  changing  the  data  or  let's  say  organizing  it  as  we  want  it.  If  we  also  more  or  less  initialize  that  data,  we  would  see,  yeah,  also  that  should  work  as  is.  So  we'll  see  here.  This  more  or  less  now  concepts  exactly  to  what  we  did  previously.  So  it  really  means  you  have  just,  if  you  like,  only  two  functions  which  you  can  call,  which  I  believe  is  a  really  good  way  of  organizing  your  code.

Now,  let's  more  or  less  think  also  about  the  last part.  And  the  last  part  in  my  eyes  is  really  a  little  bit  around  UX  or  user,  let's  say,  experience.  That  means  a  little  bit  around  how  should  I  present  it  to  the  user?  What  I  believe  is  that  you  can  certainly  play  around  with  which  data  tables  are  visible  at  which  stage.  And  you  see  here  a  really  short  snippet  around  that  you  could  create  a  data  table  from  the  beginning  as  invisible,  or  you  could  just  more  or  less  hide  it  after  being  created  at  the  initial  stage.

Or  you  could  actually  say,  if  I  actually  store  data,  I  could  provide  users  a  link  to  the  directory  directly,  which  means  they  don't  have  to  actually  look  for  that  file,  but  really  can  just  click  on  the  link  and  see  now  the  directory  opens.  Or  you  could  actually  inform  the  user  about  the  progress  of  your  execution,  and  so  he  or  she  knows,  Oh,  I'm  still  at  File  1,  but  already  at  page  8  out  of  12.

There's  a  number  of  options  that  you  could  do,  but  obviously,  as  I  mentioned,  I  would  take  that  only  once  I've  really  implemented  the  whole  code.  Now,  more  or  less  what  we  can  state  at  that  stage  is  that,  yes,  we  have  now  more  or  less  all  the  code  in  place  to  really  run,  let's  say,  this  collection  of  data  from  our  data  table  or  from  our  PDF  files  into  a  data  table.

However,  there  is  one  issue  and  that  is  more  or  less  the  issue  of  really  bringing  it  to  the  user.  The  point  being,  more  or  less  I  have  one  big  JMP  file,  potentially  it  has  quite  a  lot,  let's  say,  offline,  and  the  user  in  principle  has  to,  at  least  up  to  some  degree,  interact  with  that. T hat  is  something  I  typically  would  want  to  avoid  because  that  is  not  really,  let's  say,  something  users  want  to  do,  and  I  would  also  be  a  little  bit  scared  that  they  might  break  the  code.

Instead,  I  would  turn  to  JMP  Add- in,  which  has  the  nice  feature  of  being  only  one  file  and  it  just  requires  a  one- click  installation.  The  other  part  is  it's  easily  integrated  into  the  JMP  graphical  user  interface.  You  don't  have  to  interact  with  the  script.  You  have  a  lot  of  information  at  your  fingertips,  and  there's  actually  a  lot  of  information  of  how  you  can  do  more  or  less  create  an  add- in.

There  is,  for  example,  the  add- in  manager,  I've  added  here  the  link,  but  there's  also  the  option  to  actually  do  so  on  a  manual  or  script- based  way.   I  believe  while  it  takes  a  little  bit  higher  effort,  it's  actually  much  better  in  terms  of  the  understanding.  I  want  to  show  you  very  quickly  how  that  works.

For  that,  I've  actually  created  in  my  folder  where  I've  stored  all  the  data  so  far,  so  all  the  JMP  codes  so  far,  I've  actually  created  once  the  functional  code,  which  actually  holds  all  the  code  that  we've  created  just  in  a  slightly  more  organized  form  if  you  like.  You  might  actually  really  recognize,  again,  this  read  sample  data  page  or  this  t ransform  sample  data.  Plus  I've  added  here  an  additional  file  which  really  just  holds  an  example  of  additional  code.

You  could  imagine  that  potentially  you  want  to  outsource  the  functions  from  the  functional  code  to  the  custom  function,  say,  for  example,  to  really  make  the  code  better  readable,  or  so  on  and  so  forth.  Now,  you  could  actually  say  from  those  two,  I  want  to  create  a  JMP  add- in.  Simply  by  saying,  okay,  I  go  to  File,  sorry,  I  go  to  File  and  New.

There,  you  have  the  option  to  create  the  add- in.  You  would  now  actually  have  to  specify  a  name  and  a  ID.   I've  now  just  thought  about  it  previously  and  so  will  not  really  care  too  much  about  what  they  are  called.  But  please  really  look  at  more  or  less  the  suggestions  for  JMP  add-ins.  You  would  look  into,  oh,  which  menu  items  do  I  have?  And  so  you  would  add  a  command,  you  would  give  it  a  name,  let's  say  in this   case,  launch  PDF  creator,  and  you  would  have  to  specify  if  either  you  want  to  add  here  the  JS  code  or  if  you  actually  have  it  in  the  file.

In  this  case,  I  would  say,  let's  use  it  in  the  file  as  we  did  it.  It  should  actually  be  in  here  and  you  would  include  that  one.  Similarly,  you  could  actually  see  that  there  are  a  number  of  additional  options  like  startup  or  exit  scripts.  At  the  end,  you  have  to  include  any  additional  file  you  want  to  have  of  it.  In  this  case,  let's  just  assume  it  would  be  our  custom  function  code.  In  the  end,  you  can  more  or  less  save  that  as,  say,  our  example  PDF  data  browser  add- in.

Once  that  is  actually  stored,  you  can  simply  install  that  by  actually  double- clicking  on  that install,  and  you  would  see  that  you  have  under  add- in  now  a  launch  PDF  reader,  which  in  this  case  would  really  just  read  this  one  specific  PDF.  So  it's  still  quite  fixed.  There's  quite  a  lot  of,  let's  say,  information  which  we  could  make  more  dynamic ,  for  example,  the  file  selection  as  I  mentioned  at  the  beginning.  But  that's  more  or  less  at  least  one  way  how  you  could  read  the  data.

Now,  let's  return  here  very  quickly  to  a  little  bit  of  what  we  could  do  in  addition.  We  could  have  really  a  short  look  also  into  JMP  add- in.  I  would  say  that  a  JMP  add-i n,  and  that  is  very  nice  about  it,  actually  contains  really  more  or  less  every  single  one  tool.  Let's  look  at  our  example  PDF  data  path  and  we'll  see  where  it  was  installed.

In  addition,  if  you  look  into  that,  you  will  actually  see  it  holds  all  the  JSR  code  that  we  have,  plus  two  additional  files  which  define  actually  what  that  add- in  is  named  and  what  its  ID  is,  plus  more  or  less  the  graphical  or  the  integration  into  the  graphical  interface.  If  you  read  that  a  little  bit  careful,  those  two  statements  in  here,  you  will  actually  see  how  you  can  easily  adapt  them  to  your  purposes  if  needed.

The  last  part  I  actually  want  to  show  here  is  actually  what  you  could  also  do  if  you  had  it  fully  functional.  This  is  more  or  less  what  I  want  to  show  you  now  at  that  stage.  We'll  install  what  I  would  call  the  final  add- in.  A  little  bit  of  the  add- in,  having  also  in  addition,  let's  say,  a  little  bit  of  the  user- friendly  tools.  You  could  see  I  have  to  edit  that  now  under  here,  this  GDC  menu.

I  would  have  a  little  bit  of  buttons  to  click  a  few  more  than  potentially  previously.  You  could  actually  say,  Oh,  what  do  I  want  to  actually  read?  In  this  case,  I  would  want  to  read  those  seven  files.  As  mentioned,  they  are  all  copies  of  each  other  just  to  have  examples  here.  We  would  see  that  there  in  principle  should  be  also  a  progress  window  here  which  waits  now  for  demo  purposes  after  each  file  for  two  seconds,  reads  each  file,  we  see  also  that  the  speed  of  the  reading  is  actually  quite  impressive,  I  believe.

At  the  end,  you  see  there's  data  being  progressed  in  the  background.  The  user  sees  that  also  in  principle  but  doesn't  see  it  in  the  foreground.  The  user  is  not  really  annoyed  in  the  foreground,  but  only  once  the  data  are  processed,  we'll  get  here  a  final  result  and  we'll  actually  see  that  this  is  the  whole  data  table.  It  holds  data  from,  let's  say,  the  first  file  until  more  or  less  the  last  file,  so  on  file  number  is  called  six,  and  that  would  be  more  or  less  the  way.

Now,  as  I  mentioned,  that  is  until  now,  I  believe,  also  quite  a  lot  to  do.  So  we  could  still  ask,  what  is  next?  Is  there  any  next  step?  I  would  argue,  yes,  there  is.  The  first  one  in  my  eyes  is  really  celebrate.  Getting  until  this  stage  is  really  not  a  triple  task  and  it  is  really  a  true  achievement.  Really  be  happy  about  it,  really  concrete  yourself  that  is  really  an  achievement.

The  second  part  is,  in  principle,  you  might  want  to  do  a  little  bit  more  around  it.  You  might  want  to  think  about  code  versioning.  How  do  you  actually  work  with  going  back  a  version  or  going  ahead  a  version?  If  you  have  developed  that  or  looking  into  feature  which  doesn't  work  anymore,  but  stuff  like  that.  Code  versioning,  I  believe,  is  quite  helpful.

Similarly,  if  you  think  about  collaborative  development,  Git  might  be  an  answer  there.  If  you  think  about  unit  testing,  so  how  to  really  ensure  that  even  though  you  have  once  tested  your  code  and  you  have  now  changed  it  a  little  bit,  it  still  works,  then  unit  testing  might  be  the  answer.  If  you  want  to  deploy  more  or  less  add-ins  to  a  larger  user  base,  you  still  have  to  think  a  little  bit  around  how  that  works.  There  is  so  far,  I  believe,  no  really  good  solution  on  the  market.

The  other  part  is,  obviously,  I  would  love  to  hear  feedback  and  any  questions.  You  can  reach  me  under  this  email  address  and  I'm  happy  to  hear  more  or  less  any  suggestions,  criticism,  whatever  it  is,  please  feel  free  to  reach  out  and  I  hope  you  could  learn  a  bit  today.  I'm  really  happy  to  share  with  you  the  script,  the  code,  the  presentation,  everything  that  I  showed  you  in  the  last  30-ish  minutes.  Thank  you  very  much  and  have  a  wonderful  afternoon.

Article Tags