Choose Language Hide Translation Bar
View Original Published Thread

Regex: A Powerful Text Analysis Tool (2023-EU-30MP-1218)

Extracting pertinent information from unstructured text data can pose a daunting challenge. Someone may wish to mine blocks of text for websites, telephone numbers, emails, or physical addresses. It could be that units of measurement between documents need standardizing. The Regex function, quietly incorporated into JMP a few releases ago, is an extremely powerful tool to quickly and easily perform these and other tasks. It is also a tool that, for many, is shrouded in mystery. This presentation seeks to highlight this often overlooked and underrated function and decode its inner workings to allow anyone and everyone to tap into its full potential.

 

 

Hi,  welcome.  You've  found  our  talk  on  Regex.  It's  a  powerful  text  analytics  tool  that  Hadley  and  I  are  going  to  explore  the  basics  of  today  in  our  talk.

Yes,  thank  you  very  much  for  clicking  on  this  link  and  watching  this  presentation.  What  is  Regex?  Well,  Regex  is  a  function  that  searches  for  a  pattern  within  a  store  source  string  and  returns  a  string.  That  definition  was  taken  from  the  Regex  function  of  the  JMP  scripting  guide.  I'm  not  sure  that  that  definition  quite  does  it  justice.

Before  we  go  into  some  details  about  how  you  can  use  it,  what  the  power  and  value  of  it  is,  what  I'd  like  to  show  you  here  is  just  the  format  of  the  function.  It  takes  in  a  source,  a  pattern,  and  then  if  you  like  a  replacement  string,  it  has  other  functionality  as  well.  But  for  the  purpose  of  this  presentation,  we  are  going  to  be  talking  about  these  first  three  inputs  to  the  function.

Before  we  dive  too  deeply  into  it  and  show  you  some  examples,  I  just  like  to  talk  a  little  bit  about  how  to  set  up  a  pattern  in  Regex  and  specifically  about  the  concept  of  escape  characters.  These  are  characters  that  can  mean  many  things.

For  example,  a  \W  can  maybe  mean  a  W.  It  can  also  mean  any  lowercase  or  uppercase  letter  A  through  Z,  as  well  as  numbers  zero  through  nine  and  a  lower  space  lowercase,  what's  that  called? Underscore.

How  you  would  refer  to  that  is  simply  by  typing  \W.  If  you  wanted  to  refer  to  a  literal  W,  you  would  just  write  the  word  the  letter  W.  Digits  can  be  expressed  in  their  actual  form  or  they  can  express  generally  as  \ D,  and  then  \ S  refers  to  a  single  white  space  character,  including  tab,  return,  new  line,  vertical  tab,  and  something  called  form  feed.

Probably  some  of  you  watching  it  know  what's  that  means.  I'd  like  to  mention  some  special  characters  now,  so  you  can  see  the  period,  the  question  mark,  the  asterisk  plus  refer  to  matches  of  different  characters.  So  the  period  refers  to  any  single  character.

Question  mark  matches  zero  or  one  instance  of  whatever  is  put  in  front  of  it.  The  asterisk  matches  zero  or  more  and  then  the  plus  matches  one  or  more.  Now  there's  some  other  characters  as  well.  I  won't  go  through  all  of  these  and  there  are  many  more  that  I  haven't  captured,  but  I  thought  to  put  them  in  this  table  and  save  them  here  so  that  if  you  like,  you  can  pause  this  and  you  can  see  exactly  what  these  are.

Let's  look  at  an  example.  Let's  say  that  you  wanted  to  extract  all  email  addresses  from  blocks  of  text,  free  text  with  many  email  addresses  in  all  different  formats.  How  would  you  do  that?  Well,  let's  look  at  our  source,  which  would  be  for  example,  for  help  contactsupport@jmp.com.  It's  free  with  your  license  of  jump.

If  we  wanted  to  look  through  this  and  extract  the  email  address,  we'd  have  to  refer  to  it  as  a  pattern.  So  that  pattern  is  one  or  more  instance  of  any  character,  including  numbers.  Perhaps  we  can  refer  to  these  as  \W  followed  by  an  ad  sign,  followed  by  one  or  more  instances  of  \W  of  any  character  or  number  or  letter,  followed  by  a  literal  period  indicated  by  \. a nd  then  the  letters  C-O-M.

If  we  were  to  set  that  up  in  a  Regex  function,  the  return  result  would  be  the  email  address  support@jmp.com.  That  would  be  the  pattern  that  matches.  Now  some  of  you  watching  this,  I  know  what  you're  thinking.  Not  all  email  addresses  follow  this  format.  Some  of  them  have  other  characters  in  them,  some  of  them  have  multiple  periods,  some  of  them  perhaps  don't  end  in  com,  they  end  in  something  else.

That's  all  very  true  and  this  isn't  going  to  match  with  those.  What  you  could  do  is  then  take  this  pattern  and  perhaps  generalize  it  in  different  ways  to  get  more  email  addresses.  The  more  of  what  you're  looking  for  match  more  patterns.  We'll  talk  and  we'll  show  you  an  example  of  how  you  can  do  that  and  what  that  process  looks  like.

The  examples  we're  going  to  look  at  is  an  example  of  automated  machine  messaging  indicating  error  messages,  different  parts  of  the  system.  What  we  want  to  do  is  extract  the  components  that  are  broken  from  all  of  these  messages.  I'll  show  you  how  to  do  that.  We're  going  to  take  phone  numbers  that  have  been  entered  manually  in  all  different  crazy  formats  and  we're  going  to  put  them  in  a  uniform  format  and  we're  going  to  extract  info  from  coded  text.

In  this  case,  this  is  file  names  that  contain  information  about  how  different  biological  samples  were  run,  the  temperatures,  the  stressed  tests  and  so  on.  Times,  all  of  this  is  coded  in  the  name  of  the  file.  We're  going  to  pull  out  all  those  pieces  and  then  organize  them  in  a  table  that  we  can  work  with  them.

Now  probably  you've  all  clued  into  the  fact  that  Peter  and  I  are  not  Regex  experts.  I  think  that  the  word  novice  is  probably  a  better  description  of  how  of  our  competency  in  Regex.  The  purpose  of  this  talk  really  isn't  to  show  off  our  Regex  prowess  and  how  great  we  are  using  Regex  so  that  everybody  should  be  impressed.

Now  the  purpose  of  this  talk  is  to  demonstrate  how  powerful  Regex  can  be,  even  for  novices.  Even  with  a  very  little  bit  of  knowledge  about  how  Regex  works  and  how  patterns  work,  you  can  get  a  lot  of  use  and  a  lot  of  functionality.  Now  Regex  can  be  intimidating,  but  it  needs  because  at  its  core  it  really  is  very  simple.

We're  going  to  take  you  through  some  examples  and  show  you  exactly  how  simple  it  is  and  how  you  can  start  using  it  right  away.  Without  further  ado,  I  will  turn  things  over  to  Pete.

All  right.  Thanks,  Hadley.  Go  ahead  and  get  started  here  with  the  first  example.  Like  Hadley  said,  this  is  an  example  where  we're  trying  to  extract  out  of  a  description  here  what  part  was  actually  broken.  There's  probably  many  different  ways  you  could  get  at  this,  but  we're  going  to  show  you  how  to  do  this  with  Regex.

I'm  going  to  create  a  new  column,  generate  a  formula  here,  and  I'm  going  to  look  for  Regex  in  the  filter,  find  it  there,  and  then  start  with  my  description.  That's  what  I  want  to  run  the  Regex  on.  Then  I'm  going  to  define  a  pattern.

If  you  remember  with  what  Hadley  shared  there,  there's  a  couple  of  little  tricks  to  remember  with  Regex  that  will  make  it  a  lot  easier.  The  first  thing  I'm  going  to  do  is  put  in  a  W,  which  is  a  character,  but  I  want  this  to  be  more  than  one  character.  I'm  going  to  do  a  W  and  a  plus.  Then  after  that  w  and  plus,  I'm  looking  for  something  that  has  a  space  and  says  the  word  broken.  As  long  as  I  type  that  out,  right,  you'll  see  here  that  my  formula  result  is  there.

If  I  hit  apply  here,  you  can  see  that  it  tells  me  what  is  broken,  but  it  also  contains  that  word  broken.  Maybe  I  don't  want  that.  Maybe  I  just  want  what  the  part  is,  not  the  word  broken  in  there.  Then  if  I  want  to  do  that,  how  I  can  do  that  is  go  in  here  and  containerize  this  to  make  this  a  first  word  of  the  list  here.

Then  I'm  going  to  just  say,  hey,  I  only  want  that  first  word.  If  we  look  at  the  preview  here,  it's  just  giving  me  that.  Now  if  I  hit  apply  and  okay,  I've  extracted  out  what  I  was  looking  for.  Now,  this  is  a  simple  example  and  you  could  probably  think  of  other  ways  to  be  able  to  get  that  specific  part  of  this  description  out,  but  I  wanted  to  show  you  how  you  could  do  that  with  Regex  and  really  just  a  very  simple  start  to  this.

Let's  look  at  a  little  bit  more  complex  example.  Here  we  have  phone  numbers  that  are  entered  randomly  and  they  have  different  spacing,  different  delimiters  in  there.  Sometimes  there's  a  one,  sometimes  there's  not.  Sometimes  there  is  extensions,  sometimes  there's  not.  We  want  to  format  that  in  a  different  way  and  end  up  with  a  more  clean  format.  Here's  the  end  result.

Unlike  the  last  example,  I  think  this  one  is  a  lot  more  difficult  to  do  without  Regex.  Let's  walk  through  how  we  can  do  this  with  Regex.  Very  similar.  We're  going  to  start,  I'm  going  to  type  in  Regex  here  and  I'm  going  to  move  this  down  so  you  guys  can  see  as  we're  building  this  Regex,  the  results  pop  up  there.

I'm  going  to  put  that  phone  numbers  in  as  my  original  pattern  or  my  original  data,  and  then  I'm  going  to  start  with  that  pattern.  If  we  remember  again  from  what  Hadley  said,  we're  looking  for  digits  this  time.  Our  pattern  is  digit,  digit,  digit,  then  something.  We  don't  know  what,  but  we'll  put  in  that  question  mark  because  it  could  be  many  different  things  and  then  we  have  digit,  digit,  digit.

Let  me  pop  this  open  a  little  so  we  can  see  it.  Then  again,  we  have  a  question  mark  because  we  don't  know  what  that  delimiter  is  in  there.  Then  we  have  four  digits.  Okay,  all  right.  If  we  look  at  a  preview,  you  can  see  it  catches  some  of  these.  I'm  going  to  just  hit  apply  and  now  you  can  see  some  of  these  numbers  were  captured  here,  but  some  were  not.  Then  our  output  formula  isn't  what  we're  after.

Let's  go  back  and  open  this  up  and  we're  going  to  containerize  those  like  we  did  in  the  previous  example.  We're  going  to  look  at  three  individual  words  here,  or  three  individual  sets  of  digits,  I  should  say.  We've  containerized  them,  we'll  hit  okay.  Then  we  want  an  output  that  looks  a  certain  way.

We  want  to  have  the  first  word  followed  by  a  dash,  then  the  second  word  or  set  of  digits  followed  by  a  dash,  and  then  the  third.  Okay.  When  we  hit  apply  here,  you  can  see  this  is  cleaned  it  up  a  little  and  at  least  the  output  format  is  what  we're  looking  for,  but  we're  missing  a  few.  Like,  let's  look  at  this  one  specifically.

This  one  has  a  space  here.  How  do  we  tell  Regex  that  there  might  be  a  space,  but  there  might  not?  We'll  go  back  here  and  we're  going  to  edit  this  a  little  bit.  We're  going  to  put  in  a  potential  space.  I'm  going  to  put  a  space  with  a  question  mark  there  because  it  might  be  there,  might  not  and  I'm  going  to  hit  okay  and  apply.  There  you  can  see  it  captured  those  two  with  the  space.

But  you  can  also  see  some  of  these  have  a  one  at  the  start,  like  line  five  here.  How  do  we  tell  Regex  that  there  might  be  a  one  there?  So  just  like  we  did  with  the  space,  we're  going  to  go  in,  we're  going  to  say,  "hey,  there  could  be  a  one  here. " I f  we  do  that  and  hit  okay  and  apply,  you  can  see  that  it  cleaned  those  up.

Now  we're  pretty  happy.  We've  got  everything  in  the  format  that  we  want  it.  But  you  can  see  there  is  other  examples  of  different  styles  of  phone  numbers  here.  If  people  have  put  in  letters  instead  of  numbers,  it's  not  capturing  all  of  that.  There's  more  we  could  do  with  this  to  clean  these  up  further,  but  we've  taken  a  lot  of  messy  phone  numbers  here  and  clean  them  up  into  a  nicer  format.

This  is  a  good  way  to  use  Regex.  Now  I'm  going  to  pass  it  back  to  Hadley  for  the  last  example.

All  right,  thank  you  very  much,  Pete.  Very  well  done  as  well.  What  I'm  going  to  do  is  I'm  going  to  show  you  this  example  here,  which  is  an  example  of  descriptions  taken  from  file  names.  The  first  seven  digits,  I  think  the  first  seven  things  are  the  name  of  the  sample  and  then  how  it  was  run.  Temperatures  sometimes  included,  but  not  all  of  them.  Days  sometimes  or  weeks.  Time  sometimes  included,  but  not  always.

Let's  extract  all  of  this  information  and  what  we  ultimately  want  it  to  look  like  is  that.  We  are  going  to  use  Regex  to  extract  the  sample  project  code  from  the  front,  the  stress  condition  from  within,  the  temperatures  as  well  as  the  mean  of  those  temperatures,  temperature  range.  Then  if  there  is  a  time  we'd  like  that  as  well  expressed  in  days  and  not  in  weeks.

Let's  delete  all  this  and  see  how  we  can  do  it.  Now,  the  first  thing  we  can  do  is  to  add  our  project  code  and  we  could  do  this  in  Regex.  But  you  know  what,  this  is  actually  probably  pretty  simple  to  do  using  substring.  It's  this  guy,  the  first  seven.  There  we  go.  Let's  not  complicate  our  lives.

Now,  the  rest  of  it,  I  think,  is  a  little  bit  more  tricky.  What  I'm  going  to  do  is  I'm  going  to  open  up  a  new  script.  We're  going  to  start  out,  we  start  out  old  scripts  and  we  are  going  to  go  in  and  grab  all  of  these  descriptor  names.  We're  just  going  to  create  a  list  called  Description  with  all  the  values  in  this  column.

What  I'm  going  to  do  is  just  show  the  log.  You  can  see  here  that  if  I  run  Description,  I've  now  got  all  my  descriptions.  What  do  we  feel  like  starting  with?  Let's  see,  I  think  temperature  is  probably  a  good  one  to  start  with.  What  I'm  going  to  do  is  just  to  show  you  that  if  we  take  the  temperature  code  here,  all  of  these  are  going  to  be  in  about  the  same  format.

We're  going  to  create  a  list  container  to  hold  whatever  it  is.  We're  going  to  loop  over  all  of  the  items  in  description.  Temp  code,  I  going  to  equal  something  at  a  description.  Then  once  we  get  all  these,  we  can  just  slap  the  whole  thing  into  a  new  column.

What  is  this  going  to  look  like?  Well,  it's  going  to  look  like  Regex  first  of  all,  our  description,  I  think  this  is  just  description  I  followed  by  what  is  it?  We're  talking  about  temperatures  here.  It's  one  digit,  maybe  a  second  digit,  followed  by  a  dash,  followed  by  another  digit  and  maybe  a  second  digit.  Then  the  letter  C.

What  we  want  is  this  first  set  of  digits,  followed  by  this  second  set  of  digits.  If  I  run  this,  hopefully  it  works.  There  we  go.  As  I'm  doing  this,  I  see  that  I  probably  could  have  gotten  away  with  just  doing  this.  That  would  have  been  fine  too.  I  probably  didn't  need  that  second  one.  But  if  it  works,  it  works.  If  it's  broken,  don't  fix  it.  There  we  go.

Let's  move  forward  and  what  should  we  do  next?  Let's  grab  our  time.  Time  is  going  to  work  exactly  the  same  way.  We're  going  to  create  a  container  for  time.  We're  going  to  loop  over  descriptions  for  time.  Now  what  do  we  want?  We  want  our  time  code  equals  Regex.  What  does  this  look  like?  It  looks  like  well,  first  of  all,  we've  got  our  description  followed  by,  what's  our  pattern?

It  is  the  word  day  or  the  word  week.  Then  one  digit.  Might  there  be  two  digits?  I  guess  there  might  be.  We're  just  going  to  wrap  some  containers  around  this  so  we  have  a  day  or  a  week.  We  don't  have  both.  Then  we  have  one  digit  and  maybe  a  second  digit.  We  want  our  second  container.  We  don't  want  the  word  day  or  week.  We  want  just  this.

If  I  run  this,  let's  see  what  time  code  looks  like.  There.  You  can  see  that  where  it  was  able  to  it  managed  to  grab  the  day  or  week  and  put  it  in.  Let's  take  all  of  this  and  drop  it  into  a  column.  But  before  we  do  that,  you  perhaps  want  this  expressed  as  numbers  rather  than  characters.

What  I  could  do  is  run  that  and  express  the  whole  thing  as  a  number  instead  of  a  character.  Now  we're  getting  closer  to  where  we  need  to  be.  Of  course  we  want  to  know  whether  these  are  days  or  weeks  and  we're  not  going  to  know  that.  That's  going  to  affect  how  we  put  this  in  what  we  need  to  do  here.

Because  if  it's  days,  then  it's  fine.  If  it's  weeks,  then  we  should  take  whatever  numbers  in  here  and  multiply  it  by  seven  to  show  that  we  are  consistent  with  the  number  of  days.  Then  we'll  put  that  in  a  new  column.  What  is  that  going  to  look  like?  Well,  it's  going  to  be  an  if  statement.  If  and  another  Regex,  if  our  descriptor  day  or  week  equals  week.  Once  we  pull  this  out,  our  description  if  it's  week,  then  take  whatever  time  code  we  have  and  multiply  it  by  seven.

What  did  I  do?  I  think  I  probably  need  to  close  that  guy.  Sorry  about  that,  everyone.  Okay,  now  if  we  run  our  time  code,  you  can  see  that  our  weeks  are  now  multiplied  by  seven.  We  can  take  all  that  and  drop  it  into  a  column.  All  right,  so  far  so  good.  What's  left?  Oh,  yeah.  We  want  the  mean  temperature  rather  than  the  ranges.

What  I'd  like  to  show  you  right  now  is  how  we  can  make  use  of  Regex  once  more,  and  that  is  to  take  whatever  was  in  our  temperature  code  and  again,  apply  Regex  to  it  to  say  that  if  it  was  the  lower  one,  the  minimum  one  is  going  to  be  the  one  on  the  left  side.  The  maximum  temperature  is  going  to  be  the  container  on  the  right  side.

To  set  these  up,  but  I'm  going  to  take  all  of  this  and  wrap  it  into  a  loop  again,  like  that.  Now  we've  got  our  minimum  temperature,  our  max  temperature,  and  our  mean.  This  is  how  we're  going  to  set  this  up  in  Regex.  Anytime  we've  got  temp  code  and  this  is  the  pattern,  take  the  first  one,  take  the  second  one,  turn  them  into  numbers,  calculate  the  mean,  and  then  slap  that  entire  thing  into  a  new  column.

Oops.  Okay,  so  the  last  thing  we  want  to  do,  is  grab  this  middle  sample  here.  Now,  I'm  not  going  to  walk  through  this  in  its  entirety.  Let  me  say  that  back.  I  am  going  to  walk  through  this  in  its  entirety.  Some  of  you  watching  this,  if  Regex  is  as  new  to  you  as  it  is  to  me,  it  may  not  get  this  on  the  first  try.  That's  the  beauty  of  recording.  This  is  you  can  pause  the  recording,  you  can  look  at  this,  you  can  try  it  out  for  yourself.

But  basically  what  we're  doing  is  we're  going  through  the  same  process.  We're  creating  a  container  for  stress.  We're  looping  through  all  of  our  descriptions  and  we're  using  those  each  individually  as  the  source.  What  are  we  saying?  Well,  there's  going  to  be  eight  characters.  Any  letter  or  number  or  underscore  potentially  a  space  as  well,  although  I  don't  think  there  are  any  spaces.  Oh,  yes,  there  are.  That's  why  I  included  that.

There  may  be  a  space  to  eight  of  them.  Then  I  like  this  here.  This  is  going  to  be  some  stuff.  Anything  one  or  more  of  them,  I  think  was  what  that  meant.  What  this  does  is  it  just  tells  you  to  start  at  the  beginning  and  start  looking.  Okay,  and  now  where  are  you  going  to  stop?  You're  going  to  stop  when  you  find  day.  You're  going  to  stop  when  you  find  week  or  week  or  a  space,  an  open  parenthesis,  closed  parentheses  or  some  digits  followed  by  C,  or  you  get  to  the  end  of  the  line.

When  you  go  through  all  this,  what  are  we  looking  for?  We're  looking  to  extract  the  second  parentheses  thing  here.  This  was  a  literal  open  bracket.  That's  what  we're  looking  for.  Just  drag  all  of  these  things  here  and  drop  those  into  your  column.

As  you  can  see,  this  was  a  little  bit  more  complicated.  It  used  some  more  complex  functionality,  including  look  ahead.   I'm  not  going  to  go  into  the  details  of  that  right  now.  But  I'll  just  leave  this  up  here  so  that  you  can  see  how  that  was  done  and  how  you  would  go  about  doing  this  for  yourself.  All  this  says  is  keep  looking  forward  until  you  see  day  and  then  take  everything  before.  That's  what  these  means.  That's  what  these  mean.

With  that,  what  I'm  going  to  do  is  open  this  up  again.  Just  to  summarize  that  regular  expressions  are  a  specification  of  a  pattern  frequently  used  to  clean  up  or  extract  pieces  of  data.  That  you  can  search  for  a  pattern  and  replace  it  with  a  different  string  or  extracts  different  parts  of  the  string.

You  can  define  the  pattern  using  the  Regex  function  or  the  Regex  match  function,  which  we  didn't  talk  about,  which  we  invite  you  to  check  out  in  the  help  files,  which  contain  lots  and  lots  of  information  all  about  Regex.  As  well  as  examples  about  how  you  can  use  it  to  solve  the  problems  that  you're  looking  to  solve  in  whatever  industry  or  whatever  situation  you're  dealing  with  that.

I would like to thank  you  very  much  for  your  attention  and  I  hope  you  enjoy  the  rest  of  the  conference  to  check  out  the  other  talks.  Thanks  again.  Bye, bye