cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
Choose Language Hide Translation Bar
Wow, That’s a Mess! How Am I Supposed to Get Anything Useful out of These Data? (2022-US-30MP-1123)

Jason Wiggins, Sr. Systems Engineer, JMP

 

If you work with data, you have probably heard the adage that preparing your data for analysis makes up most of the time spent on analysis -- often as much as 80% or more! This talk focuses on using tools in JMP to gain knowledge from messy historic oil and gas drill rig count data. Though specific to a rig count use case, the example applies broadly to anyone who needs to gain insights from data sources, such as Excel with questionable formatting, structure and cleanliness.

 

The first part of the talk covers importing raw data obtained from a website, restructuring data tables, identifying errors and recoding errors with Recode. This is the 80% that must be done to get to the more exciting 20% where we glean insights. Next, I demonstrate how to use Graph Builderto gain insights from the data. The talk wraps up with using dashboards to share the insights.

 

 

Thanks  for  joining,  everybody.

My  name  is  Jason  Wiggins.

I'm  a  senior  systems  engineer  for  JMP.

I come  from  a  fairly long  career  in  oil  and  gas

and  manufacturing and  RMD  and  and  quality.

What  we're  here  to  talk about  today  is  messy  data.

I  really  believe  anyone  who  analyzes data  has  encountered  a  data  set

somewhere  along  the  way  that  needed  a  lot of  work  before  it  could  be  analyzed.

Cleaning  or  shaping  data  can  be  difficult,

but  hey,  while  that's  a  mess,  I  find  it that  it  actually  can  be  quite  frustrating,

especially  when we  have  to  do  it  manually.

Some  of  my  messiest  data  problems have  come  from  Excel  spreadsheets.

I  believe  there's  a  couple of  reasons  for  that.

Excel  is  great  for  many  things,  but for  analysis,  it's  just  not  that  great.

Part  of  the  reason  is  that  it  doesn't impose  the  format  for  the  data.

M y  mind,  data  formats  are  as  varied as  the  imaginations  of  people  using  them.

Excel  files  also  tend  to  be  hand  curated,.

The  likelihood  of  misspelling and  inconsistent  naming  conventions

are  really  quite  common.

The  example  I'm  presenting  today

comes  from  my  career  in  oil  and  gas.

I  believe  that  the  problem and  solution  that  I'm  going  to  show

can  be  found  in  many  of  your  data  sets.

my  goal  for  everybody is  to  see  a  few  possibilities

for  simplifying  that  front end  of  our  analytical  workflow.

Let's  take  the  exclamation  point  out  of, wow,  that's  a  mess,

and  just  say,  yeah,  but  no  problem I  understand  how  to  deal  with  this.

Now,  I'm  also  using  an  example

where  there  are  data  available  that  you can  download  off  the  Web.

I'm  going  to  be  uploading my  presentation  materials

in  case  anyone would  like  to  practice

some  of  the  concepts  that  I'm going  to  work  through  today.

All  right,  so  let's  get  to  this.

Our  problem.

Baker  Hughes  has  been  publishing rotary  rig  count  data.

These  are  the  rigs  that  drill  for either  oil  or  gas  all  around  the  world.

They've  been  posting  active  rotary rig  counts  for  generations.

A rig  count  is  a  very  important  business barometer  for  the  oil  and  gas  industry.

If  you're  in  the  industry, you  don't  need  the  explanation.

You're  consuming  these  data  on  a  daily, if  not  weekly  basis,  but  it's  used  broadly

all  the  way  from  downstream  to  upstream and  exploration  or  challenge.

As  I  laid  the  groundwork, we  are  going  to  be  dealing  with  Excel  data

and  some  of  the  problems that  we  have  with  that.

One,  many  of  the  worksheets that  Baker  Hughes  makes  available

are  not  in  the  right  format for  analysis  in  JMP.

I  also  found  many  errors.

This  debt  certainly  isn't  the  most error-prone  data  set

that  I've  worked  with  coming  from  Excel,

but  there  are a  few  in  there  that  are   doozy

that  we'll  focus  on  today.

And  really,  they're  going  to  be  around

selling  and  inconsistent terms  and  abbreviations.

Again,  in  terms  of  the  overall analytical  workflow.

In  order  for  us  to  even  get  to  these analysis  and  knowledge  generation,

knowledge  sharing,

we  have  to  get  our  data  set  into  a  format where  we  can  begin  to  do  that  work.

We're  going  to  be  focusing  on  getting data  into  JMP,  blending  and  cleaning,

and  then  at  the  end,  we'll  do  a  little  bit of  data  exploration  and  visualization.

Ultimately,  what  we're  shooting  for is  to  have  a  data  set

where  we  might  look  at rig  count  trends  by  state,  for  instance.

These  trends,

might  be  very  telling  about the  economics  in  the  industry  over  time.

We  may  also  want  to  look  at  more  of  a  time series  based  analysis,  like  a  bubble  plot,

where  we  conceive  the  change  in  rigs over  time  for  the  different  states.

Again,  in  order  to  get  to  that  point

let's  see,  JMP  is  pausing on  me  for  a  second.

There  we  go. Okay,  in  order  to  get  to  that  point,

we  really  need  to  get  the  data into  something  that  we  can  work  with.

this  is  the  final  data  set.

This  is  what  we're  going to  be  pushing  toward.

I  have  a  couple  of  ways  of  accounting for  date  in  the  data  set.

This  is  what  comes  from Baker  Hughes  worksheet.

But  I'm  going  to  create a  couple  of  other  variables

that  we  may  want to  take  a  closer  look  at  times.

we're  going  to  create some  new  variables.

A  month  here  variable  and  a  year  variable.

We're  going  to  fix  some  spelling  errors in  the  state,

and  ultimately,  we're  going  to  join  this with  another  data  set

that  has  the  Lat  long  coordinates for  the  capital  city  of  every  state.

I  don't  know  what the  best  way  is  to  show

that  time  series  growth  and  contraction

have  to  choose a  point  capital  cities  available.

We  have  to  do  a  few  things  in  order to  make  those  data  sets  connect.

That's  where  we're  going.

All  right,  that's  essentially what  I  outlined  verbally.

But  I'll  pause  for  a  second  and  just  let everybody  take  a  look

at  our  analysis  workflow, which  is  that  component,

that  data  shaping,  data  blending, data  visualization.

All  right,  let's  talk  about importing  data.

For  those  of  you  who  want  to  try  this.

I  want  to  point  out  that  the  data

that  Baker  Hughes  publishes is  in  a  binary  Excel  format.

At  this  point,  JMP  does  not  have a  way  to  directly  import  these  data.

If  this  were  XLSX,  which  it  used  to  be,

I'm  not  sure  when the  binary  file  format  was  adopted,

but  it  used  to  be  that  you could  do  a  file  Internet  open

and  ping  the  URL and  automatically  download  the  data.

But  we  can't  do  that we  have  to  do  an  intermediate  step.

It's  pretty  simple.

If  we  go  to  the  website  let  me  pull that  back  up  again  real  quick.

If  we  click  on  any  of  these  links, it'll  download  the  data.

We  open  that  data  up  in  Excel and  then  save  it  as  an  Excel  SX.

There  are  ways  to  do  this  automatically,

but  they're  going to  happen  outside  of  JMP.

For  those  that  really  want  to  explore this  and  make  it  automatic,

there's  a  bit  of  a  challenge  up  front.

I  will  point  out  some  ways  to  automate and  JMP  after  we  get  the  data  in.

All  right,  so  this  is  what we're  looking  at,

that's  in  fact  the  Excel sheet  that  we're  going  to  load.

All  right,  so  the  first  thing  that  we need  to  do  in  fact.

First  things  first,  let's  get  our  data  in

Column  headers  that  are  on  row  five,  but  I have  a  couple  rows  of  column  headers.

This  is  common  in  Excel.

People  use  merge  cells  to  combine

text  from  more  than  one cell  into  a  single  label.

We  want  to  make  sure that  we  capture  that.

We're  going  to  say  data  starts  on  row  five

and  we  have  two rows  of  column  headers.

I'm  sorry,  column  headers  start  on  row five  and  we  have  two  rows  of  them.

Then  we  jump  out  of  layers  and  notes how  to  adjust  for  where  the  data  starts.

This  is  good.

I  always  like  to  take  a  look at  the  preview  and  just  make  sure  that

getting  what  I'm  asking for  and  this  looks  right.

We're  importing  the  correct  worksheet.

Let's  just  import  that.

All  right.

This  is  in  a  wide  data  format.

That's  not  typically  the  format that  JMP  likes,  doing  an  analysis.

Almost  always  we  want  to  be in  a  tall  data  type  format.

What  I'd  like  to  have  is  a  column that  has  these  column  headers  as  rows.

Then  another  column  that  has  the  rig count  for  each  one  of  the  column  headers.

The  operation  that  we  need  to  do for  that  is  a  stack  operation.

Let's  just  talk  through  this.

I'm  actually  doing the  presentation  in  JMP  17,

and  the  reason  I'm  doing  this is  that  there's  a  cool  new  feature  in  17

that  I  find  to  be  so  handy for  this  type  of  work.

All  right,  I  forgot  to  do  one  thing.

Let  me  back  up  real  quick, just  to   keep  the  dialogue  simple.

What  I'd  like  to  do  is  get  rid of  these  summary  statistic  columns.

These  will  certainly  make sense  in  a  wide  data  context.

They  are  going  to  make sense  if  we  stack  them.

we're  just  going  to  delete  those.

We  can  do  that,  deal  with  that  in  a  lot of  different  ways,  but  keep  it  simple.

We'll  just  delete  it  out  of  the  data  table and  then  go  back  to  our  stack,

turn  on  a  preview,  and  let's  just select  all  of  those  columns.

This  is  great. One  thing  I  love  about  the  preview  is,

first  off,  I  get  to  see  that,

yes,  this  is  in  the  shape that  I  need  for  my  analysis,

but  I  can  also  make  some  changes

and  see  how  the  data  table is  going  to  look

before  I  actually commit  to  making  a  data  table.

If  you  remember,  we  wanted the  count  to  be  rig  count.

That  data  column, we  want  to  be  rig count.

Source  column.

We're  going  to  name  this  state  and  type,

and  then  we're  going to  separate  those  in  the  end.

Another  example  of  creating  new  variables, we're  actually  going  to  split  those  apart

so  we  can  use  them as  two  separate  variables.

But  for  right  now,  I  think that's  pretty  descriptive.

If  I  hit  enter  or  click  anywhere  else and  I  get  to  see  the  update

and  yes,  indeed, this  is  something  that  I  want.

That's  data  shaping. Step  one.

We  may  have  to  do  several steps  of  data  shaping.

In  this  case,  it's  a  simple  example,

stacks,  appropriate, and  that's  all  we  have  to  do.

All  right,  before  I  show  this  slide, let's  go  over  to  the  data  set.

One  of  the  first  things  that  I  do  when I'm  manipulating  data  tables

and  working  with  new  data  sets.

Is  I  like  to  graph  the  data, plot  the  data,

or  show  the  data  in  some  way  graphically

to  help  me  see  whether  there  are issues  that  I  need  to  resolve.

Distribution  is  a  great way  of  doing  that.

I'll  just  pause,  let  everybody  look.

It's  probably  small  print. Let  me  make  that  much  bigger.

I  think  even  in  the  first  couple  of  bars, few  bars,  half  a  dozen  bars,

hopefully  everybody  is  recognizing some  potential  issues.

One,  what's  wash?

I'm  not  sure  why  Washington was  abbreviated.

There's  probably  some  reason  historically, these  data  sets  are  quite  old.

We  have  abbreviations  for  west.

We  have  abbreviations  for  north  and  south.

Turns  out  that  is  the  exact same  abbreviation  as  for  new.

We  have  several  issues.

let  me  scroll  down. There's  another  doozy  here.

Fortunately  there's  only  one  of  them.

See  if  I  can  find  it. We're  looking  for  Tennessee.

There  we  go,  Tennessee.

Everybody  take  a  look  at  Tennessee there  and  see  if  you  notice  the  problem.

We're  missing  an  S,  right?

That's  something  that  I  would  do.

I'm  a  horrible  speller  when  I'm  typing, especially  when  I'm  typing  fast.

We  found  one  spelling  error  there.

Now  the  trick  is,  how  do  I  parse that  out  and  fix  all  of  these  errors?

More  importantly, how  do  I  do  that

in  a  way  that  doesn't involve  1000  steps

for  renaming  these different  abbreviations  and  misspellings?

That's  where  regular expressions  come  in.

In  Rico,  there  are  a  variety  of  different tools  to  deal  with  data  cleaning  problems.

I  always  like  to  choose  the  simplest  one, but  often  they're  not  that  simple.

What  are  regular  expressions?

Well,  they're  sequences  of  characters that  specify  a  search  pattern.

a  simple  example, I've  got  cats  and  dogs,  data  set.

Each  one  of  these  characters  represents a  pattern,  a  search  pattern  for  the  text

and  then  a  command  for  what  it is  that  we're  going  to  return.

Why  we  use  them?

Well,  they're  very  compact  and  flexible.

I  can  solve  really  complex character  data  problems

with  a  small  handful of  regular  expressions,

and  I  just  can't imagine  how  else  I  might  do  that.

It  definitely  takes  messy  problems and  makes  them  simpler,

but  got  to  learn  a  new  concept.

If  regular  expressions  are  brand  new to  you,  I  have  a  resource  that  I  like.

It's  free,  it's  online,  it's  Debug X.

One  of  my  favorite  parts  of  this  is  the cheat  sheet  or  the  quick  reference  guide.

I  want  to  understand  what those  search  characters  mean.

I  can  look  at  this  quick  reference

and  I  can  start  to  begin  to  piece together  regular  expressions.

I  don't  often  use  the  test  part  of  the  website,

but  if  you  had  some  example  text

and  wanted  to  test  your  regular  expression against  it,  you  could  do  it  here.

I  prefer  to  do  that  and  JMP.

It  just  saves  me  time.

JMP  actually  has  something  that  you'll see  that  we  can  use  in  a  similar  way.

All  right,  so  that's  what  they  are.

Good  place  to  go  learn  about  them and  let's  take  a  look  at  how  we  use  them.

All  right, I'm  only  going  to  fix  a  couple  of  these,

and  we'll  fix  a  couple  of  them  together.

I'll  speak  through  what  the  regular expression  means  as  I  type  it.

Before  we  even  get  there.

I'd  like  to  recommend  that  when  we're doing  this  type  of  work,

that  when  we  recode,  we  want  to  dump the  results  into  a  formula  column.

Reason  being  is  if  we  decide  to  add on  to  this  data  table,

those  fixes  will  persist  in  the  future.

It's  the  only  time  we're really  going  to  be  using  it.

Maybe  we  don't  use  the  formula column,  but  I  prefer  that.

In  fact,  I  really   like  to  see that  right  up  top  for  my  personal  use.

How  do  we  get  to  the regular  expressions?

We  have  a  replace  string  utility.

Again,  there  are  many  other  utilities.

Always  choose  the  simplest  one.

If  we're  just  trying  to  pluck  first and  last  word,  for  instance,

I  don't  want  to  write a  regular  expression  for  that,

but  in  this  case, I  got  some  mess  I  need  to  clean  up.

we're  going  to  use a  replace  string  to  do  that.

A  couple  necessary of  ingredients.

You  have  to  make  sure  that  the  Use  regular Expressions  check  box  is  turned  on.

Remember  that  preview the  Debug  X  shows.

Well,  here's  our  preview.

We  only  have  to  type  once and  adjust  once,  hopefully.

then  we  get  to  see  the  results.

let's  try  that  out on  a  couple  of  these  examples.

Again,  I'm  going  to  speak what's  happening  as  I  type.

Let's  work  with  the  new  first.

Remember,  N.  can  either  be  new  or  north.

If  I  look  at  that  dataset,

seems  logical  to  fix  the  new one  first  and  then  work  on  north.

I'm  going  to  look  for  the  character  N

in  the  string  in  the  row of  the  column  that  I'm  analyzing

Then  I'm going  to  look  for  a  period.

The  reason  I'm  putting  this  in  brackets  is

that  at  a  period  also is  a  search  character.

It  means  any  character.

To  be  honest, you  could  leave  the  brackets  out  of  this

is  still  going  to  work, but  it's  a  little  bit  confusing.

If  we're  using  a  specific  character

that  may  exist  as  a  search  character  also sometimes  it's  nice  to  bracket  them  out.

Makes  it  a  little  more  interpretable.

All  right,  after  N  we  have  a  space.

This  backslash  is  white  space  character

and  I'm  actually going  to  type  out  Mexico.

Now  I  could  use  a  search  pattern,

a  search  character  here  like a  W  star  or  something  like  that.

I'll  explain  that  a  little bit  more  as  I  go.

But  sometimes  when you're  writing  regular  expressions,

it's  handy  to  have  something that's  a  little  bit  more  readable.

I'm  choosing  to  type  out  the  words  here

and  there  aren't any  problems  with  those  words.

I'm  just  going to  reference  them  directly.

Now  with  a  regular  expression I  can  use  logic.

I  can  deal  with  all the  new  issues  but  one  in  a  single  line.

I'll  tell  you  why  we're  going  to  deal with  the  New  Hampshire  one

a  little  bit  differently.

Let's  do  the  New  Mexico.

New  York.

That  pipe  again  is our  logical  ore  and  then  we'll  do  Jersey.

The  reason  I'm  putting parentheses  around  this

is  that  it  allows  me  to  return  the  result of  everything  inside  the  parentheses

and  every  parentheses  left  to  right  is referenced  by 1, 2,3 so in numerical  order.

I  think  that  is  probably  good  except  oh, we  do  need  to  have  that  last  part,

that  dash  land or  if  there  were  offshore  rigs,

I  don't  think  there  are  in  New  York, but  we'd  want  to  capture  that.

There's  my  any  character  and  I'm looking  for  any  character  that  happens

more  than  once and  put  parentheses  around  that.

Here's  where  the  magic  is.

Alright,  so  now  I  can  type  new.

Now  I  have  a  white  space  in  there

so  I  don't  need  to  actually  put  a  white space  in  the  replacement  text.

But  this  one  and  two  reference what's  inside  the  parentheses.

The  one  is  going  to  be the  logical  result  of  that  search

inside  the  first  parentheses and  then  the  second  it's  going

to  capture  all  the  characters that  go  on  behind  it.

Let's  scroll  down  and  see  how  we  did.

We  get  a  star  for  anything  that  was recoded  and  it  looks  good.

New  Jersey,  New  Mexico,  New  York.

I  think  we've  done  our  job  here.

Again,  another  just  classic  reason  why you  want  to  use  regular  expressions  is

that  I  got  in  a  single  statement, I  was  able  to  fix  three  problems.

Let's  do  one  more.

Let's  just  do  a  couple  of  more  with  our end  and  then  we'll  move  forward  with  this.

All  right,  so  we'll  go back  to  replace  string.

You  can  have  as  many  regular  expression replace  string  commands  as  you  want

within  recode.

Sometimes  again,  that's  nice.

You  could  get  really  clever

and  fix  a  lot  of  issues with  a  single  regular  expression.

Sometimes  it's  a  little  more  readable  if

we  tackle  them  maybe  in  a  few  at  a  time  or even  in  the  unique  cases  one  at  a  time.

we  use  regular  expressions let's  deal  with  the  New  Hampshire  problem.

Same  thing  we're  going  to  do N character followed  by  a  period.

Then  we're  going  to  do,

actually,  let's  not  do  parentheses because  we  don't  want  the  Hamp  period.

We  want  to  look  for  it.

Hamp, end  with  a  period,

and  then  we  want  to  capture  all the  text  that  is  behind  that.

now  we  do  New  Hampshire with  the  Dashland  behind  it,

scroll  down  and  it  fixed  it.

That's  great.

I  could  do  something  similar.

We're   running  short  on  time.

I  want  to  make  sure  we  at  least  get to  some  of  the  graphing  part  of  this.

But  we  could  do  a  similar  set  of  steps to  deal  with  the  north  and  the  south

and  the  west  and  the  many  others  and  what that  looks  like  in  the  final  data  table.

If  I  look  at  the  recoded  formula  column.

Once  again,  very  nice  to  have this  because  it  is  portable.

Those  are  the  regular  expressions

that  I  use  to  fix  all  the  data problems  in  that  column.

Again,  the  benefit  of  doing  this,

is  if  you're working  with  a  really  huge  data  set,

could  you  imagine  going  through and  hand  typing  those

every  time  you  want  to  do  an  analysis on  new  data,  for  instance?

It's  pretty  arduous.

We've  saved  ourselves  a  lot  of  time with  just  a  little  bit  of  creativity.

That's  regular  expressions.

Again,  my  intent  wasn't  to  kind  of  teach regular  expressions,

but  really  show  that  as  an  opportunity for  folks  to  investigate

that  can  help  deal  with  many  of  your messy  data  problems.

let's  play  around  with these  new  variables.

I  think  let's  just  go  back  to...

Yes,  I'm  actually  going  to  show  it in  the  same  table

that  I  have  the  results and  I'll  just  show  you  how  I  got  there.

When  we  finally  got  all  that  data cleaned  up,  we  had  a  date  column.

If  we  look  back  here,  hey, there's  our  date  column.

I'll  use  this  table.

A  little  wishy  washy  here  today,

but  just  so  you  can  see  what  this  looks like  if  we're  starting  from  scratch,

let's  just  use  this  one.  I'll  reference back  to  the  complete  table.

This  new  formula  column  has  a  date  time group  in  it,  which  is  really  handy.

Now,  you  noticed  that we  are  in  a  day- month- year  format,

and  that's just  when  the  record  was  made.

Now,  in  an  analysis  context,  really, we  may  want  to  just  look  at  month  and  year

so  we're   combining  some  of  those,

making  the  time  component a  little  more  coarse.

That  can  be  helpful  in  seeing the  trends  that  we  want  to  see.

we're  just  going  to  choose  month,  year,

JMP,  automatically  dumps that  formula  into  a  new  column.

If  we  look  at  that,  that's  what  we  did.

We  can  do  the  same  thing  with  year new  formula  column.

Date,  time,  year.

Now  we  have  two  other  variables  that  we can  use  in  Internet  analysis  context.

Remember  the  fact  that  we  have  land and  offshore  tags  on  every  state?

We  probably  want  to  split  those  out,  and to  do  that,  just  create  a  formula  column.

Let's  see,

I  actually  used  regular  expression  on  it,

but  you  could  play  around  with  other ways  of  dealing  with  that.

I  have  a  column that  has  a  regular  expression

that's  plucking  off  the  Alabama.

Now,  if  there  are  states  here, we  could  use  that  first  word,

but  I  don't  think  if  I  remember  right, the  dash  doesn't  work.

same  thing  with  type.

Look  at  that  formula.

I  just  have  a  regular  expression  that's looking  for  the  last  piece  of  that  string

and  then  returning only  the  last  piece.

I've  created  four  different  variables that  I  want  to  use  for  analysis.

Let's  do  one  more...

Part  of  the  reason  that  we  cared so  much  about  the  state  names.

One  is  that  if  we  have  multiple representations  of  the  state,

well,  that's  going to  complicate  our  analysis.

But  the  other  rationale  for  doing  that  is

if  we  need  to  join  it  up  with  something else,  like  Lat  long  coordinates.

For  each  state,  we  need  to  have the  state  names  consistent.

The  data  set  that  I  have,

you  could  be  either  all  caps or  title  case  to  do  the  join.

Joining,  that's  cool  we  can  do  that

with  the  data  set  that  in  fact, I  think  I  have  an  intermediate

table  that  we  can  open  and  look at  here  before  we  do  the  join.

Let's  do  a  little  housekeeping  here.

I'll  leave  that  one  up  in  case we  need  to  go  back  to  it.

Okay,  so  we  have  our  state, it's  been  fixed,  recoded.

We've  created  that  state  variable.

Now  I  want  to  join  it  with my  US  state  capitals.

Reopen  that  file.

I'm  going  to  use  state  one  and  match it  with  state  in  the  state  of  table.

Joins  and  JMP.

I  always  like  to  thin I  want  to  invoke  the  join

from  the  table  I  want  on  the  left.

Then  I'm  going  to  choose the  one  that  I  want  on  the  right.

I'm  going  to  match  again. We're  going  to  match  on  state  and  state.

Interjoins  going to  be  appropriate  here.

I  mean,  if  we  had  any  missing  state  names,

I'm  not  sure  how  informative those  might  be  for  our  analysis.

I really  don't  want  them. In  fact,  they  don't  exist.

An  inner  joint  is  appropriate.

it's  just  going  to take  anything  that  matches.

Any  non matches  are  left  out  again.

Hey,  I've  got  this  great  preview  in  17.

We  can   look  at  that  a  little  bit.

Let  me  get  down  to  the  scroll  bar and  right  at  the  very  end  I  should  see

the  capital  and  the  latitude and  longitude.

Now  I  ended  up  with  this  extra  state  too.

if  I  don't  want  that,  I  can  select the  columns  for  the  join  table.

If  ultimately  this  is  going  to  a  script, maybe  you  want  to  do  that

just  so  you  don't  have  to  go  into week  columns  after  the  fact.

But  that's  it  for  a  join.

We  just  glued  those  two  data  sets  together by  matching  state  names

and  we  have  latitude  and  longitude.

Maybe  capital  is  not necessarily  important.

We  can  just  get  rid  of  those  columns.

Back  to  the  automation  component.

We  did  several  steps  along  the  way,

but  for  each  step  that we  do  in  the  tables  menu,

there's  a  source script  associated  with  it.

If  we  have  each  of  the  tables,

the  intermediate  tables, that  get  us  to  the  final  result,

we  could  steal  that  script, glue  it  together  and  automate.

That  workflow.

All  right,  that  is  through  joins.

Let's  play  a  little  bit  in bubble  plot  and  Graph  Builder.

Let's  do  Graph  Builder  first

and  then  with  whatever  time  we  have  left, we'll  get  to  a  Bubble  Plot.

Again,  we're  going  to  enjoy the  fruits  of  our  labors  here.

We  exercised  a  lot  of  creativity.

We  got  a  data  set  that  is analyzable  and  graphical.

let's  see  what  we  can  learn  from  it.

All  right,  so  because  we  have  state  or built  in  shape  files  in  junk,

so  it'll  recognize  state  names  and  build

state  boundaries  in  a  map in  the  graph  frame.

We're  going  to  drop  that  state into  the  map  shape  role.

Now  what  I  may  be  interested  in  is the  rig  count  by  state.

I'm  going  to  color  by  this.

to  me  I  don't  find  that  blue  and  red

are  also  informative  for  what I'm  trying  to  get  across.

I  really   prefer  spectral  colors, but  no  problem.

We  can  change  the  gradient with  a  simple  right  mouse  click.

I  like  white  to  red.

It  seems  to  work  for  me  and  that's  great.

We  the  rig  count  by  state.

But  again,  this  is  really  course  right.

We  have  a  lot  of  years  of  data, so  22  years  of  data  here.

We  may  want  to  drill into  these  a  little  bit  deeper.

Build  another  graph, and we look at date  and  rig  count.

We're  going  to  turn  off  that  smoother.

There  we  go.

That  just  averaged  everything.

What  I'm  doing  this  for  is I  want  to  steal  the  script  behind  this

and  I'm  going  to  use  it to  get  those  cool  hover  graphs.

Save  script  to  the  clipboard,

go  back  to  this  map right  mouse  click,  hover  label.

I'm  going  to  paste  the  graph  with, let  me  see  how  I  did.

JMP  knows  that...

Let's  just  click  down  here because  I  think  we're  getting  close.

JMP  knows  that  I'm  only  looking  at  Texas,

so  it's  only  going  to  give  me  the  rig count  trend  over  time  for  Texas.

I  believe  that  is  fairly  handy.

If  we  look  at  Louisiana,  for  instance.

Well,  let's  look  at  North  Dakota.

North  Dakota  is  interesting  because

of  the  shale  unconventional  reservoir boom  that  happened  up  there.

They  had  a  pretty  big  spike.

They  had  a  pretty  big  drop  off,

and  then  they're  building again  in  terms  of  bricks.

We  are  drilling  into  this.

But  hey,  this  actually  has land  and  offshore  in  here.

Set's  split  those  out  to  turn our  control  panel  back  on.

We'll  drop  type  down  on  page.

We'll  get  a  different  graph for  each  of  those.

Maybe  we'll  make  this a  little  bit  smaller,

a  little  bit  smaller  still.

Maybe  I  want  to  show  missing  data.

Now  I  have  boundaries  for  all  the  states

but only  the  ones that  have  offshore  rigs  are  colored.

Now  if  I  hover  over  Louisiana, interestingly,  you  can  notice

a  decline  in  rigs  off  the  coast of  Louisiana  in  the  gold.

That  coincides  a  lot  with  what  we  see in  these  unconventional  wells

that  are  really  increasing  in  number  quite rapidly  in  states  in  the  US.

We're  able  to  drill  into  a  few  trends.

We  were  able  to  do  that  because we  massage  the  data.

We  got  it  into  a  format that  we  can  actually  use.

I  think  we're  up  on  time.

I  will  post  these  materials   out  onto  the  Web,

and  you'll  have  example  scripts  in  each one  of  the  data  tables  that  you  can  run

to  get  the  final visualization  that  I  was  working  on.

then  you  can  try and  recreate  it  yourself.

Also  I'll  have  two  example  data  files,

but  I  would  recommend if  you're  going  to  try  this,

to  actually  download  the  data, and  then  you  get  the  full  experience.

With  that,  I  really  hope  that  I've highlighted  a  few  different  things

that  could  potentially save  you  a  lot  of  frustration

in  your  data cleaning  efforts.

With  that,  thank  you  very  much.