Exploratory Data Analysis and Visualization of a Large, Online Vaccine Database using JMP Pro 16 (2022-US-45MP-1153)

Stanley Siranovich, Principal Analyst, Crucial Connection LLC

Recent events have brought about much discussion in both the popular press and the scientific literature about the safety and efficacy of some recent vaccination programs. One frequently referenced data source is the Vaccine Adverse Event Reporting System (VAERS), which now covers more than 20 years. In this session, we will demonstrate the Exploratory Data Analysis and Data Visualization capabilities of JMP Statistical Discovery software. We will begin by using the Subset, Join, and Concatenate platforms from the Tables menu, followed by Graph Builder from the Graph menu. Finally, we will make use of some Screening platforms from the Analyze menu. In general, we will show how to use JMP’s “front end” data selection and management tools, drag and drop interactive graphs, and linked analyses to speed the time to discovery for a large, complex dataset.

Good morning. Good afternoon

and good evening, everyone,

wherever you are.

My name is Stan Siranovich,

and I am the principal analyst at Crucial Connection L LC,

and I'm doing this presentation

from Jeffersonville, Indiana,

right across the river from Louisville, Kentucky.

Today I'm going to talk about

how to do an exploratory data analysis and visualization

of an online vaccine data base

using JMP PRO 16.

That online database

is the VAER system.

So what is VAERS?

That is the Vaccine Adverse Event Reporting System.

It's a national early warning system to detect possible safety problems

in US-L icensed vaccine.

That is the same database that has been in the news

for the last year or two.

Let's talk about why it was developed.

First of all, it was to detect new, unusual

or rare vaccine adverse events, monitoring increases in known events,

identify potential risks,

and assess the safety of newly licensed vaccine.

Nowhere in those goals do you see anything about

make data analysis vaccine data set easy,

which is why we're doing this presentation.

Now, it's maybe structured a bit differently

than some of us are used to.

I came out of the lab and production facilities,

and I'm used to, for lack of a better word,

chemical scientific, rather small data sets.

On rare occasions,

when I did work on something larger,

somebody in the corporate hierarchy

cleaned all the data for me.

That is not the case here.

It's organized by year, and there are three tables per year.

There's facts, there's data,

and various symptoms.

What I did was, the first week in June,

I downloaded all the data as of May 31st,

and this is what it looks like.

You can download to your heart's content.

You do have to sign in.

Over on the right side of the screen,

let's see what I got.

Now, notice in years here, 2018, 2019, 2020,

the zip files are roughly the same size.

By the way, this zip file is simply these three files here zipped into one.

That is normally the best way to work on it.

Download zip and unzip it.

But notice what happens between 2020 and 2021,

which is at about a 12, 14 percent increase.

We go from 11.2 Meg up to almost 169 Meg.

So something's going on, and we would like to take a look at it.

This is what it looked like on my desktop.

Now let's talk about tables there.

I mentioned there are three tables per year.

There's vax;

contains all the vaccine information.

It's got information, such as,

almost 100 % unique VAERS ID at the top,

manufacture, lot type

and the data is where a lot of the data

that we're going to be interested in resides.

Notice it's got the VAERS ID again, and it's got some different columns

that we're interested in,

such a state, age in years, sex, symptoms,

which is a free form of a text field that sometimes seems to go on forever,

whether or not the patient d ied, et cetera.

Then VAERSYMPTOMS,

contains just the VAERS ID and symptoms,

and they go from 1-5,

and sometimes they continue on from 6-10,

and we will address that issue towards the end of this presentation.

Let me get out of that,

drag that over.

Right now, you should see the JMP window open.

I am going to present from a JMP project.

Let me go over that very quickly.

By the way, I assume everybody watching this

has seen a JMP window before.

This is my workspace.

I drag the three files in

and opened them up in JMP down here as contents.

I opened up a new instance of this.

When I'm working on a project,

I drag my links and maybe some PDFs or whatnot into that space.

Then at the bottom, we have recent files.

But mainly what we want to do, Iis look at this window,

and the main window here,

notice we have tabs here,

and you click on the tab just like a spreadsheet

and we can see our different sheets.

Let's start cleaning.

First thing I'm going to do is make...

Where is it? Where there is VAX.

Make that the home table, and you'll see why in just minute.

So I'm going to start there.

Now notice in VAX_TYPES , everything's mixed in together.

We got the COVID-19, which we are going to focus on,

HPV9.

Scroll down a little bit, we have unknown.

We have Flu X, et cetera.

So what we want to do is separate up to COVID- 19.

The way we do that, and I am going to go up to Rows

Row Selection,

Select Where.

By the way, in JMP,

there's almost always several ways of doing things,

but to keep things uniform,

I will always go up to the menu bar

to do this type of thing.

What we want to do is separate COVID- 19.

So I select VAX_TYPE.

From the dropdown,

I'll just leave the default to equals,

and I'll put in COVID-19

and I'll come down here and uncheck the Match case.

Let's see, it's checked over the window.

It tells us what we're going to do;

select rows in the data table that match specified criteria.

It looks like it.

Let's see.

Click OK. And there it is.

Notice it selected all the rows.

It skipped the HPV9 and the unknown here, et cetera .

What we want to do, now that we selected them,

is to separate them out.

So I will go to Tables, Subset,

and it tells us what we're going to do.

We're going to create a new data table

from selected rows and columns.

I went to Selected Rows, which check for us already.

Notice it says here, we can save the script to the table.

Normally, that's a good practice.

Makes it much more convenient to repeat things.

But I'll leave that unchecked for now,

since this is a demo.

Is there anything else I need to do?

Yes, it says, Output table name.

For simple analysis, JMP will take care of that for you.

But for anything it starts to get a little bit complicated.

I recommend deciding on some sort of a naming scheme.

So rather than Subset,

I am going to name that COVID- 19 only.

Click the OK button.

Here it is, COVID only.

We can scroll down, and verify that.

Notice down here in All rows,

we've selected 129, 975.

So keep that one in mind,

because over here, we started off...

Oh, no, it's data. Where was it?

Back. Right here.

We started off with 146, 500.

We just got the COVID right there.

We'd like to see what's going on.

Notice here, we don't have any symptoms,

we don't have any adverse events.

What we're going to have to do is get them out of the data,

and that's right here.

Now let's take a quick look at that.

We've got the columns displayed over here.

We've got all sorts of things.

They died, length of stay, onset date, et cetera

and here's the VAERS ID.

Now what we're going to do is to join the two tables on the VAERS ID.

Let's go back over here to our COVID- 19 data only.

There's VAX.

We've got that.

Now what I'm going to do is go up here to Tables.

We want to come down here to Join tables.

I won't get into that database stuff, but let's just say we want to join tables.

Now I went to VAERS VAX ,

and did it because it says, up here,

"Join this with VAERS VAX,"

and down here,

what I have to do is a couple things.

First of all, let me now save that to later.

Go down here and select VAERSDATA .

Notice we have some windows

that pop up here.

That shows us all the rows in the second data table.

Now we have to match the rows.

We're going to come here in the COVID- 19 data table,

the VAERS_ ID, and then jump into the end.

So we're going to click here

and they are two separate windows, so you don't have to get to control.

What we want to do is match them so that things don't get mixed up.

Let's look down here, and we get to some data table stuff.

It tells us that it is an inner join.

An inner join, selects rows common to both.

But I'm not sure about this data.

So what I want to do is come over here to Main Table and select Left Outer Join.

That's going to keep all the ones in the original table,

which was the VAERSVAX table,

and all of the matching entries

in the VAERSDATA table.

Let's look a little bit.

Again, we can save script.

Let's give it a name.

Let's call that one,

VAX join DATA

and see if there's anything else we need.

Yeah, you know what,

we could do this in a two- step process.

But why don't we do it all in one?

Let's go back up here to the VAERS VAX,

and what we'll do is we'll keep the VAERS_ID,

because we don't want to know what that is.

We don't need TYPE, because they're all the COVID-19 .

Well, we may want to look at different manufacturers in the lot,

whether it's Series One, Two, Three, or Four, whatever,

a VAX_ROUTE and VAX_SITE.

We won't worry about that for now.

Come down here

to the other data table and see...

What do I want to check?

We don't need the VAERS_ ID again.

Let's check STATE, age in years.

Notice, there's a couple other age columns,

but we'll leave those go for now.

We don't really need them.

We want to know the response by sex, probably.

Don't know, SYMPTOM_TEXT?

Yeah, we'll keep that in.

We certainly want to know whether or not the patient died.

They died.

Let's do HOSPITALDAYS ,

which is how many days they spent in the hospital

if they went to the hospital.

Let's do NUMDAYS.

Now we know, from checking the VAERS website.

NUMDAYS is simply the difference between VAERS state.

and the onset of symptoms states.

So that tells us, how far after the vaccination.

And we've got a whole bunch of other fields here,

and I don't think we need any of those.

Let me scroll up, make sure everything's still checked.

I hope I have everything that I need.

And I'm going to click this one here .

It says, "Select columns for join table,"

in case you can't see it.

I'm going to hit Select and put them all in,

I hope, and we'll scroll down to check.

A gain, we can save the script to table,

but let's hit OK.

We named it VAX join DATA.

There it is, VAX join DATA over here.

Now notice we have the manufacturers, the dose, the state.

Let's look at state.

Have a number of missing values

and there's some other considerations,

there too in that column.

But we'll get to those later.

Let's see, SYMPTOM_TEXT, HOSPDAYS .

We're all set.

Let's look at this.

Let's expand this.

This is just a free- form column.

Let's make use of one of my favorite features;

Show Header Graphs.

Let's hit that.

And wow!

Let's see what we have here,

let me pull that over.

We have, let's look at manufacturers.

We want to see if maybe one manufacturer or neither has more adverse events.

It looks like BIONTECH is by far the largest.

Then we have some unknowns.

Look at this.

We have almost 8,000 VAX_LOTs.

So if you want to do an examination by lot,

that's going to be rather difficult.

We see VAX_DOSE_SERIES here,

and it looks like Series 1 has more than series...

Though, that's a little strange.

Then comes 2, then comes 3.

How did that happen?

Now we just note that and move on.

Notice for STATE,

we have 1, 2, 3, 4, 5 plus 54.

Well, last I heard we didn't have 59 s tates

so were going to have to check that.

AGE_YRS looks okay.

It looks like we have a whole lot more females than males.

SYMPTOM_TEXT , lot of stuff there.

We noted all that.

We could close that

and let's do break up the monotony here, the cleaning.

Let's do an analysis.

Let's go to everybody's favorite platform; Graph Builder.

We go up to the menus, select G raph

and select Graph Builder from that.

Let's see what do we have here.

How about I did bring in hospital days, I hope.

Yes.

Let's do hospital days.

We'll select it, put it on the y,

and let's see if there's any differences

between amongst, or I should say the states.

Didn't want to do that. Just wanted to select STATE.

Put that in the x, and I'll select Bar Graph.

There are the bar graph.

Notice we do have some anomalies here.

Let's look down at the x-xaxis.

We have State AS ,

State MH, State PW, State QW.

That isn't right.

So we know we have to do a little bit more cleaning on that.

But this is a demo,

so we'll leave those in there for now.

That's pretty easy thing to figure out.

What we want to do is...

Come on, keeps popping up.

You can be a little more careful about that.

Come up here,

and I right- click on the x-axis

and come up here to order by and to hospital days descending.

There it is.

We have some unusual results here.

It looks like Wyoming.

Oh, I should point this out too.

JMP automatically selected the mean for us.

Which for our purposes is probably okay.

It looks like the mean hospital stay

for the people who suffered an adverse event Wyoming

is, I don't know, 20 in a fraction days.

Which, man, that's high.

Come down here to the next one, Vermont,

and it gives us the number of rows.

Put that in automatically for the hover label.

A fter that, it's Mississippi,

and see Oklahoma and Utah.

Let me find...

Let's see, New York, Pennsylvania.

They're all down in here.

For some reason,

I don't know if it's just the chance or what,

but we make note of the fact

that a couple of the sparsely populated states seem to have longer hospital stays.

Let's go back to here.

And this is another reason why I like to use the JMP projects.

You don't have to go hovering over closed windows.

Let's go to Graph.

We want to do that graph [inaudible 00:20:22].

Yeah, let's do one more example.

Go to Graph Builder,

and we'll take STATE,

put that in the x-axis.

And we look at what the JMP did for us.

And it looks like we have some high numbers here.

And we hover over California, for example,

and it says it's got 10,629 rows,

and it lists them, and it gives us the state.

What JMP did here was automatically put the number of rows in there for us,

which is a reflection of the number of patients

who reported an adverse event.

Let's see, Ordered by state, descending.

So let's do that order again

and take a look at it.

This makes sense.

California is far and away

the leader with, unfortunately for them with adverse events,

but it is also a highly populated state.

The next one is Michigan, the same deal Florida.

So it seems to somewhat mirror the population.

There's New York.

We'll just leave that open for now and move on.

Now let's take a look at the VAER symptoms.

Click that tab.

First thing we noticed, and I did not plan this,

but notice VAERS_I D, row one and two.

It's the same number.

How can that be?

That is supposed to be a unique identifier,

but that is somewhat common in the VAERs data set.

Let's take a look why.

By the way,

I think this is an excellent reason to always spend some time,

even if you're in a hurry,

to look at the data and see if anything looks a little bit weird.

And sure enough, two rows with the same row number.

Now notice symptoms.

This is the SYMPTOMVERSION,

or rather, this is the MEDRA database.

It's how the medical coders code the symptoms.

So there's some degree of uniformity nationally

and also internationally.

This is the MEDRA version for this entry right here.

And there are only two versions that I've run across so far

in my work with VAERS , and that's 24, one and 25.

So we have symptom one, we have symptom two, chest pain.

We have symptom three, the heart rate,

and it goes on and on.

Then we come back and we have some...

Looks a little weird here, SARS-CoV-2 and whatever in this duplicate row

with the duplicate number ending in 266,

which is really not a duplicate,

because there is only one entry out of the five in that row.

So that's a bit disconcerting.

But we're going to take care of that.

What we're going to do here, is a feature in JMP,

and it's stacking the variables.

If we wanted to do in analysis on symptoms from this table,

what we would have to do is go and run it

on each one of the columns, et cetera, et cetera.

But if we stack the columns, we won't have to do that.

So let's stack the columns.

So we come up here to, again, Tables,

and we come up to Stack,

and we select Stack.

So let's pick SYMPTOM1.

Hold on to control or command if you've got a Mac.

Up here to five, take a look at our check boxes.

We want to keep everything, again, s ave script to source table,

if we want to, and in case something goes wrong,

we may want to keep dialogue blocks open,

but I will not do that.

Now, ask us where to move the columns?

Well, t o last.

But it would be much more convenient if we moved them after the VAERS_ ID.

So I'll click that radio button,

click the VAERS_ID,

I'll put table name, and we'll just label that,

how about SYMPTOMS STACKED?

We'll come up here, put them in,

five, see if there's anything else.

Oh, yeah, new column names.

If you're stacking several columns,

which is often a case,

especially if you're trying to pull in some data from a PDF file

that was made for human consumption,

that could be an issue here.

But for right now,

we'll just leave the stacked column labelled, D ata

and the source column labeled, Label,

because it makes sense.

And one final check,

I hit the OK button, and there it is.

Well, let's take a look at that one.

Here we go, the VAERS_ID,

and sure enough, we have the label,

and here's the data.

Here we have our favorite row, the one that ends in 266.

Let's take a look at that.

What happened?

One goes from 1-10.

So we have ten instances of the same row,

hence table 1, 2, 3, 4, 5,

which are filled in.

And then it starts again with SYMPTOM1,

and the rest of them are empty.

So that adds up to the 10.

Here's our MEDRA data.

That makes sense.

I know, because I did it before.

This is where we want to be.

So let's go up to Rows again.

Row Selection,

select WHERE.

In the case of the nomenclature here sounds an awful lot like SQL,

that is what JMP is doing under the hood.

We want to select some rows.

So let's pick CHEST PAIN,

We'll leave that all in caps

and we come down here again.

There's this little check box, hidden way down here,

this Match case.

I don't want to match case,

because I don't know how people enter the data

and how the good people at the CDC,

clean the data before they posted it on the website.

But by the way, the VAERS data set

is not instantaneous loaded up.

It doesn't go there.

The CDC usually updates it about once a week,

and they clean the data, then update it.

So Match case, CHEST PAIN.

It should say something about the dropdown.

We want equals,

but we could use does not equals, or whatever it is that suits our purpose.

So let's click, OK.

Wait a minute.

Data equals.

That should do it.

There it is.

Here I hover over it, it says SYMPTOM STACK .

And we look down here at rows again,

I find myself referring to that quite frequently,

just to get an idea what's going on.

We started off wit h 890,000 plus rows

and we have 3,667 that were selected.

Let's subset.

Go up here to Tables again,

Stack, where it is Subset,

we will select subset.

Of course, we get our pop up window

that tells us what subset is going to do for us.

And we click on that.

Let's see, we went to Selected Rows.

We could link that to the original data table,

save script to the table, Subset of SYMPTOMS STACKED .

Let's go with our convention,

and we'll call that CHEST PAINS

of SYMPTOMS STACKED.

I want to make sure I did everything right.

No.

I want to call that CHEST PAIN.

I hope that's right.

Click on the OK button, and there it is.

We'll take a look here.

Notice we only have one row with 266

and we have all our other rows with CHEST PAIN in it,

and we've got 3,666 selected.

Now let's go back to the data,

do a bit more analysis after we did all that cleaning.

VAX join DATA.

That looks right.

Let me drag that over there.

Let's look at a couple of different variables,

or rather the graph.

This time, let's do the summary tables.

So let's go up to Tables and Summary's at the top.

And this is what it looks like.

Well, let's look at... What do we want?

Age and HOSPDAYS and NUMDAYS.

I am going to put that in here,

but before I can do that,

JMP wants us to tell it what statistic to use.

Let's use the mean.

I selected mean from dropdown,

and it's going to give us the mean of those three columns.

Now there are some other columns here that we have an interest in.

I'd say, well, DIED .

Yeah, probably interested in whether or not somebody died.

See what's going on there.

So we'll select that.

Now we can't take a mean.

It's a binary categorical variables or variable, rather.

So let's select N and see .

No, wait, let's look at STATE.

Where was it?

Where was STATE?

There it is.

Now STATE, we have 50 plus

and we want a summary table .

So 50 plus summary table.

Let's put STATE into group.

And again, we can save the script,

but we won't do it here.

One final check, that looks right.

We hit the OK button.

In here is our summary table.

Notice that STATE again, we have some serious cleaning to do.

There's no state here,

State GU .

There's state MH , again.

Some may have missing values, et cetera, et cetera.

And it gives us a mean age in years for all those states.

And it looks like the mean age there, is somewhere in the 40 s thfough to 50 s.

Come over here.

Let's take a look at the mean hospital days by state.

And that makes sense.

5, 6, 7 days.

Looks like we're all hovering about a week.

Just taking a quick look at it.

I don't see any outliers.

And come over here.

Do the same with NUMDAYS .

And that's the number of days between vaccination

and when they say the effect appeared,

and let's see, number died.

One thing we want to notice is right here.

Fortunately, there are not a whole lot of people, thousands dying.

There's a couple with a hundred here,

but notice here, N(DIED) with the blank state.

Apparently there's a lot of state with number of died missing.

So if we want to analyze that in a bit more detail,

we're going to have to be careful there,

because there's a lot that can't be assigned to a particular state.

And what else can we do?

Let's just take a look at all that.

And we see our states again,

these five plus 254.

We have the number of rows up here.

Let's see, what else can we look at?

Mean(AGE ),

haven't mentioned it yet, but JMP is interactive.

So I click on that bar, and let's see,

here's some people up here in the old age range,

older age, I should say, excuse me, or people

I can come up here.

It's a little hard select that one, maybe.

I don't see anything that sticks out at me.

Fortunately, number died, it's pretty big bar at zero.

So that's good.

And the mean for NUMDAYS

between the vaccine

and yet first event is rather low, right here.

We'll just leave it at that.

I see that I am just out of time.

So what did we do?

We looked at a large online database.

We were able to download the ZIP file on to our desktop,

open them up in JMP.

We were able to do some rudimentary analysis

after spending a lot of our time data cleaning.

Notice, even though that we spent a lot of our time cleaning the data,

we were able to do it in JMP, which of course, is very convenient.

Because, number one, we didn't have to switch,

I'll switch here, switch there, run some SQL code and bring it back in.

Number two, we could do our analysis

in line as a bar while we're cleaning the data.

We say, "Oh, that looks interesting."

We went to our Graph Builder,

took a look at the data, see if anything peaked our curiosity

or if anything was out of place.

When we finished our examination, we could continue with our data cleaning.

In this case, we went on to the SYMPTOMS.

So that concludes my presentation.

I hope everyone enjoyed it and hopefully learn something.

Maybe I should put a disclaimer here at the end.

This is the way I would do it.

After using JMP for a few years,

it's not necessarily the way you should do it ,

and not necessarily the best way to do it.

But using JMP, we were able to bring down some data and analyze it, clean it.

It's some data that's quite a bit in the news right now.

I thank you all for watching.