Choose Language Hide Translation Bar
Level III

How to do Exploratory Data Analysis on the VAERS Vaccine Dataset using JMP (2020-US-EPO-611)

Level: Intermediate


Stanley Siranovich, Principal Analyst, Crucial Connection LLC


Much has been written in both the popular press and in the scientific journals about the safety of modern vaccination programs. To detect possible safety problems in U.S.-licensed vaccines, the CDC and the FDA have established the Vaccine Adverse Event Reporting System (VAERS). This database system now covers 20 years, with several data tables for each year. Moreover, these data tables must be joined to extract useful information from the data. Although a search and filter tool (WONDER) is provided for use with this data set, it is not well suited for modern data exploration and visualization.

In this poster session, we will demonstrate how to use JMP Statistical Discovery Software to do Exploratory Data Analysis for the MMR vaccine over a single year using platforms such as Distribution, Tabulate, and Show Header Graphs. We will then show how to use JMP Scripting Language (jsl) to repeat, simply and easily, the analysis for additional years in the VAERS system.



Auto-generated transcript...




Stan Siranovich Good morning everyone. Today we're going to do a exploratory data analysis
of the VAERS database. Now let's do a little background on what this database is. VAERS, spelled V-A-E-R-S, is an acronym for Vaccine Adverse Effect Reporting System. It was created by the FDA and the CDC.
It gets about 30,000 updates per year and it's been public since 1990 so there's quite a bit of data on it.
And it was designed as an early warning system to look for some effects of vaccines that have not previously been reported.
Now these are adverse effects, not side effects, that is they haven't been linked to the vaccination yet. It's just something that happened after the vaccination. Now let's talk about the structure.
VAERS DATA. Now there is a tool for examining the online database and it goes by the acronym of WONDER. And it is traditional search tool where you navigate the different areas of the database,
select the type of data that you want, click the drop down, and after you do that a couple of times, or a couple of dozen times,
what you do is send in the query.
And without too much latency, get a result back. But for doing exploratory data analysis and some visualizations,
there's a slight problem with that. And that is that you have to know what you want to get in the first place, or at least at the very good idea.
So that's where JMP comes in. And as I mentioned, we're going to do an EDA and some visualization on on specific set of data, that is data for the MMR vaccine for measles, mumps, and rubella. And we're going to do for the most recent full year available, which will be
2019. So let me move to a new window.
Okay, the first thing we did and which I omitted here was to download the CSVs and open them up in JMP. Now I want to select my data and JMP makes it very easy.
After I get the window open, I simply go through rows, rows selection and select where and down here is a picture that I want the VAX_TYPE and I wanted it to equal MMR. Now there's some other options here besides equals, which we'll talk about in a second.
And after we click the button, and we've selected those rows, the next thing we want to do is decide on which data that that we want. So I've highlighted some of the columns and in a minute or so you'll see why.
And then when I do that, oh, before we go there, let's note
row nine and row 18 right here. Notice we have MMRV and
MMR. MMRV is a different vaccine. And if we wanted to look at that also, we could have selected contains here from the drop down. But that's not what we wanted to do. So we click OK
and we get our table. Now what we want to do is join that VAERS VAX table which contains data about the vaccine, such as a manufacturer, the lot and so forth with the VAERS DATA table, which contains data on on the
effects of vaccine, so it's it's got things like whether or not the patient had allergies, whether or not the patient was hospitalized, number
of hospital days, that sort of thing. And it also contains demographic data such as age and sex. So what we want to do is join and simply go to tables join and we select
The VAERS VAX and VAERS DATA tables and we want to join them on the VAERS ID. And again, JMP makes it pretty easy. We just click
the column in each one at one of the separate tables and we put them here in the match window and after that we go over to the table windows and we select the columns that we want.
And this is what our results table looks like. Now let me reduce that and open up and JMP table.
There we go, and I'll expand that.
And for the purposes of this demonstration I just
selected these...these columns here. We've got the VAERS ID, which you see identification obviously, the type which are all MMR.
And looks like Merck is the manufacturer. And there's a couple of unknowns
scattered through here. And I selected VAX LOT, because that would be important if there's something the matter with one lot, you want to be able to see that.
This looks like cage underscore year, but that is calculated age in years. There are several H columns and I just selected one.
And I selected sex because we'd like to know if somebody is is more affected, if males are more affected than females or vice versa.
And HOSPDAYS is the number of days in the hospital if they had an adverse effect that was severe enough to put them into the hospital. And NUMDAYS is the number of days between vaccination and the appearance of the adverse effects and it looks like we we have quite, quite a range right here.
So let's, let's get started on our analysis.
show header graphs. So I'm going to click on that and show header graphs.
And we get some distribution, and some other information up here. We'll skip the ID and see that the VAX_TYPE is all MMR, you have no others there.
And the vax manufacturer, yes, it's either a Mercks & Co Inc or unknown and one nice feature about this is we can click on the bar
and it will highlight the rows for us and click away and it's unhighlighted. Moving on to VAX_LOT, we have quite a bit of information squeezed into this tiny little box here.
First of all, we have the top five lots in occurrence in in our data table and here they are, and
here's how many times they appear. And it also tells us that we have 413 other lots included in table, plus five by my calculation,
that's something like 418 individual lots. Now we go over the calculated age in years and in we see most of our values are between zero and whatever, they're during zero bin, which makes sense because it is a vaccination and we'll just make a note of that.
And we go over to the sex column and it looks like we have significantly more females than males. Now, that tells us right away if we want to do, side by side
group comparisons, we're going to have to randomly select from females, so that they equal the males and we also have some unknowns here, quite a few unknowns.
And we simply note that and move on. And we see hospital days. And we've see NUMDAYS. Now here's another really really nice feature. Let's say we'd like more details and we want to do a little bit of exploration to see how the age is distributed, we simply right click, right click,
select open in distribution. And here we are in the distribution windows, but quite a bit of information here.
For our purposes right now, we don't really do much here about the quantiles. So let's click close and it's still taking up some space.
So let's go down here and select outline close orientation and let's go with vertical. And we're left with a nice easy to read window. It's got some information in there. We of course see our distribution down here and we've got a box and whisker plot up here.
There's not a whole lot of time to go into that, that, that just displays data in a different way. And we see from our summary statistics that the mean happens to be 16.2, with the standard deviation 20.6.
Not an ideal situation. So if you want to do anything more with that, we may want to split the years in two groups where most of them are down here and and then
where, where this, where all the skewed data is and then the rest of them and
along the right and
examine that separately. And I will minimize that window and we can do the same with hospital days and number of days. And let me just do that real quick.
And here we see the same sorts of data and I won't bother clicking through that and reducing it. But we might note also when again we have the mean of 6.7 and standard deviation of 13.2, again, not a very ideal situation and we simply make note of that.
And I will close that. Now let's say we want to do a little bit more exploratory analysis, something caught our eye and all that. And that is simple to do here. We don't have to go back to the online database,
and select through everything, click the drop downs, or whatever. We can simply come up here to analyze and fit Y by X.
So let's say that we would like to examine the relationship between
oh, hospital days, number of days spent in the hospital and calculated age in years. We simply do that. We have two continuous variables so we're going to get a bivariate plot out of that.
We click OK.
And we get another nice display of the data. And yes, we can see that
currently, the mean is down around 5 or 6, which is a good, good thing better than 10 or 12. We can, for purposes of references,
go up here to the red triangle, select fit mean and we get the mean, right here. And we noticed there's quite a few outliers.
Let's say we want to examine them right now and decide whether or not we want to delve into them a little bit further. So if we hover over one of our outlier points or any of the points for that matter,
we see we get this pop up window and it tells us
that particular data point represents row 868.
Calculated age is in the one year bucket,
and this patient happened to spend 90 days in the hospital. Now we could right click and color this row or put some sort of marker in there. I won't bother doing that, but I will
move the cursor over here into the window, and we see this little symbol up in the right hand corner, click that and that pins it.
So we can, of course, repeat that. And we can get the detail for further examination. I found this to be quite handy when giving presentations to groups of people like to call attention to one particular point. That's a little bit overbearing so let's right click, font, not edit.
And we get the font window come up and see we're using 16 point font. Let's, I don't know, let's go down to 9.
And that's a little bit better and it gives us more room if we'd like to call attention to some of the other outliers. So in summary,
let me bring up the PowerPoint again.
In summary, we were able to import and shape two large data tables from a large online government maintained database. We were able to subset tables,
able to join the tables and select our output data all seamlessly. And we were able to generate summaries and distributions, pointing out the areas
that may be of interest and for more detailed analysis. And of course, that was all seamless and all occured with within the same software platform. Now,
supply some links
right over here to the various data site. This, this is the main site, which has all the documentation that the government did quite a good job there. And here is the actual data itself in the zip..