A Model for COVID-19 Vaccine Adverse Reaction (2021-US-30MP-920)

Level: Intermediate

Suling Lee, SMU, Singapore Management University

COVID-19 vaccines play a critical role in the attempt to assuage the global pandemic that is causing surges of infections and deaths globally. However, the unprecedent rate at which it was developed and administered raised doubts about its safety in the community. Data from the United States Vaccine Adverse Event Reporting System, VAERS, has the potential to help determine if the safety concerns of the vaccines are founded. As such, this paper uses the combination of both structured and unstructured variables from VAERS to model the adverse reactions to COVID-19 vaccines.

The severity of the adverse reaction is first derived from the variables describing the vaccine recipient outcome following a reaction from the VAERS data sets. Next, unstructured data in the from of text describing symptoms, medical history, medication, and allergies are converted into a Document Term Matrix and these combined with the structured variables helps to build a model that predicts the severity of the adverse reaction. The explanatory model is built using JMP Pro 16 using Generalized Regression Models and Binary Document Term Matrix (DTM), with the model evaluation based on RSquare value of the validation set.

The optimal model is a Generalised Regression model using the Lasso estimation method for Binary DTM. The key determinants contributing to the adverse reaction from the optimal model are number of symptoms, period between vaccination onset, how the vaccine are administered, age of patient, and symptoms related to cardiopulmonary illness.

Auto-generated transcript...

Speaker	Transcript
Peter Polito	How are you doing today.
	Can you hear me.
	If you are speaking, I am unable to hear you.
	Hello.
	test test.
	Oh Hello Leo can you hear me.
	yeah sorry about that I think of the technical difficulties yeah.
Peter Polito	Oh no problem at all.
	Oh okay it's a nice Jimmy.
Peter Polito	Oh, where.
	Are you calling in from.
	Singapore.
Peter Polito	Singapore all right, how how late, is it there.
	About 909.
Peter Polito	yeah well.
	yeah kudos to you for.
	hanging on.
	so late thanks for making it.
	Possible no it's all right um yeah I hope I do a good one.
	Oh.
	No i've been like high stress about this Oh well, yeah i'm Okay, let me just put on a virtual background.
	Okay.
Peter Polito	And I gotta go through just a couple things on my end before we officially start.
	Okay.
Peter Polito	just give me a.
	minute here.
	yeah.
Peter Polito	I only bring in my checklist make sure I do everything correctly here.
	alright.
	So just to confirm.
	You are soothingly.
	yeah and your talk is titled a model for coven vaccine adverse reaction.
	Yes, is that correct yeah that's right.
Peter Polito	All right, and then just to make sure you understand this is being recorded for the use and jump discovery summit conference will be available publicly in the jump user community do you give permission for this recording and use.
	Yes, okay great.
	your microphone sounds good, I don't hear any background noises is your cell phone off and all that kind of thing anything that might make some random noises.
	hang on.
	i'll send it will find them yeah.
Peter Polito	Okay, and then.
	We need to check the can you go and share your screen and we'll go through and check the resolution and a few other things.
	Okay sure.
Peter Polito	Thank you.
	i'm sorry, is it Lisa ling or soothingly.
	My first name issuing nicely yeah.
Peter Polito	got it okay.
	um is it okay.
Peter Polito	I don't see it yet oh um you know pie covering it just a moment.
	That looks good.
	And if you go to the bottom of your screen does your taskbar actually I don't see your taskbar so we're good.
	Okay.
Peter Polito	And let me make sure.
	It are any programs that might create a pop up like outlook or Skype or any of those are those all closed down and quit.
	um yeah, I think, so I bet the checking on them.
	close my kitchen.
	Okay.
	yeah good.
Peter Polito	Okay, and then, are you going to be working just from a PowerPoint or you be showing jump as well.
	I was worried of the transiting, so I will be destiny it from PowerPoint.
Peter Polito	Okay, great, then I am going to mute and turn off my camera we are already recording so as soon as we.
	As soon as you see my picture go away go ahead and start and i'm not going to interrupt for any reason and we'll try and go through it's a 30 minute presentation so let's go through, and I won't even be here it'll be like you're talking to yourself.
	Okay, so that the Minutes right okay.
Peter Polito	Are you ready to go.
	yeah.
Peter Polito	Okay.
	All right, and it's just so you know when we actually.
	Have the discovery summit, if you realize tomorrow that you misspoke or you wanted to present something in a slightly different way.
	You can be live on the when your presentations going, and you can ask the person presenting a deposit and then you can say you know i'm about to say this, what I what I wanted to convey is that you can kind of like.
	edit in real time during the presentation so don't stress about getting every word perfectly just relax and and go through it and and i'm sure will be just fine.
	All right.
	yeah.
Peter Polito	All right, i'm gonna mute and turn off my camera and then you go ahead and begin okay.
	Thanks Peter.
	Hello, I'm Suling. I'm a master's student at Singapore Management University where I'm currently pursuing a course in data analytics at the School of Computing and Information Systems.
	So I'm actually here today to present an assignment that has been submitted for my master's in IT for Business program and, more importantly, I want to share my JMP journey so far.
	So I started using JMP this year and I really fell in love with it because of the ease of use and the range of statistical methods and the visualizations that I could do on it.
	So, as the beginner using JMP I'm really honored to be here presenting my report and do let me have your feedback, because I feel that I have to so much more to learn yeah.
	So the motivation for my paper was actually to look at the COVID 19 vaccines, so we know how important they are but at
	the unprecedented rate at which it was developed and administered has raised some doubts in the community regarding its safety.
	So we are using data from the United States vaccine adverse event reporting system, yes.
	So we are using data from there because we find that there's a potential to help determine if the safety concerns on the vaccine are founded. So this paper makes use of both the structured as well as the unstructured data from VAERS to model, the adverse reaction of COVID-19 vaccines.
	So what is VAERS? So the Center of Disease Control and Prevention and the US FDA have had this system, and it is actually a adverse event
	system where it collects data. But generally what we see is that VAERS data that cannot be used to determine causal links to adverse events
	because the link between the adverse event and the vaccination is not established. So what we actually see here is that you have people who are reporting, but there, they are people who are reporting the events, but then there is no full of action that is
	to confirm that these symptoms and events that are reported, are there any link to the vaccine. So why do we still want to use this data? So firstly the data is available and public domain.
	The data is up-to-date and, more importantly, not all adverse events are likely to be captured during the clinical trials due to low frequency. So
	usually for clinical trials, they include only the healthy individuals. So special populations, like those with chronic illnesses or pregnant women, these are
	limited so the they know that VAERS is an important source for vaccine safety. So for more information regarding it, you can look up this link over here.
	Yeah so the data set used for this study comes from tree data tables that extracted from VAERS. The first one is the VAERS data.
	It mainly contains information about patients profile and the outcome of adverse events, so what I have here is a little clip from JMP
	where we have here the symptoms text and you can see that this is just one report based on one person, one one patient. Okay, and the data is quite dirty.
	There is a lot of useful information in the narrative text, but you can see that there are spelling errors, typos, excessively long or even like a very brief statements.
	So the next two data sets that we have is the VAERS of vaccination data, as well as the symptoms. So one contains information regarding the vaccine, the other one
	is extracted from the symptoms text that we can see. Okay so given this accessibility, actually VAERS data has been mined quite
	a lot by the by quite a number of researchers, but, as you can see that the data is actually very challenging to use as the quality of the report varies.
	And there's also something that might not be genuine. So review of the power, which shows that some form of manual screening is usually employed to extract the required information.
	However, this is also quite labor intensive and quite difficult, so this paper aims to showcase the methods to extract the key information using text analysis techniques in JMP
	and try to do an explanatory model to explain the most important variables involved in this event. So what we did is that for each of these data tables over here we clean them individually and then join them using the VAERS ID.
	What we did based on the patient outcomes was to derive something that's called a severity rating, I'll talk a little bit about this a bit later. So once the tables are joined
	there are four narrative texts. One on the allergies, medication, medical history and symptoms.
	And then we will use text analysis techniques to extract the vectors for the top terms that...will explain the severity rating for each of the text data.
	And join them in the existing spreadsheet data structured variables on the data set. Okay, and then all this is compiled together and then you put it into model building.
	Okay so what is this the severity rating all about? It's based on the patient's outcome, the VAERS data has 12 variables that describes the status of the patient.
	And then, based on this, we have extracted the variables and try to make sense of it, so we came away four levels of severity and then we call this the severity rating.
	So next we will talk a little bit about how we use
	JMP Pro Text Explorer platform for text analysis and we start off with the data cleaning. So what we wanted to do was to really extract out the significant terms from the text data.
	And augment them to the structured variables to build your model. So as you can see, actually, the text data is quite quite messy so what we did was
	first of all, decide between using the term frequency, what kind of term frequency to use, and then the binary term frequency was selected, as the data shows that there's a significant advantage, of considering, of using it.
	So next a little bit about the cleaning that came in. So the the text data was first organized using the JMP Pro Text Explorer and we used a useful feature that is in there to add phrases and automatically identify
	the terms so what you can see it's like terms like white blood cells are kept as a phrase instead of being pulled into white blood and cells, which will not make much sense.
	And a few other methods as well to use. So one is the standard for combining which stemmed the words based on the word endings and then we also thought to sort the list alphabetically in order to recode like misspelled words or typo errors or what's that similar.
	Yeah and then the next thing was to use the very handy function to recode all the similar items together yeah.
	So the next thing we do after cleaning out the text was to look at the
	was the look at the workflow actually. The workflow is useful for stop word exclusion and to see the effect of the target variable on the terms.
	So what I did over here was to visualize the most frequent terms by the size and color it based on the severity rating.
	So you can see that the lighter colors belong to the less severe cases and darker ones are the other most severe ones, and you can like pick up, then the words is quite small.
	And it really shows that the common symptoms are not serious but we picked up terms by the cerebral vascular incident pulmonary embolism and things like that. So these are related to the most severe adverse event.
	The next thing we use the term selection, so the term selection is new feature JMP Pro 16 which, which was quite timely.
	So it is integrates the generalized regression model into text analysis platform so following from the text analysis platform, you can just select this where term selection is.
	And then, it allows the identification of key terms that are important to the response variable. So our response variable is the severity rating.
	So why use the generalized regression model? So it is widely used for non normally distributed or highly correlated variables.
	Where the data are independent of each other and show no other correlation. So this method over here is useful for us because it fits our our data set.
	And each role that we have inside our data table is a patient and all those are independent of each other.
	So, and then the most important thing is also that the generalized regression model allows for variable selection, so that is what we want to do because we want to pick up the variables with the highest influence on the response variable.
	Yeah so a little bit more detail about this regression model there's a few options that we can use over here that's the elastic net, as well as the lasso.
	So over here are the different thing about these is the lasso tends to select one term from a group of correlated factors, whereas the elastic network net will select a group of terms. So generally, I think that elastic net is used, and then over here that's our choice of the binary term frequency came in.
	Okay, so this is the result of the term selection, so you can see that over here that shows you the overview of the (???) and then generalized (???) but more interestingly
	when you started by the coefficient you can see that these are the top positive coefficients So these are top factors that plays the biggest role in terms of our response variable.
	And this one over here are the symptoms that plays least role when sort that according to the coefficient. So looking at the results, you can see that cardiac arrest and COVID-19 pneumonia, cerebral vascular accidents just all the terms that affects the response variable.
	So we can see that terms of more serious nature are related to the heart and lungs okay as versus the more low frequency ones right, which are very, very mild symptoms, really.
	Okay, so we repeat this whole process for all of the other
	for all of the other text variables, so we have gone to the the example for symptoms, so there's also the allergies, the medical history
	as well as the medications that are used, so what we did later was to save the document term matrix. Okay, which is basically the DTM is saves a column to the data table for each time.
	So you can see over here an example, you mainly a lot of zeros because it's a very sparse matrix so one will indicate the presence of let's take this column over here one will will indicate the presence of (???)..
	So we save it and repeat the process for the other text analysis.
	And then, once we have all these terms saved up we moved on to modeling. So therefore modeling was to build in kind of like a validation column so over here, we went to predictive modeling and make validation column. So over here we selected the choices so put it as validation set up 55%.
	And the whole thing over here was to identify the important variables with severity as a response variable So all in all we have seven structure variables and 55 that were derived from document term matrix and a total of about 31,000 rules.
	And what we see is that there's an imbalance there because of a severity rating you get an unbalanced data set. So because of this, we done our model evaluation on comparing the R split and the AIC values.
	So.
	We use the fit model in JMP so and choosing the generalized regression model again.
	And we can see that these are the results here so separate models you think the group of generalize linear models, using the penalized regression techniques were prepared.
	And then we try to fit based on the various characteristics over here, these are all the other other the penalized estimation methods.
	So of all the models, we can see that the lasso method has the lowest sorry has the highest Rsquare value, and there are other values that quite close as well, so we are going to take a closer look at them.
	So comparing the maximum likelihood model, as well as the lasso model based on the ROC curve, you can see that actually both of the ROC curves are quite similar.
	And however the ROC curve for the maximum likelihood model shows that it has the highest severity.
	Sorry, ROC curve for the maximum likelihood model shows that
	the ROC value is higher for highest severity rating, and you can see that it's only a slight difference here between both of these.
	And in general as as you go down the severity rating the area actually do increase and one of the reason is because our data set is very unbalanced.
	So the severity rating of four, which is the highest level, the most of your level is only about 5% of the total data values okay so overall this actually very little difference between both of these so we choose the one with a slightly larger area.
	Okay, so our next be turned on to the effects test. So into the report, you can choose to see the effects test, so the effects test is
	the hypothesis test of the null hypothesis that the variable has no effect on the rest.
	There was this very nice explanation of the effects test on the JMP Community, I think it was contributed by
	Mark Bailey. So he talks a bit about how the effects test is actually based on the Type III sum of squares for ANOVA.
	So we can see that the effects test is very suitable over here because of our data set so it actually tests every term in the model in terms of every
	term in it. So the main effects are tested in a lack of the interactions between the items and in the light of the other terms, in the light of the other main effects as well.
	So what do you want to use here is that we see over here is that the effects test is useful for our purpose, as it is for model reduction.
	And, and it allows us to draw inference of the long list of significant variables.
	We look at the probability at ChiSquare (???) lowest ChiSquare value taking a cut off alpha value of 0.1.
	We have a number of independent variable so that's quite a long list of them, and most of them, as you look through most of them actually related to the cardiopulmonary illnesses.
	So some of them are the effected ones like the number of symptoms, the number of days between the vaccination onset.
	(???) is the more in which the vaccine centers that by each and then you can see that the rest of them are related somehow another to.
	cardiopulmonary illnesses, there are some strange ones that I don't come from a medical background, so I don't really understand it either, but you can see that deafness is one of them, so there are some
	strange results that we can see over here, but in general that's, the picture is that, in terms of the top variables in terms listing variables.
	Okay besides this, right, what is really interesting is look at the model evaluation so even though what we're doing is to build an explanatory model,
	I went to look into the predictive model as well because JMP has very nicely put report over there for me to look at the parameter estimates so
	So I use the Profiler to try to understand the parameter estimates and you can see over here that the values shown are really, really small, so this is the value that you get
	immediately when you open up the Profiler, so the values here are the average of each variable and you can see that each of these variables
	Of the each of these values here actually very small, so it means that there's very little effect on the severity based on these coefficients.
	Based on these as a coefficient of the predicted variables. So what you can see that based on this study over here, you can tell that actually (???) symptoms and its effect has very, very little effect on severity and this is kind of like.
	kind of a within expectation to see that most of these symptoms and effects, because we are looking at the general picture of the vaccine, we can see that most of these symptoms - medical history, allergies - have very little effect actually on the outcome of the of the vaccine.
	Okay so.
	yeah.
	So a little bit of a conclusion, a few statements as a concluding statement. Several decisions were made in the grouping and classification of variables.
	And although these variables were made to the best of our understanding, especially in the way in which we came up with a severity rating,
	We perhaps need an expert familiar with vaccines studies or clinical trials to be consulted as to whether or not the severity rating is sufficient to
	to score the adverse events outcomes. And based on the model building of structured and unstructured data
	we have identified key factors that varies with the severity in a reaction to a COVID-19 vaccination.
	However, we're still not the effect of these key variables on the response variable severity is very small, so this is seen by looking at the variables.
	And then, finally, the document term matrix based on the binary ratings, the binary term frequency was found to be the most effective in representing the weights other terms in the document.
	And the generalized linear model with the lasso penalized regression technique produced the optimal model.
	So I hope you enjoyed the very short presentation and do let me know if you have any questions or any feedback, thank you very much.
Peter Polito	great job.
	was very.
	Oh no.
	I just realized that i'm you know mistake we wanted this life oh gosh.
Peter Polito	So that this is the exact situation where well you're.
	So at the actual discovery summit there's going to be a presenter.
	And so they're going to reach out to you, ahead of time and you just say, I have a mistake on one of my slides, and so in this part comes up it'll just pop he or she will pause it.
	And then you can share the slide and talk about it and then go right back to the video so don't worry about it at all.
	This happened quite a bit during last year's discovery summit is not a problem at all.
	Okay okay.
	Well gosh oh James.
Peter Polito	Would you like to fix it and redo it would that make you feel better.
	I don't know.
	If it's actually this one here, because this is the wrong box, it will take me a while to actually fix it because I need to retype it over yeah.
Peter Polito	yeah then it didn't know don't worry about it it'll be a real easy fix and you can do it in real time.
	Okay okay.
	Okay yeah.
	Thanks for sitting, through it, though.
Peter Polito	yeah no problem is great, I really.
	Okay.
Peter Polito	All right, any other questions or comments or anything.
	yeah i've got one is that um so what's up a link, where I can upload all of my slides and my people and things like that, but there was a mixup with my.
	email, and I think from tanya about that that when tanya replied me right, I think she missed out on that link, so I thought that the link will be embedded inside one of the recording but I don't suppose you guys got it right.
Peter Polito	I don't have it, but I will reach out to tanya and asked her to reach out to you directly.
	To help remedy that.
	Okay sure thanks very much I think that makes up with my email yeah.
Peter Polito	Thanks very much.
	No problem alright well have a have a good night and good rest.
	yeah you have a good day.
Peter Polito	Thank you.
	So much bye bye.

jay1 · ‎08-03-2021

Hi, I don't see any data/model/presentation here... did you have a link you could share?

Jeff_Perkinson · ‎08-03-2021

Hi @jay1, this post, and the others in this library are placeholders for the presentations to come at Discovery Summit Americas in October. The authors will update these posts between now and then. Keep an eye out here and I hope you’ll be able to join us for the conference online October 4-7.