Obtain High Quality Informationfrom FDA Online Label Repository (2020-US-30MP-62...

Wenjun Bao, Chief Scientist, Sr. Manager, JMP Life Sciences, SAS Institute Inc
Fang Hong, Dr., National Center for Toxicological Research, FDA
Zhichao Liu, Dr., National Center for Toxicological Research, FDA
Weida Tong, Dr., National Center for Toxicological Research, FDA
Russ Wolfinger, Director of Scientific Discovery and Genomics, JMP Life Sciences, SAS

Monitoring the post-marketing safety of drug and therapeutic biologic products is very important to the protection of public health. To help facilitate the safety monitoring process, the FDA has established several database systems including the FDA Online Label Repository ( FOLP ). FOLP collects the most recent drug listing information companies have submitted to the FDA. However, navigating through hundreds of drug labels and extracting meaningful information is a challenge; an easy-to-use software solution could help.

The most frequent single cause of safety-related drug withdrawals from the market during the past 50 years has been drug-induced liver injury (DILI). In this presentation we analyze 462 drug labels with DILI indicators using JMP Text Explorer. Terms and phrases from the Warnings and Precautions section of the drug labels are matched to DILI keywords and MedDRA terms. The XGBoost add-in for JMP Pro is utilized to predict DILI indicators through cross validation of XGBoost predictive models by the term matrix. The results demonstrate that a similar approach can be readily used to analyze other drug safety concerns.

Auto-generated transcript...

Speaker	Transcript
Olivia Lippincott	What
wenjba	It's my pleasure to talk about this, Obtain high quality information from FDA drug labeling system and in the JMP discovery.
	And today I'm going to talk about four portions. The first, I'll give some background information about drug post marketing monitoring and what is the effort from the FDA regulatory agency and industry. And also, I'm going to use a drug label data set
	to analyze the text and using the Text Explorer in JMP and also use the add in
	JMP add in
	XGBoost to analyze this DILI information and then give the conclusion and also the XGBoost tutorial by Dr. Russ Wolfinger is also present in this JMP Discovery Summit so please go to his tutorial if you're interested in XGBoost.
	So the drug development, according to FDA description for the drug processing,
	it can be divided by five stages and the first two stages, discover and research preclinical, many in the
	in the for the animal study and chemical screen, and then later three stages involve the human. And JMP has three products, including JMP Genomics, JMP Clinical,
	JMP and JMP Pro, that have covered every stage of the drug discovery. And JMP Genomics is the omics system that can be used for omics and clinical biomarkers selection and JMP Clinical is specific for the clinical trial
	and post marketing monitoring for the drug safety and efficacy. And also for the JMP Pro can be used for drug data cleaning, mining, target identification, formulation development, DOE, QbD, bioassay, etc. So it can be used every stage of
	the drug development.
	So in this drug development, there's most frequent single cause, called a DILI, actually can be stopped for the clinical trial.
	The drug can be rejected for approval by the FDA or the other regulatory agency, or be recalled once the drug is on market. So this is the most frequent the single cause called the DILI and can be found the information in the FDA guide and also other scientific publications. So
	what is DILI? This actually is drug-induced liver injury,
	called DILI and you have FDA, back almost more than 10 years ago in 2009, they published a guide for DILI,
	how to evaluation and follow up, FDA offers multiple years of the DILI training for the clinical investigator and those information can still find online today.
	And they have the conferences, also organized by FDA, just last year. And of course for the DILI, how you define the subject or patient could have a DILI case, they have Hy's Law that's included in the
	FDA guidance.
	So here's an example for the DILI evaluation for the clinical trial, here in the clinical trial,
	in the JMP Clinical by Hy's Law. So the Hy's Law is the combination condition for the several liver enzymes when they elevate to the certain level,
	then you would think it would be the possible Hy's Law cases. So you have potentially that liver damages. So here we use the color to identify the
	possible Hy's Law cases, the red one is a yes, blue one is a no. And also the different round and the triangle were from different treatment groups. We also use a JMP bubble plot to to show the
	the enzymes elevations through the time...timing...during the clinical trial period time. So this is typical. This is 15 days. Then you have the
	subject, starting was pretty normal. Then they go kind of crazy high level of the liver enzyme indicate they are potentially
	DILI possible cases.
	So, the FDA has two major databases, actually can deal with the post-marketing
	monitoring for the drug safety. One is a drug label and which we will get the data from this database. Another one is FDA Adverse Event Reporting System, they
	then they have from the NIH and and NCBI, they have very actively
	built this LiverTox and have lots of information, deal with the DILI. And the FDA have another database called Liver Toxic
	Knowledge Base and there was a leading by Dr. Tong, who is our co are so in this presentation. They have a lot of knowledge about the DILI and built this specific database for public information.
	So drug label. Everybody have probably seen this when you get prescription drug. You got those wordy thing paper come with your drug.
	So they come with also come with a teeny tiny font words in that even though it's sometimes it's too small to read, but they do contain many useful scientific information about this drug
	Then two potions will be related to my presentation today, would be the sections called warnings and precautions. So basically, all the information about the drug adverse event and anything need be
	be warned in these two sections. And this this drug actually have
	over 2000 words describe about the warnings and precautions. And fortunately, not every drug has that many side effects or adverse events. Some drugs like this,
	one of the Metformin, actually have a small section for the warning and precautions. So the older version of the drug label has warnings and precautions in the separate sections, and new version has them put together. So this one is in the new version they put...they have those two
	sections together. But this one has much less side effects. So JMP and the JMP clinical have made use by the FDA to perform the safety analysis and we actually help to finalize every adverse event listed in the drug labels.
	So this is data that got to present today. So we are using the warning and precaution section in the 462 drug labels that extracted by the FDA researchers and I just got from them.
	And the DILI indicator was assigned to each drug. 1 is yes and the zero is no. So from this this distribution, you can see there's
	about one...164 drugs has
	potential DILI cases and 298 doesn't and the original format for the drug label data is in the XML format and that can be imported by JMP multiply at once.
	So for the DILI keywords and was a many years effort by the FDA to come up this keyword list. Then they actually by the expert, reading hundreds of drug label and then decided what could potentially become the DILI cases. So then they come up with those
	about 44 words or terms to be indicated as a keyword, to be indicated for the drug
	could be the DILI cases. And you may also heard about MedDRA, which is a medical dictionary for regulatory activities. They have different levels of a standardized terms and most popular one is preferred term. I'm going to be using today.
	So in the
	warning and precaution, you can see if we pull everything together, you have over 12,000 terms in the
	warnings and the precautions section. And you can see that "patients" and "may" is a dominant which
	made not...should not be related to the medical cases and the medical information in this case. So we can remove that, you can see that not any other words are so dominant in this word
	cloud, but it still have many medical unrelated words like "use" and like "reported" that we could put into...
	could remove them to our analysis list. So in the in the Text Explorer, we can put them into the stop word and also we normally were using the
	different Text Explorer technology is stemming, tokenizing, regex, recoding and deleting manually.
	to clean up the list. But it had 12,000 terms, so it could be very time consuming.
	But since we have the list we are interested in, so we want to take advantage that we already knew what we are interested in the terms in this system.
	So what we're going to do and I'm going to show you in the demo that we'll only use the DILI keywords, plus the preferred term from the MedDRA to generate the interesting terms and the phrases to do the prediction.
	So here is the example we saw using only the DILI keywords. Then you see everything over here, you can see even in the list. You have a count number
	showed at the side for each of terms, how many times they are repeated in the
	warnings and precaution section and also you can see more colorful, more
	graphic in the world cloud to get a pattern recognized. And then we add the medical terms, that was the medical related terms. So it's still come down from the 12,000 terms to the 1190 terms that was including DILI keywords and medical preferred terms. So we think this would be the good
	term list to start with to do our analysis.
	So what we do is in the
	JMP Text Explorer, we can save the term...document term matrix. That means if you see 1 that means this document have seen this term, if it says, if this is 0, this means this document
	has not see, have a case of this word.
	So then we, in the
	XGBoost will make k fold, and three k folds, use each one with five columns.
	So we use in this machine learnign and use XGBoost tree model which is add in for the JMP Pro and we...using the DILI indicator
	to as a target variable and they use the DILI keywords and also the MedDRA preferred terms that have shown up more than 20 times
	to...as a predictor. Then we use a cross validation XGBoost then it 300 times interation. Now we got statistical performance metrics, we get term importance to DILI,
	and we get, we can use the prediction profiler for interactions and also we can generate and the save the prediction formula for new drug prediction.
	So I'm going to the demo.
	So this is a sample table we got in the
	in JMP.
	So you have a three columns. Basically you have the index, which is a drug ID. Then you have the warnign and precaution, it could have contain much more
	words that it's appeared, so basically have all the information for each drug. Now you have a DILI indicator. So we do the
	Text Explorer first. We have analysis, you go to the Text Explorer, you can use this
	input, which is a warning and precaution text and you would you...normally you can do different things over here, you can minimize characters, normally people go to 2 or do other things. Or you could use the stemming or you could use the regex and to
	do all kind of
	formula and in our limitation can be limited. For example, you can use a customize
	regex to get the all the numbers removed. That's if only number, you can remove those, but since we're going to use a list, we'll not touch any of those, we can just go here simply say, okay,
	So it come up the whole list of this, everything. So now I'm going to say, I only care about
	oh, for this one, you can do...you can show the
	word cloud.
	And
	we want to say I want to center it
	and also I want to the color.
	So you see this one, you see the patient is so dominant, then you can say, okay this definitely...not the...
	should not be in the
	in analysis. So I just select and right click add stop word.
	So you will see those being removed and no longer showed in your list and no longer show in the
	word cloud.
	So now I want to show you something I
	think that would speed up the clean up, because there's so many other words that could be in the system that I don't need. So I actually select
	and put everything into the stop word.
	So I removed everything, except I don't know why the "action" cannot be removed.
	And but it's fine if there's only one. So what I do is I go here. I said manage phrase, I want to import my keywords. Keyword just have a...
	very simple. The title,
	one column data just have all the name list. So I import that, I paste that into local. This will be my local library. And I said, Okay.
	So now I got only the keyword I have.
	OK, so now this one will be...I want to do the analysis later. And I want to use all of them to be included in my analysis because they are the keywords. So I go here, the red triangle, everything in the Text Explorer, hidden functions, hidden in this
	red triangle. So I say save matrix. So I want to have one and I want 44 up
	in my analysis. I say okay. So you will see, everything will get saved to my...
	to the column, the matrix.
	So now I want to what I want to add, I want to have the phrase, one more time. I also want to import
	those preferred terms.
	into the
	my database, my local data.
	Then also, I want to
	actually, I want to locally
	to
	so I say, okay.
	So now I have the mix, both of the
	the preferred terms from the MedDRA and also my keywords. So you can see now the
	phrases have changed.
	So that I can add them to my list. The same thing to my safe term matrix list and get the, the, all the numbers...all the terms I want to be included. And the one thing I want to point out here is for these terms and they are...we need to change the
	one model format. This is model type is continuing. I want to change them to nominal. I will tell you why I do that later.
	So now I have, I can go to the
	XGBoost, which is in the add in. We can make...k fold the columns that make sure I can do the
	cross validation. I can use just use index and by default is number of k fold column to create is three and the number of folds (k) is within each column is five, we just go with the default.
	Say, okay, it will generate three columns really quickly. And at the end, you are seeing fold A, B, C, three of them.
	So we got that, then we have...
	Another thing I wanted to do is in the...
	So we can
	We can create another
	phrase
	which has everything...that have have everything in...this phrase have everything, including the keywords and PT, but I want to create one that only have the
	only have
	only have the
	the preferred term, but not have the keyword, so I can add those keywords into the local exception and say, Okay. So those words will be only have preferred terms, but not have the
	keywords. So this way I can create another list, save another list of the
	documentation
	words than this one I want to have. So
	have 1000, but this term has just 20. So what they will do is they were saved terms either meet...
	have at least
	show up more than 20 times or they reach to 1000, which one of them, they will show up in the my list.
	So now I have table complete, which has the keywords and also have the MedDRA terms which have more than 20, show more than 20 times, now also have ??? column that ready for the analysis for the
	XGBoost.
	So now what I can do is go to the XGBoost. I can go for
	the analysis now. So what I'm going to do show you is I can use this DILI indicator, then the X response is all my terms that I just had for the keyword and the
	preferred words. Now, I use the three validation
	then click OK to run. It will take about five minutes to run. So I already got a result I want to show you.
	So you have...
	This is what look like.
	The tuning design. And we'll check this. You have
	the actual will find a good condition for you to to to do so. You can also, if you have
	as much as experience like Ross Wolfinger has, he will go in here, manually change some conditions, then you probably get the best result.
	But for the many people like myself don't have many experienced in XGBoost, I would rather use this tuning design than just have machine to
	select for me first, then I can go in, we can adjust a little bit, it depend on what I need to do here. So this is a result we got. You can use the...you can see here is
	different
	statistic metrics for performance metrics for this models and the default is showed only have accuracy and you can use sorting them by to click the column. You can sorting them and also it has much more other popular performance metrics like
	MCC, AUC,
	RMSE, correlation. They all show up if you click them.
	They will show up here. So whatever you need, whatever measurement you want to do, you can always find here.
	So now I'm going to use, say I trust the validation accuracy, more than anything else for this case. So I want to do is I want to see just top model, say five models.
	So what here is I choose five models. Then I go here, say I want to remove all the show models.
	So you will see the five models over here and then you can see some model, even though the, like this 19 is green,
	it doesn't the finish to the halfway. So something wrong, something is not appropriate for this model. I definitely don't want to use that one, so others I can choose. Say I want to choose this
	19, I want to remove that one. So I can say I want to remove the hidden one. So basically just whatever you need to do. So if you compare, see this
	metrics, they're actually not much, not much different. So I want to rely on this graphic to help me to choose the best one to do the performance.
	So then you choose the good one. You can go here to
	say, I like the model 12 so I can go here, say I want to do the
	profiler.
	So this is a very powerful tool, I think quite unique to JMP. Not many
	tools have this function. So this gives you an opportunity to look at individual
	parameters in the in the active way and see how they how they change the result. For example those two was most frequently show up in the DILI cases.
	And you can see the slope is quite steep and that means if you change them, they will affect the final result predictions quite a bit. So you can see when the
	hepatitis and jaundice both says zero, you actually have very low possibility to get the DILI as one. So is low case for the possible DILI cases. But if you change this line,
	to the 1, you can see the chance you get is higher. And if you move those
	even higher. So you have, you will have a way to
	analyze, if they are the
	what is the key
	parameters or predictor to affect your result. And for this, some of them, even their keyword, they're pretty flat. So that means if you change that, it will not affect the result that much. So
	So this is and also we here, we gave the list you can get to
	to see what is the most important
	features to the calculate variables prediction. So you can see over here is jaundice and others are quite important.
	And for the for the feature result, once you get the data in, this is all the results that we we have. And you can say, well, what...how about the new things coming? Yes, we have here, you can say, I want to save prediction formula.
	And you can see it's actively working on that. And then in the table, by the end of table, you will see the prediction. So remember we had one...this was, say, well,
	the first drug, second was pretty much predict it will be the
	DILI cases and the next two, third, and the fourth, and the fifth was close to zero. So we go back to this DILI
	indicator and we found out they actually list. The first five was right one. So, in case you have...don't have this indicator when you have the new data come in, you don't have to read all the label. You run the model. You can see the prediction. Pretty much you knew if it is
	it is DILI cases or not.
	So my deomo would be
	end here, and now I'm going to give a conclusion. So we are using the Text Explorer to extract the data keyword and MedDRA terms using Stop Words and Phrase Management without
	manually selection, deletion and recoding. So we use a visualization and we created a document term matrix for prediction.
	And also we use machine learning for the using the XGBoost modeling and we want to quickly to run the XGBoost to find the best model and perform predict profile. And also we can save the
	predict formula to predict the new cases.
	Thank you. And I stop here.

Presented At Discovery Summit Americas 2020

Presenter

Wenjun Bao

Obtain High Quality Informationfrom FDA Online Label Repository (2020-US-30MP-626)

Presenter

Basic Data Analysis and Modeling

Consumer and Market Research

Data Access

Data Exploration and Visualization

Design of Experiments

Mass Customization

Predictive Modeling and Machine Learning

Quality and Process Engineering

Reliability Analysis

Sharing and Communicating Results