Hello everyone.
Thanks for tuning in to my talk.
I'm David Meintrup,
Professor at Ingolstadt University of Applied Sciences.
And today, I will talk about A Lasso R egression to Detect R isk F actors
for Fatal O utcomes in Critically Ill COVID-19 Patients.
Over the last two years,
something started to increasingly bother me,
and it is not what you probably think now, the p andemic,
or at least not only.
The topic is connected to this.
This is me giving a talk on deep learning and artificial intelligence
at the Discovery Summit 2019 in the wonderful city of Copenhagen.
And since then,
it looks like AI has become the universal tool for everything,
so let me give you an example.
In May 2021,
the General Director of the World Health Organization
said the following:
"One of the lessons of COVID-19
is that the world needs a significant leap forward
in data analysis.
This requires harnessing the potential of advanced technologies
such as artificial intelligence."
So to me, this feels like a bit, forget about the scientific method
defining the goal, specifying the tools, stating the hypothesis, et cetera,
Just drop the magic word artificial intelligence
and you are on the good side.
So therefore, I decided to give this talk in the form of a dialogue
between an AI enthusiast on the left side and a statistician on the right side.
And I'm going to talk a bit more specifically on statistical models,
artificial intelligence, and penalized regression.
And then in the second part of the talk, I'm actually going to present
the case study about the critically ill COVID-19 patients.
So let's get started.
Here's the first question from our AI enthusiast.
In the era of artificial intelligence and deep learning,
who needs statistical regression models?
Here's my short answer that I borrowed from Juan Lav ista,
Vice President at Microsoft.
"When we raise money, it's AI, when we hire, it's machine learning,
and when we do the work, it's logistic regression."
So I love this tweet because in my opinion at least,
it condenses a lot of truth in a very short statement.
AI has become a universal marketing tool.
But for the real problems, we still use traditional advanced statistical methods.
A little bit longer answer could be the following.
If I look at the typical task of engineers and scientists,
they include innovate, understand, improve, and predict.
And deep learning and artificial intelligence
is mainly a prediction tool.
For everything else,
we still need advanced statistical methods like traditional machine learning,
statistical modeling, and design of experiments.
Okay, but you have to admit
that there are very successful applications of AI and deep learning.
Well, there's absolutely no doubt about that.
For example,
predicting the next move in a game like chess and Go.
The deep- learning algorithms do this way better than any human being.
Or I would like to introduce my favorite artificial intelligence application,
which is solving the protein folding problem.
The protein folding problem has famously been introduced 1972
by the Nobel Prize winner, Christian Anfinsen,
who said in his acceptance speech
that a protein's amino acid sequence should fully determine its 3D structure.
And over the last 50 years, this problem has basically been unsolved.
And there was very little progress
until DeepMind by Google developed an AI- based algorithm called AlphaF old
that to a very large extent solved the protein folding problem.
And you see two examples of this on the right side.
This is very impressive and beautiful work.
And I included it here because I wanted to clarify
that this is the perfect deep- learning AI problem.
We have a vast amount of data,
we have a combinatorial explosion of options,
and the result we are looking for is a prediction,
the actual 3D structure of the protein.
So for predictions, we should always use AI.
Not at all.
For example, in the dataset that I will present
about the critically ill COVID-19 patients,
we want the model to predict if the patient will survive or will die.
But the pure prediction doesn't really help.
If you know someone is going to die, what you want is you want to treat,
you want to prevent, you want to know the risk factors,
you want to be able to act
and not just simply predicting death or survival.
So what we need is a really interpretable model
that will hint these things that we need
like treatment, prevention, and risk factors.
Another way of looking at it,
let's have a look at typical data- driven modeling strategies.
What is very typically done in deep- learning AI environment
is that you take all available data,
you throw it in the deep- learning AI algorithm.
And it might predict very well the outcome that you're looking for,
but you're getting a fully non-interpretable model.
What's an alternative?
An alternative is to already in the data collection process
think carefully what data do you need with advanced statistical methods.
Then apply a statistical model,
and as a result, you get a fully interpretable model.
Okay, says our AI enthusiast,
But you are missing an important point here.
Statistical models might be nice for small data sets,
but for big data, they can't be used, right?
Well, no.
For large data sets, there are several intelligent ways
to reduce the dimensionality
before you start with the fitting process of the model.
And I would like to introduce, at least shortly,
three of these intelligent ways to reduce the model dimensionality.
Number one, redundancy analysis, something you might have heard about.
If you have a large data set,
the price you pay is typically that the factors are highly correlated
and you can measure the amount of correlation within the set of factors
by the value that is called variance inflation factor.
And then you can actually eliminate the factors
with the highest variance inflation factors.
Why?
Because they don't add additional information
to the set of factors that you are already looking at.
So this is one classic way of reducing the dimensionality of your data.
If you have categorical data, for example,
you look at an X- ray of the lung and you can see different symptoms.
Then you can have variables that describe,
this symptom was there, this symptom was there,
or another symptom was there, X 1, X 2, X 3.
Maybe for your analysis,
it's enough to distinguish a normal- looking lung
and a lung that has some symptoms somewhere.
In other words, you convert a row only with zeros to a zero.
And if there's at least one one, you change it to a one,
and you create a new variable catching this information.
Or to give you an alternative, you could sum X 1, X 2, and X 3,
and count the number of symptoms that you see on an X- ray.
This procedure is called scoring
and is a very efficient way of reducing dimensionality.
Principal component analysis has exactly the same spirit.
It recombines continuous variables.
It takes a linear combination of continuous variables
with the idea of catching the variation in one newly created variable.
I call these dimensionality reduction methods intelligent,
because when you apply them,
you already learn something about your data.
And that's the whole purpose of statistics, isn't it?
Learning things from your data.
So let me summarize the advantages of statistical models.
First, they can be used for all kinds of tasks,
not only for predictions.
Second, the model itself is useful and fully interpretable.
And third,
you can start in a large dataset with intelligent dimensionality reduction
before you actually fit the model.
I'm still not convinced.
Can you give me an example of a statistical model
that you applied to a large dataset?
Okay, so let's introduce logistic Lasso regression.
Lasso is an abbreviation
for least absolute shrinkage and selection operator,
and why it is called like that, I will explain in a few moments.
Let's introduce this Lasso regression in four steps.
Step number one is to remind ourselves of the logistic regression model.
In a logistic regression model, we have a categorical,
in the easiest case, two- level factor...
Sorry, a categorical response with two levels,
and we have a goal to model the occurrence probability of the event.
This is typically done with this S- shaped function
that corresponds to the probability of the event actually occurring.
The functional term is given here to the left,
but the good news is that
with an easy transformation that is called logit transformation,
you can turn the original values into log it values,
and then the result is a simple linear regression
on the logit values.
So the bottom line is logistic regression,
is simply linear regression on logit values.
Step number two.
This is a classic situation of a two- factor linear regression model.
And how do we fit this to a data cloud?
Well, we do this with the help of a loss function.
For example, we take the sum of squared errors,
and then we look for the minimum of this function.
This is the very famous and standard ordinary square estimator
that is the result of minimizing specifically this loss function
given by the sum of squared errors.
Thirdly, something that is maybe less known,
I would like to introduce concept f rom mathematics
the norm of a vector that is actually just a representation
of the notion of the distance of a length of a vector.
Let's look at three examples.
The first one that you see here in the middle
is the classic distance that you all know and use.
This is the Euclidean distance.
It's calculated by taking the squares of the coordinates
and then taking the square root.
The unit circle as you know represents all points
that have a distance one, from the center.
This is the classic Euclidean norm.
We can simplify this calculation by simply taking the sum of absolute values.
So instead of taking a square and taking a square root,
we simply sum the absolute values.
This is called the L_1 norm.
And what you see here, this diamond is the representation
of the unit circle of this L_1 norm.
In other words, all the points here on this diamond have distance one ,
if you measure distance with the L _1 norm.
Finally, the so- called maximum norm where you continue to simplify.
You just take the larger value of the two absolute values, x1 and x2 .
If you think about what the unit circle is in this case,
it will actually turn into a square.
This square is the unit circle for the maximum value.
So in summary, we can measure distance in different ways in mathematics,
and what you see here, the diamond, the actual circle or square
are unit circles.
So points with distance one, just measured with three different norms,
three distances,
three different notions of what a length is.
Finally, let's combine everything we've done so far.
So we start with the logistic regression model.
We add the loss function.
And now, instead of taking the ordinary square loss function,
we add an additional term.
And this term consists basically of the L_1 norm of the parameter.
You see that we add the absolute values, Beta_1 and Beta_2?
So this is the L_1 norm of the parameters
that we add to the loss function.
Of course, this is just one choice.
You could also square these.
Then, you would get what is called a Ridge regression
if you take the square of the parameters.
This first one here,
the top one that we are going to continue to use
is called the Lasso or L_1 regression
because this term here is simply the L_1 norm of the Beta,
of the parameter vector.
Now, overall, what this means is that you punish the loss function
for choosing large Beta values.
And this is why this penalty that you introduce
leads to the term penalized loss functions.
So if you have a punishment, a penalty for large Beta vectors,
then instead of doing ordinary least squares,
you do a penalized regression.
Now, let's look a little bit closer to the effect that penalizing has.
So this is once again the penalized loss function
with this additional term here.
Now, the graph that you see here
is independent of the so- called tuning parameter, Lambda.
The larger Lambda is, the more weight this term has,
and the more it will force the Beta values to be small.
This is why you see that the parameters shrink.
And this is why this whole procedure is called absolute shrinkage.
Secondly, in this graph, you can consider this area here,
this diamond as the budget that you have
for the sum of the absolute values of Beta.
And on these ellipses, the residual sum of square is constant.
So you're looking for the smallest residual sum of square
within the budget.
This in the case drawn here leads to this point here.
And due to the shape of this diamond,
these two will typically connect in a corner of the diamond.
And what this means is that the corresponding parameter
is set precisely to zero.
And this is why this method is also good for selection
because setting this parameter zero means nothing else
than kicking it out of the model.
So this is in summary why we call the L _1 regression Lasso.
It has a shrinkage element and a selection element
due to these two described features.
One last practical aspect of Lasso regression
is about the tuning parameter, this Lambda here.
How do you choose it?
Well, one very common approach is the following:
you use a validation method
like for example, Akaike information criterion,
and you plot the dependency of the AIC of Lambda.
And then you can pick a Lambda value
that gives you a minimal AIC.
On the left side, you see how the parameters shrink
and you can see the blue lines that correspond to the parameters
that are actually non- zero
while the other ones are already forced to be zero.
I'm still not convinced.
Can you show me a concrete case study?
Of course.
So the data that I'm going to present here
consists of 739 critically ill ICU patients with COVID-19
that was collected in the beginning of the pandemic
between March and October 2020.
We have one binary response
that consists of the levels recovered and dead,
and we have 43 factors:
lab values, vitals, pre- existing conditions, et cetera.
This is the data that we are going to analyze now.
So here, you see the dataset.
It has 44 columns, as you can see down here.
And 739 patients.
Now, let's start familiarizing a little bit here with the data.
We have this last known status, recovered and dead,
age, gender, and BMI.
And then we have additional baseline values,
comorbidities, vitals, l ab values, symptoms, and CT results.
Let's look at some distributions.
So this is the distribution of the last known status.
So you can see that unfortunately 46 percent of these patients died.
We see here the skewed age distribution with an emphasis here above 60.
And you can see that roughly 70 percent of the patients are males.
Have a look at the additional baseline values.
You can see here the body mass index,
and you will notice that we have a tendency of a high body mass index
above 25.
We have quite a lot of ACE and AT inhibitors,
and also of statins, so treatments for cholesterol
and for blood pressure and some immunosuppressive.
Next, comorbidities.
You see that almost two-thirds are hypertensive,
and we have quite a significant amount of cardiovascular disease,
of pulmonary disease,
and about 30 percent of our patients have diabetes.
Now, for the remaining four groups, vitals, l ab values, symptoms, and CT,
I'm going to show you one representative from each
so that you can get a feeling of how these values are distributed.
This is the respiratory rate.
These are the number of lymphocytes.
So this is a vital parameter, a lab parameter.
Here, you have a symptom, severe liver failure
that can occur on the ICU.
And this is a CT result,
areas of consolidation that can be seen on the CT of the lung.
Okay, so now we are ready
and we are actually going to analyze the model.
So I go to Analyze, F it M odel.
I take the last known status.
That's why I throw in everything else as factor
and I go to General Regression
which is going to perform the Lasso regression.
You can see here that the Lasso estimation method
is already preselected.
So if I click on Go, the procedure is already finished.
It's very fast, and this is the result.
This is the screenshot that you saw already on the slide.
Now, I'm not going to work with this model for the following reason.
If I go here up to the Model C omparison section,
I see that I have 30 parameters in this model.
So this model is still very big.
If I go back down here,
I can see that my AIC doesn't change a lot if I put it further to the left.
Instead of doing this manually, what I'm going to do
is I'm going to change the settings of JMP
so that it doesn't take the best- fit, the minimal AIC,
but instead the smallest within the yellow zone.
So this is something I can do here in the Model Launch.
So I take the Advanced C ontrol
and I change Best Fit to S mallest in Y ellow Z one.
I click on Go.
And now, actually I have a very nice model with 16 parameters.
Now, this is the model I'm going to use.
And to be able to show you what factors are in this model,
I'm just going to select them, Select Nonzero Terms.
Now, I have these 16 selected,
and I can put them in a logistic regression
and activate the odds ratios.
So on the top, we see the 16 effects in our model.
And below, we can see the odds ratios.
So for example, the odds ratio of age is 1.07.
If you take this to the power 10, it will give you roughly a value of two,
which means that with 10 years more,
roughly, approximately your odds ratio doubles.
Your chance of dying is twice as high as before.
And then below here,
we have the odds ratios for the pedagogical variables.
So for example,
coronary cardiovascular disease has an odd ratio of 1.62.
Now, I would like to point out some of these results.
So let's first look at factors
that are well- known from the general population.
So here, you see the dependency of the last known status versus age.
And you can see how this increasing age
increase your chance of dying very significantly.
As I said before, the odds ratio of 10 years is roughly double.
On the left side,
you see pulmonary disease and cardiovascular disease
that both also have a significant effect on the risk of dying from COVID- 19.
And these factors are also valid in the general population.
Now, more Interestingly,
we find these three factors not to be in our model.
So gender, BMI, and hypertension are not part of our model.
So how is that possible?
Well, it's critical to remember
that our population consists of ICU patients.
They are already critically ill.
So we have 72 percent of male patients,
we have almost 80 percent that have a BMI over 26,
and two-thirds are hypertensive.
So these factors will actually highly increase your risk
of a critical cause of the COVID- 19 disease.
But once you are critically ill, at that point, they don't matter anymore.
So that was a very important result for us.
Which factors carry over from the general population
and which factors disappear in their importance
once you are already critically ill?
Finally, I would like to point out one more aspect which is statins.
Statins were entirely insignificant.
You see the P value here of 25 percent once you looked at them univariately.
But in our multi factorial model, they were highly significant.
And as you can see here, the odds ratio is below one,
reducing the risk of mortality.
So this is a very important lesson.
Sometimes, people choose risk factors first in large dataset
by looking at them univariately.
And if you did that, you would with guarantee
have missed statins,
because univariately, it's completely irrelevant.
But in our multifactorial model,
we could show that statins have a protective effect
against dying from COVID-19 once you are critically ill.
This finding has been later confirmed by others.
And just as an example,
I included a meta- analysis from September 21
that indeed confirms that statins reduce the mortality of patients
in a very large meta- analysis with almost 150,000 patients.
If you're interested in more details,
we published our work in the Journal of Clinical Medicine.
I would like to take the opportunity to thank my co-authors,
in particular, Stefan Borgmann and Martina Nowak- Machen
from the clinic in Ingolstadt.
And I would also like to thank you very much for your attention,
and I'm looking forward to your questions.
Thank you very much.