Choose Language Hide Translation Bar

Take Heart: Establishing Causality Without DOE

Design of experiments (DOE) and randomized control trials are the gold standard for determining causal relationships, and JMP is the gold standard for DOE software. Unfortunately, many studies with causal inference objectives cannot be run as DOEs due to an inability to randomly assign treatments because of practical or ethical constraints. The requirement is to remove the impact of confounding variables that contribute to bias in the observational data classes.

We provide an overview of current promising causal inference methods to include propensity scores, matching algorithms, and other statistically based approaches to include draft FDA guidance to industry for real world data/evidence. We demonstrate several procedures in JMP using a retrospective study database that supports multiple clinical research efforts for the Cardiothoracic Surgery Program at Houston VA Medical Center. We finish by providing recommendations, tips, and pitfalls for the practitioner to consider when causal association is the goal and DOE is not an option.

 

 

Hello, Team Discovery, early user edition. My name is Jim Wisniowski from Adsurgo, and I'm joined with co-presenters Dr. Lorraine Cornwell, who's a cardiothoracic surgeon at the Houston VA Medical Center, as well as the Baylor College of Medicine. She also does a lot of research as a PI on many different projects, and we also have a co-author, Dr. Paweł Kolodziejski, who is the research coordinator.

So today, we'd like to talk a little bit about some of the investigations that we've done in terms of a database in the cardiothoracic world that is not done by the usual randomized control trials. We're actually looking at some retrospective studies, and we're trying to establish some level of causality out of that.

In my role as a consultant, it's not uncommon for me to go to clients in different industries and be on site and teach a couple of days of design of experiments and then teach predictive analytics, where there's a disconnect that I start out by saying design of experiments is how we establish causality. Then, when we trans over to the predictive analytics, artificial intelligence, machine learning working world. I have to backtrack a little bit and say, I know that I said that the idea of the causality is from DOE, but there still is very much goodness from these other methodologies.

That's what we're trying to do, is try to establish how can we come up with some level of actionable information, data-driven evidence-based that is not based on randomized control trials.

Here's the way that we'll go about this. I'll talk a little bit about evidence and this whole idea of design of experiments of which the randomized control trial is a subset. I'll talk about some of the problems that we're seeing in the RCT world, randomized control trial, and then I'll do a Demonstration in JMP. Since this is the early user edition, I'll show you a few things that maybe you hadn't seen or hadn't thought of in terms of design of experiments.

Then I'll talk about what happens when we don't have the randomized, and we go into these observational studies. One of the methods that we use to control the bias is going to be what we call propensity scoring. I'll show you a little bit of the concepts there in the demonstration. Then I'll turn it over to Dr. Cornwell, and she'll talk about an example of some of the work that we're doing right now and how that relates to some of the evidence and what type of information that we need and what are the levels of evidence that lead into action.

Then I'll do a demonstration on some of the methods that are associated with that example. Then I'll talk about some other more advanced methods that are coming in JMP 19.

By way of introduction, we all have heard of the correlation does not imply causality, of course. Today, it's in August and probably the hottest day of the year in some parts of the country. With that, we are familiar with the... There's a very strong correlation between dog bites, drowning, and ice cream sales. It's our duty to eat less ice cream, so we have fewer dog bites and fewer drownings.

Clearly, not the right causal there. That it is, in fact, that third variable. That's what I want to mention is an introduction to that there's some latent dimension out there that we often aren't observing. In this particular case, we could easily record with the temperatures. But in much of our work, it's a latent dimension that we don't know, but it is truly responsible for the causality. We need to recognize that.

In our world, there really is a couple more components. That correlation is a great start, but we also have to have the one has to happen before the other to make that happen. Then the last piece of this is the whole idea with the non-spuriousness that it has to... In our case, this is a spurious correlation that there is not another alternative explanation.

We, again, are told in design of experiments that this is how we establish the causality, that when we have input factors, that those are the ones that are going to cause a response to change. The important point there is you have to be the one that's controlling those. On this particular run, we're going to check temperature to this level and pressure this level, and then let's observe the response.

JMP has long been the software choice for DOE. And recently in 17, 18, they have the easy DOE platform, which for the early user, highly recommend that there's some great options in there for your workflow that really help, that they have come a long way in terms of integrating the whole functionality of JMP into the easy DOE platform. And you can see, typically, we think of the DOE in terms of the plan, design, execute, analyze stage. But here they do the define, model, design, data entry, analyze, predict, and report.

There's some great references there. If you'd like to see a video on how to do that. Again, we recognize the early user group here. But we're going to talk about randomized control designs, and that is a subset of DOE. It is the hallmark of evidence-based medicine is this whole idea of RCTs. And in fact, the world of RCTs in terms of the medical applications really started in the '80s from the cardiology folks, where they had many landmark studies that really advanced not only the state of cardiothoracic procedures and interventions as much as the science of how can we really believe some of these results. And the whole idea is we need to minimize the bias on that internal validity there.

How do you do this? Well, you just randomly assign the patient to being in the intervention group or not. And worth bringing up Sir Ronald Fischer, the father of statistics, said that really that's all you need to do is just pick each person, randomize, are they going to be in the control with a treatment group? You don't need to worry about doing any stratification things.

Then that was actually, he was all about everything is based on the null hypothesis. Then you collect your data and say, what is the chance that the treatment and control groups are equivalent? Then your evidence gives you probability that they're not. So he left it at that.

But then Jersey Naiman and Douglas Rubin expanded that and put in the whole type one, type two error and really put in the RCT approach that we have today. Now, worth mentioning is we see in the right-hand corner, that is Sir Ronald Fischer in his cloud of pipe smoke. It turns out that in the mid '50s, there were many studies in the British Medical Journal that state that they're seeing a lot of linkage between smoking and cancer, enough such that many folks stop smoking because of that.

But Fischer, because of his deeply rooted randomization believe, would not give credibility to any of this evidence. So he would write letters to the editor in saying that this is not true. He was, in fact, a great friend of the tobacco industry. He was the king of statistics in the world, and he was saying that, no, there's not enough evidence. Why? Because there wasn't randomized control trials. And he acknowledged that there's some ethical concerns that you can't randomly assign 3,000 people to smoke and 3,000 not.

But he was still dug in, and he was not willing to trust the evidence that, yes, there's a lot of correlation here. But because it wasn't a randomized control trial, he wasn't willing to sign off on it.

Ultimately, he said, maybe there's some latent dimension that's in here as well. Maybe it is a gene that is responsible not only for you having a propensity to smoke, but also to have a propensity to have lung cancer. As it turned out, he never gave in, and he died of cancer shortly after that.

Some other ideas is the whole idea of double-blinding, where neither the patient nor the provider know what the treatment is, whether they have it or don't, a thing, and whether you have blocked or not.

The problem, though, with some of the randomized control trials today that we're seeing is that they're not being used as much for a number of reasons, not the least of which is there's a perception that they're very burdensome. What we mean by that is the International Conference on Harmonization of the Good Clinical Practices have guidelines in there, and really they're more to follow the spirit of versus to the letter, but they're scaring people away.

And not surprisingly, randomized control trials are very expensive, typically, a thing, and it's really focused more on short term outcomes because we don't keep the randomized control trial going just because of the cost. As we mentioned, ethical concerns with the smoking and Fischer. All comes down to the heterogeneity of treatment effect. Meaning, in order to be in the randomized control trial, you have to meet certain criteria. Because of that, your results do not apply necessarily to the entire population that you're hoping to serve.

That has been shown time and again. We can see there's a study in 2020 that said, we look at the hallmark studies. Then, when we looked at real-world evidence versus what the RCT said, they're completely opposite conclusions. We have to be very careful that we have to expand our randomized control trials beyond the internal validity. You randomize so that you can get rid of the confounding factors in there.

It's great that the internal confounding factors are accounted for, but it's the external that we're missing out on. That if we purposefully leave folks that are 75 years at an older out of the trial and people that have larger BMI and so forth, then we end up having a great risk at which we are not able to extend those results.

There is Some guidance out there for good randomized control trials, and I highlighted in there for us, our world, is it's all based on a lot of statistical methods in here. We need to make sure that we're well-randomizing, that we have the right sample size, and we have to remove the bias. We'll talk about a couple of ways that we can do that. The folks from the cardiothoracic world also are very much, and this was just from last year or so, that they have this initiative where they want to improve these control trials and lead like they did in the '80s, a thing.

That really bore out in some of the COVID-19 where these randomized control trials were taking too long, and there was more evidence-based information that was actionable, that we were able to make good progress in some of these interventions, kind of thing.

Right now, I'll switch to a quick JMP demo on what if you were doing a randomized control trial. We're going to do three things. The first thing is we'll just go ahead and say, how would you set that up in the DOE platform? The second thing is, well, whether you're doing a randomized control trial or not, often you're asked, what's the right sample size?

We'll take a look at that, and then we'll take a look at some analysis options. Let me switch over to the journal. This is a journal that you have. I have provided this in the materials for this session, along with the slides. If I come under the introduction here, this is us and how you get a hold of us. So right now, when I want to design a randomized control trial, let's say that I would actually like to do a little stratification across BMI, age, and gender, because typically, those are important factors, risk factors that we do see a difference in. So we'd like to make sure that we're equally represented in there.

So I have my factors saved here so that I can quickly come into. I know I mentioned easy DOE is a great thing for you all, but in this particular case, I'm going to use the regular Design of Experience platform because I want to show you where, you know, maybe I don't want to have people with very high BMI and the older folks in this trial because it's higher risk.

We'll come under DOE and do a custom design here. All we have to do is load the factor, and we are ready to go. Here you can see are my three levels of BMI, low, medium, high, three levels of age. I'll go ahead and use the disallowed combinations, and I'll go ahead and put in BMI as well as the age here. What we don't want is the high level of age and the high level BMI simultaneously.

Now I'm going to create a design that does this for me. As it turns out, maybe there's two physicians in here or two surgeons that are going to be doing this. I'll put in a group of 30 and I'll have 60 total. This is being a random block, which if we have many, many surgeons, of which we're only looking at two of them, that's going to be a random effect for us. I'll go ahead and make this design. What JMP is doing is it's going to give me a design that is balanced between the three factors, BMI, the age, and the gender. Then what we need to do is figure out, well, of these 60 runs, who gets the treatment and who doesn't? So easy to just maybe make a new column here.

This is for the new users. You can see how easy this is. I'll come under Treatment, and then what we'll What we're going to do is we will go ahead and make a formula here. That formula is going to be if. If a random uniform, which is going to be between zero and one, if that whole value is greater than 0.5, we'll say, then that is going to be the treatment group. Otherwise, it will be the control.

Okay, apply. Okay. Now I have a design that I can come in here. Importantly, though, if I do have someone who you can see I've shaded ages here, you can see there's no one in this trial that is old as well as a high BMI. Similarly, a high BMI, I've only got the low and medium age because it obeyed that rule for me.

Next up, how many trials should we be doing? That typically is answered through what's called power analysis. DOE sample size Explorer, and then we come under power. Let's just say, for example, we're interested in... I'm looking at the postoperative atrial fibrillation, and I'm looking at the proportion, and I'm interested. I know that in general, it's about 25% population-wide.

I'd be interested in seeing if this intervention helps it decrease to, say, 15%. What I'll do is I'll come under two independent sample proportions here. What I'll do is I'll put in the proportion here of 0.25. Then what I want is the 0.15. This is practically significant to us. If we see that delta of that 10 percentage points, then we would like to believe that this is worthy of an intervention that we would like to perhaps put into our protocol.

If this is the case and I only have 60 trials like I just designed, my power is only 16.5%. We could spend a long time in power, but power says that if the treatment truly is important, this is the percentage of times that I will end up having my test declare it as such, given that I see that 10.1 delta. What we like to use as a rule of thumb is we need Power to be at least in the 80% range. We'll go ahead and put in maybe 250 each and 250 here. That gets me up to around 80% power. This is how you're able to design these. You have power curves right here that if I start sliding down, you'll see that the power decreases.

This is a great tool for you to really have discussions and balance some of your risk. That it might very well be that your research coordinator says, "No, we can't afford to have 500. We can only do 100 each." then you'll see, "Okay, if that's the case, you can see what happens to the power." Now, let me go back to the PowerPoint here quickly, and we'll have a couple other ideas that we'll talk about with the actual data.

If we don't do a randomized control trial, what then should we do? Well, there's a recent research that says maybe we should do a pragmatic RCT, which understands that we can't have the letter of the law of RCT. But maybe we could have some greater use of technology where you have wearable sensors and virtual follow-ups and reduce the overall burden.

Similarly, you have this idea of registry control trials where you have databases that are very well curated that are almost as good, if not better, in some cases, of the data you'd collect from a randomized control trial. But now you could potentially randomize between… It could be between Facility 1 versus Facility 2 or Provider 1 versus Provider 2.

You do the randomization at a higher level. But the idea is that this database is very well validated and verified. You now have this ability to very quickly do these investigations. Then where we're at is maybe looking at these retrospective studies, which really have a bad reputation and maybe not appropriately so. That there's some things that we can still do with this observational and retrospective studies.

One thing that we've run into recently that's very helpful is looking at some of these meta studies. Or systematic studies that, "Hey, we've seen maybe nine papers that have been written on this from various cohorts." Then, when we put all that together, we now have a very clear message that all of them are saying the same thing. Or there's some differences, so we're not going trust it.

These meta-analyses are enormously helpful, and they themselves have their own statistical measures of the funnel plots and the Q and I² statistics. We have hope out there. Just going back to the FDA, particularly in the drug approval process, they now are very much pushing toward integrating real-world data, real-world evidence as much as possible in some of their studies when they start to make some of these decisions in their Phase 2, Phase 3 trials kind of thing.

One of our best ways of getting out of the observational study being not useful is matching. If I have 1,000 patients of which 200 receive the treatment, are there 800? Of the 800 left, are there 200 in there that are a lot like the 200 that received the treatment? I go ahead and look across the dimensions that I specify. I may want to have a similar BMI, age, and maybe some other markers that are similar.

That's the idea with propensity matching is now we have as cleanly as possible matched one to another. What that does, that removes some of the bias. Again, our problem is twofold in terms of our variables. Either there's some latent dimension out there that we're missing or the variables are confounded. We don't know, is it truly the effect of our intervention or is it just because we had many more younger folks that were in the treatment group.

We have to go ahead and match, make sure that everything is clean as possible. With that, I'll turn it over to Dr. Cornwell, who will introduce one of our studies that we're doing that will cast some light in this causal analysis world.

Thank you so much, Jim. I'm Lorraine Cornwell. I'm a heart surgeon, cardiothoracic surgeon. I'm here at the VA in Houston, and I'm associated with Baylor College of Medicine. I just wanted to introduce the concept of why we were interested in looking at this particular study we've been working on, looking at addition a posterior pericardiotomy when we do open-heart surgery and looking to see if it decreased some of the post-op complications.

Just as a background, the reason this came about is that there was some evidence that some of the post-op complications after surgery has to do with the trauma to the heart, the trauma to the chest, and then the potential retained blood. The bleeding inside the pericardial sac, and leading to inflammation. Then that inflammation increases the chance that the patient will have complications, such as post-operative atrial fibrillation, irregular heart rhythm, such as retained hematoma requiring reoperation, potentially the prolonged inflammation leading to more pain, Dressler syndrome, a post-op inflammation syndrome, and prolonged length of stay.

What we saw from some literature is that the better you drain the pericardial sac, perhaps the better the patients will do. I took on for the last decade or so, I've been adding this just a simple incision in the back of the pericardium to help the pericardial sac drain better. What we decided to do is look retrospectively with an observational data set to see if those patients that had the posterior pericardium need did have any difference in their outcomes as compared to those who did not.

I know Jim is going to review how he did this for us with us in JMP. But we also wanted to just talk about how we determine whether the level of evidence that we're getting from the study will be important, and how we can publish that and also how we change practice because of the studies that are done. So, Jim, maybe you can go to the next slide for me.

Again, at what point is there enough information that we're going to incorporate a new intervention into a protocol, a clinical protocol, how we're taking care of patients? There's tons of factors to consider. Are there risks to this additional procedure? What are the expected improvements? Are the risks outweighed by the improvements? Then what do other studies say? How strong and recent is the overall body of evidence for this? Does it generalize to our specific patient pool? Like Jim said earlier, if it's only relevant to young patients, and we have elderly, frail, sick patients, it may not be the same.

Then, just is there any empirical evidence that we can go on? Does it make sense scientifically? What are the costs to it? This is a very simple intervention. We thought it was fairly easy to study. We thought this retrospective data would add credibility to other studies that are out there and standardize our protocol along those lines.

The next slide, Jim, please. About how we use studies to determine our guidelines and our guidance. This is just showing how we take studies and then incorporate it into guidelines, and the guidelines would say, for instance, well, this is a Class 1 recommendation, it's highly recommended, or is it a Class 2 or Class 3?

It goes from Class 1 being the strongest recommendation where the risks are greatly outweighed by the benefits. That the intervention is recommended, it should be performed. That it's indicated to perform that to Class 2a and b, Class 2a, meaning it's a reasonable option to do that. It's likely to be useful or beneficial. Probably indicated the wording changes a little bit because it's only moderately recommended down to 2b, which is weak, like something that may be reasonable to do may be considered, but the evidence isn't really all there.

Then, of course, as you get to Class 3, there's the no benefit Class 3 that it's been proven not to have benefit or proven to have harm strongly. The evidence is strong that there's actually harm. When we sit down with our organizations to write guidelines, we consider which class any recommendation would go into. To do that, we can see the next slide, too, we look at all the studies that have been done around that question, and we determine the quality of the evidence that's there.

When we say it's a Class 1 recommendation with Level A evidence, that means that it's high quality evidence. There's more than one randomized trial or meta-analyses of high-quality randomized trials, going down to Level B, which is more of the moderate quality evidence, maybe one randomized trial or a few randomized trials or small randomized trials. Then Level B and R is not even not randomized, well-designed studies, but not randomized observational studies.

That would be the level of evidence that we're talking about with this current pericardiotomy study that we did. Of course, it goes lower from there where it may just be even expert opinion, which, of course, is not really quite the evidence that we're hoping for. But sometimes there are things in surgery and in medicine that we don't yet have the high-quality evidence, but yet experts think it's important.

What we did with this particular question is we looked back through our data, and we had eight years of follow-up, and we had more than 700 patients who all went under one open-heart surgery. We compared those who had the pericardiotomy to those who didn't. Initially, we had fairly well-matched groups even without the propensity matching. Then we did a propensity match as well. Both sets of data, the non-matched and then the matched data showed that there was a reduction in the postoperative atrial fibrillation.

It was almost 10%. It was statistically significant. There was also a lower length of stay. They got out of the hospital faster, and had less readmissions as well. Kaplan-Meier curve for the long-term survival actually did veer apart so that the patients who had the pericardiotomy also seemed to survive longer. We think it's important information, and it could change practice along with other data that's coming out. We're actually hoping to be a part of a large multi institutional, randomized control trial concerning this topic if we can get it funded. It's always complicated. But I think we'll let Jim take it from there about how he manages this data.

Great. Thank you, Dr. Cornwell. I think what we'll do is we'll jump right back to… I was going to talk a little bit about how this idea of propensity scoring works. But I think I'll jump into the results here, given we're up against some time here. We thought we were a 45-minute group, but we're just a 30-minutes. I'll go right to the results here. This particular file you do not have, so apologize for that, but it is not yet released. But here's we'll walk through some of the analytics here and what the workflow would be.

The first idea is you can see that there are 426 columns here. As new users, can't emphasize enough how wonderful is to group your columns together. Here I have the risk factors are all grouped together, and then the outcomes are grouped together. The idea is these risk factors right here, and there are more, but I just have a subset here. They need to be relatively homogeneous across the two groups. To do that, I can just do analyze Fit Y by X, and those will be the response. We'll look at the pericardiotomy here as the intervention.

The hope is that we have approximately the same percentages. Then right off the bat, I can see that I don't, so I could actually put on those cell labeling here. I can put the percentages on. Actually, let me do that again, but maybe cell labeling, label by percent. Now they're all that way. Here we can see there is a statistically significant difference between them in terms of those who had COPD. The whole idea with the propensity matching is we want to make them as even as possible. Because in this particular case, it turns out that there are more that have COPD that didn't get the pericardiotomy.

We're putting the scales in our favor. That we don't have these folks with the emphysema thing. What we could do, and this is a great technique for JMP, is to put a local data filter on there. This is whether or not the group match. Remember, propensity scoring, we have 286 or so that did have the pericardiotomy. Now we're going to do that matching. Now I go ahead and match, and it turns out that those percentages now are very equal, 35 to 32%. That's the philosophy, and we do that all the way through.

These are proportions, but the same thing works in terms of the percentages. When we look at maybe some of these lab work that we want the lab averages to be very similar to one another, kind of a thing. That's the first thing that we look for in these studies is to make sure, do you have homogeneity? Nice that you have it at the outset of all the data, but better after you do your propensity match.

Now, when I look at the outcome analysis, I can do the same thing. I'll go ahead, and I'll do analyze, and I'll Fit Y by X. In this particular case, the outcomes are going to be the response, and the pericardiotomy will be the X factor. Here, let's go down to the POF right here. In the interest of time, we can see that there is a statistically significant difference here in that POAF when I use all of the data. I'll go ahead and put those percents in for us again. This is statistically significant at very close anyways. But then if I go ahead and put in the propensity match with the local data filter, and I'll put in did you match? Here are the matched versions.

Now I can see that I actually have greater statistical significance here. That the difference becomes even more pronounced when I have them matched correctly. One other thing I wanted to show you is that's the categorical variable. But then an important one for us in terms of the economics of it is the length of stay and certainly the patient outcome as well. They would like to go home as soon as possible, and it's a surrogate for how quickly you are recovering.

What we could do there is we could look at a test of the means, and we can see that there is a statistically significant difference in the means. But more often than not, it may be better if we take a look at our… Let me put them there. There's the means right here. You can see that the average length of stay is eight and a half versus nine and a half. But we also want to take a look at a median test here. Just wanted to point out that you can do a test of the medians. Here we have statistical significance as well, because what happens is you do have these long lengths of stay that end up screwing the data.

That's why we focus on the medians. We can see that there's a day difference in that median length of stay as well. Those are some of the ideas in terms of how you do the analysis of the outcomes. Dr. Cornwell mentioned that we also saw much better survival rates. That is done here under the survival platform here, Survival right here. What we're going to do is the pericardiotomy is going to be the grouping, and then we're going to look at the time, survival years, and then we're going to have a censored value.

This is an introduction to censoring means that when we stopped, not everyone had died, so they were censored. We go ahead and take a look and see, is there a difference? We can see right now that there is a difference. Zero means that they did not have the pericardiotomy. You can see that this is the probability of surviving by the number of years here. That difference in those two curves is, in fact, statistically significant. We can even take a look at what's your survival at four years or whatever it happens to be. Four years is 81%. If you didn't have the pericardiotomy versus one means you did have the pericardiotomy, and that's 88%. You can see there's that 8% difference in survival rates.

Lastly, just to show real quick, when we do some of our analysis, we realize that it's challenging because some of the risk factors are so dominant. We know that age and gender and BMI, that those could attenuate some of the effects that we're seeing. We take a look at various things and decision trees. I can't recommend these guys enough that they are awesome for you to gain insight.

For example, if I wanted to look at POAF as… We'll look at some of these guys here in the interest of time, and we want pericardiotomy. What we're looking at right now is just an overall, the probability and the count. Here we can see that about 71% do not have this post-op AFIB. Then if we figure out what is the single best variable to determine, we see that If they're young, less than 60, then now that is a 91, or it's 8% and 92% split. There's 122 of our patients were that way. This provides a lot of insight. We can see how important age really is.

But if I come up here to the young cohort and I split there, oh, look at that. It ends up being it's the pericardiotomy that helps us differentiate a little bit better. The overall rate for the young folks is 8%. However, if you had the pericardiotomy, now your rate is down to 3%. Lots more we could do, but we are up against time. I'm going to go ahead and wrap it up and just let you know that this may be what we'll call the beginning, if you will, of bringing these ideas to bear, because there's a lot more coming on with JMP 19.

They're doing some great work with the Structural Equation modelings and instrumental variables. This whole idea that propensity scoring is going to be automated in JMP 19. There are different methods other than some of the standard ones that we're employing using that in. There's great things coming. I think this discussion of causality is getting a lot more traction as we look at some of these 21st century methods and realize it's not all about the randomized control trial. That in fact, there's many great things that we can learn and have actionable results based on some of the data that we're looking at and some of the innovative ways that we're exploring it.

With that, I hope that this was a helpful presentation for you, and we will be around to take questions. Thank you very much.