Choose Language Hide Translation Bar
stanyoung
Reviewer

JMP Add-ins and Scripts for Evaluation of Multiple Studies (2021-US-30MP-902)

Level: Intermediate

 

Stan Young, CEO, CGStat
Warren Kindzierski, Epidemiologist, University of Alberta

Paul Fogel, Consultant, Paris

 

Researchers produce thousands of studies each year where multiple studies addressing the same or similar questions are evaluated together in a meta-analysis. There is a need to understand the reliability of these studies and the underlying studies. Our idea is to look at the reliability of the individual studies, as well as the statistical methods for combining individual study information, usually a risk ratio and its confidence limits. We have now examined about 100 meta-analysis studies, plus complete or random samples of the underlying individual studies. We have developed JMP add-ins and scripts and made them available to facilitate the evaluation of the reliability of the meta-analysis studies, p-value plots (two add-ins), Fisher’s combining of p-values (one script). In short, the meta-analysis studies are not statistically reliable. Using multiple examples, our presentation shares our results that support the observation that well over half of claims made in the literature are unlikely to replicate.

 

 

Auto-generated transcript...

 

Speaker

Transcript

Stan Young haha.
Sara Doudt allow you to.
Stan Young see everything now.
Okay, oh no Am I gonna run the slides from my side.
Sara Doudt yeah yeah just like you would a normal meeting.
There we go okay cool make sure I got everything if I'm close outlook Okay, we are recording I need to I'm gonna I'm going to turn off my camera just fyi and then.
Okay.
Stan Young To make this full screen so now just sort.
of how to do that the second.
Sara Doudt it's that bottom like in that, where it says notes on their conduct to the 34% it's the one that looks like a.
Down the boat yep sorry about that one bad, on the other side of the person that.
One yes, that guy.
Stan Young haha.
Sara Doudt You go, and I mean I'm I'm going to go we're going to start in just a second but I'm going to meet myself, do you have any questions before we get started.
Stan Young or not really are you going to be recording me as well as the slides.
Sara Doudt yeah yeah they asked that you have the.
The camera up here.
Over that's fine.
Right.
All right.
Stan Young I'm ready to go.
Sara Doudt I'm gonna mute myself, and if you have questions or need to stop, let me know but otherwise I'll let you.
Stan Young Go.
I'm going to present today JMP add-ins and scripts for the evaluation of multiple studies. Or you can call this how to catch p-hackers, people cheating with statistics, or why most science claims are wrong.
I'm first going to describe the puzzle parts and how they fit together. Most claims actually fail to replicate. These are science claims.
I'm contending, and my co-authors are contending, that this is due to p-hacking and it's a major problem. We're going to use meta analysis and P-value plots to catch them, so this is how to catch a crook.
The JMP add-ins and scripts are P-values from risk ratios and confidence limits. They come from a meta analysis.
And then Fisher's combining of P-values, an ancient technique which is similar to meta analysis technology. And then we're going to present a p-value plot, and we have a small script that will clean that up and make it into a presentable
picture.
Well, we see a bunny in the sky. There are lots of clouds in the sky, and if you sit out on a nice day and look around, you can probably find a bunny.
So this is a random event. The bunnies are not actually in the sky, of course.
Gelman and Loken actually published a small article and said the statistical crisis in science, so people are using statistics, they say, incorrectly and that's what we'll talk about today.
Let's run an epidemiology experiment.
We have 10-sided dice, red, white, and blue. They will become digits (.046 in this particular case). Let's just actually watch this happen in front of our eyes.
Now we have a P-value. It's random. You did it yourself. It's so much fun, why don't we do it one more time?
Red, white,
and blue.
Now, if you do that 60 times, you can fill in the table here. This is a simulation that I did with 10-sided dice. And you can see the P-values in four columns and 15 rows. So that's 60 P-values. My smallest P-value is .004.
And with 60 P-values, you can work out the probabilities. About 95% of the time, you will have at least one P-value less than .05.
Here one done by my daughter. She had three P-values less than .05; they are circled here. And then running an epidemiologist experiment is so easy
that even my wife Pat can do it, and she has three potential papers here that she could write based on rolling dice and spinning a good story.
P-values have expected value attached to them. So over on the right, we have P-values of .004, .016 and .012. Attached to each P-value is a normal deviat. You can see that my .004 would have a normal deviate of 2.9 and so forth. And on the left, we have
the sample size, the expected deviation, and the expected P-value for the smallest P-value and then the deviation. So if we had 400
questions that we were looking at, the expected P-value would be .00487 and the deviate would be 2.968. Now it's this deviation that is carried from the base experiments into the calculations of a meta analysis, and we'll see that as we proceed along.
How many claims in epidemiology are true? I published a paper in 2011 and I took claims that had been made in observational studies.
And for each claim in the observational studies, I found a randomized clinical trial that looked at exactly the same question. So in the 12 studies, there were
52 claims that could be made and tested.
If you look at the column under positive, zero of those 52 claims replicated in the correct direction.
So the epidemiologists had gone 0 for 52 in terms of their clalims. There were actually five claims that were statistically significant, but they were in the direction opposite of what had been claimed in the...in the observational studies.
We're gonna look at a
crazy question. Does cereal determine human gender? So if you eat breakfast cereal, are you more likely to have a boy baby? Well, that's what the first paper said.
This paper was published in the Royal Society B, which is their premier biology journal. So these three co-authors made the claim that if you eat breakfast cereal, you're more likely to have a boy baby, eat breakfast cereal in and around the time of conception.
Two of my cohorts and I looked at this, we asked for the data. We got the data and then we published a counter to the first paper saying, cereal-induced gender selection is most likely a multiple testing false positive.
at the time of just before conception...predicted conception, and right around the time of conception. And there were 131 foods in each of those questionnaires, making a total of 262 statistical tests.
You compute P-values for those tests, rank order them from smallest to largest, and plot them against the integers (that's the rank at the bottom),
you see what looks like a pretty good 45-degree line. So we're looking at a uniform distribution. Now their claim came from the lower-left of that and they said, well, here's a small P-value. The small P-value says eating cereal...breakfast cereal will lead to more
boy babies.
Pretty clearly a statistical false positive.
P-value plots. So we're going to use P-value plots a lot. On the left, we see a P-value plot for a likely true null hypothesis.
Elderly long-term exercise training does not lead to mortality or mortality risk. On the right, we have smoking and lung cancer.
And we see a whole raft of P-values tracking pretty close to zero, all the way across the page and a few stragglers up on the right. So this the right-hand picture is evidence for a real effect and the left-hand picture is support of the null hypothesis, no effect.
let's talk about a meta analysis because I am going to use those during the course of this lecture. On the left, we see a funnel and we see lots of papers dropping into the funnel.
The epidemiologist or whoever's doing the meta analysis picks what they think are high quality papers and they use those for further analysis. On the right, we see the...sort of the
evidence hierarchy. A meta analysis over many studies is considered
high-level information and then down at the bottom, expert opinion and so forth and so on. So the higher you go up the pyramid,
people contend the evidence is is better.
We're going to look at two example meta analysis papers. The first paper is by Orellano.
nitric oxide, ozone,
small particles and so forth. And it gathered data from all over the...all over the
world, and this was sponsored by the
WHO, so this is high-quality funding, high-quality paper.
And it's a meta analysis. We're going to see if the claims in that meta analysis hold up.
The bottom one was really funny. Patterns of red and processed meat consumption and risk for basically lung cancer and heart attacks.
There's been a lot of literature in the nutrition literature, saying that you really shouldn't be eating red meat. We're going to see if that...if that makes sense.
Let's go back and look at the puzzle parts and then see how they fit together.
We know from the open literature that most claims fail to replicate. This is called a crisis in science. P-hacking is a problem. P-hacking is running lots of test trying this and trying that and then, when you find a P-value less than .05,
you can write a paper.
We're going to use meta analysis and P-value plots to catch the people that are basically P-hacking, and I call P-hacking sort of cheating with statistics.
Others are, you know, little...described it a little differently. We're going to be using JMP add-ins. These JMP add-ins were written by Paul Fogle.
And they allow us to start with a meta analysis paper and quickly and easily produce a P-value plot. We're also going to describe Fisher's combining of P-values and then we're...
have a small script that will take a the two-way plot that comes out of JMP and then clean it up, so that the left and right margins are more...more attractive.
Well here's long-term exposure to air pollution and
causing asthma, so they say. On the left, we have what's called a tree diagram. That's the...
it looks sort of like a Christmas tree.
The mean values are given as the dots and then the confidence limits are the whiskers going out on either side. On the left are the names of the paper...papers that were selected by these authors and on the right are the
risk ratios and the confidence limits. Now you can often just scrape the risk ratios and confidence limits off and drop them into a
JMP data set, and so we see the JMP data set on the right.
Now the P-value plot.
thing written by Paul Fogle is going to convert those risk ratios and P... and confidence limits into P-values.
Here we see that it's been done.
The confidence limits can give you the standard error.
With the risk ratio and the standard error, you can compute a Z statistic. And then you can take the Z statistic and and get a P-value. On the right here, we see a P-value plot coming out, just as it does with with a couple of clicks in JMP.
Here again, on the left, we have the rough P-value plot, and on the right, after using a small script, we add the zero line, we add the dotted line for .05, and we clean up and expand the the numbers on the
X and Y axes. And so we can look now and judge all the studies that that were in that thing, but we see a rather strange thing. We see that the
all of this here, there are a few P-values under .05 and then a lot of P-values going up, and so we have an ambiguous situation.
Some of the P-values look like the the claim is correct, and others look like they're simply random findings.
Let's take a look at Fisher's combining of P-values. On the left, we have the formula. You take the P-value, take natural log, that sum them up times -2.
And that gives you a Chi squared. You look up the Chi squared in the table. So this...this meta analysis, we have the P-values. Under P-value, we have -2ln, and you see that there are few P-values that are small that add substantially to the summation.
If you think just a little bit, the summation is like all sums, it's subject to outliers manipulated...shifting the balance considerably.
And so the few small P-values (.0033, .0073) are adding dramatically to the summation of the Chi squared. In fact the summation is not robust. One outlier...an extreme outlier can tip the whole...the whole equation.
Now keep in mind that P-hacking can lead to small P-values and also,
scientists quite often, if if a study comes out not significant, they simply won't even publish it.
And if they find something significant, they typically will. So there's publication bias. The the whole literature is littered with small P-values
and this probably...that's the tip of the iceberg. Under the iceberg are all the publications that could have happened, but they were not statistically significant.
We're now going to look at a air pollution study. Mustafic in 2012
published a paper in JAMA and he looked at the six typical air pollution
carbon monoxide, nitrous oxide, small particles (that's PM 2.5), sulfur dioxide, etc.
If you look at these P-value plots, all of them essentially look like hockey sticks. There are a number of P-values less than .05.
But then there are a substantial number of P-values that go up along a 45-degree line, indicating that there is no effect.
We're contending that we're looking at a mixture. We're looking at a mixture of significant studies and non significant studies.
And we're further contending that the significant studies largely come from P-hacking. There are other ways that can arise, but we think P-hacking is the thing. So if you look on the left-hand side here,
in the red box, we give the median number
of models or tests that could be conducted in a in a study, the studies that were in Mustafic's paper.
We counted the number of
outcomes, predictors, and covariates, and if you multiply that all out, the median number, which we call the search space, the median search space is 12,288.
That means that the authors, on the average, had 12...in the median, had 12,000 opportunities to get a P-value of less than .05. They had substantial leeway to get a statistically significant result.
We're going to look at the simple counting here for the nutrition studies. So we have here 15 studies. The authors, or the first authors on the paper.
And then across the top you see outcomes, predictors, covariates, tests, models, and search space. The outcomes, if they looked at three health outcomes, for example, Dixon, and if they had 51
foods in their study, that will...those would be predictors. Now if they use covariates,
that would add substantially to the search space. And you can see for Dixon, theoretically, he had 20 million possible analyses that he could have done.
And if you look down the search space column, you can see substantial numbers of possible analyses that could be done.
The nutrition studies were done with cohorts. A cohort is a group of people that is collected up,
they're measured in questions, initially, then they wait over time and then they look at health effects.
Each of these cohorts has the name of a data set that the cohort uses.
And the cohorts, once they are assembled, can be used to ask, you know, a zillion questions.
If
the P-value for one of those questions comes out significant
and they feel like they can write a paper, quite often they do.
So, in the last two columns, we see the numbers of papers
that could arise. That are more...we used Google Scholar and we actually checked it out. So these are the number of papers that have appeared in the literature
associated with each of these cohorts. I'll add that we've looked at a lot of these papers and in none of these papers is there any adjustment for multiple testing or multiple modeling.
Nutritional epidemiology, environmental epidemiology, those are the ones we talked about here. Nutritional epidemiology uses a questionnaire, food frequency questionnaire (FFQ).
In an FFQ, you can have some number of foods.
Initially, they started off with 61 foods and I've seen some FFQ studies where there were 800 foods. I wouldn't want to be the one to fill out that questionnaire.
But given here are the number of papers that use FFQs over time.
Since 1985...the technique was invented in 1986...since 1985, there have been 74,000 FFQ papers, and based on our looking at them, none of these papers adjusted for multiple testing and all of them had substantially large statistical search spaces.
Environmental epidemiology, I simply did a Google search on the word "air pollution" in the title of the paper.
And here we see that over time, there've been 28-29,000 papers written about air pollution.
So far as I know, based on a lot of looking, none of these papers adjust for multiple testing or multiple modeling, and all of these papers have large...essentially all of these papers have very large search spaces.
Meta analysis goes into the name systematic review and meta analysis and, recently, starting in 2005,
journals have
used that term in the title of the paper. So starting in 2005, I asked if the words "systematic review" and "meta analysis" appeared in the title of the paper, and you can see that started off low, 1,500 papers a year.
1,500 papers in that five-year period and finally in 2021, there
have been a total of 27,000 papers, so this is a cottage industry. These papers can be turned out relatively easily.
A team, often in China of 10 or five to 15 people, can turn out one meta analysis per week.
And their their pay is rated on how many papers they publish and so forth and so on.
So far as I know, these are all...half of these studies are observational studies and half are...come from randomized clinical trials.
Virtually none of these and in all the ones that we've looked at, so far, particularly in observational studies, have this hockey stick look, like there's some P-values that are small and there are a bunch of P-values that look completely random.
I will say that all the...essentially all of these studies are funded by your tax dollars or somebody's tax dollars. They're very lavishly funded by the public purse, is one way to say it.
Many claims have no statistical support. The base papers do not correct for multiple testing and multiple modeling. The base papers have large analysis search spaces.
And we've seen examples from environmental epidemiology and nutritional epidemiology that most people, I would say based on the evidence, are unreliable.
I say and others have said too, we have a science and statistical disaster and use of meta analysis and P-value plots, these claims can be either verified true or not.
Here we have four situations of smog. Upper left is London 1952;
the right is Los Angeles 1948.
The lower left is Singapore and I'm not sure...I don't remember the exact year for that. And then we have Beijing. Those are recent.
In the case of the London fog, statistical analysis of death...daily deaths and everything indicate that there upwards of 4,000 deaths that occurred in a three- or four-day period in London in 1952. That instigated the interest by epidemiologist in what was the killer.
The other three slides or other three pictures, there has been non reported increase in death during those time periods.
The ringer is that in London, they were burning coal
for heating and everything else. There was a temperature inversion and the contention is that acid in the air was carried by particles into the lower lungs and susceptible individuals, usually the old and weak, died in increased numbers, but that is not happening now
around the world. There is all kinds of pollution around the world and we don't see spikes in death rates associated with it.
Scams? Ha ha. I love scams.
Are these scams? Air pollution kills. Any claim from an FFQ study, for example, if you drink coffee, you're more likely to have pancreatic cancer. That's one of the claims that's probably not true.
Natural levels of ozone, do they kill? I think that's not true either. Environmental estrogens. Any claim coming from meta analysis using observational studies.
Keep in mind, the evidence in the public literature is that 80% of all, usually coming from university, science claims failed to replicate when tested rigorously, for example in randomized clinical trials.
Emotion and scare. The whole aim of practical politics is to keep the population menaced, scared and hence clamorous to be led to safety by menacing the population with an endless series of hobgoblins, all of them imaginary.
Flim flam, deception, confidence games involving...involving skillful persuasion or clever manipulation of the victim. So H.L. Mencken in the 1930s said that practical politics is largely scare politics.
We finish up with the the authors of this in this of this talk. I'm Stan Young and I can be reached at genetree@bellsouth.net. Warren Kindzierski is a Canadian epidemiologist and he's been working with me very closely for the last couple of years on papers that
are based on the things that we've seen here today. Paul Fogel is a very interesting statistician, lives in Paris and he was responsible for the writing of the
add-ins and scripts that we can use. I will say that the scripts will allow you to go from a meta analysis to an evaluation in probably a little bit less than an hour.
I'd like to recommend the National Association of Scholars report, Shifting Sands Report 1, this URL will get you there.
This report is a long and involved involved, but simple simple statistics report talking about air quality and health effects.
And with that I'll also make the following claim, if someone watching this wants to try out what's going on, we will give you the scripts and add-ins to JMP and then I will even help you
look at and interpret your particular analysis of a meta analysis. So with that I'll stop and I'm prepared to answer any questions. Thank you.