Corona Virus Risk Analysis: Statistical Analysis Should Be As Simple As Possible, But No Simpler (2020-US-45MP-598)

6 Kudos

Level: Intermediate

Roland Jones, Senior Reliability Engineer, Amazon Lab126
Larry George, Engineer who does statistics, Independent Consultant
Charles Chen SAE MBB, Quality Manager, Applied Materials
Mason Chen, Student, Stanford University OHS

Patrick Giuliano, Senior Quality Engineer, Abbott Structural Heart

The novel coronavirus pandemic is undoubtedly the most significant global health challenge of our time. Analysis of infection and mortality data from the pandemic provides an excellent example of working with real-world, imperfect data in a system with feedback that alters its own parameters as it progresses (as society changes its behavior to limit the outbreak). With a tool as powerful as JMP it is tempting to throw the data into the tool and let it do the work. However, using knowledge of what is physically happening during the outbreak allows us to see what features of the data come from its imperfections, and avoid the expense and complication of over-analyzing them. Also, understanding of the physical system allows us to select appropriate data representation, and results in a surprisingly simple way (OLS linear regression in the ‘Fit Y by X’ platform) to predict the spread of the disease with reasonable accuracy. In a similar way, we can split the data into phases to provide context for them by plotting Fitted Quantiles versus Time in Fit Y by X from Nonparametric density plots. More complex analysis is required to tease out other aspects beyond its spread, answering questions like "How long will I live if I get sick?" and "How long will I be sick if I don’t die?". For this analysis, actuarial rate estimates provide transition probabilities for Markov chain approximation to SIR models of Susceptible to Removed (quarantine, shelter etc.), Infected to Death, and Infected to Cured transitions. Survival Function models drive logistics, resource allocation, and age-related demographic changes. Predicting disease progression is surprisingly simple. Answering questions about the nature of the outbreak is considerably more complex. In both cases we make the analysis as simple as possible, but no simpler.

Auto-generated transcript...

Speaker	Transcript
Roland Jones	Hi, my name is Roland Jones. I work for Amazon Lab 126 is a reliability engineer.
	When myself and my team
	put together our abstracts for the proposal at the beginning of May, we were concerned that COVID 19 would be old news by October.
	At the time of recording on the 21st of August, this is far from the case. I really hope that by the time you watch this in October, there will...things will be under control and life will be returning to normal, but I suspect that it won't.
	With all the power of JMP, it is tempting to throw the data into the tool and see what comes out. The COVID 19 pandemic is an excellent case study
	of why this should not be done. The complications of incomplete and sometimes manipulated data, changing environments, changing behavior, and changing knowledge and information, these make it particularly dangerous to just throw the data into the tool and see what happens.
	Get to know what's going on in the underlying system. Once the system's understood, the effects of the factors that I've listed can be taken into account.
	Allowing the modeling and analysis to be appropriate for what is really happening in the system, avoiding analyzing or being distracted by the imperfections in the data.
	It also makes the analysis simpler. The overriding theme of this presentation is to keep things as simple as possible, but no simpler.
	There are some areas towards the end of the presentation that are far from simple, but even here, we're still working to keep things as simple as possible.
	We started by looking at the outbreak in South Korea. It had a high early infection rate and was a trustworthy and transparent data source.
	Incidentally, all the data in the presentation comes from the Johns Hopkins database as it stood on the 21st of August when this presentation was recorded.
	This is a difficult data set to fit a trend line to.
	We know that disease naturally grows exponentially. So let's try something exponential.
	As you can see, this is not a good fit. And it's difficult to see how any function could fit the whole dataset.
	Something that looks like an exponential is visible here in the first 40 days. So let's just fit to that section.
	There is a good exponential fit.
Roland Jones	What we can do is partition the data into different phases and fit functions to each phase separately.
	1, 2, 3, 4 and 5.
	Partitions were chosen where the curve seem to transition to a different kind of behavior.
	Parameters in the fit function were optimized for us in JMP' non linear fit tool. Details of how to use this tool are in the appendix.
	Nonlinear also produced the root mean square error results, the sigma of the residuals.
	So for the first phase, we fitted an exponential; second phase was logarithmic; third phase was linear; fourth phase, another logarithmic; fifth phase, another linear.
	You can see that we have a good fit for each phase, the root main square error is impressively low. However, as partition points were specifically chosen where the curve change behavior, low root mean square area is to be expected.
	The trend lines have negligible predictive ability because the partition points were chosen by looking at existing data. This can be seen in the data present since the analysis, which was performed on the 19th of June.
	Where extra data is available, we could choose different partition points and get a better fit, but this will not help us to predict beyond the new data.
	Partition points do show where the outbreak behavior changes, but this could be seen before the analysis was performed.
	Also no indication is given as to why the different phases have a different fit function.
	This exercise does illustrate the difficulty of modeling the outbreak, but does not give us much useful information on what is happening or where the outbreak is heading. We need something simpler.
	We're dealing with a system that contains self learning.
	As we as society, as a society, learn more about the disease, we modify behavior to limited spread, changing the outbreak trajectory.
	Let's look into the mechanics of what's driving the outbreak, starting with the numbers themselves and working backwards to see what is driving them.
	The news is full of COVID 19 numbers, the USA hits 5 million infections and 150,000 deaths. California has higher infections than New York. Daily infections in the US could top 100,000 per day.
	Individual numbers are not that helpful.
	Graphs help to put the numbers into context.
	The right graphs help us to see what is happening in the system.
	Disease grows exponentially. One person infects two, who infect four, who infect eight.
	Human eyes differentiate poorly between different kinds of curves but they differentiate well between curves and straight lines. Plotting on a log scale changes the exponential growth and exponentially decline into straight lines.
	Also on the log scale early data is now visible where it was not visible on the linear scale. Many countries show one, sometimes two plateaus, which were not visible
	in the linear graph. So you can see here for South Korea, there's one plateau, two plateaus and, more recently, it's beginning to grow for third time.
	How can we model this kind of behavior?
	Let's keep digging.
	The slope on the log infections graph is the percentage growth.
	Plotting percentage growth gives us more useful information.
	Percentage growth helps to highlight where things changed.
	If you look at the decline in the US numbers, the orange line here, you can see that the decline started to slacken off sometime in mid April and can be seen to be reversing here in mid June.
	This is visible but it's not as clear in the infection graphs. It's much easier to see them in the percentage growth graph.
	Many countries show a linear decline in percentage growth when plotted on a log scale. Italy is a particularly fine example of this.
	But it can also be seen clearly in China,
	in South Korea,
	and in Russia, and also to a lesser extent in many other countries.
	Why is this happening?
	Intuitively, I expect that when behavior changes, growth would drop down to a lower percent and stay there, not exponentially decline toward zero.
	I started plotting graphs on COVID 19 back in late February, not to predict the outbreak, but because I was frustrated by the graphs that were being published.
	After seeing this linear decline in percentage growth, I started paying an interest in prediction.
	Extrapolating that percentage growth line through linear regression actually works pretty well as a predictor, but it only works when the growth is declining. It does not work at all well when the growth is increasing.
	Again, going back to the US orange line, if we extrapolate from this small section here, where it's increasing which is from the middle of June to the end...to the beginning of July,
	we can predict that we will see 30% increase by around the 22nd of July, that will go up to 100% weekly growth by the 20th...26th of August, and it will keep on growing from there, up and up and up and up.
	Clearly, this model does not match reality.
	I will come back to this exponential decline in percentage growth later. For now, let's keep looking at the, at what is physically going on as the disease spreads.
	People progress from being susceptible to disease to being infected to being contagious
	to being symptomatic to being noncontagious to being recovered.
	This is the Markoff SIR model. SIR stands for susceptible, infected, recovered. The three extra stages of contagious, symptomatic and noncontagious helped us to model the disease spread and related to what we can actually measure.
	Note the difference between infected and contagious. Infected means you have the disease; contagious means that you can spread it to others. It's easy to confuse the two, but they are different and will be used in different ways, further into this analysis.
	The timing shown are best estimates and can vary greatly. Infected to symptomatic can be from three to 14 days and for some infected people,
	they're never symptomatic.
	The only data that we have access to is confirmed infections, which usually come from test results, which usually follow from being symptomatic.
	Even if testing is performed on non symptomatic people, there's about a five-day delay from being infected to having a positive test results.
	So we're always looking at all data. We can never directly observe observe the true number of people infected.
	So the disease progresses through individual individuals from top to bottom in this diagram.
	We have a pool of people that are contagious and that pool is fed by people that are newly infected becoming contagious and the pool is drained by people that are contagious becoming non contagious.
	The disease spreads spreads to the population from left to right.
	New infections are created when susceptible people come into contact with contagious people and become infected.
	The newly infected people join the queue waiting to become contagious and the cycle continues.
	This cycle is controlled by transmission.
	How likely a contagious person is to infect a susceptible person per day.
	the number of people that a contagious person is likely to infect while they are contagious.
	This whole cycle revolves around the number of people contagious and the transmission or reproduction.
	The time individuals stay contagious should be relatively constant unless COVID 19 starts to mutate.
	The transmission can vary dramatically depending on social behavior and the size of the susceptible population.
	Our best estimate is the days contagious averages out at about nine.
	So we can estimate people contagious as the number of people confirmed infected in the last nine days.
	In some respects, this is an underestimate because it doesn't include people that are infected, but not yet symptomatic or that are asymptomatic or that don't yet have a positive test result.
	In other respects, it's an overestimate because it includes includes people who were infected, a long time ago, but they're only now being tested as positive. It's an estimate.
	From the estimate of people contagious, we can derive the percentage growth in contagious. It doesn't matter if the people contagious is an overestimate or underestimate.
	As long as the percentage error in the estimate remains constant, the percentage growth in contagious will be accurate.
	Percentage growth in contagious, because within use it to derive transmission,
	The derivation of this equation relating the two can be found in the appendix.
	Know this equation allows you to derive transmission and then reproduction from the percentage growth in contagious, but it cannot tell you the percentage growth in contagious for a given transmission.
	This can only be found by solving numerically.
	I have outlined outlined how to do this using JMP's fit model tool in the appendix.
	Reproduction and transmission are very closely linked, but reproduction has the advanced...advantage of ease of understanding.
	If it is greater than one, the outbreak is expanding out of control. Infections will continue to grow and there will be no end in sight.
	If it is less than one, the outbreak is contracting, coming under control. There are still new infections, but their number will gradually decline until they hit zero. The end is in sight, though it may be a long way off.
	The number of people contagious is the underlying engine that drives the outbreak.
	People contagious grows and declines exponentially. We can predict the path of the outbreak by extrapolating this growth or decline in people contagious. Here we have done it for Russia and Italy and for China.
	Remember the interesting observation from earlier, the infections percent in growth percentage growth declines exponentially and here's why.
	If reproduction is less than one and constant, people contagious will decline exponentially towards zero.
	People contagious drives the outbreak.
	The percentage growth in infections is proportional to the number of people contagious. So if people contagious declines exponentially, but percentage growth and infections will also decline exponentially. Mystery solved.
	The slope of people contagious plotted on log scale gives us the contagious percentage growth, which then gives us transmission and reproduction through the equations on the last slide.
	Notice that there's a weekly cycle in the data. This is particularly visible in Brazil, but it's also visible in other countries as well.
	This may be due to numbers getting reported differently at the weekends or by people being more likely to get infected at the weekend. Either way, we'll have to take this seasonality into account when using people contagious to predict the outbreak.
	Because social behavior is constantly changing, transmission and reproduction changes as well. So we can't use the whole distribution to generate reproduction.
	We chose 17 days as the period over which to estimate reproduction. We found that one week was a little too short to filter out all of the noise, two weeks gave a better results, two and a half weeks was even better. Having the extra half week
	evened out the seasonality that we saw in the data.
	There is a time series forecast tool in JMP that will do all of this for us, including the seasonality, but because we're performing the regression on small sections of the data, we didn't find the tool helpful.
	Here is the derived transmission and reproduction numbers.
	You can see that they can change quickly.
	It is easy to get confused by these numbers. South Korea is showing a significant increase in reproduction, but it's doing well. The US, Brazil, India and South Africa are doing poorly, but seem to have a reproduction of around one or less.
	This is a little confusing.
	To help reduce the confusion around reproduction, here's a little bit of calculus.
	Driving a car, the gas pedal controls acceleration.
	To predict where the car is going to be, you need to know where you are, how fast you're traveling and how much you're accelerating or decelerating.
	In a similar way to know where the pandemic is going to be, we need to know how many infections there are, which is the equivalent of distance traveled. We need to know how fast the infections are expanding or how many people are contagious, both of which are the equivalent of speed.
	We need to know how fast the people contagious is growing, which is a transmission or reproduction, which is the equivalent of acceleration.
	There is a slight difference. Distance grows linearly with speed and speed grows linearly with acceleration.
	Infections do grow linearly with people contagious, but people contagious grows exponentially with reproduction.
	There is a slight difference, but the principle's the same.
	The US, Brazil, India and South Africa have all traveled a long distance. They have high infections and they're traveling at high speed. They have high contagious. Even a little bit of acceleration has a very big effect on the number of infections.
	South Korea, on the other hand, on the other hand is not going fast, it has low contagious. So has the headroom to respond to the blip in acceleration and get things back under control without covering much distance
	Also, when the number of people contagious is low, adding a small number of new contagious people produces a significant acceleration. Countries that have things under control are prone to these blips in reproduction.
	You have to take all three factors into account
	(number of infections, people contagious and reproduction) to decide if a country is doing well or doing poorly.
	Within JMP there are a couple of ways to perform the regression to get the percentage growth of contagious. There's the Fit Y by X tool and there's the nonlinear tool. I have details on how to use both these tools in the appendix. But let's compare the results they produce.
	The graphs shown compare the results from both tools. The 17 data points used to make the prediction are shown in red.
	The prediction line from both tools are just about identical, though there are some noticeable differences in the confidence lines.
	The confidence lines for the non linear, tool are much better. The Fit Y by X tool transposes that data into linear space before finding the best fit straight line.
	This results in a lower cost...in the lower conference line pulling closer to the prediction line after transposing back into the original space.
	Confidence lines are not that useful when parameters that define the outbreak are constantly changing. Best case, they will help you to see when the parameters have definitely changed.
	In my scripts, I use linear regression calculated in column formulas, because it's easy to adjust with variables. This allows the analysis to be adjusted on the fly without having to pull up the tool in JMP.
	I don't currently use the confidence lines in my analysis. So I'm working on a way to integrate them into the column formulas.
	Linear regression is simpler and produces almost identical results. Once again, keep it simple.
	We have seen how fitting an exponential to the number of people contagious can be used to predict whether people contagious will be in the future, and also to derive transmission.
	Now that we have a prediction line for people contagious, we need to convert that back into infections.
	Remember new infections equals people contagious and multiplied by transmission.
	Transmission is the probability that a contagious person will infected susceptible person per day.
	The predicted graphs that results from this calculation are shown. Note that South Korea and Italy have low infections growth.
	However, they have a high reproduction extrapolated from the last 17 days worth of data. So, South Korea here and Italy here, low growth, but you can see them taking off because of that high reproduction number.
	The infections growth becomes significance between two and eight weeks after the prediction is made.
	For South Korea, this is unlikely to happen because they're moving slowly and have the headroom to get things back under control.
	South Korea has had several of these blips as it opens up and always manages to get things back under control.
	In the predicted growth percent graph on the right, note how the increasing percentage growth in South Korea and this leads will not carry on increasing indefinitely, but they plateau out after a while.
	Percentage growth is still seen to decline exponentially, but it does not grow exponentially.
	It plateaus out.
	So to summarize,
	the number of people contagious is what drives the outbreak.
	This metric is not normally reported, but it's close to the number of new infections over a fixed period of time.
	New infections in the past week is the closest regular reported proxy, the number of people contagious. This is what we should be focusing on, not the number of infections or the number of daily new infections.
	Exponential regression of people contagious will predict where the contagious numbers are likely to be in the future.
	Percentage growth in contagious gives us transmission and reproduction.
	The contagious number and transmission number can be combined to predict the number of new infections in the future.
	That prediction method assumes the transmission and reproduction are constant, which they aren't. They change their behavior.
	But the predictions are still useful to show what will happen if behavior does not change or how much behavior has to change to avoid certain milestones.
	The only way to close this gap is to come up with a way to mathematically model human behavior.
	If any of you know how to do this, please get in touch. We can make a lot of money, though only for short amount of time.
	This is the modeling. Let's check how accurate it is by looking at historical data from the US.
	As mentioned, the prediction works well when reproduction's constant but not when it's changing.
	If we take a prediction based on data from late April to early May, it's accurate as long as the prediction number stays at around the same level of 1.0
	The reproduction number stays around 1.0.
	After the reproduction number starts rising, you can see that the prediction underestimates the number of infections.
	The prediction based on data from late June to mid July when reproduction was at its peak as states were beginning to close down again,
	that prediction overestimates the infections as reproduction comes down.
	The model is good at predicting what will happen if behavior stays the same but not when behavior is changing.
	How can we predict deaths?
	It should be possible to estimate the delay between infection and death.
	And the proportion of infections that result in deaths and then use this to predict deaths.
	However, changes in behavior such as increasing testing and tracking skews the number of infections detected.
	So to avoid this skew also feeding into the predictions for deaths, we can use the exact same mathematics on deaths that we used on infections. As with infections, the deaths graph shows accurate predictions when deaths reproduction is stable.
	Note that contagious and reproduction numbers for deaths don't represent anything real.
	This method works because because deaths follow infections and so follow the same trends and the same mathematics. Once again, keep it simple.
	We have already seen that the model assumes constant reproduction. It also does not take into account herd immunity.
	We are fitting an exponential, but the outbreak really follows the binomial distribution.
	Binomial and a fitted exponential differ by less than 2% with up to 5% of the population infected. Graphs demonstrating this are in the appendix.
	When more than 5% of the population is no longer susceptible due the previous infection or to vaccination, transmission and reproduction naturally decline.
	So predictions based on recent reproduction numbers will still be accurate, however long-term predictions based on an old reproduction number with significantly less herd immunity will overestimate the number of infections.
	On the 21st of August, the US had per capita infections of 1.7%
	If only 34% of infected people have been diagnosed
	as infected, and there is data that indicates that this is likely, we are already at the 5% level where herd immunity begins to have a measurable effect.
	At 5% it reduces reproduction by about 2%.
	What the model can show us, reproduction tells us whether the outbreak is expanding. It's greater than 1, which is the equivalent of accelerating or its contracting, it's less than 1, the equivalent of decelerating.
	Estimated number of people contagious tells us how bad the outbreak is, how fast we're traveling.
	Per capita contagious is the right metric to choose appropriate social restrictions.
	The recommendations for social restrictions though listed on this slide are adapted from those published by the Harvard Global Health Institute. There's a reference in the appendix.
	What they recommend is when there's less than 12 people contagious per million, test and trace is sufficient. When we get up to 125 contagious per millio, rigorous test and trace is required
	At 320 contagious per million, we need rigorous test and trace and some stay at home restrictions.
	Greater than 320 contagious per million, stay at home restrictions are necessary.
	At the time of writing, the US had 1,290 contagious per million, down from 1,860 at the peak in late July.
	It's instructional to look at the per capita contagious in various countries and states when they decided to reopen.
	China and South Korea had just a handful of people contagious per million.
	Europe has in the 10s of people contagious per million except for Italy.
	The US had hundreds of people contagious per million when they decided to reopen.
	We should not really have reopened in May. This was an emotional decision not a data-driven decision.
	Some more specifics about the US reopening.
	As I said, the per capita contagious in the US, at the time of writing was 1,290 per million.
	1,290 per million, with a reproduction of .94.
	With this per capita contagious and reproduction, it will take until the ninth of December to get below 320 contatious per million.
	The lowest reproduction during the April lockdown was .86.

PatrickGiuliano · ‎08-24-2020

Nice work @Roly I look forward to watching the presentation video in October! -@PatrickGiuliano