In this presentation, we explore the crucial role of predicion, confidence, and tolerance limits. Given that they are easy to calculate and appear in most textbooks, they are commonly used; however, that also means that they are often misused or misinterpreted. One of the current issues lies in accurately estimating these limits for mixed models, which is a serious limitation in most industries.
While JMP uses the best option available to estimate these limits (based on a paper published in 2019 in the journal, Wiley Medicine), we can run into trouble if we're not careful, as the method does have some shortcomings. As a consulting company, the methods we use need to be solid. Therefore, we decided to delve deeper into the issue and evaluate these limits through simulation in JMP.
The presentation focuses on the storytelling of the problem, making people across industries aware of these statistical dilemmas, along with a demonstration of how JMP was used to simulate different scenarios and how models were evaluated. It showcases a graphical interface made for this purpose, as well as model building and producing an interpretable output.

Welcome to my presentation. My name is Thomas. I come from a company called NNE. I work as a statistical consultant in the pharma industry. Today, I'm going to talk to you about a topic that I find very interesting. I also think it's very relevant across industries. It is based on classical statistics. More in particular, it's about confidence, prediction, and tolerance limits. Why don't we just get started? I want to first set the scene. I want you to imagine that you are a quality control engineer at a company that produces medical syringes. When you think about it, for a medical syringe to be safe, it has to keep some quality characteristics under control.
One of them is that it has to be the right size. If they're designed to be a certain size, you should keep it, because if you make it too big, you might be overdosing patients, and if you keep it too small, you might be underdosing patients. Therefore, there are some things called specification limits, where you can… Some expert people will say certain characteristic has to be within these values to be safe. For example, in this case, let us say that the diameter of the syringe has to be between 10 and 10.5 millimeters in length.
Then you are a quality engineer, and you come to your office, you see that there's been a batch delivered to you. Let's say it's 100 syringes, and you need to say something about the quality of the syringes. You start off, and you take a sample, and then you measure the diameter, and then you figure out that it's inside specification. Just in the first case, your inside specification, but then you start thinking, will this be enough to show that the quality has been kept in the whole batch? Making that claim based on just one sample, it's a bit hard to defend.
Then you do the next best thing, which is, you take 10 samples, and then you get such values. They're all within specification. You will probably end up reporting the mean value, and you seem happy with it, but then the health authorities come. Then they tell you it's fine, that your average is fine, and observations, this plot looks fine, but how do you know that the next sample would not look something like this? Or even to your favor, it could also be looking like this.
That is one of the problems, that when you report, for example, just an average or just one quantity, or just want to show one plot like this, you have no real way of quantifying how uncertain you are about your estimations or about your conclusions. That is the motivation to use statistical intervals. That's the topic of today, where we will be able to give an interval about something, and then we will be able to say how sure we are about that.
That was the motivation, but the agenda for the rest of the presentation is I will talk a little bit more about the statistical intervals, a bit more theoretical, to lay down the ground rules for the rest. Then we will go into some simulations I have made in JMP. I made an interface where one can play around with different scenarios, and you'll see why this is important. Finally, we will see how a typical workflow at NNE would look like for using these intervals to work on client data, for example.
As I said, if we start by looking at the theory about statistical intervals, first, I'm going to lay down some ground rules. Before we can go out to play, we need to specify what the rules are. I just want to mention that there are many ways of building prediction intervals, tolerance intervals. The ones I'm going to present today are based on classical statistics. They're also the ones that are implemented in JMP, but there are other ways to build them, just as a heads-up. These are the most common ones, I would say.
The first assumption we're going to make, the first idea is, have we obtained a representative? Because when we build intervals, the first thing we do is we collect some data, data from the past, from which we want to say something. We'll call that a sample. Is it a representative and random sample? What I mean by this is that with an interval, you are going to say something about a population. Maybe it's your process, maybe it's a specific target population that you have in mind, but is your sample representative? Because that's what's your interval we're going to be talking about. What I mean by this is, for example, maybe every time you get delivered a batch, you get it delivered in a box, and then you always sample only from the top because that's the easiest.
In some processes, depending on how it's set up, that may be a problem because you're never taking samples from other parts, or maybe you have an oven where you produce something, a medical device, maybe, and you always sample from a place that is in a certain place in the oven. If there is any variation in temperature, then you might have some issues about it being representative.
In any case, at the start, in the simple case, we will assume the distribution of data, in particular, the normal distribution. To start, we will assume our response, whatever we're measuring, if it's a diameter of a syringe or if it's whatever your quality characteristic is, you will be with the normal distribution. If your data is not normally distributed, you can always try to transform it. Just keep in mind that this is one assumption that we need to fulfill for the intervals to be valid.
If your data is not normally distributed and transformations don't help, it could also be that you need a more complex structure such as a model. I'll give this example because it's also relevant for what we're going to see after. This is some data that comes from our normal distribution that has both within batch variants.
You can see the group spread almost the same as each other, but then there's also between batch variation. That means that, for example, you could have different batches with different properties, maybe it's different raw materials. Then you need to address that this will happen in your data, but it still follows an overall normal distribution. Those are the main rules of the game.
I'm going to start talking about the three main types of intervals that we have. First one is we have confidence, prediction, and tolerance intervals for today. The confidence interval is one of the most common ones. It is used for a parameter or a characteristic from a distribution. For example, since we are assuming that the data follow a certain distribution, in theory, you could say that it has a theoretical value.
For example, if we talk about the mean or a standard deviation, are one of the most common ones, you may want to estimate the mean of the process, but also give some confidence to it. You will report an interval where you think the actual mean is, or a standard deviation, or a quantile, or other characteristics. An example of this would be the sentence, "We are 95% confident that the mean lifetime of a light bulb is between 1,000 and 1,200 hours." Instead of saying, "You know what? The mean lifetime of a light bulb that I produce is 1,100 hours," you now give a range where it can be in, and you associate a certain amount of confidence.
What do we really mean when we talk about confidence? Confidence should be understood as how well the method performs. It's not about the specific interval, but it's about the method. In other words, in the long run, 95% of the intervals that you calculate using this methodology would actually contain the true parameter. That also means that in 5% of the cases, it will not. That's why you can choose the amount of confidence you want, 95 is very typical. You can also use 90, 99, 99.99. It all depends on the purpose for your interval.
Just be aware because there's many people who might think that 95 refers to the idea that if you calculate an interval, then 95% of your future sample means will be within the interval, and that is just not true, so be careful about that.
These are the formulas that I will use. You can see there are two, and there are two. I'll tell you why. This formula, the first one, is assuming a normal distribution and no random effects, and this one is a formula assuming random effects. These are the classical, and these are the ones that have been developed later. I will not dive too much into the formulas right now, but anyone who wishes to dive deeper is welcome to ask about it.
Then we have a prediction interval. Now, while the confidence interval was used for a mean value or a standard deviation or a characteristic, a prediction interval is used for a specific observation. Maybe if you are at the client instead of the producer, going back to the light bulb example, you might want to investigate about a specific light bulb instead of the average behavior. Because when you buy it, you just want to see how much it will last for your specific purpose.
Prediction intervals are used to predict intervals for a single observation from the future. If you alter the formulas, it can be for three observations, for five observations, but the typical is just one observation. Be aware that a prediction interval, when you calculate it and say 95% confident about it, you are talking about one extra observation. You are not talking about if you take all future observations, they should be inside, or it's just one of them.
An example is we are 95% confident that the lifetime of an individual new light bulb is between 700 and 1,500 hours. Again, the confidence is that if you repeat this process, then 95% of the time, you would be right. Here are the formulas, the classic one, and again, the one that takes into account random effects, mixed models.
Then we have the tolerance intervals. Tolerance intervals are more stringent. They're more strict than the prediction intervals. These are typically used in the pharma or medical device companies because it's used to build an interval for a proportion of the future observations from the population. Now, you not only want to contain one extra value, you want to contain, you can say 99% of your future values. Then you can assure, then you can be pretty safe that it's a good enough interval.
As an example, you would say we are 95% confident that at least 99 of the light bulb from the future will be between 650 and 1,550 hours. Now you have two measures of confidence in a way. You have the 95% and the 99%. These are the confidence and the coverage. Of course, you can set it to other values than 95, 99. It's just some examples. The confidence is in the long run, 95% of the calculated intervals in this way would contain at least 99% of the future population.
You can see the first formula is a typical one, and it gets complicated quickly. This is actually a formula that comes from articles, and it's, as far as I know, the best guess, the best formula we have to do it. This is for the case of one random effect, but can be generalized.
Enough talking. Let's actually go to JMP where it's interesting. As I said, I will first show you the script that I have made. Minimize this. This would be the input you see, the main screen you see. There are two tabs, and we'll start in the first one. This is meant to simulate data and the intervals that we just saw. We have two types of models, just the mean or with a random effect. In the mean, you have just the normal distribution and the limits that you calculate. You can change the mean, the standard deviation, and the number of points you want to simulate. On the background, all of these calculations happened.
You can investigate, you can play around with how they relate to each other. You have the confidence limits, the green lines. You have the prediction limits, the red lines, and the tolerance limits. The idea for this one is just to see what happens if I, for example, change my coverage.
Then the limits of the tolerance distribution become more narrow, or maybe you want to keep it at 95% coverage, and you want to change the confidence. The confidence is one minus half. If I want to set a higher confidence, notice, for example, the prediction intervals between minus 1.8 and 2. If I want to be more confident about it, then the limits become wider. They become wider because then you're accepting less failures in your estimations. Because you're 99% confident, the price you have to pay is that intervals become wider, and that is true for all of the intervals.
Of course, you always have to find a balance between what is good enough for you, what failure rate is the highest you can handle, and how precise you want to be the estimation trend.
As you can see, we can just click Simulate here. Here you'll get the actual numbers, and this is just meant to play around. There's no problem with the mean model. I'll jump over to the random effect model, which is more interesting, I believe. Here we have, again, the mean, the standard deviation. Now the standard deviation is, as you can see, it's within—within means, within batch, within group variation.
Then you have between batch variation, but that you get when you specify the variance ratio. The variance ratio is how much between batch variation do you have within batch variation? Then you have the number of batches and the number of items within a batch you want. For example, I can just click Simulate. Here you can see, and let me change this back to the typical value of 5.
Here you can see that the simulated points have between-group variation, so the spread within each group is fairly similar, but there's also some amount of variation between them. That's because we have a variance ratio. If I would increase the variance ratio to 10, 10 times more, you can see now the variation within a group is much less than the variation we see between groups.
Just something to think about. Because this all started when we were working on some client data which had not a lot of batches and not a lot of items. Let me actually go down to two batches and two items and simulate some data. Let me simulate again, and I will tell you why in a second. It's still close. It will make sense in a second because you will typically see this misbehavior. Let me put the variance ratio down to 1.
Here we go. Sorry about that. Because now we have… It's very interesting. We have confidence limits. Confidence limits are for the mean of the process. It should be smoothed down as compared to the individual limits, but you can see that the prediction limits are within the confidence limits. Now think about the prediction limits are supposed to have variation from how well you estimate the mean, plus some extra variation just because you take a random sample. In a way, it should be fairly wider than the confidence limits, yet they are estimated to be less wide than them. This makes no sense intuitively, but this is the price we pay for having few batches and few items.
It's a problem that is not solved because these formulas are based on a paper from Wiley Medicine. They are, as far as I know, one of the best formulas you can write as of now, best methods, but you have to be aware that if you're pushing the boundaries of you're not using many batches and items, or you don't have a lot of levels of random effects, then you will have a problem. These limits will not be very reliable if you have a low number of groups.
That is something to keep in mind. Also look at the tolerance intervals, they seem very wide. These tolerance limits also appear in that article, and they don't suffer from that. In any case, I want to show you as well that you can change the tab because when you simulate something here, you simulate it once. Once in there, and then you click again, and then you simulate it.
But what if you want to do it many times to get an idea of the general behavior? Then you go to this tab where it has the same information. It's just for the random model. Then you can say, I want 250 times clicking the button in the last tab for example. Then we click Simulate, and then we wait a little bit. Then we should get soon the output. Now we have an output. Here, there are some general plots, some histograms, smoother the kernel density estimates, but they give you an idea of the spread and the shape of the distribution of the prediction limits and the tolerance limits.
You can see tolerance limits tend to be wider than the prediction limits, and prediction limits tend to be wider than the confidence limits. This makes sense under this scenario. Here I have also included some information, which I think is very relevant. Confidence limits are what percentage of times do the confidence limits contain the actual mean? In this simulation, it should be 95% of the time, it's 94.4. That's pretty good.
The prediction limits, when I put estimated confidence, how many times do the limits contain one extra point, 94% of the times. I put this line in here with a little star because it is very important that you don't use prediction intervals in this way, but I think it's a common enough occurrence that I found it necessary to mention it.
If you think that you have a prediction interval with 95% confidence, that that would mean that 95% of your future samples will be inside of your limits. Then you will not be very happy to find out that in this case, with 95% confidence, you will only be getting 84% of your future population inside the limits. Just a heads-up, if this is how you would be planning to use prediction limits, be careful because you might be overestimating how protected you are.
Then we have the estimated confidence and the estimated coverage of the tolerance limit. In this case, they should both be 95, and they're fairly good. Be aware of this, I think, especially if we pump up the variance and go down with the number of batches. Let's simulate again and see what happens. Now, we have only three batches with 20 items and very high variance ratio. Let's see what the output says.
`For example, here, now, in the scenario where you are thinking you will cover 95% with a prediction interval, you are covering 57%. The interval is good enough because what it's supposed to do is perform well in this scenario, which it's close to, but please don't use prediction intervals in the other way. That was my interface.
To finish, as I promised, we have an example of how we would analyze some data. Now, I want to go back to the example of the syringes since we've been talking about them. I want you to imagine this scenario where you have two different production lines where you produce the syringes, so one and two. There are 40 batches. You can see it right here. If we scroll all the way down, you can see there are 40 batches, so 20 in each production line, and you have measured the diameter of the syringes.
This is your sample. You have 20 measurements per batch. Where would we start? We would start by looking at the distribution, for example, because to get confidence prediction and tolerance intervals in JMP, it is fairly easy. You would go to the distribution and ask for a confidence interval, 95%. Then you could ask for a prediction interval. You could ask for 95% again. And you could ask for a tolerance interval. There you go. You have your intervals.
Here, I would be okay, but I'd say be a little careful because as we talked about, one of the assumptions that we made is that the distribution is normally distributed. This looks fairly okay. Maybe in practice, you could get away with saying it's good enough. Why don't we ask JMP to fit all the different distributions it wants and tell us what it thinks is best? We'll just give it a second. It should be coming on fairly soon.
One has to try to respect as much as possible the assumptions, but here we see that the normal distribution is not the first pick by JMP. It actually tells you, "Pick one of these instead." Or if you look at the histogram, you can always make a normal quantile plot and see if your points lie within the red lines, then usually you're fine. If it's as straight as possible, it's better. Maybe you could get away with this, but let's see how we can make it better.
The way we usually do it is we go to Fit Model. Let's start with just putting syringe diameter as a Y, nothing else, then if we take a look at the residuals, and there's some funky stuff going on. There seems to be two levels. Let us try again. Let us go to Fit Model. We will make the diameter, and probably add batch as a random effect because we don't care about them in themselves, but we acknowledge that there might be differences between them. Add batch as a random effect and put already the production line in. We click Run.
What do we see here? We can see that the residuals now seem fairly evenly spread out. There's no particular shape. That is another way to say we can fairly assume that the model is correct. We have that the production line seems to be very significant, which means there are differences between the levels in production Line 1 and 2. Why? We don't know. There must be a reason for it, but this just says it's good to keep it in the model as well.
Then down here, we look at the variance components. We see what percent of the variation, or how much is the between batch variation affecting the process, and how much is the within variance affecting? Usually, we like to add the square root just to see the estimate of the standard deviation for each variance component. We can see they're fairly equal, there's the same amount of spread between batch and within batch.
The easiest way once you have a JMP to get confidence intervals is using the Profiler, because now JMP tells you with the blue lines, "My best guess, if you're in Production Line 1, is that the average is 10.29, but with 95% confidence, it is between 10.27 and 10.30. For Production Line 2, you would get an estimate of 10.19, but with 95% certainty between 10.18 and 10.20."
If you want prediction intervals, you can ask for it here. As I said, it's the complex formula. Now for each production line, they will tell you an individual syringe from Production Line 1. If you take a new one, the best guess is that somewhere between 10.21 and 10.36 with 95% confidence, and you have some other values for Production Line 2. You can always save them. If you do save columns, for example, the individual confidence intervals, there you have these values saved into your data table, and they're ready to be used.
For now, I will just delete them since I'm going to show you how we approach this with a script that we have made. It's another script, so we open it up. It's actually a workflow. Once we're happy with a model, that's the first thing we need to do, just be happy with the model, and we just click Play. Then the script or the workflow should take care of giving the right conclusions to you. It's just running. We need to give it a little bit of time. At the end, you will end up with a control tables output.
You can see it copies the information you had in the file, and then it calculates a lot of new columns. The easiest way to just see what you want is to click on the script that says control graph limits, and it gives you such an output. The idea behind this is you now have tolerance intervals, as well as confidence intervals calculated from your model. Based on your model, since you have a fixed effect of production line, you will get two different sets of limits. This is for Production Line 1, and this is for Production Line 2.
If we look at an individual one, the interpretation of this, depending on the confidence… Let me just see what confidence we have, 0.05, but that's one-sided. If I would make it two-sided, I could just do this. In any case, now we can say with 95% confidence, I can tell you that at least 99.73, which was the coverage we put, 99.73% of the future population from Line 1 will be within these two lines, and the same goes for the Production Line 2.
Since we are within specification limits, then we have just proved as best as we can that we are maintaining an adequate quality level. You might have to worry about if you are outside specifications, which might mean that your process is not good enough, but it could also mean that you just need to take more data because the more data you have, the more narrow these limits will become. That was my presentation. I hope you got something out of it. Thank you for listening.
Presenter
Skill level
- Beginner
- Intermediate
- Advanced