Byron Wingerd's Blog

Byron_JMP · Aug 27, 2025 03:16 PM

Expiration and retest intervals are commonly assigned to pharmaceuticals as well as a wide range of other products. These dates are often extrapolated from a limited set of stability data using statistical models that regress the stability trend to some point in the future. A confidence interval around the regression provides a probability of being above a specified threshold at the extrapolated time.

For some products that we need to rely on, like drugs, or safety equipment, we generally pay close attention to those dates. Taking a medication from the back of the cabinet that expired two years ago, might not be a good idea, likewise, depending on a fire extinguisher that is 5 years past its retest date might be equally imprudent. On the other had I once got a great deal on a giant spool of mil spec nylon thread that was 6 months past its expiry date, and that worked out ok.

Determining the expiration or retest interval for a product or material is based on a relatively simple statistical method called Analysis of Covariance (ANCOVA). For this procedure we look at how a response is affected by a continuous variable and a categorical variable at the same time. For product stability applications like expiry and retest, the continuous variable is time, and the categorical variable is the lot or batch of the material. When we set up the model it has three terms, time, batch, and the interaction between time and batch (time*batch). And of course, since it’s a statistical model there are a handful of assumptions that more or less condense down to two things. One, if the data aren’t consistently along the regression line, you’ll get unreliable results. (if the model doesn’t fit the data, the stats for the model won’t make sense). The second assumption is about the batches. Each batch is independent, for example the data can’t be replicates of the same batch. If your trend over time is curved, either a transformation of the data or a nonlinear model might help, but that’s part of a different discussion.

Setting up the analysis in JMP is pretty easy, but also, it gets a little complicated because there are multiple ways to set up the model and each one returns a little different result. Fairly frequently I get questions, like, “Hey, I did the analysis in the stability platform, and then did it somewhere else and got a different p-value, crossing time, slope or intercept. Which method is wrong?” In almost every case the mismatched results are due to how the model was set up for the comparison. Before we get into how the model can be set up, let’s talk about the “standard method.”

Understanding the "Standard Method"

I’m not a statistics researcher, but I’m pretty sure R. A. Fischer or H. G. Sanders described the ANCOVA method in the 1930’s, and in that case, it was applied to the uniformity of agricultural trials. In the pharmaceutical industry the “standard method” for stability analysis seems to tied to a paper describing the SAS STAB macro developed by the FDA in 1984. This method, while once a benchmark, is based on outdated guidance. The latest guidance from the FDA (from 2004) supersedes the older document, though it's not perfectly aligned.

Despite this, many regulators still expect to see results from the "standard method". It's crucial to know that while this method is a widely accepted practice, it is a statistical method, not a law of nature, and better, more accurate methods may exist due to advancements in computing power and statistical research.

Understanding the "Standard Method"

The standard method for stability analysis is tied to the SAS STAB macro developed by the FDA in 1984. This method has been the benchmark for determining expiry and retest intervals. The standard method is widely accepted, and it seems that many regulators expect to see its use. The latest guidance to industry from the FDA (2004) and ICH Q1E (2003) are closely aligned but allow some flexibility. The STAB macro uses SAS procedures to calculate the time when the lower or upper 95% confidence interval of the regression line crosses the lower or upper specification limit.

There are three models considered.

The whole model includes three terms: time, batch, time*batch. This model has a different slope and intercept for each batch
The common slope model includes only time and batch. This model has a different intercept for each batch, and all the slopes are the same.
The common slope and common intercept model includes only one term, time. This model has one intercept and one slope.

As the models are considered, if the p-value for the model is greater than 0.25, then a decision is made to use the next model. The p-value might seem very large, but the goal is customer protection here, so its hard to reject the model on purpose.

If the first model has a p-value greater than 0.25 then model 2 is used, and if model 2 has a p-value greater than 0.25 then the third model is used. Everyone wants to use the third model because the confidence intervals are the narrowest, which results in a longer expiry, retest interval. In the case of either the first or second model, where the p-value is less than 0.25, the crossing time of the worst-case batch is used. No one wants to use model 1 because of two reasons. First, and most obvious, it’s the worst case. Second and not so obvious, the degrees of freedom used to calculate the confidence interval of the worst-case regression line are based on the small number of points used for that one regression line, so the confidence interval is larger, and the crossing time is often a lot earlier than that of the model with the pooled lots.

Buried deep in the Degradation platform, JMP has a stability analysis report (Analyze>Reliability and Survival>Degradation> Third Tab >Stability) there is a fantastic tool for calculating the expiry and retest intervals, and it follows the SAS STAB macro exactly, but it also allows for that little bit of flexibility in ICHQ1E, and that’s where the comparable results problem starts.

Addressing Common Objections to the JMP Stability Platform

The questions that arise when the p-values or crossing times in JMP don't match those from other methods can be explained by several key statistical decisions made during the analysis.

Pooled vs. Non-Pooled Variance (Mean Square Error)

When the decision to pool batches is not an option, i.e. the p-value from model 1 above is less than 0.25, there is an option to use the pooled mean square error mentioned in ICHQ1E. In the JMP stability platform launch dialog, the option is called “Use pooled MSE for Nonpoolable model”. The decision to pool variance will substantially impact the crossing time. A non-pooled variance model is often more conservative, resulting in an earlier crossing time and a wider confidence interval. This approach is consistent with the old SAS STAB macro. Conversely, a pooled variance model is less conservative and can result in a later crossing time and a smaller confidence interval, which is allowed by ICH Q1E guidance.

For the individual (nonpooled MSE) model in the Fit Model platform, the fixed effect is time, and batch would go into the By role. So, we would fit the model for each level of batch separately. To calculate crossing times, we would use the inverse prediction for each batch separately and choose the worst-case batch.

For the pooled MSE model in the Fit Model platform, the fixed effects are time, batch, time*batch. To calculate crossing times, we would use the inverse prediction for All the levels of batch and then choose the worst-case batch.

Full vs. Reduced ANCOVA Models

The JMP Stability Platform runs all three models and then picks one using the 0.25 p-value decision threshold. Another way to interpret the model selection method is reduce the model when a model term is rejected.

The process for this decision is as follows:

Step 1: If the Batch * Month interaction term's p-value is greater than or equal to 0.25, this term is removed from the model and recalculated using only time and batch as fixed effects.
Step 2: Next, if the Batch is also greater than or equal to 0.25, this term is removed from the model, and the model is recalculated using only time as the fixed effect.

This stepwise model reduction can lead to different p-values in subsequent steps which results in different decisions on pooling of batches, and more importantly on crossing times. In some cases, this method results in longer expiry, retest intervals.

Centered vs. Uncentered Polynomials

The SAS Procs used in the STAB macro as well as the JMP Stability platform use uncentered polynomials. Centering polynomials is a data transformation that shifts the origin of the data to its mean. This is generally a good practice for most applications’ parameter estimates are important. However, when the intercept of the regression model is a critical part of your analysis, it's safer to use uncentered polynomials, although this only tends to be a concern in a small number of specific cases. The JMP Fit Model platform uses centered polynomials by default. In the Fit Model launch dialog centered polynomials can be turned off from the red triangle menu (top left).

Mixed Models and Random Effects

Yes, “batch” is a very good candidate for a random effect. It’s not an option in the Stability platform, and random effects can be added in the Fit Model platform. This topic warrants its own separate discussion, I’ll probably wait until after JMP 19 is released before diving into that topic.

Conclusion

The JMP Stability Platform is a highly scrutinized tool that produces accurate results. It can reproduce the results from the "standard method" (SAS STAB macro) and other analysis of covariance models. Differences in results stem from different methods of setting up the model. The key is to correctly set up the model and understand the implications of each decision. Is there a wrong or right way to set up the model? Maybe, all of the methods could be right in the correct context, but reproducible results across platforms in JMP, or other software packages, depend on knowing the differences in the model.

References

ICH Q1E: Evaluation of Stability Data: This guideline provides recommendations on how to use stability data and describes when and how extrapolation can be considered when proposing a retest period or shelf life. It also specifies the statistical approach for pooling data, including the use of a 0.25 significance level for F-tests. You can find the document at the ICH website
JMP Documentation: For specific implementation details and examples of how to perform stability analysis using the JMP platform, including the Degradation platform, you can refer to the official JMP documentation. The Stability Analysis in the Degradation Platform help page provides a detailed overview and a practical example.
American Statistical Association: The paper, "Perspectives on Pooling as Described in the ICH Q1E Guidance," discusses the statistical pooling strategy from ICH Q1E in detail, offering a critique from an empirical and scientific perspective. This is a valuable resource for understanding the technical nuances of the method.