Empowering Process Control: JMP for Predictive Models for Optimization (2025-US-...

In oil refining, a key optimization technique is to provide continuous online control to quality variables with infrequent sampling. It requires an empirical predictive model of the quality variable based on online process conditions. This model is then used as an inferred variable, which can be utilized in a continuous control algorithm.

There are several pitfalls where models with strong fit statistics fail to accurately predict the quality variable when deployed online. Many of the causes of model failure are outside of the engineer’s control, such as abnormal process conditions and input signal failures. A predictive model can be built that minimizes the impact of process externalities, but often this can hinder model accuracy. It is therefore the engineer’s challenge to construct predictive models that balance accuracy with reliability.

This paper explores several steps in the model development process that use JMP to build predictive models suited for online process control. By using a custom add-in, it's possible to aggregate raw process data into a format that filters process variability. The aggregated data is screened using such tools as Distribution and Screen Outliers to check for outliers and operating ranges. Model inputs are transformed with column formulas to further reduce process variability. Finally, the model is constructed using Generalized Regression in Fit Model to compare regression algorithms and minimize inputs, enhancing reliability. The end result is a model that provides enough accuracy to continuously optimize the process to a quality constraint, while remaining robust against process/instrumentation disturbances.

Hi, my name is Andrew Weaver. I'm a controls application engineer with ExxonMobil. And today we're going to be exploring how to use JMP to build predictive models for online optimization control. In our chemical and refining units, we use advanced process controllers to give real time control of the units, both to keep our process units within quality limits and to push product optimization by reducing product giveaway.

These advanced controllers depend on real time measurements to have instantaneous response on the process, based on the feedback that's coming in from our sensor data. One of the key challenges for when we're building these control models is that a lot of our quality variables are actually taken via a lab sample and don't have continuous measurements for them. And the example shown here, we have a lab value that's taken every two hours, but as shown on the chart at the bottom, a process shift can occur before the sample is taken. And our controller will need information to know that that process shift is happening.

The solution to this is to build a predictive inferential model that's using other sensors in the process to build a predictive model to give the online controller information for when these process shifts are happening to the quality variable. This allows our model to make real time adjustments without waiting for lab results to see the effects of the control moves.

Building this model it can be a difficult strategy at times, and it involves a careful balance between combining your engineering domain knowledge and your statistical modeling techniques. Today we're going to be walking through in JMP how to build these models and some of the unique challenges that come along the way for building models for online process control. For today, I'd like to introduce a simplified example of a process where you might see an inferential model in online process control.

This is a distillation column that is separating a mixed stream of propane and butane into its two components. The overhead will be primarily propane stream that's boiling up the tower, and the bottoms will be a predominantly butane stream that's condensing down the tower. Our quality variable that we want to measure is the propane concentration in the overhead tower that's taken via a lab sample.

Before we begin any modeling, it's important to understand some of the physical characteristics of the process, as this should be influencing the modeling choices that we make. For example, the temperature effects that we have on the tower are very important to the distillation effects that separate the butane and propane mixture. As temperature in the tower increases, we expect more heavier components, the butane, in this case, to boil up the tower, which will dirty our propane stream that we're trying to take measurements of.

This means that the concentration of propane in the overheads will be lowered. This will help us inform the model that any temperature effects we see in our predictive model should have a negative sign with our propane concentration in the overheads. Additionally, we have pressure data available to us, and pressure also affects the distillation by changing the boiling points of both components. This will also be important to consider as we begin modeling.

One final note is on flow measurements on the unit. We have several, and we're going to see later that flow is strongly correlated with the concentration of propane and the overheads. Particularly, the feed flow, the reflux flow and the overhead product flow, flows one, two and three, respectively. Using these flow measurements in the model without any modifications to them, though, will lead to some challenges down the road as the raw flows don't account for any accumulation in the system or the system dynamics. To use them, we're going to need to modify them from being extensive variables which they are currently and to intensive variables. And we'll see later on how we can achieve this.

The key takeaway from exploring the system at right at the start is we understand the physical meaning of each of the inputs before we start the model building process. So we have a clear guide for how we want the model to look. This ensures that our model is reflecting real process behavior, not just a statistical correlation that we see in the data. Finally, these insights will shape how we pull our inputs, which inputs we pull in to JMP, how we screen them, what variables we transform, and ultimately what ends up in the final model.

Let's quickly walk through the full model building workflow from start to finish to get an idea of what it takes to build a model, and then we'll go into JMP and look at some of these steps in some more detail. The first step is to select an inferential candidate. This is typically a quality variable that already has regular lab sampling with a large amount of data with it, and it already has defined quality limits that we want to control against. In our case, this is going to be the propane concentration, the overhead lab that we mentioned in the previous slide.

Next, it's time to begin an engineering analysis of the process. We started this in the last slide, but there are a couple other considerations to take a look at as well when you're looking at your own data. First, evaluate any unmeasured disturbances such as feed quality shifts and what effect those might have on your final model. Also, you need to consider what operating conditions your model is expected to perform in. If you are operating under a wide variety of operating conditions, your model will need data from all those different conditions that the unit is running in to be effective. At this point, you will consider if you have enough data to begin modeling, or if you need to run a dedicated plant test to determine what data you actually need to pull.

Next, you can perform an input variable selection, which is what we did in the previous slide, where we went through each of the potential inputs of the model, and began to walk through the physical effects that should have on the inferred variable. Once the models are selected, it's time to collect the data. For our process engineering, this typically is coming from a process historian that's storing the data for us. That gives us minute by minute data of each of the sensors that's on our process. For models with broad applicability, it's good to collect between 6 to 18 months of this real time data, depending on how frequently the lab is sampled. This ensures that you have enough data to give yourself a model that is applicable for many different operating conditions.

Next, it's time to go into your data cleaning and your pre-processing. The data that we're collecting will be one-minute data and for all sensors. And so you'll have a lot of data for each lab sample that you're running. We're going to need to format this data into clear XY pairs for each lab sample, and we'll do this using an add-in later on. You also need to start looking at outlier screening to flag suspect data that you have pulled.

It's important to note that this should not be done using purely statistical tests, as your outliers should be physical effects in the process that need to be removed. Such as removing data when the units not running, when it's experiencing severe upsets, or if you determine that your sensor data is faulty. Once the data is screened, we can now begin to build the models themselves. A typical trap for someone just starting to build these models is to throw all the inputs that we've collected into a least squares algorithm and generate a model that way. But this doesn't work for a couple of different reasons.

The first being, we want to make sure that the model has a physical justified effect for every input that's in there. When we run a regression algorithm over multiple effects, it's very likely that artifacts will be put into the model that are good or a good fit statistically, but don't exist in reality. The engineer needs to look at the signs and magnitude of each of the inputs that are being put in the model, to make sure it matches their engineering expectations.

The second and more subtle reason is that when we put these models in online control, we need to make sure that these models are robust and reliable. Every input we put in the model is a single point of failure. If the instrument that's attached to the input ends up failing. For that reason, we try to minimize the number of inputs to reduce the vulnerability of instrument failure. Once a model has been built, it's time to begin testing. First offline to compare the predictions to our new lab results, and then online and built in with the online controller. This is a very cyclical process, as you'll find often that a model may predict very well, but will fail when put in closed loop control and the engineer will have to go through several iterations of building new models until they find one that works both as a predictor and within the control system.

Now that we've gone through the steps of building an inferential model, let's open JMP and go into some of the steps in a little bit more detail. This first data table is showing what the raw data pool will look like. You're getting one-minute data for each of your sensors and your lab measurement, and this data is over an eight-month period. We can see by opening a graph view of a time series plot of our lab variable, that our lab is taken an infrequent time intervals, and it looks like the lab has taken every few days, with some periods of time where we're taking multiple lab samples in a short period of time.

As mentioned before, the format that our data is in is not really inductive to building an inferential model, as we need clear XY relationships for each of our lab results and our inputs. We're going to be using a custom-built in add-in to help us build a simplified data table that builds that XY relationship for us. We're going to aggregate the data by selecting our lab sample and our date, and then we're asked to begin to consider what type of aggregation interval we want to use on our data.

This add-in is going to take a look at our data table. Detect when a lab is... When the lab value has changed, and then average each of the inputs around the change of the lab variable. This can be customized either on a global level or for each individual input. Additionally, this averaging can be done on a look back basis where you're starting at when the lab sample was taken, and looking back X number of minutes, or on a centered approach where you take the lab sample, and you're taking a segment of data before and after the sample was taken.

Ultimately, this is going to depend on how much you trust your lab sample timing recording and what your process variability is. For our example, we're going to set a global interval of 60 minutes for all variables. And this will produce us a good aggregate results. This does take a couple minutes to run. So we're going to skip straight to the result, which is this data table where you can see that we have a clear separate row for every single lab result that has clear X data for all of our inputs with it.

Now we can be... Using this data, now we can begin to do our screening, our pre-processing and ultimately our modeling. We'll start by taking a look at the distributions to see if there are any effects that we should be in our data that should be removed. Recall that we want to remove outliers not only from statistical tests, but to make sure that every data point that we remove has a physical meaning of the process behind it. For example, looking at our lab itself on the left, we see a couple tail effects at the far end of the distribution, but it's too early to tell if those are real data points or not, or if they... Or if they're simply low points in our normal processing environment.

However, if we look at flow one closer on the right, we can see a long tail of low feed flows and even some negative flows in the distribution. Since this is feed going into the unit, it's important that we capture the exact feed rates that we expect the unit to be operating at normally. We will need to define some threshold of what a normal feed flow operating environment is going into this unit. And for this example, we're going to be setting that threshold at 10 or anything below that is the unit not operating in normal conditions or the unit being shut down. We'll select all of the data before 10 and go ahead and exclude that from our data set.

Scrolling through the rest of the variables, the only other standout is flow three, which is our reflux flow coming back into the tower. This has a very high flow rate going into it, which is likely indicative of a severe process upset as it does not match at all with the rest of our operating window. For that reason, we'll also hide and exclude this data set.

Now that we've screened our data, we can begin to look at some of the relationships and start to build some... And we'll start to build some relationships between our inputs and our final model. These XY charts are ordered by best of fit, and we can see that our temperature four and temperature two, which are on the upper side of the distillation column, are the strongest fit with our propane lab.

This matches our engineering expectations, as we know that temperature has a key effect on our distillation. Likewise, we see strong relationships with our flow rate and with one of our flow rates, which is our product flow and our overhead pressure. Rather than have JMP, try to infer the relationship with temperature and pressure to the distillation. We can bake in some known vapor liquid equilibrium concepts into the model by combining the pressure and temperature into a calculated variable that's better used for model building.

We can do this by using another custom-built in add-in to build what's called a pressure compensated temperature, which will take the temperature and modify it based on the pressure effect... Based on some reference pressure. This is what we'll be using the ultimate model instead of just the raw temperature. Likewise, as mentioned before, we can't use this flow on its own as it's an extensive variable and must first be transformed into an intensive variable. We'll do that by creating a new column and create a product to feed ratio. Ratio in these two flows together will turn these flows from extensive variables to intensive variables that are easier to model quality shifts that are in the data. So this will be our flow two divided by our flow one.

Last step before we begin modeling is to build a validation column, which we'll do real quick having our lapping the stratification column. Now we can begin modeling in earnest. The first technique I'll show is if we don't have as much of a clear idea of what the inputs to our model should be, we call this a kitchen sink approach, where we put in our lab as our Y, we add in our validation column, and we're going to put in all of our row inputs as X and run this under a generalized regression platform.

We're going to be looking at a best subset estimation method to see what the best subset of variable is as we're going through each of the combination of variables. As you can see, we have a plateau here where we're able to reduce the number of variables without having a strong effect on the model accuracy. As mentioned previously, we want to reduce the number of variables in our final model as much as possible to reduce our reliance on the sensors in case of instrumentation failure.

Also, here we notice that we are selecting temperature two and pressure one as inputs, which we saw earlier were strong correlations. And we have good engineering reasons why these should be included in the model. This is a good start if your process is less understood. In our case though, we know exactly what inputs we want to be put in the model based on our engineering analysis. And so instead we're going to put in our lab again, and our Xs this time are going to be our calculated variables again using our validation column.

This time we'll just run a standard least squares algorithm. This gives us what will be our final model that we'll use in our controller. You'll notice that the r squared leaves a little bit to be desired. But this is normal for running models over large process systems, as there's a lot of variation in dynamics in the instrumentation that we're trying to pull. It's at times very difficult to get very high accurate models. Remember, we want to avoid putting more inputs into the model to get the accuracy further, lest we make the model susceptible to instrumentation failure.

The last thing we want to check for online control is to make sure that the inputs are independent of each other, and have no collinearity between them. We can do this by taking a look at the VIF column and see what the VIF score for these inputs are? If the VIF score is too high, it is a near certainty that the model will fail when put into a control algorithm. But since the VIF for these two inputs is a low enough threshold, we can be confident that this model will both be a reasonable predictor for this quality variable, and we should be able to start modeling it in our control software.

Now we'll switch back to the presentation. Now we have an inferred model that we can begin to use in online control. This model that we just developed will then be used as an input in our advanced controller. And that input will give the controller feedback as the controller is manipulating the unit. And it will see in real time how the controller is influencing the quality variable.

Again, this real time feedback enables us to optimize the unit for product quality giveaway and to make sure that we're always keeping the product quality within our tolerances. As you can probably expect, this is a cyclical process both for getting a model that works well in the first place, and as your process conditions changes or your equipment ages, your model will eventually decay and will require starting from the beginning of generating a new model.

In conclusion, these inferential models build the gap between our continuous control software and our lab samples are taken at infrequent intervals. It highlights the importance of domain knowledge when building statistical models. As you recall, we spent a lot of time going over engineering analysis before we even open JMP to make sure that we knew exactly how the model would be working when put online. Even outside of chemical engineering and process engineering, this principle is the same. Your engineering analysis should guide the model, not the other way around. With these techniques, you can use limited data and transform them into actionable insights regardless of your field. Thank you very much for watching.

Presented At Discovery Summit 2025

Presenter

Andrew Weaver

Skill level

Intermediate

Beginner
Intermediate
Advanced

Files

2025-US-PO-2496.pptx

Empowering Process Control: JMP for Predictive Models for Optimization (2025-US-PO-2496)

Presenter

Skill level

Files

Advanced Statistical Modeling

Predictive Modeling and Machine Learning