Optimizing Multivariate Models: From Data Collection to Optimal Conditions and Scoring Models (2021-US-45MP-927)

12 Kudos

Level: Intermediate

Isaac Himanga, Senior Analytics Engineer, ADM

Manufacturing processes are highly correlated systems that are difficult to optimize. This correlation provides an opportunity to identify outliers. Here, we demonstrate an entire workflow, which involves:

Obtaining data from the OSISoft historian.
Modeling and optimizing with multivariate models that account for process constraints.
Using the MDMCC platform.
Scoring models online and writing abnormal conditions back to the historian.

In practice, many processes are hard to optimize. Sometimes a system cannot be reduced to a set of independent variables that can be tested; in other cases, the process can become unstable under different conditions. To address these issues, we are using new features in JMP 16 to optimize for cost and quality in the latent variable space, while accounting for constraints on observed variables. In order to remain at those optimal conditions, we use the MDMCC platform to quickly identify deviations from normal.

Add-ins are used to minimize the time spent gathering data and scoring models: one to pull process data into JMP from the popular PI historian, one to quickly recreate calculated columns, and a third to score models online and send results to the historian, including contributions from the MDMCC.

Add-ins referenced are available on the JMP Community:

Auto-generated transcript...

Speaker	Transcript
Isaac Himanga	Hello good morning.
Janice LeBeau, JMP	hi Isaac how are you.
Isaac Himanga	doing well, how are you.
Janice LeBeau, JMP	Fine, thank you I'm Dan it's about I'm going to be producing your presentation today for you.
Isaac Himanga	Right fantastic.
Janice LeBeau, JMP	And I can't see you Isaac.
Isaac Himanga	yep I'm sorry about that I'm getting it set up here.
Janice LeBeau, JMP	Okay, and just to let you know we are recording right now so.
	Absolutely cool I feel like we've met before.
Isaac Himanga	We might have either I'm sure, one of the summit's if it was anywhere.
Janice LeBeau, JMP	Okay, great well how you doing today.
Isaac Himanga	doing well, how are you.
Janice LeBeau, JMP	Good good I love your background that's a beautiful color you've gone on for sure.
Isaac Himanga	Thank you.
	Good.
Janice LeBeau, JMP	I just am going to go over a few things and that I just need to reconfirm with you.
	Your presenter name company and abstract title.
	Okay, do you have any co presenters okay all right and you're okay about us recording this and posting it to the discovery discovery summit archives.
Isaac Himanga	Just absolutely and I guess, I was going to post play into record just my end on my site as well, just in case something happens so we'll have a copy of it and that's all right with you.
Janice LeBeau, JMP	um I don't know if that will affect the quality of anything or.
Isaac Himanga	shouldn't I do it pretty often.
Janice LeBeau, JMP	Okay I'll let you take the lead on that I'm your background looks good your microphone sounds good okay just wanted to reconfirm your cell phone and computer notifications are off.
Isaac Himanga	should be but I'm gonna check again just because when.
Janice LeBeau, JMP	I call your artwork to.
	And then.
	Energy of eight and 1920 by 10 day, do you have any pets in the background.
Isaac Himanga	So.
Janice LeBeau, JMP	It could make noise, like my cat probably will, so I referred him.
Isaac Himanga	Well, in theory, I'm the only one home today so.
Janice LeBeau, JMP	Did you close out look Skype or any other applications that make noises.
Isaac Himanga	Accessing my turn this one off right yes.
	Okay.
Isaac Himanga	yeah, so I think the only thing that should be on as you and I'm actually going to turn your volume way down when we get started here, just in case something happens I'll be able to see you but it'll just be quiet that way if anything else does make noise oh.
Janice LeBeau, JMP	Well, what I'm going to do is.
	I'm just going to give you a couple more pointers here I'm going to mute myself and hide my video you can share your screen for your presentation.
	And, before I mute myself I'll do my 54321 and then give yourself a couple of seconds, and then you can go ahead and introduce yourself.
	and start recording you can start your presentation and then, when you just tell me well this concludes today's presentation, I hope you enjoyed it whatever you want to say and then I'll stop recording for you okay.
	Good and if we're into it, at the very beginning and you're not happy, you know we can stop it and start over again so I'm totally cool with all of that.
Isaac Himanga	Okay now hopefully it won't be an issue how close to 35 Minutes do you want it, I when I practiced mean it was torn between like 34 and 36 is that do you want to read it 35 or.
Janice LeBeau, JMP	You know it's fine I mean just do whatever you have to do I know one person we were at 20 minutes one if they told you what 30 to 40 minutes long.
Isaac Himanga	yeah somewhere, I remember 35 they might have given a range there.
Janice LeBeau, JMP	that's cool don't worry about it.
Isaac Himanga	Okay, good deal perfect.
Janice LeBeau, JMP	Okay, so we are recording in the cloud.
	We can see that.
	Oh, that looks very good that looks yummy.
Isaac Himanga	perfect.
Janice LeBeau, JMP	So yeah.
	So, are you ready my video.
Isaac Himanga	Of what about let's see so you don't have a timer cameras going to set a timer on my up there, so I could do 30.
Janice LeBeau, JMP	I mean do whatever you want, because this will start.
	I mean, honestly, if you think it's 35 and you go to 32 that's fine if you go a little bit over that's fine.
	Okay.
Isaac Himanga	So I got everything here.
	All right.
Janice LeBeau, JMP	Okay, good so you feel good because I'm going to do the.
	I'll do the 54321 and mute myself.
	And let's see you froze there you go Okay, so you ready.
Isaac Himanga	All right.
Janice LeBeau, JMP	I think, by 4321.
Isaac Himanga	My name is Isaac Himanga and I'm going to demonstrate a workflow I use at ADM to optimize manufacturing processes using multivariate models.
	I'll start with pulling data and building a model, finding realistic optimal conditions, identify abnormal conditions, and finally score the model using current data.
	There's a lot of other information on the individual platforms here, so instead of discussing the details of each step, I'll only highlight a few commonly used features and instead try to show the whole workflow.
	I will say most analyses with the amount of detail this one has take a little longer than 45 minutes.
	So head over to the article for this talk in the JMP Community for more detail and information, including a journal with screenshots of steps I'll move through pretty quickly here.
	I'll start with a brief overview of ADM and the general workflow.
	Then I'll put this presentation aside to show you the process in JMP. You'll see what it looks like to get and clean data, use the profiler to find out multiple conditions,
	use the model driven multivariate control chart platform, write a script to score new data against that model, and finally, I'll return to this presentation to briefly give one method to continuously score that model using JMP.
	First, a little about ADM.
	ADM's purpose is to unlock the power of nature to enrich the quality of life.
	We transform natural products into a complete portfolio of ingredients and flavors for foods and beverages, supplements, nutrition for pets and livestock,
	and more. And with an array of unparalleled capabilities across every part of the global food chain, we give our customers and edge in solving the global challenges of today and tomorrow.
	One of those capabilities is using data analytics to improve our manufacturing processes, including the method I'm about to talk about.
	I am part of the relatively new focused improvement and analytics center of excellence.
	And our growing team is invested in techniques, like this one, to help our 800 facilities, 300 food and feed processing locations, and the rest of our company around the world make better decisions using our data.
	Now, an overview of the workflow.
	The four steps I'll review today only represent part of the complete analysis. In the interest of time, I'm going to omit some things which I consider critical for every data set,
	like visualizing data, using validation, variable reduction, corroborating findings with other models, and aligning lab and process data.
	getting data, building a model, scoring that model on demand in JMP, and then scoring the model continuously.
	JMP has tools to support each step, including an array of database connections, multivariate modeling modeling tools, like partial least squares, principal components, and the model driven multivariate control chart platform, and of course, the JMP scripting language or JSL.
	Let's start with getting data. Despite the many database connections and scripting options available, we needed a quick way to pull data from our process historian,
	a popular product called PI, without writing queries or navigating table structures. A custom add-in was the answer for most data sets. This add-in was relatively...was recently posted to the JMP Community.
	Two more add-ins assist in this process. One, generically called scripting tools, includes an option to quickly get the script to recreate calculated columns
	when combined with the save script to functionality built into most JMP platforms. Analyses can be recreated quickly and scored on demand by a JMP user.
	The last add-in, called the JMP model engine, is also the newest. It uses a configuration file and information saved to the data table from the PItool's add-in to get data.
	That makes calculations using column formulas or any other JSL script and then writes results back to the historian.
	And the interest of time, I'm going to move very quickly through this last step, but again, I encourage you to look for more details on the JMP Community using the link on the
	agenda slide of this presentation.
	Each of these add-ins were overhauled to remove sensitive information but we're shifting users to the same code that's posted on the Community. So if you have ideas and how to improve them, feel free to collaborate with us on the JMP Community over...or over on GitHub.
	With that, let's open JMP.
	Behind the scenes, the ADM...or the JMP PItool's add-in uses a couple different PIproducts to pull data, including SQL DAS, OLEDB and ODBC. The instructions to set this up can be found in the help menu for this platform.
	Today we're going to pull data for a set of PItags from May 1, starting at noon, and then we're going to pull another value every day at noon until today.
	We're going to do this for a certain set of tags, those are listed here in this...
	in this box. Notice we've actually included what's called a friendly tag or a name that's more descriptive than just the name that's in the PIhistorian that will help us identify those columns later in the data table. There's all these little question marks around the
	...around the add-in giving more information, including a description for this friendly tag format that we're going to use today.
	When I hit run query, it's going to open a data table that looks like this one. It's got a set of columns for all of the numerical values in that data table.
	It's got a set of columns for any string value, so if you had characters stored in the PItag, it will pull those. And we can also see the status for that PItag
	for each row.
	Also in the column properties for that data table, if we open the column info, we're going to see information that was used to pull that data point, including the PItag,
	the call type and the interval. For more complex queries, like an average or a maximum, we're going to see more information here.
	I will note that this is real data that's been rescaled and the column names have been changed, but the analysis that we're going to see today should look and feel very similar to a real analysis using actual data from one of our facilities.
	Finally, I'll point out that there's a script here saved to the data table called PI data source, which is the same script that's actually shown in the add-in
	and it contains the SQL that's also available here. Again behind the scenes, this uses
	SQL DAS or PI DAS in order to get that information from PI, and this is all the...all the scripts that it's going to run to get that data. We're going to come back and use this again near the end of the talk today.
	Okay, now that we've got data, we need to clean that data up. We're going to use multivariate tools today to do that, specifically the principal component analysis. I'll pull all of the numerical values from that data table and put it into the y columns list and then
	right away, we can see some values that have...quite a few values that have particularly low scores for component one. If you look at the loadings for those
	different factors, we can see the component one includes high loadings for all of the different flows in this data set.
	So that tells me that all of these values over here on the left have high flows across the board for the whole system. Using some engineering knowledge, I can say that this represents downtime for that process, so I'm going to go ahead and hide and exclude all of these values.
	Now that we've done that, we'll recalculate the principal component analysis, so we'll use redo and redo the analysis and then close the old one.
	And now we can see the loadings are perhaps a little bit more meaningful. Principal component three, for example, explains most of the variation in flow 2 and there's a little bit of separation here.
	The first three components explain the majority of the variation, so I'm going to use those three components when looking for other outliers in this data set.
	To do that, I'll open the outlier analysis and I'll change the number of components to be three.
	And then we can see both the T squared plot, and I can also open the normalized DModX plot
	to see points that either have extreme values for one or more of the scores or points that have a single
	column or a single cell that has a value that's unexpected or outside of the expected range, based on all the other columns in that data set for that particular role.
	For now, we're just going to quickly clean this data set by excluding all of the T squared and DModX values that are above the upper control limit.
	One more thing that's commonly done when cleaning a data set is transforming columns, and I want to show a feature of the scripting tool add-in that makes
	it a little bit even trying to apply and
	transfer to the new formula column menu. If I select three columns or any number of columns and go to the custom transformation option, which is again loaded as part of that scripting tools add-in,
	I can select a Savitzkey-Golay transformation and hit OK, and it will add three columns to the end of the data table with a formula containing that transformation.
	I will note that the cleaning we did could have been done directly and in PLS. I often use a PCA first, though.
	Okay now we've cleaned our data set, we need to actually build a model
	to try and predict that...our process conditions. Maybe another quick note about this data set, we have a flow target up here.
	Today, our goal is going to be to create a certain amount of this target flow using as little of flow one as possible and also taking into some...
	into account some constraints on these quality parameters. So because flow one is what we're primarily interested in, I'm going to switch over to a partial least squares model and use that flow...target flow in the y and all the other variables as X factors.
	I'll just accept all the defaults for now and I'm going to jump ahead a little bit and right away use four factors.
	When I open those four factors,
	we'll see that the first two
	represent the
	variables that the plant normally changes in orders...in order to adjust how fast they run the process. So if they need to make more or less of this target flow, they often change factor one in order to achieve that that target rate.
	Factor 2, on the other hand, relates primarily to these quality parameters, which are actually input quality parameters that we don't have control over.
	So it's not something that we can change. So even though factors three and four have relatively...explain relatively small amounts of the variation of
	our target flow and they explain relatively small amounts of the variation of our factor one, those are the ones that we actually have control over and so those are the ones that we're going to be able to use in order to optimize our process.
	In order to use this...so we've we've built a model that explains the variation in our data.
	In order to use that information, we need to save some of those columns or save the predictions from this to new columns in our data set
	that we're going to use in the prediction profiler in just a few minutes.
	We'll make use of a few new features that were added in JMP 16, allowing us to save predictions, both for the Y and the X variables, as X score formulas. And when we open the profiler, I think it'll help to illustrate why that becomes important.
	So we've saved all three of the predictions, the X predictions and the T squares, back to our data table. Those should be new columns at the bottom as functions of these X scores.
	We can also take a quick look at the distance and T squared plots within the PLS platform, and we see that while there's a few variables that have pretty high DModXs or T squareds, there's nothing too extreme.
	These scores are often saved or are always saved with variable names that can become confusing as you save more and more or go through multiple iterations of your model. So the scripting tools contains another function, called the rename columns,
	which will allow you to select an example query for PLS. It has a pretty complicated regular expression pattern here,
	but notice it outputs a standard set of names that are going to look the same for PLS, PCA and other platforms within the data table.
	So in this case I'm actually going to copy a specific prefix, we'll put before all of those columns indicating that this is the model we're going to put online for our quality control and it's revision one of that model.
	When I change names, we can see it's it's automatically changed the name for all of these columns in the data set.
	So we've built a model explaining the variation in
	these columns, but what we haven't done is our original goal of figuring out how to produce a certain amount of flow...of our target flow using as little of flow one as possible. To do that we're going to use the profiler. Notice when we open the profiler,
	we can add all of these predicted values, so not the X scores themselves, but the predicted values and the T squared, to this prediction formula section.
	And then, when it opens up, we'll see across the X axis, all of our scores in the model or our latent variables.
	And we can see when we move one of those scores, it's going to automatically move all of these observed variables together
	at the ratio of the loadings in each one of those components. So importantly, take a look at these flows three and four, they always move together. No matter which score we move,
	this model understands that these two scores are related. Perhaps one is a flow meter that they have control over, and perhaps a second one
	is a second flow meter on that same line, but regardless it doesn't look like the plant is able to move one without moving the other one in the same direction.
	So the goal is to find values for each one of these scores that are going to optimize our process.
	Before we can do that, we need to tell JMP what it is we're trying to optimize. We need to say that we have a certain flow rate we're trying to hit and certain quality parameters that we're trying to hit.
	So we're going to start by telling it that we don't care about the values for most of these columns. So we'll select all of the columns that we saved,
	we'll go to standardize attributes, and we're going to add a response limit of none for all of these different columns.
	Then we'll go back and we'll adjust that response limit. It can be more descriptive for the columns that we do care about. For example, the flow target will go back to response limit and we'll indicate that we want to match a value of 140 for that column.
	Similarly for quality one, we want to hit the value of
	20.15.
	For the flow one, we want to minimize that value.
	And finally, we need to make sure that the solution that we come to is going to be close to the normal operating range of the system, so we don't want to extrapolate outside the area where we built that PLS.
	To do that, we'll use this T squared column, and we'll indicate that we want to minimize the T squared
	such that it's below the upper control limit that was shown in that PLS platform. Here we can see the upper control limit was 9.32, so we'll use that as the minimum value here.
	What we should see in the profiler now is every value below 9.32 is equally acceptable and, as you go above 9.32, it's going to be less and less desirable.
	Under the graph menu I'll open the profiler once again, and once again take all of those predictions and T squared formulas and put them into the Y prediction formula.
	And we still see the X scores across the bottom, but now we also see these desirability functions at the right.
	And once again the desirability function for T squared starts to drop off above 9.32. The desirability is highest at low values of flow one and we've got targets for both the the target flow and the quality parameter.
	Because we've defined all of these, we can now go to maximize desirability.
	And that's going to try and find values for each one of those scores, and thus, values for all of the observed variables in our data set that are going to
	achieve this target flow and achieve the targets that are in these that we defined earlier.
	Notice it came close to hitting the full target, but it looks like it's a little bit low. It did achieve our 20.15
	and it was within the T squared bound that we gave it. Most likely JMP thought that this tiny little decrease in desirability was less important than reducing flow, so we can fix that by control clicking in this desirability function and just changing the importance to 10.
	Now, if we optimize this again, it should get a little bit closer to the flow target. Before we do that, though, I'm going to save these factor settings to this table at the bottom, so we can compare before and after. And then we'll go ahead and maximize desirability.
	Looks like it's finished and we're still within bounds on our T squared.
	It has achieved the 20.15 that we asked and it's certainly much closer to 140. So now we could once again save these factor settings to the data table.
	And we should now have factors that we can give back to the manufacturing facility and say, hey, here are the parameters that we recommend.
	The benefit of using a multivariate analysis like this, we talked about those flows three and four being related earlier,
	using this method, we should be able to give the plant reasonable values that they're actually able to run at.
	If you tell them to run a high value for three and a low value for flow four, they might look at you and say well that's just not possible.
	These should be much more reasonable. Note that not all of these variables are necessarily independent variables that the plant can control. Some of those might be
	outcomes, or they might be just related to other variables. In theory, if the plant changes all of the things that they do have control over, the other things should just fall into place.
	So now we've optimized this model, the next step is often to make sure, or to verify, are we running normally. So once the optimal conditions are are put in there, it's good to use the same model to score new rows or new data and understand, is the process acting as we expect it to?
	To do that, we'll use these same prediction scores that we had in...saved from the PLS platform, but this time we're going to use those in the model driven multivariate control chart
	under quality and process.
	I'm going to use the time and put that in time ID, so instead of seeing row numbers, we're going to see an actual date on all of our charts and we'll put the X scores into the process.
	Unfortunately, the range doesn't always work correctly on these charts, so if I just zoom in a little bit, we'll see that here are periods where we had a high T squared and that high T squared was mostly the result of flows two, three, four and one, so all flows in that...
	are the high contributors to this T squared. If we click on that and hover the mouse over one of those bars, if I click on that point again, then we'll see a control chart for that individual value...individual
	variable with the point that we selected highlighted on that chart.
	And I'm going to hide this one, though, and not only look at T squared, we also can do the same thing for DModX or SPE, if you're still looking at SPE. Once again,
	this doesn't always work out correctly.
	So we'll zoom in on the
	DModX values again.
	So DModX is going to indicate points that have a high or low value for an individual column, compared to what was expected based on the other rows, the other data in that row.
	Here we can see that this point is primarily an outlier due to flow five.
	I do find the contribution proportion heat maps to be pretty useful to look and see patterns in old data when, for example, one variable might have been a high contributing factor or acting abnormal for a long period of time or for some some section of time.
	So this is a chart that might...that we might want to look at every morning, for example, or periodically to see, is the process acting normally?
	You come in, you want to open this up and see, is there anything that I should adjust right now in order to bring our process back under control?
	So to do that, we want to recreate this whole analysis from pulling PI data to opening the MDMCC platform and have it all be available at a click of a button.
	To do that, we're going to write a JSL script that has three steps. It's going to get the data from PI, we're going to add the calculated columns, and then open the model we have in the multivariate control chart platform.
	getting data from PI. If we go back to the original data table that we...
	that was opened after the PI tools add-in was run, we can see this PI data source script saved to that table. If we edit that script and just copy it, we can paste that into the new script window.
	I'm just going to make one change. Instead of pulling data from May, we'll start pulling data from August 1 instead.
	Now we need to add those calculated columns. So remember we...in the PLS platform we use the saved score...save as X score formulas option.
	In order to recreate those, we can just select all of the columns in the data table and use the copy column script function that was added again in that scripting tools add-in.
	Once we copy the column script, we go back into this new script that we created and we'll paste what are a bunch of new column formulas to again recreate all of those columns.
	Finally, model driven multivariate control
	chart has an ??? most other platforms, where you can save the script to the clipboard and you can paste that into the same script window.
	Now, if we imagine it's a new day and we have everything closed, and I want to look at how our process is doing, I would just run this same script. Note that I could start it with
	a specific slash slash and then an exclamation point in order to run automatically when the script is opened.
	When I hit run, it's going to open a data table that looks just the same as our original table. It's got all of the same columns.
	It's added those calculated columns, so let's put these scores in and all the predictions, and it also opened the model driven multivariate control chart platform
	where we can see that this recent value for DModX is actually pretty high, so the most recent data has some abnormal values, in this case, for quality three.
	So again, quality three looks like it's not at the expected value, based on the other parameters. In this particular case, that might mean,
	since quality three is an input parameter and quality one, two and three are often related, that might mean that quality three is a bad lab sample or it could mean that this is a new material that we haven't worked with before.
	Okay, finally, let's talk about one method to run this model continuously. So this was recreated on demand,
	where we wrote a JSL script to run this, but sometimes it's beneficial
	to...or we found it beneficial to write these results back to our PI historian so that they can be used by
	the operators at our facilities. So in the last couple of minutes, I want to introduce that add-in, it's called the model engine add-in,
	which will quickly score models online. I should note that this should be used for advice only. It's not an acceptable method to provide critical information or for closed loop control.
	For that you might consider exporting the formulas from the formula depot and using some other software package to score them.
	As mentioned earlier, some of the predictions and information available in this model
	has most has the most value at the moment it's happening. So knowing what caused yesterday's problem is great, but knowing what's happening right now
	means making decisions with current model results and it allows some problems to be fixed before they become a big deal.
	Of course there's many ways to score models, but the power of JMP scripting language, JSL,
	provides a way to get predictions and anomaly information in front of operators at our manufacturing facilities using their existing suite of visualization and trending tools that they're already used to.
	A pair of model engines, or computers running JMP with a specific add-in that started the task scheduler, are set up to periodically run all the models stored in a specific directory.
	All the configuration is done via two types of configuration files, a single engine configuration file and one model config file for each model that's going to be scored.
	Let's start with that model config file. Remember how the PI tools add-in saves source information to the data table?
	Now that same information can be used to populate the model config file, which tells the model engine how to download a table with a single row containing all of the model inputs that it needs to calculate values from.
	Later, the scripting tools add-in quickly save the scripts to recreate columns saved from the PLS and any other platform,
	potentially including T squared and DModX contributions that can be saved from the model version control chart platform. These new column scripts are also saved in the model config file or in a separate file in that same directory.
	Finally, the engine config file defines how the engine communicates with the data source where PI tools
	add-in...where the PI tools add-in uses the OLDB
	and SQL queries to get data, the model engine uses the PI web API to read and write data directly to PI.
	By defining a set of functions in the engine config file, this engine can communicate with many other data sources as well.
	Notice a set of heartbeat tags are defined, which allows the data source and other model engines to know the status of this engine.
	Each model also has its own set of heartbeat tags, so if one machine stops scoring a particular model, the other engine will automatically take over.
	Again this model engine idea is not intended to be used for critical applications, but I found that it allows us to move very quickly from deployment and exploratory analysis to an online prediction or a quality control solution.
	With that, thank you all for attending. Remember that more information on each add-in and the journal I use today are available in the JMP Community.
Janice LeBeau, JMP	awesome job awesome.

Joseph_Reese · ‎01-07-2022

Really impressive work!

Benjamin_Fuerst · ‎03-23-2022

@ih : Thanks for your talk Issac. I really enjoyed it!
I actually have another question to add:
Can you share a pic on what the operator finally sees on his screen?

Are your directly showing T2 or Dmodx with the UCLs? Is the operator then acting on a drill down which shows the biggest contribution proportions?

Thanks!
Ben

Ben_Sagalovsky · ‎03-23-2022

Great talk and great work, thanks for showcasing all that these techniques can provide, and for sharing your creative use of principal components as the 'building blocks' of the variables observed.

Since you have applied this to different plants, have you come up with situations in which the PCA is best done not on simultaneous measurements but rather on some delayed measurements to take into account processing and transit times, etc.? If so, could you share how have you dealt with that? Thanks!