Understanding Variable Clustering is Crucial When Analyzing Stored Manufacturing Data (2020-US-30MP-576)

Tony Cooper, Principal Analytical Consultant, SAS
Sam Edgemon, Principal Analytical Consultant, SAS Institute

 

In process product development, Design of Experiments (DOE), helps to answer questions like: Which X’s cause Y’s to change, in which direction, and by how much per unit change? How do we achieve a desired effect? And, which X’s might allow loser control, or require tighter control, or a different optimal setting? Information in historical data can help partially answer these questions and help run the right experiment. One of the issues with such non-DoE data is the challenge of multicollinearity. Detecting multicollinearity and understanding it allows the analyst to react appropriately. Variance Inflation Factors from the analysis of a model and Variable Clustering without respect to a model are two common approaches. The information from each is similar but not identical. Simple plots can add to the understanding but may not reveal enough information. Examples will be used to discuss the issues.

 

 

Auto-generated transcript...

 


Speaker

Transcript

Tony Hi, my name is Tony Cooper and I'll be presenting some work that Sam Edgemon and I did and I'll be representing both of us.
  Sam and I are research statisticians in support of manufacturing and engineering for much of our careers. And today's topic is Understanding Variable Clustering is Crucial When Analyzing Stored Manufacturing Data.
  I'll be using a single data set to present from. The focus of this presentation will be reading output. And I won't really have time to fill out all the dialog boxes to generate the output.
  But I've saved all the scripts in the data table, which of course will avail be available in the JMP community.
  The data...once the script is run, you can always use that red arrow to do the redo option and relaunch analysis and you'll see the dialog box that would have been used to present that piece of output.
  I'll be using this single data set that is on manufacturing data.
  And let's have a quick look at how this data looks
  And
  Sorry for having a
  Quick. Look how this dataset looks. First of all, I have a timestamp. So this is a continuous process and some interval of time I come back and I get a bunch of information.
  On some Y variables, some major output, some KPIs that, you know, the customer really cares about, these have to be in spec and so forth in order for us to ship it effectively. And then...so these are... would absolutely definitively be outputs. Right.
  line speed, how at the set point for the vibration,
  And and a bunch of other things. So you can see all the way across, I've got 60 odd variables that we're measuring at that moment in time. Some of them are sensors and some of them are a set points.
  And in fact, let's look and see that some of them are indeed set points like the manufacturing heat set point. Some of them are commands which means like, here's what the PLC told the process to do. Some of them are measures which is the actual value that it's at right now.
  Some of them are
  ambient conditions, maybe I think that's an external temperature.
  Some of them
  are in a row material quality characteristics, and there's certainly some vision system output. So there's a bunch of things in the inputs right now.
  And you can imagine, for instance, if I'm, if I've got the command and the measure the, the, what I told the the the zone want to be at and what it actually measure, that we hope that are correlated.
  And we need to investigate some of that in order to think about what's going on. And that's multicolinnearity, understanding the interrelationships amongst
  the inputs or separately, the understanding the multicollinearity and the outputs. But by and large, not doing Y cause X effect right now. You're not doing a supervised technique. This is an unsupervised technique, but I'm just trying to understand what's going on in the data in that fashion.
  And all my scripts are over here. So yeah, here's the abstract we talked about; here's a review of the data. And as we just said it may explain why there is so much, multicollinearity, because I do have a set points and an actuals in there. So, but we'll, we'll learn about those things.
  What we're going to do first is we're going to fit Y equals a function of X and we're going to set it and I'm only going to look at these two variables right now.
  The zone one command, so what I told zone one to to to be, if you will. And then I'm I've got a little therma couple that are, a little therma couple in the zone one area and it's measuring.
  And you can see that clearly as zone one temperature gets higher, this Y4 gets lower, this response variable gets lower.
  And that's true also for the measurement and you can see that in the in the in the estimates both negative and, by and large, and by the way, these are fairly
  predictive variables in the sense that just this one variable is explaining 50% of what's going on in in the Y4. Well, let's
  let's do a multivariate. So you imagine I'm now gonna fit model and have moved both of those variables into
  into into my model and I'm still analyzing Y4. My response is still Y4. Oh, look at this. Now it's suggesting that as Y, as the command goes up. Yeah, that that does
  that does the opposite of what I expect. This is still negative in the right direction, but look at
  look at some of these numbers. These aren't even in the same the same ballpark as what I had a moment ago, which was .04 and .07 negative. Now I have positive .4 and negative .87.
  I'm not sure I can trust these this model from an engineering standpoint, and I really wonder how it's characterizing and helping me understand this process.
  And there's a clue. And this may not be on by default, but you can right click on this parameter estimates table and you can ask for the VIF column.
  That stands for variation inflation factor. And that is an estimate of the...of the the instability of the model due to
  this multicollinearity. So you can...so we need to get little intuition on what that...how to think about that variable. But just to be a little clearer what's going on, I'm going to plot...here's
  temperature zone one command and here's measure, and as you would expect,
  as you tell...you tell the zone to increase in temperature. Yes, the zone does increase in temperature and by and larges, it's, it's going to maybe even the right values. I've got this color coded by Y4 so it is suggesting at lower temperatures, I get
  the high values of Y4 and that the...
  sorry...yeah, at low valleys of temperature, I got high values of Y4 and just the way I saw on the individual plots.
  But you can see maybe the problem gets some intuition as to why the multivariate model didn't work. You're trying to fit a three dom... a surface.
  over this knife edge and it can obviously...there's some instability and if...you can easily imagine it's it's not well pinneed on the side so I can rock back and forward here and that's what you're getting.
  It is in terms of how that is that the in in terms of that the variation inflation factor. The OLS software...the OLS ordinary least squares analysis typical regression analysis
  just can't can't handle it in the, in some sense, maybe we could also talk about the X prime X matrix being, you know, being almost singlular in places.
  So we've got some heurustic of why it's happening. Let's go back and think about more
  About
  About
  The about about the values and
  We know that
  You know we are we now know that variation inflation factor actually does think about more than just pairwise. But if I want to source...so intuition helps me think about the relationship between
  VIF and pairwise comparison. Like if I have two variables that are 60% correlated
  then it's you know if it was all it was all pairwise then the VIF would be about 2.5.
  And if they were 90% correlated two variables, then I would get a VIF of 10. And the literature says
  That when you have VIFs of 10 and more you have enough instability which way you should worry about your model. So, you know, because in some sense, you've got 90% correlation between things in your data.
  Now whether 10 is a good cut off really depends,I guess, on your application. If if you're making real design decisions based on it, I think a cut off of 10 is way too high.
  And maybe even near two is better. But if I was thinking more. Well, let me think about what factors to put in a run in an experiment like
  I want to learn how to run the right experiment. And I knew I've got to pick factors to go in it. And I know we too often go after the usual suspects. And is there a way I can push myself to think of some better factors. Well, that then maybe a
  10 might be useful to help you narrow down. So it really depends on where you want to go in terms of doing thinking about
  thinking about what what the purpose is. So more on this idea of purpose.
  You know, there's two main purposes, if you will, to me of modeling. One is what will happen, you know, that's prediction.
  And but and that's different, sometimes from why will it happen and that's more like explanation.
  As we just saw with a very simple
  command and measure on zone one, you can, you cannot do very good explanation. I would not trust that model to think about why something's happening when I have when I when the, when the responses when the, when the estimate seem to go in the wrong direction like that.
  So it's very so I wouldn't use it for explanation. I'm sort of suggesting that I wouldn't even use it for prediction. Because if I can't understand the model, it's doing, it's not intuitively doing what I expect
  that seems to make extrapolation dangerous, of course, by definition, you know, I think prediction to a future event is extrapolation in time. Right, so I would always think about this. And, you know, we're talking about ordinary least squares so far.
  All my modeling techniques I see, like decision trees, petition analysis,
  are all in some way affected by this by this issue, in different unexpected ways. So it seems a good idea to take care of it if you can. And it isn't unique to manufacturing data.
  But it's probably worse exaggerated in manufacturing data because often the variables can are not controlled. If we
  if we have, you know, zone one temperature and we've learned that it needs to be this value to make good product, then we will control it as tightly as possible to that desired value.
  And so it's it's so the the extrapolation really come gets harder and harder. So this is exaggerated manufacturing because we do can control them.
  And there's some other things about manufacturing data you can read here that make it maybe
  make it better, because this is opportunities and challenges. Better is you can understand manufacturing processes. There's a lot of science and engineering around how those things run.
  And you can understand that stoichiometry for instance, requires that the amount of A you add has, chemical A has to be in relationship to chemical B.
  Or you know you don't want to go from zone one temperature here to something vastly different because you need to ramp it up slowly, otherwise you'll just create stress.
  So you can and should understand your process and that may be one way even without any of this empirical evidence on on multicollinearity, you can get rid of some of it.
  There's also
  an advantage to manufacturing data is in that it's almost always time based, so do plot the variables over time. And it's always interesting
  or not always, but often interesting, the manufacturing data does seem to move in blocks of times like here...we think it should be 250 and we run it for months, maybe even years,
  and then suddenly, someone says, you know, we've done something different or we've got a new idea. Let's move the temperature. And so very different.
  And of course, if you're thinking about why is there multicollinearity,
  we've talked about it could be due to physics or chemistry, but it could be due to some latent variable, of course, especially when there's a concern with variable shifting in time, like we just saw.
  Anything that also changes at that rate could be the actual thing that's affecting the, the, the values, the, the Y4. Could be due to a control plan.
  It could be correlated by design. And each of these things, you know, each of these reasons for multicollinearity and the question I always ask is, you know,
  is this is it physically possible to try other combinations or not just physically possible, does it make sense to try other combinations?
  In which case you're leaning towards doing experimentation and this is very helpful. This modeling and looking retrospectively at your data to is very helpful at helping you design better experiments.
  Sometimes though the expectations are a little too high, in my mind, we seem to expect a definitive answer from this retrospective data.
  So we've talked about two methods already two and a half a few to to address multi variable clustering and understand it. One is the is VIF.
  And here's the VIF on a bigger model with all the variables in.
  How would I think about which are correlated with which? This is tells me I have a lot of problems.
  But it does not tell me how to deal with these problems. So this is good at helping you say, oh, this is not going to be a good model.
  But it's not maybe helpful getting you out of it. And if I was to put interactions in here, it'd be even worse because they are almost always more correlated. So we need another technique.
  And that is variable clustering. And this, this is available in JMP and there's two ways to get to it. You can go through the Analyze multivariate principal components.
  Or you can go straight to clustering cluster variables. If you go through the principal components, you got to use the red option...red triangle option to get the cluster variables. I still prefer this method, the PCA because I like the extra output.
  But it is based on PCA. So what we're going to do is we're going to talk about PCA first and then we will talk about the output from variables clustering.
  And there's and there's the JMP page. In order to talk about principal components, I'm actually going to work with standardized versions of the variables first. And let's think remind ourselves what is standardized.
  the mean is now zero, standard deviation is now 1. So I've standardized what that means that they're all on the same scale now and implicitly when you do
  when you do
  principal components on correlations in JMP,
  implicitly you are doing on standardized variables.
  JMP is, of course, more than capable, a more than smart enough for you to put in the original values
  and for it to work in the correct way and then when it outputs, it, it will outputs, you know, formula, it will figure out what the right
  formula should have been, given, you know, you've gone back unstandardized. But just as a first look, just so we can see where the formula are and some things, I'm going to work with standardized variables for a while, but we will quickly go back, but I just want to see one formula.
  And that's this formula right here. And what I'm going to do is think about the output. So what is the purpose of
  of
  of PCA? Well it's it's called a variation reduction technique, but what it does, it looks for linear combinations of your variables.
  And if it finds a linear combination that it likes, it...that's called a principal component.
  And it uses Eigen analysis to do this.
  So another way to think about it is, you know I want it, I put in 60 odd variables into the...into the inputs.
  There are not 60 independent dimensions that I can then manipulate in this process, they're correlated. What I do to the command for time, for temperatures on one
  dictates almost hugely what happens with the actual in temperature one. So those aren't two independent variables. Those don't keep you say you don't you don't have two dimensions there. So how many dimensions do we have?
  That's what this thing, the eigenvalues, tell you. These are like variance. These are estimates of variance and the cutoff is
  one. And if I had the whole table here, there'll be 60 eigenvalues. Go all the way down, you know, jumpstart reporting at .97 but the next one down is, you know, probably .965 and it will keep on going down.
  The cutoff says that...or some guideline is that if it's greater than one, then this linear combination is explaining a signal. And if it's less than one, then it's just explaining noise.
  And so what what JMP does is it...when I go to the variable clustering, it says, you know what
  you have a second dimension here. That means a second group of variables, right, that's explaining something a little bit differently. Now I'm going to separate them into two groups. And then I'm going to do PCA on both,
  and if and the eigenvalues for both...the first one will be big, but what's the second one look like
  after I split in two groups? Are they now less than one? Well, if they're less than one, you're done. But if they're greater than one, it's like, oh, that group can be split even further.
  And it was split that even further interactively and directly and divisively until there's no second components greater than one anymore.
  So it's creating these linear combinations and you can see the...you know when I save principal component one, it's exactly these formula. Oops.
  It's exactly this formula, .0612 and that's the formula for Prin1 and this will be exactly right as long as you standardize. If you don't standardize then JMP is just going to figure out how to do it on the unstandardized which is to say it's more than capable of doing.
  So let's start working with the, the initial data.
  And we'll do our example. And you'll see this is very similar output. It's in fact it's the same output except one thing I've turned on.
  is this option right here, cluster variables. And what I get down here is that cluster variables output. And you can see these numbers are the same. And you can already start to see some of the thinking it's going to have to do, like, let's look at these two right here.
  It's the amount of A I put in seems to be highly related to the amount of B I put in, so...and that would make sense for most chemical processes, if that's part of the chemical reaction that you're trying to accomplish.
  You know, if you want to put 10% more in, probably going to put 10% more B in. So even in this ray plot you start to see some things that the suggest the multicollinearity. And so to get somewhere.
  But I want to put them in distinct groups and this is a little hard because
  watch this guy right here, temperature zone 4.
  He's actually the opposite. They're sort of the same angle, but in the opposite direction right, so he's 100 or 180 degrees almost from A and B.
  So maybe ...negatively correlated to zone 4 temperature and A B and not...but I also want to put them in exclusive groups and that's what that's what we get
  when we when we asked for the variable clustering. So it took those very same things and put them in distinct groups and it put them in eight distinct groups.
  And here are the standardized coefficients. So these are the formula
  that the for the, you know, for the individual clusters. And so when I save
  the cluster components
  I get a very similar to what we did with Prin1 except this just for cluster 1, because you notice that in this row that has zone one command with a ... with a .44, everywhere else is zero. Every variable is exclusively in one cluster or another.
  So let me...let's talk about some of the output.
  And so we're doing variable clustering and
  Oops.
  Sorry.
Tony And then we got some tables in our output. So I'm going to close...minimize this window and we're talking about what's in here in terms of output.
  And the first thing you guys that you know I want to point us to is the standardized estimates and we were doing that. And if you want to do it, quote, you know,
  by hand, if you will, repeat and how do I get a .35238 here, I could run PCA on just cluster one variables. These are the variables staying in there. And then you could look at the eigenvalues, eigenvectors, and these are the exact exact numbers.
  So,
  So the .352 is so it's just what I said. It keeps devisively doing multiple PCAs and you can see that the second principle component here is, oh sorry, is less than one. So that's why it stopped at that component
  Who's in there
  cluster one, well there's temperature is ...the two zone one measures, a zone three A measure, the amount of water seems to matter compared to that. With some of the other temperature over here and in what...in cluster six.
  This is a very helpful color coded table. This is organized by cluster. So this is cluster one; I can hover over it and I can read all of the things
  added water (while I said I should, yeah, added water. Sorry, let me get one that's my temperature three.)
  And what that's a positive correlation, you know, the is interesting zone 4 has a negative correlation there. So you will definitely see blocks of of color to...because these are...so this is cluster one obviously, this is maybe cluster three.
  This, I know it's cluster six.
  Butlook over here... that this...as you can imagine, cluster one and three are somewhat correlated. We start to see some ideas about what might what we might do. So we're starting to get some intuition as to what's going on in the data.
  Let's explore this table right here. This is the with own cluster, with next cluster and one minus r squared ratio. And what...I'm going to save that to my own data set.
  I'm going to run do a multivariate on it. So I've got cluster one through eight components and all the factors. And what I'm really interested in is this part of a table, like for all of the original variables and how they cluster with the component. So let me save that table and
  then what I'm gonna do is, I'm gonna delete, delete some extra rows out of this table, but it's the same table. Let's just delete some rows.
  And I'll turn and...so we can focus on certain ones. So what I've got left is the columns that are the cluster components and the rows...
  row column...the role of the different...and is 27 now variables that we were thinking about, not 60 (sorry) and it's put them in 8 different groups and I've added that number. They can come automatically.
  I'm going to start and in order to continue to work, what I'm gonna do is, I don't want these as correlations. I want these as R squares. So I'm going to square all those numbers.
  So I just squared them and and here we go. Now we can look at, now we can start thinking about it.
  And I've sort...so let's look at
  row one.
  Sorry, this one that the temperature one measure we've talked about, is 96% correlated with its...with cluster one.
  It's 1% correlated with cluster two so it looks to me like it really, really wants to be in cluster one and it doesn't really have much to say I want to go into cluster two. What I want to do is let's find all of those...
  lets color code some things here so we can find them faster. So
  we're talking zone one meas and the one that would like to be in, if anything, is cluster five.
  You know it's 96% correlated, but you know, it's, it wouldn't be, if it had to leave this cluster, where would it go next? It would think about cluster five.
  And what's in cluster five? Well, you could certainly start looking at that. So there's more temperatures there. The moisture heat set point and so forth. So if it hadn't...so this number,
  The with the cl...with its own cluster talks about how much it likes being in its cluster and the what...and the R squared to the next cluster talks to you about how much it would like to be in a different one. And those are the two things reported in the table.
  You know, JMP doesn't show all the correlation, but it may be worth plotting and as we just demonstrated, it's not hard to do.
  So this tells you...I really like being in my own cluster. This says if I had to leave, then I don't mind going over here. Let's compare those two. And let's take one minus this number,
  divided by one minus this number. That's the one minus r squared ratio. So it's a ratio of how much you like your own cluster divided by how much you attempted to go to another cluster.
  And let's plot some of those.
  And
  Let me look for the local data filter on there.
  The cluster.
  And and here's the thing. So
  Values, in some sense, lower values of this ratio are better. These are the ones over here (let's highlight him)...
  Well, let's highlight the very...this one of the top here.
  I like the one down here. Sorry.
  This one, vacuum set point, you can see really really liked its own cluster over the next nearest, .97 vs .0005, so you know that the ratio is near zero. So those wouldn't...
  with it...you wouldn't want to move that over. And you could start to do things like let's just think about the cluster one variables. If anyone wanted to leave it, maybe it's the Pct_reclaim. Maybe he got in there,
  like, you know by, you know, by fortuitous chance and I've got, you know, and if I was in cluster two, I can start thinking about the line speed.
  The last table I'm going to talk about is the cluster summary table. That's
  this table here.
  And it's saying...it's gone through this list and said that if I look down r squared own cluster, the highest number is .967 for cluster one.
  So maybe that's the most representative.
  To me, it may not be wise to let software decide I'm going to keep this late...this variable and only this variable, although certainly every software
  has a button that says launch it with just the main ones that you think, and that's that may give you a stable model, in some sense, but I think it's short changing
  the kinds of things you can do. And you and, hopefully with the techniques provided, we can...you have now the ability to explore those other things.
  This is...you can calculate this number by again doing the just the PCA on its own cluster and it and it it's suggesting it cluster...eigenvalue one one complaint with explained .69% so that's another...just thinking about it, it's own cluster.
  Close these and let's summarize.
  So we've given you several techniques. Failing to understand what's clear, you can make your data hard to interpret, even misleading the models.
  Understanding the data as it relates to the process can produce multicollinearity. So just an understanding from an SME standpoint, subject matter expertise standpoint.
  Pairwise and scatter plot are easier to interpret but miss multicollinearity and so you need more. VIF is great at telling you you've got a problem. It's only really available for ordinary least squares.
  There's no, there's no comparative thing for prediction.
  And there's some guidelines and variable clustering is based on principal components and shows better membership and strengthens relationship in each group.
  And I hope that that review or the introduction would encourage you, maybe, to grab, as I say, this data set from JMP Discovery and you can run the script yourself and you can go obviously more slowly and make sure that you feel good and practice on maybe your own data or something.
  One last plug is, you know, there was there's other material and other times that Sam and I have spoken at JMP Discovery conferences and or at
  ...or written white papers for JMP and you know maybe you might want to think about all we thought about with manufacturing it because we have found that
  modeling manufacturing data takes...is a little unique compared to understanding like customer data in telecom where a lot of us learn when we went through school. And I thank you for your attention and good luck
  with further analysis.