Level: Intermediate
Bill Worley, JMP Senior Global Enablement Engineer, SAS
In the recent past partial least squares has been used to build predictive models for spectral data. A newer approach using Functional Data Explorer and covariate design of experiments will be shown that will allow for fewer spectra to be used in the development of a good predictive model. This method uses one-fourth to one-third of the data that would otherwise be used to build a predictive model based on spectral data. Newer multivariate platforms like Model Driven Multivariate Control Charts (MDMCC) will also be shown as ways to enhance spectral data analysis.
(view in My Videos)
Auto-generated tr Auto-generated transcript... Speaker Transcript Bill Worley Hello everyone, my name is Bill Worley and today we're going to be talking about analyzing spectral data. I'm going to talk about a few different ways to do it. One is using functional data explorer and design of experiments to help build better predictive models for your spectral data. The data set I'm going to be using is actually out of a JMP book discovering partial least squares. I will post this on the our discovery website, though, or community page so everything will be out there for you to use. First and foremost, I'll talk about these different things we're going to look at. So traditionally when you're looking at spectral data you're going to use partial least squares to analyze it and that's fine. And it really works very well. But there are some newer approaches that try it out and One is using principal components and then using a covariant design of experiments and partial least squares, then to analyze the data. And then even newer approach and more novel approaches using functional data explorer, then the covariance design of experiments partial least squares and an opportunity to use something like generalized regression or neural networks. Okay, so I'm going to go through a PowerPoint first give you a little bit of background. Okay. And again, we're going to be talking about using functional data explorer and design of experiments to build better predictive models for your spectral data. A little bit of history. The spectral data approach is based on a QSAR-like material selection approach that was developed previously by gentleman named Silvio Michio and Cy Wegman. So I took it and looked for opportunities to help build this approach with other highly correlated data. The first thing that really came out was the spectral data which is truly highly correlated almost auto correlated data that we can look at and use this approach. The data that I've got is, again, this continuous response data for octane rating. But I've since added, looking at mass spectral data in near IR data for categorical responses as well. This is where we're going to go We're going to build these models, we're going to compare them. And we're going to look at This is the traditional PLS approach. This is the newer approach using principal components. And then the final approach here is using functional data explorer and you can see that for the most part, we really don't lose anything with these models as we build them. As matter of fact, the slides a little bit older, the models that I've built more recently are actually a little bit better. So we'll get there. We'll show you that when we get there. So again, a twist on analyzing your spectral data partial least squares has been used in the past. We're going to be applying several multivariate techniques to show how to build good predictive models with far fewer conditions. So when I say far fewer conditions, in this case, I mean less spectra. So you'll see where that comes from. And why would you want to do this analysis differently? Well, first and foremost, there's a huge time savings. You get as good or better predictive model with 25% of the data or less. It's your choice, and then you can use penalize regression to help determine the important factors, making for simpler models. And when I say important factors, I mean important way wavelengths and again, you'll see that when we get there. This is looking at 60 gasoline near IR spectra overlay and might, as we all know, that would be pretty hard to build a predictive model to determine the difference between the different spectra for their octane rating. So what we're going to do is use JMP to help get us there. And most of this kit that I'm going to be showing can be done in regular JMP but what I'm going to be showing today is almost all JMP Pro. Okay, just so you know. So how it's done. I'm not going to read you the different steps. I'll let you do that if you so choose. But there are two important ones. First is number two, when you want to identify these prominent dimensions in the spectra. And that's where we're going to use functional principal components from the functional data explorer. It's not used in the traditional sense, because we're not going to build models using these things. We're going to use these functional principal components to help us pick which spectra we're going to analyze and then we're going to use those in a custom design to help us select those different spectra. And last but not least, this number seven here is use this sustainable model to determine the outcome for all future samples. This that's a little bit of it. I'm a chemist by training by education and an analytical chemist at that and overall I don't know how well a calibrated or instruments hold their calibration anymore. So this holds true, the model that you build will hold true as long as the instrument is calibrated and good to go from that respect. Okay. Bill Worley So some important concepts. And again, I'm not going to read these. I just want you to know that will be looking at partial least squares, principal components, and functional data analysis. Functional data analysis, really this is something newer in JMP that a lot of our, you won't see in other places. It helps you analyze data, providing information about curved surfaces or anything over a continuum. And that's taken from Wikipedia. Okay. A newer platform that I'm going to show that is in regular JMP is multivariate control, model driven multivariate control charts. And this allows you to see where the differences in some of the spectra and how you can pull those apart and maybe dive a little deeper into where you're seeing these differences in the spectra of what they really are. So with that, let's go to the demo. And you go to my home window. So, The data set. This is the again the gasoline data set where we're looking at octane rating as, you know, how do we determine octane rating for different gasolines. Right. So where do they come from, how do we determine that. You really don't need this value until you do this preprocessing or setup that I'm going to be showing you. So we'll get there, we know those numbers are there. We'll get there as we need to. But let's look at the data first. And that's always something you want to do as you want to do anyway. Whenever you get a new data set, you want to look at the data and see where things fall out. So let's go to Graph Builder. We're going to pull in octane and the wavelengths. Alright, so we're going to drop those down on our x axis. And before I do that, let me close that out. I want to color and mark by the octane reading and I'm going to change the scale. The green to black to red Say, Okay, let's see that colors in the data set. Let's go back to Graph Builder. And we'll pull this back in. Drop those. Little more colorful there. Now we've got these there, it's really hard to tell anything at all. I mean, we wouldn't know anything, you know. What we saw before the overlay was bad enough, but we're looking at, you know, more jumbled grouping of points, but let's turn on the parallel plots. Alright, so again, that kind of pulls things in and we can see again a jumbled mess, but we've got another tool that will help us investigate the data a little further. And that's the local data filter. So we're going to go there and we're going to pull in sample number and octane rating. We'll add those you transfer this out a little bit so it's not so See that. So now we could actually go into the single spectra. See over here in the green, so we can dive into those separately. I'm going to take that back off. Alright, so that's grouping and that we could actually pull this in and start looking at the different octane ratings right and see which spectra associated with the higher octane ratings are the lower. It's your choice. It's just gives you a tool to investigate the data. Do you need to do any more pre rocessing to get the spectra in line with each other or setup where you get you can see the different groupings better. Okay, so that's looking at Graph Builder. And I'm going to clear out the row states here. From here, we want to better understand what's going on with the data. Like I said, this is We're looking at spectral data and it's very highly colinear or multi colinear and this is something you may want to prove to yourself. So let's go to analyze multivariate methods, multi variant. And we're going to select all our wavelengths Right. And fairly quickly, we get back that You know, we get these Pearson correlation coefficients. And they're all closer to one, right, for the most part in these early early wavelengths. And that's just telling us that things are very highly correlated. So, you know, they'll figure that's that's the way it is. And, you know, we need to deal with that as we go forward. Okay, so we're looking at the data, we're set up and now we can look at another piece of information and this is newer in JMP 15 It's also in regular JMP. So we're going to go to Analyze Quality and process, and model driven multivariate control chart. Okay, so again we pull in all our wavelengths Okay, say Okay. And now we're looking at the data in a different way. This is basically for every spectra, all 400 wavelengths. And now we can see where some of these are little bit out of what would be considered control. All right, for all 400 wavelengths. That's the red line. So if we highlight those Right, I know if I right click in there and say contribution plots are selected samples. Now I can see Differences in the spectra compared to some, you know, as they're compared to the other spectra in the overall data set, we can see which parts of the spectra that are considered more or less out of control. And if I can get this to work. We can get there and then for that particular wavelength, we can see those three samples, you know, are out of spec, more or less out of spec, based on this control chart for the compared to the other samples. Alright, so again, that allows you to dive deeper and you know, tells you what group it is, again, this is all about learning more about your data, which ones are good, which ones are bad or which ones may be different. Right. So it's an added tool to help you better understand where you may be seeing differences. Okay. So with that, we've got things pretty much set up right and we want to go into the the analysis part. So as we go into Analyze, we have to set things up so we get what we want, when we want and how we want to analyze it. So we're going to go to Analyze and this is where we're going to select the samples. This is where we're going to use functional data explorer to help us select the samples. Alright, so go to Analyze, functional data explorer. And this is a JMP Pro 15 thing. So we're going to use instead of stack data, we're going to use rows as functions. And again, we're going to use select all of our wavelengths We're going to use octane as our supplementaries variable, right. And then the sample number is our ID function. Right, so we've got it set up, ready to go. And now for looking at the data. Remember how we had everything lined up and we were looking at it before. So this is all the data overlaid again. And if we needed to do some more preprocessing, we can do that over here in the transform area where we could actually center the data and standardize it For the most part, this data is fairly clean. We don't have to do that. And we're going to go ahead and do the analysis from here. Okay, so b-splines, p-splines and fourier basis. These clients will give you a decent model and a fairly simple model. Spectral data is again so highly correlated and the data, the wavelengths are so close, we want to understand where we're seeing differences on a much closer basis as opposed to something like a b-spline, which would spread the knots out. All right. We want to cap the knots as close together as possible to help better understand what's going on. So this takes a few seconds to run, but I'm going to click p-splines And it gives you an idea of, you know, so it's going to take, I don't know, 15 or 20 seconds to run, but it's going to fit all those models. And it's almost done. Alright, so now we fit those models. Now if I had run a b-spline, it would probably would have been around 20 knots and most We're looking at 200 knots. So it's basically taking that those 400 wavelengths and split them into virtually groups of two, right, so it's looking at individual, like individual groups of two And this is the best overall model based on these AICc BIC scores negative log likelihood. We could go backwards. It's a linear function, we could go backwards and use a simpler model, if we want. We could also go forward and see how many more knots would take to get you an even better model. I can tell you from experience. That around 203 to 204 knots is as good as it gets. And there's no reason to really go that far, you know, for that little bit of effort, or a little bit of improvement that we would get SO fit those now. The, the, you can see we fit all the models are all the spectra and let's go down to our functional data explorer or functional principal components. This is the mean spectra right here. These are the functional principal components from that data. Okay. And each one of these Eigenvalues are punctual principal components is explaining some portion of the variation that we're saying, all right. So you can see the first functional principal component component explains about 50% of the variation, so on and so forth. And it's additive You can see it about the second row at 72 and then number four we get to basically our rule of thumb or cut off point, that if we can explain 85% of the variation That's our cut off for the number principal components we want to grab and build our DOE off of. And so we're going to go with four. Right. Some other things you can look at are like the score plots. So looks like spectra number five is kind of out there. And if you really want it to look at that one, you could as well so you can pop that out or pin that to the graph. But you get an idea of, you know, which spectra is out there and what it might look like, in this case we see some differences to 15 and five, remember 41 was kind of out there too. But we can see some other things. The functional principal component profiler's down here. Now if you wanted to make changes or you wanted to better understand things, then you would say, Okay, you know, as I move my functional principal components around. What do I do, you know, how do I change my data? Well, It's hard to really visualize that. So something that's newer in JMP 15, JMP Pro rope 15 is this functional DOE analysis. And that's why I added that Octane rating to our supplementary variable. Alright, so I'm going to go down here, minimize some of these things a little bit Right. And down here on this FDE profiler, we've actually done some generalized regression. It's built in. So we're looking at that and this model as we look through these different wavelengths, we can see that that octane, what happens with the octane, we get to these different wavelengths, right, so that particular wavelength may be different for the different octane ratings, and that's what you're looking for, right. You want to see differences. Right. So, where, where can we see the biggest differences? Well, I don't know if you saw that happen out here, but right here on the end and the higher wavelengths, we're seeing some significant changes. So I'm going to go out here and as you can see that's bowed a little bit on the curve there and as I go back to the lower wavelengths, this curve starts to flatten out a little bit or actually gets a little steeper. It's not as flat as it was at the higher octane ratings. OK. So again, this is all about just again, investigating the data. But what we're going to do is go ahead and save those functional principal components. Right. And we'll do that through our function summaries right here. We need to customize that. I'm going to deselect all the summaries. And I'm going to put four in there because that's the number I want to save And just as a watch out, Make sure you say okay and save. Okay, it's fine. It just won't get you where you want to be. So we're going to say, okay, and save. And we get a new table with our functional principal components in there. The scores. Right. So all four of those for the different samples, for the different octane ratings. Now what we have to do is get that information back to our main data table. And you could do this through a virtual join. What I have done is actually I copy these and there's a way to do this fairly simply. So I need to actually go back over to my main data table. And if this will work for me is if I just, I don't need to keep this table. I just want to get these scores over to this table. And if this little... I get this All right, so you just grab them and drag them over to your main data table and drop them. I've already done it, so I'm not gonna drop them in there. But that's one way to get the data over there quickly. All right. Oh, let me minimize that for now. So this is the data. I mean that it's right there. Alright, so I've copied it over. So we've got the scores. Now we're going to do what I consider is the most important step, we're going to pick the samples that we're going to analyze. This is going to get you down to a much smaller number of samples to build your models on. Okay. And this is where we're going to use design of experiments to get us there. So we're going to select DOE, custom design. And we're going to Don't worry about the response. Right now we're going to add factors and we're going to add covariate factors. Right. And you'll see why in a minute why we're doing this. So I'm going to add covariants. Right, and you have, you have to select what your covariate factors are and we're going to choose the functional principal components. We're going to say, okay. And we're going to look in this functional principal components based to figure out which samples. We're going to analyze all right to build our model. Right. So I select continue. And right now it's saying I, you know, select all 60 and build that model from there. Well, we want to take that down to a much smaller number. We're going to use 15. Okay. So that allows us to again select smaller number. We don't have to have as many spectra. We don't have to run as many. But you have them and then you can select from them. So I'm going to say make design. And while this is building Alright, so we don't need this. I'm going to get rid of this. That's just some information. But what we've seen now is in our data table 15 rows have been selected, they're highlighted in blue. I'm going to go ahead and right click on that blue area and put some markers on them. Put star on. I'm actually going to color those as well. Right. So let's take those blue Okay. And Before I forget to I want to build what you do now is you take these and do a table subset. Right, so table subset. We've got selected rows, all columns, say, okay, and this is where we're going to be doing our modeling. But before I go there, let's go back to our data table and main data table and go to Analyze, multivariate methods, multivariate, right. And then use instead of using the wavelengths, I'm going to use the functional principal components. Put those in our y row, say okay, and now look at the, you know, what we saw before, we had almost complete correlate...complete correlation for a lot of the wavelengths. We've taken that out of play. And if you're looking at the space now, the markers as you see them -- the stars. We're looking at pushing things out to the corners of our four dimensional space, but we're also looking through the center of the space as well. So, this is more or less a space filling design, but it's spreading the points out to a point where we're going to get, hopefully get a decent model out of it. Okay. So we've got that. And I need to pull up my data table again. Pull this one up. Okay, so these are the samples that we're ... again that we're interested in. We're going to build our model on. And I'm going to slide back over here to the beginning. So these are the rows that were selected. And now we're going to go to Analyze. Fit model. And we want...octane is what we're looking for, right. So that's our rating that we're looking for. And we're going to use all of our wavelengths And this is also the next thing I'm going to show you is a JMP pro feature. Where you select partial least squares and the fit model platform. This you can do partial least squares or you can do the same analysis and regular JMP, just so you know that, but we're setting it up here in case if you wanted to, you could actually look for interactions. We're not going to worry about that on this model. And select run And got to make a few modifications. Here you can choose which method you want, the NIPALS or the SIMPLS. The SIMPLS is probably a little more statistically rigorous but NIPALS works for our purposes. The validation method, we do want to do that. But we don't have very many samples. So we're going to use the leave one out. Okay, so each row will be it's... pulled out and used as the validation. Okay. So we're going to start and just say go. As you can see up here on the percent variation explain for the X and the Y, we're doing very well. The model is explaining quite a bit of the variation for both the X and the Y, 90 almost 100%. That's great but they're using nine latent factors. Remember, we only had four functional principal components. So let's see what happens when we go to that. Change to four. And select go and we do lose something in our model, but it's not bad, right, so we're still getting a decent overall fit. And that's where we're going to go. Alright, we're going to use that model instead of the more complicated model with the nine latent factors. So I'm actually going to remove this fit. OK, and then we're going to look at this four factor partial least squares fit. What we're looking for down here is that the data isn't spread out in some wild fashion. They are, you know, for the score plots, the data is somewhere close to around this, the fit line, and we're okay with that. And if we're looking at other parts of this, we've got, look again, we're looking at what's...how much of the data is being explained, what are the variations being explained and we're looking at 97% there, almost 99% here for Y, and that's good. Let's look at a couple of other things while we're here, And look at the percent variation plots, which gives us an idea of, you know, how are these things different or how are these spectra are different and we can see that latent factor one is explaining a fair amount of the differences but latent factor two is explaining the better, more important part of that. Alright, so that's where we're kind of dialing into; three and four are still part of the model but they're not as important. So something else we can look at are this variable importance plot. There is a value here. It's a value of .08, right here that dotted red line. If you wanted to do variable reduction, you could do that here. Alright, so you could actually lower the number of Wavelengths you're looking at here, but we're going to leave that as is, right. And the way to do that to actually make that change, to actually do the variable reduction would be through this variable importance table coefficients and centered and scale data, you could actually make a model based on the variables, the important factors. Right. And you can see this again, that dotted line is the cutoff line and a fair number of those wavelengths would be cut out of the model. Right. But again, we're going to keep that and we're going to go up here. We're going to go to the red hotspot. Go to save columns and save prediction formula. Okay. Alright, so let's save back to the data table. Going to minimize that. And we've got this formula out here, right. That's our new formula. And if I go to Analyze, Fit Y by X, go to octane. And we're going to grab that formula. Say okay. All right, and Great, we fit the model and our R squares are around .99. And that's really great. But the problem is, how does that work for the rest of the data? Well, I'll show you that in a minute. But before we get there, I want to show you a separate method or another opportunity and I'll show you the setup and I'll show you the model. You would do analyze fit model. We're going to do recall. And this time, instead of using partial least squares. We're going to use generalized regression. Select that. We've gotten normal distribution, again we can go...we can go ahead and select Run. Right. We're going to change a few things here. Instead of using lasso, we're going to use elastic net. And then under the advanced controls, we'll make a change there in a second. But this validation method, remember we used leave one out. So we'll change that. We're going to select the early stopping rule. And we're also going to, under the advanced controls, make this change here. So this, this is this kind of drives why I even use generalized regression at all. It helps make a simpler model, but if you blank that out, blank out that elastic net alpha and then run your model. If I click Go, it steps through lasso, elastic net, and retrogression all the steps, through all those, so it fits all those models or tries to fit all those models and then it gives you the best elastic net alpha. Well, doing that takes a little bit of time, okay, because you're building all those different models. I'll show you the outcome of that in this fit right here, which I had done earlier. So this would be the actual output that I got from that model, again leave one out. And this gave me 41 nonzero parameters. Right. If I show you the other model, that partial least squares model is 400 wavelengths. So we've basically reduced the number of active factors by a factor of 10, right, with this elastic net model, right. And we can look at the solution path and we can change things and we can actually reduce the number of factors we want or add more, but for the most part, we'll just leave the model as is. We would save this model back to our data table. I've already done that. And now let's compare those. That's this model right here, or the information right over here on the left. Passed it too fast. This highlighted column, right, so if I right click there and go to formula, right, so I can look at that and, again, these are the important wavelengths. Alright, so that's the important wavelinks for that model for predicting octane. If I get, if I look at the partial least squares model, I click Go to formula there. This is the partial least squares model. And again, all 400 wavelinks. So it's much more complicated, complicated model. And again, you're, you know, you're more than welcome to use it. It's actually a very good model. So there's no reason not to use it, but if you can build a simpler model, it's always a good thing. Alright, so we've got these formula in our new data table or subset table and we want to transfer those back to the original data table, right. So again, right click formula, copy this formula, right, and then you would go back to your data table, make a new formula column over here. Right click Go to formula and paste that formula in your data set. All right. Well, I've already done that to save some time. Okay. And we've got, I've got both models there. I've got the partial least squares model in there, and really, what we're going to come down to, is we're going to go to Analyze fit Y by X. And we're going to go to octane rating, right, and I've previously done the PLS analysis. Now this model was built with a...48 samples were the training set and 12 for the validation side. Alright, so that's there. I've got my generalized regression formula, and I've got my octane prediction formula. Actually, this is the other PLS approach right here. And this one. And we're going to add those, and we're gonna say okay and compare those. And now you can see in here where we're doing very well overall. The models are doing well. We're still doing about 97% for our generalized regression model, in the end, which is still good. The PLS model beats it out a little bit, but then, remember, that's a much more complicated model. And overall, you know, we've built this nice predictive model that we can share with others. And as you get new spectra entries, analyze new spectra, all you have to do is drop those wavelengths into your data table and see what the octane rating is. All right. So you've made that analysis, you've made that comparison. And if nothing changes, in the day or over a period, of course, of a couple of days with your calibration, this should be a good model. It should be a sustainable model for you. So with that, I believe, I'm going to go back to, well, let me show you one more thing. I'm going to go to another data set that I wanted to share with you. This is the... as I'm trying to find it...I go to my home window... And this is a mass spec study for, actually it's a prostate cancer study and there's some unusual data with that. Right. And I'll want to show you... there's a couple of different ways, but what I want to show you is, pull this in here. Right. So instead of...this is abnormal versus normal status and... Showing you the power of the tools for...let's go to Analyze and then all the process, model driven multivariate...before I go there, let me color on status. Alright, so we'll do that, we'll give them some markers here. Okay. We're gonna go to Analyze... quality and process, model driven multivariate control chart. All of our wavelengths. Right. It takes a second to output, but I've got all the, right now, looks like I've got all the normal data selected, right, so that's what you're seeing there, if I click there. The red circles are the abnormal data, and for the most part, we see that there's a lot more of those out of control, compared to the normal data, right. The nice thing about this is if I could pull up one of those, I can start seeing which portion of the data is different than what we're seeing with the so called in control or normal data, right, and... Oh, Back to that. There. Gonna show you something else. Go to...we want to monitor the process again or look at the process a little deeper. So let's go to the right hotspot, monitor the process and then we're going to go to the score plot, right. So now we can compare these two groups. Well, we have to do a little bit of selection here. So let's go back to the data table. Right click, say select matching cells. And we're gonna go back over here and that's all selected, so that we're going to make that abnormal group, our group A, right. Go back to the data table. And scroll down. Select normal. Select matching cells and now that's going to be our Group B so now we can compare those. And now we can see where there's differences in the spectra, like, so this is maybe on the more normal side that you won't see in the abnormal side, right. But you're gonna...there are a lot more differences that you're going to see in the abnormal side that you would not see in the normal side, right. So this allows you to, again, dig deeper and better understand that. And finally, if I do this analysis for the functional data explorer with this grouping... Again rows as functions. Right. Y output. Status is our supplementary variable. Sample IDs, ID function. Say okay. And we'll fit this again with a P-spline model. This will take a second. While we're waiting for this to happen, I'm just going to show you, at the end, the generalized regression portion of this will be done, but I just want to show you what it's like looking at a categorical data set with the functional data explorer. Using that functional DOE capability. It ends up being, could be very valuable. And when you're looking for differences in spectra. And again, this is mass spec data. This isn't your IR data. This is mass spec data. We fit it, we've looked at our different spetra, how it's fit and we're happy with that. We can look at the functional principal components. Can look at the score plots. Let's look at the functional DOE. And again, where do we see differences? If I go over here and we're looking at abnormal spectra. It doesn't have this peak that the normal does, right, so now we can look at that and see, you know, again, help us better understand what differences we might see. All right. And in closing, let's go to back to the PowerPoint. Alright so what this process allows you to do is compress the available information with respect to wavelengths or mass or whatever it happens to be. Use this covariate DOE to help you select the so called corners of the box for getting a good representative sample of data to analyze. Model that data with a partial least squares, generalized regression. You can also use more sophisticated techniques like neural nets. And as new spectra comes in, you put the data into the data table and you see where it falls out. So this is highly efficient or helps you be more highly efficient with your experimentation and your analysis. And again, build that sustainable empirical model. Looking forward, the data that I've used is fairly clean and we're looking at working with the our developers and looking at how we can preprocess the spectral data and get even better analysis and better predictive models.
... View more