Showing results for 
Show  only  | Search instead for 
Did you mean: 
Choose Language Hide Translation Bar
Pre-Processing Spectroscopic Data for Analysis (2021-EU-30MP-787)

Level: Intermediate


Bill Worley, JMP Senior Global Enablement Engineer, SAS


Pre-processing spectroscopic data is an important first step in preparing your data for analysis in any data analysis tool. We will review and demonstrate Standard Normal Variate (SNV) and Savitsky-Golay (SG) smoothing using JMP. These pre-processing tools are fairly standard throughout industry and are a first step in building predictive models with from raw spectroscopic data. Special thanks to Ian Cox for sharing his SG add-in.



Auto-generated transcript...




Bill Worley Hello everyone, this is Bill Worley, JMP systems engineer. I've been with JMP almost seven years. Today we'll be talking about
  preprocessing spectroscopic data for analysis. I'm calling this a prequel because I've already given a talk on analyzing spectral data within JMP, but the pre processing is an important first step and will will show you why. Let's let's get into it.
  So let's talk about the abstract, a little bit here. Pre processing spectroscopic data is really an important first step in preparing your data, any data, for analysis, really.
  We can, you know, talk to you about...we're going to review the demo and demonstrate the standard normal variate and the Savitzky-Golay smoothing tools in JMP. These are the pre processing tools are fairly standard throughout the
  chemical world, the chemical industry and really anybody who's doing spectroscopy and analyzing spectroscopic data. They would know that...
  more about the standard normal variate the Savitzky-Golay and how they're used. So again it's an important first step inhelping you build predictive models from your raw spectroscopic data.
  So, with a little background spectral data can be very messy. Pre processing is used to help smooth and filter and baseline correct prior to building these predictive models or any predictive models, you might want to build. The Savitzky-Golay filtering and standard normal variate
  are tools that have...they're used in JMP. The Savitzky-Golay was an add-in that was developed by Ian Cox, so that is...thanks thanks to Ian for that. Standard normal variate requires a...
  some sort of formula and we'll show you how to build that into JMP. And then the last line, this is...the Graph Builder is also an invaluable tool for visualizing your spectroscopic data and that can be used as an alternative platform for smoothing your data.
  But again we'll get into that as we get further into the talk.
  Why we do this. Why would you want to do any kind of pre processing? Well, if you look at that, first, that top spectra right there, or groups of spectra, that is the raw data from 100 spectra and you can't really tell any difference between
  you know, a mixture of 100% starch or 100% gluten. You can see some differences, you just don' can't tell what's what. Well,
  what we do from there, we do the Savitzky-Golay smoothing and filtering and then you take that first derivative and that gives you a much cleaner set of spectra,
  nicely grouped, but still not completely where you'd like to be. And then, finally, after you do that standard normal variate smoothing, the spectra is much cleaner, much easier to see where the peaks are.
  And you can see where the differences are in the different groupings of spectra. So again there's 20 spectra for each one of these red lines to gray lines to blue lines, and it's definitely much cleaner. You can see where the peaks are for the different mixtures.
  Okay, so with Savitzky-Golay filtering, it's a digital filter supplied to your spectral data for the purpose of smoothing. The nice thing about it is it doesn't alter the overall spectra itself, the where the peaks are and everything like that. So it's a nice tool there.
  It uses convolution for smoothing and that's based on a linear least squares formula.
  That first and second derivatives are're also able to tame those from the smooth signal.
  And filtering of any unimportant data on either end of the spectra can be done as well. If you get into situations where, you know, you have something at the beginning, it's just unimportant and it's only going to mess you up as you go forward and same thing, on the other end.
  Graph Builder allows you to do some even internal smoothing or internal filtering that you cannot do with the Savitzky-Golay.
  add-in. So you can use the local data filter with the Graph Builder and further define areas of interest.
  let's go on. Savitzky-Golay
  analysis is that we've got the...we've done...if we do that analysis, this is a grouping of spectra. So we've got the smooth,
  the first derivative, now stretched out a little bit. And we put the zero line in there. And with the first derivative, you can
  tell where the peaks are, based on the zero points are or crossing of the zero points, right, so that gives you an idea of where the peaks are for any given
  group of spectra, set of spectra. And for the second derivative, now that's a little bit different. It helps flatten the baseline and then, if you have peaks that are overlapping, it helps you
  further develop or better see where those are overlapping and you go from a positive to a negative, which is an indication of overlapping peaks and where they might be. Okay, and that's how you use that tool.
  The standard normal variate itself is again the standardization of the data with respect to the individual spectra. It can be used alone or as an added smoothing tool and we're using it as an added tool today.
  It's used...
  it's used for baseline shift correction and correcting for global intensity variations. The nice thing is, it removes baseline variation without altering the shape of the spectra, right.
  And the data, the thing about this is, and you'll see this when we do it, is that the data must be stacked, okay.
  All right, this is just one of the formulas that
  might be used in setting up the standard normal variate formula, and I'll show you how to do that. There's a couple of different ways, but it gets a little bit involved and I'll show you how to do that. The important part here is that
  we have the standard formula here for the column standardized, but we have to make sure that we use it a by variable and that's the spectra itself. So
  there's a little bit of a trick to adding that.
  Right, so if it did the standard normal variate on the raw data only, this is what it would look like.
  We can see the groupings, we can see, you know, kind of how things break out, but if we did that standard normal variate after we do that first derivative, it's much cleaner. You get a much better idea of where the spectra or where they overlap, where they, you know,
  where you might have to do a little bit...where that second derivative would come into play, where you have overlapping peaks maybe here, right,
  and maybe back over here. So
  that's where the this really helps build the better vision of what you're seeing. And that zero line, again, is saying, where, you know, as you cross this that zero line is an indication of peaks, where it peaks are using the first derivative.
  Alright, so let's get into a demo.
  Alright, get out of there.
  So let's talk about visualizing the data...the data first. Remember, whenever you get a new set of data, the best thing to do is pull it in and visualize it. You know, plot that data, get a better idea of what it looks like what you're up against. And let's go down here and pull up
  our gluten raw data. Right, so we've imported the data. Now we need to do some visualization on it, and let's do...let's go to Graph Builder.
  Right and the first things we're going to do,
  move this up a little bit, go to raw data. Right, now we're gonna pull that in.
  And this is the point data, so this is what we...what we saw before before we did any smoothing, before we did anything else. That was that line data and, if I can change that to
  a parallel plot, right, that gives me the line data for the spectra. Again, we can't see a lot of what's going on there, but it gives us a better idea of where things are falling out. I can make that...
  we can do some things down here, something called...if we need to, and you'll see this with other data, this parallel merged. But that didn't change anything. It didn't pull any together like you'll see what some of the other data, so I'm going to close that.
  Right, and some of the other tools that I like to use when I'm visualizing the data are the model driven multivariate control chart, and let's do that first.
  So we're go to Analyze, Quality and Process, and Model Driven Multivariate Control Chart.
  Move this up.
  Right, I don't have any historical data, so I say Okay. And what this allows us to do is, look at the data and say, okay,
  we can see groupings of spectra, right. And for the most part, this is...just want to point this out,
  we're able to explain most of the variation that we're seeing, at least up to 85% of the variation with two principal components, okay. And that's using this ??? T squared
  information we've got here. And this also tells us where things might be out of control for a given spectra, right. Why is it different than the rest of the group?
  So if I hover over one of these points, I can pull that out, right. If I hover over one of the out of control points, again, pull that out,
  and now we can see, wow, those really are different, you know. There's something different about this spectra that we have to, you know, maybe better understand. In this case, we already have a better understanding, because we know that they're mixtures, but we want to see that.
  And let's pull this one out too. Alright, so that's one more again. So between those three spectra, we've got quite a variety of, you know, the mixture. So we've got three of the...three of the groups and we could put the other two in there as well, just to see. But we'll leave it at that, for now.
  Okay, so that's the model driven multivariate control chart. The other tool that I like to use's under multivariate methods and it's called multidimensional scaling.
  And the reason I like to use this is, especially for systems like this,
  is that we've got
  our raw data again, right, and we need to...
  let me set this up first and I'll tell you.
  We've got two dimensions here, right, and that's just based on what we saw before, but the reason I like to use is this allows us to do some grouping or visualize some grouping of the data itself. Alright,
  let's do that. This takes a second to come up.
  And we can see that we've got five distinct groups of spectra, right, so each one of those has to do with a group of our mixture of starch and gluten. If I highlight this grouping right here, that's the first group, which is all, you know, 100% starch.
  The middle group is a 50/50 blend and this group down here is 100% gluten. Right, so what that does is it also highlights the data back in the data table, just like shown there, so that
  interactivity is also nice. What this also shows us is that we're doing a fairly good job of already...already of being able to break out what the different groups are. If we look at the Shepard Diagram,
  we have R square of 1 and
  straight line, alright. So again, a good way to visualize it, and there's more you can do with this, but we'll leave that for another time.
  Okay, so far,
  we've brought the data in, we visualized it, right, and now we need to do some pre processing.
  And what we're going to do, I've got these, more or less, here's placeholders so we're going to do some Savitzky-Golay analysis and we're going to do some standard normal varite, at least on the first derivative, okay.
  Alright, so let me clean this up a little bit.
  Right, close that up.
  And again we're going to take that raw data and we're going to use that Savitzky-Golay add-in to do the analysis. So that's under add-ins for me.
  And this is a free add-in, so you can get it and we'll put it out there for you. So this is again, we're going to take the raw data,
  right, and if you want to learn more about what Savitzky-Golay is all about, we have a Wikipedia link here so you can use that. But we're going to use this to help smooth the data at every wavelength, right, and then take that first and second derivative. And I say, okay.
  And there we go, and we've got our smooth data here.
  And one thing, you know, you can adjust this as needed, but what we want to do is kind of take this,
  widen out that window a little bit.
  Do that one.
  And you can play around with the polynomials here, the order of polynomial fit.
  In this case, it really doesn't make much difference to go to, what, an eighth order polynomial. A second one...the second order does just fine. Alright, and then we can take that data from our first derivative.
  Say save smooth data, right, and we can save that back to the data table. And I've already done that, so I don't have to show you that process, but just know that the data is then taken back to the data table from there, okay.
  Let's close that out.
  Alright, and we're good to go now, but now we need to do that last step, right. So let's look at...before we go there, let's go back and look at
  our Savitzky-Golay data. So this is the first derivative data. I would again want to look at this under the
  Graph Builder, right. And see that's...just doesn't look very promising there, but let's do parallel plot,
  makes it a little better. And clean it up and then let's do this combined scales parallel merge, and this brings the data into a much more
  visual...a better visual than we had before, right. We can play with that and again we could add our own zero line here to get a better understanding of where the peaks cross.
  But that's that first derivative data. It' can see the groups, but they're not, you know, they're not completely separated like you'd really like to have them to get a better idea, okay.
  So we're set there. That data is back in the data table. I'm going to close that out, and then remember I said, you have to stack the data for do the standard normal variate, so we need to do that. Let's subset out some data here first.
  Alright, so
  subset all of these out.
  Tables. Subset.
  All rows. Selected columns, and say okay. Alright, so now we're good to go there.
  Well, at least, we are...we've got the data separated out, right. Now we have to do...stack the data, right. And again, this is to be able to put the filter to all the data that we need for the formula.
  So under Tables. Stack.
  And these are our columns that we're stacking.
  Right, and stack by row.
  I'm going to select the columns that we want to keep.
  On top of that. And after,
  I'm sorry...I'm just clicking around just to get this setup, but you'll get the idea. I'll leave this up for a second for folks to see what it takes to get there. So I've saved this.
  You know, see the untitled 12 now. That's our stacked data, right, and now we can perform that standard normal variate. A quick way to do that is to make a new column,
  right click.
  And I have to build this formula from here, from the formula editor. I'm going to use this statistical tool called
  column standardize, right. And if I hover over this, over here, this gives you an idea of what what you're doing with that. What does standardized means, right. So we're going to...
  a mean of 0 and a standard deviation of 1, so that's what we're trying to get with the data and with respect to all the given spectra. So we need to take the data
  here and then, remember I said we had to add that bivariable, which is, in this case, going to be the spectra. I'm going to say okay.
  Alright, and that gives us a data set. Now there's a simpler way to do this, or what I what I think is a simpler way, but it can also be...
  it's a little bit more time-consuming sometimes. But if I right-click on the spectra right here, and again, remember i'm doing this because we need a bivariable. So I'm going to go to new column info and make that group by.
  And I'm going to take my data column and I'm going to right-click on that and I'm going to do a new formula column.
  We're going to slide over to distributional and remember, standardized. We' we want that
  mean of 0; standard deviation of 1. Select that and then that builds out that new column. The nice thing about this is that the data matches, so I can breathe a sigh of relief that I got it right.
  Right, so i've got that. Now, this would require from here, I'm going to click the that column,
  that we take this data, split it and then put it back into the regular data table or the initial data table. One thing I want to show you with this is that
  once you have the data stacked, you can do some other visualizations, right. So if I go to the...back to the Graph Builder right, I want to take this data
  into the
  Y, and then i'm gonna pull the label over, and this is a cool way to see where all the variation is. So these are all the,
  you know, box plots for where're seeing the variation in the data, so that's just a really nice way of being able to visualize what's going on.
  Another powerful way of using the Graph Builder, and the data has to be stacked in this case to do that. Now, what you can also do is
  if you pull in the point data, let's take off that, put the point data in there and then we could do some smoothing. Now that's an individual smooth so that's an average...for every...for all the data. What we would like to do is do that
  Savitzky-Golay smoothing, so I need to pull in
  the overlay, right. So that's all the data. Now it's going to be hard to see, but I'm going to do this, hopefully you can see this, if I select under smoother, we're going to switch to Savitzky-Golay.
  Right, and I think you're...hopefully, you saw that that that now fits that line, that smoothing line right there.
  That's another way of smoothing the data, right, using the Savitzky-Golay. And then you can change this to a quadratic, you can do some local trimming here, and then ultimately, you could save that back to the data table.
  I'm going to leave it at that for now.
  Right and remember, we had to split this data, save it back to the data table. I'm going to forego the splitting step and then just go back to the data...the original data table.
  I don't need that.
  Right and then when I
  look at the data again under Graph Builder,
  I may take this standard normal variate group,
  drop that in. Again, doesn't look very promising but we'll do the parallel plot and then right-click.
  Combine those scales.
  And that's basically our finished smoothing, alright. So we've got it all set up. Again we could add a zero line here,
  if needed. Let's go ahead and add that.
  Zero and we'll make that some sort of green,
  there's that, add it.
  Say Okay, and then again, you would use this to help you figure out where your peaks are.
  Based on the peak maximum as I crossed that zero point, okay.
  So pretty much, that's it. Let's go back to our
  So in conclusion, pre processing...we needed to pre process the data...pre process the data to get it cleaner, to get it, you know, much easier to work with.
  Where also with JMP...
  JMP is making this...
  we're working to make this a much easier process. Pre processing in JMP 16 is going to be supported by add-ins and built-in capability.
  We're looking forward to JMP 17 where pre processing, what's going to be offered in the Functional Data Explorer with much more sophisticated spectral and baseline smoothing options, along with some peak detection and selection.
  For analyzing spectral data, please, there...there was a talk that I did back in 2020 for the Discovery
  Munich. That was a virtual talk, but that will give you a good idea of how to use JMP to analyze your spectral data.
  And with that, I'll say thank you and, hopefully, you all found...find this useful. Please let me know.