Analyzing Spectral Data: Preprocessing and Beyond (2021-US-45MP-878)
Bill Worley, JMP Senior Global Enablement Engineer, SAS
Jeremy Ash, Analytics Software Tester, SAS Institute, Inc.
JMP is advancing the capabilities for analyzing spectral data from preprocessing to building predictive models. Visualization, preprocessing and model development are shown in this presentation. Spectroscopists, analytical chemists and material scientists will take something away from this presentation. We review and demonstrate standard normal variate (SNV) and Savitsky-Golay (SG) smoothing using JMP. We then share more advanced preprocessing tools and show how Functional Data Explorer can be used to visualize and qualitatively analyze spectra. Using JMP, partial least squares and generalized regression are demonstrated as model development tools for spectral data.
Speaker |
Transcript |
Jeremy Ash, JMP | Hello, and thanks for coming to our talk. I'm Jeremy Ash. I'm a statistician in JMP R&D and Bill Worley's here. He's a senior systems engineer. We're going to be talking about analyzing near IR spectroscopy data in JMP. |
So this isn't the first talk that's been done on this topic. Bill has several related talks which you can find on the Community. | |
I have links to a couple of these here, which are on complimentary material and | |
Bill's giving a talk on FDE and DOE to efficiently model multivariate calibration models with fewer experiments, and he's also covered spectra preprocessing in more detail | |
in Discovery Europe earlier this year. And then Bill and I have been writing a blog series on this topic, and it's on the Communit. Here are the topics that we've covered so far, and we created a JMP add-in, which I'll be demonstrating usage later. | |
So a common use case of nearby IR spectroscopy is to determine the chemical composition of a mixture and I'm showing an example of near IR spectra on the right. | |
And some measure of absorbance is collected over a range of wavelengths, and the spectra here represents a powdered cereal sample, which is a mixture of several underlying chemical components. | |
And the spectra for the pure components are at the bottom, and the spectra for the mixture is over the top. And due to phenomenon like | |
overtones and combination bands, the peaks for the individual components are often broad and overlapping, and that makes deconvoluting the spectra into the underlying components difficult. | |
So, instead, the route that people take is to build models that predict composition as a function of absorbance at many wavelengths, and this makes the data high dimensional and variables are often highly correlated. | |
So one of the main challenges to overcome when analyzing near IR spectra is that the signal to noise ratio is often low and that's due to uncontrolled variations | |
caused by light scattering effects. And these need to be cleaned up before a reliable model can be constructed. So the data are often subjected to some major preprocessing transformations, and there are several common sources of noise, which are illustrated on the right here. | |
The true spectra is shown in black and then the spectra in green has a constant baseline shift or an additive effect. | |
And then the spectra in blue has an additive effect and multiplicative effect. Multiplicative effects are where distortion is larger at larger absorbances, and then the purple spectra has an additive effect, a multiplicative effect, and a slope to baseline shift. | |
So next I'll show some of the preprocessing methods for cleaning up these sources of noise. All of these are discussed in more detail on our blog posts, but I'll show several in the demo and the Savitzky-Golay first derivative will clean up additive effects. | |
And then a second derivative filter will also clean up sloped baseline shifts. And these are commonly used and effective methods, but the problem is that the transform is much harder to interpret than the original spectrum. | |
So the other preprocessing methods I'll mention do a better job of maintaining this interpretability. The | |
standard normal variate centers and scales, so that there is a zero mean in unit variance along each spectrum, and this is a simple normalization but it's effective at cleaning up both additive and multiplicative effects. | |
And then the multiplicative scatter correction, this uses a linear model to make each spectra look like a reference spectra and often no reference is available so that mean spectra is used instead. | |
And then the extended multiplicative scatter correction is the only method that will clean up all three sources of noise and this is a more advanced method that requires reference spectra for each component in a mixture. | |
And we'll see how that works later on. | |
And then since the data are high dimensional and the variables are highly correlated, many multivariate methods are applicable. All the analyses listed can be conducted in JMP. | |
Dimensionality reduction methods are essential for visualizing the data. Common methods are principal components analysis or multidimensional scaling. | |
And then the functional principal components in the functional data explorer is a particularly useful method here. | |
And JMP is unique among the software available for spectral analysis and that it has a user friendly functional data analysis platform. And since spectra have a functional form, FDE has a lot of potential uses for spectra. | |
And we're just starting to explore those capabilitie. And then clustering is common. Hierarchical clustering in JMP is particularly useful. | |
And then outlier analysis is usually conducted, | |
and to do this first build a PCA or PLS model and then outliers are identified in the T squared and DModX charts. | |
And this all can be done in the model driven multivariate control chart platform. And then predictive models that predict chemical composition as a function of spectra, these are called multivariate calibration models, and these models handle cases where there are more variables than observations. | |
And this can be done well with PLS or penalized regression in GenReg or the functional data explorer as well. | |
Okay, so I'm going to switch over to JMP and start my demo. | |
First I'll show how to read in spectral data using the spectral tools add-in. | |
For text delimited files, the add-in we provide is just a wrapper for the multiple file importer, which is here. | |
But the add-in surfaces some of the most useful features. And then we also have a JCAMP-DX importer, which is something that's new to JMP. And in general, | |
most of this can already be done in JMP but my goal was to write some wrappers that streamline the workflow for spectra a bit. And I should also note that this will be improved in JMP 17, so hopefully, this add-in will be obsolete soon, but | |
this might be useful in the time being, if you're building spectral models. | |
So I have a | |
folder with csv files, one file per spectra, and in these files, the first column is the X variable and the second column is the Y variable. So I'm going to import this directory into JMP. | |
Select that folder. | |
And I want to read in in stacked format because that's the the format that the add-in uses. | |
And then I have a header in those files; I'll select that option. And I can also use this to only select csv files. | |
So this reads in the spectra in a stacked format. | |
And I also have some metadata in a JMP table here | |
that I want to add to this data table. And the way I'm going to do that is with a join on the spectra column. | |
But I've read in the file names with with file extension so use text to columns here to | |
strip off those extensions. | |
I'm gving this a more informative file name. | |
And I'm going to use a table update and I'm going to match on the spectra columns. | |
And then the last thing I'm going to show is... | |
or I'm going to move this spectrum column here. | |
Okay, so that's how I set up this data table | |
and added it to the sample data in the add-in. | |
So I'm gonna have a first look at the data and to do that, I'm going to use the launch Graph Builder command in the add-in. | |
I'm going to color by gluten. | |
And on the left here, spectral tool sets up some local data filters that have some helpful controls, so you can filter on wavelength to focus in on just regions of interest. So if I just wanted to focus in on this peak here and | |
if I have all of the columns selected in the data table, I can use the show subset command and that will just make a subset with that region of interest. | |
I can also plot individual spectra here with this filter, and | |
it's handy to search for spectra to just plot subsets. | |
And the last thing I'm going to show is the animation controls, | |
but to do this I'm going to lock the the axes, otherwise they're going to adapt to the individual spectra. | |
Here...I'm going to speed it up a bit, so we can scan through these real quick. | |
And this feature is particularly useful if you have spectra collected at different time points, like in reaction monitoring. | |
And you can also use our 3D scatter plots in graph to visualize that data. | |
Okay, so I'm going to | |
clear this | |
filter. | |
One thing to remember when you have | |
the plot set up the way that you want it, you can save this | |
script back to the data table, so that you can recreate the graph later if you want to swap in new data, like after you've done some preprocessing. | |
So the one other thing that I wanted to show is visualizing variability and some diagnostic plots. To do that, I'm going to remove the grouping variables | |
and that'll give me a summary of spectra overall. And I can use box plots. These are | |
useful for getting an idea if variation is constant across wavelengths, are specific to certain wavelengths. If it's constant, that indicates the need for some further preprocessing, but if it's specific to wavelengths, those are often real chemical effects, so these are useful. | |
And then another way to visualize variability is with | |
bar plots. I like to use floating bar plots and then you can also add grouping variables if you want to look at variability in different subgroups. | |
And so that's all I'm going to show with visualization. And the one other thing I wanted to mention is how to read in JCAMP-DX files. | |
So I have a directory | |
that | |
contains two JCAMP-DX files that are from the JCAMP-DX standard website. I wasn't comfortable | |
sharing files that we use to test the add-in, because those came from customers. So I've just taken a test file from that website and duplicated it twice to show how we can read in | |
multiple files. And we only support the most common format and that's files with an X/Y variable list without compression and we only support one spectra per file. And I believe this is the most common format, but if you need something else, let us know. | |
So I'm going to read that in with... | |
here, with the default options. | |
And just to show you that | |
this was read in correctly, | |
plot it in Graph Builder. | |
This shows the same spectra, these are NMR spectra actually, but this is also a common format for NIR spectrum. | |
So that's all I'm going to show | |
with data import. | |
And next I'm going to show some of the simple preprocessing methods that are in spectral tools. The intent was to provide some basic preprocessing that enable users to get started on some simple calibration models with something like PLS | |
so that we could get feedback on the workflow, but we plan a more full suite of preprocessing methods in FDE in JMP 17, that will be Pro only but, we're encouraging users to use FDE since that platform really specializes in modeling functional data like spectra. | |
So first I'm going to show | |
Savitzky-Golay smoothing. | |
And you can change the degree of the polynomial that's being used here. You can change the size of the sliding window; larger size is more aggressively smoothed. And then the trim here will trim off imbalanced sliding windows. | |
And Savitzky-Golay is actually a special case of local kernal regression, so you might try out the other options for this | |
in Graph Builder, like different ways of defining your sliding window and different weighting functions. And once you've chosen the the parameters you want, you can save the spectra back to the table. | |
And so next I'm going to show the standard normal variate applied to | |
the smooth spectra. | |
And | |
this just centers and scales along each spectra individually, and you can do this with a column formula as well. Bill'll show that in his preprocessing talk. | |
So now I'm going to visualize the before and after | |
using this. Launch Graph Builder. | |
Color by gluten. | |
You can see with just those simple preprocessing methods, we've done a lot to clean up the noise in the spectra. There's still some noise in the subgroups. | |
And then Bill showed in his talk earlier how you can clean up the data even further if you use a Savitzky-Golay first derivative filter. And that's in another add-in that we've tried in the blog post, but I won't go into that here. | |
The main thing to know, Bill will show this, is that first do a first derivative filter and then a standard normal variate will clean up any remaining multiplicative effects. | |
So it turns out that you can do even better | |
preprocessing these spectra using something called the extended multiplicative scatter correction. | |
So I have that setup here. | |
So these data were actually taken from a paper that introduced the extended multiplicative scatter correction (just call that EMSC from here on out) and | |
to show the advantages of this preprocessing method, I'll use a scatter effects plot. But to do this, I need to compute the average absorbance at each wavelength. | |
And I can do that with a column formula here. | |
And then I'll plot this in Graph Builder. | |
So the scatter effects plot plots the average absorbance at each wavelength by the individual absorbances, and slopes on the Y axis indicate a additive effect, whereas | |
changes in | |
slope of the individual spectra show up as a fanning in these plots, and that indicates a multiplicative effect. And then chemical effects, interestingly, show up as loops in these plots. | |
And so there's clearly a large additive effect here, so to clean that up, I can use a first derivative Savitzky-Golay filter. I have that data here. | |
So this cleans up the additive effect dramatically, but there's still some remaining fanning left in these plot. So you can clean up that with standard normal variate. | |
But instead, what the EMSC does is it removes the additive and multiplicative effects in one step, and it doesn't require a derivative transform, which is useful because that'll make the resulting spectra easier to interpret. | |
And essentially what the EMSC does is it separates out variation in the spectra due to chemical effects and the variation due to noise. | |
And then it subtracts off the variation due to noise and it does this with a linear model, so I've set up the terms I need for the model here. And | |
we cover the details of that in the blog. All I'm going to say is that you can use fit model | |
to fit the EMSC model, and this is a very useful teaching tool for understanding how the preprocessing model is applied so that it's no longer a black box. And so here in the model, we have a separate model for each spectra and the coefficients estimated | |
are used in the correction, and you can...you can also | |
relaunch this | |
model with different effects to see how that changes things. | |
So I've applied the final correction here and I'll just show the before and after. | |
So | |
the spectra are now remarkably free from scatter effects, and the effect on the spectra is | |
due to change in mixture composition is now clear. So that is all I'm going to cover, and I'll hand it over to Bill and he can cover some of the modeling methods. | |
Bill Worley | Hey, thank you, Jeremy. |
Let's change over screens here. | |
And you should be seeing my data table about now. I'm hoping as soon as I click that. | |
Is everyone is seeing my data table? | |
Jeremy Ash, JMP | Yes. |
Bill Worley | Okay So here we go. This is the same data that Jeremy was using, except in this case, we've got it...I've got it in the wide format. Just letting you know that you can analyze the data both ways. Some people prefer the stack format, some some folks prefer the wide format. |
So just know that it can be done. The one thing...the first thing I want to show you is, | |
what Jeremy was kind of talking about before, is how we can look at the data in the raw format and then doing that Savitzky-Golay first derivative and then the eventual standard...standard normal variates smoothing. And again, just know that this is done in the wide format. | |
Okay, so showing you that that can be done there. | |
Next, I want to show is | |
principal component analysis. This is pretty standard for most folks who are using or analyzing spectral data. | |
What we...it's very doable in JMP. It's, you know, you go to analyze, | |
multivariate methods, and principal components. So I've already done that. I've already set up what we're showing here is that, for the most part, with two principal components, we can explain all the variation that's going on. If we look at this score plot here, | |
you can see that the subgrouping of the different groups of the mixtures that we have, the gluten and the starch. | |
We've got... | |
well, we can just see that subgrouping. It's nicely done there. What the other thing you can see is that Jeremy had talked about highly correlated data. | |
We're showing that in this other component plot over here. We can turn the variables on but...labels on but it'd be kind of messed up | |
to look at. Down here below is an outlier analysis plot, so it's a T squared plot and we're using the two principal components from the analysis. And it's showing that, yeah, maybe we have some of the spectra that might be a little bit out of... | |
that might be outliers, but in most case...in most cases, we weren't going to use all the data anyway; it's just kind of a way of scoping the data out, okay. | |
Next, I want to show is something called the model driven multivariate control chart, and this is virtually the same thing I did with the principal components just in a | |
different way. So we're looking at this same two principal components | |
up here, and we've got the grouping of the subgroups here. And if we look down here we've got something called the normalized DModX, which is another way of looking at the the outlier... | |
potential outliers within your data. Just know that that's there. The other thing that the model driven multivariate control charts offers up is a way to compare your data. | |
What I've done here is, I've selected the samples 13, 46, 95, and 45 to show those differences between | |
those four spectra, right. So now we can see that those...out of those groups, they are quite different, right, the spectra are quite different. | |
The other way you can do this comparison is down here through the score plot. I'm going to select the letter L, which will give me my lasso tool. | |
And I'm going to circle this as best as I can, without grabbing any of the other data, and make that the A group. And then down here in the red group, this last red group, | |
grab that and select that. It's my B group and then compare, right. And then this allows you to again effectively compare the data. | |
As I'm doing this, I can see how, you know, as I select the spectra, what's being selected in the different groupings, right, so that's just just kind of a cool way to visualize the data. | |
Alright. I'm going to go on to the next one. So this is raw data, okay. The next step is with the preprocessed data, so it's virtually the same data that you saw before, | |
but I've done the preprocessing, so I've gone all the way from raw data to the standard normal variate. And again, using the two principal components here, | |
we can see that the data is grouped a little bit better. There's a little...a lot less noise, a lot less variation than that T squared plot. An I swear I didn't do this on purpose, but the data looks like it's a little happier now, right. It's a little bit of a smile. | |
Each one of these little groupings down here is 20 spectra. If I hover over that, I can actually plot that spectra, pop that spectra out, | |
and look at that, if I so choose, okay. I can do that same grouping that I looked at before, so let's try that. Highlight this | |
and this makes it a little easier, because I don't have to do the...use that drawing tool, because the data is grouped a little bit better, right. In that same instance, I can do what I did before, highlight the data and then | |
do comparisons as needed, okay. So that's the using principal components. There's another way to do the same type of analysis, using | |
partial least squares. You save out the partial least square X scores, and then use that data to build the model driven multivariate control chart from there. And again, this looks really clean. This is the preprocessed data. The grouping is really nice, so we can see it from there. | |
We can look at the | |
DModX. It's still a little bit noisy, but not nearly as noisy as the data when we're looking at the raw data with the principal components, okay. So let's go back to | |
looking at another method of analysis, and this is looking at the higher...hierarchical clustering. Jeremy had mentioned this | |
initially, and looking at the data, | |
looking at how the clustering happens. If we do the raw data, we can see that there are a lot of smaller clusters. | |
And the actual data is...seems to be spread out in different groups. I mean, they...they're not grouped the way you would might hope, right. So we've taken that | |
preprocessed data, done the same analysis and we really got a really nice grouping. And each of these are in five clusters, so that's where you it see here. | |
So starch and gluten are separated really nicely here. If you save this data back to your data table through the save clusters capability, | |
you can actually use that in your modeling, if you so choose. You can also save the formula for closest cluster and then any new data that you get in will be | |
put in the cluster that it's closest to. | |
Okay, so that's the just a couple of ways to bring your data in, look at it, get it set up for modeling, and then the next step would be to go to modeling. And what I'm going to show you next is partial least squares, which is typically or...I'm sorry, this is generalized regression. | |
So this is a generalized regression model that we want to show folks, right, and | |
this is | |
where we might see some things that were going on. And it looks like a couple of things have changed on me. | |
I want to grab...there, there. It's it's my partial least squares, and this is showing that I've actually analyzed the data in a way that will hopefully | |
give me a good model. And what I've done, I'll just show you how to do that. You go to analyze, and this is the JMP Pro version. Like I said, there's...a lot of the things that I'm showing you can be done in regular JMP. If I go to multivariate methods, | |
partial least squares, I can do the same analysis, but I'm going to do the JMP Pro version, right. And I'm going to select starch and gluten. | |
I'm going to select my raw data and add that and then run that from there, | |
right. And I put in the wrong setup, but let's go | |
back to | |
the analysis that we've got set up here. And I can show you a couple of different things down here. This was done with preprocessed data. | |
Right, so the fit looks really nice and I'll show you a comparison of those fits in a little bit. So | |
we've got our partial least squares data for the raw data, the part...and then the preprocessed data. And if I want to compare those, | |
I can see how those models work, right. What I'm showing here now is a comparison of the raw data on the left and then the preprocessed data on the right, and we get a much better fit with a preprocessed data, so this just goes back to what Jeremy was saying earlier. | |
We need to clean this data up to get the best predictive models that we can get, hopefully, get out, okay. | |
Another thing that I want to show you is | |
the... | |
let's see where I'm at here. | |
I want to show you is the functional data explorer. | |
And this is | |
a way of...the newest way of modeling or, at least for in JMP, the newest way of modeling spectra data because you can analyze the data through continual. | |
Alright, so I'm going to show you how to do this. But what this is showing is that, right now, if I just looked at the raw data, we can analyze this with two functional principal components, | |
just like we did with the principal component analysis, and we can get a pretty decent model. The one thing that it's going to...you'll notice that is, though, is that this is more...this is a multivariate calibration method, but it's an inverse multivariate method. | |
Let's just step through this analysis. | |
Analyze. | |
And then we have to go to specialized modeling and we're going to go to functional data explorer. | |
Because I have the data in the wide format, I have to use rows as functions. | |
Gonna use my | |
starch and gluten as supplemental variables. Spectra is my ID, and I'm going to grab the raw data. | |
Okay. | |
And there's a lot of things you can do with the data now in functional data explorer, as far as cleaning it up and doing some preprocessing, but that's, as Jeremy said, that's only going to get better in JMP 17. | |
From here, we wanted to some modeling or build a model out of this. We want to fit all the data, | |
so we're going to go to models, direct functional PCA. | |
And if we look at this over here, this is saying that we could use five functional principal components, but it's almost a waste of effort, because we've got... | |
with two, we're actually fitting all the data, 100% of the data, of the variation, right. So I'm going to slide this back, clean this up a little bit, and then again move down. | |
And let's look at the score plot over here. | |
This is showing us the data, as the raw data in there subgroups, as we saw before, and all the other score plots. And the nice thing about this is if I hover over this, I can pop these individual spectra out | |
and compare them just visually, okay. | |
Further down here, we have the functional principal component profiler and it's nice. We can look at what's going to happen to the | |
curves as we change the functional principal components, but trying to tell folks, you know, go change the functional principal component 2 | |
to get the curve that you want is not going to...doesn't go over well. So what's been added is a capability called functional DOE analysis, okay. And with that, you can | |
look at the capability of...or how the curves change, based on the components that I've added, the supplemental components, our mixed components. Let me add one thing here and I'm going to alter the linear. I'm going to add a constraint. | |
and | |
change this. | |
All right, and say OK. And then as we look at our groupings here, as the...as I go down in starch, that gluten is going to go up, so that constraint allows us to look at | |
how the different grouping or the mixtures compare and what what's going to happen to the curves as we change the different levels of starch and gluten, okay. And that, again that's on a continuum, what we've done in the data is, you know, very defined | |
portions of the data, so we can get an idea what's going to happen, even over...even if the mixtures aren't exact like we were showing in the data, okay. | |
Alright, so that's with the raw data, and let's see if this is...my next step is...this is virtually that same thing. | |
And I want to make sure I've got the right one here. This is done with the preprocessed data, so that's that same functional data explorer, | |
using the preprocessed data, excuse me, and you can see the differences already. | |
And the curves as the preprocessed data comes in, we do...we fit the models, we get two functional principal components. And what we want to look down here is that the | |
score plot. And we can see again how those grouping of the variables are the fact the different versions or the different | |
groupings or mixtures of the gluten and starch are grouped so much nicer here, alright. So and then again, you can pop different ones out, so that's kind of looking at that. | |
The last thing I want to show or talk about is looking at something called generalized regression. | |
And if this will | |
play nicely with me, this is the generalized regression fit of the raw data, right. And I've done this with using an elastic net fit with a k-fold cross validation. | |
I've set the elastic net alpha at .56 to get a...what I call a more pleasing to the eye fit. What it does is allows different...or more groups of the | |
wavelengths to be added to the to the model fit. It's not just going to limit your overall model fit. | |
If I did that, if I limited that elastic, that alpha, I would get a fit that looks more like this. And now it's a much smaller model, much simpler model, but you're also getting...you might be missing out on some of the additive or | |
additive effects that might help you get a better overall fit. So it's your choice, the simpler model will help you in the long run, | |
if you need to have different groups, that will help you with finding different groups if you have multiple component mixture, but just know that the fit with the | |
lower elastic net alpha will allow more factors and more variables, more wavelengths to be a part of the model, okay. | |
And that is with the raw data. Let's see if this is going to give me... | |
we'll just see if we get...and then this is the fit with the | |
fully preprocessed data, so it's virtually the same thing. | |
It's just that you're looking at a different overall output, | |
right. You're looking at a different model, it looks way different. This is, again, using the different elastic net alpha and | |
what the...ultimately what we want to do is just compare these things. If I compare the... | |
again, looking at the PLS raw data versus the preprocessed data, they get a much improved fit. If I look at the generalized regression versus...so this is partial least squares data for | |
preprocessed, preprocessed using generalized regression at a smaller elastic net alpha, and then the higher elastic net alpha. We're getting good fits all over the place, right, so it doesn't doesn't hurt to or investigate using the | |
generalized regression in JMP Pro versus your partial least squares. Actually you get some really good models and | |
it's just a great way to, you know, analyze your data and build out predictive models for future comparison or when you get future data in. | |
With that, I'm going to say thank you. Jeremy, thank you very much for your part of this. And we hope you all enjoy this. Please give us feedback. We're looking for feedback. We want to know what people think of these newer capabilities. |