Analyzing spectral data is a bit like trying to decode a rainbow -- it’s beautiful but full of tricky surprises.  Spectral data presents unique challenges due to its highly correlated nature, which renders many conventional techniques ineffective.

In this talk, we identify these challenges and explore advanced methods tailored for handling such data. Specifically, we dive into three powerful techniques: principal component analysis (PCA), partial least squares (PLS), and functional data analysis (FDA). By comparing these methods, we highlight their strengths, limitations, and practical applications, offering insights into choosing the best approach for analyzing highly correlated spectral data. We show you how to transform your data into a vibrant spectrum of success even if you never look at a rainbow the same way.

 

 

Thank you for coming to view our poster on Decoding Rainbows, the best and least confusing ways to analyze spectral data. This poster is focused on a research paper where they used near-infrared reflected spectroscopy to analyze Arctic soil. We're going to walk through three different techniques to analyze the spectral data, principal component analysis, partial least squares, and functional data analysis. Then we're going to compare these techniques and results and share with you how each of these techniques performed.

Here on the main page, we have our spectrum, we have the three different analysis techniques, and our final conclusions. First, let's take a look at the spectrum. Here, you can see some of the challenges with analyzing spectral data. Looking at our multivariate color map on correlation, we have very highly correlated data. This makes some of the analysis techniques that you would typically use for analyzing data impossible to use. This requires a different approach using some type of technique that handles highly correlated data well.

Over here, we have a look at what our spectrum looks like in the near IR region. You can see which of the wavelengths are highly correlated or less highly correlated or even negatively correlated to two different responses we're looking at, SOM, which stands for solid organic matter or ergosterol, and you can see different wavelengths are correlated at different amounts. The first technique we wanted to look at is principal component analysis. Thomas.

Thank you, Peter. Yes, let me take you through a principal component analysis of these spectra. PCA is an exploratory method which is very, very suitable for looking at highly correlated, highly complex, and highly multivariate data, exactly as spectral data are. It's very often the first method you would jump into because it's very simple. You are actually exploring things in data, which means you can find trends, you can find outliers, and you can also navigate inside spectral regions to find out which parts are actually influencing the interesting parts that the model outcome will show you.

In PCA, you actually break down your raw data into principal components. They are extracted based on a mathematical principle called orthogonal agility, the orthogonal, which means it's actually not chemically derived, which means that sometimes the scores and loadings can be a little bit hard to interpret. But you have a very visual representation of scores and loadings as you can see in the three plots indicated here on the slide. If we take a look at the top two plots, those are the first two principal components to the left plotted in a scatter plot. To the right, we have component 3 and 4.

In PCA, those components are extracted based on variants. You can see that they actually decrease in the amount of explained variance in your raw data. Component 1 explains almost 87% of the variation. That's very common in spectral data because we have more or less the same shape throughout the spectra with some tiny differences that actually carries all the chemical information. If we color the first score plot to the left according to solid organic matter, we can actually see that there is a gradient in the score plot, meaning that we actually have captured this solid organic matter somewhere in the spectra.

The same we can do for component 3 versus 4. But we can see that then we have ergosterol with a gradient now somehow in between component 3 and 4. This one does not tell us actually why they are different. For that, we need to go to the loading plot. Here we can actually see if we plot these across the wavelength axis that the higher or lower, you can say in absolute numbers, the more important those regions are for describing how the samples have been differentiated in this corresponding score plot.

With PCA, you actually get a very good overview if you have groupings or if you have clusters. Here it's actually not really easy to see outliers, for example, but we clearly see a gradient. That means we actually believe that we have chemical information in our spectra. That's important, because very often we want to do more with our data than just explore it. We also want to do predictive models like the next two technologies will tell us.

Then PCA is a very, very good first choice to see if we have indications that those predictive models is going to be a success or is going to be a failure. Just to mention one or two pluses and minuses with PCA. PCA is a very powerful technique for spectral data. It's very easy to understand. We have very good visual interpretations.

You could say on the negative side is that we're actually extracting everything based on variance. If variance is not related to chemistry, we could have a hard time finding out what really goes on inside our spectra. But in this case, you can see here variance was actually equal to chemistry. Let's dig into the next method where we actually do it a little bit more focused. Bill.

Thank you, Thomas. The next technique we wanted to talk about is Partial Least Squares Analysis. Partially squares is a very common technique for analyzing spectral data. Along with principal component analysis, you're going to learn a lot about the chemistry of what's going on and determining which wavelengths are the most important for spectral data to analyze your overall samples and then learn from them as you're going along. If you look over on the left, the output that you get is that you're going to want to compare something called a Vanderbilt's T squared, so a Q squared value, and then you can look at your cumulative R square and cumulative Q square to determine which are the best overall number of latent factors to use.

In this case, the overall best model we made was with 14 latent factors, but as few as six latent factors would give you a very good model and that's based on that black line that you see running off the x-axis there. We wanted to then compare this to the data that we got from the paper they used partially squares as well. We felt like overall that JMP performed very well in the analysis of the spectral data. Again, you want to determine the number of optimal latent factors using something like… When I say latent factors, think principal components. Those are used to determine the best fit overall for highly correlated factors.

Then over on the right, you can see the results that we got. The first graph up the top, the regression coefficient vector curves, that gives you an indication of where the most important wavelengths were for each material. If you look at the blue line, and it's buried in there, that's the model or the original spectral lines. That's the data from the original spectra. Then if you look at the green and red lines, those are the coefficient variations in the spectra determined for the coefficients of each spectrum for the SOM and for the ergosterol.

Down on the lower right is just the comparison of the data that we got out, so the actual by predicted data, and you can see it from that that JMP performed very well overall for both the ergosterol and the SOM. Some pluses for PLS is that it makes very good predictive models and the nice thing is that other than like with PCA, you don't get any information back on the overall y, so that's more of an unsupervised method.

This PLS is more a supervised method, and you'll get some understanding from that. You get to understand the why itself or what you're trying to predict. It makes good predictive models, and it's widely used, and it's really good for analyzing spectral data. The models themselves can be very computationally intensive. As a matter of fact, when you build a model with PLS, you use all of the data, but you can do some variable reduction at some point. Lastly said, the final models can be, in this case, very large. I'll turn it over to Pete to talk about FDA.

Thanks, Bill. All right. The last technique we looked at was functional data analysis. This is a similar technique to the first two that we heard about. The basic idea with functional data analysis is we determine the average, in this case, spectral shape, and here you can see that as the red line in this graph also displayed down here.

What this is, is the average shape of all the spectral data, and then we look at differences from that average shape, and we extract those in what we call shape functions here. You can think of these shape functions as functional principal components. Just like what Thomas explained with that variation defining principal component, this is the same thing, but instead of trying to explain variance in the data, we're explaining variance in the shape.

Here you can see with this model, we have a mean shape and then four shapes that explain the difference from that mean. Just like with Thomas, that first component explains almost 87%, and then it goes down from there. Really similar type numbers to what we were seeing with PCA. One of the benefits, so Bill talked about this as well, is with FDA or functional data analysis, we have the ability to have a response, a why, that we're trying to figure out. That's what we would call supervised learning.

Here you can see our two responses and then what our spectrum would look like with those levels of each response. Down here we have our model results, so this is extracting out those shape components and then making a model to determine our two responses, so SOM and ergosterol. The pluses with this technique is we were able to create a highly predictive model, not quite as good as PLS in this case, but we get this nice visualization making this a very easy-to-interpret technique.

You can see what I expect my spectrum to look like with different levels of my response. Some of the minuses is this is not as widely used of a technique as you would get. PLS is widely used, PCA is widely used. It's less documented, less understood. It takes a little bit of explanation to understand what's going on, and it requires this surrogate model, which is an intermediate to generate the shape components. Something like a wavelet or a p spline is required to actually fit the data and extract out these shape components.

Those models can be computationally intensive and difficult to understand that intermediate surrogate model. I wanted to go back to our head slide, show our results here. The paper, like Bill mentioned, used PLS technique, and the results that we got here, so there's our root mean square error and our r square. They made fairly good models, but I think you can see here, even with PCA, we were able to get similar results. PLS seemed to, in this case, have the best results. Then functional data analysis was in between those two, but it has that nice interpretability.

Thank you. Thomas, Bill, any last thoughts?

Not from me. Thank you, Peter.

No. Thank you, Peter and Bill.

Thank you, Thomas.

Presenters

Skill level

Intermediate
  • Beginner
  • Intermediate
  • Advanced

Files

Published on ‎12-15-2024 08:23 AM by Community Manager Community Manager | Updated on ‎03-18-2025 01:12 PM

Analyzing spectral data is a bit like trying to decode a rainbow -- it’s beautiful but full of tricky surprises.  Spectral data presents unique challenges due to its highly correlated nature, which renders many conventional techniques ineffective.

In this talk, we identify these challenges and explore advanced methods tailored for handling such data. Specifically, we dive into three powerful techniques: principal component analysis (PCA), partial least squares (PLS), and functional data analysis (FDA). By comparing these methods, we highlight their strengths, limitations, and practical applications, offering insights into choosing the best approach for analyzing highly correlated spectral data. We show you how to transform your data into a vibrant spectrum of success even if you never look at a rainbow the same way.

 

 

Thank you for coming to view our poster on Decoding Rainbows, the best and least confusing ways to analyze spectral data. This poster is focused on a research paper where they used near-infrared reflected spectroscopy to analyze Arctic soil. We're going to walk through three different techniques to analyze the spectral data, principal component analysis, partial least squares, and functional data analysis. Then we're going to compare these techniques and results and share with you how each of these techniques performed.

Here on the main page, we have our spectrum, we have the three different analysis techniques, and our final conclusions. First, let's take a look at the spectrum. Here, you can see some of the challenges with analyzing spectral data. Looking at our multivariate color map on correlation, we have very highly correlated data. This makes some of the analysis techniques that you would typically use for analyzing data impossible to use. This requires a different approach using some type of technique that handles highly correlated data well.

Over here, we have a look at what our spectrum looks like in the near IR region. You can see which of the wavelengths are highly correlated or less highly correlated or even negatively correlated to two different responses we're looking at, SOM, which stands for solid organic matter or ergosterol, and you can see different wavelengths are correlated at different amounts. The first technique we wanted to look at is principal component analysis. Thomas.

Thank you, Peter. Yes, let me take you through a principal component analysis of these spectra. PCA is an exploratory method which is very, very suitable for looking at highly correlated, highly complex, and highly multivariate data, exactly as spectral data are. It's very often the first method you would jump into because it's very simple. You are actually exploring things in data, which means you can find trends, you can find outliers, and you can also navigate inside spectral regions to find out which parts are actually influencing the interesting parts that the model outcome will show you.

In PCA, you actually break down your raw data into principal components. They are extracted based on a mathematical principle called orthogonal agility, the orthogonal, which means it's actually not chemically derived, which means that sometimes the scores and loadings can be a little bit hard to interpret. But you have a very visual representation of scores and loadings as you can see in the three plots indicated here on the slide. If we take a look at the top two plots, those are the first two principal components to the left plotted in a scatter plot. To the right, we have component 3 and 4.

In PCA, those components are extracted based on variants. You can see that they actually decrease in the amount of explained variance in your raw data. Component 1 explains almost 87% of the variation. That's very common in spectral data because we have more or less the same shape throughout the spectra with some tiny differences that actually carries all the chemical information. If we color the first score plot to the left according to solid organic matter, we can actually see that there is a gradient in the score plot, meaning that we actually have captured this solid organic matter somewhere in the spectra.

The same we can do for component 3 versus 4. But we can see that then we have ergosterol with a gradient now somehow in between component 3 and 4. This one does not tell us actually why they are different. For that, we need to go to the loading plot. Here we can actually see if we plot these across the wavelength axis that the higher or lower, you can say in absolute numbers, the more important those regions are for describing how the samples have been differentiated in this corresponding score plot.

With PCA, you actually get a very good overview if you have groupings or if you have clusters. Here it's actually not really easy to see outliers, for example, but we clearly see a gradient. That means we actually believe that we have chemical information in our spectra. That's important, because very often we want to do more with our data than just explore it. We also want to do predictive models like the next two technologies will tell us.

Then PCA is a very, very good first choice to see if we have indications that those predictive models is going to be a success or is going to be a failure. Just to mention one or two pluses and minuses with PCA. PCA is a very powerful technique for spectral data. It's very easy to understand. We have very good visual interpretations.

You could say on the negative side is that we're actually extracting everything based on variance. If variance is not related to chemistry, we could have a hard time finding out what really goes on inside our spectra. But in this case, you can see here variance was actually equal to chemistry. Let's dig into the next method where we actually do it a little bit more focused. Bill.

Thank you, Thomas. The next technique we wanted to talk about is Partial Least Squares Analysis. Partially squares is a very common technique for analyzing spectral data. Along with principal component analysis, you're going to learn a lot about the chemistry of what's going on and determining which wavelengths are the most important for spectral data to analyze your overall samples and then learn from them as you're going along. If you look over on the left, the output that you get is that you're going to want to compare something called a Vanderbilt's T squared, so a Q squared value, and then you can look at your cumulative R square and cumulative Q square to determine which are the best overall number of latent factors to use.

In this case, the overall best model we made was with 14 latent factors, but as few as six latent factors would give you a very good model and that's based on that black line that you see running off the x-axis there. We wanted to then compare this to the data that we got from the paper they used partially squares as well. We felt like overall that JMP performed very well in the analysis of the spectral data. Again, you want to determine the number of optimal latent factors using something like… When I say latent factors, think principal components. Those are used to determine the best fit overall for highly correlated factors.

Then over on the right, you can see the results that we got. The first graph up the top, the regression coefficient vector curves, that gives you an indication of where the most important wavelengths were for each material. If you look at the blue line, and it's buried in there, that's the model or the original spectral lines. That's the data from the original spectra. Then if you look at the green and red lines, those are the coefficient variations in the spectra determined for the coefficients of each spectrum for the SOM and for the ergosterol.

Down on the lower right is just the comparison of the data that we got out, so the actual by predicted data, and you can see it from that that JMP performed very well overall for both the ergosterol and the SOM. Some pluses for PLS is that it makes very good predictive models and the nice thing is that other than like with PCA, you don't get any information back on the overall y, so that's more of an unsupervised method.

This PLS is more a supervised method, and you'll get some understanding from that. You get to understand the why itself or what you're trying to predict. It makes good predictive models, and it's widely used, and it's really good for analyzing spectral data. The models themselves can be very computationally intensive. As a matter of fact, when you build a model with PLS, you use all of the data, but you can do some variable reduction at some point. Lastly said, the final models can be, in this case, very large. I'll turn it over to Pete to talk about FDA.

Thanks, Bill. All right. The last technique we looked at was functional data analysis. This is a similar technique to the first two that we heard about. The basic idea with functional data analysis is we determine the average, in this case, spectral shape, and here you can see that as the red line in this graph also displayed down here.

What this is, is the average shape of all the spectral data, and then we look at differences from that average shape, and we extract those in what we call shape functions here. You can think of these shape functions as functional principal components. Just like what Thomas explained with that variation defining principal component, this is the same thing, but instead of trying to explain variance in the data, we're explaining variance in the shape.

Here you can see with this model, we have a mean shape and then four shapes that explain the difference from that mean. Just like with Thomas, that first component explains almost 87%, and then it goes down from there. Really similar type numbers to what we were seeing with PCA. One of the benefits, so Bill talked about this as well, is with FDA or functional data analysis, we have the ability to have a response, a why, that we're trying to figure out. That's what we would call supervised learning.

Here you can see our two responses and then what our spectrum would look like with those levels of each response. Down here we have our model results, so this is extracting out those shape components and then making a model to determine our two responses, so SOM and ergosterol. The pluses with this technique is we were able to create a highly predictive model, not quite as good as PLS in this case, but we get this nice visualization making this a very easy-to-interpret technique.

You can see what I expect my spectrum to look like with different levels of my response. Some of the minuses is this is not as widely used of a technique as you would get. PLS is widely used, PCA is widely used. It's less documented, less understood. It takes a little bit of explanation to understand what's going on, and it requires this surrogate model, which is an intermediate to generate the shape components. Something like a wavelet or a p spline is required to actually fit the data and extract out these shape components.

Those models can be computationally intensive and difficult to understand that intermediate surrogate model. I wanted to go back to our head slide, show our results here. The paper, like Bill mentioned, used PLS technique, and the results that we got here, so there's our root mean square error and our r square. They made fairly good models, but I think you can see here, even with PCA, we were able to get similar results. PLS seemed to, in this case, have the best results. Then functional data analysis was in between those two, but it has that nice interpretability.

Thank you. Thomas, Bill, any last thoughts?

Not from me. Thank you, Peter.

No. Thank you, Peter and Bill.

Thank you, Thomas.



Event has ended
You can no longer attend this event.

Start:
Thu, Mar 13, 2025 06:50 AM EDT
End:
Thu, Mar 13, 2025 07:30 AM EDT
Ballroom Gallery- Ped 4
0 Kudos