An Introduction To Spectral Data Analysis With Functional Data Explorer in JMP® Pro 17 (2023-EU-30MP-1264)

Ryan Parker, Senior Research Statistician Developer, JMP
Clay Barker, Principal Research Statistician Developer, JMP

Since the Functional Data Explorer was introduced in JMP Pro 14, it has become a must-have tool to summarize and gain insights from shape features in sensor data. With the release of JMP Pro 17, we have added new tools that make working with spectral data easier. In particular, the new wavelets model is a fast alternative to existing models in FDE for spectral data. This presentation introduces these new tools and how to use them to analyze your data.

Hi, everyone. Thanks for coming to our video. My name is Ryan Parker, and today I'm going to present with Clay Barker about some new tools that we have added to analyze Spectral Data with the Functional Data Explorer in JMP Pro 17. First, I just wanted to start off with some of the motivating data sets that led us to add these new tools. They're really motivated by these chemometric applications, can definitely be applied to other areas, but for example, we have this spectroscopy data where the first thing you might notice with this is we've got a lot of data points sampled, but we also have some very sharp peaks in our data. That's going to be a recurring theme where we have a need to really identify these sharp features that the existing tools we have in JMP are a little difficult to really capture those. For example, we're thinking about composition of materials or how we can detect biomarkers and data. These are three spectroscopic examples that we'll look at.

Another example of data that is of interest is this mass spectrometry data. Here we're thinking about a mass to charge ratio that we can use to construct a spectrum where the peaks in the spectrum, they're representing proteins that are of interest in the area of application. One example is comparing these spectrums between different patients, say a patient with cancer or a patient without, and the location of these proteins is very important to identify differences in these two groups.

Another example is chromatography data. Here we can think about we're using some chemical mixture over, we're using a material that's going to help us quantify relative amounts of the various components that are in these mixtures. By using the retention time in this process, we can try to identify the different components. For example, if you didn't know this, I was not aware until I started to work with this data, trying to impersonate olive oil is a big deal. We can use these data sets to figure out, okay, what's a true olive oil, or what's just some other vegetable oil that someone might be trying to pass off as an olive oil?

The first thing I want to do is go through some of the new preprocessing options that we've added to help work with spectral data before we get to the modeling stage. We have a new tool called the Standard normal Variate, a multiplicative scatter correction where you have if you have light scatter in your data, the Savitzky–Golay filter, which is the smoothing step when you have spectral data that we'll get into. Finally, a new tool to perform a baseline correction for the data to remove trends that you're not really interested in that you want to get out first.

Okay, so what's standard normal variant? Currently in JMP, we have the ability to just standardize your data in FDE. But when you use that tool, it's just taking the mean of all of the functions and a global variance and scaling it that way, but with standard normal variant, we're thinking about the individual means and variances of each of the functions to standardize and remove those effects before we go to analysis. Whenever I'm on the right here, after performing the standard normal variant, we can see, okay, there was some overall means and now they're all together and any excess variance is taken out before we go to analysis.

Multiplicate of scatter correction is the next step, and it's an alternative to standard normal variant. In some cases you're thinking you're going to... Whenever you use it, you may end up with similar results. The difference here is the motivation for using multiplicate of scatter correction. That's when you have light scatter, or you think you might have light scatter because of the way that you collected the data.

What happens is for every function, we're going to fit this simple linear model where we've got a slope and an intercept, and we're going to use those estimated coefficients to now standardize our data that we're going to work with. We subtract off the intercept and divide by the slope, and now we have the standardized version. A gain, you can end up with similar results as standard normal variance.

Now the next preprocessing step I'm going to cover is the Savitzky–Golay filter. When you have spectral data, before we get to the modeling stage, the new modeling tools we have, they're developed in such a way that they're trying to pick up all the important pieces of the data. If you have noise, we need to do a step where we smooth that first. That's where the Savitzky–Golay filter comes in. What we're doing is we're going to fit an end- degree polynovial over a specified bandwidth that we can choose to help try to remove any noise from the data. In FDE, currently, we're going to go ahead and select those best parameters for you, the degree and the width to try to minimize the model error that we get.

One thing I do want to point out is that we do require a regular grid to give us this operation, which will come up again in the future, but FDE is going to create one for you. We also have this reduced grid option available if you want finer control first before you rely on us making that choice for you. The nice thing about this Savitzky–Golay filter is because of the way the model is fit, we now have access to derivatives. This is something that has come up prior to spectral data, and now that we have this, we've got a nice way for you to access and work with modeling these derivative functions.

The last one I want to cover is the baseline correction. What baseline correction is doing is it's thinking about there might be overall trends in our data that we want to get rid of. This data set on the right has just a very small differences like linear difference in the functions. What we're thinking about is, okay, we don't really want to care about that, we want to get rid of it. What this tool is going to allow you to do is select the baseline model that you want. I n this case, it's just a really simple linear model, but you may have some where you've got exponential or logarithmic trends that you want to get rid of. And so we have that available. Then you can select your correction region.

For the most part, you're going to want to correct the entire function, but it may be possible that maybe only the beginning or the end of the function is where you want to correct. We end up with these baseline regions that are these blue lines. If we click this add button, it'll give us a pair of blue lines. We're going to drag these around to parts of the function that we believe are real. All the peaks in these data is something that we don't really want to touch. This is the part of the functions that we want to keep and analyze and is going to give the information that we're interested in. Also if you select this within region, anything that's within these regions is what will get corrected. You're either going to do one or the other, right? You're going to either leave it alone or you're going to change only within that in those data sets.

Finally, you don't see it here, but you can also add anchor points. It may be, depending on your data, easy to just specify a few points that you know this describes the overall trend. When you click add, you'll get a red line, and that's going to tell you this, wherever I drag that line, it's definitely going to be included in the model before I correct the baseline. When you click okay here, you'll just end up with a new data set that has the trend removed.

Okay, so that brings us to the modeling stage. What we've added for JMP Pro 17 are wavelet models. Okay, so what are wavelet models? They are basis function models, not like anything we have currently in JMP, but they can have very dramatic features. What these features are doing are helping us pick up these sharp peaks or these large changes in the function. We also have the simple Haar wavelet, which is just a step function. If it turns out that something really simple is like the step function fits best, we will give you that as well. You can see we have a few different options that are really... If you think about bending these wavelets and stretching them out, that's how we are modeling the data to really pick up all these features of interest.

To just motivate that, I want to show you the current go- to in JMP, which is a B- spline model, which has a very difficult time picking up on these features without any hand- tuning. B- spline model is doing a little bit better. It still has some issues picking up the peaks, but it might in some ways be the best. Direct functional PCA, doing almost as good as P-sp lines, but not quite. Then we have wavelets. We're really picking up the peaks the best. In this particular data set, it's not fitting them perfectly, but we would think looking at diagnostics, the wavelet model is definitely the one we would want to go with.

Again, we have these five different wavelet model types, and what we're going to do is we'll fit all these for you so that you don't have to worry about picking and choosing. Outside of the Haar wavelets, all of the other wavelet types have a parameter. We have a grid that we are going to search over for you in addition to the type.

Now it may be possible that there are some cases where users have said, hey, this particular wavelet type is exactly how my data should be represented, so you can change the default model, but by default, we're going to pick the model that's going to optimize this model selection criteria, the AISC. Really what you can think about here is there could be potentially a lot of parameters in every one of these wavelet models. We're effectively using a Lasso model to try to remove out any parameters that really just aren't fitting the data. We get a sparse representation, no matter the wavelet model. We saw this earlier where we have to have our data on the grid. It's the same thing with wavelets. If you just start going through the wavelet models and your data are not on the grid, we'll create one for you. But again, just wanted to point out you can use that reduce grid option to have finer control.

Okay, so something else that we show that can help give you some insight into how these models work is this coefficient plot. What it's telling us is this X axis is the normal X of the input space of your function, but the Y axis is the resolution. These top resolutions here, you're thinking about overall means. As we get into these high resolutions, these are the things that are happening really close together. A red line means it's a negative coefficient. Blue means it's positive. They're scaled so that they're all interpretable against each other. The largest lines give you the idea where the largest coefficients are. We can see that the higher frequency items are really here at the end of the function. We have some overall trends, but just something to think about that these wavelet models are looking at different resolutions of your data.

Something else that we've added before we get to our demo with Clay is wavelets DOE. In FDE, we have a functional DOE that is working with functional principal components. If you don't know those are, that's okay. All you need to know is that with wavelets, we have coefficients for all of these wavelet functions. In this DOE analysis, we're thinking about modeling the coefficients directly. The resolution tells you an idea of if it's a high- frequency item or low- frequency item. Then this number in the brackets is telling you the location. You can think, okay, these items here are in the threes, and that's where some of that highest features were that we saw in that coefficient plot. Those have what we're calling the highest energy.

Energy in this case is just... If we score all the coefficients, we add them up, you can think of that being as the total energy. So this energy number here is a relative energy and giving you an idea of how much energy it is explaining in the data. The nice thing really about using the coefficient approach is these have direct interpretation right to the location and to the resolution. An alternative that you can try and compare against functional PCA or functional DOE if you have this interpretability of the coefficients. Now I think I'll hand it over to Clay. He's got a demo for you to see how you use these models in JMP Pro.

Thanks, Ryan. Let's take a look at an example that we found. Ryan mentioned briefly the olive oil data set that we found. It's a sample of 120 different oils. Most of them are olive oils, some of them are blends or vegetable oils. What we wanted to see is, can we use this high- performance liquid chromatography data? Can we use that information to classify the oil? Can we look at the spectraland say this is an olive oil or this is not an olive oil?

These data came out of a study from a university in Spain, and Ryan and I learned a lot about olive oil in the process. For example, olive oil is actually a fruit juice, which I did not know. Let's take a look at our data. Each row in our data set is a different olive oil or other oil, and the rows represent the spectra. We'll use the Functional Data Explorer, and it'll take just a second to fit the wavelet models. Y ou'll see here, we fit our different wavelets. As Ryan mentioned earlier, we try a handful of different wavelets and we give you the best one.

In this case, the Simlet 20 was the best wavelet in terms of how well it fits our data. We can see here where we've overlaid these fitted wavelets with the data that this wavelet model fits really well. L et's say you had a preferred wavelet function that you wanted to use instead, you can always click around in this report and it'll update which wavelet we're using. If we wanted the Simlet 10 instead, all you have to do is click on this row in the table, and we'll switch to the Simlet 10 instead. Let's go back to the 20 and we'll take a look at our coefficients.

In the wavelet report, we have this table of wavelet coefficients. As Ryan was saying earlier, these give us information about where the peaks are in the data. The further wavelet, we think about that like an intercept, so that's like an overall mean. Then every one of these wavelet coefficients with a resolution is... It lines up with a different part of the function. This resolution one is the lowest frequency resolution, and it goes all the way up to resolution 12. These are much higher frequency resolutions.

As you can see, we've zeroed a lot of these out. In fact, this whole block of wavelet coefficients is zeroed out. That just goes to show that we're smoothing. If we used all of these resolutions, it would recreate the function perfectly, but we zero them out and that gives us a much smoother function. We fit the wavelet model to our spectral and we think we have a good model. Let's take these coefficients and we're going to use these to predict whether or not an oil is olive oil. I've got that in a different data set.

Now I've imported all of those wavelet coefficients into a new data set and I've combined it with what type of oil it is. It's either olive oil or it's other, and we've got all of these wavelet coefficients that we're going to use to predict that. The way we do that is using the generalized regression platform. We're going to model the type using all of our different wavelet coefficients. Since it's a binary response, we choose the binomial distribution, and we're interested in modeling the probability that an oil is olive oil. Because we don't want to use all of those wavelet coefficients, we're going to use the Lasso to do variable selection.

Now we've used the Lasso and we've got a model with just 14 parameters. Of all of those wavelet coefficients that we considered for our model, we only really needed 14 of them. We can take a look, we've zeroed out a lot of those wavelet coefficients. Let's take a look at the confusion matrix. Using our model, we actually perfectly predicted whether or not one of these oils is an olive oil or it's something else. That's pretty good. We took our wavelet coefficients and we selected the 13 most important because one of those 14 parameters is the intercept. We only needed 13 of those wavelet coefficients to predict which oil we had.

In fact, we can take a look at where those wavelet coefficients fall on our function. What we have here is we have the average olive oil spectralin blue and the other oils in red, and each of those dashed lines lines up with the coefficients that we used. Some of these really make a lot of sense. For example, here's one of the wavelet coefficients that is important, and you can see that there's a big difference in the olive oil trace and the other oil.

Likewise, over here, we can see that there's a big difference between the two there. You can look through and see that a lot of these locations really do make sense. It makes sense that we can use that part of the curve to discriminate between the different types of oil. We just thought that was a really cool example of using wavelets to predict something else. Not that olive oil isn't fun, but Ryan and I both have young kids and we're both big fans of the world.

We also found a new world data set where someone had recorded wait times for one of the popular rides at Disney World. It's called the Seven Dwarfs Mind Train . It's a roller coaster at Disney World. Someone had recorded wait times throughout the day for several years worth of data. I also mentioned these are a subset of the data. One of the problems is the parks are open for different amounts of time each day, and some of the observations are missing. We subset it down and got it to a more manageable data set. I would say that this example is inspired by real data, but it's not exactly real data once we massaged it a little bit.

If we graph our data, we can see... The horizontal axis here is the time of day, and the vertical axis is the wait time. In the middle of the day, the wait time for this ride tends to be the highest. We can look around at different days of the week. Sunday, Monday is a little bit more busy. Tuesday is a little less busy, Saturday is the most busy. We can do the same thing looking at the years. This is 2015, 2016, 2017. It looks like every year it's getting longer and longer wait times until something happens in 2021. I think we all know why wait times at an amusement park would be less in 2021. We've got an idea that you can use this information, like day of the week, year and month, to predict what that wait time curve will look like. Let's see how we do that in FDE.

I'll just run my script here. What we've done is we'll come to the menu and ask to fit our wavelet model. It takes just a second, but really not that long to fit several years worth of data. This time we're not using the Simlet anymore. We're using this Daubechies wavelet function. What Ryan mentioned earlier was the wavelet DOE feature. Now, what I didn't show was that we've also loaded time of the day of the week and the year and the month in the FDE. We're going to use those variables to predict the wavelet coefficients. Let's go to the red triangle menu and we'll ask for wavelet DOE.

Now, what is happening behind the scenes is we're using day of the week, month, and year to predict those wavelet coefficients, and then we put it all back together so that we can see how the predicted wait time changes as a function of those supplementary variables. Now, of course, we summarize it in a nice profiler. We can really quickly see the effects of month. If we're just going by the average wait time of this particular ride, we can see that September tends to have the lowest wait time. We can really quickly see the COVID effect. The wait times were here in 2019, and then when we went forward to 2020, they really dropped. You can look around to see which day of the week tends to be less busy, which months are less busy. I t's really a cool way to look at how these wait times change as a function of different factors. Thank you for watching. That's all we have for today and we hope you'll give the wavelet features in FDE a try. Thanks.