Hi, my name is Ryan Parker, and I'm excited to be here today
with Clay Barker to share with you some new tools that we've added to help you
analyze spectral data with the Functional Data Explorer in JMP Pro 17.
So what do we mean by spectral data?
We have a lot of applications
from chemometrics that have motivated a lot of this work.
But I would just start off by saying
we're really worried about data that have sharp peaks.
This may not necessarily be spectral data, but these are the type of data
that we've had a hard time modeling in FE up to this point.
And so we really wanted to focus on trying
to open up these applications and make it a lot easier to handle these sharp peaks.
Maybe potential discontinuities.
Just these large, wavy features of data.
Where in this specific example, with spectroscopy data,
we're thinking about composition of materials,
and these peaks can represent these compositions,
and we want to be able to try to quantify those.
So another application is from mass spectrometry,
and here you can see these very sharp peaks.
They're all over the place in these different functions.
But these peaks are representing proteins from these spectrums,
and they can help you, for example,
compare differences between things from a patient with cancer
and patients without cancer to understand differences.
I mean, again,
it's really important that we try to model these peaks well
so that we can quantify these differences.
An example that Clay is going to show comes from chromatography.
This is where we're trying to ...
In this case,
we want to look at quantifying the difference between an olive oil
versus other vegetable oils.
And so the components of these things represented by all of these peaks,
we need to, again, try to model these well.
The first thing I want to cover are four new tools to preprocess your data,
spectral data before you get to the modeling stage.
The first one is the Standard Normal Variate.
So with the Standard Normal Variate, we're thinking about
standardizing each function by their individual mean and variance.
So we're going to take every function one at a time,
subtract off the mean, divide by the variance,
so that they all have mean zero and variance one.
This is an alternative to the existing tool we have standardized,
which is just looking at a global mean in variance
so that the data themselves have been scaled,
but certain aspects, like means, are still there,
whereas with the Standard Normal Variate, we want to remove that for every function.
The next tool's Multiplicative S catter Correction.
It is similar to Standard Normal Variate,
the results end up being the same, similar.
But in this case, we're thinking about data where we have light scatter.
So some of these spectral data come where we have to worry about
our light scatter from all the individual functions being different
from a reference function that we're interested in.
Usually this is the mean.
So what we'll do is we will set a simple model
between the individual functions to this mean function.
You know, we will get coefficients
that we can subtract off this mean feature,
divide by the slope feature,
get us to that similar standardizing the data,
and in this case, focused on this light scatter.
Okay, so at this point,
we're thinking about what if we have noise in our data?
What if we need to smooth it?
So the models that we want to fit, for spectral data,
these wavelets, they don't smooth the data for you.
So if you have noisy data, you really want to try to handle that first,
and that's where the Savitzky-Golay Filter comes in.
What this is going to do is fit n- degree polynomials
over a specified bandwidth
to try to find the best model that will smooth your data.
So we search over a grid for you,
pick the best one, and then present the results to you.
And I do want to note that the data are required to be on a regular grid,
but if you don't have one, FDE will create one for you.
We have a reduce option
that you can use if you want some finer control over this,
but by default,
we are going to look at the longest function,
choose that as our number of grid points,
and create a regular grid from that for you.
But the nice thing about the S avitzky-Golay Filter
is because of the construction with these polynomials,
we have easy access to the first or second derivative.
Even if you don't have spectral data and you want to access derivative functions
this will be your pathway to do that.
And if you do request, say, the second derivative,
our search gets constrained to only polynomials
that will allow us to give you a second derivative, for example.
But this would be the way to access that,
even if you weren't even worried about smoothing.
You can now get to derivatives.
The last preprocessing tool I'll cover is Baseline Correction.
So in Baseline Correction,
you are worried about having some feature of your data
that you would just consider a baseline that you want to remove
before you model your data.
So the idea here is we want to fit a baseline model.
We have linear, quadratic, exponential options for you,
so we want to fit this to our data and then subtract it off.
But we know there are important features,
typically these peaks,
so we want to not use those parts of the data
when we actually fit these baseline models.
So we have the option here for correction region.
I think for the most part you would likely use entire function.
So what this just means is, what part are we going to subtract off?
So if you select within regions
only things within these blue lines are going to be subtracted.
But I've already added four here.
Every time you click add on baseline regions,
you're going to get a pair of these blue lines and you can drag them around
to the important parts of your data, and what this will do is,
when you try to fit, say, a linear baseline model,
it's going to ignore the data points that are within these two blue lines.
So, function one, we set a linear model,
but we exclude all these sharp peaks that we really want,
that we're interested in.
And so then we take the result from this linear model
and subtract it off from the whole function.
The alternative to that is an Anchor Point,
and that's if you say, I really would like
for this specific point to be included in the baseline model.
Usually this is if you have smaller data
and you know, okay, I want these few points.
These are key. These represent the baseline.
If I were able to fit, you know, say a quadratic model to these points,
that's what I want to subtract off.
So it's an alternative.
When you click those, they'll show up as red as an alternative to this blue.
But this will allow you to correct your data
remove the baseline before proceeding.
So that gets us to how we model spectral data now in JMP Pro 17,
and we're using wavelets.
The nice thing about wavelets is we have a variety of options to choose from.
So these graphs represent what are called mother wavelets
and they are used to construct
the basis that we use, that models the data.
So the simplest is this Haar wavelet, which is really just step functions,
maybe hard to see that here, but these are just step functions.
But this biorthogonal, it has a lot of little jumps
and you can start to imagine,
okay, I can see why these make it a lot easier
to capture peaks in my data
other than the Haar wavelet.
All these have parameters that are changing the shape and the size of these,
so I've just selected a couple here to just show you the differences.
But you can really see where,
okay, if I put a lot of these together,
I can understand why this is
probably a lot better to model all these peaks of my data.
And so here's an example to illustrate that
with one of our new sample data,
a NMR design of experiments.
So this is just from one function where let's start with B-Splines.
This is sort of the go to for most data place to start in FDE.
But we can see that it's really having a hard time
picking up on these peaks.
Now, there are, we have provided you tools
to change where knots are at in these beastline models.
So you could do some customization.
Probably fit this a lot better than the default.
But the idea is that now you've had to go and move things around
and maybe it works for some functions, but not others,
and you need a more automated way.
So that's one alternative to that is P-Sp lines.
That is doing that a little bit for you,
but it's still not quite capturing the peaks maybe as well as wavelets.
It's probably doing the best for these data.
relative to wavelets and almost a model -free approach
where we model the data just directly on our shape components,
this direct functional PCA.
It's maybe a bridge between P-Splines and B-Splines
where it's not quite as good as P -Splines but it's better than B -Splines
but this is just a quick example to highlight
how wavelets can really be a lot quicker and powerful.
What are we doing in FDE?
We construct a variety of wavelet types
and their associated parameters and fit them for you.
So similar to the S avitzky-Golay Filter,
we do require that the data are on a regular grid.
And good news, we'll create one for you.
But of course you can go to reduce the control that if you would like.
But now the nice thing about once data are on the grid,
we can use a transformation that's super fast.
So P -Splines would also, for these data,
be what you would really want to have to use,
but they can take a longtime to fit,
especially if you have a lot of functions
and you have a lot of data points per function.
But our wavelet models are going to essentially last
so all of the different basis functions
that construct this particular wavelet type with parameter
and what that's going to allow us to do is just really force
a lot of those coefficients that don't really mean anything to zero
to help create a sparse representation for the model.
So those five different wavelet types that I showed before,
those are available, and once we fit them all,
we're going to construct model selection criteria
and choose the best for you by default.
You can click through these as options to see other fits.
And a lot of times these first few are going to look very similar
and it's really just a matter of,
there are certain applications where they know,
"Okay, I really want this wavelet type,"
so they'll pick the best one of that type in the end to use.
So the nice thing about these models is they happen on a resolution.
They're modeling different resolutions of the data.
So we have these coefficient plots where at the top they're showing
low-frequency, larger scale trends, like an overall mean,
and as you go down in the...
Or I guess you go up in the resolutions,
but down in the plot,
you're going to look at high -frequency items,
and these are going to be things that are happening on very short scales,
so you can see where it's picking up a lot of different features.
In this case, it's taking...
A lot of these are zero for the highest resolutions.
So it's picking up some scales that are at the very end of this function.
It's picking up some of these differences here.
But this gives you a sense of kind of where things are happening at
both location and then is that high-frequency or low-frequency
parts of the data.
So the last thing we've added to complete the wavelet models
that's a little bit different from what we have now is called Wavelets DOE.
So if you've used FDE before,
you've likely tried functional design of experiments
and that is going to allow you to take functional principle component scores,
connect design of experience factors to the shape of your functions.
But now, wavelet models in particular,
they're coefficients, because they do represent resolutions and locations,
these can be more interpretable and they have more direct impact
to understanding what's happening in the data,
that may be a functional principle component
isn't as easy to connect with.
We have this energy function and it's standardized to show you,
"Okay, this resolution at 3.5,"
just representing more on this end point.
That's going to have...
That's where the most differences are in all of our functions,
and it's representing about 12%.
So you can scroll down.
We go up to where we get 90% of this energy,
which is just the squared coefficient values
that we just standardized here.
But energy is just how big are these coefficients
relative to everything else?
But this, similar to Functional DOE,
you can change the factors, see how the shape changes,
and we have cases where both Wavelets DOE
and Functional DOE work well.
Sometimes Wavelets DOE just get the structure better.
Doesn't allow for some maybe negative points
that Functional DOE might allow in this example.
So it's just they're both there, they're both fast.
I mean, you can use both of them to try to analyze the results of wavelet models.
But that's my quick overview.
So now I just want to turn it over to Clay to show you some examples
of using wavelet models with some example data sets.
Thanks, Ryan.
So, as Ryan showed earlier, we found an example
where folks were trying to use chemometrics methods
to classify different vegetable oils.
So I've got the data set opened up here.
Here we have our each row of the data set is a function,
so each row in the data set represents a particular oil,
and as we go across the table, that's the chromatogram.
So I've opened this up in FDE just to save a few minutes.
The wavelet fitting is really fast,
but I figured we'd just go ahead and start with the fit open.
So here's what our data set looks like.
You can see those red curves are olive oils.
The green curves are not olive oils.
So we can see there's definitely some differences
between the two different kinds of oils and their chromatograms.
So, as Ryan said, we just go to the red triangle menu
and we ask for wavelets and it will pick the best wavelet for you.
But like I said, I've already done that,
so we can scroll down and look at the fit.
Here we see the best wavelet that we found was called the Symlet 20,
and we've got graphs of each fit here summarized for you.
As you can see, the wavelets have fit these data really well.
But in this case,
we're not terribly interested in fitting the individual fits.
We want to see if we can use these individual chromatograms
to predict whether or not an oil is an olive oil or not an olive oil.
So what we can do is,
we can save out these wavelet coefficients,
which we've gotten a big table here, and there's thousands of them.
In fact, there's one for every point in our function.
so here we've got 4,000 points in each function.
This table is pretty huge. There's 4,000 wavelet coefficients.
But as Ryan was saying, you can see that we've zeroed some of them out.
So these wavelet coefficients drop out of the function.
So that's how we get smoothing.
We fit our data really well,
but zeroing out some of those coefficients is what smooths the function out.
So how can we use these values to predict whether or not we have an olive oil?
Well, you can come here to the function summaries and ask for save summaries.
So what it's done is it saves out the functional principal components.
But here at the end of the table, it also saves out these wavelet coefficients.
So these are the exact same values
that we saw in that wavelet coefficient table in the platform.
So let me close this one.
I've got my own queued up just so that I don't have anything unexpected happen.
So here's my version of that table.
And what we want to do is we want to use
all of these wavelet coefficients to predict whether
the curve is from an olive oil or from a different type of oil.
So what I'm going to do is,
I'm going to launch the generalized regression platform,
and if you've ever used that before,
it's the place we go to build linear models and generalize linear models
using a variety of different variable selection techniques.
So here my response is type.
I want to predict what type of oil we're looking at
and I want to predict it using all of those wavelet coefficients.
So I press run.
In this case, I'm going to use the Elastic Net
because that happens to be my favorite variable selection method.
And I'm going to press go.
So really quickly, we took all those wavelet coefficients
and we have found the ones that really do a good job of differentiating
between olive oils and non -olive oils.
So in fact, if we look at the confusion matrix,
which is, this is a way to look at how often we predict properly, right?
So for all 49 other olive oils,
we correctly identified those as not olive oils.
And for all 71 olive oils, we correctly identified those as olive oils.
So we actually predicted these perfectly
and we only needed a pretty small subset of those wavelet coefficients.
So I didn't count, but that looks like about a dozen.
So we started with thousands of wavelet coefficients and we boiled it down
to just the 12 or so that were useful for predicting our response.
So what I think is really cool is,
we can interpret these wavelet coefficients to an extent, right?
So this coefficient here is resolution two at location 3001.
So that tells us there's something going on in that part of the curve
that helps us differentiate between olive oils and not olive oils.
So what I've done is,
I've also created a graph of our data using...
Well, you'll see.
So what I've done is here the blue curve is the olive oils.
The red curve is the non -olive oils,
and this is the mean chromatogram.
So averaging over all of our samples.
And these dashed lines are the locations
where the wavelet coefficients are nonzero.
So these are the ones that are useful for discriminating between oils.
And as you can see, some of these non-zero coefficients line up with peaks
and the data that really tend to make sense, right?
So, here is,
here's one of the non -zero coefficients, and you can tell it's right at a peak
where olive oil is peaking, but non-olive oils are not, right?
So that may be meaningful to someone
that studies chromatography and olive oils in particular.
But so we like this example because
it's a really good example of how wavelets fit these chromatograms really well.
And then we can use the wavelet coefficients to do something else, right?
So not only have we fit the curves really well,
but then we've taken the information from those curves
and we've done a really good job of discriminating between different oils.
And so I've got one more example.
Ryan and I are both big fans of Disney World,
so this is not a chromatogram.
This is not spectroscopy.
But instead we found a data set that looks at wait times at Disney World,
so we downloaded a subset of wait times
for a ride called Disney's the Seven Dwarfs, Mind Train.
And if you've ever been to Disney World,
you know it's a really cool roller coaster right there in fantasyland.
But it also tends to have really long wait times, right?
You spend a lot of time waiting your turn.
So we wanted to see if we could use wavelets to analyze these data
and then use the Wavelet DOE function to see if we can figure out
if there's days of the weeks or months or years
where wait times are particularly high or low.
So we can launch FDE.
Here you can see, we've got
each day in our data set, we have the wait times
from the time that the park opens, here,
to the time that the park closes over here.
And to make this demo a little bit easier,
we've finessed the data set to clean it up some.
So this is not exactly the data,
but I think some of the trends that we're going to see are still true.
So what I'm going to do is I'm going to ask for wavelets,
and it'll take a second to run, but not too long.
So now we've found that a different basis function is the best.
It's the Daubechies 20
and I apologize if I didn't pronounce that right.
I've been avoiding pronouncing that word in public,
but now that's not the case anymore.
So we've found that's our favorite wavelet and what we're going to do
is we're going to go to the Wavelet DOE analysis,
and it's going to use these supplementary variables
that we've specified day of the week, year and month
to see if we can find trends in our curves using those variables.
So we'll ask for Wavelets DOE,
and what's happening in the background is we're modeling those wavelet coefficients
using the generalized regression platform, so that's happening behind the scenes,
and then it will put it all together in a Profiler for us.
So here we've got, you know, this is our time of day variable.
We can see that in the morning.
The wait times sort of start, you know, around an hour.
It gets longer throughout the day, you know,
peaking at about 80 minutes, almost an hour and a half wait.
And then, as you would expect,
as the day goes on and kids get tired and go to bed,
wait times get a little bit shorter until the end of the day.
Now, what we thought was interesting
is looking at some of these effects, like year and month.
So we can see in 2015, the wait times sort of gradually went up, right, until 2020.
And then what happened in 2020? They increased in February,
and then, shockingly, they dropped quite a bit in March and April.
And I think we all know why that might have happened in 2020.
Because of COVID, fewer people were going to Disney World.
In fact, it was shut down for a while.
So you can very clearly see a COVID effect on Disney World wait times
really quickly using Wavelet DOE.
One of the other things that's interesting
is we can look at what time of year is best to go.
It looks like wait times tend to be lower in September,
and since Disney World is in Florida, you know,
that's peak hurricane season, and kids don't tend to be out of school.
So it's really cool to see that our model picked that up pretty easily, right?
So, but don't start going to Disney World in September.
That's our time. We don't want it getting crowded.
But yeah, so with just a few clicks, we were able
to learn quite a few things about wait times,
Seven Dwarfs Mine Train at Disney.
But we really wanted to highlight
that these methods were focused on chromatography and spectrometry,
but there's a lot of applications where you can use Wavelets,
and I think that's all we have.
So thank you. And thank you, Ryan.