Using Functional Data Analysis: A Case Study
Many instruments and sensors generate data over a continuum. Analyzing and understanding this “curve” data can be difficult at best, especially when you are trying to analyze data over the whole continuum. Many chemometric techniques are options, but for analyzing curve data, they are basically analyzing the data point by point instead of the whole curve or group of curves at once.
Functional data analysis is one of the newer methods that has come the forefront for analyzing curve data. This presentation uses case studies on chromatographic and spectral data that show the utility of analyzing data over the continuum in conjunction with more traditional chemometric methods. Some of the newer chemometric analysis techniques in JMP Pro 18 Functional Data Explorer are highlighted.
Hello everyone. My name is Bill Worley, and today we're going to be talking about Functional Data Analysis and more or less a case study in using FDA to show how it's used, where it's used, or where we might be able to use it. I want to get my thank you's out of the way early here. There's contributions from a lot of people here, and they're all listed there. I would like to call out Tom Donnelly and Chris Gottwalt for sure. Then Pete Hersh also, folks that have really helped on making this presentation happen.
A little bit of a backstory on this too, is that, not only are we looking at Functional Data Analysis, but we want to blend in something called process analytical technology. PAT is used in the pharma industry quite a bit as part of their quality by design, and they use it to help reduce cycle times by using online, in-line and out line measurements. It's all like a continuous process.
This is all it helps prevent rejects and scrap and any reprocessing that might be needed. It also enables real time release testing and PAT with Functional Data Analysis or process analytical technology with Functional Data Analysis is and should be applicable to any industry. I think that's just part of the... You'll see more about this as we go on with the presentation. But how it blends in or how it can blend in.
Just another idea of what is functional data. We want to explore curves and shapes and extract curve features for further analysis. This is instead of looking at data point by point. A lot of our data is, comes in a continued, and we need to capture the complex curve shapes in some form of numerical values. Then we want to analyze relationships between curve shapes and other variables.
Where do we find functional data? Like I said before, a lot of data is inherently functional in form. Time series data sensor streams from manufacturing processes, spectra, measurements taken over a range of temperatures. JMP Pro makes it easy to solve many of these kinds of problems with functional data.
With Functional Data Analysis, you use functional principal components to break data into FPC scores and shape components. In a dimension reduction that is closely related to classical principal component analysis. LPC scores or weights are scalar quantities explained function to function variation. Shape components explain the longitudinal variation that you see time, distance, wavelength, things along those lines. We fit models with the FPC scores and cluster them. Graph them just like any other continuous data.
For functional data we fit the FPC scores as functions of the DOE factors, and we used the FPC score, times the shape components as intermediate formulas. Then we take the modeled FPC score times shape components as a final prediction formula.
How do we deal with functional data. Well kind of explained that before. But with JMP Pro you can convert a group of discrete points into a set of functional principal components in just a few clicks.
By fitting a spline, a wavelet or a direct model to each set of the functional data, and describing the curves with FPCs, you quickly gain new insights and understanding about your process or product. Okay, so below we're showing a set of curves, the model that was used to, fit that data. Then with the functional data is showing here the functional principal components and by batch. Okay. This is a spline fit in this case.
But the crux of the talk today is going to be talking about analyzing or reanalyzing some NMR spectral data for three alcohol mixture, using Functional Data Analysis. This was the original paper where the data comes from. You know, just want to, give them kudos for putting this out. Then we're able to take that data and then reanalyze it.
In the original data, you have the quantitative analysis of NMR spectra with Chemo metrics. Then this is an overlay of the three alcohol blend of the NMR spectra. We've got pentanol, butanol and propanol. We want to be able to break those three apart with our analysis and then be able to predict, any mixtures that we can that are made out of that.
Just to give you a better idea for spectral modeling methods, all kinds of multivariate methods. Dimension reduction is important, and we can use principal components analysis to do that. Multidimensional scaling, Functional Data Explorer.
Clustering is always important too. It helps better understand what groupings you have. Then, if you need to do an outlier analysis, you can use multivariate control charts. But if you're going to build predictive models from this data, you're going to use something like partial least squares or penalized regression. Functional Data Analysis, bootstrap forests, and a couple of others neural networks support vector machines. You've got a lot of options.
But I want to do like a direct comparison of a few of these methods with that. Principal component analysis to start off with, if you're looking at this data. If you can see on the chart here we've got points at eight, 20 and 18. Those are our actual propanol, butanol and pentanol.
We're getting kind of a ternary plot there, but it's taking a lot of principal components to explain at least 85% of the variation. We're looking at about 18 principal components. The data is still, highly correlated. What you could do with this is then, save the principal components back to your data table and do a principal component regression. Well, that's a little bit of a long way to get to an answer. But it's doable.
Then another nonfunctional platform that we look at or that we can look at is partial least squares. In this case, we can actually fit a very good model, at least as far as the butanol, pentanol and propanol are concerned. With five latent factors instead of 18 principal components.
Principal components and latent factors are the same, but with partially squares. We at least get to explain the why, or get the variation on the why, where we get none of that with principal components.
What this is showing here. Let's look at the percent variation explained. We can see with the smaller model with five factors. We're only explaining about 28% of the variation in the X, but we are able to explain really quickly about 98% of the variation, or a little over 98% of the variation. With two latent factors.
Then this graph up here is showing you, where we could do some variable reduction. We've got the propanol, butanol and pentanol, and they're colored in, we've got each wavelength in there. But this line right here at 0.8 is like the rule of thumb, the cutoff, where you could actually do some variable reduction here and get this model a lot simpler. If we wanted to. But again, this is a nonfunctional form. You're looking at data point by point. All that these models are going to be really large, beyond everything else.
But then we can look at something called, Functional Data Analysis with wavelets. We're getting a really good fit, with this data using, the wavelet model. The best model is the one with the lowest AICc score or BIC score. You can see up here we're explaining about 86%, almost the same amount of variation that we were seeing in principal components, but we're doing it in with five less functional principal components.
Already a simpler model. If we look at the actual by predicted plots, the wavelet function is doing a great job of explaining the variation that's going on here. Now, if we look over here at the right of the score plot, we've got, like almost a ternary plot. We've got our pentanol, butanol and, I'm sorry, propanol, butanol and pentanol over here.
Pentanol is a little harder to explain. It's got some overlapping peaks with some other data, so it's showing up in the score plot as not easily discernible from the other ones, but we're still getting a reasonable outcome from that.
Another option that you have for analyzing mixtures is with is something called multivariate curve resolution. We're looking at the data here. Now overall. This is the multivariate curve resolution using all of the well we're using validation, or we're using our validation data is the pentanol or the propanol butanol and pentanol. The rest of the data are mixtures of those three alcohols. We're using them as the training data in this case to predict, or to see if we can predict the data of the pure, alcohols.
We do okay, especially for propanol. Then the other two maybe not so good, but we know which, functional principal components are associated with which in this case.
So, functional principal component three is propanol. Butanol is the functional principal component two. Then the pentanol is functional principal component one. We're we're getting a pretty good outcome on this. But if you look down here at the diagnostic plots. Yeah. So I mean we're doing okay. But since we used the mixtures themselves to try and predict the wrong part of the pure materials, we're seeing kind of a ternary plot, but we don't see those other three on there. I'll show you in a little bit what that looks like when we go the other way.
Okay, so what comes out of all this is that we want to try and do something called a functional design of experiments or functional DOE. This alcohol blending example, functional principal component three is highly correlated with proportion of propanol. You get some interpretability out of that. Then like I said before FPC 1 is highly correlated with pentanol and FPC 2 is butanol. You don't really always get this kind of correlation with your when you're looking at the functional principal components and factors.
But in this case, we got some, but we're not blending the FPCs, we're blending the alcohols. That's where we want to focus on. This is why functional DOE is so important. We're going to fit these scores as functions of the components of fact or factors. Then this model is one we can easily interact with and use in a practical manner to ultimately predict component or factor impact on spectra.
If you'll recall, we mentioned before that, the FPC scores are weights are scalar quantities that explain function to function variation, and the shape components explain the longitudinal variation, differences in time and things like that. Wavelength.
For functional DOE, we fit the FPC scores as functions of the DOE. Then, we go through that same math as we talked about before. All right. All said and done, we get to something called a functional DOE profiler. This is where we can then compare, what the spectra looked like at different, proportions of the butanol, propanol and pentanol.
Okay. We'll get to that in the demo, which should be coming up now. Let me escape out of here and slide this out of the way for a moment. All right, so I'm working out of a JMP project. This is the data. We've got 23 spectra NMR spectra by batch. Right. That's their ID right there. Then we have the different proportions of propanol, butanol and pentanol. Then we've got a thousand columns of wavelength data.
The data is very broad very wide and very highly correlated. We'd have to take all of that into account when we're doing some of this analysis. Okay. The first thing we look at is we're going back and looking at the functional principal component analysis, regular principal component analysis not the functional principal component.
Let me widen this out a little bit, so you can see what's going on here. If we look at the score plot here, scatter plot here, we can see again that the height of the data is highly correlated. If we look at the different points here. We look at the star and the stars and the diamond and the star over here. Those are our pure materials. That gives us an idea that we are still looking at somewhat of a ternary plot. We are looking at mixtures here.
But again, if we look over here, we have to go down to 80 or eigenvalues at 18 to explain about 86% of the variation. Okay. That's still a lot of, factors that we have to again, take that back in and do some principal component regression when we were ready. Once we save those back to the data table.
If we remember, we talked about partial least squares. Looking at this one, this is, showing us that we can get a really good explanation of the Y variable. Being able to predict the pentanol, butanol and propanol, with five latent factors. Now, we're not doing a really good job of explaining what's going on in the X variation, but we can bump that up if we want to by adding more latent factors just to get closer to the other principal components model. But I think with this, we're still saving, you know, saving ourselves a lot just by using this first model here.
If you look down here at the Van der Voet T, this is telling us that, we really don't need to go past five latent factors, because that is where our, Van der Voet T value hits one. Anything above 0.1 could be considered a good model. We could actually go down to two, three, four. But we'll go with the five for now.
All right. If we look at the root mean press plot down here, we've got again this is showing that we could use five factors. That's going to be our lowest, root mean press, but looks like everything after that is pretty low too. This gives you an idea of what the fit looks like. This is really nice because the data is really tight on these score plots, these z score plots. That's another indication that we have a pretty good model.
We're explaining, you know, 98% of the variation right here, almost 99% of the variation with just two latent factors. But we can always use more if we get it to 100%. That's great. Then this last one is looking at the VIP again. We talked about being able to use this for variable selection. We could actually get rid of a bunch of this data and make it still make a really good model.
Those are two the principal component analysis. The partial least squares are two ways of looking at data that is not in a nonfunctional platform. If we move on to a functional to Functional Data Explorer, the first thing I want to show you is just looking at the data in a raw format, right? We just did the analysis. We didn't do any data processing here. We just pulled the data in, and then we fit the model going down here a little bit.
And this is an unconstrained, multivariate curve resolution data. The data could go negative in this case, but, we're doing okay with what we have. We look up here and look at our propanol or this batch and that's our propanol. We're doing okay with this. We're not doing great, but we're doing okay. We have no validation, no training set. We didn't use any of that for right now, but we do have the idea that we're going to be okay, if we're looking at this data as it is.
If you look at the actual by predicted plots, maybe not much. We've got quite a scatter here, especially in the residuals by predicted plot. We may want to try and do a little bit better with this model. Although that's the case, if you still look here, we still have our propanol, butanol and pentanol over here. It's kind of a ternary plot, not perfect, but if you use your imagination, it kind of gets a triangle, something like that.
We're kind of looking at that information there. We're doing pretty good, getting that analysis. Then we look at the wavelet analysis like we showed earlier and that and this is just doing really well. I mean, if you want to look at, analyzing curve data like this, wavelets are really, a great way to go.
The only problem with wavelets is then it's kind of it's almost exactly like principal components. You have to save data back out and then do some sort of generalized regression data, especially with the data that you're going to you know, that let's see down here.
There's some scores here that you could get. You're going to have to again, if you look at this data, we've got lots of functional principal components in there. But if you look at the score plot again, we've got a pretty good, triangle going on. A good ternary plot. We're feeling pretty good that we're analyzing the data really well.
Then the Functional Data Explorer, which I showed you earlier, this is now active. You can see as you move. If I go to 100% propanol the other two go to zero. Then you're going to see what the curve looks like over here. This is also a nice way if you're just looking for ways to understand what the curves look like or how you compare curves, then using the wavelet function or any of the these functions with functional DOE profiler are a good way to compare those curves. Okay.
Then last but not least down here is using, this is the multivariate curve resolution itself. This only allows, positive data. If we look back up here, this hit right on the propanol. I mean this was virtually perfect for propanol right. This is the other two which are down here, weren't quite as good. We got 93% of, or we're explaining 93% of the variation with, the butanol and then the pentanol it was about 83%. Those should be 100%. The models okay. Not great. Right.
When we added some validation, capability in here, if we look at the validation data now when we use the... This is where we're using the pure materials to try and predict the percentages of the other alcohols in this. Again, we're looking at some differences here. But if we look back up here, well, I'll show you the comparison in a little bit. But if we look at this data right here is our batch one. It's not bad at all. The comparison of the prediction of what we get to what there is. What there is actually is pretty close. Okay.
Playing that all or laying that all aside, just remember that we've got the, you know, we're looking at the actual by predicted plots, we still have ways to go, but we're doing pretty good. Not bad. Then we have the functional DOE profiler. Again we can mess around with this and play what if and compare curves with that. All right.
Last but not least on this is where we use the pure materials as training. So we use those to predict what's going on in the other data. We did pretty good on this one too. Again, if we look at this first row, those percentages are just about right on. I think it was 0.8, 0.1 and 0.1. Not bad. All right.
Let's go back to the PowerPoint and finish this up for you. Let me share again. The moral of the story here is, you've got all these options to analyze your data. Functional Data Explorer is a very powerful tool. As I was saying before, when we compare actuals by predicted, if we look across here, we can get this. Let's see if we can draw this. If you look across here, we did pretty good. Right.
Just that model itself. Now there's other, even the next one below is a little bit it's pretty good overall the actual by predicted. The model is it's good. We need to improve along this a little bit. But this is really good. Especially if you're making measurements online in line or out line. If you keep the more data you get, the better your model is going to get. Okay.
Takeaways. Functional Data Analysis using wavelets can be combined with mixture design of experiments. It allow you to predict NMR and nearly any spectra that you want, as well as chromatographic data as a function of formulation components and proportions. You're going to take that information, get the FDA, DOE model and then be able to predict out what this, spectra is trying to tell you.
Then using any of the aforementioned analytical chemistry techniques for, looking at an unknown blend as a target and the FDA-DOE mixture model, you can minimize the integrated error from the target to predict formulation components proportions. We did that in the second one.
That's that. Then there's one more. Let me see if I can get rid of this pointer. Because there we go. One more takeaway. Functional Data Analysis breaks apart highly correlated longitudinal data like spectra any curve data into two parts. We've talked about this the shape functions explaining the longitudinal variation and the FPC scores which explain that function to function variation.
You're taking this information combining it with this. Then using your design of experiments methods and any machine learning method that you want, especially in this case Functional Data Explorer or Functional Data Analysis. You can increase your predictive capability by using that information in original sensor stream. Stream of data or curves or spectra. Okay.
Then last but not least, you'll want to communicate and share, this data as you're getting your process, analytical technology data, your measurements, your streaming data, you want to be able to update that on the fly and then share that with other folks. That's where you want to bring in JMP Live to help you do that.
I guess at the last of this, you know, I want to say thank you, but please remember that you've got with Functional Data Explorer and the Process Analytical Technology Data, you can get a long way towards getting a really good predictive model for your data, especially mixtures. Okay. Thank you all.