Hello, I'm Chris Gotwalt with JMP, and my co- presenter, Giuseppe De Martino from Topsoe , and I are giving a presentation that tells the story of a data analysis project that showcases the new wavelet analysis in the Functional Data Explorer, one of the most exciting new capabilities in JMP 17.
The case study begins with a product formulation problem where Tops oe wanted to design a catalyst that optimizes two responses, but the responses are in conflict with one another in that improving one response often comes at the expense of the other.
A candidate for the optimal tradeoff was found with models fit by the generalized regression platform, and the optimal factor settings were found using the desirability functions in the profiler. This was a fairly standard DoE analysis. But in addition to the measured responses, NIR spectra were taken from some, but not all of the sample batches. This is information that can be used to give a clue to the R&D team about what the chemical structure of the ideal formulation should look like.
In addition to the GenReg model of the responses, we also used wavelet models as the basis function of a functional DoE analysis of the spectra using the DoE factors as inputs. We were then able to get a synthetic spectra of the optimal formulation by plugging in the optimal factor settings found in the initial analysis of the two critical responses.
Before going into the presentation, I want to point out that at the beginning of the project, Giuseppe was very new to JMP and didn't have a background in this type of statistical analysis. Giuseppe learned all he needed to do the analysis on his own after a couple of web meetings with me.
This obviously shows he's a clever guy, but also that JMP makes learning how to do some very sophisticated data analysis projects quick and easy. Now I'm going to hand the show over to Giuseppe.
Thank you, Chris. Here is some background about our project. It's a catalyst development project. Therefore, we are developing many different recipes. Each recipe has a unique set of production parameters, and once the sample is prepared, we characterize it in our analysis lab in Topsoe. And finally, we do a performance test.
During the performance test, we look for two values that here we call Response 1 and Response 2. In this specific case, we are trying to minimize response 1 while maximizing response 2. That will lead us to this ideal space in the top left corner of the graph. But as you can see, the 55 samples that we've tested get stuck in the middle of the graph. That is because Response 1 and Response 2 are inter- correlated, meaning that improving one comes with the expenses of the other.
Therefore, we move to JMP, and we try to look at our response data and our characterization data, try to see if we can move away from this line in the middle of the graph. We identified two targets areas that we want to reach, and together with Chris, we thought about using JMP to create a model that would connect the production parameters to the response values. T hen we further looked into our infrared spectroscopy data to try to validate our model and to get some extra information about the target samples.
Here is an overview of the data set. We have produced 112 samples. Each sample has a unique set of production parameters. We have analyzed all the samples using infrared spectroscopy. We have one spectrum for each sample.
Then we have used many other characterization techniques that we have in- house that accounts for 21 more columns of data. Finally, we have tested half of the samples and that accounts for the last two columns that we called response.
At the beginning of our project, we actually wanted to include the infrared spectroscopic data in our larger data set, and that's why we wanted to use JMP, because now we have this new wavelet model possibility. And that would enable us to include the principal components coming from the wavelet model in our data set. And we could use that to create models in JMP. But before we start our analysis, we need to have a look at the Pro data to find outliers. We do that by clicking Analyze and the Multivariate Methods, Multivariate.
H ere we can select our production parameters and characterization data as Y columns, and we get this scatterplot matrix. This is an example of just looking at the production parameters at what we would identify as an outlier. We can see that there is a set of points in production parameters too that is far away for all the other points. Furthermore, we have background knowledge about these samples that we know wouldn't be optimal for our catalyst development, so we decided to right click and say "Row, hide and exclude."
We did this also with other points looking at the scatterplot metrics of all the characterization data. Now that we have cleaned up the data, we can fit a model. We click Analyze and Fit Model, and we select the production parameters as variables in our model. We click Macros and response surface to create a second polynomial combination of these variables. Then we select our responses as Y values.
Then we decided to use generalized regression. H ere Chris can add some more info about why we decided to use this specific type of model.
We used a quadratic response surface model because the design had three or more levels for each factor, so I knew that we would be able to fit curvature terms if necessary and also be able to fit quite a variety of different interaction terms. We use the generalized regression platform because it does model selection with non Gaussian distributions like the log normal. In my opinion, there aren't many reasons not to use the generalized regression platform if you have JMP Pro because it is so easy to use while in many ways being so much more powerful for DoE analysis than the other options in JMP and JMP Pro.
After that, we can select our distribution. We know that the responses are going to be strictly positive, so we select the log normal distribution and then we click Run, and we say no.
In this slide, we can see that we have now created a model, but we have also the possibility of creating other type of models using different estimation methods. We decided to use best subset. Here, Chris can add some more words about it.
Well, so here we use best subset selection because the full model isn't terribly large, so why not try every possible subset of that full model and find the one that provides the absolute best tradeoff between accuracy and model simplicity?
On the other hand, had there been eight or more factors, I would have used a faster algorithm like forward selection or pruned forward selection because with larger base models, it would take a very long time to fit every possible submodel to the data. We're going to be using the AICc model selection criteria to compare GenReg models.
The AICc allows you to compare models with different effects in them as well as different response distributions. With the AICc, smaller values are better and the rule of thumb I use is that if a model has an AICc value that is within 4 of the smallest AICc value seen, then those two models are practically identical in quality of fit.
If the two models have AICc values within 10 of each other, then they are statistically similar to one another. The main point here being that if we have two models and their AICc's differ by more than 10, then the data are pretty strongly suggesting that the one with the smaller AICc is the better model to be working with.
As with any individual statistic, you should view the A ICc as a suggestion. If your subject matter experience strongly suggests one model over the other, you may want to trust your instincts and ignore the recommendation of the A ICc.
Once we have created this new model, we can see that the non- zero parameters have now dropped from 16 to nine. If we want to compare, if the model has improved, we can look at the A ICc values. We can see that there is an improvement of more than 10, which is an important difference. Therefore, we decided to go with the best subset.
We did the same for Response 2, and then we moved on. Now we can click on the red arrow and say profiler. In the profiler, we can play around with the production parameters and see how the model is expecting Response 1 and Response 2 to change. This is already a great tool for the scientist to understand how the model is expecting the responses to vary, but we can do more. We can click on optimization and desirability and desirability functions.
Since from slide one, we know that we have two targets that we want to reach, we can change the desirability function to match those targets. So we double click on the Desirability function and we say match target and select the target area that we want to reach. Finally, we can click again on optimization and desirability and say maximize desirability.
Here, the profiler will try to reach the optimal points for the production parameters. To summarize, we can say that now we have the first model, we go from production parameters to responses and we have set two targets that we want to reach. This way we can get ideal production parameters that we can communicate to the development team and they can use to move on in their research.
In the second part of the presentation, I'm going to talk about how we use the IR Spectra. Here we have a file for each spectrum and therefore we need to click on file and import multiple files. Then we need to specify that we want the file name to be columned in the data set and then we can use this sample name, which is the name of the file as an ID to connect it to the other table where we have all the data. We click on the column and say link reference.
Now that the two tables are connected, we can click Analyze and specialize modeling and functional data explorer. In the Functional Data Explorer, we want to use the intensity value as the Y output. We can use the sample name, matrix name as our ID function.
Then this is very important. We use the production parameters as supplementary data. And the weight number is, of course, the X- axis. And we say, "Okay, here we can see that the data is already clean." We've imported all the Spectra and the data looks clean because I've done the preprocessing outside of JMP. I used Python because I'm more familiar with that and there is a very nice module that is able to remove the background and reduce the range that we want to look at.
JMP was good to work with as an extra tool after this reprocessing. Then we decide to click on models and wavelets. We move from discrete data to continuous data. Now we can also look at the diagnostic plots. This is for you, Chris, to take talk about.
It's a good idea to look at actual by predicted plots as you proceed through a functional data analysis. These have the actual observed values in the data on the Y axis and the predicted values on the X axis. We want the predicted values to be as close to the actual values as possible. A plot like this one that is tight along the 45 degree line indicates that we have a good model.
Now, some of you may be concerned about overfit since the predictions fit the data so well. In my experience, I haven't found that to be a problem in the basis function fitting and functional principal component steps of functional DoE analysis. I'd also like to point out that in JMP 17, we've added a lot of new features for spectral preprocessing like standard normal variant, multiplicative scatter correction, and Savitzky-Golay filters . Those of you that don't know Python have access to these capabilities in JMP Pro 17.
After that, we can also have a look at the Functional PCA analysis. Here, we'll spend some more words on it.
After the wavelet model is fit, JMP Pro automatically does a functional principal components analysis of the wavelet model. This decomposes the spectra into an overall mean spectra and a linear combination of shape components and coefficients that are unique to each sample spectra. When we do the functional DoE analysis, GenReg automatically fits models to these coefficients behind the scenes and combines the resulting model with the mean function and the shape components to predict the shape of the spectra at new values of the input parameters.
If we look at the principal component analysis, we can see that the wavelet model has created a mean function of all the spectra that we've set as input, and then it has created different shape functions. We decided to stop at six. What this shape function described is the variation of the data that we are analyzing.
As you can see from the left, the first shape function is accounting for 72 % of the variation, while the second is accounting for 22 %. Together, the account called six account for 99.5 % of the variation. As an example, we can look at principal component 2, and we can see that around 3,737, there is a reduction. There is a minus. That means that increasing the principal component 2 will decrease the peak at 3,737.
This is just an example to say that already from these six shape functions, we can get a lot of information. If we have subject matter knowledge, about the infrared spectroscopy and this catalyst system. Already here, we spent quite a lot of time looking at the principal components, but this is not what we are going to focus on in the next slides.
What we want to look at instead is the functional DoE analysis. Here we have as well a profiler, but we can now plug the production parameters that we got from the first model that we developed. Therefore, knowing the target production parameters that we want to use, we can generate a fake spectrum or a synthetic spectrum, we can call it. This is a spectrum of a sample that was never produced. It could be wrong, but it can give some ideas to the scientists about what you would expect to get from these new production parameters.
To sum up, we can say that now we move from production parameters to infrared spectrum. We have a second model that uses the wavelet model to generate synthetic spectrum. I imagine this to be like when you're baking a cake, now you have the recipe but you got also a snapshot picture of the final cake. It doesn't explain you how to do it but it adds information about what you want to achieve.
Finally, we can move from model to test. That means that we can give the R&D team a new recipe and also the synthetic spectrum of that recipe. Together with this information, they can try to develop the new catalyst and see if the model is validated or is wrong.
Another thing the group can do is look back at the previous samples. Now we have half of the samples that were not tested. Those are the black dots in the slide. We can look for outliers. Is there a sample that could perform really good that we haven't looked at? We actually have one. So that's another test we can do.
Looking at future work, I added this slide that was at the beginning just to say that we focused on the production parameters and the IR spectra, but we haven't really looked at these 21 more columns of characterization data.
In the future, we could spend some time trying to identify the most predictive parameters in these 21 columns and create maybe a new model from this characterization data to the response and use this model as a screening model to avoid testing samples that would not perform as good. That's the end of the presentation.