Innovative Approaches Using JMP: Two Case Studies on Predictive Modeling and Simulations (2023-EU-30MP-1295)

Damien Perret, R&D Scientist, CEA
François Bergeret, Senior Statistician, Ippon Innovation
Carole Soual, Chief Statistician, Ippon Innovation
Isabelle Giboire, Isabelle, CEA

JMP software was initially implemented at CEA in 2010 by R&D teams who develop nuclear glass formulation. Over the years, JMP has been used for multiple purposes, such as data visualization of highly complex composition domains, optimal mixture designs, and machine learning techniques to create property-to-composition predictive models. More recently, JMP enabled us to develop very innovative methodologies. Two case studies will be presented. First, we will show an original approach based on an automatic and intelligent subsampling of the data, combining techniques of optimal designs and several predictive methods in JMP and JMP Pro to create very robust and accurate predictive models. Second, we will present an amazing benefit of using the Simulation platform where a response is below the limit of detection in the most part of the design space.

Thank you for watching this presentation for the Europe Discovery Summit Conference online. My name is Damien Perret. I am an R&D scientist at CEA in France. I am with Francois Bergeret the statistician and the founder of Ippon Innovation in France. With Francois, we are very happy to be here today and we would like to thank the steering committee who gave us the opportunity to talk about this work which is about innovative approaches using JMP.

We will give you two case studies. Francois will present the first case and I will present the second case. Just start with a few words about CEA. CEA is a French government organization for research, development, and innovation in four areas; low carbon energies, technological research, fundamental research, and defense and security. The CEA counts about 20,000 people and we are located on nine different sites in France. We have strong relationships with the academic world and many collaborations with universities and partners, both in France and all around the world.

A few words about Ippon Innovation. Of course, we are a smaller company compared to the CEA. It was created 15 years ago. We are based in the South of France, Toulouse. We are a team of statistician, only statisticians with skill in industrial statistics, example, SPC, measurement system analysis, and of course, machine learning and so on.

I'm a JMP user since 1995, so a long time ago. I started with JMP 3. Of course, Ippon is JMP Partner because we used a lot of JMP. For example, for yield optimization, we have a tool called Yeti, automatic yield optimization in complex manufacturing systems. We also developed solution based on a customer request, what we call software on demand, for example, a full solution for outlier detection or statistical process control.

Using JMP and GSL, we have several GSL expert here, including Carole Soual, which is co- author of this talk today. We also have a classical consulting and training expertise based on JMP on industrial statistics. The content of the presentation today, we will present two real case studies with Damien. The first one is based on simulation and a computer design of experiment, and the second one will be presented by Damien on a machine learning tool for prediction.

I will present the case study number 1 based on mixture. It is okay. You can go on the next slide. The context is a risk assessment and a probability calculation. To explain it a little bit, mixture design of experiment was created by the CEA to evaluate the performance of a material for nuclear waste. The conditioning is done with salts in a matrix. The performance is determined by a threshold on the energy. The energy has to be higher than the threshold.

It is not so easy to estimate the probability to be below this threshold. We will use all the tools and all the data that we have to do this task. Of course, the probability has to be as small as possible. Damien, the CEA expert, will assess if this probability is okay. Now, what methodology we use to estimate this worst case probability?

First of all, based on the data from the mixture design of experiment, we have estimated several models. Basically, some classical linear models, also classification and decision trees, and also neural networks are the main models used. For each model, we have done three analysis.

First, the Monte Carlo simulation on the factor, so classical random simulation with Monte Carlo. Also, what we call space filling design, a computer design with JMP, where we try to explore the design space but by computer simulations, adding a noise on the response. This is done with JMP Pro. Last thing we did is a blending of Monte Carlo and space filling design. We will detail this, but this is very useful as we want to estimate a worst case probability. We can go to next slide.

Case study number 1, classical JMP simulation for the mixture DOE. First of all, we have to select the best model. Before doing the simulation, we need to find the best model. There is a very nice feature in JMP Pro, which is called model comparison. Very quickly, you can compare the models based on criteria, RS quare or the AISC. We have done this and I will do the demo right now. I share my screen now. I have to share my screen. We have to use JMP. Here, I open the data set. This is a data set from mixture design of experiment. There is eight factors, X 1- X8, and the response is the energy.

To show you an example of the model that we did, we perform here, for example, a predictive modeling neural network. The response is the energy. The X's factors are here. We click here. For information, we decide here not to divide the full sample in learning sample, validation sample, test sample, because data are from design of experiments, so we have not a lot of data. Thirty-one experiment maximum here.

We use a KFold validation to save samples, I would say. A very simple neural net here, one layer, and we click on Go. There is quite a good model with a correct, let's say, RS quare for that. How is working the comparison of the model in JMP? You need to save the prediction formula. You click on the spot and I save the formula. Doing this, you have formulas here of the neural net with hidden layers. You see the formula, hyperbolic tangent here of the linear predictor. Second n euron here of the hidden layers, another formula, and so on. At the end, for the neural net, the final formula for the prediction is a linear combination of the hidden neurons. I keep this formula for the moment, and I'm going to do the same with a linear model.

The linear model was saved here to save time. We decide to have a linear model. For mixture design, we have some cross effect here and a linear model. Of course, we need to clean the model. We need to remove what is not significant here and so on. When the job is done, we also have to save the formula. I saved what I call the prediction formula. Here, once again, in the JMP table somewhere, you will have the prediction formula for the energy, ordinary least square. You have a classical linear formula for the linear model.

To summarize, here I have two models, neural network and ordinary least square, and I have a formula for the two models. Then with these formulas, I go to Analyze, Predictive Modeling, and Model Comparison. This is the nice JMP platform for machine learning. Here what you have to enter is just the formula of the predictor. Here, for the example, I just enter two formula, ordinary least square and neural net, and that's all. You just click Okay. Here you have a comparison model with some criteria and you select the best model h ere. First of all, I add noise in the data, so RS quare are not so good.

In addition, we can compare the both model. In that case, for the worst case analysis, we decided decided to perform the simulation both on linear model and also on neural nets. Now, we are going to do that work. L et me show you what we did. Maybe I'm going to share this slide, Damien, because I need to show something here. It will be easier with my PC. So the live demo is done. Here I'm going to use a Monte Carlo simulation. Very easy with a JMP prediction profiler.

First of all, this is my first demo. After that, I will do the mixture design of experiment space filling design. I have to say that both for Monte Carlo and also for space filling design, we have some mixture constraints. It means that the sum of the component has to be equal to one. It's not a real space filling design. It's not a real Monte Carlo because you have this constraint. J MP has a smart algorithm, iterative algorithm to take into account the constraints. We are going to simulate Monte Carlo with the first factors and then with the second factors taking into account the first one and so on.

At the end, I will present the full simulation with both Monte Carlo and space filling design. This is new because what we have done is for each run of the space filling design, we have done 1,000 Monte Carlo simulations. It's really a worst case here, but it was the objective for the CEA to have the worst case. Thanks to this, we will get a really good estimation of the worst case probability.

Now, let me jump to the demo. First of all, Monte Carlo. Here I have the result for the Monte Carlo. No, sorry, it's not the right one. Here it is. Let's use the neural network. Here, for the neural network, you can ask for the profiler. If you ask for the profiler, here you can ask the simulator. I will randomly move the factors. But what is important here, I'm asking for uniform distribution because it's more classical for mixture design. Here, I'm going to simulate random data for the factor X1 , but between this value and this value because there is a constraint here. I continue with the second factor with uniform simulation between this value and this value, and so on.

To save time, I will automatically run the simulation here. Here it is. Sorry, I don't have the right data set. Sorry about that. W here it is. Model comparison and simulation. Here it is. Okay, here it is, Monte Carlo simulation. To save time, I'll show you directly the result with a random simulation with a mixture constraint.

W e have done this Monte Carlo simulation and we have this result on the energy. What is nice also is that you can put the result in a table. This is the result, 10,000 Monte Carlo simulation. For each simulation and with the model, you have the energy. Now, to calculate the probability to be lower than the spec limit, we just have to do a distribution of the simulated energy. This is a distribution, not exactly Gaussian, more close to a Laplace distribution. But anyway, it doesn't matter. We will do the process capability and the spec given by the CEA is minus 100. I just click on this and we have what we call the capability analysis, overall capability. Ppk is quite good, higher than 1.3 . Here, the expected percentage of out of spec is very low because this number is clearly very low.

We can have a look in a scientific notation. We are at a very small probability and clearly it was acceptable for the [inaudible 00:15:13] . At this time, based on neural network model, based on Monte Carlo simulation, we have estimated the first probability to be out of specification and this probability is here. Next step. What we have done here is that we are going back here. Sorry, I don't close this. I don't close this. I'm going here to do another simulation with what we call a simulation experiment. Simulation experiment is also called sometimes a space filling experiment. You try to explore the design space.

Here, we have to remind that there is a mixture constraint, so we will not explore, of course, a full design space, but we will explore a part of the design space here. Here is a result, 128 computer runs with the simulation data. We can have a look at the result of the simulation. If we do a scatterplot matrix on the simulated experiment on the factors, here is the exploration of the factor with a mixture constraint. With this, once again, we have simulated energy but here, it's not Monte Carlo simulation, it's a computer simulation experiment.

Same job, we can put the energy here. W here is the simulated energy? Here it is. Once again, roughly normally distributed. Here I'm going to calculate, once again, the process capability with a spec minus 100. Here it is. Here I should have the process capability which is good. Once again, you have a probability to be out of the spec, which is close to 7 per 1,000. It is also acceptable as a result. But you can see that in that case, the probability is higher than the previous one because the calculation was different and because it was a simulation experiment exploring the space and it's different from Monte Carlo simulation.

Last thing that we did, and I'm going to open the d ata file. Here it is, Simulated Monte Carlo. Here it is. Not the right one, sorry. I have a lot of things open. Here it is. What is this file? What we did here. We did both a space filling design and for each run of the space filling design, we have done 1,000 Monte Carlo simulations. The total number of points here is 128,000 lines. Both Monte Carlo and also a computer design of experiment.

Here we have really a good data set with all the potential variations. Some are forced by the design of experiment, others are clearly random with the Monte Carlo. Here, once again, we are going to estimate the probability. You have the nice distribution of the energy. Then we will once again, calculate the probability to be out of spec. Entering the spec here, here it is. Here we have a probability to be out of spec, which is close to 1 per 1,000. Once again, it was quite a good result here. This result is quite innovative. Here, just for information, we had to create a little JMP script here to do this, to mix the computer design and the Monte Carlo simulation. There is a little GSL code for that. That's all for my part. Damien, maybe you can go. I stopped the sharing.

Okay. This is now the case study number 2. For this study, we have developed a custom tool for predictive application. The objective here was to create a tool including statistical models in JMP Pro in order to predict a specific property which is the glass visco sity as a function of composition and temperature. To do that, experimental data are coming from both commercial database and from our own database at CEA . A s we will see, the originality of the approach comes from the methodology for data sub sampling.

We wanted the algorithms to be coded in GSL and implemented in JMP Pro. The response of the model is the glass viscosity, of course, and the factors are the weight percent of the glass components. Here are some background information. You have to know that glass material is a non- crystalline solid and it is obtained by a rapid quench of a glass melt. From a material point of view, glass is a mixture of different oxides.

The number of oxides is variable from two or three in a very simple glass to about 30 and even more in the most complex compositions. T here is a long tradition in the calculation of glass properties, and we think that the first models were created in Germany at the end of the 19th century. Since then, the amount of published literature in the field of glass property prediction has increased a lot. So that today, we have a huge amount of glass data available in commercial database.

Several challenges remain for the prediction of the glass v iscosity because the glass v iscosity is a property that is very difficult to predict. First, v iscosity has a huge range of variation up to certain orders of magnitude. A lso, v iscosity is very dependent on physical and chemical mechanisms that can occur in the glass melt, depending on the glass composition, like fast separation or crystallization, for example.

Here is just a short example that shows this difficulty. We have selected three compositions of what we call SBN glass, which is very simple glass with only 3 oxides. W e applied the best known models from the literature to calculate the visco sity. T hen we compare the predicted values with the experimental value we have measured with our own device. You can see that even for a very simple glass, it is not easy to obtain one reliable value for the predicted v iscosity. Here is a good picture we like to use to give a view of the database where each dot is one glass in a multi dimensional view of the domain of compositions.

Data may come from different isolated studies, or we can have data coming from studies using experimental design, or data obtained from parametric studies with variation of one component at a time. W e spent a lot of time in the past to apply different method of machine learning. A classical approach was used for partitioning the data into a training set and a validation set. But at the end, no statistical model with acceptable predictive capability was found to predict the v iscosity. T hat's why we have decided to use a different approach.

Instead of using all the data, we think it is better to create a model by using data close to the composition where we want to predict the v iscosity. So for example, if we want to predict here, one model will be created from the data we have in this area, and a different model will be created if we want to predict the property here, for example. T hat's why we say that this technique is dynamic because the model depends on the composition and it is related and fitted where we want to predict. W e say that the model is automatic because we don't have to do this manually.

Every step is done by a algorithm implemented in the tool. O ne of the most important point is certainly the determination of the optimal data set to create the model. F or that, we have implemented two methods of sub sampling. In the first method, a theoretical or virtual design of experiment is generated around the composition of interests. Then each run of the design is replaced by the most similar experimental data present in the database, leading to the final training data set.

In the second method, we have implemented in the tool. Th is is based on different size of data sets created around the composition of interest. A small data set is generated by the tool and model are created on this small subset to predict the visco sity. T hen bigger and bigger data sets are generated automatically and the optimal size is evaluated by several statistical criteria associated to each subsets.

Finally, the construction of the models is based on three different algorithms implemented in the tool. First is a polynomial model obtained by a multi- linear regression. Second is a genreg model and neural net model. A t the end, we have six different calculated values which makes the prediction very robust.

Let's go to JMP to see how it works. L et me first show you the code. T he script, as you can see, is quite long, about 700 rows, which is a quite complicated GSL code. The first thing you have to do is to enter the composition of the glass you want to predict the viscosity. T o do that, you can use an interface we have created. You just have to select the oxide entering the composition. F or each oxide, you just have to enter the weight percent. Or if you want, you can directly enter the composition in the script, which is a little bit more quick, I would say.

Then you launch the script. I won't do that now because it takes about one or two minutes. It's not very long. But for this demo, I have already run the script. Let me show you the results. At the end of the calculation, you have this window where you can get a statistical report. Very interesting. First, here you have the composition you have entered, it's just a reminder. W e have the graph showing the predicted value. O n the Y axis, we have the predicted values of the viscosity calculated by the three algorithm and for the two methods.

On the X axis, we have the number of enlargement for the second method I have described. I n red, which is the most important value, I would say, is the average of all the different predictions. I t is the best prediction, I would say, of the glass viscosity . I f we need to have more statistical details, we have a lot of information in this report to study the quality of each model. For example, we can check the values of the PRESS statistics for the multi- linear BIC F model.

For example, here we can see that the PRESS values tell us that the prediction using the method number 1 is a little bit better than for the second method. W e also see that the model degradation with the enlargement of the training set. We can also check the RS quare values for the two different method and for the different algorithm, and we can compare them. W e can have even more details on the design of experiments that were created and all the formula of prediction. T his is a lot of information, but there is the most important part here, which is the predicted value of the viscosity.

Let's go back to the PowerPoint. T his is the result obtained for the simple SBN glass, which is, as I said earlier, a simple glass with only 3 oxides. But we have calculated the visco sity for three different composition of SBN glass. W e compare the results obtained from our tool here with the results from the models available in the literature. In term of the relative error of prediction, it has to be as low as possible. W e can see that the best results of prediction are obtained with our tool, which is really great.

H ere, we have the results obtained on more complex glasses. T he tool predictive capability was evaluated by extracting 230 rows from the global database. In this table, we have the relative error of the visco sity prediction for different type of glass and for the global subset of data. Three quantiles are given, the median, meaning that 50 % of the predicted values have a relative error that is below the value indicated here. W e also give the 75 % and the 90 % quantiles. W hen we talk about glass v iscosity, traditionally, we consider that the predictive error, around 30 % is very good.

W e see that for the majority of the data, the model capability is fine and we were very happy with the results we obtained. Here are some very important key parameters. It is very important to take into account as many inputs from the glass experts as possible. For example, we had to create specific algorithms to handle the nature and the wall of oxides and viscos ity. Another point of major importance is related to the origin and to the reliability of the data.

For this, a significant amount of time in this project has been spent to the constitution of a reliable database. W e had to implement ways and we had to study different ways of calculating the distances between the glass composition. I t's time to conclude now. We have presented two different case studies. In the first case studies, we have created several models and we have compared them for the risk assessment. We have seen that it was easy with JMP to perform Monte Carlo simulations, even for mixture designs with constraints. We have seen that it was easy also to perform space filling design, and again, even for mixture designs.

By combining Monte Carlo and space filling designs, worst case probability has been estimated. I n the second study, we have presented the tool we have created with an original method of subsampling the data. For each composition of interest, a specific model of viscosity was constructed around this composition, and we have seen that the prediction accuracy on the visco sity was very promising and much better than models available in the literature. Thank you for your attention.