Hello, everyone. Thanks for joining in. In this talk, I want to talk about how we did prepare our batch process data for modeling with JMP Pro and gaining valuable insights with a team approach. My presentation is detailed in a PowerPoint part. That is the first part and the details all here shown will follow in JMP, like how the data set looks, which platforms I have used like missing data, multi collinearity, functional data explorer, predictor screening, modeling batch data with Boosted Tree and Profiler.
Summarized data will be analyzed by Boosted Tree as well, and then by a script with Boosted Tree backward selection. At first, my company, Siltronic, has world- class production sites all over the world like shown here and about 4,000 employees. Here are some key figures. If you imagine that we have a complex process flows like shown here with silicon mold in a crucible. Silicon ingot is created here. That's my special task. To make processes for growing silicon ingots. Then the ingot is ground and sliced.
Edge rounding is done for the wafers, laser marking, lapping, cleaning, etching, polishing, and maybe epitaxy for the final wafer to be created. Our portfolio is that we are selling 300 millimeter, 200 millimeter and smaller diameter wafers for different applications like shown here, silicon wafers with several specifications.
About me, my education is I'm an electrical engineer and I did some Six Sigma education, and my main task is to develop processes for growing silicon crystals like shown here, and I'm as well responsible for around 500 users at Seltronic JMP users. How does the task look like? What we see here is the final table, but it has been created and this takes a lot of effort as well. So there are some database queries behind to get this data from database.
We fetched the results into JMP data tables and enlarged the data set with archives from earlier date and enriched some information like details of experiments and details on consumables and wrote some script for graphs and evaluations. T hen we have done the modeling tasks and of course, looked for missing data correlations to see what are the most significant effects and to do feature engineering to see which features are important to generate an optimal result.
At this point, I will switch into JMP then. We can see here my journal I'm working with and the JMP main window and the abstract is seen here. We will start with technical hints. The use data set I show here is fully anonymized and standardized, and all identifiers are generic for better understanding what are the features, what is the result, and so on. The aim of this presentation is to show all the steps we needed for getting an overview, restructuring, and understanding the data set, and how to build the models to get some insights of the content of the data set.
I will show some results that we have discussed in a team. The team is very important here because the team drove a lot of discussion and work as well, how to analyze and what features may be interesting and what should not be, and what may be the physics behind. I will start with the data set here. It's also a part of the contribution in the community. Here it is opened and I will change the design a little bit to see how it looks like. We have around 80,000 rows in this data set and it's a batch data set, so we have a batch ID.
T his data set is quite challenging because it has a mixture of historical data like here, POR batches. We can see here that we have… Most of the data is historical data, and there are only a few special experiments shown here. We have several features then, like one categorical, it's consumable. We have the batch maturity, it's the time, also standardized. Then we have several features. So these X values here, we have one result column, and to reduce the noise a little bit, we have calculated a new moving average as well.
Let's have a look at how the data set looks more in detail. We can see here, if we do a summary on the data like this, you can do this from the table s menu as well. Summary. We get around 500 rows, 500 batches. This is a summary by batch, and we see that there is no variation in the parameters X 1 to X 4, meaning that they are constant for each batch, and the others are changing at different rates. To have a look how the data looks at all, we can see here the result parameter like yield, a long time for all the rows of the batch data set, and this smoothing is done by JMP Graph Builder platform.
We implemented this as a formula as it is available as a function in JMP. We can have a look here at some special batches as well. If we use the local data filter and see here how the average works and what noise is in the single data points, the blue ones are the original data of yield, and the orange one are the moving average. I will close this then, and next point may be to look at how much data is missing. So we have this in JMP as well. We can mark all the columns and do the missing data pattern platform like this.
It will show us that from about 80,000 rows, we have 178 rows with some missing data in one column. This can also be shown here as a graph. This is very important, at least for the data creation steps, it was important to see where is some data missing and to fix this missing data then as much as possible. Another step to look at the data, maybe Columns Viewer. We can get here, I put all the columns in, and here we can see again like we had before, there is some rows missing for parameter X2.
We can see what the min, max, mean, standard deviations, and so are for all the parameters we can see here. Here we nicely see that everything is standardized. For the yield, it's between zero and 100. We can as well start from here, distribution platform. So all the columns are marked and we get by only one click for all the data, the distributions. We can see here what consumables are used how often that we have most data from historic processes and only some of some experiments with special settings.
The time, of course, looks nicely distributed, but the others don't look that nicely. So there is a lot of room between some settings, and it's sparsely distributed, non- normal distributed for the most parameters, and that makes it even more challenging to analyze this data. We go to the next steps. I will close these reports. Then we may look even more in detail on some things like how the parameters are correlated. We can see this in the multivariate platform. It needs some time to be calculated.
You will find it here under the analysis menu, multivariate. It takes the numeric columns and generates this correlation report, and you will see that the parameters like X6 and X5 are highly correlated. This makes it difficult, like X10 and X9 as well, makes it difficult to do feature engineering. What we want to know from the analysis is which parameter causes some yield drop. If two parameters are correlated, it's not so easy to detect or to find out which one is the responsible one.
Here in the scatter plot matrix, you can see as well which parameters change with time, like X1, X2, up to X4 is constant over time, and the others are changing and how they are distributed, and you can nicely mark some rows like here. They are selected in the data table then and see how the curves are for each parameter over time or which parameter over what parameter combination looks like. Next, I want to use the functional data explorer.
The functional data explorer allows us to fit curves for each batch and extracts the features of each curve. Then we can have a look at which batches behave similar or maybe extreme ones. So the start is like this. We can have a look at how I started the analysis. We launch analysis. I put time as an X parameter, Yield as the output parameter Y, and the ID function is the Batch ID. Then we have here some informal parts like Part and Group.
This platform is available in JMP Pro only, and when we start it, we can do some data processing here. But in this case, it's not necessary. We can have a look at each batch, how it looks. So there are a lot of graphs here like this. We can mark the rows. We can see here the marked rows and the data table as well. To go on with this platform, we need to make some models like P-splines for each batch and JMP does this and defines itself which splines are used and how many supporting functions are needed, like the knots shown here.
So the best result is given with a cubic spline with 20 knots. You can see how each batch is modeled here by the red line shown here and how it looks. We have here the shape functions. So each curve is added together by a combination of shape functions, and we get for each batch the coefficient for each shape function. If we are looking at Shape Function 1… This is the main behavior of all batches with a drop here at around 0.7. We can see that here we have Component 1. This is a coefficient for the shape function one.
If we select these batches, we will see that they have a pronounced shape like Shape Function 1. We can see it here. We can as well use the Profiler. So this is mostly for understanding the data but we have not used it for further analysis because we did not really need the information how the back batch looks as a shape for each curve. We were more interested in average yield of each batch because we cannot define only to use the first part of the batch and forget about the second part. This would not work in our case.
As well to see again how this works together, we can have a look in Graph Builder. The graph of some batches we have seen just before. Maybe you see this number here. We have seen it before. Here it is shown again together with the moving average of yield. Next step would be to start modeling of the batch data.
When doing modeling, it may be interesting to see or have an idea which parameters are most important for the variability of the output. There we have the predictor screening platform. You can as well start it from here. Analysis and predictor screening. I wrote it here as a script simply to start it by pressing a button. When doing so, we will see some Bootstrap Forest analysis going on, and it shows us the importance of the features we have in our data set. Time is the most important, but this is useless at the end for us because we need to use the full batch.
T hen comes X1, then comes part X8, X5, and so on. So here we could as well select a few rows, copy them and put them into a model. I will stop this here, and to see which model works best, I used model screening platform. I will not run it here because it takes several minutes. But we have seen that Boosted Tree platform may perform well. There is maybe not so a big difference between the next ones, but that's the reason why I used Boosted Tree platform.
Then we will run Boosted Tree like this on the batch data, and it works quite quick. We will see the result, and a nice feature of the Boosted Tree platform as well is that you have column contributions so that you can nicely do some feature engineering. We can see here that we have 71 % R square for training and 66 for validation, may be okay, and we have still all features in. But we are interested mostly in which features are reliably the most important ones.
When doing this, we can save it as a column, save prediction formula in the data table. We see in the data table we have a formula now and we can use it. We can maybe have a look at how the model performs or to use it in Graph Builder simply to see how the model data looks like over the batch maturity. I hope the Graph Builder to show the graph the graph soon, and here it comes. Yes. We have seen that this modeling works quite well, so we have a formula now to rebuild the data and we can maybe work with it.
But especially for the batch data modeling, we have a problem here that validation will not work because we may have here for some batch, these rows in training set and the rows next to it in validation set. So they are not well separated for the features that control the batch then. Additionally, the model is not very stable, so we will get a different result. For our different runs of the model, this is known from tree- based methods that they may give different results for high variability data.
If we do something like running Boosted Tree twice, we get also different column contributions and maybe different order like we can see here for the part and X 5 are switched here for these two runs, and I will show it again also here. If we run Boosted Tree twice. Don't know. Yes, here I should have the script. It comes later. So at this point, we have said that it may be better to model the summarized data because we need to use the full batch. Here I have a script now to summarize the data in a form that we have only one row for each batch.
There is a nice feature statistics column name format that we get the same columns for the summarized data as we have in the original table that we can use the same scripts for both. Doing so, we get here the summary data table— I can close the script— with around 500 batches. It's a lot of easier to model, and here I have summarized the data for 0.6 to 0.8 time. So it's where the yield drop was, and we can again do here some predictor screening like this and see that I still have time in that data set to see…
It's more like to see what level noise is, and it's around these parameters that are likely also noise for the model then. Then we can, of course, do some model comparison. So I selected a few parameter that we found to be responsible, most probably, and I'm doing two Boosted Tree analysis and then do some model comparison for both. It looks like this. We can see here we get a Profiler and compare the result for different settings, maybe like this. Here we have still the problem. We see some features like this here for X10 in this model and not in that model.
So it likely seems to be noise. At the beginning, we have discussed a lot about these differences. We have seen sometimes and sometimes not, and asked the question, what is true, what is physical, and what is not. That brought me to the step then that we need to continue with feature selection, and that's why we created this script. It takes this summary data and has been done for the batch data as well.
For each step, it builds Boost ed Tree model for the full parameter set, saves the model into the formula depot, and saves the model performance R square and so into a data table, and shows us the column contribution. Here we can see something that we have seen often that with the higher number model, it is the model with less parameters like we can see here with the column contributions, we have the best result. It looks different for each run, but the tendency we see in most times. So here we can see for more or less sure that part and X1 and X5 are the most important parameters.
This one may be there sometime and maybe not, so we will focus on these three parameters. As well, we can have a look in the Formula Depot. There we can start model comparison. We maybe can compare the first model. So we do it like this, model comparison. This is our data table. Take the first number. The numbers here are shifted by one, and maybe the 5th should be that one, and the last one. This will not work. I think it's number three, and the last one.
The ones with the highest validation score and maybe compare them here. We see the model comparison dialog. We see that the last model is between the best models we could fit at all, and can see here the Profiler, for example, and as well, may use extrapolation control. We have seen that we have sparse data, so not behind every point. There is some data. Let's look where it is. Here it is. Extrapolation control warning on. So it shows us when there is no data between the points.
Here we can maybe compare the models and we see here that there is no variability on the X factors that haven't been used here. To sum up, let's close some tables first and some dialogs. To sum up, we have prepared a workflow for modeling this data and have done several steps and additional script to enhance understanding and to drive the discussion about what's important and what's not.
I have a proposal for a model and some tasks we can focus on to improve the year yield of our process, and you will find the data and the presentation in the user community. If you have other ideas how to explore this data set and how to find the final best model, you can contact me or post something on my contribution in the in the community for this Discovery Summit. Thanks for your attention and bye. That's it, Martin.