Hybrid Predictor Screening and Least Square RSM Approaching

Semiconductor operation is facing challenges in predicting the performance of a device during the comprehensive fabrication. This lack of predictability has resulted in decreased yield and increased costs. This project will demonstrate how to establish a predictive model of inline metrology data (3 response variables (Y) and 300+ predictor variables (X), with a dataset of ~ 400 data points) to predict device performance.

We have utilized JMP Multivariate Platform to assess the project complexity: (1) relatively weaker Y-Y competition, (2) strong Y-X causation, and (3) severe X-X multicollinearity dependency. Both standard least square and stepwise algorithms have shown severe overfitting risk if considering all 300+ predictors. Three modern data mining dimension reduction techniques have been utilized to select the vital few Xs to continue RSM modeling: (1) Partitions, (2) Neural, (3) Predictor Screening. Limited to regular JMP software, Predictor Screening using Random Forest algorithm effectively resolve recursive partition concerns and provided a more uniform contribution pattern. Based on the predictor ranking of Predictor Screening, we ran a hybrid model to combine Fit Model, Neural, Partition, and Ensemble Models to predict device performance. We also implemented recursive combination iterations and established a stopping criterion to ensure efficient model development.

This JMP project has successfully enabled us to enhance our predictive capabilities and extract valuable insights for the design of the device. The new product development cycle time can be significantly reduced through a reliable predictive model.

Hello, everyone. I'm Chen-Chih Hsu. I'm a Process Engineer at Applied Materials. Today, I'm going to present a Hybrid Predictor Screening and Lest Square RSM Approach. The presentation is a collaborative effort between Charles and myself.

The semiconductor operation is facing some challenges in predicting the device performance throughout the whole fabrication process. This lack of predictability has resulted in decreased yield and increased cost. To address this issue, the motivation behind this is to develop a statistical model to predict device performance, which is our Y, based on our in-line metrology data, which is our X here. In the fabrication process, we have deposition, lithography, edge, and some in-line metrology measurements.

If we can build a model to predict our Y, then we can provide some feedback to previous models to monitor if there is any process shift, so we can also improve the quality control, increase the yield, and even optimize the whole entire flow process. In our data set, we have three responses and about 300 predictors. Among all these predictors, we have... There are some... For each response, we have around 200 predictors for each response. Among all these predictors, there are some independent predictors for each Y, and also there are some share predictors among all the Y. As you can see, this is a very complicated model.

Throughout this report, we want to explore several key questions. For example, we want to understand why Y1 and Y2 shows higher correlations. Additionally, we also want to know the relationship between Y and X, so that we can know how we can optimize our device performance. All these questions are very important for us to have a comprehensive understanding of this project. Since it's a complex model, so first, we use a multi-variable approach to check the relationship between the Y-Y, Y-X, and X-X.

If we look at the relationship between the Y-Y, you can, based on this multivariable plot, you can see, and they are basically have positive relation. For these Ys, they are not competing with each other, which is reasonable because in our design, our goal is to increase all these three Ys together. Additionally, because for each Y, there are some independent predictors to control each Y. That's why they are less likely to interfere with each other. This design is desirable because it gives us a better device performance.

Next, we can look at the Y-X causation. You can see that if you look at the most left column, you can see we have very strong Y-X relationship, and they have some linear relationship between the Y and the X. Ideally, because of this relationship, we should be able to build a model based on all the X here to construct a model to predict our Y.

Now, we can investigate the X-X dependency. You can notice that when we compare all the X values, some relation between the X-X are very strong. It's very linear. However, some are not very... For the others, are not as significant as the linear relationship here. This indicates that our X-X has a severe multicollinearity concerns. It will be very difficult to determine the effect for each predictor. Our solution is to conduct more research to understand why some of the X has multicollinearity concern. Also, when we build a model, we need to remove less significant terms and perform some model validation.

In the beginning of my approach, I'm using this Square Model to put all my predictors into my Fit Model, and we can observe how my Y-Y will respond. However, we encounter several challenges. First is if we take the Y1 as an example, when we fit the model with all the predictors, we see overfitting issue because the R-Square is larger than the R-Square Adjusted by more than 10%.

The second issue is the multicollinearity. As I mentioned previously in the X-X relationship with some of the X term we see very severe linear relationship. If we look at the VIF term in the Parameter Estimates table, you can see that some of the predictors, the VIF is very large, so this is not desirable.

The next one is we also observe very strong interaction in our model, because if we're looking at the summary report, we can sum up all the Main effect and the Total effect. For this, the difference is between the Main effect and the Total effect will be the Interaction effect. But based on this model, the interaction is over 36%. It also shows that we have strong Interaction in this model. Now, because of the interaction, so I also try to include all the predictors into the R-Square model, also considering all the Interaction terms. But the JMP will give me the error message because we don't have enough decrease of freedom.

Here, I list the numbers of the predictors and the decrease of freedom we needed to build a model. For example, considering our Y, we have around 200 predictors here. If we want to consider the Intercept, the Main effect, Quadratic terms and the Interaction terms, we need about 20,000 degrees of freedom to consider all the interaction to build an RSM model, which is not feasible in my data set.

Considering the total number of data and the predictor, it seems reasonable to select approximately 20 predictors for our analysis, because if we can trim down our predictors to about 20, we only need about 250 data points to build a model. Because of all these challenges, we need to move to data mining and identify the key predictors for a better model. In the JMP Analysis, my approach will primarily focus on finding a better model with a set for R-Square Adjusted, though RMSC, we don't want any overfit concern, and we want this model to be stable. We also need to identify the key predictors for the model simplification. I tried three different analysis here.

The first is we use a Neural method. This is a machine learning method because it can consider nonlinear and interaction effects. Neural provides me with the model and the key predictors. Second, I also tried a Partition method, which uses a Decision Tree approach to classify predictors and find the best way to predict the response. It can also give me models and key predictors. The third one, I used Predictor Screening. I used the Random Forest approach to identify the strong and sensitive predictors. But using this method, we can only find the key predictors without providing a model.

The other topic I want to talk about is that we can still resolve the complicated model without using JMP Pro. Because JMP Pro use AI and offers more advanced analysis, including advanced modeling, model comparison, selection, and model cross-validation. But while we are using the JMP, we don't have this capability. To overcome this limitation, we can use different approach. Such as, we can use the Hybrid model to combine the Predictor Screening and different modeling method. We can also manually compare each model by ourselves, and also the Ensemble model, we can use with our current model to build a better model as well. For the cross-validation, we can also try to use the JMP, JSL Scripting being for cross-validation, but this part will not be covered in this report.

Okay, so let's look at my first analysis of the predictive model. We start with the Neural network analysis. In this analysis, I also use the Y1 as an example. Here, we use the KFold with K=5, and Hidden Nodes=20, because it includes cross-validation to enhance the stability and reduce overfitting of the Neural model. But because Neural method is a very... It uses a very intense mathematical model, so I did five times, and the result are very different.

Like these three highlights here. You can see that the differences between the training set and the validation set, they are very different. It tells us that this model is not stable, and we may have overfitting concern. Even sometimes, for example, this Y, it can give us a high R-Square and with Low overfit. By conducting the Neural model several times, we know that the Neural model, most of the model will see the overfit concern, and this is now very stable to apply the Neural model for our data set.

Additionally, we can use the Prediction Profiler to rank the importance of different predictors, and it shows that the interaction effect is around 85% of the model, indicating we have very nonlinear relationship between the predictors and the response. Also, the top four predictors, they show similar Total effect, which will be very challenging for us to optimize the model. Also, the Total effect is larger than the Main effect. It shows that we have very strong interaction because there is very heavy mathematical transformation behind this model. In conclusion, Neural method is not adequate to identify the key predictors due to its instability and overfitting.

The next our analysis is to use Partitions. Partition uses a tree-like binary classification approach to categorize predictors and find the best model for predictions. In my analysis, I use a validation portion of 0.2. Comparing to the Neural model result, the R-Square here is similar, but if you look at the ranking of these predictors, the top six predictors it contributes to about 80% of the variance, and also the first predictor contributing the most. Because when using this partition method, most of the resources are allocated to the first one, so the remaining portion of the other predictors is decreasing.

Given the instability we just see by the Neural model, we also perform multiple runs. I still see very different R-Square between each training and the validation set. So this model is still not adequate for identifying the key predictors due to its instability and overfitting.

From our previous results, this is a short summary to compare the least Square model, Neural network, and the Partitions. You can see that for each model, for example the Standard Least Square, they have higher R-Square, but we still suffer by the overfit and very strong multicollinearity concern. This method consider the Main effect only. For Neural and the Partition, we also see the overfit concern, and they are not stable, and also both of these methods cannot find the key predictors. We need to think about some other model to find the key predictors.

Okay, so from the previous study, we've known that a new role in the Partition is not adequate for our model. So next, we use a Predictor Screening. We used the Bootstrap Random Forest partition method. This method uses Multiple Decision Tree to enhance the stability by selecting different numbers of predictors and random predictors.

When setting up this model, I used 5,000 trees because it can give a more stable model. When I run this analysis, it would took me about 15 minutes for the computation, which is still acceptable for my analysis. Once we have the ranking of these predictors, we also want to know what's the optimal numbers of predictors we should use. So, we created a cumulative portion plot to visualize the cumulative contribution of the predictors.

For instance, if we choose 23 predictors, we can cover about 60% of the contribution, if we choose 75 predictors, which can cover about 80% of the contribution. This agrees with the Pareto 20 to 80% principle. Since the primary goal of my project is that we want to significantly reduce the complexity of the model, I choose to use the 23 predictors for my analysis. In JMP, the Predictor Screening feature can only help us to identify the important parameters without providing a model. To address this limitation, we utilize a Hybrid model to combine the Predictor Screening and the Fit model together to simulate at the Random Forest algorithm in JMP Pro. This result of this Hybrid model shows we don't have overfitting concern.

Moreover, if you look at the VIF, although there are still some VIF larger than 10, but you can see that most of the predictors here we use, they are less than 10. To better understand why some of our terms still show higher VIF, so we perform a cluster variable analysis on these high VIF terms. After the analysis, all these predictors can be divided into two groups.

For the first one, they are highlight in orange here, and the second group is highlight in blue here. It turns out when we revisit our [inaudible 00:21:10] data, all the predictors in the first cluster, they all have a similar CD. All these three here, they all have another CD. It tells us that when we do the model, we actually don't need to include all these predictors into our model. Since among this group, they are highly correlated, so we only need to choose one to include in our model, and for this group, we only need to include one in our model.

Based on the previous analysis of the Cluster Variable, we were able to trim down the initial 23 predictors to about 18. After that, all the VIF term will be below 10. Furthermore, since we have significantly reduced the total number of predictors from about 200 to 18, so we can now consider the two-way interaction. This result here shows that for our Hybrid model two, with the two-way interaction consider, you can see that the R-Square is improved by about 20% compared to the Main Effect model only. Also, this Hybrid model to lay out all outperformed the previous Neural and Partition method. By optimizing the predictors variables and consider the two-way interaction, we have received a more refined and improved model.

Next, based on the previous learning at the methodology, we can apply all this learning to Y2 and Y3. Overall, you can see that using the Hybrid model to combine the Predictor Screening and the Fit model, for these three responses, we can all reduce the predictors by around 90% from original data set. Also, all the R-Square is still all acceptable. They are around 70 to 80%. After reducing the predictors, we don't have any overfit and VIF issues as well.

My next step is that I want to compare the different model, including using the Predictor Screening, combined with the Neural model, and Predictor Screening, combined with the Partition model. You can see that for the Neural model, consider all the predictors in our previous slide, we show that there are severe overfitting concern, but if we compare for this Hybrid model with Predictor Screening and the Neural, although we still have some overfitting concern, but compare the Neural with all the predictors, the overfitting concern is smaller. While for the Partition with the other predictors and this Hybrid model with Predictor Screening and the Partition, we don't show much improvement using this method.

My next step is that I want to compare all these different models. Here, I list all the models we have for now. We have the Least Square model, and we have several different Hybrid model, which is Predictor Screening combined with Neural, and Predictor Screening combined with Partitions, and Predictor Screening combined with the Least Square and their Interaction. You can see that for this Hybrid model, there are R-Square we only need around 18 predictors, and we also have acceptable R-Square. But also overall, for all this Hybrid model, this Hybrid model too, have high Least R-Square and high Least R-Square Adjusted. But does that mean using this model, we always can perform better than the other models?

I list two example here. In this example, I compare the Hybrid Neural, Hybrid Partition, and the Hybrid Least Square models with our real data, which is our response right here. For this example, if you see, if I select this data point, you can see that for the corresponding prediction for these three models. Actually, the Hybrid Neural model give me a closer value compared to our real result. But take this as another example. If I pick this data point and these three rate are the corresponding prediction for each model, you can see that for this one, the Hybrid Least-Square model will give me a better prediction.

I'm wondering, can we combine all these models to further enhance our predictions? This is a feature of the JMP Pro I found from this link. These features introduce a concept called Ensemble modeling, which is known as Model of Model. The idea is we can select multiple high performance models to build an Ensemble model. I'm inspired by the potential of the Ensemble model, and I'm interested in exploring its application on my current data set. Following the same methodology, I want to know if we can create a combined model using JMP.

Currently, we have three different Hybrid model as I have shown before. To apply the Ensemble model methodology, we use different approach to combine these three models. This includes the model five. We want to use the Average Algorithm, so just average all of the current three models we have. The model six is the Weighted Algorithm. For each model, we will give it a different weight to see if we can have a better model. Model too, will be using the Neural Algorithm to combine this three model. The Model 8, we will use the Partition Algorithm with them to combine all these three models here.

This table is a summary of all the mean of a response, RMSE, R-Square, R-Square Adjusted, and the potential risk. If we use this plot to visualize this table, you can see that for this data here is the mean of this response, and the range will be our RMSE. I rank the models of the R-Square Adjusted from the highest one to the lowest one using different colors scheme. You can see that using the model seven, if we combine all these three models with a Neural Algorithm, it can give us the based R-Square Adjusted, and also the smallest RMSE.

In conclusion, our research on the Hybrid approach of the Predictor screening and Least Square modeling can give us... Can build a model with about 90% reduction of the predictors. Based on this model, we can understand the positive Y-Y combination, and we can also co-optimize the Y-X, and also we can reduce the X-X dependency. Throughout the development of the several Hybrid models and Ensemble model, we have built several models to predict the device performance as well.

Looking ahead, we have some opportunity for our future work. We can use the Predictor Screening to compare all the ranking differences between different models. Also, we can try using recursive conducting combination iteration and determine the stopping criteria. We also want to integrate the Dashboard or Group Script to summarize and compare all the critical model KPIs.

I also want to thank my mentor, Charles, for mentoring and giving me a lot of technical support for this project. I also want to thank the Applied Materials CTO PPB team for their great support to collecting all the data. So that's all for my presentation today. Thank you.

Presented At Discovery Summit 2024

Presenters

Skill level

Intermediate

Beginner
Intermediate
Advanced

Hybrid Predictor Screening and Least Square RSM Approaching

Presenters

Skill level

Basic Data Analysis and Modeling

Predictive Modeling and Machine Learning