Machine Learning-Assisted Experimental Design for Formulation Optimization
Design of experiments (DOE) is a statistical method that guides the execution of experiments, analyzes them to detect the relevant variables, and optimizes the process or phenomenon under investigation. The use of DOE in product development can result in products that are easier and cheaper to manufacture, have enhanced performance and reliability, and require shorter product design and development times.
Nowadays, machine learning (ML) is widely adopted as a data analytics tool due to increasing availability of large and complex sets of data. However, not all applications can afford to have big data. For example, in pharma and chemical industries, experimental data set is typically small due to cost constraints and the time needed to generate the valuable data. Nevertheless, incorporating machine learning into experimental design has proved to be an effective way for optimizing formulation in a small data set that can be collected cheaper and faster.
There are three parts in this presentation. First, the literature relevant to machine learning-assisted experimental design is briefly summarized. Next, an adhesive case is presented to illustrate the efficiency of combining experimental design and machine learning to reduce the number of experiments needed for identifying the design space with an optimized catalyst package. In the third part, which pertains to an industrial sealant application, we use response surface data to compare the prediction error of the RSM model with models from various machine learning algorithms (RF, SVR, Lasso, SVEM, and XGBoost) using validation data runs within and outside the design space.
Hi. I'm Stone Cheng from Henkel Corporation. Today, my presentation is about machine learning assisted experimental design for formulation optimization. For agenda, I will briefly discuss about machine learning and its association in the DOE applications. This association will be illustrated in two case studies. The Case 1 is related to the use of active learning to find a design space. The Case 2 is use machine learning to improve the DOE predicted performance.
About my company, Henkel. It's a 22-billion chemical company. We have two business groups. One is the consumer brand consisting of laundry, home, and the beauty care. The other group is adhesive technologies, which I am part of. We are a leader in the global adhesive market, serving 800 industry with more than 20,000 products.
Machine learning is a very booming topic as illustrated in lots of the review literature published in the last 4 years, covering wide industrial application, including methodology, pharmaceutical, chemical, and lots of 3D printing, also use in quality, innovation, and manufacturing.
Typical machine learning use large data set which is opposite to the DOE practice in which you want to learn the most with the smallest number of runs. Nowadays, speaking broadly, DOE is used together with machine learning in two ways as illustrated in this 2021 reviewing article. On one hand, machine learning is used to analyze the DOE data. On the other hand, it is called active learning where the results of the machine learning modeling step are used to suggest the next experimental configurations. In my presentation, I have examples of both cases.
In Case 1, it is a target-guided machine learning DOE to study the design space and formulation optimization of adhesive. Target-guided is a fancy word. It will be also understood as a supervised learning in this case. In the Case 1, we need to improve adhesive shelf stability by incorporating two stabilizer additives in the formulation. The stability is monitored by two responses: one is the flow, the other one is for the adhesion retention.
Now, chemistry is new to Henkel and also to this chemistry and has decided to set the initial range of additive 1 between 0.1 and 0.5 and then 0.5 to 3 for additive 2. Chemistry has used one factor at a time and various DOE methods including factorial, RSM, and space filling trying to find the design space that can meet both target specs.
Out of the 22 runs, chemistry has identified 2 runs that satisfy both flow and the adhesion requirement. The question is, is there a more efficient way to identify the design space?
Our approach is to improve the efficiency of formulation design involving two stages. In stage 1, utilize active learning to identify the design space. In stage 2, a prime predictive modeling for the adhesive optimization. In stage 1, space feeding design will pick up 5 runs as initial trial. Then both response are modeled through the SVR, support vector regression.
Then from the model, the control plot was generated and overlapped to show a desired white area that meet both spec. Then space filling design is augmented based on this region suggested by the overlap contour plot. This is considered one cycle.
Then this cycle of phases is repeated two more time to have a total of the 15 runs at the end. Then in stage 2, we handpicked the design space for the modeling and also the run used for validation. The for the step 5, 6, and 7, that involve chemistries input. That's what we call a target-guided steps.
JMP Space Filling Design offer seven options for selecting the run locations. Here are example for a two-factor system. I am using the Latin hypercube since it is commonly used in open literature. The operation of phase 1 or the cycle 1 is showing it here. Number 1 is the run data and the SVM predicting column here. Number 2 show the SVM model fitting for both responses. Number 3, the contour plot out of the model.
Number 4 is the overlapping of the two contour plot, which the white area meet the design target of both response. The tip of the white area seen in this magnifying glass is a new upper design boundary that is using in the step 5 to augment specific design to request for 5 more runs.
In active learning phase 2, cycle 2, now we have a 10 run used to do the modeling. Following the same process in the phase 1, a new design boundary right here is suggested for augmenting specifically designed for the phase 3. This is the slide in the last phase number 3, phase 3, we have a total of 15 data point for the further modeling.
The number 13 is the summary of the design space on the right and the response map on the left where the upper left, this region, is the target design that map both adhesion and then the flow requirement. The data in each phase are color-coded like the blue for phase 1, red for phase 2, and then green for phase 3. The run that satisfied both design target are shown in the solid circle here, and the other is the open circle where either one or both response fails.
Out of the 3 phases of 15 total runs, 5 run are satisfying the design target, which is about 33 efficiency. This is a much better than 2 out of 22 run in previously discussed non-active learning DOE approach. That efficiency is only 9%. The fact that it take machine learning only 10 run, meaning combining phase 1 and phase 2, and a DOE 22 run to find three or two accepted target run suggests that the active learning has at least two time improving in efficiency.
In a prediction modeling, if a data set containing more run satisfying the target spec, it is likely to perform better. We said we anticipated active learning approach what year or better position results since I have five in a space and only two in the regular DOE. This active learning example only have 15 runs, and we want to avoid any potential outlier that can skew the data.
We decide to skew those run that do not meet both design target which are run 1.2 and 1.3. With this change, there are only 13 run in our modeling space, which is shown in this red triangle here. In model validation, we want to know how good is models prediction in a region that both design target are met.
We are not interested in the prediction at the outer edge. We select to the detection point in the desire design space. Those are marked by the orange dots right here, and one point near the boundary, which is 3.2 run right here. With this, there are 10 runs as a training set and then 3 runs as the validation set.
Chemistry's decision in selecting the model space and the validation run here are target-driven. Both the steps together with the argument in specific design, we call the target-guided step. I duplicate the response column in R10 here. With validation run, data have been removed. This column are used for comparison to the machine learning model led to now offer a validation option for parameter tuning. Here are the list of the machine learning model using this prediction model here.
We can apply validation column in all models except two: one is the Gaussian process regression, and the other one is the SVEM neural network offered by Predictum Inc. The model of each algorithm was executed, and their predicted formulation are saved in column right here.
The model comparison was done in two ways: one use JMP Model Comparison platform, and the other is to take a root average square error myself. To do so, I forced to stack the prediction formulation column and then clear a root average square error column myself. Instead of 18, JMP Model Comparison result is on my left and my own calculations on the right. By comparing the data in the red and the green boxes in both tables, I confirm my own root average square error calculation is correct.
With the model validation, if we look at the table on the right here, with the model validation, there are the top three machine learning algorithm that has a lowest root average square error are here, the top three here. Boosted neural network and boosted SVM and then RSM model, they are in order. But without validation to help tune in the hyperparameter, the boosted neuron network and boosted SVM, they are exhibiting higher root average square error than their counterpart: 1.7 here, and a 0.27 or 0.936 there.
However, the Gen_R of the SVEM_Lasso model, shown in the blue circle here doesn't seem to be affected that much by with or without a validation: 1.397 versus 1.427. Also without a validation, the GPR, Gaussian processing regression, and then predicting the SVEM right here, also neural network, both do not have a better adhesion prediction than the other models, even though the SVEM was highly suggested for the small data set, which is in this case, small data set.
In summary, active learning example with target-guided focus for adhesion is demonstrated. A space filling design assisted with SVM machine learning is used to identify the design space in three iterations. Active learning exhibits twice efficiency than using the non-active method in identifying the design space.
We show that various machine learning model are compared and some prove to be better, have a better prediction than least squares model. Target-guided step involve three area listed here. Chemist supervised these steps aiming to enhance the active learning and prediction for a small dataset. In this case, JMP advanced prediction capability is currently illustrated. This is the Case number 1.
Moving to Case number 2. We apply machine learning to analyze RSM DOE data from sealant application and into have a better prediction capability. Here's the background. In this case, this case is related to a sealant with 5 key ingredients and 3 responses. Sealant was optimized with a 28-run face-centered central composite design. One of the response Y3 has a reasonable R² adjust of 0.86 and root mean square error of 7.9.
This optimized Y3 response is actually the near the edge of the target boundary. Chemist do two things. One, add four runs on top of the 28 run, they are winning the DOE factorial region. Secondly, add six star points to cover a wider design space.
Then customer changed the project target. Chemist was forced to alter the ingredient loading and then add 10 runs hoping to cover the new design target. Since this 10 run are highly deviated from the original 20 run, same space, they are considered outside the design space. These are the 10 run here. The four within the central composite and the six target star point are right here.
Our objective is to apply machine learning to this RSM data set to learn how good is various machine learning model in prediction testing this data set. Those testing data set are within or near or outside the design space, which is these three data set here.
Here is the RSM model with R² and then the root mean square error highlighted. After model term reduction, there are three main effects, X2, X3, and X4, and there are two factor interaction of X2, X5, and then one quadratic effect of X1. These are all found to be significant.
Our machine learning processes involves two stages. First of all, a model screening performed with 7 K-fold cross validation is used to have a quick scan of the model performance. Then for selected machine learning method, we perform hyperparameter tuning to improve the prediction. For validation, two methods are used in the JMP Model Comparison platform. One is the fixed validation with a 75/25 split, and the other one is a 7 K-fold cross validation.
In the phase 2, involve testing unseen data set. The testing data set was added to the JMP table. They are within, near, and outside the original 28 run design space, and their prediction formulation column are saved and then stacked together. Then the root average square error column is created for the final model comparison.
The results of the JMP Model Screening is shown right here which was performed by adding two-way interactions and quadratic options and also selection of the 7 K-fold cross validation right here. The fine tune of each model was conducted by select and click Run Selected. The hyperparameter tuning and boosting are performed if loss option are available. Each model prediction formula is set in a table column. For neural network, its parameter tuning add-in is used. In this add-in, we choose the three protocol. Protocol number 2 is for 2-layer setting and number 12 and number 16 is for one-layer setting.
Here is my JMP table. Section 1 is the RSM original 28 run data and the response Y3. Section 2 is the two validation column that we want to do a model comparison. Section 3 is the saved predicting formulation column for each model. Section 4 here is the testing data set, three testing data set. Number 5 is the prediction results, come from each machine learning model.
In a model comparison, we are curious about validation method effect since there are difference in a number of training set, validation set, and the averaging process. Under the 75/25 split, fixed validation has a 21 run for training and 7 runs for validation. In the cross validation, 7 K-fold data is averaged, and each fold are using 24 run for training and 4 run for validation. If we look at the output with fixed validation, JMP will output a table directly to be used. But for K-fold cost validation, we need to do some data processing.
JMP output seven table, separate by the folder number. We use a script here called Make Combined Data Table to join all the seven table into a single one here. Then their average statistics then can be created in a tabular platform, here for comparison.
Here is the model comparison summary on the left. It's a fixed validation using fixed validation. On the right, it's using this consolidated table from a K-fold cross validation. To make the comparison easier, we use a graph builder to show the result here. In this bar graph, root average square error is on the Y, and then the model is on the X. The model name with, hyphen t at the end, it means the model parameter has been tuned. The one with the B, mean the model has been boosted.
Root average square error from the fixed validation in blue color is in general higher than the K-fold cross validation in the red color. It's shown here. RSM model is shown at the left. There are several machine learning model they have a less prediction error than RSM. They are marked with green arrows, and they are XGBoost, the two SVM model, and two neural network, and then boosted neural network. They are in order in terms of their value of the root average square error.
This conclusion is based on the DOE 28 run, which each run is used either for training or for validation. Run for the validation are involved in tuning the hyperparameter of the model for a better prediction. The model's true prediction capability will be tested by using unseen run not involved in either training or validation, and this will be shown next.
The second stage of machine learning is for testing the model using the unseen data. Those three data set was copied to a JMP table right here, and the operating value automatically shown in the prediction formulation column. All the formulation column was stuck into a new table. A calculator root average square error column is also created. From the stack table, the tablet and the graph builder are used to illustrate the model testing capability in the next two slides.
Let's look at the model performance for inside and then the star data set, which is near. There has a 4 run and 6 run respectively. As seen, it is a highlighted table here on the right or the graph on the left, which is this area. There are 9 run in the inside data set, highlighted here, that exhibit similar prediction with root average square error in a range of 5.6-7.7.
Then for the star testing dataset right here is near the design space. There are seven models has a root average square error in the range of 18.2-21.7 with average of 20.3. XGBoost and then this tuned SVM in a green arrow, they exhibit the best the DOE. They are the best in the DOE data set, which is this one. This is just a DOE, put it around DOE data set. They are the best compared to the RSM, of course, but they have a higher prediction error.
If we look this one across here or this one across here, they exhibit a higher prediction error than the RSM on the top here for both inside data set and the star data set. It is indication of overfitting. But overall, as highlighted in the shade blue arrow, this one, a single layer on tuned boosted neural network performed the best for both DOE data set, inside testing data set, and also the star data set here. That is the best as you can see from your table or from the graph shown here.
For the graph on the left, we are now looking into the testing data set outside the design space. For the graph on the left, we plotted observation and prediction on the y-axis. Notice the testing run ID right here for all three testing data set. The red is the observation and blue is the prediction. The outside testing data set has the point to the right of the green dotted line here.
The four highlighted model performed really well. They are the RSM and the two G_R forward either with Pruned or SVEM. The fourth one is this, the two-layer tuned neural network. These four performed very well. There are some model, they have a flat or very, very poor prediction, and they are the SVM model right here.
Then on tuned neural network, this, and this, and then decision-based method such as XGBoost or this bootstrap forest. If we consider all four testing scenario, meaning the DOE data set and then three testing data set, then the RSM model and this G_R Pruned Forward model, these two model. Naturally, they have the best balanced prediction performance on all four different scenario right here.
We need to take this finding with cautions. They should not be generalized. They are the characteristic specifically for this response, Y3. Nevertheless, this is a good information for a chemist to learn this Y3 response. Meaning the RSM model could potentially be used to predict the data outside the design space under a special circumstance, such as when the projects under severe time constraint, and you do not have any resource to do any more run, that you may consider to use RSM to do the prediction in the space outside this DOE space. Of course, confirmation run are absolutely needed under this second step.
In summary, for the DOE data set and two testing data set with run inside and near the design space, we see several machine learning model has a better performance than RSM. The best one is one-layer un-tuned boosted neural network. For the testing data set outside the design space, there are four models performed really well. They are RSM and two Gen_Reg forward method and also this 2-layer tuned neural network.
If we consider all four scenario, then RSM and Gen_Reg pruned forward model are the best overall. As illustrated in these two case study, it is clear that the machine learning add additional capability to the traditional DOE for effective formulation.
With this, I would like to acknowledge my colleague, Dr. Dai and then Mr. Du for their collaboration on this two case study and Henkel AMI management system and also JMP supporting team for Henkel, Nick, Jon, and Mark, for their excellent training and support. Thank you for listening and watching this video.