Machine learning (ML) methods have been widely applied to analyze design of experiments (DOE) data in such industries as chemical, mechanical, and pharmaceuticals, yet receive limited attention in the food industry, especially for recipe optimization.
To address this, we explored ML and sequential learning for recipe formulation, aiming to optimize product quality. We combined DOE with ML to select optimal combinations of one, two, or three ingredients from 11 candidates, adjusting ingredient dosages based on the number of combined ingredients: 0-1 for single ingredients, 0-0.5 for pairs, and 0-0.33 for triplets. After assessing the main effects of all 12 ingredients, we narrowed the focus to five key ingredients. A full factorial design was applied to two-ingredient combinations, alongside collecting one data point at maximum dosage for each triplet. Three promising combinations were further analyzed using a space-filling design to explore the full parameter space.
Subsequently, ML models were developed to predict product quality, with sequential learning guiding additional experiments to refine the model for one specific combination. This approach identified the optimal mixture with fewer than 100 lab experiments, demonstrating the efficiency of combining ML, sequential learning, and DOE in reducing experimental efforts while identifying high-performing ingredient mixtures.

Hi, everyone. Today, we'll be discussing Optimizing Recipe Formulation Using Machine Learning, Sequential Learning, and Design of Experiment. I'm Junaid Mehmood, and I'm working at Danone R&I in Utrecht as Process Technologist.
Before we go to discuss the topic, I would like to discuss a little bit about Danone. Our mission is bringing health through food to as many people as possible. Through our diverse portfolio, which focus on being healthy, we source from local markets, being sustainable, and being innovative. You may be familiar with most of our global brands, Activia, AptaMil, Evian, or Apro, but we also have huge market consisting of local brands.
If you've noticed, then most of our products, they consist of either food or food derivatives, and, of course, there's water. That means we deal with a lot of recipes. When we are working with recipe formulation and thinking about innovation at Danone, what we want to do is we want to maintain higher standard, which results in very amazing taste, but we also want to maintain very good nutritional value. We want to continuously innovate based on our consumer needs and what they like.
It's not an easy thing because there are numerous challenges when you want to innovate with food-based products. First of all, consumer preferences may change from one year to the next one. Being more focused on sustainability right now, that means we need to keep on innovating.
Secondly, we need to comply with the food safety regulations, which may result in one product which is being okay in one market, but it may not be okay in the other one. Also, we want to be our product, so we're inherently safe because most of our products goes for babies and sometimes even for patients who cannot take food orally.
Then we want to actually balance the cost of our product. We want to maintain the highest quality that we can because you can get really good quality, but if it's costly, no one is going to buy that.
Given this in mind, usually when we are innovating, our product workflow cycle looks like this. This is quite generic, and any food company who is making food-based products will be following similar workflow. At the start, it starts with either new product ideation, or we have done our market research, and we want to change certain properties of our product. Then based on that, we go to our ingredient selection and process selection.
Ingredient process selection goes hand in hand. One process that may work for certain type of ingredient may not work for the other one. This is a cyclic process where we are deciding our ingredients and process based on what nutritional value they bring. Are they sustainable? When it comes to process, then if they are scalable, efficient, and if the process we are using, it doesn't add to the cost of the product itself.
Then the next step is actually prototype development, which we mainly do at our labs in R&I centers. This is where we run our small-scale lab trials. We do our flavor profiling, and we optimize certain processes, or we go back to first step again and choose a different ingredient. Once that is done, we go to our pilot plants where we look for scalability of the product. Then, finally, we are doing a factory trial. Once all of this is done, the last step before we can put that into market, we need to do shelf life and stability testing. Certain types of product require regulatory approval, but not all of them, so it depends on the type of products that there is.
Now, for lab scale, for the pilot, and for factory, the way we are characterizing our product is based on how it is, like organoleptic properties, and its certain other properties, its processability, like viscosity and this kind of thing. This becomes our product characteristics based on which we are optimizing our process.
In today's project that I'll be discussing that we are mainly focusing on lab-scale trials, and for which we have three main components: ingredient, process, and then experiments itself. When we just only focus on ingredients and process, what we are focusing on is micro and macronutrients, of course. They need to be part of the product, but then product needs to taste good.
Of course, I've just discussed the shelf life part, then there will be preservatives and stabilizers, and then there are texturizers, which helps product to be... Overall, when it's going through the process plant, it makes it more processable. When we think about the process selection, then we could have temperature control, mixing, processing speeds, and all other different processes.
These are the decisions which are made beforehand, before we are going to the lab trials, or we are doing our lab trials, we are coming back to these ingredients and selections, and we are changing them, whatever makes more sense in terms of typical product.
When we think about ingredients, when we are testing these ingredients, we are testing them based on their functionality, whatever their need is in terms of micronutrients, like if we want to put certain minerals or vitamins in, then their sensory characteristics, how they're going to taste, and then what are the regulatory needs. Based on that, we can define minimum and maximum level of each ingredient. Then, when we are doing these trials at lab scales, we are testing how each ingredient is affecting the product characteristics, and those are written over here: taste, texture, and stability.
However, when it comes to maximum limits of certain ingredients that we can put in any formulation, that is usually constrained by its regulatory needs, and especially for flavor, sugars, and preservatives, and then we need to be aligned well... We need to remain below this certain limit.
For this particular project for which we used machine learning, sequential learning, and DOE together, for which we wanted to optimize taste and texture. For this, we were actually looking at 11 different ingredients, and we wanted to find a combination of either single, double, or triplet, which enhances this taste and texture.
Then we have seven different goals. For three of them, response variables which we wanted to minimize, two of them we wanted to maximize, and the rest of them we wanted to keep them constant, so that even if we put this combination of ingredients, it doesn't change these two particular molecules.
Then there were certain regulatory needs, and that is that if it's a single... Out of these 11, if we are putting single ingredient, then we can go to maximum of one. One over here is a level in our terminology, and if we think about DOE, then that means this is a maximum amount of the ingredient that we can use. If we combine two of them, then we can use maximum only, 0.5. Similarly, if we combine three of them, then maximum is 0.33. These are the constraints in which we needed to find out which three ingredients result in better taste or texture.
If you think about overall combinations, there are total possible combinations of two ingredients, 55. Then, if there are three ingredients and 165 possible combinations are possible, out of these, 11. If you combine all of them, that is 231 total combination. Then doing experiment for each of these combinations is already just too much. You cannot just... Maybe you work for three or four years, and you'll reach your conclusion, but that's just not desirable.
For this, we devised a sequential approach in which we did screening of five ingredients from 11, and that we did based only on the main effects. Then, from those five, we screened combination of two and three ingredients to find which of these combinations is performing the best. Then, in step three, once we have selected these two or three combinations together, we find out which of those combinations is performing best. Finally, we reached at the modeling part where we use predictions from the model to find the optimal level for those particular ingredients from that particular combination. We will go in detail for each of these steps.
For factor screening, what we looked at now, we did one factor at a time. Although when you take a DOE class, this is something that you don't want to do. But in this particular case, we couldn't do screening design. The reason was that there are inherent assumptions when one thinks of screening design. That is that each factor can be changed independently of the other one, and then all factors are present at every run. That is not true for our case. We cannot just put four or five ingredients together in this particular recipe. That means we could only focus on just one ingredient.
Secondly, when we think about modeling of screening design, we build just one model and look at our parameter values. But in this particular case, that is not possible because as soon as we start putting more ingredients in, we are moving away from our project objective.
Finally, we didn't consider any interaction at this stage. First of all, it's just going to increase the number of runs drastically. Secondly, the way in food formulation and the way our mouth receive interaction is quite different than usual process conditions. There could be synergy, or they could be working against each other depending upon the dosage that we have put it in. We already know there will be some nonlinear effects over there, and including them at this stage just doesn't make sense.
If we consider then one factor at a time, we did that at three different levels in this case, 0.33, 0.67, and 1. Based on that, we did total 33 runs, and, of course, we have a lot of replicates. But throughout this presentation, when I'm going to focus on the treatments or runs, I'm not including replicates in there. In this case, we have 33 individual unique runs.
Once we perform those lab trials, we build a correlation matrix based on our desirability of each response. On the right, what you are looking at is just to help you visualize. For three of the response, we want to minimize, two of them, we want to maximize, and rest of the two should remain unchanged.
This is the correlation heat map. On X-axis, we have all of the ingredients, and on Y-axis, we have all of our responses. Now, just like written on the top on the right side, three of them we wanted to minimize, and we see, okay, they are being minimized. Now, we need to decide upon which one is the best. Then Y₄ and Y₅, we want to maximize, and Y₆ and Y₇ should remain unchanged.
But correlation only tell half of the story because you could still get a very good correlation even if response value from level one or level three doesn't change very much. For this reason, we also looked at covariance heat map. Now, covariance actually tells us a relative change in response variable compared to unit change in your ingredient variable.
By combining both of these two information together, we were able to eliminate 6 out of these 11 total ingredients. If we just look at ingredients which are left, we see that our three response variables which we want to minimize, we have very good negative correlation with that. Then either we are not increasing our response variables which we want to maximize, or they are being maximized. For the rest, Y₆ and Y₇, in this particular case, we have not very good correlation. That means they just remain unchanged.
This is what we decided. In this first step, we have reduced from 11 to total of five ingredients. Then, if we go to the next step, I'm just going to call them now 1-5 just to make it easy. They are different than the original 1-5, but just for our next conversation, I'm just going to call them X₁ to X₅.
Now, we are interested in evaluating all the possible combinations and finding out which combination performs the best. We have two ingredients. If we think about combinations, there are 10 for three ingredients, then we'll get 10 combinations. In total, we have 20 combinations. If we want to do now full factorial design for main quadratic and interaction effects, that will result in 330 different runs. Once again, that is already too much.
Then we have our individual components, and it results in total 340 duplicates, which is, again, we don't want to do that many experiments. It's just going to take too much time and resources.
To overcome this problem, what we decided to do was that we are going to design two-factor experiments, where we have two ingredients, and we are going to combine all of that data to get design matrix for three-factor combination. I will go in more detail on how it looks like. For two-factor experiment, we can use maximum level of 0.5, and for three-factor, we're going to use maximum level of 0.3. If we now consider just one particular combination, X₁, X₂, X₃, then we can combine all of the data from X₁, X₂, two-factor combination X₁, X₃, and X₂, X₃. That means we don't have to do all the experiments separately for three-factor combinations.
If we go in a little bit more detail for only one particular combination, the approach looked as following. First, we designed two-factor combinations using space-filling design, and we treated them as continuous matrix from 0 to 0.5. Then we did the same for three-factor design, but then we consider it to be ranging from 0 to 0.33. Then we combine all of these treatments in one table for all two-factors and three-factor runs. Then we use covariate factors function of custom design module in JMP to reach our final design.
Let's go in detail of each of these different steps. In the first step, like I said before, we created 30 different runs for two-factor combinations. This is how our parameter space looks like. It's, of course, the same. Only our X and Y variables are changing for X₁, X₂; X₁, X₃; and X₂, X₃. We are trying to explore all of the parameter space in this particular case. Then we did the same for three-factor combination, but our maximum limit is now 0.33, and we are again using a space-filling design for this.
We can combine all of these runs now in one table. If we visualize this, this is how, overall, all of these 120 runs look like. You can see on the walls, we have our two-factor runs, and in the middle, which is shown with the red dots over here, we have our three-factor runs. When we are considering two-factor runs, that is 0-0.5, and when we are considering a three-factor combination, that is 0-0.33.
Then we can actually open up our JMP Custom Design Module, and we can select this Covariate Factor button. We can select all of these three columns in which we have all of these different values. Then the model that we selected over here for Custom Module include main effect, our quadratic effects, and all of our interaction effect.
Then we tried a different number of runs (16, 19, 22), and then compared them with each other and did the design evaluation to find out minimum number of runs required to get the best design possible from all of these possible 120 runs, which results in this model. What we found was that this design actually gives us good optimality criteria, and the correlation between different terms in our model is quite small.
When we look at this closely, it looks quite similar to full factorial designs because what we have on our walls, in this case, two-factor combinations, we have points spread out in the corners. Then the design added two points for three-factor combinations. However, design also adds central points in this case. This was our first clue that we can actually go with the full factorial design in this case. In hindsight, maybe we should have started there, but I'm just describing our workflow over here.
Because it looked like full factorial design, so this is what we went back, actually. Now everyone knows if you have full factorial design, this is how the design matrix look like, and in this case, maximum is 0.5. If you have full factorial design, for three combinations, this is how the design matrix looks like.
However, as I discussed before, there is one extra point for full factorial design, which we have included ourselves, which is at the maximum limit for three-factor combinations.
In this particular case, these are the runs which are important for interaction and quadratic effects. These are the runs where only one factor is being changed. But we have already performed these experiments where we looked at main effects in our first step that we did. Instead of performing these runs again, we just augmented that data over here. That means in our step two, we only need to perform the runs for these particular ingredients where either two or three ingredients are being changed. This reduced our total number of runs quite drastically. From 340 runs, we could only do this now with 60 runs, and we could evaluate all of different 20 combinations together.
Crucially, if full factorial design is orthogonal, in this particular case, we don't have orthogonality. That means this is something that, when we are doing analysis, we need to keep in our mind.
The total number of runs in step two then reduced to just 50. Once we have performed all of these runs for all of the possible combinations, what we did was we did a multi-objective evaluation, and that is we computed a distance metric compared to our target. We have our target for Y₁, Y₂, Y₃, Y₄, Y₅. We want to minimize, maximize, and then we want to keep two of them constant. Then, based on this distinct metric, we found out which of the combinations are closest to our target and which of the combination results in not so close to the target.
The top 10 runs which are closest to our target, we found out which of the combinations are most prevalent in those top 10 runs. Those combinations look something like this. We found two molecule combinations and three molecule combinations, which resulted in being closest to what our project objective was.
Even though we know that our model is not going to work well over here, we still don't have sufficient data, and there's also not orthogonality, we still wanted to see how the model is going to look like for each of these combinations. For these five that I've just described, this is how R² look like for both two-factor and three-factor combinations.
In this particular model, we have our main effects, interaction effect, and quadratic effects. If only we focus on two-factor combination for one particular of these combinations, this plot shows actual p-values. What we noticed was, for some of the response variables, p-values were... None of the parameter was significant. For some of them, some of the parameters are significant, but they are not the same across different responses, even though we get good R² score in this case. That is to be expected because we have a limited number of data set, and our model may be overfitting in this case.
The same was true for three-factor combination. But what we realized for three-factor combinations, at least one or two parameters are significant for all of the responses. But once again, not all parameters are the same, which are significant for all different responses. That was the first clue for us, basically, that each parameter need to be modeled differently. For each parameter, we need to find out which... Each response parameter, we need to find out which of the factors are significant.
Just to now illustrate this point more, for next, even using this limited data set, we removed quadratic effects, and then our R² scores became really bad. This was a hint that quadratic effects are important for some of these parameters, and we need to include them. When we went to step three, out of these five combinations that we looked at, now we only consider the combinations where three ingredients present because not only we could model them well, but our factors had an effect on these parameters. Now we need to find out which of these factors are affecting what response.
To augment data for these particular combinations, we added five additional runs using Augment Design module from JMP, and we used space-filling design for this. Now, for each combination, we have at least 20 edge data points based on which we can evaluate a model. But as I highlighted before, for each of the response variables, model is going to look different. We could go in and just try to... For each model, we can find a good model. But instead of that, we used a machine learning approach in which we evaluated different models for each response variables using ordinary least square approach, LASSO and Elastic Net. We also added support vector regression with only linear and polynomial kernel.
Then we did a cross validation using leave-one-out method. In this particular, leave-one-out, if you don't know, what happens is that model is fitted on all of the data except one. Then that one data point is used to make the prediction and then an error is computed. This is repeated for all data points. Then you can compute your average root-mean-square error. Based on this error, then you can find out which of the model is best. What this allows you to do is it restricts overfitting because you are always leaving one data point for prediction, but it still allows you enough data points from our limited data set to fit a good model.
Once we apply this, what we get are good models for X₂, X₃, and X₄ combinations for which we get very low mean-squared error, and we get quite good R² score. Because our capability to predict this particular combination is good, this tells us that unexplained error for this particular combination is less. So this is the particular combination which we move forward with. This is how actual versus experimental and predicted values look like. You can see models are not a fact, but still they are good enough to make predictions for this particular combination.
In next step, we use these models to run Monte Carlo simulations. Using Monte Carlo simulations, we again computed... Based on response and distance metric and based on the distance metric, we found out which are the simulated values that are closest to our particular target. Then we performed clustering on all of those simulations which are closest to the target to find three different runs which we can go back to the lab and do the experiment with.
Out of those three runs, one run achieves our desired goal. We were able to minimize Y₁, Y₂, Y₃, maximize Y₄, Y₅, and maintain Y₇. At this particular step, we actually skipped Y₆ because our ability to predict Y₆ was still not there even after all of these runs.
Now going back to the, again, product workflow, this was just one step in all of this. Now, we have reached at the end of the lab scale. Now this is going to go to pilot plant. We are going to test scalability, then factory runs, and then the rest of the things.
This brings me to my summary slide. In this particular case, we used iterative DOE approach, stepwise, in which we were able to find out optimal combination using only 100 experiments instead of 500. Then space-filling design was effectively used with machine learning to explore the experimental space. Lastly, we found out that having this capability of machine learning combined with DOE data and this sequential approach, what this results in is it allows you to find a better model, even if your hypothesis is wrong at the start. With this, I'd like to say thank you.
Presenter
Skill level
- Beginner
- Intermediate
- Advanced