DSD DOE Augmented with Robust Design/HACCP and Neural/Boosted Tree Optimization (2021-EU-45MP-761)
Patrick Giuliano, Lead Quality Engineer, Abbott
Mason Chen, Student, Stanford University OHS
Charles Chen, Principal Data Scientist/Continuous Improvement Expert, Applied Materials
As consumers recede from social eating to rely on satellite delivery, optimal food preparation is increasingly vital. Using dumplings – a simple, nutritious yet tasty dish, multidisciplinary DSD DOE, robust design, and HACCP control planning were utilized to prepare the most efficient recipe. Although a controlled experiment was initially designed using DOE, some runs had to be substituted with different meat/vegetable types after a shortage of ingredients. To study the impact of the outliers, the original DSD was modified to try to account for the substitution. Before assessing model fit, each DSD was checked for orthogonality (color map on correlations), design uniformity (scatter plot matrix with nonparametric density), and power (design diagnostics). This ensured that the response surface model (RSM) results would be attributed to scientific aspects of dumpling physics instead of problems in the data structure. Next, the optimal RSM model was selected using stepwise regression, and model robustness was probed using t-standardized residuals and global/local sensitivity. Finally, advanced modeling capabilities in JMP Pro were leveraged (multi-layered neural network, bootstrap forest, boosted tree), and, with the reapplication of HACCP framework, the optimal parameter fitting was then proposed for use in commercial manufacturing based on the data structure and physics.
Speaker |
Transcript |
Mason Chen | So hi I'm Mason, and today I'll be presenting the dumpling cooking project in which we studied the relationship between data structure, dumpling physics and RSM results. |
The motivation behind this project is that COVID-19 has increased demand of remote cooking and artificial intelligence and robotics in the food industry will play an increasingly important role as an application of technology continues to spread. | |
But most foods are made without precise control of cooking parameters as we rely on the chef's expertise to create consistent dishes. | |
But quality control and efficiency will be much more important if the robot is cooking a meal, so we decided to use steamed dumplings as a simple, nutritious but tasty dish to prepare the optimal recipe, according to the dumpling rising time. | |
So I stated previously, we performed a dumpling experiment, but due to an ingredient shortage, we had to substitute a different meat type in the experimental process. | |
So when we evaluated the DOE design, it was not orthogonal as there were some major confounding ???. | |
And when we try to run the RSM result with this data, there was also a potential outlier, particularly Run #6, which we wanted to study further. | |
So our first objective of this project is to address the shortage from a corrective(?) standpoint | |
and see if we can save the DSD structure. We'll do this by first changing the meat type, which was originally a two level categorical variable consisting of pork or shrimp | |
to a continuous variable so that we can use varying percentages of meat type for those three runs which we use with different mixtures of shrimp and pork. | |
And then we'll assess the data structure for this new DSD design using a variety of evaluation tools. | |
Well, since the DSD structure was not problematic based on those tools, we decided to run an RSM model to see if its results are in line with our scientific research regarding convection and conduction. | |
Had the DSD structure been problematic, we may have received some false interactions that cannot be explained by science | |
and interactions that we would expect from science may be hidden due to the absence of orthogonality. | |
And our second objective after finding a good DSD model to account for the ingredient shortage | |
is more preventative approach, where we want to study the impact of the potential outlier and a ??? DSD structured by revising the settings. | |
The impact of outlier can help us better understand the importance of measurement control and a problem...while studying the DSD structure | |
will let us know how confounded structure will affect our results before even running our experiment in the future. | |
So we'll do the second objective by modifying and comparing the factor settings and response from Run #6, which is a potential outlier, and Run #9, which is our center point, and studying whether the model results are due to confirming data structure or if it can be explained by physics. | |
So the original design table is on the right and the ones that we substituted are Run #3, 4, and 9, as highlighted in orange. | |
Run #9, again, this is the center point so substitution may affect the model orthogonality, such as the average prediction variance we're looking to layer on. The original run uses categorical meat type of pork and shrimp, so we kept the setting | |
shrimp, even for these three runs where we had to substitute some of the shrimp for pork. | |
The colored map on left hand side show some confounding groups, so blue indicates no confounding, zero correlation between those two terms, | |
and dark red indicates severe confounding and high correlation, which is why the diagonal line will always be 1, since each variable is...each term is 100% correlated with itself. So there are some resolution lines ??? that affect confounding. | |
and | |
the...for meat type categorical, especially about a 0.3 correlation, and we have some severe Resolution IV, which is interaction interaction compounding risk, as some of them, such as the red blocks are greater than 0.5 correlation. So we should not | |
run an RSM model from here, since our data itself cannot be trusted to to the severe confounding, so before we proceed to run a model, we need to improve the RSM...the data structure. But how can we address this problem without recollecting the data? | |
So, since the meat type categorical had confounding problems and we should account with substituting meat type, we decided to change the meat type from categorical to continuous. | |
We changed the variable essentially to shrimp percent so the previous meat type that was all shrimp was changed to 100% and the ones that were all pork changed to 0%. | |
For three substituted meats, we estimated the true percentage values based on what we could recall about the order history for total meat we bought at the supermarket that day, so 70, 40 and 65%. | |
We then ran the color map on correlations again and looked at the power analysis to assess whether or not the data structure is more orthogonal. | |
The power analysis is all greater than 0.9 for main effects, which means that there is a greater than 90% chance we can detect a significant effect of these variables. | |
Additionally, the color map on indicates slightly reduced Resolution 1 confounding as it is shaded bluer, especially for meat type continuous. | |
And Resolution 4 confounding doesn't change much, so we will choose to stick to this model, since the apparent orthogonality is not bad, and later on when you run the RSM, we will return to the color map to check if any interactions may be due to a confounding problem. | |
So next we look at the prediction variance profile, which indicates that prediction variance at different levels of factors. | |
Now the actual prediction profile depends on both the response error and a quantity that depends on the design and factor setting. | |
But for this prediction variance profile, the Y axis is a relative variance of prediction, which is the actual variance divided by the response variance | |
so that the response variance cancels out and the prediction variance profile only depends on the DSD structure and not the response. | |
So the top graph is the average variance at the center point and the bottom is the maximized variance, so we need to look at both because average variance indicates more information about the entire design, | |
and the maximum variance tells you information about the worst case point. So the average variance is Run #9 | |
at Run #9's settings, which we had to substitute some of the meat. And at Run #9, the variance is 0.06, which is not too bad and the variance is symmetric about the center point, which means that the substitution does not severely impact them model orthogonality. | |
The prediction variance will always be greatest at corners, since you don't want to make predictions of the corners, because less points are around it and the optimal run is usually around the center | |
with the least prediction variance. For the maximum variance, you can see that the curve is a bit asymmetric for dumpling weight and meat type. | |
This one is Run #18 but there isn't anything special about it, so shouldn't be too big of a deal. So since there hasn't been anything major standing out for this DSD structure, we will go ahead and proceed to run RSM model for the continuous meat type. | |
So we ran RSM model using stepwise progression and mixed selection, based on the p value. And we chose a stepwise progression instead of the ordinary least squares regression, because the least squares | |
need at least the same number of runs as terms and we don't have enough runs to estimate all the terms. | |
So three important mean effects are water temperature, dumpling weight and batch size. And we have interaction term of water temperature times dumpling weight. | |
The ANOVA is significant, which indicates that we can reject the null hypothesis and conclude we have a model. We don't have an overfitting problem | |
as the R squared and R scquared adjusted differ by less than 10%. For lack of fit window, the sum of the lack of fit error and the pure error is the total error sum of squares from the ANOVA table. | |
And pure error is independent of model, such as things like experimental error, so if the total residual error is large compared to the pure error. | |
It means we might have to fit a nonlinear models, since the linear model has high lack of fit. When all the error is only due to pure error, then there is zero lack of fit | |
and the R squared value is equal to the maximum R square. On the p value for the lack of fit is greater than 0.05 so we don't really have evidence that we need to switch to a robust nonlinear design as the pure error is much smaller compared to the total residual error. | |
Now the R squared is not very good...not excellent, because of the possible outlier in Run #6 seen in the studentized residual, which is touching the bound for any limit. Let's see why that run might be an outlier. | |
So, if you look at the prediction profiler for Run #6, we see that there's a large triangle, which indicates greater local sensitivity. | |
And this indicates that a small change in the input causes a large change in rising time, which means a greater instantaneous slope at the factor settings for Run #6. |