Use of JMP for Restructuring and Analysis of DOE Data for Pigment Stability Optimisation
A small DOE was prepared by our formulation development project team. The project required the examination of a large number of different, measured responses to investigate the optimisation of pigment stability in a new, experimental paint formulation. However, the team had difficulty obtaining clear guidance from their analysis of the results. Through the use of JMP’s data restructuring tools, it was possible to reformat the existing data into something that could be easily analysed using SLS and logistics modelling to give the key impacts on stability, as well as the probability that a given combination of factors would yield a satisfactory result.
Hello, I'm John Steele. I work for AkzoNobel in their research and development department of their decorative paints business. I'm going to be going through the use of JMP for the restructuring of DoE data for a pigment stability optimization problem that was encountered by one of our teams.
The problem itself, initially, there was a DoE put together to investigate the impact of three different stabilizing materials being used in a paint system. These stabilizing materials prevent pigments from flocculation and causing issues in a final product. The initial DoE had 17 runs, examining the different interactions and levels of these stabilizing materials. As part of the pigment stability test, there were five different pigments that were tested for each of these different DoE runs. These had to be tested for both shear and mixing stability, and tested at three separate time points.
The design was set up in exactly the way that you would expect a sensible DoE to be set up. All the rules and considerations were followed. However, when it came to analyzing the data, there were some problems which we started to run into.
For this DoE, each different combination of the shear or mixing test, the different pigment material, and the time point that the testing was performed at, was entered as a separate response and analyzed separately. Overall, there was a total of 25 individual responses for these 17 individual DoE runs.
Some of these results modeled quite well, as we can see with the graphs on the top. Their linearities, fairly straight. Their R² is quite good. However, others, as we see with the two graphs at the bottom, did not model as well. There's quite a lot of noise. The R² is very poor.
Unfortunately, there wasn't any clear pattern as to which responses would analyze okay and which ones wouldn't. Part of this was down to some of the responses having very, very narrow data ranges covering a very small range of values that didn't differ all that much.
If we look at how the data initially was structured, we have our individual model runs with their levels of the three stabilizers. These, as I said, were tested on five separate pigments, and then each of these individual pigments were tested at one time point, a second time point, and a third time point. This was repeated for each and every combination of those variations of pigment and time points, and for each of the different test types.
Although these were entered as separate responses, it could be something that you can consider as almost a factor within the data that you're looking at itself. The test that this was performed for is a quality control test. They can't be treated as normal factors in a traditional design of experiment sense in that you always need to test at these specific time points with these specific colors. Whereas obviously for a design, it will not cover each and every single possible combination because of how designs work.
For the purposes of analysis, we can treat these as pseudo-hidden factors and restructure our data to get a better analysis out of it. Then there's the question, how do we go about restructuring this data?
If I go over to JMP, here we have the initial data from the design itself. We have our ID runs, we have the factors in the form of the different stabilizer levels, and then we have all the separate individual responses. This is filled with just dummy data for illustrative purposes. What we want to do is to take each of these individual responses and turn them into essentially a single response.
We can do this using some of JMP's very useful table and data manipulation tools. The first step is to use the stack function, and then we can select to collect all the columns for the individual responses that we want to combine together into a single column. If we click then onto Stack Columns, we'll put all those columns into here.
If we go down to the section at the bottom, we can select which of the initial columns in the data table we want to keep. We want to keep the ID and the factors. If we so desire, we can also change what the data and response labels and column headers are. If we click okay, we will get a new data table where we have our ID for the individual runs, our factors, and then all of the responses are now in a single column, and the corresponding response name is in a single column next to it.
From the initial data table, all of the data was set so that all of the results for ID1 would be a one, all the results for ID2 would be a two. You can see here, if we select the rows for ID1, all of that data has been transferred across. This is the thing that you can automate using JMP's script builders if it's the type process that you need to perform regularly or using some of the workflow functions. However, in this instance, this is very much just a one-off examination that we did.
If we go over to next table, you can see that currently, we have a label that is a mix of both the serial and the test time and the test type. This on its own isn't particularly useful from a factor perspective. We need to split this information out, which again, we can do using some of the built-in tools. If we go to Columns, Utilities, and Text Columns, with this selected columns, we can define a delimiter. In this case, it's going to be a comma, which we will use to separate that text string.
If we enter comma, hit okay. You can see now we have new columns which will just have material one, two, three, et cetera. The test type, and the time point label associated with it, which now they're split out. We can start treating those as separate factors, and we can entirely delete this initial column.
After that, we can go through and do some additional reformatting. We can change the column names, and in the case of the time point, we can even go through and recode that from a text string into something like a numeric values, which we can do, again, using some of JMP's built-in functions.
If we to recode under the column in the menu and then in place, we can change time 1 into 1, time 2 into 2, time 3 into 3. We recode that in place. All those values have changed. Then under Column Info, we can change that from a character to a numeric variable or factor. We can either keep this as nominal or change it to continuous or ordinal. In this case, I'm going to change it to ordinal.
Once all that reformatting has been done, I know from personal experience that this specific type of test, the actual fundamental numeric values that we're getting out, aren't that important. What is more important is whether the test is a pass or whether it is a fail. We can take these individual results and convert those into a pass or a fail condition for the purposes of modeling.
Over here in this data table, we have slightly different result values in here, but we have a separate column at the end for whether it is a pass or a fail. If we go into the Column Info section, you see under the Column Properties, we have added a value Color section and a formula section.
Here we're using some if statements to look to see whether the result is above or below a certain value, and based on that threshold, return whether the result is a fail, a pass, or if there's no data entered, whether it comes out as blank. The value Colors section also means that we can then specify what color the data points or cells associated with those values are going to be. In this case, a pass is green, a fail is red.
When it gets on to the process of modeling this data, we can either leave the data formatted as it currently is, and then do the modeling using the by variable for the test and model the sheer test and the mixed test separately, which is sometimes more convenient, or we can split this data, which is essentially the reverse of the stack process, so that we have the result for the shear test in one column and the result for the mix test in another column.
For this, we'd want to take in the split function in JMP's table tools, we want to split based on the pass or fail result and split by the test type. Then, if we add in grouping by both ID and material, this will ensure that the data is kept within the same row. Again, we can select which of the columns we want to keep. Again, we want to keep the ID and the factors. Also, we want to make sure that we put the time point in there as well. Then we will end up with a data table where we have separate columns for both the mix test and the shear test.
Then we can start looking into the actual modeling of this data. Now we have these pass or fail criteria for these results. As opposed to using the initial approach that we used for this particular problem, which was to use standard linear regression type methods. We can start looking at this using logistic regression and categorization methods, which if we run one of the models which I already have prepared for this, we get initially an R² that is much, much higher. We're up in the 0.9 range, as well as also getting the relative impact of the factors and the interactions between those factors that are having an impact on our results.
Another useful feature is if you scroll down, there is a section here called the Confusion Matrix, which is perhaps a more accurate way of describing the quality of a model for categorization type models. Based on whether we have a pass or a fail, sorry, one or a zero, it's saying how many of those results that are a one are being predicted as a one. How many that are a zero or a fail are being predicted as being a fail. In this case, there are only two results at the entire data set for this test that are being misclassified.
What we can do with this to get an even more in-depth insight into the potential behavior and interactions, is to visualize this data using some data simulation. For logistic models, the simulation tool can be found under the contour profiler rather than the profiler where it's sometimes found for standard linear regression and numeric type models.
Then we can start specifying what area we want to simulate our data over and how many data points we want to simulate. Then, when we simulate this, we not only get a range of randomized combinations of our factors and the specified material levels, the time points, what their most likely result is. We also get the probability of whether it is going to fall into a pass category or a fail category.
As you can see, some of these where we have a value of one, means that it is almost certainly going to be a pass, whereas there are others where it's only likely to be about 60% chance that is going to be a pass.
We can take all of these probability values here and then start plotting that against the factors that we're looking at, so the stabilizing materials, the pigments, and the time points to get a graph that looks something like this. I just unselect the cells. Here we have the stabilizer level of stabilizer one on one axis. We have a threshold range for stabilizer 2 and stabilizer 3. That's some of the box cell axes. We have the separate different pigment materials along the axes at the bottom.
Then all of these are colored based on the probability of whether it's going to pass or whether it's going to fail. Whereas at the start, we had very unclear ideas as to whether a certain material was going to give a certain result. It was a bit fuzzy as to what was happening. With a visualization like this from our restructured data, we can immediately see there's these areas that are very green, these areas that are very red, and we can start to get a much more in-depth understanding of potentially which areas we're going to have issues with potential failures.
You can see for the materials 4 and 5, generally they are appearing to be more red, especially at these lower levels of stabilizer 2. They are more likely to potentially be a problem. We're seeing being almost entirely green in most cases for material 1 and material 3. If you wanted more specific outputs from this graph, you can add in the local data filters and start focusing on, say, just material 1 and material 3, and start comparing them like that.
In summary, the results and outputs from your experimental design may have some hidden factors or factor-like attributes that you didn't originally consider, and without considering those, you might end up getting very poorly modeled data. But that doesn't necessarily mean that your data doesn't have value that you can use. Sometimes it's just a case of taking what you already have and structuring it in a slightly different a different way to get more value and utility out of it. Thankfully, JMP has a lot of very useful tools built into it to make this process very, very easy.
Once you've done that, you don't necessarily just need to use the very basic and more standardized linear regression and numeric type modeling to get useful information and results out of your data. That's everything that I wanted to cover in this today. Thank you very much for your time.