Choose Language Hide Translation Bar

cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
Monday, October 12, 2020
Laura Castro-Schilo, JMP Research Statistician Developer, SAS Institute, JMP Division James Koepfler, Research Statistician Tester, SAS Institute, JMP Division   This presentation provides a detailed introduction to Structural Equation Modeling (SEM) by covering key foundational concepts that enable analysts, from all backgrounds, to use this statistical technique. We start with comparisons to regression analysis to facilitate understanding of the SEM framework. We show how to leverage observed variables to estimate latent variables, account for measurement error, improve future measurement and improve estimates of linear models. Moreover, we emphasize key questions analysts’ can tackle with SEM and show how to answer those questions with examples using real data. Attendees will learn how to perform path analysis and confirmatory factor analysis, assess model fit, compare alternative models and interpret all the results provided in the SEM platform of JMP Pro.   PRESENTATION MATERIALS The slides and supplemental materials from this presentation are available for download here. Check out the list of resources at the end of this blog post to learn more about SEM. The article and data used for the examples of this presentation are available in this link.     Auto-generated transcript...   Speaker Transcript Laura Castro-Schilo Hello everyone. Welcome. This session is all about the ABCs of Structural Equation modeling and what I'm going to try is to leave you with enough tools to be able to feel comfortable   to specify models and interpret the results of models fit using the structural equation modeling platform and JMP Pro.   Now what we're going to do is start by giving you an introduction of what structural equation modeling is and particularly drawing on the connections it has to factor analysis and regression analysis.   And then we're going to talk a little bit about how path diagrams are used in SEM and their important role within this modeling framework.   I'm going to try to keep that intro short so that we can really spend time on our hands-on examples. So after the intro, I'm going to introduce the data that I'm going to be using for demonstration. It turns out these data are   about perceptions of threats of the Covid 19 virus. So after introducing those data,   we're going to start looking at how to specify models and interpret their results within the platform, specifically by answering   a few questions. Now, these, these questions are going to allow us to touch on two very important techniques that can be done with an SEM. One is confirmatory factor analysis and also multivariate regression.   And to wrap it all up, I'm going to show you just a brief   model in which we bring together both the confirmatory factor model, regression models and that way you can really see the potential that SEM has for using it with your own data for your own work.   Alright, so what is SEM? Structural equation modeling is a framework where factor analysis and regression analysis come together.   And from the factor analysis side, what we're able to get is the ability to measure things that we do not observe directly.   And on the regression side, we're able to examine relations across variables, whether they're observed or unobserved. So when you bring those two tools together, we end up with a very flexible framework, which is SEM, where we can fit a number of different models.   Path diagrams are a unique tool within SEM, because all statistical structural equation models, the actual models, can be depicted through a diagram. And so we have to learn just some   notation as to how those diagrams are drawn. And so squares represent observed variables,   circles represent latent variables, variances or covariances are represented with double-headed arrows, and regressions and loadings are   represented with one-headed arrows. Now, as a side note, there's also a triangle that is used for path diagrams.   But that's outside the scope of what we're going to talk about today. The triangle is used to represent means and intercepts. And unfortunately, we just don't have enough time to talk about all of the awesome things we can do with means as well.   Alright. But I also want to show you just some fundamental...kind of the building blocks of   x, which is in a box, is predicting y,   which is also in a box. So we know those are observed variables and each of them have double-headed arrows that start and end on themselves, meaning those are variances.   For x, this arrow is simply its variants, but for y, the double-headed arrows represent a residual variance.   Now, of course, you might know that in SEM, an outcome can be both an outcome and a predictor. So you can imagine having another variable z that y could also predict, so we can put together as many regressions as we are interested in within one model.   The second basic block for building SEMs are confirmatory factor models and at the most basic   most basic example of that is is shown right here, where we're specifying one factor or one latent variable, which is unobserved but it's   it's shown here with arrows pointing to w, x, and y because the latent variable is thought to cause the common variability we see across w, x, and y.   Now, this right here is called confirmatory factor model, it's, it's only one factor. And I think it's really important to understand the distinctions of a factor in the factor analytic   perspective and distinguish that from principal components or principal component analysis, which sometimes are easy to get confused.   So I'll take a quick tangent to show you what's different about a factor from a factor analytic perspective and from PCA.   So here these squares are meant to represent observed variables, the things that we're measuring, and I colored here in blue   different amounts from each observed variable, which represents the variance that those variables are intended to measure. So it's kind of like the signal. I mean, there's   it's this stuff we're really interested in. And then we have these gray areas which represent a proportion of variance that comes from other sources. It can be systematic variance, but it's simply variance that is not   what we want it to pick up from our measuring instrument. So, it contains   sources are variance that are unique to each of these variables and they also contain measurement error.   So what's the difference between factor analysis and PCA is that in factor analysis, a latent variable is going to capture only the common variance that   exists across all of these observed variables, and that is the part that the latent variable accounts for.   And that is in contrast to PCA where a principle component represents the maximal amount of variance that can be explained in the dimensions of the data.   And so you can see that the principal component is going to pick up as much variance as it can explain   and that means that there's going to be an amalgamation of variance as due to what we intended to measure and perhaps other sources of variance as well. So this is a very important distinction,   because when we want to measure unobserved variables, factor analysis is indeed a better choice, unless you know if our goal truly is dimension reduction, then PCA is an ideal tool for that.   Also notice here that the arrows are pointing in different directions. And that's because in factor analysis, there really is a an underlying assumption that that unobserved variable is causing the variability we observe.   And so that is not the case in PCA, so you can see that distinction here from the diagram. So anytime we talk about   a factor or a latent variable in SEM, we're most likely talking about a latent variable from this perspective of factor analysis. Now here's a   large structural equation model where I have put together a number of different elements from those building blocks I showed before. When we see numbers because we've estimated this model.   And you see that there's observed variables that are predicting other variables. We have some sequential relations, meaning that   one variable leads to another, which then leads to another. This is also an interesting example because we have a latent variable here that's being predicted by some observed variable. But this latent variable also   in turn, predicts other variables. And so it illustrates, this diagram illustrates nicely, a number of uses that we can have in, for basically reasons that we can have for using SEM, including the five that we have unobserved variables that we want to model within a larger   context. We want to perhaps account for measurement error. And that's important because latent variables are purged of measurement error because they only account for the variance that's in common across their indicators.   We...if you also have the need to study sequential relations across variables, whether they're observed or unobserved, SEM is an excellent tool for that.   And lastly, if you have missing data, SEM can also be really helpful even if you just have a regression, because   all of the data that are available to you will be used in SEM during estimation, at least within the algorithms that we have in JMP pro.   So that...those are really great reasons for using SEM. Now I want to use this diagram as well to introduce some important terminology that you'll for sure come across multiple times if you decide that SEM is a   a tool that you are going to use in your own work. So we talked about observed variables. Those are also called manifest variables in the SEM jargon.   There's latent variables. There are also variables called exogenous. In this example, there's only two of them.   And those are variables that only predict other variables and they are in contrast to endogenous variables, which actually have variables predicting them. So here, all of these other variables are endogenous.   Latent variables, the manifest variables that they point to, that they predict, those are also called latent variable indicators.   And lastly, we talked about the unique factor variance from a factor analytic perspective. And those are these residual variances from a factor,   from a latent variable, which is variance that is not explained by the latent variable, and represents both the combination of systematic variance that's unique to that variable plus measurement error.   All right.   I have found that in order to understand structural equation modeling in a bit easier way, it's important to shift our focus into realizing that the data that we're really   modeling under the hood is the covariance structure of our data. We also model the means, the means structure, but again, that's outside the scope of the presentation today.   But it's important to think about this because it has implications for what we think our data are.   You know, we're used to seeing our data tables and we have rows and variables are our columns, and and yes that is...   these data can be used to launch our SEM platform. However, our algorithms, deep down, are actually analyzing the covariance structure of those variables. So when you think about the data, these are really the data that are   being modeled under the hood. That also has implications for when we think about residuals, because residuals in SEM are with respect to variances and covariances of   you know, those that we estimate, in contrast to those that we have in the sample. So residuals, again, we need a little bit of a shift in focus to what we're used to, from other standard statistical techniques to really   wrap our heads around what structural equation models are. And things...concepts such as degrees of freedom are also going to be degrees of freedom with respect to the covariance   matrix. And so once we make this shift in focus, I think it's a lot easier to understand structural equation models.   Now I want to go over a brief example here of how is it that SEM estimation works. And so usually what we do is we start by specifying a model and the most   exciting thing about JMP Pro and our SEM platform is that we can specify that model directly through a path diagram, which makes it so much more intuitive.   And so that path diagram, as we're drawing it, what we're doing is we're specifying a statistical model that implies the   structure of the covariance matrix of the data. So the model implies that covariance structure, but then of course we also have access to the sample covariance matrix.   And so what happens during estimation is that we try to match the values from the sample covariance as close as possible, given what the model is telling us the relations between the data are, not within the variables.   And once we have our model estimates, then we use those to get an actual   model-implied covariance matrix that we can compare against the sample covariance matrix. So, by looking at the difference between those two matrices,   we are able to obtain residuals, which allows us to get a sense of how good our models fit or don't fit. So in a nutshell, that is how structural equation modeling works.   Now we are done with the intro. And so I want to introduce to you, tell you a little bit of context for the data I'm going to be using for our demo.   These data come from a very recently published article in the Journal of Social, Psychological and Personality Science.   And the authors in this paper wanted to answer a straightforward question. They said, "How do perceived threats of Covid 19 impact our well being and public health behaviors?" Now what's really interesting about this question is that perceived threats of   Covid 19 is a very abstract concept. It is a construct for which we don't have an instrument to measure it. It's something we do not observe directly.   And that is why in this article, they had to figure out first, how to measure those perceived threats.   And the article focused on two specific types of threats. They call the first realistic threats, because they're related to physical or financial safety.   And also symbolic threats, which are those threats that are posed on one's sociocultural identity.   And so they actually came up with this final threat scale, they went through a very rigorous process to develop this survey, this scale to measure realistic threat and symbolic threat.   Here you can see here that people had to answer how much of a threat, if any, is the corona virus outbreak for your personal health,   you know, the US economy, what it means to be an American, American values and traditions. So basically, these questions are focusing on two different aspects of the   threat of the virus, one that is they labeled realistic, because it deals with personal and financial health issues.   And then the symbolic threat is more about that social cultural component. So you can see all of those questions here.   And we got the data from one of the three studies that they that they did. And those data from 550 participants who answered all of these questions in addition to a number of other surveys and so we'll be able to use those data in our demo.   And we're going to answer some very specific questions. The very first one is, how do we go about measuring perceptions of Covid 19 threats. There's two types of threats, we're interested in. And the question is, how do we do this, given that it's such an abstract concept.   And this will take us to talk about confirmatory factor analysis and ways in which we can assess the validity and reliability of our surveys.   One thing we're not going to talk about though is the very first step, which is exploratory factor analysis. That is something that we do outside of SEM   and it is something that should be done as the very first step to develop a new scale or a new survey, but here we're going to pick up from the confirmatory factor analysis step.   A second question is do perceptions of Covid 19 threats predict well being and public health behaviors? And this will take us to talk more about regression concepts.   And lastly, are effects of each type of threat on outcomes equal? And this is where we're going to learn about a very unique   helpful feature of SEM, which is allowing us to impose equality constraints on different effects within our models and being able to do systematic model comparisons to answer these types of questions.   Okay, so it's time for our demo. So let me go ahead and show you the data, which I've already have open right here.   Notice my data tables has a ton of columns, because there's a number of different surveys that participants   responded to, but the first 10 columns here are those questions, or the answers to the questions I showed you in that survey.   And so what we're going to do is go to analyze, multivariate methods, structural equation models and we're going to use those   answers from the survey. All of the 10 items, we're going to click model variables so that we can launch our platform with those data.   Now, right now I have a data table that has one observation per row. That's what the wide data format is, and so that's what I'm going to going to use.   But notice there's another tab for summarize data. So if you have only the correlation or covariance matrix, you can now input that in...well, you will be able to do it in JMP 16,   JMP Pro 16 and so that's another option because, remember that at the end of the day, what we're really doing here is modeling covariance structures. So you can use summarize data to launch the platform.   Alright, so let's go ahead and click OK. And the first thing we see is this model specification window which allows us to do all sorts of fun things.   Let's see, on the far right here we have the diagram and notice our diagram has a default specification. So our variables   all have double headed arrows, which means they all have a variance   They also have a mean, but notice if I'm right clicking on the canvas here and I get some options to show the means or intercepts. So   again, this is outside the scope of today, so I'm going to keep those hidden but do know that the default model in the platform has variances and means for every variable that we observe.   The list tab contains the same information as the diagram, but in a list format and it will split up your   paths or effects based on their type. We also have a status step which gives you a ton of information about the model that you have at that very moment. So right now, it tells us,   you know, the degrees of freedom we have, given that this is the model we have specified here is just the default model. And it also tells us, you know, data details and other useful things.   Notice that this little check mark here changes if there is a problem with your model. So as you're specifying your model, if we encounter something that looks problematic or an error, this tab will change in color and type   and so it will be helpful to hopefully help you solve any type of issues with the specification of the model.   Okay, so on the far left, we have an area for having a model name. And we also have from and to lists. And so this makes it very easy to select variables here,   in the from and then in a to role, wherever those might be. And we can connect them with single-headed arrows or double-headed arrows, which we know, they are regressions or loadings or variances or covariances.   Now for our case right now, we really want to figure out how do we measure this unobserved construct of perceptions of Covid 19   threat. And I know that the first five items that I have here are the answers to questions that the authors labeled realistic threats. So I'm going to select those   variables and now here we're going to change the default name of latent one to realistic because that's going to be the realistic threat latent variable. So I click the plus button. And notice, we automatically give you   this confirmatory factor model with one factor for realistic threat. An interesting observation here is that there is a 1 on this first loading   that indicates that this path, this effect of the latent variable on the first observed variable is fixed to one.   And we need to have that, because otherwise our model would not be identified.   So we will, by default, fix what your first loading to one in order to identify the model and be able to get all of your estimates.   An alternative would be to fix the variance of the latent variable to one, which would also help identify the model, but it's a matter of choice which you do.   Alright, so we have a model for realistic threat. And now I'm going to select those that are symbolic threat and I will call this symbolic and we're going to hit go ahead and add   our latent variable for that. I changed my resolution. And so now we are seeing a little bit less than what I was expecting. But here we go. There's our model. Now   we might want to   specify, and this is actually very important,   realistic and symbolic threats. We expect those to vary, to co-vary and therefore, we would select them in our from and to list and click on our double-headed arrow to add that covariance.   And so notice here, this is our full two factor confirmatory factor analysis. So we can name our model and we're ready to run. So let's go ahead and click Run.   And we have all of our estimates very quickly. Everything is here. Now   what I want to draw your attention to though, is the model comparison table. And the reason is because we want to make sure that our model fits well before we jump into trying to interpret the results.   So let's talk about what shows up in this table. First let's note that there's two models here that we did not fit but the platform fits them by default upon launching.   And we use those as sort of a reference, a baseline that we can use to compare a model against. The unrestricted model I will show you here,   what it is, if we had every single variable covarying with each other, that right there is the unrestricted model. In other words, is a baseline for what is the best we can do in terms of fit.   Now the chi square statistic represents the amount of misfit in a model. And because we are estimating every possible variance and covariance without any restrictions here,   that's why the chi square is zero. Now, we also have zero degrees of freedom because we're estimating everything. And so it's a perfectly fitting model and it serves as a baseline for...   serves as a baseline for understanding what's the very best we can do.   All right, then the independence model is actually the model that we have here as a default when we first launched the platform. So that is a baseline for the worst, maybe not the worst, but a pretty bad model.   It's one where the variables are completely uncorrelated with each other. And you can see indeed that the chi square statistic jumps to about 2000 units. But of course, we now have 45 degrees of freedom because we're not estimating much at all in this model.   And then lastly we have our own two factor confirmatory factor model and we also see that the chi square is large, is 147   with 34 degrees of freedom. It's a lot smaller than the independence model, so we're doing better, thankfully, but it's still a significant chi square, suggesting that there's significant misfit in the data.   Now, here's one big challenge with SEM. Back in the day was that people realize that the chi square is impacted, it's influenced by the sample size. And here we have 550 observations. So it's very likely that even well fitting models are going to have a little   Going to turn out to be significant because of the large sample size. So what has happened is that fit indices that are unique to SEM have been developed to allow us to assess model fit, irrespective of the chi square and that's where these baseline models come in.   The first one right here is called the comparative fit index. It actually ranges from zero to one. You can see here that one is the value for a perfectly fitting model and zero is the value for the really poor fitting model, the independence model.   And   I keep sorting this by accident. Okay, so what what this index for our own model means a .9395. It's the proportion, it's yeah, it's a proportion of how much better are we doing with our model   in comparison to the independence model. So this is just, we're about, you know, 94% better than the independence model, which is pretty good.   The guidelines are that CFI of .9 or greater is acceptable. We usually want it to be as close to one as possible and .95 is ideal.   We also then have this RMSCA, the root main square error of approximation, which represents the amount of misfit per degree of freedom.   And so we want this to be very small. It also ranges from zero to one. And you see here, our model has .07, about .08,   and generally speaking .1 or lower is good, is adequate, and that's what we want to see. And then on the ride, we get a some confidence limits around this one statistic. So   what this suggests is that indeed the model is a good fitting...is an acceptable fitting model and therefore we're going to go ahead and try and interpret it.   But it's really important to assess model fit before really getting into the details of the model because we're not going to learn very much, or at least not a lot of useful information if our model doesn't fit well from the get go.   Alright, so, because this is a confirmatory factor analysis, we're going to find it very useful to show (I'm right clicking here), and now I'm going to show the standardized estimates on the diagram. All right. This means that we're going to have   estimates here for the factor loadings that are in the correlation metric, which is really useful and important for interpreting these loadings in factor analysis.   This value here is also going to be rescaled so that it represents the actual correlation between these two latent variables, which in this case is   substantial point for correlation. In the red triangle menu, we also can find the standardized parameter coefficients in the table form so we can get more details about standard errors and Z statistic and so on.   But you can see here that all of these values are fairly...you know they're they're pretty acceptable. They are the correlation between the observed variable and the latent variable.   And they're generally about .48 to about .8 or so around here. So those are pretty decent values. We want them to be as high as possible.   And another thing you're going to notice here about our diagrams, which is a JMP pro 16 feature, is that   these observed variables have a little bit of shading, a little gray area here, which actually represents the amount or portion of   variance explained by the latent variable. So it's really cool, because you can really see just visually from the little shading that the variables for symbolic threats   actually have more of their variance explained by the latent variable. So perhaps this is a better factor than the realistic threat factor, just based on looking at how much variance is explained.   Now I want to draw your attention to an option called assess measurement model and that is going to be super helpful to allow us to understand whether   the questions, the individual questions in that survey are actually good questions.   We know that based on the statistical...the indicator reliability, we want our questions to be reliable and that's what we are plotting over on this side. So notice we give a little   line here for a suggested threshold of what's good acceptable reliability for any one of these questions and you can see in fact that the symbolic threat is a better   Seems to be doing a little better there. The questions are more reliable in comparison to the realistic threat. But generally speaking, they're both fairly good   You know, the fact that they're not crossing the line is not terrible. I mean, they're they're around...these two questions are around the the   adequate threshold that we would expect for indicator reliability. We also have statistics here that tell us about the reliability of the composite. In other words, if we were to grab all of these questions and   maybe grab all of these questions for a realistic threat and we get an average score for all of those answers   per individual, that would be a composite for realistic threat and we could do the same for symbolic.   And so what we have here is that index of reliability. All of these indices, by the way, range from zero to one.   And so we want them to be as close to one as possible because we want them to be very reliable and we see here that both of these   composites have adequate reliability. So they are good in terms of using an average score across them for other analyses.   We also have construct maximal reliability and these are more the values of reliability for the latent variables themselves rather than   creating averages. So we're always going to have these values be a bit higher because when you're using latent variables, you're going to have better measures.   The construct validity matrix gives us a ton of useful information. The key here is that the lower triangular simply has the correlation between the two factors in this case.   But the values in the diagonal represent the average variance extracted across all of the indicators of the factor.   And so here you see that symbolic threats have more explained variance on average than realistic threat, but they both have substantial values here, which is good.   And most importantly, we want these diagonal values to be larger than the upper triangular because the upper triangular represents the overlapping barriers between the two factors.   And you can imagine, we want the factors to have more overlap and variance with their own indicators than with some other   construct with a different factor. And so this right here, together with all of these other statistics   are good evidence that the survey is giving us valid and reliable answers and that we can in fact use it to pursue other questions. And so that's what we're going to do here. We're going to close this and I'm going to   run a different model, we're going to relaunch our platform, but this time I'm going to use...   I have a couple of variables here that I created. These two are   composites, they're averages for all of the questions that were related to realistic and symbolic threats. So I have those composite scores right here.   And I'm going to model those along with...I have a measure for anxiety. I have a measure for negative affect. And we can also add a little bit of the CDC   adherence. So these are whether people are adhering to the recommendations from the CDC, the public health behaviors. And so we're going to launch the platform with all of these variables.   And what I want to do here is focus perhaps on fitting a regression model. So I selected those variables, my predictors, my outcomes. And I just click the one-headed arrow to set it up. Now the model's not fully...   correctly specified yet because we want to make sure that both our.   Covid   threats here, we want to make sure that those are covarying with each other, because we don't have any reason to assume they don't covary.   Same with the outcome, they need to covary because we don't want to impose any weird restrictions about them being orthogonal.   And so this right here is essentially a path analysis. It's a multivariate multiple regression analysis. And so we can just put it here, multivariate regression, and we are going to go with this and run our model.   Okay. So notice that because we fit every...   we have zero degrees of freedom because we've estimated every variance and covariance amongst the data. So even though this suggest is a perfect fit,   all we've done so far is fit a regression model. And what I want to do to really emphasize that is show you...I'm going to change the layout of my diagram   to make it easier to show...the associations between both types of threat. I want to hide variances and covariances and you can see here, just...   I've hidden the other edges so that we can just focus on the relations of these two threats have on our three outcomes. Now in my data table, I've already fit a, using fit model, I used anxiety that same   variable as my outcome and the same two threats as the predictors. And I want to put them side by side because I want to show you that in fact   the estimates for the regression of the prediction of anxiety here are exactly the same values that we have here in the SEM platform for both of those predictions.   And that's what we should expect. Fit model, I mean, this is regression. So far we're doing regression.   Technically, you could say, well, if I just, I'm comfortable running three of those fit models using these three different outcomes, then what is it that SEM is buying me that's, that's better.   Well, it might, you might not need anything else and this might be great for your needs, but one unique feature of SEM is that we can test directly whether there are   equality...whether equality constraints in the model are tenable. So what we mean by that is that I can take these two effects (for example, realistic and symbolic threat effects on anxiety) and I can use this set equal button to impose an equality constraint. Notice here,   these little labels indicate that these two paths will have an equal effect. And we can then   run the model and we can now select those models in our model comparison table and compare them with this compare selected models option.   And what we'll get is a change in chi square. So, we see just the change of chi square going from one model to the next. So this basically says, okay,   the model is going to get worse because you've now added a constraint. So you gained a degree of freedom, you have now more misfit, and the question is, is that misfit significant?   The answer in this case is, yes. Of course, this is influenced by sample size. So we also look at the difference in the CFI and RMSEA and anything that's .01 or larger suggests that   the misfit is too much to ignore. It's a significant additional misfit added by this constraint.   So now that we know that, we can say directly in this...we know that this is the better fitting model, the one that did not have that constraint and we can assert that realistic threats have greater positive   association with anxiety in comparison to the symbolic threats, which also have a positive significant effect, but is not a strong significantly, not as strong as statistically, not as strong as the realistic threats. All right. And there's other   interesting effects that we have here. So what I'm going to do as we are approaching the end of the session is just draw your attention to this interesting   effect down here where both types of threats have different effects on the adherence to CDC behaviors. And the article really   pays a lot of attention to this finding because, you know, these are both threats. But as you might imagine, those who feel threats to their personal   health or threats to their financial safety, they're more likely to adhere to the CDC guidelines of social distancing and avoiding social gatherings, whereas those who   are feeling that threat is symbolic, a threat to to their social cultural values,   those folks are significantly less likely to adhere to those CDC behaviors, perhaps because they are feeling those social cultural norms being threatened. And so it's an interesting finding and we see that here we can of course test equivalence of a number of other paths in this model.   Okay, so the last thing I wanted to do is just show you (we're not going to describe this full model), but I did want to show you what happens when you bring together both   (let's make this a little bigger) regression and factor analysis.   To really use the full potential of structural equation models, you ought to model your latent variables, we have here both threats as latent variables which allow us to really purge the measurement error from those   survey items, and model associations between latent variables, which allows us to have unbiased effects because...unattenuated effect...because we are accounting for measurement error when we measure   the latant and variables directly. And notice, we are taking advantage of SEM, because we're looking at sequential associations across a number of different factors and so down here you can see our cool diagram which   I can move around to try and show you all the cool effects that we have and also highlight the fact that our diagrams are   fully interactive, really visually appealing, and very quickly we can identify, you know, significant effects based on the type of line versus non significant effects, in this case, these dashed lines. And so   again to really have the full power of SEM, you can see how here we're looking at those threats of latent variables and looking at their associations with a number of other   public health behaviors and with well being variables. And so with that, I am going to stop the demo here and   let you know that we have a really useful document in addition to the slides, we have a really great document that the tester for our platform, James Koepfler, put together where he gives a bunch of really great tips on how to use our platform,   from specifying the model, to tips on how to compare models, what is appropriate to look at, what's a nested model, all of this information I think you're going to find super valuable if you decide to use   the platform. So definitely suggest that you go on to JMP Community to get these materials which are supplementary to the presentation. And with that, I'm going to wrap it up and we are going to take questions once this all goes live. Thank you very much.
Ned Jones, Statistician, 1-alpha Solutions   Simulation has become a popular tool used to understand processes. In most cases the processes are assumed to be independent; however, many times this is not the case. A process can be viewed as physically independent, but this does not necessarily equate to stochastic independence. This is especially true when the processes are in series such that the output of a process is the input for the next process and so forth. Using the JMP simulator a simple series of processes are set up represented by JMP random functions. The process parameters are assumed to have a multivariate normal distribution. By modifying the correlation matrix, the effect of independence versus dependence is examined. These differences are shown by examining the tails of the resulting distributions. When the processes are dependent the effect of synergistic versus antagonistic process relationships are also investigated.     Auto-generated transcript...   Speaker Transcript nedjones The Good, the Bad, the Ugly or Independence and Dependence Synergistic and Antagonistic. I am Ned Jones. I have a small consulting business called 1-alpha Solutions. You can see my contact information there. Let's get into the simulation discussion. I'm going to be running the simulation, obviously, and JMP in the discovery... and allows you to discover model yet input random very...variation...model output random variation and from based on the inputs in any noise that you add into the simulator. Simulator also allows you to profile... is in the profiler and defines it defines the random input is defined based on random input that you have and you're able to run a simulation and produce output tables and simulate variables. Next thing I want to do is talk about the a couple of different types of simulation. Just a simple simulation. If you have one input and one output, there is no issue of dependence in the in the simulation. The ones that we're concerned about primarily are simulations, where we have multiple inputs. And we are simulating and will have one or more system outputs. The concern is that there could be dependence among the input variables. Scroll down here a little bit and we'll see...want to talk about what it means to to have stochastic independence. Two events A and B are independent, if and only if their joint probabilities equals the product of their probabilities. Well, that's what we want. That's the end result we want. I'm going to define it. Look at it a little bit differently and so forth. This should make it a little clearer. If we look at the intersection of A and B, events A and B, is equal to that joint product that implies that the probability of A is equal to the probability of A given B, or similarly, the probability...the with the joint probabilities, but the probability of B is equal to the probability of B given A. Thus the occurrence of A and B is is not affected. The occurence of B is not affected by the probability of A and vice versa. 2, 4, or 6. You can easily see that the the probability of A is 2 and 6 or one third, and probability of B is 3 and 6 or one half. But if you look at the intersection of A and B, that's 2. And so the and the probability of A times B as 1 6 and 2 is they get 2 is, and the only outcome you get so it's 1 in 6. Now, the probability of A is equal to the probability of any given B as equal to one third. And if we look at that and we realize that we're saying, okay, B has occurred, we know that we have a 2, a 4 or a 6, but there's still a one third chance that A could have occurred. So we can see that still, it stays at one third. And similarly, we see the same thing happening with the probability of B. Therefore, A and B are independent. Now I'm a role on in and look at the example I have and talk about that. What I'm doing is I'm simulating the pest load And the probability of a mating pair. What we have is we have a fruit harvest population and from that fruit harvest population, we're going to have some cultural practices that are applied in the... in our orchard Grove or vineyard to get a pest load...will have a pest load after those cultural practices are applied. Then we have to the harvest...the crop is harvested, we'll do some manual culling and will estimate a pest load there. And then after that we can...you can see that we have a cold storage and we'll have a pest load after the cold storage. We're going to try to freeze them to death. And the final thing we do is once we get this pest load here, we're going to break it up in a marketing distribution and split that population into several smaller pieces. And we'll be able to calculate the probability of a mating pair from that. Well, the problem, you can see immediately is that these things become very dependent because the output of the harvest population is the input for treatment A. The output of the treatment A is the input for treatment B and so forth on to C on down to the meeting pair. Now here's, here's a table I...here's a table we'll work with and we'll start with. And here is what we have is we set this up in for the simulator to work with and we have a population range of 1 million to 2.5 million fruit. We have a treatment here, a treatment range of the efficacy of mitigation that we're seeing. Here's the number of survivors we would expect from this treatment population and we have a Population A is a result of that. We're going to have a population B as a result of that, we can take a look at the formulas here that are used. And what is what is being done here, this a little differently is, I'm putting a random variable in the...that is going to go into the profiler. So the profiler is going to see this immediately as a random variable going in. So we're simulating the variable coming in, even before it comes in. So with that, you can go...we can go across. You can see the rest of the table. We're going over. We have another set, we have survivors that's after Treatment C, the same type of thing. Then we have this distribution and we had a probability of a mating pair. I'll show you that formula. It's a little different. The probability of a mating pair. Well, this is just using an exponential to estimate the probability of mating pair so you know what's going on. I haven't hidden anything from you behind the curtain and so forth. Let's take a look. So to open up the profiler, we're going to go to graph and down to profiler. All right. And then from there, we're going to select our variables that have a formula that we're interested in. So we're gonna have...we're gonna have the Pop A, Pop B, survivors and the probability of a mating pair. Going to put those in and we're going to say, uh-oh, we got to extend the variation here and we're going to say, okay. We got a nice result. Very attractive graphs here. And first thing you're going to see is, you're going to see squiggly lines in this profiler that if you use a profiler that you're probably not used to seeing lines like that. It's just a little different approach and so forth that you can see how these things work and Doing a little adjustment here so you can see the screens better. Now from this point what we're gonna do is we're going to open up the simulator in the profiler. We go up here and just click on the simulator and it gives us these choices down here. First thing I want to do with this is I want to increase the number of simulation runs to 25,000. Okay. And what I'm going to do...what we do if we have independence, one of the tests, quite often for that, we use for that is that we we'll look at the correlation. So I'm going to use a correlation here. Use the correlations and set up some correlations. So for this first population, I'm going to call it multivariate and immediately you can see we get a correlation specification down below. And we'll set up another multivariate here and another multivariate for treatment B. Another multivariate for for treatment A. Now what this is doing is, this is taking those treatment parameters that we had up above, we had before. And it is putting those in our multivariate relationship with each other. We also got the last thing, this marketing distribution. I don't want it to be continuous so I'm going to make it random and we're going to make it an integer. We'll make that an integer and we've got that run and we can see the results. Now this is the...I'll call this the Goldilocks situation with all the zeros down here, that implies that all of these relationships are completely independent and we can run our simulator here. And see the results. Do little more adjustment here on these axes. This come to life. Please look for here. Okay. Now you see those results. But what we're going to have here and look at this, is, we have the rate at which it's exceeding a limit that's been put in there. I put those spec limits back in the variables, but the one that I'm most concerned about is the probability of a mating pair. And wouldn't you know, I've run this real time and it hasn't come out exactly the way it should. Let's try a couple more times here. See. What we got the probability of a mating pair and that is supposed to be coming up as .5, but it certainly isn't. I have something isn't...oh, here, let's try this and fix this. This would be 4 and 14 Let's try the simulation one more time. Still didn't come up. Well, the example hasn't worked quite right, but in the previous example I was having .4 here. So that was saying, the rate was creating less than time but I'm having that probability is A .15 probability of a mating pair, but that's what happens sometimes when you try to do things on on the fly. So let me go up and I have a window that I can, we can look at that result with...we can look to that result. And let's...that has a little bit differently. And you can see now that that probability is under 5%. That's the target we're aiming for. in this thing, in this simulation. So if we go up and we can run those simulations, again you can see those bouncing around, staying under .5, so it's happening less than 5% of the time that the probability of a mating pair is greater than 1.5. Now because now, again, I'll say this is kind of a Goldilocks scenario because we're assuming all these relationships are independent. I have an example that I can show you that we have, where we have one that is antagonistic and synergistic. So I'll pull up the first one here and in this one we have that the relationships are antagonistic. Now when you...if you are are creating an example like this to work with it, you can't, at least I wasn't able to make everything negative. If you notice I have these two as being positive. This wants that the matrix to be positive definite. And it doesn't come out as positive definite if you set the...if you set those all to zero, but we can run that simulation again. And you can see that... you can see here that those simulations with a negative, it really makes things very, very attractive. We're getting a low, real low rate of... that we have...we have 1., .15 probability of a mating pair so that you can see just the effect. And what I really want to show is the effect of this correlation specifications, correlation matrix down here, covariance matrix that you specified. Now let's look at one other, we'll look at the one if it's positive. And we've got we've got an example here where it's positive. And you can see I have down here. I've said here. Now, I haven't been real heroic about making those correlations very high. I've tried to keep them fairly low and so forth to be fairly realistic, after all this is biological data. And we can run those simulations again and you can see very quickly that we're exceeding that 5% rate which is...becomes a great deal of concern here and so forth. And if you were...if most of the time these simulations like this are run with no consideration of the correlation between variables and that is kind of like covering your eyes and saying, I didn't see this and so forth. But it really if there is, if there is a correlation relationship and most likely there is, because one of these in...one of these outputs is the input to the next process, so pretty well has to be dependent, and what the dependencies are, estimating these correlations will be a great task to have to come up with most of the time. Work in this area is done based on research papers and they don't have correlations between different types of treatments so. But having some estimate of those is a good thing, a good thing to have. Now the next step is to show you the what else we can do here. We can create a table. And if we create this table and... Well, we'll create this table and I'm just going to highlight everything in the table out to the mating pairs here. And then I'm just going to do an analysis distribution. And run all of those and say, okay. Now we get all these grand distributions, fill up the paper with it. But what we can do is we can go in and we can select these distributions that are exceeding our limit out here. We can just highlight those and it becomes very informative as you look back and you can see the mitigations, what's happening, and so forth. What is affecting these things greatly. And one of the things that really ...first of all, our initial population, and this has been based on what we've seen in real life, is as the population gets to be higher, when we have large, large populations of the fruit, the tendency is that we have failures of the system, Treatment A and so forth. So what what the one that I thought was most interesting was, if we look back here and we look at the marketing distribution, That if we push them out, if we require that as shipments come into the country and that marketing distribution has to break these shipments up out into smaller lots to be distributed, the probability of mating pair pretty well becomes zero. With with these these examples and so forth, I want to go ahead and open it up for questions. But let me just say one last thing. I think of George Box. He was at one of our meetings a few years ago. And it was really interesting what...his two quotes that he said. He said, "Don't fall in love with a model." And he also said, "All models are wrong, but some are useful." I hope this information and these examples to give you something to think about when you're doing the simulation that you need to consider the relationship between the variables. Thank you.  
Tony Cooper, Principal Analytical Consultant, SAS Sam Edgemon, Principal Analytical Consultant, SAS Institute   In process product development, Design of Experiments (DOE), helps to answer questions like: Which X’s cause Y’s to change, in which direction, and by how much per unit change? How do we achieve a desired effect? And, which X’s might allow loser control, or require tighter control, or a different optimal setting? Information in historical data can help partially answer these questions and help run the right experiment. One of the issues with such non-DoE data is the challenge of multicollinearity. Detecting multicollinearity and understanding it allows the analyst to react appropriately. Variance Inflation Factors from the analysis of a model and Variable Clustering without respect to a model are two common approaches. The information from each is similar but not identical. Simple plots can add to the understanding but may not reveal enough information. Examples will be used to discuss the issues.     Auto-generated transcript...   Speaker Transcript Tony Hi, my name is Tony Cooper and I'll be presenting some work that Sam Edgemon and I did and I'll be representing both of us.   Sam and I are research statisticians in support of manufacturing and engineering for much of our careers. And today's topic is Understanding Variable Clustering is Crucial When Analyzing Stored Manufacturing Data.   I'll be using a single data set to present from. The focus of this presentation will be reading output. And I won't really have time to fill out all the dialog boxes to generate the output.   But I've saved all the scripts in the data table, which of course will avail be available in the JMP community.   The data...once the script is run, you can always use that red arrow to do the redo option and relaunch analysis and you'll see the dialog box that would have been used to present that piece of output.   I'll be using this single data set that is on manufacturing data.   And let's have a quick look at how this data looks   And   Sorry for having a   Quick. Look how this dataset looks. First of all, I have a timestamp. So this is a continuous process and some interval of time I come back and I get a bunch of information.   On some Y variables, some major output, some KPIs that, you know, the customer really cares about, these have to be in spec and so forth in order for us to ship it effectively. And then...so these are... would absolutely definitively be outputs. Right.   line speed, how at the set point for the vibration,   And and a bunch of other things. So you can see all the way across, I've got 60 odd variables that we're measuring at that moment in time. Some of them are sensors and some of them are a set points.   And in fact, let's look and see that some of them are indeed set points like the manufacturing heat set point. Some of them are commands which means like, here's what the PLC told the process to do. Some of them are measures which is the actual value that it's at right now.   Some of them are   ambient conditions, maybe I think that's an external temperature.   Some of them   are in a row material quality characteristics, and there's certainly some vision system output. So there's a bunch of things in the inputs right now.   And you can imagine, for instance, if I'm, if I've got the command and the measure the, the, what I told the the the zone want to be at and what it actually measure, that we hope that are correlated.   And we need to investigate some of that in order to think about what's going on. And that's multicolinnearity, understanding the interrelationships amongst   the inputs or separately, the understanding the multicollinearity and the outputs. But by and large, not doing Y cause X effect right now. You're not doing a supervised technique. This is an unsupervised technique, but I'm just trying to understand what's going on in the data in that fashion.   And all my scripts are over here. So yeah, here's the abstract we talked about; here's a review of the data. And as we just said it may explain why there is so much, multicollinearity, because I do have a set points and an actuals in there. So, but we'll, we'll learn about those things.   What we're going to do first is we're going to fit Y equals a function of X and we're going to set it and I'm only going to look at these two variables right now.   The zone one command, so what I told zone one to to to be, if you will. And then I'm I've got a little therma couple that are, a little therma couple in the zone one area and it's measuring.   And you can see that clearly as zone one temperature gets higher, this Y4 gets lower, this response variable gets lower.   And that's true also for the measurement and you can see that in the in the in the estimates both negative and, by and large, and by the way, these are fairly   predictive variables in the sense that just this one variable is explaining 50% of what's going on in in the Y4. Well, let's   let's do a multivariate. So you imagine I'm now gonna fit model and have moved both of those variables into   into into my model and I'm still analyzing Y4. My response is still Y4. Oh, look at this. Now it's suggesting that as Y, as the command goes up. Yeah, that that does   that does the opposite of what I expect. This is still negative in the right direction, but look at   look at some of these numbers. These aren't even in the same the same ballpark as what I had a moment ago, which was .04 and .07 negative. Now I have positive .4 and negative .87.   I'm not sure I can trust these this model from an engineering standpoint, and I really wonder how it's characterizing and helping me understand this process.   And there's a clue. And this may not be on by default, but you can right click on this parameter estimates table and you can ask for the VIF column.   That stands for variation inflation factor. And that is an estimate of the...of the the instability of the model due to   this multicollinearity. So you can...so we need to get little intuition on what that...how to think about that variable. But just to be a little clearer what's going on, I'm going to plot...here's   temperature zone one command and here's measure, and as you would expect,   as you tell...you tell the zone to increase in temperature. Yes, the zone does increase in temperature and by and larges, it's, it's going to maybe even the right values. I've got this color coded by Y4 so it is suggesting at lower temperatures, I get   the high values of Y4 and that the...   sorry...yeah, at low valleys of temperature, I got high values of Y4 and just the way I saw on the individual plots.   But you can see maybe the problem gets some intuition as to why the multivariate model didn't work. You're trying to fit a three dom... a surface.   over this knife edge and it can obviously...there's some instability and if...you can easily imagine it's it's not well pinneed on the side so I can rock back and forward here and that's what you're getting.   It is in terms of how that is that the in in terms of that the variation inflation factor. The OLS software...the OLS ordinary least squares analysis typical regression analysis   just can't can't handle it in the, in some sense, maybe we could also talk about the X prime X matrix being, you know, being almost singlular in places.   So we've got some heurustic of why it's happening. Let's go back and think about more   About   About   The about about the values and   We know that   You know we are we now know that variation inflation factor actually does think about more than just pairwise. But if I want to source...so intuition helps me think about the relationship between   VIF and pairwise comparison. Like if I have two variables that are 60% correlated   then it's you know if it was all it was all pairwise then the VIF would be about 2.5.   And if they were 90% correlated two variables, then I would get a VIF of 10. And the literature says   That when you have VIFs of 10 and more you have enough instability which way you should worry about your model. So, you know, because in some sense, you've got 90% correlation between things in your data.   Now whether 10 is a good cut off really depends,I guess, on your application. If if you're making real design decisions based on it, I think a cut off of 10 is way too high.   And maybe even near two is better. But if I was thinking more. Well, let me think about what factors to put in a run in an experiment like   I want to learn how to run the right experiment. And I knew I've got to pick factors to go in it. And I know we too often go after the usual suspects. And is there a way I can push myself to think of some better factors. Well, that then maybe a   10 might be useful to help you narrow down. So it really depends on where you want to go in terms of doing thinking about   thinking about what what the purpose is. So more on this idea of purpose.   You know, there's two main purposes, if you will, to me of modeling. One is what will happen, you know, that's prediction.   And but and that's different, sometimes from why will it happen and that's more like explanation.   As we just saw with a very simple   command and measure on zone one, you can, you cannot do very good explanation. I would not trust that model to think about why something's happening when I have when I when the, when the responses when the, when the estimate seem to go in the wrong direction like that.   So it's very so I wouldn't use it for explanation. I'm sort of suggesting that I wouldn't even use it for prediction. Because if I can't understand the model, it's doing, it's not intuitively doing what I expect   that seems to make extrapolation dangerous, of course, by definition, you know, I think prediction to a future event is extrapolation in time. Right, so I would always think about this. And, you know, we're talking about ordinary least squares so far.   All my modeling techniques I see, like decision trees, petition analysis,   are all in some way affected by this by this issue, in different unexpected ways. So it seems a good idea to take care of it if you can. And it isn't unique to manufacturing data.   But it's probably worse exaggerated in manufacturing data because often the variables can are not controlled. If we   if we have, you know, zone one temperature and we've learned that it needs to be this value to make good product, then we will control it as tightly as possible to that desired value.   And so it's it's so the the extrapolation really come gets harder and harder. So this is exaggerated manufacturing because we do can control them.   And there's some other things about manufacturing data you can read here that make it maybe   make it better, because this is opportunities and challenges. Better is you can understand manufacturing processes. There's a lot of science and engineering around how those things run.   And you can understand that stoichiometry for instance, requires that the amount of A you add has, chemical A has to be in relationship to chemical B.   Or you know you don't want to go from zone one temperature here to something vastly different because you need to ramp it up slowly, otherwise you'll just create stress.   So you can and should understand your process and that may be one way even without any of this empirical evidence on on multicollinearity, you can get rid of some of it.   There's also   an advantage to manufacturing data is in that it's almost always time based, so do plot the variables over time. And it's always interesting   or not always, but often interesting, the manufacturing data does seem to move in blocks of times like here...we think it should be 250 and we run it for months, maybe even years,   and then suddenly, someone says, you know, we've done something different or we've got a new idea. Let's move the temperature. And so very different.   And of course, if you're thinking about why is there multicollinearity,   we've talked about it could be due to physics or chemistry, but it could be due to some latent variable, of course, especially when there's a concern with variable shifting in time, like we just saw.   Anything that also changes at that rate could be the actual thing that's affecting the, the, the values, the, the Y4. Could be due to a control plan.   It could be correlated by design. And each of these things, you know, each of these reasons for multicollinearity and the question I always ask is, you know,   is this is it physically possible to try other combinations or not just physically possible, does it make sense to try other combinations?   In which case you're leaning towards doing experimentation and this is very helpful. This modeling and looking retrospectively at your data to is very helpful at helping you design better experiments.   Sometimes though the expectations are a little too high, in my mind, we seem to expect a definitive answer from this retrospective data.   So we've talked about two methods already two and a half a few to to address multi variable clustering and understand it. One is the is VIF.   And here's the VIF on a bigger model with all the variables in.   How would I think about which are correlated with which? This is tells me I have a lot of problems.   But it does not tell me how to deal with these problems. So this is good at helping you say, oh, this is not going to be a good model.   But it's not maybe helpful getting you out of it. And if I was to put interactions in here, it'd be even worse because they are almost always more correlated. So we need another technique.   And that is variable clustering. And this, this is available in JMP and there's two ways to get to it. You can go through the Analyze multivariate principal components.   Or you can go straight to clustering cluster variables. If you go through the principal components, you got to use the red option...red triangle option to get the cluster variables. I still prefer this method, the PCA because I like the extra output.   But it is based on PCA. So what we're going to do is we're going to talk about PCA first and then we will talk about the output from variables clustering.   And there's and there's the JMP page. In order to talk about principal components, I'm actually going to work with standardized versions of the variables first. And let's think remind ourselves what is standardized.   the mean is now zero, standard deviation is now 1. So I've standardized what that means that they're all on the same scale now and implicitly when you do   when you do   principal components on correlations in JMP,   implicitly you are doing on standardized variables.   JMP is, of course, more than capable, a more than smart enough for you to put in the original values   and for it to work in the correct way and then when it outputs, it, it will outputs, you know, formula, it will figure out what the right   formula should have been, given, you know, you've gone back unstandardized. But just as a first look, just so we can see where the formula are and some things, I'm going to work with standardized variables for a while, but we will quickly go back, but I just want to see one formula.   And that's this formula right here. And what I'm going to do is think about the output. So what is the purpose of   of   of PCA? Well it's it's called a variation reduction technique, but what it does, it looks for linear combinations of your variables.   And if it finds a linear combination that it likes, it...that's called a principal component.   And it uses Eigen analysis to do this.   So another way to think about it is, you know I want it, I put in 60 odd variables into the...into the inputs.   There are not 60 independent dimensions that I can then manipulate in this process, they're correlated. What I do to the command for time, for temperatures on one   dictates almost hugely what happens with the actual in temperature one. So those aren't two independent variables. Those don't keep you say you don't you don't have two dimensions there. So how many dimensions do we have?   That's what this thing, the eigenvalues, tell you. These are like variance. These are estimates of variance and the cutoff is   one. And if I had the whole table here, there'll be 60 eigenvalues. Go all the way down, you know, jumpstart reporting at .97 but the next one down is, you know, probably .965 and it will keep on going down.   The cutoff says that...or some guideline is that if it's greater than one, then this linear combination is explaining a signal. And if it's less than one, then it's just explaining noise.   And so what what JMP does is it...when I go to the variable clustering, it says, you know what   you have a second dimension here. That means a second group of variables, right, that's explaining something a little bit differently. Now I'm going to separate them into two groups. And then I'm going to do PCA on both,   and if and the eigenvalues for both...the first one will be big, but what's the second one look like   after I split in two groups? Are they now less than one? Well, if they're less than one, you're done. But if they're greater than one, it's like, oh, that group can be split even further.   And it was split that even further interactively and directly and divisively until there's no second components greater than one anymore.   So it's creating these linear combinations and you can see the...you know when I save principal component one, it's exactly these formula. Oops.   It's exactly this formula, .0612 and that's the formula for Prin1 and this will be exactly right as long as you standardize. If you don't standardize then JMP is just going to figure out how to do it on the unstandardized which is to say it's more than capable of doing.   So let's start working with the, the initial data.   And we'll do our example. And you'll see this is very similar output. It's in fact it's the same output except one thing I've turned on.   is this option right here, cluster variables. And what I get down here is that cluster variables output. And you can see these numbers are the same. And you can already start to see some of the thinking it's going to have to do, like, let's look at these two right here.   It's the amount of A I put in seems to be highly related to the amount of B I put in, so...and that would make sense for most chemical processes, if that's part of the chemical reaction that you're trying to accomplish.   You know, if you want to put 10% more in, probably going to put 10% more B in. So even in this ray plot you start to see some things that the suggest the multicollinearity. And so to get somewhere.   But I want to put them in distinct groups and this is a little hard because   watch this guy right here, temperature zone 4.   He's actually the opposite. They're sort of the same angle, but in the opposite direction right, so he's 100 or 180 degrees almost from A and B.   So maybe ...negatively correlated to zone 4 temperature and A B and not...but I also want to put them in exclusive groups and that's what that's what we get   when we when we asked for the variable clustering. So it took those very same things and put them in distinct groups and it put them in eight distinct groups.   And here are the standardized coefficients. So these are the formula   that the for the, you know, for the individual clusters. And so when I save   the cluster components   I get a very similar to what we did with Prin1 except this just for cluster 1, because you notice that in this row that has zone one command with a ... with a .44, everywhere else is zero. Every variable is exclusively in one cluster or another.   So let me...let's talk about some of the output.   And so we're doing variable clustering and   Oops.   Sorry. Tony And then we got some tables in our output. So I'm going to close...minimize this window and we're talking about what's in here in terms of output.   And the first thing you guys that you know I want to point us to is the standardized estimates and we were doing that. And if you want to do it, quote, you know,   by hand, if you will, repeat and how do I get a .35238 here, I could run PCA on just cluster one variables. These are the variables staying in there. And then you could look at the eigenvalues, eigenvectors, and these are the exact exact numbers.   So,   So the .352 is so it's just what I said. It keeps devisively doing multiple PCAs and you can see that the second principle component here is, oh sorry, is less than one. So that's why it stopped at that component   Who's in there   cluster one, well there's temperature is ...the two zone one measures, a zone three A measure, the amount of water seems to matter compared to that. With some of the other temperature over here and in what...in cluster six.   This is a very helpful color coded table. This is organized by cluster. So this is cluster one; I can hover over it and I can read all of the things   added water (while I said I should, yeah, added water. Sorry, let me get one that's my temperature three.)   And what that's a positive correlation, you know, the is interesting zone 4 has a negative correlation there. So you will definitely see blocks of of color to...because these are...so this is cluster one obviously, this is maybe cluster three.   This, I know it's cluster six.   Butlook over here... that this...as you can imagine, cluster one and three are somewhat correlated. We start to see some ideas about what might what we might do. So we're starting to get some intuition as to what's going on in the data.   Let's explore this table right here. This is the with own cluster, with next cluster and one minus r squared ratio. And what...I'm going to save that to my own data set.   I'm going to run do a multivariate on it. So I've got cluster one through eight components and all the factors. And what I'm really interested in is this part of a table, like for all of the original variables and how they cluster with the component. So let me save that table and   then what I'm gonna do is, I'm gonna delete, delete some extra rows out of this table, but it's the same table. Let's just delete some rows.   And I'll turn and...so we can focus on certain ones. So what I've got left is the columns that are the cluster components and the rows...   row column...the role of the different...and is 27 now variables that we were thinking about, not 60 (sorry) and it's put them in 8 different groups and I've added that number. They can come automatically.   I'm going to start and in order to continue to work, what I'm gonna do is, I don't want these as correlations. I want these as R squares. So I'm going to square all those numbers.   So I just squared them and and here we go. Now we can look at, now we can start thinking about it.   And I've sort...so let's look at   row one.   Sorry, this one that the temperature one measure we've talked about, is 96% correlated with its...with cluster one.   It's 1% correlated with cluster two so it looks to me like it really, really wants to be in cluster one and it doesn't really have much to say I want to go into cluster two. What I want to do is let's find all of those...   lets color code some things here so we can find them faster. So   we're talking zone one meas and the one that would like to be in, if anything, is cluster five.   You know it's 96% correlated, but you know, it's, it wouldn't be, if it had to leave this cluster, where would it go next? It would think about cluster five.   And what's in cluster five? Well, you could certainly start looking at that. So there's more temperatures there. The moisture heat set point and so forth. So if it hadn't...so this number,   The with the cl...with its own cluster talks about how much it likes being in its cluster and the what...and the R squared to the next cluster talks to you about how much it would like to be in a different one. And those are the two things reported in the table.   You know, JMP doesn't show all the correlation, but it may be worth plotting and as we just demonstrated, it's not hard to do.   So this tells you...I really like being in my own cluster. This says if I had to leave, then I don't mind going over here. Let's compare those two. And let's take one minus this number,   divided by one minus this number. That's the one minus r squared ratio. So it's a ratio of how much you like your own cluster divided by how much you attempted to go to another cluster.   And let's plot some of those.   And   Let me look for the local data filter on there.   The cluster.   And and here's the thing. So   Values, in some sense, lower values of this ratio are better. These are the ones over here (let's highlight him)...   Well, let's highlight the very...this one of the top here.   I like the one down here. Sorry.   This one, vacuum set point, you can see really really liked its own cluster over the next nearest, .97 vs .0005, so you know that the ratio is near zero. So those wouldn't...   with it...you wouldn't want to move that over. And you could start to do things like let's just think about the cluster one variables. If anyone wanted to leave it, maybe it's the Pct_reclaim. Maybe he got in there,   like, you know by, you know, by fortuitous chance and I've got, you know, and if I was in cluster two, I can start thinking about the line speed.   The last table I'm going to talk about is the cluster summary table. That's   this table here.   And it's saying...it's gone through this list and said that if I look down r squared own cluster, the highest number is .967 for cluster one.   So maybe that's the most representative.   To me, it may not be wise to let software decide I'm going to keep this late...this variable and only this variable, although certainly every software   has a button that says launch it with just the main ones that you think, and that's that may give you a stable model, in some sense, but I think it's short changing   the kinds of things you can do. And you and, hopefully with the techniques provided, we can...you have now the ability to explore those other things.   This is...you can calculate this number by again doing the just the PCA on its own cluster and it and it it's suggesting it cluster...eigenvalue one one complaint with explained .69% so that's another...just thinking about it, it's own cluster.   Close these and let's summarize.   So we've given you several techniques. Failing to understand what's clear, you can make your data hard to interpret, even misleading the models.   Understanding the data as it relates to the process can produce multicollinearity. So just an understanding from an SME standpoint, subject matter expertise standpoint.   Pairwise and scatter plot are easier to interpret but miss multicollinearity and so you need more. VIF is great at telling you you've got a problem. It's only really available for ordinary least squares.   There's no, there's no comparative thing for prediction.   And there's some guidelines and variable clustering is based on principal components and shows better membership and strengthens relationship in each group.   And I hope that that review or the introduction would encourage you, maybe, to grab, as I say, this data set from JMP Discovery and you can run the script yourself and you can go obviously more slowly and make sure that you feel good and practice on maybe your own data or something.   One last plug is, you know, there was there's other material and other times that Sam and I have spoken at JMP Discovery conferences and or at   ...or written white papers for JMP and you know maybe you might want to think about all we thought about with manufacturing it because we have found that   modeling manufacturing data takes...is a little unique compared to understanding like customer data in telecom where a lot of us learn when we went through school. And I thank you for your attention and good luck   with further analysis.
Monday, October 12, 2020
Charles Whitman, Reliability Engineer, Qorvo   Simulated step stress data where both temperature and power are varied are analyzed in JMP and R. The simulation mimics actual life test methods used in stressing SAW and BAW filters. In an actual life test, the power delivered to a filter is stepped up over time until failure (or censoring) occurs at a fixed ambient temperature. The failure times are fitted to a combined Arrhenius/power law model similar to Black’s equation. Although stepping power simultaneously increases the device temperature, the algorithm in R is able to separate these two effects. JMP is used to generate random lognormal failure times for different step stress patterns. R is called from within JMP to perform maximum likelihood estimation and find bootstrap confidence intervals on the model estimates. JMP is used live to plot the step patterns and demonstrate good agreement between the estimates and confidence bounds to the known true values.  A safe-operating-area (SOA) is generated from the parameter estimates.  The presentation will be given using a JMP journal.   The following are excerpts from the presentation.                                     Auto-generated transcript...   Speaker Transcript CWhitman All right. Well, thank you very much for attending my talk. My name is Charlie Whitman. I'm at Corvo and today I'm going to talk about steps stress modeling in JMP using R. So first, let me start off with an introduction. I'm going to talk a little bit about stress testing and what it is and why we do it. There are two basic kinds. There's constant stress and step stress; talk a little bit about each. Then when we get out of the results from the step stress or constant stress test are estimates of the model parameters. That's what we need to make predictions. So in the stress testing, we're stressing parts of very high stress and then going to take that data and extrapolate to use conditions, and we need model parameters to do that. But model parameters are only half the story. We also have to acknowledge that there's some uncertainty in those estimates and we're going to do that with confidence bounds and I'm gonna talk about a bootstrapping method I used to do that. And at the end of the day, armed with our maximum likelihood estimates and our bootstrap confidence bounds, we can create something called the safe operating area, SOA, which is something of a reliability map. You can also think of it as a response surface. So we're going to get is...find regions where it's safe to operate your part and regions where it's not safe. And then I'll reach some conclusions. So what is a stress test? In a stress test you stress parts until failure. Now sometimes you don't get failure; sometimes parts, you have to start, stop the test and do something else. And that case, you have a sensor data point, but the method of maximum likelihood, which are used in the simulations takes sensoring into account so you don't have to have 100% failure. We can afford to have some parts not fail. So what you, what you do is you stress these parts under various conditions, according to some designed experiment or some matrix or something like that. So you might run your stress might be temperature or power or voltage or something like that and you'll run your parts under various conditions, various stresses and then take those that data fitted to your model and then extrapolate to use conditions. mu = InA + ea/kT. Mu is the log mean of your distribution; we commonly use the lognormal distribution. That's going to be a constant term plus the temperature term. You can see that mu is inversely related to temperature. So as temperature goes up, mu goes down, and that's temperature goes down, mu goes up. If we can use the lognormal, you will also have an additional parameter that the shape factor sigma. So after we run our test, we will run several parts under very stressed conditions and we fit them to our model. It's then when you combine those two that you can predict behavior at use conditions, which is really the name of the game. The most common method is a is a constant stress test, and what basically, the stress is fixed for the duration of the test. So this is just showing an example of that. We have a plot here of temperature versus time. If we have a very low temperature, say you could get failures that would last time...that sometimes be very long. The failure times can be random, again according to, say, some distribution like the lognormal. If we increase the temperature to some higher level, we would get end of the distribution of failure times, but on the average the failure times would be shorter. And if we increase the temperature even more, same kind of thing, but failure times are even shorter than that. So what I can do is if I ran, say, a bunch of parts under these different temperatures, I could fit the results to a probability plot that looks like this. I have probability versus time to failure at the highest temperature here. This example is 330 degrees C, I have my set of failure times which I set to lognormal. And then as I decrease the temperature lower and lower the failure times get longer and longer. Then I take all this data over temperature I fit it to the Arrhenius model, I extrapolate. And then I see I can get my predictions at use conditions. This is what we are after. I want to point out that when we're doing these accelerated testing, this test, we have to run at very high stress because, for example, even though this is, say, lasting 1000 hours or so, our predictions are that the part under use conditions would be a billion hours and there's no way that we could run test for a billion hours. So we have to get tests done in a reasonable amount of time and that's why we're doing accelerated testing. So then, what is a step stress? Well, as you might imagine, a step stresses where you increase the stress in steps or some sort of a ramp. The advantage is that it's a real time saver. As I showed in the previous plot, those tests could last a very long time that could be 1000 hours. So that's it could be weeks or months before the test is over. A step stress test could be much shorter or you might be able to get done in hours or days. So it's a real time saver. But the analysis is more difficult and I'll show that in a minute. So, in the work we've done at Corvo, we're doing reliability of acoustic filters and those are those are RF devices. And so the stress in RF is RF power. And so we step up power until failure. So if we're going to step up power, we can do is we can model this with this expression here. Basically, we had the same thing as the Arrhenius equation, but we're adding another term, n log p. N is our power acceleration parameter; p is our power. So for the lognormal distribution, there would be a fourth parameter, sigma, which is the shape factors. So you have 1, 2, 3, 4 parameters. Let me just give you a quick example of what this would look like. You start, this is power versus time. Power is in dBm. You're starting off at some power like 33.5 dBm, you step and step and step and step until hopefully you get failure. And I want to point out that your varying power, and as you increase the power to the part, that's going to be changing the temperature. So as power is ramped, so it is temperature. So power and temperature are then confounded. So you're gonna have to do your experiment in such a way that you can separate the effects of temperature and power. So I want to point out that you have these two terms (temperature and power), so it's not just that I increase the power to the part and it gets hotter and it's the temperature that's driving it. It's power in and of itself also increases the failure rate. Right. So now if I show a little bit more detail about that step stress plot. So here again a power versus time. I'm running a part for, say, five hours at some power, then I increase the stress, and run another five hours, and increase the stress on up until like a failure. So, and as I mentioned as the power is increasing, so is the temperature. So I have to take that into account somehow. I have to know what the t = T ambient + R th times p T ambient is our ambient temperature; P is the power; and R th is called the thermal impedance which is a constant. So, that means, as I set the power, so I know what the power is and then I can also estimate what the temperature is for each step. So what we'd like to do is then take somehow these failure times that get from our step stress pattern and extrapolate that to use conditions. If I was only running, like, for time delta t here only and I wanted to extrapolate that to use conditions, what I would do is I would multiply...get the equivalent amount of time delta t times the acceleration factor. And here's the acceleration factor. I have an activation energy term, temperature term, and a power term. And so what I would do is I would multiply by AF. And since I'm going from high stress down to low stress, AF is larger than one and this is just for purposes of illustration, it's not that much bigger than one, but you get the idea. And as I increase the power, temperature and power are changing so the AF changes with each step. So if I want to then get the equivalent time at use conditions, I'd have to do a sum. So I have each segment. It has its own acceleration factor and maybe its own delta t. And then I do a sum and that gives me the equivalent time. So this, this expression that I would use them to predict equipment time if I knew exactly what Ea was and exactly what n was, I could predict what the equivalent time was. So that's the idea. So it turns out that....so as I said, temperature and power are confounded. So in order to estimate, what we do is we have to run to two different ambient temperatures If you have the ambient temperatures separated enough, then you can actually separate the effects of power and temperature. You also need at least two ramp rates. So at a minimum, you would need a two by two matrix of ramp rate and ambiant temperature. In the simulations I did, I chose three different rates as shown here. I have power in dBm versus stress time And I have three different ramps, but with different rates. I'll have a fast, a medium, and a slow ramp rate. In practice, you would let this go on and on and on until failure, but I've only just arbitrarily cut it off after a few hours. You see here also I have a ceiling. The ceiling is four; it's because we have found that if we increase the stress or power arbitrarily, we can change the failure mechanism. And what you want to do is make sure that failure mechanism, when you're under accelerate conditions is the same as it is under use conditions. And if I change the failure mechanism that I can't do an extrapolation. The extrapolation wouldn't be valid. So we had the ceiling here of drawn to 34.4 dBm, and we even given ourselves a little buffer to make sure we don't get close to that. So our ambient temperature is 45 degrees C, we're starting it a power 33.5 dBm so we would also have another set of conditions at 135 degrees. See, you can see the patterns here are the same. And we have a ceiling and they have a buffer region, everything, except we are starting at a lower power. So here we're below 32 dBm, whereas before we were over 33. And the reason we do that is because if we don't lower the power at this higher temperature, what will happen is you'll get failures almost immediately if you're not careful, and then you can't use the data to do your extrapolation. Alright, so what we need, again, is an expression for our quivalent time, as I showed that before. Here's that expression. This is kind of nasty and I would not know how to derive from first principles of what the expression is for the distribution of the equivalent time of use conditions. So, when faced with something which is kind of difficult like that, what I choose to do was use the bootstrap. So what is bootstrapping? So with bootstrapping, what we're doing is we are resampling the data set many times with replacement. That means from the original data set of observations, you can have replicates of from the original data set or maybe an observation won't appear all. And the approach I use is called non parametric, because we're not assuming the distribution. We don't have to know the underlying distribution of the data. So when you generate these many bootstrap samples, which you can get as an approximate distribution of the parameter, and that allows you to do statistical inference. In particular, we're interested in putting in confidence bounds on things. So that's what we need to do. Simple example of bootstrapping is called percentile bootstrap. So, for example, suppose I wanted 90% confidence bounds on some estimate. And I would do is I would form, many, many bootstrap replicates and I would extract the parameter from each bootstrap sample. And then I would sort that and I would figure out which is the shift and 95th percentile from that vector and those would form my 90% confidence bounds. What I did actually in my work was I used an improvement over to percentile, a technique. It's called the BCa for bias corrected and accelerated. Bias because sometimes our estimates are biased and this method would take that into account. Accelerated, unfortunately the term accelerated is confusing here. It has nothing to do with accelerated testing, it has to do with the method, the method has to do for with adjusting for the skewness of the distribution. But ultimately you're...what you're going to get is it's going to pick for you different percentile values. So, again, for the percentile technique we had fifth and 95th. The bootstrap or the BCa bootstrap might give you something different, might say the third percentile and 96% or whatever. And those are the ones who would need to choose for your 90% confidence bounds. So I just want to run through a very quick example just to make this clear. Suppose I have 10 observations and I want to do for bootstrap samples from this, looking something like this. So, for example, the first observation here 24.93 occurs twice in the first sample, once in the second sample, etc. 25.06 occurs twice. 25.89 does not occur at all and I can do this, in this case, 100 times And for each bootstrap sample then, I'm going to find, in this case I'm gonna take the average, say, I'm interested in the distribution of the average. Well, here I have my distribution of averages. And I can look to see what that looks like. Here we are. It looks pretty bell shaped and I have a couple points here, highlighted and these would be my 90% confidence bounds if I was using the percentile technique. So here's this is the sorted vector and the fifth percentile is at 25.84 and the 95th percentile is 27.68. If I wanted to do the BCa method, I would might just get some sort of different percentile. So this case, 25.78 and 27.73. So that's very quickly, what the BCa method is. So in our case, we'd have samples of... we would do bootstrap on the stress patterns. You would have multiple samples which would have been run, simulated under those different stress patterns and then bootstrap off those. And so we're going to get a distribution of our previous estimates or previous parameters, logA, EA, and sigma Right. CWhitman So again, here's our equation. So again, JMP The version of JMP that I have does not do bootstrapping. JMP Pro does, but the version I have does not, but fortunately R does do bootstrapping. And I can call R from within JMP. That's why I chose to do it this way. So I have I can but R do all the hard work. So I want to show an example, what I did was I chose some known true values for logA, EA and sigma. I chose them over some range randomly. And I would then choose that choose the same values for these parameters of a few times and generate samples each time I did that. So for example, I chose minus 28.7 three times for logA true and we get the data from this. There were a total of five parts per test level or six test levels, if you remember, three ramps, two different temperatures, six levels, six times five is 30. So there were 30 parts total run for this test and looking at the logA hat, the maximum likelihood estimates are around 28 or so. So that actually worked pretty well. I can look at...now for my next sample, I did three replicates here, for example, minus 5.7 and how did it look when I ran my method of the maximum that are around that minus 5.7 or so. So the method appears to be working pretty well. But let's do this a little bit more detail. Here I ran the simulation a total of 250 times with five times for each group. LogA true, EA true are repeated five times and I'm getting different estimates for logA hat, EA, etc. I'm also putting...using BCa method to form confidence bounds on each of these parameters, along with the median time to failure. So let's look and just plot this data to see how well it did. You have logA hat versus logA true here and we see that the slope is about right around 1 and the intercept is not significantly different than 0, So this is actually doing a pretty good job. If my logA true is at minus 15 then I'm getting right around minus 15 plus or minus something for my estimate. And the same is true for the other parameters EA, n and sigma, and I even did my at a particular p zero P zero. So this is all behaving very well. We also want to know, well how well is the BCa method working? Well, turns out, it worked pretty well. I want to...the question is how successful was the BCa method. And here I have a distribution. Every time I correctly correctly bracketed the known true value, I got a 1. And if I missed it, I got a 0. So for logA I'm correctly bracketing the known true value 91% of the time. I was choosing 95% of the time, so I'm pretty close. I'm in the low 90s and I'm getting about the same thing for activation, energy and etc. They're all in the mid to low 90s. So that's actually pretty good agreement. Let's suppose now I wanted to see what would happen if I increase the number of bootstrap iterations and boot from 100 to 500. What does that look like? If I plot my MLE versus the true value, you're getting about the same thing. The estimates are pretty good. The slope is all always around 1 and the intercept is always around 0. So that's pretty well behaved. And then if I look at the confidence bound width, See, on the average, I'm getting something around 23 here for confidence bound width, and around 20 or so for mu and getting something around eight for value n. And so these confidence bands are actually somewhat wide. And I want to see what happens. Well, suppose I increase my sample size to 50 instead of just using five? 50 is not a realistic sample size, we could never run that many. That would be very difficult to do, very time consuming. But this is simulation, so I can run as many parts as I want. And so just to check, I see again that the maximum likelihood estimates agree pretty well with the known true values. Again, getting a slope of 1 and intercept around zero. And BCa, I am getting bracketing right around 95% of the time as expected. So that's pretty well behaved too and my confidence bound width, now it's much lower. So, by increasing the sample size, as you might expect, the conference bounds get correspondingly lower. This was in the upper 20s originally, now it's around seven. This is also...the mu was also in the upper 20s, this is now around five; n was around 10 initially, now it's around 2.3, so we're getting this better behavior by increasing our sample size. So this just shows what the summary, this would look like. So here I have a success rate versus these different groups; n is the number of parts per test level. And boot is the number of bootstrap samples I created. So 5_100, 5_500 and 50_500 and you can see actually this is reasonably flat. You're not getting big improvement in the coverage. We're getting something in the low to mid 90s, or so. And that's about what you would expect. So by changing the number of bootstrap replicates or by changing the sample size, I'm not changing that very much. BCa is equal to doing a pretty good job, even with five parts per test level and 100 bootstrap iterations. About the width. But here we are seeing a benefit. So the width of the confidence bounds is going down as we increase the number of bootstrap iterations. And then on top of that, if you increase the sample size, you get a big decrease in the confidence bound width. So all this behavior is expected, but the point here is, this simulation allows you to do is to know ahead of time, well, how big should my sample size be? Can I get away with three parts per condition? Do I need to run five or 10 parts per condition in order to get the width of the confidence bounds that I want? Similarly, when I'm doing analysis, well, how many bootstrap iterations do I have to do to kind of get away with 110? Do I need 1000? This also gives you some heads up of what you're going to need to do when you do the analysis. Alright, so finally, we are now armed with our maximum likelihood estimates and our confidence bounds. So we can do We can summarize our results using the safe operating area and, again what we're getting here is something of a reliability map or a response surface of temperature versus power. So you'll have an idea of how reliable the part is under various conditions. And this can be very helpful to designers or customers. Designers want to know when they create a part, mimic a part, is it going to last? Are they designing a part to run at to higher temperature or to higher power so that the median time to failure would be too low. Also customers want to know when they run this part, how long is the part going to last? And so what the SOA gives you is that information. The metric I'm going to give here is median time to failure. You could use other metrics. You could use the fit rate you could use a ???, but for purposes of illustration, I'm just using median time to failure. An even better metric, as I'll show, is a lower confidence bound on the median time to failure. It's a gives you a more conservative estimate So ultimately, the SOA then will allow you to make trade offs then between temperature and power. So here is our contour plot showing our SOA. These contours are log base 10 of the median time to failure. So we have power versus temperature, as temperature goes down and as power goes down, these contours are getting larger and larger. So as you lower the stress as you might expect, and median time to failure goes up. And suppose we have a corporate goal and the corporate goal was, you want the part to last or have a median time to failure greater than 10 to six hours. If you look at this map, over the range of power and temperature we have chosen, it looks like we're golden. There's no problems here. Median time to failure is easily 10 to six hours or higher. So that tells us we have to realize that median time failure again is an average, an average is only tell half the story. We have to do something that acknowledges the uncertainty in this estimate. So what we do in practice is use a lower conference bound on the median time to failure here. So you can see those contours have changed, very much lower because we're using the lower confidence bound, and here, 10 to the six hours is given by this line. And you can see that it's only part of the reach now. So over here at green, that's good. Right. You can operate safely here but red is dangerous. It is not safe to run here. This is where the monsters are. You don't want to run your part this hot. And also, this allows you to make trade offs. So, for example, suppose a designer wanted to their part to run at 80 degrees C. That's fine, as long as they keep the power level below about 29.5 dBm. Similarly, suppose they wanted to run the part at 90 degrees C. They can, that's fine as long as they keep the power low enough, let's say 27.5 dBm. Right. So this is where you're allowed to make trade offs for between temperature and power. Alright, so now just to summarize. So I showed the differences between constant and step stress testing and I showed how we extract extract maximum likelihood estimates and our BCa confidence bounds from the simulated step stress data. And I demonstrated that we had pretty good agreement then between the estimates and the known true values. In addition, BCa method worked pretty well, even with n boot of only 100 and five parts per test level, we had about 95% coverage. And that coverage didn't change very much as we increased the number of bootstrap iterations or increased the sample size. However, we did see a big change on the confidence bounds width. And that the results there showed that we could make some sort of a trade off. Again, we could, you know, from the simulation, we would know how many bootstrap iterations do we need to run and how many parts per test conditions we need to run. And ultimately, then we took those maximum likelihood estimates and our bootstrap confidence bounds and created the SOA, which provides guidance to customers and designers on how safe a particular T0/P0 combination is. And then from that reliability map, then we able to make a trade off between temperature and power. And lastly, I showed that using the lower confidence bound on the median time to failure does provide a more conservative estimate for the SOA. So, in essence, using the lower confidence bound makes the SOA, the safe operating area, a little safer. So that ends my talk. And thank you very much for your time.  
Astrid Ruck, Senior Specialist in Statistics, Autoliv Chris Gotwalt, JMP Director of Statistical Research and Development, SAS Laura Lancaster, JMP Principal Research Statistician Developer, SAS   Measurement Systems Analysis (MSA) is a measurement process consisting not only of the measurement system, equipment and parts, but also the operators, methods and techniques involved in the entire procedure of conducting the measurements. Automotive industry guidelines such as AIAG [1] or VDA [4], investigate a one-dimensional output per test, but they do not describe how to deal with data curves as output. In this presentation, we take a first step by showing how to perform a gauge repeatability and reproducibility (GRR) study using force versus distance output curves. The Functional Data Explorer (FDE) in JMP Pro is designed to analyze data that are functions such as measurement curves, as those which were used to perform this GRR study.     Auto-generated transcript...   Speaker Transcript Astrid Ruck So my name is Astrid Ruck. I'm working for Autoliv since 15 years and Autoliv is a worldwide leading manufacturer for automotive safety components such as seatbelts, air bags and active safety systems. So today we would like to talk about measurement system analysis for curve data. And Laura, Chris, and me have also written a white paper and it has the same type of title because we think it's a very urgent topic because there is nothing else in our own knowledge for for MSA for curve data available for dynamic test machines. So first we will start with a short introduction for MSA and functional data analysis, and then we will motivate our objective of a so called spring guideline which investigates fastening behavior or comfort behavior fastening seatbelts and then Laura will explain our methodology of the Gage R&R via JMP Pro. So usually measurement system analysis is an astute is a Type 2 study. So, MSA is a process. Here in this flow chart, you can see that it starts with Type 1 and linearity study and they are done with reference parts. Reference parts are needed, because we need a reference way you to calculate the bias of the measurement system. And if it fits, if the bias is good enough, then we check if we have an operator influence or not. If and only if we have an operator influence, then we can calculate the reproducibility. Otherwise, in all other steps, repeatability is is the only variation which can be calculated. So, in this area, Type 2 and Type 3 study, production parts are used. The AIAG, automotive industry guidelines, investigate a one dimensional output per test, but they do not describe how to deal with data curves as output and we will give you an insight how to how to do it. So if one has good accuracy then the uncertainty will be low. And this can be seen in this graph. And you see the influence of the increasing uncertainty on the decision of whether good or bad parts are near the tolerance. So here we have a lower specification limit, at the right we have an upper specification limit. And so the better my accuracy is the better my decision will be. And here we have a very big gray area and it will be very hard to make a right decision. Obviously parts in the middle of the tolerance will always be classified in the correct way. And in contrast to statistical process control parts at a specification limits are very valuable. So the best thing that you can ever do for an MSA is to take parts at the lower specification limit, all the upper specification limit, plus/minus 10% of the reference figure, and in this case the reference figure is a tolerance. If you have only one specific specification limit, like an upper or lower specification limit, then you can use your process capability index, Ppk, to calculate the corresponding process variation, 6s, and that is your new reference figure. And this idea how to select your parts for an MSA can also be used for output curves and their specifications bounds instead of specification limits. So data often comes as a function, curve, or profile referred to functional data. Functional data analysis available in JMP Pro in the functional data explorer platform. We will use P splines to model our extraction of false versus distant curves and later on we will use mixed models to analyze them. So seatbelts significantly contribute to preventing fatalities, and consequently functionality and comfort of seatbelts have to be ensured. And here on the right hand side, you see a picture from extraction forces of the seatbelt, given in blue, and retraction forces over distance, given in red. And both forces are important factors that affect both safety and comfort. And to test these forces and extraction/retraction force test setup is used and this simulates the seatbelt behavior in a vehicle. So let us have a closer look at this. So here you see a test set up and here you see the seatbelt. And here's a little moving trolley which drives on a trolley arm and now you see the seatbelt is retracted. And now the extraction started and here at the right hand side you see the seatbelt, which is fixed, according to its car position. Yeah. Whoops. So, When you look at these little curves and you see that inside my extraction force curve, there are some little waves and they have a semi-periodic structure. And if you would fit this semi-periodic structure with a polynomial, for example, then you will really overfit the repeatability. That's the reason why we need some flexible models. And where do these little waves come from? This will be more clearer when we have a look inside my seatbelt. And there is a spindle and there is a spring. And how's this behaves can be seen in this video. Here inside is my is my spindle and here at the right hand side there is my spring. And the cover of the spring is open so that we can have a look inside the cover, how the spring behaves. And in the beginning when the total webbing is on the spinner, then my spring is totally relaxed. And now my webbing is extracted and at the same time the spring is wounded up. So now we have the retraction and my spring becomes relaxed again. And these little movements result into this wavy structure and that is the reason why we really need flexible models based on FDE. So the creation of the spring guideline is our objective, who require a specific fastening behavior. So the fastening behavior is given by my extraction force curves. And here you see in this picture, different groups of five different seatbelt types and the corresponding spring thicknesses. If my spring thickness is small, then you can see that my extraction force is also small and if my spring thickness is large than my extraction force over distance will also be large. And then you see we have here different colors and we have a dark color. And the dark color results from real life measurements from three operators with each... every operator made five replications per seatbelt and and you can see that they have really made a great job. And the light color, that is our model from the p splines given by by FDE. So, as a spring guideline is our target, we would like to know, for which spring we will get the corresponding fastening behavior of the seatbelt, but before you start with the project, please always start with an MSA So, according to Autoliv's procedure, we use five different seatbelts, three experienced operators and five replications as you have seen in the previous graph. And our observation y is given by the actual values plus some random noise, so noise from the operators, the part, the interaction of operator and part and the corresponding repeatability. And then my Gage R&R is defined by six times the process variation of the measurement error, which is given by my reproducibility and the repeatability. And now we would like to know what is my minimum tolerance, such that my Gage R&R is acceptable, and acceptable means that the percentage Gage R&R is smaller than 20%. If my Gage R&R is 0.2 times my minimum tolerance, then we will also get a bound for my curves which you have seen, and this plus/minus 3s error bound will help us a lot to find the correct spring for a specific fastening behavior. So our methodology will be shown by Laura Langcaster. So, we will start to estimate a mean extraction force curve and we will use flexible models by FDE. Then the residual extraction forces will be calculated and after that, random effect models will be estimated via the platform mix models and finally the Gage R&R will be calculated. So, Laura, will you start? Laura Yes, let me share my screen. Yeah. Great. Laura Okay. So yes, thank you, Astrid. So I wanted to demonstrate how we use JMP Pro 15 to perform this measurement systems analysis with curve data. So first I want to show you the data that we have. And it looks a lot like regular MSA data, except instead of just regular measurements, we actually have curves. So we have this function of force in terms of distance. So the first thing that we want to do is to to use the functional data explorer to create the part force extraction curves. So I'm going to open up functional data explorer, I'm going to go to the analyze menu, then specialized modeling, functional data explorer. So force would be my output, distance is my input, and I want to fit one for each part or seatbelt type. And so that's going to be my ID and I click OK. And I get a bunch of summary information, summary graphs, and I'm going to enlarge this one. These are my curves and you can see that I have very distinct curves for each part, which are different colors. And I can also see that semi-periodic behavior that Astrid was talking about. Now I've already fit this data, so I know that a 300 node linear p spline fits really well. So I'm going to just go ahead and fit that particular model. So I go to models, model controls, p spline model controls, and remove all of these nodes. Add 300 because I know that's the node structure that works well. I'm only going to do a linear fits. I click go, and it doesn't take too long to fit this 300 node linear p spline to this data. And I just wanted to quickly mention that I'm only going to show a little bit of the functionality from this this FDE platform. It is, it does a lot of things a lot more than I have time to show you, or that we used for this measurement systems analysis. But, I highly recommend you check it out. We added a lot for JMP 15. And so I highly recommend you check out other talks about it as well. Okay, so this is our fit. And once again, we get a nice graph of our curves and the fit. And I want to check and make sure that this is a good fit. So I'm going to go to the diagnostic plots and I'm gonna look at the actual by predicted plot, I see it looks really great. Nice linear and the residuals are really small. So I'm very happy with this fit. Now, when we did this fit, we had to combined together the operator and the replication error. And so these curves have that that variation average out, but we need that variation to actually do the Gage R&R study. So what we're going to do is actually create a force residual and once we do that by subtracting off these part force functions from the original force function, we will have residual that will contain the operator and their replication error. So I'm going to go back to data table. And I've already created scripts to make this easy. So I've created a script to add my prediction formula to the data table. And I'll just show it to you really quick. This just come straight from the functional data explorer. This is my formula for the p spline fits for each part. And then I also have a script to create my residual column and so the formula for that is simply the difference between the force and those part force extraction functions. Right. And so now that I have this residual formula that contains the operator and replication error, I can fit a random effects model and estimate my operator...operator by part and my replication error. And to do that, I'm actually going to use the mixed model platform because we're not going to be fitting part variance, because we've already factored that out by creating the residual and subtracting off the part force function. So I've already created a script to launch the mixed model platform. And you can see that I have residual extraction force as my response. And I have operator and operator by part as my random effects. I'm going to run this. operator, operator by part (which is zero), and the residual. And to calculate the Gage R&R, I find it easy just to use a JMP data table like a spreadsheet. Makes it easy. I can just use a column formula. So I've entered my variance components in the table. And I just create a formula to calculate the Gage R&R, which is just 6 times the square root of the total variation without the part. And then I see that my Gage R&R is .4385 and I can take that and apply it to my spring guidelines in my specification bounds. And I can also back solve for the minimum tolerance. And so now I'm going to hand this back over to Astrid to continue with talking about how this got applied. Astrid Ruck Yes, thank you, Laura. It's, it's great. But for the audience, of course, this are not the original of data. So, Yes. Now as as Laura explained, now we know the Gage R&R and we also know my minimum tolerance, such that my measurement system is capable. So here you see the part extraction force function. And you also see the plus/minus 3s error bound and you see that the parts are very good selected because of the bounds are non overlapping and therefore they are significant different. And we can use it to find the right spring. And here you see in black, the minimum tolerance which is... we use the green line to center it around it. So now we have our Gage R&R but on the other hand, we can use FDE to load a golden curve as a target function. So I already told you, yes, we are interested in a spring guideline, but what kind of spring shall we use? So we also can use FDE to load this golden function and then the corresponding spring's thickness is calculated to obtain a specific behavior. So FDE is a great tool. And we used it also for a Type 3 study which is independent from operators and it is used for camera. And it measures the distance between seam and cutting edge of inflatable seatbelts and cutting process and that was also a great success story of using FDE. So to come to an end, I would like to say that most of our processes and tests have curves as output. And until now it has been impossible to standardize an MSA procedure using complete curve data and therefore, we had to restrict ourselves on a maximum from the extraction force curve, all the area between extract and retraction and therefore we reduced ourselves and lost a lot of a lot of considerable amount of information. So I'm really happy that we can make MSA for curve data. And as far as we are aware, there are no other publications that discuss this type of MSA generalization for curve data with other commercial software. And at the end, I would like to show you that the corresponding paper is also available in the in the internet. So thank you for the attention.  
Ondina Sevilla-Rovirosa, Graduate Student and Small Business Owner, Oklahoma State University   The pay gap between top executives and the average American worker continues to widen. Additionally, the gender pay gap has narrowed, but women only earn 85% of what men earn. Experts debate if higher compensations could positively impact the firm’s performance and maximize its value. Therefore, it is worth analyzing the factors that companies are considering to determine the salaries of top executives.    The original dataset of 2,646 executives from 250 selected U.S. companies was gender unbalanced. I used a synthetic replication method to balance the data. Then, I ran, analyzed, and compared a decision tree, a stepwise regression, eight different neural networks, and an ensemble model. Finally, a surrogate model was used to explain the best neural net model.   From the 18 initially selected variables of companies and executives, I found that only Total Assets, Number of Employees, and Years Executive in CEO Position had a significant contribution to Salary.   Surprisingly gender had an insignificant effect on the salaries of top executives. Nevertheless, predominantly what affected Salaries was the size of the firm (Assets and Employees), followed by a lower contribution from the number of years the executive has been in the CEO position.     Auto-generated transcript...   Speaker Transcript ondinasevilla@gmail.com My name is Ondina Sevilla and my poster is about salary gaps in corporate America, specifically how do the company and executive characteristics influence compensations. Something with a little introduction, the pay gap between top executives and the average American worker continues to widen. Also the gender pay gap has narrowed, but women only earn 85% of what men earn. Even experts debate if higher compensations could positively impact the firm's performance and maximize its value. So it is worth it to analyze the factors that companies are considering to determine the salaries of top executives. I made two questions for this research. Are...is there a salary gap for top female executive in US companies? And does the company's size influence executives' salaries? So for this research, I collected a data set of 2046 from top executives from 250 selected US companies, such as Halliburton, Southwest Airlines, Starwood Hotels, Sherwin-Williams and others. Then I applied a synthetic replication method in SAS to obtain a gender balance database and used 12 companies and six executive variables, being salary by input variable. The technique used was predictive modeling. I analyzed and compared in JMP Pro 15 a decision tree, stepwise regression, eight different neural networks and ensemble model. And outlier... oh, a salary percentage year per year variable was excluded from from...for the first analysis and then I included it to compare. However, the rules are adjusted for error and the outer absolute error were higher with the outlier. So, I'm going to show you here the model, the different models that I ran without the outlier and the neural networks comparison. So from all these models, the lowest root average squared error without the outlier was the number eight neural network. These ...this neural net has 4 inputs two hidden layers, eight double neurons, with a TanH function.  
Daniel Sutton, Statistician - Innovation, Samsung Austin Semiconductor   Structured Problem Solving (SPS) tools were made available to JMP users through a JSL script center as a menu add-in. The SPS script center allowed JMP users to find useful SPS resources from within a JMP session, instead of having to search for various tools and templates in other locations. The current JMP Cause and Effect diagram platform was enhanced with JSL to allow JMP users the ability to transform tables between wide format for brainstorming and tall format for visual representation. New branches and “parking lot” ideas are also captured in the wide format before returning to the tall format for visual representation. By using JSL, access to mind-mapping files made by open source software such as Freeplane was made available to JMP users, to go back and forth between JMP and mind-mapping. This flexibility allowed users to freeform in mind maps then structure them back in JMP. Users could assign labels such as Experiment, Constant and Noise to the causes and identify what should go into the DOE platforms for root cause analysis. Further proposed enhancements to the JMP Cause and Effect Diagram are discussed.     Auto-generated transcript...   Speaker Transcript Rene and Dan Welcome to structured, problem solving, using the JMP cause and effect diagram open source mind mapping software and JSL. My name is Dan Sutton name is statistician at Samsung Austin Semiconductor where I teach statistics and statistical software such as JMP. For the outline of my talk today, I will first discuss what is structured problem solving, or SPS. I will show you what we have done at Samsung Austin Semiconductor using JMP and JSL to create a SPS script center. Next, I'll go over the current JMP cause and effect diagram and show how we at Samsung Austin Semiconductor use JSL to work with the JMP cause and effect diagram. I will then introduce you to my mapping software such as Freeplane, a free open source software. I will then return to the cause and effect diagram and show how to use the third column option of labels for marking experiment, controlled, and noise factors. I want to show you how to extend cause and effect diagrams for five why's and cause mapping and finally recommendations for the JMP cause and effect platform. Structured problem solving. So everyone has been involved with problem solving at work, school or home, but what do we mean by structured problem solving? It means taking unstructured, problem solving, such as in a brainstorming session and giving it structure and documentation as in a diagram that can be saved, manipulated and reused. Why use structured problem solving? One important reason is to avoid jumping to conclusions for more difficult problems. In the JMP Ishikawa example, there might be an increase in defects in circuit boards. Your SME, or subject matter expert, is convinced it must be the temperature controller on the folder...on the solder process again. But having a saved structure as in the causes of ...cause and effect diagram allows everyone to see the big picture and look for more clues. Maybe it is temperate control on the solder process, but a team member remembers seeing on the diagram that there was a recent change in the component insertion process and that the team should investigate In the free online training from JMP called Statistical Thinking in Industrial Problem Solving, or STIPS for short, the first module is titled statistical thinking and problem solving. Structured problem solving tools such as cause and effect diagrams and the five why's are introduced in this module. If you have not taken advantage of the free online training through STIPS, I strongly encourage you to check it out. Go to www.JMP.com/statisticalthinking. This is the cause and effect diagram shown during the first module. In this example, the team decided to focus on an experiment involving three factors. This is after creating, discussing, revisiting, and using the cause and effect diagram for the structured problem solving. Now let's look at the SPS script center that we developed at the Samsung Austin Semiconductor. At Samsung Austin Semiconductor, JMP users wanted access to SPS tools and templates from within the JMP window, instead of searching through various folders, drives, saved links or other software. A floating script center was created to allow access to SPS tools throughout the workday. Over on the right side of the script center are links to other SPS templates in Excel. On the left side of the script center are JMP scripts. It is launched from a customization of the JMP menu. Instead of putting the scripts under add ins, we chose to modify the menu to launch a variety of helpful scripts. Now let's look at the JMP cause and effect diagram. If you have never used this platform, this is what's called the cause and effect diagram looks like in JMP. The user selects a parent column and a child column. The result is the classic fishbone layout. Note the branches alternate left and right and top and bottom to make the diagram more compact for viewing on the user's screen. But the classic fishbone layout is not the only layout available. If you hover over the diagram, you can select change type and then select hierarchy. This produces a hierarchical layout that, in this example, is very wide in the x direction. To make it more compact, you do have the option to rotate the text to the left or you can rotate it to the right, as shown in here in the slides. Instead of rotating just the text, it might be nice to rotate the diagram also to left to right. In this example, the images from the previous slide were rotated in PowerPoint. To illustrate what it might look like if the user had this option in JMP. JMP developers, please take note. As you will see you later, this has more the appearnce of mind mapping software. The third layout option is called nested. This creates a nice compact diagram that may be preferred by some users. Note, you can also rotate the text in the nested option, but maybe not as desired. Did you know the JMP cause and effect diagram can include floating diagrams? For example, parking lots that can come up in a brainstorming session. If a second parent is encountered that's not used as a child, a new diagram will be created. In this example, the team is brainstorming and someone mentions, "We should buy a new machine or used equipment." Now, this idea is not part of the current discussion on causes. So the team facilitator decides to add to the JMP table as a new floating note called a parking lot, the JMP cause and effect diagram will include it. Alright, so now let's look at some examples of using JSL to manipulate the cause and effect diagram. So new scripts to manipulate the traditional JMP cause and effect diagram and associated data table were added to the floating script center. You can see examples of these to the right on this PowerPoint slide. JMP is column based and the column dialogue for the cause and effect platform requires one column for the parent and one column for the child. This table is what is called the tall format. But a wide table format might be more desired at times, such as in brainstorming sessions. With a click of a script button, our JMP users can do this to change from a tall format to a wide format. width and depth. In tall table format you would have to enter the parent each time adding that child. When done in wide format, the user can use the script button to stack the wide C&E table to tall. Another useful script in brainstorming might be taking a selected cell and creating a new category. The team realizes that it may need to add more subcategories under wrong part. A script was added to create a new column from a selected cell while in the wide table format. The facilitator can select the cell, like wrong part, then selecting this script button, a new column is created and subcauses can be entered below. you would hover over wrong part, right click, and select Insert below. You can actually enter up to 10 items. The new causes appear in the diagram. And if you don't like the layout JMP allows moving the text. For example, you can click...right click and move to the other side. JMP cause and effect diagram compacts the window using left and right, up and down, and alternate. Some users may want the classic look of the fishbone diagram, but with all bones in the same direction. By clicking on this script button, current C&E all bones to the left side, it sets them to the left and below. Likewise, you can click another script button that sets them all to the right and below. Now let's discuss mind mapping. In this section we're going to take a look at the classic JMP cause and effect diagram and see how to turn it into something that looks more like mind mapping. This is the same fishbone diagram as a mind map using Freeplane software, which is an open source software. Note the free form of this layout, yet it still provides an overview of causes for the effect. One capability of most mind mapping software is the ability to open and close notes, especially when there is a lot going on in the problem solving discussion. For example, a team might want to close notes (like components, raw card and component insertion) and focus just on the solder process and inspection branches. In Freeplane, closed nodes are represented by circles, where the user can click to open them again. The JMP cause and effect diagram already has the ability to close a note. Once closed though, it is indicated by three dots or three periods or ellipses. In the current versions of JMP, there's actually no options to open it again. So what was our solution? We included a floating window that will open and close any parent column category. So over on the right, you can see alignment, component insertion, components, etc., are all included as all the parent nodes. By clicking on the checkbox, you can close a node and then clicking again will open it. For addtion, the script also highlights the text in red when closed. One reason for using open source mind mapping software like Freeplane is that the source file can be accessed by anyone. And it's not a proprietary format like other mind mapping software. You can actually access it through any kind of text editor. Okay, the entire map can be loaded by using JSL commands that access texts strings. Use JSL to look for XML attributes to get the names of each node. A discussion of XML is beyond the scope of this presentation, but see the JMP Community for additional help and examples. And users at Samsung Austin Semiconductor would click on Make JMP table from a Freeplane.mm file. At this time, we do not have a straight JMP to Freeplane script. It's a little more complicated, but Freeplane does allow users to import text from a clipboard using spaces to knit the nodes. So by placing the text in the journal, the example here is on the left side at this slide, the user can then copy and paste into Freeplane and you would see the Freeplane diagram on the, on the right. Now let's look at adding labels of experiment, controlled, and noise to a cause and effect diagram. Another use of cause and effect diagrams is to categories...categorize specific causes for investigation or improvements. These are often category...categorize as controlled or constant (C), noise or (N) or experiment might be called X or E. For those who we're taught SPC Excel by Air Academy Associates, you might have used or still use the CE/CNX template. So to be able to do this in JMP, to add these characters, we would need to revisit the underlying script. When you actually use the optional third label column...the third column label is used. When a JMP user adds a label columln in the script, it changes the text edit box to a vertical list box with two new horizontal center boxes containing the two... two text edit boxes, one with the original child, and now one with the value from the label column. It actually has a default font color of gray and is applied as illustrated here in this slide. Our solution using JSL was to add a floating window with all the children values specified. Whatever was checked could be updated for E, C or N and added to the table and the diagram. And in fact, different colors could be specified by the script by changing the font color option as shown in the slide. JMP cause and effect diagram for five why's and mind mapping causes. While exploring the cause and effect diagram, another use as a five why's or cause mapping was discovered. Although these SPS tools do not display well on the default fish bone layout, hierarchy layout is ideal for this type of mapping. The parent and child become the why and because statements, and the label column can be used to add numbering for your why's. Sometimes there can be more and this is what it looks like on the right side. Sometimes there can be more than one reason for a why and JMP cause and effect diagram can handle it. This branching or cause mapping can be seen over here on the right. Even the nested layout can be used for a five why. In this example, you can also set up a script to set the text wrap width, so the users do not have to do each box one at a time. Or you can make your own interactive diagram using JSL. Here I'm just showing some example images of what that might look like. You might prompt the user in a window dialogue for their why's and then fill in the table and a diagram for the user. Once again, using the cause and effect diagram as over on the left side of the slide. Conclusions and recommendations. All right. In conclusion, the JMP cause and effect diagram has many excellent built in features already for structured problem solving. The current JMP cause and effect diagram was augmented using JSL scripts to add more options when being used for structured problem solving at Samsung Austin Semiconductor. JSL scripts were also used to make the cause and effect diagram act more like mind mapping software. So, what would be my recommendations? fishbone, hierarchy, nested, which use different types of display boxes in JSL. How about a fourth type of layout? How about mind map that will allow more flexible mind map layout? I'm going to add this to the wish list. And then finally, how about even a total mind map platform? That would be even a bigger wish. Thank you for your time and thank you to Samsung Austin Semiconductor and JMP for this opportunity to participate in the JMP Discovery Summit 2020 online. Thank you.  
Heath Rushing, Principal, Adsurgo   DOE Gumbo: How Hybrid and Augmenting Designs Can Lead to More Effective Design Choices When my grandmother made gumbo, she never seemed to even follow her own recipe. When I questioned her about his, she told me, “Always try something different. Ya never know if you can make better gumbo unless you try something new!” This is the same with design of experiments. Too many times, we choose the same designs we’ve used in the past, unable to try something new in our gumbo. We can construct a hybrid of different types of designs or augment the original, optimal design with points using a different criterion. We can then use this for comparison to our original design choice. These approaches can lead to designs that allow you to either add relevant constraints (and/or factors) you did not think were possible or have unique design characteristics that you may not have considered in the past. This talk will present multiple design choices: a hybrid mixture-space filling design, an optimal design augmented using pre-existing, required design points, and an optimal design constructed by augmenting a D-optimal design with both I- and A-optimal design points. Furthermore, this talk presents the motivation for choosing these design alternatives as well as design choices that have been useful in practice.     Auto-generated transcript...   Speaker Transcript Heath Rushing My name is Heath Rushing. I am a principal for Adsurgo and we're a we're a training and consulting company that works with a lot of different companies. This morning, I'm going to talk about some experiences that I had working with pharmaceutical and biopharmaceutical companies. A lot of scientists and engineers are doing things like process and product development and characterization formulation optimization. And what I found is is is a lot of these...a lot of these scientists had designs that they use in the past with a similar product or process or formulation. And what they did is is going forward, they just said, "Hey, let me just take that design that I've used in the past. It worked. You know, it worked well enough in the past. So let's just go ahead and use that design." In each of these instances, what we did is we took the original design and we came up with some sort of mechanism for doing something a little different. Right. We either augmented it with a with a with a different sort of optimization criteria or we augmented it before they added runs or after they added runs. In the first case is what we did is, is was we built a hybrid design. Right. And then the first case was a product formulazation...I'm sorry... a formulation optimization problem, where a scientist in the past was run...had a 30 run Scheffe-Cubic mixture design. In a mixture design, the process parameters are variables are factors in the experiment are mixtures. And then, so there is certain percentage where the overall mixture adds up to 100%. Right, they they felt this work well enough and help them to find an optimal setting for the for the formulation. However, one thing that they really wanted to touch more on is, they said, you know, these designs tended to to to look at design points in our experiment near the edges. And what we want to do is is further characterize the design space. So we took the original 30 run design, and instead of doing that, what we did is we run a we...we developed an experiment constructing the experiment where we ran 18 mixture experiments and then we augmented it with 12 space filling design. And a space filling design is, it's used a lot in computer simulations. And really, you know, I said this at a conference one time, I said, "You know it's used to fill space." But really what these designs do, and I'm going to pull up the the the comparison of the two, is it's going to put design points. In this one, I try to minimize the distance between each of the design points. As you see as the design on the left, the, the one that they thought was well enough or was adequate was the 30 run mixture design. And as you see, it operates a lot near the edges and right in the center of the design. The one on the right was really 18 mixture design points augmented with 12 space filling design points. So it's really a hybrid design, it's really a hybrid of a mixture design and a space filling design. As you can see, you know, based upon their objective trying to characterize that design space a little bit better, as you can see, the one on the right did a much better job of characterizing that design space, right? It had adequate prediction variance. It was a it was a design they chose to run and they found a and they found their optimal solution off of this The second design choice was, and this is used a lot, in a process characterization is, back in the old days back before a lot of people used design of experiments in terms of process characterization, what a lot of scientists would do was, was they would run center point runs like its set point and then also do what are called PAR runs, or proven acceptable range, right. So say that they had four process parameters. What they would do is is they would keep three of the process parameters at the set point and have the fourth go to the extremes. The lowest value and the highest value. And they would do it for each...they would do a set of experiments like that for each the process parameters. What they're really showing is that, you know, if everything's at set point, and one of these deviate near the edges, then we're just going to prove that it's well within specification. Right. And then so they still like to do a lot of these runs. The design that I started off with was, I had a had a scientist that took those PAR and those centerpoint runs and they added them after they built an I optimal design. And I optimal design is used for for for prediction and optimization. And in this case is is that's the kind of design that they wanted, but they added them after the I optimal design. My question to them was this, why don't you just take those runs and add them before you built I optimal design? If that was the case, the ??? algorithm in JMP would say, "You know, I'm going to take those points and I'm going to come up with the, the next best set of runs." Right. So we took those 18 design points and we augmented them with with 11 more...I'm sorry, the 11 to...the original 11 design points and 18 I optimal points. Whenever we did this, if you look in the design, the, the, this is where the PAR runs were added... were added prior to, and you see that the power of the main effects, in factor interactions, the quadratic effects are higher than if you added the PAR runs after. You see that the production variance, if you, if you look at the prediction variance is, the prediction variance is very similar. But you see, is like right near the edge of the design spaces, you see that those PAR runs, whenever we had the PAR runs augmented with I optimal, were a lot smaller. The key here is is whenever I was looking at the correlation is I think the correlation, especially with the main effects are a lot better with with the PAR augmenting and two I optimal versus what they did before, where they took the I optimal and just augmented those with the PAR runs. The third design. The third design was was was when I had a scientist take a 17 run D optimal design and they augment it with eight runs and went from a D to an I optimal design. Now they started off with D optimal design, a screening design, they augmented it with points to move to an I optimal design. JMP has a has a...it's not a really a new design, but it's new design for JMP; it's called A optimal design. And A optimal design allows you to to weight any of those factors. Right. And so I had an idea. I just said, "You know, I have many times in the past, went from a D augmented to an I optimal design. What if we did this? Really, what if we took that original 17 run D optimal design and augmented it to an I, then an A, where we weighted those quadratic terms, Or we took the D optimal design, augmented it to an A optimal design where we where we weighted the quadratic terms and then to an I optimal design?" So it's really two different augmentations, going from a D to an A to an I, and D to an I to an A. Also went to straight D to A. Right. And I wanted to compare it to the original design choice, which was a D versus an I optimal design. Now, I really would like to tell you that my idea worked. But I think as a good statistician, I should tell you that I don't think it was so. If I look at the prediction variance, which, in terms of response surface design, we're trying to minimize the prediction variance across the design region, is you see the prediction variance for their original design is is lower. Okay, even even much lower than whenever I did the A optimal design, just straight to the A optimal design. If you look at the fraction of design space, you'll see that the prediction variance is much smaller across the design space than the than the A optimal design and it's a little bit better than when I went from D to A to I, and D to I to A. The only negative that I saw with the original design compared to the other design choices was, you know, there was there was some quadratic effects, right, there were some quadratic effects that had a little bit of higher correlation, little bit higher correlation than I would like to see. And and you see what the A optimal design, it has much lower quadratic effects. So my my original thesis many times, scientists and engineers have designs they've done in the past. And I always say is, it makes sense that we just don't want to do that same design that we've done in the past. Let's try something different. The product can be a little bit different. The process can be a little bit different. The formulation can be a little bit different. If you use that to compare to the original design is you can pick your best design choice. I would like to, you know, last thing I would like to thank my my team members at Adsurgo. We always have, you know, team members and also our customers...our customers coming up with challenging problems and our team members for always working for for optimal solutions for our customers. Now, last thing that I have to do is, is these these designs were really, really taken from examples from customers, but they weren't the exact examples. There's nothing with their data. So I would like to give a give a shout out to one of my customers Sean Essex from Poseida Therapeutics that often comes up with some very hard problems and sometimes he'll come up with a problem. And I'll say, you know, this is this is a solution and it's something that we really haven't even seen yet. So have a great day.  
Thor Osborn, Principal Systems Research Analyst, Sandia National Laboratories   Parametric survival analysis is often used to characterize the probabilistic transitions of entities — people, plants, products, etc. — between clearly defined categorical states of being. Such analyses model duration-dependent processes as compact, continuous distributions, with corresponding transition probabilities for individual entities as functions of duration and effect variables. The most appropriate survival distribution for a data set is often unclear, however, because the underlying physical processes are poorly understood. In such cases a collection of common parametric survival distributions may be tried (e.g., the Lognormal, Weibull, Frechét and Loglogistic distributions) to identify the one that best fits the data. Applying a diverse set of options improves the likelihood of finding a model of adequate quality for many practical purposes, but this approach offers little insight into the processes governing the transition of interest. Each of the commonly used survival distributions is founded on a differentiating structural theme that may offer valuable perspective in framing appropriate questions and hypotheses for deeper investigation. This paper clarifies the fundamental mechanisms behind each of the more commonly used survival distributions, considering the heuristic value of each mechanism in relation to process inquiry and comprehension.     Auto-generated transcript...   Speaker Transcript   Hello, and welcome to my 00 14.633   3   over the past 25 years, I   have performed many studies and 00 31.533   7   share with you a way of thinking   about the distributions we 00 49.366   11   motivated by precedent, ease of   use, or empirically demonstrated 00 05.666   15   about its processes. Further,   when an excellent model fit is 00 20.666   19   genesis of the distributions   commonly used in parametric 00 36.366   23   seen in the workplace as well as   in the academic literature. 00 51.400   27   literature, including textbooks   and web based articles, as well 00 07.166   31   reexamination that may fail to   glean full value from the work. 00 21.633   35   the exponential.   Much is often made about the 00 39.066   39   because they model fundamentally   different system archetypes. In 00 56.200   43   distribution does in fact, fit   the lognormal data very well.   The quality of the fit may also 00 32.066   48   fits much better. And secondly,   there's only a modest coincident 00 55.333   52   the core process mechanisms   these distributions represent 00 11.600   56   analysis, but it provides a very   familiar starting point for 00 27.133   60   uncorrelated effects. Let's see   if that is true.   In order to create a good 00 52.733   65   25,000. For the individual   records, we'll use the random 00 29.400   06.333   70   71   see that we did indeed obtain   the normal distribution.   Now let's consider the 00 14.400   76   not able to imprint my brain   with a sufficient knowledge of 00 34.300   80   lognormal distribution are also   very simple. As you can see, the 00 50.233   84   this demonstration, we reuse the   fluctuation data that were 00 05.766   88   JSL scripting because I find it   much more convenient for 00 32.566   92   the number of records in each   sample. Next, it extracts the 00 53.200   96   products.   The outer loop tracks the 00 17.133   101   on the previous slide. The   amplified product compensates 00 33.700   105   distributions may be considered   as generated secondarily from 00 18.400   110   many similar internal processes   is represented by its maximum 00 35.000   114   to be Frechet distributed. The   Weibull distribution represents 00 50.466   118   processes that complete when any   of multiple elements have 00 08.766   122   using the Pareto distribution   as the source. In this case, the 00 27.600   126   absolute value of the normal   distribution as the source.   Now let's have a quick look at 00 58.666   131   maximum is used.   For the square root of the 00 50.766   136   is not available, you can also   see that the other common 00 33.233   46.033   141   value of the normal distribution   quite well.   Incidentally, Weibull 00 28.066   146   distribution when its core   behavior is substantially 00 43.600   150   the four heme containing   subunits mechanically interact 00 59.166   154   up to now have all relied on   independent samples. Professor 00 15.766   158   extended to produce auto   correlated data. Generation of 00 32.100   162   sequence autocorrelation is   about .75, yet the 00 59.033   02.300   167   the common survival   distributions. You can see that 00 26.400   171   good example of the relationship   between real-world analytical 00 42.000   175   commingle a single family   residences with heavy industry. 00 55.266   179   have similar features. The   landowner must apply to the 00 09.000   183   an opportunity to comment. Local   officials then weigh the 00 22.433   187   parties. This example is not   approached as a demonstration 00 36.633   191   processing time is 140 days. The   fit is obviously imperfect, but 00 52.733   195   distributed data results from   processes yielding the combined 00 08.400   199   ubiquitous, but the loglogistic   is less frequently used. Without 00 24.466   203   multistep process may be   insufficient to impart log 00 38.200   207   considered and the complexity   of the underlying process should 00 53.166   211   whether a process is   substantially impacted by 00 05.566   215   whether the cooperative element   is connoted by positive terms such 00 22.733   219   often been said, I would   sincerely appreciate your 00 35.033
James Wisnowski, Principal Consultant, Adsurgo Andrew Karl, Senior Statistical Consultant, Adsurgo Darryl Ahner, Director OSD Scientific Test and Analysis Techniques Center of Excellence   Testing complex autonomous systems such as auto-navigation capabilities on cars typically involves a simulation-based test approach with a large number of factors and responses. Test designs for these models are often space-filling and require near real-time augmentation with new runs. The challenging responses may have rapid differences in performance with very minor input factor changes. These performance boundaries provide the most critical system information. This creates the need for a design generation process that can identify these boundaries and target them with additional runs. JMP has many options to augment DOEs conducted by sequential assembly where testers must balance experiment objectives, statistical principles, and resources in order to populate these additional runs. We propose a new augmentation method that disproportionately adds samples at the large gradient performance boundaries using a combination of platforms to include Predictor Screening, K Nearest Neighbors, Cluster Analysis, Neural Networks, Bootstrap Forests, and Fast Flexible Filling designs. We will demonstrate the Boundary Explorer add-in tool with an autonomous system use-case involving both continuous and categorical responses. We provide an improved “gap filling” design that builds on the concept behind the Augment “space filling” option to fill empty spaces in an existing design.     Auto-generated transcript...   Speaker Transcript James Wisnowski Welcome team discovery. Andrew Carl and Darryl Ahner and I would like to and are excited to present two new sampling, adaptive sampling techniques and Really going to provide some practitioners some wonderful usefulness in terms of augmenting design of experiments. And what I want to do is I want to kind of go through a couple of our Processes here on I've been talking about how this all came about. But when we think about DOE and augmenting designs, there is a robust capability already in JMP. So what we have found though working with some very large scale simulation studies is that that we're missing a piece here gap filling designs and adaptive sampling design. And the the key point is going to be the adaptive sampling designs are going to be focusing on the response. So this is kind of quite different from when you think of maybe a standard design where you augment and you look at the design space and look at the X matrix. So now we're going to actually take into account the targets or the responses. So this will actually end up providing a whole new capability so that we can test additional samples where the excitement is. So we want to be in a high gradient region so much like you might think in response surface methodology as deep as the ascent. Now we're going to automate that in terms of being able to do this with many variables and thousands of of runs in the simulation. The good news is that this does scale down quite nicely for the practitioner with the small designs as well. And I'll do a quick run through of our of our add in that we're going to show you, and then Andrew will talk a little bit about the technical details of this. So one thing I do want to apologize, this is going to be fairly PowerPoint centric rather than JMP add in for two reasons...I should say, rather than JMP demo...for two reasons, primarily because our time, we've got a lot of material to get through, but also our JMP utilization is really in the algorithm that we're making in this adaptive sampling. So ultimately, the point and click of JMP is a very simple user interface that we've developed, but what's behind the scenes and the algorithm, it's really the power of JMP here, so. So real quick, the gap filling design, pretty clear. We can see there's some gaps here, maybe this is a bit of an exaggeration puts demonstrative of technique, though in reality we may have the very large number of factors with that curse of dimensionality can come into play and you have these holes in your design. And you can see, we could augment it with this a space filling design, which is kind of the work horse in the augmentation for a lot of our work, particularly in stimulation calling and it doesn't look too bad. If we look at those blue points which are the new points, the points that we've added, it doesn't look too bad. And then if you start looking maybe a little closer, you can kind of see though, we started replicating a lot of the ones that we've already done and maybe we didn't fill in those holes as much as we thought, particularly when we take off the blue coloring and we can see that there's still a fair amount of gaps in there. So we, as we're developing adaptive sampling, recognize one piece of that is we needed to fill in some holes in a lot of these designs. And we came up with an algorithm in our tool, our add in, called boundary explorer that will just do this particular... for any design, it will do this particular function to fill in the holes and you can see where that might have a lot of utility in many different applications. So in this particular slide or graph, we can see that those blue points are now maybe more concentrated for the holes and there are some that are dispersed throughout the rest of the region. But even when we go to the... you can color that looks a lot more uniform across, we have filled that space very well. Now that was more of a utility that we needed for our overall goal here, which was an active sampling. And the primary application that we found for this was autonomous systems, which have gotten a lot of buzz and a lot of production and test, particularly in the Department of Defense. So in autonomous systems, you may think of there's really two major items when you think of it. In autonomous systems really what you're looking at is, is you really need some sensor to kind of let the system know where it is. And then the algorithm or software to react to that. So it's kind of the sensor- algorithm software integration that we're primarily focused on. And what that then drives is a very complex simulation model that honestly needs to be run many, many thousands of times. But more importantly, what we have found is in these autonomous systems, there's there's these boundaries that we have in performance. So for example, we have a leader-follower example from the from the army. That's where a soldier would drive a very heavily armored truck in a convoy and then the rest of the convoy would be autonomous, they would not have soldiers in them. Or think of maybe the latest Tesla, the pickup truck, where you have auto nav, right? So the idea is we are looking for testing these systems and we have to end up doing a lot of testing. And what happens is for example, maybe even in this Tesla, that you could be at 30 miles an hour, you may be fine and avoiding an obstacle. But at 30.1 you would have to do an evasive maneuver that's out of the algorithm specifications. So that's what we talk about when we say these boundaries are out there. They're very steep changes in the response, very high gradient regions. And that's where we want to focus our attention. We're not as interested as where it's kind of a flat surface, it's really where the interesting part is, that's where we would like to do it. And honestly, what we found is, the more we iterate over this, the better our solution becomes. We completely recommend do this as an iterative process. So hence, that's the adaptive piece of this is, do your testing and then generate some new good points and then see what those responses are and then adapt your next set of runs to them. So that's our adaptive sampling. Kind of the idea of this really, the genesis, came from some work that we did with applied physics labs at Johns Hopkins University. They are doing some really nice work with the military and while reviewing it in one of their journal articles, I was thinking to myself, you know, this is fantastic in terms of what you're doing, and we could even use JMP to maybe come up with a solution that would be more accessible to many of the practitioners. Because the problem with Johns Hopkins is is that it's very specific and it's somewhat...to integrate, it's not something that's very accessible to the smaller test teams. So we want to give...put this in the hands of folks that can use it right away. So this paper from the Journal of Systems and Software, this is kind of the source of our boundary explorer. And as it turns out, we used a lot of the good ideas but we were able to come up with different approaches and and other methods. In particular, using native capability in JMP Pro as well as some development, like the gap filling design that we did along the way. Now, In terms of this example problem, probably best I'll just go and kind of explain it right in a demo here. So if I look at a graph here, I can see that...I'll use this...I'll just go back to the Tesla example. So let's say I'm doing an auto navigation type activity and I have two input factors and let's say maybe we have speed and density of traffic. So we're thinking about this Tesla algorithm. It wants to merge to the lane to the left so it wants to, I should say, you know, pass. So it has to merge. So one of them would be the speed the Tesla is going and then the other might be the density of traffic. And then maybe down in this area here we have a lower number. So we can think of these numbers two to 10, we could maybe even think of the responses, maybe even like a probability of a collision. So down at low speed/low density, we have a very low probability of of collision, but up here at the high speed/ high density, then you have a very high probablity. But the point is it what I have highlighted and selected here, you can see that there's very steep differences in performance along the boundary region. So it would, as we do the simulation to start doing more and more software test for the algorithm, we'll note that it really doesn't do us a lot of good to get more points down here. We know that we do well in low density and low speed. What we want to do is really work on the area in the boundaries here. So that's our problem, how can I generate 200 new points that are really going to be following my boundary conditions here. Now, here what I've done is I have really, it's X1 and X2, again, think of the speed and... our speed as well as the density. And then I just threw in a predictor variable here that doesn't mean anything. And then there's, there's our response. So to do this, all I have to do is come into boundary explorer and under adaptive sampling, my two responses (and you can have as many responses as you need) and then here are my three input factors. And then I have a few settings here, whether or not I want to target the global minimum and max, show the boundry. And we also ultimately are going to show you that you have some control here. So what happens is in this algorithm is we're really looking for, what are the nearest neighbors doing? If all the nearest neighbors have the same response, as in zero probability of having an accident, that's not a very interesting place. I want to see where there's big differences. And that's where that nearest neighbor comes into play. So I'll go ahead and and run this. And what we're seeing on there is we can see right now that the algorithm, it used JMP's native capability for the prediction screening and fortunately, is not using the normal distribution. You can see it's running the bootstrap forest. Andrew is going to talk about where that was used. And ultimately what we're going to do here, is we're going to generate a whole set of new points that should hopefully fall along the boundary. So that took, you know, 30 seconds or so to do these these points and from here I can just go ahead and pull up my new points. So you can see my new points are sort of along those boundaries, probably easiest seen if I go ahead and put in the other ones. So right here, maybe I'll switch the color here real quick. And I'll go ahead and show maybe the midpoint in the perturbation. So right now we can kind of see where all the new points are. So, the ones that are kind of shaded, those are the ones that were original and now we're kind of seeing all of my new points that have been generated in that boundry. So of course the question is, how, how did we do that? So what I'll do is I'll head back to my presentation. And from there, I'll kind of turn it over to Andrew, where he'll give a little bit more technical detail in terms of how we go about finding these boundry points because it's not as simple as we thought. Andrew Karl Okay. Thanks, Jim. I'm going to start out by talking about the the gap filling real quick because we've also put this in addition to being integrated into the overall beast tool. It's a standalone tool as well. So it's got a rather simple interface where we select the columns that we define the space that we went to fill in. And for continuous factors, it reads in the coding column property to get the high and low values and it can also take nominal factors as well. In addition, if you have generated this from custom design or space filling design and you have disallowed combinations, it will read in the disallowed combination script and only do gap filling within the allowed space. So the user specifies their columns, as well as the number of new runs they want. And let me show a real quick example in a higher dimensional space. This is a case of three dimensions. We've got a cube where we took out a hollow cylinder and we went through the process of adding these gap filling runs, and we'll turn them on together to see how they fit together. And then also turn off the color and to see what happens. So this is nice because in the higher dimensional space, we can fill in these gaps that we couldn't even necessarily see in the by variate plots. So how do we do this? So what we do is, we take the original points, which in this case is colored red now instead of black and we can see where those two gaps were, and we overlay a candidate set of runs from a space filling design for the entire space. We add for the concatonated data tables of the old and the new candidate runs, we have an indicator column, continuous indicator column, we label the old points 1 and the label the candidate point 0. And in this concatenated space, we now fit a 10 nearest neighbor model to the to the indicator column and we save the predictions from this. So the candidate runs with the smallest predictions, in this case, blue, are the gap points that we want to add into the design. Now, if we do this in a single pass, what it tends to do is overemphasize the largest gaps. So we do is we actually do this in a tenfold process, where we will take a tenth of our new points, select them as we see here, and then we will add those in and then rerun our k-nearest neighbor algorithm to pick out some new points and to fill out all the spaces more uniformly. So that's just one option...the gap filling is one option available within boundary explorer. So Jim showed that we can use any number of responses, any number of factors and we can have both continuous and nominal responses and continuous and nominal factors. The fact...the continuous factors that go in, we are going to normalize those behind the scenes to 01 to put them on a more equal footing. And for the individual responses that go into this, we are going to loop individually over each response to find the boundaries for each of the responses within the factor space. And then at the end, we have a multivariate tool using a random forest that considers all of the responses at once. And so we'll see how each of the different options available here in the GUI, in the user interface, comes up within the algorithm. So after after normalization for any of these continuous columns, the first step is predictor screening for all the both continuous and nominal responses. And this is to do is to find out the predictors, they're relevant for each particular response. And we have a default setting in the user interface of .05 for proportion of variants explained, or portion of contribution from each variable. So in this case, we see that X1 and X2 are retained for response Y1, and X3 noise is rejected. The next step is to run a nearest neighbor algorithm. And we use the default to 5, but that's an option that the user can toggle. And we aren't so concerned with how well this predicts as we are to just simply use this as a method to get to the five nearest neighbors. What are the rows of the five neighbors neighbors and how far are they? What is the distances from the current row? And we're going to use this information of the nearest neighbors to identify each point, the probability of each point being a boundary point. We have to use split here and do a different method for continuous or nominal responses. For the nominal responses, what we do is we concatenate the response from the current column along with the responses from the five nearest neighbors in order, in this concatenate concatenate neighbors column. And we have a simple heuristic we use to identify the boundary probability based on that concat neighbors column. If all the responses are the same, we say it's low probability of being a boundary point. If, at least one of the responses is different, then we say it's got a medium probability of being a boundary, excuse me, a boundary point. And if two or more of the responses are different, it's got a high probability of being a boundary point. We also record the row used. In this case, that is the the boundary pair. So that is the closest neighbor that has a response that is different from the current row. We can plot those boundary probabilities in our original space filling design. So as Jim mentioned early on, we have a...we initially run a space filling design before running this boundary explore tool to get...to explore the space and to get some responses. And now we fit that in and we've calculated the boundary probability for for these. And we can see that our boundary probabilities are matching up with the actual boundaries. For continuous responses we take the continuous response from the five nearest neighbors, and add a column for each of those, and we take the standard deviation of those. The ones with the largest standard deviations of neighbors are the points that lie in the steepest gradient areas and those are more likely to be our boundary points. We also multiply the standard deviation by the mean distance in order to get our information metric, because what that does is for two points that have an equal standard deviation of neighbors, it will upweight the one that is in a more sparse region with fewer points that are there already. So now we've got this continuous information metric and we have to figure out how to split that up into high, medium, and low probabilities for each point. So what we do is we fit in distribution. We fit in normal three mixture and we use the mean as the largest distribution as the cutoff for the high probability points. And we use the intersection of the densities of the largest and the second largest normal distributions as the cutoff for the medium probability points. So once we've identified those those cut offs, we apply that to form our boundary probability column. And we also retain the row used, which is the closest. In this case for the continuous responses, that is the neighbor that has the response that's the most different in absolute value from the current role. So now for both continuous and nominal responses we have the same output. We have the boundary probability and the row used. Now that we've identified the boundary points, we need to be able to use that to generate new points along the boundary. So the first and, in some ways, the best method for targeting and zooming in on the boundary is what we call the midpoint method. And what we do for each boundary pair, each row and its row use, its nearest neighbor identified previously...I'm sorry, so not nearest neighbor but neighbor that is most relevant either in terms of difference in response nominal or most different in terms of continuous response. For the continuous factors we take the average of the coordinates for each of those two points to form the mid point. And that's what you see in the graph here. So we would put a point at the red circle. For nominal factors, what we do is for the boundary pairs is we take the levels of that factor that are present in each of the two points and we randomly pick one of them. The nice thing about that is if they're both the same, then that means the midpoint is also going to be the same level for that nominal factor for those two points. A second method we call the perturbation method is to simply add a random deviation to each of the identified boundary points. So for the high boundary...high probability points, we add two such perturbation points for the medium, we add one. And for that one, we add, for the continuous factors, we add a random deviation. Normal means 0; standard deviation, .075 in the normalized space, and that .075 is something that you can scale within the user interface to either increase or reduce the amount of spread around the boundary. And then for nominal factors, what we do is we take...we randomly pick out a level of each the nominal factors. Now for the high probability... high probability boundary points that get a second perturbation point, what we do is in the second one we restrict those nominal factor settings to all be equal to that of the original point. So we do this process of identifying the boundary and creating the mid points and perturbation points for each of the responses specified in the boundry explorer. Once we do that, we concatenate everything together and then we are going to look at all the mid points identified for all the responses, and now use a multivariate technique to generate any additional runs. Because the user can specify how many runs they want and these midpoint and perturbation methods only generate a fixed number of runs and depending on the the lay of the land, I guess you could say, for the data. So what we do is something similar to the gap filling design where we take all of the identified perturbation and mid points for all of the responses and we fill the entire space with the space filling design of candidate points. We labeled the candidate points 0 in a continuous indicator, the mid points 1, and the perturbation points .01. We fit a random forest to this indicator. And then we take a look. We save the predictions for the candidate space fill in points and then we take the the candidate runs with the largest predictive values of this boundary indicator. And those are the ones that we add in using this random forest method. Now since this is a multivariate method, if you have a area of your design space that is a boundary for multiple responses, that will receive extra emphasis and extra runs. So here's showing the three types of points together. Now, again, to emphasize what Jim said, this needs to happen in multiple iterations, so we would collect this information from our boundary explorer tool and then concatenate it back into the original data set. And then after we have those responses, rerun boundary explorer and it's going to continuously over the iterations, zoom in on the boundaries and impacts, possibly even find more boundaries. So the perturbation points are most useful for early iterations when you're still exploring space, they're more spread out, and the random forest method is better for later iterations, because it will have more mid points available because it uses not only the ones from the current iteration, but also the previously recorded ones. We have a column in the data table that records the type of point that was added. So we'll use all the previous mid points as well. So if we put our surface plot for this example we've been looking at for a step function, we can see our new points and mid points and perturbation points are all falling along the cliffs, which are the boundaries, which is what we wanted to see. So the last two options for the user interface or to indicate those gap filling runs and we can also ask it to target global min max or match target for any continuous factors, if that's set as a column property. Just to show one final example here, we have this example where we have these two canyons through a plane with a kind of a deep well at the intersection of these. And we've run the initial space filling points, which are the points that are shown to get an idea of the response. And if we run two iterations of our boundary explorer tool, this is where all the new points are placed and we can see the gaps in kind of in the middle of these two lines. What are those gaps? If we take a look at the surface plot, those gaps are the canyon floors, where it's not actually steep down there. So it's flat, even locally over a little region, but all of these points, all of these mid points, have been placed not on the planes, but the on the steep cliffs, which is where we wanted. And here we're toggling the minimum points on and off and you can see those are hitting the bottom of the well there. So we were able to target the minimum as well. So our tool presents two distinct, two discrete options, a new tools. We want the gap filling that can be used on any data table that has coding properties set for the continuous factors. And then the boundary explorer tool that can be used to add, do runs that don't look at the factor space by itself, but they look at the response in order to target the high gradient...high change areas to add additional runs.  
Lisa Grossman, Associate Test Engineer, JMP Mandy Chambers, Principal Test Engineer, JMP   With a growing population in Cary, it is important to understand the environmental impact that a household can incur and what sustainable options are available that can minimize an individual’s environmental footprint. Recycling, for one, is a universally known method that reduces waste that would otherwise be disposed at landfills, which already is a capacity concern around the world. According to the Recycling Coalition of Utah, only 25% of plastic produced in the U.S. gets recycled, and recycling the other 75% could mean saving up to 44 million cubic yards of landfill space annually. Using recycling data that is recorded by the Town of Cary, we will analyze the relationship between the collected waste and recyclables. We will also construct visualizations to explore the breakdowns of waste and each recycling category. The goal is to compare our analysis with statistics from other cities in the U.S. to assess recycling practices of the people of the Town of Cary and determine levels of further recycling potential. And in the midst of a pandemic, we will discover how Covid-19 has influenced waste and recycling management within the country. With our findings, we hope to communicate about environmental initiatives and inform about recycling efforts in our very own community as well as addressing some impacts of Covid-19.     Auto-generated transcript...   Speaker Transcript Lisa Grossman Okay. All right. I'll get started. Hi everyone, my name is Lisa Grossman and my partner, Mandy Chambers, and I are both testers here at JMP. And today we are excited to share with you the work that Mandy and I have done with recycling and garbage collection data. So we were interested in looking at recycling and trash data in our own community, being Cary, North Carolina. And we were curious about what kind of patterns we would uncover inour exploration and learn about some Covid 19 impacts all while using JMP. So in JMP, we're going to be using Graph Builder's visual tools to see trends in Cary's trash and recycling collection categories such as paper, plastic, glass, etc. And using Text Explorer's word cloud feature, we're going to use that to identify some challenges for waste and recycling management that may have arisen due to the Covid 19 outbreak. And from what I show you today, we hope that you'll be able to use these quick and easy steps to explore your own data. And so for those of you who may not know Cary North Carolina is the home where SAS is headquartered. And Cary has a population of approximately 175,000 current residents, which is about a 30% increase since 2010. And thanks to the town of Caryt, we were able to get our hands on some of the recycling and trash collection data they had recorded from 2010 to the present. So I wanted to quickly go over some of the steps we took in our process to explore the data, which include first importing Excel sheets that we got from the town of Cary into JMP. And I wanted to note here that the Excel wizard does offer many advanced features that you might be interested in the case that you would need to import Excel sheets to JMP. And to organize our data and columns, we use table operations like transpose and updated column properties and the column info dialogue to make our data a little easier to graph later on. And then launching Text Explorer and Graphs Builder platforms, we used those to make our basic visualizations. And then I'm going to show you a new hardware label feature that's available in JMP 15 called pasting graphlets and I will show you an example of this later on using a tabulate. And so getting to our graphs and figures for Cary. So looking at them, we can see that the first two up top are looking at the breakdown. So they're recycling categories. So getting a closer look here, we can see the bar chart on the left is showing the average capture rates of the overlaid recycling categories from 2010 to 2019. And we can see that the news, glass and cardboard are the three leading categories for recycling. And then in the line graph to the right, we can see that the trends of the recycling rates of each category over the years. And what's interesting, that I wanted to point out here is that it seems that news and mixed paper are inversely related to each other. And then going back to our poster, let's look at the last two graphs we have here. And so these we are looking at recycling in comparison to garbage collection in Cary. So zooming in here, we can see the stacked bar on the left. shows the total percentage of waste and recycling recorded each year. And labeling the percentage values on the bar themselves, we can see that the recycling collection volume seems to have been slightly decreasing since 2014. And then the graph to the right, we can see the progression of both trash and recycling from 2010 to 2019. And this visual shows us how the tonnage of trash is increasing Each year, which seems expected for as the population increases. But what is surprising is that the tonnages for recycling have remained rather steady. So thinking about this, we were wondering if this could be due to a rise in more sustainable products such as using personal water bottles or tumblers. So, but Now that we are in the midst of Covid 19 we were curious to learn if there were any noticeable differences in recycling and trash collection so far this year. And the town of Cary was able to provide us some with some updated data that goes up to the month of June. And so we created another stacked bar here to show how 2020 has compared so far to the previous year. And at first glance 2020 is steadily increasing and the labeled tonnages do not show a significant spike in the collection so far. So then we decided to break it down month by month using our side by side bar charts to compare 2020 to 2019. And so our top bar chart here shows recycling overlaid by curbside drop off on computer recycling. And then the bottom chart shows trash collection. So in the month of March when North Carolina first implemented stay at home orders Cary saw a nearly 21% increase in garbage and 23.8% increase in recycling collection. And just for reference 21% increase is about 1.1 million pounds. And in April and May trash and recycling have somewhat leveled out but then spiked again in June, so it will be interesting to see how the rest of the year will pan out. So something I wanted to point out here is, notice the information included in the hover label that is pinned. So using the labeling feature, which can be done by right clicking on columns in your data table and selecting label, you will be able to see that column information represented in the hover label. So you can add as many columns as you'd like to...so that you can read in that information in your graph. So doing some further reading, we saw that Wake County, the county that Cary is in, reportedly generated about 29% more trash. So, totaling about 739 tons 45% more cardboard recycling and 20% more recycling in the week of April 13 alone. And we also found an estimate that the World Health Organization said that they're using, or that the world is using about 89 million masks and 76 million gloves each month. And we found an article here that gives us some insight on how the Covid 19 outbreak has affected recycling and trash collection. And so by downloading the article and importing it to JMP, we could use a Text Explorer platform to identify some themes in the word cloud. So I'll zoom in on it here. So you can use some features and options in Text Explorer like manage stop words and, in the word cloud itself, you can change the coloring and the layout and the font, so to really customize your word cloud. And so after making these customizations, we got a word cloud here on what is shown to the right. And notice that an increase in tonnage has been the highlight for cities like Phoenix and New York City. And because of this, we were curious to learn more about recycling and trash management in New York. And luckily, we were able to find some open data for Brooklyn, Manhattan, Queens, Bronx and Staten Island. So if we look at the first line chart that shows the average tonnage collection for paper and metal and glass and plastic, we can see here in this chart I have I have the boroughs grouped. And then they are scaled here by the month and we're looking at the recycling collection and tonnages here on the y axis. So we can see that boroughs like Bronx and Brooklyn are steadily increasing starting in the month 4, being April, but we can see that there is more of a spike collection that is in both Staten Island and Queens. But what's very interesting is that there's a noticeable decrease in collection in Manhattan. And we were curious as to why this might be. And with a little research, we have come to the conclusion, it seems that stay at home orders meant that there were fewer workers in the city, so therefore, leading to reduced recycling capture. At a similar trend here can be shown for garbage collection rates in the line graph that we have. And so we can see in the same manner, Manhattan sees a dip in garbage collection, whereas Queens and Staten Island saw an increase. But something we wanted to highlight here in this graph as a new feature of JMP 15 is this custom tabulate graphlet in the hover label. So notice that the pinned hover label here shows us at tabulate that gives us the tonnage values of both recycling categories and garbage collection for the months of January to June, just for Queens in 2020, which is the point on the line, which we have pinned. So creating this line graph with a custom tabulate graphlet, it was only a matter of a couple steps. So first we needed to make our base graph, which is the line graph we have here. But then we separately created our tabulate ...our tabulate table which is shown here. And for space sake, I couldn't include the whole tabulate, but as you can see, it shows the monthly averages of recycling and trash collection for each borough in 2019 and 2020. And so all we would have to do is go into that little right triangle menu to tabulate and save the script to our clipboard. And then the next thing we would do is go back to our base graph and right click in the background and under the hover label menu, there's going to be a paste graphlet option. And so you don't have to worry about any filtering or anything. Doing the paste graphlet, takes the... there is some magic that works behind it. And so that's that's all you would need to do and each point would be filtered for you. So, Now when you hover over a point in your line, you can see that it is complete and the filter parts of your tabulate corresponds to your point of interest. So this concludes our presentation on our findings with trash and recycling collection from the town of Cary and New York and as the year plays out, I think it'd be very interesting to see how this data might change and I hope to keep looking at it and see how 2020 will pan out. So we wanted to give some special thanks to Bob Holden and Srijana Guilford, especially from the town of Cary for helping us through and working with us with their data and sharing their data sets. And I have here linked the open data set from... for the New York data. And it's, I think, I believe it's constantly updated. So if you are interested in playing around with that data, it's available here. And I also have linked here some more information on graphlets. There's a ton of ways that you can use graphlets, and many, many ways that you can customize them too, so please check out this link and you can meet the developer, Nascif, and get some more information there. Thank you.  
Monday, October 12, 2020
Mandy Chambers, JMP Principal Test Engineer, SAS Kelci Miclaus, Senior Manager Advanced Analytics R&D, JMP Life Sciences, SAS, JMP LIfe Sciences   JMP has many ways to join data tables. Using traditional Join you can easily join two tables together. JMP Query Builder enhances the ability to join, providing a rich interface allowing additional options, including inner and outer joins, combining more than two tables and adding new columns, customizations and filtering. In JMP 13, virtual joins for data tables were developed that enable you to use common keys to link multiple tables without using the time and memory necessary to create a joined (denormalized) copy of your data. Virtually joining tables gives a table access to columns from the linked tables for easy data exploration. In JMP 14 and JMP 15, new capabilities were added to allow linked tables to communicate with row state synchronization. Column options allow you to set up a link reference table to listen and/or dispatch row state changes among virtually joined tables. This feature provides an incredibly powerful data exploration interface that avoids unnecessary table manipulations or data duplications. Additionally, there are now selections to use shorter column names, auto-open your tables and a way to go a step further, using a Link & ID and Link & Reference on the same column to virtually “pass through” tables. This presentation will highlight the new features in JMP with examples using human resources data followed by a practical application of these features as implemented in JMP Clinical. We will create a review of multiple metrics on patients in a clinical trial that are virtually linked to a subject demographic table and show how a data filter on the Link ID table enables global filtering throughout all the linked clinical metric (adverse events, labs, etc.) tables.     Auto-generated transcript...   Speaker Transcript Mandy Okay, welcome to our discussion today. Let's Talk Tables. My name is Mandy Chambers and I'm a principal test engineer on the JMP testing team. And my coworker and friend joining me today is Kelci Miclaus. She's a senior manager in R&D on the JMP life sciences team. Kelci and I actually began working together a few years ago as a Clinical product was starting to be a great consumer of all the things that I happen to test. So I got to know her group pretty well and got to work with them closely on different things that they were trying to implement. And it was really valuable for me to be able to see a live application that a customer would really be using the things that I actually tested in JMP and how they would put them to use them in the clinical product. So in the past, we've done this presentation. It's much longer and we decided that the best thing to do here was to give you the entire document. So that's what's attached with this recording, along with two sets of data and zip files. You should have data tables, scripts, some journals and different things you need and be able to step through each of the applications, even if I end up not showing you or Kecli doesn't show you something that's in there. You should be able to do that. So let me begin by sharing my screen here. So that you can see what what I'm going to talk about today. So as I said, the, the journal that I had, if I were going to show this in its entirety, would be talking about joining tables and the different ways that you can join tables. And so this is the part that I'm not going to go into great detail on but just a basic table join. If I click on this, laptop runs and laptop subjects. And under the tables menu, if you're new to JMP or maybe haven't done this before, you can do a table join and this is a for physical join. This will put the tables together. So I would be joining laptop runs to laptops subjects. Within this dialogue, you select the things that you want to join together. You can join by matching, Cartesian join, row join and then you would join the table. I'm not going to do that right now, just for time consumption but that's that's what you would do. And also in here under the tables menu, something else that I would talk about would be JMP query builder. And this has the ability to be able to join more tables together. It will, if you have 3, 4, 5, 6 however many tables you have, you can put them together and we'll make up one table that contains everything. But again, I'm actually not going to do that today. So if I go back into here and I close these tables. Let's get started with how virtual join came about. So let's talk about joining tables first. You have to decide what type of join you want to use. So your...if you're tables are small, it might be easiest to do a physical join. To just do a tables join, like the two tables I showed you weren't very big. If you pull in three or four maybe more tables, JMP query builder is a wonderful tool for building a table. And you may want all of your data in the same table so that may be exactly what you want. You just need to be mindful of disk space and performance, and just understand if you have five or six tables that you have sitting separately and then you join them together physically, you're making duplicate copies. So those are the ways that you might determine which which you would use. Virtual join came about in JMP 13 and it was added with the ability to take a link, a common link ID, and join multiple tables together. It's kind of a concept of joining without joining. It saves space and it also saves duplication of data. And so that...in 13 we we started with that. And then in 14 to 15, we added more features, things that customers requested. Link tables with rows synchronize...rows states synchronization. You can shorten column names. We added being able to auto open linked tables. Being able to have a link ID and a link reference on the same column. And we also added these little hover tips that I'll show you where it can tell you which source is your column source table. So those are the things that we added and I'm going to try to set this up and demonstrate it for you. So I've got this data that I actually got from a... it's just an imaginary high-tech firm. And it's it's HR data and it includes things such as compensation, and headcount, and some diversity, and compliance, education history, and other employment factors. And so if you think about it, it's a perfect kind of data to link because you have usually a unique ID variable, such as an employee ID or something that you can link together and maybe have various data for your HR team that's in different places. So I'm going to open up these two tables and just simply walk through what you would do if you were trying to link these together. So this table here is Employee Scores 1 and then I have Compensation Master 1 in the back. These tables, Employees Scores 1 is my source table. And Compensation Master is my referencing table. So you can see these ID, this ID variable here in this table. And it's also in the compensation master table. So I'm going to set up my link ID. So it's a couple of different ways to do this. You can go into column properties. And you can see down here, you have a link ID and reference. The easiest way to do this is with a right click, so there's link ID. And if I look right here, I can see this ID key has been assigned to that column. So then I'm going to go into my compensation master table. And I'm going to go into this column. And again, you can do it with column properties. But you can do the easiest way by going right here to link reference, the table has the ID. So it shows up in this list. I'm going to click on this and voila, there's my link reference icon right there. And I can now see that all the columns that were in this table are...are available to me in this table. You can see you have a large number of columns. You can also see in here that you have...they're kind of long column names, you have the column names, plus this identifier right here which is showing you that this is a referencing column. And so I'm going to run this little simple tabulate I've saved here and just show you very briefly that this is a report and just to simply show you this is a virtual column length of service. And then compensation type is actually part of my compensation table and then gender is a virtual column. So I'm using...building this using virtual columns and also columns that reside in the table. One thing I wanted to point out to you very quickly is that under this little red triangle...let's say you're working with this data and you decide, "Oh, I really want to make this one table. I really want all the columns in one table." There is a little secret tool here called merge reference data. And a lot of people don't know this is there, exactly. But if I wanted to, I could click that and I can merge all the columns into this table. And so, but for time sake, I'm not going to do that right now, but I wanted to point out where that is located. And let me just show you back here in the journal, real quickly. This is possible to do with scripting, so you can set the property link reference and point to your table and list that to use the link columns. So I'm going to close this real quickly and then go back to the same application where I actually had same two tables that I've got some extra saved scripts in here, a couple more things I want to show. So again, I've got employee scores. This is my source table. And then I've got compensation master and they're already linked and you can see this here. So I want to rerun that tabulate and I want to show you something. So you can see that these column names are shorter now. So I want to show what we added in JMP 14. If I right click and bring up the column info dialog, I can see here that it says use linked column names right here. And that sets that these these names will be shorter And that's really a nice feature because when, at the end of the day, when you share this report with someone, they don't really care where the columns are coming from, whether they're in the main table or virtual table. So it's a nice, clean report for you to have. The script is saved so that you can see in the script that it's... it saves the script that shows you a referencing table. So if I look at this, I can see. So you would know where this column is coming from but somebody you're sharing with doesn't necessarily need to know. So I want to show you this other thing that that that we added with this dispatching of row states. Real quick example, I'm going to run this distribution. And you notice right away that in this distribution, I've got a title that says these numbers are wrong. And so let me point out what I'm talking about. Employee scores is my employee database table and it has about 3,600 employees. This is a unique reference to employees and it's a current employee database, let's say. My compensation master table is more like a history table and it has 12,000 rows in it, so it has potentially in this table, multiple references to the same employee, let's say, an employee changed jobs and they got a raise, or they moved around. Or it could have employees that are no longer in the company. So running this report from this table doesn't render the information that I really want. I can see down here that my count is off, got bigger counts, I don't exactly have what I was looking for. So this is one of the reasons why we created this row states synchronization and Kelci is going to talk a little bit more about this in a real life application, too. But I'm just simply going to show you this is how you would set up dispatching row states. So what I'm doing is I'm just batching, selection color marker. And what I'm doing is I'm actually sending from compensation master to employee scores, I'm sending the information to this table because (I'm sorry), this is the table that I want my information to be run from. So if I go back and I rerun that distribution, I now have this distribution (it's a little bit different columns), but I have this distribution. And if I look at the numbers right here, I have the exact numbers of my employee database. And that's exactly what I wanted to see. So you need to be careful with dispatching and accepting and Kelci will speak more to that. But that was just a simple case example of how you would do that. And I will show you real quickly, that there is a Online Help link that shows an example of virtually joining columns and showing row states. It'll step you through that. There's some other examples out here too of using virtual join. If you need more information about setting this up. And again, just to remind you, all of this is scriptable. So you can script this right here, by setting up your row states and the different things that you want with that. So as we moved into JMP 15 we added a couple more things. And so what we added was we we added the ability to auto open a table and also to hover over columns and figure out where they're coming from. And I'll explain what that what that means exactly. So if I click on these. We created some new tables for JMP 15, employeemaster.jmp, which is still part of this HR data. And so if I track this down a little bit and look, a couple things I'll point out about this table. It has a link ID and a link reference. And that was the other thing we added to to JMP 15, the ability to be able to have a link ID and link reference on the same column. So if I look at this and I go and look at my home window here, I can see that there's two more tables that are open. They were opened automatically for me. And so I'm going to open these up because I kind of want to string them out so you can see how this works. But this employee master table is linked to a...stack them on top of each other...it's linked to the education history table, which has been, in turn, linked to my predicted termination table. And you can see there's an employee ID that has a link reference and the link ID, employee ID here. Same thing, and then predict determination has an ID only. And if you had another table or two that had employee ID unique data and you needed to pull it in, you could continue the string on by assigning a link reference here and you can keep on keep on going. So I'm...just to show you quickly, if I right click and look at this column here, I can see that my link ID is set, I can also see my link reference is set. And it tells me education history is a table that this table is linked to. I've got it on auto open and I've got on the shorter names. I'm not dispatching row states, so nothing is set there. So all of the columns that are in these other two tables are available to me, for my referencing table here called employee master. And real quickly, you can see that you have a large number of columns in here that are available to you, and the link columns show up as grouped columns down here. So another question that got asked from customers, as they say, is there any way you can tell us where these columns come from so that is a little clearer? So we added this nice little hover tip. If I hover over this, this tells me that this particular column disability flag is coming from predicted termination. So it's actually coming from the table that's last in my series. And if I go down here and I click on one of these, it says the degree program code is coming from education history. So that's, that's a nice little feature that will kind of help you as you're picking out your columns, maybe in what you're trying to run with platforms and so forth. But if I run this distribution, this is just a simple distribution example that's showing that employee level is actually coming from my employee master table. This degree description is coming from education history table and this performance eval is coming from my predictive termination table. And then you can look some more with some of these other examples that are in here. I did build a context window of dashboards here that shows a Graph Builder showing a box plot. We have a distribution in here, a tabulate and a heat map, using all virtual columns, some, you know, some columns that are from the table, but also virtual columns got a filter. So if I want to look at females and look at professionals. I always like to point out the the oddities here. So if I go in here and look at these two little places that are kind of hanging out here. This is very interesting to me because comp ratios shows how people are paid. Basically, whether they're paid in in the right ratio or not it for their job description. And it looks like these two outliers are consistently exceeding expectations, that looks like they're maybe underpaid. So just like this one up here is all by itself and it looks like they seldom meet their expectations, but they may be slightly overpaid and or they could be mistakes. But at any rate, as you zero in on those, you can also see that the selections are being made here. So, in this heat map, I can tell that there is some performance money that's being spent and training dollars. so maybe train that person. So that's actually good good good to see So that is about all I wanted to show. I did want to show this one thing, just to remind, just to reiterate. Education history has access to the columns that are in predicted termination. And so those two tables can talk to each other separately. And if I run this graph script, I have similar performance and training dollars, but I'm looking at like grade point average, class rank, as to where people fall into the limits here using combinations of columns from just those two tables. So I'm going to pass this on. I believe that was the majority of what I wanted to share. I'm going to stop sharing my screen. And I will pass this back to Kelci and she will take it from here. Kelci J. Miclaus Thanks, Mandy. Mandy said we've given this talk now a couple times and, really it was this combined effort of me working in my group, which is life sciences for the JMP Clinical and JMP Genomics vertical solutions, and finding such perfect examples of where I could really leverage virtual joins and working closely with the development team on how those features were released in the last few versions of JMP. And so for this section I will go through some of the examples, specific to our clinical research and how we've really leveraged this talking table idea around row state synchronization. So as as Mandy mentioned this is now, and if we have time towards the end, this, this idea of virtual joins with row state synchronization is now the entire architecture that drives how JMP Clinical reports and reviews are used for assessing early efficacy and safety and clinical trials reports with our customers. And one of the reasons it fits so well is because of the formatting of typical clinical trial data. So the data example that I'm going to use for all of the examples I have around row state synchronization or row state propagation as I sometimes call it, are example data from a clinical trial that has about 900 patients. It was a real clinical trial carried out about 20-30 years ago looking at subarachnoid hemorrhage and treatment of nicardipine on these patients. The great thing about clinical data is we work with very standard normalized data structures, meaning that each component of a clinical trial is collected, similar to the HR data that Mandy showed...show...showed us is normalized, so that each table has its own content and we can track that separately, but then use virtual joins to create comprehensive stories. So the three data sets I'll walk through are this demography table which has about a little under 900 patients of clinical trials, where here we have one row per patient in our clinical trial. And this is called the demography, that will have information about their birth, age, sex, race, what treatment they were given, any certain flags of occurrences that happened to them during the clinical trial. Similarly, we can have separate tables. So in a clinical trial, they're typically collecting at each visit what adverse events have happened to a patient while on on a new drug or study. And so this is a table that has about 5,000 records. We still have this unique subject identifier, but we have duplications, of course. So this records every event or adverse event that was reported for each of the patients in our clinical trial. And finally I'll also use a laboratory data set or labs data set, which also follows the similar type of record stacked format that we saw on the adverse events. Here we're thinking of the regular visits, where they take several laboratory measurements and we can track those across the course of the clinical trial to look for abnormalities and things like that. So these three tables are very a standard normalized format of what's called the international CDISC standard for clinical trial data. And it suits us so well towards using the virtual join. Aas Mandy has said, it is easy to, you know, create a merge table of labs. But here we have 6,000 records of labs and merging in our demography, it would cause a duplication of all of their single instances of their demographic descriptions. And so we want to set up a virtual join with this, which we can do really easily. If we create in our demography table, we're going to set up unique subject identifier as our link ID. And then very quickly, because we typically would want to look at laboratory results and use something like the treatment group they are on to see if there's differences in the laboratories, we can now reference that data and create visualizations or reports that will actually assess and look at treatment group differences in our laboratory results. And so we didn't have to make the merge. We just gained access to these...this planned arm column from our demography table through that simple two-step setting up the column properties of a virtual join. It's also very easy to then look at like lab abnormalities. So here's a plot by each of the different arms or treatment groups who had abnormally high lab tests across visits in a clinical trial. We might also want to do this same type of analysis with our adverse event, which we would also want to see if there's different occurrences in the adverse events between those treatment groups. So once again we can also link this table to our referenced demography and very quickly create counts of the distribution of adverse events that occur separately for, say, a nicardipine, the active treatment, versus a placebo. So now we want them to really talk. And so the next two examples that I want to show with these data are the row state synchronization options we have. So you quickly saw from Mandy's portion that she showed that on the column properties we have the ability to synchronize row states now between tables. Which is really why our talk is called talking tables, because that's the way they can communicate now. And you can either dispatch row states, meaning the table that you're set up the reference to some link ID can send information from that table back to its reference ID table. And I'll walk through a quick example, but as mentioned...as Mandy mentioned, this is by far the more dangerous case sometimes because it's very easy to hit times when you might get inconclusive results, but I'm going to show a case where it works and where it's useful. As you've noticed, just from this analysis, say with the adverse events, it was very easy as the table that we set up a link reference to (the ID table) to gain access to the columns and look at the differences of the treatment groups in this table. There's not really anything that goes the other way though. As Mandy had said, you wouldn't want to use this new join table to look at a distribution of, say, that treatment group, because what you actually have here is numbers that don't match. It looks like there's 5,000 subjects when really, if you go back to our demography table, we have less than 900. So here's that true distribution of about the 900 subjects by treatment group with all their other distributions. Now, there is the time, though, that this table is what you want to use as your analysis table or the goal of where you're going to create an analysis. And you want to gain information from those tables that are virtually linked to it. The laboratory, for example, and the adverse events. So here we're going to actually use this table to create a visualization that will annotate these subjects in this table with anyone who had an abnormal lab test or a serious adverse event. And now I've cheated, because I've prepared this data. You'll notice in my adverse events data I've already done the analysis to find any case of subjects that were...any adverse events that were considered serious and I've used the row state marker to annotate those records that had...were a serious adverse event. Similarly, in the labs data set, I've used red color to annotate...annotate any of the lab results that were abnormally high. So for example, we can see all of those that had high abnormalities. I've colored red most of this through, just row state selection and then controlling the row states. So with this data where I have these two row states in place, we can go back to our demography table and create a view that is a distribution by site of the ages of our patients in a clinical trial. And now if we go back to each of the linked tables, we can control bringing in this annotated information with row state synchronization. So we're going to change this option here from row states with reference table to none, to actually to dispatch and in this case I want to be careful. The only thing I want this table to tell that link reference table is a marker set. I'm going to click Apply And you'll notice automatically my visualization that I created off that demography table now has the markers of any subjects who had experienced an adverse event from that other table. We can do the same now with labs. Choose to dispatch. In this case, we only want to dispatch color. And now, just by controlling column properties, we're at a place where we have a visualization or an analysis built off our demography table that has gained access to the information from these virtually joined tables using the dispatch row state synchronization or propagation. So that's really cool. I think it's a really powerful feature. But there are a lot of gotchas and things you should be careful with with the dispatch option. Namely the entire way virtual joins work is the link ID table, the date...the data table you set up, in this case demography, is one row per ID and you're using that to merge or virtually join into a data table that has many copies of that usage ID. So we're making a one-to-many; that's fine. Dispatch makes a many-to-one conversation. So in in the document we have an ...in the resource provided with this video, there's a lot of commentary about carefully using this. It shouldn't be something that's highly interactive. If you then decide to change row states, it can be very easy for this to get confusing or nonsensical, that, say if I've marked both with color and marker, it wouldn't know what to do because it was some rows might be saying, "Color this red," but the other linked table might be saying color it blue or black. So you have to be very careful about not mixing and matching and not being too interactive with with that many-to-one merge idea. But in this example, this was a really, really valuable tool that would have required quite a lot of data manipulation to get to this point. So I'm going to close down these examples of the dispatch virtual join example and move on to likely what's going to be more commonly used is the accept... acceptance row state of the virtual join talking tables. And for this case, I'm actually going to go through this with a script. So instead of interactively walking me through the virtual join and row state column properties, we're going to look at this scripting results of that. And the example here, what we wanted to do, is be able to use these three tables (again, the demography, adverse events and laboratory data in a clinical trial) to really create what they call a comprehensive safety profile. And this is really the justification and rationale of our use in JMP Clinical for our customers. This idea that we want to be able to take these data sets, keep them separate but allow them to be used in a comprehensive single analysis so they don't feel separate. So with this example, we want to be able to open up our demography and set it up as a link ID. So this is similar to what I just did interactively that will create the demographic table and create the link ID column property on unique subject identifier. So we're done there. You see the key there that shows that that's now the link ID property. We then want to open up the labs data set. And we're going to set a property on the unique subject identifier in that table to use the link reference to the demography table. And a couple of the options and the options here. We want to show that that property of using shorter names. Use the linked column name to shorten the name of our columns coming from the demography table into the labs table. And here we want to set up row state synchronization as an acceptance of select, exclude and hide. And we're going to do this also for the AE table. So I'll run both of these next snippets of code, which will open up my AE and my lab table. And now you'll see that instead of that dispatch the properties here are said to set to accept with these select, exclude and hide. And similarly the adverse events table has the exact same acceptance. So in this case now, instead of this dispatch, which we were very careful to only dispatch one type of row state from one table and another from another table back to our link ID reference table. Here we're going to let our link ID reference table demography broadcast what happens to it to the other two tables And that's what accept does. So it's going to accept row states from the demography table. And I've cheated a little bit that I actually just have a script attached to our demography table here that is really just setting up some of the visualizations that I've already shown that are scripts attached to each of the table in a single window. And so here we have what you could consider your safety profile. We have distributions of the patient demographic information. So this is sourced from the demography table. You see the correct numbers of the counts of the 443 patients on placebo versus the 427 on nicardipine.  
Philip Ramsey, Senior Data Scientist and Statistical Consultant/Professor, North Haven Group and University of New Hampshire Tiffany D. Rau, Ph.D., Owner and Chief Consultant, Rau Consulting, LLC   Quality by Design (QbD) is a design and development strategy where one designs quality into the product from the beginning instead of attempting to test-in quality after the fact. QbD initiatives are primarily associated with bio-pharmaceuticals, but contain concepts that are universal and applicable to many industries. A key element of QbD for bio-process development is that processes must be fully characterized and optimized to ensure consistent high quality manufacturing and products for patients. Characterization is typically accomplished by using response surface type experimental designs combined with the full quadratic model (FQM) as a basis for building predictive models. Since its publication by Box (1950) the FQM is commonly used for process characterization and optimization. As a second order approximation to an unknown response surface, the FQM is adequate for optimization. Cornell and Montgomery (1996) showed that the FQM is generally inadequate for characterization of the entire design space, as QbD requires, given the inherent nonlinear behavior of biological systems. They proposed augmenting the FQM with higher order interaction terms to better approximate the full design regions. Unfortunately, the number of additional terms is large and often not estimable by traditional regression methods. We show that the fractionally weighted bootstrapping method of Gotwalt and Ramsey (2017) allows the estimation of these fully augmented FQMs. Using two bio-process development case studies we demonstrate that the augmented FQM models substantially outperform the traditional FQM in characterizing the full design space. The use of augmented FQMs and FWB will be thoroughly demonstrated using JMP Pro 15.     Auto-generated transcript...   Speaker Transcript Tiffany First, thanks for joining us today. We're going to be talking about characterizing different bio processing...seeing...and really focusing on pDNA, seeing how fractionally weighted bootstrapping can really add to your processes. So Phil will be joining me at the second half of the presentation to go through the JMP example, as well as to give some different techniques. to be used. I'm going to be talking about the CMC strategy, biotech, how do we get a drug to market and how can we use new tools like FWB in order to deliver our processes. So let's get started. So the chemistry manufacturing control journey. So it's a very long journey and we'll be discussing that. Why DOE, why predictive modeling? It is very complex to get a drug to market. It's not just about the experiments, but it's also about the clinic. and having everything go together. So we'll look at systems thinking approaches as well. And then, of course, characterizing that bioprocessing and then the case study that Phil will discuss. So what does the CMC pathway look like? This is a general example and we go from toxicology. So that's to see does doesn't have any efficacy. Does it work. in a nonhuman trial. All the way up to commercial manufacturing and there's a lot of things that go into this, for example, process development. But you also have to have a target product profile. What do you want the medication that you're developing to actually do? And it's important to understand that as you're going through. And then of course the quality target product profile as well. This is...what are the aspects that are necessary in order for the molecule to work as prescribed? And then we look at critical quality attributes and then go through the process so it's...it's a...it's a group because of the fact that you have your process development, process characterization, process qualification and the BLA. There's a huge amount of work that goes into each one of these steps. And we also want to make sure that as you're going through your different phases that you're actually building these data groupings. Because when you get to process characterization and process qualification, you want to make sure that you can leverage as much of your past information as you can, so that you've actually shorten your timelines. You might say, "Tiffany, I do process characterization, all the way through the process." And I'll say, "Absolutely." But the process characterization that specific for the CMC pathway is what we need from a regulatory point of view. So everyone has probably heard in the news, you know, vaccines and cell and gene therapies. It's a very hot subject right now, and it's also bringing new treatments to patients that we've never been able to treat before. And so in the two big groupings of cell therapies and gene therapies, there's different aspects for it. Right. So we have immunotherapies. We have stem cells. We're doing regenerative medicine. So just imagine, you know, having damage in your back and being able to regenerate. There's huge emphasis on this grouping. But there's also a huge emphasis on gene therapies. Viral vectors. Do we use bacterial vectors? How do we get the DNA into the system in order to be a treatment for the patients? Well, plasma DNA is one of those aspects and Phil has an amazing case study where they did an optimization. So you might say, well, "What is pDNA. And why is it important, other than, okay, it's part of the gene therapy space, which is very interesting right now?" Well, the fact is, is that it can be used for...in preventative vaccines, immunization agents, for...Prepare...preparation of hyper immune globulin cancer vaccines, therapeutic vaccines, doing gene replacements. Or maybe you have a child, right, that has a rare gene mutation. Can we go in and make those repairs? All these things are around this and as the gene therapy technology continues to grow, the regulations continue to increase as you move through the pathway closer to commercialization and the amount of data also increases. Just imagine gene therapies and cell therapies are where we were 20 plus years ago with the ??? cell culture and monoclonal antibodies It's an amazing world where we're learning new things every day and we go, "Oh, this isn't platform." We need new equipment. We need new ways of working. We need to be able to analyze data sets that are very, very small because in cell therapies and gene therapies, the number of patients are typically smaller than in other indications. So what's next for pDNA? Well, of course, as the cell therapy and gene therapy market continues to grow, We're going to continue going on this pathway into commercialization. We need to be able to work with the FDA and work with them, hand in hand, because these are things that we've never done before. We're using raw materials that we don't use and other ...other indications for medication. So there's a lot of things to be done. It's also critical to be able to make these products like the pDNA so the way that we get the vector in our appropriate volume, but also quality aspect. So if you have the best medication in the world, but you're not able to make it, then you don't have the medication. Right? You can't deliver it to the patients. So we also need to make sure that our process is well characterized. And as I mentioned earlier, many of these indications are very small. So the clinical trials are also small. And at the same time the patients are often very, very sick. So being able to analyze our data and also respond to their needs very quickly is very key. Both in the clinical aspect as well as when we become commercialized. We don't want to have this situation where, guess what, I can't make the drug, right? I want to be able to make it. Also manufacturing is a very important thing. So don't know if you've noticed in the news, there's been a lot of announcements of expansions. Of course, people are expanding capacity for vaccines, but also one of the big moves is pDNA. People are spending millions of dollars, sometimes billions of dollars in increasing those manufacturing sites. And you might say, well, okay, you increase your manufacturing site. That's great. But now I need to be able to tech transfer into that manufacturing site. I need to make sure my process is robust...robust and it not only can be transferred and scaled up but making sure that I have the statistical power to say I know that my process is in control. I might have a 2% variability but I always have a 2% variability and I have it characterized, for instance. And as more and more capacity comes online and as we also have shortages, it's like, where do I bring my product and taking those into consideration, so designing for manufacturing earlier. And you could have multiple products in your pipeline. So you want to make sure that you're learning and able to go and grab that information and say, let me do some predictive modeling on this, it might not be the exact product, but it has similar attributes. So with that, the path to commercialization is very integrated, just like the CMC strategy takes the clinical aspect, everything comes together in order to progress a molecule through. We also have to think about the systems aspect of it. Why? Because if we do something in the upstream space we might increase productivity to 200%, let's say, which we be going, "yes I've made my milestone. I can deliver to my patient." But if my downstream or my cell recovery can't actually recover the product, whether that is a cell or a protein therapeutic for instance, then we don't have a product. All of that work is somewhat thrown out the door. So having the systems approach, making sure you involve all the different groups from business, supply chain, QC, discovering...everyone has knowledge that they bring to the table in order to deliver to the patient in the end, which is very key. So I'm going to hand it over to Phil now. I would have loved to have spoken a lot more about how we developed drugs, but let's...let's see how we can analyze some of our data. So, Phil, I'll hand it over to you now. Philip Ramsey Okay, so thank you, Tiffany, for that discussion to set the stage for what is going to be a case study. I'm going to spend most of the time in JMP demonstrating some of the important tools that exist in JMP. You may not even know that are there, that are actually very important to process development, especially in the context of the CMC pathway, chemistry manufacturing control, and quality by design. And two important characteristics of process development, and this is in general, is one where you want to design a process, but you also need to characterize it. In fact, you have to characterize the entire operating region. And of course, we want to optimize so that we have a highly desirable production. What we often don't talk about enough is these activities are inherently about prediction. We have to build powerful predictive models that allow us to predict future performance. That's a very important part of, especially in late stage development, for regulatory agencies. You have to demonstrate that you can reliably produce a product. Well, a key paper on on this issue of process characterization and prediction was very famous paper by George Box and his cohort Wilson, who was an engineer. And in that they talked about what is the beginnings of, as people note today, as response surface. And the key to this their work with something they called the full quadratic model. Well, what is that? Well, that's a model that contains the main effects, all two-way interactions and quadratic effects. And this is still probably the gold standard for building process models, especially for production. But what people may not realize, they're good for optimization. They're good second-order approximations to these unknown response functions. What is not as well understood is, over the entire design region, they often are a poor approximation to the response surface. And in 1996 the late John Cornell and, of course many people know, Doug Montgomery published a paper that is really underappreciated. And then that paper they raised the fact the full quadratic model often is inadequate to characterize a design space. Think about it from the viewpoint of a scientist and think how dynamic these biochemical processes often are. In other words, there's a great deal of nonlinearity that leads to response surfaces with pronounced compound curvature in different regions. And the full quadratic model simply can't deal with it. So what they propose was augmenting that design and they added things like quadratic by linear, linear by quadratic and even quadratic by quadratic interactions. It turns out these models do approximate design regions better than full quadratic models. I'm going to demonstrate that to you in a moment. But there was a problem for them. Number one, traditional statisticians didn't like the approach; that's changing dramatically these days. But there are a lot of these terms that can be added to a model such that even a big central composite design becomes super saturated. What does that mean? It means there are more unknowns p, then there are observations and to fit the models. Turns out that it's not really a constrait these days in the era of machine learning and new techniques for predictive modeling. So what we're going to do is, we're going to use something called fractionally weighted bootstrapping. This can be done in JMP Pro. And something called model averaging to build models to predict response surfaces, and I am actually going to use these large augmented models. Okay, so when you try to build these predictive models, say for quality by design, there are a number of things you have to be aware of. One, again in 1996, one of the pioneers in machine learning, the late Leo Brieman, wrote a paper that again is not nearly appreciated as much as it needs to be. And he pointed out that all these model building algorithms we use for prediction (and that includes forward selection, all possible models, best subsets, lasso) are inherently unstable. What does that mean? Being unstable means small perturbations in the data can result in wildly varying models. So he did some work to point this out and he suggested a strategy, said, "Well, if you could, in some way, simulate model fitting and somehow perturb the data on each simulation run, we could fit a large number of models and then average them." And he showed that that had potential. He didn't have a lot of tools in that era to do it. But today I'm going to show you in JMP Pro, we have a lot of tools and we're going to show you that Brieman's idea is actually a very good one. It is now one way or the other, commonly accepted in machine learning and deep learning, that is the idea of ensemble modeling and model averaging. By the way, I'll quickly point out years ago in the stepwise platform of JMP, John Sall instituted a form of model averaging. It's a hidden gem in JMP. Works nice and is available in both versions of JMP, but I'm going to offer a more comprehensive solution that can be done in JMP Pro. And this solution is referred to as fractionally weighted bootstrapping with auto validation and I'm going to explain what that means. When we build predictive models, we have a challenge. We need a training set to fit the model, then we need an additional or validation set of data to test the model to see how well it's going to predict. Well, DOE simply don't have these additional trials available. In fact, Brieman was stuck on this point. There's no way to really generate a validation error. Well, in 2017 at Discovery Frankfurt, Chris Gotwalt, head of statistical research for JMP, and myself presented a talk and what we called fractionally weighted bootstrapping and auto validation. What does auto validation mean? It means, this will not seem intuitive, we're going to use the training set also as a validation set. You say, "Well, that's crazy. It's the same data." But there's a secret sauce to this technique that makes it work. What we do is, we take the original data, copy it, call it the auto validation set, and then we in a special way, assign random weights to the observations and we do the weighting such that we drive anticorrelation between the training set and the auto validation set. And I'm going to illustrate this to you very shortly. Okay. And by the way, we have been supervising my PhD student Trent Lempkis, who has studied this method extensively in exhaustive simulations over the last year. And we will be publishing a paper to show that this method actually yields superior results to classical approaches to building predictive models from DOE. So I'm just going to move ahead here and talk about the case study. And this is what Tiffany mentioned pDNA. It's a really hot topic and pDNA manufacturer is considered a big growth area and the biotech world expect big growth, maybe even 40% year over year because of all the new therapies coming online where it'll be used. And in this case, and this is very common in the biotech world, there's not really any existing data we can use to build predictive models. So that leads us quite rightly to design of experiments. And in this case, we're going to use a definitive screening design. These are wonderful inventions of Brad Jones from JMP and Chris Nachtsheim from the University of Minnesota. Highly efficient and I highly recommend them all the time to people in the biotech world where you have limited time and resources for experimentation. So basically I'm just showing you a schematic of what a bioprocess looks like. And we're going to focus on the fermentation step. But in practice, as Tiffany was alluding to, we would look through the both upstream and downstream aspects of this process. pH, percent dissolved oxygen, induction temperature. That's the temperature we set the fermentor at to get the genetically modified E. Coli to start pumping out plasmids. And what are plasmids? Well, they're really non chromosomal DNA. And they have a lot of uses in therapies, especially gene therapies, and they're separate from the usual chromosomal DNA that you would find in the bacteria. So our goal is to get these modified E. Coli to pump out as much pDNA as possible. So we did the the trial. This is an actual experiment. And because we were new to DSDs, we also ran a larger, much larger traditional central composite design . And we did this separately. And what I plan to do is for today's work, we're going to use the CCD as a validation set and we're going to fit models using auto validation on the DSD. We'll see how it goes. Okay, so I'm going to now just switch over to JMP. And I'm going to open a data table. Here's the DSD data. We're going to do all our modeling on this data set. And oh, by the way, I am going to fit a 40 predictor model to a 15 run design using machine learning techniques. And many people, you're going to have to get your head around the fact you can do these things and they're actually being done all the time in machine learning and deep learning. So there are a couple of add ins I want to show you that make this easy to do. You do need JMP Pro. One of them is an add in that sets up the table for you. This is by Michael Anderson of JMP. So I'm just going to show you what happens. The add in is available on JMP Communities. So notice it took the original data, created a validation set. And as I mentioned, we also have this weighting scheme and these weights are randomly generated, and as you'll see in a momen, we do a simulation study and we constantly change the weights on every run. And this has the effect of again generating thousands of iterations of modeling. And you'll also see, as Leo Brieman warned, as you perturb the responses (we don't change the data structure), you see wild variation in the models. So I'm going to go ahead and just illustrate this for you very quickly. So I'm going to go to fit model. And we have to tell JMP where the weights are stored. We're going to use generalized regression, highly recommended for this. And because this is a quick demo, I'm going to use forward selection, but this is a very general procedures SWB with auto validation. You can use it in many, many different prediction or modeling scenarios. I'm going to do forward selection. Okay, so I fit one model. And then I come down to the table of estimates. I'm going to right click and select simulate. And I tell it that I want to do some number of simulation runs, and on each trial I want to swap out the weights. I want to generate new weights. And by the way, I'll just do 10 because this is a demo. So there's the results. And you can see we have 10 models and all of them are quite different. So again, in practice, I would do thousands of these iterations. And then I'm going to show you later, we can then take these coefficients and average them together. And by the way, if you see zero, that means that turn did not get into a model. Okay, so what I'm going to do now is show you another add in. So I'm going to close some of this, so we keep the screen uncluttered. There's another add in that we've developed at Predictum. And this one does, not only does the faction weighted bootstrapping, but it also develops the model averaging. In other words, what I just showed you, the add in that you can use if you want to do model averaging, you're kind of on your own. Okay. It'll just be a lot of manual work. So I'm going to use the Predictum add in. It creates the table and then I'm going to actually very quickly, just to find a model to illustrate how the add in works, I'll use a standard response surface model. We want to predict pDNA. And we're going to use gen reg. So again, as an illustration, I'm just going to go ahead and use forward selection forward. And again, I do thousands of iterations in practice, but I'm only going to do 10. Click Go. Okay. Philip Ramsey Again, this is a quick, this is really, I do apologize, three talks conflated into one, but all the pieces fit together in the QbD framework. So I have a model. These are averaged coefficients. Again, I've only done 10. I'd save the prediction formula to the data table. And I'm going to try to keep the screen as uncluttered as possible. So there's my formula; the app did all the averaging for you, so you don't have to do it. And there's the formula. And a little trick you may not be aware of, this is a messy formula, especially if you want to deploy this formula to other data tables. In the Formula Editor, there's a really neat function called simplify. See, and it simplifies the equation and it makes it much more deployable to other data sets. Okay, so this was an illustration of the method. And what I'm going to do now is show what happened when we went through the entire procedure. So this is a data table. And here you'll notice the DSD and the CCD data bank combined together. And I've used the row state variable to eliminate or exclude the DSD data because I want to focus on performance of my models on the validation data. Again the models are fit to the DSD only. So here is my 41 term model. This is the augmented full quadratic done with model averaging over thousands of iterations. And for comparison I repeated the same process for the much smaller 21 term full quadratic model. So how did we do in terms of prediction? So let me show you a couple of actual by predicted plots. So remember, and I must strongly emphasize, this is a true validation test. The CCD is done separately. Different batches of raw material, including a new batch of the E Coli strain. Some of the fermenters were different, and they were completely different operators. So for those of you who work in biotech, you know, this is about as tough a prediction job as you're going to get. So again, the model was fit to the DSD, and on the left is the 41 term model, the augmented model, and the overall standard deviation of prediction error is about 67. On the right, again, I did use model averaging which helps improve performance, I fit just the 21 term full quadratic model and you can see the prediction error is about 70. In fact, without using model averaging as many people don't do full quadratic useful quadratic model, so performance would be significantly worse. Okay. So then I have the model. What do I do with it? Well, our goal is typically optimization and characterization. So let me open up a profiler. I'll actually do this for you. So I'm going to go to the profiler and the graph menu. I'm going to use my best model. And that's the one using the Predictum add in. And by the way, if you're interested in this add in, and even Beta testing it, just contact Predictum, just send email to Wayne@predictum.com and I'm sure he'd be more than happy to talk to you. So I'm going to... Went to the model and then using desirability, I'm just going to find settings that maximize production. And by the way, this is a major improvement over the production they were historically getting, and it gives us settings at which we should see on the improved performance and these were, by the way, somewhat unintuitive, but that's usually the case in complex systems. Things are never quite as intuitive as you think they are. And then also something really important, especially if you're doing late stage development in the CMC pathway. And that is, they want you to assess the importance of the inputs, which inputs are important. assess variable importance. Again, I won't get into all the technical details. So it goes through and it shows you, in terms of variation in the response, feed rate is by far the most important. That was not necessarily intuitive to people. And second is percent dissolved oxygen. So that, what does that tell you? Well it tells you, number one, you better control these variables very well, or you're likely to have a lot of variation. Now, in this particular case, I don't have critical to quality attributes. There were none available. So what we have is a critical to business attribute and that is pDNA production, But there's more we can do in JMP to fully characterize the design space. All I did was an optimization, but that's not characterization. So there's another wonderful tool in the profiler. Okay. It's called simulator. And this is just not used as much as it should be. So what I've done, I've defined distributions for the inputs. That is, I expect the inputs to vary. This is something like the FDA wants to know about. What happens to performance of your process as the inputs very. There are no perfectly controlled processes, especially once you scale up. By the way, while I think of it, these more complex models, these augmented full quadratic models, from experience, I can tell you they scale up better than full quadratic models. That's another reason to fit these more complex models. So in the simulator,there's a nice tool called simulation experiment. And what that does, it does what we call a space filling design. It distributes the points over the whole design region. So I'm going to just say I want to do 256 runs. and it's going to do 5000 simulations, at each point calculate a mean standard deviation and overall defect. Right. So this actually goes pretty quickly. And I'm just showing you what the output looks like. And again, I've already done this. So, in the interest of time, I'm just going to open another data table. Minimize the other one. So this is the results of the simulation study. And I won't get into all the details, but I fit a model to the main, I fit the model to the standard deviation, and I fit a model to overall defect rate. And the defect rates in some areas are low, in some of them are relatively high and these are what we call Gaussian process models, which are commonly used with simulated data. So what can we do with these models and with these simulation results? Well, again, characterization is important. So let me just give you a quick idea. Here's a three dimensional scatter plot, we're looking at feed rate and percent DO, because they're really important. And the plotted points are weighted by defect rate; bigger spheres mean higher defect rates. So if you look around this. You can see there are some regions where we definitely do not want to operate. So we are characterizing our design spaces and finding safer regions to operate in. And of course, I could do this for just pick some other variables and, in any case, it's just showing other regions you really want to avoid. And we can do more with this, but I think that makes the point. Where we can also go ahead and again use the profiler and I'm going to re optimize. But I'm going to do it in a different way. This way I want to maximize mean pDNA. And I want to do a dual response. And I want to minimize overall defect rate. So again, I'm going to go ahead and use desirability. This takes a few minutes. These are very complex models that we're optimizing. And notice, it comes up and says high feed rate, high DO, close to neutral pH and the induction. By the way, induction, if you want to know what induction OD 600 is, that's a measure of microbial mass and once you reach a certain mass (no one's quite sure what that is, so that's why we do the experiment) you then ramp up the temperature of the matter. And this actually forces the E. Coli to start pumping out pDNA or plasmids, and they're engineered to do this. So we call that the induction temperature. Okay. Well, notice at the settings, we are guaranteed a low defect rate, the overall optimize response wasn't as high. But remember, we're also going to have a process less prone to generating defects. Okay, so at this point, I'll just quickly go to the end. The slide. So everything is in these slides. They've all been uploaded to JMP communities. And at the end of this is an executive summary and basically what we're showing you is that process and product development using the CMC pathway (and a part of that is quality by design) requires a holistic or integrated approach. A lot of systems thinking needs to go into it. Process design and development is inherently a prediction problem, and that is the domain of machine learning and deep learning. It is not what you might think; it's not business as usual for building models in statistics, especially for prediction. We've shown you that fractionally weighted bootstrapping auto validation and model averaging can generate very effective and accurate predictive models. And I also, again, I want to emphasize these more complex augmented models of Cornell and Montgomery are actually quite important. They, they really do scale better and they do give you better characterization. And with that, I thank you and I will end my presentation.  
Caleb King, Research Statistician Developer, JMP Division, SAS Institute Inc.   Invariably, any analyst who has been in the field long enough has heard the dreaded questions: “Is X-number of samples enough? How much data do I need for my experiment?” Ulterior motives aside, any investigation involving data must ultimately answer the question of “How many?” to avoid risking either insufficient data to detect a scientifically significant effect or having too much data leading to a waste of valuable resources. This can become particularly difficult when the underlying model is complex (e.g. longitudinal designs with hard-to-change factors, time-to-event response with censoring, binary responses with non-uniform test levels, etc.). In this talk, we will show how you can wield the "power" of one-click simulation in JMP Pro to perform power calculations in complex modeling situations. We will illustrate this technique using relevant applications across a wide range of fields.     Auto-generated transcript...   Speaker Transcript Caleb King Hello, my name is Caleb king. I'm a research statistician developer here at JMP for the design of experiments group.   And today I'll be talking to you about how you can use JMP to compute power calculations for complex modeling scenarios. So as a brief recap power is the probability of detecting a scientifically significant difference that you think exists in the population.   And it's the probability of detecting that given the current amount of data that that you've sampled from that population.   Now, most people, when they run a power calculation, they're usually doing it to determine the sample size for their study there, of course, is a direct   Tie between the two. The more samples, you have the greater chance you have of detecting that scientifically significant difference   Of course, there are other factors that tie into that. There's the the model that you're using the response distribution type.   And there's also, of course, the amount of noise and uncertainty present in the population, but for the most part people use power as a metric to determine sample size. Now, I'll kind of say there's kind of three stages   of power calculation and all of them are addressed in JMP, especially if you have JMP Pro, which is what I will be using here.   The first stage is some of those simpler modeling situations where we go here under the DOE menu under Design Diagnostics. We have the sample size and power calculators.   And these cover a wide range of very simple scenarios. So, if you're testing one or two sample means, you know, maybe an ANOVA type setting with multiple means,   proportions, standard deviations. Most of this is what people think of when you think of power calculations. So, of course, you go through and you specify again the noise,   error rates, there's any parameters, what difference am I trying to detect, and say for I'm trying to compute a certain power I can get the sample size.   Or, if I want to explore a bit more. I can leave both as empty. I get a power curve. Now, of course, again, these are more of your simpler scenarios. The next stage, I would say, is what could be covered under a more general linear model so exit out of   In that case, we can go here under the all encompassing custom design menu.   I'll put in my favorite number of effects.   I'll click continue.   And I'll leave everything here.   So we'll make the design.   And at this point I can do a power analysis based on the anticipated coefficients in the model. So in this case, it might say, I have for this particular design under 12 runs. I have roughly 80% power to detect this coefficient. If I was trying to detect say something a bit smaller.   I could change that value, apply the changes, of course. See, I don't have as much power. So if that's really what I'm looking for. I might do to make some changes. Maybe I need to go back and increase the run size.   So, those are the two most common settings that we might do a power calculation, but of course life isn't that simple know you might run into more complex settings you might have mixed effects factors you might run into a longitudinal study that you have to compute power for.   You might run into settings where your response is no longer a normal random variable, you might have count data, you might have a binary response. You might even have sort of a bounded 0/1 type response. So a percentage type response.   So, what can you do if you can't go to the simple power calculators and maybe the DOE menu it might be too complex for even this to run a power analysis. Well JMP Pro's here to help and involves a tool that we call one click simulation.   So the idea here is, we'll simulate data sort of through a Monte Carlo simulation approach to try and estimate the power that you can get for your particular settings.   And it's pretty straightforward. There might be a little bit of work up front that you need to do at least depending on the modeling platform.   But once you've got it down. It's pretty straightforward to do.   And I'll go ahead and say that this was something I didn't even know JMP could do until I started working here. So, I'm happy to share what I found with you.   Alright, so we'll start off with sort of as a simpler extension of the standard linear model where we incorporate some mixed effects. Okay.   So we'll start, we have a company that's looking to improve their proton protein yield for cellular cultures. Not protons but proteins.   temperature, time, pH. We also have some mixture factors.   Water and two growth factors. Now, at this stage, if we stopped here, we probably would still be able to use the power calculator available in the custom design platform.   Where we start to deviate is now we introduce some random effect factors we have three technicians, Bob, Di, and Stan, who are representative of the entire sample of technicians.   And they will use at least one of three serum lots, which is again a representation of all the serum lots, they could use unless we treat them as random effects.   We also have a random blocking effect. In this case, the test will be conducted over two days. And so I'll show you how we can use one click simulation and JMP Pro to compute power for this case. So click to open the design.   So this was the design that I've created, let me expand my window here so can see everything. Now this might represent what you typically get once you've created the design.   Again,   at this point, you could have clicked simulate response to simulate some of the responses. But even if you didn't, it's still okay   A trick that you can easily use to replicate that is simply create a new column will go in. We won't bother renaming it at this point, we're just going to create a simple formula.   Go here to the left hand side. Click random random normal   leave everything default click Apply. Okay.   And we've got ourselves some random noise data. Some simulated response data. Okay.   At this point, I'll click right click, copy   And right click paste to get my response column.   Now all I need is just some sort of response. So simple random noise will work fine here. We're not trying to analyze any data yet. What we want is to   use the fit model platform to create a model for us that we'll then use to create the simulation formula. The way we do that, we'll go under a model. Now I've done a bit of head work here.   So, I've already created the model here. And just to show you how I did that. I'll go under here under relaunch analysis under the redo.   So, here you see I have my response protein yield hello my fixed effects. I've got some random effects.   I did everything and get everything pretty standard there.   Now you see there's there's a lot going on here. We don't need to pay attention to any of this. We are just interested in creating a model. At this point the way we do that is we go into here under the red triangle menu.   Will go under saved columns. Now we need to be careful which column we select. If I select prediction formula which you might be tempted to do. That's good. But it doesn't get us all the way there, as you'll see.   If I go into the formula. This is the mean prediction formula. There's nothing about random effects here. So this isn't the column I want. It's not complete doesn't contain everything I need. I need to come back.   Go under save columns again and scroll down here to conditional predictive formula and note from the hover help that's includes the random effect estimates, which is the one I want.   Now, you might be any case where you don't really want to compute power for the random effects. You want to just for the mean model, in which case   You could have easily gone back to the custom design platform and done it that way. Let's pretend that we're interested in those random effects as well.   Now we've saved their conditional predictive formula.   Again, we'll go in, look at the formula.   And here you can see we have a random effects. Now we need to do some tweaking here to get it into a simulation us that we want. So I'm going to double click here.   Is puts me into the JMP scripting language formatting.   Now, first I'll make some changes to the main effects. And I'm just going to pick some values. So let's see. Let's do 0.5 for temperature   0.1 for time.   And for pH. Let's do 1.2 a little bit higher.   For water. I'm going to go even higher. So, these might have larger coefficient. So I'll do 85 for water.   I'll do 90   For the growth first growth factor. And let's do 50   Growth Factor, too. Okay.   Alright, so I've made my adjustments to the main mean model portion. Now again, these are parameters that you think are scientifically important   Now for the random effects. You might be tempted to replace it with something like this. Okay. That should be a random effects. So I'll just put a random normal here.   And it kind of looks right but not exactly. And the reason is this formula is evaluated row by row, what's going to happen is the first time you come across a technician named Jill.   You will simulate a random value here and you'll get a value for that formula evaluation, but the next time you go to jail. I wrote six here.   This will simulate a different value, which then defeats the purpose of a random effect random effects should hold the same value every time. Jill appears   That it's going to take on the effect of something like a random error which I'll take this opportunity to put here that is a value that we want to change every row. So how do we overcome this well.   I tell you this because I actually ended up doing this the first time I presented this slightly embarrassing.   And thankfully, my coworker came along. Afterward, and showed me a trick to how to actually input the random effect appropriately and here's the trick.   We're gonna go to the top here and type if row.   Equals one   I'm going to create a variable call it tech Jill.   And now here's where I place it   What this trick does will replace this random normal with tech Jill.   What this will do is if it's the first row we simulate a random variable and assign it inside the value of this parameter to that variable to that value.   Under the first row, we don't simulate again, which means to tech Jill keeps the value was initially given and it will hold every place we put it   So we will do the same.   For Bob   As you can see that will accomplish the task of the random effect.   PUT BOB here for Stan things are a little bit easier. We don't have to simulate for him because random effects should add up to zero in the model.   And so the way we do that.   We make his be the opposite side.   Of the some of the other effects.   Do the same thing here for serum lot one   Now for this one I'm going to give it a bit more noise.   Let's say there's a bit more noise in the   Serum lots   And this is the advantage of this approach is you get to play around with different scenarios.   Input those values here.   Okay. Caleb King And again, this one.   Some of the others. And before I add the other one. I'll go ahead and just add it here as things makes it easy, day one.   Negative day one.   And I'll add it's random effect here and I'll say that it's random effect.   I can type   Is a bit smaller.   Alright, well, at this point, we should be, we should have our complete simulation formula. If I click OK, take me back to the Formula Editor view.   We should be good to go.   Alright, so there's our simulation formula.   Now for next, what do we do next, we'll go back to our fit model.   And we're going to go to the area where we want to simulate the power   Here I'm going to go under the fixed effects tests box. I'm going to go here to this column is the p value in this case original noisy simulation didn't give us any P values. That's okay. We don't care about that.   We just needed this to generate the model, which we then turned into a simulation formula. I'm going to right click under this column. Now remember, this only works if you have JMPed pro   And here at the very bottom is simulate. So we click that.   And it's going to ask us, which column to switch out. So by default it selects the response column and then it's going to go through and find where all the simulation formula columns. So we want to switch in this one because this one contains our simulation model.   tell it how many samples and to do 100   I'll give it my favorite random seed.   And I click OK.   Wait, about a second or two.   And there we are.   So it's generated a table where it's simulator response. It's fit the model.   And is reporting back the P values. Now there are some cases where there are no P values we ended up in a situation so much of what started and that's okay. That happens in simulation, so long as we have a sufficient number to get us an estimate.   Now the nice thing about this is JMPed saw that we were simulating P values. So it's it. I bet you're winning to do a power analysis and it's happily provided us a script to do that. So thanks JMP.   We run that and you'll see it looks a lot like the distribution platform. So it's done a distribution of each of those rows, excuse me, columns, but with an added feature a new table here that shows the simulated power and because we simulate it.   We can read these office sort of the estimated power if it weren't 100 if we were some other number, then you can look at the rejection rate. So we see here for our three mixture factors we. It looks like we have pretty good power, given everything that we have   To detect those particular coefficients. If we go over here to the other three factors, things don't look as good   So, then we'd have to go back and say, okay,   Maybe we'll go back and see what what's the maximum value that I can detect, so I'm going to minimize these   minimises table. I'll come back to my formula and say let's let's do a different   Do something different here.   What if I change this. So this was point five maybe know what if it were higher about one   For the time. Let's see, let's let's also make it one   And four pH. I'm going to go to three. So I'm going to bump things up a bit. So, you know, well hey can I detect this   Will keep everything else the same because we know we can detect those, it looks like click Apply okay generated some new back   Again, same thing. Right click under the column that you want to simulate quick simulate will switch in   Given a certain number of samples. So stick it   Same seed.   And we'll go   Just have to wait a few seconds for it to finish the simulation.   There we are.   And will run our power analysis again.   Look to be the same here. We didn't change anything there. So in fact, I'm going to tie these groups. Little too much. Here we go. Let's hide these three   Let's look at these. So we seem to have done better on pH so value of one might be the upper range of what we can detect given this sample size.   But for temperature in time it seems we still can't detect, even those high values. So, okay. Um, what else could we change. What if we double the number of samples. I mean, we are   calculating this for a sample size. So let's go back and one way we can do that. We can do go to do we, we can click augment design.   will select all our factors.   Select our response.   Click OK.   We'll just augment the design.   And this time we'll double it will make it 24   So I'll make the design.   And it's going to take a little bit of time. So I'm actually going to   A bit early.   And let's see, we'll make the table.   Okay, so now we've doubled the number of runs   And   So it only gave us half the responses. That's okay. Since we just need a response. I'm just going to take this and I'm going to copy   And paste   Course in real life. You wouldn't want to do that because hopefully get different responses. But again, we just need noise noisy response, go to the model. Now this time, we gotta fix things a little bit. I'm going to select these three go here under attributes say there are random effects.   Keep everything the same. Click Run.   I will notice I don't yet have my simulation formula, but rather than have to walk through and rebuild it. I can actually create a new column, go back to the old one.   Right click Copy column properties.   Come back, right click paste column copies my formula is now ready to go. So, let's say, What if we do it under this situation and we'll keep our values that we initially had   So I'll go back. I'll double click this open up the fit model window.   Go under the fixed effect tests, right click on there probably agree with p value simulate and   I'm not going to change this, because there was only one simulation for we let the one I wanted and it found the right response.   So I'll just change these   Let's see what happens in this case.   Alright.   Run the power analysis. Now again, I'm not going to worry about these   Mixture effects because as you can see, we just got better than what we had originally, which was already good. So I'm going to hide them again.   So we can more easily see the ones were interested in this case pH. We knew we were going to probably do better on because even with the old 12 runs. We had pretty good power.   It looks like we have definitely improved on temperature in time. So if those represents sort of the upper bound of effect sizes were interested in maybe a lower upper bound and this seems to indicate a doubling the sample size might help.   So these are illustrate how we can use the one. First of all, how to do the one click Simulate   And then how we can use it to do power calculation and encourages you to do something. I often did before I came to JMP, which is give people options explore your options. During the sample size seemed to help with temperature and time.   Changing what you're looking for, seem to help with pH with pH and then the mixture effects we seem to be okay on so explore your   So that can also include going back and changing the variances of maybe your random effect estimates.   So for example, I could come back here. I won't do it. But I could change these values and say, you know what happens if the technicians were a bit noisier where the serum lots were less noisy. Try and find situations so that your test plan is more robust to unforeseen settings.   Okay, so let me clean   Go through close these all out.   Alright.   So for the remainder of the scenarios. I'm going to be exploring sort of different takes on how you can implement this. So the general approach is the same. You create your design you simulate a response.   Us fit model, or in this case we're using a slightly different platform to generate a model.   And then use that model to create a simulation formula which then you will then use and the one click Simulate approach.   So now let's look at a case where we have a company that's going to conduct this case they have. But let's pretend that they are going to conduct a survey of their employees and they wanted to determine which factors influence employee attrition. So maybe   They have a lot of employees that are going to be leaving. And so they want to conduct a survey to assess which factors and so they want to know how many responded, they should plan for   Now the responses in years of the company, but their two little kinks. First, I'm an employee has to have worked at least a month before they leave for to be considered attrition.   And the other is that the responses are given in years, but maybe we're more concerned about months. How many months. Maybe that's how our budgeting software works or something.   And, you know, for employees, it might be easier for them to answer. And how many years have they been rather than how many years or months. They've been at the company.   So in this case we have interval censoring because we're given how many years, but that only tells us that they've been there between that many years and a year later, we also have the situation if they leave before year where it will censored between a month and a year.   So open up the stage table. I've set up a lot already. We've got a lot of factors here and scroll all the way to the end. So you can see the responses that we're looking at.   So again, we have a year's low and the years high. So what this means is that if an employee were to respond that they left after six years. That means that their actual time there in terms of months, somewhere between six and seven years.   If they left before a year than we know that they were there sometime between a month and a year.   I'm going to click this dialog button here to launch interval censoring here. We'll use the generic platform. We're going to assume a wible distribution for the response.   We don't put a censoring code here because we have interval censoring the way we handle that is we put in both response columns into the y   Which you'll see. Okay. And here's all the factors which you'll see is when we click run, JMP a recognized as a time to event distribution and say, Okay, if you gave me two response columns. Does that mean you're doing interval sensory in this case. Yes, we are.   So now.   We're going to go through the same thing. We're going to find the right red triangle. In this case, it's here next to waibel maximum likelihood. Now here's the really nice thing about   Generate platform. Now there's already a lot of nice things about it. But here's just some more icing on the top.   When I click this, if we did like before we'd have to go in and we'd say save the prediction formula, we'd have to go and make some adjustments to get the random know make sure it's a random wible that's being simulated adjust things as needed.   This is generally though.   It is aware that you can do the one click Simulate and so it's saying, Hey, would you like me to actually save the simulation formula for you if you're, if that's what you're interested in and Yes we are. So we click the Save simulation formula.   Let's go back to our table.   And you'll notice it only simulated one calm. I'll talk a bit more about why in a moment. But let's real quick check will go in   And there it is, in fact, I'll double click to pull up the scripting language, you'll see it's already got it set up as a random wible it's got the transformation of the model already in there.   All you would have to do at this point is change these parameter values to what is scientifically significant to you.   Okay, now for this purpose I won't do that. I'm just going to leave them be. I will make one change though, and I want to try and replicate.   The actual situations that we're going to be using. Notice here. These are all continuous values when in actuality, what we should be getting our nice round hole year numbers. So the way I can do that.   These are years. Hi, I'm going to create a simple variable make it equal to the actual continuous time but tell it to return the ceiling.   So round up essentially   ply. Okay. And there you have it.   As you can see, this would tell me that I've simulated yours. Hi. Now,   To   See, when you do the one click simulations are all here. I'll open up the effect tests.   If I right click and then click Simulate I could only enter one column at a time. So I can't drag and select more than one   Now, if I were to just do this place the years I was yours. Hi simulation that looks okay. The problem is this year's low. Now this year's low is being brought in, because it was part of the original model.   But it's the year's little that you originally used if we look back, we already see an issue, let me cancel out of this real quick.   For example, if we were to do that. It wouldn't be able to fit this first one, because the years high is lower than yours. Low this year's low is not tied to the simulation response. So how do we fix that we need to tie it need to make that connection. So I'll go to yours, low   I'm going to click formula. So there's already a formula here, I'm just going to make a quick change.   I'm going to say if the simulation formula I double clicked to do that.   Is double click one. So, for years, high as one return 112   Otherwise, return the simulation value minus one.   Now click OK and apply   As you can see its proper its proper now it's tied to it.   So now I can go back   I can right click, do the simulate I can replace the years high with its simulation formula and be comfortable knowing that when I do the years low will be appropriate. It will always be one year lower unless it's already one year and then which cases 112   So it's now tied to it, it'll always be brought in, when they do the simulation.   I'll run a quick simulation real quick.   There we go. It's going a bit slow. So that's a good sign.   I'll let it finish out   Alright. So there is our simulations.   And of course we can run the power analysis, this case we've got a lot of factors that I believe there were 1400 70 quote to play this.   For a lot of them were we have overkill.   But surprisingly for some of them. We still have issues. And so that might be something worth investigating maybe we can't detect that low, the coefficient   Might have to change something about these factors things to discuss in your planning meeting.   So that's how you need to work things when you have this case we had interval censoring if he had right censoring so you had a censoring column.   Same thing, you would. It would output a simulation on the actual time, I would say, you can make some adjustments to that.   To ensure that it matches the type of time you you're seeing in your response or what you expect. And then you'll have to tie your censoring column to the simulation and this is going to happen whenever you have that type of setting.   Okay.   Let's clear all this out.   So let's look at one other one   What happens if we have a non normal response. So we've already seen one. We've seen a reliability type response. So we know we can use generating let's explore another one real quick. In this case, we have a normal response in   A test.   The system is going to be able to weapons flat for their responses, a percentage. Now, technically, you could model this as a normal distribution.   And that might be fine, so long as you expect values between, you know, around the 50 percentage point   But no, because we want this to be a very accurate weapons platform, we'd hope to see responses closer to 100%   And so maybe something like a beta distribution response might be more appropriate. We do have one of the wrinkle. We have these three factors of interest, but one of them. The target is nested within fuse type. So the type of target factor will depend on the fuse type   Case will run this real quick. Again, we've created our data.   This case I simulated some random data and I did it so that it matches between zero and one. I did that simply by taking the logistic transformation of a random normal   OK. Caleb King I will copy   Paste.   Make sure I can paste   And again, walk through it.   Pretty simple.   We're going to use the beta response. We have our response. We have our target nested within future type   Click Run.   And again, red triangle. Many say columns save simulation formula. And this is what you can do in the generate for the regular fit model unfortunately cannot do that.   But we have our simulation formula. I'm not going to make any changes.   But you could you could go in. As you can see the structure, double click is already there. Even the logistic transformation. So you just got to put in your model parameters.   Excuse me. Caleb King Quick. Okay.   Bye. Okay. And again, we'll go down.   And that's how you do that. So we go down.   Effect tests, right click Simulate   Make the substitution and go   Alright, so see how easy it is, in general. So even if you have non normal responses.   You're good to go. Thanks to generate   Okay.   Now,   What if you have longitudinal data. This can be tricky, simply because now the responses might be correlated with one another. So how can we incorporate that well is straightforward.   In this case, we have an example of a company that's producing a treatment for reducing cholesterol. Let's say it's treatment, a   We're going to do run a study to compare it to a competitor treatment be in for the sake of completion will have a control and placebo group will have five subjects per group longitudinal aspect is that measurements are taken in the morning and afternoon once a month for three months.   Now I'm not going to spend too much time on this because I just want to show you how you incorporate longitudinal aspect. So this case I've already   Created a model created the simulation formula. So now you can use it as reference for how you might do this. Let's say we have an AR one model.   And on this real quick.   Just to show you. So there's all the fixed effects. Notice here we got a lot of interactions. Keep that in mind as I show you the formula might look a bit messy.   You've   Stated that we have a repeated structure. So I've selected AR one   Period by days within subject. Okay. Under the next model platform.   And so how do I incorporate that era one into my simulation formula I did it like this.   If it's the first row or the new patient. That's what this means the current patient does not equal the previous patient   This is the model that I saved I changed the parameter values to something that might be of interest. It did take a bit of work because there's a lot going on here. There's a lot of interactions happening.   We've got some random noise at the end. But that's all I did. So I changed some values here. I made things a lot of zeros, just to make things easy   If it's not the first row or if it's not a new patient. How do we incorporate correlation. All I do is copy that model up to here, added this term.   Just some value. I believe it has to be less than one equal to one times previous entry.   If it were auto regressive to then you would add something like lag.   Sim formula to   And you'd have to make another adjustment where know if it's the first row, we have our model. It's the second row or were two places into the new patient. It might look like an AR one if it's anything else we go back to   So as you can see, very easy to incorporate auto correlation structures as long as you know what your model looks like it should be easy to implement it as a simulation formula.   Okay. Caleb King I'll let you look at that real quick.   Finally,   Our final scenario is a pass fail response, which is also very common. I'm going to use this to illustrate how you can use the one click Simulate to maybe change people's minds about how they run certain types of designs show you how powerful this can be   Not intended   Let's say we have we have a detection system that we're creating to detect radioactive substances. So we're going to compare it to another system that's maybe already out there in the field.   So we're going to compare these two detection systems we've selected a certain amount of material and some test objects, ranging from very low concentrations at one to a concentration of five very high and we're going to test   Our systems repeatedly on each concentration, a certain number of times and see how many times it successfully alarms.   I'm going to open these both   Let's start with this one. So this represents a typical design, you might see we have a balanced number of samples as each setting. In this case, we have a lot of samples. They're very fortunate that this place so   Let's say we're going to do 32 balance trials at each run and these are, this is a simulated response. Okay, let's say. And then here I've created my simulation formula.   So I'll show you what that looks like. Again, random binomial. They're all the same. So I've kept the number here, but I could have referenced the alarms in trials column stem from an indie consistent, but that's okay.   Here's my model that maybe I'm interested in   Okay.   And here.   I have a scenario where instead of a balanced number and each setting I have put most of my samples here at the middle   My reasoning might be that will if it's a low concentration. I hardly expect it to catch it. I have reasonable expectations.   And if it's a high concentration will it should almost always catch it. So where the difference is most important to me is there in the middle, maybe at three or four concentrations   And so that's where I'm going to load. Most of my samples, and then I'll put a few more here. But then put the fewest at these other settings. Let's see how each of these test plans for forms in terms of power.   So run the binomial model script here which will run the binomial model. There's only one model effect here the system. We don't put concentration because we know there's that there's an effect here. This is what we're interested in.   Generate binomial.   Run it okay again red triangle menu.   I've already got my simulation formula.   So actually I don't need to do that.   So you already built up a pattern.   Right click Do you simulate. Okay, everything looks good there.   My next favorite random seed.   Here we are power analysis. Okay, now let's go over here.   Do the same thing. I'll fit the model and again when you have a binomial. You have to put in not only how many times it alarm, but out of how many trials.   Run scroll down the effect tests, go down.   In primary to get a hint of what's going to happen.   Quick. Okay.   Here's my simulations, get my parallels scooted over here, minimize minimize. So here's what you get under the balance design.   Notice that we have very low power, which seems odd because we had 32 at each run. I mean, that's a lot of samples, I would have killed for that many samples where I previously worked   So you would expect a lot of power, but there doesn't seem to be whereas here. I had the same total number of samples. I just allocated them differently.   And my power level has gone up dramatically. Maybe if I stack even more here. Maybe if it did four and four and then edit for each of these   I could get even more power to detect this difference. So not only does this show that you know it's not always just changing your sample size might not always need more samples in this case you had a lot of samples to begin with. But how you allocate them is also important to   Okay.   So,   I hope you're as excited as I discovered this very awesome tool for calculating power.   I'd like to leave you with some key takeaways.   So again, we use simulation. Now, ideally, you know, we kind of like a formula. So, and in the civil cases we do kind of get the advantage of a nice simple formula.   Even with the regression models, we kind of have formulas to help under, under the hood. But of course, and the real world. Things are a little more complex. And so we typically have to rely on simulation, which can be a very powerful tool as we've seen,   Now, of course, one of the key things we have to do with simulation is balanced accuracy with efficiency. I usually ran 100   Mainly because, you know, to save on time.   But ultimately know maybe you might stick with the default of 2500 knowing that it will take some time to run   So what I might advocate for is, you know, maybe start with 100 200 simulations at the beginning, just to give it give an idea of what's going on. And then if you find a situation   Where it looks like it. No, it's worth more investigation bump up the number of samples, so you can increase your accuracy.   OK, so maybe you start with a couple different situations run a few quick simulations and then narrow down to some key settings key scenarios and then you can increase the number of simulations to get more accuracy.   I always argue power calculations, just like design of experiments is never one and done.   You shouldn't just go to a calculator plug in some numbers and come back with a sample size. There's a lot that can happen in the design.   Or what can that can happen in an experiment. And I think that the best way to plan an experiment is to try and account for different scenarios. So explore different levels of noise.   In your response. So maybe the mixed effects play around different mixed effect sizes.   Of course you can explore different sample sizes, but also explore maybe different types of models. So for example, in the universal center in case we use the wible model would if he had done a lot normal model.   Explore these different scenarios and know presenting them to the test planners gives you a way to play in your study to be robust to a variety of settings.   So never just go calculate and come back, always present tense players with different scenarios. It's the same process. I use when I   Created actual designed actual experiments. So I would present the test players. I worked with different options they could know explore it. It may be they pick an option or it might be combination of options. You should always do that to make your plans more robust   As I say, they're   All right. Well, I hope you learned something new with this. If you have any questions you can reach out to me, they'll probably be providing my email address.   So I hope you enjoyed this talk and I hope you enjoy the rest of the conference. Thank you.
Dave Sartori, Sr. Data Scientist, PPG   A sampling tree is a simple graphical depiction of the data in a prospective sampling plan or one for which data has already been collected. In variation studies such as gage, general measurement system evaluations, or components of variance studies, the sampling tree can be a great tool for facilitating strategic thinking on: What sources of process variance can or should be included? How many levels within each factor or source of variation should be included? How many measurements should be taken for each combination of factors and settings? Strategically considering these questions before collecting any data helps define the limitations of the study, what can be learned from it, and what the overall effort to execute it will be. What’s more, there is an intimate link between the structure of the sampling plan and the associated variance component model. By way of examples, this talk will illustrate how inspection of the sampling tree facilitates selecting the correct variance component model in JMP’s variability chart platform: Crossed, Nested, Nested then Crossed or Crossed then Nested. In addition, the application will be extended to the interpretation variance structures in control charts and split-plot experiments.     Auto-generated transcript...   Speaker Transcript Dave Hi, everybody. Thanks for joining me here today, I'd like to share with you a topic that has been part of our Six Sigma Black Belt program since 1997. 1997. So I think this is one of the tools that people really enjoy and I think you'll enjoy it, too, and find it informative in terms of how it interfaces with some of the tools available in JMP. The first quick slide or two in terms of a message from our sponsor. I'm with PPG Industries outside of Pittsburgh, Pennsylvania, in our Monroeville Business and Technical Center. I've been a data scientist there on and off for over 30 years, moved in and out of technical management. And now, back to what I truly enjoy, which is working with data and JMP in particular. So PPG has been around for a while, was founded in 1883. Last year we ranked 180th on the Fortune 500. And we made mostly paints, although people think that PPG stands for Pittsburgh Plate Glass, that was no longer the case as of about 1968. So it's a...it's PPG now and it's primarily a coatings company. performance coatings and industrial coatings. cars, airplanes, of course, houses. You may have bought PPG paint or a brand of PPG's to to use on your home. But it's also used inside of packaging, so if you don't have a coating inside of a beer can, the beer gets skunky quite quickly. My particular business is the specialty coatings and materials. So my segment we make OLED phosphors for universal display corporation that you find in the Samsung phone and also the photochromic dyes that go into the transition lenses, which turn dark when you head outside. So what I'm going to talk to you today about is this this tool called sampling tree. And what it is, it's really just a simple graphical depiction of the data that you're either planning to collect or maybe that you've already collected. And so in variation studies like a Gage R&R general measurement system evaluations, components and various studies (or CoV, as we sometimes call them), the sampling tree is a great tool for for thinking strategically about a number of things. So, for example, what sources of variance can or should be included in this study? How many levels within each factor or source of variation can you include? And how many measurements to take for each combination factors and settings? So you're kind of getting to a sample size question here. So strategically considering these questions before you collect any data helps you also define the limitation of the study, what you can learn from it, and what the overall effort is going to be to execute. So we put this in a classification tools that we teach in our Six Sigma program, what we call critical thinking tools because it helps you think up front. And it is a nice sort of whiteboard exercise that you can work on paper or the whiteboard to to kind of think prospectively about the the data, you might collect. It's also really useful for understanding the structure of factorial designs, especially when you have restrictions on randomization. So I'll give you one sort of conceptual example, towards the end here, where you can describe on a sampling tree, a line of restricted randomization. And so that tells you where the whole plot factors are and where the split plot of factors are. So it can provide you again upfront with a better understanding of the of the data that you're planning to collect. They're also useful in where, I'll share another conceptual example, where we've combined a factorial design with a component of variations study. So this, this is really cool because it accelerates the learning about the system under study. So we're simultaneously trying to manipulate factors that we think impact the level of the response, and at the same time understanding components of variation which we think contributes a variation of response. So once the data is acquired, the sampling tree can really help you facilitate the analysis of the data. And this is especially true when you're trying to select the variance component model within a variance chart...variability chart that you have available in JMP. And so if you've ever used that tool (and I'll demonstrate it for you here in a couple...with a couple of examples), if you're asking JMP to calculate for you the various components, you have to make a decision as to what kind of model do you want. Is it nested? Is it crossed? Maybe it's crossed then nested. Maybe it's nested then crossed. So helping you figure out what the correct variance component model is, is really well facilitated by by good sampling tree. The other place that we've used them is where we are thinking about control charts. So the the control chart application really helps you see what's changing within subgroups and what's changing between subgroups. So it helps you think critically about what you're actually seeing in the control charts. So as I mentioned, they're they're good for kind of showing the lines of restrictions in split plot but they're kind of less useful for the analysis of designed experiments, so again for for DOE types of applications aremore kind of kind of up front. So let's jump into it here with an example. So here's a what I would call a general components of variance studies. And so in this case, this is actually from the literature. This is from Box Hunter and Hunter, "Statistics for Experimenters," and you'll find it towards the back of the book where they are talking about components of variance study and it happens to be on a paint process. And so what they have in this particular study are 15 batches of pigment paste. They're sampling each batch twice and then they're taking two moisture measurements on each of those samples. So the first sample in the first batch is physically different than the second batch, and the first sample out of the second batch is physically different from any of the other samples. And so one practice that we tried to use and teach is that for nested factors, it's often helpful to list those in numerical order. So that again emphasizes that you have physically different experimental units you're going from sample to sample throughout. And so this is a this is a nested sampling plan. So the sample is nested under the batch. So let's see how that plays out in variability chart within JMP. Okay, so here's the data and we find the variability chart under quality and process variability. And then we're going to list here as the x variables the batch and then the sample. And one thing that's very important in a nested sampling plan is that the factors get loaded in here in the same order that you have them in a sampling tree. So this is hierarchical. So, otherwise the results will be a little bit confusing. So we can decide here in this this launch platform what kind of variance component model we want to specify. So we said this is a nested sampling plan. And so now we're ready to go. We leave the the measurement out of the...out of the list of axes because the measurement really just defines where the, where the sub groups are. So we just we leave that out. And that's going to be what goes into the variant component that JMP refers to as within variation. Okay, so here's the variability chart. One of the nice things too with the variability chart is there's an option to add some some graphical information. So here I've connected the cell mean. And so this is really indicating the kind of visually what kind of variation you have between the samples within the batch. And then we have two measurements per batch, as indicated on our sampling tree. And so the the distance between the two points within the batch and the sample indicates the within subgroup variation. So you can see it looks like just right off the bat it there's a good bit of of sample to sample variation. And the other thing we might want to show here are the group means. And so that shows us the batch to batch variations. So the purple line here is the, the average on a batch to batch basis. Okay. Now, what about the actual breakdown of the variation here. Well that's nicely done in JMP here under variance components. And Get that up there, we can see it then I'll collapse this. As we saw graphically, it looked like the sample to sample variation within a batch was a major contributor to the overall variation in the data. And in fact, the calculations confirm that. So we have about 78% of the total variation coming from the sample; about 20% of variations coming batch to batch and only about 2.5% of the variation is coming from the measurement to measurement variation within the batch and sample. I noticed here to in the variance components table, the the notation that's that used here. So this is indicated that the sample is within the batch. So this is an nested study. And again, it's important that we load the factors into the into the variability chart in the order indicated here in the in the plot. So wouldn't make any sense to say that within sample one we have batch one and two. That just doesn't make any physical sense. And so it kind of reflects that in the in the tree. And just Now let's compare that with something a little bit different. I call this a traditional Gage R&R study. And so what you have in a traditional Gage R&R study is you have a number of parts sample batches that are being tested. And then you have a number of operators who are testing each one of those. And then each one test the same sample or batch multiple times. So in this particular example we're showing five parts or samples or batches, three operators measuring each one twice. Now in this case, operator one for the for batch number one is the same as operator number one for batch or sample report number five. So you can think of this as saying, well, the operator kind of crosses over between the part, sample, batch whatever the whatever the thing is that's getting getting measured. So this is referred to as a as a crossed study. And it's important that they measure the same article because one of the things that comes into play in a crossed study is that you don't have in a nested study is a potential interaction between the operators and what they're measuring. So that's going to be reflected in the in the variance component analysis that we see from JMP. Now let's have a look here. at this particular set of data. So again, we go to the handy variability chart, which again is found under the quality and process. And in this case, I'll start by using the same order for the variables for the Xs as shown on the sampling tree. But, as I'll show you one of the features of a of a crossed study is that we're no longer stuck with the hierarchical structure of the tree. We can we can flip these around. And so this is crossed. I'm going to be careful to change that here. Remember that we had a nested study from before. And I'm going to go ahead and click okay. And I'm going to put our cell means and group means on there. So the group means in this case are the samples (three) and we've got three operators. And now if we asked for the variance components. Notice that we don't have that sample within operator notation like we had in the in the nested study. What we have in this case is a sample by operator interaction. And it makes sense that that's a possibility in this case, because again, they're measuring the same sample. So Matt is measuring the same sample a as the QC lab is, as is is as Tim. So an interaction in this case really reflects the how different this pattern is as you go from one sample to the other. So you can see that it's generally the same It looks like Matt and QC tend to measure things perhaps a little bit lower overall than Tim. This part C is a little bit the exception. So the, the interaction variation contribution here is is relatively small. There is some operator to operator variation, and the within variation really is the largest contributor. And that's easy to see here because we've got some pretty pretty wide bars here. But again, this is a is a crossed study so we should be able to change the order in which we load these factors and and get the same results. So that's my proposition here; let's test it. So I'm just going to relaunch this analysis and I'm going to switch these up. I'm going to put the operator first and the sample second. Leave everything else the same. And let's go ahead and put our cell means and group means on there. And now let's ask for the variance components. So how do they compare? I'm going to collapse that part of the report. So in the graphical part and this is a cool thing to recognize with a crossed study is because again, we're not stuck with the hierarchy that we have in a nested study, we can kind of change the perspective on how we look at the data. So that perspective with loading in the operator first gives us sort of a direct operator to operator comparison here in terms of the group means. And again that interaction is reflected of how this pattern changes between the operators here as we go from Part A, B, or C, A, B, or C. What about the numbers in terms of the variance components? Well, we see that the variance components table here reflects the order in which we loaded these factors into the into the dialog box and... But the numbers come out very much the same. So the sample on the lefthand side here, the standard deviation is 1.7. Standard deviation due to the operator is about 2.3 and it's the same value over here. The sample by operator or operator by sample interaction, if you like, is exactly the same. And the within is exactly the same. So, with a crossed study, we have some flexibility in how we load those factors in and then the interpretation is a little bit different. If these were different samples, we might expect this pattern from going from operator to operator, to be somewhat random because they're they're measuring different things. So there's no reason to expect that the pattern would repeat. If you do see a significant interaction term in a typical kind of a traditional Gage R&R study, like we have here, well, then you've got a real issue to deal with because that's telling you that the nature of the sample is is causing the operators to measure differently. So that's a bit harder of a problem to solve than if you just have a no interaction situation there. OK. Dave So again, this, for your reference, I have this listed out here. Um, so now let's get to something a little bit more juicy. So here we have sort of a blended study where we've got both crossed and nested factors. So this was the business that I work in. The purity of the materials that we make is really important and a workhorse methodology for measuring purity is a high performance liquid chromatography or HPLC for short. So this was a...this was a product and it was getting used in an FDA approved application so the purity getting that right was was really important. So this is a slice from a larger study. But what I'm showing is the case where we had three samples; I'm labeling them here S1, S2, S3. We have two analysts in the study. And so each analyst is going to measure the same sample in each case. So you can see that similar to what we had in the previous example there that what I call traditional Gage R&R, where each operator or analyst in this case is measuring exactly the same part or sample. So that part is crossed. When you get down under the analyst, each analyst then takes the material and preps it two different times. And then they measure each prep twice. They do two injections into the HPLC with with each preparation. So preparation one is different than preparation two and that's physically different than the first preparation for the next analyst over here. And so again, we try to remember to label these nested factors sequentially to indicate that they're they're physically different units here. It doesn't really make any difference from JMP's point of view, it'll handle it fine, if you were to go 1-2, 1-2, 1-2, and so on down the line, that's fine, as long as you tell it the proper variance component model to start with. So this would be crossed and then nested. So let's see how that works out in JMP. So here's our data sample, analyst prep number, and then just an injection number which is really kind of within subgroup. So once again we go to analyze, quality and process. We go to the variability chart. And here we're going to put in the factors in the same order as they were showing on the sampling tree. And then we're going to put the area in there as the percent area is the response. And we said this was crossed and then nested, so we have some couple of other things to choose from here. And in this case, again, the sampling tree is really, really helpful for for helping us be convinced that this is the case, and selecting the right model. This is crossed, and then nested. Let's click OK. I'm going to put the cell means and group means on there. Again, we have a second factor involved above the within. So let's pick both of them. And let's again ask for the variance components. And I'm going to just collapse this part, hopefully, and maybe I'm going to collapse the standard deviation chart, just bringing a little bit further up onto the screen. So what we can see in the in the graph as we go, we see a good bit of sample to sample variation. The within variation doesn't look too bad. But we do maybe see a little bit of a variation of within the preparation. So, um, the sample in this case is by far the biggest component of variation, which is really what we were hoping for. The analyst is is really below that, within subgroup variation. And so this this lays it out for us very nicely. So in terms of what it's showing in the variance components here table in terms of the components, is it's sample analyst and then because these two are crossed, we've got a potential interactions to consider in this case. Doesn't seem to be contributing a whole lot to the to the overall variation. And again, that's the how the pattern changes as we go from analyst to analyst and sample to sample. Now, the claim I made before with the fully crossed study was that we could swap out the the crossed factors in terms of their in terms of their order and and it would be okay. So let's let's try that in this case. So I'm just going to redo this, relaunch it and I can I think I can swap out the crossed factors here but again I have to be careful to leave the nested factor where it is in the tree. So I notice over here in the variance components table, the way we would read this as we have the prep nested within the sample and the analyst. So that means it has to go below those on the tree. So let's go ahead and connect some things up here. I'm going to take the standard deviation chart off and asked for the variance components. Okay, so just like we saw in the traditional Gage R&R example we've got the analyst and the sample switching. But their values for the, if we look at the standard deviation over here in the last column, they're identical. We have again the identical value for the interaction term and interact on the identical value for the prep term, which again is nested within the, within the sample and the analyst. So again, here's where the, where the sampling tree helps us really fully understand the structure of the data and complements nicely what with what we see in the variance components chart of JMP. So, those, those are a couple of examples where these are geared towards components of variation study. One thing you might notice too, I forgot to point this out earlier, is look at the sampling tree here. And if I bring this back and I'm just trying to reproduce this. That backup. Dave It's interesting if you look at the horizontal axis in the variability chart, it's actually the sampling tree upside down. So that's another way to kind of confirm that you're you're looking at the right structure here when you are trying to decide what variance component component model to to apply. So again, here are the screenshots for that. Here's an example where the sampling tree can help you in terms of understanding sources of variation in a in a control chart of all things. So in this particular case, over a number of hours, a sample is being pulled off the line. These are actually lens samples. I mentioned that we we make photochromatic dyes to go into the transitions lenses and they will periodically check the film thickness on the lenses and that's a destructive test. And so when they take that lens and measure the film thickness, well, they're they're done with that with that sample. And so what we would see if we were to construct an x bar and R chart for this is you're going to see on the x bar chart as an average, the hour to hour average. And then within subgroup variation is going to be made up of what's going on here sample to sample and the thickness, the thickness measurement. Now in this case, notice that there's vertical lines in the sampling tree, so that the tree doesn't branch in this case. So when you see vertical lines when you're drawing a vertical lines on to the sampling tree, that's an indication that the variability between those two levels of the tree are confounded. So, I can't really separate the inherent measurement variation in the film thickness from the inherent variation of the sample to sample variation. So I'm kind of stuck with those in terms of how this measurement system works. So let's let's whip up a control chart with this data. And for that are, again, we're going to go to quality and process. And I'm going to jump into the control chart builder. So again, our measurement variable here is the film thickness. And we're doing that on an hour to hour basis. So when we get it set up by by doing that, we see that JMP smartly sees that the subgroup size is 3, just as indicated on our, on our sampling tree. But what's interesting in this example is that you might at first glance, be tempted to be concerned because we have so many points out of control on the x bar chart. But let's think about that for a minute in terms of what the sampling tree is telling us. So the sampling tree again is telling us that's what's changing within the subgroup, what's contributing to the average range, is the film thickness to film thickness measurement, along with the sample to sample variation. And remember how the control limits are constructed on an x bar chart. They are constructed from the average range. So we take the overall average. And then we add that range plus or minus a factor related to the subgroup sample size so that the width of these control limits is driven by the magnitude of the average range. And so really what this chart is comparing is, let's consider this measurement variation down here at the bottom of the tree. So it's comparing measurement variation to the hour to hour variation that we're getting from the, from the line. So that's actually a good thing because it's telling us that we can see variation that rises above the noise that that we see in the in the subgroup. So in this case, that's, that's actually desirable. And so, that's again, a sampling tree is really helpful for reminding us what's what's going on in the Xbar chart in terms of the within subgroup and between subgroup variation. Now, just a couple of conceptual examples in the world of designed experiments. So split plot experiment is an experiment in which you have a restriction on the run order of the of the experiment. And what that does is it ends it ends up giving a couple of different error structures, and JMP does a great job now of designing experiments for for that situation where we have restrictions on randomization and also analyzing those. So, nevertheless, though it's sometimes helpful to understand where those error structures might be splitting, and in a split plot design, you get into what are called whole plot factors and subplot factors. And the reason you have a restriction on randomization is typically because one or more of the factors is hard to vary. So in this particular scenario, we have a controlled environmental room where we can spray paints at different temperatures and humidities. But the issue there is you just can't randomly change the humidity in the room because it just takes too long to stabilize and it makes the experiment rather impractical. So what's shown in this sampling tree is you really have three factors here humidity, resin and solvant. These are shown in blue. And so we only change humidity once because it's a difficult to change variable. So that's how you set up a split plot experiment in JMP is you can specify how hard the factors are to change. So in this case, humidity is a hard, very hard to change factor. And so, JMP will take that into account when it designs the experiment and when you go to analyze it. But what this shows us is that the the humidity would be considered a whole plot factor because it's above the line restriction and then the resin and the solvent are subplot factors; they're below the line of restriction. So there's a there's a different error structure above the line of restriction for whole plot factors than there is for subplot factors. In this case we have a whole bunch of other factors that are shown here, which really affect how a formulation which is made up of a resin and a solvent gets put into a coating. So this, this is actually a 2 to the 3 experiment with a restriction randomization. It's got eight different formulations in there. Each one is applied to a panel and then that panel is measured once so that what we see in terms of the measurement to measurement variation is confounded with the coating in the in the panel variation. As, as I said before, when we have vertical lines on the on the sampling tree, then we have then we have some confounding at those levels. So that's, that's an example where we're using it to show us where the, where the splitting is in the split plot design. This particular example again it's conceptual, but it actually comes from the days when PPG was making fiberglass; we're no longer in the fiberglass business. But in this case, what what was being sought was a an optimization, or at least understanding the impact of four controllable variables on what was called loss ???, so they basically took coat fiber mats and then measure their the amount of coating that's lost when they basically burn up the sample. So what we have here is at the top of the tree is actually a 2 to the 4 design. So there's 16 combinations of the four factors in this case and for each run in the design, the mat was split into 12 different lanes as they're referred to as here. So you're going to cross the mat from 1 to 12 lanes and then we're taking out three sections which within each one of those lanes and then we're doing a destructive measurement on each one of those. So this actually combines a factorial design experiment. with a components of variations study. And so again, we've got vertical lines here at the bottom of the tree indicating that the measurement to measurement variation is confounded with the section to section variation. And so what we ended up doing here in terms of the analysis was, we treated the data from each DOE run as sort of the sample to sample variation like we had in the moisture example from Box Hunter and Hunter, to have instead of batches, here you have DOE run 1, 2, 3 and so on through 16 and then we're sub sampling below that. And so we treat this part as a components of variation study and then we basically averaged up all the data to look and see what would be the best settings for the four controllable factors involved here. So this is really a good study because it got to a lot of questions that we had about this process in a very efficient manner. So again, combining a COV with a DOE, design of experiments with components of variations study. So in summary, I hope you've got an appreciation for sampling trees that are, they're pretty simple. They're easy to understand. They're easy to construct, but yet they're great for helping us talk through maybe what we're thinking about in terms of sampling of process or understanding a measurement system. And they also help us decide what's the best variance components model when we we look to get the various components from JMP's variability chart platform, which we get a lot of use out of that particular tool, which I like to say that it's worth the price of admission that JMP for that for that tool in and of itself. So I've shown you some examples here where it's nested, where it's crossed, crossed then nested, and then also where we've applied this kind of thinking to control charts to help us understand what's varying within subgroups versus was varying between subgroups. And then also, perhaps less useful...less we can use those with designed experiments as well. So thanks for sharing a few minutes with me here and my email's on the cover slide so if you have any questions, I'd be happy to converse with you on that. So thank you.  
Peter Hersh, JMP Global Technical Enablement Engineer, SAS Phil Kay Ph.D, Learning Manager Global Enablement, SAS institute   In the process of designing experiments often many potential critical factors are identified. Investigating as many of these critical factors as possible is ideal. There are many different types of screening designs that can be used to minimize the number of runs required to investigate the large number of factors. The primary goal of screening designs is to find the active factors that should be investigated further. Picking a method to analyze these designs is critical, as it can be challenging to separate the signal from the noise. This talk will explore using the auto-validation technique developed by Chris Gotwalt and Phil Ramsey to analyze different screening designs. The focus will be on group orthogonal supersaturated designs (GO-SSDs) and definitive screening designs (DSDs).  The presentation will show the results of auto-validation techniques compared to other techniques to analyze these screening designs.       Supplementary Materials   A link to Mike Anderson's Add-in To Support Auto-Validation Workflow. JMP data tables for examples 1 and 2 from the presentation.  Journal attached for an interactive demonstration of how auto-validation works for screening experiments. Other Discovery Summit papers about auto-validation from: Europe 2018 , US 2018, Europe 2019, US 2019 and US 2020.  Recorded demonstration of how auto-validation works for screening experiments: (view in My Videos) Auto-generated transcript...   Speaker Transcript Peter Hersh All right, well thanks for tuning into watch Phil and I's Discovery presentation. We are going to be talking today about a new ??? technique in JMP that we are both really excited about and that's using auto validation to analyze screening   All right, well thanks for tuning into watch Phil and I's Discovery presentation. We are going to be talking today about a new ??? technique in JMP that we are both really excited about and that's using auto validation to analyze screening   designs for for DOEs. So my name is Peter Hersch. I'm a senior systems engineer and part of the global technical enablement team, and my co-author who's joining me is Phil. Do you want to introduce yourself?   All right, well thanks for tuning into watch Phil and I's Discovery presentation. We are going to be talking today about a new ??? technique in JMP that we are both really excited about and that's using auto validation to analyze screening   designs for for DOEs. So my name is Peter Hersch. I'm a senior systems engineer and part of the global technical enablement team, and my co-author who's joining me is Phil. Do you want to introduce yourself?   designs for for DOEs. So my name is Peter Hersch. I'm a senior systems engineer and part of the global technical enablement team, and my co-author who's joining me is Phil. Do you want to introduce yourself? phil Yes. So I'm in the global technical enablement team as well. I'm learning manager. Peter Hersh Perfect.   So we're going to start out at the end, show some of the results that we got while we were working through this so we   did some experiments with using auto validation with some different screening DOEs and we found it a very promising technique. We were able to find more active factors than some analysis techniques.   And really when we're looking at screening DOEs, what we're trying to find as many active factors as we can. And Phil, maybe you can talk a little bit about why that is. phil Yeah. So the objective of a screening experiment is to find out which of all of your factors are actually important. So if we miss any factors   from our analysis of the experiment that turned out to be important, then that's a big problem. You know, we're not going to fully understand our process or our system   because we're we're neglecting some important factor. So it's really critical. The most important thing is to identify which factors are important.   And if we occasionally add in a factor that turns out not to be important. That's, that's less less of a problem but we really need to make sure we're capturing all of the active factors. Peter Hersh Yeah, great, great explanation there, Phil, and I think   if we look at this over here on the right-hand side, our table, we we've looked at 100 different simulations of these different techniques where we looked at different signal-to-noise ratios in screening design and we found that out of those seven different techniques,   we did a fairly good job when we had a higher signal-to-noise ratio, but as that dropped a little we   struggled to find those less   large effects. So this top one was the auto validation technique, and and we only ran that once, and we'll go into why that is, and what that running that auto validation technique did for us. But I think this was a very promising result.   And that now typically, when we do a designed experiment, we don't hold out any of the data. We want to keep it intact. Phil, can you talk a little to why we wouldn't do that? phil Yeah.   When we design an experiment, we are looking to find the smallest possible number of runs that give us the information that we need. So we deliberately keep the number of rows of data really as small as possible.   Ideally, you know, in machine learning what you can do is you hold back some of the data to...as a way of checking how good your models are and whether you need to   improve that model, whether you need to simplify that model.   With design of experiments, you don't really have the luxury of just holding back a whole chunk of your data because all of it's critically important.   You've designed it so that you've got the minimum number of rows of data, so there isn't really that luxury.   But it would be really cool if we could find some way that we could use this this validation in some in some way. And I guess that's that's really the trick of what we're talking about today. Peter Hersh Yeah, excellent lead in, Phil. So   here, here, this auto validation technique has been introduced by Chris Gotwalt, who is our head developer for pretty much everything under the analyze menu in JMP,   and a university professor, Phil Ramsey. And I have two QR codes here for two different Discovery talks that they gave and   if you're interested in those Discovery talks, I highly recommend them. They go into much more technical details than Phil and I are planning   to go into to day about why the technique works and what they came up with. We're just trying to advertise that it's it's a pretty interesting technique and it's something that might be worth checking out for for you and show some of our results.   The basic idea is we start with our original data,   And then we resample that data so the data down here in gray is a repeat of this data up here in white and that is used as validation data. And how we get away with this is we used a fractional weighting system.   And this has really been...it's really easy to set up with an add in that that Mike Anderson developed and   there's the QR code for finding that but you can find that on the JMP user community. And it just makes this setup a lot more simple and we'll go through the setup and the use of the add in when we walk through the example in JMP. But   the basic idea is it creates this validation column, this fractional waiting column, and a null factor, and we'll talk about those in a little bit.   Alright, so we have a case study that both Phil and I used here and we're trying to maximize a growth rate for a certain microbe. And we're adjusting the nutrient combination.   And for my example I'm looking at 10 different nutrient factors. And this nutrient factors, we went in everywhere from not having that nutrient up to some high level of having that. And   this is based on a case study that you can read about here, but we just simulated the results. So we didn't have the real data.   And the case study I'm going to talk to is a DSD where we have five active effects. Actually it's four and the intercept that are active.   And we did a 25-run DSD and I am...I'm just looking at these 10 nutrient factors and I'm adjusting the signal-to-noise ratio for the simulated results. So that's, that's my case study and, Phil, do you want to talk to yours, a little bit? phil Yeah, so in mine, I look to a smaller number of the factors, just six factors, in a smaller experiment, so a 13-run definitive screening design. And what I was really interested in looking at was how well this method could identify active effects when we've got   as many active effects as we have runs in the design of experiments. So we've got 13 rows of data and we've got 13 active parameters when we include the intercept as well.   That's a really big challenge. Most the time we're not going to be able to identify the active effects using standard methods. So I was really, really interested in how this auto validation method might do in that situation. Peter Hersh Yeah, great. So we're gonna   duck out of our PowerPoint. I'm going to bring in my case study here, and we'll, we'll talk about this. So here is my 25-run DSD. I have   I have my results over here that are   simulated. And so this is my growth rate which is impacted by some of these factors and we're in a typical screening design. We're just trying to figure out which of these factors are active, which are not active.   And we might want to   follow up with an augmented design or at least some confirmation runs to to make sure that our, our initial results are confirmed. So   how we would set up this auto validation? So for now, in JMP 15, this is that add in that I mentioned that that Mike Anderson   created and it's just called auto validation setup. And in JMP 16 this is going to be part of the product, but in JMP 15, it's an add in.   And so when that I run that add in, what happens is it creates...   resamples that data. So it created 25 runs that are identical to   those top 25 and they're in gray. And then it added this partially...   this fractional weighting here and then it added the validation and the null factor. So basically, what we're going to do is we're going to run a model on this using   validation and you can use any modeling technique; generalized regression   is a good one. You can use any of the variable selection techniques. You want to make sure that it can do some variable selection for you. So just to give you an idea, I'm going to go under analyze, fit model.   We'll take our growth rate which is our response. We're going to take that weighting.   Actually I'll   change this to generalize regression. I'm going to put that weighting in as our frequency. I'm going to add that validation column that was created.   This null factor that's created,and we'll talk a little bit more about that null factor. And then I'm going to just add all those 10 factors. Now in Phil's example, he's going to look at interactions and quadratic effects. I could do that here as well, but this is just to show   the capability.   And I'll hit Run. We'll go ahead and launch this again. I'll use lasso, you could use forward selection or anything like that. But I'll just use a lasso fit. Hit go. And then I'm going to come down here and I'm going to look at   these estimates. So I what I want to do is simulate these estimates and I want to figure out which of these estimates get zeroed out   most often and least   often. So   I would go in here and I'd hit simulate, and I could choose my number of simulations.   In this case I had, I have done 100 and I won't make you sit here and watch it simulate.   I can go right over here to my simulated results. So I've done 100 simulations here and I'm looking at   the results from those hundred simulations and when I run the distribution, which automatically gets generated there, we can see   some information about this.   Now the next thing that I'm going to do is hold down control, and I'm going to customize my summary statistics.   And all I want to do is remove everything except for the proportion non zero. So what that's going to do is it's going to allow me to just see the factors that were   that were zeroed out or how often a factor...a certain factor was zeroed out and how often it was kept in there. So when I hit okay, all of these are changed to proportion non zero.   And when I and then when I   right click on here, I can make this combine data table, which I've, I've already done.   And the combined data table is here.   And the reason I I'm kind of going quick on this is because we can...   I've added a a factor type row and just just showing, have a saved script in here, but this is...you would get these three columns out of that.   Make a combined data table, so it would have your all of your factors and then how often that factor was non zero. So the higher the number,   the more indicative that it's an active factor. So the last thing I'm going to do is run this Graph Builder and this shows how often   a factor is added to the model. That null factor is kind of our, our line...reference line, so it has no correlation to   the response. And so anything that is lower than that. we probably don't need to include in the model and then things that are higher than that, we might want to look at. And so these green ones   from the simulation were the act of factors, along with the intercept and then the red ones were not active. So it did a pretty good job here.   It flagged all of the good ones. phil Yeah. Peter Hersh And we got one extra one but like Phil, you mentioned, that's not as big of a problem, right? phil Yeah, I mean, that's not the end of the world. I mean,   it's more, it's much more of a problem if you miss things that are active and your method tells you that they're not. And it's really impressive how it's picked out some factors here which had really low signal-to-noise ratios as well. Peter Hersh Yes, yeah. So just to give you an idea, this was citric acid was two, EDTA was one, that was half...so half the signal to noise, and potassium nitrate was a quarter, so very low signal and it was still able to catch that.   Yeah, so I'm gonna pass the ball over   to Phil and have him   present his case study. Yeah. phil Thanks, Pete.   Yeah.   Well, in my case study, as I said, it's a six factor experiment and we only have 13 runs here. And I've simulated the response here so that, such that   every one of my factors here is active, and the main effects are active, and the quadratic effects of each of those are active. So we've got 12 active effects, plus an intercept, to estimate or to find.   And I've made them, you know, just for simplicity, I made them really high signal-to-noise ratio.   So there's a signal-to-noise ratio of 3 to 1 for every one of those effects. So these are big, important effects basically as the...   is what I've simulated. So what we want to find out is that all of these effects, all of these factors are really important. Now if you look to analyze this using fit DSD, which would be a standard   method for this, it doesn't find the   active factors. It only finds ammonium nitrate as a as an active factor. I think fit DSD struggles when there are a lot of active effects. It's very good when there are only a small number.   And actually, you know, we probably wouldn't want to run a 13-run DSD and expect to get a definitive analysis. We would recommend adding some additional runs in this kind of situation.   Even if we knew what the model was, so if we somehow we knew that we had six active main effects and six active quadratic effects   plus the intercept, we really can't fit that model. So this is just that model fit using JMP's fit model platform, the least squares regression. And you know there's...we've got as many parameters to estimate as we have rows of data, so we've got nothing left to estimate error.   So this is really all just to illustrate that this is a really big challenge, analyzing this experiment and getting out from it what we need to get out from it   is a real problem.   So I followed the same method as Pete. I generated this auto validation data set where we've got the repeated runs here with the fractionally weighted...   fractional weightings   Ran that through gen reg, so using the lasso as a model selection and then again resimulating. So simulating and each time changing out the   fractional weighting and around about 250 simulations, which again I won't show you actually doing that. These are the simulation results that we got, the distributions here, and you can see that it's   picking out citric acid. So the some of the times the models had a zero for the parameter estimate for citric acid, but a lot of the time   it was getting the parameter estimate to be about three, which is what it was simulated as originally, and what it should be getting. And you can see that for then, some of these other effects, which was simulated to not be active then, by and large, they are   estimated to have a parameter size of zero, which is what we want to see. And just looking at the proportion, nonzero as Pete did there.   And I've added in all the, the effect types here because here I was looking at the main effects, the quadratic and the interactions. And what the method should find is that the main effects and the quadratics are all active, but the two factor interactions were not.   And when we look at that,   just plotting that proportion non zero for each of these effects, you can see, first of all, the, the null factor that we've added in there.   And anything above that, that's with a higher proportion non zero was suggesting, that's an active effect. And you can see, well, first of all, the intercept, which is always there.   We've got the main effects, which   we're declaring as active using this method. They've got a higher proportion nonzero than the null factor and the quadratics.   And we can see the all of the two factor interactions, the proportion nonzero was was much, much lower. So it's done an amazingly good job of finding the effects that I'd simulated to be active   in this very challenging situation, which I think is, is really very exciting.   That's just one little exploration of this method. To me that that's a very exciting result and it makes me very excited about looking at this more. So I just wanted to finish with   some of the concluding remarks. And I think, Pete, it's fair to say we're not saying that everybody should go and throw away everything else that they've done in the past and only use this method now. Peter Hersh Yeah, absolutely. We've seen some exciting results. I think, Chris, Chris is seeing exciting results, but this is not the end all, be all to always use auto validation, but it's a new tool in your tool belt. phil Yeah, I mean I think I'll certainly use it every time, but I'm not saying only use that. I think there's always...you always want to look at things from different angles using all the available tools that you've got.   And so it clearly shows a lot of promise, and we focused on the screening situation where we're trying to identify the active effects from screening experiments and we've looked at DSDs. I've looked be briefly other screen designs like group orthogonal super saturated designs,   and it does a good job there from my quick explorations. I'd see no reason why it won't do well in fractional factorial, custom screening designs.   And it seems to be working in situations where the standard methods just fall down. The situation that I showed was a very extreme example, it's probably not a very realistic example.   But it really pushes the methods and the standard methods are going to fall down in that kind of situation. Whereas this auto validation method,   it seems to do what it should do. It gives you the results that you you need to get from that kind of situation. And so it's very exciting. I think we're waiting for some the results of more rigorous simulation studies that are being done by   Chris Gotwalt and Phil Ramsay and a PhD student that are supervising.   But it really does open up a whole   a whole load of new opportunities. I think, Pete, it's just very exciting, isn't it? Peter Hersh Absolutely. Really exciting technique and thank you everyone for coming to the talk. phil Yeah, thank you.
Monday, October 12, 2020
Russ Wolfinger, Director of Scientific Discovery and Genomics, Distinguished Research Fellow, JMP Mia Stephens, Principal Product Manager, JMP   The XGBoost add-in for JMP Pro provides a point-and-click interface to the popular XGBoost open-source library for predictive modeling with extreme gradient boosted trees. Value-added functionality includes: •    Repeated k-fold cross validation with out-of-fold predictions, plus a separate routine to create optimized k-fold validation columns •    Ability to fit multiple Y responses in one run •    Automated parameter search via JMP Design of Experiments (DOE) Fast Flexible Filling Design •    Interactive graphical and statistical outputs •    Model comparison interface •    Profiling Export of JMP Scripting Language (JSL) and Python code for reproducibility   Click the link above to download a zip file containing the journal and supplementary material shown in the tutorial.  Note that the video shows XGBoost in the Predictive Modeling menu but when you install the add-in it will be under the Add-Ins menu.   You may customize your menu however you wish using View > Customize > Menus and Toolbars.   The add-in is available here: XGBoost Add-In for JMP Pro      Auto-generated transcript...   Speaker Transcript Russ Wolfinger Okay. Well, hello everyone.   Welcome to my home here in Holly Springs, North Carolina.   With the Covid craziness going on, this is kind of a new experience to do a virtual conference online, but I'm really excited to talk with you today and offer a tutorial on a brand new add in that we have for JMP Pro   that implements the popular XGBoost functionality.   So today for this tutorial, I'm going to walk you through kind of what we've got we've got. What I've got here is a JMP journal   that will be available in the conference materials. And what I would encourage you to do, if you'd like to follow along yourself,   you could pause the video right now and go to the conference materials, grab this journal. You can open it in your own version of JMP Pro,   and as well as there's a link to install. You have to install an add in, if you go ahead and install that that you'll be able to reproduce everything I do here exactly at home, and even do some of your own playing around. So I'd encourage you to do that if you can.   I do have my dog Charlie, he's in the background there. I hope he doesn't do anything embarrassing. He doesn't seem too excited right now, but he loves XGBoost as much as I do so, so let's get into it.   XGBoost is a it's it's pretty incredible open source C++ library that's been around now for quite a few years.   And the original theory was actually done by a couple of famous statisticians in the '90s, but then the University of Washington team picked up the ideas and implemented it.   And it...I think where it really kind of came into its own was in the context of some Kaggle competitions.   Where it started...once folks started using it and it was available, it literally started winning just about every tabular competition that Kaggle has been running over the last several years.   And there's actually now several hundred examples online if you want to do some searching around, you'll find them.   So I would view this as arguably the most popular and perhaps the most powerful tabular data predictive modeling methodology in the world right now.   Of course there's competitors and for any one particular data set, you may see some differences, but kind of overall, it's very impressive.   In fact, there are competitive packages out there now that do very similar kinds of things LightGBM from Microsoft and Catboost from Yandex. We won't go into them today, but pretty similar.   Let's, uh, since we don't have a lot of time today, I don't want to belabor the motivations. But again, you've got this journal if you want to look into them more carefully.   What I want to do is kind of give you the highlights of this journal and particularly give you some live demonstrations so you've got an idea of what's here.   And then you'll be free to explore and try these things on your own, as time goes along. You will need...you need a functioning copy of JMP Pro 15   at the earliest, but if you can get your hands on JMP 16 early adopter, the official JMP 16 won't ship until next year, 2021,   but you can obtain your early adopter version now and we are making enhancements there. So I would encourage you to get the latest JMP 16 Pro early adopter   in order to obtain the most recent functionality of this add in...of this functionality. Now it's, it's, this is kind of an unusual new frame setup for JMP Pro.   We have written a lot of C++ code right within Pro in order to integrate to get the XGBoost C++ API. And that's why...we do most of our work in C++   but there is an add in that accompanies this that installs the dynamic library and does does a little menu update for you, so you need...you need both Pro and you need to install the add in, in order to run it.   So do grab JMP 16 pro early adopter if you can, in order to get the latest things and that's what I'll be showing today.   Let's, let's dive right to do an example. And this is a brand new one that just came to my attention. It's got an a very interesting story behind it.   The researcher behind these data is a professor named Jaivime Evarito. He is a an environmental scientist expert, a PhD, professor, assistant professor   now at Utrecht University in the Netherlands and he's kindly provided this data with his permission, as well as the story behind it, in order to help others that were so...a bit of a drama around these data. I've made his his his colleague Jeffrey McDonnell   collected all this data. These are data. The purpose is to study water streamflow run off in deforested areas around the world.   And you can see, we've got 163 places here, most of the least half in the US and then around the world, different places when they were able to collect   regions that have been cleared of trees and then they took some critical measurements   in terms of what happened with the water runoff. And this kind of study is quite important for studying the environmental impacts of tree clearing and deforestation, as well as climate change, so it's quite timely.   And happily for Jaivime at the time, they were able to publish a paper in the journal Nature, one of the top science journals in the world, a really nice   experience for him to get it in there. Unfortunately, what happened next, though there was a competing lab that really became very critical of what they had done in this paper.   And it turned out after a lot of back and forth and debate, the paper ended up being retracted,   which was obviously a really difficult experience for Jaivime. I mean, he's been very gracious and let and sharing a story and hopes.   to avoid this. And it turns out that what's at the root of the controversy, there were several, several other things, but what what the main beef was from the critics is...   I may have done a boosted tree analysis. And it's a pretty straightforward model. There's only...we've only got maybe a handful of predictors, each of which are important but, and one of their main objectives was to determine which ones were the most important.   He ran a boosted tree model with a single holdout validation set and published a validation hold out of around .7. Everything looked okay,   but then the critics came along, they reanalyzed the data with a different hold out set and they get a validation hold out R square   of less than .1. So quite a huge change. They're going from .7 to less than .1 and and this, this was used the critics really jumped all over this and tried to really discredit what was going on.   Now, Jaivime, at this point, Jaivime, this happened last year and the paper was retracted earlier here in 2020...   Jaivime shared the data with me this summer and my thinking was to do a little more rigorous cross validation analysis and actually do repeated K fold,   instead of just a single hold out, in order to try to get to the bottom of this this discrepancy between two different holdouts. And what I did, we've got a new routine   that comes with the XGBoost add in that creates K fold columns. And if you'll see the data set here, I've created these. For sake of time, we won't go into how to do that. But there is   there is a new module now that comes with the heading called make K fold columns that will let you do it. And I did it in a stratified way. And interestingly, it calls JMP DOE under the hood.   And the benefit of doing it that way is you can actually create orthogonal folds, which is not not very common. Here, let me do a quick distribution.   That this was the, the holdout set that Jaivime did originally and he did stratify, which is a good idea, I think, as the response is a bit skewed. And then this was the holdout set that the critic used,   and then here are the folds that I ended up using. I did three different schemes. And then the point I wanted to make here is that these folds are nicely kind of orthogonal,   where we're getting kind of maximal information gain by doing K fold three separate times with kind of with three orthogonal sets.   So, and then it turns out, because he was already using boosted trees, so the next thing to try is the XGBoost add in. And so I was really happy to find out about this data set and talk about it here.   Now what happened...let me do another analysis here where I'm doing a one way on the on the validation sets. It turns out that I missed what I'm doing here is the responses, this water yield corrected.   And I'm plotting that versus the the validation sets. it turned out that Jaivime in his training set,   the top of the top four or five measurements all ended up in his training set, which I think this is kind of at the root of the problem.   Whereas in the critics' set, they did...it was balanced, a little bit more, and in particular the worst...that the highest scoring location was in the validation set. And so this is a natural source for error because it's going beyond anything that was doing the training.   And I think this is really a case where the K fold, a K fold type analysis is more compelling than just doing a single holdout set.   I would argue that both of these single holdout sets have some bias to them and it's better to do more folds in which you stratify...distribute things   differently each time and then see what happens after multiple fits. So you can see how the folds that I created here look in terms of distribution and then now let's run XGBoost.   So the add in actually has a lot of features and I don't want to overwhelm you today, but again, I would encourage you to follow along and pause the video at places if you if you are trying to follow along yourself   to make sure. But what we did here, I just ran a script. And by the way, everything in the journal has...JMP tables are nice, where you can save scripts. And so what I did here was run XGBoost from that script.   Let me just for illustration, I'm going to rerun this again right from the menu. This will be the way that you might want to do it. So the when you install the add in,   you hit predictive modeling and then XGBoost. So we added it right here to the predictive modeling menu. And so the way you would set this up   is to specify the response. Here's Y. There are seven predictors, which we'll put in here as x's and then you put their fold columns and validation.   I wanted to make a point here about those of you who are experienced JMP Pro users, XGBoost handles validation columns a bit differently than other JMP platforms.   It's kind of an experimental framework at this point, but based on my experience, I find repeated K fold to be very a very compelling way to do and I wanted to set up the add in to make it easy.   And so here I'm putting in these fold columns again that we created with the utility, and XGBoost will automatically do repeated K fold just by specifying it like we have here.   If you wanted to do a single holdout like the original analyses, you can set that up just like normal, but you have to make the column   continuous. That's a gotcha. And I know some of our early adopters got tripped up by this and it's a different convention than other   Other XGBoost or other predictive modeling routines within JMP Pro, but this to me seemed to be the cleanest way to do it. And again, the recommended way would be to run   repeated K fold like this, or at least a single K fold and then you can just hit okay. You'll get this initial launch window.   And the thing about XGBoost, is it does have a lot of tuning parameters. The key ones are listed here in this box and you can play with these.   And then it turns out there are a whole lot more, and they're hidden under this advanced options, which we don't have time at all for today.   But we have tried to...these are the most important ones that you'll typically...for most cases you can just worry about them. And so what...let's let's run the...let's go ahead and run this again, just from here you can click the Go button and then XGBoost will run.   Now I'm just running on a simple laptop here. This is a relatively small data set. And so right....we just did repeated, repeated fivefold   three different things, just in a matter of seconds. XGBoost is pretty well tuned and will work well for larger data sets, but for this small example, let's see what happened.   Now it turns out, this initial graph that comes out raises an immediate flag.   What we're looking at here is the...over the number of iterations, the fitting iterations, we've got a training curve which is the basically the loss function that you want to go down.   But then the solid curve is the validation curve. And you can see what happened here. Just after a few iterations occurred this curve bottomed out and then things got much worse.   So this is actually a case where you would not want to use this default model. XGBoost is already overfited,   which often will happen for smaller data sets like this and it does require the need for tuning.   There's a lot of other results at the bottom, but again, they wouldn't...I wouldn't consider them trustworthy. At this point, you would need...you would want to do a little tuning.   For today, let's just do a little bit of manual tuning, but I would encourage you. We've actually implemented an entire framework for creating a tuning design,   where you can specify a range of parameters and search over the design space and we again actually use JMP DOE.   So it's a...we've got two different ways we're using DOE already here, both of which have really enhanced the functionality. For now, let's just do a little bit of manual tuning based on this graph.   You can see if we can...zoom in on this graph and we see that the curve is bottoming out. Let me just have looks literally just after three or four iterations, so one thing, one real easy thing we can do is literally just just, let's just stop, which is stop training after four steps.   And see what happens. By the way, notice what happened for our overfitting, our validation R square was actually negative, so quite bad. Definitely not a recommended model. But if we run we run this new model where we're just going to take four...only four steps,   look at what happens.   Much better validation R square. We're now up around .16 and in fact we let's try three just for fun. See what happens.   Little bit worse. So you can see this is the kind of thing where you can play around. We've tried to set up this dialogue where it's amenable to that.   And you can you can do some model comparisons on this table here at the beginning helps you. You can sort by different columns and find the best model and then down below, you can drill down on various modeling details.   Let's stick with Model 2 here, and what we can do is...   Let's only keep that one and you can clean up...you can clean up the models that you don't want, it'll remove the hidden ones.   And so now we're down, just back down to the model that that we want to look at in more depth. Notice here our validation R square is .17 or so.   So, which is, remember, this is actually falling out in between what Jaivime got originally and what the critic got.   And I would view this as a much more reliable measure of R square because again it's computed over all, we actually ran 15 different modeling fits,   fivefold three different times. So this is an average over those. So I think it's a much much cleaner and more reliable measure for how the model is performing.   If you scroll down for any model that gets fit, there's quite a bit of output to go through. Again...again, JMP is very good about...we always try to have graphics near statistics that you can   both see what's going on and attach numbers to them and these graphs are live as normal, nicely interactive.   But you can see here, we've got a training actual versus predicted and validation. And things almost always do get worse for validation, but that's really what the reality is.   And you can see again kind of where the errors are being made, and this is that this is that really high point, it often gets...this is the 45 degree line.   So that that high measurement tend...and all the high ones tend to be under predicted, which is pretty normal. I think for for any kind of method like this, it's going to tend to want to shrink   extreme values down and just to be conservative. And so you can see exactly where the errors are being made and to what degree.   Now for Jaivime's key interest, they were...he was mostly interested in seeing which variables were really driving   this water corrected effect. And we can see the one that leaps out kind of as number one is this one called PET.   There are different ways of assessing variable importance in XGBoost. You can look at straight number of splits, as gain measure, which I think is   maybe the best one to start with. It's kind of how much the model improves with each, each time you split on a certain variable. There's another one called cover.   In this case, for any one of the three, this PET is emerging as kind of the most important. And so basically this quick analysis that that seems to be where the action is for these data.   Now with JMP, there's actually more we can do. And you can see here under the modeling red triangle we've we've embellished quite a few new things. You can save predictive values and formulas,   you can publish to model depot or formula depot and do more things there.   We've even got routines to generate Python code, which is not just for scoring, but it's actually to do all the training and fitting, which is kind of a new thing, but will help those of you that want to transition from from JMP Pro over to Python. For here though, let's take a look at the profiler.   And I have to have to offer a quick word of apology to my friend Brad Jones in an earlier video, I had forgotten to acknowledge that he was the inventor of the profiler.   So this is, this is actually a case and kind of credit to him, where we're we're using it now in another way, which is to assess variable importance and how that each variable works. So to me it's a really compelling   framework where we can...we can look at this. And Charlie...Charlie got right up off the couch when I mentioned that. He's right here by me now.   And so, look at what happens...we can see the interesting thing is with this PET variable, we can see the key moment, it seems to be as soon as PET   gets right above 1200 or so is when things really take off. And so it's a it's a really nice illustration of how the profiler works.   And as far as I know, this is the first time...this is the first...this is the only software that offers   plots like this, which kind of go beyond just these statistical measures of importance and show you exactly what's going on and help you interpret   the boosted tree model. So really a nice, I think, kind of a nice way to do the analysis and I'd encourage that...and I'd encourage you try this out with your own data.   Let's move on now to a other example and back to our journal.   There's, as you can tell, there's a lot here. We don't have time naturally to go through everything.   But we've we've just for sake of time, though, I wanted to kind of show you what happens when we have a binary target. What we just looked at was continuous.   For that will use the old the diabetes data set, which has been around quite a while and it's available in the JMP sample library. And what this this data set is the same data but we've embellished it with some new scripts.   And so if you get the journal and download it, you'll, you'll get this kind of enhanced version that has   quite a few XGBoost runs with different with both a binary ordinal target and, as you remember, what this here we've got low and high measurements which are derived from this original Y variable,   looking at looking at a response for diabetes. And we're going to go a little bit even further here. Let's imagine we're in a kind of a medical context where we actually want to use a profit matrix. And our goal is to make a decision. We're going to predict each person,   whether they're high or low but then I'm thinking about it, we realized that if a person is actually high, the stakes are a little bit higher.   And so we're going to kind of double our profit or or loss, depending on whether the actual state is high. And of course,   this is a way this works is typically correct...correct decisions are here and here.   And then incorrect ones are here, and those are the ones...you want to think about all four cells when you're setting up a matrix like this.   And here is going to do a simple one. And it's doubling and I don't know if you can attach real monetary values to these or not. That's actually a good thing if you're in that kind of scenario.   Who knows, maybe we can consider these each to be a BitCoic, to be maybe around $10,000 each or something like that.   Doesn't matter too much. It's more a matter of, we want to make an optimal decision based on our   our predictions. So we're going to take this profit matrix into account when we, when we do our analysis now. It's actually only done after the model fitting. It's not directly used in the fitting itself.   So we're going to run XGBoost now here, and we have a binary target. If you'll notice the   the objective function has now changed to a logistic of log loss and that's what reflected here is this is the logistic log likelihood.   And you can see...we can look at now with a binary target the the metrics that we use to assess it are a little bit different.   Although if you notice, we do have profit columns which are computed from the profit matrix that we just looked at.   But if you're in a scenario, maybe where you don't want to worry about so much a profit matrix, just kind of straight binary regression, you can look at common metrics like   just accuracy, which is the reverse of misclassification rate, these F1 or Matthews correlation are good to look at, as well as an ROC analysis, which helps you balance specificity and sensitivity. So all of those are available.   And you can you can drill down. One brand new thing I wanted to show that we're still working on a bit, is we've got a way now for you to play with your decision thresholding.   And you can you can actually do this interactively now. And we've got a new ... a new thing which plots your profit by threshold.   So this is a brand new graph that we're just starting to introduce into JMP Pro and you'll have to get the very latest JMP 16 early adopter in order to get this, but it does accommodate the decision matrix...   or the profit matrix. And then another thing we're working on is you can derive an optimal threshold based on this matrix directly.   I believe, in this case, it's actually still .5. And so this is kind of adds extra insight into the kind of things you may want to do if your real goal is to maximize profit.   Otherwise, you're likely to want to balance specificity and sensitivity giving your context, but you've got the typical confusion matrices, which are shown here, as well as up here along with some graphs for both training and validation.   And then the ROC curves.   You also get the same kind of things that we saw earlier in terms of variable importances. And let's go ahead and do the profiler again since that's actually a nice...   it's also nice in this, in this case. We can see exactly what's going on with each variable.   We can see for example here LTG and BMI are the two most important variables and it looks like they both tend to go up as the response goes up so we can see that relationship directly. And in fact, sometimes with trees, you can get nonlinearities, like here with BMI.   It's not clear if that's that's a real thing here, we might want to do more, more analyses or look at more models to make sure, maybe there is something real going on here with that little   bump that we see. But these are kind of things that you can tease out, really fun and interesting to look at.   So, so that's there to play with the diabetes data set. The journal has a lot more to it. There's two more examples that I won't show today for sake of time, but they are documented in detail in the XGBoost documentation.   This is a, this is just a PDF document that goes into a lot of written detail about the add in and walk you step by step through these two examples. So, I encourage you to check those out.   And then, the the journal also contains several different comparisons that have been done.   You saw this purple purple matrix that we looked at. This was a study that was done at University of Pennsylvania,   where they compare a whole series of machine learning methods to each other across a bunch of different data sets, and then compare how many times one, one outperform the other. And XGBoost   came out as the top model and this this comparison wasn't always the best, but on average it tended to outperform all the other ones that you see here. So, yet some more evidence of the   power and capabilities of this of this technique. Now there's some there's a few other things here that I won't get into.   This Hill Valley one is interesting. It's a case where the trees did not work well at all. It's kind of a pathological situation but interesting to study, just so you just to help understand what's going on.   We also have done some fairly extensive testing internally within R&D at JMP and a lot of those results are here across several different data sets.   And again for sake of time, I won't go into those, but I would encourage you to check them out. They do...all of our results come along with the journal here and you can play with them across quite a few different domains and data sizes.   So check those out. I will say just just for fun, our conclusion in terms of speed is summarized here in this little meme. We've got two different cars.   Actually, this really does scream along and it it's tuned to utilize all of the...all the threads that you have in your GPU...in your CPU.   And if you're on Windows, with an NVIDIA card, you can even tap into your GPU, which will often offer maybe another fivefold increase in speed. So a lot of fun there.   So let me wrap up the tutorial at this point. And again, encourage you to check it out. I did want to offer a lot of thanks. Many people have been involved   and I worry that actually, I probably I probably have overlooked some here, but I did at least want to acknowledge these folks. We've had a great early adopter group.   And they provided really nice feedback from Marie, Diedrich and these guys at Villanova have actually already started using XGBoost in a classroom setting with success.   So, so that was really great to hear about that. And a lot of people within JMP have been helping.   Of course, this is building on the entire JMP infrastructure. So pretty much need to list the entire JMP division at some point with help with this, it's been so much fun working on it.   And then I want to acknowledge our Life Sciences team that have kind of kept me honest on various things. And they've been helping out with a lot of suggestions.   And Luciano actually has implemented an XGBoost add in, a different add in that goes with JMP Genomics, so I'd encourage you to check that out as well if you're using JMP Genomics. You can also call XGBoost directly within the predictive modeling framework there.   So thank you very much for your attention and hope you can get XGBoost to try.
Monday, October 12, 2020
Kevin Gallagher, Scientist, PPG Industries   During the early days of Six Sigma deployment, many companies realized that there were limits to how much variation can be removed from an existing process. To get beyond those limits would require that products and processes be designed to be more robust and thus inherently less variable. In this presentation, the concept of product robustness will be explained followed by a demonstration of how to use JMP to develop robust products though case study examples. The presentation will illustrate JMP tools to: 1) visually assess robustness, 2) deploy Design of Experiments and subsequent analysis to identify the best product/process settings to achieve robustness, and 3) quantify the expected capability (via Monte Carlo simulation). The talk will also highlight why Split Plot and Definitive Screening Designs are among the most suitable designs for developing robust products.     Auto-generated transcript...   Speaker Transcript Kevin Hello, my name is Kevin Gallagher. I'll be talking about designing robust products today. I work for PPG industries which is headquartered in Pittsburgh, Pennsylvania, and our corporate headquarters is shown on the right hand side of the slide. PPG is a global leader in development of paints and coatings for a wide variety of applications, some of which are shown here. And I personally work in our Coatings Innovation Center in the northern suburb of Pittsburgh, where we have a strong focus on developing innovative new products. In the last 10 years the folks at this facility have developed over 600 US patents and we've received several customer and industry awards. I want to talk about how to develop robust products using design of experiments and JMP. So first question is, what do we mean by a robust product? And that is a product that delivers consistent results. And the strategy of designing a robust product is to purposely set control factors for inputs to the process, that we call X's, to desensitize the product or process to noise factors that are acting on the process. So noise factors are factors that are inputs to the process that can potentially influence the Y's, but for which we generally have little control, especially in the design of the product or process phase. Think about robust design. It's good to start with a process map that's augmented with variables that are associated with inputs and outputs of each process step. So if we think about an example of developing a coating for an automotive application, we first start with designing that coating formulation, then we manufacture it. Then it goes to our customers and they apply our coating to the vehicle and then you buy it and take it home and drive the vehicle. So when we think about robustness, we need to think about three things. We need to think about the output that's important to us. In this example, we're thinking about developing a premium appearance coating for an automotive vehicle. We need to think about some of the noise variables for which the Y due to the noise variable. And in this particular case, I want to focus on variables that are really in our customers' facilities. Not that they can't control thickness and booth temperature and an applicator settings, but there's always some natural variation around all of these parameters. And for us, we want to be able to focus on factors that we can control in the design of the product to make the product insensitive to those variables in our customers' process so they can consistently get a good appearance. So one way to really run a designed experiment around some of the factors that are known to cause that variability. This particular example, we could design a factorial design around booth humidity, applicator setting, and thickness. This assumes, of course, that you can simulate those noise variables in your laboratory, and in this case, we can. So we can run this same experiment on each of several prototype formulations; it could be just two as a comparison or it could be a whole design of experiments looking at different formulation designs. Once we have the data for this, one of the best ways to visualize the robustness of a product is to create a box plot. So I'm going to open up the data set comparing two prototype formulations tested over a range of application conditions, and in this case the appearance is measured so that higher values of appearance are better. So ideally we want, we'd like high values of appearance and then consistently good over all of the different noise conditions. So to look at this, we could, we can go to the Graph Builder. And we can take the appearance and make that our y value; prototype formulas are X values. And if we turn on the box plot and then add the points back, you can clearly see that one product has much less variation than the other, thus be more robust and on top of that, it has a better average. Now the box plots are nice because the box plots capture the middle 50% of the data and the whiskers go out to the maximum and minimum values, excluding the outliers. So it makes a very nice visual display of the robustness of a product. So now we want to talk about how do we use design of experiments to find settings that are best for developing a product that is robust. So as you know, when you design an experiment, the best way to analyze it is to build a model. Y is a function of x, as shown in the top right. And then once we have that model built, we can explore the relationship between the Y's and the X's with various tools in JMP, like in the bottom right hand corner, a contour plot and on and...also down there, prediction profiler. These allow us to explore what's called the response surface or how the response varies as a function of the changing values of the X factors. The key to finding a robust product is to find areas of that response surface where the surface is relatively flat. In that region it will be very insensitive to small variations in those input variables. An example here is a very simple example where there's just one y and one x And the relationship is shown here sort of a parabolic function. If we set the X at a higher value here where the, where the function is a little bit flatter, and we we have some sort of common cause variation in the input variable, that variation will be translated to a smaller amount of variation in the y, than if we had that x setting at a lower setting, as shown by the dotted red lines. In a similar way, we can have interactions that transmit more or less variation. This example we have an interaction between a noise variable and and a control variable x. And in this scenario, if there's again some common cause variation associated with that noise variable, if we have the X factor set at the low setting, that will transmit less variation to the y variable. So now I want to share a second case study with you where we're going to demonstrate how to build a model, explore the response surface for flat areas where we could make our settings to have a robust product, and finally to evaluate the robustness using some predictive capability analysis. This particular example, a chemist is focused on finding the variables that are contributing to unacceptable variation in yellowness of the product and that yellowness is measured with a spectrum photometer with with the metric, b*. The team did a series of experiments to identify the important factors influencing yellowing, and the two most influential factors that they found were the reaction temperature and the rate of addition of one of the important ingredients. So they decided to develop full factorial design with some replicated center points, as shown on the right hand corner. Now, the team would like to have the yellowness value (b*) to be set to a target value of 2 but within a specification of 1 to 3. I'm going to go back into JMP and open up the second case study example. It's a small experiment here, where the factorial runs are shown in blue and the center points in red. And again, the metric of interest (B*) is listed here as well. Now the first thing we would normally do is fit, fit the experiment to the model that is best for that design. And in this particular case, we find a very good R square between the the yellowness and the factors that we're studying, and all of the factors appear to be statistically significant. So given that's the case, we can begin to explore the response surface using some other tools within JMP. One of the tools that we often like to use is the prediction profiler, because with this tool, we can explore different settings and look to find settings where we're going to get the yellowness predicted to be where we want it to be, a value of 2. But when it comes to finding robust settings, a really good tool to use is the the contour profiler. It's under factor profiling. And I'm going to put a contour right here at 3, because we said specification limits were 1 to 3 and at the high end (3), anywhere along this contour here the predicted value will be 3 and above this value into the shaded area will be above 3, out of our specification range. That means that anything in the white is expected to be within our specification limits. So right now the way we have it set up, anything that is less than a temperature at 80 and a rate anywhere between 1.5 and 4 should give us product that meets specifications on average. But what if the temperature in in the process that, when we scale this product up is, is something that we can't control completely accurate. So there's gonna be some amount of variation in the temperature. So how can we develop the product and come up with these set points so that the product will be insensitive to temperature variation? So in order to do that, or to think about that, it's often useful to add some contour grid lines to the contour plot overlay here. And I like to round off the low value in the increment, so that the the contours are at nice even numbers 1.5. 2, 2.5, and 3, going from left to right. So anywhere along this contour here should give us a predicted value of 2. But we want to be down here where the contours are close together or up here where they're further apart with respect to temperature. As the contours get further apart, that's an indication that we're nearing a flat spot in the in response surface. So to be most robust at temperature, that's where we want to be near the top here. So a setting of near 75 degrees and rate of about 4 might be most ideal. And we can see this also in the prediction profiler when we have these profilers linked, because in this setting, we're predicting the b* to be 2. But notice the the relationship between b* and temperature is relatively flat, but if I click down to this lower level, now even though the b* is still 2, the relationship between b* and temperature is very steep. So if we know something about how much variation is likely to occur in temperature when we scale this product up, we can actually use a model that we've built from our DOE to simulate the process capability into the future. And the way we can do that with JMP is to open up the simulator option. And it allows us to input random variation into the model in a number of different ways. And then use the model to calculate the output for those selected input conditions. We could add random noise, like common cause variation that could be due to measurement variation and such, into the design. We can also put random variation into any of the factors. In this case we're talking about maybe having trouble controlling the temperature in production, so we might want to make that a random variable. And it sets the mean to where I have it set. So I'm just going to drag it down a little bit to the very bottom. So it's about a mean of 70. And then JMP has a default of a standard deviation of 10. You can change that to whatever makes sense for the process that you're studying. But for now, I'm just going to leave that at 10 and you can choose to randomly select from any distribution that you want. And I'm going to leave it at the normal distribution. I'm going to leave the rate fixed. So maybe in this scenario, we can control the rate very accurately, but the temperature, not as much. So we want to make sure we're selecting our set points for rate and temperature so that there is as little impact of temperature variation on on the yellowness. So we can evaluate the results of this simulation by clicking the simulate to table, make table button. Now, what we have is every row, there's 5,000 rows here that have been simulated, every row as a random selection of temperature from the distribution, shown here. And then the rate location limits that we have for this product. And we can do that with the process capability. And since I already have the specification limits as a column property, they're automatically filled in, but if you didn't have them filled in, you can type them in here. And simply click OK, and now it shows us the capability analysis for this particular product. It shows us the lower spec limit, the upper spec limit, the target value, and in overlays that over the distribution of responses from our simulation. In this particular case, the results don't look too promising because there's a large percentage of the product that seems to be outside of the specification. In fact 30% of it is outside. And if we use the capability index Cpk, which compares the specification range to the range in process variation, we see that the Cpk is not very good at .3.  
Monday, October 12, 2020
Bradley Jones, JMP Distinguished Research Fellow, SAS   JMP has been at the forefront of innovation in screening experiment design and analysis. Developments in the last decade include Definitive Screening Designs, A-optimal designs for minimizing the average variance of the coefficient estimates, and Group-orthogonal Supersaturated Designs. This tutorial will give examples of each of these approaches to screening many factors and provide rules of thumb for choosing which to apply for any specific problem type.     Auto-generated transcript...   Speaker Transcript Bradley Jones Thanks for joining me. I'm going to give a talk on 21st century screening designs. The topic of screening designs has changed over the last 20 years. I'd like to talk about the three kinds of screening designs that I would recommend, all of which are available in JMP. A-optimal screening designs, definitive screening designs (DSDs), and group orthogonal super saturated screening designs (GO SSDs). Now that might be a little surprising to you because the default in the custom designer has been D-optimal for around 20 years. However, over the last few years I've come to the conclusion that A-optimal designs are better than D-optimal designs. I'm going to be illustrating that point in the first section of this talk. Those are the three designs I'm comparing. I'm going to start with A-optimal designs. My slide has a graph showing exactly why I think that A-optimal designs might, in many cases, be better than D-optimal screening designs. The example is a four-factor 12-run design for fitting all the main effects and all the two-factor interactions of four continuous factors. You can see that for the A-optimal design here... I'm showing there are only three cells where there are non-zero correlations in the correlation cell plot. The rest of the pairwise correlations are all zero for the A-optimal design. For the D-optimal design, there are a lot of correlations all over the place and that makes model selection harder. This concludes my first example. What is A-optimal, you might ask? What does the �A� in A-optimality stand for? The �A� stands for the average variance of the parameter estimates for the model. The parameter estimates are the estimates of the four main effects and the six two-factor interactions. If you minimize the average of them, you tend to do a reasonable job at lowering every one of them, at least doing better than some other approaches. The way to remember what A-optimal means is that the �A� stands for average, the average variance of the parameter estimates. Bradley Jones Bradley Jones That makes the A-optimality criteria easy to understand. Everybody understands what an average is. We have all these estimates. We want the variances of those estimates to be small in general and that's what the A-optimality criterion does in a direct way. The other nice thing about A-optimal designs is that A-optimal designs can allow for putting different emphasis on groups of parameters through weighting. Now you could weight D-optimal designs, but the weighting on a D-optimal design doesn't change the design, so it's kind of useless. Whereas when you weight A-optimal designs, you get differential weighting on the parameters. You might say, well, main effects are more important to me than two-factor interactions, so I want to weight them higher. We have features in JMP for doing just that. I have two different A-optimal design demonstrations. One is the one I just showed you that is, the output for four-factors, 12-runs D-optimal versus A-optimal for all the two-factor interactions and main effects. The second example is a five-factor experiment with 20 runs and a model having main effects and all the two-factor interactions. With D-optimal versus a weighted A-optimal where I put more weight in the A-optimal design on the main effects. Let me switch over to JMP. Here's the D-optimal design for four factors and 12 runs and here's the A-optimal design for the same problem. And now I want to, I want to compare these two designs. If we look at the relative estimation efficiency, the A-optimal design has higher...estimation efficiency for every parameter except this... except for the parameter that is the interaction between X1 and X3. If we look at the average correlations for the A-optimal design, the average correlation is only 0.02 whereas... for the D-optimal design, the average absolute correlation is 0.116. So there is almost six times as much correlation in the D optimal design as in the A-optimal design. The A-optimal design isn't quite as D efficient As the D-optimal design. Of course, the D-optimal design has been chosen to be the most D efficient that you can be. The fact that A-optimal design is still 97+% D efficient is really good. But look, the A-optimal design is 87.5% more G efficient than the D optimal design. So the A-optimal design is reducing the worst possible variance of prediction. Of course, the A efficiency of the A-optimal design is 14.5% more efficient than A efficiency of the D optimal design. And the I efficiency of the A-optimal design is 16% more efficient than the D-optimal design. All of these efficiency measurements and all of these correlations make it pretty clear to me that the A-optimal design in this case is far better than the D optimal design. And that's one of the reasons why A-optimality is now the default for screening including two-factor interactions in JMP 16. Let's look at the second example. Here I have a five-factor experiment. Here's the five-factor 20-run D-optimal design and then a five-factor 20-run weighted A-optimal design. Now let me show you...I'm going to show you the...the JMP script that creates this design. And when I point out is these parameter weights. This weight factor is saying, I want to weight the intercept and the five main effects 100 times higher than the 10 two-factor interactions. Let's see what that does to the design. Now we'll compare designs. And see, here is the efficiency of the D-optimal design with respect to the weighted A and the weighted A is estimating the five main effects all better than the D-optimal design, because the D-optimal design is Bradley Jones Bradley Jones this 97% through 99% is the relative efficiency of the D optimal to the weighted A-optimal for estimating the main effects. The effect of weighting the main effects is that you get a little bit less good variance for the two-factor interactions. Here you're making your estimation of the main effects better at the expense of making your estimation of the two-factor interactions a little worse. But that is what you wanted by weighting the main effects so heavily. Again, here the average absolute pairwise correlations among the for the weighted A is about 50% better than the those for for the D optimal design. And again, if we look at A-optimal correlation cell plots versus the D-optimal correlation cell plots, you can see that there are a lot more there are a lot more white cells in the A-optimal plot, which means these these pairwise correlations are zero. There are a lot of zeros here. And that means that it will be a lot easier to separate the, the true active effects from the inactive effects. The D-optimal design is only a half a percent better than the A-optimal design, even on its own criteria, but the A efficiency is right at nominal and the A-optimal design is more G efficient and also more I efficient than the D-optimal design. Now here, it might be a little bit harder to say which one you would choose. For me, I said I wanted to be able to estimate main effects better and I can. Those are my two demonstrations for the A-optimal design. Let me go back to my slides. Okay, so what I want to do for each of these three kinds of designs is give you an idea about when you would use one in preference to the others. For the A-optimal screening design, A-optimal designs and the other designs supported in the custom designer are more flexible than other designs. They allow you to solve more different kinds of problems. In particular, you would use an A-optimal screening design if you have a lot of categorical factors, especially if there are more than two levels. If there are certain factor level combinations that are not feasible, like for instance if you couldn't do the high setting of all the factors, then you would you want to say, I want an A-optimal design because the other...the other two screening designs wouldn't be able to do that. And when you have a particular model in mind that you want to fit that's not just the main effects or the main effects plus two-factor interactions, you would use the A-optimal screening design, rather than the definitive screening designs or a group orthogonal supersaturated design. Moreover, you would use an A-optimal design if you want to put more weight on some group of effects than other groups of effects, and I showed you an example of that where I wanted to put more weight on the main effects rather than the two-factor interactions. Let's move on to definitive screening designs now. Definitive screening designs first appeared in a published journal article in 2011 and they appeared in JMP roughly the same time. They are a very interesting design. You can see, looking at the correlation cell plot here that in a definitive screening design, the main effects are all orthogonal to each other. That is, there is no pairwise correlation between any pair of main effects, but also the main effects are uncorrelated with all the two-factor interactions and all the quadratic effects. So that makes it very easy to see which main effects are important. Because definitive screen designs have far fewer runs than a response surface design, there are correlations between two-factor interactions and between two-factor interactions and quadratic effects and among the quadratic effects. But notice each quadratic effect is uncorrelated with all the two-factor interactions that have... this is, this is the quadratic effect of X1 squared. Bradley Jones Bradley Jones that effect is completely orthogonal to all of these effects that have X1 in the two-factor interaction. And that's the same for all of the factors. The squared term for X2 is orthogonal to all of the effects that have X2 in them and so on. That's an interesting property that turns out to be quite useful. What does the DSD look like? Well they have three main properties that I want to tell you about. The first property is if we look at Run 1 and Run 2. In Run 1, whenever there's a +1 in Run 2 there's a -1 and vice versa. Whenever there's a -1 in Run 1, there's a +1 in Run 2 so that Run 1 and Run 2 are kind of mirror images of each other. The same thing is true of Run 3 and Run 4, Run 5 and Run 6, Run 7 and Run 8, Run 9 and 10, and 11 and 12. So this design is what's called a fold-over design, and that folding over is what makes the main effects not be correlated to the two-factor interaction. Of course, I have this design sorted in such a way that you can see this structure, but when you run this design, you should actually randomize the order of the runs. The second thing I want to point out is that for each factor, there are two center runs in some pair of runs. For example, in the first pair of runs, A is at its center level; in the second pair of runs, B is at its center level. And in the third set of runs, C is at its center level; in the fourth set of runs, D is at its center level and so on. And then the last thing to notice is that there's one overall center run. We have six factors, Factors A through F, that's six. And the number of runs is 2x6+1. And the model that...one model that we can fit with this design is the model that contains the intercept, all six main effects, and all six quadratic effects. So that's 13 different effects in a 13-run design. It's, it's amazingly efficient in terms of the allocation of runs to effects. Now, what are the positive consequences of having a definitive screening design? Well first, an active two-factor interaction doesn't affect the estimate of any main effect because they are uncorrelated. Which makes it the case that any single active two-factor interaction can be identified uniquely. The same thing is true of any single quadratic effect as long as that quadratic effect is large, with respect to the noise. And then one really interesting consequence is that if if...it turns out, if only three factors are active, let's say factors A, C and E, then I can fit a full quadratic model and those three factors. A full quadratic model is the kind of model that people fit when they're doing response surface methods or RSM methods. And it doesn't matter which three factors are the active ones. The DSD will always be able to fit a full quadratic model, no matter which three factors turn out to be active. That's a very powerful thing. It�s result is that if you're lucky enough that only three out of the six factors are important, you can skip the RSM step in some cases and do RSM and screening in one fell swoop. I've told you all the good things about DSDs. There is one bad thing, or maybe less good thing, and that is because of those... because of these zeros in each column, the main effects are not estimated as precisely in a definitive screen design as they are in a design that would be an orthogonal design with one center run. As a result, confidence intervals on the main effects for DSD are going to be 10% or around 10% longer than confidence intervals if you had run, say a Plackett-Berman design of the same number of runs. That's, that's a very small price to pay, in my view, for all the benefits that you get in, particularly the benefit of being able to fit quadratic effects, which you can't do With the Plackett-Berman design having a single center run. You can identify that you need to be able to fit quadratic effects because you have the center run but you don't know which of the factors has the high curvature. Now, when would you use a DSD? Well, you use them when most of the factors are continuous, when you... and if they're continuous, you might have the factor levels set far enough apart that you're concerned about possible nonlinearity or curvature in the effect of factor on a response. And then you're also concerned about the possibility of two-factor interactions, although the DSDs cannot promise you that you can fit all the two-factor interactions, because there just aren't enough runs. If there are a couple or three two-factor interactions, you're likely to be able to identify them with a DSD. Okay, let's go to back to JMP and back to the journal here. Bradley Jones Bradley Jones This is a DSD. It's the DSD that has six factors and, instead of 13 runs, I created the eight-factor design and just dropped the last two factors. Now when I fit the full factorial model, I created a full factorial design for this problem and fit it. Bradley Jones Bradley Jones Let me show you the parameter estimates. The parameter estimates that are significant in the full factorial design are A, B, and C, A*B, A*C, B*C and A squared. Now, let me show you the analysis that you would get by doing Fit Definitive Screening. I have time, as my as my response, and A through F as my factors. And let's look at the model that comes out. The model that that this is finding has A, B C and E; E is a spurious effect. So that's a Type I error, but it identified all three two-factor interactions and the quadratic effect of A. The full factorial design that I showed you here had three to the sixth runs, which is 729 runs This design here only has 17 runs. The fact that I was able to identify all the correct terms with far, far less runs is is an eye opener for why definitive screening designs are really great. Let's go back to the slides. DSDs are great when almost all of the factors are continuous. You can accommodate a couple or three two-level categorical effects. And you can also block definitive screen designs much more flexibly than you can block fractional factorial designs. For a six-factor definitive screening designs, you can have anywhere between two and six blocks. And the blocks are orthogonal to the main effects. That's, that's another amazing thing about these designs. Let's move on to the newest of the screening designs. These have just been discovered in the last couple of years and the publication in Technometrics just came out in the last week or so. It's been online for a year, but the actual printed article came to my house in my mail just a week or so ago. This is a correlation cell plot of a group orthogonal supersaturated design and you might notice all this gray area. In most of the time, if you look at a supersaturated design, the correlation cell plot has correlations everywhere. Here we only see correlations in groups of factors. This group of factors is correlated. This other group of factors is correlated. This group of factors is correlated and this other group of factors is correlated, but there are no correlations between any pair of groups of factors. The only correlations that you see are within a group not between groups, and that helps you with analyzing the data. Here's a pic of the first page the published article, which I just said when into print just last week or so. My coauthors are Chris Nachtsheim, this guy here, from the University of Minnesota; my colleague from JMP, Ryan Lekivetz; Dibyen Majumdar who's an Associate Dean at the University of Illinois in Chicago in the statistics department; and Jon Stallrich, who is a professor at NC State. Let me talk about why you might even be interested in group orthogonal supersaturated designs or supersaturated designs at all. And then I'll show how we make a group orthogonal supersaturated design. I will show how to analyze them, except that you don't have to learn how to analyze them because there's an automatic analyze tool in JMP that's right next to the designer. And then I'll show you how the two-stage analysis of these �Go SSDs�, as we call them, How they compare to more generic analytical approaches. Then I'll make some conclusions. What's a super saturated design? A supersaturated design has more factors than runs. For example, you might have something like 20 factors and you only have 12 runs to investigate them in. Then the question you might ask yourself is, "Is this this a good idea?" Well, a former colleague of mine, who has since retired about 15 years ago or so, told me supersaturated designs are evil. Bradley Jones Bradley Jones I do understand why he felt that way. The problem with a supersaturated design is that you can't do multiple regression, because you have more factors and runs so the matrix that you want to be able to invert is not invertible. And then also the factor aliasing is typically complex, although in these group orthogonal supersaturated designs, it's a lot less complex. And there's this general feeling that you can't get something for nothing. It feels like you're not putting enough resources into the design to get anything good out of it. Bradley Jones Bradley Jones Supersaturated designs were first discussed in the literature by a mathematician by the name of Sattherwaite. Bradley Jones Bradley Jones His paper was roundly excoriated by a lot of the high-end statisticians of the day. Even, you know, laughed at him to a large degree and then three years later Booth and Cox Discuss the possibility of systematically generating a supersaturated design and they had a selection criterion which said, look at all of the squared pairwise correlations so they're all positive numbers and look at the average of that and make that as small as possible. even though the design cannot be orthogonal because in order to have an orthogonal design, you have to have more runs than factors. The criterion of Booth and Cox is trying to find the closest to an orthogonal design as it can, given that there are fewer runs than factors. We think that John Tukey was the first to use the term "supersaturated" in his discussion of Sattherwaite in 1959. Here's what Tukey said, "Of course constant balance can only take us up to saturated... saturation (one of George Box's well-chosen terms) up to the situation where each degree of freedom is taken up with the main effect (or something else we are prepared to estimate)." As a result, a saturated design has no degrees of freedom leftover to estimate the variance. But Tukey then says, "I think it's perfectly natural and wise to do some supersaturated experiments." But in general, the statistics community didn't take that to heart and nothing happened for 30 years after that. 30 years later, Jeff Wu, who's now a professor at Georgia Tech, wrote a paper in Biometrika, talking about one way of making supersaturated designs. And the same year in Technometrics, Dennis Lin, who's now the chair of the Statistics Department at Purdue, wrote a paper about another way to create supersaturated designs and both of these papers were very interesting and they brought supersaturated design back into people's consciousness. When would you use supersaturated design? Well, one time that you would use it is when runs are super expensive. If a run costs a million dollars to do, you don't want to do very many runs. You want to do as few as possible. And if you can do fewer runs than you have factors, all the better. I don't know about you, but I've done a lot of brainstorming exercises with stakeholders of processes, and it's very easy when everybody writes a sticky note with a factor they think might be active, and you get three dozen sticky notes on the wall, and they're all different. Bradley Jones Bradley Jones So, what are you supposed to do then? Well, what often happens there is, people are used to doing screening experiments with 6, 7, 8 maybe even 10 factors. But people get really nervous doing a screening experiment that has three dozen factors. And so, what happens is, after this brainstorming session happens, the engineers decide, well, we can get rid of maybe 20 of these factors because we know better than the people who pick those 20 factors. Bradley Jones Bradley Jones What I'm afraid of is that when you do that you might be throwing the baby out with the bathwater, so to speak. The, the most important factor might be one of those 20 that you just decided to ignore. And the factors that you end up looking at may look like there's a huge amount of noise because they're not taking account of this other more important factor that was left out. I think that eliminating factors without any data is unprincipled. It's, definitely not a statistical approach. Now, how do we construct these Go SSDs? Well, we start with a Hadamard matrix, call it H, of order m, and m has to be zero mod(4), which is just a fancy way of saying that m needs to be a multiple of four. And then we then we take another matrix T, which is a matrix of plus and minus ones that is w rows by q columns. Bradley Jones Bradley Jones then we take the Kronecker product of H and T. By the way, that zero with an �X� in the middle of it is a symbol for Kronecker product and I'll explain what that is in the next slide. That operation gives us a structure for x that's m by w rows by m by q columns where w is less than q, so mw is less than mq. As a result, this is now a supersaturated design. We recommend that T be a Hadamard matrix with fewer than half of the rows removed. Here's an example. Here's my H. It's a Hadamard matrix. And one of the things about a Hadamard matrix is that the columns of a Hadamard matrix are pairwise orthogonal. If we look at the main effects, they're pairwise orthogonal. In this example, T is just H with this last row removed. And then this H cross (Kronecker Product) T in every element of H that's a plus one, I replace that element with T. And every element of H that's a minus one, I replace that element with -T. Now, T is a matrix. I'm replacing a single number with a three by four matrix everywhere here. H is four by four, but this new matrix here is 12 by 16. Now I have this much bigger matrix that I've formed by taking the this Kronecker product of H and T. Of course, you don't need to do this yourself. JMP will tell you which ones you can make in the GO SSD designer and it'll all happen automatically. Call the Kronecker product of H and T, �X� if you look at X�X, which is what you what you look at to understand the correlations of the factors, X�X looks like this, where these are the blocks that I showed you before. It's block diagonal. The first group of factors is uncorrelated with all the other groups of factors. The second group of factors is uncorrelated with all the others, and so forth. That's a very nice property. Here again is the example. I have a Hadamard matrix that's 4x4. So, m is 4. My matrix T, what I was just talking about...my matrix T is 3x4, where I've just removed the last row from H and so w is three and q is 4. So the number of rows is m times w, or four times three, or 12. And the number of columns is m times q, which is four times four or 16. Now the first column is all one, so this is the call...the constant column. It's what you would use to estimate the intercept. The next three columns are correlated. The next four columns, columns A through D, are correlated with each other, but not with anything else. E through H are correlated with each other, but not with anything else; and I through L are correlated with each other, but not with anything else. You get this block diagonal correlation structure. Now, we have three groups of four factors that are correlated with each other. And then we have this first group of factors (A, B and C) that are correlated with each other more and they're also unbalanced columns. So what my colleagues and I recommend is that instead of actually assigning factors to A, B, and C, you leave them free. You don't assign them factors. And so instead of having 15 factors you only have 12 factors. Because A, B, and C are uncorrelated with all these other factors, we can use A, B and C to estimate the error variance. Now in supersaturated design There's never been a way to estimate the error variance within a supersaturated design. This is a property of group orthogonal supersaturated designs that doesn't exist anywhere else in supersaturated design land. Now we have three independent groups of four factors and each factor group has the rank of three. Now the three fake factor columns (A-C) have rank two, so I can estimate sigma squared with two degrees of freedom, assuming that that two factor interactions are negligible. Now notice when I created this Kronecker product, this group of factors is a foldover. This is a foldover. And this is a foldover. So you have three groups of factors that are all foldover designs. Remember that DSD was an example of a foldover design where all the factors were foldovers, but this particular structure gives you some interesting properties. Any two factor interactions involving from factors in the same group are orthogonal to the main effects in that group. I was looking at the main effects in group one, all the main effects in group one are uncorrelated with all the two factor interactions in that group. But wait, there's more. All the main effects in group one are uncorrelated with two factor interactions where one of the factors in is in group one and the other factor is in group two. The last thing is all the main effects in group one are uncorrelated with all the two factor interactions in any other group. So, this construction gives you this supersaturated design, not only is giving you a good way of estimating main effects, but it also protects you from a lot of two factor interactions. The only thing that you that you don't get protection from is if you have a main effect in group one and a two factor interaction involving main effects in two different groups, then that is not necessarily uncorrelated. Together all of these properties make you want to think about how to load factors into the groups. One strategy would be to say, "Well, I want to put all the factors that I think are most likely to be highly important into one group." In that case, those factors will be uncorrelated with any of the two factor interactions involving those factors. That's good. And then you would be more likely to have inactive groups and if a group is inactive, then you can pool those factor estimates into the estimate of sigma squared and give the estimate of sigma squared more degrees of freedom, which means that you'll have more power for detecting other effects. The second strategy would be put all the effects that you think are most likely to be active in different groups. That is, put what I think a priori is, the most active effect in group one, my second most active effect in group two, my third most active effect in group three, and so on. Now, if you have your most likely effects in separate groups, you're less likely to have confounding of one factor effect with another factor effect. Bradley Jones Bradley Jones My coauthors and I recommend the second strategy. This is a table in the paper that just got published and it shows you all of the group orthogonal supersaturated designs up to 128 factors and 120 runs. Now, how do you analyze these things? Well, you can leverage the fake factors to get an unbiased assessment variance. You can use the group structure to identify active groups first, and then after you know which groups are active, you can do regression to identify active factors within each group. And as you go, you can pool the sum of squares and degrees of freedom for inactive groups and inactive factors into the estimate of sigma squared to give you more power for detecting effects. This is a this is a mathematical slide that we can just skip. But what the slide meant is that, you can maximize your power for testing each group by making your a priori guess of the direction of the effect of each factor be positive. Now, if you thought that the effect of a factor was negative, you would just relabel the signs of the minus ones to plus ones and the plus ones to minus ones. And that would maximize your power for identifying the group. What we do is we identify groups first and then we identify active factors within the groups. Of course, this how all happens automatically in the fitting. We're comparing our analytical method to the lasso and the Dantzig selector. And then we're looking at power and type one error rates using Go SSDs versus the standard selectors. We chose three different supersaturated designs, we, we made different numbers of groups active and we looked at signal to noise ratios of one to three and we had numbers of active factors per group, either one or two. The number of active factors can range from one to 14 in these cases. Looking at the graph of the power results� Here are the Dantzig results, and here is the Lasso followed by our two-stage method. Some of the powers for the Dantzig and Lasso are low. In fact, these are the cases where the signal to noise ratio is one. Otherwise, the Dantzig selector and the Lasso are doing very well. In facto doing as well as the two-stage method, except for these cases where the signal to noise ratio is small. However, the two-stage method is always finding all of the active factors. In the paper by Marley and Woods where they did a simulation study looking at Bayesian D-optimal supersaturated designs and other kinds of supersaturated designs, they basically said you... cannot identify more active factors than a third of the number of runs. Well, in our case, we have 12 factors in 12 runs, so we would expect to only be able to identify four active factors. However, in the case where n is 24, we identified 14 active effect, whereas n over three is only eight. You can see that we're doing a lot better than what Marley and Woods say that you should be able to do, given their simulation study. That is because of using a GO SSD. We did a case study to evaluate what makes JMP's custom design tool take longer to generate a design. We had if there are quadratic terms in the model makes it slower, if the optimality criterion is A-optimal it's slightly so; or if you do 10 times as many random starts, yeah, it's slower; if you have more factors, then it'll be slower. Here's our design. And here's the analysis. Let me show you this in JMP. So first, let me show you the design construction script. I'm going to... this is the script. m, remember, is the number of runs in the Hadamard matrix. There's a new JSL command called Hadamard. I'm going to just create a new script with this, and I can run the script and look at the Log. Here we go. Here's the new JMP 16 log. If I run the first three things in this script, you can see m is four, q is three. And here's my 4x4 Hadamard matrix. And taking the first three rows to create T and the group orthogonal supersaturated design is the 12x16 matrix here. That's the matrix and we can make a table out of it. And here's the table. So that's how easy it is to construct them by hand, but you don't have to do that. You can just get JMP to do all this for you. looking at the pairwise correlate column correlations, it has the same pattern that we showed you before. Here's the case study that we ran. We make our first three factors fake factors. Then we're going to use them to estimate the variance and then these are all of the real factors. When I fit the group orthogonal supersaturated design, there's two factors that are active in the second group. That first group I'm using to estimate variance One factor in the third group, and two factors in the fourth group. I have five factors in all that are active and I end up having six degrees of freedom for error. So that's, that's kind of an amazing thing. Let me show you how you can do this in JMP. Here's the group orthogonal supersaturated design. I could say I'm willing to do 12 runs. You can either do two groups of size eight or four groups of size four. This is the same design that I just ran. So that's how easy to make one of these things and then the analysis tool is right under the designer so you can just choose your factors, choose your response and go, and you get the same analysis as I got before. Let me wrap up by going back to my slides. When do you want to use it a Go SSD? It's when you have lots of factors, and runs are expensive, and you think that most of the factors are not active, but you don't know which ones are active and which are not. My final advice is to replace D-optimal with A-optimal for doing screening. If you were using D-Optimal before, A-optimal is better. Use DSDs if you have mostly continuous factors and you're interested in possible curvature And you don't have any restrictions about which levels of the factors you can use. Use Go SSDs in preference to removing a lot of factors in advance of getting any data. If you have a lot of factors and it's expensive to run the runs.  
Tonya Mauldin, Principal Analytics Software Tester, SAS   Distribution is one of the most widely used platforms in JMP. It has been around since the first version of JMP. It's useful for all sorts of things, including data exploration, capability analysis – and of course, testing which distribution fits your data. Version 15 introduces a modernized distribution platform. This talk discusses the changes made to the distribution platform in Version 15 of JMP.       Auto-generated transcript...   Speaker Transcript tobake The Modernization of the Distribution Platform. My name is Tonya Mauldin, and I am the tester for distribution. For this presentation I will be using version 15.2 of JMP. Distribution is one of the most widely used platforms in JMP. This platform is not only used for testing which distribution fits your data, it is used for data exploration, capability analysis, and so much more. JMP 15 brings a modernization to this commonly used platform. Why did we decide to update this platform? Distribution has been around since the first version of JMP. It was time to make this platform more modern. We did this by improving the flow, providing a cleaner look, making the product more consistent and easier to maintain. Distribution of fitters now use the same code as generalized regression. Capability analysis now uses the same code as process capability. the negative binomial distribution which is equivalent to the gamma poisson, the Cauchy distribution, zero inflated poisson, and ZI negative binomial. Johnson fits are now a single command that selects the best fitting distribution from the Johnson system of distribution. This is the same method, quantile matching, that is used in the process capability platform. This method is more stable and faster than maximum likelihood. Users now have the ability to type the specific bandwidth parameter for each for the smooth curve fit. Standard errors have been added for the SHASH distribution. Let's use JMP to investigate some of these. Airlines Delays is part of the sample data that comes with JMP. Let's perform a distribution analysis on the column Arrival Delay and fit a Johnson distribution. Notice that rather than seeing the three different Johnson fitters, there's one option called fit Johnson. After choosing this option, we see the best fit from the Johnson distribution family for this data is the Johnson SU distribution. Let's also add a smooth curve fit to this report. Prior to JMP 15, the only way to control the bandwidth parameter for the non parametric density was via a slider. Now there is an option to specify a specific value for the bandwidth parameter within the user interface as well as via JSL. In addition to compare...in addition to altering the fitters themselves, JMP has changed the way fit comparisons are made. As each fit is selected, it is added to a compare distributions report. In previous versions of JMP you only got a compare distributions report when using the All option under continuous fits. AICc weight and BIC have been added to this report to make it more consistent with other platforms in JMP. The histogram legend has been removed from the report. This information is now contained within the compare distributions report, which always appears directly below the histogram. Additionally overlaid CDF plots have been added to the report. Let's look at some of these in JMP. This data table Washers is provided in the sample data that comes with JMP. Let's perform a distribution analysis on the #defective column. Since this is count data, let's fit the negative binomial. This is a new fitter for JMP 15. It is equivalent to the gamma poisson fit. The compare distributions report was added automatically. Let's also fit the ZI negative binomial distribution to see how it compares. Remember the zero inflated negative binomial was also a new fitter available in JMP 15. Information about this fit was also added to the compare distributions report. You will also notice some changes within the compare distribution report itself. AICc weight corrected and weighted, Akaike information criterion, and BIC bayesian information criterion have been added to this report. These statistics were already available in other parts of JMP, such as model comparison. They have now been added to the distribution platform to make JMP more consistent. Notice that the compare distributions report always appears directly underneath the histogram. This gave us the ability to remove the legend that previously appeared beneath the histograms. This information is now contained within the compare distributions reports. We find that the green line is for the negative binomial fit, the blue line is for the ZI negative binomial fit. CDF plots have also changed. As fits are added, the CDFs for the fitted distributions are superimposed on to the empirical CDFs. We see that both the green and blue lines closely followed the empirical CDFs for this example. The histogram shows that the green and blue lines closely following the data. In a compare distributions report, the negative binomial distribution appears first because it has a smaller a AICc value, indicating a better fit. If we wanted to use the Ic, rather than AICc as our criterion for best fit, we can simply click on that column header to perform the sort on that column. Something that is helpful but may not be obvious is that you can remove a fit from the report window entirely by double clicking on the fit name in the compare distribution section. Here, I double clicked on ZI negative binomial to remove it from the report. QQ and PP plots. The quantile-quantile plot shows the relationship between the observations and the quantiles obtained using the estimated parameters. The percentile-percentile plot shows the relationship between the empirical cumulative distribution function and the fitted CDFs obtained using the estimated parameters. Two profilers have been added. The distribution profiler is the prediction profiler of the cumulative distribution function. The quantile profiler is the prediction profiler of the quantile function. save distribution formula and save simulation formula. The goodness of fit tests have been standardized to use Anderson-Darling and Pearson chi-square. The Pearson chi-square test has improved ??? creations. Beginning with version 15 in JMP, we now satisfy the rule of thumb that there are at least five expected observations in each ???. Let's look at some of these in JMP. From the previous Washers report, select QQ plot and PP plot. For each of these plots, we are looking to see how closely the points fall to the reference line. The closer the points are to the reference line the better the fit. Add the distribution and quantile profilers to the report. If the data follow the negative binomial distribution, the probability of having four defective or fewer is 84%. As with other profilers in JMP, you can alter the input settings to see what effect it will have on the probability. The quantile profiler works in a similar manner, which shows the relationship between the probability and the negative binomial quantile. Another new option is the save simulation formula. This option saves a new column to the data table that contains a formula that generates simulated values using the estimated parameters. This column can be used in the simulate utility as well as in other parts of JMP. Although we are dealing with discrete data, let's add the exponential distribution to the report so we can view the goodness of fit report. You can see from the histogram that the exponential distribution may not be a good fit for this data. To test this hypothesis, select goodness of fit. The Anderson-Darling test is ??? for continuous distribution. Here we have a simulated p-Value of less than .05, which indicates that we should reject the null hypothesis that this data comes from the exponential distribution. Process Capability was introduced in JMP version 12. Parts of this new platform were added to control chart builder at the same time. Now with JMP 15, parts of this platform have been added to the distribution platform. This makes the platforms not only easier to maintain, but it also makes them more consistent with each other. There're several differences that happen around the launch of a distribution report. The histogram only shows spec limits, if show as graph reference lines is checked in the spec limits column properties. The user is given the ability to disable the capability analysis, even if there are spec limit column properties. If there is a process capability distribution column property, that distribution will be used for the capability analysis. The workflow for the quantiles option for fitted distributions has been simplified into one launch, whereas past versions required two platform calls. Normal and non normal capability analyses now use similar dialogues. Let's look at JMP. The script I'm running opens the process measurements sample data table and alters some of the column properties. For Process 1, there's a spec limits column property, show as graph reference lines, that's not checked. For Process 2, there's the spec limit column property, shows as graph reference lines, is checked. For Process 3, there's a process capability distribution column property with Weibull defined, as well as the spec limits column property with show as graph reference lines unchecked. In the distribution dialogue, I will assign these three processes as the Y. Notice the new check box in the lower left corner, create process capability. This checkbox only appears if at least one Y column has a spec limit columns property. Uncheck this box and click OK. In previous versions of JMP, you would get a capability analysis for each of these three processes. Additionally, there would be spec limits drawn in each of the three histograms. In this report, no capability analysis is given at all because we unchecked that box. Spec limits are only shown for Process 2 because it was the only Y whose spec limit column property had show as graph reference lines checked. Let's go back to the distribution dialogue and check the box, create process capability, to compare. Capability analysis is now giving for all three process variables. In the main histogram at the top, spec limits are only shown for Process 2. The capability analysis for Process 3 is based on the Weibull distribution. The capability report looks the same. It has the same options that you would get with the individual detail reports in a process capability platform that was introduced in JMP 12. Note that Ppk labeling is now the default for the overall sigma. The report shows both within and overall sigma indices when a normal distribution is assumed. Let's investigate the case where the data table has no spec limit column properties. Thickness.JMP is the sample data table with no column properties. In the distribution dialogue, notice there is no checkbox for creative process capability. This is because there are no spec limit column properties. Let's investigate the distribution report for Thickness 3 and Thickness 4. We can add a process capability report for Thickness 3 by selecting this option from the red triangle menu. These options work in the same manner as the options you would find in our process capability platform introduced in JMP 12. Let's perform a simple capability analysis and define LSL as .03, target is .045 and USL is .05. Let's also turn on show limits. A capability analysis based on the normal distribution is given for the specification limits. The limits are shown in the graph because we checked that option in the dialogue. You're provided with both within and overall sigma capabilities statistics. For Thickness 4, let's fit a beta distribution. Process Capability for fitted distribution yields the following dialogue options. The calculate quantile spec limits options section of this dialogue contains the quantiles option and the set k for k sigma options that were available in the legacy fitters. For this example, specify the LSL prob as .05, the target prob as .5, And the USL prob as .95. When we click the calculate spec limit button, the spec limits are calculated for the given probabilities using the fitted beta distribution. In previous versions of JMP, you had to fit a distribution, get the quantiles, and pass them back to the platform, which required two distribution calls. Now you can do this in a single call. The capability analysis given is based on the beta distribution. You're provided with the ovrall sigma capabilities statistics. Spec limits are not shown in the top histogram because we did not check the show limits option in the dialogue. In conclusion, th distribution platform in version 15 has been modernized to improve flow, provide a cleaner look, provide consistency and require less maintenance. Thank you for attending my presentation about the modernization of the distribution platform.  
Fabio D'Ottaviano, R&D Statistician, Dow Inc Wenzhao Yang, Dr, Dow Inc   The large availability of undesigned data, a by-product of chemical industrial research and manufacturing, makes it attractive the venturesome use of machine learning for its plug-and-play appeal in attempt to extract value out of this data. Often this type of data does not only reflect the response to controlled variation but also to that caused by random effects not tagged in the data. Thus, machine learning based models in this industry may easily miss active random effects out. This presentation uses simulation in JMP to show the effect of missing a random effect via machine learning — vs. including it properly via mixed models as a benchmark — in a context commonly encountered in the chemical industry — mixture experiments with process variables — and as a function of relative cluster size, total variance, proportion of variance attributed to the random effect, and data size. Simulation was employed for it allows the comparison — missing vs. not missing random effects — to be made clear and in a simple manner while avoiding unwanted confounders found in real world data. Besides the long-established fact that machine learning performs better the larger the size of the data, it was also observed that data lacking due specificity—i.e. without clustering information—causes critical prediction biases regardless the data size.   This presentation is based on a published paper of the same title.     Auto-generated transcript...   Speaker Transcript Fabio D'Ottaviano Okay thanks everybody for watching this video here. Well, because you can see, I'll be talking about missing random effects in machine learning. It's a work ideas together with my colleague when Joe Young, we both work for Dow Chemical Company working Korean D and help you know valid develop new processes and mainly new products. What you see here in this screen is a big bingo cage, because our talk here is going to be about to simulation and simulation has a lot to do at least to me. With bingo case because you decided the distribution of your balls and numbers inside the big occasion, then you keep just picking them as you want. All right. This talk also has to do with the publication, we just said. Lately, what the same name, and you can win what you should have access to this presentation, you can just click here and you'll have access to the entire paper. So here's just a summary of what we have published in there. Okay, what's the context for this. Well, first of all, machine learning has a kind of a plug and play appeal to knowing stuff sessions. I know you don't have to assume anything that's attractive. Besides, you have a very user friendly software out there these days. So, you know, people like to do that these days. However, you know, random effects are everywhere and run these effects is a funny thing because it's it's a concept that is a little bit more complex. So it tends not to be Touching basic statistics courses shows more advanced subject. So you're going to get a lot of people doing machine learning without a lot of understanding about random effect. And even if they have that understanding, then the concept of random fact Still doesn't, you know, bring the loud bout with people doing machine learning because there's just a few algorithms that can do that that can use random effects. You can check these reference here where you see that there are some trees and random forest, and it can take it, but the recent and they're not, you know, spread Everywhere. So you're going to have some hard time to find something that can do can handle random effects in machine learning. Just talk a little bit about the random effects. As you can see here, at least in the chemical industry where I come from. We typically mix in. I say three components. Right. These yellow, red and green. We, we make this, you know, the percentage of each one of these components different levels. And then we measured the responses as we change it, the percentage of these components with a certain equipment and sometimes you have even a operator or lab technician that will Also interfere in the result that you want to see. Okay. And then when we do this kind of experiment, we want to generalize, is that the findings, right, or whatever prediction. We are trying to get here. But the problem is that you know when I'm mixing these green component here. If I buy next time from the supplier that supplies me this green component year And the green made shade, you know, very and I don't know what's the next time I buy this green component is the batch. Would that be supplier is giving me is going to be exactly the same green because there is a variability in supply On top you know I may make my experience you're using a certain equipment. But if I go and look around in my lab or if I look around in other labs, I may have different makes of these equipment. And on top. You also have, you know, maybe food that measurement depends on on the operator who is doing that right so you may also interfere and kind of impoverished my Prediction here on my generalization to Do whatever I want to predict here besides This is the most typical I guess in the chemical industry, which is the experiment experimental batch variability A over time if you repeat the same thing over and over again. Let's say you have an experiment here you get your model your model can predict something, but then you repeat that experiment to get another Malden get another model the predictions of these three models. May be considerably different right now. Nick legible. So, there is also the component of time. So what's the problem I'm talking about here. Well, typically you we have stored data and historical data just say, you know, a collection of bits and pieces of data you've done in the past. And people were not done much concerned with generalizing that result. The result at the time they had that experiment. So when we collect them and call it historical data, we may or may not have tags for the random effect, right. And then if you have text, which is at least from where I come from. This is more of an exception to the rule is having no tax for me facts, what, at least not for not all of them. Let's say you have tags. One thing you can do is to use a machine learning techniques that can handle these random effects lead them into the model. And that's it. You don't have a problem. But then, as I said, is not very well numb machine learning techniques that can hinder random effects. You may be tempted to use machine learning. And let the random effects into the model as if they were fixed and then you're going to run into these you know very well known problem that you should treat the random effect this fixed Just to say one thing you're going to have a hard time to predict any new all to come because, for example, if your random effect is European later you have only a few Operators in your data, right, a few names, but if there is a new operator doing days, you don't. You cannot predict what the effect of this new operator is going to be. So, here there is no deal And then there's one thing you can do. You do. You should have or you should don't have tax revenue we sacrifice to us again any machine learning technique. And if you have random, you should have the tags you ignore the random effect or if you don't. Anyway, you're going to be ignoring it. Whether you like it or not. So what I want to do is less simulating shooter. We use jump rope fishing. And you know, I hope you enjoyed the results. The simulation, basically. So I will use a mixed effect model right with fixed and random effect. And then we use that same model to estimate With the response to this to make the model after I simulate and also model, the results of my simulation here with a neural net right Then we use this model here as the predictive performance of this model here. As a benchmark and we use it to predict performance of the near on the edge to, you know, compare later they're taking their test set are squares to see what's going to happen. You find meets the random effect here, right. Then, okay. Sometimes I when I talk about these people sometimes think that I'm comparing a linear mixed effect model versus, you know, machine learning neural net. And that's not the case, you know, here we are comparing a model with and without random effect. Even that there is a random effect in the data. I could do. For example, a Linear Model with run them effects versus a linear model without to bring them effects. And I could do a neural net with random effects versus in urine that without random effect. But the problem is that today there is no wonder and that, for example, that can handle random effect. So I forced to use. For example, a linear mixed effect of all My simulation factors. Well, I'll use something that is typically in the industry, which is a mixture with process variable model what it is. Let's say I have those three components. I showed you before. Know the yellow, red, green, and they have percent certain percentage and they get up to one. Have a continuous variable which, for example, can be temperature. I have a categorical variable that can be capitalist type and I have a random effect which can be very ugly from batch to batch of experiments. Okay. The simulation model. Well, it's a pretty simple one I have here my mixture main effects M one M two and three. Right. And you will see all over this model that the fixed effects all have either one or minus one. I just assigned one minus one randomly to them. So I have the mixture main effects. Here I have the mixture two ways interaction, the interaction of the mixture would be continuous variable. And the interaction of the categorical variable with the components. And finally, the introduction of the continuous variable with the categorical variable. Plus I have these be, Why here we choose my random effect and the Ej, which is my random error. Right. And both are normally distributed with certain variance here. I said, the variance between a better of experiments, right, and uses divergence within the match of batch of experiment. From all over this presentation, just to make this whole formula in represent a forming the more I say neat way or use this form a hero X actually represents all the fixes effects and beta Represents all the parameters that I used. Right. And my why here. Actually, the expected. Why is actually XP right it's this whole thing without my random effect here and we dealt my Renault mayor. Simulation parameters. Well, here I have one which is data size, right, the one that she. What happens if I have no, not so much data and More data layers and more data right here I have two levels 110,000 roles at every set of experiment here have actually 20 rows perfect effect than 200. The other thing I will vary is going to be D decides of the badge for the cluster, whatever you like it. Sometimes is going to be. It's, I have two levels 4% and 25% so 4% means if I have 100 rolls of these one batch of experience my batch, we're going to Will be actually for roles. So I'm going to have 2525 batches. If I have 100 rolls in total out in my batch sizes 25% and I have only four batches. Then the other variable we change here is going to be the total variance. Right. And well, we have two levels here, point, five and 2.5 is half of effect effect size right to choose. So the formula here. It is all ones for the fixed effect. And the other one is to write and the summation this total variance is the summation of my variation between batches and within batches very a variance. Right. And lastly, the other thing I will change is the ratio of between two within very Similar segments. Right. So I have one and four. So in one my Between batch variation is going to be equal to winning and the other one is going to be four times bigger than winning Then, once I settled is for four factors here. I say parameters and then you do a full factorial, do we wear our have 16 runs right to To two levels of data size two levels of batch size two levels of total variance Angelo's was the desert. With that, I can calculate it within between within segments accordingly. Right. And that's the setup for simulation. Okay. Now, I call it simulation hyper parameter, because you can change in as we What I would do, and I'll show you in the demo. It's would 30 simulation risk or do we run. So every one of the 16 what I did is I run 30 times each. Right. So, for example, I'll have a simulation run 123 up to dirty and for the fixed effects. The, the level of difficulty effects. I use the space feeling design. And the reason why I use this space filling design is that don't want to confound the effect of missing there and then effect with the fact that possibly I have some some calling the charity or sparse data. Which is typical thing in historical data. Right. I don't want that in the middle of my way. I want to, I prefer to design and space feeling design that we spread the Levels of the fixed effects across the input space. So if I get rid of this problem of sparse hitting the data or clean the oddity right and then we you allocate to the batch we randomize batch cross the runs in the first round, and then use the same Sequence across all the other 29 runs. So all the runs, we have the same batch of location. In late. And lastly, all the Santa location will be randomized for every one of the simulation runs. So let me just get out of the air and start to jump. So he would do is, I used to do we special purposed spilling design. I'll load my factor here just for being I want to be fast here. So anyway, here I have my 12345 fixed effects. Space feeling designs don't accept a mixture of variables. So you need to set linear constraints here just to tell look these three guys here and need to add up to one. So that's what I'm going to do here. Alright. So with that, a satisfied that constraint will give an example of the first run of this d which is where I have data size 100 relative batch size 4% total variation total variance point five and the ratio is going to be one. So if I go back here and need to put I need to generate 100 runs. Also, if I want to replicate this theory and you have to set the random seed and the number of starts right Then the only option I have when I said constraint is the fast faster flexible filling design. And here we go I get the table here right so you can include this table. One thing you see is if you use a ternary blocked and you use your three components. You see that everything is a spread out. Oops. I have a problem here that she Didn't Let me go back. There's a problem with the constraints, yeah. I forgot the one here. Yep. All right, let's start all over again. Hundred Set Random seat 21234 And number starts. Great. Next book feeling make table. Yeah. And then I need just to check if it is all spread out. And find out. Yes. Alright, so then I look at my categorical variable here. I want to see if it spread out for all of them. As you can see this for one and two. Great. Now I said let me close all that we do 30 simulations. So this is one SIMULATION RIGHT. I HAVE 100 roles. But now I need to do 30. So what I will do is to add roles here. 2900 At the end And we just select this first 1000 runs sorry 100 runs all the variables, a year. And feel to the end of the table. Great. So now I repeat the same design 30 times right Now, To make it faster would just open. I'm using this table again though. Just use another table to where I have everything I wanted to show already set up. So yeah, I have back to this table right what I have. Next thing I would do is just to create a simulation column here just to tell look With this formula here. I can tell that simulations up to 100 row hundred this simulation one and then every hundred you change the simulation numbers. So at the end of the day I get obviously 30 simulations with 100 rolls. Each great Then the batch location, just to explain what they did. I just showed that in the PowerPoint. I have a farm. Now you will create two batches of 4% the size of the total data size, which means I have four rows per batch here than four and so on. And once the I get to 100 here and I'm jumping from simulation. Want to simulation to then it starts all over again. Right. So I have at the end of the day. All the 25 batches here. Okay. Then the next one thing I will do is to create my validation Column, which means I need to split the set right so Back to this demonstration. Back to the PowerPoint here, you see that for the solution that I'm going to create but the neuron that I had Is divided the roles 60% of them will belong to the training set 20% to the validation set and the other 22 the test set. So how do I do that in that case again back to john There we go. Let me hide this Okay, so here's to validate the validation come. How do I do that. It's already there, but don't explain how you do that you go to Analyze Make validation column you use your simulation column has a certification, you go and then you just do Point 6.5 2.2 and a user a random seed here just to make sure you see that's how I create that column right Then if I go back to my presentation here. All the 60% that belongs to the training set for the new year and that 10 to 20% that Belongs to the validation set also for the near net. Now they both belong to a set called modeling set for the mix effects. So the mix effect. Model, there will be no validation. There were just estimate the model with this 80% and the test set of the mixed effect solution that will be the same 20% that I use for the new year on that solution. So in that case, I go back to jump and Go here this And it just great to hear formula where you know zero means the validation of the neural net to zero means training so training, be still training. One is validation setting is going to be my modeling such and two is my test set, and it's going to be my test set here too. So I created these and then you column name for all your labels zero is going to be modeling and one is going to be desk so that way you see here that whatever stashed here he says here, but what you're straining a validation becomes modeling right then finally I need to set the formulas for my response. So for the expected value of my simulation. I just have here to fix effects formula right there is no random term here. All right. And that's my expected value my wife and my wife i j is going to be This and you look at the formula you have the y which is my expected value plus here I have a formula for generating A random Value following a normal distribution with mean zero and between sigma sigma Between sigma. It's a i i like to set to the variables here and as a table variable because I can change the value later as a week without going to the formula. But anyway, this is going to generate a single value every time you change the batch number. So if my batch is Here that's going to be the same value when I change it to 22 it creates another value. And you when I change from one simulation to another. So I will have one value for 25 per batch 25 our simulation one and then when I jumped to batch one of simulation to then it creates another value. Right. And here is just my normal random number with with things sigma that I set on the table here, right. So see some replicating the deal we run one I have sigma 05 and 05 Alright, so then now I have here solution for my mix effects model simulation right before that, let me go back here and show you what I am doing. For example, for the mix effect model. My simulation mode is this, but my feet that model will be the issue might be the analysis be the hat. And my small be here is going to be be hacked, right. So, it is our beach to meet the values for whatever I simulated And then in the mix effect model. I have to to less a prediction model. One is fitting conditional model when I use The my estimation of the random effects and the other one is my marginal when I don't. Right. So I have these two types of model. This is good to predict things that I have in the data already. And this is something I used to predict data around don't have an entire data set, right. For the new year and that I'm using a, you know, the standard default kind of near and at the end and jump, which you know i'm not just using because it's difficult because he pretty much works. You have here all the five fixed effects. I use one layer with three nodes all hyperbolic tangent functions as you can see here And then you have here a function called h x which is the summation of district functions, plus a bias term here, right. So if I add more nodes. It wouldn't make it any better. You find only use two nodes then it gets even worse. So I'm going to use this all over. And that's what I'm going to show you here. My show. Oops. Okay. Show. Me go back to you. So Here I have my Mixed effect model solution. How did I come up with that. Just to show you. I have here the response I put validation validation of the mix effects by simulation. All my fixed effects a year and my random effect is my batch right and then a genetic this first simulation. You see simulation one and it goes all the way the simulation Turkey right I couldn't use, for example, that simulate function of jump here because I'm changing the validation column for every batch, so I cannot, at least I don't know how to do it, how to incorporate the validation column in the formula of I white G. And. Okay. So, oops, back here, then now I have another script here for it and you're in that it's going to take a little bit Shouldn't be a big problem. When I'm doing for example. The runs the do you runs. We've 1000 rows per simulation that can take courses from all of time something like maybe 10 to 15 minutes To do the auditory simulations at the same time here should throw up some And There we go. Okay, so here I have, again, my if you look every one of them have five three notes right okay and you have simulation one all the way to simulation 30 Right. So now I have all that done for for run one of my year, right, so next thing I need to understand is what type of our squares. I'm going to compare right there are actually five types of our squares here, right, so here's the r squared formula. Why, why do I differentiate di squares by type here because it depends on what you use this actual versus predicted in this formula. You know your square change. So for example, I have here. Oh, these are type of r squared away or compare For example, The Rosa turning the training set. What I simulated versus the form I got for the neural net here because when you're in it. And you know, that's actually The case, for example, in all these three are the same thing because I'm comparing my wife and my wife hats are the same. So I have type eight right Then I have another type of call it type be where I compare. I don't, I'm not comparing the simulated value with the Random effect in the random error term I'm comparing the expected value of my simulation versus the form they go So these makes me sent to the test set rules are always the same as just the way you calculate the r squared is going to change because in this way here. When I have what I call the conditional test set. I see the parent future performance because that's exactly what we get when you have any data set, because we cannot tell the real We don't have the real model that's for that you need to simulate and then you have the expected test set, which is actually the same rose, but now I'm comparing the Expected value. And I can tell like for the lack of a better word, a real future performance. So the apparent performed is not necessarily the real future performance. OK. For the mix effect model is the same. Now I have another type of r squared, because here I'm comparing the simulated value versus my Conditional prediction farmland here and using the estimate to have for the random effects, but when I want to predict the future. So, well, no one to break the test sets both conditions here, I have another square here d which is comparing the whole simulation value for I Why i j versus my marginal model here. I'm not using be here right and Leslie to have a fifth type of our square which is my expected that sent Again, the test set is always the same roles is just that I'm using here. Now, the expected value versus my margin. A lot. So the problem is If you're not careful there. You may calculate wrong guy r squared. So what I do is, whenever I have here. And if I had to mix effect mode. I don't use anything that is in this report, all I do is to save the columns here I saved my Marginal model right prediction farm in here you have saved the prediction formula of the conditional model right and I will create columns with this formless that's for the mix effect model. Now for the Near in that or you can also save the formulas. I like to say fest formulas, because I just want to calculate to our squares. So I was saved as fast formulas and then what you see as I create this five columns here. Alright so let me go to them. So now my type A, if you remember Taipei from the presentation here type am comparing the simulated why i j versus my near in that model. So what I do here. Sorry, what I do here is I go to call them the info and you see here. Predicting what and predicting d y AJ Okay, here it is. Now I have here saves De Niro and that from the twice as he does value is equal to this value. But now I would just change predicting what here. This is predicting the expected value. Right, so that way I can use this formula is functioning jump here which is model comparison I can go to use this type A and I do buy a simulation and I grew up by validation And then what you get. It's all Dr squares you need To see from From simulation one All down to simulation 30 and you have it by set so you can later do combine data table and you get everything neat, right, for those are squares. So for the other ones I have also script here for example for to be He was the peace formula. Now right this column, and I only predicted that set the r squared for the test set, not for the modeling, not for the validation set. So the simulation. One is a year. And then if you go all the way down here you have simulation 30. And again, you can always combine data table and your data comes out like the same table format for all of these are squares. And Daniel, obviously I can have another script here for the type co fire squares we choose the modeling set of the mixed effect model right and simulation one all the way to modern set to simulation 30 and you know the do the same. See now on test set, but now I'm predicting why i j, right. So, the, the, say, the secret here is that you have one even sometimes in the lake. Here again I have the same formula because it's my marginal model of the Mix effect model solution. The only thing that changed is in your column me for you. Make sure you have predicting what and then you can use this to calculate all these are squares. All right. Let me go back to presentation and now since I got all those are squares together you stack your tables and then you can do the visualization, you want But here I'm interested really in the conditional test set of both solutions and the expected that said here, you know, I can spend a could spend 45 minutes just talking about this table display here. But all I I'm not really interested in the absolute values of the r squared, but more comparatively kind of a way of comparing our square But I need days just to check one thing which is, as you can see here when the data size here as you can see Them use the pointer here. Make it easier. I have all the are squares. I created here versus all the do you factors. Right. So you see that when I have a small data set, what happens is, I'm my near and that's being trained correctly because my training. And my validation sets they kind of have our square distribution that overlap. But then when you look at the conditional test set, which is actually the data we always have right because we never have the expected value. It's always at the lower level right as you see all for all these when I have this small data set, but then when I have a bigger data set. The situation is different with 1000 roles, then the are all aligned, you know, kind of overlap. So I did train these correctly. Again, the absolute value of our square here is not have much interest what they really care is how, you know, if you go back to One of the earlier slide here, you see that now I want to compare. You see, I have to, I want you to compare the predictive the performance of my benchmark mixed effect model versus my neural net compared to test that dark square, right. So, here what you get. Let me get this mold. So what you get here is again all the verbs. I had for to do we, and whether they Disease and your net solution or didn't mix effects solution. So, you see that for the conditional test set, which is the one part in performance. So if you're in during that you always you know when the data sizes are small. The mix effect Maltese always doing a better job here when you include the the The random effect right versus then you're and that's because there's always this Their median or or even average are all higher, but then when you have a bigger data set, then You know that difference kind of doesn't exist anymore to to a certain point, that even the new one that just doing a better job here, but that's the current performance. Now the real one. You see now that the mix effect model has been given a better job than a deed versus internet And here there is no more, you know, even grounds, you know, because at the end of the day, direct effect model. He's doing a better job, especially in this scenario here which is big data bigger data. Last sets or variability and more between den with invited right now find those lines I have here is going to do this is going to do for every simulation run is going to do the difference between the mix effect. R square versus the new year and that are square. So, here we go. Here I have four plots right so let's just concentrating one what you have in the y axis is the air conditioner square. Then mix minus the difference in conditioner square, sorry, the The mix effect r squared minus then you're in that square. So, that's what you get here, and that is the difference in a pattern to future performance. And here in the x axis, what you have is this difference in the expected R square we choose your real future performance, right, or bias. If you want Now I have four blocks. Why, because if you think about that when you have historical data where you don't have the tech, you know, You know, if you're analyzing the data, you just have possibly control over two things, which is the data size and the relative batch size. Why, because you cannot control what the variability is going to be in your data. And if your random effects are going to be much bigger than your random air. So the only two things that you can possibly Have any control over is data size and relative batch size. If you don't have the tag you can at least have an idea if, you know, use your historical data. Should be comprised of many, many bedrooms, just one or two batch. Right. So that's the kind of control you have You can at least have an idea of the batch size when you'll have statistical data. So what I'm comparing here then again I have the difference in apparent performance issue just differences positive, it means the mixed effect model has a better performance. If this difference here is also positive. It means that mix effect as a better future performance, right. And as you can see here when you have a small data set. It doesn't matter what you do. And mix effect model has a better performance and sometimes way much better because will come, talking about differences in R squared, that can go way over ones. Right, so he's getting much better performance you do pot into one or the future one. So when the data sizes are small, there's really no No, no solution here. However, when you look at the data size bigger data sides right but when you have this small amount of batches. Right. Here it's something funny happens because here on you know the difference enough funding future performance Y axis is negative, most of them. Which means the near and that to doing a better job in terms of patent at test set Tahrir Square right or conditional Touch that dark square. So, when you do it. Who's going to look like the new urine. That'd be great job better than The mix effect model. However, when you look at the lot of the the x axis, right, which is the difference in real future performance, it can be pretty much misleading. Right, so here you when you have a lot of data, but to just a few batches, you know, you're going to get nice Test, test set are squares. But then when you try to deploy your mold in the future. You may get into trouble. But then when you look here. Here we have a mitigation situation where you have a lot of data and a lot of batches. So they tend to be not that much different. Right, so As a conclusion, you should use a non negligible random effecting machine learning when the data set is a small, you know, the test set predictive performance will most likely be poor. Regardless, how many clusters of batches, you have. And that's because machine learning requires the minimum data size for success. Right. So there's no No way to win the game here. Now, when the data size is large and you just have a few clusters. And that's kind of misleading situation because your test set predict the performance can be good, but the performance, we would likely be Brewer later when you deploy the model. Some people tell me, Well, why don't you use regularization said what even if you will, you will you will not do it in these situations because Your test set R squared is going to be can be good but and then you don't know you need it right so you won't be able to tell You know, what is your long term future performance, just by looking at your tests at dark square or some of some kind of some of errors. But then when your data set is large in you have many clusters day and the whole situation is mitigated and the biasing effect of the closer kind of average out because every random effect, you know, the summation of all of them. It's zero. So the more you have the latest by as you can to get On top, you know, just wanted to say that one that learned what I learned from that is that when the data is not designed on purpose, there's two things I always remember Machine land cannot tackle at data just because it is big. You got to have a minimum level of design right to make it work. But the bigger the data, the more likely it is minimal level of design is already present in the data just by sheer chance. All right. And thank you, if you want to contact us. We are in the jump community. These are our addresses. Thank you.  
Scott Wise, Senior Manager, JMP Education Team, SAS   The power of using Text Mining is a great tool in investigating all kinds of unstructured text that commonly resides in our collected data. From notes captured on warranty issues, lab testing/experimental comments, to even looking at food recipes, this new method opens a lot of opportunity to better understand our world. In this presentation, we will show how to use the latest text analytic methods to help solve a family mystery as to the regional source of my Grandma’s delicious chili recipe. Along the way, we will see how to use text mining to create leading terms and phrase lists and word cloud reports. Then we will utilize the resulting document term matrix to perform topic analysis (via latent class analysis clustering) that will enable us to find a solution to our question. You will be left with an understanding of the powerful text mining approaches that you can add to your own toolbox and start solving your own text data challenges!     Auto-generated transcript...   Speaker Transcript Scott Wise Investigative Text Exploration. My name is Scott Wise and I'm Senior Manager of the JMP Education Team.   And I've got a really fun presentation for us to view today and the goals of this presentation are to give you a little more familiarity   with the capabilities and how easy it is to do text exploration in JMP and JMP Pro, as well as show you a different way of looking at text exploration, like, can I do with to investigate something like a detective would?   Okay, so we're going to talk about my grandma's chili and how that relates to a Texas chili cook off. Before we begin,   let's just debunk some of the terminology that is around text mining, and to me, there's really five simple steps. You spend your time summarizing the data, literally, finding out what words in text occur the most often, even what combination of words occur the most often.   So the ??? call this looking through unstructured text right and then this, this could be anything. This could be a sentence your customer gave you about the performance of your product or your service. It could be...   it could be social media, right, where you know you figure out what people like or dislike based on comments,   something about your product or service or it could be this guy, something like a recipe.   I've even seen it done on patents when people are researching what are popular things people are applying patents for and should we be doing these.   So really, summarizing finding out what words out of that are the most important. Now there's some preparation that comes next, which is just getting down to the smaller list of the words you care about.   Then of course we wouldn't be JMP if we aren't going to visualize it and analyze it and as well we can model it. We can even do some advanced modeling. So I'm going to be using JMP Pro to do this, version 15.1,   and I will be sure to point out what you can do in JMP and where I actually throw in a little bit of Pro that help further answer my question.   document, corpus and term. Document is going to be those...   those things you're analyzing, basically the individual body of text, each row of the text. So it's my my recipes for my example.   It could be different customers who have commented back to you on your product or service. Corpus is actually that unstructured text that you're trying to handle, the body of that text which you're going to analyze.   And then the terms are going to be those words or those combination of words that you care about that can help you answer a question. So let's talk about the story. You know, why did I come up with looking at chili recipes? Well, this comes back to   almost 25 years ago when I first moved to Texas and I came to work for a big company and like most big companies in Texas, they had a big Texas chili cook off.   And I was asked to participate. Everybody participated. But I was also asked to do even more than that, I was asked to be a judge.   So I should have turned down the judge. I thought it was an honor for the new guy, but I think I was the only one they could give it to, I found out why.   But let's talk first about the reaction to the chili I brought. So I had chili recipe that my family's always used and it came from my grandmother, Grandma Lillian.   This is not chili. This is not what Texans considered chili.   Now the good news was I'd have to bring any home because I enjoy eating it. They considered at something else besides chili. Their chili didn't have beans in it. Their chili was not hearty. It was very hot, very soupy.   Mine had beef. Theirs had mostly pork. They had all in. Here's the other thing that got me in trouble. Not only did they not like my grandma's chili, I almost didn't survive the judging   because the real badge of honor of a real Texan is to make the hottest bowl of chili. So you want to beat your neighbor. You want to beat your coworker. So they were throwing all kinds of ungodly hot chiles and spices in here.   And I just thought, almost put a hole my stomach just trying to taste it and you're drinking all this milk, trying to put the heat out. So   it was a baptism by fire. So my recommendation is unless you like the heat, don't enter in as a judge, right. Not a good idea. But I wanted to place; that always bothered me.   What, why didn't my grandma's chili do well? Are there really different types of chili? Where do they come from?   And it turns out chili's got a really cool history. You can see some actually really cool history blogs and papers on it and it most likely came out of   Central America, mostly Mexico. And there are some light dishes, but in San Antonio, it was first observed being sold kind of in the state, we would know it now is chili,   on the on the old San Antonio square there by the Alamo. And it was used on cattle drives and it started to get popular and then it worked its way up the middle of the country all up through the Midwest.   And I was told that, you know, many food innovations got created at the St. Louis World's Fair, you know, really took off when they were shown, this was one of them. Chili made it into the   St. Louis World's Fair and it really got popular.   So there's different varieties. And so the idea was if I...could I use text exploration on ingredients and recipes? Just take the whole recipe and dump it in, see what happens.   And what do I want to compare it to? So I looked up what the traditional regions were; we had several, all the way from Texas up to Michigan.   So in Texas, we've got different varieties. Texas bowl red is the one I was tasting. That's chili con carne. A   Frito pie is something you'll get at a football game, but very popular, that Louisiana has their own version. New Mexico, with green chilies and chicken, have their own version. Oklahoma, Kansas City, Springfield,   Illinois, Cincinnati with the skyline chili hey put it over spaghetti. Michigan. Coney Island,the hot dogs, you know, the chili sauce they put it on there is serious chili.   White chili, unknown where it comes from. Vegetarian chili. But there's a lot of styles that are out there now.   So I said, if I took three recipes from each of these styles and I compared it to control my grandma's chili, can I find what I'm looking for?   So we're going to do this and I'm going to show you, as I walk through the steps, I'm going to show you a summary of the steps then I'll go right behind and show you how I do it in JMP.   First steps going to be summarizing the data. We want to find those words we care about.   Now what happens when you enter in   any of this text analytics, this text data, like my ingredients, into a text explorer, it's going to run it through a library   of regular expressions, and think about the word like regular, just things that are just part of everyday speech.   And they're not that helpful. It's not helpful if I have "the" and "and" in my list. So it tries to help pull out words you don't care about. And yes, it can be customized.   And it's got a pretty strong one built in already, and that's what I used. And then stemming. Stemming is where you go and   say how you want to treat like words -- so "dice" is "dice," "dicing," you got plurals, different...different...   different versions of the root word. Do you want to just count it all in the root word or do you want it separated? So that's a consideration.   And then after we summarize. Let me get these words. You can see I've got a little list here of words. The one behind is unedited and then the one in the front   is actually one where we've gone through and kind of sized down that word list. So what you do is, you look for words you don't care about, you call them stop words, basically say, remove these. Add them back to my library of things I don't care about.   You can also bring over phrases that really matter. And once you have these, we are going to be able to visualize some.   And we're going to be able to see, in my case, I got really interested in ingredients. So we're going to be able to see what ingredients came out the most often.   And you often see word clouds here, and word clouds, you know, the bigger the word, the more frequency it is.   And you can look at it in a cloud, just as just a...just a sporadic layout or you can look at an order where, you know, the first thing up there is the biggest word. That's like the   graph that's in front of your slide. And after that, we are going to do something with it. And so we can create some basic models.   And one of the easiest ways to do that is, once we know what words on our list that we care about,   we can add it back to our data table. It's called saving out the document term matrix. In this case, a simple way of doing it is just binaries. So I've got a separate column here, you can see to my right, where   "onion" has a one where it's in that rows, you know, recipe. It has zero if it's not, so you can get zeros and ones here in columns. And you can see what's the most important.   And a lot of times you're trying to model for something like, if I was stuck,   maybe I had the judging scores by this and I can say, well, here's a numeric score tied to each recipe. I want to see what ingredients   are common that result in a high score. That might be something I'm doing, so try to do it a little more predictive.   But in this case, I'm kind of going to look at grouping. I'm really interested in grouping, kind of, like recipes together and seeing where my grandma's chili falls. So I'll show you how we're going to do that. But let's first go to JMP and show you how we do these steps in JMP.   So here is my raw data.   In every document here, every row is is...got my unstructured text and in my cases, it's just the raw ingredients.   And if I click on any one of these cells here, you can see it is just literally the copied in ingredients. So there's the one...the first one for Cajun chili from Louisiana.   It's got words I care about, like "tomato" and "chili powder" and "honey." It's got words I don't care about like "can."   It's probably ingredient measures I don't care about so much, like "one," "two pounds," "teaspoons." So how are we going to take care of this? So what we're going to do   is we're going to go to Analyze and we're going to go to Text Explorer and we're just going to put those ingredients up into the text column.   I'm going to ask for stemming, how to stem for all terms. I find that very helpful. And then I'm going to use the built in regular expression.   I say okay and now here is my initial list.   So what I can do now is I can go and select those things I don't care about. I don't care about numbers. So maybe I can go in here and highlight them, right click, and say add a stop word, and then it gets added to the list of things I don't care about.   What about "chili powder"? It sounds like something that needs to be on its own. So I'm going to right click on that phrase. And I'm going to say add phrase and it adds it in.   So you go and you do this until you get a streamlined sized-down list.   I'm just going to run my   My finished list by it. And here's all the words that came off the regular expression that were found, and also   things I added   and stop words. And now, here is my finished term and phrase list. And so I've added these phrases. I care about. So "onion" came out the highest, then "salt," then "cumin," then "chili powder."   As you can guess, this would be really good to visualize and we do have a word cloud. So here is the word cloud for everything in this one. And again under my red triangle options here, I can change that to an order to make it crisper on what comes out.   If I keep it centered, something that's fun to do. You know you can add filters to your data and something you can do is, sometimes you can find your answer visually without having to do anything else. And so in that case, I'm going to go into this to my   red triangle. I'm going to get a local data filter. I'm just going to look at the type and I'm going to say, well, let's take a look at   what Grandma Lillian's chili looks like, you know "tomatoes" and "kidney beans" and "beef," that type of thing. And how would that compare to the Cajun chili?   Well they shared chili powder and beef, but, you know, there might be some different things on there. You know, how that compared to the chili verde,   you know, which is more of the green chili, you know. And they've got raw chilies in there and jalapenos, chicken stock, all that type of thing, chicken broth.   So this is really interesting, but probably not enough for me to figure out what's going on. So I did go   (and this is another...you've got all the options here under your red triangle) I did go make sure (and I probably need to make sure here) that I turn off my local data filter, make sure everything selected, that I'm looking at all my terms.   I got 299 here. I'm going to right click and I'm going to say "save document term matrix."   And when I do that, it asked me what kind of way I want to save it, with with with with weights, what kind of weighting. It's basically binary, there's frequencies I can use, how many terms, you know, the minimum term frequency to actually get a place in your data table. And I have already done that.   So if I slide across and look...   Aactually   I'll go ahead and do that and show you what it looks like. So I'll just say "save document term matrix" and say "okay."   And now, as I showed you on the slide, now I get all the terms I care about. And there's that first one for "onion." Here's the one for "chili powder," and as it relates to their respective   Recipe, you know, rows.   So I know where there's a one, I know this Cajun chili had chili powder.   When there's zero,   this one here looks like Oklahoman chili or no, I'm sorry, New Mexican chili did not have chili powder, so that's just, that's just how that works. And this can be used in modeling. So you can go to Analyze fit model in JMP and take...and actually apply this to some type of model.   But what I'm going to do is, since I'm not really trying to predict, you know, what ingredients will give me a higher score. I don't have any like, you know,   you know output data here. I really want to group them together and I heard in JMP Pro, I can actually do this. So I'm going to go right back to my slides here.   And I'm going to talk about analyzing the data. So to do further analysis   in JMP Pro, it enables you to do some really good grouping techniques and these are multivariate methods and their specialized for handling text analytics and working with those document term matrix is about text.   And it uses something called latent class analysis. It's one of the terms. And this is similar to the principal components, if you're used to doing that technique. But basically it's going to   ask us how many groupings or clusters of data do we want to look at. It's going to look across that multidimensional space   between everything that's in those columns and your document term matrix for the important terms in your model here, the important terms we got on the word   word list, right? And it's going to group them. And in my case, I was able to get it down to three groups.   So there's a cluster one, which seemed to have a lot of chili recipes with ground beef, tomato sauce, chili powder and beans.   There's a cluster two, which had a lot to do with chicken and green chilies, raw chilis here.   And then there's cluster three, which had a lot of chilies again, but they were more of the red chillies and they were kind of pork based and this made a lot of sense.   Okay, so when I created these clusters, I was able to use a cluster probability by row, this kind of gave me how strong those individual   recipes, my rows in my document, right, these these original...my control recipe, where did they fall and how strong did they belong. Why did I assign them into whatever cluster? And when I did this,   22 was my grandma's cluster...was my grandma's control. And I found that she clustered in cluster one along with some other recipes, including those that came from Kansas City, Missouri.   There is one on number 24 which was very close. Now the Texas recipes for cluster three, they had hot chilis, spice...a lot of spice in them, and often often pork and no beans, right? Beans were something that showed up in cluster one, but not in cluster three and then...   The cluster two   was more for, you know, it's more for green chili,   more for those things you see in New Mexico, you know, chicken-based chili, things with green chili.   Alright, so what happened was I was able to make the match, and I found a recipe. And one of the three representative ones from Missouri   that actually was called Kansas City chili, and it almost matched exactly Grandma Lillian's chili.   So when I asked my mother about this, I said, "Well, why could this happen. I didn't think it came from Missouri." And she said, "All this makes sense." She said, "Grandma Lillian,   she grew up on a farm in St. Joseph, Missouri, and she was the only girl and she had like 11-12 brothers. So she did a lot of the cooking."   By the way, she was the only one to get a college education and so she was quite progressive for for the time that she lived and was one of my favorite relatives, but her recipe was very indicative of this. So let's show you what this looks like live.   So if I go to...   at this point,   I go back to the data I had made and under that hotspot, I'm going to ask for these additional models that JMP Pro can give. There's a latent class analysis clusters documents using that method,   based on the binary way to document term matrix. So it does use a doc...does use that document term matrix, yeah so you don't even have to save it out, it automatically generates it.   There's also a latent semantic analysis, which does...which does a little more math, a little more advanced method, but both of these are basically doing the same thing, and I particularly liked this latent class analysis. So that's the one I selected.   And I asked for three clusters, you can play with it to see if it makes a difference. And I did try more clusters and I broke back to three.   And within its   options, you can look at the cluster probabilities by row.   And of all the output, this is the one that made the most sense. So remember, back to my slide, this helped me look at where my grandma's chili fell,   Which was 22,   row 22. And then what else it combined well with. And so that's how I was able to do this analysis.   It's that simple.   So,   that was a really quick run through the capabilities of doing JMP text exploration in JMP and then how I was able to use JMP Pro and   find these clusters and place my grandma's chili and find a matching recipe. So if you're hungry for more, I do have a link to the blog   in the presentation that you can...you can go click on or you can just go right to the Community and you can just type in "grandma's chili" and you can find that blog. And I also will give you along with that, we will give you as well   the recipe. So you too can make Grandma Lillian's chili.   So we appreciate being able to show this to you today.   Please be sure to leave any questions in the Q&A that we can answer. And try this, try something   that you have around you at work, at home, wherever that has some unstructured text data, where you would like to explore and ask the question, and you'll find it's a fantastic, fantastic method, very powerful and really helps you attack that third dimension of data.
Monday, October 12, 2020
Ronald Andrews, Sr. Process Engineer, Bausch + Lomb   How do we set internal process specs when there are multiple process parameters that impact a key product measure? We need a process to divide up the total variability allowed into separate and probably unequal buckets. Since variances are additive, it is much easier to allocate a variance for each process parameter than a deviation. We start with a list of potential contributors to the product parameter of interest. A cause-and-effect diagram is a useful tool. Then we gather the sensitivity information that is already known. We sort out what we know and don’t know and plan some DOEs to fill in the gaps. We can test our predictor variances by combining them to predict the total variance in the product. If our prediction of the total product variability falls short of actual experience, we need to add a TBD factor. Once we have a comprehensive model, we can start budgeting. Variance budgeting can be just as arbitrary as financial budgeting. We can look for low hanging fruit that can easily be improved. We may have to budget some financial resources to embark on projects to improve factors to meet our variance budget goals.     Auto-generated transcript...   Speaker Transcript Ronald Andrews Well, good morning or afternoon as the case may be. My name is Ron Andrews and topic of the day is variance budgeting. Oh, I need to share my screen. And there's a file, we will be getting to. And we'll get to start with PowerPoint. So variance budgeting is the topic. I'm a process engineer at Bausch and Lomb; got contact information here. My supervision requires this disclaimer. They don't necessarily want to take credit for what I say today. Overview of what we're going to talk about What is the variance budget? A little bit of history. When do we need one? We have some examples. We'll go through the elements of the process, cause and effect diagram, gather the foreknowledge, do some DOEs to fill in the gaps, Monte Carlo simulations, as required. And we've got a test case will work through. So really, what is a variance budget? Mechanical engineers like to talk about tolerance stack-up. Well tolerance stack-up is basically a corollary Murphy's Law, that being all tolerances will add unit directionally in the direction that can do the most harm. Variance budget is like a tolerance stack-up, except that instead of budgeting the parameter itself, we budget the variance -- sigma squared. We're relying on more or less normal shape distributions, rather than uniform distributions. Variances are additive, makes the budgeting process a whole lot easier than trying to budget something like standard deviations. Brief example here. If we used test-and-sort or test-and-adjust strategies, our distributions are going to look more like these uniform distributions. So if we have the distribution with the width of 1 and one with a width of 2 and other with a 3, we add them all together, we end up with a distribution with a width of pretty close to 6. In this case, we probably need to budget the tolerances more than the variances. ...If we rely on process control, our distributions will be more normal. In this case, if we have a normal distribution with a standard deviation of 1, standard deviation of 2, standard deviation of 3, we add them up, we end up with standard deviation of 3.7, lot less than six. So we do the numbers 1 squared plus 2 squared plus 3 squared equals essentially 3.7 squared. Now to be fair, on that previous slide, if I added up these variances, they would have added up to the variance of this one. But when you have something other than a normal distribution, you have to pay attention to the shape down near the tail. It depends on where you can set your specs. So, What is the variance budget? Non normal distributions are going to require special attention and we'll get to those later. For now variance budget is kind of like a financial budget. They can be just as arbitrary. There only three basic rules. We translate everything into common currency. Now we do this for each product measure of interest, but we translate all the relevant process variables into their contribution to the product measure of interest. Rule number two is fairly simple. Don't budget more than 100% of the allowed variance. Yeah, sounds simple. I've seen this rule violated more than once in more than one company. Number three. This goes for life in general, as well as engineering, use your best judgment at all times. Little bit of history. This is not rocket science. Other people must be doing something similar. I have searched the literature and I have not been able to find published accounts of a process similar to this. I'm sure it's out there, but I have not found any published accounts yet. So for me the history came back in the 1980s, when I worked at Kodak with a challenge for management. Challenge was produce film with no variation perceived by customers. Actually what they originally said produce film with no variation. no perceivable variations. They define that as a Six Sigma shift would be less than one just noticeable difference. Kodak was pretty good on the perceptual stuff and all these just noticeable differences were defined, we knew exactly what they were. For a slide film like Kodachrome, which is what I was working on the... that's what I was working on at the time, color balance was the biggest challenge. Here, this streamline cause and effect diagram, color balance is a function of the green speed, the blue speed and the red speed. Now I've sort of fleshed out one of these legs. The red speed, I got the cyan absorber dye and then one of the emulsions as the factors that contribute to the speed of that, that affects the red speed, that affects the color balance. This is a very simplified version. There are actually three different emulsions in the red record, there are three more in the green record. There are two more in the blue record. Add up everything, they're 75 factors that all contribute to color balance. These are not just potential contributors. These are actually demonstrated contributors. So this is a fairly daunting task. So moving on to when we need a variance budget. Get a little tongue in cheek decision tree here. Do we have a mess in the house? If not, life is good. If so, how many kids do we have? If one, we probably know where the responsibility lies. If more than that and we probably need a budget. This is an example of some work we did a number of years ago on a contact lens project at Bausch and Lomb. This is long before it got out the door to the marketplace. We were having trouble meeting our diameter specs. plus or minus two tenths of a millimeter We were having trouble meeting that. We looked at a lot of sources of variability and we managed to characterize each one. So lot to lot. And this is with the same input materials and same set points, fairly large variability. Lens to lens within a lot, lower variability. Monomer component No. 1, we change lots occasionally, extreme variability. Monomer component No. 2, also had a fairly large variability. Now we mix our monomers together and we have a pretty good process with pretty good precision. It's not perfect and we can estimate the variability from that. That's a pretty small contributor. We put the monomer in a mold and put it under cure lamps to ??? it and the intensity of the lamps can make a difference. There we can estimate that source of variability as well. We add all these distributions up and this is our overall distribution. It does go belong...beyond the spec limits on both ends. Standard deviation of .082 And as I mentioned, spec elements of plus and minus .2 that gives us a PPk of .81. Not so good. Percent out of spec estimated at 1.5% It might have been passable if it was really that good, but it wasn't. This estimate assumes each lens is an independent event. They're not. We make the lenses in lots and there's...every lot has a certain set of raw materials in a certain set of starting conditions. That within a lot, there's a lot of the correlation. And two of the components I mentioned, two monomer components that had sizable contributions, there's looking here, occasionally you can see the yellow line and the pink line. These are the variability introduced by these two monomer components. When they're both on the same side of the center line, they push the diameter out towards the spec limits and we have some other sources of variability that add to the possibilities. Another problem is that our .2 limit is for an individual lens. We did this...we disposition based on lots. And so this plot predicts lot averages, though, when we get a lot average out to .175, chances are we're going to have enough lenses beyond the limit that failed a lot. So in all, added up our estimate is 4% of the lots are going to be discarded. And they're going to come in bunches. We're going to have weeks when we can't get half of our lots through the system. So this is non starter. We have to make some major improvements. To the lot-to-lot variability from two monomer components contributed a good chunk of that variability. We looked and found that the overall purity of Monomer 1 was a significant factor and certain impurities in Monomer 2, when present, were contributors. Our chemists looked at the synthetic routes for these ingredients and found that there was a single starting material that contributed most of the impurities. They recommended that our suppliers distill this starting ingredient to eliminate the impurities. That made some major improvements. We also put variacs on the cure lamps to control the intensity. Lamp intensity was not a big factor, but this was easy. And when it's easy, you make the improvement. Strictly speaking, this was a variance assessment, rather than a variance budget. We never actually assigned numeric goals for each component. This is back...we're kind of picking the low-hanging fruit. I mean, we found two factors that pretty much accounted for a large portion of the variability Maybe we need a little bit better structure to reach the higher branches, now that we need to reach up higher. Current status on lens dimension, lens diameter. PPk is 2.1. The product's on the market now, has been for a few years. This is not the problem anymore. We've made major...major improvements in these momoer components. We're still working on them. They still have detectable variability; detectable, but it hasn't been a problem in a long time. So the basic question is, what do we do to apply data to a variance budget? Maybe reduce that arbitrariness a little bit. We have to start by choosing a product measure in need of improvement. We need to identify the potential contributors, cause and effect diagrams, a convenient tool. We need to gather some foreknowledge. We need to know the sensitivity. The product measure divided by the process measure; what's the slope of that line? We, we are going to need some DOEs to fill in the gaps. We need to estimate the degree of difficulty for improving some of these factors. And we estimate the variance from each component and then we divide that variance, the total variance goal among the contributors. Sounds easy enough. Let's get into an example. let's say we're we're working on a new project. And along the way, we have a new product measure called CMT (stands for cascaded modulation transfer) to measure overall image sharpness. Kind of important for contact lenses. Target is 100, plus or minus 10. We want a PPK of at least 1.33 That means standard deviation's got to be 2.5 or less. Variance has got to be 6.25 or less. What factors might be involved? Let's think about a cause and effect diagram. We can go into JMP and create a table. We start by listing CMT in the parent column. Then we list each of our manufacturing steps in the child column. And then we start listing these child factors over on the parent's side and then we start listing subfactors. These subfactors are obviously generic and arbitrary, the whole thing's hypothetical. And we can go as many levels as we want. We can have as many branches in the diagram as we care to, but we've identified 14 potential factors here. So we go into the appropriate dialog box, identify the parent column and the child column. Click the OK button and out pops the cause and effect diagram. Brief aside here. I've been using JMP for 30 years now. I have very, very few regrets. This is one of them. And my regret is, I only found this last year. I don't know, actually, when this tool was implemented. I wish I had found it earlier because this is the easiest way I found to generate a cause and effect diagram. So we need to gather the sensitivity data. Physics sometimes will give us the answer. In optics, if we know the refractive index and the radius of curvature, that can give us some information about the optical power of the lens. Sometimes physics, oftentimes we need experimental data. So, ask the subject matter experts. Maybe somebody's done some experiments that will give us an idea. We're going to need some well-designed experiments because no way have all 14 of those factors been covered. Several notches down on the list, in my opinion, is historical data. And if you've used historical data to generate models, you know, some of the concerns I'm nervous about. We need to be very cautious with this. Historical data, it's usually messy; it has a lot of misidentified numbers, sometimes things in the wrong column, it needs a lot of cleaning. There's also a lot, also a lot of correlation between factors. Standard practice is to reserve 25% of the data points randomly, reserve that data for confirmation, generate the model with 75% of the data, and then test it with a 25% reserve data. If it works, maybe we have something worth using. If not, don't touch it. So gathering foreknowledge, we want to ask subject matter experts independently to contribute any sensitivity data they have. I'm taking a page from a presentation last year at the Discovery Summit by Cy Wegman and Wayne Levin. This is their suggestion in gathering foreknowledge to avoid the loudest voice in the room rules syndrome. Sometimes there's a quiet engineer sitting in the back who may have important information to impart, may or may not speak up. So we want to get that information. Ask everybody independently to start with. Then get people together and discuss the discrepancies. There will be some. Where are the gaps? What parameters still need sensitivity or distribution information? What parameters can we discount? I'd like to find these. What parameters are conditional? Doesn't happen very often, but in our contact lens process, we include biological indicators in every sterilization cycle. These indicators are intentionally biased so that false positives are far more likely than false negatives. When we get a failure in this test, we sterilize again. We know our sterilization routine was probably right, but we sterilize again. So sometimes we sterilize twice. That can have a small effect on our dimensions. It's small, but measurable. So we're going to need to plan some experiments to gather the sensitivities for things we don't know about. And we'll look at production distribution data; use it with caution to generate sensitivity. We can use it to generate information on the variability of each of the components and the overall variability of the product measure of interest. We need to do some record keeping along the way. We can start with that table we used to generate the cause and effect diagram, add a few more columns. Fill in the sensitivities, units of measure, various columns. Any kind of table will do. Just keep the records and keep them up to date. We're going to need some DOEs to fill in the gaps. There are some newer techniques -- definitive screening designs, group orthogonal super saturated designs -- provide a good bang for the buck when the situation fits. Now in this particular situation, we got 14 factors. We asked our subject matter experts. Some of them have enough experience to predict some directional information, but nobody has a good estimate of the actual slopes. So we need to evaluate 14 factors. I'd love to run a DSD that doesn't require 33 runs, I don't have the budget for it. So we're going to resort to the custom DOE. So, go to the custom DOE function and then...been using PowerPoint for long enough now...time we demonstrated a few things live in JMP. That would go to DOE custom design. And you don't have to, but it's a very good practice to fill in the response information (if i could type it right). Match target from 90 to 110. Importance of 1, only makes a difference if we have more than one response. The factors. I have my file, so I can load these quickly there. Here we have all 14 of the factors. This factor constraints, I've never used it. But I know it's there if some combination of factors would be dangerous. I know that we can disallow it. The model specification. This is probably the most important part. This is basically a screening operation. We're just going to look at the main effects. Now our subject matter experts suggested the interactions are not likely. And nonlinearity is possible but not likely to be strong. So we're going to ignore those for now, at least for the screening experiment. We don't need to block this. We don't need extra center points. For 14 main effects, JMP says a minimum of 15, that's a given, default 20. I've learned that if I have a budget that can run the default, that's a good place to start. I can do 20 runs; 33 was too much. I can manage the 20. Let's make this design. I left this in design units. There's a hypothetical example. I didn't feel like replacing these arbitrary units with other arbitrary units. Got a whole suite of designed evaluation tools, a couple that I normally look at. The power analysis. If the root mean square error estimate of 1 is somewhere in the ballpark, then these power estimates are going to be somewhere in the ballpark. .9 and above, pretty good. I like that. The other thing I normally look at is the color map on correlations. I like to actually make it a color map. And it's kind of complicated. We got 14 main effects, and I honestly haven't counted all the two way interactions. What we're looking for is confounding effects, where we have to red blocks in the same row. Well, I don't see that. That's good. We've got some dark blue where there's no correlation. We've got some light blue where there's a slight correlation. And we have some light pink where maybe it's a .6 correlation coefficient. This is tolerable. As long as we don't have complete confounding, we can probably estimate what's what, what's causing the effect. Now this is good. Move on, make the table. Well, this is our design. Got the space to fill in the results. I'm going to take a page from the Julia Child school of cooking. Do the prep for you and then put it in the oven and then take a previously prepared file out of the oven that already has the executed experiment. These are the results. CMT values, we wanted them between 90 and 110. We got a couple here in the 80s. There's 110.5, we've got 111 here. Looks like we have a little work to do. Let's analyze the data. Everything's all done for us. There's a response. Here's all the factors. We want the screening report. Click Run. r squared .999. Yeah, you can tell this is fake data. I probably should have set the noise factor a little higher than this. The first six factors are highly significant; the next eight, not so much. I was lazy when I generated it. I put something in there for the first six. Now, typically we eliminate the insignificant factors. So we can either eliminate them all at once. I tend to do it one at a time. Eliminate the least significant factor each time and see what it does to the other numbers. Sometimes it changes, sometimes it doesn't. Eliminate this one and it looks like Cure1 slipped in under the wire, .0478. It's just under .05. I doubt that it's a big deal, but we'll leave it in there. So we look at the residuals; that's kind of random, that's good. Studentized residuals, also kind of random. We need to look at the parameter estimates. This is what we paid for. These are the the...regression coefficients are the slopes we were looking for. These are the sensitivities. That's why we did the experiment. I'm a visual kind of guy, so I always look at the prediction profiler. And one of the things I like here...well, I always look at the...the plot of the slopes and look at the... I look at the confidence intervals, which are pretty small. Here you can just barely see there's a blue shaded interval. I also like to use the simulator when I have some information about the input, that we can input the variability for each of these. Now if you'll allow me again use the Julia Child approach and go back to the previously prepared example where I've already input the variations on each one of these. From Mold Tool 1, I input an expression that results in a bimodal distribution. And for Mold Tool 2, input a uniform distribution. And I gotta say, in defense of my friends in the tool room, bimodal distribution only happens in a situation...what happened last month, where the tools we wanted to use were busy on the production floor, so for experiment, we use some old iterations. We actually mixed iterations. When that happens, we can get a bimodal distribution. This uniform distribution, never happens with these guys. They're always shooting for the center point and usually come within a couple of microns. Other distributions are all normal. Various widths. In one case, we had a bit of a bias to it. These are the input distributions. Here's our predicted output. Even though we had some non normal starting distributions, we have pretty much a normal output distribution. It does extend beyond our targets. We kind of knew that. Now, the default when you start here is 5,000 runs. I usually increase it, increase this to something like 100,000. It didn't take any extra time to execute, and it gives you a little smoother distributions. It also produces a table here, we can make the table. Move this over here. Big advantage of this is that we can get (don't want this CMT yet)...let's look at the distributions of the input factors. This is a bigger fancier plot. This is our bimodal distribution, uniform, these various normal distributions, various widths, this one has kind of a bias to it. So we can take all those and we added them up. We look at this and we have a distribution. It looks pretty normal. Even though some of the inputs were not normal. We can use conventional techniques on this. So when we start setting the specs, it does extend beyond our spec limits. So we're going to need to improve, make some improvements in this. Scroll down here. Look at the capability information. PPk a .6. That's a nonstarter. No way is manufacturing going to accept a process like this. So we need to make some significant improvements. So go back to the PowerPoint file. And I scroll through the slides that were my backup in case I had a problem with Live version of JMP. Because of me having the problem, not JMP. So here we have the factors. Standard deviations come from our preproduction trials, estimate the variability. The sensitivities, these are the results from our DOE.  
Level: Intermediate Gaussian Process (GP) is one of several analysis techniques that are used to build approximation models for computer generated experiments.  Generally, a space filling design is used to guide the computer experimentation efforts because all the parameters/variables are derived from or directly pulled from first principles physics models/equations.  Space filling designs are used because the data generated by the computer experiments is deterministic and likely to be highly non-linear.  This is where GP comes into play.  Because the data is deterministic, GP will attempt to fit every point in the design perfectly allowing for a close approximation of the true model.  We will compare GP to Response Surface and Neural Net models.  We will also compare GP models derived from different types of space filling designs. Gaussian Process Typically used to build models for computer simulation experiments. Data is deterministic so there is no need to run an experiment more than once.  A given set of inputs will always produce the same answer. Also known as kriging. More than 100 conditions will take a long time to compute a solution. JMP Pro has Fast GASP; for larger data sets – breaks the GP into blocks allowing for faster computation. You can also have categorical inputs with JMP Pro.   Model Options for Gaussian Fit Estimate Nugget Parameter – This useful if there is noise or randomness in the response, and you would like the prediction model to smooth over the noise instead of perfectly fitting.  Highly recommended Correlation Type – lets you choose the correlation structure used in the model Gaussian – allows the correlation between two responses to always be non-zero, no matter the distance between the points. Cubic – allows the correlation between two responses to be zero for points far enough apart. Minimum Theta Value – allows you to set up the minimum theta value used in the fitted model.   Variance vs. Bias For most design of experiments the goal is to minimize the variance of prediction.  Because computer experiments are deterministic there is no variance, but there is bias.  Bias is the difference between the approximation model and the true mathematical function.  Space filling designs are used in an effort to bound the bias.   Borehole Example Types of Space Filling Designs in JMP Sphere Packing – maximizes the minimum distance between design points. Latin Hypercube – maximizes the minimum distance between design points but requires even spacing of the levels for each factor. Uniform – minimizes the discrepancy between the design points and a theoretical uniform distance. Minimum Potential – spreads points inside a sphere around a centroid. Maximum Entropy – measures the amount of information contained in the distribution of a set of data Gaussian Process IMSE Optimal – creates a design the minimizes the integrated mean square error (IMSE) of the Gaussian Process over the experimental Region Fast Flexible Filling (FFF) – FFF method uses clusters of random points to choose design points according to an optimization criterion.  Can be constrained.   Summary of Fit Do Gaussian with and without Nugget Parameter and check Jackknife fit. Neural Net models offer a good alternative to Gaussian models but can be more complicated.  NN models sometimes outperform Gaussian models. Use the smoothing function for Neural Nets – JMP Pro Don’t rely on R2 alone when deciding on the best fit model. Picking the right model is about keeping the model as simple as possible while still getting reasonable prediction.   Gaussian Process Resources Comparison of different GP packages - from 2017 Borehole model example found in JMP 14 DOE Guide Chapter 21 pg 637. Discovery Summit 2011 Presentation: Meta-Modeling of Computational Models – Challenges  and Opportunities
Level: Intermediate Designed experiments for dry etch equipment present challenges for semiconductor engineers. First, because the total gas flow rate is often fixed, a mixture design must be used to honor the constraints imposed by this type of design. These types of designs are not commonly seen in the Semiconductor industry. Second, as is often the case with these experiments, the investigator is interested in optimizing more than one variable. In this presentation, you will see an example of how to design and analyze a seven-factor experiment for a dry etch tool and simultaneously optimize an overall wafer target value while minimizing within wafer variability. Overview Eight factor experiment for a dry etch process Three process gases: A, B, C Five process factors: power, pressure, temperature, time, total flow The experimenter was interested in both the gas ratios and the total gas flow. To keep total flow and gas ratios as uncorrelated as possible, a mixture design was used. To keep total flow and gas ratios as uncorrelated as possible, a mixture design was used. In addition, the experimenter wanted to bound the ratio for two of the gases between an upper and lower value.   The third gas, C, was to make up no less than 10% and no more than 25% of the total mixture.   What is a Mixture Design? A mixture design is used when the quantity of two or more experimental factors must sum to a fixed amount. The inclusion of non-mixture components (i.e., factors that are not part of the mixture) makes designing this experiment challenging. Mixture designs emphasize prediction over factor screening. For that reason, mixture factors are not removed from the experiment even when they are not significant (they may be set to 0, however).   Mixture Design Challenges Effect are highly correlated and are harder to estimate. Squared mixture terms are confounded with (are a linear function of) mixture factor main effects and two factor interactions. Main effects for non-mixture factors are correlated with the two factor interactions between that non-mixture factor and the mixture factors. Focusing on prediction and use of the Profiler (instead of parameter estimation and significance) makes designing and interpreting mixture experiments much easier.   Experimental Responses Response Goal ER Target=100 ER Std Minimize     Experimental Factors Response Low  High Power 25 75 Press 100 200 Temp 25 40 time 30 45 Total Flow 80 120 Gas A 0 1 Gas B 0 1 Gas C 0.1 0.25     Experimental Constraints