Choose Language Hide Translation Bar

cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
Tony Cooper, Principal Analytical Consultant, SAS Sam Edgemon, Principal Analytical Consultant, SAS Institute   In process product development, Design of Experiments (DOE), helps to answer questions like: Which X’s cause Y’s to change, in which direction, and by how much per unit change? How do we achieve a desired effect? And, which X’s might allow loser control, or require tighter control, or a different optimal setting? Information in historical data can help partially answer these questions and help run the right experiment. One of the issues with such non-DoE data is the challenge of multicollinearity. Detecting multicollinearity and understanding it allows the analyst to react appropriately. Variance Inflation Factors from the analysis of a model and Variable Clustering without respect to a model are two common approaches. The information from each is similar but not identical. Simple plots can add to the understanding but may not reveal enough information. Examples will be used to discuss the issues.     Auto-generated transcript...   Speaker Transcript Tony Hi, my name is Tony Cooper and I'll be presenting some work that Sam Edgemon and I did and I'll be representing both of us.   Sam and I are research statisticians in support of manufacturing and engineering for much of our careers. And today's topic is Understanding Variable Clustering is Crucial When Analyzing Stored Manufacturing Data.   I'll be using a single data set to present from. The focus of this presentation will be reading output. And I won't really have time to fill out all the dialog boxes to generate the output.   But I've saved all the scripts in the data table, which of course will avail be available in the JMP community.   The data...once the script is run, you can always use that red arrow to do the redo option and relaunch analysis and you'll see the dialog box that would have been used to present that piece of output.   I'll be using this single data set that is on manufacturing data.   And let's have a quick look at how this data looks   And   Sorry for having a   Quick. Look how this dataset looks. First of all, I have a timestamp. So this is a continuous process and some interval of time I come back and I get a bunch of information.   On some Y variables, some major output, some KPIs that, you know, the customer really cares about, these have to be in spec and so forth in order for us to ship it effectively. And then...so these are... would absolutely definitively be outputs. Right.   line speed, how at the set point for the vibration,   And and a bunch of other things. So you can see all the way across, I've got 60 odd variables that we're measuring at that moment in time. Some of them are sensors and some of them are a set points.   And in fact, let's look and see that some of them are indeed set points like the manufacturing heat set point. Some of them are commands which means like, here's what the PLC told the process to do. Some of them are measures which is the actual value that it's at right now.   Some of them are   ambient conditions, maybe I think that's an external temperature.   Some of them   are in a row material quality characteristics, and there's certainly some vision system output. So there's a bunch of things in the inputs right now.   And you can imagine, for instance, if I'm, if I've got the command and the measure the, the, what I told the the the zone want to be at and what it actually measure, that we hope that are correlated.   And we need to investigate some of that in order to think about what's going on. And that's multicolinnearity, understanding the interrelationships amongst   the inputs or separately, the understanding the multicollinearity and the outputs. But by and large, not doing Y cause X effect right now. You're not doing a supervised technique. This is an unsupervised technique, but I'm just trying to understand what's going on in the data in that fashion.   And all my scripts are over here. So yeah, here's the abstract we talked about; here's a review of the data. And as we just said it may explain why there is so much, multicollinearity, because I do have a set points and an actuals in there. So, but we'll, we'll learn about those things.   What we're going to do first is we're going to fit Y equals a function of X and we're going to set it and I'm only going to look at these two variables right now.   The zone one command, so what I told zone one to to to be, if you will. And then I'm I've got a little therma couple that are, a little therma couple in the zone one area and it's measuring.   And you can see that clearly as zone one temperature gets higher, this Y4 gets lower, this response variable gets lower.   And that's true also for the measurement and you can see that in the in the in the estimates both negative and, by and large, and by the way, these are fairly   predictive variables in the sense that just this one variable is explaining 50% of what's going on in in the Y4. Well, let's   let's do a multivariate. So you imagine I'm now gonna fit model and have moved both of those variables into   into into my model and I'm still analyzing Y4. My response is still Y4. Oh, look at this. Now it's suggesting that as Y, as the command goes up. Yeah, that that does   that does the opposite of what I expect. This is still negative in the right direction, but look at   look at some of these numbers. These aren't even in the same the same ballpark as what I had a moment ago, which was .04 and .07 negative. Now I have positive .4 and negative .87.   I'm not sure I can trust these this model from an engineering standpoint, and I really wonder how it's characterizing and helping me understand this process.   And there's a clue. And this may not be on by default, but you can right click on this parameter estimates table and you can ask for the VIF column.   That stands for variation inflation factor. And that is an estimate of the...of the the instability of the model due to   this multicollinearity. So you can...so we need to get little intuition on what that...how to think about that variable. But just to be a little clearer what's going on, I'm going to plot...here's   temperature zone one command and here's measure, and as you would expect,   as you tell...you tell the zone to increase in temperature. Yes, the zone does increase in temperature and by and larges, it's, it's going to maybe even the right values. I've got this color coded by Y4 so it is suggesting at lower temperatures, I get   the high values of Y4 and that the...   sorry...yeah, at low valleys of temperature, I got high values of Y4 and just the way I saw on the individual plots.   But you can see maybe the problem gets some intuition as to why the multivariate model didn't work. You're trying to fit a three dom... a surface.   over this knife edge and it can obviously...there's some instability and if...you can easily imagine it's it's not well pinneed on the side so I can rock back and forward here and that's what you're getting.   It is in terms of how that is that the in in terms of that the variation inflation factor. The OLS software...the OLS ordinary least squares analysis typical regression analysis   just can't can't handle it in the, in some sense, maybe we could also talk about the X prime X matrix being, you know, being almost singlular in places.   So we've got some heurustic of why it's happening. Let's go back and think about more   About   About   The about about the values and   We know that   You know we are we now know that variation inflation factor actually does think about more than just pairwise. But if I want to source...so intuition helps me think about the relationship between   VIF and pairwise comparison. Like if I have two variables that are 60% correlated   then it's you know if it was all it was all pairwise then the VIF would be about 2.5.   And if they were 90% correlated two variables, then I would get a VIF of 10. And the literature says   That when you have VIFs of 10 and more you have enough instability which way you should worry about your model. So, you know, because in some sense, you've got 90% correlation between things in your data.   Now whether 10 is a good cut off really depends,I guess, on your application. If if you're making real design decisions based on it, I think a cut off of 10 is way too high.   And maybe even near two is better. But if I was thinking more. Well, let me think about what factors to put in a run in an experiment like   I want to learn how to run the right experiment. And I knew I've got to pick factors to go in it. And I know we too often go after the usual suspects. And is there a way I can push myself to think of some better factors. Well, that then maybe a   10 might be useful to help you narrow down. So it really depends on where you want to go in terms of doing thinking about   thinking about what what the purpose is. So more on this idea of purpose.   You know, there's two main purposes, if you will, to me of modeling. One is what will happen, you know, that's prediction.   And but and that's different, sometimes from why will it happen and that's more like explanation.   As we just saw with a very simple   command and measure on zone one, you can, you cannot do very good explanation. I would not trust that model to think about why something's happening when I have when I when the, when the responses when the, when the estimate seem to go in the wrong direction like that.   So it's very so I wouldn't use it for explanation. I'm sort of suggesting that I wouldn't even use it for prediction. Because if I can't understand the model, it's doing, it's not intuitively doing what I expect   that seems to make extrapolation dangerous, of course, by definition, you know, I think prediction to a future event is extrapolation in time. Right, so I would always think about this. And, you know, we're talking about ordinary least squares so far.   All my modeling techniques I see, like decision trees, petition analysis,   are all in some way affected by this by this issue, in different unexpected ways. So it seems a good idea to take care of it if you can. And it isn't unique to manufacturing data.   But it's probably worse exaggerated in manufacturing data because often the variables can are not controlled. If we   if we have, you know, zone one temperature and we've learned that it needs to be this value to make good product, then we will control it as tightly as possible to that desired value.   And so it's it's so the the extrapolation really come gets harder and harder. So this is exaggerated manufacturing because we do can control them.   And there's some other things about manufacturing data you can read here that make it maybe   make it better, because this is opportunities and challenges. Better is you can understand manufacturing processes. There's a lot of science and engineering around how those things run.   And you can understand that stoichiometry for instance, requires that the amount of A you add has, chemical A has to be in relationship to chemical B.   Or you know you don't want to go from zone one temperature here to something vastly different because you need to ramp it up slowly, otherwise you'll just create stress.   So you can and should understand your process and that may be one way even without any of this empirical evidence on on multicollinearity, you can get rid of some of it.   There's also   an advantage to manufacturing data is in that it's almost always time based, so do plot the variables over time. And it's always interesting   or not always, but often interesting, the manufacturing data does seem to move in blocks of times like here...we think it should be 250 and we run it for months, maybe even years,   and then suddenly, someone says, you know, we've done something different or we've got a new idea. Let's move the temperature. And so very different.   And of course, if you're thinking about why is there multicollinearity,   we've talked about it could be due to physics or chemistry, but it could be due to some latent variable, of course, especially when there's a concern with variable shifting in time, like we just saw.   Anything that also changes at that rate could be the actual thing that's affecting the, the, the values, the, the Y4. Could be due to a control plan.   It could be correlated by design. And each of these things, you know, each of these reasons for multicollinearity and the question I always ask is, you know,   is this is it physically possible to try other combinations or not just physically possible, does it make sense to try other combinations?   In which case you're leaning towards doing experimentation and this is very helpful. This modeling and looking retrospectively at your data to is very helpful at helping you design better experiments.   Sometimes though the expectations are a little too high, in my mind, we seem to expect a definitive answer from this retrospective data.   So we've talked about two methods already two and a half a few to to address multi variable clustering and understand it. One is the is VIF.   And here's the VIF on a bigger model with all the variables in.   How would I think about which are correlated with which? This is tells me I have a lot of problems.   But it does not tell me how to deal with these problems. So this is good at helping you say, oh, this is not going to be a good model.   But it's not maybe helpful getting you out of it. And if I was to put interactions in here, it'd be even worse because they are almost always more correlated. So we need another technique.   And that is variable clustering. And this, this is available in JMP and there's two ways to get to it. You can go through the Analyze multivariate principal components.   Or you can go straight to clustering cluster variables. If you go through the principal components, you got to use the red option...red triangle option to get the cluster variables. I still prefer this method, the PCA because I like the extra output.   But it is based on PCA. So what we're going to do is we're going to talk about PCA first and then we will talk about the output from variables clustering.   And there's and there's the JMP page. In order to talk about principal components, I'm actually going to work with standardized versions of the variables first. And let's think remind ourselves what is standardized.   the mean is now zero, standard deviation is now 1. So I've standardized what that means that they're all on the same scale now and implicitly when you do   when you do   principal components on correlations in JMP,   implicitly you are doing on standardized variables.   JMP is, of course, more than capable, a more than smart enough for you to put in the original values   and for it to work in the correct way and then when it outputs, it, it will outputs, you know, formula, it will figure out what the right   formula should have been, given, you know, you've gone back unstandardized. But just as a first look, just so we can see where the formula are and some things, I'm going to work with standardized variables for a while, but we will quickly go back, but I just want to see one formula.   And that's this formula right here. And what I'm going to do is think about the output. So what is the purpose of   of   of PCA? Well it's it's called a variation reduction technique, but what it does, it looks for linear combinations of your variables.   And if it finds a linear combination that it likes, it...that's called a principal component.   And it uses Eigen analysis to do this.   So another way to think about it is, you know I want it, I put in 60 odd variables into the...into the inputs.   There are not 60 independent dimensions that I can then manipulate in this process, they're correlated. What I do to the command for time, for temperatures on one   dictates almost hugely what happens with the actual in temperature one. So those aren't two independent variables. Those don't keep you say you don't you don't have two dimensions there. So how many dimensions do we have?   That's what this thing, the eigenvalues, tell you. These are like variance. These are estimates of variance and the cutoff is   one. And if I had the whole table here, there'll be 60 eigenvalues. Go all the way down, you know, jumpstart reporting at .97 but the next one down is, you know, probably .965 and it will keep on going down.   The cutoff says that...or some guideline is that if it's greater than one, then this linear combination is explaining a signal. And if it's less than one, then it's just explaining noise.   And so what what JMP does is it...when I go to the variable clustering, it says, you know what   you have a second dimension here. That means a second group of variables, right, that's explaining something a little bit differently. Now I'm going to separate them into two groups. And then I'm going to do PCA on both,   and if and the eigenvalues for both...the first one will be big, but what's the second one look like   after I split in two groups? Are they now less than one? Well, if they're less than one, you're done. But if they're greater than one, it's like, oh, that group can be split even further.   And it was split that even further interactively and directly and divisively until there's no second components greater than one anymore.   So it's creating these linear combinations and you can see the...you know when I save principal component one, it's exactly these formula. Oops.   It's exactly this formula, .0612 and that's the formula for Prin1 and this will be exactly right as long as you standardize. If you don't standardize then JMP is just going to figure out how to do it on the unstandardized which is to say it's more than capable of doing.   So let's start working with the, the initial data.   And we'll do our example. And you'll see this is very similar output. It's in fact it's the same output except one thing I've turned on.   is this option right here, cluster variables. And what I get down here is that cluster variables output. And you can see these numbers are the same. And you can already start to see some of the thinking it's going to have to do, like, let's look at these two right here.   It's the amount of A I put in seems to be highly related to the amount of B I put in, so...and that would make sense for most chemical processes, if that's part of the chemical reaction that you're trying to accomplish.   You know, if you want to put 10% more in, probably going to put 10% more B in. So even in this ray plot you start to see some things that the suggest the multicollinearity. And so to get somewhere.   But I want to put them in distinct groups and this is a little hard because   watch this guy right here, temperature zone 4.   He's actually the opposite. They're sort of the same angle, but in the opposite direction right, so he's 100 or 180 degrees almost from A and B.   So maybe ...negatively correlated to zone 4 temperature and A B and not...but I also want to put them in exclusive groups and that's what that's what we get   when we when we asked for the variable clustering. So it took those very same things and put them in distinct groups and it put them in eight distinct groups.   And here are the standardized coefficients. So these are the formula   that the for the, you know, for the individual clusters. And so when I save   the cluster components   I get a very similar to what we did with Prin1 except this just for cluster 1, because you notice that in this row that has zone one command with a ... with a .44, everywhere else is zero. Every variable is exclusively in one cluster or another.   So let me...let's talk about some of the output.   And so we're doing variable clustering and   Oops.   Sorry. Tony And then we got some tables in our output. So I'm going to close...minimize this window and we're talking about what's in here in terms of output.   And the first thing you guys that you know I want to point us to is the standardized estimates and we were doing that. And if you want to do it, quote, you know,   by hand, if you will, repeat and how do I get a .35238 here, I could run PCA on just cluster one variables. These are the variables staying in there. And then you could look at the eigenvalues, eigenvectors, and these are the exact exact numbers.   So,   So the .352 is so it's just what I said. It keeps devisively doing multiple PCAs and you can see that the second principle component here is, oh sorry, is less than one. So that's why it stopped at that component   Who's in there   cluster one, well there's temperature is ...the two zone one measures, a zone three A measure, the amount of water seems to matter compared to that. With some of the other temperature over here and in what...in cluster six.   This is a very helpful color coded table. This is organized by cluster. So this is cluster one; I can hover over it and I can read all of the things   added water (while I said I should, yeah, added water. Sorry, let me get one that's my temperature three.)   And what that's a positive correlation, you know, the is interesting zone 4 has a negative correlation there. So you will definitely see blocks of of color to...because these are...so this is cluster one obviously, this is maybe cluster three.   This, I know it's cluster six.   Butlook over here... that this...as you can imagine, cluster one and three are somewhat correlated. We start to see some ideas about what might what we might do. So we're starting to get some intuition as to what's going on in the data.   Let's explore this table right here. This is the with own cluster, with next cluster and one minus r squared ratio. And what...I'm going to save that to my own data set.   I'm going to run do a multivariate on it. So I've got cluster one through eight components and all the factors. And what I'm really interested in is this part of a table, like for all of the original variables and how they cluster with the component. So let me save that table and   then what I'm gonna do is, I'm gonna delete, delete some extra rows out of this table, but it's the same table. Let's just delete some rows.   And I'll turn and...so we can focus on certain ones. So what I've got left is the columns that are the cluster components and the rows...   row column...the role of the different...and is 27 now variables that we were thinking about, not 60 (sorry) and it's put them in 8 different groups and I've added that number. They can come automatically.   I'm going to start and in order to continue to work, what I'm gonna do is, I don't want these as correlations. I want these as R squares. So I'm going to square all those numbers.   So I just squared them and and here we go. Now we can look at, now we can start thinking about it.   And I've sort...so let's look at   row one.   Sorry, this one that the temperature one measure we've talked about, is 96% correlated with its...with cluster one.   It's 1% correlated with cluster two so it looks to me like it really, really wants to be in cluster one and it doesn't really have much to say I want to go into cluster two. What I want to do is let's find all of those...   lets color code some things here so we can find them faster. So   we're talking zone one meas and the one that would like to be in, if anything, is cluster five.   You know it's 96% correlated, but you know, it's, it wouldn't be, if it had to leave this cluster, where would it go next? It would think about cluster five.   And what's in cluster five? Well, you could certainly start looking at that. So there's more temperatures there. The moisture heat set point and so forth. So if it hadn't...so this number,   The with the cl...with its own cluster talks about how much it likes being in its cluster and the what...and the R squared to the next cluster talks to you about how much it would like to be in a different one. And those are the two things reported in the table.   You know, JMP doesn't show all the correlation, but it may be worth plotting and as we just demonstrated, it's not hard to do.   So this tells you...I really like being in my own cluster. This says if I had to leave, then I don't mind going over here. Let's compare those two. And let's take one minus this number,   divided by one minus this number. That's the one minus r squared ratio. So it's a ratio of how much you like your own cluster divided by how much you attempted to go to another cluster.   And let's plot some of those.   And   Let me look for the local data filter on there.   The cluster.   And and here's the thing. So   Values, in some sense, lower values of this ratio are better. These are the ones over here (let's highlight him)...   Well, let's highlight the very...this one of the top here.   I like the one down here. Sorry.   This one, vacuum set point, you can see really really liked its own cluster over the next nearest, .97 vs .0005, so you know that the ratio is near zero. So those wouldn't...   with it...you wouldn't want to move that over. And you could start to do things like let's just think about the cluster one variables. If anyone wanted to leave it, maybe it's the Pct_reclaim. Maybe he got in there,   like, you know by, you know, by fortuitous chance and I've got, you know, and if I was in cluster two, I can start thinking about the line speed.   The last table I'm going to talk about is the cluster summary table. That's   this table here.   And it's saying...it's gone through this list and said that if I look down r squared own cluster, the highest number is .967 for cluster one.   So maybe that's the most representative.   To me, it may not be wise to let software decide I'm going to keep this late...this variable and only this variable, although certainly every software   has a button that says launch it with just the main ones that you think, and that's that may give you a stable model, in some sense, but I think it's short changing   the kinds of things you can do. And you and, hopefully with the techniques provided, we can...you have now the ability to explore those other things.   This is...you can calculate this number by again doing the just the PCA on its own cluster and it and it it's suggesting it cluster...eigenvalue one one complaint with explained .69% so that's another...just thinking about it, it's own cluster.   Close these and let's summarize.   So we've given you several techniques. Failing to understand what's clear, you can make your data hard to interpret, even misleading the models.   Understanding the data as it relates to the process can produce multicollinearity. So just an understanding from an SME standpoint, subject matter expertise standpoint.   Pairwise and scatter plot are easier to interpret but miss multicollinearity and so you need more. VIF is great at telling you you've got a problem. It's only really available for ordinary least squares.   There's no, there's no comparative thing for prediction.   And there's some guidelines and variable clustering is based on principal components and shows better membership and strengthens relationship in each group.   And I hope that that review or the introduction would encourage you, maybe, to grab, as I say, this data set from JMP Discovery and you can run the script yourself and you can go obviously more slowly and make sure that you feel good and practice on maybe your own data or something.   One last plug is, you know, there was there's other material and other times that Sam and I have spoken at JMP Discovery conferences and or at   ...or written white papers for JMP and you know maybe you might want to think about all we thought about with manufacturing it because we have found that   modeling manufacturing data takes...is a little unique compared to understanding like customer data in telecom where a lot of us learn when we went through school. And I thank you for your attention and good luck   with further analysis.
Monday, October 12, 2020
Charles Whitman, Reliability Engineer, Qorvo   Simulated step stress data where both temperature and power are varied are analyzed in JMP and R. The simulation mimics actual life test methods used in stressing SAW and BAW filters. In an actual life test, the power delivered to a filter is stepped up over time until failure (or censoring) occurs at a fixed ambient temperature. The failure times are fitted to a combined Arrhenius/power law model similar to Black’s equation. Although stepping power simultaneously increases the device temperature, the algorithm in R is able to separate these two effects. JMP is used to generate random lognormal failure times for different step stress patterns. R is called from within JMP to perform maximum likelihood estimation and find bootstrap confidence intervals on the model estimates. JMP is used live to plot the step patterns and demonstrate good agreement between the estimates and confidence bounds to the known true values.  A safe-operating-area (SOA) is generated from the parameter estimates.  The presentation will be given using a JMP journal.   The following are excerpts from the presentation.                                     Auto-generated transcript...   Speaker Transcript CWhitman All right. Well, thank you very much for attending my talk. My name is Charlie Whitman. I'm at Corvo and today I'm going to talk about steps stress modeling in JMP using R. So first, let me start off with an introduction. I'm going to talk a little bit about stress testing and what it is and why we do it. There are two basic kinds. There's constant stress and step stress; talk a little bit about each. Then when we get out of the results from the step stress or constant stress test are estimates of the model parameters. That's what we need to make predictions. So in the stress testing, we're stressing parts of very high stress and then going to take that data and extrapolate to use conditions, and we need model parameters to do that. But model parameters are only half the story. We also have to acknowledge that there's some uncertainty in those estimates and we're going to do that with confidence bounds and I'm gonna talk about a bootstrapping method I used to do that. And at the end of the day, armed with our maximum likelihood estimates and our bootstrap confidence bounds, we can create something called the safe operating area, SOA, which is something of a reliability map. You can also think of it as a response surface. So we're going to get is...find regions where it's safe to operate your part and regions where it's not safe. And then I'll reach some conclusions. So what is a stress test? In a stress test you stress parts until failure. Now sometimes you don't get failure; sometimes parts, you have to start, stop the test and do something else. And that case, you have a sensor data point, but the method of maximum likelihood, which are used in the simulations takes sensoring into account so you don't have to have 100% failure. We can afford to have some parts not fail. So what you, what you do is you stress these parts under various conditions, according to some designed experiment or some matrix or something like that. So you might run your stress might be temperature or power or voltage or something like that and you'll run your parts under various conditions, various stresses and then take those that data fitted to your model and then extrapolate to use conditions. mu = InA + ea/kT. Mu is the log mean of your distribution; we commonly use the lognormal distribution. That's going to be a constant term plus the temperature term. You can see that mu is inversely related to temperature. So as temperature goes up, mu goes down, and that's temperature goes down, mu goes up. If we can use the lognormal, you will also have an additional parameter that the shape factor sigma. So after we run our test, we will run several parts under very stressed conditions and we fit them to our model. It's then when you combine those two that you can predict behavior at use conditions, which is really the name of the game. The most common method is a is a constant stress test, and what basically, the stress is fixed for the duration of the test. So this is just showing an example of that. We have a plot here of temperature versus time. If we have a very low temperature, say you could get failures that would last time...that sometimes be very long. The failure times can be random, again according to, say, some distribution like the lognormal. If we increase the temperature to some higher level, we would get end of the distribution of failure times, but on the average the failure times would be shorter. And if we increase the temperature even more, same kind of thing, but failure times are even shorter than that. So what I can do is if I ran, say, a bunch of parts under these different temperatures, I could fit the results to a probability plot that looks like this. I have probability versus time to failure at the highest temperature here. This example is 330 degrees C, I have my set of failure times which I set to lognormal. And then as I decrease the temperature lower and lower the failure times get longer and longer. Then I take all this data over temperature I fit it to the Arrhenius model, I extrapolate. And then I see I can get my predictions at use conditions. This is what we are after. I want to point out that when we're doing these accelerated testing, this test, we have to run at very high stress because, for example, even though this is, say, lasting 1000 hours or so, our predictions are that the part under use conditions would be a billion hours and there's no way that we could run test for a billion hours. So we have to get tests done in a reasonable amount of time and that's why we're doing accelerated testing. So then, what is a step stress? Well, as you might imagine, a step stresses where you increase the stress in steps or some sort of a ramp. The advantage is that it's a real time saver. As I showed in the previous plot, those tests could last a very long time that could be 1000 hours. So that's it could be weeks or months before the test is over. A step stress test could be much shorter or you might be able to get done in hours or days. So it's a real time saver. But the analysis is more difficult and I'll show that in a minute. So, in the work we've done at Corvo, we're doing reliability of acoustic filters and those are those are RF devices. And so the stress in RF is RF power. And so we step up power until failure. So if we're going to step up power, we can do is we can model this with this expression here. Basically, we had the same thing as the Arrhenius equation, but we're adding another term, n log p. N is our power acceleration parameter; p is our power. So for the lognormal distribution, there would be a fourth parameter, sigma, which is the shape factors. So you have 1, 2, 3, 4 parameters. Let me just give you a quick example of what this would look like. You start, this is power versus time. Power is in dBm. You're starting off at some power like 33.5 dBm, you step and step and step and step until hopefully you get failure. And I want to point out that your varying power, and as you increase the power to the part, that's going to be changing the temperature. So as power is ramped, so it is temperature. So power and temperature are then confounded. So you're gonna have to do your experiment in such a way that you can separate the effects of temperature and power. So I want to point out that you have these two terms (temperature and power), so it's not just that I increase the power to the part and it gets hotter and it's the temperature that's driving it. It's power in and of itself also increases the failure rate. Right. So now if I show a little bit more detail about that step stress plot. So here again a power versus time. I'm running a part for, say, five hours at some power, then I increase the stress, and run another five hours, and increase the stress on up until like a failure. So, and as I mentioned as the power is increasing, so is the temperature. So I have to take that into account somehow. I have to know what the t = T ambient + R th times p T ambient is our ambient temperature; P is the power; and R th is called the thermal impedance which is a constant. So, that means, as I set the power, so I know what the power is and then I can also estimate what the temperature is for each step. So what we'd like to do is then take somehow these failure times that get from our step stress pattern and extrapolate that to use conditions. If I was only running, like, for time delta t here only and I wanted to extrapolate that to use conditions, what I would do is I would multiply...get the equivalent amount of time delta t times the acceleration factor. And here's the acceleration factor. I have an activation energy term, temperature term, and a power term. And so what I would do is I would multiply by AF. And since I'm going from high stress down to low stress, AF is larger than one and this is just for purposes of illustration, it's not that much bigger than one, but you get the idea. And as I increase the power, temperature and power are changing so the AF changes with each step. So if I want to then get the equivalent time at use conditions, I'd have to do a sum. So I have each segment. It has its own acceleration factor and maybe its own delta t. And then I do a sum and that gives me the equivalent time. So this, this expression that I would use them to predict equipment time if I knew exactly what Ea was and exactly what n was, I could predict what the equivalent time was. So that's the idea. So it turns out that....so as I said, temperature and power are confounded. So in order to estimate, what we do is we have to run to two different ambient temperatures If you have the ambient temperatures separated enough, then you can actually separate the effects of power and temperature. You also need at least two ramp rates. So at a minimum, you would need a two by two matrix of ramp rate and ambiant temperature. In the simulations I did, I chose three different rates as shown here. I have power in dBm versus stress time And I have three different ramps, but with different rates. I'll have a fast, a medium, and a slow ramp rate. In practice, you would let this go on and on and on until failure, but I've only just arbitrarily cut it off after a few hours. You see here also I have a ceiling. The ceiling is four; it's because we have found that if we increase the stress or power arbitrarily, we can change the failure mechanism. And what you want to do is make sure that failure mechanism, when you're under accelerate conditions is the same as it is under use conditions. And if I change the failure mechanism that I can't do an extrapolation. The extrapolation wouldn't be valid. So we had the ceiling here of drawn to 34.4 dBm, and we even given ourselves a little buffer to make sure we don't get close to that. So our ambient temperature is 45 degrees C, we're starting it a power 33.5 dBm so we would also have another set of conditions at 135 degrees. See, you can see the patterns here are the same. And we have a ceiling and they have a buffer region, everything, except we are starting at a lower power. So here we're below 32 dBm, whereas before we were over 33. And the reason we do that is because if we don't lower the power at this higher temperature, what will happen is you'll get failures almost immediately if you're not careful, and then you can't use the data to do your extrapolation. Alright, so what we need, again, is an expression for our quivalent time, as I showed that before. Here's that expression. This is kind of nasty and I would not know how to derive from first principles of what the expression is for the distribution of the equivalent time of use conditions. So, when faced with something which is kind of difficult like that, what I choose to do was use the bootstrap. So what is bootstrapping? So with bootstrapping, what we're doing is we are resampling the data set many times with replacement. That means from the original data set of observations, you can have replicates of from the original data set or maybe an observation won't appear all. And the approach I use is called non parametric, because we're not assuming the distribution. We don't have to know the underlying distribution of the data. So when you generate these many bootstrap samples, which you can get as an approximate distribution of the parameter, and that allows you to do statistical inference. In particular, we're interested in putting in confidence bounds on things. So that's what we need to do. Simple example of bootstrapping is called percentile bootstrap. So, for example, suppose I wanted 90% confidence bounds on some estimate. And I would do is I would form, many, many bootstrap replicates and I would extract the parameter from each bootstrap sample. And then I would sort that and I would figure out which is the shift and 95th percentile from that vector and those would form my 90% confidence bounds. What I did actually in my work was I used an improvement over to percentile, a technique. It's called the BCa for bias corrected and accelerated. Bias because sometimes our estimates are biased and this method would take that into account. Accelerated, unfortunately the term accelerated is confusing here. It has nothing to do with accelerated testing, it has to do with the method, the method has to do for with adjusting for the skewness of the distribution. But ultimately you're...what you're going to get is it's going to pick for you different percentile values. So, again, for the percentile technique we had fifth and 95th. The bootstrap or the BCa bootstrap might give you something different, might say the third percentile and 96% or whatever. And those are the ones who would need to choose for your 90% confidence bounds. So I just want to run through a very quick example just to make this clear. Suppose I have 10 observations and I want to do for bootstrap samples from this, looking something like this. So, for example, the first observation here 24.93 occurs twice in the first sample, once in the second sample, etc. 25.06 occurs twice. 25.89 does not occur at all and I can do this, in this case, 100 times And for each bootstrap sample then, I'm going to find, in this case I'm gonna take the average, say, I'm interested in the distribution of the average. Well, here I have my distribution of averages. And I can look to see what that looks like. Here we are. It looks pretty bell shaped and I have a couple points here, highlighted and these would be my 90% confidence bounds if I was using the percentile technique. So here's this is the sorted vector and the fifth percentile is at 25.84 and the 95th percentile is 27.68. If I wanted to do the BCa method, I would might just get some sort of different percentile. So this case, 25.78 and 27.73. So that's very quickly, what the BCa method is. So in our case, we'd have samples of... we would do bootstrap on the stress patterns. You would have multiple samples which would have been run, simulated under those different stress patterns and then bootstrap off those. And so we're going to get a distribution of our previous estimates or previous parameters, logA, EA, and sigma Right. CWhitman So again, here's our equation. So again, JMP The version of JMP that I have does not do bootstrapping. JMP Pro does, but the version I have does not, but fortunately R does do bootstrapping. And I can call R from within JMP. That's why I chose to do it this way. So I have I can but R do all the hard work. So I want to show an example, what I did was I chose some known true values for logA, EA and sigma. I chose them over some range randomly. And I would then choose that choose the same values for these parameters of a few times and generate samples each time I did that. So for example, I chose minus 28.7 three times for logA true and we get the data from this. There were a total of five parts per test level or six test levels, if you remember, three ramps, two different temperatures, six levels, six times five is 30. So there were 30 parts total run for this test and looking at the logA hat, the maximum likelihood estimates are around 28 or so. So that actually worked pretty well. I can look at...now for my next sample, I did three replicates here, for example, minus 5.7 and how did it look when I ran my method of the maximum that are around that minus 5.7 or so. So the method appears to be working pretty well. But let's do this a little bit more detail. Here I ran the simulation a total of 250 times with five times for each group. LogA true, EA true are repeated five times and I'm getting different estimates for logA hat, EA, etc. I'm also putting...using BCa method to form confidence bounds on each of these parameters, along with the median time to failure. So let's look and just plot this data to see how well it did. You have logA hat versus logA true here and we see that the slope is about right around 1 and the intercept is not significantly different than 0, So this is actually doing a pretty good job. If my logA true is at minus 15 then I'm getting right around minus 15 plus or minus something for my estimate. And the same is true for the other parameters EA, n and sigma, and I even did my at a particular p zero P zero. So this is all behaving very well. We also want to know, well how well is the BCa method working? Well, turns out, it worked pretty well. I want to...the question is how successful was the BCa method. And here I have a distribution. Every time I correctly correctly bracketed the known true value, I got a 1. And if I missed it, I got a 0. So for logA I'm correctly bracketing the known true value 91% of the time. I was choosing 95% of the time, so I'm pretty close. I'm in the low 90s and I'm getting about the same thing for activation, energy and etc. They're all in the mid to low 90s. So that's actually pretty good agreement. Let's suppose now I wanted to see what would happen if I increase the number of bootstrap iterations and boot from 100 to 500. What does that look like? If I plot my MLE versus the true value, you're getting about the same thing. The estimates are pretty good. The slope is all always around 1 and the intercept is always around 0. So that's pretty well behaved. And then if I look at the confidence bound width, See, on the average, I'm getting something around 23 here for confidence bound width, and around 20 or so for mu and getting something around eight for value n. And so these confidence bands are actually somewhat wide. And I want to see what happens. Well, suppose I increase my sample size to 50 instead of just using five? 50 is not a realistic sample size, we could never run that many. That would be very difficult to do, very time consuming. But this is simulation, so I can run as many parts as I want. And so just to check, I see again that the maximum likelihood estimates agree pretty well with the known true values. Again, getting a slope of 1 and intercept around zero. And BCa, I am getting bracketing right around 95% of the time as expected. So that's pretty well behaved too and my confidence bound width, now it's much lower. So, by increasing the sample size, as you might expect, the conference bounds get correspondingly lower. This was in the upper 20s originally, now it's around seven. This is also...the mu was also in the upper 20s, this is now around five; n was around 10 initially, now it's around 2.3, so we're getting this better behavior by increasing our sample size. So this just shows what the summary, this would look like. So here I have a success rate versus these different groups; n is the number of parts per test level. And boot is the number of bootstrap samples I created. So 5_100, 5_500 and 50_500 and you can see actually this is reasonably flat. You're not getting big improvement in the coverage. We're getting something in the low to mid 90s, or so. And that's about what you would expect. So by changing the number of bootstrap replicates or by changing the sample size, I'm not changing that very much. BCa is equal to doing a pretty good job, even with five parts per test level and 100 bootstrap iterations. About the width. But here we are seeing a benefit. So the width of the confidence bounds is going down as we increase the number of bootstrap iterations. And then on top of that, if you increase the sample size, you get a big decrease in the confidence bound width. So all this behavior is expected, but the point here is, this simulation allows you to do is to know ahead of time, well, how big should my sample size be? Can I get away with three parts per condition? Do I need to run five or 10 parts per condition in order to get the width of the confidence bounds that I want? Similarly, when I'm doing analysis, well, how many bootstrap iterations do I have to do to kind of get away with 110? Do I need 1000? This also gives you some heads up of what you're going to need to do when you do the analysis. Alright, so finally, we are now armed with our maximum likelihood estimates and our confidence bounds. So we can do We can summarize our results using the safe operating area and, again what we're getting here is something of a reliability map or a response surface of temperature versus power. So you'll have an idea of how reliable the part is under various conditions. And this can be very helpful to designers or customers. Designers want to know when they create a part, mimic a part, is it going to last? Are they designing a part to run at to higher temperature or to higher power so that the median time to failure would be too low. Also customers want to know when they run this part, how long is the part going to last? And so what the SOA gives you is that information. The metric I'm going to give here is median time to failure. You could use other metrics. You could use the fit rate you could use a ???, but for purposes of illustration, I'm just using median time to failure. An even better metric, as I'll show, is a lower confidence bound on the median time to failure. It's a gives you a more conservative estimate So ultimately, the SOA then will allow you to make trade offs then between temperature and power. So here is our contour plot showing our SOA. These contours are log base 10 of the median time to failure. So we have power versus temperature, as temperature goes down and as power goes down, these contours are getting larger and larger. So as you lower the stress as you might expect, and median time to failure goes up. And suppose we have a corporate goal and the corporate goal was, you want the part to last or have a median time to failure greater than 10 to six hours. If you look at this map, over the range of power and temperature we have chosen, it looks like we're golden. There's no problems here. Median time to failure is easily 10 to six hours or higher. So that tells us we have to realize that median time failure again is an average, an average is only tell half the story. We have to do something that acknowledges the uncertainty in this estimate. So what we do in practice is use a lower conference bound on the median time to failure here. So you can see those contours have changed, very much lower because we're using the lower confidence bound, and here, 10 to the six hours is given by this line. And you can see that it's only part of the reach now. So over here at green, that's good. Right. You can operate safely here but red is dangerous. It is not safe to run here. This is where the monsters are. You don't want to run your part this hot. And also, this allows you to make trade offs. So, for example, suppose a designer wanted to their part to run at 80 degrees C. That's fine, as long as they keep the power level below about 29.5 dBm. Similarly, suppose they wanted to run the part at 90 degrees C. They can, that's fine as long as they keep the power low enough, let's say 27.5 dBm. Right. So this is where you're allowed to make trade offs for between temperature and power. Alright, so now just to summarize. So I showed the differences between constant and step stress testing and I showed how we extract extract maximum likelihood estimates and our BCa confidence bounds from the simulated step stress data. And I demonstrated that we had pretty good agreement then between the estimates and the known true values. In addition, BCa method worked pretty well, even with n boot of only 100 and five parts per test level, we had about 95% coverage. And that coverage didn't change very much as we increased the number of bootstrap iterations or increased the sample size. However, we did see a big change on the confidence bounds width. And that the results there showed that we could make some sort of a trade off. Again, we could, you know, from the simulation, we would know how many bootstrap iterations do we need to run and how many parts per test conditions we need to run. And ultimately, then we took those maximum likelihood estimates and our bootstrap confidence bounds and created the SOA, which provides guidance to customers and designers on how safe a particular T0/P0 combination is. And then from that reliability map, then we able to make a trade off between temperature and power. And lastly, I showed that using the lower confidence bound on the median time to failure does provide a more conservative estimate for the SOA. So, in essence, using the lower confidence bound makes the SOA, the safe operating area, a little safer. So that ends my talk. And thank you very much for your time.  
Ondina Sevilla-Rovirosa, Graduate Student and Small Business Owner, Oklahoma State University   The pay gap between top executives and the average American worker continues to widen. Additionally, the gender pay gap has narrowed, but women only earn 85% of what men earn. Experts debate if higher compensations could positively impact the firm’s performance and maximize its value. Therefore, it is worth analyzing the factors that companies are considering to determine the salaries of top executives.    The original dataset of 2,646 executives from 250 selected U.S. companies was gender unbalanced. I used a synthetic replication method to balance the data. Then, I ran, analyzed, and compared a decision tree, a stepwise regression, eight different neural networks, and an ensemble model. Finally, a surrogate model was used to explain the best neural net model.   From the 18 initially selected variables of companies and executives, I found that only Total Assets, Number of Employees, and Years Executive in CEO Position had a significant contribution to Salary.   Surprisingly gender had an insignificant effect on the salaries of top executives. Nevertheless, predominantly what affected Salaries was the size of the firm (Assets and Employees), followed by a lower contribution from the number of years the executive has been in the CEO position.     Auto-generated transcript...   Speaker Transcript ondinasevilla@gmail.com My name is Ondina Sevilla and my poster is about salary gaps in corporate America, specifically how do the company and executive characteristics influence compensations. Something with a little introduction, the pay gap between top executives and the average American worker continues to widen. Also the gender pay gap has narrowed, but women only earn 85% of what men earn. Even experts debate if higher compensations could positively impact the firm's performance and maximize its value. So it is worth it to analyze the factors that companies are considering to determine the salaries of top executives. I made two questions for this research. Are...is there a salary gap for top female executive in US companies? And does the company's size influence executives' salaries? So for this research, I collected a data set of 2046 from top executives from 250 selected US companies, such as Halliburton, Southwest Airlines, Starwood Hotels, Sherwin-Williams and others. Then I applied a synthetic replication method in SAS to obtain a gender balance database and used 12 companies and six executive variables, being salary by input variable. The technique used was predictive modeling. I analyzed and compared in JMP Pro 15 a decision tree, stepwise regression, eight different neural networks and ensemble model. And outlier... oh, a salary percentage year per year variable was excluded from from...for the first analysis and then I included it to compare. However, the rules are adjusted for error and the outer absolute error were higher with the outlier. So, I'm going to show you here the model, the different models that I ran without the outlier and the neural networks comparison. So from all these models, the lowest root average squared error without the outlier was the number eight neural network. These ...this neural net has 4 inputs two hidden layers, eight double neurons, with a TanH function.  
Daniel Sutton, Statistician - Innovation, Samsung Austin Semiconductor   Structured Problem Solving (SPS) tools were made available to JMP users through a JSL script center as a menu add-in. The SPS script center allowed JMP users to find useful SPS resources from within a JMP session, instead of having to search for various tools and templates in other locations. The current JMP Cause and Effect diagram platform was enhanced with JSL to allow JMP users the ability to transform tables between wide format for brainstorming and tall format for visual representation. New branches and “parking lot” ideas are also captured in the wide format before returning to the tall format for visual representation. By using JSL, access to mind-mapping files made by open source software such as Freeplane was made available to JMP users, to go back and forth between JMP and mind-mapping. This flexibility allowed users to freeform in mind maps then structure them back in JMP. Users could assign labels such as Experiment, Constant and Noise to the causes and identify what should go into the DOE platforms for root cause analysis. Further proposed enhancements to the JMP Cause and Effect Diagram are discussed.     Auto-generated transcript...   Speaker Transcript Rene and Dan Welcome to structured, problem solving, using the JMP cause and effect diagram open source mind mapping software and JSL. My name is Dan Sutton name is statistician at Samsung Austin Semiconductor where I teach statistics and statistical software such as JMP. For the outline of my talk today, I will first discuss what is structured problem solving, or SPS. I will show you what we have done at Samsung Austin Semiconductor using JMP and JSL to create a SPS script center. Next, I'll go over the current JMP cause and effect diagram and show how we at Samsung Austin Semiconductor use JSL to work with the JMP cause and effect diagram. I will then introduce you to my mapping software such as Freeplane, a free open source software. I will then return to the cause and effect diagram and show how to use the third column option of labels for marking experiment, controlled, and noise factors. I want to show you how to extend cause and effect diagrams for five why's and cause mapping and finally recommendations for the JMP cause and effect platform. Structured problem solving. So everyone has been involved with problem solving at work, school or home, but what do we mean by structured problem solving? It means taking unstructured, problem solving, such as in a brainstorming session and giving it structure and documentation as in a diagram that can be saved, manipulated and reused. Why use structured problem solving? One important reason is to avoid jumping to conclusions for more difficult problems. In the JMP Ishikawa example, there might be an increase in defects in circuit boards. Your SME, or subject matter expert, is convinced it must be the temperature controller on the folder...on the solder process again. But having a saved structure as in the causes of ...cause and effect diagram allows everyone to see the big picture and look for more clues. Maybe it is temperate control on the solder process, but a team member remembers seeing on the diagram that there was a recent change in the component insertion process and that the team should investigate In the free online training from JMP called Statistical Thinking in Industrial Problem Solving, or STIPS for short, the first module is titled statistical thinking and problem solving. Structured problem solving tools such as cause and effect diagrams and the five why's are introduced in this module. If you have not taken advantage of the free online training through STIPS, I strongly encourage you to check it out. Go to www.JMP.com/statisticalthinking. This is the cause and effect diagram shown during the first module. In this example, the team decided to focus on an experiment involving three factors. This is after creating, discussing, revisiting, and using the cause and effect diagram for the structured problem solving. Now let's look at the SPS script center that we developed at the Samsung Austin Semiconductor. At Samsung Austin Semiconductor, JMP users wanted access to SPS tools and templates from within the JMP window, instead of searching through various folders, drives, saved links or other software. A floating script center was created to allow access to SPS tools throughout the workday. Over on the right side of the script center are links to other SPS templates in Excel. On the left side of the script center are JMP scripts. It is launched from a customization of the JMP menu. Instead of putting the scripts under add ins, we chose to modify the menu to launch a variety of helpful scripts. Now let's look at the JMP cause and effect diagram. If you have never used this platform, this is what's called the cause and effect diagram looks like in JMP. The user selects a parent column and a child column. The result is the classic fishbone layout. Note the branches alternate left and right and top and bottom to make the diagram more compact for viewing on the user's screen. But the classic fishbone layout is not the only layout available. If you hover over the diagram, you can select change type and then select hierarchy. This produces a hierarchical layout that, in this example, is very wide in the x direction. To make it more compact, you do have the option to rotate the text to the left or you can rotate it to the right, as shown in here in the slides. Instead of rotating just the text, it might be nice to rotate the diagram also to left to right. In this example, the images from the previous slide were rotated in PowerPoint. To illustrate what it might look like if the user had this option in JMP. JMP developers, please take note. As you will see you later, this has more the appearnce of mind mapping software. The third layout option is called nested. This creates a nice compact diagram that may be preferred by some users. Note, you can also rotate the text in the nested option, but maybe not as desired. Did you know the JMP cause and effect diagram can include floating diagrams? For example, parking lots that can come up in a brainstorming session. If a second parent is encountered that's not used as a child, a new diagram will be created. In this example, the team is brainstorming and someone mentions, "We should buy a new machine or used equipment." Now, this idea is not part of the current discussion on causes. So the team facilitator decides to add to the JMP table as a new floating note called a parking lot, the JMP cause and effect diagram will include it. Alright, so now let's look at some examples of using JSL to manipulate the cause and effect diagram. So new scripts to manipulate the traditional JMP cause and effect diagram and associated data table were added to the floating script center. You can see examples of these to the right on this PowerPoint slide. JMP is column based and the column dialogue for the cause and effect platform requires one column for the parent and one column for the child. This table is what is called the tall format. But a wide table format might be more desired at times, such as in brainstorming sessions. With a click of a script button, our JMP users can do this to change from a tall format to a wide format. width and depth. In tall table format you would have to enter the parent each time adding that child. When done in wide format, the user can use the script button to stack the wide C&E table to tall. Another useful script in brainstorming might be taking a selected cell and creating a new category. The team realizes that it may need to add more subcategories under wrong part. A script was added to create a new column from a selected cell while in the wide table format. The facilitator can select the cell, like wrong part, then selecting this script button, a new column is created and subcauses can be entered below. you would hover over wrong part, right click, and select Insert below. You can actually enter up to 10 items. The new causes appear in the diagram. And if you don't like the layout JMP allows moving the text. For example, you can click...right click and move to the other side. JMP cause and effect diagram compacts the window using left and right, up and down, and alternate. Some users may want the classic look of the fishbone diagram, but with all bones in the same direction. By clicking on this script button, current C&E all bones to the left side, it sets them to the left and below. Likewise, you can click another script button that sets them all to the right and below. Now let's discuss mind mapping. In this section we're going to take a look at the classic JMP cause and effect diagram and see how to turn it into something that looks more like mind mapping. This is the same fishbone diagram as a mind map using Freeplane software, which is an open source software. Note the free form of this layout, yet it still provides an overview of causes for the effect. One capability of most mind mapping software is the ability to open and close notes, especially when there is a lot going on in the problem solving discussion. For example, a team might want to close notes (like components, raw card and component insertion) and focus just on the solder process and inspection branches. In Freeplane, closed nodes are represented by circles, where the user can click to open them again. The JMP cause and effect diagram already has the ability to close a note. Once closed though, it is indicated by three dots or three periods or ellipses. In the current versions of JMP, there's actually no options to open it again. So what was our solution? We included a floating window that will open and close any parent column category. So over on the right, you can see alignment, component insertion, components, etc., are all included as all the parent nodes. By clicking on the checkbox, you can close a node and then clicking again will open it. For addtion, the script also highlights the text in red when closed. One reason for using open source mind mapping software like Freeplane is that the source file can be accessed by anyone. And it's not a proprietary format like other mind mapping software. You can actually access it through any kind of text editor. Okay, the entire map can be loaded by using JSL commands that access texts strings. Use JSL to look for XML attributes to get the names of each node. A discussion of XML is beyond the scope of this presentation, but see the JMP Community for additional help and examples. And users at Samsung Austin Semiconductor would click on Make JMP table from a Freeplane.mm file. At this time, we do not have a straight JMP to Freeplane script. It's a little more complicated, but Freeplane does allow users to import text from a clipboard using spaces to knit the nodes. So by placing the text in the journal, the example here is on the left side at this slide, the user can then copy and paste into Freeplane and you would see the Freeplane diagram on the, on the right. Now let's look at adding labels of experiment, controlled, and noise to a cause and effect diagram. Another use of cause and effect diagrams is to categories...categorize specific causes for investigation or improvements. These are often category...categorize as controlled or constant (C), noise or (N) or experiment might be called X or E. For those who we're taught SPC Excel by Air Academy Associates, you might have used or still use the CE/CNX template. So to be able to do this in JMP, to add these characters, we would need to revisit the underlying script. When you actually use the optional third label column...the third column label is used. When a JMP user adds a label columln in the script, it changes the text edit box to a vertical list box with two new horizontal center boxes containing the two... two text edit boxes, one with the original child, and now one with the value from the label column. It actually has a default font color of gray and is applied as illustrated here in this slide. Our solution using JSL was to add a floating window with all the children values specified. Whatever was checked could be updated for E, C or N and added to the table and the diagram. And in fact, different colors could be specified by the script by changing the font color option as shown in the slide. JMP cause and effect diagram for five why's and mind mapping causes. While exploring the cause and effect diagram, another use as a five why's or cause mapping was discovered. Although these SPS tools do not display well on the default fish bone layout, hierarchy layout is ideal for this type of mapping. The parent and child become the why and because statements, and the label column can be used to add numbering for your why's. Sometimes there can be more and this is what it looks like on the right side. Sometimes there can be more than one reason for a why and JMP cause and effect diagram can handle it. This branching or cause mapping can be seen over here on the right. Even the nested layout can be used for a five why. In this example, you can also set up a script to set the text wrap width, so the users do not have to do each box one at a time. Or you can make your own interactive diagram using JSL. Here I'm just showing some example images of what that might look like. You might prompt the user in a window dialogue for their why's and then fill in the table and a diagram for the user. Once again, using the cause and effect diagram as over on the left side of the slide. Conclusions and recommendations. All right. In conclusion, the JMP cause and effect diagram has many excellent built in features already for structured problem solving. The current JMP cause and effect diagram was augmented using JSL scripts to add more options when being used for structured problem solving at Samsung Austin Semiconductor. JSL scripts were also used to make the cause and effect diagram act more like mind mapping software. So, what would be my recommendations? fishbone, hierarchy, nested, which use different types of display boxes in JSL. How about a fourth type of layout? How about mind map that will allow more flexible mind map layout? I'm going to add this to the wish list. And then finally, how about even a total mind map platform? That would be even a bigger wish. Thank you for your time and thank you to Samsung Austin Semiconductor and JMP for this opportunity to participate in the JMP Discovery Summit 2020 online. Thank you.  
Heath Rushing, Principal, Adsurgo   DOE Gumbo: How Hybrid and Augmenting Designs Can Lead to More Effective Design Choices When my grandmother made gumbo, she never seemed to even follow her own recipe. When I questioned her about his, she told me, “Always try something different. Ya never know if you can make better gumbo unless you try something new!” This is the same with design of experiments. Too many times, we choose the same designs we’ve used in the past, unable to try something new in our gumbo. We can construct a hybrid of different types of designs or augment the original, optimal design with points using a different criterion. We can then use this for comparison to our original design choice. These approaches can lead to designs that allow you to either add relevant constraints (and/or factors) you did not think were possible or have unique design characteristics that you may not have considered in the past. This talk will present multiple design choices: a hybrid mixture-space filling design, an optimal design augmented using pre-existing, required design points, and an optimal design constructed by augmenting a D-optimal design with both I- and A-optimal design points. Furthermore, this talk presents the motivation for choosing these design alternatives as well as design choices that have been useful in practice.     Auto-generated transcript...   Speaker Transcript Heath Rushing My name is Heath Rushing. I am a principal for Adsurgo and we're a we're a training and consulting company that works with a lot of different companies. This morning, I'm going to talk about some experiences that I had working with pharmaceutical and biopharmaceutical companies. A lot of scientists and engineers are doing things like process and product development and characterization formulation optimization. And what I found is is is a lot of these...a lot of these scientists had designs that they use in the past with a similar product or process or formulation. And what they did is is going forward, they just said, "Hey, let me just take that design that I've used in the past. It worked. You know, it worked well enough in the past. So let's just go ahead and use that design." In each of these instances, what we did is we took the original design and we came up with some sort of mechanism for doing something a little different. Right. We either augmented it with a with a with a different sort of optimization criteria or we augmented it before they added runs or after they added runs. In the first case is what we did is, is was we built a hybrid design. Right. And then the first case was a product formulazation...I'm sorry... a formulation optimization problem, where a scientist in the past was run...had a 30 run Scheffe-Cubic mixture design. In a mixture design, the process parameters are variables are factors in the experiment are mixtures. And then, so there is certain percentage where the overall mixture adds up to 100%. Right, they they felt this work well enough and help them to find an optimal setting for the for the formulation. However, one thing that they really wanted to touch more on is, they said, you know, these designs tended to to to look at design points in our experiment near the edges. And what we want to do is is further characterize the design space. So we took the original 30 run design, and instead of doing that, what we did is we run a we...we developed an experiment constructing the experiment where we ran 18 mixture experiments and then we augmented it with 12 space filling design. And a space filling design is, it's used a lot in computer simulations. And really, you know, I said this at a conference one time, I said, "You know it's used to fill space." But really what these designs do, and I'm going to pull up the the the comparison of the two, is it's going to put design points. In this one, I try to minimize the distance between each of the design points. As you see as the design on the left, the, the one that they thought was well enough or was adequate was the 30 run mixture design. And as you see, it operates a lot near the edges and right in the center of the design. The one on the right was really 18 mixture design points augmented with 12 space filling design points. So it's really a hybrid design, it's really a hybrid of a mixture design and a space filling design. As you can see, you know, based upon their objective trying to characterize that design space a little bit better, as you can see, the one on the right did a much better job of characterizing that design space, right? It had adequate prediction variance. It was a it was a design they chose to run and they found a and they found their optimal solution off of this The second design choice was, and this is used a lot, in a process characterization is, back in the old days back before a lot of people used design of experiments in terms of process characterization, what a lot of scientists would do was, was they would run center point runs like its set point and then also do what are called PAR runs, or proven acceptable range, right. So say that they had four process parameters. What they would do is is they would keep three of the process parameters at the set point and have the fourth go to the extremes. The lowest value and the highest value. And they would do it for each...they would do a set of experiments like that for each the process parameters. What they're really showing is that, you know, if everything's at set point, and one of these deviate near the edges, then we're just going to prove that it's well within specification. Right. And then so they still like to do a lot of these runs. The design that I started off with was, I had a had a scientist that took those PAR and those centerpoint runs and they added them after they built an I optimal design. And I optimal design is used for for for prediction and optimization. And in this case is is that's the kind of design that they wanted, but they added them after the I optimal design. My question to them was this, why don't you just take those runs and add them before you built I optimal design? If that was the case, the ??? algorithm in JMP would say, "You know, I'm going to take those points and I'm going to come up with the, the next best set of runs." Right. So we took those 18 design points and we augmented them with with 11 more...I'm sorry, the 11 to...the original 11 design points and 18 I optimal points. Whenever we did this, if you look in the design, the, the, this is where the PAR runs were added... were added prior to, and you see that the power of the main effects, in factor interactions, the quadratic effects are higher than if you added the PAR runs after. You see that the production variance, if you, if you look at the prediction variance is, the prediction variance is very similar. But you see, is like right near the edge of the design spaces, you see that those PAR runs, whenever we had the PAR runs augmented with I optimal, were a lot smaller. The key here is is whenever I was looking at the correlation is I think the correlation, especially with the main effects are a lot better with with the PAR augmenting and two I optimal versus what they did before, where they took the I optimal and just augmented those with the PAR runs. The third design. The third design was was was when I had a scientist take a 17 run D optimal design and they augment it with eight runs and went from a D to an I optimal design. Now they started off with D optimal design, a screening design, they augmented it with points to move to an I optimal design. JMP has a has a...it's not a really a new design, but it's new design for JMP; it's called A optimal design. And A optimal design allows you to to weight any of those factors. Right. And so I had an idea. I just said, "You know, I have many times in the past, went from a D augmented to an I optimal design. What if we did this? Really, what if we took that original 17 run D optimal design and augmented it to an I, then an A, where we weighted those quadratic terms, Or we took the D optimal design, augmented it to an A optimal design where we where we weighted the quadratic terms and then to an I optimal design?" So it's really two different augmentations, going from a D to an A to an I, and D to an I to an A. Also went to straight D to A. Right. And I wanted to compare it to the original design choice, which was a D versus an I optimal design. Now, I really would like to tell you that my idea worked. But I think as a good statistician, I should tell you that I don't think it was so. If I look at the prediction variance, which, in terms of response surface design, we're trying to minimize the prediction variance across the design region, is you see the prediction variance for their original design is is lower. Okay, even even much lower than whenever I did the A optimal design, just straight to the A optimal design. If you look at the fraction of design space, you'll see that the prediction variance is much smaller across the design space than the than the A optimal design and it's a little bit better than when I went from D to A to I, and D to I to A. The only negative that I saw with the original design compared to the other design choices was, you know, there was there was some quadratic effects, right, there were some quadratic effects that had a little bit of higher correlation, little bit higher correlation than I would like to see. And and you see what the A optimal design, it has much lower quadratic effects. So my my original thesis many times, scientists and engineers have designs they've done in the past. And I always say is, it makes sense that we just don't want to do that same design that we've done in the past. Let's try something different. The product can be a little bit different. The process can be a little bit different. The formulation can be a little bit different. If you use that to compare to the original design is you can pick your best design choice. I would like to, you know, last thing I would like to thank my my team members at Adsurgo. We always have, you know, team members and also our customers...our customers coming up with challenging problems and our team members for always working for for optimal solutions for our customers. Now, last thing that I have to do is, is these these designs were really, really taken from examples from customers, but they weren't the exact examples. There's nothing with their data. So I would like to give a give a shout out to one of my customers Sean Essex from Poseida Therapeutics that often comes up with some very hard problems and sometimes he'll come up with a problem. And I'll say, you know, this is this is a solution and it's something that we really haven't even seen yet. So have a great day.  
Thor Osborn, Principal Systems Research Analyst, Sandia National Laboratories   Parametric survival analysis is often used to characterize the probabilistic transitions of entities — people, plants, products, etc. — between clearly defined categorical states of being. Such analyses model duration-dependent processes as compact, continuous distributions, with corresponding transition probabilities for individual entities as functions of duration and effect variables. The most appropriate survival distribution for a data set is often unclear, however, because the underlying physical processes are poorly understood. In such cases a collection of common parametric survival distributions may be tried (e.g., the Lognormal, Weibull, Frechét and Loglogistic distributions) to identify the one that best fits the data. Applying a diverse set of options improves the likelihood of finding a model of adequate quality for many practical purposes, but this approach offers little insight into the processes governing the transition of interest. Each of the commonly used survival distributions is founded on a differentiating structural theme that may offer valuable perspective in framing appropriate questions and hypotheses for deeper investigation. This paper clarifies the fundamental mechanisms behind each of the more commonly used survival distributions, considering the heuristic value of each mechanism in relation to process inquiry and comprehension.     Auto-generated transcript...   Speaker Transcript   Hello, and welcome to my 00 14.633   3   over the past 25 years, I   have performed many studies and 00 31.533   7   share with you a way of thinking   about the distributions we 00 49.366   11   motivated by precedent, ease of   use, or empirically demonstrated 00 05.666   15   about its processes. Further,   when an excellent model fit is 00 20.666   19   genesis of the distributions   commonly used in parametric 00 36.366   23   seen in the workplace as well as   in the academic literature. 00 51.400   27   literature, including textbooks   and web based articles, as well 00 07.166   31   reexamination that may fail to   glean full value from the work. 00 21.633   35   the exponential.   Much is often made about the 00 39.066   39   because they model fundamentally   different system archetypes. In 00 56.200   43   distribution does in fact, fit   the lognormal data very well.   The quality of the fit may also 00 32.066   48   fits much better. And secondly,   there's only a modest coincident 00 55.333   52   the core process mechanisms   these distributions represent 00 11.600   56   analysis, but it provides a very   familiar starting point for 00 27.133   60   uncorrelated effects. Let's see   if that is true.   In order to create a good 00 52.733   65   25,000. For the individual   records, we'll use the random 00 29.400   06.333   70   71   see that we did indeed obtain   the normal distribution.   Now let's consider the 00 14.400   76   not able to imprint my brain   with a sufficient knowledge of 00 34.300   80   lognormal distribution are also   very simple. As you can see, the 00 50.233   84   this demonstration, we reuse the   fluctuation data that were 00 05.766   88   JSL scripting because I find it   much more convenient for 00 32.566   92   the number of records in each   sample. Next, it extracts the 00 53.200   96   products.   The outer loop tracks the 00 17.133   101   on the previous slide. The   amplified product compensates 00 33.700   105   distributions may be considered   as generated secondarily from 00 18.400   110   many similar internal processes   is represented by its maximum 00 35.000   114   to be Frechet distributed. The   Weibull distribution represents 00 50.466   118   processes that complete when any   of multiple elements have 00 08.766   122   using the Pareto distribution   as the source. In this case, the 00 27.600   126   absolute value of the normal   distribution as the source.   Now let's have a quick look at 00 58.666   131   maximum is used.   For the square root of the 00 50.766   136   is not available, you can also   see that the other common 00 33.233   46.033   141   value of the normal distribution   quite well.   Incidentally, Weibull 00 28.066   146   distribution when its core   behavior is substantially 00 43.600   150   the four heme containing   subunits mechanically interact 00 59.166   154   up to now have all relied on   independent samples. Professor 00 15.766   158   extended to produce auto   correlated data. Generation of 00 32.100   162   sequence autocorrelation is   about .75, yet the 00 59.033   02.300   167   the common survival   distributions. You can see that 00 26.400   171   good example of the relationship   between real-world analytical 00 42.000   175   commingle a single family   residences with heavy industry. 00 55.266   179   have similar features. The   landowner must apply to the 00 09.000   183   an opportunity to comment. Local   officials then weigh the 00 22.433   187   parties. This example is not   approached as a demonstration 00 36.633   191   processing time is 140 days. The   fit is obviously imperfect, but 00 52.733   195   distributed data results from   processes yielding the combined 00 08.400   199   ubiquitous, but the loglogistic   is less frequently used. Without 00 24.466   203   multistep process may be   insufficient to impart log 00 38.200   207   considered and the complexity   of the underlying process should 00 53.166   211   whether a process is   substantially impacted by 00 05.566   215   whether the cooperative element   is connoted by positive terms such 00 22.733   219   often been said, I would   sincerely appreciate your 00 35.033
James Wisnowski, Principal Consultant, Adsurgo Andrew Karl, Senior Statistical Consultant, Adsurgo Darryl Ahner, Director OSD Scientific Test and Analysis Techniques Center of Excellence   Testing complex autonomous systems such as auto-navigation capabilities on cars typically involves a simulation-based test approach with a large number of factors and responses. Test designs for these models are often space-filling and require near real-time augmentation with new runs. The challenging responses may have rapid differences in performance with very minor input factor changes. These performance boundaries provide the most critical system information. This creates the need for a design generation process that can identify these boundaries and target them with additional runs. JMP has many options to augment DOEs conducted by sequential assembly where testers must balance experiment objectives, statistical principles, and resources in order to populate these additional runs. We propose a new augmentation method that disproportionately adds samples at the large gradient performance boundaries using a combination of platforms to include Predictor Screening, K Nearest Neighbors, Cluster Analysis, Neural Networks, Bootstrap Forests, and Fast Flexible Filling designs. We will demonstrate the Boundary Explorer add-in tool with an autonomous system use-case involving both continuous and categorical responses. We provide an improved “gap filling” design that builds on the concept behind the Augment “space filling” option to fill empty spaces in an existing design.     Auto-generated transcript...   Speaker Transcript James Wisnowski Welcome team discovery. Andrew Carl and Darryl Ahner and I would like to and are excited to present two new sampling, adaptive sampling techniques and Really going to provide some practitioners some wonderful usefulness in terms of augmenting design of experiments. And what I want to do is I want to kind of go through a couple of our Processes here on I've been talking about how this all came about. But when we think about DOE and augmenting designs, there is a robust capability already in JMP. So what we have found though working with some very large scale simulation studies is that that we're missing a piece here gap filling designs and adaptive sampling design. And the the key point is going to be the adaptive sampling designs are going to be focusing on the response. So this is kind of quite different from when you think of maybe a standard design where you augment and you look at the design space and look at the X matrix. So now we're going to actually take into account the targets or the responses. So this will actually end up providing a whole new capability so that we can test additional samples where the excitement is. So we want to be in a high gradient region so much like you might think in response surface methodology as deep as the ascent. Now we're going to automate that in terms of being able to do this with many variables and thousands of of runs in the simulation. The good news is that this does scale down quite nicely for the practitioner with the small designs as well. And I'll do a quick run through of our of our add in that we're going to show you, and then Andrew will talk a little bit about the technical details of this. So one thing I do want to apologize, this is going to be fairly PowerPoint centric rather than JMP add in for two reasons...I should say, rather than JMP demo...for two reasons, primarily because our time, we've got a lot of material to get through, but also our JMP utilization is really in the algorithm that we're making in this adaptive sampling. So ultimately, the point and click of JMP is a very simple user interface that we've developed, but what's behind the scenes and the algorithm, it's really the power of JMP here, so. So real quick, the gap filling design, pretty clear. We can see there's some gaps here, maybe this is a bit of an exaggeration puts demonstrative of technique, though in reality we may have the very large number of factors with that curse of dimensionality can come into play and you have these holes in your design. And you can see, we could augment it with this a space filling design, which is kind of the work horse in the augmentation for a lot of our work, particularly in stimulation calling and it doesn't look too bad. If we look at those blue points which are the new points, the points that we've added, it doesn't look too bad. And then if you start looking maybe a little closer, you can kind of see though, we started replicating a lot of the ones that we've already done and maybe we didn't fill in those holes as much as we thought, particularly when we take off the blue coloring and we can see that there's still a fair amount of gaps in there. So we, as we're developing adaptive sampling, recognize one piece of that is we needed to fill in some holes in a lot of these designs. And we came up with an algorithm in our tool, our add in, called boundary explorer that will just do this particular... for any design, it will do this particular function to fill in the holes and you can see where that might have a lot of utility in many different applications. So in this particular slide or graph, we can see that those blue points are now maybe more concentrated for the holes and there are some that are dispersed throughout the rest of the region. But even when we go to the... you can color that looks a lot more uniform across, we have filled that space very well. Now that was more of a utility that we needed for our overall goal here, which was an active sampling. And the primary application that we found for this was autonomous systems, which have gotten a lot of buzz and a lot of production and test, particularly in the Department of Defense. So in autonomous systems, you may think of there's really two major items when you think of it. In autonomous systems really what you're looking at is, is you really need some sensor to kind of let the system know where it is. And then the algorithm or software to react to that. So it's kind of the sensor- algorithm software integration that we're primarily focused on. And what that then drives is a very complex simulation model that honestly needs to be run many, many thousands of times. But more importantly, what we have found is in these autonomous systems, there's there's these boundaries that we have in performance. So for example, we have a leader-follower example from the from the army. That's where a soldier would drive a very heavily armored truck in a convoy and then the rest of the convoy would be autonomous, they would not have soldiers in them. Or think of maybe the latest Tesla, the pickup truck, where you have auto nav, right? So the idea is we are looking for testing these systems and we have to end up doing a lot of testing. And what happens is for example, maybe even in this Tesla, that you could be at 30 miles an hour, you may be fine and avoiding an obstacle. But at 30.1 you would have to do an evasive maneuver that's out of the algorithm specifications. So that's what we talk about when we say these boundaries are out there. They're very steep changes in the response, very high gradient regions. And that's where we want to focus our attention. We're not as interested as where it's kind of a flat surface, it's really where the interesting part is, that's where we would like to do it. And honestly, what we found is, the more we iterate over this, the better our solution becomes. We completely recommend do this as an iterative process. So hence, that's the adaptive piece of this is, do your testing and then generate some new good points and then see what those responses are and then adapt your next set of runs to them. So that's our adaptive sampling. Kind of the idea of this really, the genesis, came from some work that we did with applied physics labs at Johns Hopkins University. They are doing some really nice work with the military and while reviewing it in one of their journal articles, I was thinking to myself, you know, this is fantastic in terms of what you're doing, and we could even use JMP to maybe come up with a solution that would be more accessible to many of the practitioners. Because the problem with Johns Hopkins is is that it's very specific and it's somewhat...to integrate, it's not something that's very accessible to the smaller test teams. So we want to give...put this in the hands of folks that can use it right away. So this paper from the Journal of Systems and Software, this is kind of the source of our boundary explorer. And as it turns out, we used a lot of the good ideas but we were able to come up with different approaches and and other methods. In particular, using native capability in JMP Pro as well as some development, like the gap filling design that we did along the way. Now, In terms of this example problem, probably best I'll just go and kind of explain it right in a demo here. So if I look at a graph here, I can see that...I'll use this...I'll just go back to the Tesla example. So let's say I'm doing an auto navigation type activity and I have two input factors and let's say maybe we have speed and density of traffic. So we're thinking about this Tesla algorithm. It wants to merge to the lane to the left so it wants to, I should say, you know, pass. So it has to merge. So one of them would be the speed the Tesla is going and then the other might be the density of traffic. And then maybe down in this area here we have a lower number. So we can think of these numbers two to 10, we could maybe even think of the responses, maybe even like a probability of a collision. So down at low speed/low density, we have a very low probability of of collision, but up here at the high speed/ high density, then you have a very high probablity. But the point is it what I have highlighted and selected here, you can see that there's very steep differences in performance along the boundary region. So it would, as we do the simulation to start doing more and more software test for the algorithm, we'll note that it really doesn't do us a lot of good to get more points down here. We know that we do well in low density and low speed. What we want to do is really work on the area in the boundaries here. So that's our problem, how can I generate 200 new points that are really going to be following my boundary conditions here. Now, here what I've done is I have really, it's X1 and X2, again, think of the speed and... our speed as well as the density. And then I just threw in a predictor variable here that doesn't mean anything. And then there's, there's our response. So to do this, all I have to do is come into boundary explorer and under adaptive sampling, my two responses (and you can have as many responses as you need) and then here are my three input factors. And then I have a few settings here, whether or not I want to target the global minimum and max, show the boundry. And we also ultimately are going to show you that you have some control here. So what happens is in this algorithm is we're really looking for, what are the nearest neighbors doing? If all the nearest neighbors have the same response, as in zero probability of having an accident, that's not a very interesting place. I want to see where there's big differences. And that's where that nearest neighbor comes into play. So I'll go ahead and and run this. And what we're seeing on there is we can see right now that the algorithm, it used JMP's native capability for the prediction screening and fortunately, is not using the normal distribution. You can see it's running the bootstrap forest. Andrew is going to talk about where that was used. And ultimately what we're going to do here, is we're going to generate a whole set of new points that should hopefully fall along the boundary. So that took, you know, 30 seconds or so to do these these points and from here I can just go ahead and pull up my new points. So you can see my new points are sort of along those boundaries, probably easiest seen if I go ahead and put in the other ones. So right here, maybe I'll switch the color here real quick. And I'll go ahead and show maybe the midpoint in the perturbation. So right now we can kind of see where all the new points are. So, the ones that are kind of shaded, those are the ones that were original and now we're kind of seeing all of my new points that have been generated in that boundry. So of course the question is, how, how did we do that? So what I'll do is I'll head back to my presentation. And from there, I'll kind of turn it over to Andrew, where he'll give a little bit more technical detail in terms of how we go about finding these boundry points because it's not as simple as we thought. Andrew Karl Okay. Thanks, Jim. I'm going to start out by talking about the the gap filling real quick because we've also put this in addition to being integrated into the overall beast tool. It's a standalone tool as well. So it's got a rather simple interface where we select the columns that we define the space that we went to fill in. And for continuous factors, it reads in the coding column property to get the high and low values and it can also take nominal factors as well. In addition, if you have generated this from custom design or space filling design and you have disallowed combinations, it will read in the disallowed combination script and only do gap filling within the allowed space. So the user specifies their columns, as well as the number of new runs they want. And let me show a real quick example in a higher dimensional space. This is a case of three dimensions. We've got a cube where we took out a hollow cylinder and we went through the process of adding these gap filling runs, and we'll turn them on together to see how they fit together. And then also turn off the color and to see what happens. So this is nice because in the higher dimensional space, we can fill in these gaps that we couldn't even necessarily see in the by variate plots. So how do we do this? So what we do is, we take the original points, which in this case is colored red now instead of black and we can see where those two gaps were, and we overlay a candidate set of runs from a space filling design for the entire space. We add for the concatonated data tables of the old and the new candidate runs, we have an indicator column, continuous indicator column, we label the old points 1 and the label the candidate point 0. And in this concatenated space, we now fit a 10 nearest neighbor model to the to the indicator column and we save the predictions from this. So the candidate runs with the smallest predictions, in this case, blue, are the gap points that we want to add into the design. Now, if we do this in a single pass, what it tends to do is overemphasize the largest gaps. So we do is we actually do this in a tenfold process, where we will take a tenth of our new points, select them as we see here, and then we will add those in and then rerun our k-nearest neighbor algorithm to pick out some new points and to fill out all the spaces more uniformly. So that's just one option...the gap filling is one option available within boundary explorer. So Jim showed that we can use any number of responses, any number of factors and we can have both continuous and nominal responses and continuous and nominal factors. The fact...the continuous factors that go in, we are going to normalize those behind the scenes to 01 to put them on a more equal footing. And for the individual responses that go into this, we are going to loop individually over each response to find the boundaries for each of the responses within the factor space. And then at the end, we have a multivariate tool using a random forest that considers all of the responses at once. And so we'll see how each of the different options available here in the GUI, in the user interface, comes up within the algorithm. So after after normalization for any of these continuous columns, the first step is predictor screening for all the both continuous and nominal responses. And this is to do is to find out the predictors, they're relevant for each particular response. And we have a default setting in the user interface of .05 for proportion of variants explained, or portion of contribution from each variable. So in this case, we see that X1 and X2 are retained for response Y1, and X3 noise is rejected. The next step is to run a nearest neighbor algorithm. And we use the default to 5, but that's an option that the user can toggle. And we aren't so concerned with how well this predicts as we are to just simply use this as a method to get to the five nearest neighbors. What are the rows of the five neighbors neighbors and how far are they? What is the distances from the current row? And we're going to use this information of the nearest neighbors to identify each point, the probability of each point being a boundary point. We have to use split here and do a different method for continuous or nominal responses. For the nominal responses, what we do is we concatenate the response from the current column along with the responses from the five nearest neighbors in order, in this concatenate concatenate neighbors column. And we have a simple heuristic we use to identify the boundary probability based on that concat neighbors column. If all the responses are the same, we say it's low probability of being a boundary point. If, at least one of the responses is different, then we say it's got a medium probability of being a boundary, excuse me, a boundary point. And if two or more of the responses are different, it's got a high probability of being a boundary point. We also record the row used. In this case, that is the the boundary pair. So that is the closest neighbor that has a response that is different from the current row. We can plot those boundary probabilities in our original space filling design. So as Jim mentioned early on, we have a...we initially run a space filling design before running this boundary explore tool to get...to explore the space and to get some responses. And now we fit that in and we've calculated the boundary probability for for these. And we can see that our boundary probabilities are matching up with the actual boundaries. For continuous responses we take the continuous response from the five nearest neighbors, and add a column for each of those, and we take the standard deviation of those. The ones with the largest standard deviations of neighbors are the points that lie in the steepest gradient areas and those are more likely to be our boundary points. We also multiply the standard deviation by the mean distance in order to get our information metric, because what that does is for two points that have an equal standard deviation of neighbors, it will upweight the one that is in a more sparse region with fewer points that are there already. So now we've got this continuous information metric and we have to figure out how to split that up into high, medium, and low probabilities for each point. So what we do is we fit in distribution. We fit in normal three mixture and we use the mean as the largest distribution as the cutoff for the high probability points. And we use the intersection of the densities of the largest and the second largest normal distributions as the cutoff for the medium probability points. So once we've identified those those cut offs, we apply that to form our boundary probability column. And we also retain the row used, which is the closest. In this case for the continuous responses, that is the neighbor that has the response that's the most different in absolute value from the current role. So now for both continuous and nominal responses we have the same output. We have the boundary probability and the row used. Now that we've identified the boundary points, we need to be able to use that to generate new points along the boundary. So the first and, in some ways, the best method for targeting and zooming in on the boundary is what we call the midpoint method. And what we do for each boundary pair, each row and its row use, its nearest neighbor identified previously...I'm sorry, so not nearest neighbor but neighbor that is most relevant either in terms of difference in response nominal or most different in terms of continuous response. For the continuous factors we take the average of the coordinates for each of those two points to form the mid point. And that's what you see in the graph here. So we would put a point at the red circle. For nominal factors, what we do is for the boundary pairs is we take the levels of that factor that are present in each of the two points and we randomly pick one of them. The nice thing about that is if they're both the same, then that means the midpoint is also going to be the same level for that nominal factor for those two points. A second method we call the perturbation method is to simply add a random deviation to each of the identified boundary points. So for the high boundary...high probability points, we add two such perturbation points for the medium, we add one. And for that one, we add, for the continuous factors, we add a random deviation. Normal means 0; standard deviation, .075 in the normalized space, and that .075 is something that you can scale within the user interface to either increase or reduce the amount of spread around the boundary. And then for nominal factors, what we do is we take...we randomly pick out a level of each the nominal factors. Now for the high probability... high probability boundary points that get a second perturbation point, what we do is in the second one we restrict those nominal factor settings to all be equal to that of the original point. So we do this process of identifying the boundary and creating the mid points and perturbation points for each of the responses specified in the boundry explorer. Once we do that, we concatenate everything together and then we are going to look at all the mid points identified for all the responses, and now use a multivariate technique to generate any additional runs. Because the user can specify how many runs they want and these midpoint and perturbation methods only generate a fixed number of runs and depending on the the lay of the land, I guess you could say, for the data. So what we do is something similar to the gap filling design where we take all of the identified perturbation and mid points for all of the responses and we fill the entire space with the space filling design of candidate points. We labeled the candidate points 0 in a continuous indicator, the mid points 1, and the perturbation points .01. We fit a random forest to this indicator. And then we take a look. We save the predictions for the candidate space fill in points and then we take the the candidate runs with the largest predictive values of this boundary indicator. And those are the ones that we add in using this random forest method. Now since this is a multivariate method, if you have a area of your design space that is a boundary for multiple responses, that will receive extra emphasis and extra runs. So here's showing the three types of points together. Now, again, to emphasize what Jim said, this needs to happen in multiple iterations, so we would collect this information from our boundary explorer tool and then concatenate it back into the original data set. And then after we have those responses, rerun boundary explorer and it's going to continuously over the iterations, zoom in on the boundaries and impacts, possibly even find more boundaries. So the perturbation points are most useful for early iterations when you're still exploring space, they're more spread out, and the random forest method is better for later iterations, because it will have more mid points available because it uses not only the ones from the current iteration, but also the previously recorded ones. We have a column in the data table that records the type of point that was added. So we'll use all the previous mid points as well. So if we put our surface plot for this example we've been looking at for a step function, we can see our new points and mid points and perturbation points are all falling along the cliffs, which are the boundaries, which is what we wanted to see. So the last two options for the user interface or to indicate those gap filling runs and we can also ask it to target global min max or match target for any continuous factors, if that's set as a column property. Just to show one final example here, we have this example where we have these two canyons through a plane with a kind of a deep well at the intersection of these. And we've run the initial space filling points, which are the points that are shown to get an idea of the response. And if we run two iterations of our boundary explorer tool, this is where all the new points are placed and we can see the gaps in kind of in the middle of these two lines. What are those gaps? If we take a look at the surface plot, those gaps are the canyon floors, where it's not actually steep down there. So it's flat, even locally over a little region, but all of these points, all of these mid points, have been placed not on the planes, but the on the steep cliffs, which is where we wanted. And here we're toggling the minimum points on and off and you can see those are hitting the bottom of the well there. So we were able to target the minimum as well. So our tool presents two distinct, two discrete options, a new tools. We want the gap filling that can be used on any data table that has coding properties set for the continuous factors. And then the boundary explorer tool that can be used to add, do runs that don't look at the factor space by itself, but they look at the response in order to target the high gradient...high change areas to add additional runs.  
Lisa Grossman, Associate Test Engineer, JMP Mandy Chambers, Principal Test Engineer, JMP   With a growing population in Cary, it is important to understand the environmental impact that a household can incur and what sustainable options are available that can minimize an individual’s environmental footprint. Recycling, for one, is a universally known method that reduces waste that would otherwise be disposed at landfills, which already is a capacity concern around the world. According to the Recycling Coalition of Utah, only 25% of plastic produced in the U.S. gets recycled, and recycling the other 75% could mean saving up to 44 million cubic yards of landfill space annually. Using recycling data that is recorded by the Town of Cary, we will analyze the relationship between the collected waste and recyclables. We will also construct visualizations to explore the breakdowns of waste and each recycling category. The goal is to compare our analysis with statistics from other cities in the U.S. to assess recycling practices of the people of the Town of Cary and determine levels of further recycling potential. And in the midst of a pandemic, we will discover how Covid-19 has influenced waste and recycling management within the country. With our findings, we hope to communicate about environmental initiatives and inform about recycling efforts in our very own community as well as addressing some impacts of Covid-19.     Auto-generated transcript...   Speaker Transcript Lisa Grossman Okay. All right. I'll get started. Hi everyone, my name is Lisa Grossman and my partner, Mandy Chambers, and I are both testers here at JMP. And today we are excited to share with you the work that Mandy and I have done with recycling and garbage collection data. So we were interested in looking at recycling and trash data in our own community, being Cary, North Carolina. And we were curious about what kind of patterns we would uncover inour exploration and learn about some Covid 19 impacts all while using JMP. So in JMP, we're going to be using Graph Builder's visual tools to see trends in Cary's trash and recycling collection categories such as paper, plastic, glass, etc. And using Text Explorer's word cloud feature, we're going to use that to identify some challenges for waste and recycling management that may have arisen due to the Covid 19 outbreak. And from what I show you today, we hope that you'll be able to use these quick and easy steps to explore your own data. And so for those of you who may not know Cary North Carolina is the home where SAS is headquartered. And Cary has a population of approximately 175,000 current residents, which is about a 30% increase since 2010. And thanks to the town of Caryt, we were able to get our hands on some of the recycling and trash collection data they had recorded from 2010 to the present. So I wanted to quickly go over some of the steps we took in our process to explore the data, which include first importing Excel sheets that we got from the town of Cary into JMP. And I wanted to note here that the Excel wizard does offer many advanced features that you might be interested in the case that you would need to import Excel sheets to JMP. And to organize our data and columns, we use table operations like transpose and updated column properties and the column info dialogue to make our data a little easier to graph later on. And then launching Text Explorer and Graphs Builder platforms, we used those to make our basic visualizations. And then I'm going to show you a new hardware label feature that's available in JMP 15 called pasting graphlets and I will show you an example of this later on using a tabulate. And so getting to our graphs and figures for Cary. So looking at them, we can see that the first two up top are looking at the breakdown. So they're recycling categories. So getting a closer look here, we can see the bar chart on the left is showing the average capture rates of the overlaid recycling categories from 2010 to 2019. And we can see that the news, glass and cardboard are the three leading categories for recycling. And then in the line graph to the right, we can see that the trends of the recycling rates of each category over the years. And what's interesting, that I wanted to point out here is that it seems that news and mixed paper are inversely related to each other. And then going back to our poster, let's look at the last two graphs we have here. And so these we are looking at recycling in comparison to garbage collection in Cary. So zooming in here, we can see the stacked bar on the left. shows the total percentage of waste and recycling recorded each year. And labeling the percentage values on the bar themselves, we can see that the recycling collection volume seems to have been slightly decreasing since 2014. And then the graph to the right, we can see the progression of both trash and recycling from 2010 to 2019. And this visual shows us how the tonnage of trash is increasing Each year, which seems expected for as the population increases. But what is surprising is that the tonnages for recycling have remained rather steady. So thinking about this, we were wondering if this could be due to a rise in more sustainable products such as using personal water bottles or tumblers. So, but Now that we are in the midst of Covid 19 we were curious to learn if there were any noticeable differences in recycling and trash collection so far this year. And the town of Cary was able to provide us some with some updated data that goes up to the month of June. And so we created another stacked bar here to show how 2020 has compared so far to the previous year. And at first glance 2020 is steadily increasing and the labeled tonnages do not show a significant spike in the collection so far. So then we decided to break it down month by month using our side by side bar charts to compare 2020 to 2019. And so our top bar chart here shows recycling overlaid by curbside drop off on computer recycling. And then the bottom chart shows trash collection. So in the month of March when North Carolina first implemented stay at home orders Cary saw a nearly 21% increase in garbage and 23.8% increase in recycling collection. And just for reference 21% increase is about 1.1 million pounds. And in April and May trash and recycling have somewhat leveled out but then spiked again in June, so it will be interesting to see how the rest of the year will pan out. So something I wanted to point out here is, notice the information included in the hover label that is pinned. So using the labeling feature, which can be done by right clicking on columns in your data table and selecting label, you will be able to see that column information represented in the hover label. So you can add as many columns as you'd like to...so that you can read in that information in your graph. So doing some further reading, we saw that Wake County, the county that Cary is in, reportedly generated about 29% more trash. So, totaling about 739 tons 45% more cardboard recycling and 20% more recycling in the week of April 13 alone. And we also found an estimate that the World Health Organization said that they're using, or that the world is using about 89 million masks and 76 million gloves each month. And we found an article here that gives us some insight on how the Covid 19 outbreak has affected recycling and trash collection. And so by downloading the article and importing it to JMP, we could use a Text Explorer platform to identify some themes in the word cloud. So I'll zoom in on it here. So you can use some features and options in Text Explorer like manage stop words and, in the word cloud itself, you can change the coloring and the layout and the font, so to really customize your word cloud. And so after making these customizations, we got a word cloud here on what is shown to the right. And notice that an increase in tonnage has been the highlight for cities like Phoenix and New York City. And because of this, we were curious to learn more about recycling and trash management in New York. And luckily, we were able to find some open data for Brooklyn, Manhattan, Queens, Bronx and Staten Island. So if we look at the first line chart that shows the average tonnage collection for paper and metal and glass and plastic, we can see here in this chart I have I have the boroughs grouped. And then they are scaled here by the month and we're looking at the recycling collection and tonnages here on the y axis. So we can see that boroughs like Bronx and Brooklyn are steadily increasing starting in the month 4, being April, but we can see that there is more of a spike collection that is in both Staten Island and Queens. But what's very interesting is that there's a noticeable decrease in collection in Manhattan. And we were curious as to why this might be. And with a little research, we have come to the conclusion, it seems that stay at home orders meant that there were fewer workers in the city, so therefore, leading to reduced recycling capture. At a similar trend here can be shown for garbage collection rates in the line graph that we have. And so we can see in the same manner, Manhattan sees a dip in garbage collection, whereas Queens and Staten Island saw an increase. But something we wanted to highlight here in this graph as a new feature of JMP 15 is this custom tabulate graphlet in the hover label. So notice that the pinned hover label here shows us at tabulate that gives us the tonnage values of both recycling categories and garbage collection for the months of January to June, just for Queens in 2020, which is the point on the line, which we have pinned. So creating this line graph with a custom tabulate graphlet, it was only a matter of a couple steps. So first we needed to make our base graph, which is the line graph we have here. But then we separately created our tabulate ...our tabulate table which is shown here. And for space sake, I couldn't include the whole tabulate, but as you can see, it shows the monthly averages of recycling and trash collection for each borough in 2019 and 2020. And so all we would have to do is go into that little right triangle menu to tabulate and save the script to our clipboard. And then the next thing we would do is go back to our base graph and right click in the background and under the hover label menu, there's going to be a paste graphlet option. And so you don't have to worry about any filtering or anything. Doing the paste graphlet, takes the... there is some magic that works behind it. And so that's that's all you would need to do and each point would be filtered for you. So, Now when you hover over a point in your line, you can see that it is complete and the filter parts of your tabulate corresponds to your point of interest. So this concludes our presentation on our findings with trash and recycling collection from the town of Cary and New York and as the year plays out, I think it'd be very interesting to see how this data might change and I hope to keep looking at it and see how 2020 will pan out. So we wanted to give some special thanks to Bob Holden and Srijana Guilford, especially from the town of Cary for helping us through and working with us with their data and sharing their data sets. And I have here linked the open data set from... for the New York data. And it's, I think, I believe it's constantly updated. So if you are interested in playing around with that data, it's available here. And I also have linked here some more information on graphlets. There's a ton of ways that you can use graphlets, and many, many ways that you can customize them too, so please check out this link and you can meet the developer, Nascif, and get some more information there. Thank you.  
Philip Ramsey, Senior Data Scientist and Statistical Consultant/Professor, North Haven Group and University of New Hampshire Tiffany D. Rau, Ph.D., Owner and Chief Consultant, Rau Consulting, LLC   Quality by Design (QbD) is a design and development strategy where one designs quality into the product from the beginning instead of attempting to test-in quality after the fact. QbD initiatives are primarily associated with bio-pharmaceuticals, but contain concepts that are universal and applicable to many industries. A key element of QbD for bio-process development is that processes must be fully characterized and optimized to ensure consistent high quality manufacturing and products for patients. Characterization is typically accomplished by using response surface type experimental designs combined with the full quadratic model (FQM) as a basis for building predictive models. Since its publication by Box (1950) the FQM is commonly used for process characterization and optimization. As a second order approximation to an unknown response surface, the FQM is adequate for optimization. Cornell and Montgomery (1996) showed that the FQM is generally inadequate for characterization of the entire design space, as QbD requires, given the inherent nonlinear behavior of biological systems. They proposed augmenting the FQM with higher order interaction terms to better approximate the full design regions. Unfortunately, the number of additional terms is large and often not estimable by traditional regression methods. We show that the fractionally weighted bootstrapping method of Gotwalt and Ramsey (2017) allows the estimation of these fully augmented FQMs. Using two bio-process development case studies we demonstrate that the augmented FQM models substantially outperform the traditional FQM in characterizing the full design space. The use of augmented FQMs and FWB will be thoroughly demonstrated using JMP Pro 15.     Auto-generated transcript...   Speaker Transcript Tiffany First, thanks for joining us today. We're going to be talking about characterizing different bio processing...seeing...and really focusing on pDNA, seeing how fractionally weighted bootstrapping can really add to your processes. So Phil will be joining me at the second half of the presentation to go through the JMP example, as well as to give some different techniques. to be used. I'm going to be talking about the CMC strategy, biotech, how do we get a drug to market and how can we use new tools like FWB in order to deliver our processes. So let's get started. So the chemistry manufacturing control journey. So it's a very long journey and we'll be discussing that. Why DOE, why predictive modeling? It is very complex to get a drug to market. It's not just about the experiments, but it's also about the clinic. and having everything go together. So we'll look at systems thinking approaches as well. And then, of course, characterizing that bioprocessing and then the case study that Phil will discuss. So what does the CMC pathway look like? This is a general example and we go from toxicology. So that's to see does doesn't have any efficacy. Does it work. in a nonhuman trial. All the way up to commercial manufacturing and there's a lot of things that go into this, for example, process development. But you also have to have a target product profile. What do you want the medication that you're developing to actually do? And it's important to understand that as you're going through. And then of course the quality target product profile as well. This is...what are the aspects that are necessary in order for the molecule to work as prescribed? And then we look at critical quality attributes and then go through the process so it's...it's a...it's a group because of the fact that you have your process development, process characterization, process qualification and the BLA. There's a huge amount of work that goes into each one of these steps. And we also want to make sure that as you're going through your different phases that you're actually building these data groupings. Because when you get to process characterization and process qualification, you want to make sure that you can leverage as much of your past information as you can, so that you've actually shorten your timelines. You might say, "Tiffany, I do process characterization, all the way through the process." And I'll say, "Absolutely." But the process characterization that specific for the CMC pathway is what we need from a regulatory point of view. So everyone has probably heard in the news, you know, vaccines and cell and gene therapies. It's a very hot subject right now, and it's also bringing new treatments to patients that we've never been able to treat before. And so in the two big groupings of cell therapies and gene therapies, there's different aspects for it. Right. So we have immunotherapies. We have stem cells. We're doing regenerative medicine. So just imagine, you know, having damage in your back and being able to regenerate. There's huge emphasis on this grouping. But there's also a huge emphasis on gene therapies. Viral vectors. Do we use bacterial vectors? How do we get the DNA into the system in order to be a treatment for the patients? Well, plasma DNA is one of those aspects and Phil has an amazing case study where they did an optimization. So you might say, well, "What is pDNA. And why is it important, other than, okay, it's part of the gene therapy space, which is very interesting right now?" Well, the fact is, is that it can be used for...in preventative vaccines, immunization agents, for...Prepare...preparation of hyper immune globulin cancer vaccines, therapeutic vaccines, doing gene replacements. Or maybe you have a child, right, that has a rare gene mutation. Can we go in and make those repairs? All these things are around this and as the gene therapy technology continues to grow, the regulations continue to increase as you move through the pathway closer to commercialization and the amount of data also increases. Just imagine gene therapies and cell therapies are where we were 20 plus years ago with the ??? cell culture and monoclonal antibodies It's an amazing world where we're learning new things every day and we go, "Oh, this isn't platform." We need new equipment. We need new ways of working. We need to be able to analyze data sets that are very, very small because in cell therapies and gene therapies, the number of patients are typically smaller than in other indications. So what's next for pDNA? Well, of course, as the cell therapy and gene therapy market continues to grow, We're going to continue going on this pathway into commercialization. We need to be able to work with the FDA and work with them, hand in hand, because these are things that we've never done before. We're using raw materials that we don't use and other ...other indications for medication. So there's a lot of things to be done. It's also critical to be able to make these products like the pDNA so the way that we get the vector in our appropriate volume, but also quality aspect. So if you have the best medication in the world, but you're not able to make it, then you don't have the medication. Right? You can't deliver it to the patients. So we also need to make sure that our process is well characterized. And as I mentioned earlier, many of these indications are very small. So the clinical trials are also small. And at the same time the patients are often very, very sick. So being able to analyze our data and also respond to their needs very quickly is very key. Both in the clinical aspect as well as when we become commercialized. We don't want to have this situation where, guess what, I can't make the drug, right? I want to be able to make it. Also manufacturing is a very important thing. So don't know if you've noticed in the news, there's been a lot of announcements of expansions. Of course, people are expanding capacity for vaccines, but also one of the big moves is pDNA. People are spending millions of dollars, sometimes billions of dollars in increasing those manufacturing sites. And you might say, well, okay, you increase your manufacturing site. That's great. But now I need to be able to tech transfer into that manufacturing site. I need to make sure my process is robust...robust and it not only can be transferred and scaled up but making sure that I have the statistical power to say I know that my process is in control. I might have a 2% variability but I always have a 2% variability and I have it characterized, for instance. And as more and more capacity comes online and as we also have shortages, it's like, where do I bring my product and taking those into consideration, so designing for manufacturing earlier. And you could have multiple products in your pipeline. So you want to make sure that you're learning and able to go and grab that information and say, let me do some predictive modeling on this, it might not be the exact product, but it has similar attributes. So with that, the path to commercialization is very integrated, just like the CMC strategy takes the clinical aspect, everything comes together in order to progress a molecule through. We also have to think about the systems aspect of it. Why? Because if we do something in the upstream space we might increase productivity to 200%, let's say, which we be going, "yes I've made my milestone. I can deliver to my patient." But if my downstream or my cell recovery can't actually recover the product, whether that is a cell or a protein therapeutic for instance, then we don't have a product. All of that work is somewhat thrown out the door. So having the systems approach, making sure you involve all the different groups from business, supply chain, QC, discovering...everyone has knowledge that they bring to the table in order to deliver to the patient in the end, which is very key. So I'm going to hand it over to Phil now. I would have loved to have spoken a lot more about how we developed drugs, but let's...let's see how we can analyze some of our data. So, Phil, I'll hand it over to you now. Philip Ramsey Okay, so thank you, Tiffany, for that discussion to set the stage for what is going to be a case study. I'm going to spend most of the time in JMP demonstrating some of the important tools that exist in JMP. You may not even know that are there, that are actually very important to process development, especially in the context of the CMC pathway, chemistry manufacturing control, and quality by design. And two important characteristics of process development, and this is in general, is one where you want to design a process, but you also need to characterize it. In fact, you have to characterize the entire operating region. And of course, we want to optimize so that we have a highly desirable production. What we often don't talk about enough is these activities are inherently about prediction. We have to build powerful predictive models that allow us to predict future performance. That's a very important part of, especially in late stage development, for regulatory agencies. You have to demonstrate that you can reliably produce a product. Well, a key paper on on this issue of process characterization and prediction was very famous paper by George Box and his cohort Wilson, who was an engineer. And in that they talked about what is the beginnings of, as people note today, as response surface. And the key to this their work with something they called the full quadratic model. Well, what is that? Well, that's a model that contains the main effects, all two-way interactions and quadratic effects. And this is still probably the gold standard for building process models, especially for production. But what people may not realize, they're good for optimization. They're good second-order approximations to these unknown response functions. What is not as well understood is, over the entire design region, they often are a poor approximation to the response surface. And in 1996 the late John Cornell and, of course many people know, Doug Montgomery published a paper that is really underappreciated. And then that paper they raised the fact the full quadratic model often is inadequate to characterize a design space. Think about it from the viewpoint of a scientist and think how dynamic these biochemical processes often are. In other words, there's a great deal of nonlinearity that leads to response surfaces with pronounced compound curvature in different regions. And the full quadratic model simply can't deal with it. So what they propose was augmenting that design and they added things like quadratic by linear, linear by quadratic and even quadratic by quadratic interactions. It turns out these models do approximate design regions better than full quadratic models. I'm going to demonstrate that to you in a moment. But there was a problem for them. Number one, traditional statisticians didn't like the approach; that's changing dramatically these days. But there are a lot of these terms that can be added to a model such that even a big central composite design becomes super saturated. What does that mean? It means there are more unknowns p, then there are observations and to fit the models. Turns out that it's not really a constrait these days in the era of machine learning and new techniques for predictive modeling. So what we're going to do is, we're going to use something called fractionally weighted bootstrapping. This can be done in JMP Pro. And something called model averaging to build models to predict response surfaces, and I am actually going to use these large augmented models. Okay, so when you try to build these predictive models, say for quality by design, there are a number of things you have to be aware of. One, again in 1996, one of the pioneers in machine learning, the late Leo Brieman, wrote a paper that again is not nearly appreciated as much as it needs to be. And he pointed out that all these model building algorithms we use for prediction (and that includes forward selection, all possible models, best subsets, lasso) are inherently unstable. What does that mean? Being unstable means small perturbations in the data can result in wildly varying models. So he did some work to point this out and he suggested a strategy, said, "Well, if you could, in some way, simulate model fitting and somehow perturb the data on each simulation run, we could fit a large number of models and then average them." And he showed that that had potential. He didn't have a lot of tools in that era to do it. But today I'm going to show you in JMP Pro, we have a lot of tools and we're going to show you that Brieman's idea is actually a very good one. It is now one way or the other, commonly accepted in machine learning and deep learning, that is the idea of ensemble modeling and model averaging. By the way, I'll quickly point out years ago in the stepwise platform of JMP, John Sall instituted a form of model averaging. It's a hidden gem in JMP. Works nice and is available in both versions of JMP, but I'm going to offer a more comprehensive solution that can be done in JMP Pro. And this solution is referred to as fractionally weighted bootstrapping with auto validation and I'm going to explain what that means. When we build predictive models, we have a challenge. We need a training set to fit the model, then we need an additional or validation set of data to test the model to see how well it's going to predict. Well, DOE simply don't have these additional trials available. In fact, Brieman was stuck on this point. There's no way to really generate a validation error. Well, in 2017 at Discovery Frankfurt, Chris Gotwalt, head of statistical research for JMP, and myself presented a talk and what we called fractionally weighted bootstrapping and auto validation. What does auto validation mean? It means, this will not seem intuitive, we're going to use the training set also as a validation set. You say, "Well, that's crazy. It's the same data." But there's a secret sauce to this technique that makes it work. What we do is, we take the original data, copy it, call it the auto validation set, and then we in a special way, assign random weights to the observations and we do the weighting such that we drive anticorrelation between the training set and the auto validation set. And I'm going to illustrate this to you very shortly. Okay. And by the way, we have been supervising my PhD student Trent Lempkis, who has studied this method extensively in exhaustive simulations over the last year. And we will be publishing a paper to show that this method actually yields superior results to classical approaches to building predictive models from DOE. So I'm just going to move ahead here and talk about the case study. And this is what Tiffany mentioned pDNA. It's a really hot topic and pDNA manufacturer is considered a big growth area and the biotech world expect big growth, maybe even 40% year over year because of all the new therapies coming online where it'll be used. And in this case, and this is very common in the biotech world, there's not really any existing data we can use to build predictive models. So that leads us quite rightly to design of experiments. And in this case, we're going to use a definitive screening design. These are wonderful inventions of Brad Jones from JMP and Chris Nachtsheim from the University of Minnesota. Highly efficient and I highly recommend them all the time to people in the biotech world where you have limited time and resources for experimentation. So basically I'm just showing you a schematic of what a bioprocess looks like. And we're going to focus on the fermentation step. But in practice, as Tiffany was alluding to, we would look through the both upstream and downstream aspects of this process. pH, percent dissolved oxygen, induction temperature. That's the temperature we set the fermentor at to get the genetically modified E. Coli to start pumping out plasmids. And what are plasmids? Well, they're really non chromosomal DNA. And they have a lot of uses in therapies, especially gene therapies, and they're separate from the usual chromosomal DNA that you would find in the bacteria. So our goal is to get these modified E. Coli to pump out as much pDNA as possible. So we did the the trial. This is an actual experiment. And because we were new to DSDs, we also ran a larger, much larger traditional central composite design . And we did this separately. And what I plan to do is for today's work, we're going to use the CCD as a validation set and we're going to fit models using auto validation on the DSD. We'll see how it goes. Okay, so I'm going to now just switch over to JMP. And I'm going to open a data table. Here's the DSD data. We're going to do all our modeling on this data set. And oh, by the way, I am going to fit a 40 predictor model to a 15 run design using machine learning techniques. And many people, you're going to have to get your head around the fact you can do these things and they're actually being done all the time in machine learning and deep learning. So there are a couple of add ins I want to show you that make this easy to do. You do need JMP Pro. One of them is an add in that sets up the table for you. This is by Michael Anderson of JMP. So I'm just going to show you what happens. The add in is available on JMP Communities. So notice it took the original data, created a validation set. And as I mentioned, we also have this weighting scheme and these weights are randomly generated, and as you'll see in a momen, we do a simulation study and we constantly change the weights on every run. And this has the effect of again generating thousands of iterations of modeling. And you'll also see, as Leo Brieman warned, as you perturb the responses (we don't change the data structure), you see wild variation in the models. So I'm going to go ahead and just illustrate this for you very quickly. So I'm going to go to fit model. And we have to tell JMP where the weights are stored. We're going to use generalized regression, highly recommended for this. And because this is a quick demo, I'm going to use forward selection, but this is a very general procedures SWB with auto validation. You can use it in many, many different prediction or modeling scenarios. I'm going to do forward selection. Okay, so I fit one model. And then I come down to the table of estimates. I'm going to right click and select simulate. And I tell it that I want to do some number of simulation runs, and on each trial I want to swap out the weights. I want to generate new weights. And by the way, I'll just do 10 because this is a demo. So there's the results. And you can see we have 10 models and all of them are quite different. So again, in practice, I would do thousands of these iterations. And then I'm going to show you later, we can then take these coefficients and average them together. And by the way, if you see zero, that means that turn did not get into a model. Okay, so what I'm going to do now is show you another add in. So I'm going to close some of this, so we keep the screen uncluttered. There's another add in that we've developed at Predictum. And this one does, not only does the faction weighted bootstrapping, but it also develops the model averaging. In other words, what I just showed you, the add in that you can use if you want to do model averaging, you're kind of on your own. Okay. It'll just be a lot of manual work. So I'm going to use the Predictum add in. It creates the table and then I'm going to actually very quickly, just to find a model to illustrate how the add in works, I'll use a standard response surface model. We want to predict pDNA. And we're going to use gen reg. So again, as an illustration, I'm just going to go ahead and use forward selection forward. And again, I do thousands of iterations in practice, but I'm only going to do 10. Click Go. Okay. Philip Ramsey Again, this is a quick, this is really, I do apologize, three talks conflated into one, but all the pieces fit together in the QbD framework. So I have a model. These are averaged coefficients. Again, I've only done 10. I'd save the prediction formula to the data table. And I'm going to try to keep the screen as uncluttered as possible. So there's my formula; the app did all the averaging for you, so you don't have to do it. And there's the formula. And a little trick you may not be aware of, this is a messy formula, especially if you want to deploy this formula to other data tables. In the Formula Editor, there's a really neat function called simplify. See, and it simplifies the equation and it makes it much more deployable to other data sets. Okay, so this was an illustration of the method. And what I'm going to do now is show what happened when we went through the entire procedure. So this is a data table. And here you'll notice the DSD and the CCD data bank combined together. And I've used the row state variable to eliminate or exclude the DSD data because I want to focus on performance of my models on the validation data. Again the models are fit to the DSD only. So here is my 41 term model. This is the augmented full quadratic done with model averaging over thousands of iterations. And for comparison I repeated the same process for the much smaller 21 term full quadratic model. So how did we do in terms of prediction? So let me show you a couple of actual by predicted plots. So remember, and I must strongly emphasize, this is a true validation test. The CCD is done separately. Different batches of raw material, including a new batch of the E Coli strain. Some of the fermenters were different, and they were completely different operators. So for those of you who work in biotech, you know, this is about as tough a prediction job as you're going to get. So again, the model was fit to the DSD, and on the left is the 41 term model, the augmented model, and the overall standard deviation of prediction error is about 67. On the right, again, I did use model averaging which helps improve performance, I fit just the 21 term full quadratic model and you can see the prediction error is about 70. In fact, without using model averaging as many people don't do full quadratic useful quadratic model, so performance would be significantly worse. Okay. So then I have the model. What do I do with it? Well, our goal is typically optimization and characterization. So let me open up a profiler. I'll actually do this for you. So I'm going to go to the profiler and the graph menu. I'm going to use my best model. And that's the one using the Predictum add in. And by the way, if you're interested in this add in, and even Beta testing it, just contact Predictum, just send email to Wayne@predictum.com and I'm sure he'd be more than happy to talk to you. So I'm going to... Went to the model and then using desirability, I'm just going to find settings that maximize production. And by the way, this is a major improvement over the production they were historically getting, and it gives us settings at which we should see on the improved performance and these were, by the way, somewhat unintuitive, but that's usually the case in complex systems. Things are never quite as intuitive as you think they are. And then also something really important, especially if you're doing late stage development in the CMC pathway. And that is, they want you to assess the importance of the inputs, which inputs are important. assess variable importance. Again, I won't get into all the technical details. So it goes through and it shows you, in terms of variation in the response, feed rate is by far the most important. That was not necessarily intuitive to people. And second is percent dissolved oxygen. So that, what does that tell you? Well it tells you, number one, you better control these variables very well, or you're likely to have a lot of variation. Now, in this particular case, I don't have critical to quality attributes. There were none available. So what we have is a critical to business attribute and that is pDNA production, But there's more we can do in JMP to fully characterize the design space. All I did was an optimization, but that's not characterization. So there's another wonderful tool in the profiler. Okay. It's called simulator. And this is just not used as much as it should be. So what I've done, I've defined distributions for the inputs. That is, I expect the inputs to vary. This is something like the FDA wants to know about. What happens to performance of your process as the inputs very. There are no perfectly controlled processes, especially once you scale up. By the way, while I think of it, these more complex models, these augmented full quadratic models, from experience, I can tell you they scale up better than full quadratic models. That's another reason to fit these more complex models. So in the simulator,there's a nice tool called simulation experiment. And what that does, it does what we call a space filling design. It distributes the points over the whole design region. So I'm going to just say I want to do 256 runs. and it's going to do 5000 simulations, at each point calculate a mean standard deviation and overall defect. Right. So this actually goes pretty quickly. And I'm just showing you what the output looks like. And again, I've already done this. So, in the interest of time, I'm just going to open another data table. Minimize the other one. So this is the results of the simulation study. And I won't get into all the details, but I fit a model to the main, I fit the model to the standard deviation, and I fit a model to overall defect rate. And the defect rates in some areas are low, in some of them are relatively high and these are what we call Gaussian process models, which are commonly used with simulated data. So what can we do with these models and with these simulation results? Well, again, characterization is important. So let me just give you a quick idea. Here's a three dimensional scatter plot, we're looking at feed rate and percent DO, because they're really important. And the plotted points are weighted by defect rate; bigger spheres mean higher defect rates. So if you look around this. You can see there are some regions where we definitely do not want to operate. So we are characterizing our design spaces and finding safer regions to operate in. And of course, I could do this for just pick some other variables and, in any case, it's just showing other regions you really want to avoid. And we can do more with this, but I think that makes the point. Where we can also go ahead and again use the profiler and I'm going to re optimize. But I'm going to do it in a different way. This way I want to maximize mean pDNA. And I want to do a dual response. And I want to minimize overall defect rate. So again, I'm going to go ahead and use desirability. This takes a few minutes. These are very complex models that we're optimizing. And notice, it comes up and says high feed rate, high DO, close to neutral pH and the induction. By the way, induction, if you want to know what induction OD 600 is, that's a measure of microbial mass and once you reach a certain mass (no one's quite sure what that is, so that's why we do the experiment) you then ramp up the temperature of the matter. And this actually forces the E. Coli to start pumping out pDNA or plasmids, and they're engineered to do this. So we call that the induction temperature. Okay. Well, notice at the settings, we are guaranteed a low defect rate, the overall optimize response wasn't as high. But remember, we're also going to have a process less prone to generating defects. Okay, so at this point, I'll just quickly go to the end. The slide. So everything is in these slides. They've all been uploaded to JMP communities. And at the end of this is an executive summary and basically what we're showing you is that process and product development using the CMC pathway (and a part of that is quality by design) requires a holistic or integrated approach. A lot of systems thinking needs to go into it. Process design and development is inherently a prediction problem, and that is the domain of machine learning and deep learning. It is not what you might think; it's not business as usual for building models in statistics, especially for prediction. We've shown you that fractionally weighted bootstrapping auto validation and model averaging can generate very effective and accurate predictive models. And I also, again, I want to emphasize these more complex augmented models of Cornell and Montgomery are actually quite important. They, they really do scale better and they do give you better characterization. And with that, I thank you and I will end my presentation.  
Caleb King, Research Statistician Developer, JMP Division, SAS Institute Inc.   Invariably, any analyst who has been in the field long enough has heard the dreaded questions: “Is X-number of samples enough? How much data do I need for my experiment?” Ulterior motives aside, any investigation involving data must ultimately answer the question of “How many?” to avoid risking either insufficient data to detect a scientifically significant effect or having too much data leading to a waste of valuable resources. This can become particularly difficult when the underlying model is complex (e.g. longitudinal designs with hard-to-change factors, time-to-event response with censoring, binary responses with non-uniform test levels, etc.). In this talk, we will show how you can wield the "power" of one-click simulation in JMP Pro to perform power calculations in complex modeling situations. We will illustrate this technique using relevant applications across a wide range of fields.     Auto-generated transcript...   Speaker Transcript Caleb King Hello, my name is Caleb king. I'm a research statistician developer here at JMP for the design of experiments group.   And today I'll be talking to you about how you can use JMP to compute power calculations for complex modeling scenarios. So as a brief recap power is the probability of detecting a scientifically significant difference that you think exists in the population.   And it's the probability of detecting that given the current amount of data that that you've sampled from that population.   Now, most people, when they run a power calculation, they're usually doing it to determine the sample size for their study there, of course, is a direct   Tie between the two. The more samples, you have the greater chance you have of detecting that scientifically significant difference   Of course, there are other factors that tie into that. There's the the model that you're using the response distribution type.   And there's also, of course, the amount of noise and uncertainty present in the population, but for the most part people use power as a metric to determine sample size. Now, I'll kind of say there's kind of three stages   of power calculation and all of them are addressed in JMP, especially if you have JMP Pro, which is what I will be using here.   The first stage is some of those simpler modeling situations where we go here under the DOE menu under Design Diagnostics. We have the sample size and power calculators.   And these cover a wide range of very simple scenarios. So, if you're testing one or two sample means, you know, maybe an ANOVA type setting with multiple means,   proportions, standard deviations. Most of this is what people think of when you think of power calculations. So, of course, you go through and you specify again the noise,   error rates, there's any parameters, what difference am I trying to detect, and say for I'm trying to compute a certain power I can get the sample size.   Or, if I want to explore a bit more. I can leave both as empty. I get a power curve. Now, of course, again, these are more of your simpler scenarios. The next stage, I would say, is what could be covered under a more general linear model so exit out of   In that case, we can go here under the all encompassing custom design menu.   I'll put in my favorite number of effects.   I'll click continue.   And I'll leave everything here.   So we'll make the design.   And at this point I can do a power analysis based on the anticipated coefficients in the model. So in this case, it might say, I have for this particular design under 12 runs. I have roughly 80% power to detect this coefficient. If I was trying to detect say something a bit smaller.   I could change that value, apply the changes, of course. See, I don't have as much power. So if that's really what I'm looking for. I might do to make some changes. Maybe I need to go back and increase the run size.   So, those are the two most common settings that we might do a power calculation, but of course life isn't that simple know you might run into more complex settings you might have mixed effects factors you might run into a longitudinal study that you have to compute power for.   You might run into settings where your response is no longer a normal random variable, you might have count data, you might have a binary response. You might even have sort of a bounded 0/1 type response. So a percentage type response.   So, what can you do if you can't go to the simple power calculators and maybe the DOE menu it might be too complex for even this to run a power analysis. Well JMP Pro's here to help and involves a tool that we call one click simulation.   So the idea here is, we'll simulate data sort of through a Monte Carlo simulation approach to try and estimate the power that you can get for your particular settings.   And it's pretty straightforward. There might be a little bit of work up front that you need to do at least depending on the modeling platform.   But once you've got it down. It's pretty straightforward to do.   And I'll go ahead and say that this was something I didn't even know JMP could do until I started working here. So, I'm happy to share what I found with you.   Alright, so we'll start off with sort of as a simpler extension of the standard linear model where we incorporate some mixed effects. Okay.   So we'll start, we have a company that's looking to improve their proton protein yield for cellular cultures. Not protons but proteins.   temperature, time, pH. We also have some mixture factors.   Water and two growth factors. Now, at this stage, if we stopped here, we probably would still be able to use the power calculator available in the custom design platform.   Where we start to deviate is now we introduce some random effect factors we have three technicians, Bob, Di, and Stan, who are representative of the entire sample of technicians.   And they will use at least one of three serum lots, which is again a representation of all the serum lots, they could use unless we treat them as random effects.   We also have a random blocking effect. In this case, the test will be conducted over two days. And so I'll show you how we can use one click simulation and JMP Pro to compute power for this case. So click to open the design.   So this was the design that I've created, let me expand my window here so can see everything. Now this might represent what you typically get once you've created the design.   Again,   at this point, you could have clicked simulate response to simulate some of the responses. But even if you didn't, it's still okay   A trick that you can easily use to replicate that is simply create a new column will go in. We won't bother renaming it at this point, we're just going to create a simple formula.   Go here to the left hand side. Click random random normal   leave everything default click Apply. Okay.   And we've got ourselves some random noise data. Some simulated response data. Okay.   At this point, I'll click right click, copy   And right click paste to get my response column.   Now all I need is just some sort of response. So simple random noise will work fine here. We're not trying to analyze any data yet. What we want is to   use the fit model platform to create a model for us that we'll then use to create the simulation formula. The way we do that, we'll go under a model. Now I've done a bit of head work here.   So, I've already created the model here. And just to show you how I did that. I'll go under here under relaunch analysis under the redo.   So, here you see I have my response protein yield hello my fixed effects. I've got some random effects.   I did everything and get everything pretty standard there.   Now you see there's there's a lot going on here. We don't need to pay attention to any of this. We are just interested in creating a model. At this point the way we do that is we go into here under the red triangle menu.   Will go under saved columns. Now we need to be careful which column we select. If I select prediction formula which you might be tempted to do. That's good. But it doesn't get us all the way there, as you'll see.   If I go into the formula. This is the mean prediction formula. There's nothing about random effects here. So this isn't the column I want. It's not complete doesn't contain everything I need. I need to come back.   Go under save columns again and scroll down here to conditional predictive formula and note from the hover help that's includes the random effect estimates, which is the one I want.   Now, you might be any case where you don't really want to compute power for the random effects. You want to just for the mean model, in which case   You could have easily gone back to the custom design platform and done it that way. Let's pretend that we're interested in those random effects as well.   Now we've saved their conditional predictive formula.   Again, we'll go in, look at the formula.   And here you can see we have a random effects. Now we need to do some tweaking here to get it into a simulation us that we want. So I'm going to double click here.   Is puts me into the JMP scripting language formatting.   Now, first I'll make some changes to the main effects. And I'm just going to pick some values. So let's see. Let's do 0.5 for temperature   0.1 for time.   And for pH. Let's do 1.2 a little bit higher.   For water. I'm going to go even higher. So, these might have larger coefficient. So I'll do 85 for water.   I'll do 90   For the growth first growth factor. And let's do 50   Growth Factor, too. Okay.   Alright, so I've made my adjustments to the main mean model portion. Now again, these are parameters that you think are scientifically important   Now for the random effects. You might be tempted to replace it with something like this. Okay. That should be a random effects. So I'll just put a random normal here.   And it kind of looks right but not exactly. And the reason is this formula is evaluated row by row, what's going to happen is the first time you come across a technician named Jill.   You will simulate a random value here and you'll get a value for that formula evaluation, but the next time you go to jail. I wrote six here.   This will simulate a different value, which then defeats the purpose of a random effect random effects should hold the same value every time. Jill appears   That it's going to take on the effect of something like a random error which I'll take this opportunity to put here that is a value that we want to change every row. So how do we overcome this well.   I tell you this because I actually ended up doing this the first time I presented this slightly embarrassing.   And thankfully, my coworker came along. Afterward, and showed me a trick to how to actually input the random effect appropriately and here's the trick.   We're gonna go to the top here and type if row.   Equals one   I'm going to create a variable call it tech Jill.   And now here's where I place it   What this trick does will replace this random normal with tech Jill.   What this will do is if it's the first row we simulate a random variable and assign it inside the value of this parameter to that variable to that value.   Under the first row, we don't simulate again, which means to tech Jill keeps the value was initially given and it will hold every place we put it   So we will do the same.   For Bob   As you can see that will accomplish the task of the random effect.   PUT BOB here for Stan things are a little bit easier. We don't have to simulate for him because random effects should add up to zero in the model.   And so the way we do that.   We make his be the opposite side.   Of the some of the other effects.   Do the same thing here for serum lot one   Now for this one I'm going to give it a bit more noise.   Let's say there's a bit more noise in the   Serum lots   And this is the advantage of this approach is you get to play around with different scenarios.   Input those values here.   Okay. Caleb King And again, this one.   Some of the others. And before I add the other one. I'll go ahead and just add it here as things makes it easy, day one.   Negative day one.   And I'll add it's random effect here and I'll say that it's random effect.   I can type   Is a bit smaller.   Alright, well, at this point, we should be, we should have our complete simulation formula. If I click OK, take me back to the Formula Editor view.   We should be good to go.   Alright, so there's our simulation formula.   Now for next, what do we do next, we'll go back to our fit model.   And we're going to go to the area where we want to simulate the power   Here I'm going to go under the fixed effects tests box. I'm going to go here to this column is the p value in this case original noisy simulation didn't give us any P values. That's okay. We don't care about that.   We just needed this to generate the model, which we then turned into a simulation formula. I'm going to right click under this column. Now remember, this only works if you have JMPed pro   And here at the very bottom is simulate. So we click that.   And it's going to ask us, which column to switch out. So by default it selects the response column and then it's going to go through and find where all the simulation formula columns. So we want to switch in this one because this one contains our simulation model.   tell it how many samples and to do 100   I'll give it my favorite random seed.   And I click OK.   Wait, about a second or two.   And there we are.   So it's generated a table where it's simulator response. It's fit the model.   And is reporting back the P values. Now there are some cases where there are no P values we ended up in a situation so much of what started and that's okay. That happens in simulation, so long as we have a sufficient number to get us an estimate.   Now the nice thing about this is JMPed saw that we were simulating P values. So it's it. I bet you're winning to do a power analysis and it's happily provided us a script to do that. So thanks JMP.   We run that and you'll see it looks a lot like the distribution platform. So it's done a distribution of each of those rows, excuse me, columns, but with an added feature a new table here that shows the simulated power and because we simulate it.   We can read these office sort of the estimated power if it weren't 100 if we were some other number, then you can look at the rejection rate. So we see here for our three mixture factors we. It looks like we have pretty good power, given everything that we have   To detect those particular coefficients. If we go over here to the other three factors, things don't look as good   So, then we'd have to go back and say, okay,   Maybe we'll go back and see what what's the maximum value that I can detect, so I'm going to minimize these   minimises table. I'll come back to my formula and say let's let's do a different   Do something different here.   What if I change this. So this was point five maybe know what if it were higher about one   For the time. Let's see, let's let's also make it one   And four pH. I'm going to go to three. So I'm going to bump things up a bit. So, you know, well hey can I detect this   Will keep everything else the same because we know we can detect those, it looks like click Apply okay generated some new back   Again, same thing. Right click under the column that you want to simulate quick simulate will switch in   Given a certain number of samples. So stick it   Same seed.   And we'll go   Just have to wait a few seconds for it to finish the simulation.   There we are.   And will run our power analysis again.   Look to be the same here. We didn't change anything there. So in fact, I'm going to tie these groups. Little too much. Here we go. Let's hide these three   Let's look at these. So we seem to have done better on pH so value of one might be the upper range of what we can detect given this sample size.   But for temperature in time it seems we still can't detect, even those high values. So, okay. Um, what else could we change. What if we double the number of samples. I mean, we are   calculating this for a sample size. So let's go back and one way we can do that. We can do go to do we, we can click augment design.   will select all our factors.   Select our response.   Click OK.   We'll just augment the design.   And this time we'll double it will make it 24   So I'll make the design.   And it's going to take a little bit of time. So I'm actually going to   A bit early.   And let's see, we'll make the table.   Okay, so now we've doubled the number of runs   And   So it only gave us half the responses. That's okay. Since we just need a response. I'm just going to take this and I'm going to copy   And paste   Course in real life. You wouldn't want to do that because hopefully get different responses. But again, we just need noise noisy response, go to the model. Now this time, we gotta fix things a little bit. I'm going to select these three go here under attributes say there are random effects.   Keep everything the same. Click Run.   I will notice I don't yet have my simulation formula, but rather than have to walk through and rebuild it. I can actually create a new column, go back to the old one.   Right click Copy column properties.   Come back, right click paste column copies my formula is now ready to go. So, let's say, What if we do it under this situation and we'll keep our values that we initially had   So I'll go back. I'll double click this open up the fit model window.   Go under the fixed effect tests, right click on there probably agree with p value simulate and   I'm not going to change this, because there was only one simulation for we let the one I wanted and it found the right response.   So I'll just change these   Let's see what happens in this case.   Alright.   Run the power analysis. Now again, I'm not going to worry about these   Mixture effects because as you can see, we just got better than what we had originally, which was already good. So I'm going to hide them again.   So we can more easily see the ones were interested in this case pH. We knew we were going to probably do better on because even with the old 12 runs. We had pretty good power.   It looks like we have definitely improved on temperature in time. So if those represents sort of the upper bound of effect sizes were interested in maybe a lower upper bound and this seems to indicate a doubling the sample size might help.   So these are illustrate how we can use the one. First of all, how to do the one click Simulate   And then how we can use it to do power calculation and encourages you to do something. I often did before I came to JMP, which is give people options explore your options. During the sample size seemed to help with temperature and time.   Changing what you're looking for, seem to help with pH with pH and then the mixture effects we seem to be okay on so explore your   So that can also include going back and changing the variances of maybe your random effect estimates.   So for example, I could come back here. I won't do it. But I could change these values and say, you know what happens if the technicians were a bit noisier where the serum lots were less noisy. Try and find situations so that your test plan is more robust to unforeseen settings.   Okay, so let me clean   Go through close these all out.   Alright.   So for the remainder of the scenarios. I'm going to be exploring sort of different takes on how you can implement this. So the general approach is the same. You create your design you simulate a response.   Us fit model, or in this case we're using a slightly different platform to generate a model.   And then use that model to create a simulation formula which then you will then use and the one click Simulate approach.   So now let's look at a case where we have a company that's going to conduct this case they have. But let's pretend that they are going to conduct a survey of their employees and they wanted to determine which factors influence employee attrition. So maybe   They have a lot of employees that are going to be leaving. And so they want to conduct a survey to assess which factors and so they want to know how many responded, they should plan for   Now the responses in years of the company, but their two little kinks. First, I'm an employee has to have worked at least a month before they leave for to be considered attrition.   And the other is that the responses are given in years, but maybe we're more concerned about months. How many months. Maybe that's how our budgeting software works or something.   And, you know, for employees, it might be easier for them to answer. And how many years have they been rather than how many years or months. They've been at the company.   So in this case we have interval censoring because we're given how many years, but that only tells us that they've been there between that many years and a year later, we also have the situation if they leave before year where it will censored between a month and a year.   So open up the stage table. I've set up a lot already. We've got a lot of factors here and scroll all the way to the end. So you can see the responses that we're looking at.   So again, we have a year's low and the years high. So what this means is that if an employee were to respond that they left after six years. That means that their actual time there in terms of months, somewhere between six and seven years.   If they left before a year than we know that they were there sometime between a month and a year.   I'm going to click this dialog button here to launch interval censoring here. We'll use the generic platform. We're going to assume a wible distribution for the response.   We don't put a censoring code here because we have interval censoring the way we handle that is we put in both response columns into the y   Which you'll see. Okay. And here's all the factors which you'll see is when we click run, JMP a recognized as a time to event distribution and say, Okay, if you gave me two response columns. Does that mean you're doing interval sensory in this case. Yes, we are.   So now.   We're going to go through the same thing. We're going to find the right red triangle. In this case, it's here next to waibel maximum likelihood. Now here's the really nice thing about   Generate platform. Now there's already a lot of nice things about it. But here's just some more icing on the top.   When I click this, if we did like before we'd have to go in and we'd say save the prediction formula, we'd have to go and make some adjustments to get the random know make sure it's a random wible that's being simulated adjust things as needed.   This is generally though.   It is aware that you can do the one click Simulate and so it's saying, Hey, would you like me to actually save the simulation formula for you if you're, if that's what you're interested in and Yes we are. So we click the Save simulation formula.   Let's go back to our table.   And you'll notice it only simulated one calm. I'll talk a bit more about why in a moment. But let's real quick check will go in   And there it is, in fact, I'll double click to pull up the scripting language, you'll see it's already got it set up as a random wible it's got the transformation of the model already in there.   All you would have to do at this point is change these parameter values to what is scientifically significant to you.   Okay, now for this purpose I won't do that. I'm just going to leave them be. I will make one change though, and I want to try and replicate.   The actual situations that we're going to be using. Notice here. These are all continuous values when in actuality, what we should be getting our nice round hole year numbers. So the way I can do that.   These are years. Hi, I'm going to create a simple variable make it equal to the actual continuous time but tell it to return the ceiling.   So round up essentially   ply. Okay. And there you have it.   As you can see, this would tell me that I've simulated yours. Hi. Now,   To   See, when you do the one click simulations are all here. I'll open up the effect tests.   If I right click and then click Simulate I could only enter one column at a time. So I can't drag and select more than one   Now, if I were to just do this place the years I was yours. Hi simulation that looks okay. The problem is this year's low. Now this year's low is being brought in, because it was part of the original model.   But it's the year's little that you originally used if we look back, we already see an issue, let me cancel out of this real quick.   For example, if we were to do that. It wouldn't be able to fit this first one, because the years high is lower than yours. Low this year's low is not tied to the simulation response. So how do we fix that we need to tie it need to make that connection. So I'll go to yours, low   I'm going to click formula. So there's already a formula here, I'm just going to make a quick change.   I'm going to say if the simulation formula I double clicked to do that.   Is double click one. So, for years, high as one return 112   Otherwise, return the simulation value minus one.   Now click OK and apply   As you can see its proper its proper now it's tied to it.   So now I can go back   I can right click, do the simulate I can replace the years high with its simulation formula and be comfortable knowing that when I do the years low will be appropriate. It will always be one year lower unless it's already one year and then which cases 112   So it's now tied to it, it'll always be brought in, when they do the simulation.   I'll run a quick simulation real quick.   There we go. It's going a bit slow. So that's a good sign.   I'll let it finish out   Alright. So there is our simulations.   And of course we can run the power analysis, this case we've got a lot of factors that I believe there were 1400 70 quote to play this.   For a lot of them were we have overkill.   But surprisingly for some of them. We still have issues. And so that might be something worth investigating maybe we can't detect that low, the coefficient   Might have to change something about these factors things to discuss in your planning meeting.   So that's how you need to work things when you have this case we had interval censoring if he had right censoring so you had a censoring column.   Same thing, you would. It would output a simulation on the actual time, I would say, you can make some adjustments to that.   To ensure that it matches the type of time you you're seeing in your response or what you expect. And then you'll have to tie your censoring column to the simulation and this is going to happen whenever you have that type of setting.   Okay.   Let's clear all this out.   So let's look at one other one   What happens if we have a non normal response. So we've already seen one. We've seen a reliability type response. So we know we can use generating let's explore another one real quick. In this case, we have a normal response in   A test.   The system is going to be able to weapons flat for their responses, a percentage. Now, technically, you could model this as a normal distribution.   And that might be fine, so long as you expect values between, you know, around the 50 percentage point   But no, because we want this to be a very accurate weapons platform, we'd hope to see responses closer to 100%   And so maybe something like a beta distribution response might be more appropriate. We do have one of the wrinkle. We have these three factors of interest, but one of them. The target is nested within fuse type. So the type of target factor will depend on the fuse type   Case will run this real quick. Again, we've created our data.   This case I simulated some random data and I did it so that it matches between zero and one. I did that simply by taking the logistic transformation of a random normal   OK. Caleb King I will copy   Paste.   Make sure I can paste   And again, walk through it.   Pretty simple.   We're going to use the beta response. We have our response. We have our target nested within future type   Click Run.   And again, red triangle. Many say columns save simulation formula. And this is what you can do in the generate for the regular fit model unfortunately cannot do that.   But we have our simulation formula. I'm not going to make any changes.   But you could you could go in. As you can see the structure, double click is already there. Even the logistic transformation. So you just got to put in your model parameters.   Excuse me. Caleb King Quick. Okay.   Bye. Okay. And again, we'll go down.   And that's how you do that. So we go down.   Effect tests, right click Simulate   Make the substitution and go   Alright, so see how easy it is, in general. So even if you have non normal responses.   You're good to go. Thanks to generate   Okay.   Now,   What if you have longitudinal data. This can be tricky, simply because now the responses might be correlated with one another. So how can we incorporate that well is straightforward.   In this case, we have an example of a company that's producing a treatment for reducing cholesterol. Let's say it's treatment, a   We're going to do run a study to compare it to a competitor treatment be in for the sake of completion will have a control and placebo group will have five subjects per group longitudinal aspect is that measurements are taken in the morning and afternoon once a month for three months.   Now I'm not going to spend too much time on this because I just want to show you how you incorporate longitudinal aspect. So this case I've already   Created a model created the simulation formula. So now you can use it as reference for how you might do this. Let's say we have an AR one model.   And on this real quick.   Just to show you. So there's all the fixed effects. Notice here we got a lot of interactions. Keep that in mind as I show you the formula might look a bit messy.   You've   Stated that we have a repeated structure. So I've selected AR one   Period by days within subject. Okay. Under the next model platform.   And so how do I incorporate that era one into my simulation formula I did it like this.   If it's the first row or the new patient. That's what this means the current patient does not equal the previous patient   This is the model that I saved I changed the parameter values to something that might be of interest. It did take a bit of work because there's a lot going on here. There's a lot of interactions happening.   We've got some random noise at the end. But that's all I did. So I changed some values here. I made things a lot of zeros, just to make things easy   If it's not the first row or if it's not a new patient. How do we incorporate correlation. All I do is copy that model up to here, added this term.   Just some value. I believe it has to be less than one equal to one times previous entry.   If it were auto regressive to then you would add something like lag.   Sim formula to   And you'd have to make another adjustment where know if it's the first row, we have our model. It's the second row or were two places into the new patient. It might look like an AR one if it's anything else we go back to   So as you can see, very easy to incorporate auto correlation structures as long as you know what your model looks like it should be easy to implement it as a simulation formula.   Okay. Caleb King I'll let you look at that real quick.   Finally,   Our final scenario is a pass fail response, which is also very common. I'm going to use this to illustrate how you can use the one click Simulate to maybe change people's minds about how they run certain types of designs show you how powerful this can be   Not intended   Let's say we have we have a detection system that we're creating to detect radioactive substances. So we're going to compare it to another system that's maybe already out there in the field.   So we're going to compare these two detection systems we've selected a certain amount of material and some test objects, ranging from very low concentrations at one to a concentration of five very high and we're going to test   Our systems repeatedly on each concentration, a certain number of times and see how many times it successfully alarms.   I'm going to open these both   Let's start with this one. So this represents a typical design, you might see we have a balanced number of samples as each setting. In this case, we have a lot of samples. They're very fortunate that this place so   Let's say we're going to do 32 balance trials at each run and these are, this is a simulated response. Okay, let's say. And then here I've created my simulation formula.   So I'll show you what that looks like. Again, random binomial. They're all the same. So I've kept the number here, but I could have referenced the alarms in trials column stem from an indie consistent, but that's okay.   Here's my model that maybe I'm interested in   Okay.   And here.   I have a scenario where instead of a balanced number and each setting I have put most of my samples here at the middle   My reasoning might be that will if it's a low concentration. I hardly expect it to catch it. I have reasonable expectations.   And if it's a high concentration will it should almost always catch it. So where the difference is most important to me is there in the middle, maybe at three or four concentrations   And so that's where I'm going to load. Most of my samples, and then I'll put a few more here. But then put the fewest at these other settings. Let's see how each of these test plans for forms in terms of power.   So run the binomial model script here which will run the binomial model. There's only one model effect here the system. We don't put concentration because we know there's that there's an effect here. This is what we're interested in.   Generate binomial.   Run it okay again red triangle menu.   I've already got my simulation formula.   So actually I don't need to do that.   So you already built up a pattern.   Right click Do you simulate. Okay, everything looks good there.   My next favorite random seed.   Here we are power analysis. Okay, now let's go over here.   Do the same thing. I'll fit the model and again when you have a binomial. You have to put in not only how many times it alarm, but out of how many trials.   Run scroll down the effect tests, go down.   In primary to get a hint of what's going to happen.   Quick. Okay.   Here's my simulations, get my parallels scooted over here, minimize minimize. So here's what you get under the balance design.   Notice that we have very low power, which seems odd because we had 32 at each run. I mean, that's a lot of samples, I would have killed for that many samples where I previously worked   So you would expect a lot of power, but there doesn't seem to be whereas here. I had the same total number of samples. I just allocated them differently.   And my power level has gone up dramatically. Maybe if I stack even more here. Maybe if it did four and four and then edit for each of these   I could get even more power to detect this difference. So not only does this show that you know it's not always just changing your sample size might not always need more samples in this case you had a lot of samples to begin with. But how you allocate them is also important to   Okay.   So,   I hope you're as excited as I discovered this very awesome tool for calculating power.   I'd like to leave you with some key takeaways.   So again, we use simulation. Now, ideally, you know, we kind of like a formula. So, and in the civil cases we do kind of get the advantage of a nice simple formula.   Even with the regression models, we kind of have formulas to help under, under the hood. But of course, and the real world. Things are a little more complex. And so we typically have to rely on simulation, which can be a very powerful tool as we've seen,   Now, of course, one of the key things we have to do with simulation is balanced accuracy with efficiency. I usually ran 100   Mainly because, you know, to save on time.   But ultimately know maybe you might stick with the default of 2500 knowing that it will take some time to run   So what I might advocate for is, you know, maybe start with 100 200 simulations at the beginning, just to give it give an idea of what's going on. And then if you find a situation   Where it looks like it. No, it's worth more investigation bump up the number of samples, so you can increase your accuracy.   OK, so maybe you start with a couple different situations run a few quick simulations and then narrow down to some key settings key scenarios and then you can increase the number of simulations to get more accuracy.   I always argue power calculations, just like design of experiments is never one and done.   You shouldn't just go to a calculator plug in some numbers and come back with a sample size. There's a lot that can happen in the design.   Or what can that can happen in an experiment. And I think that the best way to plan an experiment is to try and account for different scenarios. So explore different levels of noise.   In your response. So maybe the mixed effects play around different mixed effect sizes.   Of course you can explore different sample sizes, but also explore maybe different types of models. So for example, in the universal center in case we use the wible model would if he had done a lot normal model.   Explore these different scenarios and know presenting them to the test planners gives you a way to play in your study to be robust to a variety of settings.   So never just go calculate and come back, always present tense players with different scenarios. It's the same process. I use when I   Created actual designed actual experiments. So I would present the test players. I worked with different options they could know explore it. It may be they pick an option or it might be combination of options. You should always do that to make your plans more robust   As I say, they're   All right. Well, I hope you learned something new with this. If you have any questions you can reach out to me, they'll probably be providing my email address.   So I hope you enjoyed this talk and I hope you enjoy the rest of the conference. Thank you.
John Powell, Principal Software Developer, SAS/JMP Division   Novel ideas often come from combining approaches used in totally different industries. The JMP Discovery Summit and the JMP User Community provide excellent ways to cross pollinate ideas from one industry to another. This talk originated from a marine biologist’s request on the JMP User Community to display many variables in a tight space. Techniques used in video games provide possible solutions. I’ll demonstrate how JMP’s customizable visualization capabilities rise to the occasion to make these potential solutions a reality. Another use of video game technology is the 3D scatterplot capability available in JMP. This approach requires powerful graphics capability commonly available on modern desktop computers. But what if you need to share these scatterplots on the web? The range of graphics power on devices people use to view web content varies greatly. So, we need to use techniques that work even on less powerful devices. Once again, the games industry offers a solution — particle systems! I’ll cover particle system basics and how to export 3D data to a 3D Scatterplot application I built using particle systems on the web.     Auto-generated transcript...   Speaker Transcript jopowe Hi, welcome to my talk. The motivation for this talk was based on a discussion on the JMP User Community. Dr. Anderson Mayfield is a marine biologist at NOAA and University of Miami. He studies health of reef corals. He presented at our special Earth Day episode of JMP On Air. And he's also presenting at this conference. So in this JMP User Community post, he posted this picture here. And if you could see, it has all these graphics that are representing corals or coral reefs and they're representing multiple pieces of data, all in one little graphic, which is kind of like a video game when you have player stats and they're hovering around the player as they run around the scene. So now you probably get an idea why I called my talk, you know, What Do You Get When Uou Combine a Video Game, a Marine Biologist and JMP? So let's move on. What I'm going to talk about is this game based, or game inspired solutions that are possible in JMP, including using custom graphics scripts, custom maps. And then I'm going to talk about a 3D web based scatter plot application that I built and how I integrated JMP data sets into that application. Here's an example of the graphic script. And here's an example of using custom maps. And here's my 3D scatter plot application. So let's get started. To see how custom graphics scripts are used, I've got a little sample here. And when I open up JMP and run this little graphic, and actually, I'll talk about this. I use this fictitious data set I created from scratch. And it's basically about a race of fish going all the way down from the Great Barrier Reef down to Sydney. So to get to a graphic scripts, what you can do is go to customize. And then this little plus button lets you add a new script. It starts off empty, but we've got a few samples in JMP. So I like the polygon script. It's very simple. All it is, is it draws a triangle. So if I apply that to my graph, you can see a triangle in there. And all it took us a three lines, setting the transparency, setting a fill color, and then drawing the polygon. But what we're after is, how do you actually embed this in a script? So the easiest way to see that is you hit the answer button and you save script to the script window. And here's the script. It starts off, there's my little bivariate platform being launched with latitude and longitude. And down below, there's a dispatch that has an add graphic script. And then the three lines of the sample. So there's a real simple example of using a graphic script. Now this is a graphic script I'm going to embed. It draws these bars floating over each point. And what I do first in this application, I set the transparency also and then I draw a background of my little graphic. one for the hunger, one for the strength, and one for the speed. Now I'm going to show you the whole script, because it relies on other things. It relies on these constants that I set up mostly for doing colors and in the background and a few things like the background like. This is my function draw bar. It takes a position, a length and a color. So that can be used for each bar. So if I run that script. There it is, you've got the graphic with the little kind of game-inspired health monitors hovering above each fish as they swim down and race down the east Australian current to Sydney. This is going to give you an idea, basically a step by step of how that function works. We started with just a marker on screen, and for each marker, we set the pixel origin. And then we start to draw the background, but that takes a few steps. The first time we set the line thickness, you don't actually see anything. Then we set the color, which is going to be black. And then we do a move to and line to that draws the background and very thick line. And for each bar, we're going to set the color of the bar, move to the position and then do a line to. Then we draw the background, we set its color and it continues the bar. And of course we do that for the other two lines as well. Now you can use many different graphics functions. And here's the graphic functions section of the JSL syntax reference and you can build whatever you want. So go at it. Next, I'm going to talk about using custom maps. Excuse me. And to use custom maps, basically you provide two tables. The first table is a definition of all the shapes you want to use. And the second one names the shapes in the table. And for this name column, basically what you need to add is a map role column property, and it needs to be of the type shape name definition. I'll show you that in a minute. So just to do a real simple example, of course, what we need is Big Class. And I'm going to have a real simple. map for it as well that just has two rectangles. Let me get those files open. So it's very tiny map, as I said. And just to show you these coordinates, I've got a little script here that basically shows the points. If I highlight these top four, that's the top rectangle and the next four is the bottom rectangle. the top ones is weight, the bottom one is height. And if you look at the column property, it's map role, shape name definition. So once we have this map ready, we can open up Big Class. We can't reuse it directly. What we need to do first is take the height and weight columns and stack them. So we go to tables. Let me stack them. And now I've got a table that has these columns stacked. You see that the label has height, weight alternating, and what we need to do in order to hook that up to the map file is add another column property. This one is all the way down here, map role again. But we won't say shaped name definition, we'll say shaped name use and we need to choose the map file, the name map file. And also set the column to name. And then we're done. So now that we have that, I can actually use that in a graph. The graph we're going after is basically using the label column for the map shape, which sets up the top and bottom shapes, and then we can drag in that data column that has all the data into the color role. And there we have it, this is a summary of all of the students in Big Class, but we really want to get one per student. If I drag that to the wrap role, it's a little tight, but I'm going to spread it out by doing a levels per row and set that to 10. There we go. And that's pretty much what I was going after. And this can be useful graph on its own. So let's do something a little bit more complicated now. Well, actually, if you're doing something more complicated, you might want to use the custom map creator add in. And this is a great add in. It allows you to drag in an image and then trace over it. And when you're finished tracing over all the shapes, you click on Finish with random data, and it will generate the two shape files you need, as well as a random data set that allows you to test your custom map. And here's one where I just did four variables and one in the center. So it's basically a square version of what you saw in the original slide that I had. Now that wasn't exactly what we're shooting for. We wanted something round, and I believe it was Dr Mayfield that called it complex pie. I don't know if that's an official term but I decided I was going to build these things with script. So what I wanted to do is build shape files and make sure that I got what the doctor ordered. And that was four shapes with something in the center. And then I thought, Well, I'm a programmer. I like to do a little bit more and make a little more flexible. So I thought maybe some people would want to do the same thing, but only have two variables plus an average or center variable or more. And I thought maybe it would be nice to be able to also do different shapes like a diamond or square. So, I'm no genius. I got lucky. And Xan Gregg actually answered a post that was how can I make this polar plot in JMP. And here it comes. There we go and They were looking for a shape like this, which was looks an awful lot like what I needed. What Xan did was really great. It's a flexible script that does this, and generates these wedges around a circle, but the only thing missing was the center and also naming of files in a particular way that I wanted to do. But it wasn't too difficult. And that is what my script does. And when you run complex pie, there's the recipe for this pie, you open the complex pie maker. Then you add some ingredients. First of all, you need to say the number of shapes. And then whether you're going to center it, either at the top or off to the side a little bit. Then there's a variable for smoothness. And then you would want to also supply the inner radius, whether you want more filling or less, and the outer radius. Of course, the next step is to run the script and it will generate these shape files for you and also do an example test, just like the custom pie maker or custom map add in. So here's an example of complex pie for five and shapes is five and the smoothness to set to seven. You could use five or six and that would probably still be okay if they're going to be drawn really small. The size doesn't really matter. It's more the relative inner radius and outer radius that matter. And this is approximately what Dr Mayfield was going with, so I stuck with four for the inner radius and nine for the outer radius. So let's see how we can actually use these things. Just like we did with the Big Class demo, we're going to have to...we're gonna have to stack the columns. But one thing different is that I don't really have the strength and health variables in my shape file. They're actually in the shape file, they're named wedge 1, 2 3 and center. So I'm going to need to build a file that will link the two together. So first, I've got this stacked small school example, which I did by stacking those health, strengths, speed columns together. And here's the shape file that...the name shape file that I'm going to use. Notice that it has a column property of map role, right, map role definition. And that's required when you make your own custom shapes. So the linked file, the maps columns to shape, I built basically by listing the labels within within my stacked file. And then in another column, shape name, I listed the shapes the shape names. Now, it's important that the shape name here have a property that is map role, but this one is shape name use. And this data label column will have a link ID so it can be virtually joined back to my stacked table. So now that I have these tables all set up, the next thing to do is actually build a graph. Now, it'll be similar to what I did before. The one difference is that instead of just dragging label down to map shape, now I use this virtually join column, drag that down into map shape and there's the shape I want. What I need to do next is add the data to the color column. And then add name to wrap. And there are all my fish with a graphic for each one. To try to get the right gradient, I go to gradient here and that is this one right here. I want to make sure that the green is good, so I'm going to reverse the colors. And there we have it. This is the useful map on itself, because you can look at each fish and see how they're doing. But I want to do a little bit more with that, of course. What I want to do is be able to put those images into a tool tip. In order to do that, we're going to do make into data table. Bring back the file that I had and that graphic. So what it'll do here is, under the red triangle men there is a making to data table. And what this will do will produce a new data table with all these images, and that's really useful, especially if we link it back. And I would turn this and set it as a link ID so that I can point to it from my my simple small school file so that I'll be able to have each marker, find the graph for each character. Alright, so I've got this actually stored in another file and I'm going to open that one. I called it health images. So we don't need this anymore. And we don't need the stacked file anymore. But what we do need is the small school example that I started with. Now the first thing you need to do in order to get these to show up in the graph is to set that image column, that virtual column, as labeled. You can do this in a simple graph as well. So we already have an example...well, I'm going to open this again. Why not? My race script, just to show you that I'm starting from scratch. But since I've had this column labeled, now when I hover over one of these graphics, you get a...the three pieces of information for this graphic and you can do that and pin any number of these characters. So that's one way you can use these graphics. Another way is to use use for marker. And the graphics that you're going to get will be floating over the points and be used instead of the diamond shapes that I had before. I'll bring back those health images, small school, and I'll even bring back that graphic. So right now, we've got those diamonds. In order to turn them into these shapes over here, we had to add yet another badge to this virtual join column. It's really getting popular so use for marker. And so there's new badge that shows up. And behind me, you'll see that these images were put in the scene. One last detail is I don't really like this white around the image. And that's actually built into the graph image. And one way I can take care of that is, it's got a script that will find that white area and set the alpha channel so that it will make them transparent. So now we have graphs with a nice round shape, not the square background. One other thing you might want to do is increase the marker size. And we can go up to 11. How does that look? That's pretty much looks like what I had; 10 would probably would have been better. But I like going to 11. Okay, so that's use from marker. We want to make a little more complicated graph. And this is what Dr. Anderson Mayfield was going for, is a heat map behind the markers. And that's pretty complicated but JMP can handle that. That's actually a contour. So one thing I have here is another version of small school. But this time, I've got ocean temperatures down at the bottom and they're hidden so that they won't show up as actual points. So I've already built images for this one. The same images, actually, as I built before. And I'll start with the end here so we can actually see what I'm going for. So I basically want to have these graphics hovering over the contour and a couple of legends here that show what things are. That shouldn't be too hard. I'll start part of the way with a map already placed in, and lat longs already on the x and y. So the first thing I want to do is add in a contour. And we need to give that little bit of color, so I'm dragging the temp role over the color role and I'm going to adjust the number of levels on this contour. Change the alpha little bit and add some smoothness. Next thing I want to do is make it a little transparent. Let's put a .6 on that. That's about the same as what I had before. And now to add the markers, there's two ways you can go about doing that. You can shift click on the points or you can just drag that in. To drag it in, then you can set the marker size again, doing that by going to here and do other and let's use 10 this time. I think 11 was just a little too big. Alright, so that's looking good. That's almost there. We want to work on this legend. How do we get these colors in there? Well, we already have the health role. And that you can drag right over here to the corner, just add a second color role. Kind of messes up your graph initially. We can take care of that. The problem is that the contour doesn't really need to use the health role. So if we disable that then we're back to what we want. So the next thing you want to do is add the color to the actual legend here. And we can just customize that again like we did before. We're going to do a little bit more tis time. I want to have four labels. I want to have it start at five and go up to 20. Oh, I wanted to reverse the colors as well. There we have it, looks pretty much like that, so almost done. I just want to pretty it up a little bit more. There's a couple of things in the legend I would like to fix up. So go to legend settings and take away these things that don't seem to be adding too much and move the health legend up to the top. There we have it. So I think I'm matching what I was going for, as good as I can. So I can get rid of that one. I don't believe all need this anymore. But I do want to add a little bit more so. How about adding this legend, because it wasn't really a way to know which area of this graphic did what, you know hunger, speed, strength and health. Luckily we have that mapping file. It goes from column name to shape. And I'm just going to open that up for a second. So you can see here that I've labeled all the rows and I labeled the data label column as well. That means that when you create a graph, it will all be labeled. So this one's really simple. You just take the shape name and drag it into the map role and we're really done with that graphic. So if I want to get that on to my scene, the best way to do that is to just select it, the plus sign and then copy it. Then you open up some image editing application and save it to disk. And of course, I've done that already. So I'm going to show once you have that, and I need to bring back my graphic again, wow I can just drag this in. And there it is. Oh, it's not there. Well, so there's a way to find it. Go to customize, it dropped it in the background because normally when you drag in an image, you're dragging in a background image. So all we need to do now...let's move this out of the way so you can see what's happening...is move to the front. Here it comes and there we go. It's at the front. And I'd like to actually add a little bit of transparency on this because a little too bright for my liking. So let's put .8. And now we just have to drag it into the corner and we're pretty much done with that. Okay, so there was a lot and it involved a lot of files. So let's summarize all the files that were used. I used my complex pie maker to generate the two map files that are needed, shape files. I took small school and I stacked it. I created this column to shape mapping file that needed to point to the name file with the map role. I used a link reference to do a virtual join back to small school stacked. Then I made a graphic that I wanted to make into a data table and save that to my health images.jmp. Of course, that had to be virtually joined back to the original small school so that I can make the graphic with that. And then I did the same thing with small school ocean temps, my link reference to health images. I only...I took my column to shape file and drew the legend for the graphic that I tried to do with the ocean temperatures and I used...I just use that by dragging it in basically. So that's it for the 2D demonstrations I wanted to do. Let's take a little breather here and now we're going to go into the world of 3D, and even the web application. Okay. So this is my web 3D scatter plot application. And what I'm going to show you is how I got data from JMP into it. My application was based on JMP's scatterplot 3D. So this is just to remind you how that looks. And now I've got a little demo of that. So basically, it had...I'm going to show you some of the features, starts off with the ability to select different data sets. Here's the diamonds data set. And of course this drawing points. That's the whole idea. And they're drawn with this little ball shape which is I'll explain how that's done in a minute. But the other things you can do is rotate it, just like in JMP, you can zoom, move it back and forth. And then it has hover labels as well. You can see that bottom right corner I draw axes and a grid and add a little bit of customization. Not too much. You can change the background, set the color of the walls. Let's see if we can give it a little blue color. Yeah, maybe that's not to your liking. But this is just a demo. And then you can change the size of the markers. That's up on the web. Okay, so when I started this application, first thing I wanted to do is, could it perform on the web? Would it work well on my iPhone or my iPad? Because those machines are not as powerful as my desktop. So I created built in 3D or three random data sets of increasing numbers of points. I went and used a... 25,000 was what I thought was worth trying on lower power devices, but you can go a lot higher actually with high powered devices. Then I thought, well, I should try to bring some real data in and I found the famous iris data set, and it was in a CSV, comma separated variable, format. And I brought it in. But I had to write some code just to convert it over to my internal structures. And I thought, well JMP has a lot of data sets. I'd rather just bring those in. So I brought in cars, diamonds and cytometry. The only difference was cytometry, it didn't really have a good category for using color and they actually had to change my application to accept a graph that didn't have any color. So the application is pretty simple. There's only one...it's a one page web app. So there's one HTML file. I've got a couple of JavaScript files and i use a few third party libraries and one font and one texture, and the texture is the ball that is used for the shapes. So the technique that I use from the games industry is particles or particle systems. They're used very commonly for simulation games and 3D visualization. You can do some cool effects like fire, smoke, fireworks, and clouds. And I had spent some time working on a commercial game engine in the past and my responsibility was actually take the particle system and improve it. So I did have some experience with this already. Um, that was in C++ world, not JavaScript. But since I worked on interactive HTML5 and JMP for quite a while, I thought it was time to see if I can take two of my passions and marry them and come up with a web based version of using particle systems. I got lucky. I found this library, Three.js. It does hardware accelerated 3D drawing and it has excellent documentation and there are many code examples including particle systems, so that made it quite easy to build my application. And actually, the difficult part for me was figuring out how to get data into it from JMP. But I will have...I'll share script for that. The one thing I did do is make sure that I made it easy to use these objects in my application. So one thing that I did was I just created an array JavaScript objects and for the numeric columns actually add in a minimum and a maximum, so I don't have to calculate that in JavaScript. And of course, JMP is really good at calculating that kind of stuff. Um, for the category, for the color column, what I needed to do is use like a category dictionary. And basically, all that is, the names of the categories in one list and then just point to those values by a zero based index. So that's how that structure works. I thought it would be nice to do a user interface. And actually that was an easy thing to do. And I'll show you that in a second. But basically all I need to do is limit it to numerical columns for the x, y, and z. And then limit it to a categorical column for the color role. So let's have a look at that code. Alright, so I give an example of a very simple data set first of all. And then the dialogue really was easy to do. I use column dialogue and I specified what I wanted for the numeric columns and what I wanted for the color column, and made sure that the modeling type was categorical so its ordinal and nominal. Next, I take the columns data and I build up the JavaScript objects that I need. Here's the min and max strings. And this is just building a string of these objects. And then if there's a color column, I'll do the same. It's a little more complicated because I need to get those character strings out and stuff them into the JavaScript object. And then finally, all it needs to do is save this to a text file. So let's give this a shot. Of course, I'm going to use Big Class, so it's a nice small file. And so I run the script. And it asked me, What do I want for my XYZ. I've already got a height and weight selected and age is a numeric column, so I'll use that as well and sex is a good one for color. And then we're done. And actually, the output here is telling me where the file is. I happen to have that on the next slide. So let's go to that. So I bet you always wondered what would Big Class looked like it was in JavaScript. And this is it. And so you can see the three numerical columns with your min and the max, and then the color column has this dictionary of F and M for the categories, and the zero, it means female and the one means male. Well, that's my exciting finish to this talk, I hope you enjoyed it. So thanks for watching, and if there are any questions, please ask them.  
Dave Sartori, Sr. Data Scientist, PPG   A sampling tree is a simple graphical depiction of the data in a prospective sampling plan or one for which data has already been collected. In variation studies such as gage, general measurement system evaluations, or components of variance studies, the sampling tree can be a great tool for facilitating strategic thinking on: What sources of process variance can or should be included? How many levels within each factor or source of variation should be included? How many measurements should be taken for each combination of factors and settings? Strategically considering these questions before collecting any data helps define the limitations of the study, what can be learned from it, and what the overall effort to execute it will be. What’s more, there is an intimate link between the structure of the sampling plan and the associated variance component model. By way of examples, this talk will illustrate how inspection of the sampling tree facilitates selecting the correct variance component model in JMP’s variability chart platform: Crossed, Nested, Nested then Crossed or Crossed then Nested. In addition, the application will be extended to the interpretation variance structures in control charts and split-plot experiments.     Auto-generated transcript...   Speaker Transcript Dave Hi, everybody. Thanks for joining me here today, I'd like to share with you a topic that has been part of our Six Sigma Black Belt program since 1997. 1997. So I think this is one of the tools that people really enjoy and I think you'll enjoy it, too, and find it informative in terms of how it interfaces with some of the tools available in JMP. The first quick slide or two in terms of a message from our sponsor. I'm with PPG Industries outside of Pittsburgh, Pennsylvania, in our Monroeville Business and Technical Center. I've been a data scientist there on and off for over 30 years, moved in and out of technical management. And now, back to what I truly enjoy, which is working with data and JMP in particular. So PPG has been around for a while, was founded in 1883. Last year we ranked 180th on the Fortune 500. And we made mostly paints, although people think that PPG stands for Pittsburgh Plate Glass, that was no longer the case as of about 1968. So it's a...it's PPG now and it's primarily a coatings company. performance coatings and industrial coatings. cars, airplanes, of course, houses. You may have bought PPG paint or a brand of PPG's to to use on your home. But it's also used inside of packaging, so if you don't have a coating inside of a beer can, the beer gets skunky quite quickly. My particular business is the specialty coatings and materials. So my segment we make OLED phosphors for universal display corporation that you find in the Samsung phone and also the photochromic dyes that go into the transition lenses, which turn dark when you head outside. So what I'm going to talk to you today about is this this tool called sampling tree. And what it is, it's really just a simple graphical depiction of the data that you're either planning to collect or maybe that you've already collected. And so in variation studies like a Gage R&R general measurement system evaluations, components and various studies (or CoV, as we sometimes call them), the sampling tree is a great tool for for thinking strategically about a number of things. So, for example, what sources of variance can or should be included in this study? How many levels within each factor or source of variation can you include? And how many measurements to take for each combination factors and settings? So you're kind of getting to a sample size question here. So strategically considering these questions before you collect any data helps you also define the limitation of the study, what you can learn from it, and what the overall effort is going to be to execute. So we put this in a classification tools that we teach in our Six Sigma program, what we call critical thinking tools because it helps you think up front. And it is a nice sort of whiteboard exercise that you can work on paper or the whiteboard to to kind of think prospectively about the the data, you might collect. It's also really useful for understanding the structure of factorial designs, especially when you have restrictions on randomization. So I'll give you one sort of conceptual example, towards the end here, where you can describe on a sampling tree, a line of restricted randomization. And so that tells you where the whole plot factors are and where the split plot of factors are. So it can provide you again upfront with a better understanding of the of the data that you're planning to collect. They're also useful in where, I'll share another conceptual example, where we've combined a factorial design with a component of variations study. So this, this is really cool because it accelerates the learning about the system under study. So we're simultaneously trying to manipulate factors that we think impact the level of the response, and at the same time understanding components of variation which we think contributes a variation of response. So once the data is acquired, the sampling tree can really help you facilitate the analysis of the data. And this is especially true when you're trying to select the variance component model within a variance chart...variability chart that you have available in JMP. And so if you've ever used that tool (and I'll demonstrate it for you here in a couple...with a couple of examples), if you're asking JMP to calculate for you the various components, you have to make a decision as to what kind of model do you want. Is it nested? Is it crossed? Maybe it's crossed then nested. Maybe it's nested then crossed. So helping you figure out what the correct variance component model is, is really well facilitated by by good sampling tree. The other place that we've used them is where we are thinking about control charts. So the the control chart application really helps you see what's changing within subgroups and what's changing between subgroups. So it helps you think critically about what you're actually seeing in the control charts. So as I mentioned, they're they're good for kind of showing the lines of restrictions in split plot but they're kind of less useful for the analysis of designed experiments, so again for for DOE types of applications aremore kind of kind of up front. So let's jump into it here with an example. So here's a what I would call a general components of variance studies. And so in this case, this is actually from the literature. This is from Box Hunter and Hunter, "Statistics for Experimenters," and you'll find it towards the back of the book where they are talking about components of variance study and it happens to be on a paint process. And so what they have in this particular study are 15 batches of pigment paste. They're sampling each batch twice and then they're taking two moisture measurements on each of those samples. So the first sample in the first batch is physically different than the second batch, and the first sample out of the second batch is physically different from any of the other samples. And so one practice that we tried to use and teach is that for nested factors, it's often helpful to list those in numerical order. So that again emphasizes that you have physically different experimental units you're going from sample to sample throughout. And so this is a this is a nested sampling plan. So the sample is nested under the batch. So let's see how that plays out in variability chart within JMP. Okay, so here's the data and we find the variability chart under quality and process variability. And then we're going to list here as the x variables the batch and then the sample. And one thing that's very important in a nested sampling plan is that the factors get loaded in here in the same order that you have them in a sampling tree. So this is hierarchical. So, otherwise the results will be a little bit confusing. So we can decide here in this this launch platform what kind of variance component model we want to specify. So we said this is a nested sampling plan. And so now we're ready to go. We leave the the measurement out of the...out of the list of axes because the measurement really just defines where the, where the sub groups are. So we just we leave that out. And that's going to be what goes into the variant component that JMP refers to as within variation. Okay, so here's the variability chart. One of the nice things too with the variability chart is there's an option to add some some graphical information. So here I've connected the cell mean. And so this is really indicating the kind of visually what kind of variation you have between the samples within the batch. And then we have two measurements per batch, as indicated on our sampling tree. And so the the distance between the two points within the batch and the sample indicates the within subgroup variation. So you can see it looks like just right off the bat it there's a good bit of of sample to sample variation. And the other thing we might want to show here are the group means. And so that shows us the batch to batch variations. So the purple line here is the, the average on a batch to batch basis. Okay. Now, what about the actual breakdown of the variation here. Well that's nicely done in JMP here under variance components. And Get that up there, we can see it then I'll collapse this. As we saw graphically, it looked like the sample to sample variation within a batch was a major contributor to the overall variation in the data. And in fact, the calculations confirm that. So we have about 78% of the total variation coming from the sample; about 20% of variations coming batch to batch and only about 2.5% of the variation is coming from the measurement to measurement variation within the batch and sample. I noticed here to in the variance components table, the the notation that's that used here. So this is indicated that the sample is within the batch. So this is an nested study. And again, it's important that we load the factors into the into the variability chart in the order indicated here in the in the plot. So wouldn't make any sense to say that within sample one we have batch one and two. That just doesn't make any physical sense. And so it kind of reflects that in the in the tree. And just Now let's compare that with something a little bit different. I call this a traditional Gage R&R study. And so what you have in a traditional Gage R&R study is you have a number of parts sample batches that are being tested. And then you have a number of operators who are testing each one of those. And then each one test the same sample or batch multiple times. So in this particular example we're showing five parts or samples or batches, three operators measuring each one twice. Now in this case, operator one for the for batch number one is the same as operator number one for batch or sample report number five. So you can think of this as saying, well, the operator kind of crosses over between the part, sample, batch whatever the whatever the thing is that's getting getting measured. So this is referred to as a as a crossed study. And it's important that they measure the same article because one of the things that comes into play in a crossed study is that you don't have in a nested study is a potential interaction between the operators and what they're measuring. So that's going to be reflected in the in the variance component analysis that we see from JMP. Now let's have a look here. at this particular set of data. So again, we go to the handy variability chart, which again is found under the quality and process. And in this case, I'll start by using the same order for the variables for the Xs as shown on the sampling tree. But, as I'll show you one of the features of a of a crossed study is that we're no longer stuck with the hierarchical structure of the tree. We can we can flip these around. And so this is crossed. I'm going to be careful to change that here. Remember that we had a nested study from before. And I'm going to go ahead and click okay. And I'm going to put our cell means and group means on there. So the group means in this case are the samples (three) and we've got three operators. And now if we asked for the variance components. Notice that we don't have that sample within operator notation like we had in the in the nested study. What we have in this case is a sample by operator interaction. And it makes sense that that's a possibility in this case, because again, they're measuring the same sample. So Matt is measuring the same sample a as the QC lab is, as is is as Tim. So an interaction in this case really reflects the how different this pattern is as you go from one sample to the other. So you can see that it's generally the same It looks like Matt and QC tend to measure things perhaps a little bit lower overall than Tim. This part C is a little bit the exception. So the, the interaction variation contribution here is is relatively small. There is some operator to operator variation, and the within variation really is the largest contributor. And that's easy to see here because we've got some pretty pretty wide bars here. But again, this is a is a crossed study so we should be able to change the order in which we load these factors and and get the same results. So that's my proposition here; let's test it. So I'm just going to relaunch this analysis and I'm going to switch these up. I'm going to put the operator first and the sample second. Leave everything else the same. And let's go ahead and put our cell means and group means on there. And now let's ask for the variance components. So how do they compare? I'm going to collapse that part of the report. So in the graphical part and this is a cool thing to recognize with a crossed study is because again, we're not stuck with the hierarchy that we have in a nested study, we can kind of change the perspective on how we look at the data. So that perspective with loading in the operator first gives us sort of a direct operator to operator comparison here in terms of the group means. And again that interaction is reflected of how this pattern changes between the operators here as we go from Part A, B, or C, A, B, or C. What about the numbers in terms of the variance components? Well, we see that the variance components table here reflects the order in which we loaded these factors into the into the dialog box and... But the numbers come out very much the same. So the sample on the lefthand side here, the standard deviation is 1.7. Standard deviation due to the operator is about 2.3 and it's the same value over here. The sample by operator or operator by sample interaction, if you like, is exactly the same. And the within is exactly the same. So, with a crossed study, we have some flexibility in how we load those factors in and then the interpretation is a little bit different. If these were different samples, we might expect this pattern from going from operator to operator, to be somewhat random because they're they're measuring different things. So there's no reason to expect that the pattern would repeat. If you do see a significant interaction term in a typical kind of a traditional Gage R&R study, like we have here, well, then you've got a real issue to deal with because that's telling you that the nature of the sample is is causing the operators to measure differently. So that's a bit harder of a problem to solve than if you just have a no interaction situation there. OK. Dave So again, this, for your reference, I have this listed out here. Um, so now let's get to something a little bit more juicy. So here we have sort of a blended study where we've got both crossed and nested factors. So this was the business that I work in. The purity of the materials that we make is really important and a workhorse methodology for measuring purity is a high performance liquid chromatography or HPLC for short. So this was a...this was a product and it was getting used in an FDA approved application so the purity getting that right was was really important. So this is a slice from a larger study. But what I'm showing is the case where we had three samples; I'm labeling them here S1, S2, S3. We have two analysts in the study. And so each analyst is going to measure the same sample in each case. So you can see that similar to what we had in the previous example there that what I call traditional Gage R&R, where each operator or analyst in this case is measuring exactly the same part or sample. So that part is crossed. When you get down under the analyst, each analyst then takes the material and preps it two different times. And then they measure each prep twice. They do two injections into the HPLC with with each preparation. So preparation one is different than preparation two and that's physically different than the first preparation for the next analyst over here. And so again, we try to remember to label these nested factors sequentially to indicate that they're they're physically different units here. It doesn't really make any difference from JMP's point of view, it'll handle it fine, if you were to go 1-2, 1-2, 1-2, and so on down the line, that's fine, as long as you tell it the proper variance component model to start with. So this would be crossed and then nested. So let's see how that works out in JMP. So here's our data sample, analyst prep number, and then just an injection number which is really kind of within subgroup. So once again we go to analyze, quality and process. We go to the variability chart. And here we're going to put in the factors in the same order as they were showing on the sampling tree. And then we're going to put the area in there as the percent area is the response. And we said this was crossed and then nested, so we have some couple of other things to choose from here. And in this case, again, the sampling tree is really, really helpful for for helping us be convinced that this is the case, and selecting the right model. This is crossed, and then nested. Let's click OK. I'm going to put the cell means and group means on there. Again, we have a second factor involved above the within. So let's pick both of them. And let's again ask for the variance components. And I'm going to just collapse this part, hopefully, and maybe I'm going to collapse the standard deviation chart, just bringing a little bit further up onto the screen. So what we can see in the in the graph as we go, we see a good bit of sample to sample variation. The within variation doesn't look too bad. But we do maybe see a little bit of a variation of within the preparation. So, um, the sample in this case is by far the biggest component of variation, which is really what we were hoping for. The analyst is is really below that, within subgroup variation. And so this this lays it out for us very nicely. So in terms of what it's showing in the variance components here table in terms of the components, is it's sample analyst and then because these two are crossed, we've got a potential interactions to consider in this case. Doesn't seem to be contributing a whole lot to the to the overall variation. And again, that's the how the pattern changes as we go from analyst to analyst and sample to sample. Now, the claim I made before with the fully crossed study was that we could swap out the the crossed factors in terms of their in terms of their order and and it would be okay. So let's let's try that in this case. So I'm just going to redo this, relaunch it and I can I think I can swap out the crossed factors here but again I have to be careful to leave the nested factor where it is in the tree. So I notice over here in the variance components table, the way we would read this as we have the prep nested within the sample and the analyst. So that means it has to go below those on the tree. So let's go ahead and connect some things up here. I'm going to take the standard deviation chart off and asked for the variance components. Okay, so just like we saw in the traditional Gage R&R example we've got the analyst and the sample switching. But their values for the, if we look at the standard deviation over here in the last column, they're identical. We have again the identical value for the interaction term and interact on the identical value for the prep term, which again is nested within the, within the sample and the analyst. So again, here's where the, where the sampling tree helps us really fully understand the structure of the data and complements nicely what with what we see in the variance components chart of JMP. So, those, those are a couple of examples where these are geared towards components of variation study. One thing you might notice too, I forgot to point this out earlier, is look at the sampling tree here. And if I bring this back and I'm just trying to reproduce this. That backup. Dave It's interesting if you look at the horizontal axis in the variability chart, it's actually the sampling tree upside down. So that's another way to kind of confirm that you're you're looking at the right structure here when you are trying to decide what variance component component model to to apply. So again, here are the screenshots for that. Here's an example where the sampling tree can help you in terms of understanding sources of variation in a in a control chart of all things. So in this particular case, over a number of hours, a sample is being pulled off the line. These are actually lens samples. I mentioned that we we make photochromatic dyes to go into the transitions lenses and they will periodically check the film thickness on the lenses and that's a destructive test. And so when they take that lens and measure the film thickness, well, they're they're done with that with that sample. And so what we would see if we were to construct an x bar and R chart for this is you're going to see on the x bar chart as an average, the hour to hour average. And then within subgroup variation is going to be made up of what's going on here sample to sample and the thickness, the thickness measurement. Now in this case, notice that there's vertical lines in the sampling tree, so that the tree doesn't branch in this case. So when you see vertical lines when you're drawing a vertical lines on to the sampling tree, that's an indication that the variability between those two levels of the tree are confounded. So, I can't really separate the inherent measurement variation in the film thickness from the inherent variation of the sample to sample variation. So I'm kind of stuck with those in terms of how this measurement system works. So let's let's whip up a control chart with this data. And for that are, again, we're going to go to quality and process. And I'm going to jump into the control chart builder. So again, our measurement variable here is the film thickness. And we're doing that on an hour to hour basis. So when we get it set up by by doing that, we see that JMP smartly sees that the subgroup size is 3, just as indicated on our, on our sampling tree. But what's interesting in this example is that you might at first glance, be tempted to be concerned because we have so many points out of control on the x bar chart. But let's think about that for a minute in terms of what the sampling tree is telling us. So the sampling tree again is telling us that's what's changing within the subgroup, what's contributing to the average range, is the film thickness to film thickness measurement, along with the sample to sample variation. And remember how the control limits are constructed on an x bar chart. They are constructed from the average range. So we take the overall average. And then we add that range plus or minus a factor related to the subgroup sample size so that the width of these control limits is driven by the magnitude of the average range. And so really what this chart is comparing is, let's consider this measurement variation down here at the bottom of the tree. So it's comparing measurement variation to the hour to hour variation that we're getting from the, from the line. So that's actually a good thing because it's telling us that we can see variation that rises above the noise that that we see in the in the subgroup. So in this case, that's, that's actually desirable. And so, that's again, a sampling tree is really helpful for reminding us what's what's going on in the Xbar chart in terms of the within subgroup and between subgroup variation. Now, just a couple of conceptual examples in the world of designed experiments. So split plot experiment is an experiment in which you have a restriction on the run order of the of the experiment. And what that does is it ends it ends up giving a couple of different error structures, and JMP does a great job now of designing experiments for for that situation where we have restrictions on randomization and also analyzing those. So, nevertheless, though it's sometimes helpful to understand where those error structures might be splitting, and in a split plot design, you get into what are called whole plot factors and subplot factors. And the reason you have a restriction on randomization is typically because one or more of the factors is hard to vary. So in this particular scenario, we have a controlled environmental room where we can spray paints at different temperatures and humidities. But the issue there is you just can't randomly change the humidity in the room because it just takes too long to stabilize and it makes the experiment rather impractical. So what's shown in this sampling tree is you really have three factors here humidity, resin and solvant. These are shown in blue. And so we only change humidity once because it's a difficult to change variable. So that's how you set up a split plot experiment in JMP is you can specify how hard the factors are to change. So in this case, humidity is a hard, very hard to change factor. And so, JMP will take that into account when it designs the experiment and when you go to analyze it. But what this shows us is that the the humidity would be considered a whole plot factor because it's above the line restriction and then the resin and the solvent are subplot factors; they're below the line of restriction. So there's a there's a different error structure above the line of restriction for whole plot factors than there is for subplot factors. In this case we have a whole bunch of other factors that are shown here, which really affect how a formulation which is made up of a resin and a solvent gets put into a coating. So this, this is actually a 2 to the 3 experiment with a restriction randomization. It's got eight different formulations in there. Each one is applied to a panel and then that panel is measured once so that what we see in terms of the measurement to measurement variation is confounded with the coating in the in the panel variation. As, as I said before, when we have vertical lines on the on the sampling tree, then we have then we have some confounding at those levels. So that's, that's an example where we're using it to show us where the, where the splitting is in the split plot design. This particular example again it's conceptual, but it actually comes from the days when PPG was making fiberglass; we're no longer in the fiberglass business. But in this case, what what was being sought was a an optimization, or at least understanding the impact of four controllable variables on what was called loss ???, so they basically took coat fiber mats and then measure their the amount of coating that's lost when they basically burn up the sample. So what we have here is at the top of the tree is actually a 2 to the 4 design. So there's 16 combinations of the four factors in this case and for each run in the design, the mat was split into 12 different lanes as they're referred to as here. So you're going to cross the mat from 1 to 12 lanes and then we're taking out three sections which within each one of those lanes and then we're doing a destructive measurement on each one of those. So this actually combines a factorial design experiment. with a components of variations study. And so again, we've got vertical lines here at the bottom of the tree indicating that the measurement to measurement variation is confounded with the section to section variation. And so what we ended up doing here in terms of the analysis was, we treated the data from each DOE run as sort of the sample to sample variation like we had in the moisture example from Box Hunter and Hunter, to have instead of batches, here you have DOE run 1, 2, 3 and so on through 16 and then we're sub sampling below that. And so we treat this part as a components of variation study and then we basically averaged up all the data to look and see what would be the best settings for the four controllable factors involved here. So this is really a good study because it got to a lot of questions that we had about this process in a very efficient manner. So again, combining a COV with a DOE, design of experiments with components of variations study. So in summary, I hope you've got an appreciation for sampling trees that are, they're pretty simple. They're easy to understand. They're easy to construct, but yet they're great for helping us talk through maybe what we're thinking about in terms of sampling of process or understanding a measurement system. And they also help us decide what's the best variance components model when we we look to get the various components from JMP's variability chart platform, which we get a lot of use out of that particular tool, which I like to say that it's worth the price of admission that JMP for that for that tool in and of itself. So I've shown you some examples here where it's nested, where it's crossed, crossed then nested, and then also where we've applied this kind of thinking to control charts to help us understand what's varying within subgroups versus was varying between subgroups. And then also, perhaps less useful...less we can use those with designed experiments as well. So thanks for sharing a few minutes with me here and my email's on the cover slide so if you have any questions, I'd be happy to converse with you on that. So thank you.  
Peter Hersh, JMP Global Technical Enablement Engineer, SAS Phil Kay Ph.D, Learning Manager Global Enablement, SAS institute   In the process of designing experiments often many potential critical factors are identified. Investigating as many of these critical factors as possible is ideal. There are many different types of screening designs that can be used to minimize the number of runs required to investigate the large number of factors. The primary goal of screening designs is to find the active factors that should be investigated further. Picking a method to analyze these designs is critical, as it can be challenging to separate the signal from the noise. This talk will explore using the auto-validation technique developed by Chris Gotwalt and Phil Ramsey to analyze different screening designs. The focus will be on group orthogonal supersaturated designs (GO-SSDs) and definitive screening designs (DSDs).  The presentation will show the results of auto-validation techniques compared to other techniques to analyze these screening designs.       Supplementary Materials   A link to Mike Anderson's Add-in To Support Auto-Validation Workflow. JMP data tables for examples 1 and 2 from the presentation.  Journal attached for an interactive demonstration of how auto-validation works for screening experiments. Other Discovery Summit papers about auto-validation from: Europe 2018 , US 2018, Europe 2019, US 2019 and US 2020.  Recorded demonstration of how auto-validation works for screening experiments: (view in My Videos) Auto-generated transcript...   Speaker Transcript Peter Hersh All right, well thanks for tuning into watch Phil and I's Discovery presentation. We are going to be talking today about a new ??? technique in JMP that we are both really excited about and that's using auto validation to analyze screening   All right, well thanks for tuning into watch Phil and I's Discovery presentation. We are going to be talking today about a new ??? technique in JMP that we are both really excited about and that's using auto validation to analyze screening   designs for for DOEs. So my name is Peter Hersch. I'm a senior systems engineer and part of the global technical enablement team, and my co-author who's joining me is Phil. Do you want to introduce yourself?   All right, well thanks for tuning into watch Phil and I's Discovery presentation. We are going to be talking today about a new ??? technique in JMP that we are both really excited about and that's using auto validation to analyze screening   designs for for DOEs. So my name is Peter Hersch. I'm a senior systems engineer and part of the global technical enablement team, and my co-author who's joining me is Phil. Do you want to introduce yourself?   designs for for DOEs. So my name is Peter Hersch. I'm a senior systems engineer and part of the global technical enablement team, and my co-author who's joining me is Phil. Do you want to introduce yourself? phil Yes. So I'm in the global technical enablement team as well. I'm learning manager. Peter Hersh Perfect.   So we're going to start out at the end, show some of the results that we got while we were working through this so we   did some experiments with using auto validation with some different screening DOEs and we found it a very promising technique. We were able to find more active factors than some analysis techniques.   And really when we're looking at screening DOEs, what we're trying to find as many active factors as we can. And Phil, maybe you can talk a little bit about why that is. phil Yeah. So the objective of a screening experiment is to find out which of all of your factors are actually important. So if we miss any factors   from our analysis of the experiment that turned out to be important, then that's a big problem. You know, we're not going to fully understand our process or our system   because we're we're neglecting some important factor. So it's really critical. The most important thing is to identify which factors are important.   And if we occasionally add in a factor that turns out not to be important. That's, that's less less of a problem but we really need to make sure we're capturing all of the active factors. Peter Hersh Yeah, great, great explanation there, Phil, and I think   if we look at this over here on the right-hand side, our table, we we've looked at 100 different simulations of these different techniques where we looked at different signal-to-noise ratios in screening design and we found that out of those seven different techniques,   we did a fairly good job when we had a higher signal-to-noise ratio, but as that dropped a little we   struggled to find those less   large effects. So this top one was the auto validation technique, and and we only ran that once, and we'll go into why that is, and what that running that auto validation technique did for us. But I think this was a very promising result.   And that now typically, when we do a designed experiment, we don't hold out any of the data. We want to keep it intact. Phil, can you talk a little to why we wouldn't do that? phil Yeah.   When we design an experiment, we are looking to find the smallest possible number of runs that give us the information that we need. So we deliberately keep the number of rows of data really as small as possible.   Ideally, you know, in machine learning what you can do is you hold back some of the data to...as a way of checking how good your models are and whether you need to   improve that model, whether you need to simplify that model.   With design of experiments, you don't really have the luxury of just holding back a whole chunk of your data because all of it's critically important.   You've designed it so that you've got the minimum number of rows of data, so there isn't really that luxury.   But it would be really cool if we could find some way that we could use this this validation in some in some way. And I guess that's that's really the trick of what we're talking about today. Peter Hersh Yeah, excellent lead in, Phil. So   here, here, this auto validation technique has been introduced by Chris Gotwalt, who is our head developer for pretty much everything under the analyze menu in JMP,   and a university professor, Phil Ramsey. And I have two QR codes here for two different Discovery talks that they gave and   if you're interested in those Discovery talks, I highly recommend them. They go into much more technical details than Phil and I are planning   to go into to day about why the technique works and what they came up with. We're just trying to advertise that it's it's a pretty interesting technique and it's something that might be worth checking out for for you and show some of our results.   The basic idea is we start with our original data,   And then we resample that data so the data down here in gray is a repeat of this data up here in white and that is used as validation data. And how we get away with this is we used a fractional weighting system.   And this has really been...it's really easy to set up with an add in that that Mike Anderson developed and   there's the QR code for finding that but you can find that on the JMP user community. And it just makes this setup a lot more simple and we'll go through the setup and the use of the add in when we walk through the example in JMP. But   the basic idea is it creates this validation column, this fractional waiting column, and a null factor, and we'll talk about those in a little bit.   Alright, so we have a case study that both Phil and I used here and we're trying to maximize a growth rate for a certain microbe. And we're adjusting the nutrient combination.   And for my example I'm looking at 10 different nutrient factors. And this nutrient factors, we went in everywhere from not having that nutrient up to some high level of having that. And   this is based on a case study that you can read about here, but we just simulated the results. So we didn't have the real data.   And the case study I'm going to talk to is a DSD where we have five active effects. Actually it's four and the intercept that are active.   And we did a 25-run DSD and I am...I'm just looking at these 10 nutrient factors and I'm adjusting the signal-to-noise ratio for the simulated results. So that's, that's my case study and, Phil, do you want to talk to yours, a little bit? phil Yeah, so in mine, I look to a smaller number of the factors, just six factors, in a smaller experiment, so a 13-run definitive screening design. And what I was really interested in looking at was how well this method could identify active effects when we've got   as many active effects as we have runs in the design of experiments. So we've got 13 rows of data and we've got 13 active parameters when we include the intercept as well.   That's a really big challenge. Most the time we're not going to be able to identify the active effects using standard methods. So I was really, really interested in how this auto validation method might do in that situation. Peter Hersh Yeah, great. So we're gonna   duck out of our PowerPoint. I'm going to bring in my case study here, and we'll, we'll talk about this. So here is my 25-run DSD. I have   I have my results over here that are   simulated. And so this is my growth rate which is impacted by some of these factors and we're in a typical screening design. We're just trying to figure out which of these factors are active, which are not active.   And we might want to   follow up with an augmented design or at least some confirmation runs to to make sure that our, our initial results are confirmed. So   how we would set up this auto validation? So for now, in JMP 15, this is that add in that I mentioned that that Mike Anderson   created and it's just called auto validation setup. And in JMP 16 this is going to be part of the product, but in JMP 15, it's an add in.   And so when that I run that add in, what happens is it creates...   resamples that data. So it created 25 runs that are identical to   those top 25 and they're in gray. And then it added this partially...   this fractional weighting here and then it added the validation and the null factor. So basically, what we're going to do is we're going to run a model on this using   validation and you can use any modeling technique; generalized regression   is a good one. You can use any of the variable selection techniques. You want to make sure that it can do some variable selection for you. So just to give you an idea, I'm going to go under analyze, fit model.   We'll take our growth rate which is our response. We're going to take that weighting.   Actually I'll   change this to generalize regression. I'm going to put that weighting in as our frequency. I'm going to add that validation column that was created.   This null factor that's created,and we'll talk a little bit more about that null factor. And then I'm going to just add all those 10 factors. Now in Phil's example, he's going to look at interactions and quadratic effects. I could do that here as well, but this is just to show   the capability.   And I'll hit Run. We'll go ahead and launch this again. I'll use lasso, you could use forward selection or anything like that. But I'll just use a lasso fit. Hit go. And then I'm going to come down here and I'm going to look at   these estimates. So I what I want to do is simulate these estimates and I want to figure out which of these estimates get zeroed out   most often and least   often. So   I would go in here and I'd hit simulate, and I could choose my number of simulations.   In this case I had, I have done 100 and I won't make you sit here and watch it simulate.   I can go right over here to my simulated results. So I've done 100 simulations here and I'm looking at   the results from those hundred simulations and when I run the distribution, which automatically gets generated there, we can see   some information about this.   Now the next thing that I'm going to do is hold down control, and I'm going to customize my summary statistics.   And all I want to do is remove everything except for the proportion non zero. So what that's going to do is it's going to allow me to just see the factors that were   that were zeroed out or how often a factor...a certain factor was zeroed out and how often it was kept in there. So when I hit okay, all of these are changed to proportion non zero.   And when I and then when I   right click on here, I can make this combine data table, which I've, I've already done.   And the combined data table is here.   And the reason I I'm kind of going quick on this is because we can...   I've added a a factor type row and just just showing, have a saved script in here, but this is...you would get these three columns out of that.   Make a combined data table, so it would have your all of your factors and then how often that factor was non zero. So the higher the number,   the more indicative that it's an active factor. So the last thing I'm going to do is run this Graph Builder and this shows how often   a factor is added to the model. That null factor is kind of our, our line...reference line, so it has no correlation to   the response. And so anything that is lower than that. we probably don't need to include in the model and then things that are higher than that, we might want to look at. And so these green ones   from the simulation were the act of factors, along with the intercept and then the red ones were not active. So it did a pretty good job here.   It flagged all of the good ones. phil Yeah. Peter Hersh And we got one extra one but like Phil, you mentioned, that's not as big of a problem, right? phil Yeah, I mean, that's not the end of the world. I mean,   it's more, it's much more of a problem if you miss things that are active and your method tells you that they're not. And it's really impressive how it's picked out some factors here which had really low signal-to-noise ratios as well. Peter Hersh Yes, yeah. So just to give you an idea, this was citric acid was two, EDTA was one, that was half...so half the signal to noise, and potassium nitrate was a quarter, so very low signal and it was still able to catch that.   Yeah, so I'm gonna pass the ball over   to Phil and have him   present his case study. Yeah. phil Thanks, Pete.   Yeah.   Well, in my case study, as I said, it's a six factor experiment and we only have 13 runs here. And I've simulated the response here so that, such that   every one of my factors here is active, and the main effects are active, and the quadratic effects of each of those are active. So we've got 12 active effects, plus an intercept, to estimate or to find.   And I've made them, you know, just for simplicity, I made them really high signal-to-noise ratio.   So there's a signal-to-noise ratio of 3 to 1 for every one of those effects. So these are big, important effects basically as the...   is what I've simulated. So what we want to find out is that all of these effects, all of these factors are really important. Now if you look to analyze this using fit DSD, which would be a standard   method for this, it doesn't find the   active factors. It only finds ammonium nitrate as a as an active factor. I think fit DSD struggles when there are a lot of active effects. It's very good when there are only a small number.   And actually, you know, we probably wouldn't want to run a 13-run DSD and expect to get a definitive analysis. We would recommend adding some additional runs in this kind of situation.   Even if we knew what the model was, so if we somehow we knew that we had six active main effects and six active quadratic effects   plus the intercept, we really can't fit that model. So this is just that model fit using JMP's fit model platform, the least squares regression. And you know there's...we've got as many parameters to estimate as we have rows of data, so we've got nothing left to estimate error.   So this is really all just to illustrate that this is a really big challenge, analyzing this experiment and getting out from it what we need to get out from it   is a real problem.   So I followed the same method as Pete. I generated this auto validation data set where we've got the repeated runs here with the fractionally weighted...   fractional weightings   Ran that through gen reg, so using the lasso as a model selection and then again resimulating. So simulating and each time changing out the   fractional weighting and around about 250 simulations, which again I won't show you actually doing that. These are the simulation results that we got, the distributions here, and you can see that it's   picking out citric acid. So the some of the times the models had a zero for the parameter estimate for citric acid, but a lot of the time   it was getting the parameter estimate to be about three, which is what it was simulated as originally, and what it should be getting. And you can see that for then, some of these other effects, which was simulated to not be active then, by and large, they are   estimated to have a parameter size of zero, which is what we want to see. And just looking at the proportion, nonzero as Pete did there.   And I've added in all the, the effect types here because here I was looking at the main effects, the quadratic and the interactions. And what the method should find is that the main effects and the quadratics are all active, but the two factor interactions were not.   And when we look at that,   just plotting that proportion non zero for each of these effects, you can see, first of all, the, the null factor that we've added in there.   And anything above that, that's with a higher proportion non zero was suggesting, that's an active effect. And you can see, well, first of all, the intercept, which is always there.   We've got the main effects, which   we're declaring as active using this method. They've got a higher proportion nonzero than the null factor and the quadratics.   And we can see the all of the two factor interactions, the proportion nonzero was was much, much lower. So it's done an amazingly good job of finding the effects that I'd simulated to be active   in this very challenging situation, which I think is, is really very exciting.   That's just one little exploration of this method. To me that that's a very exciting result and it makes me very excited about looking at this more. So I just wanted to finish with   some of the concluding remarks. And I think, Pete, it's fair to say we're not saying that everybody should go and throw away everything else that they've done in the past and only use this method now. Peter Hersh Yeah, absolutely. We've seen some exciting results. I think, Chris, Chris is seeing exciting results, but this is not the end all, be all to always use auto validation, but it's a new tool in your tool belt. phil Yeah, I mean I think I'll certainly use it every time, but I'm not saying only use that. I think there's always...you always want to look at things from different angles using all the available tools that you've got.   And so it clearly shows a lot of promise, and we focused on the screening situation where we're trying to identify the active effects from screening experiments and we've looked at DSDs. I've looked be briefly other screen designs like group orthogonal super saturated designs,   and it does a good job there from my quick explorations. I'd see no reason why it won't do well in fractional factorial, custom screening designs.   And it seems to be working in situations where the standard methods just fall down. The situation that I showed was a very extreme example, it's probably not a very realistic example.   But it really pushes the methods and the standard methods are going to fall down in that kind of situation. Whereas this auto validation method,   it seems to do what it should do. It gives you the results that you you need to get from that kind of situation. And so it's very exciting. I think we're waiting for some the results of more rigorous simulation studies that are being done by   Chris Gotwalt and Phil Ramsay and a PhD student that are supervising.   But it really does open up a whole   a whole load of new opportunities. I think, Pete, it's just very exciting, isn't it? Peter Hersh Absolutely. Really exciting technique and thank you everyone for coming to the talk. phil Yeah, thank you.
Monday, October 12, 2020
Russ Wolfinger, Director of Scientific Discovery and Genomics, Distinguished Research Fellow, JMP Mia Stephens, Principal Product Manager, JMP   The XGBoost add-in for JMP Pro provides a point-and-click interface to the popular XGBoost open-source library for predictive modeling with extreme gradient boosted trees. Value-added functionality includes: •    Repeated k-fold cross validation with out-of-fold predictions, plus a separate routine to create optimized k-fold validation columns •    Ability to fit multiple Y responses in one run •    Automated parameter search via JMP Design of Experiments (DOE) Fast Flexible Filling Design •    Interactive graphical and statistical outputs •    Model comparison interface •    Profiling Export of JMP Scripting Language (JSL) and Python code for reproducibility   Click the link above to download a zip file containing the journal and supplementary material shown in the tutorial.  Note that the video shows XGBoost in the Predictive Modeling menu but when you install the add-in it will be under the Add-Ins menu.   You may customize your menu however you wish using View > Customize > Menus and Toolbars.   The add-in is available here: XGBoost Add-In for JMP Pro      Auto-generated transcript...   Speaker Transcript Russ Wolfinger Okay. Well, hello everyone.   Welcome to my home here in Holly Springs, North Carolina.   With the Covid craziness going on, this is kind of a new experience to do a virtual conference online, but I'm really excited to talk with you today and offer a tutorial on a brand new add in that we have for JMP Pro   that implements the popular XGBoost functionality.   So today for this tutorial, I'm going to walk you through kind of what we've got we've got. What I've got here is a JMP journal   that will be available in the conference materials. And what I would encourage you to do, if you'd like to follow along yourself,   you could pause the video right now and go to the conference materials, grab this journal. You can open it in your own version of JMP Pro,   and as well as there's a link to install. You have to install an add in, if you go ahead and install that that you'll be able to reproduce everything I do here exactly at home, and even do some of your own playing around. So I'd encourage you to do that if you can.   I do have my dog Charlie, he's in the background there. I hope he doesn't do anything embarrassing. He doesn't seem too excited right now, but he loves XGBoost as much as I do so, so let's get into it.   XGBoost is a it's it's pretty incredible open source C++ library that's been around now for quite a few years.   And the original theory was actually done by a couple of famous statisticians in the '90s, but then the University of Washington team picked up the ideas and implemented it.   And it...I think where it really kind of came into its own was in the context of some Kaggle competitions.   Where it started...once folks started using it and it was available, it literally started winning just about every tabular competition that Kaggle has been running over the last several years.   And there's actually now several hundred examples online if you want to do some searching around, you'll find them.   So I would view this as arguably the most popular and perhaps the most powerful tabular data predictive modeling methodology in the world right now.   Of course there's competitors and for any one particular data set, you may see some differences, but kind of overall, it's very impressive.   In fact, there are competitive packages out there now that do very similar kinds of things LightGBM from Microsoft and Catboost from Yandex. We won't go into them today, but pretty similar.   Let's, uh, since we don't have a lot of time today, I don't want to belabor the motivations. But again, you've got this journal if you want to look into them more carefully.   What I want to do is kind of give you the highlights of this journal and particularly give you some live demonstrations so you've got an idea of what's here.   And then you'll be free to explore and try these things on your own, as time goes along. You will need...you need a functioning copy of JMP Pro 15   at the earliest, but if you can get your hands on JMP 16 early adopter, the official JMP 16 won't ship until next year, 2021,   but you can obtain your early adopter version now and we are making enhancements there. So I would encourage you to get the latest JMP 16 Pro early adopter   in order to obtain the most recent functionality of this add in...of this functionality. Now it's, it's, this is kind of an unusual new frame setup for JMP Pro.   We have written a lot of C++ code right within Pro in order to integrate to get the XGBoost C++ API. And that's why...we do most of our work in C++   but there is an add in that accompanies this that installs the dynamic library and does does a little menu update for you, so you need...you need both Pro and you need to install the add in, in order to run it.   So do grab JMP 16 pro early adopter if you can, in order to get the latest things and that's what I'll be showing today.   Let's, let's dive right to do an example. And this is a brand new one that just came to my attention. It's got an a very interesting story behind it.   The researcher behind these data is a professor named Jaivime Evarito. He is a an environmental scientist expert, a PhD, professor, assistant professor   now at Utrecht University in the Netherlands and he's kindly provided this data with his permission, as well as the story behind it, in order to help others that were so...a bit of a drama around these data. I've made his his his colleague Jeffrey McDonnell   collected all this data. These are data. The purpose is to study water streamflow run off in deforested areas around the world.   And you can see, we've got 163 places here, most of the least half in the US and then around the world, different places when they were able to collect   regions that have been cleared of trees and then they took some critical measurements   in terms of what happened with the water runoff. And this kind of study is quite important for studying the environmental impacts of tree clearing and deforestation, as well as climate change, so it's quite timely.   And happily for Jaivime at the time, they were able to publish a paper in the journal Nature, one of the top science journals in the world, a really nice   experience for him to get it in there. Unfortunately, what happened next, though there was a competing lab that really became very critical of what they had done in this paper.   And it turned out after a lot of back and forth and debate, the paper ended up being retracted,   which was obviously a really difficult experience for Jaivime. I mean, he's been very gracious and let and sharing a story and hopes.   to avoid this. And it turns out that what's at the root of the controversy, there were several, several other things, but what what the main beef was from the critics is...   I may have done a boosted tree analysis. And it's a pretty straightforward model. There's only...we've only got maybe a handful of predictors, each of which are important but, and one of their main objectives was to determine which ones were the most important.   He ran a boosted tree model with a single holdout validation set and published a validation hold out of around .7. Everything looked okay,   but then the critics came along, they reanalyzed the data with a different hold out set and they get a validation hold out R square   of less than .1. So quite a huge change. They're going from .7 to less than .1 and and this, this was used the critics really jumped all over this and tried to really discredit what was going on.   Now, Jaivime, at this point, Jaivime, this happened last year and the paper was retracted earlier here in 2020...   Jaivime shared the data with me this summer and my thinking was to do a little more rigorous cross validation analysis and actually do repeated K fold,   instead of just a single hold out, in order to try to get to the bottom of this this discrepancy between two different holdouts. And what I did, we've got a new routine   that comes with the XGBoost add in that creates K fold columns. And if you'll see the data set here, I've created these. For sake of time, we won't go into how to do that. But there is   there is a new module now that comes with the heading called make K fold columns that will let you do it. And I did it in a stratified way. And interestingly, it calls JMP DOE under the hood.   And the benefit of doing it that way is you can actually create orthogonal folds, which is not not very common. Here, let me do a quick distribution.   That this was the, the holdout set that Jaivime did originally and he did stratify, which is a good idea, I think, as the response is a bit skewed. And then this was the holdout set that the critic used,   and then here are the folds that I ended up using. I did three different schemes. And then the point I wanted to make here is that these folds are nicely kind of orthogonal,   where we're getting kind of maximal information gain by doing K fold three separate times with kind of with three orthogonal sets.   So, and then it turns out, because he was already using boosted trees, so the next thing to try is the XGBoost add in. And so I was really happy to find out about this data set and talk about it here.   Now what happened...let me do another analysis here where I'm doing a one way on the on the validation sets. It turns out that I missed what I'm doing here is the responses, this water yield corrected.   And I'm plotting that versus the the validation sets. it turned out that Jaivime in his training set,   the top of the top four or five measurements all ended up in his training set, which I think this is kind of at the root of the problem.   Whereas in the critics' set, they did...it was balanced, a little bit more, and in particular the worst...that the highest scoring location was in the validation set. And so this is a natural source for error because it's going beyond anything that was doing the training.   And I think this is really a case where the K fold, a K fold type analysis is more compelling than just doing a single holdout set.   I would argue that both of these single holdout sets have some bias to them and it's better to do more folds in which you stratify...distribute things   differently each time and then see what happens after multiple fits. So you can see how the folds that I created here look in terms of distribution and then now let's run XGBoost.   So the add in actually has a lot of features and I don't want to overwhelm you today, but again, I would encourage you to follow along and pause the video at places if you if you are trying to follow along yourself   to make sure. But what we did here, I just ran a script. And by the way, everything in the journal has...JMP tables are nice, where you can save scripts. And so what I did here was run XGBoost from that script.   Let me just for illustration, I'm going to rerun this again right from the menu. This will be the way that you might want to do it. So the when you install the add in,   you hit predictive modeling and then XGBoost. So we added it right here to the predictive modeling menu. And so the way you would set this up   is to specify the response. Here's Y. There are seven predictors, which we'll put in here as x's and then you put their fold columns and validation.   I wanted to make a point here about those of you who are experienced JMP Pro users, XGBoost handles validation columns a bit differently than other JMP platforms.   It's kind of an experimental framework at this point, but based on my experience, I find repeated K fold to be very a very compelling way to do and I wanted to set up the add in to make it easy.   And so here I'm putting in these fold columns again that we created with the utility, and XGBoost will automatically do repeated K fold just by specifying it like we have here.   If you wanted to do a single holdout like the original analyses, you can set that up just like normal, but you have to make the column   continuous. That's a gotcha. And I know some of our early adopters got tripped up by this and it's a different convention than other   Other XGBoost or other predictive modeling routines within JMP Pro, but this to me seemed to be the cleanest way to do it. And again, the recommended way would be to run   repeated K fold like this, or at least a single K fold and then you can just hit okay. You'll get this initial launch window.   And the thing about XGBoost, is it does have a lot of tuning parameters. The key ones are listed here in this box and you can play with these.   And then it turns out there are a whole lot more, and they're hidden under this advanced options, which we don't have time at all for today.   But we have tried to...these are the most important ones that you'll typically...for most cases you can just worry about them. And so what...let's let's run the...let's go ahead and run this again, just from here you can click the Go button and then XGBoost will run.   Now I'm just running on a simple laptop here. This is a relatively small data set. And so right....we just did repeated, repeated fivefold   three different things, just in a matter of seconds. XGBoost is pretty well tuned and will work well for larger data sets, but for this small example, let's see what happened.   Now it turns out, this initial graph that comes out raises an immediate flag.   What we're looking at here is the...over the number of iterations, the fitting iterations, we've got a training curve which is the basically the loss function that you want to go down.   But then the solid curve is the validation curve. And you can see what happened here. Just after a few iterations occurred this curve bottomed out and then things got much worse.   So this is actually a case where you would not want to use this default model. XGBoost is already overfited,   which often will happen for smaller data sets like this and it does require the need for tuning.   There's a lot of other results at the bottom, but again, they wouldn't...I wouldn't consider them trustworthy. At this point, you would need...you would want to do a little tuning.   For today, let's just do a little bit of manual tuning, but I would encourage you. We've actually implemented an entire framework for creating a tuning design,   where you can specify a range of parameters and search over the design space and we again actually use JMP DOE.   So it's a...we've got two different ways we're using DOE already here, both of which have really enhanced the functionality. For now, let's just do a little bit of manual tuning based on this graph.   You can see if we can...zoom in on this graph and we see that the curve is bottoming out. Let me just have looks literally just after three or four iterations, so one thing, one real easy thing we can do is literally just just, let's just stop, which is stop training after four steps.   And see what happens. By the way, notice what happened for our overfitting, our validation R square was actually negative, so quite bad. Definitely not a recommended model. But if we run we run this new model where we're just going to take four...only four steps,   look at what happens.   Much better validation R square. We're now up around .16 and in fact we let's try three just for fun. See what happens.   Little bit worse. So you can see this is the kind of thing where you can play around. We've tried to set up this dialogue where it's amenable to that.   And you can you can do some model comparisons on this table here at the beginning helps you. You can sort by different columns and find the best model and then down below, you can drill down on various modeling details.   Let's stick with Model 2 here, and what we can do is...   Let's only keep that one and you can clean up...you can clean up the models that you don't want, it'll remove the hidden ones.   And so now we're down, just back down to the model that that we want to look at in more depth. Notice here our validation R square is .17 or so.   So, which is, remember, this is actually falling out in between what Jaivime got originally and what the critic got.   And I would view this as a much more reliable measure of R square because again it's computed over all, we actually ran 15 different modeling fits,   fivefold three different times. So this is an average over those. So I think it's a much much cleaner and more reliable measure for how the model is performing.   If you scroll down for any model that gets fit, there's quite a bit of output to go through. Again...again, JMP is very good about...we always try to have graphics near statistics that you can   both see what's going on and attach numbers to them and these graphs are live as normal, nicely interactive.   But you can see here, we've got a training actual versus predicted and validation. And things almost always do get worse for validation, but that's really what the reality is.   And you can see again kind of where the errors are being made, and this is that this is that really high point, it often gets...this is the 45 degree line.   So that that high measurement tend...and all the high ones tend to be under predicted, which is pretty normal. I think for for any kind of method like this, it's going to tend to want to shrink   extreme values down and just to be conservative. And so you can see exactly where the errors are being made and to what degree.   Now for Jaivime's key interest, they were...he was mostly interested in seeing which variables were really driving   this water corrected effect. And we can see the one that leaps out kind of as number one is this one called PET.   There are different ways of assessing variable importance in XGBoost. You can look at straight number of splits, as gain measure, which I think is   maybe the best one to start with. It's kind of how much the model improves with each, each time you split on a certain variable. There's another one called cover.   In this case, for any one of the three, this PET is emerging as kind of the most important. And so basically this quick analysis that that seems to be where the action is for these data.   Now with JMP, there's actually more we can do. And you can see here under the modeling red triangle we've we've embellished quite a few new things. You can save predictive values and formulas,   you can publish to model depot or formula depot and do more things there.   We've even got routines to generate Python code, which is not just for scoring, but it's actually to do all the training and fitting, which is kind of a new thing, but will help those of you that want to transition from from JMP Pro over to Python. For here though, let's take a look at the profiler.   And I have to have to offer a quick word of apology to my friend Brad Jones in an earlier video, I had forgotten to acknowledge that he was the inventor of the profiler.   So this is, this is actually a case and kind of credit to him, where we're we're using it now in another way, which is to assess variable importance and how that each variable works. So to me it's a really compelling   framework where we can...we can look at this. And Charlie...Charlie got right up off the couch when I mentioned that. He's right here by me now.   And so, look at what happens...we can see the interesting thing is with this PET variable, we can see the key moment, it seems to be as soon as PET   gets right above 1200 or so is when things really take off. And so it's a it's a really nice illustration of how the profiler works.   And as far as I know, this is the first time...this is the first...this is the only software that offers   plots like this, which kind of go beyond just these statistical measures of importance and show you exactly what's going on and help you interpret   the boosted tree model. So really a nice, I think, kind of a nice way to do the analysis and I'd encourage that...and I'd encourage you try this out with your own data.   Let's move on now to a other example and back to our journal.   There's, as you can tell, there's a lot here. We don't have time naturally to go through everything.   But we've we've just for sake of time, though, I wanted to kind of show you what happens when we have a binary target. What we just looked at was continuous.   For that will use the old the diabetes data set, which has been around quite a while and it's available in the JMP sample library. And what this this data set is the same data but we've embellished it with some new scripts.   And so if you get the journal and download it, you'll, you'll get this kind of enhanced version that has   quite a few XGBoost runs with different with both a binary ordinal target and, as you remember, what this here we've got low and high measurements which are derived from this original Y variable,   looking at looking at a response for diabetes. And we're going to go a little bit even further here. Let's imagine we're in a kind of a medical context where we actually want to use a profit matrix. And our goal is to make a decision. We're going to predict each person,   whether they're high or low but then I'm thinking about it, we realized that if a person is actually high, the stakes are a little bit higher.   And so we're going to kind of double our profit or or loss, depending on whether the actual state is high. And of course,   this is a way this works is typically correct...correct decisions are here and here.   And then incorrect ones are here, and those are the ones...you want to think about all four cells when you're setting up a matrix like this.   And here is going to do a simple one. And it's doubling and I don't know if you can attach real monetary values to these or not. That's actually a good thing if you're in that kind of scenario.   Who knows, maybe we can consider these each to be a BitCoic, to be maybe around $10,000 each or something like that.   Doesn't matter too much. It's more a matter of, we want to make an optimal decision based on our   our predictions. So we're going to take this profit matrix into account when we, when we do our analysis now. It's actually only done after the model fitting. It's not directly used in the fitting itself.   So we're going to run XGBoost now here, and we have a binary target. If you'll notice the   the objective function has now changed to a logistic of log loss and that's what reflected here is this is the logistic log likelihood.   And you can see...we can look at now with a binary target the the metrics that we use to assess it are a little bit different.   Although if you notice, we do have profit columns which are computed from the profit matrix that we just looked at.   But if you're in a scenario, maybe where you don't want to worry about so much a profit matrix, just kind of straight binary regression, you can look at common metrics like   just accuracy, which is the reverse of misclassification rate, these F1 or Matthews correlation are good to look at, as well as an ROC analysis, which helps you balance specificity and sensitivity. So all of those are available.   And you can you can drill down. One brand new thing I wanted to show that we're still working on a bit, is we've got a way now for you to play with your decision thresholding.   And you can you can actually do this interactively now. And we've got a new ... a new thing which plots your profit by threshold.   So this is a brand new graph that we're just starting to introduce into JMP Pro and you'll have to get the very latest JMP 16 early adopter in order to get this, but it does accommodate the decision matrix...   or the profit matrix. And then another thing we're working on is you can derive an optimal threshold based on this matrix directly.   I believe, in this case, it's actually still .5. And so this is kind of adds extra insight into the kind of things you may want to do if your real goal is to maximize profit.   Otherwise, you're likely to want to balance specificity and sensitivity giving your context, but you've got the typical confusion matrices, which are shown here, as well as up here along with some graphs for both training and validation.   And then the ROC curves.   You also get the same kind of things that we saw earlier in terms of variable importances. And let's go ahead and do the profiler again since that's actually a nice...   it's also nice in this, in this case. We can see exactly what's going on with each variable.   We can see for example here LTG and BMI are the two most important variables and it looks like they both tend to go up as the response goes up so we can see that relationship directly. And in fact, sometimes with trees, you can get nonlinearities, like here with BMI.   It's not clear if that's that's a real thing here, we might want to do more, more analyses or look at more models to make sure, maybe there is something real going on here with that little   bump that we see. But these are kind of things that you can tease out, really fun and interesting to look at.   So, so that's there to play with the diabetes data set. The journal has a lot more to it. There's two more examples that I won't show today for sake of time, but they are documented in detail in the XGBoost documentation.   This is a, this is just a PDF document that goes into a lot of written detail about the add in and walk you step by step through these two examples. So, I encourage you to check those out.   And then, the the journal also contains several different comparisons that have been done.   You saw this purple purple matrix that we looked at. This was a study that was done at University of Pennsylvania,   where they compare a whole series of machine learning methods to each other across a bunch of different data sets, and then compare how many times one, one outperform the other. And XGBoost   came out as the top model and this this comparison wasn't always the best, but on average it tended to outperform all the other ones that you see here. So, yet some more evidence of the   power and capabilities of this of this technique. Now there's some there's a few other things here that I won't get into.   This Hill Valley one is interesting. It's a case where the trees did not work well at all. It's kind of a pathological situation but interesting to study, just so you just to help understand what's going on.   We also have done some fairly extensive testing internally within R&D at JMP and a lot of those results are here across several different data sets.   And again for sake of time, I won't go into those, but I would encourage you to check them out. They do...all of our results come along with the journal here and you can play with them across quite a few different domains and data sizes.   So check those out. I will say just just for fun, our conclusion in terms of speed is summarized here in this little meme. We've got two different cars.   Actually, this really does scream along and it it's tuned to utilize all of the...all the threads that you have in your GPU...in your CPU.   And if you're on Windows, with an NVIDIA card, you can even tap into your GPU, which will often offer maybe another fivefold increase in speed. So a lot of fun there.   So let me wrap up the tutorial at this point. And again, encourage you to check it out. I did want to offer a lot of thanks. Many people have been involved   and I worry that actually, I probably I probably have overlooked some here, but I did at least want to acknowledge these folks. We've had a great early adopter group.   And they provided really nice feedback from Marie, Diedrich and these guys at Villanova have actually already started using XGBoost in a classroom setting with success.   So, so that was really great to hear about that. And a lot of people within JMP have been helping.   Of course, this is building on the entire JMP infrastructure. So pretty much need to list the entire JMP division at some point with help with this, it's been so much fun working on it.   And then I want to acknowledge our Life Sciences team that have kind of kept me honest on various things. And they've been helping out with a lot of suggestions.   And Luciano actually has implemented an XGBoost add in, a different add in that goes with JMP Genomics, so I'd encourage you to check that out as well if you're using JMP Genomics. You can also call XGBoost directly within the predictive modeling framework there.   So thank you very much for your attention and hope you can get XGBoost to try.
Monday, October 12, 2020
Kevin Gallagher, Scientist, PPG Industries   During the early days of Six Sigma deployment, many companies realized that there were limits to how much variation can be removed from an existing process. To get beyond those limits would require that products and processes be designed to be more robust and thus inherently less variable. In this presentation, the concept of product robustness will be explained followed by a demonstration of how to use JMP to develop robust products though case study examples. The presentation will illustrate JMP tools to: 1) visually assess robustness, 2) deploy Design of Experiments and subsequent analysis to identify the best product/process settings to achieve robustness, and 3) quantify the expected capability (via Monte Carlo simulation). The talk will also highlight why Split Plot and Definitive Screening Designs are among the most suitable designs for developing robust products.     Auto-generated transcript...   Speaker Transcript Kevin Hello, my name is Kevin Gallagher. I'll be talking about designing robust products today. I work for PPG industries which is headquartered in Pittsburgh, Pennsylvania, and our corporate headquarters is shown on the right hand side of the slide. PPG is a global leader in development of paints and coatings for a wide variety of applications, some of which are shown here. And I personally work in our Coatings Innovation Center in the northern suburb of Pittsburgh, where we have a strong focus on developing innovative new products. In the last 10 years the folks at this facility have developed over 600 US patents and we've received several customer and industry awards. I want to talk about how to develop robust products using design of experiments and JMP. So first question is, what do we mean by a robust product? And that is a product that delivers consistent results. And the strategy of designing a robust product is to purposely set control factors for inputs to the process, that we call X's, to desensitize the product or process to noise factors that are acting on the process. So noise factors are factors that are inputs to the process that can potentially influence the Y's, but for which we generally have little control, especially in the design of the product or process phase. Think about robust design. It's good to start with a process map that's augmented with variables that are associated with inputs and outputs of each process step. So if we think about an example of developing a coating for an automotive application, we first start with designing that coating formulation, then we manufacture it. Then it goes to our customers and they apply our coating to the vehicle and then you buy it and take it home and drive the vehicle. So when we think about robustness, we need to think about three things. We need to think about the output that's important to us. In this example, we're thinking about developing a premium appearance coating for an automotive vehicle. We need to think about some of the noise variables for which the Y due to the noise variable. And in this particular case, I want to focus on variables that are really in our customers' facilities. Not that they can't control thickness and booth temperature and an applicator settings, but there's always some natural variation around all of these parameters. And for us, we want to be able to focus on factors that we can control in the design of the product to make the product insensitive to those variables in our customers' process so they can consistently get a good appearance. So one way to really run a designed experiment around some of the factors that are known to cause that variability. This particular example, we could design a factorial design around booth humidity, applicator setting, and thickness. This assumes, of course, that you can simulate those noise variables in your laboratory, and in this case, we can. So we can run this same experiment on each of several prototype formulations; it could be just two as a comparison or it could be a whole design of experiments looking at different formulation designs. Once we have the data for this, one of the best ways to visualize the robustness of a product is to create a box plot. So I'm going to open up the data set comparing two prototype formulations tested over a range of application conditions, and in this case the appearance is measured so that higher values of appearance are better. So ideally we want, we'd like high values of appearance and then consistently good over all of the different noise conditions. So to look at this, we could, we can go to the Graph Builder. And we can take the appearance and make that our y value; prototype formulas are X values. And if we turn on the box plot and then add the points back, you can clearly see that one product has much less variation than the other, thus be more robust and on top of that, it has a better average. Now the box plots are nice because the box plots capture the middle 50% of the data and the whiskers go out to the maximum and minimum values, excluding the outliers. So it makes a very nice visual display of the robustness of a product. So now we want to talk about how do we use design of experiments to find settings that are best for developing a product that is robust. So as you know, when you design an experiment, the best way to analyze it is to build a model. Y is a function of x, as shown in the top right. And then once we have that model built, we can explore the relationship between the Y's and the X's with various tools in JMP, like in the bottom right hand corner, a contour plot and on and...also down there, prediction profiler. These allow us to explore what's called the response surface or how the response varies as a function of the changing values of the X factors. The key to finding a robust product is to find areas of that response surface where the surface is relatively flat. In that region it will be very insensitive to small variations in those input variables. An example here is a very simple example where there's just one y and one x And the relationship is shown here sort of a parabolic function. If we set the X at a higher value here where the, where the function is a little bit flatter, and we we have some sort of common cause variation in the input variable, that variation will be translated to a smaller amount of variation in the y, than if we had that x setting at a lower setting, as shown by the dotted red lines. In a similar way, we can have interactions that transmit more or less variation. This example we have an interaction between a noise variable and and a control variable x. And in this scenario, if there's again some common cause variation associated with that noise variable, if we have the X factor set at the low setting, that will transmit less variation to the y variable. So now I want to share a second case study with you where we're going to demonstrate how to build a model, explore the response surface for flat areas where we could make our settings to have a robust product, and finally to evaluate the robustness using some predictive capability analysis. This particular example, a chemist is focused on finding the variables that are contributing to unacceptable variation in yellowness of the product and that yellowness is measured with a spectrum photometer with with the metric, b*. The team did a series of experiments to identify the important factors influencing yellowing, and the two most influential factors that they found were the reaction temperature and the rate of addition of one of the important ingredients. So they decided to develop full factorial design with some replicated center points, as shown on the right hand corner. Now, the team would like to have the yellowness value (b*) to be set to a target value of 2 but within a specification of 1 to 3. I'm going to go back into JMP and open up the second case study example. It's a small experiment here, where the factorial runs are shown in blue and the center points in red. And again, the metric of interest (B*) is listed here as well. Now the first thing we would normally do is fit, fit the experiment to the model that is best for that design. And in this particular case, we find a very good R square between the the yellowness and the factors that we're studying, and all of the factors appear to be statistically significant. So given that's the case, we can begin to explore the response surface using some other tools within JMP. One of the tools that we often like to use is the prediction profiler, because with this tool, we can explore different settings and look to find settings where we're going to get the yellowness predicted to be where we want it to be, a value of 2. But when it comes to finding robust settings, a really good tool to use is the the contour profiler. It's under factor profiling. And I'm going to put a contour right here at 3, because we said specification limits were 1 to 3 and at the high end (3), anywhere along this contour here the predicted value will be 3 and above this value into the shaded area will be above 3, out of our specification range. That means that anything in the white is expected to be within our specification limits. So right now the way we have it set up, anything that is less than a temperature at 80 and a rate anywhere between 1.5 and 4 should give us product that meets specifications on average. But what if the temperature in in the process that, when we scale this product up is, is something that we can't control completely accurate. So there's gonna be some amount of variation in the temperature. So how can we develop the product and come up with these set points so that the product will be insensitive to temperature variation? So in order to do that, or to think about that, it's often useful to add some contour grid lines to the contour plot overlay here. And I like to round off the low value in the increment, so that the the contours are at nice even numbers 1.5. 2, 2.5, and 3, going from left to right. So anywhere along this contour here should give us a predicted value of 2. But we want to be down here where the contours are close together or up here where they're further apart with respect to temperature. As the contours get further apart, that's an indication that we're nearing a flat spot in the in response surface. So to be most robust at temperature, that's where we want to be near the top here. So a setting of near 75 degrees and rate of about 4 might be most ideal. And we can see this also in the prediction profiler when we have these profilers linked, because in this setting, we're predicting the b* to be 2. But notice the the relationship between b* and temperature is relatively flat, but if I click down to this lower level, now even though the b* is still 2, the relationship between b* and temperature is very steep. So if we know something about how much variation is likely to occur in temperature when we scale this product up, we can actually use a model that we've built from our DOE to simulate the process capability into the future. And the way we can do that with JMP is to open up the simulator option. And it allows us to input random variation into the model in a number of different ways. And then use the model to calculate the output for those selected input conditions. We could add random noise, like common cause variation that could be due to measurement variation and such, into the design. We can also put random variation into any of the factors. In this case we're talking about maybe having trouble controlling the temperature in production, so we might want to make that a random variable. And it sets the mean to where I have it set. So I'm just going to drag it down a little bit to the very bottom. So it's about a mean of 70. And then JMP has a default of a standard deviation of 10. You can change that to whatever makes sense for the process that you're studying. But for now, I'm just going to leave that at 10 and you can choose to randomly select from any distribution that you want. And I'm going to leave it at the normal distribution. I'm going to leave the rate fixed. So maybe in this scenario, we can control the rate very accurately, but the temperature, not as much. So we want to make sure we're selecting our set points for rate and temperature so that there is as little impact of temperature variation on on the yellowness. So we can evaluate the results of this simulation by clicking the simulate to table, make table button. Now, what we have is every row, there's 5,000 rows here that have been simulated, every row as a random selection of temperature from the distribution, shown here. And then the rate location limits that we have for this product. And we can do that with the process capability. And since I already have the specification limits as a column property, they're automatically filled in, but if you didn't have them filled in, you can type them in here. And simply click OK, and now it shows us the capability analysis for this particular product. It shows us the lower spec limit, the upper spec limit, the target value, and in overlays that over the distribution of responses from our simulation. In this particular case, the results don't look too promising because there's a large percentage of the product that seems to be outside of the specification. In fact 30% of it is outside. And if we use the capability index Cpk, which compares the specification range to the range in process variation, we see that the Cpk is not very good at .3.  
Monday, October 12, 2020
Bradley Jones, JMP Distinguished Research Fellow, SAS   JMP has been at the forefront of innovation in screening experiment design and analysis. Developments in the last decade include Definitive Screening Designs, A-optimal designs for minimizing the average variance of the coefficient estimates, and Group-orthogonal Supersaturated Designs. This tutorial will give examples of each of these approaches to screening many factors and provide rules of thumb for choosing which to apply for any specific problem type.     Auto-generated transcript...   Speaker Transcript Bradley Jones Thanks for joining me. I'm going to give a talk on 21st century screening designs. The topic of screening designs has changed over the last 20 years. I'd like to talk about the three kinds of screening designs that I would recommend, all of which are available in JMP. A-optimal screening designs, definitive screening designs (DSDs), and group orthogonal super saturated screening designs (GO SSDs). Now that might be a little surprising to you because the default in the custom designer has been D-optimal for around 20 years. However, over the last few years I've come to the conclusion that A-optimal designs are better than D-optimal designs. I'm going to be illustrating that point in the first section of this talk. Those are the three designs I'm comparing. I'm going to start with A-optimal designs. My slide has a graph showing exactly why I think that A-optimal designs might, in many cases, be better than D-optimal screening designs. The example is a four-factor 12-run design for fitting all the main effects and all the two-factor interactions of four continuous factors. You can see that for the A-optimal design here... I'm showing there are only three cells where there are non-zero correlations in the correlation cell plot. The rest of the pairwise correlations are all zero for the A-optimal design. For the D-optimal design, there are a lot of correlations all over the place and that makes model selection harder. This concludes my first example. What is A-optimal, you might ask? What does the �A� in A-optimality stand for? The �A� stands for the average variance of the parameter estimates for the model. The parameter estimates are the estimates of the four main effects and the six two-factor interactions. If you minimize the average of them, you tend to do a reasonable job at lowering every one of them, at least doing better than some other approaches. The way to remember what A-optimal means is that the �A� stands for average, the average variance of the parameter estimates. Bradley Jones Bradley Jones That makes the A-optimality criteria easy to understand. Everybody understands what an average is. We have all these estimates. We want the variances of those estimates to be small in general and that's what the A-optimality criterion does in a direct way. The other nice thing about A-optimal designs is that A-optimal designs can allow for putting different emphasis on groups of parameters through weighting. Now you could weight D-optimal designs, but the weighting on a D-optimal design doesn't change the design, so it's kind of useless. Whereas when you weight A-optimal designs, you get differential weighting on the parameters. You might say, well, main effects are more important to me than two-factor interactions, so I want to weight them higher. We have features in JMP for doing just that. I have two different A-optimal design demonstrations. One is the one I just showed you that is, the output for four-factors, 12-runs D-optimal versus A-optimal for all the two-factor interactions and main effects. The second example is a five-factor experiment with 20 runs and a model having main effects and all the two-factor interactions. With D-optimal versus a weighted A-optimal where I put more weight in the A-optimal design on the main effects. Let me switch over to JMP. Here's the D-optimal design for four factors and 12 runs and here's the A-optimal design for the same problem. And now I want to, I want to compare these two designs. If we look at the relative estimation efficiency, the A-optimal design has higher...estimation efficiency for every parameter except this... except for the parameter that is the interaction between X1 and X3. If we look at the average correlations for the A-optimal design, the average correlation is only 0.02 whereas... for the D-optimal design, the average absolute correlation is 0.116. So there is almost six times as much correlation in the D optimal design as in the A-optimal design. The A-optimal design isn't quite as D efficient As the D-optimal design. Of course, the D-optimal design has been chosen to be the most D efficient that you can be. The fact that A-optimal design is still 97+% D efficient is really good. But look, the A-optimal design is 87.5% more G efficient than the D optimal design. So the A-optimal design is reducing the worst possible variance of prediction. Of course, the A efficiency of the A-optimal design is 14.5% more efficient than A efficiency of the D optimal design. And the I efficiency of the A-optimal design is 16% more efficient than the D-optimal design. All of these efficiency measurements and all of these correlations make it pretty clear to me that the A-optimal design in this case is far better than the D optimal design. And that's one of the reasons why A-optimality is now the default for screening including two-factor interactions in JMP 16. Let's look at the second example. Here I have a five-factor experiment. Here's the five-factor 20-run D-optimal design and then a five-factor 20-run weighted A-optimal design. Now let me show you...I'm going to show you the...the JMP script that creates this design. And when I point out is these parameter weights. This weight factor is saying, I want to weight the intercept and the five main effects 100 times higher than the 10 two-factor interactions. Let's see what that does to the design. Now we'll compare designs. And see, here is the efficiency of the D-optimal design with respect to the weighted A and the weighted A is estimating the five main effects all better than the D-optimal design, because the D-optimal design is Bradley Jones Bradley Jones this 97% through 99% is the relative efficiency of the D optimal to the weighted A-optimal for estimating the main effects. The effect of weighting the main effects is that you get a little bit less good variance for the two-factor interactions. Here you're making your estimation of the main effects better at the expense of making your estimation of the two-factor interactions a little worse. But that is what you wanted by weighting the main effects so heavily. Again, here the average absolute pairwise correlations among the for the weighted A is about 50% better than the those for for the D optimal design. And again, if we look at A-optimal correlation cell plots versus the D-optimal correlation cell plots, you can see that there are a lot more there are a lot more white cells in the A-optimal plot, which means these these pairwise correlations are zero. There are a lot of zeros here. And that means that it will be a lot easier to separate the, the true active effects from the inactive effects. The D-optimal design is only a half a percent better than the A-optimal design, even on its own criteria, but the A efficiency is right at nominal and the A-optimal design is more G efficient and also more I efficient than the D-optimal design. Now here, it might be a little bit harder to say which one you would choose. For me, I said I wanted to be able to estimate main effects better and I can. Those are my two demonstrations for the A-optimal design. Let me go back to my slides. Okay, so what I want to do for each of these three kinds of designs is give you an idea about when you would use one in preference to the others. For the A-optimal screening design, A-optimal designs and the other designs supported in the custom designer are more flexible than other designs. They allow you to solve more different kinds of problems. In particular, you would use an A-optimal screening design if you have a lot of categorical factors, especially if there are more than two levels. If there are certain factor level combinations that are not feasible, like for instance if you couldn't do the high setting of all the factors, then you would you want to say, I want an A-optimal design because the other...the other two screening designs wouldn't be able to do that. And when you have a particular model in mind that you want to fit that's not just the main effects or the main effects plus two-factor interactions, you would use the A-optimal screening design, rather than the definitive screening designs or a group orthogonal supersaturated design. Moreover, you would use an A-optimal design if you want to put more weight on some group of effects than other groups of effects, and I showed you an example of that where I wanted to put more weight on the main effects rather than the two-factor interactions. Let's move on to definitive screening designs now. Definitive screening designs first appeared in a published journal article in 2011 and they appeared in JMP roughly the same time. They are a very interesting design. You can see, looking at the correlation cell plot here that in a definitive screening design, the main effects are all orthogonal to each other. That is, there is no pairwise correlation between any pair of main effects, but also the main effects are uncorrelated with all the two-factor interactions and all the quadratic effects. So that makes it very easy to see which main effects are important. Because definitive screen designs have far fewer runs than a response surface design, there are correlations between two-factor interactions and between two-factor interactions and quadratic effects and among the quadratic effects. But notice each quadratic effect is uncorrelated with all the two-factor interactions that have... this is, this is the quadratic effect of X1 squared. Bradley Jones Bradley Jones that effect is completely orthogonal to all of these effects that have X1 in the two-factor interaction. And that's the same for all of the factors. The squared term for X2 is orthogonal to all of the effects that have X2 in them and so on. That's an interesting property that turns out to be quite useful. What does the DSD look like? Well they have three main properties that I want to tell you about. The first property is if we look at Run 1 and Run 2. In Run 1, whenever there's a +1 in Run 2 there's a -1 and vice versa. Whenever there's a -1 in Run 1, there's a +1 in Run 2 so that Run 1 and Run 2 are kind of mirror images of each other. The same thing is true of Run 3 and Run 4, Run 5 and Run 6, Run 7 and Run 8, Run 9 and 10, and 11 and 12. So this design is what's called a fold-over design, and that folding over is what makes the main effects not be correlated to the two-factor interaction. Of course, I have this design sorted in such a way that you can see this structure, but when you run this design, you should actually randomize the order of the runs. The second thing I want to point out is that for each factor, there are two center runs in some pair of runs. For example, in the first pair of runs, A is at its center level; in the second pair of runs, B is at its center level. And in the third set of runs, C is at its center level; in the fourth set of runs, D is at its center level and so on. And then the last thing to notice is that there's one overall center run. We have six factors, Factors A through F, that's six. And the number of runs is 2x6+1. And the model that...one model that we can fit with this design is the model that contains the intercept, all six main effects, and all six quadratic effects. So that's 13 different effects in a 13-run design. It's, it's amazingly efficient in terms of the allocation of runs to effects. Now, what are the positive consequences of having a definitive screening design? Well first, an active two-factor interaction doesn't affect the estimate of any main effect because they are uncorrelated. Which makes it the case that any single active two-factor interaction can be identified uniquely. The same thing is true of any single quadratic effect as long as that quadratic effect is large, with respect to the noise. And then one really interesting consequence is that if if...it turns out, if only three factors are active, let's say factors A, C and E, then I can fit a full quadratic model and those three factors. A full quadratic model is the kind of model that people fit when they're doing response surface methods or RSM methods. And it doesn't matter which three factors are the active ones. The DSD will always be able to fit a full quadratic model, no matter which three factors turn out to be active. That's a very powerful thing. It�s result is that if you're lucky enough that only three out of the six factors are important, you can skip the RSM step in some cases and do RSM and screening in one fell swoop. I've told you all the good things about DSDs. There is one bad thing, or maybe less good thing, and that is because of those... because of these zeros in each column, the main effects are not estimated as precisely in a definitive screen design as they are in a design that would be an orthogonal design with one center run. As a result, confidence intervals on the main effects for DSD are going to be 10% or around 10% longer than confidence intervals if you had run, say a Plackett-Berman design of the same number of runs. That's, that's a very small price to pay, in my view, for all the benefits that you get in, particularly the benefit of being able to fit quadratic effects, which you can't do With the Plackett-Berman design having a single center run. You can identify that you need to be able to fit quadratic effects because you have the center run but you don't know which of the factors has the high curvature. Now, when would you use a DSD? Well, you use them when most of the factors are continuous, when you... and if they're continuous, you might have the factor levels set far enough apart that you're concerned about possible nonlinearity or curvature in the effect of factor on a response. And then you're also concerned about the possibility of two-factor interactions, although the DSDs cannot promise you that you can fit all the two-factor interactions, because there just aren't enough runs. If there are a couple or three two-factor interactions, you're likely to be able to identify them with a DSD. Okay, let's go to back to JMP and back to the journal here. Bradley Jones Bradley Jones This is a DSD. It's the DSD that has six factors and, instead of 13 runs, I created the eight-factor design and just dropped the last two factors. Now when I fit the full factorial model, I created a full factorial design for this problem and fit it. Bradley Jones Bradley Jones Let me show you the parameter estimates. The parameter estimates that are significant in the full factorial design are A, B, and C, A*B, A*C, B*C and A squared. Now, let me show you the analysis that you would get by doing Fit Definitive Screening. I have time, as my as my response, and A through F as my factors. And let's look at the model that comes out. The model that that this is finding has A, B C and E; E is a spurious effect. So that's a Type I error, but it identified all three two-factor interactions and the quadratic effect of A. The full factorial design that I showed you here had three to the sixth runs, which is 729 runs This design here only has 17 runs. The fact that I was able to identify all the correct terms with far, far less runs is is an eye opener for why definitive screening designs are really great. Let's go back to the slides. DSDs are great when almost all of the factors are continuous. You can accommodate a couple or three two-level categorical effects. And you can also block definitive screen designs much more flexibly than you can block fractional factorial designs. For a six-factor definitive screening designs, you can have anywhere between two and six blocks. And the blocks are orthogonal to the main effects. That's, that's another amazing thing about these designs. Let's move on to the newest of the screening designs. These have just been discovered in the last couple of years and the publication in Technometrics just came out in the last week or so. It's been online for a year, but the actual printed article came to my house in my mail just a week or so ago. This is a correlation cell plot of a group orthogonal supersaturated design and you might notice all this gray area. In most of the time, if you look at a supersaturated design, the correlation cell plot has correlations everywhere. Here we only see correlations in groups of factors. This group of factors is correlated. This other group of factors is correlated. This group of factors is correlated and this other group of factors is correlated, but there are no correlations between any pair of groups of factors. The only correlations that you see are within a group not between groups, and that helps you with analyzing the data. Here's a pic of the first page the published article, which I just said when into print just last week or so. My coauthors are Chris Nachtsheim, this guy here, from the University of Minnesota; my colleague from JMP, Ryan Lekivetz; Dibyen Majumdar who's an Associate Dean at the University of Illinois in Chicago in the statistics department; and Jon Stallrich, who is a professor at NC State. Let me talk about why you might even be interested in group orthogonal supersaturated designs or supersaturated designs at all. And then I'll show how we make a group orthogonal supersaturated design. I will show how to analyze them, except that you don't have to learn how to analyze them because there's an automatic analyze tool in JMP that's right next to the designer. And then I'll show you how the two-stage analysis of these �Go SSDs�, as we call them, How they compare to more generic analytical approaches. Then I'll make some conclusions. What's a super saturated design? A supersaturated design has more factors than runs. For example, you might have something like 20 factors and you only have 12 runs to investigate them in. Then the question you might ask yourself is, "Is this this a good idea?" Well, a former colleague of mine, who has since retired about 15 years ago or so, told me supersaturated designs are evil. Bradley Jones Bradley Jones I do understand why he felt that way. The problem with a supersaturated design is that you can't do multiple regression, because you have more factors and runs so the matrix that you want to be able to invert is not invertible. And then also the factor aliasing is typically complex, although in these group orthogonal supersaturated designs, it's a lot less complex. And there's this general feeling that you can't get something for nothing. It feels like you're not putting enough resources into the design to get anything good out of it. Bradley Jones Bradley Jones Supersaturated designs were first discussed in the literature by a mathematician by the name of Sattherwaite. Bradley Jones Bradley Jones His paper was roundly excoriated by a lot of the high-end statisticians of the day. Even, you know, laughed at him to a large degree and then three years later Booth and Cox Discuss the possibility of systematically generating a supersaturated design and they had a selection criterion which said, look at all of the squared pairwise correlations so they're all positive numbers and look at the average of that and make that as small as possible. even though the design cannot be orthogonal because in order to have an orthogonal design, you have to have more runs than factors. The criterion of Booth and Cox is trying to find the closest to an orthogonal design as it can, given that there are fewer runs than factors. We think that John Tukey was the first to use the term "supersaturated" in his discussion of Sattherwaite in 1959. Here's what Tukey said, "Of course constant balance can only take us up to saturated... saturation (one of George Box's well-chosen terms) up to the situation where each degree of freedom is taken up with the main effect (or something else we are prepared to estimate)." As a result, a saturated design has no degrees of freedom leftover to estimate the variance. But Tukey then says, "I think it's perfectly natural and wise to do some supersaturated experiments." But in general, the statistics community didn't take that to heart and nothing happened for 30 years after that. 30 years later, Jeff Wu, who's now a professor at Georgia Tech, wrote a paper in Biometrika, talking about one way of making supersaturated designs. And the same year in Technometrics, Dennis Lin, who's now the chair of the Statistics Department at Purdue, wrote a paper about another way to create supersaturated designs and both of these papers were very interesting and they brought supersaturated design back into people's consciousness. When would you use supersaturated design? Well, one time that you would use it is when runs are super expensive. If a run costs a million dollars to do, you don't want to do very many runs. You want to do as few as possible. And if you can do fewer runs than you have factors, all the better. I don't know about you, but I've done a lot of brainstorming exercises with stakeholders of processes, and it's very easy when everybody writes a sticky note with a factor they think might be active, and you get three dozen sticky notes on the wall, and they're all different. Bradley Jones Bradley Jones So, what are you supposed to do then? Well, what often happens there is, people are used to doing screening experiments with 6, 7, 8 maybe even 10 factors. But people get really nervous doing a screening experiment that has three dozen factors. And so, what happens is, after this brainstorming session happens, the engineers decide, well, we can get rid of maybe 20 of these factors because we know better than the people who pick those 20 factors. Bradley Jones Bradley Jones What I'm afraid of is that when you do that you might be throwing the baby out with the bathwater, so to speak. The, the most important factor might be one of those 20 that you just decided to ignore. And the factors that you end up looking at may look like there's a huge amount of noise because they're not taking account of this other more important factor that was left out. I think that eliminating factors without any data is unprincipled. It's, definitely not a statistical approach. Now, how do we construct these Go SSDs? Well, we start with a Hadamard matrix, call it H, of order m, and m has to be zero mod(4), which is just a fancy way of saying that m needs to be a multiple of four. And then we then we take another matrix T, which is a matrix of plus and minus ones that is w rows by q columns. Bradley Jones Bradley Jones then we take the Kronecker product of H and T. By the way, that zero with an �X� in the middle of it is a symbol for Kronecker product and I'll explain what that is in the next slide. That operation gives us a structure for x that's m by w rows by m by q columns where w is less than q, so mw is less than mq. As a result, this is now a supersaturated design. We recommend that T be a Hadamard matrix with fewer than half of the rows removed. Here's an example. Here's my H. It's a Hadamard matrix. And one of the things about a Hadamard matrix is that the columns of a Hadamard matrix are pairwise orthogonal. If we look at the main effects, they're pairwise orthogonal. In this example, T is just H with this last row removed. And then this H cross (Kronecker Product) T in every element of H that's a plus one, I replace that element with T. And every element of H that's a minus one, I replace that element with -T. Now, T is a matrix. I'm replacing a single number with a three by four matrix everywhere here. H is four by four, but this new matrix here is 12 by 16. Now I have this much bigger matrix that I've formed by taking the this Kronecker product of H and T. Of course, you don't need to do this yourself. JMP will tell you which ones you can make in the GO SSD designer and it'll all happen automatically. Call the Kronecker product of H and T, �X� if you look at X�X, which is what you what you look at to understand the correlations of the factors, X�X looks like this, where these are the blocks that I showed you before. It's block diagonal. The first group of factors is uncorrelated with all the other groups of factors. The second group of factors is uncorrelated with all the others, and so forth. That's a very nice property. Here again is the example. I have a Hadamard matrix that's 4x4. So, m is 4. My matrix T, what I was just talking about...my matrix T is 3x4, where I've just removed the last row from H and so w is three and q is 4. So the number of rows is m times w, or four times three, or 12. And the number of columns is m times q, which is four times four or 16. Now the first column is all one, so this is the call...the constant column. It's what you would use to estimate the intercept. The next three columns are correlated. The next four columns, columns A through D, are correlated with each other, but not with anything else. E through H are correlated with each other, but not with anything else; and I through L are correlated with each other, but not with anything else. You get this block diagonal correlation structure. Now, we have three groups of four factors that are correlated with each other. And then we have this first group of factors (A, B and C) that are correlated with each other more and they're also unbalanced columns. So what my colleagues and I recommend is that instead of actually assigning factors to A, B, and C, you leave them free. You don't assign them factors. And so instead of having 15 factors you only have 12 factors. Because A, B, and C are uncorrelated with all these other factors, we can use A, B and C to estimate the error variance. Now in supersaturated design There's never been a way to estimate the error variance within a supersaturated design. This is a property of group orthogonal supersaturated designs that doesn't exist anywhere else in supersaturated design land. Now we have three independent groups of four factors and each factor group has the rank of three. Now the three fake factor columns (A-C) have rank two, so I can estimate sigma squared with two degrees of freedom, assuming that that two factor interactions are negligible. Now notice when I created this Kronecker product, this group of factors is a foldover. This is a foldover. And this is a foldover. So you have three groups of factors that are all foldover designs. Remember that DSD was an example of a foldover design where all the factors were foldovers, but this particular structure gives you some interesting properties. Any two factor interactions involving from factors in the same group are orthogonal to the main effects in that group. I was looking at the main effects in group one, all the main effects in group one are uncorrelated with all the two factor interactions in that group. But wait, there's more. All the main effects in group one are uncorrelated with two factor interactions where one of the factors in is in group one and the other factor is in group two. The last thing is all the main effects in group one are uncorrelated with all the two factor interactions in any other group. So, this construction gives you this supersaturated design, not only is giving you a good way of estimating main effects, but it also protects you from a lot of two factor interactions. The only thing that you that you don't get protection from is if you have a main effect in group one and a two factor interaction involving main effects in two different groups, then that is not necessarily uncorrelated. Together all of these properties make you want to think about how to load factors into the groups. One strategy would be to say, "Well, I want to put all the factors that I think are most likely to be highly important into one group." In that case, those factors will be uncorrelated with any of the two factor interactions involving those factors. That's good. And then you would be more likely to have inactive groups and if a group is inactive, then you can pool those factor estimates into the estimate of sigma squared and give the estimate of sigma squared more degrees of freedom, which means that you'll have more power for detecting other effects. The second strategy would be put all the effects that you think are most likely to be active in different groups. That is, put what I think a priori is, the most active effect in group one, my second most active effect in group two, my third most active effect in group three, and so on. Now, if you have your most likely effects in separate groups, you're less likely to have confounding of one factor effect with another factor effect. Bradley Jones Bradley Jones My coauthors and I recommend the second strategy. This is a table in the paper that just got published and it shows you all of the group orthogonal supersaturated designs up to 128 factors and 120 runs. Now, how do you analyze these things? Well, you can leverage the fake factors to get an unbiased assessment variance. You can use the group structure to identify active groups first, and then after you know which groups are active, you can do regression to identify active factors within each group. And as you go, you can pool the sum of squares and degrees of freedom for inactive groups and inactive factors into the estimate of sigma squared to give you more power for detecting effects. This is a this is a mathematical slide that we can just skip. But what the slide meant is that, you can maximize your power for testing each group by making your a priori guess of the direction of the effect of each factor be positive. Now, if you thought that the effect of a factor was negative, you would just relabel the signs of the minus ones to plus ones and the plus ones to minus ones. And that would maximize your power for identifying the group. What we do is we identify groups first and then we identify active factors within the groups. Of course, this how all happens automatically in the fitting. We're comparing our analytical method to the lasso and the Dantzig selector. And then we're looking at power and type one error rates using Go SSDs versus the standard selectors. We chose three different supersaturated designs, we, we made different numbers of groups active and we looked at signal to noise ratios of one to three and we had numbers of active factors per group, either one or two. The number of active factors can range from one to 14 in these cases. Looking at the graph of the power results� Here are the Dantzig results, and here is the Lasso followed by our two-stage method. Some of the powers for the Dantzig and Lasso are low. In fact, these are the cases where the signal to noise ratio is one. Otherwise, the Dantzig selector and the Lasso are doing very well. In facto doing as well as the two-stage method, except for these cases where the signal to noise ratio is small. However, the two-stage method is always finding all of the active factors. In the paper by Marley and Woods where they did a simulation study looking at Bayesian D-optimal supersaturated designs and other kinds of supersaturated designs, they basically said you... cannot identify more active factors than a third of the number of runs. Well, in our case, we have 12 factors in 12 runs, so we would expect to only be able to identify four active factors. However, in the case where n is 24, we identified 14 active effect, whereas n over three is only eight. You can see that we're doing a lot better than what Marley and Woods say that you should be able to do, given their simulation study. That is because of using a GO SSD. We did a case study to evaluate what makes JMP's custom design tool take longer to generate a design. We had if there are quadratic terms in the model makes it slower, if the optimality criterion is A-optimal it's slightly so; or if you do 10 times as many random starts, yeah, it's slower; if you have more factors, then it'll be slower. Here's our design. And here's the analysis. Let me show you this in JMP. So first, let me show you the design construction script. I'm going to... this is the script. m, remember, is the number of runs in the Hadamard matrix. There's a new JSL command called Hadamard. I'm going to just create a new script with this, and I can run the script and look at the Log. Here we go. Here's the new JMP 16 log. If I run the first three things in this script, you can see m is four, q is three. And here's my 4x4 Hadamard matrix. And taking the first three rows to create T and the group orthogonal supersaturated design is the 12x16 matrix here. That's the matrix and we can make a table out of it. And here's the table. So that's how easy it is to construct them by hand, but you don't have to do that. You can just get JMP to do all this for you. looking at the pairwise correlate column correlations, it has the same pattern that we showed you before. Here's the case study that we ran. We make our first three factors fake factors. Then we're going to use them to estimate the variance and then these are all of the real factors. When I fit the group orthogonal supersaturated design, there's two factors that are active in the second group. That first group I'm using to estimate variance One factor in the third group, and two factors in the fourth group. I have five factors in all that are active and I end up having six degrees of freedom for error. So that's, that's kind of an amazing thing. Let me show you how you can do this in JMP. Here's the group orthogonal supersaturated design. I could say I'm willing to do 12 runs. You can either do two groups of size eight or four groups of size four. This is the same design that I just ran. So that's how easy to make one of these things and then the analysis tool is right under the designer so you can just choose your factors, choose your response and go, and you get the same analysis as I got before. Let me wrap up by going back to my slides. When do you want to use it a Go SSD? It's when you have lots of factors, and runs are expensive, and you think that most of the factors are not active, but you don't know which ones are active and which are not. My final advice is to replace D-optimal with A-optimal for doing screening. If you were using D-Optimal before, A-optimal is better. Use DSDs if you have mostly continuous factors and you're interested in possible curvature And you don't have any restrictions about which levels of the factors you can use. Use Go SSDs in preference to removing a lot of factors in advance of getting any data. If you have a lot of factors and it's expensive to run the runs.  
Fabio D'Ottaviano, R&D Statistician, Dow Inc Wenzhao Yang, Dr, Dow Inc   The large availability of undesigned data, a by-product of chemical industrial research and manufacturing, makes it attractive the venturesome use of machine learning for its plug-and-play appeal in attempt to extract value out of this data. Often this type of data does not only reflect the response to controlled variation but also to that caused by random effects not tagged in the data. Thus, machine learning based models in this industry may easily miss active random effects out. This presentation uses simulation in JMP to show the effect of missing a random effect via machine learning — vs. including it properly via mixed models as a benchmark — in a context commonly encountered in the chemical industry — mixture experiments with process variables — and as a function of relative cluster size, total variance, proportion of variance attributed to the random effect, and data size. Simulation was employed for it allows the comparison — missing vs. not missing random effects — to be made clear and in a simple manner while avoiding unwanted confounders found in real world data. Besides the long-established fact that machine learning performs better the larger the size of the data, it was also observed that data lacking due specificity—i.e. without clustering information—causes critical prediction biases regardless the data size.   This presentation is based on a published paper of the same title.     Auto-generated transcript...   Speaker Transcript Fabio D'Ottaviano Okay thanks everybody for watching this video here. Well, because you can see, I'll be talking about missing random effects in machine learning. It's a work ideas together with my colleague when Joe Young, we both work for Dow Chemical Company working Korean D and help you know valid develop new processes and mainly new products. What you see here in this screen is a big bingo cage, because our talk here is going to be about to simulation and simulation has a lot to do at least to me. With bingo case because you decided the distribution of your balls and numbers inside the big occasion, then you keep just picking them as you want. All right. This talk also has to do with the publication, we just said. Lately, what the same name, and you can win what you should have access to this presentation, you can just click here and you'll have access to the entire paper. So here's just a summary of what we have published in there. Okay, what's the context for this. Well, first of all, machine learning has a kind of a plug and play appeal to knowing stuff sessions. I know you don't have to assume anything that's attractive. Besides, you have a very user friendly software out there these days. So, you know, people like to do that these days. However, you know, random effects are everywhere and run these effects is a funny thing because it's it's a concept that is a little bit more complex. So it tends not to be Touching basic statistics courses shows more advanced subject. So you're going to get a lot of people doing machine learning without a lot of understanding about random effect. And even if they have that understanding, then the concept of random fact Still doesn't, you know, bring the loud bout with people doing machine learning because there's just a few algorithms that can do that that can use random effects. You can check these reference here where you see that there are some trees and random forest, and it can take it, but the recent and they're not, you know, spread Everywhere. So you're going to have some hard time to find something that can do can handle random effects in machine learning. Just talk a little bit about the random effects. As you can see here, at least in the chemical industry where I come from. We typically mix in. I say three components. Right. These yellow, red and green. We, we make this, you know, the percentage of each one of these components different levels. And then we measured the responses as we change it, the percentage of these components with a certain equipment and sometimes you have even a operator or lab technician that will Also interfere in the result that you want to see. Okay. And then when we do this kind of experiment, we want to generalize, is that the findings, right, or whatever prediction. We are trying to get here. But the problem is that you know when I'm mixing these green component here. If I buy next time from the supplier that supplies me this green component year And the green made shade, you know, very and I don't know what's the next time I buy this green component is the batch. Would that be supplier is giving me is going to be exactly the same green because there is a variability in supply On top you know I may make my experience you're using a certain equipment. But if I go and look around in my lab or if I look around in other labs, I may have different makes of these equipment. And on top. You also have, you know, maybe food that measurement depends on on the operator who is doing that right so you may also interfere and kind of impoverished my Prediction here on my generalization to Do whatever I want to predict here besides This is the most typical I guess in the chemical industry, which is the experiment experimental batch variability A over time if you repeat the same thing over and over again. Let's say you have an experiment here you get your model your model can predict something, but then you repeat that experiment to get another Malden get another model the predictions of these three models. May be considerably different right now. Nick legible. So, there is also the component of time. So what's the problem I'm talking about here. Well, typically you we have stored data and historical data just say, you know, a collection of bits and pieces of data you've done in the past. And people were not done much concerned with generalizing that result. The result at the time they had that experiment. So when we collect them and call it historical data, we may or may not have tags for the random effect, right. And then if you have text, which is at least from where I come from. This is more of an exception to the rule is having no tax for me facts, what, at least not for not all of them. Let's say you have tags. One thing you can do is to use a machine learning techniques that can handle these random effects lead them into the model. And that's it. You don't have a problem. But then, as I said, is not very well numb machine learning techniques that can hinder random effects. You may be tempted to use machine learning. And let the random effects into the model as if they were fixed and then you're going to run into these you know very well known problem that you should treat the random effect this fixed Just to say one thing you're going to have a hard time to predict any new all to come because, for example, if your random effect is European later you have only a few Operators in your data, right, a few names, but if there is a new operator doing days, you don't. You cannot predict what the effect of this new operator is going to be. So, here there is no deal And then there's one thing you can do. You do. You should have or you should don't have tax revenue we sacrifice to us again any machine learning technique. And if you have random, you should have the tags you ignore the random effect or if you don't. Anyway, you're going to be ignoring it. Whether you like it or not. So what I want to do is less simulating shooter. We use jump rope fishing. And you know, I hope you enjoyed the results. The simulation, basically. So I will use a mixed effect model right with fixed and random effect. And then we use that same model to estimate With the response to this to make the model after I simulate and also model, the results of my simulation here with a neural net right Then we use this model here as the predictive performance of this model here. As a benchmark and we use it to predict performance of the near on the edge to, you know, compare later they're taking their test set are squares to see what's going to happen. You find meets the random effect here, right. Then, okay. Sometimes I when I talk about these people sometimes think that I'm comparing a linear mixed effect model versus, you know, machine learning neural net. And that's not the case, you know, here we are comparing a model with and without random effect. Even that there is a random effect in the data. I could do. For example, a Linear Model with run them effects versus a linear model without to bring them effects. And I could do a neural net with random effects versus in urine that without random effect. But the problem is that today there is no wonder and that, for example, that can handle random effect. So I forced to use. For example, a linear mixed effect of all My simulation factors. Well, I'll use something that is typically in the industry, which is a mixture with process variable model what it is. Let's say I have those three components. I showed you before. Know the yellow, red, green, and they have percent certain percentage and they get up to one. Have a continuous variable which, for example, can be temperature. I have a categorical variable that can be capitalist type and I have a random effect which can be very ugly from batch to batch of experiments. Okay. The simulation model. Well, it's a pretty simple one I have here my mixture main effects M one M two and three. Right. And you will see all over this model that the fixed effects all have either one or minus one. I just assigned one minus one randomly to them. So I have the mixture main effects. Here I have the mixture two ways interaction, the interaction of the mixture would be continuous variable. And the interaction of the categorical variable with the components. And finally, the introduction of the continuous variable with the categorical variable. Plus I have these be, Why here we choose my random effect and the Ej, which is my random error. Right. And both are normally distributed with certain variance here. I said, the variance between a better of experiments, right, and uses divergence within the match of batch of experiment. From all over this presentation, just to make this whole formula in represent a forming the more I say neat way or use this form a hero X actually represents all the fixes effects and beta Represents all the parameters that I used. Right. And my why here. Actually, the expected. Why is actually XP right it's this whole thing without my random effect here and we dealt my Renault mayor. Simulation parameters. Well, here I have one which is data size, right, the one that she. What happens if I have no, not so much data and More data layers and more data right here I have two levels 110,000 roles at every set of experiment here have actually 20 rows perfect effect than 200. The other thing I will vary is going to be D decides of the badge for the cluster, whatever you like it. Sometimes is going to be. It's, I have two levels 4% and 25% so 4% means if I have 100 rolls of these one batch of experience my batch, we're going to Will be actually for roles. So I'm going to have 2525 batches. If I have 100 rolls in total out in my batch sizes 25% and I have only four batches. Then the other variable we change here is going to be the total variance. Right. And well, we have two levels here, point, five and 2.5 is half of effect effect size right to choose. So the formula here. It is all ones for the fixed effect. And the other one is to write and the summation this total variance is the summation of my variation between batches and within batches very a variance. Right. And lastly, the other thing I will change is the ratio of between two within very Similar segments. Right. So I have one and four. So in one my Between batch variation is going to be equal to winning and the other one is going to be four times bigger than winning Then, once I settled is for four factors here. I say parameters and then you do a full factorial, do we wear our have 16 runs right to To two levels of data size two levels of batch size two levels of total variance Angelo's was the desert. With that, I can calculate it within between within segments accordingly. Right. And that's the setup for simulation. Okay. Now, I call it simulation hyper parameter, because you can change in as we What I would do, and I'll show you in the demo. It's would 30 simulation risk or do we run. So every one of the 16 what I did is I run 30 times each. Right. So, for example, I'll have a simulation run 123 up to dirty and for the fixed effects. The, the level of difficulty effects. I use the space feeling design. And the reason why I use this space filling design is that don't want to confound the effect of missing there and then effect with the fact that possibly I have some some calling the charity or sparse data. Which is typical thing in historical data. Right. I don't want that in the middle of my way. I want to, I prefer to design and space feeling design that we spread the Levels of the fixed effects across the input space. So if I get rid of this problem of sparse hitting the data or clean the oddity right and then we you allocate to the batch we randomize batch cross the runs in the first round, and then use the same Sequence across all the other 29 runs. So all the runs, we have the same batch of location. In late. And lastly, all the Santa location will be randomized for every one of the simulation runs. So let me just get out of the air and start to jump. So he would do is, I used to do we special purposed spilling design. I'll load my factor here just for being I want to be fast here. So anyway, here I have my 12345 fixed effects. Space feeling designs don't accept a mixture of variables. So you need to set linear constraints here just to tell look these three guys here and need to add up to one. So that's what I'm going to do here. Alright. So with that, a satisfied that constraint will give an example of the first run of this d which is where I have data size 100 relative batch size 4% total variation total variance point five and the ratio is going to be one. So if I go back here and need to put I need to generate 100 runs. Also, if I want to replicate this theory and you have to set the random seed and the number of starts right Then the only option I have when I said constraint is the fast faster flexible filling design. And here we go I get the table here right so you can include this table. One thing you see is if you use a ternary blocked and you use your three components. You see that everything is a spread out. Oops. I have a problem here that she Didn't Let me go back. There's a problem with the constraints, yeah. I forgot the one here. Yep. All right, let's start all over again. Hundred Set Random seat 21234 And number starts. Great. Next book feeling make table. Yeah. And then I need just to check if it is all spread out. And find out. Yes. Alright, so then I look at my categorical variable here. I want to see if it spread out for all of them. As you can see this for one and two. Great. Now I said let me close all that we do 30 simulations. So this is one SIMULATION RIGHT. I HAVE 100 roles. But now I need to do 30. So what I will do is to add roles here. 2900 At the end And we just select this first 1000 runs sorry 100 runs all the variables, a year. And feel to the end of the table. Great. So now I repeat the same design 30 times right Now, To make it faster would just open. I'm using this table again though. Just use another table to where I have everything I wanted to show already set up. So yeah, I have back to this table right what I have. Next thing I would do is just to create a simulation column here just to tell look With this formula here. I can tell that simulations up to 100 row hundred this simulation one and then every hundred you change the simulation numbers. So at the end of the day I get obviously 30 simulations with 100 rolls. Each great Then the batch location, just to explain what they did. I just showed that in the PowerPoint. I have a farm. Now you will create two batches of 4% the size of the total data size, which means I have four rows per batch here than four and so on. And once the I get to 100 here and I'm jumping from simulation. Want to simulation to then it starts all over again. Right. So I have at the end of the day. All the 25 batches here. Okay. Then the next one thing I will do is to create my validation Column, which means I need to split the set right so Back to this demonstration. Back to the PowerPoint here, you see that for the solution that I'm going to create but the neuron that I had Is divided the roles 60% of them will belong to the training set 20% to the validation set and the other 22 the test set. So how do I do that in that case again back to john There we go. Let me hide this Okay, so here's to validate the validation come. How do I do that. It's already there, but don't explain how you do that you go to Analyze Make validation column you use your simulation column has a certification, you go and then you just do Point 6.5 2.2 and a user a random seed here just to make sure you see that's how I create that column right Then if I go back to my presentation here. All the 60% that belongs to the training set for the new year and that 10 to 20% that Belongs to the validation set also for the near net. Now they both belong to a set called modeling set for the mix effects. So the mix effect. Model, there will be no validation. There were just estimate the model with this 80% and the test set of the mixed effect solution that will be the same 20% that I use for the new year on that solution. So in that case, I go back to jump and Go here this And it just great to hear formula where you know zero means the validation of the neural net to zero means training so training, be still training. One is validation setting is going to be my modeling such and two is my test set, and it's going to be my test set here too. So I created these and then you column name for all your labels zero is going to be modeling and one is going to be desk so that way you see here that whatever stashed here he says here, but what you're straining a validation becomes modeling right then finally I need to set the formulas for my response. So for the expected value of my simulation. I just have here to fix effects formula right there is no random term here. All right. And that's my expected value my wife and my wife i j is going to be This and you look at the formula you have the y which is my expected value plus here I have a formula for generating A random Value following a normal distribution with mean zero and between sigma sigma Between sigma. It's a i i like to set to the variables here and as a table variable because I can change the value later as a week without going to the formula. But anyway, this is going to generate a single value every time you change the batch number. So if my batch is Here that's going to be the same value when I change it to 22 it creates another value. And you when I change from one simulation to another. So I will have one value for 25 per batch 25 our simulation one and then when I jumped to batch one of simulation to then it creates another value. Right. And here is just my normal random number with with things sigma that I set on the table here, right. So see some replicating the deal we run one I have sigma 05 and 05 Alright, so then now I have here solution for my mix effects model simulation right before that, let me go back here and show you what I am doing. For example, for the mix effect model. My simulation mode is this, but my feet that model will be the issue might be the analysis be the hat. And my small be here is going to be be hacked, right. So, it is our beach to meet the values for whatever I simulated And then in the mix effect model. I have to to less a prediction model. One is fitting conditional model when I use The my estimation of the random effects and the other one is my marginal when I don't. Right. So I have these two types of model. This is good to predict things that I have in the data already. And this is something I used to predict data around don't have an entire data set, right. For the new year and that I'm using a, you know, the standard default kind of near and at the end and jump, which you know i'm not just using because it's difficult because he pretty much works. You have here all the five fixed effects. I use one layer with three nodes all hyperbolic tangent functions as you can see here And then you have here a function called h x which is the summation of district functions, plus a bias term here, right. So if I add more nodes. It wouldn't make it any better. You find only use two nodes then it gets even worse. So I'm going to use this all over. And that's what I'm going to show you here. My show. Oops. Okay. Show. Me go back to you. So Here I have my Mixed effect model solution. How did I come up with that. Just to show you. I have here the response I put validation validation of the mix effects by simulation. All my fixed effects a year and my random effect is my batch right and then a genetic this first simulation. You see simulation one and it goes all the way the simulation Turkey right I couldn't use, for example, that simulate function of jump here because I'm changing the validation column for every batch, so I cannot, at least I don't know how to do it, how to incorporate the validation column in the formula of I white G. And. Okay. So, oops, back here, then now I have another script here for it and you're in that it's going to take a little bit Shouldn't be a big problem. When I'm doing for example. The runs the do you runs. We've 1000 rows per simulation that can take courses from all of time something like maybe 10 to 15 minutes To do the auditory simulations at the same time here should throw up some And There we go. Okay, so here I have, again, my if you look every one of them have five three notes right okay and you have simulation one all the way to simulation 30 Right. So now I have all that done for for run one of my year, right, so next thing I need to understand is what type of our squares. I'm going to compare right there are actually five types of our squares here, right, so here's the r squared formula. Why, why do I differentiate di squares by type here because it depends on what you use this actual versus predicted in this formula. You know your square change. So for example, I have here. Oh, these are type of r squared away or compare For example, The Rosa turning the training set. What I simulated versus the form I got for the neural net here because when you're in it. And you know, that's actually The case, for example, in all these three are the same thing because I'm comparing my wife and my wife hats are the same. So I have type eight right Then I have another type of call it type be where I compare. I don't, I'm not comparing the simulated value with the Random effect in the random error term I'm comparing the expected value of my simulation versus the form they go So these makes me sent to the test set rules are always the same as just the way you calculate the r squared is going to change because in this way here. When I have what I call the conditional test set. I see the parent future performance because that's exactly what we get when you have any data set, because we cannot tell the real We don't have the real model that's for that you need to simulate and then you have the expected test set, which is actually the same rose, but now I'm comparing the Expected value. And I can tell like for the lack of a better word, a real future performance. So the apparent performed is not necessarily the real future performance. OK. For the mix effect model is the same. Now I have another type of r squared, because here I'm comparing the simulated value versus my Conditional prediction farmland here and using the estimate to have for the random effects, but when I want to predict the future. So, well, no one to break the test sets both conditions here, I have another square here d which is comparing the whole simulation value for I Why i j versus my marginal model here. I'm not using be here right and Leslie to have a fifth type of our square which is my expected that sent Again, the test set is always the same roles is just that I'm using here. Now, the expected value versus my margin. A lot. So the problem is If you're not careful there. You may calculate wrong guy r squared. So what I do is, whenever I have here. And if I had to mix effect mode. I don't use anything that is in this report, all I do is to save the columns here I saved my Marginal model right prediction farm in here you have saved the prediction formula of the conditional model right and I will create columns with this formless that's for the mix effect model. Now for the Near in that or you can also save the formulas. I like to say fest formulas, because I just want to calculate to our squares. So I was saved as fast formulas and then what you see as I create this five columns here. Alright so let me go to them. So now my type A, if you remember Taipei from the presentation here type am comparing the simulated why i j versus my near in that model. So what I do here. Sorry, what I do here is I go to call them the info and you see here. Predicting what and predicting d y AJ Okay, here it is. Now I have here saves De Niro and that from the twice as he does value is equal to this value. But now I would just change predicting what here. This is predicting the expected value. Right, so that way I can use this formula is functioning jump here which is model comparison I can go to use this type A and I do buy a simulation and I grew up by validation And then what you get. It's all Dr squares you need To see from From simulation one All down to simulation 30 and you have it by set so you can later do combine data table and you get everything neat, right, for those are squares. So for the other ones I have also script here for example for to be He was the peace formula. Now right this column, and I only predicted that set the r squared for the test set, not for the modeling, not for the validation set. So the simulation. One is a year. And then if you go all the way down here you have simulation 30. And again, you can always combine data table and your data comes out like the same table format for all of these are squares. And Daniel, obviously I can have another script here for the type co fire squares we choose the modeling set of the mixed effect model right and simulation one all the way to modern set to simulation 30 and you know the do the same. See now on test set, but now I'm predicting why i j, right. So, the, the, say, the secret here is that you have one even sometimes in the lake. Here again I have the same formula because it's my marginal model of the Mix effect model solution. The only thing that changed is in your column me for you. Make sure you have predicting what and then you can use this to calculate all these are squares. All right. Let me go back to presentation and now since I got all those are squares together you stack your tables and then you can do the visualization, you want But here I'm interested really in the conditional test set of both solutions and the expected that said here, you know, I can spend a could spend 45 minutes just talking about this table display here. But all I I'm not really interested in the absolute values of the r squared, but more comparatively kind of a way of comparing our square But I need days just to check one thing which is, as you can see here when the data size here as you can see Them use the pointer here. Make it easier. I have all the are squares. I created here versus all the do you factors. Right. So you see that when I have a small data set, what happens is, I'm my near and that's being trained correctly because my training. And my validation sets they kind of have our square distribution that overlap. But then when you look at the conditional test set, which is actually the data we always have right because we never have the expected value. It's always at the lower level right as you see all for all these when I have this small data set, but then when I have a bigger data set. The situation is different with 1000 roles, then the are all aligned, you know, kind of overlap. So I did train these correctly. Again, the absolute value of our square here is not have much interest what they really care is how, you know, if you go back to One of the earlier slide here, you see that now I want to compare. You see, I have to, I want you to compare the predictive the performance of my benchmark mixed effect model versus my neural net compared to test that dark square, right. So, here what you get. Let me get this mold. So what you get here is again all the verbs. I had for to do we, and whether they Disease and your net solution or didn't mix effects solution. So, you see that for the conditional test set, which is the one part in performance. So if you're in during that you always you know when the data sizes are small. The mix effect Maltese always doing a better job here when you include the the The random effect right versus then you're and that's because there's always this Their median or or even average are all higher, but then when you have a bigger data set, then You know that difference kind of doesn't exist anymore to to a certain point, that even the new one that just doing a better job here, but that's the current performance. Now the real one. You see now that the mix effect model has been given a better job than a deed versus internet And here there is no more, you know, even grounds, you know, because at the end of the day, direct effect model. He's doing a better job, especially in this scenario here which is big data bigger data. Last sets or variability and more between den with invited right now find those lines I have here is going to do this is going to do for every simulation run is going to do the difference between the mix effect. R square versus the new year and that are square. So, here we go. Here I have four plots right so let's just concentrating one what you have in the y axis is the air conditioner square. Then mix minus the difference in conditioner square, sorry, the The mix effect r squared minus then you're in that square. So, that's what you get here, and that is the difference in a pattern to future performance. And here in the x axis, what you have is this difference in the expected R square we choose your real future performance, right, or bias. If you want Now I have four blocks. Why, because if you think about that when you have historical data where you don't have the tech, you know, You know, if you're analyzing the data, you just have possibly control over two things, which is the data size and the relative batch size. Why, because you cannot control what the variability is going to be in your data. And if your random effects are going to be much bigger than your random air. So the only two things that you can possibly Have any control over is data size and relative batch size. If you don't have the tag you can at least have an idea if, you know, use your historical data. Should be comprised of many, many bedrooms, just one or two batch. Right. So that's the kind of control you have You can at least have an idea of the batch size when you'll have statistical data. So what I'm comparing here then again I have the difference in apparent performance issue just differences positive, it means the mixed effect model has a better performance. If this difference here is also positive. It means that mix effect as a better future performance, right. And as you can see here when you have a small data set. It doesn't matter what you do. And mix effect model has a better performance and sometimes way much better because will come, talking about differences in R squared, that can go way over ones. Right, so he's getting much better performance you do pot into one or the future one. So when the data sizes are small, there's really no No, no solution here. However, when you look at the data size bigger data sides right but when you have this small amount of batches. Right. Here it's something funny happens because here on you know the difference enough funding future performance Y axis is negative, most of them. Which means the near and that to doing a better job in terms of patent at test set Tahrir Square right or conditional Touch that dark square. So, when you do it. Who's going to look like the new urine. That'd be great job better than The mix effect model. However, when you look at the lot of the the x axis, right, which is the difference in real future performance, it can be pretty much misleading. Right, so here you when you have a lot of data, but to just a few batches, you know, you're going to get nice Test, test set are squares. But then when you try to deploy your mold in the future. You may get into trouble. But then when you look here. Here we have a mitigation situation where you have a lot of data and a lot of batches. So they tend to be not that much different. Right, so As a conclusion, you should use a non negligible random effecting machine learning when the data set is a small, you know, the test set predictive performance will most likely be poor. Regardless, how many clusters of batches, you have. And that's because machine learning requires the minimum data size for success. Right. So there's no No way to win the game here. Now, when the data size is large and you just have a few clusters. And that's kind of misleading situation because your test set predict the performance can be good, but the performance, we would likely be Brewer later when you deploy the model. Some people tell me, Well, why don't you use regularization said what even if you will, you will you will not do it in these situations because Your test set R squared is going to be can be good but and then you don't know you need it right so you won't be able to tell You know, what is your long term future performance, just by looking at your tests at dark square or some of some kind of some of errors. But then when your data set is large in you have many clusters day and the whole situation is mitigated and the biasing effect of the closer kind of average out because every random effect, you know, the summation of all of them. It's zero. So the more you have the latest by as you can to get On top, you know, just wanted to say that one that learned what I learned from that is that when the data is not designed on purpose, there's two things I always remember Machine land cannot tackle at data just because it is big. You got to have a minimum level of design right to make it work. But the bigger the data, the more likely it is minimal level of design is already present in the data just by sheer chance. All right. And thank you, if you want to contact us. We are in the jump community. These are our addresses. Thank you.  
Scott Wise, Senior Manager, JMP Education Team, SAS   A picture is said to be worth a thousand words, and the visuals that can be created in JMP Graph Builder can be considered fine works of art in their ability to convey compelling information to the viewer. This journal presentation features how to build popular and captivating advanced graph views using JMP Graph Builder. Based on the popular Pictures from the Gallery journals, the Gallery 5 presentation and journal features new views available in the latest versions of JMP. We will feature several popular industry graph formats that you may not have known could be easily built within JMP. Views such as incorporating Ridgeline & Plots, Contour Bag Plots, Informative Box Plots and more will be included that can help breathe life into your graphs and provide a compelling platform to help manage up your results.     Auto-generated transcript...   Speaker Transcript Greetings and welcome to pictures from the Gallery Five More Advanced Graph Builder Views. In past presentations many weird things have occurred, like being interrupted by visits from aliens. We apologize and we will show a more serious graph demo to start our presentation.   Let's use graph builder to answer the question, what country has the tallest buildings. To do this we'll use wooden blocks that represent every 150 meters of structural height. On the back of each block is a barcode that we can use to directly scan info into JMP.   By physically using the blocks, we now have a physical scale model.   This automatically populated the data in JMP and generated a bar graph in descending order.   Wait, do you hear that?   Okay, Godzilla. Would you rather we showed a graph about you?   So let's do that. Scott Wise Welcome, everybody. Welcome to pictures from the gallery. My name is Scott Wise and hopefully you enjoyed that little video. And our whole idea is to show you some advanced things in Graph Builder you probably didn't know you could automatically do.   So we're going to show you some smarter things we can do, maybe more compelling ways we can show Godzilla's story. And there was an article that came out and it had an alarming trend on it. It said, since when Godzilla began about in 1954,   he has grown bigger and bigger and bigger.   On...to the point he's much larger than he used to be.   That's pretty disconcerting. He's pretty destructive from the very start. So we'd like to show this. And of course, we've got to get the data into JMP; that was fairly easy to do.   So here's the data into JMP. We have the meters high. So he's...we see he's grown from 50 meters high, all the way to 150 meters high in his last picture which was in 2019.   Now, a couple of things we can do. This has been an option in JMP for quite a while, but you might not have seen it before. You can put pictures into JMP.   And if I add a column here, you can see it's got this empty with the brackets. And if I go to the column info, this is a expression data type.   And pictures is one of the things it can take. And so if I go and I take a look at   just any picture, and I've got, I've got these as kind of separate pictures here in my PowerPoint. All I have to do is grab it,   drop it directly into my JMP table and it will size. So that's pretty cool. And you can as well,   label this column so it'll show up when I hover over points in my graphs. So I'm going to delete that row, but here's those pictures I have.   OK, now let's go back to our data. Let's take a look at just building out a simple bar chart. I'm just going to put meters high on the Y. I'll put the year on the X. I'll ask for bars up here from the bar element.   Now this doesn't have the same view I had before. There's a lot of space in between these bars. If you want to change them, right click into the bar area. Go to customize.   Click on bar and on the width proportion, type in like a .99 or something below one here and you can see that filled out the space and there's not a lot of open gap in between them. So this is pretty cool. And so I have this information.   This has given me the relative height of Godzilla over those...over the years of making movies. And of course if you hover over them, of course, the   beside the label turned on, the picture will show up. But what if I wanted to make the picture be the bar? Is there a way to do that? Yes, there is. So I copied this bar chart into my PowerPoint presentation.   And kind of use it as a template. And then I said, Well, gee, can I just take the pictures and using that JMP bar chart as a template, can I kind of get them into the right size on the same size scale? And yes, you can. You can see here I just massaged it in the place.   So I got them all just kind of oriented into place.   And then, of course, were able to take a...we were able to use that as a group picture. So now I have a relatively scaled group picture.   So this was very useful because what I can do   is if I come back   into JMP.   Let's bring back up my Graph Builder.   And I just take this grouping.   I can put it into my graph.   It just kind of snaps right in there.   Then you can work on positioning it, get it into the right format. And of course, if you go and you make this transparent, really large,   that really helps.   You can build it out and shrink the graph and get it down to the right size. So it takes a little bit of finagling to do, but the result is   you then can match that size graph. Now another graph I'd like to make are bar and needle charts.   So I kind of like these, you can...you can make the circle kind of stand out. You can even size something by the circle, kind of like it...like pins, right, the top of a pen. And then you've got, you know, a long...   a long line there, kind of connected it to the label. So that's, that's kind of a nice view.   So that's very easy to do within JMP. If I go back, go ahead, put meters high, put year, go to bar, but this time under bar style, select needle.   And maybe I will also add some points and, of course, I can make the size be the meters high.   And I can as well give it some color. Maybe I'll color it by the...where...where Godzilla attacked. And if I want to make the circles show more representation   I can right click on this marker size there, where it showed me that meters high is the size of the circle, maybe increase that to something like 12. Now I get a much more separation from 1954 and 2019.   And again, the pictures will surface.   And you can even stop there. I went a little further.   I created this chart. And on this chart,   on the axis settings, I put in a reference line by where the maximum depth of the respective target harbors were.   And why we did this was, I saw a funny article where they were given a hypothetical to the to the emergency coordinator of New York City.   This is not long after Godzilla had visited there. And so very dinosaurish looking Godzilla in 1998.   And he said if Godzilla ever comes back, "Are you worried? Are you prepared?", and he said, "No problem. We'll evacuate the city, very quickly." And he said, "why is that?" And he says, "Well, gee, the actual   you know, maximum depth of New York Harbor is only so big here and Godzilla is way up here. So we'll see him coming along way out before he ever hits land." So I thought that was   pretty amusing. Of course you can combine these charts. We can do just what we did and put that scaled picture, scaled bar chart in with the needle chart so I can have, you know, a lot of information here, including the harbor depth to targets. And now of course see in the picture of Godzilla.   All right, so hopefully you enjoyed that, uh, that little demonstration here. And that's really what   this whole presentations about. We call this Pictures from the Gallery and we're challenged every year to come up with a handful of advanced views that maybe folks had done   with a lot of pain and in spreadsheets and other packages, or really challenged JMP to be able to create.   And JMP is so flexible and so interactive that there's a lot of great views you can get that can make your data even more compelling.   And so here, without further ado, are the version of pictures from the gallery for this year. We're on our fifth edition and we got six beautiful views here.   Number one is a informative box plot.   Number two is a ridge plot density chart. Number three is actually   having multiple ranges as an area range plot in between, on my lines.   Number four is an informative points plot. It's kind of unique view to look at points and size points.   Now on number five is a box plot with outlier boxes.   Not box plot, excuse me, bag plot with outlier boxes so bag plot is a new functionality.   A two dimensional way of seeing outliers and I even included some outlier boxes on the edges. And number six is a components effects plot,   and it helps you when you have mixture components that have to add up to 100%, it helps you figure out a way of showing them on a graph where you can look at your mixture settings and see how they respond to an output.   Maybe something you try to experiment on or model. So that's very handy.   So these are the six views. Now, we probably don't have time to go through all six, if I'm usually doing this presentation live I might take a vote, but I can tell you, I've got a pretty good idea from doing this before that   I'm going to show you the most popular views first. And whatever we don't cover, I'll be glad to cover   later for you. I'm going to leave you behind with instructions in the script to generate it yourself.   So that's the beauty of this is you're going to get gift from us. This gift is going to be this pictures from the gallery journal that you can always go back to and use when you want to replicate one of these views or practice.   the ridgeline plot. This was a new view in JMP 15   and it is showing you a lot of stacked histograms over on top of each other, against kind of a bunch of categorical levels   on my y axis. And it's very useful, especially if you're like...you're plotting signal data or growth data and you want to look at it in comparison to a reference. So this was some   real medical data and we have this DMSO drug and they wanted to, they had the...   they had some measure of of area where they took the log of that measure and I want to find how things are the same or different than my reference, my red   distribution. So let me show you how we set this up. So again you have good information in your journal, right, tips on how to make it. We're going to use that data and we're going to use a kernel density in a bunch of ordering commands.   And now, here are the steps we're going to follow to make this view. So you can go through and see these yourself. Always attached is the data and always attached is the finished script. So we're going to try to generate this chart. So let me start from scratch.   So I'm going to put the drug on the Y axis. I'm going to put the area log   on the X axis and it gives me box plots and that's fine. Now something I'll do, I'll take this DMSO   and I'm going to put it in the overlay. That's kind of setting up my red, blue, you know, this is my reference. This is things I'm comparing it to.   Now before we begin, anything you have in the chart can help you order things in your Graph Builder. And if I go over the Y axis, I right click, I can order by and I can order by the area log10 descending.   So I do that and it orders by something. What is it order by? If I right click on there, now that I've activated the order by, now it shows me all the statistics. It defaults to the mean.   I have a feeling, median, because not all these are normal, some of them have quite long tails, I might do the median here. And does that change things? A little bit.   So now it is ordering from top to bottom according to the median. And you can kind of tell that with the median lines there, 50% quintiles of these box plots.   But I don't want box plots. I want bar. So, all I have to do is come up here and click on the bar icon.   Now we're looking pretty good, but how do I get those smooth lines and how do I get into overlay? All you have to do under histogram style, down here in the little control panel for histograms, select kernel density.   And you get two smoothers, you get an overlap and you get a smoothness. The smoothness controls how bumpy, you want to earn smooth, you want to make these lines and I like him just slightly bumpy.   And now the overlaps controlling how much the overlap with the next level and you can give it a little...give it a lot of overlap. Give it a little overlap. Whatever makes the most sense.   And that's it. And so what you can additionally add some reference lines, you know, by right click and go to access setting and add some reference lines down here to help your view.   If I go to the one in our script, you can see I've added for the DMSO, I've added where it's median is and where it's min and max are.   And you can kind of get an idea which ones are very similar in center, which ones are similar in shape, which ones are very different from each other. So that's the ridgeplot density.   So again, we'll give you this journal so you can replicate it.   And let's move on to the next view, the next view we're going to look at, that's the probably the second most popular out of all these is the bag plot with outlier boxes. So the bag plot is a new kind of   of chart that gives you a two dimensional view.   So I've got (this is pollution data) so I've got ozone on my Y and I've got particulates (PM10s)   on my x axis.   So now I've got this bag plot here.   that's going to allow me, it's going to...it's going to find a center and that's this little asterisk, little hard to see. I'll make it a little bigger when we do this. But   that little asterisk there is really the center of the two dimensional space in between ozone and PM10.   And it's drawing some fences, draws and...draws a little...   little area closest to the to that two dimensional grouping and then it draws a fence outside and it says, if any point falls beyond the fence, like this point right here,   it is truly an outlier in a two dimensional fence, in respect to both PM10 and ozone and that's kind of cool. And what I did was on the edges, I put in some box plots.   Because I wanted to see if on a one dimensional standpoint, I just looked at PM10 what would have been out?   Well, this point right here, which was St. Louis, which was not outside the bag plot fence, but it was outside for PM10.   But why it's not outside the bag plot fence is it's almost on the median of ozone. It's right in there.   Right but Los Angeles would have been out for ozone, but it is not out for PM10. So I thought that was really interesting view. So let's see how we can do that. So again, I've got these Graph Builder steps. We are going to be using contour plots.   to help us do this and we're going to be using a bag plot in the contour plot and a dummy variable.   So let me pull up the pollutants map.   Here we go.   Alright, so I've got my PM10.   I've got my ozone.   I've got my city, of course.   Now I've got a dummy variable. And it really is a dummy variable. You see, all I have in here are x's. So, a whole bunch of x's. Just something...something categorical but repeating that   can be used to open a new section of your Graph Builder. So this is a trick that shows up a lot in pictures from the gallery.   So we're going to take my ozone on the Y, we'll take my PM10 on the X. I'll turn off that smoother line, but I'm going to add in the contour plot up here from the...from those graph elements.   Now here if I look at the elements for contour, here's where I can do the bag plot.   Now this sets up that bag plot here, which is pretty cool.   I think I'll color this one purple, just so I can just point out, that's where the middle of the two dimensional space is.   And that's very cool. We can see that point fallen out there, which was Los Angeles. That's well away. Now what if I want to throw those on box plots in as well.   Well, I'm going to take my dummy variable. I'm just going to drag it right down here to like almost the start of the leftmost X axis, then it tries to do something with it says, Oh, you got a category. It's x. Well I can right click in here and I can change that to a box plot.   And I'm going to do the same thing on the very top of the y axis, come right in here and change it.   Just right clicked in here. It's all I'm doing. From points, change that to box plot. There it is. There we go.   This section in here doesn't really mean anything. So I'm going to right click and I don't want a box plot in this little top square; I want nothing, so I just removed it.   Now to size these a little better, what I'm going to do, I'm going to hover over the dummy label. I'm going to pull all the way down to I kind of get that little diagonal, you know, line, and then I...this lets me move the width. There's the width, I can move here. And I can do the same thing   on the x axis and make it look a little prettier.   Which is kind of cool.   And of course we can come in here. We can we can take...we don't have to use the word "dummy" in here. You might not want dummy in a smart graph, we can take the label out   by double clicking on that axis and taking out the label and just making it kind of ghosting it over there. And now I've got my box plot. And now I've got, as well, my bag plot all in one.   And this is another good one to do reference lines. I've showed you how to do reference lines earlier. But here you see I drew in some reference lines.   And there's hover help as well if you've got things you can label here and I think we did put a label on city.   And of course, we get ozone and PM10 because they're in the graph. You can pin these and you see I pin St. Louis, and I pin Los Angeles. They can be moved all over the place.   But I drew some lines and dashed lines. And so I can make my points, you know, basically, that, you know, Los Angeles is truly an outlier in two dimensional space, where St. Louis is only outlier in respect to PM10.   So that is how we do a bag plot with outlier boxes.   And again, as we go through this presentation, feel free to lose...let's see, leave a Q&A in the chat and we'll get back with you and glad to reproduce some of these views for you or answer any questions. Usually generates a lot of other ideas on graphs that maybe you've been itching to make.   Alright, so the next one we're going to take a look at is number six. This is the components effects plot. As I mentioned, when I was going through the   pictures from the gallery, this one is dealing with mixtures and I have a whole bunch of diluents. They are all in just...just for example, they're all in the same vat.   And this is a vat of solutions, chemicals and I can't have any of them add up over 100%. And so there's mixture designs and mixture modeling that helps you make sure you put the constraint in there that no one   ingredient in there, so that when they add up, they all have to add to 100%. So that's kind of why you're seeing that if all five of them here are only 20%   of the amount that would add 100%. So these are just...so this is showing you something that was very difficult to do. It was easy to do this type of analysis in JMP but it was hard to graph and show   exactly how they... how the different settings, the different ratios of the mixture here actually are affecting the output, in this case total hardness.   And this is a beautiful chart and it's making use of smoother lines to really help us. So what we're going to do, we're going to look at some stack data.   And I do want to point out that   there's a great book if you're dealing with formulations and things like mixtures, that are Ron Snee and Roger Hoerl have a book called "Strategies for Formulation Development."   They do use JMP to do this. And this is an example of the ABDC, actually ABCD mixture screening design. So this is some results that came out of a mixture screening experiments. So this is pretty good. I got my tablet hardness. I've got my different diluent amounts that I have.   So what we're going to do is just go to Graph Builder. Got that set up, I'll put my tablet hardness here, put my diluent amount here. Now they did take multiple measures here so I'm   I'm not surprised to see what's going on. And there's several points. But what I'm going to do is I'm going to overlay   by the diluents   and the summary statistic we're going to use. Let's just use the mean.   Okay, now here's the beauty of the smoother. You have a smoother control box here in the smoother element, and you can play with the amount of straightness in curve.   And I'm going to pull it down until they agree. And what I was looking at, I was really focusing on 20% because it makes sense to me that all five of these will have to add up to 50%.   It's something I probably haven't shown before. My eyes...I really love grid lines. You can turn these on when you double click on the axis.   There we go, can just turn on grid lines here and that really helps my eyes. But now, if you can see something as this...the results of this experiment showing something like cornstarch where immediately   when, when it was low didn't make much of a difference. Not many of them made much of a difference when they were low in concentration in terms of   how much they made up of the vat. Here you can see the more we put, the more tablet hardness went down and we can see something like, man, it's all...   ...went up.   So that is a cool graph and probably something that problably didn't see we can do but it's actually quite easy to do within JMP.   All right, I'm checking our time, we are doing pretty good. So I will keep going until we are out of time. I will go with the next most popular view. The next most popular view   was these informative point plots. So this is jittering of points, but it's making a nice kind of cluster of them and it kind of makes this kind of pack circular grid.   And you see we've even got it...got this one sized by cardbs and colored by calories. So this is a bunch of beer, so I...   this year I went on a pretty extreme diet, I did real well with it. One of the things I had to do was pay attention to the kind of beer I was drinking,   couldn't drink the really the really high calorie or a high carb stouts and porters that I usually do. So did change what I was drinking, but I would love to make this type of chart against some of my favorite breweries, so I can find new things to drink.   So, pretty easy to do.   What we're going to do is we're going to go down here to beer calories.   Right now I'm going to put the brewery on the x axis, calories on the, um,   and then, and then the color and size.   So it's a little bit of a different graph and we're not going to mess with the y axis. So I got brewery down here. It's got a bunch of points in there. Now let's go ahead   and   go ahead and size   by the amount of carbs and color by the   calories.   And maybe I'll do them all per ounce. Maybe that's a fair way of showing it. So here's one version of the graph that we have.   Now,   this looks pretty interesting. But you might be wondering what's going on, what's, what exactly is going on with   not having that grouping. They're all kind of sin line. They're slightly jittered.   Well, if I go to my local data filter, let's size down. Let's just not look at all the breweries, let's just take a look at some of the top ones. So here's...   I asked for, again, just under that hotspot. This was asking under for a local data filter down here at the bottom.   And then this red triangle, it'll let me order by count, find the biggest ones. Now just find the biggest four. Now ou're seeing the grouping. In fact I'll click in here and add a grid line.   to make that easy to see. And now I can see, you know, Sierra Nevada has got   you know, a really high calorie in a really big carb Bigfoot, which is really delicious beer. Okay, but I can see there's something else, much smaller like Anheuser Busch has the Budweiser Select 55 which   says no carbs or very little carbs per ounce. You know, there's and then just 4.6 calories per ounce, you know, but something something below on something close to zero.   So that's pretty cool to view. Now if I add too many of these I lose it, you're like, "Well, Scott, how do I get that back?"   Don't fear. Of course you can always size down your list. You can also make your graph bigger, but under points there is a jitter limit. And if I increase that jimmer...jitter limit to two, you can see it's going to allow you to   take a little more space to create these jitters. Now you can see what you want. Now you can get a smaller subset and get just the view you like.   All right, so hopefully you enjoyed that one. That's one of my favorites.   Alright, so the next one we're gonna take a look at. Let's take a look at informative box plot. This is a quick one. This is a another new one in JMP 15 so the box plots the bag plots and as well...the...   the...this box plots,   this bag plot, as well as the ridge plot, these are the ones that were new to 15, had new functionality in 15. So this is a different kind of style,   which is kind of cool   that you get that kind of view on there. Plus, I'm able to color these box plots, because they're solid. Now I can give them a color. And here I colored them by a process capability measurement. So this gives you another chance to make your box plot stand out, which is kind of cool.   So what we're gonna do is we're gonna look at this fan supplier stack. This is a bunch of fans suppliers, looking at their revolutions per minute.   I'm going to go in the Graph Builder. I am going to put the fan supplier on the x, but the fan RPM on the Y, ask for box plots. Pretty easy. Now what we're going to do now is under the style, I'm going to say, give me solid.   Okay, and that's a brand new style. There's also a thin style, in case you'd like that one. I like this solid style because with the solid style, ow I can take something like Cpk, and I can color by it.   And that's what we did here. And if I right click here right on the gradient, click on that gradient, maybe I will choose a green to white to red, kind of a   go-stop kind of situation. I'll reverse them to make sure that makes sense. And now I can see the things with the worst capability are indeed getting out   beyond my upper spec limit. Those things with higher Cpk, higher capability. are a lot more closer to being centered and closer to my target, not spread out all that much.   So that's a pretty cool view, pretty easy to use.   Alright, so we actually got time to run through the last one, which is fantastic. So the last one we're going to look at   is this area range chart. This is not just the lines. Everybody knows how to do lines and kind of do a trend chart. But you can see I've got area shaded in here between the lines.   What we were doing in this one, I did this chart with Bill Worley, we got a good blog on it as well. We were looking at some different   ages at which you can start to pull your US social security and you can take it early at 62, you can take it next at 68 and 8 months, and you can take it...you can wait all the way til you're 70 and there'll be higher payouts each year.   So if I take it a 62, I get a lower pay out. But of course, I start earlier. So we wanted a good chart that kind of shows you the trade offs.
Monday, October 12, 2020
Ronald Andrews, Sr. Process Engineer, Bausch + Lomb   How do we set internal process specs when there are multiple process parameters that impact a key product measure? We need a process to divide up the total variability allowed into separate and probably unequal buckets. Since variances are additive, it is much easier to allocate a variance for each process parameter than a deviation. We start with a list of potential contributors to the product parameter of interest. A cause-and-effect diagram is a useful tool. Then we gather the sensitivity information that is already known. We sort out what we know and don’t know and plan some DOEs to fill in the gaps. We can test our predictor variances by combining them to predict the total variance in the product. If our prediction of the total product variability falls short of actual experience, we need to add a TBD factor. Once we have a comprehensive model, we can start budgeting. Variance budgeting can be just as arbitrary as financial budgeting. We can look for low hanging fruit that can easily be improved. We may have to budget some financial resources to embark on projects to improve factors to meet our variance budget goals.     Auto-generated transcript...   Speaker Transcript Ronald Andrews Well, good morning or afternoon as the case may be. My name is Ron Andrews and topic of the day is variance budgeting. Oh, I need to share my screen. And there's a file, we will be getting to. And we'll get to start with PowerPoint. So variance budgeting is the topic. I'm a process engineer at Bausch and Lomb; got contact information here. My supervision requires this disclaimer. They don't necessarily want to take credit for what I say today. Overview of what we're going to talk about What is the variance budget? A little bit of history. When do we need one? We have some examples. We'll go through the elements of the process, cause and effect diagram, gather the foreknowledge, do some DOEs to fill in the gaps, Monte Carlo simulations, as required. And we've got a test case will work through. So really, what is a variance budget? Mechanical engineers like to talk about tolerance stack-up. Well tolerance stack-up is basically a corollary Murphy's Law, that being all tolerances will add unit directionally in the direction that can do the most harm. Variance budget is like a tolerance stack-up, except that instead of budgeting the parameter itself, we budget the variance -- sigma squared. We're relying on more or less normal shape distributions, rather than uniform distributions. Variances are additive, makes the budgeting process a whole lot easier than trying to budget something like standard deviations. Brief example here. If we used test-and-sort or test-and-adjust strategies, our distributions are going to look more like these uniform distributions. So if we have the distribution with the width of 1 and one with a width of 2 and other with a 3, we add them all together, we end up with a distribution with a width of pretty close to 6. In this case, we probably need to budget the tolerances more than the variances. ...If we rely on process control, our distributions will be more normal. In this case, if we have a normal distribution with a standard deviation of 1, standard deviation of 2, standard deviation of 3, we add them up, we end up with standard deviation of 3.7, lot less than six. So we do the numbers 1 squared plus 2 squared plus 3 squared equals essentially 3.7 squared. Now to be fair, on that previous slide, if I added up these variances, they would have added up to the variance of this one. But when you have something other than a normal distribution, you have to pay attention to the shape down near the tail. It depends on where you can set your specs. So, What is the variance budget? Non normal distributions are going to require special attention and we'll get to those later. For now variance budget is kind of like a financial budget. They can be just as arbitrary. There only three basic rules. We translate everything into common currency. Now we do this for each product measure of interest, but we translate all the relevant process variables into their contribution to the product measure of interest. Rule number two is fairly simple. Don't budget more than 100% of the allowed variance. Yeah, sounds simple. I've seen this rule violated more than once in more than one company. Number three. This goes for life in general, as well as engineering, use your best judgment at all times. Little bit of history. This is not rocket science. Other people must be doing something similar. I have searched the literature and I have not been able to find published accounts of a process similar to this. I'm sure it's out there, but I have not found any published accounts yet. So for me the history came back in the 1980s, when I worked at Kodak with a challenge for management. Challenge was produce film with no variation perceived by customers. Actually what they originally said produce film with no variation. no perceivable variations. They define that as a Six Sigma shift would be less than one just noticeable difference. Kodak was pretty good on the perceptual stuff and all these just noticeable differences were defined, we knew exactly what they were. For a slide film like Kodachrome, which is what I was working on the... that's what I was working on at the time, color balance was the biggest challenge. Here, this streamline cause and effect diagram, color balance is a function of the green speed, the blue speed and the red speed. Now I've sort of fleshed out one of these legs. The red speed, I got the cyan absorber dye and then one of the emulsions as the factors that contribute to the speed of that, that affects the red speed, that affects the color balance. This is a very simplified version. There are actually three different emulsions in the red record, there are three more in the green record. There are two more in the blue record. Add up everything, they're 75 factors that all contribute to color balance. These are not just potential contributors. These are actually demonstrated contributors. So this is a fairly daunting task. So moving on to when we need a variance budget. Get a little tongue in cheek decision tree here. Do we have a mess in the house? If not, life is good. If so, how many kids do we have? If one, we probably know where the responsibility lies. If more than that and we probably need a budget. This is an example of some work we did a number of years ago on a contact lens project at Bausch and Lomb. This is long before it got out the door to the marketplace. We were having trouble meeting our diameter specs. plus or minus two tenths of a millimeter We were having trouble meeting that. We looked at a lot of sources of variability and we managed to characterize each one. So lot to lot. And this is with the same input materials and same set points, fairly large variability. Lens to lens within a lot, lower variability. Monomer component No. 1, we change lots occasionally, extreme variability. Monomer component No. 2, also had a fairly large variability. Now we mix our monomers together and we have a pretty good process with pretty good precision. It's not perfect and we can estimate the variability from that. That's a pretty small contributor. We put the monomer in a mold and put it under cure lamps to ??? it and the intensity of the lamps can make a difference. There we can estimate that source of variability as well. We add all these distributions up and this is our overall distribution. It does go belong...beyond the spec limits on both ends. Standard deviation of .082 And as I mentioned, spec elements of plus and minus .2 that gives us a PPk of .81. Not so good. Percent out of spec estimated at 1.5% It might have been passable if it was really that good, but it wasn't. This estimate assumes each lens is an independent event. They're not. We make the lenses in lots and there's...every lot has a certain set of raw materials in a certain set of starting conditions. That within a lot, there's a lot of the correlation. And two of the components I mentioned, two monomer components that had sizable contributions, there's looking here, occasionally you can see the yellow line and the pink line. These are the variability introduced by these two monomer components. When they're both on the same side of the center line, they push the diameter out towards the spec limits and we have some other sources of variability that add to the possibilities. Another problem is that our .2 limit is for an individual lens. We did this...we disposition based on lots. And so this plot predicts lot averages, though, when we get a lot average out to .175, chances are we're going to have enough lenses beyond the limit that failed a lot. So in all, added up our estimate is 4% of the lots are going to be discarded. And they're going to come in bunches. We're going to have weeks when we can't get half of our lots through the system. So this is non starter. We have to make some major improvements. To the lot-to-lot variability from two monomer components contributed a good chunk of that variability. We looked and found that the overall purity of Monomer 1 was a significant factor and certain impurities in Monomer 2, when present, were contributors. Our chemists looked at the synthetic routes for these ingredients and found that there was a single starting material that contributed most of the impurities. They recommended that our suppliers distill this starting ingredient to eliminate the impurities. That made some major improvements. We also put variacs on the cure lamps to control the intensity. Lamp intensity was not a big factor, but this was easy. And when it's easy, you make the improvement. Strictly speaking, this was a variance assessment, rather than a variance budget. We never actually assigned numeric goals for each component. This is back...we're kind of picking the low-hanging fruit. I mean, we found two factors that pretty much accounted for a large portion of the variability Maybe we need a little bit better structure to reach the higher branches, now that we need to reach up higher. Current status on lens dimension, lens diameter. PPk is 2.1. The product's on the market now, has been for a few years. This is not the problem anymore. We've made major...major improvements in these momoer components. We're still working on them. They still have detectable variability; detectable, but it hasn't been a problem in a long time. So the basic question is, what do we do to apply data to a variance budget? Maybe reduce that arbitrariness a little bit. We have to start by choosing a product measure in need of improvement. We need to identify the potential contributors, cause and effect diagrams, a convenient tool. We need to gather some foreknowledge. We need to know the sensitivity. The product measure divided by the process measure; what's the slope of that line? We, we are going to need some DOEs to fill in the gaps. We need to estimate the degree of difficulty for improving some of these factors. And we estimate the variance from each component and then we divide that variance, the total variance goal among the contributors. Sounds easy enough. Let's get into an example. let's say we're we're working on a new project. And along the way, we have a new product measure called CMT (stands for cascaded modulation transfer) to measure overall image sharpness. Kind of important for contact lenses. Target is 100, plus or minus 10. We want a PPK of at least 1.33 That means standard deviation's got to be 2.5 or less. Variance has got to be 6.25 or less. What factors might be involved? Let's think about a cause and effect diagram. We can go into JMP and create a table. We start by listing CMT in the parent column. Then we list each of our manufacturing steps in the child column. And then we start listing these child factors over on the parent's side and then we start listing subfactors. These subfactors are obviously generic and arbitrary, the whole thing's hypothetical. And we can go as many levels as we want. We can have as many branches in the diagram as we care to, but we've identified 14 potential factors here. So we go into the appropriate dialog box, identify the parent column and the child column. Click the OK button and out pops the cause and effect diagram. Brief aside here. I've been using JMP for 30 years now. I have very, very few regrets. This is one of them. And my regret is, I only found this last year. I don't know, actually, when this tool was implemented. I wish I had found it earlier because this is the easiest way I found to generate a cause and effect diagram. So we need to gather the sensitivity data. Physics sometimes will give us the answer. In optics, if we know the refractive index and the radius of curvature, that can give us some information about the optical power of the lens. Sometimes physics, oftentimes we need experimental data. So, ask the subject matter experts. Maybe somebody's done some experiments that will give us an idea. We're going to need some well-designed experiments because no way have all 14 of those factors been covered. Several notches down on the list, in my opinion, is historical data. And if you've used historical data to generate models, you know, some of the concerns I'm nervous about. We need to be very cautious with this. Historical data, it's usually messy; it has a lot of misidentified numbers, sometimes things in the wrong column, it needs a lot of cleaning. There's also a lot, also a lot of correlation between factors. Standard practice is to reserve 25% of the data points randomly, reserve that data for confirmation, generate the model with 75% of the data, and then test it with a 25% reserve data. If it works, maybe we have something worth using. If not, don't touch it. So gathering foreknowledge, we want to ask subject matter experts independently to contribute any sensitivity data they have. I'm taking a page from a presentation last year at the Discovery Summit by Cy Wegman and Wayne Levin. This is their suggestion in gathering foreknowledge to avoid the loudest voice in the room rules syndrome. Sometimes there's a quiet engineer sitting in the back who may have important information to impart, may or may not speak up. So we want to get that information. Ask everybody independently to start with. Then get people together and discuss the discrepancies. There will be some. Where are the gaps? What parameters still need sensitivity or distribution information? What parameters can we discount? I'd like to find these. What parameters are conditional? Doesn't happen very often, but in our contact lens process, we include biological indicators in every sterilization cycle. These indicators are intentionally biased so that false positives are far more likely than false negatives. When we get a failure in this test, we sterilize again. We know our sterilization routine was probably right, but we sterilize again. So sometimes we sterilize twice. That can have a small effect on our dimensions. It's small, but measurable. So we're going to need to plan some experiments to gather the sensitivities for things we don't know about. And we'll look at production distribution data; use it with caution to generate sensitivity. We can use it to generate information on the variability of each of the components and the overall variability of the product measure of interest. We need to do some record keeping along the way. We can start with that table we used to generate the cause and effect diagram, add a few more columns. Fill in the sensitivities, units of measure, various columns. Any kind of table will do. Just keep the records and keep them up to date. We're going to need some DOEs to fill in the gaps. There are some newer techniques -- definitive screening designs, group orthogonal super saturated designs -- provide a good bang for the buck when the situation fits. Now in this particular situation, we got 14 factors. We asked our subject matter experts. Some of them have enough experience to predict some directional information, but nobody has a good estimate of the actual slopes. So we need to evaluate 14 factors. I'd love to run a DSD that doesn't require 33 runs, I don't have the budget for it. So we're going to resort to the custom DOE. So, go to the custom DOE function and then...been using PowerPoint for long enough now...time we demonstrated a few things live in JMP. That would go to DOE custom design. And you don't have to, but it's a very good practice to fill in the response information (if i could type it right). Match target from 90 to 110. Importance of 1, only makes a difference if we have more than one response. The factors. I have my file, so I can load these quickly there. Here we have all 14 of the factors. This factor constraints, I've never used it. But I know it's there if some combination of factors would be dangerous. I know that we can disallow it. The model specification. This is probably the most important part. This is basically a screening operation. We're just going to look at the main effects. Now our subject matter experts suggested the interactions are not likely. And nonlinearity is possible but not likely to be strong. So we're going to ignore those for now, at least for the screening experiment. We don't need to block this. We don't need extra center points. For 14 main effects, JMP says a minimum of 15, that's a given, default 20. I've learned that if I have a budget that can run the default, that's a good place to start. I can do 20 runs; 33 was too much. I can manage the 20. Let's make this design. I left this in design units. There's a hypothetical example. I didn't feel like replacing these arbitrary units with other arbitrary units. Got a whole suite of designed evaluation tools, a couple that I normally look at. The power analysis. If the root mean square error estimate of 1 is somewhere in the ballpark, then these power estimates are going to be somewhere in the ballpark. .9 and above, pretty good. I like that. The other thing I normally look at is the color map on correlations. I like to actually make it a color map. And it's kind of complicated. We got 14 main effects, and I honestly haven't counted all the two way interactions. What we're looking for is confounding effects, where we have to red blocks in the same row. Well, I don't see that. That's good. We've got some dark blue where there's no correlation. We've got some light blue where there's a slight correlation. And we have some light pink where maybe it's a .6 correlation coefficient. This is tolerable. As long as we don't have complete confounding, we can probably estimate what's what, what's causing the effect. Now this is good. Move on, make the table. Well, this is our design. Got the space to fill in the results. I'm going to take a page from the Julia Child school of cooking. Do the prep for you and then put it in the oven and then take a previously prepared file out of the oven that already has the executed experiment. These are the results. CMT values, we wanted them between 90 and 110. We got a couple here in the 80s. There's 110.5, we've got 111 here. Looks like we have a little work to do. Let's analyze the data. Everything's all done for us. There's a response. Here's all the factors. We want the screening report. Click Run. r squared .999. Yeah, you can tell this is fake data. I probably should have set the noise factor a little higher than this. The first six factors are highly significant; the next eight, not so much. I was lazy when I generated it. I put something in there for the first six. Now, typically we eliminate the insignificant factors. So we can either eliminate them all at once. I tend to do it one at a time. Eliminate the least significant factor each time and see what it does to the other numbers. Sometimes it changes, sometimes it doesn't. Eliminate this one and it looks like Cure1 slipped in under the wire, .0478. It's just under .05. I doubt that it's a big deal, but we'll leave it in there. So we look at the residuals; that's kind of random, that's good. Studentized residuals, also kind of random. We need to look at the parameter estimates. This is what we paid for. These are the the...regression coefficients are the slopes we were looking for. These are the sensitivities. That's why we did the experiment. I'm a visual kind of guy, so I always look at the prediction profiler. And one of the things I like here...well, I always look at the...the plot of the slopes and look at the... I look at the confidence intervals, which are pretty small. Here you can just barely see there's a blue shaded interval. I also like to use the simulator when I have some information about the input, that we can input the variability for each of these. Now if you'll allow me again use the Julia Child approach and go back to the previously prepared example where I've already input the variations on each one of these. From Mold Tool 1, I input an expression that results in a bimodal distribution. And for Mold Tool 2, input a uniform distribution. And I gotta say, in defense of my friends in the tool room, bimodal distribution only happens in a situation...what happened last month, where the tools we wanted to use were busy on the production floor, so for experiment, we use some old iterations. We actually mixed iterations. When that happens, we can get a bimodal distribution. This uniform distribution, never happens with these guys. They're always shooting for the center point and usually come within a couple of microns. Other distributions are all normal. Various widths. In one case, we had a bit of a bias to it. These are the input distributions. Here's our predicted output. Even though we had some non normal starting distributions, we have pretty much a normal output distribution. It does extend beyond our targets. We kind of knew that. Now, the default when you start here is 5,000 runs. I usually increase it, increase this to something like 100,000. It didn't take any extra time to execute, and it gives you a little smoother distributions. It also produces a table here, we can make the table. Move this over here. Big advantage of this is that we can get (don't want this CMT yet)...let's look at the distributions of the input factors. This is a bigger fancier plot. This is our bimodal distribution, uniform, these various normal distributions, various widths, this one has kind of a bias to it. So we can take all those and we added them up. We look at this and we have a distribution. It looks pretty normal. Even though some of the inputs were not normal. We can use conventional techniques on this. So when we start setting the specs, it does extend beyond our spec limits. So we're going to need to improve, make some improvements in this. Scroll down here. Look at the capability information. PPk a .6. That's a nonstarter. No way is manufacturing going to accept a process like this. So we need to make some significant improvements. So go back to the PowerPoint file. And I scroll through the slides that were my backup in case I had a problem with Live version of JMP. Because of me having the problem, not JMP. So here we have the factors. Standard deviations come from our preproduction trials, estimate the variability. The sensitivities, these are the results from our DOE.  
Level: Intermediate Gaussian Process (GP) is one of several analysis techniques that are used to build approximation models for computer generated experiments.  Generally, a space filling design is used to guide the computer experimentation efforts because all the parameters/variables are derived from or directly pulled from first principles physics models/equations.  Space filling designs are used because the data generated by the computer experiments is deterministic and likely to be highly non-linear.  This is where GP comes into play.  Because the data is deterministic, GP will attempt to fit every point in the design perfectly allowing for a close approximation of the true model.  We will compare GP to Response Surface and Neural Net models.  We will also compare GP models derived from different types of space filling designs. Gaussian Process Typically used to build models for computer simulation experiments. Data is deterministic so there is no need to run an experiment more than once.  A given set of inputs will always produce the same answer. Also known as kriging. More than 100 conditions will take a long time to compute a solution. JMP Pro has Fast GASP; for larger data sets – breaks the GP into blocks allowing for faster computation. You can also have categorical inputs with JMP Pro.   Model Options for Gaussian Fit Estimate Nugget Parameter – This useful if there is noise or randomness in the response, and you would like the prediction model to smooth over the noise instead of perfectly fitting.  Highly recommended Correlation Type – lets you choose the correlation structure used in the model Gaussian – allows the correlation between two responses to always be non-zero, no matter the distance between the points. Cubic – allows the correlation between two responses to be zero for points far enough apart. Minimum Theta Value – allows you to set up the minimum theta value used in the fitted model.   Variance vs. Bias For most design of experiments the goal is to minimize the variance of prediction.  Because computer experiments are deterministic there is no variance, but there is bias.  Bias is the difference between the approximation model and the true mathematical function.  Space filling designs are used in an effort to bound the bias.   Borehole Example Types of Space Filling Designs in JMP Sphere Packing – maximizes the minimum distance between design points. Latin Hypercube – maximizes the minimum distance between design points but requires even spacing of the levels for each factor. Uniform – minimizes the discrepancy between the design points and a theoretical uniform distance. Minimum Potential – spreads points inside a sphere around a centroid. Maximum Entropy – measures the amount of information contained in the distribution of a set of data Gaussian Process IMSE Optimal – creates a design the minimizes the integrated mean square error (IMSE) of the Gaussian Process over the experimental Region Fast Flexible Filling (FFF) – FFF method uses clusters of random points to choose design points according to an optimization criterion.  Can be constrained.   Summary of Fit Do Gaussian with and without Nugget Parameter and check Jackknife fit. Neural Net models offer a good alternative to Gaussian models but can be more complicated.  NN models sometimes outperform Gaussian models. Use the smoothing function for Neural Nets – JMP Pro Don’t rely on R2 alone when deciding on the best fit model. Picking the right model is about keeping the model as simple as possible while still getting reasonable prediction.   Gaussian Process Resources Comparison of different GP packages - from 2017 Borehole model example found in JMP 14 DOE Guide Chapter 21 pg 637. Discovery Summit 2011 Presentation: Meta-Modeling of Computational Models – Challenges  and Opportunities
Level: Intermediate Designed experiments for dry etch equipment present challenges for semiconductor engineers. First, because the total gas flow rate is often fixed, a mixture design must be used to honor the constraints imposed by this type of design. These types of designs are not commonly seen in the Semiconductor industry. Second, as is often the case with these experiments, the investigator is interested in optimizing more than one variable. In this presentation, you will see an example of how to design and analyze a seven-factor experiment for a dry etch tool and simultaneously optimize an overall wafer target value while minimizing within wafer variability. Overview Eight factor experiment for a dry etch process Three process gases: A, B, C Five process factors: power, pressure, temperature, time, total flow The experimenter was interested in both the gas ratios and the total gas flow. To keep total flow and gas ratios as uncorrelated as possible, a mixture design was used. To keep total flow and gas ratios as uncorrelated as possible, a mixture design was used. In addition, the experimenter wanted to bound the ratio for two of the gases between an upper and lower value.   The third gas, C, was to make up no less than 10% and no more than 25% of the total mixture.   What is a Mixture Design? A mixture design is used when the quantity of two or more experimental factors must sum to a fixed amount. The inclusion of non-mixture components (i.e., factors that are not part of the mixture) makes designing this experiment challenging. Mixture designs emphasize prediction over factor screening. For that reason, mixture factors are not removed from the experiment even when they are not significant (they may be set to 0, however).   Mixture Design Challenges Effect are highly correlated and are harder to estimate. Squared mixture terms are confounded with (are a linear function of) mixture factor main effects and two factor interactions. Main effects for non-mixture factors are correlated with the two factor interactions between that non-mixture factor and the mixture factors. Focusing on prediction and use of the Profiler (instead of parameter estimation and significance) makes designing and interpreting mixture experiments much easier.   Experimental Responses Response Goal ER Target=100 ER Std Minimize     Experimental Factors Response Low  High Power 25 75 Press 100 200 Temp 25 40 time 30 45 Total Flow 80 120 Gas A 0 1 Gas B 0 1 Gas C 0.1 0.25     Experimental Constraints