Choose Language
Hide Translation Bar
--Select Language--
(English) English
(Français) French
(Deutsch) German
(Italiano) Italian
(日本語) Japanese
(한국어) Korean
(简体中文) Simplified Chinese
(Español) Spanish
(繁體中文) Traditional Chinese
JMP User Community
:
JMP Discovery Summit Series
:
Abstracts
All community
This category
Events
Knowledge base
Users
Products
cancel
Turn on suggestions
Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.
Showing results for
Show
only
|
Search instead for
Did you mean:
Sign In
Sign In
JMP.com
User Community
Online Documentation
My JMP
JMP Store
JMP Marketplace
Discovery Summits
Discussions
Learn JMP
Support
JMP Blogs
File Exchange
Add-Ins
Scripts
Sample Data
JMP Wish List
Community
About the Community
JSL Cookbook
JMP Technical Resources
JMP Users Groups
Interest Groups
JMP Discovery Summit Series
JMP Software Administrators
Options
Add Events to Calendar
Mark all as New
Mark all as Read
Showing events with label
Reliability Analysis
.
Show all events
«
Previous
1
2
3
4
…
18
Next
»
0 attendees
0
0
Using JMP® to Compare Models from Various Environments (2020-US-45MP-593)
Monday, October 12, 2020
Lucas Beverlin, Statistician, Intel, Corp. The Model Comparison platform is an excellent tool for comparing various models fit within JMP Pro. However, it also has the ability to compare models fit from other software as well. In this presentation, we will use the Model Comparison platform to compare various models fit to the well-known Boston housing data set in JMP Pro15, Python, MATLAB, and R. Although JMP can interact with those environments, the Model Comparison platform can be used to analyze models fit from any software that can output its predictions. Auto-generated transcript... Speaker Transcript Lucas Okay, thanks everyone for coming and listen to my talk. My name is Lucas Beverlin. I'm a statistician at Intel. And today I'm going to talk about using JMP to compare models from various environments. Okay so currently JMP 15 Pro is the latest and greatest that JMP has out and if you want to fit the model and they've got several different tools to do that. There's the fit model platform. There's the neural platform on neural network partition platform. If you want classification and regression trees. The nonlinear platform for non linear modeling and there's several more. And so within JMP 15 I think it came out in 12 or 13 but this model comparison platform is a very nifty tool that you can use to compare model fits from various platforms within JMP. So if you have a tree and a neural network and you're not really sure which one's better. Okay, you could flip back and forth between the two. But now with this, you have everything on the same screen. It's very quick and easy to tell, is this better that that, so on so forth. So that being said, JMP can fit a lot of things, but, alas, it can't fit everything. So just give a few ideas of some things that can't fit. So, for example, those that do a lot of machine learning and AI might fit something like an auto encoder or convolutional neural network that generally requires lots of activation functions or yes, lots of hidden layers nodes, other activation functions than what's offered by JMP so JMP's not going to be able to do a whole lot of that within JMP. Another one is something called projection pursuit regression. Another one is called multivariate adaptive regression splines. So there are a few things unfortunately JMP can't do. R, Python, and MATLAB. There's several more out there, but I'm going to focus on those three. Now that being said, the ideas I'm going to discuss here, you want to go fit them in C++ or Java or Rust or whatever other language comes to mind, you should be able to use a lot of those. So we can use the model comparison platform, as I've said, to compare from other software as well. So what will you need? So the two big things you're going to need are the model predictions from whatever software you use to fit the model. And generally, when we do model fitting, particularly with larger models, you may split the data into training validation and/or test sets. You're going to need something that tells all the software which is training, which is validation, which is test, because you're going to want those to be consistent when you're comparing the fits. OK, so the biggest reason I chose R, Python and MATLAB to focus on for this talk is that turns out JMP and scripting language can actually create their own sessions of R and run code from it. So this picture here just shows very quickly if I wanted to fit a linear regression model to some output To to the Boston housing data set. I'll work a lot more with that data set later. But if you wanted to just very quickly fit a linear regression model in R and spit out the predictive values, you can do that. Of course you can do that JMP. But just to give a very simple idea. So, one thing to note so I'm using R 3.6.3 but JMP can handle anything as long as it's greater than 2.9. And then similarly, Python, you can call your own Python session. So here the picture shows I fit the linear regression with Python. I'm not going to step through all the lines of code here but you get the basic idea. Now, of course, with Python be a little bit careful in that the newest version of Python 3.8.5. But if you use Anaconda to install things, JMP has problems talking to it when it's greater than 3.6 so since I'm using 3.6.5 for this demonstration. And then lastly, we can create our own MATLAB session as well. So here I'm using MATLAB 2019b. But basically, as long as your MATLAB version has come out in the last seven or eight years, it should work just fine. Okay, so how do we tie everything together? So really, there's kind of a four-step process we're going to look at here. So first off, we want to fit each model. So we'll send each software the data and which set each observation resides. Once we have the models fit, we want to output those fits and their predictions and add them to a data table that JMP can look at. So of course my warning is, be sure you name things that you can tell where did you fit the model or how did you fit the model. I've examples of both coming up. So step three, depending on the model and you may want to look at some model diagnostics. Just because a model fits...appears to fit well based on the numbers, one look at your residual plot, for example, and you may find out real quickly the area of biggest interest is not fit very well. Or there's something wrong with residuals so on so forth. So we'll show how to output that sort of stuff as well. And then lastly we'll use the model comparison platform, really, to bring everything into one big table to compare numbers, much more easily as opposed to flipping back and forth and forth and back. Okay, so we'll break down the steps into a little more detail now. So for the first step where we do model fitting, we essentially have two options. So first off, we can tell JMP via JSL to call your software of choice. Send it the data and the code to fit it. And so, in fact, I'm gonna jump out of this for a moment and do exactly that. So you see here, I have some code for actually calling R. And then once it's done, I'll call Python and once it's done, I'll call MATLAB and then I'll tie everything together. Now I'll say more about the code here in a little bit, but it will take probably three or four minutes to run. So I'm going to do that now. And we'll come back to him when we're ready. So our other option is we create a data set with the validation. Well, we create a data set with the validation column and and/or a test column, depending on how many sets were splitting our data into. We're scheduled to run on whatever software, we need to run on, of course output from that whatever it is we need. So of course a few warnings. Make sure you're...whatever software you're using actually has what you need to fit the model. Make sure the model is finished fitting before you try to compare it to things. Make sure the output format is something JMP can actually read. Thankfully JMP can read quite a few things, so that's not the biggest of the four warnings. But as I've warned you earlier, make sure the predictions from each model correspond to the correct observations from the original data set. And so that comes back to the if it's training, if it's a training observation, when you fit it in JMP, it better be a training observation when you fit it in whatever software using. If it's validation in JMP, it is the validation elsewhere. It's test in JMP, it's going to be test elsewhere. So make sure things correspond correctly because the last thing we want to find out is to look at test sets and say, "Oh, this one fit way better." Well, it's because the observations fit in it were weighted different and didn't have any real outliers. So that ends up skewing your thinking. So a word of caution, excuse me, a word of caution there. Okay. So as I've alluded to, I have an example I'm currently running in the background. And so I want to give a little bit of detail as far as what I'm doing. So it turns out I'm going to fit neural networks in R and Python and MATLAB. So if I want to go about doing that, within R, two packages I need to install in R on top of whatever base installing have and that's the Keras package and the Tensorflow package. numpy, pandas and matplotlib. So numpy to do some calculations pretty easily; pandas, pandas to do data...some data manipulation; and matplotlib should be pretty straightforward to create some plots. And then in MATLAB I use the deep learning tool box, whether you have access to that are not. Okay. So step two, we want to add predictions to the JMP data table. So if you use JMP to call the software, you can use JSL code to retrieve those predictions and add them into a data table so then you can compare them later on. So then the other way you can go about doing is that the software ran on its own and save the output, you can quickly tell JMP, hey go pull that output file and then do some manipulation to bring the predictions into whatever data table you have storing your results. So now that we can also read the diagnostic plots. In this case what I generally am going to do is, I'm going to save those diagnostic plots as graphics files. So for me, it's going to be PNG files. But of course, whichever graphics you can use. Now JMP can't hit every single one under the sun, but I believe PNG that maps jpgs and some and they...they have your usual ones covered. So the second note I use this for the model comparison platform, but to help identify what you...what model you fit and where you fit it, I generally recommend adding the following property for each prediction column that you add. And so we see here, we're sending the data table of interest, this property called predicting. And so here we have the whatever it is you're using to predict things (now here in value probably isn't the best choice here) but but with this creator, this tells me, hey, what software did I use to actually create this model. And so here I used R. It shows R so this would actually fit on the screen. Python and MATLAB were a little too long, but we can put whatever string we want here. You'll see those when I go through the code in a little more detail here shortly. So, and this comes in handy because I'm going to fit multiple models within R later as well. So if I choose the column names properly and I have multiple ones where R created it, I still know what model I'm actually looking at. Okay, so this is what the typical model comparison dialog box looks like. So one thing I'm going to note is that, so this is roughly what it would look like if I did a point and click at the end of all the model fitting. So you can see I have several predictors. So I've neural nets for a MATLAB, Python and R. Various prediction forms; I used to JMP to fit a few things. Now, oftentimes what folks will do is, they'll put this validation column as a group, so that it'll group the training validation and test. I actually like the output a bit better when I stick it in the By statement here. So I'll show that here a little later. But you can put it either or but I like the output this way better is the long and short of it. So this is the biggest reason why it is now I can clearly see, these are all the training, these are all the validation (shouldn't see by the headers) and these are all the test. If you use validation as a group variable, you're going to get one big table with 21 entries in it. Now, there'll be an extra column. It says training validation test or in my case, it will be 012 but this way with the words, I don't have to think as hard. I don't have to explain to anyone what 012 means so on so forth. So that was why I made the choice that I did. Okay, so in the example I'm gonna break down here, I'm going to use the classic Boston housing data set. Now this is included within JMP. So that's why I didn't include it as a file in my presentation because if you have JMP, you've already got it. So Harrison and Rubinfeld had several predictors of the median value of the house, such things such as per capita crime rate, the proportion of non retail business acres per town, average number of rooms within whatever it is you're trying to buy, pupil to the teacher ratio by town (so if there's a lot of teachers and not quite as many students that generally means better education is what a lot of people found) and several others. I'm not gonna really try to go through all 13 of them here. Okay, so let me give a little bit of background as far as what models I looked at here. And then I'm going to delve into the JSL code and how I fit everything. So some of the models, I've looked at. So first off, the quintessential linear regression model. So here you just see a simple linear regression. I just fit the median value to, looks like, by tax rate. But of course I'll use a multiple linear regression and use all of them. So, But with 13 different predictors, and knowing some of them might be correlated to one another, I decided that maybe a few other types of regression would be worth looking at. One of them is something called bridge regression. So really all it is, is it's linear regression with essentially an added constraint that the squared values of my parameters can't be larger than some constant. I can...turns out I can actually rewrite this as an optimization problem where some value of lambda corresponds to some value of C. And so then I'm just trying to minimize this with this extra penalty term, as opposed to the typical least squares that you're used to seeing. Now, this is called a shrinkage method because of course as I make this smaller and smaller, it's going to push all these closer and closer to zero. So, of course, some thought needs to be put into how restrictive do I want it to be. Now with shrinkage, it's going to push everybody slowly towards zero. But with another type of penalty term, I can actually eliminate some terms altogether. And I can use something called the lasso. And so the contraint here is, okay, instead of the squared parameter estimates, I'm just going to take the sum of the absolute value of those parameter estimates. And so it turns out from that, what'll actually happen is those that are very weak actually get their parameter estimates set to zero itself, which kind of serves as a elimination, if you will, of unimportant terms. So to give a little bit of a visual as to what lasso and ridge regression are doing. So for ridge regression, the circle here represents the penalty term. And here we're looking at the parameter space. And so the true least squares estimates would be here. So we're not quite getting there, because we have this additional constraint. So in the end, we find where does...where do we get the minimum value that touches the circle, basically. And so this is, this would be our ridge regression parameter estimates. For lasso, similar drawing, but you can see now with the absolute value, this is more of a diamond as opposed to a circle. Now note, this is two dimensions, of course, we're going to get into hyper spheres and all those shapes. But you can see here, notice it touches right at the tip of the diamond. And so in this case beta one is actually zero. So that's how it eliminates terms. Okay, so another thing we're going to look at is what's called a regression tree. Now JMP uses the partition platforms to do these. So just to give a very quick demo of what this shows, in that, ok so I have all of my data. And so my first question I'll ask myself is, how many rooms are in the dwelling, and I know I can't have .943 of a room, so basically, I have six rooms or less. So come down this part of the tree, let's not come down this part of the tree. Now if I have seven rooms or more, this tells me immediately I'm going to predict my median value to be 37. Remember it's in tens of thousands of dollars, so make that $370,000. If it's less than seven, then the next question I asked is, well, how big is lstat? So if it's bigger than or equal to 14.43, I'll look at this node and suddenly my median housing estimates about 150 grand and if I come over here, it's gonna be about 233 grand. So what regression trees really do is they're partitioning your input space into different areas. And we'er giving the same prediction to every value within that area. So you can see here I've partitioned...now on this case, I'm taking a two dimensional one because it's easier to draw... and so you can see this tree here, where I first look at x1. Now look at x2 here and ask another question about x1 and ask another question about x2, and this is how I end up partitioning the input space. Now each of these five is going to have a prediction value. And that's essentially what this looks like. I look at this from up top. I'm going to get exactly this. But you can see here that the prediction is a little bit different depending upon which of the five areas right. Now, I'm not going to get into too much of the details on how exactly to fit one of these, but James, Witten, Tibshirani and Friedman give a little bit; Leo ??? wrote the seminal book on it so you can take a look there. So next off, I'll come to neural networks, which are being used a lot in machine learning and whatnot these days. And so this kind of gives a visual of what a neural network looks like. So here, this visual just uses five and the 13 inputs when passing them to these...this hidden layer. And each of these is transformed via an activation function. And for each of these activation functions, you get an output and we'll use... oftentimes, we'll just use a linear regression of these outputs to predict the median value. Okay, so really, neural network are nonlinear models, at the end of the day, and really, they're called neural networks because the representation generally is how we viewed neurons as working within the human brain. So each input can be passed to nodes in a hidden layer. At the hidden layer your inputs are pushed through an activation function and some output is calculated and each output can be passed to a node in another hidden layer or be an output of the network. Now within JMP, you're only allowed two hidden layers. Truth be told, as far as creating a neural network, there's nothing that says you can't have 20 for all that we're concerned about now. Truth be told, there's some statistical theory that suggests that hey, we can approximate any continuous function, given a few boundary conditions, with two hidden layers. So that's likely why JMP made the decision that they did. linear, hyperbolic tangent and Gaussian radial basis. So in fact, on these nodes here, notice the little curve here. I believe that is for the hyperbolic tangent function; linear will be a straight line going up; and Gaussian radial basis, I believe, will look more like a normal curve. That's the neural network platform. So the last one we'll look at is something called projection pursuit regression. I wanted to pull something that JMP simply can't do just kind of give an example here. Um, so projection pursuit regression was a model originally proposed by Jerome Friedman and Steutzle over at Stanford. Their model takes prediction...makes predictions of the form y equals the summation of beta i, f sub i, and a linear transformation of your inputs. So really this is somewhat analogous to a neural network. You have one hidden layer here with k nodes and each with activation function f i L. Turns out with projection pursue regression, we're actually going to estimate these f sub i as well. Generally they're going to be some sort of smoother or a spline fit. Typically the f sub i are called ridge functions. Now we have alphas, we have betas and we have Fs we need to optimize over. So generally a stagewise fitting is done. I'm not going to get too deep in the details at this point. Okay, so I've kind of gone through all my models. So now I'm going to show some output and hopefully things look good. So one thing I'm going to note before I get into JMP is that it's really hard to set seeds for the neural networks in R, Python for Keras. So do note that if you take my code and run it, you're probably not going to get exactly what I got, but it should be pretty close. So with that said, let's see what we got here. So this was the output that I got. Now, unfortunately, things do not appear to have run perfectly. So, Lucas what do I have here? So I have my training, my validation, and my test. And so we see very quickly that one of these models didn't fit very well. The neural net within R unfortunately something horrible happened. It must have caught a bad spot in the input space to start from and whatnot. And so it just didn't fit a very good model. So unfortunately, starting parameters with nonlinear models matter; in some cases, we get bit by them. But if we take a look at everything else, everything else seems to fit decently well. Now what is decently well, we can argue over that, but I'm seeing R squares, one above .5. I'm seeing root average squared errors here around five or so, and even our average absolute errors are in the three range. Now for training, it looks like projection pursuit regressions did best. If I come down to the validation data set, it still looks like R projection pursuit did best. But if we look at the test data set, all of a sudden, no, projection pursuit regression was second, assuming we're gonna ignore the neural net from R, second worst. Oftentimes in a framework like this, we're going to look at the test data set the closest because it wasn't used in any way, shape, or form to determine the model fit. And we see based on that, It looks like the ridge regression from JMP fit best. We can see up here, it's R squared was .71 here before was about .73, and about .73 here, so we can see it's consistently fitting the same thing through all three data sets. So if I were forced to make a decision, just based on what I see at the moment, I would probably go with the ridge regression. So that being said, we have a whole bunch of diagnostics and whatnot down here. So if I want to look at what happened with that neural network from R, I can see very quickly, something happened just a few steps into there. As you can see, it's doing a very lousy job of fitting because pretty much everything is predicted to be 220 some thousand. So we know something went wrong during the fitting of this. So we saw the ridge regression looked like the best one. So let's take a look at what it spits out. So I'll show in a moment my JSL code real quick that shows how I did all this but, um, we can see here's the parameter estimates from the ridge regression. We can see the ridge diagnostic plots, so things hadn't really shrunk too much from the original estimates. You can see from validation testing with log like didn't whatnot. And over here on the right, we have our essentially residual plots. These are actual by predicted. So you can see from the training, looks like there was a few that were rather expensive that didn't get predicted very well. We see fewer here than in the test set, it doesn't really look like we had too much trouble. We have a couple of points here a little odd, but we can see for generally when we're in lower priced houses, it fits all three data sets fairly well. Again, we may want to ask ourselves what happened on these but, at the moment, this appears to be the best of the bunch. So we can see from others. See here. So we'll look at MATLAB for a moment. So you can see training test validation here as well. So here we're spitting out...MATLAB spits out one thing of diagnostics and you can see it took a few epochs to finish so. But thankfully MATLAB runs pretty quickly as we can tell. And then the actual by predicted here. We can see all this. Okay, so I'm going to take a few minutes now to take a look at the code. So of course, a few notes, make sure things are installed so you can actually run all this because if not, JMP's just going to fail miserably, not spit out predictions and then it's going to fail because it can't find the predictions. So JMP has the ability to create a validation column with code. So I did that I chose 60 20 20. I did choose that random seed here so that you can use the same training validation test sets that I do. So actually, for the moment, what I end up doing is I save what which ones are training, validation and test. I'm actually going to delete that column for a little bit. The reason I do that here is because I'm sending the data set to R, Python and MATLAB and it's easier to code when everything in it is either the output or all the inputs. So I didn't want a validation column that wasn't either and then it becomes a little more difficult do that. So what I ended up doing was I sent it the data set, I sent it which rows of training, validation, and test, and then I call the R code to run it. Now you can actually put the actual R code itself within here. I chose to just write one line here so that I don't have to scroll forever. But there's nothing stopping you. If it's only a few lines of code, like what you saw earlier in the presentation, I would just paste it right in here. So that once it's done, it turns out...this code spits out a picture of the diagnostics. We saw it stopped after six or seven iterations, let's have this say is that out. And also fits the ridge regression in this script so we get two pictures. So I spit that one out as well and save it and outline box. Now, these all put together at the end of all the code. And then I get the output and I'll add those to the data table here in a little while. Okay, so I give a little bit of code here in that. Let's say you have 6 million observations and it's going to take 24 hours to actually fit the model, you're probably not going to want to run it within JMP. So as a little bit of code that you could do from here, you can say, hey, I'm okay, I'm going to just open the data table I care about. I'm going to tell R to go run it somewhere else in the meantime, and once my system, when I gives me the green light that hey, it's done, I can say, okay, well go open the output from that and bring it into my data table. So this would be one way you could go about doing some of that. And of course you want to save those picture file somewhere and use this code as well. But this is gonna be the exact same code. Okay, so for Python, it's going to be very similar. I'm going to pass it these things. Run some Python code, spit out the diagnostic plot and spit out the predictions. And I give some Python code, you can see, it's very, very similar to what we did from JMP. I'm just going to go open some CSV file in this case. Copy the column in and close it, because I don't need it anymore. And then MATLAB again the exact same game. Asset things, run the MATLAB script. I get the PNG file that I spat out of here. Save it where I need to, save the predictions. And if you need to grab it rather than run it from within here, a little bit of sample code will do that. OK, so now that I'm done calling R, Python and Matlab, I bring back my validation columns so that JMP can use it. So since I remember which one's which. So by default, JMP looks at the values within the validation column which we'll use. The smallest value is training, the next largest is validation, the largest is test. Now if you do K fold cross validation, it'll tell it which fold it is. So coursing though 012345678 so on so forth. So then create this. I also then in turn created this here, so that way instead of 012, it'll actually say training, validation, and test in my output, so it's a little clearer to understand. So if I'm going to show someone else that's never run JMP before, they're not going to know what 012 means, but they should know a training, validation and test are. OK, so now I start adding the predictions to my data table. Um, so here's that set property I alluded to earlier in my talk. So my creator's MATLAB, I've given the column name so, hey, I know it's the neural net prediction for MATLAB. So I may not necessarily need the creator, but in case I'm a little sloppy in naming things Sorry about that. So we can get all the projection pursuit regression, neural nets, and whatnot. Then I also noted that, hey, I fit ridge regression, lasso, linear regression in JMP. So I did all that here. So here I do my fit model, my generalize regression. Get all these spat out. Sve my prediction formulas. Plot my actual by predicted for my full output at the end. And I'm going to fit my neural network. Can I say the validation column. I transfer my covariates, generally neural networks tend to fit a little bit better when we scale things around zero as opposed to whatever the output is usually at. So my first hidden layer has three nodes. My second hidden layer has two nodes. Here they're both linear activation functions. Turns out for the three above, I use the restricted linear units activation function so slightly different, but I found they seem to fit about the same regardless. 5. What that means is, hey, I'm going to try five different sets of starting values, whichever one does best is what I'm going to keep. As you can tell from my code, I probably should have done that with the R. It's done kind of a four loop, done several of them, spit out the one that does best. So for future work, that would be one spot I would go. So then I save that stuff out and now I'm ready for the model comparison. So now I bring all those new columns in the model comparison. Scroll over a little bit. So here I'm doing the by validation, as I alluded to earlier. And so lastly I'm just doing a bit of coding to essentially make it look the way I want it to look. So I get these outline boxes, just to say training diagnostics, validation diagnostics, test diagnosis, instead of the usual stuff that JMPs says. I'm gonna get these diagnostic plots. Now here I'm just saying I just only want part of the outputs on grabbing a handle to that. I'm going to make some residual plots real quick because not all of them instantly spit those out, so particularly ones from MATLAB, Python and R. Set those titles and then here I just create the big old table or the big old dialogue. And then I journal everything. So that it's nice and clean. Close a bunch of stuff out, so I don't have to worry about things. And then what I did here at the end is what I wanted to happen is when I pop one of these open, everything else below it is immediately open rather than having to click on six or seven different things. You can see, I have to click here and here and Over here, there's three more. I guess one more. Sorry. But this way, I don't have to click on any of these, they're automatically open. So that's what this last bit of code does. Okay. Lucas So this is just different output it and run it live. But this is where it can also look like. So as I mentioned, so in the code, you saw there, you saw something else that's what we saw. Richard Lucas Nope. So to wrap everything up the model comparison platform is really a very nice tool for comparing the predictive ability of multiple models in one place. You don't have to cut back and forth between various things. You can just look at everything right in front of you. The flexibility can even be used to fit models that weren't fit in JMP or compare models that weren't even fit in JMP. And so with this, if we need to fit very large models that take a long time to fit, we can tell them to go fit. Pull everything in JMP and very easily look at all the results to try to determine next steps. And with that, thank you for your time.
Labels
(8)
Labels:
Labels:
Advanced Statistical Modeling
Automation and Scripting
Basic Data Analysis and Modeling
Data Access
Design of Experiments
Predictive Modeling and Machine Learning
Quality and Process Engineering
Reliability Analysis
0 attendees
0
0
0 attendees
0
0
Creating a Reliability Modeling & Report Generation App using JSL + JMP Reliability Capabilities (2020-US-30MP-591)
Monday, October 12, 2020
Shamgar McDowell, Senior Analytics and Reliability Engineer, GE Gas Power Engineering Faced with the business need to reduce project cycle time and to standardize the process and outputs, the GE Gas Turbine Reliability Team turned to JMP for a solution. Using the JMP Scripting Language and JMP’s built-in Reliability and Survival platform, GE and a trusted third party created a tool to ingest previous model information and new empirical data which allows the user to interactively create updated reliability models and generate reports using standardized formats. The tool takes a task that would have previously taken days or weeks of manual data manipulation (in addition to tedious copying and pasting of images into PowerPoint) and allows a user to perform it in minutes. In addition to the time savings, the tool enables new team members to learn the modeling process faster and to focus less on data manipulation. The GE Gas Turbine Reliability Team continues to update and expand the capabilities of the tool based on business needs. Auto-generated transcript... Speaker Transcript Shamgar McDowell Maya Angelou famously said, "Do the best you can, until you know better. Then when you know better, do better." Good morning, good afternoon, good evening. I hope you're enjoying the JMP Discovery Summit, you're learning some better way ways of doing the things you need to do. I'm Shamgar McDowell, senior reliability and analytics engineer at GE Gas Power. I've been at GE for 15 years and have worked in sourcing, quality, manufacturing and engineering. Today I'm going to share a bit about our team's journey to automating reliability modeling using JMP. Perhaps your organization faces a similar challenge to the one I'm about to describe. As I walk you through how we approach this challenge, I hope our time together will provide you with some things to reflect upon as you look to improve the workflows in your own business context. So by way of background, I want to spend the next couple of slides, explain a little bit about GE Gas Power business. First off, our products. We make high tech, very large engines that have a variety of applications, but primarily they're used in the production of electricity. And from a technology standpoint, these machines are actually incredible feats of engineering with firing temperatures well above the melting point of the alloys used in the hot section. A single gas turbine can generate enough electricity to reliably power hundreds of thousands of homes. And just to give an idea of the size of these machines, this picture on the right you can see there's four adult human beings, which just kind of point to how big these machines really are. So I had to throw in a few gratuitous JMP graph building examples here. But the bubble plot and the tree map really underscore the global nature of our customer base. We are providing cleaner, accessible energy that people depend upon the world over, and that includes developing nations that historically might not have had access to power and the many life-changing effects that go with it. So as I've come to appreciate the impact that our work has on everyday lives of so many people worldwide, it's been both humbling and helpful in providing a purpose for what I do and the rest of our team does each day. So I'm part of the reliability analytics and data engineering team. Our team is responsible for providing our business with empirical risk and reliability models that are used in a number of different ways by internal teams. So in that context, we count on the analyst in our team to be able to focus on engineering tasks, such as understanding the physics that affect our components' quality and applicability of the data we use, and also trade offs in the modeling approaches and what's the best way to extract value from our data. These are, these are all value added tasks. Our process also entails that we go through a rigorous review with the chief engineers. So having a PowerPoint pitch containing the models is part of that process. And previously creating this presentation entailed significant copying and pasting and a variety of tools, and this was both time consuming and more prone to errors. So that's not value added. So we needed a solution that would provide our engineers greater time to focus on the value added tasks. It would also further standardize the process because those two things greater productivity and ability to focus on what matters, and further standardization. And so to that end, we use the mantra Automate the Boring Stuff. So I wanted to give you a feel for the scale of the data sets we used. Often the volume of the data that you're dealing with can dictate the direction you go in terms of solutions. And in our case, there's some variation but just as a general rule, we're dealing with thousands of gas turbines in the field, hundreds of track components in each unit, and then there's tens of inspections or reconditioning per component. So in in all, there's millions of records that we're dealing with. But typically, our models are targeted at specific configurations and thus, they're built on more limited data sets with 10,000 or fewer records, tens of thousands or fewer records. The other thing I was going to point out here is we often have over 100 columns in our data set. So there are challenges with this data size that made JMP a much better fit than something like an Excel based approach to doing this the same tasks. So, the first version of this tool, GE worked with a third party to develop using JMP scripting language. And the name of the tool is computer aided reliability modeling application or CARMA, with a c. And the amount of effort involved with building this out to what we have today is not trivial. This is a representation of that. You can see the number of scripts and code lines that testified to the scope and size of the tool as it's come to today. But it's also been proven to be a very useful tool for us. So as its time has gone on, we've seen the need to continue to develop and improve CARMA over time. And so in order to do this, we've had to grow and foster some in-house expertise in JSL coding and I oversee the work of developers that focus on this and some related tools. Message on this to you is that even after you create something like CARMA, there's going to be an ongoing investment required to maintain and keep the app relevant and evolve it as your business needs evolve. But it's both doable and the benefits are very real. A survey of our users this summer actually pointed to a net promoter score of 100% and at least 25% reduction in the cycle time to do a model update. So that's real time that's being saved. And then anecdotally, we also see where CARMA has surfaced issues in our process that we've been able to address that otherwise might have remained hidden and unable to address. And I have a quote, it's kind of long. But I wanted to just pass this caveat on automation from Bill Gates, on which he knows a thing or two about software development. "The first rule of any technology used in business is that automation applied to an efficient automation will magnify the efficiency. The second is that automation applied to an inefficient operation will magnify the inefficiency." So that's the end of the quote, but this is just a great reminder that automation is not a silver bullet that will fix a broken process, we still need people to do that today. Okay, so before we do a demonstration of the tool. I just wanted to give a high level overview of of the tool and the inputs and outputs in CARMA. And user user has to point the tool to the input files. So over here on the left, you see we have an active models file that's essentially the already approved models. And then we have empirical data. And then in the user interface, the user does some modeling activities. And then outputs are running models, so updates to the act of models in a PowerPoint presentation. And we'll also look at that. As a background for the data, I'll be using the demo. I just wanted to pass on that I started with the locomotive data set. And we'll see that JMP provides some sample data. So this is the case, there and that gives one population. Then I also added into additional population of models. And the big message here I wanted to pass on is that what we're going to see is all made-up data. It's not real; it doesn't represent the functionality or the behavior of any of our parts in the field, or it's it's just all contrived. So keep that in mind as we go through the results, but it should give us a way to look at the tool, nonetheless. So I'm going to switch over to JMP for a second and I'm using JMP 15.2 for the demo. And this data set is simplified compared to what we normally see. But like I said, it should exercise the core functionality in CARMA. So first, I'm just going to go to the Help menu, sample data, and you'll see the reliability and survival menu here. So that's where we're going. One of the nice things about JMP is that it has a lot of different disciplines and functionality and specialized tools for that. And so for my case with reliability, there's a lot here, which also lends to the value of using JMP is a home for CARMA. But I wanted to point you to the locomotive data set and just show you... this originally came out of a textbook. And talks to it here applied life data analysis. So, in that, there's a problem that asks you what the risk is at 80,000 exposures and we're going to model that today in our data set in an oxidation model is what we've called it, but essentially CARMA will give us an answer. Again, a really simple answer, but I was just going to show you, you can get the same way by clicking in the analysis menu. So we go down to an analyze or liability and survival, life distribution. Put the time and sensor where they need to go. We're going to use Weibull and just the two...so it creates a fit for that data. Two parameters I was going to point out is the beta, 2.3, and then it's called a Weibull alpha here. In our tool, it'll be called Ada, but 183. Okay, so we see how to do that here. Now just to jump over, want to look at a couple of the other files, the input files so I will pull those up. Okay, this is the model file. I mentioned I made three models. And so these are the active models that we're going to be comparing the data against. You'll see that oxidation is the first one, I mentioned that, and then you want...one also, in addition to having model parameters, it has some configuration information. This is just two simple things here (combustion system, fuel capability) I use for examples, but there's many, many more columns, like it. But essentially what CARMA does, one of the things I like about it is when you have a large data set with a lot of different varied configurations, it can go through and find which of those rows of records applies to your model and do the sorting real time, and you know, do that for all the models that you need to do in the data set. And so that's what we're going to use that to demonstrate. Excuse me. Also, just look, jump over to the empirical data for a minute. And just a highlight, we have a sensor, we have exposures, we have the interval that we're going to evaluate those exposures at, modes, and then these are the last two columns I just talked about, combustion system and fuel capability. Okay, so let's start up CARMA. As an add in, so I'll just get it going. And you'll see I already have it pointing to the location I want to use. And today's presentation, I'm not gonna have time to talk through all the variety of features that are in here. But these are all things that can help you take and look at your data and decide the best way to model it, and to do some checks on it before you finalize your models. For the purposes of time, I'm not going to explain all that and demonstrate it, but I just wanted to take a minute to build the three models we talked about create a presentation so you can see that that portion of the functionality. Excuse me, my throat is getting dry all the sudden so I have to keep drinking; I apologize for that. So we've got oxidation. We see the number of failures and suspensions. That's the same as what you'll see in the text. Add that. And let's just scroll down for a second. That's first model added Oxidation. We see the old model had 30 failures, 50 suspensions. This one has 37 and 59. The beta is 2.33, like we saw externally and the ADA is 183. And the answer to the textbook question, the risk of 80,000 exposures is about 13.5% using a Weibull model. So that's just kind of a high level of a way to do that here. Let's look at also just adding the other two models. Okay, we've got cracking, I'm adding in creep. And you'll see in here there's different boxes presented that represent like the combustion system or the fuel capability, where for this given model, this is what the LDM file calls for. But if I wanted to change that, I could select other configurations here and that would result in changing my rows for FNS as far as what gets included or doesn't. And then I can create new populations and segment it accordingly. Okay, so we've gotten all three models added and I think, you know, we're not going to spend more time on that, just playing with the models as far as options, but I'm gonna generate a report. And I have some options on what I want to include into the report. And I have a presentation and this LDM input is going to be the active models, sorry, the running models that come out as a table. All right, so I just need to select the appropriate folder where I want my presentation to go And now it's going to take a minute here to go through and and generate this report. This does take a minute. But I think what I would just contrast it to is the hours that it would take normally to do this same task, potentially, if you were working outside of the tool. And so now we're ready to finalize the report. Save it. And save the folder and now it's done. It's, it's in there and we can review it. The other thing I'll point out, as I pull up, I'd already generated this previously, so I'll just pull up the file that I already generated and we can look through it. But there's, it's this is a template. It's meant for speed, but this can be further customized after you make it, or you can leave placeholders, you can modify the slides after you've generated them. It's doing more than just the life distribution modeling that I kind of highlighted initially. It's doing a lot of summary work, summarizing the data included in each model, which, of course, JMP is very good for. It, it does some work comparing the models, so you can do a variety of statistical tests. Use JMP. And again, JMP is great at that. So that, that adds that functionality. Some of the things our reviewers like to see and how the models have changed year over year, you have more data, include less. How does it affect the parameters? How does it change your risk numbers? Plots of course you get a lot of data out of scatter plots and things of that nature. There's a summary that includes some of the configuration information we talked about, as well as the final parameters. And it does this for each of the three models, as well as just a risk roll up at the end for for all these combined. So that was a quick walkthrough. The demo. I think we we've covered everything I wanted to do. Hopefully we'll get to talk a little more in Q&A if you have more questions. It's hard to anticipate everything. But I just wanted to talk to some of the benefits again. I've mentioned this previously, but we've seen productivity increases as a result of CARMA, so that's a benefit. Of course standardization our modeling process is increased and that also allows team members who are newer to focus more on the process and learning it versus working with tools, which, in the end, helps them come up to speed faster. And then there's also increased employee engagement by allowing engineers to use their minds where they can make the biggest impact. So I also wanted to be sure to thank Melissa Seely, Brad Foulkes, Preston Kemp and Waldemar Zero for their contributions to this presentation. I owe them a debt of gratitude for all they've done in supporting it. And I want to thank you for your time. I've enjoyed sharing our journey towards improvement with you all today. I hope we have a chance to connect in the Q&A time, but either way, enjoy the rest of the summit.
Labels
(12)
Labels:
Labels:
Advanced Statistical Modeling
Automation and Scripting
Basic Data Analysis and Modeling
Consumer and Market Research
Content Organization
Data Blending and Cleanup
Data Exploration and Visualization
Design of Experiments
Mass Customization
Predictive Modeling and Machine Learning
Quality and Process Engineering
Reliability Analysis
0 attendees
0
0
0 attendees
0
0
ABCs of Structural Equations Models (2020-US-45MP-590)
Monday, October 12, 2020
Laura Castro-Schilo, JMP Research Statistician Developer, SAS Institute, JMP Division James Koepfler, Research Statistician Tester, SAS Institute, JMP Division This presentation provides a detailed introduction to Structural Equation Modeling (SEM) by covering key foundational concepts that enable analysts, from all backgrounds, to use this statistical technique. We start with comparisons to regression analysis to facilitate understanding of the SEM framework. We show how to leverage observed variables to estimate latent variables, account for measurement error, improve future measurement and improve estimates of linear models. Moreover, we emphasize key questions analysts’ can tackle with SEM and show how to answer those questions with examples using real data. Attendees will learn how to perform path analysis and confirmatory factor analysis, assess model fit, compare alternative models and interpret all the results provided in the SEM platform of JMP Pro. PRESENTATION MATERIALS The slides and supplemental materials from this presentation are available for download here. Check out the list of resources at the end of this blog post to learn more about SEM. The article and data used for the examples of this presentation are available in this link. Auto-generated transcript... Speaker Transcript Laura Castro-Schilo Hello everyone. Welcome. This session is all about the ABCs of Structural Equation modeling and what I'm going to try is to leave you with enough tools to be able to feel comfortable to specify models and interpret the results of models fit using the structural equation modeling platform and JMP Pro. Now what we're going to do is start by giving you an introduction of what structural equation modeling is and particularly drawing on the connections it has to factor analysis and regression analysis. And then we're going to talk a little bit about how path diagrams are used in SEM and their important role within this modeling framework. I'm going to try to keep that intro short so that we can really spend time on our hands-on examples. So after the intro, I'm going to introduce the data that I'm going to be using for demonstration. It turns out these data are about perceptions of threats of the Covid 19 virus. So after introducing those data, we're going to start looking at how to specify models and interpret their results within the platform, specifically by answering a few questions. Now, these, these questions are going to allow us to touch on two very important techniques that can be done with an SEM. One is confirmatory factor analysis and also multivariate regression. And to wrap it all up, I'm going to show you just a brief model in which we bring together both the confirmatory factor model, regression models and that way you can really see the potential that SEM has for using it with your own data for your own work. Alright, so what is SEM? Structural equation modeling is a framework where factor analysis and regression analysis come together. And from the factor analysis side, what we're able to get is the ability to measure things that we do not observe directly. And on the regression side, we're able to examine relations across variables, whether they're observed or unobserved. So when you bring those two tools together, we end up with a very flexible framework, which is SEM, where we can fit a number of different models. Path diagrams are a unique tool within SEM, because all statistical structural equation models, the actual models, can be depicted through a diagram. And so we have to learn just some notation as to how those diagrams are drawn. And so squares represent observed variables, circles represent latent variables, variances or covariances are represented with double-headed arrows, and regressions and loadings are represented with one-headed arrows. Now, as a side note, there's also a triangle that is used for path diagrams. But that's outside the scope of what we're going to talk about today. The triangle is used to represent means and intercepts. And unfortunately, we just don't have enough time to talk about all of the awesome things we can do with means as well. Alright. But I also want to show you just some fundamental...kind of the building blocks of x, which is in a box, is predicting y, which is also in a box. So we know those are observed variables and each of them have double-headed arrows that start and end on themselves, meaning those are variances. For x, this arrow is simply its variants, but for y, the double-headed arrows represent a residual variance. Now, of course, you might know that in SEM, an outcome can be both an outcome and a predictor. So you can imagine having another variable z that y could also predict, so we can put together as many regressions as we are interested in within one model. The second basic block for building SEMs are confirmatory factor models and at the most basic most basic example of that is is shown right here, where we're specifying one factor or one latent variable, which is unobserved but it's it's shown here with arrows pointing to w, x, and y because the latent variable is thought to cause the common variability we see across w, x, and y. Now, this right here is called confirmatory factor model, it's, it's only one factor. And I think it's really important to understand the distinctions of a factor in the factor analytic perspective and distinguish that from principal components or principal component analysis, which sometimes are easy to get confused. So I'll take a quick tangent to show you what's different about a factor from a factor analytic perspective and from PCA. So here these squares are meant to represent observed variables, the things that we're measuring, and I colored here in blue different amounts from each observed variable, which represents the variance that those variables are intended to measure. So it's kind of like the signal. I mean, there's it's this stuff we're really interested in. And then we have these gray areas which represent a proportion of variance that comes from other sources. It can be systematic variance, but it's simply variance that is not what we want it to pick up from our measuring instrument. So, it contains sources are variance that are unique to each of these variables and they also contain measurement error. So what's the difference between factor analysis and PCA is that in factor analysis, a latent variable is going to capture only the common variance that exists across all of these observed variables, and that is the part that the latent variable accounts for. And that is in contrast to PCA where a principle component represents the maximal amount of variance that can be explained in the dimensions of the data. And so you can see that the principal component is going to pick up as much variance as it can explain and that means that there's going to be an amalgamation of variance as due to what we intended to measure and perhaps other sources of variance as well. So this is a very important distinction, because when we want to measure unobserved variables, factor analysis is indeed a better choice, unless you know if our goal truly is dimension reduction, then PCA is an ideal tool for that. Also notice here that the arrows are pointing in different directions. And that's because in factor analysis, there really is a an underlying assumption that that unobserved variable is causing the variability we observe. And so that is not the case in PCA, so you can see that distinction here from the diagram. So anytime we talk about a factor or a latent variable in SEM, we're most likely talking about a latent variable from this perspective of factor analysis. Now here's a large structural equation model where I have put together a number of different elements from those building blocks I showed before. When we see numbers because we've estimated this model. And you see that there's observed variables that are predicting other variables. We have some sequential relations, meaning that one variable leads to another, which then leads to another. This is also an interesting example because we have a latent variable here that's being predicted by some observed variable. But this latent variable also in turn, predicts other variables. And so it illustrates, this diagram illustrates nicely, a number of uses that we can have in, for basically reasons that we can have for using SEM, including the five that we have unobserved variables that we want to model within a larger context. We want to perhaps account for measurement error. And that's important because latent variables are purged of measurement error because they only account for the variance that's in common across their indicators. We...if you also have the need to study sequential relations across variables, whether they're observed or unobserved, SEM is an excellent tool for that. And lastly, if you have missing data, SEM can also be really helpful even if you just have a regression, because all of the data that are available to you will be used in SEM during estimation, at least within the algorithms that we have in JMP pro. So that...those are really great reasons for using SEM. Now I want to use this diagram as well to introduce some important terminology that you'll for sure come across multiple times if you decide that SEM is a a tool that you are going to use in your own work. So we talked about observed variables. Those are also called manifest variables in the SEM jargon. There's latent variables. There are also variables called exogenous. In this example, there's only two of them. And those are variables that only predict other variables and they are in contrast to endogenous variables, which actually have variables predicting them. So here, all of these other variables are endogenous. Latent variables, the manifest variables that they point to, that they predict, those are also called latent variable indicators. And lastly, we talked about the unique factor variance from a factor analytic perspective. And those are these residual variances from a factor, from a latent variable, which is variance that is not explained by the latent variable, and represents both the combination of systematic variance that's unique to that variable plus measurement error. All right. I have found that in order to understand structural equation modeling in a bit easier way, it's important to shift our focus into realizing that the data that we're really modeling under the hood is the covariance structure of our data. We also model the means, the means structure, but again, that's outside the scope of the presentation today. But it's important to think about this because it has implications for what we think our data are. You know, we're used to seeing our data tables and we have rows and variables are our columns, and and yes that is... these data can be used to launch our SEM platform. However, our algorithms, deep down, are actually analyzing the covariance structure of those variables. So when you think about the data, these are really the data that are being modeled under the hood. That also has implications for when we think about residuals, because residuals in SEM are with respect to variances and covariances of you know, those that we estimate, in contrast to those that we have in the sample. So residuals, again, we need a little bit of a shift in focus to what we're used to, from other standard statistical techniques to really wrap our heads around what structural equation models are. And things...concepts such as degrees of freedom are also going to be degrees of freedom with respect to the covariance matrix. And so once we make this shift in focus, I think it's a lot easier to understand structural equation models. Now I want to go over a brief example here of how is it that SEM estimation works. And so usually what we do is we start by specifying a model and the most exciting thing about JMP Pro and our SEM platform is that we can specify that model directly through a path diagram, which makes it so much more intuitive. And so that path diagram, as we're drawing it, what we're doing is we're specifying a statistical model that implies the structure of the covariance matrix of the data. So the model implies that covariance structure, but then of course we also have access to the sample covariance matrix. And so what happens during estimation is that we try to match the values from the sample covariance as close as possible, given what the model is telling us the relations between the data are, not within the variables. And once we have our model estimates, then we use those to get an actual model-implied covariance matrix that we can compare against the sample covariance matrix. So, by looking at the difference between those two matrices, we are able to obtain residuals, which allows us to get a sense of how good our models fit or don't fit. So in a nutshell, that is how structural equation modeling works. Now we are done with the intro. And so I want to introduce to you, tell you a little bit of context for the data I'm going to be using for our demo. These data come from a very recently published article in the Journal of Social, Psychological and Personality Science. And the authors in this paper wanted to answer a straightforward question. They said, "How do perceived threats of Covid 19 impact our well being and public health behaviors?" Now what's really interesting about this question is that perceived threats of Covid 19 is a very abstract concept. It is a construct for which we don't have an instrument to measure it. It's something we do not observe directly. And that is why in this article, they had to figure out first, how to measure those perceived threats. And the article focused on two specific types of threats. They call the first realistic threats, because they're related to physical or financial safety. And also symbolic threats, which are those threats that are posed on one's sociocultural identity. And so they actually came up with this final threat scale, they went through a very rigorous process to develop this survey, this scale to measure realistic threat and symbolic threat. Here you can see here that people had to answer how much of a threat, if any, is the corona virus outbreak for your personal health, you know, the US economy, what it means to be an American, American values and traditions. So basically, these questions are focusing on two different aspects of the threat of the virus, one that is they labeled realistic, because it deals with personal and financial health issues. And then the symbolic threat is more about that social cultural component. So you can see all of those questions here. And we got the data from one of the three studies that they that they did. And those data from 550 participants who answered all of these questions in addition to a number of other surveys and so we'll be able to use those data in our demo. And we're going to answer some very specific questions. The very first one is, how do we go about measuring perceptions of Covid 19 threats. There's two types of threats, we're interested in. And the question is, how do we do this, given that it's such an abstract concept. And this will take us to talk about confirmatory factor analysis and ways in which we can assess the validity and reliability of our surveys. One thing we're not going to talk about though is the very first step, which is exploratory factor analysis. That is something that we do outside of SEM and it is something that should be done as the very first step to develop a new scale or a new survey, but here we're going to pick up from the confirmatory factor analysis step. A second question is do perceptions of Covid 19 threats predict well being and public health behaviors? And this will take us to talk more about regression concepts. And lastly, are effects of each type of threat on outcomes equal? And this is where we're going to learn about a very unique helpful feature of SEM, which is allowing us to impose equality constraints on different effects within our models and being able to do systematic model comparisons to answer these types of questions. Okay, so it's time for our demo. So let me go ahead and show you the data, which I've already have open right here. Notice my data tables has a ton of columns, because there's a number of different surveys that participants responded to, but the first 10 columns here are those questions, or the answers to the questions I showed you in that survey. And so what we're going to do is go to analyze, multivariate methods, structural equation models and we're going to use those answers from the survey. All of the 10 items, we're going to click model variables so that we can launch our platform with those data. Now, right now I have a data table that has one observation per row. That's what the wide data format is, and so that's what I'm going to going to use. But notice there's another tab for summarize data. So if you have only the correlation or covariance matrix, you can now input that in...well, you will be able to do it in JMP 16, JMP Pro 16 and so that's another option because, remember that at the end of the day, what we're really doing here is modeling covariance structures. So you can use summarize data to launch the platform. Alright, so let's go ahead and click OK. And the first thing we see is this model specification window which allows us to do all sorts of fun things. Let's see, on the far right here we have the diagram and notice our diagram has a default specification. So our variables all have double headed arrows, which means they all have a variance They also have a mean, but notice if I'm right clicking on the canvas here and I get some options to show the means or intercepts. So again, this is outside the scope of today, so I'm going to keep those hidden but do know that the default model in the platform has variances and means for every variable that we observe. The list tab contains the same information as the diagram, but in a list format and it will split up your paths or effects based on their type. We also have a status step which gives you a ton of information about the model that you have at that very moment. So right now, it tells us, you know, the degrees of freedom we have, given that this is the model we have specified here is just the default model. And it also tells us, you know, data details and other useful things. Notice that this little check mark here changes if there is a problem with your model. So as you're specifying your model, if we encounter something that looks problematic or an error, this tab will change in color and type and so it will be helpful to hopefully help you solve any type of issues with the specification of the model. Okay, so on the far left, we have an area for having a model name. And we also have from and to lists. And so this makes it very easy to select variables here, in the from and then in a to role, wherever those might be. And we can connect them with single-headed arrows or double-headed arrows, which we know, they are regressions or loadings or variances or covariances. Now for our case right now, we really want to figure out how do we measure this unobserved construct of perceptions of Covid 19 threat. And I know that the first five items that I have here are the answers to questions that the authors labeled realistic threats. So I'm going to select those variables and now here we're going to change the default name of latent one to realistic because that's going to be the realistic threat latent variable. So I click the plus button. And notice, we automatically give you this confirmatory factor model with one factor for realistic threat. An interesting observation here is that there is a 1 on this first loading that indicates that this path, this effect of the latent variable on the first observed variable is fixed to one. And we need to have that, because otherwise our model would not be identified. So we will, by default, fix what your first loading to one in order to identify the model and be able to get all of your estimates. An alternative would be to fix the variance of the latent variable to one, which would also help identify the model, but it's a matter of choice which you do. Alright, so we have a model for realistic threat. And now I'm going to select those that are symbolic threat and I will call this symbolic and we're going to hit go ahead and add our latent variable for that. I changed my resolution. And so now we are seeing a little bit less than what I was expecting. But here we go. There's our model. Now we might want to specify, and this is actually very important, realistic and symbolic threats. We expect those to vary, to co-vary and therefore, we would select them in our from and to list and click on our double-headed arrow to add that covariance. And so notice here, this is our full two factor confirmatory factor analysis. So we can name our model and we're ready to run. So let's go ahead and click Run. And we have all of our estimates very quickly. Everything is here. Now what I want to draw your attention to though, is the model comparison table. And the reason is because we want to make sure that our model fits well before we jump into trying to interpret the results. So let's talk about what shows up in this table. First let's note that there's two models here that we did not fit but the platform fits them by default upon launching. And we use those as sort of a reference, a baseline that we can use to compare a model against. The unrestricted model I will show you here, what it is, if we had every single variable covarying with each other, that right there is the unrestricted model. In other words, is a baseline for what is the best we can do in terms of fit. Now the chi square statistic represents the amount of misfit in a model. And because we are estimating every possible variance and covariance without any restrictions here, that's why the chi square is zero. Now, we also have zero degrees of freedom because we're estimating everything. And so it's a perfectly fitting model and it serves as a baseline for... serves as a baseline for understanding what's the very best we can do. All right, then the independence model is actually the model that we have here as a default when we first launched the platform. So that is a baseline for the worst, maybe not the worst, but a pretty bad model. It's one where the variables are completely uncorrelated with each other. And you can see indeed that the chi square statistic jumps to about 2000 units. But of course, we now have 45 degrees of freedom because we're not estimating much at all in this model. And then lastly we have our own two factor confirmatory factor model and we also see that the chi square is large, is 147 with 34 degrees of freedom. It's a lot smaller than the independence model, so we're doing better, thankfully, but it's still a significant chi square, suggesting that there's significant misfit in the data. Now, here's one big challenge with SEM. Back in the day was that people realize that the chi square is impacted, it's influenced by the sample size. And here we have 550 observations. So it's very likely that even well fitting models are going to have a little Going to turn out to be significant because of the large sample size. So what has happened is that fit indices that are unique to SEM have been developed to allow us to assess model fit, irrespective of the chi square and that's where these baseline models come in. The first one right here is called the comparative fit index. It actually ranges from zero to one. You can see here that one is the value for a perfectly fitting model and zero is the value for the really poor fitting model, the independence model. And I keep sorting this by accident. Okay, so what what this index for our own model means a .9395. It's the proportion, it's yeah, it's a proportion of how much better are we doing with our model in comparison to the independence model. So this is just, we're about, you know, 94% better than the independence model, which is pretty good. The guidelines are that CFI of .9 or greater is acceptable. We usually want it to be as close to one as possible and .95 is ideal. We also then have this RMSCA, the root main square error of approximation, which represents the amount of misfit per degree of freedom. And so we want this to be very small. It also ranges from zero to one. And you see here, our model has .07, about .08, and generally speaking .1 or lower is good, is adequate, and that's what we want to see. And then on the ride, we get a some confidence limits around this one statistic. So what this suggests is that indeed the model is a good fitting...is an acceptable fitting model and therefore we're going to go ahead and try and interpret it. But it's really important to assess model fit before really getting into the details of the model because we're not going to learn very much, or at least not a lot of useful information if our model doesn't fit well from the get go. Alright, so, because this is a confirmatory factor analysis, we're going to find it very useful to show (I'm right clicking here), and now I'm going to show the standardized estimates on the diagram. All right. This means that we're going to have estimates here for the factor loadings that are in the correlation metric, which is really useful and important for interpreting these loadings in factor analysis. This value here is also going to be rescaled so that it represents the actual correlation between these two latent variables, which in this case is substantial point for correlation. In the red triangle menu, we also can find the standardized parameter coefficients in the table form so we can get more details about standard errors and Z statistic and so on. But you can see here that all of these values are fairly...you know they're they're pretty acceptable. They are the correlation between the observed variable and the latent variable. And they're generally about .48 to about .8 or so around here. So those are pretty decent values. We want them to be as high as possible. And another thing you're going to notice here about our diagrams, which is a JMP pro 16 feature, is that these observed variables have a little bit of shading, a little gray area here, which actually represents the amount or portion of variance explained by the latent variable. So it's really cool, because you can really see just visually from the little shading that the variables for symbolic threats actually have more of their variance explained by the latent variable. So perhaps this is a better factor than the realistic threat factor, just based on looking at how much variance is explained. Now I want to draw your attention to an option called assess measurement model and that is going to be super helpful to allow us to understand whether the questions, the individual questions in that survey are actually good questions. We know that based on the statistical...the indicator reliability, we want our questions to be reliable and that's what we are plotting over on this side. So notice we give a little line here for a suggested threshold of what's good acceptable reliability for any one of these questions and you can see in fact that the symbolic threat is a better Seems to be doing a little better there. The questions are more reliable in comparison to the realistic threat. But generally speaking, they're both fairly good You know, the fact that they're not crossing the line is not terrible. I mean, they're they're around...these two questions are around the the adequate threshold that we would expect for indicator reliability. We also have statistics here that tell us about the reliability of the composite. In other words, if we were to grab all of these questions and maybe grab all of these questions for a realistic threat and we get an average score for all of those answers per individual, that would be a composite for realistic threat and we could do the same for symbolic. And so what we have here is that index of reliability. All of these indices, by the way, range from zero to one. And so we want them to be as close to one as possible because we want them to be very reliable and we see here that both of these composites have adequate reliability. So they are good in terms of using an average score across them for other analyses. We also have construct maximal reliability and these are more the values of reliability for the latent variables themselves rather than creating averages. So we're always going to have these values be a bit higher because when you're using latent variables, you're going to have better measures. The construct validity matrix gives us a ton of useful information. The key here is that the lower triangular simply has the correlation between the two factors in this case. But the values in the diagonal represent the average variance extracted across all of the indicators of the factor. And so here you see that symbolic threats have more explained variance on average than realistic threat, but they both have substantial values here, which is good. And most importantly, we want these diagonal values to be larger than the upper triangular because the upper triangular represents the overlapping barriers between the two factors. And you can imagine, we want the factors to have more overlap and variance with their own indicators than with some other construct with a different factor. And so this right here, together with all of these other statistics are good evidence that the survey is giving us valid and reliable answers and that we can in fact use it to pursue other questions. And so that's what we're going to do here. We're going to close this and I'm going to run a different model, we're going to relaunch our platform, but this time I'm going to use... I have a couple of variables here that I created. These two are composites, they're averages for all of the questions that were related to realistic and symbolic threats. So I have those composite scores right here. And I'm going to model those along with...I have a measure for anxiety. I have a measure for negative affect. And we can also add a little bit of the CDC adherence. So these are whether people are adhering to the recommendations from the CDC, the public health behaviors. And so we're going to launch the platform with all of these variables. And what I want to do here is focus perhaps on fitting a regression model. So I selected those variables, my predictors, my outcomes. And I just click the one-headed arrow to set it up. Now the model's not fully... correctly specified yet because we want to make sure that both our. Covid threats here, we want to make sure that those are covarying with each other, because we don't have any reason to assume they don't covary. Same with the outcome, they need to covary because we don't want to impose any weird restrictions about them being orthogonal. And so this right here is essentially a path analysis. It's a multivariate multiple regression analysis. And so we can just put it here, multivariate regression, and we are going to go with this and run our model. Okay. So notice that because we fit every... we have zero degrees of freedom because we've estimated every variance and covariance amongst the data. So even though this suggest is a perfect fit, all we've done so far is fit a regression model. And what I want to do to really emphasize that is show you...I'm going to change the layout of my diagram to make it easier to show...the associations between both types of threat. I want to hide variances and covariances and you can see here, just... I've hidden the other edges so that we can just focus on the relations of these two threats have on our three outcomes. Now in my data table, I've already fit a, using fit model, I used anxiety that same variable as my outcome and the same two threats as the predictors. And I want to put them side by side because I want to show you that in fact the estimates for the regression of the prediction of anxiety here are exactly the same values that we have here in the SEM platform for both of those predictions. And that's what we should expect. Fit model, I mean, this is regression. So far we're doing regression. Technically, you could say, well, if I just, I'm comfortable running three of those fit models using these three different outcomes, then what is it that SEM is buying me that's, that's better. Well, it might, you might not need anything else and this might be great for your needs, but one unique feature of SEM is that we can test directly whether there are equality...whether equality constraints in the model are tenable. So what we mean by that is that I can take these two effects (for example, realistic and symbolic threat effects on anxiety) and I can use this set equal button to impose an equality constraint. Notice here, these little labels indicate that these two paths will have an equal effect. And we can then run the model and we can now select those models in our model comparison table and compare them with this compare selected models option. And what we'll get is a change in chi square. So, we see just the change of chi square going from one model to the next. So this basically says, okay, the model is going to get worse because you've now added a constraint. So you gained a degree of freedom, you have now more misfit, and the question is, is that misfit significant? The answer in this case is, yes. Of course, this is influenced by sample size. So we also look at the difference in the CFI and RMSEA and anything that's .01 or larger suggests that the misfit is too much to ignore. It's a significant additional misfit added by this constraint. So now that we know that, we can say directly in this...we know that this is the better fitting model, the one that did not have that constraint and we can assert that realistic threats have greater positive association with anxiety in comparison to the symbolic threats, which also have a positive significant effect, but is not a strong significantly, not as strong as statistically, not as strong as the realistic threats. All right. And there's other interesting effects that we have here. So what I'm going to do as we are approaching the end of the session is just draw your attention to this interesting effect down here where both types of threats have different effects on the adherence to CDC behaviors. And the article really pays a lot of attention to this finding because, you know, these are both threats. But as you might imagine, those who feel threats to their personal health or threats to their financial safety, they're more likely to adhere to the CDC guidelines of social distancing and avoiding social gatherings, whereas those who are feeling that threat is symbolic, a threat to to their social cultural values, those folks are significantly less likely to adhere to those CDC behaviors, perhaps because they are feeling those social cultural norms being threatened. And so it's an interesting finding and we see that here we can of course test equivalence of a number of other paths in this model. Okay, so the last thing I wanted to do is just show you (we're not going to describe this full model), but I did want to show you what happens when you bring together both (let's make this a little bigger) regression and factor analysis. To really use the full potential of structural equation models, you ought to model your latent variables, we have here both threats as latent variables which allow us to really purge the measurement error from those survey items, and model associations between latent variables, which allows us to have unbiased effects because...unattenuated effect...because we are accounting for measurement error when we measure the latant and variables directly. And notice, we are taking advantage of SEM, because we're looking at sequential associations across a number of different factors and so down here you can see our cool diagram which I can move around to try and show you all the cool effects that we have and also highlight the fact that our diagrams are fully interactive, really visually appealing, and very quickly we can identify, you know, significant effects based on the type of line versus non significant effects, in this case, these dashed lines. And so again to really have the full power of SEM, you can see how here we're looking at those threats of latent variables and looking at their associations with a number of other public health behaviors and with well being variables. And so with that, I am going to stop the demo here and let you know that we have a really useful document in addition to the slides, we have a really great document that the tester for our platform, James Koepfler, put together where he gives a bunch of really great tips on how to use our platform, from specifying the model, to tips on how to compare models, what is appropriate to look at, what's a nested model, all of this information I think you're going to find super valuable if you decide to use the platform. So definitely suggest that you go on to JMP Community to get these materials which are supplementary to the presentation. And with that, I'm going to wrap it up and we are going to take questions once this all goes live. Thank you very much.
Labels
(8)
Labels:
Labels:
Advanced Statistical Modeling
Basic Data Analysis and Modeling
Consumer and Market Research
Design of Experiments
Predictive Modeling and Machine Learning
Quality and Process Engineering
Reliability Analysis
Sharing and Communicating Results
0 attendees
0
0
0 attendees
0
0
Simulation, the Good, the Bad and the Ugly or Independence? Dependence, Synergistic, Antagonistic (2020-US-30MP-584)
Monday, October 12, 2020
Ned Jones, Statistician, 1-alpha Solutions Simulation has become a popular tool used to understand processes. In most cases the processes are assumed to be independent; however, many times this is not the case. A process can be viewed as physically independent, but this does not necessarily equate to stochastic independence. This is especially true when the processes are in series such that the output of a process is the input for the next process and so forth. Using the JMP simulator a simple series of processes are set up represented by JMP random functions. The process parameters are assumed to have a multivariate normal distribution. By modifying the correlation matrix, the effect of independence versus dependence is examined. These differences are shown by examining the tails of the resulting distributions. When the processes are dependent the effect of synergistic versus antagonistic process relationships are also investigated. Auto-generated transcript... Speaker Transcript nedjones The Good, the Bad, the Ugly or Independence and Dependence Synergistic and Antagonistic. I am Ned Jones. I have a small consulting business called 1-alpha Solutions. You can see my contact information there. Let's get into the simulation discussion. I'm going to be running the simulation, obviously, and JMP in the discovery... and allows you to discover model yet input random very...variation...model output random variation and from based on the inputs in any noise that you add into the simulator. Simulator also allows you to profile... is in the profiler and defines it defines the random input is defined based on random input that you have and you're able to run a simulation and produce output tables and simulate variables. Next thing I want to do is talk about the a couple of different types of simulation. Just a simple simulation. If you have one input and one output, there is no issue of dependence in the in the simulation. The ones that we're concerned about primarily are simulations, where we have multiple inputs. And we are simulating and will have one or more system outputs. The concern is that there could be dependence among the input variables. Scroll down here a little bit and we'll see...want to talk about what it means to to have stochastic independence. Two events A and B are independent, if and only if their joint probabilities equals the product of their probabilities. Well, that's what we want. That's the end result we want. I'm going to define it. Look at it a little bit differently and so forth. This should make it a little clearer. If we look at the intersection of A and B, events A and B, is equal to that joint product that implies that the probability of A is equal to the probability of A given B, or similarly, the probability...the with the joint probabilities, but the probability of B is equal to the probability of B given A. Thus the occurrence of A and B is is not affected. The occurence of B is not affected by the probability of A and vice versa. 2, 4, or 6. You can easily see that the the probability of A is 2 and 6 or one third, and probability of B is 3 and 6 or one half. But if you look at the intersection of A and B, that's 2. And so the and the probability of A times B as 1 6 and 2 is they get 2 is, and the only outcome you get so it's 1 in 6. Now, the probability of A is equal to the probability of any given B as equal to one third. And if we look at that and we realize that we're saying, okay, B has occurred, we know that we have a 2, a 4 or a 6, but there's still a one third chance that A could have occurred. So we can see that still, it stays at one third. And similarly, we see the same thing happening with the probability of B. Therefore, A and B are independent. Now I'm a role on in and look at the example I have and talk about that. What I'm doing is I'm simulating the pest load And the probability of a mating pair. What we have is we have a fruit harvest population and from that fruit harvest population, we're going to have some cultural practices that are applied in the... in our orchard Grove or vineyard to get a pest load...will have a pest load after those cultural practices are applied. Then we have to the harvest...the crop is harvested, we'll do some manual culling and will estimate a pest load there. And then after that we can...you can see that we have a cold storage and we'll have a pest load after the cold storage. We're going to try to freeze them to death. And the final thing we do is once we get this pest load here, we're going to break it up in a marketing distribution and split that population into several smaller pieces. And we'll be able to calculate the probability of a mating pair from that. Well, the problem, you can see immediately is that these things become very dependent because the output of the harvest population is the input for treatment A. The output of the treatment A is the input for treatment B and so forth on to C on down to the meeting pair. Now here's, here's a table I...here's a table we'll work with and we'll start with. And here is what we have is we set this up in for the simulator to work with and we have a population range of 1 million to 2.5 million fruit. We have a treatment here, a treatment range of the efficacy of mitigation that we're seeing. Here's the number of survivors we would expect from this treatment population and we have a Population A is a result of that. We're going to have a population B as a result of that, we can take a look at the formulas here that are used. And what is what is being done here, this a little differently is, I'm putting a random variable in the...that is going to go into the profiler. So the profiler is going to see this immediately as a random variable going in. So we're simulating the variable coming in, even before it comes in. So with that, you can go...we can go across. You can see the rest of the table. We're going over. We have another set, we have survivors that's after Treatment C, the same type of thing. Then we have this distribution and we had a probability of a mating pair. I'll show you that formula. It's a little different. The probability of a mating pair. Well, this is just using an exponential to estimate the probability of mating pair so you know what's going on. I haven't hidden anything from you behind the curtain and so forth. Let's take a look. So to open up the profiler, we're going to go to graph and down to profiler. All right. And then from there, we're going to select our variables that have a formula that we're interested in. So we're gonna have...we're gonna have the Pop A, Pop B, survivors and the probability of a mating pair. Going to put those in and we're going to say, uh-oh, we got to extend the variation here and we're going to say, okay. We got a nice result. Very attractive graphs here. And first thing you're going to see is, you're going to see squiggly lines in this profiler that if you use a profiler that you're probably not used to seeing lines like that. It's just a little different approach and so forth that you can see how these things work and Doing a little adjustment here so you can see the screens better. Now from this point what we're gonna do is we're going to open up the simulator in the profiler. We go up here and just click on the simulator and it gives us these choices down here. First thing I want to do with this is I want to increase the number of simulation runs to 25,000. Okay. And what I'm going to do...what we do if we have independence, one of the tests, quite often for that, we use for that is that we we'll look at the correlation. So I'm going to use a correlation here. Use the correlations and set up some correlations. So for this first population, I'm going to call it multivariate and immediately you can see we get a correlation specification down below. And we'll set up another multivariate here and another multivariate for treatment B. Another multivariate for for treatment A. Now what this is doing is, this is taking those treatment parameters that we had up above, we had before. And it is putting those in our multivariate relationship with each other. We also got the last thing, this marketing distribution. I don't want it to be continuous so I'm going to make it random and we're going to make it an integer. We'll make that an integer and we've got that run and we can see the results. Now this is the...I'll call this the Goldilocks situation with all the zeros down here, that implies that all of these relationships are completely independent and we can run our simulator here. And see the results. Do little more adjustment here on these axes. This come to life. Please look for here. Okay. Now you see those results. But what we're going to have here and look at this, is, we have the rate at which it's exceeding a limit that's been put in there. I put those spec limits back in the variables, but the one that I'm most concerned about is the probability of a mating pair. And wouldn't you know, I've run this real time and it hasn't come out exactly the way it should. Let's try a couple more times here. See. What we got the probability of a mating pair and that is supposed to be coming up as .5, but it certainly isn't. I have something isn't...oh, here, let's try this and fix this. This would be 4 and 14 Let's try the simulation one more time. Still didn't come up. Well, the example hasn't worked quite right, but in the previous example I was having .4 here. So that was saying, the rate was creating less than time but I'm having that probability is A .15 probability of a mating pair, but that's what happens sometimes when you try to do things on on the fly. So let me go up and I have a window that I can, we can look at that result with...we can look to that result. And let's...that has a little bit differently. And you can see now that that probability is under 5%. That's the target we're aiming for. in this thing, in this simulation. So if we go up and we can run those simulations, again you can see those bouncing around, staying under .5, so it's happening less than 5% of the time that the probability of a mating pair is greater than 1.5. Now because now, again, I'll say this is kind of a Goldilocks scenario because we're assuming all these relationships are independent. I have an example that I can show you that we have, where we have one that is antagonistic and synergistic. So I'll pull up the first one here and in this one we have that the relationships are antagonistic. Now when you...if you are are creating an example like this to work with it, you can't, at least I wasn't able to make everything negative. If you notice I have these two as being positive. This wants that the matrix to be positive definite. And it doesn't come out as positive definite if you set the...if you set those all to zero, but we can run that simulation again. And you can see that... you can see here that those simulations with a negative, it really makes things very, very attractive. We're getting a low, real low rate of... that we have...we have 1., .15 probability of a mating pair so that you can see just the effect. And what I really want to show is the effect of this correlation specifications, correlation matrix down here, covariance matrix that you specified. Now let's look at one other, we'll look at the one if it's positive. And we've got we've got an example here where it's positive. And you can see I have down here. I've said here. Now, I haven't been real heroic about making those correlations very high. I've tried to keep them fairly low and so forth to be fairly realistic, after all this is biological data. And we can run those simulations again and you can see very quickly that we're exceeding that 5% rate which is...becomes a great deal of concern here and so forth. And if you were...if most of the time these simulations like this are run with no consideration of the correlation between variables and that is kind of like covering your eyes and saying, I didn't see this and so forth. But it really if there is, if there is a correlation relationship and most likely there is, because one of these in...one of these outputs is the input to the next process, so pretty well has to be dependent, and what the dependencies are, estimating these correlations will be a great task to have to come up with most of the time. Work in this area is done based on research papers and they don't have correlations between different types of treatments so. But having some estimate of those is a good thing, a good thing to have. Now the next step is to show you the what else we can do here. We can create a table. And if we create this table and... Well, we'll create this table and I'm just going to highlight everything in the table out to the mating pairs here. And then I'm just going to do an analysis distribution. And run all of those and say, okay. Now we get all these grand distributions, fill up the paper with it. But what we can do is we can go in and we can select these distributions that are exceeding our limit out here. We can just highlight those and it becomes very informative as you look back and you can see the mitigations, what's happening, and so forth. What is affecting these things greatly. And one of the things that really ...first of all, our initial population, and this has been based on what we've seen in real life, is as the population gets to be higher, when we have large, large populations of the fruit, the tendency is that we have failures of the system, Treatment A and so forth. So what what the one that I thought was most interesting was, if we look back here and we look at the marketing distribution, That if we push them out, if we require that as shipments come into the country and that marketing distribution has to break these shipments up out into smaller lots to be distributed, the probability of mating pair pretty well becomes zero. With with these these examples and so forth, I want to go ahead and open it up for questions. But let me just say one last thing. I think of George Box. He was at one of our meetings a few years ago. And it was really interesting what...his two quotes that he said. He said, "Don't fall in love with a model." And he also said, "All models are wrong, but some are useful." I hope this information and these examples to give you something to think about when you're doing the simulation that you need to consider the relationship between the variables. Thank you.
Labels
(8)
Labels:
Labels:
Advanced Statistical Modeling
Automation and Scripting
Basic Data Analysis and Modeling
Consumer and Market Research
Data Exploration and Visualization
Predictive Modeling and Machine Learning
Quality and Process Engineering
Reliability Analysis
0 attendees
0
0
0 attendees
0
0
A system to distribute, manage, and monitor add-ins. (2020-US-45MP-581)
Monday, October 12, 2020
John Moore, Elanco Animal Health, Elanco Animal Health Elanco has several custom JMP add-ins. Moore has created a system in JSL that allows: • Distribution of add-ins • Version control of add-ins • Tracking of add-in usage • Easy construction and publishing of add-ins This is a complete redo of the add-in management system that Moore presented a few years ago. He has focused on making the new system more user-friendly and easier to manage. Auto-generated transcript... Speaker Transcript John Moore Hi, thanks for joining me. My name is John Moore and today and we'll be talking about Lupine, a framework I've developed for creating, publishing, managing and monitoring JMP add-ins. So first just a tiny little bit about me. My name is John Moore and my degrees are in engineering and management, but really I'm a data geek at heart. I've worked at Elanco Animal Health as an analyst for the last 19 years. I have one son, one husband, one dog, and at the moment, approximately about 80,000 bees. So, you know, how did we get here? Well at Elanco, we have TCs, which are technical consultants typically like vets or nutritionists spread across the globe, and made them use JMP and use JMP scripting. And many of these TCs only have intermittent access to the internet because just because they're so remote. So we kind of had a state where we had ad hoc scripting passing from person to person. There was no version control. There's no idea of who is using what script. And getting script out was difficult, updating scripts was difficult. So I created Lupine to try to get us to a state where we had standardized scripts, we had version control, we could easily know who was doing what, who's using which scripts. We could easily distribute updates and as an added bonus, we threw in a couple of formulas into the Formula Editor. So let's just take a look at Lupine here. Okay, so once you install Lupine, you'll see if there's not much to it. There's just one Lupine menu up on JMP and you only have check for update. When you look at check for updates, you'll see it'll list all of the add ins that are controlled by Lupine. And if it's not up to date, it'll say up to date. If it's up to date. It'll just say, okay, so it's very simple. Just to say, here are the add ins that are available. Do you have the most recent version or not? Also in the Formula Editor, just because we found ourselves using these particular formulas a lot, you know, like, month/year, which returns the first day of the of the month of the of the date has passed to it. We have year quarter, we have season. Also we have counter week where you can do things like week starting, week ending, doing those sorts of things. Also in Lupine we have usage logging. So every time a user uses a script managed by Lupine, we create a log file. So we know who used what script and what add in at what timestamp. And this really helps us figure out which of our scripts are useful and which of them aren't. So let's say you want to set up Lupine in your organization. Okay, first step we're going to download some zip files. We have a Lupine zip file, a Lupine manager, and Lupine template. And we also have Lupine example. So I suggest that you find one folder someplace and just download the three of those there. The next thing you need to decide as a manager setting this up is, where am I going to put my network folders? There are two network folders that are really important for Lupine. The first is, you're add in depot folders. So this is a folder that's going to contain all your add ins, and the PublishedAddins.jmp file which contains metadata about those add ins. The other one we need to think about is our central logs folder. So this is the folder that will contain all of the uploaded log files that people create every time that they execute a Lupine managed script. So you can see here, we have an example of all the add ins we have in our published add ins file. And here's just an example of the log files. So every log file has a unique ID. And we'll talk about this later, but whenever we go in and grab those, we only grab the new one so you don't have to download everything, every time. So once you've decided where you want to put those, then we need to tell Lupine where those are. And the way we do that is to go into that Lupine folder, after you've unzipped it, and under the add in management subfolder, you'll see this file called setup and this is the part to look for in setup. So here is where you define our add in depot path and our central logs path. Now the in production, you'll need to have these the network locations, but just for this little simple example, I created these just on my own machine here, but this is where you would define your network path. Once you've done that, now is the time to build Lupine for your own organization. So in the Lupine folder, you want to grab everything except if you have a gitattributes file in there. I use GitHub Desktop for managing my scripts. I strongly recommend it. It's made my life so much better. But you don't want put those in your script, so grab everything but the gitattributes file. You're going to right click on those, say send to and send those to a compressed zip folder. This will create a zip file of all those. Now JMP add in files are really just zip files. So what we're going to do is rename that Lupine.jmpaddin. And once you name it, you just have to double click on that thing, and it'll install it. Then you'll have Lupine on your machine. Now we're not gonna do anything with it yet, but it's there. If you look up on your menu bar, you'll see Lupine. Next, let's talk about Lupine Manager. This is a separate add in that's designed for the person who's administering, and keeping track of, and updating all the add ins within an organization. So this is probably not going to be on your typical user's machine, just because it's not useful. So what it has is a collection of tools that will help your...you as a manager manage the add ins. The first thing you do when you open up Lupine Manager is just go down to setup and then we're going to say our unique users file, which gets a all the unique users in your organization, and log data. OK, so the unique user table links user IDs. So we're grabbing user IDs from your computer's identity, but usually those don't say anything too particularly meaningful. Like for instance, mine is 00126545. We'd like to link that something that says John Moore. So that's what user unique users does. The log data file has a row for each log. So this is a summary of all those log files that we created before. So we can analyze these to our heart's content. Okay, so let's say you want to start using Lupine Manager. First thing you need to do is tell it which add ins you want to manage. To do that, we're going to go to build add in, then head on down to manage available add ins. And then this will be blank. When you get to it, click Add add in, and you can select the folders that contain the files for the add ins you want to put under control. Once you do, you'll see that those will be listed in the available add ins, click on build add ins. Okay, so you have two options here. Build add in is just going to create an add in file in the folder where the scripts are, so this is great for, like, testing. I want to do a quick build so I can test to see if it works. Build and publish will do the same thing, but also it will take that add in, copy it to your add in depot folder and update your published add ins file. So this is when you're ready to distribute it to the company, you've done your testing, it's ready to go. Now if you want to import all that log data, all you have to do is go to import and process log data and it will bring in all the new log files and add them to your log data file. We have a user summary here, which is just a quick tabulate to say who's using which add in. So this gives you a really quick view of who are my biggest users of which add in. You know here we can see, for this example, Lupine and LupineManager used a lot; LupineExample1 some; LupinExample, not so much. Okay, so let's say you've got Lupine installed, you've got Lupine Manager installed. Now you have some add ins that you would like to manage under Lupine. So let's talk about the steps for that. First thing you need to do is go out to that add in template, Lupine template file, we have and copy that someplace. So this contains all the necessary scripts and all the things that make Lupine work with your add in. So go ahead and copy that. Next we're going to kind of work through some steps here. The first is we're going to edit the addin.def file. We're going to create all the scripts that you're going to have in your add in or update them if you already have the scripts. Build the add in will customize the menus, so that the menu looks right. Assuming you click on it, it looks like the way you want it to. We'll build the add in again, once we get the menu fixed so that we can test it, make sure it's right. After we're done testing, we can publish it. So let's talk about this addin.def file; addin.def is really a tiny little text file, but it's required for every add in. This contains the core important information about the add in, what's its ID and what's its name. So you'll need to go in and edit that to change it to what the name of your add in will be. And this is a one-time thing. Once you've done this, you shouldn't have to change this for the add in going forward. So this is just once...you do it once when you set it up. Next, you need to decide which script am I going to put in this add in? Now I've created a, Lupine make log file. And this is what's going to create that log file for any add in that's under using Lupine. So this is what actually creates logs and allows you to do the version monitoring. So I recommend putting this header at the end...at the beginning of all of your files. Your script you're going to use, because that's what's going to allow you to monitor things. Now, so you've got all your scripts in there, next thing you do is make the menu file look the way you want it to. Right now, it's going to just say templates. So we're going to build this thing first. So just like we built Lupine before, where we did the select everything except the gitattributes files, right click, zip, change the name. We're going to do the same thing. And then we'll have the wrong menu, because right now, it just still has that template menu. So we need to go in and update that template menu to do what we want to do. Okay, so once you've installed it, so, you know, you'll kind of see your new script up here, your new add in up here, template. We're going to right click up on the menu and you'll see this long list. And way at the bottom, we'll see customize and all we want to do is go to menus and toolbars. Okay, once we're there you'll see something like this. We, the first thing we need to do is tell JMP, well, what do we want to edit. We want to edit the menu that goes with our particular add in, which right now is called template. So we'll go up to this change button. Click on that and we have a whole host of choices here. But what we want to do is go to click on JMP add ins radio button here and then go down to the add in we're interested in. Okay. Now here I just left this as template, but if you...since you've already changed it in the addin.def file, it will be what that is. Once you do that, what you're going to see is the bits and pieces that belong to your add in are going to show up in blue on the left here. So this is what we're going to edit and change to make it what we want for our new add in. So let's take a look at this. I included the About in the What's New as a standard scripts within this, but when you get it, what you'll see is it's pointing to this template add in, which isn't what you want. You've got your own add in name that you want. So what we're going to do is click on the run add in this file. Click on this use add in folder and then select this add in that you want here. So that's going to point JMP to the new scripts that you just created. And then we can click Browse. And then we can browse and navigate and go to that particular script we want and say, "No, no, I don't want to use it in the template one. I want you to use it in my new add in." Likewise, we can do the same thing with the what is new item in the menu. And once you've done those, really it's just putting in the rest of them we want. If you do a right click over on the blue area, you'll see these options for insert before, insert after, where you can add things or delete things from your menu. So you can add things in, most of them are going to be commands. You can also do a submenu so it can go deeper, or you can put a separator where it makes it nice little line across there. And so you're going to build your menu, put all the items that you...for the scripts that you just created in there. And then we're going to save it, but we're going to save it twice. When you save it here, what that's going to do is just change it in the version of the add in that's installed on your machine. And that's good, but what we would also need to do is save it to the folder where you're building the add in. So we need to save it there so the next time we build the add in, it has the right menu file. So after we click save, we're going to click save as and go to the folder that contains our add in files. Okay, so we have our scripts. We have our addin.def defined, we've got our menu worked out. The next thing we need to do is actually build this thing and test it. So typically, I'll send it out to users. They can tell me what's working, what's not working. After that, I can go in and actually publish it. So when you go in and publish, it's going to prompt you to say, update the information. The most important thing, and this is in Lupine Manager, the most important thing is to change the release date. So this is what lets people know that there's a more current version available, right. So it...what's going to happen is this gets compared to the date tha't published, and that's what's to Lupine is going to use and say, hey, there's a more current version for you to upload. So you can also add notes to it like revision notes. These are the things that I changed this time. These are the great updates and things that I did to my add in. And once you do that, if you click Publish, then Lupine Manager will take a copy of that add in, put it in the folder that creates the add in, where you have the source code. It will also publish it to the add in depot and it will update the published add ins file to have the most recent version, which most importantly, your release date in it. And then when, if you, if a user were to click on check for updates, it would say, hey, this add in has a new version available, would you want to download it? They can click on that and it'll install it, it'll be on your machine...on their machine. Okay, that was a really brief introduction. I hope there's enough material in here for you to do this yourself. If not, please contact me at john_moore@elanco.com. I'm happy to help you set this up. Many thanks, and thanks again. Bye bye.
Labels
(6)
Labels:
Labels:
Automation and Scripting
Basic Data Analysis and Modeling
Data Exploration and Visualization
Design of Experiments
Quality and Process Engineering
Reliability Analysis
0 attendees
0
0
0 attendees
0
0
At the corner of Lean Street and Statistics Road (2020-US-45MP-578)
Monday, October 12, 2020
Stephen Czupryna, Process Engineering Consultant & Instructor, Objective Experiments Manufacturing companies invest huge sums of money in lean principles education for their shop floor staff, yielding measurable process improvements and better employee morale. However, many companies indicate a need for a higher return on investment on their lean investments and a greater impact on profits. This paper will help lean-thinking organizations move to the next logical step by combining lean principles with statistical thinking. However, effectively teaching statistical thinking to shop floor personnel requires a unique approach (like not using the word “statistics”) and an overwhelming emphasis on data visualization and hands-on work. To that end, here’s the presentation outline: A) The Prime Directive (of shop floor process improvement) B) The Taguchi Loss Function , unchained C) The Statistical Babelfish D) Refining Lean metrics like cycle time, inventory turns, OEE and perishable tool lifetime E) Why JMP’s emphasis on workflow, rather than rote statistical tools, is the right choice for the shop floor F) A series of case studies in a what-we-did versus what-we-should-have-done format. Attendee benefits include guidance on getting-it-right with shop floor operators, turbo-charged process improvement efforts, a higher return on their Lean training and statware investments and higher bottom line profits. Auto-generated transcript... Speaker Transcript Stephen Czupryna Okay. Welcome, everyone. Welcome to at the corner of Lean Street and Statistics Road. Thank you for attending. My name is Stephen Czupryna. I work as a contractor for a small consulting company in Bellingham, Washington. Our name is Objective Experiments. We teach design of experiments, we teach reliability analysis and statistical process control, and I have a fairly long history of work in manufacturing. So here's the presentation framework for the next 35 odd minutes. I'm going to first talk about the Lean foundation of the presentation, about how Lean is an important part of continuous improvement. And then in the second section, we'll take Lean to what I like to call the next logical step, which is to teach and help operators and in particularly teach them and helping them using graphics and, in particular, JMP graphics. And we'll talk about refining some of the common Lean metrics and we'll end with a few case studies. But first, a little bit of background, what I'm about to say in the next 35 odd minutes is based on my experience. It's my opinion. And it will work, I believe, often, but not always. There are some companies that that may not agree with my philosophy, particularly companies that are, you know, really focused on pushing stuff out the door and the like, probably wouldn't work in that environment, but in the right environment, I think a lot of what I what I say will work fine. All the data you're about to see is simulated and I will post or have posted detailed paper at JMP.com. You can collect it there, or you're welcome to email me at Steve@objexp.com and I'll email you a copy of it or you can contact me with some questions. Again, my view. My view, real simple, boil it all down production workers, maintenance workers are the key to continuous improvement. Spent my career listening carefully to production operators, maintenance people learning from them, and most of all, helping them. So my view is a little bit odd. I believe that an engineer, process engineer, quality engineer really needs to earn the right to enter the production area, need to earn the support of the people that are working there, day in, day out, eight hours a day. Again, my opinion. So who's the presentation for? Yeah, the shortlist is people in production management, supervisors, manufacturing managers and the like, process engineers, quality engineers, manufacturing engineers, folks that are supposed to be out on the shop floor working on things. And this presentation is particularly for people who, who, like who like the the view in the in the photograph that the customer, the internal customer, if you will, is, is the production operator and that the engineer or the supervisor is really a supplier to that person. And to quote, Dr. Deming, "Bad system beats a good person, every time." And the fact is the production operators aren't responsible for the for the system. They work within the system. So the goals of the presentation is to help you work with your production people, \
Labels
(8)
Labels:
Labels:
Advanced Statistical Modeling
Basic Data Analysis and Modeling
Content Organization
Design of Experiments
Mass Customization
Quality and Process Engineering
Reliability Analysis
Sharing and Communicating Results
0 attendees
0
0
0 attendees
0
0
Biological Surveillance Techniques Developed with JMP (2020-US-30MP-577)
Monday, October 12, 2020
Sam Edgemon, Analyst, SAS Institute Tony Cooper, Principal Analytical Consultant, SAS The Department of Homeland Security asked the question, “how can we detect acts of biological terrorism?” After discussion and consideration, our answer was “If we can effectively detect an outbreak of a naturally occurring event such as influenza, then we can find an attack in which anthrax was used because both present with similar symptoms.” The tools that were developed became much more relevant to the detection of naturally occurring outbreaks, and JMP was used as the primary communication tool for almost five years of interactions with all levels of the U.S. Government. In this presentation, we will demonstrate how those tools developed then could have been used to defer the affects of the Coronavirus COVID-19. The data that will be used for demonstration will be from Emergency Management Systems, Emergency Departments and the Poison Centers of America. Auto-generated transcript... Speaker Transcript Sam Edgemon Hello. This is Sam Edgemon. I worked for the SAS Institute, you know, work for the SAS Institute, because I get to work on so many different projects. And we're going to tell you about one of those projects that we worked on today. Almost on all these projects I work on I work with Tony Cooper, who's on the screen. We've worked together really since since we met at University of Tennessee a few years ago. And the things we learned at the University of Tennessee we've we've applied throughout this project. Now this project was was done for the Department of Homeland Security. The Department of Homeland Security was very concerned about biological terrorism and they came to SAS with the question of how will we detect acts of biological terrorism. Well you know that's that's quite a discussion to have, you know, if you think about the things we might come back with. You know, one of those things was well what do you, what are you most concerned with what does, what do the things look like that you're concerned with? And they they talked about things like anthrax, and ricin and a number of other very dangerous elements that terrorists could use to hurt the American population. Well, we took the question and and their, their immediate concerns and researched as best we could concerning anthrax and ricin, in particular. You know, our research involved, you know, involved going to websites and studying what the CDC said were symptoms of anthrax, and the symptoms of ricin and and how those, those things might present in a patient that walks into the emergency room or or or or takes a ride on an ambulance or calls a poison center or something like that happens. So what we realized in going through this process was was that the symptoms look a lot like influenza if you've been exposed to anthrax. And if you've been exposed to ricin, that looks a lot like any type of gastrointestinal issue that you might might experience. So we concluded and what our response was to Homeland Security was that was that if we can detect an outbreak of influenza or an outbreak of the, let's say the norovirus or some gastrointestinal issue, then we think we can we can detect when when some of these these bad elements have been used out in the public. And so that's the path we took. So we we took data from EMS and and emergency rooms, emergency departments and poison centers and we've actually used Google search engine data as well or social media data as well to detect things that are you know before were thought as undetectable in a sense. But but we developed several, several tools along the way. And you can see from the slide I've got here some of the results of the questions that that we that we put together, you know, these different methods that we've talked about over here. I'll touch on some of those methods in the brief time we've got to talk today, but let's let's dive into it. What I want to do is just show you the types of conversations we had using JMP. We use JMP throughout this project to to communicate our ideas and communicate our concerns, communicate what we were seeing. An example of that communication could start just like this, we, we had taken data from from the EMS system, medical system primarily based in North Carolina. You know, SAS is based in North Carolina, JMP is based in North Carolina in Cary and and some of them, some of the best data medical data in the country is housed in North Carolina. The University of North Carolina's got a lot to do that. In fact, we formed a collaboration between SAS and the University of North Carolina and North Carolina State University to work on this project for Homeland Security that went on for almost five years. But what what I showed them initially was you know what data we could pull out of those databases that might tell us interesting things. So let's just walk, walk through some of those types of situations. One of the things I initially wanted to talk about was, okay let's let's look at cases. you know, can we see information in cases that occur every, every day? So you know this this was one of the first graphs I demonstrated. You know, it's hard to see anything in this and I don't think you really can see anything in this. This is the, you know, how many cases in the state of North Carolina, on any given day average averages, you know, 2,782 cases a day and and, you know, that's a lot of information to sort through. So we can look at diagnosis codes, but some of the guys didn't like the idea that this this not as clear as we want want it to be so so we we had to find ways to get into that data and study and study what what what ways we could surface information. One of those ways we felt like was to identify symptoms, specific symptoms related to something that we're interested in, which goes back to this idea that, okay we've identified what anthrax looks like when someone walks in to the emergency room or takes a ride on an ambulance or what have you. So we have those...if we identify those specific symptoms, then we can we can go and search for that in the data. Now a way that we could do that, we could ask professionals. There was there's rooms full of of medical professionals on this, on this project and and lots of physicians. And kind of an odd thing that I observed very quickly was when you asked a roomful of really, really smart people question like, what what is...what symptoms should I look for when I'm looking for influenza or the norovirus, you get lots and lots of different answers. So I thought, well, I would really like to have a way to to get to this information, mathematically, rather than just use opinion. And what I did was I organized the data that I was working with to consider symptoms on specific days and and the diagnosis. I was going to use those diagnosis diagnosis codes. And what I ended up coming out with, and I set this up where I could run it over and over, was a set of mathematically valid symptoms that we could go into data and look and look for specific things like influenza, like the norovirus or like anthrax or like ricin or like the symptoms of COVID 19. This project surfaced again with with many asks about what we might...how we might go about finding the issues of COVID 19 in this. This is exactly what I started showing again, these types of things. How can we identify the symptoms? Well, this is a way to do that. Now, once we find these symptoms, one of the things that we do is we will write code that might look something similar to this code that will will look into a particular field in one of those databases and look for things that we found in those analyses that we've that we've just demonstrated for you. So here we will look into the chief complaint field in one of those databases to look for specific words that we might be interested in doing. Now that the complete programs would also look for terms that someone said, Well, someone does not have a fever or someone does not have nausea. So we'd have to identify essentially the negatives, as well as the the pure quote unquote symptoms in the words. So once we did that, we could come back to JMP and and think about, well, let's, let's look at, let's look at this information again. We've got we've got this this number of cases up here, but what if we took a look at it where we've identified specific symptoms now and see what that would look like. So what I'm actually looking for is any information regarding gastrointestinal issues. I could have been looking for the flu or anything like that, but this is this is what the data looks like. It's the same data. It's just essentially been sculpted to look like you know something I'm interested in. So in this case, there was an outbreak of the norovirus that we told people about that they didn't know about that, you know, we started talking about this on January 15. And and you know the world didn't know that there was a essentially an outbreak of the norovirus until we started talking about it here. And that was, that was seen as kind of a big deal. You know, we'd taken data, we'd cleaned that data up and left the things that we're really interested in But we kept going. You know that the strength of what we were doing was not simply just counting cases or counting diagnosis codes, we're looking at symptoms that that describe the person's visit to the emergency room or what they called about the poison center for or they or they took a ride on the ambulance for. chief complaint field, symptoms fields, and free text fields. We looked into the into the fields that described the words that an EMS tech might use on the scene. We looked in fields that describe the words that a nurse might use whenever someone first comes into the emergency room, and we looked at the words that a physician may may use. Maybe not what they clicked on the in in the boxes, but the actual words they used. And we we developed a metric around that as well. This metric was, you know, it let us know you know, another month in advance that something was was odd in a particular area in North Carolina on a particular date. So I mentioned this was January 15 and this, this was December 6 and it was in the same area. And what is really registering is is the how much people are talking about a specific thing and if one person is talking about it, it's not weighted very heavily, therefore, it wouldn't be a big deal. If two people are talking about it, if a nurse and an EMS tech are talking about a specific set of symptoms, or mentioning a symptom several times, then, then we're measuring that and we're developing a metric from that information. So if three people, you know, the, the doctor, the nurse and the EMS tech if that's what information we have is, if they're all talking about it, then it's probably a pretty big deal. So that's what's happened here on December 6, a lot of people are talking about symptoms that would describe something like the norovirus. This, this was related to an outbreak that the media started talking about in the middle of February. So, so this is seen as...as us telling the world about something that the media started talking about, you know, in a month later. And specific specifically you know, we were drawn to this Cape Fear region because a lot of the cases were we're in that area of North Carolina around Wilson, Wilson County and that sort of thing. So, so that that was seen as something of interest that we could we could kind of drill in that far in advance of, you know, talk about something going on. Now we carried on with that type of work concerning um, you know, using those tools for bio surveillance. But what what we did later was, you know, after we set up systems that would that would, you know, was essentially running every day, you know every hour, every day, that sort of thing. And then so whenever we would be able to say, well, the system has predicted an outbreak, you know if this was noticed. The information was providing...was was really noise free in a sense. We we look back over time and we was predicting let's say, between 20 and 30 alerts a year, total alerts a year. So there was 20 or 30 situations where we had just given people, the, the, the notice that they might should look into something, you know, look, check something out. There might be you know a situation occurring. But in one of these instances, the fellow that we worked with so much at Homeland Security came to us and said, okay, we believe your alert, so tell us something more about it. Tell us what what it's made up of. That's that's that's how he put the question. So, so what we we did was was develop a model, just right in front of him. And the reason we were able to do that (and here's, here's the results of that model), the reason we were able to do that was by now, we realized the value of of keeping data concerning symptoms relative to time and place and and all the different all the different pieces of data we could keep in relation to that, like age, like ethnicity. So when we were asked, What's it made up of, then then we could... Let's put this right in the middle of the screen, close some of the other information around us here so you can just focus on that. So when we're asked, okay, what's this outbreak made up of, you know, we, we built a model in front of them (Tony actually did that) and that that seemed to have quite an impact when he did this, to say, Okay, you're right. Now we've told you today there there's there's an alert. And you should pay attention to influenza cases in this particular area because it appears to be abnormal. But we could also tell them now that, okay these cases are primarily made up of young people, people under the age of 16. The symptoms, they're talking about when they go into emergency room or get on an ambulance is fever, coughing, respiratory issues. There's pain. and there's gastrointestinal issues. The, the key piece of information we feel like is is the the interactions between age groups and the symptoms themselves. While this one may, you know, it may not be seen as important is because it's down the list, we think it is, and even these on down here. We talked about young people and dyspnea, and young people and gastro issues, and then older people. So there was, you know, starting to see older people come into the data here as well. So we could talk about younger people, older people and and people in their 20s, 30s, 40s and 50s are not showing up in this outbreak at this time. So there's a couple of things here. When we could give people you know intel on the day of of an alert happening and we could give them a symptom set to look for. You know when COVID 19 was was well into our country, you know you you still seem to turn on the news everyday and hear of a different symptom. This is how we can deal with those types of things. You know, we can understand you know, what what symptoms are surfacing such that people may may actually have, you know, have information to recognize when a problem is actually going to occur and exist. So, so this is some of the things that you know we're talking about here, you'll think about how we can apply it now. Using the the systems of alerting that I showed you earlier that, you know, I generally refer to as the TAP method as just using text analytics and proportional charting. Well, you know, that's we're probably beyond that now, it's it's on us. So we didn't have the tool in place to to go looking then. But these types of tools may still help us to be able to say, you know, this is these are the symptoms we're looking for. These are the these are the age groups were interested in learning about as well. So, so let's let's keep walking through some ways that we could use what we learned back on that project to to help the situation with COVID 19. One of the things that we did of course we've we've talked about building this this the symptoms database. The symptoms database is giving us information on a daily basis about symptoms that arise. And and you know who's, who's sick and where they're sick at. So here's an extract from that database that we talked about, where it it has information on a date, it has information about gender, ethnicity, in regions of North Carolina. We could you take this down to towns and and the zip codes or whatever was useful. This I mentioned TAP in that text analytics information, well now we've got TAP information on symptoms. You know, so if people are talking about this, say for example, nausea, then we we know how many people are talking about nausea on a day, and eventually in a place. And so this is just an extract of symptoms from from this this database. So, so let's take a look at how we could use this this. Let's say you wanted to come to me, an ER doctor, or some someone investigating COVID 19 might come to me and say, well, where are people getting sick at. You know, that's where are people getting sick now, or where might an outbreak be occurring in a particular area. Well, this is the type of thing we might do to demonstrate that. I use Principal Components Analysis a lot. In this case because we've got this data set up, I can use this tool to identify the stuff I'm interested in analyzing. In this case it's the regions, they asked, you know, the question was where, where and what. Okay what what are you interested in knowing about? So I hear people talk about respiratory issues concerning COVID and I hear people talking about having a fever and and these are kind of elevated symptoms. These are issues that people are talking about even more than they're writing things down. That's the idea of TAP is, is we're getting into those texts fields and understanding understanding interesting things. So once we we we run this analyses, JMP creates this wonderful graph for us. It's great for communicating what's going on. And what's going on in this case is that Charlotte, North Carolina, is really maybe inundated with with with physicians and nurses and maybe EMS techs talking about their patients having a fever and respiratory issues. If you want to get as far as you can away from that, you might spend time in Greensboro or Asheville, and if you're in Raleigh Durham, you might be aware of what's on the way. So that this is this is a way that we can use this type of information for for essentially intelligence, you know, intelligence into what what might be happening next in specific areas. We could also talk about severity in the same, in the same instance. We could talk about severity of cases and measure where they are the same way. So you know the the keys here is is getting the symptoms database organized and utilized. We've we use JMP to communicate these ideas. A graph like this may may have been shown to Homeland Security and we talked about it for two hours easily just with, not just questions about even validity, you know, is where the data come from and so forth. We could talk about that and and we could also talk about okay, this, this is the information that that you need to know, you know. This is information that will help you understand where people are getting sick at, such that warnings can be given and essentially life...lives saved. So, so that's that in a sense is the system that we've we put together. The underlying key is, is the data. Again, the data we've used is EMS, ED, poison center data. I don't have an example of the poison center data here, but I've got a long talk about how we how we use poison center data to surface foodborne illness, just in similar ways than what we've shown here. And then the ability to, to, to be fairly dynamic with developing our story in front of people and talking to them in, you know, selling belief in what we do. JMP helps us do that; SAS code helps us do that. That's a good combination tools and that's all I have for this this particular topic. I appreciate your attention and hope you find it useful, and hope we can help you with this type of stuff. Thank you.
Labels
(5)
Labels:
Labels:
Basic Data Analysis and Modeling
Data Blending and Cleanup
Mass Customization
Quality and Process Engineering
Reliability Analysis
0 attendees
0
0
0 attendees
0
0
Understanding Variable Clustering is Crucial When Analyzing Stored Manufacturing Data (2020-US-30MP-576)
Monday, October 12, 2020
Tony Cooper, Principal Analytical Consultant, SAS Sam Edgemon, Principal Analytical Consultant, SAS Institute In process product development, Design of Experiments (DOE), helps to answer questions like: Which X’s cause Y’s to change, in which direction, and by how much per unit change? How do we achieve a desired effect? And, which X’s might allow loser control, or require tighter control, or a different optimal setting? Information in historical data can help partially answer these questions and help run the right experiment. One of the issues with such non-DoE data is the challenge of multicollinearity. Detecting multicollinearity and understanding it allows the analyst to react appropriately. Variance Inflation Factors from the analysis of a model and Variable Clustering without respect to a model are two common approaches. The information from each is similar but not identical. Simple plots can add to the understanding but may not reveal enough information. Examples will be used to discuss the issues. Auto-generated transcript... Speaker Transcript Tony Hi, my name is Tony Cooper and I'll be presenting some work that Sam Edgemon and I did and I'll be representing both of us. Sam and I are research statisticians in support of manufacturing and engineering for much of our careers. And today's topic is Understanding Variable Clustering is Crucial When Analyzing Stored Manufacturing Data. I'll be using a single data set to present from. The focus of this presentation will be reading output. And I won't really have time to fill out all the dialog boxes to generate the output. But I've saved all the scripts in the data table, which of course will avail be available in the JMP community. The data...once the script is run, you can always use that red arrow to do the redo option and relaunch analysis and you'll see the dialog box that would have been used to present that piece of output. I'll be using this single data set that is on manufacturing data. And let's have a quick look at how this data looks And Sorry for having a Quick. Look how this dataset looks. First of all, I have a timestamp. So this is a continuous process and some interval of time I come back and I get a bunch of information. On some Y variables, some major output, some KPIs that, you know, the customer really cares about, these have to be in spec and so forth in order for us to ship it effectively. And then...so these are... would absolutely definitively be outputs. Right. line speed, how at the set point for the vibration, And and a bunch of other things. So you can see all the way across, I've got 60 odd variables that we're measuring at that moment in time. Some of them are sensors and some of them are a set points. And in fact, let's look and see that some of them are indeed set points like the manufacturing heat set point. Some of them are commands which means like, here's what the PLC told the process to do. Some of them are measures which is the actual value that it's at right now. Some of them are ambient conditions, maybe I think that's an external temperature. Some of them are in a row material quality characteristics, and there's certainly some vision system output. So there's a bunch of things in the inputs right now. And you can imagine, for instance, if I'm, if I've got the command and the measure the, the, what I told the the the zone want to be at and what it actually measure, that we hope that are correlated. And we need to investigate some of that in order to think about what's going on. And that's multicolinnearity, understanding the interrelationships amongst the inputs or separately, the understanding the multicollinearity and the outputs. But by and large, not doing Y cause X effect right now. You're not doing a supervised technique. This is an unsupervised technique, but I'm just trying to understand what's going on in the data in that fashion. And all my scripts are over here. So yeah, here's the abstract we talked about; here's a review of the data. And as we just said it may explain why there is so much, multicollinearity, because I do have a set points and an actuals in there. So, but we'll, we'll learn about those things. What we're going to do first is we're going to fit Y equals a function of X and we're going to set it and I'm only going to look at these two variables right now. The zone one command, so what I told zone one to to to be, if you will. And then I'm I've got a little therma couple that are, a little therma couple in the zone one area and it's measuring. And you can see that clearly as zone one temperature gets higher, this Y4 gets lower, this response variable gets lower. And that's true also for the measurement and you can see that in the in the in the estimates both negative and, by and large, and by the way, these are fairly predictive variables in the sense that just this one variable is explaining 50% of what's going on in in the Y4. Well, let's let's do a multivariate. So you imagine I'm now gonna fit model and have moved both of those variables into into into my model and I'm still analyzing Y4. My response is still Y4. Oh, look at this. Now it's suggesting that as Y, as the command goes up. Yeah, that that does that does the opposite of what I expect. This is still negative in the right direction, but look at look at some of these numbers. These aren't even in the same the same ballpark as what I had a moment ago, which was .04 and .07 negative. Now I have positive .4 and negative .87. I'm not sure I can trust these this model from an engineering standpoint, and I really wonder how it's characterizing and helping me understand this process. And there's a clue. And this may not be on by default, but you can right click on this parameter estimates table and you can ask for the VIF column. That stands for variation inflation factor. And that is an estimate of the...of the the instability of the model due to this multicollinearity. So you can...so we need to get little intuition on what that...how to think about that variable. But just to be a little clearer what's going on, I'm going to plot...here's temperature zone one command and here's measure, and as you would expect, as you tell...you tell the zone to increase in temperature. Yes, the zone does increase in temperature and by and larges, it's, it's going to maybe even the right values. I've got this color coded by Y4 so it is suggesting at lower temperatures, I get the high values of Y4 and that the... sorry...yeah, at low valleys of temperature, I got high values of Y4 and just the way I saw on the individual plots. But you can see maybe the problem gets some intuition as to why the multivariate model didn't work. You're trying to fit a three dom... a surface. over this knife edge and it can obviously...there's some instability and if...you can easily imagine it's it's not well pinneed on the side so I can rock back and forward here and that's what you're getting. It is in terms of how that is that the in in terms of that the variation inflation factor. The OLS software...the OLS ordinary least squares analysis typical regression analysis just can't can't handle it in the, in some sense, maybe we could also talk about the X prime X matrix being, you know, being almost singlular in places. So we've got some heurustic of why it's happening. Let's go back and think about more About About The about about the values and We know that You know we are we now know that variation inflation factor actually does think about more than just pairwise. But if I want to source...so intuition helps me think about the relationship between VIF and pairwise comparison. Like if I have two variables that are 60% correlated then it's you know if it was all it was all pairwise then the VIF would be about 2.5. And if they were 90% correlated two variables, then I would get a VIF of 10. And the literature says That when you have VIFs of 10 and more you have enough instability which way you should worry about your model. So, you know, because in some sense, you've got 90% correlation between things in your data. Now whether 10 is a good cut off really depends,I guess, on your application. If if you're making real design decisions based on it, I think a cut off of 10 is way too high. And maybe even near two is better. But if I was thinking more. Well, let me think about what factors to put in a run in an experiment like I want to learn how to run the right experiment. And I knew I've got to pick factors to go in it. And I know we too often go after the usual suspects. And is there a way I can push myself to think of some better factors. Well, that then maybe a 10 might be useful to help you narrow down. So it really depends on where you want to go in terms of doing thinking about thinking about what what the purpose is. So more on this idea of purpose. You know, there's two main purposes, if you will, to me of modeling. One is what will happen, you know, that's prediction. And but and that's different, sometimes from why will it happen and that's more like explanation. As we just saw with a very simple command and measure on zone one, you can, you cannot do very good explanation. I would not trust that model to think about why something's happening when I have when I when the, when the responses when the, when the estimate seem to go in the wrong direction like that. So it's very so I wouldn't use it for explanation. I'm sort of suggesting that I wouldn't even use it for prediction. Because if I can't understand the model, it's doing, it's not intuitively doing what I expect that seems to make extrapolation dangerous, of course, by definition, you know, I think prediction to a future event is extrapolation in time. Right, so I would always think about this. And, you know, we're talking about ordinary least squares so far. All my modeling techniques I see, like decision trees, petition analysis, are all in some way affected by this by this issue, in different unexpected ways. So it seems a good idea to take care of it if you can. And it isn't unique to manufacturing data. But it's probably worse exaggerated in manufacturing data because often the variables can are not controlled. If we if we have, you know, zone one temperature and we've learned that it needs to be this value to make good product, then we will control it as tightly as possible to that desired value. And so it's it's so the the extrapolation really come gets harder and harder. So this is exaggerated manufacturing because we do can control them. And there's some other things about manufacturing data you can read here that make it maybe make it better, because this is opportunities and challenges. Better is you can understand manufacturing processes. There's a lot of science and engineering around how those things run. And you can understand that stoichiometry for instance, requires that the amount of A you add has, chemical A has to be in relationship to chemical B. Or you know you don't want to go from zone one temperature here to something vastly different because you need to ramp it up slowly, otherwise you'll just create stress. So you can and should understand your process and that may be one way even without any of this empirical evidence on on multicollinearity, you can get rid of some of it. There's also an advantage to manufacturing data is in that it's almost always time based, so do plot the variables over time. And it's always interesting or not always, but often interesting, the manufacturing data does seem to move in blocks of times like here...we think it should be 250 and we run it for months, maybe even years, and then suddenly, someone says, you know, we've done something different or we've got a new idea. Let's move the temperature. And so very different. And of course, if you're thinking about why is there multicollinearity, we've talked about it could be due to physics or chemistry, but it could be due to some latent variable, of course, especially when there's a concern with variable shifting in time, like we just saw. Anything that also changes at that rate could be the actual thing that's affecting the, the, the values, the, the Y4. Could be due to a control plan. It could be correlated by design. And each of these things, you know, each of these reasons for multicollinearity and the question I always ask is, you know, is this is it physically possible to try other combinations or not just physically possible, does it make sense to try other combinations? In which case you're leaning towards doing experimentation and this is very helpful. This modeling and looking retrospectively at your data to is very helpful at helping you design better experiments. Sometimes though the expectations are a little too high, in my mind, we seem to expect a definitive answer from this retrospective data. So we've talked about two methods already two and a half a few to to address multi variable clustering and understand it. One is the is VIF. And here's the VIF on a bigger model with all the variables in. How would I think about which are correlated with which? This is tells me I have a lot of problems. But it does not tell me how to deal with these problems. So this is good at helping you say, oh, this is not going to be a good model. But it's not maybe helpful getting you out of it. And if I was to put interactions in here, it'd be even worse because they are almost always more correlated. So we need another technique. And that is variable clustering. And this, this is available in JMP and there's two ways to get to it. You can go through the Analyze multivariate principal components. Or you can go straight to clustering cluster variables. If you go through the principal components, you got to use the red option...red triangle option to get the cluster variables. I still prefer this method, the PCA because I like the extra output. But it is based on PCA. So what we're going to do is we're going to talk about PCA first and then we will talk about the output from variables clustering. And there's and there's the JMP page. In order to talk about principal components, I'm actually going to work with standardized versions of the variables first. And let's think remind ourselves what is standardized. the mean is now zero, standard deviation is now 1. So I've standardized what that means that they're all on the same scale now and implicitly when you do when you do principal components on correlations in JMP, implicitly you are doing on standardized variables. JMP is, of course, more than capable, a more than smart enough for you to put in the original values and for it to work in the correct way and then when it outputs, it, it will outputs, you know, formula, it will figure out what the right formula should have been, given, you know, you've gone back unstandardized. But just as a first look, just so we can see where the formula are and some things, I'm going to work with standardized variables for a while, but we will quickly go back, but I just want to see one formula. And that's this formula right here. And what I'm going to do is think about the output. So what is the purpose of of of PCA? Well it's it's called a variation reduction technique, but what it does, it looks for linear combinations of your variables. And if it finds a linear combination that it likes, it...that's called a principal component. And it uses Eigen analysis to do this. So another way to think about it is, you know I want it, I put in 60 odd variables into the...into the inputs. There are not 60 independent dimensions that I can then manipulate in this process, they're correlated. What I do to the command for time, for temperatures on one dictates almost hugely what happens with the actual in temperature one. So those aren't two independent variables. Those don't keep you say you don't you don't have two dimensions there. So how many dimensions do we have? That's what this thing, the eigenvalues, tell you. These are like variance. These are estimates of variance and the cutoff is one. And if I had the whole table here, there'll be 60 eigenvalues. Go all the way down, you know, jumpstart reporting at .97 but the next one down is, you know, probably .965 and it will keep on going down. The cutoff says that...or some guideline is that if it's greater than one, then this linear combination is explaining a signal. And if it's less than one, then it's just explaining noise. And so what what JMP does is it...when I go to the variable clustering, it says, you know what you have a second dimension here. That means a second group of variables, right, that's explaining something a little bit differently. Now I'm going to separate them into two groups. And then I'm going to do PCA on both, and if and the eigenvalues for both...the first one will be big, but what's the second one look like after I split in two groups? Are they now less than one? Well, if they're less than one, you're done. But if they're greater than one, it's like, oh, that group can be split even further. And it was split that even further interactively and directly and divisively until there's no second components greater than one anymore. So it's creating these linear combinations and you can see the...you know when I save principal component one, it's exactly these formula. Oops. It's exactly this formula, .0612 and that's the formula for Prin1 and this will be exactly right as long as you standardize. If you don't standardize then JMP is just going to figure out how to do it on the unstandardized which is to say it's more than capable of doing. So let's start working with the, the initial data. And we'll do our example. And you'll see this is very similar output. It's in fact it's the same output except one thing I've turned on. is this option right here, cluster variables. And what I get down here is that cluster variables output. And you can see these numbers are the same. And you can already start to see some of the thinking it's going to have to do, like, let's look at these two right here. It's the amount of A I put in seems to be highly related to the amount of B I put in, so...and that would make sense for most chemical processes, if that's part of the chemical reaction that you're trying to accomplish. You know, if you want to put 10% more in, probably going to put 10% more B in. So even in this ray plot you start to see some things that the suggest the multicollinearity. And so to get somewhere. But I want to put them in distinct groups and this is a little hard because watch this guy right here, temperature zone 4. He's actually the opposite. They're sort of the same angle, but in the opposite direction right, so he's 100 or 180 degrees almost from A and B. So maybe ...negatively correlated to zone 4 temperature and A B and not...but I also want to put them in exclusive groups and that's what that's what we get when we when we asked for the variable clustering. So it took those very same things and put them in distinct groups and it put them in eight distinct groups. And here are the standardized coefficients. So these are the formula that the for the, you know, for the individual clusters. And so when I save the cluster components I get a very similar to what we did with Prin1 except this just for cluster 1, because you notice that in this row that has zone one command with a ... with a .44, everywhere else is zero. Every variable is exclusively in one cluster or another. So let me...let's talk about some of the output. And so we're doing variable clustering and Oops. Sorry. Tony And then we got some tables in our output. So I'm going to close...minimize this window and we're talking about what's in here in terms of output. And the first thing you guys that you know I want to point us to is the standardized estimates and we were doing that. And if you want to do it, quote, you know, by hand, if you will, repeat and how do I get a .35238 here, I could run PCA on just cluster one variables. These are the variables staying in there. And then you could look at the eigenvalues, eigenvectors, and these are the exact exact numbers. So, So the .352 is so it's just what I said. It keeps devisively doing multiple PCAs and you can see that the second principle component here is, oh sorry, is less than one. So that's why it stopped at that component Who's in there cluster one, well there's temperature is ...the two zone one measures, a zone three A measure, the amount of water seems to matter compared to that. With some of the other temperature over here and in what...in cluster six. This is a very helpful color coded table. This is organized by cluster. So this is cluster one; I can hover over it and I can read all of the things added water (while I said I should, yeah, added water. Sorry, let me get one that's my temperature three.) And what that's a positive correlation, you know, the is interesting zone 4 has a negative correlation there. So you will definitely see blocks of of color to...because these are...so this is cluster one obviously, this is maybe cluster three. This, I know it's cluster six. Butlook over here... that this...as you can imagine, cluster one and three are somewhat correlated. We start to see some ideas about what might what we might do. So we're starting to get some intuition as to what's going on in the data. Let's explore this table right here. This is the with own cluster, with next cluster and one minus r squared ratio. And what...I'm going to save that to my own data set. I'm going to run do a multivariate on it. So I've got cluster one through eight components and all the factors. And what I'm really interested in is this part of a table, like for all of the original variables and how they cluster with the component. So let me save that table and then what I'm gonna do is, I'm gonna delete, delete some extra rows out of this table, but it's the same table. Let's just delete some rows. And I'll turn and...so we can focus on certain ones. So what I've got left is the columns that are the cluster components and the rows... row column...the role of the different...and is 27 now variables that we were thinking about, not 60 (sorry) and it's put them in 8 different groups and I've added that number. They can come automatically. I'm going to start and in order to continue to work, what I'm gonna do is, I don't want these as correlations. I want these as R squares. So I'm going to square all those numbers. So I just squared them and and here we go. Now we can look at, now we can start thinking about it. And I've sort...so let's look at row one. Sorry, this one that the temperature one measure we've talked about, is 96% correlated with its...with cluster one. It's 1% correlated with cluster two so it looks to me like it really, really wants to be in cluster one and it doesn't really have much to say I want to go into cluster two. What I want to do is let's find all of those... lets color code some things here so we can find them faster. So we're talking zone one meas and the one that would like to be in, if anything, is cluster five. You know it's 96% correlated, but you know, it's, it wouldn't be, if it had to leave this cluster, where would it go next? It would think about cluster five. And what's in cluster five? Well, you could certainly start looking at that. So there's more temperatures there. The moisture heat set point and so forth. So if it hadn't...so this number, The with the cl...with its own cluster talks about how much it likes being in its cluster and the what...and the R squared to the next cluster talks to you about how much it would like to be in a different one. And those are the two things reported in the table. You know, JMP doesn't show all the correlation, but it may be worth plotting and as we just demonstrated, it's not hard to do. So this tells you...I really like being in my own cluster. This says if I had to leave, then I don't mind going over here. Let's compare those two. And let's take one minus this number, divided by one minus this number. That's the one minus r squared ratio. So it's a ratio of how much you like your own cluster divided by how much you attempted to go to another cluster. And let's plot some of those. And Let me look for the local data filter on there. The cluster. And and here's the thing. So Values, in some sense, lower values of this ratio are better. These are the ones over here (let's highlight him)... Well, let's highlight the very...this one of the top here. I like the one down here. Sorry. This one, vacuum set point, you can see really really liked its own cluster over the next nearest, .97 vs .0005, so you know that the ratio is near zero. So those wouldn't... with it...you wouldn't want to move that over. And you could start to do things like let's just think about the cluster one variables. If anyone wanted to leave it, maybe it's the Pct_reclaim. Maybe he got in there, like, you know by, you know, by fortuitous chance and I've got, you know, and if I was in cluster two, I can start thinking about the line speed. The last table I'm going to talk about is the cluster summary table. That's this table here. And it's saying...it's gone through this list and said that if I look down r squared own cluster, the highest number is .967 for cluster one. So maybe that's the most representative. To me, it may not be wise to let software decide I'm going to keep this late...this variable and only this variable, although certainly every software has a button that says launch it with just the main ones that you think, and that's that may give you a stable model, in some sense, but I think it's short changing the kinds of things you can do. And you and, hopefully with the techniques provided, we can...you have now the ability to explore those other things. This is...you can calculate this number by again doing the just the PCA on its own cluster and it and it it's suggesting it cluster...eigenvalue one one complaint with explained .69% so that's another...just thinking about it, it's own cluster. Close these and let's summarize. So we've given you several techniques. Failing to understand what's clear, you can make your data hard to interpret, even misleading the models. Understanding the data as it relates to the process can produce multicollinearity. So just an understanding from an SME standpoint, subject matter expertise standpoint. Pairwise and scatter plot are easier to interpret but miss multicollinearity and so you need more. VIF is great at telling you you've got a problem. It's only really available for ordinary least squares. There's no, there's no comparative thing for prediction. And there's some guidelines and variable clustering is based on principal components and shows better membership and strengthens relationship in each group. And I hope that that review or the introduction would encourage you, maybe, to grab, as I say, this data set from JMP Discovery and you can run the script yourself and you can go obviously more slowly and make sure that you feel good and practice on maybe your own data or something. One last plug is, you know, there was there's other material and other times that Sam and I have spoken at JMP Discovery conferences and or at ...or written white papers for JMP and you know maybe you might want to think about all we thought about with manufacturing it because we have found that modeling manufacturing data takes...is a little unique compared to understanding like customer data in telecom where a lot of us learn when we went through school. And I thank you for your attention and good luck with further analysis.
Labels
(9)
Labels:
Labels:
Advanced Statistical Modeling
Basic Data Analysis and Modeling
Consumer and Market Research
Data Exploration and Visualization
Design of Experiments
Mass Customization
Predictive Modeling and Machine Learning
Quality and Process Engineering
Reliability Analysis
0 attendees
0
0
0 attendees
0
0
Step Stress Modeling in JMP using R (2020-US-30MP-574)
Monday, October 12, 2020
Charles Whitman, Reliability Engineer, Qorvo Simulated step stress data where both temperature and power are varied are analyzed in JMP and R. The simulation mimics actual life test methods used in stressing SAW and BAW filters. In an actual life test, the power delivered to a filter is stepped up over time until failure (or censoring) occurs at a fixed ambient temperature. The failure times are fitted to a combined Arrhenius/power law model similar to Black’s equation. Although stepping power simultaneously increases the device temperature, the algorithm in R is able to separate these two effects. JMP is used to generate random lognormal failure times for different step stress patterns. R is called from within JMP to perform maximum likelihood estimation and find bootstrap confidence intervals on the model estimates. JMP is used live to plot the step patterns and demonstrate good agreement between the estimates and confidence bounds to the known true values. A safe-operating-area (SOA) is generated from the parameter estimates. The presentation will be given using a JMP journal. The following are excerpts from the presentation. Auto-generated transcript... Speaker Transcript CWhitman All right. Well, thank you very much for attending my talk. My name is Charlie Whitman. I'm at Corvo and today I'm going to talk about steps stress modeling in JMP using R. So first, let me start off with an introduction. I'm going to talk a little bit about stress testing and what it is and why we do it. There are two basic kinds. There's constant stress and step stress; talk a little bit about each. Then when we get out of the results from the step stress or constant stress test are estimates of the model parameters. That's what we need to make predictions. So in the stress testing, we're stressing parts of very high stress and then going to take that data and extrapolate to use conditions, and we need model parameters to do that. But model parameters are only half the story. We also have to acknowledge that there's some uncertainty in those estimates and we're going to do that with confidence bounds and I'm gonna talk about a bootstrapping method I used to do that. And at the end of the day, armed with our maximum likelihood estimates and our bootstrap confidence bounds, we can create something called the safe operating area, SOA, which is something of a reliability map. You can also think of it as a response surface. So we're going to get is...find regions where it's safe to operate your part and regions where it's not safe. And then I'll reach some conclusions. So what is a stress test? In a stress test you stress parts until failure. Now sometimes you don't get failure; sometimes parts, you have to start, stop the test and do something else. And that case, you have a sensor data point, but the method of maximum likelihood, which are used in the simulations takes sensoring into account so you don't have to have 100% failure. We can afford to have some parts not fail. So what you, what you do is you stress these parts under various conditions, according to some designed experiment or some matrix or something like that. So you might run your stress might be temperature or power or voltage or something like that and you'll run your parts under various conditions, various stresses and then take those that data fitted to your model and then extrapolate to use conditions. mu = InA + ea/kT. Mu is the log mean of your distribution; we commonly use the lognormal distribution. That's going to be a constant term plus the temperature term. You can see that mu is inversely related to temperature. So as temperature goes up, mu goes down, and that's temperature goes down, mu goes up. If we can use the lognormal, you will also have an additional parameter that the shape factor sigma. So after we run our test, we will run several parts under very stressed conditions and we fit them to our model. It's then when you combine those two that you can predict behavior at use conditions, which is really the name of the game. The most common method is a is a constant stress test, and what basically, the stress is fixed for the duration of the test. So this is just showing an example of that. We have a plot here of temperature versus time. If we have a very low temperature, say you could get failures that would last time...that sometimes be very long. The failure times can be random, again according to, say, some distribution like the lognormal. If we increase the temperature to some higher level, we would get end of the distribution of failure times, but on the average the failure times would be shorter. And if we increase the temperature even more, same kind of thing, but failure times are even shorter than that. So what I can do is if I ran, say, a bunch of parts under these different temperatures, I could fit the results to a probability plot that looks like this. I have probability versus time to failure at the highest temperature here. This example is 330 degrees C, I have my set of failure times which I set to lognormal. And then as I decrease the temperature lower and lower the failure times get longer and longer. Then I take all this data over temperature I fit it to the Arrhenius model, I extrapolate. And then I see I can get my predictions at use conditions. This is what we are after. I want to point out that when we're doing these accelerated testing, this test, we have to run at very high stress because, for example, even though this is, say, lasting 1000 hours or so, our predictions are that the part under use conditions would be a billion hours and there's no way that we could run test for a billion hours. So we have to get tests done in a reasonable amount of time and that's why we're doing accelerated testing. So then, what is a step stress? Well, as you might imagine, a step stresses where you increase the stress in steps or some sort of a ramp. The advantage is that it's a real time saver. As I showed in the previous plot, those tests could last a very long time that could be 1000 hours. So that's it could be weeks or months before the test is over. A step stress test could be much shorter or you might be able to get done in hours or days. So it's a real time saver. But the analysis is more difficult and I'll show that in a minute. So, in the work we've done at Corvo, we're doing reliability of acoustic filters and those are those are RF devices. And so the stress in RF is RF power. And so we step up power until failure. So if we're going to step up power, we can do is we can model this with this expression here. Basically, we had the same thing as the Arrhenius equation, but we're adding another term, n log p. N is our power acceleration parameter; p is our power. So for the lognormal distribution, there would be a fourth parameter, sigma, which is the shape factors. So you have 1, 2, 3, 4 parameters. Let me just give you a quick example of what this would look like. You start, this is power versus time. Power is in dBm. You're starting off at some power like 33.5 dBm, you step and step and step and step until hopefully you get failure. And I want to point out that your varying power, and as you increase the power to the part, that's going to be changing the temperature. So as power is ramped, so it is temperature. So power and temperature are then confounded. So you're gonna have to do your experiment in such a way that you can separate the effects of temperature and power. So I want to point out that you have these two terms (temperature and power), so it's not just that I increase the power to the part and it gets hotter and it's the temperature that's driving it. It's power in and of itself also increases the failure rate. Right. So now if I show a little bit more detail about that step stress plot. So here again a power versus time. I'm running a part for, say, five hours at some power, then I increase the stress, and run another five hours, and increase the stress on up until like a failure. So, and as I mentioned as the power is increasing, so is the temperature. So I have to take that into account somehow. I have to know what the t = T ambient + R th times p T ambient is our ambient temperature; P is the power; and R th is called the thermal impedance which is a constant. So, that means, as I set the power, so I know what the power is and then I can also estimate what the temperature is for each step. So what we'd like to do is then take somehow these failure times that get from our step stress pattern and extrapolate that to use conditions. If I was only running, like, for time delta t here only and I wanted to extrapolate that to use conditions, what I would do is I would multiply...get the equivalent amount of time delta t times the acceleration factor. And here's the acceleration factor. I have an activation energy term, temperature term, and a power term. And so what I would do is I would multiply by AF. And since I'm going from high stress down to low stress, AF is larger than one and this is just for purposes of illustration, it's not that much bigger than one, but you get the idea. And as I increase the power, temperature and power are changing so the AF changes with each step. So if I want to then get the equivalent time at use conditions, I'd have to do a sum. So I have each segment. It has its own acceleration factor and maybe its own delta t. And then I do a sum and that gives me the equivalent time. So this, this expression that I would use them to predict equipment time if I knew exactly what Ea was and exactly what n was, I could predict what the equivalent time was. So that's the idea. So it turns out that....so as I said, temperature and power are confounded. So in order to estimate, what we do is we have to run to two different ambient temperatures If you have the ambient temperatures separated enough, then you can actually separate the effects of power and temperature. You also need at least two ramp rates. So at a minimum, you would need a two by two matrix of ramp rate and ambiant temperature. In the simulations I did, I chose three different rates as shown here. I have power in dBm versus stress time And I have three different ramps, but with different rates. I'll have a fast, a medium, and a slow ramp rate. In practice, you would let this go on and on and on until failure, but I've only just arbitrarily cut it off after a few hours. You see here also I have a ceiling. The ceiling is four; it's because we have found that if we increase the stress or power arbitrarily, we can change the failure mechanism. And what you want to do is make sure that failure mechanism, when you're under accelerate conditions is the same as it is under use conditions. And if I change the failure mechanism that I can't do an extrapolation. The extrapolation wouldn't be valid. So we had the ceiling here of drawn to 34.4 dBm, and we even given ourselves a little buffer to make sure we don't get close to that. So our ambient temperature is 45 degrees C, we're starting it a power 33.5 dBm so we would also have another set of conditions at 135 degrees. See, you can see the patterns here are the same. And we have a ceiling and they have a buffer region, everything, except we are starting at a lower power. So here we're below 32 dBm, whereas before we were over 33. And the reason we do that is because if we don't lower the power at this higher temperature, what will happen is you'll get failures almost immediately if you're not careful, and then you can't use the data to do your extrapolation. Alright, so what we need, again, is an expression for our quivalent time, as I showed that before. Here's that expression. This is kind of nasty and I would not know how to derive from first principles of what the expression is for the distribution of the equivalent time of use conditions. So, when faced with something which is kind of difficult like that, what I choose to do was use the bootstrap. So what is bootstrapping? So with bootstrapping, what we're doing is we are resampling the data set many times with replacement. That means from the original data set of observations, you can have replicates of from the original data set or maybe an observation won't appear all. And the approach I use is called non parametric, because we're not assuming the distribution. We don't have to know the underlying distribution of the data. So when you generate these many bootstrap samples, which you can get as an approximate distribution of the parameter, and that allows you to do statistical inference. In particular, we're interested in putting in confidence bounds on things. So that's what we need to do. Simple example of bootstrapping is called percentile bootstrap. So, for example, suppose I wanted 90% confidence bounds on some estimate. And I would do is I would form, many, many bootstrap replicates and I would extract the parameter from each bootstrap sample. And then I would sort that and I would figure out which is the shift and 95th percentile from that vector and those would form my 90% confidence bounds. What I did actually in my work was I used an improvement over to percentile, a technique. It's called the BCa for bias corrected and accelerated. Bias because sometimes our estimates are biased and this method would take that into account. Accelerated, unfortunately the term accelerated is confusing here. It has nothing to do with accelerated testing, it has to do with the method, the method has to do for with adjusting for the skewness of the distribution. But ultimately you're...what you're going to get is it's going to pick for you different percentile values. So, again, for the percentile technique we had fifth and 95th. The bootstrap or the BCa bootstrap might give you something different, might say the third percentile and 96% or whatever. And those are the ones who would need to choose for your 90% confidence bounds. So I just want to run through a very quick example just to make this clear. Suppose I have 10 observations and I want to do for bootstrap samples from this, looking something like this. So, for example, the first observation here 24.93 occurs twice in the first sample, once in the second sample, etc. 25.06 occurs twice. 25.89 does not occur at all and I can do this, in this case, 100 times And for each bootstrap sample then, I'm going to find, in this case I'm gonna take the average, say, I'm interested in the distribution of the average. Well, here I have my distribution of averages. And I can look to see what that looks like. Here we are. It looks pretty bell shaped and I have a couple points here, highlighted and these would be my 90% confidence bounds if I was using the percentile technique. So here's this is the sorted vector and the fifth percentile is at 25.84 and the 95th percentile is 27.68. If I wanted to do the BCa method, I would might just get some sort of different percentile. So this case, 25.78 and 27.73. So that's very quickly, what the BCa method is. So in our case, we'd have samples of... we would do bootstrap on the stress patterns. You would have multiple samples which would have been run, simulated under those different stress patterns and then bootstrap off those. And so we're going to get a distribution of our previous estimates or previous parameters, logA, EA, and sigma Right. CWhitman So again, here's our equation. So again, JMP The version of JMP that I have does not do bootstrapping. JMP Pro does, but the version I have does not, but fortunately R does do bootstrapping. And I can call R from within JMP. That's why I chose to do it this way. So I have I can but R do all the hard work. So I want to show an example, what I did was I chose some known true values for logA, EA and sigma. I chose them over some range randomly. And I would then choose that choose the same values for these parameters of a few times and generate samples each time I did that. So for example, I chose minus 28.7 three times for logA true and we get the data from this. There were a total of five parts per test level or six test levels, if you remember, three ramps, two different temperatures, six levels, six times five is 30. So there were 30 parts total run for this test and looking at the logA hat, the maximum likelihood estimates are around 28 or so. So that actually worked pretty well. I can look at...now for my next sample, I did three replicates here, for example, minus 5.7 and how did it look when I ran my method of the maximum that are around that minus 5.7 or so. So the method appears to be working pretty well. But let's do this a little bit more detail. Here I ran the simulation a total of 250 times with five times for each group. LogA true, EA true are repeated five times and I'm getting different estimates for logA hat, EA, etc. I'm also putting...using BCa method to form confidence bounds on each of these parameters, along with the median time to failure. So let's look and just plot this data to see how well it did. You have logA hat versus logA true here and we see that the slope is about right around 1 and the intercept is not significantly different than 0, So this is actually doing a pretty good job. If my logA true is at minus 15 then I'm getting right around minus 15 plus or minus something for my estimate. And the same is true for the other parameters EA, n and sigma, and I even did my at a particular p zero P zero. So this is all behaving very well. We also want to know, well how well is the BCa method working? Well, turns out, it worked pretty well. I want to...the question is how successful was the BCa method. And here I have a distribution. Every time I correctly correctly bracketed the known true value, I got a 1. And if I missed it, I got a 0. So for logA I'm correctly bracketing the known true value 91% of the time. I was choosing 95% of the time, so I'm pretty close. I'm in the low 90s and I'm getting about the same thing for activation, energy and etc. They're all in the mid to low 90s. So that's actually pretty good agreement. Let's suppose now I wanted to see what would happen if I increase the number of bootstrap iterations and boot from 100 to 500. What does that look like? If I plot my MLE versus the true value, you're getting about the same thing. The estimates are pretty good. The slope is all always around 1 and the intercept is always around 0. So that's pretty well behaved. And then if I look at the confidence bound width, See, on the average, I'm getting something around 23 here for confidence bound width, and around 20 or so for mu and getting something around eight for value n. And so these confidence bands are actually somewhat wide. And I want to see what happens. Well, suppose I increase my sample size to 50 instead of just using five? 50 is not a realistic sample size, we could never run that many. That would be very difficult to do, very time consuming. But this is simulation, so I can run as many parts as I want. And so just to check, I see again that the maximum likelihood estimates agree pretty well with the known true values. Again, getting a slope of 1 and intercept around zero. And BCa, I am getting bracketing right around 95% of the time as expected. So that's pretty well behaved too and my confidence bound width, now it's much lower. So, by increasing the sample size, as you might expect, the conference bounds get correspondingly lower. This was in the upper 20s originally, now it's around seven. This is also...the mu was also in the upper 20s, this is now around five; n was around 10 initially, now it's around 2.3, so we're getting this better behavior by increasing our sample size. So this just shows what the summary, this would look like. So here I have a success rate versus these different groups; n is the number of parts per test level. And boot is the number of bootstrap samples I created. So 5_100, 5_500 and 50_500 and you can see actually this is reasonably flat. You're not getting big improvement in the coverage. We're getting something in the low to mid 90s, or so. And that's about what you would expect. So by changing the number of bootstrap replicates or by changing the sample size, I'm not changing that very much. BCa is equal to doing a pretty good job, even with five parts per test level and 100 bootstrap iterations. About the width. But here we are seeing a benefit. So the width of the confidence bounds is going down as we increase the number of bootstrap iterations. And then on top of that, if you increase the sample size, you get a big decrease in the confidence bound width. So all this behavior is expected, but the point here is, this simulation allows you to do is to know ahead of time, well, how big should my sample size be? Can I get away with three parts per condition? Do I need to run five or 10 parts per condition in order to get the width of the confidence bounds that I want? Similarly, when I'm doing analysis, well, how many bootstrap iterations do I have to do to kind of get away with 110? Do I need 1000? This also gives you some heads up of what you're going to need to do when you do the analysis. Alright, so finally, we are now armed with our maximum likelihood estimates and our confidence bounds. So we can do We can summarize our results using the safe operating area and, again what we're getting here is something of a reliability map or a response surface of temperature versus power. So you'll have an idea of how reliable the part is under various conditions. And this can be very helpful to designers or customers. Designers want to know when they create a part, mimic a part, is it going to last? Are they designing a part to run at to higher temperature or to higher power so that the median time to failure would be too low. Also customers want to know when they run this part, how long is the part going to last? And so what the SOA gives you is that information. The metric I'm going to give here is median time to failure. You could use other metrics. You could use the fit rate you could use a ???, but for purposes of illustration, I'm just using median time to failure. An even better metric, as I'll show, is a lower confidence bound on the median time to failure. It's a gives you a more conservative estimate So ultimately, the SOA then will allow you to make trade offs then between temperature and power. So here is our contour plot showing our SOA. These contours are log base 10 of the median time to failure. So we have power versus temperature, as temperature goes down and as power goes down, these contours are getting larger and larger. So as you lower the stress as you might expect, and median time to failure goes up. And suppose we have a corporate goal and the corporate goal was, you want the part to last or have a median time to failure greater than 10 to six hours. If you look at this map, over the range of power and temperature we have chosen, it looks like we're golden. There's no problems here. Median time to failure is easily 10 to six hours or higher. So that tells us we have to realize that median time failure again is an average, an average is only tell half the story. We have to do something that acknowledges the uncertainty in this estimate. So what we do in practice is use a lower conference bound on the median time to failure here. So you can see those contours have changed, very much lower because we're using the lower confidence bound, and here, 10 to the six hours is given by this line. And you can see that it's only part of the reach now. So over here at green, that's good. Right. You can operate safely here but red is dangerous. It is not safe to run here. This is where the monsters are. You don't want to run your part this hot. And also, this allows you to make trade offs. So, for example, suppose a designer wanted to their part to run at 80 degrees C. That's fine, as long as they keep the power level below about 29.5 dBm. Similarly, suppose they wanted to run the part at 90 degrees C. They can, that's fine as long as they keep the power low enough, let's say 27.5 dBm. Right. So this is where you're allowed to make trade offs for between temperature and power. Alright, so now just to summarize. So I showed the differences between constant and step stress testing and I showed how we extract extract maximum likelihood estimates and our BCa confidence bounds from the simulated step stress data. And I demonstrated that we had pretty good agreement then between the estimates and the known true values. In addition, BCa method worked pretty well, even with n boot of only 100 and five parts per test level, we had about 95% coverage. And that coverage didn't change very much as we increased the number of bootstrap iterations or increased the sample size. However, we did see a big change on the confidence bounds width. And that the results there showed that we could make some sort of a trade off. Again, we could, you know, from the simulation, we would know how many bootstrap iterations do we need to run and how many parts per test conditions we need to run. And ultimately, then we took those maximum likelihood estimates and our bootstrap confidence bounds and created the SOA, which provides guidance to customers and designers on how safe a particular T0/P0 combination is. And then from that reliability map, then we able to make a trade off between temperature and power. And lastly, I showed that using the lower confidence bound on the median time to failure does provide a more conservative estimate for the SOA. So, in essence, using the lower confidence bound makes the SOA, the safe operating area, a little safer. So that ends my talk. And thank you very much for your time.
Labels
(11)
Labels:
Labels:
Advanced Statistical Modeling
Basic Data Analysis and Modeling
Consumer and Market Research
Content Organization
Data Blending and Cleanup
Data Exploration and Visualization
Design of Experiments
Predictive Modeling and Machine Learning
Quality and Process Engineering
Reliability Analysis
Sharing and Communicating Results
0 attendees
0
0
0 attendees
0
0
Measurement Systems Analysis for Curve Data (2020-US-30MP-573)
Monday, October 12, 2020
Astrid Ruck, Senior Specialist in Statistics, Autoliv Chris Gotwalt, JMP Director of Statistical Research and Development, SAS Laura Lancaster, JMP Principal Research Statistician Developer, SAS Measurement Systems Analysis (MSA) is a measurement process consisting not only of the measurement system, equipment and parts, but also the operators, methods and techniques involved in the entire procedure of conducting the measurements. Automotive industry guidelines such as AIAG [1] or VDA [4], investigate a one-dimensional output per test, but they do not describe how to deal with data curves as output. In this presentation, we take a first step by showing how to perform a gauge repeatability and reproducibility (GRR) study using force versus distance output curves. The Functional Data Explorer (FDE) in JMP Pro is designed to analyze data that are functions such as measurement curves, as those which were used to perform this GRR study. Auto-generated transcript... Speaker Transcript Astrid Ruck So my name is Astrid Ruck. I'm working for Autoliv since 15 years and Autoliv is a worldwide leading manufacturer for automotive safety components such as seatbelts, air bags and active safety systems. So today we would like to talk about measurement system analysis for curve data. And Laura, Chris, and me have also written a white paper and it has the same type of title because we think it's a very urgent topic because there is nothing else in our own knowledge for for MSA for curve data available for dynamic test machines. So first we will start with a short introduction for MSA and functional data analysis, and then we will motivate our objective of a so called spring guideline which investigates fastening behavior or comfort behavior fastening seatbelts and then Laura will explain our methodology of the Gage R&R via JMP Pro. So usually measurement system analysis is an astute is a Type 2 study. So, MSA is a process. Here in this flow chart, you can see that it starts with Type 1 and linearity study and they are done with reference parts. Reference parts are needed, because we need a reference way you to calculate the bias of the measurement system. And if it fits, if the bias is good enough, then we check if we have an operator influence or not. If and only if we have an operator influence, then we can calculate the reproducibility. Otherwise, in all other steps, repeatability is is the only variation which can be calculated. So, in this area, Type 2 and Type 3 study, production parts are used. The AIAG, automotive industry guidelines, investigate a one dimensional output per test, but they do not describe how to deal with data curves as output and we will give you an insight how to how to do it. So if one has good accuracy then the uncertainty will be low. And this can be seen in this graph. And you see the influence of the increasing uncertainty on the decision of whether good or bad parts are near the tolerance. So here we have a lower specification limit, at the right we have an upper specification limit. And so the better my accuracy is the better my decision will be. And here we have a very big gray area and it will be very hard to make a right decision. Obviously parts in the middle of the tolerance will always be classified in the correct way. And in contrast to statistical process control parts at a specification limits are very valuable. So the best thing that you can ever do for an MSA is to take parts at the lower specification limit, all the upper specification limit, plus/minus 10% of the reference figure, and in this case the reference figure is a tolerance. If you have only one specific specification limit, like an upper or lower specification limit, then you can use your process capability index, Ppk, to calculate the corresponding process variation, 6s, and that is your new reference figure. And this idea how to select your parts for an MSA can also be used for output curves and their specifications bounds instead of specification limits. So data often comes as a function, curve, or profile referred to functional data. Functional data analysis available in JMP Pro in the functional data explorer platform. We will use P splines to model our extraction of false versus distant curves and later on we will use mixed models to analyze them. So seatbelts significantly contribute to preventing fatalities, and consequently functionality and comfort of seatbelts have to be ensured. And here on the right hand side, you see a picture from extraction forces of the seatbelt, given in blue, and retraction forces over distance, given in red. And both forces are important factors that affect both safety and comfort. And to test these forces and extraction/retraction force test setup is used and this simulates the seatbelt behavior in a vehicle. So let us have a closer look at this. So here you see a test set up and here you see the seatbelt. And here's a little moving trolley which drives on a trolley arm and now you see the seatbelt is retracted. And now the extraction started and here at the right hand side you see the seatbelt, which is fixed, according to its car position. Yeah. Whoops. So, When you look at these little curves and you see that inside my extraction force curve, there are some little waves and they have a semi-periodic structure. And if you would fit this semi-periodic structure with a polynomial, for example, then you will really overfit the repeatability. That's the reason why we need some flexible models. And where do these little waves come from? This will be more clearer when we have a look inside my seatbelt. And there is a spindle and there is a spring. And how's this behaves can be seen in this video. Here inside is my is my spindle and here at the right hand side there is my spring. And the cover of the spring is open so that we can have a look inside the cover, how the spring behaves. And in the beginning when the total webbing is on the spinner, then my spring is totally relaxed. And now my webbing is extracted and at the same time the spring is wounded up. So now we have the retraction and my spring becomes relaxed again. And these little movements result into this wavy structure and that is the reason why we really need flexible models based on FDE. So the creation of the spring guideline is our objective, who require a specific fastening behavior. So the fastening behavior is given by my extraction force curves. And here you see in this picture, different groups of five different seatbelt types and the corresponding spring thicknesses. If my spring thickness is small, then you can see that my extraction force is also small and if my spring thickness is large than my extraction force over distance will also be large. And then you see we have here different colors and we have a dark color. And the dark color results from real life measurements from three operators with each... every operator made five replications per seatbelt and and you can see that they have really made a great job. And the light color, that is our model from the p splines given by by FDE. So, as a spring guideline is our target, we would like to know, for which spring we will get the corresponding fastening behavior of the seatbelt, but before you start with the project, please always start with an MSA So, according to Autoliv's procedure, we use five different seatbelts, three experienced operators and five replications as you have seen in the previous graph. And our observation y is given by the actual values plus some random noise, so noise from the operators, the part, the interaction of operator and part and the corresponding repeatability. And then my Gage R&R is defined by six times the process variation of the measurement error, which is given by my reproducibility and the repeatability. And now we would like to know what is my minimum tolerance, such that my Gage R&R is acceptable, and acceptable means that the percentage Gage R&R is smaller than 20%. If my Gage R&R is 0.2 times my minimum tolerance, then we will also get a bound for my curves which you have seen, and this plus/minus 3s error bound will help us a lot to find the correct spring for a specific fastening behavior. So our methodology will be shown by Laura Langcaster. So, we will start to estimate a mean extraction force curve and we will use flexible models by FDE. Then the residual extraction forces will be calculated and after that, random effect models will be estimated via the platform mix models and finally the Gage R&R will be calculated. So, Laura, will you start? Laura Yes, let me share my screen. Yeah. Great. Laura Okay. So yes, thank you, Astrid. So I wanted to demonstrate how we use JMP Pro 15 to perform this measurement systems analysis with curve data. So first I want to show you the data that we have. And it looks a lot like regular MSA data, except instead of just regular measurements, we actually have curves. So we have this function of force in terms of distance. So the first thing that we want to do is to to use the functional data explorer to create the part force extraction curves. So I'm going to open up functional data explorer, I'm going to go to the analyze menu, then specialized modeling, functional data explorer. So force would be my output, distance is my input, and I want to fit one for each part or seatbelt type. And so that's going to be my ID and I click OK. And I get a bunch of summary information, summary graphs, and I'm going to enlarge this one. These are my curves and you can see that I have very distinct curves for each part, which are different colors. And I can also see that semi-periodic behavior that Astrid was talking about. Now I've already fit this data, so I know that a 300 node linear p spline fits really well. So I'm going to just go ahead and fit that particular model. So I go to models, model controls, p spline model controls, and remove all of these nodes. Add 300 because I know that's the node structure that works well. I'm only going to do a linear fits. I click go, and it doesn't take too long to fit this 300 node linear p spline to this data. And I just wanted to quickly mention that I'm only going to show a little bit of the functionality from this this FDE platform. It is, it does a lot of things a lot more than I have time to show you, or that we used for this measurement systems analysis. But, I highly recommend you check it out. We added a lot for JMP 15. And so I highly recommend you check out other talks about it as well. Okay, so this is our fit. And once again, we get a nice graph of our curves and the fit. And I want to check and make sure that this is a good fit. So I'm going to go to the diagnostic plots and I'm gonna look at the actual by predicted plot, I see it looks really great. Nice linear and the residuals are really small. So I'm very happy with this fit. Now, when we did this fit, we had to combined together the operator and the replication error. And so these curves have that that variation average out, but we need that variation to actually do the Gage R&R study. So what we're going to do is actually create a force residual and once we do that by subtracting off these part force functions from the original force function, we will have residual that will contain the operator and their replication error. So I'm going to go back to data table. And I've already created scripts to make this easy. So I've created a script to add my prediction formula to the data table. And I'll just show it to you really quick. This just come straight from the functional data explorer. This is my formula for the p spline fits for each part. And then I also have a script to create my residual column and so the formula for that is simply the difference between the force and those part force extraction functions. Right. And so now that I have this residual formula that contains the operator and replication error, I can fit a random effects model and estimate my operator...operator by part and my replication error. And to do that, I'm actually going to use the mixed model platform because we're not going to be fitting part variance, because we've already factored that out by creating the residual and subtracting off the part force function. So I've already created a script to launch the mixed model platform. And you can see that I have residual extraction force as my response. And I have operator and operator by part as my random effects. I'm going to run this. operator, operator by part (which is zero), and the residual. And to calculate the Gage R&R, I find it easy just to use a JMP data table like a spreadsheet. Makes it easy. I can just use a column formula. So I've entered my variance components in the table. And I just create a formula to calculate the Gage R&R, which is just 6 times the square root of the total variation without the part. And then I see that my Gage R&R is .4385 and I can take that and apply it to my spring guidelines in my specification bounds. And I can also back solve for the minimum tolerance. And so now I'm going to hand this back over to Astrid to continue with talking about how this got applied. Astrid Ruck Yes, thank you, Laura. It's, it's great. But for the audience, of course, this are not the original of data. So, Yes. Now as as Laura explained, now we know the Gage R&R and we also know my minimum tolerance, such that my measurement system is capable. So here you see the part extraction force function. And you also see the plus/minus 3s error bound and you see that the parts are very good selected because of the bounds are non overlapping and therefore they are significant different. And we can use it to find the right spring. And here you see in black, the minimum tolerance which is... we use the green line to center it around it. So now we have our Gage R&R but on the other hand, we can use FDE to load a golden curve as a target function. So I already told you, yes, we are interested in a spring guideline, but what kind of spring shall we use? So we also can use FDE to load this golden function and then the corresponding spring's thickness is calculated to obtain a specific behavior. So FDE is a great tool. And we used it also for a Type 3 study which is independent from operators and it is used for camera. And it measures the distance between seam and cutting edge of inflatable seatbelts and cutting process and that was also a great success story of using FDE. So to come to an end, I would like to say that most of our processes and tests have curves as output. And until now it has been impossible to standardize an MSA procedure using complete curve data and therefore, we had to restrict ourselves on a maximum from the extraction force curve, all the area between extract and retraction and therefore we reduced ourselves and lost a lot of a lot of considerable amount of information. So I'm really happy that we can make MSA for curve data. And as far as we are aware, there are no other publications that discuss this type of MSA generalization for curve data with other commercial software. And at the end, I would like to show you that the corresponding paper is also available in the in the internet. So thank you for the attention.
Labels
(7)
Labels:
Labels:
Advanced Statistical Modeling
Basic Data Analysis and Modeling
Consumer and Market Research
Data Exploration and Visualization
Predictive Modeling and Machine Learning
Quality and Process Engineering
Reliability Analysis
0 attendees
0
0
0 attendees
0
0
Most Common JSL Mistakes and How to Avoid Them (2020-US-30MP-571)
Monday, October 12, 2020
Justin Chilton, JMP Senior Associate Software Development Engineer in Test, SAS Wendy Murphrey, Senior Principal Technical Support Analyst, SAS Have you been working on a scripting project and become frustrated by errors or unexpected results? In this session, we will identify some of the pitfalls that scripters of various experience levels encounter and help you to understand the cause so that you can move forward with accomplishing your scripting goals. We will help you through the simple mistakes, like not reviewing the log, to the foundational topics, such as scripting data tables and platforms, to the more complex topics, including name resolution and expressions. We will finish the session by providing resources available to you for learning more about the JMP Scripting Language. Auto-generated transcript... Speaker Transcript wendymurphrey Hello and thank you for joining our presentation today. We're excited to talk with you about common scenarios that script authors at various experience levels encounter. Our intent is to explain what's happening so that you have the tools and knowledge to avoid these mistakes. So let's get started. Perhaps because it's not open by default, it's easy to forget to check the log for messages. But the log contains useful information about recent script executions. So let's take a closer look at this example in JMP. Here we can see that the script simply opens a data table and attempts to make a row selection. When we run this script, we get an interactive error message for the fatal error, that is, those situations where your script stops due to the error. The pop up window is new behavior for JMP 15, and this can be turned off in preferences. When we dismiss that message, we can check the log and we see the same message from the window, but we also see the line number nd the name of the script file to help guide us to the trouble area. Now, don't solely rely on these pop up windows because other important messages will appear in the log, as well, and they don't necessarily cause a pop up window. So let's take a look at another example. In this script, the author is attempting to open a table, create an analysis, and print a message to the log. And when we run this script, we notice right away that our analysis was not generated. So again, we check the log and we see that there was a column that was not found in the data table, but the script continued and printed our message to the log. So the important takeaway here is always check the log, especially when what happens does not meet your expectation. It might be as simple as a misspelled column name. The concept of current data table is often described as the active table. But what makes the table the active or current data table and why does it matter to your script? When a table is opened, created, or otherwise acted upon, it becomes the current data table. But there are other ways that a table can become current at runtime that you as the script author might not have anticipated. Their tasks that JMP processes in the background. For example, JMP may need to evaluate a formula or make an update to a window. It's these background processes that can cause a different table than the one you intended to become active at runtime. And this can cause your script to fail, or worse, you don't even realize that JMP used the wrong table. So here we've provided some best practices to craft your scripts using data table-specific references. This is the way you can ensure that the correct table is used, regardless of what table is active at runtime. In this example, we're demonstrating how you can send a bivariate call to a specific data table to ensure that the data in that table is used for your analysis. You can also use the new column as a message to a specific data table, and this will ensure that your new column is created in the desired table and it does not go to whatever is current. You can also use optional data table references. Here, we're just demonstrating how you can use that for referencing columns and we're going to talk in more detail about referencing columns in a little bit. You can also use optional data table references in functions; for each row and summarize are two where this is available. We'd also like to suggest that you avoid using a data table...assigning a data table reference using the current data table function in your production scripts. In this example we're joining two data tables and we would like to assign a reference to that table that's created from the join. Now, join returns a reference to the newly created table. So instead of assigning to the current data table, we want to assign the reference when the join is created. So here, we simply make our assignment when we call the join, and when we run this we can see that indeed our new DT variable references the desire table and it is not affected by what table is current. Being able to identify a column using a JSL variable is a common task and which method you use depends upon your intended use. The column function returns a reference to the column as a whole unit. It accepts a variety of arguments, such as the column name is a string or a string variable, an index or a numeric variable representing an index. It can be subscripted to reference the column's value on a specific row. Some places that you can use the column function are in platform calls, sending messages to a column, or assigning a variable reference to a column. The as column function returns a reference to the value on the current row of the specified column. It accepts all the same arguments as the column function, but it's important to understand that in JMP, the current row is always zero unless otherwise managed. So some great places to use the as column function are in formulas, for each row, and select where, because they control the row as they evaluate on on each row by their design. So let's take a look at an example of what we mean. Here we're opening a sample data table. We've assigned a JSL variable to the name of a column, and we're attempting to use the column function in our select where. Now, when we run this script, we find that no rows were selected. Why is that? Because select where evaluates on each row, using the column function without any subscript will not select any rows. And we remember that the column function returns a reference to the column as a whole unit, so a whole column of data will never equal 14. So the solutions are to use the column function with a row subscript (and this row subscript simply means the current row, whatever it is while the select where is evaluating), or you can use the as column function to select your rows. The use of a JSL variable in formulas can be problematic for a few reasons. The lifespan of a JSL variable is limited to the current JMP session, and as the value of the variable changes, so can the result of the formula. And unless you look at the formula in the formula editor, you may not realize you have a problem. So let's take a look at our example. Here, I'm simply opening a data table and establishing a list of columns that I would like to sum. I'm creating my new formula column and I'm using my JSL variable in my sum function. So let's run this. And we can see that our new column was created and populated with data. But let's take a look at the formula editor. We never said our JSL variable was used in the sum function. And when we dismiss that, and we make a change to one of the dependent columns, we also noticed that our, our column formula was not automatically updated. So let's save and close JMP, save this table and close JMP. And we'll start again. Now remember that the lifespan of that JSL variable was limited to the prior JMP session. So let's open the table and take a look at what happens. We noticed that our variable...column formulas still did not evaluate again. When we take a look at that, we still see our JSL variable in there, but now we have a red line under it. And we hover over it, we see the name unresolved error. If we try to re evaluate the formula we do get the error message and ignoring the errors will cause all of our data to go to be changed to missing. So how do we fix this? When you use a variable in a formula, you'll need to do something to instruct JMP to replace the variable with its value before evaluating the formula. You can use substitute for this, but today we're going to demonstrate expression functions. So let's close this table. We'll open a fresh copy of the table and reestablish our list of columns. And now let's take a look at how we work around this. So the eval expression function tells JMP to evaluate and replace any variables in there...in its argument that are wrapped in an EXPR function. And then it will return the unevaluated expression. So in this code JMP is going to replace the expression SAT scores and replace it with our list of columns. So let's run just this section and see what happens. And we do see that the sum function now has our list of columns that we want to sum instead of the variable. But it didn't actually create the column for us. It's simply returned this expression. So to evaluate that, we add an eval function around our eval expression function. And this will cause JMP to evaluate the entire new column expression after the replacement has occurred. So essentially, the email says, run... run what's returned by the eval expression, which is this. So let's run it and we can see that our column was now created and we have the data as we expect. And our formula does not have our variable any longer, it has our list of columns. And when we make a change to the data in a dependent column, our formula now updates automatically. So now I'm going to pass this over to Justin to talk about a few more mistakes to avoid. Justin Chilton Alright, so another mistake that we see people make is when they're going to set column properties and the syntax can be a little bit tricky. So before we get into the details, let's talk about what a column is in terms of column properties. So essentially a column is just a repository for this property information. It doesn't do any syntax checking to make sure that what you give it is correct and this enables it to store anything that you'd like. And this means that this is because the consumer of the property expects something to be in a specific format, whether that be a JMP platform, using a column property or even a JSL application. So let's take a look at a quick example. And here, we are going to open up a table that we have called column properties, just for testing. We have one column called NPN1 and one column called test. And this column has spec limits and an access property, but we want to add the spec limit properties to test. So we have here, what we think is the correct syntax to set the spec limits column properties. We run this and doesn't give us an error but when we look at the column, there is spec limits on that column. And you can, but you can tell that it's not correct. When we asked for process capability analysis with our spec limits, when we run that, we do get one for NPN1 and but we don't get one for the test column. That's because what's stored in the spec limits property for the test column is not correct. So one way to do this if you already know how to use the set properties spec limits, you can just send the property message to the column NPN1, which is the column that we know to be correct. So if we run just that one line, we get the correct syntax. And this actually shows us that this should really be in a list form instead of separate arguments. So another way, if you're not familiar with the get property, you can go to the column that you know is correct and copy column properties. And then you can paste that into your script and you'll see that you have add column properties as well as a couple of set property. So add column properties is just a way to batch up multiple set properties. We can remove that and we can remove the extra access property. So once we're done removing that, we can use this set property to send it to our test column with the send operator. So now when we run this, we...it's...it has overwritten our spec limits, but we can confirm that by running our same distribution using our spec limits. So, and you can see here that we have our process capability analysis for both of our columns. So that's just one example of how to get spec limits, but the same can apply to any column property that you need to apply and get the syntax for in JSL. So the next mistake that we're going to talk about is manipulating dispatch commands. So what is a dispatch command? It's the thing that appears in your saved script when you do the red triangle menu's save to script window or save to clipboard. And it appears when you add something like a reference line or you change something else like closing an outline box. And this command was developed as an internal way for JMP to be able to recreate graphs with the customization that you made. And it wasn't planned for users to customize this with dynamic variables. So that means that some of the arguments in the dispatch command do not accept JSL variables. So let's take a look at an example of where that might happen. So here's a script. We have a table and a string variable which will use a little later. And then this is essentially what would be returned by the save to script window. So when we run this, we can see that our dispatch command that sets the min max and increment for measurement axis here is set from zero to 1.4. Now we can see that the dispatch command is relying on the outline box title here, but that outline box title is actually based on our y input. So say that we wanted this to be dynamic input from a user and all we have this this variable. We could use the variable for the Y. But then we would also need to use that that the name of that column in here. So when we run this, you'll actually see that we don't get the correct axis settings 0.4 to 1.2, which is not...so that means that what we sent was not processed. So there are a couple ways to address this. And the first is going to be using eval expression solution that Wendy talked about a few minutes ago. So here you see we have the same eval expression. We actually want to wrap the concatenation that we tried in the previous example here. And so eval expression will say evaluate this whole thing. So when we run eval EXPR, it just replaces that whole thing with our concatenated string and then, remember when we do the eval, it runs that expression. So here we have 0 to 1.4. So that's one way. But sometimes it can still get a little bit tricky. And what how you should really manipulate those dispatch commands because you're still using the dispatch commands and you have to know how they work. So the most common way of doing this is to remove this this dispatch command altogether. So, and use what we call display box subscripting. So here I'm going to run this with the removed dispatch command and we have our original axis settings. And then what we're going to do is, before we don't really know how to write this, just from looking at it. So we're going to go to this icon, this outline box, right click edit and show tree structure. So, this shows us the display box tree structure within this outline box and within this outline box, you can see this axix box is the one that we want to change. So within this outline box is axis box one. So then we can go and write our script. We use this report function to get the report layer of the platform, OBJ1. Then we subscript to get the outline box title that we know it should be, based on our column name. And then within that we want...and we want to be as specific as possible, so that if things change in a future version of JMP we're less likely to have issues. So then within that, we want axis box 1 and that's how we get that axis box, and then we can take the same pieces of information from the dispatch command and just convert them the messages that we send to the axis box. And then you can see over here that now our axis settings are from zero to 1.4. So another more powerful way to do this is using XPath. It does the same thing, but can be a little bit more complicated, but also allows for more flexibility for things in future versions of JMP because you can use underlying data about a box. So the last mistake that we're going to discuss is only using the global namespace. So you may not know it, but when you're when you create a symbol without specifying where it goes, it goes into the global namespace. So this is great for quick scripts when you're just mocking something up for your personal use, but it may not be great idea for scripts that you send to colleagues or scripts that you publish or add ins that you publish on the Community. So little bit more details about how scoping and and name lookup works in JMP. You start at the bottom here in this diagram. The local scope can can be any number of scopes, with which are functions and local box. So above that, if it doesn't find the symbol or you didn't specify it to go in local depending on if you're writing or reading, then there's the here scope, and that exists essentially one per file. And there's global scope, which essentially is one scope for your session of JMP. And then over to the right, we have namespace scope, which is something that you really have to explicitly seek out in order to access. Let's take a look at an example. This is just going to delete all of my symbols and since I didn't specify where they should go, it's going to go into globals. So x is 5. But what if someone else has a script out there that's like, well, I also have a variable x that I want to use. And I'm not going to specify where to put it. So I'm going to set it to 10. But then when you come back to your script and maybe this is some variable that you've set and then you have some callback on a button...button action that is relying on that variable, then you go to use it and the variable has changed on you and that's probably not desired. So what are some ways that we can address this? Well, the easiest way to do this is to use names default to here(1). This means that nothing will end up in the global scope, unless you explicitly tell it to go there. So let's take a look at this example. So all we've done is add this name default to here(1). It can go with...generally it goes at the top of your script, but it can really go anywhere. I'm going, I'm going to clean up my symbols and set this variable x equal to 5. So then we have some other scripture over here that's says that says I want to access your, your x. But really this x can can never access this x in this here namespace. So let's copy part of the script and do X equals 50. And then so the X in the right hand script is 50, but that had no effect on the X in our left hand script. So that's what, that's a great way to easily separate your symbols from other scripts and applications. The other option is to use a custom namespace. This is a little bit more advanced, but it can be very powerful. So here I'm again...clear my symbols and then creating a new namespace, setting up a variable x within the namespace and showing it. So it has a value of five. And you have another script out here. Maybe this is one of your scripts and you know that myNS exists. So you can try and access that, and you can do that just fine, because you know the name of this. But it's much less likely that someone nefarious is going to change your symbol without your knowledge. So I can change this symbol and it's still going to be 10 but if I just do X, it's, it's not going to really have any have any effect on the result of the the X from the namespace. I'm going to throw it back over to Wendy, who's going to tell us about some overlooked resources. wendymurphrey Thank you, Justin. There are many resources available to you to learn more about JSL, and many of them are free. Here we've linked for you some of the more common places to look and learn more about JSL. We also offer technical support that you can reach out to, and there are a couple of books on the topic as well. Of course, we always encourage you to take a look at our training resources and the plans that they have there for you. Be sure to take a look at the paper associated with this presentation on the JMP community, as there are a few more mistakes that we covered there for you, which we didn't have time to discuss today. Thank you for watching.
Labels
(7)
Labels:
Labels:
Automation and Scripting
Basic Data Analysis and Modeling
Content Organization
Data Exploration and Visualization
Quality and Process Engineering
Reliability Analysis
Sharing and Communicating Results
0 attendees
0
0
0 attendees
0
0
Salary Gaps in Corporate America: How Company and Executive Characteristics Influence Compensations (2020-US-EPO-570)
Monday, October 12, 2020
Ondina Sevilla-Rovirosa, Graduate Student and Small Business Owner, Oklahoma State University The pay gap between top executives and the average American worker continues to widen. Additionally, the gender pay gap has narrowed, but women only earn 85% of what men earn. Experts debate if higher compensations could positively impact the firm’s performance and maximize its value. Therefore, it is worth analyzing the factors that companies are considering to determine the salaries of top executives. The original dataset of 2,646 executives from 250 selected U.S. companies was gender unbalanced. I used a synthetic replication method to balance the data. Then, I ran, analyzed, and compared a decision tree, a stepwise regression, eight different neural networks, and an ensemble model. Finally, a surrogate model was used to explain the best neural net model. From the 18 initially selected variables of companies and executives, I found that only Total Assets, Number of Employees, and Years Executive in CEO Position had a significant contribution to Salary. Surprisingly gender had an insignificant effect on the salaries of top executives. Nevertheless, predominantly what affected Salaries was the size of the firm (Assets and Employees), followed by a lower contribution from the number of years the executive has been in the CEO position. Auto-generated transcript... Speaker Transcript ondinasevilla@gmail.com My name is Ondina Sevilla and my poster is about salary gaps in corporate America, specifically how do the company and executive characteristics influence compensations. Something with a little introduction, the pay gap between top executives and the average American worker continues to widen. Also the gender pay gap has narrowed, but women only earn 85% of what men earn. Even experts debate if higher compensations could positively impact the firm's performance and maximize its value. So it is worth it to analyze the factors that companies are considering to determine the salaries of top executives. I made two questions for this research. Are...is there a salary gap for top female executive in US companies? And does the company's size influence executives' salaries? So for this research, I collected a data set of 2046 from top executives from 250 selected US companies, such as Halliburton, Southwest Airlines, Starwood Hotels, Sherwin-Williams and others. Then I applied a synthetic replication method in SAS to obtain a gender balance database and used 12 companies and six executive variables, being salary by input variable. The technique used was predictive modeling. I analyzed and compared in JMP Pro 15 a decision tree, stepwise regression, eight different neural networks and ensemble model. And outlier... oh, a salary percentage year per year variable was excluded from from...for the first analysis and then I included it to compare. However, the rules are adjusted for error and the outer absolute error were higher with the outlier. So, I'm going to show you here the model, the different models that I ran without the outlier and the neural networks comparison. So from all these models, the lowest root average squared error without the outlier was the number eight neural network. These ...this neural net has 4 inputs two hidden layers, eight double neurons, with a TanH function.
Labels
(8)
Labels:
Labels:
Advanced Statistical Modeling
Basic Data Analysis and Modeling
Consumer and Market Research
Data Exploration and Visualization
Design of Experiments
Predictive Modeling and Machine Learning
Quality and Process Engineering
Reliability Analysis
0 attendees
0
0
0 attendees
0
0
Structured Problem Solving using JMP Cause and Effect Diagram, mind-mapping software, and JSL (2020-US-30MP-569)
Monday, October 12, 2020
Daniel Sutton, Statistician - Innovation, Samsung Austin Semiconductor Structured Problem Solving (SPS) tools were made available to JMP users through a JSL script center as a menu add-in. The SPS script center allowed JMP users to find useful SPS resources from within a JMP session, instead of having to search for various tools and templates in other locations. The current JMP Cause and Effect diagram platform was enhanced with JSL to allow JMP users the ability to transform tables between wide format for brainstorming and tall format for visual representation. New branches and “parking lot” ideas are also captured in the wide format before returning to the tall format for visual representation. By using JSL, access to mind-mapping files made by open source software such as Freeplane was made available to JMP users, to go back and forth between JMP and mind-mapping. This flexibility allowed users to freeform in mind maps then structure them back in JMP. Users could assign labels such as Experiment, Constant and Noise to the causes and identify what should go into the DOE platforms for root cause analysis. Further proposed enhancements to the JMP Cause and Effect Diagram are discussed. Auto-generated transcript... Speaker Transcript Rene and Dan Welcome to structured, problem solving, using the JMP cause and effect diagram open source mind mapping software and JSL. My name is Dan Sutton name is statistician at Samsung Austin Semiconductor where I teach statistics and statistical software such as JMP. For the outline of my talk today, I will first discuss what is structured problem solving, or SPS. I will show you what we have done at Samsung Austin Semiconductor using JMP and JSL to create a SPS script center. Next, I'll go over the current JMP cause and effect diagram and show how we at Samsung Austin Semiconductor use JSL to work with the JMP cause and effect diagram. I will then introduce you to my mapping software such as Freeplane, a free open source software. I will then return to the cause and effect diagram and show how to use the third column option of labels for marking experiment, controlled, and noise factors. I want to show you how to extend cause and effect diagrams for five why's and cause mapping and finally recommendations for the JMP cause and effect platform. Structured problem solving. So everyone has been involved with problem solving at work, school or home, but what do we mean by structured problem solving? It means taking unstructured, problem solving, such as in a brainstorming session and giving it structure and documentation as in a diagram that can be saved, manipulated and reused. Why use structured problem solving? One important reason is to avoid jumping to conclusions for more difficult problems. In the JMP Ishikawa example, there might be an increase in defects in circuit boards. Your SME, or subject matter expert, is convinced it must be the temperature controller on the folder...on the solder process again. But having a saved structure as in the causes of ...cause and effect diagram allows everyone to see the big picture and look for more clues. Maybe it is temperate control on the solder process, but a team member remembers seeing on the diagram that there was a recent change in the component insertion process and that the team should investigate In the free online training from JMP called Statistical Thinking in Industrial Problem Solving, or STIPS for short, the first module is titled statistical thinking and problem solving. Structured problem solving tools such as cause and effect diagrams and the five why's are introduced in this module. If you have not taken advantage of the free online training through STIPS, I strongly encourage you to check it out. Go to www.JMP.com/statisticalthinking. This is the cause and effect diagram shown during the first module. In this example, the team decided to focus on an experiment involving three factors. This is after creating, discussing, revisiting, and using the cause and effect diagram for the structured problem solving. Now let's look at the SPS script center that we developed at the Samsung Austin Semiconductor. At Samsung Austin Semiconductor, JMP users wanted access to SPS tools and templates from within the JMP window, instead of searching through various folders, drives, saved links or other software. A floating script center was created to allow access to SPS tools throughout the workday. Over on the right side of the script center are links to other SPS templates in Excel. On the left side of the script center are JMP scripts. It is launched from a customization of the JMP menu. Instead of putting the scripts under add ins, we chose to modify the menu to launch a variety of helpful scripts. Now let's look at the JMP cause and effect diagram. If you have never used this platform, this is what's called the cause and effect diagram looks like in JMP. The user selects a parent column and a child column. The result is the classic fishbone layout. Note the branches alternate left and right and top and bottom to make the diagram more compact for viewing on the user's screen. But the classic fishbone layout is not the only layout available. If you hover over the diagram, you can select change type and then select hierarchy. This produces a hierarchical layout that, in this example, is very wide in the x direction. To make it more compact, you do have the option to rotate the text to the left or you can rotate it to the right, as shown in here in the slides. Instead of rotating just the text, it might be nice to rotate the diagram also to left to right. In this example, the images from the previous slide were rotated in PowerPoint. To illustrate what it might look like if the user had this option in JMP. JMP developers, please take note. As you will see you later, this has more the appearnce of mind mapping software. The third layout option is called nested. This creates a nice compact diagram that may be preferred by some users. Note, you can also rotate the text in the nested option, but maybe not as desired. Did you know the JMP cause and effect diagram can include floating diagrams? For example, parking lots that can come up in a brainstorming session. If a second parent is encountered that's not used as a child, a new diagram will be created. In this example, the team is brainstorming and someone mentions, "We should buy a new machine or used equipment." Now, this idea is not part of the current discussion on causes. So the team facilitator decides to add to the JMP table as a new floating note called a parking lot, the JMP cause and effect diagram will include it. Alright, so now let's look at some examples of using JSL to manipulate the cause and effect diagram. So new scripts to manipulate the traditional JMP cause and effect diagram and associated data table were added to the floating script center. You can see examples of these to the right on this PowerPoint slide. JMP is column based and the column dialogue for the cause and effect platform requires one column for the parent and one column for the child. This table is what is called the tall format. But a wide table format might be more desired at times, such as in brainstorming sessions. With a click of a script button, our JMP users can do this to change from a tall format to a wide format. width and depth. In tall table format you would have to enter the parent each time adding that child. When done in wide format, the user can use the script button to stack the wide C&E table to tall. Another useful script in brainstorming might be taking a selected cell and creating a new category. The team realizes that it may need to add more subcategories under wrong part. A script was added to create a new column from a selected cell while in the wide table format. The facilitator can select the cell, like wrong part, then selecting this script button, a new column is created and subcauses can be entered below. you would hover over wrong part, right click, and select Insert below. You can actually enter up to 10 items. The new causes appear in the diagram. And if you don't like the layout JMP allows moving the text. For example, you can click...right click and move to the other side. JMP cause and effect diagram compacts the window using left and right, up and down, and alternate. Some users may want the classic look of the fishbone diagram, but with all bones in the same direction. By clicking on this script button, current C&E all bones to the left side, it sets them to the left and below. Likewise, you can click another script button that sets them all to the right and below. Now let's discuss mind mapping. In this section we're going to take a look at the classic JMP cause and effect diagram and see how to turn it into something that looks more like mind mapping. This is the same fishbone diagram as a mind map using Freeplane software, which is an open source software. Note the free form of this layout, yet it still provides an overview of causes for the effect. One capability of most mind mapping software is the ability to open and close notes, especially when there is a lot going on in the problem solving discussion. For example, a team might want to close notes (like components, raw card and component insertion) and focus just on the solder process and inspection branches. In Freeplane, closed nodes are represented by circles, where the user can click to open them again. The JMP cause and effect diagram already has the ability to close a note. Once closed though, it is indicated by three dots or three periods or ellipses. In the current versions of JMP, there's actually no options to open it again. So what was our solution? We included a floating window that will open and close any parent column category. So over on the right, you can see alignment, component insertion, components, etc., are all included as all the parent nodes. By clicking on the checkbox, you can close a node and then clicking again will open it. For addtion, the script also highlights the text in red when closed. One reason for using open source mind mapping software like Freeplane is that the source file can be accessed by anyone. And it's not a proprietary format like other mind mapping software. You can actually access it through any kind of text editor. Okay, the entire map can be loaded by using JSL commands that access texts strings. Use JSL to look for XML attributes to get the names of each node. A discussion of XML is beyond the scope of this presentation, but see the JMP Community for additional help and examples. And users at Samsung Austin Semiconductor would click on Make JMP table from a Freeplane.mm file. At this time, we do not have a straight JMP to Freeplane script. It's a little more complicated, but Freeplane does allow users to import text from a clipboard using spaces to knit the nodes. So by placing the text in the journal, the example here is on the left side at this slide, the user can then copy and paste into Freeplane and you would see the Freeplane diagram on the, on the right. Now let's look at adding labels of experiment, controlled, and noise to a cause and effect diagram. Another use of cause and effect diagrams is to categories...categorize specific causes for investigation or improvements. These are often category...categorize as controlled or constant (C), noise or (N) or experiment might be called X or E. For those who we're taught SPC Excel by Air Academy Associates, you might have used or still use the CE/CNX template. So to be able to do this in JMP, to add these characters, we would need to revisit the underlying script. When you actually use the optional third label column...the third column label is used. When a JMP user adds a label columln in the script, it changes the text edit box to a vertical list box with two new horizontal center boxes containing the two... two text edit boxes, one with the original child, and now one with the value from the label column. It actually has a default font color of gray and is applied as illustrated here in this slide. Our solution using JSL was to add a floating window with all the children values specified. Whatever was checked could be updated for E, C or N and added to the table and the diagram. And in fact, different colors could be specified by the script by changing the font color option as shown in the slide. JMP cause and effect diagram for five why's and mind mapping causes. While exploring the cause and effect diagram, another use as a five why's or cause mapping was discovered. Although these SPS tools do not display well on the default fish bone layout, hierarchy layout is ideal for this type of mapping. The parent and child become the why and because statements, and the label column can be used to add numbering for your why's. Sometimes there can be more and this is what it looks like on the right side. Sometimes there can be more than one reason for a why and JMP cause and effect diagram can handle it. This branching or cause mapping can be seen over here on the right. Even the nested layout can be used for a five why. In this example, you can also set up a script to set the text wrap width, so the users do not have to do each box one at a time. Or you can make your own interactive diagram using JSL. Here I'm just showing some example images of what that might look like. You might prompt the user in a window dialogue for their why's and then fill in the table and a diagram for the user. Once again, using the cause and effect diagram as over on the left side of the slide. Conclusions and recommendations. All right. In conclusion, the JMP cause and effect diagram has many excellent built in features already for structured problem solving. The current JMP cause and effect diagram was augmented using JSL scripts to add more options when being used for structured problem solving at Samsung Austin Semiconductor. JSL scripts were also used to make the cause and effect diagram act more like mind mapping software. So, what would be my recommendations? fishbone, hierarchy, nested, which use different types of display boxes in JSL. How about a fourth type of layout? How about mind map that will allow more flexible mind map layout? I'm going to add this to the wish list. And then finally, how about even a total mind map platform? That would be even a bigger wish. Thank you for your time and thank you to Samsung Austin Semiconductor and JMP for this opportunity to participate in the JMP Discovery Summit 2020 online. Thank you.
Labels
(12)
Labels:
Labels:
Advanced Statistical Modeling
Automation and Scripting
Basic Data Analysis and Modeling
Consumer and Market Research
Data Access
Data Blending and Cleanup
Data Exploration and Visualization
Design of Experiments
Mass Customization
Predictive Modeling and Machine Learning
Quality and Process Engineering
Reliability Analysis
0 attendees
0
0
0 attendees
0
0
DOE Gumbo: How Hybrid and Augmenting Designs Can Lead To More Effective Design Choices (2020-US-EPO-568)
Monday, October 12, 2020
Heath Rushing, Principal, Adsurgo DOE Gumbo: How Hybrid and Augmenting Designs Can Lead to More Effective Design Choices When my grandmother made gumbo, she never seemed to even follow her own recipe. When I questioned her about his, she told me, “Always try something different. Ya never know if you can make better gumbo unless you try something new!” This is the same with design of experiments. Too many times, we choose the same designs we’ve used in the past, unable to try something new in our gumbo. We can construct a hybrid of different types of designs or augment the original, optimal design with points using a different criterion. We can then use this for comparison to our original design choice. These approaches can lead to designs that allow you to either add relevant constraints (and/or factors) you did not think were possible or have unique design characteristics that you may not have considered in the past. This talk will present multiple design choices: a hybrid mixture-space filling design, an optimal design augmented using pre-existing, required design points, and an optimal design constructed by augmenting a D-optimal design with both I- and A-optimal design points. Furthermore, this talk presents the motivation for choosing these design alternatives as well as design choices that have been useful in practice. Auto-generated transcript... Speaker Transcript Heath Rushing My name is Heath Rushing. I am a principal for Adsurgo and we're a we're a training and consulting company that works with a lot of different companies. This morning, I'm going to talk about some experiences that I had working with pharmaceutical and biopharmaceutical companies. A lot of scientists and engineers are doing things like process and product development and characterization formulation optimization. And what I found is is is a lot of these...a lot of these scientists had designs that they use in the past with a similar product or process or formulation. And what they did is is going forward, they just said, "Hey, let me just take that design that I've used in the past. It worked. You know, it worked well enough in the past. So let's just go ahead and use that design." In each of these instances, what we did is we took the original design and we came up with some sort of mechanism for doing something a little different. Right. We either augmented it with a with a with a different sort of optimization criteria or we augmented it before they added runs or after they added runs. In the first case is what we did is, is was we built a hybrid design. Right. And then the first case was a product formulazation...I'm sorry... a formulation optimization problem, where a scientist in the past was run...had a 30 run Scheffe-Cubic mixture design. In a mixture design, the process parameters are variables are factors in the experiment are mixtures. And then, so there is certain percentage where the overall mixture adds up to 100%. Right, they they felt this work well enough and help them to find an optimal setting for the for the formulation. However, one thing that they really wanted to touch more on is, they said, you know, these designs tended to to to look at design points in our experiment near the edges. And what we want to do is is further characterize the design space. So we took the original 30 run design, and instead of doing that, what we did is we run a we...we developed an experiment constructing the experiment where we ran 18 mixture experiments and then we augmented it with 12 space filling design. And a space filling design is, it's used a lot in computer simulations. And really, you know, I said this at a conference one time, I said, "You know it's used to fill space." But really what these designs do, and I'm going to pull up the the the comparison of the two, is it's going to put design points. In this one, I try to minimize the distance between each of the design points. As you see as the design on the left, the, the one that they thought was well enough or was adequate was the 30 run mixture design. And as you see, it operates a lot near the edges and right in the center of the design. The one on the right was really 18 mixture design points augmented with 12 space filling design points. So it's really a hybrid design, it's really a hybrid of a mixture design and a space filling design. As you can see, you know, based upon their objective trying to characterize that design space a little bit better, as you can see, the one on the right did a much better job of characterizing that design space, right? It had adequate prediction variance. It was a it was a design they chose to run and they found a and they found their optimal solution off of this The second design choice was, and this is used a lot, in a process characterization is, back in the old days back before a lot of people used design of experiments in terms of process characterization, what a lot of scientists would do was, was they would run center point runs like its set point and then also do what are called PAR runs, or proven acceptable range, right. So say that they had four process parameters. What they would do is is they would keep three of the process parameters at the set point and have the fourth go to the extremes. The lowest value and the highest value. And they would do it for each...they would do a set of experiments like that for each the process parameters. What they're really showing is that, you know, if everything's at set point, and one of these deviate near the edges, then we're just going to prove that it's well within specification. Right. And then so they still like to do a lot of these runs. The design that I started off with was, I had a had a scientist that took those PAR and those centerpoint runs and they added them after they built an I optimal design. And I optimal design is used for for for prediction and optimization. And in this case is is that's the kind of design that they wanted, but they added them after the I optimal design. My question to them was this, why don't you just take those runs and add them before you built I optimal design? If that was the case, the ??? algorithm in JMP would say, "You know, I'm going to take those points and I'm going to come up with the, the next best set of runs." Right. So we took those 18 design points and we augmented them with with 11 more...I'm sorry, the 11 to...the original 11 design points and 18 I optimal points. Whenever we did this, if you look in the design, the, the, this is where the PAR runs were added... were added prior to, and you see that the power of the main effects, in factor interactions, the quadratic effects are higher than if you added the PAR runs after. You see that the production variance, if you, if you look at the prediction variance is, the prediction variance is very similar. But you see, is like right near the edge of the design spaces, you see that those PAR runs, whenever we had the PAR runs augmented with I optimal, were a lot smaller. The key here is is whenever I was looking at the correlation is I think the correlation, especially with the main effects are a lot better with with the PAR augmenting and two I optimal versus what they did before, where they took the I optimal and just augmented those with the PAR runs. The third design. The third design was was was when I had a scientist take a 17 run D optimal design and they augment it with eight runs and went from a D to an I optimal design. Now they started off with D optimal design, a screening design, they augmented it with points to move to an I optimal design. JMP has a has a...it's not a really a new design, but it's new design for JMP; it's called A optimal design. And A optimal design allows you to to weight any of those factors. Right. And so I had an idea. I just said, "You know, I have many times in the past, went from a D augmented to an I optimal design. What if we did this? Really, what if we took that original 17 run D optimal design and augmented it to an I, then an A, where we weighted those quadratic terms, Or we took the D optimal design, augmented it to an A optimal design where we where we weighted the quadratic terms and then to an I optimal design?" So it's really two different augmentations, going from a D to an A to an I, and D to an I to an A. Also went to straight D to A. Right. And I wanted to compare it to the original design choice, which was a D versus an I optimal design. Now, I really would like to tell you that my idea worked. But I think as a good statistician, I should tell you that I don't think it was so. If I look at the prediction variance, which, in terms of response surface design, we're trying to minimize the prediction variance across the design region, is you see the prediction variance for their original design is is lower. Okay, even even much lower than whenever I did the A optimal design, just straight to the A optimal design. If you look at the fraction of design space, you'll see that the prediction variance is much smaller across the design space than the than the A optimal design and it's a little bit better than when I went from D to A to I, and D to I to A. The only negative that I saw with the original design compared to the other design choices was, you know, there was there was some quadratic effects, right, there were some quadratic effects that had a little bit of higher correlation, little bit higher correlation than I would like to see. And and you see what the A optimal design, it has much lower quadratic effects. So my my original thesis many times, scientists and engineers have designs they've done in the past. And I always say is, it makes sense that we just don't want to do that same design that we've done in the past. Let's try something different. The product can be a little bit different. The process can be a little bit different. The formulation can be a little bit different. If you use that to compare to the original design is you can pick your best design choice. I would like to, you know, last thing I would like to thank my my team members at Adsurgo. We always have, you know, team members and also our customers...our customers coming up with challenging problems and our team members for always working for for optimal solutions for our customers. Now, last thing that I have to do is, is these these designs were really, really taken from examples from customers, but they weren't the exact examples. There's nothing with their data. So I would like to give a give a shout out to one of my customers Sean Essex from Poseida Therapeutics that often comes up with some very hard problems and sometimes he'll come up with a problem. And I'll say, you know, this is this is a solution and it's something that we really haven't even seen yet. So have a great day.
Labels
(8)
Labels:
Labels:
Advanced Statistical Modeling
Basic Data Analysis and Modeling
Consumer and Market Research
Design of Experiments
Mass Customization
Predictive Modeling and Machine Learning
Quality and Process Engineering
Reliability Analysis
0 attendees
0
0
0 attendees
0
0
Heuristic Perspectives on Parametric Survival Analysis (2020-US-30MP-567)
Monday, October 12, 2020
Thor Osborn, Principal Systems Research Analyst, Sandia National Laboratories Parametric survival analysis is often used to characterize the probabilistic transitions of entities — people, plants, products, etc. — between clearly defined categorical states of being. Such analyses model duration-dependent processes as compact, continuous distributions, with corresponding transition probabilities for individual entities as functions of duration and effect variables. The most appropriate survival distribution for a data set is often unclear, however, because the underlying physical processes are poorly understood. In such cases a collection of common parametric survival distributions may be tried (e.g., the Lognormal, Weibull, Frechét and Loglogistic distributions) to identify the one that best fits the data. Applying a diverse set of options improves the likelihood of finding a model of adequate quality for many practical purposes, but this approach offers little insight into the processes governing the transition of interest. Each of the commonly used survival distributions is founded on a differentiating structural theme that may offer valuable perspective in framing appropriate questions and hypotheses for deeper investigation. This paper clarifies the fundamental mechanisms behind each of the more commonly used survival distributions, considering the heuristic value of each mechanism in relation to process inquiry and comprehension. Auto-generated transcript... Speaker Transcript Hello, and welcome to my 00 14.633 3 over the past 25 years, I have performed many studies and 00 31.533 7 share with you a way of thinking about the distributions we 00 49.366 11 motivated by precedent, ease of use, or empirically demonstrated 00 05.666 15 about its processes. Further, when an excellent model fit is 00 20.666 19 genesis of the distributions commonly used in parametric 00 36.366 23 seen in the workplace as well as in the academic literature. 00 51.400 27 literature, including textbooks and web based articles, as well 00 07.166 31 reexamination that may fail to glean full value from the work. 00 21.633 35 the exponential. Much is often made about the 00 39.066 39 because they model fundamentally different system archetypes. In 00 56.200 43 distribution does in fact, fit the lognormal data very well. The quality of the fit may also 00 32.066 48 fits much better. And secondly, there's only a modest coincident 00 55.333 52 the core process mechanisms these distributions represent 00 11.600 56 analysis, but it provides a very familiar starting point for 00 27.133 60 uncorrelated effects. Let's see if that is true. In order to create a good 00 52.733 65 25,000. For the individual records, we'll use the random 00 29.400 06.333 70 71 see that we did indeed obtain the normal distribution. Now let's consider the 00 14.400 76 not able to imprint my brain with a sufficient knowledge of 00 34.300 80 lognormal distribution are also very simple. As you can see, the 00 50.233 84 this demonstration, we reuse the fluctuation data that were 00 05.766 88 JSL scripting because I find it much more convenient for 00 32.566 92 the number of records in each sample. Next, it extracts the 00 53.200 96 products. The outer loop tracks the 00 17.133 101 on the previous slide. The amplified product compensates 00 33.700 105 distributions may be considered as generated secondarily from 00 18.400 110 many similar internal processes is represented by its maximum 00 35.000 114 to be Frechet distributed. The Weibull distribution represents 00 50.466 118 processes that complete when any of multiple elements have 00 08.766 122 using the Pareto distribution as the source. In this case, the 00 27.600 126 absolute value of the normal distribution as the source. Now let's have a quick look at 00 58.666 131 maximum is used. For the square root of the 00 50.766 136 is not available, you can also see that the other common 00 33.233 46.033 141 value of the normal distribution quite well. Incidentally, Weibull 00 28.066 146 distribution when its core behavior is substantially 00 43.600 150 the four heme containing subunits mechanically interact 00 59.166 154 up to now have all relied on independent samples. Professor 00 15.766 158 extended to produce auto correlated data. Generation of 00 32.100 162 sequence autocorrelation is about .75, yet the 00 59.033 02.300 167 the common survival distributions. You can see that 00 26.400 171 good example of the relationship between real-world analytical 00 42.000 175 commingle a single family residences with heavy industry. 00 55.266 179 have similar features. The landowner must apply to the 00 09.000 183 an opportunity to comment. Local officials then weigh the 00 22.433 187 parties. This example is not approached as a demonstration 00 36.633 191 processing time is 140 days. The fit is obviously imperfect, but 00 52.733 195 distributed data results from processes yielding the combined 00 08.400 199 ubiquitous, but the loglogistic is less frequently used. Without 00 24.466 203 multistep process may be insufficient to impart log 00 38.200 207 considered and the complexity of the underlying process should 00 53.166 211 whether a process is substantially impacted by 00 05.566 215 whether the cooperative element is connoted by positive terms such 00 22.733 219 often been said, I would sincerely appreciate your 00 35.033
Labels
(8)
Labels:
Labels:
Advanced Statistical Modeling
Basic Data Analysis and Modeling
Consumer and Market Research
Data Exploration and Visualization
Design of Experiments
Predictive Modeling and Machine Learning
Quality and Process Engineering
Reliability Analysis
0 attendees
0
0
0 attendees
0
0
JMP BEAST Mode: Boundary Exploration through Adaptive Sampling Techniques (2020-US-30MP-562)
Monday, October 12, 2020
James Wisnowski, Principal Consultant, Adsurgo Andrew Karl, Senior Statistical Consultant, Adsurgo Darryl Ahner, Director OSD Scientific Test and Analysis Techniques Center of Excellence Testing complex autonomous systems such as auto-navigation capabilities on cars typically involves a simulation-based test approach with a large number of factors and responses. Test designs for these models are often space-filling and require near real-time augmentation with new runs. The challenging responses may have rapid differences in performance with very minor input factor changes. These performance boundaries provide the most critical system information. This creates the need for a design generation process that can identify these boundaries and target them with additional runs. JMP has many options to augment DOEs conducted by sequential assembly where testers must balance experiment objectives, statistical principles, and resources in order to populate these additional runs. We propose a new augmentation method that disproportionately adds samples at the large gradient performance boundaries using a combination of platforms to include Predictor Screening, K Nearest Neighbors, Cluster Analysis, Neural Networks, Bootstrap Forests, and Fast Flexible Filling designs. We will demonstrate the Boundary Explorer add-in tool with an autonomous system use-case involving both continuous and categorical responses. We provide an improved “gap filling” design that builds on the concept behind the Augment “space filling” option to fill empty spaces in an existing design. Auto-generated transcript... Speaker Transcript James Wisnowski Welcome team discovery. Andrew Carl and Darryl Ahner and I would like to and are excited to present two new sampling, adaptive sampling techniques and Really going to provide some practitioners some wonderful usefulness in terms of augmenting design of experiments. And what I want to do is I want to kind of go through a couple of our Processes here on I've been talking about how this all came about. But when we think about DOE and augmenting designs, there is a robust capability already in JMP. So what we have found though working with some very large scale simulation studies is that that we're missing a piece here gap filling designs and adaptive sampling design. And the the key point is going to be the adaptive sampling designs are going to be focusing on the response. So this is kind of quite different from when you think of maybe a standard design where you augment and you look at the design space and look at the X matrix. So now we're going to actually take into account the targets or the responses. So this will actually end up providing a whole new capability so that we can test additional samples where the excitement is. So we want to be in a high gradient region so much like you might think in response surface methodology as deep as the ascent. Now we're going to automate that in terms of being able to do this with many variables and thousands of of runs in the simulation. The good news is that this does scale down quite nicely for the practitioner with the small designs as well. And I'll do a quick run through of our of our add in that we're going to show you, and then Andrew will talk a little bit about the technical details of this. So one thing I do want to apologize, this is going to be fairly PowerPoint centric rather than JMP add in for two reasons...I should say, rather than JMP demo...for two reasons, primarily because our time, we've got a lot of material to get through, but also our JMP utilization is really in the algorithm that we're making in this adaptive sampling. So ultimately, the point and click of JMP is a very simple user interface that we've developed, but what's behind the scenes and the algorithm, it's really the power of JMP here, so. So real quick, the gap filling design, pretty clear. We can see there's some gaps here, maybe this is a bit of an exaggeration puts demonstrative of technique, though in reality we may have the very large number of factors with that curse of dimensionality can come into play and you have these holes in your design. And you can see, we could augment it with this a space filling design, which is kind of the work horse in the augmentation for a lot of our work, particularly in stimulation calling and it doesn't look too bad. If we look at those blue points which are the new points, the points that we've added, it doesn't look too bad. And then if you start looking maybe a little closer, you can kind of see though, we started replicating a lot of the ones that we've already done and maybe we didn't fill in those holes as much as we thought, particularly when we take off the blue coloring and we can see that there's still a fair amount of gaps in there. So we, as we're developing adaptive sampling, recognize one piece of that is we needed to fill in some holes in a lot of these designs. And we came up with an algorithm in our tool, our add in, called boundary explorer that will just do this particular... for any design, it will do this particular function to fill in the holes and you can see where that might have a lot of utility in many different applications. So in this particular slide or graph, we can see that those blue points are now maybe more concentrated for the holes and there are some that are dispersed throughout the rest of the region. But even when we go to the... you can color that looks a lot more uniform across, we have filled that space very well. Now that was more of a utility that we needed for our overall goal here, which was an active sampling. And the primary application that we found for this was autonomous systems, which have gotten a lot of buzz and a lot of production and test, particularly in the Department of Defense. So in autonomous systems, you may think of there's really two major items when you think of it. In autonomous systems really what you're looking at is, is you really need some sensor to kind of let the system know where it is. And then the algorithm or software to react to that. So it's kind of the sensor- algorithm software integration that we're primarily focused on. And what that then drives is a very complex simulation model that honestly needs to be run many, many thousands of times. But more importantly, what we have found is in these autonomous systems, there's there's these boundaries that we have in performance. So for example, we have a leader-follower example from the from the army. That's where a soldier would drive a very heavily armored truck in a convoy and then the rest of the convoy would be autonomous, they would not have soldiers in them. Or think of maybe the latest Tesla, the pickup truck, where you have auto nav, right? So the idea is we are looking for testing these systems and we have to end up doing a lot of testing. And what happens is for example, maybe even in this Tesla, that you could be at 30 miles an hour, you may be fine and avoiding an obstacle. But at 30.1 you would have to do an evasive maneuver that's out of the algorithm specifications. So that's what we talk about when we say these boundaries are out there. They're very steep changes in the response, very high gradient regions. And that's where we want to focus our attention. We're not as interested as where it's kind of a flat surface, it's really where the interesting part is, that's where we would like to do it. And honestly, what we found is, the more we iterate over this, the better our solution becomes. We completely recommend do this as an iterative process. So hence, that's the adaptive piece of this is, do your testing and then generate some new good points and then see what those responses are and then adapt your next set of runs to them. So that's our adaptive sampling. Kind of the idea of this really, the genesis, came from some work that we did with applied physics labs at Johns Hopkins University. They are doing some really nice work with the military and while reviewing it in one of their journal articles, I was thinking to myself, you know, this is fantastic in terms of what you're doing, and we could even use JMP to maybe come up with a solution that would be more accessible to many of the practitioners. Because the problem with Johns Hopkins is is that it's very specific and it's somewhat...to integrate, it's not something that's very accessible to the smaller test teams. So we want to give...put this in the hands of folks that can use it right away. So this paper from the Journal of Systems and Software, this is kind of the source of our boundary explorer. And as it turns out, we used a lot of the good ideas but we were able to come up with different approaches and and other methods. In particular, using native capability in JMP Pro as well as some development, like the gap filling design that we did along the way. Now, In terms of this example problem, probably best I'll just go and kind of explain it right in a demo here. So if I look at a graph here, I can see that...I'll use this...I'll just go back to the Tesla example. So let's say I'm doing an auto navigation type activity and I have two input factors and let's say maybe we have speed and density of traffic. So we're thinking about this Tesla algorithm. It wants to merge to the lane to the left so it wants to, I should say, you know, pass. So it has to merge. So one of them would be the speed the Tesla is going and then the other might be the density of traffic. And then maybe down in this area here we have a lower number. So we can think of these numbers two to 10, we could maybe even think of the responses, maybe even like a probability of a collision. So down at low speed/low density, we have a very low probability of of collision, but up here at the high speed/ high density, then you have a very high probablity. But the point is it what I have highlighted and selected here, you can see that there's very steep differences in performance along the boundary region. So it would, as we do the simulation to start doing more and more software test for the algorithm, we'll note that it really doesn't do us a lot of good to get more points down here. We know that we do well in low density and low speed. What we want to do is really work on the area in the boundaries here. So that's our problem, how can I generate 200 new points that are really going to be following my boundary conditions here. Now, here what I've done is I have really, it's X1 and X2, again, think of the speed and... our speed as well as the density. And then I just threw in a predictor variable here that doesn't mean anything. And then there's, there's our response. So to do this, all I have to do is come into boundary explorer and under adaptive sampling, my two responses (and you can have as many responses as you need) and then here are my three input factors. And then I have a few settings here, whether or not I want to target the global minimum and max, show the boundry. And we also ultimately are going to show you that you have some control here. So what happens is in this algorithm is we're really looking for, what are the nearest neighbors doing? If all the nearest neighbors have the same response, as in zero probability of having an accident, that's not a very interesting place. I want to see where there's big differences. And that's where that nearest neighbor comes into play. So I'll go ahead and and run this. And what we're seeing on there is we can see right now that the algorithm, it used JMP's native capability for the prediction screening and fortunately, is not using the normal distribution. You can see it's running the bootstrap forest. Andrew is going to talk about where that was used. And ultimately what we're going to do here, is we're going to generate a whole set of new points that should hopefully fall along the boundary. So that took, you know, 30 seconds or so to do these these points and from here I can just go ahead and pull up my new points. So you can see my new points are sort of along those boundaries, probably easiest seen if I go ahead and put in the other ones. So right here, maybe I'll switch the color here real quick. And I'll go ahead and show maybe the midpoint in the perturbation. So right now we can kind of see where all the new points are. So, the ones that are kind of shaded, those are the ones that were original and now we're kind of seeing all of my new points that have been generated in that boundry. So of course the question is, how, how did we do that? So what I'll do is I'll head back to my presentation. And from there, I'll kind of turn it over to Andrew, where he'll give a little bit more technical detail in terms of how we go about finding these boundry points because it's not as simple as we thought. Andrew Karl Okay. Thanks, Jim. I'm going to start out by talking about the the gap filling real quick because we've also put this in addition to being integrated into the overall beast tool. It's a standalone tool as well. So it's got a rather simple interface where we select the columns that we define the space that we went to fill in. And for continuous factors, it reads in the coding column property to get the high and low values and it can also take nominal factors as well. In addition, if you have generated this from custom design or space filling design and you have disallowed combinations, it will read in the disallowed combination script and only do gap filling within the allowed space. So the user specifies their columns, as well as the number of new runs they want. And let me show a real quick example in a higher dimensional space. This is a case of three dimensions. We've got a cube where we took out a hollow cylinder and we went through the process of adding these gap filling runs, and we'll turn them on together to see how they fit together. And then also turn off the color and to see what happens. So this is nice because in the higher dimensional space, we can fill in these gaps that we couldn't even necessarily see in the by variate plots. So how do we do this? So what we do is, we take the original points, which in this case is colored red now instead of black and we can see where those two gaps were, and we overlay a candidate set of runs from a space filling design for the entire space. We add for the concatonated data tables of the old and the new candidate runs, we have an indicator column, continuous indicator column, we label the old points 1 and the label the candidate point 0. And in this concatenated space, we now fit a 10 nearest neighbor model to the to the indicator column and we save the predictions from this. So the candidate runs with the smallest predictions, in this case, blue, are the gap points that we want to add into the design. Now, if we do this in a single pass, what it tends to do is overemphasize the largest gaps. So we do is we actually do this in a tenfold process, where we will take a tenth of our new points, select them as we see here, and then we will add those in and then rerun our k-nearest neighbor algorithm to pick out some new points and to fill out all the spaces more uniformly. So that's just one option...the gap filling is one option available within boundary explorer. So Jim showed that we can use any number of responses, any number of factors and we can have both continuous and nominal responses and continuous and nominal factors. The fact...the continuous factors that go in, we are going to normalize those behind the scenes to 01 to put them on a more equal footing. And for the individual responses that go into this, we are going to loop individually over each response to find the boundaries for each of the responses within the factor space. And then at the end, we have a multivariate tool using a random forest that considers all of the responses at once. And so we'll see how each of the different options available here in the GUI, in the user interface, comes up within the algorithm. So after after normalization for any of these continuous columns, the first step is predictor screening for all the both continuous and nominal responses. And this is to do is to find out the predictors, they're relevant for each particular response. And we have a default setting in the user interface of .05 for proportion of variants explained, or portion of contribution from each variable. So in this case, we see that X1 and X2 are retained for response Y1, and X3 noise is rejected. The next step is to run a nearest neighbor algorithm. And we use the default to 5, but that's an option that the user can toggle. And we aren't so concerned with how well this predicts as we are to just simply use this as a method to get to the five nearest neighbors. What are the rows of the five neighbors neighbors and how far are they? What is the distances from the current row? And we're going to use this information of the nearest neighbors to identify each point, the probability of each point being a boundary point. We have to use split here and do a different method for continuous or nominal responses. For the nominal responses, what we do is we concatenate the response from the current column along with the responses from the five nearest neighbors in order, in this concatenate concatenate neighbors column. And we have a simple heuristic we use to identify the boundary probability based on that concat neighbors column. If all the responses are the same, we say it's low probability of being a boundary point. If, at least one of the responses is different, then we say it's got a medium probability of being a boundary, excuse me, a boundary point. And if two or more of the responses are different, it's got a high probability of being a boundary point. We also record the row used. In this case, that is the the boundary pair. So that is the closest neighbor that has a response that is different from the current row. We can plot those boundary probabilities in our original space filling design. So as Jim mentioned early on, we have a...we initially run a space filling design before running this boundary explore tool to get...to explore the space and to get some responses. And now we fit that in and we've calculated the boundary probability for for these. And we can see that our boundary probabilities are matching up with the actual boundaries. For continuous responses we take the continuous response from the five nearest neighbors, and add a column for each of those, and we take the standard deviation of those. The ones with the largest standard deviations of neighbors are the points that lie in the steepest gradient areas and those are more likely to be our boundary points. We also multiply the standard deviation by the mean distance in order to get our information metric, because what that does is for two points that have an equal standard deviation of neighbors, it will upweight the one that is in a more sparse region with fewer points that are there already. So now we've got this continuous information metric and we have to figure out how to split that up into high, medium, and low probabilities for each point. So what we do is we fit in distribution. We fit in normal three mixture and we use the mean as the largest distribution as the cutoff for the high probability points. And we use the intersection of the densities of the largest and the second largest normal distributions as the cutoff for the medium probability points. So once we've identified those those cut offs, we apply that to form our boundary probability column. And we also retain the row used, which is the closest. In this case for the continuous responses, that is the neighbor that has the response that's the most different in absolute value from the current role. So now for both continuous and nominal responses we have the same output. We have the boundary probability and the row used. Now that we've identified the boundary points, we need to be able to use that to generate new points along the boundary. So the first and, in some ways, the best method for targeting and zooming in on the boundary is what we call the midpoint method. And what we do for each boundary pair, each row and its row use, its nearest neighbor identified previously...I'm sorry, so not nearest neighbor but neighbor that is most relevant either in terms of difference in response nominal or most different in terms of continuous response. For the continuous factors we take the average of the coordinates for each of those two points to form the mid point. And that's what you see in the graph here. So we would put a point at the red circle. For nominal factors, what we do is for the boundary pairs is we take the levels of that factor that are present in each of the two points and we randomly pick one of them. The nice thing about that is if they're both the same, then that means the midpoint is also going to be the same level for that nominal factor for those two points. A second method we call the perturbation method is to simply add a random deviation to each of the identified boundary points. So for the high boundary...high probability points, we add two such perturbation points for the medium, we add one. And for that one, we add, for the continuous factors, we add a random deviation. Normal means 0; standard deviation, .075 in the normalized space, and that .075 is something that you can scale within the user interface to either increase or reduce the amount of spread around the boundary. And then for nominal factors, what we do is we take...we randomly pick out a level of each the nominal factors. Now for the high probability... high probability boundary points that get a second perturbation point, what we do is in the second one we restrict those nominal factor settings to all be equal to that of the original point. So we do this process of identifying the boundary and creating the mid points and perturbation points for each of the responses specified in the boundry explorer. Once we do that, we concatenate everything together and then we are going to look at all the mid points identified for all the responses, and now use a multivariate technique to generate any additional runs. Because the user can specify how many runs they want and these midpoint and perturbation methods only generate a fixed number of runs and depending on the the lay of the land, I guess you could say, for the data. So what we do is something similar to the gap filling design where we take all of the identified perturbation and mid points for all of the responses and we fill the entire space with the space filling design of candidate points. We labeled the candidate points 0 in a continuous indicator, the mid points 1, and the perturbation points .01. We fit a random forest to this indicator. And then we take a look. We save the predictions for the candidate space fill in points and then we take the the candidate runs with the largest predictive values of this boundary indicator. And those are the ones that we add in using this random forest method. Now since this is a multivariate method, if you have a area of your design space that is a boundary for multiple responses, that will receive extra emphasis and extra runs. So here's showing the three types of points together. Now, again, to emphasize what Jim said, this needs to happen in multiple iterations, so we would collect this information from our boundary explorer tool and then concatenate it back into the original data set. And then after we have those responses, rerun boundary explorer and it's going to continuously over the iterations, zoom in on the boundaries and impacts, possibly even find more boundaries. So the perturbation points are most useful for early iterations when you're still exploring space, they're more spread out, and the random forest method is better for later iterations, because it will have more mid points available because it uses not only the ones from the current iteration, but also the previously recorded ones. We have a column in the data table that records the type of point that was added. So we'll use all the previous mid points as well. So if we put our surface plot for this example we've been looking at for a step function, we can see our new points and mid points and perturbation points are all falling along the cliffs, which are the boundaries, which is what we wanted to see. So the last two options for the user interface or to indicate those gap filling runs and we can also ask it to target global min max or match target for any continuous factors, if that's set as a column property. Just to show one final example here, we have this example where we have these two canyons through a plane with a kind of a deep well at the intersection of these. And we've run the initial space filling points, which are the points that are shown to get an idea of the response. And if we run two iterations of our boundary explorer tool, this is where all the new points are placed and we can see the gaps in kind of in the middle of these two lines. What are those gaps? If we take a look at the surface plot, those gaps are the canyon floors, where it's not actually steep down there. So it's flat, even locally over a little region, but all of these points, all of these mid points, have been placed not on the planes, but the on the steep cliffs, which is where we wanted. And here we're toggling the minimum points on and off and you can see those are hitting the bottom of the well there. So we were able to target the minimum as well. So our tool presents two distinct, two discrete options, a new tools. We want the gap filling that can be used on any data table that has coding properties set for the continuous factors. And then the boundary explorer tool that can be used to add, do runs that don't look at the factor space by itself, but they look at the response in order to target the high gradient...high change areas to add additional runs.
Labels
(8)
Labels:
Labels:
Advanced Statistical Modeling
Basic Data Analysis and Modeling
Consumer and Market Research
Design of Experiments
Mass Customization
Predictive Modeling and Machine Learning
Quality and Process Engineering
Reliability Analysis
0 attendees
0
0
0 attendees
0
0
Rubbish or Recycle: An Exploration of Waste & Recycling Data and Covid-19 Impacts (2020-US-EPO-559)
Monday, October 12, 2020
Lisa Grossman, Associate Test Engineer, JMP Mandy Chambers, Principal Test Engineer, JMP With a growing population in Cary, it is important to understand the environmental impact that a household can incur and what sustainable options are available that can minimize an individual’s environmental footprint. Recycling, for one, is a universally known method that reduces waste that would otherwise be disposed at landfills, which already is a capacity concern around the world. According to the Recycling Coalition of Utah, only 25% of plastic produced in the U.S. gets recycled, and recycling the other 75% could mean saving up to 44 million cubic yards of landfill space annually. Using recycling data that is recorded by the Town of Cary, we will analyze the relationship between the collected waste and recyclables. We will also construct visualizations to explore the breakdowns of waste and each recycling category. The goal is to compare our analysis with statistics from other cities in the U.S. to assess recycling practices of the people of the Town of Cary and determine levels of further recycling potential. And in the midst of a pandemic, we will discover how Covid-19 has influenced waste and recycling management within the country. With our findings, we hope to communicate about environmental initiatives and inform about recycling efforts in our very own community as well as addressing some impacts of Covid-19. Auto-generated transcript... Speaker Transcript Lisa Grossman Okay. All right. I'll get started. Hi everyone, my name is Lisa Grossman and my partner, Mandy Chambers, and I are both testers here at JMP. And today we are excited to share with you the work that Mandy and I have done with recycling and garbage collection data. So we were interested in looking at recycling and trash data in our own community, being Cary, North Carolina. And we were curious about what kind of patterns we would uncover inour exploration and learn about some Covid 19 impacts all while using JMP. So in JMP, we're going to be using Graph Builder's visual tools to see trends in Cary's trash and recycling collection categories such as paper, plastic, glass, etc. And using Text Explorer's word cloud feature, we're going to use that to identify some challenges for waste and recycling management that may have arisen due to the Covid 19 outbreak. And from what I show you today, we hope that you'll be able to use these quick and easy steps to explore your own data. And so for those of you who may not know Cary North Carolina is the home where SAS is headquartered. And Cary has a population of approximately 175,000 current residents, which is about a 30% increase since 2010. And thanks to the town of Caryt, we were able to get our hands on some of the recycling and trash collection data they had recorded from 2010 to the present. So I wanted to quickly go over some of the steps we took in our process to explore the data, which include first importing Excel sheets that we got from the town of Cary into JMP. And I wanted to note here that the Excel wizard does offer many advanced features that you might be interested in the case that you would need to import Excel sheets to JMP. And to organize our data and columns, we use table operations like transpose and updated column properties and the column info dialogue to make our data a little easier to graph later on. And then launching Text Explorer and Graphs Builder platforms, we used those to make our basic visualizations. And then I'm going to show you a new hardware label feature that's available in JMP 15 called pasting graphlets and I will show you an example of this later on using a tabulate. And so getting to our graphs and figures for Cary. So looking at them, we can see that the first two up top are looking at the breakdown. So they're recycling categories. So getting a closer look here, we can see the bar chart on the left is showing the average capture rates of the overlaid recycling categories from 2010 to 2019. And we can see that the news, glass and cardboard are the three leading categories for recycling. And then in the line graph to the right, we can see that the trends of the recycling rates of each category over the years. And what's interesting, that I wanted to point out here is that it seems that news and mixed paper are inversely related to each other. And then going back to our poster, let's look at the last two graphs we have here. And so these we are looking at recycling in comparison to garbage collection in Cary. So zooming in here, we can see the stacked bar on the left. shows the total percentage of waste and recycling recorded each year. And labeling the percentage values on the bar themselves, we can see that the recycling collection volume seems to have been slightly decreasing since 2014. And then the graph to the right, we can see the progression of both trash and recycling from 2010 to 2019. And this visual shows us how the tonnage of trash is increasing Each year, which seems expected for as the population increases. But what is surprising is that the tonnages for recycling have remained rather steady. So thinking about this, we were wondering if this could be due to a rise in more sustainable products such as using personal water bottles or tumblers. So, but Now that we are in the midst of Covid 19 we were curious to learn if there were any noticeable differences in recycling and trash collection so far this year. And the town of Cary was able to provide us some with some updated data that goes up to the month of June. And so we created another stacked bar here to show how 2020 has compared so far to the previous year. And at first glance 2020 is steadily increasing and the labeled tonnages do not show a significant spike in the collection so far. So then we decided to break it down month by month using our side by side bar charts to compare 2020 to 2019. And so our top bar chart here shows recycling overlaid by curbside drop off on computer recycling. And then the bottom chart shows trash collection. So in the month of March when North Carolina first implemented stay at home orders Cary saw a nearly 21% increase in garbage and 23.8% increase in recycling collection. And just for reference 21% increase is about 1.1 million pounds. And in April and May trash and recycling have somewhat leveled out but then spiked again in June, so it will be interesting to see how the rest of the year will pan out. So something I wanted to point out here is, notice the information included in the hover label that is pinned. So using the labeling feature, which can be done by right clicking on columns in your data table and selecting label, you will be able to see that column information represented in the hover label. So you can add as many columns as you'd like to...so that you can read in that information in your graph. So doing some further reading, we saw that Wake County, the county that Cary is in, reportedly generated about 29% more trash. So, totaling about 739 tons 45% more cardboard recycling and 20% more recycling in the week of April 13 alone. And we also found an estimate that the World Health Organization said that they're using, or that the world is using about 89 million masks and 76 million gloves each month. And we found an article here that gives us some insight on how the Covid 19 outbreak has affected recycling and trash collection. And so by downloading the article and importing it to JMP, we could use a Text Explorer platform to identify some themes in the word cloud. So I'll zoom in on it here. So you can use some features and options in Text Explorer like manage stop words and, in the word cloud itself, you can change the coloring and the layout and the font, so to really customize your word cloud. And so after making these customizations, we got a word cloud here on what is shown to the right. And notice that an increase in tonnage has been the highlight for cities like Phoenix and New York City. And because of this, we were curious to learn more about recycling and trash management in New York. And luckily, we were able to find some open data for Brooklyn, Manhattan, Queens, Bronx and Staten Island. So if we look at the first line chart that shows the average tonnage collection for paper and metal and glass and plastic, we can see here in this chart I have I have the boroughs grouped. And then they are scaled here by the month and we're looking at the recycling collection and tonnages here on the y axis. So we can see that boroughs like Bronx and Brooklyn are steadily increasing starting in the month 4, being April, but we can see that there is more of a spike collection that is in both Staten Island and Queens. But what's very interesting is that there's a noticeable decrease in collection in Manhattan. And we were curious as to why this might be. And with a little research, we have come to the conclusion, it seems that stay at home orders meant that there were fewer workers in the city, so therefore, leading to reduced recycling capture. At a similar trend here can be shown for garbage collection rates in the line graph that we have. And so we can see in the same manner, Manhattan sees a dip in garbage collection, whereas Queens and Staten Island saw an increase. But something we wanted to highlight here in this graph as a new feature of JMP 15 is this custom tabulate graphlet in the hover label. So notice that the pinned hover label here shows us at tabulate that gives us the tonnage values of both recycling categories and garbage collection for the months of January to June, just for Queens in 2020, which is the point on the line, which we have pinned. So creating this line graph with a custom tabulate graphlet, it was only a matter of a couple steps. So first we needed to make our base graph, which is the line graph we have here. But then we separately created our tabulate ...our tabulate table which is shown here. And for space sake, I couldn't include the whole tabulate, but as you can see, it shows the monthly averages of recycling and trash collection for each borough in 2019 and 2020. And so all we would have to do is go into that little right triangle menu to tabulate and save the script to our clipboard. And then the next thing we would do is go back to our base graph and right click in the background and under the hover label menu, there's going to be a paste graphlet option. And so you don't have to worry about any filtering or anything. Doing the paste graphlet, takes the... there is some magic that works behind it. And so that's that's all you would need to do and each point would be filtered for you. So, Now when you hover over a point in your line, you can see that it is complete and the filter parts of your tabulate corresponds to your point of interest. So this concludes our presentation on our findings with trash and recycling collection from the town of Cary and New York and as the year plays out, I think it'd be very interesting to see how this data might change and I hope to keep looking at it and see how 2020 will pan out. So we wanted to give some special thanks to Bob Holden and Srijana Guilford, especially from the town of Cary for helping us through and working with us with their data and sharing their data sets. And I have here linked the open data set from... for the New York data. And it's, I think, I believe it's constantly updated. So if you are interested in playing around with that data, it's available here. And I also have linked here some more information on graphlets. There's a ton of ways that you can use graphlets, and many, many ways that you can customize them too, so please check out this link and you can meet the developer, Nascif, and get some more information there. Thank you.
Labels
(7)
Labels:
Labels:
Advanced Statistical Modeling
Basic Data Analysis and Modeling
Consumer and Market Research
Design of Experiments
Predictive Modeling and Machine Learning
Quality and Process Engineering
Reliability Analysis
0 attendees
0
0
0 attendees
0
0
Let's talk tables (2020-US-45MP-549)
Monday, October 12, 2020
Mandy Chambers, JMP Principal Test Engineer, SAS Kelci Miclaus, Senior Manager Advanced Analytics R&D, JMP Life Sciences, SAS, JMP LIfe Sciences JMP has many ways to join data tables. Using traditional Join you can easily join two tables together. JMP Query Builder enhances the ability to join, providing a rich interface allowing additional options, including inner and outer joins, combining more than two tables and adding new columns, customizations and filtering. In JMP 13, virtual joins for data tables were developed that enable you to use common keys to link multiple tables without using the time and memory necessary to create a joined (denormalized) copy of your data. Virtually joining tables gives a table access to columns from the linked tables for easy data exploration. In JMP 14 and JMP 15, new capabilities were added to allow linked tables to communicate with row state synchronization. Column options allow you to set up a link reference table to listen and/or dispatch row state changes among virtually joined tables. This feature provides an incredibly powerful data exploration interface that avoids unnecessary table manipulations or data duplications. Additionally, there are now selections to use shorter column names, auto-open your tables and a way to go a step further, using a Link & ID and Link & Reference on the same column to virtually “pass through” tables. This presentation will highlight the new features in JMP with examples using human resources data followed by a practical application of these features as implemented in JMP Clinical. We will create a review of multiple metrics on patients in a clinical trial that are virtually linked to a subject demographic table and show how a data filter on the Link ID table enables global filtering throughout all the linked clinical metric (adverse events, labs, etc.) tables. Auto-generated transcript... Speaker Transcript Mandy Okay, welcome to our discussion today. Let's Talk Tables. My name is Mandy Chambers and I'm a principal test engineer on the JMP testing team. And my coworker and friend joining me today is Kelci Miclaus. She's a senior manager in R&D on the JMP life sciences team. Kelci and I actually began working together a few years ago as a Clinical product was starting to be a great consumer of all the things that I happen to test. So I got to know her group pretty well and got to work with them closely on different things that they were trying to implement. And it was really valuable for me to be able to see a live application that a customer would really be using the things that I actually tested in JMP and how they would put them to use them in the clinical product. So in the past, we've done this presentation. It's much longer and we decided that the best thing to do here was to give you the entire document. So that's what's attached with this recording, along with two sets of data and zip files. You should have data tables, scripts, some journals and different things you need and be able to step through each of the applications, even if I end up not showing you or Kecli doesn't show you something that's in there. You should be able to do that. So let me begin by sharing my screen here. So that you can see what what I'm going to talk about today. So as I said, the, the journal that I had, if I were going to show this in its entirety, would be talking about joining tables and the different ways that you can join tables. And so this is the part that I'm not going to go into great detail on but just a basic table join. If I click on this, laptop runs and laptop subjects. And under the tables menu, if you're new to JMP or maybe haven't done this before, you can do a table join and this is a for physical join. This will put the tables together. So I would be joining laptop runs to laptops subjects. Within this dialogue, you select the things that you want to join together. You can join by matching, Cartesian join, row join and then you would join the table. I'm not going to do that right now, just for time consumption but that's that's what you would do. And also in here under the tables menu, something else that I would talk about would be JMP query builder. And this has the ability to be able to join more tables together. It will, if you have 3, 4, 5, 6 however many tables you have, you can put them together and we'll make up one table that contains everything. But again, I'm actually not going to do that today. So if I go back into here and I close these tables. Let's get started with how virtual join came about. So let's talk about joining tables first. You have to decide what type of join you want to use. So your...if you're tables are small, it might be easiest to do a physical join. To just do a tables join, like the two tables I showed you weren't very big. If you pull in three or four maybe more tables, JMP query builder is a wonderful tool for building a table. And you may want all of your data in the same table so that may be exactly what you want. You just need to be mindful of disk space and performance, and just understand if you have five or six tables that you have sitting separately and then you join them together physically, you're making duplicate copies. So those are the ways that you might determine which which you would use. Virtual join came about in JMP 13 and it was added with the ability to take a link, a common link ID, and join multiple tables together. It's kind of a concept of joining without joining. It saves space and it also saves duplication of data. And so that...in 13 we we started with that. And then in 14 to 15, we added more features, things that customers requested. Link tables with rows synchronize...rows states synchronization. You can shorten column names. We added being able to auto open linked tables. Being able to have a link ID and a link reference on the same column. And we also added these little hover tips that I'll show you where it can tell you which source is your column source table. So those are the things that we added and I'm going to try to set this up and demonstrate it for you. So I've got this data that I actually got from a... it's just an imaginary high-tech firm. And it's it's HR data and it includes things such as compensation, and headcount, and some diversity, and compliance, education history, and other employment factors. And so if you think about it, it's a perfect kind of data to link because you have usually a unique ID variable, such as an employee ID or something that you can link together and maybe have various data for your HR team that's in different places. So I'm going to open up these two tables and just simply walk through what you would do if you were trying to link these together. So this table here is Employee Scores 1 and then I have Compensation Master 1 in the back. These tables, Employees Scores 1 is my source table. And Compensation Master is my referencing table. So you can see these ID, this ID variable here in this table. And it's also in the compensation master table. So I'm going to set up my link ID. So it's a couple of different ways to do this. You can go into column properties. And you can see down here, you have a link ID and reference. The easiest way to do this is with a right click, so there's link ID. And if I look right here, I can see this ID key has been assigned to that column. So then I'm going to go into my compensation master table. And I'm going to go into this column. And again, you can do it with column properties. But you can do the easiest way by going right here to link reference, the table has the ID. So it shows up in this list. I'm going to click on this and voila, there's my link reference icon right there. And I can now see that all the columns that were in this table are...are available to me in this table. You can see you have a large number of columns. You can also see in here that you have...they're kind of long column names, you have the column names, plus this identifier right here which is showing you that this is a referencing column. And so I'm going to run this little simple tabulate I've saved here and just show you very briefly that this is a report and just to simply show you this is a virtual column length of service. And then compensation type is actually part of my compensation table and then gender is a virtual column. So I'm using...building this using virtual columns and also columns that reside in the table. One thing I wanted to point out to you very quickly is that under this little red triangle...let's say you're working with this data and you decide, "Oh, I really want to make this one table. I really want all the columns in one table." There is a little secret tool here called merge reference data. And a lot of people don't know this is there, exactly. But if I wanted to, I could click that and I can merge all the columns into this table. And so, but for time sake, I'm not going to do that right now, but I wanted to point out where that is located. And let me just show you back here in the journal, real quickly. This is possible to do with scripting, so you can set the property link reference and point to your table and list that to use the link columns. So I'm going to close this real quickly and then go back to the same application where I actually had same two tables that I've got some extra saved scripts in here, a couple more things I want to show. So again, I've got employee scores. This is my source table. And then I've got compensation master and they're already linked and you can see this here. So I want to rerun that tabulate and I want to show you something. So you can see that these column names are shorter now. So I want to show what we added in JMP 14. If I right click and bring up the column info dialog, I can see here that it says use linked column names right here. And that sets that these these names will be shorter And that's really a nice feature because when, at the end of the day, when you share this report with someone, they don't really care where the columns are coming from, whether they're in the main table or virtual table. So it's a nice, clean report for you to have. The script is saved so that you can see in the script that it's... it saves the script that shows you a referencing table. So if I look at this, I can see. So you would know where this column is coming from but somebody you're sharing with doesn't necessarily need to know. So I want to show you this other thing that that that we added with this dispatching of row states. Real quick example, I'm going to run this distribution. And you notice right away that in this distribution, I've got a title that says these numbers are wrong. And so let me point out what I'm talking about. Employee scores is my employee database table and it has about 3,600 employees. This is a unique reference to employees and it's a current employee database, let's say. My compensation master table is more like a history table and it has 12,000 rows in it, so it has potentially in this table, multiple references to the same employee, let's say, an employee changed jobs and they got a raise, or they moved around. Or it could have employees that are no longer in the company. So running this report from this table doesn't render the information that I really want. I can see down here that my count is off, got bigger counts, I don't exactly have what I was looking for. So this is one of the reasons why we created this row states synchronization and Kelci is going to talk a little bit more about this in a real life application, too. But I'm just simply going to show you this is how you would set up dispatching row states. So what I'm doing is I'm just batching, selection color marker. And what I'm doing is I'm actually sending from compensation master to employee scores, I'm sending the information to this table because (I'm sorry), this is the table that I want my information to be run from. So if I go back and I rerun that distribution, I now have this distribution (it's a little bit different columns), but I have this distribution. And if I look at the numbers right here, I have the exact numbers of my employee database. And that's exactly what I wanted to see. So you need to be careful with dispatching and accepting and Kelci will speak more to that. But that was just a simple case example of how you would do that. And I will show you real quickly, that there is a Online Help link that shows an example of virtually joining columns and showing row states. It'll step you through that. There's some other examples out here too of using virtual join. If you need more information about setting this up. And again, just to remind you, all of this is scriptable. So you can script this right here, by setting up your row states and the different things that you want with that. So as we moved into JMP 15 we added a couple more things. And so what we added was we we added the ability to auto open a table and also to hover over columns and figure out where they're coming from. And I'll explain what that what that means exactly. So if I click on these. We created some new tables for JMP 15, employeemaster.jmp, which is still part of this HR data. And so if I track this down a little bit and look, a couple things I'll point out about this table. It has a link ID and a link reference. And that was the other thing we added to to JMP 15, the ability to be able to have a link ID and link reference on the same column. So if I look at this and I go and look at my home window here, I can see that there's two more tables that are open. They were opened automatically for me. And so I'm going to open these up because I kind of want to string them out so you can see how this works. But this employee master table is linked to a...stack them on top of each other...it's linked to the education history table, which has been, in turn, linked to my predicted termination table. And you can see there's an employee ID that has a link reference and the link ID, employee ID here. Same thing, and then predict determination has an ID only. And if you had another table or two that had employee ID unique data and you needed to pull it in, you could continue the string on by assigning a link reference here and you can keep on keep on going. So I'm...just to show you quickly, if I right click and look at this column here, I can see that my link ID is set, I can also see my link reference is set. And it tells me education history is a table that this table is linked to. I've got it on auto open and I've got on the shorter names. I'm not dispatching row states, so nothing is set there. So all of the columns that are in these other two tables are available to me, for my referencing table here called employee master. And real quickly, you can see that you have a large number of columns in here that are available to you, and the link columns show up as grouped columns down here. So another question that got asked from customers, as they say, is there any way you can tell us where these columns come from so that is a little clearer? So we added this nice little hover tip. If I hover over this, this tells me that this particular column disability flag is coming from predicted termination. So it's actually coming from the table that's last in my series. And if I go down here and I click on one of these, it says the degree program code is coming from education history. So that's, that's a nice little feature that will kind of help you as you're picking out your columns, maybe in what you're trying to run with platforms and so forth. But if I run this distribution, this is just a simple distribution example that's showing that employee level is actually coming from my employee master table. This degree description is coming from education history table and this performance eval is coming from my predictive termination table. And then you can look some more with some of these other examples that are in here. I did build a context window of dashboards here that shows a Graph Builder showing a box plot. We have a distribution in here, a tabulate and a heat map, using all virtual columns, some, you know, some columns that are from the table, but also virtual columns got a filter. So if I want to look at females and look at professionals. I always like to point out the the oddities here. So if I go in here and look at these two little places that are kind of hanging out here. This is very interesting to me because comp ratios shows how people are paid. Basically, whether they're paid in in the right ratio or not it for their job description. And it looks like these two outliers are consistently exceeding expectations, that looks like they're maybe underpaid. So just like this one up here is all by itself and it looks like they seldom meet their expectations, but they may be slightly overpaid and or they could be mistakes. But at any rate, as you zero in on those, you can also see that the selections are being made here. So, in this heat map, I can tell that there is some performance money that's being spent and training dollars. so maybe train that person. So that's actually good good good to see So that is about all I wanted to show. I did want to show this one thing, just to remind, just to reiterate. Education history has access to the columns that are in predicted termination. And so those two tables can talk to each other separately. And if I run this graph script, I have similar performance and training dollars, but I'm looking at like grade point average, class rank, as to where people fall into the limits here using combinations of columns from just those two tables. So I'm going to pass this on. I believe that was the majority of what I wanted to share. I'm going to stop sharing my screen. And I will pass this back to Kelci and she will take it from here. Kelci J. Miclaus Thanks, Mandy. Mandy said we've given this talk now a couple times and, really it was this combined effort of me working in my group, which is life sciences for the JMP Clinical and JMP Genomics vertical solutions, and finding such perfect examples of where I could really leverage virtual joins and working closely with the development team on how those features were released in the last few versions of JMP. And so for this section I will go through some of the examples, specific to our clinical research and how we've really leveraged this talking table idea around row state synchronization. So as as Mandy mentioned this is now, and if we have time towards the end, this, this idea of virtual joins with row state synchronization is now the entire architecture that drives how JMP Clinical reports and reviews are used for assessing early efficacy and safety and clinical trials reports with our customers. And one of the reasons it fits so well is because of the formatting of typical clinical trial data. So the data example that I'm going to use for all of the examples I have around row state synchronization or row state propagation as I sometimes call it, are example data from a clinical trial that has about 900 patients. It was a real clinical trial carried out about 20-30 years ago looking at subarachnoid hemorrhage and treatment of nicardipine on these patients. The great thing about clinical data is we work with very standard normalized data structures, meaning that each component of a clinical trial is collected, similar to the HR data that Mandy showed...show...showed us is normalized, so that each table has its own content and we can track that separately, but then use virtual joins to create comprehensive stories. So the three data sets I'll walk through are this demography table which has about a little under 900 patients of clinical trials, where here we have one row per patient in our clinical trial. And this is called the demography, that will have information about their birth, age, sex, race, what treatment they were given, any certain flags of occurrences that happened to them during the clinical trial. Similarly, we can have separate tables. So in a clinical trial, they're typically collecting at each visit what adverse events have happened to a patient while on on a new drug or study. And so this is a table that has about 5,000 records. We still have this unique subject identifier, but we have duplications, of course. So this records every event or adverse event that was reported for each of the patients in our clinical trial. And finally I'll also use a laboratory data set or labs data set, which also follows the similar type of record stacked format that we saw on the adverse events. Here we're thinking of the regular visits, where they take several laboratory measurements and we can track those across the course of the clinical trial to look for abnormalities and things like that. So these three tables are very a standard normalized format of what's called the international CDISC standard for clinical trial data. And it suits us so well towards using the virtual join. Aas Mandy has said, it is easy to, you know, create a merge table of labs. But here we have 6,000 records of labs and merging in our demography, it would cause a duplication of all of their single instances of their demographic descriptions. And so we want to set up a virtual join with this, which we can do really easily. If we create in our demography table, we're going to set up unique subject identifier as our link ID. And then very quickly, because we typically would want to look at laboratory results and use something like the treatment group they are on to see if there's differences in the laboratories, we can now reference that data and create visualizations or reports that will actually assess and look at treatment group differences in our laboratory results. And so we didn't have to make the merge. We just gained access to these...this planned arm column from our demography table through that simple two-step setting up the column properties of a virtual join. It's also very easy to then look at like lab abnormalities. So here's a plot by each of the different arms or treatment groups who had abnormally high lab tests across visits in a clinical trial. We might also want to do this same type of analysis with our adverse event, which we would also want to see if there's different occurrences in the adverse events between those treatment groups. So once again we can also link this table to our referenced demography and very quickly create counts of the distribution of adverse events that occur separately for, say, a nicardipine, the active treatment, versus a placebo. So now we want them to really talk. And so the next two examples that I want to show with these data are the row state synchronization options we have. So you quickly saw from Mandy's portion that she showed that on the column properties we have the ability to synchronize row states now between tables. Which is really why our talk is called talking tables, because that's the way they can communicate now. And you can either dispatch row states, meaning the table that you're set up the reference to some link ID can send information from that table back to its reference ID table. And I'll walk through a quick example, but as mentioned...as Mandy mentioned, this is by far the more dangerous case sometimes because it's very easy to hit times when you might get inconclusive results, but I'm going to show a case where it works and where it's useful. As you've noticed, just from this analysis, say with the adverse events, it was very easy as the table that we set up a link reference to (the ID table) to gain access to the columns and look at the differences of the treatment groups in this table. There's not really anything that goes the other way though. As Mandy had said, you wouldn't want to use this new join table to look at a distribution of, say, that treatment group, because what you actually have here is numbers that don't match. It looks like there's 5,000 subjects when really, if you go back to our demography table, we have less than 900. So here's that true distribution of about the 900 subjects by treatment group with all their other distributions. Now, there is the time, though, that this table is what you want to use as your analysis table or the goal of where you're going to create an analysis. And you want to gain information from those tables that are virtually linked to it. The laboratory, for example, and the adverse events. So here we're going to actually use this table to create a visualization that will annotate these subjects in this table with anyone who had an abnormal lab test or a serious adverse event. And now I've cheated, because I've prepared this data. You'll notice in my adverse events data I've already done the analysis to find any case of subjects that were...any adverse events that were considered serious and I've used the row state marker to annotate those records that had...were a serious adverse event. Similarly, in the labs data set, I've used red color to annotate...annotate any of the lab results that were abnormally high. So for example, we can see all of those that had high abnormalities. I've colored red most of this through, just row state selection and then controlling the row states. So with this data where I have these two row states in place, we can go back to our demography table and create a view that is a distribution by site of the ages of our patients in a clinical trial. And now if we go back to each of the linked tables, we can control bringing in this annotated information with row state synchronization. So we're going to change this option here from row states with reference table to none, to actually to dispatch and in this case I want to be careful. The only thing I want this table to tell that link reference table is a marker set. I'm going to click Apply And you'll notice automatically my visualization that I created off that demography table now has the markers of any subjects who had experienced an adverse event from that other table. We can do the same now with labs. Choose to dispatch. In this case, we only want to dispatch color. And now, just by controlling column properties, we're at a place where we have a visualization or an analysis built off our demography table that has gained access to the information from these virtually joined tables using the dispatch row state synchronization or propagation. So that's really cool. I think it's a really powerful feature. But there are a lot of gotchas and things you should be careful with with the dispatch option. Namely the entire way virtual joins work is the link ID table, the date...the data table you set up, in this case demography, is one row per ID and you're using that to merge or virtually join into a data table that has many copies of that usage ID. So we're making a one-to-many; that's fine. Dispatch makes a many-to-one conversation. So in in the document we have an ...in the resource provided with this video, there's a lot of commentary about carefully using this. It shouldn't be something that's highly interactive. If you then decide to change row states, it can be very easy for this to get confusing or nonsensical, that, say if I've marked both with color and marker, it wouldn't know what to do because it was some rows might be saying, "Color this red," but the other linked table might be saying color it blue or black. So you have to be very careful about not mixing and matching and not being too interactive with with that many-to-one merge idea. But in this example, this was a really, really valuable tool that would have required quite a lot of data manipulation to get to this point. So I'm going to close down these examples of the dispatch virtual join example and move on to likely what's going to be more commonly used is the accept... acceptance row state of the virtual join talking tables. And for this case, I'm actually going to go through this with a script. So instead of interactively walking me through the virtual join and row state column properties, we're going to look at this scripting results of that. And the example here, what we wanted to do, is be able to use these three tables (again, the demography, adverse events and laboratory data in a clinical trial) to really create what they call a comprehensive safety profile. And this is really the justification and rationale of our use in JMP Clinical for our customers. This idea that we want to be able to take these data sets, keep them separate but allow them to be used in a comprehensive single analysis so they don't feel separate. So with this example, we want to be able to open up our demography and set it up as a link ID. So this is similar to what I just did interactively that will create the demographic table and create the link ID column property on unique subject identifier. So we're done there. You see the key there that shows that that's now the link ID property. We then want to open up the labs data set. And we're going to set a property on the unique subject identifier in that table to use the link reference to the demography table. And a couple of the options and the options here. We want to show that that property of using shorter names. Use the linked column name to shorten the name of our columns coming from the demography table into the labs table. And here we want to set up row state synchronization as an acceptance of select, exclude and hide. And we're going to do this also for the AE table. So I'll run both of these next snippets of code, which will open up my AE and my lab table. And now you'll see that instead of that dispatch the properties here are said to set to accept with these select, exclude and hide. And similarly the adverse events table has the exact same acceptance. So in this case now, instead of this dispatch, which we were very careful to only dispatch one type of row state from one table and another from another table back to our link ID reference table. Here we're going to let our link ID reference table demography broadcast what happens to it to the other two tables And that's what accept does. So it's going to accept row states from the demography table. And I've cheated a little bit that I actually just have a script attached to our demography table here that is really just setting up some of the visualizations that I've already shown that are scripts attached to each of the table in a single window. And so here we have what you could consider your safety profile. We have distributions of the patient demographic information. So this is sourced from the demography table. You see the correct numbers of the counts of the 443 patients on placebo versus the 427 on nicardipine.
Labels
(13)
Labels:
Labels:
Advanced Statistical Modeling
Automation and Scripting
Basic Data Analysis and Modeling
Consumer and Market Research
Content Organization
Data Access
Data Blending and Cleanup
Data Exploration and Visualization
Mass Customization
Predictive Modeling and Machine Learning
Quality and Process Engineering
Reliability Analysis
Sharing and Communicating Results
0 attendees
0
0
0 attendees
0
0
Characterizing Bio-processes With Augmented Full Quadratic Models and FWB+AV (2020-US-45MP-548)
Monday, October 12, 2020
Philip Ramsey, Senior Data Scientist and Statistical Consultant/Professor, North Haven Group and University of New Hampshire Tiffany D. Rau, Ph.D., Owner and Chief Consultant, Rau Consulting, LLC Quality by Design (QbD) is a design and development strategy where one designs quality into the product from the beginning instead of attempting to test-in quality after the fact. QbD initiatives are primarily associated with bio-pharmaceuticals, but contain concepts that are universal and applicable to many industries. A key element of QbD for bio-process development is that processes must be fully characterized and optimized to ensure consistent high quality manufacturing and products for patients. Characterization is typically accomplished by using response surface type experimental designs combined with the full quadratic model (FQM) as a basis for building predictive models. Since its publication by Box (1950) the FQM is commonly used for process characterization and optimization. As a second order approximation to an unknown response surface, the FQM is adequate for optimization. Cornell and Montgomery (1996) showed that the FQM is generally inadequate for characterization of the entire design space, as QbD requires, given the inherent nonlinear behavior of biological systems. They proposed augmenting the FQM with higher order interaction terms to better approximate the full design regions. Unfortunately, the number of additional terms is large and often not estimable by traditional regression methods. We show that the fractionally weighted bootstrapping method of Gotwalt and Ramsey (2017) allows the estimation of these fully augmented FQMs. Using two bio-process development case studies we demonstrate that the augmented FQM models substantially outperform the traditional FQM in characterizing the full design space. The use of augmented FQMs and FWB will be thoroughly demonstrated using JMP Pro 15. Auto-generated transcript... Speaker Transcript Tiffany First, thanks for joining us today. We're going to be talking about characterizing different bio processing...seeing...and really focusing on pDNA, seeing how fractionally weighted bootstrapping can really add to your processes. So Phil will be joining me at the second half of the presentation to go through the JMP example, as well as to give some different techniques. to be used. I'm going to be talking about the CMC strategy, biotech, how do we get a drug to market and how can we use new tools like FWB in order to deliver our processes. So let's get started. So the chemistry manufacturing control journey. So it's a very long journey and we'll be discussing that. Why DOE, why predictive modeling? It is very complex to get a drug to market. It's not just about the experiments, but it's also about the clinic. and having everything go together. So we'll look at systems thinking approaches as well. And then, of course, characterizing that bioprocessing and then the case study that Phil will discuss. So what does the CMC pathway look like? This is a general example and we go from toxicology. So that's to see does doesn't have any efficacy. Does it work. in a nonhuman trial. All the way up to commercial manufacturing and there's a lot of things that go into this, for example, process development. But you also have to have a target product profile. What do you want the medication that you're developing to actually do? And it's important to understand that as you're going through. And then of course the quality target product profile as well. This is...what are the aspects that are necessary in order for the molecule to work as prescribed? And then we look at critical quality attributes and then go through the process so it's...it's a...it's a group because of the fact that you have your process development, process characterization, process qualification and the BLA. There's a huge amount of work that goes into each one of these steps. And we also want to make sure that as you're going through your different phases that you're actually building these data groupings. Because when you get to process characterization and process qualification, you want to make sure that you can leverage as much of your past information as you can, so that you've actually shorten your timelines. You might say, "Tiffany, I do process characterization, all the way through the process." And I'll say, "Absolutely." But the process characterization that specific for the CMC pathway is what we need from a regulatory point of view. So everyone has probably heard in the news, you know, vaccines and cell and gene therapies. It's a very hot subject right now, and it's also bringing new treatments to patients that we've never been able to treat before. And so in the two big groupings of cell therapies and gene therapies, there's different aspects for it. Right. So we have immunotherapies. We have stem cells. We're doing regenerative medicine. So just imagine, you know, having damage in your back and being able to regenerate. There's huge emphasis on this grouping. But there's also a huge emphasis on gene therapies. Viral vectors. Do we use bacterial vectors? How do we get the DNA into the system in order to be a treatment for the patients? Well, plasma DNA is one of those aspects and Phil has an amazing case study where they did an optimization. So you might say, well, "What is pDNA. And why is it important, other than, okay, it's part of the gene therapy space, which is very interesting right now?" Well, the fact is, is that it can be used for...in preventative vaccines, immunization agents, for...Prepare...preparation of hyper immune globulin cancer vaccines, therapeutic vaccines, doing gene replacements. Or maybe you have a child, right, that has a rare gene mutation. Can we go in and make those repairs? All these things are around this and as the gene therapy technology continues to grow, the regulations continue to increase as you move through the pathway closer to commercialization and the amount of data also increases. Just imagine gene therapies and cell therapies are where we were 20 plus years ago with the ??? cell culture and monoclonal antibodies It's an amazing world where we're learning new things every day and we go, "Oh, this isn't platform." We need new equipment. We need new ways of working. We need to be able to analyze data sets that are very, very small because in cell therapies and gene therapies, the number of patients are typically smaller than in other indications. So what's next for pDNA? Well, of course, as the cell therapy and gene therapy market continues to grow, We're going to continue going on this pathway into commercialization. We need to be able to work with the FDA and work with them, hand in hand, because these are things that we've never done before. We're using raw materials that we don't use and other ...other indications for medication. So there's a lot of things to be done. It's also critical to be able to make these products like the pDNA so the way that we get the vector in our appropriate volume, but also quality aspect. So if you have the best medication in the world, but you're not able to make it, then you don't have the medication. Right? You can't deliver it to the patients. So we also need to make sure that our process is well characterized. And as I mentioned earlier, many of these indications are very small. So the clinical trials are also small. And at the same time the patients are often very, very sick. So being able to analyze our data and also respond to their needs very quickly is very key. Both in the clinical aspect as well as when we become commercialized. We don't want to have this situation where, guess what, I can't make the drug, right? I want to be able to make it. Also manufacturing is a very important thing. So don't know if you've noticed in the news, there's been a lot of announcements of expansions. Of course, people are expanding capacity for vaccines, but also one of the big moves is pDNA. People are spending millions of dollars, sometimes billions of dollars in increasing those manufacturing sites. And you might say, well, okay, you increase your manufacturing site. That's great. But now I need to be able to tech transfer into that manufacturing site. I need to make sure my process is robust...robust and it not only can be transferred and scaled up but making sure that I have the statistical power to say I know that my process is in control. I might have a 2% variability but I always have a 2% variability and I have it characterized, for instance. And as more and more capacity comes online and as we also have shortages, it's like, where do I bring my product and taking those into consideration, so designing for manufacturing earlier. And you could have multiple products in your pipeline. So you want to make sure that you're learning and able to go and grab that information and say, let me do some predictive modeling on this, it might not be the exact product, but it has similar attributes. So with that, the path to commercialization is very integrated, just like the CMC strategy takes the clinical aspect, everything comes together in order to progress a molecule through. We also have to think about the systems aspect of it. Why? Because if we do something in the upstream space we might increase productivity to 200%, let's say, which we be going, "yes I've made my milestone. I can deliver to my patient." But if my downstream or my cell recovery can't actually recover the product, whether that is a cell or a protein therapeutic for instance, then we don't have a product. All of that work is somewhat thrown out the door. So having the systems approach, making sure you involve all the different groups from business, supply chain, QC, discovering...everyone has knowledge that they bring to the table in order to deliver to the patient in the end, which is very key. So I'm going to hand it over to Phil now. I would have loved to have spoken a lot more about how we developed drugs, but let's...let's see how we can analyze some of our data. So, Phil, I'll hand it over to you now. Philip Ramsey Okay, so thank you, Tiffany, for that discussion to set the stage for what is going to be a case study. I'm going to spend most of the time in JMP demonstrating some of the important tools that exist in JMP. You may not even know that are there, that are actually very important to process development, especially in the context of the CMC pathway, chemistry manufacturing control, and quality by design. And two important characteristics of process development, and this is in general, is one where you want to design a process, but you also need to characterize it. In fact, you have to characterize the entire operating region. And of course, we want to optimize so that we have a highly desirable production. What we often don't talk about enough is these activities are inherently about prediction. We have to build powerful predictive models that allow us to predict future performance. That's a very important part of, especially in late stage development, for regulatory agencies. You have to demonstrate that you can reliably produce a product. Well, a key paper on on this issue of process characterization and prediction was very famous paper by George Box and his cohort Wilson, who was an engineer. And in that they talked about what is the beginnings of, as people note today, as response surface. And the key to this their work with something they called the full quadratic model. Well, what is that? Well, that's a model that contains the main effects, all two-way interactions and quadratic effects. And this is still probably the gold standard for building process models, especially for production. But what people may not realize, they're good for optimization. They're good second-order approximations to these unknown response functions. What is not as well understood is, over the entire design region, they often are a poor approximation to the response surface. And in 1996 the late John Cornell and, of course many people know, Doug Montgomery published a paper that is really underappreciated. And then that paper they raised the fact the full quadratic model often is inadequate to characterize a design space. Think about it from the viewpoint of a scientist and think how dynamic these biochemical processes often are. In other words, there's a great deal of nonlinearity that leads to response surfaces with pronounced compound curvature in different regions. And the full quadratic model simply can't deal with it. So what they propose was augmenting that design and they added things like quadratic by linear, linear by quadratic and even quadratic by quadratic interactions. It turns out these models do approximate design regions better than full quadratic models. I'm going to demonstrate that to you in a moment. But there was a problem for them. Number one, traditional statisticians didn't like the approach; that's changing dramatically these days. But there are a lot of these terms that can be added to a model such that even a big central composite design becomes super saturated. What does that mean? It means there are more unknowns p, then there are observations and to fit the models. Turns out that it's not really a constrait these days in the era of machine learning and new techniques for predictive modeling. So what we're going to do is, we're going to use something called fractionally weighted bootstrapping. This can be done in JMP Pro. And something called model averaging to build models to predict response surfaces, and I am actually going to use these large augmented models. Okay, so when you try to build these predictive models, say for quality by design, there are a number of things you have to be aware of. One, again in 1996, one of the pioneers in machine learning, the late Leo Brieman, wrote a paper that again is not nearly appreciated as much as it needs to be. And he pointed out that all these model building algorithms we use for prediction (and that includes forward selection, all possible models, best subsets, lasso) are inherently unstable. What does that mean? Being unstable means small perturbations in the data can result in wildly varying models. So he did some work to point this out and he suggested a strategy, said, "Well, if you could, in some way, simulate model fitting and somehow perturb the data on each simulation run, we could fit a large number of models and then average them." And he showed that that had potential. He didn't have a lot of tools in that era to do it. But today I'm going to show you in JMP Pro, we have a lot of tools and we're going to show you that Brieman's idea is actually a very good one. It is now one way or the other, commonly accepted in machine learning and deep learning, that is the idea of ensemble modeling and model averaging. By the way, I'll quickly point out years ago in the stepwise platform of JMP, John Sall instituted a form of model averaging. It's a hidden gem in JMP. Works nice and is available in both versions of JMP, but I'm going to offer a more comprehensive solution that can be done in JMP Pro. And this solution is referred to as fractionally weighted bootstrapping with auto validation and I'm going to explain what that means. When we build predictive models, we have a challenge. We need a training set to fit the model, then we need an additional or validation set of data to test the model to see how well it's going to predict. Well, DOE simply don't have these additional trials available. In fact, Brieman was stuck on this point. There's no way to really generate a validation error. Well, in 2017 at Discovery Frankfurt, Chris Gotwalt, head of statistical research for JMP, and myself presented a talk and what we called fractionally weighted bootstrapping and auto validation. What does auto validation mean? It means, this will not seem intuitive, we're going to use the training set also as a validation set. You say, "Well, that's crazy. It's the same data." But there's a secret sauce to this technique that makes it work. What we do is, we take the original data, copy it, call it the auto validation set, and then we in a special way, assign random weights to the observations and we do the weighting such that we drive anticorrelation between the training set and the auto validation set. And I'm going to illustrate this to you very shortly. Okay. And by the way, we have been supervising my PhD student Trent Lempkis, who has studied this method extensively in exhaustive simulations over the last year. And we will be publishing a paper to show that this method actually yields superior results to classical approaches to building predictive models from DOE. So I'm just going to move ahead here and talk about the case study. And this is what Tiffany mentioned pDNA. It's a really hot topic and pDNA manufacturer is considered a big growth area and the biotech world expect big growth, maybe even 40% year over year because of all the new therapies coming online where it'll be used. And in this case, and this is very common in the biotech world, there's not really any existing data we can use to build predictive models. So that leads us quite rightly to design of experiments. And in this case, we're going to use a definitive screening design. These are wonderful inventions of Brad Jones from JMP and Chris Nachtsheim from the University of Minnesota. Highly efficient and I highly recommend them all the time to people in the biotech world where you have limited time and resources for experimentation. So basically I'm just showing you a schematic of what a bioprocess looks like. And we're going to focus on the fermentation step. But in practice, as Tiffany was alluding to, we would look through the both upstream and downstream aspects of this process. pH, percent dissolved oxygen, induction temperature. That's the temperature we set the fermentor at to get the genetically modified E. Coli to start pumping out plasmids. And what are plasmids? Well, they're really non chromosomal DNA. And they have a lot of uses in therapies, especially gene therapies, and they're separate from the usual chromosomal DNA that you would find in the bacteria. So our goal is to get these modified E. Coli to pump out as much pDNA as possible. So we did the the trial. This is an actual experiment. And because we were new to DSDs, we also ran a larger, much larger traditional central composite design . And we did this separately. And what I plan to do is for today's work, we're going to use the CCD as a validation set and we're going to fit models using auto validation on the DSD. We'll see how it goes. Okay, so I'm going to now just switch over to JMP. And I'm going to open a data table. Here's the DSD data. We're going to do all our modeling on this data set. And oh, by the way, I am going to fit a 40 predictor model to a 15 run design using machine learning techniques. And many people, you're going to have to get your head around the fact you can do these things and they're actually being done all the time in machine learning and deep learning. So there are a couple of add ins I want to show you that make this easy to do. You do need JMP Pro. One of them is an add in that sets up the table for you. This is by Michael Anderson of JMP. So I'm just going to show you what happens. The add in is available on JMP Communities. So notice it took the original data, created a validation set. And as I mentioned, we also have this weighting scheme and these weights are randomly generated, and as you'll see in a momen, we do a simulation study and we constantly change the weights on every run. And this has the effect of again generating thousands of iterations of modeling. And you'll also see, as Leo Brieman warned, as you perturb the responses (we don't change the data structure), you see wild variation in the models. So I'm going to go ahead and just illustrate this for you very quickly. So I'm going to go to fit model. And we have to tell JMP where the weights are stored. We're going to use generalized regression, highly recommended for this. And because this is a quick demo, I'm going to use forward selection, but this is a very general procedures SWB with auto validation. You can use it in many, many different prediction or modeling scenarios. I'm going to do forward selection. Okay, so I fit one model. And then I come down to the table of estimates. I'm going to right click and select simulate. And I tell it that I want to do some number of simulation runs, and on each trial I want to swap out the weights. I want to generate new weights. And by the way, I'll just do 10 because this is a demo. So there's the results. And you can see we have 10 models and all of them are quite different. So again, in practice, I would do thousands of these iterations. And then I'm going to show you later, we can then take these coefficients and average them together. And by the way, if you see zero, that means that turn did not get into a model. Okay, so what I'm going to do now is show you another add in. So I'm going to close some of this, so we keep the screen uncluttered. There's another add in that we've developed at Predictum. And this one does, not only does the faction weighted bootstrapping, but it also develops the model averaging. In other words, what I just showed you, the add in that you can use if you want to do model averaging, you're kind of on your own. Okay. It'll just be a lot of manual work. So I'm going to use the Predictum add in. It creates the table and then I'm going to actually very quickly, just to find a model to illustrate how the add in works, I'll use a standard response surface model. We want to predict pDNA. And we're going to use gen reg. So again, as an illustration, I'm just going to go ahead and use forward selection forward. And again, I do thousands of iterations in practice, but I'm only going to do 10. Click Go. Okay. Philip Ramsey Again, this is a quick, this is really, I do apologize, three talks conflated into one, but all the pieces fit together in the QbD framework. So I have a model. These are averaged coefficients. Again, I've only done 10. I'd save the prediction formula to the data table. And I'm going to try to keep the screen as uncluttered as possible. So there's my formula; the app did all the averaging for you, so you don't have to do it. And there's the formula. And a little trick you may not be aware of, this is a messy formula, especially if you want to deploy this formula to other data tables. In the Formula Editor, there's a really neat function called simplify. See, and it simplifies the equation and it makes it much more deployable to other data sets. Okay, so this was an illustration of the method. And what I'm going to do now is show what happened when we went through the entire procedure. So this is a data table. And here you'll notice the DSD and the CCD data bank combined together. And I've used the row state variable to eliminate or exclude the DSD data because I want to focus on performance of my models on the validation data. Again the models are fit to the DSD only. So here is my 41 term model. This is the augmented full quadratic done with model averaging over thousands of iterations. And for comparison I repeated the same process for the much smaller 21 term full quadratic model. So how did we do in terms of prediction? So let me show you a couple of actual by predicted plots. So remember, and I must strongly emphasize, this is a true validation test. The CCD is done separately. Different batches of raw material, including a new batch of the E Coli strain. Some of the fermenters were different, and they were completely different operators. So for those of you who work in biotech, you know, this is about as tough a prediction job as you're going to get. So again, the model was fit to the DSD, and on the left is the 41 term model, the augmented model, and the overall standard deviation of prediction error is about 67. On the right, again, I did use model averaging which helps improve performance, I fit just the 21 term full quadratic model and you can see the prediction error is about 70. In fact, without using model averaging as many people don't do full quadratic useful quadratic model, so performance would be significantly worse. Okay. So then I have the model. What do I do with it? Well, our goal is typically optimization and characterization. So let me open up a profiler. I'll actually do this for you. So I'm going to go to the profiler and the graph menu. I'm going to use my best model. And that's the one using the Predictum add in. And by the way, if you're interested in this add in, and even Beta testing it, just contact Predictum, just send email to Wayne@predictum.com and I'm sure he'd be more than happy to talk to you. So I'm going to... Went to the model and then using desirability, I'm just going to find settings that maximize production. And by the way, this is a major improvement over the production they were historically getting, and it gives us settings at which we should see on the improved performance and these were, by the way, somewhat unintuitive, but that's usually the case in complex systems. Things are never quite as intuitive as you think they are. And then also something really important, especially if you're doing late stage development in the CMC pathway. And that is, they want you to assess the importance of the inputs, which inputs are important. assess variable importance. Again, I won't get into all the technical details. So it goes through and it shows you, in terms of variation in the response, feed rate is by far the most important. That was not necessarily intuitive to people. And second is percent dissolved oxygen. So that, what does that tell you? Well it tells you, number one, you better control these variables very well, or you're likely to have a lot of variation. Now, in this particular case, I don't have critical to quality attributes. There were none available. So what we have is a critical to business attribute and that is pDNA production, But there's more we can do in JMP to fully characterize the design space. All I did was an optimization, but that's not characterization. So there's another wonderful tool in the profiler. Okay. It's called simulator. And this is just not used as much as it should be. So what I've done, I've defined distributions for the inputs. That is, I expect the inputs to vary. This is something like the FDA wants to know about. What happens to performance of your process as the inputs very. There are no perfectly controlled processes, especially once you scale up. By the way, while I think of it, these more complex models, these augmented full quadratic models, from experience, I can tell you they scale up better than full quadratic models. That's another reason to fit these more complex models. So in the simulator,there's a nice tool called simulation experiment. And what that does, it does what we call a space filling design. It distributes the points over the whole design region. So I'm going to just say I want to do 256 runs. and it's going to do 5000 simulations, at each point calculate a mean standard deviation and overall defect. Right. So this actually goes pretty quickly. And I'm just showing you what the output looks like. And again, I've already done this. So, in the interest of time, I'm just going to open another data table. Minimize the other one. So this is the results of the simulation study. And I won't get into all the details, but I fit a model to the main, I fit the model to the standard deviation, and I fit a model to overall defect rate. And the defect rates in some areas are low, in some of them are relatively high and these are what we call Gaussian process models, which are commonly used with simulated data. So what can we do with these models and with these simulation results? Well, again, characterization is important. So let me just give you a quick idea. Here's a three dimensional scatter plot, we're looking at feed rate and percent DO, because they're really important. And the plotted points are weighted by defect rate; bigger spheres mean higher defect rates. So if you look around this. You can see there are some regions where we definitely do not want to operate. So we are characterizing our design spaces and finding safer regions to operate in. And of course, I could do this for just pick some other variables and, in any case, it's just showing other regions you really want to avoid. And we can do more with this, but I think that makes the point. Where we can also go ahead and again use the profiler and I'm going to re optimize. But I'm going to do it in a different way. This way I want to maximize mean pDNA. And I want to do a dual response. And I want to minimize overall defect rate. So again, I'm going to go ahead and use desirability. This takes a few minutes. These are very complex models that we're optimizing. And notice, it comes up and says high feed rate, high DO, close to neutral pH and the induction. By the way, induction, if you want to know what induction OD 600 is, that's a measure of microbial mass and once you reach a certain mass (no one's quite sure what that is, so that's why we do the experiment) you then ramp up the temperature of the matter. And this actually forces the E. Coli to start pumping out pDNA or plasmids, and they're engineered to do this. So we call that the induction temperature. Okay. Well, notice at the settings, we are guaranteed a low defect rate, the overall optimize response wasn't as high. But remember, we're also going to have a process less prone to generating defects. Okay, so at this point, I'll just quickly go to the end. The slide. So everything is in these slides. They've all been uploaded to JMP communities. And at the end of this is an executive summary and basically what we're showing you is that process and product development using the CMC pathway (and a part of that is quality by design) requires a holistic or integrated approach. A lot of systems thinking needs to go into it. Process design and development is inherently a prediction problem, and that is the domain of machine learning and deep learning. It is not what you might think; it's not business as usual for building models in statistics, especially for prediction. We've shown you that fractionally weighted bootstrapping auto validation and model averaging can generate very effective and accurate predictive models. And I also, again, I want to emphasize these more complex augmented models of Cornell and Montgomery are actually quite important. They, they really do scale better and they do give you better characterization. And with that, I thank you and I will end my presentation.
Labels
(10)
Labels:
Labels:
Advanced Statistical Modeling
Basic Data Analysis and Modeling
Consumer and Market Research
Content Organization
Data Exploration and Visualization
Design of Experiments
Predictive Modeling and Machine Learning
Quality and Process Engineering
Reliability Analysis
Sharing and Communicating Results
0 attendees
0
0
0 attendees
0
0
I Have the Power!!: Power Calculation in Complex Modeling Situations (2020-US-30MP-544)
Monday, October 12, 2020
Caleb King, Research Statistician Developer, JMP Division, SAS Institute Inc. Invariably, any analyst who has been in the field long enough has heard the dreaded questions: “Is X-number of samples enough? How much data do I need for my experiment?” Ulterior motives aside, any investigation involving data must ultimately answer the question of “How many?” to avoid risking either insufficient data to detect a scientifically significant effect or having too much data leading to a waste of valuable resources. This can become particularly difficult when the underlying model is complex (e.g. longitudinal designs with hard-to-change factors, time-to-event response with censoring, binary responses with non-uniform test levels, etc.). In this talk, we will show how you can wield the "power" of one-click simulation in JMP Pro to perform power calculations in complex modeling situations. We will illustrate this technique using relevant applications across a wide range of fields. Auto-generated transcript... Speaker Transcript Caleb King Hello, my name is Caleb king. I'm a research statistician developer here at JMP for the design of experiments group. And today I'll be talking to you about how you can use JMP to compute power calculations for complex modeling scenarios. So as a brief recap power is the probability of detecting a scientifically significant difference that you think exists in the population. And it's the probability of detecting that given the current amount of data that that you've sampled from that population. Now, most people, when they run a power calculation, they're usually doing it to determine the sample size for their study there, of course, is a direct Tie between the two. The more samples, you have the greater chance you have of detecting that scientifically significant difference Of course, there are other factors that tie into that. There's the the model that you're using the response distribution type. And there's also, of course, the amount of noise and uncertainty present in the population, but for the most part people use power as a metric to determine sample size. Now, I'll kind of say there's kind of three stages of power calculation and all of them are addressed in JMP, especially if you have JMP Pro, which is what I will be using here. The first stage is some of those simpler modeling situations where we go here under the DOE menu under Design Diagnostics. We have the sample size and power calculators. And these cover a wide range of very simple scenarios. So, if you're testing one or two sample means, you know, maybe an ANOVA type setting with multiple means, proportions, standard deviations. Most of this is what people think of when you think of power calculations. So, of course, you go through and you specify again the noise, error rates, there's any parameters, what difference am I trying to detect, and say for I'm trying to compute a certain power I can get the sample size. Or, if I want to explore a bit more. I can leave both as empty. I get a power curve. Now, of course, again, these are more of your simpler scenarios. The next stage, I would say, is what could be covered under a more general linear model so exit out of In that case, we can go here under the all encompassing custom design menu. I'll put in my favorite number of effects. I'll click continue. And I'll leave everything here. So we'll make the design. And at this point I can do a power analysis based on the anticipated coefficients in the model. So in this case, it might say, I have for this particular design under 12 runs. I have roughly 80% power to detect this coefficient. If I was trying to detect say something a bit smaller. I could change that value, apply the changes, of course. See, I don't have as much power. So if that's really what I'm looking for. I might do to make some changes. Maybe I need to go back and increase the run size. So, those are the two most common settings that we might do a power calculation, but of course life isn't that simple know you might run into more complex settings you might have mixed effects factors you might run into a longitudinal study that you have to compute power for. You might run into settings where your response is no longer a normal random variable, you might have count data, you might have a binary response. You might even have sort of a bounded 0/1 type response. So a percentage type response. So, what can you do if you can't go to the simple power calculators and maybe the DOE menu it might be too complex for even this to run a power analysis. Well JMP Pro's here to help and involves a tool that we call one click simulation. So the idea here is, we'll simulate data sort of through a Monte Carlo simulation approach to try and estimate the power that you can get for your particular settings. And it's pretty straightforward. There might be a little bit of work up front that you need to do at least depending on the modeling platform. But once you've got it down. It's pretty straightforward to do. And I'll go ahead and say that this was something I didn't even know JMP could do until I started working here. So, I'm happy to share what I found with you. Alright, so we'll start off with sort of as a simpler extension of the standard linear model where we incorporate some mixed effects. Okay. So we'll start, we have a company that's looking to improve their proton protein yield for cellular cultures. Not protons but proteins. temperature, time, pH. We also have some mixture factors. Water and two growth factors. Now, at this stage, if we stopped here, we probably would still be able to use the power calculator available in the custom design platform. Where we start to deviate is now we introduce some random effect factors we have three technicians, Bob, Di, and Stan, who are representative of the entire sample of technicians. And they will use at least one of three serum lots, which is again a representation of all the serum lots, they could use unless we treat them as random effects. We also have a random blocking effect. In this case, the test will be conducted over two days. And so I'll show you how we can use one click simulation and JMP Pro to compute power for this case. So click to open the design. So this was the design that I've created, let me expand my window here so can see everything. Now this might represent what you typically get once you've created the design. Again, at this point, you could have clicked simulate response to simulate some of the responses. But even if you didn't, it's still okay A trick that you can easily use to replicate that is simply create a new column will go in. We won't bother renaming it at this point, we're just going to create a simple formula. Go here to the left hand side. Click random random normal leave everything default click Apply. Okay. And we've got ourselves some random noise data. Some simulated response data. Okay. At this point, I'll click right click, copy And right click paste to get my response column. Now all I need is just some sort of response. So simple random noise will work fine here. We're not trying to analyze any data yet. What we want is to use the fit model platform to create a model for us that we'll then use to create the simulation formula. The way we do that, we'll go under a model. Now I've done a bit of head work here. So, I've already created the model here. And just to show you how I did that. I'll go under here under relaunch analysis under the redo. So, here you see I have my response protein yield hello my fixed effects. I've got some random effects. I did everything and get everything pretty standard there. Now you see there's there's a lot going on here. We don't need to pay attention to any of this. We are just interested in creating a model. At this point the way we do that is we go into here under the red triangle menu. Will go under saved columns. Now we need to be careful which column we select. If I select prediction formula which you might be tempted to do. That's good. But it doesn't get us all the way there, as you'll see. If I go into the formula. This is the mean prediction formula. There's nothing about random effects here. So this isn't the column I want. It's not complete doesn't contain everything I need. I need to come back. Go under save columns again and scroll down here to conditional predictive formula and note from the hover help that's includes the random effect estimates, which is the one I want. Now, you might be any case where you don't really want to compute power for the random effects. You want to just for the mean model, in which case You could have easily gone back to the custom design platform and done it that way. Let's pretend that we're interested in those random effects as well. Now we've saved their conditional predictive formula. Again, we'll go in, look at the formula. And here you can see we have a random effects. Now we need to do some tweaking here to get it into a simulation us that we want. So I'm going to double click here. Is puts me into the JMP scripting language formatting. Now, first I'll make some changes to the main effects. And I'm just going to pick some values. So let's see. Let's do 0.5 for temperature 0.1 for time. And for pH. Let's do 1.2 a little bit higher. For water. I'm going to go even higher. So, these might have larger coefficient. So I'll do 85 for water. I'll do 90 For the growth first growth factor. And let's do 50 Growth Factor, too. Okay. Alright, so I've made my adjustments to the main mean model portion. Now again, these are parameters that you think are scientifically important Now for the random effects. You might be tempted to replace it with something like this. Okay. That should be a random effects. So I'll just put a random normal here. And it kind of looks right but not exactly. And the reason is this formula is evaluated row by row, what's going to happen is the first time you come across a technician named Jill. You will simulate a random value here and you'll get a value for that formula evaluation, but the next time you go to jail. I wrote six here. This will simulate a different value, which then defeats the purpose of a random effect random effects should hold the same value every time. Jill appears That it's going to take on the effect of something like a random error which I'll take this opportunity to put here that is a value that we want to change every row. So how do we overcome this well. I tell you this because I actually ended up doing this the first time I presented this slightly embarrassing. And thankfully, my coworker came along. Afterward, and showed me a trick to how to actually input the random effect appropriately and here's the trick. We're gonna go to the top here and type if row. Equals one I'm going to create a variable call it tech Jill. And now here's where I place it What this trick does will replace this random normal with tech Jill. What this will do is if it's the first row we simulate a random variable and assign it inside the value of this parameter to that variable to that value. Under the first row, we don't simulate again, which means to tech Jill keeps the value was initially given and it will hold every place we put it So we will do the same. For Bob As you can see that will accomplish the task of the random effect. PUT BOB here for Stan things are a little bit easier. We don't have to simulate for him because random effects should add up to zero in the model. And so the way we do that. We make his be the opposite side. Of the some of the other effects. Do the same thing here for serum lot one Now for this one I'm going to give it a bit more noise. Let's say there's a bit more noise in the Serum lots And this is the advantage of this approach is you get to play around with different scenarios. Input those values here. Okay. Caleb King And again, this one. Some of the others. And before I add the other one. I'll go ahead and just add it here as things makes it easy, day one. Negative day one. And I'll add it's random effect here and I'll say that it's random effect. I can type Is a bit smaller. Alright, well, at this point, we should be, we should have our complete simulation formula. If I click OK, take me back to the Formula Editor view. We should be good to go. Alright, so there's our simulation formula. Now for next, what do we do next, we'll go back to our fit model. And we're going to go to the area where we want to simulate the power Here I'm going to go under the fixed effects tests box. I'm going to go here to this column is the p value in this case original noisy simulation didn't give us any P values. That's okay. We don't care about that. We just needed this to generate the model, which we then turned into a simulation formula. I'm going to right click under this column. Now remember, this only works if you have JMPed pro And here at the very bottom is simulate. So we click that. And it's going to ask us, which column to switch out. So by default it selects the response column and then it's going to go through and find where all the simulation formula columns. So we want to switch in this one because this one contains our simulation model. tell it how many samples and to do 100 I'll give it my favorite random seed. And I click OK. Wait, about a second or two. And there we are. So it's generated a table where it's simulator response. It's fit the model. And is reporting back the P values. Now there are some cases where there are no P values we ended up in a situation so much of what started and that's okay. That happens in simulation, so long as we have a sufficient number to get us an estimate. Now the nice thing about this is JMPed saw that we were simulating P values. So it's it. I bet you're winning to do a power analysis and it's happily provided us a script to do that. So thanks JMP. We run that and you'll see it looks a lot like the distribution platform. So it's done a distribution of each of those rows, excuse me, columns, but with an added feature a new table here that shows the simulated power and because we simulate it. We can read these office sort of the estimated power if it weren't 100 if we were some other number, then you can look at the rejection rate. So we see here for our three mixture factors we. It looks like we have pretty good power, given everything that we have To detect those particular coefficients. If we go over here to the other three factors, things don't look as good So, then we'd have to go back and say, okay, Maybe we'll go back and see what what's the maximum value that I can detect, so I'm going to minimize these minimises table. I'll come back to my formula and say let's let's do a different Do something different here. What if I change this. So this was point five maybe know what if it were higher about one For the time. Let's see, let's let's also make it one And four pH. I'm going to go to three. So I'm going to bump things up a bit. So, you know, well hey can I detect this Will keep everything else the same because we know we can detect those, it looks like click Apply okay generated some new back Again, same thing. Right click under the column that you want to simulate quick simulate will switch in Given a certain number of samples. So stick it Same seed. And we'll go Just have to wait a few seconds for it to finish the simulation. There we are. And will run our power analysis again. Look to be the same here. We didn't change anything there. So in fact, I'm going to tie these groups. Little too much. Here we go. Let's hide these three Let's look at these. So we seem to have done better on pH so value of one might be the upper range of what we can detect given this sample size. But for temperature in time it seems we still can't detect, even those high values. So, okay. Um, what else could we change. What if we double the number of samples. I mean, we are calculating this for a sample size. So let's go back and one way we can do that. We can do go to do we, we can click augment design. will select all our factors. Select our response. Click OK. We'll just augment the design. And this time we'll double it will make it 24 So I'll make the design. And it's going to take a little bit of time. So I'm actually going to A bit early. And let's see, we'll make the table. Okay, so now we've doubled the number of runs And So it only gave us half the responses. That's okay. Since we just need a response. I'm just going to take this and I'm going to copy And paste Course in real life. You wouldn't want to do that because hopefully get different responses. But again, we just need noise noisy response, go to the model. Now this time, we gotta fix things a little bit. I'm going to select these three go here under attributes say there are random effects. Keep everything the same. Click Run. I will notice I don't yet have my simulation formula, but rather than have to walk through and rebuild it. I can actually create a new column, go back to the old one. Right click Copy column properties. Come back, right click paste column copies my formula is now ready to go. So, let's say, What if we do it under this situation and we'll keep our values that we initially had So I'll go back. I'll double click this open up the fit model window. Go under the fixed effect tests, right click on there probably agree with p value simulate and I'm not going to change this, because there was only one simulation for we let the one I wanted and it found the right response. So I'll just change these Let's see what happens in this case. Alright. Run the power analysis. Now again, I'm not going to worry about these Mixture effects because as you can see, we just got better than what we had originally, which was already good. So I'm going to hide them again. So we can more easily see the ones were interested in this case pH. We knew we were going to probably do better on because even with the old 12 runs. We had pretty good power. It looks like we have definitely improved on temperature in time. So if those represents sort of the upper bound of effect sizes were interested in maybe a lower upper bound and this seems to indicate a doubling the sample size might help. So these are illustrate how we can use the one. First of all, how to do the one click Simulate And then how we can use it to do power calculation and encourages you to do something. I often did before I came to JMP, which is give people options explore your options. During the sample size seemed to help with temperature and time. Changing what you're looking for, seem to help with pH with pH and then the mixture effects we seem to be okay on so explore your So that can also include going back and changing the variances of maybe your random effect estimates. So for example, I could come back here. I won't do it. But I could change these values and say, you know what happens if the technicians were a bit noisier where the serum lots were less noisy. Try and find situations so that your test plan is more robust to unforeseen settings. Okay, so let me clean Go through close these all out. Alright. So for the remainder of the scenarios. I'm going to be exploring sort of different takes on how you can implement this. So the general approach is the same. You create your design you simulate a response. Us fit model, or in this case we're using a slightly different platform to generate a model. And then use that model to create a simulation formula which then you will then use and the one click Simulate approach. So now let's look at a case where we have a company that's going to conduct this case they have. But let's pretend that they are going to conduct a survey of their employees and they wanted to determine which factors influence employee attrition. So maybe They have a lot of employees that are going to be leaving. And so they want to conduct a survey to assess which factors and so they want to know how many responded, they should plan for Now the responses in years of the company, but their two little kinks. First, I'm an employee has to have worked at least a month before they leave for to be considered attrition. And the other is that the responses are given in years, but maybe we're more concerned about months. How many months. Maybe that's how our budgeting software works or something. And, you know, for employees, it might be easier for them to answer. And how many years have they been rather than how many years or months. They've been at the company. So in this case we have interval censoring because we're given how many years, but that only tells us that they've been there between that many years and a year later, we also have the situation if they leave before year where it will censored between a month and a year. So open up the stage table. I've set up a lot already. We've got a lot of factors here and scroll all the way to the end. So you can see the responses that we're looking at. So again, we have a year's low and the years high. So what this means is that if an employee were to respond that they left after six years. That means that their actual time there in terms of months, somewhere between six and seven years. If they left before a year than we know that they were there sometime between a month and a year. I'm going to click this dialog button here to launch interval censoring here. We'll use the generic platform. We're going to assume a wible distribution for the response. We don't put a censoring code here because we have interval censoring the way we handle that is we put in both response columns into the y Which you'll see. Okay. And here's all the factors which you'll see is when we click run, JMP a recognized as a time to event distribution and say, Okay, if you gave me two response columns. Does that mean you're doing interval sensory in this case. Yes, we are. So now. We're going to go through the same thing. We're going to find the right red triangle. In this case, it's here next to waibel maximum likelihood. Now here's the really nice thing about Generate platform. Now there's already a lot of nice things about it. But here's just some more icing on the top. When I click this, if we did like before we'd have to go in and we'd say save the prediction formula, we'd have to go and make some adjustments to get the random know make sure it's a random wible that's being simulated adjust things as needed. This is generally though. It is aware that you can do the one click Simulate and so it's saying, Hey, would you like me to actually save the simulation formula for you if you're, if that's what you're interested in and Yes we are. So we click the Save simulation formula. Let's go back to our table. And you'll notice it only simulated one calm. I'll talk a bit more about why in a moment. But let's real quick check will go in And there it is, in fact, I'll double click to pull up the scripting language, you'll see it's already got it set up as a random wible it's got the transformation of the model already in there. All you would have to do at this point is change these parameter values to what is scientifically significant to you. Okay, now for this purpose I won't do that. I'm just going to leave them be. I will make one change though, and I want to try and replicate. The actual situations that we're going to be using. Notice here. These are all continuous values when in actuality, what we should be getting our nice round hole year numbers. So the way I can do that. These are years. Hi, I'm going to create a simple variable make it equal to the actual continuous time but tell it to return the ceiling. So round up essentially ply. Okay. And there you have it. As you can see, this would tell me that I've simulated yours. Hi. Now, To See, when you do the one click simulations are all here. I'll open up the effect tests. If I right click and then click Simulate I could only enter one column at a time. So I can't drag and select more than one Now, if I were to just do this place the years I was yours. Hi simulation that looks okay. The problem is this year's low. Now this year's low is being brought in, because it was part of the original model. But it's the year's little that you originally used if we look back, we already see an issue, let me cancel out of this real quick. For example, if we were to do that. It wouldn't be able to fit this first one, because the years high is lower than yours. Low this year's low is not tied to the simulation response. So how do we fix that we need to tie it need to make that connection. So I'll go to yours, low I'm going to click formula. So there's already a formula here, I'm just going to make a quick change. I'm going to say if the simulation formula I double clicked to do that. Is double click one. So, for years, high as one return 112 Otherwise, return the simulation value minus one. Now click OK and apply As you can see its proper its proper now it's tied to it. So now I can go back I can right click, do the simulate I can replace the years high with its simulation formula and be comfortable knowing that when I do the years low will be appropriate. It will always be one year lower unless it's already one year and then which cases 112 So it's now tied to it, it'll always be brought in, when they do the simulation. I'll run a quick simulation real quick. There we go. It's going a bit slow. So that's a good sign. I'll let it finish out Alright. So there is our simulations. And of course we can run the power analysis, this case we've got a lot of factors that I believe there were 1400 70 quote to play this. For a lot of them were we have overkill. But surprisingly for some of them. We still have issues. And so that might be something worth investigating maybe we can't detect that low, the coefficient Might have to change something about these factors things to discuss in your planning meeting. So that's how you need to work things when you have this case we had interval censoring if he had right censoring so you had a censoring column. Same thing, you would. It would output a simulation on the actual time, I would say, you can make some adjustments to that. To ensure that it matches the type of time you you're seeing in your response or what you expect. And then you'll have to tie your censoring column to the simulation and this is going to happen whenever you have that type of setting. Okay. Let's clear all this out. So let's look at one other one What happens if we have a non normal response. So we've already seen one. We've seen a reliability type response. So we know we can use generating let's explore another one real quick. In this case, we have a normal response in A test. The system is going to be able to weapons flat for their responses, a percentage. Now, technically, you could model this as a normal distribution. And that might be fine, so long as you expect values between, you know, around the 50 percentage point But no, because we want this to be a very accurate weapons platform, we'd hope to see responses closer to 100% And so maybe something like a beta distribution response might be more appropriate. We do have one of the wrinkle. We have these three factors of interest, but one of them. The target is nested within fuse type. So the type of target factor will depend on the fuse type Case will run this real quick. Again, we've created our data. This case I simulated some random data and I did it so that it matches between zero and one. I did that simply by taking the logistic transformation of a random normal OK. Caleb King I will copy Paste. Make sure I can paste And again, walk through it. Pretty simple. We're going to use the beta response. We have our response. We have our target nested within future type Click Run. And again, red triangle. Many say columns save simulation formula. And this is what you can do in the generate for the regular fit model unfortunately cannot do that. But we have our simulation formula. I'm not going to make any changes. But you could you could go in. As you can see the structure, double click is already there. Even the logistic transformation. So you just got to put in your model parameters. Excuse me. Caleb King Quick. Okay. Bye. Okay. And again, we'll go down. And that's how you do that. So we go down. Effect tests, right click Simulate Make the substitution and go Alright, so see how easy it is, in general. So even if you have non normal responses. You're good to go. Thanks to generate Okay. Now, What if you have longitudinal data. This can be tricky, simply because now the responses might be correlated with one another. So how can we incorporate that well is straightforward. In this case, we have an example of a company that's producing a treatment for reducing cholesterol. Let's say it's treatment, a We're going to do run a study to compare it to a competitor treatment be in for the sake of completion will have a control and placebo group will have five subjects per group longitudinal aspect is that measurements are taken in the morning and afternoon once a month for three months. Now I'm not going to spend too much time on this because I just want to show you how you incorporate longitudinal aspect. So this case I've already Created a model created the simulation formula. So now you can use it as reference for how you might do this. Let's say we have an AR one model. And on this real quick. Just to show you. So there's all the fixed effects. Notice here we got a lot of interactions. Keep that in mind as I show you the formula might look a bit messy. You've Stated that we have a repeated structure. So I've selected AR one Period by days within subject. Okay. Under the next model platform. And so how do I incorporate that era one into my simulation formula I did it like this. If it's the first row or the new patient. That's what this means the current patient does not equal the previous patient This is the model that I saved I changed the parameter values to something that might be of interest. It did take a bit of work because there's a lot going on here. There's a lot of interactions happening. We've got some random noise at the end. But that's all I did. So I changed some values here. I made things a lot of zeros, just to make things easy If it's not the first row or if it's not a new patient. How do we incorporate correlation. All I do is copy that model up to here, added this term. Just some value. I believe it has to be less than one equal to one times previous entry. If it were auto regressive to then you would add something like lag. Sim formula to And you'd have to make another adjustment where know if it's the first row, we have our model. It's the second row or were two places into the new patient. It might look like an AR one if it's anything else we go back to So as you can see, very easy to incorporate auto correlation structures as long as you know what your model looks like it should be easy to implement it as a simulation formula. Okay. Caleb King I'll let you look at that real quick. Finally, Our final scenario is a pass fail response, which is also very common. I'm going to use this to illustrate how you can use the one click Simulate to maybe change people's minds about how they run certain types of designs show you how powerful this can be Not intended Let's say we have we have a detection system that we're creating to detect radioactive substances. So we're going to compare it to another system that's maybe already out there in the field. So we're going to compare these two detection systems we've selected a certain amount of material and some test objects, ranging from very low concentrations at one to a concentration of five very high and we're going to test Our systems repeatedly on each concentration, a certain number of times and see how many times it successfully alarms. I'm going to open these both Let's start with this one. So this represents a typical design, you might see we have a balanced number of samples as each setting. In this case, we have a lot of samples. They're very fortunate that this place so Let's say we're going to do 32 balance trials at each run and these are, this is a simulated response. Okay, let's say. And then here I've created my simulation formula. So I'll show you what that looks like. Again, random binomial. They're all the same. So I've kept the number here, but I could have referenced the alarms in trials column stem from an indie consistent, but that's okay. Here's my model that maybe I'm interested in Okay. And here. I have a scenario where instead of a balanced number and each setting I have put most of my samples here at the middle My reasoning might be that will if it's a low concentration. I hardly expect it to catch it. I have reasonable expectations. And if it's a high concentration will it should almost always catch it. So where the difference is most important to me is there in the middle, maybe at three or four concentrations And so that's where I'm going to load. Most of my samples, and then I'll put a few more here. But then put the fewest at these other settings. Let's see how each of these test plans for forms in terms of power. So run the binomial model script here which will run the binomial model. There's only one model effect here the system. We don't put concentration because we know there's that there's an effect here. This is what we're interested in. Generate binomial. Run it okay again red triangle menu. I've already got my simulation formula. So actually I don't need to do that. So you already built up a pattern. Right click Do you simulate. Okay, everything looks good there. My next favorite random seed. Here we are power analysis. Okay, now let's go over here. Do the same thing. I'll fit the model and again when you have a binomial. You have to put in not only how many times it alarm, but out of how many trials. Run scroll down the effect tests, go down. In primary to get a hint of what's going to happen. Quick. Okay. Here's my simulations, get my parallels scooted over here, minimize minimize. So here's what you get under the balance design. Notice that we have very low power, which seems odd because we had 32 at each run. I mean, that's a lot of samples, I would have killed for that many samples where I previously worked So you would expect a lot of power, but there doesn't seem to be whereas here. I had the same total number of samples. I just allocated them differently. And my power level has gone up dramatically. Maybe if I stack even more here. Maybe if it did four and four and then edit for each of these I could get even more power to detect this difference. So not only does this show that you know it's not always just changing your sample size might not always need more samples in this case you had a lot of samples to begin with. But how you allocate them is also important to Okay. So, I hope you're as excited as I discovered this very awesome tool for calculating power. I'd like to leave you with some key takeaways. So again, we use simulation. Now, ideally, you know, we kind of like a formula. So, and in the civil cases we do kind of get the advantage of a nice simple formula. Even with the regression models, we kind of have formulas to help under, under the hood. But of course, and the real world. Things are a little more complex. And so we typically have to rely on simulation, which can be a very powerful tool as we've seen, Now, of course, one of the key things we have to do with simulation is balanced accuracy with efficiency. I usually ran 100 Mainly because, you know, to save on time. But ultimately know maybe you might stick with the default of 2500 knowing that it will take some time to run So what I might advocate for is, you know, maybe start with 100 200 simulations at the beginning, just to give it give an idea of what's going on. And then if you find a situation Where it looks like it. No, it's worth more investigation bump up the number of samples, so you can increase your accuracy. OK, so maybe you start with a couple different situations run a few quick simulations and then narrow down to some key settings key scenarios and then you can increase the number of simulations to get more accuracy. I always argue power calculations, just like design of experiments is never one and done. You shouldn't just go to a calculator plug in some numbers and come back with a sample size. There's a lot that can happen in the design. Or what can that can happen in an experiment. And I think that the best way to plan an experiment is to try and account for different scenarios. So explore different levels of noise. In your response. So maybe the mixed effects play around different mixed effect sizes. Of course you can explore different sample sizes, but also explore maybe different types of models. So for example, in the universal center in case we use the wible model would if he had done a lot normal model. Explore these different scenarios and know presenting them to the test planners gives you a way to play in your study to be robust to a variety of settings. So never just go calculate and come back, always present tense players with different scenarios. It's the same process. I use when I Created actual designed actual experiments. So I would present the test players. I worked with different options they could know explore it. It may be they pick an option or it might be combination of options. You should always do that to make your plans more robust As I say, they're All right. Well, I hope you learned something new with this. If you have any questions you can reach out to me, they'll probably be providing my email address. So I hope you enjoyed this talk and I hope you enjoy the rest of the conference. Thank you.
Labels
(9)
Labels:
Labels:
Advanced Statistical Modeling
Basic Data Analysis and Modeling
Consumer and Market Research
Design of Experiments
Mass Customization
Predictive Modeling and Machine Learning
Quality and Process Engineering
Reliability Analysis
Sharing and Communicating Results
0 attendees
0
0
0 attendees
0
0
What do you get when you combine a Marine Biologist, a Video Game, and JMP? (2020-US-45MP-543)
Monday, October 12, 2020
John Powell, Principal Software Developer, SAS/JMP Division Novel ideas often come from combining approaches used in totally different industries. The JMP Discovery Summit and the JMP User Community provide excellent ways to cross pollinate ideas from one industry to another. This talk originated from a marine biologist’s request on the JMP User Community to display many variables in a tight space. Techniques used in video games provide possible solutions. I’ll demonstrate how JMP’s customizable visualization capabilities rise to the occasion to make these potential solutions a reality. Another use of video game technology is the 3D scatterplot capability available in JMP. This approach requires powerful graphics capability commonly available on modern desktop computers. But what if you need to share these scatterplots on the web? The range of graphics power on devices people use to view web content varies greatly. So, we need to use techniques that work even on less powerful devices. Once again, the games industry offers a solution — particle systems! I’ll cover particle system basics and how to export 3D data to a 3D Scatterplot application I built using particle systems on the web. Auto-generated transcript... Speaker Transcript jopowe Hi, welcome to my talk. The motivation for this talk was based on a discussion on the JMP User Community. Dr. Anderson Mayfield is a marine biologist at NOAA and University of Miami. He studies health of reef corals. He presented at our special Earth Day episode of JMP On Air. And he's also presenting at this conference. So in this JMP User Community post, he posted this picture here. And if you could see, it has all these graphics that are representing corals or coral reefs and they're representing multiple pieces of data, all in one little graphic, which is kind of like a video game when you have player stats and they're hovering around the player as they run around the scene. So now you probably get an idea why I called my talk, you know, What Do You Get When Uou Combine a Video Game, a Marine Biologist and JMP? So let's move on. What I'm going to talk about is this game based, or game inspired solutions that are possible in JMP, including using custom graphics scripts, custom maps. And then I'm going to talk about a 3D web based scatter plot application that I built and how I integrated JMP data sets into that application. Here's an example of the graphic script. And here's an example of using custom maps. And here's my 3D scatter plot application. So let's get started. To see how custom graphics scripts are used, I've got a little sample here. And when I open up JMP and run this little graphic, and actually, I'll talk about this. I use this fictitious data set I created from scratch. And it's basically about a race of fish going all the way down from the Great Barrier Reef down to Sydney. So to get to a graphic scripts, what you can do is go to customize. And then this little plus button lets you add a new script. It starts off empty, but we've got a few samples in JMP. So I like the polygon script. It's very simple. All it is, is it draws a triangle. So if I apply that to my graph, you can see a triangle in there. And all it took us a three lines, setting the transparency, setting a fill color, and then drawing the polygon. But what we're after is, how do you actually embed this in a script? So the easiest way to see that is you hit the answer button and you save script to the script window. And here's the script. It starts off, there's my little bivariate platform being launched with latitude and longitude. And down below, there's a dispatch that has an add graphic script. And then the three lines of the sample. So there's a real simple example of using a graphic script. Now this is a graphic script I'm going to embed. It draws these bars floating over each point. And what I do first in this application, I set the transparency also and then I draw a background of my little graphic. one for the hunger, one for the strength, and one for the speed. Now I'm going to show you the whole script, because it relies on other things. It relies on these constants that I set up mostly for doing colors and in the background and a few things like the background like. This is my function draw bar. It takes a position, a length and a color. So that can be used for each bar. So if I run that script. There it is, you've got the graphic with the little kind of game-inspired health monitors hovering above each fish as they swim down and race down the east Australian current to Sydney. This is going to give you an idea, basically a step by step of how that function works. We started with just a marker on screen, and for each marker, we set the pixel origin. And then we start to draw the background, but that takes a few steps. The first time we set the line thickness, you don't actually see anything. Then we set the color, which is going to be black. And then we do a move to and line to that draws the background and very thick line. And for each bar, we're going to set the color of the bar, move to the position and then do a line to. Then we draw the background, we set its color and it continues the bar. And of course we do that for the other two lines as well. Now you can use many different graphics functions. And here's the graphic functions section of the JSL syntax reference and you can build whatever you want. So go at it. Next, I'm going to talk about using custom maps. Excuse me. And to use custom maps, basically you provide two tables. The first table is a definition of all the shapes you want to use. And the second one names the shapes in the table. And for this name column, basically what you need to add is a map role column property, and it needs to be of the type shape name definition. I'll show you that in a minute. So just to do a real simple example, of course, what we need is Big Class. And I'm going to have a real simple. map for it as well that just has two rectangles. Let me get those files open. So it's very tiny map, as I said. And just to show you these coordinates, I've got a little script here that basically shows the points. If I highlight these top four, that's the top rectangle and the next four is the bottom rectangle. the top ones is weight, the bottom one is height. And if you look at the column property, it's map role, shape name definition. So once we have this map ready, we can open up Big Class. We can't reuse it directly. What we need to do first is take the height and weight columns and stack them. So we go to tables. Let me stack them. And now I've got a table that has these columns stacked. You see that the label has height, weight alternating, and what we need to do in order to hook that up to the map file is add another column property. This one is all the way down here, map role again. But we won't say shaped name definition, we'll say shaped name use and we need to choose the map file, the name map file. And also set the column to name. And then we're done. So now that we have that, I can actually use that in a graph. The graph we're going after is basically using the label column for the map shape, which sets up the top and bottom shapes, and then we can drag in that data column that has all the data into the color role. And there we have it, this is a summary of all of the students in Big Class, but we really want to get one per student. If I drag that to the wrap role, it's a little tight, but I'm going to spread it out by doing a levels per row and set that to 10. There we go. And that's pretty much what I was going after. And this can be useful graph on its own. So let's do something a little bit more complicated now. Well, actually, if you're doing something more complicated, you might want to use the custom map creator add in. And this is a great add in. It allows you to drag in an image and then trace over it. And when you're finished tracing over all the shapes, you click on Finish with random data, and it will generate the two shape files you need, as well as a random data set that allows you to test your custom map. And here's one where I just did four variables and one in the center. So it's basically a square version of what you saw in the original slide that I had. Now that wasn't exactly what we're shooting for. We wanted something round, and I believe it was Dr Mayfield that called it complex pie. I don't know if that's an official term but I decided I was going to build these things with script. So what I wanted to do is build shape files and make sure that I got what the doctor ordered. And that was four shapes with something in the center. And then I thought, Well, I'm a programmer. I like to do a little bit more and make a little more flexible. So I thought maybe some people would want to do the same thing, but only have two variables plus an average or center variable or more. And I thought maybe it would be nice to be able to also do different shapes like a diamond or square. So, I'm no genius. I got lucky. And Xan Gregg actually answered a post that was how can I make this polar plot in JMP. And here it comes. There we go and They were looking for a shape like this, which was looks an awful lot like what I needed. What Xan did was really great. It's a flexible script that does this, and generates these wedges around a circle, but the only thing missing was the center and also naming of files in a particular way that I wanted to do. But it wasn't too difficult. And that is what my script does. And when you run complex pie, there's the recipe for this pie, you open the complex pie maker. Then you add some ingredients. First of all, you need to say the number of shapes. And then whether you're going to center it, either at the top or off to the side a little bit. Then there's a variable for smoothness. And then you would want to also supply the inner radius, whether you want more filling or less, and the outer radius. Of course, the next step is to run the script and it will generate these shape files for you and also do an example test, just like the custom pie maker or custom map add in. So here's an example of complex pie for five and shapes is five and the smoothness to set to seven. You could use five or six and that would probably still be okay if they're going to be drawn really small. The size doesn't really matter. It's more the relative inner radius and outer radius that matter. And this is approximately what Dr Mayfield was going with, so I stuck with four for the inner radius and nine for the outer radius. So let's see how we can actually use these things. Just like we did with the Big Class demo, we're going to have to...we're gonna have to stack the columns. But one thing different is that I don't really have the strength and health variables in my shape file. They're actually in the shape file, they're named wedge 1, 2 3 and center. So I'm going to need to build a file that will link the two together. So first, I've got this stacked small school example, which I did by stacking those health, strengths, speed columns together. And here's the shape file that...the name shape file that I'm going to use. Notice that it has a column property of map role, right, map role definition. And that's required when you make your own custom shapes. So the linked file, the maps columns to shape, I built basically by listing the labels within within my stacked file. And then in another column, shape name, I listed the shapes the shape names. Now, it's important that the shape name here have a property that is map role, but this one is shape name use. And this data label column will have a link ID so it can be virtually joined back to my stacked table. So now that I have these tables all set up, the next thing to do is actually build a graph. Now, it'll be similar to what I did before. The one difference is that instead of just dragging label down to map shape, now I use this virtually join column, drag that down into map shape and there's the shape I want. What I need to do next is add the data to the color column. And then add name to wrap. And there are all my fish with a graphic for each one. To try to get the right gradient, I go to gradient here and that is this one right here. I want to make sure that the green is good, so I'm going to reverse the colors. And there we have it. This is the useful map on itself, because you can look at each fish and see how they're doing. But I want to do a little bit more with that, of course. What I want to do is be able to put those images into a tool tip. In order to do that, we're going to do make into data table. Bring back the file that I had and that graphic. So what it'll do here is, under the red triangle men there is a making to data table. And what this will do will produce a new data table with all these images, and that's really useful, especially if we link it back. And I would turn this and set it as a link ID so that I can point to it from my my simple small school file so that I'll be able to have each marker, find the graph for each character. Alright, so I've got this actually stored in another file and I'm going to open that one. I called it health images. So we don't need this anymore. And we don't need the stacked file anymore. But what we do need is the small school example that I started with. Now the first thing you need to do in order to get these to show up in the graph is to set that image column, that virtual column, as labeled. You can do this in a simple graph as well. So we already have an example...well, I'm going to open this again. Why not? My race script, just to show you that I'm starting from scratch. But since I've had this column labeled, now when I hover over one of these graphics, you get a...the three pieces of information for this graphic and you can do that and pin any number of these characters. So that's one way you can use these graphics. Another way is to use use for marker. And the graphics that you're going to get will be floating over the points and be used instead of the diamond shapes that I had before. I'll bring back those health images, small school, and I'll even bring back that graphic. So right now, we've got those diamonds. In order to turn them into these shapes over here, we had to add yet another badge to this virtual join column. It's really getting popular so use for marker. And so there's new badge that shows up. And behind me, you'll see that these images were put in the scene. One last detail is I don't really like this white around the image. And that's actually built into the graph image. And one way I can take care of that is, it's got a script that will find that white area and set the alpha channel so that it will make them transparent. So now we have graphs with a nice round shape, not the square background. One other thing you might want to do is increase the marker size. And we can go up to 11. How does that look? That's pretty much looks like what I had; 10 would probably would have been better. But I like going to 11. Okay, so that's use from marker. We want to make a little more complicated graph. And this is what Dr. Anderson Mayfield was going for, is a heat map behind the markers. And that's pretty complicated but JMP can handle that. That's actually a contour. So one thing I have here is another version of small school. But this time, I've got ocean temperatures down at the bottom and they're hidden so that they won't show up as actual points. So I've already built images for this one. The same images, actually, as I built before. And I'll start with the end here so we can actually see what I'm going for. So I basically want to have these graphics hovering over the contour and a couple of legends here that show what things are. That shouldn't be too hard. I'll start part of the way with a map already placed in, and lat longs already on the x and y. So the first thing I want to do is add in a contour. And we need to give that little bit of color, so I'm dragging the temp role over the color role and I'm going to adjust the number of levels on this contour. Change the alpha little bit and add some smoothness. Next thing I want to do is make it a little transparent. Let's put a .6 on that. That's about the same as what I had before. And now to add the markers, there's two ways you can go about doing that. You can shift click on the points or you can just drag that in. To drag it in, then you can set the marker size again, doing that by going to here and do other and let's use 10 this time. I think 11 was just a little too big. Alright, so that's looking good. That's almost there. We want to work on this legend. How do we get these colors in there? Well, we already have the health role. And that you can drag right over here to the corner, just add a second color role. Kind of messes up your graph initially. We can take care of that. The problem is that the contour doesn't really need to use the health role. So if we disable that then we're back to what we want. So the next thing you want to do is add the color to the actual legend here. And we can just customize that again like we did before. We're going to do a little bit more tis time. I want to have four labels. I want to have it start at five and go up to 20. Oh, I wanted to reverse the colors as well. There we have it, looks pretty much like that, so almost done. I just want to pretty it up a little bit more. There's a couple of things in the legend I would like to fix up. So go to legend settings and take away these things that don't seem to be adding too much and move the health legend up to the top. There we have it. So I think I'm matching what I was going for, as good as I can. So I can get rid of that one. I don't believe all need this anymore. But I do want to add a little bit more so. How about adding this legend, because it wasn't really a way to know which area of this graphic did what, you know hunger, speed, strength and health. Luckily we have that mapping file. It goes from column name to shape. And I'm just going to open that up for a second. So you can see here that I've labeled all the rows and I labeled the data label column as well. That means that when you create a graph, it will all be labeled. So this one's really simple. You just take the shape name and drag it into the map role and we're really done with that graphic. So if I want to get that on to my scene, the best way to do that is to just select it, the plus sign and then copy it. Then you open up some image editing application and save it to disk. And of course, I've done that already. So I'm going to show once you have that, and I need to bring back my graphic again, wow I can just drag this in. And there it is. Oh, it's not there. Well, so there's a way to find it. Go to customize, it dropped it in the background because normally when you drag in an image, you're dragging in a background image. So all we need to do now...let's move this out of the way so you can see what's happening...is move to the front. Here it comes and there we go. It's at the front. And I'd like to actually add a little bit of transparency on this because a little too bright for my liking. So let's put .8. And now we just have to drag it into the corner and we're pretty much done with that. Okay, so there was a lot and it involved a lot of files. So let's summarize all the files that were used. I used my complex pie maker to generate the two map files that are needed, shape files. I took small school and I stacked it. I created this column to shape mapping file that needed to point to the name file with the map role. I used a link reference to do a virtual join back to small school stacked. Then I made a graphic that I wanted to make into a data table and save that to my health images.jmp. Of course, that had to be virtually joined back to the original small school so that I can make the graphic with that. And then I did the same thing with small school ocean temps, my link reference to health images. I only...I took my column to shape file and drew the legend for the graphic that I tried to do with the ocean temperatures and I used...I just use that by dragging it in basically. So that's it for the 2D demonstrations I wanted to do. Let's take a little breather here and now we're going to go into the world of 3D, and even the web application. Okay. So this is my web 3D scatter plot application. And what I'm going to show you is how I got data from JMP into it. My application was based on JMP's scatterplot 3D. So this is just to remind you how that looks. And now I've got a little demo of that. So basically, it had...I'm going to show you some of the features, starts off with the ability to select different data sets. Here's the diamonds data set. And of course this drawing points. That's the whole idea. And they're drawn with this little ball shape which is I'll explain how that's done in a minute. But the other things you can do is rotate it, just like in JMP, you can zoom, move it back and forth. And then it has hover labels as well. You can see that bottom right corner I draw axes and a grid and add a little bit of customization. Not too much. You can change the background, set the color of the walls. Let's see if we can give it a little blue color. Yeah, maybe that's not to your liking. But this is just a demo. And then you can change the size of the markers. That's up on the web. Okay, so when I started this application, first thing I wanted to do is, could it perform on the web? Would it work well on my iPhone or my iPad? Because those machines are not as powerful as my desktop. So I created built in 3D or three random data sets of increasing numbers of points. I went and used a... 25,000 was what I thought was worth trying on lower power devices, but you can go a lot higher actually with high powered devices. Then I thought, well, I should try to bring some real data in and I found the famous iris data set, and it was in a CSV, comma separated variable, format. And I brought it in. But I had to write some code just to convert it over to my internal structures. And I thought, well JMP has a lot of data sets. I'd rather just bring those in. So I brought in cars, diamonds and cytometry. The only difference was cytometry, it didn't really have a good category for using color and they actually had to change my application to accept a graph that didn't have any color. So the application is pretty simple. There's only one...it's a one page web app. So there's one HTML file. I've got a couple of JavaScript files and i use a few third party libraries and one font and one texture, and the texture is the ball that is used for the shapes. So the technique that I use from the games industry is particles or particle systems. They're used very commonly for simulation games and 3D visualization. You can do some cool effects like fire, smoke, fireworks, and clouds. And I had spent some time working on a commercial game engine in the past and my responsibility was actually take the particle system and improve it. So I did have some experience with this already. Um, that was in C++ world, not JavaScript. But since I worked on interactive HTML5 and JMP for quite a while, I thought it was time to see if I can take two of my passions and marry them and come up with a web based version of using particle systems. I got lucky. I found this library, Three.js. It does hardware accelerated 3D drawing and it has excellent documentation and there are many code examples including particle systems, so that made it quite easy to build my application. And actually, the difficult part for me was figuring out how to get data into it from JMP. But I will have...I'll share script for that. The one thing I did do is make sure that I made it easy to use these objects in my application. So one thing that I did was I just created an array JavaScript objects and for the numeric columns actually add in a minimum and a maximum, so I don't have to calculate that in JavaScript. And of course, JMP is really good at calculating that kind of stuff. Um, for the category, for the color column, what I needed to do is use like a category dictionary. And basically, all that is, the names of the categories in one list and then just point to those values by a zero based index. So that's how that structure works. I thought it would be nice to do a user interface. And actually that was an easy thing to do. And I'll show you that in a second. But basically all I need to do is limit it to numerical columns for the x, y, and z. And then limit it to a categorical column for the color role. So let's have a look at that code. Alright, so I give an example of a very simple data set first of all. And then the dialogue really was easy to do. I use column dialogue and I specified what I wanted for the numeric columns and what I wanted for the color column, and made sure that the modeling type was categorical so its ordinal and nominal. Next, I take the columns data and I build up the JavaScript objects that I need. Here's the min and max strings. And this is just building a string of these objects. And then if there's a color column, I'll do the same. It's a little more complicated because I need to get those character strings out and stuff them into the JavaScript object. And then finally, all it needs to do is save this to a text file. So let's give this a shot. Of course, I'm going to use Big Class, so it's a nice small file. And so I run the script. And it asked me, What do I want for my XYZ. I've already got a height and weight selected and age is a numeric column, so I'll use that as well and sex is a good one for color. And then we're done. And actually, the output here is telling me where the file is. I happen to have that on the next slide. So let's go to that. So I bet you always wondered what would Big Class looked like it was in JavaScript. And this is it. And so you can see the three numerical columns with your min and the max, and then the color column has this dictionary of F and M for the categories, and the zero, it means female and the one means male. Well, that's my exciting finish to this talk, I hope you enjoyed it. So thanks for watching, and if there are any questions, please ask them.
Labels
(8)
Labels:
Labels:
Advanced Statistical Modeling
Basic Data Analysis and Modeling
Content Organization
Data Exploration and Visualization
Design of Experiments
Quality and Process Engineering
Reliability Analysis
Sharing and Communicating Results
0 attendees
0
0
0 attendees
0
0
The Sampling Tree: A Strategic Sampling and Analysis Tool (2020-US-30MP-542)
Monday, October 12, 2020
Dave Sartori, Sr. Data Scientist, PPG A sampling tree is a simple graphical depiction of the data in a prospective sampling plan or one for which data has already been collected. In variation studies such as gage, general measurement system evaluations, or components of variance studies, the sampling tree can be a great tool for facilitating strategic thinking on: What sources of process variance can or should be included? How many levels within each factor or source of variation should be included? How many measurements should be taken for each combination of factors and settings? Strategically considering these questions before collecting any data helps define the limitations of the study, what can be learned from it, and what the overall effort to execute it will be. What’s more, there is an intimate link between the structure of the sampling plan and the associated variance component model. By way of examples, this talk will illustrate how inspection of the sampling tree facilitates selecting the correct variance component model in JMP’s variability chart platform: Crossed, Nested, Nested then Crossed or Crossed then Nested. In addition, the application will be extended to the interpretation variance structures in control charts and split-plot experiments. Auto-generated transcript... Speaker Transcript Dave Hi, everybody. Thanks for joining me here today, I'd like to share with you a topic that has been part of our Six Sigma Black Belt program since 1997. 1997. So I think this is one of the tools that people really enjoy and I think you'll enjoy it, too, and find it informative in terms of how it interfaces with some of the tools available in JMP. The first quick slide or two in terms of a message from our sponsor. I'm with PPG Industries outside of Pittsburgh, Pennsylvania, in our Monroeville Business and Technical Center. I've been a data scientist there on and off for over 30 years, moved in and out of technical management. And now, back to what I truly enjoy, which is working with data and JMP in particular. So PPG has been around for a while, was founded in 1883. Last year we ranked 180th on the Fortune 500. And we made mostly paints, although people think that PPG stands for Pittsburgh Plate Glass, that was no longer the case as of about 1968. So it's a...it's PPG now and it's primarily a coatings company. performance coatings and industrial coatings. cars, airplanes, of course, houses. You may have bought PPG paint or a brand of PPG's to to use on your home. But it's also used inside of packaging, so if you don't have a coating inside of a beer can, the beer gets skunky quite quickly. My particular business is the specialty coatings and materials. So my segment we make OLED phosphors for universal display corporation that you find in the Samsung phone and also the photochromic dyes that go into the transition lenses, which turn dark when you head outside. So what I'm going to talk to you today about is this this tool called sampling tree. And what it is, it's really just a simple graphical depiction of the data that you're either planning to collect or maybe that you've already collected. And so in variation studies like a Gage R&R general measurement system evaluations, components and various studies (or CoV, as we sometimes call them), the sampling tree is a great tool for for thinking strategically about a number of things. So, for example, what sources of variance can or should be included in this study? How many levels within each factor or source of variation can you include? And how many measurements to take for each combination factors and settings? So you're kind of getting to a sample size question here. So strategically considering these questions before you collect any data helps you also define the limitation of the study, what you can learn from it, and what the overall effort is going to be to execute. So we put this in a classification tools that we teach in our Six Sigma program, what we call critical thinking tools because it helps you think up front. And it is a nice sort of whiteboard exercise that you can work on paper or the whiteboard to to kind of think prospectively about the the data, you might collect. It's also really useful for understanding the structure of factorial designs, especially when you have restrictions on randomization. So I'll give you one sort of conceptual example, towards the end here, where you can describe on a sampling tree, a line of restricted randomization. And so that tells you where the whole plot factors are and where the split plot of factors are. So it can provide you again upfront with a better understanding of the of the data that you're planning to collect. They're also useful in where, I'll share another conceptual example, where we've combined a factorial design with a component of variations study. So this, this is really cool because it accelerates the learning about the system under study. So we're simultaneously trying to manipulate factors that we think impact the level of the response, and at the same time understanding components of variation which we think contributes a variation of response. So once the data is acquired, the sampling tree can really help you facilitate the analysis of the data. And this is especially true when you're trying to select the variance component model within a variance chart...variability chart that you have available in JMP. And so if you've ever used that tool (and I'll demonstrate it for you here in a couple...with a couple of examples), if you're asking JMP to calculate for you the various components, you have to make a decision as to what kind of model do you want. Is it nested? Is it crossed? Maybe it's crossed then nested. Maybe it's nested then crossed. So helping you figure out what the correct variance component model is, is really well facilitated by by good sampling tree. The other place that we've used them is where we are thinking about control charts. So the the control chart application really helps you see what's changing within subgroups and what's changing between subgroups. So it helps you think critically about what you're actually seeing in the control charts. So as I mentioned, they're they're good for kind of showing the lines of restrictions in split plot but they're kind of less useful for the analysis of designed experiments, so again for for DOE types of applications aremore kind of kind of up front. So let's jump into it here with an example. So here's a what I would call a general components of variance studies. And so in this case, this is actually from the literature. This is from Box Hunter and Hunter, "Statistics for Experimenters," and you'll find it towards the back of the book where they are talking about components of variance study and it happens to be on a paint process. And so what they have in this particular study are 15 batches of pigment paste. They're sampling each batch twice and then they're taking two moisture measurements on each of those samples. So the first sample in the first batch is physically different than the second batch, and the first sample out of the second batch is physically different from any of the other samples. And so one practice that we tried to use and teach is that for nested factors, it's often helpful to list those in numerical order. So that again emphasizes that you have physically different experimental units you're going from sample to sample throughout. And so this is a this is a nested sampling plan. So the sample is nested under the batch. So let's see how that plays out in variability chart within JMP. Okay, so here's the data and we find the variability chart under quality and process variability. And then we're going to list here as the x variables the batch and then the sample. And one thing that's very important in a nested sampling plan is that the factors get loaded in here in the same order that you have them in a sampling tree. So this is hierarchical. So, otherwise the results will be a little bit confusing. So we can decide here in this this launch platform what kind of variance component model we want to specify. So we said this is a nested sampling plan. And so now we're ready to go. We leave the the measurement out of the...out of the list of axes because the measurement really just defines where the, where the sub groups are. So we just we leave that out. And that's going to be what goes into the variant component that JMP refers to as within variation. Okay, so here's the variability chart. One of the nice things too with the variability chart is there's an option to add some some graphical information. So here I've connected the cell mean. And so this is really indicating the kind of visually what kind of variation you have between the samples within the batch. And then we have two measurements per batch, as indicated on our sampling tree. And so the the distance between the two points within the batch and the sample indicates the within subgroup variation. So you can see it looks like just right off the bat it there's a good bit of of sample to sample variation. And the other thing we might want to show here are the group means. And so that shows us the batch to batch variations. So the purple line here is the, the average on a batch to batch basis. Okay. Now, what about the actual breakdown of the variation here. Well that's nicely done in JMP here under variance components. And Get that up there, we can see it then I'll collapse this. As we saw graphically, it looked like the sample to sample variation within a batch was a major contributor to the overall variation in the data. And in fact, the calculations confirm that. So we have about 78% of the total variation coming from the sample; about 20% of variations coming batch to batch and only about 2.5% of the variation is coming from the measurement to measurement variation within the batch and sample. I noticed here to in the variance components table, the the notation that's that used here. So this is indicated that the sample is within the batch. So this is an nested study. And again, it's important that we load the factors into the into the variability chart in the order indicated here in the in the plot. So wouldn't make any sense to say that within sample one we have batch one and two. That just doesn't make any physical sense. And so it kind of reflects that in the in the tree. And just Now let's compare that with something a little bit different. I call this a traditional Gage R&R study. And so what you have in a traditional Gage R&R study is you have a number of parts sample batches that are being tested. And then you have a number of operators who are testing each one of those. And then each one test the same sample or batch multiple times. So in this particular example we're showing five parts or samples or batches, three operators measuring each one twice. Now in this case, operator one for the for batch number one is the same as operator number one for batch or sample report number five. So you can think of this as saying, well, the operator kind of crosses over between the part, sample, batch whatever the whatever the thing is that's getting getting measured. So this is referred to as a as a crossed study. And it's important that they measure the same article because one of the things that comes into play in a crossed study is that you don't have in a nested study is a potential interaction between the operators and what they're measuring. So that's going to be reflected in the in the variance component analysis that we see from JMP. Now let's have a look here. at this particular set of data. So again, we go to the handy variability chart, which again is found under the quality and process. And in this case, I'll start by using the same order for the variables for the Xs as shown on the sampling tree. But, as I'll show you one of the features of a of a crossed study is that we're no longer stuck with the hierarchical structure of the tree. We can we can flip these around. And so this is crossed. I'm going to be careful to change that here. Remember that we had a nested study from before. And I'm going to go ahead and click okay. And I'm going to put our cell means and group means on there. So the group means in this case are the samples (three) and we've got three operators. And now if we asked for the variance components. Notice that we don't have that sample within operator notation like we had in the in the nested study. What we have in this case is a sample by operator interaction. And it makes sense that that's a possibility in this case, because again, they're measuring the same sample. So Matt is measuring the same sample a as the QC lab is, as is is as Tim. So an interaction in this case really reflects the how different this pattern is as you go from one sample to the other. So you can see that it's generally the same It looks like Matt and QC tend to measure things perhaps a little bit lower overall than Tim. This part C is a little bit the exception. So the, the interaction variation contribution here is is relatively small. There is some operator to operator variation, and the within variation really is the largest contributor. And that's easy to see here because we've got some pretty pretty wide bars here. But again, this is a is a crossed study so we should be able to change the order in which we load these factors and and get the same results. So that's my proposition here; let's test it. So I'm just going to relaunch this analysis and I'm going to switch these up. I'm going to put the operator first and the sample second. Leave everything else the same. And let's go ahead and put our cell means and group means on there. And now let's ask for the variance components. So how do they compare? I'm going to collapse that part of the report. So in the graphical part and this is a cool thing to recognize with a crossed study is because again, we're not stuck with the hierarchy that we have in a nested study, we can kind of change the perspective on how we look at the data. So that perspective with loading in the operator first gives us sort of a direct operator to operator comparison here in terms of the group means. And again that interaction is reflected of how this pattern changes between the operators here as we go from Part A, B, or C, A, B, or C. What about the numbers in terms of the variance components? Well, we see that the variance components table here reflects the order in which we loaded these factors into the into the dialog box and... But the numbers come out very much the same. So the sample on the lefthand side here, the standard deviation is 1.7. Standard deviation due to the operator is about 2.3 and it's the same value over here. The sample by operator or operator by sample interaction, if you like, is exactly the same. And the within is exactly the same. So, with a crossed study, we have some flexibility in how we load those factors in and then the interpretation is a little bit different. If these were different samples, we might expect this pattern from going from operator to operator, to be somewhat random because they're they're measuring different things. So there's no reason to expect that the pattern would repeat. If you do see a significant interaction term in a typical kind of a traditional Gage R&R study, like we have here, well, then you've got a real issue to deal with because that's telling you that the nature of the sample is is causing the operators to measure differently. So that's a bit harder of a problem to solve than if you just have a no interaction situation there. OK. Dave So again, this, for your reference, I have this listed out here. Um, so now let's get to something a little bit more juicy. So here we have sort of a blended study where we've got both crossed and nested factors. So this was the business that I work in. The purity of the materials that we make is really important and a workhorse methodology for measuring purity is a high performance liquid chromatography or HPLC for short. So this was a...this was a product and it was getting used in an FDA approved application so the purity getting that right was was really important. So this is a slice from a larger study. But what I'm showing is the case where we had three samples; I'm labeling them here S1, S2, S3. We have two analysts in the study. And so each analyst is going to measure the same sample in each case. So you can see that similar to what we had in the previous example there that what I call traditional Gage R&R, where each operator or analyst in this case is measuring exactly the same part or sample. So that part is crossed. When you get down under the analyst, each analyst then takes the material and preps it two different times. And then they measure each prep twice. They do two injections into the HPLC with with each preparation. So preparation one is different than preparation two and that's physically different than the first preparation for the next analyst over here. And so again, we try to remember to label these nested factors sequentially to indicate that they're they're physically different units here. It doesn't really make any difference from JMP's point of view, it'll handle it fine, if you were to go 1-2, 1-2, 1-2, and so on down the line, that's fine, as long as you tell it the proper variance component model to start with. So this would be crossed and then nested. So let's see how that works out in JMP. So here's our data sample, analyst prep number, and then just an injection number which is really kind of within subgroup. So once again we go to analyze, quality and process. We go to the variability chart. And here we're going to put in the factors in the same order as they were showing on the sampling tree. And then we're going to put the area in there as the percent area is the response. And we said this was crossed and then nested, so we have some couple of other things to choose from here. And in this case, again, the sampling tree is really, really helpful for for helping us be convinced that this is the case, and selecting the right model. This is crossed, and then nested. Let's click OK. I'm going to put the cell means and group means on there. Again, we have a second factor involved above the within. So let's pick both of them. And let's again ask for the variance components. And I'm going to just collapse this part, hopefully, and maybe I'm going to collapse the standard deviation chart, just bringing a little bit further up onto the screen. So what we can see in the in the graph as we go, we see a good bit of sample to sample variation. The within variation doesn't look too bad. But we do maybe see a little bit of a variation of within the preparation. So, um, the sample in this case is by far the biggest component of variation, which is really what we were hoping for. The analyst is is really below that, within subgroup variation. And so this this lays it out for us very nicely. So in terms of what it's showing in the variance components here table in terms of the components, is it's sample analyst and then because these two are crossed, we've got a potential interactions to consider in this case. Doesn't seem to be contributing a whole lot to the to the overall variation. And again, that's the how the pattern changes as we go from analyst to analyst and sample to sample. Now, the claim I made before with the fully crossed study was that we could swap out the the crossed factors in terms of their in terms of their order and and it would be okay. So let's let's try that in this case. So I'm just going to redo this, relaunch it and I can I think I can swap out the crossed factors here but again I have to be careful to leave the nested factor where it is in the tree. So I notice over here in the variance components table, the way we would read this as we have the prep nested within the sample and the analyst. So that means it has to go below those on the tree. So let's go ahead and connect some things up here. I'm going to take the standard deviation chart off and asked for the variance components. Okay, so just like we saw in the traditional Gage R&R example we've got the analyst and the sample switching. But their values for the, if we look at the standard deviation over here in the last column, they're identical. We have again the identical value for the interaction term and interact on the identical value for the prep term, which again is nested within the, within the sample and the analyst. So again, here's where the, where the sampling tree helps us really fully understand the structure of the data and complements nicely what with what we see in the variance components chart of JMP. So, those, those are a couple of examples where these are geared towards components of variation study. One thing you might notice too, I forgot to point this out earlier, is look at the sampling tree here. And if I bring this back and I'm just trying to reproduce this. That backup. Dave It's interesting if you look at the horizontal axis in the variability chart, it's actually the sampling tree upside down. So that's another way to kind of confirm that you're you're looking at the right structure here when you are trying to decide what variance component component model to to apply. So again, here are the screenshots for that. Here's an example where the sampling tree can help you in terms of understanding sources of variation in a in a control chart of all things. So in this particular case, over a number of hours, a sample is being pulled off the line. These are actually lens samples. I mentioned that we we make photochromatic dyes to go into the transitions lenses and they will periodically check the film thickness on the lenses and that's a destructive test. And so when they take that lens and measure the film thickness, well, they're they're done with that with that sample. And so what we would see if we were to construct an x bar and R chart for this is you're going to see on the x bar chart as an average, the hour to hour average. And then within subgroup variation is going to be made up of what's going on here sample to sample and the thickness, the thickness measurement. Now in this case, notice that there's vertical lines in the sampling tree, so that the tree doesn't branch in this case. So when you see vertical lines when you're drawing a vertical lines on to the sampling tree, that's an indication that the variability between those two levels of the tree are confounded. So, I can't really separate the inherent measurement variation in the film thickness from the inherent variation of the sample to sample variation. So I'm kind of stuck with those in terms of how this measurement system works. So let's let's whip up a control chart with this data. And for that are, again, we're going to go to quality and process. And I'm going to jump into the control chart builder. So again, our measurement variable here is the film thickness. And we're doing that on an hour to hour basis. So when we get it set up by by doing that, we see that JMP smartly sees that the subgroup size is 3, just as indicated on our, on our sampling tree. But what's interesting in this example is that you might at first glance, be tempted to be concerned because we have so many points out of control on the x bar chart. But let's think about that for a minute in terms of what the sampling tree is telling us. So the sampling tree again is telling us that's what's changing within the subgroup, what's contributing to the average range, is the film thickness to film thickness measurement, along with the sample to sample variation. And remember how the control limits are constructed on an x bar chart. They are constructed from the average range. So we take the overall average. And then we add that range plus or minus a factor related to the subgroup sample size so that the width of these control limits is driven by the magnitude of the average range. And so really what this chart is comparing is, let's consider this measurement variation down here at the bottom of the tree. So it's comparing measurement variation to the hour to hour variation that we're getting from the, from the line. So that's actually a good thing because it's telling us that we can see variation that rises above the noise that that we see in the in the subgroup. So in this case, that's, that's actually desirable. And so, that's again, a sampling tree is really helpful for reminding us what's what's going on in the Xbar chart in terms of the within subgroup and between subgroup variation. Now, just a couple of conceptual examples in the world of designed experiments. So split plot experiment is an experiment in which you have a restriction on the run order of the of the experiment. And what that does is it ends it ends up giving a couple of different error structures, and JMP does a great job now of designing experiments for for that situation where we have restrictions on randomization and also analyzing those. So, nevertheless, though it's sometimes helpful to understand where those error structures might be splitting, and in a split plot design, you get into what are called whole plot factors and subplot factors. And the reason you have a restriction on randomization is typically because one or more of the factors is hard to vary. So in this particular scenario, we have a controlled environmental room where we can spray paints at different temperatures and humidities. But the issue there is you just can't randomly change the humidity in the room because it just takes too long to stabilize and it makes the experiment rather impractical. So what's shown in this sampling tree is you really have three factors here humidity, resin and solvant. These are shown in blue. And so we only change humidity once because it's a difficult to change variable. So that's how you set up a split plot experiment in JMP is you can specify how hard the factors are to change. So in this case, humidity is a hard, very hard to change factor. And so, JMP will take that into account when it designs the experiment and when you go to analyze it. But what this shows us is that the the humidity would be considered a whole plot factor because it's above the line restriction and then the resin and the solvent are subplot factors; they're below the line of restriction. So there's a there's a different error structure above the line of restriction for whole plot factors than there is for subplot factors. In this case we have a whole bunch of other factors that are shown here, which really affect how a formulation which is made up of a resin and a solvent gets put into a coating. So this, this is actually a 2 to the 3 experiment with a restriction randomization. It's got eight different formulations in there. Each one is applied to a panel and then that panel is measured once so that what we see in terms of the measurement to measurement variation is confounded with the coating in the in the panel variation. As, as I said before, when we have vertical lines on the on the sampling tree, then we have then we have some confounding at those levels. So that's, that's an example where we're using it to show us where the, where the splitting is in the split plot design. This particular example again it's conceptual, but it actually comes from the days when PPG was making fiberglass; we're no longer in the fiberglass business. But in this case, what what was being sought was a an optimization, or at least understanding the impact of four controllable variables on what was called loss ???, so they basically took coat fiber mats and then measure their the amount of coating that's lost when they basically burn up the sample. So what we have here is at the top of the tree is actually a 2 to the 4 design. So there's 16 combinations of the four factors in this case and for each run in the design, the mat was split into 12 different lanes as they're referred to as here. So you're going to cross the mat from 1 to 12 lanes and then we're taking out three sections which within each one of those lanes and then we're doing a destructive measurement on each one of those. So this actually combines a factorial design experiment. with a components of variations study. And so again, we've got vertical lines here at the bottom of the tree indicating that the measurement to measurement variation is confounded with the section to section variation. And so what we ended up doing here in terms of the analysis was, we treated the data from each DOE run as sort of the sample to sample variation like we had in the moisture example from Box Hunter and Hunter, to have instead of batches, here you have DOE run 1, 2, 3 and so on through 16 and then we're sub sampling below that. And so we treat this part as a components of variation study and then we basically averaged up all the data to look and see what would be the best settings for the four controllable factors involved here. So this is really a good study because it got to a lot of questions that we had about this process in a very efficient manner. So again, combining a COV with a DOE, design of experiments with components of variations study. So in summary, I hope you've got an appreciation for sampling trees that are, they're pretty simple. They're easy to understand. They're easy to construct, but yet they're great for helping us talk through maybe what we're thinking about in terms of sampling of process or understanding a measurement system. And they also help us decide what's the best variance components model when we we look to get the various components from JMP's variability chart platform, which we get a lot of use out of that particular tool, which I like to say that it's worth the price of admission that JMP for that for that tool in and of itself. So I've shown you some examples here where it's nested, where it's crossed, crossed then nested, and then also where we've applied this kind of thinking to control charts to help us understand what's varying within subgroups versus was varying between subgroups. And then also, perhaps less useful...less we can use those with designed experiments as well. So thanks for sharing a few minutes with me here and my email's on the cover slide so if you have any questions, I'd be happy to converse with you on that. So thank you.
Labels
(12)
Labels:
Labels:
Advanced Statistical Modeling
Automation and Scripting
Basic Data Analysis and Modeling
Consumer and Market Research
Data Blending and Cleanup
Data Exploration and Visualization
Design of Experiments
Mass Customization
Predictive Modeling and Machine Learning
Quality and Process Engineering
Reliability Analysis
Sharing and Communicating Results
0 attendees
0
0
0 attendees
0
0
Using Auto-Validation to Analyze Screening DOEs (2020-US-30MP-541)
Monday, October 12, 2020
Peter Hersh, JMP Global Technical Enablement Engineer, SAS Phil Kay Ph.D, Learning Manager Global Enablement, SAS institute In the process of designing experiments often many potential critical factors are identified. Investigating as many of these critical factors as possible is ideal. There are many different types of screening designs that can be used to minimize the number of runs required to investigate the large number of factors. The primary goal of screening designs is to find the active factors that should be investigated further. Picking a method to analyze these designs is critical, as it can be challenging to separate the signal from the noise. This talk will explore using the auto-validation technique developed by Chris Gotwalt and Phil Ramsey to analyze different screening designs. The focus will be on group orthogonal supersaturated designs (GO-SSDs) and definitive screening designs (DSDs). The presentation will show the results of auto-validation techniques compared to other techniques to analyze these screening designs. Supplementary Materials A link to Mike Anderson's Add-in To Support Auto-Validation Workflow. JMP data tables for examples 1 and 2 from the presentation. Journal attached for an interactive demonstration of how auto-validation works for screening experiments. Other Discovery Summit papers about auto-validation from: Europe 2018 , US 2018, Europe 2019, US 2019 and US 2020. Recorded demonstration of how auto-validation works for screening experiments: (view in My Videos) Auto-generated transcript... Speaker Transcript Peter Hersh All right, well thanks for tuning into watch Phil and I's Discovery presentation. We are going to be talking today about a new ??? technique in JMP that we are both really excited about and that's using auto validation to analyze screening All right, well thanks for tuning into watch Phil and I's Discovery presentation. We are going to be talking today about a new ??? technique in JMP that we are both really excited about and that's using auto validation to analyze screening designs for for DOEs. So my name is Peter Hersch. I'm a senior systems engineer and part of the global technical enablement team, and my co-author who's joining me is Phil. Do you want to introduce yourself? All right, well thanks for tuning into watch Phil and I's Discovery presentation. We are going to be talking today about a new ??? technique in JMP that we are both really excited about and that's using auto validation to analyze screening designs for for DOEs. So my name is Peter Hersch. I'm a senior systems engineer and part of the global technical enablement team, and my co-author who's joining me is Phil. Do you want to introduce yourself? designs for for DOEs. So my name is Peter Hersch. I'm a senior systems engineer and part of the global technical enablement team, and my co-author who's joining me is Phil. Do you want to introduce yourself? phil Yes. So I'm in the global technical enablement team as well. I'm learning manager. Peter Hersh Perfect. So we're going to start out at the end, show some of the results that we got while we were working through this so we did some experiments with using auto validation with some different screening DOEs and we found it a very promising technique. We were able to find more active factors than some analysis techniques. And really when we're looking at screening DOEs, what we're trying to find as many active factors as we can. And Phil, maybe you can talk a little bit about why that is. phil Yeah. So the objective of a screening experiment is to find out which of all of your factors are actually important. So if we miss any factors from our analysis of the experiment that turned out to be important, then that's a big problem. You know, we're not going to fully understand our process or our system because we're we're neglecting some important factor. So it's really critical. The most important thing is to identify which factors are important. And if we occasionally add in a factor that turns out not to be important. That's, that's less less of a problem but we really need to make sure we're capturing all of the active factors. Peter Hersh Yeah, great, great explanation there, Phil, and I think if we look at this over here on the right-hand side, our table, we we've looked at 100 different simulations of these different techniques where we looked at different signal-to-noise ratios in screening design and we found that out of those seven different techniques, we did a fairly good job when we had a higher signal-to-noise ratio, but as that dropped a little we struggled to find those less large effects. So this top one was the auto validation technique, and and we only ran that once, and we'll go into why that is, and what that running that auto validation technique did for us. But I think this was a very promising result. And that now typically, when we do a designed experiment, we don't hold out any of the data. We want to keep it intact. Phil, can you talk a little to why we wouldn't do that? phil Yeah. When we design an experiment, we are looking to find the smallest possible number of runs that give us the information that we need. So we deliberately keep the number of rows of data really as small as possible. Ideally, you know, in machine learning what you can do is you hold back some of the data to...as a way of checking how good your models are and whether you need to improve that model, whether you need to simplify that model. With design of experiments, you don't really have the luxury of just holding back a whole chunk of your data because all of it's critically important. You've designed it so that you've got the minimum number of rows of data, so there isn't really that luxury. But it would be really cool if we could find some way that we could use this this validation in some in some way. And I guess that's that's really the trick of what we're talking about today. Peter Hersh Yeah, excellent lead in, Phil. So here, here, this auto validation technique has been introduced by Chris Gotwalt, who is our head developer for pretty much everything under the analyze menu in JMP, and a university professor, Phil Ramsey. And I have two QR codes here for two different Discovery talks that they gave and if you're interested in those Discovery talks, I highly recommend them. They go into much more technical details than Phil and I are planning to go into to day about why the technique works and what they came up with. We're just trying to advertise that it's it's a pretty interesting technique and it's something that might be worth checking out for for you and show some of our results. The basic idea is we start with our original data, And then we resample that data so the data down here in gray is a repeat of this data up here in white and that is used as validation data. And how we get away with this is we used a fractional weighting system. And this has really been...it's really easy to set up with an add in that that Mike Anderson developed and there's the QR code for finding that but you can find that on the JMP user community. And it just makes this setup a lot more simple and we'll go through the setup and the use of the add in when we walk through the example in JMP. But the basic idea is it creates this validation column, this fractional waiting column, and a null factor, and we'll talk about those in a little bit. Alright, so we have a case study that both Phil and I used here and we're trying to maximize a growth rate for a certain microbe. And we're adjusting the nutrient combination. And for my example I'm looking at 10 different nutrient factors. And this nutrient factors, we went in everywhere from not having that nutrient up to some high level of having that. And this is based on a case study that you can read about here, but we just simulated the results. So we didn't have the real data. And the case study I'm going to talk to is a DSD where we have five active effects. Actually it's four and the intercept that are active. And we did a 25-run DSD and I am...I'm just looking at these 10 nutrient factors and I'm adjusting the signal-to-noise ratio for the simulated results. So that's, that's my case study and, Phil, do you want to talk to yours, a little bit? phil Yeah, so in mine, I look to a smaller number of the factors, just six factors, in a smaller experiment, so a 13-run definitive screening design. And what I was really interested in looking at was how well this method could identify active effects when we've got as many active effects as we have runs in the design of experiments. So we've got 13 rows of data and we've got 13 active parameters when we include the intercept as well. That's a really big challenge. Most the time we're not going to be able to identify the active effects using standard methods. So I was really, really interested in how this auto validation method might do in that situation. Peter Hersh Yeah, great. So we're gonna duck out of our PowerPoint. I'm going to bring in my case study here, and we'll, we'll talk about this. So here is my 25-run DSD. I have I have my results over here that are simulated. And so this is my growth rate which is impacted by some of these factors and we're in a typical screening design. We're just trying to figure out which of these factors are active, which are not active. And we might want to follow up with an augmented design or at least some confirmation runs to to make sure that our, our initial results are confirmed. So how we would set up this auto validation? So for now, in JMP 15, this is that add in that I mentioned that that Mike Anderson created and it's just called auto validation setup. And in JMP 16 this is going to be part of the product, but in JMP 15, it's an add in. And so when that I run that add in, what happens is it creates... resamples that data. So it created 25 runs that are identical to those top 25 and they're in gray. And then it added this partially... this fractional weighting here and then it added the validation and the null factor. So basically, what we're going to do is we're going to run a model on this using validation and you can use any modeling technique; generalized regression is a good one. You can use any of the variable selection techniques. You want to make sure that it can do some variable selection for you. So just to give you an idea, I'm going to go under analyze, fit model. We'll take our growth rate which is our response. We're going to take that weighting. Actually I'll change this to generalize regression. I'm going to put that weighting in as our frequency. I'm going to add that validation column that was created. This null factor that's created,and we'll talk a little bit more about that null factor. And then I'm going to just add all those 10 factors. Now in Phil's example, he's going to look at interactions and quadratic effects. I could do that here as well, but this is just to show the capability. And I'll hit Run. We'll go ahead and launch this again. I'll use lasso, you could use forward selection or anything like that. But I'll just use a lasso fit. Hit go. And then I'm going to come down here and I'm going to look at these estimates. So I what I want to do is simulate these estimates and I want to figure out which of these estimates get zeroed out most often and least often. So I would go in here and I'd hit simulate, and I could choose my number of simulations. In this case I had, I have done 100 and I won't make you sit here and watch it simulate. I can go right over here to my simulated results. So I've done 100 simulations here and I'm looking at the results from those hundred simulations and when I run the distribution, which automatically gets generated there, we can see some information about this. Now the next thing that I'm going to do is hold down control, and I'm going to customize my summary statistics. And all I want to do is remove everything except for the proportion non zero. So what that's going to do is it's going to allow me to just see the factors that were that were zeroed out or how often a factor...a certain factor was zeroed out and how often it was kept in there. So when I hit okay, all of these are changed to proportion non zero. And when I and then when I right click on here, I can make this combine data table, which I've, I've already done. And the combined data table is here. And the reason I I'm kind of going quick on this is because we can... I've added a a factor type row and just just showing, have a saved script in here, but this is...you would get these three columns out of that. Make a combined data table, so it would have your all of your factors and then how often that factor was non zero. So the higher the number, the more indicative that it's an active factor. So the last thing I'm going to do is run this Graph Builder and this shows how often a factor is added to the model. That null factor is kind of our, our line...reference line, so it has no correlation to the response. And so anything that is lower than that. we probably don't need to include in the model and then things that are higher than that, we might want to look at. And so these green ones from the simulation were the act of factors, along with the intercept and then the red ones were not active. So it did a pretty good job here. It flagged all of the good ones. phil Yeah. Peter Hersh And we got one extra one but like Phil, you mentioned, that's not as big of a problem, right? phil Yeah, I mean, that's not the end of the world. I mean, it's more, it's much more of a problem if you miss things that are active and your method tells you that they're not. And it's really impressive how it's picked out some factors here which had really low signal-to-noise ratios as well. Peter Hersh Yes, yeah. So just to give you an idea, this was citric acid was two, EDTA was one, that was half...so half the signal to noise, and potassium nitrate was a quarter, so very low signal and it was still able to catch that. Yeah, so I'm gonna pass the ball over to Phil and have him present his case study. Yeah. phil Thanks, Pete. Yeah. Well, in my case study, as I said, it's a six factor experiment and we only have 13 runs here. And I've simulated the response here so that, such that every one of my factors here is active, and the main effects are active, and the quadratic effects of each of those are active. So we've got 12 active effects, plus an intercept, to estimate or to find. And I've made them, you know, just for simplicity, I made them really high signal-to-noise ratio. So there's a signal-to-noise ratio of 3 to 1 for every one of those effects. So these are big, important effects basically as the... is what I've simulated. So what we want to find out is that all of these effects, all of these factors are really important. Now if you look to analyze this using fit DSD, which would be a standard method for this, it doesn't find the active factors. It only finds ammonium nitrate as a as an active factor. I think fit DSD struggles when there are a lot of active effects. It's very good when there are only a small number. And actually, you know, we probably wouldn't want to run a 13-run DSD and expect to get a definitive analysis. We would recommend adding some additional runs in this kind of situation. Even if we knew what the model was, so if we somehow we knew that we had six active main effects and six active quadratic effects plus the intercept, we really can't fit that model. So this is just that model fit using JMP's fit model platform, the least squares regression. And you know there's...we've got as many parameters to estimate as we have rows of data, so we've got nothing left to estimate error. So this is really all just to illustrate that this is a really big challenge, analyzing this experiment and getting out from it what we need to get out from it is a real problem. So I followed the same method as Pete. I generated this auto validation data set where we've got the repeated runs here with the fractionally weighted... fractional weightings Ran that through gen reg, so using the lasso as a model selection and then again resimulating. So simulating and each time changing out the fractional weighting and around about 250 simulations, which again I won't show you actually doing that. These are the simulation results that we got, the distributions here, and you can see that it's picking out citric acid. So the some of the times the models had a zero for the parameter estimate for citric acid, but a lot of the time it was getting the parameter estimate to be about three, which is what it was simulated as originally, and what it should be getting. And you can see that for then, some of these other effects, which was simulated to not be active then, by and large, they are estimated to have a parameter size of zero, which is what we want to see. And just looking at the proportion, nonzero as Pete did there. And I've added in all the, the effect types here because here I was looking at the main effects, the quadratic and the interactions. And what the method should find is that the main effects and the quadratics are all active, but the two factor interactions were not. And when we look at that, just plotting that proportion non zero for each of these effects, you can see, first of all, the, the null factor that we've added in there. And anything above that, that's with a higher proportion non zero was suggesting, that's an active effect. And you can see, well, first of all, the intercept, which is always there. We've got the main effects, which we're declaring as active using this method. They've got a higher proportion nonzero than the null factor and the quadratics. And we can see the all of the two factor interactions, the proportion nonzero was was much, much lower. So it's done an amazingly good job of finding the effects that I'd simulated to be active in this very challenging situation, which I think is, is really very exciting. That's just one little exploration of this method. To me that that's a very exciting result and it makes me very excited about looking at this more. So I just wanted to finish with some of the concluding remarks. And I think, Pete, it's fair to say we're not saying that everybody should go and throw away everything else that they've done in the past and only use this method now. Peter Hersh Yeah, absolutely. We've seen some exciting results. I think, Chris, Chris is seeing exciting results, but this is not the end all, be all to always use auto validation, but it's a new tool in your tool belt. phil Yeah, I mean I think I'll certainly use it every time, but I'm not saying only use that. I think there's always...you always want to look at things from different angles using all the available tools that you've got. And so it clearly shows a lot of promise, and we focused on the screening situation where we're trying to identify the active effects from screening experiments and we've looked at DSDs. I've looked be briefly other screen designs like group orthogonal super saturated designs, and it does a good job there from my quick explorations. I'd see no reason why it won't do well in fractional factorial, custom screening designs. And it seems to be working in situations where the standard methods just fall down. The situation that I showed was a very extreme example, it's probably not a very realistic example. But it really pushes the methods and the standard methods are going to fall down in that kind of situation. Whereas this auto validation method, it seems to do what it should do. It gives you the results that you you need to get from that kind of situation. And so it's very exciting. I think we're waiting for some the results of more rigorous simulation studies that are being done by Chris Gotwalt and Phil Ramsay and a PhD student that are supervising. But it really does open up a whole a whole load of new opportunities. I think, Pete, it's just very exciting, isn't it? Peter Hersh Absolutely. Really exciting technique and thank you everyone for coming to the talk. phil Yeah, thank you.
Labels
(12)
Labels:
Labels:
Advanced Statistical Modeling
Automation and Scripting
Basic Data Analysis and Modeling
Consumer and Market Research
Content Organization
Data Blending and Cleanup
Data Exploration and Visualization
Design of Experiments
Predictive Modeling and Machine Learning
Quality and Process Engineering
Reliability Analysis
Sharing and Communicating Results
0 attendees
0
0
0 attendees
0
0
XGBoost Add-In for JMP Pro (2020-US-45MP-540)
Monday, October 12, 2020
Russ Wolfinger, Director of Scientific Discovery and Genomics, Distinguished Research Fellow, JMP Mia Stephens, Principal Product Manager, JMP The XGBoost add-in for JMP Pro provides a point-and-click interface to the popular XGBoost open-source library for predictive modeling with extreme gradient boosted trees. Value-added functionality includes: • Repeated k-fold cross validation with out-of-fold predictions, plus a separate routine to create optimized k-fold validation columns • Ability to fit multiple Y responses in one run • Automated parameter search via JMP Design of Experiments (DOE) Fast Flexible Filling Design • Interactive graphical and statistical outputs • Model comparison interface • Profiling Export of JMP Scripting Language (JSL) and Python code for reproducibility Click the link above to download a zip file containing the journal and supplementary material shown in the tutorial. Note that the video shows XGBoost in the Predictive Modeling menu but when you install the add-in it will be under the Add-Ins menu. You may customize your menu however you wish using View > Customize > Menus and Toolbars. The add-in is available here: XGBoost Add-In for JMP Pro Auto-generated transcript... Speaker Transcript Russ Wolfinger Okay. Well, hello everyone. Welcome to my home here in Holly Springs, North Carolina. With the Covid craziness going on, this is kind of a new experience to do a virtual conference online, but I'm really excited to talk with you today and offer a tutorial on a brand new add in that we have for JMP Pro that implements the popular XGBoost functionality. So today for this tutorial, I'm going to walk you through kind of what we've got we've got. What I've got here is a JMP journal that will be available in the conference materials. And what I would encourage you to do, if you'd like to follow along yourself, you could pause the video right now and go to the conference materials, grab this journal. You can open it in your own version of JMP Pro, and as well as there's a link to install. You have to install an add in, if you go ahead and install that that you'll be able to reproduce everything I do here exactly at home, and even do some of your own playing around. So I'd encourage you to do that if you can. I do have my dog Charlie, he's in the background there. I hope he doesn't do anything embarrassing. He doesn't seem too excited right now, but he loves XGBoost as much as I do so, so let's get into it. XGBoost is a it's it's pretty incredible open source C++ library that's been around now for quite a few years. And the original theory was actually done by a couple of famous statisticians in the '90s, but then the University of Washington team picked up the ideas and implemented it. And it...I think where it really kind of came into its own was in the context of some Kaggle competitions. Where it started...once folks started using it and it was available, it literally started winning just about every tabular competition that Kaggle has been running over the last several years. And there's actually now several hundred examples online if you want to do some searching around, you'll find them. So I would view this as arguably the most popular and perhaps the most powerful tabular data predictive modeling methodology in the world right now. Of course there's competitors and for any one particular data set, you may see some differences, but kind of overall, it's very impressive. In fact, there are competitive packages out there now that do very similar kinds of things LightGBM from Microsoft and Catboost from Yandex. We won't go into them today, but pretty similar. Let's, uh, since we don't have a lot of time today, I don't want to belabor the motivations. But again, you've got this journal if you want to look into them more carefully. What I want to do is kind of give you the highlights of this journal and particularly give you some live demonstrations so you've got an idea of what's here. And then you'll be free to explore and try these things on your own, as time goes along. You will need...you need a functioning copy of JMP Pro 15 at the earliest, but if you can get your hands on JMP 16 early adopter, the official JMP 16 won't ship until next year, 2021, but you can obtain your early adopter version now and we are making enhancements there. So I would encourage you to get the latest JMP 16 Pro early adopter in order to obtain the most recent functionality of this add in...of this functionality. Now it's, it's, this is kind of an unusual new frame setup for JMP Pro. We have written a lot of C++ code right within Pro in order to integrate to get the XGBoost C++ API. And that's why...we do most of our work in C++ but there is an add in that accompanies this that installs the dynamic library and does does a little menu update for you, so you need...you need both Pro and you need to install the add in, in order to run it. So do grab JMP 16 pro early adopter if you can, in order to get the latest things and that's what I'll be showing today. Let's, let's dive right to do an example. And this is a brand new one that just came to my attention. It's got an a very interesting story behind it. The researcher behind these data is a professor named Jaivime Evarito. He is a an environmental scientist expert, a PhD, professor, assistant professor now at Utrecht University in the Netherlands and he's kindly provided this data with his permission, as well as the story behind it, in order to help others that were so...a bit of a drama around these data. I've made his his his colleague Jeffrey McDonnell collected all this data. These are data. The purpose is to study water streamflow run off in deforested areas around the world. And you can see, we've got 163 places here, most of the least half in the US and then around the world, different places when they were able to collect regions that have been cleared of trees and then they took some critical measurements in terms of what happened with the water runoff. And this kind of study is quite important for studying the environmental impacts of tree clearing and deforestation, as well as climate change, so it's quite timely. And happily for Jaivime at the time, they were able to publish a paper in the journal Nature, one of the top science journals in the world, a really nice experience for him to get it in there. Unfortunately, what happened next, though there was a competing lab that really became very critical of what they had done in this paper. And it turned out after a lot of back and forth and debate, the paper ended up being retracted, which was obviously a really difficult experience for Jaivime. I mean, he's been very gracious and let and sharing a story and hopes. to avoid this. And it turns out that what's at the root of the controversy, there were several, several other things, but what what the main beef was from the critics is... I may have done a boosted tree analysis. And it's a pretty straightforward model. There's only...we've only got maybe a handful of predictors, each of which are important but, and one of their main objectives was to determine which ones were the most important. He ran a boosted tree model with a single holdout validation set and published a validation hold out of around .7. Everything looked okay, but then the critics came along, they reanalyzed the data with a different hold out set and they get a validation hold out R square of less than .1. So quite a huge change. They're going from .7 to less than .1 and and this, this was used the critics really jumped all over this and tried to really discredit what was going on. Now, Jaivime, at this point, Jaivime, this happened last year and the paper was retracted earlier here in 2020... Jaivime shared the data with me this summer and my thinking was to do a little more rigorous cross validation analysis and actually do repeated K fold, instead of just a single hold out, in order to try to get to the bottom of this this discrepancy between two different holdouts. And what I did, we've got a new routine that comes with the XGBoost add in that creates K fold columns. And if you'll see the data set here, I've created these. For sake of time, we won't go into how to do that. But there is there is a new module now that comes with the heading called make K fold columns that will let you do it. And I did it in a stratified way. And interestingly, it calls JMP DOE under the hood. And the benefit of doing it that way is you can actually create orthogonal folds, which is not not very common. Here, let me do a quick distribution. That this was the, the holdout set that Jaivime did originally and he did stratify, which is a good idea, I think, as the response is a bit skewed. And then this was the holdout set that the critic used, and then here are the folds that I ended up using. I did three different schemes. And then the point I wanted to make here is that these folds are nicely kind of orthogonal, where we're getting kind of maximal information gain by doing K fold three separate times with kind of with three orthogonal sets. So, and then it turns out, because he was already using boosted trees, so the next thing to try is the XGBoost add in. And so I was really happy to find out about this data set and talk about it here. Now what happened...let me do another analysis here where I'm doing a one way on the on the validation sets. It turns out that I missed what I'm doing here is the responses, this water yield corrected. And I'm plotting that versus the the validation sets. it turned out that Jaivime in his training set, the top of the top four or five measurements all ended up in his training set, which I think this is kind of at the root of the problem. Whereas in the critics' set, they did...it was balanced, a little bit more, and in particular the worst...that the highest scoring location was in the validation set. And so this is a natural source for error because it's going beyond anything that was doing the training. And I think this is really a case where the K fold, a K fold type analysis is more compelling than just doing a single holdout set. I would argue that both of these single holdout sets have some bias to them and it's better to do more folds in which you stratify...distribute things differently each time and then see what happens after multiple fits. So you can see how the folds that I created here look in terms of distribution and then now let's run XGBoost. So the add in actually has a lot of features and I don't want to overwhelm you today, but again, I would encourage you to follow along and pause the video at places if you if you are trying to follow along yourself to make sure. But what we did here, I just ran a script. And by the way, everything in the journal has...JMP tables are nice, where you can save scripts. And so what I did here was run XGBoost from that script. Let me just for illustration, I'm going to rerun this again right from the menu. This will be the way that you might want to do it. So the when you install the add in, you hit predictive modeling and then XGBoost. So we added it right here to the predictive modeling menu. And so the way you would set this up is to specify the response. Here's Y. There are seven predictors, which we'll put in here as x's and then you put their fold columns and validation. I wanted to make a point here about those of you who are experienced JMP Pro users, XGBoost handles validation columns a bit differently than other JMP platforms. It's kind of an experimental framework at this point, but based on my experience, I find repeated K fold to be very a very compelling way to do and I wanted to set up the add in to make it easy. And so here I'm putting in these fold columns again that we created with the utility, and XGBoost will automatically do repeated K fold just by specifying it like we have here. If you wanted to do a single holdout like the original analyses, you can set that up just like normal, but you have to make the column continuous. That's a gotcha. And I know some of our early adopters got tripped up by this and it's a different convention than other Other XGBoost or other predictive modeling routines within JMP Pro, but this to me seemed to be the cleanest way to do it. And again, the recommended way would be to run repeated K fold like this, or at least a single K fold and then you can just hit okay. You'll get this initial launch window. And the thing about XGBoost, is it does have a lot of tuning parameters. The key ones are listed here in this box and you can play with these. And then it turns out there are a whole lot more, and they're hidden under this advanced options, which we don't have time at all for today. But we have tried to...these are the most important ones that you'll typically...for most cases you can just worry about them. And so what...let's let's run the...let's go ahead and run this again, just from here you can click the Go button and then XGBoost will run. Now I'm just running on a simple laptop here. This is a relatively small data set. And so right....we just did repeated, repeated fivefold three different things, just in a matter of seconds. XGBoost is pretty well tuned and will work well for larger data sets, but for this small example, let's see what happened. Now it turns out, this initial graph that comes out raises an immediate flag. What we're looking at here is the...over the number of iterations, the fitting iterations, we've got a training curve which is the basically the loss function that you want to go down. But then the solid curve is the validation curve. And you can see what happened here. Just after a few iterations occurred this curve bottomed out and then things got much worse. So this is actually a case where you would not want to use this default model. XGBoost is already overfited, which often will happen for smaller data sets like this and it does require the need for tuning. There's a lot of other results at the bottom, but again, they wouldn't...I wouldn't consider them trustworthy. At this point, you would need...you would want to do a little tuning. For today, let's just do a little bit of manual tuning, but I would encourage you. We've actually implemented an entire framework for creating a tuning design, where you can specify a range of parameters and search over the design space and we again actually use JMP DOE. So it's a...we've got two different ways we're using DOE already here, both of which have really enhanced the functionality. For now, let's just do a little bit of manual tuning based on this graph. You can see if we can...zoom in on this graph and we see that the curve is bottoming out. Let me just have looks literally just after three or four iterations, so one thing, one real easy thing we can do is literally just just, let's just stop, which is stop training after four steps. And see what happens. By the way, notice what happened for our overfitting, our validation R square was actually negative, so quite bad. Definitely not a recommended model. But if we run we run this new model where we're just going to take four...only four steps, look at what happens. Much better validation R square. We're now up around .16 and in fact we let's try three just for fun. See what happens. Little bit worse. So you can see this is the kind of thing where you can play around. We've tried to set up this dialogue where it's amenable to that. And you can you can do some model comparisons on this table here at the beginning helps you. You can sort by different columns and find the best model and then down below, you can drill down on various modeling details. Let's stick with Model 2 here, and what we can do is... Let's only keep that one and you can clean up...you can clean up the models that you don't want, it'll remove the hidden ones. And so now we're down, just back down to the model that that we want to look at in more depth. Notice here our validation R square is .17 or so. So, which is, remember, this is actually falling out in between what Jaivime got originally and what the critic got. And I would view this as a much more reliable measure of R square because again it's computed over all, we actually ran 15 different modeling fits, fivefold three different times. So this is an average over those. So I think it's a much much cleaner and more reliable measure for how the model is performing. If you scroll down for any model that gets fit, there's quite a bit of output to go through. Again...again, JMP is very good about...we always try to have graphics near statistics that you can both see what's going on and attach numbers to them and these graphs are live as normal, nicely interactive. But you can see here, we've got a training actual versus predicted and validation. And things almost always do get worse for validation, but that's really what the reality is. And you can see again kind of where the errors are being made, and this is that this is that really high point, it often gets...this is the 45 degree line. So that that high measurement tend...and all the high ones tend to be under predicted, which is pretty normal. I think for for any kind of method like this, it's going to tend to want to shrink extreme values down and just to be conservative. And so you can see exactly where the errors are being made and to what degree. Now for Jaivime's key interest, they were...he was mostly interested in seeing which variables were really driving this water corrected effect. And we can see the one that leaps out kind of as number one is this one called PET. There are different ways of assessing variable importance in XGBoost. You can look at straight number of splits, as gain measure, which I think is maybe the best one to start with. It's kind of how much the model improves with each, each time you split on a certain variable. There's another one called cover. In this case, for any one of the three, this PET is emerging as kind of the most important. And so basically this quick analysis that that seems to be where the action is for these data. Now with JMP, there's actually more we can do. And you can see here under the modeling red triangle we've we've embellished quite a few new things. You can save predictive values and formulas, you can publish to model depot or formula depot and do more things there. We've even got routines to generate Python code, which is not just for scoring, but it's actually to do all the training and fitting, which is kind of a new thing, but will help those of you that want to transition from from JMP Pro over to Python. For here though, let's take a look at the profiler. And I have to have to offer a quick word of apology to my friend Brad Jones in an earlier video, I had forgotten to acknowledge that he was the inventor of the profiler. So this is, this is actually a case and kind of credit to him, where we're we're using it now in another way, which is to assess variable importance and how that each variable works. So to me it's a really compelling framework where we can...we can look at this. And Charlie...Charlie got right up off the couch when I mentioned that. He's right here by me now. And so, look at what happens...we can see the interesting thing is with this PET variable, we can see the key moment, it seems to be as soon as PET gets right above 1200 or so is when things really take off. And so it's a it's a really nice illustration of how the profiler works. And as far as I know, this is the first time...this is the first...this is the only software that offers plots like this, which kind of go beyond just these statistical measures of importance and show you exactly what's going on and help you interpret the boosted tree model. So really a nice, I think, kind of a nice way to do the analysis and I'd encourage that...and I'd encourage you try this out with your own data. Let's move on now to a other example and back to our journal. There's, as you can tell, there's a lot here. We don't have time naturally to go through everything. But we've we've just for sake of time, though, I wanted to kind of show you what happens when we have a binary target. What we just looked at was continuous. For that will use the old the diabetes data set, which has been around quite a while and it's available in the JMP sample library. And what this this data set is the same data but we've embellished it with some new scripts. And so if you get the journal and download it, you'll, you'll get this kind of enhanced version that has quite a few XGBoost runs with different with both a binary ordinal target and, as you remember, what this here we've got low and high measurements which are derived from this original Y variable, looking at looking at a response for diabetes. And we're going to go a little bit even further here. Let's imagine we're in a kind of a medical context where we actually want to use a profit matrix. And our goal is to make a decision. We're going to predict each person, whether they're high or low but then I'm thinking about it, we realized that if a person is actually high, the stakes are a little bit higher. And so we're going to kind of double our profit or or loss, depending on whether the actual state is high. And of course, this is a way this works is typically correct...correct decisions are here and here. And then incorrect ones are here, and those are the ones...you want to think about all four cells when you're setting up a matrix like this. And here is going to do a simple one. And it's doubling and I don't know if you can attach real monetary values to these or not. That's actually a good thing if you're in that kind of scenario. Who knows, maybe we can consider these each to be a BitCoic, to be maybe around $10,000 each or something like that. Doesn't matter too much. It's more a matter of, we want to make an optimal decision based on our our predictions. So we're going to take this profit matrix into account when we, when we do our analysis now. It's actually only done after the model fitting. It's not directly used in the fitting itself. So we're going to run XGBoost now here, and we have a binary target. If you'll notice the the objective function has now changed to a logistic of log loss and that's what reflected here is this is the logistic log likelihood. And you can see...we can look at now with a binary target the the metrics that we use to assess it are a little bit different. Although if you notice, we do have profit columns which are computed from the profit matrix that we just looked at. But if you're in a scenario, maybe where you don't want to worry about so much a profit matrix, just kind of straight binary regression, you can look at common metrics like just accuracy, which is the reverse of misclassification rate, these F1 or Matthews correlation are good to look at, as well as an ROC analysis, which helps you balance specificity and sensitivity. So all of those are available. And you can you can drill down. One brand new thing I wanted to show that we're still working on a bit, is we've got a way now for you to play with your decision thresholding. And you can you can actually do this interactively now. And we've got a new ... a new thing which plots your profit by threshold. So this is a brand new graph that we're just starting to introduce into JMP Pro and you'll have to get the very latest JMP 16 early adopter in order to get this, but it does accommodate the decision matrix... or the profit matrix. And then another thing we're working on is you can derive an optimal threshold based on this matrix directly. I believe, in this case, it's actually still .5. And so this is kind of adds extra insight into the kind of things you may want to do if your real goal is to maximize profit. Otherwise, you're likely to want to balance specificity and sensitivity giving your context, but you've got the typical confusion matrices, which are shown here, as well as up here along with some graphs for both training and validation. And then the ROC curves. You also get the same kind of things that we saw earlier in terms of variable importances. And let's go ahead and do the profiler again since that's actually a nice... it's also nice in this, in this case. We can see exactly what's going on with each variable. We can see for example here LTG and BMI are the two most important variables and it looks like they both tend to go up as the response goes up so we can see that relationship directly. And in fact, sometimes with trees, you can get nonlinearities, like here with BMI. It's not clear if that's that's a real thing here, we might want to do more, more analyses or look at more models to make sure, maybe there is something real going on here with that little bump that we see. But these are kind of things that you can tease out, really fun and interesting to look at. So, so that's there to play with the diabetes data set. The journal has a lot more to it. There's two more examples that I won't show today for sake of time, but they are documented in detail in the XGBoost documentation. This is a, this is just a PDF document that goes into a lot of written detail about the add in and walk you step by step through these two examples. So, I encourage you to check those out. And then, the the journal also contains several different comparisons that have been done. You saw this purple purple matrix that we looked at. This was a study that was done at University of Pennsylvania, where they compare a whole series of machine learning methods to each other across a bunch of different data sets, and then compare how many times one, one outperform the other. And XGBoost came out as the top model and this this comparison wasn't always the best, but on average it tended to outperform all the other ones that you see here. So, yet some more evidence of the power and capabilities of this of this technique. Now there's some there's a few other things here that I won't get into. This Hill Valley one is interesting. It's a case where the trees did not work well at all. It's kind of a pathological situation but interesting to study, just so you just to help understand what's going on. We also have done some fairly extensive testing internally within R&D at JMP and a lot of those results are here across several different data sets. And again for sake of time, I won't go into those, but I would encourage you to check them out. They do...all of our results come along with the journal here and you can play with them across quite a few different domains and data sizes. So check those out. I will say just just for fun, our conclusion in terms of speed is summarized here in this little meme. We've got two different cars. Actually, this really does scream along and it it's tuned to utilize all of the...all the threads that you have in your GPU...in your CPU. And if you're on Windows, with an NVIDIA card, you can even tap into your GPU, which will often offer maybe another fivefold increase in speed. So a lot of fun there. So let me wrap up the tutorial at this point. And again, encourage you to check it out. I did want to offer a lot of thanks. Many people have been involved and I worry that actually, I probably I probably have overlooked some here, but I did at least want to acknowledge these folks. We've had a great early adopter group. And they provided really nice feedback from Marie, Diedrich and these guys at Villanova have actually already started using XGBoost in a classroom setting with success. So, so that was really great to hear about that. And a lot of people within JMP have been helping. Of course, this is building on the entire JMP infrastructure. So pretty much need to list the entire JMP division at some point with help with this, it's been so much fun working on it. And then I want to acknowledge our Life Sciences team that have kind of kept me honest on various things. And they've been helping out with a lot of suggestions. And Luciano actually has implemented an XGBoost add in, a different add in that goes with JMP Genomics, so I'd encourage you to check that out as well if you're using JMP Genomics. You can also call XGBoost directly within the predictive modeling framework there. So thank you very much for your attention and hope you can get XGBoost to try.
Labels
(13)
Labels:
Labels:
Advanced Statistical Modeling
Automation and Scripting
Basic Data Analysis and Modeling
Consumer and Market Research
Content Organization
Data Access
Data Exploration and Visualization
Design of Experiments
Mass Customization
Predictive Modeling and Machine Learning
Quality and Process Engineering
Reliability Analysis
Sharing and Communicating Results
0 attendees
0
0
0 attendees
0
0
Designing Robust Products (2020-US-30MP-539)
Monday, October 12, 2020
Kevin Gallagher, Scientist, PPG Industries During the early days of Six Sigma deployment, many companies realized that there were limits to how much variation can be removed from an existing process. To get beyond those limits would require that products and processes be designed to be more robust and thus inherently less variable. In this presentation, the concept of product robustness will be explained followed by a demonstration of how to use JMP to develop robust products though case study examples. The presentation will illustrate JMP tools to: 1) visually assess robustness, 2) deploy Design of Experiments and subsequent analysis to identify the best product/process settings to achieve robustness, and 3) quantify the expected capability (via Monte Carlo simulation). The talk will also highlight why Split Plot and Definitive Screening Designs are among the most suitable designs for developing robust products. Auto-generated transcript... Speaker Transcript Kevin Hello, my name is Kevin Gallagher. I'll be talking about designing robust products today. I work for PPG industries which is headquartered in Pittsburgh, Pennsylvania, and our corporate headquarters is shown on the right hand side of the slide. PPG is a global leader in development of paints and coatings for a wide variety of applications, some of which are shown here. And I personally work in our Coatings Innovation Center in the northern suburb of Pittsburgh, where we have a strong focus on developing innovative new products. In the last 10 years the folks at this facility have developed over 600 US patents and we've received several customer and industry awards. I want to talk about how to develop robust products using design of experiments and JMP. So first question is, what do we mean by a robust product? And that is a product that delivers consistent results. And the strategy of designing a robust product is to purposely set control factors for inputs to the process, that we call X's, to desensitize the product or process to noise factors that are acting on the process. So noise factors are factors that are inputs to the process that can potentially influence the Y's, but for which we generally have little control, especially in the design of the product or process phase. Think about robust design. It's good to start with a process map that's augmented with variables that are associated with inputs and outputs of each process step. So if we think about an example of developing a coating for an automotive application, we first start with designing that coating formulation, then we manufacture it. Then it goes to our customers and they apply our coating to the vehicle and then you buy it and take it home and drive the vehicle. So when we think about robustness, we need to think about three things. We need to think about the output that's important to us. In this example, we're thinking about developing a premium appearance coating for an automotive vehicle. We need to think about some of the noise variables for which the Y due to the noise variable. And in this particular case, I want to focus on variables that are really in our customers' facilities. Not that they can't control thickness and booth temperature and an applicator settings, but there's always some natural variation around all of these parameters. And for us, we want to be able to focus on factors that we can control in the design of the product to make the product insensitive to those variables in our customers' process so they can consistently get a good appearance. So one way to really run a designed experiment around some of the factors that are known to cause that variability. This particular example, we could design a factorial design around booth humidity, applicator setting, and thickness. This assumes, of course, that you can simulate those noise variables in your laboratory, and in this case, we can. So we can run this same experiment on each of several prototype formulations; it could be just two as a comparison or it could be a whole design of experiments looking at different formulation designs. Once we have the data for this, one of the best ways to visualize the robustness of a product is to create a box plot. So I'm going to open up the data set comparing two prototype formulations tested over a range of application conditions, and in this case the appearance is measured so that higher values of appearance are better. So ideally we want, we'd like high values of appearance and then consistently good over all of the different noise conditions. So to look at this, we could, we can go to the Graph Builder. And we can take the appearance and make that our y value; prototype formulas are X values. And if we turn on the box plot and then add the points back, you can clearly see that one product has much less variation than the other, thus be more robust and on top of that, it has a better average. Now the box plots are nice because the box plots capture the middle 50% of the data and the whiskers go out to the maximum and minimum values, excluding the outliers. So it makes a very nice visual display of the robustness of a product. So now we want to talk about how do we use design of experiments to find settings that are best for developing a product that is robust. So as you know, when you design an experiment, the best way to analyze it is to build a model. Y is a function of x, as shown in the top right. And then once we have that model built, we can explore the relationship between the Y's and the X's with various tools in JMP, like in the bottom right hand corner, a contour plot and on and...also down there, prediction profiler. These allow us to explore what's called the response surface or how the response varies as a function of the changing values of the X factors. The key to finding a robust product is to find areas of that response surface where the surface is relatively flat. In that region it will be very insensitive to small variations in those input variables. An example here is a very simple example where there's just one y and one x And the relationship is shown here sort of a parabolic function. If we set the X at a higher value here where the, where the function is a little bit flatter, and we we have some sort of common cause variation in the input variable, that variation will be translated to a smaller amount of variation in the y, than if we had that x setting at a lower setting, as shown by the dotted red lines. In a similar way, we can have interactions that transmit more or less variation. This example we have an interaction between a noise variable and and a control variable x. And in this scenario, if there's again some common cause variation associated with that noise variable, if we have the X factor set at the low setting, that will transmit less variation to the y variable. So now I want to share a second case study with you where we're going to demonstrate how to build a model, explore the response surface for flat areas where we could make our settings to have a robust product, and finally to evaluate the robustness using some predictive capability analysis. This particular example, a chemist is focused on finding the variables that are contributing to unacceptable variation in yellowness of the product and that yellowness is measured with a spectrum photometer with with the metric, b*. The team did a series of experiments to identify the important factors influencing yellowing, and the two most influential factors that they found were the reaction temperature and the rate of addition of one of the important ingredients. So they decided to develop full factorial design with some replicated center points, as shown on the right hand corner. Now, the team would like to have the yellowness value (b*) to be set to a target value of 2 but within a specification of 1 to 3. I'm going to go back into JMP and open up the second case study example. It's a small experiment here, where the factorial runs are shown in blue and the center points in red. And again, the metric of interest (B*) is listed here as well. Now the first thing we would normally do is fit, fit the experiment to the model that is best for that design. And in this particular case, we find a very good R square between the the yellowness and the factors that we're studying, and all of the factors appear to be statistically significant. So given that's the case, we can begin to explore the response surface using some other tools within JMP. One of the tools that we often like to use is the prediction profiler, because with this tool, we can explore different settings and look to find settings where we're going to get the yellowness predicted to be where we want it to be, a value of 2. But when it comes to finding robust settings, a really good tool to use is the the contour profiler. It's under factor profiling. And I'm going to put a contour right here at 3, because we said specification limits were 1 to 3 and at the high end (3), anywhere along this contour here the predicted value will be 3 and above this value into the shaded area will be above 3, out of our specification range. That means that anything in the white is expected to be within our specification limits. So right now the way we have it set up, anything that is less than a temperature at 80 and a rate anywhere between 1.5 and 4 should give us product that meets specifications on average. But what if the temperature in in the process that, when we scale this product up is, is something that we can't control completely accurate. So there's gonna be some amount of variation in the temperature. So how can we develop the product and come up with these set points so that the product will be insensitive to temperature variation? So in order to do that, or to think about that, it's often useful to add some contour grid lines to the contour plot overlay here. And I like to round off the low value in the increment, so that the the contours are at nice even numbers 1.5. 2, 2.5, and 3, going from left to right. So anywhere along this contour here should give us a predicted value of 2. But we want to be down here where the contours are close together or up here where they're further apart with respect to temperature. As the contours get further apart, that's an indication that we're nearing a flat spot in the in response surface. So to be most robust at temperature, that's where we want to be near the top here. So a setting of near 75 degrees and rate of about 4 might be most ideal. And we can see this also in the prediction profiler when we have these profilers linked, because in this setting, we're predicting the b* to be 2. But notice the the relationship between b* and temperature is relatively flat, but if I click down to this lower level, now even though the b* is still 2, the relationship between b* and temperature is very steep. So if we know something about how much variation is likely to occur in temperature when we scale this product up, we can actually use a model that we've built from our DOE to simulate the process capability into the future. And the way we can do that with JMP is to open up the simulator option. And it allows us to input random variation into the model in a number of different ways. And then use the model to calculate the output for those selected input conditions. We could add random noise, like common cause variation that could be due to measurement variation and such, into the design. We can also put random variation into any of the factors. In this case we're talking about maybe having trouble controlling the temperature in production, so we might want to make that a random variable. And it sets the mean to where I have it set. So I'm just going to drag it down a little bit to the very bottom. So it's about a mean of 70. And then JMP has a default of a standard deviation of 10. You can change that to whatever makes sense for the process that you're studying. But for now, I'm just going to leave that at 10 and you can choose to randomly select from any distribution that you want. And I'm going to leave it at the normal distribution. I'm going to leave the rate fixed. So maybe in this scenario, we can control the rate very accurately, but the temperature, not as much. So we want to make sure we're selecting our set points for rate and temperature so that there is as little impact of temperature variation on on the yellowness. So we can evaluate the results of this simulation by clicking the simulate to table, make table button. Now, what we have is every row, there's 5,000 rows here that have been simulated, every row as a random selection of temperature from the distribution, shown here. And then the rate location limits that we have for this product. And we can do that with the process capability. And since I already have the specification limits as a column property, they're automatically filled in, but if you didn't have them filled in, you can type them in here. And simply click OK, and now it shows us the capability analysis for this particular product. It shows us the lower spec limit, the upper spec limit, the target value, and in overlays that over the distribution of responses from our simulation. In this particular case, the results don't look too promising because there's a large percentage of the product that seems to be outside of the specification. In fact 30% of it is outside. And if we use the capability index Cpk, which compares the specification range to the range in process variation, we see that the Cpk is not very good at .3.
Labels
(10)
Labels:
Labels:
Advanced Statistical Modeling
Basic Data Analysis and Modeling
Consumer and Market Research
Data Blending and Cleanup
Data Exploration and Visualization
Design of Experiments
Mass Customization
Predictive Modeling and Machine Learning
Quality and Process Engineering
Reliability Analysis
0 attendees
0
0
0 attendees
0
0
On Missing Random Effects in Machine Learning (2020-US-45MP-534)
Monday, October 12, 2020
Fabio D'Ottaviano, R&D Statistician, Dow Inc Wenzhao Yang, Dr, Dow Inc The large availability of undesigned data, a by-product of chemical industrial research and manufacturing, makes it attractive the venturesome use of machine learning for its plug-and-play appeal in attempt to extract value out of this data. Often this type of data does not only reflect the response to controlled variation but also to that caused by random effects not tagged in the data. Thus, machine learning based models in this industry may easily miss active random effects out. This presentation uses simulation in JMP to show the effect of missing a random effect via machine learning — vs. including it properly via mixed models as a benchmark — in a context commonly encountered in the chemical industry — mixture experiments with process variables — and as a function of relative cluster size, total variance, proportion of variance attributed to the random effect, and data size. Simulation was employed for it allows the comparison — missing vs. not missing random effects — to be made clear and in a simple manner while avoiding unwanted confounders found in real world data. Besides the long-established fact that machine learning performs better the larger the size of the data, it was also observed that data lacking due specificity—i.e. without clustering information—causes critical prediction biases regardless the data size. This presentation is based on a published paper of the same title. Auto-generated transcript... Speaker Transcript Fabio D'Ottaviano Okay thanks everybody for watching this video here. Well, because you can see, I'll be talking about missing random effects in machine learning. It's a work ideas together with my colleague when Joe Young, we both work for Dow Chemical Company working Korean D and help you know valid develop new processes and mainly new products. What you see here in this screen is a big bingo cage, because our talk here is going to be about to simulation and simulation has a lot to do at least to me. With bingo case because you decided the distribution of your balls and numbers inside the big occasion, then you keep just picking them as you want. All right. This talk also has to do with the publication, we just said. Lately, what the same name, and you can win what you should have access to this presentation, you can just click here and you'll have access to the entire paper. So here's just a summary of what we have published in there. Okay, what's the context for this. Well, first of all, machine learning has a kind of a plug and play appeal to knowing stuff sessions. I know you don't have to assume anything that's attractive. Besides, you have a very user friendly software out there these days. So, you know, people like to do that these days. However, you know, random effects are everywhere and run these effects is a funny thing because it's it's a concept that is a little bit more complex. So it tends not to be Touching basic statistics courses shows more advanced subject. So you're going to get a lot of people doing machine learning without a lot of understanding about random effect. And even if they have that understanding, then the concept of random fact Still doesn't, you know, bring the loud bout with people doing machine learning because there's just a few algorithms that can do that that can use random effects. You can check these reference here where you see that there are some trees and random forest, and it can take it, but the recent and they're not, you know, spread Everywhere. So you're going to have some hard time to find something that can do can handle random effects in machine learning. Just talk a little bit about the random effects. As you can see here, at least in the chemical industry where I come from. We typically mix in. I say three components. Right. These yellow, red and green. We, we make this, you know, the percentage of each one of these components different levels. And then we measured the responses as we change it, the percentage of these components with a certain equipment and sometimes you have even a operator or lab technician that will Also interfere in the result that you want to see. Okay. And then when we do this kind of experiment, we want to generalize, is that the findings, right, or whatever prediction. We are trying to get here. But the problem is that you know when I'm mixing these green component here. If I buy next time from the supplier that supplies me this green component year And the green made shade, you know, very and I don't know what's the next time I buy this green component is the batch. Would that be supplier is giving me is going to be exactly the same green because there is a variability in supply On top you know I may make my experience you're using a certain equipment. But if I go and look around in my lab or if I look around in other labs, I may have different makes of these equipment. And on top. You also have, you know, maybe food that measurement depends on on the operator who is doing that right so you may also interfere and kind of impoverished my Prediction here on my generalization to Do whatever I want to predict here besides This is the most typical I guess in the chemical industry, which is the experiment experimental batch variability A over time if you repeat the same thing over and over again. Let's say you have an experiment here you get your model your model can predict something, but then you repeat that experiment to get another Malden get another model the predictions of these three models. May be considerably different right now. Nick legible. So, there is also the component of time. So what's the problem I'm talking about here. Well, typically you we have stored data and historical data just say, you know, a collection of bits and pieces of data you've done in the past. And people were not done much concerned with generalizing that result. The result at the time they had that experiment. So when we collect them and call it historical data, we may or may not have tags for the random effect, right. And then if you have text, which is at least from where I come from. This is more of an exception to the rule is having no tax for me facts, what, at least not for not all of them. Let's say you have tags. One thing you can do is to use a machine learning techniques that can handle these random effects lead them into the model. And that's it. You don't have a problem. But then, as I said, is not very well numb machine learning techniques that can hinder random effects. You may be tempted to use machine learning. And let the random effects into the model as if they were fixed and then you're going to run into these you know very well known problem that you should treat the random effect this fixed Just to say one thing you're going to have a hard time to predict any new all to come because, for example, if your random effect is European later you have only a few Operators in your data, right, a few names, but if there is a new operator doing days, you don't. You cannot predict what the effect of this new operator is going to be. So, here there is no deal And then there's one thing you can do. You do. You should have or you should don't have tax revenue we sacrifice to us again any machine learning technique. And if you have random, you should have the tags you ignore the random effect or if you don't. Anyway, you're going to be ignoring it. Whether you like it or not. So what I want to do is less simulating shooter. We use jump rope fishing. And you know, I hope you enjoyed the results. The simulation, basically. So I will use a mixed effect model right with fixed and random effect. And then we use that same model to estimate With the response to this to make the model after I simulate and also model, the results of my simulation here with a neural net right Then we use this model here as the predictive performance of this model here. As a benchmark and we use it to predict performance of the near on the edge to, you know, compare later they're taking their test set are squares to see what's going to happen. You find meets the random effect here, right. Then, okay. Sometimes I when I talk about these people sometimes think that I'm comparing a linear mixed effect model versus, you know, machine learning neural net. And that's not the case, you know, here we are comparing a model with and without random effect. Even that there is a random effect in the data. I could do. For example, a Linear Model with run them effects versus a linear model without to bring them effects. And I could do a neural net with random effects versus in urine that without random effect. But the problem is that today there is no wonder and that, for example, that can handle random effect. So I forced to use. For example, a linear mixed effect of all My simulation factors. Well, I'll use something that is typically in the industry, which is a mixture with process variable model what it is. Let's say I have those three components. I showed you before. Know the yellow, red, green, and they have percent certain percentage and they get up to one. Have a continuous variable which, for example, can be temperature. I have a categorical variable that can be capitalist type and I have a random effect which can be very ugly from batch to batch of experiments. Okay. The simulation model. Well, it's a pretty simple one I have here my mixture main effects M one M two and three. Right. And you will see all over this model that the fixed effects all have either one or minus one. I just assigned one minus one randomly to them. So I have the mixture main effects. Here I have the mixture two ways interaction, the interaction of the mixture would be continuous variable. And the interaction of the categorical variable with the components. And finally, the introduction of the continuous variable with the categorical variable. Plus I have these be, Why here we choose my random effect and the Ej, which is my random error. Right. And both are normally distributed with certain variance here. I said, the variance between a better of experiments, right, and uses divergence within the match of batch of experiment. From all over this presentation, just to make this whole formula in represent a forming the more I say neat way or use this form a hero X actually represents all the fixes effects and beta Represents all the parameters that I used. Right. And my why here. Actually, the expected. Why is actually XP right it's this whole thing without my random effect here and we dealt my Renault mayor. Simulation parameters. Well, here I have one which is data size, right, the one that she. What happens if I have no, not so much data and More data layers and more data right here I have two levels 110,000 roles at every set of experiment here have actually 20 rows perfect effect than 200. The other thing I will vary is going to be D decides of the badge for the cluster, whatever you like it. Sometimes is going to be. It's, I have two levels 4% and 25% so 4% means if I have 100 rolls of these one batch of experience my batch, we're going to Will be actually for roles. So I'm going to have 2525 batches. If I have 100 rolls in total out in my batch sizes 25% and I have only four batches. Then the other variable we change here is going to be the total variance. Right. And well, we have two levels here, point, five and 2.5 is half of effect effect size right to choose. So the formula here. It is all ones for the fixed effect. And the other one is to write and the summation this total variance is the summation of my variation between batches and within batches very a variance. Right. And lastly, the other thing I will change is the ratio of between two within very Similar segments. Right. So I have one and four. So in one my Between batch variation is going to be equal to winning and the other one is going to be four times bigger than winning Then, once I settled is for four factors here. I say parameters and then you do a full factorial, do we wear our have 16 runs right to To two levels of data size two levels of batch size two levels of total variance Angelo's was the desert. With that, I can calculate it within between within segments accordingly. Right. And that's the setup for simulation. Okay. Now, I call it simulation hyper parameter, because you can change in as we What I would do, and I'll show you in the demo. It's would 30 simulation risk or do we run. So every one of the 16 what I did is I run 30 times each. Right. So, for example, I'll have a simulation run 123 up to dirty and for the fixed effects. The, the level of difficulty effects. I use the space feeling design. And the reason why I use this space filling design is that don't want to confound the effect of missing there and then effect with the fact that possibly I have some some calling the charity or sparse data. Which is typical thing in historical data. Right. I don't want that in the middle of my way. I want to, I prefer to design and space feeling design that we spread the Levels of the fixed effects across the input space. So if I get rid of this problem of sparse hitting the data or clean the oddity right and then we you allocate to the batch we randomize batch cross the runs in the first round, and then use the same Sequence across all the other 29 runs. So all the runs, we have the same batch of location. In late. And lastly, all the Santa location will be randomized for every one of the simulation runs. So let me just get out of the air and start to jump. So he would do is, I used to do we special purposed spilling design. I'll load my factor here just for being I want to be fast here. So anyway, here I have my 12345 fixed effects. Space feeling designs don't accept a mixture of variables. So you need to set linear constraints here just to tell look these three guys here and need to add up to one. So that's what I'm going to do here. Alright. So with that, a satisfied that constraint will give an example of the first run of this d which is where I have data size 100 relative batch size 4% total variation total variance point five and the ratio is going to be one. So if I go back here and need to put I need to generate 100 runs. Also, if I want to replicate this theory and you have to set the random seed and the number of starts right Then the only option I have when I said constraint is the fast faster flexible filling design. And here we go I get the table here right so you can include this table. One thing you see is if you use a ternary blocked and you use your three components. You see that everything is a spread out. Oops. I have a problem here that she Didn't Let me go back. There's a problem with the constraints, yeah. I forgot the one here. Yep. All right, let's start all over again. Hundred Set Random seat 21234 And number starts. Great. Next book feeling make table. Yeah. And then I need just to check if it is all spread out. And find out. Yes. Alright, so then I look at my categorical variable here. I want to see if it spread out for all of them. As you can see this for one and two. Great. Now I said let me close all that we do 30 simulations. So this is one SIMULATION RIGHT. I HAVE 100 roles. But now I need to do 30. So what I will do is to add roles here. 2900 At the end And we just select this first 1000 runs sorry 100 runs all the variables, a year. And feel to the end of the table. Great. So now I repeat the same design 30 times right Now, To make it faster would just open. I'm using this table again though. Just use another table to where I have everything I wanted to show already set up. So yeah, I have back to this table right what I have. Next thing I would do is just to create a simulation column here just to tell look With this formula here. I can tell that simulations up to 100 row hundred this simulation one and then every hundred you change the simulation numbers. So at the end of the day I get obviously 30 simulations with 100 rolls. Each great Then the batch location, just to explain what they did. I just showed that in the PowerPoint. I have a farm. Now you will create two batches of 4% the size of the total data size, which means I have four rows per batch here than four and so on. And once the I get to 100 here and I'm jumping from simulation. Want to simulation to then it starts all over again. Right. So I have at the end of the day. All the 25 batches here. Okay. Then the next one thing I will do is to create my validation Column, which means I need to split the set right so Back to this demonstration. Back to the PowerPoint here, you see that for the solution that I'm going to create but the neuron that I had Is divided the roles 60% of them will belong to the training set 20% to the validation set and the other 22 the test set. So how do I do that in that case again back to john There we go. Let me hide this Okay, so here's to validate the validation come. How do I do that. It's already there, but don't explain how you do that you go to Analyze Make validation column you use your simulation column has a certification, you go and then you just do Point 6.5 2.2 and a user a random seed here just to make sure you see that's how I create that column right Then if I go back to my presentation here. All the 60% that belongs to the training set for the new year and that 10 to 20% that Belongs to the validation set also for the near net. Now they both belong to a set called modeling set for the mix effects. So the mix effect. Model, there will be no validation. There were just estimate the model with this 80% and the test set of the mixed effect solution that will be the same 20% that I use for the new year on that solution. So in that case, I go back to jump and Go here this And it just great to hear formula where you know zero means the validation of the neural net to zero means training so training, be still training. One is validation setting is going to be my modeling such and two is my test set, and it's going to be my test set here too. So I created these and then you column name for all your labels zero is going to be modeling and one is going to be desk so that way you see here that whatever stashed here he says here, but what you're straining a validation becomes modeling right then finally I need to set the formulas for my response. So for the expected value of my simulation. I just have here to fix effects formula right there is no random term here. All right. And that's my expected value my wife and my wife i j is going to be This and you look at the formula you have the y which is my expected value plus here I have a formula for generating A random Value following a normal distribution with mean zero and between sigma sigma Between sigma. It's a i i like to set to the variables here and as a table variable because I can change the value later as a week without going to the formula. But anyway, this is going to generate a single value every time you change the batch number. So if my batch is Here that's going to be the same value when I change it to 22 it creates another value. And you when I change from one simulation to another. So I will have one value for 25 per batch 25 our simulation one and then when I jumped to batch one of simulation to then it creates another value. Right. And here is just my normal random number with with things sigma that I set on the table here, right. So see some replicating the deal we run one I have sigma 05 and 05 Alright, so then now I have here solution for my mix effects model simulation right before that, let me go back here and show you what I am doing. For example, for the mix effect model. My simulation mode is this, but my feet that model will be the issue might be the analysis be the hat. And my small be here is going to be be hacked, right. So, it is our beach to meet the values for whatever I simulated And then in the mix effect model. I have to to less a prediction model. One is fitting conditional model when I use The my estimation of the random effects and the other one is my marginal when I don't. Right. So I have these two types of model. This is good to predict things that I have in the data already. And this is something I used to predict data around don't have an entire data set, right. For the new year and that I'm using a, you know, the standard default kind of near and at the end and jump, which you know i'm not just using because it's difficult because he pretty much works. You have here all the five fixed effects. I use one layer with three nodes all hyperbolic tangent functions as you can see here And then you have here a function called h x which is the summation of district functions, plus a bias term here, right. So if I add more nodes. It wouldn't make it any better. You find only use two nodes then it gets even worse. So I'm going to use this all over. And that's what I'm going to show you here. My show. Oops. Okay. Show. Me go back to you. So Here I have my Mixed effect model solution. How did I come up with that. Just to show you. I have here the response I put validation validation of the mix effects by simulation. All my fixed effects a year and my random effect is my batch right and then a genetic this first simulation. You see simulation one and it goes all the way the simulation Turkey right I couldn't use, for example, that simulate function of jump here because I'm changing the validation column for every batch, so I cannot, at least I don't know how to do it, how to incorporate the validation column in the formula of I white G. And. Okay. So, oops, back here, then now I have another script here for it and you're in that it's going to take a little bit Shouldn't be a big problem. When I'm doing for example. The runs the do you runs. We've 1000 rows per simulation that can take courses from all of time something like maybe 10 to 15 minutes To do the auditory simulations at the same time here should throw up some And There we go. Okay, so here I have, again, my if you look every one of them have five three notes right okay and you have simulation one all the way to simulation 30 Right. So now I have all that done for for run one of my year, right, so next thing I need to understand is what type of our squares. I'm going to compare right there are actually five types of our squares here, right, so here's the r squared formula. Why, why do I differentiate di squares by type here because it depends on what you use this actual versus predicted in this formula. You know your square change. So for example, I have here. Oh, these are type of r squared away or compare For example, The Rosa turning the training set. What I simulated versus the form I got for the neural net here because when you're in it. And you know, that's actually The case, for example, in all these three are the same thing because I'm comparing my wife and my wife hats are the same. So I have type eight right Then I have another type of call it type be where I compare. I don't, I'm not comparing the simulated value with the Random effect in the random error term I'm comparing the expected value of my simulation versus the form they go So these makes me sent to the test set rules are always the same as just the way you calculate the r squared is going to change because in this way here. When I have what I call the conditional test set. I see the parent future performance because that's exactly what we get when you have any data set, because we cannot tell the real We don't have the real model that's for that you need to simulate and then you have the expected test set, which is actually the same rose, but now I'm comparing the Expected value. And I can tell like for the lack of a better word, a real future performance. So the apparent performed is not necessarily the real future performance. OK. For the mix effect model is the same. Now I have another type of r squared, because here I'm comparing the simulated value versus my Conditional prediction farmland here and using the estimate to have for the random effects, but when I want to predict the future. So, well, no one to break the test sets both conditions here, I have another square here d which is comparing the whole simulation value for I Why i j versus my marginal model here. I'm not using be here right and Leslie to have a fifth type of our square which is my expected that sent Again, the test set is always the same roles is just that I'm using here. Now, the expected value versus my margin. A lot. So the problem is If you're not careful there. You may calculate wrong guy r squared. So what I do is, whenever I have here. And if I had to mix effect mode. I don't use anything that is in this report, all I do is to save the columns here I saved my Marginal model right prediction farm in here you have saved the prediction formula of the conditional model right and I will create columns with this formless that's for the mix effect model. Now for the Near in that or you can also save the formulas. I like to say fest formulas, because I just want to calculate to our squares. So I was saved as fast formulas and then what you see as I create this five columns here. Alright so let me go to them. So now my type A, if you remember Taipei from the presentation here type am comparing the simulated why i j versus my near in that model. So what I do here. Sorry, what I do here is I go to call them the info and you see here. Predicting what and predicting d y AJ Okay, here it is. Now I have here saves De Niro and that from the twice as he does value is equal to this value. But now I would just change predicting what here. This is predicting the expected value. Right, so that way I can use this formula is functioning jump here which is model comparison I can go to use this type A and I do buy a simulation and I grew up by validation And then what you get. It's all Dr squares you need To see from From simulation one All down to simulation 30 and you have it by set so you can later do combine data table and you get everything neat, right, for those are squares. So for the other ones I have also script here for example for to be He was the peace formula. Now right this column, and I only predicted that set the r squared for the test set, not for the modeling, not for the validation set. So the simulation. One is a year. And then if you go all the way down here you have simulation 30. And again, you can always combine data table and your data comes out like the same table format for all of these are squares. And Daniel, obviously I can have another script here for the type co fire squares we choose the modeling set of the mixed effect model right and simulation one all the way to modern set to simulation 30 and you know the do the same. See now on test set, but now I'm predicting why i j, right. So, the, the, say, the secret here is that you have one even sometimes in the lake. Here again I have the same formula because it's my marginal model of the Mix effect model solution. The only thing that changed is in your column me for you. Make sure you have predicting what and then you can use this to calculate all these are squares. All right. Let me go back to presentation and now since I got all those are squares together you stack your tables and then you can do the visualization, you want But here I'm interested really in the conditional test set of both solutions and the expected that said here, you know, I can spend a could spend 45 minutes just talking about this table display here. But all I I'm not really interested in the absolute values of the r squared, but more comparatively kind of a way of comparing our square But I need days just to check one thing which is, as you can see here when the data size here as you can see Them use the pointer here. Make it easier. I have all the are squares. I created here versus all the do you factors. Right. So you see that when I have a small data set, what happens is, I'm my near and that's being trained correctly because my training. And my validation sets they kind of have our square distribution that overlap. But then when you look at the conditional test set, which is actually the data we always have right because we never have the expected value. It's always at the lower level right as you see all for all these when I have this small data set, but then when I have a bigger data set. The situation is different with 1000 roles, then the are all aligned, you know, kind of overlap. So I did train these correctly. Again, the absolute value of our square here is not have much interest what they really care is how, you know, if you go back to One of the earlier slide here, you see that now I want to compare. You see, I have to, I want you to compare the predictive the performance of my benchmark mixed effect model versus my neural net compared to test that dark square, right. So, here what you get. Let me get this mold. So what you get here is again all the verbs. I had for to do we, and whether they Disease and your net solution or didn't mix effects solution. So, you see that for the conditional test set, which is the one part in performance. So if you're in during that you always you know when the data sizes are small. The mix effect Maltese always doing a better job here when you include the the The random effect right versus then you're and that's because there's always this Their median or or even average are all higher, but then when you have a bigger data set, then You know that difference kind of doesn't exist anymore to to a certain point, that even the new one that just doing a better job here, but that's the current performance. Now the real one. You see now that the mix effect model has been given a better job than a deed versus internet And here there is no more, you know, even grounds, you know, because at the end of the day, direct effect model. He's doing a better job, especially in this scenario here which is big data bigger data. Last sets or variability and more between den with invited right now find those lines I have here is going to do this is going to do for every simulation run is going to do the difference between the mix effect. R square versus the new year and that are square. So, here we go. Here I have four plots right so let's just concentrating one what you have in the y axis is the air conditioner square. Then mix minus the difference in conditioner square, sorry, the The mix effect r squared minus then you're in that square. So, that's what you get here, and that is the difference in a pattern to future performance. And here in the x axis, what you have is this difference in the expected R square we choose your real future performance, right, or bias. If you want Now I have four blocks. Why, because if you think about that when you have historical data where you don't have the tech, you know, You know, if you're analyzing the data, you just have possibly control over two things, which is the data size and the relative batch size. Why, because you cannot control what the variability is going to be in your data. And if your random effects are going to be much bigger than your random air. So the only two things that you can possibly Have any control over is data size and relative batch size. If you don't have the tag you can at least have an idea if, you know, use your historical data. Should be comprised of many, many bedrooms, just one or two batch. Right. So that's the kind of control you have You can at least have an idea of the batch size when you'll have statistical data. So what I'm comparing here then again I have the difference in apparent performance issue just differences positive, it means the mixed effect model has a better performance. If this difference here is also positive. It means that mix effect as a better future performance, right. And as you can see here when you have a small data set. It doesn't matter what you do. And mix effect model has a better performance and sometimes way much better because will come, talking about differences in R squared, that can go way over ones. Right, so he's getting much better performance you do pot into one or the future one. So when the data sizes are small, there's really no No, no solution here. However, when you look at the data size bigger data sides right but when you have this small amount of batches. Right. Here it's something funny happens because here on you know the difference enough funding future performance Y axis is negative, most of them. Which means the near and that to doing a better job in terms of patent at test set Tahrir Square right or conditional Touch that dark square. So, when you do it. Who's going to look like the new urine. That'd be great job better than The mix effect model. However, when you look at the lot of the the x axis, right, which is the difference in real future performance, it can be pretty much misleading. Right, so here you when you have a lot of data, but to just a few batches, you know, you're going to get nice Test, test set are squares. But then when you try to deploy your mold in the future. You may get into trouble. But then when you look here. Here we have a mitigation situation where you have a lot of data and a lot of batches. So they tend to be not that much different. Right, so As a conclusion, you should use a non negligible random effecting machine learning when the data set is a small, you know, the test set predictive performance will most likely be poor. Regardless, how many clusters of batches, you have. And that's because machine learning requires the minimum data size for success. Right. So there's no No way to win the game here. Now, when the data size is large and you just have a few clusters. And that's kind of misleading situation because your test set predict the performance can be good, but the performance, we would likely be Brewer later when you deploy the model. Some people tell me, Well, why don't you use regularization said what even if you will, you will you will not do it in these situations because Your test set R squared is going to be can be good but and then you don't know you need it right so you won't be able to tell You know, what is your long term future performance, just by looking at your tests at dark square or some of some kind of some of errors. But then when your data set is large in you have many clusters day and the whole situation is mitigated and the biasing effect of the closer kind of average out because every random effect, you know, the summation of all of them. It's zero. So the more you have the latest by as you can to get On top, you know, just wanted to say that one that learned what I learned from that is that when the data is not designed on purpose, there's two things I always remember Machine land cannot tackle at data just because it is big. You got to have a minimum level of design right to make it work. But the bigger the data, the more likely it is minimal level of design is already present in the data just by sheer chance. All right. And thank you, if you want to contact us. We are in the jump community. These are our addresses. Thank you.
Labels
(9)
Labels:
Labels:
Advanced Statistical Modeling
Basic Data Analysis and Modeling
Consumer and Market Research
Data Blending and Cleanup
Design of Experiments
Mass Customization
Predictive Modeling and Machine Learning
Quality and Process Engineering
Reliability Analysis
0 attendees
0
0
0 attendees
0
0
Variance Budgets (2020-US-45MP-531)
Monday, October 12, 2020
Ronald Andrews, Sr. Process Engineer, Bausch + Lomb How do we set internal process specs when there are multiple process parameters that impact a key product measure? We need a process to divide up the total variability allowed into separate and probably unequal buckets. Since variances are additive, it is much easier to allocate a variance for each process parameter than a deviation. We start with a list of potential contributors to the product parameter of interest. A cause-and-effect diagram is a useful tool. Then we gather the sensitivity information that is already known. We sort out what we know and don’t know and plan some DOEs to fill in the gaps. We can test our predictor variances by combining them to predict the total variance in the product. If our prediction of the total product variability falls short of actual experience, we need to add a TBD factor. Once we have a comprehensive model, we can start budgeting. Variance budgeting can be just as arbitrary as financial budgeting. We can look for low hanging fruit that can easily be improved. We may have to budget some financial resources to embark on projects to improve factors to meet our variance budget goals. Auto-generated transcript... Speaker Transcript Ronald Andrews Well, good morning or afternoon as the case may be. My name is Ron Andrews and topic of the day is variance budgeting. Oh, I need to share my screen. And there's a file, we will be getting to. And we'll get to start with PowerPoint. So variance budgeting is the topic. I'm a process engineer at Bausch and Lomb; got contact information here. My supervision requires this disclaimer. They don't necessarily want to take credit for what I say today. Overview of what we're going to talk about What is the variance budget? A little bit of history. When do we need one? We have some examples. We'll go through the elements of the process, cause and effect diagram, gather the foreknowledge, do some DOEs to fill in the gaps, Monte Carlo simulations, as required. And we've got a test case will work through. So really, what is a variance budget? Mechanical engineers like to talk about tolerance stack-up. Well tolerance stack-up is basically a corollary Murphy's Law, that being all tolerances will add unit directionally in the direction that can do the most harm. Variance budget is like a tolerance stack-up, except that instead of budgeting the parameter itself, we budget the variance -- sigma squared. We're relying on more or less normal shape distributions, rather than uniform distributions. Variances are additive, makes the budgeting process a whole lot easier than trying to budget something like standard deviations. Brief example here. If we used test-and-sort or test-and-adjust strategies, our distributions are going to look more like these uniform distributions. So if we have the distribution with the width of 1 and one with a width of 2 and other with a 3, we add them all together, we end up with a distribution with a width of pretty close to 6. In this case, we probably need to budget the tolerances more than the variances. ...If we rely on process control, our distributions will be more normal. In this case, if we have a normal distribution with a standard deviation of 1, standard deviation of 2, standard deviation of 3, we add them up, we end up with standard deviation of 3.7, lot less than six. So we do the numbers 1 squared plus 2 squared plus 3 squared equals essentially 3.7 squared. Now to be fair, on that previous slide, if I added up these variances, they would have added up to the variance of this one. But when you have something other than a normal distribution, you have to pay attention to the shape down near the tail. It depends on where you can set your specs. So, What is the variance budget? Non normal distributions are going to require special attention and we'll get to those later. For now variance budget is kind of like a financial budget. They can be just as arbitrary. There only three basic rules. We translate everything into common currency. Now we do this for each product measure of interest, but we translate all the relevant process variables into their contribution to the product measure of interest. Rule number two is fairly simple. Don't budget more than 100% of the allowed variance. Yeah, sounds simple. I've seen this rule violated more than once in more than one company. Number three. This goes for life in general, as well as engineering, use your best judgment at all times. Little bit of history. This is not rocket science. Other people must be doing something similar. I have searched the literature and I have not been able to find published accounts of a process similar to this. I'm sure it's out there, but I have not found any published accounts yet. So for me the history came back in the 1980s, when I worked at Kodak with a challenge for management. Challenge was produce film with no variation perceived by customers. Actually what they originally said produce film with no variation. no perceivable variations. They define that as a Six Sigma shift would be less than one just noticeable difference. Kodak was pretty good on the perceptual stuff and all these just noticeable differences were defined, we knew exactly what they were. For a slide film like Kodachrome, which is what I was working on the... that's what I was working on at the time, color balance was the biggest challenge. Here, this streamline cause and effect diagram, color balance is a function of the green speed, the blue speed and the red speed. Now I've sort of fleshed out one of these legs. The red speed, I got the cyan absorber dye and then one of the emulsions as the factors that contribute to the speed of that, that affects the red speed, that affects the color balance. This is a very simplified version. There are actually three different emulsions in the red record, there are three more in the green record. There are two more in the blue record. Add up everything, they're 75 factors that all contribute to color balance. These are not just potential contributors. These are actually demonstrated contributors. So this is a fairly daunting task. So moving on to when we need a variance budget. Get a little tongue in cheek decision tree here. Do we have a mess in the house? If not, life is good. If so, how many kids do we have? If one, we probably know where the responsibility lies. If more than that and we probably need a budget. This is an example of some work we did a number of years ago on a contact lens project at Bausch and Lomb. This is long before it got out the door to the marketplace. We were having trouble meeting our diameter specs. plus or minus two tenths of a millimeter We were having trouble meeting that. We looked at a lot of sources of variability and we managed to characterize each one. So lot to lot. And this is with the same input materials and same set points, fairly large variability. Lens to lens within a lot, lower variability. Monomer component No. 1, we change lots occasionally, extreme variability. Monomer component No. 2, also had a fairly large variability. Now we mix our monomers together and we have a pretty good process with pretty good precision. It's not perfect and we can estimate the variability from that. That's a pretty small contributor. We put the monomer in a mold and put it under cure lamps to ??? it and the intensity of the lamps can make a difference. There we can estimate that source of variability as well. We add all these distributions up and this is our overall distribution. It does go belong...beyond the spec limits on both ends. Standard deviation of .082 And as I mentioned, spec elements of plus and minus .2 that gives us a PPk of .81. Not so good. Percent out of spec estimated at 1.5% It might have been passable if it was really that good, but it wasn't. This estimate assumes each lens is an independent event. They're not. We make the lenses in lots and there's...every lot has a certain set of raw materials in a certain set of starting conditions. That within a lot, there's a lot of the correlation. And two of the components I mentioned, two monomer components that had sizable contributions, there's looking here, occasionally you can see the yellow line and the pink line. These are the variability introduced by these two monomer components. When they're both on the same side of the center line, they push the diameter out towards the spec limits and we have some other sources of variability that add to the possibilities. Another problem is that our .2 limit is for an individual lens. We did this...we disposition based on lots. And so this plot predicts lot averages, though, when we get a lot average out to .175, chances are we're going to have enough lenses beyond the limit that failed a lot. So in all, added up our estimate is 4% of the lots are going to be discarded. And they're going to come in bunches. We're going to have weeks when we can't get half of our lots through the system. So this is non starter. We have to make some major improvements. To the lot-to-lot variability from two monomer components contributed a good chunk of that variability. We looked and found that the overall purity of Monomer 1 was a significant factor and certain impurities in Monomer 2, when present, were contributors. Our chemists looked at the synthetic routes for these ingredients and found that there was a single starting material that contributed most of the impurities. They recommended that our suppliers distill this starting ingredient to eliminate the impurities. That made some major improvements. We also put variacs on the cure lamps to control the intensity. Lamp intensity was not a big factor, but this was easy. And when it's easy, you make the improvement. Strictly speaking, this was a variance assessment, rather than a variance budget. We never actually assigned numeric goals for each component. This is back...we're kind of picking the low-hanging fruit. I mean, we found two factors that pretty much accounted for a large portion of the variability Maybe we need a little bit better structure to reach the higher branches, now that we need to reach up higher. Current status on lens dimension, lens diameter. PPk is 2.1. The product's on the market now, has been for a few years. This is not the problem anymore. We've made major...major improvements in these momoer components. We're still working on them. They still have detectable variability; detectable, but it hasn't been a problem in a long time. So the basic question is, what do we do to apply data to a variance budget? Maybe reduce that arbitrariness a little bit. We have to start by choosing a product measure in need of improvement. We need to identify the potential contributors, cause and effect diagrams, a convenient tool. We need to gather some foreknowledge. We need to know the sensitivity. The product measure divided by the process measure; what's the slope of that line? We, we are going to need some DOEs to fill in the gaps. We need to estimate the degree of difficulty for improving some of these factors. And we estimate the variance from each component and then we divide that variance, the total variance goal among the contributors. Sounds easy enough. Let's get into an example. let's say we're we're working on a new project. And along the way, we have a new product measure called CMT (stands for cascaded modulation transfer) to measure overall image sharpness. Kind of important for contact lenses. Target is 100, plus or minus 10. We want a PPK of at least 1.33 That means standard deviation's got to be 2.5 or less. Variance has got to be 6.25 or less. What factors might be involved? Let's think about a cause and effect diagram. We can go into JMP and create a table. We start by listing CMT in the parent column. Then we list each of our manufacturing steps in the child column. And then we start listing these child factors over on the parent's side and then we start listing subfactors. These subfactors are obviously generic and arbitrary, the whole thing's hypothetical. And we can go as many levels as we want. We can have as many branches in the diagram as we care to, but we've identified 14 potential factors here. So we go into the appropriate dialog box, identify the parent column and the child column. Click the OK button and out pops the cause and effect diagram. Brief aside here. I've been using JMP for 30 years now. I have very, very few regrets. This is one of them. And my regret is, I only found this last year. I don't know, actually, when this tool was implemented. I wish I had found it earlier because this is the easiest way I found to generate a cause and effect diagram. So we need to gather the sensitivity data. Physics sometimes will give us the answer. In optics, if we know the refractive index and the radius of curvature, that can give us some information about the optical power of the lens. Sometimes physics, oftentimes we need experimental data. So, ask the subject matter experts. Maybe somebody's done some experiments that will give us an idea. We're going to need some well-designed experiments because no way have all 14 of those factors been covered. Several notches down on the list, in my opinion, is historical data. And if you've used historical data to generate models, you know, some of the concerns I'm nervous about. We need to be very cautious with this. Historical data, it's usually messy; it has a lot of misidentified numbers, sometimes things in the wrong column, it needs a lot of cleaning. There's also a lot, also a lot of correlation between factors. Standard practice is to reserve 25% of the data points randomly, reserve that data for confirmation, generate the model with 75% of the data, and then test it with a 25% reserve data. If it works, maybe we have something worth using. If not, don't touch it. So gathering foreknowledge, we want to ask subject matter experts independently to contribute any sensitivity data they have. I'm taking a page from a presentation last year at the Discovery Summit by Cy Wegman and Wayne Levin. This is their suggestion in gathering foreknowledge to avoid the loudest voice in the room rules syndrome. Sometimes there's a quiet engineer sitting in the back who may have important information to impart, may or may not speak up. So we want to get that information. Ask everybody independently to start with. Then get people together and discuss the discrepancies. There will be some. Where are the gaps? What parameters still need sensitivity or distribution information? What parameters can we discount? I'd like to find these. What parameters are conditional? Doesn't happen very often, but in our contact lens process, we include biological indicators in every sterilization cycle. These indicators are intentionally biased so that false positives are far more likely than false negatives. When we get a failure in this test, we sterilize again. We know our sterilization routine was probably right, but we sterilize again. So sometimes we sterilize twice. That can have a small effect on our dimensions. It's small, but measurable. So we're going to need to plan some experiments to gather the sensitivities for things we don't know about. And we'll look at production distribution data; use it with caution to generate sensitivity. We can use it to generate information on the variability of each of the components and the overall variability of the product measure of interest. We need to do some record keeping along the way. We can start with that table we used to generate the cause and effect diagram, add a few more columns. Fill in the sensitivities, units of measure, various columns. Any kind of table will do. Just keep the records and keep them up to date. We're going to need some DOEs to fill in the gaps. There are some newer techniques -- definitive screening designs, group orthogonal super saturated designs -- provide a good bang for the buck when the situation fits. Now in this particular situation, we got 14 factors. We asked our subject matter experts. Some of them have enough experience to predict some directional information, but nobody has a good estimate of the actual slopes. So we need to evaluate 14 factors. I'd love to run a DSD that doesn't require 33 runs, I don't have the budget for it. So we're going to resort to the custom DOE. So, go to the custom DOE function and then...been using PowerPoint for long enough now...time we demonstrated a few things live in JMP. That would go to DOE custom design. And you don't have to, but it's a very good practice to fill in the response information (if i could type it right). Match target from 90 to 110. Importance of 1, only makes a difference if we have more than one response. The factors. I have my file, so I can load these quickly there. Here we have all 14 of the factors. This factor constraints, I've never used it. But I know it's there if some combination of factors would be dangerous. I know that we can disallow it. The model specification. This is probably the most important part. This is basically a screening operation. We're just going to look at the main effects. Now our subject matter experts suggested the interactions are not likely. And nonlinearity is possible but not likely to be strong. So we're going to ignore those for now, at least for the screening experiment. We don't need to block this. We don't need extra center points. For 14 main effects, JMP says a minimum of 15, that's a given, default 20. I've learned that if I have a budget that can run the default, that's a good place to start. I can do 20 runs; 33 was too much. I can manage the 20. Let's make this design. I left this in design units. There's a hypothetical example. I didn't feel like replacing these arbitrary units with other arbitrary units. Got a whole suite of designed evaluation tools, a couple that I normally look at. The power analysis. If the root mean square error estimate of 1 is somewhere in the ballpark, then these power estimates are going to be somewhere in the ballpark. .9 and above, pretty good. I like that. The other thing I normally look at is the color map on correlations. I like to actually make it a color map. And it's kind of complicated. We got 14 main effects, and I honestly haven't counted all the two way interactions. What we're looking for is confounding effects, where we have to red blocks in the same row. Well, I don't see that. That's good. We've got some dark blue where there's no correlation. We've got some light blue where there's a slight correlation. And we have some light pink where maybe it's a .6 correlation coefficient. This is tolerable. As long as we don't have complete confounding, we can probably estimate what's what, what's causing the effect. Now this is good. Move on, make the table. Well, this is our design. Got the space to fill in the results. I'm going to take a page from the Julia Child school of cooking. Do the prep for you and then put it in the oven and then take a previously prepared file out of the oven that already has the executed experiment. These are the results. CMT values, we wanted them between 90 and 110. We got a couple here in the 80s. There's 110.5, we've got 111 here. Looks like we have a little work to do. Let's analyze the data. Everything's all done for us. There's a response. Here's all the factors. We want the screening report. Click Run. r squared .999. Yeah, you can tell this is fake data. I probably should have set the noise factor a little higher than this. The first six factors are highly significant; the next eight, not so much. I was lazy when I generated it. I put something in there for the first six. Now, typically we eliminate the insignificant factors. So we can either eliminate them all at once. I tend to do it one at a time. Eliminate the least significant factor each time and see what it does to the other numbers. Sometimes it changes, sometimes it doesn't. Eliminate this one and it looks like Cure1 slipped in under the wire, .0478. It's just under .05. I doubt that it's a big deal, but we'll leave it in there. So we look at the residuals; that's kind of random, that's good. Studentized residuals, also kind of random. We need to look at the parameter estimates. This is what we paid for. These are the the...regression coefficients are the slopes we were looking for. These are the sensitivities. That's why we did the experiment. I'm a visual kind of guy, so I always look at the prediction profiler. And one of the things I like here...well, I always look at the...the plot of the slopes and look at the... I look at the confidence intervals, which are pretty small. Here you can just barely see there's a blue shaded interval. I also like to use the simulator when I have some information about the input, that we can input the variability for each of these. Now if you'll allow me again use the Julia Child approach and go back to the previously prepared example where I've already input the variations on each one of these. From Mold Tool 1, I input an expression that results in a bimodal distribution. And for Mold Tool 2, input a uniform distribution. And I gotta say, in defense of my friends in the tool room, bimodal distribution only happens in a situation...what happened last month, where the tools we wanted to use were busy on the production floor, so for experiment, we use some old iterations. We actually mixed iterations. When that happens, we can get a bimodal distribution. This uniform distribution, never happens with these guys. They're always shooting for the center point and usually come within a couple of microns. Other distributions are all normal. Various widths. In one case, we had a bit of a bias to it. These are the input distributions. Here's our predicted output. Even though we had some non normal starting distributions, we have pretty much a normal output distribution. It does extend beyond our targets. We kind of knew that. Now, the default when you start here is 5,000 runs. I usually increase it, increase this to something like 100,000. It didn't take any extra time to execute, and it gives you a little smoother distributions. It also produces a table here, we can make the table. Move this over here. Big advantage of this is that we can get (don't want this CMT yet)...let's look at the distributions of the input factors. This is a bigger fancier plot. This is our bimodal distribution, uniform, these various normal distributions, various widths, this one has kind of a bias to it. So we can take all those and we added them up. We look at this and we have a distribution. It looks pretty normal. Even though some of the inputs were not normal. We can use conventional techniques on this. So when we start setting the specs, it does extend beyond our spec limits. So we're going to need to improve, make some improvements in this. Scroll down here. Look at the capability information. PPk a .6. That's a nonstarter. No way is manufacturing going to accept a process like this. So we need to make some significant improvements. So go back to the PowerPoint file. And I scroll through the slides that were my backup in case I had a problem with Live version of JMP. Because of me having the problem, not JMP. So here we have the factors. Standard deviations come from our preproduction trials, estimate the variability. The sensitivities, these are the results from our DOE.
Labels
(10)
Labels:
Labels:
Advanced Statistical Modeling
Basic Data Analysis and Modeling
Consumer and Market Research
Data Blending and Cleanup
Data Exploration and Visualization
Design of Experiments
Mass Customization
Predictive Modeling and Machine Learning
Quality and Process Engineering
Reliability Analysis
0 attendees
0
0
0 attendees
0
0
Anabolic, Aphrodisiac or Analgesic?
Monday, November 30, 2020
Level: All JMP에서 새로운 분석 도구를 개발할 때에는 필연적으로 다음 중에서 우선 순위를 결정해야 합니다. 1) 더 많은 근육을 추가하여 제품을 더 강력하게 만들 것인가? 2) 더 섹시하고 흥미진진하게 만들 것인가? 3) 통증 완화에 집중하여 덜 좌절하고, 덜 힘들게 만들 것인가? A로 시작하는 긴 단어로 표현하면, 근육강화제(anabolic), 최음제(aphrodisiac) 또는 진통제(analgesic)라 할 수 있는데, 그 중 무엇을 골라야 할까요? John Sall은 진통제가 그 답이라고 말합니다. 통증 완화는 개발에서 중심적인 동기가 되어야 합니다. 물론이 세 가지는 상호 배타적이지 않습니다. 흥미롭고 강력한 기능을 추가하면 통증이 완화될 수도 있습니다. 그러나 통증 완화가 핵심입니다. 왜냐하면 통증은 우리를 꼼짝 못하게 하고, 의욕을 떨어뜨리며, 데이터를 통해 얻을 수 있는 정보의 일부 밖에 얻지 못하게 하기 때문입니다.
Labels
(7)
Labels:
Labels:
Advanced Statistical Modeling
Basic Data Analysis and Modeling
Consumer and Market Research
Data Exploration and Visualization
Predictive Modeling and Machine Learning
Quality and Process Engineering
Reliability Analysis
0 attendees
0
0
0 attendees
0
0
Case Study – The Use of Gaussian Process for Analyzing Computer Generated Experiments
Monday, November 30, 2020
Level: Intermediate Gaussian Process (GP) is one of several analysis techniques that are used to build approximation models for computer generated experiments. Generally, a space filling design is used to guide the computer experimentation efforts because all the parameters/variables are derived from or directly pulled from first principles physics models/equations. Space filling designs are used because the data generated by the computer experiments is deterministic and likely to be highly non-linear. This is where GP comes into play. Because the data is deterministic, GP will attempt to fit every point in the design perfectly allowing for a close approximation of the true model. We will compare GP to Response Surface and Neural Net models. We will also compare GP models derived from different types of space filling designs. Gaussian Process Typically used to build models for computer simulation experiments. Data is deterministic so there is no need to run an experiment more than once. A given set of inputs will always produce the same answer. Also known as kriging. More than 100 conditions will take a long time to compute a solution. JMP Pro has Fast GASP; for larger data sets – breaks the GP into blocks allowing for faster computation. You can also have categorical inputs with JMP Pro. Model Options for Gaussian Fit Estimate Nugget Parameter – This useful if there is noise or randomness in the response, and you would like the prediction model to smooth over the noise instead of perfectly fitting. Highly recommended Correlation Type – lets you choose the correlation structure used in the model Gaussian – allows the correlation between two responses to always be non-zero, no matter the distance between the points. Cubic – allows the correlation between two responses to be zero for points far enough apart. Minimum Theta Value – allows you to set up the minimum theta value used in the fitted model. Variance vs. Bias For most design of experiments the goal is to minimize the variance of prediction. Because computer experiments are deterministic there is no variance, but there is bias. Bias is the difference between the approximation model and the true mathematical function. Space filling designs are used in an effort to bound the bias. Borehole Example Types of Space Filling Designs in JMP Sphere Packing – maximizes the minimum distance between design points. Latin Hypercube – maximizes the minimum distance between design points but requires even spacing of the levels for each factor. Uniform – minimizes the discrepancy between the design points and a theoretical uniform distance. Minimum Potential – spreads points inside a sphere around a centroid. Maximum Entropy – measures the amount of information contained in the distribution of a set of data Gaussian Process IMSE Optimal – creates a design the minimizes the integrated mean square error (IMSE) of the Gaussian Process over the experimental Region Fast Flexible Filling (FFF) – FFF method uses clusters of random points to choose design points according to an optimization criterion. Can be constrained. Summary of Fit Do Gaussian with and without Nugget Parameter and check Jackknife fit. Neural Net models offer a good alternative to Gaussian models but can be more complicated. NN models sometimes outperform Gaussian models. Use the smoothing function for Neural Nets – JMP Pro Don’t rely on R2 alone when deciding on the best fit model. Picking the right model is about keeping the model as simple as possible while still getting reasonable prediction. Gaussian Process Resources Comparison of different GP packages - from 2017 Borehole model example found in JMP 14 DOE Guide Chapter 21 pg 637. Discovery Summit 2011 Presentation: Meta-Modeling of Computational Models – Challenges and Opportunities
Labels
(9)
Labels:
Labels:
Advanced Statistical Modeling
Basic Data Analysis and Modeling
Consumer and Market Research
Data Blending and Cleanup
Data Exploration and Visualization
Design of Experiments
Predictive Modeling and Machine Learning
Quality and Process Engineering
Reliability Analysis
0 attendees
0
0
0 attendees
0
0
Design and Analysis of a Multiple Response Mixture Experiment for a Dry Etch Wafer Process
Monday, November 30, 2020
Level: Intermediate Designed experiments for dry etch equipment present challenges for semiconductor engineers. First, because the total gas flow rate is often fixed, a mixture design must be used to honor the constraints imposed by this type of design. These types of designs are not commonly seen in the Semiconductor industry. Second, as is often the case with these experiments, the investigator is interested in optimizing more than one variable. In this presentation, you will see an example of how to design and analyze a seven-factor experiment for a dry etch tool and simultaneously optimize an overall wafer target value while minimizing within wafer variability. Overview Eight factor experiment for a dry etch process Three process gases: A, B, C Five process factors: power, pressure, temperature, time, total flow The experimenter was interested in both the gas ratios and the total gas flow. To keep total flow and gas ratios as uncorrelated as possible, a mixture design was used. To keep total flow and gas ratios as uncorrelated as possible, a mixture design was used. In addition, the experimenter wanted to bound the ratio for two of the gases between an upper and lower value. The third gas, C, was to make up no less than 10% and no more than 25% of the total mixture. What is a Mixture Design? A mixture design is used when the quantity of two or more experimental factors must sum to a fixed amount. The inclusion of non-mixture components (i.e., factors that are not part of the mixture) makes designing this experiment challenging. Mixture designs emphasize prediction over factor screening. For that reason, mixture factors are not removed from the experiment even when they are not significant (they may be set to 0, however). Mixture Design Challenges Effect are highly correlated and are harder to estimate. Squared mixture terms are confounded with (are a linear function of) mixture factor main effects and two factor interactions. Main effects for non-mixture factors are correlated with the two factor interactions between that non-mixture factor and the mixture factors. Focusing on prediction and use of the Profiler (instead of parameter estimation and significance) makes designing and interpreting mixture experiments much easier. Experimental Responses Response Goal ER Target=100 ER Std Minimize Experimental Factors Response Low High Power 25 75 Press 100 200 Temp 25 40 time 30 45 Total Flow 80 120 Gas A 0 1 Gas B 0 1 Gas C 0.1 0.25 Experimental Constraints
Labels
(8)
Labels:
Labels:
Advanced Statistical Modeling
Basic Data Analysis and Modeling
Consumer and Market Research
Data Exploration and Visualization
Design of Experiments
Predictive Modeling and Machine Learning
Quality and Process Engineering
Reliability Analysis
0 attendees
0
0
«
Previous
1
2
3
4
…
18
Next
»