Choose Language Hide Translation Bar

cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
Jeremy Ash, JMP Analytics Software Tester, JMP   The Model Driven Multivariate Control Chart (MDMVCC) platform enables users to build control charts based on PCA or PLS models. These can be used for fault detection and diagnosis of high dimensional data sets. We demonstrate MDMVCC monitoring of a PLS model using the simulation of a real world industrial chemical process — the Tennessee Eastman Process. During the simulation, quality and process variables are measured as a chemical reactor produces liquid products from gaseous reactants. We demonstrate fault diagnosis in an offline setting. This often involves switching between multivariate control charts, univariate control charts, and diagnostic plots. MDMVCC provides a user-friendly way to move between these plots. Next, we demonstrate how MDMVCC can perform online monitoring by connecting JMP to an external database. Measuring product quality variables often involves a time delay before measurements are available, which can delay fault detection substantially. When MDMVCC monitors a PLS model, the variation of product quality variables is monitored as a function of process variables. Since process variables are often more readily available, this can aide in the early detection of faults. Example Files Download and extract streaming_example.zip.  There is a README file with some additional setup instructions that you will need to perform before following along with the example in the video.  There are also additional fault diagnosis examples provided. Message me on the community if you find any issues or have any questions.       Auto-generated transcript...   Speaker Transcript Jeremy Ash Hello, I'm Jeremy ash. I'm a statistician in jump R amp D. My job primarily consists of testing the multivariate statistics platforms and jump but   I also help research and evaluate methodology and today I'm going to be analyzing the Tennessee Eastman process using some statistical process control methods and jump.   I'm going to be paying particular attention to the model driven multivariate control chart platform, which is a new addition to jump and I'm really excited about this platform and these data provided a new opportunity to showcase some of its features.   First, I'm assuming some knowledge of statistical process control in this talk.   The main thing you need to know about is control charts. If you're not familiar with these. These are charts used to monitor complex industrial systems to determine when they deviate from normal operating conditions.   I'm not gonna have much time to go into the methodology and model driven multivariate control chart. So I'll refer to these other great talks that are freely available.   For more details. I should also mention that Jim finding was that primary developer of the model driven multivariate control chart and in collaboration with Chris Got Walt and Tanya Malden I were testers.   So the focus of this talk will be using multivariate control charts to monitor a real world chemical process.   Another novel aspect of this talk will be using control charts for online process monitoring this means we'll be monitoring data continuously as it's added to a database and texting faults in real time.   So I'm going to start with the obligatory slide on the advantages of multivariate control charts. So why not use University control charts there. There are a number of excellent options and jump.   University control charts are excellent tools for analyzing a few variables at a time. However, quality control data sets are often high dimensional   And the number of charts that you need to look at can quickly become overwhelming. So multivariate control charts summarize a high dimensional process. And just a few charts and that's a key advantage.   But that's not to say that university control charts aren't useful in this setting, you'll see throughout the talk that fault diagnosis often involves switching between multivariate in University of control charts.   Multivariate control charts, give you a sense of the overall health of a process well University control charts allow you to   Look at specific aspects. And so the information is complimentary and one of the main goals of model driven multivariate control chart was to provide some tools that make it easy to switch between those two types of charts.   One disadvantage of the university control chart is that observations can appear to be in control when they're actually out of control in the multivariate sense. So I have to   Control our IR charts for oil and density and these two observations in red are in control, but oil and density are highly correlated. And these observations are outliers in the multivariate sense in particular observation 51 severely violates the correlation structure.   So multivariate control charts can pick up on these types of outliers. When University control charts can't   model driven multivariate control chart uses projection methods to construct its control charts. I'm going to start by explaining PCA because it's easy to build up from there.   PCA reduces dimensionality of your process variables by projecting into a low dimensional space.   This is shown in the in the picture to the right we have p process variables and and observations and we want to reduce the dimensionality of the process to a were a as much less than p and   To do this we use this P loading matrix, which provides the coefficients for linear combinations of our X variables which give the score variables. The shown and equations on the left.   tee times P will give you predicted values for your process variables with the low dimensional representation. And there's some prediction air and your score variables are selected.   In a way that minimizes this squared prediction air. Another way to think about it is, you're maximizing the amount of variance explained x   Pls is more suitable when you have a set of process variables and a set of quality variables and you really want to ensure that the quality variables are kept in control, but these variables are often expensive or time consuming to collect   At planet can be making out of control quality for a long time before fault is detected, so   Pls models allow you to monitor your quality variables as a function of your process variables. And you can see here that pls will find score variables that maximize the variance explained in the y variables.   The process variables are often cheaper and more readily available. So pls models can allow you to detect quality faults early and can make process monitoring cheaper.   So from here on out. I'm just going to focus on pls models because that's that's more appropriate for our example.   So pls partitions your data into two components. The first component is your model component. This gives you the predicted values.   Another way to think about this as your data has been projected into a model plane defined by your score variables and t squared charts will monitor variation in this model plane.   The second component is your error component. This is the distance between your original data and that predicted data and squared prediction error charts are sp charts will monitor   Variation in this component   We also provide an alternative distance to model x plane, this is just a normalized version of sp.   The last concept that's important to understand for the demo is the distinction between historical and current data.   historical data typically collected when the process is known to be in control. These data are used to build the PLS model and define   Normal process variation. And this allows a control limit to be obtained current data are assigned scores based on the model, but are independent of the model.   Another way to think about this is that we have a training and a test set, and the t squared control limit is lower for the training data because we expect lower variability for   Observations used to train the model, whereas there's greater variability and t squared. When the model generalized is to a test set. And fortunately, there's some theory that's been worked out for the   Variants of T square that allows us to obtain control limits based on some distributional assumptions.   In the demo will be monitoring the Tennessee Eastman process. I'm going to present a short introduction to these data.   This is a simulation of a chemical process developed by downs and Bogle to chemists at Eastman Chemical and it was originally written in Fortran, but there are rappers for it in MATLAB and Python now.   The simulation was based on a real industrial process, but it was manipulated to protect proprietary information.   The simulation processes. The, the production of to liquids.   By gassing reactants and F is a byproduct that will need to be siphoned off from the desired product.   The two season processes pervasive in the in the literature on benchmarking multivariate process control methods.   So this is the process diagram. It looks complicated, but it's really not that bad. So I'm going to walk you through it.   The gaseous reactants ad and he are flowing into the reactor here, the reaction occurs and product leaves as a gas. It's been cooled and condensed into a liquid and the condenser.   Then we have a vapor liquid separator that will remove any remaining vapor and recycle it back to the reactor through the compressor and there's also a purge stream here that will   Vent byproduct and an art chemical to prevent it from accumulating and then the liquid product will be pumped through a stripper where the remaining reactants are stripped off and the final purified product leaves here in the exit stream.   The first set of variables that are being monitored are the manipulated variables. These look like bow ties and the diagram.   Think they're actually meant to be valves and the manipulative variables, mostly control the flow rate through different streams of the process.   These variables can be set to specific values within limits and have some Gaussian noise and the manipulative variables can be sampled at any rate, we're using a default three minutes sampling in   Some examples of the manipulative variables are the flow rate of the reactants into the reactor   The flow rate of steam into the stripper.   And the flow of coolant into the reactor   The next set of variables are measurement variables. These are shown as circles in the diagram and they're also sampled in three minute intervals and the difference is that the measurement variables can't be manipulated in the simulation.   Our quality variables will be percent composition of to liquid products you can see   The analyzer measuring the composition here.   These variables are collected with a considerable time delay so   We're looking at the product in the stream because   These variables can be measured more readily than the product leaving in the exit stream. And we'll also be building a pls model to monitor   monitor our quality variables by means of our process variables which have substantial substantially less delay in a faster sampling rate.   Okay, so that's an a background on the data. In total there are 33 process variables into quality variables.   The process of collecting the variables is simulated with a series of differential equations. So this is just a simulation. But you can see that a considerable amount of care went into model modeling. This is a real world process.   So here's an overview of the demo, I'm about to show you will collect data on our process and then store these data in a database.   I wanted to have an example that was easy to share. So I'll be using a sequel light database, but this workflow is relevant to most types of databases.   Most databases support odd see connections once jump connects to the database it can periodically check for new observations and update the jump table as they come in.   And then if we have a model driven multivariate control chart report open with automatic re calc turned on. We have a mechanism for updating the control charts as new data come in.   And the whole process of adding data to a database will likely be going on on a separate computer from the computer doing the monitoring.   So I have two sessions of jump open to emulate this both sessions have their own journal in the materials are provided on the Community.   And the first session will add simulated data to the database and it's called the streaming session and the next session will update reports as they come into the database and I'm calling that the monitoring session.   One thing I really liked about the downs and Vogel paper was that they didn't provide a single metric to evaluate the control of the process. I have a quote from the paper here. We felt like   We felt that the trade offs among possible control strategies and techniques involved, much more than a mathematical expression.   So here's some of the goals they listed in their paper which are relevant to our problem maintain the process variables that desired values minimize variability of the product quality during disturbances and recover quickly and smoothly from disturbances.   So we will assess how well our process achieve these goals, using our monitoring methods.   Okay.   So to start off, I'm in the monitoring session journal and I'll show you our first data sent the data table contains all the variables I introduced earlier, the first set are the measurement variables. The next set our composition variables. And then the last set are the manipulated variables.   And the first script attached here will fit a pls model it excludes the last hundred rose is a test set.   And just as a reminder, this model is predicting our two product composition variables as a function of our process variables but pls model or PLS is not the focus of the talk. So I've already fit the model and output score columns here.   And if we look at the column properties. You can see that there's a MD MCC historical statistics property that contains all the information   On your model that you need to construct the multivariate control charts. One of the reasons why monitoring multivariate control chart was designed this way was   Imagine you're a statistician, and you want to share your model with an engineer, so they can construct control charts. All you need to do is provide the data table with these formula columns. You don't need to share all the gory details of how you fit your model.   So next I will use the score columns to create our control turn   On the left, I have to control charts t squared and SPE there 860 observations that were used to estimate the model. And these are labeled as historical and then I have 100 observations that were held out as a test set.   And you can see in the limit summaries down here that I performed a bond for only correction for multiple testing.   As based on the historical data. I did this up here in the red triangle menu, you can set the alpha level, anything you want and   I did this correction, because the data is known to be a normal operating conditions. So, we expect no observations to be out of control and after this multiplicity adjustment, there are zero false alarms.   On the right or the contribution proportion heat maps. These indicate how much each variable contributes to the outer control signal each observation is on the Y axis and the contributions are expressed as a proportion   And you can see in both of these plots that the contributions are spread pretty evenly across the variables.   And at the bottom. I have a score plant.   Right now we're just plotting the first score dimension versus the second score dimension, but you can look at any combination of the score dimensions using these drop down menus, or this arrow.   Okay, so we're pretty oriented to the report, I'm going to switch over to the monitoring session.   Which will stream data into the database.   In order to do anything for this example, you'll need to have a sequel light odd see driver installed. It's easy to do. You can just follow this link here.   And I don't have time to talk about this but I created the sequel light database. I'll be using and jump I have instructions on how to do this and how to connect jump to the database on my community webpage   This is example might be helpful if you want to try this out on date of your own.   I've already created a connection to this database.   And I've shared the database on the community. So I'm going to take a peek at the data tables in query builder.   I can do that table snapshot   The first data set is the historical data I I've used this to construct a pls model, there are 960 observations that are in control.   The next data table is a monitoring data table this it is just contains the historical data at first, but I'll gradually add new data to this and this is what our multivariate control chart will be monitoring.   And then I've simulated the new data already and added it to this data table here and see it starts at timestamp 961   And there's another 960 observations, but I've introduced a fault at some time point   And I wanted to have something easy to share. So I'm not going to run my simulation script and add the database that way.   I'm just going to take observations from this new data table and move them over to the monitoring data table using some JSON with sequel statements.   And this is just a simple example emulating the process of new data coming into a database, somehow, you might not actually do this with jump. But this is an opportunity to show how you can do it with ASL.   Next, I'll show you the script will use to stream in the data.   This is a simple script. So I'm just going to walk you through it real quick.   The first set of commands will open the new data table from the sequel light database, it opens up in the background. So I have to deal with the window, and then I'm going to take pieces from this new data table and   move them to the monitoring data table I'm calling the pieces bites and the BITE SIZES 20   And then this will create a database connection which will allow me to send the database SQL statements. And then this last bit of code will interactively construct sequel statements that insert new data into the monitoring data. So I'm going to initialize   Okay, and show you the first iteration of this loop.   So this is just a simple   SQL statement insert into statement that inserts the first 20 observations.   Comment that outset runs faster. And there's a wait statement down here. This will just slow down the stream.   So that we have enough time to see the progression of the data and the control charts by didn't have this this streaming example would just be over too quick.   Okay, so I'm going to   Switch back to the monitoring session and show you some scripts that will update the report.   Move this over to the right. So you can see the report and the scripts at the same time.   So,   This read from monitoring data script is a simple script that checks the database every point two seconds and adds new data to the jump table. And since the report has automatic recount turned on.   The report will update whenever new data are added. And I should add that realistically, you probably wouldn't use a script that just integrates like this, you probably use Task Scheduler and windows are automated and Max better schedule schedule the runs   And then the next script here.   will push the report to jump public whenever the report is updated.   I was really excited that this is possible and jump.   It enables any computer with a web browser to view updates to the control chart. You can even view the report on your smartphone. So this makes it easy to share results across organizations. You can also use jump live if you wanted the reports to be on a restricted server.   And then the script will recreate the historical data and the data table in case you want to run the example multiple times.   Okay, so let's run the streaming script.   And look at how the report updates.   You can see the data is in control at first, but then a fault is introduced, there's a large out of control signal, but there's a plant wide control system that's been implemented and the simulation, which brings the system to a new equilibrium   I give this a second to finish.   And now that I've updated the control chart. I'm going to push the results to jump public   On my jump public page I have at first the control chart with the data and control at the beginning.   And this should be updated with the addition of the data.   So if we zoom in on the when the process first went out of control.   Your Jeremy Ash It looks like that was sample 1125 I'm going to color that   And label it.   So that it shows up in other plots and then   In the SP plot it looks like this observation is still in control.   And what chart will catch faults earlier depends on your model. And how many factors, you've chosen   We can also zoom in on   That time point in the contribution plot. And you can see when the process. First goes out of control. There's a large number of variables that are contributing to the out of control signal. But then when the system reaches a new equilibrium, only two variables have large contributions.   So I'm going to remove these heat maps so that I'm more room in the diagnostic section.   And to make everything pretty pretty large so that the text would show up on your screen.   If I hover over the first point that's out of control. You can get a peek at the top 10 contributing variables.   This is great for quickly identifying what variables are contributing the most to the out of control signal. I can also click on that plot and appended to the diagnostic section and   You can see that there's a large number of variables that are contributing to the out of control signal.   zoom in here a little bit.   So if one of the bars is red. This means that variable is out of control.   In a universal control chart. And you can see this by hovering over the bars.   I'm gonna pan, a couple of those   And these graph, let's our IR charts for the individual variables with three sigma control limits.   You'd see for the stripper pressure variable. The observation is out of control in the university control chart, but the variables eventually brought back under control by our control system. And that's true for   Most of the   Large contributing variables and also show you one of the variables where observation is in control.   So once the control system responds many variables are brought back under control and the process reaches   A new equilibrium   But there's obviously a shift in the process. So to identify the variables that are contributing to the shift. And one thing you can look at is a main contribution.   Plot   If I sort this and look at   The variables that are most contributing. It looks like just two variables have large contributions and both of these are measuring the flow rate of react in a in a stream one which is coming into the reactor   And these are measuring essentially the same thing except one is a measurement variable and one's a manipulated variable. And you can see   In the university control chart that there's a large step change in the flow rate.   This one as well. And this is the step change that I programmed in the simulation. So these contributions allow us to quickly identify the root cause.   So I'm going to present a few other alternate methods to identify the same cause of the shift. And the reason is that in real data.   Process shifts are often more subtle and some of the tools may be more useful and identifying them than others and will consistently arrive at the same conclusion with these alternate methods. So it'll show some of the ways that these methods are connected   Down here, I have a score plant which can provide supplementary information about shifts in the t squared plant.   It's more limited in its ability to capture high dimensional shifts, because only two dimensions of the model are visualized at a time, however, we can provide a more intuitive visualization of the process as it visuals visualizes it in a low dimensional representation   And in fact, one of the main reasons why multivariate control charts are split into t squared and SPE in the first place is that it provides enough dimensionality reduction to easily visualize the process and the scatter plot.   So we want to identify the variables that are   Causing the shift. So I'm going to, I'm going to color the points before and after the shift.   So that they show up in the score plot.   Typically, when we look through all combinations of the six factors, but that's a lot of score plots to look through   So something that's very handy is the ability to cycle through all combinations quickly with this arrow down here and we can look through the factor combinations and find one where there's large separation.   And if we wanted to identify where the shift first occurred in the score plots, you can connect the dots and see that the shift occurred around 1125 again.   Another useful tool. If you want to identify   Score dimensions, where an observation shows the largest separation from the historical data and you don't want to look through all the score plots is the normalized score plot. So I'm going to select a point after the shift and look at the normalized score plot.   I'm actually going to choose another one.   Okay. Jeremy Ash Because I want to look at dimensions, five, and six. So the   These plots show the magnitude of the score and each dimension normalized, so that the dimensions are on the same scale. And since the mean of the historical data is is that zero for each score to mention the dimensions with the largest magnitude will show the largest separation.   Between the selected point and the historical data. So it looks like here, the dimensions, five and six show the greatest separation and   I'm going to move to those   So there's large separation here between our   Shifted data and the historical data and square plot visualization is can also be more interpreted well because you can use the variable loadings to assign meaning to the factors.   And   Here I have   We have too many variables to see all the labels for them.   Loading vectors, but you can hover over and see them. And you can see, if I look in the direction of the shift that the two variables that were the cause show up there as well.   We can also explore differences between sub groups in the process with the group comparisons to do that I'll select all the points before the shift in call that the reference group and everything after in call that the group I'm comparing to the reference   These   And this contribution plot will will give me the variables that are contributing the most to the difference between these two groups. And you can see that this also identifies the variables that caused the shift.   The group comparisons tool is particularly useful when there's multiple shifts in a score plot are when you can see more than two distinct subgroups in your data.   In our case, as, as we're comparing a group in our current data to the historical data. We could also just select the data after the shift and look at a main contribution score plot.   And this will give us   The average contributions of each variable to the scores in the orange group. And since large scores indicate large difference from the historical data. These contribution plots can also identify the cause.   These are using the same formula is the contribution formula for t squared. But now we're just using the, the two factors from the score plot.   Okay, I'm gonna find my PowerPoint again.   So real quick, I'm going to summarize the key features of the model driven multi variant control chart that were shown in the demo.   The platform is capable of performing both online fault detection and offline fault diagnosis. There are many methods, providing the platform for drilling down to the root cause of the faults.   I'm showing you. Here's some plots from the popular book fault detection and diagnosis in industrial systems throughout the book authors.   Demonstrate how one needs to use multivariate and universal control charts side by side to get a sense of what's going on in the process.   And one particularly useful feature and model driven multivariate control chart is how interactive and user friendly. It is to switch between these types of charts.   So that's my talk here. Here's my email. If you have any further questions, and thanks to everyone who tuned in to watch this.
John Cromer, Sr. Research Statistician Developer, JMP   While the value of a good visualization in summarizing research results is difficult to overstate, selection of the right medium for sharing with colleagues, industry peers and the greater community is equally important. In this presentation, we will walk through the spectrum of formats used for disseminating data, results and visualizations, and discuss the benefits and limitations of each. A brief overview of JMP Live features sets the stage for an exciting array of potential applications. We will demonstrate how to publish JMP graphics to JMP Live using the rich interactive interface and scripting methods, providing examples and guidance for choosing the best approach. The presentation culminates with a showcase of a custom JMP Live publishing interface for JMP Clinical results, including the considerations made in designing the dialog, the mechanics of the publishing framework, the structure of JMP Live reports and their relationship to the JMP Clinical client reports and a discussion of potential consumption patterns for published reviews.     Auto-generated transcript...   Speaker Transcript John Cromer Hello everyone, Today I'd like to talk about two powerful products that extend JMP in exciting ways. One of them, JMP Clinical, offers rich visualization, analytical and data management capabilities for ensuring clinical trial safety and efficacy. The other, JMP Live, extends these visualizations to a secure and convenient platform that allows for a wider group of users to interact with them from a web browser. As data analysis and visualization becomes increasingly collaborative, it is important that both creating and sharing is easy. By the end of this talk, you'll see just how easy it is. First, I'd like to introduce the term collaborative visualization. Isenberg, et al., defines it as the shared use of computer supported interactive visual representations of data on more than one person with a common goal of contribution to join information processing activities. As I'll later demonstrate, this definition captures the essence of what JMP, JMP Clinical and JMP Live can provide. When thinking about the various situations in which collaborative visualization occurs, it is useful to consult the Space Time Matrix. In the upper left of this matrix, we have the traditional model of classroom learning and office meetings, with all participants at the same place at the same time. Next in the upper right, we have participants at different places interacting with the visualization at the same time. In the lower left, we have participants interacting at different times at the same location, such as in the case of shift workers. And finally, in the lower right, we have flexibility in both space and time with participants potentially located anywhere around the globe and interacting with the visualization at any time of day. So JMP Live can facilitate this scenario. A second way to slice through the modes of collaborative visualization is by thinking about the necessary level of engagement for participants. When simply browsing a few high-level graphs or tables, sometimes simple viewing can be sufficient. But with more complex graphics and for those in which the data connections have been preserved between the graphs and underlying data tables, users can greatly benefit by also having the ability to interact with and explore the data. This may include choosing a different column of interest, selecting different levels in a data filter and exposing detailed data point hover text. Finally, authors who create visualizations often have a need to share them with others and by necessity will also have the ability to view, interact with and explore the data. and JMP and JMP Clinical for authors who require all abilities. A third way to think about formats and solutions is by the interactivity spectrum. Static reports, such as PDFs, are perhaps the simplest and most portable, but generally, the least interactive Interactive HTML, also known as HTML5, offers responsive graphics and hover text. JMP Live is built on an HTML5 foundation, but also offer server-side computations for regenerating the analysis. While the features of JMP Live will continue to grow over time, JMP offers even more interactivity. And finally, There are industry-specific solutions such as JMP Clinical which are built on a front framework of JMP and SAS that offer all of JMP's interactivity, but with some additional specialization. So when we lay these out on the interactivity spectrum, we can see that JMP Live fills the sweet spot of being portable enough for those with only a web browser to access, while offering many of the prime interactive features that JMP provides So the product that I'll use to demonstrate creating a visualization is JMP Clinical. JMP Clinical, as I mentioned before, offers a way to conveniently assess clinical trial safety and efficacy. With several role-based workflows for medical monitors, writers, clinical operations and data managers, and three review templates, predefined or custom workflows can be conveniently reused on multiple studies, producing results that allow for easy exploration of trends and outliers. Several formats are available for sharing these results, from static reports and in-product review viewer and new to JMP Clinical ??? and JMP Live reports. The product I'll use to demonstrate interacting with on a shared platform is JMP Live. JMP Live allows users with only a web browser to securely and conveniently interact with the visualizations, and they could specify access restrictions for who can view both the graphics and the underlying data tables with the ability to publish a local data filter and column switcher. The view can be refreshed in just a matter of seconds. Users can additionally organize their web reports through titles, descriptions and thumbnails and leave comments that facilitate discussion between all interested parties. So explore the data on your desktop with JMP or JMP Clinical, published a JMP Live with just a few quick steps, share the results with colleagues across your organization, and enrich the shared experience through communication and automation. So now I would like to demonstrate how to publish a simple graphic from JMP to JMP Live. I'm going to open the demographics data set from the sample study Nicardipine, which is included with JMP Clinical. I can do this either through the file open menu where I can navigate to my data set dt= open then the path to my data table. So I'm going to click run scripts to open that data table. Okay. So now I'd like to create a simple visualization. I'm going to, let's say, I'd like to create a simple box plot. Or click graph, Graph Builder. And here I have a dialogue from moving variables into roles. I'm going to move the study site identifier into the X role. Age into Y. And click box plot. And click Done. So here's one quick and easy way to create a visualization in JMP. Alternatively, I can do the same thing with the script. And so this block of code I have here, this encapsulates a data filter and a Graph Builder box plot into a data filter context box. So I'm going to run this block of code. And here you see, I have some filters and a box plot. Now, notice how interactive this filter is and the corresponding graph. I can select a different lower bound for age; I can type in a precise value, let's say, I'd like to exclude those under 30 and suppose I am interested in only the first 10 study side identifiers. OK. So now I'd like to share this visualization with some of my colleagues who don't have JMP but they have JMP Live. So one way to publish this to JMP Live is interactively through the file published menu. And here I have options for for my web report. Can see I have options for specifying a title, description. I can add images. I can choose who to share this report with. So at this point, I could publish this, but I'd like to show you how to do so using the script. So I have this chunk of code where I create a new web report object. I add my JMP report to the web report object. I issue the public message to the web report, and then I automatically open the URL. So let me go ahead and run that. You can see that I'm automatically taken to JMP Live with a very similar structure as my client report. My filter selections have been preserved. I can make filter selection changes. For example, I can move the lower bound for age down and notice also I have detailed data point hover text. I have filter-specific options. And I also have platform-specific options. So any time you see these menus. You can further explore those to see what options are available. Alright, so now that you've seen how to publish a simple graphic from JMP to JMP Live. How about a complex one, as in the case of a JMP Clinical report. So what I'm going to do is open a new review. I will add the adverse events distribution report to this review. I will run it with all default settings. And now I have my adverse events distribution report, which consists of column switchers for demographic grouping and stalking, report filters, an adverse events counts graph, tabulate object for counts and some distributions. Suppose I'm interested in stacking my adverse events by severity. I've selected that and now I have my stoplight colors that I've set for my adverse events for mild, moderate and severe. At this point I'm...I'd like to share these results with a colleague who maybe in this case has JMP, but there are certain times where they prefer to work through a web browser to to inspect and take a look at the visualizations. So this point, I will click this report level create live report button. I will... ...and that...and now I have my dialogue, I can choose to publish to either file or JMP Live. I can choose whether to publish the data tables or not, but I would always recommend to publish them for maximum interactivity. I can also specify whether to allow my colleagues to download the data tables from JMP Live. In addition to the URL, you can specify whether to share the results only with yourself, everyone at your organization or with specific groups. So for demonstration purposes, I will only publish for myself. I'll click OK. Got a notification to say that my web report has been published. Over on JMP Live, I have a very similar structure. At my report filters, my column switchers with my column, a column of interest preserved. You can see my axes and legends and colors have also carried over. Within this web report, I can easily collapse or expand particular report sections, and many of the sections off also offer detailed data point hover text and responsive updates for data filter changes. Another thing I'd like to point out is this Details button in the upper right of the live report, where I can get detailed creation information, a list of the data tables that republished, as well as the script. And because I've given users the ability to download these tables and scripts, these are download buttons for those for that purpose. I can also leave comments from my colleagues that they can then read and take further action on, for example, to follow up on an analysis. All right, so from my final demo, I would simply like to extend my single clinical report to a review consisting of two other reports enrollment patterns, and findings bubble plot. So I'm going to run these reports. Enrollment patterns plots patient enrollment over the course of a study by things like start date of disposition event, study day and study site identifier. Findings bubble plot, I will run on the laboratory test results domain. And this report features a prominent animated bubble plot, in which you can launch this animation. You can see how specific test results change over the course of a study. You can pause the animation. You can scroll to specific, precise values for study day and you can also hover over data points to reveal the detailed information for each of those points. create live report for review. I have a...have the same dialogue that you've seen earlier, same options, and I'm just going to go ahead and publish this now so you can see what it looks like when I have three clinical reports bundled together and in one publication. So when this operation completes, you will see that will be taken to an index page corresponding to report sections. And each thumbnail on this page corresponds to report section in which we have our binoculars icon on the lower left, that indicates how many views each page had. I have a three dot menu, where you can get back to that details view. If you click Edit, from here you can also see creation information and a list of data tables and scripts. And by clicking any of these thumbnails, I can get down to the report, the specific web report of interest. So just because this is one of my favorite interactive features, I've chosen to show you the findings bubble plot on JMP Live. Notice that it has carried over our study day, where we left off on the client, on study day 7. I can continue this animation. You can see study day counting up and you can see how our test results change over time. I can pause this again. I can get to a specific study day. I can do things like change bubble size to suit your preference. Again, I have data point hover text, I can select multiple data points and I have numerous platform specific options that will vary, but I encourage you to take a look at these anytime you see this three dot menu. So to wrap up, let me just jump to my second-last slide. So how was all this possible? Well, behind the scenes, the code to publish a complex clinical report is simply a JSL script that systematically analyzes a list of graphical report object references and pairs them with the appropriate data filters, column switchers, and report sections into a web report object. The JSL publish command takes care of a lot of the work for you, for bundling the appropriate data tables into the web report and ensuring that the desired visibility is met. Power users who have both products can use the download features on JMP Live to conveniently share to conveniently adjust the changes ...to to... make changes on their clients and to update their... the report that was initially published, even if they were not the original authors. And then the cycle can continue, of collaboration between those on the client and those on JMP Live. So, as you can see, both creating and sharing is easy. With JMP and JMP Clinical, collaborative visualization is truly possible. I hope you've enjoyed this presentation, and I look forward to any questions that you may have.  
Mike Anderson, SAS Institute, SAS Institute Anna Morris, Lead Environmental Educator, Vermont Institute of Natural Science Bren Lundborg, Wildlife Keeper, Vermont Institute of Natural Science   Since 1994, the Vermont Institute of Natural Science’s (VINS) Center for Wild Bird Rehabilitation (CWBR), has been working to rehabilitate native wild birds in the northeastern United States. One of the most common raptor patients CWBR treats is the Barred Owl. Barred Owls are fairly ubiquitous east of the Rocky Mountains. Their call is the familiar “Who cooks for you, who cooks for you all.” They have adapted swiftly to living alongside people and, because of this, are commonly presented to CWBR for treatment. As part of a collaboration with SAS, technical staff from JMP and VINS have been analyzing the admission records from the rehabilitation center. Recently we have used a combination of Functional Data Analysis, Bootstrap Forest Modeling, and other techniques to explore how climate and weather patterns can affect the number of Barred Owls that arrive at VINS for treatment — specifically for malnutrition and related ailments. We found that a combination of temperature and precipitation patterns results in an increase in undernourished Barred Owls being presented for treatment. This session will discuss our findings, how we developed them, and potential implications in the broader context of climate change in the Northeastern United States.       Auto-generated transcript...   Speaker Transcript Mike Anderson Welcome, everyone, and thank you for joining us. My name is Anna Morris and I'm the lead environmental Educator at the Vermont Institute of Natural Science or VINS in Quechee, Vermont. I'm Bren Lundborg, wildlife keeper at VINS center for wildlife rehabilitation and I'm Mike Anderson JMP systems engineer at SAS. We're excited to present to you today our work on the effects of local weather patterns on the malnutrition and death rates of wild barred owls in Vermont. This study represents 18 years of data collected on wild owls presented for care at one avian rehabilitation clinics and unique collaboration between our organization and the volunteer efforts of Mike Anderson at JMP. Let's first get to know our study species, the barred owl, with the help of a non releasable rehabilitative bird serving as an education ambassador at the VINS nature center. Yep, this owl was presented for rehabilitation in Troy, New Hampshire in 2013 and suffered eye damage from a car collision from which she was unable to recover. Barred owls like this one are year-round residents of the mixed deciduous forests of New England, subsisting on a diet that includes mammals, birds, reptiles, amphibians, fish and a variety of terrestrial and aquatic invertebrates. However, the prey they consume differs seasonally, with small mammals composing a larger portion of the diet in the winter. Their hunting styles differ in winter as well, due to the presence of snowpack, which can shelter small mammals from predation. Barred owls are known to use the behavior of snow punching or pouncing downward through layers of snow to catch prey detected auditorially. Here's a short video demonstrating this snow punching behavior. I've seen in that quick clip barred owls can be quite tolerant of human altered landscapes, with nearly one quarter of barred owl nests utilizing human structures. There are also the most frequently observed owl species by members of the public in Vermont, according to the citizen science project, iNaturalist, with 468 research grade observations of wild owls logged. As such, barred owls are commonly presented to wildlife rehabilitation clinics by people who discover injured animals. The Vermont Institute of Natural Sciences Center for wild bird rehabilitation or CWBR is the federal and state licensed wildlife rehabilitation facility located in Quechee, Vermont. All wild living avian species that are legal to rehabilitate in the state are submitted as patients to CWBR and we received an average of 405 patients yearly from 2001 to 2019, representing 193 bird species. 90% of patients presented at CWBR come from within 86 kilometers of the facility. Of the patients admitted during the 18 year period of the study, 11% were owls of the order Strix varia, comprising six species, with barred owls being the most common. However, year to year, the number of barred owls received as patients by CWBR has varied widely compared to another commonly received species, the American Robin. Certain years, such as the winter of 2018 to 2019 had been anecdotally considered big barred owl years by CWBR staff and other rehabilitation centers in the Northeastern US for the large number presented as patients. One explanation proposed by local naturalists attributes the big year phenomenon to shifts in weather patterns. When freeze/thaw cycles occur over short time scales, these milder, wetter winters are thought to pose challenges to barred owls relying on snow plunging for prey capture. Specifically the formation of a layer of ice on top of the snow can prevent owls from capturing prey using this snow plunging technique as the owls may not be able to penetrate this ice layer. In order to feed...I lost my place. In order to feed the animals may therefore use alternative hunting locations or styles or suffer from weakness due to malnutrition, which could lead to adverse interactions with humans, resulting in injury. This study was undertaken to determine if a relationship exists between higher than average winter precipitation and the number of barred owls presented during those years at CWBR for rehabilitation. Though there are several possible explanations for the variation in the number of patients associated with regional weather, we sought to determine if there was support for the ice layer hypothesis by further investigating whether barred owls presented during wetter winters exhibited malnutrition as part of the intake diagnosis in greater proportion than in dryer winters. This would suggest that obtaining food was a primary difficulty, leading to the need for rehabilitation, rather than a general population increase, which would likely lead to a proportional increase in all intake categories. Initially we expected that there would be a fairly simple time series analysis relationship to this. We went and looked at the original data for the admissions and just to compare as, as Bren said, just to compare the data between the barred owls and the American robins, you can see for bad years, which I've marked here in blue, except for the the gray one which is actually had a hurricane involved, we can see there's a very strong periodic signal associated with the robins. We can see that the year-round resident barred owls should have something resembling a fairly steady intake rate, but we see some significant changes in that year to year. Looking at the contingency analysis, we can see that the green bands, the starvation, correlates fairly nicely with those years where we have big barred owl years. Again, pointing out 2008, 2015, 2019, these being ski season years instead, which I'll make clear in a moment. 2017 doesn't show up, but it does have a big band of unknown trauma and cause, and that was from a difference in how they were triaging the incoming animals that year. The one, the one trick to working with this is that we needed to use functional data analysis to be able to take the year over year trends and turn them into a signal that we can analyze effectively against weather patterns and other data that we were able to find. Looking here, it's fairly easy to see that those years that we would call bad years have a very distinctive dogear...dogear...dogleg type pattern. You can see 2008, 2017, 2019, 2015. Again, Most importantly, those signals tend to correlate most strongly with this first Eigen function in our principal component analysis. You can see quite clearly here that component one does a great job with discriminating between the good years and the bad years with that odd hurricane year right in the middle where it should be. You can also look at the profiler for FPC one and you can see that as we raise and lower that profiler, we see that dogleg pattern become more pronounced. The next question is, is how do we get the data for that kind of an analysis? How do we get the weather data that we think is important? Well, it turns out that there's a great organization that's a ski resort about 20 miles away from here that has been collecting data from as far back as the 50s. And they've also been working with naturalists and conservation efforts, providing their ecological or their environmental data to researchers for different projects, and they gave us access to their database. This is an example of base mountain temperature at Killington, Vermont, and you can see that the the bad years, again colored in blue here, tend to have a flatter belly in their low temperature. You can see for instance, looking at 2007, the first one in the upper left corner, you can see that there's a steep drop down, followed by a steep incline back up. Whereas 2008, which is one of the bad years for for owl admissions, we have a fairly flat, and if not maybe in a slightly inverted peak in the middle. And that's fairly consistent, with the exception of maybe 2015, throughout the other throughout the other the other years. So I took all of that data and used functional data explorer to get the principal components for our responses. We end up having, therefore, a functional component on the response and a functional component on the factors. This is an example of one of those for the...what turns out to be one of the driving factors of this analysis, and you can see it does a very nice job of pulling out the principal components. The one we're going to be interested in in a moment is this Eigenfunction4. It doesn't look like much right now, but it turns out to be quite important. So let's put all this together. I use the combination of generalized regression, along with the autovalidation strategy that was pioneered by Gotwalt and Ramsay a few years ago to build a model of this of the behavior. We can see we get a fairly good actual by predictive plot for that. We get a nice r square around 99% and looking at the reduced model, we see that we have four primary drivers, the cumulative rain that shows up. That makes sense. We can't have rain without...we can't have ice without rain. Also a temperature factor, we need temperature to have a strong...to have ice. But also we have the sum of the daily snowfall or the daily snowfall. That's a max total snowfall per year, and the sum of the daily...the daily rainfall as well. And taking all of this, we can put together start to put together a picture of what bad barred owl years look like from a data driven standpoint. We can see fairly clearly. I'm going to show you first again what a bad barred...what a bad year looks like from the standpoint of the of the of the the admission rates. And we can see here. Let me show you what a bad, bad year looks like. That's a bad year; that's a good year, fairly dramatic difference. Now we're going to have to pay fairly close attention to the... We're gonna have to pay fairly close attention to the other factors to see because it's a very subtle change in the the temperatures, in the rain falls that trigger this good year/bad year. It's it's kind of interesting how how tiny the effects are. So first, this is the total snowfall per year. And we're going to pay attention to the slope of this curve for a good year and then for a bad year. Fairly tiny change, year over year. So it's a it's a subtle change, but that subtle change is one of the big drivers. We need to have a certain amount of snowfall present in order to facilitate the snow diving. The other thing, if we look at rain, we're going to look at the belly of this rainfall right here, around, around week 13 in the in the ski season. There's a good year. And there's a bad year. Slightly more rain earlier in the year, and with a flatter profile going into spring. And again, looking at the cumulative rain over the season, a good year tends to be a little bit drier than a bad year. And lastly, most importantly, the temperature. This one is actually fairly...this is that belly effect that we were seeing before. We see in early years or in good years that we have that strong decline down and strong climb out in the temperature, but for bad years we get just slightly more bowlshaped effect overall. And I'm going to turn it over to Bren to talk about what that means in terms of barred owl malnutrition. Malnutrition has a significant negative impact upon survival of both free ranging owls and those receiving treatment at a rehabilitation facility. Detrimental effects include reduced hunting success, lessened ability to compete with other animals or predator species for food, and reduced immunocompetence. Some emaciated birds are found too weak to fly and are at high risk for complications such as refeeding syndrome during care. For birds in care, the stress of captivity, as well as healing from injuries such as fractures and traumatic brain injuries can double the caloric needs of the patient, thus putting further metabolic stress on an already malnourished bird. Additionally, scarcity or unavailability of food may push owls closer to human populated areas, leading to increased risk for human related causes of mortality. Vehicle strikes are the most common cause of intakes for barred owls in all years and hunting near roads and human occupied habitats increases that risk. In the winter of 2018 to 2019, reports of barred owls hunting at bird feeders and stalking domestic animals, such as poultry, were common. Hunting at bird feeders potentially increases exposure to pathogens, as they are common sites of disease transmission, it may lead to higher rates of infectious diseases such as salmonellosis and trichomoniasis. Difficult winters also provide extra challenges for first year barred owls. Clutching barred owls are highly dependent on their parents and will remain with them long after being able to fly and hunt. And once parental support ends, they are still relatively inexperienced hunters facing less prey availability and harsher conditions in their first winter. Additionally, the lack of established territories may lead them to be more likely to hunt near humans, predisposing them to risks such as vehicle collision related injuries. Previous research on a close relative of the barred owl, the northern spotted owl of the Pacific Northwest, shows a decline in northern spotted owl in fecundity and survival associated with cold, wet weather in winter and early spring. In Vermont, the National Oceanic and Atmospheric Administration has projected an increase in winter precipitation of up to 15% by the middle of the 21st century, which may have specific impacts on populations of barred owls and their prey sources. The findings of this study provide important implications for the management of barred owl populations and those of related species in the wake of a change in climate. Predicted changes to regional weather patterns in Vermont and New England forecast that cases of malnourished barred owls will only increase in frequency over the next 20 to 30 years as we continue to see unusually wet winters. Barred owls, currently listed by the International Union for Conservation of Nature as a species of least concern with a population trend that is increasing, will likely not find themselves threatened with extinction rapidly. However, ignoring this clear threat to local populations may cascade through the species at large and exacerbate the effects of other conservation concerns, such as accidental poisoning and nest site loss. These findings also highlight the need for protocols to be established on the part of wildlife rehabilitators and veterinarians for the treatment of severe malnourishment in barred owls, such as to avoid refeeding syndrome, and provide the right balance of nutrients for recovery from an often lethal condition. Rehabilitation clinics would benefit from a pooling of knowledge and resources to combat this growing issue. Finally, this study shows yet another way in which climate change is currently affecting the health of wildlife species around us. Individual and community efforts to reduce human impacts on the climate will not be sufficient to reduce greenhouse gas emissions at the scale necessary to halt or reverse the damage that has been done. Action on the part of governments and large corporations must be taken, and individuals and communities have the responsibility to continue to demand that action. We would like to thank the staff and volunteers at the Vermont Institute of Natural Science, as well as at JMP, who helped collect and analyze the data presented here, especially Gray O'Tool. We'd also like to thank the Killington Ski Resort for providing us with the detailed weather data. Thank you.  
Roland Jones, Senior Reliability Engineer, Amazon Lab126 Larry George, Engineer who does statistics, Independent Consultant Charles Chen SAE MBB, Quality Manager, Applied Materials Mason Chen, Student, Stanford University OHS Patrick Giuliano, Senior Quality Engineer, Abbott Structural Heart   The novel coronavirus pandemic is undoubtedly the most significant global health challenge of our time. Analysis of infection and mortality data from the pandemic provides an excellent example of working with real-world, imperfect data in a system with feedback that alters its own parameters as it progresses (as society changes its behavior to limit the outbreak). With a tool as powerful as JMP it is tempting to throw the data into the tool and let it do the work. However, using knowledge of what is physically happening during the outbreak allows us to see what features of the data come from its imperfections, and avoid the expense and complication of over-analyzing them. Also, understanding of the physical system allows us to select appropriate data representation, and results in a surprisingly simple way (OLS linear regression in the ‘Fit Y by X’ platform) to predict the spread of the disease with reasonable accuracy. In a similar way, we can split the data into phases to provide context for them by plotting Fitted Quantiles versus Time in Fit Y by X from Nonparametric density plots. More complex analysis is required to tease out other aspects beyond its spread, answering questions like "How long will I live if I get sick?" and "How long will I be sick if I don’t die?". For this analysis, actuarial rate estimates provide transition probabilities for Markov chain approximation to SIR models of Susceptible to Removed (quarantine, shelter etc.), Infected to Death, and Infected to Cured transitions. Survival Function models drive logistics, resource allocation, and age-related demographic changes. Predicting disease progression is surprisingly simple. Answering questions about the nature of the outbreak is considerably more complex. In both cases we make the analysis as simple as possible, but no simpler.     Auto-generated transcript...   Speaker Transcript Roland Jones Hi, my name is Roland Jones. I work for Amazon Lab 126 is a reliability engineer.   When myself and my team   put together our abstracts for the proposal at the beginning of May, we were concerned that COVID 19 would be old news by October.   At the time of recording on the 21st of August, this is far from the case. I really hope that by the time you watch this in October, there will...things will be under control and life will be returning to normal, but I suspect that it won't.   With all the power of JMP, it is tempting to throw the data into the tool and see what comes out. The COVID 19 pandemic is an excellent case study   of why this should not be done. The complications of incomplete and sometimes manipulated data, changing environments, changing behavior, and changing knowledge and information, these make it particularly dangerous to just throw the data into the tool and see what happens.   Get to know what's going on in the underlying system. Once the system's understood, the effects of the factors that I've listed can be taken into account.   Allowing the modeling and analysis to be appropriate for what is really happening in the system, avoiding analyzing or being distracted by the imperfections in the data.   It also makes the analysis simpler. The overriding theme of this presentation is to keep things as simple as possible, but no simpler.   There are some areas towards the end of the presentation that are far from simple, but even here, we're still working to keep things as simple as possible.   We started by looking at the outbreak in South Korea. It had a high early infection rate and was a trustworthy and transparent data source.   Incidentally, all the data in the presentation comes from the Johns Hopkins database as it stood on the 21st of August when this presentation was recorded.   This is a difficult data set to fit a trend line to.   We know that disease naturally grows exponentially. So let's try something exponential.   As you can see, this is not a good fit. And it's difficult to see how any function could fit the whole dataset.   Something that looks like an exponential is visible here in the first 40 days. So let's just fit to that section.   There is a good exponential fit. Roland Jones What we can do is partition the data into different phases and fit functions to each phase separately.   1, 2, 3, 4 and 5.   Partitions were chosen where the curve seem to transition to a different kind of behavior.   Parameters in the fit function were optimized for us in JMP' non linear fit tool. Details of how to use this tool are in the appendix.   Nonlinear also produced the root mean square error results, the sigma of the residuals.   So for the first phase, we fitted an exponential; second phase was logarithmic; third phase was linear; fourth phase, another logarithmic; fifth phase, another linear.   You can see that we have a good fit for each phase, the root main square error is impressively low. However, as partition points were specifically chosen where the curve change behavior, low root mean square area is to be expected.   The trend lines have negligible predictive ability because the partition points were chosen by looking at existing data. This can be seen in the data present since the analysis, which was performed on the 19th of June.   Where extra data is available, we could choose different partition points and get a better fit, but this will not help us to predict beyond the new data.   Partition points do show where the outbreak behavior changes, but this could be seen before the analysis was performed.   Also no indication is given as to why the different phases have a different fit function.   This exercise does illustrate the difficulty of modeling the outbreak, but does not give us much useful information on what is happening or where the outbreak is heading. We need something simpler.   We're dealing with a system that contains self learning.   As we as society, as a society, learn more about the disease, we modify behavior to limited spread, changing the outbreak trajectory.   Let's look into the mechanics of what's driving the outbreak, starting with the numbers themselves and working backwards to see what is driving them.   The news is full of COVID 19 numbers, the USA hits 5 million infections and 150,000 deaths. California has higher infections than New York. Daily infections in the US could top 100,000 per day.   Individual numbers are not that helpful.   Graphs help to put the numbers into context.   The right graphs help us to see what is happening in the system.   Disease grows exponentially. One person infects two, who infect four, who infect eight.   Human eyes differentiate poorly between different kinds of curves but they differentiate well between curves and straight lines. Plotting on a log scale changes the exponential growth and exponentially decline into straight lines.   Also on the log scale early data is now visible where it was not visible on the linear scale. Many countries show one, sometimes two plateaus, which were not visible   in the linear graph. So you can see here for South Korea, there's one plateau, two plateaus and, more recently, it's beginning to grow for third time.   How can we model this kind of behavior?   Let's keep digging.   The slope on the log infections graph is the percentage growth.   Plotting percentage growth gives us more useful information.   Percentage growth helps to highlight where things changed.   If you look at the decline in the US numbers, the orange line here, you can see that the decline started to slacken off sometime in mid April and can be seen to be reversing here in mid June.   This is visible but it's not as clear in the infection graphs. It's much easier to see them in the percentage growth graph.   Many countries show a linear decline in percentage growth when plotted on a log scale. Italy is a particularly fine example of this.   But it can also be seen clearly in China,   in South Korea,   and in Russia, and also to a lesser extent in many other countries.   Why is this happening?   Intuitively, I expect that when behavior changes, growth would drop down to a lower percent and stay there, not exponentially decline toward zero.   I started plotting graphs on COVID 19 back in late February, not to predict the outbreak, but because I was frustrated by the graphs that were being published.   After seeing this linear decline in percentage growth, I started paying an interest in prediction.   Extrapolating that percentage growth line through linear regression actually works pretty well as a predictor, but it only works when the growth is declining. It does not work at all well when the growth is increasing.   Again, going back to the US orange line, if we extrapolate from this small section here, where it's increasing which is from the middle of June to the end...to the beginning of July,   we can predict that we will see 30% increase by around the 22nd of July, that will go up to 100% weekly growth by the 20th...26th of August, and it will keep on growing from there, up and up and up and up.   Clearly, this model does not match reality.   I will come back to this exponential decline in percentage growth later. For now, let's keep looking at the, at what is physically going on as the disease spreads.   People progress from being susceptible to disease to being infected to being contagious   to being symptomatic to being noncontagious to being recovered.   This is the Markoff SIR model. SIR stands for susceptible, infected, recovered. The three extra stages of contagious, symptomatic and noncontagious helped us to model the disease spread and related to what we can actually measure.   Note the difference between infected and contagious. Infected means you have the disease; contagious means that you can spread it to others. It's easy to confuse the two, but they are different and will be used in different ways, further into this analysis.   The timing shown are best estimates and can vary greatly. Infected to symptomatic can be from three to 14 days and for some infected people,   they're never symptomatic.   The only data that we have access to is confirmed infections, which usually come from test results, which usually follow from being symptomatic.   Even if testing is performed on non symptomatic people, there's about a five-day delay from being infected to having a positive test results.   So we're always looking at all data. We can never directly observe observe the true number of people infected.   So the disease progresses through individual individuals from top to bottom in this diagram.   We have a pool of people that are contagious and that pool is fed by people that are newly infected becoming contagious and the pool is drained by people that are contagious becoming non contagious.   The disease spreads spreads to the population from left to right.   New infections are created when susceptible people come into contact with contagious people and become infected.   The newly infected people join the queue waiting to become contagious and the cycle continues.   This cycle is controlled by transmission.   How likely a contagious person is to infect a susceptible person per day.   the number of people that a contagious person is likely to infect while they are contagious.   This whole cycle revolves around the number of people contagious and the transmission or reproduction.   The time individuals stay contagious should be relatively constant unless COVID 19 starts to mutate.   The transmission can vary dramatically depending on social behavior and the size of the susceptible population.   Our best estimate is the days contagious averages out at about nine.   So we can estimate people contagious as the number of people confirmed infected in the last nine days.   In some respects, this is an underestimate because it doesn't include people that are infected, but not yet symptomatic or that are asymptomatic or that don't yet have a positive test result.   In other respects, it's an overestimate because it includes includes people who were infected, a long time ago, but they're only now being tested as positive. It's an estimate.   From the estimate of people contagious, we can derive the percentage growth in contagious. It doesn't matter if the people contagious is an overestimate or underestimate.   As long as the percentage error in the estimate remains constant, the percentage growth in contagious will be accurate.   Percentage growth in contagious, because within use it to derive transmission,   The derivation of this equation relating the two can be found in the appendix.   Know this equation allows you to derive transmission and then reproduction from the percentage growth in contagious, but it cannot tell you the percentage growth in contagious for a given transmission.   This can only be found by solving numerically.   I have outlined outlined how to do this using JMP's fit model tool in the appendix.   Reproduction and transmission are very closely linked, but reproduction has the advanced...advantage of ease of understanding.   If it is greater than one, the outbreak is expanding out of control. Infections will continue to grow and there will be no end in sight.   If it is less than one, the outbreak is contracting, coming under control. There are still new infections, but their number will gradually decline until they hit zero. The end is in sight, though it may be a long way off.   The number of people contagious is the underlying engine that drives the outbreak.   People contagious grows and declines exponentially. We can predict the path of the outbreak by extrapolating this growth or decline in people contagious. Here we have done it for Russia and Italy and for China.   Remember the interesting observation from earlier, the infections percent in growth percentage growth declines exponentially and here's why.   If reproduction is less than one and constant, people contagious will decline exponentially towards zero.   People contagious drives the outbreak.   The percentage growth in infections is proportional to the number of people contagious. So if people contagious declines exponentially, but percentage growth and infections will also decline exponentially. Mystery solved.   The slope of people contagious plotted on log scale gives us the contagious percentage growth, which then gives us transmission and reproduction through the equations on the last slide.   Notice that there's a weekly cycle in the data. This is particularly visible in Brazil, but it's also visible in other countries as well.   This may be due to numbers getting reported differently at the weekends or by people being more likely to get infected at the weekend. Either way, we'll have to take this seasonality into account when using people contagious to predict the outbreak.   Because social behavior is constantly changing, transmission and reproduction changes as well. So we can't use the whole distribution to generate reproduction.   We chose 17 days as the period over which to estimate reproduction. We found that one week was a little too short to filter out all of the noise, two weeks gave a better results, two and a half weeks was even better. Having the extra half week   evened out the seasonality that we saw in the data.   There is a time series forecast tool in JMP that will do all of this for us, including the seasonality, but because we're performing the regression on small sections of the data, we didn't find the tool helpful.   Here is the derived transmission and reproduction numbers.   You can see that they can change quickly.   It is easy to get confused by these numbers. South Korea is showing a significant increase in reproduction, but it's doing well. The US, Brazil, India and South Africa are doing poorly, but seem to have a reproduction of around one or less.   This is a little confusing.   To help reduce the confusion around reproduction, here's a little bit of calculus.   Driving a car, the gas pedal controls acceleration.   To predict where the car is going to be, you need to know where you are, how fast you're traveling and how much you're accelerating or decelerating.   In a similar way to know where the pandemic is going to be, we need to know how many infections there are, which is the equivalent of distance traveled. We need to know how fast the infections are expanding or how many people are contagious, both of which are the equivalent of speed.   We need to know how fast the people contagious is growing, which is a transmission or reproduction, which is the equivalent of acceleration.   There is a slight difference. Distance grows linearly with speed and speed grows linearly with acceleration.   Infections do grow linearly with people contagious, but people contagious grows exponentially with reproduction.   There is a slight difference, but the principle's the same.   The US, Brazil, India and South Africa have all traveled a long distance. They have high infections and they're traveling at high speed. They have high contagious. Even a little bit of acceleration has a very big effect on the number of infections.   South Korea, on the other hand, on the other hand is not going fast, it has low contagious. So has the headroom to respond to the blip in acceleration and get things back under control without covering much distance   Also, when the number of people contagious is low, adding a small number of new contagious people produces a significant acceleration. Countries that have things under control are prone to these blips in reproduction.   You have to take all three factors into account   (number of infections, people contagious and reproduction) to decide if a country is doing well or doing poorly.   Within JMP there are a couple of ways to perform the regression to get the percentage growth of contagious. There's the Fit Y by X tool and there's the nonlinear tool. I have details on how to use both these tools in the appendix. But let's compare the results they produce.   The graphs shown compare the results from both tools. The 17 data points used to make the prediction are shown in red.   The prediction line from both tools are just about identical, though there are some noticeable differences in the confidence lines.   The confidence lines for the non linear, tool are much better. The Fit Y by X tool transposes that data into linear space before finding the best fit straight line.   This results in a lower cost...in the lower conference line pulling closer to the prediction line after transposing back into the original space.   Confidence lines are not that useful when parameters that define the outbreak are constantly changing. Best case, they will help you to see when the parameters have definitely changed.   In my scripts, I use linear regression calculated in column formulas, because it's easy to adjust with variables. This allows the analysis to be adjusted on the fly without having to pull up the tool in JMP.   I don't currently use the confidence lines in my analysis. So I'm working on a way to integrate them into the column formulas.   Linear regression is simpler and produces almost identical results. Once again, keep it simple.   We have seen how fitting an exponential to the number of people contagious can be used to predict whether people contagious will be in the future, and also to derive transmission.   Now that we have a prediction line for people contagious, we need to convert that back into infections.   Remember new infections equals people contagious and multiplied by transmission.   Transmission is the probability that a contagious person will infected susceptible person per day.   The predicted graphs that results from this calculation are shown. Note that South Korea and Italy have low infections growth.   However, they have a high reproduction extrapolated from the last 17 days worth of data. So, South Korea here and Italy here, low growth, but you can see them taking off because of that high reproduction number.   The infections growth becomes significance between two and eight weeks after the prediction is made.   For South Korea, this is unlikely to happen because they're moving slowly and have the headroom to get things back under control.   South Korea has had several of these blips as it opens up and always manages to get things back under control.   In the predicted growth percent graph on the right, note how the increasing percentage growth in South Korea and this leads will not carry on increasing indefinitely, but they plateau out after a while.   Percentage growth is still seen to decline exponentially, but it does not grow exponentially.   It plateaus out.   So to summarize,   the number of people contagious is what drives the outbreak.   This metric is not normally reported, but it's close to the number of new infections over a fixed period of time.   New infections in the past week is the closest regular reported proxy, the number of people contagious. This is what we should be focusing on, not the number of infections or the number of daily new infections.   Exponential regression of people contagious will predict where the contagious numbers are likely to be in the future.   Percentage growth in contagious gives us transmission and reproduction.   The contagious number and transmission number can be combined to predict the number of new infections in the future.   That prediction method assumes the transmission and reproduction are constant, which they aren't. They change their behavior.   But the predictions are still useful to show what will happen if behavior does not change or how much behavior has to change to avoid certain milestones.   The only way to close this gap is to come up with a way to mathematically model human behavior.   If any of you know how to do this, please get in touch. We can make a lot of money, though only for short amount of time.   This is the modeling. Let's check how accurate it is by looking at historical data from the US.   As mentioned, the prediction works well when reproduction's constant but not when it's changing.   If we take a prediction based on data from late April to early May, it's accurate as long as the prediction number stays at around the same level of 1.0   The reproduction number stays around 1.0.   After the reproduction number starts rising, you can see that the prediction underestimates the number of infections.   The prediction based on data from late June to mid July when reproduction was at its peak as states were beginning to close down again,   that prediction overestimates the infections as reproduction comes down.   The model is good at predicting what will happen if behavior stays the same but not when behavior is changing.   How can we predict deaths?   It should be possible to estimate the delay between infection and death.   And the proportion of infections that result in deaths and then use this to predict deaths.   However, changes in behavior such as increasing testing and tracking skews the number of infections detected.   So to avoid this skew also feeding into the predictions for deaths, we can use the exact same mathematics on deaths that we used on infections. As with infections, the deaths graph shows accurate predictions when deaths reproduction is stable.   Note that contagious and reproduction numbers for deaths don't represent anything real.   This method works because because deaths follow infections and so follow the same trends and the same mathematics. Once again, keep it simple.   We have already seen that the model assumes constant reproduction. It also does not take into account herd immunity.   We are fitting an exponential, but the outbreak really follows the binomial distribution.   Binomial and a fitted exponential differ by less than 2% with up to 5% of the population infected. Graphs demonstrating this are in the appendix.   When more than 5% of the population is no longer susceptible due the previous infection or to vaccination, transmission and reproduction naturally decline.   So predictions based on recent reproduction numbers will still be accurate, however long-term predictions based on an old reproduction number with significantly less herd immunity will overestimate the number of infections.   On the 21st of August, the US had per capita infections of 1.7%   If only 34% of infected people have been diagnosed   as infected, and there is data that indicates that this is likely, we are already at the 5% level where herd immunity begins to have a measurable effect.   At 5% it reduces reproduction by about 2%.   What the model can show us, reproduction tells us whether the outbreak is expanding. It's greater than 1, which is the equivalent of accelerating or its contracting, it's less than 1, the equivalent of decelerating.   Estimated number of people contagious tells us how bad the outbreak is, how fast we're traveling.   Per capita contagious is the right metric to choose appropriate social restrictions.   The recommendations for social restrictions though listed on this slide are adapted from those published by the Harvard Global Health Institute. There's a reference in the appendix.   What they recommend is when there's less than 12 people contagious per million, test and trace is sufficient. When we get up to 125 contagious per millio, rigorous test and trace is required   At 320 contagious per million, we need rigorous test and trace and some stay at home restrictions.   Greater than 320 contagious per million, stay at home restrictions are necessary.   At the time of writing, the US had 1,290 contagious per million, down from 1,860 at the peak in late July.   It's instructional to look at the per capita contagious in various countries and states when they decided to reopen.   China and South Korea had just a handful of people contagious per million.   Europe has in the 10s of people contagious per million except for Italy.   The US had hundreds of people contagious per million when they decided to reopen.   We should not really have reopened in May. This was an emotional decision not a data-driven decision.   Some more specifics about the US reopening.   As I said, the per capita contagious in the US, at the time of writing was 1,290 per million.   1,290 per million, with a reproduction of .94.   With this per capita contagious and reproduction, it will take until the ninth of December to get below 320 contatious per million.   The lowest reproduction during the April lockdown was .86.
Shamgar McDowell, Senior Analytics and Reliability Engineer, GE Gas Power Engineering   Faced with the business need to reduce project cycle time and to standardize the process and outputs, the GE Gas Turbine Reliability Team turned to JMP for a solution. Using the JMP Scripting Language and JMP’s built-in Reliability and Survival platform, GE and a trusted third party created a tool to ingest previous model information and new empirical data which allows the user to interactively create updated reliability models and generate reports using standardized formats. The tool takes a task that would have previously taken days or weeks of manual data manipulation (in addition to tedious copying and pasting of images into PowerPoint) and allows a user to perform it in minutes. In addition to the time savings, the tool enables new team members to learn the modeling process faster and to focus less on data manipulation. The GE Gas Turbine Reliability Team continues to update and expand the capabilities of the tool based on business needs.       Auto-generated transcript...   Speaker Transcript Shamgar McDowell Maya Angelou famously said, "Do the best you can, until you know better. Then when you know better, do better." Good morning, good afternoon, good evening. I hope you're enjoying the JMP Discovery Summit, you're learning some better way ways of doing the things you need to do. I'm Shamgar McDowell, senior reliability and analytics engineer at GE Gas Power. I've been at GE for 15 years and have worked in sourcing, quality, manufacturing and engineering. Today I'm going to share a bit about our team's journey to automating reliability modeling using JMP. Perhaps your organization faces a similar challenge to the one I'm about to describe. As I walk you through how we approach this challenge, I hope our time together will provide you with some things to reflect upon as you look to improve the workflows in your own business context. So by way of background, I want to spend the next couple of slides, explain a little bit about GE Gas Power business. First off, our products. We make high tech, very large engines that have a variety of applications, but primarily they're used in the production of electricity. And from a technology standpoint, these machines are actually incredible feats of engineering with firing temperatures well above the melting point of the alloys used in the hot section. A single gas turbine can generate enough electricity to reliably power hundreds of thousands of homes. And just to give an idea of the size of these machines, this picture on the right you can see there's four adult human beings, which just kind of point to how big these machines really are. So I had to throw in a few gratuitous JMP graph building examples here. But the bubble plot and the tree map really underscore the global nature of our customer base. We are providing cleaner, accessible energy that people depend upon the world over, and that includes developing nations that historically might not have had access to power and the many life-changing effects that go with it. So as I've come to appreciate the impact that our work has on everyday lives of so many people worldwide, it's been both humbling and helpful in providing a purpose for what I do and the rest of our team does each day. So I'm part of the reliability analytics and data engineering team. Our team is responsible for providing our business with empirical risk and reliability models that are used in a number of different ways by internal teams. So in that context, we count on the analyst in our team to be able to focus on engineering tasks, such as understanding the physics that affect our components' quality and applicability of the data we use, and also trade offs in the modeling approaches and what's the best way to extract value from our data. These are, these are all value added tasks. Our process also entails that we go through a rigorous review with the chief engineers. So having a PowerPoint pitch containing the models is part of that process. And previously creating this presentation entailed significant copying and pasting and a variety of tools, and this was both time consuming and more prone to errors. So that's not value added. So we needed a solution that would provide our engineers greater time to focus on the value added tasks. It would also further standardize the process because those two things greater productivity and ability to focus on what matters, and further standardization. And so to that end, we use the mantra Automate the Boring Stuff. So I wanted to give you a feel for the scale of the data sets we used. Often the volume of the data that you're dealing with can dictate the direction you go in terms of solutions. And in our case, there's some variation but just as a general rule, we're dealing with thousands of gas turbines in the field, hundreds of track components in each unit, and then there's tens of inspections or reconditioning per component. So in in all, there's millions of records that we're dealing with. But typically, our models are targeted at specific configurations and thus, they're built on more limited data sets with 10,000 or fewer records, tens of thousands or fewer records. The other thing I was going to point out here is we often have over 100 columns in our data set. So there are challenges with this data size that made JMP a much better fit than something like an Excel based approach to doing this the same tasks. So, the first version of this tool, GE worked with a third party to develop using JMP scripting language. And the name of the tool is computer aided reliability modeling application or CARMA, with a c. And the amount of effort involved with building this out to what we have today is not trivial. This is a representation of that. You can see the number of scripts and code lines that testified to the scope and size of the tool as it's come to today. But it's also been proven to be a very useful tool for us. So as its time has gone on, we've seen the need to continue to develop and improve CARMA over time. And so in order to do this, we've had to grow and foster some in-house expertise in JSL coding and I oversee the work of developers that focus on this and some related tools. Message on this to you is that even after you create something like CARMA, there's going to be an ongoing investment required to maintain and keep the app relevant and evolve it as your business needs evolve. But it's both doable and the benefits are very real. A survey of our users this summer actually pointed to a net promoter score of 100% and at least 25% reduction in the cycle time to do a model update. So that's real time that's being saved. And then anecdotally, we also see where CARMA has surfaced issues in our process that we've been able to address that otherwise might have remained hidden and unable to address. And I have a quote, it's kind of long. But I wanted to just pass this caveat on automation from Bill Gates, on which he knows a thing or two about software development. "The first rule of any technology used in business is that automation applied to an efficient automation will magnify the efficiency. The second is that automation applied to an inefficient operation will magnify the inefficiency." So that's the end of the quote, but this is just a great reminder that automation is not a silver bullet that will fix a broken process, we still need people to do that today. Okay, so before we do a demonstration of the tool. I just wanted to give a high level overview of of the tool and the inputs and outputs in CARMA. And user user has to point the tool to the input files. So over here on the left, you see we have an active models file that's essentially the already approved models. And then we have empirical data. And then in the user interface, the user does some modeling activities. And then outputs are running models, so updates to the act of models in a PowerPoint presentation. And we'll also look at that. As a background for the data, I'll be using the demo. I just wanted to pass on that I started with the locomotive data set. And we'll see that JMP provides some sample data. So this is the case, there and that gives one population. Then I also added into additional population of models. And the big message here I wanted to pass on is that what we're going to see is all made-up data. It's not real; it doesn't represent the functionality or the behavior of any of our parts in the field, or it's it's just all contrived. So keep that in mind as we go through the results, but it should give us a way to look at the tool, nonetheless. So I'm going to switch over to JMP for a second and I'm using JMP 15.2 for the demo. And this data set is simplified compared to what we normally see. But like I said, it should exercise the core functionality in CARMA. So first, I'm just going to go to the Help menu, sample data, and you'll see the reliability and survival menu here. So that's where we're going. One of the nice things about JMP is that it has a lot of different disciplines and functionality and specialized tools for that. And so for my case with reliability, there's a lot here, which also lends to the value of using JMP is a home for CARMA. But I wanted to point you to the locomotive data set and just show you... this originally came out of a textbook. And talks to it here applied life data analysis. So, in that, there's a problem that asks you what the risk is at 80,000 exposures and we're going to model that today in our data set in an oxidation model is what we've called it, but essentially CARMA will give us an answer. Again, a really simple answer, but I was just going to show you, you can get the same way by clicking in the analysis menu. So we go down to an analyze or liability and survival, life distribution. Put the time and sensor where they need to go. We're going to use Weibull and just the two...so it creates a fit for that data. Two parameters I was going to point out is the beta, 2.3, and then it's called a Weibull alpha here. In our tool, it'll be called Ada, but 183. Okay, so we see how to do that here. Now just to jump over, want to look at a couple of the other files, the input files so I will pull those up. Okay, this is the model file. I mentioned I made three models. And so these are the active models that we're going to be comparing the data against. You'll see that oxidation is the first one, I mentioned that, and then you want...one also, in addition to having model parameters, it has some configuration information. This is just two simple things here (combustion system, fuel capability) I use for examples, but there's many, many more columns, like it. But essentially what CARMA does, one of the things I like about it is when you have a large data set with a lot of different varied configurations, it can go through and find which of those rows of records applies to your model and do the sorting real time, and you know, do that for all the models that you need to do in the data set. And so that's what we're going to use that to demonstrate. Excuse me. Also, just look, jump over to the empirical data for a minute. And just a highlight, we have a sensor, we have exposures, we have the interval that we're going to evaluate those exposures at, modes, and then these are the last two columns I just talked about, combustion system and fuel capability. Okay, so let's start up CARMA. As an add in, so I'll just get it going. And you'll see I already have it pointing to the location I want to use. And today's presentation, I'm not gonna have time to talk through all the variety of features that are in here. But these are all things that can help you take and look at your data and decide the best way to model it, and to do some checks on it before you finalize your models. For the purposes of time, I'm not going to explain all that and demonstrate it, but I just wanted to take a minute to build the three models we talked about create a presentation so you can see that that portion of the functionality. Excuse me, my throat is getting dry all the sudden so I have to keep drinking; I apologize for that. So we've got oxidation. We see the number of failures and suspensions. That's the same as what you'll see in the text. Add that. And let's just scroll down for a second. That's first model added Oxidation. We see the old model had 30 failures, 50 suspensions. This one has 37 and 59. The beta is 2.33, like we saw externally and the ADA is 183. And the answer to the textbook question, the risk of 80,000 exposures is about 13.5% using a Weibull model. So that's just kind of a high level of a way to do that here. Let's look at also just adding the other two models. Okay, we've got cracking, I'm adding in creep. And you'll see in here there's different boxes presented that represent like the combustion system or the fuel capability, where for this given model, this is what the LDM file calls for. But if I wanted to change that, I could select other configurations here and that would result in changing my rows for FNS as far as what gets included or doesn't. And then I can create new populations and segment it accordingly. Okay, so we've gotten all three models added and I think, you know, we're not going to spend more time on that, just playing with the models as far as options, but I'm gonna generate a report. And I have some options on what I want to include into the report. And I have a presentation and this LDM input is going to be the active models, sorry, the running models that come out as a table. All right, so I just need to select the appropriate folder where I want my presentation to go And now it's going to take a minute here to go through and and generate this report. This does take a minute. But I think what I would just contrast it to is the hours that it would take normally to do this same task, potentially, if you were working outside of the tool. And so now we're ready to finalize the report. Save it. And save the folder and now it's done. It's, it's in there and we can review it. The other thing I'll point out, as I pull up, I'd already generated this previously, so I'll just pull up the file that I already generated and we can look through it. But there's, it's this is a template. It's meant for speed, but this can be further customized after you make it, or you can leave placeholders, you can modify the slides after you've generated them. It's doing more than just the life distribution modeling that I kind of highlighted initially. It's doing a lot of summary work, summarizing the data included in each model, which, of course, JMP is very good for. It, it does some work comparing the models, so you can do a variety of statistical tests. Use JMP. And again, JMP is great at that. So that, that adds that functionality. Some of the things our reviewers like to see and how the models have changed year over year, you have more data, include less. How does it affect the parameters? How does it change your risk numbers? Plots of course you get a lot of data out of scatter plots and things of that nature. There's a summary that includes some of the configuration information we talked about, as well as the final parameters. And it does this for each of the three models, as well as just a risk roll up at the end for for all these combined. So that was a quick walkthrough. The demo. I think we we've covered everything I wanted to do. Hopefully we'll get to talk a little more in Q&A if you have more questions. It's hard to anticipate everything. But I just wanted to talk to some of the benefits again. I've mentioned this previously, but we've seen productivity increases as a result of CARMA, so that's a benefit. Of course standardization our modeling process is increased and that also allows team members who are newer to focus more on the process and learning it versus working with tools, which, in the end, helps them come up to speed faster. And then there's also increased employee engagement by allowing engineers to use their minds where they can make the biggest impact. So I also wanted to be sure to thank Melissa Seely, Brad Foulkes, Preston Kemp and Waldemar Zero for their contributions to this presentation. I owe them a debt of gratitude for all they've done in supporting it. And I want to thank you for your time. I've enjoyed sharing our journey towards improvement with you all today. I hope we have a chance to connect in the Q&A time, but either way, enjoy the rest of the summit.  
Stephen Czupryna, Process Engineering Consultant & Instructor, Objective Experiments   Manufacturing companies invest huge sums of money in lean principles education for their shop floor staff, yielding measurable process improvements and better employee morale. However, many companies indicate a need for a higher return on investment on their lean investments and a greater impact on profits. This paper will help lean-thinking organizations move to the next logical step by combining lean principles with statistical thinking. However, effectively teaching statistical thinking to shop floor personnel requires a unique approach (like not using the word “statistics”) and an overwhelming emphasis on data visualization and hands-on work. To that end, here’s the presentation outline:   A)    The Prime Directive (of shop floor process improvement) B)    The Taguchi Loss Function , unchained C)    The Statistical Babelfish D)    Refining Lean metrics like cycle time, inventory turns, OEE and perishable tool lifetime E)    Why JMP’s emphasis on workflow, rather than rote statistical tools, is the right choice for the shop floor F)    A series of case studies in a what-we-did versus what-we-should-have-done format.   Attendee benefits include guidance on getting-it-right with shop floor operators, turbo-charged process improvement efforts, a higher return on their Lean training and statware investments and higher bottom line profits.     Auto-generated transcript...   Speaker Transcript Stephen Czupryna Okay. Welcome, everyone. Welcome to at the corner of Lean Street and Statistics Road. Thank you for attending. My name is Stephen Czupryna. I work as a contractor for a small consulting company in Bellingham, Washington. Our name is Objective Experiments. We teach design of experiments, we teach reliability analysis and statistical process control, and I have a fairly long history of work in manufacturing. So here's the presentation framework for the next 35 odd minutes. I'm going to first talk about the Lean foundation of the presentation, about how Lean is an important part of continuous improvement. And then in the second section, we'll take Lean to what I like to call the next logical step, which is to teach and help operators and in particularly teach them and helping them using graphics and, in particular, JMP graphics. And we'll talk about refining some of the common Lean metrics and we'll end with a few case studies. But first, a little bit of background, what I'm about to say in the next 35 odd minutes is based on my experience. It's my opinion. And it will work, I believe, often, but not always. There are some companies that that may not agree with my philosophy, particularly companies that are, you know, really focused on pushing stuff out the door and the like, probably wouldn't work in that environment, but in the right environment, I think a lot of what I what I say will work fine. All the data you're about to see is simulated and I will post or have posted detailed paper at JMP.com. You can collect it there, or you're welcome to email me at Steve@objexp.com and I'll email you a copy of it or you can contact me with some questions. Again, my view. My view, real simple, boil it all down production workers, maintenance workers are the key to continuous improvement. Spent my career listening carefully to production operators, maintenance people learning from them, and most of all, helping them. So my view is a little bit odd. I believe that an engineer, process engineer, quality engineer really needs to earn the right to enter the production area, need to earn the support of the people that are working there, day in, day out, eight hours a day. Again, my opinion. So who's the presentation for? Yeah, the shortlist is people in production management, supervisors, manufacturing managers and the like, process engineers, quality engineers, manufacturing engineers, folks that are supposed to be out on the shop floor working on things. And this presentation is particularly for people who, who, like who like the the view in the in the photograph that the customer, the internal customer, if you will, is, is the production operator and that the engineer or the supervisor is really a supplier to that person. And to quote, Dr. Deming, "Bad system beats a good person, every time." And the fact is the production operators aren't responsible for the for the system. They work within the system. So the goals of the presentation is to help you work with your production people, \  
Sam Edgemon, Analyst, SAS Institute Tony Cooper, Principal Analytical Consultant, SAS   The Department of Homeland Security asked the question, “how can we detect acts of biological terrorism?” After discussion and consideration, our answer was “If we can effectively detect an outbreak of a naturally occurring event such as influenza, then we can find an attack in which anthrax was used because both present with similar symptoms.” The tools that were developed became much more relevant to the detection of naturally occurring outbreaks, and JMP was used as the primary communication tool for almost five years of interactions with all levels of the U.S. Government. In this presentation, we will demonstrate how those tools developed then could have been used to defer the affects of the Coronavirus COVID-19. The data that will be used for demonstration will be from Emergency Management Systems, Emergency Departments and the Poison Centers of America.     Auto-generated transcript...   Speaker Transcript Sam Edgemon Hello. This is Sam Edgemon. I worked for the SAS Institute, you know, work for the SAS Institute, because I get to work on so many different projects.   And we're going to tell you about one of those projects that we worked on today. Almost on all these projects I work on I work with Tony Cooper, who's on the screen. We've worked together really since since we met at University of Tennessee a few years ago.   And the things we learned at the University of Tennessee we've we've applied throughout this project. Now this project was was done for the Department of Homeland Security.   The Department of Homeland Security was very concerned about biological terrorism and they came to SAS with the question of how will we detect acts of biological terrorism.   Well you know that's that's quite a discussion to have, you know, if you think about   the things we might come back with. You know, one of those things was well what do you, what are you most concerned with what does, what do the things look like   that you're concerned with? And they they talked about things like anthrax, and ricin and a number of other very dangerous elements that terrorists could use to hurt the American population.   Well, we took the question and and their, their immediate concerns and researched as best we could concerning anthrax and ricin, in particular.   You know, our research involved, you know, involved going to websites and studying what the CDC said were symptoms of anthrax, and the symptoms of   ricin and and how those, those things might present in a patient that walks into the emergency room or or or or takes a ride on an ambulance or calls a poison center or something like that happens. So what we realized in going through this process was   was that the symptoms look a lot like influenza if you've been exposed to anthrax. And if you've been exposed to ricin, that looks a lot like any type of gastrointestinal issue that you might might experience. So we concluded and what our response was to Homeland Security was that   was that if we can detect an outbreak of influenza or an outbreak of the, let's say the norovirus or some gastrointestinal issue,   then we think we can we can detect when when some of these these bad elements have been used out in the public. And so that's the path we took. So we we took data from EMS and and   emergency rooms, emergency departments and poison centers and we've actually used Google search engine data as well or social media data as well   to detect things that are you know before were thought as undetectable in a sense. But but we developed several, several tools along the way. And you can see from the slide I've got here some of the results of the questions   that that we that we put together, you know, these different methods that we've talked about over here. I'll touch on some of those methods in the brief time we've got to talk today, but let's let's dive into it. What I want to do is just show you the types of conversations we had   using JMP. We use JMP throughout this project to to communicate our ideas and communicate our concerns, communicate what we were seeing. An example of that communication could start just like this, we, we had taken data from from the EMS   system, medical system primarily based in North Carolina. You know, SAS is based in North Carolina, JMP is based in North Carolina in Cary and   and some of them, some of the best data medical data in the country is housed in North Carolina. The University of North Carolina's got a lot to do that.   In fact, we formed a collaboration between SAS and the University of North Carolina and North Carolina State University to work on this project for Homeland Security that went on for almost five years.   But what what I showed them initially was you know what data we could pull out of those databases that might tell us interesting things.   So let's just walk, walk through some of those types of situations. One of the things I initially wanted to talk about was, okay let's let's look at cases. you know,   can we see information in cases that occur every, every day? So you know this this was one of the first graphs I demonstrated. You know, it's hard to see anything in this   and I don't think you really can see anything in this. This is the, you know, how many cases   in the state of North Carolina, on any given day average averages, you know, 2,782 cases a day and and, you know, that's a lot of information to sort through.   So we can look at diagnosis codes, but some of the guys didn't like the idea that this this not as clear as we want want it to be so so we we had to find ways to get into that data and study   and study what what what ways we could surface information. One of those ways we felt like was to identify symptoms, specific symptoms related to something that we're interested in,   which goes back to this idea that, okay we've identified what anthrax looks like when someone walks in to the emergency room or takes a ride on an ambulance or what have you.   So we have those...if we identify those specific symptoms, then we can we can go and search for that in the data.   Now a way that we could do that, we could ask professionals. There was there's rooms full of of medical professionals on this, on this project and and lots of physicians. And kind of an odd thing that   I observed very quickly was when you asked a roomful of really, really smart people question like, what what is...what symptoms should I look for when I'm looking for influenza or the norovirus, you get lots and lots of different answers.   So I thought, well, I would really like to have a way to to get to this information, mathematically, rather than just use opinion. And what I did was I organized the data that I was working with   to consider symptoms on specific days and and the diagnosis. I was going to use those diagnosis diagnosis codes.   And what I ended up coming out with, and I set this up where I could run it over and over, was a set of mathematically valid symptoms   that we could go into data and look and look for specific things like influenza, like the norovirus or like anthrax or like ricin or like the symptoms of COVID 19.   This project surfaced again with with many asks about what we might...how we might go about finding the issues   of COVID 19 in this. This is exactly what I started showing again, these types of things. How can we identify the symptoms? Well, this is a way to do that.   Now, once we find these symptoms, one of the things that we do is we will write code that might look something similar to this code that will will look into a particular field in one of those databases and look for things that we found in those analyses that we've   that we've just demonstrated for you. So here we will look into the chief complaint field in one of those databases to look for specific words   that we might be interested in doing. Now that the complete programs would also look for terms that someone said, Well, someone does not have a fever or someone does not have nausea. So we'd have to identify   essentially the negatives, as well as the the pure quote unquote symptoms in the words. So once we did that, we could come back to   JMP and and think about, well, let's, let's look at, let's look at this information again. We've got we've got this this number of cases up here, but what if we took a look at it   where we've identified specific symptoms now   and see what that would look like.   So what I'm actually looking for is any information regarding   gastrointestinal issues. I could have been looking for the flu or anything like that, but this is this is what the data looks like. It's the same data. It's just essentially been sculpted to look like you know something I'm interested in. So in this case, there was an outbreak   of the norovirus that we told people about that they didn't know about that, you know, we started talking about this on January 15.   And and you know the world didn't know that there was a essentially an outbreak of the norovirus until we started talking about it here.   And that was, that was seen as kind of a big deal. You know, we'd taken data, we'd cleaned that data up and left the things that we're really interested in   But we kept going. You know that the strength of what we were doing was not simply just counting cases or counting diagnosis codes, we're looking at symptoms that that describe the person's visit to   the emergency room or what they called about the poison center for or they or they took a ride on the ambulance for.   chief complaint field, symptoms fields,   and free text fields. We looked into the into the fields that described the words that an EMS tech might use on the scene. We looked in fields that describe   the words that a nurse might use whenever someone first comes into the emergency room, and we looked at the words that a physician may may use. Maybe not what they clicked on the in in the boxes, but the actual words they used. And we we developed a metric around that as well.   This metric   was, you know, it let us know   you know, another month in advance that something was was odd in a particular area in North Carolina on a particular date. So I mentioned this was January 15 and this, this was December 6   and it was in the same area. And what is really registering is is the how much people are talking about a specific thing and if one person is talking about it,   it's not weighted very heavily, therefore, it wouldn't be a big deal. If two people are talking about it, if a nurse   and an EMS tech are talking about a specific set of symptoms, or mentioning a symptom several times, then, then we're measuring that and we're developing a metric from that information.   So if three people, you know, the, the doctor, the nurse and the EMS tech if that's what information we have is, if they're all talking about it,   then it's probably a pretty big deal. So that's what's happened here on December 6, a lot of people are talking about symptoms that would describe something like the norovirus.   This, this was related to an outbreak that the media started talking about in the middle of February. So, so this is seen as...as us telling the world about something that the media started talking about, you know, in a month later.   And   specific specifically you know, we were drawn to this Cape Fear region because a lot of the cases were we're in that area of North Carolina around Wilson,   Wilson County and that sort of thing. So, so that that was seen as something of interest that we could we could kind of drill in that far in advance of, you know, talk about something going on. Now   we carried on with that type of work concerning um, you know, using those tools for bio surveillance.   But what what we did later was, you know, after we set up systems that would that would, you know, was essentially running   every day, you know every hour, every day, that sort of thing. And then so whenever we would be able to say, well,   the system has predicted an outbreak, you know if this was noticed. The information was providing...was was really noise free in a sense. We we look back over time and we was   predicting let's say, between 20 and 30 alerts a year,   total alerts a year. So there was 20 or 30 situations where we had just given people, the, the, the notice that they might should look into something, you know, look, check something out. There might be you know a situation occurring. But in one of these instances,   the fellow that we worked with so much at Homeland Security came to us and said, okay, we believe your alert, so tell us something more about it. Tell us what   what it's made up of. That's that's that's how he put the question. So, so what we we did   was was develop a model, just right in front of him.   And the reason we were able to do that (and here's, here's the results of that model), the reason we were able to do that was by now, we realized the value of   of keeping data concerning symptoms relative to time and place and and all the different all the different pieces of data we could keep in relation to that, like age, like ethnicity.   So when we were asked, What's it made up of, then then we could... Let's put this right in the middle of the screen, close some of the other information around us here so you can just focus on that.   So when we're asked, okay, what's this outbreak made up of, you know, we, we built a model in front of them (Tony actually did that)   and that that seemed to have quite an impact when he did this, to say, Okay, you're right. Now we've told you today there there's there's an alert.   And you should pay attention to influenza cases in this particular area because it appears to be abnormal. But we could also tell them now that, okay   these cases are primarily made up of young people, people under the age of 16.   The symptoms, they're talking about when they go into emergency room or get on an ambulance is fever, coughing, respiratory issues. There's pain.   and there's gastrointestinal issues. The, the key piece of information we feel like is is the the interactions between age groups and the symptoms themselves.   While this one may, you know, it may not be seen as important is because it's down the list, we think it is,   and even these on down here. We talked about young people and dyspnea, and young people and gastro issues, and then older people.   So there was, you know, starting to see older people come into the data here as well. So we could talk about younger people, older people and and people in their   20s, 30s, 40s and 50s are not showing up in this outbreak at this time. So there's a couple of things here. When we could give people you know intel on the day of   of an alert happening and we could give them a symptom set to look for. You know when COVID 19 was was well into our country, you know you you still seem to turn on the news everyday and hear of a different symptom.   This is how we can deal with those types of things. You know, we can understand   you know, what what symptoms are surfacing such that people may may actually have, you know, have information to recognize when a problem is actually going to occur and exist.   So, so this is some of the things that you know we're talking about here, you'll think about how we can apply it now.   Using the the systems of alerting that I showed you earlier that, you know, I generally refer to as the TAP method as just using text analytics and proportional charting.   Well, you know, that's we're probably beyond that now, it's it's on us. So we didn't have the tool in place to to go looking then.   But these types of tools may still help us to be able to say, you know, this is these are the symptoms we're looking for. These are the   these are the age groups were interested in learning about as well. So, so let's let's keep walking through some ways that we could use what we learned back on that project to to help the situation with COVID 19.   One of the things that we did of course we've we've talked about building this this the symptoms database. The symptoms database is giving us information on a daily basis about symptoms that arise.   And and you know who's, who's sick and where they're sick at. So here's an extract from that database that we talked about, where it it has information on a date,   it has information about gender, ethnicity, in regions of North Carolina. We could you take this down to towns and and the zip codes or whatever was useful.   This I mentioned TAP in that text analytics information, well now we've got TAP information on symptoms. You know, so if people are talking about   this, say for example, nausea, then we we know how many people are talking about nausea on a day, and eventually in a place. And so this is just an extract of symptoms from   from this   this database. So, so let's take a look at how we could use this this. Let's say you wanted to come to me, an ER doctor, or some someone investigating COVID 19 might come to me and say,   well, where are people getting sick at. You know, that's where are people getting sick   now, or where might an outbreak be occurring in a particular area. Well, this is the type of thing we might do to demonstrate that.   I use Principal Components Analysis a lot. In this case because we've got this data set up, I can use this tool to identify   the stuff I'm interested in analyzing. In this case it's the regions, they asked, you know, the question was where, where and what. Okay what what are you interested in knowing about? So I hear people talk about respiratory issues   concerning COVID and I hear people talking about having a fever and and these are kind of elevated symptoms. These are issues that people are talking about   even more than they're writing things down. That's the idea of TAP is, is we're getting into those texts fields and understanding understanding interesting things. So once we we   we run this analyses,   JMP creates this wonderful graph for us. It's great for communicating what's going on. And what's going on in this case is that Charlotte, North Carolina,   is really maybe inundated with with with physicians and nurses and maybe EMS techs talking about their patients having a fever   and respiratory issues. If you want to get as far as you can away from that, you might spend time in Greensboro or Asheville, and if you're in Raleigh Durham, you might be aware of what's on the way.   So that this is this is a way that we can use this type of information for   for essentially intelligence, you know, intelligence into what what might be happening next in specific areas. We could also talk about severity in the same, in the same instance. We could talk about severity of cases and measure where they are the same way.   So you know the the keys here is is getting the symptoms database organized and utilized.   We've we use JMP to communicate these ideas. A graph like this may may have been shown to Homeland Security and we talked about it for two hours easily just with, not just questions about even validity,   you know, is where the data come from and so forth. We could talk about that and and we could also talk about   okay, this, this is the information that that you need to know, you know. This is information that will help you understand where people are getting sick at, such that warnings can be given and essentially life...lives saved.   So, so that's that in a sense is the system that we've we put together. The underlying key is, is the data.   Again, the data we've used is EMS, ED, poison center data. I don't have an example of the poison center data here, but I've got a long talk about how we how we use poison center data to surface foodborne illness, just in similar ways than what we've shown here.   And then the ability to, to, to be fairly dynamic with developing our story in front of people and talking to them   in, you know, selling belief in what we do. JMP helps us do that; SAS code helps us do that. That's a good combination tools and that's all I have for this this particular   topic. I appreciate your attention and hope you find it useful, and hope we can help you with this type of stuff. Thank you.
Tony Cooper, Principal Analytical Consultant, SAS Sam Edgemon, Principal Analytical Consultant, SAS Institute   In process product development, Design of Experiments (DOE), helps to answer questions like: Which X’s cause Y’s to change, in which direction, and by how much per unit change? How do we achieve a desired effect? And, which X’s might allow loser control, or require tighter control, or a different optimal setting? Information in historical data can help partially answer these questions and help run the right experiment. One of the issues with such non-DoE data is the challenge of multicollinearity. Detecting multicollinearity and understanding it allows the analyst to react appropriately. Variance Inflation Factors from the analysis of a model and Variable Clustering without respect to a model are two common approaches. The information from each is similar but not identical. Simple plots can add to the understanding but may not reveal enough information. Examples will be used to discuss the issues.     Auto-generated transcript...   Speaker Transcript Tony Hi, my name is Tony Cooper and I'll be presenting some work that Sam Edgemon and I did and I'll be representing both of us.   Sam and I are research statisticians in support of manufacturing and engineering for much of our careers. And today's topic is Understanding Variable Clustering is Crucial When Analyzing Stored Manufacturing Data.   I'll be using a single data set to present from. The focus of this presentation will be reading output. And I won't really have time to fill out all the dialog boxes to generate the output.   But I've saved all the scripts in the data table, which of course will avail be available in the JMP community.   The data...once the script is run, you can always use that red arrow to do the redo option and relaunch analysis and you'll see the dialog box that would have been used to present that piece of output.   I'll be using this single data set that is on manufacturing data.   And let's have a quick look at how this data looks   And   Sorry for having a   Quick. Look how this dataset looks. First of all, I have a timestamp. So this is a continuous process and some interval of time I come back and I get a bunch of information.   On some Y variables, some major output, some KPIs that, you know, the customer really cares about, these have to be in spec and so forth in order for us to ship it effectively. And then...so these are... would absolutely definitively be outputs. Right.   line speed, how at the set point for the vibration,   And and a bunch of other things. So you can see all the way across, I've got 60 odd variables that we're measuring at that moment in time. Some of them are sensors and some of them are a set points.   And in fact, let's look and see that some of them are indeed set points like the manufacturing heat set point. Some of them are commands which means like, here's what the PLC told the process to do. Some of them are measures which is the actual value that it's at right now.   Some of them are   ambient conditions, maybe I think that's an external temperature.   Some of them   are in a row material quality characteristics, and there's certainly some vision system output. So there's a bunch of things in the inputs right now.   And you can imagine, for instance, if I'm, if I've got the command and the measure the, the, what I told the the the zone want to be at and what it actually measure, that we hope that are correlated.   And we need to investigate some of that in order to think about what's going on. And that's multicolinnearity, understanding the interrelationships amongst   the inputs or separately, the understanding the multicollinearity and the outputs. But by and large, not doing Y cause X effect right now. You're not doing a supervised technique. This is an unsupervised technique, but I'm just trying to understand what's going on in the data in that fashion.   And all my scripts are over here. So yeah, here's the abstract we talked about; here's a review of the data. And as we just said it may explain why there is so much, multicollinearity, because I do have a set points and an actuals in there. So, but we'll, we'll learn about those things.   What we're going to do first is we're going to fit Y equals a function of X and we're going to set it and I'm only going to look at these two variables right now.   The zone one command, so what I told zone one to to to be, if you will. And then I'm I've got a little therma couple that are, a little therma couple in the zone one area and it's measuring.   And you can see that clearly as zone one temperature gets higher, this Y4 gets lower, this response variable gets lower.   And that's true also for the measurement and you can see that in the in the in the estimates both negative and, by and large, and by the way, these are fairly   predictive variables in the sense that just this one variable is explaining 50% of what's going on in in the Y4. Well, let's   let's do a multivariate. So you imagine I'm now gonna fit model and have moved both of those variables into   into into my model and I'm still analyzing Y4. My response is still Y4. Oh, look at this. Now it's suggesting that as Y, as the command goes up. Yeah, that that does   that does the opposite of what I expect. This is still negative in the right direction, but look at   look at some of these numbers. These aren't even in the same the same ballpark as what I had a moment ago, which was .04 and .07 negative. Now I have positive .4 and negative .87.   I'm not sure I can trust these this model from an engineering standpoint, and I really wonder how it's characterizing and helping me understand this process.   And there's a clue. And this may not be on by default, but you can right click on this parameter estimates table and you can ask for the VIF column.   That stands for variation inflation factor. And that is an estimate of the...of the the instability of the model due to   this multicollinearity. So you can...so we need to get little intuition on what that...how to think about that variable. But just to be a little clearer what's going on, I'm going to plot...here's   temperature zone one command and here's measure, and as you would expect,   as you tell...you tell the zone to increase in temperature. Yes, the zone does increase in temperature and by and larges, it's, it's going to maybe even the right values. I've got this color coded by Y4 so it is suggesting at lower temperatures, I get   the high values of Y4 and that the...   sorry...yeah, at low valleys of temperature, I got high values of Y4 and just the way I saw on the individual plots.   But you can see maybe the problem gets some intuition as to why the multivariate model didn't work. You're trying to fit a three dom... a surface.   over this knife edge and it can obviously...there's some instability and if...you can easily imagine it's it's not well pinneed on the side so I can rock back and forward here and that's what you're getting.   It is in terms of how that is that the in in terms of that the variation inflation factor. The OLS software...the OLS ordinary least squares analysis typical regression analysis   just can't can't handle it in the, in some sense, maybe we could also talk about the X prime X matrix being, you know, being almost singlular in places.   So we've got some heurustic of why it's happening. Let's go back and think about more   About   About   The about about the values and   We know that   You know we are we now know that variation inflation factor actually does think about more than just pairwise. But if I want to source...so intuition helps me think about the relationship between   VIF and pairwise comparison. Like if I have two variables that are 60% correlated   then it's you know if it was all it was all pairwise then the VIF would be about 2.5.   And if they were 90% correlated two variables, then I would get a VIF of 10. And the literature says   That when you have VIFs of 10 and more you have enough instability which way you should worry about your model. So, you know, because in some sense, you've got 90% correlation between things in your data.   Now whether 10 is a good cut off really depends,I guess, on your application. If if you're making real design decisions based on it, I think a cut off of 10 is way too high.   And maybe even near two is better. But if I was thinking more. Well, let me think about what factors to put in a run in an experiment like   I want to learn how to run the right experiment. And I knew I've got to pick factors to go in it. And I know we too often go after the usual suspects. And is there a way I can push myself to think of some better factors. Well, that then maybe a   10 might be useful to help you narrow down. So it really depends on where you want to go in terms of doing thinking about   thinking about what what the purpose is. So more on this idea of purpose.   You know, there's two main purposes, if you will, to me of modeling. One is what will happen, you know, that's prediction.   And but and that's different, sometimes from why will it happen and that's more like explanation.   As we just saw with a very simple   command and measure on zone one, you can, you cannot do very good explanation. I would not trust that model to think about why something's happening when I have when I when the, when the responses when the, when the estimate seem to go in the wrong direction like that.   So it's very so I wouldn't use it for explanation. I'm sort of suggesting that I wouldn't even use it for prediction. Because if I can't understand the model, it's doing, it's not intuitively doing what I expect   that seems to make extrapolation dangerous, of course, by definition, you know, I think prediction to a future event is extrapolation in time. Right, so I would always think about this. And, you know, we're talking about ordinary least squares so far.   All my modeling techniques I see, like decision trees, petition analysis,   are all in some way affected by this by this issue, in different unexpected ways. So it seems a good idea to take care of it if you can. And it isn't unique to manufacturing data.   But it's probably worse exaggerated in manufacturing data because often the variables can are not controlled. If we   if we have, you know, zone one temperature and we've learned that it needs to be this value to make good product, then we will control it as tightly as possible to that desired value.   And so it's it's so the the extrapolation really come gets harder and harder. So this is exaggerated manufacturing because we do can control them.   And there's some other things about manufacturing data you can read here that make it maybe   make it better, because this is opportunities and challenges. Better is you can understand manufacturing processes. There's a lot of science and engineering around how those things run.   And you can understand that stoichiometry for instance, requires that the amount of A you add has, chemical A has to be in relationship to chemical B.   Or you know you don't want to go from zone one temperature here to something vastly different because you need to ramp it up slowly, otherwise you'll just create stress.   So you can and should understand your process and that may be one way even without any of this empirical evidence on on multicollinearity, you can get rid of some of it.   There's also   an advantage to manufacturing data is in that it's almost always time based, so do plot the variables over time. And it's always interesting   or not always, but often interesting, the manufacturing data does seem to move in blocks of times like here...we think it should be 250 and we run it for months, maybe even years,   and then suddenly, someone says, you know, we've done something different or we've got a new idea. Let's move the temperature. And so very different.   And of course, if you're thinking about why is there multicollinearity,   we've talked about it could be due to physics or chemistry, but it could be due to some latent variable, of course, especially when there's a concern with variable shifting in time, like we just saw.   Anything that also changes at that rate could be the actual thing that's affecting the, the, the values, the, the Y4. Could be due to a control plan.   It could be correlated by design. And each of these things, you know, each of these reasons for multicollinearity and the question I always ask is, you know,   is this is it physically possible to try other combinations or not just physically possible, does it make sense to try other combinations?   In which case you're leaning towards doing experimentation and this is very helpful. This modeling and looking retrospectively at your data to is very helpful at helping you design better experiments.   Sometimes though the expectations are a little too high, in my mind, we seem to expect a definitive answer from this retrospective data.   So we've talked about two methods already two and a half a few to to address multi variable clustering and understand it. One is the is VIF.   And here's the VIF on a bigger model with all the variables in.   How would I think about which are correlated with which? This is tells me I have a lot of problems.   But it does not tell me how to deal with these problems. So this is good at helping you say, oh, this is not going to be a good model.   But it's not maybe helpful getting you out of it. And if I was to put interactions in here, it'd be even worse because they are almost always more correlated. So we need another technique.   And that is variable clustering. And this, this is available in JMP and there's two ways to get to it. You can go through the Analyze multivariate principal components.   Or you can go straight to clustering cluster variables. If you go through the principal components, you got to use the red option...red triangle option to get the cluster variables. I still prefer this method, the PCA because I like the extra output.   But it is based on PCA. So what we're going to do is we're going to talk about PCA first and then we will talk about the output from variables clustering.   And there's and there's the JMP page. In order to talk about principal components, I'm actually going to work with standardized versions of the variables first. And let's think remind ourselves what is standardized.   the mean is now zero, standard deviation is now 1. So I've standardized what that means that they're all on the same scale now and implicitly when you do   when you do   principal components on correlations in JMP,   implicitly you are doing on standardized variables.   JMP is, of course, more than capable, a more than smart enough for you to put in the original values   and for it to work in the correct way and then when it outputs, it, it will outputs, you know, formula, it will figure out what the right   formula should have been, given, you know, you've gone back unstandardized. But just as a first look, just so we can see where the formula are and some things, I'm going to work with standardized variables for a while, but we will quickly go back, but I just want to see one formula.   And that's this formula right here. And what I'm going to do is think about the output. So what is the purpose of   of   of PCA? Well it's it's called a variation reduction technique, but what it does, it looks for linear combinations of your variables.   And if it finds a linear combination that it likes, it...that's called a principal component.   And it uses Eigen analysis to do this.   So another way to think about it is, you know I want it, I put in 60 odd variables into the...into the inputs.   There are not 60 independent dimensions that I can then manipulate in this process, they're correlated. What I do to the command for time, for temperatures on one   dictates almost hugely what happens with the actual in temperature one. So those aren't two independent variables. Those don't keep you say you don't you don't have two dimensions there. So how many dimensions do we have?   That's what this thing, the eigenvalues, tell you. These are like variance. These are estimates of variance and the cutoff is   one. And if I had the whole table here, there'll be 60 eigenvalues. Go all the way down, you know, jumpstart reporting at .97 but the next one down is, you know, probably .965 and it will keep on going down.   The cutoff says that...or some guideline is that if it's greater than one, then this linear combination is explaining a signal. And if it's less than one, then it's just explaining noise.   And so what what JMP does is it...when I go to the variable clustering, it says, you know what   you have a second dimension here. That means a second group of variables, right, that's explaining something a little bit differently. Now I'm going to separate them into two groups. And then I'm going to do PCA on both,   and if and the eigenvalues for both...the first one will be big, but what's the second one look like   after I split in two groups? Are they now less than one? Well, if they're less than one, you're done. But if they're greater than one, it's like, oh, that group can be split even further.   And it was split that even further interactively and directly and divisively until there's no second components greater than one anymore.   So it's creating these linear combinations and you can see the...you know when I save principal component one, it's exactly these formula. Oops.   It's exactly this formula, .0612 and that's the formula for Prin1 and this will be exactly right as long as you standardize. If you don't standardize then JMP is just going to figure out how to do it on the unstandardized which is to say it's more than capable of doing.   So let's start working with the, the initial data.   And we'll do our example. And you'll see this is very similar output. It's in fact it's the same output except one thing I've turned on.   is this option right here, cluster variables. And what I get down here is that cluster variables output. And you can see these numbers are the same. And you can already start to see some of the thinking it's going to have to do, like, let's look at these two right here.   It's the amount of A I put in seems to be highly related to the amount of B I put in, so...and that would make sense for most chemical processes, if that's part of the chemical reaction that you're trying to accomplish.   You know, if you want to put 10% more in, probably going to put 10% more B in. So even in this ray plot you start to see some things that the suggest the multicollinearity. And so to get somewhere.   But I want to put them in distinct groups and this is a little hard because   watch this guy right here, temperature zone 4.   He's actually the opposite. They're sort of the same angle, but in the opposite direction right, so he's 100 or 180 degrees almost from A and B.   So maybe ...negatively correlated to zone 4 temperature and A B and not...but I also want to put them in exclusive groups and that's what that's what we get   when we when we asked for the variable clustering. So it took those very same things and put them in distinct groups and it put them in eight distinct groups.   And here are the standardized coefficients. So these are the formula   that the for the, you know, for the individual clusters. And so when I save   the cluster components   I get a very similar to what we did with Prin1 except this just for cluster 1, because you notice that in this row that has zone one command with a ... with a .44, everywhere else is zero. Every variable is exclusively in one cluster or another.   So let me...let's talk about some of the output.   And so we're doing variable clustering and   Oops.   Sorry. Tony And then we got some tables in our output. So I'm going to close...minimize this window and we're talking about what's in here in terms of output.   And the first thing you guys that you know I want to point us to is the standardized estimates and we were doing that. And if you want to do it, quote, you know,   by hand, if you will, repeat and how do I get a .35238 here, I could run PCA on just cluster one variables. These are the variables staying in there. And then you could look at the eigenvalues, eigenvectors, and these are the exact exact numbers.   So,   So the .352 is so it's just what I said. It keeps devisively doing multiple PCAs and you can see that the second principle component here is, oh sorry, is less than one. So that's why it stopped at that component   Who's in there   cluster one, well there's temperature is ...the two zone one measures, a zone three A measure, the amount of water seems to matter compared to that. With some of the other temperature over here and in what...in cluster six.   This is a very helpful color coded table. This is organized by cluster. So this is cluster one; I can hover over it and I can read all of the things   added water (while I said I should, yeah, added water. Sorry, let me get one that's my temperature three.)   And what that's a positive correlation, you know, the is interesting zone 4 has a negative correlation there. So you will definitely see blocks of of color to...because these are...so this is cluster one obviously, this is maybe cluster three.   This, I know it's cluster six.   Butlook over here... that this...as you can imagine, cluster one and three are somewhat correlated. We start to see some ideas about what might what we might do. So we're starting to get some intuition as to what's going on in the data.   Let's explore this table right here. This is the with own cluster, with next cluster and one minus r squared ratio. And what...I'm going to save that to my own data set.   I'm going to run do a multivariate on it. So I've got cluster one through eight components and all the factors. And what I'm really interested in is this part of a table, like for all of the original variables and how they cluster with the component. So let me save that table and   then what I'm gonna do is, I'm gonna delete, delete some extra rows out of this table, but it's the same table. Let's just delete some rows.   And I'll turn and...so we can focus on certain ones. So what I've got left is the columns that are the cluster components and the rows...   row column...the role of the different...and is 27 now variables that we were thinking about, not 60 (sorry) and it's put them in 8 different groups and I've added that number. They can come automatically.   I'm going to start and in order to continue to work, what I'm gonna do is, I don't want these as correlations. I want these as R squares. So I'm going to square all those numbers.   So I just squared them and and here we go. Now we can look at, now we can start thinking about it.   And I've sort...so let's look at   row one.   Sorry, this one that the temperature one measure we've talked about, is 96% correlated with its...with cluster one.   It's 1% correlated with cluster two so it looks to me like it really, really wants to be in cluster one and it doesn't really have much to say I want to go into cluster two. What I want to do is let's find all of those...   lets color code some things here so we can find them faster. So   we're talking zone one meas and the one that would like to be in, if anything, is cluster five.   You know it's 96% correlated, but you know, it's, it wouldn't be, if it had to leave this cluster, where would it go next? It would think about cluster five.   And what's in cluster five? Well, you could certainly start looking at that. So there's more temperatures there. The moisture heat set point and so forth. So if it hadn't...so this number,   The with the cl...with its own cluster talks about how much it likes being in its cluster and the what...and the R squared to the next cluster talks to you about how much it would like to be in a different one. And those are the two things reported in the table.   You know, JMP doesn't show all the correlation, but it may be worth plotting and as we just demonstrated, it's not hard to do.   So this tells you...I really like being in my own cluster. This says if I had to leave, then I don't mind going over here. Let's compare those two. And let's take one minus this number,   divided by one minus this number. That's the one minus r squared ratio. So it's a ratio of how much you like your own cluster divided by how much you attempted to go to another cluster.   And let's plot some of those.   And   Let me look for the local data filter on there.   The cluster.   And and here's the thing. So   Values, in some sense, lower values of this ratio are better. These are the ones over here (let's highlight him)...   Well, let's highlight the very...this one of the top here.   I like the one down here. Sorry.   This one, vacuum set point, you can see really really liked its own cluster over the next nearest, .97 vs .0005, so you know that the ratio is near zero. So those wouldn't...   with it...you wouldn't want to move that over. And you could start to do things like let's just think about the cluster one variables. If anyone wanted to leave it, maybe it's the Pct_reclaim. Maybe he got in there,   like, you know by, you know, by fortuitous chance and I've got, you know, and if I was in cluster two, I can start thinking about the line speed.   The last table I'm going to talk about is the cluster summary table. That's   this table here.   And it's saying...it's gone through this list and said that if I look down r squared own cluster, the highest number is .967 for cluster one.   So maybe that's the most representative.   To me, it may not be wise to let software decide I'm going to keep this late...this variable and only this variable, although certainly every software   has a button that says launch it with just the main ones that you think, and that's that may give you a stable model, in some sense, but I think it's short changing   the kinds of things you can do. And you and, hopefully with the techniques provided, we can...you have now the ability to explore those other things.   This is...you can calculate this number by again doing the just the PCA on its own cluster and it and it it's suggesting it cluster...eigenvalue one one complaint with explained .69% so that's another...just thinking about it, it's own cluster.   Close these and let's summarize.   So we've given you several techniques. Failing to understand what's clear, you can make your data hard to interpret, even misleading the models.   Understanding the data as it relates to the process can produce multicollinearity. So just an understanding from an SME standpoint, subject matter expertise standpoint.   Pairwise and scatter plot are easier to interpret but miss multicollinearity and so you need more. VIF is great at telling you you've got a problem. It's only really available for ordinary least squares.   There's no, there's no comparative thing for prediction.   And there's some guidelines and variable clustering is based on principal components and shows better membership and strengthens relationship in each group.   And I hope that that review or the introduction would encourage you, maybe, to grab, as I say, this data set from JMP Discovery and you can run the script yourself and you can go obviously more slowly and make sure that you feel good and practice on maybe your own data or something.   One last plug is, you know, there was there's other material and other times that Sam and I have spoken at JMP Discovery conferences and or at   ...or written white papers for JMP and you know maybe you might want to think about all we thought about with manufacturing it because we have found that   modeling manufacturing data takes...is a little unique compared to understanding like customer data in telecom where a lot of us learn when we went through school. And I thank you for your attention and good luck   with further analysis.
Daniel Sutton, Statistician - Innovation, Samsung Austin Semiconductor   Structured Problem Solving (SPS) tools were made available to JMP users through a JSL script center as a menu add-in. The SPS script center allowed JMP users to find useful SPS resources from within a JMP session, instead of having to search for various tools and templates in other locations. The current JMP Cause and Effect diagram platform was enhanced with JSL to allow JMP users the ability to transform tables between wide format for brainstorming and tall format for visual representation. New branches and “parking lot” ideas are also captured in the wide format before returning to the tall format for visual representation. By using JSL, access to mind-mapping files made by open source software such as Freeplane was made available to JMP users, to go back and forth between JMP and mind-mapping. This flexibility allowed users to freeform in mind maps then structure them back in JMP. Users could assign labels such as Experiment, Constant and Noise to the causes and identify what should go into the DOE platforms for root cause analysis. Further proposed enhancements to the JMP Cause and Effect Diagram are discussed.     Auto-generated transcript...   Speaker Transcript Rene and Dan Welcome to structured, problem solving, using the JMP cause and effect diagram open source mind mapping software and JSL. My name is Dan Sutton name is statistician at Samsung Austin Semiconductor where I teach statistics and statistical software such as JMP. For the outline of my talk today, I will first discuss what is structured problem solving, or SPS. I will show you what we have done at Samsung Austin Semiconductor using JMP and JSL to create a SPS script center. Next, I'll go over the current JMP cause and effect diagram and show how we at Samsung Austin Semiconductor use JSL to work with the JMP cause and effect diagram. I will then introduce you to my mapping software such as Freeplane, a free open source software. I will then return to the cause and effect diagram and show how to use the third column option of labels for marking experiment, controlled, and noise factors. I want to show you how to extend cause and effect diagrams for five why's and cause mapping and finally recommendations for the JMP cause and effect platform. Structured problem solving. So everyone has been involved with problem solving at work, school or home, but what do we mean by structured problem solving? It means taking unstructured, problem solving, such as in a brainstorming session and giving it structure and documentation as in a diagram that can be saved, manipulated and reused. Why use structured problem solving? One important reason is to avoid jumping to conclusions for more difficult problems. In the JMP Ishikawa example, there might be an increase in defects in circuit boards. Your SME, or subject matter expert, is convinced it must be the temperature controller on the folder...on the solder process again. But having a saved structure as in the causes of ...cause and effect diagram allows everyone to see the big picture and look for more clues. Maybe it is temperate control on the solder process, but a team member remembers seeing on the diagram that there was a recent change in the component insertion process and that the team should investigate In the free online training from JMP called Statistical Thinking in Industrial Problem Solving, or STIPS for short, the first module is titled statistical thinking and problem solving. Structured problem solving tools such as cause and effect diagrams and the five why's are introduced in this module. If you have not taken advantage of the free online training through STIPS, I strongly encourage you to check it out. Go to www.JMP.com/statisticalthinking. This is the cause and effect diagram shown during the first module. In this example, the team decided to focus on an experiment involving three factors. This is after creating, discussing, revisiting, and using the cause and effect diagram for the structured problem solving. Now let's look at the SPS script center that we developed at the Samsung Austin Semiconductor. At Samsung Austin Semiconductor, JMP users wanted access to SPS tools and templates from within the JMP window, instead of searching through various folders, drives, saved links or other software. A floating script center was created to allow access to SPS tools throughout the workday. Over on the right side of the script center are links to other SPS templates in Excel. On the left side of the script center are JMP scripts. It is launched from a customization of the JMP menu. Instead of putting the scripts under add ins, we chose to modify the menu to launch a variety of helpful scripts. Now let's look at the JMP cause and effect diagram. If you have never used this platform, this is what's called the cause and effect diagram looks like in JMP. The user selects a parent column and a child column. The result is the classic fishbone layout. Note the branches alternate left and right and top and bottom to make the diagram more compact for viewing on the user's screen. But the classic fishbone layout is not the only layout available. If you hover over the diagram, you can select change type and then select hierarchy. This produces a hierarchical layout that, in this example, is very wide in the x direction. To make it more compact, you do have the option to rotate the text to the left or you can rotate it to the right, as shown in here in the slides. Instead of rotating just the text, it might be nice to rotate the diagram also to left to right. In this example, the images from the previous slide were rotated in PowerPoint. To illustrate what it might look like if the user had this option in JMP. JMP developers, please take note. As you will see you later, this has more the appearnce of mind mapping software. The third layout option is called nested. This creates a nice compact diagram that may be preferred by some users. Note, you can also rotate the text in the nested option, but maybe not as desired. Did you know the JMP cause and effect diagram can include floating diagrams? For example, parking lots that can come up in a brainstorming session. If a second parent is encountered that's not used as a child, a new diagram will be created. In this example, the team is brainstorming and someone mentions, "We should buy a new machine or used equipment." Now, this idea is not part of the current discussion on causes. So the team facilitator decides to add to the JMP table as a new floating note called a parking lot, the JMP cause and effect diagram will include it. Alright, so now let's look at some examples of using JSL to manipulate the cause and effect diagram. So new scripts to manipulate the traditional JMP cause and effect diagram and associated data table were added to the floating script center. You can see examples of these to the right on this PowerPoint slide. JMP is column based and the column dialogue for the cause and effect platform requires one column for the parent and one column for the child. This table is what is called the tall format. But a wide table format might be more desired at times, such as in brainstorming sessions. With a click of a script button, our JMP users can do this to change from a tall format to a wide format. width and depth. In tall table format you would have to enter the parent each time adding that child. When done in wide format, the user can use the script button to stack the wide C&E table to tall. Another useful script in brainstorming might be taking a selected cell and creating a new category. The team realizes that it may need to add more subcategories under wrong part. A script was added to create a new column from a selected cell while in the wide table format. The facilitator can select the cell, like wrong part, then selecting this script button, a new column is created and subcauses can be entered below. you would hover over wrong part, right click, and select Insert below. You can actually enter up to 10 items. The new causes appear in the diagram. And if you don't like the layout JMP allows moving the text. For example, you can click...right click and move to the other side. JMP cause and effect diagram compacts the window using left and right, up and down, and alternate. Some users may want the classic look of the fishbone diagram, but with all bones in the same direction. By clicking on this script button, current C&E all bones to the left side, it sets them to the left and below. Likewise, you can click another script button that sets them all to the right and below. Now let's discuss mind mapping. In this section we're going to take a look at the classic JMP cause and effect diagram and see how to turn it into something that looks more like mind mapping. This is the same fishbone diagram as a mind map using Freeplane software, which is an open source software. Note the free form of this layout, yet it still provides an overview of causes for the effect. One capability of most mind mapping software is the ability to open and close notes, especially when there is a lot going on in the problem solving discussion. For example, a team might want to close notes (like components, raw card and component insertion) and focus just on the solder process and inspection branches. In Freeplane, closed nodes are represented by circles, where the user can click to open them again. The JMP cause and effect diagram already has the ability to close a note. Once closed though, it is indicated by three dots or three periods or ellipses. In the current versions of JMP, there's actually no options to open it again. So what was our solution? We included a floating window that will open and close any parent column category. So over on the right, you can see alignment, component insertion, components, etc., are all included as all the parent nodes. By clicking on the checkbox, you can close a node and then clicking again will open it. For addtion, the script also highlights the text in red when closed. One reason for using open source mind mapping software like Freeplane is that the source file can be accessed by anyone. And it's not a proprietary format like other mind mapping software. You can actually access it through any kind of text editor. Okay, the entire map can be loaded by using JSL commands that access texts strings. Use JSL to look for XML attributes to get the names of each node. A discussion of XML is beyond the scope of this presentation, but see the JMP Community for additional help and examples. And users at Samsung Austin Semiconductor would click on Make JMP table from a Freeplane.mm file. At this time, we do not have a straight JMP to Freeplane script. It's a little more complicated, but Freeplane does allow users to import text from a clipboard using spaces to knit the nodes. So by placing the text in the journal, the example here is on the left side at this slide, the user can then copy and paste into Freeplane and you would see the Freeplane diagram on the, on the right. Now let's look at adding labels of experiment, controlled, and noise to a cause and effect diagram. Another use of cause and effect diagrams is to categories...categorize specific causes for investigation or improvements. These are often category...categorize as controlled or constant (C), noise or (N) or experiment might be called X or E. For those who we're taught SPC Excel by Air Academy Associates, you might have used or still use the CE/CNX template. So to be able to do this in JMP, to add these characters, we would need to revisit the underlying script. When you actually use the optional third label column...the third column label is used. When a JMP user adds a label columln in the script, it changes the text edit box to a vertical list box with two new horizontal center boxes containing the two... two text edit boxes, one with the original child, and now one with the value from the label column. It actually has a default font color of gray and is applied as illustrated here in this slide. Our solution using JSL was to add a floating window with all the children values specified. Whatever was checked could be updated for E, C or N and added to the table and the diagram. And in fact, different colors could be specified by the script by changing the font color option as shown in the slide. JMP cause and effect diagram for five why's and mind mapping causes. While exploring the cause and effect diagram, another use as a five why's or cause mapping was discovered. Although these SPS tools do not display well on the default fish bone layout, hierarchy layout is ideal for this type of mapping. The parent and child become the why and because statements, and the label column can be used to add numbering for your why's. Sometimes there can be more and this is what it looks like on the right side. Sometimes there can be more than one reason for a why and JMP cause and effect diagram can handle it. This branching or cause mapping can be seen over here on the right. Even the nested layout can be used for a five why. In this example, you can also set up a script to set the text wrap width, so the users do not have to do each box one at a time. Or you can make your own interactive diagram using JSL. Here I'm just showing some example images of what that might look like. You might prompt the user in a window dialogue for their why's and then fill in the table and a diagram for the user. Once again, using the cause and effect diagram as over on the left side of the slide. Conclusions and recommendations. All right. In conclusion, the JMP cause and effect diagram has many excellent built in features already for structured problem solving. The current JMP cause and effect diagram was augmented using JSL scripts to add more options when being used for structured problem solving at Samsung Austin Semiconductor. JSL scripts were also used to make the cause and effect diagram act more like mind mapping software. So, what would be my recommendations? fishbone, hierarchy, nested, which use different types of display boxes in JSL. How about a fourth type of layout? How about mind map that will allow more flexible mind map layout? I'm going to add this to the wish list. And then finally, how about even a total mind map platform? That would be even a bigger wish. Thank you for your time and thank you to Samsung Austin Semiconductor and JMP for this opportunity to participate in the JMP Discovery Summit 2020 online. Thank you.  
Heath Rushing, Principal, Adsurgo   DOE Gumbo: How Hybrid and Augmenting Designs Can Lead to More Effective Design Choices When my grandmother made gumbo, she never seemed to even follow her own recipe. When I questioned her about his, she told me, “Always try something different. Ya never know if you can make better gumbo unless you try something new!” This is the same with design of experiments. Too many times, we choose the same designs we’ve used in the past, unable to try something new in our gumbo. We can construct a hybrid of different types of designs or augment the original, optimal design with points using a different criterion. We can then use this for comparison to our original design choice. These approaches can lead to designs that allow you to either add relevant constraints (and/or factors) you did not think were possible or have unique design characteristics that you may not have considered in the past. This talk will present multiple design choices: a hybrid mixture-space filling design, an optimal design augmented using pre-existing, required design points, and an optimal design constructed by augmenting a D-optimal design with both I- and A-optimal design points. Furthermore, this talk presents the motivation for choosing these design alternatives as well as design choices that have been useful in practice.     Auto-generated transcript...   Speaker Transcript Heath Rushing My name is Heath Rushing. I am a principal for Adsurgo and we're a we're a training and consulting company that works with a lot of different companies. This morning, I'm going to talk about some experiences that I had working with pharmaceutical and biopharmaceutical companies. A lot of scientists and engineers are doing things like process and product development and characterization formulation optimization. And what I found is is is a lot of these...a lot of these scientists had designs that they use in the past with a similar product or process or formulation. And what they did is is going forward, they just said, "Hey, let me just take that design that I've used in the past. It worked. You know, it worked well enough in the past. So let's just go ahead and use that design." In each of these instances, what we did is we took the original design and we came up with some sort of mechanism for doing something a little different. Right. We either augmented it with a with a with a different sort of optimization criteria or we augmented it before they added runs or after they added runs. In the first case is what we did is, is was we built a hybrid design. Right. And then the first case was a product formulazation...I'm sorry... a formulation optimization problem, where a scientist in the past was run...had a 30 run Scheffe-Cubic mixture design. In a mixture design, the process parameters are variables are factors in the experiment are mixtures. And then, so there is certain percentage where the overall mixture adds up to 100%. Right, they they felt this work well enough and help them to find an optimal setting for the for the formulation. However, one thing that they really wanted to touch more on is, they said, you know, these designs tended to to to look at design points in our experiment near the edges. And what we want to do is is further characterize the design space. So we took the original 30 run design, and instead of doing that, what we did is we run a we...we developed an experiment constructing the experiment where we ran 18 mixture experiments and then we augmented it with 12 space filling design. And a space filling design is, it's used a lot in computer simulations. And really, you know, I said this at a conference one time, I said, "You know it's used to fill space." But really what these designs do, and I'm going to pull up the the the comparison of the two, is it's going to put design points. In this one, I try to minimize the distance between each of the design points. As you see as the design on the left, the, the one that they thought was well enough or was adequate was the 30 run mixture design. And as you see, it operates a lot near the edges and right in the center of the design. The one on the right was really 18 mixture design points augmented with 12 space filling design points. So it's really a hybrid design, it's really a hybrid of a mixture design and a space filling design. As you can see, you know, based upon their objective trying to characterize that design space a little bit better, as you can see, the one on the right did a much better job of characterizing that design space, right? It had adequate prediction variance. It was a it was a design they chose to run and they found a and they found their optimal solution off of this The second design choice was, and this is used a lot, in a process characterization is, back in the old days back before a lot of people used design of experiments in terms of process characterization, what a lot of scientists would do was, was they would run center point runs like its set point and then also do what are called PAR runs, or proven acceptable range, right. So say that they had four process parameters. What they would do is is they would keep three of the process parameters at the set point and have the fourth go to the extremes. The lowest value and the highest value. And they would do it for each...they would do a set of experiments like that for each the process parameters. What they're really showing is that, you know, if everything's at set point, and one of these deviate near the edges, then we're just going to prove that it's well within specification. Right. And then so they still like to do a lot of these runs. The design that I started off with was, I had a had a scientist that took those PAR and those centerpoint runs and they added them after they built an I optimal design. And I optimal design is used for for for prediction and optimization. And in this case is is that's the kind of design that they wanted, but they added them after the I optimal design. My question to them was this, why don't you just take those runs and add them before you built I optimal design? If that was the case, the ??? algorithm in JMP would say, "You know, I'm going to take those points and I'm going to come up with the, the next best set of runs." Right. So we took those 18 design points and we augmented them with with 11 more...I'm sorry, the 11 to...the original 11 design points and 18 I optimal points. Whenever we did this, if you look in the design, the, the, this is where the PAR runs were added... were added prior to, and you see that the power of the main effects, in factor interactions, the quadratic effects are higher than if you added the PAR runs after. You see that the production variance, if you, if you look at the prediction variance is, the prediction variance is very similar. But you see, is like right near the edge of the design spaces, you see that those PAR runs, whenever we had the PAR runs augmented with I optimal, were a lot smaller. The key here is is whenever I was looking at the correlation is I think the correlation, especially with the main effects are a lot better with with the PAR augmenting and two I optimal versus what they did before, where they took the I optimal and just augmented those with the PAR runs. The third design. The third design was was was when I had a scientist take a 17 run D optimal design and they augment it with eight runs and went from a D to an I optimal design. Now they started off with D optimal design, a screening design, they augmented it with points to move to an I optimal design. JMP has a has a...it's not a really a new design, but it's new design for JMP; it's called A optimal design. And A optimal design allows you to to weight any of those factors. Right. And so I had an idea. I just said, "You know, I have many times in the past, went from a D augmented to an I optimal design. What if we did this? Really, what if we took that original 17 run D optimal design and augmented it to an I, then an A, where we weighted those quadratic terms, Or we took the D optimal design, augmented it to an A optimal design where we where we weighted the quadratic terms and then to an I optimal design?" So it's really two different augmentations, going from a D to an A to an I, and D to an I to an A. Also went to straight D to A. Right. And I wanted to compare it to the original design choice, which was a D versus an I optimal design. Now, I really would like to tell you that my idea worked. But I think as a good statistician, I should tell you that I don't think it was so. If I look at the prediction variance, which, in terms of response surface design, we're trying to minimize the prediction variance across the design region, is you see the prediction variance for their original design is is lower. Okay, even even much lower than whenever I did the A optimal design, just straight to the A optimal design. If you look at the fraction of design space, you'll see that the prediction variance is much smaller across the design space than the than the A optimal design and it's a little bit better than when I went from D to A to I, and D to I to A. The only negative that I saw with the original design compared to the other design choices was, you know, there was there was some quadratic effects, right, there were some quadratic effects that had a little bit of higher correlation, little bit higher correlation than I would like to see. And and you see what the A optimal design, it has much lower quadratic effects. So my my original thesis many times, scientists and engineers have designs they've done in the past. And I always say is, it makes sense that we just don't want to do that same design that we've done in the past. Let's try something different. The product can be a little bit different. The process can be a little bit different. The formulation can be a little bit different. If you use that to compare to the original design is you can pick your best design choice. I would like to, you know, last thing I would like to thank my my team members at Adsurgo. We always have, you know, team members and also our customers...our customers coming up with challenging problems and our team members for always working for for optimal solutions for our customers. Now, last thing that I have to do is, is these these designs were really, really taken from examples from customers, but they weren't the exact examples. There's nothing with their data. So I would like to give a give a shout out to one of my customers Sean Essex from Poseida Therapeutics that often comes up with some very hard problems and sometimes he'll come up with a problem. And I'll say, you know, this is this is a solution and it's something that we really haven't even seen yet. So have a great day.  
James Wisnowski, Principal Consultant, Adsurgo Andrew Karl, Senior Statistical Consultant, Adsurgo Darryl Ahner, Director OSD Scientific Test and Analysis Techniques Center of Excellence   Testing complex autonomous systems such as auto-navigation capabilities on cars typically involves a simulation-based test approach with a large number of factors and responses. Test designs for these models are often space-filling and require near real-time augmentation with new runs. The challenging responses may have rapid differences in performance with very minor input factor changes. These performance boundaries provide the most critical system information. This creates the need for a design generation process that can identify these boundaries and target them with additional runs. JMP has many options to augment DOEs conducted by sequential assembly where testers must balance experiment objectives, statistical principles, and resources in order to populate these additional runs. We propose a new augmentation method that disproportionately adds samples at the large gradient performance boundaries using a combination of platforms to include Predictor Screening, K Nearest Neighbors, Cluster Analysis, Neural Networks, Bootstrap Forests, and Fast Flexible Filling designs. We will demonstrate the Boundary Explorer add-in tool with an autonomous system use-case involving both continuous and categorical responses. We provide an improved “gap filling” design that builds on the concept behind the Augment “space filling” option to fill empty spaces in an existing design.     Auto-generated transcript...   Speaker Transcript James Wisnowski Welcome team discovery. Andrew Carl and Darryl Ahner and I would like to and are excited to present two new sampling, adaptive sampling techniques and Really going to provide some practitioners some wonderful usefulness in terms of augmenting design of experiments. And what I want to do is I want to kind of go through a couple of our Processes here on I've been talking about how this all came about. But when we think about DOE and augmenting designs, there is a robust capability already in JMP. So what we have found though working with some very large scale simulation studies is that that we're missing a piece here gap filling designs and adaptive sampling design. And the the key point is going to be the adaptive sampling designs are going to be focusing on the response. So this is kind of quite different from when you think of maybe a standard design where you augment and you look at the design space and look at the X matrix. So now we're going to actually take into account the targets or the responses. So this will actually end up providing a whole new capability so that we can test additional samples where the excitement is. So we want to be in a high gradient region so much like you might think in response surface methodology as deep as the ascent. Now we're going to automate that in terms of being able to do this with many variables and thousands of of runs in the simulation. The good news is that this does scale down quite nicely for the practitioner with the small designs as well. And I'll do a quick run through of our of our add in that we're going to show you, and then Andrew will talk a little bit about the technical details of this. So one thing I do want to apologize, this is going to be fairly PowerPoint centric rather than JMP add in for two reasons...I should say, rather than JMP demo...for two reasons, primarily because our time, we've got a lot of material to get through, but also our JMP utilization is really in the algorithm that we're making in this adaptive sampling. So ultimately, the point and click of JMP is a very simple user interface that we've developed, but what's behind the scenes and the algorithm, it's really the power of JMP here, so. So real quick, the gap filling design, pretty clear. We can see there's some gaps here, maybe this is a bit of an exaggeration puts demonstrative of technique, though in reality we may have the very large number of factors with that curse of dimensionality can come into play and you have these holes in your design. And you can see, we could augment it with this a space filling design, which is kind of the work horse in the augmentation for a lot of our work, particularly in stimulation calling and it doesn't look too bad. If we look at those blue points which are the new points, the points that we've added, it doesn't look too bad. And then if you start looking maybe a little closer, you can kind of see though, we started replicating a lot of the ones that we've already done and maybe we didn't fill in those holes as much as we thought, particularly when we take off the blue coloring and we can see that there's still a fair amount of gaps in there. So we, as we're developing adaptive sampling, recognize one piece of that is we needed to fill in some holes in a lot of these designs. And we came up with an algorithm in our tool, our add in, called boundary explorer that will just do this particular... for any design, it will do this particular function to fill in the holes and you can see where that might have a lot of utility in many different applications. So in this particular slide or graph, we can see that those blue points are now maybe more concentrated for the holes and there are some that are dispersed throughout the rest of the region. But even when we go to the... you can color that looks a lot more uniform across, we have filled that space very well. Now that was more of a utility that we needed for our overall goal here, which was an active sampling. And the primary application that we found for this was autonomous systems, which have gotten a lot of buzz and a lot of production and test, particularly in the Department of Defense. So in autonomous systems, you may think of there's really two major items when you think of it. In autonomous systems really what you're looking at is, is you really need some sensor to kind of let the system know where it is. And then the algorithm or software to react to that. So it's kind of the sensor- algorithm software integration that we're primarily focused on. And what that then drives is a very complex simulation model that honestly needs to be run many, many thousands of times. But more importantly, what we have found is in these autonomous systems, there's there's these boundaries that we have in performance. So for example, we have a leader-follower example from the from the army. That's where a soldier would drive a very heavily armored truck in a convoy and then the rest of the convoy would be autonomous, they would not have soldiers in them. Or think of maybe the latest Tesla, the pickup truck, where you have auto nav, right? So the idea is we are looking for testing these systems and we have to end up doing a lot of testing. And what happens is for example, maybe even in this Tesla, that you could be at 30 miles an hour, you may be fine and avoiding an obstacle. But at 30.1 you would have to do an evasive maneuver that's out of the algorithm specifications. So that's what we talk about when we say these boundaries are out there. They're very steep changes in the response, very high gradient regions. And that's where we want to focus our attention. We're not as interested as where it's kind of a flat surface, it's really where the interesting part is, that's where we would like to do it. And honestly, what we found is, the more we iterate over this, the better our solution becomes. We completely recommend do this as an iterative process. So hence, that's the adaptive piece of this is, do your testing and then generate some new good points and then see what those responses are and then adapt your next set of runs to them. So that's our adaptive sampling. Kind of the idea of this really, the genesis, came from some work that we did with applied physics labs at Johns Hopkins University. They are doing some really nice work with the military and while reviewing it in one of their journal articles, I was thinking to myself, you know, this is fantastic in terms of what you're doing, and we could even use JMP to maybe come up with a solution that would be more accessible to many of the practitioners. Because the problem with Johns Hopkins is is that it's very specific and it's somewhat...to integrate, it's not something that's very accessible to the smaller test teams. So we want to give...put this in the hands of folks that can use it right away. So this paper from the Journal of Systems and Software, this is kind of the source of our boundary explorer. And as it turns out, we used a lot of the good ideas but we were able to come up with different approaches and and other methods. In particular, using native capability in JMP Pro as well as some development, like the gap filling design that we did along the way. Now, In terms of this example problem, probably best I'll just go and kind of explain it right in a demo here. So if I look at a graph here, I can see that...I'll use this...I'll just go back to the Tesla example. So let's say I'm doing an auto navigation type activity and I have two input factors and let's say maybe we have speed and density of traffic. So we're thinking about this Tesla algorithm. It wants to merge to the lane to the left so it wants to, I should say, you know, pass. So it has to merge. So one of them would be the speed the Tesla is going and then the other might be the density of traffic. And then maybe down in this area here we have a lower number. So we can think of these numbers two to 10, we could maybe even think of the responses, maybe even like a probability of a collision. So down at low speed/low density, we have a very low probability of of collision, but up here at the high speed/ high density, then you have a very high probablity. But the point is it what I have highlighted and selected here, you can see that there's very steep differences in performance along the boundary region. So it would, as we do the simulation to start doing more and more software test for the algorithm, we'll note that it really doesn't do us a lot of good to get more points down here. We know that we do well in low density and low speed. What we want to do is really work on the area in the boundaries here. So that's our problem, how can I generate 200 new points that are really going to be following my boundary conditions here. Now, here what I've done is I have really, it's X1 and X2, again, think of the speed and... our speed as well as the density. And then I just threw in a predictor variable here that doesn't mean anything. And then there's, there's our response. So to do this, all I have to do is come into boundary explorer and under adaptive sampling, my two responses (and you can have as many responses as you need) and then here are my three input factors. And then I have a few settings here, whether or not I want to target the global minimum and max, show the boundry. And we also ultimately are going to show you that you have some control here. So what happens is in this algorithm is we're really looking for, what are the nearest neighbors doing? If all the nearest neighbors have the same response, as in zero probability of having an accident, that's not a very interesting place. I want to see where there's big differences. And that's where that nearest neighbor comes into play. So I'll go ahead and and run this. And what we're seeing on there is we can see right now that the algorithm, it used JMP's native capability for the prediction screening and fortunately, is not using the normal distribution. You can see it's running the bootstrap forest. Andrew is going to talk about where that was used. And ultimately what we're going to do here, is we're going to generate a whole set of new points that should hopefully fall along the boundary. So that took, you know, 30 seconds or so to do these these points and from here I can just go ahead and pull up my new points. So you can see my new points are sort of along those boundaries, probably easiest seen if I go ahead and put in the other ones. So right here, maybe I'll switch the color here real quick. And I'll go ahead and show maybe the midpoint in the perturbation. So right now we can kind of see where all the new points are. So, the ones that are kind of shaded, those are the ones that were original and now we're kind of seeing all of my new points that have been generated in that boundry. So of course the question is, how, how did we do that? So what I'll do is I'll head back to my presentation. And from there, I'll kind of turn it over to Andrew, where he'll give a little bit more technical detail in terms of how we go about finding these boundry points because it's not as simple as we thought. Andrew Karl Okay. Thanks, Jim. I'm going to start out by talking about the the gap filling real quick because we've also put this in addition to being integrated into the overall beast tool. It's a standalone tool as well. So it's got a rather simple interface where we select the columns that we define the space that we went to fill in. And for continuous factors, it reads in the coding column property to get the high and low values and it can also take nominal factors as well. In addition, if you have generated this from custom design or space filling design and you have disallowed combinations, it will read in the disallowed combination script and only do gap filling within the allowed space. So the user specifies their columns, as well as the number of new runs they want. And let me show a real quick example in a higher dimensional space. This is a case of three dimensions. We've got a cube where we took out a hollow cylinder and we went through the process of adding these gap filling runs, and we'll turn them on together to see how they fit together. And then also turn off the color and to see what happens. So this is nice because in the higher dimensional space, we can fill in these gaps that we couldn't even necessarily see in the by variate plots. So how do we do this? So what we do is, we take the original points, which in this case is colored red now instead of black and we can see where those two gaps were, and we overlay a candidate set of runs from a space filling design for the entire space. We add for the concatonated data tables of the old and the new candidate runs, we have an indicator column, continuous indicator column, we label the old points 1 and the label the candidate point 0. And in this concatenated space, we now fit a 10 nearest neighbor model to the to the indicator column and we save the predictions from this. So the candidate runs with the smallest predictions, in this case, blue, are the gap points that we want to add into the design. Now, if we do this in a single pass, what it tends to do is overemphasize the largest gaps. So we do is we actually do this in a tenfold process, where we will take a tenth of our new points, select them as we see here, and then we will add those in and then rerun our k-nearest neighbor algorithm to pick out some new points and to fill out all the spaces more uniformly. So that's just one option...the gap filling is one option available within boundary explorer. So Jim showed that we can use any number of responses, any number of factors and we can have both continuous and nominal responses and continuous and nominal factors. The fact...the continuous factors that go in, we are going to normalize those behind the scenes to 01 to put them on a more equal footing. And for the individual responses that go into this, we are going to loop individually over each response to find the boundaries for each of the responses within the factor space. And then at the end, we have a multivariate tool using a random forest that considers all of the responses at once. And so we'll see how each of the different options available here in the GUI, in the user interface, comes up within the algorithm. So after after normalization for any of these continuous columns, the first step is predictor screening for all the both continuous and nominal responses. And this is to do is to find out the predictors, they're relevant for each particular response. And we have a default setting in the user interface of .05 for proportion of variants explained, or portion of contribution from each variable. So in this case, we see that X1 and X2 are retained for response Y1, and X3 noise is rejected. The next step is to run a nearest neighbor algorithm. And we use the default to 5, but that's an option that the user can toggle. And we aren't so concerned with how well this predicts as we are to just simply use this as a method to get to the five nearest neighbors. What are the rows of the five neighbors neighbors and how far are they? What is the distances from the current row? And we're going to use this information of the nearest neighbors to identify each point, the probability of each point being a boundary point. We have to use split here and do a different method for continuous or nominal responses. For the nominal responses, what we do is we concatenate the response from the current column along with the responses from the five nearest neighbors in order, in this concatenate concatenate neighbors column. And we have a simple heuristic we use to identify the boundary probability based on that concat neighbors column. If all the responses are the same, we say it's low probability of being a boundary point. If, at least one of the responses is different, then we say it's got a medium probability of being a boundary, excuse me, a boundary point. And if two or more of the responses are different, it's got a high probability of being a boundary point. We also record the row used. In this case, that is the the boundary pair. So that is the closest neighbor that has a response that is different from the current row. We can plot those boundary probabilities in our original space filling design. So as Jim mentioned early on, we have a...we initially run a space filling design before running this boundary explore tool to get...to explore the space and to get some responses. And now we fit that in and we've calculated the boundary probability for for these. And we can see that our boundary probabilities are matching up with the actual boundaries. For continuous responses we take the continuous response from the five nearest neighbors, and add a column for each of those, and we take the standard deviation of those. The ones with the largest standard deviations of neighbors are the points that lie in the steepest gradient areas and those are more likely to be our boundary points. We also multiply the standard deviation by the mean distance in order to get our information metric, because what that does is for two points that have an equal standard deviation of neighbors, it will upweight the one that is in a more sparse region with fewer points that are there already. So now we've got this continuous information metric and we have to figure out how to split that up into high, medium, and low probabilities for each point. So what we do is we fit in distribution. We fit in normal three mixture and we use the mean as the largest distribution as the cutoff for the high probability points. And we use the intersection of the densities of the largest and the second largest normal distributions as the cutoff for the medium probability points. So once we've identified those those cut offs, we apply that to form our boundary probability column. And we also retain the row used, which is the closest. In this case for the continuous responses, that is the neighbor that has the response that's the most different in absolute value from the current role. So now for both continuous and nominal responses we have the same output. We have the boundary probability and the row used. Now that we've identified the boundary points, we need to be able to use that to generate new points along the boundary. So the first and, in some ways, the best method for targeting and zooming in on the boundary is what we call the midpoint method. And what we do for each boundary pair, each row and its row use, its nearest neighbor identified previously...I'm sorry, so not nearest neighbor but neighbor that is most relevant either in terms of difference in response nominal or most different in terms of continuous response. For the continuous factors we take the average of the coordinates for each of those two points to form the mid point. And that's what you see in the graph here. So we would put a point at the red circle. For nominal factors, what we do is for the boundary pairs is we take the levels of that factor that are present in each of the two points and we randomly pick one of them. The nice thing about that is if they're both the same, then that means the midpoint is also going to be the same level for that nominal factor for those two points. A second method we call the perturbation method is to simply add a random deviation to each of the identified boundary points. So for the high boundary...high probability points, we add two such perturbation points for the medium, we add one. And for that one, we add, for the continuous factors, we add a random deviation. Normal means 0; standard deviation, .075 in the normalized space, and that .075 is something that you can scale within the user interface to either increase or reduce the amount of spread around the boundary. And then for nominal factors, what we do is we take...we randomly pick out a level of each the nominal factors. Now for the high probability... high probability boundary points that get a second perturbation point, what we do is in the second one we restrict those nominal factor settings to all be equal to that of the original point. So we do this process of identifying the boundary and creating the mid points and perturbation points for each of the responses specified in the boundry explorer. Once we do that, we concatenate everything together and then we are going to look at all the mid points identified for all the responses, and now use a multivariate technique to generate any additional runs. Because the user can specify how many runs they want and these midpoint and perturbation methods only generate a fixed number of runs and depending on the the lay of the land, I guess you could say, for the data. So what we do is something similar to the gap filling design where we take all of the identified perturbation and mid points for all of the responses and we fill the entire space with the space filling design of candidate points. We labeled the candidate points 0 in a continuous indicator, the mid points 1, and the perturbation points .01. We fit a random forest to this indicator. And then we take a look. We save the predictions for the candidate space fill in points and then we take the the candidate runs with the largest predictive values of this boundary indicator. And those are the ones that we add in using this random forest method. Now since this is a multivariate method, if you have a area of your design space that is a boundary for multiple responses, that will receive extra emphasis and extra runs. So here's showing the three types of points together. Now, again, to emphasize what Jim said, this needs to happen in multiple iterations, so we would collect this information from our boundary explorer tool and then concatenate it back into the original data set. And then after we have those responses, rerun boundary explorer and it's going to continuously over the iterations, zoom in on the boundaries and impacts, possibly even find more boundaries. So the perturbation points are most useful for early iterations when you're still exploring space, they're more spread out, and the random forest method is better for later iterations, because it will have more mid points available because it uses not only the ones from the current iteration, but also the previously recorded ones. We have a column in the data table that records the type of point that was added. So we'll use all the previous mid points as well. So if we put our surface plot for this example we've been looking at for a step function, we can see our new points and mid points and perturbation points are all falling along the cliffs, which are the boundaries, which is what we wanted to see. So the last two options for the user interface or to indicate those gap filling runs and we can also ask it to target global min max or match target for any continuous factors, if that's set as a column property. Just to show one final example here, we have this example where we have these two canyons through a plane with a kind of a deep well at the intersection of these. And we've run the initial space filling points, which are the points that are shown to get an idea of the response. And if we run two iterations of our boundary explorer tool, this is where all the new points are placed and we can see the gaps in kind of in the middle of these two lines. What are those gaps? If we take a look at the surface plot, those gaps are the canyon floors, where it's not actually steep down there. So it's flat, even locally over a little region, but all of these points, all of these mid points, have been placed not on the planes, but the on the steep cliffs, which is where we wanted. And here we're toggling the minimum points on and off and you can see those are hitting the bottom of the well there. So we were able to target the minimum as well. So our tool presents two distinct, two discrete options, a new tools. We want the gap filling that can be used on any data table that has coding properties set for the continuous factors. And then the boundary explorer tool that can be used to add, do runs that don't look at the factor space by itself, but they look at the response in order to target the high gradient...high change areas to add additional runs.  
Monday, October 12, 2020
Mandy Chambers, JMP Principal Test Engineer, SAS Kelci Miclaus, Senior Manager Advanced Analytics R&D, JMP Life Sciences, SAS, JMP LIfe Sciences   JMP has many ways to join data tables. Using traditional Join you can easily join two tables together. JMP Query Builder enhances the ability to join, providing a rich interface allowing additional options, including inner and outer joins, combining more than two tables and adding new columns, customizations and filtering. In JMP 13, virtual joins for data tables were developed that enable you to use common keys to link multiple tables without using the time and memory necessary to create a joined (denormalized) copy of your data. Virtually joining tables gives a table access to columns from the linked tables for easy data exploration. In JMP 14 and JMP 15, new capabilities were added to allow linked tables to communicate with row state synchronization. Column options allow you to set up a link reference table to listen and/or dispatch row state changes among virtually joined tables. This feature provides an incredibly powerful data exploration interface that avoids unnecessary table manipulations or data duplications. Additionally, there are now selections to use shorter column names, auto-open your tables and a way to go a step further, using a Link & ID and Link & Reference on the same column to virtually “pass through” tables. This presentation will highlight the new features in JMP with examples using human resources data followed by a practical application of these features as implemented in JMP Clinical. We will create a review of multiple metrics on patients in a clinical trial that are virtually linked to a subject demographic table and show how a data filter on the Link ID table enables global filtering throughout all the linked clinical metric (adverse events, labs, etc.) tables.     Auto-generated transcript...   Speaker Transcript Mandy Okay, welcome to our discussion today. Let's Talk Tables. My name is Mandy Chambers and I'm a principal test engineer on the JMP testing team. And my coworker and friend joining me today is Kelci Miclaus. She's a senior manager in R&D on the JMP life sciences team. Kelci and I actually began working together a few years ago as a Clinical product was starting to be a great consumer of all the things that I happen to test. So I got to know her group pretty well and got to work with them closely on different things that they were trying to implement. And it was really valuable for me to be able to see a live application that a customer would really be using the things that I actually tested in JMP and how they would put them to use them in the clinical product. So in the past, we've done this presentation. It's much longer and we decided that the best thing to do here was to give you the entire document. So that's what's attached with this recording, along with two sets of data and zip files. You should have data tables, scripts, some journals and different things you need and be able to step through each of the applications, even if I end up not showing you or Kecli doesn't show you something that's in there. You should be able to do that. So let me begin by sharing my screen here. So that you can see what what I'm going to talk about today. So as I said, the, the journal that I had, if I were going to show this in its entirety, would be talking about joining tables and the different ways that you can join tables. And so this is the part that I'm not going to go into great detail on but just a basic table join. If I click on this, laptop runs and laptop subjects. And under the tables menu, if you're new to JMP or maybe haven't done this before, you can do a table join and this is a for physical join. This will put the tables together. So I would be joining laptop runs to laptops subjects. Within this dialogue, you select the things that you want to join together. You can join by matching, Cartesian join, row join and then you would join the table. I'm not going to do that right now, just for time consumption but that's that's what you would do. And also in here under the tables menu, something else that I would talk about would be JMP query builder. And this has the ability to be able to join more tables together. It will, if you have 3, 4, 5, 6 however many tables you have, you can put them together and we'll make up one table that contains everything. But again, I'm actually not going to do that today. So if I go back into here and I close these tables. Let's get started with how virtual join came about. So let's talk about joining tables first. You have to decide what type of join you want to use. So your...if you're tables are small, it might be easiest to do a physical join. To just do a tables join, like the two tables I showed you weren't very big. If you pull in three or four maybe more tables, JMP query builder is a wonderful tool for building a table. And you may want all of your data in the same table so that may be exactly what you want. You just need to be mindful of disk space and performance, and just understand if you have five or six tables that you have sitting separately and then you join them together physically, you're making duplicate copies. So those are the ways that you might determine which which you would use. Virtual join came about in JMP 13 and it was added with the ability to take a link, a common link ID, and join multiple tables together. It's kind of a concept of joining without joining. It saves space and it also saves duplication of data. And so that...in 13 we we started with that. And then in 14 to 15, we added more features, things that customers requested. Link tables with rows synchronize...rows states synchronization. You can shorten column names. We added being able to auto open linked tables. Being able to have a link ID and a link reference on the same column. And we also added these little hover tips that I'll show you where it can tell you which source is your column source table. So those are the things that we added and I'm going to try to set this up and demonstrate it for you. So I've got this data that I actually got from a... it's just an imaginary high-tech firm. And it's it's HR data and it includes things such as compensation, and headcount, and some diversity, and compliance, education history, and other employment factors. And so if you think about it, it's a perfect kind of data to link because you have usually a unique ID variable, such as an employee ID or something that you can link together and maybe have various data for your HR team that's in different places. So I'm going to open up these two tables and just simply walk through what you would do if you were trying to link these together. So this table here is Employee Scores 1 and then I have Compensation Master 1 in the back. These tables, Employees Scores 1 is my source table. And Compensation Master is my referencing table. So you can see these ID, this ID variable here in this table. And it's also in the compensation master table. So I'm going to set up my link ID. So it's a couple of different ways to do this. You can go into column properties. And you can see down here, you have a link ID and reference. The easiest way to do this is with a right click, so there's link ID. And if I look right here, I can see this ID key has been assigned to that column. So then I'm going to go into my compensation master table. And I'm going to go into this column. And again, you can do it with column properties. But you can do the easiest way by going right here to link reference, the table has the ID. So it shows up in this list. I'm going to click on this and voila, there's my link reference icon right there. And I can now see that all the columns that were in this table are...are available to me in this table. You can see you have a large number of columns. You can also see in here that you have...they're kind of long column names, you have the column names, plus this identifier right here which is showing you that this is a referencing column. And so I'm going to run this little simple tabulate I've saved here and just show you very briefly that this is a report and just to simply show you this is a virtual column length of service. And then compensation type is actually part of my compensation table and then gender is a virtual column. So I'm using...building this using virtual columns and also columns that reside in the table. One thing I wanted to point out to you very quickly is that under this little red triangle...let's say you're working with this data and you decide, "Oh, I really want to make this one table. I really want all the columns in one table." There is a little secret tool here called merge reference data. And a lot of people don't know this is there, exactly. But if I wanted to, I could click that and I can merge all the columns into this table. And so, but for time sake, I'm not going to do that right now, but I wanted to point out where that is located. And let me just show you back here in the journal, real quickly. This is possible to do with scripting, so you can set the property link reference and point to your table and list that to use the link columns. So I'm going to close this real quickly and then go back to the same application where I actually had same two tables that I've got some extra saved scripts in here, a couple more things I want to show. So again, I've got employee scores. This is my source table. And then I've got compensation master and they're already linked and you can see this here. So I want to rerun that tabulate and I want to show you something. So you can see that these column names are shorter now. So I want to show what we added in JMP 14. If I right click and bring up the column info dialog, I can see here that it says use linked column names right here. And that sets that these these names will be shorter And that's really a nice feature because when, at the end of the day, when you share this report with someone, they don't really care where the columns are coming from, whether they're in the main table or virtual table. So it's a nice, clean report for you to have. The script is saved so that you can see in the script that it's... it saves the script that shows you a referencing table. So if I look at this, I can see. So you would know where this column is coming from but somebody you're sharing with doesn't necessarily need to know. So I want to show you this other thing that that that we added with this dispatching of row states. Real quick example, I'm going to run this distribution. And you notice right away that in this distribution, I've got a title that says these numbers are wrong. And so let me point out what I'm talking about. Employee scores is my employee database table and it has about 3,600 employees. This is a unique reference to employees and it's a current employee database, let's say. My compensation master table is more like a history table and it has 12,000 rows in it, so it has potentially in this table, multiple references to the same employee, let's say, an employee changed jobs and they got a raise, or they moved around. Or it could have employees that are no longer in the company. So running this report from this table doesn't render the information that I really want. I can see down here that my count is off, got bigger counts, I don't exactly have what I was looking for. So this is one of the reasons why we created this row states synchronization and Kelci is going to talk a little bit more about this in a real life application, too. But I'm just simply going to show you this is how you would set up dispatching row states. So what I'm doing is I'm just batching, selection color marker. And what I'm doing is I'm actually sending from compensation master to employee scores, I'm sending the information to this table because (I'm sorry), this is the table that I want my information to be run from. So if I go back and I rerun that distribution, I now have this distribution (it's a little bit different columns), but I have this distribution. And if I look at the numbers right here, I have the exact numbers of my employee database. And that's exactly what I wanted to see. So you need to be careful with dispatching and accepting and Kelci will speak more to that. But that was just a simple case example of how you would do that. And I will show you real quickly, that there is a Online Help link that shows an example of virtually joining columns and showing row states. It'll step you through that. There's some other examples out here too of using virtual join. If you need more information about setting this up. And again, just to remind you, all of this is scriptable. So you can script this right here, by setting up your row states and the different things that you want with that. So as we moved into JMP 15 we added a couple more things. And so what we added was we we added the ability to auto open a table and also to hover over columns and figure out where they're coming from. And I'll explain what that what that means exactly. So if I click on these. We created some new tables for JMP 15, employeemaster.jmp, which is still part of this HR data. And so if I track this down a little bit and look, a couple things I'll point out about this table. It has a link ID and a link reference. And that was the other thing we added to to JMP 15, the ability to be able to have a link ID and link reference on the same column. So if I look at this and I go and look at my home window here, I can see that there's two more tables that are open. They were opened automatically for me. And so I'm going to open these up because I kind of want to string them out so you can see how this works. But this employee master table is linked to a...stack them on top of each other...it's linked to the education history table, which has been, in turn, linked to my predicted termination table. And you can see there's an employee ID that has a link reference and the link ID, employee ID here. Same thing, and then predict determination has an ID only. And if you had another table or two that had employee ID unique data and you needed to pull it in, you could continue the string on by assigning a link reference here and you can keep on keep on going. So I'm...just to show you quickly, if I right click and look at this column here, I can see that my link ID is set, I can also see my link reference is set. And it tells me education history is a table that this table is linked to. I've got it on auto open and I've got on the shorter names. I'm not dispatching row states, so nothing is set there. So all of the columns that are in these other two tables are available to me, for my referencing table here called employee master. And real quickly, you can see that you have a large number of columns in here that are available to you, and the link columns show up as grouped columns down here. So another question that got asked from customers, as they say, is there any way you can tell us where these columns come from so that is a little clearer? So we added this nice little hover tip. If I hover over this, this tells me that this particular column disability flag is coming from predicted termination. So it's actually coming from the table that's last in my series. And if I go down here and I click on one of these, it says the degree program code is coming from education history. So that's, that's a nice little feature that will kind of help you as you're picking out your columns, maybe in what you're trying to run with platforms and so forth. But if I run this distribution, this is just a simple distribution example that's showing that employee level is actually coming from my employee master table. This degree description is coming from education history table and this performance eval is coming from my predictive termination table. And then you can look some more with some of these other examples that are in here. I did build a context window of dashboards here that shows a Graph Builder showing a box plot. We have a distribution in here, a tabulate and a heat map, using all virtual columns, some, you know, some columns that are from the table, but also virtual columns got a filter. So if I want to look at females and look at professionals. I always like to point out the the oddities here. So if I go in here and look at these two little places that are kind of hanging out here. This is very interesting to me because comp ratios shows how people are paid. Basically, whether they're paid in in the right ratio or not it for their job description. And it looks like these two outliers are consistently exceeding expectations, that looks like they're maybe underpaid. So just like this one up here is all by itself and it looks like they seldom meet their expectations, but they may be slightly overpaid and or they could be mistakes. But at any rate, as you zero in on those, you can also see that the selections are being made here. So, in this heat map, I can tell that there is some performance money that's being spent and training dollars. so maybe train that person. So that's actually good good good to see So that is about all I wanted to show. I did want to show this one thing, just to remind, just to reiterate. Education history has access to the columns that are in predicted termination. And so those two tables can talk to each other separately. And if I run this graph script, I have similar performance and training dollars, but I'm looking at like grade point average, class rank, as to where people fall into the limits here using combinations of columns from just those two tables. So I'm going to pass this on. I believe that was the majority of what I wanted to share. I'm going to stop sharing my screen. And I will pass this back to Kelci and she will take it from here. Kelci J. Miclaus Thanks, Mandy. Mandy said we've given this talk now a couple times and, really it was this combined effort of me working in my group, which is life sciences for the JMP Clinical and JMP Genomics vertical solutions, and finding such perfect examples of where I could really leverage virtual joins and working closely with the development team on how those features were released in the last few versions of JMP. And so for this section I will go through some of the examples, specific to our clinical research and how we've really leveraged this talking table idea around row state synchronization. So as as Mandy mentioned this is now, and if we have time towards the end, this, this idea of virtual joins with row state synchronization is now the entire architecture that drives how JMP Clinical reports and reviews are used for assessing early efficacy and safety and clinical trials reports with our customers. And one of the reasons it fits so well is because of the formatting of typical clinical trial data. So the data example that I'm going to use for all of the examples I have around row state synchronization or row state propagation as I sometimes call it, are example data from a clinical trial that has about 900 patients. It was a real clinical trial carried out about 20-30 years ago looking at subarachnoid hemorrhage and treatment of nicardipine on these patients. The great thing about clinical data is we work with very standard normalized data structures, meaning that each component of a clinical trial is collected, similar to the HR data that Mandy showed...show...showed us is normalized, so that each table has its own content and we can track that separately, but then use virtual joins to create comprehensive stories. So the three data sets I'll walk through are this demography table which has about a little under 900 patients of clinical trials, where here we have one row per patient in our clinical trial. And this is called the demography, that will have information about their birth, age, sex, race, what treatment they were given, any certain flags of occurrences that happened to them during the clinical trial. Similarly, we can have separate tables. So in a clinical trial, they're typically collecting at each visit what adverse events have happened to a patient while on on a new drug or study. And so this is a table that has about 5,000 records. We still have this unique subject identifier, but we have duplications, of course. So this records every event or adverse event that was reported for each of the patients in our clinical trial. And finally I'll also use a laboratory data set or labs data set, which also follows the similar type of record stacked format that we saw on the adverse events. Here we're thinking of the regular visits, where they take several laboratory measurements and we can track those across the course of the clinical trial to look for abnormalities and things like that. So these three tables are very a standard normalized format of what's called the international CDISC standard for clinical trial data. And it suits us so well towards using the virtual join. Aas Mandy has said, it is easy to, you know, create a merge table of labs. But here we have 6,000 records of labs and merging in our demography, it would cause a duplication of all of their single instances of their demographic descriptions. And so we want to set up a virtual join with this, which we can do really easily. If we create in our demography table, we're going to set up unique subject identifier as our link ID. And then very quickly, because we typically would want to look at laboratory results and use something like the treatment group they are on to see if there's differences in the laboratories, we can now reference that data and create visualizations or reports that will actually assess and look at treatment group differences in our laboratory results. And so we didn't have to make the merge. We just gained access to these...this planned arm column from our demography table through that simple two-step setting up the column properties of a virtual join. It's also very easy to then look at like lab abnormalities. So here's a plot by each of the different arms or treatment groups who had abnormally high lab tests across visits in a clinical trial. We might also want to do this same type of analysis with our adverse event, which we would also want to see if there's different occurrences in the adverse events between those treatment groups. So once again we can also link this table to our referenced demography and very quickly create counts of the distribution of adverse events that occur separately for, say, a nicardipine, the active treatment, versus a placebo. So now we want them to really talk. And so the next two examples that I want to show with these data are the row state synchronization options we have. So you quickly saw from Mandy's portion that she showed that on the column properties we have the ability to synchronize row states now between tables. Which is really why our talk is called talking tables, because that's the way they can communicate now. And you can either dispatch row states, meaning the table that you're set up the reference to some link ID can send information from that table back to its reference ID table. And I'll walk through a quick example, but as mentioned...as Mandy mentioned, this is by far the more dangerous case sometimes because it's very easy to hit times when you might get inconclusive results, but I'm going to show a case where it works and where it's useful. As you've noticed, just from this analysis, say with the adverse events, it was very easy as the table that we set up a link reference to (the ID table) to gain access to the columns and look at the differences of the treatment groups in this table. There's not really anything that goes the other way though. As Mandy had said, you wouldn't want to use this new join table to look at a distribution of, say, that treatment group, because what you actually have here is numbers that don't match. It looks like there's 5,000 subjects when really, if you go back to our demography table, we have less than 900. So here's that true distribution of about the 900 subjects by treatment group with all their other distributions. Now, there is the time, though, that this table is what you want to use as your analysis table or the goal of where you're going to create an analysis. And you want to gain information from those tables that are virtually linked to it. The laboratory, for example, and the adverse events. So here we're going to actually use this table to create a visualization that will annotate these subjects in this table with anyone who had an abnormal lab test or a serious adverse event. And now I've cheated, because I've prepared this data. You'll notice in my adverse events data I've already done the analysis to find any case of subjects that were...any adverse events that were considered serious and I've used the row state marker to annotate those records that had...were a serious adverse event. Similarly, in the labs data set, I've used red color to annotate...annotate any of the lab results that were abnormally high. So for example, we can see all of those that had high abnormalities. I've colored red most of this through, just row state selection and then controlling the row states. So with this data where I have these two row states in place, we can go back to our demography table and create a view that is a distribution by site of the ages of our patients in a clinical trial. And now if we go back to each of the linked tables, we can control bringing in this annotated information with row state synchronization. So we're going to change this option here from row states with reference table to none, to actually to dispatch and in this case I want to be careful. The only thing I want this table to tell that link reference table is a marker set. I'm going to click Apply And you'll notice automatically my visualization that I created off that demography table now has the markers of any subjects who had experienced an adverse event from that other table. We can do the same now with labs. Choose to dispatch. In this case, we only want to dispatch color. And now, just by controlling column properties, we're at a place where we have a visualization or an analysis built off our demography table that has gained access to the information from these virtually joined tables using the dispatch row state synchronization or propagation. So that's really cool. I think it's a really powerful feature. But there are a lot of gotchas and things you should be careful with with the dispatch option. Namely the entire way virtual joins work is the link ID table, the date...the data table you set up, in this case demography, is one row per ID and you're using that to merge or virtually join into a data table that has many copies of that usage ID. So we're making a one-to-many; that's fine. Dispatch makes a many-to-one conversation. So in in the document we have an ...in the resource provided with this video, there's a lot of commentary about carefully using this. It shouldn't be something that's highly interactive. If you then decide to change row states, it can be very easy for this to get confusing or nonsensical, that, say if I've marked both with color and marker, it wouldn't know what to do because it was some rows might be saying, "Color this red," but the other linked table might be saying color it blue or black. So you have to be very careful about not mixing and matching and not being too interactive with with that many-to-one merge idea. But in this example, this was a really, really valuable tool that would have required quite a lot of data manipulation to get to this point. So I'm going to close down these examples of the dispatch virtual join example and move on to likely what's going to be more commonly used is the accept... acceptance row state of the virtual join talking tables. And for this case, I'm actually going to go through this with a script. So instead of interactively walking me through the virtual join and row state column properties, we're going to look at this scripting results of that. And the example here, what we wanted to do, is be able to use these three tables (again, the demography, adverse events and laboratory data in a clinical trial) to really create what they call a comprehensive safety profile. And this is really the justification and rationale of our use in JMP Clinical for our customers. This idea that we want to be able to take these data sets, keep them separate but allow them to be used in a comprehensive single analysis so they don't feel separate. So with this example, we want to be able to open up our demography and set it up as a link ID. So this is similar to what I just did interactively that will create the demographic table and create the link ID column property on unique subject identifier. So we're done there. You see the key there that shows that that's now the link ID property. We then want to open up the labs data set. And we're going to set a property on the unique subject identifier in that table to use the link reference to the demography table. And a couple of the options and the options here. We want to show that that property of using shorter names. Use the linked column name to shorten the name of our columns coming from the demography table into the labs table. And here we want to set up row state synchronization as an acceptance of select, exclude and hide. And we're going to do this also for the AE table. So I'll run both of these next snippets of code, which will open up my AE and my lab table. And now you'll see that instead of that dispatch the properties here are said to set to accept with these select, exclude and hide. And similarly the adverse events table has the exact same acceptance. So in this case now, instead of this dispatch, which we were very careful to only dispatch one type of row state from one table and another from another table back to our link ID reference table. Here we're going to let our link ID reference table demography broadcast what happens to it to the other two tables And that's what accept does. So it's going to accept row states from the demography table. And I've cheated a little bit that I actually just have a script attached to our demography table here that is really just setting up some of the visualizations that I've already shown that are scripts attached to each of the table in a single window. And so here we have what you could consider your safety profile. We have distributions of the patient demographic information. So this is sourced from the demography table. You see the correct numbers of the counts of the 443 patients on placebo versus the 427 on nicardipine.  
Caleb King, Research Statistician Developer, JMP Division, SAS Institute Inc.   Invariably, any analyst who has been in the field long enough has heard the dreaded questions: “Is X-number of samples enough? How much data do I need for my experiment?” Ulterior motives aside, any investigation involving data must ultimately answer the question of “How many?” to avoid risking either insufficient data to detect a scientifically significant effect or having too much data leading to a waste of valuable resources. This can become particularly difficult when the underlying model is complex (e.g. longitudinal designs with hard-to-change factors, time-to-event response with censoring, binary responses with non-uniform test levels, etc.). In this talk, we will show how you can wield the "power" of one-click simulation in JMP Pro to perform power calculations in complex modeling situations. We will illustrate this technique using relevant applications across a wide range of fields.     Auto-generated transcript...   Speaker Transcript Caleb King Hello, my name is Caleb king. I'm a research statistician developer here at JMP for the design of experiments group.   And today I'll be talking to you about how you can use JMP to compute power calculations for complex modeling scenarios. So as a brief recap power is the probability of detecting a scientifically significant difference that you think exists in the population.   And it's the probability of detecting that given the current amount of data that that you've sampled from that population.   Now, most people, when they run a power calculation, they're usually doing it to determine the sample size for their study there, of course, is a direct   Tie between the two. The more samples, you have the greater chance you have of detecting that scientifically significant difference   Of course, there are other factors that tie into that. There's the the model that you're using the response distribution type.   And there's also, of course, the amount of noise and uncertainty present in the population, but for the most part people use power as a metric to determine sample size. Now, I'll kind of say there's kind of three stages   of power calculation and all of them are addressed in JMP, especially if you have JMP Pro, which is what I will be using here.   The first stage is some of those simpler modeling situations where we go here under the DOE menu under Design Diagnostics. We have the sample size and power calculators.   And these cover a wide range of very simple scenarios. So, if you're testing one or two sample means, you know, maybe an ANOVA type setting with multiple means,   proportions, standard deviations. Most of this is what people think of when you think of power calculations. So, of course, you go through and you specify again the noise,   error rates, there's any parameters, what difference am I trying to detect, and say for I'm trying to compute a certain power I can get the sample size.   Or, if I want to explore a bit more. I can leave both as empty. I get a power curve. Now, of course, again, these are more of your simpler scenarios. The next stage, I would say, is what could be covered under a more general linear model so exit out of   In that case, we can go here under the all encompassing custom design menu.   I'll put in my favorite number of effects.   I'll click continue.   And I'll leave everything here.   So we'll make the design.   And at this point I can do a power analysis based on the anticipated coefficients in the model. So in this case, it might say, I have for this particular design under 12 runs. I have roughly 80% power to detect this coefficient. If I was trying to detect say something a bit smaller.   I could change that value, apply the changes, of course. See, I don't have as much power. So if that's really what I'm looking for. I might do to make some changes. Maybe I need to go back and increase the run size.   So, those are the two most common settings that we might do a power calculation, but of course life isn't that simple know you might run into more complex settings you might have mixed effects factors you might run into a longitudinal study that you have to compute power for.   You might run into settings where your response is no longer a normal random variable, you might have count data, you might have a binary response. You might even have sort of a bounded 0/1 type response. So a percentage type response.   So, what can you do if you can't go to the simple power calculators and maybe the DOE menu it might be too complex for even this to run a power analysis. Well JMP Pro's here to help and involves a tool that we call one click simulation.   So the idea here is, we'll simulate data sort of through a Monte Carlo simulation approach to try and estimate the power that you can get for your particular settings.   And it's pretty straightforward. There might be a little bit of work up front that you need to do at least depending on the modeling platform.   But once you've got it down. It's pretty straightforward to do.   And I'll go ahead and say that this was something I didn't even know JMP could do until I started working here. So, I'm happy to share what I found with you.   Alright, so we'll start off with sort of as a simpler extension of the standard linear model where we incorporate some mixed effects. Okay.   So we'll start, we have a company that's looking to improve their proton protein yield for cellular cultures. Not protons but proteins.   temperature, time, pH. We also have some mixture factors.   Water and two growth factors. Now, at this stage, if we stopped here, we probably would still be able to use the power calculator available in the custom design platform.   Where we start to deviate is now we introduce some random effect factors we have three technicians, Bob, Di, and Stan, who are representative of the entire sample of technicians.   And they will use at least one of three serum lots, which is again a representation of all the serum lots, they could use unless we treat them as random effects.   We also have a random blocking effect. In this case, the test will be conducted over two days. And so I'll show you how we can use one click simulation and JMP Pro to compute power for this case. So click to open the design.   So this was the design that I've created, let me expand my window here so can see everything. Now this might represent what you typically get once you've created the design.   Again,   at this point, you could have clicked simulate response to simulate some of the responses. But even if you didn't, it's still okay   A trick that you can easily use to replicate that is simply create a new column will go in. We won't bother renaming it at this point, we're just going to create a simple formula.   Go here to the left hand side. Click random random normal   leave everything default click Apply. Okay.   And we've got ourselves some random noise data. Some simulated response data. Okay.   At this point, I'll click right click, copy   And right click paste to get my response column.   Now all I need is just some sort of response. So simple random noise will work fine here. We're not trying to analyze any data yet. What we want is to   use the fit model platform to create a model for us that we'll then use to create the simulation formula. The way we do that, we'll go under a model. Now I've done a bit of head work here.   So, I've already created the model here. And just to show you how I did that. I'll go under here under relaunch analysis under the redo.   So, here you see I have my response protein yield hello my fixed effects. I've got some random effects.   I did everything and get everything pretty standard there.   Now you see there's there's a lot going on here. We don't need to pay attention to any of this. We are just interested in creating a model. At this point the way we do that is we go into here under the red triangle menu.   Will go under saved columns. Now we need to be careful which column we select. If I select prediction formula which you might be tempted to do. That's good. But it doesn't get us all the way there, as you'll see.   If I go into the formula. This is the mean prediction formula. There's nothing about random effects here. So this isn't the column I want. It's not complete doesn't contain everything I need. I need to come back.   Go under save columns again and scroll down here to conditional predictive formula and note from the hover help that's includes the random effect estimates, which is the one I want.   Now, you might be any case where you don't really want to compute power for the random effects. You want to just for the mean model, in which case   You could have easily gone back to the custom design platform and done it that way. Let's pretend that we're interested in those random effects as well.   Now we've saved their conditional predictive formula.   Again, we'll go in, look at the formula.   And here you can see we have a random effects. Now we need to do some tweaking here to get it into a simulation us that we want. So I'm going to double click here.   Is puts me into the JMP scripting language formatting.   Now, first I'll make some changes to the main effects. And I'm just going to pick some values. So let's see. Let's do 0.5 for temperature   0.1 for time.   And for pH. Let's do 1.2 a little bit higher.   For water. I'm going to go even higher. So, these might have larger coefficient. So I'll do 85 for water.   I'll do 90   For the growth first growth factor. And let's do 50   Growth Factor, too. Okay.   Alright, so I've made my adjustments to the main mean model portion. Now again, these are parameters that you think are scientifically important   Now for the random effects. You might be tempted to replace it with something like this. Okay. That should be a random effects. So I'll just put a random normal here.   And it kind of looks right but not exactly. And the reason is this formula is evaluated row by row, what's going to happen is the first time you come across a technician named Jill.   You will simulate a random value here and you'll get a value for that formula evaluation, but the next time you go to jail. I wrote six here.   This will simulate a different value, which then defeats the purpose of a random effect random effects should hold the same value every time. Jill appears   That it's going to take on the effect of something like a random error which I'll take this opportunity to put here that is a value that we want to change every row. So how do we overcome this well.   I tell you this because I actually ended up doing this the first time I presented this slightly embarrassing.   And thankfully, my coworker came along. Afterward, and showed me a trick to how to actually input the random effect appropriately and here's the trick.   We're gonna go to the top here and type if row.   Equals one   I'm going to create a variable call it tech Jill.   And now here's where I place it   What this trick does will replace this random normal with tech Jill.   What this will do is if it's the first row we simulate a random variable and assign it inside the value of this parameter to that variable to that value.   Under the first row, we don't simulate again, which means to tech Jill keeps the value was initially given and it will hold every place we put it   So we will do the same.   For Bob   As you can see that will accomplish the task of the random effect.   PUT BOB here for Stan things are a little bit easier. We don't have to simulate for him because random effects should add up to zero in the model.   And so the way we do that.   We make his be the opposite side.   Of the some of the other effects.   Do the same thing here for serum lot one   Now for this one I'm going to give it a bit more noise.   Let's say there's a bit more noise in the   Serum lots   And this is the advantage of this approach is you get to play around with different scenarios.   Input those values here.   Okay. Caleb King And again, this one.   Some of the others. And before I add the other one. I'll go ahead and just add it here as things makes it easy, day one.   Negative day one.   And I'll add it's random effect here and I'll say that it's random effect.   I can type   Is a bit smaller.   Alright, well, at this point, we should be, we should have our complete simulation formula. If I click OK, take me back to the Formula Editor view.   We should be good to go.   Alright, so there's our simulation formula.   Now for next, what do we do next, we'll go back to our fit model.   And we're going to go to the area where we want to simulate the power   Here I'm going to go under the fixed effects tests box. I'm going to go here to this column is the p value in this case original noisy simulation didn't give us any P values. That's okay. We don't care about that.   We just needed this to generate the model, which we then turned into a simulation formula. I'm going to right click under this column. Now remember, this only works if you have JMPed pro   And here at the very bottom is simulate. So we click that.   And it's going to ask us, which column to switch out. So by default it selects the response column and then it's going to go through and find where all the simulation formula columns. So we want to switch in this one because this one contains our simulation model.   tell it how many samples and to do 100   I'll give it my favorite random seed.   And I click OK.   Wait, about a second or two.   And there we are.   So it's generated a table where it's simulator response. It's fit the model.   And is reporting back the P values. Now there are some cases where there are no P values we ended up in a situation so much of what started and that's okay. That happens in simulation, so long as we have a sufficient number to get us an estimate.   Now the nice thing about this is JMPed saw that we were simulating P values. So it's it. I bet you're winning to do a power analysis and it's happily provided us a script to do that. So thanks JMP.   We run that and you'll see it looks a lot like the distribution platform. So it's done a distribution of each of those rows, excuse me, columns, but with an added feature a new table here that shows the simulated power and because we simulate it.   We can read these office sort of the estimated power if it weren't 100 if we were some other number, then you can look at the rejection rate. So we see here for our three mixture factors we. It looks like we have pretty good power, given everything that we have   To detect those particular coefficients. If we go over here to the other three factors, things don't look as good   So, then we'd have to go back and say, okay,   Maybe we'll go back and see what what's the maximum value that I can detect, so I'm going to minimize these   minimises table. I'll come back to my formula and say let's let's do a different   Do something different here.   What if I change this. So this was point five maybe know what if it were higher about one   For the time. Let's see, let's let's also make it one   And four pH. I'm going to go to three. So I'm going to bump things up a bit. So, you know, well hey can I detect this   Will keep everything else the same because we know we can detect those, it looks like click Apply okay generated some new back   Again, same thing. Right click under the column that you want to simulate quick simulate will switch in   Given a certain number of samples. So stick it   Same seed.   And we'll go   Just have to wait a few seconds for it to finish the simulation.   There we are.   And will run our power analysis again.   Look to be the same here. We didn't change anything there. So in fact, I'm going to tie these groups. Little too much. Here we go. Let's hide these three   Let's look at these. So we seem to have done better on pH so value of one might be the upper range of what we can detect given this sample size.   But for temperature in time it seems we still can't detect, even those high values. So, okay. Um, what else could we change. What if we double the number of samples. I mean, we are   calculating this for a sample size. So let's go back and one way we can do that. We can do go to do we, we can click augment design.   will select all our factors.   Select our response.   Click OK.   We'll just augment the design.   And this time we'll double it will make it 24   So I'll make the design.   And it's going to take a little bit of time. So I'm actually going to   A bit early.   And let's see, we'll make the table.   Okay, so now we've doubled the number of runs   And   So it only gave us half the responses. That's okay. Since we just need a response. I'm just going to take this and I'm going to copy   And paste   Course in real life. You wouldn't want to do that because hopefully get different responses. But again, we just need noise noisy response, go to the model. Now this time, we gotta fix things a little bit. I'm going to select these three go here under attributes say there are random effects.   Keep everything the same. Click Run.   I will notice I don't yet have my simulation formula, but rather than have to walk through and rebuild it. I can actually create a new column, go back to the old one.   Right click Copy column properties.   Come back, right click paste column copies my formula is now ready to go. So, let's say, What if we do it under this situation and we'll keep our values that we initially had   So I'll go back. I'll double click this open up the fit model window.   Go under the fixed effect tests, right click on there probably agree with p value simulate and   I'm not going to change this, because there was only one simulation for we let the one I wanted and it found the right response.   So I'll just change these   Let's see what happens in this case.   Alright.   Run the power analysis. Now again, I'm not going to worry about these   Mixture effects because as you can see, we just got better than what we had originally, which was already good. So I'm going to hide them again.   So we can more easily see the ones were interested in this case pH. We knew we were going to probably do better on because even with the old 12 runs. We had pretty good power.   It looks like we have definitely improved on temperature in time. So if those represents sort of the upper bound of effect sizes were interested in maybe a lower upper bound and this seems to indicate a doubling the sample size might help.   So these are illustrate how we can use the one. First of all, how to do the one click Simulate   And then how we can use it to do power calculation and encourages you to do something. I often did before I came to JMP, which is give people options explore your options. During the sample size seemed to help with temperature and time.   Changing what you're looking for, seem to help with pH with pH and then the mixture effects we seem to be okay on so explore your   So that can also include going back and changing the variances of maybe your random effect estimates.   So for example, I could come back here. I won't do it. But I could change these values and say, you know what happens if the technicians were a bit noisier where the serum lots were less noisy. Try and find situations so that your test plan is more robust to unforeseen settings.   Okay, so let me clean   Go through close these all out.   Alright.   So for the remainder of the scenarios. I'm going to be exploring sort of different takes on how you can implement this. So the general approach is the same. You create your design you simulate a response.   Us fit model, or in this case we're using a slightly different platform to generate a model.   And then use that model to create a simulation formula which then you will then use and the one click Simulate approach.   So now let's look at a case where we have a company that's going to conduct this case they have. But let's pretend that they are going to conduct a survey of their employees and they wanted to determine which factors influence employee attrition. So maybe   They have a lot of employees that are going to be leaving. And so they want to conduct a survey to assess which factors and so they want to know how many responded, they should plan for   Now the responses in years of the company, but their two little kinks. First, I'm an employee has to have worked at least a month before they leave for to be considered attrition.   And the other is that the responses are given in years, but maybe we're more concerned about months. How many months. Maybe that's how our budgeting software works or something.   And, you know, for employees, it might be easier for them to answer. And how many years have they been rather than how many years or months. They've been at the company.   So in this case we have interval censoring because we're given how many years, but that only tells us that they've been there between that many years and a year later, we also have the situation if they leave before year where it will censored between a month and a year.   So open up the stage table. I've set up a lot already. We've got a lot of factors here and scroll all the way to the end. So you can see the responses that we're looking at.   So again, we have a year's low and the years high. So what this means is that if an employee were to respond that they left after six years. That means that their actual time there in terms of months, somewhere between six and seven years.   If they left before a year than we know that they were there sometime between a month and a year.   I'm going to click this dialog button here to launch interval censoring here. We'll use the generic platform. We're going to assume a wible distribution for the response.   We don't put a censoring code here because we have interval censoring the way we handle that is we put in both response columns into the y   Which you'll see. Okay. And here's all the factors which you'll see is when we click run, JMP a recognized as a time to event distribution and say, Okay, if you gave me two response columns. Does that mean you're doing interval sensory in this case. Yes, we are.   So now.   We're going to go through the same thing. We're going to find the right red triangle. In this case, it's here next to waibel maximum likelihood. Now here's the really nice thing about   Generate platform. Now there's already a lot of nice things about it. But here's just some more icing on the top.   When I click this, if we did like before we'd have to go in and we'd say save the prediction formula, we'd have to go and make some adjustments to get the random know make sure it's a random wible that's being simulated adjust things as needed.   This is generally though.   It is aware that you can do the one click Simulate and so it's saying, Hey, would you like me to actually save the simulation formula for you if you're, if that's what you're interested in and Yes we are. So we click the Save simulation formula.   Let's go back to our table.   And you'll notice it only simulated one calm. I'll talk a bit more about why in a moment. But let's real quick check will go in   And there it is, in fact, I'll double click to pull up the scripting language, you'll see it's already got it set up as a random wible it's got the transformation of the model already in there.   All you would have to do at this point is change these parameter values to what is scientifically significant to you.   Okay, now for this purpose I won't do that. I'm just going to leave them be. I will make one change though, and I want to try and replicate.   The actual situations that we're going to be using. Notice here. These are all continuous values when in actuality, what we should be getting our nice round hole year numbers. So the way I can do that.   These are years. Hi, I'm going to create a simple variable make it equal to the actual continuous time but tell it to return the ceiling.   So round up essentially   ply. Okay. And there you have it.   As you can see, this would tell me that I've simulated yours. Hi. Now,   To   See, when you do the one click simulations are all here. I'll open up the effect tests.   If I right click and then click Simulate I could only enter one column at a time. So I can't drag and select more than one   Now, if I were to just do this place the years I was yours. Hi simulation that looks okay. The problem is this year's low. Now this year's low is being brought in, because it was part of the original model.   But it's the year's little that you originally used if we look back, we already see an issue, let me cancel out of this real quick.   For example, if we were to do that. It wouldn't be able to fit this first one, because the years high is lower than yours. Low this year's low is not tied to the simulation response. So how do we fix that we need to tie it need to make that connection. So I'll go to yours, low   I'm going to click formula. So there's already a formula here, I'm just going to make a quick change.   I'm going to say if the simulation formula I double clicked to do that.   Is double click one. So, for years, high as one return 112   Otherwise, return the simulation value minus one.   Now click OK and apply   As you can see its proper its proper now it's tied to it.   So now I can go back   I can right click, do the simulate I can replace the years high with its simulation formula and be comfortable knowing that when I do the years low will be appropriate. It will always be one year lower unless it's already one year and then which cases 112   So it's now tied to it, it'll always be brought in, when they do the simulation.   I'll run a quick simulation real quick.   There we go. It's going a bit slow. So that's a good sign.   I'll let it finish out   Alright. So there is our simulations.   And of course we can run the power analysis, this case we've got a lot of factors that I believe there were 1400 70 quote to play this.   For a lot of them were we have overkill.   But surprisingly for some of them. We still have issues. And so that might be something worth investigating maybe we can't detect that low, the coefficient   Might have to change something about these factors things to discuss in your planning meeting.   So that's how you need to work things when you have this case we had interval censoring if he had right censoring so you had a censoring column.   Same thing, you would. It would output a simulation on the actual time, I would say, you can make some adjustments to that.   To ensure that it matches the type of time you you're seeing in your response or what you expect. And then you'll have to tie your censoring column to the simulation and this is going to happen whenever you have that type of setting.   Okay.   Let's clear all this out.   So let's look at one other one   What happens if we have a non normal response. So we've already seen one. We've seen a reliability type response. So we know we can use generating let's explore another one real quick. In this case, we have a normal response in   A test.   The system is going to be able to weapons flat for their responses, a percentage. Now, technically, you could model this as a normal distribution.   And that might be fine, so long as you expect values between, you know, around the 50 percentage point   But no, because we want this to be a very accurate weapons platform, we'd hope to see responses closer to 100%   And so maybe something like a beta distribution response might be more appropriate. We do have one of the wrinkle. We have these three factors of interest, but one of them. The target is nested within fuse type. So the type of target factor will depend on the fuse type   Case will run this real quick. Again, we've created our data.   This case I simulated some random data and I did it so that it matches between zero and one. I did that simply by taking the logistic transformation of a random normal   OK. Caleb King I will copy   Paste.   Make sure I can paste   And again, walk through it.   Pretty simple.   We're going to use the beta response. We have our response. We have our target nested within future type   Click Run.   And again, red triangle. Many say columns save simulation formula. And this is what you can do in the generate for the regular fit model unfortunately cannot do that.   But we have our simulation formula. I'm not going to make any changes.   But you could you could go in. As you can see the structure, double click is already there. Even the logistic transformation. So you just got to put in your model parameters.   Excuse me. Caleb King Quick. Okay.   Bye. Okay. And again, we'll go down.   And that's how you do that. So we go down.   Effect tests, right click Simulate   Make the substitution and go   Alright, so see how easy it is, in general. So even if you have non normal responses.   You're good to go. Thanks to generate   Okay.   Now,   What if you have longitudinal data. This can be tricky, simply because now the responses might be correlated with one another. So how can we incorporate that well is straightforward.   In this case, we have an example of a company that's producing a treatment for reducing cholesterol. Let's say it's treatment, a   We're going to do run a study to compare it to a competitor treatment be in for the sake of completion will have a control and placebo group will have five subjects per group longitudinal aspect is that measurements are taken in the morning and afternoon once a month for three months.   Now I'm not going to spend too much time on this because I just want to show you how you incorporate longitudinal aspect. So this case I've already   Created a model created the simulation formula. So now you can use it as reference for how you might do this. Let's say we have an AR one model.   And on this real quick.   Just to show you. So there's all the fixed effects. Notice here we got a lot of interactions. Keep that in mind as I show you the formula might look a bit messy.   You've   Stated that we have a repeated structure. So I've selected AR one   Period by days within subject. Okay. Under the next model platform.   And so how do I incorporate that era one into my simulation formula I did it like this.   If it's the first row or the new patient. That's what this means the current patient does not equal the previous patient   This is the model that I saved I changed the parameter values to something that might be of interest. It did take a bit of work because there's a lot going on here. There's a lot of interactions happening.   We've got some random noise at the end. But that's all I did. So I changed some values here. I made things a lot of zeros, just to make things easy   If it's not the first row or if it's not a new patient. How do we incorporate correlation. All I do is copy that model up to here, added this term.   Just some value. I believe it has to be less than one equal to one times previous entry.   If it were auto regressive to then you would add something like lag.   Sim formula to   And you'd have to make another adjustment where know if it's the first row, we have our model. It's the second row or were two places into the new patient. It might look like an AR one if it's anything else we go back to   So as you can see, very easy to incorporate auto correlation structures as long as you know what your model looks like it should be easy to implement it as a simulation formula.   Okay. Caleb King I'll let you look at that real quick.   Finally,   Our final scenario is a pass fail response, which is also very common. I'm going to use this to illustrate how you can use the one click Simulate to maybe change people's minds about how they run certain types of designs show you how powerful this can be   Not intended   Let's say we have we have a detection system that we're creating to detect radioactive substances. So we're going to compare it to another system that's maybe already out there in the field.   So we're going to compare these two detection systems we've selected a certain amount of material and some test objects, ranging from very low concentrations at one to a concentration of five very high and we're going to test   Our systems repeatedly on each concentration, a certain number of times and see how many times it successfully alarms.   I'm going to open these both   Let's start with this one. So this represents a typical design, you might see we have a balanced number of samples as each setting. In this case, we have a lot of samples. They're very fortunate that this place so   Let's say we're going to do 32 balance trials at each run and these are, this is a simulated response. Okay, let's say. And then here I've created my simulation formula.   So I'll show you what that looks like. Again, random binomial. They're all the same. So I've kept the number here, but I could have referenced the alarms in trials column stem from an indie consistent, but that's okay.   Here's my model that maybe I'm interested in   Okay.   And here.   I have a scenario where instead of a balanced number and each setting I have put most of my samples here at the middle   My reasoning might be that will if it's a low concentration. I hardly expect it to catch it. I have reasonable expectations.   And if it's a high concentration will it should almost always catch it. So where the difference is most important to me is there in the middle, maybe at three or four concentrations   And so that's where I'm going to load. Most of my samples, and then I'll put a few more here. But then put the fewest at these other settings. Let's see how each of these test plans for forms in terms of power.   So run the binomial model script here which will run the binomial model. There's only one model effect here the system. We don't put concentration because we know there's that there's an effect here. This is what we're interested in.   Generate binomial.   Run it okay again red triangle menu.   I've already got my simulation formula.   So actually I don't need to do that.   So you already built up a pattern.   Right click Do you simulate. Okay, everything looks good there.   My next favorite random seed.   Here we are power analysis. Okay, now let's go over here.   Do the same thing. I'll fit the model and again when you have a binomial. You have to put in not only how many times it alarm, but out of how many trials.   Run scroll down the effect tests, go down.   In primary to get a hint of what's going to happen.   Quick. Okay.   Here's my simulations, get my parallels scooted over here, minimize minimize. So here's what you get under the balance design.   Notice that we have very low power, which seems odd because we had 32 at each run. I mean, that's a lot of samples, I would have killed for that many samples where I previously worked   So you would expect a lot of power, but there doesn't seem to be whereas here. I had the same total number of samples. I just allocated them differently.   And my power level has gone up dramatically. Maybe if I stack even more here. Maybe if it did four and four and then edit for each of these   I could get even more power to detect this difference. So not only does this show that you know it's not always just changing your sample size might not always need more samples in this case you had a lot of samples to begin with. But how you allocate them is also important to   Okay.   So,   I hope you're as excited as I discovered this very awesome tool for calculating power.   I'd like to leave you with some key takeaways.   So again, we use simulation. Now, ideally, you know, we kind of like a formula. So, and in the civil cases we do kind of get the advantage of a nice simple formula.   Even with the regression models, we kind of have formulas to help under, under the hood. But of course, and the real world. Things are a little more complex. And so we typically have to rely on simulation, which can be a very powerful tool as we've seen,   Now, of course, one of the key things we have to do with simulation is balanced accuracy with efficiency. I usually ran 100   Mainly because, you know, to save on time.   But ultimately know maybe you might stick with the default of 2500 knowing that it will take some time to run   So what I might advocate for is, you know, maybe start with 100 200 simulations at the beginning, just to give it give an idea of what's going on. And then if you find a situation   Where it looks like it. No, it's worth more investigation bump up the number of samples, so you can increase your accuracy.   OK, so maybe you start with a couple different situations run a few quick simulations and then narrow down to some key settings key scenarios and then you can increase the number of simulations to get more accuracy.   I always argue power calculations, just like design of experiments is never one and done.   You shouldn't just go to a calculator plug in some numbers and come back with a sample size. There's a lot that can happen in the design.   Or what can that can happen in an experiment. And I think that the best way to plan an experiment is to try and account for different scenarios. So explore different levels of noise.   In your response. So maybe the mixed effects play around different mixed effect sizes.   Of course you can explore different sample sizes, but also explore maybe different types of models. So for example, in the universal center in case we use the wible model would if he had done a lot normal model.   Explore these different scenarios and know presenting them to the test planners gives you a way to play in your study to be robust to a variety of settings.   So never just go calculate and come back, always present tense players with different scenarios. It's the same process. I use when I   Created actual designed actual experiments. So I would present the test players. I worked with different options they could know explore it. It may be they pick an option or it might be combination of options. You should always do that to make your plans more robust   As I say, they're   All right. Well, I hope you learned something new with this. If you have any questions you can reach out to me, they'll probably be providing my email address.   So I hope you enjoyed this talk and I hope you enjoy the rest of the conference. Thank you.
Dave Sartori, Sr. Data Scientist, PPG   A sampling tree is a simple graphical depiction of the data in a prospective sampling plan or one for which data has already been collected. In variation studies such as gage, general measurement system evaluations, or components of variance studies, the sampling tree can be a great tool for facilitating strategic thinking on: What sources of process variance can or should be included? How many levels within each factor or source of variation should be included? How many measurements should be taken for each combination of factors and settings? Strategically considering these questions before collecting any data helps define the limitations of the study, what can be learned from it, and what the overall effort to execute it will be. What’s more, there is an intimate link between the structure of the sampling plan and the associated variance component model. By way of examples, this talk will illustrate how inspection of the sampling tree facilitates selecting the correct variance component model in JMP’s variability chart platform: Crossed, Nested, Nested then Crossed or Crossed then Nested. In addition, the application will be extended to the interpretation variance structures in control charts and split-plot experiments.     Auto-generated transcript...   Speaker Transcript Dave Hi, everybody. Thanks for joining me here today, I'd like to share with you a topic that has been part of our Six Sigma Black Belt program since 1997. 1997. So I think this is one of the tools that people really enjoy and I think you'll enjoy it, too, and find it informative in terms of how it interfaces with some of the tools available in JMP. The first quick slide or two in terms of a message from our sponsor. I'm with PPG Industries outside of Pittsburgh, Pennsylvania, in our Monroeville Business and Technical Center. I've been a data scientist there on and off for over 30 years, moved in and out of technical management. And now, back to what I truly enjoy, which is working with data and JMP in particular. So PPG has been around for a while, was founded in 1883. Last year we ranked 180th on the Fortune 500. And we made mostly paints, although people think that PPG stands for Pittsburgh Plate Glass, that was no longer the case as of about 1968. So it's a...it's PPG now and it's primarily a coatings company. performance coatings and industrial coatings. cars, airplanes, of course, houses. You may have bought PPG paint or a brand of PPG's to to use on your home. But it's also used inside of packaging, so if you don't have a coating inside of a beer can, the beer gets skunky quite quickly. My particular business is the specialty coatings and materials. So my segment we make OLED phosphors for universal display corporation that you find in the Samsung phone and also the photochromic dyes that go into the transition lenses, which turn dark when you head outside. So what I'm going to talk to you today about is this this tool called sampling tree. And what it is, it's really just a simple graphical depiction of the data that you're either planning to collect or maybe that you've already collected. And so in variation studies like a Gage R&R general measurement system evaluations, components and various studies (or CoV, as we sometimes call them), the sampling tree is a great tool for for thinking strategically about a number of things. So, for example, what sources of variance can or should be included in this study? How many levels within each factor or source of variation can you include? And how many measurements to take for each combination factors and settings? So you're kind of getting to a sample size question here. So strategically considering these questions before you collect any data helps you also define the limitation of the study, what you can learn from it, and what the overall effort is going to be to execute. So we put this in a classification tools that we teach in our Six Sigma program, what we call critical thinking tools because it helps you think up front. And it is a nice sort of whiteboard exercise that you can work on paper or the whiteboard to to kind of think prospectively about the the data, you might collect. It's also really useful for understanding the structure of factorial designs, especially when you have restrictions on randomization. So I'll give you one sort of conceptual example, towards the end here, where you can describe on a sampling tree, a line of restricted randomization. And so that tells you where the whole plot factors are and where the split plot of factors are. So it can provide you again upfront with a better understanding of the of the data that you're planning to collect. They're also useful in where, I'll share another conceptual example, where we've combined a factorial design with a component of variations study. So this, this is really cool because it accelerates the learning about the system under study. So we're simultaneously trying to manipulate factors that we think impact the level of the response, and at the same time understanding components of variation which we think contributes a variation of response. So once the data is acquired, the sampling tree can really help you facilitate the analysis of the data. And this is especially true when you're trying to select the variance component model within a variance chart...variability chart that you have available in JMP. And so if you've ever used that tool (and I'll demonstrate it for you here in a couple...with a couple of examples), if you're asking JMP to calculate for you the various components, you have to make a decision as to what kind of model do you want. Is it nested? Is it crossed? Maybe it's crossed then nested. Maybe it's nested then crossed. So helping you figure out what the correct variance component model is, is really well facilitated by by good sampling tree. The other place that we've used them is where we are thinking about control charts. So the the control chart application really helps you see what's changing within subgroups and what's changing between subgroups. So it helps you think critically about what you're actually seeing in the control charts. So as I mentioned, they're they're good for kind of showing the lines of restrictions in split plot but they're kind of less useful for the analysis of designed experiments, so again for for DOE types of applications aremore kind of kind of up front. So let's jump into it here with an example. So here's a what I would call a general components of variance studies. And so in this case, this is actually from the literature. This is from Box Hunter and Hunter, "Statistics for Experimenters," and you'll find it towards the back of the book where they are talking about components of variance study and it happens to be on a paint process. And so what they have in this particular study are 15 batches of pigment paste. They're sampling each batch twice and then they're taking two moisture measurements on each of those samples. So the first sample in the first batch is physically different than the second batch, and the first sample out of the second batch is physically different from any of the other samples. And so one practice that we tried to use and teach is that for nested factors, it's often helpful to list those in numerical order. So that again emphasizes that you have physically different experimental units you're going from sample to sample throughout. And so this is a this is a nested sampling plan. So the sample is nested under the batch. So let's see how that plays out in variability chart within JMP. Okay, so here's the data and we find the variability chart under quality and process variability. And then we're going to list here as the x variables the batch and then the sample. And one thing that's very important in a nested sampling plan is that the factors get loaded in here in the same order that you have them in a sampling tree. So this is hierarchical. So, otherwise the results will be a little bit confusing. So we can decide here in this this launch platform what kind of variance component model we want to specify. So we said this is a nested sampling plan. And so now we're ready to go. We leave the the measurement out of the...out of the list of axes because the measurement really just defines where the, where the sub groups are. So we just we leave that out. And that's going to be what goes into the variant component that JMP refers to as within variation. Okay, so here's the variability chart. One of the nice things too with the variability chart is there's an option to add some some graphical information. So here I've connected the cell mean. And so this is really indicating the kind of visually what kind of variation you have between the samples within the batch. And then we have two measurements per batch, as indicated on our sampling tree. And so the the distance between the two points within the batch and the sample indicates the within subgroup variation. So you can see it looks like just right off the bat it there's a good bit of of sample to sample variation. And the other thing we might want to show here are the group means. And so that shows us the batch to batch variations. So the purple line here is the, the average on a batch to batch basis. Okay. Now, what about the actual breakdown of the variation here. Well that's nicely done in JMP here under variance components. And Get that up there, we can see it then I'll collapse this. As we saw graphically, it looked like the sample to sample variation within a batch was a major contributor to the overall variation in the data. And in fact, the calculations confirm that. So we have about 78% of the total variation coming from the sample; about 20% of variations coming batch to batch and only about 2.5% of the variation is coming from the measurement to measurement variation within the batch and sample. I noticed here to in the variance components table, the the notation that's that used here. So this is indicated that the sample is within the batch. So this is an nested study. And again, it's important that we load the factors into the into the variability chart in the order indicated here in the in the plot. So wouldn't make any sense to say that within sample one we have batch one and two. That just doesn't make any physical sense. And so it kind of reflects that in the in the tree. And just Now let's compare that with something a little bit different. I call this a traditional Gage R&R study. And so what you have in a traditional Gage R&R study is you have a number of parts sample batches that are being tested. And then you have a number of operators who are testing each one of those. And then each one test the same sample or batch multiple times. So in this particular example we're showing five parts or samples or batches, three operators measuring each one twice. Now in this case, operator one for the for batch number one is the same as operator number one for batch or sample report number five. So you can think of this as saying, well, the operator kind of crosses over between the part, sample, batch whatever the whatever the thing is that's getting getting measured. So this is referred to as a as a crossed study. And it's important that they measure the same article because one of the things that comes into play in a crossed study is that you don't have in a nested study is a potential interaction between the operators and what they're measuring. So that's going to be reflected in the in the variance component analysis that we see from JMP. Now let's have a look here. at this particular set of data. So again, we go to the handy variability chart, which again is found under the quality and process. And in this case, I'll start by using the same order for the variables for the Xs as shown on the sampling tree. But, as I'll show you one of the features of a of a crossed study is that we're no longer stuck with the hierarchical structure of the tree. We can we can flip these around. And so this is crossed. I'm going to be careful to change that here. Remember that we had a nested study from before. And I'm going to go ahead and click okay. And I'm going to put our cell means and group means on there. So the group means in this case are the samples (three) and we've got three operators. And now if we asked for the variance components. Notice that we don't have that sample within operator notation like we had in the in the nested study. What we have in this case is a sample by operator interaction. And it makes sense that that's a possibility in this case, because again, they're measuring the same sample. So Matt is measuring the same sample a as the QC lab is, as is is as Tim. So an interaction in this case really reflects the how different this pattern is as you go from one sample to the other. So you can see that it's generally the same It looks like Matt and QC tend to measure things perhaps a little bit lower overall than Tim. This part C is a little bit the exception. So the, the interaction variation contribution here is is relatively small. There is some operator to operator variation, and the within variation really is the largest contributor. And that's easy to see here because we've got some pretty pretty wide bars here. But again, this is a is a crossed study so we should be able to change the order in which we load these factors and and get the same results. So that's my proposition here; let's test it. So I'm just going to relaunch this analysis and I'm going to switch these up. I'm going to put the operator first and the sample second. Leave everything else the same. And let's go ahead and put our cell means and group means on there. And now let's ask for the variance components. So how do they compare? I'm going to collapse that part of the report. So in the graphical part and this is a cool thing to recognize with a crossed study is because again, we're not stuck with the hierarchy that we have in a nested study, we can kind of change the perspective on how we look at the data. So that perspective with loading in the operator first gives us sort of a direct operator to operator comparison here in terms of the group means. And again that interaction is reflected of how this pattern changes between the operators here as we go from Part A, B, or C, A, B, or C. What about the numbers in terms of the variance components? Well, we see that the variance components table here reflects the order in which we loaded these factors into the into the dialog box and... But the numbers come out very much the same. So the sample on the lefthand side here, the standard deviation is 1.7. Standard deviation due to the operator is about 2.3 and it's the same value over here. The sample by operator or operator by sample interaction, if you like, is exactly the same. And the within is exactly the same. So, with a crossed study, we have some flexibility in how we load those factors in and then the interpretation is a little bit different. If these were different samples, we might expect this pattern from going from operator to operator, to be somewhat random because they're they're measuring different things. So there's no reason to expect that the pattern would repeat. If you do see a significant interaction term in a typical kind of a traditional Gage R&R study, like we have here, well, then you've got a real issue to deal with because that's telling you that the nature of the sample is is causing the operators to measure differently. So that's a bit harder of a problem to solve than if you just have a no interaction situation there. OK. Dave So again, this, for your reference, I have this listed out here. Um, so now let's get to something a little bit more juicy. So here we have sort of a blended study where we've got both crossed and nested factors. So this was the business that I work in. The purity of the materials that we make is really important and a workhorse methodology for measuring purity is a high performance liquid chromatography or HPLC for short. So this was a...this was a product and it was getting used in an FDA approved application so the purity getting that right was was really important. So this is a slice from a larger study. But what I'm showing is the case where we had three samples; I'm labeling them here S1, S2, S3. We have two analysts in the study. And so each analyst is going to measure the same sample in each case. So you can see that similar to what we had in the previous example there that what I call traditional Gage R&R, where each operator or analyst in this case is measuring exactly the same part or sample. So that part is crossed. When you get down under the analyst, each analyst then takes the material and preps it two different times. And then they measure each prep twice. They do two injections into the HPLC with with each preparation. So preparation one is different than preparation two and that's physically different than the first preparation for the next analyst over here. And so again, we try to remember to label these nested factors sequentially to indicate that they're they're physically different units here. It doesn't really make any difference from JMP's point of view, it'll handle it fine, if you were to go 1-2, 1-2, 1-2, and so on down the line, that's fine, as long as you tell it the proper variance component model to start with. So this would be crossed and then nested. So let's see how that works out in JMP. So here's our data sample, analyst prep number, and then just an injection number which is really kind of within subgroup. So once again we go to analyze, quality and process. We go to the variability chart. And here we're going to put in the factors in the same order as they were showing on the sampling tree. And then we're going to put the area in there as the percent area is the response. And we said this was crossed and then nested, so we have some couple of other things to choose from here. And in this case, again, the sampling tree is really, really helpful for for helping us be convinced that this is the case, and selecting the right model. This is crossed, and then nested. Let's click OK. I'm going to put the cell means and group means on there. Again, we have a second factor involved above the within. So let's pick both of them. And let's again ask for the variance components. And I'm going to just collapse this part, hopefully, and maybe I'm going to collapse the standard deviation chart, just bringing a little bit further up onto the screen. So what we can see in the in the graph as we go, we see a good bit of sample to sample variation. The within variation doesn't look too bad. But we do maybe see a little bit of a variation of within the preparation. So, um, the sample in this case is by far the biggest component of variation, which is really what we were hoping for. The analyst is is really below that, within subgroup variation. And so this this lays it out for us very nicely. So in terms of what it's showing in the variance components here table in terms of the components, is it's sample analyst and then because these two are crossed, we've got a potential interactions to consider in this case. Doesn't seem to be contributing a whole lot to the to the overall variation. And again, that's the how the pattern changes as we go from analyst to analyst and sample to sample. Now, the claim I made before with the fully crossed study was that we could swap out the the crossed factors in terms of their in terms of their order and and it would be okay. So let's let's try that in this case. So I'm just going to redo this, relaunch it and I can I think I can swap out the crossed factors here but again I have to be careful to leave the nested factor where it is in the tree. So I notice over here in the variance components table, the way we would read this as we have the prep nested within the sample and the analyst. So that means it has to go below those on the tree. So let's go ahead and connect some things up here. I'm going to take the standard deviation chart off and asked for the variance components. Okay, so just like we saw in the traditional Gage R&R example we've got the analyst and the sample switching. But their values for the, if we look at the standard deviation over here in the last column, they're identical. We have again the identical value for the interaction term and interact on the identical value for the prep term, which again is nested within the, within the sample and the analyst. So again, here's where the, where the sampling tree helps us really fully understand the structure of the data and complements nicely what with what we see in the variance components chart of JMP. So, those, those are a couple of examples where these are geared towards components of variation study. One thing you might notice too, I forgot to point this out earlier, is look at the sampling tree here. And if I bring this back and I'm just trying to reproduce this. That backup. Dave It's interesting if you look at the horizontal axis in the variability chart, it's actually the sampling tree upside down. So that's another way to kind of confirm that you're you're looking at the right structure here when you are trying to decide what variance component component model to to apply. So again, here are the screenshots for that. Here's an example where the sampling tree can help you in terms of understanding sources of variation in a in a control chart of all things. So in this particular case, over a number of hours, a sample is being pulled off the line. These are actually lens samples. I mentioned that we we make photochromatic dyes to go into the transitions lenses and they will periodically check the film thickness on the lenses and that's a destructive test. And so when they take that lens and measure the film thickness, well, they're they're done with that with that sample. And so what we would see if we were to construct an x bar and R chart for this is you're going to see on the x bar chart as an average, the hour to hour average. And then within subgroup variation is going to be made up of what's going on here sample to sample and the thickness, the thickness measurement. Now in this case, notice that there's vertical lines in the sampling tree, so that the tree doesn't branch in this case. So when you see vertical lines when you're drawing a vertical lines on to the sampling tree, that's an indication that the variability between those two levels of the tree are confounded. So, I can't really separate the inherent measurement variation in the film thickness from the inherent variation of the sample to sample variation. So I'm kind of stuck with those in terms of how this measurement system works. So let's let's whip up a control chart with this data. And for that are, again, we're going to go to quality and process. And I'm going to jump into the control chart builder. So again, our measurement variable here is the film thickness. And we're doing that on an hour to hour basis. So when we get it set up by by doing that, we see that JMP smartly sees that the subgroup size is 3, just as indicated on our, on our sampling tree. But what's interesting in this example is that you might at first glance, be tempted to be concerned because we have so many points out of control on the x bar chart. But let's think about that for a minute in terms of what the sampling tree is telling us. So the sampling tree again is telling us that's what's changing within the subgroup, what's contributing to the average range, is the film thickness to film thickness measurement, along with the sample to sample variation. And remember how the control limits are constructed on an x bar chart. They are constructed from the average range. So we take the overall average. And then we add that range plus or minus a factor related to the subgroup sample size so that the width of these control limits is driven by the magnitude of the average range. And so really what this chart is comparing is, let's consider this measurement variation down here at the bottom of the tree. So it's comparing measurement variation to the hour to hour variation that we're getting from the, from the line. So that's actually a good thing because it's telling us that we can see variation that rises above the noise that that we see in the in the subgroup. So in this case, that's, that's actually desirable. And so, that's again, a sampling tree is really helpful for reminding us what's what's going on in the Xbar chart in terms of the within subgroup and between subgroup variation. Now, just a couple of conceptual examples in the world of designed experiments. So split plot experiment is an experiment in which you have a restriction on the run order of the of the experiment. And what that does is it ends it ends up giving a couple of different error structures, and JMP does a great job now of designing experiments for for that situation where we have restrictions on randomization and also analyzing those. So, nevertheless, though it's sometimes helpful to understand where those error structures might be splitting, and in a split plot design, you get into what are called whole plot factors and subplot factors. And the reason you have a restriction on randomization is typically because one or more of the factors is hard to vary. So in this particular scenario, we have a controlled environmental room where we can spray paints at different temperatures and humidities. But the issue there is you just can't randomly change the humidity in the room because it just takes too long to stabilize and it makes the experiment rather impractical. So what's shown in this sampling tree is you really have three factors here humidity, resin and solvant. These are shown in blue. And so we only change humidity once because it's a difficult to change variable. So that's how you set up a split plot experiment in JMP is you can specify how hard the factors are to change. So in this case, humidity is a hard, very hard to change factor. And so, JMP will take that into account when it designs the experiment and when you go to analyze it. But what this shows us is that the the humidity would be considered a whole plot factor because it's above the line restriction and then the resin and the solvent are subplot factors; they're below the line of restriction. So there's a there's a different error structure above the line of restriction for whole plot factors than there is for subplot factors. In this case we have a whole bunch of other factors that are shown here, which really affect how a formulation which is made up of a resin and a solvent gets put into a coating. So this, this is actually a 2 to the 3 experiment with a restriction randomization. It's got eight different formulations in there. Each one is applied to a panel and then that panel is measured once so that what we see in terms of the measurement to measurement variation is confounded with the coating in the in the panel variation. As, as I said before, when we have vertical lines on the on the sampling tree, then we have then we have some confounding at those levels. So that's, that's an example where we're using it to show us where the, where the splitting is in the split plot design. This particular example again it's conceptual, but it actually comes from the days when PPG was making fiberglass; we're no longer in the fiberglass business. But in this case, what what was being sought was a an optimization, or at least understanding the impact of four controllable variables on what was called loss ???, so they basically took coat fiber mats and then measure their the amount of coating that's lost when they basically burn up the sample. So what we have here is at the top of the tree is actually a 2 to the 4 design. So there's 16 combinations of the four factors in this case and for each run in the design, the mat was split into 12 different lanes as they're referred to as here. So you're going to cross the mat from 1 to 12 lanes and then we're taking out three sections which within each one of those lanes and then we're doing a destructive measurement on each one of those. So this actually combines a factorial design experiment. with a components of variations study. And so again, we've got vertical lines here at the bottom of the tree indicating that the measurement to measurement variation is confounded with the section to section variation. And so what we ended up doing here in terms of the analysis was, we treated the data from each DOE run as sort of the sample to sample variation like we had in the moisture example from Box Hunter and Hunter, to have instead of batches, here you have DOE run 1, 2, 3 and so on through 16 and then we're sub sampling below that. And so we treat this part as a components of variation study and then we basically averaged up all the data to look and see what would be the best settings for the four controllable factors involved here. So this is really a good study because it got to a lot of questions that we had about this process in a very efficient manner. So again, combining a COV with a DOE, design of experiments with components of variations study. So in summary, I hope you've got an appreciation for sampling trees that are, they're pretty simple. They're easy to understand. They're easy to construct, but yet they're great for helping us talk through maybe what we're thinking about in terms of sampling of process or understanding a measurement system. And they also help us decide what's the best variance components model when we we look to get the various components from JMP's variability chart platform, which we get a lot of use out of that particular tool, which I like to say that it's worth the price of admission that JMP for that for that tool in and of itself. So I've shown you some examples here where it's nested, where it's crossed, crossed then nested, and then also where we've applied this kind of thinking to control charts to help us understand what's varying within subgroups versus was varying between subgroups. And then also, perhaps less useful...less we can use those with designed experiments as well. So thanks for sharing a few minutes with me here and my email's on the cover slide so if you have any questions, I'd be happy to converse with you on that. So thank you.  
Monday, October 12, 2020
Russ Wolfinger, Director of Scientific Discovery and Genomics, Distinguished Research Fellow, JMP Mia Stephens, Principal Product Manager, JMP   The XGBoost add-in for JMP Pro provides a point-and-click interface to the popular XGBoost open-source library for predictive modeling with extreme gradient boosted trees. Value-added functionality includes: •    Repeated k-fold cross validation with out-of-fold predictions, plus a separate routine to create optimized k-fold validation columns •    Ability to fit multiple Y responses in one run •    Automated parameter search via JMP Design of Experiments (DOE) Fast Flexible Filling Design •    Interactive graphical and statistical outputs •    Model comparison interface •    Profiling Export of JMP Scripting Language (JSL) and Python code for reproducibility   Click the link above to download a zip file containing the journal and supplementary material shown in the tutorial.  Note that the video shows XGBoost in the Predictive Modeling menu but when you install the add-in it will be under the Add-Ins menu.   You may customize your menu however you wish using View > Customize > Menus and Toolbars.   The add-in is available here: XGBoost Add-In for JMP Pro      Auto-generated transcript...   Speaker Transcript Russ Wolfinger Okay. Well, hello everyone.   Welcome to my home here in Holly Springs, North Carolina.   With the Covid craziness going on, this is kind of a new experience to do a virtual conference online, but I'm really excited to talk with you today and offer a tutorial on a brand new add in that we have for JMP Pro   that implements the popular XGBoost functionality.   So today for this tutorial, I'm going to walk you through kind of what we've got we've got. What I've got here is a JMP journal   that will be available in the conference materials. And what I would encourage you to do, if you'd like to follow along yourself,   you could pause the video right now and go to the conference materials, grab this journal. You can open it in your own version of JMP Pro,   and as well as there's a link to install. You have to install an add in, if you go ahead and install that that you'll be able to reproduce everything I do here exactly at home, and even do some of your own playing around. So I'd encourage you to do that if you can.   I do have my dog Charlie, he's in the background there. I hope he doesn't do anything embarrassing. He doesn't seem too excited right now, but he loves XGBoost as much as I do so, so let's get into it.   XGBoost is a it's it's pretty incredible open source C++ library that's been around now for quite a few years.   And the original theory was actually done by a couple of famous statisticians in the '90s, but then the University of Washington team picked up the ideas and implemented it.   And it...I think where it really kind of came into its own was in the context of some Kaggle competitions.   Where it started...once folks started using it and it was available, it literally started winning just about every tabular competition that Kaggle has been running over the last several years.   And there's actually now several hundred examples online if you want to do some searching around, you'll find them.   So I would view this as arguably the most popular and perhaps the most powerful tabular data predictive modeling methodology in the world right now.   Of course there's competitors and for any one particular data set, you may see some differences, but kind of overall, it's very impressive.   In fact, there are competitive packages out there now that do very similar kinds of things LightGBM from Microsoft and Catboost from Yandex. We won't go into them today, but pretty similar.   Let's, uh, since we don't have a lot of time today, I don't want to belabor the motivations. But again, you've got this journal if you want to look into them more carefully.   What I want to do is kind of give you the highlights of this journal and particularly give you some live demonstrations so you've got an idea of what's here.   And then you'll be free to explore and try these things on your own, as time goes along. You will need...you need a functioning copy of JMP Pro 15   at the earliest, but if you can get your hands on JMP 16 early adopter, the official JMP 16 won't ship until next year, 2021,   but you can obtain your early adopter version now and we are making enhancements there. So I would encourage you to get the latest JMP 16 Pro early adopter   in order to obtain the most recent functionality of this add in...of this functionality. Now it's, it's, this is kind of an unusual new frame setup for JMP Pro.   We have written a lot of C++ code right within Pro in order to integrate to get the XGBoost C++ API. And that's why...we do most of our work in C++   but there is an add in that accompanies this that installs the dynamic library and does does a little menu update for you, so you need...you need both Pro and you need to install the add in, in order to run it.   So do grab JMP 16 pro early adopter if you can, in order to get the latest things and that's what I'll be showing today.   Let's, let's dive right to do an example. And this is a brand new one that just came to my attention. It's got an a very interesting story behind it.   The researcher behind these data is a professor named Jaivime Evarito. He is a an environmental scientist expert, a PhD, professor, assistant professor   now at Utrecht University in the Netherlands and he's kindly provided this data with his permission, as well as the story behind it, in order to help others that were so...a bit of a drama around these data. I've made his his his colleague Jeffrey McDonnell   collected all this data. These are data. The purpose is to study water streamflow run off in deforested areas around the world.   And you can see, we've got 163 places here, most of the least half in the US and then around the world, different places when they were able to collect   regions that have been cleared of trees and then they took some critical measurements   in terms of what happened with the water runoff. And this kind of study is quite important for studying the environmental impacts of tree clearing and deforestation, as well as climate change, so it's quite timely.   And happily for Jaivime at the time, they were able to publish a paper in the journal Nature, one of the top science journals in the world, a really nice   experience for him to get it in there. Unfortunately, what happened next, though there was a competing lab that really became very critical of what they had done in this paper.   And it turned out after a lot of back and forth and debate, the paper ended up being retracted,   which was obviously a really difficult experience for Jaivime. I mean, he's been very gracious and let and sharing a story and hopes.   to avoid this. And it turns out that what's at the root of the controversy, there were several, several other things, but what what the main beef was from the critics is...   I may have done a boosted tree analysis. And it's a pretty straightforward model. There's only...we've only got maybe a handful of predictors, each of which are important but, and one of their main objectives was to determine which ones were the most important.   He ran a boosted tree model with a single holdout validation set and published a validation hold out of around .7. Everything looked okay,   but then the critics came along, they reanalyzed the data with a different hold out set and they get a validation hold out R square   of less than .1. So quite a huge change. They're going from .7 to less than .1 and and this, this was used the critics really jumped all over this and tried to really discredit what was going on.   Now, Jaivime, at this point, Jaivime, this happened last year and the paper was retracted earlier here in 2020...   Jaivime shared the data with me this summer and my thinking was to do a little more rigorous cross validation analysis and actually do repeated K fold,   instead of just a single hold out, in order to try to get to the bottom of this this discrepancy between two different holdouts. And what I did, we've got a new routine   that comes with the XGBoost add in that creates K fold columns. And if you'll see the data set here, I've created these. For sake of time, we won't go into how to do that. But there is   there is a new module now that comes with the heading called make K fold columns that will let you do it. And I did it in a stratified way. And interestingly, it calls JMP DOE under the hood.   And the benefit of doing it that way is you can actually create orthogonal folds, which is not not very common. Here, let me do a quick distribution.   That this was the, the holdout set that Jaivime did originally and he did stratify, which is a good idea, I think, as the response is a bit skewed. And then this was the holdout set that the critic used,   and then here are the folds that I ended up using. I did three different schemes. And then the point I wanted to make here is that these folds are nicely kind of orthogonal,   where we're getting kind of maximal information gain by doing K fold three separate times with kind of with three orthogonal sets.   So, and then it turns out, because he was already using boosted trees, so the next thing to try is the XGBoost add in. And so I was really happy to find out about this data set and talk about it here.   Now what happened...let me do another analysis here where I'm doing a one way on the on the validation sets. It turns out that I missed what I'm doing here is the responses, this water yield corrected.   And I'm plotting that versus the the validation sets. it turned out that Jaivime in his training set,   the top of the top four or five measurements all ended up in his training set, which I think this is kind of at the root of the problem.   Whereas in the critics' set, they did...it was balanced, a little bit more, and in particular the worst...that the highest scoring location was in the validation set. And so this is a natural source for error because it's going beyond anything that was doing the training.   And I think this is really a case where the K fold, a K fold type analysis is more compelling than just doing a single holdout set.   I would argue that both of these single holdout sets have some bias to them and it's better to do more folds in which you stratify...distribute things   differently each time and then see what happens after multiple fits. So you can see how the folds that I created here look in terms of distribution and then now let's run XGBoost.   So the add in actually has a lot of features and I don't want to overwhelm you today, but again, I would encourage you to follow along and pause the video at places if you if you are trying to follow along yourself   to make sure. But what we did here, I just ran a script. And by the way, everything in the journal has...JMP tables are nice, where you can save scripts. And so what I did here was run XGBoost from that script.   Let me just for illustration, I'm going to rerun this again right from the menu. This will be the way that you might want to do it. So the when you install the add in,   you hit predictive modeling and then XGBoost. So we added it right here to the predictive modeling menu. And so the way you would set this up   is to specify the response. Here's Y. There are seven predictors, which we'll put in here as x's and then you put their fold columns and validation.   I wanted to make a point here about those of you who are experienced JMP Pro users, XGBoost handles validation columns a bit differently than other JMP platforms.   It's kind of an experimental framework at this point, but based on my experience, I find repeated K fold to be very a very compelling way to do and I wanted to set up the add in to make it easy.   And so here I'm putting in these fold columns again that we created with the utility, and XGBoost will automatically do repeated K fold just by specifying it like we have here.   If you wanted to do a single holdout like the original analyses, you can set that up just like normal, but you have to make the column   continuous. That's a gotcha. And I know some of our early adopters got tripped up by this and it's a different convention than other   Other XGBoost or other predictive modeling routines within JMP Pro, but this to me seemed to be the cleanest way to do it. And again, the recommended way would be to run   repeated K fold like this, or at least a single K fold and then you can just hit okay. You'll get this initial launch window.   And the thing about XGBoost, is it does have a lot of tuning parameters. The key ones are listed here in this box and you can play with these.   And then it turns out there are a whole lot more, and they're hidden under this advanced options, which we don't have time at all for today.   But we have tried to...these are the most important ones that you'll typically...for most cases you can just worry about them. And so what...let's let's run the...let's go ahead and run this again, just from here you can click the Go button and then XGBoost will run.   Now I'm just running on a simple laptop here. This is a relatively small data set. And so right....we just did repeated, repeated fivefold   three different things, just in a matter of seconds. XGBoost is pretty well tuned and will work well for larger data sets, but for this small example, let's see what happened.   Now it turns out, this initial graph that comes out raises an immediate flag.   What we're looking at here is the...over the number of iterations, the fitting iterations, we've got a training curve which is the basically the loss function that you want to go down.   But then the solid curve is the validation curve. And you can see what happened here. Just after a few iterations occurred this curve bottomed out and then things got much worse.   So this is actually a case where you would not want to use this default model. XGBoost is already overfited,   which often will happen for smaller data sets like this and it does require the need for tuning.   There's a lot of other results at the bottom, but again, they wouldn't...I wouldn't consider them trustworthy. At this point, you would need...you would want to do a little tuning.   For today, let's just do a little bit of manual tuning, but I would encourage you. We've actually implemented an entire framework for creating a tuning design,   where you can specify a range of parameters and search over the design space and we again actually use JMP DOE.   So it's a...we've got two different ways we're using DOE already here, both of which have really enhanced the functionality. For now, let's just do a little bit of manual tuning based on this graph.   You can see if we can...zoom in on this graph and we see that the curve is bottoming out. Let me just have looks literally just after three or four iterations, so one thing, one real easy thing we can do is literally just just, let's just stop, which is stop training after four steps.   And see what happens. By the way, notice what happened for our overfitting, our validation R square was actually negative, so quite bad. Definitely not a recommended model. But if we run we run this new model where we're just going to take four...only four steps,   look at what happens.   Much better validation R square. We're now up around .16 and in fact we let's try three just for fun. See what happens.   Little bit worse. So you can see this is the kind of thing where you can play around. We've tried to set up this dialogue where it's amenable to that.   And you can you can do some model comparisons on this table here at the beginning helps you. You can sort by different columns and find the best model and then down below, you can drill down on various modeling details.   Let's stick with Model 2 here, and what we can do is...   Let's only keep that one and you can clean up...you can clean up the models that you don't want, it'll remove the hidden ones.   And so now we're down, just back down to the model that that we want to look at in more depth. Notice here our validation R square is .17 or so.   So, which is, remember, this is actually falling out in between what Jaivime got originally and what the critic got.   And I would view this as a much more reliable measure of R square because again it's computed over all, we actually ran 15 different modeling fits,   fivefold three different times. So this is an average over those. So I think it's a much much cleaner and more reliable measure for how the model is performing.   If you scroll down for any model that gets fit, there's quite a bit of output to go through. Again...again, JMP is very good about...we always try to have graphics near statistics that you can   both see what's going on and attach numbers to them and these graphs are live as normal, nicely interactive.   But you can see here, we've got a training actual versus predicted and validation. And things almost always do get worse for validation, but that's really what the reality is.   And you can see again kind of where the errors are being made, and this is that this is that really high point, it often gets...this is the 45 degree line.   So that that high measurement tend...and all the high ones tend to be under predicted, which is pretty normal. I think for for any kind of method like this, it's going to tend to want to shrink   extreme values down and just to be conservative. And so you can see exactly where the errors are being made and to what degree.   Now for Jaivime's key interest, they were...he was mostly interested in seeing which variables were really driving   this water corrected effect. And we can see the one that leaps out kind of as number one is this one called PET.   There are different ways of assessing variable importance in XGBoost. You can look at straight number of splits, as gain measure, which I think is   maybe the best one to start with. It's kind of how much the model improves with each, each time you split on a certain variable. There's another one called cover.   In this case, for any one of the three, this PET is emerging as kind of the most important. And so basically this quick analysis that that seems to be where the action is for these data.   Now with JMP, there's actually more we can do. And you can see here under the modeling red triangle we've we've embellished quite a few new things. You can save predictive values and formulas,   you can publish to model depot or formula depot and do more things there.   We've even got routines to generate Python code, which is not just for scoring, but it's actually to do all the training and fitting, which is kind of a new thing, but will help those of you that want to transition from from JMP Pro over to Python. For here though, let's take a look at the profiler.   And I have to have to offer a quick word of apology to my friend Brad Jones in an earlier video, I had forgotten to acknowledge that he was the inventor of the profiler.   So this is, this is actually a case and kind of credit to him, where we're we're using it now in another way, which is to assess variable importance and how that each variable works. So to me it's a really compelling   framework where we can...we can look at this. And Charlie...Charlie got right up off the couch when I mentioned that. He's right here by me now.   And so, look at what happens...we can see the interesting thing is with this PET variable, we can see the key moment, it seems to be as soon as PET   gets right above 1200 or so is when things really take off. And so it's a it's a really nice illustration of how the profiler works.   And as far as I know, this is the first time...this is the first...this is the only software that offers   plots like this, which kind of go beyond just these statistical measures of importance and show you exactly what's going on and help you interpret   the boosted tree model. So really a nice, I think, kind of a nice way to do the analysis and I'd encourage that...and I'd encourage you try this out with your own data.   Let's move on now to a other example and back to our journal.   There's, as you can tell, there's a lot here. We don't have time naturally to go through everything.   But we've we've just for sake of time, though, I wanted to kind of show you what happens when we have a binary target. What we just looked at was continuous.   For that will use the old the diabetes data set, which has been around quite a while and it's available in the JMP sample library. And what this this data set is the same data but we've embellished it with some new scripts.   And so if you get the journal and download it, you'll, you'll get this kind of enhanced version that has   quite a few XGBoost runs with different with both a binary ordinal target and, as you remember, what this here we've got low and high measurements which are derived from this original Y variable,   looking at looking at a response for diabetes. And we're going to go a little bit even further here. Let's imagine we're in a kind of a medical context where we actually want to use a profit matrix. And our goal is to make a decision. We're going to predict each person,   whether they're high or low but then I'm thinking about it, we realized that if a person is actually high, the stakes are a little bit higher.   And so we're going to kind of double our profit or or loss, depending on whether the actual state is high. And of course,   this is a way this works is typically correct...correct decisions are here and here.   And then incorrect ones are here, and those are the ones...you want to think about all four cells when you're setting up a matrix like this.   And here is going to do a simple one. And it's doubling and I don't know if you can attach real monetary values to these or not. That's actually a good thing if you're in that kind of scenario.   Who knows, maybe we can consider these each to be a BitCoic, to be maybe around $10,000 each or something like that.   Doesn't matter too much. It's more a matter of, we want to make an optimal decision based on our   our predictions. So we're going to take this profit matrix into account when we, when we do our analysis now. It's actually only done after the model fitting. It's not directly used in the fitting itself.   So we're going to run XGBoost now here, and we have a binary target. If you'll notice the   the objective function has now changed to a logistic of log loss and that's what reflected here is this is the logistic log likelihood.   And you can see...we can look at now with a binary target the the metrics that we use to assess it are a little bit different.   Although if you notice, we do have profit columns which are computed from the profit matrix that we just looked at.   But if you're in a scenario, maybe where you don't want to worry about so much a profit matrix, just kind of straight binary regression, you can look at common metrics like   just accuracy, which is the reverse of misclassification rate, these F1 or Matthews correlation are good to look at, as well as an ROC analysis, which helps you balance specificity and sensitivity. So all of those are available.   And you can you can drill down. One brand new thing I wanted to show that we're still working on a bit, is we've got a way now for you to play with your decision thresholding.   And you can you can actually do this interactively now. And we've got a new ... a new thing which plots your profit by threshold.   So this is a brand new graph that we're just starting to introduce into JMP Pro and you'll have to get the very latest JMP 16 early adopter in order to get this, but it does accommodate the decision matrix...   or the profit matrix. And then another thing we're working on is you can derive an optimal threshold based on this matrix directly.   I believe, in this case, it's actually still .5. And so this is kind of adds extra insight into the kind of things you may want to do if your real goal is to maximize profit.   Otherwise, you're likely to want to balance specificity and sensitivity giving your context, but you've got the typical confusion matrices, which are shown here, as well as up here along with some graphs for both training and validation.   And then the ROC curves.   You also get the same kind of things that we saw earlier in terms of variable importances. And let's go ahead and do the profiler again since that's actually a nice...   it's also nice in this, in this case. We can see exactly what's going on with each variable.   We can see for example here LTG and BMI are the two most important variables and it looks like they both tend to go up as the response goes up so we can see that relationship directly. And in fact, sometimes with trees, you can get nonlinearities, like here with BMI.   It's not clear if that's that's a real thing here, we might want to do more, more analyses or look at more models to make sure, maybe there is something real going on here with that little   bump that we see. But these are kind of things that you can tease out, really fun and interesting to look at.   So, so that's there to play with the diabetes data set. The journal has a lot more to it. There's two more examples that I won't show today for sake of time, but they are documented in detail in the XGBoost documentation.   This is a, this is just a PDF document that goes into a lot of written detail about the add in and walk you step by step through these two examples. So, I encourage you to check those out.   And then, the the journal also contains several different comparisons that have been done.   You saw this purple purple matrix that we looked at. This was a study that was done at University of Pennsylvania,   where they compare a whole series of machine learning methods to each other across a bunch of different data sets, and then compare how many times one, one outperform the other. And XGBoost   came out as the top model and this this comparison wasn't always the best, but on average it tended to outperform all the other ones that you see here. So, yet some more evidence of the   power and capabilities of this of this technique. Now there's some there's a few other things here that I won't get into.   This Hill Valley one is interesting. It's a case where the trees did not work well at all. It's kind of a pathological situation but interesting to study, just so you just to help understand what's going on.   We also have done some fairly extensive testing internally within R&D at JMP and a lot of those results are here across several different data sets.   And again for sake of time, I won't go into those, but I would encourage you to check them out. They do...all of our results come along with the journal here and you can play with them across quite a few different domains and data sizes.   So check those out. I will say just just for fun, our conclusion in terms of speed is summarized here in this little meme. We've got two different cars.   Actually, this really does scream along and it it's tuned to utilize all of the...all the threads that you have in your GPU...in your CPU.   And if you're on Windows, with an NVIDIA card, you can even tap into your GPU, which will often offer maybe another fivefold increase in speed. So a lot of fun there.   So let me wrap up the tutorial at this point. And again, encourage you to check it out. I did want to offer a lot of thanks. Many people have been involved   and I worry that actually, I probably I probably have overlooked some here, but I did at least want to acknowledge these folks. We've had a great early adopter group.   And they provided really nice feedback from Marie, Diedrich and these guys at Villanova have actually already started using XGBoost in a classroom setting with success.   So, so that was really great to hear about that. And a lot of people within JMP have been helping.   Of course, this is building on the entire JMP infrastructure. So pretty much need to list the entire JMP division at some point with help with this, it's been so much fun working on it.   And then I want to acknowledge our Life Sciences team that have kind of kept me honest on various things. And they've been helping out with a lot of suggestions.   And Luciano actually has implemented an XGBoost add in, a different add in that goes with JMP Genomics, so I'd encourage you to check that out as well if you're using JMP Genomics. You can also call XGBoost directly within the predictive modeling framework there.   So thank you very much for your attention and hope you can get XGBoost to try.
Monday, October 12, 2020
Kevin Gallagher, Scientist, PPG Industries   During the early days of Six Sigma deployment, many companies realized that there were limits to how much variation can be removed from an existing process. To get beyond those limits would require that products and processes be designed to be more robust and thus inherently less variable. In this presentation, the concept of product robustness will be explained followed by a demonstration of how to use JMP to develop robust products though case study examples. The presentation will illustrate JMP tools to: 1) visually assess robustness, 2) deploy Design of Experiments and subsequent analysis to identify the best product/process settings to achieve robustness, and 3) quantify the expected capability (via Monte Carlo simulation). The talk will also highlight why Split Plot and Definitive Screening Designs are among the most suitable designs for developing robust products.     Auto-generated transcript...   Speaker Transcript Kevin Hello, my name is Kevin Gallagher. I'll be talking about designing robust products today. I work for PPG industries which is headquartered in Pittsburgh, Pennsylvania, and our corporate headquarters is shown on the right hand side of the slide. PPG is a global leader in development of paints and coatings for a wide variety of applications, some of which are shown here. And I personally work in our Coatings Innovation Center in the northern suburb of Pittsburgh, where we have a strong focus on developing innovative new products. In the last 10 years the folks at this facility have developed over 600 US patents and we've received several customer and industry awards. I want to talk about how to develop robust products using design of experiments and JMP. So first question is, what do we mean by a robust product? And that is a product that delivers consistent results. And the strategy of designing a robust product is to purposely set control factors for inputs to the process, that we call X's, to desensitize the product or process to noise factors that are acting on the process. So noise factors are factors that are inputs to the process that can potentially influence the Y's, but for which we generally have little control, especially in the design of the product or process phase. Think about robust design. It's good to start with a process map that's augmented with variables that are associated with inputs and outputs of each process step. So if we think about an example of developing a coating for an automotive application, we first start with designing that coating formulation, then we manufacture it. Then it goes to our customers and they apply our coating to the vehicle and then you buy it and take it home and drive the vehicle. So when we think about robustness, we need to think about three things. We need to think about the output that's important to us. In this example, we're thinking about developing a premium appearance coating for an automotive vehicle. We need to think about some of the noise variables for which the Y due to the noise variable. And in this particular case, I want to focus on variables that are really in our customers' facilities. Not that they can't control thickness and booth temperature and an applicator settings, but there's always some natural variation around all of these parameters. And for us, we want to be able to focus on factors that we can control in the design of the product to make the product insensitive to those variables in our customers' process so they can consistently get a good appearance. So one way to really run a designed experiment around some of the factors that are known to cause that variability. This particular example, we could design a factorial design around booth humidity, applicator setting, and thickness. This assumes, of course, that you can simulate those noise variables in your laboratory, and in this case, we can. So we can run this same experiment on each of several prototype formulations; it could be just two as a comparison or it could be a whole design of experiments looking at different formulation designs. Once we have the data for this, one of the best ways to visualize the robustness of a product is to create a box plot. So I'm going to open up the data set comparing two prototype formulations tested over a range of application conditions, and in this case the appearance is measured so that higher values of appearance are better. So ideally we want, we'd like high values of appearance and then consistently good over all of the different noise conditions. So to look at this, we could, we can go to the Graph Builder. And we can take the appearance and make that our y value; prototype formulas are X values. And if we turn on the box plot and then add the points back, you can clearly see that one product has much less variation than the other, thus be more robust and on top of that, it has a better average. Now the box plots are nice because the box plots capture the middle 50% of the data and the whiskers go out to the maximum and minimum values, excluding the outliers. So it makes a very nice visual display of the robustness of a product. So now we want to talk about how do we use design of experiments to find settings that are best for developing a product that is robust. So as you know, when you design an experiment, the best way to analyze it is to build a model. Y is a function of x, as shown in the top right. And then once we have that model built, we can explore the relationship between the Y's and the X's with various tools in JMP, like in the bottom right hand corner, a contour plot and on and...also down there, prediction profiler. These allow us to explore what's called the response surface or how the response varies as a function of the changing values of the X factors. The key to finding a robust product is to find areas of that response surface where the surface is relatively flat. In that region it will be very insensitive to small variations in those input variables. An example here is a very simple example where there's just one y and one x And the relationship is shown here sort of a parabolic function. If we set the X at a higher value here where the, where the function is a little bit flatter, and we we have some sort of common cause variation in the input variable, that variation will be translated to a smaller amount of variation in the y, than if we had that x setting at a lower setting, as shown by the dotted red lines. In a similar way, we can have interactions that transmit more or less variation. This example we have an interaction between a noise variable and and a control variable x. And in this scenario, if there's again some common cause variation associated with that noise variable, if we have the X factor set at the low setting, that will transmit less variation to the y variable. So now I want to share a second case study with you where we're going to demonstrate how to build a model, explore the response surface for flat areas where we could make our settings to have a robust product, and finally to evaluate the robustness using some predictive capability analysis. This particular example, a chemist is focused on finding the variables that are contributing to unacceptable variation in yellowness of the product and that yellowness is measured with a spectrum photometer with with the metric, b*. The team did a series of experiments to identify the important factors influencing yellowing, and the two most influential factors that they found were the reaction temperature and the rate of addition of one of the important ingredients. So they decided to develop full factorial design with some replicated center points, as shown on the right hand corner. Now, the team would like to have the yellowness value (b*) to be set to a target value of 2 but within a specification of 1 to 3. I'm going to go back into JMP and open up the second case study example. It's a small experiment here, where the factorial runs are shown in blue and the center points in red. And again, the metric of interest (B*) is listed here as well. Now the first thing we would normally do is fit, fit the experiment to the model that is best for that design. And in this particular case, we find a very good R square between the the yellowness and the factors that we're studying, and all of the factors appear to be statistically significant. So given that's the case, we can begin to explore the response surface using some other tools within JMP. One of the tools that we often like to use is the prediction profiler, because with this tool, we can explore different settings and look to find settings where we're going to get the yellowness predicted to be where we want it to be, a value of 2. But when it comes to finding robust settings, a really good tool to use is the the contour profiler. It's under factor profiling. And I'm going to put a contour right here at 3, because we said specification limits were 1 to 3 and at the high end (3), anywhere along this contour here the predicted value will be 3 and above this value into the shaded area will be above 3, out of our specification range. That means that anything in the white is expected to be within our specification limits. So right now the way we have it set up, anything that is less than a temperature at 80 and a rate anywhere between 1.5 and 4 should give us product that meets specifications on average. But what if the temperature in in the process that, when we scale this product up is, is something that we can't control completely accurate. So there's gonna be some amount of variation in the temperature. So how can we develop the product and come up with these set points so that the product will be insensitive to temperature variation? So in order to do that, or to think about that, it's often useful to add some contour grid lines to the contour plot overlay here. And I like to round off the low value in the increment, so that the the contours are at nice even numbers 1.5. 2, 2.5, and 3, going from left to right. So anywhere along this contour here should give us a predicted value of 2. But we want to be down here where the contours are close together or up here where they're further apart with respect to temperature. As the contours get further apart, that's an indication that we're nearing a flat spot in the in response surface. So to be most robust at temperature, that's where we want to be near the top here. So a setting of near 75 degrees and rate of about 4 might be most ideal. And we can see this also in the prediction profiler when we have these profilers linked, because in this setting, we're predicting the b* to be 2. But notice the the relationship between b* and temperature is relatively flat, but if I click down to this lower level, now even though the b* is still 2, the relationship between b* and temperature is very steep. So if we know something about how much variation is likely to occur in temperature when we scale this product up, we can actually use a model that we've built from our DOE to simulate the process capability into the future. And the way we can do that with JMP is to open up the simulator option. And it allows us to input random variation into the model in a number of different ways. And then use the model to calculate the output for those selected input conditions. We could add random noise, like common cause variation that could be due to measurement variation and such, into the design. We can also put random variation into any of the factors. In this case we're talking about maybe having trouble controlling the temperature in production, so we might want to make that a random variable. And it sets the mean to where I have it set. So I'm just going to drag it down a little bit to the very bottom. So it's about a mean of 70. And then JMP has a default of a standard deviation of 10. You can change that to whatever makes sense for the process that you're studying. But for now, I'm just going to leave that at 10 and you can choose to randomly select from any distribution that you want. And I'm going to leave it at the normal distribution. I'm going to leave the rate fixed. So maybe in this scenario, we can control the rate very accurately, but the temperature, not as much. So we want to make sure we're selecting our set points for rate and temperature so that there is as little impact of temperature variation on on the yellowness. So we can evaluate the results of this simulation by clicking the simulate to table, make table button. Now, what we have is every row, there's 5,000 rows here that have been simulated, every row as a random selection of temperature from the distribution, shown here. And then the rate location limits that we have for this product. And we can do that with the process capability. And since I already have the specification limits as a column property, they're automatically filled in, but if you didn't have them filled in, you can type them in here. And simply click OK, and now it shows us the capability analysis for this particular product. It shows us the lower spec limit, the upper spec limit, the target value, and in overlays that over the distribution of responses from our simulation. In this particular case, the results don't look too promising because there's a large percentage of the product that seems to be outside of the specification. In fact 30% of it is outside. And if we use the capability index Cpk, which compares the specification range to the range in process variation, we see that the Cpk is not very good at .3.  
Fabio D'Ottaviano, R&D Statistician, Dow Inc Wenzhao Yang, Dr, Dow Inc   The large availability of undesigned data, a by-product of chemical industrial research and manufacturing, makes it attractive the venturesome use of machine learning for its plug-and-play appeal in attempt to extract value out of this data. Often this type of data does not only reflect the response to controlled variation but also to that caused by random effects not tagged in the data. Thus, machine learning based models in this industry may easily miss active random effects out. This presentation uses simulation in JMP to show the effect of missing a random effect via machine learning — vs. including it properly via mixed models as a benchmark — in a context commonly encountered in the chemical industry — mixture experiments with process variables — and as a function of relative cluster size, total variance, proportion of variance attributed to the random effect, and data size. Simulation was employed for it allows the comparison — missing vs. not missing random effects — to be made clear and in a simple manner while avoiding unwanted confounders found in real world data. Besides the long-established fact that machine learning performs better the larger the size of the data, it was also observed that data lacking due specificity—i.e. without clustering information—causes critical prediction biases regardless the data size.   This presentation is based on a published paper of the same title.     Auto-generated transcript...   Speaker Transcript Fabio D'Ottaviano Okay thanks everybody for watching this video here. Well, because you can see, I'll be talking about missing random effects in machine learning. It's a work ideas together with my colleague when Joe Young, we both work for Dow Chemical Company working Korean D and help you know valid develop new processes and mainly new products. What you see here in this screen is a big bingo cage, because our talk here is going to be about to simulation and simulation has a lot to do at least to me. With bingo case because you decided the distribution of your balls and numbers inside the big occasion, then you keep just picking them as you want. All right. This talk also has to do with the publication, we just said. Lately, what the same name, and you can win what you should have access to this presentation, you can just click here and you'll have access to the entire paper. So here's just a summary of what we have published in there. Okay, what's the context for this. Well, first of all, machine learning has a kind of a plug and play appeal to knowing stuff sessions. I know you don't have to assume anything that's attractive. Besides, you have a very user friendly software out there these days. So, you know, people like to do that these days. However, you know, random effects are everywhere and run these effects is a funny thing because it's it's a concept that is a little bit more complex. So it tends not to be Touching basic statistics courses shows more advanced subject. So you're going to get a lot of people doing machine learning without a lot of understanding about random effect. And even if they have that understanding, then the concept of random fact Still doesn't, you know, bring the loud bout with people doing machine learning because there's just a few algorithms that can do that that can use random effects. You can check these reference here where you see that there are some trees and random forest, and it can take it, but the recent and they're not, you know, spread Everywhere. So you're going to have some hard time to find something that can do can handle random effects in machine learning. Just talk a little bit about the random effects. As you can see here, at least in the chemical industry where I come from. We typically mix in. I say three components. Right. These yellow, red and green. We, we make this, you know, the percentage of each one of these components different levels. And then we measured the responses as we change it, the percentage of these components with a certain equipment and sometimes you have even a operator or lab technician that will Also interfere in the result that you want to see. Okay. And then when we do this kind of experiment, we want to generalize, is that the findings, right, or whatever prediction. We are trying to get here. But the problem is that you know when I'm mixing these green component here. If I buy next time from the supplier that supplies me this green component year And the green made shade, you know, very and I don't know what's the next time I buy this green component is the batch. Would that be supplier is giving me is going to be exactly the same green because there is a variability in supply On top you know I may make my experience you're using a certain equipment. But if I go and look around in my lab or if I look around in other labs, I may have different makes of these equipment. And on top. You also have, you know, maybe food that measurement depends on on the operator who is doing that right so you may also interfere and kind of impoverished my Prediction here on my generalization to Do whatever I want to predict here besides This is the most typical I guess in the chemical industry, which is the experiment experimental batch variability A over time if you repeat the same thing over and over again. Let's say you have an experiment here you get your model your model can predict something, but then you repeat that experiment to get another Malden get another model the predictions of these three models. May be considerably different right now. Nick legible. So, there is also the component of time. So what's the problem I'm talking about here. Well, typically you we have stored data and historical data just say, you know, a collection of bits and pieces of data you've done in the past. And people were not done much concerned with generalizing that result. The result at the time they had that experiment. So when we collect them and call it historical data, we may or may not have tags for the random effect, right. And then if you have text, which is at least from where I come from. This is more of an exception to the rule is having no tax for me facts, what, at least not for not all of them. Let's say you have tags. One thing you can do is to use a machine learning techniques that can handle these random effects lead them into the model. And that's it. You don't have a problem. But then, as I said, is not very well numb machine learning techniques that can hinder random effects. You may be tempted to use machine learning. And let the random effects into the model as if they were fixed and then you're going to run into these you know very well known problem that you should treat the random effect this fixed Just to say one thing you're going to have a hard time to predict any new all to come because, for example, if your random effect is European later you have only a few Operators in your data, right, a few names, but if there is a new operator doing days, you don't. You cannot predict what the effect of this new operator is going to be. So, here there is no deal And then there's one thing you can do. You do. You should have or you should don't have tax revenue we sacrifice to us again any machine learning technique. And if you have random, you should have the tags you ignore the random effect or if you don't. Anyway, you're going to be ignoring it. Whether you like it or not. So what I want to do is less simulating shooter. We use jump rope fishing. And you know, I hope you enjoyed the results. The simulation, basically. So I will use a mixed effect model right with fixed and random effect. And then we use that same model to estimate With the response to this to make the model after I simulate and also model, the results of my simulation here with a neural net right Then we use this model here as the predictive performance of this model here. As a benchmark and we use it to predict performance of the near on the edge to, you know, compare later they're taking their test set are squares to see what's going to happen. You find meets the random effect here, right. Then, okay. Sometimes I when I talk about these people sometimes think that I'm comparing a linear mixed effect model versus, you know, machine learning neural net. And that's not the case, you know, here we are comparing a model with and without random effect. Even that there is a random effect in the data. I could do. For example, a Linear Model with run them effects versus a linear model without to bring them effects. And I could do a neural net with random effects versus in urine that without random effect. But the problem is that today there is no wonder and that, for example, that can handle random effect. So I forced to use. For example, a linear mixed effect of all My simulation factors. Well, I'll use something that is typically in the industry, which is a mixture with process variable model what it is. Let's say I have those three components. I showed you before. Know the yellow, red, green, and they have percent certain percentage and they get up to one. Have a continuous variable which, for example, can be temperature. I have a categorical variable that can be capitalist type and I have a random effect which can be very ugly from batch to batch of experiments. Okay. The simulation model. Well, it's a pretty simple one I have here my mixture main effects M one M two and three. Right. And you will see all over this model that the fixed effects all have either one or minus one. I just assigned one minus one randomly to them. So I have the mixture main effects. Here I have the mixture two ways interaction, the interaction of the mixture would be continuous variable. And the interaction of the categorical variable with the components. And finally, the introduction of the continuous variable with the categorical variable. Plus I have these be, Why here we choose my random effect and the Ej, which is my random error. Right. And both are normally distributed with certain variance here. I said, the variance between a better of experiments, right, and uses divergence within the match of batch of experiment. From all over this presentation, just to make this whole formula in represent a forming the more I say neat way or use this form a hero X actually represents all the fixes effects and beta Represents all the parameters that I used. Right. And my why here. Actually, the expected. Why is actually XP right it's this whole thing without my random effect here and we dealt my Renault mayor. Simulation parameters. Well, here I have one which is data size, right, the one that she. What happens if I have no, not so much data and More data layers and more data right here I have two levels 110,000 roles at every set of experiment here have actually 20 rows perfect effect than 200. The other thing I will vary is going to be D decides of the badge for the cluster, whatever you like it. Sometimes is going to be. It's, I have two levels 4% and 25% so 4% means if I have 100 rolls of these one batch of experience my batch, we're going to Will be actually for roles. So I'm going to have 2525 batches. If I have 100 rolls in total out in my batch sizes 25% and I have only four batches. Then the other variable we change here is going to be the total variance. Right. And well, we have two levels here, point, five and 2.5 is half of effect effect size right to choose. So the formula here. It is all ones for the fixed effect. And the other one is to write and the summation this total variance is the summation of my variation between batches and within batches very a variance. Right. And lastly, the other thing I will change is the ratio of between two within very Similar segments. Right. So I have one and four. So in one my Between batch variation is going to be equal to winning and the other one is going to be four times bigger than winning Then, once I settled is for four factors here. I say parameters and then you do a full factorial, do we wear our have 16 runs right to To two levels of data size two levels of batch size two levels of total variance Angelo's was the desert. With that, I can calculate it within between within segments accordingly. Right. And that's the setup for simulation. Okay. Now, I call it simulation hyper parameter, because you can change in as we What I would do, and I'll show you in the demo. It's would 30 simulation risk or do we run. So every one of the 16 what I did is I run 30 times each. Right. So, for example, I'll have a simulation run 123 up to dirty and for the fixed effects. The, the level of difficulty effects. I use the space feeling design. And the reason why I use this space filling design is that don't want to confound the effect of missing there and then effect with the fact that possibly I have some some calling the charity or sparse data. Which is typical thing in historical data. Right. I don't want that in the middle of my way. I want to, I prefer to design and space feeling design that we spread the Levels of the fixed effects across the input space. So if I get rid of this problem of sparse hitting the data or clean the oddity right and then we you allocate to the batch we randomize batch cross the runs in the first round, and then use the same Sequence across all the other 29 runs. So all the runs, we have the same batch of location. In late. And lastly, all the Santa location will be randomized for every one of the simulation runs. So let me just get out of the air and start to jump. So he would do is, I used to do we special purposed spilling design. I'll load my factor here just for being I want to be fast here. So anyway, here I have my 12345 fixed effects. Space feeling designs don't accept a mixture of variables. So you need to set linear constraints here just to tell look these three guys here and need to add up to one. So that's what I'm going to do here. Alright. So with that, a satisfied that constraint will give an example of the first run of this d which is where I have data size 100 relative batch size 4% total variation total variance point five and the ratio is going to be one. So if I go back here and need to put I need to generate 100 runs. Also, if I want to replicate this theory and you have to set the random seed and the number of starts right Then the only option I have when I said constraint is the fast faster flexible filling design. And here we go I get the table here right so you can include this table. One thing you see is if you use a ternary blocked and you use your three components. You see that everything is a spread out. Oops. I have a problem here that she Didn't Let me go back. There's a problem with the constraints, yeah. I forgot the one here. Yep. All right, let's start all over again. Hundred Set Random seat 21234 And number starts. Great. Next book feeling make table. Yeah. And then I need just to check if it is all spread out. And find out. Yes. Alright, so then I look at my categorical variable here. I want to see if it spread out for all of them. As you can see this for one and two. Great. Now I said let me close all that we do 30 simulations. So this is one SIMULATION RIGHT. I HAVE 100 roles. But now I need to do 30. So what I will do is to add roles here. 2900 At the end And we just select this first 1000 runs sorry 100 runs all the variables, a year. And feel to the end of the table. Great. So now I repeat the same design 30 times right Now, To make it faster would just open. I'm using this table again though. Just use another table to where I have everything I wanted to show already set up. So yeah, I have back to this table right what I have. Next thing I would do is just to create a simulation column here just to tell look With this formula here. I can tell that simulations up to 100 row hundred this simulation one and then every hundred you change the simulation numbers. So at the end of the day I get obviously 30 simulations with 100 rolls. Each great Then the batch location, just to explain what they did. I just showed that in the PowerPoint. I have a farm. Now you will create two batches of 4% the size of the total data size, which means I have four rows per batch here than four and so on. And once the I get to 100 here and I'm jumping from simulation. Want to simulation to then it starts all over again. Right. So I have at the end of the day. All the 25 batches here. Okay. Then the next one thing I will do is to create my validation Column, which means I need to split the set right so Back to this demonstration. Back to the PowerPoint here, you see that for the solution that I'm going to create but the neuron that I had Is divided the roles 60% of them will belong to the training set 20% to the validation set and the other 22 the test set. So how do I do that in that case again back to john There we go. Let me hide this Okay, so here's to validate the validation come. How do I do that. It's already there, but don't explain how you do that you go to Analyze Make validation column you use your simulation column has a certification, you go and then you just do Point 6.5 2.2 and a user a random seed here just to make sure you see that's how I create that column right Then if I go back to my presentation here. All the 60% that belongs to the training set for the new year and that 10 to 20% that Belongs to the validation set also for the near net. Now they both belong to a set called modeling set for the mix effects. So the mix effect. Model, there will be no validation. There were just estimate the model with this 80% and the test set of the mixed effect solution that will be the same 20% that I use for the new year on that solution. So in that case, I go back to jump and Go here this And it just great to hear formula where you know zero means the validation of the neural net to zero means training so training, be still training. One is validation setting is going to be my modeling such and two is my test set, and it's going to be my test set here too. So I created these and then you column name for all your labels zero is going to be modeling and one is going to be desk so that way you see here that whatever stashed here he says here, but what you're straining a validation becomes modeling right then finally I need to set the formulas for my response. So for the expected value of my simulation. I just have here to fix effects formula right there is no random term here. All right. And that's my expected value my wife and my wife i j is going to be This and you look at the formula you have the y which is my expected value plus here I have a formula for generating A random Value following a normal distribution with mean zero and between sigma sigma Between sigma. It's a i i like to set to the variables here and as a table variable because I can change the value later as a week without going to the formula. But anyway, this is going to generate a single value every time you change the batch number. So if my batch is Here that's going to be the same value when I change it to 22 it creates another value. And you when I change from one simulation to another. So I will have one value for 25 per batch 25 our simulation one and then when I jumped to batch one of simulation to then it creates another value. Right. And here is just my normal random number with with things sigma that I set on the table here, right. So see some replicating the deal we run one I have sigma 05 and 05 Alright, so then now I have here solution for my mix effects model simulation right before that, let me go back here and show you what I am doing. For example, for the mix effect model. My simulation mode is this, but my feet that model will be the issue might be the analysis be the hat. And my small be here is going to be be hacked, right. So, it is our beach to meet the values for whatever I simulated And then in the mix effect model. I have to to less a prediction model. One is fitting conditional model when I use The my estimation of the random effects and the other one is my marginal when I don't. Right. So I have these two types of model. This is good to predict things that I have in the data already. And this is something I used to predict data around don't have an entire data set, right. For the new year and that I'm using a, you know, the standard default kind of near and at the end and jump, which you know i'm not just using because it's difficult because he pretty much works. You have here all the five fixed effects. I use one layer with three nodes all hyperbolic tangent functions as you can see here And then you have here a function called h x which is the summation of district functions, plus a bias term here, right. So if I add more nodes. It wouldn't make it any better. You find only use two nodes then it gets even worse. So I'm going to use this all over. And that's what I'm going to show you here. My show. Oops. Okay. Show. Me go back to you. So Here I have my Mixed effect model solution. How did I come up with that. Just to show you. I have here the response I put validation validation of the mix effects by simulation. All my fixed effects a year and my random effect is my batch right and then a genetic this first simulation. You see simulation one and it goes all the way the simulation Turkey right I couldn't use, for example, that simulate function of jump here because I'm changing the validation column for every batch, so I cannot, at least I don't know how to do it, how to incorporate the validation column in the formula of I white G. And. Okay. So, oops, back here, then now I have another script here for it and you're in that it's going to take a little bit Shouldn't be a big problem. When I'm doing for example. The runs the do you runs. We've 1000 rows per simulation that can take courses from all of time something like maybe 10 to 15 minutes To do the auditory simulations at the same time here should throw up some And There we go. Okay, so here I have, again, my if you look every one of them have five three notes right okay and you have simulation one all the way to simulation 30 Right. So now I have all that done for for run one of my year, right, so next thing I need to understand is what type of our squares. I'm going to compare right there are actually five types of our squares here, right, so here's the r squared formula. Why, why do I differentiate di squares by type here because it depends on what you use this actual versus predicted in this formula. You know your square change. So for example, I have here. Oh, these are type of r squared away or compare For example, The Rosa turning the training set. What I simulated versus the form I got for the neural net here because when you're in it. And you know, that's actually The case, for example, in all these three are the same thing because I'm comparing my wife and my wife hats are the same. So I have type eight right Then I have another type of call it type be where I compare. I don't, I'm not comparing the simulated value with the Random effect in the random error term I'm comparing the expected value of my simulation versus the form they go So these makes me sent to the test set rules are always the same as just the way you calculate the r squared is going to change because in this way here. When I have what I call the conditional test set. I see the parent future performance because that's exactly what we get when you have any data set, because we cannot tell the real We don't have the real model that's for that you need to simulate and then you have the expected test set, which is actually the same rose, but now I'm comparing the Expected value. And I can tell like for the lack of a better word, a real future performance. So the apparent performed is not necessarily the real future performance. OK. For the mix effect model is the same. Now I have another type of r squared, because here I'm comparing the simulated value versus my Conditional prediction farmland here and using the estimate to have for the random effects, but when I want to predict the future. So, well, no one to break the test sets both conditions here, I have another square here d which is comparing the whole simulation value for I Why i j versus my marginal model here. I'm not using be here right and Leslie to have a fifth type of our square which is my expected that sent Again, the test set is always the same roles is just that I'm using here. Now, the expected value versus my margin. A lot. So the problem is If you're not careful there. You may calculate wrong guy r squared. So what I do is, whenever I have here. And if I had to mix effect mode. I don't use anything that is in this report, all I do is to save the columns here I saved my Marginal model right prediction farm in here you have saved the prediction formula of the conditional model right and I will create columns with this formless that's for the mix effect model. Now for the Near in that or you can also save the formulas. I like to say fest formulas, because I just want to calculate to our squares. So I was saved as fast formulas and then what you see as I create this five columns here. Alright so let me go to them. So now my type A, if you remember Taipei from the presentation here type am comparing the simulated why i j versus my near in that model. So what I do here. Sorry, what I do here is I go to call them the info and you see here. Predicting what and predicting d y AJ Okay, here it is. Now I have here saves De Niro and that from the twice as he does value is equal to this value. But now I would just change predicting what here. This is predicting the expected value. Right, so that way I can use this formula is functioning jump here which is model comparison I can go to use this type A and I do buy a simulation and I grew up by validation And then what you get. It's all Dr squares you need To see from From simulation one All down to simulation 30 and you have it by set so you can later do combine data table and you get everything neat, right, for those are squares. So for the other ones I have also script here for example for to be He was the peace formula. Now right this column, and I only predicted that set the r squared for the test set, not for the modeling, not for the validation set. So the simulation. One is a year. And then if you go all the way down here you have simulation 30. And again, you can always combine data table and your data comes out like the same table format for all of these are squares. And Daniel, obviously I can have another script here for the type co fire squares we choose the modeling set of the mixed effect model right and simulation one all the way to modern set to simulation 30 and you know the do the same. See now on test set, but now I'm predicting why i j, right. So, the, the, say, the secret here is that you have one even sometimes in the lake. Here again I have the same formula because it's my marginal model of the Mix effect model solution. The only thing that changed is in your column me for you. Make sure you have predicting what and then you can use this to calculate all these are squares. All right. Let me go back to presentation and now since I got all those are squares together you stack your tables and then you can do the visualization, you want But here I'm interested really in the conditional test set of both solutions and the expected that said here, you know, I can spend a could spend 45 minutes just talking about this table display here. But all I I'm not really interested in the absolute values of the r squared, but more comparatively kind of a way of comparing our square But I need days just to check one thing which is, as you can see here when the data size here as you can see Them use the pointer here. Make it easier. I have all the are squares. I created here versus all the do you factors. Right. So you see that when I have a small data set, what happens is, I'm my near and that's being trained correctly because my training. And my validation sets they kind of have our square distribution that overlap. But then when you look at the conditional test set, which is actually the data we always have right because we never have the expected value. It's always at the lower level right as you see all for all these when I have this small data set, but then when I have a bigger data set. The situation is different with 1000 roles, then the are all aligned, you know, kind of overlap. So I did train these correctly. Again, the absolute value of our square here is not have much interest what they really care is how, you know, if you go back to One of the earlier slide here, you see that now I want to compare. You see, I have to, I want you to compare the predictive the performance of my benchmark mixed effect model versus my neural net compared to test that dark square, right. So, here what you get. Let me get this mold. So what you get here is again all the verbs. I had for to do we, and whether they Disease and your net solution or didn't mix effects solution. So, you see that for the conditional test set, which is the one part in performance. So if you're in during that you always you know when the data sizes are small. The mix effect Maltese always doing a better job here when you include the the The random effect right versus then you're and that's because there's always this Their median or or even average are all higher, but then when you have a bigger data set, then You know that difference kind of doesn't exist anymore to to a certain point, that even the new one that just doing a better job here, but that's the current performance. Now the real one. You see now that the mix effect model has been given a better job than a deed versus internet And here there is no more, you know, even grounds, you know, because at the end of the day, direct effect model. He's doing a better job, especially in this scenario here which is big data bigger data. Last sets or variability and more between den with invited right now find those lines I have here is going to do this is going to do for every simulation run is going to do the difference between the mix effect. R square versus the new year and that are square. So, here we go. Here I have four plots right so let's just concentrating one what you have in the y axis is the air conditioner square. Then mix minus the difference in conditioner square, sorry, the The mix effect r squared minus then you're in that square. So, that's what you get here, and that is the difference in a pattern to future performance. And here in the x axis, what you have is this difference in the expected R square we choose your real future performance, right, or bias. If you want Now I have four blocks. Why, because if you think about that when you have historical data where you don't have the tech, you know, You know, if you're analyzing the data, you just have possibly control over two things, which is the data size and the relative batch size. Why, because you cannot control what the variability is going to be in your data. And if your random effects are going to be much bigger than your random air. So the only two things that you can possibly Have any control over is data size and relative batch size. If you don't have the tag you can at least have an idea if, you know, use your historical data. Should be comprised of many, many bedrooms, just one or two batch. Right. So that's the kind of control you have You can at least have an idea of the batch size when you'll have statistical data. So what I'm comparing here then again I have the difference in apparent performance issue just differences positive, it means the mixed effect model has a better performance. If this difference here is also positive. It means that mix effect as a better future performance, right. And as you can see here when you have a small data set. It doesn't matter what you do. And mix effect model has a better performance and sometimes way much better because will come, talking about differences in R squared, that can go way over ones. Right, so he's getting much better performance you do pot into one or the future one. So when the data sizes are small, there's really no No, no solution here. However, when you look at the data size bigger data sides right but when you have this small amount of batches. Right. Here it's something funny happens because here on you know the difference enough funding future performance Y axis is negative, most of them. Which means the near and that to doing a better job in terms of patent at test set Tahrir Square right or conditional Touch that dark square. So, when you do it. Who's going to look like the new urine. That'd be great job better than The mix effect model. However, when you look at the lot of the the x axis, right, which is the difference in real future performance, it can be pretty much misleading. Right, so here you when you have a lot of data, but to just a few batches, you know, you're going to get nice Test, test set are squares. But then when you try to deploy your mold in the future. You may get into trouble. But then when you look here. Here we have a mitigation situation where you have a lot of data and a lot of batches. So they tend to be not that much different. Right, so As a conclusion, you should use a non negligible random effecting machine learning when the data set is a small, you know, the test set predictive performance will most likely be poor. Regardless, how many clusters of batches, you have. And that's because machine learning requires the minimum data size for success. Right. So there's no No way to win the game here. Now, when the data size is large and you just have a few clusters. And that's kind of misleading situation because your test set predict the performance can be good, but the performance, we would likely be Brewer later when you deploy the model. Some people tell me, Well, why don't you use regularization said what even if you will, you will you will not do it in these situations because Your test set R squared is going to be can be good but and then you don't know you need it right so you won't be able to tell You know, what is your long term future performance, just by looking at your tests at dark square or some of some kind of some of errors. But then when your data set is large in you have many clusters day and the whole situation is mitigated and the biasing effect of the closer kind of average out because every random effect, you know, the summation of all of them. It's zero. So the more you have the latest by as you can to get On top, you know, just wanted to say that one that learned what I learned from that is that when the data is not designed on purpose, there's two things I always remember Machine land cannot tackle at data just because it is big. You got to have a minimum level of design right to make it work. But the bigger the data, the more likely it is minimal level of design is already present in the data just by sheer chance. All right. And thank you, if you want to contact us. We are in the jump community. These are our addresses. Thank you.  
Scott Wise, Senior Manager, JMP Education Team, SAS   A picture is said to be worth a thousand words, and the visuals that can be created in JMP Graph Builder can be considered fine works of art in their ability to convey compelling information to the viewer. This journal presentation features how to build popular and captivating advanced graph views using JMP Graph Builder. Based on the popular Pictures from the Gallery journals, the Gallery 5 presentation and journal features new views available in the latest versions of JMP. We will feature several popular industry graph formats that you may not have known could be easily built within JMP. Views such as incorporating Ridgeline & Plots, Contour Bag Plots, Informative Box Plots and more will be included that can help breathe life into your graphs and provide a compelling platform to help manage up your results.     Auto-generated transcript...   Speaker Transcript Greetings and welcome to pictures from the Gallery Five More Advanced Graph Builder Views. In past presentations many weird things have occurred, like being interrupted by visits from aliens. We apologize and we will show a more serious graph demo to start our presentation.   Let's use graph builder to answer the question, what country has the tallest buildings. To do this we'll use wooden blocks that represent every 150 meters of structural height. On the back of each block is a barcode that we can use to directly scan info into JMP.   By physically using the blocks, we now have a physical scale model.   This automatically populated the data in JMP and generated a bar graph in descending order.   Wait, do you hear that?   Okay, Godzilla. Would you rather we showed a graph about you?   So let's do that. Scott Wise Welcome, everybody. Welcome to pictures from the gallery. My name is Scott Wise and hopefully you enjoyed that little video. And our whole idea is to show you some advanced things in Graph Builder you probably didn't know you could automatically do.   So we're going to show you some smarter things we can do, maybe more compelling ways we can show Godzilla's story. And there was an article that came out and it had an alarming trend on it. It said, since when Godzilla began about in 1954,   he has grown bigger and bigger and bigger.   On...to the point he's much larger than he used to be.   That's pretty disconcerting. He's pretty destructive from the very start. So we'd like to show this. And of course, we've got to get the data into JMP; that was fairly easy to do.   So here's the data into JMP. We have the meters high. So he's...we see he's grown from 50 meters high, all the way to 150 meters high in his last picture which was in 2019.   Now, a couple of things we can do. This has been an option in JMP for quite a while, but you might not have seen it before. You can put pictures into JMP.   And if I add a column here, you can see it's got this empty with the brackets. And if I go to the column info, this is a expression data type.   And pictures is one of the things it can take. And so if I go and I take a look at   just any picture, and I've got, I've got these as kind of separate pictures here in my PowerPoint. All I have to do is grab it,   drop it directly into my JMP table and it will size. So that's pretty cool. And you can as well,   label this column so it'll show up when I hover over points in my graphs. So I'm going to delete that row, but here's those pictures I have.   OK, now let's go back to our data. Let's take a look at just building out a simple bar chart. I'm just going to put meters high on the Y. I'll put the year on the X. I'll ask for bars up here from the bar element.   Now this doesn't have the same view I had before. There's a lot of space in between these bars. If you want to change them, right click into the bar area. Go to customize.   Click on bar and on the width proportion, type in like a .99 or something below one here and you can see that filled out the space and there's not a lot of open gap in between them. So this is pretty cool. And so I have this information.   This has given me the relative height of Godzilla over those...over the years of making movies. And of course if you hover over them, of course, the   beside the label turned on, the picture will show up. But what if I wanted to make the picture be the bar? Is there a way to do that? Yes, there is. So I copied this bar chart into my PowerPoint presentation.   And kind of use it as a template. And then I said, Well, gee, can I just take the pictures and using that JMP bar chart as a template, can I kind of get them into the right size on the same size scale? And yes, you can. You can see here I just massaged it in the place.   So I got them all just kind of oriented into place.   And then, of course, were able to take a...we were able to use that as a group picture. So now I have a relatively scaled group picture.   So this was very useful because what I can do   is if I come back   into JMP.   Let's bring back up my Graph Builder.   And I just take this grouping.   I can put it into my graph.   It just kind of snaps right in there.   Then you can work on positioning it, get it into the right format. And of course, if you go and you make this transparent, really large,   that really helps.   You can build it out and shrink the graph and get it down to the right size. So it takes a little bit of finagling to do, but the result is   you then can match that size graph. Now another graph I'd like to make are bar and needle charts.   So I kind of like these, you can...you can make the circle kind of stand out. You can even size something by the circle, kind of like it...like pins, right, the top of a pen. And then you've got, you know, a long...   a long line there, kind of connected it to the label. So that's, that's kind of a nice view.   So that's very easy to do within JMP. If I go back, go ahead, put meters high, put year, go to bar, but this time under bar style, select needle.   And maybe I will also add some points and, of course, I can make the size be the meters high.   And I can as well give it some color. Maybe I'll color it by the...where...where Godzilla attacked. And if I want to make the circles show more representation   I can right click on this marker size there, where it showed me that meters high is the size of the circle, maybe increase that to something like 12. Now I get a much more separation from 1954 and 2019.   And again, the pictures will surface.   And you can even stop there. I went a little further.   I created this chart. And on this chart,   on the axis settings, I put in a reference line by where the maximum depth of the respective target harbors were.   And why we did this was, I saw a funny article where they were given a hypothetical to the to the emergency coordinator of New York City.   This is not long after Godzilla had visited there. And so very dinosaurish looking Godzilla in 1998.   And he said if Godzilla ever comes back, "Are you worried? Are you prepared?", and he said, "No problem. We'll evacuate the city, very quickly." And he said, "why is that?" And he says, "Well, gee, the actual   you know, maximum depth of New York Harbor is only so big here and Godzilla is way up here. So we'll see him coming along way out before he ever hits land." So I thought that was   pretty amusing. Of course you can combine these charts. We can do just what we did and put that scaled picture, scaled bar chart in with the needle chart so I can have, you know, a lot of information here, including the harbor depth to targets. And now of course see in the picture of Godzilla.   All right, so hopefully you enjoyed that, uh, that little demonstration here. And that's really what   this whole presentations about. We call this Pictures from the Gallery and we're challenged every year to come up with a handful of advanced views that maybe folks had done   with a lot of pain and in spreadsheets and other packages, or really challenged JMP to be able to create.   And JMP is so flexible and so interactive that there's a lot of great views you can get that can make your data even more compelling.   And so here, without further ado, are the version of pictures from the gallery for this year. We're on our fifth edition and we got six beautiful views here.   Number one is a informative box plot.   Number two is a ridge plot density chart. Number three is actually   having multiple ranges as an area range plot in between, on my lines.   Number four is an informative points plot. It's kind of unique view to look at points and size points.   Now on number five is a box plot with outlier boxes.   Not box plot, excuse me, bag plot with outlier boxes so bag plot is a new functionality.   A two dimensional way of seeing outliers and I even included some outlier boxes on the edges. And number six is a components effects plot,   and it helps you when you have mixture components that have to add up to 100%, it helps you figure out a way of showing them on a graph where you can look at your mixture settings and see how they respond to an output.   Maybe something you try to experiment on or model. So that's very handy.   So these are the six views. Now, we probably don't have time to go through all six, if I'm usually doing this presentation live I might take a vote, but I can tell you, I've got a pretty good idea from doing this before that   I'm going to show you the most popular views first. And whatever we don't cover, I'll be glad to cover   later for you. I'm going to leave you behind with instructions in the script to generate it yourself.   So that's the beauty of this is you're going to get gift from us. This gift is going to be this pictures from the gallery journal that you can always go back to and use when you want to replicate one of these views or practice.   the ridgeline plot. This was a new view in JMP 15   and it is showing you a lot of stacked histograms over on top of each other, against kind of a bunch of categorical levels   on my y axis. And it's very useful, especially if you're like...you're plotting signal data or growth data and you want to look at it in comparison to a reference. So this was some   real medical data and we have this DMSO drug and they wanted to, they had the...   they had some measure of of area where they took the log of that measure and I want to find how things are the same or different than my reference, my red   distribution. So let me show you how we set this up. So again you have good information in your journal, right, tips on how to make it. We're going to use that data and we're going to use a kernel density in a bunch of ordering commands.   And now, here are the steps we're going to follow to make this view. So you can go through and see these yourself. Always attached is the data and always attached is the finished script. So we're going to try to generate this chart. So let me start from scratch.   So I'm going to put the drug on the Y axis. I'm going to put the area log   on the X axis and it gives me box plots and that's fine. Now something I'll do, I'll take this DMSO   and I'm going to put it in the overlay. That's kind of setting up my red, blue, you know, this is my reference. This is things I'm comparing it to.   Now before we begin, anything you have in the chart can help you order things in your Graph Builder. And if I go over the Y axis, I right click, I can order by and I can order by the area log10 descending.   So I do that and it orders by something. What is it order by? If I right click on there, now that I've activated the order by, now it shows me all the statistics. It defaults to the mean.   I have a feeling, median, because not all these are normal, some of them have quite long tails, I might do the median here. And does that change things? A little bit.   So now it is ordering from top to bottom according to the median. And you can kind of tell that with the median lines there, 50% quintiles of these box plots.   But I don't want box plots. I want bar. So, all I have to do is come up here and click on the bar icon.   Now we're looking pretty good, but how do I get those smooth lines and how do I get into overlay? All you have to do under histogram style, down here in the little control panel for histograms, select kernel density.   And you get two smoothers, you get an overlap and you get a smoothness. The smoothness controls how bumpy, you want to earn smooth, you want to make these lines and I like him just slightly bumpy.   And now the overlaps controlling how much the overlap with the next level and you can give it a little...give it a lot of overlap. Give it a little overlap. Whatever makes the most sense.   And that's it. And so what you can additionally add some reference lines, you know, by right click and go to access setting and add some reference lines down here to help your view.   If I go to the one in our script, you can see I've added for the DMSO, I've added where it's median is and where it's min and max are.   And you can kind of get an idea which ones are very similar in center, which ones are similar in shape, which ones are very different from each other. So that's the ridgeplot density.   So again, we'll give you this journal so you can replicate it.   And let's move on to the next view, the next view we're going to look at, that's the probably the second most popular out of all these is the bag plot with outlier boxes. So the bag plot is a new kind of   of chart that gives you a two dimensional view.   So I've got (this is pollution data) so I've got ozone on my Y and I've got particulates (PM10s)   on my x axis.   So now I've got this bag plot here.   that's going to allow me, it's going to...it's going to find a center and that's this little asterisk, little hard to see. I'll make it a little bigger when we do this. But   that little asterisk there is really the center of the two dimensional space in between ozone and PM10.   And it's drawing some fences, draws and...draws a little...   little area closest to the to that two dimensional grouping and then it draws a fence outside and it says, if any point falls beyond the fence, like this point right here,   it is truly an outlier in a two dimensional fence, in respect to both PM10 and ozone and that's kind of cool. And what I did was on the edges, I put in some box plots.   Because I wanted to see if on a one dimensional standpoint, I just looked at PM10 what would have been out?   Well, this point right here, which was St. Louis, which was not outside the bag plot fence, but it was outside for PM10.   But why it's not outside the bag plot fence is it's almost on the median of ozone. It's right in there.   Right but Los Angeles would have been out for ozone, but it is not out for PM10. So I thought that was really interesting view. So let's see how we can do that. So again, I've got these Graph Builder steps. We are going to be using contour plots.   to help us do this and we're going to be using a bag plot in the contour plot and a dummy variable.   So let me pull up the pollutants map.   Here we go.   Alright, so I've got my PM10.   I've got my ozone.   I've got my city, of course.   Now I've got a dummy variable. And it really is a dummy variable. You see, all I have in here are x's. So, a whole bunch of x's. Just something...something categorical but repeating that   can be used to open a new section of your Graph Builder. So this is a trick that shows up a lot in pictures from the gallery.   So we're going to take my ozone on the Y, we'll take my PM10 on the X. I'll turn off that smoother line, but I'm going to add in the contour plot up here from the...from those graph elements.   Now here if I look at the elements for contour, here's where I can do the bag plot.   Now this sets up that bag plot here, which is pretty cool.   I think I'll color this one purple, just so I can just point out, that's where the middle of the two dimensional space is.   And that's very cool. We can see that point fallen out there, which was Los Angeles. That's well away. Now what if I want to throw those on box plots in as well.   Well, I'm going to take my dummy variable. I'm just going to drag it right down here to like almost the start of the leftmost X axis, then it tries to do something with it says, Oh, you got a category. It's x. Well I can right click in here and I can change that to a box plot.   And I'm going to do the same thing on the very top of the y axis, come right in here and change it.   Just right clicked in here. It's all I'm doing. From points, change that to box plot. There it is. There we go.   This section in here doesn't really mean anything. So I'm going to right click and I don't want a box plot in this little top square; I want nothing, so I just removed it.   Now to size these a little better, what I'm going to do, I'm going to hover over the dummy label. I'm going to pull all the way down to I kind of get that little diagonal, you know, line, and then I...this lets me move the width. There's the width, I can move here. And I can do the same thing   on the x axis and make it look a little prettier.   Which is kind of cool.   And of course we can come in here. We can we can take...we don't have to use the word "dummy" in here. You might not want dummy in a smart graph, we can take the label out   by double clicking on that axis and taking out the label and just making it kind of ghosting it over there. And now I've got my box plot. And now I've got, as well, my bag plot all in one.   And this is another good one to do reference lines. I've showed you how to do reference lines earlier. But here you see I drew in some reference lines.   And there's hover help as well if you've got things you can label here and I think we did put a label on city.   And of course, we get ozone and PM10 because they're in the graph. You can pin these and you see I pin St. Louis, and I pin Los Angeles. They can be moved all over the place.   But I drew some lines and dashed lines. And so I can make my points, you know, basically, that, you know, Los Angeles is truly an outlier in two dimensional space, where St. Louis is only outlier in respect to PM10.   So that is how we do a bag plot with outlier boxes.   And again, as we go through this presentation, feel free to lose...let's see, leave a Q&A in the chat and we'll get back with you and glad to reproduce some of these views for you or answer any questions. Usually generates a lot of other ideas on graphs that maybe you've been itching to make.   Alright, so the next one we're going to take a look at is number six. This is the components effects plot. As I mentioned, when I was going through the   pictures from the gallery, this one is dealing with mixtures and I have a whole bunch of diluents. They are all in just...just for example, they're all in the same vat.   And this is a vat of solutions, chemicals and I can't have any of them add up over 100%. And so there's mixture designs and mixture modeling that helps you make sure you put the constraint in there that no one   ingredient in there, so that when they add up, they all have to add to 100%. So that's kind of why you're seeing that if all five of them here are only 20%   of the amount that would add 100%. So these are just...so this is showing you something that was very difficult to do. It was easy to do this type of analysis in JMP but it was hard to graph and show   exactly how they... how the different settings, the different ratios of the mixture here actually are affecting the output, in this case total hardness.   And this is a beautiful chart and it's making use of smoother lines to really help us. So what we're going to do, we're going to look at some stack data.   And I do want to point out that   there's a great book if you're dealing with formulations and things like mixtures, that are Ron Snee and Roger Hoerl have a book called "Strategies for Formulation Development."   They do use JMP to do this. And this is an example of the ABDC, actually ABCD mixture screening design. So this is some results that came out of a mixture screening experiments. So this is pretty good. I got my tablet hardness. I've got my different diluent amounts that I have.   So what we're going to do is just go to Graph Builder. Got that set up, I'll put my tablet hardness here, put my diluent amount here. Now they did take multiple measures here so I'm   I'm not surprised to see what's going on. And there's several points. But what I'm going to do is I'm going to overlay   by the diluents   and the summary statistic we're going to use. Let's just use the mean.   Okay, now here's the beauty of the smoother. You have a smoother control box here in the smoother element, and you can play with the amount of straightness in curve.   And I'm going to pull it down until they agree. And what I was looking at, I was really focusing on 20% because it makes sense to me that all five of these will have to add up to 50%.   It's something I probably haven't shown before. My eyes...I really love grid lines. You can turn these on when you double click on the axis.   There we go, can just turn on grid lines here and that really helps my eyes. But now, if you can see something as this...the results of this experiment showing something like cornstarch where immediately   when, when it was low didn't make much of a difference. Not many of them made much of a difference when they were low in concentration in terms of   how much they made up of the vat. Here you can see the more we put, the more tablet hardness went down and we can see something like, man, it's all...   ...went up.   So that is a cool graph and probably something that problably didn't see we can do but it's actually quite easy to do within JMP.   All right, I'm checking our time, we are doing pretty good. So I will keep going until we are out of time. I will go with the next most popular view. The next most popular view   was these informative point plots. So this is jittering of points, but it's making a nice kind of cluster of them and it kind of makes this kind of pack circular grid.   And you see we've even got it...got this one sized by cardbs and colored by calories. So this is a bunch of beer, so I...   this year I went on a pretty extreme diet, I did real well with it. One of the things I had to do was pay attention to the kind of beer I was drinking,   couldn't drink the really the really high calorie or a high carb stouts and porters that I usually do. So did change what I was drinking, but I would love to make this type of chart against some of my favorite breweries, so I can find new things to drink.   So, pretty easy to do.   What we're going to do is we're going to go down here to beer calories.   Right now I'm going to put the brewery on the x axis, calories on the, um,   and then, and then the color and size.   So it's a little bit of a different graph and we're not going to mess with the y axis. So I got brewery down here. It's got a bunch of points in there. Now let's go ahead   and   go ahead and size   by the amount of carbs and color by the   calories.   And maybe I'll do them all per ounce. Maybe that's a fair way of showing it. So here's one version of the graph that we have.   Now,   this looks pretty interesting. But you might be wondering what's going on, what's, what exactly is going on with   not having that grouping. They're all kind of sin line. They're slightly jittered.   Well, if I go to my local data filter, let's size down. Let's just not look at all the breweries, let's just take a look at some of the top ones. So here's...   I asked for, again, just under that hotspot. This was asking under for a local data filter down here at the bottom.   And then this red triangle, it'll let me order by count, find the biggest ones. Now just find the biggest four. Now ou're seeing the grouping. In fact I'll click in here and add a grid line.   to make that easy to see. And now I can see, you know, Sierra Nevada has got   you know, a really high calorie in a really big carb Bigfoot, which is really delicious beer. Okay, but I can see there's something else, much smaller like Anheuser Busch has the Budweiser Select 55 which   says no carbs or very little carbs per ounce. You know, there's and then just 4.6 calories per ounce, you know, but something something below on something close to zero.   So that's pretty cool to view. Now if I add too many of these I lose it, you're like, "Well, Scott, how do I get that back?"   Don't fear. Of course you can always size down your list. You can also make your graph bigger, but under points there is a jitter limit. And if I increase that jimmer...jitter limit to two, you can see it's going to allow you to   take a little more space to create these jitters. Now you can see what you want. Now you can get a smaller subset and get just the view you like.   All right, so hopefully you enjoyed that one. That's one of my favorites.   Alright, so the next one we're gonna take a look at. Let's take a look at informative box plot. This is a quick one. This is a another new one in JMP 15 so the box plots the bag plots and as well...the...   the...this box plots,   this bag plot, as well as the ridge plot, these are the ones that were new to 15, had new functionality in 15. So this is a different kind of style,   which is kind of cool   that you get that kind of view on there. Plus, I'm able to color these box plots, because they're solid. Now I can give them a color. And here I colored them by a process capability measurement. So this gives you another chance to make your box plot stand out, which is kind of cool.   So what we're gonna do is we're gonna look at this fan supplier stack. This is a bunch of fans suppliers, looking at their revolutions per minute.   I'm going to go in the Graph Builder. I am going to put the fan supplier on the x, but the fan RPM on the Y, ask for box plots. Pretty easy. Now what we're going to do now is under the style, I'm going to say, give me solid.   Okay, and that's a brand new style. There's also a thin style, in case you'd like that one. I like this solid style because with the solid style, ow I can take something like Cpk, and I can color by it.   And that's what we did here. And if I right click here right on the gradient, click on that gradient, maybe I will choose a green to white to red, kind of a   go-stop kind of situation. I'll reverse them to make sure that makes sense. And now I can see the things with the worst capability are indeed getting out   beyond my upper spec limit. Those things with higher Cpk, higher capability. are a lot more closer to being centered and closer to my target, not spread out all that much.   So that's a pretty cool view, pretty easy to use.   Alright, so we actually got time to run through the last one, which is fantastic. So the last one we're going to look at   is this area range chart. This is not just the lines. Everybody knows how to do lines and kind of do a trend chart. But you can see I've got area shaded in here between the lines.   What we were doing in this one, I did this chart with Bill Worley, we got a good blog on it as well. We were looking at some different   ages at which you can start to pull your US social security and you can take it early at 62, you can take it next at 68 and 8 months, and you can take it...you can wait all the way til you're 70 and there'll be higher payouts each year.   So if I take it a 62, I get a lower pay out. But of course, I start earlier. So we wanted a good chart that kind of shows you the trade offs.
Monday, October 12, 2020
Ronald Andrews, Sr. Process Engineer, Bausch + Lomb   How do we set internal process specs when there are multiple process parameters that impact a key product measure? We need a process to divide up the total variability allowed into separate and probably unequal buckets. Since variances are additive, it is much easier to allocate a variance for each process parameter than a deviation. We start with a list of potential contributors to the product parameter of interest. A cause-and-effect diagram is a useful tool. Then we gather the sensitivity information that is already known. We sort out what we know and don’t know and plan some DOEs to fill in the gaps. We can test our predictor variances by combining them to predict the total variance in the product. If our prediction of the total product variability falls short of actual experience, we need to add a TBD factor. Once we have a comprehensive model, we can start budgeting. Variance budgeting can be just as arbitrary as financial budgeting. We can look for low hanging fruit that can easily be improved. We may have to budget some financial resources to embark on projects to improve factors to meet our variance budget goals.     Auto-generated transcript...   Speaker Transcript Ronald Andrews Well, good morning or afternoon as the case may be. My name is Ron Andrews and topic of the day is variance budgeting. Oh, I need to share my screen. And there's a file, we will be getting to. And we'll get to start with PowerPoint. So variance budgeting is the topic. I'm a process engineer at Bausch and Lomb; got contact information here. My supervision requires this disclaimer. They don't necessarily want to take credit for what I say today. Overview of what we're going to talk about What is the variance budget? A little bit of history. When do we need one? We have some examples. We'll go through the elements of the process, cause and effect diagram, gather the foreknowledge, do some DOEs to fill in the gaps, Monte Carlo simulations, as required. And we've got a test case will work through. So really, what is a variance budget? Mechanical engineers like to talk about tolerance stack-up. Well tolerance stack-up is basically a corollary Murphy's Law, that being all tolerances will add unit directionally in the direction that can do the most harm. Variance budget is like a tolerance stack-up, except that instead of budgeting the parameter itself, we budget the variance -- sigma squared. We're relying on more or less normal shape distributions, rather than uniform distributions. Variances are additive, makes the budgeting process a whole lot easier than trying to budget something like standard deviations. Brief example here. If we used test-and-sort or test-and-adjust strategies, our distributions are going to look more like these uniform distributions. So if we have the distribution with the width of 1 and one with a width of 2 and other with a 3, we add them all together, we end up with a distribution with a width of pretty close to 6. In this case, we probably need to budget the tolerances more than the variances. ...If we rely on process control, our distributions will be more normal. In this case, if we have a normal distribution with a standard deviation of 1, standard deviation of 2, standard deviation of 3, we add them up, we end up with standard deviation of 3.7, lot less than six. So we do the numbers 1 squared plus 2 squared plus 3 squared equals essentially 3.7 squared. Now to be fair, on that previous slide, if I added up these variances, they would have added up to the variance of this one. But when you have something other than a normal distribution, you have to pay attention to the shape down near the tail. It depends on where you can set your specs. So, What is the variance budget? Non normal distributions are going to require special attention and we'll get to those later. For now variance budget is kind of like a financial budget. They can be just as arbitrary. There only three basic rules. We translate everything into common currency. Now we do this for each product measure of interest, but we translate all the relevant process variables into their contribution to the product measure of interest. Rule number two is fairly simple. Don't budget more than 100% of the allowed variance. Yeah, sounds simple. I've seen this rule violated more than once in more than one company. Number three. This goes for life in general, as well as engineering, use your best judgment at all times. Little bit of history. This is not rocket science. Other people must be doing something similar. I have searched the literature and I have not been able to find published accounts of a process similar to this. I'm sure it's out there, but I have not found any published accounts yet. So for me the history came back in the 1980s, when I worked at Kodak with a challenge for management. Challenge was produce film with no variation perceived by customers. Actually what they originally said produce film with no variation. no perceivable variations. They define that as a Six Sigma shift would be less than one just noticeable difference. Kodak was pretty good on the perceptual stuff and all these just noticeable differences were defined, we knew exactly what they were. For a slide film like Kodachrome, which is what I was working on the... that's what I was working on at the time, color balance was the biggest challenge. Here, this streamline cause and effect diagram, color balance is a function of the green speed, the blue speed and the red speed. Now I've sort of fleshed out one of these legs. The red speed, I got the cyan absorber dye and then one of the emulsions as the factors that contribute to the speed of that, that affects the red speed, that affects the color balance. This is a very simplified version. There are actually three different emulsions in the red record, there are three more in the green record. There are two more in the blue record. Add up everything, they're 75 factors that all contribute to color balance. These are not just potential contributors. These are actually demonstrated contributors. So this is a fairly daunting task. So moving on to when we need a variance budget. Get a little tongue in cheek decision tree here. Do we have a mess in the house? If not, life is good. If so, how many kids do we have? If one, we probably know where the responsibility lies. If more than that and we probably need a budget. This is an example of some work we did a number of years ago on a contact lens project at Bausch and Lomb. This is long before it got out the door to the marketplace. We were having trouble meeting our diameter specs. plus or minus two tenths of a millimeter We were having trouble meeting that. We looked at a lot of sources of variability and we managed to characterize each one. So lot to lot. And this is with the same input materials and same set points, fairly large variability. Lens to lens within a lot, lower variability. Monomer component No. 1, we change lots occasionally, extreme variability. Monomer component No. 2, also had a fairly large variability. Now we mix our monomers together and we have a pretty good process with pretty good precision. It's not perfect and we can estimate the variability from that. That's a pretty small contributor. We put the monomer in a mold and put it under cure lamps to ??? it and the intensity of the lamps can make a difference. There we can estimate that source of variability as well. We add all these distributions up and this is our overall distribution. It does go belong...beyond the spec limits on both ends. Standard deviation of .082 And as I mentioned, spec elements of plus and minus .2 that gives us a PPk of .81. Not so good. Percent out of spec estimated at 1.5% It might have been passable if it was really that good, but it wasn't. This estimate assumes each lens is an independent event. They're not. We make the lenses in lots and there's...every lot has a certain set of raw materials in a certain set of starting conditions. That within a lot, there's a lot of the correlation. And two of the components I mentioned, two monomer components that had sizable contributions, there's looking here, occasionally you can see the yellow line and the pink line. These are the variability introduced by these two monomer components. When they're both on the same side of the center line, they push the diameter out towards the spec limits and we have some other sources of variability that add to the possibilities. Another problem is that our .2 limit is for an individual lens. We did this...we disposition based on lots. And so this plot predicts lot averages, though, when we get a lot average out to .175, chances are we're going to have enough lenses beyond the limit that failed a lot. So in all, added up our estimate is 4% of the lots are going to be discarded. And they're going to come in bunches. We're going to have weeks when we can't get half of our lots through the system. So this is non starter. We have to make some major improvements. To the lot-to-lot variability from two monomer components contributed a good chunk of that variability. We looked and found that the overall purity of Monomer 1 was a significant factor and certain impurities in Monomer 2, when present, were contributors. Our chemists looked at the synthetic routes for these ingredients and found that there was a single starting material that contributed most of the impurities. They recommended that our suppliers distill this starting ingredient to eliminate the impurities. That made some major improvements. We also put variacs on the cure lamps to control the intensity. Lamp intensity was not a big factor, but this was easy. And when it's easy, you make the improvement. Strictly speaking, this was a variance assessment, rather than a variance budget. We never actually assigned numeric goals for each component. This is back...we're kind of picking the low-hanging fruit. I mean, we found two factors that pretty much accounted for a large portion of the variability Maybe we need a little bit better structure to reach the higher branches, now that we need to reach up higher. Current status on lens dimension, lens diameter. PPk is 2.1. The product's on the market now, has been for a few years. This is not the problem anymore. We've made major...major improvements in these momoer components. We're still working on them. They still have detectable variability; detectable, but it hasn't been a problem in a long time. So the basic question is, what do we do to apply data to a variance budget? Maybe reduce that arbitrariness a little bit. We have to start by choosing a product measure in need of improvement. We need to identify the potential contributors, cause and effect diagrams, a convenient tool. We need to gather some foreknowledge. We need to know the sensitivity. The product measure divided by the process measure; what's the slope of that line? We, we are going to need some DOEs to fill in the gaps. We need to estimate the degree of difficulty for improving some of these factors. And we estimate the variance from each component and then we divide that variance, the total variance goal among the contributors. Sounds easy enough. Let's get into an example. let's say we're we're working on a new project. And along the way, we have a new product measure called CMT (stands for cascaded modulation transfer) to measure overall image sharpness. Kind of important for contact lenses. Target is 100, plus or minus 10. We want a PPK of at least 1.33 That means standard deviation's got to be 2.5 or less. Variance has got to be 6.25 or less. What factors might be involved? Let's think about a cause and effect diagram. We can go into JMP and create a table. We start by listing CMT in the parent column. Then we list each of our manufacturing steps in the child column. And then we start listing these child factors over on the parent's side and then we start listing subfactors. These subfactors are obviously generic and arbitrary, the whole thing's hypothetical. And we can go as many levels as we want. We can have as many branches in the diagram as we care to, but we've identified 14 potential factors here. So we go into the appropriate dialog box, identify the parent column and the child column. Click the OK button and out pops the cause and effect diagram. Brief aside here. I've been using JMP for 30 years now. I have very, very few regrets. This is one of them. And my regret is, I only found this last year. I don't know, actually, when this tool was implemented. I wish I had found it earlier because this is the easiest way I found to generate a cause and effect diagram. So we need to gather the sensitivity data. Physics sometimes will give us the answer. In optics, if we know the refractive index and the radius of curvature, that can give us some information about the optical power of the lens. Sometimes physics, oftentimes we need experimental data. So, ask the subject matter experts. Maybe somebody's done some experiments that will give us an idea. We're going to need some well-designed experiments because no way have all 14 of those factors been covered. Several notches down on the list, in my opinion, is historical data. And if you've used historical data to generate models, you know, some of the concerns I'm nervous about. We need to be very cautious with this. Historical data, it's usually messy; it has a lot of misidentified numbers, sometimes things in the wrong column, it needs a lot of cleaning. There's also a lot, also a lot of correlation between factors. Standard practice is to reserve 25% of the data points randomly, reserve that data for confirmation, generate the model with 75% of the data, and then test it with a 25% reserve data. If it works, maybe we have something worth using. If not, don't touch it. So gathering foreknowledge, we want to ask subject matter experts independently to contribute any sensitivity data they have. I'm taking a page from a presentation last year at the Discovery Summit by Cy Wegman and Wayne Levin. This is their suggestion in gathering foreknowledge to avoid the loudest voice in the room rules syndrome. Sometimes there's a quiet engineer sitting in the back who may have important information to impart, may or may not speak up. So we want to get that information. Ask everybody independently to start with. Then get people together and discuss the discrepancies. There will be some. Where are the gaps? What parameters still need sensitivity or distribution information? What parameters can we discount? I'd like to find these. What parameters are conditional? Doesn't happen very often, but in our contact lens process, we include biological indicators in every sterilization cycle. These indicators are intentionally biased so that false positives are far more likely than false negatives. When we get a failure in this test, we sterilize again. We know our sterilization routine was probably right, but we sterilize again. So sometimes we sterilize twice. That can have a small effect on our dimensions. It's small, but measurable. So we're going to need to plan some experiments to gather the sensitivities for things we don't know about. And we'll look at production distribution data; use it with caution to generate sensitivity. We can use it to generate information on the variability of each of the components and the overall variability of the product measure of interest. We need to do some record keeping along the way. We can start with that table we used to generate the cause and effect diagram, add a few more columns. Fill in the sensitivities, units of measure, various columns. Any kind of table will do. Just keep the records and keep them up to date. We're going to need some DOEs to fill in the gaps. There are some newer techniques -- definitive screening designs, group orthogonal super saturated designs -- provide a good bang for the buck when the situation fits. Now in this particular situation, we got 14 factors. We asked our subject matter experts. Some of them have enough experience to predict some directional information, but nobody has a good estimate of the actual slopes. So we need to evaluate 14 factors. I'd love to run a DSD that doesn't require 33 runs, I don't have the budget for it. So we're going to resort to the custom DOE. So, go to the custom DOE function and then...been using PowerPoint for long enough now...time we demonstrated a few things live in JMP. That would go to DOE custom design. And you don't have to, but it's a very good practice to fill in the response information (if i could type it right). Match target from 90 to 110. Importance of 1, only makes a difference if we have more than one response. The factors. I have my file, so I can load these quickly there. Here we have all 14 of the factors. This factor constraints, I've never used it. But I know it's there if some combination of factors would be dangerous. I know that we can disallow it. The model specification. This is probably the most important part. This is basically a screening operation. We're just going to look at the main effects. Now our subject matter experts suggested the interactions are not likely. And nonlinearity is possible but not likely to be strong. So we're going to ignore those for now, at least for the screening experiment. We don't need to block this. We don't need extra center points. For 14 main effects, JMP says a minimum of 15, that's a given, default 20. I've learned that if I have a budget that can run the default, that's a good place to start. I can do 20 runs; 33 was too much. I can manage the 20. Let's make this design. I left this in design units. There's a hypothetical example. I didn't feel like replacing these arbitrary units with other arbitrary units. Got a whole suite of designed evaluation tools, a couple that I normally look at. The power analysis. If the root mean square error estimate of 1 is somewhere in the ballpark, then these power estimates are going to be somewhere in the ballpark. .9 and above, pretty good. I like that. The other thing I normally look at is the color map on correlations. I like to actually make it a color map. And it's kind of complicated. We got 14 main effects, and I honestly haven't counted all the two way interactions. What we're looking for is confounding effects, where we have to red blocks in the same row. Well, I don't see that. That's good. We've got some dark blue where there's no correlation. We've got some light blue where there's a slight correlation. And we have some light pink where maybe it's a .6 correlation coefficient. This is tolerable. As long as we don't have complete confounding, we can probably estimate what's what, what's causing the effect. Now this is good. Move on, make the table. Well, this is our design. Got the space to fill in the results. I'm going to take a page from the Julia Child school of cooking. Do the prep for you and then put it in the oven and then take a previously prepared file out of the oven that already has the executed experiment. These are the results. CMT values, we wanted them between 90 and 110. We got a couple here in the 80s. There's 110.5, we've got 111 here. Looks like we have a little work to do. Let's analyze the data. Everything's all done for us. There's a response. Here's all the factors. We want the screening report. Click Run. r squared .999. Yeah, you can tell this is fake data. I probably should have set the noise factor a little higher than this. The first six factors are highly significant; the next eight, not so much. I was lazy when I generated it. I put something in there for the first six. Now, typically we eliminate the insignificant factors. So we can either eliminate them all at once. I tend to do it one at a time. Eliminate the least significant factor each time and see what it does to the other numbers. Sometimes it changes, sometimes it doesn't. Eliminate this one and it looks like Cure1 slipped in under the wire, .0478. It's just under .05. I doubt that it's a big deal, but we'll leave it in there. So we look at the residuals; that's kind of random, that's good. Studentized residuals, also kind of random. We need to look at the parameter estimates. This is what we paid for. These are the the...regression coefficients are the slopes we were looking for. These are the sensitivities. That's why we did the experiment. I'm a visual kind of guy, so I always look at the prediction profiler. And one of the things I like here...well, I always look at the...the plot of the slopes and look at the... I look at the confidence intervals, which are pretty small. Here you can just barely see there's a blue shaded interval. I also like to use the simulator when I have some information about the input, that we can input the variability for each of these. Now if you'll allow me again use the Julia Child approach and go back to the previously prepared example where I've already input the variations on each one of these. From Mold Tool 1, I input an expression that results in a bimodal distribution. And for Mold Tool 2, input a uniform distribution. And I gotta say, in defense of my friends in the tool room, bimodal distribution only happens in a situation...what happened last month, where the tools we wanted to use were busy on the production floor, so for experiment, we use some old iterations. We actually mixed iterations. When that happens, we can get a bimodal distribution. This uniform distribution, never happens with these guys. They're always shooting for the center point and usually come within a couple of microns. Other distributions are all normal. Various widths. In one case, we had a bit of a bias to it. These are the input distributions. Here's our predicted output. Even though we had some non normal starting distributions, we have pretty much a normal output distribution. It does extend beyond our targets. We kind of knew that. Now, the default when you start here is 5,000 runs. I usually increase it, increase this to something like 100,000. It didn't take any extra time to execute, and it gives you a little smoother distributions. It also produces a table here, we can make the table. Move this over here. Big advantage of this is that we can get (don't want this CMT yet)...let's look at the distributions of the input factors. This is a bigger fancier plot. This is our bimodal distribution, uniform, these various normal distributions, various widths, this one has kind of a bias to it. So we can take all those and we added them up. We look at this and we have a distribution. It looks pretty normal. Even though some of the inputs were not normal. We can use conventional techniques on this. So when we start setting the specs, it does extend beyond our spec limits. So we're going to need to improve, make some improvements in this. Scroll down here. Look at the capability information. PPk a .6. That's a nonstarter. No way is manufacturing going to accept a process like this. So we need to make some significant improvements. So go back to the PowerPoint file. And I scroll through the slides that were my backup in case I had a problem with Live version of JMP. Because of me having the problem, not JMP. So here we have the factors. Standard deviations come from our preproduction trials, estimate the variability. The sensitivities, these are the results from our DOE.  
2020年11月に開催した「Discovery Summit Japan Online」で、「さらに詳しく聞きたい」という ご要望が多く寄せられたチュートリアルセッションであった「特別チュートリアル JMPによる 統計的機械学習入門」を、5月13日(木)、5月20日(木)に、2回シリーズで開催をさせていただきました。   本セミナーの動画を、2021年6月15日(火)17時までの期間限定にて公開いたします。 ※公開を終了いたしました。   セミナータイトル 特別チュートリアル JMPによる統計的機械学習入門(全2回)     概要 JMP (Pro)を使えばR , Pythonなどに較べて手軽に分析を楽しめます。 フルオーダーメイドの分析とはいきませんが、セミオーダーには十分に対応が可能です。 JMPを使えば以下のようなことが簡単に実行できます。   ① コマンドを打ちこまなくてもマウス1つで分析が可能に ② グラフと統計量のセット ③ 分析プロセスをスクリプトに残せる ④ 分析プロセスの流れに沿ったレポートの出力が可能 ⑤ 統計的な思想が基本にあるから体系的な理解と学習に最適 など   本報告では数値例を使ってJMPでできる予測や分類の話をします。 扱う方法はカーネル平滑化、SVMやニューロ判別などです。 また、従来の統計的な多変量解析との対比も行い理解を深めます。   講師ご紹介 廣野 元久様 1984年、株式会社リコー入社。以来、社内の品質マネジメント・信頼性管理の業務、 統計学の啓発普及に従事。 品質本部QM推進室室長、SF事業センター所長を経て、現在はバイオメディカル事業センタ ヘルスケア事業支援室 薬事・品質保証G(倫理審査委員)として社内外での教育・講演などを 幅広く行っている。   東京理科大学工学部(1997-1998)、慶應義塾大学総合政策学部 非常勤講師(2000-2004)。 主な専門分野は統計的品質管理、信頼性工学。主著に「グラフィカルモデリングの実際」、 「JMPによる多変量データの活用術」、「アンスコム的な数値例で学ぶ統計的計算方法23講」、 「JMPによる技術者のための多変量解析」、「目からウロコの多変量解析」などがある。   各回のタイトル 第1回:ビッグデータで役立つJMPのグラフ機能 第2回:教師あり分類の実際
レベル:中級 JMP 15で導入されたホバーラベルの拡張機能は、オンデマンドで詳細を表示する従来の機能を超えた、エキサイティングな新しい可能性をもたらします。これまで、ホバーラベルは、現在のグラフ要素から取り出せる限られた情報のセットを表示するだけで、列にラベルの属性を設定することなど可能なカスタマイズも限られていました。 このプレゼンテーションでは、JMP 15の拡張機能によって、ユーザーがホバーラベルの内容を全体的にカスタマイズできるだけでなく、新たな探索のパターンや統合されたワークフローを組み立てられることを紹介します。 まず、ダイナミックなデータビジュアライゼーションのサムネイルを簡単にホバーラベルに追加するハイレベルコマンドを使用してみます。これは、「データのドリル」または「ドリルダウン」と呼ばれる探索的ワークフローの出発点となります。次に、それを実現しているローレベルの基盤部分を見てみます。この部分は、パワーユーザーであればこれらの新しいワークフローをカスタマイズできるJMPスクリプト言語の拡張です。 また、「ドリルアウト」ともいうべき、外部システムに接続して画像を取得する例や、ひとつのホバーラベルに複数の画像を表示するアドインを作成する方法について説明します。   【発表者プロフィール】 Nascif Abousalh-NetoはSAS社のJMP Division、研究開発グループに所属するソフトウェア開発者。2004年にSASに入社し、25年に渡るソフトウェア開発の経験を持つ。現在の業務で、データビジュアライゼーション、およびソフトウェア開発のあらゆる側面における品質の追求に情熱を注いでいる。   ※日本語音声吹き替えです。 本発表をオリジナル音声で視聴されたい方は、こちらをクリックしてください。 Discovery Summit Americas配下のセッションページに移動します。 (SAS Profileへのログインを求められますので予めご了承ください。)    
レベル:全て 私たちJMP開発チームが分析機能を開発する時、次のうちのどれを重視するかを決めなければいけません。 筋肉を増強することにより、JMPをより強力にする JMPをよりセクシーにする(ユーザーをより興奮させる?) JMPによって、ユーザーの痛みを和らげる(ユーザーの隔靴掻痒や負担を軽減する?) “A”で始まる長めの英単語として、anabolic(同化剤、筋肉増強剤)、aphrodisiac(媚薬)、analgesic(鎮痛剤)があります。これら3つのうちのどれに私たちは向かうべきでしょうか? 講演者は、「鎮痛剤」を目指すべきだと考えています。「鎮痛剤」こそが、JMPの開発における中心的な原動力になるべきだと考えています。 もちろん、これら3つは両立しないものではありません。セクシーで強力な機能を追加することで、痛みが和らぐこともあるでしょう。しかし、「鎮痛剤」が重要です。なぜなら、痛みが生じると、私たちは動けなくなり、やる気がなくなります。痛みがあると、データが私たちに語ってくれていることを十分に読み取れず、その場に立ち止まってしまいます。    John Sallは、世界をリードするビジネス・アナリティクス・ソフトウェア・ベンダーである、SAS Institute Inc.の 共同創設者 兼 執行副社長で、科学者やエンジニア向けに開発された双方向で可視性に優れた統計的探索ソフトウェア「JMP」の開発者であり、現在もJMPのビジネスを率いています。 また、Sallは、「アメリカ統計学会」や、世界的最大級の学術団体である「アメリカ科学振興協会(AAAS)」のフェロー(特別研究員)でもあります。 加えて、自身の国際自然環境保護への強い関心から、世界自然保護基金(WWF)の役員や、スミソニアン国立自然史博物館の自然史諮委員会の委員も務めています。また、ノースカロライナ州立大学の評議委員も務めました。   ※日本語音声吹き替えです。 本発表をオリジナル音声で視聴されたい方は、こちらをクリックしてください。 Discovery Summit Americas配下のセッションページに移動します。 (SAS Profileへのログインを求められますので予めご了承ください。)   
レベル:中級 自動翻訳を利用) 研究結果を要約する際の優れた視覚化の価値を誇張することは困難ですが、同僚、業界の仲間、およびより大きなコミュニティと共有するための適切な媒体の選択も同様に重要です。このプレゼンテーションでは、データ、結果、視覚化を広めるために使用されるさまざまな形式について説明し、それぞれの利点と制限について説明します。JMP Live機能の簡単な概要は、エキサイティングな一連の潜在的なアプリケーションの準備を整えます。豊富なインタラクティブインターフェイスとスクリプトメソッドを使用してJMPグラフィックをJMPLiveに公開する方法を示し、最適なアプローチを選択するための例とガイダンスを提供します。プレゼンテーションは、ダイアログの設計で行われた考慮事項、パブリッシングフレームワークの仕組み、JMPライブレポートの構造、およびJMPクリニカルクライアントレポートとの関係を含む、JMPクリニカル結果用のカスタムJMPライブパブリッシングインターフェイスのショーケースで締めくくられます。公開されたレビューの潜在的な消費パターンの議論。   本発表を日本語トランスクリプトと共に視聴されたい方は、こちらをクリックしてください。 Discovery Summit Americas配下のセッションページに移動します。 (SAS Profileへのログインを求められますので予めご了承ください。   下のビデオでは英語の字幕が選択できます。
レベル:中級 自動翻訳を利用) 医薬品および治療用生物製剤の市販後の安全性を監視することは、公衆衛生の保護にとって非常に重要です。安全監視プロセスを促進するために、FDAはFDAオンラインラベルリポジトリ( フォルプ )。 FOLPは、企業がFDAに提出した最新の医薬品リスト情報を収集します。 ただし、何百もの薬物ラベルをナビゲートし、意味のある情報を抽出することは困難です。使いやすいソフトウェアソリューションが役立ちます。   過去50年間の市場からの安全関連の薬物離脱の最も頻繁な単一の原因は、薬物誘発性肝障害(DILI)です。このプレゼンテーションでは、JMPテキストエクスプローラーを使用して、462の薬物ラベルをDILIインジケーターで分析します。 薬物ラベルの「警告と注意」セクションの用語とフレーズは、DILIキーワードとMedDRA用語に一致します。JMP ProのXGBoostアドインは、用語マトリックスによるXGBoost予測モデルの相互検証を通じてDILIインジケーターを予測するために利用されます。結果は、同様のアプローチが他の薬物安全性の懸念を分析するために容易に使用できることを示しています。    本発表を日本語トランスクリプトと共に視聴されたい方は、こちらをクリックしてください。 Discovery Summit Americasに移動します。 (SAS Profileへのログインを求められますので予めご了承ください。)   下のビデオでは英語の字幕が選択できます。